FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor AndrC M. van Tilborg
Other titles in the series: FOUNDATIONS OF DEPENDABLE COMPUTING: Models and Frameworks for Dependable Systems, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9484-4 FOUNDATIONS OF DEPENDABLE COMPUTING: Paradigms for Dependable Applications, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9485-2 PARALLEL ALGORITHM DERIVATION AND PROGRAM TRANSFORMATION, edited by Robert Paige, John Reif and Ralph Wachter ISBN: 0-7923-9362-7 FOUNDATIONS OF KNOWLEDGE ACQUISITION: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning, edited by Alan L. Meyrowitz and Susan Chipman ISBN: 0-7923-9278-7 FOUNDATIONS OF REAL-TIME COMPUTING: Formal Specifications and Methods, edited by AndrC M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 FOUNDATIONS OF REAL-TIME COMPUTING: Scheduling and Resource Management, edited by AndrC M. van Tilborg and Gary M. Koob ISBN : 0-7923-9 166-7
FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation
edited by
Gary M. Koob Clifford G. Lau OfSlce of Naval Research
KLUWER ACADEMIC PUBLISHERS Boston I Dordrecht I London
Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Nonvell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS
Library of Congress Cataloging-in-PublicationData Foundations of dependable computing. System implementation 1 edited by Gary M. Koob, Clifford G. Lau. p. cm. -- (The Kluwer international series in engineering and computer science ; 0285) Includes bibliographical references and index. ISBN 0-7923-9486-0 1. Electronic digital computers--Reliability . 2. Real-time data processing. 3. Fault-tolerant computing. 4. Systems engineering. I. Koob, Gary M., 1958- . 11. Lau, Clifford. 111. Series: Kluwer international series in engineering and computer science ; SECS 0285. QA76.5. F624 1994 004.2'2--dc20 94-29138 CIP
Copyright
@
1994 by Kluwer Academic Publishers
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Nonvell, Massachusetts 0206 1
Printed on acid-free paper. Printed in the United States of America
CONTENTS
.....................................................v i.l. ... Acknowledgements ...................................... Preface
XHI
1 . DEPENDABLE COMPONENTS
..........................1
1.1
Self-checking and Self-Exercising Design for Hierarchic Long-Life Fault-Tolerant Systems.. ...................................................... .3 D.A. Rennels and H. Kim
1.2
Design of Self-checking Processors Using Efficient Berger Check Prediction Logic ..................................................................... 35 T.R.N. Rao, G-L Feng, and M.S. Kolturu
.
................ 6 9
2.1
Network Fault-Detection and Recovery in the Chaos Router .....................71 K. W. Bolding and L. Snyder
2.2
Real-Tie Fault-Tolerant Communication in Distributed Computing Systems.. ...................................................................... 87 K.G. Shin and Q. Zheng
2 DEPENDABLE COMMUNICATIONS..
3 . COMPILER SUPPORT
.................................1 3 3
3.1
Speculative Execution and Compiler-Assisted Multiple Instruction Recovery ....................................................................... .135 W.K. Fuchs, N.J. Alewine, and W-M Hwu
3.2
Compiler Assisted Synthesis of Algorithm-Based Checking in Multiprocessors ........................................................................... 159 P. Banerjee, V. Balasubramanian, and A. Roy-Chowdhury
4 . OPERATING SYSTEM SUPPORT
....................2 1 3
4.1
Application-Transparent Fault Management in Fault-TolerantMach ........................................................................215 M. Russinovich, Z Segall, and D.P. Siewiorek
4.2
Constructing Dependable Distributed Systems Using Consul...................243 R.D. Schlichting, S. Mishra, and L.L. Peterson
4.3
Enhancing Fault-Tolerance of Real-Time Systems Through T i e Redundancy............................................................................265 S.R. Thuel and J.K. Strosnlder
Index..
................................................... . 3 1 9
PREFACE
Dependability has long been a central concern in the design of space-based and military systems, where survivability for the prescribed mission duration is an essential requirement, and is becoming an increasingly important attribute of government and commercial systems where reduced availability may have severe financial consequences or even lead to loss of lie. Historically, research in the field of dependable computing has focused on the theory and techniques for preventing hardware and environmentally induced faults through increasing the intrinsic reliability of components and systems (fault avoidance), or surviving such faults through massive redundancy at the hardware level (fault tolemce). Recent advances in hardware, software, and measurement technology coupled with new insights into the nature, scope, and fundamental principles of dependable computing, however, contributed to the creation of a challenging new research agenda in the late eighties aimed at dramatically increasing the power, effectiveness, and efficiency of approaches to ensuring dependability in critical systems At the core of this new agenda was a paradigm shift spurred by the recognition that dependability is fundamentally an attribute of applications and services-not platforms. Research should therefore focus on (1) developing a scientific understanding of the manifestations of faults at the application level in terms of their ultimate impact on the correctness and survivability of the application; (2) innovative, application-sensitive approaches to detecting and mitigating this impact; and (3) hierarchical system support for these new approaches. Such a paradigm shift necessarily entailed a concomitant shift in emphasis away from inefficient, inflexible, hardware-based approaches toward higher level, more efficient and flexible software-based solutions. Consequently, the role of hard- ' ware-based mechanisms was redefined to that of providing and implementing the abstractions required to support the higher level software-based mechanisms in an integrated, hierarchical approach to ultradependable system design. This shift was furthermore compatible with an expanded view of "dependability," which had evolved to mean "the ability of the system to deliver the specified (or expected) service." Such a definition encompasses not only survival of traditional single hardware faults and environmental disturbances but more complex and less-well understood phenomena, as well: Byzantine faults, correlated errors, timing faults, software design and process interaction errors, and-most significantly-the unique issues encountered in real-
time systems in which faults and transient overload conditions must be detected and handled under hard deadline and resource constraints. As sources of service disruption multiplied and focus shifted to their ultimate effects, traditional frameworks for reasoning about dependability had to be rethought. The classical fault/error/failure model, in which underlying anomalies Vaults) give rise to incorrect values (errors), which may ultimately cause incorrect behavior at the output Vailures), required extension to capture timing and performance issues. Graceful degradation, a long standing principle codifying performance/dependability trade-offs must be more carefully applied in real-time systems, where individual task requirements supercede general throughput optimization in any assessment. Indeed, embedded real-time s y s t e m ~ f t e ncharacterized by interaction with physical sensors and actuators-may possess an inherent ability to tolerate brief periods of incorrect interaction, either in the values exchanged or the timing of those exchanges. Thus, a technical failure of the embedded computer does not necessarily imply a system failure. The challenge of capturing and modeling dependability for sucb potentially complex requirements is matched by the challenge of successfully exploiting them to devise more intelligent and efficient-as well as more completedependability mechanisms. The evolution to a hierarchical, software-dominated approach would not have been possible without several enabling advances in hardware and software technology over the past decade: (1) Advances in VLSI technology and RISC architectures have produced components with more chip real estate available for incorporation of efficient concurrent error detection mechanisms and more on-chip resources permitting software management of f i e - g d n redundancy; (2) The emergence of practical parallel and distributed computing platforms possessing inherent coarse-grain redundancy of processing and communications resources-also amenable to efficient software-based management by either the system or the application;
(3) Advances in algorithms and languages for parallel and distributed computing leading to new insights in and paradigms for problem decomposition, module encapsulation, and module interaction, potentially exploitable in refining redundancy requirements and isolating faults; (4) Advances in distributed operating systems allowing more efficient inter-
process communication and more intelligent resource management;
(5) Advances in compiler technology that permit efficient, automatic instrumentation or restructuring of application code, program decomposition, and coarse and fine-grain resource management; and (6) The emergence of fault-injection technology for conducting controlled experiments to determine the system and application-level manifestations of faults and evaluating the effectiveness or performance of fault-tolerance methods.
In response to this challenging, new vision for dependable computing research, the advent of the technological opportunities for realizing it, and its potential for addressing critical dependability needs of Naval, Defense, and commercial systems, the Office of Naval Research launched a five-year basic research initiative in 1990 in Ultradependable Multicomputers and Electronic Systems to accelerate and integrate progress in this important discipline. The objective of the initiative is to establish the fundamental principles as well as practical approaches for efficiently incorporating depenability into critical applications running on modem platforms. More specifically, the initiative sought increased effectiveness and efficiency through (1) Intelligent exploitation of the inherent redundancy available in modem parallel and distributed computers and VLSI components; (2) More precise characterization of the sources and manifestations of errors; (3) Exploitation of application semantics at all levels--code, task, algorithm, and domain-to allow optimization of fault-tolerance mechanisms to both application requirements and resource limitations; (4) Hierarchical, integrated software/hardware approaches; and (5) Development of scientific methods for evaluating and comparing candidate approaches. Implementation of this broad mandate as a coherent research program necessitated focusing on a small cross-section of promising application-sensitive paradigms (including language, algorithm, and coordination-basedapproaches), their required hardware, compiler, and system support, and a few selected modeling and evaluation projects. In scope, the initiative emphasizes dependability primarily with respect to an expanded class of hardware and environment (both physical and operational) faults. Many of the efforts furthermore explicitly address issues of dependability unique to the domain of embedded real-time systems. The success of the initiative and the significance of the research is demonstrated by the ongoing associations that many of our principal investigators have forged with a variety of military, Government, and commercial projects whose critical needs are leading to the rapid assimilation of concepts, approaches, and expertise arising from this initiative. Activities influenced to date include the FAA's Advanced Automation System for air traffic control, the Navy's AX project and Next Generation Computing Resources standards program, the Air Force's Center for Dependable Systems, the OSFI1 project, the space station Freedom, the Strategic
Defense Initiative, and research projects at GE, DEC, Tandem, the Naval Surface Warfare Center, and MITRE Corporation. This book series is a compendium of papers summarizing the major results and accomplishments attained under the auspices of the ONR initiative in its f i s t three years. Rather than providing a comprehensive text on dependable computing, the series is intended to capture the breadth, depth, and impact of recent advances in the field, as reflected through the specific research efforts represented, in the context of the vision articulated here. Each chapter does, however, incorporate appropriate background material and references. In view of the increasing importance and pervasiveness of real-time concerns in critical systems that impact our daily lives-ranging from multimedia communications to manufacturing' to medical instrumentation-the real-time material is woven throughout the series rather than isolated in a single section or volume. The series is partitioned into three volumes, corresponding to the three principal avenues of research identified at the beginning of this preface. While many of the chapters actually address issues at multiple levels, reflecting the comprehensive nature of the associated research project, they have been organized into these volumes on the basis of the primary conceptual contribution of the work. Agha and Sturman, for example, describe a framework (reflective architectures), a paradigm (replicated actors), and a prototype implementation (the Screed language and Broadway runtime system). But because the salient attribute of this work is the use of reflection to dynamically adapt an application to its environment, it is included in the Frameworks volume. Volume I, Models and Frameworksfor Dependable Systems, presents two comprehensive frameworks for reasoning about system dependability, thereby establishing a context for understanding the roles played by specific approaches presented throughout the series. This volume then explores the range of models and analysis methods necessary to design, validate, and analyze dependable systems. Volume 11, Paradigms for Dependable Applications, presents a variety of specific approaches to achieving dependability at the application level. Driven by the higher level fault models of Volume I and buiilt on the lower level abstractions implemented in Volume 111, these approaches demonstrate how dependability may be tuned to the requirements of an application, the fault environment, and the characteristics of the target platform. Three classes of paradigms are considered: protocolbased paradigms for distributed applications, algorithm-based paradigms for parallel applications, and approaches to exploiting application semantics in embedded realtime control systems. Volume 111, System Implementation, explores the system infrastructure needed to support the various paradigms of Volume 11. Approaches to implementing
suppport mechanisms and to incorporating additional appropriate levels of fault detection and fault tolerance at the processor, network, and operating system level are presented. A primary concern at these levels is balancing cost and performance against coverage and overall dependability. As these chapters demonstrate, low overhead, practical solutions are attainable and not necessarily incompatible with performance considerations. The section on innovative compiler support, in particular, demonstrates how the benefits of application specificity may be obtained while reducing hardware cost and run-time overhead. This third volume in the series completes the picture established in the fist two volumes by presenting detailed descriptions of techniques for implementing dependability infrastructure of the system: the operating system, run-time environment, communications, and processor levels. Section 1 presents design approaches for implementing concurrent error detection in processors and other hardware components. Rennels and Kim apply the principles of self-checking, self-exercising design at the processor level. Rao, et al, use an extension of the well-known Berger code as the mathematical foundation for the construction of self-checking ALUs. These components provide the fundamental building blocks of dependable systems and the abstractions required by higher level software implemented protocols. The field of fault-tolerant computing was once dominated by concerns over the dependability of the processor. In modem parallel and distributed systems, the reliability of the network is just as critical. Communications dependability-like processor dependability-is more appropriately addressed at the lower layers of the system to minimize impact on performance and to simplify higher-level protocols and algorithms. In the fist chapter of Section 2, Bolding and Snyder describe how the inherent attributes of chaotic routing-an approach proposed for its performance advantages in parallel systems-may be adapted to support dependability as well. The use of non-determinism in chaotic routing is the key to realizing its goal of high throughput, but is inappropriate for real-time systems where low latency and predictability are the primary concerns. Shin and Zheng offer the concept of redundant real-time channels as a solution to the problem of dependable end-to-end communications in distributed systems under hard real-time constraints. Compiler technology has emerged within the past decade as a dominant factor in effectively mapping application demands onto available resources to achieve high utilization. The chapters in Section 3 explore the intersection of compiler optimizations for high performance and compiler transformations to support dependability goals. Fuchs, et al, describe the similarities of compiler transformations intended to support instruction-level recovery from transient errors and the state management requirements encountered when speculative execution is employed in super-scalar architectures. Banerjee, et al, adapt parallelizing compiler technology to the automa-
tion of algorithm-based fault tolerance. The approach considers not only transformations for generating checks, but also partitioning, mapping, and granularity adjustment strategies that balance performance and dependability requirements. Operating systems support is central to the design of modern dependable systems. Support is required for service abstractions, communication, checkpointing and recovery, and resourcelredundancy management. All of these are considered in Section 4. Russinovich, et al, describe how various error detection and recovery mechanisms may be efficiently integrated into an operating system built on the popular Mach microkernel in a manner transparent to the application itself but customizable to its requirements. Protocol-based paradigms for distributed systems all seem to rely on a number of fundamental protocols such as atomic multicast and group membership. Schlichting, et al, have organized these operations into a comprehensive communications "substrate". By exploiting the interdependencies among them, an efficient implementation has been constructed. The problem of resource management in a real-time distributed system becomes even more complex when deadlineconstrained redundancy and fault management must be supported. Thuel and Strosnider present a framework for managing redundancy, exceptions, and recovery under hard real-time constraints. Gary M. Koob Mathematical, Computer and Information Sciences Division Office of Naval Research
Clifford G. Lau Electronics Division Office of Naval Research
ACKNOWLEDGEMENTS
The editors regret that, due to circumstances beyond their control, two planned contributions to this series could not be included in the final publications: "Compiler Generated Self-MonitoringPrograms for Concurrent Detection of Run-Time Errors," by J.P. Shen and "The Hybrid Fault Effects Model for Dependable Systems," by C.J. Walter, M.M. Hugue, and N. Suri. Both represent significant, innovative contributions to the theory and practice of dependable computing and their omission diminishes the overall quality and completeness of these volumes. The editors would also like to gratefully acknowledge the invaluable contributions of the following individuals to the success of the Office of Naval Research initiative in Ultradependable Multicomputers and Electronic Systems and this book series: Joe Chiara, George Gilley, Walt Heimerdinger, Robert Holland, Michelle Hugue, Miroslaw Malek, Tim Monaghan, Richard Scalzo, Jim Smith, Andr6 van Tilborg, and Chuck Weinstock.
SECTION 1
DEPENDABLE COMPONENTS
SECTION 1.1
Self-Checking and Self-Exercising Design for Hierarchic Long-Life Fault-Tolerant Systems David Rennels and Hyeongil Kim Abstract This research deals with fault-tolerant computers capable of operating for extended periods without external maintenance. Conventional fault-tolerance techniques such as majority voting are unsuitable for these applications, because performance is too low, power consumption is too high and an excessive number of spares must be included to keep all of the replicated systems working over an extended life. The preferred design approach is to operate as many different computations as possible on single computers, thus maximizing the amount of processing available from limited hardware resources. Fault-tolerance is implemented in a hierarchic fashion. Fault recovery is either done locally within an afflicted computer or, if that is unsuccessful, by the other working computers when one fails. Concurrent error detection is required in the computers making up these systems since errors must be quickly detected and isolated to allow recovery to begin. This chapter discusses ways of implementing concurrent error detection (i.e., self-checking) and in addition providing self-exercising capabilities that can rapidly expose dormant faults and latent errors. The fundamentals of selfchecking design are presented along with an example -- the design of a selfchecking self-exercising memory system. A new methodology for implementing self-checking in a.synchronous subsystems is discussed along with error simulation results to examine its effectiveness.
1.1.1 Introduction There is a class of multicomputer applications that require long unmaintained operation in the range of a decade or more. The obvious example is remote sensing, e.g.. space satellites. Important new applications of this type are expected for computers that are embedded in long-life host systems where maintenance is expensive or incon-
1. Computer Science Department, University of California at Los Angeles. This work was supported by the Office of Naval Research, grant N0()014-91-J-1009.
venient, e.g., transportation systems and patient monitoring. They are becoming practical because the rapidly decreasing cost of hardware malces it cost-effective to employ redundancy during the construction of a computer to avoid maintenance costs later and to avoid the inconvenience and danger of later breakdowns. Many of these applications also require minimization of power, weight, and volume. In spacecraft this is due to limited resources, and in ground-based systems this may be due to battery limitations or the wish to reduce heal to improve reliability of densely packaged systems. Technology for fabricating these systems is moving toward very densely packaged systems using stacked tnultichip modules and stacked chips. These packaging techniques provide greatly improved performance while minimizing power, weight and volume, and we expect that they will also become a preferred way of making systems when their high-volume leads to low-cost. One side effect of extremely densely packaged computers is that they are very expensive to take apart and repair. Conversely, the cost of adding redundant chips is relatively small. Thus the trend will be to add in the needed redundancy at the time of fabrication to guarantee long reliable life. This technology leads to building highly modular systems in which many small computer modules work together to solve large computing problems. Here the designers aim at as high a perfonnance as possible, and this means that as many computers as possible should be doing different parts of a problem. To optimize the design of highly reliable long-life systems it is necessary to use hardware in a highly efficient fashion. If one uses the classical fault-tolerance techniques of triplication and voting, performance is reduced by an unacceptable factor of three. Furthermore, for long life unmaintained applications, a sufficient amount of spare hardware must be included to guarantee that a triplicated system (i.e., three systems) will be operational at the end of mission. This is unacceptable in terms of power, weight, volume and performance, so we must look for a more hardware-efficient way of providing fault tolerance and k)ng unmaintained life 11).
1.1.2 Hierarchic Fault-Tolerant Designs One approach to doing this is to design hierarchic fault- tolerant systems in which the amount of redundancy is varied to match the criticality of various computations in a computer system. Critical computations such as the operating system and a subset of critical applications programs are replicated (i.e., run on two or more computers) to allow compulations to continue uninterrupted in the presence of faults). Less critical applications are run in single computers with rollback or checkpointing for recovery in order to maximize performance.
Figure 1.1.1 shows the recommended approach in a graphic fashion. The area represents the complete set of programs that are needed for the fault-tolerant system and is divided into three sub-areas I, 2. and 3, that we will call levels since they are associated with different levels of protection. •
Level 1 programs (or area 1) provide critical functions that require very strong fault-tolerance protection. These include operating system and fault-recovery functions as well as applications programs that cannot tolerate delays or data loss when a fault occurs. Executive and fault-recovery programs of level one are used to manage recovery from faults in the lower-level programs of areas 2 and 3. These programs must be run redundantly on replicated hardware to allow continued operation when a module fails.
Standby Redundancy re-compute data Recovery by restart / roll-forward (Seconds delay plus data loss)
Recovery with no program interruption or delay Massive redundancy, voted or duplcx-sclf-checking Standby redundancy with rollback / roll-forward Recovery with Program Rollback (lO's of milliseconds delay)
FIGURE 1.1.1: Applications in an Hierarchic Fault-Tolerant System •
Level 2 consists of those programs that can accept delays in recovery on the order of milliseconds. Here programs can be run singly, but program rollback techniques are required to quickly restart computations from a recent point where the state has been saved. To do this, concurrent error detection is required in all processors. That is, error detection circuits must be placed in all of the computer modules that can detect errors as soon as they occur. This prevents damaged data from propagating to external modules or across rollback points before an error is detected - making recovery very difficult. Program rollback imposes extra complexity in program writing and requires duplicate storage of rollback recovery variables.
•
Level 3 consist of less-critical programs that can accept considerable fault-recovery times (e.g., seconds), and which can be restarted after a fault is detected and corrected. These programs can be run on single machines, and fault recovery docs not add particularly difficult problems in programming. Concurrent error detection hardware is highly recommended so that hardware errors can be reliably detected and isolated.
Of course, the diagram of Figure l.I.l is a simplification. The number of programs and their required fault-tolerance, speed and memory resources varies with different applications. But many of the computers in long-life dedicated systems have only a minority of programs that require the highest levels of redundancy for fault-protection, and a majority of programs can be run in simplex using the more heavily protected executive functions to manage their recovery. This can lead to a fault-tolerant design that uses hardware most efficiently — getting a maximum of computation out of the hardware available and thus maintaining performance over long time periods with a limited amount of spare resources. However, an underlying requirement in these applications is concurrent error detection. The computer modules and interconnection structures should be designed so that they can detect internal logic errors as soon as they manifest themselves -- before they multiply and propagate. Thus they can be safely depended upon to execute programs singly, but quickly detect errors and notify more heavily protected higher-level functions to institute recovery actions should a fault occur.
1.1.3 Design of Modules with Concurrent Error Detection The ba.sic techniques and theory for implementing self-checking concurrent error detection were developed a number of years ago by Carter et. al. at IBM Research [2]. They defined a technique for designing self-checking checkers that could not only detect errors in the circuit being checked, they could also detect faults in the checkers. This solved the fundamental problem of who checks the checker. The approach was to use dual-rail logic where check signals are represented as two wires. The values 0,1 and 1,0 on the pairs indicate correct operation, while the values 1,1 or (),() indicate either an error in the circuit being checked or an error by the checker. When no errors occur, the check signals representing a "good" circuit alternate between 0,1 and 1,0 in a way that exercises and checks the check circuits. This is best explained by examples shown in Figure 1.1.2. Most concurrent error checks consist of checking separable codes on data (e.g. parity and arithmetic codes) or comparing the outputs of duplicated circuits. For self-checking designs the checkers are modified to present a set of output pairs that take on the 0,1 and 1,0 values for correct data and the 0,0 or 1,1 values for errors as shown in Figures 1.1.2(a) and 1.1.2(b). Carteret, al. demonstrated what they called a "Morphic-And" circuit that reduced two 2-wire pairs of complementary signals to a single pair of complementary signals that also take on values of 0,1 and 1,0 (See Figure 1.1.2(c)). If an input error occurs (an error in one of the circuits being checked) 0,0 or 1,1 occurs in one of the input pairs.
and the outputs will also take on the error value of 0,0 or 1,1. This circuit has the selfchecking property that for every stuck- at signal in the checker there is at least one correct set of (complementary 0,1 or 1,0) input pairs that will cause a 1,1 or 0,0 output. That means that not only does the checker detect errors in the circuit being checked, also there is good data that will flush out an error in the checker by causing an error signal of 1,1 or 0,0 to occur in the outputs. A tree of "Morphic-And circuits (See Figure 1.1.2(d)) will reduce a set of error checks to a single complementary pair that take on values 1,0 or 0,1 if no error occurs, or are 1,1 or 0,0 if an error occurs in the circuits being checked or an error occurs in the checker.
Logic module
c c a) A Self-Checking Odd Parity Check
a a b b c c b) Duplication and Comparison
a a bb
a
Duplicated Logic module
c c dd
e e ff
g g h h
Do Do c) A Morphic AND Reduction Circuit
d) A Tree of Morphic AND Circuits
FIGURE L1.2: Self-Checking Checkers This design methodology resulted in a large number of papers that showed how to implement self-checking concurrent error detection in a variety of error detection circuits. The author developed a self-checking computer module that provided concurrent error detection using this type of self-checking logic.
1.1.3.1 The Fault-Tolerant Building Block Computer As an example of a modular multicomputer architecture that uses self-checking logic design to provide computer modules with high-coverage concurrent error detection, we can turn to the JPL Fault-Tolerant Building-Block Computer. This system, built fifteen years ago, at the Jet Propulsion Laboratory used Self-Checking Computer Modules [3]. It is still a good example of the type of computer architecture that can be used to support hierarchic applications in a cost-effective fashion. The SCCM, shown in Figure 1.1.3, was experimentally verified by inserting errors. External Busses (1553A) Bus Interface Building Blocks
Redundant Memory
Bus (BA) Adaptor Bus (BA) Adaptor
I I I Bus (BA) Adaptor Bus Controller (BC)
Memory Interface Building Block Internal Bus
Hamming Correction Interrupt
Internal Fault
Bus Check Bus Arbiter Processor Check
Reset/ Rollback Core Building Block
CPU
CPU
Priority Daisy Chains
2Bits 6-Bits Hamming Spare
16 Bits Data
Output Inhibit
I/O BB I/O BB DMA Req DMA Grant Inlernal Fault Indicalors
FIGURE 1.1.3: A Self-Checking Computer Module
The SCCM, is a small 16-bit computer which is capable of detecting its own malfunctions. It contains I/O and bus interface logic which allows it to be connected to other SCCM's to form fault-tolerant multicomputers. The SCCM contains commercially available microprocessors, memories, and four types of building-block circuits as shown in Figure 1.1.3. The building blocks are: 1) an error detecting (and correcting) memory interface building block (MI-BB), 2) a programmable bus interface building block (BI-BB), 3) a core building block (Core-BB), and 4) an I/O building block (10BB). A typical SCCM consists of 2 microprocessors, 23 RAM's, 1 MI-BB, 3 BIBB's, 2 lO-BB's, and a single Core-BB. Although these circuits were designed before large ASICS became commercially available, they now could be implemented as single-chip ASICS. The building-block circuits control and interface the various processor, intercommunication, memory, and I/O functions to the SCCM's internal bus. Each building block is responsible for detecting faults in its associated circuitry and then signaling the fault condition to the Core-BB by means of an internal fault indicator. The MI-BB implements fault detection and correction in the memory, as well as providing detection of faults in its own internal circuitry. Similarly, the BI-BB and lO-BB provide intercommunications and I/O functions, along with detecting faults within themselves and their associated communications circuitry. The Core-BB checks the processing function by running two CPUs in synchronism and comparing their outputs. It is also responsible for fault collection and fault handling within the SCCM. The approach to concurrent error detection uses the Carter self-checking design methodology. For data paths and memory, error codes are checked as shown in Figure 1.1.2(a) to obtain 2-wirc complementary error signals. Irregular circuits are duplicated and compared with the outputs inverted on one module to obtain complementary pairs as shown in Figure 1.1.2(b). Finally the error signals are combined as shown in Figure 1.1.2(c) to obtain a self-checked master fault indicator. Local fault-tolerance is handled by the Core-BB. It receives two-wire l-out-of-2 coded fault indicators from the other building-block circuits, and it also checks internal bus information for proper coding. Upon detecting an error, the Core-BB disables the external bus interface and I/O functions, isolating the SCCM from its surrounding environment. The Core-BB then attempts a rollback or restart of the processor. Repeated errors result in the disabling of the faculty SCCM by its Core-BB. Most transient errors can be corrected locally within the SCCM and RAM chip faults can be handled using the SEC/DED code and spare bit-plane replacement. Faults that cannot be corrected within a SCCM must be handled by redundant SCCM modules in the system.
10 Modular computer architectures containing self-checking processor modules with duplex and compare CPUs, coding on buses and memories and redundant memory chips have been further developed by several manufacturers (Harris, Honeywell, IBM, and TRW) for DoD systems.
1.1.4 Self-Checking, Self-Exercising Logic The methodology of designing self-checking (or at least mostly self-checking) logic circuits with high-coverage error detection is well established. Given that concurrent error detection is used in a design, remaining difficulties in achieving fault- tolerance include the problems of fault dormancy and error latency. Fault dormancy occurs in circuits that have failed, but their failed state happens to correspond to the currently correct value of data. Therefore the fault will not be discovered until the data value is changed in the normal course of processing. Examples include a stuck-at-one memory ceil that is currently supposed to hold a one, or a pattern-sensitive logic error that will not show up until surrounding circuits take on a particular pattern. Error latency occurs when there is a data error, but a delay occurs before it is detected by a checking circuit. Bit-flips in seldom-read memory locations fall into this category. There is a danger that when a error is detected, another dormant fault or latent error may already exist in a fault-tolerant system, which will upset the recovery algorithm and cause the system to fail. Therefore it is a good practice to design a system in such a way that dormant faults and latent errors will be flushed out quickly. We call this approach self-checking and self-exercising (SCSE) design. Our research shows that SCSE design is relatively inexpensive, given that concurrent error detection logic is already in place. The additional cost is one of introducing periodic test patterns to expose these errors and faults. Design examples are discussed below. 1.1.4.1 A Self-Checking Self-Exercising Memory System In the last several years a SCSE memory system was designed, laid out for VLSI and simulated at UCLA [4]. As shown in Figure 1.1.4, it consists of a set of 2 ^ 1 RAM chips and a Memory Interface Building Block (MIBB) chip. The memory systems employs an odd-weight SEC/DED code for error detection and correction, and each bit-position is stored in separate chips so that any single chip failure will at most damage one bit in any word. Two spare bit planes (i.e. sets of RAM chips that can replace those in any failed bit position) are included for long life applications. Each RAMs is organized as 2 one-bit words. The cell array is square, meaning that the upper half of the address bits select a row address (RAS) of sqrt(N) cells, and the lower half of the address bits select a bit in the row i.e., a column address (CAS). The
11 Odd Column Parity
The RAM Chip
Even Column Parity
CONTROL (RAV, RAS, CAS. -7^— CHECK)
ADDRESS
Address Parity Error DATA
Address Bus To RAM Chips
32 Information Bits
f 7 Parity Bits
2 Spare Bits
Control Bus
Memory Interface Building Block (MIBB) ^
ncorr. Error
(SEC/DED, Sweeping, Spare RAM Replacement, Control) Data and Address Bus From CPU 32D+2P Control Bus from CPU (R/W, MST, CPL)
-/Correctable Error Int.
FIGURE 1.1.4: A Self-Checking Self-Exercising Memory System novel feature of this design is that both the RAM chips and MIBB contain additional circuitry to provide concurrent self testing and self-checking. Two parity bits are added to each row of storage cells inside the memory chips, one parity bit is used to check all odd numbered bits in the row, and the other parity bit checks all even bits in the row. The extra parity bits add two columns to the array as shown in the figure. This on-chip RAM checking allows background scrubbing to be carried out by the MIBB by interleaving check cycles with normal program execution as is summarized below. All the RAM chips can be commanded by the MIBB to perform a CHECK CYCLE (typically taking less than a microsecond) during which a row of each RAM's mem-
12 ory cell array (specified by the most significant half of the address - RAS) is read out into a data register, the parity is checked, and the data bits are stored back into the row either unchanged or with one of three permutations to be described later. If a chip detects an internal error, it is signalled via its data line. The MIBB interleaves these check cycles with normal reads and writes (approximately every 100th cycle), and they check each ROW in sequence. It it not possible using row-parity to determine which cell in the RAM's row (and therefore which word) has an error. The data that is read out of a row of cells inside each RAM corresponds to sqrt(n) words, and an error exists in one of the words (bits). Therefore the MIBB-dirccted recovery sequence consists of accessing every word containing bits in the erroneous RAM row and reading it out one-at-a-timc. This is done by holding the upper address bits (RAS) constant and sequencing through all values of the lower address bits (CAS). During these accesses, the SEC/DED code across all RAMs is used to find the error and write back corrected information. Both the SWEEP checking and the correction process (when needed) are interleaved with normal program execution. A typical interleaving sequence is shown in Table I.I.I. Note that the correction cycles are only invoked if a RAM chip has signalled a row error. Using this technique, transient errors in the RAMs can be detected and corrected within a few milliseconds, without significantly affecting ongoing processing. .SWEEP
ROW
CYCLE
SWEEP
ROW
CYCLE
1 2
2(X)
1
1
100
1
2
2
200
2
3
3
300
3
3 ERRO R
300
4
4
400
Correct
Cycle
R3, W l
400
5
5
500
Correct
Cycle
R3, W2
500
128
128
12800
Correct
Cycle
R3, WI28
13200
1
12900
4
4
13300
2
1 2
13000
5
5
13400
---
---
1(X)
---
Table 1.1.1 Interleaving Sweep and Correction Cycles
13 When the memory is initially loaded, sweeping is turned off and data is stored in nonpermuted form. A series of special Regenerate Parity (RP) check cycles arc executed to initialize the parity bits in all rows in the memory chips. Then sweeping is turned on and the MIBB executes sweeps consisting of a scries of interleaved check cycles as previously described. During any single sweep, the RAMs can be commanded to permute the data (i.e. flip selected bits before writing a row back) in the specified row in one of four ways. One of the four row-permutations is chosen for writing back even rows, and one (possibly different) is chosen for writing back odd rows. Permutations: (1) No Change, (2) Invert Odd Bits (3) Invert Even Bits, (4) Invert All Bits We will call a set of two row-permutations (one for odd rows and one for even rows) as a sweep-permutation, or simply a permutation. The permutation capability is used to test for permanent faults as discussed below. •
Stuck at One/Stuck at Zero Cells - If a cell is stuck at the same value as the data stored in it, there will not be an error until an attempt is made to store the other value (one or zero) that it is incapable of storing. To detect this type of fault, sweep cycles are carried out that invert all bits in each ROW as they are written back. Thus the memory contents alternate between true and complement form exposing the stuck cells in at most two sweeps.
•
Coupling (e.g. shorts) Between Cells - this is the situation where a cell is sensitive to the value in a neighboring cell. With a series of four permutations it is possible to exercise each cell in the following way: (i) its neighbors (above, below, right, and left) take on opposite logical values, (ii) then the cell takes on the opposite logic value, (iii) then the neighbor cells return to their original logic value, and (iv) finally the cell returns to its original state.
At any given time there are two permutations in memory, separated by the next row scheduled to have a sweep cycle. The MIBB must keep track of the Next Row to be Swept (NRS), the last permutation will be found in the NRS and following rows, and the new permutation in rows already swept (RAS < NRS). On a normal read or write, the MIBB must determine if the word being accessed is stored in true or complement form, since the effect of permutations is to complement the bits at some addresses but not at others. To do this it must determine what permutation has been applied to the row being accessed, and whether the bit being accessed is in an odd or even numbered column. Using the permutation information, the MIBB then determines if the word (i.e. the corresponding cell position) is inverted or not. and if necessary, re-inverts the data as it arrives at the MIBB.
14 The process is quite simple. A simple 4-bit state machine determines the current permutation that is being applied, and the last permutation is saved. The four bits indicate bit inversions in; i) odd-rows, odd-columns, ii) odd-rows, even-columns, iii) evenrows, odd-columns, and iv) even-rows, even-columns. When a normal read or write cycle is initiated, the Row Address (RAS) portion of the address is compared with NRS to determine which permutation to use. Then the least significant bit of the RAS and CAS portions of the address (which specify whether the row and column addresses in the memory array are odd or even) are used to index the four bits that specify that permutation. The correct bit in the state machine is then accessed to determine whether the word is or is not inverted. In order to prevent speed degradation, row-parity checking and generation are only done in the RAMs during sweep cycles. They are not done in the RAMs during normal reads and writes. During reads the SEC/DED code will detect and correct errors. During writes, the parity bit being stored is exclusive-or'd with the appropriate (even or odd position-covering) parity bit to update it. There are several advantages of this self-checking self exercising memory system architecture. First, it provides the standard self-checking design to detect errors and a Hamming code to correct single errors. Second, transient errors can be detected and corrected within a few milliseconds, allowing recovery in very high transient and burst-error environments. This has been modeled and its error recovery properties discussed by Meyer [5]. Finally, it reduces the latency of permanent faults by rapidly exposing them. This can be done about a thousand limes faster than can be done with software, and at relatively low cost.
1.1.5 Concurrent Error Detection Using Self-Timed Logic It is difTicult to avoid noticing the similarity of the complementary pair coding used to detect completion in Differential Cascode Voltage Switch (DCVS) self-timed logic, and the complementary signalling used in the classic paper describing self-checking logic of Carter et. al. [2]. It has been recognized by many that the redundancy inherent m self-timed logic can be used for testing and error detection. Fault modeling and siinulation of DCVS circuits has been discussed and the capability of DCVS circuits provide on-line testability with their complementary outputs has been explored [6,7]. The testability of DCVS EX-OR gates and DCVS parity trees has been analyzed, and methods for testing DCVS one-count generators, and adders have been presented [8,9]. Easily testable DCVS multipliers have been presented in which all detectable stuck-at, stuck-on and stuck-open faults are detected with a small set of test vectors, and the impact of multiple faults on DCVS circuits has been explored [10,11]. And a
15
technique for designing self-checking circuits using DCVS logic was recently presented [12]. There has also been a great deal of interest lately in asynchronous logic design in the VLSI community [13,14]. Self- timed logic is typically more complex than synchronous logic with similar functionality, but it offers potential advantages. TTiese include higher speed, the avoidance of clock-skew in large chips, better layout topology, and the ability to function correctly with slow components. For several years we have been studying the applicability of a modified form of DCVSL logic for use in designing processors and associated logic with concurrent error detection. Although this logic is not self-checking in the formal sense, for all practical purposes it provides the same capability of detecting faults in the checkers as well as faults in the circuits being checked. One of the primary motivations for this is the need for lower power in spacecraft applications. The current way of providing high-coverage concurrent error detection in processors is to duplicate and compare them. This presents a considerable power overhead. Synchronous processors consume relatively high power (at least several watts) in clock distribution and output drivers, and that power is doubled in a duplex configuration. And as chips get larger and contain more circuitry the clocking problem becomes more severe. A single self-checking asynchronous chip is expected to require considerably less power than two duplicated synchronous chips. Another reason we began this study was to explore the feasibility of using asynchronous logic in the interface/controller chip of the self-checking sclf-cxcrcising memory system previously described at FrCS-21 [4], The synchronous nature of that design caused what we felt were unnecessary delays in synchronizing between independent clocks, so we decided to examine an asynchronous design as an alternative. A key requirement of the design was concurrent error detection. Thus our approach has been from an architectural viewpoint. 1.1.5.1 The Starting Point - Synchronous Circuits with Concurrent Error Detection A common way to make synchronous hardware systems with concurrent error detection is to duplicate the circuit modules, run them with the same clock, and compare their outputs. Duplication and comparison has become widely accepted and is supported in chip sets by commercial manufacturers (e.g. Intel) and DoD-supported projects such as the GVSC and RH-32 processors. One way to do this in a self-checking fashion is to invert the outputs of one module to obtain morphic (1,0 or 0.1) pairs
16 and compare its outputs with the corresponding outputs of the other module using a tree of self-checking comparators of the type introduced by Carter et. al. many years ago [15,16]. These checkers are self-checking with respect to stuck-at faults, and will, in most cases detect transient errors that cause the coding to be incorrect at clock transitions. They may fail under Byzantine conditions where noise or marginal circuit conditions cause the checker to sec correct coding while the circuit receiving the output data sees an incorrect value. When one examines conventional duplicated synchronous systems, the cost of concurrent error detection is a doubling of active circuits plus the addition of comparators. The overhead of error detection in asynchronous designs should be similar to the synchronous case and it is already an integral part of the design. This is discussed below. 1.1.5.2 The Analogous Asynchronous Design Style — Differential Cascode Voltage Switch Logic (DCVSL) An asynchronous design requires redundant encoding that can provide completion information as part of the logic signals. A logic module is held in an initial condition until a completion signal arrives from a previous module indicating that its inputs are ready. Then it is started, and when its outputs indicate completion, other modules may be started in turn. In general this requires a form of encoding that allows the receiver to verify that the data is ready. One way to do this in self-timed designs is to use a form of l-out-of-2 coding. Output signals from various logic modules are sent as a set of two-wire pairs, taking on the values 0,0 before starting, and 0,1 or 1,0 after completion. Such logic can be implemented using differential cascode voltage switch logic (DCVSL). A typical DCVSL gate is shown in Figure 1.1.5(a). DCVSL is a differential prechargcd form of logic. When the Req (request) signal is low, the PMOS pull-up transistors precharge points a and c to Vdd. At this time the circuit is in an initial state, and its outputs are 0,0. The circuit block B contains two complementary functions. One pulls down, and the other is an open circuit. When the input signals arc ready, Req is raised, the pull ups are turned off, the NMOS pull down transistor connects points b and d to ground, and the circuit computes an output value. The side that forms a closed circuit forms a zero and the side that remains an open circuit remains precharged. TTic outputs, driven by inverters, go from 0,0 to either 1,0 or 0,1 and the completion signal CPL is generated from either a logical OR or the exclusive OR of the outputs. As in the case of duplex self-checking synchronous circuits, the functions are duplicated in true and complement form, but here the state 0,0 on a
17
signal pair is a valid setup signal, and the arrival of complementary values signal completion.
(a) A Single Circuit
(b) An Iterative Circuit FIGURE 1.1.5: Differential Cascode Voltage Switch Logic It is intuitively obvious that DCVSL can provide a degree of error detection. Consider a single DCVSL circuit (Figure 1.1.5(a)). The circuit block B can be designed so that a fault or error will only affect one of the two sides (true or complement) and thus will only affect one output [6]. The fault will cause a detectable output pair of 0,0
(detected by timeout due to no completion) or 1,1. Given that faults in individual DCVSL circuits produce detectable (1,1 or 0,0) outputs, it is helpful to examine a combinational network made up of a number of these circuits. 1.1.5.3 A DCVSL Combinational Network- An Adder Example Figure 1.1.6 shows the DCVSL circuits used in a ripple-carry adder. The sum and carry circuits are shown in Figure 1.1.6(a) and 1.1.6(b). The complete adder is shown as a multi-module circuit in Figure 1.1.6(c), and a 4-bit adder of this type is used as the function blocks in the circuit simulations to be described later.
H [ . „ s„„. Ir • ' H C . sTm A
cc
'^m
Cin
s s
A
m
Req-
a) DCVSL Carry Gate
s
B
B-
A
-S
b) DCVSL Sum Gate
c) A Multi-bit Adder
FIGURE 1.1.6: A DCVSL Combinational Function Block (Adder)
s "s
19 Fault Effects in Multi-Module DCVSL Circuits - When a good circuit receives incorrect signal input pairs from a faulty circuit (i.e. 0,0 or 1,1), it will produce either the correct output or in an output of 0,0 or 1,1 because DCVSL circuits have paths corresponding to each minterm that either pull dovi/n one side or the other. A (0,0) input pair will disable some minterms and can only prevent a side from being pulled down — producing an error output of 0,0. Similarly a (1,1) on an input pair will activate additional minterms and can only produce an error output of 1,1. These errors will be either masked or propagated through multiple level DCVSL and be detectable at the output. 1.1.5.4 An Asynchronous System with Concurrent Error Detection We now show how these DCVSL function blocks are combined into a larger sequential system. Figure 1.1.7 shows a 2-stage micropipeline. Each stage begins with a register made up of two latches for each complementary pair of input signals received from DCVSL circuits. The register data is sent to a DCVSL combinational circuit that we will call a computational block (a DCVSL 4-bit adder in our simulation studies). A checker is provided at the output of the computational block. Rl
CI
Bl
Data In
71
R2
^ B2
Data
7
7\ < z
< Z
ERROR
o QM C Q
ERROR
< o
H O
s
u CPL
o u
CU PQ
o u
Rin B^SY
A Data Out
/
7
C2
CONTROL
(Aout) FIGURE 1.1.7: A 2-Stage Asynchronous Circuit (Micropipeline)
CPL
20 The checker circuit is shown in Figure 1.1.8(c). It is logically equivalent lo a tree of morphic AND gates used in synchronous self-checking checkers by Carter et. al. The outputs (z,~z) are complementary if all of the input pairs are complementary, and if any input pair is 0,0 or 1,1 the output takes on that same value. This checker provides
Out
:L^ c
Out
'}^ aa
Qi
UJ _ ^ u u I u
Inputs
CPL
* CPL output augmented for delay-insensitive etc. operation
W
clc.
OR
AND
(a) The Ba,sic Checker Circuit ERROR
Time-Out
X ± Out
(b) A Simple RC Timer FIGURE 1.1.8: The Checker Circuits
c
o a:
Inputs (c) The Checker-Tree
21 a completion signal if all the complementary input pairs have a one. If" one of the input circuits fails to complete and generates a 0,0 signal pair, then (z,~/.=0,0), a completion is prevented and the checker uses a simple time-out counter (see Figure 1.1.8(b)) to signal the error. If one of the computational circuits generates a 1,1 pair (z,~z= 1,1) and the error signal is also generated, Note that the checker is partially self-checking because (since it is logically a tree of Carter's self-checking morphic-and gates) for any stuck-at signals in the tree, there is a set of "good" inputs that will cause that internal signal pair to take on values 0,0 or 1,1. Any 0,0 pair generated in the checker tree (due to a checker fault) will result in a 0,0 output from the checker and a timeout error. Similarly an error signal of 1,1 generated anywhere in the checker tree will result in a 1,1 output from the tree and an error signal. One of the reasons that it is not fully fault-checking is that there are error conditions that can generate a premature completion signal when an internal variable is stuck at one. If a member of a complementary pair inside the tree w,x (say w) is stuck at one and the input signals would normally generate w=l and x=0, the tree can generate a premature completion signal without waiting for the circuits preceding w,x. This premature completion signal may cause a data word to be sent to a succeeding logic stage before some of the circuits have finished setting up. This leads to 0,0 pairs being loaded into registers of the following circuit causing it to halt and timeout. The following is a simplified view of the control sequence (see Figure 1.1.7). Rin is raised if input data is ready and the first stage is not busy, and this loads the input register Rl. The arrival of data in RI causes the computational block Bl to be started. If no error has occurred, the checker signals completion when data is available from BI. An interlock is then set if stage 2 is busy. When stage 2 finishes, it loads the data out of B1 into R2 and the DATA-IN detect from R2 causes B1 and CI to be precharged, and Rl to be cleared to all zeros. Then the first stage is released by removing its BUSY signal, allowing it to accept more data. 1.1.5.5 Control Circuits A detailed view of the control circuits are shown in Figure 1.1.9. The circuits in Figure 1.1.9 can be viewed as multiple stages of logic where each stage consists of a latch followed by a block of DCVSL computational logic. The stages can operate concurrently and form a pipeline. We started with an Interconnection and Synchronization circuit by Meng [17]. The objective was to add whatever was necessary to provide concurrent error detection in the system. In the original design, the registers were implemented with edge-triggered flip flops whose inputs used only the "true" signal
22 from each input pair. The Q and ~Q flip flop outputs provided the complementary signals for the following DCVSL computational block. This had two problems. First, the one-out-of- two checking code was lost at the registers so a flip-flop error was undetectable. Second, loss of a clock to one or more flip-flops was also undetectable.
Data Out
ERROR
FIGURE 1.1.9: The Handshaking Control Circuits The modified circuit in Figure 1.1.9 uses essentially the same handshaking conventions as Meng. However, to improve the testability and fault tolerance of the control and synchronization circuit we modified it in the following fashion. The register has two gated latches for each DCVSL output pair, and thus it accepts 2-wire complementary inputs instead of single line inputs. All latches are reset after their contents have been used (to illegal 0,0 pairs) to give a better chance to detect errors in the registers if they are clocked when data is changing, or if some of the latch pairs fail to be reloaded. The dual gated latches are simpler than the single positive edge triggered flip-flops used in the original design. The main synchronization signals are: •
Rin - a data ready by the checker, signalling completion of a computational block.
23 Aout - a register loaded signal, indicates that every signal pair in a register has at least one "one". It is the logical AND of the OR of each latch pair of signals. Rout (or Request) - A signal releasing the pull ups and causing computation in the DCVSL computational logic. This signal is interlocked by the C-Gate so that the logic cannot be started until the input register is loaded and the following register is free. Similarly it cannot be released to reset the computational block until data has been loaded into the output register and the input register is reset. Ain - the same as Aout, it is the busy signal from the next latch.
FIGURE 1.1.10: Transition Graph for a Full Handshake The transition graph of signals in Figure 1.1.9 is shown in Figure 1.1.10, which shows a full-handshake between function blocks as explained in Meng [17]. This is the synchronizing function performed by the control circuits. Both the positive and negative values of the control signals are shown by the superscript + and -. Arrows show signal conditions that must be true before the following transition is allowed to proceed. A careful examination of the graph shows that this provides the appropriate interlocking so that a module on the right has to complete before the module on the left is allowed to take the next computational step. The Rout- to Rout-i- step then provides the reset to pull up the DCVSL functional block before the next computation is started. Stuck-at Faults — The interlocking nature of the feedback control signals causes the circuit to "hang up" and stop if one of the signals Ain, Aout, Rin, Completion, .. sticks
24 at a one or zero value (see Figure 1.1.10). A time out counter is employed to detect the stopped condition. In nearly all cases, stuck-at values in a register or computational block will cause a detectable value of 0,0 or 1,1 to appear at the checker This occurs because the dualrail DCVSL logic block circuits pass on an uncoded (0,0 or 1,1) output when an uncoded input (0,0 or 1,1) occurs. When input signals occur that would normally cause a stuck circuit to go to the other value, its complementary circuit takes on the same value, generating an uncoded signal that passes through the Computational Block to the checker. The reset signal sets all register pairs to 0,0 to enable detection of faults caused by the inability to clock one or more sets of latches. It is redundant so that if it sticks at /.ero. a second fault must occur before an error is generated. If it slicks at one, the register will be permanently reset to 0,0 pairs, Aout will never go high, and the circuit will slop. As soon as the latch complete signal goes high, the load signal to the register goes low in order that the latched data are not disturbed by changing data from the preceding stage. If the load sticks at zero, the registers will be permanently reset and Aout will not be generated, halting the circuit. A stuck at one load signal will cause the register not to be held constant while the computational block is working. The C-gate preceding a register norinally prevents the register from being reloaded while the outputs of the circuit that sent it inputs is being to reset to 0,0. The stuck at one load will allow the register to change while the following computational block is using its data. The results, though difficult to predict, are likely to produce a detectable coding error in the following stage.
1.1.6 Simulation of the Self-Timed Micropipeline There is no easy way to analyze the effect of errors in an asynchronous circuit. To help in analyzing the response of the circuit to transient errors we simulated the 2stage micropipeline circuit by injecting randomly generated faults at points shown in Figure 1.1.11. The circuit was divided into two sections of control logic and data logic. The control logic is the same as that shown in Figure 1.1.9, and the data path logic at each stage is a 4-bit DCVSL adder. Transient error insertion points in the control logic arc shown as exclusive-or gates in the diagram. In addition two types of errors were injected into the data path section: i) data latched in the register, and ii) data generated by the computational block. The simulation has been done using the Lsim which is Mentor Graphics Corp.'s mixed mode simulator. The simulation circuit is built in the netlist and in the modeling M language.
25
FIGURE 1.1.11: Experimental Fault Insertion Points In order to make randomly generated transient errors and data patterns to the circuit, an input deck generation program is written in C. The timing and duration of the transient errors are determined from the random numbers and are in the range of 10-90 nsec. We assume that there is only one transient error at a time. Lsim generates time trace outputs. Inputs to the adder circuit arc randomly varied, and we can easily determine the expected time sequence of the output variables. The effects of errors on the values or the timing of the outputs can then be analyzed. As currently implemented, analysis of output traces is partially done manually, therefore the number of fault-insertions is relatively small. We are currently working on a more automated way of analyzing data, and more extensive testing will be done in the future. However, the results are highly encouraging, with no undetected errors in the faults simulated so far. Since we are trying to determine the fault detection coverage of this circuit, a successful test occurs if: i) a fault produces a correct output though it may delay the circuit less than the time out count, or ii) if a fault produces bad output and it is detected. An unsuccessful test occurs if an undetected error is found.
26 The following are the results of the simulations so far. These results are divided into several categories: i) no effect, ii) tolerated-delayed - no errors circuit halted a short time until the fault was removed. iii) tolerated - an error occurred in an internal variable but it was not used by the following stage (e.g., data already taken). iv) error detected explicitly detected by a timeout or error signal from the checker. 1.1.6.1 Error Simulation Results Transient errors were simulated in the data section as described below: Simulation of "0" transient errors in the data path. - There are 32 data bits in the 2stage simulated circuit. Each stage has 4-bits of latched data, 4-bits of complement latched data, 4- bits of output data (from the computational block) and 4-bit complement output data. In the simulation one of 32 data bits is randomly selected and it is temporarily stuck at 0 at random times with random duration less than 90 time steps. Normally each calculation cycle is about 100 time steps and 20 transient errors are mjected during 5000 time steps. If the circuit is idle for more than 200 time steps, the monitoring circuit issues a timeout signal. There was no undetected error in the simulation and effects of the transient errors are classified as: a. no effect: 515 cases b. tolerated (delayed): 158 cases c. tolerated: 102 cases d. timeout error detected: 22 cases Simulation of logic "I" transient errors in the data path - This simulation setup is almost the same as the transient error 0, but this time the selected data bit is temporarily stuck at 1. There was no undetected errors in the simulation and the transient errors are classified as: a. no effect: 285 b. tolerated (ignored): 285 c. tolerated(delayed): 136 d. detect error: 254
27 Control Section Errors - This was a simulation of transient errors in selected control signals. Duration of each transient error was selected randomly between 10ns to 30ns. There were no undetected errors. a. no effect: 58 b. tolerated-delayed: 592 c. timeout error detected: 60 d. error detected by checker: 29 1.1.6.2 Observations on Error Effects After examining in detail the traces of logic signals in the error simulations, we have made the following observations about the effects of the fault-categories so far modeled. There were many cases where the inserted transient held a signal in its correct value, so there was no error. Those cases are uninteresting. So we will look at the cases where a real error occurred. Transient Errors Making Data Bits 0 J. Errors in Registers/Latches •
If a "zero" transient error occurs in data being latched to a register, the OR-AND circuit delays issuing a latch completion signal (Aout) until the transient error disappears. Either correct data is eventually loaded or a timeout occurs. That means this error is tolerated by the circuit.
•
If the error occurs after the latch completion signal (Aout) is generated, i.e., the following computational block starts, but there is one 00 pair in its input data. Therefore the computational block generates a 0,0 output and the checker also produces a 0,0 preventing a completion signal. This error is eventually detected by the timer circuit that generates a time-out signal.
•
If a latched data bit is affected after the following computational block generates a correct output and a completion signal, then the computational block produced a correct output even though one of input data bits later changed to 0, and the error has no effect.
2. Computational Block Errors •
If one of the output bits of a computational block is pulled to zero before the completion signal is generated, then the generation of the completion signal is delayed until the transient error disappears and then a correct output is generated. This error is tolerated and causes only small time delay in the circuit.
28 •
If a transient error occurs after a correct output has been generated by the computational block but before the data is latched in the register of next stage, then the latch completion signal (Aout) of the next stage is not generated until the transient error disappears and correct output is generated and latched into the register. If the transient error exists for long enough time to activate the timeout circuit, then a timeout signal is generated. So we can say this error is tolerated or detected.
•
If the error happens after output data is latched in the register of the next stage and before the computational block is initialized (evaluation signal is high), then the error causes no effect to the circuit. This error is also tolerated by the circuit.
As is explained above, it appears that all the transient errors that make a bit in the data path circuits go to 0 can be tolerated or detected by the timeout circuit. Transient Errors Making Data Bits One 1. Errors in Register Latches •
If a data bit in the register is flipped to "one" during initialization, it cannot cause Aout to be asserted since the other register pairs arc zeros. Therefore the error will be overwritten when the register is loaded and there will be no effect.
•
When a data bit in the register is hit by the error after initialization of the register, the affected data bit remains 1 and, depending on incoming data, this error will have no effect (if the bit of incoming data is 1) or it will can cause a 1,1 input pair leading to an incorrect output (1,1) from the computational block and (1,1) from the checker which signals an error
2. Errors in the Computational Blocks •
If the computational block is hit by the transient error during precharge state, the affected bit is restored to 0 by the precharge, and the error is tolerated.
•
If the data bit of the computational block is affected by the error during evaluation, then the output of the computational block will take on (1,1) and be detected by the checker. If the data bit of the computational block is hit before the precharge state and after output data is latched in the following stage, then the checker generates an error signal even though the error was not passed on to the next stage.
As it is shown above, all the logic " I " transient errors inserted so far have been tolerated -- either causing no error (other than short delays) or being detected by a checker.
29 Transient Errors in the Control Signals - We have not yet done an exhaustive analysis of the effects of errors on all of the control signals, but we have looked at the ones on which errors were inserted. They are briefly (and infonnally) described below: /. Transient errors in the Rin signal - are tolerated due to the characteristics of the Cgate and the AND gate generating the load signal. If there is a transient error in the Rin signal when the request signal (Rout signal) is high, then the output of the C-gatc cannot change. If it occurs when Rout signal is low, this means that the next stage is awaiting data. The output of C-gate goes high and load signal is generated prematurely. Here, the register will wait until coded data arrives before generating an Aout signal and starting the next circuit. Thus the error is masked. If the error happens when Aout signal is high, the transient error in Rin signal is masked by the AND gate. If Rin goes to zero prematurely, the C gate does not allow the latch signal to drop until the register is loaded (when Aout is asserted). If Rin is either held to 0 or 1 for a long period of time, the circuit simply stops computing until the transient goes away. We are finding many cases where the circuit simply stops when an error occurs. If the error goes away, the computations continue without error. If it lasts too long, a timeout error is generated. 2. Transient Errors on the Load Signal - During error-free operation, this signal should be generated when correct morphic input data is available at the input port of the register and the following computational block is reset. At this time, the register has been previously reset to all zero pairs. We find that if it is raised prematurely (while the following computational block is reset and waiting for data) the circuit simply waits until the correctly coded data arrives because AOUT will not be asserted. If an error causes load to be raised while the following computational block is evaluating, a detectable coding error will be created in the register. (If any incoming bit pair is non-zero and has a different value from the current register, a detectable 1,1 value will be latched.) If the load signal goes low because of a transient error before correct data is latched to the register, then the generation of Aout signal is delayed until load signal goes back to high and correct data is latched. .?. Data Register Reset - The data register of the circuit is reset after the use of the data in the register and before the new data is latched. If a faulty reset signal is applied after it has been reset and before new data arrives, then there is no effect on the circuit. If the register has data arriving but the completion signal has not been set and the reset signal goes high because of a transient error, then the circuit will wait for the reset transient to go away and for the data to arrive. If it takes too long, a timeout will occur. If a reset occurs after the computational block has started, the computational
30 block can not generate a correct output. In that case a timeout error is detected. Thus we can say that the transient error on the reset signal is tolerated or detected by the circuit. If the reset signal fails to go to "one" due to a transient, the Aout signal fails to be reset and this is also detected because the circuit times out. 4. Transients in Aout - The Aout signal of the current stage is the Ain signal of the previous stage. A logic "one" transient error on the Aout signal can cause an early start of a following computational block by prematurely raising its Rout signal when the computational block has already evaluated previous data and precharged the functional circuit. In this case the computational block does not generate morphic outputs until correct input data comes from the register. Thus the circuit waits until correct data arrives or times out. If there is a "one" transient error on the Aout signal when the Rout signal of a previous stage is high, and the output of the functional block of the previous stage has not generated yet, then the register of the previous stage is reset forcing the computational block inputs to zero, and eventually a timeout error occurs. 5. Transients on Rout - The Rout signal starts evaluation of the computational block. If a "one" transient error affects the Rout signal when the Rout signal should be low, then the error on the Rout signal does not have any effect on the circuit (the input register is inputting 0,0 so the circuit will remain precharged). If the Rout signal is high when a "zero" transient error hits the signal, then the effect of the error depends on the timing of the Rout signal. If the Rout signal is just beginning to go to one, then some computational block outputs will be 0,0, No completion will be generated and the circuit waits for the transient to go away. But if the Rout signal is hit by a transient error at the end of evaluation, then evaluation starts again and never finishes. In this case timeout error is detected. At least in the cases of the transient errors in the control signals that we have studied so far, they are tolerated — producing delays or error signals.
1.1.7 Conclusions Having done experimental fault insertions on both the JPL-STAR and Fault-Tolerant Building Block Computers, the author certainly understands that inserting a few hundred errors does not adequately determine coverage for any system [18]. But the fact that we have found no undetected errors so far tends to indicate that the basic design approach is sound. We will not be surprised to find (as occurred in the STAR machine) a few signals that are not adequately covered and have to modify the design to improve their coverage.
31 This study indicates that self-timed design techniques can be adapted to fault-tolerant systems, and that they offer considerable potential in the implementation of modules that have concurrent error detection. Of course, self-timed logic is a matter of religion to many, but it is not clear to what degree it will ever displace conventional clocked CMOS designs. We make no projections here, but only note that asynchronous design is very interesting, and its fault-tolerance properties need to be explored from an architecture prospective. The cost of this approach is reasonable, and we are optimistic that this design style will become more important as fault tolerant systems made are made from larger chips with smaller feature sizes. There are many interesting problems still unexplored. First, more extensive simulation experiments and analysis are needed to prove the effectiveness of these design techniques. Another interesting problem is to compare the power consumption in Watts per MIP of duplicated synchronous processors vs. self-checking self-timed designs. Another intriguing question is the possibility of implementing error recovery in the form of microrollback in micropipelines. By latching old values at each stage, it may be possible to restart and correct computations when an error has occurred. But these interesting problems must be left for subsequent investigations.
1.1.8 References 1. Rennels, D. and J. Rohr, "Fault-Tolerant Parallel Processors for Avionics with Reduced Maintenance," Proc. 9th Digital Avionics Systems Conference, October ISIS, 1990, Virginia Beach, Virginia. 2. W.C. Carter, A.B. Wadia, and D.C.Jessep Jr., "Computer Error Control by Testable Morphic Boolean Functions - A Way of Removing Hardcore", In Proc. 1972 Int. Symp. Fault-Tolerant Computing, pages 154-159, Newton, Massachusetts, June 1972. 3. Rennels, D., "Architectures for Fault-Tolerant Spacecraft Computers", Proc. of the IEEE, October 1978, 66-10: 1255-1268. 4. David A. Rennels and Hyeongil Kim, "VLSI Implementation of A Self-Checking Self-Exercising Memory System". Proc. 21th Int. Symp. Fault-Tolerant Computing, pages 170—177, Montreal, Canada, June 1991. 5. Meyer, J. and L.Wei, "Influence of Workload on Error Recovery in Random Access Memories," IEEE Trans. Computers, April 1988, pp. 500-507.
32 6. Z.Barzilai, V.S. Iyengar, B.K. Rosen, and C M . Silberman, "Accurate Fault Modeling and Efficient Simulation of Differential CVS Circuits" In International Test Conference, pages 722-729, Philadelphia. PA, Nov 1985. 7. R. K. Montoye, "Testing Scheme for Differential Cascode Voltage Switch Circuits". IBM Technical Disclosure Bulletin, 27(10B):6I48-6152. Mar 1985. 8. Niraj K, Jha, "Fault Detection in CVS Parity Trees: Application to SSC CVS Parity and Two-Rail Checkers", In Proc. 19th Int. Symp. Fault-Tolerant Computing, pages 407-414. Chicago, IL, June 1989. 9. Niraj K. Jha, "Testing of Differential Cascode Voltage Switch One-Count Generators". IEEE Journal of Soiid-State Circuits, 25(1 ):246-253, Feb 1990 10. Andres R. Takach and Niraj K. Jha., "Easily Testable DCVS Multiplier". In IEEE International Symposium on Circuits and Systems, pages 2732--2735, New Orleans. LA.. June 1990. 11. N. Kanopoulos and N. Vasanthavada, "Testing of Differential Cascode Voltage Switch (DCVS) Circuits", IEEE Journal of Solid-Slale Circuits, 25(3):806-813, June 1990. 12. N.Kanopoulos, Dimitris Pantzartzis, and Frederick R. Bartram, "Design of SelfChecking Circuits Using DCVS Logic: A Case Study", IEEE Transactions on Computers. 41(7):891-896, July 1992. 13. Alain J. Martin, Steven M. Burns, T. K. Lee, Drazen Borkovic. and Pieter J. Hazewindus. "The Design of an Asynchronous Microprocessor". Technical Report Caltech-CS-TR-89-2. CSD, Caltech, 1989 14. Gordon M. Jacobs and Robert W. Broderson, "A Fully Asynchronous Digital Signal Processor Using Self-timed Circuits". IEEE Journal of Solid-State Circuits. 25(6): 1526-1537, Dec 1990. 15. W.C. Carter and PR. Schneider, "Design of Dynamically Checked Computers", In Proc. IFIP Congress 68, pages 878-883, Edinburgh, Scotland, Aug 1968. 16. Richard M. Sedmak and Harris L. Liebergot, "Fault Tolerance of a General Purpose Computer Implemented by Very Large Scale Integration". IEEE Transactions on Computer, 29(6):492-500, June 1980.
33 17. Teresa H. Meng. Synchronization Design for Digital Systems, Kluwer Academic Publishers, 1991. 18. A. Avizienis and D. Rennels, "Fault-Tolerance Experiments with the JPL-STAR Computer". Dig. of the 6th Annual IEEE Computer Society Int. Conf. (COMPCON), San Francisco. 1972, pp. 321-324.
SECTION 1.2
DESIGN OF SELF-CHECKING PROCESSORS USING EFFICIENT BERGER CHECK PREDICTION LOGIC T. R. N. Rao, Gui-Liang Feng, and Mahadev S. Kolluru Abstract Processors with concurrent error detection (CED) capability are called selfchecking processors. CED is a very important and necessary feature in VLSI microprocessors that are integral and ultradependable for real-time applications. The design of self-checking reduced instruction set computer (RISC) requires the stateof-the-art techniques in computer architectures, implementation and self-checking designs. Among the components of a processor, the most difficult circuits to check are the arithmetic and logic units (ALUs). In this chapter, we shall concentrate on the design of a self-checking ALU. We introduce a new totally self-checking (TSC) ALU design scheme called Berger check prediction (BCP), Using the BCP, the selfchecking processor design can be made very efficient. Also, we discuss the theory involving the use of a reduced Berger code for a more efficient BCP design. A novel design for a Berger code checker based on a generalized code partitioning scheme is discussed here, and is used to efficiently implement the Berger code checking.
Key words: Self-Checking, Berger Code, Berger Code Partitioning, FaultTolerance, Self-Checking ALU, BCP, Reduced Berger Check.
The autliors are with the Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, LA 70504. Tliis work was supported by the Office of Naval Research under Grant N00014-91-J-1067.
36 1.2.1 Introduction Tlie complexity of an IC chip increases significantly as a result of tlie advent of very large scale integrated (VLSI) technology. A modem microprocessor built on a single VLSI IC chip is much more complex than a medium scale computer built just a few years ago. The high density and small feature size contribute to the increasing vulnerability to a-particles. Since the future VLSI circuits should be more dense with smaller feature sizes, permanent and transient faults are more likely to occur in the future VLSI circuits than in those at the present time. Concurrent error detection (CED) is thus vital for the success of the future development in VLSI. The high performance (in terms of speed, throughput et al) requirement of the processors makes it necessary to adopt a RISC based processor design. The design philosophy of RISC architectures is to analyze the target applications to detemiine the most frequently used operations, and then to optimize the data path design to execute these instructions as quickly as possible. This philosophy of RISC is applicable to special purpose processors as well as to large general purpose computers. The design of self-checking reduced instruction set computer (RISC) requires the stateof-the-art techniques in computer architectures, implementation and self-checking designs. A self-checking processor (SCP) is a processor that is designed to have the concurrent error detection (CED) capability. CED is an error detection process that is designed to operate concurrently with the normal processor operations. CED is a very important and necessary feature in the VLSI microprocessors that are integral to real-time applications, since the error latency time will be very small. The incorporation of CED capability enables fast error recovery and also helps in preventing system crashes. An SCP can be very effective in the fault-tolerant computer system design. The important classes of SCP include a totally self-checking (TSC) processor, and a strongly fault-secure (SFS) processor. A typical TSC or SFS processor consists of a TSC or SFS functional circuit, and a TSC or strongly code disjoint (SCD) code checker. The study of a self-checking processor utilizes such selfchecking circuits to form a complete CPU. Due to its operative nature, arithmetic and logic units (ALUs) are the most difficult functional circuits to check among the components of processors. It is well known that in the presence of a fault, arithmetic operations may gencrate arithmetic errors, which are of the fomi 5" = 6' ± 2', where 5" denotes the corrupted sum, S is the uncomipted sum, and / is the faulty bit position. Residue codes and AN codes are the two major classes of arithmetic error-contiol codes (15, 17). Several non-aritlimetic codes such as parity codes [19], parity-based linear codes |5],
37 two-rail codes [6], and Bergcr codes [4, 8], have also been studied for applications in arithmetic operations. Reed-MuUer codes are the only non-duplicated codes that can handle logical operations. A number of codes such as the parity codes [19], residue codes [16], two-rail codes, and Berger codes [8], have also been studied to protect the logical operations from failures. Among the existing codes and designs, two-rail code encoded ALUs, and Berger encoded ALUs are the only known self-checking ALUs. The designs of TSC two-rail encoded ALUs are widely used in self-checking processor designs, such as the ones in [6, 10, 11, 12]. The SFS Berger check prediction ALU [8] is the only known technique for self-checking ALU, other than the duplication or two-rail methods. A self-checking processor with an SFS Berger check prediction ALU is more efficient in temis of redundancy than that with a two-rail encoded ALU. Berger code is the only known systematic all-unidirectional error detecting code. It has two different encoding schemes: BQ and B j . The Bo encoding scheme uses the binary representation of tlie number of O's in information bits as tlie check symbol, and the B i encoding scheme uses the I's complement of the number of 1 's in information bits as the check symbol. Both the encoding schemes are valid for tlie detection of all-unidirectional errors. Several papers have dealt with the design of the Bergcr code checkers [1, 2, 3, 9,14]. In this discussion, we examine the application of a reduced Berger code to the design of a self-checking ALU. The reduced Berger code may use only two check bits, regardless of tlie information length. Since a Bergcr code requires riog2(n + 1)1 check bits for n information bits, (where F x 1 denotes tlic smallest integer that is greater than or equal to x), the application of reduced Berger code yields a more efficient implementation. Section 1.2.2 presents a brief review of the required terminology most commonly used in the areas of self-checking and coding theory. In section 1.2.3, we discuss the Berger codes and the Berger check prediction scheme. Section 1.2.4 presents the fomiulation of check prediction equations of the reduced Berger code for various arithmetic and logical operations. The circuit design of the proposed reduced Berger Check Prediction ALU will also be given in Section 1.2.4. Then, Section 1.2.5 demonstrates the VLSI design of a reduced Berger code encoded Manchester can-y chain ALU. In Section 1.2.6, we present the theory of generalized Berger code partitioning, which i)rovides a foundation for the design presented here. Section 1.2.7 describes the design alternatives based on the generalized Berger code partitioning, followed finally by some conclusive remarks.
38 1.2.2 Conceptual Overview It is desirable to design circuits that will indicate any malfunction during normal operation and will not produce an erroneous result without an error indication. Some of the arithmetic and non-arithmetic types of codes are briefly reviewed in this section.
1.2.2.1 Definitions and Terminology We present here some commonly encountered terminology and a few definitions from checker design and coding theory. Definition 1: Concurrent Error Detection (CED) is an error detection capability that is desigued to operate concurrently with the normal processor operations. Detinition 2: A Self-Checking Processor (SCP) is a processor that is designed to have the concurrent error detection capability. A model for the .self-checking circuit G is illustrated in Figure 1.2.1. The circuit comprises a functional circuit L and a check circuit CK. G Inpui X—
Functional Cifcuu L
Ouipui Y
Checker CK
Error Indicaior Z
Figure 1.2.1. Model for a self-checking circuit Let us consider a combinational circuit L that produces an output vector Y(X, /), w hich is a function of the input vector X and a fault/ e F, a specified set of faults
39 in the circuit. The absence of a fault is termed a null fault and is denoted by X. If the circuit L has n inputs, then the input space Q.\ of L is the set of all 2" input vectors. Similarly, if the circuit L has r outputs, then the set of all 2'' output vectors is the output space Oy of L. Further let N be the subset of inputs (referred to as the input codespace) received by the logic circuit under normal operation and S be the subset of outputs (referred to as the output codespace) of the circuit under normal operation. Definition 3: A circuit L is fault secure (FS) for an input set N and a fault set F, if for any input X in N and for any fault/in F, Y(X,f) = Y(X, X), or Y(X,f) i S. A fault-secure circuit is depicted in Figure 1.2.2. Input Space
0"'P"' Space
Y(X, k)
X6N
f6F XeN
Circuit L
Y(X. f) - y(X, A.) —»
or Y(X, f) < S
Figure 1.2.2. Fault-secure circuit
Definition 4: A circuit L is self-testing (ST) for an input set N and a fault set F, if for every fault/in F, there is some input JV in N such that Y(X,f) is not in S. The model for a self-testing circuit is shown in Figure 1.2.3.
40 Ouiput Space
Input Space
f eF YiX, A.1 e S for all X 6 N
XS N
Circuit L Y(X.f) t SforiomeX € N
Figure 1.2.3. Self-testing circuit Definition 5: A circuit L is said to be lotally self-checking (TSC) if it is both selftesting and fault-secure. If each input in N occurs during the normal course of operations of the circuit, then the self-testing property guarantees that all faults in /•" produce detectable errors during normal operation. lu general, self-testing is a difficult condition to satisfy properly, in comparison to fault secureness. This is due to the fact that in some cases the required number N of inputs, to detect every fault in F, do not exist. The concept of strongly fault-secure (SFS) was proposed in [20]. According to this concept, a fault sequence {fx, f^,-, /„ } . where / , e F, 1 < i < «, is defined to represent the event where fault/i occurs, followed by the occurrence of fault/T, and so on until/„ occurs. At this instant the effect of the entire/au/r sequence is present on the system. A line stuck at a 0 or 1 is assumed to remain stuck at that value, and further the faults are assumed to occur one at a time and that the time interval between any two such fault occurrences is sufficient for all the input code combinations to be apphed to the circuit. Definition 6: Assume that the circuit L always gives correct codewords for a sequence of fewer than m faults, 2< m
41
yfx,//,,/„..,/„_,>) = Y(x,X) Also assume that for /n-faults sequence {fi.fz.-.fm-i.fm such that y(X.{fuf2-JJ}
) tliere is some A" e N
is not in S
Then L is said to be strongly fault secure (SFS) for f/i ,/2,-,/m }• We can verify that any SFS circuit satisfies the TSC conditions. Furthermore, if a circuit is not SFS, it is always possible to produce an erroneous code output prior to a noncode output. Hence, SFS circuits form the largest class of circuits that satisfy the TSC conditions. Fault secureness and self-testing specify only the behavior of the circuit for the codeword inputs (the inputs are always assumed to be error free). However, we need to study the circuit behavior for noncodeword inputs. Definition 7: A circuit L with output codespace S is error-secure (ES) for input noncodespace ilx - N if for any inputXg in Q.x - N, whereX^ =Xp + E,Xp eN,E* Y(X„, X) i S, or Y(X,, X) = Y(X., X)
YlX, >i) t S
W X,-X, ".
. E
Ciicuil L iruUt-Int) Y ' * . * l - Yil-.,
6 N
Figure 1.2.4. Error-secure circuit
42 This behavior is ilhistrated in Figure 1.2.4. If an error seeure circuit receives an erroneous input, it either passes a noncodeword on to the subsequent circuit blocks, or masks (i.e., corrects) the error in the input word. Definition 8: A circuit L with output codespace S is error-preserving (EP) or code disjoin! {CD) for input noncodespace ^ x - N if for any input X in Qx - N, Y(X, X)iS An error preserving circuit is error secure, but not vice-versa. Under a faultfree ojieration, a code-disjoint circuit maps all of its input noncodespace members to noncodeword outputs. Definition 9: Assume that the circuit L is such that for a sequence of fewer than m faults, 2<m< «, in / \ and for all noncodeword inputs A' e Qx - N,
Also assume that for m-faults sequence {f\,fi,-,fm-\,fm This implies,
} circuit L is self-testing.
"^i^, f/i ./2.--./m }) is not in S for some X e N Then ihc circuit L is said to be strongly code disjoint (SCD) for {f\,f2,--,fm }•
1.2.2.2 Arithmetic Codes Codes used for aritlimetic operations are modeled as integers in some ring 1^ , v\ ilh nioduUis M. The codewords are to be closed under addition of integers. Any integer N can be expressed as a polynomial in radix r as A' = fln.ir""' -i-a„_2r""^ -i-.... +a^r + ai^ for «, e {0, /,.., r-/jf. The integer is written in the fomi of an «-/M/?/e,
43
Definition 10: The arithmetic weight War(N) of an integer A^ is the smallest number of nonzero terms in a minimal expression for ^V of the form A^ = ifln.ir"-'±fl„_2r"~^±....±air±ao where a, s{0,1,.., r-1 Jaad r is the radix of the system. For example, the integer 31 in radix 2 is written as 31=(11111)2 and its arithmetic weight obtained from the minimal form of expression 31=2^-2" is VV^ar(31) = 2. We can clearly see that the arithmetic weight of an integer N depends on the radix r of the system. Definition 11: The arithmetic distance between two integers iVi and A^2 denoted Dar{NuN2), is given by the arithmetic weight ofN^- N2. For example Da,(31, 39) = W^rO^ - 39) = W^,( - 2^) = 1. The aritlimetic distance so defined corresponds with the errors that occur in arithmetic units based on the error propagation. The code C with minimum aritlunetic distance, d^i„ , provides the following properties: 1. Code C can detect up to d arithmetic errors for d < rf^in. 2. Code C can correct / or fewer aritlimetic errors if and only ifrf^i^> 2/ + 1. 3. Code C can correct up to t errors and detect up to d errors with d>i if and only if rfmjn >d+t+l.
1.2.2.3 Arithmetic Code Classes For arithmetic codewords, carries occur between infomiation digits and check digits and may cause problems in handling. We should hence make a clear distinction between systematic and separate codes.
44 Definition 12: An arithmetic code is said to be systematic if for each codeword (of « digits) there are k specified positions called information digits and the rest n-k positions are known as the check digits. Definition 13: An arithmetic code is said to be a separate code if the code is systematic, and the addition structure for the code has separate adders for the information and check digits. This implies that no carries occur between the information and the check digits during addition. There are three distinct classes of arithmetic codes: (1) AN codes, which are non-systematic; (2) residue codes, which are also referred to as separate codes; and (3) systematic AN codes. Systematic AN codes are not separate. This indicates that codewords are integers and there could be carry propagation between the information and check parts. We shall briefly describe the AN codes and the residue codes. Also, parity codes and two-rail codes are briefly reviewed here.
1.2.2.4 /W Codes In AN codes, A denotes the generator of the code and A' is the represented information. Every codeword is an integer of the form AN. For infomiation N e Z^i where iW = A x m , the integers m and M are called the range of information and modulus of the code, respectively. There are m codewords { 0, A, Z4,.., (m - 1)A }. Each codeword is represented as a binary «-tuple, and hence n satisfies the following inequality: l " " ' < Aim-1)
< 2"
and n is called the length of the code. For two codewords 4yV, and AN2, their sum is given by: R = I AW , + AW, I >; = Ax\
N i + N 2\ ,n
is also a codeword. Thus, AN codes are also linear since the sum of two codewords is also a codeword. If an error e is introduced iu the addition, then the erroneous result is given by the relation: R +e \ K, = \ AN
M
45 To check for errors, we find the syndrome of R denoted as follows: S( R ) = \ R I .4 = I I AA?3 + e I ,1,
1.2.2.5 Residue Codes Systematic codes that have separate adders for information and check parts are called separate or residue codes. Error detection using residue codes: In this case, the applied codewords are of the fomi [N, C(N)]. Tlie code is closed or preserved under the addition operation if and only if C(N) is a residue check of N for some base b, and the operation * in the checker is an addition modulo b. The separate adder and checker circuit is depicted in Figure 1.2.5. N,
1
Adder
-
Ni*N2
Error Detector
- n C(N,)
Checker
Error Indicator
_
Figure 1.2.5. Separate Adder and Checker circuit For a detailed study of arithmetic codes the reader is advised to refer to [15].
1.2.2.6 Parity Codes Codes which use the concept of parity are termed the parity codes or paritybased codes. Here, we shall briefly review the concept of parity. The parity of an
46 integer can be either odd or even. An even parity is referred to as parity 0, while parity 1 denotes an odd parity. The panty of a certain gronp of bits refers to the parity of the number of elements of value 1 ni the group. Furtliermore, the parity of the sum of two blocks (of digits) is equal to the sum of their parities. A block of bits of length k can be converted to a block of length k + J having a desired parity 0 or 1. by a adding a 0 or 1 parity bit. The parity bit is obtained by XORing all the k bits of the original block. Let us briefly see how the parity concept is used in determining the existence of an odd number of errors in a received block. Let the information block 10 be transmitted be of length k. The transmitter appends the parity bit to the information part of length A:, thus converting it into a block of length n = k + 1 having parity 0. The receiver knov\s that any uansmitted block (of length n = k + I) has parity 0. Upon receiving a block, the receiver checks its parity and the received block is determined to be erroneous if and only if its parity is 1. The process performed at the transmitting end is teirned the encoding process while that performed at the receiving end is referred to as the decoding process. Figure 1.2.6 illustrates the encoding and decoding processes. Informauon Block 1 of Icogih i
Aiiaching to I as parity bit
Traasmiued Block of length k + I
Encoding process
Received Block of leagih k + I
check the paiity of the received block Decodiiig procea
Figure 1.2.6. Basic Error Detection scheme
1.2.2.7 Two-Rail Codes One of the techniques used to encode informaUon bits is a two-rail encoding scheme. In a two-rail encoding scheme, bits of information occur as complementary pairs, (0, 1) and (1,0); the pairs (0, 0) and (1, 1) denote errors. Thus, the same circuit that checks a two-rail encoding can be used to combine several self-checking checker output pairs into one pair. A self-checking two-rail code checker will map m input
47 pairs, referred to as {(GO.^O). (^i.^i) ( « „ - ] . ^m-i)). to one output pair referred to as (zo. 21 )• The output pair must be complementary if and only if each of the input pairs is all complementary. Figure 1.2.7 shows the schematic for a two-rail code checker with duplication check. The duplicated check is sometimes designed in a complementary form to prevent identical failure states in both the functional circuit and the duplicated circuit. In Figure 1.2.7, the output of one of the two identical circuits is inverted and fed into a self-checking two-rail code checker. The circuit acts as a self-testing comparator.
Functional Circuit L
Duplicated Circuit L
Inverters 1-1
'm-1
1_±
_1_±
Two-rail Code Checker (Comparator)
^0
H
Figure 1.2.7. Two-rail code checker for duplication check
1.2.3 Berger Codes Berger code is the only known systematic all-unidirectional error detecting code. It has two different encoding schemes: Bo and Bj. The Bo encoding scheme uses the binary representation of the number of O's in information bits as the check symbol, and the B^ encoding scheme uses the I's complement of the number of I's in information bits as the check symbol. Both the encoding schemes are valid for all-unidirectional error detection. Let us consider the BQ encoding scheme here.
48 Using the standard {n,k) block codes approach, for the codeword [OQ. ^ i , -•. ^ t - i , a<.,... a„_i], (flo. -.o^r-i) = / is the information part and (flj., ..,«„_,) = p(I) is the check part, and is referred to as the Berger check. The Berger check, (3(1), is the binary representation of the number of O's in /. The number of check bits r required is given by r = n - k = floga (fc + 1)] (where f x ] denotes the smallest integer that is greater than or equal to x). As an example, for information / = 101101, the number of O's is 2, and hence the Berger check P(I) = 010. Thus, (101101010) is its codeword. Consider a unidirectional error qhannel with only 1-errors. The number of O's in the information part may only increase (if errors occur); hence, the Berger check must increase for the result to be a codeword. However, the check cannot increase due to errors because of the assumption that only 1-errors can occur. Therefore, any number of 1errors cannot take a codeword into another codeword. Similarly, we can show that multiple 0-errors are also detected by this code. The most well known valid TSC Berger code checker is usually referred to as the normal checker and is illustrated in Figure 1.2.8.
nfo. / ^ k
/
Cl 1 's counter
r
/
^
' r
Check bits
Figure 1.2.8
C2 TSC Two-Rail Checker
Error Indical
Berger code normal checker
A Berger code is a maximal length code if and only if the information length k = 2' - 1, where r is the number of check bits. For every non-maximal-length Berger code we can construct an equivalent separable code for which the checker of Figure 1.2.8 can be self-testing. An implementation of self-testing Berger code checkers arises from the idea that any Berger code can be constructed from u = [ (k+l)/2 1 m-out-of-« codes, where k is the number of infomiation bits, m = 1, 3, 5, ..., 2u - 1, and n =k+l [14]. This is called Berger code partitioning. Table 1.2.2 (b) gives an example of the partitioning of a Berger code with k = 1 and check bit length r = 3, denoted as C(7,3).
49 An improved checker design which uses one counter circuit consisting of modular adders, was proposed in [7]. Our recently developed reduced Berger check prediction technique may use only two check bits, regardless of the information length. The application of reduced Berger codes yields a more efficient implementation, and is discussed in the subsequent section.
1.2.4 Reduced Berger check prediction We now apply the reduced Berger check prediction scheme to derive the equations governing some of the ALU functions. Without loss of generality, we will assume an ALU with the following functions available: additions and subtractions, and AND, OR, and Exclusive-OR (XOR) logical operations. In the following, we briefly describe the corresponding reduced Berger check prediction equations.
1.2.4.1 Reduced Berger check prediction for arithmetic and logic operations Let ? = ( .v„, ... , X2y X\ ) be an information vector, W(3?) be the Hamming weight of X i.e., the muiibcr of 1 's in 7, and B(?) be the number of O's in ]?, tlicn we have W(lt) + B(t)
= n.
(L2.1)
Definition 14: Let RC?) be a binary expression of B(]?) mod 2''. R(t) is defined as a reduced Berger check of?, and x* = (!?, R ( ^ ) is defined as a reduced Berger code. Obviously, if 2'' > n, then a reduced Berger code is just a Berger code. Let us consider the addition of two n-bit numbers, ]?= ( J;„, ..., JCi.-ii ) andy= ( y„,... ,>'2, ^i ), to obtain a sum7= ( s „ , . . . , ^2- -^i ) witli internal carries c^= ( c„, ... , c^, C] ), where the input cairy Cjn'^ CQ and output carry Coui = c„. For each 1 < i < n, we have A-,- + y; + c,_i
= 2c,. + Si .
Thus ixi + -£ .y,. + i: c,-, = 2^: f,• + i: s^. j = l
i = l
1= 1
1 = 1
r" = l
50 n
Since Wf?) = "^ .t, , we have i = l
W(t) + W(f) +W(^) + Co - c„„, = 2W(c') + Wit)
(1.2.2)
Since B(]?) = n - WC?), and from Eq. (1.2.2), we have B(t)
= ^(:?) +Biy) -5C?) - Co + c,„,
(1.2.3)
From Definition 14, we have R(t)
= ( ROt) + RC^) - R(t)
- Co + c„„, ) mod 2'^ .
(1.2.4)
In the two's complement ALU design, the subtraction of 5* = 5* - y is handled by performing the addition of "?=]? + >'+ 1. However, if a carry input is required for the subtraction, then the carry input to the adder must be inverted to obtain the result ? = A* - y - Ci„. Thus, wc assume that the inputs to the ALU are ^ y and C;„ for the general case, and x, y and c,„, during the two's complement subtraction. Similarly, we have R(T> = ( R(f)
- R(f)
- R(c) + ( n mod 2'' ) - fo + c„,„ ) mod 2'' (1,2.5)
If ( n mod 2'') = 0, then we have R(s)
= ( « ( t ) - Ri^) - RCc) ~ Co + c„,„ ) mod 2''
(1.2.6)
The RBCP scheme can be easily extended to other aritlimetic oix^rations such as the Us complement subtraction, sign-magnitude subtraction, multiplication and division. There are 16 possible logical operations on two o|)erands, including six trivial oi^crations, 0, 1, x, y, !v, y, and ten nontrivial operations. Here we only examine three basic logical operations AND(n), OR(u), XOR(©). We can easily verify the following: ^i ^ y, = -^'r + .Vi - (-"^i ^ .V, )
(1.2.7)
51 X; u .y,- = JJ; + yi - ( .V; n y,- ) ^i ®yi
(1.2.8) (1.2.9)
^ ^i + >-.• - 2 ( -X,- n y,- )
Similarly, we have for AND operation, RCi) = ( R(i)
+ R(f)
C/? operation, R(t) = ( «(?) + R(^
- R{1^\jy)
) mod 2-^(1.2.10)
- RC^ nf)
) mod 2'' (1.2.11)
for XOR operation, R(^
= ( R(7) + Rif)
- 2 R(J ny
+ ( n mod 2'' ) ) ) mod 2'' (1.2.12)
If ( n mod 2 ) = 0, then equation (1.2.12) reduces to R(s) = ( ROt) + /^(37) - 2 «( ? n y ) ) /no6? 2"
(1.2.13)
In the following, we describe the design of a circuit to implement these equations. For practical reasons, we consider a 32-bit ALU.
1.2.4.2 The Design of RBCP ALU The design of a reduced Berger check prediction circuit for a typical arithmetic and logic unit (ALU) will be given in this section. ALU has three control signals: flo. ' ' i . 3nd Oj. When OQ is "0", ALU performs the arithmetic operations. When flo is "1", ALU performs the logic operations. The signals a^ and Oj select a specified arithmetic or logical operation to be performed. The operations determined by these three control signals are Usted in Table 1.2.1. CoDtrols a, a, 02
6
Fuoctioai tl
ll
0 0
0 1
X X
S - X + Y + c. S- X -Y - c
0 0
0 0
1 1 1 1
0 0 1 1
0 1 0 1
s-xn Y
1 0 0
0 1 0 X
S-X® Y S-XU Y
s-x
x X: doo't cut
t1
1 I 1 0
U
0 1
C,>« - Cj. + 1
0 0 0 0
1 1 1 0
Table 1.2.1 Function Table for RBCP ALU in Figure 1.2.9
52 The RBCP ALU circuit is shown in Figure 1.2.9. The circuit translating the external signals (GQ.
ra
.\ND .\LU
±£
OR
i_i
MUX
Q'j counter
x2
.'k
Control r ^ PLA
MCSA
Figure 1.2.9. Reduced Berger check Prediction Unit Let us verity the correctness of the proposed circuit. During the arithmetic ojx^rations, the MUX always selects the internal carries of the ALU, since from Table
53 1.2.1,flo=0 3nci /i=0 during arithmetic operations. Since the "X2" operation is not needed for the arithmetic operations, we set ;2 =0. Also, i^ is set to 1 because both Y^ and C<. are required for arithmetic operations. Finally, ti, is set to "0" for addition and set to " 1 " for 2's complement subtraction. The pseudo variable 5 is generated according to Table 1.2.1 for the correct result from the MCSA. The result from MCSA is compared with that of the ALU in a TSC checker. The result of ALU (denoted by S in Figure 1.2.9) is Berger encoded through the O's counter, and then the two least significant bits of the Berger check bits of S are compared with the 2 bits from the MCSA, to detect tlie possible presence of a single fault in tiie ALU. To illustrate the logical operations, let us use the operation S = X©Y as an example. The Berger check for this operation is predicted using equation (1.2.13). By setting Oo = 1 and ' i =0, XnY can be obtained at tlie output of the MUX. Then tlic output of the modulo 4 O's counter will be (Xr\Y)^. Hence, we set ^2=1 to get l(Xr\Y)^. With /3=1, /4=0, and 5=1, the output of the MSCA is X, + Y, + l{Xn,Y),, as defined in Table 1.2.1.
1.2.5 Design of Strongly Fault-Secure ALU In this section, we will analyze a strongly fault-secure ALU.
1.2.5.1 Fault Model Since components in both gate and switch levels aie involved in this design, tliere exists a need to use a fairly comprehensive fault model. The faults considered are: 1. Stuck-at-0 and Stuck-at-I faults at individual signal lines. 2. Stuck-open faults of nMOS or pMOS FETs. 3. Stuck-on faults of nMOS or pMOS FETs. The stuck-at faults at tlie signal lines are ajiplied to both gate level and switch level circuits. The many shorts and breaks tliat could occur at the switch level are described by stuck-open and stuck-on faults of FETs. Such level of absuaction is adequate for practical designs. In a typical ALU design, there is no distinction between the circuit tliat handles tiie aritlimetic o])erations and the circuit that handles the logical ojjcrations. Consequently, a single fault related to caiTy propagation circuit may induce multiple random errors during the logical operations. For example, [12] shows that a single stuck-at-1 fault in an ALU design vvitli Manchester cany chain results in double
54 random errors in S during the XOR operation. This type of error cannot be handled by the reduced Berger code. This situation can be avoided by separating the arithmetic and logic units, as shown in Figure 1.2.10. The conliguration of Figure 1.2.10 ensures that any single fault in an arithmetic unit results in an arithmetic error (i.e. in the form of S' = S ± 2'), and any single fault in a logic unit results in a single error. Furthermore, any single fault in multiplexer or control signals results in unidirectional errors.
_t_ Logic Unit
Anthmetic Unit
I
Multiple net
Figure 1.2.10. ALU Organization that eliminates multiple random errors
Figure 1.2.11 shows a bit slice of an ALU with Manchester carry chaui which is implemented based on tlie organization of Figure 1.2.10.
Figure 1.2.11. Example Bit-Slice of ALU with Manchester Carry Chain
55 1.2.5.2 Self-Checking Properties In order to apply the proposed reduced Berger check prediction (RBCP) ALU to the design of an SCP, the proposed RBCP ALU must be proved to have the proper self-checking properties. In this section, we will show that the RBCP ALU is strongly fault-secure with respect to any single fault in the ALU. Furthermore, for any design of the RBCP circuit, the proposed RBCP ALU will be shown to be strongly fault-secure (SFS). We will consider an ALU using a ripple carry adder and an ALU using a Manchester carry chain adder. In order to show the fault-secure property, it is not sufficient to know tlie errors on the ALU outputs. We also need to know the errors occurring on the ALU carries. This is due to the fact that the RBCP circuit uses the ALU carries in order to compute the reduced Berger check. For a ripple carry adder, we consider a fault affecting a single adder shce at a time, such that the output S/ and the carry c, can take any possible erroneous values. The error on C; can be propagated to tlie higher order slices, giving multiple bit errors on the carries and the outputs of these slices. -» -> Theorem 1: Let s' and c' be the corrupted sum and the corrupted carry vector of the ripple carr^ adder, respectively^ due to a single fault. Let R(.s') be the reduced Berger check oi s'. The value of R(5*) be computed using the RBCP of equation (1.2.4). R(s*) will be equal to R(s') if and only if no error occurs, and d > 2. This implies that the RBCP with 2 check bits can detect any single fault in an ALU using a ripple carry adder. Proof: We assume that the single fault is located at the m-th bit slice, where 1 < m < n. Subsequently, all the internal carries are partitioned into two subgroups, c^,/ = (c„, c„_i, ... , c„ + i) and 'c/^ = (c„, c ^ . ] , . . . , Cj). Let s„ and c„ be the erroneous sum and carry outputs of the m-th bit slice. Let c'^ = (c^, c^-i,... ,c„ + i) and c'i Since C; =
= W{^0 - c„^ c„; W(Pi^) = W(?i_) -s^ + s„ (1.2.14)
On the other hand, from equations (1.2.2) and (1.2.14)
W{P„) = W ( ^ ) + W(^H) - W(?H)
+ Cm - c;„, .
56 Thus we have, W{P)
= W(^
= W{T'H)
+
W(PL)
+ W(y) - W(P) - c„„, + c,„ - s„ + s„ - 2ic„ - c„,} .
Therefore, from equations (1.2.3) and (1.2.14) we have B(P) = fi(t) + 5(7) - B(P)
+ c;„, - c,„ + s„ - s„ + 2(c„ - („,) .
and from the above equation and equation (1.2.4), we have R(P) = iR(^
+ R(f} - R{c') + c;,„ - c,„ +
Sm - s'm + 2{c„ - c„) )
modi''
(1.2.15) —> However, R(s*) is generated by the RBCP circuit. From equation (1.2.4) we have as follows: R(s*) = ( R(t)
+ R(f> - R(c') + c;„, " c,„ ) mod 2'' .
Thus, R{s*) - R(P) = ( 2(c^-c„) + s'^-s„) mod 2''. Since (c'„-cj = 1 or -1 or 0, and s„ - .y^ = 1 or -1 or 0, I 2(c„-c„) + j„ - 5„ I < 3. Hence, when d > 2, i R(5*) mod 2'' = I 2{c„-c„) + s'^-s^ I. When d > 2, R(.y^) I = I (_2(c„-c„) + s„-s„)\ R(s*) = R(^) if and only if c„ = c„ and s„ = s„. Therefore, the RBCP witli 2 check bits can detect any single fault in an ALU using a ripple carry adder, n For a ripple carry adder, the optimal reduced Bcrger code is a 2-UED Bosc-Lui code, which can achieve fault-secureness. —* Theorem 2: The Berger check generated by the RBCP circuit, R{s*), will be equal to R(5'), if and only if d > 2 and no error occurs in an ALU that uses a Manchester carry chain. That is, tlie RBCP with 2 check bits can detect any single fault in a Manchester carry chain adder. Proof: The ALU is first partitioned into four parts: tlie first part contains the (irst to the p-lh bit slices, the second part contains the (p+l)-th to the (q-I)-th bit slices, the tliird part has the erroneous q-th bit slice and the fourth part contains the (q+I)-th to the n-th bit slices, where 1 < p < q < n. In Figure 1.2.12, n = 6, p = 2, and q = 4.
57 Prft C
"••i
>
? ;i>=::?^^::^^:::^^=:-
Figure 1.2.12. Backward error propagauon in Manchester Carry Chain due to a single stuck-on fault at a pass transistor
Based on the error charactensuc, we have: c,-0
c,-l
^,-0
G^-1
c,,.,-! c',.,-0
c,.,-0 c',.1-1
/',-i-l
G,_i-0
*1-1 c' p*i-0
^Z'*!
-0
^'p*
1-1
^?
c,-0 c',-0
^ * l
^ ^
-1
G . . 1-0
G^-x
5,-0 J',-0
-1 1-0
Cj-2-l c',-2-0
c»-2-0
^,-1-1
C'T-2-1
5%-I-0
c' .-0
c,-0 c',-1
^ ' . * 1-0
Cp-l-X
-/»-'"
J,-<:,.,
-1
<=p
^?*>
-1
DO change
ID Figure 1.2.12. the single pass transistor stuck-oo fault occun at the q-tb bit slice. Tberefore, the first part is fault free with a contct ouqwt. the second part is fault-^ee but is influenced by the backward error propagation and geaentes tbe corrupted sum, the third part is under the fault and generates a corrupted sion. and the fourth part is fault-free but has a corrupted carry inpuL
58 Siiinmarizing, there are (q-p) changes from 0 to 1 among <:,_i, c^_2 c and Ihcre are (q-p-l) changes from 1 to 0 among 9q_i,i^_2,.... .Sp + i. Thus, in 5' there are (q-p-1) changes from 1 to 0. Since s'c = ^e. number of O's in s' andl^c = the [uinibcrof 0"s in?, s\- -~sc =q-p-l. On the other hand, R(5') =l'c mod T^ = (^c +^c - ^c - '^in + <^out) mod 2'' and R(s*) = (s*c=^c+'9c-^'c -^m +Coui )mod2'^. Thus, we have R(5*) - R(5^ = ( Q- - c'c ) mod 2'' = ( ? c - (t^c - (q - p ) ) ) mod 2'' = ( q - p ) mod 2''. Combining these two equations, we have, R{s*) « ( ? ) = 1 mod 2^ ^ 0. n For a Mancliester carry chain adder, the optimal reduced Berger code is a 2UED Bosc-Lin code, which can achieve fault-secureness. Based on the above Uieorems, we can prove that the RBCP ALU is faultsecure with respect to the arithmetic and logical operation as follows. For any single fault in an adder, the RBCP circuit guarantees the outputs,"? and R(?), to be a noncode word. With regard to tlie logical operations, if a single fault occurs, tlie RBCP circuit also guarantees a noncode word output, since the corrupted output of the logical operations may have either a single error or multiple unidirectional errors, (up to d-1 unidirectional errors or less), while a fault-free RBCP circuit will generate a correct reduced Berger check. Hence, the RBCP ALU is fault-secure for a single fault occurring in the ALU. The following theorem shows that the RBCP ALU is also I'anlt-sccurc for any single fault affecting the RBCP circuit. Theorem 3: The RBCP ALU is fault-secure for any single fault affecting tlie RBCP circuit. Proof: Since tiie fault has occurred in the RBCP circuit, the output S of the ALU is correct. If the RBCP ])roduccs the correct value 5^, then S and 5',. fomi a code word and no enor lias occurred in the system. If the RBCP produces any erroneous value S'^ ^ Se, then S and .S',. Ibmi a noncode word. This completes the proof of faultsecureness. Q
1.2.6 Generalized Berger Code Partitioning In this section, we introduce the concept of generalized Berger code partitioning in
59 sufficient detail. Definition 15: For the information part of a codeword, for each 0 < M < ^ - 1 , if .x„ = 1, then we say tJiat J:„ has "weight" 1, otherwise its "weight" is 0. For the check part of a codeword, for each 0 < v < r - l , i f C v = l, then we say that c^ has "weight" 2", otherwise its "weight" is 0. For each (JCQ, JCJ, ..., ^ i - i , Cp_i, c ^ . j , ..., CQ). its "weight" is the siimmation of the "weights" of x,, for w = 0, l,...,k-l and "weights" of cI, for v = 0, 1, ..., p-1. Hereinafter, weight always means "weight". Theorem 4: For the BQ encoding scheme of any Berger code with k information bits and r check bits, where r = riog2(fc+ 1)1, all the 2*" codewords can be partitioned into [jc/2''} + I groups, for each of which (XQ, Jc,, ..., j;(.-i-Cp_i, c,,_2 CQ) has the same weight, where 0
for
Cv , x„ e -{O,! (• .
Thus, for any 0 < p < r, ^u=o ^u ^ v ==00 Cc^-2 v ' 2 ^ +-f 2^M=0
-
^ ~ ^C = n C v ' 2 ^ .
This means that the weight of (Xo. •'^i. •••. Xk-\> c^-i, Cp_2, •-, <^o) is k - T.yjp c^-l^. Obviously, 0 < k - -L'^-Jp c^-T
C^&\Q,\\.
(1.2.16)
Now wc calculate the number of these distinct weights. Equation (1.2.16) is equivalent to 0 < lklT\
- E ; ; r ' c,+p-2^ < U/2^J ,
(1.2.17)
Since k
60 For convenience, the notations in [14] are used. C^i. ,^ is the representation of a Berger code having k information bits and r check bits; C„/„ represents an m-outof-« code checker; the notation C„/„xi P \ implies that €„,„ is concatenated with a vertex \ P \. Therefore, a Berger code C^^.n can be represented as k
C(k,r) - ^
C,7(jt + i) X i f; f ,
t =0
where P, is the bit by bit complemented binary representation of i. For/) = 1, our Theorem 4 reduces to the Theorem 1 by Piestrak in [14]. Corollary 1: For the BQ encoding scheme of any Berger code, all the 2^ codewords can be partitioned into Lk/ZJ + 1 groups, for each of which (.XQ, Xi,..., X/r^i, CQ) has the same number of 1 's. Forp = r, let the inputs be XQ, ... , x^.i, c^_^x2'"'\ ... , c^xl, CQ. Then the number of inputs is/: + 2''"' + - - - + 2 + l = A : + 2 * ^ - 1 . We have the following corollary; Corollary 2: For the BQ encoding scheme of any Berger code, all the 2* codewords have the same weight, k. Furthemiore, a Ct/(j.+2'_i) code checker can be used as a C^.^, code checker. In the following, we show two examples to illustrate the application of Theorem 4. Example 1.2.1: Consider the Berger code C(7 3,, that is, each codeword has fc = 7 infomiation bits and r = 3 check bits. Forp = 0: Cn.y) = Con X ^ 111 !> u
C1/7 x i UO} KJ C^n x "I 101 \ u Cj,, x •! lOOl-
u €4,7 X -I o i l (• u Cin X -I 010 I- u
Cf,:-, x "I 001 !• u €7,7 x ^ 000 I'
This is a general Berger code check form as shown in (a) of Table 1.2.2. For p = 1 .• C(7.3) = C,;s X ^ 11 i- u Cy^ x\
\Qi\ KJ Cs,i x\
01 } ^j C7;8 X ^ 00 > f .
This is a Berger partitioning of [14], as shown in (b) of Table 1.2.2. For p = 2 :
61 C(7.3) = CsMO X ^ 1 (• U
C 7 , | 0 X ^ 0 f-
This partitioning scheme is shown in (c) of Table 1.2.2. For p - 3 ; C(7.3)
^7/14
•
From this generalized partitioning, we know that any TSC CT^^ code checker can be used as a TSC checker of the Berger code C(7.3). This generalized partitioning scheme is shown in (d) of Table 1.2.2. Xo
Xi
Xj
Xj
X4
Xs
X«
C:
c,
Co
Paititioning
0
0
0
0
0
0
0
1
1
1
C(V7X <111)
0
0
0
0
0
0
1
1
0
C„7X <110>
0
0
0
0
0
1
0
1
C n x <101>
0
0
0
0
1
0
0
Cv7 X (100>
0
0
0
1
0
1
1
CAT,
0
0
1
1
0
1
0
Cin X <010)
0
1
1
1
0
0
1
CM
I
1
1
1
0
0
0
Cm X (000)
X <01 1 >
X (001)
(a)
Xo
Xi Xi XJ X4 X5 X
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
t
1
0
1
1
1
1
1
1
1
1
1
I
1
c, 1
Co X 1 E - v . + C0XI I X 1
Partitioning
1
C,.8 X {11>
3
C3,8 X { 1 0 }
5
C5,8 X {01 >
7
Cvi
0 1
0
1 X 1
0
1
0
1
1 X 1
0 0
0
1 X 1
0
(b)
X {00 >
62 Xo
X,
Xi
Xj
X*
X,
X< Ci
C,x2
0
0
0
0
0
0
0
1X 2
0
0
0
0
0
0
Co X 1
Y^X, + Coxl +C^x2
0 1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
0
1 X 1
1 x2
I X 1
0
1
1
1
1
1
1
1
i
C5.;o X <1>
7
C7.,0 X <0>
0
0 0
0
Partitioning
1 X I
0
1 X 1
0 (c)
Xo
X,
Xz
X,
X.
X,
X«
0
0
0
0
0
0
0
0
0
0
0
0
0
C2 X 4
C, X 2
Co X 1
1 X2
1 X1
^ . V , +CoxHC,x2 + Coxl
PartJtioninf
0 1 X 4
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
0
0
I X1
0 7 1 x2
C? 14
1 X I
0 0
0
1
1
1
!
1
1
1
0
1 X1
0 (d)
Table 1.2.2 Generalized Berger Code Partitioning tor C Now we show an example for the case, k - 2'"K Kxaniple 1.2.2; Consider the Berger code C,s.4). that is, each codeword has fe = 8 iiiloniiation bits and r = 4 check bits. ['or p - 0 ; we have C(8,4, - Co/8 X -I 1000 t- u
C,,,s X -i 0111 (• u C.;8 x ^ 0110 <•
63 u C^i X i 0101 i u C4/8 X i 0100 i' u Cs/s x -j 0011 I' u
C6/8 X i 0010 )• u C7/8 X -i 0001 1^ u C8,8 X ^ 0000 I •
I'orp = 1 ; we have C(8,4) = Co/9 X
C2/9 X ^ o i l }• u C4/9 X
u C6/9 X -! 001 (> u C8/9 X -I 000 I' . For p = 2 : we have C(8,4) = CoMi X
C4,„ X ^ 01 I' u Cg,,, X ^ 00
r
For p = 3 .• we have C(8,4) = Co/15 X -I 1 (• U Cg/ij X -i 0 !> . For p = 4 : we have C(8,4) = Cg;23 • Thus, a Berger code checker can be realized using an 8-oiit-of-23 code checker.
1.2.7
Implementations
Based on the generalized Berger code partitioning, we may utilize Ilie existing /n-out-ol"-n code checker designs to construct the TSC Berger code checkers. Note that a TSC converter is required to convert tlie binary value of the ciieck bits to a corrcs])onding weight. For instance, in example 1.2.1, a circuit is required to generate C2X4 + c, x2 + CQXI, as in Table 1.2.2 (d). This circuit is trivial: Cj has four tan-outs, c, has two fan-outs and Co has one fan-out. A TSC m-out-ol-2/n code checker can be constructed from adders based on decomi)osition fl3). When applying such TSC m-out-of-2w code checker to the ])rojwsed TSC Berger code checker design, we need to realize that the weight of the clieck bit is already in the binary form. Also, due to the nature of decomjiosition, any ^-bit Berger code checker may be constructed from a it-oul-of-2A: code checker of this type of constniction. For the TSC Berger code checkers, we give a construction j^rocedure as follows :
64 1. Let 8„ + 8,, = 2"" -1 - k, where 8„ = \(2' - 1 - k)li\ and Sj = 1(2' - 1 - k)l2\ and let C be the number represented by tlie cheek bits, such that Efr^ ,v; + C + 8„ + 8fc = 2' - 1 ,
(1.2.18)
since ZfrJ .x, + C = ^ according to Corollary 2, forp = r. 2. Construct an adder network to compute Zf^o J^,. Then, replace the appropriate half adders by MHA < j , or M//A,2) to obtain a sum of Ef JQ J^, + SO • The modified half adders (MHA's)are defined as follows: function is
MHA^^
2(r,' + i,- = a,- + i),- + 1
(1.2.19) and ihe function realized is 2c,- + s^ = O; + 1.
M//A,2 C;
(1.2.20) 3. Use MHA(i), MHAi2) and regular half adders to obtain the sum C + 5;, . For Ihe most significant bit position, use MHA^^, which is defined as follows: MHAf:,
5; = a, © i-,-
, Ihe function realized is i',- = a,•+ /?, .
(1.2.21) The above procedure is a modification of that given in [13]. When k = 2' - I and 8„ =5;, =0, the above construction is identical to that proposed in [9]. Example 1.2.3: Figure 1.2.13 shows a TSC Berger code checker for C(8,4) based on the above procedure, using MHA^y-,, MHA^2) and MHA(^^. Sincefc= 8, we have r = 4, 5„ =4 or ^0100K and 8;, = 3 or "100111". The adder network is constructed to compute Sflo •'^i . In computing ZfrJ .v, + 8„, we note that the only nonzero bit in 8^ is at bit position 2 and has a weight equal to 2". The half adder that generates the final sum for this weighted position is replaced by an M//A(i,. For the check bits
65 portion, we need to add a 1 at the bit position 0, and a 1 at the bit position 1. Hence, we use an MHAfj) at the bit position 0, an MHAfi^ at bit ix)sition 1, a regular lialf adder at bit {wsition 2 and an MHA(y:, at bit position 3. Note that tlie most significant bits of 5a and 5* are always zero. Tlie modified l"s counter ( evaluating E .t, + S^ ), generates values ranging from 4 to 12, while the computation of C + 6* generates values from 11 to 3. Obviously, the TSC two-rail checker receives sufficient input combinations to guarantee the self-testing property. For k = Y - 1, the hardware cost of this scheme is identical to tliat of the scheme in [9]. For k*2' - \ and k*2'"\ the proposed scheme uses more Jiardware than that in [9], due to the need to |)erfomi check hit modification.
Figure 1.2.13.
Adder Realization of TSC Berger code checker for C(g 4^
66 For k = 2*^"^ there exists no checker tor comparison.
Conclusive Remarks In this chapter, we have presented tlie tlieory and tlie design of an SFS ALU using the reduced Berger code. A reduced Berger code is used to encode both operands and computation results. This code uses only two check bits regardless of the information length, and hence the ajJi^lication of reduced Berger code to ALU yields a more efficient implementation of a strongly fault-secure ALU than the designs proposed in [8, 10, 11). Further, we have discussed the concept of generalized Berger code partitioning. Based on this concept, we may utilize the well-known TSC m-out-of-« code checker designs for constructing the TSC Berger code checkers. When k = 2'' - I, we have /w = 2'' - 1 and n = 2m. If a technique similar to tliat proposed in [13] is used, then only TSC k-Qut-oi-2k code checkers are needed. This is due to the decomposition nature of the method. The most significant contribution discussed in tliis chapter is the newly found design [18] of a two-output TSC Berger code checkers for A: = 2"^"'. It is hoped tliat tliese design methods are fore runners for practical self-checking processor designs of the future.
REFERENCES [1]
M. J. Ashjaee and S. M. Reddy, 'Totally Self-Checking Checkers for a class of Separable Codes," I'roc. J2ih Annual Allerton Conf. Circuit and System Tfieory, pj). 238-242, October 1974.
|2
M. J. Ashjaee, 'Totally Self-Checking Check Circuits for Separable Codes," Ph.D. dissertation. Univ. of Iowa, Iowa, 1976.
[3!
M. J. Ashjaee and S. M. Reddy, "On Totally Self-Checking Checkers for Separable Codes," IEEE Trans., on Computers, Vol. C-26, No. 8, pp. 737744. Aug., 1977.
[4]
J. M. Berger, "A note on an eiTor detection code for asymmetric channels,"//!/ Control, pp. 68-73, March 1961.
[5]
E. Fujiwara and K. Haruta, "Fault-Tolerant Arithmetic Logic Unit Using Parity-Based Codes," Trails. Inst. Electron. Commun. Eng. Jap., Vol. E64,
67 No. 10, pp. 653-660, October 1981. [6]
M. P. Halbert and S. M. Bose, "Design Approach for a VLSI Self-Checking MIL-STD-1750A Microprocessor," Proc. 14th Inl'l Symp. Fault-Tolerant Comput., pp. 254-259, June 1984.
[7]
J. C. Lo and S. Thanawastien, "The Design of Fast Totally Self-Checking Berger Code Checkers Based on Berger Code Partitioning," Proc. FTCS-19 pp. 226-231, June 1988.
[8]
J. C. Lo, S. Thanawastien, T. R. N. Rao, and M. Nicolaidis, "An SFS Berger Check Prediction ALU and Its Application to Self-Checking Processor Designs," IEEE Trans. Computer-Aided Design, Vol. 11, No. 4. April 1992.
[9]
M. A. Marouf and D. A. Friedman, "Design of Self-Checking Checkers for Berger Codes," Proc. 8th Symp. Fault-Tolerant Comput., pp. 179-184, June 1978.
[10]
T. Nanya and T. Kawaniura, "Error Secure/Propagating Concept and its Application to the Design of Strongly Fault Secure Processors," IEEE Trans, on Comput., Vol. 37, No. 1, pp. 14-24, January 1988.
[11]
T. Nanya and M. Uchida "A Strongly Fault-Secure and Strongly CodeDisjoint Realization of Combinational Circuits," Proc. Int. Symp. on FaultTolerant Computing, pp. 390-397, 1989.
[12]
M. Nicolaidis, "Evaluation of a Self-Checking Version of the MC68000 Microprocessor," Proc. 15th Int'l Symp. on Fault-Tolerant Comput., pp. 350-356, June 1985.
[13]
A. M. Paschalis, D. Nikolos, and C. Halatsis, "Efficient Modular Design of TSC Checkers for M-out-of-2M Codes," IEEE Trans., on Computers, Vol. C-37, No. 3, pp. 301-309, March, 1988.
[14]
S. J. Piestrak, "Design of Fast Self-Testing Checkers for a Class of Berger Codes," IEEE Trans., on Computers, Vol. C-36, No. 5, pp. 629-634, May, 1987.
[15]
T. R. N. Rao, Error Coding for Arithmetic Processors, Academic Press, 1989.
[16]
T. R. N. Rao and H. J. Reinheimer, "Fault-Tolerant Modularized Aritlimetic Logic Units," Proc. ofNatioiwl Computer Conferences, AFIPS, pp. 703-710, 1977.
[17]
T. R. N. Rao and E. Fujivvara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989.
68 [181
T. R. N. Rao, G. L. Feng, Mahadev S. Kolluni and J. C. Lo, "Novel Totally Self-Checking Berger Code Checker Designs Based on Generalized Berger Code Partitioning," IEEE Trans., on Computers, Vol. 42, No. 8, Aug., 1993.
[191
F. F. Sellers, M. -Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, New York: McGraw-Hill.
[201
J. E. Smith and G. Mertze, "Strongly Fault Secure Logic Networks," IEEE Trans., on Computers, C-27, pp. 495-499, June 1978.
SECTION 2
DEPENDABLE COMMUNICATIONS
SECTION 2.1
Network Fault Detection and Recovery in the Chaos Router Kevin Bolding and Lawrence Snyder^ Abstract Chaotic routing, which allows packets to follow non-minimal routes, provides a basic level of fault-tolerance by allowing messages to be routed around faults without requiring a priori knowledge of their locations. However, the mechanisms for doing this can be slow and clumsy at times. We augment Chaotic routing with a limited amount of hardware to support fault detection, identification, and reconfiguration so that the network can automatically reconfigure itself when faults occur. We present a high-level design of these mechanisms, driven by the goal of achieving reasonable reliability without exorbitant cost.
2.1.1
Introduction
Due to the never-ending quest for greater computational power, the size of multicomputer networks is rapidly growing. Networks containing hundreds or thousands of processing nodes are moving out of research labs and into the marketplace. As the size of multicomputers grows, the frequency of faults grows as well. The interconnection network linking all of the nodes of a multicomputer into one coherent system is a critical resource of great complexity which must provide some level of fault tolerance for successful operation. Interconnection networks are especially subject to faults because of the large number of external connections necessary. For common k-aiy rf-cube networks, where nodes are connected into radix k, dimension d structures, there are A' = k'' nodes and 2Nd network links^. For example, a 1024-node network arranged as a two-dimensional torus has 4096 network links, and, arranged as a ten-dimensional binary hypercube, it has 10240 network links. Each of these 'Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195. This work is supported in part by Office of Naval Research grant N00014-91-J1007 and National Science Foundation grant MIP9213469. •^For the case with d = 2, a binary hypercube, there are only Nd links.
72 links is usually implemented using either external cables or traces in a printed circuit board. These connections, particularly those implemented with cables, can break or disconnect, especially when the machine cabinet is opened for servicing. Thus, the failure of interconnection network links is likely to be relatively common due to the sheer number of links and the vulnerability of the hardware. Because network failures are expected to be relatively common, the interconnection network must be able to perform in the presence of faulty links. However, the ubiquitous diine.nsion-order ohln'ioufs routing scheme [8, 9, 11, 20] and more recent minimal adaptive [7, 19, 13] routing schemes for network communication suffer significant losses of communication ability in the presence of even a single link failure. On the other hand, non-mmtmal adaptive routing schemes [18, 15, 6, 10, 21] provide a natural mechanism for providing a basic level of fault-tolerance. Chaotic routing, a non-minimal adaptive routing scheme whicli has been sliovvn to provide superior performance with a practical design [15, 3, 12, 2], is augmented with several siiDple mechanisms to i^rovide on-line fault detection and simple off-line fault location and recovery mechanisms.
2.1.2
Fault Tolerant Routing
We define "fault tolerant routing" as a method of providing reliable message delivery between any nodes in the network connected by a functional physical l)ath. Interconnection networks vary in the number of redundant paths they provide between any pair of nodes. However, in any network, multiple failures may render coiiinuiuicatioii between two nodes impossible regardless of the routing algorithm. The goal of network fault tolerance has three parts: • Provide reliable message delivery between all pairs of nodes that have some non-faulty path between them. • Detect when a fault has occurred, and alert the operating system. • Provide mechanisms to re-configure the network around faulty areas to ensure quick and reliable message delivery to all non-faulty areas of the network. VVe use a relatively high-level fault model in this work: a fault occurs when a packet is injected into the network and cither is not delivered to its destination, or is delivered with corrupted data. Such faults may be either transient in nature, with only temporary effects, or persistent, causing repeatable errors.
73
Processing Node
Network Channels Router
Communication > co-frocessor J*>,o,essor i Channel Figvire 2.1.1: A node, router, and network links. Failures may occur in any of the components or links.
Detection of faults relies on discovering t h a t packets are lost or corrupted. Thus, if a fault occurs which causes neither lost nor corrupted packets, it m a y not be detected by the mechanisms introduced in this work. We consider three basic classes of faults; link failures, node failures, and router failures, corresponding to each of the components listed in the basic node-router model shown in P'igure 2.1.1. Link failures refer to a fault in the physical network link between two routers. This class of failures includes common physical failures such as the unplugging of a routing cable or unseating of router chip from its socket. Link failures result in either corruption of d a t a between two routers or loss of communication between two routers and are easily detected. If a link is found to be faulty, it will be removed from service by the reconfiguration methods, but the routers attached to it will continue to operate. Node failures refer to any failure which causes the processing node attached to a router to refuse to accept packets. In other words, the node appears "dead" to the network when such a fault occurs. Such faults are easily detected, although it is not possible to isolate the fault further than identifying the processing node as "faulty." Such failures can be either in the node itself, the communication co-processor, or the links on the path from the node to the router. If a node is found to be faulty, it is removed from service. T h e router attached to it can continue to operate so t h a t network service is unaffected. A router may fail in different ways, each resulting in varying loss of service. A failure in which packets are routed successfully with proper data, but use an incorrect channel or experience unnecessary delay is difficult to detect because such behavior is within the realm of normal operation. Fortunately, such failures are not catastrophic because they do not result in corruption or loss of data. Some failures may result in certain channels of the router being inoperable, and
74 will appear the same as link failures and be treated as such. Finally, a router may fail to operate at all. This is the most severe form of fault in the network because this results in the inoperability of all network links to the router, as well as severing the attached processing node from the rest of the network. Fortunately, router failures are expected to be more rare than link or node failures because they are not as vulnerable to physical effects as the physical links are, and they have a much simpler design than an entire processing node does. Related Work The many routing schemes that appear in the literature and in actual multicomputers vary significantly in their levels of fault-tolerance. T h e most corrunon routing scheme is the dimension-order oblivious routing scheme [8, 9, 11, 20] which provides efficient routing without undue complexity. However, the ohlitlous nature of the routing decision means t h a t there is only one possible (predetermined) path between any particular source and destination pair. This has the following two consequences: performance is limited, and faults cause total communication failure. Specifically, Borodin and Hopcroft show that any oblivious router for an iV-node netvvork with degree d must have worst-case routing time of Q I W 4- 1 [4]. T h u s , any network using an oblivious router is guaranteed to exhibit delays proportional to the square-root of the number of nodes for some traffic pattern, regardless of the network diameter. Moreover, since there is only one staticallydetermined path for any source and destination pair, any failure of any link or router on this path results in a loss of communication, regardless of the availability of alternative paths in the network. Minimal adaptive routing schemes [7, 19, 13] provide better performance and some fault tolerance by allowing a wider range of paths between a source and destination node. These schemes allow packets to travel on an minimal-length path from their source to their destination. While this allows some faults to be avoided, the scheme fails when there is only one minimal-length p a t h between the source and destination node (i.e., the source and destination addresses differ in only a single dimension in a k-ary d-cube). Several schemes have been introduced to deal specifically with faults in networks [16, 5], but these require both more complicated routing algorithms and packets to carry information on faults with them. Since these are likely to increase significantly the time for packets to be routed in non-faulty cases as well, they are impractical for use in real applications. Non-minimal adaptive routers, including those which have packet storage
75 capability [18, 6] and deflection routers, which have no storage capability [21, 10] can take advantage of the schemes presented in this work. However, chaotic routing has been shown to be simpler than other non-minimal adaptive schemes [2] and has higher performance than deflection-based schemes when using multiword packets [3].
2.1.3
Chaotic Routing
Chaotic routing [15] is a form of non-minimal adaptive routing which uses randomization to prevent livelock without requiring complex protection hardware. It has been shown to be deadlock and livelock free on hypercube [15] and many other types of networks [1], including all ^-ary rf-cubes. Chaotic routing provides superior routing performance [3, 12] with a practical design [2]. Figure 2.1.2 shows a data path diagram for a two-dimensional (mesh and torus) chaos router. A chaos router has three main components: input/output frames, a crossbar, and the muHiqueue Each frame and multiqueue slot is capable of holding one fixed-size (20-flit) packet. The basic operation of the chaos router is similar to a typical oblivious cut-through packet router [14]: packets enter the router into an input frame, are connected through the crossbar to an output frame and have their header updated on the way out to reflect their progress. Virtual cut-through allows packets to proceed through the router as soon as their header is received and decoded; it is not necessary for the entire packet to be buffered before moving to the next router. Two critical enliancements to the oblivious router design distinguish the chaos router. First the routing decision is adaptive so that a set of equally profitable channels is specified rather than a single channel. The first available profitable channel will be chosen for routing. The second enhancement is the addition of a small (five packet) buffer, the multiqneue, which holds packets for which no profitable channels are immediately available. By moving packets out of input frames to the multiqueue, critical channel resources are freed up so other packets can use them. Packets in the multiqueue are stored until a profitable output frame becomes free. Packets in chaos routers are moved from input bufi"ers into the multiqueue on two occasions; packet exchanges for deadlock prevention and "stalling" [3]. The packet-exchange deadlock prevention protocol mandates that if routers on either side of shared link have packets to send to each other, both packets must be sent, rather than only one side sending a packet. In order to guarantee this, the chaos router implements the following protocol: if a packet is sent to the output frame for channel i and the input frame for channel i has an incoming
76 +X Input Frame +A Input n
Qulpui Frai
• dec -"fn
-4fTTr^ - X l i put Frame
J
•-e
U^THffl
-t-V Iripiil Frnmi;
•^fitt
dec
Inpul It'ranit - YV Input Election Frame
Injection Frame
MtiltiQueae
Figure 2.1.2: Two-dimensional Chaos router d a t a p a t h diagram.
packet, the packet in input frame i is moved to the queue.'* The second occasion on wliich packets are moved into the queue is when they have stalled for a •'reasonably long" time at an input frame while waiting for a profitable o u t p u t frame. For performance reasons, stalled packets are moved to the queue to free up the input channel for other packets. The chaos router determines t h a t a packet is stalled when the entire packet is stored in a single input frame, i.e. it can no longer cut-through. Packets in the queue have priority over packets in input frames when competing for outgoing channels. When an o u t p u t frame desired by one or more packets in the multiqueue becomes available, a packet is routed out of the multiqueue to t h a t output frame. When more than one such packet exists, F I F O priority is used; also, all packets in the nuiltiqueue have precedence over packets in input frames. If tiie router is in a situation wltere a new packet must be read into the multiqueue, but the multiqueue is full, a packet is randomly selected from the multiqueue to be derouied to the next available o u t p u t frame, making room for the new packet. Derouting of packets is the only mechanism for routing packets on non-minimal paths in the network. "^Packets from the injection frame do not enter the queue due to deadlock prevention considerations.
77
2.1.4
Fault Tolerance in Chaotic Routing
Although the design of the chaos router was conceived with the primary goals of high performance and low complexity, a basic level of fault tolerance is a natural result of the design as well. When either a link or a router in a chaotic network persistently fails to respond to its inputs (i.e., becomes "stuck"), the output frames in neighboring routers connected to the faulty components will fail to empty, as the only way for packets to leave them is through the failed component. Since packets are moved to output frames only when there is space available in the output frame, the routers will not route any more packets toward the faulty components. Packets which have other profitable channels besides the faulty channel will be automatically routed out an alternative minimal path. Those which have no other profitable choices will be, after a short time, moved into the multiqueue. When the multiqueue fills to capacity, packets will be selected on a random basis to be derouted out the next available output channel. Eventually, packets desiring failed channels will be derouted off of their original path and out a nonprofitable, but working, channel. At this point, a new path will be embarked upon from the packet's new location. Thus, even though the minimal path is blocked by a faulty component, chaotic routing allows communication to continue by means of the derouting mechanism. Although chaotic routing provides some fault tolerance by this method, there is much room for improvement. First of all, there will be some packets stuck in the output frames waiting for faulty channels. Moreover, packets may be corrupted or lost by transient faults. Packets in either of these categories will not be delivered properly by the original algorithm. The fault detection mechanisms presented in Section 2.1.5 will be able to detect these lost and corrupted packets. Another difficulty is the "clumsiness" of the derouting mechanism in providing alternative paths. In order for a packet to be derouted, even when its path is blocked by a faulty channel, it must first wait to be read into the multiqueue, then wait for the multiqueue to fill up with other packets'*, and then wait to be randomly chosen to be derouted. Mechanisms to identify the locations of faulty hardware are introduced in Section 2.1.6 and methods of streamlining the derouting process when packets are waiting on faulty channels are discussed in Section 2.1.7. ''In the case of light traffic, this may not happen at all.
78
2.1.5
Fault Detection
The "natural" fault tolerance in chaotic routing stems from the fact that a faulty component appears the same as a very congested component from the point of view of a router. Since chaotic routing is designed to route packets around congestion, it will also route packets around faults. In order to provide more robust fault tolerance, though, it is necessary to detect when a fault has occurred, so t h a t corrective action may be taken. Three fault detection mechanisms, presented below, are employed in chaotic routing to raise a "redalert" alarm signal whenever a fault is likely to have occurred. Corrupted Packet Checking In order to ensure reliable transmission of data across the network, a mechanism must be implemented to detect and possibly correct errors in transmitted data. This can be accomplished through simple parity checking (which can be done at each hop in a packet's path) or packet-based checksums which are checked when each packet is received at its destination. Adding an additional word for a checksum to each packet consumes bandwidth, but the additional d a t a security is worth the cost. If a single bit of a packet is in error, then the source of the error is probably a transient fault and it can be resolved by either using d a t a correction techniques, if applicable, or directing the source to re-send the packet. On the other hand, if multiple words are in error, or if words are mi.ssing or added to the packet, a more serious error is indicated. Multiple faulty words in a packet indicate t h a t there is a persistent fault in some component of the network. Packets with more or fewer words than expected indicate a problem in the control mechanisms of either a router or channel. In either case, the error is serious enough to warrant further attention. Thus, when such a corrupted packet is found, a "red-alert" is signalled to warn the system that a persistent fault may exist. Lost P a c k e t C h e c k i n g T h e methods discussed in Section 2.1.5 detect errors in packets which arrive at their destination. However, some network failures will result in the delivery of a packet to an incorrect destination or complete loss of the packet. It is necessary to provide ways to detect such errors so t h a t the sender can be notified and the network reconfigured if necessary. Acknowledgment messages may be used to perform this service, but there arc two drawbacks to this method. The first is lliat message traffic is doubled in the worst case since every message requires an acknowledgment. The second
79 is t h a t , because chaotic routing provides no deterministic bound of message delivery time, the sender cannot at any time be certain that the acknowledgment it is waiting for will not arrive in the future. Because of these drawbacks, we present an alternative solution based on the packet re-ordering mechanism. P a c k e t R e - o r d e r i n g . An unfortunate property of adaptive routing is that packets may arrive at their destination out of order. This may occur when a packet is sent on a shorter or less congested path than its predecessors were. Since most applications desire in-order packet delivery, the network interface must rc-order the packets upon delivery. T h e network interface described by McKenzie [17] provides a packet reordering mechanism by tagging each outgoing packet with a sequence number. T h e destination network interface keeps track of packets received so they can be reassembled in order to form the intended messages. Because the interface keeps track of packets, it can be used to ensure that all packets from a message have successfully arrived. If a packet is lost, the transmission will never be completed. By keeping track of the time since the last arrival of a packet, the receiving network interface unit can raise an alarm after a specified timeout period indicating that a packet has not been received. R e - o r d e r i n g T i m e - o u t s . Although the detection of lost packets using the packet re-ordering mechanism is straightforward, it is not perfect. "False alarms" may be raised if a timer expires because a message is substantially delayed in transit, but not lost. To prevent such events, the time-out should be as large as possible. On the other hand, the time-out period should be short enough to provide quick response to lost packets. T h u s , the time-out period should be sized large enough that the number of false alarms will be very small, since servicing each alarm is costly, but small enough to raise alarms quickly when needed. Finding the proper size is difficult because the degree to which packets are out of order, when packets are not lost, varies with the size of the network and the traffic pattern applied. Larger networks will produce more out-of-order packets due to the longer average path lengths of packets: the longer a packet is in the network, the more opportunities other packets have to pass it. Also, traffic patterns in which nodes inject packets infrequently and uniformly present fewer re-ordering problems than when nodes inject long streams of packets into a network with non-uniform traffic.
80 Cliannel Time-outs Because overflow of the re-ordering buffer may not detect all lost packets in a timely manner, another meciianism is provided to detect packets which are "stuck" in the network. The time from the entry of a packet into an output frame until it is moved across the channel to its neighboring router's input frame is bounded by the design of the packet-exchange protocol [18] and of the chaos router [15]. Thus, if a packet is waiting in an output frame for an extraordinarily long time, it is an indication of a failure somewhere in the network. By including timers with each output frame which count the number of cycles elapsed since a packet has entered the output frame, it can be determined if the network is faulty. The counters are reset whenever a packet leaves an o u t p u t frame, and are incremented each cycle that a packet is present in the output frame. Once the counters reach some pre-determined value, a red alert is signalled. Although the location of the channel which timed-out provides a cine to where the faulty component in the network is located, it does not pinpoint it because the time-out could be the result of a failed component several network hops away. Thus, the fault location techniques must still be applied to track down the location of the faulty hardware.
2.1.6
Fault Location
Once a red-alert has been signalled by one of the three fault detection mechanisms, the location of the fault must be pinpointed. To do this, the network must be taken off-line and diagnostics run. The fault location diagnostics consist of two phases: system drain and network testing. S y s t e m Drain When a red-alert is signalled, the first action ncKxled is to drain all packets out of the network. First, all nodes are signalled to cease injecting new packets, and the system is run until all packets arrive at their destinations. If the network has no faulty components, all packets should be delivered at this point. However, if faults exist in the network, packets may be lodged somewhere and must be removed by other means. Packets which remain after the system has been drained must be saved to be re-injected after the system is re-started. Lodged packets may be stuck behind faulty channels or at the delivery frames to dead nodes. If a node or its interface is not responding, packets destined to it may be discarded. Nodes may be inferred to be dead if the ejection frame of the node's router still is full after the completion of the system drain. The ejection channels of routers
81 Router
Router
^IIT^ ~X^
sT^'^
rhT
Router
.r^I^ /he-
Router \m
Router
tr^-v
.
\
^ < < N
1^ Node (a)
(b)
Figure 2.1.3: Fault diagnostic testing, (a) One-hop testing path, which requires nodes to be functional for the link to pass the test, (b) Two-hop testing path allows links to pass the test even if nodes fail.
connected to dead nodes can be instructed to discard their packets at this time. Any packets lodged behind these packets will then be freed and can continue to their destinations. Packets remaining in the network at this point are stuck behind faulty channels and must be removed from the network for later reinjection. By adding a bus which connects the output frame's o u t p u t s back to the ejection frame's input, the packets can be removed from the router and saved in the attached node. At this point, all packets should be cleared from the network since those destined for dead nodes have been deleted and those lodged behind failed channels have been cleared from the network and saved. Roto-routing: Network Testing Once the network hats been cleared of packets, diagnostics can begin. T h e objective of the first test is to determine which network links and nodes are functional. The communications co-processor attached to each router sends a series of test packets to each of its neighboring nodes. At the same time, the co-processor confirms the correct arrival of each test packet and, after all have arrived from a particular neighbor, sends an acknowledgment packet back to t h a t neighbor. The testing path is shown in Figure 2.1.3(a). When a node receives an acknowledgment packet, it can deduce that the channel the packet was received on is functional. Because this test relies not only on the channel being functional, but also on the neighboring co-processors to be operating properly, it may improperly indicate channels as faulty. A second test, similar to the first one, is then performed. In this test, though, packets are exchanged between all pairs of two-hop neighbors (those nodes which can be reached in two network hops), as shown in Figure 2.1.3(b). In this test, even if a neighboring router's co-processor
82 is faulty, if its router is functional, the test packets will be delivered properly. If a link passes either the first or second test, it is considered correctly operating. Links which fail both tests are considered faulty and will be masked off during the fault recovery process. Once the faulty links have been established, a test is made to find nodes which are still functioning. This is done by having each node send a packet to a single designated "master" node. For each packet received, the master infers t h a t the source node must be alive. A map of all of the functional nodes in the network is then made and distributed as appropriate.
2.1.7
Fault Recovery
Once faulty components have been located, the network can be re-configured and re-started. Reconfiguration involves identifying dead or faulty nodes to the operating system and marking failed channels. If there are any dead nodes, the operating system is notified so that it can reconfigure the system before restarting. Since it will probably be necessary to rc-start the system from some checkpoint before reconfiguration in this case, all of the saved messages can be dropped since their contents will be invalid anyway. Faulty links identified in the fault location phase must now be masked out of the system. This can be done by marking each output frame connected to a failed link as permanently "full," so that no packets are routed to the faulty link. To provide better performance, packets that have only faulty links as their profitable links can be immediately derouted upon entering a node, avoiding the wait for the multiqueue to fill up first. If any routers have failed completely, their attached node, and all attached channels can be considered dead and treated as above. Once the faulty links have been masked out, the messages stored in the processors during the system drain phase can be re-injected into the network and the system re-started.
2.1.8
Architectural Support
In order to provide the fault ditions must be made to the network interface. A primary a manner which has minimal faults are not present. There the "on-line" phase, and fault
tolerant functions described above, several adchaos router chip architecture and to the chaos goal of this work is to support fault tolerance in or no impact on the router's performance when are two areas of concern: fault detection during location during the "off-line" diagnostic phase.
83 Detection Because the vast majority of operation during the normal, on-line, phase will be without error, the fault detection mechanisms are designed to be implemented without causing any additional delays for fault-free traffic. The first two methods, checking for corrupted or lost packets, require no additional hardware. The third method, channel time-outs, requires a small amount of hardware. Finally, communicating a red-alert message requires a single pin per routing node. Packets sent over any communication network are subject to some number of random transient data faults due to common effects such as electro-magnetic interference and alpha particles. Thus, it is necessary to provide some sort of data correctness validation, via cyclic redundancy checks (CRCs) or simple checksums for each packet. Hence, the overhead to protect the data must be spent regardless of the presence of fault-detection mechanisms, so corrupted packet detection comes with no extra cost. The second method of fault detection, lost message detection, utilizes the already necessary re-ordering buffer. The only modification is that the buffer should be sized small enough to allow timely detection of lost packets. The final detection method, channel time-outs, requires additional hardware to be added to each router. Each output frame must have a timer associated with it which counts the number of cycles the output frame has been full and signals an alarm when a pre-determined limit has been exceeded. The hardware for this is very small and can be implemented without affecting the timing of packets traveling through the routers. Finally, to communicate a red-alert situation, a single pin must be allocated to each router and communications co-processor that is connected to a systemwide bus which can be driven by any node signalling a red-alert. Diagnostics Once a red-alert is signalled and the diagnostics phase is entered, most of the work is done by software in the operating system combined with the communications co-processor. A node must be appointed as the system master, which controls the pace of the diagnostic procedures. This node can communicate with the network via a bit-serial data bus which is connected to all of the routers and communication co-processors in the network. The system drain requires that the routers have a mode in which all packets in a them are routed to the ejection frame. This mode can be entered based on communication from the bit-serial control bus and requires only additional control logic in the router. The system drain also requires that packets can be "backed out" of output frames and delivered via the ejection frame. This requires a bus to be added which connects the outputs of all of the output frames
84 to tlie input of the ejection frame, and control logic to govern its operation. Although both of these mechanisms will increase the area of the router chip, they should not interfere with its normal operation at all. T h e roto-routing phase of the diagnostics can be accomplished via programming in the communications co-processor, and requires no special hardware in the router chip. T h e co-processor simply injects and receives test packets, checking to make sure that they arrive correctly. The final phase, reconfiguration, requires that individual channels in routers be disabled. This can be accomplished by sending a message via the bit-serial bus to disable the faulty channels. Each faulty channel will have its o u t p u t frame always appear full, so that packets will not be routed to it. Also, any packets entering a router which have only disabled packets as profitable channels will be immediately derouted. The latter step is a minor change in the control logic, and both steps must be implemented even in a fault-free network which may have missing nodes or "edges."
2.1.9
Conclusion
As multicomputer networks grow larger, the frequency of faults occurring will increase substantially. To handle this, networks must be able to provide communication even in the presence of faulty components. Chaotic routing provides mechanisms which will route packets around faulty links, preserving the communications capabilities between all parts of the network whenever possible. Chaotic routing is augmented with simple and cost-effective methods of detecting when faults have occurred so that the operating system can be notified of failures in a timely manner and reconfiguration can take place. A set of diagnostic procedures which will locate faulty components has been proposed. Once this information is known, the faulty components can be isolated from the network and procedures implemented to allow packets to be quickly and reliably routed around the disabled components.
References [1]
Kevin Holding. Chaotic Routing: Design and Implementation of an Adaptive Multicomputer Network Router. P h D thesis. University of Washington, Seattle, WA, July 1993.
[2]
Kevin Holding, Sen-Ching Cheung, Sung-Eun Choi, Carl Ebeling, Soha Hassoun, Ton Anh Ngo, and Robert Wille. T h e chaos router chip: Design and implementation of an adaptive router, In Proceedings of the IFIP Conf. on VLSI, pages 311-320, September 1993.
85 [3] Kevin Bolding and Lawrence Snyder. Mesh and torus chaotic routing. In Advanced Research in VLSI and Parallel Systems: Proceedings of the 1992 Brown/MIT Conference, pages 333-347, March 1992. [4] A. Borodin and J. E. Hopcroft. Routing, merging and sorting on parallel models of computation. Journal of Computer and System Sciences, 30:130145, 1985. [5] Ming-Syan Chen and Kang G. Shin. Adaptive fault-tolerant routing in hypercube multicomputers. IEEE Trans, on Computers, 39(12):1406-1416, December 1990. [6]
Bill Coates, Al Davis, and Ken Stevens. The post office experience: Designing a large asynchronous chip. In Proceedings of the HICSS, 1993.
[7]
Robert Cypher and Luis Gravano. Adaptive, deadlock-free packet routing in torus networks with minimal storage. In Proc. Int. Conf. on Parallel Processing, pages 204-211, 1992.
[8] W. Dally. Wire-efficient VLSI multiprocessor communication networks, In Paul Losleben, editor, Proceedings of the Stanford Conference on Advanced Research m VLSI, pages 391-415. MIT Press, March 1987. [9] W. Dally and C. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans, on Computers, C-36(5):547-553, May 1987. [10] Chien Fang and Ted Szymanski. An analysis of deflection routing in multidimensional regular mesh networks. In Proceedings of IEEE INFOCOM '91, pages 859-868, April 1991. [11] C. Flaig. VLSI mesh routing systems. Master's thesis, California Institute of Technology, May 1987. [12] Melanie L. Fulgham and Lawrence Snyder. Performance of chaos and oblivious routers under non-uniform traffic. Technical Report CSE-93-0601, University of Washington, Seattle, WA, June 1993. [13] Christopher J. Glass and Lionel M. Ni. The turn model for adaptive routing. In Proc. Int. Symp. on Computer Architecture, 1992. [14] P. Kermani and L. Kleinrock. Virtual cut-through: A new computer communication switching technique. Computer Networks, 3:267-286, 1979.
86 [15] Smaragda Konstantinidou and Lawrence Snyder. The chaos router: A practical application of randomization in network routing. In Proc. Symp. on Parallel Algorithms and Architectures, pages 21-30, 1990. [16] D. H. Linder and J. C. Hardin. An adaptive and fault tolerant wormhole routing strategy for /:-ary n-cubes. IEEE Trans, on Computers, C-40(l):212, January 1991. [17] Neil McKenzie, Kevin Bolding, Carl Ebeling, and Lawrence Snyder. CRANIUM: An interface for message passing on adaptive packet routing networks. In Proc. Parallel Computer Routing and Communication Workshop, May 1994. [18] J. Y. Ngai and C. L. Seitz. A framework for adaptive routing in multicomputer networks. In Proc. Symp. on Parallel Algorithms and Architectures, pages 1-9, 1989. [19] Gustavo D. Pifarre, Luis Gravano, Sergio A. Felperin, and Jorge L. C Sanz. Fully-adaptive minimal deadlock-free packet routing in hypercubes, meshes and other networks. In Proc. Symp. on Parallel Algorithms and Architectures, pages 278-290, 1991. [20] Charles L. Seitz and Wen-King Su. A family of routing and communication chips based on the Mosaic. In Symp. on Integrated Systems: Proc. of the 1993 Washington Conf., pages 320-337, 1993. [21] B. J. Smith. Architecture and applications of the HEP multiprocessor computer system. In Proceedings of SPIE, pages 241-248, 1981.
SECTION 2.2 Real-Time Fault-Tolerant Communication in Distributed Computing Systemsi Kang G. Shin^ and Qin Zheng^ Abstract Distributed systems with point-to-point interconnection networks are natural candidates for real-time fault-tolerant communication because parallel processing and communication as well as fault-tolerance can be achieved using multiple processors and interconnection paths between every pair of nodes. However, due to the contention among randomly-arriving messages at each node/link and multi-hops between the source and destination that a message must travel, it is difficult to guarantee the timely delivery of the messages. The goal of this chapter is to remove this difficulty and simultaneously realize the potential of distributed systems for high performance and high reliability.
2.2.1
Introduction
In this chapter, we will first show how real-time communication in point-topoint packet-switching networks can be achieved by using the abstraction of real-time channel which makes 'soft' reservation of network resources to ensure the timely delivery of real-time messages. Like a dedicated circuit in a ^ The work reported in this paper was supported in part by the Office of Naval Research under Grant N00014-92-J-1080 and the National Science Foundation under Grants M I P 9012549 and MIP-9203895. Any opinions, findings, and recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the funding agencies. •^ K. G. Shin is with the Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122. ''Q. Zheng is with Mitsubishi Electric Reseeirch Labs., Inc., Cambridge Research Center, Cambridge, MA 02139.
circuit-switching network, a real-time channel guarantees the end-to-end delay bound as long as the source node does not exceed the pre-specified limits of message length and generation rate. However, unlike a dedicated circuit, a real-time channel does not waste any transmission bandwidth. The links are free to transmit other packets whenever there are no real-time packets to be transmitted. Fault-tolerant communication is traditionally achieved through acknowledgment and retransmission schemes, where one gains the reliability at the cost of timeliness. Simultaneous realization of timeliness and fault-tolerant communication is very difficult and has rarely been addressed. In the second part of this chapter, we discuss how this problem can be solved by using real-time channels and exploring the spatial redundancy of a given network topology. Specifically, we will show how Single Failure Immune (SFI) real-time channels can be established which guarantee the timely delivery of messages as long as each message encounters no more than one link/node failure on its way to the destination node. We will also discuss how a class of more reliable real-time channels, called the Isolated Failure Immune (IFI) real-time channels, can be established in networks with the hexagonal mesh topology. For a real-time communication system that can tolerate rare short-period breakdowns, backup channels can be established; this is a less expensive way to increase the reliability of real-time channels. This method is also applicable to cases where a SFI circuit cannot be established due to the poor connectivity of the network.
2.2.2
Real-time communication
Real-time communication is a key to distributed real-time systems. Task deadlines cannot be guaranteed if inter-task communication delays are unpredictable. An easy way to implement real-time communication would be to use the circuit-switched transmission. Given the user's maximum traffic generation rate, one can use a dedicated circuit between two end systems with an adequate bandwidth to guarantee the timely delivery of all messages. If the bandwidth of a link is greater than that a single channel requires, several channels can be established through the link using either frequency-division multiplexing
89 (FDM) or time-division multiplexing (TDM) techniques, as is usually done for telecommunication systems. These multiplexing techniques, however, do not exploit the bursty nature of data traffic (thus resulting in inefficient use of link capacity) and the different requirements of traffic types (resulting in an inflexible allocation of link bandwidth), and thus are rarely used in computer networks. The packet-switched transmission, on the other hand, uses the link capacity more efficiently and flexibly by dynamically allocating link bandwidth according to the traffic demands. However, due to the queueing delays at transmission links (e.g., when more than one packet need to be transmitted over a link, only one can be transmitted immediately and all the others have to wait), message delivery times are not predictable. To counter this problem, we use a new transfer mode, called the real-time channel, to support real-time communication in packet-switched networks [1, 2, 3]. Real-time channels use two phases to guarantee end-to-end message delay bound: admission control of channels establishment and deadline scheduling of packet transmissions. Like the circuit-switched transmission, admission control requires each task requesting real-time communication services to establish a connection (i.e., a real-time channel) before starting packet transmissions. Non real-time messages are transmitted in an ordinary way without needing a channel establishment phase. A real-time channel establishment request may be accepted or rejected, depending on the current network-load condition. Admission control is necessary because message delay bounds cannot be guaranteed without controlling the traffic load of the network. If a channel establishment request is rejected, the requesting process has the following three options:
1. Wait for a period of time and repeat the same request. Since real-time channels are established or removed from time to time, a channel which cannot be established at one time may be establishable at some other time.
90 This is usually used for critical real-time applications like in a distributed control/manufacturing system where an application should not be started until adequate communication channels are established. Reduce the quality of service requirement, e.g., increase the requested end-to-end delay bound and try to establish the channel again. This is sometimes useful for non critical real-time applications like interactive voice/video transmissions. When a channel-establishment request is not granted, the user may be willing to accept a connection with a lower quality than the previously requested. Transmit messages as non real-time ones without establishing a real-time channel. In this c£ise, there will be no guarantee on the message delivery delay bound. This method is suitable for randomly-generated urgent messages for which transmission without any delivery-time guarantee is better than holding them at the source node until a delivery-time guarantee can be made.
Each message is divided into packets. Packet transmissions are scheduled as follows. Real-time packets (i.e., packets which require explicit delivery-delay guarantees) are given higher transmission priority than non real-time packets. Each real-time packet is assigned a deadline over each link it traverses which is calculated from the requested end-to-end delay bound and the generation time of the packet. When several real-time packets contend for use of the same link, the packet with the earliest deadline is transmitted first. There are two advantages of using the deadline scheduling.
•
Minimal effects of queueing delays: For real-time applications, we need to control the maximum packet delay. The queueing delays at transmission links have significant effects on the maximum packet delays, even when the traffic load is not heavy. The deadline scheduling policy can minimize the effects of queueing delays in the sense that given a set of packets with deadlines, if they are schedulable under any scheduling policy (i.e., every packet can be transmitted before its deadline), so can they under the deadline scheduling policy [4]. Thus, the deadline scheduling policy gives
91 the communication network more capacity of accommodating real-time channels than any other scheduling policies. In other words, the deadline scheduling minimizes the probability of rejecting channel establishment requests. •
Channel protection: When establishing a new real-time channel, the worstcase queueing delay is calculated for each link under the assumption t h a t all existing real-time channels over this link will not exceed their a priori specified m a x i m u m message generation rate and m a x i m u m message length. In practice, however, it is possible that some channels exceed their pre-specified limits. By properly assigning deadlines to the packets over each transmission link they traverse, the deadline scheduling policy can ensure t h a t those channels exceeding their pre-specified limits will not affect the timely delivery of the other channels' packets. Other scheduling policies, like the F i r s t - I n - F i r s t - O u t (FIFO), Exhaustive Round Robin or Priority Scheduling, do not possess this property.
In summary, the real-time channels established in a packet-switched network guarantee the timely delivery of messages just like dedicated circuits in a circuitswitched network.
Each real-time channel guarantees the e n d - t o - e n d delay
bound as long as the source node does not exceed the pre-specified limits of message length and generation rate. However, unlike a dedicated circuit in a circuit-switched network, a real-time channel does not reserve any transmission bandwidth. The links are free to transmit other packets whenever there are no real-time packets to be transmitted, thus allowing efficient use of network bandwidth. Details of the real-time channel approach is described below. Channel establishment/removal: Channel establishment can be done in a centralized [5] or decentralized manner [1]. We discuss a centralized version here which is easier to manage and implement than the decentralized one, though it may become a performance or reliability bottleneck. To establish a real-time channel, the requesting node must first determine three channel parameters, (T,M,D),
where T is the minimum message inter-
92 generation time, M is the maximum message size, and D is the requested end-to-end message delay bound. The three parameter model is a natural way to describe the traffic characteristics for most real-time applications. In case of motion video transmissions, each video frame is a message, and T = 1/30 second if the frames are transmitted at a rate of 30 frames/second, M is the maximum frame size which depends on the physical size, pixel resolution of the frames and the compression algorithm used, and D — T if one requires that each frame reach the destination before generating the next frame. The source node then sends a message of requesting a channel to be established that contains the channel parameters together with the addresses of the source and destination nodes to a special node containing the Network Manager (NM), which manages all the real-time channels in the network. After receiving a channel-establishment request, the NM first selects a route from the source to the destination over which the real-time channel is to be established. The minimum hop routing is usually preferred since it is easier to guarantee an endto-end delay bound over a shorter route than a longer one, and it requires the least amount of network resources. The NM then executes a real-time channel establishment algorithm (to be presented later in this section) to determine whether or not the requested channel can be established over the selected route. Then, the NM sends a reply message back to the source node notifying the acceptance or denial of the channel request. If the channel request is accepted, the reply message also contains the route information and the link delay bound di over each link on the route which will be used during the run-time message transmission. After all messages have been transmitted, the source node sends a channel removal message to the NM which deletes the channel from its channel table. A message is also sent to the destination node informing it of the end of transmission. Runtime message transmission: The source node can start message transmission any time after receiving a positive reply from the NM. A message must be
93 packetized first. In other words, a large message needs to be broken into packets of the size fitting the network specification, and a header is added to each packet. The packet header contains the information for routing (an outgoing link over which the packet is to be forwarded) and transmission scheduling (the link delay bound) for each intermediate node. Recall t h a t the deadline scheduling policy is used for transmitting real-time channels' packets. Each packet is assigned a deadline, and when several packets contend for the same transmission link, the one with the earliest deadline is t r a n s m i t t e d first. T h e deadline of a packet is the time by which the last bit of the message to which the packet belongs must be transmitted. We define the arrival time of a message to be the time when the last bit of the message's first packet arrives. In a store-and-forward packet-switched network, this is the time the first packet of the message becomes eligible for transmission. A straightforward way of deadline assignment would be to set the deadlines of a message's packets over link t.i to be the message's arrival time plus the link delay bound c?,. However, this simple method causes problems. As will be clear later, in verifying whether a requested real-time channel can be established or not, the worstcase e n d - t o - e n d delay is calculated from the worst-case link delays, which are in turn calculated based on the assumption that message inter-arrival times will not be smaller than T and message sizes will not be larger than M.
This
assumption is not always true because; (1) the source node may generate messages faster and larger than the specified values T and A/, and (2) even if the source node does not violate its traffic specification, the message inter-arrival times at an intermediate link can still be smaller than T due to the uneven queueing delays at upstream links. To solve this problem, we use logical arrival times to calculate the message deadlines. If a message arrives (or is generated) too early, its logical arrival time is set to be later than its actual arrival time as if it had arrived on time. If a message of too large size is generated, part of the message will be treated as p a r t of the next message. T h e message's deadline is then calculated cis its logical arrival time plus di, and packet transmissions are scheduled according to
94 their deadlines (which are based on logical arrival times). Usage of the logical arrival times has the following advantages.
1. Messages generated faster or larger than the specified values by a malicious or faulty channel will not affect the timely transmission of other channels' messages since these violating messages will be assigned deadlines as if they were generated as specified. W i t h the deadlines calculated from the logical arrival times, the deadline scheduling of packet transmissions effectively provides guaranteed protection between channels. This kind of channel protection makes a real-time channel behave like a dedicated circuit without making "hard" reservation of network resources, which is unachievable with other policies like FIFO, Round-Robin, or priority scheduling. 2. T h e variation of packet inter-arrival times at an intermediate node due to uneven queueing delays at upstream links is automatically taken care of by logical inter-arrival times. Thus, we can always use the minimum message inter-arrival time to calculate the worst-case delay. Also, using logical inter-arrival times will not cause unnecessary message delays since early messages are eligible for transmission as long as there are no other messages with earlier deadlines.
W i t h the introduction of logical arrival times, we can now present the message packetization and transmission processes. In addition to the message type, source and destination addresses, and packet sequence number, the header of a real-time packet contains the following fields exclusively for real-time channels;
transmission link IDs £i and link delay
bounds di, i — 1, • • •, A:, where k is the number of links of the path over which the channel runs, logical arrival time t\, and deadline td. The source node uses variables tp, tc to record the times the previous and current messages of the channel are generated, and rrtr, rUa to record the size (in bits) of the remaining part of the message (i.e., the part not yet packetized) and the accumulated message size (to detect oversized messages), respectively.
95 Recall t h a t the channel parameters are (T, M, D), and let P be the m a x i m u m packet size of the network. Initialize tp := —oo, in-r := 0, iria '•= 0. Suppose a message of m bits is generated at time t, then the source node does the following.
S t e p 1: Set the current message arrival time tc := t, and the remaining message size ?7)r :— m. S t e p 2: Assemble a packet of size p = min{P, m ^ } . Fill in the packet header fields, £i's and di's, with the values obtained from the NM. Update the accumulated message size rria := rria + p. ti := ma.x{tc,tp
+ T} + [rna/{M
Set the logical arrival time
+ 1)JT and the deadline t^i :- ti + di.
S t e p 3 : Forward the assembled packet to the transmission link identified by the link £i. Update the remaining message size nir := mr — p. If m^ > 0, goto Step 2 to assemble the next packet. max{ma —M, 0} and tp := max{tc, tp+T)
Otherwise, update ma :=
(message packetization is done).
T h e packets forwarded to an outgoing link are scheduled according to their deadlines. The readers are referred to [6, 7] for an example implementation of this scheduling process.
If the clocks of the nodes are not synchronized,
we assume t h a t the transmitter will put a timestamp tt at the header of each packet when it is transmitted. At an intermediate node, say node i with the outgoing link ^,, a packet arrived at time t is processed as follows.
S t e p 1': Calculate the clock skew between node i — I and node i: tg = t — ttSince the e n d - t o - e n d propagation delay can be pre-subtracted from the user requested e n d - t o - e n d delay bound, ts also removes the link propagation delay. We ignore the timestamping time (since hardware can be used for this). Suppose link ti has a transmission bandwidth Ri, then u p d a t e the fields of the logical arrival time and the deadline in the packet header as ti := ti +ts + di-i - max{0, {M - P)/Ri},
td = ti + di.
96 (i
"l-l
node i-1 time
'l
-(M-PVR.,-*
1 Figure 2.2.1
L_j 1
node i
Calculation of a message's logical arrival time at node i.
S t e p 2 ' : Remove ^i_i and rfi_i from the packet header and forward the packet over the link £;.
T h e calculation of the logical arrival time ti in Step I' can be explained with Fig. 2.2.1 which shows the logical arrival times of a message at node i — 1 and node i. A message's link delay is defined as the time between the logical arrival lime of the message and the time when the last bit of the message is transmitted. Then, a message's logical arrival time at node i equals the message's logical arrival time at node i — 1 plus the worst-case link delay rf,-i minus the time needed to transmit M ~ P bits (i.e., the m a x i m u m message minus the first packet). One important fact is that if all channels are established through the NM, which ensures the messages' actual link delays not to exceed the worstcase link delays dfs, then the actual message arrival times at a node will never be later than the corresponding logical arrival times. If this is violated, then something must have gone wrong: either the NM is not functioning properly or a source node is sending real-time messages without getting an approval from the NM. Channel establishment algorithm: As discussed earlier, the Network Manager (NM) needs an algorithm to (i) check if a requested channel can be established over a given route, and (ii) calculate the delay bound of each link in the route if the channel can be established. We first consider the message delay over a single link. For the convenience of presentation, we re])lace the channel parameter A/ (i.e., the m a x i m u m message
97 size), with a parameter C which equals the transmission time of a maximumsize message, e.g., C = M/R if the link has a transmission rate R. Also, let Cp denote the time needed to transmit a packet. A set of real-time channels r, = (T,, C,, d,), i = 1, 2, • • •, n is said to be schedulablt over a link if for all 1 < i < n, the maximum link delay experienced by channel r^'s messages is not greater than d,-. Then, we have the following schedulability condition.
Theorem 2.2.1 {Channel SchedulabiUty Condition) . In the presence of non real-time packets, a set of real-time channels Ti = {Ti,Ci,di), i — 1,2, •••,71, are schedulable over a link under the deadline scheduling policy if and only if
^ [ ( < - dj)/Tj] + Cj -\-CpKt,
V < > d,mn
where d^riin = minfd, : 1 < J < n}.
Readers are referred to [2, 3] for the proof of the theorem. For a set of channels r,- = [Ti, Ci, di), j = 1, 2, •• •, n, define imax = maxjdi, • • •, dn, (^"_j (1 — di/Ti)Ci)/{\ — X^"_i Ci/Ti)]. Then, it is easy to prove that one only needs to check the inequality of Theorem 2.2.1 for a finite set 5 — UjL^ld.-l-nTi ; n = 0, 1, • • •, \(tmax — d,)/TiJ}. Then, the worst-case message link delay can be calculated as stated in the next theorem.
T h e o r e m 2.2.2 (Worst-Case Link Delay) . Let f{t, dn) = 'Y^l-x \{t — dj)/Tj'\'^Cj -f- Cp and S be the set defined above with dn = Cn + Cp. Then, dn = Cn -\- Cp is the worst-case delay if f{t,Cn) < t, \/( € Si. Otherwise, the worst-case delay is dn = maxjrf' :t £ G], where G = Sr\{t : f{t, Cn) > t} and d' is computed as d* = C„ + H r „ +c\ -{-e], with
98 k) = [{f{t,Cn)-t)/C„\,
i)=f{t,Cn)-t~k)C„,
t\^t-Cn--k\Tn,
k\
L(«-c„)/r„j.
The proof of Theorem 2.2.2 can be found in [2, 3] and thus omitted. Now we calculate the end-to-end message delays from link delays. The end-to-end message delivery delay is defined as the time period between the generation of the message at the source node and the arrival of the last bit of the message at the destination node. By the definition of link delay, the end-to-end delay no longer equals the summation of link delays. The following theorem gives a formula for the worst-case end-to-end delay from link delays.
Theorem 2.2.3 (Worst-Case E n d - t o - e n d Delay) . Suppose a real-time channel with a maximum message transmission time C runs over n links which guarantee the worst-case link delays di, • • • ,d„, respectively. Then, ignoring the propagation delay, the worst-case end-to^end message delivery delay can be calculated as
D=zY^di-{n-
l)max{0, ( C - C p ) } ,
1= 1
where Cp is the packet transmission time.
The proof of the theorem can be found in [8, 3]. In summary, we have the following algorithm for the establishment of a realtime channel in a general packet-switched network. Algorithm 2.2.1 (Real-time channel establishment) . Step 1. Suppose a new channel is to run over the route from the source to destination that contains k links t\,- • • ,(.jc. Using Theorem 2.2.2, calculate the worst-case message link delay d^^^^ over link £j, j = 1, •• • ,k.
99 Step
2. Calculate the worst-case end-to-end
message delay Dmax using
Theo-
rem 2.2.3. If Dmax < D, the requested real-time channel can he established. Assign the delay bound of ij to be d' = d{^g^.j.-\-{D — Dmax)/k. the channel establishment
2.2.3
Otherwise,
request is rejected.
Fault-Tolerant Real-Time Channels
T h e real-time channels discussed so far are not fault-tolerant. All messages of a channel are transmitted along a static path, so a single component failure could disable the entire channel. A natural way to increase the reliability of a real-time channel is to expand the channel with some extra links and nodes such t h a t packets can be re-routed around faulty components on the original channel. One extreme is to use all links and nodes in the network such t h a t packets can be successfully routed in a timely manner as long as the network remains connected.
However, this method is very expensive. T h e high cost
comes not only from the large number of links/nodes involved, but also from the delay bound requirements over the links. For a real-time channel which remains operational as long as the network is connected, the number of hops a packet may traverse in the worst case is very large. The large number of hops from the source to destination node requires each link on the path to provide a very small delay bound so that the e n d - t o - e n d delay bound can be guaranteed. Requesting real-time channels with very small delay bounds over a link are very likely to be rejected. By making a tradeoff between reliability and cost, we first establish Failure Immune
Single
(SFI) real-time channels which guarantee the timely delivery
of packets as long as each packet encounters no more than one link/node failure on its way to the destination node. Then, we discuss how a class of more reliable real-time channels, called the Isolated Failure Immune (IFI) real-time channels, can be established in networks with the hexagonal mesh topology. For a realtime communication system t h a t can tolerate rare short-period breakdowns, we discuss how backup channels can be established: this is a less expensive way to increase the reliability of real-time channels. This method is also applicable
100 to Ccises where a SFI circuit cannot be established due to the poor connectivity of the network.
Single Failure Immune Real-time
Channels
For convenience of presentation, we introduce some definitions first. A communication network is modeled as a directed graph N — {V, J?}, where V is a set of nodes and £" is a set of directed links. A hasic circuit from node VQ to node Vk in a network is defined as a sequence Cb — VQe\V\e'i. . .ekVt, where Vi's are nodes and e; =Vi^iVi is a directed link from «;;_! to ;;;.
Definition 2.2.1 (Single Failure Immune (SFI) circuit) . A SFI circuit, denoted by C,, from node VQ to node Vk is defined as a basic circuit Cb = voeiVie2 • • -ei-Vk augmented with some extra nodes and links, which are catted the detour ofCb, such that when node Vi (I < i < k) or link e, (1 < i < k) IS removed from Cb, there exists a basic circuit from ii,_i to v^ m the remaining d.
With the above definition, when a packet arrives at Di_i and finds that either I'i or Cj is faulty, it can always be re-routed over a detour link leading to the destination node. Thus a SFI circuit guarantees that a packet will always be delivered to the destination node as long cis no more than one link or node (except the source and destination nodes) is faulty. Figure 2.2.2 shows two SFI circuits in a mesh network, where the solid arrows represent the basic circuits and the dashed arrows represent the detour. To establish a SFI real-time channel, the first step is to find a SFI circuit on which the real-time channel is to be established. A straightforward way to find a SFI circuit from VQ to Vk is given as follows. First, select a basic circuit Cb — VQCX • • •, I'jt. Then, for i — 1, • • •/:, remove Vi (e; if i — k) from Cb and establish a basic circuit from ii,_i to Vk in the remaining network. Clearly, the union of all the basic circuits forms a SFI circuit from VQ to Vk- Failure of this algorithm means that no SFI circuits exist with the selected Cj.
101 RtSTlNATION NODt 1
DHSTINATION NODE 2
o
SOURCli NODB 2
F i g u r e 2.2.2
Two optimal SFI circuits in a mesh networli.
If a SFI circuit is to be used for the establishment of a SFI real-time channel, some extra features are desirable. An established real-time channel over a link affects the link's ability to accommodate future channels, so it is desirable to minimize this negative influence on future channels to be established. There are two ways to measure this influence of a real-time channel. First, the more links a real-time channel traverses, the more pronounced the influence will become. This is because a larger number of links are involved in establishing a long channel than a short channel. Secondly, over a single link, the smaller the link delay bound the more the influence is. Thus, to reduce a real-time channel's influence on the network's ability to establish future channels, one should run it through as few links as possible, and make the requested packet delivery delay over each link as large as possible. Note that the second objective is consistent with the first, because the more links a real-time channel traverses, the smaller the requested packet delivery delay per link would become. Hence, the minimum-hop routes are best suited for real-time channels. Another advantage of minimum-hop routing for real-time channels is the reduction of real-time packets' influence over non real-time packets. If each real-time
102 packet traverses through a minimum number of links, the total real-time traffic in the network would be minimized. Since transmission priority is usually given to real-time packets over non real-time packets, minimizing real-time traffic effectively minimizes its influence on non real-time packets in the network. For SFI real-time channels, the influence of the existing real-time channels on future channel establishment also includes that over detour paths. T h u s , one should establish a SFI real-time channel over a SFI circuit which needs a minimum number of extra links. In summary, we have the following two goals in selecting a SFI circuit for the establishment of a SFI real-time channel;
G l : T h e basic circuit is a m i n i m u m - h o p route from the source to the destination node. In the case of link/node breakdown, the detour should also be the m i n i m u m - h o p routes in the remaining network. G 2 : Under the constraint of G l , the total number of links on the detour of a SP'I circuit should be as small as possible.
A SFI circuit is said to be optimal if it achieves the above two goals. It is not difficult to find an optimal SFI circuit in some widely-used regular networks like meshes and hypercubes. Figure 2.2.2 gives an example of two optimal SFI circuits in a mesh network. For an arbitrary-topology network, however, the optimal SFI circuits are not always readily obtainable. The difficulty comes from the existence of multiple m i n i m u m - h o p basic circuits between two nodes in a network. Different choices of the basic circuit and detour could result in different numbers of extra links needed for a SFI circuit. We propose a heuristic algorithm for finding a SFI circuit as follows.
A l g o r i t h m 2 . 2 . 2 ( C o n s t r u c t i o n of S F I c i r c u i t s )
103 Step 1. Set up a minimum-hop basic circuit Cj = woeit'i •••e^Vk from the source node VQ to the destination node vjc. Step 2. Initialize the set of extra nodes and links C := 0. For i = 1,. .., i:, do the following. Step 2.1. Remove from the original network node i', if i < k. and a link Ck if 2 = k. Step 2.2. Establish a minimum-hop basic circuit d from t;i_i to Vk in the remaining network. At any node, if there are two directions both leading to a minimum-hop circuit, the one to a node which is closer to Ci UC is selected. Break a tie using the following rules: (1) choose the one which does not introduce a new link, (2) choose the one which is closer to Cj, and (3) break the tie arbitrarily. If a basic circuit from VQ to I'lc does not exist, go to Step 3. Step 2.3. Suppose C,- intersects node Vj in C. If there is node Vn in Cj such that (1) there is a link VjVn in C, (2) the basic circuit from Vj to Vie in d does not contain any node vi, I < n, and (3) the number of hops from Vn to t'^ in Cj plus 1 is not smaller than the number of hops from Vj to Vk in Ci, then remove VjVn from C. Update C:=CU(C,/Cj). Step 3. If the algorithm fails at Step 2, there does not exist a SFI circuit from VQ to Vk- Otherwise, CU Cb is a SFI circuit connecting VQ and v^ with Cj as its basic circuit and C as the set of extra links and nodes.
The heuristic used in the above algorithm is that a detour route C, should be as close to the existing routes Ci U C as possible. In this way, d will most likely intersect Ct,UC, thus reducing the number of links needed. The purpose of Step 2.3 is to remove any redundant links and nodes. As an example, we show how a SFI circuit from SOURCE NODE 1 to DESTINATION NODE 1 in Figure 2.2.2 can be established using Algorithm 2.2.2. The relevant nodes and links and their labels are shown in Figure 2.2.3. The sequences of the SFI circuit establishment steps are shown below.
104 DESTINATION NODE I
®^-- —
Jk
i-"-^ (o>-.A.|^ SOURCE NODE 1
F i g u r e 2.2.3
A n e x a m p l e of u s i n g A l g o r i t h m 3.
Step 1. Set up a minimum-hop basic circuit Cj = fof ii'ie2t'2e3f3Step 2. Initialize the set of extra nodes and links C ;= 0. i = 1. Step 2.1 Remove node vi from the original network. Step 2.2 Establish a minimum-hop basic circuit from t'o to t'3 in the remaining network: Ci = i'oe4f4e5i'5e6''76ii'^'2'°3i'3- At node ug, both en and e? lead to a minimum-hop circuit, but e'u is chosen since it leads to a node closer to Ci,. Step 2.3 No nodes and links can be removed from C since C = 0. Update C :=CU{C\/Ct,} = {e4,e5,e6,eii,i'4,r'5,t^6}Step 2. i - 2. Step 2.1 Remove node V2 from the original network. Step 2.2 Establish a minimum-hop basic circuit from v^ to vj, in the remaining network: Cg = ''ie9t'5e6t'6e7t'7e8i'3-
105 S t e p 2.3 C-2 intersects node VQ in C and there is node v-^ in Cf, such t h a t (1) e n directs from ug to D2, (2) the basic circuit from VQ to vz in C2 does not contain any node Vi, / < 2, and (3) the number of hops from V2 to ti3 in Cj plus 1 equals the the number of hops from v^ to vz in €2- Thus, remove e n from C2. Update C := C U {C^/Ci}
—
{64, 65, £6, e?, eg, eg, t^4, t's, ^6, ^iv}-
Step 2. i = 3. S t e p 2.1 Remove link 63 from the original network. S t e p 2.2 Establish a minimum-hop basic circuit from V2 to v^ in the remaining network: C3 = V2t\QVQejv-jesV3. S t e p 2 . 3 Update C := CUJCs/Ci,} = {e4, eg, ee, ey, eg, eg, eio, 1)4,1^5, ^6, ^7}S t e p 3 Then, C U Cj is a SFI circuit from VQ to 1)3 with Cb as its basic circuit and C as the set of extra links and nodes.
Comparing with Fig. 2.2.2, we see t h a t an optimal SFI circuit is constructed using Algorithm 2.2.2. Although this is not always true, the algorithm performs relatively well for most networks.
We now develop algorithms for the establishment of a SFI real-time channel over a SFI circuit. Let Cs = C U C4 be a SFI circuit on which a SFI real-time channel is to be established, where Cj = foeifi • • -Sk^k is the basic circuit from the source node VQ to the destination node Vk and C is the set of extra links and nodes. Suppose there are m links in C,. Label the links in C as ejt+i, • • •, e ^ . Denote Co = Cb- For i = l,2,---,k,
let C, be the m i n i m u m - h o p circuit from
VQ to Vk in Cs when link Vi (ejt if i — k) is removed. Define a (A; -|- I) x m circuit-link matrix M — {rnij)kxm
as follows:
J 1
if Ci contams e,-4.1
i = 0, • • •, ^,
I 0
otherwise
j = 0, • • •, m — 1
Suppose link e,- guarantees a delay bound di, and let Cp denote the time needed to transmit a maximum-size packet. Then, from Theorem 2.2.3, a SFI real-
106 time channel can be established with an end-to-end delay bound D if and only if the following delay inequality is satisfied:
M
("^^ ]<
i ^' \
^ dm j
\ Dm )
(2.2.1)
where
( ^' \
\ Dm j
1 ( ^ ] ( ^ ]\ / M max{0, ( C - C p ) } + \ \ [ 1J [ w /
^ \ (2,2.2)
D /
Then, there are two ways to establish a SFI real-time channel over a SFI circuit.
Algorithm 2.2.3 (Establishment of SFI Real time Channels (1)) .
Step 1. For i = 1, 2, • • •, m, assign a link delay bound d, over e^, such that the inequality (2.2.1) is satisfied. Step 2. Using Theorem 2.2.1, check the schedulability of the channel over each link. If all the checks are positive, the requested real-time channel can be established with the Jissigned link delays (f,'s. Otherwise, the channel establishment request is rejected.
Algorithm 2.2.4 (Establishment of R e a l - t i m e Channels (2))
Step 1. Using Theorem 2.2.2, calculate the minimum packet delay bound dmin i over link e,-, i = I, • •• ,k.
107 S t e p 2 . Set cf, •.= dmin,i, i = 1, • • •, ?n. If the inequality (2.2.1) is satisfied, the requested real-time channel can be established. Assign the link delay over ej to be di =: dmin,i + ^i, where di's satisfy
i h \
I ^^ \ <
M \6m
/
dmin.l (2.2.3)
M
\^rn J
y ^min,m J
Otherwise, the channel establishment request is rejected.
Since the inequality (2.2.3) can not determine a unique solution, we need to give a rule to choose one of the solutions. T h e reason for increasing the delay bound over link e, from dmin.i to dmin,% + Si in Algorithm 2.2.4 is to reduce the channel's influence on the link's ability to establish more real-time channels. T h e value of ^j- represents the degree of the influence reduced.
Thus, with
respect to a single link e,, one should set 6i as large as possible. However, since the maximizing of 6i must be done under the constraint of inequality (2.2.3), increase of Si may cause the decrease of 6j of another link Cj. To make the whole network evenly loaded, we use the following max-min rule to choose a solution from the inequality (2.2.3).
M a x - m i n R u l e : Among all solutions satisfying the inequality (2.2.3), choose the one whose smallest element, i.e., mini
T h e m a x - m i n rule can be easily implemented with the following algorithm.
A l g o r i t h m 2.2.5 ( D i s t r i b u t i o n o f Link D e l a y s ) .
S t e p 1. Initialize the set of variables to be determined as S — {6i, • • • ,6rn}'
108 Step 2. For all 6, £ S, replace (5, with a single variable 6 in inequality (2.2.3). Calculate the maximum value of 5 satisfying inequality (2.2.3). Notice that Eq. (2.2.3) contains 771 inequalities, all elements of M are either 0 or 1, thus the maximum value of 6 makes at least one inequality become an equality. Remove all 6i"s from S which are contained in the equality and set them to be the obtained maximum value of 6. Step 3. If 5 = 0, stop. Otherwise, goto Step 2. As an example of using Algorithms 2.2.4 and 2.2.5, we show how a SFI real-time channel can be established over the SFI circuit from source node 1 to destination node 1 in Fig. 2.2.2. Suppose this real-time channel is to be established in an otherwise idle network and has the minimum message inter-arrival time T = 100, maximum message transmission time C = 5, and requested end-toend message delivery delay bound D — 60. Suppose Cp = C. The labels of the links on the SFI circuit are shown in Fig. 2.2.2. Then, the circuit-link matrix is / 1 1 1 0 0 0 0 0 0 0 \ 0 0 0 1 1 1 1 1 0 0 M = 1 0 0 0 0 1 1 1 1 0 Vi 1 0 0 0 0 1 1 0 1 /
(2.2.4)
From Theorem 2.2.2, we have dmin,i = 5 for 1 < i < 10. Thus, the right-hand side of inequality (2.2.3) equals (45 35 35 35)-^. Using Algorithm 2.2.5, we first set 5 := {^1, •-•, (5io}- Replacing all elements in S with a single variable 6, inequality (2.2.3) becomes / 45 \ 3 \ 35 5 <5< 35 5 \ 5 / \ 35 /
(2.2.5)
The maximum value d is then 7 and with which the second, third and fourth inequalities of Eq. (2.2.3) become equalities which contain all i5,'s except fi^.
109 Thus, at the next iteration of Step 2 in Algorithm 2.2.5, S = {^3}. Replacing 63 with 5 and set all other variables in Eq. (2.2.3) to be 7, it becomes 14+6 < 45. The maximum value of i5 is 31. Thus, the solution from algorithm 2.2.5 is (i5i, ^2,^311^4,1^5, ^6,67, ^g.f^g, 610) — (7,7,31,7,7,7,7,7,7,7). Using Algorithm 2.2.4, the SFI real-time channel is established with link delay bounds {di, 6.2, da, d^, ds,rfg,d-, dg, d^, dio) = (12,12,36,12,12,12,12,12,12,12).
Isolated Failure Immune Real-Time
Channels
Making a real-time channel more robust than just tolerating a single failure is usually very difficult and requires reservation of significantly more network resources. We discuss in this subsection how the problem can be solved by choosing a proper network topology. If a network has a wrapped hexagonal mesh topology [9], one can readily establish Isolated Failure Immune (IFI) real-time channels. An IFI real-time channel guarantees the timely delivery of messages in the presence of network component failures as long as the failures are isolated with respect to the channel. Node failures are said to be isolated with respect to a real-time channel if the source and destination nodes of the channel are not faulty and any two faulty nodes in the channel are not adjacent. Link failures (a link failure is caused by either the failure of the link itself or the failure of the node which the link leads to) are said to be isolated if any two faulty links are not originated from a same non faulty node or directed to the destination node. Figure 2.2.4 shows four types of non-isolated failures; (a) two faulty nodes which are adjacent, (b) two faulty links which originate from the same node, (c) same as (b) except that one link is made unusable (thus regarded faulty) by the failure of another node, and (d) two incoming faulty links of the destination node. Another two types of non-isolated failures are the failures of the source and destination nodes, respectively. Figure 2.2.5 shows an example of an IFI channel from node 1 to node 6 and one pattern of tolerable isolated failures.
110
(a)
(b)
(0
(d)
desiination
Figure 2.2.4
Four types of non-isolated component failures.
destination
source Figure 2.2.5
An IFI channel and one pattern of tolerable link/node failures.
Ill T h e isolated failure immune communication problem for undirected
neiworks
was first discussed in [10] where the authors proved that a 2-tree'^ is a minimum IFI network.
In other words, any IFI network must contain a spanning 2-
tree. This result excludes almost all commonly-used network topologies (e.g., rings with more than 3 nodes, rectangular meshes, and hypercubes) from the candidate set of the IFI networks, except for the hexagonal mesh. An IFI real-time channel has the following advantages over a basic real-time channel:
H i g h R e l i a b i l i t y : The channel can tolerate a large number of component failures as long as they are isolated. For example, the IFI channel shown in Figure 2.2.5 can tolerate as many as 7 faulty links and 2 faulty nodes, which represent 70% of the links and 33% of the nodes that the channel runs through. E a s y F a i l u r e D e t e c t i o n : Non-isolated failures in the network can be easily detected using only local information, i.e., the status of a node's own links and its neighbor nodes. This makes the system maintenance extremely easy. A node can safely shut down one of its links or itself by checking the status of its links and the neighboring nodes. T r a n s m i s s i o n of E m e r g e n c y M e s s a g e s : Notice that a path between any pair of nodes in a network can always be constructed using only those links whose failure will not cause non isolated failures.
So, in the ab-
sence of network component failures, an emergency message can always be transmitted from a source node to a destination node using the full link bandwidth on its path without interrupting existing real-time channels.
At the R e a l - T i m e Computing Laboratory, the University of Michigan, an experimental distributed real-time fault-tolerant systems called HARTS [11] is currently being built. HARTS has a wrapped hexagonal
mesh interconnection
network as shown in Figure 2.2.6 which can be defined as follows. ^A 2-tree can be constructed as follows. Two nodes connected by a link is a 2-tree. A new node can be added to a 2-tree by connecting it to two neighboring nodes in the 2-tree.
112
U
F i g u r e 2.2.6
15
15
16
16
n
A wrapped hexagonal mesh of size 3.
Definition 2.2.2 Let [a]i denote a mod b. Then a wrapped hexagonal mesh of Size n (or the number of nodes on each peripheral edge) is composed of N = 3n(n— 1)+ 1 nodes, labeled from Q to N — i, such that each node s has six neighbors [s + 1]A', [s + 3Ti(n — l)];v, [« + 3 n — 2];v, [« +Sn"^ — 6n + 3]jv, [s + 3n^— Qn + 2]N, and [s + ^n— 1]^, m the X,—X,Y,—Y,Z,—Z directions, respectively.
One important result obtained from the HARTS project is the routing algorithm in a wrapped hexagonal mesh network. It was proved in [9] that a wrapped hexagonal mesh is homogeneous. Consequently, any node can view itself as the center of the mesh. Let rux^riiy, and m. be, respectively, the number of hops (negative values mean the moves in negative directions) from the source node to the destination node along the X, Y, and Z directions on a shortest path. The following routing algorithm [9] determines the values of vrix, my, and rn^ for the shortest paths from a source node s to a destination node d m a wrapped hexagonal mesh of size n:
113 A l g o r i t h m 2.2.6 ( R o u t i n g in H A R T S ) .
S t e p 0. Set m^. := 0, my :— 0, m^ := 0. Let p — 3n^ — 3n — 1, k = (d — s) mod p, r = (k — n) div {3n — 2), t = [k — n) mod (3n — 2). S t e p 1. U k < n then set rrij; := fc, stop. Else if ^ > 3n^ — 4n + 1 then set rrix — k — 3n^ + 3n — 1, stop. Else goto Step 2. S t e p 2. If < < n + r - 1 then •
If t < r then set m^ := t — r, m^ := n — r — 1, stop.
•
If t > n — 1 then set m^ := t — n + 1, rriy := r + 1 — n, stop.
•
Else set my ::= r — t, m^ :— n — t — 1, stop.
else •
If ^ < 2n — 2 then set nix '•= t + 2 — 2n, my := r + 1, stop.
•
I f < > 2 n + r — 1 then set rUj; := t — 2n — r + I, m^ := —r— 1, stop.
•
Else set my := 2n + r — t — 1, m^ := 2n — t — 2, stop.
We now discuss how real-time channels can be enhanced to be IFI in H A R T S . T h e first step is to find an IFI path, which is defined as a subnetwork containing a directed path from the source to the destination in the presence of any isolated failures. Let ds{vi, ^2) denote the minimum number of hops (i.e., distance) from node Vi to node V2 in a network S.
The following theorem gives a sufficient
condition for S to be an IFI path from a source node v, to a destination node
T h e o r e m 2 . 2 . 4 A network S containing
the source node v, and the
destina-
tion node Vd is an IFI path from v, to Vd if
CI
Every node v £ S,v ^ Vd, has at least two outgoing links to two other nodes, say v\ and V2, such that ds{v\, Vd) < ds{v, dd), ds{v2, Vd) < ds(v, Vd), and I'l, i'2 are
adjacent,
114 C2
There is no loop in S whose nodes are all of the same distance d > \ to the destination
node v^.
Proof: From C I , every node v E S except the destination node lias two outgoing links i\ and ^'2 which lead to a pair of adjacent nodes v\ and V2, respectively. Then, a packet will be blocked at node v only if (1) both £1 and (^2 are disabled, or (2) both vi and V2 are disabled, or (3) i\ and V2 are disabled, or (4) ^2 or !;i are disabled. All these situations represent non-isolated failures. Thus, in the absence of non-isolated failures, a packet from the source node can always progress unless it has reached the destination. Further, CI ensures a packet will not move away from the destination and C2 ensures that a packet will not move around forever without reaching the destination node or circling in a loop in which each node is directly connected to v^. Since Vd cannot have more than one faulty incoming link, we conclude t h a t a packet from the source node can always reach the destination node.
D
From the above theorem, we see that each node in an IFI path needs only two outgoing links. We call one of them the primary secondary
link and the other the
link.
The primary link is the one which leads to a node closer to the destination. One can choose the primary link from the shortest path as determined by Algorithm 2.2.6. In case there exist multiple choices (i.e., more than one of the mx,my,ini
are non-zero), we will use the following algorithm to select a
primary link L.
A l g o r i t h m 2.2.7 ( S e l e c t i o n of t h e p r i m a r y link L) . Let abs{x) and sign{x)
denote the absolute value and the sign of x, respectively,
and let A", —A', V, —V, Z, —Z denote the outgoing links of a node along the six different directions. Then, If abs(m!r) > 1 then set L ;= else if abs{my)
sign{mx)X
> 1 then set L :=
sign{my)Y
else if a6s(m,-) > 1 then set L :=
sign{m^)Z
115 else if a6s(mx) = 1 then set L — sign[m^)X else if abs(my) — 1 then set L — sign{my)Y else if ahs(mz) — 1 then set L = sign{mi)Z.
As will be clear later, the selection of the primary links in the way specified by Algorithm 2.2.7 will facilitate the determination of the secondary links and reduce the number of nodes/links of the resulting IFI channel. To ensure that the secondary link does not lead to a node which is farther away from the destination, it must be either 60 degree above or 60 degree below the primary link.^ We use the notation L + 1 to denote the link which is 60 degree above L, and L — I the one which is 60 degree below L. For example, if L = X, then X + l = ~Z and A' - 1 = ~Y. Let node[i] denote the zth node of an IFI path, and node[i].p and node[i].s denote the node's primary and secondary links, respectively. We can use the following algorithm to construct an IFI path from v, to I'd-
Algorithm 2.2.8 (Construction of an IFI path) Step 1. Calculate m,c,my,m,z for all shortest paths from the source node t;., to the destination node v^ using Algorithm 2.2.6. Notice that at most two of them can be non-zero. Step 2. Set i := 1 and noiie[l] := v,. Set the initial rotating direction for the secondary link R := I if one of the following is true: (1) abs{m.y) > a6s(mi) = 1, (2) a6s(m^) > abs{my) — 1, (3) abs{mx) > 1, m^ ^ 0, and (4) abs(mx) = a6s(m^) = 1. Otherwise, set R ;= —1. Step 3. Calculate the primary link L{i) using Algorithm 2.2.7. If i > 1, L(i) ^ L{i — 1), and node[i — 1] is not adjacent to Vd, set R := —R. 'Here "above" means counter-clockwise and "below" means clockwise.
116 S t e p 4. Set node[i].p := L{i),
node(i).s
:= L(i) + R, and set node[i + 1] to be
the node which the secondary link of^ node[i] leads to. Update
mj.,my,niz
for node[i + 1]. S t e p 5. If node[i + 1] = node[i — 1], then set node[i + 1] ;= v^ and stop. T h e destination node has been reached. Otherwise, set i ;= i + 1, R :— —R, goto Step 3.
T h e correctness of Algorithm 2.2.8 is proved by the following theorem.
T h e o r e m 2 . 2 . 5 The subnetwork
obtained from Algorithm
2.2.8 is an IFI path
from !,'., to v^.
Proof: We prove t h a t the resulting subnetwork S satisfies C I and C2 of Theorem 2.2.4. For any node[i] 9^ v^ in S, let vi and v^ be the two respective nodes which links node[i].p and node[i\.s enter. From the algorithm, node[i + 1] = t'2- T h u s 1)2 G S. To show t h a t vi is also in S, and Vi and V2 are adjacent, we first prove that there is a link in S from V2 to 1^1. Since a secondary link will never lead to the destination node, V2 ^ vj. Thus, 7iode[i + 1] always has two outgoing links node[i + l].p and node[i + l].s in S.
Assume node[i].s is 60 degree above node[i].p.
from the direction of node[i].p
As shown in Figure 2.2.7,
(which is on the shortest path from noc/e[z]
to I'd), node[i -f l].p (i-e., the shortest path from node[i + I] to DJ) has only three choices: ^3,^4,^5.
We claim that node[i + [].p can not take £3 since
otherwise, from Algorithm 2.2.7, node[i].p would have taken (2 instead of £1. If node[i + l].p = ^5, the primary link of node[i + 1] is the link from V2 to Vi. Otherwise, node[i + l].p — £4. From Algorithm 2.2.8, node[i -f l].s should be 60 degree below node[i + \].p since node[i\.p and node[i + l].p have the same direction and node[i] is not adjacent to D^ (node[i + l].p would otherwise have taken £5). T h u s node[i+
l]..s = £5 is the link from V2 to t;i. Similarly, it can be
proved t h a t there is a link from 1*2 to V\ in S when node[i].s is 60 degree below node[i].p.
117
node(i).s = 1
F i g u r e 2.2.7
We now prove t h a t vi 6 S.
Proof of the adjacency of vi and V2.
If node[i + l].s = £5, then vi = node[i + 2] €
5". Otherwise, from the above proof, node[i -\- l].p = ^5. If v^ = Vd, from Algorithm 2.2.8, node[i+l].s directs back to nod€[i]. Then, vi — node[i+2] 6 S. Otherwise, as shown in Figure 2.2.7, V3 = node[i+2].
Continuing this induction,
we can conclude t h a t either DJ £ 5 , or the six neighbors of I'l all have primary links directed to vy. T h e latter case implies vi — V4. Thus, vi £ S. Since there is a link in 5 from V2 to vi, vi and V2 are adjacent in 5 . Further, since node[i].p is on the shortest path, ds{vi,Vti) 1 < ds{node[i],Vci). ds{v\,Vd)
= ds{node[i],Vd)
—
Since there exists a link in S from V2 to vi, ds(v2,Vd)
<
+ 1 = ds{node[i],Vd).
Thus C I is proved.
We now prove t h a t there does not exist any loop all of whose nodes are of a constant distance d > 1 to fd by contradiction.
First, notice t h a t such a
loop contains only secondary links since a primary link connects two nodes of different distances to Vd- Then, all the primary links of the nodes in the loop must lead to a common node v. This is from the fact proved above that either node[i+
l].p or node[i+
node[i+
l].s can not lead to t; since it must lead to a node of the same distance
l].s must lead to node v which node[i].p leads to. But
118 to Vd as that of node[i]. This is possible only if u = vj, i.e., (f = 1, Thus, C2 is proved. Q
We make several remarks on Algorithm 2.2.8 as follows.
1. In Step 4, the address of node[i + 1] can be obtained from that of norfe[?] using Definition 2.2.2, which gives the addresses of the six neighboring nodes of a node in six directions. The values of m^, my, m^ for node[i + 1] can be updated directly with Algorithm 2.2.6 using the address oinode[i + 1]. But a simpler way of doing this is as follows. Let w be the direction of link node[i].s and v,w be the remaining two directions. Let * = 1 if link L(i) + .R is at the positive direction of u and s = — 1 otherwise. Then, if [niu = rriy — 0 and sm^. > 0) or (m^ = mu, = 0 and srut, > 0), update rrit, := rrty — s,m^ ;= m^, — s. Otherwise, update m^ ;= rrii^ — s. The correctness of this algorithm can be verified by placing the destination node Vfi at the center of the wrapped hexagonal mesh and checking the changes of /TI^ , rriy, iriy as one moves from node[i] to node[i + 1] along link node[i].s. 2. In Step 2, the initial rotating direction R for the secondary link is chosen such that if noc/e[l] has two links both on shortest paths® to the destination nodes, node[l].s will take one of them, In this way, the resulting IFI path needs less links and nodes than when doing otherwise. The way in which the primary link is chosen in Algorithm 2.2.7 also serves this purpose. 3. Since the primary links are always on the shortest path to the destination, they form a shortest path sinking tree to the destination. In other words, if a packet generated at any node in 5 is always forwarded using the primary links, it will take a minimum number of hops to the destination. This fact results in the following routing policy at each node: an arriving packet should be forwarded via the primary link whenever possible. The secondary link is used only if the primary link is down. ' N o t e that there could be multiple shortest paths between a pair of nodes.
119 We now discuss how the IFI real-time channel can be established over an IFI path obtained from Algorithm 2.2.8. The procedures to establish an IFI realtime channel are composed of the following three steps.
S t e p 1. Calculate the message delay bound over each link of the channel. S t e p 2. Calculate the e n d - t o - e n d delay bound using the link delay bounds. S t e p 3 . If the e n d - t o - e n d delay bound is not larger than the requested one, the channel can be established.
Calculate the link delay bounds to be
assigned to the channel. Otherwise, the channel establishment request is rejected.
Theorem 2.2.2 can be used for the calculation of the link delay bounds in Step 1. Let node[i],i
— l , - - - , i ' be the nodes of an IFI path obtained from
Algorithm 2.2.8, where no(ie[l] is the source node and node[k] is the destination node. Let d[i].p and d[i].s be the delay bounds over the primary and secondary links of node[i], respectively.
Then the e n d - t o - e n d message delivery delay
bound in Step 2 can be calculated using the following algorithm.
A l g o r i t h m 2.2.9 ( C a l c u l a t i o n o f t h e m e s s a g e delivei-y d e l a y b o u n d s ) T h e message delivery delay bound d[i] from node[i] to the destination node node[k] for a real-time channel with a m a x i m u m message transmission time C can be calculated as follows:
d[k-
d[k-2]
1] = msix{d[k-
l].p, d[k-
l].s-hd[/t-2].p-max{0,(C-Cp)}},
= max{d[fc-2].p, d[k - 2].s + d[k - l ] . p - m a x { 0 , ( C - Cp)}},
d[i] = ma.x{d[i].p+
d[ip], d[i].s + d[is]] — max{0, ( C — Cp)}
i = ^ — 3, • • •, 1.
where node[ip], node[is] are the nodes to which the primary and secondary links of node[i] lead, respectively, and Cp is the packet transmission time.
120
iKxlc|k-2'
inHJc[k-l].p
(b)
(a)
F i g u r e 2.2.8
Calculation of [i[j]'s.
The correctness of Algorithm 2.2.9 can be verified as follows. From the proof of Theorem 2.2.5, the connections between not/e[fc —2], node[k — l], and node[k] are shown in Figure 2.2.8(b), from which the first two equations can be obtained using Theorem 2.2.3. For \ < i < k — 3, node[i] is connected to node[ip] and node[is] in the way shown in Figure 2.2.8(a), which proves the remaining A; — 3 equations. Since ip and is are always larger than i for i < k — 2, the maximum delay bound from node[i] to node[k] can be obtained from the above equations. If d[\] < D, the IFI real-time channel can be established, and we need to determine the link delay bounds to be assigned to the channel. As discussed in Section 5.2, the link delay bounds of the channel should be set as large as possible to reduce the channel's influence on the links' ability to establish more real-time channels in future. This can be done using the following algorithm.
Algorithm 2.2.10 (Assignment of link delay bounds) . Step 1. In Algorithm 2.2.9, for i = k — I, • • • ,1, record the link (i.e., the primary or secondary link) l[i] on which the maximum is achieved for d[i]. Notice that there could be two links iov i = k — 2 or t = k — 1. Step 2. Record all the links traversed as one goes from no(/e[l] to node[k] using only the links recorded in Step 1. This gives a critical path from the
121 source to the destination which has the e n d - t o - e n d delay bound d[l] as calculated from Algorithm 2.2.9. S t e p 3 . Let N be the total number of links on the critical p a t h . For each link €_,• on the critical path, set the channel's delay bound d^ := dj +{D — d[j])/N, where dj is the delay bound calculated using Theorem 2.2.2 for fj. S t e p 4 . Recalculate
In summary, we have the following algorithm for the establishment of an IFI real-time channel.
A l g o r i t h m 2 . 2 . 1 1 ( E s t a b l i s h m e n t of a n I F I r e a l - t i m e c h a n n e l ) .
Step
1. Using Theorem
2.2.2, calculate the rmnimum
d[i].pmin and d[i].Smin over the primary
message
delay
bounds
and secondary links of node{i\, i —
l,---,k-\. Step
2. Calculate the end-to-end
Step
3. If d\l\ is larger than the user-requested channel request is rejected.
delay bound d[l] from Algorithm
Otherwise,
end-to-end
2.2.9.
delay bound D, the
the channel can be established
the link delay hounds calculated from Algorithm
with
2.2.10.
We now give an example to demonstrate the above ideas. Figure 2.2.9 shows a portion of a hexagonal mesh. We want to establish an IFI real-time channel from node 1 to node 8 with channel parameters {T,C,D)
= (100,5,70). For
simplicity, assume Cp > C. We first construct an IFI path from node 1 to node 8 using Algorithm 2.2.8. For i = 1, n-ode[l] = node 1. (m.j;,my,m^) = ( 2 , 0 , - 2 ) . The initial rotating direction for the secondary link /? = 1 since abs(m.^) > 1 and m^ ^ 0. From
122 10 (20)
-*i 3 }. — - # - — > 10
35 (70) 25(50)
Figure 2.2.9
ZOC'W)
An IFI real-time channel from node 1 to node 8.
Algorithm 2,2.7, the primary link is calculated to be node[l].p and the secondary link is node[l].s = L(l) + 1 = —Z.
L(l) = A',
Set the next node to one which link ~Z leads to, then node[2] — node 2. Update rrix, rriy, m^ for no(ie[2] as follows. The direction of —Z is Z, so u = Z, and V = X,w = Y. Also, s = —1. Since rriui = 0 and sm^, = —2 < 0, we only need to update m-u := m^ — s — — 2 + 1 = —1. Thus, for node 2, ( m i , m j , , m ^ ) = (2,0, - 1 ) .
Repeating the above procedure, we get an IFI path as shown in Figure 2.2.9, where the primary links are denoted by solid arrows and the secondary links by dashed arrows. It is not difficult to see that a packet can be transmitted from node 1 to node 8 in the presence of any isolated failures. Also, all the primary links and the nodes form a shortest path sinking tree to the destination node. We now establish an IFI real-time channel over the IFI path thus obtained by assigning delay bounds to the links using Algorithm 2.2.11. Suppose there is no other real-time traffic in the network. Then, for i — 1, • • -,8, (i[i].pm,,j = f^iJl-Smm = C = 5. Using Algorithm 2.2.9, rf[i]'s are calculated and shown near
123 each node in Figure 2.2.9. The requested real-time channel can be established since d[l] = 35 < D = 70. The critical path can be determined by recording the links over which the maximum is achieved in Algorithm 2.2.9, which is in this example the ones marked by " / / " in Figure 2.2.9. There are a total of A' = 7 links on the critical path. The channel's delay bounds over the links of the critical path are thus d^ = dj + (D - d[l])/N = 5 + (70 - 35)/7 = 10. The updated values of d[i]'s calculated from Algorithm 2.2.9 are shown in the parentheses near each node. Then, the channel's delay bounds on the other links can be calculated as the differences of d[?]'s of the nodes they connect, which are shown near each link in Figure 2.2.9.
Backup Channels For real-time communication that can tolerate rare short-period breakdowns, a less expensive way to increase the reliability of a real-time channel is to set up backup channels. This method is also applicable to cases where SFI circuits can not be established due to the poor connectivity of a network. The idea of the backup channels works as follows:
•
Each real-time channel is composed of a primary channel and a number of backup channels. Under the normal circumstance, the primary channel is used for packet transmission while keeping the backup channels unused. In case the primary channel gets disabled by a component failure, one of the backup channels is promoted to the primary channel (thus taking over the transmission task).
•
Both the primary and backup channels are established simultaneously using the procedures described earlier. In case a primary or backup channel cannot be established, the system is allowed to tear down some of the existing backup channels bcised on a scheme that will be described later.
124 Using backup channels, the long channel re-establishment overhead can be avoided when the primary channel is disabled. The cost of backup channels is minimal because: •
Backup channels are not used to transmit redundant real-time messages under the normal circumstance, thus they do not affect the transmission of other channels as well as non real-time traffic, and
•
The number of real-time channels that a network can accommodate is not reduced since the backup channels can be removed whenever there is a shortage of network resources in establishing a new real-time channel.
As compared to the case of replicating real-time channels, there are two problems in implementing the idea of backup channels. First, there is a delay in switching to a backup channel when a fault occurs to the primary channel. The main source of this delay is associated with fault detection and channel switching. The real-time packets transmitted during this period could be lost. Second, there is no guarantee on the number of backup channels that each realtime channel can have. As more and more real-time channels are established in a network, the number of backup channels of a real-time channels may reduce and even become 0. In the rest of this section we will address how these two problems can be alleviated. The fault detection time can be reduced by using an "acknowledgment channel" for each real-time channel. A channel fault can be detected c(uickly if the source node does not receive an ack in a certain period after transmitting a packet. Note that an acknowledgment channel usually costs far less than a real-time channel since it deals with a shorter packet size, a larger requested delivery delay (depending on the required fault detection time), and/or a longer packet inter-arrival time (several packets can be acknowledged at a time). The channel switch time is the time needed to find a non faulty backup channel and promote it to the primary channel. If there is at least one non faulty backup channel, this process is quite fast. One can send multiple copies of a packet through all the backup channels of the now-disabled real-time channel.
125 and choose the one which deUvered the packet correctly. If, unfortunately, all the backup channels are faulty, there is no choice but to execute the t i m e consuming channel establishment procedure. T h e second problem of using backup channels comes from the fact t h a t a backup channel may be removed in order to accommodate future real-time channels. A real-time channel may have many backup channels when the network is lightly loaded (in the sense that there do not exist many real-time channels in the network).
As more and more real-time channels are added, the number of
backup channels may decrease or even become zero. Thus, the use of backup channels provide no guaranteed fault-tolerance for the primary channels. An i m p o r t a n t question is then how to manage the establishment and removal of backup channels such t h a t the primary channels can be backed up as much as possible. To this end, the following three questions must be answered:
Q\:
For each real-time channel, how many backup channels should be established?
Q'i- How should the channels be routed? Q3: Which channel removal policy should be used?
A straightforward answer to Qi is t h a t the more backup channels the better. A real-time channel is more likely to find a non faulty backup channel if it is backed up by many channels. However, there is a limitation to the number of backup channels which can be established. Establishing too many backup channels will complicate the channel establishment procedure. Also, the system gains little by establishing channels which have a large number of common links and nodes, because a single component failure would bring all of them down. For this reason, we restrict that the primary and backup channels of a real-time channel run through disjoint
paths, and the number of backup channels to be
established is thus the m a x i m u m number of disjoint paths minus one. As to the routing problem, we argued in Section 5.1 t h a t the m i n i m u m - h o p routing is preferable for real-time channels. The primary channel and backup
126 channels can be set up sequentially by establishing one at a time, then removing all the intermediate nodes and links it passes through from consideration for setting up next channels, and establishing the next backup channel in the remaining network, and so on. T h e third question consists of two parts: (i) which channels are allowed to be removed? and (ii) which of the removable channels should actually be removed? As to part (i), establishment of a new channel should only be allowed to remove only those channels which are less important than itself. There are two ways to compare the relative importance of backup channels: (1) the backup channels of a critical real-time channel is more important than those of a less critical one, and (2) for two real-time channels of the same criticality, the backup channels of the one with more backup channels are less important than those of the other real-time channel. To this end, one may assign a criticality number C to each real-time channel and then assign a rank R ~ f{C,k)
to its k-th
backup
channel, where / ( C , k) is an increasing function of C and a decreasing function of k.
One such choice is f{C,k}
= C — k.
So, we have a channel removal
strategy which allows for tearing down only lower-rank backup channels. Since the establishment of a primary channel is always allowed to tear down backup channels, the rank of all primary channels is assigned to be oo. T h e answer to the second part of the question is t h a t one should remove as few i m p o r t a n t backup channels as possible. In other words, if there is a choice between two backup channels, the one with lower rank should be removed. This can be done with the following algorithm.
A l g o r i t h m 2.2.12 (Channel Removal Strategy) .
S t e p 1.
Establish the new channel without removing any existing backup
channels. Terminate if successful. Otherwise, goto Step 2 S t e p 2.
Remove all the backup channels on the route having lower ranks than
the new one, except those with link delay bounds larger than i„,, the time t h a t achieves the m a x i m u m of dn = max{d'
: t £ G} in Theorem 2 (the
127 removal of these channels will not change the link-delay bound of the new channel). Step 3. Establish the new channel. If it is still not successful, the new channel cannot be established and those channels removed in Step 2 are restored. Otherwise, goto Step 4. Step 4. Starting from the backup channels with the highest rank, re-establish the backup channels removed in Step 2. This step reduces the number of backup channels removed.
One possible application domain of the backup channels is in a dual-ring network (e.g., the FDDI) where both the SFI and IFI real-time channels can not be established. We give an example showing how backup channels can be established. Consider a ring network with 5 stations connected by 5 duplex links (link i connects station i and [i+i) mod 5). All real-time channels have the same minimum message inter-arrival time T,- = 100 and maximum message transmission time Ct = 5. Suppose three backup channels TH, T2t, r^i, have been established in the network with the following information on the source {src) and destination {dst) stations, user-requested end-to-end delay Di, channel rank Ra, and link-delay bound d^. (To show the procedure of Algorithm 2.2.12 clearly, we do not consider the existing primary channels since the primary channels are not allowed to be torn down.)
T3b
src=3 src=2 src=3
dst=0 dst=0 dst=0
£>! = 15 D2 = 10 1)3 = 30
Rn = 1 i?2t = 2 /?36 = 1
d?j = 10 dL 2b = — b " dh=7 36 — '
c/Jj = 5 c/°. "2(. = 5 dL "36 = — 12 ^^
"36
d°=\\.
Now, we want to establish a new real-time channel T^ of criticality 4 with src = 3, dst = 0, D4 = 15.
128 The primary channel of T4 takes the m i n i m u m - h o p route (3 -^ (.4. Using Algorithm 2, one can establish the primary channel as:
r4p:
src=3
dst,=0
D^ = 15
R^p = 00
d\^ = 5
d\^, = 10.
After removing ^3 and (.4 from the network, the network contains only one more connection between station 3 and station 0. So, T^ can have at most one backup channel 7-44,. Since the criticality of T/^ is 4, the rank of its first backup channel is R.xb = 4 — 1 = 3 . Setting up r^j on the route C2 —' h —* (0 with Algorithm 2 results in rejecting the channel request. Hence, all other backup channels on the route with lower ranks lower than T4, i.e., r26 and r^t,, are removed. After this removal, T41, can be successfully established as;
7-4),:
src=3
dsl=0
^ 4 = 15
R^b = 3
^4^ = 5
^4^ = 5
dljj, = 5.
T h e next step is to re-establish those backup channels removed, starting from the one with the highest rank. Using Algorithm 2, the establishment of r j j is rejected. So T21, must be removed and r^f, is re-established as:
T3t,:
src=3
dst=0
D3 = 30
Rab = 1
^Ij = 10
c/^ft = 10
c'si = 10.
Consequently, after establishing a new real-time channel T4, the network has the following four established channels:
dst=0
Di = 15
Rib =1
offj = 10
d^^ = 5
TSb
src: 33 src=3
dst=0
D3 = 30
^36 = 1
c^sj = 10
d^^ = 10
d°j = 10
T4b
src=3
dst=0
D4 = 15
R^b = 3
Ab = 5
<6 = 5
Up
src=3
dst=0
D4, - 15
Ri,p = 00
4(- = ^ rffp
= 5
d\^ = 10.
129
2.2.4
Conclusion
This chapter addressed the problem of fault-tolerant real-time communication in distributed computing systems. The end-to-end message delivery delay can be controlled to be below a pre-specified value by establishing a real-time channel, and unlike traditional reliable communication protocols like the TCP which uses a message retransmission scheme (i.e., time redundancy), spatial redundancy was used to achieve fault-tolerant communication which preserves timeliness. With methods described in this chapter, real-time communication in point-to-point networks can be made to tolerate any single failure, or isolated failures in case of a hexagonal mesh topology. One can also establish backup channels with the minimal cost to which a broken real-time channel can quickly switch. The proposed fault-tolerant real-time channel approaches are currently being implemented on an experimental distributed system called HARTS at the Realtime Computing Laboratory of the University of Michigan [11]. The second version of the HARTS operating system called HARTOS includes the real-time channel establishment procedure described in [5, 12]. The wrapped hexagonal topology of HARTS will allow for experiments ranging from the basic real-time channels to the highly reliable IFI real-time channels.
130
REFERENCES
[1] D. Ferrari and D. C. Verma, "A scheme for real-time channel establishment in wide-area networks," IEEE Journal on Selected Areas in Communications, vol, SAC-8, no, 3, pp, 368-379, April 1990, [2] Q, Zheng and K. G. Shin, "On the ability of establishing real-time channels in point-to-point packet-switched networks," IEEE Transactions on Communication (in press), 1993. [3] Q, Zheng, Real-time Fault-tolerant Communication in Computer Networks, PhD thesis, University of Michigan, 1993, PostScript version of the thesis is available via anonymous FTP from ftp,eecs.umich,edu in directory outgoing/zheng, [4] C. L. Liu and J, W, Layland, "Scheduling algorithms for multiprogramming in a hard real-time environment," Journal of the ACM, vol, 20, no, 1, pp. 46-61, January 1973, [5] D, D. Kandlur, K. G. Shin, and D. Ferrari, "Real-time communication in multi-hop networks," in Proc. Int. Conf. on Distributed Computer Systems, pp. 300-307. IEEE, May 1991. [6] Q. Zheng and K. G. Shin, "Real-time communication in local area ring networks," in Conference on Local Computer Networks, pp. 416-425, September 1992. [7] A. Indiresan and Q. Zheng, "Design and evaluation of a fast deadline scheduling switch for multicomputers," RTCL working document, December 1991. [8] Q. Zheng, K. G. Shin, and E. Abram-Profeta, "Transmission of compressed digital motion video over computer networks," in Digest of COMECON Sprtng'93, pp. 37-46, February 1993, [9] M.-S. Chen, K. G. Shin, and D. D. Kandlur, "Addressing, routing and broadcasting in hexagonal mesh multiprocessors," IEEE Trans. Computers, vol, 39, no, 1, pp. 10-18, January 1990.
131 [10] A. M. Farley, "Networks immune to isolated failures," Networks, vol. 11, pp. 255-268, 1981. [11] K. G. Shin, "HARTS: A distributed real-time architecture," IEEE Computer, vol. 24, no. 5, pp. 25-36, May 1991. [12] D. D. Kandlur and K. G. Shin, "Design of a communication subsystem for HARTS," Technical Report CSE-TR-109-91, CSE Division, Department of EECS, The University of Michigan, October 1991.
SECTION 3
COMPILER SUPPORT
SECTION 3.1
Speculative Execution and Compiler-Assisted Multiple Instruction Recovery^ W. Kent Fuchs^, Neal J. Alewine^, and Wen-mei Hwu^
Abstract Multiple instruction rollback is a technique developed for recovery from transient processor failures. Speculative execution is a method to increase instruction level parallelism which can be exploited by both super-scalar and VLFW architectures. The key to a successful general speculation strategy is a repair mechanism to handle mispredicted branches and accurate reporting of exceptions for speculated instructions. This chapter describes compiler-assisted multiple instruction rollback schemes and their applicability to speculative execution repair. Performance measurements across ten application programs are presented. The results indicate that techniques used in compiler-assisted rollback recovery are effective for handling branch and exception repair in support of speculative execution.
3.1.1 Introduction Multiple instruction rollback recovery is particularly appropriate when error detection latencies are greater than a single instruction cycle. Multiple instruction retry can be implemented so that reexecution is within a sliding window of a few instructions [1,2, 3,4], or re-execution of a few cycles [5]. The issues associated with instruction retry are similar to the issues encountered with exception handling in an out-of-order instruction execution architecture. If an instruction is to write to a register and N is the maximum error detection latency (or exception latency), two copies of the data must be maintained for N cycles. Hardware schemes such as reorder buffers, history buffers, future files [6], and micro-rollback 'This research was supported in part by the Office of Naval Research under grant N00014-91-J-1283. Portions of this chapter have appeared in previous publications by the authors [3,4,25]. ^Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801 ^International Business Machines, Boca Raton, FL
136 Table 3.1.1 Hardware-based Single and Multiple Instruction Rollback Schemes. Rollback Scheme
Checkpoint Type
IBM 4341 [7] IBM 3081 [8] VAX 8600 [9] IBM patent [10] IBM patent [11] micro-rollback [2] history buffer [6] history file [6] VAX 9000 [12] IBM E/S 9000(1]
full full full full incremental incremental incremental incremental full incremental
Rollback Distance single instr. 10-20 instr. single instr. variable single instr. variable variable variable single instr. variable
Location of Data Primary Redundant register file shadow file register file shadow file register file not required register file shadow file register file shadow files write buffer register file register file history buffer register file shadow file register file not required virtual file physical file
[2] differ in where the updated and old values reside, circuit complexity, CPU cycle times, and rollback efficiency. Table 3.1.1 gives a description of various hard ware-based methods to restore the general purpose register file contents during single or multiple instruction rollback. In the VAX 8600 and VAX 9000, errors are detected prior to the completion of a faulty instruction. For most VAX instructions, updates to the system state occur at the end of the instruction. If the error is detected prior to the updating of the system state, the instruction can be rolled back and re-executed. If the system state has changed prior to detection of the error, a flag is set to indicate that instruction rollback cannot be accomplished. Redundant data storage is not required for the VAX 8600 and VAX 9000. The IBM 4341, IBM 3081, IBM patent 4,912,707, IBM patent 4,044,337, and history file all require shadow file structures to maintain redundant data. This data is used to restore the system state during rollback recovery. Shadow file structures can add significant circuit overhead, although the level sensitive scan design [13] of the IBM 4341 and IBM 3081 provides this feature without additional cost over that incurred to obtain testability.'* The VAX 8600 and VAX 9000 schemes avoid shadow files, however, require an error detection latency of only one instruction. The micro-rollback scheme also avoids shadow files by using a delayed write buffer to prevent old data from being overwritten until the error detection latency has expired; •*The 126 scan rings of the IBM 3081 contains 35,000 bits of dataf8].
137 ensuring that the new data is fault-free. In a delayed write scheme, the most recent write values are contained in the delayed write buffer, and bypass circuitry is required to forward this data on subsequent reads. The performance impact introduced by the bypass circuitry is a function of the register file size and the maximum rollback distance [2]. The history buffer scheme maintains redundant data in a separate push-down array and therefore does not require bypass circuitry [6]. The history buffer does however require an extra register file port which complicates the file design and can impact performance by increasing file access times. In an effort to increase the register file size while maintaining down-level code compatibility relative to the 16 architectural registers, the IBM E/S 9000 has introduced a virtual register management (VRM) system [14]. The VRM circuitry dynamically maps the eight architectural registers into 32 physical registers. When the data in a physical register becomes obsolete, the physical register is released for reassignment as a new virtual register. Although the VRM system was primarily intended to reduce register pressure and therefore improve system performance, it has been extended to provide data redundancy to assist in rollback recovery. In the VRM extension, remapping of a physical register to a new virtual register is postponed until the error detection latency has been exceeded for the data contained in the physical register. Super-scalar and VLIW architectures have been shown effective in exploiting instruction level parallelism (ILP) [15, 16, 17]. Creating additional ILP in applications has been thesubjectof study in recent years [18, 19, 20]. Code motion within a basic block is insufficient to unlock the full potential of super-scalar and VLIW processors with issue rates greater than two [17]. Given a trace of the most frequently executed basic blocks, limited code movement across block boundaries can create additional ILP at the expense of requiring complex compensation code to ensure program correctness [21]. Combining multiple basic blocks into superblocks permits code movement within the superblock without the compensation code required in standard trace scheduling [17]. General upward and downward code movement across trace entry points (joins) and general downward code motion across trace exit points (branches, or forks) is permitted without the need for special hardware support [21]. Sophisticated hardware support is required, however, for upward code motion across a branch boundary. Such code motion is referred to as speculative execution and has been shown to substantially enhance performance over nonspeculated architectures [22, 23, 24]. The remainder of this chapter presents a summary of compiler-assisted multiple instruction retry techniques and their extended application to speculative execution. Compiler and hardware support are described for recovery from speculated instruc-
138 tions (referred to as exception repair) and mispredicted branches (referred to as branch repair). We demonstrate that data hazards which result from exception and branch repair are similar to data hazards that result from multiple instruction rollback for transient error recovery, and that techniques used to resolve rollback data hazards are applicable to exception and branch repair [25].
3.1.2 Compiler-Assisted Multiple Instruction Rollback Recovery Compiler-Based Instruction Rollback Recently, compiler-based approaches to multiple instruction rollback (MIR) recovery have been investigated [4, 26]. Compiler-based MIR uses data-flow manipulations to remove data hazards that result from multiple instruction rollback. Rollback data hazards (or just hazards) are identified by antidependencies^ of length < A'^, where A^ represents the maximum rollback distance. Antidependencies are removed at three levels: 1) pseudo-code level, or the code level prior to variables being assigned to physical registers, 2) machine-code level, or the code level in which variables are assigned to physical registers, and 3) post-pass level, which represents assembler-level code emitted by the compiler. Compiler-based multiple instruction rollback reduces the requirement for data redundancy logic present in hardware-based instruction rollback approaches.
Compiler-Assisted Instruction Rollback Compiler-based multiple instruction rollback resolves all data hazards using compiler transformations. Compiler-assisted instruction rollback uses dedicated data redundancy hardware to resolve one type of rollback data hazard while relying on compiler assistance to resolve the remaining hazards. Experimental results indicate that by exploiting the unique characteristics of differing hazard types, compiler-assisted MIR design can achieve superior performance to either a hardware-only or compiler-based instruction rollback scheme.
Hazard Classification Within a general error model, data hazards resulting from instruction retry are of two types [4, 28]. On-path hazards are those encountered when the instruction path after rollback is the same as the initial path and branch hazards are those encountered when the instruction path after rollback is different than the initial path. As shown in Figure 3.1.1 (a), Tj. represents an on-path hazard where during the initial instruction sequence 'For a complete presentation of data-flow properties and manipulation method.s, see [27].
139
i~-... r IS live
N
rollback
rollback
N '
r = r +1 X
C
error '•""^ detected (a) On-path Hazard
error detected (b) Branch Hazard Figure 3.1.1 Data hazards.
TJ: is written and after rollback is read prior to being re-written. As shown in Figure 3.1.1 (b), r-j, represents a branch hazard where the initial instruction sequence writes r-y and after rollback Vy is read prior to being re-written however this tiine not along the original path.
On-path Hazard Resolution Using a Read Buffer Hardware support consisting of a read buffer of size 27V, as shown in Figure 3.1.2, has been shown to be effective in resolving on-path hazards [4, 28]. The read buffer maintains a window of register read history. If an on-path hazard is present, then prior to writing over the old value of the hazard register, a read of that value must have taken place within the last A'^ instructions (else after rollback of < A'^, a read of the hazard register would not occur before a redefinition). Key to this scenario is the fact that the original path is repeated after rollback. Branch hazard resolution is left to the compiler. At rollback, the read buffer is flushed back to the general purpose register file (GPRF), restoring the register file to a restartable state. The primary advantage of the read buffer is that it does not require an additional read port as with a history buffer, duplication of the GPRF as with the future file, or bypass logic as with the reorder buffer or delayed write buffer [2, 6].
Branch Hazard Removal Compiler Transformations Compiler transformations have been shown to be effective in resolving branch hazards [4]. Branch hazard resolution occurs at three levels; 1) pseudo code, 2) machine code, and 3) post-pass. Resolution at the pseudo code level would be accomplished by renaming the pseudo register ry of instruction /, (Figure 3.1.1) to r,. Node
140
Register File
t A B
Read Buffer
^
Figure 3.1.2 Read buffer. splitting, loop expansion and loop protection transformations aid in breaking pseudo register equivalence relationships so that renaming can be performed. After the pseudo registers are mapped to physical registers, some branch hazards could re-appear. This is prevented at the machine code level by adding hazard constraints to live range constraints prior to register allocation. Branch hazards that remain after the first two levels can be resolved by either creating a "covering" on-path hazard or by inserting nop instructions ahead of the hazard instruction until the rollback is guaranteed to be under the branch. Given the branch hazard of Figure 3.1.1, a covering on-path hazard is created by inserting an MOV ry,ry instruction immediately before the instruction in which r^ is defined. This guarantees that the old value of r^ is loaded into the read buffer and is available to restore the register file during rollback.
3.1.3 Speculative Execution Figure 3.1.3 illustrates the two basic problems encountered when attempting upward code motion across a branch. First, if the speculated instruction (i.e., an instruction moved upward past one or more branches) modifies the system state, and due to the branch outcome the speculated instruction should not have been executed, program correctness could be affected. Second, if the speculated instruction causes an exception, and again due to the branch outcome, the excepting instruction should not have been executed, program performance or even program correctness could be affected.
141 • • •
• • •
(oh ^2 +0 • • •
m
'l--
m
branch taken
o
• • i
'•; = ''2 +
0
• • •
'4='-5
• • •
e
-ra
rj in livejout of taken path
"O"
trap occurs
• • • branch taken
0
t
• • •
= MEM( r, ) -^
i
• • •
i ' / ' = MEM( • • •
• • •
r^ )•
• • •
speculated instruction traps
Figure 3.1.3 Speculative execution.
Branch Repair Figure 3.1.4 shows an original instruction schedule and a new schedule after speculation. Instructions d, i, and / have been speculated above branches c and g from their respective fall-through paths.^ Speculated instructions are inarked "(s)." The inotivation for such a schedule might be to hide the load delay of the speculated instructions or to allow more time for the operands of the branch instructions to become available. If c commits to the taken path (i.e., it is mispredicted by the static scheduler), some changes to the system state that have resulted from the execution of rf, ;', and / , may have to be undone. No update is required for the program counter (PC); execution simply begins at j . If instead, c commits to the fall-through path but g commits to the taken path, then only i's changes to the system state may have to be undone. Not all changes to the system state are equally important. If for example, d writes to register r^ and Vj. ^ tiveJn(j) (i.e., along the path starting at j , a redefinition of r^ will be encountered prior to a use of Vj; [27]), then the original value of r,. does not have to be restored. Inconsistencies to the system state as a result of mispredicted branches exhibit similarities to branch hazards in multiple instruction rollback [4]. Given this similarity between branch hazards due to instruction rollback and branch hazards due to speculative execution, compiler-driven data-flow manipulations, similar to those developed to eliminate branch hazards for MIR [4], can be used to resolve branch hazards that result from speculation. Such compiler transformations have been proposed for branch misprediction handling [23]. Since re-execution of speculated ''For this example it is assumed that the fall-through paths are the most likely outcome of the branch decisions at c and 5.
142 a b
EKJ d e f
1:[3— k h
a
RB_c: d
(s)^
e f
(S)j
b
jump LI
(S)jr
[IKJ e
U}—k
RB_g: h i jump L2
L2: Original Schedule
Speculated Schedule
Recovery Blocks
Figure 3.1.4 Branch repair. instructions is not required for branch misprediction, compiler resolution of branch hazards becomes a sufficient branch repair technique. Exception
Repair
Figure 3.1.4 also demonstrates the handling of speculated trapping instructions. If d is a trapping instruction and an exception occurred during its execution, handling of the exception must be delayed until c commits so that changes to the system state are minimized, and in some cases to ensure that repair is possible in the event that c is mispredicted. If c commits to the taken path, the exception is ignored and d is handled like any other speculated instruction given a branch mispredict. If c was correctly predicted, three exception repair strategies are possible. The first is to undo the effects of only those instructions speculated above c (i.e., d, i, and / ) and then branch to a recovery block RB-C [24] as shown in Figure 3.1.4. The address of the recovery block can be obtained by using the PC value of the excepting instruction as an index into a hash table. This strategy ensures precise interrupts [6, 29] relative to the nonspeculated schedule but not relative to the original schedule. Recovery blocks can cause significant code growth [24]. The second strategy undoes the effects of all instructions subsequent to d (i.e., i, b, and / ) , handles the exception, and resumes execution at instruction i [23]. This latter strategy provides restartable states and does not require recovery blocks. A third exception repair strategy undoes the effects of only those subsequent instructions that are speculated above c (i.e., only i and / ) , handles the exception, and resumes execution at instruction i, however, this time only executing speculated instructions until c is reached. The improved efficiency of
143 strategy 3 over that of strategy 2 comes at the cost of slightly more complex exception repair hardware. When a branch commits and is mispredicted, the exception repair hardware must perform three functions: 1) determine whether an exception has occurred during the execution of a speculated instruction, 2) if an exception has occurred, determine the PC value of the excepting instruction, and 3) determine which changes to the system state must be undone. Functions 1 and 2 are similar to error detection and location in multiple instruction rollback. Function 3 is similar to on-path hazard resolution in multiple instruction rollback [3, 4]. On-path hazards assume that after rollback the initial instruction sequence from the faulty instruction to the instruction where the error was detected is repeated. Figure 3.1.5 illustrates the speculation of a group of instructions and re-execution strategy 3. The load instruction traps, but the exception is not handled until the branch instruction commits to the fall-through path. Control is then returned to the trapping instruction. This scenario is identical to multiple instruction rollback where an error occurs during the load instruction and is detected during the branch instruction. For this example, only r\ must be restored during rollback since 7-4 and r? will be rewritten prior to use during re-execution. Figure 3.1.5 shows that exception repair hazards in speculative execution are the same as on-path hazards in multiple instruction rollback, and a read buffer as described in Section 3.1.2 can be used to resolve these hazards. The depth of the read buffer is the maximum distance from /j to /„ along any backwards walk^, where /„ is a trapping instruction that was speculated above branch instruction
h. Schedule Reconstruction Assumed in Figures 3.1.4 and 3.1.5 are mechanisms to identify speculative instructions, determine the PC value of excepting speculated instructions, and determine how many branches a given instruction has been speculated above. An example of the latter case is shown in Figure 3.1.4 where instructions d, i, and / , are undone if c is mispredicted; however, only i must be undone if g is mispredicted. If the hardware had access to the original code schedule, the design of these mechanisms would be straightforward. Unfortunately, static scheduling reorders instructions at compile-time and information as to the original code schedule is lost. To enable recovery from mispredicted branches and proper handling of speculated exceptions, some information relative to the original instruction order must be present in the compiler-emitted instructions. This will be referred to as schedule reconstruction. ^ A walk is a sequence of edge Iraversals in a graph where the edge.s visited can be repeated [30].
144
r^ = UEM{7~)\ trap occurs
\
©
from below branch
rollback
® branch not taken
1
Figure 3.1.5 Exception repair. By limiting the flexibility of the scheduler, less information about the original schedule is required. For example, if speculation is limited to one level only (i.e., above a single branch), a single bit in the opcode field is sufficient to indicate that the instruction has been moved above the next branch [22]. The hardware would then know exactly which instruction effects to undo (i.e., the ones with this bit set). Also, removing branch hazards directly with the compiler permits general speculation with no schedule reconstruction for branch repair [23].
3.1.4 Implicit Index Schedule Reconstruction Implicit index scheduling supports general speculation of regular and trapping instructions. The scheme was inspired by the handling of stores in the sentinel scheduling scheme [23] and was designed to exploit the unique properties of the read buffer hardware design described in Section 3.1.2. Schedule reconstruction is accomplished by marking each instruction speculated or nonspeculated by including a bit in the opcode field, and using this encoding to maintain an operand history of speculated instructions in a FIFO queue called a speculation read buffer (SRB). The SRB operates similar to a read buffer with additional provisions for exception handling.
145 Exception Repair Using a Speculation Read Buffer Figure 3.1.6 shows an original code schedule and two speculative schedules, along with the contents of the SRB at the time branches Ic and Ig commit. Instructions Id and // have been speculated above branch instruction Ic, and /; has been speculated above both 7^ and Ic. The encoding of speculated instructions informs the hardware that the source operands are to be saved in the SRB, along with the source operand values, corresponding register addresses, and the PC of the speculated instruction. Speculated instructions execute normally unless they trap. If a speculated instruction traps, the exception bit in the SRB which corresponds to the trapping instruction is set and program execution continues. Subsequent instructions that use the result of the trapping instruction are allowed to execute normally. A chk-except(k) instruction is placed in the home block of each speculated instruction. Only one chk-except(k) instruction is required for a home block. As the name implies, chk^except(k) checks for pending exceptions. The command can simultaneously interrogate each location in the SRB by utilizing the bit field k. As shown in schedule 1 of Figure 3.1.6, chk.except(OOII 11) in /J checks exceptions for instructions Id and Id- If a checked exception bit is set, the SRB is flushed in reverse order, restoring the appropriate register and PC values. Execution can then begin with the excepting instruction. Figure 3.1.6 illustrates several on-path hazards which are resolved by the SRB. In schedule 1, if 7, traps and the branch Ic commits to the taken path, 7, has corrupted r2 and If has corrupted rj. Flushing the SRB up through 7, restores both registers to their values prior to the initial execution of 7,. Note that register r(, is also corrupted but not restored by the SRB, since after rollback r^ will be rewritten with a correct value before the corrupted value is used. As an alternative to checking for exceptions in each home block, the exception could be handled when the exception bit reaches the bottom of the SRB. This is similar to the reorder buffer used in dynamic scheduling [6] and eliminates the cost of the chk.except(k) command, however, increases the exception handling latency which can impact performance depending on the frequency of exceptions. Implicit index scheduling derives its name from the ability of the compiler to locate a particular register value within the SRB. This is possible only if the dynamically occurring history of speculated instructions is deterministic at branch boundeiries. Superblocks guarantee this by ensuring that the sole entry into the superblock is at the header and by limiting speculation to within the superblock. For standard blocks, bookkeeping code [21] can be used to ensure this deterministic behavior.
146 Original Schedule
Speculated Schedule
Speculated Schecule 2
: bne r^, r^, I.
r
'•7 =
'•7 -^ ^
,•• b n e r^. K
'•ft
;.: chk_except(001111)
r^. 1^
,
T: chk_except(l 10011)
'•«='•»+'*
='"6+^
r, = MEM(/-,)
^: bne r^, r^, I^ chk_except(001100)
',: chk_except(l 10000) /,•
PC
V V f I u s h
e c o r d
2N
v ,/ / i_
'-6 =
'•ft +
^
Except bit Reg. No.
-
PC
ll
valuefrj) 7 valuefr^) 8 2N
0
•'
r
h d
value(r^) 2
0
-
0
value(r^) 2
'•
value(rg) 8
'rf 'rf
ll
value{r^} 7
'i
value(ry) 7
-
-
V V
0
Except bit Reg. No.
value( r J 7
L--'
r<
1
\ SRB Contents
->
SRB Contents
Figure 3.1.6 Exception repair using a speculation read buffer (SRB).
_,
147 Branch Repair Using a Speculation Read Buffer As described in Section 3.1.2, branch repair can be handled by resolving branch hazards with the compiler. Branch hazard resolution in multiple instruction rollback can be assisted by the read buffer when "covering" on-path hazards are present, reducing the performance cost of variable renaming [4]. In a similar fashion, the SRB can assist in branch repair. Figure 3.1.7 shows the original code schedule and the two speculative schedules of Figure 3.1.6. For this example, it is assumed that rj, rj, r^, and rj are elements in both liveJn(Ij) and liveJn(Ik). As shown in schedule 1, if branch instruction /^ commits to the taken path, r2, re, and rj, which were modified in /,, Id, and Ij, respectively, must be restored. If instead, I^ commits to the fall-through path and Ig commits to the taken path, only rj must be restored. Registers rj and rj are rollback hazards that result from exception repair; therefore, the SRB contains their unmodified values. By including aflush(k) command at the target of Ic and Ig, the SRB can be used to restore r2 and/or rj given a misprediction of Ic or Ig. Theflush(k) command selectively flushes the appropriate register values given a branch misprediction. For example, in schedule 2 of Figure 3.1.7, if Ic is predicted correctly and Ig is mispredicted, the SRB is flushed in reverse order up through 7,, restoring valueirj) from /, but not restoring value(.ry) from Ij. Since speculation is always from the most probable branch path, the flush(k) command is always placed on the most improbable branch path, minimizing the performance penalty. Not all branch hazards are resolved by the presence of on-path hazards. These remaining hazards can be resolved with compiler transformations.
3.1.5 Performance Evaluation Evaluation Methodology In this section, results of a read buffer flush penalty evaluation are presented. The instrumentation code segments of Figure 3.1.8 call a branch error procedure which performs the following functions: 1. Update the read buffer model. 2. Force actual branch errors during program execution, allowing execution to proceed along an incorrect path for a controlled number of instructions.
148 Speculated Schedule
Origina Schedule
K- '•/ = '•2* h- 0 = '•4+
Speculated Schedule 2
0
'•5
I : bne r,, r^.
\.
c
•rf^
'6 = ' • 7 *
V '^ =r , +
'•»
4
bne r^, r,, I
V '7 =r^-y 4 I ; bne s
h' 'r
\
r^+4
V
'•« =
L:
rj = MEM(rj)
bne r^, r^, I^
flush(lOlllO)
I.: flush(l 11010)
Except bit
s r h d
SRB Contents
SRB Contents
Figure 3.1.7 Branch repair using a speculation read buffer (SRB).
149
. . ^ Instrumentation code "_•; Original s-code instructions
Figure 3.1.8 Instrumentation code placement. 3. Terminate execution along the incorrect path and restore the required system state from the simulated read buffer. 4. Measure the resulting flush cycles during the branch repair. 5. Begin execution along the correct path until the next branch is encountered. When a branch instruction in the original application program is encountered, an armJiranch flag is set. Prior to the execution of the next application instruction, the arm-branch flag is checked, and if set, the branch decision made by the application program is set aside. The branch is predicted by the branch prediction model. Four models are used in the evaluation: 1) predict taken, 2) predict not taken, 3) dynamic prediction, and 4) static prediction from profiling information. The dynamic prediction model is derived from a two bit counter branch target buffer (BTB) design [31] and is the only model that requires updating with each prediction outcome. After the branch is predicted, the prediction is checked against the actual branch path taken by the application program. If the prediction was correct, execution proceeds normally. If the prediction was incorrect, the correct branch path is loaded into the recovery queue along with a branch error detection (BED) latency, and the predicted path is loaded into the PC. The BED latency indicates how long the execution of instructions is to continue along the incorrect path. The branch error timejout flag is set when the BED latency is reached. When a branch error is detected, the register file state is repaired using the read buffer contents. The PC value of the correct branch path is obtained from the recovery queue. During branch error rollback recovery, the number of cycles required to flush the read buffer during branch repair is recorded.
150 Table 3.1.2 Application Programs. Program QUEEN WC QSORT CMP GREP PUZZLE COMPRESS LEX YACC CCCP
Static Size 148 181 252 262 907 932 1826 6856 8099 8775
Description eight-queen program UNIX utility quick sort algorithm UNIX utility UNIX utility simple game UNIX utility lexical analyzer parser-generator preprocessor for gnu C compiler
It is assumed for this evaluation that two read buffer entries can be flushed in a single cycle. This corresponds to a split-cycle-save assumption of the general purpose register file. Performance overhead due to read buffer flushes (% increase) is computed as Flush.OH=\00*
flush-cycles totaLcycles
All instructions are assumed to require one cycle for execution. This assumption produces conservative Flush-OH results since the MIPS processor used for the evaluation requires two cycles for a load. The additional cycles would increase the total-cycles and thereby reduce the observed performance overhead. In addition to accurately measuring flush costs, the evaluation verifies the operation of the read buffer and its ability to restore the appropriate system state over a wide range of applications. The instrumentation insertion transformation operates on the s-code emitted by the MIPS code generator of the IMPACT C compiler [17]. The transformation determines which operands require saving in the read buffer and inserts calls to the initialization, branch error, and summary procedures. The resulting s-code modules are then compiled and run on a DECstation 3100. For the evaluation, BED latencies from 1 to 10 were used. Table 3.1.2 lists the ten application programs evaluated. Static Size is the number of assembly instructions emitted by the code generator, not including the library routines and other fixed overhead.
Evaluation of Results Experimental measurements of read buffer flush overhead (Flush OH) for various BED latencies are shown in Figures 9 through 13. The four branch prediction strategies used
151 for the evaluation are: 1) predict taken (P-Taken), 2) predict not taken (PJ^^Taken), 3) dynamic prediction based on a branch target buffer {Dyti-Pred), and 4) static branch prediction using profiling data (Prof-Pred). Flush costs were closely related to branch prediction accuracies, i.e., the more often a branch was mispredicted, the more often flush costs were incurred. In a speculative execution architecture, branch prediction inaccuracies result in performance impacts in addition to the impacts from the branch repair scheme. Branch misprediction increases the base run time of an application by permitting speculative execution of unproductive instructions. Increased levels of speculation increase the performance impacts associated with branch prediction inaccuracies. Only the performance impacts associated with read buffer flushes are shown in Figures 9 through 13. For nine of the ten applications, PJ^.Taken was significantly more accurate or marginally more accurate in predicting branch outcomes than P-Taken. For QSORT, P-Taken was significantly more accurate than P-N-Taken. This result demonstrates that in a speculative execution architecture, it is difficult to guarantee optimal performance across a range of applications given a choice between predict-taken and predict-not-taken branch prediction strategies. For all but one application, Prof-Pred was more accurate than either P-Taken or P-N-Taken. For CMP, Prof-Pred, PJ^-Taken, and Dyn-Pred were nearly perfect in their prediction of branch outcomes. Prof-Pred marginally outperformed Dyn-Pred in all applications except LEX. The purpose of measuring read buffer flush costs given the recovery from injected branch errors is to establish the viability of using a read buffer design for branch repair for speculative execution. Although in such a speculative schedule only static prediction strategies would be applicable, the Dyn-Pred model was included to assess how varying branch prediction strategies impact flush costs. Overall, the accuracy of Dyn-Pred fell between P-Taken!P-N-Taken and Prof-Pred. Over the ten applications studied, read buffer flush overhead ranged from 49.91% for the P-Taken strategy in CCCP to .01% for the PJ^-Taken strategy for CMP given a BED of ten. It can be seen from Figures 9 through 13 that a good branch prediction strategy is key to a low read buffer flush cost. The results show that given a static branch prediction strategy using profiling data, an average BED of ten produces flush costs no greater than 14.8% and an average flush cost of 8.1% across the ten applications studied. Given a maximum BED of ten and an average BED of less than ten, the flush costs of the read buffer would be less than that of a delayed write buffer, since a delayed write buffer is designed for a worst-case BED and the flush penalty of a read buffer is based on the average BED. The observed flush costs are small in comparison to the
152 Flush OH (%) 50 P_Taken: - o P_N_Taken:- a 40 Dyn_Pred: x Prof Pred: - A -
Flush OH {%)
50
P_Taken: - » P_N_Taken:- a 40 H Dyn_Pred: x Prof Pred: ^ ^
30
30
1
1
1
1
1
1
1
1
r
1
1
1
1
r
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 BED Latency BED Latency Figure 3.1.9 Flush penalty: QUEEN, WC.
^lush OH {%)
50- P Taken: P N Taken 40- Dyn Pred: Prof Pred:
Flush OH (%) 50- P_Taken: -^ P_N_Taken:-o40- Dyn_Pred: x Prof Pred: - ^
30H
30
20-
20-
1 1 \ 1 r 0 - ^ ^ f » » iji..^- i^.4 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 BED Latency BED Latency Figure 3.1.10 Flush penalty: COMPRESS, CMF
T
substantial performance gain of speculated architectures over that of nonspeculated architectures [22, 23, 24]. The BED for a given branch in this evaluation corresponds to the number of instructions moved above a branch in a speculative schedule. The results of the evaluation indicate that if the average number of instructions speculated above a given branch is < 10, then the read buffer becomes a viable approach to handling branch repair.
153 Flush OH 50i P_Taken: -^ P_N_Taken:-a 40 Dyn_Preci; x Prof_Pred: -^
Flush OH (%) 50i P_Taken: -^ P_N_Taken>^40- Dyn_Pred: x Prof Pred: ^ ^
30H
30-
20
20
10
10
0
0
Flush OH
(%) 50 P_Taken:
40 30-
~i
1
I
I
]
1
1
I
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 BED Latency BED Latency Figure 3.1.11 Flush penalty: PUZZLE, QSORT.
-aP_N_Taken:--DDyn_Pred: x Prof Pred:
r
10
Flush OH (%) 50 H P_Taken: -^ P_N_Taken:-o40 Dyn_Pred: x Prof Pred: ^ 30
"I
1
1
r
T
1
r
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 BED Latency BED Latency Figure 3.1.12 Flush penalty; GREP, LEX.
154 Flush OH (%) 50H P_Taken: -oP_N_Taken:- a 40 Dyn_Pred: x Prof Pred: ^ ^
Flush OH (%) 50H 40
30 30
P_Taken: P_N_Taken: Dyn_Pred: Prof Pred:
2010~i
1
r
4 5 5 7 8 9 10 1 2 3 4 5 6 7 8 9 10 BED Latency BED Latency Figure 3.1.13 Flush penalty: YACC, CCCP
3.1.6 Summary This chapter showed that branch hazards resulting from branch mispredictions in speculative execution are similar to branch hazards in multiple instruction rollback developed for processor error recovery. It was shown that compiler techniques previously developed for error recovery can be used as an effective branch repair scheme in a speculative execution architecture. It was also shown that data hazards that result in rollback due to exception repair are similar to on-path hazards suggesting a read buffer approach to exception repair. Implicit index scheduling was introduced to exploit the unique characteristics of rollback recovery using a read buffer approach. The read buffer design was extended to include PC values to aid in rollback from excepting speculated instructions. Read buffer flush penalties were measured by injecting branch errors into ten applications and measuring the flush cycles required to recover from the branch errors using a simulated read buffer. It was shown that with a static branch prediction strategy using profiling data, flush costs under 15% are achievable. The results of these evaluations indicate that compiler-assisted multiple instruction rollback is viable for branch and exception repair in a speculative execution architecture.
155
Acknowledgements The authors wish to thank Shyh-Kwei Chen and C.-C. Jim Li. Shyh-Kwei and Jim did much of the early work on developing and implementing compiler-based multiple instruction recovery.
References [1] L. Spainhower, J. Isenberg, R. Chillarege, and J. Berding, "Design for FaultTolerance in System ES/9000Model 900," in Proc. 22th Int. Symp. Fault-Tolerant Comput., pp. 38-47, July 1992. [2] Y. Tamir and M. Tremblay, "High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback," IEEE Trans. Comput., vol. 39, pp. 548-554, Apr. 1990. [3] C.-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu, "Compiler-Assisted Multiple Instruction Retry," IEEE Trans. Comput., To appear; 1994. [4] N. J. Alewine, S.-K. Chen, C.-C. J. Li, W. K. Fuchs, and W.-M. W. Hwu, "Branch Recovery with Compiler-Assisted Multiple Instruction Retry," in Proc. 22th Int. Symp. Fault-Tolerant Comput., pp. 66-73, July 1992. [5] Y. Tamir, M. Liang, T. Lai, and M. Tremblay, "The UCLA Mirror Processor: A Building Block for Self-Checking Self-Repairing Computing Nodes," in Proc. 21thlnt. Symp. Fault-Tolerant Comput., pp. 178-185, June 1991. [6] J. E. Smith and A. R. Pleszkun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Trans. Comput., vol. 37, pp. 562-573, May 1988. [7] M. L. Ciacelli, "Fault Handling on the IBM 4341 Processor," in Proc. 11th Int. Symp. Fault-Tolerant Comput., pp. 9~12, June 1981. [8] M. S. Pittler, D. M. Powers, and D. L. Schnabel, "System Development and Technology Aspects of the IBM 3081 Processor Complex," IBM J. Res. Dev, vol.26, pp. 2-11, Jan. 1982. [9] W. F. Bruckert and R. E. Josephson, "Designing Reliability into the VAX 8600 System," Digital Tech. J. Digital Equip. Corp., vol. 1, no. 1, pp. 71-77, Aug. 1985. [10] P. M. Kogge, K. T Truong, D. A. Richard, and R. L. Schoenike, "Checkpoint Retry Mechanism." United States Patent, no. 4912707, Mar. 1990. Assignee: International Business Machines Corporation, Armonk, N.Y.
156 [11] G. L. Hicks, D. Howe, Jr., and A. Zurla, Jr., "Insruction Retry Mechanism for a Data Processing System." United States Patent, no. 4044337, Aug. 1977. Assignee: International Business Machines Corporation, Armonk, N.Y. [12] D. B. Fite, T. Fossum, and D. Manley, "Design Strategy for the VAX 9000 System," Digital Tech. J. Digital Equip. Corp., vol. 2, no. 4, pp. 13-24, Fall 1990. [13] E. B. Eichelberger and T. W. Williams, "A Logic Design Structure for LSI Testability," in Proc. I4th Design Autom. Conf., pp. 462^68, 1977. [14] J. S. Liptay, "The ES/9000 High End Processor Design," IBM J. Res. Dev., vol. 36, no. 3, May 1992. [15] R. P Colweli, R. P Nix, J. O'Donnell, D. B. Papworth, and R K. Rodman, "A VLIW Architecture for a Trace Scheduling Compiler," in Proc. 2nd Int. Conf. Architecture Support Programming Languages andOperating Syst., pp. 105-111, Oct, 1987. [16] J. C. Dehnert, P. Y. Hsu, and J. P. Bratt, "Overlapped Loop Support in the Cydra 5," in Proc. 3rd Int. Conf. Architecture Support Programming Languages and Operating Syst., pp. 26-38, April 1989. [17] R Chang, W. Chen, N. Warter, and W.-M. W. Hwu, "IMPACT: An Architecture Framework for Multiple-Instruction-Issue Processors," in Proc. 18th Annu. Symp. Comput. Architecture, pp. 266-275, May 1991. [18] B. R. Rau and C. D. Glaeser, "Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing," in Proc. 20th Annu. Workshop Microprogramming Microarchitecture, pp. 183-198, Oct. 1981. [ 19] M. S. Lam, "Software Pipelining: An Effective Scheduling Technique for VLIW Machines," in Proc. ACM SIGPLAN1988 Conf. Programming Language Design Implementation, pp. 318-328, June 1988. [20] A. Aiken and A. Nicolau, "Optimal Loop Parallelization," in Proc. ACM SIGPLAN 1988 Conf. Programming Language Design Implementation, pp. 308-317, June 1988. [21] J. A. Fisher, "Trace Scheduling: A Technique for Global Microcode Compaction," IEEE Trans. Comput., vol. c-30, no. 7, pp. 478-490, July 1981. [22] M. D. Smith, M. S. Lam, and M. Horowitz, "Boosting Beyond Scalar Scheduling in a Superscalar Processor," in Proc. 17th Annu. Symp. Comput. Architecture, pp. 344-354, May 1990.
157 [23] S. A. Mahike, W. Y. Chen, W.-M. W. Hwu, B. R. Rao, and M. S. Schlansker, "Sentinel Scheduling for VLIW and Superscalar Processors," in Proc. 5th Int. Conf. Architecture Support Programming Languages and Operating Syst., pp. 238-247, Oct. 1992. [24] M. D. Smith, M. A. Horowitz, and M. S. Lam, "Efficient Superscalar Performance Through Boosting," in Proc. 5th Int. Conf. Architecture Support Programming Languages and Operating Syst., pp. 248-259, Oct. 1992. [25] N. J. Alewine, W. K. Fuchs, and W.-M. Hwu, "Application of Compiler-Assisted Rollback Recovery to Speculative Execution Repair," in Hardware and Software Architectures for Fault Tolerance, (New York), Springer-Verlag, 1994. [26] C.-C, J. Li, S.-K. Chen, W. K. Fuchs, and W,-M. W. Hwu, "Compiler-Assisted Multiple Instruction Retry," Tech. Rep. CRHC-91 -31, Coordinated Science Laboratory, University of Illinois, May 1991. [27] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. [28] N. J. Alewine, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu, "Compiler-assisted MultiplelnstructionRollbackRecovery usingaRead Buffer,"Tech. Rep. CRHC93-11, Coordinated Science Laboratory, University of Illinois, May 1993. [29] M. Johnson, Superscalar Microprocessor Design. Prentice-Hall, Inc., 1991.
Englewood Cliffs, NJ:
[30] J. A. Bondy and U. Murty, Graph Theory with Applications. London, England; Macmillan Press Ltd., 1979. [31] J. K. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design," Computer, vol. 17, no. 1, pp. 6-22, Jan. 1984.
Section 3.2
Compiler Assisted Synthesis of Algorithm-Based Checking in Multiprocessors^ Prithviraj Banerjee^ Vijay Balasubramanian'' A m b e r Roy-Chowdhury^ Abstract
In this section we describe a compile-time approach to synthesizing algorithmbased checks for numerical programs. The compiler is used to identify linear transformations within loops and for restructuring nonlinear program statements to introduce more linearity into the program. The data manipulated linear statements are then checked by the introduction of checksums, which can often he done more cheaply than replication. We discuss the implementation of a source-to-source restructuring compiler based on the above approach, and present results of applying this compiler to routines from UNPACK, EISPACK and the Perfect Benchmark Suite.
3.2.1
Introduction
The advent of cost-effective VLSI components in the past few years has made feasible the commercial development of multiprocessor systems. Since the probability of one or more processors failing in such multiprocessor systems is quite large, it is desirable to build some fault tolerance feature into them. In this section we will discuss some cost-effective techniques for fault tolerance in multiprocessor architectures. T h e requirements for high performance and fault tolerance are seemingly contradictory: parallel architectures and algorithms ' T h i s research was supported in part by the Office of Naval Research under contract N00014-91-J-1096 ^Center for Reliable and High Performance Computing, University of Illinois, Urbana, IL 61801 •'Computer Science Dept., Xavier University, New Orleans, LA 70125
160 developed for high performance a t t e m p t to achieve maxim\im utilization of each of the processors, while fault tolerance requires redundant computations and checks to ensure that the results of the computations are correct. T h e result is that conventional fault tolerance techniques are very expensive when applied to multiprocessor architectures. There are two basic approaches to achieving fault tolerance in multiprocessors: static or masking redundancy, and dynamtc or standby redundancy. In the static or masking redundancy approach, one uses N copies of a module and votes on the results. One can combine the scheme with the use of a disagreement detector and a switching unit to produce a hybrid redundant system. Clearly, in a multiprocessor, one can apply this at several levels. One can replicate each processor and vote on the result of each processor's computation, or one can replicate the entire multiprocessor and vote on the combined result. A third option is to divide the P processors of the multiprocessor into P / N groups of N processors, each group voting on its results before communicating to other groups. T h e first two options incur a large hardware overhead (at least 200% if using triple modular redundancy, N = 3), while the third option has a large time overhead (upto 200% in time for N = 3 ) . In order to provide error masking, all critical transactions must be replicated and voted upon, where the m i n i m u m degree of replication is triplication. Such an approach ha.s been used in the FTMP[1], the C.VMP[2], and the S I F T multiprocessor[3]. The F T M P and C . V M P performed the voting on triplicated set of computations in hardware, while the S I F T performed the voting on triplicated set of computations in .software. A lower cost fault tolerance technique in multiprocessors is to use dynamic or standby redundancy, where one first uses some technique to detect a faulty processor and reconfigures the system by bringing in a spare processor. In this approach, a mechanism has to be provided to produce an indication of an error occurrence during the system operation. This can be achieved using either hardware or time redundancy. Once the presence of a faulty processor is detected, a fault location or diagnosis procedure is used. T h e faulty processor is now replaced by a spare processor through reconfiguration techniques. Finally, error recovery is performed, whereby the spare processor takes over the computations of the faulty processor from where it left off using some checkpointed information. This approach is usually the most cost effective approach to achieving fault tolerance in general purpose multiprocessors. This section deals with low cost techniques for detecting errors in multiprocessors in a dynamic redundancy approach. The issues of reconfiguration and
161 recovery are also important, but not discussed in this section. There are several approaches to performing the error detection in multiprocessors. (1) Use of standard circuit level or module level coding and self-checking techniques on components of a multiprocessor. (2) Duplication and comparison using space and time redundancy. (3) Algorithm-based methods. Use of the first techniques is well-known [4] and will not be discussed in this section. We will now review the second and third techniques.
Error Detection through Duplication and Comparison If the multiprocessor system does not provide any lower level techniques for error detection, one can resort to some general system level techniques for error detection using duplication and comparison. In order to provide on-line error detection, all critical transactions performed in a multiprocessor can be duplicated and compared. The duplication and comparison operations can be performed using space or time redundancy. There are several ways this can be achieved. One can duplicate each processor of the multiprocessor and compare the results before communicating to the processor pairs. For example, duplication and comparison in hardware is performed in the Stratus n'iultiprocessor[5], and in the Intel 432 system [6]. An equivalent way of viewing this option is to divide the P processors of a multiprocessor into P/2 pairs. The global common memory consisting of M memory modules can again be divided into M/2 pairs. Comparators can be kept inside each processor and memory module and results of both computations have to match for an operation to be executed. The comparison of results of computations can be performed in hardware at each machine cycle, during each memory operation. If an error is detected by a processor pair, both processors of the pair are powered off, and the computations can proceed on the P — 2 remaining processors, configured as (P — 2)/2 pairs of processors. A second approach is to duplicate the entire multiprocessor, and compare the results of the parallel computations. This option is really not used in practice. In the third approach, one uses duplication and comparison using time redundancy. Such a technique is needed when one does not wish to spend the 100% overhead that is needed in the space redundancy approaches. Specifically, the P processors duplicate the same parallel computation in time, on a different set of processors, such that each processor only affects one copy of the result. This can readily be achieved by dividing the P processors into P/2 groups of
162
1
2
3
4
5
6
7
(a) Original task graph mapping
Id
1
2,5
3,7
4,6
Id
2d
2d,5c
3d
3d, 74
4d
4d,6c
(b) Duplicated task graph mapping F i g u r e 3.2.1 procei*sors
Example of mapping duplicated task graphs on disjoint sets of
two processors (pairs), each of which 'compares tlie results of tlie pair before communicating with other processors. Figure 3.2.1 sliows an example task graph that is duplicated and compared for detection of faults. Case (a) shows that original task graph that is mapped onto 8 processors. Case (b) sliows that the the task grapii is duplicated, and each copy is mapped onto a disjoint set of 4 processors. The results of the two copies are compared twice, each time by a processor from each subgroup. The above
163 approach in the worst case may require 100% time overhead. However, it may actually result in less time overhead than that due to idleness in processors in the original task mapping. Clearly, if there were no task dependencies in the original task graph, one can m a p tasks efficiently onto all processors of a multiprocessor, and keep all processors busy all the time. Hence it is possible to get a perfect speedup of P on P processors on that task graph. However, in the presence of task dependencies, one often finds processors that arc idle, since there are no ready tasks. Such situations give rise to speedups that arc less than P. In such situations, one can m a p the original task graph on P / 2 processors (as shown in Case (b)), get better processor utilization, and use the remaining P/'l processors to perform the duplicate computation of the task graph. Hence, in real task graphs, one can observe less than 100% time overhead. T h e comparison operation can also be performed in software, except t h a t now, the granularity of computation before a comparison can be performed has to be larger than a machine cycle since software comparison takes longer than hardware comparisons. The duplication of tasks and their allocation on different processors of a multiprocessor can be performed such that the duplicated tasks and their comparison tasks get executed on different processors.
Algorithm-Based
Error
Detection
From the above discussions, it is clear that general purpose error detection techniques in multiprocessors can require an overhead of about 100%; in hardware or time. This is because when one uses P processors in a multiprocessor to get P times the throughput of a uniprocessor, in order to achieve fault detection, one is cither forced to use twice the number of processors, to keep the same throughput, or use the same number of processors and get half the throughput. For certain specific parallel applications running on these multiprocessors, it is often possible to use some algorithm-based schemes for error detection at much lower cost in hardware or time. Another advantage of algorithm-based techniques is that it is possible to gain a lot of fault tolerance at low cost with off-the-shelf hardware that has little or no concurrent error detection capability, and most multiprocessors that are currently built fit this class. Algorithm-based fault tolerance (ABFT) was proposed originally by Huang and A b r a h a m [7] for matrix computations on array processors. Since then, algorithm-based fault tolerance (ABFT) and algorithm-based error detection (ABED) techniques have been proposed for numerous applications using array processors and systolic arrays. All of them deal with low cost fault tolerance/error detection techniques specific to the algorithms being executed.
164 4
5
1
8
3
20
4
7
18
2
4
5
8
19
6
3
4
1
14
13
18
21
19
71
•
. Failed Row Check
! Failed Column Check F i g u r e 3.2.2
Checksum matrix encoding
We will now provide a brief review of the algorithm-based fault tolerance method. Conventional d a t a encoding is done at the word level in order to protect against errors which affect bits in a word. Since a faulty proces.sor in a nuiltiproccssor could affect all the bits of a word it is operating on, we need to encode the d a t a at a higher level. This can be done by considering the set of input d a t a to the algorithm and encoding this set. The original algorithm must then be redesigned to operate on this encoded d a t a and to produce encoded o u t p u t data. The redundancy in the encoding would enable the correct d a t a to be recovered or, at least, to recognize t h a t the d a t a are erroneous. We illustrate the application of an algorithm-based checking technique by an e.xample: the tmiltiplication of two N x N matrices [7]. In the checksum encoding, an extra row and an extra column is appended to the original matrix which are the sums of the elements of the columns and rows, respectively [7], For each row or column in the matrix, considered as a vector, (oiao • • • a^), the check elements CS is appended to the vector where
CS = ^ a ,
(3.2.1)
Figure 3.2.2 shows an example of a 4 x 4 matrix, augmented by the row and column checkum encoding. When two such matrices are multiplied, the resul-
165 tant matrix preserves the checksum property. If there is an error in the result matrix element (i,j), it will be identified by verifying the equality of the sum of the row elements with the checksum for row i, and by verifying the equality of the sum of the column elements with the checksum for column j . Suppose the m a t r i x in Figure 3.2.2 is the result of a matrix multiplication, and element (2,2) is erroneous. Then the error will be detected in a failed row 2 check, and a failed column 2 check. Once the erroneous element is identified, the correct element can be reconstructed by taking the sum of all elements of t h a t row (column) except the erroneous element and subtracting this sum from the row (column) checksum. This is illustrated in Figure 3.2.3 which shows a 5 x 4 row checksum encoded m a t r i x multiplied by a 4 x 5 column checksum encoded matrix on a 5 x 5 processor array having row and column broadcasting capability to produce a 5 x 5 full checksum matrix. T h e operation of the array is as follows. For the m a t r i x elements a, j on the left-hand side of the array, the top four rows consist of the information part of the matrix A, and the fifth row is the s u m m a t i o n vector; the elements aij are broadcast from the left boundary of the array to processors in the fth row at time _;'. For the matrix elements bj^ on the top of the array, the leftmost columns are the information part of the matrix B and the fifth column is the s u m m a t i o n vector. The elements bj^k arc broadcast from the top boundary of the array to processors in the kth column at time j . At time j , the processor Pij; performs the product of Ojj- with bjk^ and accumulates the product in a register. Thus after n steps, each processor calculates an element of the C matrix. Subsequently, one performs a verification of the checksum property for each row i and each column j . The verification of elements in row i is performed by processors in row i + 1. If the processor (i,j) in the array is faulty, it will affect the output element («, j ) of the matrix C, hence the checks will fail for row i and column j , correctly identifying the faulty processor. The erroneous d a t a element {i,j) can be corrected by subtracting the sum of all elements of the row i except element (i, j ) from the row checksum for row i. For an A' x N array of processors, the original matrix multiplication can be performed in n time steps. The fault tolerant matrix multiplication needs an (jV + 1) X (jV + 1) array. T h e checksum verification can be performed in parallel among processors of the rows and columns using a tree based s u m m a t i o n , which would take 2log{N) steps. The redundancy overhead is 2/A' in hardware and 2/log{N) in time. For large values of A'^ ( > 100), the overheads are less than 5-10%. The above scheme assumed a processor array of size A' x A'' for performing a matrix multiplication of matrices of size N x N. In reality, since the number of
166
KEY = processor
3
3
14
a a
24 34
^4 3
3
13
a
23
a
33
^43 3
54
a a
3
12
11
22
a 21
32
31 i
^42
*41 ""li
8
3
53 52
F i g u r e 3.2.3
a
51
H M^
Cliecksiim matrix multiplication on an orthogonal array
167 processors is limited (to say F x P array), a partitioned matrix multiplication scheme can be used. The basic idea involves the partitioning of each matrix into an N x P — I submatrix, and performing multiplication of partitioned submatrices with checksum rows or columns for each partition. In order to avoid the effect of a single faulty processor corrupting multiple d a t a elements in the same row or column of the larger result matrix, each partitioned matrix multiplication is performed by rotating the matrix one row and column position at each step. Details of the algorithm can be found in [7]. 'J'he idea of algorithm-based fault tolerance was proposed initially for systolic arrays. While these proposed techniques were interesting, none of the results were practically applicable since there were not many commercially available systolic array processors for actual evaluation of the schemes. Recently, Bancrjee, et. al., have reported on the application of the ABF'T techniques to general purpose multiprocessors such as hypercube multiprocessors [8]. While previous techniques for A B F T ignored the effects of finite precision roundoff errors on the encodings, the actual studies of A B F T reported on an Intel iPSC hypercube measured the real error coverage in the presence of finite precision arithmetic. More detailed evaluations with regard to error coverages, timing overheads, and error latency were reported in [24]. The schemes did not involve any hardware modifications or overhead. We now describe the application of algorithm-based fault tolerance techniques for the m a t r i x multiplication problem on a general purpose hypercube multiprocessor. T h e host computer of the hypercube partitions the matrix ^4 into a number of rectangular strips by rows equal to the number of processors in the hypercube and sends one strip to each processor. T h e complete matrix B is also sent to each processor. Each processor Pi performs the submatrix multiplication Ci = Ai X B using a sequential algorithm. At the end, each processor sends the submatrices of the result matrix back to the host. T h e algorithm is modified to include system level checks based on the checksum matrix encoding as shown in Figure 3.2.4. The elements of each column of the matrix A are summed together to form the row CC(A), called the column checksum of A. The matrix multiplications C = A x B and D = CC{A) x B are performed. T h e column checksum of matrix C, CC(C), is computed and compared to D. In the absence of faults and round-off errors, CC{C) and D should be identical. The A strips are duplicated among processor pairs, such t h a t the two processors (hereafter, referred to as nodes) in a pair are mutual neighbors (or mates) in the hypercube. Node i computes the column checksum of the A strip of its neighbor
168 Node (1
*n IX-(-\, 1
'o U
* i
Node 2
;
'i
* 2
U "3 (A j^}
' • 1
B CC ( * J 1
M:
'•l
Node 3
i -^ ,
._.i'j._..
1
'•' "3
'^3
...i'j-.
.
<^1
F i g u r e 3.2.4 Algorithm-based fault detection for matrix multiplication on a general purpose nuiUiprocessor
CC(Amateii)), and then the D strip of its m a t e , D,nate(i) = CC(Ajnate(i)) x -ONode i then obtains the C strip of its mate, Cmate{i)- Node i checks its m a t e by comparing CC(C mate{i), to D,mate{i) and sends the result (pass or fail) to the host. Finally, the host judges the corriputation error-free if all nodes "pass" otherwise, it judges the computation erroneous. If there is a fault in a node during the regular matrix multiplication computation, it will be detected by the row check with a high probability, since the checksum row for a strip C, is calculated by the node that is the neighbor of Node i. The above checksum encoding scheme was implemented on a 16 processor Intel i P S C / 1 liypercube for cubes of various sizes (4, 8 and 16 nodes), using randomly generated 64 x 64 matrices. It gave high error coverage (75-96% for bit level errors, and 100% for word level errors) while maintaining low time overheads (10-30%). Details of the implementation as well as the experimental results can be found in [8]. Algorithm-based fault tolerance techniques have been applied to a wide range of applications including matrix vector multiplication [9], matrix solvers [10], Fast Fourier Transform [9, 11, 12] Q R Factorization [12], and adaptive filtering [13].
169
3.2.2
Compiler Assisted Synthesis of Algorithm-based Checking
Although algorithm-based fault tolerance (ABFT) and algorithm-based error detection (ABED) are viable approaches in view of their low time and memory overhead, their practical application is limited in that all the parallel applications developed so far have to be reprogrammed manually to provide error detection. It would be desirable to have an automated (e.g. compiler-based) approach to the synthesis of ABED schemes to relieve the user of the task of devising these checks. The basic idea is to have the compiler analyze high-level language programs, identify linear transformations in DO loops, and automatically insert system-level checks based on this property. Most of the ideas regarding compiler-assisted generation of algorithm-based checks were developed in [14]. The target programs for this approach of automatic check insertion by the compiler are parallel Fortran programs, since there is a large body of existing software written in Fortran and it continues to be the preferred language for scientific programs. In subsection 3.2.8 we discuss the implementation of a compiler pass for automatic check insertion using Parafrase-2 [15] as a core. Parafrase-2 is an existing source-to-source parallelizing and vectorizing package developed at the University of Illinois. In this section, we discuss rules by which one may identify linear transformations in loop operations encountered in Fortran programs. Although the implementation discussed here is specifically for Fortran programs written for the Intel iPSC/2 hypercube multicomputer, the theory and implementation are general enough to be applied to parallel programs written for other general purpose multicomputers. Subsection 3.2.3 describes the linearity property and its identification. Subsection 3.2.4 presents compiler transformations to introduce linearity in statements. Subsection 3.2.5 discusses the exploitation of the linearity property to produce system level checks for a multiprocessor system. Subsection 3.2.6 discusses the concepts of symbolic linearity and program linearity. Subsection 3.2.7 discusses various tradeoffs involved in linearity-based checking. Subsection 3.2.8 discusses the implementation of a compiler-assisted check generator. Subsection 3.2.9 presents the results of running several programs through the automatic check generator. Subsection 3.2.10 presents a summary of the discussion in this section.
170 D0I=1,M DOJ= 1,N DOK=l,L C(I,J) = C(I,J) + A(I,K) * B(K,J)
ENDDO ENDDO ENDDO F i g u r e 3.2.5
3.2.3
Matrix multiplication code
Linear Transformations in Fortran Loops
In this section, we address the issue of linear transformations in Fortran DO loops and their identification at a symbolic level. We restrict ourselves to operations on two-dimensional arrays in order to relate to practical numerical algorithms; the extension to multi-dimensional arrays is very simple, and will be discussed later. We consider indexed program statements of the form 5 ( / ' , / " , . . . , /*), where k denotes the number of distinct loops surrounding S and P is the index of the Jth loop. For example, in the matrix multiplication code shown in Fig, 3.2,5, the single indexed statement is of the form S{I, J, K) and has ^ = 3.
The Linearity
Property
We would like to identify statements S which are linear transformations, i.e., satisfy the linearity property. In order to present the most general form of such statements, we introduce some notation. We assume that the output is a two dimensional array involving loop indices I and / . (1) A[l, J] denotes a two dimensional array variable whose subscripts are real functions of 7 or J; hence, it is equivalent to A{fi{I), /i'( J)) or A{f2(J), / i ( / ) ) , where /i and f^ are arbitrary real functions. Examples of such variables would be Ail + i,2J + 1) and A(J^,[). (2) A[I, J] denotes a set of such variables, {>li[I, J], A2[l, J],. . •}.
171 (3) A[i\ denotes an array variable which involves only / as a subscript. It is where A ' c a n equivalent t o / ! ( / ( / ) , A'), A{K,/{!)), A{fi(I), MJ)). OT A{f{I)), be a constant or a function not involving / or J . A[i] is similarly defined. (4) A[I] and B[J] denote sets of array variables A[\] and B[3], respectively. T h e general form of the potentially linear statement can now be written as D[\, J] = / ( / , J, A[I], B[J], C[I, J])
(3.2.2)
where / is some real-valued function involving array variables similar to the ones defined above, and possibly, the variables / and J, independently. Note that the above form does not exclude £'[1, J] from appearing in the R.H.S. as well, which would then be a part of C[I, J ] , When this statement is placed inside a multiply nested DO loop structure with loops having / and J as indices, array D gets updated according to the values of / for allowed / and J values (and values of indices corresponding to other loops, if present). As an example, consider the following statement D{I + 3, J ) = A{I, 2) * 5 ( L , J + 4) + {B(1,2I)f/C{J,
I)
(3.2.3)
This is consistent with the general form where £)[!, ,1] = £)(/ + 3, J ) , A [I] = {.4(7, 2), 5 ( 7 , 27)}, B[J] = B{L, J + 4 ) , and C[I,J] = C(,7, 7). It can be similarly verified t h a t the indexed statement in matrix multiplication code of Fig. 3.2.5 conforms to the general form. Let 7t:, 1 < Ar < m be values (not necessarily all distinct) t h a t 7 takes in course of the loop iterations, and let Wk, ^
=
(b = i
m
m
m
/(^u;i7fc,J,^«;fcA[Ik],B[J],^u;,C[Iu,J]) k=l
k=l
(3.2.4)
k=l
In the above equation, s u m m a t i o n over an index of a set, say A , is assumed to yield a set whose elements are values obtained by s u m m a t i o n s over the index under consideration of the arrays belonging to set A. Note that the L.H.S. of the above equation is nothing but X^StLi'"*^['k, J]- In some algorithms, I may be restricted to the first dimension of all array variables; it is then a row variable and 7-linearity then translates to roM>-linearity. \Vc can similarly
172 define J-linearity (or co/umn-linearity in specific cases). Z + J-linearity (or rowco/«mn-linearity in specific cases) means that the function / is linear both in / and J. For example, matrix multiplication is a linear operation, as the statement inside the triple loop satisfies the linearity test for 7, with rn = M , as shown below
J2w,{Cih,J)
+ A(h,K)*BiK,J))
= ^»;,C(/,,J)
k=\
k=l M
+(Y^WkA{hJ<))*B{K,J)
(3.2.5)
k= l
The above relation holds because of the distributivity of multiplication over addition. It can be similarly shown t h a t the statement is ./-linear as well. This / + ./-linearity translates to row-column linearity as I is limited t o the first dimension and J t,o the second. In general, m equals the number o f / - l o o p iterations, i.e., the linearity property is applied across all allowable I values. In addition, if in is equal t o the size of the input and o u t p u t arrays (indexed by / ) in the /-dimension, then this is called full tineartiy. As an example, the linearity test for matrix multiplication shown above employs full linearity as the number of /-loop iterations is equal to the row size of matrices A and C However, for a particular algorithm we might need to apply the linearity property only for a subset of the / values; we are then using a suh-linearity property. This is true if / , in course of its iterations, covers only part of the relevant arrays in the /-dimension. Such a case often arises when the / loop bound(s) is not a constant but depends on an outer loop index or the loop stride is not unity. The extent of linearity is also determined by the range of J values for which / is linear in / ; usually this range covers all the values t h a t J takes in course its loop iterations. It can be easily seen t h a t the application of the linearity concept to statements involving multi-dimensional arrays requires just a simple extension of the above definitions. In this case, the o u t p u t variable can have an arbitrary number of loop indices in its subscripts and the statement is linear with respect t o those indices t h a t satisfy the linearity property as defined above. Note t h a t a statement which writes to a single dimensional array is just a special case of the form of Eq. 3.2.4 where one of the loop indices in the o u t p u t variable is replaced by a constant.
173
Detection of Linearity One can determine whether a given transformation is linear by applying the above linearity test to it. However, the following similar looking statement Dil, J) = C{I, J) + Ail, K) * A{I, J+ 2)
(3.2.6)
is not /-linear since both A{I, K) and A{I, J + 2) vary with / and hence the distributive law is no longer applicable. A brute force application of the linearity test is not feasible for lengthy and complex statements; moreover, it involves algebraic manipulations and consumes valuable time. We therefore discuss automatic and fast detection of linearity by the compiler. We thus need to address the problem at a symbolic (structural) level. Let expi and exp2 be two expressions occurring as parts of the R.H.S. expression of an assignment. Using the commutative and associative propertics of addition, and the distributivity of multiplication over addition, one can translate the linearity test to the following three recursive linearity rules: (Rl) expi ± exp'2 is linear in I{J) if and only if both expi and exp2 are linear in7(/). (R2) expi * exp'j is linear in / ( J ) if and only if either expi is linear in I{J) and expo does not involve / ( J ) , or vice versa. (R3) expi/exp2 is linear in I{J) if and only it expi is linear in I{J) and exp2 does not involve / ( J ) . In addition, we have the following four fairly obvious linearity rules for simple expressions and functions: (R4) A constant is not linear in any variable. (R5) A simple variable is linear in itself. (R6) An array variable is linear in its subscript variables. Finally, we have the following linearity rule for functions: (R7) A function of an expression is linear in the variables the expression is linear in, if and only if the function itself is linear. Examples of non-linear functions are trigonometric functions, square roots, etc.
174 In order to determine whether a particular statement is linear, the compiler needs to check the satisfiability of the above rules Rl — Rl by the statement. For example, the statement C ( / , J ) = ^ ( J ) + r>(/ + 2,J)
(3.2.7)
is not /-linear (from rule /21), since A{J) does not involve / and so is not /-linear (from rule RQ). However, the statement is J-linear as the R.H.S. expression satisfies rule R\ w.r.t. J. The statement C{1, J) = A{I, J) * B{I) 4- D{I)
(3.2.8)
is neither / nor J-linear. A{1, J) and B{1) belong to the same product term and both involve /. Hence, this product term violates rule R2. Moreover, D(I) does not involve J at all. It can be similarly verified that statement 3.2,3 is neither / nor J-linear (rule RZ violated), statement 3.2.6 is J but not /-linear and the matrix multiplication assignment statement is linear in both / and J.
3.2.4
Compiler Transformations to Introduce Linearity
Many assignment statements encountered in Fortran programs do violate rules Rl — Rl for both / and J and hence are not linear. However, it is possible to restructure a non-linear statement such that it satisfies these rules for at least one loop index which then qualifies it as a linear transformation. Again, this restructuring needs to be done at a symbolic level. The following four simple techniques can be used by the compiler to introduce linearity into statements (TI) If an expression of the form expi -f- exp2 is such that one component does not contain the loop index in question, the variable can be expanded to accommodate that index. This creates an instance of the variable for every iteration of the DO loop corresponding to that index. For example, statement 3.2.8 can be made J-linear by expanding D{I) to /?(/, J ) . The trade-off here is the introduction of linearity at the cost of additional memory. However, scalar expansion is a technique employed by most parallelizing compilers [16] to reduce edges in the dependence graph, so in fact, the task check generation at compile time may be made easier after parallelization by a parallelizing compiler. Thus, a parallelizing compiler often unknowingly introduces linearity in some statements. (T2) If an expression of the form exp\ * exp2 is such that both exp\ and exp^ contain the index in question, variable substitution can be performed by substi-
175 tuting this expression by a single variable containing the relevant loop indices. An additional statement needs to be placed before the restructured statement that simply assigns the multiple variable term to the single variable. For example, the statement E{I.,J)
=
B{I,K)*C(K,J)
+
+A{IJ<)*A{I,K
+ 2)
B(I,J)*{D(L.J))(3.2.9)
is not /-linear due to the presence of the expression .4(7, A')*.4(7, 7\' + 2 ) which has both components involving 7. Restructuring this statement produces the following code where the second statement is now /-linear 7^(7, A') = ^ ( 7 , A') * A{I, K + 2) E{L J) = S ( / , K) * C(A', J ) + B{I, J) * (D(L, J)f
(3.2.10) + F{L A )
(3.2.11)
Very often, the extra computation due to the substitution process is small compared to what is saved by introduction of linearity since this can avoid duplication of the original statement. (T3) If an expression of the form exp\/exp2 violates rule 7?3, then variable substitution can be performed similar to the one described above, which removes variables containing the loop index in question from the denominator. For example, statement 3.2.3 can be made J-linear by substituting 1/C(7, J ) by say, £"(7, J). Again an additional statement needs to be introduced which effects this substitution. (T4) Violation of rule Rl can be similarly taken care of by substituting the non-linear expression by a simple array variable. It may be noted t h a t technique T l is also applicable to expressions where the non-linear component is not a simple expression. In such cases, the above techniques can be applied to this component. Such a recursive approach is discussed in detail in subsection 3.2.8. It is i m p o r t a n t to note t h a t restructuring techniques T2, T 3 and TA might increase the program length (and hence computation) by an amount sufficient to offset any gain that can be obtained by introduction of linearity. Hence, they should be used with discretion.
176
3.2.5
Linearity-Based System-Level Checking
In this section, we will discuss how linearity can be exploited to provide lowcost system-level checks for concurrent error detection in multiprocessors by the use of the checksum encoding.
The Checksum Encoding Scheme Once a statement is identified to be a linear transformation (or is restructured to satisfy the linearity property), it is possible to use this information to develop system-level checks for detecting errors during computations. Once an index in which a statement is linear has been discovered by the compiler, or the statement has been restructured so that it is linear in a particular index, we may use Eq. 3.2.4 to compute and compare the L.H.S. and R.H.S. values. In the absence of errors, these values should be equal. As an example, let the statement be /-linear. First compute XIT^i'^'Jt^f^k, J], for a particular value of J. We will denote this as the L.H.S. computation. Also compute / ( Z l i ' L i ^^k:h,J, YJk = i "'* A[Ik], B[J], YJk = i WtC[I|(, J]), by first performing the s u m m a t i o n s of the array variables belonging to sets A and C and possibly the s u m m a t i o n of the I values, and then computing the value of the function / for the same J. This is the R.H.S. computation. These values are then compared with each other; a mismatch indicates error(s) during normal computations. This procedure is repeated for other values of J. The role of the compiler is to detect program statements which are linear, restructure some of the nonlinear ones and then insert the above system-level checks automatically into the program. T h e sums of the array variables computed above, can be regarded as encodings on the i n p u t / o u t p u t data, and are referred to as weighted checksums. It is customary to use unity weights Wk for computing these checksums, which are then known as unity checksums or just checksums. If we restrict / as a row variable and J as a column variable, then summations which exploit the Ilinearity property would involve array values belonging to the same column and are called column checksums (since / values change but J remains the same); s u m m a t i o n s for ./-linearity would involve values in the same row and are called row checksums. These terms carry over directly from the checksum encoding proposed for matrix multiplication [7]. Sometimes a statement may fail the linearity condition only because of the presence of a constant in the R.H.S. expression (rule R4). A straightforward application of restructuring technique T\ will assign the constant to a single
177 dimensional array. During the checking phase, the array values will be summed over 1 to m. A more efficient way of performing this computation is to simply multiply the constant by m; hence, technique Tl need not be applied at all for such non-linear statements.
ParaUelization Issues We first briefly review some basic concepts regarding data dependencies in program statements. Let ^i and ^2 be two statements inside a loop. S\ and 52 are said to be involved in a flow dependence (denoted by S16S2), if the result of 5i is used by 5? in the current or a subsequent iteration. If in addition, we have S26S1, then these two statements are involved in a cycle and the loop has to be executed in a serial fashion. In general, cycles involve other kinds of dependencies as well; details of the dependence notions can be found in [17]. Absence of cycles enables the iterations of the loop to be executed completely in parallel. Note that these concepts can be extended to statement blocks also [16]. Once parallelism has been discovered in a loop, it can be marked as DO, DOALL or DOACROSS, depending on whether the iterations need to be executed serially, can be executed completely in parallel or have a serial execution with some degree of overlap between successive iterations [17]. DOALL and DOACROSS loops can be executed in parallel machines by partitioning the set of iterations and assigning a different set to each processor. This is called loop concurrentization [18]. We refer to the concurrent loop index as the parallehzation variable. Concurrentization often requires the insertion of synchronization instructions within the body of the concurrent loop [19]. These might be needed in critical regions of the loop body where processors access shared data, or for ensuring proper ordering of the loops during execution. In distributed memory machines synchronization is performed by communication of proper data among processors. We will focus on parallehzation and data integrity checking for a distributed memory multicomputer. A very important issue in parallehzation is the consistency of data distribution with the parallel loop constructs in the program. This is especially true for programs with multiple DO loops one after the other. The parallehzation of one loop might demand distribution of the input matrices by rows whereas it might be necessary to have a column distribution for another loop. A simple solution is to have the input data distribution governed by the largest parallel DO loop in the program.
178
D0ALLI=1,M
D0J=1,N
D[U] = / ( A[I], B [J], C [U]) ENDDO
ENDDOALL
F i g u r e 3.2.6
Parallel loop under study
Checking on a Distributed Memory
Multicomputer
Let us now apply the checksum encoding scheme discussed earlier to a specific distributed memory multicomputer using unity checksums. T h e procedure will be different depending on whether the checking variable and the parallelization variable are the same or not. In order to clearly illustrate the concepts involved, we assume a multiple loop structure where the /-loop is normalized (the lower bound and loop stride are both unity) and is completely parallelizable. In fact, compilers do normalize loops before optimizing them [16]. We further assume without loss of generality, that the J-loop is inside the /-loop. The parallel loop structure is shown in Fig. 3.2.6. The outer DOALL loop is concurrentized so t h a t the iterations are assigned to the processors in a wraparound fashion; here / is the parallelization variable. The input d a t a is distributed accordingly. Hence, if there are P processors numbered 0, 1, ..., P-\ assuming t h a t M is divisible by P, processor A' gets M/P iterations, and A[I] and C[I, J] values, corresponding to / = K + \,K + P+\,...,K + M — P-\-\. For the following discussion we let Z denote the set 1, 2, ..., M, and let ZK denote the set /i + 1, A ' + P + 1,. . . , / \ + M - P - f 1. I-lineariiy
based
Checking
Now suppose t h a t the statement inside the J-loop is detected to be /-linear. We employ / as the checking variable. The sublinearity property may be used
179
Processor 1 1,5
I-iteralions 2.6
Norma] computations:
Normal computations:
I-iterations
D0J = 1,N Dll.Jl = f(A(l|,B|Jl,C|I.Jll D|5.Jl = f|A|5),B(J|.C|5,Jl)
DOJ= l.N D(2,Jl = f(A|2|.B|J|,C(2,Jl) D|6,J| = f(A|6|,B|Jl.C|6.J|) ENDDO
ENDDO
Check computations:
Check computations:
DOJ = l.N E(J>=D|1,J] + D|5,J| F(J) = f|A|2)+A16),BU),C(2.J)+C16,J)) ENDDO
Figure 3.2.7
DO J = 1 ,N E(J) = D[2,J| +D|6,J1 F(J) = f|A|ll+A|,51,B|J),Cil,Jl+C15.J)) ENDIXI
An I-linearity based checking scheme
to perform the checking. On processor A', we compute XT/g^ • ^[^^ J]^ ^"'^ / ( S / E Z K • ' ^ W ' ^ C ' ' ] ' XTIEZK *-'t^'J])' f^"" different values of J . The check computations are split across processor pairs, preferably neighboring processors; they cooperate with each other to obtain and compare the checksums. Let us illustrate the sub-linearity based checking scheme for M = 8 and P — 4. Fig. 3.2.7 shows the normal and check computations performed in processors 0 and 1, with respect to the linear statement only. T h e execution of other statements proceeds as in the normal algorithm, and is not explicitly shown here. We consider a row cyclic mapping, where processors 0 and 1 get d a t a corresponding to iterations {1,5} and {2,6},respectively. Each of these d a t a sets is operated on by a set of checksum calculations. The L.H.S and R.H.S. calculations for the sets {1,5} and {2,6} need to be performed on different processors. In this
180 case, the L.H.S calculation for set {1,5} is performed on processor 0; processor 1 performs the R.H.S. calculation. The roles are reversed for set {2,6}. This requires both the processors to have each other's d a t a sets before the check computations can proceed; this requires a modified mapping which duplicates d a t a in pairs of processors. The values at the end of the checking phase are then mutually exchanged and compared, which requires additional communications. Suppose a fault occurred in processor 0 during these computations. It produces an error with high probability in D[l, J] + D[5, J] a n d / o r f{A[2] + A[6], B[J], C[2, J] + C[6, J ] } . Since their counterparts in processor 1 are calculated correctly, the comparisons in that processor will produce a mismatch and signal a fault. T h e role of the compiler is to a u t o m a t e the checking process by inserting statements inside the node program that perform the check computations and communications together with the comparison checks. T h e time required for the check computations is much lower than the normal computation time, especially for large input arrays (high M and A') and functions / which require a significant amount of computation. Each function / is computed N * M/P times during the normal computation phase in each processor, whereas it is computed only N times during the checking phase. This is evident from Fig. 3.2.7, although the overhead for this example is considerable since we have chosen small dimensions for ease of explanation. Two extra communication steps are required for exchanging the E and F arrays. T h u s , the relative overhead for the / computations is 0(P/M), which is very low for high M. The time required for the computation of the various checksums, which consist of addition operations, is usually much less than the time consumed by the / computations which may involve several floating point operations. jjpt us illustrate these ideas for matrix multiplication. Fig. 3.2.5 shows that the / (J) iterations are completely independent of each other; hence, we can employ I (J) as the parallelization variable. Let us assume without loss of generality that the 7 iterations are performed in parallel. Concurrentizing the 7-loop is equivalent to distributing the A matrix by rows to the hypercube processors. Also, the o u t p u t C matrix resides in the processors in a row distributed fashion. This is because the A and C array variables are functions of / . The B matrix is replicated on all the processors; note t h a t the B variable does not involve / . If there are P processors (assume that M is divisible by P ) , then each processor gels M/P iterations of the I loop, i.e., a submatrix of A of size M/P x L. Each processor then multiplies its own submatrix with the complete B m a t r i x in parallel, and produces a submatrix of C of size M/P x N. These submatrices are then .sent to the host, which reorganizes them to generate the complete C matrix.
181 i.ie + '5.1
"5,2
+ Cl6,I CC(A)^
ZA
LA
F i g u r e 3.2.8
CC(C) = E C
Column checksum encoding for matrix niultiplication
The normal algorithm requires no row communications; hence, there is no particular advantage in using row checksums. We then opt for an I sub-linearity based column checksum approach, for purposes of illustration. Let M = N = L — 16, and P — 4. Fig. 3.2.8 shows the computations (normal and check) for the submatrix multiplication in processor 0, which receives the /-iterations 1, 5,9, and Processor 0 calculates the column checksums of the submatrix of C and stores them in an array CC'(C). The column checksums of the submatrix of A are calculated in the neighboring processor of A, and stored in an array CC{A). The neighbor also performs the multiplication CC{A) x B and stores the resultant vector in array D. The arrays CC{C) and D are then compared element by element; any mismatch indicates errors during computation. The normal computation in each processor requires 0{MNL/P) operations, whereas the check computation requires only 0(ML/P + LN) operations. A constant number of communication steps are needed for the exchange operations for comparison. The relative overhead is, thus, 0{l/N+P/M). Hence, the checksum encoding scheme described here is, indeed, a low overhead scheme. J-hneartty based Checking Suppose that the statement inside the J-loop is linear in J. One could then use J as the checking variable and employ a J-linearity based checking scheme. Using a full linearity property necessitates the computations ^j^y ^i^' "^li ^^^ /(A[I],EieYB[J],Ej€YC[I,J]), for different values of/. Here, Y = 1,2, ..., N is the set of all values that J takes in course of its iterations. Since, our loop distribution scheme assigns all iterations of J for a particular value of / to the same processor (as is clear from Fig. 3.2.6, the entire check computation for that value of / can be performed in the same processor. However, a fault in the processor will invalidate the results. Hence, the L.H.S and the R.H.S.
182 part of the computation is split among neighboring processors and the results at the end are exchanged and compared. It can be easily seen that the relative overhead is 0{\/N)^ and is low for high A'^. Since the matrix multiplication assignment statement is linear in J, the above scheme is applicable. This results in a row checksum encoding of the matrices B and C where each processor computes the row checksums of its B matrix and its C submatrix. The rest of the computations are very similar to the column-checksum approach and are not discussed here. It is worthwhile mentioning here that it is possible to have loop constructs where the parallelization variable is neither I nor J, which means that the data distribution is governed by some other loop index. The / (J) iterations may or may not be distributed across processors which again depends on the relative location of the / {J) loop with respect to the parallel loop. But the techniques illustrated above are still applicable where the extent of linearity and the number of processors participating in the check depends on the checking variable loop distribution. The checking schemes described above are applicable to program statements which are linear in either / or J (or are restructured to be linear). Statements not satisfying the linearity property are duplicated on neighboring processors, and the results are exchanged and compared. In any case, duplication of a particular segment of the program might be preferred over linearity-based checking (if applicable) when that code is executed only in certain processors in the normal algorithm and other processors are idle at that time. This is often true for program segments outside the DOALL loops. In such a situation, duplication does not incur any computation overhead though some communication needs to be done to exchange results of code.
3.2.6
Symbolic Linearity and Program Linearity
The successful application of the checksum encoding scheme to a linear statement in a program depends not only on the statement's structure, but also on the program context of the statement. This is best illustrated by a simple example. Let us consider the I + J linear statement in the double loop structure of Fig. 3.2.9. Using / as the checking variable, one cannot perform the summation of the B array values for different values of / since K is also dependent on / (the distributive law is not applicable). However, the use of J
183 D0K=1,L
D0ALLI=1,M
D 0 J = 1,N
D[U] = / (
A [ I ] , B [J], C
[I,J])
ENDDO
ENDDOALL ENDDO F i g u r e 3.2.9
Example loop illustrating program linearity
as the checking variable circumvents the above problem as K is independent of J. We introduce the concept of program linearity of a statement to address such situations. A symbolically linear statement is said to be program linear in the context of a particular program, if this property can be directly exploited to develop a checksum-based encoding scheme for on-line error detection. Thus, our example statement, although symbolically linear in both / and J , is program linear only in J but not / . In the future sections, unless otherwise indicated, we use the term "linearity" to indicate symbolically linear statements which are program linear as well. Typically, we check a statement in a loop body for symbolic linearity with respect to the loop index variable. However, the statement may involve additional variables which are different from the loop index (potentially linearizing) variable but are nevertheless dependent on it, perhaps because they are altered within the body of the loop. Such variables are called induction variables. In essence, the idea behind checking for program linearity is to replace all induction variables by the loop index variable and checking the resulting statement for symbolic linearity with respect to the induction variable in the usual manner. If the transformed statement is found to be symbolically linear, the original statement is program linear.
184 We discuss a procedure for determining whether a statement is program linear in a later subsection.
3.2.7
Trade-offs in Algorithm-based Checking
This section discusses various trade-offs involved in checking related to the nature and extent of the checksum encodings. These trade-offs are essentially options to the compiler and depend on the program structure as will be described below. The objectives are: (1) high error coverage, (2) low performance overhead, and (3) low error latency. These objectives are mutually conflicting: realization of (2) can only be achieved at the expense of (1) and (3), and viceversa. For a better understanding of the issues involved, we assume that / is a row variable and J is a column variable, which is indeed true for many algorithms.
Location of Encoding T h e location of encoding depends on the nature of linearity in the statements inside the multiple loop structure. If the statement is /-linear, then a column checksum encoding would be employed. On the other hand, J-linearity implies that a row checksum encoding scheme be used. As before we assume that / is the parallelization variable which means t h a t the input arrays are distributed by rows to the processors. C o m p u t a t i o n of column checksums in general, require elements from different processors. Moreover, the computed values might have to be exchanged. All this implies additional communication. However, error propagation is higher as a fault in any processor can manifest itself as an error in the checksum value. In contrast, row checksums can be computed within processors. In fact, they can be treated as additional array elements and appended to the rows. The advantage is t h a t if the normal algorithm requires communication of rows between processors, then these checksum values can be communicated along with them thus saving in start-up time. T h e disadvantage is error masking: during a checksum computation, an erroneous processor might create errors in opposite directions which could cancel out and produce a correct checksum value. However, the possibility of such error masking is quite low in realistic situations.
185
Extent of Chexking One could use the full linearity property (if the checking variable covers the full range in its dimension) or a sub-linearity property. Again we assume that / is the parallelization variable so that the input matrices are distibuted by rows. This is particularly relevant in the case of column checksum encodings. Full linearity requires that all processors contribute to the checksum values thus increasing error propagation, as described earlier. However, additional communication is required. Sub-linearity restricts the summations to elements within processors. There is a trade-off here in the form of reduced communication overhead at the expense of error masking. A sub-linearity based column checksum encoding behaves very similar to a row checksum encoding in terms of fault propagation and the time overhead values. Note that sub-linearity based encodings also require communication for the exchange operations, but its localized nature suggests a lower overhead than that for the full linearity based encodings. Sometimes, sub-linearity is the only solution when conditional statements are present inside loops. It is possible that the execution of a linear statement inside a loop structure is governed by a conditional prior to the statement but inside the loop structure. This implies that the statement might be executed only for certain values of the checking variable. Hence, the extent of checking is automatically limited and sub-linearity needs to be employed. The extent of checking is also determined by the number of column (row) checksums computed during checking; the encoding can be applied to all columns (rows) or to particular columns (rows) only. Moreover, if some statements are both / and J linear then I + J linearity can be used, and both row and column checksums can be employed. The trade-off here is an increase in fault detection due to the additional checks at the expense of greater overhead.
Frequency of Checking It is possible that the checking variable loop occurs inside a serial loop in which case this loop needs to be executed repeatedly inside all processors. The frequency of checking then refers to how often the checksum calculations are done during the course of the algorithm. They can be performed after each iteration of the loop or at the end of some iterations only. The higher the frequency of checking, the higher the fault detection capability and the higher the time overhead. Let J be the checking variable in the example loop structure of Fig. 3.2.10. Then the check computation could be done after the normal
186 D0K=1,L DOALLI=l,M
DOJ=l,N
D[U] = / ( A [ I ] , B [J], C [ U ] ) ENDDO
ENDDOALL ENDDO F i g u r e 3.2.10
Example loop illustrating checking frequency
computation for every iteration of the K loop; the other extreme is when the checking is performed at the end i.e., after the normal computation in the Lth iteration.
Generation of Encoding Many programs have multiple loop structures containing cycles where the output array variable in a statement is also present as input. For example, consider the loop structure of Fig. 3.2,10 again where Z?[I,J] € C[I,J], and is involved in a cycle with respect to loop index K. We might desire to perform checking after every iteration of the K loop. Instead of computing the input row checksums for the D array for each iteration, it is desirable to store the output checksums computed in the previous iteration and then use them for the current iteration. Thus, the D array could be expanded to accommodate the row checksum column as £>[M+1, J] = /(A[I], Ylj^i B[J], Ylj=i C[I, J]). The checksums are treated as part of the input array and included in the normal computations. Contrast this to the schemes described earlier, where the checksums are computed explicitly.
187 Another advantage of storing the checksums is t h a t checking need not be performed after every iteration and yet errors during normal computation for any iteration can be detected. This is because any error during normal computation destroys the linearity property and invalidates some (or all) of the checksums; note t h a t if this computation is performed in a single processor, a fault in it could cause error masking and keep the checksums validated which means t h a t the error will not be detected. T h e checksums remain invalidated for subsequent iterations where there are no errors and (absence of) linearity is preserved. This can be detected when a checking operation is performed after some iteration. Two errors in two different iterations can mask each other and revalidate the checksums; however, the probability of such an event is low.
Memory
Requirements
T h e classic trade-off between memory required and time taken is evident in variable expansion for program restructuring and checksum storage. Variable expansion of a variable introduces linearity into statements, which can be exploited to reduce the checking time overhead, but increases the dimension of the variable by one, which increases storage requirements. For example, expansion of A(I) to A{I,J) adds M * (N — I) elements to the original array A, where M and TV are the upper bounds of the / and J loops respectively. Variable expansion is a practical optimization only as long as the added memory is within permissible limits. On a similar note, checksum storage to reduce computation time is feasible as long as the memory requirements are reasonable.
3.2.8
Implementation of a Compiler for Synthesis of Algorithm-Based Checks
In this section, we discuss the implementation of a compiler for a u t o m a t i c detection of linear statements in Fortran programs, restructuring of non-linear statements, and subsequent insertion of suitable code to perform on-line checking, based on the concepts discussed in the previous sections. This compiler is essentially an enhanced version of Parafrase-2 [15][20], an existing vectorizing/parallelizing compiler developed at the University of Illinois, which has been implemented as a source to source code restructurer. This compiler, termed C R A F T (for CompileR Assisted Fault-Tolerance) has been discussed by Balasubramanian in [14, 21]. T h e results of application of this package to two sample routines each from the LINPACK and EISPACK libraries, and to two Perfect
188 Benchmark routines, are presented. T h e results of experimental evaluation are presented on an Intel i P S C / 2 hypercube.
Review of Parafrase-2 Parafrase-2 provides a reliable, portable and efficient research tool for experimentation with program transformations and other compiler techniques for parallel supercomputers. One of the major features is the ability to allow different source languages as input to the compiler. This is accomplished by means of an intermediate representation used by the core of Parafrase-2 and hence transparent to the user. A property of the intermediate representation is the ability to recreate the source program. In order to make this intermediate representation transparent to the user, a preprocessor is used to transform each input language to the intermediate representation. T h e input to Parafrase-2 is a set of d a t a structures, organized in the form of a parse tree [22] so as to allow recreation of source code. This set of d a t a structures is constructed from the input program by the preprocessor associated with the corresponding language. A pass of Parafrase-2 operates on the d a t a structures to transform the program to a form suitable for parallel execution. Passes can be executed in a number of orders. This is achieved by insisting t h a t tlie form of the system d a t a structures be left invariant by each pass. T h e o u t p u t of Parafrase-2 is the modified intermediate form, which is acted upon by a post-processor to produce the requisite o u t p u t language. Access to any fields of the standard d a t a structures is exclusively via macros [20]. Of particular interest to us are Fortran assignment statements and the expressions and macros associated with them. As an example, the expression corresponding to the statement K=:K
+ L*M
(3.2,12)
is stored as the expression tree shown in Fig. 3.2.11. Each node in the tree is an expression and points to zero or more children. T h e ciiildren are ordered and stored in a linked list, and are referred to as operands of the expression. For example, the operands of the parent expression of this tree are A' and K + L*M, where the second expression again has two operands.
189
F i g u r e 3.2.11
The LINTEST
Example expression tree
Pass
We now describe the LINTEST pass, which performs automatic detection of linear statements in a given program followed by linearization of other statements. Since it operates actively only on statements inside DO loops, it needs information about the nesting level of each statement. This information is supplied by the 'donest' pass of Parafrase-2 [20]. LINTEST is executed in two phases or sub-pa-sses. The first of these is the GETLIN phase, which goes through the entire program and identifies symbolically linear statements. The second phase uses the above information and calls one of the following two procedures, depending on the nature of the statement that is being currently processed: (1) the PROGLIN procedure, which determines whether a symbolically linear statement is program linear as well; (2) the MAKELIN procedure, which selectively linearizes a non-linear statement.
The GETLIN
Phase
This phase performs step-by-step detection of symbolic linearity for all assignment statements inside loops. The recursive nature of the expression tree corresponding to a particular statement suggests a direct application of the recursive linearity rules discussed earlier. Note that rule Rl also applies to assignment expressions i.e., expi = exp2 is linear in a particular variable if and only it expi and expo are individually linear in that variable. Each node of the
190
F i g u r e 3.2.12
Example expression tree for illustrating GETLIN phase
expression tree has a lineariiy set associated with it, which consists of ail simple variables that the node expression depends on, including additional information about the linearity of this expression with respect to these variables. First, the linearity sets of the leaves are generated by using the linearity rules R4 — R6. This requires accesses to the leaf expressions which is performed in a top-down manner starting from the root of the expression tree. The linearity sets for the non-leaf nodes are generated by an appropriate merge of the linearity sets of its children. This merge depends on the type of expression associated with the node and the corresponding linearity rule to be selected from HI — Ri. Consider the following statement A{I,J)
= BiI,J)
+
K*C{I+l)
(3.2.13)
Fig. 3.2.12 illustrates the above bottom-up approach adopted for /-linearity detection of this statement. Shown in the figure is a subset of the linearity set at each node, to indicate only the linearity variables i.e., the variables in which the expression is linear in. The linearity set generated for the parent expression of each assignment statement, is saved for further use by the remaining phase of the LINTEST pass and by other passes. Another activity that is concurrently performed with symbolic linearity detection is the generation of an outset for each loop in the program.
191 which consists of the output variables of all assignment statements inside t h a t particular loop. These outsets are employed for program linearity detection, as will be clear from the description below.
The PROGLIN
Procedure
We return to the problem of determining which symbolically linear statements are also program linear. Recall that we introduced the notion of program linearity of a statement in subsection 3.2.6. The key step in the identification of program linearity of a symbolically linear statement is to determine whether the input variables to the statement are functions of the iteration variables (the loop indices) associated with loops t h a t the statement is nested in. This, in the strictest sense, requires a detailed flow analysis of the entire loop body for each of these loops. Parafrase-2, although targeted to achieve this objective, currently does not implement certain necessary optimizations such as forward substitution, induction variable substitution, etc. However, we observe t h a t , in general, o u t p u t variables of assignment statements inside a particular loop do depend on t h a t loop index, unless they are assigned to a constant expression; this is a relatively rare occurrence. Based on this observation, the outset of each loop is generated in the GETLIN phase itself. Any variable occurring in the outset of a particular loop is assumed to be dependent on the iteration variable of t h a t loop. T h e P R O G L I N procedure operates on each statement t h a t has been identified to be symbolically linear by the G E T L I N phase. It uses linearity set of the s t a t e m e n t currently under test, and the outsets of the various loops to determine the program linearity of the statement. The basic steps are as follows: (1) For each linearity variable of the statement, access the body of the innermost loop in which the statement is nested in, to check for program linearity in relation to the corresponding loop index. (2) Proceed only if the linearity variable is an induction variable i.e., is or depends on, the loop index. Otherwise, access the next surrounding loop, if present. (3) Determine whether an input variable to the statement (other than the linearity variable) is, or depends on, the loop index. (4) If the above condition is satisfied, check whether the input variable occurs independently of the loop index in the statement. If this is true, the statement
192
SI S2
K= 1 DOI= 1,M L = P2 DOJ= 1,N A(I,J) = A{I,J) + K*C(I) A(K,J) = B(K,J) + C(J)*D(K,L) ENDDO K=K + 2 ENDDO
F i g u r e 3.2.13
[L.A.B.K]
(UJ lA,Bj 11 J,KI
^Kjx; (Kl
Example loop for illustrating PROGLIN procedure
is not program linear in relation to the cvirrent loop index and the next outer loop is considered in turn. (5) However, if the condition in (3) is not satisfied, check whether the arrays, if any, in which this variable occurs independently of the loop index, depend on the iteration variable. Again, only a positive result indicates that the s t a t e m e n t is not program linear and the next loop is accessed. To facilitate the execution of steps (4) and (5), each linearity set contains additional information about arrays in which the elements of the set are involved. This is obtained in the GETLIN phase. Consider the loop structure of Fig. 3.2.13. Indicated on the right corner are the linearity sets of the relevant assignment s t a t e m e n t s and the outsets of the two loops with the linearity variables in boldface type, Statement 5 1 is symbolically linear in / (and not J, A'), as shown. However, K varies with / , which is indicated by its presence in the outset of the /-loop. In fact, K = 2.7-1-1, since K is incremented by two at the end of every iteration of the 7-loop. Thus, statement 5 1 is not program linear in / . Statement 5 2 , on the other hand, is program linear in both A' and J. Although A' and L both depend on I, L occurs in the same array as A", thus preserving the program linearity with respect to A'. T h e linearity with respect to J is not affected by the independent presence of either A' or L (in array D), as both remain constant while the ./-loop is being executed.
193
The MAKELIN
Procedure
This procedure operates actively on all assignment statements which are not symbolically linear, except those that satisfy two conditions simultaneously: (1) their R.H.S. expressions are non-linear, and (2) they write to scalars or nonlinear expressions. In other words, if the L.H.S. of an assignment statement is a scalar or an array expression involving only constants (e.g. A(2)), then it will be ignored, unless its R.H.S. expression is linear in some variable; in such a Ccise, the output variable is expanded to accommodate this linearizing variable. If the output variable is, however, a linear array expression, the R.H.S. expression is linearized with respect to the linearity variable. If multiple linearity variables exist in either of these cases, then the one associated with the innermost surrounding loop of the parent statement is chosen. The linearizing technique adopted, directly exploits the recursive nature of the R.H.S. expression tree to search for the smallest sub-expression (expression associated with a node furthest away from the root) that when linearized, guarantees the linearity of the parent expression. At any stage, the transformation techniques Tl — TA, as discussed in subsection 3.2.4, are used to determine which of the two operands of the current expression needs to be selected for linearization. If this search process ends in a simple expression (i.e. at the leaf), variable expansion is performed. Otherwise, the selected expression is substituted by a single array involving the linearizing variable and other variables obtained from the inset of the substituted expression. This is followed by the generation of a new statement denoting the substitution. Note that this substitution is aborted if the expression so chosen is the R.H.S. expression itself. This procedure is illustrated on the expression tree of Fig. 3.2.14, which corresponds to the following non-linear statement: C{J) = C{J) + A{I,J)*B{J)*2
(3.2,14)
Since the output variable is linear in / , the R.H.S. expression is chosen to be linearized with respect to J. The first operand of this sum expression viz., C{J), is already linear in J and so, the second operand is selected for linearization. The expression tree of this product term indicates that it is sufficient to linearize the first operand viz. A{I, J) * B{J) since the other operand does not involve J. This follows from the linearity rule R2 described in subsection 3.2.3. At this stage, transformation technique T2 is directly applied to substitute the expression by an array expression involving / and J (shown as TEMP{I, J) in
194
(ZZ TEMP(I,J)
F i g u r e 3.2.14
Example expression tree for illustrating MAKELIN phase
the figure). The pair of statements, thus, generated are TEMP{I,J)
=
A{I,J)*B{J)
C{J) = C(J) + TEMP{I,
J) * 2
(3.2.15) (3.2.16)
Variable expansion or expression substitution requires modification of the expression tree corresponding to the original statement. List manipulation and expression building functions, provided by Parafrase-2 [20], are used here. The actual modification is carried out in the following manner; (1) First, the operand list of the new array is appropriately generated to include the linearizing variable as well as the input variables of the the expression to be expanded or substituted. (2) Next, the expression corresponding to the array is constructed by using the above operand list and the symbol table pertaining to the symbol of this array. Note that substitution might first require the generation of a new symbol table entry for the array.
195 (3) Finally, a new statement to denote the substitution is constructed and appropriately included within the program. T h e linearized statement at the end of step (2) is tested for program linearity by re-executing the various steps of the PROGLIN procedure. If it fails the test, the MAKELIN procedure is termed to be unsuccessf\il and the modification of step (2) is undone. Also, step (3) is no longer executed. From the description of the MAKELIN phase, thus, it is apparent that information about the linearity of the sub-expressions at various nodes of the statement expression tree, as well as the their insets, is needed. This is generated by using the appropriate routines of the G E T L I N phase, as and when required.
The ADDCHECK
Pass
T h e A D D C H E C K pass operates on the linearized version of the original program and generates suitable check code for providing on-line detection of errors that may occur during actual execution of the program. This pass is placed after the L I N T E S T pass in the passlist. Currently, the pass operates actively on all program linear and program linearized statements that have at least one linearity variable occurring as the iteration variable of some surrounding loop. In the event that multiple linearity variables exist, the one associated with the innermost surrounding loop is selected to act as the checking variable. Thus, each loop having the checking variable a^ its index plays a key role in this pass and will be subsequently referred to as the check loop. The check code for each program linear (or linearized) statement, includes three sets of statements: (1) summation statements, each denoting the s u m m a t i o n of the various values of an array variable (or scalar) involving the linearity variable, for different values of this iteration variable, (2) check statemenis, t h a t denote the modified computation of the output using these s u m m a t i o n s , and the comparison t h a t follows, and (3) initialization statements, t h a t denote the proper initialization of the summation variables before they are used to store the s u m m a t i o n values. We now describe the generation of each set. Summation
Statements
The generation of s u m m a t i o n statements first involves identification of the leaves of the statement expression tree and selection of only those expressions that involve the linearity variable. For each such expression (represented gener-
196 ically by exp), a s u m m a t i o n statement is built and included in the program, as follows; (1) Assume for the moment that exp is an array. First, a suitable operand list for the s u m m a t i o n variable is created. This consists of all variables t h a t occur along with the linearity variable in exp, but which are associated only with loops inside the check loop. Note t h a t this operand list would be empty if exp was just a scalar, or if the check loop was the innermost loop itself as is true for many of the loop structures encountered in actual programs. (2) Next, an expression corresponding to the new s u m m a t i o n variable, say sum, is built. T h e nature of the expression building function used for this purpose depends on the status of the operand list created above. In other words, if the operand list is empty, a scalar e.xpression is created; otherwise the resulting expression is an array with the given operand list. (3) Using the exp and sum expressions, a summation statement of the form sum — sum -f exp is built. Note t h a t the occurrences of sum on the L.H.S. and on the R.H.S. actually correspond to different expressions t h a t share a common symbol table entry. This requires a duplication of the sum expression created in (2), before the statement can be built. (4) Finally, the s u m m a t i o n statement is inserted just before or immediately after the linear statement in question, depending on whether exp corresponds to an input variable or an output variable, respectively. Such an insertion guarantees t h a t the linear statement shares all its parent conditional statements with the s u m m a t i o n statement, so that there is a one-to-one correspondence between the executions of the linear and s u m m a t i o n statement. For example, if the linear statement was to be executed only during certain iterations of the check loop, then the s u m m a t i o n statement would also be executed only in these iterations, thus, preserving the validity of the summation values. Another major advantage in having the summations thus placed is t h a t variable expansions can be totally avoided in the MAKELIN procedure, since the the value of each such variable in the current iteration of the check loop will be included in the corresponding s u m m a t i o n before it acquires a possible new value for the next iteration. Fig. 3.2.15 shows the check code inserted by the A D D C H E C K pass for statement 52 in the loop structure of Fig. 3.2.13. Note t h a t ,7 has been employed as the checking variable, since it is associated with the inner loop. Statement 52 has three array expressions t h a t involve
197
DOI=l,M 11 12 13
SUMO = 0 SUM1=0 SUM2 = 0 D0J=1,N
SMI SM2 S2 SMS
SUMl = SUMl + B(K,J) SUM2 = SUM2 + C(J) A(K,J) = B(K,J) + C(J) * D(K,L) SUMO = SUMO + A(K,J) ENDDO SUM3 = SUMl + SUM2 * D(K,L) CALL COMPARE(SUM3, SUMO)
CI C2
ENDDO Figure 3.2.15
Illustration of the ADDCHECK pass on the loop of Fig. 3.2.13
198 J viz., A{K,J), B{K,J), and C(J). Hence, three summation statements are inserted, each denoting the summation of values of one of the three arrays for [ — J < = N. These are numbered SMI — SM3 in the figure. Check and Inittattzation
Statements
We now describe the construction of the check statements. T h e first of these denotes the alternate computation of the o u t p u t s u m m a t i o n using the input s u m m a t i o n s . The structure of this statement is similar to that of the original linear statement, but for the replacement of each input array (or scalar) involving the checking variable with the corresponding s u m m a t i o n variable, and the presence of a new s u m m a t i o n variable at the o u t p u t . This is shown in Fig. 3.2.15 where check statement C I is obtained from 5 1 by replacing B{K, ,7) and C(J) with SUMl and SUM2, respectively. T h e recursive generation of the expression tree for this statement is performed concurrently with the exploration of the original expression tree, which is needed anyway to generate the s u m m a t i o n statements. The next check statement is actually a subroutine call for comparison of the two o u t p u t summations to check for possible mismatches. This is indicated as C 2 in the figure. It is possible t h a t the check loop is not the innermost loop in a multiply nested loop structure, in which case, all loops inside the check loop that surround the linear statement, need to be duplicated and placed around the check statements to ensure the validity of the comparisons. Such a situation is illustrated in the next section which discusses the check code for matrix multiplication, T h e generation of initialization statements is fairly simple. This involves a duplication of the expressions corresponding to the various s u m m a t i o n variables so t h a t they may be used in the construction of assignment statements having their R.H.S. expressions set to the constant, zero. These statements are placed before the check loop and are indicated as 71 — 73 in the figure. As in the case of check statements, all loops occurring inside the check loop t h a t aff"ect the linear statement must be duplicated and placed around the initialization statements.
Example Application of Passes to Matrix Multiplication We now illustrate the L I N T E S T and A D D C H E C K passes on the node program for parallel matrix multiplication on the Intel i P S C / 2 hypercube. Fig. 3.2.16 shows the relevant code for the node program obtained by using 7 as the paral-
199 DOI= 1,M /P D O J = 1 ,N
S1
DOK= 1,L C(I,J) = C(I,J) + A(I,K) * B(K,J) ENDDO ENDDO ENDDO
F i g u r e 3.2.16
Parallel node program for matrix multiplication
lelization variable in the program of Fig. 3.2.5. Note that each processor would execute M/P iterations of the /-loop. T h e assignment statement S\ is identified to be symbolically / + • / - l i n e a r by the G E T L I N phase; the recursive linearity rules R\ and R'2 are used here. This s t a t e m e n t is then acted upon by the P R O G L I N procedure, which concludes t h a t it is also program linear in both / and J since the only o u t p u t variable occurring in this loop structure viz., the C array, involves both / and J . Note t h a t the MAKELIN procedure is not employed at all. Fig. 3.2.17 shows the same program with on-line s u m m a t i o n , check and initialization statements. T h e scheme employs J as the checking variable. The summation statement for B{K, J) viz., S M I , is generated and inserted. The values stored in the SUMK array at the end of the J-iterations are nothing but the row checksums of the B array. T h e procedure for C{I,J) is, however, modified to exploit a special property of the loop structure viz., repeated update and write operations on the C array. A s u m m a t i o n statement is generated for only the outpul C{I. J) and is placed outside the A'-loop eis indicated by SM2. Hence, errors during the actual execution of the entire A'-loop (for a particular value of ,7), will be reflected in the value stored in SU MO. In fact, this method aims to limit the amount of checking t h a t needs to be done, and is, therefore, a frequencyrelated optimization. Note t h a t after the J-loop has been completely executed. the values stored in the SUM2 array are nothing but the row checksums of the B matrix: SUMl holds the row checksum of the / t h row of the C submatrix.
200
11 12
SMI SI SM2
CI
C2
D O I = 1,M/P D O K = 1,L SUM2(K) = 0 ENDDO SUMO = 0 D O J = 1,N D0K=1,L SUM2(K) = SUM2(K) + B(K,J) C(I,J) = C(I,J) + A(I,K)*B(K,J) ENDDO SUMO = SUMO + C(I,J) ENDDO D0K=1,L SUMl = SUMI + A'(I,K)*SUM2(K) ENDDO CSEND(...SUM1...) CRECV(,..SUMr...) CALL COMPARE(SUM 1' ,SUMO) ENDDO
F i g u r e 3.2.17
Parallel node program for matrix multiplication with checks
201 T h e procedure for the check statements is modified accordingly. Statement C I is obtained from 5 1 by replacement of B{K, J) with its summation variable, and the o u t p u t and input C{I, J ) ' s by a new variable. It is then placed outside the J-loop with a A'-loop around it. The complete execution of the /\'-loop would then result in an alternate computation of the o u t p u t checksum viz., the R.H.S. computation as per the notation adopted in subsection 3.2.5. Since, the program under consideration is a node program, A(I,K) is replaced by J 4 ' ( / , A'), where the A' array stores the A submatrix values for a neighboring processor. This results in the computation being performed for this neighbor. After an exchange of the R.H.S. computation values with the neighbor, the comparison routine is finally called. T h e initialization statements II and 72 are inserted as shown in the figure and are obtained by a straightforward implementation of the steps discussed in the previous section. It is worthwhile mentioning here t h a t the complete check code is placed inside the /-loop and is, therefore, executed repeatedly for all the /-iterations.
3.2.9
Results of Applying the Compiler on Real Programs
This section presents the results of application of the compiler to real scientific Fortran programs. Six sample routines from the LINPACK and EISPACK libraries, and from the Perfect Club Benchmark Suite, are considered first. These routines are essentially sequential programs meant to run on uniprocessors. However, results for sequential programs can be expected to carry over to their parallel versions since the fraction of linearizable statements remains relatively unchanged. We then consider parallel versions of two of the above programs developed for a hypercube multiprocessor, by running the compiler on the corresponding node programs.
Results on Sequential Programs Table 3.2.1 shows the results of applying the compiler passes for detection and generation of linear statements to six application programs. The first two routines are from the LINPACK library. DGEFA, factors a real matrix by Gaussian Elimination, which is then used by DGESL to solve a corresponding system of linear equations. The results presented here are for expanded versions of the above two routines. This was done to expose the linearity property of
202
T a b l e 3.2.1
Program DGEFA DGESL TRED2 TQL2 TRFD INTGRL OLDA MDG MDMAIN BNDRY INITIA CORREC INTRAF INTERF POTENG
Linearity characteristics of six application programs
Exec 43 49 92 84 67 28 100 100 61 13 97 10 59 108 69
Number of Statements Assign Lin Pglin Mklin 4 16 4 5 3 18 3 9 12 12 46 16 2 14 3 47 2 1 3 23 11 2 0 0 13 6 35 6 2 2 6 2 3 11 3 0 0 6 0 6 34 12 0 0 4 2 0 0 4 4 5 47 9 2 9 81 2 0 43 0
Pslin 5 9 16 14 0 2 13 2 0 6 12 2 4 2 2
% lin-checkable Static Dynamic 56.3 98.0 74.2 66.7 66.4 60.9 34.0 95.0 12.5 57.4 18.2 38.1 54.3 97.1 66.7 99.9 27.3 30.0 100.0 100.0 35.3 41.4 50.0 50.0 18.8 29.7 13.6 11.0 4.7 10.9
203 certain statements inside the expanded subroutines, so t h a t they could be easily detected by the compiler. Next we have a routine form EISPACK , TR.ED2, which reduces a real symmetric matrix to a symmetrical tridiagonal matrix using and accumulating orthogonal similarity transforms. This m a t r i x is then used by the EISPACK routine, T Q L 2 , to find the eigenvalues and eigenvectors. T h e remaining two programs are from the Perfect Club Benchmark Suite. The program T R F D simulates the computational aspects of a two-electron integral transformation. It employs two subroutines, I N T G R L and OLDA, the results for which are also provided in the table. T h e code for MDG provides a molecular dynamics model for water molecules in the liquid state at room t e m p e r a t u r e and pressure. Again, the results for most of the important subroutines used in these programs are also are shown in the table. The next six columns of the table show the results obtained at the end of the L I N T E S T pass for each of the above six programs along with their subroutines, if present. For a particular program; (1) Exec indicates the total number of executable statements inside the program. Hence, declarations and other similar statements are excluded. (2) Assign denotes the number of assignment statements present within loops. These are the key statements for the L I N T E S T pass. (3) Lin indicates the number of assignments (out of (2)) that were detected to be symbolically linear by the GETLIN phase. (4) Pglin pertains to candidate statements for the P R O G L I N procedure, and shows the number of symbolically linear statements t h a t are program linear EIS well. (5) Mklin represents the results of the MAKELIN procedure, and shows the number of statements from (2) (but not in (3)) that were transformed to be symbolically linear and were program linear as well. (6) Pslin denotes the number of statements (out of (5)) that are attributed to variable expansions. Such expansions can be avoided because of the way in which the s u m m a t i o n s are carried out during actual checking. Hence, these statements represent a special class of non-linear statements that are amenable to checking without the necessity for any restructuring, and are called psuedoItnear.
204 T h e last two columns portray the combined picture t h a t emerges from the above six columns. These results are a measure of the amount of linearity t h a t is present in, or can be successfully extracted from, the original programs using our compiler. Here, the term lin-checkable is used to denote all such statements which lend themselves to linearity-based checking. Note that statements t h a t are not lin-checkable can be checked by the simple duplication-and-comparison approach. T h e total number of lin-checkable statements per row is nothing but the total number of program linear and linearized statements ( (3) -f (4) ) for the same row. Each number in the static column is obtained by dividing this number by the total number of assignments statements t h a t were present initially or created during substitutions ( (2) -|- (5)-(4) ). In other words, the static results show the total number of lin-checkable statements relative to the total number of assignment statements in the final linearized versions of the original programs, and are, thus, indicative of the inherent structure of these programs. The dynamic results present a more realistic picture as they are an estimate of the total number of executions of all lin-checkable statements with respect to the total number of executions of all assignment statements, for a given input d a t a size. Hence, programs with a high dynamic percent count would most likely have a high percentage of lin-checkable statements inside multiple loops. This is indeed true for highly structured programs such as the DGEFA, T Q L 2 , and OLDA routines, which show a dramatic rise in their dynamic counts when compared to the static counts. In fact, the OLDA subroutine dominates the execution of the entire T R F D program with the result t h a t the net dynamic count is around 93%. The above values allow us to make the following inferences. Since any error that occurs during the execution of a lin-checkable statement is detected by the corresponding check code with a high probability, one can expect programs having reasonably high dynamic counts (the first five programs) to give high error coverages while executing on real error-prone machines. Correspondingly, programs with very low dynamic counts such as some subroutines in the MDG program cannot be expected to give high error coverages, if linearity-based checking were the sole concurrent error detection scheme to be used. T h e coverages for on-line error detection of such programs can be boosted up by the intelligent use of duplication techniques. T h e results for DGEFA, DGESL and T Q L 2 were estimated using an input matrix size of 100, whereas an input size of 1000 was used for the T R E D 2 routine. The parameter values for the T R F D and MDG routines were defaults provided by the Benchmark Suite.
205 Table 3.2.2
Program PDGEFA PTRED2
Exec 72 362
Linearity characterisitcs of two parallel node programs for iPSC/2
Number of statements Assign Lin Pglin Mklin 25 5 5 10 138 8 32 8
Pslin 10 32
% lin-checkable Dynamic Static P^4 P = 16 P=8 84.1 60.0 89.1 93.3 52.4 29.0 56.5 58.9
Results for Parallel Programs This section presents results for parallel versions of the DGEFA and TRED2 routines. The former is the Fortran version of the parallelized DGEFA program which has been implemented on an Intel iPSC-2 hypercube and is described in [14]. The program for the TRED2 routine is a parallel implementation on the iPSC-2 hypercube using a row cyclic distribution of the input matrices, and is described in [23]. Table 3.2.2 shows the results for the processor node programs viz., PDGEFA and PTRED2, presented in a manner similar to the previous table. Since DGEFA is a highly structured program as mentioned in the previous section, the static count for PDGEFA remains more or less the same indicating the fact that the linearity of the program as a whole has not been affected by parallelization. The dynamic count drops down a little bit as expected. Since the program parallelization causes the number of iterations of the parallel loop to decrease by a factor equal to the number of processors, the relative number of executions of lin-checkable statements inside the parallel loop also goes down by the same factor. Parallelization of the TRED2 routine, however, introduces lots of non-linear statements because of the somewhat loosely structured nature of the original program. Hence, the static percent is decreased quite a bit as is evident from the Tables 3.2.1 and 3.2.2. The drop in the dynamic count is comparatively lower, as many of these non-linear statements occur outside the most frequently executed loops in the program.
Results on Error Coverages and Time Overheads In order to observe error coverage results of the compiler-assisted data integrity checking scheme, a version of the DGEFA routine with data integrity checks
206 T a b l e 3.2.3
Percent error coverage in the DGEFA checking scheme
Error type Transient bit error Transient word error Permanent bit error Permanent word error
Unit adder multiplier adder multiplier adder multiplier adder multiplier
Number of processors 4 8 16 92.97 91.41 88.28 86.72 89.84 84.38 100.00 100,00 100.00 100.00 100.00 100.00 92.97 92.97 90.63 89.84 86.72 89.84 100.00 100.00 100.00 100.00 100.00 100.00
was implemented for the Intel iPSC/2 hypercube multicomputer. Experiments were run on randomly generated 100x100 matrices for hypercubes with 4, 8 and 16 processors. Errors were injected at the source code level and could be of four types - bit-level transient errors, bit-level permanent errors, word-level transient errors and word-level permanent errors. The errors were assumed to affect either floating point additions or floating point multiplications. Injecting these errors at the source code level involved replacing the output of every floating point addition or multiplication by a call to an appropriate error injection routine, which would then corrupt the result in accordance with the error being simulated. Details of this method may be found in [8, 24]. Error coverages are illustrated in Table 3.2.3. These are quite high and in particular are always 100% for word-level errors, which can be expected to cause more data corruption than bit-level errors. Time overheads are illustrated in Fig. 3.2.18. Timing overheads decrease with an increase in problem size since normal computations increase much more rapidly than check computations as problem size increases.
3.2.10
Summary
In this section, we have addressed the task of synthesizing algorithm-based checking techniques for general applications. We have approached the problem at the compiler level by identifying linear transformations in Fortran DO loops, restructuring program statements to convert non-linear transformations
207
30
A.
A 16 processors Q — 8 processors -0 4 processors
20 Percent Overhead 10
100
F i g u r e 3.2.18
200 300 Matrix Row Size
Time overhead of the DGEFA checking scheme
400
208 to linear ones, and have proposed system-level checks based on this property. We have illustrated our approach on the example problem of matrix multiplication. We have discussed certain trade-offs involved in algorithm-based checking which are related to the nature and extent of the checksum encodings. Next, we have discussed the implementation of a source-to-source restructuring compiler t h a t can synthesize suitable low-cost checks for providing on-line detection of errors occurring during execution of a variety of numerical Fortran programs. This compiler consists of two passes. T h e first of these is the L I N T E S T pass which performs automatic detection of linear statements and selective linearization of others. T h e second pass is the A D D C H E C K pass which performs actual insertion of check code using the information generated in the previous pass. We have illustrated these passes on the node program for matrix multiplication on the hypercube to generate a version with on-line checks. We have then presented the results of application of this compiler to six important routines from the LINPACK and EISPACK libraries, and the Perfect Benchmark Suite. We have also presented error coverage and time overhead results for a selected parallel prgram.
209
REFERENCES
[1] A. L. Hopkins, I. T. B. Smith, and J. H. Lala, " F T M P : A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft," Proc. IEEE, vol. 66, pp. 12211239, Oct. 1978. [2] D. P. Siewiorek, V. Kini, H. Mashburn, S. McConnel, and M. Hsao, "A Case Study of C . m m p , Cm*, and C.vmp. I. Experiences with Fault Tolerance in Multiprocessor Systems," Proc. IEEE, vol. 66, pp. 1178-1199, Oct. 1978. [3] J. H. Wensley, L. Lamport, J. Goldberg, M. W. Green, K. N. Levitt, P. M. Melliar-Smith, R. E. Shostak, and C. B. Weinstock, "SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control," Proc. IEEE, vol. 66, no. 10, pp. 1240-1255, Oct. 1978. [4] D. P. Siewiorek and R. S. Swartz, Theory and Practice of Reliable Design. Bedford, MA: Digital Press, 1982. [5] O. Serlin, "Fault-Tolerant Systems in Commercial Applications," Computer, pp. 19-30, August 1984.
System
IEEE
[6] D. Johnson, "The Intel 432: A VLSI Architecture for Fault-Tolerant Computer Systems," IEEE Computer, pp. 40-48, August 1984. [7] K.-H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations," IEEE Trans. Comput., vol. C-33, pp. 518-528, J u n e 1984. [8] P. Banerjee, J. T. Rahmeh, C. Stunkel, V. S. Nair, K. Roy, V. Balasubr a m a n i a n , and J. A. Abraham, "Algorithm-based fault tolerance on a hypercube multiprocessor," IEEE Trans. Comput., vol. 39, pp. 1132-1145, September 1990. [9] J.-Y. Jou and J. A. Abraham, "Fault-tolerant matrix operations on multiple processor systems using weighted checksums," SPIE Proceedings, vol. 495, August 1984.
210 [10] F. Luk, "Algorithm-Baaed Fault Tolerance for Parallel Matrix Solvers," J'roc. SPIE Real-Time Signal Processing VIll, vol. 564, 1985. [11] M. Malek and Y. H. Choi, "A fault-tolerant fft processor," IEEE Comput., May 1988.
7Vn«.s.
[12] A. L. N. Rcddy and P. Banerjcc, "Algorithm-based fault detection techni(iues in signal processing applications," IEEE Trans. Compui., vol. 39. pp. 1304-1308, October 1990. [13] H,. B. Mueller-Thuiis, D. McFarland, and P. Baiierjee. "Algorithm-Based Fault Tolerance for Adaptive Least Squares Lattice Filtering on a Hypercube Multiprocessor," Pror. Int. Conf. on Parnllel Prore.ssing, pp. 177 189, Aug. 1989. [14] v . Balasubramanian, ""The .Analysis and Synthesis of Efficient AlgorithiiiBased Error Detection Schemes for Ilypercube Multiprocessors.'' P h . D . dissertation, Uuiv. of Illinois, Urbana-Champaign, February 1991. Tech. Report no. C R H C 91-6, U I L l I - E N G - 9 1 - 2 2 1 0 . [15] W. Harrison, "/In Overview of the Structure of Parafrase.'' Univ. of Illinois, Urbana-Champaign, July 1985. CSRD Tech. Report no. 501, PR-85-2. UILU-FNC.-85-8002. [16] B. Leasure, "I'hc Parafrase project's Fortran analyzer major module documentation." Univ. of Illinois, LIrbana-Champaign, July 1985. C S R D Tech. Report no. 504, PR-85-5, UILU KNG--85-8005. [17] C. D. Polychronopoulos, "(^ompilcr optimizations for enhancing parallelism and their impact on architecture design," IEEE Trans. Comput., vol. 37, pp. 991-1004, August 1988. [18] D. A. P a d u a and M. J. Wolfe, "Advanced compiler optimizations for supercomputers," Commun. ACM, vol. 29, pp. 1184-1201, December 1986. [19] S. P. Midkiff and D. A. Padua, "Compiler algorithms for synchronization." IEEE Trans. Comput.. vol. C-36, pp. 1485 1495. December 1987. [20] C. D. Polychronopoulos, M. Cirkar. M. H. Hagighat, ('. L. Lee. B. Leung, and D. Schouten, "Parafrase-2 mantial." CSRD, Univ. of Illinois, UrbanaChampaign, 1990. [21] V. Balasubramanian and P. Banerjee, " G R A F T : A Compiler for Synthesizing ALgorithm-Based Fault Tolerance in Ilypercube Multiprocessors," Proc. Int. Conf Parallel Processing (ICPP-91), Aug. 1991.
211 [22] A. v . Aho, R. Srtlii, and J. D. lUlinan, ComjnUrs: and Tools. Reading, MA: Addison-Wesley, 1988.
Pnuciplfs,
Tcchiuqins
[2.'^] M. (Jupla, '"Aiitoniatic d a t a partitioning on (.UstTihutod incMiiory niulliprocessors." Univ. of Illinois, Urbana-Chainpaign, 0('tol)er 1990. 'IVrli. Report no. C R H ( > 9 0 ~ 1 4 . [24] A. Roy-C'howdhury, "Evaluation of Algorithm Based Faull-Tolcranre fccliniques on Mulliplc Fault Classes in the Presence of Finite Precision Arithmetic.'' M.S. Thesi.s. Univ. of Illinois, Urbana-Champaign, Aiignst 1992. Terh. Report no. C R H C - 9 2 - 1 5 . lULU-K.NG-92-2228.
SECTION 4
OPERATING SYSTEM SUPPORT
SECTION 4.1 Application Transparent Fault Management in Fault Tolerant Mach Mark Russinovich, Zary Segall and Dan Siewiorek Abstract: Fault detection and fault tolerance has become an increasingly important aspect of all computer system designs, from PC's to high-end workstations and embedded critical systems. Since operating systems are common to all cornputers and it is at the operating system level where there is maximum system visibility and control, it is appropriate for the operating system to provide policies which detect, contain and tolerate faults. These policies form an operating system's "fault management." A mechanism to provide support for operating system fault management has been designed and unplementedfor a UNIX 43 BSD server running on the Mach 3.0 microkernel. The mechanism, called the sentry mechanism, consists of fault management control placed at all operating system entry and exit points. The suitability of the mechanism is determined through demonstration of its ability to support diverse, commonly accepted policies efficiently, where efficiency is measured in terms of implementation complexity and performance. Several sentry policies have been implemented including monitoring, assertions, checkpoint/checkpoint recovery and journaling/jourruil replay. This paper presents the sentry mechanism, its implementation and the design and implementation of the mentioned policies.
4.1.1 Introduction A number of operating systems have had policies to detect, contain and recover from faults included in their specification. The policies have always been integral parts of the systems, usually satisfying a narrow range of dependability and availability needs which, as an inherent assumption, all clients of the qxrating system desire. The current trend in operating system design, however, is not to provide a custom design for each class of computer or application environment, but rather to design one operating system which is applicable across the entire range of computer systems, from PC's to high end multiprocessors. Current examples of such operating systems include Mach 3.0[9][17] and Windows NT[6]. The application domains that these computers address are diverse in their fault management requirements, and a one This rEseaich has been supported by the Office of Naval Research under contract N00014-91-J-4139. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of ONR, Carnegie Mellon University or the U. S. Government. The authors are with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213.
216 solution approach is clearly no longer acceptable. The recent trend towards distributed systems has lead to the development of "high-lever operating systems which run on top of the native operating systems of the distributed system's nodes. In many cases, these systems provide fault management primitives that are able to take advantage of the hardware redundancy introduced. However, these systems assume that faulty nodes behave in a fail-fast or failsilent manner which is an assumption not validateid by practice. These trends have created a need for an operatmg system fault management mechanism which is powerful enough to support diverse policies in a manner that can tune the system for the specific needs of an application. The Fault Tolerant Mach (FTM) project's [18][19] primary goal is to define such a mechanism. Ptoof of the mechanism's power can only be demonstrated by example so we have chosen several policies for implementation which are representative of state-of-the-art fault management in terms of complexity, coverage, and diversity. They also serve to satisfy the additional goals of FTM which are to: • design a user-tunable, user-accessible fault management mechanism • provide an open-systems interface to the mechanism • provide system monitoring policy example(s) • provide fault detection policy example(s) • provide fault recovery policy example(s) By providing an open-systems interface to the policy, this approach hopes to encourage other research groups to implement policies for the mechanism that build a library of solutions for diffo^nt {^plication needs. An extremely important goal related to the development of example policies is that they must aU be application transparent Application transparency implies that applications do not have to be modified in any way to use the policies. This restriction serves not only to demonstrate the power of policies implemental with the sentry mechanism, but it also enables any application whichfitsa policy's £q)pIication model use the policies. Policies which are application dependent require the development of specially tailored software, a time consuming, error-prone process preventing the vast collection of off-the-shelf to take advantage of the policies. In the experiments presented later, only standard unmodified workstation software is used. In this work the policies implemented using the mechanism are: • monitoring • assertions • joumaling/joumal recovery • checkpoint/checkpoint recovery For pragmatic reasons the research presented here concentrates on a uniprocessor environment running in multiuser mode. Parallel research is being conducted to explore the sentry mechanism's applicability to scalable fault management in a multicomputer domain. The rest of this chapter is organized as follows: Section 4.1.2 defines the sentry mechanism and presents its implementation. The next three sections present the monitoring and assertion policies (presented together), the joumaling policy and the checlqpointing policy. These are followed by a concluding section which discusses the current slate of the project and its direction.
217
The primary goal of FTM is to provide a user-tunable, user-accessible fault management mechaJiism pjwerful enough to support diYersc policies. TTie mechanism which will accomplsh this is called the sentry mechanism (SM). This section presents the concepts behind the mechanism and temninology associated with it It then covers the implementation of SM for the UX (BSD 4.3 [24]) operating system sewer and details of the user interface to the SM.
These goals irst require that control points for the policies be placed at al locations in the operating system where a policy may be required to execute. The central idea of the sentry concept is that the operating system entry and exit points provide sufficient visibility and control to support the majority of standard fault detection and tolerance techniques. These points allow a policy to encapsulate operating system services such as system calls, page faults, intennpts, etc., effectively isolating faults before they cross the operating system/cKent boundary as well as allowing a policy to replace an existing service with its own version. When sentries are enabled for a service, that service is said to be guarded. Figure4.L1 shows a service being gnarded by one sentry policy. Request
Operating System Serace
Reply Figure 4.1.1 The sentry mechanism Examples where isolation is important is in fault detection/correction policies. A policy can perform consistency and validity checks on important operating system data structures and may perform correction in the case of faults before a service is performed which would cause the operating system to fail. The exit point allows faults to
218 be hidden firoin, or reported to, an ^plication. System service replacement is necessary for calls which provide enhanced or modified vo^ions of a service. For example, a fault recovery policy based on joumaling will need to r^licate system call behavior based on the results which the service provided in the initial run in cader to recreate iL The policy must provide its own version of the service which read the service results fnxn a journal and respond with them to the application to recreate the initial run's outcome. Several characteristics of use of this mechanism are: • the majority of fault management policies require no modification of the operating system internals. This is important because modification of the existing code increases risk of introduced faults. In addition, hard-wiring of fault management in the service precludes the advantage described below. • policies can be dynamically assigned to guard or leave a service unguarded Thisflexibilitymeans that only processes that desire the policies will have them in place. An operating system can be implemented with sentry policies of varying scape and cost Applications can select the policies based on their own specific fault management and cost/performance needs. An operating system which has the sentry mechanism in place should provide a selection of fault management policies and an activation mechanism for selecting policies to guard specified services. Each policy should be characterized in terms of its detection or tolerance capability and the cost for each service guarded. It should also allow for multiple policies to be simultaneously enabled for an application. Policies can then work in conjunction with one another. A fault recovery policy can be designed to worii with a fault detection policy, for example, or complemen^ry fault detection policies can provide increased coverage. Figure 4.1.2 depicts a service sandwiched between sentries from two policies.
4.1.2.2 Sentry Implementation To understand the implementation of the sentry mechanism, some background is necessary to describe the target operating system which is Mach 3.0. The development platform of this project is an i486 processor (50MHz intemal/25MHz external) with 16MB of main memory. This section first describes the Mach 3.0 and then the sentry mechanism implementation and activation mechanism.
4.1.2.2.1 Mach 3.0 Architecture The software architecture used in this research is the Mach 3.0 (MK) microkernel. It was selected because it is representative of the current trend toward microkernel based operating systems. In microkernel operating systems, system services are divided between a user-level server, which provides much of the visible operating system interface, and a supervisor-level microkernel that implements low-level resource management. The microkernel provides enough low-level support so that various servers can be implemented to run on top of it. For example, servers that have been implemented for Mach 3.0 include DOS, Unix 4.3BSD (UX) and MacOS. Unix 4.3 was chosen as the initial platform for the sentry mechanism. Factors behind this decision include the fact that Unix is widely used in the research community and that Unix is more complex than either DOS or MacOS, and therefore, results
219 Request
Reply Figure 4 . U Multiple policies encapsulating a service generated should be migratablc to iiese other operatiiig systems. In general, both microkernel and UX sarices are avaflable to Unix applications, but this work assumes that Unix applicaliais wiB only make UX requests. In Macb 3.0 the microkernel povides an IPC mechanism, virtual memory management, low le¥el device drivers, and lask/ihread scheduling. UX implements al Unix 4.3 system calls including signals, die UFS (Unix File System), and Unix process management. Where nccessaiy UX calls on MK services such as task creation and deletion. While there are several types of services povidcd by UX including signal delivery, system calls, and memory management, system calls are by far die most common. Fipire 4.1.3 shows control iow for a system caM. MK takes appHcation system call traps and bounces control into a special UX portion of the application address space called die emulation library. This library takes the system call parameters and number, and packages an MK IPC message which is sent to UX. hi UX threads (light-weight pKKesses) wait for mcomingrequestsand after decoding die message, call the appropriate fimction. After therequesthas been serviced, UX packages a reply message and sends it off to the application. The emulation library unpackages the reply parameters from the message and control is returned to the instruction following the trap.
4.1.2.2.2 Sentrj Mechanism Implementation
/^AJ\J
Afferent opeialing systems include different services, so the sentiy mechanism must be tailored for different environments. In UX, the services which have sentry points placed at their entries and/or exits are: • System calls • Signal creation • Process creation and deletion • Interrupts The principal of operation for all of these points is the same, so for the purposes
Emulation Libnir>'
Apj3lication @
syscalJ
l.Tap 2. Bounce back to user space 3. MK message with UX request information 4. BrfK message with UX reply information 5. Return to point of trap Figure 4 . U Flow of control for system call In Maci 3 J/UX of demonstration. Figure 4.1.4 depicts control iow when a policy is enabled for a system calL The same path is followed until tJie point where the UX thread has received the request message. At this point, the thread wil check to see if any policies are enabled for the requestor. If any are enabled, theif entry sentries are executed. These entry sentries mayrequirethat the UX system call code be bypassed. In this case, control iow is passed to the code which executes enabled exit sentries. This alternate low is not shown in the diagram. After the exit sentries are executed, the reply message is deli¥ered to the application. TTic bypass mechanism, which exists at all sentry entry points, is what allows a policy to implement its own service in lieu of the exist-
221 ing operating system service code. The activation mechanism is composed of several system calls which have been added to UX. They provide the ability to enable a policy for a particular service (i.e. a specific system call, signal, etc.), and also for a policy to be enabled for a particular process. In the UX implementation all services are numbered. System calls arc identified by their native BSD numbers and other services, such as signals, are assigned arbitrary id numbers. An example of how the interface is used will be presented after the system calls are described. The parentheses are specific to the complementary forms of the calls (un)guardserv(service, policymask, policyp) This call selects the policies flagged in the policymask to guard or stop guarding the specified service. The policyp parameter is an array which has an entry for each flagged policy. Each entry in the array contains a string and an integer. These components allow policy specific information to be passed to the policy. The string parameters can be used to specify a directory in which the policy should store information, or a file name, for example. Degree of coverage, or a mask of option bits, can be stwed in the integer parameters. (un)guardproc(pid, policymask) This call selects the policies flagged in the policymask to guard or stop guarding the specified process. sentryonO, sentryoffQ these calls are used to turn the global sentry mechanism on and off. ftexecve(command, argp, envp, policymask, policyp) This call is identical to the Unix 4.3 execvc system call, but it has two additional paramenters for the sentry mechanism. The policymask indicates which policies should be enabled when the command is executed. As an example of how the activation mechanism is used, lets say that it is necessary to monitOT all read system calls for an application. The user would first enable the monitoring policy for the read system call using the guardserv command. If the application was already running, the user would enable the monitoring policy for the application by executing the guardproc command. To start the process running with the monitoring policy the ftexecve command would be entered. Child processes inherit the same enablings the parent had at the time it was forked.
4.1.3 Monitoring and Assertions Because of its wide applicability and ease of implementation, a system call monitoring policy was developed. The monitoring policy was used to discover the cause of a system crash which was brought on by a simple robustness evaluation program. This section describes the monitoring policy and its visualization, and then the robustness program.
4.1.3.1 The Monitoring Policy The first policy implemented for the sentry mechanism was the monitoring policy. When the monitoring policy is enabled for a system call, all of the call's input and ouq)ut parameters are recorded to a file associatwl with the process making the call. The writes to the monitoring file can either be synchronous or asynchronous and is determined by a flag passed to the policy when it is enabled for a process. Call information can be used to debug appUcation behavior, debug operating system behavior.
222
Emulation Library gggagggggjg
Application
(D
syscall
l.Tiap
2. Bounce back to user space 3. MK message with UX request information 4. Process system caD 5. Execute exit sentry 6. MK message with UX reply information 7. Return to point of trap Figure 4.1.4 Flow of control with sentry enabled analyze system performance, or can be used inttendanalysis. Several programs have been written to profile the data and calculate statistics related to performance, accountin|j and frequency. As a debugging aid a visualization tool based on the PEEilS] visualizer was designed to display generated monitoring data. By using this tool, the FTSCOPE, the user is able to quickly see the system call activity of each process involved in an application and to print the input and output parameters for a selected system call. T'*A#«j'#^
M.. l i e / j L & o d I H J H Ji t J t l l r f V
As a task related to the FTM project, robustness benchmarks are being developed to qualitatively and quantitatively evaluate system dependability. One of Aese benchmarks is a system call tester. This program spawns a user-speciied number of processes which make random system calls ^ssing in buffers containing arbittary information as parameters. When this progrmn was run on UX it crashed every time. To determine which system call was causing the operating system crash,flieprogram was run with the monitoring policy enabled.
223 The FTSCOPE view of the crash is shown in Figure 4.1.5. In the top portion of
1
«rl|^ *iii* wli* «U«k
t» to ft» to
tftti^Mrt* i(^t*i** if^i«ift# i™i1««t»
rlrtft NnnJ *4fm •? mmt r-ar^j*,. l»rt iMfwS wtes" e* www f«r<«, - t a w •*« t l«ri h * ^ m«f« **• »WE r*nf*, Iwft Hat^ *%# **• ^«s« r«r^]^ ^ ^ ^ £
t i ^ « . t ^ « « » , ««.i^ «= f , « ^ # * , tfclSM- — 0,«
Figure 4.1 J FTSCOPE view of UX crash tlie of the FTSCOPE are pull down buttons; below ifiem are type buttons used to filter types of system calls being displayed. Unix calls have retoted system calls. An additional type called Error Detect has been added that is produced by fault detection sentries when they register a fault Hie portion of the window containing bars shows concurrent processes on the Y-axis and the time line on the X-axis. Each process is identified by its process id and the different colored rectangles represent then- system call activity. If a system caU rectangle is clicked on by the mouse, textual information is printed in the bottom portion of the window. This includes the system call type, the input and output parameters and the executioE times of the call. In this igure the last (rightmost) rectangle has been clicked on reveaing that it was a rccvfroni call wMch neverreturned.This was the last call made before the operating system crashed. A policy was developed which has one sentry implemenled for the recvfrom call. This sentry performs assertion Ksts[2][3][ll][16] on each of the recvfrom input parameters. If an assertion fails, the call is bypassed and the exit sentry returns with an Error Detected value and a number identifying the failed assertion. The same program was run without crashing the operating system wiih this pohcy enabled. An
224 FTSCOPE ¥iew of a ran is shown in Fipii-e 4.1.6. The .same caM that crashed the
U l i i A to i s r f l « * # J # f t fti««i »«%» *rf » « | « | i l * k * • i j ^ ^ M * * i - | ^ ( t lrta*ii • i ^ p ' «# Si*
- »
% » - • "C tffl a
I to^irtjj
Figure 4.1J FTSCOPE view of afoided crash operating system when the policy was not in efiiect is shown as the dark rectangle in the bottom center of the window. Its textual infonmatioii shows that it has the same input parameters, and the negative return vahie flag indicates that an assertion failed in a detection policy. This experiment demonstrate-d the power and efficiency of the se-ntry mechanism for a fault detection. It abo highlightexl ihe usefulness of system monitoring and visualization for isolating faulty t3ehavior. The example is somewhat trivial in that the information obtained by the monitoring library should be used to correct UK instead of writing an assertion poUcy, but the idea.s presented are directly appUcabie to more sophisticated fault detection techniques.
4.1.4 The Journaling and Journal Recovery Policies In order to t»th evaluate the sentry mechanism and pro¥ide a practical, desirable fault tolerant policy, joumaling[5][7][8][10][12][14][22][21] and journal recoYcry policies have tseen developed. The idea behind joumaHng is that during an appEcation's execution, enough information is saved that, in case of failure, the appication can be restarted and brought back to the exact state it was in at the time of failure. In other words, an application given the same inputs at the same time will always behave in the same way. This method provides fault tolerance for transient faults in the hardware, the operating system and the application. Of course, the fault must be detected.
225
The development of this policy does not address that issue, but relies on the assumption that a lavil is detected in the native operating system fault detection mechanisms, by the user, by the hardware, or through a fault detection policy implemented for the sentry mechanism (such as assertions). The applicaticm model used for this policy is a concurrent system where the application process' creation dependency graph is a tree. A node is the father of all nodes that are below it in the tree and these nodes are called children. The vast majcx*ity of single processor {plications fit into this model including the X window system. The reason that it is imposed will be discussed later. An example of an application fitting this model is shown in Figure 4.1.7. root 1
y \
A" 2
- * - • •
Nk 3 \
"1 4
1 "
-
1 process communication
application boundary Figure 4.1.7 Example of application which Fits application model There are several issues that needed to be addressed during the development of this policy. They arc: • stable stcffage medium • external input • order of events • journal structure • process identifiers • asynchronous signal delivery • disk recovery This section will present the approach taken for each issue in the order listed. Journal structure is discussed after external inputs and order of events because it is related to these issues. Thefinalpart of this section presents some performance measurements of the joumaling and joumaling recovery policy.
226
4.1.4.1 Journaling Issues 4.1.4.1.1 Stable Storage An important consideration is the medium on which the journal is stored. The fault model being used must assume that the data on the medium will maintain its integrity in the event of a failure. The most common stable storage that exists for systems today are mirrored hard disks, battery backed RAM and flash memory. If the fault model is relaxed, it can be asserted that the data on a single hard disk will survive a system failure. Although hard disks axe generally available in installed systems, performance penalties involved in disk journaling could be many orders of magnitude greater than for memory journaling. This is due to the fact that if the user desires that no information be lost in a failure (i.e. one can recover up to the point of the failure), then journal entries must be saved synchronously to the diisk. An alternative to synchronous disk journaling is to buffer the journal in memory and asynchronously write it to disk. This will usually improve performance, but the price paid is that a failure will result in the loss of some of the execution. This may or may not be acceptable depending on the characteristics of the system. If asynchronous journaling is implemented, potential data loss must be clearly defined. One asynchronous approach is toflushthe journal after some defined interval. For instance, if the buffer was flushed afto- two minutes of accumulated data, the guarantee would be that, in the case of a failure, a maximum of two minutes of work would be lost, and that on average, one minute would be unrecoverable. Other criteria forflushingcould be that a journal buffer becomes full or that some number of events have occurred.
4.1.4.1.2 External Inputs The information needed to recover an £q}plication includes any external input the application has received. In the application model described earlier, the only information received from an external source are device inputs that are not automatically and identically recreated when an ^plication is re-run. A Mouse and a keyboard are examples of these input devices because during a recov^, it cannot be guaranteed that a user will input data in precisely the same way they did in the original run. In order to record necessary information, the journaling policy has actions specific to system calls which may receive inputs from devices. This group of system calls includes: • read • select • recv • recvfrom • ioctl When an exit sentry for one of these calls executes, the journal determines if the call queried an external device for data and in such cases, saves the data to the journal. During a recovery, the entry sentry which acts on behalf of the recovery policy for these system calls checks for device queries in the same way. If a device is being access&d the sentry retrieves the reply parameters from the journal and bypasses the
227
system call.
4.1.4.1 J ETcnt Order External input alone is not sufficient to force conrect recreation of a concurrent application daring recovery. Actions performed by a process that have an effect on tlie execution of another process must also be replayed so that causality is preserved. In Unix, events with interprocess effect are made up entirely of interprocess communication system calls (signals are handled specially and arc discussed later). The Mst of these calls is made up of: • select •read, write • rcadv, writev
•receive,send • recvfiroin, sendto Process I Process! Process 3 1
b
B
Process 1 Process 2 Process 3
Figure 4.1J A) Event order, B) dependency graph and C) event sequencing
228 Execution of these calk crraites dependency graphs. An example execution is shown at the top of Figure 4.1.8. The communication pattern creates the dependency graph shown in the middle of the ignre. Joomaling algorithms in distributed systems are fcquircd to maintain these graphs which are used during recovery to replay the events in a correct order. Parts of the giaph that arc on the same horizontal level can be replayed k paralel. Unfortunately, maintaining these dependency graphs Ixcomes ineficient when there is a high degree of communication. An alternative approach is to impose a totaltime-precedencebased order on events disregarding their dependencies. Distributed recovery schemes do not use this ordering because it reqoiies a notion of global time, which is expensive to maintain in a distributed environmenL In a uniprocessor, however, a precedence ordering is easiy obtained just by maintaining a counter. Each event is atomically identified with a sequence number. The l»ttom of Figij«54.L8 shows this identification for the example events. These identifications are known as the global sequence number and each application has a global sequence number which is shared by all of the processes in the application. If events are replayed witii the same global ordering then all dependency graphs will be automatically recreated. Atomicity must be enforced between assignment of a sequence numter and the actual event being assigned so that the event's sequence number is an accurate identiicalion of its time ordering. In many cases the locking in UX does not guarantee atomicity whai the sequence number is incremented in a sentry. For these situations, enough of the serYicc is replicated in the sentry so that locMng can be added to enforce the required atomicity. FiguiB 4.1.9 shows an event chart on the left for the events of an original run, and on the right side shows how the event order is preserved during a recovery. In die
time
rn
n+l
Original run
n+l Recover/ run system call entry or exit event se-qiience increment blocking
Figure 4.1 J Event seqnencing and its enforcement during recovery
229 original run, process B must wait to execute the system call event and the associated sequence increment, while A executes its atomic event and sequence increment During recovery, process B enters its system callfirst,but it must wait until the sequence number becomes n before it can continue.
4.1.4.1.4 Journal Structure There are several options for structuring a journal. For example, each process of an application could have a sq>arate journal. Multiple journals are relatively easy to implement in a file system, but if the stable storage is RAM, the issue of how the memory should be divided among the journals becomes important. In addition, if a buffering optimization is used aflushof the journal means that several journals must be independendyflushedwhich can lead to performance degredation associated with multiple disk opertions. For these reasons, we have chosen to have one journal serve for each application. A single journal introduces the problem of locating a piece of needed data. When a system call occurs during a recovery which has external inputs that must be retrieved from the journal, the sentry must correctly find the place in the journal where the information is stored. Our approach is to assign sequence numbers to all events stored in the journal and to enforce sequencing as described above. A process executing a call that has external inputs stored in the journal will block until its turn to increment the sequence number. When the sequence number is incremented, the process doing so reads the next entry out of the journal. Entries are tagged with the process identifier of the process that stored it If Uie currently executing process is not the one that corresponds to the process that wrote the entry in the original run, it notifies the correspondmg process. The corresponding process can then read the journal when it arrives at the point where it must do so. A global sleep queue is maintained for processes which must block waiting for their turn. An optimization on the size of the journal can be made by noting that there is unnecessary information in the journal. Processes occasionally make a series of entries in the journal that is uninterrupted by other processes. This series is known as a run. The records after thefirstof a run can beremovedfrom the journal without loss of infcHination because during recovery the process that owns thefirstentry of a run can assume it is responsible for the missing entries as well. Note that entries that have data such as device input cannot beremoved,however. These journal entries that are missing must actually be written to the journal, but are then over-written by subsequent entries of a run. This is due to the fact that the application must know when a recovery is complete and normal execution can begin. If the records are never stored in the journal, the application may believe that recovery hasfinishedbefore it should be and then cause a different global ordering. By overwriting, we insure that the last record of an execution will always be in the journal. To better demonstrate how run-length compression works, see Figure 4.1.10. Process A begins a run that has itsfirstrecord written to position 1 of the journal. Subsequent entries of its run are all written to position 2. When process B starts its run, its first record is written to position 2 and subsequent records are written position 3. The basic journal entry is 12 bytes in length. Entries consist of the process identifier, the system call type of the event, and sequencing information.
£^-^J\I
B time
jouiTia!
Figure 4.1.19 Eiin length optimttation
4.1,4.1.5 Process Hentiieri and Time Some system cais retum infonmation that does not come from external devices and that is not autoniaticaMy recreated in a t&:mttj. This class is made up of calls which query the system for information that wil be different in a recovery, such as the time of day or process identifiers. When a process is foited in recovery, il is informed of the process identiier that its corresponding process to the original run owned. Any getpid system raJls made during recovery are bypassed and a sentiy returns the original process id as the return parameter. In this way the application uses the same process identifiers during recovery. Another system call which requires joumaling is the wait call. This call is used to wait for the statns of a chfld process to change and also to query the status of a child proceK. During a recovery, it cannot be guaranteed that when a process queries the status of another process, the target process wil be in the same state it was in when the action was perfonned in the original run. Joumaling the wait call insures that the process will perceive the status to be the same. Many programs have control iows that depend on system time or elapsed system time. The mformation returned by these types of calls are also joumaled and replayed with system call bypass during recoveiy. Calls in this group include gettlmeofday, getitimer. and getriisage. During development it was noticed that some applications, particularly the X window system, make a great deal of cals to gettimeofday that retum the same value. Joumaling these calls with retum parameters is expensive in terms of time, but is especially expensive in teniis of the size of the journal. As an optimization, the journahng poKcy keeps track of times returned by gettinieofdaj. If the time is the same for more than one call, the journal entries for the second and subsequent calls are special records that do not include the reply parameters. During a replay, the time
231 returned by the journal is also maintained as the application global time. When one of the special gettimeofday records is read from the journal, the sentry returns the current application global time. This optimization is called the gettime optimization.
4.1.4.1.6 Asynchronous Signals Signals are events that consist of two types: • self generated • externally generated A process self generates a signal when it performs some action that directly causes a signal to be delivered to itself. For example, executing an illegal instruction causes an illegal instruction signal delivery; accessing an illegal address causes a segmentation fault signal to be delivered, etc. Externally generated signals are delivered because of some action that another process has taken. A wakeup signal that is delivered when the process has a message arrival on a socket is one type of external signal. Another is a process kill signal that is sent from another process. It is clear that if an application follows the same controlflow,it will always create the same self generated signals at precisely the same point in its execution. For this reason, these types of signals require no special treatment in joumaling and arc merely ignored. Externally generated signals cannot be ignored. It cannot be guaranteed that when a process sends a signal to another process during a recovery, that the signal will arrive at the target process at the same point in its execution as it did in an original run. Assigning sequence numbers to signal creating calls does not solve the problem because signal delivery is not instantaneous and may be delayed. The solution taken is to provide maximum control over external signal delivery by having processes execute a sentry delivering signals to itself at the appropriate times. In UX a process will recognize all externally generated signals only on a system call boundary. This means that it is always possible to replay a signal at precisely the same point in the execution of a process during a recovery. The signal delivery point is treated as an operating system entry point and thus, a sentry point exists there so that signals are visible to policies if desired. Signals are joumaled to a separate signal journal that is created for each process of an application receiving external signals. When a process starts a recovery, it checks to see if a signal exists in this file and then prepares to send it on the correct system call. After a signal is delivered a new signal (if one exists) is read from the signal file. Only signals arriving on the correct system call are handled with all others being ignored. Most externally generated signals will arrive twice - once from the self delivery and once from the external source - but will only be handled once and on the correct system call. Handled specially is the "put a process to sleep" signal. To explain this situation lets consider the following: a process is put to sleep and then awoken because some event occurs. In the recovery, if the process puts itself to sleep it will never wake up to deliver itself the wakeup signal! For this reason, all slcep/wakeup pairs of signals are disregarded during recovery playback.
232
4.1.4.1.7 Disk Recovery Once a failure occurs, the application must be restarted with all of its initial conditions the same. In the application model assumed, the only peripheral that must be restored is the disk. Any modifications to diskfilesmust be undone when the application is restarted. The approach taken for the joumaling policy is to keep track of disk modifications on afilebyfilebasis. Figiire 4.1.11 presents the algorithm used in high level pseudo-code. When a system call is made that opens or creates a file or directory, an entry sentry for the call checks to see if the application has accessed thefilepreviously. If it is thefirstopen for thefilea note is msule of the existence of thefileand its size. Sentries for calls that modifyfilesor directories check to see if the part of thefilethey are modifying existed at the time of thefirstopen. If it did, it means that the system call is changing a part of the file that must be restored to its original state at Ae beginning of a recovery. Parts of thefilethat have not already been backed up are copied into a special subdirectory of the journal directory for file storage. If a file is being appended, or a part which didn't exist at the start of the application is being modified, then no action is taken. Finally, before afileis truncated or deleted, thefileis moved (not copied) into the file subdirectory. Sentry Action for File Operations open: if (file has not been opened before by the aptplication) lecord the size of the file modify: if (offset < initial size) if (offset range has not been saved) save part that has not been saved already to backup area deleteAruncate: move file to backup area At Start of Recovery If (file was deleted) move the backup to the original location If (file size > original size) truncate to original size Restore parts of file which exist in backup area
Figure 4.1.11 Disk restoring algorithm In this way, information is saved that allows afileto be reconstructed to its original state at the beginning of a recovery. Figure 4.1.12 shows a simple example. The application performs several actions on afilethat is originally 40 bytes in size. At the open call the initial size of the file is noted. At thefirstwrite (numbered 2 in the figure), the sentry determines that the modification is to an original part of the file and backs up this part (as pictured with the hatching). The second write call is to part of the original file and also appends 10 bytes to thefile.Only the part of the file that is original is saved. When the last call is made to delete thefile,the entire file is moved to the backup area. To reconstruct the disk state, each file is reconstructed. Files which were deleted are moved to the position in thefilesystem they occupied before deletion. The blocks
233
1. open - initial size is 40 2. write(0-20) WA 3. write(30-50)
4.delete
11
Figure 4.1.12 Example of disk restoring algorithm backing up modiications are written back to their positions in tlic file. Aftef this is completed the ile will!«identical to the original ile at the start of the original run.
4.1.4.2 Joiirnaling/EecoYery Performance This section presents performance mcasurcrnents taken for the joumaling policy and the jomnal reco¥ery poicy. To date, there arc no standard user-interface benchmarks so we have selected several scenarios tliat provide an oveiaM view of the performance. The vast majority of UNIX workstations in use run the X window system. For this reason, we make al measurements in the X en¥ironraent. The aH5licatioo model encompasses X with any programs run under X such as xterm, csh, and gnuemacs. In evaluating the joumaling pojicy, the cost of thi«e components must be determined First, interprocess communication causes joumaling in order to record event order. Secondly, external dcYiix input causes joumaling with the two types of external input on most workstations l)eing a keyboard and a mouse. With this information in mind we have selected four tests: • X initialization • mouse movement • keyboard input • X session
234
This section describes each test and presents 5 runs of each tesc • No policy enabled • Basic joumaling • Run-length optimization • Gettime optimization • Both optimizations In each case, times are shown for the joumaling policy and the recovery policy. In addition, the size of the journal, the number of joumaled calls, and the number of pro-' cesses involved in the test are listed. All table time values are in seconds and all size values are in bytes. For each individual joumaling test, times are shown for both synchronous journaling to the disk and asynchronous joumaling. Synchronous joumaUng is used if full recovery is desired and there is no stable memory. Asynchronous journaling approximates results that would be obtained if the journal was buffered in stable memory before being written to the disk.
4.1.4.2.1 X Initialization The start-up of X involves on the order of 20 processes. X start-up occurs when the xinit program is mn and is considered complete when an xclock and an xterm window have been created and a cursor appears in the xterm indicating that input can commence. The measurements taken in this test serve to show degradation due to sequencing and event ordering of communication calls and are shown in Ibble 4.1.1. Both times and journal sizes are averages of several runs and rounded to the nearest second and 100 bytes respectively.
X Imtulizaticn
«of procenea
DO joumaling
aaynchnnaua time
journal
time
journal
5
-
-
-
21500
basic joumaling 20 gettime apt
lyncfaroooua
6
20600
reeoveiy time
-
22000 55
22000
lua-Iength cpt
8300
19700
bolhapu
7500
19400
5
Table 4.1.1 X start-up performance The reason that journal sizes are larger for synchronous joumaling is that flushing on each record causes more context switches and therefore fewer chances for runlength compression. It also results in fewer gettimeofday calls that return the same value so the gettime optimization cannot be used. This eS^ect is seen in all of the tests. The times in the table for asynchronous joumaling and no joumaling indicate a
235
roughly 20 percent degradation due to joumaling. This number is misleading, however, because it is an average time, and in some cases the start-up time with journaling was smaller than without joumaling. Results like these are due to the fact that times are dependent on a large variety of factors such as the state of the unix disk buffers, the state of the cache and the paging file as well as activity by system daemon processes. All of these will vary from run to run.
4.1.4.2.2 Mouse Movement In the mouse movement test the mouse is moved in a continuous motion up and down the screen from the top to the bottom and back for 20 seconds. Degradation due to joumaling overiiead is not evident in latency of moving the mouse from one position on the screen to anothw, but on the number of times the mouse is sampled per unit of time. A lower sampling rate results in jerky mouse motion. Another type of joumaling is introduced here csJIed hybrid jourmlmg. Hybrid joumaling takes advantsynchranous Mouse Input
proceues
hybrid
synchronous
event
journal
event
journal
event
journal
820
-
-
-
-
-
700
83600
20
58200
350
58500
gecdmeopc
715
75(500
20
56100
410
53900
run-length opt
765
63000
20
37200
390
39172
both opts
800
66500
20
17300
420
40900
no joumaling ba
time (a/i/h/r)
20/25/ 20/2
Table 4.1.2 Mouse input performance tage of the fact that keyboard and mouse input is the only information which is critical. If the inputs are recorded and the event sequence up to the point where input occurred is recorded, then an application can be restored to an equivalent state in a recovery. Therefore, only when an external input arrives is the journal buffer flushed. In the case of a failure no external input will be lost, but message ordering may be different following the last input replayed from the joumal since this will not have been recorded during a failure. No measurements are shown for hybrid joumaling in Table 4.1.1 because there is no input during this test. If the test were mn with hybrid journaling the results would be identical to the asynchronous case. Table 4.1.2 shows times measured for different types of joumaling. Times for various operations are shown in the last column of the table. Thefirstthree numbers are the timefromthe moment the mouse began to move to the time it stopped and are for the asynchronous, synchronous and hybrid caserespectively.The last number is the replay time. The table shows that the sampling rate is reduced by a factor of forty when moving from asynchronous joumaling to synchronous joumaling. Hybrid joumaling is
236 only a fifty percent degradation with respect to the synchronous case. It was found during the tests that this rate is sufficient for a sense of accurate and timely control over the movements of the mouse cursor.
4.1.4.3 Keyboard Input The keyboard input test measures the cost of joumaling and replaying keystrokes. The data is presented in Table 4.1.4. Keyboard joumaling cost is visible as the delay between keystrokes and the appearance of the character on the screen. For each run in this test, 80 keyboard characters were entered. The typing rate wasreducedto the rate of character output. Thus, the higher the overhead, the longer the time to type the characters. uynchronotu Keyboud Input
processet
nojounuding
tunc
joumd
time
joumtl
time
joimttl
7
-
-
-
-
-
25700
b u c Joumaling 3 gettinieapC
hybrid
fynduoncus
7
24900
3S8Q0
-
25800
36300 95
recovery time
7
24100
mn-length opt
13400
32700
13300
both opts
13100
31500
13000
2
Table 4.1.3 Keyboard input performance The recovery time is only 30 percent of the original run time because device input is read directly from the journal. No blocking is done waiting for the input.
4.1.4.3.1 X Session This final test consists of a sample interactive session using elements from all of the above tests. The X window system is started, a gnu-emacs editor is opened and a file is edited requiring a number of modifications. Then the editor is closed and X is exited. Keyboard input is again limited to the display rate as in the keyboard test. This example serves to show the performance of the system under practical working conditions. The measurements are shown in Table 4.1.4. Under the hybrid case, which provides fault tolerant guarantees, the degradation is visible as a slight lag in feedback when typing or moving the mouse.
4.1.5 The Checkpointing and Checkpointing Recovery Policy There are two problems that prevent joumaling from being a truly practical stand-alone policy. They are journal growth and recovery latency. As events occur in
237
X Session
«of processes
time 40
nojouniftlizig buic joumaling
jounud •
40
69000
hybrid
time
Joiunil
time
joumsl
-
-
-
-
63S00
70000 21
geoimeopt
synchionous
uynchronous
no
62900
twovery time
-
57300 40
54500
nin-length opt
3SC00
55700
27600
boihopu
34800
52500
27500
20
Table 4.1.4 X session perTormance an application the journal will continue to grow, and given enough events, will consume all available disk space. The second issue becomes a problem for long running applications. For a recovery there is no time compression for compute bound segments of code. As the length of the original run grows, so does the recovery time; and, as the recovery time grows, so does the unlikelihood that it will be useful to recover the state of the application using the journal. For instance, if an application has run for 24 hours, a recovery that takes 22 hours may be unaccq)table. A checkpointing policy[5][8][12][13][14][21][22][23] solves both of these problems. If an application's state is ijerioidically saved, the joumal can be cleaned. At the time of recovery the an)lication is restored to the most recent checlqxiint and joumal recovery is performed to Ixing the application back to the state which existed at the time of the failure. The same application model assumed for joumaling is assumed for checkpointing. As with joumaling, there are several issues that must be addressed for a checkpointing implementation. They include the four checkpoint components: • memory checkpointing • disk checkpointing • operating system checkpointing • device initialization as well as the checkpoint interval. This section discusses each of these areas. Stable storage issues are the same as for joumaling and are not addressed further. At press time, the checkpointing implementation is in progress so no performance data is available.
4.1.5.1 Memory Checkpointing Modem ^plications have become very memory intensive in comparison with the applications of just a few years ago. Applications such as X, which are made up of many processes, can have on the order of a hundred megabytes of virtual memory in use at a given point in time. Saving the entire address space of each process at every checkpoint is extremely inefficient. Incremental checlqx)inting is a technique where only parts of the system which have changed since the previous checkpoint are saved.
238
Mach 3.0 allows a ppcess to use a pager that is external to the kernel to manage its memory. Our checkpointing implementation relies on this feature by adding memory management to UX. Processes run with the checkpoint policy in effect using the UX pager to keep track of page modifications. An implied checkpoint exists at the start of the application. After a process has started, the pager keeps track of any modifications on a page by page basis. At a checlqioint these modified pages are saved to stable storage. The checkpoint is complete when all modifications have been saved and at that point the journal can be cleared and the previous checkpoint erased. At the time of recovery, a process is loaded as it would be if it were started for die first time. Then its address space is allocated and copied from the checkpointed memory information. After this has been done for each process in the application, the entire memory space of the application has been restored.
4.1.5.2 Disk Checkpointing The mechanism which performs disk checkpointing works in conjunction with the journal disk recovery algorithm. At checkpoints, all information that has been saved by the joumaling policy is deleted, and all data structures that monitor the data that was saved are cleared. FUes that are open are noted for their size and this is then used as the initial size when rolling back to the checkpoint. To recover the disk to a checkpoint, the same procedures are followed as in the journal recovery policy.
4.1.5.3 Operating System Checkpointing The final component of a checkpoint is operating system data structures associated with each process of the application. When a process is checkpointed, operating system data structures such as the process control block, communications structures and file descriptOTS opened by the process, are saved to a file. For checkpoint recovery, firet the disk is restored, then the application's memory space, and finally the operating system data structures.
4.1.5.4 Device Initialization During application execution the state of external devices may be changed. An example is a video display card which is set to a certain mode of operation. If no provision is made for these operations, then recovery to a checkpoint may result in the device being in an incorrect state. To address this we assume that all devices were initialized to a known starting state prior to the application start and is reinitialized to this state during recovery. System calls that have changed the operating statefromthis initial one during an application's execution must bereplayedat the start of a recovery in order torestorethe device to the state it had at the time of a checkpoint. In the case of the video card, ioctl system calls and their parameters that access the device arc joumaled to special checkpoint journals which are never cleared. To recover the devices all offiiesecalls are executed by the checkpoint recovery policy as the final step of a recovery. When the recovery is complete devices will be in a correct state.
239
4.1.5.5 Checkpoint Interval Checkpointing implies a transactional saving of state to the stable medium. Because this is usually an expensive operation, checkpoints must be spaced in time. The issues involved in detennining the checkpoint interval include the cost of checkpoint, limits on the size of the journal, and recovery time limits. Each factor is influenced by both the implementation and by application requirements. Because our chectooint implementation is not complete, it is impossible to determine the tradeoffs. Our plans are to implement aflexiblecheckpoint policy where the interval can be set at run-time with some of the interval specification options being: • Time interval based • Journal size limit based • Application or user directed • Event determined Measurements of checkpoint overhead will give user's guidelines that can be followed to achieve specific application needs. A summary of the entire checkpointing algorithm is shown in Figure 4.1.13. At Checkpoint 1. Checkpoint data structures in UX related to the application 2. Do a memory checkpoint of the application. This mean^ that all pages modified since the last checkpoint are stored as well as the memory map of the application 3. Do a disk checkpoint. This means that the disk backup information is deleted At System Calls Modifying Rles Follow disk joumaling algorithm At System Calls Changing State of Devices Journal the operation At Recovery 1. Reboot MK and UX 2. Create a new processes 3. Restore checlqwinted UX information for each process 4. Restore the memory map of each process 5. Load the memory map with checlqxjinted data 6. Restore the disk by applying the disk restore algorithm 7. Initialize devices
Figure 4.1.13 Checkpoint Algorithm
4.1.6 Summary and Future Directions A mechanism has been defined and implemented that serves as a framework for fault management policies. To demonstrate the power of the mechanism several pobcies have been designed and implemented including monitoring, assertions, journal-
240
ing/joumal recovery, and checkpointing/checkpoint recovery. These policies are representative of practical, commonly used fault detection and fault tolerance techniques. The joumaling and checkpointing policies can be used to provide a workstation with the ability to tolerate transient faults in the hardware or software. The current direction of the project is the completicHi of the checkpointing policies which will result in a study of joumaling versus checkpointing trade-offs. Future directions include implementation of more policies, such as assertions, and primary/ backup fault tolerance. Integration of the techniques and policies that allow the ones presented to function in a distributed computing environment will be investigated. Finally, policies will be developed that offer a scalable solution to the fault tolerance problems that confront massively parallel architectures.
4.1.7 Acknowledgments The authors would like to thank the members of the Fault Tolerance Mach project for their input at strategy meetings. The members are Chuck Weinstock, Walt Heimerdinger, Chris Dingman and Arup Mukherjee, In addition we would like to thank Ashok Agrawala for his comments on early FTM work.
4.1.8 References [I] M. Accetta, R. Baron, W. Bolosky, D. Golub and R. Rashid, "A New Kernel Foundation for UNIX Development", USENfX 86, July 1986. [2] D. M. Andrews, "Software Fault Tolerance Through Executable Assertions", 12tk Asilomar Conference on Circuits and Systems and Computers, Pacific Grove, CA.. pp, 6641-645, Nov. 1978. [3] D. M. Andrews, "Using Executable Assertions for Testing and Fault Tolerance", FTCS-9, Madison, WI, June 20-22, pp. 102-105.1979. [4] J. F. Bartlett, "A Nonstop Kernel", Eigth Symposium on Operating Systems Principles, Asilomar, CA.. pp 22-29. Dec. 1981. [5] B. Bhargava Shu-Renn Lian, "Independent Checkpointing and Concurrent Rollback Recovery in Distributed Systems -an Optimistic Approach", Seventh Symposium on Reliable Distributed Systems, Colombus, OH, pp. 3-12, Oct. 1988. [6] H. Custer. Inside Windows NT, Microsoft Press, Redmond, WA. 1993. [7] E. N. Elnozahy, W. Zwaenepoel, "Manetho: Transparent Roll Back-Recovery With Low Overhead, Limited Rollback, and Fast Output Commit", IEEE Transactions on Computers, Vol. 41, No. 5., pp. 526-531, May 1992. [8] T. M. Frazier, Y. Tamir, "Application-Transparent Error-Recovery Techniques for Multicomponent", Fourth Conference on Hypercubes, Concurrent Computers and Applications, Monterey, CA., pp. 103-108, March 1989. [9] D. Golub, R. Dean, A. Fonn and R. Rashid, "Unix as an Application Program". USENIX Summer Conference, Anaheim, CA, June 11-15, 1990. [10] J. Gray and D. P. Siewiorek, "High-Availability Computer Systems", IEEE Computer, September. 1991. [II] D. Jewitt, "Integrity S2: A Fault-Tolerant Unix Platform". FTCS-21, Montreal. Canada, pp. 512-519, June. 1991. [12] T. T. Juang, S. Venkatesan, "Crash Recovery With Little Overhead", 11 th International Conference on Distributed Computing Systems, Arlington, TX, pp. 454-
241 461, May, 1991. [13] T. T. Juang, S. Venkatesan, "Efficient Algorithms for Crash Recovery in Distributed Systems", Tenth Conference on Foundations of Software Technology and Theoretical Computer Science, Bangalore, India, pp. 17-19, 1990. [14] R. Koo and S. Toueg, "Checkpointing and Rollback Recovery for Distributed Systems", IEEE Transactions on Software Engineering, Vol. 13, Jan. 1987. [15] T. Lehr, Z. Segall, D. Vrsalovic, E. Caplan, A. Chung, and C. Fineman, "Visualizing Performance Debugging", IEEE Computer, pp. 38-51, Oct 1989. [16] A. Mahmood, D. J. Lu, and E. J. McCluskey, "Executable Assertions and Flight Software", AIAAIIEEE 6th Digital Avionics Systems Conference, Baltimore, MD, pp. 346-351. Dec. 1984. [17] R. Rashid, R. Baron, A. Forin, D. Golub, M. Jones, D. Julin, D. Orr, and R. Sanzi, "Mach: A Foundation For Open Systems", Proceedings of the Second Workshop on Workstation Operating Systems, Pacific Grove, CA, Sept. 27-29, 1989. [18] M. Russinovich, Z. Segall, "Open System Fault Management - Fault Tolerant Mach", CMU Research Report, CMUCDS-92-8,1992. [19] M. Russinovich, Z. Segall, and D. P. Siewiorek, "AppUcation Transparent Fault Management in Fault Tolerant Mach", FTCS-23, Toulouse, France, pp. 10-19, June 22-24,1993. [20] D. Siewiorek and R. Swara, Reliable Computer Systems: Design and Evaluation, Digital Press, BurUngton, MA. 1992. [21] R. E. Strom, D. F. Bacon, S. A. Yemini, "Volatile Logging in n-Fault Tolerant Distributed Systems", FTCS-18, Tokyo, Japan, pp.27-30, 1988. [22] R. E. Strom, D. F. Bacon, S. A. Yemini, "Towards Self Recovering Operating Systems", International Cortference on Reliable Systems, Los Angeles, CA, pp. 59-71, April 21-23, 1975. [23] Z. Tong, R. Y. Kain, W. T. Tsai, "A Low Overhead Checkpointing and Rollback Recovery Scheme For Distributed Systems", Eigth Symposium on Reliable Distributed Systems, Seattle, WA, pp. 12-20, Oct. 1989. [24] UNIX Programmer's Manual, USENIX Association. March 1986.
SECTION 4.2
Constructing Dependable Distributed Systems Using Consul Richard D. Schlichting^f Shivakant Mi.shra,tt and Larry L. Petersont Abstract Constructing tiie software for a distributed system that can continue to provide dependable service despite failures is a complex task. Consul is a communication substrate that simplifies this task by providing a collection of fundamental abstractions for implementing replicated processing. These include provisions for transmitting messages atomically and in some consistent order to a group of processes (atomic multicast), for detecting failures and agreeing on the resulting system composition (membership), and for reestablishing a consistent process state following failure (recovery). This chapter outlines the features provided by Consul and its implementation using the .v-kemel.
4.2.1, Introduction Computers are increasingly being used in applications where dependable service must be provided despite failures in the underlying computing platform. While tolerating processor crashes and network problems is undoubtedly difficult, a number of well-understood techniques have been developed that can be used to make key computing services fault-tolerant. One such technique is to replicate processing activity on separate processors in a distributed system and then coordinate their Portions previously appeared in: Mishra, Peterson, and Schlichting, Consul: A Communication Suhslralo for Fault-Tolerant Distributed Programs, Distributed Systems Eiif>ineeri/ii;, sol. 1, pp. X7-103. 19^3; and Mishra, Peterson, and Schlichting, Experience with Modularity in Consul. Softwaie—Praciice ami Experience, vol. 23, pp. 1059-1075, Oct. 1993 (reprinted by permission of John Wiley and Sons. Ltd.). + Dept. of Computer Science, Univ. of Arizona, Tucson. AZ 85721. i't Dept. of Computer Science and Eng., Univ. of California San Diego, La Jolla. CA 92093. This work supported in pan by the National Science Foundation under grant CCR-9O031()l and the Office of Naval Research under grant NOOO14-91 -J-1015
244 execution so that they appear to the user of the service as a single logical entity. Thus, depending on the degree of replication and the type of faults to be tolerated, some number of processors can fail without the service being lost. This approach has been formalized as the replicated state machine approach 11 ]. Consul is a communication substrate that facilitates the construction of software structured in this way by providing implementations of a collection of fundamental fault-tolerance abstractions. These abstractions include a multicast service to deliver messages to a collection of processes reliably and in some consistent order, a membership service to maintain a consistent system-wide view of which processes are functioning and which have failed, and a recovery service to recover a failed process. Consul provides these abstractions in the form of a unified collection of communication protocols. These protocols have been implemented using the vkernel. an operating system kernel designed for easy implementation and composition of communication protcKols (2|. This chapter provides an overview of Consul and its implementation. In Section 2, the abstract services realized in Consul are first described, followed by a more detailed examination of the major protocols. Section ?i then describes the system implementation using the .v-kernel. Finally, Section 4 highlights related work, while Section 3 offers some conclusions.
4,2.2. The Design of Consul 4,2.2.1. Abstract Services From the application's perspective. Consul provides a collection of abstractions that collectively support the state machine model of distributed computing. In this approach, the application maintains state variables that are modified in response to commands received from other state machines. Execution of a command is deterministic and atomic with respect to other commands. The output of a state machine, that is, the sequence of commands to other state machines or the environment, is completely determined by the sequence of input commands. A fault-tolerant version of a state machine is implemented by replicating the state machine and running each replica in parallel on a different processor in a distributed system. Key requirements for implementing the state machine approach include maintaining replica consistency at all times and integrating repaired replicas following failure. The abstract services found in Consul are designed specifically to support these requirements. For example, the multicast service provides atomic (i.e..
245
all or nothing) message delivery and a consistent ordering among all recipients, which makes it idea! for disseminating commands to state machine replicas. Figure 4.2.1 illustrates the services found in Consul and the fundamental dependencies among them. In this figure, the rectangles are services, with an arrow from service S. to service S-, indicating that the correctness of Sj depends on the correctness of S~, [3J; the edge labels indicate the property that induces the dependency. At the top of the figure is the state machine that represents the application program; it depends directly on two services; multicast and recovery. As already
State Machine
{recoverabllltyj {replica coordination}
Recovery
Multicast
Membership {total order)
{order}
{recoverability}
Stable Storage
Time
Figure 4.2.1 — Fault-tolerant services and dependencies
246 mentioned, multicast is a communication service that allows a message to be transmitted asynchronously to a group of processes atomically and in a consistent order, while recovery deals with restoring the state of a failed state-machine replica upon restart. Membership provides a consistent view of which processors are functioning and which have failed at any given moment in time. Membership is used by the recovery service when a replica recovers and by the multicast service to implement a consistent total order; it also depends on multicast to disseminate messages to instances of the membership service on other machines. The time service provides the abstraction of a common time base on all the machines in a distributed system despite the lack of a single physical clock. In Consul, this service is realized using logical clocks [4], and is used by multicast to consistently order messages. Finally, it should be noted that these services are widely recognized as fundamental to the construction of fault-tolerant distributed systems, with variants being used in a large number of systems [5, 6, 7, 8, 9]. The dependencies are fundamental as well since they are induced by the properties of the services and not by the specific way in which they are realized in Consul. Further discussion of these abstraction, their interrelationships, and the systems that use them can be found in Reference [10].
4.2.2.2. Protocol Overview We now turn our attention from the abstract services provided by Consul to the set of protocol modules that realize these services. A copy of these protocols resides on each machine in a distributed system, and provides an interface between the application program in the form of the state machine replicas and the underlying network. The communication network is assumed to be asynchronous, with no bound on the transmission delay for a message between any two machines. Messages may be lost or delivered out-of-order, but it is assumed that they are never corrupted. Furthermore, machines are assumed to suffer fail-silent semantics, i.e., they fail by crashing without making any incorrect state transitions. Finally, Consul assumes that stable storage is available to each machine, and that data written to stable storage survives crashes [11]. The mapping from fault-tolerant services to protocols is primarily l-to-1 or 1to-few; that is, the services are implemented independently of one another as individual protocols or as a small set of protocols, rather than together in one monolithic system. Figure 4.2.2 illustrates the detailed architecture of a typical Consul protocol configuration. In this figure, the rectangles are protocols, with an arrow from
247
State Machine
Recovery h—^ Membership
Stable Storage
FailureDetection
Order
Network
Figure 4.2.2 — Consol protocols and dependencies
protocol P. to protocol P, indicating that Pj invokes operations on Pj to implement its functionality. The mapping from fault4olerant service to protocols is as follows. The Recovery protocol implements the recovery service, the Membership and FailureDetection protocols collectively implement the membership service, and a combination of the Psync and Order protocols implement the multicast and time services. In this figure, the stable storage and network protocols are shaded to indicate
248 that they are provided externally, and hence, assumed by Consul. 4.2.2.3. Psync Psync is the main communication mechanism in Consul [12]. It provides a multicast facility that maintains the partial order of messages exchanged in the system. Specihcally, it supports a conversation abstraction through which a collection of processes such as the state machine replicas exchange messages. A conversation is explicitly opened by specifying a set of participating processes called the membership list. ML. A message sent to the conversation is multicast to all processes in ML. Fundamentally, each process sends a message in the conte.xt of those messages it has already .sent or received, a relation that defines a partial ordering on the messages exchanged through the conversation. Psync explicitly maintains the partial order, which has also been called causal order |6], in the form of a directed acyclic graph called a context graph. Psync provides operations for sending and receiving messages, as well as for inspecting the context graph. The multicast message delivery implemented by Psync is atomic, i.e., either all the non-failed processes in ML receive the message, or none do. Psync maintains a copy of a conversation's context graph at each processor on which a participant in the conversation resides. A distinct copy of ML is also maintained at each such processor. Each time a process sends a message, Psync propagates a copy of the message to each of these other processors; this is done by sending either several point-to-point messages in a point-to-point communication network, or one multicast message in a broadcast network. This propagated message contains the identifiers of all the messages upon which the new message depends; i.e., it identifies the nodes to which the new message is to be attached in the context graph. Psync recovers from network failures by recognizing when a new message is to be attached to a message that is not present in the local copy of the graph. In this case, Psync asks the processor that sent the new message to retransmit the missing message. That processor is guaranteed to have a copy of the missing message because it just sent a messiage in the context of it. Mechanisms are provided for pruning the context graph or spooling messages off-line to stable storage under control of higher-level protocols. Psync itself provides only minimal support for recovering from a processor failure. Specifically, Psync supports two operations that affect the local definition of ML. Note that these operations are purely local; they do not affect the definition of ML at other processors. First, the local process can tell Psync to maskout a certain
249 process. This operation removes the process from the local definition of ML. It also causes Psync to stop accepting messages from that process. Second, the local process can tell Psync to maskin a certain process. This has the effect of returning the process to the local definition of ML and accepting future messages sent by that process. In addition, Psync supports a restart operation that is invoked by a process upon being restarted following a failure. Execution of this operation has two effects: to inform other processes that the invoking process has restarted and to initiate reconstruction of the local copy of the context graph. Psync accomplishes this by broadcasting a special restart message. When this message is received at a processor, the local instance of Psync performs two actions. First, it generates a local notification of the restart event; this is implemented as an out-of-band control message that is delivered to the local process. Second, it transmits its current set of leaf nodes to the process that generated the restart message. The restarting process then uses a combination of messages spooled to stable storage and the standard lost message protocol to reconstruct the local copy of the context graph.
4.2.2.4. Order Consul's Order protocol enforces consistency on the order in which replicas receive messages, a property that is used to guarantee that replicas process commands in a consistent way. This protocol is chosen from a suite of different and independent protocols, each providing a different kind of consistent message ordering using the partial ordering provided by Psync as a base. To date, we have designed two Order protocols. One is a total order protocol, in which the stream of message is delivered to each replica is exactly the same order. When combined with the atomic message delivery guarantees of Psync, this gives the effect of an atomic broadcast [13, 14, 15, 16]. The other protocol enforces a semantic dependent order. With this scheme, the semantics of the state machine commands contained in the messages are exploited to allow the delivery order of messages to vary between replicas, while still maintaining a consistent state. The specific protocol that has been designed and implemented, called SemOrder, exploits the commutativity of operations used in certain applications. Here, an operation is defined to be conwnitativc if the execution of two or more consecutive invocations of that operation in any order gives the same result, and noncommutative if not.
250 To illustrate the details of SemOrder, consider a replicated directory that implements delete, insert, and update operations. Among these operations, note that multiple invocations of delete are commutative assuming that deleting a nonexistent entry is treated as a no-op. That is, executing a collection of such operations in any order leaves the object in the same state. Invocations of insert and update are not commutative, however, because applying them in different orders may leave the object in a different state. In addition to this static relationship among the operations, there is also a dynamic relationship among the operations based on when they were invoked. Suppose Figure 4.2.3 represents the partial order of invocations of operations as given by a Psync context graph, where the subscripts are used to distinguish between different invocations of the same operation. Here d, i, and u represent delete, insert, and update, respectively. In this example, operations dy d2, dy ('-,, and Uj were invoked at the same logical time: that is, these operations were invoked concurrently on different machines, implying that there is no causal relationship among the operations. SemOrder first orders the operations based on the partial ordering—e.g., /| is executed before d. because it was invoked first—and then takes advantage of the commutativity of the operations to enhance concurrency. For example, because invocations of delete are commutative and because d., d^, and d.^ were invoked at the same logical time, they can be executed in any order, and in fact, in a different order at each replica. Furthermore, because insert and update are not commutative, ('T and //. must be executed in the same order by each manager process and they must be totally ordered with respect to the group of commutative delete operations. In other words, we assign a precedence to the operations and then use this precedence to break ties between operations that were invoked at the same logical time. For example, if delete is preferred to insert, which is in turn preferred to update, then the set of operations J j , c/^, dy /-,, ii^ can be executed in any of the following orders: dy d-y, dy (.,, Hj d., dy dj.',,
u
d-i. d.. dy i-y, u
dydydyi^, u dy dy d-„ i^, u. dyd^, dy i-,, M| In contrast, a solution based on a total order would have limited each replica to just one total ordering, i.e., one of these six or some other that has /, and/or u^ earlier in
251
Figure 4.2.3 — Example partial ordering of operation invocations
the ordering.
4.2.2.5. Membership Consul's membership service is implemented by a pair of protocols: FailureDetection, which handles local detection of failure and recovery events, and Membership, which coordinates agreement among group members about the event. The FailureDetection protocol handles its task by monitoring the messages exchanged in the system, and upon suspecting a change of state of a process, initiating the
252 membership protocol by submitting a distinguished message to the conversation. In Consul, a failure is typically susp)ected when no message has been received from a given replica in some interval of time, while recovery is based on the asynchronous notification generated when the recovering process executes the Psync restart primitive. The Failure Detect! on protocol also tran.smits dummy messages whenever the replica on its machine is silent for some period of time. This serves two purposes. First, it ensures that every message received by the process is acknowledged within some interval of time. Second, it reduces the possibility of an idle process being suspected as failed. Membership is used to reach agreement among the currently executing processes that a failure or recovery event has occurred so that all replicas maintain a consistent view. This protocol is based on the partial order provided by Psync. As a result, it requires less synchronization overhead and performs esp)ecially well in the presence of multiple failures. In particular. Membership sits on top of Psync and coordinates the way in which processes modify their local membership list using the Psync priinitives inaskin and maskout. To ensure the correctness of the application in the presence of failures. Membership guarantees the following two properties: all functioning processes receive the same set of messages in partial order, and the decisions taken by the application based on these messages are consistent even in the presence of failures. In the situation where only one replica at a time is assumed to fail. Membership is relatively straightforward. Assume ML initially contains n processes. The Membership protocol is based on the effect that the failure has on the context graph. In particular, since a process obviously sends no messages once it has failed, it can be guaranteed there is no message from the failed process at the same logical time as initiation message sent by FailureDetection. If, on the other hand, there is a message from the suspect process at the same logical time as the initiation message, then it can be viewed as evidence that the process has in fact not failed. In this case, it is likely that the original suspicion of process failure was caused by the process or network being "slow" rather than an actual failure. Membership uses this heuristic to establish the failure of a process. The goal of the protocol is to establish an agreement among the n-1 alive processes about the failure or recovery of the n'^ process. The basic strategy is to agree on the failure of the process if and only if none of the n-1 processes have received a message from the suspect process at the same logical time as the protocol initiation message. A process sends a positive acknowledgement, if it has not
253 received any message from the suspect process at the same logical time as the initiation message. Otherwise, it sends a negative acknowledgement. A process is decided to have failed, ifn-1 positive acknowledgements are received in response to the protocol initiation message. In case of recovery, the process is incorporated in the membership list once all the remaining n-l processes have acknowledged its recovery. The protocol outlined above does not work in the presence of multiple concurrent failures. In the presence of such concurrent events, the protocol becomes much more complex. Perhaps the predominant reason for this is the inherent lack of knowledge about the set of processes that participate in the membership agreement process itself. That is, processes may fail or recover at any time and, in particular, they may fail or recover while the membership protocol itself is in progress. Another source of complexity stems from the requirement that a consistent order of removal or incorporation of processes in the membership list be maintained. A complete description of the Consul membership protocol that tolerates multiple concurrent failures can be found in (17|.
4.2.2.6. Recovery The Recovery protocol is concerned with bringing the state of a failed replica back into synchronization with the remainder of the replicas upon restart. The strategy currently implemented in Consul to realize this service is based on a combination independent checkpoiutlmessage logging technique. In this technique, replicas write checkpoints without attempting to coordinate with other replicas and messages are stored into a log on stable storage for later retrieval. When a failure occurs, the state of the replica is restored to the most recent checkpoint, with logged messages being used to make the hnal transformation to the current state. In the case of Consul, however, no explicit log is needed since messages are implicitly logged in the context graph when they are transmitted in what is, in essence, automatic sender-based logging 118]. This general technique is most applicable in situations where a failed replica restarts relatively quickly, either on its original processor after a reboot or some other functioning processor. For a replica that is down for an extended period of time, recovery based on state transfer from a functioning replica such as is done in ISIS [6] is more efficient. Such a strategy could be configured into Consul as an alternative recovery protocol. Given this checkpointing and message logging strategy, the recovery service goes through three stages. The first stage restores the substrate to the checkpointed
254 state, the second stage restores Consul and the state machine replica to the current state of the system, and the third stage initiates the membership protocol to incorporate the process back into the membership list. Since the context graph and its role as implicit message log make it perhaps the key component in Consul's recovery strategy, we outline these stages by focusing on the context graph. As shown in Figure 4.2.4, the graph can be divided into four regions. Define wave(fi) to be the wave containing node n. The first portion is from the root of the graph down to and including the nodes in wave(n-view), where n-view is any node in the newly restored view. Any commands in this portion have already been applied. The second portion is from below the restored view down to and including the nodes in waveln-failed), where n-failed is the node corresponding to the message generated
Stage 1
wave(n-view)
Stage 2
wave(n-failecl)
wave(n-restart) Stage 3 Figure 4.2.4 — Context graph during recovery
255 by the process that detected the failure of p. Any commands in this region may or may not have been applied. Similarly, the third portion is from below \vave(n-fallcJ) down to and including the nodes in wave(n-reslartj, where n-restan is the node corresponding to the message generated when p recovers. Operations in this area were missed due to the failure of p. The fourth and final portion of graph consists of those nodes below wave(n-restart). Any commands here are new commands that p will apply once recovery is complete. The stages of the Recovery protocol are based on these different segments of the context graph. After Psync reconstructs the local copy of the graph from stable storage or other hosts, the recovering process begins reading messages starting at wave(n-view). It processes these messages as it would normally, but with two exceptions. First, all the protocols except Psync function in a passive mode, that is, they do not send any messages during this time. The intuition is thai, although the recovering site is replaying the past, it was not an active participant in the decisions and hence should not send messages. Second, the recovering site does not accept new operation requests from client programs. Such messages are associated with normal processing, and so are deferred until the recovery phase is complete. The final stage is determining when to incorporate the process into the membership list and determining when the recovering process can start participating actively in the system. Intuitively, the recovering process starts participating actively in the system after it has been incorporated into the membership list by every alive process, thus ensuring that it has received all messages prior to this time. The Recovery protocol detects this condition by monitoring the message flow for the final message in the membership protocol from every other process.
4.2.3. Implementation We now consider how Consul's protocols were actually implemented in a particular object infrastructure—the x-kemel. The implementation consists of approximately 10,000 lines of C code, of which 3,500 is Psync. Consul currently uses a version of the A'-kemel that runs standalone on Sun-3 workstations; a port to a version running on the Mach microkernel is in progress. Two small prototype applications have been constructed, a replicated directory server and a replicated word search game; the Mach version is also being used to implement a replicated tuple space for a faulttolerant version of the Linda coordination language [19].
256 4.2.3.1. Configuration The v-kernel provides an object-oriented framework designed to support the rapid implementation of efficient network protocols. It does this by providing a uniform protocol interface and support library that allows the programmer to configure individual protocol objects into'dprotocol ^raph that realizes the required functionality. Each node in this graph corresponds to a protocol object, and the edges represent a "uses" relationship. That is, an edge from protocol P. to protocol P^ indicates that P. opens P^ to send and receive messages on its behalf. Note, however, that in the case of Consul, the actual message flow has been optimized and hence, does not always follow these edges. This is discussed in more detail below. Figure 4.2.5 illustrates the protocol graph that implements Consul, where each protocol described in the preceding section is implemented as an v-kemel protocol object. Accordingly, there are single protocol objects implementing Psync, Membership, FailureDetection, and Recovery; there is a set of objects for Order, each of which implements a different ordering discipline, hi addition, the Dispatcher protocol maps messages onto commands—it tags each outgoing message with a command tag and calls the application-level command corresponding to the tag on each incoming message. Two final protocols—Divider and (Re)Start—are required for configuration purposes. Divider demultiplexes incoming messages to the appropriate high-level protocols; its specific role is described in more detail below. (Re)Start establishes a connection among various protocols needed by an application for proper communication, and reestablishes them after a failure; this protocol remains quiescent at other times. Neither protocol object is a real protocol in the sense that it exchanges messages with a peer on another machine: they exist only to "manage" the protocol graph. As before, the stable storage and network protocols represent facilities that are provided externally.
4.2.3.2. Opening Connections Connections among the various protocol objects, as well as between the application and Consul, must be explicitly created at initialization time. From the perspective of the application, this occurs when the state machine replicas, each identified by a well-known port, decide to open connections among themselves in order to exchange messages. To do this, each replica process opens the top-most object in its protocol graph, specifying the well-known port and host addresses of the other replicas. This protocol then opens lower-level protocols, and so on. When a high-level protocol object opens a low-level protocol object, the low-level protocol returns an v-kernel
257
State Machine
(Re)Start
Dispatcher
Recovery
Order
Stable Storage
Network
Figure 4.2.5 — Consul protocol objects in an x-kcmel graph
258 session object. This session object represents the end-point of a connection and can be used by the high-level protocol object to send and receive messages. Consider now the process of opening connections through the protocol graph in more detail. First, the replica process opens the (Re)Start protocol object multiple times—once for each command supported by the replica. The (Re)Start protocol then opens the Divider protocol which, in turn, opens the Psync protocol. Because there is a one-to-one relationship between the replica and the corresponding Psync session, the session identifier returned by Psync serves as an internal replica id; it is passed as an argument when each subsequent protocol is opened to ensure that the appropriate sessions are properly connected, (Re)Start then opens the Recovery, FailureDetection, and Membership protocols exactly once, and the Dispatch protocol once for every application open invocation (i.e., once for each replica-level command) and the Dispatch protocol in turn opens Order exactly once. Finally, the Recovery, Membership, FailureDetection, and Order protocols each open the Divider protocol. Divider knows which Psync .session to associate the.se protocols with because of the unique id mentioned above.
4.2.3.3. Sending and Receiving Messages Now consider how messages flow through the session objects that represent the dynamic configuration of Consul. Outbound messages generated by the replica process—which are the commands issued by the replica—are first sent to the Dispatch session that corresponds to the command being issued. The Dispatch session tags the message with a command id, and sends it out through the Order session. Finally, Order sends the message to the Psync session, which multicasts it over the network. Notice that the Divider never processes outgoing messages. For inbound messages that correspond to incoming commands issued by other replicas, the Psync session hands the message to the Divider protocol, which passes it up to Order. Order then delivers it to Dispatch, which finally invokes the appropriate command in the replica process. Notice that in both cases, the (Re)Start protocol does not process messages; it only exists to manage the process of opening protocols. The flow of messages just described corresponds to commands that are issued and received by the replica process. In addition, the other protocol objects in Consul exchange messages with their peers on other machines. For example, the Membership protocol exchanges messages to reach agreement that a replica has failed. Also, FailureDetection receives a copy of all incoming messages. This is how it learns that remote replicas are still functioning. In short, each incoming mes.sage is potentially
259 received by multiple protocols. The fact that a set of protocol objects on the same machine have to cooperate so closely means that they need to have common knowledge about what messages look like, i.e., they share message representation information. Specifically, all protocols except for Psync recognize two types of messages: OT (operation type) message and MT (monitoring type) messages. Each OT message is a replica-generated message that invokes a specific command on each replica. MT messages, on the other hand, are used internally by Consul's protocols to exchange information. In other words, OT and MT messages roughly correspond to data and control messages in a traditional monolithic protocol, the only difference being that they are shared by a set of protocols. Each protocol object receives one or both message types. For example, Order receives both the OT and MT messages, while Membership receives only MT messages. The protocols specify which messages they expect to receive to the Divider upon initialization, which in turn delivers the appropriate messages as they are received. As already noted, this means that the Divider may deliver a single incoming message to multiple high-level protocols.
4.2.3.4. Restoring Connections Processor failures cause the connections among various protocol and session objects to be lost in addition to their states. As a result, when a replica recovers, all these objects and interconnections must be recreated. To restore these connections, every protocol and session object stores information in stable storage at a well-known logical address. Typically, a protocol object stores the number of its associated .session objects, and for each of its sessions, the logical addresses in stable storage where that session's state is checkpointed, while each session object stores its state. This is performed during the periodic checkpointing that every session performs while the system is operating. After this checkpoint is read, connections among protocol and session objects are restored by the (Re)Start protocol. There is, however, an additional complexity that must be dealt with: the session states cannot be fully restored given only the information stored by the previous incarnation of the session object since these states also depend on the checkpoints taken by the other protocols. This problem is solved as follows. First, the (Re)Start protocol gathers the relevant checkpoints from all the protocols; these checkpoints include the internal replica id. The (Re)Start protocol then instructs the Divider protocol to restore the sessions corresponding to each unique id. The Divider protocol.
260 in turn, invokes the Psync protocol object to reconstruct the session state corresponding to the session identifier retrieved from stable storage. The Psync protocol object creates a Psync session, reconstructs the context graph from the stable storage and returns the new unique id to the Divider protocol, which returns it to the (Re)Start protocol. (Re)Start then invokes the FailureDetection, Membership, Dispatch, and Recovery protocol objects to recover their appropriate session states, while the Dispatch protocol in turn invokes the Order protocol to recover the state of each of its sessions. This completes restoration of the connections among various protocol and session objects of the communication substrate. The connection between the application process and the substrate is restored when the application invokes the (Re)Start protocol with the appropriate port.
4.2.4. Related Work Considerable attention has been given to the design of various fault-tolerant protocols and systems. For providing consistent message ordering, related approaches may be classified into two categories. The first category includes those protocols where the .semantics of the operations are not exploited and a total order is imposed to implement replicated objects or a related constructs. Examples of this approach include (20, 21, 22]. In the second category, the semantics of the application have been exploited to come up with a solution, as was the ca.se with our SemOrder protocol. Examples here include [23, 24, 25]. In the area of membership services, protocols have been proposed for systems with and without synchronized physical clocks. Examples of the former include [3, 8]. All protocols of this type make use of synchronized clocks to maintain a consistent view of which processes are functioning at every clock tick. Examples of the latter include |5, 26, 27]. These protocols maintain a consistent view of the configuration, but have the property that the complete protocol has to be restarted when a process fails while the protocol is in progress. This is in contrast with our protocol, where subsequent failures are handled incrementally. Consul can also be contrasted with other comprehensive fault-tolerant systems. These include MARS (8], AAS 17], Delta-4 [9], and ISIS [6]. Both MARS and AAS are distributed real-time systems that employ synchronized clocks to implement various fault-tolerant services provided by the system. MARS is a system designed for distributed real-time process control applications, while AAS is designed to replace the present en-route and terminal approach U.S. air traffic
261 control computer systems. Because these systems use synchronized clocks, the algorithms for various protocols in these systems cannot make use of the partial order among various events in the distributed system and hence they resort to more expensive total order. On the other hand, in Consul, partial order has been used to provide more efficient algorithms for these protocols, but no real time guarantees are made. ISIS and Delta-4 do not make use of the synchronized clocks for implementing various fault-tolerant protocols. The Delta-4 project seeks to define a dependable distributed, real-time operating system that allows integration of heterogeneous computing elements, while ISIS is a distributed programming environment that provides tools for building fault-tolerant applications. Both of these systems provide causal ordering but they do not preserve the context graph and present it to the application. As a result, these systems cannot provide fault-tolerant algorithms that make use of the communication history of the system. In particular, weaker orderings such as semantic dependent ordering, that make use of the communication history, cannot be implemented in these systems.
4.2.5. Conclusions Constructing a dependable system is a complex task due to the inherent uncertainties that result when failures must be considered. Consul is designed to simplify the task of building programs that use replicated processing to realize dependability by presenting the higher level software with abstractions that simplify the programming process. These include an atomic multicast facility, protocols that ensure consistent message ordering, a membership service, and recovery facilities. The novelty of the system is in the new algorithms that have been developed for these services, as well as the way they have been realized as a modular and configurable implementation in the .v-kemel. The net result is a system that provides the application designer with important new tools for constructing software of the type required for many highlydependable applications.
4.2.6. References [1]
F. Schneider, "Implementing fault-tolerant services using the state machine approach: A tutorial," ACM Computing Surveys, vol. 22, pp. 299-319, Dec. 1990.
|2)
N. Hutchinson and L. Peterson, "The .v-kernel: An architecture for implementing network protocols," IEEE Trans, on Software Engineering, vol. SE-17, pp. 64-76, Jan. 1991.
262 [3|
F. Cristian, "Understanding fault-tolerant distributed systems," Commun. ACM, vol. 34. pp. 56-78, Feb. 1991.
[4]
L. Lamport, "Time, clocks, and the ordering of events in a distributed systems," Commun. ACM, vol. 21, pp. 558-565, July 1978.
|5]
K. Birman and T. Joseph, "Reliable communication in the presence of failures," ACM Trans, on Computer Systems, vol. 5, pp. 47-76, Feb. 1987.
[6]
K. Birman, A. Schiper, and P. Stephenson, "Lightweight causal and atomic group multicast," ACM Trans, on Computer Systems, vol. 9, pp. 272-314, Aug. 1991.
[7]
F. Cristian, B. Dancey, and J. Dehn, "Fault-tolerance in the Advanced Automation System," in Proc. 20th Symp. on Fault-Tolerant Computing, Newcastle-upon-Type, UK, pp. 6-17, June 1990.
[8]
H. Kopelz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and R. Zainlinger, "Distributed fault-tolerant real-time systems: The Mars approach," IEEE Micro, pp. 25-40, Feb. 1989.
[9|
D. Powell, ed., Delta-4: A Generic Architecture for Dependable Computing, Research Reports ESPRIT, Vol. 1, Springer-Verlag, 1991.
|10| S. Mishra and R. Schlichting, "Abstractions for constructing dependable distributed systems," Technical report 92-19, Dept. of Computer Science, University of Arizona, 1992. [Ill
B. Lampson, "Atomic transactions," in Distributed Systems-Architecture and Implementation (B. Lampson, M. Paul, and H. Seigert, eds.), ch. 11, pp. 246265, Springer-Verlag, Berlin, 1981.
[12] L. Peterson, N. Buchholz, and R. Schlichting, "Preserving and using context information in interprocess communication," ACM Trans, on Computer Systems, vol. 7, pp. 217-246, Aug. 1989. 113) F. Cristian, H. Aghili, R. Strong, and D. Dolev, "Atomic broadcast: From simple message diffusion to Byzantine agreement," in Proc. 15th Symp. on Fault-Tolerant Computing, Ann Arbor, ML pp. 200-206, June 1985. [14] M. Kaashoek, A. Tanenbaum, S. Hummel, and H. Bal, "An efficient reliable broadcast protocol," Operating Systems Review, vol. 23, pp. 5-19, Oct. 1989. |15| P. Melliar-Smith and L. Moser, "Fault-tolerant distributed systems based on broadcast communication," in Proc. 9th Conf. on Distributed Computing Systems, Newport Beach, CA, pp. 129-134, June 1989.
263 [16] P. Verissimo, L. Rodrigues, and M. Baptista, "AMp: A highly parallel atomic multicast protocol," in Proc. SIGCOMM '89, Austin, TX, pp. 83-93, Sept. 1989. [17] S. Mishra, L, Peterson, and R. Schlichting, "Consul: A communication substrate for fault-tolerant distributed programs," Distributed Systems Eiigineer/«^, vol. 1, pp. 87-103, 1993. [18] D. Johnson and W. Zwaenepoel, "Sender based message logging."" in Proc. 17th S\mp. on Fault-Tolerant Computing, Pittsburgh, PA, pp. 14-19, July 1987. " [19] D. Bakken and R. Schlichting, "Supporting fault-tolerant parallel programming in Linda,"' IEEE Trans, on Parallel and Distributed Systems, to appear. 1994. [20] K. Birman, T. Joseph, T. Raeuchle, and A. El Abbadi, "Implementing faulttolerant distributed objects," IEEE Trans, on SoftH'are Engineering^, vol. SE11, pp. 502-508, June 1985. [21] A. Birrell, R. Levin, R. Needham, and M. Schroeder. "Grapevine: An exercise in distributed computing," Commun. ACM, vol. 25, pp. 260-274, Apr. 1982. [22] B. Oki and B. Liskov, "Viewstamped replication: A new primary copy method to support highly-available distributed systems," in Proc. 7th ACM Symp. on Principles of Distributed Computing, Toronto, Canada, pp. 8-17, Aug. 1988. [23] D. Daniels and A. Spector, "An algorithm for replicated directories,'" in Proc. 2nd ACM Symp. on Principles of Distributed Computing, Montreal. Canada, pp. 104-1 n ! Aug. 1983. [24] M. Herlihy, "Extending multiversion time-stamping protocols to exploit type information," IEEE Trans, on Computers, vol. C-36, pp. 443-448. Apr. 1987. [25] R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat, "Providing high availability using lazy replication," ACM Trans, on Computer Systems, vol. 10, pp. 360391, Nov. 1992. [26] J. Chang and N. Maxemchuk, "Reliable broadcast protocols," ACM Trans, on Computer Systems, vol. 2, pp. 251-273, Aug. 1984. [27] A. Ricciardi and K. Birman, "Using process groups to implement failure detection in asynchronous environments," in Proc. 10th ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 341-353, Aug. 1991.
S E C T I O N 4.3
Enhancing Fault Tolerance of Real-Time Systems through Time Redundancy^ Sandra R. ThueP Jay K. Strosnider"'
ABSTRACT Fault-tolerant, real-time systems require correct, time-constrained results in the presence of faults. Missed deadlines in many high dependability systems can result in significant property damage or loss of human life. Historically, designers relied almost exclusively upon massive hardware replication to achieve their dependability goals. Research suggests that not only is this approach inadequate for dealing with certain fault classes, but also that it is inappropriate for many applications with strict space, weight, and cost constraints. Alternatively, time redundancy can be used to complement replication as a means to improve fault coverage and reduce the required level of replication for fault-tolerant system design. Although previous work has advocated the use of time redundancy to provide protection against hardware and software faults, there exists no formal methodology for allocating and managing such time. This chapter provides an overview of recent work in developing a comprehensive analytical framework for allocating and managing time redundancy to preserve the timing correctness of priority-driven, real-time systems in the presence of faults.
^Research supported in part by Office of Naval Research under Contract N00014-92-J-1524 and by ATfoT Bell Laboratories under the Cooperative Research Fellowship Progrcun. •^AT&T Bell Laboratories, Holmdel, NJ 07733. This work was done while the author was at Carnegie Mellon University, Pittsburgh, PA. •'Department of Electrical and Computer Engineering, C€irnegie Mellon University, Pittsburgh, PA 15213
266
4.3.1
Introduction
Fault Tolerance in real-time systems is the ability of a system to provide a service in a timely manner even in the presence of failures, [l] Over the past decade, real-time systems have emerged in a wide range of application domains including medical technology, automated control and manufacturing, air and land transportation facilities, and banking systems, to name a few. Real-time systems have taken many forms, ranging from simple instrumentation or monitoring devices to complex computerized control systems such as those encountered in nuclear power plants. The underlying characteristic of such diverse systems is the requirement that services not only be performed correctly, but t h a t they be delivered within a given time. In hard real-time systems, a late answer is a wrong answer [2]. In addition to complying with stringent timing requirements, real-time systems are usually required to meet demanding dependability specifications. T h e need to satisfy simultaneously the criteria of real-time performance and high dependability exacerbates the already difficult task of designing general-purpose fault tolerant systems. T h e s t a n d a r d approach to achieve fault tolerance is redundancy. Fundamentally, redundancy can only be exploited in either the time domain or the space domain. In a general sense, redundancy in space a t t e m p t s to meet a desired level of fault tolerance by increasing the number of hardware resources, whereas redundancy in time strives to make an efficient use of time on a given set of hardware resources in order to provide protection against faults. In a more concrete sense, three types of redundancy are commonly identified: hardware redundancy, software redundancy and time redundancy. Hardware redundancy refers to the introduction of additional hardware components which can serve as replacements for components manifesting faulty behavior. Similarly, .software redundancy consists of introducing extra software components which are identical or functionally equivalent in performing a required operation. Allocating e x t r a time to retry a failed operation, to execute a redundant piece of software or to perform any time-consuming action for the tolerance of faults is referred to as tiine redundancy. Note that hardware redundancy is a manifestation of redundancy in the space domain whereas redundant software can be bound to either the time or the space domain, depending on how the software components are m a p p e d onto the architecture. While redundancy provides the basic foundation for designing a reliable system, management policies for handling redundant assets, be they software, hardware, or time, play a critical role in determining a system's response to faulty conditions. Hence, it is important t h a t management policies are effective in ensuring
267 t h a t the system has a high probability of successful recovery from failures. In addition, it is desirable that these policies are not too costly to implement and t h a t any vulnerabilities they may introduce into the system are minimized. To meet these challenges, it seems inevitable t h a t real-time systems evolve to a point in which hardware, software and time redundancy methods can be carefully coordinated through efficient management schemes. Previous research on fault tolerance has focused primarily on the exploitation of hardware and software redundancy techniques. Redundant software is usually m a p p e d onto redundant processors, with the intent of obviating the use of time redundancy. As a result, very little work has been done on time redundancy m e t h o d s . Meanwhile, research also indicates that the conventional approaches to fault tolerance have certain pitfalls which may only be overcome through the use of time redundancy. This chapter addresses this need by exploring time redundancy mechanisms for enhancing the fault tolerance of real-time systems. In particular, our work pursues two objectives: first, to introduce formal methods for allocating redundant time for fault recovery; and second, to provide insight into the effectiveness of various time allocation mechanisms for maximizing the ability of real-time systems to recover from faulty conditions. T h e chapter is organized cis follows. The remainder of this section gives further insight into the potential and challenges of exploiting time redundancy for fault tolerance, discusses related research, and presents the motivation for this work. Section 2 provides introductory background material and discusses the assumptions of the work presented in this chapter. Section 3 introduces two algorithms for scheduling the redundant execution of faulty tasks by using time reserved for recovery before run-time. These algorithms are referred to as the Private Reservation Algonthm and the Communal Reservation Algorithm. Section 4 introduces a novel algorithm for the dynamic or on-line allocation of time for recovery, referred to as the Myopic Stack Management Algorithm. In Section 5, the general allocation algorithms developed in the previous sections are subject to a series of simulation experiments in order to measure their ability to schedule recovery operations while avoiding timing failures. Finally, Section 6 provides a summary and conclusions.
4.3.1.1
The Role of Time Redundancy
We are convinced that a certain amount of temporal displacement between
268 redundant computations detection coverage. [3]
is required in order to realize a very high error
Kopetz's words allude to the importance of considering time in the process of allocating redundant software operations to the processing resources. This observation challenges conventional approaches to system-level error detection based solely on the comparison of results generated concurrently by redundant processors executing redundant software tasks. Such conventional approaches yield systems in which real-time tasks are highly correlated in time, making it difficult to detect and locate transient faults. To circumvent this problem, Kopetz proposes the use of time redundancy for error detection in the MARS (Maintainable Real-Time System) architecture. MARS allocates redundant time to allow the two-fold execution of critical software tasks and the comparison of their results for error detection. Estimates indicate t h a t a significant reliability improvement (over an order of magnitude, in some cases) can be obtained by such an exploitation of time redundancy. T h e MARS architecture illustrates one promising way in which time redundancy can enhance the fault tolerance of real-time systems, by reducing the system's susceptibility to transient hardware faults. In general, time redundancy can: (a) provide a means to enhance a system's tolerance of both hardware and software faults by complementing existing software and hardware redundancy methods; and (b) offer more cost-effective solutions to the design of some fault tolerant systems by trading off time on processing resources for massive hardware redundancy. We now take a closer look at these areas and provide some insight into the potential of time redundancy methods.
4.3.1.1.1
E n h a n c i n g T o l e r a n c e of H a r d w a r e F a u l t s
Protection against [correlated transient failures] requires time redundancy, i.e., the presence in a schedule of sufficient slack time to re-execute any critical computations disrupted by transient failures and still meet hard deadlines. [A] T h e conventional approach to designing real-time architectures with high dependability requirements is to massively allocate redundant hardware resources, to which redundant software components are assigned. Software components are usually assigned to the processing resources and scheduled so that similar operations are executed concurrently, in a synchronized feishion. T h e results obtained in each processor are then compared every so often, depending on the granularity of the synchronization barrier, which can be at an instruction level (as in F T M P [5]), or at a software module level (as in SIFT [6, 7]). When
269 results are compared, some form of consensus is reached among the processors and faulty results are masked out, sometimes allowing the detection of the faulty processor and triggering a recovery action. This conventional approach is attractive in t h a t it can protect the system against individually occurring hardware failures, whether transient or permanent. Further, it yields architectures which negligibly affect the temporal behavior of real-time application software. Unfortunately, these conventional architectures are susceptible to correlated failures, a generic term which usually refers to hardware failures exhibiting proximity in time. Two types of correlated failures commonly referred to in the literature are simultaneous and near-coincident faults. Simultaneous faults are possible due to environmental disturbances such as electromagnetic interference, temporary blockouts such as power outages, or t e m p e r a t u r e fluctuations. Upon detection of simultaneous faults which cannot be masked out by the redundant hardware and software components, most systems invoke safe shutdown procedures. Near-coincident faults refer to the situation in which a second fault is detected before the recovery procedures for a prior fault have been completed. T h e occurrence of near-coincident faults can lead to a quick exhaustion of spares and a subsequent shutdown [8]. Research indicates that correlated failures which are transient in nature are much more likely to occur t h a n permanent failures [9, 10, 11]. As a result, some researchers have proposed time redundancy as an alternative to handling correlated failures [4]. Thus, instead of invoking drastic measures such as an architectural reconfiguration consuming spares a n d / o r shutdown procedures, time redundancy may be used to retry the failed software operation. If the correlated failures are transient, it is highly probable that a retry of the operation may be successful. Hence, drastic recovery measures may be averted in situations which would otherwise cause an unnecessary loss of service.
4.3.1.1.2
E n h a n c i n g T o l e r a n c e of S o f t w a r e Faults
Although current practices in designing reliable hardware are considered very robust, the development of reliable software has not experienced a commensurate evolution. To date, software faults have become the dominant source of failures, and projections indicate t h a t the gap between software and hardware reliability will continue to widen [12]. In order to address this problem, better m e t h o d s must be developed to detect and correct errors in the software design stage and to detect and tolerate software faults during system operation.
270 One of the primary difficulties in providing software fault tolerance is that application-dependent knowledge is usually required to effectively detect errors and decide on adequate recovery procedures. T h e need to include applicationdependent knowledge makes detection and recovery at a circuit-level infeasible. As a result, software fault tolerance techniques at the system or application level have become the typical way of providing coverage for software design faults [13]. T h e most popular methods for software fault tolerance are the Recovery Block approach, proposed by Randell in 1975 [14], and N-Version Programming, introduced by Avizienis in 1984 [15]. These approaches are briefly described as follows. T h e Recovery Block (RB) scheme entails the construction of a set of alternative software modules, referred to as blocks, to accomplish a desired operation. Blocks are designed in such a way t h a t they produce the same or similar results, with possible variations in result quality a n d / o r in execution time. T h e most desirable result is generated by what is known as the primary block, and all other options are considered a.lterna.te blocks. Alternate blocks are never executed unless a fault is detected in the primary block. To determine whether a fault occurred in the primary block, an acceptance test is executed. An acceptance test is a short software test which a t t e m p t s to validate the functional correctness of the results generated by the block. It is sometimes called a reasonabiliiy check. Acceptance tests may entail range or plausibility checks on d a t a or the verification that the o u t p u t is in some desired state. For instance, if the function of the primary block is to sort a set of numbers, the acceptance test may consist of verifying that the sorted set is in the desired order. If the primary block fails the acceptance test, alternate blocks are invoked in succession, according to some predetermined order of preference until an alternate block yields an accepted result or all alternate blocks are exhausted. T h e N-Version Programming (N-VP) approach consists of using N, or a multiple number of versions of the application software which are concurrently executed. Each one of the versions is functionally equivalent, but they are designed and developed by different teams. T h e main idea is that software developed by different teams will be less likely to have bugs in the same piece of code, a situation referred to as a common mode failure. Note t h a t common mode failures can be viewed as the software counterpart to hardware correlated failures. To further reduce the probability of common mode failures, the N-versions are sometimes developed in different programming languages, with different development tools. Communications between the development teams are kept to a bare minimum and are carefully monitored to ensure that there are no information leaks which can introduce undesirable biases in the design process. T h e
271 multiple or so-called diverse versions generated for each software component are then executed on different processors and results are compared to reach some form of consensus. Similar to the case of hardware faults, the consensus or voting process a t t e m p t s to mask out any software design faults. The N-VP and RB approaches are similar in spirit in that they both require t h a t diverse or alternate software modules be generated to fulfill a desired operation. However, these approaches differ in a number of ways. The first diff"crence is that a fixed number of redundant software versions in N-VP are usually executed in parallel, whereas a variable number of versions are executed serially in the RB scheme. Thus, N-VP usually relies on redundant hardware while RB relies on redundant time. Second, note t h a t their approach to error detection is different. Error detection in N-VP can only be done through a comparison of results among versions. An error may be detected in a version if its result disagrees with those generated by the other versions (assuming all other versions agree). Therefore, there is no absolute check on the functional or semantic correctness of the results; the check for correctness is relative. This differs from the RB approach in which an explicit test for correctness is executed at the end of each block. One advantage of the RB scheme is t h a t the self-checking capability of each block makes the approach suitable for the use of time redundancy methods. However, the RB scheme has a major drawback: its success hinges on the ability to design adequate acceptance tests for error detection. Acceptance tests are very difficult to generate because they must explicitly contain knowledge about the functional or semantic correctness of the application. N-VP has a different set of drawbacks such as the high cost of developing the redundant versions, the cost of redundant hardware, and the susceptibility to common mode failures. However, the relative simplicity and small overhead of error detection have made N-VP the method of choice in the design of many ultrareliable real-time systems (e.g., Boeing 737 commercial transport aircraft [16]). Another approach to software fault tolerance proposed recently is the Certification Trails tecimique [17]. This approach is similar to the two-fold execution of application tasks and subsequent comparison of results for error detection. To reduce the processing overhead of a two-fold execution, the first time the task is executed it leaves behind a trail of d a t a which allows the second execution of the task to be faster. Hence, the main objective is to obtain error detection coverage through a duplicate and compare strategy but without having to pay the performance penalty of a two-fold execution. (Note that the viability of using this technique in real-time systems relies on the availability of redundant
272 time for the second execution of tasks.) The primary challenge faced by this technique is in developing strategies for generating the trail which effectively address a wide class of applications. So far, the technique has shown promise for algorithms which perform data-structure operations such as those carried out using balanced binary trees and heaps [18], The previous discussion exposed some of the numerous issues associated with enhancing a system's tolerance of software faults. The benefits and disadvantages of the proposed approaches are still a point of much controversy. Thus, it is still unclear under which conditions the current approaches are adequate and which are the situations that can greatly benefit from novel solutions. What is becoming increasingly clear is that software fault tolerance techniques promise to include a tradeoff between hardware and time redundancy. However, in order to consider time redundancy as a viable alternative, efficient mechanisms for the allocation and usage of redundant time must be in place.
4.3.1.1.3
Providing Cost-Effective Fault Tolerance
There are many applications in which space, weight and cost constraints discourage or preclude the use of massive hardware redundancy. These include sophisticated control systems such as those encountered in spacecraft or satellites, where space and weight are a significant concern. It also includes real-time service systems where cost is a major issue, such as multimedia systems and real-time communication networks. In all these systems, there is a need to ensure that timing requirements are met or that a desired quality of service is maintained even in the presence of failures. Of course, all of this must be done without having to resort to an increase in the number of hardware resources. This motivates the development of efficient ways of utilizing available resources and to reason about both application level timing requirements and fault tolerance objectives. The development of efficient lime redtmdancy mechanisms provides the vehicle to accomplish this objective.
4.3.1.2
T h e P r o b l e m of Exploiting Time Redundancy
Perhaps the most challenging issue concerning the exploitation of time redundancy in real-time systems is the development of efficient scheduling policies. These scheduling policies must be able to resolve resource contention conflicts
273 [ Error | I DeieciiuH j f Real-Timei I Workloadi j •^
^ r"~A ( Real-1 iine | I Workload J
f Fault \ { Model J
r ^ ( Rccovcrv | Iworkloadj
Scheduler (a)
(b)
Figure 3.2.1: Factors affecting real-time sclieduling complexity: (a) Traditional Real-Time Sclieduling View; (b) A Fault Tolerant Real-Time Scheduling Perspective among different application tasks requesting access to any shared resource, sucli as a processor. In situations in which the recovery operations for faulty application tasks can also compete for resources, a scheduler must strive to meet two competing objectives, namely, • to ensure that the timing requirements of the real-time application workload are met, and • to ensure that the timing requirements of any recovery operations triggered by the detection of errors are met. This problem challenges the traditional view of hard real-time scheduling, in which the scheduler only needs to cope with a real-time workload that is assumed to be known a-priori, as depicted in Figure 3.2.La. Because hard realtime systems are commonly used in applications requiring regular interactions with an external environment, real-time application tasks are often periodic in nature and can be characterized before run-time. Due to this deterministic and periodic nature, all scheduling decisions can be pre-planned or relatively simple algorithms for on-line contention resolution ca.n be used, so that all timing requirements are ensured to be met. Consequently, the scheduler of hard real-time .systems is usually designed so that the timing requirements of the real-time application workload arc always guaranteed to be met, as long as no failures occur. Unfortunately, the occurrence of faults and the subsequent need for recovery can unexpectedly threaten to destroy the timeliness of a system. Unlike realtime application tasks, recovery tasks do not lend themselves to an a-priort characterization. Recovery operations tend to be unpredictable, bursty and
274 time-critical. Their timing characteristics are influenced by numerous run-time conditions such as error detection latency, the diagnosed fault type, the extent of the damage caused to the application tasks, etc. Hence, real-time scheduling in the presence of faults is a complex function of the real-time workload, the error detection mechanism, the fault model, and the observed recovery workload, as shown in Figure 3.2.1.b. Because of the non-deterministic and time-critical nature of recovery operations, it is i m p o r t a n t t h a t strategies be developed to allocate redundant time for their service, while guaranteeing that the timeliness of the most critical realtime application tcisks is never compromised. The next section presents some solutions that have been proposed to address this problem.
4.3.1.3
Related Work
We first introduce some basic terminology, graphically depicted in Figure 3.2.2. In general, there are only two ways in which redundant time can be allocated: either before run-time or at run-time. Hence, we refer to the mechanisms for allocating time a-prion as reservation or static allocation schemes and those which allocate time on-line as dynamic allocation schemes. Furthermore, dynamic allocation schemes can be passive or aggressive. A dynamic allocation scheme is passive if it a t t e m p t s to allocate time without interfering with the temporal behavior of the real-time workload in any way; that is, if it only a t t e m p t s to schedule recovery operations in a way that will not impact the timing requirements of tasks in the periodic execution stream. An example of a passive dynamic allocation scheme is to service recovery operations only when the resource would otherwise be idle. On the other hand, dynamic allocation schemes are aggressive if they can alter the temporal behavior of the real-time workload; t h a t is, if recovery operations are allowed to displace some periodic tasks. Depending on whether or not the temporal interference causes deadlines of fault-free periodic tasks to be missed, the aggressiveness can be transparent or intrusive. T h e feasibility of any of these allocation approaches is significantly influenced by the underlying scheduling philosophy of the real-time system. Two scheduling philosophies are distinguished. One consists of generating a pre-planned schedule of all activities to be executed at run-time. The function of the scheduler is simply to perform a table lookup. Hence, these are referred to as table-driven
275 (
Time Redundancy
J
I Method of Allocating Time
C
Static J
(
(Before run-time)
DynamicJ (At run-time) I
(
Passive J
(No temporal interference) Recovery tasks run in the 'shadow' of (he periodic execution stream.
(Temporal interference) I Recovery tasks can displace the periodic execution stream.
,
(Transparent j Deadlines of all fault-free periodic tasks are guaranteed.
\
,
( Intrusive j Deadlines of some fault-free periodic tasks are violated.
Figure 3.2.2: Taxonomy of Methods for Allocating Redundant Time systems^. The other philosophy advocates the use of priority mechanisms to resolve contention conflicts among real-time tasks at run-time. These priority mechanisms are formally developed to ensure that the timing requirements for all tasks are met, without the need to store a table explicitly describing the schedule. The function of the scheduler in this case is to resolve resource contentions using priorities. To remain competitive with table-driven systems, the priority mechanisms are designed so that the scheduling overhead is low. Real-time systems embracing this scheduling philosophy are referred to as prtority driven systems. Scheduling mechanisms based on priorities are sometimes referred to as algorithmic scheduling in the literature, while those based on a table-lookup are referred to as timeline scheduling. Table-driven systems are attractive in that the scheduling overhead is negligible. However, the memory overhead for storing the table may be very large. In addition, the process of generating the table usually requires the use of heuristics and handcrafting to ensure the timing correctness of all tasks. Due to these shortcomings, these tables are hard to modify during the design stage and offer Table-driven systems are sometimes referred to as Time-driven systems in the literature and the lookup table is often called a cyclic executive.
276 little to no flexibility at run-time [19]. Priority-driven systems eliminate handcrafting by providing analytical techniques for the evaluation of timing correctness. As a result, it is much ecisier to evaluate the impact of any modifications to the timing characteristics of the workload. This makes priority-driven methods attractive for the design of real-time systems which must exhibit some degree of run-time adaptability. The main drawback of priority-driven systems is that the analytical techniques for timing verification cannot yet handle all possible classes of real-time workloads, that is, real-time workloads of arbitrary timing complexity. In addition, priority-driven scheduling, unlike table-driven scheduling, may not allow the processing capacity to be fully utilized under some circumstances. Another concern is that the scheduling overhead is higher than that of table-driven systems. Thus, care must be exercised in ensuring that the scheduling overhead is kept to a minimum. The majority of real-time systems developed to date are table-driven. Examples include the MARS system [20], the MARUTI system [21], the Spring project [22], FTMP [23], and many traditional in-flight control applications. Prioritydriven systems have started to flourish recently. Although the number of actual priority-driven real-time systems is relatively small, it is increasing. The NASA Space Station Freedom, for instance, has recently embraced a priority-driven scheduling paradigm [24]. The development of more encompassing analytical techniques for the evaluation of timing correctness and the run-time adaptability of priority-driven systems makes them prime candidates for coping with environments subject to increasing complexity and change. Consequently, prioritydriven systems, notwithstanding their shortcomings, seem to offer a greater potential for achieving robust solutions to the scheduling challenges posed by the integration of real-time and fault tolerance objectives. Very little work has been done on the development of time redundancy mechanisms for priority-driven systems. On the other hand, a variety of approaches for allocating redundant time in table-driven systems have been proposed. For an overview of the work on table-driven systems see [25]. A brief overview of the time redundancy methods proposed for priority-driven systems is presented next.
277 4.3.1.3.1
T i m e R e d u n d a n c y in P r i o r i t y - D r i v e n S y s t e m s
There are two general classes of priority-driven systems, namely, fixed priority and dynamic priority systems. In fixed "priority systems, real-time tcisks are assigned a priority before run-time which remains invariant during system operation. In dynamic priority systems, the priority of each real-time task is assigned at run-time and as such, is subject to change. T h e few research efforts to address the time redundancy problem for priority-driven systems have focused exclusively on dynamic priority mechanisms. The most salient contributions to such research efforts have been recently proposed by Ghetto and Ghetto [26], and followed by Schvvan and Zhou [27]. Ghetto and Ghetto proposed a dynamic allocation algorithm for real-time systems in which the periodic real-time tasks are scheduled according to the Earliest Deadline scheduling algorithm [28]. Their algorithm a t t e m p t s to maximize the a m o u n t of time that can be dedicated to the service of a time-critical aperiodic task (e.g., a recovery operation), between the time of the task's arrival until it's deadline. A major shortcoming of Ghetto and Ghetto's algorithm is that they assume t h a t no aperiodic request ever arrives before the completion of any prior pending requests. Moreover, the scheduling and memory overhead of their algorithm is large. Schwan and Zhou circumvent these shortcomings by developing an elegant scheduling algorithm which uses efficient data structures to allocate time for servicing a time-critical operation. The complexity of this algorithm is 0 ( n log n), where n is the number of ta.sks comprising the real-time workload.
4.3.1.4
Motivation and S u m m a r y
The time redundancy mechanisms proposed in the literature for table-driven and priority-driven systems are a promising sign of progress. However, it is clear t h a t we are far from being able to count on having robust solutions, with well-characterized performance potential and strong engineering appeal. T h e main thrust of the work presented in this chapter is to enhance the scope of proposed time redundancy schemes to address the particular needs of fixedpriority real-time systems. In the context of these systems, we propose a comprehensive set of solutions which include algorithms for both the reservation and dynamic allocation of redundant time for recovery. These algorithms are developed on a sound scheduling-theoretic foundation which ensures that the
278 timing behavior of the real-time system is analyzable and predictable before, during, and after recovery. Our dynamic allocation algorithms are aggressive and transparent. Hence, we a t t e m p t to maximize the aggressiveness of the scheduler in allocating time for recovery while still guaranteeing that the deadlines of all fault-free periodic tasks are met. Although we do not explicitly investigate dynamic allocation methods t h a t are intrusive, care is taken to ensure t h a t the proposed dynamic algorithms can be easily extended to support intrusive allocation strategies. Our reservation algorithms represent the first a t t e m p t to formalize some of the heuristics used in allocating time redundancy in table-driven systems and applying such ideas in the context of priority-driven systems. In a similar vein, our dynamic allocation schemes represent the first effort to introduce policies for the on-line allocation of time redundancy akin to those proposed for dynamic priority systems. Another goal of this work is to evaluate the performance of our proposed algorithms relative to their potential for responsively servicing recovery operations. To fulfill this goal, we conducted a series of fault injection experiments which prompted the scheduling software to be exercised under a wide range of faulty conditions. T h e performance of our algorithms is compared by measuring the recovery coverage, an indication of the algorithm's effectiveness in meeting the timing requirements of the recovery workload. F'ven though the performance evaluation studies are conducted for a restricted class of recovery workloads on fixed-priority real-time systems, the evaluation methodology is general and can be applied to broad set of time-consuming recovery problems. In particular, we intend our tripartite emphasis on analytical modeling, performance evaluation, and impJementation cost assessment to serve as a model for evaluating the effectiveness of other time redundancy mechanisms. It is our belief t h a t one of the greatest weaknesses of the time redundancy mechanisms proposed in the literature is their inability to bridge the gap between the problem abstraction level and the implementation level. The work presented here is an initial step to overcome this shortcoming.
279
4.3.2
Background and Assumptions
Consider a real-time system with n periodic tasks, TI, ... , T „ . Each task, r,, has a worst-case computation requirement Ci, a period Ti, an initiation time
0 or offset relative to some time origin, and a deadline Di, assumed to satisfy Di < Ti. T h e parameters d, Ti, (pi, and D,- are known deterministic quantities. We require t h a t these tasks be scheduled according to a fixed priority algorithm, such as the deadline monotonic algorithm, in which tasks with small values of Di are given relatively high priority [29]. We assume t h a t the periodic tasks are indexed in priority order with ri having highest priority and r„ having lowest priority. For simplicity, we refer to those levels as 1, ...,n with 1 indicating highest priority and n the lowest. A periodic task, say r,;, gives rise to an infinite sequence of jobs. T h e A;"" such job is ready at time (p, + {k — 1)T, and its C, units of required execution must be completed by time (f>i+{k — l)Ti+Di or else a timing fault will occur. A task set in which all job deadlines arc guaranteed to be met is said to be schedulable. Liu and Layland [28] proved t h a t a task r; is guaranteed to be schedulable if the deadline for its first job is met when it is initiated at the same time as all higher priority tasks, i.e., (j)k — 0, for k = 1 , . . . , « . This is because the time between the arrival of a task's job and the completion of its service, referred to as its response time, is maximized when it arrives at the same instant at which all tasks of equal and higher priority arrive. The phasing scenario in which the initiation times for all tasks are equal is known as the critical instant, which is the worst-case phasing. It follows that a workload is schedulable under a fixed-priority assignment if the deadline for the first job of every task starting at a critical instant is met. Liu and Layland also developed a sufficient test for the schedulability of a task set in which task deadlines are equal to the periods, D, = Tj. They proved that if the workload utilization is less than 69%, a fixed-priority assignment exists for the tasks which guarantees t h a t the tcisk set is schedulable. Assigning the priorities to the tasks according to the Rate Monotonic (RM) algorithm was shown to be optimal in t h a t no other fixed-priority assignment can guarantee the schedulability of a task set which cannot be scheduled with a RM priority assignment. RM scheduling assigns priorities to tasks according to their periods, such t h a t Ti has higher priority than TJ \{ Ti < Tj. Ties are resolved arbitrarily. Lehoczky, et.ai, extended these results by deriving a necessary and sufficient schedulability criterion for fixed-priority workloads under critical instant phas-
280 ing. This criterion dictates that a task r,- is schedulable iff min^o
< 1,
(1)
where Wi{t) is the cumulative woric that has arrived from priority levels 1 to i in the time interval [0,/] under critical instant pheising and is computed as i
Writ) = Yl Cr\t/Tj]-
(2)
Intuitively, task r; is schedulable if Wi{t)/t < 1, because there exists some time t before the task's deadline D, for which the elapsed time t is at least as great as the time required to complete all the work that has arrived, including the d units for r,. It follows that the entire task set is schedulable if the maximum value of Wi{t)/t over the minima computed for each task r^, i = 1,. . ., n, is also less than or equal to one, as indicated by max{i
< 1.
(3)
This analytical framework allows us to evaluate the schedulability of a task set assuming that no processing time is reserved for recovery. The following sections will extend this framework to explore and evaluate the timing impact caused to a real-time workload under two reservation and one dynamic allocation algorithm, which will also be introduced in the chapter. Our allocation algorithms are developed under the following assumptions: • Al: All operating system overheads such as context switching, preemption, etc., are assumed to be zero and any task can be instantly preempted. • A 2: Tasks are ready at the start of their period and do not suspend themselves or synchronize with any other task. • A3: The recovery of any faulty periodic task r; involves retrying the faulty task or executing an alternate tcisk. In either case the retry has a known worst-case execution time Crec,- (To simplify our analysis, we further assume that the retry execution time is proportional to the primary execution time of the task, i.e., Crec, — k x Ci, where k is a. constant recomputation scale factor greater than zero.)
281 • A4: Each recovery task must be completed by the deadline of the associated periodic task found to be faulty.
4.3.3
Static Allocation Strategies
In this section, we introduce two algorithms for static allocation in fixed-priority systems: the Private Reservation Algorithm and the Communal Reservation Algorithm. These algorithms represent an attempt to formally embody the fundamental strategies possible in statically allocating time for recovery in realtime systems.
4.3.3.1
Time Partitioning Approaches
The reservation of processing time for recovery requires that two basic design decisions be made. First, one must decide whether the reserved time should be partitioned in a shared or in a dedicated manner. A dedicated partitioning approach means that when time is reserved, it is statically bound to the recovery of individual real-time tasks. On the other hand, a shared partitioning approach reserves a pool of recovery time which is dynamically allocated to failed tasks on a contention basis. Clearly, hybrid partitioning schemes are also possible. Second, one must decide whether the access privileges conferred to each application task for the allocation of recovery time will be fair or unfair. Fairness implies that all tasks will have an equal share, in principle, of the reserved recovery time. (Note that the issue of fairness applies to both shared and dedicated partitioning strategies.) On the contrary, the allocation is unfair if recovery time is explicitly biased in favor of some tasks, at the expense of others. Hence, unfair allocation policies are those which tend to discriminate between tasks by favoring the timely recovery of some tasks over others. The issues of time partitioning and allocation fairness are fundamental to the design of any reservation strategy. In this chapter, we are interested in exploring both shared and dedicated time partitioning strategies for reservation. However, we intend to focus only on fair allocation schemes. Note that embracing a fair allocation policy is not a limiting assumption in that it does not compromise the generality of this work. The same underlying methodology proposed for fair allocation can be used to
282 support unfair allocation schemes. Investigating unfair allocation schemes would require the use of some discriminatory information about the tasks. For example, if one would have information on the expected failure characteristics of individual tasks, then tasks t h a t are more susceptible to failures may warrant a larger share of the reserved time. Likewise, information about the semantic importance or relative criticalness of the tasks can be used to favor the recovery of those tasks which are more imp o r t a n t to system survival. We limit our discussion to the case in which all tasks have equal failure probability and equal criticalness. Even when a fair allocation scheme is desired, there are certain situations where it is unattainable. For example, when the utilization of the real-time workload is fairly high, the application tasks have little slack so it may not be possible to reserve equal time for the recovery of all tasks. Since there may not be enough processing time to provide equal recovery opportunities for all tasks, some tasks may have to be given an unfair advantage over others. As a result, some discrimination among tasks is unavoidable in this situation, as will be shown later. T h e next sections introduce two reservation algorithms for fixed-priority systems. One algorithm adheres to a dedicated time partitioning approach while the other algorithm follows a shared partitioning approach. Hence, the former is called the Private Reservation Algorithm, while the latter is referred to as the Communal Reservation Algorithm.
4.3.3.2
Private Reservation Algorithm
T h e Private Reservation Algorithm ( P R A ) a t t e m p t s to reserve time for the specific recovery of each task TJ, for « = 1 , . . . , n. The objective of this algorithm is to guarantee t h a t each job for task r; can be retried xi times without violating the timing correctness of any other task. The number of times each task can be retried, x';, is referred to as the number of retry tickets. Because the reserved time is dedicated to each task so t h a t there is no on-line contention for recovery time, it is said t h a t x',- retry operations are unconditionally guaranteed lor each task. Since we are limiting our attention to fair allocation strategies, it is desired t h a t all tasks be given the same number of retry tickets, i.e., Xi — X2 = ... = XnIf this is the case, it is said t h a t all tasks have equal recovery privileges. As
283 stated earlier, when the utilization of the real-time workload is high, it may be impossible to grant equal recovery privileges to all tasks. In this case, a mechanism for discriminately conferring retry tickets to tasks is needed. For a detailed treatment of this latter case see [25]. To guarantee that each job for task Ti can be retried x, times without violating the timing correctness of any other task is equivalent to saying that the processing time to be devoted to r; is given by C, + Xi X Crec, = Ci{l
+ XiX k}.
(4)
Recall that k is a constant scale factor greater than zero denoting the portion of the primary processing time for each task, C,, required to retry the task. This implies that the schedulability test for task r, must now consider not only the processing requirements of all fault-free higher priority tasks, as in Equation (2), but also the time it would take to service all their retry operations, should they fail. These observations lead to the following theorem. Theorem 1 A task Ti m a real-time workload in which any failed jobs can unconditionally be retried Xi times is schedulable iff max{i
(5)
t
Wl{t) = ^ ( 1
+xjxk)Cr\t/T,].
(6)
Proof: According to Liu and Layland's results, the response time of r; is maximized when it arrives at the same instant in which all fault-free tasks of higher priority arrive. If any higher priority task TJ fails, we wish to unconditionally guarantee that it can be retried up to Xj times without violating the schedulability of Ti. Hence, if TJ fails sometime prior to r^'s completion, the response-time of Ti will be further increased by the time required to service the Xj retries for TJ , which amounts to (xj x k x Cj) time units. Hence, the worst-case response time of Ti, under faulty operation, occurs when all tasks of higher and equal priority, including TJ, fail and exhaust their maximum number of retry tickets.• Equation (6) gives us a schedulability test for a task set with some guaranteed recovery properties. Hence, given the timing requirements of a task set and some desirable recovery properties, this test tells us if they can simultaneously
284 be satisfied. It would also be useful to answer the reverse question: W h a t are the constraints on a task set for which some desired recovery properties can be guaranteed? This question is answered in the ensuing discussion. T h e breakdown utilization of a task set is the utilization at which the real-time application tasks fully utilize the processor [28]; that is, the task set is schedulable but any proportional increase in the computation times of all tasks will make it unschedulable. The breakdown utilization of a tjisk set gives us an upper bound on the processing capacity attainable if no processing time is reserved for recovery. Lehoczky, e l a/[30] introduced a closed-form expression for computing the breakdown utilization of a task set. The breakdown utilization UBD of a task set with utilization Uw — Yll=i\^i/'^i' i^ given by
UBD
= Uw
X A*
(7)
A* = [ m a i - { i < K „ } min|o<«
]"'
Reserving time for recovery will obviously reduce the processing capacity available to the application workload. Hence, if some processing time is reserved, the breakdown utilization is unattainable under fault-free operation. As a result, we need to define upper utilization bounds which demarcate the m a x i m u m utilization attainable by a fault-free application workload after a portion of the processing capacity is reserved for recovery. These bounds are referred to as fault-free utilization bounds. We now compute the fault-free utilization bounds given by the P R A for the case in which all tasks have an equal number of retry tickets. Note t h a t for this case, Wlit) is further simplified to i
Wl[t) = (\ + xxk)Y,
Cr\tlTi}.
(8)
j=i
Similar to the computation of the breakdown utilization given in Equation (7), we can compute the fault-free utilization bound for a given number of retry tickets X, or Ug^ as
U'BD
= UW y- [maar{i<,<„} mm{o
(9)
285 Task T\ T2
T-3
n
Period 4 6 10 14
Exec, time 0.25 0.50 1.50 1.25
Deadline 4 6 10 14
Utilization 6.25% 8.33% 15.00% 8.92%
Table 1: Timing Requirements for Example
X
U'BD
0 0.856
1
2
0.428
0.285
3 0.214
4 0.171
5 0.143
Table 2: Fault-Free Utilization Bounds for Various Numbers of Retry Ticlcets
Hence, for a given recomputation scale factor k and number of retry tickets a;, U'gjj represents a corresponding fault-free utilization bound. Note t h a t for the special case in which tasks have no retry tickets, x is equal to 0 and no time is reserved for recovery. Hence, the fault-free utilization bound is equal to the task set's breakdown utilization, as expected. We now illustrate the framework presented so far for the P R A with an example. Consider a set of 4 tasks, r i , r 2 , r 3 , and 7-4, with timing requirements shown in Table 1. This task set has a utilization Uw of 38.5% which is below Liu and Layland's least upper utilization bound of 69% so it is schedulable. T h e breakdown utilization of this task set is found to be 85.6%, using Equation (7). If we assume that each task retry is equivalent to re-executing the entire primary task, then Crec, = Ci and ^ = 1. Using Equation (10) we compute the fault-free utilization bounds for different numbers of retry tickets (assuming all tasks have an equal number of tickets). As shown in Table 2, each value of U'gjj is simply obtained as 0.856 U'BD
=
( l + x
(11)
Given t h a t the utilization of the task set is known, we can now look at Table 2 to determine how many retry tickets can be given to the tasks without violating schedulability. One retry ticket may be given to each task as long cis the
286 utilization does not exceed 42.8%. Moreover, to reserve two retry tickets per task the utilization cannot exceed 28.5%. Hence, for a workload utilization of 38.5%, we find that a maximum of one retry ticket can be given to each task. Now let us imagine that this same task set is being executed on a processor that runs at | the speed, so that the execution times for all tasks are longer. The utilization of the task set then becomes 51.3%. Consulting Table 2 we find that no retry tickets can be given to all the tasks, because the workload utilization exceeds the fault-free utilization bound for the reservation of one retry ticket per task, that is, Ug^ — 42.8%. Hence, it is impossible to guarantee equal recovery privileges for all tasks. However, it may still be possible to reserve some retry tickets for one or more tasks. At this point, we are inevitably faced with the decision of choosing which of the tasks will have dedicated recovery time. Consequently, some strategy for discriminating among the tasks is needed. Heuristics for this discrimination are presented in [25].
4.3.3.3
C o m m u n a l Reservation Algorithm
The Communal Reservation Algorithm (CRA) attempts to reserve a shared pool of recovery time which is dynamically allocated to failed tasks on a contention basis. The objective of this algorithm is to guarantee that a total of X retry operations may be serviced within a given time interval [tai't], with no specific number of retries guaranteed for any particular task. Stated otherwise, the cumulative number of retry operations serviced for arbitrary tasks in any interval of time [ta, ti,] must not exceed X. Hence, X denotes the total number of retry tickets that may be assigned to arbitrary tasks. Since retry tickets are now shared, there is no notion of retry tickets per tcisk, x,, as in the Private Reservation Algorithm (PRA). Therefore, it is said that retry operations are conditionally guaranteed for each task, depending on the number of shared retry tickets available. Three things are needed to fully specify the CRA, namely, the interval of time [ta,tb], the number of retry tickets X which can be assigned over the interval, and the contention resolution policy for the shared tickets. We choose an interval of time equal to the duration of the largest task period, T^, because this gives us a clear way of evaluating the effect of altering the number of retry tickets on the workload schedulability. Having chosen the duration of the interval, we can analytically compute the maximum number of retry tickets which can be assigned subject to guaranteed schedulability. Finally, the contention resolution
287 policy determines how tickets should be given to faulty tasks as they request to be retried. We assume retry tickets are assigned on a first-come first-serve (FCFS) basis. As in the PRA, we focus on a fair allocation strategy. Thus, it is desired that, in principle, all tasks be given the same opportunity to contend for the shared retry tickets. This means that none of the tasks should be prevented from contending for tickets in the shared pool. All tasks should have equal contention privileges. However, we observe that when the workload utilization is high, the contention privileges for some tasks may have to be withdrawn in order to maintain schedulability. As with the PRA, only the case of equal contention privileges is summarized here. For a detailed description see [25. 31]. In order to guarantee that X retry operations can be serviced without a deadline violation, we must make sure that all the tasks can withstand the worst-case timing impact of such a recovery workload. This means that each task T, must be able to tolerate the largest delay which can be caused by the service of A" retry operations. Recalling that Wi{t) is defined in Equation (2) and k is a recomputation scale factor greater than zero, we present the following theorem. Theorem 2 A task r,- in a real-iime workload tn which any failed jobs can condtttonally be reined X times is schedulable iff
max{i
(Wi(t) i —
h
X xk xmaxWi
j^
\j =l,...,i})
, „, •— > < 1 (12)
Proof: Assume that there exists a set of A' retry operations which when serviced, cause task r; to miss its deadline. Let us refer to the cumulative processing time of these X retry operations as Wr- If r,- misses its deadline, it must be the case that there is no scheduling point t < Di for which the cumulative fault-free periodic work, Wi{t) and the recovery workload Wr can be serviced. Now let us subject the schedulability of r; to the constraint that it can withstand a delay equal to the largest value of Wr possible and still find a satisfying solution in its set of scheduling points. Having met this constraint, it is clearly impossible that r, misses its deadline.• Equation (12) gives us a schedulability test for a task set which allocates a maximum of X retry tickets under the CRA. We now wish to compute the fault-free utilization bounds associated with this algorithm.
288
X U'B[J
0 0.856
1 0.654
2 0.553
3 0.479
4 0.416
5 0.358
Table 3: Fault-Free Utilization Bounds for Various Numbers of Retry Tickets
Recall that the breakdown utilization for a task set can be computed by multiplying the workload utilization Uw by a scale factor A*, shown in Equation (7). Similarly, the fault-free utilization bound for a given number of retry tickets A' can be computed by multiplying the workload utilization by a scale factor A^given by Wi{t) maX{i
X X k X 7nax{Cj
+
\j=l,...,i}
-1
(13)
Hence, a fault-free utilization bound is equivalent to
m.BD
i'w X A^-.
(14)
Note t h a t the special case in which there are no retry tickets A' = 0 and the fault-free utilization bound is equal to the task set's breakdown utilization. Let us revisit the task set example examined under the P R A and described in Table 1. Using Equations (13) and (14), we compute the fault-free utilization bounds for values of A' ranging from 0 to 5. Results are shown in Table 3. Note that these bounds are higher than the comparable values computed for the P R A as depicted in Table 2. This illustrates that the cost of reserving recovery time is higher if the partitioning is dedicated rather than shared. However, when the recovery time is dedicated, the retry operations for any given task are unconditionally guaranteed because there is no contention for tickets. With a utilization of 38.5%, 4 shared retry tickets can be reserved for the task set (A = 4). Even if the workload utilization is increased to 51.3%, 2 retry tickets can still be reserved. Now consider that the task set in Table 1 is executed on a processor t h a t is half as fast, so that the utilization is doubled to 77%. At this utilization we can no longer grant all tasks equal contention privileges because it exceeds the fault-free utilization bound for A' = 1, or 65.4%. Once again we are faced with the need to provide discriminatory treatment to some tasks in
289 order to guarantee that schedulability is maintained when recovery operations are being serviced. In summary, two algorithms for the reservation of time for recovery in fixedpriority real-time systems have been introduced. The Private Reservation Algorithm ( P R A ) reserves time which is statically bound to the recovery of individual real-time tasks while the Communal Reservation Algorithm (CRA) reserves a pool of recovery time which is dynamically allocated to failed tasks on a contention basis. The expected performance of these algorithms against the dynamic allocation algorithm presented in the next section will be evaluated in Section 5.
4.3.4
Dynamic Allocation Strategies
In this section we investigate the dynamic allocation of redundant time for recovery. In this approach there is no a-priori notion of processing time reserved for the service of fault recovery requests. Rather, the scheduler allocates time to service recovery operations as they arrive at run-time. As a result, the execution time of a recovery operation need not be known until the time at which the request is issued. This contrasts with static allocation methods, which require that a specific characterization of the recovery workload be known before runtime. When a recovery request is triggered, dynamic allocation methods must find time to service the request prior to its deadline. Therefore, recovery requests behave like hard-deadline aperiodic tasks. When a hard aperiodic task arrives in a system, it is said t h a t an acceptance test is executed to determine whether or not the scheduler can meet its deadline. To do so, dynamic allocation methods compute on-line the total slack available for aperiodic processing prior to the aperiodic deadline. If the slack available is at least as large as the execution time for the aperiodic, then it is accepted for service. Otherwise, it is rejected and denied service by the system. Our work on scheduling hard aperiodic tasks leverages heavily off of the Slack Stealing Algorithm [32], which dynamically allocates time for the service of softdeadline aperiodic tasks. The Slack Stealing algorithm was shown to be optimal in the sense t h a t it provides the largest amount of high priority execution time for soft aperiodic processing subject to meeting all the deadlines for the faultfree periodic tasks. Its performance is compared to state-of-the-art methods
290 for soft aperiodic service, resulting in significant performance improvements in certain circumstances. Although the performance of the Slack Stealing algorithm is impressive compared with current methods, its implementation overhead may be high. In addition, modifications to the algorithm are needed to service hard-deadline aperiodic tasks, which more adequately represent the nature of fault recovery operations. To solve these problems, we introduce the Myopic Slack Management Algorithm, or the Myopic Slack Manager. As its name suggests, the Myopic Slack Manager is an approximate algorithm, which strives to achieve a performance comparable to the slack stealing algorithm while reducing its implementation overhead.
4.3.4.1
T h e Myopic P r o p e r t y
T h e slack stealing algorithm proposed for hard aperiodic scheduling in [33] is capable of computing the m a x i m u m slack available for aperiodic processing in an arbitrary interval of time [
291
,'!j:>:iiMMf:>::iinMt7si,,,M|7Si,J!t^s]|J!) 5
10
1
20
25
30
35
40
45
50
55
50
55
60
65
70
I
\
^i,Js^,,,,,|,,^,,,F?T^,,i:l,f>:>:ii 5
I
5
10
10
1J
I
1t
20
25
30
35
40
45
^,,,n,,M( 60
65
70
I
4.^^ 20
25
I " " I " I ' I" ' , ^ , , , ^ 30 35 40 45 50 55
, .,-p-n-i-^^ 60
65
70 •
t=3
t=15
Figure 3.2.3: Illustrating slack estimation intervals for times t = 3 and < = 15 ables. These worst-case slack values ensure that any slack stolen for aperiodic processing will never violate the timing correctness of any of the periodic tasks. However, worst-case slack values are conservative and will yield fewer opportunities for aperiodic processing than are actually available. As a result, we attempt to obtain less conservative slack estimates on-line by improving upon these worst-case estimates. In order to make slack estimation feasible, it is important to keep its processing overhead low. We do so by limiting the interval of time into the future over which slack is computed for each task, referred to as the slack esiimatton interval. In particular, we limit the estimation of slack for each task r, to an interval of time with an upper bound equal to a future time ti corresponding to Ti's nearest unsatisfied deadline. For example, using the task set of Table 4, assume that we need to estimate the slack available for each teisk in an interval of time [^o, tb) where ta = 3. Figure 3.2.3 shows that at time i = 3 the first jobs of all tasks are still active, so the nearest unsatisfied deadline for each task is the deadline of its first job. Hence, tf, = 10 for ri, <;, = 14 for r2, and tt, = 70 for TQ. Similarly, if <„ = 15, the nearest unsatisfied deadline for n is 30 since ri2's deadline is satisfied at time 14 and the nearest unsatisfied deadlines for T2 and T3 are at time 28 and 70, respectively. By constraining the slack estimation intervals in this manner, any slack which appears beyond the nearest unsatisfied deadline for each task is not accounted for at the time of the initiation of the interval, or ta- As a result, the dynamic allocation strategy is nearsighted in its ability to accumulate slack into the future. This is referred to as the myopic property.
292 Task
n ''2 73
Period 10 14 70
Exec, time 4 5 2
Deadline 10 14 70
Table 4; Example Timing Requirements
D.next ^2= '^a ^ i~ '^curr ' curr Figure 3.2.4: Clarifying Concept of A Slack Estimation Interval Being myopic does not impose any performance limitations on dynamic allocation for the service of task retry operations which must be completed by the end of the task's period. Any slack made available beyond the retry's deadline cannot be used to service the retry anyway, so it is of no use to accumulate slack into any future time beyond the task's period. This may not be the case, however, for the general case of scheduling arbitrary hard aperiodic tasks which may have deadlines that are much greater than the task's periods. We reduce the scheduling overhead of searching through all the priority levels by limiting the search to a maximum of two priority levels. When a periodic task fails and issues a retry request, the two priority levels considered for servicing such a recovery operation are the priority level of the task that failed and the deadline monotonic priority level. This restriction reduces the worst-case search overhead to 0{1). In the next section, we introduce the Myopic Slack Management (MSM) algorithm, a dynamic allocation algorithm which combines the aforementioned techniques for reducing the overhead of hard aperiodic scheduling.
293 4.3.4.2
T h e Myopic Slack Management Algorithm
Let us first define a level-? busy period as a time interval during which the processor is busy with tasks of priority i or higher. Alternatively, a level-?' inactivity period is a time interval during which the processor does not execute tasks of priority i or higher. T h e MSM algorithm bases all its aperiodic scheduling decisions on run-time slack estimates for all the tasks. We define a slack estimate as follows: A slack estimate, Si, is a prediction of the total amount of level-?' inactivity expected to occur within an interval of time [ta, tt,), where ta is the current time and ti, is some future time (i.e., ti, > ta)As its definition implies, every slack estimate has: a magnitude (amount of level-? inactivity) and an interval of applicability (the slack estimation interval). In this section we explain how slack estimates are initialized and updated at run-time, then we define how the MSM algorithm uses these slack estimates to perform an aperiodic acceptance test.
4.3.4.2.1
I n i t i a l i z i n g a n d U p d a t i n g R u n - t i m e Slack E s t i m a t e s
The slack estimation interval cissociated with an estimate always has a lower bound ta equal to the current time. As stated earlier, we constrain the upper bound tb of the interval for r; to be a future time which corresponds to TV'S nearest unsatisfied deadline, as shown in Figure 3.2.3. Suppose the deadline for the last j o b of r, to arrive prior to ta is Dcurr and the deadline for the next job of Ti to arrive after ta is Dnext, as illustrated in Figure 3.2.4. For times ta where ti < ta < ^2, tb can adopt one of two values, Dcurr or Dngxi, depending on whether or not TJ'S current deadline has been satisfied by time ta- If r^ has pending periodic work at time ta, then its current deadline has not been satisfied so ti = Dcurr • If Ti has no pending periodic nor recovery work at time ta, then its current deadline has been satisfied so t\, = Dnext- On the other hand, if r; heis no pending periodic work but does have pending recovery work at time ta, ij can adopt either value Dcurr or D„ext depending on the priority level at which its recovery work is being serviced. If the recovery work for r,is being serviced at priority level i, then ti, is equal to Dcurr', otherwise, <j is equal to Dnext- This point will be revisited during the discussion of the MSM algorithm.
294 ^1L^
^1L^
^1L^
^1|
[ > i l I 11111 p 1*11 111 11 ' i > l I 111 11 [ 0
5
10
15
20
25
5
10
15
(a)
20
25
^1l
^11
^1.
I n 11111 M > i i I 11111 S 1*111111 i * i > I
30
0
30
0
l,,Fy?^i,,,:f^nfii 0
^^U^
p
5
10
15
20
25
30
35
,^j,,„|„P^,,„^t 5
10
15
20
25
30
35
(b)
Figure 3.2.5: Example Illustrating Slack Estimation The magnitude of a slack estimate £i for a task TJ represents the amount of time available for servicing aperiodic recovery work at priority level ?', ignoring any periodic work of priority lower than i which must be done. The aperiodic recovery work to be done could be due with the recovery of r, itself or the recovery of a lower priority task that failed. Determining the magnitude of a slack estimate, Si, for a given task r; requires that two parameters be computed before run-time. These parameters are; the initial slack value, i^,(0), and a worst-case job slack or Smin,- At time t=0, Si is set to Si(0), Vi, where Si{0) is the amount of slack associated with the first job of r,- to arrive. Smin, is the worst-case slack associated with any of TJ'S jobs within a hyperperiod. Simply stated, Smin, tells us the minimum amount of any of r,'s jobs can be delayed without violating their deadlines. Thus, Smin, is a conservative estimate of the slack of any of r,'s jobs. For example, consider two tasks, ri and r2, with periods and execution times (Ti,Ci) = (10,3) and (T2,C2) = (15,6). Assume all deadlines are equal to the task periods and that tasks are initiated at a critical instant so that (f>i=(p2= 0. The schedule for this task set is shown in Figure 3.2.5.a. All three jobs of Ti in the hyperperiod of 30 have a slack of 7, so TI has a worst-case slack Smin, — min{7, 7, 7} = 7. The first job of T2 has an exact slack of 3 and its second job has an exact slack of 6 so its worst-case slack Smin^ = min{3, 6} = 3. Since the tasks are initiated at a critical instant, ^i(O) = Smini and ^2(0) = ^min2
•
Now consider the case in which the t£isks are not initiated at a critical instant. Assume 2 = 5, as shown in Figure 3.2.5.b. As in the previous
295 case, Smin-, = 7, £min:! - min{6, 3} = 3, and ^i(O) = Smini = 7. However, the initial slack estimate for rj must now consider the fact that its first job will not arrive until time 5. As a result, £2(0) = level-2 inactivity prior to the arrival of T21 + worst-case slack for T21 = level-2 inactivity in interval [0, 5) + Smin^ €2(0) = 2 + 3 = 5. In general, the worst-case slack of a task, Smin,, corresponds to the largest value such that the following equality holds: min[Q
Pi{t))/t]
= 1,
(15)
where Pi(J-) is the periodic ready work of level i or higher in [0,<], or Pi(t) =
To compute the initial slack of a task r;, or ^i(O), we need to compute the total amount of level-?' inactivity which occurs prior to the first arrival of Ti. This may be computed by finding the largest size aperiodic task that can arrive at time 0 at priority level i and be completed by time 4>i. We compute this as follows. Create an aperiodic teisk r,' with priority i and deadline D[ equal to the arrival of r,-, or <^j. Assume that r/ arrives at time 0 and its execution time, CI, satisfies 0 < C[ <4>i. The objective is to find the largest value of C\ such that r/'s deadline is met. We conduct a binary search over the range of values for C,'. For each given value of C-, we compute the response time of r/ by the following recursive formulation {x = 1,2,...): t-i
max{0,tc,_^ - 4>k) /t = l
Tk
The initial conditions and convergence criterion for the recursion are given by Initial Conditions:
Convergence Criterion:
When the recursions converge to a result, tc^ will contain the response time for T-. Once the largest value of C- is found, ^,(0) = C-+ Smin,- Recall that for the special case in which
296 So far we have discussed how to initialize the run-time slack estimates for all tasks and how to compute a worst-case bound on the slack of all jobs of a periodic task. We proceed to discuss how the magnitude of the slack estimates change at run-time. At time 0, all slack estimates Si are initialized to £i{0). There are two types of run-time events which can cause slack estimates to change: • slack consumption events- Events which cause one or more slack estimates to be decremented. These are: - the occurrence of level-i inactivity, and - the acceptance of a recovery operation. Level-J inactivity results in the reduction of all slack estimates of priority higher than i, or £ ^ i , . . . f i _ i . Conversely, the acceptance of a recovery operation to be serviced at priority level i results in the reduction of slack estimates for tcisks of priority equal to or lower than i, or Si,.. . £„• • slack replenishment events- Events which cause one or more slack estimates to be incremented. These are: - the successful completion of a periodic job, and - the start of a level-i busy period. When a periodic job completes and no errors are detected, its current deadline, Dcum is considered to be satisfied. Thus, the slack estimation interval for Ti can be extended to its next deadline, Dnext, and the worstcase slack for the next job of r^ is added to its slack estimate Si. When a new level-?' busy period starts, all inactive periodic tasks of priority lower than or equal to i must have slack estimates which are no smaller t h a n their worst-case values, Smin,, i-e., Si = ma,x.{Smin,, Si}. This observation is the basis for the slack replenishment policy of the MSM algorithm, formally stated in the following theorem. (Interested readers are referred to [25] for a proof.)
T h e o r e m 3 If a penodtc task TJ IS inactive at the start of a level-i busy period, where i is a priority higher than or equal to j (i < j), then the slack available for Tj at the start of the level-i busy period must be no smaller than the worst-case slack over all of TJ 'S jobs within a hyperperiod.
297 4.3.4.2.2
Illustrative Example:
Consider the task set shown in Figure 3.2.5.a. Let us illustrate the evolution of run-time slack estimates over the first 15 units of execution, assuming t h a t no tasks fail, t h a t is, no aperiodic recovery operations arrive. The exact slack values and estimated slack values for TI and T2 change at six points in time within the interval [0, 15], as shown in Table 5. The status column indicates the activation status of a task at time t; A stands for an active status, meaning t h a t the task arrives or has pending work at time t, while an I, or inactive status, indicates t h a t the contrary is true. Since tasks are initiated at a critical instant, the slack estimates for both tasks are set to their worst-case values, Smin,- When TI completes at time 3 its slack is replenished by the worst-case slack estimate for its next job, which is 7 units. By time 9, 6 units of level-1 inactivity have elapsed, so the slack estimate for r, is reduced by this amount. At time 9, T^ completes and its slack is replenished by the worst-case slack estimate for its next job, which is 3 units. Note t h a t the actual slack of r2's second job is 6 and not 3, so the slack estimate at time 9 underestimates the exact slack by 3 units of time. At time 10, r i ' s second job arrives and the slack estimates for both TI and r2 are reduced by 1 unit to reflect the fact t h a t both priority levels have been inactive since time 9. Since time 10 marks the start of a busy period for both rj and T2, we must ensure t h a t their slack estimates are at least as large as their corresponding Emin,- TI completes its execution at time 13 and its slack is immediately replenished by the worst-case slack for its next job. When T2'S second job arrives at time 15, the slack estimates for both ri and rj are reduced by the 2 units of inactivity which occur between time 13 and 15. Because a level-2 busy period starts at time 15, the slack for T2 is checked to be no smaller than £min2This example illustrates the shortcoming of slack estimation. Although the large table of exact slack values used by the slack stealing algorithm is no longer needed, the run-time slack estimates tend to underestimate the actual slack by yielding slack values t h a t are smaller than those actually available. This is the essential tradeoff made by MSM algorithm. T h e following section will discuss how these run-time slack estimates are used by the MSM algorithm in allocating time to service recovery operations.
298 Status
Exact
Estimate
t
n
T2
n
T-l
n
T-2
0
A
A
7
3
7
3
3
I
A
14
3
14
3
9
I
I
8
9
8
6
10
A
I
7
8
7
5
13
I
I
14
8
14
5
15
I
A
12
6
12
3
Computations for Estimates ^1 ^^mtrii
^
'
^•2 — i ' m t f i j ^
3
t-l — '-I T t-mini
t2
— c.2-\-
Cmin-i
€i = max {^1 - 1, Emini] €2 = m a x { ^ 2 - 1 , ^ 1 — t^l
^min2}
it'Tnini
^^2 = max {£2 - 2, Smin^} Table 5: Illustrating Difference between Exact Slack and Slack Estimates
4.3.4.2.3
Algorithm Description
Suppose that a recovery operation for task r; is triggered at some arbitrary time ta- The execution time for the recovery operation is given by Crec and its deadline Dree is a time displacement relative to the time the recovery operation is requested. Since the recovery operation must be completed by the end of the period of the failed task, we assume Dree < Ti < Tn- If the recovery request is accepted by the MSM algorithm, it is to be serviced at a priority level given by irec) where tree can adopt one of two values, either the priority of the failed periodic task, i, or a deadline monotonic priority, ioM • The deadline monotonic priority of the recovery request is the lowest priority level at which the processor services tasks with a period no greater than the deadline for the request. If no such tasks are found, the deadhne monotonic priority level is the highest, that is, priority level 1. For instance, if a recovery operation with a deadline Dree = 25 is issued at time 0 in a set of five tasks with periods Tj = 10, T2 = 15, T3 = 23, T4 = 38, T5 = 54, then the deadline monotonic priority level is 3. If on the other hand. Dree = 8, the deadline monotonic priority level is 1. In general, the deadline monotonic priority level for a recovery request is computed as: ^DM
max{
1 < ; < i}
1, otherwise.
%rj>i.
299 Observe that the deadhne priority level for a recovery request is never lower than the priority of the failed task that issued the request, or IDM < z, Vz. Since it is desirable that a recovery operation be serviced at the lowest priority level such that its deadline is met, service at priority level i is preferable over service at a deadline monotonic priority level, if the latter is higher. Let us now present the schedulability analysis for a recovery request. For a recovery operation to be schedulable at any priority level it is necessary that the following condition is met: Condition 1 The deadline for the recovery operation must be at least as large as its execution time.
If this necessary condition is met, a more detailed schedulability analysis at the possible priority levels for service, i or ioM, must be carried out. Let us first state a sufficient condition for the schedulability of a recovery request at an arbitrary priority level irecA recovery operation with a deadline no greater than the largest task period, Tn, is schedulable at an arbitrary priority level i if the slack available at this priority level is at least as large as the recovery execution time, Crec- Thus, a sufficient condition for the schedulability of a recovery request can be expressed as follows. Condition 2 A recovery operation of size Crec is schedulable at priority level i, if Crec
< rnin{ i < j < n )
^i
Let us refer to F as the set of tasks of priority lower than or equal to irec which have a slack estimate smaller than the recovery execution time DreeIf the above schedulability condition is met when Vec = h then F must be a null set, and the recovery operation is accepted for service at priority level i. If, on the contrary, the above condition is not met, the recovery operation is unschedulable at priority level i, because there is at least one element in the F set. A recovery operation which is unschedulable at priority level i may be schedulable at priority level ioM if the F set has exactly one element, corresponding to the task that failed and issued the recovery operation, i.e., F = {xi}. If the F set contains any task other than r,, servicing the recovery request at a higher priority level IDM will cause its deadline to be missed, so the recovery request must be rejected.
300 Task r, is an exception because of the following observations. In order for Ti's recovery request to be schedulable at priority level i {irec = i), the slack estimation interval for r,- cannot go beyond Dcurr- Any slack that becomes available at priority level i beyond Dcurr cannot be used to service the recovery operation at priority level i. However, if the recovery operation is serviced at a deadline monotonic priority level {QM which is a higher priority level than i, then there is no longer a need to constrain the slack estimation interval for r, until Dcurr to guarantee that the deadline for the recovery operation will be met. Providing service at a deadline monotonic priority is sufficient to guarantee the deadline of the recovery request. As a result, the slack estimation interval for r, can be extended until its next deadline, or Dnext and the slack for r, may be increased by Smin,- This may remove the violation that £i < Crec and remove Ti from the F set. The recovery operation is then accepted if the schedulability condition given above is met when irec — ioM- A pseudocode version of the acceptance test for the MSM algorithm is in Figure 3.2.6, which invokes routine DMsched in Figure 3.2.7 to evaluate the schedulability of the recovery request at the deadline monotonic priority level.
4.3.4.3
Implementing t h e Myopic Slack Manager
4.3.4.3.1
Memory Requirements
The MSM algorithm needs a total of three variables per periodic task, namely, a worst-case slack value £min,'. a run-time slack estimate £i, and a binary flag indicating the upper bound for the task's slack estimation interval (0, to indicate the current deadline, Dcurr] 1 to indicate the next deadline, D„cxt)This results in a total of 3ri variables for a set of n tasks, as summarized in Table 6. The MSM has the benefit of a significantly smaller memory overhead than that associated with the slack stealing algorithm. By only storing the minimum slack values for each task, the MSM algorithm eliminates the large table of exact slack values used by the slack stealing algorithm. We also note that the MSM algorithm does not keep flags to describe the activation status for the tasks. Instead, it uses flags to point to the upper bound on the slack estimation interval associated with a task's run-time slack estimate.
301 /* Inputs: i (priority of failed periodic task), Crec and Dree */ If Cj-i-c < Dree then If slack estimation interval = [t, Dcurr] then /* £av is the slack available for recovery */ £av = min{,<j<„}fj If £av > Cree
then
/* £, is the current slack for r, */ If £, > Crec then status = ACCEPTED Else c-t ^^ Oi -\-
^min,
slack estimation interval = [(, D„ext] s t a t u s = DMsched{Crec, E n d i f £, >
Dree)
Crec
Else status = REJECTED E n d i f £av >
Cree
Else s t a t u s = DMsched{Cree,
Dree)
Endif slack estimation interval... Else status = REJECTED E n d i f Cree
<
Dree
Return(status)
Figure 3.2.6; Acceptance Test for the Myopic Slack Management Algorithm
4.3.4.3.2
Functional Requirements
Some work needs to be done to update slack management variables any time the scheduler is invoked. The amount of work required depends on the event which caused the scheduler to be invoked, called a scheduling event. There are four types of scheduling events, namely, a periodic job arrival, a periodic job completion, an aperiodic job arrival, and an aperiodic job completion. Note t h a t the arrival of periodic or aperiodic jobs, which occurs at scheduling points describe only two of the four possible scheduling events. Further, note that all slack consumption and replenishment events occur at the time of a scheduling
302 /* Inputs: Crtc, Dree V Compute iDM If 'DM 7^ t t h e n ' r e c ^^ ^DM
then
€av — m i n { r e c < J < n ) ^ ;
If ^av > Crtc then status = ACCEPTED Else status = REJECTED E n d i f £av > C'rcc
Else status = REJECTED Endif iDM ^ I
Return(status)
Figure 3.2.7: DMscbed routine to evaluate schedulability of a recovery request at the deadline monotonic priority level event. A scheduling event hsis one of two possible outcomes. Either a context switch occurs or the scheduler resumes execution of the same job. Every scheduling event, regardless of the outcome, requires that some work be done to update slack management variables. If the outcome of a scheduling event is a context switch, some additional work is required. We now describe the processing overhead associated with the updating of slack management variables at each
D a t a Structure Counters Binary Flags Vector
Description n run-time slack estimates Si n, one to indicate upper bound on slack estimation for each task containing worst-case slack values, Smin,, over all of Ti's jobs within hyperperiod (vector size = n)
Table 6: Memory Requirements for Myopic Slack Management Algorithm
303 scheduling event: 1. If a context switch occurs, we must account for level-j inactivity. Similar to the slack stealing algorithm, vi'e distinguish n + 1 different activities at run-time. Activities l , . . . , n refer to periodic or aperiodic task processing at the corresponding priority level, and activity n -f 1 refers to the processor being idle. At each instant in time tk in which a new context switch occurs, the processor has finished an activity j,l < j < n + I which it had initiated at time tk-i, corresponding to the previous context switch time. Then, if . _ J 1, •^ ~ I 2 < i < n + l ,
do nothing. £i = max{0,Si-itk-tk-i)},l
-I.
After level-i inactivity is accounted for, check whether the task TJ which will be swapped in for execution has a higher priority than the activity which the processor has just finished. If so, this marks the start of a new level-j busy period. Consequently, the slack estimates for all inactive ta^ks of priority lower than or equal to j must be at least as large as their corresponding worst-case slack estimates, £k = ma.x{£k , £min^], k = j , .
..,n.
2. If a periodic job Tij arrives, set the status for periodic task Ti to active. Then set the upper bound on its slack estimation interval to its current deadline, Dcurr3. If a periodic job r,j completes, set the status for the periodic task r,to inactive. If the completed job is not faulty, increase the slack estimate for the task, £i, by its worst-case slack value, £rmn,4. If an aperiodic job arrives, run the MSM algorithm to determine if the job can be scheduled. If so, the priority level for its service is set to irec and Crec 's Subtracted from the slack time estimates of all periodic tasks of priority lower than or equal to Vec- (Note that if iVec ^ i, where i is the priority of the failed periodic task, the MSM algorithm changes Ti's slack estimation interval to the deadline for its next job, or Dnext) 5. If an aperiodic job completes successfully, that is, with no faults, check the upper bound on the slack estimation interval for the faulty periodic task which spawned it. If such upper bound is set to the task's
304 current deadline, Dcurr, change it to its next deadline, Dnei-t, and increase the task's slack estimate by its worst-case slack value. Otherwise, do nothing.
In summary, an algorithm called the Myopic Slack Management (MSM) algorithm was introduced in this section to overcome the implementation shortcomings of slack stealing algorithms. T h e memory overhead is reduced by a slack estimation approach wherein the scheduler obtains conservative estimates at run-time of the slack available for each periodic task. To make slack estimation feasible, we constrain the accumulation of available slack to intervals of time into the future which are relatively short, that is, the MSM algorithm is nearsighted in its ability to accumulate slack into the future. The scheduling overhead of the slack stealing methods is reduced by constraining the service of hard aperiodic tasks to a m a x i m u m of two priority levels, namely, the priority level of the periodic task t h a t failed and issued the aperiodic recovery request and the deadline monotonic priority level for the aperiodic. These techniques allow us to reduce the memory and scheduling overheads to a worst-case complexity of 0(«). The tendency to underestimate the slack available and the limitations imposed on the priority levels considered for service in the MSM algorithm may, in principle, reduce its performance over that which can be expected from a direct implementation of the slack stealing methods. However, the MSM trades off some of the performance of slack stealing methods for a scheduling solution which has significantly less overhead.
4.3.5
Performance Evaluation
The dynamic and static allocation strategies presented so far give us a methodology for scheduling recovery operations in such a way that the timing correctness of fault-free real-time tasks is not disturbed. Static allocation strategies are very simple to implement but they tend to be pessimistic in their reservation of processing time for recovery. Dynamic allocation strategies, on the other hand, are more complex but tend to make a better use of the processing capacity. Beyond this qualitative description of their differences, how can these strategies be compared quantitatively? T h e objective of this section is to address this question.
305 We evaluate the performance of three allocation algorithms proposed in the previous sections, namely, the Communal Reservation Algorithm, the Private Reservation Algorithm, and the Myopic Slack Management Algorithm. To estimate the effectiveness of these algorithms, we simulate the injection of a series of faults into a few real-time workloads, representative of applications found in navigation, avionics and multimedia systems. These faults act as triggers t h a t generate a stream of recovery operations that require service prior to the deadlines of the periodic tasks which were found faulty. The recovery operations result in transient load increases which exercise the scheduler's ability to maintain the timing correctness of the periodic tasks while striving to maximize timely recovery. We then gauge an algorithm's performance by measuring the probability t h a t a recovery operation is successfully scheduled under a wide range of transient recovery loads. In addition to highlighting performance differences among the algorithms, our simulation studies validate their timing correctness.
4.3.5.1
Modeling Recovery Loads
Given a particular real-time application workload, the timing characteristics of its recovery operations may be determined by an arbitrarily complex mesh of factors. Among numerous others, these include the fault type and persistence, application-level semantics, severity of the damage caused by the fault and error detection latency. To the extent of our knowledge, there is no d a t a in the literature for modeling the stochastic nature of recovery workloads arising in practical real-time applications. We do not intend to address this problem by proposing models for real-time recovery workloads which a t t e m p t to characterize the complex set of actions t h a t may ultimately lead to a transient recovery load. Our objective is to subject real-time systems to recovery loads which stimulate the aperiodic scheduling software, allowing its strengths and weaknesses to be exposed and mectsured. A fairly simplistic recovery workload model adequately serves this purpose. To this effect, we assume that: • A l : Errors are detected at the completion time of a task, the time at which the acceptance test to validate the results of the task is executed. • A2: Error detection coverage is 100%. Therefore, ai/injected faults cause recovery requests to be triggered.
306 • A3: If a real-time task is found to be faulty, it immediately issues a retry request which must be completed by the task's deadline. • A4: There are no constraints imposed on the number of successive retry requests that a task may issue prior to its deadline. Note that Al and A3 are consistent with the Recovery Block model, in which error detection is performed at the end of a primary block, followed by the execution of a secondary block in the event of an error. A2 and A4 are clearly unrealistic in practice. However, the efficiency of error detection mechanisms and the amount of software redundancy available per application task are not issues which are being scrutinized by this research. Therefore, we justify assumptions A2 and A4 on the basis that: (a) they have no direct bearing on the relative performance of the scheduling algorithms we intend to evaluate; and (b) they allow us to accelerate the generation of transient recovery workloads.
4.3.5.2
T h e Experimental Design
The primary tools used to conduct our simulation studies are shown in Figure 3.2.8. Notice that the heart of our simulation environment is a discrete event simulation tool called DERTsim, (Dependable Real-Time scheduling simulator). DERTsim was developed with a simulation programming language called Simscript 11.5^ [34]. DERTsim models the scheduling actions and processor activity that take place during a desired observation interval, sis real-time application tasks and any required recovery operations contend for processing time. The inputs to DERTsim are: • a description of the timing requirements of the real-time application workload (i.e., periods, worst-case execution times, and fixed-priorities), • a set of pre-computed data structures needed to support our proposed static and dynamic allocation algorithms, • and a set of pre-determined recovery operations to be triggered at runtime. The pre-computed data structures include: the number of retry tickets per task for the Private Reservation Algorithm, the number of shared retry tickets ^Simscript II.5 is a registered trademark of CACI, Inc.-Federal.
307
WT WorMoad
Fault Injection Profile
Z'
Mecovery Worlcload Generator
Static Timing Analysis START, e
DERTsim Dependable RT Scheauling simulator
Processor Aaivity
^I Scliednling Event Log
Discrete Event Simulation Tool (SimscriptllS}
Recovery Log
Recovery Events
y Log Parser |
[ Log Analyzer
X Windows Schedule Display
JL Recovery Statistics
I
RTview.c
Figure 3.2.8: Simulation Tools
308 for the C o m m u n a l Reservation Algorithm, and the worst-case and initial slack estimates for the Myopic Slack Management algorithm. These d a t a structures are determined by a C program called START.c, which also verifies t h a t the input real-time workload is schedulable in the absence of faults. T h e transient recovery loads to be triggered at run-time are created by a recovery workload generator. T h e recovery workload generator requires a fault injection profile describing a mean fault interarrival time and a fault distribution. These parameters are then used by the recovery workload generator to compute the probability t h a t any job for a periodic task is found faulty. Given this information, the recovery workload generator creates a, fault injection table for each application task, which lists all of its jobs t h a t are tagged as faulty. Every j o b included in these tables will fail its acceptance test and issue at least one retry request. (The actual number of retry requests a job will generate is listed in the tables.) Hence, any time a j o b is completed during a simulation, DERTsim consults the fault injection table for the corresponding application task to determine whether or not the job passes or fails the reasonability check for validating its o u t p u t s . The o u t p u t produced by DERTsim is captured in two types of logs. One is a Scheduling Event Log, which contains a detailed trace of all processing activity observed during a simulation. T h e other is a Recovery Log, which contains records describing every recovery event observed. A recovery event is described by logging its request time, acceptance or rejection status, and an identification of the faulty periodic job which issued the request. To cissist in the exploration of the massive amounts of d a t a contained in scheduling event logs, we developed a visualization tool called RTview, which runs on X-Windows. RTview displays timelines t h a t depict all periodic and aperiodic processing activity, highlighting the occurrence of faults and the outcome of any a t t e m p t s to allocate time for recovery. This tool was instrumental in providing quick visual feedback on the scheduling actions of a given time allocation strategy, significantly reducing the efforts to verify and debug DERTsim. A schedule parser postprocesses every scheduling event log, generating a compact log which is readable by RTview. The performance results for a simulation are finally generated by invoking a log analyzer. T h e log analyzer computes a number of recovery statistics which measure the scheduler's ability to responsively handle transient recovery loads.
309 4.3.5.3
T h e Performance Metrics
Meeting the deadline of a real-time application job requires that its outputs be semantically and temporally correct, i.e., its outputs must be correct and they must be delivered prior to the job's deadline. If an output error is detected, the job issues a recovery request which must be serviced prior to the job's deadline and must generate an acceptable output. If the recovery operation, in turn, fails to deliver an acceptable o u t p u t by its deadline, a timing failure is said to have occurred. Allocation algorithms address the timing correctness of recovery operations. T h e function of an allocation algorithm is to maximize the probability that the timing correctness of a recovery operation is met, thus minimizing the probability of timing failures. Since the outputs of a recovery operation may also be faulty, there is no guarantee that meeting the timing requirements for recovery will successfully avert a timing failure. Hovvever, an effective allocation algorithm can greatly enhance the chances of successful recovery. We define a recovery coverage metric to quantitatively characterize the effectiveness of recovery time allocation algorithms. Specifically, Recovery coverage is the conditional probability that the deadline for a recovery operation is guaranteed, given that a real-time task has failed and issued a recovery request". This quantitative definition of coverage is analogous to that given by Bouricius, et.al. for the recovery from permanent hardware faults [35]. Coverage was defined cis the conditional probability of recovery, given t h a t a fault has occurred. The concept of coverage has widespread use in the literature, usually referring to the effectiveness of error detection rather than recovery mechanisms.
4.3.5.4
INS Task Set Study
In this section we present a representative case study which quantitatively compares the performance of both the static and dynamic time redundancy management techniques developed in this research. For a complete description on the full set of performance studies conducted see [25]. The case study focuses on a real-time application representative of an Inertial Navigation System (INS) [36], described in Table 7. First, we illustrate the input d a t a structures computed for this task set to support each algorithm and then discuss our simulation results.
310 Task r\ T-z
rz TA
7-5
rs
Period 2.5 40 62.5 1000 1000 1250
Execution Time 1.18 4.28 10.28 20.28 100.28 25
Description Update Ship Attitude Update Ship Displacement Send Attitude Message Send Navigation Message Update Status on Screen Update Ship Position
Table 7: Timing Requirements for the Inertial Navigation System
4.3.5.4.1
Data Structures for the Reservation Algorithms
Figure 3.2.9 shows the fault-free utilization bounds for the INS task set as the number of retry tickets reserved by the Private Reservation Algorithm (PRA) and the Communal Reservation Algorithm (CRA) increases from 0 to 15. The fault-free utilization bounds demarcate the upper utilization attainable by a fault-free task set once reservation is considered. A horizontal line in the figure shows the 88% utilization of the 6 periodic tasks in the INS task set, referred to as the nominal periodic load. Note that the CRA has a fault-free utilization bound for 1 shared retry ticket which is above 88%. This means that at an 88% load, the CRA can reserve 1 retry ticket to be allocated to any one failed task at run-time. Hence, one recovery operation can be serviced during any interval of time of duration equal to the largest task period, or TQ = 1250psec. AH tasks have equal contention privileges so they can all contend for this retry ticket. On the contrary, the fault-free utilization bound for 1 retry ticket per task is about 50% for the PRA, which is below the 88% task set utilization. Since the task set utilization is above this bound, the PRA cannot reserve 1 retry ticket per task without violating the schedulability of the task set. Hence, it is necessary that the PRA confers retry privileges to some but not all of the tasks. The retry privileges given to the tasks may be represented by a ticketing vector of the form [xiX2- • -x^ ], where Xi is 1 or 0 depending on whether or not the corresponding task r, has a reserved retry ticket. Using the heuristics given in [25], we compute the fault-free utilization bounds, U'^j^ for a set of ticketing vectors in which a subset of the tasks holds a ticket. These fault-free utilization
311
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# of Retry Tickets Figure 3.2.9: Fault-Free Utilization Bounds for the INS task set bounds are shown in Table 8. The rightmost column indicates the number of the task which is associated with the fault-free utilization bound in the row. For instance, the fault-free utilization bound for T5 is 85.4%, so T^ will not hold a retry ticket if the utilization of the task set exceeds 85.4%. Since the utilization of the task set is 88%, the PRA finds that the most protection it can provide to the tasks is given by the ticketing vector [000101]. Therefore, tasks T^ and T-g hold 1 retry ticket, while the other tasks hold none.
Ticketing Vector
U'BD,
[ 000000 ] [ 000001 ] [000101] [000111 ] [010111] [011111] [mill ]
0.994 0.967 0.946 0.854 0.774' 0.777 0.497
i None 6 4 5 2 3 1
Table 8: Fault-Free Utilization Bounds for the INS task set under the PRA
312 Task <^ mm I
T\
T2
T-3
TA
T5
T&
1.32
16.84
14.16
236.24
113.96
110.96
Table 9: Worst-Case Slack Values for the INS Task Set
4.3.5.4.2 Data Structures for the Myopic Slack Management Algorithm The Myopic Slack Management (MSM) algorithni requires as input the worstcase slack values for each task. The worst-case slack value Srnin, for a task n is tlic least slack associated with any of r,'s jobs within a hyperperiod. The procedure for computing these values is given in [25]. The worst-case slack values computed for the INS task set are summarized in Table 9. The initial slack values for the tasks were computed prior to each simulation, based on the initial phasing conditions chosen.
4.3.5.4.3
Simulation Results
A comparison of the coverage provided by the Private Reservation Algorithm (PRA), the Communal Reservation Algorithm (CRA), and the Myopic Slack Management (MSM) algorithm for the INS task set is shown in Figure 3.2.10. The lower horizontal axis shows the mean retry rate, expressed as the average number of retries per second. The retry rate is equal to the fault rate A, which is an independent variable. The upper horizontal axis shows the mean recovery load resulting from a burst of task retries, expressed as a percentage of processor utilization. The mean recovery load is computed analytically for each retry rate A as follows. The mean number of times the processor will be requested to execute a job for task r, is equal to the mean of a geometric distribution given by e^*^'. If the execution time of a retry for task r,- is C',- x k, where k is a scale factor satisfying (0 < A- < 1.0), the recovery load is given by
fVe. = E t 1= 1
,AC,
- 1
Cjxk
(16)
313 Recovery Load (%) c)
6.43
2.96 1
I
15.3
10.55 .
I
.
-
1.000.90 0.80 DO 0.70 1
>
0.60
O
0.500.40
u
o
PRA
• '
CRA MSM
-
0.300.200.100.00-
jpo-o—
0
Q.
2000
—
O
4000
o
6000
O
8000
Retry Rate Figure 3.2.10: Coverage results for the INS task set T h e mean number of recovery jobs executed on behalf of r, is [e^*^'' — 1] because the primary execution of a job for TJ is not considered a recovery operation. It is i m p o r t a n t to note t h a t although the mean recovery loads shown on the figure were determined analytically, we verified these values empirically. The vertical axis depicts the coverage obtained from a simulation experiment,. Coverage can adopt any real value between 0 and 1.0. A coverage value of 1.0 indicates t h a t the allocation algorithm was able to schedule all recovery requests triggered during a simulation, successfully avoiding the occurrence of any timing failures. Ideally, the coverage provided by an allocation algorithm should be as close to unity as possible. The retry rates for the INS task set fluctuated between 80 and 8000 retries per second, resulting in transient recovery workload increases of up to 15.37%, as shown in Figure 3.2.10. The time to retry each task Ti was assumed to be equal to the primary execution time for the task, or Ci. Observe t h a t the MSM algorithm provides the highest coverage among the three algorithms shown. It is the only algorithm which reaches perfect coverage at low recovery loads. Moreover, its performance degrades slowly as the retry rate is increased. Even when the joint processing load is about 104%, the MSM algorithm provides a coverage of 0.87.
314 The coverage provided by the CRA is competitive with the MSM algorithm at very low recovery loads. The leftmost point shown on the graph corresponds to a recovery load of about 0.11%. As the retry rate increases, the performance of the CRA drops significantly. When the joint processing load reaches 104%, the coverage provided by CRA has dropped to 0.107. The explanation for this behavior is that the coverage provided by the CRA is highly sensitive to the degree of contention for the shared retry tickets. As long as the demand for retry tickets never exceeds the maximum supply of tickets, all retry requests get accepted and coverage is 1.0. However, if the demand tends to exceed the maximum supply of tickets, the allocator is usually saturated. This means that the number of retry operations contending for tickets surpasses the maximum number of tickets available on average. Consequently, beyond the saturation point, an increaise in the retry rate causes a proportional increase in the rejection rate. These performance characteristics remind us of the behavior of contention protocols for medium access control, such as the CSMA contention protocol for packet radio networks [37]. As the load on the network increases, the number of retransmissions due to collisions increases. As a result, the communication throughput is inversely proportional to the load on the network. In summary, the coverage provided by the MSM algorithm is superior to the coverage provided by the PRA and CRA algorithms. This performance advantage was consistently observed for the other task sets evaluated as reported in [25]. The high coverage provided by the MSM algorithm is an indication that it does an excellent job at meeting the response-time requirements for the recovery operations while guaranteeing that the deadlines for all fault-free periodic tasks are also met.
4.3.6
Conclusions
This chapter provided an overview of a coherent analytical framework for evaluating the timing correctness of priority-driven, real-time systems in the presence of faults via exploitation of time redundancy. A set of static and dynamic algorithms are introduced for the management of redundant task executions caused by faults. The dynamic algorithms are based on the concept of slack stealing, developed in this research and proven to be optimal for scheduling soft aperiodic tcisks. Algorithms for time redundancy management are then developed
315 which exploit the performance advantages of slack stealing with significantly less overhead. Simulation studies consistently highlight the superiority of dynamic allocation methods, even in situations in which run-time overheads are considered. This demonstrates that dynamic, on-line time redundancy allocation strategies have a greater potential for providing high recovery coverage for real-time applications than static, off-line methods.
4.3. References [1] Farnam Jahanian. State restoration in real-time fault-tolerant systems. Complex Systems Engineering Synthesis and Assessment Technology Workshop, pages 21-29, July 1992. [2] Tom Hand. Real-time systems need predictability. Computer Design RISC, Supplement:57-59, August 1989. [3] H. Kopetz, H. Kantz, G. Grunsteidl, P. Puschner, and J. Reisinger. Tolerating transient faults in mars. In International Symposium on Fault-Tolerant Computing, pages 466-473, NewCastle Upon Tyne, U.K., June 1990. [4] C M . Krishna and A.D. Singh. Modelling correlated transient failures in fault-tolerant systems. In 1989 International Symposium on Fault-Tolerant Computing, pages 374-381, Chicago, Illinois, June 1989. [5] A.L. Hopkins, T.B. Smith III, and J.H. Lala. Ftmp - a highly reliable faulttolerant multiprocessor for aircraft. Proceedings of the IEEE 66, pages 1221-1239, October 1978. [6] J. Goldberg et.al. Development and analysis of the software implemented fault-tolerance (sift) computer. Technical report, NASA CR-172146, 1984. [7] J.H. Wensley et.al. Sift: The design and analysis of a fault-tolerant computer for aircraft control. Proceedings of the IEEE 66, 66(10), October 1978. [8] J.H. Lala and L.S. Alger. Hardware and software fault tolerance: A unified architectural approach. In 1988 International Symposium on FaultTolerant Computing, pages 240-245, Tokyo, Japan, June 1988. [9] Y.K. Malaiya. Linearly correlated intermittent failures. IEEE Transactions on ReUabihty, R-31(2), 1982.
316 [10] S.R. McConnei, D.P. Siewiorek, and M.M. Tsao. The measurement and analysis of transient errors in digital computing systems. In Digest of Papers, Ninth Annual International Conference on Fault-Tolerant Computing, pages 67-70, 1979. [11] Ting-Ting Y. Lin. Design and Evaluation of an On-line Predictive Diagnostic System. PhD thesis, Carnegie Mellon University, May 1988. [12] Jim Gray. Why do computers stop and what can be done about it? In Fifth Symposium on Reliability in Distributed Software and Database Systems, pages 374-381, Los Angeles, California, Jan. 1986. [13] Daniel P. Siewiorek. Architecture of fault-tolerant computers: An historical perspective. In Proceedings of the IEEE, volume 79, pages 1-25, December 1991. [14] B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Engineering, pages 220-232, June 1975. [15] A. Avizienis and J. Kelly. Fault tolerance by design diversity: Concepts and experiments. IEEE Computer, August 1984. [16] L.J. Yount. Architectural solutions to safety problems for commercial transports. Proceedings of the 6th AIAA/IEEE Digital Avionics Systems Conference, December 1984. [17] G.F. Sullivan and G.M. Masson. Using certification trails to achieve software fault tolerance. In Proceedings of the IEEE 1990 Fault-Tolerant Computing Symposium, pages 423-431, 1990. [18] G.F. Sullivan and G.M. Masson. Certification trails for data structures. Technical Report JHU 90/17, John Hopkins University, MD., 1990. [19] P. Hood and V. Grover. Designing real-time systems in ada. Technical Report 1123-1, SofFech Inc., 460 Totten Pold Road, Waltham, MA 0225409197, January 1986. [20] Kopetz et.al. Distributed fault-tolerant real-time systems: The mars approach. IEEE Micro, 9(l):25-40, February 1989. [21] V. Nirkhe and W. Pugh. A partial evaluator for the maruti hard real-time system. In Real-Time Systems Symposium, pages 64-73, Dec. 1991. [22] J. Stankovic and K. Ramamritham. The spring kernel; A new paradigm for real-time operating systems. ACM Operating Systems Review, 23(3), July 1989.
317 [23] T . B . Smith III. The Fault-Tolerant lications, 1986. [24] J a m e s GafFord. J u n e 1991.
Multiprocessor
Rate monotonic scheduling.
Computer.
IEEE
Moyes P u b -
Micro, pages 34-38,
[25] Sandra Ramos Thuel. Enhancing Fault Tolerance of Real-Time Systems through Time Redundancy. P h D thesis, Carnegie Mellon University, May 1993. [26] H. Ghetto and M. Ghetto. Some results of the earliest deadline scheduling algorithm. IEEE Transactions on SW Eng., 15(10):466-473, 1989. [27] K. Schwan and H. Zhou. Dynamic scheduling of hard real-time tasks and real-time threads. IEEE Transactions on SW Eng., 18(8):736-748, 1992. [28] C.L. Liu and J . W . Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the Association for Computing Machinery, 20(1):46-61, January 1973. [29] J. Y.-T. Leung and J. Whitehead. On the complexity of fixed-priority scheduling of periodic real-time tasks. Performance Evaluation, 2:237-250, 1982. [30] John Lehoczky, Lui Sha, and Ye Ding. The rate-monotonic scheduling algorithm: Exact characterization and average case behavior. In RealTime Systems Symposium, pages 166-171, 1989. [31] S a n d r a Ramos-Thuel and Jay K. Strosnider. Scheduling fault recovery operations for time-critical applications. In Proceedings of Dependable Computing for Critical Applications, J a n u a r y 1994. [32] Sandra Ramos-Thuel and John P. Lehoczky. An optimal algorithm for scheduling soft-aperiodic tasks in fixed-priority preemptive systems. In Real-Time Systems Symposium, pages 100-110, December 1992. [33] Sandra Ramos-Thuel and John P. Lehoczky. On-hne scheduling of hard deadline aperiodic tasks in fixed-priority systems. In Proceedings of the Real-Time Systems Symposium, pages 160-171, December 1993. [34] Edward C. Russell. CACI Inc., 1983.
Building
Simulation
Models with SIMSCRIPT
[35] W . G . Bouricius. Reliability modeling for fault-tolerant computers. Transactions on Computers, G-20:1306-1311, Nov. 1971.
II.5.
IEEE
318 [36] K. Fowler. Inertial navigation system simulator: Top-level design. Technical Report CMU/SEI-89-TR-38, Software Engineering Institute, January 1989. [37] W. Stallings. N.Y., 1985.
Data and Computer Communications.
Macmillan, N.Y.,
INDEX
algorithm-based fault tolerance 163 ALU 5 1 assertion 222 atomic broadcast 249 atomicity 228
data dependence 138, 177 DCVSL 16 deadline 90, 279, 299 deadlock 75 dual-rail logic 6
backup channel 123
error propagation 55 exception 142
check prediction 49 checkpoint 236,253 checksum 164, 176 code disjoint 42 code partitioning 48, 58 code, AN 44 code, arithmetic 42 code, Berger 47 code, dual-rail 6, 46 code, parity 11 code, parity 45 code, residue 45 code, SECIDED 10 compiler transformations 139, 174 concurrent error detection 6,38 coverage 25,205 cyclic redundancy check 83
fail-silent 246 failure, common mode 270 false alarm 79 fault coupling 13 fault effects 19 fault location 80 fault, stuck-at 13, 23 fault, stuck-on 53 fault, stuck-open 53 fault-secure 39 Fault-Tolerant Building Block Computer 8 fault/error injection 24, 206,308
guard 217
Inertial Navigation System 309 instruction retry 136 interconnection network 7 1
latency 10, 135, 149
m-out4f-n checker 63 Mach 218,255 message delivery time 79,98 morphic circuits 6 multicast 248
N-version programming 270
ordering, event 227 ordering, message 249 ordering, packet 79
packet 72, 90 parallelization 177
pipeline 24 protocol graph 256 protocol, membership 25 1
rate monotonic analysis 279 real-time channel 89 real-time communication 88 real-time system 36, 88, 260, 266 recovery 5,82,232,253,289, 305 recovery block 270 redundancy 5, 160,267 replicated hardware 5 restalt 249 retry 136,282 RISC 36 robustness benchmarks 222 rollback 136, 138 routing 72 routing, Chaotic 75
schedulability 97, 279 scheduling, instruction 143 scrubbing 11 self-checking 6,38 selfchecklng checker 7 selfexercising 10 self-testing 39 self-timed logic 14 sentry 2 17 slack stealing 289 speculative execution 140 stable storage 226, 246 state-machine 244 super-scalar 137 synchronization circuit 2 1
testability 14 testing 8 1 time-out 79
virtual cut-through 75 VLIW 137 VLSI 15, 36, 159