Fault-tolerant Distributed Computing (Lecture Notes in Computer Science)

Lecture Notes in Computer Science Edited by G. Goos and J. Hartmanis 448 B. Simons A. Spector (Eds.) Fault-Tolerant D...

Author: Barbara Simons | Alfred Spector

38 downloads 780 Views 19MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos and J. Hartmanis

448 B. Simons A. Spector (Eds.)

Fault-Tolerant Distributed Computing

Springer-Verlag New York Berlin Heidelberg London Paris Tokyo Hong Kong Barcelona

Editorial Board

D. Barstow W, Brauer R Brinch Hansen D. Gries D. Luckham C. Moler A. Pnueti G. Seegm~iller J. Stoer N. Wirth Editors

Barbara Simons IBM Almaden Research Center, Dept. K53/802 650 Harry Road, San .lose, CA 95120-6099, USA Alfred Spector Transarc Corporation, The Gulf Tower 70'7 Grant Street, Pittsburgh, PA 15219, USA

CR Subject Classification (198,7): D.4, C.2.4, E1.1,E2.0, C.3-4 ISBN 3-540-9?385-0 Springer-Verlag Berlin Heidelberg New York ISBN 0-38'7-9'7385-0 Springer-Verlag New York Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data banks. Duplication of this publication or parts thereof is only permitted under the provisions of the Germ an Copyright Law of September 9, 1965, in its current version, and a copyright fee must always be paid. Violations fall under the prosecution act of the German Copyright Law, © Springer-Verlag Berlin Heidelberg 1990 Printed in Germany Printing and binding: Druckhaus Beltz, Hemsbach/Bergstr. 214513140-543210 - Printed on acid-free paper

Preface

The goal of the Asilom~r Workshop on Fault-Tolerant Distributed Computing was to facilitate interaction between theoreticians and practitioners. To achieve this goal, speaker:; were invited and topics chosen so that a broad overview of the field would be presented. Because the attendance at the Workshop was diverse, the presentations/papers were in many instances designed to appeal to a general audience. The presentations were also designed to span a body of research from the theoretical to the pragmatic. Since the material seemed ideal for a book, this book was planned together with the planning of the workshop. Held in the spring of 1986, the workshop brought together approximately 70 active researchers from academia and industry. Most of the chapters were written following the workshop and subsequently revised. Consequently, some of the results that are described were obtained after the workshop. However, six of the chapters (those by Bernstein, Cohn, Finkslstein, Liskov, Spector, and Wensley) were recorded, transcribed, and edited for inclusion. This transcription format makes for easy-toread, though somewhat chatty, articles. The chapters are arranged in the order of presentation at the workshop. Giwm the scope of this book, we feel that it should be of use to students, researchers, and developers. We hope that it will promote greater understanding within the world of fault-tolerant research and development.

Barbara Simons Alfred Spector

Workshop Program Committee: Chair: Barbara Simons, Haviu Cristian, Danny Dolev, Michael Fischer, Jim Gray, Leslie Lmnport, Nancy Lynch, Marshall Pease, Fred Schneider, Ray Strong.

We would like to thank IBM Almaden Research Center and the Office of Naval B.esearch for their support. We also apprec/ate the assistance of several volunteers who assisted at the workshop.

Contents

A Theoretician's View of Fault Tolerant Distributed Computing ................................................... 1

M. J. Fischer A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem .................................................................................................

10

J. Gray The State Machine Approach: A Tutorial .......................................................................................... 18

F. B. Schneider A Simple Model for Agreement in Distributed Systems ................................................................. 42

D. Dolev, ~ Strong Atomic Broadcast in a Real-Time Environment ............................................................................... 51

F. Cristian, D. Dolev, 1~ Strong, H. Aghili Randomized Agreement Protocols .......................................................................................................

72

M. Ben-Or An Overview of Clock Synchronization ...............................................................................................

84

B. Simons, J. L. Welch, 19. Lynch Implementation Issues in Clock Synchronization .............................................................................. 97

M. Beck, T. K. Srikanth, S. Toueg Systems Session I Argus .........................................................................................................................................................

108

B. Liskov TABS .........................................................................................................................................................

115

A. Z. Spector Communication Support for Reliable Distributed Computing ..................................................... 124

K. P. Birman, T. A. Joseph Algorithms and System Design in the Highly Available Systems Project .................................. 138

S. J. Finkelstein

VI

Easy Impossibility Proofs for Distributed Consensus Problems .................................................. 147

M. J. Fischer, N. A. Lynch, M. Merritt An Efficient, Fault-Tolerant Protocol for Replicated Data Management ................................. 171

D. Skeen, A. EI Abbadi, F. Cristian

Systems Session II Arpanet Routing ..................................................................................................................................... 192

S. Cohn On the Relationship Between the Atomic Commitment and Consensus Problems .......................................................................................................................201

V. Hadzilacos The August System ........................................................-~........................................................................ 209

Z Wensley The Sequoia System .:.............................................................................................................................217

P. Bemstein Fault Tolerance in Distributed UNIX ............................................................................................... 224 A. Borg, W. Blau, W. Oberle, W. Graetsch Faults and Their Manifestation ........................................................................................................... 244

D. P. Siewiorek The "Engineering" of Fault-Tolerant Distributed Computing Systems .......................................262

(). Babao~lu Bibliography for Fault.Tolerant Distributed Computing .....................................................:.........274

B. A. Coan

A Theoretician's View of Fault Tolerant Distributed Computing* M i c h a e l J. F i s c h e r Department of Computer Science Yale U n i v e r s i t y B o x 2158 Yale S t a t i o n N e w H a v e n , C T 06517

1

Introduction

Distributed computing systems have been objects of study ever since computers began to communicate with each other, and achieving reliability has always been a major problem of such systems. The theory of distributed computing is a relative newcomer, both to the fidd of distributed computing and to the general area of theoretical computer science. Thins paper is intended as a non-technical introduction to the r61e played by theory in the study of fault tolerant distributed computing systems. Rather than focus on particular accomplishments of theory, we will try to illustrate the theoretical approach and to point out its strengths and weaknesses as a paradigm for understanding distributed systems.

2

The Theoretical Paradigm

Theory could be called the science of asking (and answering) precise questions. A practitioner might be satisfied at finding a system that performs "well". A theoretician would want to know precisely how well. The two would also differ on what they considered acceptable evidence that a system performed as advertised. The practitioner is likdy to find the observed behavior of the system under practical operation most compelling. A theoretician would find such evidence unconvincing since not all of the relevant variables could be controlhd or understood, so the true causes of the observed behavior would remain in doubt. He would prefer instead a mathematical proof. *This work was supported in part by the National Science Foundation under grant DCR-8405478 and by the Office of Naval Research under Contract N00014-82-K-0154.

The power of theory is that it forces one to think clearly about the problem one is trying to solve. Until a question can be stated precisely, there is no theoretical question. The process of stating the question leads one to identify relevant variables, state explicitly any assumptions being made, and so forth. These very factors are often instrumental in leading one to a solution, and identifying them forces one to pay attention to them. A common criticism of theory is that it is too abstract. It deals with simplified ~ideal worlds' and ignores many complicating aspects of the ~real worldL Abstraction, however, is the real power of theory, for by ignoring the irrelevant details, one focuses attention on the relevant properties, enabling them to be understood to far greater depth than would otherwise be possible. Abstraction also produces generality, for results that depend on fewer assumptions are more widely applicable. Indeed, abstraction lies at the heart of scientific understanding, for to have an understanding of something is to have abstracted the essential features in a way that they can be applied to a variety of related situations. Of course, the insight gained from studying a particular abstraction is only as good as the abstraction itself. The results are only applicable in practice if the details ignored by the abstraction ready are irrelevant. Determining whether that is so is inherently a non-theoretical question which can only be answered like any other practical question--by experimentation and observation. Nevertheless, the power of the theoretical approach is that it confines the practical problem to that of validating the abstraction. Once that has been done, the theoretical results can be applied with confidence.

3

Constructing a Formal Theory

A theory of distributed computing involves three parts: a formal model, a formal problem statement, and a complexity measure for comparing one solution with another.

3.1

Formal Models

A formal model consists of a collection of set-theoretic objects that represent the various elements of the system being modeled. For example, if a distributed system consists of processes, communication links, algorithms, fault assumptions, message assumptions, and so forth, then the model will have corresponding formal dements. Just as a Tufing machine can be defined, abstractly, as a set of quintuples, so can a distributed system be defined as an appropriate tuple. The purpose of a formal model is to insure completeness of the specification. By saying that a formal model i8 a particular set-theoretic object, we have reduced the problem of specifying an abstract system to that of specifying a particular mathematical object, a problem for which good mathematical techniques have been developed

over the years. Understanding the implications of a particular formal model may not be easy, and that is the work of the theoretician, but at least there is little room for misunderstanding about what the formal model is.

3.2

Formal P r o b l e m S t a t e m e n t s

Along with a formal statement of the computational model, one needs a formal statement of the problem to be solved. Separating the problem from its solution is one of the real contributions of the theoretical approach, for it opens the door to alternative solutions. Often the problem to be solved is very basic and fundamental: choose a leader from among a collection of identical processes, reach an agreement s carry out an election, and so forth. In these cases, the problem itself is abstract, and a solution to such a problem can often be used as a building block in solving more complicated problems. But in many interesting cases, defining the problem precisely can be quite difficult, for it may not be at all clear exactly what one wants to achieve. Consider for example the problem of building a distributed name server to maintain a mapping between names and values (which might in practice be network addresses, user ids, mailboxes, and so forth). What properties should be required of the solution? One would like it to be reasonably efficient, highly available, reasonably robust against failures, remain reasonably current and consistent in the face of updates, and so forth, but to require perfection in any of these properties may make the problem impossible or prohibitively costly to solve. We have not one but many problems depending on the importance one attaches to the various properties, and a solution that is good in one respect might be very bad in other. The difficulty lies not in making a formal problem statement but in finding an abstract problem that reflects our intuition about the practical problem.

3.3

Complexity Measures

A complezity measure provides a means of measuring the goodness of the various possible solutions to a problem. Typical quantities to measure in a distributed system are time to solve the problem, total number of messages sent, numbers of faults tolerated, and so forth. As in sequential complexity theory, one can analyze either the worst case or the average case complexity. For the latter to be meaningful, one must know the underlying probability distributions on the choice of inputs, schedules, failure patterns, and so forth over which the "average" is being taken. In the absence of such information, one instead performs a worst case analysis, thus obtaining a pessimistic guarantee on the behavior of the system. Even when the distributions are known, the worst case behavior might be more important than the average case. For example, in a real-time system, response within a fixed amount of time might be required always, not just on the average. In carrying out a worst case analysis, one often pretends that choices are being

made by a malicious adversary who is trying to maximize the "cost" of the run. By definition, the worst-case complexity is the largest amount of the measured resource that the adversary can cause the system to use. However, this approach does not require that the choices actually be under the control of a malicious adversary, nor is it invalid in those situations where the choices are obviously made by non-malicious means such as by random coin flips. Rather~ a worst-case analysis is exactly what the name implies--it tells one the worst that can possibly happen. Whether or not that is likely to happen in practice is a separate question that may well require further assumptions to answer.

4

Theory of Distributed Computing

Distributed computing is the study of distributed systems and what they compute. We are thus led to the question, "What is a distributed system?" The obvious answer, that it is a collection of communicating, concurrent processes, is too broad and doesn't distinguish distributed'systems from parallel systems. Intuitively, a distributed system is a collection of geographically separated computers or node8 that communicate over a relatively low-speed network. However, from a theoretical point of view, this definition is unsatisfactory for a number of reasons: The physical geometry of the system is rarely relevant to the problems studied and is one of the first features to be discarded in building an abstract model. Detailed timing considerations are difficult to deal with in an abstract model and are often not relevant anyway when one is concerned only with what can or cannot be achieved. At a more abstract level, distributed systems and parallel computers seem to look quite similar~ and one can reasonably ask if there really is any qualitative difference between the two. Upon further examination, one sees that the characteristic features of distributed systems mentioned above do have a qualitative effect that we can identify--namely, they all lead to greater degrees of uncertainty. Geographical separation makes it more difficult to manage the individual nodes, leading to greater uncertainty about their status. Low-speed communication restricts the flow of information around ,the network, leading to greater undertalnty about the global state of the system. Typical communication channels are subject to various forms of unreliability, and one cannot realistically assume error-free communication. Thus, what distinguishes parallel computers from distributed systems is the degree of uncertainty to which they are subject and the extent to which one must take explicit account of such uncertainty. In studying a parallel computer, it is often reasonable to assume that the machine is working correctly, that one has full control over the code to be run on each of the processors, that interconnection topology is stable and known to all of the processors, and that communication is reliable and occurs within a predictable amount of time. In the study of distributed systems~ one often can not make such simplifying assumptions and still obtain realistic results. Individual nodes may be faulty, the network may lose or corrupt messages, one cannot always control

what programs other nodes run, nodes speeds can vary wildly from one another, and so forth. As a result of these differences, a major concern in the study of parallel computation is performance of algorithms, whereas a major concern in the study of distributed computing is dealing with uncertainty. This is not to say that there is a clear dichotomy between the two disciplines but rather to identify two ends of a spectrum. Thus, a p a r c e l system becomes more and more distributed as uncertainty factors become more and more significant in its operation. We now look in greater detail at some of the sources of uncertainty in distributed computing systems.

4.1

S y s t e m Configuration

In a distributed system, the eventual system configuration might not be known when the processes are designed, so the goal is to design processes that will work when embedded in a wide range of systems. Of course, some assumptions must be made about the rest of the system in order to say anything meaningful about system behavior. We look at some natural assumptions that have been considered. A process might have only partial information about the global structure of the system--how many processes there are altogether and how they are interconnected. For example, a process might know that the processes are connected together in a ring but not know the size of the ring, or it might know that the interconnection graph is connected but not know the topology, or it might know that the graph is fully connected but not know the identities of the processes attached to its ports. One can deal with such uncertainties either by finding algorithms to obtain the unknown information or by finding ways to accomplish a given task despite the uncertainty. A process might not know what behavior to expect from another process because it does not know the program being run by that process. In analyzing such a system, one thinks of the other process as a malicious adversary who does his best to disrupt the system. However, as in any worst-case analysis, this does not imply that the unknown process must possess some malevolent intelligence; only that in the absence of information to the contrary, one must assume the worst. This is the assumption that underlies much of the work in cryptographic protocols.

4.2

Faults

A fault! is an event that causes "incorrect" operation of a system component. Whether a particular behavior is considered a fault or simply normal behavior is rather arbitrary and depends on one's expectations. For example, failure to obtain exclusive access to a shared resource is expected if the resource is already in use, but it might be considered a fault if the resource is idle. As with uncertainty in system configuration, one must make some assumptions

about the kinds of faults that can occur in order to say anything meaningful about the system's overall behavior. These assumptions result in a fault model that describes the kinds of faults that are anticipated, and a system that operates in the presence of such faults is called fault tolerant. In the formal system, these are the only kinds of faults that can occur. In practice, other kinds of faults might occur~ but if they do, the formal results do not apply. Unlike the uncertainty due to system configuration discussed above, faults are often assumed to be benign, random events controlled by nature. Nevertheless, it is difficult to come up with plausible assumptions that govern such faults, and one generally treats faults as being controlled by an adversary (though perhaps a restricted one). Thus, both unpredictability of processes and failures of processes are treated the same by the theory, even though they arise from very different considerations. Many kinds of communication faults can be considered. Depending on the structure of the underlying communication system, messages may be delayed, lost, duplicated, reordered, or corrupted. For example, a simple point-to-point link may inject random noise into the data it carries, leading to lost or corrupted messages, but messages, if delivered correctly, are delivered only once and in the order sent. However, with more elaborate store-and-forward message systems, other behaviors are possible: messages may not necessarily arrive in the order sent, and multiple copies of the same message~ possibly traveling along different paths~ can be delivered to the receiver. Adding a link protocol to the communication system can radically alter its external behavior~ and the level at which one models a system depends on one's goals. For example, a link protocol implementing checksums and message retransmission can make the eventual delivery of corrupted messages highly unlikely at the expense of increased communication delay. Assuming reliable but slow message delivery might be reasonable when studying protocols built on top of such a link protocol but would not be reasonable when studying link protocols themselves.

4.3

Timing

Considerations

Time can be a major source of uncertainty in distributed systems. Unless processes operate off of the same clock, they will not proceed at exactly the same speed, and it is difficult to make reasonable assumptions about their relative speeds. For example, a process being run on a time-shared computer might be interrupted at an arbitrary point of its execution and suspended for a long and variable length of time. Quite separate from the question of whether all processes proceed at the same rate is whether or not clocks are available. Even if processes do not proceed in lockstep, the ability to read a common clock might be of considerable use in coordinating their activities. Lacking a common global clock, processes might have available local clocks that are accurate within known bounds. Nevertheless, keeping clocks synchronized is not easy in practice, and one often looks for algorithms that make no assumptions about time, thus leading one to consider the fully asynchronous model. It isn't so much that one really believes that every behavior is possible, but only that one

doesn't know where to draw the line between those that are possible and those that are not. The fully asynchronous model is the common denominator with respect to timing assumptions.

5

Sample R e s u l t s

This whole discussion has itself been pretty abstract. To give a flavor of the kinds of theoretical results that have been obtained, we look at the reliable broadcast problem, sometimes known as the Byzantine Generals problem.

5.1

The

Reliable

Broadcast

Problem

The model consists of a fixed number n of processes in a completely connected, reliable network. At most f of the processes are assumed to fail during an execution of the protocol, but which ones are faulty is not known (or even necessarily determined) in advance° The system may be synchronous or asynchronous. strained according to a particular fault model:

Faulty processes are con-

Failstop A faulty process ceases operation and other processes are notified of the fault. C r a s h A faulty process ceases operation but other processes are not notified. Omission A faulty process omits sending some of its messages but otherwise continues to operate correctly. B y z a n t i n e A faulty process may continue to operate but can send arbitrary messages. One process, ca~ed the sender, has an initial binary input value which it wants to broadcast to the other processes. The problem is to find a fault-tolerant protocol such that each reliable process decides on a value satisfying the following two conditions: Validity If the sender is nonfaulty, then each non-faulty process decides on the sender's value. A g r e e m e n t Whether or not the sender is faulty, all non-faulty processes decide on the same value. Here are just a few of the many results that have been obtained on this problem: 1. With Byzantine faults, there is a deterministic synchronous solution iff f < n/3 [PSL$O].

2. With crash faults, every synchronous deterministic solution requires at least ] + 1 rounds of communication [DS82]. 3. No asynchronous deterministic solution can tolerate even a single crash fault [FLP85]. 4. With Byzantine faults, there are both synchronous and asynchronous randomized solutions that use an expected constant number of rounds [FM88].

5.2

Insights Gained

What do such results tell us about practical distributed computing systems? First off, even simple-sounding problems may be much more difficult than they appear at first sight, and great caution is called for. The reliable broadcast problem-to agree on a single bit--is one of the simplest coordination problems imaginable, yet solutions only exist under certain conditions, and even when they do, they can be quite costly. Secondly, the conditions under which solutions are not possible set useful boundary conditions on the search for solutions. One does not need to look for 3-process solutions to tolerate a single Byzantine processor fault, for they do not exist. If someone purports to have such a solution, one knows that it is erroneous, even before the bug can be demonstrated. Third, such results help guide the refinement process in designing a system. Knowing that agreement problems are costly suggests that their need be avoided where possible. Perhaps weaker properties are enough. Finally, the theory itself might indicate a solution to the very problems it poses. One can notice that the time lower bound in result 2 above applies only to deterministic systems. Removing the determinicity restriction led to the much more efficient randomized solutions of result 4.

6

Conclusion

The theory of distributed systems so far lacks the cohesiveness that can only come with further development. In the future, we hope to see work on a greater variety of problems and on better models with the goal of eventually obtaining greater generality, sharper results~ and better insight.

Acknowledgements: We are grateful to Lenore Zuck and Barbara Simons for many helpful comments and suggestions.

References [DS82]

D. Dolev and H. R. Strong. Polynomial algorithms for multiple processor agremment. In Proe. l~th ACM Symposium on Theory o] Computing, pages 401-407, 1982.

[FLP85] M.J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal o[ the A CM, 32(2):374-382, April 1985. [FM88] P. Feldman and S. Micali. Optimal algorithms for Byzantine agreement. In Proe. *Oth A CM Symposium on Theory of Computing, pages 148-161, 1988. [PSL80] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228-234, 1980.

A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem

Jim Gray Tandem Computers, Cupertino, CA., USA

Abstract: Transaction Commit algorithms and Byzantine Agreement algorithms solve the

problem of multiple processes reaching agreement in the presence of process and mes~;age failures. This paper summarizes the computation and fault models of these two kinds of agreement and discusses the differences between them. In particular, it explains that Byzantine Agreement is rarely used in practice because it involves significantly more hardware and messages, yet does not give predictable behavior if there are more than a few

faults. 1. Introduction

The Workshop on Fault Tolerant Distributed Computing met at Asilomar, California on March 16-19, 1986. It brought together practitioners and theorists. The theorists seemed primarily concerned with variations of the Byzantine Generals problem. The practitioners seemed primarily concerned with techniques for building fault-tolerant and distributed systems. Prior to the conference, it was widely believed that the Transaction Commit Problem faced by distributed systems is a degenerate form of the Byzantine Generals Problem. One useful consequence of the conference was to show that these two problems have little in comInon.

2. The Transaction Commit Problem

The Transaction Commit Problem was lust solved by Niko Garzado noticed and solved it while working for IBM on a distributed system for the Italian Social Security Department in 1971. The problem was folklore for several years. Five years later, descriptions and solutions began to appear in the open literature [2], [5], [7]. Solutions to the Commit Problem make a collection of actions atomic -- either all happen or none happens. Atomicity is easy if all goes well, but the Commit Problem requires atomicity even if there are failures. Today, commit algorithms are a key element of most transaction processing systems. Maintenance of replicated objects (all copies must be the same) and maintenance of consistency within a system (a mail message must arrive at one node if it leaves a second one) require atomic agreement among computers.

11

To statethe Commit Problem more precisely,one needs a model of computation and a model of failures. Lampson and Sturgis formulated a simple and elegant model now widely cmbraced by practitioners[5], It is impossible to do the model justicehere; their paper is well worth reading. In outline: The Lampson-Sturgis computation model consistsof: • storagewhich may be read and written. • processeswhich execute programs composed of threekinds of actions: • change process state, • send or receive a message and, • read or write storage. Processes run at arbitrary speed, but eventually make progress. The Lampson-Sturgisfault model postulates that: • Storage writes may fail, or may corrupt another piece of storage. Such faults are rare, but when they happen, a subsequent read can detect the fault. • Storage may spontaneously decay, but such faults are rare. In particular, a pair of ,.torage units will not both decay within the repair time for one storage unit. When decayed storage is read, the reader can detect the corruption. These are the fault assumptions of duplexed discs. • Processes may lose state and be reset to a null state, but such faults are rare and detectable. • Ivlessages may be delayed, corrupted, or lost, but such faults are rare. Corrupted messages are detectably corrupted. Based on these computation and fault models, Lampson and Sturgis showed how to build single-fault tolerant stable storage which does not decay (duplexed discs), and stable processes which do not reset (process pairs). This single-fault tolerance is based on the assumptions of rare faults and eventual progress (Mean Time To Repair is orders of magnitude less than Mean Time To Failure).

The cost model implicit in the Lampson-Sturgis computation model is: • Computation cost is proportional to the number of storage accesses and messages. • Time cost, i.e. delay, is proportional to serialized the number of storage accesses plus serialized messages. Good algorithms have low cost and low delay in the average ease. In addition they tolerate arbitrarily many lost messages, process resets, and storage decays.

12 Given this computation and fault model the Commit Problem may be stated as: The Commit Problem: Given N stable processes each with the state diagram:

An algorithm solves the Commit Problem if it forces ALL processes to the COMMITTED state or ALL to the ABORTEDstate depending on the input to the algorithm. Q The Commit Problem is easily solved; a stable process just sends the decision to each process, and keeps resending it until the process changes state and acknowledges. Because messages may be lost and processes may run slowly, there is no limit on how long the algorithm may take. But, eventually all processes will agree and the algorithm will terminate. The expected cost is proportional to 2N messages plus N stable (i.e. single-fault tolerant) writes. Assuming constant service times, the expected delay is two message delays plus one stable write delay. There is a more general version of the Commit Problem called the Two-Phase Commit Problem. It allows an ACTIVEprocess to unilaterally abort its part of the transaction, by entering the ABORT state, and consequently aborting the whole transaction. To allow consensus among the processes, a PREPARED state is introduced. After entering the PREPARED state an active process abdicates its right to unilateral abort. By contrast, the One-Phase Commit Problem does not allow processes to unilaterally abort. Unilateral abort is very convenient; the CANCEL keys of most user interfaces are examples of unilateral abort. The Two Phase Commit Problem may be stated as: The Two-Phase Commit Problem: Given N stable processes each with the state diagram:

An algorithm solves the Two-Phase Commit Problem if it forces ALL the processes to the COMMITTED state or all to the ABORTEDstate depending on the input to the algorithm and whether any process has already spontaneously made the ACTIVEto ABORTEDtransition.~l

13 Algorithms for two-phase commit are fancier and costlier than the one-phase algorithms described first. Needless to say, three-phase algorithms have been invented and implemented, so called non-blocking commit protocols, but it has stopped there. All of the Commit Problems have the following properties: • All processes are assumed to correctly obey the state machine and commit/abort messages. • There may be arbitrarily many storage failures, process failures, and message failures. • F!ventually, all processes agree. 3. The Byzantine Generals Problem The Byzantine Generals Problem grew out of attempts to build a fault-tolerant computer to control an unstable aircraft. The goal was to fred a design_that could tolerate arbitrary faults in the computer. The designers assumed that they could verify the computer software, so the only remaining problem was faulty hardware. Faulty computer hardware can execute even CO~Tectprograms in crazy ways. So, the designers postulated that most computers functioned correctly, but some functioned in the most malicious way conceivable. Thus, the Byzantine Generals Problem was formulated [6], [4]. The computation and cost models of the Byzantine Generals problem are similar to the Lampson-Sturgis model for processes and messages (storage is not explicitly modeled). Processe~ communicate with one another via messages. The Byzantine fault model is quite different from the Lampson-Sturgis fault model. The Byzantine Generals Problem assumes that some processors are good and some arefaulty. The faulty ones can forge messages, delay messages sent via them, send conflicting or contradictory messages, and masquerade as others. If a message from a good process is lost or damaged, then the good process is treated as a bad one. The Lampson-Sturgis model assumes processes execute correctly or detectably fail, and that messages are delivered, detectably corrupted, or lost. Forgeries or undetected corruption are defined as "impossible" (i.e. very rare squared) by Lampson and Sturgis; actually, they just define such an ,event as a catastrophe. The Byzantine Generals Problem is intended for process control applications which must respond within a fixed time in order to fly the airplane or control the reactor. So, the problem statement insists that any solution must have a bounded execution time. This in turn implies a bounded number of messages and faults (fault rates are finite). These bounded-time, bounded-faults assumptions are the key distinction between the Commit and Byzantine problems.

14 The Byzantine Generals problem is defined as: The Byzantine Generals Problem Given N generals (processes) each with the state diagram:

An algorithm solves the Byzantine Generals Problem if it gets all the good generals to agree YES or ALL to agree NO within a bounded time.O The gist of the theory on solutions to the Byzantine Generals Problem is: • If at least 1/3 of the generals are bad, then the good generals cannot reliably agree [6]. • If fewer than 1/3 of the generals are bad, then there arc many algorithm~, • Solutions to the algorithm have polynomial cost (e.g. ~N 2 messages) and, assuming constant service time for a broadcast, have constant delay (typically ~N3). 4. Comparing the Problems The Commit and the Byzantine Generals problems are identical in the fault-free case -- this is hinted by the sin~larity in their state transition diagrams. In the fault-free case all the participants must agree. In the typical case (no faults) Commit algorithms send many fewer messages than the Byzantine Generals algorithms because they need not guard against ambiguity or forgery. Fundamental differences between the problems and their solutions emerge when there are faults.The basic differences arc: • Commit protocols tolerate finitely many faults. Byzantine protocols tolerate at most N/3 faults. • ALL processors agree in the conunit case. SOME processors agree in the Byzantine case. Commit algorithms are fail-fast. They give either a common answer or no answer. Byzantine algorithms give random answers without warning if the fault threshold is exceeded. • Commit agreement may require unbounded time. Byzantine agreement terminates in bounded time. • Commit algorithms require no extra processors and few extra messages. Byzantine algorithms require many messages and processors. •

15

The following examples show that neither of these problems is especially realistic Byzantine ATMS: Consider an Automated Teller Machine (ATM) doing a debit of a bank account. Three computers storing the account plus the ATM give the requisite four processors needed for Byzantine agreement. If the phone line to the ATM fails, then the three computers quickly agree to debit the account but the ATM refuses to dispense the money. This is Byzantine agreement.

Commit ATMS: If the same ATM is controlled by a single computer and they obey a commit protocol, then they will eventually agree (debit plus dispense or no debit and no dispense). But, the customer may have to wait at the ATM in PREPARED state for many days before a faulty phone line is luted and the money is dispensed. That is commitment[

Neither of these solutions is particularly nice. Most "rear' systems focus on single-fault tolerance because single faults are rare, double faults are very rare (rare squared). Commit algorithms are geared to this single-fault view. They give single-fault tolerance by duplexing fail-fast modules [3]. Byzantine algorithms require at least four-plexed modules to tolerate a single fault. This high processor cost and related high message cost is the probable reason no commercial system uses Byzantine algorithms. Look at the two pictures above and judge for yourself. The Lampson-Sturgis model distinguishes between message failure and process failure because long-haul messages typically are corrupted once an hour while processors typically fail once a month. This is an important practical distinction -- especially since Byzantine

16

messages outnumber processors by a polynomial multiplier. The Byzantine faultmodel typicallyequates message failureswith process failures. As the number of nodes grows, the number of message failures grows polynomial and produces a system much less reliablethan a singleprocessor. Paradoxically,both Byzantine and Commit systems have worse mean-time-to-failure (MTBF) than a single processor system, but for different reasons and in different ways Large Byzantine systems fail because of polynomial message failures [8], [i]. Commit systems fail because a single processor failuremay introduce a delay for all (unlessthree-phase algorithm¢ are used). Commit algorithms are fail-fastwhile Byzantine algori~rns give an answer in any case. Each non-faulty proccssor executing a Byzantine Generals algorithm gives an answer within a fixed time -- ~ N 3 message delays for a typicalalgorithm, ff there are few faults, thcn all the non-faultyprocessors will give the same answer. If at leastN/3 processors arc faulty or if at least N/3 mcssagcs are damaged, then two "correct" processors may give differentanswers. By contrast,commit algorithms cvcntuaUy get all the processes to give the same answer. This may take a very long time and many messages. N o one wants to wait forever for the right answer. Unfortunately, there is no solution to thisdilemma. If all processors must agree, and if there is no finitebound on the number of message or processor faults,then the processors must be prepared to wait an unbounded time. The salientproperties of the two problems are summarized in the chart below. It shows thatthere islittleovcrlap between the two problems. ",,,,,Degreeof Fault~g ~em~ T o l e r a n c c ' ~ ...... Limited Time & Errors

Unlimited Time &Errors

Some Agree

All Agree

Byzantine

Ideal

Inferior

Commit

Of course the ideal solution would combine the best of both: all agree within a fixed time limit. One can prove thatthere is no ideal solution [2]. Similarly an algorithm which has no time limit,and does not get universal agreement is uniformly inferiorthan Byzantine or Cornmitrncnt, and so is not interesting.

In summary, based on these comparisons between the two problems, practitioners embraced the Commit problem over the Byzantine Generals problem because it has an efficient solution to the single-fault case, gives correct answers in the multi-fault case, and has good no-fault performance.

17

5. Acknowledgments Phfll Garrett, Fritz Graf, Pat Helland, Pete Homan, Fritz Joem, and Barbara Simons made valuable comments on this paper. 6. References [1] Babaoglu, O., "on The Reliability of Consensus Based Fault Tolerant Distributed Computer Systems", ACM TOCS, V. 5.4, 1987. [2] Gray, J., "Notes on Database Operating Systems", Operating Systems, An Advanced Course, Lecture Notes in Computer Science, V. 60, Springer Verlag, 1978. [3] Gray, L, "Why Do Computers Stop, and What Can We Do About It?", Tandem TR: 85.7, Tandem Computers, 1985. [4] Lamport, L., Shostak, R., Pease, M., "The Byzantine Generals Problem", V. 4.3, 1982.

ACM TPLS,

[5] Lampson, B.W., Sturgis, H., "Atomic Transactions", Distributed Systems Architecture and Implementation; An Advanced Course, Lecture Notes in Computer Science, V. I05, Springer Verlag, 1981. [6] Pease, M., Shostak, R., Lamport, L., "Reaching Agreement in the Presence of Faults", ACM Journal, V. 27.2, 1980. [7] Rosenkrantz, D.J., R.D. Stearns, P.M. Lewis, "System Level Concurrency Control for Database Systems", ACM TODS, V. 3.2, 1977. [8] Tay, Y.C., "The Reliability of (k,n)- Resilient Distributed Systems", Proc. 4th Symposium in Distributed Software and Database Systems, IEEE, 1984.

The State Machine Approach: *

A Tutorial

Fred B. Schneider Department of Computer Science Cometl University Ithaca, New York 14853

Abstract. The state machine approach is a general method for achieving fault tolerance and implementing decentralized control in distributed systems. This paper reviews the approach and identifies abstractions needed for coordinating ensembles of state machines. Implementations of these abstractions for two different failure models Byzantine and fail-stolr--are discussed. The state machine approach is illustrated by programming several examples. Optimization and system reconfiguration techniques are explained.

1. Introduction The state machine approach is a general method for managing replication. It has broad applicability for implementing distributed and fault-tolerant systems. In fact, every protocol we know of that employs replication--be it for masking failures or simply to facilitate cooperation without centralized control---can be derived using the state machine approach. Although few of these protocols actually were obtained in this manner, viewing them in terms of state machines helps in understanding how and why they work. When the state machine approach is used for implementing fault tolerance, a computation is replicated on processors that are physically and electrically isolated. This permits the effects of failures to be masked by voting on the outputs produced by these independent replicas. Triple-modular redundancy (TMR) is a familiar example of this scheme. Although when the approach is used additional *This material is based on work supporledin part by the Office of Naval Research under contract N00014-86-K-0092, the National Science Foundation under Grant No. CCR-8701103, and Digital Equipment Corporation. Any opinions, findings, and conclusions or recommendationsexpressed in this publication are those of the author and do not reflect the views of these agencies.

19

coordination is necessary to distribute inputs and collect outputs from replicas, failures cannot increase task completion times. This makes the approach ideally suited for real-time systems, where deadlines must be met and timing is critical. Other approaches, such as failure detection and retry, are ill suited for real-time applic,ations because unpredictable delays can be observed in response to a failure. The state machine approach permits separation of fault-tolerance from other aspects of system design. The programmer is not forced to think in terms of a particular programming abstraction, such as transactions ~Liskov 85] [Spector 85], fault-tolerant actions [Schlichting & Schneider 83], reliable objects [Bimaan 85], replicated remote procedure calls [Cooper 84] or the multitude of other proposals that have appeared in the literature. Instead, a programming abstraction suited for the application at hand can be defined and used; the state machine approach is employed to realize a fault-tolerant implementation of that abstraction. This p~tper is a tutorial on the state machine approach. It describes the approach and its implementation for two representative environments. Small examples suffice to illustrate the points; however, the approach has been successfully applied to larger examples. Section 2 describes how a system can be viewed in t e l l s of a state machine, clients, and output devices. Measures of fault-tolerance are discussed in section 3. Achieving fault-tolerance is the subject of the following three sections: Section 4 discusses implementing fault-tolerant state machines; section 5 discusses tolerating faulty output devices; and section 6 discusses coping with faulty clients. An important class of optimizations---based on the use of tinae--is discussed in section 7. Optimizations possible by making assumptions about failures are discussed in section 8. Section 9 describes dynamic reconfiguration. Related work is discussed in section 10. 2. State Machines A state machine consists of state variables, which implement its state, and commands, which

transform its state. Each command is implemented by a deterministic program; execution of the command is atomic with respect to other commands and modifies the state variables and/or produces some output. A client of the state machine makes a request to specify execution of a command. The request names a state machine, names the command to be performed, and contains any information needed by the command. Output from request processing can be to an actuator (e.g. in a process-control system), to some other peripheral device (e.g. a disk or terminal), or to clients awaiting responses from prior requests. The name "state machine" is a poor one, since it is suggestive of a finite-state automata. Our state machines are more powerful than finite-state automata because they contain program variables and, therefore, need not be finite state. However, state machines intended for execution on real machines should be finite state because real computers have finite memories. Our state machines are also easier to specify than finite-state automata because any programming notation can be used. "State machine" is used in this paper for historical reasons--it is the term used in the literature. State machines can be described by explicitly listing state variables and commands. As an example, state machine memory of Figure 2.1 implements a mapping from locations to values. A read command permits a client to determine the value associated with a location, and a write command associates a new value with a location. Observe that there is little difference between our state machine description of merrugry and an abstract datatype or (software) module for such an object. This is deliberate---it makes it clear that state machines are a general programming construct. In fact, a state machine can be implemented in a variety of ways. It can be implemented as a collection of procedures that share data,

20

memory: s t a t e m a c h i n e var store : array [O..n] of word

read: command(loc : O..n) send store [loc ] to client end read; write: eommand(loc : O..n, value : word) store [loc] := value end write end memory

Figure 2.1. A memory

as in a module; it can be implemented as a process that awaits messages containing requests and performs the actions they specify; and, it can be implemented as a collection of interrupt handlers, in which case a request is made by causing an interrupt. (Disabling interrupts permits each command to be executed to completion before the next is started.) For example, the state machine of Figure 2.2 implements commands to ensure that at atl times at most one client has been granted access to some resource. 1 It would likely be implemented as a collection of interrupt handlers as part of the kernel of an operating system. Requests are processed by a state machine one at a time, in an order consistent with causality. Therefore, clients of a state machine can be programmed under the assumptions that O1:

requests issued by a single client to a given state machine sm are processed by sm in the order they were issued, and

O2:

if the fact that request r was made to a state machine sm by client c could have caused a request r' to be made by a client c" to sm, then sm processes r before r'.

In this paper, for expository simplicity, client requests are specified as tuples of the form

(state_machine.command, arguments) and the return of results is done using message passing. For example, a client could execute

(memory.write, 100, 16.2); (memory.read, 100); receive v from memory to set the vatue of location 100 to 16.2, request the value of location 100, and await that value, setting v to it upon receipt. 1We use xq~ to append y to the end of list x.

21

mutex: state machine var u s e r : client id init ~ ; waiting : list of client id init acquire: command if user=dp --> send OK to client; user := client user ~

--> waiting := waiting oclient

fi

end acquire release: command if waiting =d~ .-->user := D waiting ~

--->send OK to head(waiting); user := head(waiting); waiting := tail(waiting)

fi

end release end mutex

Figure2.2. Aresourceallocator

The defining characteristic of a state machine is not its syntax, but that it specifies a deterministic computation that reads a stream of requests and processes each, occasionally producing output: Semantic Characterization of State Machine. Outputs of a state machine are completely determined by the sequence of requests it processes, independent of time and any other activity in a system. Any program satisfying this definition will be considered a state machine for the purposes of this paper. For example, the following program solves a simple process-control problem in which an actuator is adjusted repeate~y based on the value of a sensor. Periodically, a client reads a sensor and communicates the value read to state machine pc: monitor : process do true --~ val := sensor; (pc.adjust, vat); delay D od end monitor State machine pc adjusts an actuator based on past adjustments saved in state variable q, the sensor reading, and a control function F.

22

pc: s t a t e m a c h i n e var q : real;

adjust: command(sensorval : real) q := F(q, sensorval); send q to actuator end adjust end pc Although it is tempting to structure pc as a single command that loops--reading from the sensor, evaluating F, and writing to actuator--if the value of the sensor is time-varying, then the result would not satisfy the semantic characterization given above and therefore would not be a state machine. This is because values sent to actuator (the output of the state machine) would not depend solely on the requests made to the state machine but would, in addition, depend on the execution speed of the loop. In the structure used above, this probIem has been avoided by moving the loop into monitor. Having to structure a system in terms of state machines and clients does not constitute a restriction. Anything that can be structured in terms of procedures and procedure calls can also be structured using state machines and clients--a state machine implements the procedure, and requests implement the procedure calls. In fact, state machines permit more flexibility in system structure than is usually available with procedure calls. With state machines, a client making a request is not delayed until that request is processed, and the output of a request can be sent someplace other than to the client making the request. We have not yet encountered an application that could not be programmed cleanly in terms of state machines and clients. 3. F a u l t - T o l e r a n c e A component is faulty once its behavior is no longer consistent with its specification. In this paper, we consider two representative classes of faulty behavior from a spectrum of possible ones: Byzantine Failures. The component can exhibit arbitrary and malicious behavior, perhaps involv-

ing collusion with other faulty components [Lamport et a182]. Fail-stop Failures. In response to a failure, the component changes to a state that permits other components to detect that a failure has occurred and then stops [Schneider 84]. Byzantine failures can be the most disruptive, and there is anecdotal evidence that such failures do occur in practice. Allowing Byzantine failures is the weakest possible assumption that could be made about the effects of a failure. Since a design based on assumptions about the behavior of faulty components runs the risk of failing if these assumptions are not satisfied, it is prudent that life-critical systems tolerate Byzantine failures. However, for most applications, it suffices to assume fail-stop failures. A system consisting of a set of distinct components is f fault-tolerant if it satisfies its specification provided that no more than f of those components become faulty during some interval of interest. 2 Fault-tolerance traditionally has been specified in terms of MTBF (mean-time-between-failures), probability of failure over a given interval, and other statistical measures [Siewiorek & Swarz 82]. While it is clear that such characterizations are important to the users of a system, there are advantages in ZAnf fault-tolerant systemmightcontinueto operate correctlyif more thanf failures occur, but correct operationcannot be guaranteed.

23

describing fault tolerance of a system in terms of the maximum number of component failures that can be tolerated over some interval of interest. Asserting that a system is f fault-tolerant makes explicit the assumptions required for correct operation; MTBF and other statistical measures do not. Moreover, f fault-tolerance is unrelated to the reliability of the components that make up the system and therefore is a measure of the fault tolerance supported by the system architecture, in contrast to fault tolerance achieved simply by using reliable components. Of course, MTBF and other statistical reliability measures of an f fault-tolerant system will depend on the reliability of the components used in constructing that system---in particular, the probability that there will be f o r more failures during the operating interval of interest. Thus, f should be chosen based on statistical measures of component reliability. O n c e f has been chosen, it is possible to derive MTBF and other statistical measures of reliability by computing the probabilities of various configurations of 0 throughf failures and their consequences [Babaoglu 86].

4. Fault-tolerant State Machines A n f fault-tolerant state machine can be implemented by replicating it and running a copy on each of the processors in a distributed system. Provided each copy being run by a non-faulty processor starts in the same initial state and executes the same requests in the same order, then each wiU do the same thing and produce the same output. If we assume that each failure can affect at most one processor, hence one state machine copy, then by combining the output of the state machine copies in this ensemble, the output for a t fault-tolerant state machine can be obtained. When processors can experience Byzantine failures, an ensemble implementing a f fault-tolerant state machine must have at least 2f +1 copies, and the output of the ensemble is the output produced by the majority of the state machine copies. This is because with 2f +1 copies, the majority of the outputs remain correct even after as many a s f failures. If processors experience only fail-stop failures, then an ensemble containing f + 1 copies suffices, and the output of the ensemble can be the output produced by any of its members. This is because onty correct outputs are produced by fail-stop processors, and after ffailures one non-faulty copy will remain among the f + 1 copies. Our scheme for implementing a n f fault-tolerant state machine is based on fault-tolerant implementations of two abstractions. Agreement. Every non-faulty copy of the state machine receives every request. Order. Requests are processed in the same order by every non-faulty copy of the state machine. However, knowledge of command semantics sometimes permit weaker (i.e., cheaper to implement) abstractions to be used. For example, when fall-stop processors are used, a request whose processing does not modify state variables need only be sent to a single non-faulty state machine copy, thus permitting reIaxation of Agreement. It also possible to exploit request semantics to relax Order. Two requests r and r" commute in a state machine if the sequence of outputs that would result from processing r followed by r ' is the same as would result from processing r' followed by r. Not surprisingly, the schemes outlined above for combining outputs of the members of an ensemble work even when two requests that commute axe processed in different orders by different state machine copies in an ensemble, thus permining relaxation of Order. An example of a state machine where Order is not necessary appears in Figure 4.1. State machine tally determines the first from among a set of alternatives to receive at least MAJ votes and sends this choice to SYSTEM. If clients cannot vote more than once and the number of clients Cno satisfies 2MAJ > Cno, then every request commutes with every other. Thus, implementing Order would be

24

tally: state_machine var votes : array[candidate] of integer init 0 cast vote: command(choice : candidate) votes [choice] := votes [choice ] + 1; if votes [choice ] >_MAJ --4 send choice to SYSTEM; halt

0 votes [choice] <MAJ --> skip

fi end cast vote end tally

Figure 4.1. Election

unnecessary--different copies of the state machine will produce the same outputs even if they process requests in different orders. On the other hand, if clients can vote more than once or 2MAJ
All non-faulty processors agree on the same value.

IC2:

If the transmitter is non-faulty then all non-faulty processors use its value as the one they agree on.

If whenever a client makes a request, it employs a protocol satisfying IC1 and IC2 to disseminate that request to all copies of the state machine, then Agreement is achieved. Thus, we have: Agreement I m p l e m e n t a t i o n . Clients disseminate requests using a protocol that establishes IC1 and IC2. Notice that the Agreement Implementation does not require a client to be the transmitter. A client might send its request to a single copy of the state machine and let that copy serve as the transmitter in further distributing the request to the other members of the ensemble. This permits clients to be simpler. However, a request can be lost or corrupted if the client sends it directly to only a single copy of the state machine and that copy is being executed by a faulty processor. Protocols to establish IC1 and IC2 have received much attention in the literature. If digital signatures are available and processors can exhibit Byzantine failures or if processors are restricted to failstop failures, then f + 1 processors are sufficient in order to tolerate f failures; otherwise 3f+ 1 processors are necessary to tolerate f failures. See [Strong & Dolev 83] for a survey of protocols that can tolerate

25

Byzantine processor failures and [Schneider et al 84] for a (significantly cheaper) protocol that can tolerate (only) fail-stop processor failures. 4.2. O r d e r The Order abstraction can be implemented by having clients assign unique identifiers to requests and having state machine copies process requests according to a total ordering relation on these unique request identifiers. However, simply having each state machine copy process in ascending order the requests it has received does not imply that every state machine will processes requests in the same order. Two requests could be delivered to one state machine in one order and to another state machine in the other order. We must devise a way to avoid this problem. We shall say that a request is stable at p once no request from a correct client and bearing a lower unique identifier can be subsequently delivered to the state machine copy at processor p. Given an implementation of stability, the following is an implementation for the Order abstraction: Order Implementation. Stable requests are processed by a state machine in ascending order by unique identifier. The choice of request identifiers is further constrained when clients of a state machine are programmed under the assumption that requests are processed in an order consistent with potential causality (i.e., O1 and O2 of section 2). Now, processing requests in ascending order by unique identifier must • be in an order consistent with 01 and 02. One way to produce unique request identifiers satisfying O1 and O2 is to use logical clocks; a second way is to use approximately synchronized real-time clocks. Using Logical Clocks

A logical clock [Lamport 78a] is a mapping A from events to the integers. A(E), the "time" assigned to an event E by logical clock A, is such that for any two distinct events E and F, either A(E) < A(F) or A(F)< A(E), and if E might be responsible for causing F then A(E)< A(F). It is a simple matter to implement logical clocks in a distributed system. Associated with each process p is a counter ~.p. A timestamp is included in each message sent by p. This timestamp is the value of ~'t, when that message is sent. In addition, Lp is changed according to: CUI:

3.p is incremented after each event at p.

CU2:

Upon receipt of a message with timestamp x, process p resets kp: ~'e :=max(~p, x) + 1.

The value of A(E) for an event E that occurs at processor p is constructed by appending a fixed-length bit string that uniquely identifies p to the value of ~,p when E occurs. A logical clock can be used to implement a mapping from requests to unique identifiers with a total ordering that satisfies O1 and 02. A(E) is used as the unique identifier associated with a request whose issuance corresponds to event E. Therefore, all that remains for an implementation of the Order abslraction is to formulate a test for stability. It is pointless to implement a stability test in an asynchronous system3 where Byzantine failures are 3A system is asynchronousif message delivery delay or the relative speeds of processors is unbounded; it is synchronous if delivery delay and relative processorspeeds are bounded.

26

possible. This is because no deterministic protocol can achieve ICI and IC2 under these conditions [Fischer et al 85], so it is impossible to implement Agreement.4 Since it is impossible to implement Agreement, there is no point in implementing Order. The case where processors are synchronous is equivalent to assuming that they have synchronized real-time clocks and will be considered shortly. This leaves the case where processors are asynchronous and can exhibit fail-stop failures. We now turn to that. By attaching sequence numbers to the messages between every pair of processors, it is trivial to ensure that the following property holds of communications channels. F I F O Channels. Messages between a pair of processors are delivered in the order sent. We also assume: Failure Detection Assumption. A processor p detects that a fail-stop processor q has failed only after p has received the last message sent to p by q. The Failure Detection Assumption is consistent with FIFO Channels, since the failure event for a failstop processor necessarily happens after the last message sent by that processor and, therefore, should be received after all other messages. Using logical clocks to generate request identifiers implies that a request made by a client must have a larger unique identifier than was assigned to any previous request made by that client. Thus, assuming FIFO Channels, once a request from a client c is received by a copy sm i of the state machine, no request from c with smaller unique identifier can be received by sm i. Moreover, if sm i detects that c has failed then no request from c with smaller unique identifier can be received by smi, due to the nature of fail-stop failures and the Failure Detection Assumption. Combining these restrictions, we deduce that if c l , c2 .... cn are all the clients that have not failed and request rk, with unique request identifier uid(rk), denotes the last request received by smi from ck, then any request subsequently received by smi must have a unique identifier that is larger than min_uid, where rain uid = min uid(rl~). --

l<_k<_n

This means that every request with unique identifier at most min_uid must be stable. There is one remaining flaw in this scheme, however. A non-faulty client that does not make requests--for whatever reason--will prevent requests from becoming stable. This problem can be avoided by requiting that otherwise inactive clients periodically make null requests. Summarizing, we get:

Logical Clock Stability Test Tolerating Fail-stop Failures. Every client periodically makes some request to the state machine. A request is stable at smi if a request with larger timestamp has been received by sml from every client running on a non-faulty fail-stop processor.

Synchronized Real.Time Clocks A second way to produce unique request identifiers satisfying O1 and 0 2 is with approximately synchronized real-time clocks. Define Tp(E) to be the value of the real-time clock at processor p when 4The result of [Fischeret al 85] is actually stronger than this. It states that IC1 and IC2 cannot be achieved by a deterministicprotocol in an asynchronoussystem with a single processor that fails in an even less restrictive manner---bysimply halting.

27

event E occurs. We use Tp(E) foUowed by a fixed-length bit string that uniquely identifies p as the unique identifier associated with a request made as event E by a client running on a processor p. To ensure that O1 and 0 2 (of section 2) hold for unique identifiers generated in this manner, two restrictions are required. O1 follows provided no client makes two or more requests between successive clock ticks. If processor clocks have resolution p, then each client can make at most one request every p seconds. 0 2 follows provided the degree of clock synchronization is better than the minimum message delivery time. If clocks on different processors are synchronized to within 8 seconds, then it must take more than ~i seconds for a message from one client to reach another; otherwise, 0 2 would be violated because a request r made by one client could have a unique identifier that was smaUer than a request r ' made by another, even though r was caused by a message sent after r' was made. A number of protocols to achieve clock synchronization while tolerating Byzantine failures have been proposed. They are surveyed in [Schneider 86]. The protocols require that known bounds exist for the execution speed and clock rates of non-faulty processors and for message delivery delays along non-faulty communications links. These requirements do not constitute a restriction in practice. Clock synchronization achieved by the protocols is proportional to the variance in message delivery delay, making it possible to satisfy the restriction--necessary to ensure O2--that message delivery delay exceeds clock synchronization. A stability test can be implemented by exploiting synchronized real-time processor clocks and the bounds on delivery delays. If requests are disseminated using a protocol employing a fixed number of rounds, like the ones cited above for establishing IC1 and IC2, then there will exist a constant A such that a request r with unique identifier uid(r) will be received by every correct processor no later than time uid(r)+A according to the local clock at the receiving processor.5 Thus, once the clock on a processor p reaches x, p cannot subsequently receive a request r such that uid(r) < x - A. Therefore, a stability test is: Real-time Clock Stability Test Tolerating B y z a n t i n e Failures I. A request r is stable at a state machine copy sm i being executed by processor p if the local clock at p is x and

uid(r) < x-A. One disadvantage of this stability test is that it forces the state machine to lag behind its clients by A, where A is proportional to the worst-case message delivery delay. This disadvantage can be avoided. Due to property O1 of the total ordering on request identifiers, if communications channels satisfy FIFO Channels, then a state machine copy that has received a request r from a client c can subsequently receive from c only requests with unique identifiers greater than uid(r). Thus, a request r is also stable at a state machine copy provided a request with larger unique identifier has been received from every client. R e a l - t i m e Clock Stability Test Tolerating Byzantine Failures II. A request r is stable at a

state machine copy smi being executed by processor p if a request with larger unique identifier has been received from every client. This second stability test is foiled by a single faulty processor that refuses to make requests. However, by combinhag the first and second test, so that a request is considered stable when it satisfies either test, 5In gcnexal, A will be a function of the variance in message delivery delay, the maximum message delivery delay, and

the degree of clock synchronization. See [Cristianet a185] for a detailedderivationfor A in a variety of environments.

28

a stability test results that lags clients by A only when faulty processors or network delays force it.

5. Tolerating Faulty Output Devices When implementing a n f fault-tolerant system, there are problems with using a single voter to combine the outputs of an ensemble of state machine copies into a single output. In particular, a single failure--of the voter--can prevent the system from producing the correct output. The solution to this problem depends on how the output of the state machine implemented by the ensemble is used.

Outputs Used O u t s i d e t h e S y s t e m If the output of the state machine is sent to an output device, then that device is already a single component whose failure cannot be tolerated. Thus, being able to tolerate a faulty voter is not sufficient--the system must also be able to tolerate a faulty output device. The usual solution to this problem is to replicate the output device and voter. Each voter combines the output of each state machine copy, producing a signal that drives one output device. Whatever reads the outputs of the system is assumed to combine the outputs of the replicated devices. This reader, which is not considered part of the computing system, implements the critical voter. If output devices can exhibit Byzantine failures, then by taking the output produced by the majority of the devices, 2f+ 1-fold replication permits up t o f faulty output devices to be tolerated. For examF a flap on an airplane wing might be designed so that when the 2f+ 1" actuators that control it do ~, agree, the flap always moves in the direction of the majority (rather than twisting, which would be a voter failure). If output devices exhibit only fail-stop failures, then only f + 1-fold replication is necessary to tolerate f failures because any output produced by a fail-stop output device can be assumed correct. For exampte, terminals usually present information with enough redundancy so that they can be treated as fail-stop--failure detection is implemented by the viewer. With such an output device, a human user can look at a one o f f + 1 devices, decide whether the output is faulty, and only if it is faulty, look at another, and so on.

Outputs U s e d I n s i d e t h e S y s t e m If the output of the state machine is to a client, then the client itself can combine the outputs of state machine copies in the ensemble. Here, the voter--a part of the client--is faulty exactly when the client is, so the fact that an incorrect output is read by the client due to a faulty voter is irrelevant. When Byzantine failures are possible, the client waits until it has received f + 1 identical responses, each from a different member of the ensemble, and takes that as the response from the f fault-tolerant state machine. When only fail-stop failures are possible, the client waits until it has received the first response from any member of the ensemble and takes that as the response from t h e f fault-tolerant state machine. When the client is executed on the same processor as one of the state machine copies and Byzantine failures can occur, optimization of client-implemented voting is possible.6 Now, the local state machine copy is correct exactly when the client is. Therefore, the response produced by the state 6Care must be exercisedwhen analyzingthe fault-toleranceof such a system because a single processorfailure can now cause two system components to fail. Implicitin most of our discussionsis that system components fail independently. It is not always possible to transform a f fault-tolerantsystem in which clients and state machinecopies have independentfailures to one in which they share processors.

29

machine copy running locally can be used as that client's response from the f fault-tolerant state machine and we have: Dependent-Failures Output Optimization. If a client and a state machine copy run on the same processor, then even when Byzantine failures are possible, the client need not gather a majority of responses to its requests to the state machine. It can use the single response produced locally.

6. Tolerating Faulty C l i e n t s Implementing an f fault-tolerant state machine is not sufficient for implementing an f fault-tolerant system. Faulty clients must not be able to make requests that cause the state machine to produce erroneous output or that corrupt the state machine so that subsequent requests from non-faulty clients are incorrectly processed. When a client is itself structured as a state machine or can be restructured as a state machine, the approach of section 4 can be used to implement an f fault-tolerant client. Unfortunately, such restructuring is not always possible and other, application-dependent, techniques must sometimes be employed. 6.1. S e n s o r R e p l i c a t i o n A client c that obtains input from a time-varying input source can be restructured as a state machine sin(c) by isolating that time-varying input source and making it a client c ' of sin(c). While sm(c) can be made f fanlt-toterant (using the approach of section 4), this appears to bring us no closer to solving the original problenv--the new client c ' is still not fault toterant and still obtains input from a time-varying input source. One solution to this dilemma is to restructure sm(c), obtaining a state machine sm'(c) that reads its input from multiple sensors, and therefore does not depend on the correctness of any single sensor. To accomplish this, client c ' and its sensor are each replicated; every copy of c' reads from a different sensor. Where sin(c) obtained a single value from the sensor, sm'(c) obtains values from copies of c ' and combines them. To summarize: Fault-tolerant Sensor. Given a client c that reads from a time-varying input source, the input source is replicated and c is restructured as a state machine sin'(c) and a collection of clients. Each client reads from a different copy of the time-varying input source; sm'(c) reads from all the clients. If a sensor can exhibit Byzantine failures, then it must be replicated 2f+ 1 fold; sm'(c) chooses the median value. This works because even when as many asfcopies of the client (c') or sensor are faulty, the median of these 2f+ 1 values is guaranteed to be either a value from a correct sensor or bounded by values from correct sensors. If sensors exhibit only fail-stop failures, then f + 1 fold replication suffices, and sm'(c) can chooses any sensor value that is known not to be faulty. It is possible to optimize the implementation of a fault-tolerant sensor when the same processors are being used both to run copies of the state machine of which c is a client and to run copies of sin'(c). Since the output of sm'(c) is destined to another state machine--say m - - w e could use the output produced by the single copy of sin'(c) as input to m instead of combining the outputs produced by copies of sin'(c), as described in the Dependent-Failures Output Optimization of section 5. Moreover, w e can merge the copy of sm'(c) and m, obtaining a single state machine. For example, when this scheme is applied to the process control system of section 2, monitor is c' and is replicated, and pc is the result of combining m and sin'(c). This is shown for the case where Byzantine failures are possible in Figure 6.1.

30

monitor [i]: process do true --->vat := sensor[/]; (~oc.adjust, i, val); delay D od end monitor [i ]

pc: state_machine var q : real; resp : set of client_id init ~ ; val_rcd : array[1..2t+l] of real; adjust: command(cid, sensorval) resp := resp ~) cid; val__rcd [cid] := sensor_val ; if I resp I skip [] I resp t >[ (2t+i)/2~ --~ q := F(q, median(val_rcd [c ])); c • resp

resp := ~ ; send q to actuator fi end adjust end pc

Figure 6.1. Revised process control system

6.2. Defensive P r o g r a m m i n g Sometimes a client cannot be restructured as a state machine, and thus cannot be made f faulttolerant using the approach just described. If it were possible for state machine copies to agree on the identities of faulty clients, then tolerating faulty clients would be simple--ignore requests from them. Unfortunately, this is not always possible. When Byzantine failures can occur, not all failures will produce identifiable symptoms. Without restricting possible failure modes, there is no way for a state machine to be able to identify faulty clients. However, careful design of a state machine can limit the effects of faulty requests. For example, memory (Figure 2.1) permits any client to write to any location. Therefore, a faulty client can overwrite all locations, destroying valuable information in state variables. This problem could be prevented by restricting write client access to only certain memory locations-the state machine can enforce this. Including tests in commands is another way to design a state machine that cannot be corrupted by requests from faulty clients. For example, mutex as specified in Figure 2.2, will execute a release command made by any client---even one that does not have access to the resource. Consequently, a faulty client could issue such a request and cause mutex to grant a second client access to the resource before

31

the first has relinquished access. A better formulation of mutex ignores release commands from all but the client to which exclusive access has been granted. This is implemented by changing the release in mutex to:

release: c o m m a n d if user¢client ~ skip n waiting=¢ A user=client ---) user := (I) waiting ~(I) ^ user=client --->send OK to head(waiting); user := head(waiting); waiting := tail(waiting) fi e n d release Sometimes, a faulty client not making a request can be just as catastrophic as one making an erroneous request. For example, if a client of mutex failed and stopped while it had exclusive access to the resource, then no client could be granted access to the resource. Of course, unless we are prepared to bound the length of time that a correctly functioning process can retain exclusive access to the resource, there is little we can do about this problem. This is because there is no way for a state machine to distinguish between a client that has stopped executing because it has failed and one that is executing very slowly. However, given an upper bound B on the interval between an acquire and the following release, mutex can automaticaUy schedule release on behalf of a client. This is done by having the acquire command automatically schedule the release request. We introduce the notation schedule (REQUEST) for +x to specify scheduling (REQUEST) with a unique identifier x greater than the identifier on the request being processed. Such a request is called a timeout request and becomes stable at some time in the future, according to the stability test being used for client-generated requests. Unlike requests from clients, requests that result from executing schedule need not be distributed to all state machine copies of the ensemble. This is because each state machine copy will independently schedule its own (identical) copy of the request. We can now modify acquire so that a release operation is automatically scheduled: 7

7This means that mutex might process two release commands on behalf of a client: one from the client itself and one generated by its acquire request. The new state variable time_granted permits such superfluouscommands to be ignored.

32

acquire: command if user=d~ --> send OK to client; user := client; time_granted := NOW;

schedule ~mutex.timeout, time_granted) for +B D user~O --> waiting := waitingoclient fi

end acquire timeout : command(when_granted : integer) if when_granted ~time__granted --> skip D waiting =alp ^ when_granted=time_granted ~ user := n waiting ~

^ when_granted=time_granted --> send OK to head(waiting); user := head (waiting); time_granted := NOW; waiting := tail (waiting)

fi end timeout 7. Using Time to Make Requests A client need not explicitly send a message to make a request. Not receiving a request can trigger execution of a command--in effect, allowing the passage of time to transmit a request from client to state machine [Lamport 84]. Transmitting a request using time instead of messages can be advantageous because protocols that implement IC1 and IC2 can be costly both in total number of messages exchanged and in delay. Unfortunately, using time to transmit requests has only limited applicability, since the client cannot specify parameter values. The use of time to transmit a request was employed in section 6 when we revised the acquire command of mutex to foil clients that failed to release the resource. There, a release request was automatically scheduled by acquire on behalf of a client being granted the resource. A client transmits a release request to mutex simply by permitting B (logical clock or real-time clock) time units to pass. It is only to increase utilization of the shared resource that a client might use messages to transmit a release request to mutex before B time units have passed. A more dramatic example of using time to transmit a request is illustrated in connection with tally of Figure 4.1. Assume that •

all clients and state machine copies have (logical or real time) clocks synchronized to within F and

•

the election starts at time Strt anffthis is known to all clients and state machine copies.

Using time, a client can cast a vote for a default by doing nothing; only when a client casts a vote different from its default do we require that it actually transmit a request message. Thus, we have:

Transmitting a Default Vote. If client has not made a request by time Strt+F, then a request with that client's default vote has been made. Notice that the default need not be fixed nor even known at the time the vote is cast. For example, the default vote could be "choose the first client that votes for itself". In that case, only one client---one that

83

votes for itself--need actually use message transmission to cast its vote. The result is a state machine to implement an election in which only the winner actually does something.

8. Exploiting Assumptions about Failures Optimization of a state machine is frequently possible when assumptions can be made about the number and types of failures that can occur. The easiest assumption to make about failures is that they do not happen. Given a fault-free processor on which to execute a state machine, replication of the state machine becomes unnecessary and the Agreement and Order abstractions have trivial implementations: a client simply sends its request to the single state machine copy. Of course, the asstimption that failures do not happen is not very realistic. More realistic assumptions about failures permit somewhat less dramatic optimizations. To iUustrate these, we consider various solutions to the database commit problem. A commit protocol permits an update to be performed on all or no copy of a replicated database according to a commit rule and information provided by the sites maintaining database copies. We can formulate a solution to the commit problem by using a state machine commit [tid] (see Figure 8.1) for each transaction tid and defining a client for each copy of the database. Each client involved in processing tid registers a suggested outcome--commit or abort--with commit [rid] and awaits a response. State machine commit [t/d] runs forever, so that a client that has failed and restarted can ascertain the outcome of transactions it processed but neither committed nor aborted. In commit [tid], the commit rule is implemented by function

commit[t/d]: state machine vat sugs : array[1..maxclients ] of [commit, abort, undecided]; wait ans : set of client_id; outcome : [commit, abort, undecided] init undecided; status: command(c_sug : [commit, abort]) sugs [client] := c_sug; if outcome = undecided --* outcome := Commit_Rule (sugs ) 0 outcome #undecided --->skip fi; if outcome #undecided --->send outcome to client; forall pid ~ wait ans: send outcome to pid Q outcome =undecided --) w a i t a n s := wait_ans u client fi end status end commit [tid]

Figure 8.1. Commit

34

Commit_Rule(sugs), which returns commit, abort, or undecided, based on the values in sugs. Note that it may also be necessary to employ timeout u'ansitions in commit[t/d] so that a faulty processor that does not register a suggested outcome cannot unconditionally delay the decision to commit or abort tid. When commit [tid] is replicated and a copy is executed at each site running a client, the decentralized commit protocol of [Skeen 82] results. Other commit protocols that have appeared in the literature can be derived from commit It/d] by making assumptions about failures. For example, the 2-phase commit protocol described in [Gray 78] uses a single copy of the commit [tid] state machine and is based on two assumptions: (1)

The processor executing this state machine does not fail.

(2)

Client failures are detectable (i.e., clients are fail-stop).

If the assumptions are violated, then the protocol may not work. In particular, if commit [tid] exhibits a fail-stop failure in the midst of sending outcome to clients in wait_ans, then (correct) clients might be unable to decide on an outcome, a phenomenon sometimes referred to as the "window-of-vulnerability" of this protocol; and if commit It~d] exhibits a Byzantine failure, then (correct) clients could receive conflicting information causing some to commit the transaction and others to abort it. The 3-phase commit protocol of [Skeen 82] and 4-phase protocol of [Hammer and Shipman 80] result when more than a single copy of the commit[t/d] state machine is run. These protocols can tolerate fall-stop failures of processors running copies of commit [tid] and of clients. However, they do not employ sufficient replication to tolerate Byzantine failures. 9, Reconfiguration An ensemble of state machine copies can tolerate more thanf faults if it is possible to remove state machine copies running on faulty processors from the ensemble and add copies running on repaired processors. (A similar argument can be made for being able to add and remove copies of clients and output devices.) Let P(x) be the total number of processors at time z that are executing copies of some state machine of interest, and let F(x) be the number of them that are faulty. In order for the ensemble to produce the correct output, we must have Combining Condition: P(x)-F(x)>Enuf for all 0<x.

where Enuf =

P(z)/2 if Byzantine failures are possible. 0 if only fall-stop failures are possible.

A processor failure can cause the Combining Condition to be violated by increasing F(z), thereby decreasing P (x)-F (x). When Byzantine failures are possible, if a faulty processor can be identified, then removing it from the ensembIe decreases Enuf without further decreasing P(x)-F(x); this can prevent the Combining Condition from being violated. When only fail-stop failures are possible, increasing the number of non-faulty processors--by adding one that has been repaired is the only way to prevent the Combining Condition from being violated because increasing P(z) is the only way to keep P(x)-F('c)>0. Therefore, provided the following conditions hold, it may be possible to maintain the Combining Condition forever and thus tolerate an unbounded total number of faults over the life of the system. FI:

If Byzantine failures are possible, then state machine copies being executed by faulty processors are identified and removed from the ensemble before the Combining

35

Condition is violated. F2:

State machine copies running on repaired processors are added to the ensemble so that the Combining Condition is not violated.

F1 and F2 constrain the rates at which failures and repairs occur. Removing faulty processors from an ensemble of state machines can also improve system performance. This is because the number of messages that must be sent to achieve Agreement is usually proportional to the number of state machine copies that must agree on the contents of a request. In addition, some protocols to implement Agreement execute in time proportional to the number of processors that are faulty. Removing faulty processors clearly reduces both the message complexity and time complexity of such protocols. Adding or removing a client from the system is simply a matter of changing the state machine so that henceforth it responds to or ignores requests from that client. Adding an output device is also straightforward--the state machine starts sending output to that device. Removing an output device from a system is achieved by disabling the device. This is done by putting the device in a state that prevents it from affecting the environment. For example, a CRT terminal can be disabled by turning off the brightness so that the screen can no longer be read; a hydraulic actuator controUing the flap on an airplane wing can be disabled by opening a cutoff valve so that the actuator exerts no presure on that control surface. However, as shown by these examples, it is not always possible to disable a faulty output device: turning off the brightness might have no effect on the screen and the cutoff valve might not work. Thus, there are systems in which no more than a total of t actuator faults can be tolerated because faulty actuators cannot be disabled.

The configuration of a system structured in terms of a state machine and clients can be described using three sets: the clients C, the state machine copies S, and the output devices O. S is used by the implementation of the Agreement abstraction and therefore must be known to clients and state, machine copies. It can also be used by an output device to determine which send operations made by state machine copies should be ignored. C and O are used by state machine copies to determine from which clients requests should be processed and to which devices output should be sent. Therefore, C and O must be available to state machine copies. Two problems must be solved to support changing the system configuration. First, the values of ,5, and O must be available when required. Second, whenever a client, state machine copy, or output device is added to the configuration, the state of that element must be updated to reflect the current state of the system. These problems are considered in the following two subsections.

9.1. Managing the Configuration The configuration of a system can be managed using the state machine in that system. Sets G ,5, and O are stored in state variables and changed by commands. Each configuration is valid for a collection of requests--those requests r such that uid(r) is in the range defined by two successive configuration-change requests. Thus, whenever a client, state machine copy, or output device performs an action connected with processing r, it uses the configuration that is valid for r. This means that a configuration-change request must schedule the new configuration for some point far enough in the future so that clients, state machine copies, and output devices can find out about the new configuration before it actuaUy comes into effect.

36

There are various ways to make configuration information available to the clients and output devices of a system. (The information is already available to the state machine.) One is for clients and output devices to query the state machine periodically for information about relevant pending configuration changes. Obviously, communication costs for this scheme are reduced if clients and output devices share processors with state machine copies. Another way to make configuration information available is for the state machine to include information about configuration changes in messages it sends to clients and output devices in the course of normal processing. Doing this requires regular and periodic communication between the state machine and clients and between the state machine and output devices. Requests to change the configuration of the system are made by a failure/recovery detection mechanism. It is convenient to think of this mechanism as a collection of clients, one for each element of C, S, or O. Each of these configurators is responsible for detecting the failure or repair of the single object it manages and, when such an event is detected, for making a request to alter the configuration. A configurator is likely to be part of an existing client or state machine copy and might be implemented in a variety of ways. When elements are fail-stop, a configurator need only check the failure-detection mechanism of that element. When elements can exhibit Byzantine failures, detecting failures is not always possible. When it is possible, a higher degree of fault tolerance can be achieved by reconfiguration. A non-faulty configurator satisfies two safety properties. CI:

Only a faulty element is removed from the configuration.

C2:

Only a non-faulty element is added to the configuration.

However, a configurator that does nothing satisfies C1 and C2. Changing the configuration enhances fault-tolerance only if F1 and F2 also hold. For F1 and F2 to hold, a configurator must also (1) detect faults and cause elements to be removed and (2) detect repairs and cause elements to be added. Thus, the degree to which a configurator enhances fault tolerance is directly related to the degree to which (1) and (2) are achieved. Here, the semantics of the application can be helpful. For example, to infer that a client is faulty, a state machine can compare requests made by different clients or by the same client over a period of time. To determine that a processor executing a state machine copy is faulty, the state machine can monitor messages sent by other state machine copies during execution of an Agreement protocol And, by monitoring aspects of the environment being controlled by actuators, a state machine copy might be able to determine that an output device is fauity. Some elements, such as processors, have internal failure-detection circuitry that can be read to determine whether that element is faulty or has been repaired and restarted. A configurator for such an element can be implemented by having the state machine periodically poll this circuitry. In order to analyze the fault-tolerance of a system that uses configurators, failure of a configurator can be considered equivalent to the failure of the element that the configurator manages: This is because with respect to the Combining Condition, removal of a non-faulty element from the system or addition of a faulty one is the same as that element failing. Thus, in a n f fault-tolerant system, the sum of the number of faulty configurators that manage non-faulty elements and the number of faulty components with non-faulty configurators must be bounded by f.

37

9.2. I n t e g r a t i n g a R e p a i r e d O b j e c t Not only must an element being added to a configuration be non-faulty, it also must have the correct state so that its actions will be consistent with those of rest of the system. Define e[ri] to be the state that a non-faulty system element e should be in after processing requests r0 through ri. An element e joining the configuration immediately after request rjoin must be in state e[rjoin] before it can participate in the running system. An dement is self-stabilizing [Dijkstra 74] if its current state is completely defined by the previous k inputs it has processed, for some fixed k. Obviously, running such an dement long enough to ensure that it has processed k inputs is all that is required to put it in state e [rjom]. Unfortunately, the design of self-stabilizing state machines is not always possible. When elements are not self-stabilizing, processors are fail-stop, and logical clocks are implemented, cooperation of a single state machine copy smi is sufficient to integrate a new element e into the system. This is because state information obtained from smi must be correct. In order to integrate e at request r join, smi must have access to enough state information so that e[rjoin] can be assembled and forwarded to e. •

When e is an output device, e[ryoin] is likely to be only a small amount of device-specific set-up information--information that changes infrequently and can be stored in state variables of smi.

•

When e is a client, the information needed for e[rjo~] is frequently based on recent sensor values read and can therefore be determined by using information provided to smi by other clients.

•

And, when e is a state machine copy, the information needed for e[rjoin] is stored in the state variables and pending requests at sm i.

The protocol for integrating a client or output device e is simple---e[ryoin] is sent to e before the output produced by processing any request with a unique identifier larger than uid(rjoin). The protocol for integrating a state machine copy Smne~ is a bit more complex. It is not sufficient for smi simply to send the values of all its state variables and copies of any pending requests to Smnew. This is because some client request might have been received by sml after sending e[rjoin] but delivered to smne~ before its repair. Such a request would neither be reflected in the state information forwarded by smi to smnew nor received by s m , ~ directly. Thus, smi must, for a time, relay to sm,~,,~ requests received from clients. 8 Since requests from a given client are received by Smnew in the order sent and in ascending order by request identifier, once smnew has received a request directly (i.e. not relayed) from a client c, there is no need for requests from c with larger identifiers to be relayed to srnnew. If sm,~,~ informs smi of the identifier on a request received directly from each client c, then sm i can know when to stop relaying to sm new requests from c. The complete integration protocol is summarized in the following. Integration with Fail-stop Processors and Logical Clocks. A state machine copy sm i can integrate an element e at request rjo~ into a running system as follows. 8Duplicatecopies of some requests might be receivedby sm,,,,w.

38

If e is a client or output device, sm i sends the relevant portions of its state variables to e and does so before sending any output produced by requests with unique identifiers larger than the one on r join. If e is a state machine copy smnew, then sm i (1)

sends the values of its state variables and copies of any pending requests to s m , ~ ,

(2)

sends to srnnew every subsequent request r received from each client c such that u i d ( r ) < u i d ( r c ) , where rc is the first request Smnew received directly from c after being restarted.

The existence of synchronized real-time clocks permits this protocol to be simplified because smi can determine when to stop relaying messages based on the passage of time. Suppose, as in section 4, there exists a constant A such that a request r with unique identifier u i d ( r ) will be received by every (correct) state machine copy in the configuration no later than time u/d(r)+A according to the local clock at the receiving processor. Let sm new join the configuration at time Xjoin. By definition, srnnew is guaranteed to receive every request that was made after time "¢join o n the requesting client's clock. Since unique identifiers are obtained from the real-time clock of the client making the request, smnew is guaranteed to receive every request r such that uid(r)>Xjo~. T h e first such a request r must be received by s m i by time "Cjoin-I-A according to its clock. Therefore, every request received by smi after "Cjoinq-A must also be received directly by smnew. Clearly, smi need not relay such requests, and we have the following protocol.

Integration with Fail.stop Processors a n d Real-time Clocks. A state machine copy smi

can

integrate an element e at request r join into a running system as follows. If e is a client or output device, then smi sends the relevant portions of its state variables to e and does so before sending any output produced by requests with unique identifiers larger than the one o n r join. If e is a state machine copy smnew then s m i (1)

sends the values of its state variables and copies of any pending requests to sm new,

(2)

sends to smnew every request received during the next interval of size A.

When processors can exhibit Byzantine failures, a single state machine copy srn i is not sufficient for integrating a new element into the System. This is because state information furnished by smi is not necessarily c o r r e c t - - s m i might be executing on a faulty processor. To tolerate f failures in a system with 2f+ 1 state machine copies, f + 1 identical copies of the state information and f + 1 identical copies of relayed messages must be obtained. Otherwise, the protocol is as described above for real-time clocks.

Stability Revisited The stability tests of section 4 do not work when requests made by a client can be received from two sources--the client and via a relay. During the interval that messages are being relayed, s r n n ~ , the state machine copy being integrated, might receive a request r directly from c but later receive r', another request from c, with u i d ( r ) > u i d ( r ' ) , because r ' was relayed by s m i. The solution to this problem is for sm,~.w to consider requests received directly from c stable only after no relayed requests from

$9

c can arrive. Thus, the stability test must be changed: Stability Test During Restart. A request r received directly from a client c by a restarting state machine copy smngw is stable only after the last request from c relayed by another processor has been received by smnew. An obvious way to implement this is for a message to be sent to s m , ~ when no further requests from c will be relayed. 10. Related W o r k The state machine approach was first described in [Lamport 78a] for environments in which failures could not occur. It was generalized to handle fail-stop failures in [Schneider 82], a class of failures between fail-stop and Byzantine failures in [Lamport 78b], and full Byzantine failures in [Lamport 84]. The vm-ious abstractions proposed for these models are unified in [Schneider 85]. A critique of the approach for use in database systems appears in [Garcia-Molina et al 84]. Experiments evaluating the performance of various of the stability tests in a network of SUN Workstations are reported in [Pittelli & Garcia-Molina 87]. The state machine approach has been used in the design of significant fault-tolerant process control applications [Wensley et al 78]. It has also been used to implement distributed synchronization-including read/write locks and distributed semaphores [Schneider 80], input/output guards for CSP and conditional Ada SELEC~ statements [Schneider 82J--and, more recently, in the design of a fail-stop processor approximations in terms of processors that can exhibit arbitrary behavior in response to a failure [Schlichting & Schneider 83] [Schneider 84]. The state machine approach is rediscovered with depressing frequency, though rarely in its full generality. For example, the (late) Auragen 4000 series system described in [Borg et al 83] and the Publishing crash recovery mechanism [Powell & Presotto 83], both use variations of the approach. A stable storage implementation described in [Bernstein 85] exploits properties of a synchronous broadcast network to avoid explicit protocols for Agreement and Order and employs Transmitting a Default Vote (as described in section 7). The notion of A common storage, suggested in [Cristian et al 85], is a state machine implementation of memory that uses the Real-time Clock Stability Test. The method of implementing highly available distributed services in [Liskov & Ladin 86] uses the state machine approach, with clever optimizations of the stability test and Agreement abstraction that are possible due to the semantics of the application and the use of fail-stop processors. The ISIS project [Birman & Joseph 87] has recently been investigating fast protocols to support fault-tolerant process groups--in the terminology of this paper, state machines in a system of fail-stop processors. Their ABCAST protocol is a packaging of our Agreement and Order abstractions based on the Logical Clock Stability Test Tolerating Fail-stop Failures; CBCAST allows more flexibility in message ordering and permits designers to specify when requests commute. Another project at Comell, the Realtime-Reliabitity testbed, is investigating semantics-dependent optimizations to state machines. The goal of that project is to systematically develop efficient, faulttolerant, process control software for a hard real-time environment. Starting with a system structured as state machines and clients, various optimizations are performed to combine state machines, thereby obtaining an fast, yet provably fault-tolerant distributed program.

40

Acknowledgments Discussions with O. Babaoglu, K. Birman, and L. Lamport over the past 5 years have helped me to formulate these ideas. Helpful comments on a draft of this paper were provided by J. Aizikowitz, O. Babaoglu, A. Bernstein, K. Birman, D. Giles, and B. Simons.

References [Babaoglu 86] Babaoglu, O. On the reliability of consensus-based fault-tolerant distributed systems. ACM TOCS 5, 4 (Nov. 1987), 394-416. [Bernstein 85] Bemstein, A.J. A loosely coupled system for reliably storing data. IEEE Trans. on Software Engineering SE-11, 5 (May 1985), 446-454. [Birman 85] Birman, K.P. Replication and fault tolerance in the ISIS system. Proc. Tenth ACM Symposium on Operating Systems Principles. (Oreas Island, Washington, Dec. 1985), ACM, 79-86. [Birman & Joseph 87] Birman, K.P. and T. Joseph. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76. [Borg et al 83] Borg, A., J. Banmbach, and S. Glazer. A message system supporting fault tolerance. Proc. of Ninth ACM Symposium on Operating Systems Principles, (Bretton Woods, New Hampshire, October 1983), ACM, 90-99. [Cooper 84] Cooper, E.C. Replicated procedure call. Proc. of the Third ACM Symposium on Principles of Distributed Computing, (Vancouver, Canada, August 1984), ACM, 220-232. [Cristian et al 85] Cristian, F., H. Aghili, H.R. Strong, and D. Dolev. Atomic Broadcast: From simple message diffusion to Byzantine agreement. Proc. Fifteenth International Conference on Fault-tolerant Computing, (Ann Arbor, Mich., June 1985), IEEE Computer Society. [Dijkstra 74] Dijkstra, E.W. Self Stabilization in Spite of Distributed Control. CACM 17, 11 (Nov. 1974), 643-644. [Fischer et al 85] Fischer, M., N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty process. JACM 32, 2 (April 1985), 374-382. [Garcia-Molina et al 84] Garcia-Molina, H., F. Pittelli, and S. Davidson. Application of Byzantine agreement in database systems. TR 316, Department of Computer Science, Princeton University, June 1984. [Gray 78] Gray, J. Notes on Data Base Operating Systems. Operating Systems: An Advanced Course, Lecture Notes in Computer Science, Vol. 60, Spilnger-Verlag, New York, 1978, 393-481. [Hammer and Shipman 80] Hammer, M. and D. Shipman. Reliability mechanisms for SDD-I: A system for distributed databases. ACM TODS 5, 4 (December 1980), 431-466. [Lamport 78a] Lamport, L. Time, clocks and the ordering of events in a distributed system. CACM 21, 7 (July 1978), 558565, [Lamport 78b] Lamport, L. The implementation of reliable distributed multiprocess systems. Computer Networks 2 (1978), 95-114. [Lamport 84] Lamport, L. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (April 1984), 254-280. [Lampert et al 82] Lamport, L., R. Shostak, and M. Pease. The Byzantine generals problem. ACM TOPLAS 4, 3 (July 1982), 382-401. [Liskov 85] Liskov, B. The Argus language and system. Distributed Systems----~ethods and Tools for Specification, Lecture Notes in Computer Science, Vol. 190, Springer-Verlag, New York, N.Y, 1985, 343.430. [Liskov & Ladin 86] Liskov, B. and R. Ladin. Highly-available distributed services and fault-tolerant distributed garbage collection. Proc. of the Fifth ACM Symposium on Principles of Distributed Computing, (Calgry, Alberta, Canada, August 1986), ACM, 29-39. [Pittelli & Garcia-Molina 87] Pittelli, F.M. and H. Garcia-Molina. Efficient scheduling in a TMR database system. Proc. Seventeenth International Symposium on Fault-tolerant Computing, (Pittsburgh, Pa, July 1987), IEEE. [Powell & Presotto 83] Powell, M. and D. Presotto. PUBLISHING: A reliable broadcast communication mechanism. Proc. of Ninth ACM Symposium on Operating Systems Principles, (Bretton Woods, New Hampshire, October 1983), ACM, 100-109.

41

[Schlichting & Schneider 83] Schlichting, R.D. and F.B. Schneider. Fail-Stop processors: An approach to designing faulttolerant computing systems. ACM TOCS 1, 3 (August 1983), 222-238. [Schneider 80] Schneider, F.B. Ensuring Consistency on a Distributed Database System by Use of Distributed Semaphores. Proc. International Symposium on Distributed Data Bases (Paris, France, March 1980), INRIA, 183-189. [Schneider 82t Schneider, F.B. Synchronization in distributed programs. ACM TOPLAS 4, 2 (April 1982), 179-195. [Schneider 84] Schneider, F.B. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May 1984), 145-154. [Schneider 85]

Schneider, F.B. Paradigms for distributed programs. Distributed Systems---Methods and Tools for Specification, Lecture Notes in Computer Science, Vol. 190, Springer-Verlag, New York, N.Y. 1985, 343-430.

[Schneider 86] Schneider, F.B. A paradigm for reliable clock synchronization. Proc. Advanced Seminar on Real-Time Local Area Networks (Bandol, France, April 1986), INRIA, 85-I04. [Schneider et al 84] Schneider, F.B., D. Gries, and R.D. Schlichting. Fault-Tolerant Broadcasts. Science of Computer Programming 4 (1984), 1-15. [Siewiorek & Swarz 82] Siewiorek, D.P. and R.S. Swarz. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass, 1982. [Skeen 82] Skeen, D. Crash Recovery in a Distributed Database System. Ph.D. Thesis, University of California at Berkeley, May 1982. [Spector 85] Spector, A.Z. Distributed transactions for reliable systems. Proc. Tenth ACM Symposium on Operating Systems Principles, (Orcas Island, Washington, Dec. 1985), ACM, 127-146. [Slrong & Dolev 83] Strong, H.R. and D. Dolev. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers, (Compcon 83, IEEE Computer Society, March 1983), 1EEE Computer Society. 77-82. [Wenstey et a178] Wensley, J., et al. SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control. Proc. IEEE 66, 10 (Oct. 1978), 1240-1255.

A Simple Model for Agreement in Distributed Systems Danny Dolev Raymond Strong IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120

1

Introduction

The goal of our research into fault tolerant algorithms has been to discover and prove the best results possible in any environment consisting of processing elements that communicate by means of messages via some communication medium. Thus, rather than settle for the existence of a practical algorithm that accomplishes some task, we explore the whole range of practical and impractical algorithms that could exist, attempting to establish trade-offs and lower bounds that explain why some problems are inherently more difficult to solve than others. Rather than settle for with algorithms that tolerate a single fault of a type that is deemed likely to occur~ the ultimate aim of our explorations is to provide the best possible multiple fault tolerant algorithms and to provide families of solutions to various problems, so that the best solution for each context is available. Thus we are not satisfied with tolerating a single or a simple fault until we understand how to tolerate more complex varieties, how to determine the most cost effective solution, understand the possible trade-offs that are involved. A particular application has driven much of the research reported here. Many algorithms for maintaining the consistency of distributed data are designed to tolerate the failure of a participating processing element (processor) by blocking, i.e. by holding up the completion of ongoing operations until the failed component is repaired. Such a strategy can obviously interfere with the timely completion of a distributed task. Instead, we study strategies that allow a duster of processors, communicating by messages, to behave like a single highly available processor so that data consistency preserving algorithms are prevented from blocking in the presence of faults. We are not directly concerned with improved Mgorithms for distributed data management. The algorithms studied here should not be compared directly with algorithms for preserving the atomicity of distributed transactions, such as two phase commit algorithms. Instead we wish to make two phase commit and other algorithms

43 more robust by providing systems of components that guarantee the assumptions under which they perform optimally. While studying solutions to this problem, we have discovered that techniques for fault tolerance in real time process control can be quite relevant, especially since we are interested in algorithms that are correct even in the worst case of coincidental multiple failure. Thus we have focussed attention on a paradigm that was introduced as a real time process control problem called the Byzantine Generals Problem ([LSP]). In this paradigm, a source participant (or general) must communicate some message to the other participants. These participants must either all receive and act identically on the same message or all take some default action. What makes the problem interesting, and often impossible, is that the coordination implicit in the problem must be carried out in the presence of faulty components, including possibly the source. The word "Byzantine" was applied to this paradigm by Leslie Lamport to indicate that the behavior of faulty components could be arbitrary and worst case. We treat the Byzantine Generals Problem as a problem of reaching agreement among processors in a distributed system. This kind of agreement problem is similar to but simpler than the problem of atomic commit of a distributed database transaction. The distributed commit problem requires that either all sites managing parts of a distributed database commit the changes prescribed by a transaction or that no site commits the changes. To insure progress, there must also be fault free conditions under which the changes are committed. Moreover, there must be some way for faulty sites to recover and commit changes committed by others and undo changes not committed by others. Within the context of the reasonable assumption that eventually any faulty component crashes and is repaired, a solution to the distributed commit problem must include commit and recovery algorithms that tolerate faults and allow progress in the absence of faults. Atomicity must be insured among correctly behaving components and among faulty components when they have been repaired. But faults may postpone or "block" progress indefinitely. Unlike the distributed commit problem, the agreement or atomic broadcast problem requires atomicity only among the components that have not failed and does not refer explicitly to recovery. However, for the agreement problem faults are not allowed to block progress; the correct components must agree on some action and take that action, independent of the faulty components. Now suppose that we obtained a solution to the agreement problem that can be used by a cluster of processors so that they appear to be a single highly available processor in solutions to the distributed commit problem. Using it we can make any distributed commit solution more reliable by making it less vulnerable to blocking failures of components ([Sch]). Moreover, many algorithms for distributed commit are not designed to tolerate worst case types of faulty behavior. For example, standard two phase commit algorithms tolerate what are called omission faults, but not more Byzantine two faced behavior ([MSF]). Thus, if we can provide ways to prevent or mask such behavior, we have again made distributed commit solutions potentially more reliable.

44 One problem with some agreement algorithms is that the increased reliability may be more expensive. But these algorithms provide an the alternative which is available for those times when it is needed. Another problem with some agreement algorithms is that they are efficient and practical in small clusters but not necessarily feasible in large networks. We elaim~ however, that rather then compare distributed commit and agreement algorithms directly, one should compare the reliability and performance of distributed commit over large networks~ enhanced by agreement over local dusters, with that of distributed commit alone. This approach may still have problems if the duster is to provide the services of a highly available node in some distributed commit protocol. These problems include the detection of cluster failure (no agreement algorithm can tolerate a partition of correctly functioning processors) and recovery after cluster failure ([$85]).

2

The Simple Model

In what follows we shall discuss only a simple version of the problem~ so that we can focus on the inherent difficulties and costs of its solution. It is beyond the scope of this paper to cover the more realistic models for agreement or the history of actual implementations of solutions. Instead, we survey theoretical results obtained in what we call the simple model. Our characterization of the simple model is taken from [$85]. The general context is a network of processors communicating by means of messages over links~ Variants of the simple model may be found in [PSL], [DS], [DFFLS], [FLM], [TPS], [C], and many others. Papers discussing more realistic models include [CASD], [DHSS], and [$86]. In [AGKS] and [GS] the reader can find descriptions of a recent implementation based on some of these theoretical algorithms. In the simple model we make the following assumptions: • Complete connectivity: the network of processors is fully connected. • Perfect communication: messages are never altered or lost by communication links. • Perfect synchronization: one absolutely reliable clock synchronizes all processors. • Isolation: only one protocol is executed at a time (no other concurrent processes can interfere). • Perfect information: the identities of the participants and the time and place of origin of the process under study are common knowledge in advance, as is the exact time required for transmission and processing of each message. (The latter time is called the message delay).

The assumptions of perfect synchronization and perfect information make it possible for algorithms in the simple model to be organized into rounds of message

45 exchange. Each round consists of the sending of messages followed by the receipt and processing of messages and the preparation of messages to be sent at the next round. In this model time is usually measured in rounds. The relevant performance characteristics for an algorithm in this model include time, number of messages required, number of bits per message, and number of faults of various types tolerated. A more detailed examination of performance might also include the size and complexity of a program implementing the algorithm, the amount of information that must be stored by each processor (either temporarily or permanently), and the amount of work performed by each processor in processing its messages.

3

Fault m o d e l s

The introduction of a hierarchy of fault models of increasingly complex and rare type has provided explicit understanding of trade-offs available to the system designer that replace more traditional rules of thumb. We need not describe the goal of fault tolerance as the masking of a single failure, except for bizarre unexpected behavior. Instead we can classify component failures in terms of the behavior experienced and thus in terms of the measures needed to mask their occurrences, whether they be single or coincidental. We now have the tools to express the number and types of failures that can be tolerated. Many distributed database management and other distributed algorithms were originally designed to tolerate failstop failures. In particular, a component is considered "fail-stop" if upon failing that component is guaranteed not only to halt immediately before violating any of its input-output specifications, but also to notify all components directly communicating with it of its failure ([Sch]). We, however, classify the fault rather than the component. We assume that each component has a complete input-output specification that describes its correct behavior. To be faithful to the origin of the term, we say that a fault that causes a component to stop functioning after notifying all its correspondents that it is about to stop is a ]ailstop fault. If the component simply ceases functioning without notification, we call the fault a crash fault. There has been some confusion about these terms, to which we unfortunately have contributed, so we hope that this more precise terminology will be adopted. Many of the a~gorithms that were implicitly designed to tolerate failstop or crash faults can easily be converted to tolerate a much broader class of fault. The term omission was introduced by [MSF] to classify the set of faults in which some action is omitted. That is, if the specification for a component requires a particular action and the component fails to perform this action but otherwise continues to function correctly, then the fault is classified as an omission fault. Note that the fault is classified according to the manner in which the component fails to meet its inputoutput specification. Thus if component A is supposed to send messages X and Z after receipt of message Y, and component A fails to send message X after receiving message Y but does send message Z, then component A is said to have suffered an

46

omission (or output omission) fault. If component B falls to receive message X and behaves as if message X had not arrived, then component B is said to have suffered an input omission fault. Input omission faults are more powerful and harder to tolerate than output omission faults. The class of crash faults is a subset of the class of (output) omission faults, which in turn is a subset of the class of timing faults. Suppose that if component A receives message X, it is supposed to send message Y immediately and to send message Z between 5 and 15 milliseconds thereafter. If, instead, component A receives message X and sends both messages Y and Z immediately, it is said to have suffered an early timing fault. Alternately, if A sends message Z 20 milliseconds after message Y, it is said to have suffered a late timing fault. Input timing faults are defined in analogy with input omission faults. A more general class of faults is those that can be ascribed to the arbitrary behavior of a subcomponent clock. If a component that contains a clock as a subcomponent behaves correctly except for an arbitrary malfunction of its clock, the component is said to have suffered a clock fault. The most general fault class encompasses all possible faults, including arbitrary, even malicious~ behavior. This class is called Byzantine Faults. There is a special subclass of Byzantine faults that has a great deal of practical significance. This is the class of faults that allows for "almost" arbitrary behavior; some authentication or error detection protocol is specifically excluded from corruption. When an algorithm is designed to tolerate any fault that does not corrupt an authentication protocol, the algorithm is referred to as an authenticated algorithm and the class of faults tolerated Byzantine with authentication. To emphasize that worst case faults could corrupt an authentication protocols we sometimes refer to such faults as unauthenticated Byzantine faults.

4

Problems

The specific problems discussed below are all variants on the notion of agreement and require some kind of consistency among the outputs of correct processors. When there is one input (to one processor called the source or originator), we refer to the consistency problem as an agreement problem. When each participant has an input that is to affect all outputs, we refer to the consistency problem as a consensus problem. Each of the variants discussed below has two correctness criteria: (1) output consistency, which may be equality or approximate equality in the presence of uniformly correct behavior on the part of all participants; and (2) progress, which prevents trivial solution by forcing the output to be a function of the input. A third criterion that sometimes allows finer distinctions is (3) termination, which may be specified as synchronous or asynchronous, depending on whether or not all correct processors are to produce their outputs at the same round. The Byzantine Generals ~ problem is a case of simple synchronous agreement. There is one input. All correct processors must (1) agree on identical outputs. And,

47 if the source functions correctly~ then all correct processors must (2) produce the input as output ([LSP]). Implicit in the statement of the problem is the third requirement that all correct processors must (3) produce their outputs at the same round ([DRS]). In the corresponding consensus problem, each processor has its own input. The output consistency criterion is the same as for the Byzantine Generals ~ problem. The usual criterion for progress states that if all inputs to processors are identical, then the input value must be produced as output. Asynchronous agreement and consensus have also been studied. When considering these problems studied in a less synchronous model, we replace the terms synchronous and asynchronous by simultaneous and eventual, respectively. The weak Byzantine Generals problem has the same consistency criterion, but it requires progress only if there are no faults ([L83]). It easily generalizes to four flavors corresponding to the pairs and <simultaneous~ eventual>.

Crusader Agreement represents a weakening of the consistency requirement so that correct processors need produce the same outputs only if the source is correct ([D]). It is not easy to generalize Crusader to a consensus problem, though possible (cf. [ST]). Alternatively~ a problem that is more easily defined in the consensus context is that of approxin~te consensus (often called approximate agreement) [DLPSW, MS]. Here the consistency requirement is that the range of output values be smaller than the range of inputs unless all inputs are identical, in which case the outputs must be equal to the inputs. Corresponding to progress is a requirement that outputs come from the range of the inputs. Many variations on these criteria are possible. Finally, the firing squad problem emphasizes the termination criterion, requiring simultaneous or synchronous termination. One simple formulation of the problem is a variant of weak simultaneous consensus on a binary value. This variant requires that if any input is 1 and if all processors are correct, then all processors output the value 1. Other versions of the problem are beyond the scope of our simple model. Each of the problems described above can be further specified by enumerating the numbers and types of faults that are to be tolerated. For example, one can consider Byzantine agreement, authenticated Byzantine consensus, omission crusa~ler agreement, etc.

5

Results

Results in the simple model are summarized in [F], [SD], and [FLM]. In particular [FLM] contains a unifying proof technique for lower bounds on the number and connectivity of processors required to tolerate unauthenticated Byzantine faults. As a special case, they reprove the well known result that neither Byzantine agreement nor consensus can be achieved if at least one-third of the processors are faulty ([PSL]). This result does not hold in the Authenticated model. Thus, if faults are constrained not to corrupt an authentication protocol (or some error detection scheme), then the

48 one third limitation does not exist. Lower bounds on the number of rounds required for Byzantine agreement (and consensus) can be found in [DRS]. In fact, these lower bounds hold for crash faults as well as for Byzantine faults. Denote by n the number of processors and by F the upper bound on the number of faults to be tolerated. Any algorithm that guarantees simultaneous agreement in the presence of up to F < n - 1 crash faults has scenarios that use at least F + 1 rounds of message exchange; any algorithm that guarantees eventual agreement in the presence of up to F < n - 1 crash faults, has, for each 0 _ f < F, scenarios in which there are at most ,f faults and ] + 2 rounds ofmessage exchange. Agreement protocols tolerant of different types of faults are described in other chapters of this book. Here we focus on a simple protocol that might be used in the application emphasized in our introduction. Assume that we have a small duster of n completely connected processors. For authentications we presume some simple error detection scheme such as combining a processor id with a checksum on the message. Suppose that the clocks of these processors are synchronized. An agreement protocol in which each processor signs and sends each new message received to all its neighbors can be designed to tolerate any number o f omission failures up to a partition of the network. The number of messages required per input can be ( n - 1) 2, and the time required can be n - 1 message delays. A proportionately shorter time is required to tolerate fewer than n - 2 faults. Provided neither authentication nor clock synchronization protocols are corrupted, Byzantine faults can be tolerated at a cost of only twice as many messages and with no additional message delay required. So, at least for this synchronous case, the cost of tolerating any imaginable fault is only double the cost of tolerating only omission failures.

References [AGKS]

H. Aghili, A. Griefer~ R. Kistler, R. Strong, "Highly available communication," Proceedings of the IEEE International Conference on Communication, 1439-1443, Toronto, 1986.

[c]

B. Coan, "A communication-efficient canonical form for fault-tolerant distributed protocols," Proc. 5th ACM Symp. on the Principles of Distributed Computing, Calgary, Canada, August, 1986.

[CASD]

F. Cristian, H. Aghili, R. Strong, and D. Dolev, "Atomic broadcast: from simple message diffusion to Byzantine agreement," proceedings, the 15th Int. Conf. on Fault Tolerant Computing, June 1985.

[D]

D. Dolev, "The Byzantine generals strike again," Journal of Algorithms 3:1, 1982.

49 [DFFLS 1 D. Dolev, M. Fischer, R. Fowler, N. Lynch, and R. Strong, "An efficient algorithm for Byzantine agreement without authentication," Information and Control 52:3,257-274, March 1982. [DHSS]

Do Dolev, J. Halpern, B. Simons, and R. Strong, "Fault Tolerant Clock Synchronization," Proc. 3rd ACM Symp. on the Principles of Distributed Computing, Vancouver, 1984.

[DLPSW] D. Dolev, N. Lynch, S. Pinter, E. Stark, and W. Weihl, "Reaching approximate agreement in the presence of faults," Journal of the ACM 33, 499-516, 1986. [DRS]

D. Dolev, R. Reischuk, and R. Strong, "Early stopping in Byzantine agreement," IBM Research Report RJ3915, June, 1983.

[DS]

D. Dolev and R. Strong, "Authenticated algorithms for Byzantine agreement," SIAM Journal of Computing 12:4, 656-666, 1983.

IF]

M. Fischer, "The consensus problemln unreliable distributed systems," Proc. of the International Conference on Foundations of Computing Theory, Sweden, 1983, see also Yale University Report YALEU/DCS/RR-273, June, 1983.

[FLM]

M. Fischer, N. Lynch, and M. Merritt, "Easy impossibility proofs for distributed consensus problems~" Distributed Computing 1, 26-39, 1986.

[Gs]

A. Griefer and R. Strong, "DCF: Distributed communication with fault tolerance," Proc. 7th ACM Symp. on the Principles of Distributed Computing, Vancouver, 1988.

[LSP]

L. Lamport, R. Shostak, and M. Pease, "The Byzantine generals problem," ACM TOPLAS 4:3, 382-401, July, 1982.

[L83]

L. Lamport, "The weak Byzantine generals problem," JACM 30, 668-676, 1983.

[MS]

S. Mahaney and F. Schneider, "Inexact agreement: accuracy, precision, and graceful degradation," Proc. 4th ACM Syrup. on the Principles of Distributed Computing, Minaki, 1985.

[MSF]

C. Mohan, R. Strong, S. Finkelstein, "Method for distributed commit and recovery using Byzantine agreement within clusters of processors," Proc. 2nd ACM Syrup. on the Principles of Distributed Computing, Montreal, 1983.

[PSL]

M. Pease, R. Shostak, and L. Lamport, "Reaching agreement in the presence of faults," JACM 27:2, 228-234, 1980.

[Sch]

F. Schneider, "Byzantine generals in action: implementing fail-stop processors," ACM TOCS 2:2, 146-154, May, 1984.

50

[ST]

T. Srikanth and S. Toueg, "Simulating authenticated broadcasts to derive simple fault-tolerant algorithms," Distributed Computing, 2:2, 80-94, August 1987.

[SD]

R. Strong and D. Dolev, "Byzantine agreement," Digest of Papers from Spring COMPCON, IEEE Computer Society Press, 1983, see also IBM Research Report RJ3714, December, 1982.

[s85]

R. Strong, "Problems in fault tolerant distributed systems," Digest of Papers from Spring COMPCON, IEEE Computer Society Press, 1985, see also IBM Research Report RJ4220, December, 1984.

[s86]

R. Strong, "Problems in maintaining agreement," Proc. 5th IEEE Symp. on Reliability in Distributed Software and Database Systems, 20-27, Los Angeles, January, 1986.

[TPS]

S. Toueg, K. Perry, and T. Srikanth, "Fast distributed agreement," SIAM 3. Comp. 16, 445-457, 1987.

ATOMIC BROADCAST IN A REAL-TIME ENVIRONMENT Flaviu Cristian

Danny Dolev Ray Strong IBM Research Almaden Research Center 650 Harry Road San Jose, CA 95120

Houtan Aghili*

Abstract

This paper presents a model for zeal-time distributed systems that is intermediate in complexity between the simple, perfectly synchronous model in which there are rounds of communication exchange among processors in a completely connected network and an asynchronous model in which there is no reasonable upper bound on the time required for transmission and processing of messages. In this model algorithms are described for atomic broadcast that can be used to update synchronous replicated storage, a distributed storage that displays the same contents at every correct processor as of any clock time. The algorithms are all based on a simple communication paradigm and differ only in the additional checking required to tolerate different classes of failures.

1

Introduction

The fundamental characteristic of a real-time distributed operating system is a known upper bound on the time required to transmit a message from one processor to another and to process the message at the receiver~ processing being assumed to include the preparation of any responsive messages. This bound in turn provides a bound on the time required to propagate information throughout the system, provided there is a known b o u n d on the n u m b e r of processors in the network and provided the network remains sufficiently connected. In such an environment, small numbers of failures can be tolerated by a distributed system that manages to provide logically synchronous replicated storage. The use of synchronous replicated storage considerably simplifies the programming of distributed processes, since a programmer is not confronted with inconsistencies *H. Aghili is now with the IBM T. J. Watson Resenrc~ Center, Hawthorne, New York.

52 among local knowledge states that can result from random communication delays or faulty processors and links. Moreover it allows a programmer to assume a shared clock in addition to the assumed shared memory. It is easy to adapt known concurrent programming paradigms for shared storage environments to distributed environments that provide the abstraction of a synchronous replicated storage. Several examples of such adaptations are given in ILl. The objective of this paper is to discuss fault tolerant protocols for updating synchronous replicated storage in an arbitrary point-to-point network and to contrast these protocols and their performance characteristics with protocols that might be obtained by a straightforward lifting from a simple model based on rounds of communication in a completely connected network [DS]. The real-time model and protocols presented here are based on work that appeared in [CASD]. In our model, processor clocks are synchronized to within some given precision. To implement the state machine approach, global system state information is replicated in all processors. Updates to this global system state may originate at any processor in the network. These updates are disseminated by means of an atomic broadcast protocol, so that all correct processors have identical views of the global state at identical local clock times. An a t o m i c b r o a d c a s t protocol is a communication protocol that possesses a fixed termination time A and satisfies: atomicityevery update whose broadcast is initiated by some processor at time T on its clock is either delivered at each correct processor at local clock time T + A or is never delivered at any correct processor, o r d e r - - all updates delivered at correct processors are delivered in the same order at each correct processor, and t e r m i n a t i o n - - every update whose broadcast is initiated by a correct processor at time T on its clock is delivered at all correct processors.

The Highly Available System project at the IBM Almaden Research Center, designed an atomic broadcast protocol to update replicated system directories and reach agreement on the failure and recovery of system components (v. [Or], [GS]). Much previous work on atomic broadcast protocols has been performed within the Byzantine Generals framework [LSP] (v. [F],[SD] for surveys of this work). This framework includes guaranteed communication in a completely connected network of perfectly synchronized processors that communicate in synchronous rounds of information exchange. This framework is called the s i m p l e model. In the simple model, processors send messages at the beginning of each round, and every processor has time to receive and process all messages sent to it during the round before the end of the round. Note that a round must have a time duration as great as the worst case delay in transmission and processing from one end of the network to the other. In the real-time model networks have arbitrary topology and are subject to link as well as processor failures. Moreover, processors may respond to messages immediately

53 rather than wait for the beginning of the next round. The real-time model has only approximately synchronized docks, but retains the upper bound on message transmission and processing time. Atomic broadcast protocols in the real-time model are not limited by any structure of rounds, S° they are generally faster and more efficient than protocols based on rounds. Indeed, a straightforward translation of a round based protocol (e.g. [DS]) into our model would require that routing be used to achieve full ~logical" connectivity among processors and that each round include not only the worst case time for sending a message between any two correct processors, but also an extra delay corresponding to the worst case duration between the end of a round on one processor clock and the end of the same round on another processor clock. Such protocols clearly send more messages and take longer than necessary.

2

Failure C l a s s i f i c a t i o n

The real time model is composed of components called processors and links. A set of input events and a set of output actions are associated with each component. Included in the output actions of processors are the set of possible message transmissions. Likewise, included in the set of input events are the set of possible message receipts. (For purposes of this failure classification, messages are not decomposed into constituent bits.) In addition to message receipt, the passage of a specific time duration also constitutes an input event for a processor. Input and output events for links are analogous. Each component is assumed to have an input-output specification describing its correct response (output) in relation to a history of previous inputs and outputs. For exampte~ a link connecting processor s to processor r is specified to deliver a message sent by s to r at some tin/e between the time s sent the message and a fixed number of time units later. Any output of a correct component depends only on its history of previous inputs and outputs and is consistent with its specification. A component specification prescribes both the output that should occur in response to a sequence of input events and the real-time interval within which this output should occur. A component failure occurs when a component does not behave in the manner specified. An omission failure occurs when, in response to an input event sequence, a component never gives the specified output. A timing failure occurs when, in response to a trigger event sequence, a component either omits the specified output or gives it too early or too late. A Byzantine failure [LSP] occurs when a component does not behave in the manner specified: either no output occurs, or the output is outside the real-time interval specified, or some output different from the one specified occurs. An important subclass of Byzantine failures is that for which any resulting corruption of output messages is detectable by message authentication protocols. These failures are called authentication-detectable Byzantine failures. Error detecting codes [PW] and public-key cryptosystems based on digital signatures [RSA] are two examples of well-known authentication techniques.

54

A processor crash, a link breakdown, a processor that occasionally does not forward a message that it should, and a link that occasionally loses messages are examples of omission failures. An excessive message transmission or processing delay due to a processor or network overload is an example of a late timing failure. Another example of a late timing failure is the delivery of messages out of order by a link specified to deliver them in first-in first-out order. When some coordinated action is taken too soon by a processor (perhaps because of a faulty internal timer), an early timing failure occurs. A message alteration by a processor or a link (because of a random fault) is an example of a Byzantine falhre that is neither an omission nor a timing failure. If the authentication protocol employed enables the receiver of the message to detect the alteration, then the failure is an authentication-detectable Byzantine failure. Crash failures are a proper subclass of omission failures (a crash failure occurs when a component systematically omits all outputs from some time on). Omission failures are a proper subclass of timing failures. Timing failures are a proper subclass of authentication-detectable Byzantine failures. Finally, authentication-detectable Byzantine failures are a proper subclass of the class of all possible failures, the Byzantine failures. The nested nature of the failure classes defined above makes it easy to compare "the power" of fault-tolerant protocols: a protocol A that solves some problem is "less fault-tolerant" than a protocol B which solves the same problem if A tolerates only a subclass of the failures that B tolerates. Observe that a failure cannot be classified without reference to a component specification. Moreover, the type of failure depends on the decomposition into components: if one component is made up of others, then a failure of one type in one of its constituent components can lead to a failure of another type in the containing component. For example, a clock on which the "time" never changes is an example of a crash failure. If that clock is part of a processor that is specified to associate different timestamps with different replicated synchronous storage updates, then the processor may be classed as experiencing a Byzantine failure. In our decomposition of a distributed system into processors and finks, neither type of component is considered part of the other. Also, when considering output behavior, we do not decompose messages, so a message is either correct or incorrect, as a whole. With' these conventions we can classify failures unambiguously. We are not concerned with directly tolerating or handling the failures experienced by such sub-components as clocks. We discuss fault tolerance in terms of the survival and correct functioning of processors that meet their specifications in an environmerit in which some other processors and some finks may not meet theirs (usually because they contain faulty sub-components). Thus when we speak of tolerating omission failures, we mean tolerating omission failures on the part of other processors or links, not tolerating omission failures on the part of sub-components like timers or clocks that might cause much worse behavior on the part of their containing processors.

55

3

The

Model

The real-time model consists of a connected network G of arbitrary topology, with n processors and m point-to-point links. Processors that share a link are called neighbors. Each processor p possesses a clock Cr that reads Cp(t) at real time t. We witl use upper case letters for clock times and lower case letters for real times. The model is characterized by the following assumptions. 1. All processor names in G are distinct and there is a total order on processor names. 2. The docks of correct processors are monotone increasing functions of real time and the resolution of processor docks is fine enough, so that separate clock readings yield different values (this will ensure that no correct processor issues the same timestamp twice). 3. The clocks of correct processors are approximately synchronized within a known, constant, maximum deviation e. That is, for any correct processors p and q, and for any real time t, I G ( t ) - o,(e)l < (Clock synchronization protocols tolerant of omission, late timing, and authentication detectable Byzantine failures are presented in [CAS,DHSS]. For a survey see [Sc], and "An Overview of Clock Synchronization" in this book.) 4. For the message types used in our protocols, transmission and processing delays (as measured on any correct processor's dock) are bounded by a constant 6. This assumption can be stated formally as follows. Let p and q be two correct neighbors linked by a correct link and let r be any correct processor. If p sends a message m to q at real time u, and q receives and processes m at real time v, then 0 < - c,(•) < 6. (We assume that processing time is negligible so 6 covers the interval of time from the transmission of a message to the time of subsequent transmission of any' message resulting from processing the message at a receiving neighbor.) 5. The underlying operating system provides a "schedule(A,T,p)" command that allows a task A to be scheduled for execution at time T with input parameters p. An invocation of "schedule(A,T, p)" at a local time U > T has no effect, and multiple invocations of "schedule(A,T,p)" have the same effect as a single invocation.

56

4

Protocols

We consider three properly nested failure classes: (1) omission failures, (2) timing failures, and (3) authentication-detectable Byzantine failures. For each one of these classes, we present an atomic broadcast protocol that tolerates up to 7r processor and up to A link failures in that class, provided these failures do not disconnect G. The termination time of each protocol is computed as a function of the failure class tolerated, of the rr and )t parameters, of the known constants ~i and e, and of the largest diameter d of a surviving communication network G - F, for all possible subnetworks F containing up to ~r processors and ,~ links. All protocols are based on a common communication technique called information diffusion: (1) when a correct processor learns a piece of information of the appropriate type, it propagates the information to its neighbors by sending messages to them, and (2) if a correct neighbor does not already know that piece of information, it in turn propagates the information to its neighbors by sending them messages. This ensures that, in the absence of network partitions, information diffuses throughout the network to all correct processors. A possible optimization is to eliminate messages to neighbors that are already known to possess the information.

5

First P r o t o c o l : T o l e r a n c e o f O m i s s i o n Failures

Each message diffused by our first protocol carries its initiation time (or timestamp) T, the name of the source processor s, and a replicated storage update ~r. Each atomic broadcast is uniquely identified by its timestamp T and its initiator's name s (by assumptions 1,2, and 3). As diffused messages are received by (the atomic broadcast layer) of a processor, these are stored in a local history log H until delivery (to the local synchronous storage layer). The order property required of atomic broadcasts is achieved by letting each processor deliver the updates it receives in the order of their timestamp, by ordering the delivery of updates with identical timestamps in increasing order of their initiatot's name, and by ensuring that no correct processor begins the delivery of updates with timestamp T before it is certain that it has received all updates timestamped T that it may ever have to deliver (to satisfy the atomicity requirement). For omission failures, the local time by which a processor is certain it has received copies of each message'timestamped T that could have been received by some correct processor is T + ~r~ + d~ + e. We call this time the delive~ deadline for updates with timestamp T. The intuition behind this deadline is as follows. The term ~r~ is the worst case delay between the initiation of a broadcast (7,8, ~r) and the moment a first correct processor r learns of that broadcast. It corresponds to the case in which the broadcast source s is a faulty processor, between s and r there is a path of r faulty processors which all forward just one message (T, s, a) on one outgoing link, and each of these messages experiences a maximum delay of 8 clock time units. The term d~ is the time sufficient for r to diffuse information about the broadcast (7', 8, a) to any correct

57 processor p in the surviving network. The last term ensures that any update accepted for delivery by a correct processor q whose clock is in advance of the sender's clock is also accepted by a correct processor p whose clock is behind the sender's clock. We assume all processors know the protocol termination time A -- ~r5 + d5 + e. To keep the number of messages needed for diffusing an update finite, each processor p that receives a message (T, 8, a) relay8 the message (to all its neighbors except the one that sent the message) only if it receives (T, s, a) ]or the first time. Ifp inserts all received messages in its local history H (and never removes them), p can easily test whether a newly arrived message m was or was not seen before by evaluating the test m E H. We call this test the "deja vu" acceptance test for the message m. The main drawback of the "deja vu" solution described above is that it causes local histories to grow arbitrarily large. To keep the size of H bounded~ a history garbage collection rule is needed. A possible solution is to remove from H a message (T, 8, ~r) as soon as the deadline T ÷ A for delivering a passes on the local clock. However, a simple-minded application of the above garbage-collection rule would not be sufficient for ensuring that locM histories remain bounded, since it is possible that copies of a message (T, s, a) continue to be received by a correct processor p after the delivery deadline T + A has passed on p's clock. Such duplicates would then pass the "deja vu" acceptance test and would be inserted again in t'he history of p. Since such "residual" duplicates will never be delivered (see assumption 5), they will cause p's history to grow without bound. To prevent such residual messages from accumulating in local histories, we introduce a "late message" acceptance test. This test discards a message (T, s~ a) if it arrives at a local time U past the delivery deadline T ÷ A, i.e. if U > T + A. The "deja vu" and "late message" acceptance tests ensure together that updates require only a finite number of messages and that local histories stay bounded (provided~ of course, processors broadcast only a bounded number of updates per time unit). A detailed description of our first atomic broadcast protocol is given in Figures 1, 2, and 3. Each processor has three concurrent tasks: a Start broadcast task (Figure 1) that initiates an atomic broadcast, a Relay task (Figure 2) that forwards at0rrfic broadcast messages to neighbors, and an End task (Figure 3) that delivers broadcast updates (to be performed on the synchronous replicated storage). In what follows we refer to line j of figure i as (i.j). A user of the atomic broadcast layer triggers the broadcast of an update a of some type Text by sending it to the local Start task. The command "take" is used by this task to take a as input (1.4). The broadcast of a is identified by the local time T at which a is received (1.4) and the identity of the sending processor, obtained by invoking the function "myid" (1.5). This function returns different processor identifiers when invoked on distinct processors. The broadcast of a is initiated by invoking the "send-all ~ command, which sends messages on all outgoing links (1.5). We do not assume that this command is atomic with respect to failures: a processor failure can prevent messages from being sent on some links. The fact that the broadcast of a has been initiated is then recorded in a local history variable H shared by all broadcast layer tasks:

58

1 task Start; 2 const Delta = (~r + d)8 + e; 3 vat T: Time; a: Text; s: Processor-Name; 4 cycle take(a); T ~- clock; 5 send-all(T, myid, a); 6 H ,--- H @ (T, myid, a); 7 schedule(End,T + A, T); 8 endcycle ; Figure 1: Start Task of the first protocol

1 task Relay; 2 const A = (lr + d)6 + e; 3 vat U, T: Time; a: Text; s: Processor-Name; 4 cycle receive((T, s, a), I); U ~-- clock; . 5 [g > T + A: "late message" iterate]; 6 IT e dom(tt)grs e dom(II(W)): "deja vu" iterate]; 7 send-all-but(l, (T, s, a)); 8 H ~ H @ (T, s, a); 9 echedule(End,T + A, T); 10 endcycle ; Figure 2: Relay Task of the first protocol v a t / / : Time ~ ( Processor-Name ~ Text ). We assume t h a t / / is initialized to the empty function at processor start. The variable//keeps track of ongoing broadcasts by associating with instants T in Time a function H(T) (of type Processor-Name --~ Text). The domain of H(T) consists of names of processors that have initiated atomic broadcasts at local time T. For each such processor p, H(T)(p) is the update broadcast by p at T. We define the following two operators on histories. The update "@ " of a history H by a message (T, s, a) yields a (longer) history, denoted H $ (T, s, a), that contains all the facts in H, plus the fact that s has broadcast a at local time T. The deletion '*V' of some instant T from a history H yields a (shorter) history, denoted H \ T, which does not contain T in its domain, i.e. everything about the broadcasts that were initiated at time T is deleted. Once the history H is updated (1.6), the End task is scheduled to start at local clock time T + A to deliver the update a (1.7). The Relay task uses the command "receive" to receive messages formatted as (T, s, a) from neighbors (2.4). In describing this task, we use double quotes to delimit comments and the syntactic construct ~[B: iterate]" to mean "if Boolean condition B is true, then terminate the current iteration and begin the next iteration". After a message is received (2.4), the parameter t contains the identity of the link over which

59

1 2 3 4

task End(T:Time); vat p: Processor-Name; val: Processor-Name ~ Text; vale--- H(T); while dom(val) ~ {}

5

dop

6 7 8 9

deliver(val(p)); val ~ val \ p; od;

H*-H\T; Figure 3: End Task

the message arrived. If the message is a duplicate of a message that was already received (2.6) or delivered (2.5) then it is discarded. A message is accepted if it passes the acceptance tests of the Relay task. If (T, s,~) is accepted (i.e. passes the "late message '~ (2.5) and "deja vu" (2.6) tests), theh it is relayed on all outgoing links except l using the command "send-all-but" (2.7), it is inserted in the history variable (2.8)~ and the End task is scheduled to start at local time T + A to deliver t h e received update (2.9). The End task (Figure 3) starts at clock time T + A to deliver updates timestamped T in increasing order of their sender's identifier ((3.5)-(3.8)) and to delete from the local history H everything about broadcasts initiated at time T (3.9). A proof of correctness for the protocol may be found in [CASD].

6

S e c o n d Protocol: Tolerance of T i m i n g Failures

The first protocol is not tolerant of timing failures because there is a fixed clock time interval, independent of the number of faulty processors, during which a message is unconditionally accepted by a correct processor. This creates a real time "window" during which a message might be "too late" for some (early) correct processors and "in time" for other (late) correct processors. To achieve atomicity in the presence of timing failures, we must ensure that if a first correct processor p accepts a message m, then all correct neighbors to which p relays m also accept m. A neighbor q does not know whether the message source p is correct or not. However, if p is correct, q must accept m if the information stored in it tells q that the clock time delay between p and q is at least - e (as when f s clock is very close to being e time units behind q's and the message propagation delay between p and q is very close to being 0) or at most ~ + e (as when p~s clock is very close to being e time units in advance of q~s and the message from p to q takes ~ time units). To be able to evaluate the time a message spends between two neighbors, we store in the message the number h of hops traversed by it. This leads to the following timeliness acceptance test: a correct processor q accepts a message timestamped T with hop count h if it receives it at a

60 1 task Start; const A = r:(8 + e) + d~ + e; vat T: Time; a: Text; s: Processor-Name; cycle take(a); T ~ clock; send-all(T, myid, 1, a); //*--- H ® (7", myid, o'); ~chedule(End,T + A, T); endeycle ; Figure 4: Start Task of the second protocol local time U such that: T-

he < U < T + h(b +e).

Since by hypothesis there can be at most a path of Ir faulty processors from a (faulty) sender s to a first correct processor p, and the message accepted must pass the above test at p, it follows that a message can spend at most lr(b+e) clock time units in the network before being accepted by a first correct processor. From that moment, it needs at most d~ clock time units to reach all other correct processors. Given the e uncertainty on clock synchrony the termination time of the second protocol is therefore: A = 7r(~ + e) + d~ + e. The Start task of the second protocol (Figure 4) is identical to that of the first except for the addition of a hop count (initially set to 1 (4.5)) to all messages. In addition to the tests used for providing tolerance of omission failures (5.7, 5.8), the Relay task of the second protocol (Figure 5) also contains the timeliness tests discussed above (5.5, 5.6). The hop count h carried by messages is incremented (5.9) every time a message is relayed. The End task of the second protocol is identical to that of the first protocol. A proof of correctness for the protocol together with an indication of why it will not tolerate Byzantine faults may be found in [CASD].

7

Tolerance of A u t h e n t i c a t i o n D e t e c t a b l e Byzantine Failures

A "Byzantine" processor can confuse a network of correct processors by forwarding appropriately altered messages on behalf of correct processors at appropriately chosen moments. One way of preventing this phenomenon is to authenticate the messages exchanged by processors during a broadcast [DS], [LSP], so that messages corrupted by "Byzantine" processors can be recognized and discarded by correct processors. In this way~ we are able to handle authentication-detectable Byzantine failures in a

61 1 task Relay; const A = ~r(8 + e) + d6 + e; 2 vat U, T: Time; a: Text; s: Processor-Name; h: Integer; 3 cycle receive((T, s, h, a), l); U ~ clock; 4 [U < t - he: "too early" iterate]; 5 [U > T + h(6 + e): "too late" iterate]; 6 7 [U > T + A: "late message" iterate]; IT E dom(H)&s E dom(H(T)): "deja vu" iterate]; 8 9 send-all-but(l, (T, s, h + 1, a)); H ~- g @ (T, s, a); 10 11 schedule(End,T + A, T); 12 endcyele ; Figure 5: Relay Task of the second protocol manner similar to the way we handle timing failures. Ignoring (for simplicity) the increase in message processing time due to authentication, we set the termination time of the third protocol to be the same as the termination time of the second protocol: A = ~r(6 + e) + d6 + e. The detailed implementation of our third protocol is given in Figures 6-12. We assume that each processor p possesses a signature function ~r, which, for any string of characters z, generates a string of characters y = @p(z) (called the signature ofp on z). Every processor q knows the names of all other processors in the communication network, and for each p E G, q has access to an authentication predicate O(z,p, y) which yields true if and only if y = ~p(z). We assume that if processor q receives a message (z,p, y) from processor p, and O(z,p, y) is true, then p actually sent that message to q. (If the authentication predicate fails to detect message forgery, then our last protocol can no longer guarantee atomicity in the presence of Byzantine failures.) The proper selection of the @r(m) and O functions for a given environment depends on the likely cause of message corruption. If the source of message corruption is unintentional (e.g., transmission errors due to random noise on a link or hardware malfunction), then simple signature and authentication functions like the error detecting/correcting codes studied in [PW] are appropriate. If the source of message corruption is intentional, e.g., an act of sabotage, then more elaborate authentication schemes like those discussed in [RSA] should be used. In any case there is always a small but non-zero probability that a corrupted message will be accepted as authentic. In practice this probability can be reduced to acceptably small levels by choosing signature and authentication functions appropriate for the (adverse) environment in which a system is intended to operate. We implement message authentication by using three procedures "sign", "cosign", and "authenticate", and a new signed message data type "Smsg" (Figure 6). These are all described in a Pascal-like language supporting recursive type declaration and the exception mechanism of [C].

62

1 type Smsg = record case tag: (first,relayed) o] first: (timestamp: Time; update: Text); relayed: (incoming: Smsg); procid: Processor-Name; signature: string; end; Figure 6: The Signed-Message data type 1 procedure sign(in T: Time; a: Text; out z: Smsg); 2 begin z.iag ~ ~first'; z.timestamp ~ T; 3 z.update ~-- a; z.procid :~-- myid; 4 z.signature ~- ~,,yid(z.tag, T, ~r); 5 end; Figure 7: The sign procedure A signed message (of type Smsg) that has been signed by k processors Pl,--.,P~ has the structure (relayed,

... (relayed, (first, T, a,p,, sl),p~, s2), ... Pk, sk)

where T and a are the timestamp and update inserted by the message source pl, and si, i E 1, ..., k, are the signatures appended by processors pl that have accepted the message. The sign procedure (Figure 7) is invoked by the originator of a broadcast (T, s, tr) to produce a message z containing the originator's signature. The co-sign procedure (Figure 8) is invoked by a processor r which forwards an incoming message z already signed by other processors; it yields a new message y with r's signature appended to the list of signatures on z. The authenticate procedure (Figure 9) verifies the authenticity of an incoming message. If no alteration of the original message content is detectable, it returns the timestamp and original update of the message as well as the sequence S of processor names that have signed the message. The identity of the initiator is the first element of the sequence, denoted first(S), and the

1 procedure co-sign(in z:Smsg; out y: Smsg); 2 begin y.tag ~ ~relayed~; y.incomin9 ~- z; 3 y.proeid :*-- myid; y.signature :~--- ffm~ia(y.tag, 4 end; Figure 8: The co-sign procedure

Z);

63

1 procedure authenticate(in z:Smsg; out T:Time; a:Text, S:Sequence-of-Processor-Name) [forged]; begin 2 [z.tag ='first'Sz -~®((z.tag, z.timestamp, z.val), z.procid,z.signature): signal forged]; [z.tag ='relayed'& -~O((z.tag, z.incoming), z.procid, z.signature): signal forged]; 4 if z.tag ='first' then T ~- z.timesiamp; a ~- z.updale; S ~--<> 5 6 else authenticate(z.incoming, T, a, S) [forged: signal forged] fi; S ~ append(S, z.procid); 7 end; 8 Figure 9: The authenticate procedure

1 task Start; 2 eonst A = ~r(~ + e) + d~ + e; 3 vat T: Time; a: Text; z: Smsg; 4 cycle take(o); T ~- clock; sign(T, a, z); 5 send-all(z); 6 H ~ H @ (T, myid, a); 7 schedule (End,T + A, T); 8 endcycle ; 9 Figure 10: Start Task of the third protocol number ,of hops (i.e., the number of intermediate links) traversed by the message is the length of the sequence, denoted ISI. If the message is determined to be corrupted, the "forged" exception is signalled. Except for the change concerning the authentication of messages, the structure of the Start task of the third protocol (Figure 10) is the same as that of the second protocol. In order to handle the case in which a faulty processor broadcasts several updates with the same timestamp, the type of the history variable H is changed to var H: Time ---* (Processor-Name ---, (Text U @)), where the symbol 0 denotes a "null" update (0 ~Text). Specifically, if a processor receives several distinct updates with the same broadcast identifier, it associates the null update with that broadcast. Thus, a null update in the history is an indication of a faulty sender. The Relay task of the third protocol (Figure 11) works as follows. Upon receipt of a message (11.4), the message is checked for authenticity (11.5) and if corrupted,

64

1 task Relay; const A = 1r(~ + e) + d8 + e.; 2 3 vat T, U: Time; a : Text; s: Processor-Name; z: Smsg; S: Sequence-of-Processor-Name; cycle receive(z, l); U *---clock; 4 5 authenticate(z, T, a, S) [forged: iterate]; 6 [duplicates(S): "duplicate signatures" iterate]; 7 [U < T -[S[e: "too early" iterate]; 8 [U _> T q-IS[(~ q-e): "too late" iterate]; 9 [U > T + A: "late message" iterate]; 10 s ~ first(S); 11 if T e dom(lt)&s e dom(II(T) then 12 [a = H(T)(s): "deja vu" iterate]; 13 [H(T)(s) = 0: "faulty sender"iterate]; 14 H(T)(s) ~ 8; 15 else 16 H ~ H @ (T, s, o'); 17 schedule (End,T + A, T); 18

19 20 21

cosign(z, y); send-all-but(l, y); endcycle ; Figure 11: Relay Task of the third protocol

65 1 task End(T:Time); vat p: Processor-Name; val: Processor-Name --~ Text; val ~- H(T); while dom(vaI) # 0 dop , -

min(dom(aO);

if val(p) ¢

then deliverO,

val ~-- val \ p;

od; TI ,-- t t \ T ;

Figure 12: End Task of the third protocol the message is discarded. Then, the sequence of signatures of the processors that have accepted the message is examined to ensure that there are no duplicates; if there are any duplicate signatures, the message is discarded (11.6). Since processor signatures are authenticated, the number of signatures ISI on a message can be trusted and can be used as a hop count in determining the timeliness of the message (11.7, 11.8). No confusions such as those illustrated in the previous counter example can occur unless the authentication scheme is compromised. If the incoming message is authentic, has no duplicate signatures, and is timely, then the history variable H is examined to determine whether the message is the first of a new broadcast (11.11). If this is the case, the history variable H is updated with the information that the sender s = first(S) has sent update a at time T (11.16), the End task is scheduled to start processing and possibly delivering the received update at (local clock) time T + A (11.17), and the received message is cosigned and forwarded (11.19, 11.20). If the received update a has already been recorded i n / / (because it was received via an alternate path), it is discarded (11.12). If a is a second update for a broadcast identified (T, s), then the sender must be faulty. This fact is recorded by setting H(T)(s) to the null update (11.14). The message is then cosigned and forwarded so that other correct processors also learn the sender's failure (11.19, 11.20). Finally, if a is associated with a broadcast identifier to which H has already associated the null update (i.e., it is already known that the originator of the broadcast (t, s) is faulty), then the received update is simply discarded (11.13). The End task (Figure 12) delivers at local time T + A all updates broadcast correctly at time T. If exactly one update has been accepted for a broadcast initiated at clock time T, then that update is delivered (12.6), otherwise no update is delivered. In either ease, the updates associated with broadcasts (T, s), for all processors s, are deleted from H (12.9) to ensure H stays bounded.

66

8

Performance: Messages and Termination Time

In the absence of failures, the initiator s of an atomic broadcast sends d, messages to its neighbors, where d, denotes the degree of s (i.e. the number of its adjacent links). Each processor q ¢ s that receives a message from a processor p sends dq - 1 messages to all its neighbors (except p). Since the sum of all node degrees of a network is twice the number of network links, it follows that each atomic broadcast costs 2m - (n - 1) messages: one message for each link of a spanning tree in G and two messages in each direction for all the other links in G. For example, an atomic broadcast among 8 processors arranged in a 3-dimensional cube requires 17 messages in the absence of failures. The message cost of a diffusion based atomic broadcast protocol compares f~vorably to that of a round-based Byzantine agreement protocol designed for a fully connected network. To achieve the requirement that in each round each processor communicates with each other processor, if the underlying physical network is not fully connected, a message routing service can be used to implement the "logically" fully connected network required by the round structure. The "logical" messages sent by processors are then implemented as sequences of (one-hop) messages sent among neighbors, using some message forwarding technique. The full connectivity required by a round structure has its cost: some of the messages sent in each round will be redundant. Indeed, if a ~'logical'~ message has to be sent from a processor s to a non-neighbor processor r, and p is the neighbor of s on the path to r selected by the message routing algorithm used, then the message s sends to p to be forwarded to q is redundant with the message that s sends to p for direct consumption. For the example of 8 processors arranged in a 3-dimensional cube, a round of logical messages sent by one processor to the 7 others costs 12 (one-hop) messages. Thus, for Ir > 1, a round based agreement protocol tolerant of timing or authentication-detectable Byzantine failures sends in the absence of failures at least 12 + 7.12 = 216 messages, compared to the 17 messages needed by a diffusion based protocol for any 7r > 1. The termination time for an atomic broadcast depends on the network topology and on the class of failures to be tolerated. In the absence of information about the network topology except that the number of processors is bounded above by n, n - 1 can be taken to be an upper bound on ~"+ d. Clock synchronization algorithms which can provide an e close to d8 are investigated in [CAS,DHSS]. For simplicity, we assume here an e of (~r + d)6. Thus, for omission failures, the termination time of an atomic bro~lcast is linear in n : A = 2(7r + d)6 is bounded above by 2(n - 1)6. For timing and Byzantine failures, the termination time is proportional to the product of the number of processors and the number of processor failures to be tolerated: A = (Tr + 2)(~r + d)8 is bounded above by (~r + 2)(n - 1)6. As a numerical example, consider the case of 8 processors arranged in some arbitrary way to form a network. Assume that the link delay bound ~ is 0.01 seconds and that we want to tolerate up to two processor failures. The termination time for omission failures is 0.14 seconds, and for timing (or authentication-detectable Byzantine) failures is 0.28 seconds. If more information about network topology is available, then a better expression can be

67 computed for the network diffusion time db. Note that the expression ~r+d corresponds to a worst case path consisting of ~r hops between faulty processors followed by d hops along a shortest path in the surviving network of correct processors and links. For example, if the eight processors above are arranged in a 3-dimensional cube and we need tolerate no link failures, the approximate termination times for omission and timing (or authentication-detectable Byzantine) failures are cut to 0.10 and 0.20 seconds respectively. This is because lr + d is bounded above by 5: if the two faulty processors are adjacent, then the diameter o f the surviving network is at most 3; if they are not adjacent, the diameter can be 4, but 2 faulty processors cannot be encountered on a path before a correct processor is encountered. We now show that our protocols dominate round based protocols in speed. A straightforward translation into our system model of the algorithms designed for the simpler round-based model (fully connected network, exactly synchronized clocks) would require that each round include not only the worst case time for sending a message from any correct processor to any other, but also an extra delay corresponding to the worst case duration between the end of a round on one processor clock and the end of the same round on another processor clock. Thus, the length of round is at least d/~ + e (in our example, d5 + (~r + d)5) clock time units. To tolerate 7r failures, a round based protocol needs at least ~r+ 1 rounds, that is at least (~"+ 1)(~r + 2d)~ time units. This time is always equal or greater than the termination time (~r+2)(lr+d)$ of a diffusion based protocol (with equality for a fully connected surviving network with d = 1). For example, to atomically broadcast in a 3-dimensional cube with $ = 0.01 seconds despite up to two timing or authentication-detectable Byzantine fa~ures, a round based protocol needs 0.3 seconds, compared to the 0.2 seconds sufficient for a diffusion based protocol.

9

Conclusion

This paper presented an investigation of the atomic broadcast problem in a real-time model, proposed a classification of the failures observable in distributed systems, described three protocols for atomic broadcast in systems with bounded transmission delays and no partition failures, and has discussed and contrasted their performance with that of round based protocols designed for a simpler model. Atomic broadcast simplifies the design of distributed fault-tolerant programs by enabling correct processes to access global state information in synchronous replicated storage. The beauty of this notion is that it reduces distributed programming to "shared storage" programming without having a single point of system failure. In the Highly Available Systems prototype, we use synchronous replicated storage to store crucial system configuration information that must remain available despite arbitrary numbers of processor crashes. Our practical programming experience with synchronous replicated storage indicates that it leads to very simple and elegant decentra~zed program structures. Because of its nested structure, the proposed failure classification provides a good

68 basis for comparing the "power" of various published fault-tolerant distributed protocols. One of the most frustrating aspects to the reader of papers that present fault-tolerant distributed protocols is that the class of failures tolerated is rarely precisely specified. This makes the comparison of different protocols for solving a given problem difficult. We feel that the adoption of the failure classification proposed in this paper as a standard would simplify the task of comparing the power of various fault-tolerant distributed algorithms. The three protocols presented above share the same specification have the same diffusion-based structure. They differ in the classes of failures tolerated, ranging from omission failures, to authentication-detectable Byzantine failures. Clearly, the complexity increases as more failures are tolerated, but the complexity of the final protocol that handles authentication-detectable Byzantine failures is not orders of magnitude greater than that of the initial protocol. A variant of this protocol (which uses error correcting codes to authenticate messages) has been implemented and runs on a prototype system designed by the Highly Available Systems project at the IBM Almaden Research Center [GS]. The experience accumulated during the implementation and test of this prototype demonstrates-that the failures most likely to be observed in distributed systems based on general purpose operating systems such as VM or Unix are performance (or late timing) failures caused by random variations in system load. Our protocols are based on a relatively realistic communication model (arbitrary network topology, approximately synchronized clocks, unreliable communication links). Abandoning the simple rounds of communication model has led to better performance. Further improvements in performance may be obtained using a probabilistic clock synchronization approach like that of [Cri]. At the time when our .protocols were invented (1983), we were unaware of other protocols for atomic broadcast designed for system models more realistic than those assumed in the Byzantine agreement literature [F], [LSP], [SD]. Since then, several other protocols for atomic broadcast in system models similar to ours have been proposed (e.g. [BJ], [BSD], [Ca], [CM], [D], [PG].) All protocols proposed so far can be divided into two classes: diffusion based protocols providing bounded termination times even in the presence of concurrent failures, and acknowledgement-based protocols that do not provide bounded termination times if failures occur during a broadcast. Examples of protocols in the first class (other than those given in this paper) are [BSD] and [PG]. Examples of acknowledgement-based protocols are [BJ], [Ca], [CM], and [D]. Although the acknowledgement-based protocols can tolerate the late timing failures that can cause a logical network partitioning for diffusion protocols, they provide the additional tolerance at the cost of sacrificing bounded termination time. We have investigated methods for detecting and reconciling inconsistencies caused by partitions in systems using diffusion based atomic broadcast (e.g. [SSCA]), but such "optimistic" approaches cannot be used in applications in which there are no natural compensation actions for the actions taken by some processors while their state was inconsistent with the state of other processors. The existence of these two classes of protocols pose a serious dilemma to distributed system designers:

69 either avoid network partitioning by using massive network redundancy and reai-time operating systems to guarantee bounded reaction time to events in the presence of failures, or accept partitioning as an unavoidable evil (for example because the operating systems are not hard real-time) and abandon the requirement that a system should provide bounded reaction times to events when failures occur.

10

Acknowledgements

We would like to thank Shel Finkelstein, Joe Halpern, Nick Littlestone, Fred Schneider, Mario Schkolnik, Dale Skeen, Barbara Simons, and Irv Traiger for a number of useful comments and criticisms.

References [BSD]

O. Babaoglu, P. Stephenson, R. Drurnond: "Reliable Broadcasts and Communication Models: Tradeoffs and Lower Bounds" Distributed Computing, No. 2, 1988, pp. 177-189.

[BJ]

K. Birman, T. Joseph: "Reliable Communication in the Presence of Failures", ACM Transactions on Computer Systems, Vol 5, No. 1, February 1987, pp. 47-76. 1984.

[ca]

R. Cart, "The Tandem Global Update Protocol", Tandem Systems Review, June 1985, pp. 74-85.

It]

F. Cristian, "Correct and Robust Programs," IEEE Transaction8 on Software Engineering, vol. SE-10, no. 2, pp. 163-174, 1984.

[CAS]

F. Cristian, H. Aghili, and R. Strong, "Clock Synchronization in the Presence of Omission and Performance Faults, and Processor Joins," 16th Int. Conf. on Fault-Tolerant Computing, Vienna, Austria, 1986.

[CASD]

F. Cristian, H. Aghili, R. Strong, and D. Dolev, "Atomic Broadcast: from simple message diffusion to Byzantine agreement," IBM Research Report RJ5244, July 30, 1986.

[Cr]

F. Cristian, "Issues in the Design of Highly Available Computing Services," Invited paper, Annual Symposium of the Canadian In]ormation Processing Society, Edmonton, Alberta, 1987, pp. 9-16 (also IBM Research Report RJ5856, July 1987).

[Cri]

F. Cristian, "Probabilistic Clock Synchronization"~ IBM Research Report R36432, September, 1988 (also in Proc. 8th Int. Conf. on Distributed Computing, June 1989).

70

[CM]

J.M. Chang, and N.F. Maxemchuk, "Reliable Broadcast Protocols," A CM Transactions on Computer Systems, vol. 2~ no. 3~ pp. 251-273~ 1984.

[D]

"The Delta-4: Overall System Specification", D. Powell, editor, January 1989.

[DS]

D. Dolev~and R. Strong, "Authenticated Algorithms for Byzantine Agreement," SIAM Journal of Computing, vol. 12, no. 4, pp. 656-666~ 1983.

[DHSS]

D. Dolev, J. Halpern, B. Simons, and R. Strong, "Dynamic Fault-Tolerant Clock Synchronization," IBM Research Report R~6722, March 3, 1989. See also "Dynamic Fault-Tolerant Clock Synchronizations" Proceedings of the 3rd Annual A CM Symposium on Principles of Distributed Computing, 1984.

IF]

M. Fischer, "The Consensus Problem in Unreliable Distributed Systems," Proceedings of the International Conference on Foundations of Computing Theory, Sweden, 1983.

[GS]

A. Griefer, and H. R. Strong, DCF: Distributed Communication with Fault-tolerance, Proceedings of the 7th Annual A CM Symposium on Principles of Distributed Computing, 1988.

[r,]

L. Lamport~ "Using Time instead of Time-outs in Fault-Tolerant Systems," A CM Transactions on Programming Languages and Systems, vol. 6, no. 2, pp. 256-280~ 1984.

[Lsp]

L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," A CM Transactions on Programming Languages and Systems, vol. 4, no. 3~ pp. 382-401, July 1982.

[PG]

F. Pittelli, H. Garcia-Molina, "Recovery in a Triple Modular Redundant Database System'~ Technical Report CS-076-87, Princeton University, January, 1987.

[PW]

W. Peterson, and E. Weldon, "Error Correction Codes," (2nd Edition), MIT Press, Massachusetts, 1972.

[RSA]

R. Rivest~ A. Shamir, and L. Adelman, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems," CACM, Vol 21., no. 2, pp. 120-126, 1978.

IS]

F. Schneider: "Abstractions for Fault Tolerance in Distributed Systems," Invited paper, Proceedings IFIP Congress '86 September 1986.

[sc]

F. Schneider: "Understanding Protocols for Byzantine Clock Synchronization', Technical report 87-859, Cornell University, August 1987.

[SD]

R. Strong, and D. Dolev, "Byzantine Agreement," Proceedings of COMPCON, Spring 1983.

7]

[SSCA]

R. Strong, D. Skeen, F. Cristian, H. Aghili, "Handshake Protocols" 7th Int. Con]. on Distributed Computing, September~ 1987, pp. 521-528.

Randomized Agreement Protocols Michael Ben-Or Institute of Mathematics and Computer Science The Hebrew University, Jerusalem/Israel

Introduction Reaching agreement in the presence of faults is one of the most important problems in fault-tolerant distributed computation, and it is also a beautiful example of the power of randomized algorithms. This problem, first introduced by Pease, Shostak and Lamport [PSL80] as the "Byzantine Agreement Problem", considers a situation in which each process /~, i = 1 , . . . , n , holds an initial value Mi, and they have to agree on a common value M, such that (1) Consistency: All correct processes agree on the same value, and (2) Meaningfulness: If all correct processes hold the same initial value M~ = M, then the correct processes will agree on M. Moreover, these properties should hold even if some processes are maliciously faulty, and skillfully coordinate their action as to foil agreement among the correct processes. The problem of Byzantine Agreement has been studied extensively in the literature. In particular, Pease, Shostuk and Lamport [PSL80] proved that a solution exists if and only if the number of faulty processes, ], satisfies ] < n/3. Fischer and Lynch [LF82] (see also [DS82]) have shown that any deterministic solution tolerating ] faults may need ] ~ 1 communication rounds to reach agreement. Another important result of Fischer, Lynch and Paterson [FLP82] shows that the Byzantine Agreement problem has no deterministic solution in the asynchronous case. This impossibility result holds even if only a single process may fail by stopping. The negative results described above collapse when we consider randomized protocols. Allowing each process to sometimes use a coin flip to determine its next step, we can solve the Byzantine Agreement problem for asynchronous systems. Moreover~ we can sometimes reach agreement within a constant expected number of rounds. This proves that randomized protocols are strictly more powerful than deterministic protocols in the area of fault-tolerant distributed computation, and in some cases, can solve problems which have no deterministic solutions. Another important feature of these randomized protocols is that they are much simpler to describe and prove correct than most deterministic Byzantine Agreement protocols. In this paper, we describe the main ideas behind these randomized agreement protocols. This is not a survey of the many research papers on this topic, as we combine old and recent results to simplify the protocols presented here. In particular, Cryptography is not used throughout this paper. In an effort to highlight the main ideas and to avoid technical difficulties, we have adopted here weaker versions of the

73 best known results. We direct the reader to the forthcoming survey [CD], and to the original papers for details of the best solutions known.

The M o d e l of C o m p u t a t i o n In a deterministic system, we can model the worst case behavior by selecting in advance f processes to be faulty. This does not model the worst case behavior of a non-deterministic system. One cannot rule out the possibility that a process may become faulty following some particular sequence of coin flips but remain correct otherwise. The correct worst case assumption here is to assume that at any time during the computation, an adversary can select further processes to become faulty provided that the total number of faulty processes does not exceed f. We allow our adversary to select faulty processes and coordinate their action using complete information about the system. The adversary cannot predict, of course, a future coin ftip of some correct process. However once this coin flip is performed, the adversary can base its action on the outcome. This strong notion of adversary models the fact that any change in the global state of the system may result in a different faulty behavior. In the next section, we present a randomized asynchronous Byzantine Agreement protocol that will reach agreement with probability 1 against any such adaptive adversary. Assuming that the adversary can adaptively select the faulty processes, but requiring that its decision be based only on information held locally by the current faulty processes, we obtain the "Weak Adversary" model. Here the adversary cannot access the internal state or listen to the messages passed between correct processes. This model is a reasonable restriction of the general adversary if we assume that our processes reside in separate processors and, therefore, a local action of some correct process cannot affect the behavior of some remote faulty process. We shall present here fast agreement protocols for this weak adversary model tolerating f = O(n) faults. The reader should note that care must be taken when applying the protocols in the weak adversary m o d e l Consider a situation where each process is a compound process that resides on several processors, where different processes may share common processors. A failure of one processor may cause one process to fail while the other remains correct, tolerating this fault. In ~his case, the local action of a correct process can clearly affect the faulty one, and solutions in the weak adversary model may not be applied. In the next section, we describe a very simple randomized agreement protocol in the strong adversary model.

The Simple Two Threshold Scheme Let PI~ .... , P. be an asynchronous system of n processes where each process P~ can directly exchange messages with every other P/. We assume that messages sent by a correct process Pi will eventually arrive at their destination but may experience

74

arbitrary finite delay (controlled by our adversary). To simplify the presentation of our protocol we assume that the maximal number of faulty processes f is less than n/t6. A somewhat more complicated version tolerating up to f < n/3 faults can be found in [Br84]. Let M(i) be the input value of process P~. Our protocol will consist of rounds of Polling, where each P/ polls the other processes on their value of the message. After each such poll _P/will update its value M(i) to be one of the initial messages or else the default value 0. Since we may have several polling rounds, we assume that each player adds the round number r to all its messages. Initially r = 0. Polling: For each process Pi:

S t e p 1: Set r : = r + l ; Broadcast (M(i), r) (to all other processes); Collect incoming messages from this step until n - f messages arrive (including own value). Let N(i) be the maximal number of times a value appears among incoming messages, and let

Temp(i) =

Most common value if N(i) >_n/2 0 (Default value) otherwise

S t e p 2: Broadcast "End Poll r" message to all processes and walt for n - f such messages to arrive. (This delay step is needed as we explain below). Both Temp(i) and N(i) represent P~'s view of the poll. Since each player misses at most / messages from correct processes and receives at most f message from bad processes we have Fact (I): For any two (correct) processes/~ and Pi I#(i) - #(i)I <

21.

From this we get Fact (II): If some correct P~ has N(i) > ,~/2 + 2f then for any other correct Pj we have Temp(i) = Temp(j). Our first goal will be to speed up the agreement if the system is trying to reach an agreement on a message M that has been broadcasted by a correct process. In such a case, all the correct processes start the polling with the same message M(i) = M. Since n > 16f each correct Pi wilt have N(i) ~_ n - 2f > n/2 -{- 4f, and set Temp(i) = M. We can therefore use R u l e (A): (High threshold) If N(i) > n/2 + 4 f then set M(i) = Temp(i) and decide on M(i). i one

IPi decides on M bat will eontlnae to participate in the next round by broadcasting its value more time.

75 While this rule guarantees that if all correct processes start with the same value they all decide in one round, this simple rule forces us to set a second rule to handle the case when this does not hold. Note that if some correct Pi decides on M by rule (A), then any other Pj must have N(j) > n/2 + 2f, and also Temp(j) = M. Setting R u l e (B): (Low threshold) If N(i) > n/2 ÷ 2f then set M(i) = Temp(i) and continue to the next polling round. ensures that if some correct Pi decides on M at some round, all other correct Pi will set M(j) = M at this same polling round. Therefore, coming into the next round all correct processes will start with the same value M and by rule (A) decide also on this value within the next round. Rules (A) and (B) make sure that if some process decides, then all other processes will decide on the same value by the end of the next round. These rules do not cover the case of N(i) < n/2 + 2f. Knowing that no deterministic rule can help, we make our first use of randomization. Definition: Let 0 < p < 1/2 and let Coin be a distributed protocol that terminates within a constant number of communication rounds, in which each correct process P; selects a bit bi. We say that Coin is an f-resilient Global Coin Protocol with probability p, if for any adversary and any b E {0, 1), with probability at least p, all the correct processes Pi select bl = b. Intuitively, a "global coin" may be viewed as a random and unpredictable bit, on which the faulty processes, or in our model - the adversary, do not have complete control. We continue the description of our agreement protocol assuming the existence of such a global coin protocol Coin, by adding the following step to the Polling protocol. S t e p 3: Run the protocol Coin, and let bl be the bit selected by Pi. We can now complete the description of our randomized agreement protocol by adding

Rule (C): If N(i) <_ n/2 + 2f then set M (i) =

Temp(i) if b~ = 1 0 if bi 0

and continue to the next round. The full Byzantine agreement protocol can be described now as follows: Each P; repeats the polling procedure, followed by the coin flipping protocol Coin, then setting the new value of M(i) according to rules (A), (B) or (C) as appropriate, and back to the polling again with the new values until a decision is reached.

?6

T h e o r e m 1: Let f < n/16, and let Coin be a global coin protocol with probability p > O. The protocol described above guarantees that all correct processes will reach agreement with probability 1 (against any adaptive adversary!). Furthermore, the ezpected number of rounds to reach agreement is O(1/p).

Proo]: By the remarks above it is clear that if some correct process decides on some value, then all other processes will decide on the same value by the next round. In a similar manner, if for some round r, at least n - 2 f correct processes begin the round with the same value, then any correct process will tally this value at least n - 4f > n/2 + 4f, and therefore by rule (A), decide on this value during this round. In particular if all correct processes begin with a common value, all processes will decide (within one round) on that value. It remains to show that the probability that the adversary can prevent agreement for infinitely many rounds is 0. Let Pi be the first process to finish its r-th iteration of Polling. At this time, Pi has accumulated n - f notices from other processes that they have finished their step 1 of the poll and have gone on to step 2. Of these, at least n - 2 f are correct. Therefore no correct process will begin the coin flipping protocol before n - 2 f correct processes have finished their poll. (This is the reason for the delay in step 2. The adversary can delay the messages to at most f correct processes, so that their poll will be determined after the execution of Coin, and their poll may therefore depend on Coin's outcome.) First assume that among these n - 2 f correct processes some P is now forced to set its value to M by rule (B). By fact (II), all other processes Pj have Temp(j) = M. By our assumption on Coin, with probability at least p, all the correct processes P~ will have bi = 1. In this case according to all our rules, including rule (C), all the processes begin the next round with the same value M, and will therefore decide on M by round r + 1. Otherwise, before any Correct process starts the protocol Coin, we know that at least n - 2 f correct processes will set their value by rule (C). By our assumption on Coin, with probability at least p the coin will be a unanimous 0. By rule (C), on next round we have at least n - 2 f correct processes that will broadcast the value 0, and will therefore decide on the value 0 by round r + 1. We have seen that for any adversary schedule, on each round r with probability at least p all correct processes will reach agreement by round r + 1. Therefore the probability of not reaching agreement after r + 1 rounds is tess than (1 - p)" ~ 0 as r --* oc, and the expected number of rounds to reach agreement is O(1/p). QED.

A Simple Global Coin P r o t o c o l We now complete our simple two threshold scheme by presenting a very simple global coin protocol that is resilient against any adaptive adversary. Assuming that each process can flip a local unbiased coin, our goal is to generate a global coin from the local coins of the processes, while minimizing the influence of the faulty processes, or the adversary, on the outcome. A simple way to achieve this goal is to let each

77 process generate and broadcast a random bit, and take the majority of these values as a "global coin". Protocol

Simple-Coin

Each process Pi:

S t e p 1: Generate an unbiased bit rl and broadcast ri to all other processes. S t e p 2: Wait for n - f values of ri's. Set b i = majority of the rj's. Output hi. We have L e m m a I:

(a) Let f

< n/16. Simple-Coin is an f-resilieni global coin protocol with probability p > 1/2",

(b) lf f = O(v~) then Simple-Coin is an f-resilient protocol with probability po > O,

where Po is a constant not depending on n. Proof: (a) Let b E {0, 1}. If all the correct processes flip the value r~ = b, then on step 2 they will all compute the same majority b; = b. Since this event has probability at least p = 1/2", we are done. To prove (b) we note the expected deviation from n/2, among n independent coin flips, is O(v/-ff). Therefore if f = c v ~ , and b E {0, 1}, there is some constant probability P0 = p(c), that among n independent coin flips the value b will appear more than n/2 + 2 f times. If this happens, then on step 2 all the correct processes will set b l = b. Q E D . Combining Theorem 1 and Lemma 1 we have T h e o r e m 2: Let f < n/16.

There is a completely asynchronous [-resilient randomized Byzantine agreement protocol that guarantees agreement with probability 1 against any adaptive adversary that acts with complete in]ormation. Furthermore, i] f = 0(~/-~) then the ezpected number of rounds to reach agreement is constant.

This randomized solution is by no means practical. Since p in our proof can be as small as 2-", the expected number of rounds to reach agreement may be exponential! Nevertheless, this does show that randomized algorithms give us the power to solve problems that have no deterministic solution, and raises the possibility of better randomized solutions. Our protocol behaves much better when the number of faulty processes f is bounded by O(v/-n-). In fact, for synchronous systems, our protocol will still have

78 the same worst case constant expected time, while any deterministic protocol may require up to f ÷ 1 rounds to reach agreement. The Simple-Coin protocol described above is resilient to an adversary that selects and coordinates the action of the faulty processes based on complete information about the system including the internal states of all the processes. Much simpler and more efficient "global coin" protocols exist if we severely restrict the adversary's information, thereby restricting the type of faults we allow. We can, for example, hand out to all the processes a common long list of random bits as part of their protocol, and use the.r-th bit as the r-th global coin flip. This would provide a reasonable solution if we can assume that the faults in the system do not depend in any way on the values on this list. An even simpler solution is to use a common random number generator as our ~'global coin". Under this assumption we have an unbiased global coin and therefore, a constant expected time agreement protocol "tolerating" f = O(n) faulty processes. It is clear that our protocol is no longer valid if we allow the faulty processes to act based on future "coin flips". It is therefore, hard to justify such simple minded Coin protocols because we must make the unreasonable assumption that the faulty processes, though not acting according to the protocol~ do not make use of some information they have. In the next section we show, following Rabin IRa83], how to distribute pre-dealt random coins among all the n processes, in such a way that any set of at most f < n/5 faulty processes will be not have any information about the r-th coin before it is time to reveal the coin.

Secret Sharing Protocol Let s be a secret. We want to share the secret 8 among n processes so that (1) Any set of at most f processes will have no information about s, and (2) The secret can be reconstructed from all the pieces even if some of the pieces have been changed by faulty processes. To this end, let p > n be a prime number and let Zp = { 0 , . . . , p - 1} be the field of p elements with addition and multiplication modulo p. Let g E Zp be a primitive element in Zp so that ak = gk-1 h = 1 , . . . , f l - 1, are all distinct. To share a secret s E Zp, the dealer selects a random f-degree polynomial S(z), whose constant term S(0) = s. The dealer then hands process Pk the value sk = S(ak), [Sh79,BGWS8]. It is easy to see that any set of f players has no information about the secret because the values they hold~ together with any value of the secret, define a unique f-degree polynomial interpolating through these f ÷ 1 points. Moreover, by our special choi~:e of the evaluation points ak, the possible sequences ( s l , . . . , s,) are just a generalized BCH error correcting code that can correct up to (n - f - 1)/2 wrong s~ (see [PW72]). Thus, for f < n/3 we can correct f errors and for f < n/5 we can correct up to 2 f errors. To reveal the secret, all the processes broadcast their values.

?9

Each correct process receives at most f wrong values and can, therefore, correct the errors using one of the well-known error correction algorithms [PW72, page 286] and compute the correct interpolating polynomial to recover the secret s. From this point on we shall restrict our attention to the weak adversary model. As discussed above, instead of handing the random bits directly to all the processes, we can prepare a tong list of random bits and share their value using the robust secret-sharing scheme described above. By our assumptions, the weak adversary has no inforraation about the value of the next bit on the list before some correct process begins the secret recovery procedure and broadcasts its value. Combining this with Theorem 1 we have T h e o r e m 3: (Weak adversary, Pre-dealt coins) In a properly initialized system, under the weak adversary model, there is a constant ezpeeted number of rounds Byzantine Agreement protocol tolerating f = O(n) ]aults. Theorem 3 does not address the problem of generating unbiased secret coins by the system itsdf. We must therefore rely on a trusted "dealer" to prepare in advance enough secret coins for the lifetime of the system. In the next section we describe how the system of processors can prepare the needed unbiased random coins without any outside help.

Generating Secret Coins To generate an unbiased coin we let each process/~ select a secret random number sl, 0 < sl < p, and share this secret sl among all the other processes using the secret sharing procedure described above. Having done this, we can at a later step of the protocol reveal all the secrets and take the parity of the sum (modulo p) of all the secrets as our common "global coin". If a~ the secrets are properly shared, the sum will be random if at least one correct process selected its number randomly, and therefore our global coin will be unbiased. Before attempting to use these secrets we must verify that the secret shares sent by a possibly faulty process D are shares of a real secret and not some n random numbers. We want to do so without revealing any information about the secret itself. This is easily done using the following Zero Knowledge proof technique, first introduced by [GMR]. Let us assume that the system can reach Byzantine Agreement, and let S(z) be the f-degree polynomial used by process D to share the secret s. The dealer of the secret, D, generates n additional random f-degree polynomials R 1 , . . . , Rn, and sends to each P~ its share of the secret sh = S ( a k ) along with the values of ri,k = R j ( a k ) , j = 1 , . . . , n , of all the polynomials at the point ak. At this point, each player P~ randomly picks a random bit bk E {0, 1} and broa~lcasts its value to all the other processes. After reaching Byzantine Agreement on the value of all the broadcasted bits, bk, (missing values take the default value 0) the dealer of the secret, D, broadcasts the polynomials Fj(z), where Fj(z) = R~(z) if bi = 0, and Fj(z) = S(z) + Ri(z) if bi = 1, for j = 1 , . . . , n (Second Byzantine Agreement). At this point each process

80 Pk checks that the polynomials Fi(z) are of degree at most f and that at the point ak the shares it received satisfy the required equations, that is ri,k + bks~ = Fi(ak), for all j. If some Pk finds an error it broadcasts a complaint against the dealer D. (Third Byzantine Agreement). If f + 1 or more processes file a complaint, we decide that the dealer D is faulty and take the default value 0 as the value of its secret and a~ its shares. Claim: Let G be the set of correct processes that did not complain against PiSs secret. Let So(z) be the interpolation polynomial through the points in G of the original polynomial $(z). Then with probability exponentially close to 1, S.G(z) is a polynomial of degree (_ f .

Proof: (Sketch) Let R~(z) be the interpolation polynomials through the points in G of Ri(z ). Assume that the degree of SG(Z), deg S G > f. Then if deg R~ _( f and 5i = 1, then no polynomial of degree f will fit the points ri,k + 5isk for k E G. If on the other hand deg R~ > f and 6i = 0 then there is no f degree polynomial fitting all the point ri,k for k E G. Thus if 5~ was picked at random, with probability at least 1/2 some of the processes in G would file a complaint. Since at least n - f of the 5h were selected randomly by correct processes, the probability that deg S c ) f is exponentially small. Q E D . Note that if n _) 5 f + 1 then our secret sharing scheme can correct up to 2f errors. If a secret is accepted by the system, then at most f good processes may have values not on the polynomial $G(z). This, together with at most f additional wrong values coming from the faulty processes, gives altogether at most 2 f errors. Thus, in this case the secret is well defined (as the free coe~eient of S G) and there is a simple procedure to recover its value from the shares using the error correcting procedure. Our scheme to generate unbiased secret coins requires, in itself, three rounds of Byzantine Agreements. It is therefore hard to see how such a scheme can be of any use for the Byzantine Agreement itself. The scheme is helpful only because we can prepare many unbiased coins at a time. By distributing m secrets together we can prepare rn independent secret coins using the same three rounds of Byzantine Agreements. This leads to the following modification of Theorem 3: • Initialize the system with m = 100n z secret random coins. s Use these coins to run the fast agreement protocol of Theorem 3. • Whenever the number of random coins falls below m/2 generate in parallel ra additional secret coins. This will take only a constant expected number of communication rounds using some of the remaining m/2 random coins. It is easy to see that the probability of ever running out of coins is exponentially small (in m). If this ever happens, we can revert to slower Byzantine Agreement protocols [PSL80,Be83], generate again ra random coins, bringing the system back to its initialized state. Likewise, if there is no "trusted dealers, to initialize the system,

8] we can start up the system with a slow initialization stage, prepare m coins, and only then start the system. Summarizing this section we have T h e o r e m 4: (Weak Adversary) After an initialization stage by the system or by a trusted dealer, there is a constant ezpeetcd number of rounds Byzantine Agreement protocol tolerating f = O(n) ]aults.

The

Feldman-Micali

Protocol

The problem of initializing the system in a constant number of rounds without the aid of a "trusted dealer" was recently solved by Feldman and Micali [FM88]. In this section we briefly describe the main ideas underlying this beautiful protocol. The reader can find the full details and proof of this protocol in the original papers [FM88] and [F88]. Returning to the original Two Threshold scheme, we recall if process P sends a message to all other processes, then in just one more round we have (a) If P is correct, then all correct processes agree on that message, and (b) Even if P is faulty, if any correct process accepts the message, then all other processes receive the value of this same message, (but may or may not accept it). Running this constant round protocol, we say that process P has Cast the message to all other process. In a similar way, we can replace the broaxtcasts, and the needed Byzantine Agreements, in our secret sharifig protocol by the simple constant round Casting protocol. Doing this carefully we get a constant round Secret-Casting protocol, such that if P distributes a secret to all the players then (a) If P is correct then all the processes agree that a good secret has been shared, an{] (b) Even if P is faulty, if any good process accepts the secret as good, then all the good players know that there is some well defined secret, and furthermore, this un!que secret is recoverable from all the pieces. The crucial idea of the constant round initialization protocol is how to use the Casting and Secret-Casting protocols to generate a Common Global Coin protocol. To generate the coin the processes all participate in an election where each process is assigned a random number in the range [0, n~], and the player with the smallest number is selected. As the common global coin we can take the parity of this smallest number. To guarantee that thenumber assigned to each process P is indeed random, all the processes participate in the choice of this number. Each process selects a random number in the range [0, n~], and uses the Secret-Casting protocol to share its

82 pieces among ~he other processes. At this point our process P selects n - f secrets that it has accepted, and announces its selection using the Casting protocol. As P's number we take the sum of the secrets it has selected modulo n ~ + 1. This is done concurrently for each process P. A good player will accept P's selection only if it knows that all the secrets on the list are well defined. At this point all the shares of the secrets are opened, the secrets are recovered and our global coin is determined. Note that if P is correct, then the secrets it included in his list are known to all the players and they are all recoverable. Therefore P's number is well defined. If P is faulty, its number may not be well defined if it includes nonrecoverable secrets. A correct process will accept P~s selection only if all the secrets on the list are recoverable; therefore all the correct process that do accept P~s choice get the same value. Since each accepted list must contain n - f recoverable secrets, some of which were randomly selected by correct processes, the number assigned to P is indeed random. Since all the numbers assigned to all the processes are random, with probability at least 1 - f / n a correct process number wil~ be the minimum. In this case all the processes have the same minimal number and therefore the same common coin. In the unlucky case where a faulty process draws the minimal number, some of the processes may not receive this number and may come up with a different coin. Since this happens only with probability less than ]/n, combining this Global Coin Protocol with Theorem 1 we finally have T h e o r e m 5: (Weak Adversary) There is a constant ezpected number o] round, Byzantine Agreement protocol tolerating f = O(n) faults. The Feldman and Micali protocol is much more complicated then the simple protocol of, say, Theorem 3, and the constants involved are considerably bigger. It is therefore best to use this fast agreement protocol to preform the initialization step needed in Theorem 4, and after this first stage continue with the protocol of Theorem 4.

Final R e m a r k s We have presented here the main ideas underlying the best to date randomized agreement protocols for the strong and weak adversary models. In the weak adversary model~ these protocols give us the added benefit of being able to generate within the system a global unbiased coin. This by itself is an important and nontrivial task that may find other applications. In particular, just as the power of Byzantine Agreement allows the system to carry any deterministic computation in a consistent way, the ability to generate a global unbiased coin provides the system with the ability to carry any randomized computation. In the strong adversary model our solution is much less satisfactory. For constant expected time agreement our solution tolerates only f = O(v/'n-) faults, leaving the the question of fast agreement when ] = O(n) open to further research.

83

References [Be83]

M. Ben-Or, Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols, Proc. 2nd Annum ACM Symposium on Principles of Distributed Computing, pp. 27-30, 1983.

[BGW88] M. Ben-Or, S. Goldwasser and A. Wigderson, Completeness Theorems for Non-Cryptographie Fault-Tolerant Computation, Proc. 20th Annual ACM Symposium on Theory of Computing, pp. 1-10, 1988. [Br84]

G. Bracha, An Asynchronous (n-1)~3-Resilient Consensus Protocol, Proc. 3rd Annum ACM Symposium on Principles of Distributed Computing, pp. 154-162, 1984.

[CD]

C. Chor and C. Dwork, Randomization in Byzantine Agreement, to appear.

[DS82]

D. Dolev and R. Strong, Polynomial Algorithms for Multiple Processor Agreement, Proc. 14th Annual ACM Symposium on Theory of Computing, pp. 401-407, 1982.

[F88]

P. Feldman, Optimal Algorithms for Byzantine Agreement, MIT Ph.D. Thessis, 1988.

[FM88]

P. Feldman and S. MicMi, Optimal Algorithms for Byzantine Agreement, Proc. 20th Annum ACM Symposium on Theory of Computing, pp. 148161, 1988.

[FLP83]

M. Fischer, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, JACM 32, pp. 374-382, 1985.

[LF82]

M. Fischer and N. Lynch, A Lower Bound for the Time to Assure Interactive Consistency, Information Processing Letters 14, pp. 183-186, 1982.

[GMR85] S. Goldwasser, S. MicMi and C. Rackoff, The Knowledge Complexity of Interactive Proof Systems, Proc. 17th Annual ACM Symposium on Theory of Computing, pp. 291-304, 1985. [PSL80]

M. Pease, R. Shostak and L. Lamport, Reaching Agreement in the Presence of Faults, JACM 27, pp. 228-234, 1980.

[PW72]

W . W . Peterson and E. J. Weldon, Error Correcting Codes, Second Ed., MIT Press, 1972.

[R S3]

M. Rabin, Randomized Byzantine Generals, Proc. 24th Annual Symposium on Foundations of Computer Science, pp. 403-409, 1983.

[Sh79]

A. Shamir, How to Share a Secret, CACM 22, pp. 612-613, 1979.

An Overview of Clock Synchronization Barbara Simons, IBM Almaden Research Center Jennifer Lundelius Welch, GTE Laboratories Incorporated Nancy Lynch, MIT

1

Introduction

A distributed system consists of a set of processors that communicate by message transmission and that do not have access to a central clock. Nonetheless, it is frequently necessary for the processors to obtain some common notion of time, where ~time:~ can mean either an approximation to real time or simply an integer-valued counter. The technique that is used to coordinate the notion of time is known as

clock synchronization. Synchronized clocks are useful for many reasons. Often a distributed system is designed to realize some synchronized behavior, especially in real-time processing in factories, aircraft, space vehicles, and military applications. If clocks are synchronized, algorithms can proceed in "rounds :' and algorithms that are designed for a synchronous system can be employed. In database systems, version management and concurrency control depend on being able to assign timestamps and version numbers to files or other entities. Some algorithms that use timeouts, such as communication protocols, are very time-dependent. One strategy for keeping docks synchronized is to give each processor a receiver and to use time signals sent by satellite. There are obvious questions of reliability and cost with this scheme. An alternative approach is to use software and to design synchronization algorithms. This paper discusses the software approach to clock synchronization, using deterministic algorithms. The results surveyed in this paper are classified according to whether the distributed system being modeled is asynchronous or partially synchronous, reliable or unreliable. An asynchronous model is one in which relative processor speeds and message delivery times are unbounded. Partially synchronous can be interpreted in several ways - - processors may have real-time docks that are approximately the same or that move at about the same rate or that drift slightly. The message detivery time may always be within some bounds, or it may follow a probability distribution. A reliable system is one in which all components are assumed to operate correctly. In an unreliable system, communication faults such as sporadic message losses and link failures may occur, or processors may exhibit a range of faulty behavior.

85 This paper presents some of the theoretical results involving clock synchronization. A more thorough discussion of our basic assumptions and definitions, especially concerning faults, is contained in section 2. In section 3 we discuss the completely asynchronous, reliable model. Section 4 deals with asynchronous, unreliable models. In section 5, we discuss partially synchronous, reliable models. Section 6 is the longest and contains descriptions of several algorithms to synchronize clocks in some partially synchronous, unreliable models. In section 7 some problems closely related to the clock synchronization problem of the previous section are mentioned. We close with open problems in section 8.

2

Basic Assumptions

We assume that we are given a distributed system, called a networks of n processors (or nodes) connected by communication links. The processors do not have access to a source of random numbers, thus ruling out probabilistic algorithms. We allow the network to have up to f faults, where a fault can be either a faulty processor or a faulty link. We say that a system is reliable if f is always 0. Otherwise, the system is unreliable or faulty. Although there is some work on fault tolerance that distinguishes between node faults and link faults (e.g. see [DHSS]), for simplicity we shall assume that only node faults occur. If a link is faulty, we can arbitrarily choose one of the two nodes that are the endpoints of the faulty link and label that node as faulty. This is clearly a conservative assumption, since the node that is selected to be faulty might be the endpoint of many nonfaulty links~ all of which are now considered faulty. Having limited ourselves to node faults, there remains a variety of different models in which to work. The simplest of these models, called fail safe, is based on the assumption that the only type of failure is a processor crash. There is the further assumption that just before a processor crashes~ it informs the system that it is about to c r a s h . This is the only model in which the faulty processor is thoughtful enough to so inform the others. A more insidious form of failure is unannounced processor crashes, sometimes called a faiIstop fault. Next in the hierarchy of faults is the omission fault model. In this case a processor might simply omit sending or relaying a message. A processor that has crashed will of course omit sending all its messages.

Timing faults can be more complicated than omission faults, especially when dealing with the problem of clock synchronization. The class of timing faults is itself divided into the subcases of only late messages and of both early and late messages. For many systems the types of faults most frequently encountered are processor crashes (without necessarily notifying the other processors)~ omission faults, and late timing faults. Finally, a fault that does not fall into any of the above categories is called a

86

Byzantine .fault. (For a more thorough discussion of Byzantine faults, see the article by Dolev and Strong in this book). This includes faults that might appear to the outside observer to be malicious. For an example of such a fault that brought down the ARPANET for several hours, see the article by Cohn in this book.

3

Asynchronous Reliable M o d e l

We assume in this section that message delays are unbounded, and that neither processors nor the message delivery system is faulty. For this environment we examine the differences caused by whether relative processor speeds are lockstep or unbounded, i.e., whether processors are synchronous or asynchronous. Lamport ILl] presents a simple algorithm allowing asynchronous processors to maintain a discrete clock that remains consistent with communication. When processor i sends a message to processor j, i tags the message with the current time o n / ' s clock, say tl. Processor j receives the message at time tj. If tj < tl, processor j updates its clock to read time t~. Otherwise, processor j does nothing to its clock. Note that this algorithm depends heavily on the assumption that there are no faults in the system, since clearly a faulty processor could force correct processors to set their docks to arbitrary times. The Lamport algorithm can be used to assign timestamps for version management. It can also provide a total order on events in a distributed system, which is useful for solving many problems, such as mutual exclusion ILl]. The power of local processor clocks in an otherwise asynchronous system is further explored by Arjomandi, Fischer and Lynch [AFL]. They prove that there is an inherent difference in the time required to solve a simple problem, depending on whether or not processors are synchronous (i.e., whether or not processors have synchronized clocks). The problem is that of synchronizing output events in real time: there is a sequence of events, each of which must occur at each processor and each taking unit time, with the constraint that event i cannot occur at any processor until event i - 1 has occurred at all processors. With synchronous processors, the time for k events is k, and no communication is needed. With asynchronous processors, a tight bound on the time for k events is k times the diameter of the network. Note that since Lamport clocks can be used to make a completely asynchronous system appear to the processors to have synchronous processors, the problem presented in [AFL] is of necessity one of external synchronization.

4

Asynchronous Unreliable Models

Devising algorithms for a model in which faults may occur can be much more difficult than devising algorithms for the comparable reliable model. In fact, there might not even exist an algorithm for the unreliable version, as is the case for the agreement problem [FLP]. In particular, it is possible for all (correct) processors to reach agreement on some value in an asynchronous reliable model, but not in an asynchronous

87 unreliable one. By contrast, there exist methods [A] to convert algorithms designed for a synchronous reliable system into algorithms that are correct for an asynchronous reliable system. Welch [W] has shown that a system with asynchronous processors and asynchronous reliable communication can simulate a system with synchronous processors and asynchronous reliable communication, in the presence of various kinds of processor faults. The method used in the simulation is a variant of Lamport clocks - each message is tagged with the sender's time, and the recipient of a message delays processing the message until its local time is past the time tag on the message. One application of this simulation is that the result of Dolev, Dwork, and Stockmeyer [DDS], that the agreement problem is impossible in an unreliable model with synchronous processors and asynchronous communication, follows directly from the result of Fischer, Lynch, and Paterson [FLP], that agreement is impossible in an unreliable model in which both processors and communication are asynchronous. (Neiger and Toueg [NT] independently developed the same simulation, but they did not consider faults, and they studied different problems). A subtle point is determining exactly what is preserved by this transformation. (Cf. [NT] for the fault-free case). Since a partially synchronous system and an asynchronous system appear quite different when viewed externally, the behavior preserved by this simulation is that which is observed locally by the processors. Thus~ the transformation cannot be used in the asynchronous model to create simultaneous events at remote processors, even though this is easy to do in the model with synchronous processors and asynchronous communication. It is also possible to design Lamport-like clocks for an asynchronous system that tolerate some number, say f, of Byzantine faults. A common technique is to wait until hearing from ] ÷ 1 (or all but ]) of the processors that time i has passed, before setting one's clock to time i + 1. This type of clock imposes a round structure on an asynchronous computation, and is used in some probabilistic agreement algorithms. (See the article by Ben-Or in this book, and also [Be, St]). Dwork, Lynch and Stockmeyer [DLS] solve the agreement problem in unreliable models that lie somewhere between strictly asynchronous and synchronous. Their algorithms use interesting discrete docks reminiscent of Lamport clocks, but more complicated.

5

Partially Synchronous Reliable Models

Several researchers have considered a partially synchronous, reliable model in which processors have real-time clocks that run at the same rate as real time, but are arbitrarily offset from each other initially. In addition, there are known upper and lower bounds on message delays. The goal is to prove limits on how closely clocks can be synchronized (or, how close in time remote events can be synchronized). In a completely connected network of n processors, Lundelius and Lynch [LL] show that the (tight) lower bound is ~/(1 - l / n ) , where n is the difference between the bounds

88 on the message delay. This work was subsequently extended by ttalpern, Megiddo and Munshi [HMM] to arbitrary networks. A version of the Lamport clocks algorithm for real-time clocks has been analyzed ILl] in a different reliable, partially synchronous model, one in which clock drift rate and message uncertainty are bounded, to obtain upper bounds on the closeness of the clocks. Together with the results mentioned in the previous paragraph, we have upper and lower bounds on closeness imposed by uncertainty in system timing. Marzullo [M] also did some work in the same reliable, partially synchronous model as ILl]. The key idea is for each processor to maintain an upper bound on the error of its clock. This bound allows an interval to be constructed that includes the correct real time. Periodically each processor requests the time from each of its neighbors. As each response is received, the processor sets its new interval to be the intersection of its current one with the interval received in response, after adjusting for further error that could be introduced by message delays,

6

Partially Synchronous Unreliable Models

There has been much work done on the problem of devising fault-tolerant algorithms to synchronize real-time clocks that drift slightly in the presence of variable message delays [LM, M, WL, HSSD, MS, ST]. Although most of the algorithms are simple to state, the analyses tend to be very complicated, and comparisons between algorithms are difficult to make. The difficulty arises from the different assumptions, some of which are hidden in the models, and from differing notations. There has been some work by Schneider IS] attempting to unify all these algorithms into a common framework and common proof. Our goal in this section is simply to describe some of these algorithms and attempt some comparisons. First, though, we discuss the assumptions, notations and goals.

6.1

Assumptions

Recall that n is the total number of processors in the system, and .f is the maximum number of faulty processors to be tolerated. The required relationship between n and f in order for the clock synchronization problem to be solvable depends on the type of faults to be tolerated, the desired capabilities of the algorithm, and what cryptographic resources are available, as we now discuss. To overcome the problem in the case of Byzantine faults of deciding what message a processor actually sent to some other processor, algorithms may use authentication. The assumption customarily made for an authenticated algorithm is that there exists a secure encryption system such that if processor A tells processor B that processor C said X, then B can verify that X is precisely what C said. Dolev, Halpern arid Strong [DHS] show that without authentication, n must be greater than 3 f in order to synchronize clocks in the presence of Byzantine faults.

89 With authentication, any number of Byzantine faults can be tolerated. The paper [DHS] also shows that without authentication~ the connectivity of the network must be greater than 2 f in order to synchronize clocks in the presence of Byzantine faults. (See [FLM] for simpler proofs of the lower bounds in [DHS]). Even if authentication is used~ clearly each pair of processors must be connected by at least f ÷ 1 distinct paths (i.e., the network is ( f ÷ 1)-connected), since otherwise f faults could disconnect a portion of the network. Some of the algorithms in the literature assume that the network is totally connected~ i.e. every processor has a direct link to every other processor in the network. In such a model a processor can poll every other processor directly and does not have to rely on some processor's say-so as to what another processor said. The assumption of total connectivity often results in elegant algorithms, but it is~ unfortunately, an unrealistic assumption if the network is very large. Consequently~ there are other algorithms that assume only that the network has connectivity f ÷ 1 (and use authentication). One assumption that all of the algorithms make is that the processors' real-time (or hardware) clocks do not keep perfect time. We shall refer to the upper bound on the rate at which processor clocks ~drifC frbm real time as p. In particular, the assumption is usually made that there is a '~linear envelope" bounding the amount by which a correct processor's (hardware) clock can diverge from real time. In the formulations of this condition given below, C represents the hardware clock~ modeled as a function from real time to clock time; u, v and t are real times. The papers [HSSD, DHS, ST] use the following condition: (v -

+ p) < c ( v ) -

<

-

+ p)

The paper [WL] uses the following (very similar) condition:

1/(1 + p) < dC( )ld <<_1 +

p

A necessary assumption is that there is a bound on the transmission delay along working links, and that this bound is known beforehand. Two common notations for transmission delay are TDEL, for the case in which one assumes that the transmission time can be anywhere from 0 to TDEL, and 64-e, for the case in which the transmission delay can be anywhere from 6 - e to 6 + e. Clearly, if 6 = e, then the two notations are equivalent. Some algorithms assume that the times of synchronization are predetermined and known beforehand, while others allow a synchronization to be started at any time. If the model allows for Byzantine faults, then a problem with the laissez-faire approach to clock synchronization is that a faulty processor might force the system to constantly resynchronize. Consequently, the deviation between docks will be small indeed, but no other work will be completed by the system, because the clock synchronization monopolizes the system resources. A commonly made assumption is that messages sent between processors arrive in the same order as that in which they were sent. This is not a limiting assumption~ since it can be implemented easily by numbering messages and by ignoring a message with a particular number until after all messages with a smaller number have arrived. Another common, but not essential, assumption is that in the initial state of the

90 system all the correct clocks start close together. Some of the papers present algorithms to achieve this synchronization initially, although there are some subtle points in switching from one of these start-up algorithms to an algorithm that maintains synchronization.

6.2

Goals

The main goal of a dock synchronization algorithm is to ensure that the clocks of nonfaulty processors never differ by more than some fixed amount, usually referred to as DMAX or 7. This is sometimes called the agreement condition. Another requirement sometimes imposed is the validity or accuracy condition, which is the requirement that the clocks stay close to real time, i.e. that the drift of the clocks away from real time be limited. Yet another common goal is that of minimizing the number of messages exchanged during the synchronization algorithm. In order to avoid unpleasant discontinuities, such as skipping jobs that are triggered at fixed clock times, the size of the adjustments made to the clocks should be small. Similarly, many applications require that the clocks never be set back. The latter is not a serious constraint, thanks to known techniques for spreading the adjustment over an interval (see paper by Beck, Srikanth and Toueg in this book). It should be easy for a repaired processor or a new processor to synchronize its clock with those of the old group of processors, a procedure called joining or reintegration. If one wishes to implement a bounded join, that is, a join which is guaranteed to be completed within an amount of time that has been previously determined, then a necessary condition in the Byzantine model is that there be more synchronized processors than processors that are trying to join, even if authentication is available [HSSD]. A requirement that so far has been addressed only in [MS] is achieving graceful degradation, ensuring that even if the bound on the number of faults is exceeded, there are still some limits on how badly performance is affected. Yet another possible goal is that the synchronization should not disrupt other activities of the network, for instance by occurring too frequently, or requiring too many resources. (See comments by Beck, Srikanth, and Toueg in this book about the trade-off between accuracy and the priority of the synchronization routine).

6.3

Algorithms

W e now brieflycompare the algorithms of[LM, WL, HSSD, M, MS, ST]. The different assumptions made by the authors are pointed out, and various performance measures are discussed. All the algorithms handle Byzantine processor faults, as long as n > 3f (except where noted). They also all require that the processors be initially synchronized and that there be known bounds on the message delays and clock drift. Finally, they

91

all run in rounds, or successive periods of resynchronization activity (necessitated by clock drift). For the rest of this subsection, we divide the algorithms into two groups, those that need a fully connected network, and those that do not. The algorithms in [LM, WL, MS] assume a fully connected network. Since each processor broadcasts at each round, n 2 messages are sent in every round. At every round of the interactive convergence algorithm of [LM], each processor obtains a value for each of the other processors' docks, and sets its dock to the average of those values that are not too different from its own. The closeness of synchronization achieved is about 2he (recall that e is the uncertainty in the message delay). Accuracy is close to that of the underlying hardware clocks (although it is not explicitly' discussed). The size of the adjustment is about (2n + 1)e. Reintegration and initialization are not discussed in [LM]. The algorithm in [WL] also collects clock values at each round, but they are averaged using a fault-tolerant averaging function based on those in [DLPSW] to calculate an adjustment. It first throws out the f highest and f lowest values, and then takes the midpoint of the range of the remaining values: Clocks stay synchronized to within about 4e. The synchronized clock's rate of drift does not exceed by very much the drift of the underlying hardware clocks. The size of the adjustment at each round is about 5e. Superficially this performance looks better than [LM]; however in converting between the different models, it may be the case that e in the [WL] model equals ne in the [LM] model. The reason is that in the [LM] algorithm a processor can obtain another processor~s clock value by sending the other processor a request and busy-waiting until that processor replies, whereas in the [WL] algorithm a processor can receive a clock value from any processor during an interval, necessitating the processor to cycle through polling n queues for incoming messages (this argument is expanded on in [LM]). This is an example of the many pitfalls encountered in comparing clock synchronization algorithms. Rejoining is easy, but can only happen at resynchronization intervals, which are relatively far apart. A variant of the algorithm works when clocks are not initially synchronized. The algorithms of Mahaney and Schneider [MS] are also based on the interactive convergence algorithm of [LM]. At each round, clock values are exchanged. All values that are not close enough to n - f other values (thus are clearly faulty) are discarded, and the remaining values are averaged. However, the performance is analyzed in different terms, with more emphasis on how the clock values are related before and after a simgle round, so agreement, accuracy, and adjustment size values are not readily available for comparison. Reintegration and initialization are not discussed. A pleasing and novel aspect of this algorithm is that it degrades gracefully if more than a third of the processors fail. The next set of algorithms (those in [M, HSSD, ST]) do not require a fully connected network. Again, every processor communicates with all its neighbors at each round, but since the network is not necessarily fully connected, the message complexity per round could be less than O(n2). The estimates of agreement, accuracy, and adjustment size presented in the rest of this subsection for these algorithms are made

92 assuming n = 3f + 1, and a fully connected network with no link failures, in order to facilitate comparison (although, as mentioned above, the algorithms do not require that these conditions hold). Marzullo [M] extended his algorithm (discussed in Section 5) to handle Byzantine faults without authentication by calculating the new interval in a more complicated, and thus fault-tolerant, manner, and altering the clock rates, in addition to the clock times. Since the algorithm's performance is analyzed probabilistically, assuming various probability distributions for the clock rates over time, it is difficult to compare results with the analyses of the other algorithms, which make worst-case assumptions. The algorithm of Halpern, Simons, Strong and Dolev [HSSD] can tolerate any number of processor and link failures as long as the nonfaulty processors can still communicate. However, the price paid for this extra fault tolerance is that authentication is needed. When a processor's clock reaches the next in a series of values (decided on in advance), the processor begins the next round by broadcasting that value. If the processor receives a message containing the value not too long before its clock reaches the value, it updates its clock to the value and relays the message. The closeness of synchronization achievable is about 6 % e. By sending messages too early, the faulty processors can cause the nonfaulty ones to speed up their docks, and the slope of the synchronized clocks can exceed 1 by an amount that increases as f increases. The size of the adjustment is about (~ + 1)(6 + e), again depending on ]. An algorithm to reintegrate a repaired processor is mentioned; although it is complicated, it has the nice property of not forcing the processor to wait possibly many hours until the next resynchronization, but instead starting as soon as the processor requests it. No system initialization is discussed. (In the revised version of their paper [to appear], they present a simpler reintegration algorithm that joins processors at predetermined fixed times that occur with much greater frequency than the predetermined fixed standard synchronization times). The algorithm of Srikanth and Toueg [ST] is very similar to that of [HSSD], but only handles fewer than n/2 processor failures and does not handle link failures. However, they can relax the necessity of authentication (if n > 3f). Agreement, as in [HSSD] is about $ + e. Accuracy is optimal, i.e., is that provided by the underlying hardware clocks. The size of the adjustment is about 3(~+e). There are twice as many messages per round as in [HSSD] when digital signatures are not used. Reintegration is based on the method in [WL]. A simple modification to the algorithm gives an elegant algorithm for initially synchronizing the clocks.

7

Related Problems

Several interesting problems are related to that of synchronizing docks in unreliable models. In the approzirnate agreement problem [DLPSW, MS, F] each processor begins with a real number. The goal is for each nonfaulty processor to decide on a real number that is close to the final real number of every other nonfaulty processor and within the range of the initial real numbers. Solutions to this problem are used

93

in the clock synchronization algorithms of [WL] and [MS]. In Sections 3 and 5 we discussed the problem of achieving synchronized remote actions in reliable models. If instead one considers unreliable models, the problem, dubbed the Byzantine firing squad problem, becomes more difficult. Burns and Lynch [BL] consider the situation in which the message delay is known, every processor's clock runs at the same rate but the clocks are skewed arbitrarily, and Byzantine processor faults are allowed. The algorithm they obtain can be thought of as simulating multiple parallel instances of an agreement algorithm, one per real time unit, until one succeeds. Since most of the time nothing happens, most messages sent are the null message, similarly to Lamport's '~time vs. timeout" ideas [L2]. Coan, Dolev, Dwork and Stockmeyer [CDDS] obtain upper and lower bounds for versions of this problem that have other fault and timing assumptions. Fischer, Lynch and Merritt [FLM] consider a class of problems including clock synchronization, firing squad, and agreement in the synchronous unreliable model with Byzantine processor faults and without authentication. They observe that all feasible solutions to these problems have similar constraints. In particular, they demonstrate why 3] + 1 processors and 2 f + 1 connectivity is necessary and sufficient to solve these problems in the presence of up to [ Byzantine faults. (See the article by the same authors in this book).

8

Future Research

We define the precision of a system as being the maximum difference in real time between when any two clocks read the same clock time T. Clearly we want the precision to be as small as ppssible and bounded above by a constant. One interesting open question is to determine what the trade-off is between precision and accuracy (see Section 6.2). Is it possible to achieve optimal precision and optimal accuracy simultaneously? What is the trade-off between precision and accuracy in terms of messages exchanged? Another open question is whether one can achieve an unbounded join (see Section 6.2) if at least half the processors are faulty. (Dolev has conjectured that this is possible [D]). No lower bounds on closeness of synchronization have yet been determined for the case when clocks can drift and processors can fail. How does this situation compare to a totally asynchronous system? What are minimal conditions that would allow some sort of clock simulation in an asynchronous system? What would it mean to be fault tolerant in such a model? Finally, much work remains to be done to quantify the relationships between different time, fault, and system models.

94

References

[A]

B. Awerbuch, "Complexity of Network Synchronization," J. A CM vol. 32, no. 4, pp. 804-823, 1985.

[AFL]

E. Arjomandi, M. Fischer, and N. Lynch, "Efficiency of Synchronous vs. Asynchronous Distributed Systems," J. ACM, vol. 30, no. 3, pp. 449-456, July 1983.

[Be]

M. Ben-Or, "Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols," Proe. 2"a Ann. A CM Syrup. on Principles of Distributed Computing, pp. 27-30, 1983.

[BL]

J. Burns and N. Lynch, "The Byzantine Firing Squad Problem," Advances in Computing Research: Parallel and Distributed Computing, vol. 4, JAI Press, 1987.

[Br]

G. Bracha, "An O(log n) Expected Rounds Randomized Byzantine Generals Algorithm," Proc. 17~n Ann. A'CM Syrup. on Theory of Computing, pp. 316-326, 1985.

[CDDS]

B. Coan, D. Dolev, C. Dwork, and L. Stockmeyer, "The Distributed Firing Squad Problem," Proceedings of the 17~h Ann. A CM Symp. on Theory of Computing, pp. 335-345, May 1985.

[D]

D. Dolev, private communication.

[DDS]

D. Dolev, C. Dwork, and L. Stockmeyer, "On the Minimal Synchronism Needed for Distributed Consensus," Y. ACM, vol. 34, no. 1, pp.77-97, Jan. 1987.

[DHS]

D. Dotev, J. Halpern, and H. R. Strong, "On the Possibility and Impossibility of Achieving Clock Synchronization," Journal of Computer and System Sciences, vol. 32, no. 2, pp. 230-250, 1986.

[DHSS]

D. Dolev, J. Halpern, B. Simons, and H. R. Strong, "A New Look at FaultTolerant Network Routing," Information and Computation, vol. 72, no. 3, pp. 180-198, March 1987.

[DLPSW]

D. Dolev, N. Lynch, S. Pinter, E. Stark, and W. Weihl, "Reoxhing Approximate Agreement in the Presence of Faults," J. ACM, vol. 33, no. 3, pp. 499-516, July 1986.

[DLS]

C. Dwork, N. Lynch, and L. Stockmeyer, "Consensus in the Presence of PartiM Synchrony," J. ACM, vol. 35, no. 2, pp. 288-323, 1988.

[F]

A. Fekete, "Asynchronous Approximate Agreement," Proc. 6~h Ann. A CM Syrup. on Principles of Distributed Computing, pp. 64-76, Aug. 1987.

95 [FLM]

M. Fischer, N. Lynch, and M. Merritt, "Easy Impossibility Proofs for Distributed Consensus Problems," Distributed Computing, vol.1, no.l, pp. 26-39, 1986.

[FLP]

M. Fischer, N. Lynch, and M. Paterson, "Impossibility of Distributed Consensus with One Faulty Process," J. ACM, vol. 32, no. 2, pp. 374-382, 1985.

[HMM]

J. Halpern, N. Megiddo and A. Munshi, "Optimal Precision in the Presence of Uncertainty," Journal of CompIezity, vol. 1, pp. t70-196, 1985.

[ttSSD]

J. Halpern, B. Simons, R. Strong, and D. Dolev, "Fault-Tolerant Clock Synchronization," Proe. 3"a Ann. A CM Syrup. on Principles of Distributed Computing, pp. 89-102, Aug. 1984:

[L1]

L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," C. ACM, vol. 27, no. 7, pp. 558-565, July, 1978.

[L2]

L. Lamport, "Using Time Instead of Timeout for Fault-Tolerant Distributed Systems," Computer Networks, vol. 2, pp. 95-114, 1978.

ILL]

J. Lundelius and N. Lynch, "An Upper and Lower Bound for Clock Synchronization," Information and Control, vol. 62, nos. 2/3, pp. 190-204, Aug./Sept. 1984.

[LM]

L. Lamport and P. Melliar-Smith, "Synchronizing Clocks in the Presence of Faults," J. ACM, vol. 32, no. 1, pp. 52-78, Jan. 1985.

[M]

K. Marzullo, Loosely-Coupled Distributed Services: A Distributed Time Service, Ph.D. Thesis, Stanford Univ., 1983.

[MS]

S. Mahaney and F. Schneider, "Inexact Agreement: Accuracy, Precision and Graceful Degradation," Proc. 4th Ann. ACM Syrup. on Principles of Distributed Computing, pp. 237-249, Aug. 1985.

[NT]

G. Neiger and S. Toueg, "Substituting for Real Time and Common Knowledge in Asynchronous Distributed Systems," Proc. 6fh Ann. A CM Syrup. on Principles of Distributed Computing, pp. 281-293, 1987.

Is]

F. Schneider, "A Paradigm for Reliable Clock Synchronization," Proe. Advanced Seminar on Real-Time Local Area Networks, Bandol, France, April 1986.

[ST]

T.K. Srikanth and S. Toueg, "Optimal Clock Synchronization," J. ACM, vol. 34, no. 3, pp. 626-645, July 1987.

[w]

J. Lundelius Welch, "Simulating Synchronous Processors," Information and Computation, vol. 74, no. 2, pp. 159-171, Aug. 1987.

96

[WL]

J. Lundelius Welch and N. Lynch, ~A New Fault-Tolerant Algorithm for Clock Synchronization," In]ormation and Computation, vol. 77, no. 1, pp. 1-36~ April 1988.

Implementation Issues In Clock Synchronization Micah Becki" T,K. Srikanth Sam Toueg$ Computer Science Department, Cornetl University, Ithaca, NY 14853 ABSTRACT We present some results from an experimental implementation of a recent clock synchronization algorithm. This algorithm was designed to overcome arbitrary processor failures, and to achieve optimal accuracy, i.e., the accuracy of synchronized clocks (with respect to real time) is as good as that specified for the underlying hardware clocks. Our system was implemented on a set of workstations on a local area broadcast network. Initial results indicate that this algorithm can form the basis of an accurate, reliable, and practical distributed time service. 1. I n t r o d u c t i o n

An important problem in distributed computing is that of synchronizing clocks i n spite of faults [Dole84, Drum86, Guse84, Halp84, Lamp85, Lund84, Marz84]. G i v e n "hardware" clocks whose rate of drift from real time is within known bounds, synchronization consists of maintaining logical clocks that are never too far apart. Processors maintain these logical clocks by computing periodic adjustments to their hardware clocks. Although the underlying hardware clocks have a bounded rate of drift from real time, the drift of logical clocks can exceed this bound. In other words, while synchronization ensures that logical clocks are close together, the accuracy of these logical clocks (with respect to real time) can be lower than that specified for hardware clocks. This reduction in accuracy might appear to be an inherent consequence of synchronization in the presence of arbitrary processor failures and variation in message delivery times. All previous synchronization algorithms exhibit this reduction in clock accuracy. In a recent paper [Srik87], we showed that accuracy need not be sacrificed in order to achieve synchronization. W e presented a synchronization algorithm where logical clocks have the same accuracy as the underlying physical clocks. W e showed that no synchronization algorithm can achieve a better accuracy, and therefore our algorithm is optimal in this respect.

~This author was supported in part by a Hewlett Packard Faculty Development Fellowship. ~This author was supported in part by a National Science Foundation Grant DCR-8601864.

98

In previous results, a different algorithm has been derived for each model of failure. In contrast, the algorithm that we describe in [Srik87] is a unified solution for several models of failure: crash, omission, or arbitrary (i.e., "Byzanting') failures with and without message authentication. [See article by Dolev & Strong in this volume]. With simple modifications, our solution also provides for initial clock synchronization and for the integration of new clocks. In this paper, we describe some initial experimental results from an implementation of this algorithm on a collection of workstations connected by a local area broadcast network. We implemented a version of the algorithm that overcomes arbitrary processor and clock failures, for an Ethernet network. Our experience shows that this algorithm is simple, efficient, and easy to implement. Furthermore, the results indicate that this algorithm can form the basis of an accurate, reliable, and practical distributed time service. A simpler version of the algorithm, one that overcomes only processor omission faults, is described in [Srik87]. The paper is organized as follows. We describe the system model in Section 2. In Section 3, we describe the synchronization algorithm. Section 4 presents our implementation. Other implementation issues are discussed in Section 5. 2. The model We consider a system of distributed processors that communicate through a reliable, error-free and fully connected message system. Each processor has a physical "hardware" clock and computes its logical time by adding a locally determined adjustment to this physical clock. The notation used here closely follows that in [Halp84]. Variables and constants associated with real time are in lower case, and those corresponding to the logical time of a processor are in upper case. The following assumptions are made about the system: A1, The rate of drift of physical clocks from real time is bounded by a known constant p>0. That is, if Ri(t) is the reading of the physical clock of processor i at time t, then for all t2 > tl, (1 + p)-l(t2 - t ~ ) <- R;(t2)-R~(t l) <-(1 + p)(t2 - t l ) Thus, correct physical clocks are within a linear envelope of real time. [See article by Lundelius, Lynch, & Simons in this volume]. A2. There is a known upper bound tale i o n the time required for a message to be prepared by a processor, sent to all processors and processed by the recipients of the message. A processor is faulty if it deviates from its algorithm or if its physical clock violates assumption A1; otherwise it is said to be correct. Faulty processors may also collude to" prevent correct processors from achieving synchronization (i.e., processors and clocks are subject to "Byzantine" failures). We use the term ~'correcz clock" to refer to the logical

99

clock of a correct processor. Resynchronization proceeds in rounds, a period of time in which processors exchange messages and reset their clocks. To simplify the presentation and analysis, we adopt the standard convention that a processor i starts a new logical clock, denoted C k, after the k th resynchronization. In practice, the discontinuity between successive logical clocks may cause some ambiguity as to which clock a processor should use when an external application requests the time (in particular, when the application uses the returned times to accurately measure elapsed times between events and to synchronize actions of different processors to within D max of each other). In Section 5, we solve this problem by showing how each processor can maintain a single continuous logical clock. Given the above assumptions, the algorithm presented in [Srik87], satisfies the following synchronization conditions. Define end k to be the real time at which the last correct processor starts its k ih clock. For all correct clocks i and j, all k --- 1, and rE[end k, end k +t]:

1.

Agreement: There exists a constant D max such that

1 2.

-

Optimal accuracy: For any execution of the algorithm, for all correct clocks i, all k >- 1, and t~[end k, end k+l] (1 + p ) - l t + a <- C~(t) --- (1 + p ) t + b for some constants a and b which depend on the initial conditions of this execution.

The agreement condition asserts that the maximum deviation between correct logical clocks is bounded. The optimal accuracy condition states that correct logical clocks are within a linear envelope of real time, and that their rate of drift from real time is no worse than that specified for the physical clocks. In [Srik87] we show that better accuracy cannot be achieved.

3. The authenticated version of the algorithm The following is an informal description of our authenticated synchronization algorithm. We assume a system with n processors of which at most f can exhibit arbitrary faults. The algorithm requires that n - 2 f + l and that messages are authenticated. Informally, authentication prevents a faulty processor from changing a message it relays or from introducing a new message into the system and claiming to have received it from some other processor. An initial synchronization (described below) ensures that all correct processors i start their 0-th logical clock, C °, with the same value and at about the same time (within one message delivery time of each other). The algorithm for the k-th resynchronization round, for k -> I, is as follows. Let P be the logical time between resynchronizations. A processor

100

i expects the k ~h resynchronization at time kP on its logical clock C~ -1. When C~-l(t)=kP, it broadcasts a signed message of the form (round k), indicating that it is ready to resynchronize. When a processor receives such a message from f + 1 distinct processors, it knows that at least one correct processor is ready to resynchronize. It is then said to accept the message, and decides to resynchronize, even if its logical clock has not yet reached kP. A processor i resynchronizes by starting its k ~h clock C~ and setting it to kP + a, where a, like P, is a global constant known by all processors. The constant a can be easily computed from the system parameters to ensure that clocks are never set back [see Srik87]. For all processors i, a is greater than the increase in C~ -~ from the time i sent a (round k) message (when C~-1 = kP) to the time it starts C~. After resynchroniz. ing, a processor also relays the f + 1 signed (round k) messages to all other processors to ensure that they also resynchronize. The algorithm is described in Figure 1. [II [Srik87], we show that it achieves agreement and, with simple modifications, optimal accuracy. There are several ways to trigger the 0-th synchronization. One way, is to use a onetime initialization broadcast that causes each processor i to start its 0-th logical clock C O by setting it to a. Another approach is described in [Srik87].

4. Our I m p l e m e n t a t i o n W e implemented our clock synchronization algorithm as a set of processes running at the user level under 4.2 B S D U N I X ~. O n each processor, one process implemented the algorithm as outlined above. W e ran experiments on a group of ten Sun workstations (3 Sun-l, 7 Sun-2), using a D E C V A X 11/750 to record messages passing over the network. The V A X uses a quartz crystal clock to generate ticks, and seems to miss very few clock interrupts. By comparison, the Suns' clocks run uniformly slow, possibly indicating lost interrupts. Because the clock on the V A X is more accurate than the Suns' (differingfrom an external time service by approximately 3 seconds in a day), we used it as "real time,"

cobegin

if C~-l(t)=kP

/* ready to start C~ */

--~ sign and broadcast (round k) fi //

if accepted the message (round k)

~* received f + 1 signed (round k) messages */

--* Ck(t):= kP+a;

/* Start C,~ */ relay all f + 1 signed messages to all fi

coend Figure 1. An authenticated clock synchronization algorithm for processor i and round k >- 1,

~UNIXis a trademark of AT&TBell Laboratories.

I01

and plotted network events against it. Our system was implemented on an Ethernet broadcast network, providing us with full interconnection between processors. We assumed the broadcast network to be reliable, and hence that every message is guaranteed to be received by all correct processors. This obviated the need for message relaying, and thus eliminated the possibility of faulty processors tampering with, and then forwarding, relayed messages. Furthermore, the Ethernet network driver marks each outgoing message with the unique internet address of the sending processor. Given our assumption that the network is reliable, the receiving processor can use these addresses to safely identify the sender of each message. In other words, the Ethernet broadcast hardware provides all the properties of authentication needed to run the authenticated version of our algorithm. 4.1. Our E x p e r i m e n t Figure 2 is a plot of the aggregate behavior of the processors' clocks, both synchronized and unsynchronized. After some preliminary experiments, we decided to assume a maximum message delivery time t4el=O.l seconds, and a maximum clock drift rate of p = 2 × 10 -4. We wanted to achieve a maximum difference between synchronized clocks D max =0.4 seconds. This choice of parameters led us to use a period between synchronizations P = 30 minutes, and a = 0.5 sec. W i t h a total number of synchronizing processors N = 10, we configured the protocol to withstand f = 3 processors with arbitrary faults. The graph represents the range and mean of the difference between the processors' local clocks and the VAX's "real time", C ( t ) - t , at the moment that each processor resynchronizes. Note that the horizontal axis of Figure 1 is the round number, as opposed to real time. The time between successive synchronization rounds in our experiment is measured in hours, while synchronization typically lasts for less than a second. If plotted on the same time scale, the times of events within a synchronization round would be indiscernible. From Figure 1, we can see that during the 10 hour experiment the slowest unsynchronized clocks drifted from real time at a rate close to 17 sec/day, so the rate of drift was bounded by p = 2 × 1 0 -4. During this experiment, unsynchronized clocks drifted increasingly apart from each other. The synchronized clocks that satisfied the assumed specifications on p and tdeL deviated by less than 0.4 sec. from each other, as desired. However, two processors experienced message delivery times as high as tdei = 0.5 sec. The maximum difference between all clocks in the experiment, including these faulty processorsj was less than 0.8 sec. Furthermore, all synchronized clocks were considerably closer to real time than the slowest unsynchronized clocks. As a reference point, the Berkeley Time Synchronization Protocol [Guse84], which synchronizes every 4 minutes, reportedly achieves a maximum difference between clocks of less than 0.04 sec. This shows that the performance of our experimental system,

102

C(t)-t in sec. I

IS

I0 0

l

I,ilI I

20

!

k

T

l I

"2

£I

"5 .t

4

"5

p,

o2-):'...

~

mox overagesynchronizedctock rain

mOX

T?overageunsynchronizedclock !

"~ rain

'ExperimentalParameter",

i

nf =tO " 3

~1 :0.1SeC X !,~ 4

1 P • 50mm a

On~x-O.4sec

• O, S . c

F i g u r e 2: D r i f t F r o m R e a l Time synchronized vs. unsynchronized clocks

running at a low user-level priority, is within an order of magnitude of a highly tuned production system. Our experiment demonstrated an important aspect of the robustness of our algorithm: It periodically recovered the unsynchronized clocks of some faulty processors, and gracefully reintegrated these clocks into the system of synchronized clocks. Specifically, "lazy" or temporarily faulty processors that occasionaly "missed" a resynchronization round (e.g., by losing or reacting slowly to a resynchronization message) were often reintegrated and correctly resynchronized at the next resynchronization. In fact, our algorithm ensures that all faulty processors are asked to start a new logical clock at the correct time (by the f + i correct processors) at each resynchronization round, independent of their past faulty behavior. In other words, at each resynchronization round, the correct processors give the opportunity to all the faulty processors to correct their clocks by resynchronizing on time. Thus, the clocks of processors that experience transient failures are never "lost" forever: instead, they usually correct themselves and fully recover at the next resynchronization round. For example, we observed a consistently slow clock that was forced to ""pull itself

103

up" and synchronize with the correct clocks at every resynchronization. The same type of periodic corrections would also happen to a consistently fast clock. Several previous synchronization algorithms exhibit the following undesirable behavior. Once a processor becomes unsynchronized because of a transient failure, it may forever refuse to accept synchronizing messages from correct processors: its own clock has already strayed "too far" from the correct clocks. With such algorithms, unsynchronized clocks have to be detected and then explicitlyreintegrated into the system.

5. Other Implementation Issues Our experience in building an experimental system was that the algorithm itselfwas straightforward to implement. The whole program consisted of under 250 lines of C code, and was written and debugged in under 2 weeks. While our experiments were not a complete implementation of a fault tolerant, synchronized time utility,our experience sheds some light on the task of building one. W e have considered, but not yet solved, the problems of overcoming high variance in the message delivery time (Section 5.2), and of changing c o m m o n algorithm parameters dynamically to achieve greater fault tolerance (Section 5.3).

5.1. Maintaining continuous clocks To simplify the presentation and analysis, we adopted the standard convention that a processor starts a new logical clock after each resynchronization [Halp84]. When a new clock is started, it is set to a value greater than that shown by the previous clock, thus ensuring that clocks are never set back. If clocks are set backwards, the intuitive rela. tionship between time and causality is lost. Since logical time is often used to order events in real systems (e.g., file system timestamps), monotonicity of clocks should be preserved. For some applications, this scheme of starting a new logical clock at every resynchronization creates some problems. For example, an application may repeatedly ask for the current time, and then use the returned clock values to accurately measure elapsed time. If tile system always returns the time according to the current (latest) logical clock, and several logical clocks are started during the execution of that application, the elapsed times measured will not be accurate: every time the system switches to a new logical clock, the logical time jumps forward. On the other hand, if we force the application to stick to the same logical clock across several resynchronizations, then this clock will deviate from the clocks of other processors beyond the allowed limit Dmax, preventing close synchronization of actions between different processors. A simple and elegant solution to the above problem is to maintain a single continuous logical clock at each processor, thus removing the ambiguity about which clocks to use. As Lamport and Melliar-Smith noted, [n [Lamp85], an algorithm for discontinuously resynchronizing clocks can be transformed into one where logical clocks are continuous. This can be achieved by" spreading out each

104

resynchronization adjustment over the next resynchronization period. We now show how to use our algorithm to implement a single continuous logical clock for each processor. Each processor i runs the algorithm described in Section 3 to start its logical clocks C~ (the initialization process starts CO). For all k >0, let t~ be the real time of the k th resynchronization of processor i, i.e., the time at which processor i starts the new clock C~. Let h~ be the forward adjustment that processor i makes to its logical clock at the h th resynchronization, namely ~ = Ci(t~) k k - C~-l(t~), for k - - l , and h0 = a. Let R~(t) be the value of the physical clock of processor i at time t. Using the forward adjustments A~, and the underlying physical clock Ri(t), we can easily implement a single continuous clock Ci for processor i as follows: c : t °) = o

el(t)

= Ci(tk) + x~(t)&~ + R~(t) - R~(t~) [

Ri(t)-R~(t')]

wherexk(t) = rain 1, P - a - D m a x

''

for tk <--t <<-t~ + t a n d a l l

k>-0

,

For all k >- 1, at the k th resynchronization, the continuous clock C, matches the logical clock C~-~. That is, C,(t~) = C~-t(t~). In [Srik87], it is shown that the continuous c l o c k Ci(t) satisfies both the agreement and optimal accuracy properties. 5.2.

Impi'ecise

Synchronization

Some processors occasionally experienced a message delivery time ~del greater than 0.1 seconds, the bound we assumed in our implementation. As a result of this, the clocks of these processors sometimes differed from the clocks of other processors by more than the permitted D max. This imprecise synchronization can be accounted for by classifying those processors that experience larger than allowed delays as faulty. We now examine some causes of the observed large variance in message delivery time. We also consider possible solutions to this problem. rdet is a random variable with unknown distribution over the interval [0, tdet]. The variance of ~'d~z is a function of the system's hardware and software architecture and of the levet at which clock synchronization is implemented. Our experiments put the clock synchronization protocol in a user level Unix process. On receipt of a message, the n e t w o r k driver has to move the message into memory and then wake up the process, which will usually be paged out. Any processes with higher scheduling priority can preempt clock synchronization, and kernel level activities are performed before running user processes. Accordingly, the most erratic processor in our experimental system was the file ÷ We can show that, for k > 1, Ci(t~) = k P + a - ~ , and h} = P - [R~(t~) - Ri(t~-t)]. Thus. processor L can easily compute its single continuous clock C, from the values of its physical clock R~ at the two most recent resynchronization times.

105

server for a cluster of 6 Suns. Its unusually high load of kernel level work intermittently caused large delays in the delivery of messages to the user level. This made precise synchronization of the server impossible. One obvious solution to this problem (maybe the most logical one) is to implement the clock synchronization protocol at a lower architectural level, such as in the kernel. It could even be offloaded to an intelligent network handler. This would considerably reduce the m a x i m u m

message delivery time tdel, and especially the variance of r~eZ, although

network contention could still cause unpredictable delays. The standard release of 4.3 B S D Unix includes TSP, the Berkeley Time Synchroniza. tion Protocol, which is a distributed clock synchronization scheme with limited fault tolerance [Guse84]. T S P uses a statistical technique to compensate for the large variance in rdel. A n estimate of the difference between two processors' clocks is obtained by timestamping messages which are passed between them. From these timestamps, an estimate of the difference is obtained, which contains an error term depending on Tdel, the message delivery time. T S P reduces this error term by sampling the estimated difference in clocks m a n y times and deriving a statistical estimate which greatly reduces the expected error. Since the longest possible message delay can be more than a second (several order of magnitudes larger than the average message delay), it is unacceptably expensive to a s s u m e such a large conservative value for tde l. However, using a s m a l l e r tde l is a n incorrect characterization of the network t h a t may result in a communication fault: ~'del ~

ts!el" Since our system model assumes t h a t no communication errors occur, the

fault is a t t r i b u t e d to the processor t h a t experiences the late a r r i v a l of a message n a m e l y , the recipient.

A more complete model, which explicitly took communication

errors into account, would m a k e it possible to attribute late messages to a fault in the n e t w o r k and to avoid labeling correct processors as faulty.

A large message delay, and the resulting imprecise synchronization, is just one of a class of faults which a processor can sometimes detect. If a processor can detect that it is "faulty" according to the specification of the clock synchronization algorithm, then it can attempt to take corrective action. For example, a processor that must consistently set its clock forward by an excessive amount at each resynchronization can surmise that its hardware clock is slow, and apply a speed-up factor. Such self-correction can enhance the long-term stability of the system. While our experimental system has no such facility for fault detection, we can see its usefulness in a production system. 5.3. Initialization and J o i n i n g The correctness of fault tolerant clock synchronization algorithms is usually proved in the context of a static specification of robustness: "given a network of n processors, at most; f of them faulty." While some condition on n and f (eg. n >- 2f + 1 in our case) must hold for any algorithm to operate correctly, the parameter n is not referenced by our algorithm.

106

This is helpful, in that it eliminates the need for agreement on the value of n. However, f is a common parameter of our synchronization protocol, and so the situation becomes more complex if we wish to allow its value to change dynamically. Our algorithm allows a processor to join the network without any adjustment to common parameters, using a technique similar to the one in [Lund84].

However, if n

increases, greater fault tolerance can be achieved by increasing f to l - ~ - ~ 1. If n decreases below 2f + 1, the parameter f must be decreased in order to preserve the condition n - 2f + 1 necessary for the correctness of our algorithm. The question of how to dynamically adjust the common value of f as n changes (ie. as processors join or leave the network) is a complex one which we have not addressed in our implementation. In comparison, the algorithm presented in [Halp84] has no explicit reference to a common value for either n or f. It includes a Byzantine Agreement based protocol for processors to join an existing cluster of synchronized processors. The joining protocol is also used during initialization. While the synchronization algorithm itself requires only that n > f, the joining protocol requires that n - 2f + 1. 6. Conclusions We described some initial experimental results from an implementation of a faulttolerant synchronization algorithm on a collection of workstations connected by a local area broadcast network. The algorithm overcomes arbitrary (i.e., "Byzantine") processor and clock faults, without using expensive Byzantine Agreement protocols. Our experience shows that Byzantine faults are not necessarily expensive to overcome: the algorithm is simple, efficient, and easy to implement. The results indicate that it can form the basis of an accurate, reliable, and practical distributed time service.

107

References

Dole84

D. Dolev, J.Y. Halpern, and R. Strong, On the possibility and impossibility of achieving clock synchronization, Proceedings Sixteenth Annual ACM STOC, Washington D.C., pp. 504-511, April 1984. Also to appear in JCSS.

Drum86 R. Drummond, Impact of communication networks on fault-tolerant distributed computing, Ph. D. thesis, Cornell University, 1986. Guse84

R. Gusella and S. Zatti, TEMPO - A network time controller for a distributed Berkeley UNIX system, Distributed Processing Tech. Comm. Newsletter, vol. 6 No. SI-2, pp. 7-15, IEEE, June 1984.

Halp84

J.Y. Halpern, B. Simons, R. Strong, and D. Dolev, Fault-tolerant clock synchronization, Proceedings Third Annual ACM Symposium on Principles of Distributed Computing, Vancouver, Canada, pp. 89-102, Aug. 1984.

Lamp85 L. Lamport and P.M. Melliar-Smith, Synchronizing clocks in the presence of faults, Journal of the ACM, vol. 32, No. 1, pp. 52-78, Jan. 1985. Lund84 J. Lundelius and N. Lynch, A new fault-tolerant algorithm for clock synchronization, Proceedings Third Annual ACM Symposium on Principles of Distributed Computing, Vancouver, Canada, pp. 75-88, Aug. 1984. Marz84

K. Marzullo, Maintaining the time in a distributed system. An example of a loosely-coupled distributed service. Ph.D. Thesis, Department of Electrical Engineering, Stanford University, 1984.

Srik87

T.K. Srikanth and S. Toueg, Optimal Clock Synchronization, Proc. 4th Symposium on the Principles of Distributed Computing, Minaki, Canada, pp. 71-86, Aug. 1985. Also, to appear in the Journal of the ACM. vol. 34, No. 3, July 1987.

Argus Barbara Liskov Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139

Moderator Lindsay: I would like to present Professor Barbara Liskov of MIT, who will give you an overview of her group's ambitious work in support of distributed recoverable abstract data types. It's a very ambitious project and now we are beginning to see the results. I think this will be an exciting and interesting presentation. Liskov: Argus is a programming language and system that supports the development and execution of distributed programs. It is intended to run on a collection of processors joined together by a network, rather than shared memory. We purposely set out to avoid any assumptions about the topology of the network or the nodes themselves. To translate our assumptions about this hardware into the terminology that has been used at this conference, we assume that there can be omission faults and timing faults in our equipment, but no byzantine faults. For example, if you put a message on the network, it might not get there, it might get there very late, it might get there out of order, it might get there corrupted, but it would be detectabty corrupted. However, we assume a message can't arrive with an undetectable error. Similarly, writes of data to disks may fail, but we can tell if they fail, perhaps by reading the data to check its validity. With respect to processors, either they are running or they are not, but they may be running very slowly due to a high load. In Argus, we wanted to provide a language that would allow people to write applications that could make use of a distributed collection of hardware. We were particularly interested in applications that require high availability and that would make use. of redundancy to provide protection against faults of individual components. While Argus itself does not provide for automatic availability of the various components in the system, it does provide two basic mechanisms that should make availability easier to achieve. The first mechanism is called a guardian, which is a program module. The second mechanism, transactions, will be discussed later. When you build a program in Argus, it is composed of many guardians that are placed at various locations in the network. Each guardian looks like an abstract data object and provides external operations, called handlers. If a client wants to use the guardian, it calls one of the handlers; it uses a remote procedure call for this.

109

Outside a guardian, no module can see anything about the inside of the guardian; in particular, there is no way to obtain an address of data that is inside of the guardian. Therefore, there is a completely encapsulated local address space for each guardian. Inside a guardian are data and processes that are manipulating the data. Potentially, there is a lot of concurrency inside a guardian. For example, every time a call on a handler comes in, a new process is started to execute it. It is also possible for the process running a handler ca~ to fork sub-processes, which can all run in parallel. Finally, a guardian can have some additional processes that execute background code. This code does work that is not in response to any particular handler call. The data that are in a guardian are divided into two parts: stable data and volatile data. Stable data are written periodically to stable storage devices. When a guardian's node crashes and recovers, its volatile data and any processes running at the time of a crash are lost, but the stable data survives. Argus brings the guardian back to life with the stable data in the same state as when it was last written to stable storage. Then, a special recovery process is begun that runs user-defined code to bring the guardian's volatile state back up to a state that is consistent with the stable state. After this job is completed, the guardian is back in business and ready to respond to new handler calls. App]~ication programmers can replicate information explicitly if desired. This is achieved by providing several copies of a guardian, each running in a different place and having its own copy of the data. Argus doesn't do this automatically. Our approach is for application programmers to use our tools to build replicated objects on top of Argus. Our decision that we were not going to solve the replicated object problem within the Argus language was made very early in the project. We didn't feel we sufficiently understood which replication protocols to use to automatically preserve replicated data in a consistent state. This now seems like a smart decision, because the problems associated with replicated data remain di~cutt even today. In fact, the more I look at it, the more I see applicatlon-specific techniques that in many cases provide much greater efficiency than would the general purpose algorithms needed if Argus did replication automatically. Perhaps one day we really will know just the right replication algorithms that should be automatically provided by a system. Guardians alone are not sufficient to support an application with state spread over many nodes. This leads us to the second mechanism that's in Argus, namely, atomic actions or transactions. A program (running in some guardian) can begin an atomic transaction and provide the code that is to be run as part of that transaction. That code may contain within it many remote calls to handlers of other guardians. At its end, the program that began the transaction can authorize it to commit or abort, and the effect will be that either the transaction happens everywhere, making the results available to other transactions, or the transaction is completely undone everywhere. Thus, the transaction exhibits the properties of serializability and recoverability. These two properties are supported by special data objects called atomic objects.

110

Atomic objects are like ordinary objects in that they provide collections of operations by which computations can manipulate them. However, these operations support the synchronization and recovery of transactions. In particular, we use strict two-phase locking to support serializability and we use versions to support recoverability. In other words, we don't make modifications directly to the objects. We always make a copy (called a version) and do the modifications to the copy. Q: I'm still unsure about how data is replicated within guardians. A: The guardian's stable objects are written to stable storage devices and so they survive crashes of the node on which the guardian resides. The stable copy provides a kind of replication. However, the data of a guardian are not replicated at other nodes of the network. Instead, one can explicitly provide replicas of an entire guardian at other nodes. Each of these replicas is itself a guardian. A guardian has a state, and it has to represent that state in terms of a number of objects. The rule is that objects that represent the guardian's state are always atomic objects. They might be built-in atomic objects or they might be user-defined objects, but they must be atomic; that is, they must support recovery and serializability. Guardians themselves are user-defined atomic objects. Therefore, you can construct an object (within a guardian) that refers to all these replica guardians. So you can build distributed, replicated, atomic data in this way. Q: Does that mean that guardians can be nested? A: Guardians can certainly call handlers in other guardians. Guardians cannot be nested in the sense that one guardian has direct access to the state of another guardian; instead all inter-guardian communication must occur through handler calls. Q: Is there a log? A: While a transaction is running, we keep track of its changes to objects in primary memory. In other words, the versions created for it are in main memory. This means that if crashes occur while a transaction is still active, we won't be able to complete the transaction because we will have lost information. When a transaction commits, we use a standard two-phase commit protocol and as part of two-phase commit, we write to stable storage the new versions of all the objects that were modified by that transaction. We use a logging scheme for that. Each version is written to a log belonging to its guardian, that is, to the guardian that contains its object. So, there is a stable log per guardian. Q: Do you write whole objects to the log? A: Yes. However, if the object refers to other atomic objects, we do not write these other objects to the log. For example, if the modified object is an atomic array of integers, we will copy the entire array to the tog. However, if it is an atomic record one of whose components is an atomic array, we will just write the record; the array would be written only if it were also modified by the transaction. Thus we try to write as little as we can. Programmers can control the amount of writing by how they structure the objects. For example, a huge set can be represented by a tree structure, and only that part of the tree that actually changed would need to be written.

111

Writing to the log happens when a transaction commits. At this point, Argus carries out two-phase commit. The Argus runtime system visits all the guardians that took part in this transaction. They have all kept track of which of their local objects were used by that transaction. Further, they know which ones were read and which ones were modified. Objects that have been modified have their versions written to stable storage. Q: One question to make sure I understand: What does the application do and what does the system do? Are the guardians the system? A: Allow me to give you an example application: a mail system. This would be implemented as a collection of guardians. These guardians run code that is provided by the application programmer. A guardian implementation consists of code that defines the guardian state, the processing of handler calls, the background code, and SO o n .

To get the system started, a user creates guardians and installs them at various nodes. A new guardian contains a copy of the code of its guardian definition. The state of a newly created guardian in the mail system would contain no mail, users, etc. The guardian's code runs when users of the mail system call its handlers to send and receive mail. As part of executing a handler call, the code modifies the state of its guardian, for example, to contain the new mail. However, the code of the guardian need not write modifications to the log nor carry out two-phase commit. This is done by the system under the covers. The Argus system runs within each guardian. For example, the state of the mail system survives crashes without the programmer needing to be concerned with writing and reading data from stable storage. Q: Can you describe in more detail the versions of objects that you store? A: I used the word "version" as did DeVries. However, we don't have a whole history of versions. When an action commits, we throw away the previous version of a modified object and install the new one as the true state of the object. I think of logging as an implementation of this technique. Each object has a unique identifier and we use this when we write the version to the log so we know exactly how to put objects back together again after a crash. Another perfectly correct way of implementing objects is to write a new object version to stable storage as soon as the object is modified. Then, there's almost nothing to do at commit time. There is an implementation trade-offhere. We decided to keep data in volatile storage and write it to stable storage only during the prepare phase of the two-phase commit protocol. This means that if the action aborts, we do not write unnecessarily. Now Fd like to talk about a feature of Argus that has not yet been discussed: We allow actions to be nested; there is a notion of an action that contains a bunch of children actions. These children actions in turn may contain children of their own, and so forth. There are two reasons why nesting is a good idea. First, iUs good for parallelism; it gives us a nice way of taking a single action and breaking it up into

112

a bunch of subparts that can run simultaneously. Second, and more important, is it gives us a check-point mechanism so that you can do a piece of work and abort it, and get back to where you were, and then try something else. We use nested actions in performing handler calls. For example, you can do many handler calls in parallel, and even if some of them don't work, the enclosing unit of work can still commit. That, of course, assumes that it only needed some of the calls to complete successfully. An important performance point is that we only do two-phase commit when

topactions non-nested actions commit. Thus, it isn't expensive to support nesting. Let me now speak a little bit about some of the algorithms that we use inside the Argus implementation. I should say that Argus runs on six Vaxes connected by a local area net and we have started to experiment with applications. I am primarily interested right now in expressiveness rather than performance. The performance of Argus is O.K. considering that it's a prototype running on top of UNIX. What's more interesting to me, as a designer of the language, is whether we omitted things that are needed by users. We intend to find this out by building a lot of applications and seeing where we have problems. Concerning stable storage, we use very standard notions here. In this area, we haven't tried to add to the state of the art. Q: How tong does it take to write a log record.* A: At present we use a disk over the network since we do not have enough disk space at the machines where the guardians run. This means a write requires a roundtrip message plus the actual write to disk. We are planning to put more disks on the nodes where guardians run so that we can do the writes locally and reduce the cost. Getting back to algorithms, we use the standard two-phase commit. The coordinator is the guardian at which the topaction began. As the topaction wanders around the network by making handler calls, we accumulate information about where it went. At commit time, Argus knows all the guardians that are participants. The coordinator sends out ~prepare" messages to all participating guardians. They then write some information to stable storage and respond affirmatively to the coordinator. Then, phase two of the commit protocol begins. The delay that a user sees due to commit processing is for the prepare message and its acknowledgment, and the writing of the prepared record by the participant and the committed record by the coordinator. This is all the visible delay, because as soon as the coordinator has written the commit record, it can let the user program continue. Phase two is done in the background. Thus, usually the delay consists of 1 message roundtrip and two disk writes. There would be more delay if a transaction had modified a large object since then more than one disk write might be needed at the participant to record the new version. This delay can be reduced by doing "early prepare." This means the new version is written to stable storage early and the writing may be done before phase one of the commit happens. However, we have not implemented early prepare.

113

Argus doesn't try very hard to propagate information about aborts. We made a decision to work hard on commit, but for abort, only to send courtesy messages that may not arrive. Argus does support a query mechanism whereby participants can ask the coordinator what has happened to a transaction. This sort of ties the system together. Q: How is an abort performed? A: Well, it's easy. Nothing much happens at abort time because the old value of the object (its "base" version) is maintained in volatile memory until the action terminates (commits or aborts). Therefore, to abort an action, any new versions created for the aborting action are simply discarded, and any locks held by that action are released. Q: What about orphans? A: We have mechanisms for orphan detections. Orphans in Argus are any processes that are running that we know will ultimately have to be aborted. For example, if a call is made and then the caller loses interest, well, you may still be processing the call at the called guardian; that processing is an orphan. In Argus, orphans are primarily problems because they waste resources. They ultimately will abort so they won't cause any permanent damage to the state. We have an algorithmthat allows us to detect them in a timely fashion before they can ever observe that they are actually orphans. Q: What about two-phase locking? A: We use two-phase locking for built-in atomic objects, but two-phase locking can lead to concurrency problems. For example, if a transaction modifies just one element of an array, the entire array is locked until the transaction terminates. One solution is to use short transactions. The other thing you can do is use Argus' mechanism for building user-defined atomic types, which are types that provide much higher concurrency than what you can get from the built-in kind. With these types, the programmer can control both the amount of concurrency and also the amount of information that must be written to stable storage when a transaction commits. I'd like to make one final point about algorithms, and talk about the implementation of internal services used in Argus. For example, we plan to allow guardians to move from one place to another. For this you need a service than enables you to locate a guardian. The technique we are planning to use for this seems to be a good one. Such services are logically centralized, but highly-available because we replicate them at various places. But, we don't replicate them everywhere. We just replicate them at a few sites. Q: Do you have any highly-available applications or performance information? A: We haven't built highly-available applications yet, and we've not yet done detailed performance measurements. Q: Would an application use existing guardians or write new ones? A: A new application will probably require the writing of some new guardian

114

definitions. However, applications might well make use of existing guardians. One possibility is to reuse code that has been written already. One can instantiate different copies of a guardian definition on different nodes of a network, so the application can have its own private guardians using the already-written code. As another paradigm, a particular guardian or guardians may implement a service that is made available to many clients across the network, and this service might be useful in the application. In this case, there is sharing of pre-existing guardians. Q: If there were a lot of guardians, wouldn't there be a lot of overhead? A: Guardians are expensive. Each guardian is implemented as a UNIX process. This means, for example~ that they can't share code, and it's expensive for them to communicate with one another. If I had a machine I could start on from scratch, I would put them all in one address space and have all the guardians that exist on one node share code. There are a lot of expenses in our implementation that are due to the system that we're using. To begin to conclude, it should be clear by now that Argus provides a measure of reliability. Guardians are highly reliable. They don~t lose information. But Argus doesn't provide any availability as far as people writing applications are concerned. So, what we expect to see in the applications is that people will use replication to make applications highly available. Replication is, of course~ also good for lowering response time and reducing bottlenecks. I already mentioned providing increased concurrency by using user-defined atomic types. From the perspective of this workshop, Argus doesn't depend a lot on the fancy synchronization algorithms that are being discussed here. We do depend on the availability of loosely synchronized clocks. I suspect~ though, that when we start to look at some applications of Argus, we will see the need for more sophisticated algorithms. For example, if a user wants to build a highly-available system, where perhaps one guardian acts as the primary and the others are spares, then you have the problem of determining how to select a new primary from the set of backups. I think some of those more sophisticated algorithms will have to be used in those kinds of problems.

TABS Alfred Z. Spector Transarc Corporation The Gulf Tower 707 Grant St. Pittsburgh, PA 15219

Lindsay: We will now hear a talk about TABS by Alfred Spector. TABS is a distributed transaction management facility implemented at Carnegie-Mellon University under his guidance and direction. Though he didn't write any of the code, he knows where all the semicolons are and he counts all the lines. Spector: I must first correct the distinguished moderator. Prior to the (10th) Symposium on Operating Systems Principles, we were measuring the performance of TABS, and it was slower than we had anticipated. So, with Jeff Eppinger's help, I wrote about 40 lines of code that sped up the message passing system. Now, that is about 40/50,000 of the system... But, they were a very important 40 lines. But to the point, the goal of TABS was to provide building blocks for reliable, available systems. TABS is based on many of the same notions t h a t Barbara Liskov has already described in her discussion of Argus. That is, we thought we would provide tools for supporting reliability and synchronization, and thereby give other developers good tools whereby they could layer reliable and available applications on top of them. More specifically, TABS provides transactional underpinnings that provide failure atomicity, permanence of effect, and synchronization. This makes it possible for users to provide data integrity. Availability would be then achieved by application-level use of replication. To begin, let's first consider a TABS node, as implemented on Perq workstations. The lowest-level component is the Accent Kernel, developed by my colleague Rick Rashid and his team. Accent provides processes with disjoint virtual address spaces and message passing. It has some other functions related to managing recoverable virtual memory, but I'll get back to those. The communication manager relays messages from one node to another over the network. That is, it extends the local message passing mechanism to be global over a network of nodes. The name server is a name dissemination mechanism. It provides operations such as ~Register" which adds a new name and associates it with a communica-

116

tion channel, and "Lookup" which returns one or more communication channels associated with a given name. In essence, the name server is a blackboard for storing and retrieving naming information. The TABS name server does not guarantee consistency. Higher levels of the system will detect this fact, and attempt to get more up-to-date information. For example, there could be communication channels for objects that no longer exist. The name server is flexible, however, and it does permit there may be multiple por~ (or communication channels) for a given object. The recovery manager provides centralized recovery facilities for transaction aborts, that is, aborts initiated by a transaction. It aiso provides rather efficient facilities for crash recovery, due to its built-in write-aheozt logging techniques. The recovery manager does not support media recovery; the CMU Perqs only had a single disk so we never did the work necessary to duplex log writes or to recover data after a disk crash. Adding media recovery would not change the data logged during normal processing, however, or affect forward processing performance. Finally, there is a single transaction manager per node, which is responsible for committing and aborting transactions. It is particulgrly involved in distributed transactions because it implements the two-phase commit protocol. The transaction manager uses the f~cility of the recovery manager to write log records to stable storage. It also works with the recovery manager to provide information on the state of transactions during crash recovery. On top of these basic TABS components, programmers build abstract objects. These objects are within Data Servers; they are similar to the Guardians described by Liskov. There are slight differences, but both abstractions encapsulate objects and permit operations to be performed on them. To support recovery and failure atomicity, programmers constructing Data Servers use system libraries, many of which make calls on the TABS components I have just described. In order to build a Data Server, programmers may use the facilities of systemprovided library routines to do locking. Users are also permitted to write their own locking code if they want to do some form of type-specific locking, perhaps to get higher performance in a B-tree or perhaps to prevent deadlocks when updating records. Programmers may also use the facilities of the recovery manager to provide failure atomicity. To do so, object implementers need to understand our recovery algorithms. One of these is an old value/new value recovery aigorithm where the old and new value of objects are recorded in the log. The other is an operation recovery algorithm. The operation recovery algorithm stores the operations performed on objects and their inverses, thereby permitting you to go forward and backward. Both are compatible, in the sense that a node can be using both types of recovery at a time. Both recovery algorithms involve periodic checkpointing to reduce the amount of time to recover from node failure.

117

Some programmers just use the TABS/Accent remote procedure call facilities, either locally or across the network on perform operations on abstract objects. Here's an outline of an example of replicated abstraction that we built on top of these facilities. On each of three workstations, there is a data server called the directory representative server. Each is used to store a replica of information in a directory abstract object, which has operations insert, delete, lookup, and update. Assume we are using a weighted voting or quorum consensus algorithm for replication, as discussed by David Gifford or Maurice Herlihy, though Joshua Bloch in my group has invented a special variant which works well with record-oriented directories. Applications then can get high availability access to the directory by accessing only two out of three of the directory representative servers. So even if one is down, they can continue to run. If one crashes in the middle of the transaction, at worst the transaction aborts and the transaction can be reissued to another pair of directory representatives. [In the presentation, Spector showed a graphical, button-oriented user interface to an example application that permits a user to begin a transaction, commit it, abort it, and issue abstract operations on a replicated directory.] Q. Does the application lookup the servers in the TABS name server each time? A. No, there is caching in the application. That is, an application caches ports for partict~lar servers and continues to use them until it encounters a failure. Then, the application goes back to the name server and re-issues a lookup request to obtain additional ports. In this particular replication example (using the Bloch variant of the Gifford voting algorithm), there is a command called "Display Representatives" that causes all the representatives to display, their contents. The representatives show that they are storing version numbers, values, and keys for each record in the directory and also storing version numbers for gaps that represent the version numbers of information between records stored in the directory representative. The gap version numbers are there to make it possible to correctly answer a question as to whether or not a particular entry has been deleted or not. As a completely different kind of example use of TABS, we developed a transactional window manager server. With this server, we could do output to the screen and then if a transaction aborted, that server would be told about the abort and it would draw lines through the output indicating that that conversational transaction never had occurred. Transactions in progress are illustrated with dots through the text. Output from transactions that commit is displayed without alteration. So this is another example of the kind of thing we can easily do with TABS. Well, this has been by way of introduction, so I should now get to primary point of my taik for this workshop. As should be clear, the TABS model is of a network with multiple independent processing nodes - nodes which have non-volatile and volatile storage, and also access

118

to stable storage. (Stable storage could be on a node or across a network1). For many uses of TABS, there really need to be multiple networks (communication channels) to reduce the risk of partition. (It is possible to ensure that a partition will never cause inconsistencies, though the partition may cause lowered availability.) TABS is concerned w i t h these failure modes: hardware crashes of the fail-stop processing nodes, less frequent destruction of non-volatile storage, transaction aborts due to timeout or deadlock, lost or re-ordered network packets. TABS also could be used to provide some resiliency in the face of fail-stop, transient software problems that do not affect stable storage. The transaction model in general (and TABS in specific) does not provide complete protection against software problems, however. For example, the replication example I described previously has relatively immature code in it, and the code has caused many crashes. Yet, the underlying stored was not harmed. In fact, when crashes were due to timing dependencies, only one replica might crash, thereby only minimally reducing system availability. So what were the key ideas in building this system? First, we used the client-server model, so that all of our data is encapsulated within servers. Objects having user-defined types can be developed by applications programmers using TABS primitives. A type of RPC is used in order to call the servers. Resource location is by a distributed name service of the type that I mentioned in the introduction. In TABS we implemented a form of nested transactions. Nested transactions, as in the Argus and Moss work at MIT, have independent recovery contexts. But, TABS does not do a good job on intra-transaction synchronization. Now that we've seen how the Argus people have done it, we think we can do it in much the same way in our follow-on system, called Camelot. However, unlike Argus, we won't insist that every remote call create a nested transaction. TABS supports distributed replication as its availability strategy. Regarding failure detection, we detect crashes by timeouts, or when nodes are restarted. They are timeouts of many types. We have ~imeouts at the RPC level that are RPC based. We have timeouts at the session level. We have timeouts within the servers to prevent a transaction from running too long. This is the computation model for high availability: Many applications run parallel. Each application is a process in its own state. Processes execute transactions. Transactions are assumed to access replicated data, so server crashes have minimal impact on the system. If we had an application checkpoint mechanism and the ability to access a log from another node, applications could be restarted after node crashes using information stored on a log. Perhaps~ this model can be adapted for use in real-time systems. Here is how it might work. Multiple replicated sensors store their data on a queue, each datum being enqueued from within a transaction. Imagine that these queues are replicated. Then, there could be a collection of application processes competing to remove and process 1T_rue stable storage was never implemented in TABS, ed.

119

information from these replicated queues. Thus, there would be processes on multiple nodes that are competing to read and process the next sensor values. Some of these processes, in turn, place their information on other queues. Finally, some of these processes actually correspond to physical actuators, e.g., like the force-fight actuators that are used in the space shuttle. (These actuators use physical replication.) Certainly, this paradigm doesn't recover from certain types of failures, such as inconsistent sensor values that are sent to different queues. That would be a problem. Also, incorrect operation of processors would not be detected since there is no voting on computations. It would be possible to add this type of thing, and it might be worth exploring how to leverage off of the transaction model. Overall, TABS is more or less complete from an experimental systems perspective. We have measured it extensively, and we are building applications on it; e.g., we're exploring how to best use operation logging. Are there any questions? Q. I'm really struck by the similarities between Argus and TABS. Can you say what you think are the most important differences between the two? Spector: rll answer that, and perhaps Barbara Liskov and Bruce Lindsay will also answer with respect to Argus and R*. I think that one major difference is the recovery techniques that are used. TABS has a single log per node. This permits a single log force per transaction per node for local transactions; if implemented correctly, TABS would require only a single force per node for distributed transactions. Objects are made recoverable by explicitly logging information about operations. Argus uses a collection of built-in atomic types and some specialized objects as the basis for developing recoverable objects. Also, the recovery mechanism in TABS (which uses write-ahead, log-based recovery that is integrated with the virtual memory system) is potentially very efficient. Another difference is that, quite frankly, TABS has a greatly inferior programming interface. We didn't put time into that part of the system, though we believe there could be a clean library interface for Pascal of C. Also, TABS did not deal with the full generality of the Argus nested transaction model, nor did we insist on all RPCs being embedded in their own transaction. We're pleased with the latter decision, but believe we would implement a more complete nested transaction model if we had it to do over again. Liskov: I think the major difference, the most important difference is that Argus is a language as opposed to a system. Of course there's a system there that makes the language run. But it was designed as a language first with a system afterwards and it's just a different way of approaching a problem. Lindsay: There are many differences R* and these other systems. I guess the major one is that R '~ is dealing with a single abstraction where both TABS and Argus support multiple abstractions across a network. With R", the object was to provide a single abstraction with a distributed implementation. That is, really one abstraction but it's all over the place.

120

The R~' process model is also quite different in that R * creates a remote process to serve a given user and retains that process until the user gets tired and goes home. Whereas in Argus and TABS, a process is created per request. Spector: Bruce, how about deadlock detection7 Lindsay: Yes, we use distributed deadlock detection. But you guys would use that too if you got serious. There are no time-outs in R*. Liskov: In Argus, we cannot do deadlock detection. We can do deadlock detection for the built-in atomic types. But for the user defined types, we don't know how to do it because, first off, they typically can get into livelock rather than deadlocks. And secondly, there are often ways for future events to occur that could cause a deadlock, but we have now idea if they will actually happen. I think there's a fundamental problem here when you start to go to user-defined objects with unknown semantics. Q: Barbara, why can't type imptementers provide entry points that say in an application-specific way for whom they are waiting, and have the system call upon those in piecing together a wait-for graph? Liskov: Imagine a process waiting on queue: (Another process will place information on the queue). How do you know for what the process is waiting? We have datatypes like this, so we use timeout. Lindsay: It's worth mentioning the design of R* is such that we could support another abstraction running alongside the database, but none has been implemented and none is contemplated. Q: One of the questions I wanted to ask you (Spector) and also Liskov, concerns the underline operating system on which you built your system. What was good about Accent and UNIX, and what wasn't? A: On Accent, we like" the ports. They make it very easy to implement remote procedure calls and we think they can be implemented very efficiently, though the Perq is such a slow machine that it is sometimes hard to tell. We modified Accent to support our recoverable storage abstraction. What we did was define a new type of virtual memory in Accent called "Recoverable Storage". A data server process can then allocate an area of recoverable storage which is backed by a file, rather than the paging area on disk. Whenever you modify an object in recoverable storage the changes get written directly back to the disk. However, they only get written back to the disk after the log records pertaining to those pages have been forced to the log. This was only a slight modification to Accent. We think there are some efficiency advantages to this approach. One nice thing is that Data Server implementers handle all of the long-term data just as if it's in the virtual memory of the process. Liskov: We took it as a given that we weren't going to modify the UNIX kernel, and so I haven't looked carefully at the problem. Thus, we haven't considered what we would do differently, but I can tell you where our two biggest problems are. UNIX processes are a real problem for us. Especially for doing remote procedure

121

calls. They cause us to do an extra level of copying on both sides. Both on the sending side and the receiving side. The other big problem is blocking I/O. We implement a guardian as a single UNIX process. But, of course, a guardian has lots of sub-processes inside it. It was a good thing we did it that way because it would have been grossly inefficient had we chosen to do anything else. But, still if one of those processes waits for some I/O, we would like to go ahead and run some of the other ones; UNIX does not make it easy for us to do this. Q: I would like to make a comment. This morning we saw or heard a number of attempts to summarizesome theoretical results. Now, we've heard a bunch of case studies from individuals who have been branded practitioners. It would be really nice if there could be some summary of the common points among the systems. It seems like everyone is describing their favorite animal. I guess I'm just expressing a desire that there could be a nice system theory describing these systems. Spector: I actually think that there's one bright spot, at least among R*, Argus, and TABS. There is a substantial amount of commonality. While we may each choose a different technique for aborting a transaction or writing information to a log, we all use a variant of locking. We all use transactions as the underlying technique for providing atomicity. We all provide encapsulations of objects using an abstract data type model. So I believe there is somecommonality. Perhaps, we haven't managed to convey this, but that is the nature of a case studies panel. Now, I would admit that in the presentations by Ken Birman and Shel Finkelstein, you were seeing systems that explored different approaches, with different goals. Q: Alfred, so there is commonality. However, what we have here is a workshop of practitioners and theoreticians. In the morning, the pr~titioners said things like, "this is not reasonable; this isn't my question; this really doesn't address the problem." All of that is fair. Now you're telling us, the nice thing is that these three systems exhibit some commonality. My questions is, is there any real problems theoreticians should be worried about, or is this commonality really saying that problems are solved. If not, how would you, having the experience many of us don~t, characterize some of the crucial problems that should be addressed. Are there other things, other than Byzantine agreement, that the theory community should be looking at? Spector: Well, I think there are open problems in how to you structure highly available replicated services on top of systems like the ones we're talking about. I think we don't understand replication as well as we might. I don't think we understand yet what trade-offs we ought to make between the strength of the consistency constraints that we place on the data versus the parallelism and availability that would be permitted. Now, perhaps, this is not highly theoretical work. But, I suspect there is a lot more formalization and analysis that is needed. Lindsay: I agree. Barbara was quite right when they decided not to fool around

122

with replication at a low level, because there was insufficient understanding. Liskov" We'd also like to know what's the truth about partitions. I always assumed that there are partitions. And that I can't detect partitions. And I would like to understand if there's anything better than that that I can do. Ken Birman: In ISIS we had to assume that partitioning doesn't occur simply because it seems like a very difficult problem to deal with. Q: One thing I hear is that there needs to be more clean mathematical formalisms describing high level abstractions on which applications can count. The other question that is being asked are the implementation tradeoffs on the basis of which you decide how to do things, like supporting replication and consistency. Perhaps, there is a way that mathematics can express formally these tradeoffs. Lindsay: I'm not sure that engineering reduces to mathematics in all cases. We are making tough choices. In R*, for example, we picked blocking commit protocols because we don't want two extra messages per transaction to handle an occasional failed coordinator. Shel Finkelstein: I think that whether or not-you believe in atomic broadcast, the problems of things like the consensus algorithms~ aspects of clock synchronization in certain environments~ membership protocols~ are important protocols for a number of different situations. Jim Gray: Depending on what sorts of consistency conditions you have, you may be able to get by with some different algorithms than the transaction model. Fm not trying to suggest that two-phase commit is the wrong thing to use. What I'm saying is that the problem of membership is a problem that really does seem to show up all the time in all different situations. Spector: I think one of the things to do is to look at some of the end-user applications that we have been explicitly looking at today. For example~ this is an application of interest at Carnegie Mellon: That is, we are deploying a system that will ultimately have between five and ten thousand high power personal workstations. And we would like those work stations to work like timesharing systems of yesterday, except with high performance and availability. So we would like people to be able to go to one workstation and access their files, then go to another one and continue to access their files and be able to survive arbitrary crashes with no availability problems. There are distributed approaches for replication that one could use. Will they scale to a system of that size? I don~t know and I think there are problems associated with trying to handle really huge structures with really large loads like this. There was a conference here held six months ago or so looking at what happens when you try to support a thousand or more transactions per second on a very large distributed database. So I think some of those problems could yield interesting theoretical subproblems. The theoreticians might let their mind wander to this problem. People for a long time have been looking at synchronization and they"re come up with all sorts of synchronization mechanisms that might work better than locking

123

under some circumstance. Perhaps, with big problems, fast computers, and large networks~ some of these algorithms might make sense. It would be useful to show that on real practical problems these new synchronization techniques would be useful. In this panel, we haven't talked much about real-time systems. There aren't many practitioners here that represent the real time arena. But there is some reason to believe that many more of the algorithms that have been described by the theoreticians apply to real-time systems - at least more than they apply to transaction processing systems. Lindsay: I would like to thank Alfred and the rest of the panel.

COMMUNICATION SUPPORT FOR RELIABLE * DISTRIBUTED COMPUTING Kenneth P. Birman

Thomas A. Joseph

Department of Computer Science Cornell University, Ithaca, New York We describe a collection of communication primitives integrated with a mechanism for handling process failure and recovery.

These primitives facilitate the

implementation of fault-tolerant process groups, which can be used to provide distributed services in an environment subject to non-m~licious crash failures.

1. Introduction At Cornell, we recently completed a prototype of the ISIS system, which transforms abstract type specifications into fault-tolerant distributed implementations, while insulating users from the mechanisms by which fault-tolerance is achieved [Birman-a]. A wide range of reliable communication primitives have been proposed in the literature, and we became convinced that by using such primitives when building the ISIS system, complexity could be avoided. Unfortunately, the existing protocols, which range from reliable and atomic broadcast [Chang] [Cristian] [Schneider] to Byzantine agreement [Strong], either do not satisfy the ordering constraints required for many fault-tolerant applications or satisfy a stronger constraint than necessary at too high a cost. In particular, these protocols have not attempted to minimize the latency (delay) incurred before message delivery can occur. In ISIS, latency appears to be a major factor that limits performance.

Fault-tolerant distributed systems also need a way to detect failures and

recoveries consistently, and

we

found

that this could be

integrated

into the

*This work was supported by the Defense Advanced Research Projects Agency (DoD) under A R P A order 5378, Contract MDA903-85-C-0124, and by the National Science Foundation under grant DCR-8412582. The views, opinions and findings contained in this report are those of the authors and should not be construed as an officialDepartment of Defense position, policy, or decision.

125

communication layer in a manner that reduces the synchronization burden on higher level algorithms.

These observations motivated the development of a new collection of

primitives~ which we present below. Our broadcast primitives are designed to respect several sorts of ordering constraints, and have cost and latency that varies depending on the nature of the constraint required [Birman-b] [Joseph-a] [Joseph-b]. Failure and recovery are integrated into the communication subsystem by treating these events as a special sort of broadcast issued on behalf of a process that has failed or recovered. The primitives are presented in the context of fault tolerant process groups: groups of processes that cooperate to implement some distributed algorithm or service, and which need to see consistent orderings of system events in order to achieve mutually consistent behavior.

Our primitives provide

flexible, inexpensive support for process groups of this sort. By using these primitives, the ISIS system achieved both high levels of concurrency and suprisingly good performance. Equally important, its structure was made suprisingly simple, making it feasible to reason about the correctness of our algorithms. In the remainder of this paper we sumarize the issues and alternatives that the designer of a distributed system is presented with, focusing on two styles of support for fault-tolerant computing: remote procedure calls coupled with a transactional execution facility, and the fault-tolerant process group mechanism mentioned above. primitives are described.

Next, our

We conclude by speculating on future directions in which this

work might be taken. 2. G o a l s a n d a s s u m p t i o n s The difficulty of constructing fault-tolerant distributed software can be traced to a number of interrelated issues. The list that follows is not exhaustive, but attempts to touch on the principal considerations that must be addressed in any such system:

1.

Synchronization.

Distributed systems offer the potential for large amounts of con-

currency, and it is usually desirable to operate at as high a level of concurrency as possible.

However, when we move from a sequential execution environment to a

concurrent one, it becomes necessary to synchronize actions that may conflict in their access to shared data or entail communication with overlapping sets of processes. Additional problems that can arise in this context include deadlock avoidance or detection, livelock avoidance, etc.

2.

Fault detection. I t is usually necessary for a fault-tolerant application to have a consistent picture of which components fail, and in what order. Timeout, the most common mechanism for detecting failure, is unsatisfactory, because there are many situations in which a healthy component can timeout with respect to one component

126

without this being detected by some another. Failure detection under more rigorous requirements requires an agreement protocol that is related to Byzantine agreement [Strong] [Hadzilacos].

3.

Consistency.

When a group of processes cooperate in a distributed system, it is

necessary to ensure that the operational processes have consistent views of the state of the group as a whole. For example, if process p believes that some property P holds, and on the basis of this interacts with process q, the state of q should not contradict the fact that p believes P to be true. This problem is closely related to notions of knowledge and consistency in distributed systems [Halpern] [Lamport]. In our context, P will often be the assertion that a broadcast has been received by q, or that q saw some sequence of events occur in the same order as did p.

4.

Serializability.

Many distributed

systems are partitioned

into data manager

processes, which implemented s h a r e d variables, and transaction manager processes, which issue series of requests to data managers [Bernstein]. If transaction managers can execute concurrently, it is often desirable to ensure that transactions produce serializable

outcomes [Eswaren] [Papadimitrou].

Serializability

is increasingly

viewed as an important property in "object-oriented" distributed systems that package services as abstract objects with which clients communicate by remote procedure calls (RPC). On the other hand, there are systems for which serializability is either too strong a constraint, or simply inappropriate. Jointly, these problems render the design of fault-tolerant distributed software daunting. The correctness of any proposed design and of its implementation become serious, if not insurmountable, concerns. We faced this range of problems in our work on the

ISIS system, and rapidly became convinced that in the absence of some systematic approach for dealing with them, a correct implementation of ISIS could never be constructed. In Sec. 6, we will show how the primitives of Sec. 5 provide such an approach. The failure model that one adopts has considerable impact on the structure of the resulting system. We adopted the model of fail-stop processors [Schneider]: when failures occur, a processor simply stops (crashes), as do all the processes executing on it. We rejected the extremely pessimistic assumptions of the malicious Byzantine failure models because they lead to slower, more redundant software, and because the probability that a system failure will be undetectably malicious seems vanishingly small in practice. Work based on Byzantine assumptions is described in [Lamport] and [Schlicting].

We also

assume that the communication network is reliable but subject to unbounded delay. Although network partitioning is an important problem, we do not address it here. Further assumptions are sometimes made about the availability of synchronized realtime clocks. Here, we adopt the position that although reasonably accurate elapsed-

127

time clocks are normally available, closely synchronized clocks frequently are not. For example, the 60Hz "line" clocks commonly used on current workstations are only accurate to 16ms. On the other hand, 4-Sins inter-site message transit times are common and 1-2ms are reported increasingly often. Thus, it is impossible to synchronize clocks to better than 3248ms, enough time for a pair of sites to exchange between 4 and 50 messages.

Thus, we assume that clock skew is '~large" compared to inter-site message

latency. 3. A l t e r n a t i v e s Two different approaches to reliable distributed computing have become predominant. The first approach involves the provision of a communication primitive, such as atomic broadcast, which can be used as the framework on which higher level algorithms

are designed. Such a primitive seeks to deliver messages reliably to some set of destinations, despite the possibility that failures might occur during the execution of the protocol. We term this the process group approach, since it lends itself to the organization of cooperating processes into groups, as described in the introduction. Process groups are an extremely flexible abstraction, and have been employed in the V Kernel [Cheriton] as well in the I S I S system. The idea of using process groups to address the problems raised in the previous section seems to be new. A higher level approach is to provide mechanisms for transactional interactions between processes that communicate using remote procedure calls [Birrell]. This has lead to work on nested transactions (due to nested RPC's) [Moss], support for transactions at the language level [Liskov], transactions within an operating systems kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional access to higher-level replicated services, such as resilient objects in ISIS or relations in database systems. The primitives in a transactional system provide mechanisms for distributing the request that initiates the transaction, accessing data (which may be replicated), performing concurrency control, and implementing commit or abort.

Additional mechanisms are normally needed for

orphan termination, deadtock detection, etc. The issue then arises of how these mechanisms should themselves be implemented. Our work in ISIS leads us to believe that transactions are easily implemented on top of fault-tolerant process groups; lacking such a mechanism a number of complicated protocols are needed and the associated system support can be substantial.

Moreover, transactions represent a relatively heavy.weight solu-

tion to the problems surveyed in the previous section. We now believe that transactions are inappropriate for casual interactions between processes in typical distributed systems. The remainder of this paper is therefore focused on the process group approach.

128

4. Existing broadcast primitives The considerations outlined above motivated us to examine reliable broadcast primitives. Previous work has been reported on this problem, under assumptions comparable with those of Sec. 2, and we begin by surveying this research. In [Schneider], an implementation of a reliable broadcast primitive is described. Such a primitive ensures that a designated message will be transmitted from one site to all other operational sites in a system; if a failure occurs but any site has received the message, all will eventually do so. [Chang] and [Cristian] describe implementations for atomic broadcast, which is a reliable broadcast with the additional property that messages are delivered in the same order at all overlapping destinations, and this order preserves the transmission order if messages originate in a single site. Atomic broadcast is a powerful abstraction, and essentially the same behavior is provided by one of the primitives we discuss in the next section. However, it has several drawbacks which made us hesitant to adopt it as the only primitive in the system. Most serious is the latency that is incurred in order to satisfy the delivery ordering property. Without delving into the implementations,

which are based on a token scheme in

[Chang] and an acknowledgement protocol in [Schneider], we observe that the delaying of certain messages is fundamental to the establishment of a unique global delivery ordering; indeed, it is easy to prove that this must always be the case. In [Chang] a primary goal is to minimize the number of messages sent, and the protocol given performs extremely well in this regard. However, a delay occurs while waiting for tokens to arrive and the delivery latency that results may be high.

[Cristian] assumes that clocks are

closely synchronized and that message transit times are bounded by well-known constants, and uses this to derive atomic broadcast protocols tolerant of increasingly severe classes of failures.

The protocols explicitly delay delivery to achieve the desired global

ordering on broadcasts. Hence for poorly synchronized clocks (which are typical of existing workstations), latency would be high in comparison to inter-site message transit times. Another drawback of the atomic broadcast protocols is that no mechanism is provided for ensuring that all processes observe the same sequence of failures and recoveries, or for ensuring that failures and recoveries are ordered relative to ongoing broadcasts. We decided to look more closely at these issues.

5. Our broadcast primitives We now describe three broadcast protocols - G B C A S T , B C A S T , and O B C A S T - for transmitting a message reliably from a sender process to some set of destination processes. Details of the protocols and their correctness proofs can be found in [Birmanb]. The protocols ensure "all or nothing" behavior: if any destination receives a message,

129

then unless it fails, all destinations will receive it. 5.1. T h e G B C A S T p r i m i t i v e

GBCAST (group broadcast) is the most constrained, and costly, of the three primitives. It is used to transmit information about failures and recoveries to members of a process group. A recovering member uses GBCAST to inform the operational ones that it has become available.

Additionally, when a member fails, the system arranges for a

GBCAST to be issued to group members on its behalf, informing them of its failure. Arguments to GBCAST are a message and a process group identifier, which is translated into a set of destinations as described below (Sec. 5.6).

Our GBCAST protocol ensures that if any process receives a broadcast B before receiving a GBCAST G, then all overlapping destinations will receive B before G. This is true regardless of the type of broadcast involved. Moreover, when a failure occurs, the corresponding GBCAST message is delivered after any other broadcasts from the failed process. Each member can therefore maintain a view listing the membership of the process group, updating it when a GBCAST is received. Although views are not updated simultaneously in real time, all members observe the same sequence of view changes. Since, GBCAST's are ordered relative to all other broadcasts, all members receiving a given broadcast will have the same value o f view when they receive it. 1 Members of a process group can use this value to pick a strategy for processing an incoming request, or to react to failure or recovery without having to run any special protocol first. Since the

GBCAST ordering is the same everywhere, their actions will all be consistent. Notice that when all the members of a process group may have failed, GBCAST also provides an inexpensive way to determine the last site that failed: process group members simply log each new view that becomes defined on stable storage before using it; a simplified version of the algorithm in [Skeen-a] can then be executed when recovering from failure. 5.2. T h e B C A S T p r i m i t i v e The GBCAST primitive is too costly to be used for general communication between process group members. This motivates the introduction of weaker (less ordered) primitives, which might be used in situations where a total order on broadcast messages is not 1A problem arises if a process p fails without receiving some message after that message has already been delivered to some other process q: q's view when it received the message would show p to be operational; hence, q will assume that p received the message, although p is physically incapable of doing so. However, the state of the system is now equivalent to one in which p did receive the message, but failed before acting on it. In effect, there exists an interpretation of the actual system state that is consistent with q's assumption.

130

necessary.

Our

second

primitive,

BCAST,

satisfies

such

a

weaker

constraint.

Specifically, it is often desired that if two broadcasts are received in some order at a common destination site, they be received in that order at all other common destinations, even if this order was not predetermined.

For example, if a process group is being used

to maintain a replicated queue and BCAST is used to transmit queue operations to all copies, the operations will be done in the same order everywhere, hence the copies of the queue will remain mutually consistent. The primitive BCAST(msg, label, dests), where

msg is the message and label is a string of characters, provides this behavior. Two BCAST's having the same label are delivered in the same order at all common destinations. On the other hand, BCAST's with different labels can be delivered in arbitrary order, and since BCAST is not used to propagate information about failures, no flushing mechanism is needed. The relaxed synchronization results in lower latency.

5.3. The OBCAST primitive Our third primitive, OBCAST (ordered broadcast), is weakest in the sense that the it involves

less

distributed

synchronization

then

GBCAST

or

BCAST.

OBCAST(msg, dests) atomically delivers msg to each operational dest. If an OBCAST potentially causally dependent on another, then the former is delivered after the latter at all overlapping destinations.

A broadcast B 2 is potentially causally dependent on a

broadcast B 1 if both broadcasts originate from the same process, and B 2 is sent after B l, or if there exists a chain of message transmissions and receptions or local events by which knowledge could have been transferred from the process that issued B 1 t o the process that issued B 2 [Lampo~]. For causally independent broadcasts, the deliver ordering is not constrained.

OBCAST is valuable in systems like ISIS, where concurrency control algorithms are used to synchronize concurrent computations.

In these systems, if two processes com-

municate concurrently with the same process the messages are almost always independent ones that can be processed in any order: otherwise, concurrency control would have caused one to pause until the other was finished.

On the other hand, order is clearly

important within a causally linked series of broadcasts, and it is precisely this sort of order that OBCAST respects.

5.4. Other broadcast primitives A weaker broadcast primitive is reliable broadcast, which provides all-or-nothing delivery, but no ordering properties. The formulation of OBCAST in [Birman-b] actually includes a mechanism for performing broadcasts of this sort, hence no special primitive is needed for the purpose. Additionally, there may be situations in which BCAST protocols that also satisfy an OBCAST ordering property would be valuable.

Although our

131

BCAST primitive could be changed to respect such a rule, when we considered the likely uses of the primitives it seemed that BCAST was better left completely orthogonat to

OBCAST. In situations needing hybrid ordering behavior, the protocols of [Birman-b] could easily be modified to implement BCAST in terms of OBCAST, and the resulting protocol would behave as desired.

5.5. S y n c h r o n o u s versus a s y n c h r o n o u s broadcast abstractions Many systems employ RPC internally, as a lowest level primitive for interaction between processes. It should be evident that all of our broadcast primitives can be used to implement replicated remote procedure calls [Cooper]: the caller would simply pause until replies have been received from all the participants (observation of a failure constitutes a reply in this case). We term such a use of the primitives synchronous, to distinguish it from from an asynchronous broadcast in which no replies, or just one reply,

suf~ces, In our' work on ISIS, GBCAST and BCAST are normally invoked synchronously, to implement a remote procedure call by one member of an object on all the members of its process group. However, OBCAST, which is the most frequently used overall, is almost never inw)ked synchronously.

Asynchronous OBCAST's are the source of most con-

currency in ISIS: although the delivery ordering is assured, transmission can be delayed to enable a message to be piggybacked on another, or to schedule IO within the system as a whole. 'While the system cannot defer an asynchronous broadcast indefinitely, the ability to defer it a little, without delaying some computation by doing so, permits load to be smoothed. Since OBCAST respects the delivery orderings on which a computation might depend, and is ordered with respect to failures, the concurrency introduced does not complicate higher level algorithms. Moreover, the protocol itself is extremely cheap. A problem is introduced by our decision to allow asynchronous broadcasts: the atomic reception property must now be extended to address causally related sequences of asynchronous messages. If a failure were to result in some broadcasts being delivered to all their destinations but others that precede them not being delivered anywhere, inconsistency might result even if the destinations do not overlap. We therefore extend the atomicity property as follows. If process t receives a message m from process s, and s subsequently fails, then unless t fails as well, m ' must be delivered to its remaining destinations. This is because the state of t may depend on any message m ' received by s before it sent m. The costs of the protocols are not affected by this change. A second problem arises when the user-level implications of this atomicity rule are considered[. In the event of a failure, any suffix of a sequence of aysnchronous broadcasts could be lost and the system state would still be internally consistent. A process that is about to take some action that may leave an externally visible side.effect will need a way

132

to pause until it is guaranteed that such broadcasts have actually been delivered. For this purpose, a flush primitive is provided. Occasional calls to flush do not eliminate the benefit of using OBCAST asynchronously. Unless the system has built up a considerable backlog of undelivered broadcast messages, which should be rare, flush will only pause while transmission of the last few broadcasts completes. 5.6. G r o u p a d d r e s s i n g

protocol

Since group membership can change dynamically, it may be difficult for a process to compute a list of destinations to which a message should be sent, for example, as is needed to perform a GBCAST. In [Birman-b] we report on a protocol for ensuring that a given broadcast will be delivered to all members of a process group in the same view. This view is either the view that was operative when the message transmission was initiated, or a view that was defined subsequently. The algorithm is a simple iterative one that costs nothing unless the group membership changes, and permits the caching of possibly inaccurate membership information near processes that might want to communicate with a group. Using the protocol, a flexible message addressing scheme can readily be supported. 5.7. E x a m p l e Figure 1 illustrates a pair of computations interacting with a process group while its membership changes dynamically.

One client issues a pair of OBCAST's, then uses

BCAST to perform a third request on the group. A second client interacts only once, using BCAST. Note that unless the first client invoked flush before issuing the BCAST, the BCAST might be received before the prior OBCAST's at some sites. Arrows showing reply messages have been omitted to simplify the figure, but it would normally be the case that one or more group members reply to each request. 6. U s i n g t h e p r i m i t i v e s The reliable communication primitives described above dramatically simplify the solution of the problems cited in Sec. 2:

1.

Synch.ronization.

Many synchronization problems are subsumed into the primitives

themselves. For example, consider the use of GBCAST to implement recovery. A recovering process would issue a GBCAST to the process group members, requesting that state information be transferred to it. In addition to sending the current state of the group to the recovering process, group members update the process group view at this time. Subsequent messages to the group will be delivered to the recovered process, with all necessary synchronization being provided by the ordering properties of GBCAST. In situations where other forms of synchronization are needed, BCAST

133

Client Computations

Pqroup View ! ! t I I

Q

OBCAST01 I I

GSCAST

BJoins

OBCASTO2 ! ! .,

,

! ] ,

~

GBCAST:Cjoins

BCASTB~ BCASTBz i I I

= I I I

Figure 1: Client processes interacting with a process group provides a simple way to ensure that several processes take actions in the same order, and this form of low-level synchronization simplit~es a number of higher-level synchronization problems.

For example, if B C A S T is used to request write-locks

from lock-manager processes, two write-lock requests on the same item can never deadlock by being granted in different orders by a pair of managers.

2.

Fault detection.

Consistent failure (and recovery) detection are trivial using our

primit~ves: a process simply waits for the appropriate process group view to change. This facilitates the implementation of algorithms in which one processes monitors the status of another process. A process that acts on the basis of a process group view change does so with the assurance that other group members will (eventually) observe the same event and will take consistent actions.

3.

Consistency.

We believe that consistency is generally expressible as a set of atomi-

city and ordering constraints on message delivery, particularly causal ones of the sort provided by OBCAST.

Our primitives permit a process to specify the communi-

cation properties needed to achieve a desired form of consistency.

Continued

research will be needed to understand precisely how to p i c k the weakest primitive in a designated situation.

4.

Serializability.

To achieve serializability, one implements a concurrency control

algorithm and then forces computations to respect the serialization order that this algorithm choses. The B C A S T primitive, as observed above, is a powerful tool for establiLshing an order between concurrent events. Having established such an order,

OBCAST can be used to distribute information about the computation and also its termination (commit or abort). Any process that observes the commit or abort of a

134

computation will only be able to interact with data managers that have received messages preceding the commit or abort, hence a highly asynchronous transactional execution results. This problem is discussed in more detail in [Birman-a] [Joseph-a] [Joseph-b],

7. Implementation The communication primitives can be built in layers, starting with a bare network providing unreliable datagrams.

A site-to-site acknowledgement protocol converts this

into a sequenced, error-free message abstraction, using timeouts to detect apparent failures. An agreement protocol is then used to order the site-failures and recoveries consistently. If timeouts cause a failure to be detected erroneously, the protocol forces the affected site to undergo recovery. Built on this is a layer that supports the primitives themselves. OBCAST has a very light-weight implementation, based on the idea of flooding the system with copies of a message: Each process buffers copies of any messages needed to ensure the consistency of its view of the system. If message m is delivered to process p, and m is potentially causally dependent on a message m', then a copy of m' is sent to p as well (duplicates are discarded).

A garbage collector deletes superfluous copies after a message has

reached all its destinations.

By using extensive piggybacking and a simple scheduling

algorithm to control message transmission, the cost of an OBCAST is kept low -- often, less than one packet per destination. BCAST employs a two-phase protocol based on one suggested to us by Skeen [Skeen-b].

This protocol has higher latency than OBCAST

because delivery can only occur during the second phase; BCAST is thus inherently synchronous. In ISIS, however, BCAST is used rarely; we believe that this would be the case in other systems as well. GBCAST is implemented using a two-phase protocol similar to the one for BCAST, but with an additional mechanism that flushes messages from a failed process before delivering the GBCAST announcing the failure.

GBCAST is slow, it is used even less often than BCAST.

Although

Preliminary performance

figures appear in [Birman-b]. 8. A p p l i c a t i o n s o f t h e a p p r o a c h Our work with communication primitives has convinced us that the resilient objects provided by the ISIS system exist at too high a level for many sorts of distributed application. For example, consider the cognac still shown in figure 2. If independent, nonidentical computer systems were used to control distillation, two aspects would have to be addressed.

First, it would be necessary to design the hardware itself in a way that

admits safe actions in all possible system states. Second, however, one would need to implement the control software in each processor in a way that ensures mutual

135

I. 2. 3. 4. 5. 6.

Pressure/temp sensors Flow valve # I Wine Flow valve # 2 Bottles Heater

Figure 2: An automated c o g n a c still consistency of the operational computing units.

That is, given that the specification

describes a sequence of actions to take in some scenario (for example, detection of excessive pressure in the distillation vessel), can we be assured that the operational processors will jointly act to avert a disastrous spill of cognac? We believe that fault-tolerant process groups provide a simple, elegant way to address problems such as this one. We plan to complete an implementation of the protocols by the summer of 1986, and then to develop a collection of software subsystems running on top of them. 9. A c k n o w l e d g e m e n t The authors are grateful to P a t Stephenson and Fred Schneider for m a n y suggestions that are reflected in the presentation of this material, and to Date Skeen, with whom we colaborated on many aspects of the work reported here. 10. R e f e r e n c e s

[Allchin] Allchin, J., McKendry, M. Synchronization and recovery of actions. Proc. 2nd ACM SIGACT/SIGOPS Principles of Distributed Computing, Montreal, Canada, 1983. [Babaoglu]Babaoglu, O., Drummond, R. The streets of Byzantium: Network architectures for

136

fast reliable broadcast. IEEE Trans. on Software Engineering TSE-11, 6 (June 1985).

[Bernstein] Bernstein, P, Goodman, N. Concurrency control algorithms for replicated database systems. ACM Computing Surveys 13, 2 (June 1981), 185-222.

[Birman.a]

Birman, K. Replication and fault-tolerance in the ISIS system. Proc. lOth ACM SIGOPS Symposium on Operating Systems Principles. Orcas Island, Washington,

Dec. 1985, 79-86.

[Birman-b]

Birman, K., Joseph, T. Reliable communication in an unreliable environment. Dept. of Computer Science, Cornell Univ., TR 85-694, Aug. 1985.

[Birrell]

BirreU, A, Nelson, B. Implementing remote procedure calls. ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 39-59.

[Chang]

Chang, J., Maxemchuck, M, Reliable broadcast protocols. ACM TOCS 2, 3 (Aug. 1984), 251-273.

[Cheriton]

Cheriton, D. The V Kernel: A software base for distributed systems. IEEE Software 1 12, (1984), 19-43.

[Cooper]

Cooper, E. Replicated procedure call. Proc. 3rd ACM Symposium on Principles of Distributed Computing., August 1984, 220-232. (May 1985).

[Cristian]

Cristian, F. et al Atomic broadcast: From simple diffusion to Byzantine agreement. IBM Technical Report RJ 4540 (48668), Oct. 1984.

[Eswaren]

Eswaren, K.P., et aI The notion of consistency and predicate locks in a database system. Comm. ACM 19, 11 (Nov. 1976), 624-633.

[Hadzilacos] Hadzilacos, V. Byzantine agreement under restricted types of failures (not telling the truth is different from telling of lies). Tech. ARep. TR-19-83, Aiken Comp. Lab., Harvard University (June 1983). [Halpern]

Halpern, J., and Moses, Y. Knowledge and common knowledge in a distributed environment. Tech. Report RJ.4421, IBM San Jose Research Laboratory, 1984.

[Joseph-a]

Joseph, T. Low cost management of replicated data. Ph.D. dissertation, Dept. of Computer Science, Cornell Univ., Ithaca (Dec. 1985).

[Joseph-b]

Joseph, T., Birman, K. Low cost management of replicated data in fault-tolerant distributed systems. ACM TOCS 4, 1 (Feb 1986), 54-70.

[Lamport]

Lamport, L. Time, clocks, and the ordering of events in a distributed system. CACM 21, 7, July 1978, 558-565.

[Lazowska] Lazowska, E. et al The architecture of the EDEN system. Proc. 8th Symposium on Operating Systems Principles, Dec. 1981, 148-159.

137

[Liskov]

Liskov, B., Scheifler, R. Guardians and actions: Linguistic support for robust, distributed programs. ACM TOPLAS 5, 3 (July 1983), 381-404.

[Moss]

Moss, E. Nested transactions: An approach to reliable, distributed computing. Ph.D. thesis, MIT Dept of EECS, TR 260, April 1981.

[Papa]

Papadimitrou, C. The serializability of concurrent database updates. JACM 26, 4 (Oct. 1979), 631-653.

[Popek]

Popek, G. et al. Locus: A network transparent, high reliability distributed system. Proc. 8th Symposium on Operating Systems Principles, Dec. 1981, 169-177.

[Schlicting] Schlicting, R, Schneider, F. Fail-stop processors: An approach to designing faulttolerant distributed computing systems. ACM TOCS 1, 3, August 1983, 222-238. [Schneider] Schneider, F., Gries, D., Schlicting, R. Reliable broadcast protocols. Science of computer programming 3, 2 (March 1984). [Skeen-a]

Skeen, D. Determining the last process to fail. ACM TOCS 3, 1, Feb. 1985, 15-30.

[Skeen-b]

Skeen, D. A reliable broadcast protocol. Unpublished.

[Spector]

Spector, A., et al Distributed transactions for reliable systems. Proc. lOth ACM SIGOPS Symposium on Operating Systems Principles, Dec. 1985, 127-146.

[Strong]

Strong, H.R., Dolev, D. Byzantine agreement. Digest of papers, Spring Compcon 83, San Francisco, CA, March 1983, 77-81.

Algorithms and System Design in the Highly Available Systems Project Sheldon J. Finkelstein* IBM Almaden Research Center 650 H a r r y R d San Jose, CA 95120

Fm going to describe the applications of ideas described by Ray Strong, Ftaviu Cristian and Danny Dolev. Ray, Houtan Aghili, Ruth Kistler, and A1 Griefer designed and implemented the Distributed Communication Facility I'll describe. Flaviu and Dale Skeen designed the Auditor. Others, including Danny also contributed to the algorithms used in our prototype. Our systems environment consists of a loosely-connected collection of IBM 4300 processors which we call a cluster. These processors communicate via channel-tochannel and bisync links. Although we're also looking into protocols for broadcast media, the work that I'm describing is for point-to-point networks. Multiple processors may share a disk. We use a server/client model; clients within a duster may make request services from servers in the same cluster. Servers are sometimes called resource managers. File systems and databases are examples of resources which servers may manage. To provide availability, the cluster needs redundancy, but that redundancy should not be visible to users. Moreover, whenever possible, redundant system capabilities should not be wasted. For example, extra processors should be doing useful noncritical work when the system is behaving normally. If there are failures of processors, servers or links, they should be detected automatically~ and the cluster should reconfigure itself or restart the failed components as soon as possible. Since servers may migrate (due to failures, load balancing or cluster growth), the communication system must provide location transparency, where servers are accessed by server names that do not depend on their physical locations. You've already heard about many of our distributed algorithms ~om others~ and I'll describe how they are used in our prototype. Although our protocols involve communication between "virtual m~hines" (which you can think of as process groups) running on each processor~ I'll informally speak of the algorithms as if they ran between the processors in the cluster. *Currently at Tandem Computers, 10100 N. Tantau Rd., Cupertino, CA 95014

139

Network recon£guration algorithms handle addition and deletion of links as well as start-up and failure of processors and links, and include automatic reconfiguration. Sessions (or virtual circuits) between clients and servers are not broken as tong as we can find some path from client to server. We use atomic broadcast [CASD] both for name registry and for cluster membership. "Cluster membership" refers to the question of determining which processors are in the cluster, so that we can automatically handle processor join (becoming part of the cluster) and departure (leaving the clusters e.g., due to failure). Clock synchronization is the basis for atomic broadcast. Handshake protocols are used to detect partitions and react to them. Before I describe the facilities in our prototype, let's review the definitions of atomic broadcast and synchronous replicated storage. Atomic broadcast is a protocol that sends messages to all reachable processors in the cluster and ensures three properties: atomicity, order and termination. The atomicity property say that for each message broadcast, either every processor in the cluster delivers it or none of the processors deliver it. (Delivery is different from receipt. When a recipient gets a broadcast messages it may choose to act on it at a specific time, or may choose not to act on it at all, perhaps because the broadcast ~rrived too late. Acting on the message constitutes delivery.) The order property says that all broadcasts are delivered in the same sequence at all processors in the cluster. The termination property says that any b r o a d e s t that succeeds is delivered to all processors within time Delta of the time the broadcast initiated, where Delta is a dynamic cluster parameter. These three properties (atomicity, order and termination) are guaranteed up to partition. Timing-dependent cluster parameters such as Delta, the time for atomic broadcast to complete, must sometimes dynamically be changed while the cluster is operational. Delta depends on link transmission time. We determine whether or not a link is up by sending "I am a~ve" messages across the link, and we also determine transmission time the same way. But messages may travel more slowly because of increased load. To deal with this, we auton~tically change link transmission time parameters, which in turn affects Delta. But all processors must change system parameters together, so these parameters must be changed using our agreement protocol, atomic b r o a d e s t . But atomic broadcast depends on the parameters being changed! If the load changes gradually, we can handle dynamic parameter change without causing a partition; if load changes rapidly, we may temporarily partition the cluster. Building on atomic broadcast, we implement an abstraction called synchronous replicated storage (SRS). This abstraction used to be called Delta Common Storage [CASD]o Synchronous replicated storage is identical across all processors in the cluster. Since the processors receive the same atomic broadcasts, and receive them in the same order, SRS remains identic~ across all processors. Thus implementing SRS allows us to treat loosely-coupled processors as if they had shared memory. However, writes to this storage take time Delta, so the abstraction is not cheap to implement. There are three major software facilities of our prototype. The function of the Distributed Communication Facility is to provide highly available communications. The Auditor provides highly-available services (e.g. those provided by a file server or a database server). Heartbeat exists to try to ensure that resources never have more

140

than one manager at any given time. DCF, the Distributed Communication Facility [GS], provides fault-tolerant distributed communications service. It manages transparent naming~ allows processors and links to be added (or deleted) as part of the cluster while it is running, automatically deals with links and processors starting or failing, and automatically reeonfigures the network. The Auditor is built on DCF and its design is based on synchronous replicated storage. Auditor monitors processors and local servers and updates configuration tables by writes to synchronous replicated storage. It detects failures, and coordinates takeovers by backups. Assuming that partition has not occurred, the Auditor representatives on all processors make the same decisions because they run the same code on the same tables in SRS. The Heartbeat facility is needed because our loosely-coupled processors may share disks. Auditor can't tell the difference between a processor failure (after which Auditor should make back-ups servers take over for servers running on the failed processor) and a processor that has not failed but is unable to communicate (because all links are down or because load is unusually heavy). If a back-up takes over management of a resource (such as a database) on a shared disk while the old server is still operating, the integrity of the resource might be compromised. Heartbeat reduces that possibility by writing a notice to the disk every n seconds (with a changing timestamp in it-any changing value would do). By monitoring the heartbeat, a back-up can see that the server is still running even if all communication links are down. Let's go through some scenarios showing how our prototype works. Before DCF existed~ clients could connect to a server running on the same processor either by giving the physical name of the virtual machine running the server or by giving a logical name that was mapped in a specially implemented way to the virtual machine running the server. The SQL/DS database management system had used the latter approach. In a DCF duster, however~ there is a uniform mechanism by which servers identify themselves and claim names. Clients connect to servers using these names, which have logical, not physical, meanings. Thus if a server identifies itself as providing services corresponding to the name DB1, any client who wants to use those services can uniformly connect to DB1, whether it is local (on the same processor as the server) or non-local. There is one representative of DCF running on each processor; part of this Facility is in the operating system, and part is in a virtual machine. Communication is less expensive when a client speaks to a local server, since the operating system handles the connection without involving the DCF virtual machine. For non-local sessions, the DCF virtual machines act as intermediaries~ communicating with the client and server as if the connection were local. The DCF virtual machines cooperate together, using atomic broadcast to exchange information about what processors are in the cluster, which names have been claimed where, etc. Thus DCF both implements and uses synchronous replicated storage. Now suppose that there is a session between server with the name DB1 and a client, where DB1 names a database manager on Processor 1 and the client is on Processor

141

2. If the link between Processor 1 and Processor 2 fails, DCF detects the failure and uses some other path (e.g., from Processor 1 to Processor 3 to Processor 2) without interrupting the session. If Processor 4 comes up, it automatically joins the cluster, and clients on that processor have access to all servers in the cluster. Moreover, all clients in the cluster can access the servers on Processor 4. If a processor fails, we automatically detect the failure and react to it. If a link is added to the cluster, the processors connected by that link will be in the same cluster if the link is operational. Links can also be deleted from the cluster. Q: Wouldn't it be simpler to have all communication between clients and servers handled through the DCF virtual machine, whether or not the server was on the same processor as the client? A: Yes, it would be simpler, but it would also be less efficient. Note that this is a question about implementation, not the external protocol (or application program interface) of DCF. The DCF virtual machines function as intermediaries and collaborate with other DCF virtual machines so that clients and servers can behave as if the cluster were a single processor. A client and server follow the same protocol whether the client is local to the server or non-local. If'we switched to the implementation you suggest, the performance for local communication w.ould degrade, since there would be an unnecessary intermediary (the DCF virtual machine). In our implementation, the component of DCF in the operating system distinguishes between local and non-local messages. For local communication, the operating system portion of DCF resolves the server name when a connection is established, so that communication is established directly between client and server. Messages from a client to a non-local server go to the DCF virtual machine, which forwards them to the DCF virtual machine on the server:s processor (which may involve multiple hops in the point-to-point network). That DCF virtual machine sends messages to the server as if they came directly from the client. Messages from the server back to the client are handled analogously. Q: Is the assumption that this is a local network, or could it be a long-haul network? A: We originally designed the system with a "glasshouse'~ a machine room, as the design point. However, DCF would work over either a local network or a long-haul network, although cluster parameters (e.g., Delta, the time for atomic broadcast to complete) would differ. Q: Some of your algorithms seem to require N squared messages, where N is the number of processors. Are these algorithms are used so infrequently that performance is not an issue? How often do you expect them to be used? A: Atomic broadcast can take up to 2*E messages, where E is the number of links (undirected edges) in the network. Since E can be as large as N squared, it is true that the :protocol can be quadratic in N, but it:s more useful to think in terms of E. The network connecting the cluster may be sparse, particularly when the nodes are geographically dispersed. We prototyped DCF for a maximum of eight processors. We're interested in extending DCF to handle a much larger number of nodes, but

142

eight is the design point. We don't expect atomic broadcast to be used frequently. We use it for name registry and for membership operations including processor join (when a new processor comes up) and processor departure (when a processor fails or is taken off-line). It's also used by the Auditor to handle operator commands and "adverse environment" commands which are registered in synchronous replicated storage; I'll say more about this later. Atomic broadcast is NOT used for ordinary communication sessions between clients and servers. In steady state for the cluster, broadcast might be expected to occur every ten or fifteen minutes, perhaps less often than that. They would be more frequent when the duster begins operation or the cluster encounters (or recovers from) certain failures. Let's continue describing scenarios for our prototype. Like DCF, Auditor has one representative on each processor in the cluster. Auditor handles server failures (or processor failures, which imply failure of all servers on those processors). If a server fails, there may be a local back-up (which might be preferred over a non-local backup since load remains balanced); sometimes it is possible to recover the failed server itself. Non-local back-ups are also necessary for several reasons, including processor failures, overload when a processor handles both server and backup, and dynamic load balancing. Now let's suppose that Processor 1 crashed, where DB1 was (the name of) a database manager running on that processor. A non-local backup would have to take over. That back-up might be cold (meaning that it was not monitoring information from the active server). The back-up might also be warm (meaning that it monitored information from the active server but did not keep its state current), or hot (meaning that it was actively monitoring and maintaining state). Decisions about temperatures for back-ups should be a matter of installation policy, requiring a trade-off between work done when nothing goes wrong and work after failures. For simplicity, assume that the database managed by DB1 resides on a single disk. That disk may be shared; that is, other processors may be able to access that disk. A cold back-up for DB1 must be able to access that disk; a warm or hot back-up may not need to do so (if it has been maintaining state). Let's assume that a cold back-up takes over for DB1. The database is automatically restored to a transaction-consistent state when the back-up takes over, just as it would be if the original database manager failed and were re-started. In-flight transactions (that is, transactions that had not committed at the time of the failure) are rolled-back, and applications using the database are informed of this. New transactions can be addressed to the database by these or other applications. DCF has undergone a series of iterations as a prototype, becoming more realistic. It began with a simple model which assumed that there was a fully synchronized clock across the duster, and it originally didn't handle communication failures, only processor failures. The first prototype also assumed that processors ran phases of protocols in lockstep. As the prototype became more advanced, protocols became dearer, more efficient and more realistic, and they also handled a more general class

143

of failures. The current prototype uses a link-based protocol (atomic broadcast) rather than a route-based protocol. It tolerates clock failures, communication failures (up to partition) and handles asynchronous (rather than just lockstep) execution of protocols. Synchronous replicated storage is maintained via atomic broadcast, and it is used for three purposes in our prototype. First, we use it for name registry. All identifications (claiming names) and revocations (dropping names) are handled using atomic broadcast to update SRS, so that each name has only one claimant at a time. In my judgement, it is not clear that a name registry should be managed this way. Resources still have to be protected using Heartbeat (because partitions can occur), so uniqueness of names should not be the responsibility of the name registry. (Uniqueness is not even always desirable; sometimes several servers may provide equivalent services.) Second, SRS is used for cluster membership protocols, so that all processors agree on the members of the cluster and on cluster parameters. This is a more realistic use of SRS, ]but I'm not sure that it would be a worthwhile primitive for a distributed system if that were it's only use. The third use of SRS in our prototype is t h e most significant. Auditor uses synchronous replicated storage to maintain configuration tables describing the set of services provided in the cluster and the locations of the servers providing them. Thus at any given logical clock time, the Auditor representatives on all processors in the cluster agree on where all the active and back-up servers are supposed to be. When there is a either an operator command to the Auditor or a command issued by the "adverse environment" (i.e., something has crashed), that command should be handled by the Auditor representative on some processor. The command is registered as an event using atomic broadcast. As a result, the configuration tables which the Auditor manages as synchronous replicated storage are updated. Although there is no decision coordinator, each" processor can make an independent decision about what actions every processor (including itself) should do, and the same decision will be made by all the other processors. Hence each processor can take appropriate actions (e.g., telling a back-up server that it should become active) without conifict with other processors' decisions. A previous version of Auditor was designed using centralized control, with one of the Auditor representatives designated as coordinator. The centralized approach is a less robust solution, since it must specially handle coordinator failure, with another representative becoming coordinator. Not only is our current decentralized approach more uniform, it also allows decisions to be made more quickly (in general) than the centralized approach. Q: Synchronized replicated storage guarantees that all correct processors will be in agreement on the contents of tables in SRS. But a processor might be incorrect and issue commands based on erroneous SRS. What happens then7 A: No user code runs in Auditor, just our own code. That doesn't guarantee that there won't be mistakes, of course, but it does contain the problems that can arise. Auditors can become incorrect because of partitions. We have "handshake" pro-

144

tocols in which the Auditor representatives on neighboring processors exchange descriptors to decide if they've lost broadcasts. These protocols are built on top of SRS. (DCF runs its own handshake protocol.) Inconsistencies are a u t o m a t i c ~ y detected; the Auditors react to them either by healing SRS (if possible) or by causing a rejoin of an inconsistent processor. Dealing with partitions turns out to be a more subtle problem than we had expected. When we chose the formulas for cluster parameters such as "expected link delays" and "the time for atomic broadcast to complete", we encountered a difficult trade-off. When we made those parameters larger, partition was less likely, but it took more time for "normal" events (e.g., writes to SRS, which are done using atomic broadcast) to complete. When we made those parameters smaller, partition was more likely, so we had to d o more rejoins. Intuitively, it seemed right to set the parameters conservatively, making them very high so that partitions almost never occurred. But that made normal events so expensive that it turned out to be better to set the parameters at intermediate values. That is, instead of trying to prevent unlikely partitions, we designed the prototype to tolerate them, by detecting them and reacting to them quickly. Q: What do you mean by a partition? A: Permanent partitions occur when one or more processors are unable to communicate with the rest of the cluster for physical reasons (such as communication links being down). Transient partition occur when communication is temporary slowed so that messages cannot be received in a timely manner (e.g., because links are overloaded). We define "partition" functionally. Partition occurs when one or more broadcasts arrive at some processors but not at others. Thus, if there is a short-lived condition in which processors can't communicate but our broadcasts eventually get through in a sufficiently timely fashion, that does not constitute a partition for us. On the other hand, even if all links are operational but some processor fails to handle a broadcast for local reasons (e.g., because high load delays delivery), there can be a partition. Q: What do you do when you detect a partition? In particular, what happens if different pieces of synchronous replicated storage diverged while a partition existed? How do you reconcile SRS after this? A: We may have to do a rejoin. Rejoin resembles the join of a new processor; however, unlike a new processor, a rejoined processor may have non-empty SRS, which must be merged with the SRS of processors in the existing cluster. The procedure for merging SRS depends on the semantics associated with data in SRS. Let's consider rejoin for DCF. (Rejoin also exists for Auditor.) The DCF name registry is maintained in SRS, identifying the names that have been claimed and the processors that have claimed them. If there are no conflicts between the name registry of the rejoiner and that of the other processors, their registries can easily be merged. If there is a name conflict, someone has to win and someone has to lose, since DCF maintains uniqueness o£ names. The loser (who can no longer own the name) will typically be a server on the rejoining processor. DCF notifies the loser that it no longer owns that

145

name, and the loser has the responsibility to react appropriately to that event. Q: Fve asked the theory people to provide measures for their algorithms so that we systems people can analyze them. You said that you believe that synchronous replicated storage is the right technique for use in your Auditor. Have you made comparisons against competing algorithms using, say, transactions for recovery~ to show that your algorithms using SRS have fewer messages~ better response time~ or greater simplicity in programming? If not~ how would you persuade skeptics that synchronous replicated storage is good? A: No, we haven't done the comparisons you suggest, but transaction protocols such as two and three-phase commit solve different problems than atomic broadcast~ so comparisons may not make sense. These protocols handle problems which appear similar, so it's easy to confuse them. The purpose of atomic broadcast is to ensure that all correct processors (or more precisely, particular representatives on those processors) deli[ver the same information in the same order in a timely manner. The purpose of commit protocols is to ensure that for each transaction, each database associated with that transaction permanently registers any aspects of the transaction for which it has responsibility. Atomic broadcast has no obligation to processors which arenSt in the cluster~ even if they previously were in the cluster; commit protocols have enduring obligations to the databases associated with specific transactions (although the managers managing those databases may change). A transaction involving multiple databases can take place on a single processor (if both database managers are running on that processor); atomic broadcasts always go to every correct processor in the cluster. For example~ suppose that processor P is providing service S~ and processor P crashes. There is no coordinator designated to notify other processors of this event. (Commit protocols have coordinators.) One or more neighbors of P will notice the failure and initiate the separation protocol. The Auditor's synchronous replicated storage will be updated to reflect P~s absences so that some server elsewhere will perform appropriate recovery actions and provide service S. Synchronous replicated storage maps exceptionally difficult decision-making problems for loosely-coupled distributed systems into (still difficult, but far more manageable) problems of managing concurrent access to a shared memory (with fast read access and slow write access). The price for this abstraction is the time Delta to perform updates. For distributed systems such as Auditors where that cost is acceptable (because updates are infrequent and agreement among processors in the cluster is essential), synchronous replicated storage is an exciting programming paradigm for distributed systems~ and it deserves further attention. Note: Since the Asilomar conference, the Distributed Communication Facility prototype was the basis for the Transparent Services Access Facility (TSAF), which is part of IBM~s VM/SP operating system. TSAF is used for remote access to SQL/DS database systems, PROFS calendar managers and other servers. Flaviu Cristian is technical leader in the design of a new U.S. Air Traffic Control System; this design was influenced by the designs of both TSAF and Auditor.

146

Bibliography [CASD] F. Cristian, H. Aghili, R. Strong and D. Dolev, ~AtomJc Broadcast: From Simple Message Diffusion to Byzantine AgreemenC~ Proc. 15th International Conference on Fault-Tolerant Computing Systems, Ann Arbor, Michigan, 1985. Revised as iBM Research Report RJ5244, July 30, 1986. [GS] A. Griefer and R. Strong, ~DCF: Distributed Communication with Fault Tolerance" Proceedings of the 7th ACM Symp. on PODC, 42-51, Toronto, August, 1988. Also IBM Research Report RJ6361, August 3, 1988.

Easy Impossibility Proofs for Distributed Consensus Problems I Michael J. Fischer Department of Computer Science Yale University P.O. Box 2158, New Haven, Connecticut 06520 USA Nancy A. Lynch Laboratory for Computer Science Massachusetts Institute of Technology 545 Technology Square Cambridge, Massachusetts 02139, USA Michael Merritt AT&T Bell Laboratories 600 Mountain Avenue Murray Hill, New Jersey 07974 USA

Abstract Easy proofs are given of the impossibility of solving several consensus problems (Byzantine agreement, weak agreement, Byzantine firing squad, approximate agreement and clock synchronization) in certain communication graphs. It is shown that, in the presence o f f faults, no solution to these problems exists for communication graphs with fewer than 3f+1 nodes or less than 2f+1 connectivity. While some of these results had been proven previously, the new proofs are much simpler, provide considerably more insight, apply to more general models of computation, and (particularly in the case of clock synchronization) significantly strengthen the results.

1. Introduction In this paper we present easy proofs for the impossibility of solving several consensus problems in particular communication graphs. We prove results for Byzantine agreement, weak agreement, the Byzantine f'~ing squad problem, approximate agreement and clock synchronization. The bounds are all the same: toleratingf faults requires at least 3m+l nodes and at least 2f+1 connectivity in the communication graph. (The connectivity of a graph is the minimum number of nodes whose removal disconnects the graph. Also, we assume throughout that graphs have at least three nodes.) For a given value o f f , we call graphs with fewer than 3f+1 nodes or less than 2f+1 connectivity inadequategraphs. Each of our proofs is an argument by contradiction. We assume that a given problem can be

IEarlier versions of this paper appeared in the ACM Conference Proceedings of PODC 1985, and in Distributed Computing, volume I number 1, reprinted by permission.

148

solved in a system with an inadequate communication graph. We then construct a set of system behaviors all of which which cannot satisfy the correctness conditions for the given problem, even though they are required to do so. Versions of many of the results with proofs of this same general form were already known. Our proofs differ from the earlier proofs in the technique we use to construct the set of behaviors. Our technique is simpler and applies to more general models of distributed computation. For Byzantine agreement both bounds were already known [PSL,D]. The 3f+1 node lower bound in [PSL] was proved only for a particular synchronous model of computation. Although carefully done, the proof is somewhat complicated and not as intuitive as one might like. In contrast, our proof is simpte, transparent, and applies to general models of computation. A proof of the 2f+1 connectivity lower bound was presented informally in [D]; we prove that bound more formally and for more general models. For weak Byzantine agreement, the requirement of 3f+1 nodes was known [L], but it was proved using a complicated construction. The new proof is easy and extends to more general models (although not as general as those for Byzantine agreement and approximate agreement). The 2f+1 connectivity requirement was previously unknown. The result for the Byzantine fning squad problem follows from a reduction to weak agreement in [CDDS]. We provide a direct proof. For approximate agreement, the 3f+1 bound was noted, but not proved, in [DLPSW], while the 2f+1 connectivity requirement was previously unknown. For clock synchronization, the 3f+1 node bound was proved in [DHS] using a complicated proof. The authors of [DHS] claimed that they knew how to prove the corresponding 2f+1 connectivity lower bound, but we presume that such a proof also would be complicated. We prove both the 3f+1 node and the 2f+1 connectivity bounds, for a much more general notion of clock synchronization than in [DHS]. These synchronization bounds assume that there is no direct method, other than by reading their inaccurate hardware clocks, for nodes to measure the passage of time. Since we obtain the same lower bounds for each problem, one might think that the problems are equivalent in some sense. This is not the case. The bounds for the different problems require different assumptions about the underlying model. For example, the lower bounds for Byzantine and approximate agreement work with virtually any reasonable computational model, while the lower bound for weak agreement requires a special assumption, placing a bound on the rate of propagation of information through the system. The bound for clock synchronization requires a different assumption, about how devices can measure time. Many of the results are sensitive to small perturbations of the underlying assumptions, about such factors as communication delay or the behaviors of faulty nodes.

149

2. A Model of Distributed Systems To make the impossibility results clear, concise, and general, we introduce a simple model of distributed systems.

A communication graph is a directed graph G with node set nodes(G) and edge set edges(G), such that the directed edges occur in pairs; edge (u,v) ~ edges(G) if and only if (v,u) edges(G). (We consider a pair of directed edges rather than a single undirected edge to model the communication in each direction separately). We call the edge (u,v) an outedge of u, and an inedge of v. Let U be a subset of nodes(G). Then the subgraph GU induced by U is the graph containing all the nodes in U and all the edges between nodes in U. The inedge border of G U is the set of edges from nodes outside U into U; that is, edges(G) n ((nodes(G)\U) × U). A system G is a communication graph G with an assignment of a device and an input to each node of G. Devices are undefined primitive objects. The specific inputs we consider are either encodings of Booleans or real numbers or real-valued functions of time (e.g. local clocks). The particular type of input depends on the agreement problem addressed. If a node is assigned device A in system G, we say that the node runs A. A subsystem U of G is any subgraph G U of G with the associated devices and inputs. Every system G has a system behavior, E, which is a tuple containing a behavior of every node and edge in G. (We also describe E as a behavior of the communication graph G. Note that a system has exactly one behavior, while a graph may have several, depending on the devices and inputs assigned to the nodes.) The restriction of a system behavior E to the behaviors of the nodes and edges of a subgraph G U of G is the scenario E U of G U in E. For now, we take node and edge behaviors as primitives. In more concrete and familiar models, a node or edge behavior might be a finite or infinite sequence of states, or a mapping from the positive reals to some state set, denoting state as a function of time. (We use the latter interpretation for later results); Less familiar models might interpret behaviors as mappings from reals to states, or from transfirfite ordinals to states. To obtain our furst results, the precise interpretation of node and edge behaviors is unimportant. We need restrict our model only so that the following two axioms hold. (We assume these two axioms throughout the paper. Some of the later results require additional assumptions.)

Locality Axiom

Let G and G ' be systems with behaviors E and E', respectively, and isomorphic subsystems U and U', (with vertex sets U and U'). If the corresponding behaviors of the inedge borders of U and U' in E and E' are identical, then scenarios E v and E U, are identical.

The Locality axiom says that communication takes place only over the edges of the communication graph. In particular, it expresses the following property: The only parameters affecting the behavior of any local portion of a system are the devices and inputs at each local node, together with any information incoming over edges from the remainder of the system. If

150

these parameters are the same in two behaviors, the local behaviors (scenarios) are the same. 2Clearly, if some such locality property does not hold, then agreement is trivially achievable by having devices read other device's inputs directly. Fault A x i o m

Let A be any device. Let E 1..... E d be d edge behaviors, such that each E i is the behavior of the i'th outedge of a node running A in some system behavior E i. Let U be any node with d outedges (U,Vl) ..... (U,Vd). There is a device F such that in any system in which U runs F, the behavior of each outedge (u,vi) is E i.

In this case, we write FA(E 1..... Ed) for F. This axiom expresses a powerful masquerading capability of failed devices. Any behavior exhibited by a device over different edges in different system behaviors can be exhibited by a failed device in a single system behavior. When this axiom is significantly weakened (say, by adding an unforgeable signature assumption), then consensus is possible [LSP,PSL]. For establishing the relevance of our impossibility results to more concrete models of distributed systems, it is sufficient to interpret our definitions in the particular model and then to demonstrate that the Locality and Fault axioms hold under the particular interpretation. Our proofs utilize the graph-theoretic notion of a covering. For any graph G, let neighbors = {(u,V) I U is a node of G and V is the set of all nodes v such that there is an edge from v to U in G}. A graph S covers G if there is a mapping ¢0 from the nodes of S to the nodes of G that preserves "neighbors." That is, if node U of S has d neighbors v I ..... v d, and O(u) = w for a node w of G, then w has d neighbors Xl,...,x d and O(vi) = x i for I < i < d. Under such a mapping, S looks locally like G. Graph coverings play an important role in our understanding of the interaction of network topology and distributed computation. A discussion appears in [A], and indeed, some of our proofs are surprisingly similar to Angluin's. Similar techniques also appear in [IR], [B] and elsewhere.

3. Byzantine Agreement We say that Byzantine agreement is possible in a graph G (with n nodes) if there exist n devices A1,...,A n (which we call agreement devices), with the following properties. Each agreement device A u takes a Boolean input and chooses 1 or 0 as a result. (To model choosing a result, assume there is a function CHOOSE from behaviors of nodes running agreement devices to the set {0,1 }.) A node U of G is correct in a behavior E of G if node U runs A u in E. Any system behavior E of G in which at least n - f nodes are correct is a correct

2For weak agreementand the firing squad problem,we need to extend this locality propertyto include time.

151

system behavior. Correct system behaviors must satisfy the following conditions. Agreement. Every correct node chooses the same value. Validity. ff all the correct nodes have the same input, that input must be the value chosen.

Theorem 1: Byzantine agreement is not possible in inadequate graphs.

3.1. Number of Nodes We begin with the lower bound of 3f+l for the number of nodes required for Byzantine agreement. First consider the case where IGI = n = 3 and f = 1. Assume that the problem can be solved for the communication graph G consisting of three nodes fully connected by communication edges. Let the three nodes of G be a, b and c, and assume that they run agreement devices A, B and C, respectively. We represent each pair of directed edges by a single undirected edge, and label the nodes with the devices they run.

A

/\ B

C

The covering graph S is as follows.

v

/ \

U

W

Z

\ /

y

X

This graph looks locally like G under the mapping t~ defined by ¢(u) = O(x) = a, O(v) = t~(y) = b and O(w) = O(z) = c.

152

Now specify the system by assigning devices and inputs for the nodes in S as follows.

/0 o\

C

,\ /1

C

A

0

1

A

B

B

By this we mean that node u runs device A with input 0, node v runs B with input 0, and so on. Let S denote the resulting behavior of the system; S includes a behavior for each of the six nodes and twelve directed edges in S. Now consider scenarios Svw, Swx and Sxy in S, where each consists of the behaviors of the two indicated nodes in S, along with the activity over the two connecting edges. We argue that each of these scenarios is identical to a scenario in a correct behavior of G. The first scenario Svw is shown below.

S A 0

El

1\

C

o\

B

Svw C 0

/

F

B 0

Svw

0

A 1

This scenario is the behavior in S of nodes v and w, together with that of the communication edges between v and w. Now consider the behavior E 1 of G in which node b runs B on input 0, node c runs C on input 0, and node a runs a device that mimics node u in talking to b, and mimics node x in talking to c. Formally, if E(u,v) and E(x,w) are the indicated edge behaviors in S, node a runs device FA(E(u,v),E(x,w )) (we have written just F in the figure). This device exists, by the Fault axiom, and in the resulting behavior, edges from node a to node b and to node c have behaviors E(u,v) and E(x,w), respectively. By the Locality axiom, the scenario containing b and c's behaviors in E 1 is identical to Svw. Validity requirements insure that node b and node c must choose 0 in E 1. Since their behavior is identical in S, v and w choose 0 in S.

153

Next, consider scenario Swx.

S

E2

/0A C1\ 0\................................................ /!

B

B

/1\ ~A

F-

C 0

S~ ~'\... . . . . . . . . . .

C

0

Swx

J

A

1

This scenario includes the behavior of nodes w and x in S. It is also the behavior o f nodes a and c in a behavior E 2 of G which results when they run their devices A and C on inputs I and 0, respectively, and node b is faulty, exhibiting the same behavior to node x that v exhibits to w in S, and the same behavior to node a that y exhibits to x in S. The behavior o f node c in E 2 is identical to that of node w in S, so node c chooses 0 in E 2, from the argument above. By agreement, node a decides 0 in E 2. Thus node x decides 0 in S. Now consider the third scenario, Sxy.

S

/0

A

E2

C

/'

B

B

0~

B 1

Sxy

F

C 0

This scenario is the behavior o f nodes x and y in S. It is also the behavior o f nodes a and b in a correct behavior E 3 o f G which results when they both run their devices on input t, and node c is faulty, exhibiting the same behavior to node a that w exhibits to x in S, and the same behavior to node b that z exhibits to y in S. Validity requirements insure that nodes a and b must choose 1. Thus nodes x and y choose 1. But we have already established that node x must choose 0, a contradiction. Now consider the general case of IGt = n < 3f. Partition the nodes o f G into three sets, a, b and c, so that a, b and c have at least 1 and at most f nodes. This means that any two sets

154

together contain at least n-f nodes. The nodes in each set are running agreement devices, and we denote by A the set o f devices running at the nodes in a, and similarly for B and C. Now construct the covering graph S in the obvious way. Briefly, take two copies of G, and label the sets a, b and c in each copy by u, v and w, respectively, in one copy, and x, y and z in the other. Now replace the edges between nodes in u and w and between nodes in x and z by corresponding edges between u and z and between x and w. Assign devices to nodes of S according to their corresponding node in G. W e represent the covering graph S and assigned devices exactly as above, so that the edges depicted between two sets of nodes in S, say sets u and v, are now a shorthand representation for all the edges in S between nodes in set u and nodes in set v. The inputs depicted for the sets of devices A, B and C are assigned to all the devices in the respective sets. The arguments proceed exactly as in the preceding pictures. W e consider only one in detail. 3

S A

E1

C

1\

7/oB o\

/1

r I k ....

Sv

C 0

'

1 sz_ . . . . .

;

A 1

This scenario is now the behavior of the sets of nodes in v and w in the behavior S. It is the same as the behavior o f the sets b and c in a behavior E 1 of G in which all nodes in both sets run their devices with input 0 and the nodes in set a exhibit the same behavior to members of b that the corresponding nodes in set u exhibit to the members o f v in S, and the same behavior to nodes in c that the corresponding nodes in y exhibit to the members o f x in S. Since sets b and c together contain at least n-f correct nodes, E 1 is a correct behavior of G. Thus, all the nodes in b and c must decide 0, by the validity condition, and c contains at least one node, by construction.

3An alternative proof of the general bound may be obtained by a direct reductin to the casef=l. Given a system S and a partitioning of its communication graph G into subgraphs, there is a natural construction of a new system S', obtained by collapsing the subgraphs into single nodes. The devices in S' are the (indexed) sets of devices running in each subgraph of G, the node behaviors of S' are the subsystem behaviors of the corresponding subgraphs in G, and the edge behaviors in S' are the corresponding sets of edge behaviors in S. Then the devices and behaviors in $' satisfy the Locality and Fault axioms if the underlying devices and behaviors in S do. Now, if Byzantine agreement were possible in a graph G with IGI < 3f for some value o f f > 1, this construction would imply that Byzantine agreement is possible forf -- I in a subgraph of the triangle graph, contradicting Theorem 1. This is essentially the proof strategy for the general bound given in [PSL].

155

3.2. Connectivity Now we carry out the 2f+1 connectivity lower bound proof. Let c(G) = connectivity of G. We assume we can achieve Byzantine agreement in a graph G with c(G) < 2f, and derive a contradiction. For now, we consider the case f = l and the communication graph G of four nodes a, b, c and d, running devices A, B, C and D, as indicated below.

/\ B \/

A D

C The connectivity of G is two; the two nodes b and d disconnect G into two pieces, the nodes a and c. We consider the following system, with the eight-node graph S and devices and inputs as indicated.

/f\ /I

D ~ B

]\

A

A

ON

BraD

/1

oN / o C 0

The resulting behavior of the system is S. We consider three scenarios in S: S I, S 2 and S 3.

156

The first scenario, S1, is shown below.

S

/c\

A

D~B

\ A /]

B ~ D

"

/0\ B 0 \ y~ F

\,o \

C

,._

Oy

",,,0// This is also a scenario in a correct behavior E 1 of G. In El, nodes a, b and c are correct. Node d is faulty, exhibiting the same behavior to node a as one node running D in the covering graph, and the same behavior to b and c as the other node running D exhibits in the covering graph. Then nodes a, b and c must choose 0 in E l, and so must the nodes running A, B and C in S 1. Now consider the second scenario, S2.

S

E2 A

/1 A--

o,,

D ~ B

/

//1~ 1 X",. / / A~\,

/1/

/o/

B ~ D

",,,,0//

~

,,,,,,

,

N\ C 0

This scenario in S is also a scenario in a correct behavior E 2 of G in which nodes c, d and a are correct. This time, node b is faulty, exhibiting the same behavior to nodes c and d as one node running B in the covering, and the same behavior to node a as the other node running B. So nodes a, c and d must agree in E 2, and so do the corresponding nodes in S2. Since the node running C chooses 0 from the argument above, the nodes running D and A in S2 choose 0, too.

157

Finally, consider the last scenario S 3.

A /1N N

/ 1 A

",, 1 \ ",, \,, A ")

B

D

oN / o

B

rF

C

C 0

This scenario is again the same as a scenario in a behavior E 3 of G in which nodes a, b and c are correct, but have input 1. Node d is faulty, exhibiting the same behavior to node a that one node running D in the covering graph exhibits, and the same behavior to nodes b and c as the other D in the covering exhibits. Then nodes a, b and c choose 1 in E 3, and so must the nodes running A, B and C in S3, contradicting the argument above that the node running A chooses 0. The general case for arbitrary c(G) < 2 f i s an easy generalization of the case f o r f = 1. The same picutres are used. Just choose b and d to be sets consisting of at m o s t f nodes each, such that removing the nodes in b and d from G disconnects two nodes u and v of G. Let G' be the graph obtained by removing b and d from G, let the set a contains those nodes connected to u, and the set c contains the remaining nodes of G' (c contains at least one node, v). Construct S as before, by taking two copies of G and rearranging edges between the ' a ' sets and their neighbors. The nodes and edges in our figures are now a shorthand for the actual nodes and edges of G and S. This completes the proof of Theorem 1. The succeeding impossibility results for other consensus problems follow the same general form as the two arguments above. We assume a problem can be solved by specific devices in an inadequate graph, G, install the devices in a graph S that covers G, and provide appropriate inputs. Using the Locality and Fault axioms, we argue the existence of a sequence of correct behaviors o f G that have node and edge behaviors identical to some of those in the behavior of S. (This sequence was (E 1, E 2, E3)), in the arguments above.) By the agreement condition, correct nodes in each of the behaviors of G have to agree. Because each successive pair of system behaviors has a correct node behavior in common, all of the correct nodes in all the behaviors in the sequence have to agree. But by the validity condition, correct nodes in the first behavior in the sequence must choose different values than those in the last behavior, a contradiction.

158

As we indicated in the introduction, a less general version of Theorem 1 was previously known, and the structure of our proof is very similar to that of earlier proofs [PSL], [D]. Our proof differs in the construction of the system behaviors E 1, E 2 and E 3. Earlier results construct these behaviors inductively using less general models of distributed systems. The detailed assumptions of the models are necessary to carry out the tedious and involved constructions. Rather than construct the behaviors explicitly, we build them from pieces (node and edge behaviors) extracted from actual runs of the devices in a covering graph. The Locality and Fault axioms imply that scenarios in the covering graph are found in correct behaviors of the original inadequate graph. The model used to obtain these results is an extremely general one, but it does assume that systems behave deterministically. (For every set of inputs, a system has a single behavior). By considering a system and inputs as determining a set of behaviors, nondeterminism may be introduced in a straightforward manner. One changes the Locality axiom to express the following. If there exist behaviors of two systems in which the inedge borders of two isomorphic subsystems are identical, there exist such behaviors in which the behaviors of the subsystems are also identical. Using this axiom, the same proofs suffice to show that nondeterministic algorithms cannot guarantee Byzantine agreement.

4. Weak Agreement Now we give our impossibility results for the weak agreement problem. As in the Byzantine agreement case, nodes have Boolean inputs and must choose a Boolean output. The agreement condition is the same as for Byzantine agreement--all correct nodes must choose the same output. The validity condition, however, is weaker.

Agreement. Validity.

Every correct node chooses the same value.

If all nodes are correct and have the same input, that input must be the value chosen.

The weaker validity condition has an interesting impact on the agreement problem. If any correct node observes disagreement or faulty behavior, then all are free to choose a default value, so long as they still agree. Lamport notes that there are devices for reaching a form of approximate weak consensus which work when IGI < 3f. Running these for an infinite time produces exact consensus (at the limit) [L]. In such infinite behaviors any correct node observing disagreement or faulty behavior has plenty of time to notify the others before they choose a value. Thus, strengthening the choice condition by prohibiting such infinite solutions is necessary to obtain the lower bound. If the communication delays are not bounded away from zero, a similar type of infinite behavior is possible. In fact, if there is no lower bound on transmission delay, and if devices can control the delay and have synchronized clocks, then we can construct an algorithm for reaching

159

weak consensus. This algorithm requires at most two broadcasts per node, each having non-zero transmission delay, and works with any number of faults. Again, this is because any correct node which observes disagreement or faulty behavior has plenty of time to notify the others before they choose a value. 4 In more realistic models it is impossible to reach weak consensus in inadequate graphs. To show this, the minimal semantics introduced in the previous sections must be extended to exclude infinitary solutions. We do this as follows. Previously, behaviors of nodes and edges were elements of some arbitrary set. Henceforth, we consider them to be mappings t~om [0,oo) (our definition of time), to arbitrary state sets. Thus, if E u is a behavior of node u, then u is in state Eu(t) at time t. We add the following condition to the weak agreement problem.

Choice. A correct node must choose 0 or 1 after a finite amount of time. This means there is a function CHOOSE from behaviors of nodes running weak agreement devices to {0,1 }, with the following property: Every such behavior E has a finite prefix Et (E restricted to the interval [0,t]) such that all beha~,iors E' extending Et have CHOOSE(E) = CHOOSE(E'). This choice condition prohibits Lamport's infinite solution. To prohibit the second solution, we bound the rate at which information can traverse the network. To do so, we add the following stronger locality axiom to our model.

Bounded-Delay Locality Axiom. There exists a positive constant 8 such that the following is true. Let G and G ' be systems with behaviors E and E', respectively, and isomorphic subsystems U and U', (with vertex sets U and U'). ff the corresponding behaviors o f the inedge borders of U and U' in E and E ' are identical through time t, then scenarios E U and E U, are identical through time t+8.

Thus, news of events k edges away from some subgraph G' takes time at least k8 to arrive at G'. In a model with explicit messages, this axiom would hold if the transmission delay is at least 8; the edge behaviors in our model would correspond to state descriptions of the transmitting end of each communications link. T h e o r e m 2: Weak agreement is not possible in inadequate graphs for models satisfying the Bounded-Delay Locality axiom. Again, we first sketch the 3f+l node bound. In this case the previously published proof [L]

4Nodes start at time 0, and decide at time 1. They broadcast their value at time 0, specifying it to arrive at time

1/2. If a node first detects disagreement or failure (at time 1-0, it broadcasts a "failure detected, ehoosa default value" message, specifying it to arrive at time 1-t/2. The obvious decision is made by everyone at time 1.

160

was very difficult. As before, we restrict our attention to the case IGI = n = 3 , f = I. (l'he case for generalf follows immediately, just as above.) Assume there are weak agreement devices A, B and C, for the triangle graph G containing nodes a, b and c. Consider the two behaviors of G in which all nodes are correct, and all have input 0 or all have input 1. Let t' be an upper bound on the time it takes all nodes to choose 0 or 1 in both behaviors. Choose k > t'/8 to be a multiple of 3. The covering graph S consists of 4k nodes, arranged in a ring and assigned devices and inputs as follows:

C--B-A 1

1

A--B--C 0

0

A--C--B--A-C

1

1

1

1

1

1

0

0

B--C--A-B-C

0

0

0

0

.....

C--B--~- 1

.....

A--B-C

1

0

1

0

0

Consider the resulting behavior S, and each pair of successive two-node scenarios, such as the two below.

....

4a÷

CtA

q i\ol ~,,+./~, 1

As before, each scenario is identical to a scenario in a behavior in G of the appropriate two weak consensus devices. Since each pair of successive scenarios overlaps in one node behavior (here, that of the node running B), the agreement condition requires that all the nodes in both scenarios must choose the same value in G and in S. Thus, every node in S must choose the same value. Without loss of generality, assume they choose 1. Consider the k scenarios indicated below.

C--B--A ~~1 I 1

oi o Sli

S2

A--C--B--A ..... 1 1 1 1

o

iol • "

o¢olo

C--B--A 1 1 1~

o

C ol o I

J

....................................

Let E be the behavior of G in which a, b and c are correct and each has input 0, and denote the resulting behaviors of a, b and c by Ea, Eb and E c, respectively.

161

Lemma 3: The behavior in scenario S i of a node running device A (or B or C) is identical to E a (or E b or E c) through time i8. Proof: An easy induction using the Bounded-Delay Locality axiom, essentially arguing that no device in S i can hear from a device with input 1 until after time i5. By Lemma 3, the nodes running devices C and A in scenario Sk have behaviors identical to E c and E a through time k8 > t'. Since nodes c and a in G have chosen output 0 by this time, so have the corresponding nodes in Sk, a contradiction. The general case of IGI < 3fand the connectivity bound follow as for Byzantine agreement. There are strong similarities between this argument and a proof by Angluin, concerning leader elections in rings and arbitrarily long lines of processors [A]. Both results depend crucially on the existence of a lower bound on the rate of information flow. Under this assumption, devices in different communication networks can be shown to observe the same local behavior for some fixed time.

5. Byzantine Firing Squad The Byzantine f'wing squad problem addresses a form of synchronization in the presence of Byzantine failures. The problem is" given an input stimulus, "synchronize the response of entering a designated FIRE state. This problem was studied originally in [BL]. In [CDDS], a reduction of weak agreement to the Byzantine firing squad problem demonstrates that the latter is impossible to solve in inadequate graphs. We provide a direct proof that a simple variant of the original problem is impossible to solve in inadequate graphs. (In the original version, the stimulus can arrive at any time. We require that it arrive either at time 0, or not at all. Our validity condition is slightly different.) The proof is very similar to that for weak agreement. One or more devices may receive a stimulus at time 0. We model the stimulus as an input of 1, and the absence of the stimulus as an input of 0. Correct executions must satisfy the following conditions. Agreement. If a correct node enters the FIRE state at time t, every correct node enters the FIRE state at time t. Validity. If all nodes are correct and the stimulus occurs at any node, all nodes enter the FIRE state after some finite delay. If the stimulus does not occur and all nodes are correct, no node ever enters the FIRE state. As in the case of weak agreement, solutions to the Byzantine firing squad problem exist in models in which there is no minimum communication delay. Thus, the following result requires the Bounded-Delay Locality axiom, in addition to the Fault axiom.

162

Theorem 4: The Byzantinefiring squad problem cannot be solved in inadequate graphsfor modelssatisfying the Bounded-DelayLocalityaxiom. We sketch the 3f+1 node bound. As before, we examine the case IGI = n = 3,f = 1. Assume there are Byzantine firing squad devices A, B and C for the triangle graph G containing nodes a, b and c. Consider the two behaviors of G in which all nodes are correct, and all have input 0 or all have input 1. Let t be the time at which the correct devices enter the FIRE state in the case that the stimulus occurred (the input i case). Choose k >-t/~ to be a multiple of 3. (Recall that 5 is the minimum transmission delay defined in the Bounded-Delay l.x~ality axiom). The covering graph S consists of 4k nodes, arranged in a ring and assigned devices and inputs as follows:

C--B--A ( 1 1 1 A_B_C

0

0

.....

0

.....

A--C--B--A-C 1 1 1 1 1 B--C--A--B-C

0

0

0

C--B--A 1 1 1

0

0

.....

A--B--C

0

0

1

0

Similarly to the proof for weak agreement, the middle two devices receiving the stimulus enter the FIRE state at time t, as their behavior through time t is the same as that of the correct nodes in G which have received the stimulus and fire at time t. Because of the communication delay, there is not enough time for "news" from the distant nodes to reach these devices. By repeated use of the agreement property, all the devices in S must f'tre at time t. But through time t, the middle two devices not receiving the stimulus behave exactly as correct nodes in G which do not receive the stimulus (the input 0 case). Thus, they do not fn'e at time t, a contradiction.

6. Approximate Agreement Next, we turn to two versions of the approximate agreement problem [DLPSW,MS]. We call have real values as inputs and choose real numbers as a result. The goal is to have the results close to each other and to the inputs. To obtain the strongest possible impossibility result, we formulate very weak versions of the problems.

them simple approximateagreementand (~,5,y)-agreement. In these problems nodes

For the following two theorems we use only the Locality and Fault axioms. We do not need the Bounded-Delay Locality axiom used for the weak agreement and firing squad results.

6.1. SimpleApproximateAgreement First, examine a version of the simple approximate agreement problem [DLPSW]. Each correct node has a real value from the interval [0,1] as input, runs its device, and chooses a real value. Correct behaviors (those in which at least n - f nodes are correct) must satisfy the following conditions.

163

Agreement. The maximum difference between values chosen by correct nodes must be strictly smaller than the maximum difference between the inputs, or be equal to the latter difference if it is zero. Validity. Each correct node chooses a value within the range of the inputs of the nodes. Theorem 5: Simple approximate agreement is not possible in inadequate graphs. The proof is almost exactly that for Byzantine agreement. Here, we consider devices which take as inputs numbers from the interval [0,1], and choose a value from [0,1] to output. (Outputs are modeled by a function CHOOSE from behaviors of nodes running the devices to the interval [0,113 As before, assume simple approximate agreement can be reached in the triangle graph G. Consider the following three scenarios from the indicated behavior in the covering graph S.

A -/0

C 1

B

B ,':

iC

:

-Ai

11

Again, each scenario is also a scenario in a correct behavior of G. In the f'n'st scenario, the only value C can choose is 0. In the third, the only value A can choose is 1. This means the values chosen by A and C in the the second scenario are 0 and 1, so that the outputs are no closer than the inputs, violating the agreement condition. The general case of IGI < 3f and the connectivity bounds follow as for Byzantine agreement.

6.2. (e,~,7)-Agreement This version of approximate agreement is based on that in [MS]. Let e, ~ and "/be positive real numbers. The correct nodes receive real numbers as inputs, with rmin and rmax the smallest and largest such inputs, respectively. These inputs are all at most ~ apart (i.e. the interval of inputs [rmin, rmax] has length at most 5). They must choose a real number as output, such that correct behaviors (those in which at least n - f nodes are correct) satisfy the following conditions.

Agreement. The values chosen by correct nodes are all at most E apart. Validity. Each correct node chooses a value in the interval [rmin-7,rmax+'y].

164

Note that if e -> 8, (e,8,y)-agreement can be achieved trivially by choosing the input value as output. Theorem 6: If e < 8, (e,8,~l)-agreement is not possible in inadequate graphs.

Proof." Let e, 8 and y be positive real numbers with e < 8. We prove only the 3f+1 bound on the number of nodes. Assume that devices A, B and C exist which solve the (e,8,y)-approximate agreement problem in the complete graph G on three nodes for particular values of e, 8 and y, where e < 8. Choose k sufficiently large that 8 > 2y/(k-1) + e, and k+2 is divisible by three. The covering graph S contains k+2 nodes arranged in a ring, with devices and inputs assigned to create the following system.

fa_, node index

input

0

1 .

0

2.

k

k+l

Let Si, for 0 _< i < k, denote the two-node scenario in S containing the behaviors of nodes i and i+l. By the Fault Axiom, each scenario Si is a scenario of a correct behavior of G, in which the largest input value to a correct node is (i+1)8.

Lemma 7: For 0 < i < k, the value chosen by the device at node i+1 is at most 8 + y +ie. Proof: The proof is a simple induction. By validity applied to scenario S 0, the device at node 1 chooses at most 8 + y. Assume inductively that the device at node i chooses at most 8 + y + (i-1)e, for 0 < i < k+l. By agreement applied to scenario S i, the device at node i+1 chooses at most 8 + y + ie. In particular, Lemma 7 implies the device at node k chooses at most 8 + y + (k-1)E. But validity applied to scenario Sk implies the device at node k chooses at least k8 - y. So k8 - y < 8 + y + (k-1)e. This implies 8 < 27/(k-1) + e, a contradiction. The general case of IGI < 3f and the connectivity bounds follow as in previous proofs.

7. Clock Synchronization Each node has a hardware clock and maintains a logical clock. The hardware clocks are realvalued, invertible and increasing functions of time. Since different hardware clocks run at different rates, it may be necessary to synchronize the logical clocks more closely than the

165

hardware clocks. In addition, logical clocks should be reasonably close to real time--setting them to be constantly zero, for example, should be forbidden. Thus, we require the logical clocks to stay within some envelope of the hardware clocks. [See the paper by Lundelius, Lynch and Simons in this volume.] This problem was studied in [DHS] for the case of linear clock and envelope functions, where it was shown that it is impossible to synchronize to within a constant in inadequate graphs. Some more general synchronization issues were raised, such as that diverging linear clocks can be synchronized to within a constant if nodes can run their logical clocks as the logarithm of their hardware clocks. For a large class of clock and envelope functions (increasing and invertible clocks, non-decreasing envelopes), we can characterize the best synchronization possible in inadequate graphs. This synchronization requires no communication whatsoever. We model node i's hardware clock, D i, as an input to the device at node i that has value Di(t) at time t. The value of the hardware clock at time t is assumed to be part of the state of the node at time t, The time on node i's logical clock at real time t is given by a function of the entire state of node i. Thus, if E i is a behavior of node i (such that node i is in state Ei(t) at time t), then we express i's logical clock value at time t as Ci(Ei(t)). We assume that any aspect of the system which is dependent upon time (such as transmission delay, minimum step time, maximum rate of message transmission) is a function of the states of the hardware clocks. Having made this assumption, it is clear that speeding up or slowing down the hardware clocks uniformly in a behavior E cannot be observable to the nodes, so that the only impact on a E should be to change the (unobserved) real times at which events occur. To formalize this assumption, we need to talk about scaling clocks and behaviors. Let h be any invertible function of time. If E is a behavior (of a edge or node), then Eh, the behavior E scaled by h, is such that Eh(t)=E(h(t)), for all times t. Similarly, Dh is the hardware clock D scaled by h: Dh(O=D(h(t)). If E is a system behavior or scenario, Eh is the system behavior or scenario obtained by scaling every node and edge behavior in E by h. Similarly, if S is a system, then Sh is the system obtained by scaling every clock in S by h. Intuitively, a scaled clock or behavior is in the state at time t that the corresponding unsealed clock or behavior is in at time

h(t). Scaling Axiom

If E is the behavior of system S, then Eh is the behavior of system Sh.

If this axiom is significantly weakened, as by bounding the transmission delay, clock synchronization may be possible in inadequate graphs. In the following we use the Locality, Fault and Scaling axioms. We do not need the BoundedDelay Locality axiom used for the weak agreement and fitting squad results. The synchronization problem can be stated as follows. Let correct hardware clocks run either

166

at p(t) or q(t), where p and q are increasing, invertible functions, with p(t) <. q(O~ for all t. Let the envelope functions l and u be non-decreasing functions such that l(O < u(t), for all t. Consider what happens if all logical clocks are run at the lower envelope, C(E(t))=t(D(t)). Then the logical clocks are synchronized to within l(q(t))-l(p(t)). The goal then, is to improve this trivial synchronization. We show that logical clocks cannot be synchronized to within l(q(t))-l(p(t))-t~, for any positive ~. That is, nontrivial synchronization is achieved by synchronization devices in G if there exist positive constant ct and time t' such that every correct system behavior E satisfies the following conditions.

Agreement. For any two correct nodes i and ) in E, ICi(Ei(t))- Cj(Eft))l < l(q(t))- l(p(t)) - o~, for all times t > t'. Validity. For any correct node i in E, with hardware clock D i and resulting behavior E i, t(p(t)) <- Ct~(E~4t)) <- u(q(O). Theorem 8: Nontrivial synchronization is not possible in inadequate graphs for models satisfying the Scaling axiom. The argument is very similar to the proof of Theorem 6. We construct a covering graph and assign hardware clocks in such a way that each node seems to be running slow relative to one neighbor, and fast relative to the other. As usual, the Fault axiom allows us to draw conclusions about behaviors in the covering system from required behavior in the original graph. The Agreement condition requires the nodes to be synchronized with their faster neighbor. An induction shows that the slowest node (whose fast neighbor is synchronized with a faster neighbor, in turn synchronized with an even faster neighbor...) must run so fast as to violate the upper envelope condition. Specifically, we show that for every integer k>2, there is a behavior E of G in which node i is correct, has hardware clock D i = p (that is, Di(t) = p(t)), and in which C~(E~(t')) > l(p(t')) + k ~ For k big enough, this violates the upper envelope condition, C~(Ei(f))< u(q(t')). Define h = p .lq. (That is, h(t) = p "l(q(t)).) Then h "1 = q-lp. Note that h(t) > t for all t, since p(t) <_.q(t). We begin with the three node, one fault case. Assume the existence of devices A, B and C, time t' and positive constant e~ such that logical clocks of correct nodes obey the agreement and validity conditions:

lCi(Ei(O) - Cj4Ej4t))I < l(q(t)) - l(p(t)) - t~, for all times t _ t'. t(p(t)) <_ C(E,40) <- u(q(O), for all times t. Choose an integer k > 2, such that k+2 is a multiple of three, and such that l(p(t')) + kct >

167

u(q(r)).

The covering graph S contains k+2 nodes arranged in a ring, with devices and clock inputs assigned to create the following system.

8__c)

A -- B

.k

k+l

node index

0

1

h a r d w a r e clock

g

gh

• gh~ gt~(k+~)

behavior

Eo

E1

.E k

-1

Ek+1

Let S be the behavior of this system. An initially troubling concern is that the hardware clocks in S are much slower in most of the devices in S than they would be in a correct behavior in G. But consider S i, the two-node scenario containing the behaviors of nodes i and i+1, where 0 -< i
....

A --B ....

node index

i

i+1

h a r d w a r e clock

ph 4 Ei

behavior

p h "(i+1)

E i+l

Now consider Sihi, the scenario S i scaled by h i.

....

A --B ....

node index

i

i+1

h a r d w a r e clock

p

q i

behavior

Eih

i

E+lh

In this scenario, the hardware clocks have values within the constraints for correct behaviors of G. Thus we have the following. L e m m a 9: Scenario Sihi, for 0 < i < k, is a scenario containing the behaviors of two correct nodes in a correct behavior of G.

t68

Lemma 10: For all i, 0 < i < k, and all t > hi(t'), ICi+l(Ei+l(t)) - Ci(Ei(t))l < l(q(h'i(t))) - l(p(h'i(t))) - o~. Proof." Fix t > hi(t'). Then h'i(t) > t'. By Lemma 9, i and i+1 are correct in Si hi, so by the agreement assumption ICi+l(Ei+lhi(h'i(t)))- Ci(Eihi(h-i(t)))l < l(q(h'i(t))) - l(p(h'i(t))) - cx. The result is immediate. Let time t" = hk(t' ). Note that t" >_ hi(t' ), for i < k.

Lemma 11: For alt i, 1 <_ i < k+ l, Ci(Ez(t")) > l(qh'i(t")) + (i-1)¢~. Proof: The proof is by induction on i. By Lemma 9, scenario SO is a scenario in G of correct nodes a and b with hardware clocks q and p, respectively. From the validity condition, for all t, CI(EI(t)) >_ l(p(t)). Setting t = t", and substituting qh "1 forp, we have the basis step: CI(EI(t")) >_ l(qh'l(t")). Now make the inductive assumption Ci(Ei(t")) >_ l(qh'i(t")) + (i-1)cz, for I < i < k. Since t" >_ hi(t'), from Lemma I0, we know ICi+l(Ei+l(t")) - Ci(Ei(t"))l < l(qh'i(t")) l(ph-i(t")) - or. This implies Ci+l(Ei+l(t")) >_ C~(Ei(t"))- l(qh'i(t")) + l(ph'i(t")) + ct. } Substituting for C~Ei(t")) using the inductive assumption gives us Ci+l(Ei+l(t")) >- l(qh-i(t")) l(qh-i(t")) + l(ph-i(t")) + ict = l(ph'i(t")) + ion. Noting that p = qh "1, we have the result, Ci+l(Ei+l(t") ) >_ l(qh-(i+l)(t")) + iot. Proof of Theorem 8: Lemma 11 implies Ck+l(Ek+l(t")) > l(qh'(k+1)(t")) + kot. Since t" = hk(t'), we have Ck+l(Ek+l(t") ) = Ck+l(Ek+l(hk(r ))) = Ck+l(ek+lhk(t')) > l(qh-(k+l)hk(t')) + ktx = l(p(t')) + kot. But the upper envelope constraint for the scaled scenario Skhk (in which k+l is correct and has hardware clock p(t)) implies that Ck+l(Ek+lhk(t')) < u(q(t')). Thus, l(p(t')) + kcx < u(q(t')). This violates the assumed bound on k, l(p(t' )) + k(x > u(q(t' )). Once again, the general case of IGI _< 3f is a simple extension of this argument. connectivity bound also follows easily, as with the earlier results.

The

7.1. Linear Envelope Synchronization and other Corollaries Linear envelope synchronization, as defined in [DHS], examines the synchronization problem when the clocks and envelope functions arc linear functions (q(t)=rt, p(t)=t, l(O=at+b and u(t)=ct+d). It requires correct logical clocks to remain within a constant of each other, so that the agreement condition is ICi(Et~t)) - Cj(Ej(t))l < a, for all times t, instead of our weaker

169

condition ICi(Ei(t)) - CJEJt))I < art - at - t~, for all times t > t'. Our validity condition is slightly weaker, as well. Thus, the proof of [DHS] shows that logical clocks cannot be synchronized to within a constant; we show that that the synchronization of logical clocks cannot be improved by a constant over the synchronization (art - at) that can be achieved trivially. Thus the following corollary follows immediately from Theorem 8. (Each of the four corollaries below holds for models satisfying the Scaling axiom.)

Corollary 12: Linear envelope synchronization is not possible in inadequate graphs. By choosing specific values for the clock and lower envelope functions, we get the following additional results immediately from Theorem 8. Note that the particular choice of the upper envelope :function does not affect the minimal synchronization possible in inadequate graphs, although the existence of some upper envelope function is necessary to obtain our impossibility proofs.

Corollary 13: If p(t)=t, q(t)=rt, and l(t)=at+b, no devices can synchronize a constant closer than art-at in inadequate graphs. Corollary 14: If p(t)=t, q(t)=t+c and l(t)=at+b, no devices can synchronize a constant closer than ac in inadequate graphs. Corollary 15: If p(t)=t, q(t)=rt and l(O=log2(O, no devices can synchronize a constant closer than log2(r ) in inadequate graphs. In general, the best possible synchronization in inadequate graphs can be achieved without any communication at all. The best that nodes can do is run their logical clocks as slowly as they are permitted, C(E(t)) = l(D(t)).

8. Conclusion Most of the results we have presented were previously known. While our proofs are both simpler than earlier proofs, and apply to more general models, these are not the main contributions. The simplicity and generality are welcome byproducts of our attempt to identify the fundamental issues and assumptions behind a collection of similar results. One important contribution is to elucidate the relationship between the unrestricted or Byzantine failure assumption and inadequate graphs. As is clear from our proofs, this fault assumptio~a permits faulty nodes to mimic executions of disparate network topologies. If the network is inadequate, a covering graph can be constructed so that correct devices cannot distinguish the execution in the original graph from one in the covering graph. A second contribution is related to the generality of our results. Nowhere do we restrict state sets or transitions to be finite, or even to reflect the outcome of effective computations. The

170

inability to solve consensus problems in inadequate graphs has nothing to do with computation per se, but rather with distribution. It is the distinction between local and global state, and the uncertainty introduced by the presence of Byzantine faults, which result in this limitation. Finally, we have identified a small, natural set of assumptions upon which the impossibility results depend. For example, in the case of weak agreement and the firing squad problem, the correcmess conditions are sensitive to the actions of faulty nodes. Instantaneous notification of the detection of fault events would allow one to solve these problems. An assumption that there are minimum delays in discovering and relaying information about faults is sufficient to make these problems unsolvable.

9. References [A]

[B] [BL] [CDDS] [D] [DHS] [DLPSW]

[]:R] ILl [LSPI [MSI [PSL]

D. Angluin, "Local and Global Properties in Networks of Processors," Proc, of the 12th STOC, April 30-May 2, 1980, Los Angeles, CA., pp. 82-93. J. Bums, "A Formal Model for Message Passing Systems," TR-91, Indiana University, September 1980. L Bums, N. Lynch "The Byzantine Firing Squad Problem," submitted for publication. B. Coan, D. Dotev, C. Dwork and L. Stockmeyer "The Distributed Firing Squad Problem," Proc. of the 17th STOC, May 6-8, 1985, Providence R.I. D. Dolev, "The Byzantine Generals Strike Again," Journal of Algorithms, 3, 1982, pp. 14-30. D. Dolev, J. Halpem, H. Strong, "On the Possibility and Impossibility of Achieving Clock Synchronization," Proc. of the 16th STOC, April 30-May 2, 1984, Washington, D.C., pp. 504-510. D. Dolev, N. A. Lynch, S. Pinter, E. Stark and W. Weihl, "Reaching Approximate Agreement in the Presence of Faults," Proc. of the 3rd Annual IEEE Syrup. on Distributed Sofware and Databases, 1983. A. Itai, M. Rodeh, "The Lord of the Ring or Probabilistic Methods for Breaking Symmetry in Distributive Networks," RJ-3110, IBM Research Report, April 1981. L. Lamport, "The Weak Byzantine Generals Problem", JACM, 30, 1983, pp. 668-676. L. Lamport, R. Shostak, M. Pease, "The Byzantine Generals Problem," ACM Trans. on Programming Lang. and Systems 4, 3 (July 1982), 382-401. S. Mahaney, F. Schneider ,"Inexact Agreement: Accuracy, Precision, and Graceful Degradation," Proc. of the 4th Annual ACM Symposium on Principles of Distributed Computing, August 5-7, 1985, Minacki, Ontario. M. Pease, R. Shostak, L. Lamport, "Reaching Agreement in the Presence of Faults," JACM 27:2 1980, 228-234.

AN EFFICIENT, FAULT-TOLERANT PROTOCOL FOR REPLICATED DATA MANAGEMENT

Dale Skeen Teknekron Software Systems Palo Alto, California Amr El Abbadi Computer Science Department University of California, Santa Barbara Flaviu Cristian IBM Almaden Research Center San Jose, California

ABSTRACT A data management protocol for executing transactions on a replicated database is presented. The protocol ensures one-copy serializability, i.e., the concurrent execution of transactions on a replicated database is equivalent to some serial execution of the same transactions on a non-replicated database. The protocol tolerates a large class of failures, including: processor and communication link crashes, partitioning of the communication network, lost messages, and slow responses of processors and communication links. Processor and Link recoveries are also handled. The protocol implements the reading of a replicated object efficiently by reading the nearest available copy of the object. When reads outnumber writes, the protocol performs better than other known protocols.

1. INTRODUCTION The objective of data replication in a distributed database system is to increase data availability and decrease data access time. By data replication, we mean maintaining several physical copies, usually at distinct locations, of a single logical data base object. To make the replication transparent to the user of an object, a replica control protocol is needed to coordinate physical accesses to the copies of a logical data object and to

172

guarantee that they exhibit behavior eqm'valent to that of a single copy object [BGb]. Such a protocol translates a logical writo of a data object x into a set of physical writes on copies of x, and translates a logical read of x into a set of reads on one or more physical copies of x. To increase data availability effe.ctively, a replica control protocol must be tolerant of commonly occurring system component failures. To minimize the overhead caused by replication, the protocol should minimize the number of physical accesses required for implementing one logical access. This paper outlines a replica control protocol that tolerates a large class of failures: processor and communication link failures, partitioning of the communication network, lost messages, and slow responses of processors and communication links. It also handles any number of possibly simultaneous processor and link recoveries. The major strength of the protocol is that it implements the reading of a logical object very efficiently: a read of a logical object, when perrrfiRed, is accomplished by accessing only the nearest available physical copy of the object. In applications where reads outnumber writes, this strategy will reduce the total cost of accessing replicated data objects. There arc two possible approaches to develop a fault-tolerant replica control protocol in such an environment. The first is the status-oblivious approach of which the quorum consensus algorithm [(3] is a well known example. In this protocol a processor executing a logical operation first sends out messages to a/l processors containing copies of the object requesting the execution of the operation. It then waits for a response from a quorum. If as a result of failures, a quorum of processors does not respond, the operation is aborted. The second is the status-dependent approach. The execution of operations depends on the knowledge by each processor of the communication topology, and this knowledge is used to decide on the appropriate set of sites with which to communicate when executing a logical operation. In this paper we use a status dependent approach where each processor maintains a view of the current communication topology. Views arc used to optimize the translation of logical data operations into physical data accesses. Ideally, views should reflect the actual communication topology, but instantaneous detection of failures and recoveries is not possible. In our protocol, a processor's view is an approximation of the set of sites with which it can communicate. Views arc maintained by a sub-protocol of the replica control algorithm. This protocol guarantees that views satisfy a set of well defined properfies. In ~SC], it was proven that these propct~s arc sufficient for the replica control protocol to exhibit to users the behavior of a database where each logical object is implemented by a single copy. The protocol compares favorably with other proposed replica control protocols. It tolerates the same failure classes as majority voting [T] and quorum consensus [G]. It requires fewer accesses to copies, assuming that read requests outnumber write requests and that failure occurrences are rare events. It also tolerates the same failure classes as the "missing write" protocol [ES], but, unlike that protocol, uses a "read-one" rule for reading logical data objects even in the presence of failures. Our protocol is also simpler than the "missing write" protocol In particular, it does not require the extra logging of transaction information that is required by that protocol when failures occur.

173

For a more detailed description of the replica control protocol and for a proof of correctness, the reader is referred to [ESC].

2. FAILURE ASSUMtrrlONS System components (processors, links) can fail in many ways, from occasional processor crashes and lost messages to Byzantine failures, where components may act in arbitrary, even malicious, ways. We consider only those failures that have a reasonable chance of occurring in practical systems and that can be handled by algorithms of moderate complexity and cost. The most general failure classes satisfying these criteria are omission failures and performance failures [CASD,H]. An omission failure occurs when a component does not respond to a service request. Typical examples of such failures include processor crashes, occasional message losses due to transmission errors or overflowing buffers, and communication link crashes. A performance failure occurs when a system component fails to respond to a service request within the time limit specified for the delivery of that service. Message delays caused by overloaded processors and network congestion are examples of performance failures. An important subclass of omission and performance, failures is the class of partition failures. A partition failure divides a system into two or more disjoint sets of processors, where no member of one set can communicate in a timely manner with a member of another set. Our objective is to design a replica control protocol that is tolerant of any number of omission and performance failures.

3. SYSTEM MODEL A distributed system consists of a finite set of processors, P={ 1,2 ..... n}, connected by a commur~cation network. In the absence of failures the network provides the service of routing messages between any two processors. Processors or links may fail, leading to an inability to communicate within reasonable delays. Failed processors and links can recover spontaneously or because of system maintenance. Thus, the system of processors that can communicate with each other is a dynamically evolving system. In the following discussion, we will not be concerned with the details of the. physical interconnection of the processors (e.g. a point-to-point versus a bus-oriented interconnecdon) or with the detailed behavior of the message muting algorithm. Instead, we consider only whether two processors are capable of communicating through messages. We model the current can-communicate relation between processors by a communication graph. The nodes of the graph represent processors, and an undirected edge between two nodes a,b~ P indicates that ff a and b send messages to each other, these are received within a specified time limit. We call a connected component of a communication graph a communication cluster. A communication clique is a communication cluster which is totally connected, that is, there is an edge in the communication graph between every pair of processors in the cluster. We do not assume that the can-communicate relation is

174

transitive. Thus, it is possible that a and b can communicate, and b and c can communicate, but a and c cannot communicate. (Note that if the can-communicate relation is transitive, then all communications clusters are cliques.) In a system where failure occurrences lead quickly to the establishment of new communication routes that avoid the failed system components, communication clusters can be expected to be cliques most of the time. In the absence of failures, a communication graph is a single clique. The crash of a processor p results in a graph that contains two clusters: a trivial cluster consisting of the single node p, and a duster consisting of all other nodes. A partition failure results in a graph containing two or more clusters. For the purpose of adapting to changes in the communication topology, each processor maintains a local "view" of the can-communicate relation. Each processor's view is that processor's current estimate of the set of processors with which it believes that communication is possible. The function view: P --> 2P (where 2p denotes the powerset of P) gives the current view of each processor pc P. A replicated database consists of a set of logical data objects L. Each logical object 1~ L is implemented by a nonempty set of physical data objects (the copies of 1) that are stored at different processors. The copy of 1 stored at processor p is denoted by lp. The function copies: L --->2I, gives for each logical object I the set of processors that possess physical copies of 1. Transactions issue read and write operations on logical objects. A replicated data management protocol is responsible for implementing logical operations (as they occur in transactions) in terms of physical operations on copies. The term event is used to denote a primitive atomic action in the system. A primitive action is an operation that is executed locally on a site, such as the reading and writing of a physical object and the sending and receiving of messages. An execution of a set of transactions is a finite set of events partially ordered by the happens-before relation studied in [L]. We assume that the set of events restricted to a given processor is totally ordered by the happens-before relation. That is to say, if e and f are events occurring at the same processor, then either e happens-before f or f happens-before e. Consequently, the operations executed on a given physical object are totally ordered. An execution is serial if its set of events is totally ordered by the happens-before relation, and if, for every pair of transactions t 1 and t2, either all physical data operations of t 1 happen-before all physical operations of tg, or vice versa. For a replicated data management protocol to be correct, the database system must exhibit the same externally observable behavior as a system executing transactions serially in a nonreplicated database system [TGGL]. This property is known as one-copy serializability [BGb]. One popular approach for designing a replicated data management protocol is to decompose the algorithm into two parts: a replica control protocol that translates each logical operation into one or more physical operations, and a concurrency control protocol that

175

synchronizes the execution of physical operations ~Gb]. The concurrency control protocol ensures that an execution of the translated transactions (in which logical access operations are replaced by physical operations) is scrializable,that is,equivalent to some scriai execution. But the concurrency control protocol does not ensure one-copy serializability,since it knows nothing about logicalobjects. (Itmay, for example, permit two distincttransactionsto update differentcopies of the same logical object in parallel.)The replica control protocol ensures that transactionexecution is one-copy serializable.

4. REPLICA CONTROL Following the decomposition outlined above, we now derive a protocol for correctly managing replicated data in the presence of any number of omission and performance failures. In this section, the emphasis is on formulating the requirements for a replica control protocol and on showing that any implementation satisfying these requirements satisfiesour correctness criteria. In the next section,we describe in some detailone protocol and show that itexhibits the desired properties. Ideally, we would like to design a replica control protocol that can be combined with any concurrency control protocol that ensures scrializability. However, this seems to be difficultto achieve given our performance objectives. Consequently, we will restrictthe class of allowable concurrency control protocols to those ensuring a stronger property known as conflict-preservingscrializability[H]. T w o physical operations conflictif they operate on the same physical object and at least one of them is a write. An execution E of a set of transactions T is conflict-preserving(CP) serializableif there exists an equivalent serialexecution E s of T that preserves the order of execution of conflicting operations (i.e.if opl and oP2 are conflictingphysical operations and opl happens-before oP2 in E, then opl happens-before oP2 in Es) [EGLT,H]. Henceforth in our discussion, we will assume the existence of a concurrency control protocol ensuring that (AI) The execution of any set of transactions (viewed as a set of physical operations) is conflict-preservingscrializable.

Practically speaking, restricting the class of concurrency control protocols to those enforcing CP-serializability is inconsequential, since all published, general-purpose concurrency control protocols are members of this class. This includes two-phase locking [EGLT], optimistic concurrency control [KR], timestamp ordering [BSR], and all distributed protocols surveyed by Bemstein and OoodmRn [BGa]. Our performance objective is to provide cheap read access while offering a high level of data availability. In order to understand better what is attainable, let us first consider a "clean" failure environment in which two simplifying assumptions hold. The first assumption is that the can-communicate relation is transitive: (A2) All communication clusters are cliques.

176

The second assumption (unrealistically)posits that changes in the communication topology (resultingfrom failures and recoveries) are instantlydetected by all affected processors. (A3) The view of each processor contains only the processors adjacent to itin the current communication graph, itselfand no other processors. Thus, from A2 and A3 we can conclude that the views of processors in the same communication cluster are equal and the views of processors in different clusters are disjoint. Given the above assumptions, the following rules can be used to control access to logical objects. W h e n processor p executes a read or a write operation on a logical object l, it firstchecks whether a (possibly weighted) majority of the copies of I reside on processors in itslocal view. If not, it aborts the operation. Otherwise, for a read, itreads the nearest copy which resides on a processor in itsview; and for a write, itwrites all copies on proccssors in itsview. When integrated with an appropriate cluster initialization protocol, which ensures that aLi copies of a logical object accessible in a newly established cluster have the most up-todate value assigned to that object, the above rules can form the basis of a correct replica control protocol. The "majority rule" ensures that only one cluster can access a logical object at a time, and the "read-one/write-all rule" ensures that the copies of an object in a cluster act as a single copy. Together, these rules ensure that all executions are one-copy scrializable. The above rules are simple, intuitive, and ensure a high level of data availability, provided the communication information maintained by the processors is accurate. Unfortunately, the correctness of the rules depends heavily on assumptions A2 and A3. If either is relaxed, non-one-copy-serializable executions can result. Example 1. Figure 1 gives a possible communication graph for three processors when assumption (A2) is relaxed. The graph indicates that processors A and B are no longer able to communicate due to, for example, failures that have occurred in the message routing protocol. Both processors however arc able to communicate with C, and C with them. We thus have: view(A)={A,C}, view(B)f{B,C} and view(C)f{A,B,C}. Let each processor contain a copy of a logical data object x initialized to 0. Assuming that aLi copies are weighted equally, each processor will consider x to be accessible, since each has a majority of the copies in its view. Now, let A and then B execute a transaction that increments x by 1. Based on its own view, processor A reads its local copy of x and updates both its copy and C's copy. Similarly, B reads its local copy of x (which still contains 0) and updates both its copy and C's copy. Observe that after two successive increments, all copies of x contain 1. Clearly, the execution of these transactions is not one-copy serializable.

177

Figure 1. Example 2. Consider an initially partitioned system that undergoes repartitioning as shown in Figure 2. Two processors B and D detect the occurrence of the new partition immediately and update their views. The other two processors (A and C) do not detect it until later. Table 1 shows the intermediate system state after the view updates in B and D and before the view updates in A and C.

Figure 2.

t78

original view

current view ......

A,B A,B C, D C,D

A,B B,C C, D A,D

A B C D

Table I

Assume that, while the views are inconsistent,each processor p executes a transaction tp. Table 2 gives the transaction executed at each processor, and also the data objects stored there.

A

cooies a2,b

B

b2,c

transactions tA: read(b),write(a) ts: read(c),write(b)

C

c2,d

tc: read(d),write(c)

D

d2,a

tD: read(a),write(d)

Table 2

The superscripts on the objects denote "weights". Consider the execution of tA at processor A. Since Bedew(A), A can read its local copy orb. Since A's copy of a has weight 2, A can update it, and, furthermore, A will not attempt to update D's copy since D 4 view(A). Hence, in the execution of the transaction, A accesses itslocal copies only. The executions of transactions is, tc, and tD proceed similarly,with each processor accessing only its local copies. Since no accesses to remote copies arc made, the inconsistency between processor views is not detected. The resulting execution is serializable but not one-copy serializable. As example 1 illustrates,the correctness of the simple replica control protocol critically depends on the property that no two processors with different views be able to access a common set of copies. Example 2 illustratesthat even in a well-behaved communication network, where u-ansitivityof the can-communicate relation is assured, processors can not independently and asynchronously update theirviews. The principal idea in our replica control protocol is to use the majority and the read/write rules mentioned above, but to circumvent the anomalies illustrated in examples 1 and 2 by placing appropriate restrictions on when and how processors may update their views.

179

Toward this goal, we introduce the notion of a virtual partition. Roughly speaidng, a virtual partition is a set of communicating processors that have agreed on a common view and on a common way to test for membership in the partition. For the purposes of transaction processing, only processors that are assigned to the same virtual partition may communicate. Hence, a virtual partition can be considered a type of "abstract" communication cluster where processors join and depart in a disciplined manner. In contrast, in a real communication cluster processors join and depart abruptly (and often inopportunely) because of failures and recoveries. It is desirable, of course, for virtual partitions to approximate the real communication capabilities of a system. The common view of the members in a virtual partition represents a shared estimation of the set of processors with which communication is believed possible. When a processor detects an inconsistency between its view and the can-communicate relation (by not receiving an expected message or by receiving a message from a processor not in its view), it can unilaterally depart from its current virtual partition. Note that the capability for a processor to depart unilaterally is an important capability: since the departing processor may no longer be able to communicate with the other members of its virtual partition, it must be able to depart autonomously, without communicating with any other processor, l~ter departing, the processor can invoke'a protocol to establish a new virtual partition. This protocol, which is part of the replica control protocol, m a t e s a new virtual partition, assigns a set of processors to the partition, and updates those processors' views. ,~aa objective of the protocol is for the new virtual partition to correspond to a maximal set of communicating processors. However, since failures and recoveries can occur during the execution of the view update protocol, it is possible that a virtual partition resulting from a protocol execution only partially achieves this objective. We identify virtual partitions by unique identifiers, and we denote the set of virtual partition identifiers by V. At any time, a processor is assigned to at most one virtual partition. The instantaneous assignment of processors to virtual partitions is given by the partial function vp: P --> V, where vp is not defined for a processor p if p is not assigned to any virtual partition. We use the total function defview: P --+ {true,false} to characterize the domain of vp. That is, defview(p) is true if p is currently assigned to some virtual partition, and is false otherwise. The function members: V -+ 2P yields for each virtual partition v the set of processors that were at some point in time (but not necessarily contemporaneously) assigned to v. In order to ensure that the simple "read-one/write-air' rules achieve one-copy serializability, we require the following properties from any protocol managing processor's views and their assignment to virtual partitions. If p and q are arbitrary processors, the first two properties are

180

(S1) View consistency: view(p)=view(q).

If

defview(p)&defview(q)

and

vp(p)-vp(q),

then

($2) Reflexivity: If defview(p) then p ¢ view(p) Property S 1 states the requirement that processors assigned to the same virtual partition have the same view. With a slight abuse of notation, we let view(v) denote the view common to all members of virtual partition v. Property $2 enforces the requirement that every processor should be able to communicate with itself. From $1 and $2, one can infer that the view of a processor (when defined) is a superset of the processors in its virtual partition, and thereby, a superset of the processors with which it may actually communicate during transaction processing. The final property restricts the way a processor may join a new virtual partition. Let p denote any processor and let v and w denote arbitrary virtual partitions. Let join(p,v) denote the event where p changes its local state to indicate that it is currently assigned to v. Similarly, let depart(p,v) denote the event of p changing its local state to indicate that it is no longer assigned to v. Join and departure events, in addition to physical read and write events, are recorded in the execution of transactions. The third property is ($3) Serializability of virtual partitions: For any execution E produced by the replicated data management protocol, the set of virtual partition identifiers occurring in E can be totally ordered by a relation << which satisfies the condition: if v<<w and p~ members(v) ra view(members(w)), then depart(p,v) happensbefore join(q,w) for any q¢ members(w). This property says that before a new partition w can be formed, every processor p in the view of processors in w must first depart from its current vimml partition. Note that the property does not require p to eventually join w. $3 prevents anomalies of the type illustrated in Example 2, where a processor 03) detects changes in the can-communicate relation, adopts a new view, and begins processing transactions before another processor ((2) in the same communication duster detects the changes. Any ordering over the virtual partitions occurring in a given execution that satisfies condition $3 is called a legal creation order for the virtual partitions. In general, there will be many legal creation orders for a given execution. Comparing requirements S 1-$3 to requirements A2-A3, we find that the former ones are considerably weaker than the latter. All that $1-$3 ensure is that processors agree on a common view when in a virtual partition, and that joining and departing from such virtual partitions is done in a disciplined way. A3 requires that processors' views reflect the current can-communicate relation. $1-$3, on the other hand, do not imply any relationship between the processors' views and the can-communicate relation. A3 implies that a processor witl always know the members of its communication cluster. In general, a processor will never know the set of processors in its virtual partition (but it will know a superset-the processors in its view). Under S 1-$3, the views of processors in different virtual partitions may overlap; whereas, under A2, the views of processors in different clusters must be disjoint. Finally and most significantly, requirements $I-$3 are

181

attainable under realistic failure assumptions (see Section 5). Requirements A2-A3 arc not.

The intuitive replica control rules mentioned previously for the "clean" failure environmcnt can now be reformulated in terms of the weaker notion of a virtualpartition. The first four n~es are

(R1) Majority rule: A logical object 1 is accessible from a processor p assigned to a virtual partition only ff a (possibly weighted) majority of copies of 1 reside on processors in view(p). (R2) Read rule: Processor p implements the logical read of 1 by checking if 1 is accessible from it and, if so, sending a physical read request to any processor qe view(p)c~copy(1). If q does not respond, then the physical read can be retried at another processor or the logical read can be aborted. (R3) Write rule: Processor p implements the logical write of I by checking if I is accessible from it and, if so, sending physical write requests to all processors qe view(p)c~copies(1) which are accessible and have copies of 1. If any physical write request can not be honored, the logical write is aborted. (R4) Same partition rule: All physical operations carried out on account of a transaction t must be executed by processors belonging to the same virtual partition v. In this case we say that t executes in v. Rules RI:R3 are straightforward interpretations of the simple replica control rules. Rule R4 expresses the communication restriction that is placed on virtual partitions. In particular, a processor p accepts a physical access request from processor q only ff p and q are assigned to the same virtual partition (that is, ff defview(p) and defview(q) and vp(p) = vp(q)). Observe that R4 requires a transaction to be aborted if any of the processors executing it joins a new partitibn. (In section 6, we will discuss an opfimiz~on thaiavoids this most of the time.) The final rule concerns propagating the up-to-date values of logical objects to copies on processors that were previously in different partitions. (RS) Partition initialization rule: Let p be a processor that has joined a new virtual partition v, and let lp be a copy of a logical object 1 that is accessible in v. The first operation on lp must be either a write of lp performed on behalf of a transaction execuring in v, or a recoverOp) operation that writes into lp the most recent value of 1 written by a transaction executing in some virtual partition u such that u<
182

R1), and a majority ofl's copies is written by each logical write ofl (by R3). The above properties can be used to design a correct replica control protocol. Theorem 1. Let R be a replica control protocol obeying properties S1-$3 and R1-R5 and let C be a concurrency control protocol that ensures CPserializability of physical operations. Any execution of transactions produced by R and C is one-copy serializabUity. The proof of this theorem, which uses notions introduced in [BGb], is given in [ESC]. The proof actually proves a stronger result.

Theorem 1". Let R and C be protocols as above. For every execution E produced by R and C, and every legal creation order << over the virtual partitions in E, there exists a serial execution Es, equivalent to E, with the property: if v<<w, then all transactions executing in v occur in E s before any transaction executing in w. Hence, with regard to serializability,we can consider transaciions to execute in an order consistent with a creation order of virtual partitions. That is to say, for an execution with a creation order <<, if transaction tI executes in v and transaction t.z executes in w and v<<w, then w e can consider tI to "execute before" t.z.

Although a replica control protocol satisfying S1-S3 and R1-R5 produces only one-copy serializable executions, the executions may exhibit a curious and, for some applications, undesirable property. Specifically, transactions may not read the most up-to-date (in real time) copies of logical objects. This can occur because views of different virtual partitions that exist simultaneously can overlap. ConsequentJy, the same logical object can be simultaneously accessible in these partitions. Note that only one of these partitions will be able to write the object, since a logical write requires a majority of the copies to reside on processors actually in the same partition. However, a logical read is required only to access one copy, hence many partitions (with views that are actually a superset of their virtual partitions) may be able to read the object and, therefore, will be able to read out of date values. This phenomenon is not detectable by applications executing transactions since, by design, applications can not send messages across partition boundaries. However, this might be detected by a user that moves from a processor in one partition to a processor in another. The capability of a processor to read stale data indicates that the processor's view no longer accurately reflects the can-communicate relation. This situation arises most often when a processor is slow to detect the occurrence of a failure. For most systems, such situations can be expected to be short-lived. If desired, a distributed systems can pcriodicaUy send probe messages in order to bound the staleness of the data (the next section discusses probing in more detail). Nonetheless, there appears to be no practical way to completely eliminate the reading of stale data when using the majority rule and read-

183

one/write-all rules. On the other hand, as mentioned before, one-copy serializability is guaranteed in all cases.

5. THE REPLICA CONTROL PROTOCOL We now sketch a replica control protocol satisfying the properties presented in the previous section. A detailed description appears in [ESC]. We have chosen to emphasize clarity over performance; consequently, for any real implementation, a number of significant optimizations are possible. Some of these are discussed in the next section. The development of the replica control protocol in the previous section suggests an implementation consisting of two layers:

(1) the virtual partition management layer, which assigns processors to virtual partitions, and

(2) the replica management layer, which controls accessibility of logical objects, translates the logical operations issued by transactions into physical read/write operations, and ensures that the copies of an accessible logical object in a newly formed partition have the most up-to-date values. The virtual partition management layer enforces rules S1-$3; whereas, the replica management layer enforces rules R1-R5. Rules $1-$3 and R1-R5 are sufficient for ensuring one-copy serializability. Nevertheless S1-$3 is not by itself a satisfactory specification for a useful virtual partition management layer, since it does not require that the assignment of processors to virtual partitions mirror the current communication capabilities of a system. In fact, a trivial protocol that never assigns any processor to a virtual partition, or one that constantly assigns each processor to its own virtual partition satisfies the specification S 1-$3. Clearly, the availability of logical objects is influenced by how closely the views of the processors of a virtual partition mirror the current can-communicate relation. If a processor excludes from its view a processor with which it can communicate, then some logical objects may unnecessarily be deemed inaccessible by the majority rule (1tl). If, on the hand, a processor p includes in its view a processor q with which it can not communicate, then p will not be be able to write the logical objects with copies on q (because of the write-all rule). To ensure high logical data availability, we introduce a supplementary liveness constraint oft the discrepancy between processor views and reality. Let us say that a failure or recovery affects a processor ff it results in the deletion or addition of an edge incident to that processor in the communication graph. A failure or recovery is said to affect a set of processors ff it affects any processor in the set. The liveness constraint is (LI) Let C be a nontrivial clique in the communication graph existing at time t. For an appropriate constant A, if no failures or recoveries affecting C occur during the time interval [t,t+A], then at t+A the view of each processor in C contains all processors in C,

184

The constant A depends, among other things, on the worst-case message transmission delay between any two processors.

5.1. The Virtual Partition Management Layer Enforcement of rules S 1-$3 lies principally with the virtual partition creation protocol, which is invoked whenever a processor suspects a change in the ean-eommtmicate relation. Enforcement of the liveness constraint lies principally with the probe protocol, which detects changes in communication capability, through the use of periodic probe messages. When a change in communication capability is detected, the creation of a new virtual partition is attempted. A key aspect of the virtual creation protocol is the implementation and generation of virtual partition identifiers. These identifiers consists of two fields: a sequence number (seqno), and a processor identifier (pid). Virtual partition identifiers are totally ordered by the relation "<<". id 1 << id 2 --- (idrseqno
185

INITIATOR

OTHERS

vpid := generate-new-vpidO; for each processor p sendCinvite",vpid) to p; receiveCinvite",vpid) from initiator;, if vpid>maximum-vpid-ever-seen then depart current virtual partition reply("accept",vpid); new-view := {p : p replies "accept"l; for each p replying "accept" sendCjoin",vpid,new-view); receiveCjoin",vpid,new-view) from initiator;, if vpid>maximum-vpid-ever-seen then join vpid and adopt new_view;

Figure 3. The virtual partition creation protocol.

message, and they handle the case where several processors in the same cluster simultaneously attempt to establish new partitions. In the absence of additional failures, only the processor generating the highest numbered partition identifier will succeed. Furthermore, the protocol satisfies rules S 1 through $3. Satisfaction of requirement S 1 is ensured from the use of a single processor, the initiator, to determine the view of a virtual partition. Satisfaction of requirement S2 is easy to verify. Satisfaction of requirement $3 follows from the observation that the relation << defined over virtual partition identifiers is indeed a legal creation order. An execution of the protocol requires approximately 3*8 time units, where 8 is the maximum time to send, transmit, and receive a message from one processor to another. Changes in the can-communicate relation can be detected by the following naive probe protocol. At regular intervals, each processor sends a probe message to each of the other processors. A discrepancy between the can-communicate relation and a processor's view is detected whenever the processor does not receive a periodic message from a processor in its view or does receive message from a processor not in its view. If probes are sent every ~r time units, then a discrepancy will be detected within 7r+2"8 time units.

186

The implementation also satisfies the liveness condition (L1) when the parameter A is set to rc+88. This value is computed as follows. After a clique C forms, it may take as long as 38 time units for all ongoing instances of "Create-VP" to finish. It may take another time units for the processors in C to send probes, and an additional 28 to receive the acknowledgements. The results of probing may cause additional invocations of "CreateVP", and these will take 38 to complete. Now if no failures or recoveries affecting C occur during any of this, then within ~+88 time units all processors in C have committed to the invocation of "Create-VP" generaling the highest virtual partition identifier.

5.2. The Replica ManagementLayer The implementation of the majority rule (R1) and the read-one/write-all rules (R2-R3) is straightforward and will not be discussed further. Rule R4, which requires that all physical operations of a transaction be executed in the same virtual partition, can be enforced by assigning a virtual partition identifier to a transaction when it enters the system, and by including this identifier in all request messages for physical operations. A processor ignores (or returns a "negative acknowledgement") requests with identifiers different than its own partition identifier. The remaining rule to be implemented is the partition initialization rule (R5). After a processor is assigned to a new virtual partition, it has to make sure that all physical copies accessible in the new virtual partition possess the most recent value assigned to the logical object they represent (rule R5). To determine the most recent value of a logical object, we will make use of the fact that the relation << is a legal creation order. From this and Theorem l', it follows that the most recent write operation on a logical object I is executed in the highest numbered virtual partition among those partitions conmining logical writes to 1. For the purpose of identifying the value produced by the most recent write of 1, each processor stores with its local' copy of I the virtual partition identifier associated with the latest logical write of 1. On each processor p, the partial functions (suitably initialized) value: L --) Value,

date: L - , V,

yield the value and the most recent assignment date for the logical objects stored on p. That is, value(l) denotes the value of the local copy of the logical object 1 and date(l) denotes the virtual partition identifier (or logical date) current when the local copy of 1 was last updated. To recover the most recent value of an object 1, a processor can request from all processors in its view the current value of date(l). The processor than requests one of the processors returning the highest date to send value(l), which is the most recent value of 1. Between the time that a processor is assigned to a new virtual partition and the assignmcnt of the most recent value to one of its copies, all reads and writes to the copy must be delayed.

187

6. OPTIMIZATIONS and EXTENSIONS The replica control protocol described in the previous section may be implemented in several ways that satisfy the properties S1-$3, and R1-R5. In this section we present optimizations for efficient implementations in an environment containing a large number of processors and large data objects. Speeitically, we present an efficient way for updating "out-of-date" copies. We also present a generalization of our accessibility rules to allow in general, reading of an object in more than just the partition it is being written in. Our first optimization takes advantage of the fact that all that rule R5 requires from a processor p upon joining virtual partition v is to bring its copy of an accessible object 1 "upto-date" by reading a copy of 1 written in a virtual partition with the largest view identifier less than v. In the simple implementation given, a processor p finds this copy by reading all copies on processors in its view. However, p can optimize its search for an up-to-date copy by making use of the recent partition assignment history of each processor in view(p)o Let previousv(q) denote the largest virtual partition less than v that q was a member of. The optimized search strategy is for p to consider the processors in view(p) in decreasing order of their previous v values. The desired up-to-date value of 1 is found at a processor q such that: previous (q) = max{previOUSv(r) Ire view(p) & I was accessible in previOUSv(r)} Now, if processor p satisfies the role of q in the above condition, then p holds an up-todate copy of 1 and no initialization for lp is necessary. The values of previousv(q) for all qe view(v) can be collected by the initiator in the first phase of the protocol creating v, and this set of values can be distributed to all members of v in the second phase of the protocol at no extra cost in messages or time. Further opdmizations can tie made where a subset of the members of a virtual partition v splits off and forms a new virtual partition w. It occurs frequently, for example, when some members of v detect the failure of another member of v. In this ease, all members of w contain the up-to-date copies of all accessible objects. Consequently, no initialization is required. This special ease can be detected using the values of previousw, specifically, in this case previOUSw(P)----vfor every p that is a member of w. Another optimization can be made in the partition initialization protocol of section 5, where a copy is brought up-to-date by reading another copy, in its entirety. If the object is large, a more economical approach is to apply to the out-of-date copy all of the writes that it missed. This, however, requires an efficient procedure for specifying and extracting the values of the missed writes. Specification of the missing writes is made easy by applying Theorem 1'. Consider a copy of 1 residing on a processor currently assigned to partition w, but that was last written in partition v. Theorem 1' tells us that the copy missed the writes of transactions execuring in virtual partitions with identifiers greater than v and less than or equal to w. Thus, this out-of-date copy can be brought up-to-date efficiently if the system can

188

support a query on an arbitrary copy of the form: retrieve (the values of) all physical writes on copy c by any transactionexecuting in u such thatv total number of copies of 1. In addition, it is often desirable that writes to 1 are not performed concurrently by processors with different views, although this is not required to ensure serializability. Choosing the write quorum to be greater than a majority of the copies precludes concurrent writes. Even for the modified rules of the replica control algorithm, a logical read is translated into exactly one read and a logical write into physical writes on all copies in a partition. Hence quorums are used to control accessibility but, other than that, are not used in translating from logical operations to physical operations. This differs significantly from

189

their use in other algorithms (including their use in [G]), where quorums define the minimum number of copies that must be actually accessed in order to perform an operation.

7. CONCLUSION We have presented a fault-tolerant replica control protocol that ensures one-copy serializability when combined a standard concurrency control protocol. The protocol tolerates omission and performance failures, even if such failures disrupt the communication connectivity of the network. The implementation of logical read operations is efficient even in the presence of failures. The novelty of our protocol lies in the fact that the virtual partition management subprotocol makes omission and performance failures look like "clean" communication failures that partition the network. This allows us to introduce the notion of virtual partitions and to base our replica control protocol on the intuitive rules that (1) processors in a virtual partition can access a logical object if it contains a majority of the object's copies and (2) logical operations are translated into physical operations on the copies within a virtual partition using the "read-one/write-air' rule. We identify a set of properties ($1-$3) that are satisfied by the virtual partition management, and that are sufficient for our replica control protocol to ensure one-copy serializability of all transaction executions. The virtual partition protocol is itself fault tolerant, and can be easily implemented. One interesting result of our approach for dealing with failures is that protocols that have been designed for partition failures can be used in conjunction with our virtual partition management protocol in a more general and realistic processing environment. For example, many proposed data management schemes (e.g. [BGRCK, D, SW]) for partitioned systems require partition detection and, furthermore, assume A2 and A3. It earl be shown that properties $1 through" $3 are sufficient requirements for the correctness of these schemes. Therefore, these schemes can use the virtual partition management protocol to "detect" virtual partitions and operate on them as ff they were real partitions.

ACKNOWLEDGMENTS We would like to thank Phil Bemstein, Ken Birman, Shel Finkelstein, Jim Gray, Tommy Joseph, Bruce Lindsay, Gil Neiger, Thomas Ranchle, B. Simons, T.K. Srikanth and Irv Traigcr for a number of useful comments.

t90

REFERENCES

[BGa]

Bemstein, P., and Goodman, N., "Concurrency Control in Distributed Database Systems," ACM Computing Surveys 13, 2, (June 1981) 185-222.

[BGb]

Bemstein, P., and Goodman, N., "The Failure and Recovery Problem for Replicated Databases," Proc. 2nd ACM Syrup. on Princ. of Distributed Computing, Montreal, Quebec, August 1983, 114-122.

[BGRCK] Blaustein, B.T., Garcia-Molina, H., Ries, D.R., Chilenskas, R.M., and Kaufman, C.W., "Maintaining Replicated Databases Even in the Presence of Network Partitions," EASCON, 1983. [BSR]

Bemstein, P., Shipman, D., and Rothnie, Jr., J., "Concurrency Control in a System for Distributed Databases (SDD-1)," ACM Transactions on Database SystemsS, 1 (March 1980), 18-51.

[c]

Cristian, F., "Correct and Robust Programs," IEEE Trans. on Software Engineering SE-IO, 2 (March 1984), 163-174.

[CASD]

Cristian F., Aghili H., Strong R., and Dolev D. "Fault-Tolerant Atomic Broadcasts: from Simple Message Diffusion to Byzantine Agreement," Tech. Report, IBM Research San Jose, 1984.

[D]

Davidson, S., "Optimism and Consistency in Partitioned Distributed Database Systems," ACM Transactions on Database Systems 9, 3 (September 1984), 456-482.

[EGLT]

Eswaran, K., Gray, J., Lode, R., and Traiger, I., "The Notions of Consistency and Predicate Locks in a Database System," Comm. of the ACM 19, 11 (November 1976), 624-633.

[ES]

Eager, D., and Seveik, K., "Achieving Robustness in Distributed Data Base Systems," Transactions on Database Systems 8, 3 (September 83), 354-381.

[ESC]

El Abbadi, A., Skeen, D., and Cristian, F., "An Efficient, Fault-Tolerant Protoeol for Replicated Data Management," Proc. 4th ACM Symp. on Princ. of Database Systems, Portland, Oregon, March 1985, 215-229 (a revised version is to appear in Transactions on Database Systems).

[GSCDFR] Goodman, N., Skeen, D., Char,, A., Dayal, U., Fox, S., and Ries, D., "A Recovery Algorithm for a Distributed Database System," Proc. 2nd ACM Syrup. on Princ. of Database Systems, Atlanta, Georgia, March 1983, 8-15.

[G]

Gifford, D., "Weighted Voting for Replicated Data," Proc. of the 7th Symposium on Operating Systems Principles Dee. 1979.

[GMBLLPPT] Gray, J., McJones, P., Blasgen, M., Lindsay, B., Lode, R., Price, T., Putzulo,

191

F., and Traiger, I., "The Recovery Manager of the System R Database Manager," ACM Computing Surveys 13, 2 (June 1981), 223-242.

[I-I]

Hadzilacos, V., "Issues of Fault Tolerance in Concurrent Computations," Tech. Report 11-84, Harvard University, Center for Research in Computing Technology, Cambridge, Massachusetts (June 1984).

[KR]

Kung, H., and Robinson, J., "On Optimistic Methods for Concurrency Control," ACM Transactions on Database Systems 6, 2 (June 1982), 213-226.

ILl

Lamport, L., "Time, Clocks, and the Ordering of Events in a Distribt~ted System," Comm. of the ACM 21, 7, (July 1978) 558-565.

[P]

Papadimitriou, C.H., "Serializability of Concurrent Database Updates," Journal of the ACM 24, 4, (October 1979) 631-653.

[SW]

Skeen, D., and Wright, D., "Increasing Availabilty in Partitioned Database Systems". Proc. 3rd ACM Syrup. on Princ. of Database Systems, Waterloo, Canada, April 1984, 290-299. TR 83-581 Dept. of Computer Science, Cornell University, Ithaca NY 14853.

IT]

Thomas, R.H., "A Majority Consensus Approach to Concurrency Control for Multiple Copy Data Bases," ACM Transactions on Database Systems 4, 2 (June 1979) 180-209.

[TGGL]

Tmiger, I.L., Gray, J.N., Galtieri, C.A., and Lindsay, B.G., "Transactions and Consistency in Distributed Database Systems," Transactions on Database Systems VoL 7, 3 (September 1982), 323-342.

Arpanet Routing Stephen Cohn Bolt, B e r a n e k a n d N e w m a n 10 F a w c e t t St. C a m b r i d g e , M A 02138

Cohn: I'm going to talk a little bit about the routing algorithm on the ARPANET, which has been running now for about 7 years. The three components of routing are: delay measurement on links, information dissemination throughout the network, and the routing calculation itself. I'll also talk about some of the failures on the ARPANET and tell you, in detail, exactly what happened when the ARPANET crashed. I would then like to focus on a few of the failures that have occurred on various instances of the network in the past several years. I'd like to also talk briefly about the ARPANET'S origins, evolution and progeny. The ARPANET is no longer a single network; in fact, there are a number of ARPANETlike networks around the world. As a note, I suppose that most of you who have used the ARPANET for mail and telnet have also recognized failures more frequently than the kinds of things I'm going to be talking about. What you recognize as failures are really "features" of the current routing algorithm. We understood those features when it was built. Now that the ARPANET is sometimes overloaded, you see the consequences of that. Most of the algorithms operating in the ARPANET were developed for the ARPANET project and are not based on the formal agreement algorithms that are being discussed here, although they address many similar problems. The ARPANET started in 1969. It was an experimental wide-area packet switched network using 50 kilobit/second analog circuits and Honeywell 516 processors. It was taken over by Bolt, Beranek, and Newman, and more recently, by BBN Communications. The ARPANET has remained fairly homogeneous: mostly 56 kb/s circuits. The 516 processors have been replaced by C30s, which are manufactured by BBN. A principal difference between the C30s and the Honeywell 516s is that the C30's rarely lose memory bits. The initial Arpanet algorithms were tailored to handle bit errors that were introduced by the Honeywell 516. By 1983 the AttPANET had become largely a military operational network. So, in 1984, the ARPANET was physically split into two separate backbones. As of the date of this presentation, the ARPANET is at approximately 50 nodes and MILNET

193

is growing up towards 160 nodes. In addition to numerous other networks all using the same ARPANET architecture throughout BBN and various other government agencies., BBN has installed a number of commercial networks that use the same sub-network technology, but a different interface protocol (CCITT X.25) than the DOD standard. The ARPANET has traffic matrices that are fairly well populated. This is in contrast to some of the commercial networks which have single or multiple main processor sites and fairly sparse traffic matrices. However, the same routing algorithms have been running on all of these networks. We h~d certain objectives for our first routing algorithm. Our principal objective was to minimize the individual end-to-end packet delay for transactions. Initially, most transactions were for single character packets (requiring remote echo to support remote terminals)., so this was important. Additionally., the ARPANET was intended to be highly responsive to changing delay patterns in the network. It was theoretically designed to route around local congestion. There is no active congestion control mechanism in the ARPANET networks except for the end-to-end flow control. For example., when flows converge in the middle of the network, packets may be dropped when all'CPU and buffer resources are exhausted. Given the poor quality of communications circuits., a lot of attention was paid to being able to recover easily from partitions and line outages. There have been many partitions; it's something that the network responds to trivially. Our final objective was, of course., to provide reliable performance. "New" routing became operational in 1979, and it's been running for about seven years. It's autonomous, which means that there's no mechanism for controlling the routing algorithm. As Jim Gray has said, operational control of network computer systems is one of the principal sources of trouble. In this case, there's no mechanism for operators to get into the guts of the network and I think that.,s been one of the factors which has made the ARPANET as reliable as it has been. There is also no way to configure it. It configures itself to the extent that it needs to., with the exception that the algorithm needs to know what the propagation delay and line speeds of the network lines. If these are configured incorrectly, strange things will occur. Typically that hasn't been a problem., because these parameters are computed automatically by measurement facilities deployed occasionally by network operators. The ARPANET operate continuously - the networks never shut down. Fll discuss the catastrophic event which crashed the network a little later. The ARPANET continues to operate during topology changes. Individual nodes can be taken in and out of service and are taken down for software upgrades., but the network itself has been running. The ARPANET has run through the whole period without being taken down for software upgrades., changes in the topology~ and partitions. How does the ARPANET routing work? Basically., there are three components.

194

First, there is the delay measurement of each link in the network, performed by the node at the source. Each link is considered to be simplex although, in fact the Telco circuits are duplex. Second~ those delay measurements are processed and disseminated by each node at appropriate times. The disseminated updates are then received by all nodes in the network and stored into their databases. Third, there is the routing calculation based upon the delay information. ttow does the delay measurement work? For each packet sent on the line, the node at the sending site has tabled the propagation and transmission time for a packet of that size. Queuing delays are measured from our difference in time stamp when the packet is received to when it is last sent. That is, it may take several transmissions to actually get the packet across the line, you consider those retransmissions in determining the delay. These delays are accumulated per line and averaged on a 10 second interval. The minimum interval at which you can send an update is 10 seconds, which is also the averaging interval. The purpose here is to limit the amount of CPU time that can be exhausted in the entire network as a result of a single node issuing updates. The maximum value between update transmissions was actually specified at 60 seconds; however, it was implemented at 50 seconds. The purpose here is to make sure that~ during the course of bringing a new link into service~ every node issues an update. It takes a minute to bring up a link; most of that time is devoted to measuring the quality of the link during that minute. This ensures that all network nodes have consistent knowledge of the network topology before any start passing traffic across the new link. This procedure avoids forming routing loops. In between the minimum and maximum limits, there's a threshold decay scheme that reports large changes quickly and small changes at the 60 second interval. Of course~ if there is a topology change~ that change is immediately reported. From the perspective of reliability, the key part of the whole system is the dissemination of these updates. Basically~ the routing calculation uses the shortest path first algorithm, which is attributed to Dijkstra. For the algorithm not to generate looping routes~ it is essential that every node calculate its routes on the same database. Otherwise, it's possible for two neighbor nodes to think that the other is the shortest path to a particular destination. To make sure the databases are uniform, we use "flooding" to reliably distribute the databases. I gather other people call this diffusion or broadcast. The updates are small and use little bandwidth, so that on average flooding isn't an expensive procedure. It uses a fairly small part of the link bandwidth. To give you a sense of the overhead involved for 160-node MILNET~ we're using less than 2% of the 50 kb/s line for those updates. Update packets are also handled as highest priority, so that we can flood across the whole network in the order of 100 to 125 milliseconds. A node receiving a database update on an incoming line, echoes it on its incoming

195

line, forwards it on its outgoing lines, and installs it in its database and then passes the database off t o the routing software. This allows the update itself to act as its own acknowledgment. Typically, the topologies of ARPANET-like networks have been degree three, so there are several different paths by which an update might reach any particular node; this enhances the reliability of the broadcast scheme. At each hop, there is some consistency checking on the update. We have positive acknowledgment, and we also have a timeout in retransmissions so that in any case where the update is lost, we retransmit and that retransrnission occurs quite soon. The measurements that we've made on performance show that every node in the network gets an update quite reliably within this 125 milliseconds figure. Of course, this figure depends on network diameter. One of the key aspects of update dissemination is to identify the most recent update..As any update makes previous updates obsolete, you can begin to see why this is so important. To keep the bandwidth requirements down, short sequence numbers are used. The other requirement on the system is that when a node has gone down and perhaps lost its memory of what update sequence number it was using or what the update sequence numbers tha~ other nodes in the networks were using, when it rejoins the network it must accept updates from other nodes and they must accept its updates. To accomplish this, updates age when they are sitting in processors. This is done so that after a minute any update received is fresher than the updates in the database. Thus, if a node restarts or a partition is resolved, a node will accept all subsequent updates that flow to it. Q: Could you tell us what you keep in the routing tables? A: The data that is kept is the source node, destination node, and the delay. It is stored by link, so there's an entry in the table for each link. Consider node 1, which is connected to node 2 and to node 3. Node 1 sends an update concerning the delay on links 2 and 3. Its update contains only that information. The tables in all other nodes eventually get these delay values. As for the size of this table, we're considering designing networks in the rangeof 1000 nodes having degree 4. The size of the database needed to store this is small as compared to the status structure required for maintaining connection information. At the moment, neither the CPU processing nor the database size are a real concern. The real concern is how much link bandwidth is consumed by passing the updates around. Q: What happens if you increase the number of nodes on the network to more like a thousand? It would seem the bandwidth for propagating updates goes way up. A: Yes, it looks like it would increase to more like 8 or 10 kb/s. Here are a number of solutions that we are considering. One is a hierarchical routing scheme which is like the present system at each level of the hierarchy. Another is just slowing the present algorithm down. It's not clear that updates every 50 seconds are necessary. The algorithm tends to oscillate when it's overloaded. So, the other possibility is to look at just the physical properties as opposed to the measured delay and slow the whole thing down. ThaCs an easy way to get a factor of 5 or 10 and send an update

196

every 10 minutes and whenever anything important changes. Now, as to packet routing, I won't have time to go into the routing algorithm per se. However, a node looks up the destination in the routing table and determines on which modem to send it out. To work, this depends upon the information calculated by nodes along the path being consistent. That is, if node one shows that node two is the next node on the route to node six~ node two makes a decision of its next hop based on the same database. Now, it's clear that that there is transitory looping in the system, but our measurements show it only at the 1% level. The reason why there is little looping is that round-trip times are about 200 milliseconds. (This number includes the time for the acknowledgment.) Because the routing tree changes much less often, the routing tables become consistent quickly enough. Q: Did anybody prove that the loops don't last for more than a small amount of time? A: I think there was a good deal of analysis going into that. We have certainly measured the looping and we have seen only limited looping. Q: Do you use a hop count to prevent looping? A: We don't have a hop count, and it hasn't been necessary. I'd like to now discuss the performance and failures of the network. I~m talking about seven years of service~ beginning in May of 1979~ when this particular routing algorithm began service. I've counted the networks having at least twenty-five nodes and with which I am familiar, and I came up with about 22 network years of service. Because the algorithm does not have human control, there is obviously no controller intervention. It has delivered reliable operation without reconfiguration because you can't configure'it. It has worked with a wide variety of different kinds of trafllc flows. In particular, the commercial networks tend to have one or two database activities to or from which which most emanates. The government traiTic tend to be much more distributed. We have observed four different kinds of routing failures: The first is network-wide routing catastrophe. The second is that bad updates have caused multiple processors to halt. The third is degraded network performance due to overload. I would describe this overload as (an unfortunate) feature of the algorithm. The fourth was a naming problem. Obviously, for routing of this variety to work, each processor has to be uniquely named~ and once this was assumption was violated. Before I was at BBN, there was a case in which the commercial network and the ARPANET were connected together and the result was pandemonium. One particular day goes down in history, at least for us. We refer to it as "Black Monday." It occurred 27 October 1980. The entire ARPANET became unusable for about four hours. What happened? How quickly did we recover it? It took an hour for the developers to figure out exactly what was happening and then three hours to get control over the network. It turned out that all the processors

197

were devoted to routing. At that time, the system was strictly priority scheduled and routing was a higher priority process than the processes that dispatched and forwarded packets. What had happened was a runaway in terms of the number of update packets. Each such packet resulted in additional CPU usage, and there was nothing in the scheduler to prevent the routing calculation from using up all the CPU. What triggered this? It all began with a Honeywell 516 that was having some memory problems. In fact, it was having trouble communicating with its neighbor. It ended up generating a series of updates that had invalid sequence numbers, and there was no internal protection for the sequence counter. There were three sequence numbers and their corresponding updates simultaneously active in the network at the same time. By the way, all nodes periodically checksum their code. If any failure in the checksum of the routing code is detected, a node will crash. The idea is that it's more important to protect the network than a node. And any such peculiarity would suggest that something is seriously broken. But this type of checksumming does not detect this problem. You must understand how to determine the relative age of an update. In this case we have a sequence number space of 64 (6 bits) and half the windows are acceptable at any particular time. We ended up with three updates having sequence numbers 8, 40, and 44. When one processed 8, packets in the range of 9-40 were acceptable. Hence 40 was acceptable, and this increased the range to include 44. When one then processed 44, 8 was acceptable. Thought had been given to protection against two sequence numbers being outstanding simultaneously, but there was no protection against three. You :notice if you look at the binary representations of these three values that what probably happened was that the original sequence number was 44 and we lost one or two bits to get these other two sequence numbers. That was the characteristic way that the Honeywell 516's core memory failed. Q: But, how often are updates generated? A: They are only generated every ten seconds. But what went haywire was that a node generated these three updates, and once generated, they were flooded. No testing beyond the comparison of the sequence numbers that I described above was done before the flooding process continued. It's very important to do as little as possible before getting those packets across the networks. Why didn't any of the checksums help? Link checksums only correct communication failures, and the integrity of the stored pzograms was ok. This was corruption of the internal memory. Interestingly enough, the node that actually generated these sequence numbers crashed a few seconds later; it wasn't on the network at all. The reason it took a long time (4 hours) to fix this problem was that while it was possible to take down one node, isolate and fix it, it immediately got the sequence of updates again as soon as it was reconnected to the network. Most of these nodes were at unattended sites, so they had to be controlled via

198

the network. The network was very slow because so much of the CPU and buffer resources were being consumed by the runaway update dissemination activity. The ultimate fix was to install a software patch, slowly node by node, that told the nodes to ignore updates from this particular node number. Unfortunately, it took three hours to do that because it was very hard to get a node to respond to the control packets. Needless to say, this was all very embarrassing. After this occurred, we paid attention to fixing this problem so it couldn't occur again. The principal fix here was simply to halt if you got more than N updates on your queue. Under typical operations you never have a lot of routing updates since you're dealing with them in a high priority task. This type of solution, halting before flooding, has worked in the case of other calamities. Q: Do sequence numbers identify a node? A: The updates for every node have a unique sequence number space. Every node maintains a sequence number for each other node in the network. Q: If that is the case, why not use synchronized time? A: We don't have a notion of synchronized tifne in the nodes. There are also a class of events that have occurred several times; these were caused by link hardware that caused corrupted packets, combined with a microcode bug that caused the hardware CRC to be ignored under the instance that clock disappeared. It wasn't that all packets in the network were not being checked; just in that particular instance. There was was also software check applied on top of the CRC so that there was still fairly good protection, but we knew we were running with some vulnerability. Well, occasionally one type of control packet, formatted a lot like routing updates except for a different code, were corrupted and appeared to be update packets. But the software is designed to' do some checking on those packets and identify them as being bogus, primarily because they had some non-zero field that should have been zero, and they were somewhat the wrong length. Those nodes were designed to halt based upon detecting this type of inconsistency. Unfortunately, the original construction of the relationship between the software resulted in flooding first, followed by checking later. And so, what happened what was that you would flood the bad update and ' then crash! So that was fixed, by hand checking for some consistencies before flooding and by ignoring bad updates. This was an interesting movement away from the fail-fast strategy to one of tolerating the errors. Partially, we did this because the hardware environment had changed; we were dealing with C30s and not Honeywell 516's, and we had many less of the faulty computations that had caused the original design paranoia in the routing software. Due to lack of time, I will skip discussing congestion instabilities; that is, the inability of the routing algorithm to deal with overloads well. As I mentioned, duplicate node number conflicts are another example of the failures that have occurred.

199

If a node is trying to come up on the network, and it receives an update from. a node that has the same name, it will stay in its loader and not come up. However, if you have a partitioned network which has nodes of the same name and you connect that network~ then the nodes will each detect inconsistent updates when they do their own calculation and their response is to halt. And this has actuMly occurred; it's easy to fix. We've had at least two instances to connecting together two networks and having them crash. I think the right answer is to have a network itself be named and to not allow different networks to connect. I doubt any routing schemes work well without unique names for distinct nodes. Q: Can you say something about how frequently the network partitions ? A: The most notable partition was as we were approaching the physical split between ARPANET and MILNET. Both networks were on the order of 50 nodes. And they were only two or three lines between them. As we were snipping lines to slowly sever the two networks, there was an accidental partition. The consequence was that all connections that were going across the partition failed. But, we have not paid a lot of attention to measuring the occurrence of partitions, because the algorithra just breezes through them. Q: How frequent are partitions? A: Large partitions aren't frequent~ however, because line performance is fairly good - Ninety-nine percent level. Also, the networks are typically at least degree 2, and often 3. A current operational problem is that telephone circuits in Italy are questionable, and there are two nodes with only a single connection between them. We found there that the algorithm that we use to decide whether a line was acceptable or not was a little bit too stringent; the result was that it was rather difficult as a user to read mall on a mail host there. On the other hand~ I suppose there are quite a few partitions of the small scale two or three nodes can partition. I would say there's probably a partition every week. Q: ttow do you detect partitions? A: The monitoring center, which doesn't get involved at a~ in day-to-day operations, has a full database and it can tell which nodes it can reach and which nodes it cannot reach and thus it knows about some partitions. Q: Can you contrast the new algorithm to the old routing algorithm? A: Basically, the old algorithm was a distributed computation of the routing delays, whe.reas the new algorithm distributes the database and does a local computation of the routes. The old protocols had some real problems. It was very hard to break large loops. In addition, the algorithm costs grow as N-squared, because of tables that are passed around the network. I think it's generally believed that this protocol is vastly superior to the old routing. Q: What are the average delays on a lightly loaded network? A: Right now on MILNET, which has sort of an average circuit utilization on the order of 10 or 15 per cent, there's very little queuing. Now, of course the reason for

200

that is that MILNET has been overdesigned because there are some large military systems that are going to be coming on to it. But, in that environment the average round trip delay for the whole network is 200 milliseconds. Historically, that's low. Typically~ before the split it was about 275.

ATOMIC

ON THE RELATIONSHIP BETWEEN COMMITMENT AND CONSENSUS

THE PROBLEMS

Vassos ttadzilacos Department of Computer Science and Computer Systems Research Institute Universityof Toronto Toronto, Canada M5S IA4

i. INTRODUCTION A critical task in the execution of a transaction in a distributed database system is to ensure its consistent termination. The sites whose databases were updated by the transaction must agree on whether to commit it (in which case the transaction's updates will take effect at all these sites) or abort it (in which case none the transaction's updates will take effect at any site). The consistency of the decisions (commit or abort) reached at different sites must be safeguarded even in the presence of various types of failures. This, in a nutshell, is the atomic commitment problem. A number of protocols have been devised to solve this problem including, notably, the two-phase commit protocol (Lampson and Sturgis [1976], Gray [1978]). From this sketchy description it is apparent that atomic commitment is a problem that involves' the achievement of some sort of agreement in a fault tolerant manner. Distributed consensus in its various formulations, the focus of voluminous investigation in recent years, is another problem that entails fault tolerant agreement among processes. The purpose of this paper is two-fold. First~ we offer a precise specification for the atomic commitment problem. Note the emphasis on the word "problem". We wish to specify, not some protocol that solves the problem, but the problem itself. The specification of the atomic commitment problem is given in the form commonly employed to specify the consensus problem and its variants, namely as a list of conditions that must be satisfied. This facilitates the second purpose of the paper, which is to investigate the relationship between ~ the atomic commitment and consensus problems. We find that the problems, though closely related, are not equivalent; and that the theoretical work on the consensus problem has some interesting and important implications regarding atomic commitment. The specification of atomic commitment and the study of its relationship to consensus are presented in the two remaining sections of the paper.

202

2. S P E C I F I C A T I O N

OF ATOMIC COMMITMENT

We assume the following model of computation. There is a collection of processes capable of communicating with each other by exchanging messages. The behaviour of each process is specified by a programme. In effect, this programme stipulates how the process should change its local state, what messages to send to other processes, and whether to accept messages from other processes. At any time a process is either up or down. While it is up, the process is following exactly the actions specified by its programme. While it is down, the process takes no actions at all. In the transition from being up to going down, the process' local state is destroyed except for a part, called the stable state.t The transition from up to down can be effected at any time. When this transition takes place, we say that a process failure occurred. The transition from down to up can also be taken at any time, but after coming up the process must first execute a special programme, called the recovery procedure, before it does anything else. The purpose of this procedure is to construct a state from which the process can continue executing its programme. This state must be constructed on the basis of the stable state, which is the only information available to the process when it recovers. When the .transition from down to up takes place, we say that the process failure has Seen repaired. A process is correct if it has never been down; otherwise, it is faulty.$ (Notice that, by this definition, a process may be both faulty and up.) In addition to process failures, we also admit the possibility of communication failures. When process p expects a message from process q, it sets a period of time, called the timeout period, which limits how long p will wait for the message. We say that a process p cannot communicate with process q if p does not receive a message it is expecting from q within the timeout period, as measured on p's clock. A communication failure occurs when two processes are both up and one is unable to communicate with the other. Note the caveat "are both up": the inability of p to communicate with q because p or q is down counts as a process, not communication, failure. Communication failures may be attributed to a variety of causes: network partitions, unanticipated message delays, fast clocks measuring timeout periods, etc. We do not care what the cause is; what characterises a communication failure is the effect, namely the inability of processes that are up to communicate. We say that a communication failure is repaired if at some earlier time p c o u l d t In practical terms, the stable state consists of the information saved by the process in non-volatile storage (e.g. disk). Given this definition of a "correct" process, one might be led to conclude that if processes are subject to failure with non-zero probability, eventually they will all become faulty. This is not true because, for our purposes, processes have limited lifetimes. That is, a process is a particular execution of a programme which, in the absence of failures, terminates.

203

not communicate with q, even though both were up, but at the present time p can communicate with q. As was mentioned above, each process measures the timeout period on its own clock. A process sets a timeout period on the basis of certain assumptions such as how fast other processes are and how long it takes messages to traverse communication links. To the extent that these assumptions are valid and the process' clock is not too fast relative to some externally defined "absolute time", the process will succeed in receiving the messages it is waiting for within the timeout period it has set. T h e fact t h a t processes' clocks must have some relationship to a common measure of time means that these clocks must, in some sense, be "synchrouised'. We shall not attempt to define "synchrony" in this paper, as this would require a long digression with no direct bearing on the subsequent discussion. As far as our model is concerned, violations of "synchrony" (for any reasonable definition of the term) will be manifested as communication failures. (For a discussion of synchrony in the context of the agreement and consensus problems, see Fischer et aL [1985], Dolev et aL [1983], Dwork et aL [1984].) Given these assumptions on the model of computation, we can specify the atomic c o m m i t m e n t problem as follows. We have a set Of processes each of which has one input variable, which we shall call the "vote", and one output variable, which we shall call the "decision". The vote is initially set to one of two values: Yes or N o . The decision has one of three values: U n d e f i n e d , C o m m i t or A b o r t . The decision is a write-once variable and its initial value is U n d e f i n e d . If a process' decision is not U n d e f i n e d , we say t h a t the process has reached a decision. We say that a process decides to C o m m i t (respectively A b o r t ) if it sets its decision to C o m m i t (respectively A b o r t ) . T h e atomic commitment (AC) problem is to devise a protocol (i.e. define programmes for these processes) so t h a t the following conditions are satisfied:

AC1 No two processes reach different decisions. AC2 C o m m i t is decided only if all votes are Yes. AC3 If there are no failures and all votes are Yes, then all processes decide to Commit. A C4 If all existing failures are repaired and no new failures occur for a sutficiently long period o f time, then all processes will reach a decision. Some discussion of these conditions is now in order. First, note t h a t we do not require all processes to decide. This would be an unattainable goal because a process might go down and never recover. There is nothing we can require of such a process since, by definition, a process cannot do anything while it is down. Given t h a t we can't require all processes to decide, a natural idea is to require that NB All correct processes reach a decision.

204

Unfortunately, in the computational model we have described, NB is also unattainable; therefore, we do not make NB part of the AC specification. More specifically, it can be shown that for any AC protocol, communication failures can happen in such a manner that correct processes cannot reach a decision without risking a violation of AC1 relative to processes with which they are unable to communicate. This is a consequence of Gray's "Generals' Paradox" (cf. Gray [1978], Skeen [1982], Halpern and Moses [1986]). However, if we assume away communication failures from our model of computation so that only process failures may occur, there exist AC protocols that satisfy NB (cf. Skeen [1982], Dolev and Strong [1982], Lamport and Fischer [1982], Dwork and Skeen [1983], Hadzilacos [1984]). An AC protocol that satisfies NB is called a non-blocking protocol. AC4 is the condition that identifies the circumstances under which processes are required to reach a decision. Fortunately, this condition is attainable! AC2 states that the decision to C o m m i t must be reached by consensus. Notice that there is no similar requirement for A b o r t . This "asymmetry" between the two possible decision values - - which, as we shall see in the next section, is not present in the consensus problem - - is a reflection of the difference in their meaning. The decision to C o m m i t implies the responsibility to carry out an action - - namely~ to record a transaction's updates in the database. Thus, for this decision to be reached, every process that will be required to carry out a piece of the action must be consulted to ensure that it is, in fact, in a position to carry it out.t The votes are used by processes to indicate their ability (vote = Yes) or inability (vote = N o ) to carry out the action. Thus, the decision to C o m m i t may be reached only if all votes are Yes. By contrast~ the decision to A b o r t engenders no responsibility of carrying out an action and can, therefore, be reached no m a t t e r what the votes are. Note that we do not require the converse of AC2; it is possible that all votes are Yes, yet the decision is to A b o r t . The reason for this becomes clear if we consider a scenario in which a process, say p, goes down before other processes have learned its vote, and all the other process' votes are Yes. If we insisted in enforcing the converse of AC2, the other processes would have to wait for the recovery of p before any decision could be reached. Rather than waiting~ possibly forever, since p might never recover, it is generally considered much preferable to allow the other processes to decide A b o r t , even though the decision to C o m m i t was in l~rinciple possible (since p's vote might have been Yes as well). AC3 is a partial converse to AC2. It ensures that there are circumstances under which the decision to C o m m i t must be reached, thereby excluding "sot In practical terms, "being in a position to carry out the action" means that the process supervising the transaction's execution at a site has saved the transaction's updates at that site in its stable state, so that subsequent failures will not result in their loss.

205

lutions" in which all processes always decide to Abort. 3. A T O M I C

COMMITMENT

AND

CONSENSUS

COMPARED

W e now turn to the question of how the atomic commitment problem, as specified previously,relates to the consensus problem. To facilitatethis,let us translate the latterinto the terms (votes, decisions etc.) we used to specify AC. The problem is to devise a protocol that will guarantee the following two conditions:

It [Agreement] All non-faulty processes reach the same decision; a n d V [Validity] If all non-faulty processes' votes are Yes then they will all decide to C o m m i t ; if all non-faulty processes' votes are N o then they will all decide to A b o r t . Two weaker versions of this problem which, in fact, are more relevant to our comparison, have also been studied in the literature. These are obtained by replacing V with either W V [Wbak Validity] If there are no failures then V holds; or V W V [Very Weak Validity] Both C o m m i t and A b o r t are possible decision values; i.e. there is an execution in which correct processes decide to C o m m i t and an execution in which correct processes decide to A b o r t . Satisfaction of A and V is the consensus problem. Satisfaction of A and WV is the weak consensus problem. Satisfaction of A and V W V is the very weak consensus problem. The main differences between AC and the various versions of consensus concern (a) the decisions reached by faulty processes; and (b) the strength of the conditions required.

D e c i s i o n s of faulty processes In the consensus problems the conditions that must be attained do not mention the votes or decisions of faulty processes. O n the other hand, A C 1 requires that all d~isions be consistent -- be they reached by correct or by faulty processes. Also, A C 2 states that the decision to C o m m i t is reached only if the votes of all processes, including faulty ones, are Yes. Consequently, it is perfectly reasonable to discuss consensus problems in a model that places no restrictionson the possible behaviour of faulty processes (known as the "arbitrary" or "malicious" or "Byzantine" failure model). By contrast, AC is unattainable in such a model. If failures are unrestricted, a faulty process can set its decision variable to an arbitrary value - - in particular, a value different from the decision of other processes, contradicting AC1. In fact, AC is attainable only under the assumption that process failures are of the benign variety mentioned in the beginning of the previous section. T h a t is, processes can fail by stopping (whereupon they lose their state, except for

206

the stable part) and can later recover. By contrast, in the consensus problems there is no such thing as "process recovery". Once a process has gone down, it has become faulty and its actions are therefore irrelevant as far as the consensus problem specification is concerned. This difference has its roots at the different computing paradigms that motivate the two problems. Consensus originated from a problem in real-time process control (of. Pease et aL [1980], Wensley et al. [1978]). In this scenario process decisions are used to trigger some "real world action", e.g. open a valve, which must be carried out within strict deadlines; there is no time to wait for faulty processes to recover and reach a decision. Faulty processes are therefore ignored in the hope that there will be enough correct processes, so that using their decisions alone, the "real world action" will be correctly Carried out. AC, on the other hand, emerged from the transaction processing paradigm. In this case, reaching inconsistent decisions means that the effects of a transaction will not be consistently recorded at all the sites affected by the transaction. This implies that the database itself would become inconsistent. Inconsistent decisions are unacceptable in this case. On the other hand, in the transaction processing paradigm, there are no critical real-time deadlines to be met. Therefore, delayed decisions by processes when failures occur are (grudgingly) accepted.

Strength o f t h e conditions The second major difference between AC and consensus concerns the strength of the required conditions. It is not difficult to see that conditions AC2, 3 and 4 imply WV. AC3 says that if there are no failures and all processes' votes are Yes, all processes will decide to C o m m i t ; AC2 and AC4 say that if there are no failures and some process' vote is N o , all processes will decide to A b o r t . Thus if AC2, 3 and 4 hold, so does WV. It is noteworthy that the converse is not true. For instance, under W V it is legitimate to have correct processes decide to C o m m i t , when some processes' votes are N o and others' are Yes, while this is not allowed under AC2. Now, if AC1 - - possibly in combination with AC2-AC4 ~ implied A, we would have that the AC conditions are stronger than those of weak consensus. Unfortunately, that's not quite so. The reason is that in weak consensus all correct processes are required to reach a decision; while in AC correct processes are not required to reach a decision - - unless all failures are repaired and no new failures occur (for sufficiently long). We'll get rid of this problem by assuming that the proviso under which processes are required to reach a decision in AC will eventually be true in any execution. More precisely, we shall assume that the following axiom holds: N C [No-Catastrophe Axiom] Eventually, all process or communication failures are repaired and no new failures occur for a period of time ~, for any ~ > O.

207

For the transaction processing type of applications where AC is used, this is a realistic axiom. In fact, given the parameters of a particular system, such as maximum message delays and accuracy of processes' clocks) and a particular AC protocol~ it will be sufficient that NC hold for some particular value of ~.t It is now immediate that AC1, AC4 and NC imply A: AC4 says that if failures are repaired and no new ones happen for sufficiently long (which NC ensures will eventually be the case) then all processes - - and, in particular, all correct processes - - will reach a decision; AC1 then says that these decisions will be consistent. We have therefore shown that the AC conditions are stronger than those of the weak consensus problem, provided that the No-Catastrophe Axiom holds. Consequently, M1 lower bound results that are known about the weak (or very weak) consensus problem are, a .fortiori, applicable to A C. This is especially important in view of the fact that several lower bound results have been obtained about these problems.$ A consensus protocol is necessarily non-blocking, since A implies NB. Like atomic commitment with NB, weak consensus is unattainable in the presence of communication failures. In hindsight, this may well be the reason why communication failures are routinely excluded from consideration in the agreement and consensus literature: otherwise, the problems addressed there become unsolvable. ACKNOWLEDGMENTS I thank Barbara Liskov and Barbara Simons for their comments on an earlier draft of this paper.

t NC is not reasonable for systems that are either too large or are made up of fairly unreliable components. In such systems it may well be that there is never a time during which there are no failures. In fact, NC is stronger than necessary. We could get by with the following, weaker axiom: [Finite Progress Axiom] Failures cannot forever prevent processes from making progress towards the execution of their programmes. In particular, a process cannot remain down forever, when it recovers it can execute at least one step (that will affect its stable state) before going down again, and a message need only be sent a finite number of times before it is received. In fairness, it should be pointed out that not all of these bounds are relevant, since some refer specifically to the arbitrary failure model which is far weaker than the failure model in which AC is attainable (cf. earlier discussion on this matter).

208

REFERENCES

1. Dolev D., C. Dwork and L. Stockmeyer [1983]. "On the minimal synchronism needed for distributed consensus". Proc. of the 24th Syrup. on Foundations of Computer Science, pp. 393-402, 1983. 2. Dolev D. and H.R. Strong [1982]. "Distributed commit with bounded waiting". Proc. of the 2nd Symp. on Reliability in Distributed Software and Database Systems, pp. 53-60, 1982. 3. Dwork C., N.A. Lynch and L. Stockmeyer [1984]. "Consensus in the presence of partial synchrony'. Proe. of the 3rd Annual ACM SIGACTSIGOPS Symp. on Principles of Distributed Computing, pp. 103-118, 1984. 4. Dwork C., and D. Skeen [1983]. "The inherent cost of non-blocking commitment'. Proc. of the 2nd Annual ACM SIGACT-SIGOPS Syrup. on Principles of Distributed Computing, pp. 1-11, 1983. 5. Fischer M.J., N.A. Lynch and M.S. Paterson [1985]. "Impossibility of distributed consensus with one faulty process". Journal of the ACM 32(2):374382 (1985). 6. Gray J.N. [1978]. "Notes on database operating'systems". In R. Bayer, R.M. Graham and G. Seegmuller (eds.), Operating Systems: An Advanced Course. Lecture Notes in Computer Science, vol. 60. Springer-Verlag, 1978. 7. Hadzilacos V. [1984]. Issues of fault tolerance in concurrent computations. PhD dissertation, Harvard University, 1984. 8. Halpern J.Y. and Y. Moses [1986]. "Knowledge and common knowledge in a distributed environment". IBM Research Report ttJ 4421, 1986 (revised). 9. Lamport L. and M.J. Fischer [1982]. "Byzantine generals and transaction commit protocols". Computer Science Laboratory, SRI International, Op. 62, 1982. 10. Lampson B. and It. Sturgis [1976]. "Crash recovery in a distributed data storage system". Unpublished memorandum, Xerox PARC, 1976. 11. Pease M., R. Shostak and L. Lamport [1980]. "Reaching agreement in the presence of faults". Journal of the ACM, 27(2):228-234 (1980). 12. Skeen D. [1982]. Crash recovery in a distributed database system. PhD dissertation, University of California at Berkeley, 1982. 13. Wensley J.H., L. Lamport, J. Goldberg, M.W. Green, K.N. Levitt, P.M. Melliar-Smith, tt.E. Shostak and C.B. Weinstock [1978]. "SIFT: The design and analysis of a fault-tolerant computer for aircraft control". Proc. of the IEEE, 66(10):1240-1255 (1978).

The August System John Wensley 13710 Southwest Knaus Rd. Lake Oswego, OR 97037

Wenstey: I'm here to talk about August Systems ~ computers. Our computers are very different from the ones we~ve been hearing about here. We make industrial control computers that are meant for nuclear power plants, wind tunnels~ space shuttle ground facilities and other similar applications. Two types of machines have been delivered. The first, the 300~ is hardly delivered any more. The second, the 330, is being delivered now and is similar to the 300. We have a new machine under development, and Fll talk about that a little later. First of all, I'd like to talk about fault tolerance. What do we mean by a failure.7 Failure to us is one bit being wrong in one result. We typically have to concern ourselves that failure may drop a control rod into a reactor and shut it down. Every time that happens, it costs someone a million dollars. Failure can also mean that one bit goes wrong and crucial cooling water may be shut off. In that case a reactor could blow up. Again, it is very serious. We do not concern ourselves with ava~ability. We typically have to satisfy failure rate requirements between 10-4~ which is relatively easy to satisfy in a small system, to 10 -14. The latter rate is something like a mean time between failures of a billion years. There is a misconception that a mean time failure of a billion years implies a life of a billion years. What it means is that if you had ten million of them out there~ there is a ten percent chance that one of them may fail in a year. In a process control computer, the computing function itself is typically only 10% of t:he total system in terms of electronics. Most of the computer is actually the electronic structures connected to the process itself. As these structures are the major part of the electronics, most of the failures you will experience occur here. Also~ there is an implication for a distributed system. If you have a distributed system and you want to achieve fault tolerance by moving a job from one place to another, that's okay if you can move that job. However, if the job is to open a valve, it must be at the place where the valve is located; you cannot move that function. Distribution in a process control system is of limited worth. In the process control world, people typically talk about three functions within a control plant: the control system, the safety system, and the alarm system. The control system is intended to maintain the physical plant in some desirable condition. The safety system follows the rule that if the control falls, it (the safety system) will

210

bring the plant down to a safe condition. The alarm system is supposed to notify operators of what is going wrong in the plant. In the past, these three functions were fulfilled by different vendors, who employed different technologies. We are starting to see a new trend where these three functions are integrated with microprocessors. Previously, these functions were separated. In fact, in certain industries, it is mandated that they be totally separate. However, this doesn't really preclude you from shipping data from one machine to another as long as there is not a hard physical electrical connection that can cause a failure at one end to lead to a failure at the other. As a rule, control engineers rarely know much about either fault tolerance techniques or computers. The fault tolerance must be totally transparent to the user. The user is either the the engineer who designs the application or the operator who is running the physical plant. These are very firm requirements. If you do use a distributed system, the distribution must also be transparent. One important factor is to achieve total hot repair throughout the whole system. You must be able to go out to system, remove whatever is faulty (a board, perhaps), and replace it. We are able to do this by allocating a redundant slot. When we want to remove a faulty board, we put in a new one, and by.hitting a button, tell the operating system to switch to the new board. The operating system checks the new board, verifies its type and that it is working properly. After that, it uses the good board and we remove the faulty one. This is all done without disturbance, while inputs and outputs are varying. Latent fault detection is incredibly important. As you move further out from a processor and toward input/output units, it becomes more and more difficult to know whether a piece of equipment is going to work when you need it to. How do you know that a circuit that is in one particular condition for many months hasn't had a failure in it? This is an exceedingly important problem and it's a problem for which there are not always good solutions. However, in fault tolerant systems, the more you hide faults from visible effects, the greater the chance that you will have a latent fault. As an example, there was an incident in a nuclear reactor in one of the northeastern states. There were two safety systems intended to drop the control rod and inject cooling water in the event of loss of primary cooling water. They had been running for many, many months (reactors are shut down infrequently). The trouble was that the backup systems were accumulating faults and nobody knew about them. No one could know about them because they didn't test for faults. Well, once, they lost a pump in the primary system, and both backup circuits totally failed. The human operator however was attentive and shut the reactor down within one second. That human oper0:tor deserves a lot of credit; he must have been sitting with his hand right over the big red button to do it in one second. All this occurred because of the problems wlth latent faults. Another, simpler, latent fault situation concerns driving. How do you know that the emergency parking brake on your car can operate on both rear wheels to be able to stop you if you are on the freeway? I would guarantee that very few of us have

211

every taken our cars out and tested the emergency brake. It could only be working on one rear wheel for all we know. Whatever it is that we must do to achieve fault tolerance - repair, diagnostics, error reporting, etc. - must cause no interruption in the process. Process control must be very fast (ten to one hundred milliseconds) and it must also be entirely deterministic. You can only tolerate recovery that takes very small amounts of time. Therefore, you can't use typical transactional techniques, and you can't use any virtual memory. For example, you typically don't use disks. You put all the software in RAM, which has deterministic timing. How does my company achieve these goals? The machine we deliver at the moment provides fault tolerance using a number of techniques in a number of ways. Software is responsible for detecting errors, for diagnosing where errors are, for correcting errors, for reporting errors to an operator, and for managing things like latent fault detection or handling repair. The hardware's features are all concerned with the basic replication; its a triplicated system throughout, except on the output voters. If you have a line that is going to a single valve, eventually you have to come down from three streams to one stream. You come down through one line going to one valve and at that point you have to build fault tolerance into the hardware of the voter itself. The voter must be built so that not only does it vote, but an internal fault within the voter does not corrupt its output. This is a very tricky issue because once you do that you immediately build in a latent fault situation. You may have a fault in there, but the fault does not corrupt the output. How then do you know about the fault7 Certainly, we need the ability to shut down the process control system. The shut down itself is a sequence of events. For example, if you have a very large compressor forcing air in some particular direction and you just shutdown the compressor and do nothing else, the compressed air will come back through the compressor and damage it. Therefore, you must stop when the pressure is equalized and not when the flow stops, and then you drop the shutdown valve. This is very tricky and if you've got multiple valves, it gets very complex. We also need the ability to warm start. Warm starting serves the process of bringing a newly inserted processor back after repair and putting it in the same state as the others. Of course, this is all done while the process control system is running. So the new processor and existing processors must learn all about each others state while everything is dynamicaily changing. We have some difficulties doing this with our present machine, but I'll discuss that a little later. We frequently use multiple triplicated machines. As I mentioned, our basic machine is a triple modular redundant system, but we build multiple triples for many different reasons. First, we need distribution for operational reasons. We want the equipment to be near the things that it controls. Secondly, we need it to provide greater fault tolerance. The greatest reliability requirement we have is for one function. The computer has one binary bit output. There are five analog inputs and each of four triples calculates the same function. If the function calculates that the plant

212

should shut down, it makes sense to shut down the plant when any two of the four vote to shut down. (This would be done in a nuclear reactor.) Another reason to use multiple triples is to increase performance. Frequently, we will combine multiple systems, when there is not enough computer power inside one triple. In effect, we spread t h e job as one would do in a typical distributed system. But, the triples are usually not physically separated by large distances, and the amount of data flow between the machines in the control environment is not large. We have one system with six triples, and the data flow between each of them is 1.6 baud. I'm not talking kilobaud. One point six baud! ThatSs a hundred bits every minute. What about the topics of measurements and evaluation? I think probably only the telephone company can actually give you good strong data about how their fault tolerance systems really behave in the field. Let me give you some qualitative things, starting with failures. We do have some failures during initial comn~ssioning. We will get failures because valves break, pumps breaks humans do stupid things~ etc. We rank the things that cause our failures: application software; operator or management errors; our software; and our hardware. Fd like to discuss them in order. First is application software. This most likely cause of system fMlure is when someone writes a special program to improve the application. Second, as an example of operator or management errors, consider a triplicated system which reports that processor one failed for some reasons and by mistake, the operator restarts system three. This is just sheer stupidity. The operator was told in about four different ways about this: there were red lights on the failed processor, messages on the screens, messages on printers, etc. Despite all of this the operator still did the wrong thing. The third type of problem is with our own software. The last problem is hardware failures and we've had only one of these, and we have about 80 systems out, with the oldest system being five years old. How do we evaluate our systems? When we bid our system on any project, we routinely supply a complete reliability analysis. As far as I know, we are the only manufacturer who does this on a regular basis. We use standard techniques to predict failure rates and we back it up with more data from vendors: life test data and the like. We also use the IEEE standard 500, which is strictly intended for nuclear facilities. The IEEE 500 standard is valuable to us because it contains data on devices external to the computer. W e often have to worry about such things, for example, vMves, switches, circuit breakers, and similar devices. Also, the standard gives a lot of data about failure modes as well as failure rates. The standard is certainly lacking, in that it contains a limited failure model. Ideally, we would like much more complete information about failure modes. What we have to do is to assume that any circuit will fail in the worst possible way. This is a pessimistic assumption, but routine. Currently, we do not include the operator in our reliability analysis. I guess there is now a lot of work going on in Star Wars research to try and model other software and operator errors.

213

Q: Who does the reliability analysis? A: I often do the reliability analysis. I make various simplifying assumptions that only change things in an insignificant digit position. Then I can easily compute the results. Sometimes it's as simple as a series/parallel reliability model which you only need a desk calculator to compute. Q: You mentioned about failure modes of valves or other mechanical devices. What information do you have about the failure modes of computer components? A: The electronics companies give us the best data they can, but it's tough to get good information. Companies will quote the failures that they have experienced over the life of parts and the failure modes. But you really don't know if the information will be applicable to the parts you have. But we have substantial knowledge about some parts. For example, on a silicon controlled rectifier, we know that it will almost invariably fail with a short circuit. We~ve never been refused data, and the companies give us the best data they can. Our current difficulties include latent faults, which I mentioned earlier, and software validation. We experience very few problems with our operating system and the I/O subsystem~ because we leave it alone~ and we do very careful testing of any changes that we make. However~ we worry a lot about software - all the way up to the top management of the company. Sot we would like to find new ways of validating our software. For our particular machine, we would also like more flexibility in performance. We have triplicated systems, and that gives us a certain amount of computing power. Then we start adding more and more triples, which can be uncompetitive, as it just costs too much money. We~d like more flexibility and we're trying to aim for that. Also~ a triplicated system is satisfactory as long as the mean time to repair is reasonably short, namely(a few hours or maybe even a few days. When you need totally unattended operations, triplicated system suffer from the fact that after the first fault you need to get to the system right away; you can't tolerate another fault after that. Therefore~ if you have a system that you want to leave for six months and never go to~ you want t o achieve the kind of reliability you might need. So~ something with triplication and built-in spares (a total of 4, 5, 6, or 7 machines) would be desirable. At tl~[s point, I should go on to what we would do next time, if there is a next time. Well, we are currently in the middle of a "next time." We are building a system which provides us with much higher power computing, but leaves us all the rest of the I/O structures. We want to retain our current development base because it works, and also because it's a valued item that sells. We are basing the system on the 68020 which will run at sixteen megahertz. You will be able to put in multiple triples within the same crate to get a multiprocessor. On the first fault~ a triple wilt reduce to a dual. Our objective is that on a second fault, we have a 95% chance to go successfully down to a single machine. We are using mainly hardware fault tolerance and trying to get away from software

214

fault tolerance. Part of the reason for this is to increase the performance. These days, it is economical to build custom ICs for special functions, making hardware fault tolerance more attractive. The machine will start looking like Siewiorek's VMP with voters between memory and processors so that every exchange of data between memory and processor will go through a two-out-of-three vote. There will be a special voter that will enable you to condition it so you can go down from three units to two, to one with reasonable probability. You can never go from two to one with certainty. Q: If you have a design bug in your software, you can repair it fairly easily; how do you solve that problem if you have all the logic in ICs? A: A bug in a crucial IC would be a disaster to our company. What saves us is that many of the ICs are fairly simple, and we're just doing an extreme amount of testing and simulation before we build. But, it would be very nasty if we would have to wait a couple of months to get the replacements. Q: Can you say a bit more about the voting on the new machine? A: We'll be able to have three processors and three memories and three voters in between them. So each memory, when it's feeding data to a processor, will feed it to all three voters. The voter chip is 16-bits wide and we use two of them side by side to produce a 32-bit wide voter. The voter on the 330 is itself fault tolerant. You can go up to any component in that voter, short it, open circuit it, move the control connection or whatever, and it won't change the output. Basically, the voter is built out of six transistors. It's one of the places where we don't use ICs, because we don't want correlated failures. Q: Can you go in and pull out the individual transistors dynamically without shutting down the system? A: You could do that if you wanted to, but I don't know why you'd do it. To change the voter board, it would be better to put a replacement board in a spare slot, tell the system to use that, and then pull the other board out. Then, you could change a transistor. Q: John, in your applications, do you see any tendency toward increasing the amount of state that has to be preserved in the event of a fault within your system? A: We feel it is very important to save a lot of state. I made this point many times that if on the detection of an error, you merely reconfigure and get going again, you may lose crucial information. If it was a transient error, you recover and restart the process. You don't have enough data to know whether another error occurring a little bit later is another occurrence of the same type. If you don't record enough state data, you can never say, "These are not two isolated transients, these are two instances of an intermittent fault producing an error and they're the same, and occurred two hours apart." So, I think it's very important to know what's happening inside your machine when there's a fault. Back in the physical part that's being controlled, it's important to know what's happening to a boiler or reactor, because faults do exist and the control is the only thing that knows what's happening. Industry is worrying more and more about this;

215

in general, companies want to repair these things before they lead to a crisis. Q: It seems to me that you have the possibility of making the computers much more retiable than the pumps, valves, etc. Isn't this unnecessary? A: There are two things. We usually talk about time periods of one hundred to five hundred years between failures, and there are two reasons for that. The first is that the authority of the control system is much greater than the control system of a pump or a valve. The design of a process plant will withstand localized failures. For example, if they need cooling water, there may be four pumps and multiple valves to pipe tlhe water. The control system, however, has the potential to do much more damage. It could shut off all the valves. For example, we have a system on a big oil platform down in the Gulf, about twenty miles out. By law, when a hurricane gets to a certain severity, the platform must be evacuated. The platform, which was our first platform application, was the only platform that continued operating when hurricane Danny went through, and it generated 3 mittion dollars in revenue. That system there was totally unattended and is connected throughout the whole platform. It could have done almost anything: shut the platform down, blow it up, or almost anything in the middle. The other thing is that if something happens to a part of your process equipment, how do you deal with it? The human operator deals with it by going to his control system. He needs to be able to count on it, more than the equipment he's working with. We are often asked whether or not these problems change as our company takes on larger and larger tasks. Obviously, a nuclear reactor is a very large system. We find that our early systems may have been controlling a compressor, and that our later ones might be controlling six compressors. However, they're all basically the same jobs. We do find difficulty in the operator workstations. When plants get really big, there's just a lot of data to look at. It's very easy to imagine a situation where an alarm goes off, and the operator becomes very confused very quickly, particularly if the control system isn't fault tolerant. Q: How does the operator know which sensor data to believe, if there are conflicts? A: There are a number of solutions. If there are two output displays, and they disagree, the operator knows in advance which one to take as the reading, because it is the safer one. If I'm landing an aircraft and I'm looking at two altitude sensors, each with different readings, I should take the safer reading. If three reaflings are present, I should take the middle value. If there are four readings, then typically there are policies that say, "if two readings say to shut down, then shut down." Operators are trained in sensible and conservative policies. The NRC does legislate the number of sensors to put into critical situations. It's almost an industry standard; four is a common number for critical processes. I've talked about the control system versus the safety system. There are circumstances where the control will put something into a state that the safety system is too slow to fix. For example, imagine a very fast chemical reactor. Suppose the con-

216

trol system injects some new material in that would cause an explosion. The safety system would not be fast enough to do anything about it. Stopping an explosion is very difficult. There's presently a lot of thinking about integrating the safety with the control and thus providing more viable control. We have one system in which there are three triples on the control system for performance reasons and one other triple for safety. They all communicate; for example, the safety system needs to know the state of the process if it's going to shut it down, because the shutdown procedure may need to vary. The entire industry is slowly becoming more and more sophisticated in this respect. Q: When you have multiple sensors, say three, I assume that each sensor must report to each of the processors in your triple modular redundancy scheme. Then, I assume each processor would go off and use a middle select algorithm, or whatever. In principle, each correct processor should come up with the same value. Do you ever worry about the case that sensors could report different values to different processors? A: You are referring to a consistency problem. In our particular machine, there are places where you have to worry about interactive consistency. Reading analog data is one. The real time clock, which is a one millisecond clock, causes another. Thirdly, there is periodically the need to synchronize three loosely synchronized machines. You are right that all these problems are tricky, but we~ve paid close attention to them. There's no time to go into the algorithms, but we do have a fair amount of logic in our software to deal with these issues.

The Sequoia System Phil Bernstein Digital Equipment

Corp.

Cambridge Research Lab O n e K e n d a l l S q u a r e , B l d g 700 C a m b r i d g e , M A 02189

Moderator: Phil Bernstein is here to talk about a system that he worked on when he was at Sequoia. Bernstein: The Sequoia system uses a combination of hardware and software fault tolerance. The Sequoia system was designed by Jack Stifler, founder of Sequoia and its chief architect. I arrived at Sequoia late in the product design phase, so I'm mostly describing the good ideas of others. I'm describing the current version of the product (as of March, 1986). 1 Sequoia is a tightly coupled multiprocessor. Each processor is a 68000-based unit that has access to all of the memory boxes and all of the I/O units on a shared bus. I'll explain more about the hardware in a moment. This system is in contrast to most of the other systems described at this workshop. The others are loosely coupled; each processor has locai memory and I/O capabilities, and it communicates with other system resources by sending messages. The main advantages that are normally associated with closely-coupled architectures are more efficient memory utilization, because code can be shared among all the processors. Also, memory is not dedicated to any particular processor, so it can be shared. The processing load itself is automatically balanced because each processor can grab its jobs off of the same ready queue. As long as there's work to do in the system, all processors can contribute to doing that work. On a loosely coupled system, it is more difficult to do load balancing. The Sequoia system is also highly configurable, because systems can have variable amounts of processor, memory, and I/O. Clearly, there are many costs associated with designing an architecture like this, which I'm not going to delve into very much. But, I would be happy to answer some questions. There are a number of parameters that are relevant to the design of fault tolerant systems that are strongly related to performance. One is that contention in tightly IA more formal presentation of the material in this paper appears in "Sequoia: A Fault-Tolerant Tightly Coupled M~tiprocessor for Transaction Processing," IEEE Computer 21,2 (Feb. 1988), pp. 37-46.

218

coupled systems is traditionally a problem. Because of this, the Sequoia system has two independent, high speed busses, each of which has an effective bandwidth of 32 megabytes per second. Also, each processor has a large cache to reduce the number of times a processor has to go across the bus. Each cache is 128 kilobytes of static RAM. It takes four micro-seconds to move a 128 byte block between cache and memory in either direction. Another source of contention in a shared memory multiprocessor is the need to synchronize the processor caches. In the Sequoia system, the processor cache is not write through. Software flushes the cache explicitly. All the dirty blocks are flushed together. To maintain cache consistency, the software uses a large number of hardware-supplied test and set locks. The cache hit ratio seems very high, though we don't yet have a direct measurement of it. Apparently, each processor uses about four percent of one bus, though I cannot claim a very thorough study across a wide range of applications. I do know that in a particular eight processor system which we've been running for a long time, the busses have never been seen as a bottleneck of any kind. The belief is that we probably can go up to 25 or 30 processors before bottlenecks would be an issue. We had hoped that the busses would support 64 processors, and that may be possible if we make some optimizations in the current setup. Hardware is used to detect all faults. Each module containing processor, memory, and I/O can detect its own faults. Watch-dog timers are used to detect processor failures. All memory and bus units detect and correct failures using error detecting and correcting codes. However, much recovery is done in software. The main reason for doing hardware fault-detection is speed. The software mechanism I'm going to describe can recover from a fault, but it takes a few seconds, perhaps ten. We believed, this was fast enough, and that it was not worth the extra cost to have a full hardware recovery scheme, as in the Stratus computer. (In that scheme, a standby processor can cut in in a few clock cycles.) Today, the processor is a ten megahertz 68010. It's got a memory mapping unit and a large cache with access to both busses. The switch between busses at the moment is relatively static, and the cache uses only one bus at a time. The bus is switched after some failure occurs. Q: Could you say a little bit more about the fault detection technique used within the processor? A: There's quite a bit of care at that level of logic to make sure that there's no single point of failure that could cause compensating errors. The people who built this system had built a system for the space program as their last job. Each logical processor is made up of two 68010s on two identical boards with half of the comparator logic on each board. In addition, the memories~ beyond the cache on the processors, are split so that half of each byte is on each board and no two bits of the same nibble are on the same chip. This includes the error-correcting bits. This way, there is no single package failure that could cause self-compensating errors that

219

would not get detected by one of the codes. Perhaps, we were a little too careful in the design. For example, there is a heavy use of error correcting codes for double bit errors, and that may not be a big enough problem to warrant the cost. Error isolation is also done in hardware. That is, the module goes into an error state and immediately disconnects itself from the bus and therefore cannot pollute the rest of the system. Thus~ errors don't propagate. This is a major part of the technique for providing a fail-fast computer model. Q: So, you have a mechanism for disconnecting a failed processor. How do you know that the mechanism works - that is, it is capable of disconnecting a processor if that should become necessary? A: First~ let me note there is a lot of self-checking logic built into the system. The comparators are self-checking. Second, the operating system exercises them at periodic intervals, a few times per day, to make sure that they haven't failed. You need simultaneous compensating failures relative to a fairly well-protected, errorcorrecting code in two comparators for this scheme to malfunction. With respect to the disconnect logic of the processor, I don't know if the operating system exercises it~ but I assume that it does. Cache coherence for shared memory systems like the Sequoia is the main problem, particularly since the operating system can run in parallel on any processor. The Sequoia operating system is a UNIX ~m compatible kernel designed and implemented by Sequoia. When you do a system call on each processor~ you enter into the operating system and do that system call based on the contents of structures in shared memory. The obvious problem is that two processors may go to that shared memory at the same time, unless they are prevented from doing so. For example, if two processors~ A and B, both read a control strueture~ X, at the same time, and then both modify it, the updates of one of t.he processors would be lost~ and this cannot be allowed to happen. The way we solved that problem without compromising fault tolerance is to use test and set locks~ and cache invalidation: before the operating system accesses a shared control structure, it first executes what amounts to a P operation on a semaphore. The semaphores are implemented using shared~ hardware-provided spin locks that are accessible to all the processors. After the test-and-set is executed~ the cache is then invalidated if it is dirty. This is to ensure that any data structures that are referenced are up-to-date. If this were not done~ the operating system might access a stale version of some data in the cache. When the operating system is done with its critical section, which is usually only just a few instructions~ it then flushes the cache. Note that you obviously cannot invalidate dirty cache entries before their values have been written out to shared memory, as you would lose updates. Flushing the cache is all done in hardware. By issuing a single instruction, the hardware scans all of the dirty bits on a cache block. In parallel, it starts pushing out the dirty ones to the memory. It costs 15 micro-seconds to scan those dirty bits and then it costs four micro-seconds times the numbers of blocks you're moving out

220

to memory in order to complete the flush. After the flush is complete, the operating system can then release the spin lock. Basically, this is the conventional, mutual-exclusion, low-level, synchronization mechanism that you find in a lot of operating systems. The locks are not held across context switches and not in user modes. In other words, when you return from a system call, you never own any of these spin locks. We implement higher-level locking primitives like ordinary software locks by protecting the control structures which implement them using the low-level locking scheme. In the configuration we have now, there are probably about fifty spin locks that are used for all the shared control structures in the processor system. Early on, we observed that we were flushing t h e cache too much on exit from these locking blocks. Very often software is nested, and it gets locks as it travels down through some structure. Then, as it pops back, it starts releasing locks and flushing at each level. Of course, the first release got rid of all the dirty data, but the system flushed over and over again. One easy optimization is to distinguish between read locks and write locks. We do not need to bother flushing the cache for read locks. As a second optimization, we only logically release write locks when we release the last lock. At that point, we really release all the locks and flush the cache. This is like dropping all locks at the end of a transaction. The benefit from this is substantial. Q: Do you allow two processes to hold read locks at the same time? A: No. Read locks are really exclusive; we distinguish between them only to determine when cache flushing is needed. The obvious question is, what if these locks become a bottleneck? The answer is that all of the kernel data structures are implemented as hash tables partitioned on a primary key, and they can be dynamically partitioned into as many partitions as is necessary to reduce contention. Take the file descriptors. If it turns out that you are pounding on the lock for a file descriptor table, you can split the file descriptors into two tables and hash on the descriptor to determine which table to use. In that way, you can cut down on contention. We've done a fair bit of analysis of this. Essentially, since each processor is only accessing one such structure at a time, you're never going to need any more partitions than you you have processors in the system. If you think about a 60 processor system, you need only have about 60 partitions. There are about ten or twelve such partitions in a typical operating system. Q: What happens to code in the cache? A: We distinguish between data in the cache that needs to be flushed and data that doesn't. Code and a user's private data never need to be flushed. Obviously, there are no cache consistency problems associated with read-only data. Now, a question that arises is what happens if the system crashes. There are two cases: a crash when there is no cache flush in progress and a crash when there is a cache flush in progress. For the first case, you can just resume a process on another CPU, provided that process flushes all information about I/O's synchronously. That

221

is, you can just pick up the old memory state and execute the process on another CPU. However, if a crash occurs when a cache flush is occurring, only some of the blocks may have been written out, and the state in memory after the crash may be inconsistent. Q: What happens when the cache overflows, and you need to flush a cache entry to free up more cache blocks? A: Indeed~ cache overflow may also cause a cache flush. To maintain process state consistency, you need to flush the whole cache. Anyway, to solve the problem of a crash during a cache flush, we mirror memory. Every writeable physical page of main memory is in two different memory boxes. Thus, we actually flush the cache twice. We flush once to the primary copy of a~ the pages. We wait for that to complete and then we flush again to the secondary copy. Thus, if a processor crashes while it's flushing its cache, we have either a before-flush or after-flush consistent state. There! are a couple of technical difficulties with this. The first is making absolutely sure that you don't repeat or lose I/O's. Essentially, you write something into memory that tells the I/O processor to do an I/O and you must make sure that the request is not honored until the flush is complete. The very last action at the conclusion of the flush flips a bit that says, "O.K. I know that I did an I/O, now that I'm done with the flush." It~ in turn, flips a hardware bit that says, "It's O.K. for the I/O processor to go at the I/O control block." The problem is that the processor might crash right after it finishes the flush and just before it flips the bit. The I/O processor may have initiated the I/O and there would be no record of it. To recover from this possibility, the operating system on each processor running recovery has to ask the I/O processor, "did you see the bit flip for I/O control block numbered such and such because I don~t want to send it to you again if you've already seen it?" It turns out that the hardware interface makes this difficult, and this ended up costing our best operating system developer many months to get the protocol right and the I/O processor really fast. Q: Doesn't the Stratus machine use shared memory? A: The Stratus machine uses six processors attached to a common memory. Each of the processors is duplicated for fault-detection. I don't know how they do their caching, but I believe they use write-through caches to minimize cache coherence problems. Q: Why didn't you make the I/O idempotent and then just redo any in-doubt I/O's after a crash? You could do that by having the I/O processor check the serial number of an I/O before undertaking it. The question is how much computing power you want the I/O to do. In this case, it would have to really read the serial number on what's going out. Unfortunately, reading the sequence number reduces the possible pipelining, and pipelining is the major technique we use to make the I/O go fast. Extra instructions to proce~Is the sequence number in an I/O block as it comes through is a significant

222

issue in terms of throughput. Part of the reason for this in the case of the Sequoia system is that there are severai stages in the I/O subsystem. We know this, because an earlier version of the operating system made I/O idempotent and it contributed to a serious I/O bottleneck, until we corrected it. By the way, the I/O sub-system is a 68000-based fault tolerant computer. It's on-board memory is connected on a Multi-bus ~m TO A BUS adapter, that is in turn connected to the main system bus. The data must come through the controller, into the I/O processor and then be staged into the bus-adapter, where it gets DMA~d. The bus-adapter is just an MSI special purpose unit. With all of those stages, you can end up with queues building up, if you are not careful. I was asked to talk a little bit about some of the practical experiences with the Sequoia machine. While the marketing literature says that we can handle up to 64 main processors and 96 I/O processors~ we haven't built the system bigger than eight. Twelve processors have been run experimentally for testing purposes. The eight processor systems that are being used at the company now have four I/O processors. The memory modules are two megabytes each. The system calls cost about 150 micro-seconds and process switches about one millisecond. The context switches are relatively long because it takes a while for the cache to reach steady state after a process switch. All in all~ these basic operating system functions do not seem to me to be a bottleneck. Q: Turning to more mundane things~ what is the cache hit rate? A: I don't know for sure~ but it's probably over 98%. I don~t have a good answer because we haven't done a careful study. Our only information comes from running an INGRES application, measuring the performance~ and working the numbers backward making certain assumptions about the processor speed. Obviously, this is not a direct measurement of what was going on in cache and the measurement was done only for that application. Q: Do you have any reliability statistics on the system? A: We do observe transient failures. One of the nice things about this system is that ira processor fails~ it's just a transient failure. After a processor failures we go into a five second recovery activity. If the processor checks out OK~ then it immediately cuts back into the system, reintegrates, and if it runs OK without another failure, we just log it, and that's the end of the problem. Though I don't have exact numbers, I would guess that we have observed a transient failure every couple of weeks in one module or another. Some of this may be because Sequoia is not in full-scale production. Some of these boards are early versions and they probably are more error prone than they would be after all of the manufacturing quality assurance associated with volume production. Hence~ I~m not sure that statistics now would be terribly relevant. This overall design of the basic Sequoia machine strikes me as another way of doing transaction processing: The database is a process state, the transaction is the time between cache flushes, and mirroring is the technique used for recovery. When I

223

made this observation, I looked at a lot of other systems and noticed that they were all doing similar things. That is, there are two or three basic techniques for making memory states fault tolerant and everybody uses one of them. You can go very deep into the hardware of fault tolerant computers and basically find these techniques being used at all levels. The people who build fault tolerant computers tell me that most of the work is in error detection and isolation; recovery is assumed to be easy if you do isolation and detection well. The reason for this is that you are always in a consistent state, from which it is obvious what to do after a failure. Nevertheless, to my knowledge, no one has ever gone through the various kinds of fault tolerant systems and expressed their findings in transaction recovery terms. Moreover, those transaction commit points (such as the cache flushes in the Sequoia case) in no way precisely match the transaction or sub-transaction commits in a database systems that may use the underlying fault tolerant systems. We end up having a whole different set of mechanisms, in all systems I know ors to do locking and logging and the other primitives of transaction processing. This feels wrong to me. It seems like there are two completely different mechanisms doing virtually the same thing. For example, all the hardware to keep the system from going downs or from staying down for a long time, doesn't really help in terms of transaction management. Basically~ there is still a buffer pool, and the need to force information to disks at the usual times s and there's nothing particularly beneficial in the computer to help in these functions. In general, I think this notion of unifying the recovery techniques is a research problem that we really need to understand. There may well be a fair bit of economy of mechanism that can be achieved by understanding how these lower level mechanisms work and then trying to reduce the number of layers and thereby improve the overall performance of the system. There"s a lot of benefit to the Sequoia design in terms of system maintenance. Every fault tolerant vendor has talked quite a lot about these benefits. ItSs terrific to have the manufacturing people take out a flaky boards replace it immediately, and then fix the flaky board, without rebooting the operating system. No one need ever notice that it has happened. This is a wonderful benefit of fault tolerant designs.

Faults and Their Manifestation Daniel P. Siewiorek School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3890

1

Introduction

Reliable computing has historically been the domain of military, aerospace and communication applications in which the consequence of computer failure is significant economic impact and/or loss of life. Reliability has grown in importance as our dependence on computing systems has grown. At the same time computers have been installed in harsh environments, often operated by novice users. Multiple computers frequently are coupled together by a network to form a distributed system dedicated to a single service such as making reservations, transferring funds, and retrieving information. Commencing with [19], theorists have studied the problem of reaching agreement among a pool of processors in the presence of failures. A large variety of algorithms have been developed to ensure agreement assuming one of a small set of failure models. These models include:

Failed-notified. The failure causes the processor to stop all execution. The other processors are immediately notified of the failure. Failed-stop. The failure causes the processor to stop all execution. The other processors are not notified of the failure.

Message Omission. A scheduled message is either not sent, not transmitted, or not received.

Timing. Messages are arbitrarily delayed and/or arrive out of order. Byzantine Failure. The faulty processor continues execution and maliciously lies when asked for information. This is a worst-case model, since the faulty processor can generate misleading information causing a maximum of confusion. The question is - "How accurately do these models reflect real failures?" This paper presents failure data from several sources in an attempt to bridge the gap between theoretical failure models and actual failures. Section 2 describes the physical

245

hierarchy and temporal stages of a computer system which serves as a framework for discussing failure data. Section 3 presents the failure data according to the framework in Section 2.

2

Levels & Stages in the Life o f a Digital S y s t e m

2.1 L e v e l in a D i g i t a l S y s t e m 1 Digital computer systems are enormously complex. To make them more comprehensible it is necessary to divide the system into several levels. One can then proceed upward from the most primitive level to the highest conceptual level through a series of abstractions. Each abstraction contains only information important to its level and suppresses unnecessary information about lower ones. Because system designers utilize the hierarchical concept to manage the complexity of a digital system, the levels frequently coincide with the systemh physical boundaries. Table 2-1 describes a typical set of levels for a digital computer.

Circuit Level. The circuit level consists of such components as resistors, capacitors, inductors, and power sources. The metrics of system behavior include voltage, current, flux, and charge. Logic ~vel. The logic level is unique to digital systems. The switching-circuit sublevel is composed of such things as gates and data operators build out of gates. The switching circuit sublevel is further subdivided into combinatorial and sequential logic circuits, the fundamental difference being the absence of memory elements in combinatorial circuits. The ]Register Transfer (RT) sublevel deals with the next higher level of abstraction, namely, registers and functional transfers of information among registers. A register is a digital device that remembers the state of a set of binary digits. The RT suhlevel is frequently further subdivided into a data part and a control part. The data part is composed of registers, operators, and data paths. The control part provides the time-dependent stimuli that cause transfers between registers to take place. In some computers, the control part is implemented as a hard-wired state machine. With the availability of low-cost Read-Only Memories (ROMs), microprogramrnlng is also a popular way to implement the control function.

Program Level. The program level is unique to digital computers. At this level a sequeI/ce of instructions in the device is interpreted and causes action upon a data structure. This is the Instruction Set Processor (ISP) sublevel. The ISP description is used in turn to create software components that are easily manipulated by programmers-the high-level-language sublevel. The result is software, such as operating systems, run-time systems, application programs, and application systems. PMS Level. Finally, the various elements-input/output devices, memories, mass 1Thls discussion is adapted from D. $iewiorek, G. Bell and A. Newe~ "Computer Structures: Principles and Examples," McGraw-Hill, New York, NY 1982.

246

storage, communications, and processors-are interconnected to form a complete system.

2.2 Stages

in the

Life of a Digital

System 2

Not only are system levels important for describing a digital computer; a time dimension is also required. At what point data is collected during the life cycle of a system may be more important than at what physical level. From a user's viewpoint, a digital system can be treated as a "black box" that produces outputs in response to input stimuli. Table 2-2 lists the numerous stages in the life of the box as it progresses from concept to final implementation. These stages include specification of input/output relationships, logic design, prototype debugging, manufacturing, installation, and field operation. Deviations from intended behavior, or errors, can occur at any stage as a result of incomplete specifications, incorrect implementation of a specification in a logic design, and assembly mistakes during prototyping or manufacturing. During the system's operational life, errors can result from change in the physical state or damage to hardware. Physical changes may be iriggered by environmental factors such as fluctuations in temperature or power supply voltage, static discharge, and even alpha particle emissions. Inconsistent states can also be caused by operator errors and by design errors in hardware or software. Design errors, whether in hardware or software, are those caused by improper translation of a concept into an operational realization. Closely tied to the human creative process, design errors are difficult to predict. Gathering statistical information about the phenomenon is also difficult, because each design error occurs only once per system. Any source of error can appear at any stage; however, it is usually assumed that certain sources of error predominate at particular stages.

2This discussion is adapted from D. Siewiorek vatd R. Swarz, "The Theory and Practice of Reliable System Design", Digltat Press, Bedford, MA, 1982.

247

Table 2-1: Levels of abstraction for digital computers Level

Sublevel

PMS

Program

High-Level language ISP

Logic

Register transfer

Switching circuit

Components Processors Memories Switches Controllers Transducers Data operators Links Software

Memory state .Processor state Effective address calculation Instruction decode Instruction execution Data Paths Registers Data Operators Control Hardwired Sequential logic machines Microprogramming Microsequencer Microstore Sequential Flip-flops Latches Delays Combinatorial Gates

Circuit

Resistors

Encoders/Decoders Data operators Capacitors Inductors Power sources Diodes Transistors

248

Table 2-2: Stages in the development of a system. Stage

Error Sources

Specification design

Algorithm design Formal specifications

Prototype

Algorithm design Wiring and assembly Timing Component failure

Manufacturing

Wiring and assembly Component failure

Installation

Assembly Component failure

Operational life

Component failure Operator errors Environmental fluctuations

3 Failure D a t a Through the years~ a variety of meanings have been assigned to the terms failure, fault, and error, as wel~ as their temporal duration. In this paper we will adapt the terminology in [1~ 18]: • Permanent. Describes a £ailure, fault, or error that is continuous and stable. In

hardware, permanent failure reflects an irreversible physical change. The word hard is used interchangeably with permanent. Describes a fault or error that is only occasionally present due to unstable hardware or varying hardware or software states (for example, as a function of load or activity).

• Intermittent.

• Transient. Describes a fault or error resulting from temporary environmental conditions. The word soft is used interchangeably with transient. • Failure. Occurs when the delivered service deviates from the specified service.

Service can be viewed from several levels of abstraction. Service can be delivered by a chip~ or from the system level as viewed by the user. ~Adapted from Chapter 2, "Faults and Their Manifestations," in "The Theory and Practice of Reliable Systems Design", D. Siewiorek and 1t. Swarz, Digital Press, Bedford, MA, 1982.

249

Fault. Erroneous state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design. • Error. Manifestation of a fault within a program or data structure. The error

may occur some distance from the fault site. The distinction between intermittent and transient faults is not always made in the literature [16, 31]. The dividing line is the applicability of repair [3, 17, 20, 27]. Faults resulting from physical conditions of the hardware, incorrect hardware or software design, or unstable but repeated environmental conditions are potentially detectable and repairable by replacement or redesign; faults due to temporary environmental conditions, however, cannot be repaired because the hardware is physically undamaged. It is this attribute of transient faults that magnifies their importance. Even in the absence of all physical defects, including those manifested as intermittent faults, error will still occur.

3.1 Specification/Design/Prototype E r r o r s As indicated above, errors during the early stages of system life are the most difficult to collect data on and to generalize. At the same time, hardware and software errors in these stages have the most in common. Indeed, all software errors occur in these stages since there is no counterpart in software to hardware manufacturing and aging. Table 3-1 lists the sources and distribution of errors during the design and integration phases of a mid-range mainframe computer built by ICL [12]. As expected, logical errors represent the majority of problems prior to hardware integration. After integration of the boards into the prototype, logical errors still remain but occur roughly equally with specification and physical errors (e.g. violation of physical design . rules). In the design of the IBM 3081 it was estimated that one error was encountered for every 4000 circuits in data path (e.g. regular) design and one error for every 1000 circuits on control (e.g. random) design [23].

250

Table 3-1: Distribution of errors in ICL hardware design in the stages of logic design and system integration

Source

Percent Before System Integration

Percent After System Integration

Logic Error Testability/Maintainability Compatibility Technology Rules Violated Clock Fault Specification Error Management Noise Design Automation Performance

52 17 15 6 6 4 -

21 12 4 25 2 19 10 4 2 1

A series of studies to validate the behavior of FTMP, a prototype fault tolerant multi-processor, indicated the following types of errors [6, 7, 8, 9, 10, 11, 13, 15]: • Many of the exception interrupts did not have proper interrupt handlers. These exceptions included: Arithmetic overflow, write protection violation, illegal opcode, stack overflow, privileged instruction violation, and privileged mode call. Generation of these exceptions in user mode caused a halt instruction to be executed. Yurtherm?re, the interrupt vector for the divide exception was not implemented. A division by zero in user mode crashed the machine. To avoid stalling the system due to exceptions, all application software was executed in privileged mode in which interrupts are ignored. • All tasks were specified to execute within a 40 millisecond "frame". The frame boundaries were considered to be hard deadlines and tasks were not to be allowed to execute beyond a frame boundary. The task dispatcher required 15, 66, and 40 milliseconds to schedule three consecutive tasks. This repeatable scheduling pattern was never satisfactorily explained. As a result, application tasks (including the dispatcher time) required 40, 110 and 90 milliseconds to execute. Again the sequence was repetitive. • In order to provide adequate time for a task to complete, a "frame stretching" mechanism was provided whereby a task could ask for more time. Repeated use of the frame stretching mechanism allowed a task to monopolize a processor and to lock out all other tasks. Arranging for the, first task in the dispatcher table to point to itself as the next task also caused an infinite loop in which the single task monopolized the processor.

251

While a majority of these F T M P design errors had manifestations that mapped into the "failed-stop","message omission", or "timing" high-levelfailuremodels, no single failuremodel is adequate to describe all the failuremodes.

3.2 Manufacture/Installation/Operational

Life

Physical defects are the lowest level in the hierarchy of failures. There are numerous ways in which a semiconductor chip can fail. Some failures result from defects in the m~nufacturing process. Others are due to stress during normal operation. The Reliability Analysis Center (RAC) of the Rome Air Development Center (RADC) collects reliabilitydata from government and industry on all phases of component development, assembly, testing and field operation. The data are summarized in publications dealing with digital ICs, hybrid circuits,linear/interfacedevices, memory/LSI, discrete transistors/diodes and nonelectronic parts. Table 3-2 summarizes observed defects for C M O S and T T L technology.

To determine the effect of failureson logic functions, physical data such as those given in Table 3-2 must be used to generate circuit-levelfault classes,which in turn are used to formulate logic-levelfault classes. The abstraction process prevents proliferationof details. The following logic-levelfault models have been used successfully as abstractions of the physical defect mechanisms:

Stuck-at. Logical values in lines,gates, pins and the like are permanently constrained to a value of 1(s-a-l) or 0 (s-a-0). Bridging. T w o or more adjacent signal lines are physically shorted together. In some logic families this introduces an additional "wired-AND" or "wiredO R " function. Sometimes bridging faults turn a combinatorial circuit into a sequential circuit. Short or Open. These correspond to missing (open) or additional (short) connections. Undirectional. Due to the geometric nature of circuits, some single failures can effect multiple signal lines. An open circuit in a memory-select line may cause a word to be incorrectly re~l as all ls. The multiple bits in error are all in the same logical direction (that is, correct 0s have been transformed into incorrect

ls)

252

Table 3-2: Die-related defect summary for SSI, MSI, and LSI standard CMOS and TTL. Defect Classification

Relative Percent CMOS

TTL

Surface Oxide Input/Output Circuits Metalization Bulk Diffusion

38 32 9 8 7 6

16 14 4 51 7 8

Table 3-3: The logical effect of physical defects in NMOS circuits. Defect Effect

Relative Percent

Line stuck at faults Transistor stuck at faults Floating line faults Bridging faults Miscellaneous faults

.28 15 21 30 6

Table 3-3 gives the distribution of logic level faults obtained by simulating manufacturing defects on representative circuits [29]. Now consider the system (PMS) level in the digital design hierarchy. Transient and intermittent faults are already a major source of errors in systems. An early study for the U.S. Air Force [26] showed that 80 percent of the electronic failures in computers are intermittent. Another study by IBM [2] indicated that "intermittents comprised over 90 percent of field failures." Table 3-4 depicts the ratio of measured Mean Time to Failure (MTTF) to Mean Time Between Errors (MTBE) for several systems [30, 24~ 21]. The last row of this table is the estimate of permanent and transient failure rates for a one-megaword, 37-bit memory composed of 4K MOS RAMs [14, 25]. In this case, transient errors are caused by alpha particles emitted by the decay of trace radioactivity in the semiconductor packaging materials. As they pass through the semiconductor material, alpha particles create su~cient hole-electron pairs to add charge to or remove charge from bit cells. By exposing MOS RAMs to artificial alpha particle sources, the operational life error rate can be determined as a function of RAM density, voltage and cycle time [4].

253

Table 3-4: Ratios of transient to permanent errors.

System/Technology

Mechanism

Processor MTTF

CMUA PDP-10, ECL Cm ~' LSI-11, NMOS C.vmp TMR LSI-11 Telettra, TTL 1M x 37 RAM, MOS

Parity Diagnostics Crash Mismatch (Parity)

800-1,600 hrs. 4,200 hrs. 4,900 hrs. 1,300 hrs. 1,450 hrs.

Processor MTBE

MTTF/MTBE

44 hrs. 128 hrs. 97-328 hrs. 80-170 hrs. 106 hrs.

18-36 33 15-51 8-16 14

Transient errors have also been observed in microprocessor chips [4]. Transient errors wilt become even more of a problem in the future with shrinking device dimensions, lower energy levels for indicating logical values, and higher-speed operation. Data in Table 3-4 indicates that transients occur 20 to 50 times more often than hard failures. Table 3-5 gives the distributions of system crashes between hardware and software for a variety of systems. Table 3-6 depicts tile allocation of system down time to various sources in the telephone Electronic Switching Systems. [32]. Two observations can be made from this data. First, downtime is attributed to several sources, and in order to improve downtime one must focus on all the sources, not just one. Second, hardware and software are not the major contributors to downtime. Operator errors and the inability to detect and/or correctly diagnose a failure are major sources of downtime. Since the ESS systems use duplication and matching for error detection, one might think the error detection and diagnosis would be highly effective. A NASA study [22] which fault injected a flight control computer indicated that over 60 percent of the faults were not detected. The low detection rate was due to the fact that the repetitive flight control software did not exercise the entire machine. These undetected, hence latent, faults mean that when a fault is detected the common assumption of the presence of a single failure is violated. Thus, recovery algorithms could be confused into incorrect diagnosis and/or recovery. Latent faults are not represented by any of the high level fault models listed in Section 1.

254

Table 3-5: Distribution of the causes of crashes.

System B5580 Univac 1108 IBM 370/165 IBM 3081 Univac Large Size Systems Univac Medium Size Systems Univac Small Size Systems C.mmp DEC PDP-10

Percent of Crashes Attributed to: Hardware

Software

Unknown

39 45 65 74 51

8 55 32 26 42

53 3 7

57

41

2

88

9

3

55 40

26 60

19 -

Table 3-6: Sources of downtime at AT & T Electronic Switching Systems. Percent Hardware Software Recovery Deficiencies Procedural

20 15 35 30

The manifestations of intermittent and transient faults and of incorrect hardware and software design are much harder to determine than permanent faults. The permanent fault model often can be applied to the intermittent case; however, because the fault is present only temporarily and because most contemporary computer systems do not have substantial on-line error detection, the normal manifestations of intermittent faults are at the system level (such as system crash or !/O channel retry). Transient faults and incorrect designs do not have a well-defined, bounded, basic fault model. Transients are a combination of local phenomena such as ground loops, static electricity discharges, power lines, and thermal distributions) and universal phenomena (such as cosmic rays, alpha particles, power supply characteristics, and mechanical design). Even if models could be developed for transients and incorrect designs~ they would quickly become obsolete because of the rapid changes in technology. Consider now the type of system-level manifestations that might be expected from intermittent faults, transient faults and incorrect design. The experience reported below, derived from an extensive study of system crashes on C.mmp, a multiprocessor

255

in which 16 processors converse with 16 memories through a crosspoint switch, indicates that system-level fault behavior is complex [30]. Table 3-7 depicts the sources of crashes in C.mmp. Memory parity failures were the most common failure mode. Most were transient, but permanent errors occurred with regularity. Often the memory failure rate had largely determined the mean-time-to crash (MTTC).

Table 3-7: Sources of crashes in C.mmp. Error Sources

Percent

Parity System error No response False NXM Illegal instruction Other

32 31 18 15 2 2

Transient errors were an especially large problem on. C.mmp, since there were few error detection mechanisms in the logic. A similar weakness was evident in the software: often information about a failure was lost by the operating system~ making recording of the conditions for transients unreliable. A transient error that eluded solution was the problem of "false NXM~s." The processor reported a nonexistent memory (NXM) exception, but subsequent analysis showed that the memory was responding, and the instruction, registers, and index words were well-formed. No exception should have resulted. Timing problems were suspected, but there was insufficient information available to isolate the failure. Other long-standing transient errors were related to stack operations. These errors usually appeared as an incorrect execution of subroutine call/return instructions or interrupt entry/exit mistakes. The most common form of the error was having one too many (or few) words pushed (or popped) from the stack. This transient error was relatively rare, and no method of recovery from it was every developed. In order to bridge the gap between observed system behavior and logic-level fault models, carefully designed experiments must be conducted. Table 3-8 depicts the results of injecting transient stuck-at-one (zero) faults of various durations into different lines of the processor-memory bus of a Motorola 68000 [28]. The longer duration faults are more likely to be detected, although the fault latency (time from fault injection until detection) was approximately constant. Faults in the instruction fetch cycle were much more likely to be detected than data faults, since the MC68000 detects illegal operation codes. Thus, a high percentage of even simple incorrect one (zero) faults have a long error detection latency, or go completely undetected. Table 3-9 lists the distribution of injected faults by the mechanisms that detected them. A little over 47 percent of the faults were detected by the built-in error detection mechanisms in the MC68000 (e.g. odd address, illegal address, etc.). A further 8 percent of the faults caused a permanent change in the processor state yet did not alter the output of the

256

benchmark program. Next, the same faults were injected into one copy of three identical computations which used software voting to tolerate faults. Approximately 9 percent more of the faults were not only tolerated, but did not produce disagreement among the voted outputs (and hence were not detected). A further 33 percent were detected and corrected by the voter. A little less than 2 percent of the faults were not correctable, even by voting. There is a large gap between logic-level fault models and system-level manifestations. Much work remains to be done before an acceptable system level model can be developed. While the data in Tables 3.6 to 3.9 indicate that over half the manifestations of faults would fit one or more of the high-level fault models described in Section 1, a significant portion of observed failures do not fit these models. Even triplication and voting is insufficient to detect or recover from 100% of the faults.

Table 3-8: Faults injected into the bus of a Motorola 68000 processor. Percent Detected

Average Latency (milliseconds)

41 61 68

13 12 14

69 23

4.1 82

Fault Duration 1 Cycle 2 Cycles 3 Cycles Bus Cycle Type Instruction Data

So far, we have only considered average system behavior. What does the distribution of faults look like? Analysis of hard failures [30] indicates a constant failure rate with time. If A is the constant failure rate, the system reliability as a function of time becomes the familiar exponential: = e

(I)

257

Table 3-9: Percentage of faults detected by various mechanisms Effect of Fault

Percentage

Detected by MC68000 Detected by MC68000 (tolerated with permanent changes) No voted data error No voted data error (tolerated with permanent changes) Voted data error Voted data error (tolerated with permanent changes) Unrecoverable errors (unmaskable errors in data or inability to completely executed benchmarks)

47 8

31 2

A constant failure is easy to model mathematically using an extensive battery of tools such as the Markov model. Transient faults, on the other hand, have a decreasing failure rate with time. Figure 3-1 depicts the probability of system crash on a DEC System 20 at CarnegieMellon University as a function of time since the last system crash. The longer the elapsed time since the last crash the less the probability there will be a crash in the next hour. The probability of system crash over time can be modeled by a Weibull Function: R(t) = e -(~0° (2) where A is the scale parameter and a is the shape parameter. Note that for a = 1, Equation (2) reduces to the exponential given in Equation (1). For comparison, the Weibull function whose a, A parameters best fit the data (i.e., the maximumlikelihood estimates or MLE) is also plotted in Figure 3-1. Table 3-10 indicates that a wide range of transient phenomena - ranging from hard (i.e., parity errors) to complex hardware/software systems (i.e., crashes), from single processors (i.e., DEC System 10's and 20's) to multiprocessors (i.e. Cm*, a 50-processor system) to fault tolerant system (Le. C.vmp, a triplicated and volted system) - follow a Weibull distribution. Thus, modeling systems with transient faults by the exponential function can lead to substantial errors [5].

258

Figure 3-1: Distribution of TOPSC System Reloads

0.14

0.12

0.10

~

O.OB

i

"o

,a

0.06

0.04

0.02

0.00

~n

II n nnll

20

40

6O Time (hn)

80

100

120

Table 3-10: Weibull parameters for a variety of systems Mean

System

PDP-10 PDP-10 DEC System 20 Cm* C.vmp IBM 3081

Error Detection

Parity System Crash System Crash Diagnostics System Crash System Crash

Time To Error (hours) 110 13.4 13.5 40.6 96.5 69.2

a 0.481 0.639 0.286 0.779 0.654 0.647

0.0203 0.106 0.0826 0.0288 0.0146 0.0205

259

4 Conclusion The data in Section 3 indicates that system failures are predominantly transient in nature and follow a decreasing failure rate function (i.e., Weibull) rather than a constant failure rate function (i.e., exponential). System failures have diverse manifestations and diverse causes ranging from errors in design to component aging. A substantial gap remains between actual system failures and these system failure models. Perhaps as little as 50of the observed system faiures fall into one or more of the high-level fault models defined in Section 1. Much work remains to effectively bridge this gap.

References [1] A.

Avizienis "Architecture of Fault-Tolerant Computer systems," 5th International Symposium on Fault-Tolerant Computing, IEEE, Paris, FR, pp 3-16, 1975.

[2]

Ball, M.O. and F. Hardier '~Effects and Detection of Intermittent Failure in Digital Systems," IBM 67-825-2137, 1967.

[3] Breuer, M. A., "Testing for Intermittent Faults in Digital Circuits," IEEE Transactions on Computers, Vol. C-22: pp 241-246, March 1975.

[4]

Brodsky, M, "Hardening RAMs Against Soft Errors," Electronics, Vol 53, April 24, 1980, McGraw-Hill.

[5]

Castillo, X., S. R. McConnel, D. P. Siewiorek, "Derivation and Calibration of a Transient Error Reliability Model," IEEE Transactions on Computers, Vol. C-31(7), pp 658-671, July 1982.

[6]

Clune, Ed, "Analysis of the Fault Free Behavior of the FTMP Multiprocessor System~" Technical Report CMU-CS-84-130~ Carnegie Mellon University, 1984.

[T]

Czeck, Ed., D. P. Siewiorek, Z. Segall, "Fault Free Perfomance Validation of a Fault-Tolerant Multiprocessor: Baseline and Synthetic Workload Measurements, ~' Technical Report CMU-CS-85-177, Carnegie Mellon University, Nov. 1985.

[8]

Czeck, Edward W., Frank E. Feather, Ann Marie Grizzai~, George B. FineUi, Zary Z. Segail, and Daniel P. Siewiorek, "Fault-Free Performance Validation of Avionic Muttiprocessors," 7th Digital Avionic Systems Conference, Dallas, TX, October 1986.

[9] Czeck, Edward W., Frank E. Feather, Ann Marie Grizzai~i, Zary Z. Segall, and Daniel P. Siewiorek "Fault-Free Performance Validation of Fault-Tolerant Multiprocessor," NASA CR-178236, January 1987. [10] Czeck, Edward W., Daniel P. Siewiorek, and Zary Z. Segail, "Software Implemented Fault Insertion: An FTMP Example," NASA CR-17823, October 1987.

260

[11] Czeck, Edward W., Daniel P. Siewiorek, and Zary Z. Segall, "Predeployment Validation of Fault-Tolerant Systems Through Software- Implemented Fault Insertion," NASA CR-4244, July 1989. [12] Faulkner, T. L., C. W. Bartlett, and M. Small, "Hardware Logic Design Faults - a Classification and Some Measurements," 12th Annual International Symposium on Fault-Tolerant Computing, pp 377-380~ Santa Monica, CA, June 1982. [13] Feather, Frank. "Validation of a Fault-Tolerant Multiprocessor: Baseline Experiments and Workload Implementations" Technical Report CMU-CS-85-145, Carnegie Mellon University, July 1985. [14] Geilhufe, M., "Soft Errors in Semiconductor Memories," Digest of Papers, COMPCON Spring 79, IEEE Computer Society, 1979. [15] Grizzaffi, Ann Marie, "Fault Free Performance Validation of Fault-Tolerant Multiprocessors," Technical Report CMU-CS-86-127, Carnegie Mellon University, Nov. 1985. [16] Kamal, S., "An Approach to the Diagnosis of Intermittent Faults," IEEE Transactions on Computers, Vol. C-24, pp 461-467, May 1975. [17] Kamal, S and C. V. Page, "Intermittent Faults: A Model and Detection Procedure," IEEE Trans. Comp, C-23, pp 173-179, July 1974. [18] Laprie, J-C, "Dependable Computing and Fault Tolerance: Concepts and Terminology," IEEE 15th Annual International Symposium on Fault-Tolerant Computing, Ann Arbor, Michigan, pp 2-11, June 1985. [19] Lamport, L., "Proving the Correctness of Multiprocess Programs", IEEE Transactions on Software Engineering, Vol. SE-3, No. 7, pp 125-133, March 1977. [20] J. Losq, "Testing for Intermittent Failures in Combinational Circuits," Third USA-Japan Computer Conf., AFIPS-IPSJ, pp 165-170, 1978. [21] McConnel, S. R., D. P. Siewiorek, and M. M. Tsao, "Transient Error Data Analysis", Technical Report, Carnegie-Mellon University, Department of Computer Science, May 1979. [22] McGough, J. G., F. Swern, and S.J. Bavuso, "New Results in Fault Latency Modeling", Proceedings of the IEEE EASCON Conference, pp 299-306, August 1983. [23] Monachino, M., "Design Verification System for Large-Scale LSI Designs," IBM Journal of Research and Development, Vol. 26, No. 1, pp 78-88, January 1982. [24] M. Morganti, Personal communications to author, 1978. [25] Ohm, V. J., "Reliability Consideration for Semiconductor Memories," In Spring Digest of Papers CompCon, IEEE Computer Society, pp 207-209, 1979.

261

[26] Roth, J. P., W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Phase II of an Architectural Study for a Self-Repairing Computer," SAMSO-TR-67-106, U.S. Air Force Space and Missile Division, E1 Segundo, CA, 1967. ~27] Saris, J., "Testing for Intermittent Failures in Combinational Circuits by Minimizing the Mean Testing Time for a Given Test Quality," Third USA-Japan Computer Conf. AFIPS & IPSJ, pp 155-161, 1978. !28] Schuette, M.A., J. P. Shen. D. P. Siewiorek, and Y. X. Zhu, "Experimental Evaluation of Two Concurrent Error Detection Schemes," IEEE 16th Annual International Symposium on Fault-Tolerant Computing, Vienna, Austria, pp 138-143, July 1986. [29] Shen, J. P., W. Maly and F. Joel Ferguson, "Inductive Fault Analysis of MOS Integrated Circuits," IEEE Design and Test of Computers, December 1985. [30] Siewiorek, D. P., V. Kini, H. Mashburn, S. McConnel, and M. Tsao, "A Case Study of C.mmp, Cm ~', C.vmp: Part I - Experiences with Fault Tolerance in Multipr0cessor Systems," Proceedings of the IEEE, pp 1178-1199, October 1978. [31] Tasar, O. and V. Tasar, "A Study of Intermittent Faults in Digital Computers," AFIPS Conf. Proceedings,Vot. 46, pp 807-811, Montvale, NJ, 1977. [32] Toy, W. N., "Fault-Tolerant Design of Local ESS Processors," Proc. IEEE Vol. 66, No. 10, pp 1126-1145, October 1978.

The "Engineering" of Fault-Tolerant Distributed Computing Systems* Ozalp Babaoglu Department of Computer Science Cornell University Ithaca, New York 14853-7501

ABSTRACT We view the design of fault-tolerant computing systems as an engineering endeavor. As such, this activity requires understanding the theoretical limitations and the scope of the feasible designs. We survey the impact that various environment characteristics and design choices have on the resultant system properties. We propose a single metric--the system reliability--as an appropriate measure for exploring tradeoffs among a potentially-large design space.

1. I n t r o d u c t i o n Continued and correct operation in the presence of failures are required attributes for an increasing number of computing systems [Kim84, Spec84]. Unfortunately, given a finite amount of hardware, it is impossible to construct a computing system that never fails. The best we can hope to achieve are systems that continue correct operation "with high probability." There are two complementary strategies for coping with failures. The first is to construct a computing system from components that are less likely to fail--the fault-avoidance approach. The second is to construct a system that continues to function correctly despite failures--the fault-tolerance approach [RLT78]. The fanlt-avoidance approach by itself is limited only by technological and economic factors, and presents no real conceptual challenges at the system design level. On the other hand, the faulttolerance approach requires techniques for transparently masking failures or restarting the computation from some past state. In this paper, we consider only fault tolerance through replication in computing systems that are viewed at a level where the "components" consist of processors (each with associated memory and input/output devices) and communication hardware. We do not consider failures due to faulty software. Informally, a distributed computing system is a collection of autonomous processors that share no memory, do not have access to a global clock and communicate only by exchanging messages. Recent developments in hardware technology, along with the inherent distributed nature of many applications have made distributed systems a cost effective alternative to centralized computing systems. Distributed algorithms rely on cooperation among processors. The lack of shared memory and random commtmication delays contribute to the difficulty of programming distributed systems. When processors are allowed to fail, this task becomes even more difficult. * Partial supportfor this work was providedby the NationalScienceFoundationunder Grant DCR-86-01864 and AT&T under a FoundationGrant. ~t Author's current address: Departmentof Mathematics,Universityof Bologna,Piazza Porta San Donato,40127 Bologna, Italy.

263

Recently, considerable effort has been devoted to identifying the appropriate primitives and structures towards a "methodology" of fault-tolerant computing [Lamp84, SL85]. We now have the understanding necessary to incorporate fault tolerance into a large class of applications without having to reinvent ,algorithms for each special case. Our goal is to extend this methodology for designing fault-tolerant distributed computing systems so that it resembles the traditional "engineering" endeavor. This entails solving the appropriate sub-problems, understanding the interactions between the proposed solutions and the properties of the resulting system, and developing the measures necessary for exploring as wide a range of designs as possible in searching trade-offs. We make the analogy to engineering an aircraft. This activity is based on theoretical foundations including physics~ aerodynamics, fluid mechanics, etc. It requires an understanding of the interactions between the numerous design parameters (e.g., construction material, wingspan, number of engines, etc.) and the properties of the resultant design (e.g., maximum speed, pay load capacity, maximum `altitude, etc.). Whether the final design resembles a hang glider or a wide-body Airbus is simply due to the tradeoffs that were made among the many possible designs that satisfy the requirements. In engineering fault-tolerant computing systems, we have the necessary theoretical foundations. What we lack is an understanding of the interactions between the design parameters and the system properties. This is essential if we hope to effectively explore the design space in search of tradeoffs. Typically, the properties of a fauh-tolerant system design are given in terms of the maximum number of processors that can fail (resiliency), the total number of messages that are exchanged (communication complexity), the number of rounds of message exchanges (time complexity) and the amount of computation performed during each round (computational complexity). Unfortunately, it is rarely the case that one design dominates another with respect to all four of these measures. In most cases, one cost can be traded for another cost. Consequently, it is quite difficult to say when one design is "better" than another design. Ultimately, the design goal for a fault-tolerant system is to guarantee a lower bound on the probability that it will perform the right action after it has been operating for some length of time. This metric, which we call the system reliability, subsumes all four cost measures defined above such that it is suitable for exploring tradeoffs. Certainly, for many applications, other requirements such as total cost and performance may be equally important. For simplicity, we consider only the system reliability in evaluating alternative solutions. In the next section we outline the replicated system structure that constitutes the core of the faulttolerant computing system design methodology. In comparing alternative solutions for the same problem, we distinguish between effects that are due to the characteristics of the external environment (and, thus, are usuaUy beyond the conffol of the designer) and those that are due to choices made for the design parameters. In Section 3, we survey the impact that the external environment characteristics have on fault-tolerant systems. Then, in Section 4, we examine the alternative solutions that are possible by varying the design parameters. The reliability metric we propose reveals some subtle and counterintuitive interactions between certain system design parameters and the resulting fault tolerance.

2. System Structure Our goal is to reliably execute a single program that performs some arbitrary computation. Without loss of generality, we assume that the application cyclically reads data from an input source, performs a deterministic computation based on this data, and writes outputs. This structure for a computation has been called a state machine [Lamp78, Schn88]. To achieve fault tolerance, the state machine (including the input sources) is replicated and each instance executes on its own processor [Lamp84, SL85, GPD84]. Fault tolerance requirements usually dictate that the replicated system not rely on the correctness of any single component for its correct operation. Furthermore, the independence of processor failures necessitates that the processors be isolated, both physically and electrically. Consequently, when viewed at an appropriate level of abstraction, the replicated system has the same properties as a distributed system---autonomous processors with no shared resources that communicate by message exchange.

264

In the rest of the paper, we refer to a processor executing a state machine instance simply as a processor. The entire ensemble is called the system. A system-dependent threshold ~ dictates the minimum number of correct processors that must exist to maintain system correctness in the presence of failures. Recall that processors compute results based on their input. If the same result is to be generated by all correct processors, they must use the same input value for their computation. This seemingly simple requirement is not always easy to achieve. For example, at any given time, correct analog sensors may differ slightly in their readings. In addition, processors may read their associated sensors at slightly different times. Therefore, since the processors cannot simply compute based on local data, they must execute a protocol to exchange local inputs and to decide on a common value to use as the input for the computation. In the presence of processor failures, a protocol that achieves this goal is called a distributed consensus p r o t o c o l [PSL80, Fisc83, SD83, also many of the articles in this book]. Formally,. in a system with n processors where each processor i has an initial local value vi, a protocol achieves consensus if upon termination each processor computes an n-element vector such that the following two conditions are satisfied: CI: (Agreement) All correct processors compute a common vector W=(w 1,w2, • " • ,wn); C2: (Validity) The jth component of the vector is equal to the initial local value of processor j i f j is correct (i.e., wj=vj for each j corresponding to a correct processor). Typically, the common input value to be used for the computation is obtained by each processor's applying the same deterministic function (such as arithmetic mean or median) to the vector W. This protocol for input dissemination can be easily extended to ensure that all elements of the consensus vector correspond to input values for the same iteration of the state machines [Schn82]. The correctness specification for each state machine replica is given as an input-output relation defined over all possible input vectors W. We assume that each processor is correctly programmed in the sense that it implements this relation in the absence of failures. We say that a s y s t e m is correct if at least processors generate an output for the computation that is consistent with the input-output specification. The reliability of a system at time T is the probability that it is correct at time T given that it was initially correct. Each of the n processors in the system can be in two states: correct and faulty. Each processor starts out in the correct state in which its behavior conforms to the specification encoded as a program. After some time, a processor may fail; thereafter it is called faulty. Note that "correctness" and "faultiness" are only classifications of the internal state of a processor. A f a u l t y processor may deviate from its specification. We classify failures according to the deviations that are permitted. We assume that the times processors spend in the correct state before becoming faulty are independent and identically distributed exponential random variables with rate ~.. In other words, each processor fails independently of all the others at a common constant failure rate. The exponential distribution assumption for failure times has been empirically justified for a large class of components, including electronic hardware, that do not " a g e " [SS82]. The independence of failures is typically achieved through physical and electrical isolation of the processors. Without loss of generality, we assume that L=t in the rest of the paper by scaling all time variables in our discussion by 1/~. Since 1/~, is the expected value of an exponential random variable with rate ~,, the unit of time in our results is the meantime-to-failure interval of a single processor. We do not consider the possibility of repairing faulty processors.

3. E n v i r o n m e n t C h a r a c t e r i s t i c s The global system structure implied by the methodology presented in the previous section is as follows: • •

An application which is not tolerant to processor failures is replicated on each of the n processors. Each processor reads a value from its local input source.

265

•

The ensemble executes a consensus protocol to exchange input values and to agree on the unique value to be used for the computation~

•

Each processor independently computes the output corresponding to the input value.

•

If there are sufficiently many processors generating correct outputs, the system state is correct.

While the computation itself is performed by each processor in isolation, the consensus protocol requires cooperation among the processors and is highly sensitive to the properties of the environment in which it is formulated. We survey these properties below. 3.1. S y n c h r o n y We say that a system is synchronous if the relative processor speeds and the network message delivery delays are bounded and these bounds are known by the processors. A system is called completely asynchronous if neither the relative processor speeds nor the message delivery delays are bounded. A deterministic protocol is one that takes only deterministic steps in its computation. Alternatively, a randomized protocol can take computational steps based on non-deterministic events (such as coin tosses). It is known that the consensus problem has no deterministic solution in completely asynchronous systems in the presence of even a single failure [FLP85] t. Fortunately, most realistic systems do exhibit bounded relative processor speeds and message delivery delays. We consider only synchronous systems in the rest of our discussions.

3.2. Faulty Processor Behavior Before we can proceed with the design of a fault-tolerant system, we must decide on an adequate characterization of the behavior of faulty processors. We describe such behavior as deviations from the protocol that they are supposed to be executing. Based on this classification, we distinguish three failure models:

1.

Crash Failure: A processor stops executing its protocol, never to resume again. Note that, since the only externally-visible behavior of processors in a distributed system is the messages they send, a crash failure is equivalent to a processor stopping to send any more messages. Internally, it could be doing any arbitrary computation (e.g., executing an infinite loop).

2.

Omission Failure: A processor falls to send some of the messages it is supposed to send. The messages it does send are alway~ correct.

3.

Byzantine (Malicious) Failure: A faulty processor exhibits arbitrary behavior. In particular, it may send messages not prescribed by its protocol, fail to send others, collude with other faulty processors to confound the system and behave in any other arbitrary manner. While crash failures are more restrictive than omission failures, the two failure models are equivalent with respect to the consensus problem in the sense that a solution for one can be transformed into a solution for the other [I-Iadz84, NT88]. For both failure models, consensus can be achieved in the presence of any number of faulty processors. The lower and upper bounds for the time complexity of consensus under these fault models coincide at f +1 rounds, where f is the protocol resiliency [Fise83, Hadz84]. Byzantine failures represent a "non-assumption" about the behavior of faulty processors. Achieving consensus in the presence of Byzantine processor failures is considerably more expensive than with the other two failure models. In general, at least 3 f + l processors are required to tolerate up to f faulty ones [SPL80]. Any protocol must execute for at least f + l rounds to reach consensus [FL82]. While f +1 rounds is a lower bound, currently no consensus protocol for Byzantine failures achieves this time complexity while exchanging only a polynomial number of bits. ? Recently,Dolev et al. have given a much finercharacterizationof the minimalsynchronynecessaryfor the solvability of the consensusproblem [DDS87].

266

The reliability of the system when subjected to crash or omission failures is simply the probability that at least one out of the n processors generates an output at the end of the computation. In other words, we have the correctness threshold tF=l. Therefore, the reliability of a replicated system for these failure models will typically be greater than that of a single processor, as tong as the distributed consensus phase is negligible in length with respect to the rest of the computation time. On the other hand, the system reliability in the presence of Byzantine processor failures is given as the probability that the consensus protocol succeeds and that at least W of the processors remain correct throughout the computation phase. We wilt evaluate these expressions in Section 4. Note that if the desired response from the system is a single output to the external world, despite up to fByzantine failures, the replicated system cannot have a correctness threshold less than majority (i.e., W>f) and it must rely on a single voter to implement it [GPD84].

3.3. Authentication The results we presented above are relevant for systems in which a faulty processor can both undetectably forge messages on behalf of other processors and also tamper with the contents of messages it is forwarding. If the system supports an authentication mechanism (such as digital signatures [DH76]), then the undetectable behavior of even Byzantine processors is severly restricted. In fact, in a system with authentication, a consensus protocol can tolerate any number of Byzantine processor failures [LSP82]. Furthermore, consensus can be reached in f +1 rounds, thus matching the lower bound for time complexity [DS83]. Let us examine the reliability of the state machine ensemble with authentication and Byzantine processor failures. While the consensus protocol for input dissemination can tolerate any number of failures, the correctness of the system can be guaranteed only if the number of faulty processors by the end of the computation phase represents a minority of the total number of processors. Unlike crash or omission failures, a processor exhibiting Byzantine failure can still lie about its own state even with authentication.

3.4. Communication Model Up to this point, we have assumed that the network used for communication among processors is a perfectly-reliable, fully-connected point-to-point graph. Even if the network remains perfectly reliable, we cannot achieve consensus in systems that communicate through sparsely-connected networks. It is known that the connectivity of the network must be at least 2 f + l without authentication and f + l with authentication [Dole82]. One simple technique to deal with unreliable network components is to model the failure of a link between two processors as the failure of one of the processors. While not requiring any new mechanisms, this technique results in overly-pessimistic designs, since f i s now the sum of the number of processor and link failures. More realistic designs can be obtained by introducing new fa~ure models that explicitly deal with communication failures or account for them more accurately [PT86, Hadz88]. Recently, we have studied the time complexity of consensus protocols in systems that communicate through networks other than point-to-point graphs [BSD88]. A typical architecture for distributed systems consists of several processor clusters on a common network in which the intra-cluster communication takes place over a shared, multiple-access media that support broadcasts [Stal84]. This broadcast network-based architecture encompasses a wide range of designs. For example, the clusters could represent geographically distant local area networks connected through gateways (such as the Xerox Intemet comprising a large number of Ethemets [MB76]). Alternatively, each cluster could be a single, tightly-coupled multiprocessor with an internal bus interconnect and inter-cluster links implemented as bus adaptors. Regardless of the physical realization of the architecture, we can abstract the behavior of communication in such systems as follows:

267

BNP:

(Broadcast Network Property) In response to a broadcast, all processors that receive a message receive the same message.

This property ensures that for all possible failures, a processor cannot send conflicting messages to other processors in a single broadcast. For a given broadcast network, the set of processors that receive the (same) message in response to a broadcast is called the receiving set. The broadcast degree of a network is defined to be a lower bound on the size of the receiving sets for all broadcasts in the network. The receiving sets may vary from one broadcast message to another as well as from one sender to another. For a network to have broadcast degree b, all we require is that each of these sets contain at least b processors. We assume that every processor receives its own broadcasts regardless of failures. Note that communication failures manifest themselves in defining a particular broadcast degree for the network. Given this framework, we have shown that in a system with n processors where the broadcast degree is b, consensus can be achieved in 2 rounds in the presence of Byzantine failures and no authentication as long as b > f +n/2 [BD85]. For broadcast degrees smaller than f, achieving consensus in the presence of Byzantine failures requires f +1 rounds, just as it does in point-to-point networks [BSD88]. For omission failures, the condition b>_fsuffices to achieve a 2-round solution. Furthermore, in the case of omission failures, we were able to obtain a parametrized solution that achieves consensus in f - b +3 rounds for any broadcast degree 2<_b< f [BSD88]. The networks that are characterized by this range of broadcast degrees span the entire spectrum from point-to-point graphs to full broadcast networks. These results reveal a new dimension in the design space of fault-tolerant systems--performance and resiliency can be traded for network cost and complexity.

4. Design Parameters Having defined the context in which a fault-tolerant system is to be designed, we next examine the properties of the resultant system as a function of the choices that are made for the various design parameters. 4.1. R e p l i c a t i o n

Level

Perhaps the first question that the designer of a fault-tolerant computing system must answer is: How many times should the state machine be replicated? Although this question has economic as well as reliability implications, we will address only the reliability issues. Recall that if the design specification of the system is given in terms of the resiliency (f), then the characteristics of the external environment dictate the minimum level of replication. For example, with Byzantine failures and no authentication, n >3f. If the system supports authentication, then n >2f. If the failures are restricted to crash or omission, then the replication level need only be one greater than the resiliency (n >JO. If the design specification of the system is given in terms of reliability, the answer to the replication level question is not so clear. It is well known that replicated systems can have worse reliability than their non-replicated counterparts for certain system parameters [SS82]. Intuitively, as there are more processors in the system due to replication, there are more sources of failures. Consequently, the probability that enough processors will be correct arbitrarily late in the computation phase will be smaller than the probability of a single processor's being correct. The reliability of state machine ensembles indeed exhibit this behavior as a function of the replication level. Let c~ denote the time required for the system to implement a round. In Figure 1, we plot the reliability of systems subject to Byzantine failures with authentication and with various levels of replication and round lengths where the correctness threshold, ~ , is majority. In the figures, m denotes the number of rounds that the consensus protocol is executed to disseminate the input. For the derivations of the expressions used to generate the figures in this paper, we refer the reader to [Baba87b]. Recall that the unit of time in our results is the mean-time-to-failure of a single processor. With today's technology, it is trivial to achieve mean-time-to-failures for processors in excess of 100 hours. Typically, processor and communication network speeds allow rounds to be realized in at most several seconds.

268

n=3,m=2

0.8 Reliability 0.6 0.4

_•••00001 --

~o~le

0.2

~=0.03 I 0.2

,t 0.8

~

I I l 0.4 0.6 0.8 ComputationTime (a)

I 1

n=13,m=5

2

0000l

Reliability 0.6 0.4 0.2 -

I ~---0.1 I ~

0.2

I ~

' ' ~

0.4 0.6 0.8 ComputationTime (b)

1

Fig. 1. Reliabilityof statemachineensembles

269

Consequently, realistic normalized (i.e., after scaling by the mean-time-to-failure) values for 0t are of the order 10-5. The figures include larger values for the round length simply to explore as large a design space as possible. The reliability of the non-replicated design alternative is labeled "single" in the figures. We note that for different parameter values, there are distinct intervals for the computation time in which the nonreplicated single processor has higher reliability than the replicated system. In fact, for sufficiently long computation times (perhaps unrealistically long), the single processor has uniformly higher reliability than the replicated system. As the replication level is increased, the computation time that delineates highly reliable systems from highly unreliable systems becomes more sharply defined. We can formally show that as n --->,~ and ~ ~ 0, the system remains perfectly reliable during computation times up to ln(n/(W-t)) and is incorrect with certainty for computations that last longer than this time [Baba87b]. In case of majority threshold, this transition time has the limit ln(2)=0.693.

4.2. Consensus Protocol Running Time In a system with n processors and Byzantine failures, an m-round consensus protocol execution is at least m - 1 resilient in the sense that consensus is guaranteed provided there are no more than m - t faulty processors (i.e., f<m). Given our model where processors fail independently according to an exponential distribution, there could be m-round executions where consensus is achieved despite the fact that protocol resiliency is violated (i.e., ~.rn). In [Baba87a] we characterize such executions and derive expressions for the probability of their occurrence. Using these results, in this section we will explore the tradeoffs that are involved in selecting the running time of the input consensus protocol. All other factors being equal, an (m +l)-round execution has a higher probability of achieving consensus than an m-round execution. Figure 2a displays the probability with which an m-round execution achieves consensus in a system with t3 processors and various round lengths. Note that for small (realistic) values of ~, a 2 to 3-round execution defines the point diminishing returns--the system achieves consensus with high probability for this execution length and any additional rounds of execution result in negligible gain. However, for large values of a, each additional round of execution increases the probability of consensus by a significant amount. Next we investigate the effect of input consensus protocol execution time on the reliability of a state machine ensemble. Although the consensus protocol itself always benefits from additional rounds of execution, this benefit is gained at the expense of having fewer correct processors available for the application computation. Thus, the increase in the probability of the consensus phase succeeding may be offset by the decrease in the probability of the system achieving the correctness threshold. Figure 2b is the reliability of a system with 13 processors as a function of the input consensus protocol running time and various computation times. When the computation time is short with respect to the consensus phase, the system reliability increases for up to about 5 rounds of consensus execution and then starts to decrease. For computation times that are longer, the consensus protocol execution length that maximizes system reliability becomes shorter. In fact, for sufficiently long computations, the reliability is a monotone decreasing function of m, suggesting that the system is better off not wasting its time with consensus and commencing the computation immediately after a single round of input data exchange.

4.3. Fault Detection In Section 4.2 we saw that in a system with Byzantine failures, a state machine ensemble became arbitrarily unreliable after sufficiently long computation times. This is due to the accumulation of faulty processors in the system such that there comes a time when the failure of a single additional processor tips the scale away from the correctness threshold. Intuitively, we need to he able to periodically identify the processors that have failed and remove them from the system. In this section, we study system reliability in the presence of a protocol that can detect all processors that have failed by 0 time units into the application computation and remove them from the system. In

270

n=13 i

'

--

0.8 Probability 0.6 of Consensus

~

w--0.09

0.4 0.2I

I

I

I

I

I

!

1

I

I

I

1

2

3

4

5

6

7

8

9

10

11

I

I

9

10

11

Number of Rounds (a)

n=13, ~----0.03 1

0.8 Reliability 0.6 0.4

r--o.3

~

T=0.6

0.2 T=0.9 I

t

I

I

I

t

I

I

1

2

3

4

5

6

7

8

I

Number of Rounds

(b)

Fig. 2. Effect of consensus protocol running time

271

n=13, or=0.00001, m 1=3 1

~.___.__....~=~.2,m

2=6

.3,rn 2=7

0.8

\ Reliability

0

0.6

0.4

_-

0=0.8,m 2 = f _ l ~ ' . . .

0.2

single 0=0.1,m 2=5

I

I

I

I

I

0.2

0.4

0.6

0.8

1

Computation Time Fig. 3. The effect of fault detection [Baba87b] we discuss methods that are suitable for implementing such a fault detector based on consensus protocols. In Figure 3 we illustrate the effect of executing a fault detection protocol 0 units into the application computation for a system with 13 processors. The consensus protocol used for input dissemination is always executed for m 1 rounds (m 1=3 in the example), while that used for fault detection is executed a variable number of rounds (denoted m 2) depending on the value of 0. Note that the system in Figure 3 has reliabilities similar to those depicted in Figure lb until time T=0. At this point, the system maintains its current reliability level for some time and then resumes deterioration. If fault detection is performed early enough in the. computation, high system reliability can be maintain much longer than is possible without fault detection. Obviously, the technique could be extended to perform fault detection every 0 units of the computation rather than only once. 5. Conclusions

We have presented the reliability of a fault-tolerant computing system as a metric that is more suitable than resiliency in evaluating different designs. Not only is the reliability metric more in tune with the design specifications for most applications, it also captures most of the subtle and counterintuitive interactions between design parameters and system properties. We surveyed the implications of external environment characteristics such as failure models, synchrony, authentication and communication models on fault-tolerant system properties. We then examined the properties of alternative solutions due to design decisions. We showed that increases in the replication level serve to more sharply delineate computation times that separate highly reliable systems from highly unreliable ones. One of our counterintuitive results involves selecting the running time of the consensus protocol that is used to disseminate the inputs to the state machine replicas. For certain system parameters, we have seen that increasing the protocol running time actually decreases the overall system reliability. Finally, we have shown that the ability to detect faulty processors and remove them from the system is extremely effective with respect to reliability.

272

Our results are a first step towards completing the design methodology for building fault-tolerant computing systems. Although we were able to evaluate several design choices individually using the system reliability as the criterion, a realistic design endeavor has to cope with alternatives resulting from varying several parameters simultaneously. Furthermore, the evaluation criterion has to include measures other than reliability such as performance. We feel that our approach is suitable for extension in any one of these directions.

Acknowledgments I am grateful to Fred Schneider, RogErio Drummond and Pat Stephenson for discussions that helped me formulate many of the ideas surveyed in this paper. Comments from Barbara Simons helped improve the presentation.

References Baba87a Baba87b BD85 BSD88

DH76 Dole82 DDS87 DS83 Fisc83

FL82 FLP85 GPD84 Hadz84

Hadz88

O. Babaog-lu, Stopping times of distributed consensus protocols: a probabilistic analysis. Information Processing Letters, vol. 25, no. 3, pp. 163-169, (May 1987). O. Bahai-lu, On the reliability of consensus-based fault-tolerant distributed computing systems. ACM Trans. on Computer Systems, vol. 5, no. 3, pp. 394-416. O. Babaog-lu and R. Drummond, Streets of Byzantium: N~twork architectures for fast reliable broadcasts. IEEE Trans. Software Eng., vol. SE-11, no. 6, pp. 546-554, June 1985. O. Babaog-lu, P. Stephenson and R. Drummond, Reliable broadcast protocols and communication models: Tradeoffs and lower bounds. Springer-Vertag Distributed Computing, (to appear). W. Diffie and M. Hellman, New directions in cryptography. IEEE Trans. on Inf. Theory, vol. IT-22, pp. 644-654, 1976. D. Dolev, The Byzantine Generals strike again. Journal of Algorithms, vol. 3, no. 1, pp. 1430, 1982. D. Dolev, C. Dwork and L. Stockmeyer, On the minimal synchronism needed for distributed consensus. Journal oftheACM, vol. 34, no. 1, pp. 77-97, January 1987. D. Dolev and H. R. Strong, Authenticated algorithms for Byzantine Agreement. SIAM J. Comput., vol. 12, no. 4, pp. 656-666, November 1983. M. J. Fischer, The consensus problem in unreliable distributed systems (A Brief Survey). Tech. Rep. YAI.EU-DCS-RR-273, Dept. of Computer Science, Yale University, New Haven, Connecticut, June 1983. Fischer, M. and Lynch, N. A lower bound for the time to assure interactive consistency. Inform. Proc. Letters 14, no. 4, pp. 183-186, April 1982. M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process. JournaloftheACM, vol. 32, no. 2, pp. 374-382, April 1985. H. Garcia-Molina, F. Pittelli and S. Davidson, Applications of Byzantine Agreement in database systems. Tech. Rep. TR 316, Princeton University, Princeton, New Jersey, June 1984. V. Hadzilacos, Issues of fault tolerance in concurrent computations. Ph.D Thesis, Tech. Rep. TR-11-84, Aiken Computation Laboratory, Harvard University, Cambridge, Mass., June, 1984. V. Hadzilacos, Connectivity requirements for Byzantine Agreement under restricted types of failures. Springer-Verlag Distributed Computing, (to appear).

273

Kim84 Lamp84 MB76 NT88

PSL80 PT86

RLT78 Schn82 Schn88 SL85

SS82 Spec84 Stal84 SD83

W. Kim, Highly available systems for database applications. ACM Computing Surveys, vol. 16, no. 1, pp. 71-98, March 1984. L. Lamport, Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. on Programming Languages and Systems, voi. 6, no. 2, pp. 254-280, April 1984. R. Metcalfe and D.R. Boggs, Ethemet: Distributed packet switching for local computer networks. Commun.ACM, vol. 19, no. 7, pp. 396-403, July 1976. G. Neiger and S. Toueg, Automatically increasing the fault-tolerance of distributed systems. Proc. of the 7th ACM Symposium on Principles of Distributed Computing, Toronto, Canada, August 1988. (to appear) M. Pease, R. Shostak and L. Lamport, Reaching agreement in the presence of faults. Journal of the ACM, vol. 27, no. 2, pp. 228-234, April 1980. K.J. Perry and S. Toueg, Distributed agreement in the presence of processor and communication faults. IEEE Trans. on Software Engineering, vol. SE-12, no. 3, pp. 477-482, March 1986. B. Randell, P.A. Lee, and P.C. Treleaven, Reliability issues in computing system design. ACM Computing Surveys, vol. 10, no. 2, pp. 123-166, June 1978. F. B. Schneider, Synchronization in distributed programs. ACM Trans. Programming Languages and Systems, vol. 4, pp. 125-148, April 1982. F. B. Schneider, The state machine approach: A tutorial. This volume. F. B. Schneider and L. Lamport, Paradigms for distributed programs. In Distributed Systems: Methods and Toolsfor Specification, Paul, M. and Siegert H.J. (Eds.), Springer-Verlag Lecture Notes in Computer Science Vol. 190. D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design. Digital Press, Belford, Mass. (1982). A. Z. Spector, Computer software for process control. ScientificAmerican, vol. 251, no. 3, pp. 174-187, September 1984. Stallings, W. Local networks. ACM Computing Surveys, vol. 16, no. 1, pp. 3-41, March 1984. H.R. Strong and D. Dolev, Byzantine agreement. In Digest of Papers, Spring Compcon 83, San Francisco, Calif6rnia, pp. 77-81, March 1983.

Bibliography for Fault-Tolerant Distributed Computing Compiled by Brian A. Coan Bell Communications Research 435 South St. Morristown, N.J. 07960

1

Analysis & Applications of Consensus Protocols O. Babaoglu, "Stopping Time of Distributed Consensus Protocols: A Probabilistic Analysis," Information Processing Letters, vol. 25, no. 3, pp. 163-169, May 1987. O. Babaoglu, "On the Reliability of Consensus-Based Fault-Tolerant Distributed Computing Systems," ACM Transaction8 on Computer Systems, vol. 5, no. 4, pp. 394-416, Nov. 1987. M. J. Fischer, "The Consensus Problem in Unreliable Distributed Systems (A Brief Survey)," Technical Report YALEU/DCS-RR-273, Yale University, June 1983. H. Garcia-Molina, F. Pittelli, and S. B. Davidson, "Is Byzantine Agreement Useful in a Distributed Database?" Proceedings of the 3"J Annual A CM Symposium on Principle$ o-f Database Systems, pp. 61-69, Mar. 1984. J. N. Gray, "The Cost of Messages," Proceedings of the 7t~ Annual A CM Symposium on Principles of Distributed Computing, pp. 1-7, Aug. 1988. Y. C. Tay, "The Reliability of (k, n)-Resilient Distributed Systems," Proceedings o] the 4th IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 119-122, Oct. 1984. J. H. Wensley, L. Lamport, J. Goldberg, M. W. Green, K. N. Levitt, P. M. Melliar-Smith, R. E. Shostak, and C. B. Weinstock, "SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control," Proceedings o-f the IEEE, vol. 66, no. 10, pp. 1240-1255, Oct. 1978.

275

2

Atomic Registers & Wait-Free Synchronization K. Abrahamson, "On Achieving Consensus Using a Shared Memory," Proceed. ings o/the 7 th Annual A CM Symposium on Principles o/Distributed Computing, pp. 291-302, Aug. 1988. B. Bloom, "Constructing Two-Writer Atomic Registers," Proceedings o] the 6th Annual A CM Symposium on Principles of Distributed Computing, pp. 249-259, Aug. 1987. 3. E. Burns and G. L. Peterson, "Constructing Multi-Reader Atomic Values from Non-Atomic Values," Proceedings o] the 6th Annual A CM Symposium on Principles o[ Distributed Computing, pp. 222-231, Aug. 1987. B. Chor, A. Israeli, and M. Li, "On Processor Coordination Using Asynchronous Hardwares" Proceedings of the 6th Annual ACM Symposium on Principles o] Distributed Computing, pp. 86-97, Aug. 1987. A. Gotttieb, B. D. Lub~hevsky, and L. Rudolph, "Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors," A CM Transactions on Programming Languages and Systems, vol. 5, no. 2, pp. 164-189, Apr. 1983. M. P. Herlihy, "Impossibility and Universality Results for Wait-Free Synchronization," Proceedings o] the 7~ Annual A CM Symposium on Principles o] Distributed Computing, pp. 276-290, Aug. 1988. C. P. Kruskal, L. Rudolph, and M. Snir, "Efficient Synchronization of Multiprocessors with Shared Memory," Proceedings of the 5t~ Annual A CM Symposium on Principles o] Distributed Computing, pp. 218-228~ Aug. 1986. • L. Lamport, "Concurrent Reading and Writing," Communications o] the ACM, vol. 20, no. 11, pp. 806-811, Nov. 1977. • L. Lamport, "On Interprocess Communication," parts I and II, Distributed Computing, vol. 1, no. 2, pp. 77-101, 1986. M. C. Loui and H. H. Abu-Amara, "Memory Requirements for Agreement Among Unreliable Asynchronous Processes," Advances in Computing Research: Parallel and Distributed Computing, vol. 4, JAI Press, pp. 163-183, 1987. R. Newman-Wolfe, "A Protocol for Wait-Free, Atomic, Multi-Reader Shared Variables," Proceedings of the 6t~ Annual ACM Symposium on Principles of Distributed Computing, pp. 232-249, Aug. 1987. G. L. Peterson, "Concurrent Reading while Writing," A CM ]Vansactions on Programming Languages and Systems, vol. 5, no. 1, pp. 46-55, Jan. 1983.

276

G. L. Peterson and J. E. Burns, "Concurrent Reading while Writing II: the Multi-Writer Case," Proceedings o] the 28th Annual IEEE Symposium on Foundations of Computer Science, pp. 383-392, Oct. 1987. S. A. Plotkin, "Sticky Bits and Universality of Consensus," manuscript, 1988. R. Schafer, "On the Correctness of Atomic Multi-Writer Registers," B. S. Thesis, Massachusetts Institute of Technology, June 1988. (Available as Technical Report MIT/LCS/TM-364.) A. K. Singh, J. H. Anderson, and M. G. Gouda, "The Elusive Atomic Register Revisited," Proceedings o] the 6~h Annual ACM Symposium on Principles of Distributed Computing, pp. 206-221, Aug: 1987. P. Vitanyi and B. Awerbuch, "Atomic Shared Register Access by Asynchronous Hardware," Proceedings o] the 2T h Annual IEEE Symposium on Foundations o] Computer Science, pp. 223-243, Oct. 1986. (See also.errata in SIGACT News, vol. 18, no. 4, Summer 1987.)

3

Asynchronous Consensus Protocols C. Attiya, A. Bar-Noy, D. Dolev, D. Kolter, D. Peleg, and R. Reischuk, "Achievable Cases in an Asynchronous Environment," Proceedings of the 28f~ Annual IEEE Symposium on Foundations o] Computer Science, pp. 337-346, Oct. 1987. C. Attiya, D. Dolev, and J. Gil, "Asynchronous Byzantine Consensus," Proceedings of the 3"d Annual A CM Symposium on Principles of Distributed Computing, pp. 119-133, Aug. 1984. M. Ben-Or, "Fast Asynchronous Byzantine Agreement," Proceedings of the 4~ Annual A CM Symposium on Principles of Distributed Computing, pp. 149-151, Aug. 1985. M. Ben-Or, "Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols," Proceedings of the 2nd Annual A CM Symposium on Principles o] Distributed Computing, pp. 27-30, Aug. 1983. G. BracIaa, "Asynchronous Byzantine Agreement Protocols," ln]ormation and Computation, vol. 75, no. 2, pp. 130-143, Nov. 1987. G. Bracha and S. Toueg, "Resilient Consensus Protocols," Proceedings o] the 2"d Annual A CM Symposium on Principles o/Distributed Computing, pp. 12-26, Aug. 1983. G. Br~ha and S. Toueg, "Asynchronous Consensus and Broadcast Protocols," Journal o] the ACM, vol. 32, no. 4, pp. 824-840, Oct. 1985.

277

s C. Dwork, N. A. Lynch, and L. Stockmeyer, "Consensus in the Presence of Partial Synchrony," Journal o] the ACM, vol. 35, no. 2, pp. 288-323, Apr. 1988. s A. D. Fekete, "Asynchronous Approximate Agreement," Proceedings of the 6th Annual A CM Symposium on Principles of Distributed Computing, pp. 64-76, Aug. 1987.

4

Broadcast Protocols O. Babaoglu and R. Drummond, "Time-Communication Tradeoffs for Reliable Broadcast Protocols," Technical Report TR-85-687, Cornell University, Dec. 1985. O. Babaoglu and R. Drummond, "Streets of Byzantium: Network Architectures for Fast Reliable Broadcasts," IEEE Transactions on Software Engineering, vol. SE-11, no. 6, pp. 546-554, June 1985. O. Babaoglu and P. Stephenson, "Reliable Broadcasts through Partial Broadcasts," Technical Report TR-85-720, CorneU University, Dec. 1985. R. Bar-Yehuda, O. Goldreich, and A. Itai, "On the Time Complexity of Broadcast in Radio Networks: An Exponential Gap Between Determinism and Randomization," Proceedings of the 6~ Annual A CM Symposium on Principles o] Distributed Computing, pp. 98-108, Aug. 1987. K. P. Birman and T. A. Joseph, "Reliable Communication in the Presence of Failures," ACM Transactions on Computer Systems, vol. 5, no. 1, pp. 47-76, Feb. 1987. J. M. Chang and N. F. Maxemchuck, "Reliable Broadcast Protocols," A CM Transactions on Computer Systems, vol. 2, no. 3, pp. 251-273, Aug. 1984. B. Chor and M. O. Rabin, "Achieving Independence in Logarithmic Number of Rounds," Proceedings of the 6th Annual ACId Symposium on Principles oJ Distributed Computing, pp. 260-268, Aug. 1987. O

F. Cristian, H. Aghili, H. R. Strong, and D. Dolev, "Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement," 15t~ International Con]erence on Fault-Tolerant Computing, June 1985. F. B. Schneider, D. Gries, and R. D. Schlichting, "Fault-Tolerant Broadcasts," Science of Computer Programming, vol. 4, no. 1, pp. 1-15, Jan. 1984.

278

5

Clocks and Clock Synchronization • F. Cristian, H. Aghili, and H. Ft. Strong, "Clock Synchronization in the Presence of Omission and Performance Faults and Processor Joins," 16~h International Conference on Fault-Tolerant Computing, July 1986. • D. Dolev, J. Y. Halpern, and H. R. Strong, "On the Possibility and Impossibility of Achieving Clock Synchronization," Journal of Computer and System Sciences, vol. 32, no. 2, pp. 230-250, Apr. 1986. • J. Y. Halpern, B. Simons, H. R. Strong, and D. Dolev, "Fault-Tolerant Clock Synchronization," Proceedings of the 3"d Annual A CM Symposium on Principles of Distributed Computing, pp. 89-102, Aug. 1984. • L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the ACM, vol. 27, no. 7, pp. 558-565, July 1978. • L. Lamport and P. M. Meltiar-Smith, "Byzantine Clock Synchronization," Proceedings of the 3"d Annual A CM Symposium on Principles of Distributed Computing, pp. 68-74, Aug. 1984. • J. Lundelius and N. A. Lynch, "An Upper and Lower Bound for Clock Synchronization," Information and Control, vol. 62, no. 2, pp. 190-204, Aug. 1984. • K. Marzullo, Loosely-Coupled Distributed Services: A Distributed Time Service, Ph.D. Thesis, Stanford University, 1983. • K. Marzullo and S. Owicki, "Maintaining the Time in a Distributed System," Proceedings of the 2"d Annual A CM Symposium on Principles of Distributed Computing, pp. 295-305, Aug. 1983. • F. B. Schneider, "A Paradigm for Reliable Clock Synchronization," Proceedings of the Advanced Seminar on Real-Time Local Area Networks, Bandol, France, Apr. 1986. • T. K. Srikanth and S. Toueg, "Optimal Clock Synchronization," Journal of the ACM, vol. 34, no. 3, pp. 626-645, July 1987. • 3. L. Welch and N. A. Lynch, "A New Fault-Tolerant Algorithm for Clock Synchronization," Information and Computation, vol. 77, no. 1, pp. 1-36, Apr. 1988.

6

Concurrency Control, Transaction Commitment, and Recovery • P. A. Bernstein and N. Goodman, "Concurrency Control in Distributed Database Systems," ACM Computing Surveys, vol. 13, no. 2, pp. 185-222, June 1981.

279

• P° A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, Reading, Massachusetts, 1987. • H. Breitwieser and M. Leszak, "A Distributed T r a n s i t i o n Processing Protocol Based on Majority Consensus," Proceedings of the 1a Annual A CM Symposium on Principles of Distributed Computing, pp. 224-237, Aug. 1982. • D. Cheung and T. Kameda, "Site-Optimal Termination Protocols for a Distributed Database Under Networking Partitioning," Proceedings o] the 4th Annual ACM Symposium on Principles of Distributed Computing, pp..111-121, Aug. 1985. • B. A. Coan and J. Lundelius, "Transaction Commit in a Realistic Fault Modal," Proceedings of the 5~h Annual ACM Symposium on Principles of Distributed Computing, pp. 40-51, Aug. 1986. • S. B. Davidson, "Optimism and Consistency in Partitioned Distributed Database Systems," ACM Transactions on Database Systems, vol. 9, no. 3, pp. 456481, Sept. 1984. • D. Dolev and H. R. Strong, "Distributed Commit' with Bounded Waiting," Proceedings o] the 2"a IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 53-60, July 1982. • C. Dwork and D. Skeen, "The Inherent Cost of Nonblocking Commitment," Proceedings of the 2"a Annual ACM Symposium on Principles of Distributed Computing, pp. 1-11, Aug. 1983. • D. Eager and K. Sevcik, "Achieving Robustness in Distributed Database Systerns," ACM Transactions on Database Systems, vot. 8, no. 3, pp. 354-381, Sept. 1983. • K~ P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger, "The Notion of Consistency and Predicate Locks in a Database System," Communications of the ACM, vol. 19, no. 11, pp. 624-633, Nov. 1976. * J.N. Gray, "Notes on Database Operating Systems," Lecture Notes in Computer Science, vol. 60, Good and Hartmanis, eds., Springer-Verlag, Berlin, 1978. • J. N. Gray, "A Transaction Model," Technical Report RJ2895, IBM Corp., Aug. 1980. • J. N. Gray, P. Mw Jones, M. Blasgen, B. G. Lindsay, R. A. Lorie, T. Price, F. Putzolu, and I. L. Traiger, "The Recovery Manager of the System R Database Manager," ACM Computing Surveys, vol. 13, no. 2, pp. 223-242, June 1981. • T. Harder and A. Reuter, "Principles of Transaction-Oriented Database Recovery," ACM Computing Surveys~ vol. 15, no. 4, pp. 287-318, Dec. 1983.

280

• M. P. Herlihy, "Optimistic Concurrency Control for Abstract Data Types," Proceedings of the 5~h Annual ACM Symposium on Principles o] Distributed Computing, pp. 206-217, Aug. 1986. • D. B. Johnson and W. Zwaenepoel, "Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing," Proceedings of the 7t~ Annual ACM Symposium on Principles of Distributed Computing, pp. 171-181, Aug. 1988. • W. H. Kohter, "A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems," ACM Computing Surveys, vol. 13, no. 2, pp. 149-183, June 1981. • C. Mohan and B. G. Lindsay, "Efficient Commit Protocols for the Tree of Processors Model of Distributed Transactions," Proceedings o/ the 2"d Annual A CM Symposium on Principles of Distributed Computing, pp. 76-88, Aug. 1983. • C. Mohan, B. G. Lindsay, and R. Obermarck, "Transaction Management in the R x Distributed Database Management System," A CM Transactions on Database Systems, vol. 11, no. 4, pp. 378-396, DeL 1986. • C. Mohan, H. R. Strong, and S. Finkelstein, "Method for Distributed Transaction Commit and Recovery Using Byzantine Agreement within Clusters of Processors," Proceedings of the 2"a Annual A CM Symposium on Principles o~ Distributed Computing, pp. 89-103, Aug. 1983. • C. H. Papa~mitriou, "The Serializabitity of Concurrent Database Updates," Journal of the ACM, vol. 26, no. 4, pp. 631-653, Oct. 1979. • C. H. Papadirnitrioti and M. Yannakakis, "The Complexity of Reliable Concurrency Control," SIAM Journal on Computing, vol. 16, no. 3, pp. 538-553, June 1987. • D. Skeen, "A Decentralized Termination Protocol," Proceedings of the 1"f Annual A CM Symposium on Principles of Distributed Computing, pp. 27-32, Aug. 1981. • D. Skeen, "Nonblocking Commit Protocols," SIGMOD Proceedings, May 1981. • D. Skeen, Crash Recovery in a Distributed Database System, P h . D . Thesis, University of California at Berkeley, May 1982. (Available as Technical Report UCB/ERL M82/45.) • D. Skeen, "Quoruna-Based Commit Protocols," Proceedings of the 5~h Berkeley Workshop on Networks and Database Systems, Feb. 1982. • D. Skeen, "Determining the Last Process to Fail," ACM Transactions on Computer Systems, pp. 15-30, Feb. 1985.

281

• D. Skeen and M. Stonebraker, "A Formal Model of Crash Recovery in a Distributed System," IEEE Transactions on Software Engineering, vol. SE-9, no. 3, pp. 219-228, May 1983. • D. Skeen and D. Wright, "Increasing Availability in Partitioned Database Systerns," Technical Report TR-83-581, Cornell University, Mar. 1984. • I. L. Traiger, J. N. Gray, C. A. Galtieri, and B. G. Lindsay, "Transactions and Consistency in Distributed Database Systems," A CM Transactions on Database Systems, vol. 7, no. 3, pp. 323-342, Sept. 1982. • W. E. Weihl, "Data Dependent Concurrency Control and Recovery," Proceed. ings of the 2"a Annual A CM Symposium on Principles of Distributed Computing, pp. 63-75, Aug. 1983. • W. E. Weihl, "Distributed Version Management for Read-Only Actions," Proceedings of the 4th Annual A CM Symposium on Principles oJ Distributed Computing, pp. 122-135, Aug. 1985. • D. Wright, "Managing Distributed Databases in Partitioned Networks," Technical Report TR-83-572, Cornell University, Sept. 1983.

7

Distributed Systems (Not Transaction Based) K. P. Birman and T. A. Joseph, "Exploiting Virtual Synchrony in Distributed Systems," Proceedings of the 11th A CM Symposium on Operating Systems Principles, pp. 123-138, Nov. 1987. A. D. Birretl, R. Levin, R. M. Needham, and M. D. Schroeder, "Grapevine: An Exercise in Distributed Computing~" Communications of the ACM, vol. 25, no. 4, pp. 260-274, Apr. 1982. A. D. Birrell and B. J. Nelson, "Implementing Remote Procedure Calls," ACM Transactions on Computer Systems, vol. 2, no. 1, pp. 38-59, Feb. 1984. D. R. Cheriton, "The V Kernel: A Software Base for Distributed Systems," IEEE Software, vol. 1, no. 12, pp. 19-42, Apr. 1984. E. C. Cooper, "Replicated Procedure Call," Proceedings of the 3"a Annual A CM Symposium on Principles of Distributed Computing, pp. 220-232, Aug. 1984. E. C. Cooper, "A Replicated Procedure Call Facility," Proceedings of the 4~ IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 11-24, Oct. 1984. E. C. Cooper, Replicated Distributed Programs, P h . D . Thesis, University of California at Berkeley, May 1985. (Available as Technical Report UCB/CSD

s5/231.)

282

A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry, "Epedimic Algorithms for Replicated Database Maintenance," Proceedings o] the 6th Annual A CM Symposium on Principles of Distributed Computing, pp. 1-12, Aug. 1987. A. D. Griefer and H. R. Strong, "DCF: Distributed Communication with Fault Tolerance," Proceedings o] the 7th Annual ACM Symposium on Principles o] Distributed Computing, pp. 18-27, Aug. 1988. R. Ladin, B. H. Liskov, and L. Shrira, "A Technique for Constructing Highly Available Services," Algorithmica, vol. 3, no. 3, pp. 393-420, 1988. B. W. Lampson, "Designing a Global Name Service," Proceedings of the 5th Annual ACM Symposium on Principles of Distributed Computing, pp. 1-10, Aug. 1986. K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," Proceedings of the 5th Annual A CM Symposium on Principles o] Distributed Computing, pp. 229-239, Aug. 1986. N. A. Lynch, B. Blaustein, and M. Siegel, "Correctness Conditions for Highly Available Replicated Databases," Proceedings o] the 5t~ Annual ACM Symposium on Principles of Distributed Computing, pp. 11-28, Aug. 1986. R. Rashid and G. Robertson, "Accent: A Communication Oriented Network Operating System Kernel," Proceedings o] the 8th A CM Symposium on Operating Systems Principles, Dec. 1981. S. Sarin, "Robust Application Design in Highly Available Distributed Databases," Proceedings "of the 5th IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 87-94, Jan. 1986. M. D. Schroeder, A. D. Birrell,and R. M. Needham, "Experience with Grapevine: The Growth of a Distributed System," A C M Transactions on Computer Systems, vol. 2, no. 1, pp. 3-23, Feb. 1984. A. S. Tanenbaum and R. Van Renesse, "Distributed Operating Systems," A C M Computing Surveys, vol. 17, no. 4, pp. 419-470, Dec. 1985.

8

Distributed Systems (Transaction Based) J. Allchin, An Architecture for Reliable Decentralized Systems, Ph.D. Thesis, Georgia institute of Technology, Sept. 1983. (Available as Technical Report GIT-ICS-83/23.) J. Allchin and M. McKendry, "Synchronization and Recovery of Actions," Proceedings of the 2"d Annual A CM Symposium on Principles of Distributed Computing, pp. 31-44, Aug. 1983.

283

• T. Anderson, P. Lee, and S. Shrivastava, "A Model of Recoverability in Multilevel Systems," IEEE Transactions on Software Engineering, vol. SE-4, no. 6, pp. 486-494, Nov. 1978. • P.A. Bernstein, D. W. Shipman, and J. B. Rothnie, "Concurrency Control in a System for Distributed Databases (SDD-1)," ACM Transactions on Database Systems, vol. 5, no. 1, pp. 18-51, Mar. 1980. • P. A. Bernstein and D. W. Shipman, "The Correctness of Concurrency Control in a System for Distributed Databases (SDD-1)," ACM Transactions on Database Systems, vot. 5, no. 1, pp. 52-68, Mar. 1980. • K. P. Birman, T. A. Joseph, T. Raeuchle, and A. E1 Abbadi, "Implementing Fault-Tolerant Distributed Objects," IEEE Transactions on Software Engineering, vol. 11, no. 6, pp. 502-508, June 1985. • M. Hammer and D. W. Shipmam "Reliability Mechanisms for SDD-I: A System for Distributed Databases," ACM Transactions on Database Systems, vol. 5, no. 4, pp. 431-466, Dec. 1980. • R. Haskin, Y. Malachi, W. Sawdon, and G. Chan, "Recovery Management in QuickSilver," A C M Transactions on Computer Systems, vol. 6, no. 1, pp. 82108, Feb. 1988. • M. P. Herlihy and J. M. Wing, "Specifying Graceful Degradation in Distributed Systems," Proceedings of the 6th Annual A CM Symposium on Principles of Dis. tributed Computing, pp. 167-177, Aug. 1987. • E. D. Lasowska, H. M. Levy, G. T. Alines, M. J. Fischer, R. J. Fowler, and S. C. Vestal, "The Architecture of the EDEN System," Proceedings of the 8th ACM Symposium on Operating Systems Principles, pp. 148-159, Dec. 1981. * B. G. Lindsay, L. M. Haas, C. Mohan, P. F. Wilms, and R. A. Yost, "Computation and Communication in R'~: A Distributed Database Manager," A CM Transactions on Computer Systems, vol. 2, no. 1, pp. 24-38, Feb. 1984. • B. H. Liskov, "On Linguistic Support for Distributed Programs," IEEE Transactions on Software Engineering, vol. SE-6, no. 3, pp. 203-210, May 1982. H. Liskov, "Distributed Programming in Argus," Communications o] the ACM, vol. 32, no. 3, pp. 300-313, Mar. 1988.

* B.

• B. H. Liskov, D. Curtis, P. Johnson, and R. W. Scheifler, "Implementation of Argus," Proceedings o.f the 11~h ACM Symposium on Operating Systems Principles, pp. 111-122, Nov. 1987. o B. H. Liskov, M. P. Herlihy, P. Johnson, G. Leavens, It. W. Scheifler, and W. E. Weihl, "Preliminary Argus Reference Manual," Programming Methodology Group Memo 39, MIT Laboratory for Computer Science, Oct. 1983.

284

•

B. H. Liskov and R. W. Seheifler ~Guardians and Actions: Linguistic Support for Robust, Distributed Programs," A CM Transactions on Programming Languages and Systems, vol. 5, no. 3, pp. 381-404~ July 1983.

• M. S. McKendry and M. P. Herlihy, "Time-Driven Orphan Elimination," Proceedings of the 5~a IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 42-48, Jan. 1986. • E. T. Mueller, J. D. Moore, and G. Popek, "A Nested Transaction Mechanism for LOCUS," Proceedings of the 9t~ ACM Symposium on Operating Systems Principles, pp. 71-89, Oct. 1983. • G. Popek, B. Walker, J. Chow, D. Edwards, C. Kline, G. Rudisin, and G. Thiel, "LOCUS: A Network Transparent High Reliability Distributed System," Proceedings of the 8th A CM Symposium on Operating Systems Principles, pp. 169-177, Dec. 1981. •

J. B. Rothnie, P. A. Bernstein, S. Fox~ N. Goodman, M. Hammer, T. A. Landers, C. Reeve, D. W. Shipman, and E. Wong, "Introduction to a System for Distributed Databases (SDD-1)," A CM Tran'~actions. on Database Systems, vol. 5, no. 1, pp. 1-17, Mar. 1980.

• A. Z. Spector, "Distributed Transactions for Reliable Systems," Proceedings of the 10~h ACM Symposium on Operating Systems Principles, pp. 127-146, Dec. 1983. • A. Z. Spector, J. Butcher, D. S. Daniels, D. J. Duchamp, J. L. Eppinger, C. E. Fineman, A. Heddaya, and P. M. Schwartz, "Support for Distributed Transactions in the TABS Prototype," Proceedings of the 4f~ IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 186-206, Oct. 1984. • A. Z. Spector, D. S. Daniels, D. J. Duchamp, J. L. Eppinger, and R. Pausch, "Distributed Transactions for Reliable Systems," Proceedings o] the 10th A CM Symposium on Operating Systems Principles, pp. 127-146, Dee. 1985. • A. Z. Spector, J. L. Eppinger, D. S. Daniels, R. Draves, J. J. Bloch, D. Duchamp, R. F. Pausch, D. Thompson, "High Performance Distributed Transaction Processing in a General Purpose Computing Environment," manuscript~ Sept. 1987. • A. Z. Spector and P. M. Schwartz, "Transactions: A Construct for Reliable Distributed Computing," ACM Operating Systems Review, vol. 17, no. 2~ pp. 18-35, Apr. 1983. • A. Z. Spector, D. Thompson, R. F. Pausch, J. L. Eppinger, D. Duchamp, R. Draves, D. S. Daniels~ and J. 3. Bloch, "Camelot: A Distributed Transaction Facility for Mach and the Internet--An Interim Report," Technical Report CMUCS-87-129, Carnegie-Mellon University, June 1987.

285

R. E. Strom and S. Yemini, "NIL: An Integrated Language and System for Distributed Programming," A C M SIGPLAN Notices, vol. 18, no. 6, pp. 73-82, June 1983. • R. E. Strom and S. Yemini, "Optimistic Recovery: An Asynchronous Approach to Fault-Tolerance in Distributed Systems," Proceedings o.f the 14th Inferno. tional Conference on Fault-Tolerant Computing, pp. 374-379, June 1984. • R. E. Strom and S. Yemini, "Optimistic Recovery in Distributed Systems," ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985. • L. Svobodow, "Resilient Distributed Computing," IEEE Transactions on Software Engineering, vol. TSE-10, no. 3, pp. 257-268, May 1984. • B. Walker, G. Popek, R. English, C. Kline, and G. Thiel, "The LOCUS Distributed Operating System," Proceedings o.f the 9th A CM Symposium on Operating Systems Principles, pp. 49-70, Oct. 1983.

Fail-Stop Systems

9

• R. D. Schlichting and F. B. Schneider, "Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems," A C M Transactions on Computer Systems, vol. 1, no. 3, pp. 222-238, Aug. 1983. • F. B. Schneider, "Byzantine Generals in Action: Implementing Fail-Stop Processors," A C M Transactions on Computer Systems, vol. 2, no. 2, pp. 145-154, May 1984.

Formal Models for Distributed Systems

10

S. Aggarwal, D. Barbara, and K. Z. Meth, "A Software Environment for the Specification and Analysis of Problems of Coordination and Concurrency," IEEE Transactions on Software Engineering, rot. SE-14, no. 3, pp. 280-290, Mar. 1988. S. Aggarwal, D. Barbara, and K. Z. Meth, "SPANNER: A Tool for the Specification, Analysis, and Evaluation of Protocols," IEEE Transactions on Software Engineering, vol. SE-13, no. 12, pp. 1218-1237, Dec. 1987. M. Alford, J. Ansart, G. Hommel, L. Lamport, F. B. Schneider, and B. H. Liskov, Distributed Systems--Methods and Tools .for Specification, (Chapter 8), Lecture Notes in Computer Science, vol. 190, Springer-Verlag, New York, 1985. Q

B. Alpern and F. B. Schneider, "Proving Boolean Combinations of Deterministic Properties," Proceedings of the 2 "a Annual ACM Symposium on Logic in Computer Science, June 1987.

286

• H. Barringer, R. Kuiper, and A. Pnueli, "How You May Compose Temporal Logic Specifications," Proceedings of the 16t~ Annual A CM Symposium on Theory of Computing, pp. 51-63, Apr. 1984. • J. E. Burns, "A Formal Model for Message Passing Systems," Technical Report TR-91, Indiana University, Sept. 1980. • E. M. Clark and O. Grumberg, "Avoiding the State Explosion Problem in Temporal Logic Model Checking Algorithms," Proceedings of the 6~h Annual A CM Symposium on Principles of Distributed Computing, pp. 294-303, Aug. 1987. • N. Francez, Fairness, Springer-Verlag, Berlin, 1986. • D. Harel, "Statecharts: A Visual Forma~sm for Complex Systems," Science of Computer Programming, vol. 8, no. 3, pp. 231-274, June 1987. • C. A. R. Hoare, Communicating Sequential Processes , Prentice-Hall International, Englewood Cliffs, New Jersey, 1985. • S. S. Lain and A. U. Shankar, "Protocol Verification via Projections," 1EEE Transactions on Software Engineering, vol." SE-10( no. 4, pp. 325-342, July 1984. • L. Lamport, "Solved Problems, Unsolved Problems, and Non-Problems in Concurrency," Proceedings of the 3"a Annual ACM Symposium on Principles of Distributed Computing, pp. 1-11, Aug. 1984. • L. Lamport, "Specifying Concurrent Program Modules," A CM Transactions on Programming Languages and Systems, vol. 5, no. 2, pp. 190-222, Apr. 1983. • L. Lamport and F. B. Schneider, "The 'Hoare Logic' and All That," ACId Transactions on Programming Languages and Systems, vol. 6, no. 2, pp. 281296, Apr. 1984. • N. A. Lynch and M. J. Fischer, "On Describing the Behavior and Implementation of Distributed Systems," Theoretical Computer Science, vol. 13, no. 1, pp. 17-43, Jan. 1981. • N. A. Lynch and M. Merritt, Introduction to the Theory of Nested Transactions, Technical Report MIT/LCS/TR-367, Massachusetts Institute of Technology, July 1986. (Revised version to appear in Theoretical Computer Science.) • N. A. Lynch and M. R. Tattle, "Hierarchical Correctness Proofs for Distributed Algorithms," Proceedings of the 6th Annual A CM Symposium on Principles of Distributed Computing, pp. 137-151, Aug. 1987. (Expanded version available as Technical Report MIT/LCS/TR-387, Massachusetts Institute of Technology.) • T. A. Joseph, T. Raeuchle, and S. Toueg, "State Machines and Assertions: An Integrated Approach to Modeling and Verification of Distributed Systems," Science of Computer Programming, vol. 7, no. 1, pp. 1-22, July 1986.

287

R. Milner, A Calculus o] Communicating Systems, Lecture Notes in Computer Science, vol. 92, Springer-Verlag, Berlin, 1980. Z. Manna and A. Pnueli, "Verification of Concurrent Programs: Temporal Proof Principles," Logic of Programs, D. Kozen, ed., Lecture Notes in Computer Science, vol. 131, pp. 200-252, Springer-Verlag, Berlin, 1981. S. Owicki and D. Gries, "An Axiomatic Proof Technique for Parallel Programs I," Acta In formatica, vol. 6, no. 4, pp. 319-340, Aug. 1976. E. W. Stark, Foundations o] a Theory o] Specification of Distributed Systems, Ph.D. Thesis, Massachusetts Institute of Technology, Aug. 1984. (Available as

Technical Report MIT/LCS/TR-342.) O

C.-T. Chou and E. Gafni, "Understanding and Verifying Distributed Algorithms Using Stratified Decomposition," Proceedings of the 7th Annual A CM Symposium on Principles of Distributed Computing, pp. 44-65, Aug. 1988. J. I,. Welch, L. Lamport, and N. A. Lynch, "A Lattice-Structured Proof of a Minimum Spanning Tree Algorithm," Prqceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing, 'pp. 28-43, Aug. 1988.

11

Impossibility Results for Consensus D. Angluin, "Local and Global Properties in Networks of Processors," Proceedings of the 12t~ Annual ACM Symposium on Theory o] Computing, pp. 82-93, Apt. 1980. O. Biran, S. Moran, and S. Zaks, "A Combinatorial Characterization of the Distributed Tasks which are Solvable in the' Presence of One Faulty Processor," Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing, pp. 263-275, Aug. 1988. M. F. Bridgland and R. J. Watro, "Fault-Tolerant Decision Making in Totally Asynchronous Distributed Systems," Proceedings of the 6t~ Annual A CM Symposium on Principles of Distributed Computing, pp. 52-63, Aug. 1987. B. A. Coan and C. Dwork, "Simultaneity is Harder than Agreement," Proceedings of the 5th IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 141-150, Jan. 1986. R. DeMillo, N. A. Lynch, and M. Merritt, "Cryptographic Protocols," Proceedings of the 14th Annual A CM Symposium on Theory of Computing, pp. 383-400, May 1982. D. Dolev, C. Dwork, and L. Stockmeyer, "On the Minimal Synchronism Needed for Distributed Consensus," Journal o] the AGM, vol. 34, no. 1, pp. 77-97, Jan. 1987.

288

D. Dolev, R. Reischuk, and H. R. Strong, "Eventual is Earlier than Immediate," Proceedings of the 23"d Annual IEEE Symposium on Foundations of Computer Science, pp. 196-203, Nov. 1982. D. Dolev. R. Reischuk, and H. R. Strong, "Early Stopping in Byzantine Agreemeat," IBM Research Report tLJ5406, Dec. 1986. M. J. Fischer and N. A. Lynch, "A Lower Bound for the Time to Assure Interactive Consistency," Information Processing Letters, vol. 14, no. 4, pp. 183-186, June 1982. M. J. Fischer, N. A. Lynch, and M. Merritt, "Easy Impossibility Proofs for Distributed Consensus Problems," DistributedComputing, vol. I, no. 1, pp. 26-39, Jan. 1986. M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process," Journal of theACM, vol. 32, no. 2, pp. 374-382, Apr. 1985. N. A. Lynch, Y. Mansour, and A. D. Fekete,-"Data Link Layer: Two Impossibility Results," Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing, pp. 149-170, Aug. 1988. J. L. Welch, "Simulating Synchronous Processors," Information and Computation, vol. 74, no. 2, pp. 159-171, Aug. 1987.

12

Knowledge and Common Knowledge C. Dwork and Y. Moses, "Knowledge and Common Knowledge in a Byzantine Environment I: Crash Failures," Proceedings of the 1986 Conference on Theoretical Aspects of Reasoning About Knowledge, pp. 149-170, Mar. 1986. (Revised version to appear in Information and Computation.) M. J. Fischer and N. I_mmerman, "Foundations of Knowledge for Distributed Systems," Proceedings of the 1986 Conference on Theoretical Aspects of Reasoning About Knowledge, pp. 171-185, Mar. 1986. J. Y. Halpern and R. Fagin, "A Formal Model of Knowledge, Action, and Communication in Distributed Systems," Proceedings of the 4f~ Annual A CM Symposium on Principles of Distributed Computing, pp. 224-236, Aug. 1985. J. Y. Halpern and Y. Moses, "Knowledge and Common Knowledge in a Distributed Environment," Proceedings of the 3"d Annual ACM Symposium on Principles of Distributed Computing, pp. 50-61, Aug. 1984. (Revised as Technical Report IBM-RJ-4421, IBM Corp., Jan. 1986.)

289

J. Y. Halpern and L. D. Zuck, "A Little Knowledge Goes a Long Way: Simple Knowledge-Based Derivations and Correctness Proofs for a Family of Protocols," Proceedings of the 6th Annual A CM Symposium on Principles o] Distributed Computing, pp. 269-280, Aug. 1987. S. Katz and G. Taubenfeld, "What Processors Know: Definitions and Proof Methods," Proceedings o] the 5th Annual ACM Symposium on Principles of Distributed Computing, pp. 249-262, Aug. 1986. R. Koo and S. Toueg, "Effects of Message Loss on the Termination of Distributed Protocols," Information Processing Letters, vol. 27, no. 4, pp. 181-188, Apr. 1988. Y. Moses, D. Dolev, and J. Y. Halpern, "Cheating Husbands and Other Stories: A Case Study of Knowledge, Action, and Communication," Proceedings o] the 4th Annual A CM Symposium on Principles o] Distributed Computing, pp. 215223, Aug. 1985. Y. Moses and M. R. Tuttle, "Programming Simultaneous Actions Using Common Knowledge," Algorithmica, vol. 3, no. 1, pp. 121-169, 1988. G. Neiger and S. Toueg, "Substituting for Real Time and Common Knowledge in Asynchronous Distributed Systems," Proceedings of the 6th Annual ACM Symposium on Principles of Distributed Computing, pp. 281-293, Aug. 1987. P. Panangaden and K. Taylor, "Concurrent Common Knowledge: A New Definition of Agreement for Asynchronous Systems," Proceedings o] the 7th Annual ACM Symposium on Principles of Distributed Computing, pp. 197-209, Aug. 1988.

13

Non-Stop Systems J. Bartlett, "A NonStop Kernel," Proceedings of the 8th ACM Symposium on Operating Systems Principles, pp. 22-29, Dec. 1981. J. Bartlett, "A 'NonStop' Operating System," Proceedings o] the Eleventh Hawaii International Conference on System Sciences, pp. 103-117, Jan. 1978. A. Borg, J. Baumbach, and S. Glazer, "A Message System Supporting Fault Tolerance," Proceedings of the 9th A CM Symposium on Operating Systems Prin. eiples, pp. 90-99, Oct. 1983. J. N. Gray, "Why Do Computers Stop and What Can Be Done About It?" Proceedings o/the 5th IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 87-94, Jan. 1986.

290

V. Hadzilacos, Issues of Fault Tolerance in Concurrent Computations, Ph.D. Thesis, Harvard University, June 1984. (Available as Technical Report TR-1184.) N. P. Kronberg, H. M. Levy, and W. D. Strecker, "VAXclusters: A CloselyCoupled Distributed System," ACM Transactions on Computer Systems, vol. 4, no. 2, pp. 130-146, May 1986. L. Lamport, "Using Time Instead of Timeout for Fault-Tolerant Distributed Systems," A CM Transactions on Programming Languages and Systems, vol. 6, no. 2, pp. 254-280, Apr. 1984. L. Lamport, "The Implementation of Reliable Distributed Multiprocess Systems," Computer Networks, vol. 2, pp. 95-114, 1978. F. B. Schneider, "Synchronization in Distributed Programs," A CM Transactions on Programming Languages and Systems, vol. 4, no. 2, pp. 125-148, Apr. 1982. F. B. Schneider, "Paradigms for Distributed Programs," Distributed Systems-Methods and Tools for Specification, Lecture Notes in Computer Science, vol. 190, Springer-Verlag, New York, 1985.

14

Replication of D a t a D. Barbara, H. Garcia-Molina, and A. Spauster, "Protocols for Dynamic Vote Reassignment," Proceedings o] the 5t~ Annual A CM Symposium on Principles o] Distributed Computing, pp. 195-205, Aug. 1986. D. Barbara and H. Gaxcia-Molina, "The Vulnerability of Vote Assignments," ACM Transactions on Computer Systems, vol. 4, no. 3, pp. 187-214, Aug. 1986. K. P. Birman, "Replication and Fault-Tolerance in the ISIS System," Proceedings o] the 10th ACM Symposium on Operating Systems Principles, pp. 79-86, Dec. 1985. P. A. Bernstein and N. Goodman, "The Failure and Recovery Problem for Replicated Databases," Proceedings of the 2"d Annual A CM Symposium on Principles of Distributed Computing, pp. 114-130, Aug. 1983. P. A. Bernstein and N. Goodman, "Multiversion Concurrency Control: Theory and Algorithms," ACM Transactions on Database Systems, vol. 8, no. 4, pp. 465-483, Dec. 1983. J. J. Bloch, D. S. Daniels, and A. Z. Spector, "A Weighted Voting Algorithm for Replicated Directories," Journal o] the ACM, vol. 34, no. 4, pp. 859-909, Oct. 1987.

291

• M. J. Carey and W. A. Muhanna, "The Performance of Multiversion Concurrency Control Algorithms," A CM Transactions on Computer Systems, vol. 4, no. 4, pp. 338-378, Nov. 1986. • B.A. Coan, B. M. Old. and E. K. Kolodner, "Limitations on Database Availability when Networks Partition," Proceedings of the 5th Annual A CM Symposium on Principles of Distributed Computing, pp. 187-194, Aug. 1986. • S. B. Davidson, H. Garcia-Molina, and D. Skeen, "Consistency in Partitioned Networks," ACM Computing Surveys, vol. 17, no. 3, pp. 341-370, Sept. 1985. • A. E1 Abbadi, D. Skeen, and F. Cristian, "An Efficient, Fault-Tolerant Protocol for Replicated Data Management," Proceedings of the 4~h Annual A CM Symposium on Principles of Database Systems, pp. 215-229, Mar. 1985. • A. E1 Abbadi and S. Toueg, "Maintaining Availability in Partitioned Replicated Databases," Proceedings of the 5~h Annual ACId Symposium on Principles of Database Systems, pp. 240-251, Mar. 1986. (Revised version to appear in ACM Transactions on Database Systems.) • A. E1 Abbadi and S. Toueg, "The Group Paradigm for Concurrency Control Protocols," Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp. 126-142, June 1988. • J. L. Eppinger and A. Z. Spector, "Virtual Memory Management for Recoverable Objects in the TABS Prototype," Technical Report CMU-CS-85-163, Carnegie-Mellon University, 1985. • M. J. Fischer and A. Michael, "Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network," Proceedings of the 1" Annual ACM Symposium on Principles of Database Systems, pp. 70-75, Mar. 1982. • H. Garcia-Molina, "The Future of Data Replication," Proceedings of the 5th IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 13-19, Jan. 1986. • H. Garcia-Molina and D. Barbara, "How to Assign Votes in a Distributed System," Journal of the ACM, vol. 32, no. 4, pp. 841-860, Oct. 1985.

• D. Gifford, "Weighted Voting for Replicated Data," Proceedings of the 7~h ACM Symposium on Operating Systems Principles, pp. 150-162, Dec. 1979. • K. J. Goldman and N. A. Lynch, "Quorum Consensus in Nested Transaction Systems," Proceedings of the 6fh Annual A CM Symposium on Principles of Distributed Computing, pp. 27-41, Aug. 1987. • M. P. Herlihy, "Comparing How Atomicity Mechanisms Support Replication," Proceedings of the 4th Annual ACM Symposium on Principles of Distributed Computing, pp. 102-110, Aug. 1985.

292

M. P. Hertihy, Replication Methods for Abstract Data Types, Ph.D. Thesis, Massachusetts Institute of Technology, May 1984. (Available as Technical Report MIT/LCS/TR-319.) M. P. Herlihy, "A Quorum-Consensus Replication Method for Abstract Data Types,;' ACM Transactions on Computer Systems, vol. 4, no. 1, pp. 32-53, Feb. 1986. * M. P. Herlihy, "Dynamic Quorum Adjustment for Partitioned Data," A CM Transactions on Database Systems, vol. 12, no. 2, pp. 170-194, June 1987. e T.A. Joseph, Low Cost Management of Replicated Data, Ph.D. Thesis, Cornel] University, Dec. 1985. • T. A. Joseph and K. P. Birman, "Low Cost Management of Replicated Data in Fault-Tolerant Distributed Systems," A CM Transactions on Computer Systems, vol. 4, no. 1, pp. 54-70, Feb. 1986. B. Kogan and H. Garcia-Molina, "Update Propagation in Bakunin Data Networks," Proceedings o] the 6~h Annual A CM-Symposium on Principles o] Distributed Computing, pp. 12-26, Aug. 1987. B. M. Old and B. H. Liskov, "Viewstamped Replication: A General Primary Copy Method to Support Highly-Available Distributed Systems," Proceedings o] the 7th Annual A CM Symposium on Principles o] Distributed Computing, pp. 8-17, Aug. 1988. F. Pittelli and H. Garcia-Molina "Database Processing with Triple Modular Redundancy," Proceedings of the 5th IEEE Symposium on Reliability in Distributed Software and Databa'se Systems, pp. 95-103, Jan. 1986. P. M. Schwartz Transactions on Typed Objects, Ph.D. Thesis, Carnegie-Mellon University, Dec. 1984. (Available as Technical Report CMU-CS-84-166.) P. M. Schwartz and A. Z. Spector, "Synchronizing Shared Abstract Types," ACM Transactions on Computer Systems, vol. 2, no. 3, pp. 223-250, July 1984. G. Wuu and A. Bernstein, "Efficient Solutions to the Replicated Log and Dictionary Problems," Proceedings o] the 3'a Annual A CM Symposium on Principles of Distributed Computing, pp. 233-242, Aug. 1984.

15

Routing A. Broder, D. Dolev, M. J. Fischer, B. Simons, "Efficient Fault-Tolerant Routing in Networks," In]orrnation and Computation, vol. 75, no. 1, pp. 52-64, Oct. 1987.

293

D. Dolev, J. Y. Halpern, B. Simons, and H. R. Strong, "A New Look at FaultTolerant Network Routing," In]ormation and Computation, vol. 72, no. 3, pp. 180-198, Mar. 1987. D. Dolev, J. Meseguer, and M. Pease, "Finding Safe Paths in a Faulty Environment," Proceedings of the 1't Annual A CM Symposium on Principles of Distributed Computing, pp. 95-103, Aug. 1982.

D. Peleg and B. Simons, "On Fault-Tolerant Routing in General Networks," Information and Computation, vol. 74, no. 1, pp. 33-49, July 1987.

Snapshots and Checkpoints

16

K. M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," A C M Transactions on Computer Systems, vol. 3, no. i, pp. 63-75, Feb. 1985.

M. J. Fischer, N. D. Griffeth, and N. A. Lynch "Global States of a Distributed System," IEEE Transactions on Software Engineering, vol. SE-8, no. 3, pp. 198-202, May 1982. J.-M. Helary, C. Jard, N. Plouzeau, and M. Raynal, "Detection of Stable Properties in Distributed Applications," Proceedings of the 6th Annual A CM Symposium on Principles of Distributed Computing, pp. 125-136, Aug. 1987. R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23-31, Jan. 1987.

R. E. Strom and S. Yemini, "Optimistic Recovery in Distributed Systems," A CM Transactions on Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985. S. Toueg and O. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM Journal on Computing, vol. 13, no. 3, pp. 630-649, Aug. 1984.

17

Software Reliability

• T. Anderson and R. Kerr, "Recovery Blocks in Action: A System Supporting High Rdiability,, Proceedings of the 2"a International Conference on Software Engineering, pp. 447-457, 1976.

• A. Avizienis, "Architecture of Fault-Tolerant Computing Systems," Proceedings of the Fifth IEEE International Symposium on Fault-Tolerant Computing, pp. 3-16, June 1975.

294

• A. Avizienis, "The N-Version Approach to Fault Tolerant Software," IEEE Transactions on Software Engineering, ~,ol. SE-11, no. 12, pp. 1491-1501, Dec. 1985. • D. J. Lu, "Watchdog Processors and StructurM Integrity Checking," IEEE Transactions on Computers, vol. C-31, no. 7, pp. 681-685, July 1982. • P. M. Melliar-Smith and R. L. Schwartz, "Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System," IEEE Transactions on Computers, vol. C-31, no. 7, pp. 616-630, July 1982. • C. V. Ran~amoorthy and F. B. Bastani, "Software Reliability--Status and Perspectives," IEEE Transactions on Software Engineering, vol. SE-8, no. 4, pp. 354-371, July t982. • B. Randell, "System Structure for Software Fault Tolerance," IEEE Transactions on Software Engineering, vol. SE-1, no. 2, pp. 220-232, June 1975. • R. D. Schlichting, Aziomatic Verification to Enhance Software Reliability, Ph.D. Thesis, Cornell University, Jan. 1982. • D. J. Taylor, D. E. Morgan, and J. P. Black, "Redundancy in Data Structures: Improving Software Fault-Tolerance," IEEE Transactions on Software Engineering, vol. SE-6, no. 6, pp. 585-594, Nov. 1980.

18

Synchronous Consensus Protocols A. Bar-Noy, D. Dolev, C. Dwork, and H. R. Strong, "Shifting Gears: Changing Algorithms on the Fly to Expedite Byzantine Agreement," Proceedings of the 6th Annual ACM Symposium on Principles of Distributed Computing, pp. 42-51, Aug. 1987. G. Bracha, "An O(logn) Expected Rounds Randomized Byzantine Generals Protocol," Journal o] the ACM, vol. 34, no. 4, pp. 910-920, Oct. 1987. A. Broder, "A Provably Secure Polynomial Approximation Scheme for the Distributed Lottery Problem," Proceedings of the 4~ Annual A CM Symposium on Principles o/Distributed Computing, pp. 136-148, Aug. 1985. A. Broder and D. Dotev, "Coin Flipping in Many Pockets," Proceedings o] the 25th Annual IEEE Symposium on Foundations o] Computer Science, pp. 157170, Oct. 1984. J. E. Burns and N. A. Lynch, "The Byzantine Firing Squad Problem," Advances in Computing Research: Parallel and Distributed Computing, vol. 4, pp. 147161, JAI Press, 1987.

295

• B. Chor and B. A. Coan, "A Simple and Efficient Randomized Byzantine Agreement Algorithm," IEEE Transactions on So/tware Engineering, vol. SE-11, no. 6, pp. 531-539. June 1985. • B. Chor, M. Merritt, and D. Shmoys, "Simple Constant-Time Consensus Protocols ~n Realistic Failure Models," Proceedings of the 4th Annual ACM Symposium on Principles of Distributed Computing, pp. 152-162, Aug. 1985. • B. A. Coan, "A Communication-Efficient Canonical Form for Fault-Tolerant Distributed Protocols,'? Proceedings of the 5~h Annual ACM Symposium on Principles of Distributed Computing, pp. 63-72. Aug. 1986. • B. A. Coan, Achieving Consensus in Fault-Tolerant Distributed Computer Systems: Protocols, £ower Bounds, and Simulations, Ph.D. Thesis~ Massachusetts Institute of Technology, Apr. 1987. • B. A. Coan, "Efficient Agreement using Fault Detection," Proceedings of the 26a' Annual Allerton Conference on Communication, Control, and Computing, Sept. 1988. • B. A. Coan, D. Dolev, C. Dwork, and L. Stockmeyer, "The Distributed Firing Squad Problem," Proceedings of the 1T h Annual A CM Symposium on Theory of Computing, pp. 335-345, May 1985. s D. Dolev, "The Byzantine Generals Strike Again," Journal of Algorithms, vol. 3, no. 1, pp. 14-30, Mar. 1982. • D. Dolev, M. J. Fischer, R. J. Fowler, N. A. Lynch, and H. R. Strong, "An Efficient Algorithm for Byzantine Agreement without Authentication," Information and Control, vol. 52, pp. 257-274, Mar. 1982. • D. Dolev, N. A. Lynch, S. Pinter, E. W. Stark, and W. E. Weihl, "Reaching Approximate Agreement in the Presence of Faults," Journal of the ACM, vot. 33, no. 3, pp. 499-516, July 1986. s D. Dolev and R. Reischuk, "Bounds on Information Exchange for Byzantine Agreement," Journal of the ACId, vol. 32, no. 1, pp. 191-204, :Jan. 1985. • D. Dolev and H. R. Strong, "Polynomial Algorithms for Multiple Processor Agreement," Proceedings of the 14th Annual ACM Symposium on Theory o] Computing, pp. 401-407, May 1982. • D. Dolev and H. R. Strong, "Authenticated Algorithms for Byzantine Agreement," SIAM Journal on Computing, vol. 12, no. 4, pp. 656-666, Nov. 1983. • C. Dwork, D. Shmoys, and L. Stockmeyer, "Flipping persuasively in Constant Expected Time," Proceedings of the 27~ Annual IEEE Symposium on Foundations of Computer Science, pp. 222-232, Oct. 1986.

296

s C. Dwork and D. Skeen, "Patterns of Communication in Consensus Protocols," Proceedings of the 3rd Annual A CM Symposium on Principles o] Distributed Computing, pp. 143-153, Aug. 1984. • P. Feldman and S. Micali, "Byzantine Agreement in Constant Expected Time (and Trusting No One)," Proceedings o~ the 26th Annual IEEE Symposium on Foundations o] Computer Science, pp. 267-276, Oct. 1985. • A. D. Fekete, "Asymptotically Optimal Algorithms for Approximate Agreement," Proceedings o] the 5th Annual ACM Symposium on Principles o] Distributed Computing, pp. 73-87, Aug. 1986. • P. Feldman and S. Micali, "Optimal Algorithms for Byzantine Agreement," Proceedings o] the 20 th Annual A CM Symposium on Theory of Computing, pp. 147:161, May 1988. • L. Lamport, "The Weak Byzantine Generals Problem," Journal of the ACM, vol. 30, no. 3, pp. 668-676, July 1983. • L. Lamport and M. J. Fischer, "Byzantine Generals and Transaction Commitment Protocols," Technical Report Op. 62, SRI Corl~., 1982. • L. Lamport, R. E. Shostak, and M. Pease, "The Byzantine Generals Problem," A CM Transactions on Programming Languages and Systems, vol. 4, no. 3, pp. 382--401, July 1982. • N. A. Lynch, M. J. Fischer, and R. J. Fowler, "A Simple and Efficient Byzantine Generals Algorithm," Proceedings o] the 2"d IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 46-52, July 1982. • S. R. Mahaney and F. B. Schneider, "Inexact Agreement: Accuracy, Precision, and Graceful Degradation," Proceedings of the 4~h Annual A CM Symposium on Principles of Distributed Computing, pp. 237-249, Aug. 1985. • M. Merritt, Cryptographic Protocols, P h . D . Thesis, Georgia Institute of Technology, Feb. 1983. • M. Merritt, "Elections in the Presence of Faults," Proceedings oJ the 3"a Annual ACM Symposium on Principles of Distributed Computing, pp. 134-142, Aug. 1984. • Y. Moses and O. Waarts, "Coordinated Traversal: (t + 1)-Round Byzantine Agreement in Polynomial Time," Proceedings o] the 29 th Annual IEEE Symposium on Foundations of Computer Science, pp. 246-255, Oct. 1988. • G. Neiger and S. Toueg, "Automatically Increasing the Fault-Tolerance of Distributed Systems," Proceedings of the 7th Annual A CM Symposium on Principles of Distributed Computing, pp. 248-262, Aug. 1988.

297

M. Pease, R. E. Shostak, and L. Lamport, "Reaching Agreement in the Presence of Faults," Journal of the ACM, vol. 27, no. 2, pp. 228-234, Apr. 1980. K. J. Perry, Early Stopping Protocols for Fault-Tolerant Distributed Agreement, P h . D . Thesis, Cornell University, Jan. 1985. (Available as Technical Report

TR-85-662.) K. J. Perry and S. Toueg, "An Authenticated Byzantine Generals Algorithm with Early Stopping," Technical Report TR-84-620, Cornell University, June 1984. • K.J. Perry and S. Toueg, "Distributed Agreement in the Presence of Processor and Communication Faults," IEEE Transactions on Software Engineering, vol. SE-12, no. 3, pp. 477-482, Mar. 1986. • M.O. Rabin, "Randomized Byzantine Generals," Proceedings of the 24*h Annual IEEE Symposium on Foundations of Computer Science, pp. 403-409, Nov. 1983. • R. Reischuk, "A new Solution for the Byzantine Generals Problem," Information and Control, vol. 64, no. 1, pp. 23-42, Jan. 1985. T. K. Srikanth and S. Toueg, "Simulating Authenticated Broadcasts to Derive Simple Fault-Tolerant Algorithms," Distributed Computing, vol. 2, no. 2, pp. 80-94, Aug. 1987. S. Toueg, "Randomized Byzantine Agreements," Proceedings of the 3"d Annual ACM Symposium on Principles o] Distributed Computing, pp. 163-178, Aug. 1984. S. Toueg, K. J. Perry, and T. K. Srikanth, "Fast Distributed Agreement," SIAM Journal on Computing, vol. 16, no. 3, pp. 445-457, June 1987. R. Turpin and B. A. Coan, "Extending Binary Byzantine Agreement to Multivalued Byzantine Agreement," Information Processing Letters, vol. 18, no. 2, pp. 73-76, Feb. 1984.

19

Theory of Nested Transactions J. Aspnes, A. D. Fekete, N. A. Lynch, M. Merritt, and W. E. Weihl, "A Theory of Timestamp-Based Concurrency Control for Nested Transactions," Proceedings of the 14th International Conference on Very Large Data Bases, pp. 431-444, August 1988. A. D. Fekete, N. A. Lynch, M. Merritt, and W. E. Weihl, "Nested Transactions and Read/Write Locking," Proceedings of the 6'h Annual A CM Symposium on Principles of Database Systems, pp. 97-111, Mar. 1987.

298

• A. D. Fekete, N. A. Lynch, M. Merritt, and W. E. Weihl, "Commutativity-Based Locking for Nested Transactions," manuscript, Aug. 1988. • K. J. Goldman and N. A. Lynch, "Quorum Consensus in Nested Transaction Systems," Proceedings o] the 6th Annual A CM Symposium on Principles o] Distributed Computing, pp. 27-41, Aug. 1987. • J. Goree, Internal Consistency of a Distributed Transaction System zoith Orphan Detection, M. S. Thesis, Massachusetts Institute of Technology, Jan. 1983. (Available as Technical Report MIT/LCS/TR-286.) • M. P. Herlihy, N. A. Lynch, M. Merritt, and W. E. Weihl, "On the Correctness of Orphan Elimination Algorithms," Proceedings of the 17th Annual IEEE Symposium on Fault-Tolerant Computing, July 1987. • N.A. Lynch, "Concurrency Control for Resilient Nested Transactions," Proceedings o/the 2"4 Annual ACId Symposium on Principles of Database Systems, pp. 166-181, Mar. 1983. • N. A. Lynch and M. Merritt, "Introduction to the Theory of Nested Transa~:tions," Technical Report MIT/LCS/TR-367, Massachusetts Institute of Technology, Apr. 1986. (Revised version to appear in Theoretical Computer Science.) • N. A. Lynch, M. Merritt, W. E. Weihl, and A. D. Fekete, "A Theory of Atomic Transactions," Proceedings o] the 2"a International Conference on Database Theory, Bruges, Belgium, Aug. 1988. (Available as Technical Report MIT/LCS/ TM-362, Massachusetts Institute of Technology, June 1988.) * N. A. Lynch, M. Merritt, W. E. Weihl, and A. D. Fekete, Atomic Transactions, book in progress, 1988. • J. E. B. Moss, Nested Transactions: An Approach to Reliable Distributed Computing, MIT Press, Cambridge, Massachusetts, 1985. Reed, Naming and Synchronization in a Decentralized Computer System, Ph. D. Thesis, Massachusetts Institute of Technology, Oct. 1978. (Available as Technical Report MIT/LCS/TR-205.)

• D. P.

• D. P. Reed, "Implementing Atomic Actions on Decentralized Data," ACM Transactions on Computer Systems, vol. 1, no. 1, pp. 3-23, Feb. 1983. • W. E. WeiM, Specification and Implementation o] Atomic Data Types, P h . D . Thesis, Massachusetts Institute of Technology, Mar. 1984. (Available as Technical Report MIT/LSC/TR-314.) • W. E. Weihl and B. H. Liskov, "Implementation of Resilient, Atomic Data Types," AGM Transactions on Programming Languages and Systems, vot. 7, no. 2, pp. 244-269, Apr. 1985.