Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page i
Network Recovery Protection and Restoration of Optical, SONET-SDH, IP, and MPLS
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
page ii
The Morgan Kaufmann Series in Networking Series Editor, David Clark, M.I.T.
Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS Jean-Philippe Vasseur, Mario Pickavet, and Piet Demeester Routing, Flow, and Capacity Design in Communication and Computer Networks Michał Pio´ro and Deepankar Medhi Wireless Sensor Networks: An Information Processing Approach Feng Zhao and Leonidas Guibas Communication Networking: An Analytical Approach Anurag Kumar, D. Manjunath, and Joy Kuri
Internet QoS: Architectures and Mechanisms Zheng Wang TCP/IP Sockets in Java: Practical Guide for Programmers Michael J. Donahoo and Kenneth L. Calvert TCP/IP Sockets in C: Practical Guide for Programmers Kenneth L. Calvert and Michael J. Donahoo Multicast Communication: Protocols, Programming, and Applications Ralph Wittmann and Martina Zitterbart MPLS: Technology and Applications Bruce Davie and Yakov Rekhter
The Internet and Its Protocols: A Comparative Approach Adrian Farrel
High-Performance Communication Networks, 2e Jean Walrand and Pravin Varaiya
Modern Cable Television Technology: Video, Voice, and Data Communications, 2e Walter Ciciora, James Farmer, David Large, and Michael Adams
Internetworking Multimedia Jon Crowcroft, Mark Handley, and Ian Wakeman
Bluetooth Application Programming with the Java APIs C Bala Kumar, Paul J. Kline, and Timothy J. Thompson Policy-Based Network Management: Solutions for the Next Generation John Strassner Computer Networks: A Systems Approach, 3e Larry L. Peterson and Bruce S. Davie Network Architecture, Analysis, and Design, 2e James D. McCabe MPLS Network Management: MIBs, Tools, and Techniques Thomas D. Nadeau Developing IP-Based Services: Solutions for Service Providers and Vendors Monique Morrow and Kateel Vijayananda Telecommunications Law in the Internet Age Sharon K. Black Optical Networks: A Practical Perspective, 2e Rajiv Ramaswami and Kumar N. Sivarajan
Understanding Networked Applications: A First Course David G. Messerschmitt Integrated Management of Networked Systems: Concepts, Architectures, and their Operational Application Heinz-Gerd Hegering, Sebastian Abeck, and Bernhard Neumair Virtual Private Networks: Making the Right Connection Dennis Fowler Networked Applications: A Guide to the New Computing Infrastructure David G. Messerschmitt Wide Area Network Design: Concepts and Tools for Optimization Robert S. Cahn For further information on these books and for a list of forthcoming titles, please visit our website at http://www.mkp.com
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page iii
Network Recovery Protection and Restoration of Optical, SONET-SDH, IP, and MPLS
Jean-Philippe Vasseur Mario Pickavet Piet Demeester
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
page iv
Senior Editor Rick Adams Associate Editor Karyn Johnson Acquisitions Editor Rick Adams Publishing Services Manager Andre Cuello Project Manager Justin Palmeiro Editorial Coordinator Graphic World Publishing Services Cover Design Yvo Riezebos Design Cover Image Brooklyn Bridge in front of Manhattan skyline at dusk. Courtesy Digital Vision and Getty Images Composition Kolam Information Services, Pvt., Ltd. Technical Illustration Kolam Information Services, Pvt., Ltd. Copyeditor Graphic World Publishing Services Proofreader Graphic World Publishing Services Indexer Graphic World Publishing Services Interior printer Maple-Vail Book Manufacturing Group, Pennsylvania Cover printer Maple-Vail Book Manufacturing Group, Pennsylvania Morgan Kaufmann Publishers is an imprint of Elsevier. 500 Sansome Street, Suite 400, San Francisco, CA 94111 This book is printed on acid-free paper. # 2004 by Elsevier Inc. All rights reserved. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permission of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (þ44) 1865 843830, fax: (þ44) 1865 853333, e-mail:
[email protected]. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com) by selecting ‘‘Customer Support’’ and then ‘‘Obtaining Permissions.’’ Library of Congress Cataloging-in-Publication Data Application submitted. ISBN: 0-12-715051-x For information on all Morgan Kaufmann publications, visit our website at www.mkp.com. Printed in the United States of America. 04
05 06
07
08
5 4 3 2
1
Vasseur / Network Recovery Final Proof 9.6.2004
7:32pm page v
Dedications
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
page vi
About the Authors Jean-Philippe Vasseur Jean-Philippe Vasseur has a French engineering degree in Network Computing and a Master of Science degree from the Stevens Institute of Technology, New Jersey. He worked as a network architect for several large national and international service providers in large multiprotocol environments (e.g., IP, ATM, X25) prior to joining Cisco Systems. After two years within the EMEA technical consulting group focusing on IP/MPLS routing, VPN, and traffic engineering designs for service providers, he joined the CISCO Engineering team as a Technical Leader with a particular focus on IP, MPLS traffic engineering, and recovery mechanisms. He is a regular speaker at various international conferences and is involved in several research projects in the area of IP and MPLS. In addition, he is an active member of the Internet Engineering Task Force (IETF) and has co-authored several IETF specifications. Mario Pickavet Mario Pickavet received a Master of Science degree and a Doctor of Electrical Engineering degree, specialized in telecommunications, from Ghent University in 1996 and 1999, respectively. Since 2000, he has been a full-time professor at the same university. His research interests are related to broadband communication networks (i.e., IP, MPLS, WDM, SDH, ATM) and include resilience mechanisms, design, and long-term planning of core and access networks. In this context, he was and currently is involved in European IST projects (i.e., LION, DAVID, STOLAS, ePhoton/One, LASAGNE) on IP over WDM next generation networks. He has published a number of international publications on these subjects, both in leading journals (e.g., IEEE Journal on Selected Areas in Communications and IEEE Communication Magazine) and proceedings of conferences. Piet Demeester Piet Demeester received his doctoral degree from Ghent University at the Department of Information Technology (INTEC) in 1988. In 1993, he became a professor at Ghent University, where he is responsible for research on communication networks. He was involved in several European COST, ESPRIT, RACE, ACTS, and IST projects. He is a member of the editorial board of several international journals and has been a member of several technical program committees. His current interests are related to broadband communication networks (i.e., IP, G-MPLS, optical packet and burst switching, access and residential, active, mobile, CDN, grid) and include network planning, network and service management, telecom software, internetworking, and network protocols for QoS support. He has published over 250 journal or conference papers in this field. He also has been very active in the field of resilience in communication networks, both as founder of the DRCN conference and as editor of special issues on this subject in IEEE Communication Magazine.
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page vii
Contents
Foreword Preface
xvii
Chapter 1 General Introduction 1.1
1.2
1.3
1.4
1.5
xv
Communications Networks Today 1.1.1 Fundamental Networking Concepts 1.1.2 Layered Network Representation 1.1.3 Network Planes Network Reliability 1.2.1 Definitions 1.2.2 Which Failures Can Occur? 1.2.3 Reliability Requirements for Various Users and Services 1.2.4 Measures to Increase Reliability Different Phases in a Recovery Process 1.3.1 Recovery Cycle 1.3.2 Reversion Cycle Performance of Recovery Mechanisms: Criteria 1.4.1 Scope of Failure Coverage 1.4.2 Recovery Time 1.4.3 Backup Capacity Requirements 1.4.4 Guaranteed Bandwidth 1.4.5 Reordering and Duplication 1.4.6 Additive Latency and Jitter 1.4.7 State Overhead 1.4.8 Scalability 1.4.9 Signaling Requirements 1.4.10 Stability 1.4.11 Notion of Recovery Class Characteristics of Single-Layer Recovery Mechanisms 1.5.1 Backup Capacity Dedicated versus Shared
1 1 3 5 6 8 9 12 18 20 22 23 24 25 25 26 26 27 27 27 27 27 28 28 28 28 29
vii
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page viii
viii
Contents
1.6
1.7
1.5.2 Recovery Paths: Preplanned versus Dynamic 1.5.3 Protection versus Restoration 1.5.4 Global versus Local Recovery 1.5.5 Control of Recovery Mechanisms 1.5.6 Ring Networks versus Mesh Networks 1.5.7 Connection-Oriented versus Connectionless 1.5.8 Revertive versus Nonrevertive Mode Multilayer Recovery 1.6.1 Sequential Approach 1.6.2 Integrated Approach Conclusion
Chapter 2 SONET/SDH Networks 2.1
2.2
2.3
2.4
2.5
Introduction 2.1.1 Transmission Networks 2.1.2 Management of (Transmission) Networks 2.1.3 Structuring/Modeling Transmission Networks 2.1.4 Summary SDH and SONET Networks 2.2.1 Introduction 2.2.2 Structure of SDH Networks 2.2.3 SDH Frame Structure: Overhead Bytes Relevant for Network Recovery 2.2.4 SDH Network Elements 2.2.5 Summary 2.2.6 Differences between SONET and SDH Operational Aspects 2.3.1 Fault Management Processes 2.3.2 Fault Detection and Propagation Inside a Network Element 2.3.3 Fault Propagation and Notification on a Network Level 2.3.4 Automatic Protection Switching Protocol 2.3.5 Summary Ring Protection 2.4.1 Multiplex Section–Shared Protection Ring 2.4.2 Multiplex Section–Dedicated Protection Ring 2.4.3 Subnetwork Connection Protection Ring 2.4.4 Ring Interconnection 2.4.5 Summary 2.4.6 Differences between SONET and SDH Linear Protection 2.5.1 Multiplex Section Protection 2.5.2 Path Protection 2.5.3 Summary
30 31 32 34 35 36 36 36 38 38 38
39 40 40 42 43 45 45 45 46 48 52 55 56 57 58 60 70 74 80 81 83 91 93 93 105 106 107 107 108 113
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
page ix
Contents
2.6
2.7 2.8 2.9
Restoration 2.6.1 Protection versus Restoration 2.6.2 Summary Case Study Conclusion Recommended Reference Work and Research-Related Topics
Chapter 3 Optical Networks 3.1
3.2
3.3
3.4
3.5
3.6
Evolution of the Optical Network Layer 3.1.1 Wavelength Division Multiplexing in the Point-to-Point Optical Network Layer 3.1.2 An Optical Networking Layer with Optical Nodes 3.1.3 An Optical Network Layer Organized in Rings 3.1.4 Meshed Optical Networks 3.1.5 Adding Flexibility to the Optical Network Layer The Optical Transport Network 3.2.1 Architectural Aspects and Structure of the Optical Transport Network 3.2.2 Structure of the Optical Transport Module 3.2.3 Overview of the Standardization Work on the Optical Transport Network Fault Detection and Propagation 3.3.1 The Optical Network Overhead 3.3.2 Defects in the Optical Transport Network 3.3.3 OTN Maintenance Signals and Alarm Suppression Recovery in Optical Networks 3.4.1 Recovery at the Optical Layer? 3.4.2 Standardization Work on Recovery in the Optical Transport Network 3.4.3 Shared Risk Group Recovery Mechanisms in Ring-Based Optical Networks 3.5.1 Multiplex Section Protection in Ring-Based Optical Networks 3.5.2 Optical Channel Protection in Ring-Based Optical Networks 3.5.3 OMS- versus OCh-Based Approach 3.5.4 Shared versus Dedicated Approach 3.5.5 Interconnection of Rings Recovery Mechanisms in Mesh-Based Optical Networks 3.6.1 Protection 3.6.2 Protection in a WP Network versus Protection in a VWP Network 3.6.3 Restoration 3.6.4 Protection versus Restoration
ix 113 113 115 115 127 129
131 132 132 135 135 137 139 139 139 142 144 144 145 152 154 157 157 158 159 160 163 166 170 171 173 173 175 176 177 180
Vasseur / Network Recovery Final Proof 9.6.2004
x
7:32pm page x
Contents 3.6.5 Protection Combined with Restoration Ring-Based versus Mesh-Based Recovery Schemes Availability 3.8.1 Availability Calculations 3.8.2 Availability: Some Observations 3.9 Recent Trends in Research 3.9.1 p-Cycles 3.9.2 Meta-Mesh Recovery Technique 3.9.3 Flexible Optical Networks 3.10 Conclusion 3.7 3.8
Chapter 4 IP Routing 4.1
4.2
4.3
4.4 4.5
4.6
4.7
IP Routing Protocols 4.1.1 Introduction 4.1.2 Distance Vector Routing Protocols Overview (‘‘Bellman-Ford’’) 4.1.3 Link State Routing Protocols Overview 4.1.4 IP Routing: A Global versus Local Restoration Mechanism? Analysis of the IP Routing Recovery Cycle 4.2.1 Fault Detection and Characterization 4.2.2 Hold-Off Timer 4.2.3 Fault Notification Time 4.2.4 Computation of the Routing Table 4.2.5 An Example of IP Rerouting upon Link Failure Failure Profile and Fault Detection 4.3.1 Failure Profiles 4.3.2 Failure Detection 4.3.3 Failure Characterization 4.3.4 Analysis of the Various Failure Types and Their Impact on Traffic Forwarding Dampening Algorithms FIS Propagation (LSA Origination and Flooding) 4.5.1 LSA Origination Process 4.5.2 LSA Flooding Process 4.5.3 Time Estimate for the LSA Origination and Flooding Process Route Computation 4.6.1 Shortest Path Computation 4.6.2 The Dijkstra Algorithm 4.6.3 Shortest Path Computation Triggers 4.6.4 Routing Information Base Update Temporary Loops during Network State Changes 4.7.1 Temporary Loops in the Case of a Link or Node Failure
182 182 185 185 192 197 197 199 200 200
203 204 204 204 207 213 214 214 214 215 215 217 220 220 222 224 225 226 229 231 233 237 237 238 241 249 251 252 253
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
page xi
Contents
4.7.2
4.8 4.9
4.10
4.11 4.12 4.13
4.14
4.15 4.16
Temporary Loops Caused by a Restored Network Element Load Balancing QoS during Failure 4.9.1 IP Traffic Engineering at Steady State 4.9.2 QoS Guarantee during Failure Nonstop Forwarding: An Example with OSPF 4.10.1 Mode of Operation 4.10.2 Mode of Operation of the Restarting Router 4.10.3 Mode of Operation of the Restarting Router’s Neighbors 4.10.4 Backward Compatibility A Case Study with IS-IS Summary Algorithm Complexity 4.13.1 Definition of Algorithm Complexity 4.13.2 NP Complete Problem Incremental Dijkstra 4.14.1 Motivation 4.14.2 History 4.14.3 Algorithm Description 4.14.4 iSPF Efficiency Interaction between Fast IGP Convergence and NSF Research-Related Topics
xi
Chapter 5 MPLS Traffic Engineering Recovery Mechanisms 5.1
5.2
5.3
5.4
MPLS Traffic Engineering Refresher 5.1.1 Traffic Engineering in Data Networks 5.1.2 Terminology 5.1.3 MPLS Traffic Engineering Components 5.1.4 Notion of Preemption in MPLS Traffic Engineering 5.1.5 Motivations for Deploying MPLS Traffic Engineering Analysis of the Recovery Cycle 5.2.1 Fault Detection Time 5.2.2 Hold-Off Timer 5.2.3 Fault Notification Time 5.2.4 Recovery Operation Time 5.2.5 Traffic Recovery Time MPLS Traffic Engineering Global Default Restoration 5.3.1 Fault Signal Indication 5.3.2 Mode of Operation 5.3.3 Recovery Time MPLS Traffic Engineering Global Path Protection 5.4.1 Mode of Operation
257 259 262 262 264 266 267 267 269 269 270 278 279 279 284 285 285 287 287 293 293 295
297 298 298 301 303 305 306 307 307 308 308 309 309 310 310 311 313 314 315
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page xii
xii
Contents
5.5
5.6 5.7 5.8
5.9
5.10
5.11
5.12 5.13 5.14
5.4.2 Recovery Time MPLS Traffic Engineering Local Protection 5.5.1 Terminology 5.5.2 Principles of Local Protection Recovery Techniques 5.5.3 Local Protection: One-to-One Backup 5.5.4 Local Protection: ‘‘Facility Backup’’ 5.5.5 Properties of a Traffic Engineering LSP 5.5.6 Notification of Tunnel Locally Repaired 5.5.7 Signaling Extensions for MPLS Traffic Engineering Local Protection 5.5.8 Two Strategies for Deploying MPLS Traffic Engineering for Fast Recovery Another MPLS Traffic Engineering Recovery Alternative Load Balancing Comparison of Global and Local Protection 5.8.1 Recovery Time 5.8.2 Scalability 5.8.3 Bandwidth Sharing Capability 5.8.4 Summary Revertive versus Nonrevertive Modes 5.9.1 MPLS Traffic Engineering Global Default Restoration 5.9.2 MPLS Traffic Engineering Global Path Protection 5.9.3 MPLS Traffic Engineering Local Protection Failure Profile and Fault Detection 5.10.1 MPLS-Specific Failure Detection Hello-Based Protocols 5.10.2 Requirements for an Accurate Failure Type Characterization 5.10.3 Analysis of the Various Failure Types and Their Impact on Traffic Forwarding Case Studies 5.11.1 Case Study 1 5.11.2 Case Study 2 5.11.3 Case Study 3 Standardization Summary RSVP Signaling Extensions for MPLS TE Local Protection 5.14.1 SESSION-ATTRIBUTE Object 5.14.2 FAST-REROUTE Object 5.14.3 DETOUR Object 5.14.4 Route Record Object 5.14.5 Signaling a Protected Traffic Engineering LSP with a Set of Constraints 5.14.6 Identification of a Signaled TE LSP 5.14.7 Signaling with Facility Backup 5.14.8 Signaling with One-to-One Backup
316 316 316 317 318 320 325 327 329 329 333 334 336 336 336 340 343 346 346 347 347 348 348 349 353 354 354 359 362 370 371 372 372 374 375 376 378 378 379 382
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page xiii
Contents
Detour Merging Path Computation Introduction Requirements for Strict QoS Guarantees during Failure Network Design Considerations Notion of Bandwidth Sharing between Backup Paths Backup Path Computation: MPLS TE Global Path Protection 5.15.6 Backup Tunnel Path Computation: MPLS TE Fast Reroute Facility Backup 5.15.7 Backup Tunnel Path Computation with MPLS TE Fast Reroute One-to-One Backup 5.15.8 Summary 5.16 Research-Related Topics 5.15
5.14.9 Backup 5.15.1 5.15.2 5.15.3 5.15.4 5.15.5
Chapter 6 Multilayer Networks 6.1
6.2
6.3
6.4
ASON/G-MPLS Networks 6.1.1 The ASON/ASTN Framework 6.1.2 Protocols for Implementing a Distributed Control Plane 6.1.3 Overview of Control Plane Architectures (Overlay, Peer, Augmented) Generic Multilayer Recovery Approaches 6.2.1 Why Multilayer Recovery? 6.2.2 Single-Layer Recovery Schemes in Multilayer Networks 6.2.3 Static Multilayer Recovery Schemes 6.2.4 Dynamic Multilayer Recovery 6.2.5 Summary Case Studies 6.3.1 Optical Restoration and MPLS Traffic Engineering Fast Reroute 6.3.2 SONET/SDH Protection and IP Routing 6.3.3 MPLS Traffic Engineering Fast Reroute (Link Protection) and IP Rerouting Fast Convergence Conclusion
xiii 384 385 386 386 387 392 393 397 419 421 422
423 424 424 426 432 437 438 439 444 457 464 464 465 469 471 476
Bibliography
479
List of Figure Sources
491
Index
497
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page xiv
This page intentionally left blank
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
page xv
Foreword
It was not all that long ago that the Internet was considered, at least by the traditional telecommunications world, as a researcher’s toy. They scoffed at the notion of a ‘‘best effort’’ network being a useful way to support serious telecommunications such as ‘‘toll quality’’ voice or business critical data communications. Some of the traditional telecommunications carriers still think this way. But even these carriers are now planning on migrating their services to Internet Protocol (IP)based networks. Some of the more dogmatic carriers are planning on building a multicarrier carrier-run IP network in parallel with the Internet, but most of the carriers are beginning to realize that the Internet itself is in the process of a transformation—a transformation that will render obsolete much of the traditional telecommunications infrastructure and thinking. A key facilitator of this transformation is the subject of this book. Recovery from link failures in traditional IP networks can take a long time. This is because IP routing protocols (covered in Chapter 4) were not designed to ensure that network users would not experience significant outages while the routing protocol attempted to find a path around a link, SRLG, or node failure. Internet research has shown that a single link failure can cause users to experience outages of many minutes even when the underlying network architecture itself is highly redundant with plenty of spare bandwidth available and with multiple ways to route around the failure. Outages of many minutes are not a real problem if the primary communication method is email. Such outages are more of an issue for web surfers, but considering the episodic nature of web surfers, most users will not even notice even a 5-minute outage. On the other hand, even outages as short as a few tens of seconds can be a real problem when talking over the phone and a 5-minute outage seems like forever. The same is true for critical business data communications such as stock trading systems, e-commerce servers or process controllers. Future applications such as remote medical diagnostic or surgery systems will be even more demanding. The network recovery technologies covered in this book are changing the perception and reality of the Internet. The IP, MPLS, SONET, and optical protec-
xv
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page xvi
xvi
Foreword tion and restoration technologies explained in this book can be used to reduce outages resulting from link, SRLG, and node failures from minutes to subseconds (and in some cases to milliseconds). As these technologies continue to be deployed in major Internet service providers even the most demanding traditional telecommunications engineer will be forced to take another look at replacing their existing network infrastructure with the Internet or at least with Internet technology, if for no other reason than the realization that their competitors are already working on their transition plans. This is the right book at the right time for anyone in the telecommunications business, or anyone who is dependent on the services provided by the telecommunications business and would like to understand the new Internet that is rapidly becoming the common reality. Scott Bradner Senior Technical Consultant and University Technology Security Officer at Harvard University
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page xvii
Preface
The range of services and applications that rely on communication networks is impressive: business critical communication, phone calls, emails, home banking, and even watching TV or listening to music, and this is undoubtedly just the very beginning. Because our professional and private life is more and more dependent on these communication services, the repercussions of a service interruption are severe. Hence, network reliability has received intensified interest from service providers and enterprises during the past few years to provide highly reliable networks, and this trend is expected to continue in the future. We have dedicated a very significant amount of our time during those past years to understanding the challenges of network recovery and the existing and new requirements of operators and enterprises to develop new technologies, standards, and network designs. We found that the time was overdue to devote a book to network recovery, and this book is the result of our experience. Network recovery is undoubtedly a complex, fascinating, and rapidly evolving topic, essentially because of its truly multi-dimensional nature. Indeed, although the immediate criteria that comes to mind is convergence time (i.e., time to recover the affected traffic), which is only one among several other aspects we should consider. Throughout this book, we explore all the other dimensions that lead to choosing a particular recovery mechanism and elect a specific network recovery design: Does the backup path offer a similar quality of service in terms of bandwidth and propagation delay? What are the consequences of maintaining extra network states? Is there any potential impact on the network stability as a result of trying to restore the traffic upon failure and to potentially reuse restored routes? What are the implications of adding some extra complexity in the network both in terms of engineering and network operation management? And finally, what are the cost implications in terms of additional required equipment and network backup bandwidth? All the above criteria must be carefully evaluated, because they lead to various trade-offs during the decision-making process of network recovery design. Moreover, one must admit that the emergence of new services and applications have resulted in some increased
xvii
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
xviii
page xviii
Preface complexity in terms of hardware and software equipment (indeed, it is not unusual to see a software program with millions of code lines!). As a result, the potential for possible failures drastically increased during the past several years, both in diversity and identification complexity. Furthermore, both network convergence and the rapid growth of new applications such as Voice or Video over IP led to building networks involving several layers. Each layer offers a large set of recovery mechanisms, which ineluctably interact when deployed at multiple layers. Hence, we devote an entire chapter to the subject of interlayer recovery, with the objective of highlighting the potential interactions between multiple recovery mechanisms operating at different layers.
Objectives Our objectives in writing this book are to deliver a thorough reference to network recovery mechanisms available at various layers, highlighting the strengths, weaknesses, and applicability of each of them. This includes not only the detailed coverage of the signaling and routing aspects, but also the delicate problem of understanding the underlying dynamics: In other words, what actually happens when a network recovery mechanism is triggered upon a network element failure? This usually involves a succession of rather complex action steps, which are described and illustrated by means of an extensive set of examples that appear throughout this book. Our second objective is to go beyond the technical description of the possible network recovery mechanisms and include some network recovery design guidelines as well. Indeed, understanding a protection or restoration mechanism is not a sufficient knowledge base from which to develop an efficient network design, especially considering the large set of possible objectives and networks constraints. Consequently, for each mechanism we incorporate a number of detailed case studies, starting with the constraints and then describing a set of network recovery objectives to propose some possible designs along with detailed explanations and possible alternatives. Finally, although we have been involved in the design of some of the recovery mechanisms discussed, we have endeavored to detail each one with objectiveness. One of our main challenges in writing this book was to offer a reference in network recovery while not making it a prerequisite for the reader to be an expert in every related layer. Hence, we have strived to make each chapter readable at multiple levels—from a basic understanding of the discussed set of mechanisms in question to an in-depth coverage allowing a protocol designer and network architect to benefit from this material in their primary work.
Structure We begin the book with an ‘‘advanced’’ introduction with the goal to provide an exhaustive definition of each concept used throughout the book. In particular, the
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm page xix
Preface
xix
literature related to network recovery often uses substantially disparate terms, acronyms, and definitions. To avoid confusion, we devoted a chapter to reviewing each individual concept before digging in to each detailed analysis of the network recovery mechanisms available at each layer. After this general Chapter 1, the first investigated approach in Chapter 2 is SONET-SDH, followed by the optical layer in Chapter 3, because a significant number of network recovery mechanisms available at the optical layer were inspired by SONET-SDH. Then IP routing is explored in detail: A large proportion of Chapter 4 emphasizes the routing dynamics, which play a crucial role in distributed routing environments and are usually not covered in detail in literature. The MPLSrelated recovery mechanisms are studied in Chapter 5, and the large set of drastic evolutions in this area are all covered in depth with numerous illustrations. Note that in most of the chapters, the reader can skip some parts related to more advanced aspects (usually situated at the end of each chapter) without compromising an overall understanding of the technology. For the sake of illustration, a detailed understanding of the signaling aspects of MPLS Fast Reroute is not required to perfectly understand the mechanisms in use and how they can be applied to any networks. Finally, Chapter 6 of this book concludes with a discussion of the interlayer network aspects and investigates the interaction between network recovery processes operating at different layers. Each chapter contains a final section devoted to current research-related work. These sections may be the core of potential revision to this book.
Acknowledgments We are greatly indebted to Didier Colle, Sophie De Maesschalck, and Ilse Lievens for their indispensable contributions to the writing and review of the book. We highly appreciate your continuous effort and outstanding expertise throughout the entire process! We are of course extremely grateful to our reviewers for their thorough review of our book and their precious suggestions that undoubtedly helped us in many ways to improve its content: Dave Cooper for his experience and expertise in data networks after several years as a lead architect for a large Service Provider; Stefano Previdi, probably one of most recognized routing experts; Achim Autenrieth for his much appreciated expertise in transport network recovery mechanisms; Kevin D’Souza for his valuable suggestions and input from a large operator; and Maurice Gagnaire for his expertise as a widely recognized professor, researcher, and author. Writing a book is a fascinating experience; this book would not have been possible without the support from several people at Morgan Kaufmann and in particular our editor, Rick Adams, and our development editor, Karyn Johnson, whose help and guidance throughout the writing of this book have been tremendous. We are also extremely grateful to our production editor,
Vasseur / Network Recovery Final Proof 9.6.2004 7:32pm
xx
page xx
Preface Denise DeLancey, for her outstanding professionalism and precious help during the production phase of this book. Finally, we would like to encourage our readers to send comments, highlight errors or omissions, or support the writing of a second edition. Please contact us at:
[email protected].
Special Acknowledgments My first acknowledgment undoubtedly goes to my wife, Brigitte, without whose help and support I would not have succeeded in either the writing of this book or in my life. I would, of course, like to thank my two daughters, Manon and Eleonore, for their patience, understanding, and love. I also wish to thank my company, Cisco Systems, Inc., for being part of it, and in particular Bruce Davie, who helped me in many respects. A special thank you to my close friend Stefano Previdi, for not only his review and expertise but also for several years of friendship. Jean-Philippe Vasseur Words are not adequate to thank my wife, Evelien, for her tremendous support and understanding in everything I do. Thank you, Evelien, for being my soul mate and the ultimate inspiration in my life. I would also like to thank my colleagues at Ghent University for the fruitful technical debates. Mario Pickavet Thank you, colleagues (especially those from the ACTS-PANEL project) and PhD students, for the stimulating discussions on network resilience. Thank you, Mieke, Bram, Anneleen, and Jozefien, for your love, patience, and support. Piet Demeester
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 1
CHAPTER 1
General Introduction
This chapter presents a general introduction to recovery mechanisms in data and telecommunications networks. Before delving into the core topic of this book, network recovery mechanisms, we present an overview of the main technologies in today’s broadband communications networks. The objective of Section 1.1 is to highlight the high-level characteristics of these networks, to refresh our knowledge. After this background introduction, we touch on the focus of the book in Section 1.2, where the crucial importance of network reliability—and hence recovery mechanisms—is explained. Making abstraction of the network technology, we enumerate and illustrate the different phases of traffic recovery in Section 1.3. To be able to compare various recovery mechanisms, we discuss a number of criteria in Section 1.4. The main characteristics and fundamental choices when constructing a singlelayer recovery mechanism are elucidated in Section 1.5. This allows classification of the plethora of recovery mechanisms and a better overview of the pros and cons of different mechanisms. Finally, Section 1.6 briefly touches on the issue of interlayer dependency of failures and recovery mechanisms. In summary, the first section presents a technology overview, whereas sections that follow lay the (technology-independent) foundations for the study of recovery mechanisms, by explaining the terminology and by highlighting the main characteristics of recovery mechanisms.
1.1 Communications Networks Today It is a well-known fact that today’s society is relying more and more on communications networks, both for professional and for recreational purposes. The volume of traffic to be conveyed by the communications network infrastructure has grown significantly. This traffic increase, which is expected to continue, is mainly due to the popularity of the Internet and all its related services [Rob01]. According to several sources (e.g., [McK00]) data traffic has already overtaken voice traffic
1
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 2
2
CHAPTER 1
General Introduction
in volume. In terms of revenue, however, voice traffic is still the most important source of income for telecom operators. A network model more optimized to carry data traffic could thus help the operators to increase their revenues coming from data traffic. Moreover, several service providers have started to carry voice traffic over their data networks. This can reduce the operating cost, because operating a single network is typically less costly than running two networks. The network model currently envisaged to be most suited for the transport of large traffic volumes is an IP/MPLS-over-OTN multilayer model, as depicted in Figure 1.11 (in contrast to today’s IP [over ATM] over SONET/SDH over WDM networks). Because of the explosive growth of the Internet, future broadband communications networks will be based on the Internet Protocol (IP) to carry both data and voice traffic. To be able to accommodate these huge amounts of traffic, the transport network will be based on Wavelength Division Multiplexing (WDM). WDM systems provide more bandwidth capacity over fiberoptic networks by increasing the number of usable channels per cable. Unfortunately, because traffic must be converted to an electrical signal at each network node, a bottleneck is created. This bottleneck can be overcome by introducing optics in the network nodes, thereby creating an Optical Transport Network (OTN). The introduction of Multi-Protocol Label Switching (MPLS) [Ros01] in the IP layer enhances the network’s capabilities (e.g., for traffic engineering, virtual private networking
PAST Voice, IP, ...
FUTURE IP-MPLS
ATM
SONET/SDH
WDM (pt. to pt.)
Adaptation
OTN(G-MPLS)
Figure 1.1 Evolution in data-centric networks. 1 See the first sections of Chapters 2, 3, 4, and 5 for more detailed descriptions of the different technologies.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 3
1.1 Communications Networks Today
3
[VPN] or strict quality-of-service [QoS] support). In the OTN, a similar extension toward Generalized MPLS (G-MPLS [Man1]) provides opportunities for dynamic lightpath allocation, leading to an Automatic Switched Optical Network (ASON) [G807]. Of course, some kind of adaptation (e.g., based on a SONET- or SDH-like mechanism) is still needed between the IP/MPLS and the OTN layer, to deal with issues like framing, flow control, and error correction. This book concentrates on the data-centric network model depicted in Figure 1.1, particularly on the reliability and recovery issues that arise in these networks. We mainly focus on carrier infrastructure of the network, that is, the network realizing the traffic exchange between the various customers. The following chapters of this book contain an in-depth discussion of the various recovery mechanisms in every technology. The remainder of the current section introduces some general preliminary terminology on communications networks and may be skipped if you are familiar with communications networks.
1.1.1
Fundamental Networking Concepts There is a large variety of characteristics of the different technologies in communications networks today.
Symmetrical Versus Asymmetrical Traffic, Unidirectional Versus Bidirectional Traffic In the literature, several definitions of symmetrical/asymmetrical and unidirectional/ bidirectional traffic exist. In this book, we adopt the following definitions. Symmetrical services, like classic telephony, require the same bandwidth in each direction. If person A and person B want to have a telephone conversation, a connection is provided between A and B, where the same bandwidth is provided from A to B as from B to A. Other services, like Web access, are inherently asymmetrical in nature: The bandwidth needed from server to client is typically much higher than that needed from client to server. If the traffic from server to client is following a different route than the traffic from client to server, this is called unidirectional traffic. In the case of bidirectional traffic, the route from point A to point B in the network is the same as the route from point B to point A. The unidirectional or bidirectional nature of various services has a direct impact on the network technologies, designed with one or more specific services in mind. For instance, SONET and SDH were originally designed with a focus on classic telephony, hence all connections are bidirectional. On the other hand, IP/MPLS is inherently unidirectional.
Ring Networks Versus Mesh Networks A ring network is defined as a set of nodes forming a closed loop where each node is connected to two adjacent nodes. A ring network is completely composed of interconnected rings (Figure 1.2). Whereas the traffic in a mesh network is routed
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 4
4
CHAPTER 1
General Introduction
Working Path Ring 1
Ring 2 Ring 3
Figure 1.2 Example of a ring (left) and a ring network (right).
unrestricted through the network, all traffic in a ring network is routed from ring to ring. (See Chapters 2 and 3 for typical examples of SONET/SDH networks and OTN networks, respectively.)
Circuit Switching Versus Packet Switching A distinction can be made with respect to the atomic entity (i.e., the smallest indivisible portion) that is switched in a network node. In circuit switching, all information is transported through the network via circuits (i.e., paths with a fixed available bandwidth). In packet switching, all information is split up in packets and these packets are sent one by one through the network. Every node reads the header or label of every incoming packet, to find where the packet should be forwarded. Because the packets occupy capacity only when they are transmitted, packet switching allows for statistical multiplexing and hence is usually more efficient than circuit switching in terms of bandwidth usage. On the other hand, packet switching requires more operations in the network nodes, because the packets must be processed and switched one by one.
Connection Oriented Versus Connectionless Switching techniques can also be categorized as connection oriented or connectionless. In connection-oriented networks, an end-to-end connection must be established before the start of each communication session. After the session, the connection is closed. In connectionless networks, communication can occur without having to establish any kind of connection. For instance, circuit-switched networks (e.g., based on SONET/SDH or OTN) only support connection-oriented operation.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 5
1.1 Communications Networks Today
5
Shared multiple access approaches (e.g., Ethernet) do not involve the concept of connection and are hence connectionless. Packet-switched networks can operate in connectionless (e.g., IP networks) or connection-oriented mode (e.g., ATM networks). Hybrid forms are also possible; for example, MPLS has both a connectionless and a connection-oriented form (see Chapter 5 for a detailed description).
1.1.2
Layered Network Representation A communications network usually consists of a number of heterogeneous network elements, performing a large variety of communication functions. To ensure the compatibility and ‘‘interworking’’ of these elements, several reference models have been developed from various standardization bodies. Some examples are ITU-T recommendation G.805 on the generic functional architecture of transport networks [G805], the OSI-model in ITU-T recommendation X.200 [ X200], and the TCP/IP protocol stack [Soc91] from the Internet Engineering Task Force (IETF). The latter model is shown in Figure 1.3. It isolates the specific functions or tasks for communication in IP networks in five layers. The lowest layer, or Layer 1, is the physical layer, which deals with the transmission of the pure unstructured bit stream over a physical link (e.g., optical fiber, coaxial cable, and wireless). It deals with characteristics to establish, maintain, and deactivate a physical link, such as bit duration and signal voltage swing. Layer 2, or the data link layer, attempts to make the physical link reliable and provides the means to activate, maintain, and deactivate the link. The main service provided to the higher layers is error detection and control. This implies that with full Layer 2 functionality, the next higher layer may assume a virtual error-free transmission over the link. Layer 3 is the network layer, which provides the upper layers with independence from data transmission and switching technologies used to connect systems. Layer 3 is realized by the IP, so it relieves Layer 4 of the need of knowing about the underlying data transmission and switching. The purpose of Layer 4, the transport layer, is to provide a reliable mechanism for the transparent exchange of data between endpoints. It is a connection-oriented approach, providing end-to-end error recovery and flow
Figure 1.3 TCP/IP protocol stack.
5
Application Layer
4
Transport Layer
3
Network Layer
2
Data link Layer
1
Physical Layer
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 6
6
CHAPTER 1
General Introduction
control and ensuring that data units are delivered error free, in sequence, and without duplications or losses. Layer 4 is realized by the Transport Control Protocol (TCP). The User Datagram Protocol (UDP) forms a connectionless alternative. The highest layer is the application layer, providing a means for applications to exchange information. Some typical examples of application layer protocols in TCP/IP networks are the Hypertext Transfer Protocol (HTTP, used for access to the Web), the File Transfer Protocol (FTP) to upload and download large files, the Simple Mail Transfer Protocol (SMTP) for email, and so on. In practice, however, broadband communications networks typically consist of several technologies (e.g., OTN, SONET or SDH, ATM, IP/MPLS, etc.), where every technology can cover functions from different layers in the TCP/IP protocol stack. The main drivers leading to a multitechnology network are that each technology has its strengths and weaknesses (depending on the traffic type, the user requirements, etc.) and the historical evolution of a communications network where legacy equipment is used for as long as possible. To visualize multitechnology networks in a comprehensible manner, a layered representation is very helpful [G805], and this type of representation will be used throughout the book. Every network technology corresponds to a network layer, where the successive network layers usually have a client-server relationship. To illustrate this concept, the example of a small IP-over-OTN network is shown in Figure 1.4. The OTN layer, consisting of five optical cross-connects (OXCs A, B, C, D, and E) that are interconnected via optical fibers, represents the physical topology of the network (lower plane of Figure 1.4). It serves as a transport network to the IP layer. For instance, the link between IP routers b and c will be realized in the OTN layer as a lightpath (i.e., a bandwidth pipe corresponding with one wavelength channel) B-D-C or B-A-E-C. In a similar way, the IP links a-b, a-c, b-d, and c-d are realized, leading to the logical topology of the IP layer (upper plane of Figure 1.4). From an IP point of view, only the logical topology of the IP layer is visible, irrespective of the exact realization of the IP links in the underlying transport network. Of course, a realistic carrier-class network will typically be much more complex than the situation shown in Figure 1.4: The network can contain hundreds of nodes instead of only a few, often separated in multiple domains (e.g., autonomous systems), possibly with different routing protocols in the different domains, and so on.
1.1.3
Network Planes To identify the large functionality entities in a communications network, we make a distinction between the following (see also Figure 1.5 [I321]):
. The data or user plane transfers user information (also called the payload ) through the network. Every network layer has its own user plane.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 7
1.1 Communications Networks Today
7
b
a
d
c IP Layer
B
D
A E C OTN Layer Working Path Recovery Path
Figure 1.4 IP-over-OTN network.
Management Plane Control Plane Highest Layer
User Plane
Layer Mgmt
Plane Mgmt
... Physical Layer
Figure 1.5 Protocol reference model. (ITU-T Recommendation I.321, ‘‘B-ISDN Protocol Reference Model and its Application,’’ April 1991. Available at www.itu.int. Accessed May 2004.)
. The control plane handles, for example, signaling for connection setup, supervision, and tear down by transferring the control information through the network routing table updates. Every network layer has its own control plane. A control plane typically functions in a distributed way across the
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 8
8
CHAPTER 1
.
General Introduction
network. Typical examples are the telephony control plane (e.g., based on Signaling System 7 [SS7]) and the IP control plane. The management plane consists of two parts: A layer management for each network layer and a plane management to ensure the correct coordination between the different layers. A management plane is usually operating in a centralized way, a typical example being the Telecommunication Management Network (TMN) [M30000].
1.2 Network Reliability Communications networks are subject to a wide variety of unintentional failures caused by natural disasters (earthquakes, fires, and floods), wear out, overload, software bugs, human errors, and so on, as well as intentional failures caused by maintenance action or sabotage. Such failures affect network facilities such as transmission or switching infrastructure, whose failure in turn disrupts communication services for business and residential users. Communication services play an indispensable role in many of the social and economic activities of our daily lives [Dem99]. For instance, telephone services serve as a lifeline and their interruption (even if only temporary) causes social turmoil and unrest. Strategic corporate functions also show an increasing dependence on communication services. For business customers, disruption of communication can suspend critical operations. This may cause a significant loss of revenue for the customer, to be reclaimed from the communications provider. In fact, availability guarantees (and compensations if these are not met) now form an important component of service-level agreements (SLAs) between provider and customer. Besides, the provider is often the largest customer of its own communication services (e.g., an incumbent network operator heavily relies on its own transport infrastructure). For all parties’ sake, it is thus imperative to provide a high level of service availability. This relies on the permanence of those network functions required to make the communication services run. Before going into more detail on the kind of failures that can happen in a communications network and their impact on various services, we define some terms commonly used in the context of network reliability. This terminology is used throughout the book. Section 1.2.2 elaborates on typical examples of network failures, to give you a better overview of what can disturb the proper functioning of a communications network. The impact of these failures on the plethora of services to be supported by the network is discussed in Section 1.2.3. Considering the drastic nature of some failures and the unacceptable impact on crucial communication services, we take a wide variety of measures to overcome or alleviate the burden caused by frequently occurring failures. Section 1.2.4 presents a general overview of these measures, describing their functioning and their advantages and disadvantages.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 9
1.2 Network Reliability
1.2.1
9
Definitions With respect to network reliability, a number of similar but slightly different terms are used. Network element reliability is defined as the probability of a network element (e.g., a node or a link) to be fully operational during a certain time frame [E800]. Availability is the instantaneous counterpart of reliability: Network element availability is the probability of a network element to be operational at one particular point in time. A simple numerical example is given in Figure 1.6: For each network node and link, the availability is mentioned. If we assume that these probabilities are mutually independent (i.e., not caused by a shared origin), we can easily calculate the availability of a complete network path from these numbers, being the product of the availability numbers of all network elements along the path. For example, the availability of the path shown in the figure amounts to 0:99996 0:9997 0:99999 0:9998 0:99997 0:9999 0:99995 0:9995 0:99990 0:9987 If we assume that these probabilities are not mutually independent (e.g., two links can simultaneously fail as a result of a single network element failure), then the overall availability of the path is lower. For a detailed example of availability calculations in a WDM-based network, see Chapter 3. Whereas the aforementioned definitions concentrate on the statistical behavior of the network elements, a number of definitions represent the capabilities and skills of the network as a whole. Network integrity is the ability of a network to provide the desired QoS to the services, not only in normal (i.e., failure-free) network
0.99996 0.9994
0.99995
0.9995
0.99996
0.9999
0.99997
0.9997
0.9998 0.9995
0.99999
0.9995
0.99990 0.9993
0.9996 0.99989
Figure 1.6 Network with node and link (italics) availability numbers.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 10
10
CHAPTER 1
General Introduction
conditions, but also when network congestion or network failures occur [Wu97]. Network survivability is a subset of integrity; it is the ability of a network to recover the traffic in the event of a failure, causing few or no consequences for the users [G841]. Because it is impossible for a network to be completely survivable (e.g., in the case of dramatic events like major earthquakes, causing multiple failures in the network), we use the degree of survivability to denote the extent to which a network is able to recover from single and multiple network failures (considering the probability of each individual failure to occur). Also with respect to failure terminology, a lot of terms and slightly different interpretations can be found in the literature. In this book, we use the following convention [M20]:
. A network element defect is a decrease in the ability of a network element to perform a required function. For instance, a link defect may cause a poor link quality (indicated by an increased bit error rate), leading to error detection and resulting in packet/frame retransmissions. . A network element failure is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment. For example, a cable cut by an excavator is a network failure. Note that in practice, some failures do not happen overnight and a network element may exhibit a gradual degradation. The time of the failure is then defined as the moment the degradation reaches an unacceptable level. . The inability of a network element itself to perform a required function is called a fault or outage. This fault lasts until the network element is repaired, implying that a network fault covers a time interval, in contrast to a network failure. These definitions are also illustrated in Figure 1.7. A further distinction is made between the original failure and the failures that occur as a consequence of the original failure. A root failure or primary failure is the basic, original failure occurring in the network (e.g., a cable cut). This root failure can cause many other failures to occur, the so-called secondary failures or symptoms. For example, when a cable is cut (root failure), many secondary failures such as interrupted connections in higher network layers occur.
Failure Defect 1
Fault
Repair
Defect 2
Time
Operational
Figure 1.7 Failure-and-repair process.
Not Operational
Operational
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 11
11
1.2 Network Reliability
Router a
Router b
Cable Cut No Light at R No Signal at DXC B No Signal at Router b
DXC C
Time t1 t2 t3 t4 Regenerator R DXC A
DXC B
Figure 1.8 Example of failure terminology.
A typical example of a SONET network is shown in Figure 1.8, to illustrate the introduced terminology. At time point t1 , the cable between the SONET digital cross-connect (DXC) A and the regenerator R is cut (root failure). A few milliseconds later (t2 ), there is no light anymore at regenerator R (secondary failure). Some time later (t3 and t4 ), DXC B and IP router b no longer receive a signal (secondary failures). Because of the built-in recovery mechanisms, an alternative route is found in the SONET network between DXCs A and B along the DXC C and the traffic is rerouted along that path, or A-C-B (end of fault status in DXC B and a bit later in router b). Meanwhile, civil workers are repairing the cable that was cut. After some time, the cable is fixed, and a few moments later light again enters regenerator R. Along the life cycle of a network element, several failures could occur. This leads to an alternation of operational and fault states. To grasp the temporal behavior of a network element in a probabilistic way, we use two parameters [G911]:
. The mean time between failures (MTBFs) specifies the average length of the time interval that elapses between two subsequent failures of the same network element. . The mean time to repair (MTTR) refers to the average time needed to repair the network element when it has failed. From these two values, the availability of the network element can be derived as A¼1
MTTR MTBF
(under the assumption that the MTBF is much larger than the MTTR, a safe assumption for most network elements).
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 12
12
CHAPTER 1
1.2.2
General Introduction
Which Failures Can Occur? The causes of a network failure are quite diverse. An initial distinction can be made between planned and unplanned outages. A planned outage is caused by operational or maintenance actions intentionally performed by the operator, for instance, to change the software version of a network element (i.e., if this change operation is disruptive) or to remove or add a new network element. Because the planned outages are known in advance, preventive techniques can be quite effective against them; for example, measures can be taken at the service level, the customers can be notified in advance, the operational and maintenance procedures can be timed to cause minimal impact (e.g., during the night or a weekend), and so on. In contrast, unplanned outages are by definition difficult to predict and the operator must therefore prepare an arsenal of ‘‘defensive’’ measures against them. A second distinction could be made between internal and external causes, depending on whether the failure is caused by a network-internal imperfection or by some surrounding event. Examples of internal causes are design errors, defects of electronic or optical components, a battery breakdown, and so on. The failure could also have an external cause such as electricity breakdown, lightning, storm, earthquake, flood, digging accident, vandalism, or sabotage. Failures are not restricted to hardware components. Today, advanced communication technologies and services show an increasing dependence on information technology (IT) systems, and software in particular. Typical examples are software bugs, configuration errors (e.g., you forget to turn on an essential protocol), routing errors (e.g., some routes are missing in the routing table), and hacker attacks. Hence, the reliability of software systems also affects network integrity. Equipment vendors are devoting substantial efforts to increase the quality of software. Nevertheless, the measurement and control of software quality (and software performance in general) is a relatively young branch of science. Because it is impossible to simulate all possible events that could occur in a communications network, software failures are very difficult to detect.
Commonly Occurring Failures One of the most commonly occurring failures in communications networks are fiberoptic cable cuts and hence cable cuts are related to the length of the link. Of course, the vulnerability of a cable also depends on the terrain (e.g., urban area where a lot of civil works are carried out) and the preventive measures taken to reduce the vulnerability of the cable system (e.g., armored casings). Most operators have a long history of cable-cut recordings and know more or less how many cable cuts they can expect on average per year for their long-distance links. Typical MTBF values range between 50 and 200 days ([DeM03], [Wil01], [Bat02], [Jur98]) per 1000 km of cable. The time to repair the cable consists mainly of the amount of time it takes to determine and reach the cut location, because mending the cable is a relatively quick procedure. Usually, the repair team monitors the performance of the fibers before they are returned into operation. Note that most
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 13
1.2 Network Reliability
13
cable cuts are due to digging activities that accidentally hit the cable, so it is often possible to quickly start with the repair. However, in some cases reaching the location can be quite time consuming (e.g., submarine cables). Depending on the location and the severity of the damage, MTTR values typically range from hours to several weeks ([DeM03], [Wil01], [Jur98]). Besides cable cuts, equipment failures also frequently occur. Unlike with cable cuts, however, it is difficult to collect realistic and accurate MTBF and MTTR values for equipment failures. Because of the rapid technological evolution, operators have little practical experience with most of the relatively novel equipment. AQ1 Some typical MTBF ranges for IP, ATM, SONET/SDH, and WDM equipment are shown in Table 1.1 ([DeM03], [Wil01], [Wos01], [Lab99], [Kal96]). The MTTR for equipment failures largely depends on the urgency, which is itself dependent on the amount and priority of traffic passing through the failing equipment. Some typical MTTR values for urgent faults are also shown in Table 1.1. In practice, the time required for the mending of the equipment could be only a few hours. Yet, the repair time must be supplemented with the time spent to get the spare parts at the site of the failed equipment. The major exchanges such as backbone offices usually keep an on-site storage of spare parts (for critical equipment). Other sites rely on a central depot of equipment parts; the transportation of spare parts then delays the repair. Hence, the total repair time can be much longer, in the range of 4 hours to several days for remote sites. In addition, the repair time depends on the ability to detect the defect. Some defects generate merely QoS degradation and might be difficult to detect.
More Drastic Examples Besides the more commonly occurring failures described above, a communications network can be damaged by more severe causes, leading to major disruptions in the network availability. For example, in the United States, all network faults affecting at least 30,000 users during at least 30 minutes must be reported to the Federal Communications Commission (FCC) [McC95]. In the following sections, we
Table 1.1 Typical MTBF and MTTR Values for Communication Equipment Failures
Equipment Type
MTBF Range (hr)
Typical MTTR (hr)
Web Server IP Interface Card IP Router Itself ATM switch SONET DXC or ADM SDH DXC or ADM WDM OXC or OADM
104 106 104 105 105 106 105 106 105 106 105 106 105 106
1 2 2 2 4 4 6
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 14
14
CHAPTER 1
General Introduction
describe some major network disruptions, highlighting the extent of the damage, the measures taken by the network operator to keep critical traffic alive, and the time it took to repair the failures and to return to normal traffic conditions.
Hanshin/Awaji Earthquake At 5:46 am on January 17, 1995, the Japanese city of Kobe was hit by the Hanshin/ Awaji earthquake (recorded as 7.2 on the Richter scale) [Kal96]; 5379 people died, 34,626 were injured, and the material damage in the city was immense. The earthquake caused widespread severe physical damage to the Hanshin area, particularly Kobe City. The damage was very localized; in many cases adjacent buildings were affected differently, and sometimes half of a building was virtually undamaged while the other half was severely affected. All utilities such as electricity, gas, water, and telecommunications were completely disrupted (Figure 1.9). The total damage for the Japanese telephone company NTT was estimated at 30 billion yen (about U.S. $200 million). Eleven local telephone switches were put out of action for more than 24 hours (mainly because of lack of power), which cut off as many as 285,000 subscriber lines (about 20% of the subscriber lines in the Hanshin area). More than 60,000 local transmission lines were affected, as well as signaling systems, billing systems, and other network databases. It took several hours merely to identify which network
Figure 1.9 Ravage caused by the Hanshin earthquake (left: broken pole, right: extracted conduits). (G. Kalbe, et al., ‘‘Operator requirements,’’ European ACTS project Protection Across Network Layers [PANEL], deliverable D1, December 1996.)
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 15
1.2 Network Reliability
15
infrastructure the earthquake destroyed. Long-distance communications services in Japan were not directly hindered by the disaster, because of the automatic protection mechanisms installed in NTT’s core network. As a consequence of the earthquake, calling patterns within, into, and out of the Hanshin area changed drastically. Peak hour traffic volume on January 17 was 20 times the normal volume. This resulted in severe network congestion, which NTT handled by limiting the amount of nonpriority calls per minute in major telephone switches. NTT also provided 5000 emergency transmission lines to meet the critical communications needs in the region. NTT undertook intensive repair actions and managed to repair more than 50% of the damage within 2 weeks.
Erroneous Software Update in AT&T Network In April 1998, AT&T suffered a catastrophic nationwide failure of most of its frame-relay network [Gry01]. During the outage, more than 5000 corporations were unable to complete network-based business operations. For example, retailers were unable to authorize credit-card payments and financial institutions could not complete transactions. When the outage was detected, AT&T engineers focused first on identifying and isolating the problem. They found out that the problem was caused by a computer command to upgrade software code in one of the network switch’s circuit cards. The upgrade was performed but malfunctioned; this created a faulty communication path, which generated a large volume of administrative messages to the other network switches. As a result, these switches became overloaded and stopped routing data from customers’ applications. This lasted from 6 to 26 hours before the network was fully restored. AT&T provided selected customers (in particular, those with critical applications) with updates every 15 or 20 minutes throughout the crisis. Although many large corporate users had backup systems in place, analysts and customers agree an incident like this can prove devastating for companies without a contingency plan. In fact, the communications to many smaller companies were left completely dead until the outage was rectified. After 24 hours, about 96% of the affected services were reestablished.
Submarine Cable Break On July 5, 2002, a submarine multiple cable failure affected the Asia Pacific Cable Network (APCN 2) that connects the Philippines to the Internet [Lem02]. APCN 2 is a 19,000-km underwater fiberoptic cable system that stretches from Japan to Singapore. It covers major countries in Asia including China, South Korea, Hong Kong, Japan, Malaysia, Taiwan, and the Philippines. The network has been operational since December 21, 2001. The failure caused a considerable slowdown of the company’s services but did not completely disrupt services. Because of poor weather conditions, the repair of the failure was delayed. On July 16, the network was completely repaired.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 16
16
CHAPTER 1
General Introduction
Deriving Accounted Failure Scenarios It is practically impossible to provide measures against all possible failures in a communications network. The impact of some dramatic failures such as those caused by major earthquakes is simply too great, whereas other failures are too rare to justify the extra budget needed to cover them. Therefore, a practical strategy followed by most operators and service providers is to identify the most frequently occurring failures, to classify these failures in a limited set of failure scenarios, and to provide ‘‘healing’’ measures to overcome these failures in a gracious, costeffective manner. These failures are called accounted failures in the remainder of this book, although no measures are provided for unaccounted failures. We elaborate on the possible preventive measures in Section 1.2.4, but for now we concentrate on the definition of the accounted failure scenarios. When considering the physical network layer, where cable cuts and equipment failures typically represent the most common failures, most operators consider two accounted failure scenarios: 1. A single-link failure is a situation in which the link between two adjacent offices fails. As a consequence, no direct information exchange between these two offices is possible (until the fault gets repaired). This is illustrated in Figure 1.10. Note that measures to heal a link failure will automatically be able to heal smaller failures affecting only part of a link (e.g., only one direction of a bidirectional SDH link or only one failing laser of a WDM line system). Stated otherwise, the singlelink failure scenario encompasses all single sublink failure scenarios. 2. A single-node failure is a situation in which a network element in an office fails (e.g., a hardware equipment failure or a software crash). As illustrated in Figure 1.11, the failing of a single node automatically puts all attached
Figure 1.10 Single-link failure scenario.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 17
1.2 Network Reliability
17
Figure 1.11 Single-node failure scenario.
links out of service. Note that the node might be out of service with the attached links still seen in operation by its neighbor nodes. Like the single-link failure scenario, a single-node failure scenario encompasses all single subnode failure scenarios (e.g., failing of only a part of the exchange office).2 The focus on the single-link or single-node failure scenario is based on two main assumptions:
. In most cases, the failure of a link or node in the network is statistically independent of the failure of another link or node in the network (assuming that dramatic outages affecting large parts of the network, such as earthquakes, are very unlikely). . If the network scale is not too large, the MTTR for a single-link or singlenode failure is typically much shorter than the MTBF. Hence, the probability that two (link or node) faults are overlapping in time can be neglected in comparison to the probability of a single-link or single-node fault. The situation gets more complicated when considering a logical network layer. The importance of single-link or single-node failures remains. For instance, in an IP network layer, this corresponds to an IP link disruption and an IP router breakdown, respectively. However, the malfunctioning of the IP network layer could be caused by an unrecovered failure in a lower network layer as well (e.g., a cable cut in the physical layer), leading to multiple link failures in the IP network layer at the same time. To 2
In fact, even a single-link failure scenario could be seen as a component of a single-node failure scenario. However, the distinction between a single-link and a single-node failure is important when considering recovery techniques for both failure scenarios (see Section 1.2.4).
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 18
18
CHAPTER 1
General Introduction
model such a failure scenario, the IETF [Pap02] has defined the concept of shared risk link group (SRLG), or more general a shared risk group (SRG), that is, a group of resources that are affected by the same failure. In contrast to the single-link or single-node failure scenario, in which the failure of individual links or nodes is considered statistically independent, the SRG concept expresses a statistical dependence between the failures of individual links or nodes.
1.2.3
Reliability Requirements for Various Users and Services The reliability requirements of a communications network highly depend on the types of users and the types of services transported through the network.
User Types The users can typically be classified in the following categories [OSh94]:
. Safety critical users (e.g., hospitals, police, and fire department): For safety reasons, this group should have communications services at all times. Service interruptions are unacceptable, especially during fault conditions (e.g., accessibility of emergency services after an earthquake). . Business critical users: This group suffers considerable financial losses when service interruptions occur. . Low cost users: This group consists of residential users demanding more or less reliable communications services at a relatively low cost. Service interruptions cause discomfort but can be tolerated if they are not too frequent. . Basic level users: The support of this group is the lowest. Service reliability is only a side issue. This implies that the service provider can support these users as long as the network is failure free; in case of a failure, bandwidth is removed from these basic level users to transport services from the first three user categories. For instance, if a failure occurs in an IP network, some critical traffic flows may be rerouted along alternative paths, which may provoke some congestion. This in turn could result in dropping the traffic generated by basic level users, using some congestion avoidance mechanisms.
Service Types Tomorrow’s communications networks will have to carry a plethora of services, such as the classic plain old telephone service (POTS), voice over IP (VoIP), videotelephony and videoconferencing, teleworking, TV broadcast services, distance learning, movies and news on demand, Internet access, teleshopping, and many others. These services are considerably different, not only with respect to their bit rate requirements, but also with respect to delay tolerances and the need for recovery. This is indicated in Table 1.2 [Las99], where the service sensitivity for delay and the need for recovery are graded on a scale from 1 (not sensitive) through 5 (highly sensitive).
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 19
19
1.2 Network Reliability Table 1.2 Overview of Applications Services and Their Typical Requirements
Application
Bit Rate
Bit Rate Variation
Delay Sensitivity
Need for Recovery
Plain Old Telephone Service Voice Over IP Video-telephony Videoconferencing Teleworking TV broadcast Distance Learning Movies on Demand News on Demand Internet Access Teleshopping
32–64 Kbps 8–32 Kbps 256–1920 Kbps at least 256 Kbps 64 Kbps to 2 Mbps 2–8 Mbps 64 Kbps to 2 Mbps 750 Kbps to 4 Mbps 64 Kbps 64 Kbps to 2 Mbps 64 Kbps to 2 Mbps
Constant Constant High High Very high High Very high High Very high Very high Very high
5 5 5 5 5 4 5 4 2 1 2
5 5 5 5 4 4 5 3 2 2 2
Kbps, kilobits per second; Mbps, megabits per second. AQ2 From A. Lason, et al., ‘‘Network Scenarios and Requirements,’’ European IST project Layers Interworking in Optical Networks (LION), deliverable D6, September 1999.
For the subject of this book, the last two columns of the table are particularly important. The ‘‘need for recovery’’ answers the question of whether recovery mechanisms are necessary for the applications under consideration. As can be seen from the table, a lot of applications show a critical dependence on the recovery skills of the network. To estimate the necessary speed of the recovery process, the column ‘‘delay sensitivity’’ is clarifying, because delay sensitive applications will be severely disturbed or even completely disrupted in the case of failures if the recovery time becomes large. For instance, in the case of POTS, the minimal recovery time leading to service disruption ranges from 150 ms to 2 seconds [Sos94]. To avoid POTS disruptions, a recovery time less than 150 ms is needed.
Examples of Service-Level Agreements From the previous discussion, we know that network reliability is in many cases crucial for the customers. These expectations are translated into contracts between an operator or service provider and its customers, via an SLA. Typically, these agreements include a rebate provision if the service level is not met during a billing period. With respect to reliability, SLAs usually specify the minimal availability of the service (e.g., minimal availability of 99.99% required) and the maximum downtime that is acceptable (e.g., half an hour). The more stringent the reliability requirements are, the more expensive the service provided by the operator will be. If these engagements are not met, a financial compensation is usually agreed on
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 20
20
CHAPTER 1
General Introduction
(e.g., X% of the monthly charge is waived). The SLA may also specify the way customers will be notified of outages.
Trend of Reliability Requirements Though dependent on the specific user and service type, the reliability of a communications network is clearly an important issue and will become even more important in the future.
. In a liberalized telecommunications sector, the sense of well-being of business and residential customers of traditional and multimedia services steps more and more into the limelight. Price, quality, and flexibility are key. Users do not appreciate an interruption in these services. Network failures discredit operators and service providers in a commercial market. . The total amount of data to be transported is ever increasing and the socioeconomic life is relying more on communications services. Hence, the consequences of network failures may be significant. . Because of the introduction of optical fiber and digital or optical switching, traffic is more and more concentrated in fewer network elements (e.g., a fiber carrying thirty-two 10-Gbps wavelengths can carry about 4 million simultaneous telephone calls). This augments the vulnerability of the network.
1.2.4
Measures to Increase Reliability As indicated in Section 1.2.2, quite a large variety of failure types may occur in a telecommunications network. However, many services require high network availability. To bring these two opposing factors together, we take several measures. A first possibility is to prevent failures as much as possible. For instance, the likelihood of a cable cut can be reduced by putting the cable deeper in the ground or by using special armored cables, and the number of failures in exchange offices can be diminished by a fire security plan or by limited access to the building. Equipment failures can be reduced by a safer design, an extra cover, or more testing of the hardware and software before putting it into use. Quick detection of failures or dangerous situations, by a smoke detection system, an automatic sprinkler system, or a direct connection with the fire department, can increase the availability as well. Another strategy is to duplicate vulnerable network elements. For instance, in the case of a cross-connect failure, all traffic can be switched to an identical hot standby cross-connect (most helpful in hardware failures). Also the network access link can be duplicated to ensure that users are not cut off from the network by a single failure. This dual homing principle is illustrated in Figure 1.12. When a failure occurs, the network can still be accessed via the unaffected network access link. Although the aforementioned measures alleviate the problem to some extent, in many cases they turn out to be insufficient to meet the network availability levels required by the customers. Moreover, they can be quite expensive and do not allow
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 21
1.2 Network Reliability
21
Figure 1.12 Principle of dual homing.
easy differentiation between critical traffic—requiring extremely high availability— and less important traffic from low cost or basic level users. To circumvent these drawbacks, most modern communications networks use so-called network recovery or resilience schemes. As soon as a failure in the network is detected, these mechanisms automatically divert the traffic stream affected by the failure to another (fault-free) path in the network. This way, the traffic eventually reaches its destination. These schemes can greatly enhance the availability of the services transported through the network. In contrast to the aforementioned measures, recovery schemes operate on a network scale level, not on an individual network element level. The basic principle of a recovery scheme is illustrated in Figure 1.13. Under normal (i.e., fault-free) conditions, the traffic is transported along the working or primary path. If a failure is detected along that path, the recovery scheme is activated. A part of the working path (or the whole path, depending on the recovery technique), the recovered segment, will be bypassed by a recovery or alternative path. Traffic that was flowing along the failed network element will be redirected in the recovery head-end (RHE) toward the backup path (the switch-over operation). After passing the recovery tail-end (RTE), the traffic is again transported along the working path toward the destination. In most cases diverse routing is applied, that is, the recovery path is usually resource disjoint (e.g., link and/or node disjoint) from the working path, to ensure that a single failure will not affect both the working and the recovery path [Sha03].
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 22
22
CHAPTER 1
General Introduction
RTE
Working Path Recovered Segment Recovery Path
RHE
Figure 1.13 Basic principle of recovery scheme.
Note that a recovery mechanism imposes some extra requirements on the communications network. For any failure it wants to recover from, there must be an alternative route in the network (topology requirement) to serve as a recovery path. This implies that a so-called single point of failure must be avoided in the network; it should be designed so that one single failure cannot disconnect a part of the network from the rest. If an equivalent QoS must be offered to the rerouted flows after the link or node failure, then other requirements on the recovery path must be satisfied as well:
. There should be enough available bandwidth along the recovery path (capacity requirement), so a recovery scheme will typically require some additional capacity, the backup or spare capacity. . A considerable rise of the propagation delay from source to destination should be avoided. Note also that a recovery scheme usually forms a component of a particular network technology; hence, the mechanism is active only in the network layer corresponding to that technology. The most important recovery techniques in SONET/SDH, OTN, IP, and MPLS are highlighted in Chapters 2, 3, 4, and 5, respectively.
1.3 Different Phases in a Recovery Process Although a wide variety of recovery schemes exists (see Section 1.5), they all show a rather similar succession of phases, that is, the recovery cycle.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 23
1.3 Different Phases in a Recovery Process
1.3.1
23
Recovery Cycle The different phases of this cycle are shown in Figure 1.14 [Sha03]. If a failure in the network occurs, it could take some time before a node adjacent to the failure detects the fault. This time may depend, for instance, on the frequency of signals sent, on the speed of fault detection in a lower network layer and notification toward upper layers, on the time it takes for the node to gather all abnormal information from various signals, correlate this information, and derive the exact fault state from that (diagnosis), and so on. Once the fault is detected, the node that detected the fault may (or may not) wait some time before it starts sending notification messages toward the other nodes in the network. For instance, this hold-off time could allow a lower layer recovery scheme to repair the fault. For example, in an IP network supported by an optical transport network, a cable cut could be quickly repaired by an optical recovery mechanism so the IP link becomes operational again shortly after the moment of failure. If the fault still exists after the hold-off time, fault notification messages are sent throughout the network to inform the other nodes that will be involved in the recovery action. Note that this timer may be a static or a dynamic value. In the latter case, the timer is a function of the number of failures within a certain period; the more failures have been detected recently, the longer the hold-off time. This technique is called dampening and helps stabilize the network in case of a flapping resource (i.e., a resource alternating quickly between the operational state and the fault state). A typical example of dampening is discussed in Chapter 4. The time between the first and the last recovery action is called the recovery operation time (not to be confused with the overall recovery time; see Figure 1.14).
Recovery Time Failure Fault Detected Time
Operational
Operational
Traffic Recovery Time Recovery Operation Time Fault Notification Time Hold-Off Time Fault Detection Time
Figure 1.14 Recovery cycle. (V. Sharma, F. Hellstrand, ‘‘Framework for MPLS-based recovery,’’ Internet draft, work in progress, RFC 3469, February 2003. Available at www.ietf. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 24
24
CHAPTER 1
General Introduction
This time span could include the exchange of messages between the different nodes involved in the recovery action to coordinate the operation. After the last recovery action, the traffic starts using the recovery path. However, it could still take some time before the traffic is completely recovered. This traffic recovery time may depend on the propagation delay along the recovery path, the location of the fault, and the recovery scheme used.
1.3.2
Reversion Cycle After the recovery cycle, the network is again fully operational. However, the new routes of the traffic along the recovery paths may be less ideal than before the failure (e.g., recovery path longer than the original path and more congestion along recovery path). Therefore, a dynamic rerouting protocol may be initiated to optimize the usage of network resources in the new situation. Another possibility is to wait for the repair of the fault that has occurred and to redirect the traffic from the recovery path back to the working path once the fault is completely repaired (in this case, the recovery technique is called revertive). This switch-back operation also follows a succession of general phases, the so-called reversion cycle. The different phases of this cycle are shown in Figure 1.15. The reversion cycle bears a strong resemblance to the recovery cycle, described earlier. Once the fault is repaired, it could take some time (e.g., dependent on lower layer protocols) before this repair is detected: the fault clearing time. After that, the protocol may decide to wait for a certain time before starting the notification of the repaired fault. This hold-off time may be needed to ensure that the path is stable. Indeed, in the case of an intermittent fault, a quick reaction of the reversion process may lead to unstable network conditions. As in the previous case, note that this timer may be a static or a dynamic value (dampening).
Fault Repaired Fault Cleared
Time
Traffic Reversion Time Reversion Operation Time Fault Repaired Notification Time Hold-off Time Fault Clearing Time
Figure 1.15 Reversion cycle. (V. Sharma, F. Hellstrand, ‘‘Framework for MPLS-based recovery,’’ Internet draft, work in progress, RFC 3469, February 2003. Available at www.ietf. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 25
1.4 Performance of Recovery Mechanisms: Criteria
25
After that, in a similar way as in the recovery cycle, the repaired fault is notified throughout the network and the actual reversion operation is carried out; the traffic is again switched from the recovery path to the working path. Finally, it may take some time before the traffic begins flowing on the working path again (traffic reversion time). In contrast to the recovery cycle—which is typically reacting to an unforeseen event, the failure—the reversion cycle can be planned well in advance. In a reversion, there is no need for a hasty operation; a well-controlled switch-back mechanism with minimal disruption is typically preferred.
1.4 Performance of Recovery Mechanisms: Criteria As is pointed out in Section 1.5, a wide variety of recovery mechanisms exist, depending on the facilities of the network technology, the priorities and desires of the typical users of the network, and so on. Every recovery mechanism has its strengths and weaknesses. This section elaborates on the criteria that represent the performance components of the recovery scheme. This overview of criteria allows us to weigh the performance and the cost of a recovery mechanism, to assess the pros and cons of any recovery mechanism, and to make a judicious comparison between recovery mechanisms [Sha03], [Owe02].
1.4.1
Scope of Failure Coverage Recovery schemes may offer various types of failure coverage. The scope of failure coverage may be defined by several metrics, which are described in the following paragraphs.
Failure Scenarios The recovery mechanism may be designed to cover a particular failure scenario, such as a single-link failure (e.g., one fiber in an optical network or one OC192 (see Chapter 2) line system in a SONET network), a single-node failure (e.g., an optical node or an IP/MPLS node), or a single-link or single-node failure (i.e., single failure, either a link failure or a node failure). To ensure very high availability, you may design the recovery mechanism to cover a number of concurrent faults, for example, a double link failure or an SRLG failure.
Percentage of Coverage The recovery mechanism may completely or partially cover the failure scenario. For example, the recovery mechanism may be able to recover only a percentage of the traffic volume affected by the failure (e.g., if a certain percentage of the traffic is high priority). Another example is the percentage of coverage of node failures. In contrast to link failures, 100% coverage of a node failure might be possible (if the
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 26
26
CHAPTER 1
General Introduction
Working Paths Recovery Path
Figure 1.16 Incomplete coverage of a node failure.
node is a transit node) or not (in the case of an edge node). Traffic coming from or terminated in the failing node cannot be recovered (at least not by a recovery mechanism operating in this network layer only; see Chapter 6 for a more thorough discussion of this topic). As illustrated in Figure 1.16, only traffic passing through the failing node can be recovered.
1.4.2
Recovery Time The recovery time is the time between a network failure and the point at which a recovery path is installed and the traffic starts flowing through it. The recovery time usually forms an important criterion for a recovery mechanism: Typically, the smaller the recovery time, the less the services are harmed by the network failure. Note: This does not imply that after the recovery time, the traffic will experience the same network conditions as before the failure occurred. For instance, the recovery path could have more limited resources or a worse signal quality than the working path.
1.4.3
Backup Capacity Requirements When comparing different recovery schemes, the backup capacity that is needed to recover from the same failure scenarios may be quite different. The capacity requirements of the recovery scheme may depend on variables such as the algorithm selecting the recovery paths, the traffic characteristics, or the layer in which the recovery mechanism operates.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 27
1.4 Performance of Recovery Mechanisms: Criteria
1.4.4
27
Guaranteed Bandwidth Some recovery mechanisms inherently guarantee that the full bandwidth of the affected traffic will be rerouted along the recovery paths. Other recovery mechanisms do not provide any bandwidth guarantee, and depending on the situation, there may or may not be enough backup capacity to reroute all affected traffic.
1.4.5
Reordering and Duplication Even though a switch-back operation from recovery path to working path may seem beneficial at first glance, it can have some awkward complications with respect to the order in which traffic is delivered. For example, in packet switched networks (e.g., IP or IP/MPLS networks), the switch back may result in a reordering of the packets at the destination. Indeed, if the delay of the packets along both paths is different, the switch-back operation may cause some packets to overtake others. Similar situations could also occur at the switch-over operation. For example, in one-to-one protection (see Section 1.5.3), a switch-over operation may cause a temporal duplication of traffic. Such reordering or duplication may have a significant impact on the complexity and cost of the destination node, because (depending on the application) the information stream should usually be reordered to respect the original order.
1.4.6
Additive Latency and Jitter Recovery schemes may introduce additional latency to traffic. For example, a recovery path may by significantly longer than the working path. This may be dependent on the recovery path selection algorithms. For some services, it is also important to minimize the jitter, that is, the fluctuations on the delay for data from the same traffic flow.
1.4.7
State Overhead As the number of recovery paths in a recovery plan grows, the state (i.e., the information stored in the individual network elements) required to maintain them also grows. The exact required state may depend not only on the number of recovery paths (the state overhead is usually proportional to the number of recovery paths), but also on the particular state needs of the recovery mechanism.
1.4.8
Scalability As the network grows (i.e., more links and nodes) and the amount of traffic to be transported by the network increases, the performance of the recovery mechanism
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 28
28
CHAPTER 1
General Introduction
may change considerably. For instance, for some schemes the state overhead may increase very fast with growing network or traffic size, whereas other recovery schemes need only a modest state overhead increase. In addition, other performance factors such as the recovery time or the required backup capacity may be highly influenced by the network and traffic size. A recovery mechanism is said to be scalable if the performance does not depend too much on the size of the network and the traffic to be transported. Scalability is an important characteristic for a recovery scheme to be ‘‘future proof.’’
1.4.9
Signaling Requirements The operation of a recovery scheme might require a significant number of signaling messages between the network nodes. For instance, the fault detection may depend on (the absence of) messages; the fault notification is based on messages; and signaling can also play a crucial role in the recovery operation itself. Because some recovery schemes require much more signaling messages than others, the resources (in terms of bandwidth, central processing unit [CPU] usage, etc.) used by signaling form another criterion to judge the performance of a recovery scheme.
1.4.10
Stability When designing a recovery mechanism, you typically will find a number of timing parameters (e.g., time between two consecutive messages, hold-off times, etc.) that can be tuned more or less freely within a certain range. Although small values for these timers usually speed up the recovery, they may have a deteriorating impact on the network stability. For instance, in the case of a flapping link, small hold-off timers for reversion may lead to a never-ending switch-over and switch-back alteration, having a significant impact on traffic disruption.
1.4.11
Notion of Recovery Class Some recovery schemes make it possible to distinguish between different classes of traffic and to take appropriate recovery actions for each individual QoS class [Aut02]. This may be a useful feature, because different traffic classes typically impose different recovery requirements. For instance, one traffic class may need a very fast recovery scheme with bandwidth guarantee, whereas for another class a slow recovery mechanism at a low cost may be sufficient.
1.5 Characteristics of Single-Layer Recovery Mechanisms Depending on the particular application(s) a recovery mechanism is aimed for, some evaluation criteria described earlier can be far more important than others.
AQ3
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 29
29
1.5 Characteristics of Single-Layer Recovery Mechanisms
Moreover, the specific network technology typically imposes some constraints on the implementation feasibility of a recovery scheme. Hence, a wide range of recovery mechanisms exists in today’s networks. In what follows, the essential choices when designing a recovery mechanism are enumerated and the pros and cons of each option are elucidated.
1.5.1
Backup Capacity: Dedicated versus Shared With respect to the allocation of backup capacity, two major options exist: dedicated or shared backup capacity. In the case of dedicated backup capacity, a particular backup resource corresponds to one particular working path. In other words, there is a one-to-one relationship between the backup resources and the working paths. A backup resource can be used only by a particular working path. This concept is illustrated in Figure 1.17: The recovery paths A-D-E-F-C and G-D-E-F-I are not sharing any bandwidth on the common part D-E-F, despite that the working paths A-B-C and G-H-I do not have any common resource and hence no large probability of simultaneous fault states. The other possibility is to share a backup resource between several working paths. If a failure occurs along one of these working paths, the backup resource is used to recover from this failure. If at another time a failure occurs along another one of these working paths, the same backup resource will be used to recover from this failure. Stated otherwise, there is a one-to-many relationship between the
A
B
C
Channel 1
D
E
F
Channel 2
Working Paths Recovery Paths G
H
Figure 1.17 Dedicated backup capacity.
I
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 30
30
CHAPTER 1
A
General Introduction
B
C
One Common Channel D
E
F
Working Paths Recovery Paths G
H
I
Figure 1.18 Shared backup capacity.
backup resources and the working paths. This concept is illustrated in Figure 1.18: The recovery paths A-D-E-F-C and G-D-E-F-I are now sharing the bandwidth on the common part D-E-F. This is permitted because the probability of having a simultaneous malfunction on the working paths A-B-C and G-H-I is low, so usually at most one recovery path will be exploiting the resources on D-E-F at a time. The second option clearly is more complex than the first one; after a failure along a working path, you must be sure that the corresponding backup resources are still available for the recovery (i.e., not used for the recovery of another working path). On the other hand, the backup capacity can be used much more efficiently in the case of shared backup capacity, because of its flexible character. The purpose of the backup resources is adapted to the failure that occurs.
1.5.2
Recovery Paths: Preplanned versus Dynamic Another choice depends on the moment the path for the recovery flow is chosen. In the preplanned option, for all accounted failure scenarios the path of the recovery flow is calculated in advance (i.e., before any failure occurs). In the dynamic option, recovery paths are not planned; their path is computed ‘‘on the fly’’ once the failure is detected, for instance, by the RHE or RTE node. If a failure occurs, the recovery mechanism starts searching dynamically for possible recovery paths throughout the network.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 31
1.5 Characteristics of Single-Layer Recovery Mechanisms
31
The preplanned option advantageously allows a fast recovery if a failure occurs, whereas a dynamic recovery mechanism may take additional time to identify suitable recovery paths. On the other hand, the preplanned option lacks flexibility for unaccounted (i.e., noncovered) failure scenarios. A dynamic recovery mechanism is able to search for recovery paths for unaccounted failures as well, although there is no guarantee that it will find such a recovery path. Because of its flexible nature, a dynamic recovery mechanism will typically lead to a situation of shared backup capacity. In a preplanned recovery mechanism, the nature of the backup capacity can be either dedicated or shared.
1.5.3
Protection versus Restoration A quite important distinction typically made when considering recovery mechanisms is between protection and restoration [Man2]. Both options require signaling, but the subtle difference lies in the timing of the signaling actions. In the case of protection, the recovery paths are preplanned and fully signaled before a failure occurs. Hence, when a failure occurs, no additional signaling is needed to establish the protection path.3 In the case of restoration, the recovery paths can be either preplanned or dynamically allocated, but when a failure occurs additional signaling will be needed to establish the restoration path. A major advantage of protection compared to restoration is typically its fast recovery time. Indeed, the additional signaling after a failure occurrence in the case of restoration may consume quite some (precious) time. On the other hand, restoration techniques can be more flexible with regard to the failure scenarios they can recover from and require in many cases less backup capacity because of their shared nature.
Protection Variants Between the different recovery mechanisms that are classified as protection schemes, a further distinction can be made, depending on the number of recovery entities that are protecting a given number of working entities [Man2].
1þ1 Protection (Dedicated Protection) One dedicated protection path protects exactly one working segment and the normal traffic is permanently duplicated at the RHE on both the recovery path and the working path. At the RTE, the signal with highest quality is selected and sent to the destination. Another method consists of selecting the working path unless a signal defect is detected. Then, the RTE starts to select the traffic from the recovery path. Note that this protection strategy is very efficient in terms of recovery time but quite expensive in terms of bandwidth usage. 3
It must be noted that this does not exclude all signaling after a failure. Various other kinds of signaling may take place between RHE and RTE, for fault notification, to synchronize their use of the protecting path, for reversion, and so on [Man2].
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 32
32
CHAPTER 1
General Introduction
1:1 Protection (Dedicated Protection with Extra Traffic) One dedicated protection path protects exactly one working segment, but in failurefree conditions the traffic is transmitted over only one path at a time. This leaves the opportunity to transport extra traffic along the protection path in failure-free conditions. As soon as a fault along the working segment is detected, the extra traffic is preempted from the recovery path and the traffic affected by the failure is switched to the protection path.
1:N Protection (Shared Recovery with Extra Traffic) A specific recovery entity is dedicated to the protection of up to N (explicitly identified) working entities. In failure-free conditions, the recovery entity can be used for extra traffic.
M:N Protection (M N) A set of M specific recovery entities protects a set of up to N specific working entities. The two sets are explicitly identified. Extra traffic can be transported over the M recovery entities when available.
1.5.4
Global versus Local Recovery To bypass the failed network facilities, recovery schemes change the route of affected traffic. We define the recovery extent as the portion of the working path that may be manipulated by the recovery scheme, that is, the recovered segment. In local recovery, only the affected network elements are bypassed. In other words, the RHE and RTE are chosen as close to the failed network element as possible (Figure 1.19). If a single link fails, a (link-disjoint) recovery path is set up between the nodes adjacent to the failure. If a single node fails, all links incident to
RHE
RTE
RHE Working Paths Recovery Paths
Figure 1.19 Local recovery for single-link (left) and single-node (right) failure.
RTE
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 33
1.5 Characteristics of Single-Layer Recovery Mechanisms
RHE
RTE
RHE
33
RTE
Working Paths Recovery Paths
Figure 1.20 Global recovery for single-link (left) and single-node (right) failure.
the failing node cannot be used anymore. Hence, the local recovery path establishes a (node-disjoint4) recovery path between every two ‘‘neighbor’’ nodes of the failing node. The other extreme is global recovery, in which the complete working path between source and destination is bypassed by a recovery path. In other words, the RHE and RTE will coincide with the source and destination of the working path, respectively (Figure 1.20). In a preplanned recovery mechanism, the global recovery path should be disjoint from the working path, which might impose some additional constraints to compute both paths. For instance, if the mechanism aims only at recovering from single-link failures, a recovery path that is link-disjoint from the working path will be sufficient. If, on the other hand, the mechanism wants to recover from single-node failures (or both single-link failures and single-node failures), the recovery path must be node-disjoint5 from the working path. When comparing local and global recovery, several pros and cons arise:
. In local recovery, the RHE and RTE are closer to the failure. Hence, these nodes will typically detect the fault rather quickly, leading to a smaller recovery time than for global recovery. In other words, local recovery is usually much faster than global recovery, an important advantage in timesensitive applications. . An obvious drawback of local recovery is apparent from Figure 1.19: The resulting route followed by the traffic after recovery is often longer than needed. The main reason for this suboptimum result is that local recovery does not consider other parts of the working path than the recovered segment. Hence, the same traffic may cross a particular link twice. This 4 5
Except for the RHE and RTE, of course. Except for the RHE and RTE, of course.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 34
34
CHAPTER 1
.
.
General Introduction
phenomenon is called back hauling. Because of its network-wide optimizing nature, global recovery will in many cases require less backup capacity than local recovery, for identical failure scenarios. Local and global recovery are also slightly different with respect to the failure coverage. For instance, if two successive nodes along a working path fail, global recovery could still resolve the problem, whereas local recovery will fail. The number of recovery paths needed in the complete network can be largely different when comparing local and global recovery, resulting in different state overhead requirements. In many cases, global recovery may generate more state overhead.
Of course, local and global recovery represent only the two extremes of a whole range of intermediate possibilities, where the recovery extent is longer than in the local option, but shorter than the complete working path (global option). For instance, in G-MPLS networks these intermediate options are denoted as ‘‘segment recovery,’’ whereas the term subnetwork connection protection is applied for SONET/SDH and OTN networks.
1.5.5
Control of Recovery Mechanisms Another attribute pertains to which entity is in control of the recovery process. Centralized recovery mechanisms depend on a central controller to determine which recovery actions to take. The central controller has a global view of the network status. This controller determines where and when a fault has occurred, gathers network-wide state information, and issues (switching) commands to reconfigure all of the network elements involved in the recovery process. Network management systems based on the Telecommunications Management Network (TMN) [M30000] form a typical example of centralized operation systems. Decentralized or distributed recovery mechanisms operate without the intervention of a central control system. In this case, the network elements feature intelligent control systems, which autonomously initiate and steer the recovery actions. In other words, the control is distributed over the network elements involved in the recovery process. In contrast to centralized mechanisms, these distributed control systems do not have a global but only a local view of the network status. They may have to exchange messages to provide each other with sufficient information and coordinate their recovery actions. As such, multiple network elements work in parallel to put disrupted traffic on an alternative route. A typical example of a distributed system is the control plane in IP and G-MPLS networks. Note that the recovery path computation could be decorrelated from the action of recovery. Hence, a recovery mechanism can be a combination of both centralized and distributed aspects. As an example, in IP, once the link or node failure is
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 35
1.5 Characteristics of Single-Layer Recovery Mechanisms
35
detected, the recovery path is computed on the fly by every IP node in the network (recomputing their routing table) as soon as they are informed of the failure. Each node then reroutes the traffic whose destination can be reached via a new path. In the case of MPLS traffic engineering, the recovery paths can be computed by a central system (also called path computation server [PCS]) or can be computed by the nodes themselves before the failure. Once the failure is detected, the decision of recovery is taken by the node detecting the failure, not by the central system (see Chapter 5 for more details). Both control systems have their strengths and weaknesses. Some examples are as follows:
. In principle, centralized mechanisms are simpler. The interaction between the individual network elements in a distributed system, in order to give each network element a good and up-to-date view on the network status, can be quite complex. . Because of the complexity of this interaction, centralized systems tend to have a better global view of the network, whereas the view of distributed systems is typically more local. . Because of their global view of the whole network topology and complete resources, centralized systems are generally more efficient in terms of required capacity. Also, more complicated algorithms are usually easier to implement on a central PCS than in individual optical/SDH/IP-MPLS nodes. . Distributed systems are more scalable, because of the parallel processing effect in the individual network elements. . In a centralized system, the control architecture itself also forms a vulnerable aspect of the network. . It is easier for a human expert to supervise a centralized system. This may turn out to be beneficial in case of unaccounted catastrophes, in which the human operator may intervene in the control of the recovery (remotely controlled patching).
1.5.6
Ring Networks versus Mesh Networks In a ring network (see Section 1.1.1), the restricted routing pattern does not only hold for working paths. The recovery of traffic is also carried out on a ring-by-ring basis. If a failure occurs along a ring, the traffic is rerouted along the other side of the ring. Figure 1.21 shows an example in which two simultaneous faults (in different rings) lead to recovery actions in both affected rings. To avoid single points of failure, special recovery measures must be taken for the interconnection of the rings. Chapters 2 and 3 further elaborate on recovery techniques in SONET/SDH and OTN ring networks, respectively. In a mesh network, no restriction is imposed on the routing pattern of the recovery path(s).
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 36
36
CHAPTER 1
General Introduction
Working Path Recovery Paths 1 and 2 Ring 1 Ring 2 Ring 3
Figure 1.21 Recovery in ring networks.
1.5.7
Connection-Oriented versus Connectionless The connection-oriented or connectionless nature of a network technology is also reflected in its setup of recovery paths. In connectionless networks, such as an IP network, there is no need for a new connection between RHE and RTE before the traffic can start flowing again. In connection-oriented networks, a recovery connection must be set up first (before the occurrence of a failure in preplanned mode or when a fault is detected in dynamic mode).
1.5.8
Revertive versus Nonrevertive Mode As mentioned in Section 1.3.2, some recovery mechanisms switch back from the recovery path to the working path once the fault is completely repaired. If this revertive mode is provided, this can lead to more efficient network utilization than leaving the traffic along the recovery path. On the other hand, the nonrevertive mode avoids the temporary repercussions of a switch-back operation.
1.6 Multilayer Recovery In the previous sections, the main characteristics of recovery schemes active in one network layer (technology) were presented. These schemes prove very effective to cover a number of failure scenarios. In a realistic (multilayer) network, one could imagine a situation in which every network layer has its own recovery mechanism. For instance, in an IP-over-OTN network, IP restoration could be used to recover from an IP router failure or an IP interface card failure and
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 37
1.6 Multilayer Recovery
37
one-to-one optical protection could be used to recover from an OXC failure or an optical fiber cable cut. However, not every failure in a particular network layer can be resolved by a recovery mechanism in that same layer. Consider, for instance, Figure 1.22, where the OXC B is hit by a failure. The fault is detected in the optical network and a recovery action may be initiated in the OTN layer. However, this OTN recovery action cannot recover the traffic along the working path, because from the OTN layer point of view, this traffic is nothing more than two separate connections A-B and B-D, which are both unrecoverable in the OTN layer. From the IP point of view, a number of secondary failures (links a-b, b-c and b-d) are noticed, isolating router b. Upon detection of these faults, the IP network layer could also initiate recovery actions (eventually leading to the recovery path indicated in Figure 1.22). In other situations (e.g., the failure of link A-B in Figure 1.22), both a recovery action in the OTN layer and a recovery action in the IP layer are able to resolve the problem. If these recovery mechanisms are merely triggered by detection of a fault, an uncoordinated and inefficient action may result. From these examples, it becomes clear that interworking and coordination between the network layers will be needed for recovery purposes. As explained in the following sections, this interworking may take on different forms [Col02].
b a
d
c IP Layer
B D
A E C OTN Layer
Figure 1.22 Multilayer recovery.
Vasseur / Network Recovery Final Proof 7.6.2004 12:18pm page 38
38
CHAPTER 1
1.6.1
General Introduction
Sequential Approach Instead of uncoordinated recovery in several network layers, one could ensure that a fault is not resolved in different layers at the same time—leading to racing conditions—by imposing a chronological order on the recovery mechanisms. This could be implemented with a hold-off time (see Section 1.3.1). For instance, upon detection of a fault, the server layer may start recovery immediately, whereas the recovery mechanism in the client layer has a built-in hold-off time before initiating the client recovery process. This way, if the fault is already fixed by the server layer recovery mechanism before the hold-off time expires, no client recovery action will take place. An alternative implementation is based on a recovery token signal, that is, a token that is sent from the server layer recovery mechanism to the client layer from the moment that it knows it cannot recover the traffic. Upon receipt of this token, the client layer recovery mechanism is initiated. This allows to limit the traffic disruption time in case the server layer cannot recover.
1.6.2
Integrated Approach A more radical means to ensure coordination between the recovery mechanisms in different layers is to combine the two mechanisms in one integrated multilayer recovery scheme. This implies that the recovery scheme has a full overview of both layers and that it can decide when and in which layer (or layers) to take the appropriate recovery actions. Although this approach is clearly the most flexible one from a recovery point of view, combining different technologies in one mechanism is often unrealistic from a practical point of view. In Chapter 6, we elaborate in much more detail on multilayer recovery and the related implementation issues.
1.7 Conclusion This chapter is intended to present a brief overview of today’s and future network technologies and concepts in general and of recovery mechanisms to augment the reliability of information traffic through the network. Although this chapter focused on the generic technology-independent characteristics of recovery techniques, the following chapters will go into more detail for each individual technology: SONET and SDH recovery in Chapter 2, optical network recovery in Chapter 3, IP recovery in Chapter 4, and MPLS-based recovery in Chapter 5. These various recovery techniques are then combined in Chapter 6, providing the big picture for recovery in multitechnology networks.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 39
CHAPTER 2
SONET/SDH Networks
In the last decade, Synchronous Digital Hierarchy (SDH)/Synchronous Optical NETwork (SONET) networks have been deployed in many commercial networks. SDH/SONET is a technology used in transmission networks: These networks can provide huge amounts of capacities between nodes in client networks in a flexible and cost-effective way. Taking into account that Internet Protocol (IP) data traffic becomes the most dominant type of traffic, the SDH/SONET technology is enhanced with some new features (e.g., the Link Capacity Adjustment Scheme [LCAS]) to better fulfil the needs of this traffic type. As the traffic keeps growing, one can observe a slow evolution from SDH/SONET to Optical Transport Networks (OTNs): OTNs switch complete wavelength channels as a single entity, whereas SDH/SONET networks switch on a sub–wavelength granularity. This chapter is devoted to the SDH/SONET technology, and Chapter 3 is dedicated to OTNs. With respect to network recovery, the SDH/SONET technology is commonly accepted as a network technology that has already proven to be capable of providing very fast protection switching (on the order of 50 or 60 milliseconds [ms]). First, this is realized by having sophisticated supervisory processes for failure detection, notification and propagation process. Second, the Automatic Protection Switching (APS) protocol is responsible for switching over very fast from the affected resources to dedicated preprovisioned protection/backup resources. Although protection rings have been chosen as the strategy to implement the fast protection switching, there is a trend to shift from ring-based to mesh-based networks. Section 2.1 starts by introducing the concept of transmission/transport networks, and Section 2.2 gives a brief overview of the SDH/SONET technology. While discussing the SDH frame format, Section 2.2 particularly focuses on that part of the overhead that is needed for failure detection. Section 2.3 highlights the operational aspect of Automatic Protection Switching in SDH networks; more We are greatly indebted to Didier Colle, INTEC, Ghent University, for his exceptional contribution to the writing of Chapter 2.
39
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
40
CHAPTER 2
page 40
SONET/SDH Networks
precisely, it discusses the failure notification and propagation process plus the basics of the APS protocol. Sections 2.4, 2.5, and 2.6 describe the various recovery strategies possible in SDH networks, starting in Section 2.4 by discussing the popular protection rings. Section 2.5 continues with linear protection switching and Section 2.6 highlights opportunities for restoration and compares restoration versus protection. Section 2.7 presents a practical case study, showing the cost advantages of having hybrid protection strategies and highlighting some issues with respect to providing protection when considering different node architectures. The main findings are summarized and recapitulated in Section 2.8. Section 2.9 recommends the review of some reference material, on which this chapter relies, and highlights some research topics related to SDH network recovery.
2.1 Introduction The goal of this section is to introduce the concept of transmission networks: SDH/ SONET networks are a particular example of transmission networks. Section 2.1.1 starts with positioning transmission networks in the overall network. Then Section 2.1.2 briefly discusses network management, which is an important aspect of transmission networks. Section 2.1.3 highlights how to model and structure transmission networks, and finally Section 2.1.4 provides a summary of Section 2.1.
2.1.1
Transmission Networks Communications networks typically consist of network nodes, interconnected by network links. Based on some control information provided by the enduser, the network nodes know how to ‘‘route’’ or ‘‘switch’’ traffic through the network node. For example, in the case of a telephone network, the calling party dials the phone number of the called party in order to dictate how the exchanges in the network should connect incoming circuits to outgoing circuits to establish an end-to-end circuit between calling and called party. In the case of an IP network, the enduser adds sufficient control information (the destination address) to each packet so the routers can forward the packets in the direction of the destination (the routers use a routing protocol). A cost-effective network typically consists of a rather low number of network nodes (compared to the number of endusers connected to the network). Often the network is organized hierarchically (e.g., think about the regional, national, and international levels in a telephone network). For cost-efficiency reasons this typically results in lots of traffic flows that are bundled (or aggregated) and routed between the same network nodes. Nevertheless, often it is not reasonable or feasible to provide a physical link for each bundle. For example, in the core of a telephone network, a (dense) mesh of links is typically required. Therefore, transmission or transport networks aim at provisioning any set of high-bandwidth bit pipes (i.e., circuits) between the network nodes independently of the underlying physical network topology. The left side of Figure 2.1 illustrates how
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 41
2.1 Introduction
41
Server
Transmission Network LEX
Transmission Bit pipe
IP Router IP Host
TMN
NMS
Figure 2.1 Transmission networks.
the transmission network infrastructure can be shared among multiple networks (here, an IP-based and a telephone network). In this example, a ring network functions as a transmission network (see top of the figure), and thus, the network links or transmission bit pipes (i.e., circuits) between the local telephone exchanges (dotted lines) or IP routers (dashed lines) are realized as connections (crossing two physical links) in the transmission network. The fact that these highbandwidth bit pipes (i.e., circuits) are multiplexed onto a typically sparse topology makes a single network failure in the transmission network affect a lot of traffic,
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
42
CHAPTER 2
page 42
SONET/SDH Networks
motivating the tremendous importance of network survivability of transmission networks. The top of Figure 2.1 shows a typical transmission or transport network. The network operator configures the transport network connections statically by some means—for example, through its network management system (NMS). In other words, no control information provided by the enduser is incorporated in the configuration of the transport network. A logically separated Telecommunications Management Network (TMN) allows the central Network Management System (NMS) to configure each network element (NE) in the network.
2.1.2
Management of (Transmission) Networks Network management is not restricted to only the provisioning of network connections, but it also involves the following items, according to the Open Systems Interconnection (OSI) FCAPS (i.e., fault, configuration, accounting, performance, and security) classification [X700], [X701]:
. Fault management denotes the collection of management processes responsible for identifying, locating, and reporting problems or faults in the network. Fault management may trigger resilience techniques, as shown later in this chapter. . Configuration management denotes the collection of management processes responsible for discovering and configuring network devices and connections. With respect to network resilience, configuration management is important because it allows configuring the network with sufficient redundancy in order to survive network faults and/or allows reconfiguring the network when a failure occurs. . Accounting management denotes the collection of management processes responsible for keeping track of the network resources being reserved or used by for example a particular user or a particular traffic type. Based on these statistics, the users are billed as agreed contractually. Depending on the agreed service-level agreement (SLA), the network operator might have to pay a penalty in case the service has been interrupted for a certain amount of time, motivating the importance of network survivability. . Performance management denotes the collection of management processes responsible for monitoring the overall performance of the network and the performance perceived by the network user this includes performance of hardware, software, and/or any other media. A degraded performance might end up in a network fault. . Security management denotes the collection of management processes responsible for issues such as the control of access to any available network resources, the exchange of keys for encrypted transport of the data, and/or the prevention of denial-of-service (DoS) attacks.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 43
2.1 Introduction
43
In this chapter, the fault, performance, and configuration management aspects of network recovery in SONET/SDH transmission networks are discussed in more detail. Managing every item in the network from a central NMS would not be very scalable. Therefore, several abstraction levels exist for managing the network, allowing the delegation of more detailed management tasks to agents managing a smaller part of the network. Five layers (ordered from highest to lowest level of abstraction) can be distinguished [M3010]:
. The business management layer (BML) is responsible for the total enterprise and in particular the agreements between customers and the operator. . The service management layer (SML) is the layer responsible for negotiating the contractual aspects of the services offered to the customer. These service-level agreements (SLAs) specify the quality-of-service (QoS) measures that must be met on the overall network connection. . The network management layer (NML) manages the overall network and is responsible for provisioning end-to-end connections. . The scope of the network element management layer (NEML) is restricted to the subnetwork level. . The network element layer (NEL) is the level conceptually representing the managed network equipment/elements.
2.1.3
Structuring/Modeling Transmission Networks To understand how network failures are detected and propagated through the transmission network and how network recovery mechanisms react to these failures, we must have a clear view of the structure of transmission networks. A transmission network can be decomposed in one or more layers. In accordance with International Telecommunications Union-T (ITU-T) recommendations G.805 [G805] and G.806 [G806] and European Telecommunications Standards Institute (ETSI) EN 300 417-1-1 [ETSI1], each network layer can be modeled as a set of atomic functions interconnecting reference points in the network layer, as depicted in Figure 2.2.
. Connection functions (C) represent the flexibility in the network; more precisely, a connection function connects a set of connection points (CPs) at its border with each other. Because the flexibility of a connection function can be represented by a matrix (e.g., think about the ‘‘switch matrix’’ or ‘‘switch fabric’’ in a cross-connect), each cross-connection realized by the connection function is called a matrix connection (MC). . Link connections (LC): The interconnections of CPs at the borders of distinct connection functions are called link connections. As Figure 2.2 illustrates, link connections can be supported by one or more server layers. . Subnetwork connection (SNC): Like the decomposition of a network into layers (vertical decomposition), we can also decompose the network into one
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
44
page 44
SONET/SDH Networks
CHAPTER 2
To Client Layer
A AP TT
A
Trail
AP TT
Network Connection (NC)
TCP
A TT C
Adaptation Function Trail Termination Function Connection Function
AP CP TCP
Access Point Connection Point Termination Connection Point
NC SNC LC MC
Network Connection Subnetwork Connection Link Connection Matrix Connection
SNC
TCP
TCP
LC
MC
MC
LC
C LC
CP MC
LC
CP
C MC
TCP
To Server Layer
Figure 2.2 Functional structure/model of transmission networks. (ITU-T Recommendation G. 805, ‘‘Generic functional architecture of transport networks,’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.)
or more interconnected subnetworks. A subnetwork is at least one connection function but could also comprise multiple or maximally all connection functions in the network and the link connections between these connection functions. Of course, a ‘‘nested’’ subnetwork can be defined inside a subnetwork. A subnetwork connection is the connection created between connection points on the border of the subnetwork. At the ingress of the network, the client information has to be transformed, wrapped in the necessary overhead and fed into the network. At the egress of the network, a reverse operation is needed and the integrity of the received signal (which is important with respect to network resilience) has to be supervised.
. CPs at the edge of the layer network are called termination connection points (TCPs) and end-to-end network connections (NCs) are established between these TCPs. Network connections carry characteristic information (CI), which consists of the adapted information (AI) plus some overhead information. The trail termination (TT) functions are responsible for adding (source direction) and removing (sink direction) this overhead information, which allows the sink TT function supervising the integrity of the received signal. . Adapted information (AI), carried over so-called trails, flows through access points (APs) instead of through connection points (CPs). Adaptation functions (A) are responsible for converting the client layer information into the appropriate format (more precisely, the AI). This conversion includes scrambling, encoding/decoding, alignment, multiplexing/demultiplexing, bit rate adaptation, frequency justification, timing/clock recovery,
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 45
2.2 SDH and SONET Networks
45
smoothing, and/or payload justification. The client layer information is retrieved from the CP in the client layer network. The fact that transmission/transport networks typically consist of multiple layers does not necessarily imply that network recovery techniques are foreseen in more than one layer. The coordination of recovery techniques deployed in multiple layers is discussed extensively in Chapter 6. Thus, this chapter is limited to networks featuring a recovery technique at a single layer. Nevertheless, this chapter incorporates how failures might propagate from a lower layer up to the layer deploying the network recovery technique.
2.1.4
Summary . Reference points: Characteristic information (CI) flows through connection points (CPs), whereas adapted information (AI) flows through access points (APs). . Atomic functions: Connection function (C): (T)CP!(T)CP; trail termination (TT) function: AP!TCP in source direction and TCP!AP in sink direction; adaptation function (A): (T)CP!AP in source direction and AP!(T)CP in sink direction.
2.2 SDH and SONET Networks The goal of this section is to give a brief overview of some major aspects of SDH/ SONET networks. The introduction of Section 2.2.1 situates SDH/SONET networks in the evolution of transmission networks. Section 2.2.2 describes the structure of SDH networks, Section 2.2.3 discusses the SDH frame structure while focusing on relevant aspects for network recovery, and Section 2.2.4 describes different types of SDH network equipment. The major items are summarized in Section 2.2.5. Finally, Section 2.2.6 highlights some differences and similarities between SDH and SONET transmission networks.
2.2.1
Introduction The remainder of this chapter focuses on one particular transmission/transport network technology: Synchronous Digital Hierarchy (SDH). The Synchronous Optical NETwork (SONET) technology is the U.S. counterpart of the SDH technology: Both technologies are compared with each other in Section 2.2.6. One of the major progresses realized in SDH, compared to its predecessor (which is the Plesiochronous Digital Hierarchy [PDH]), is that the clocks used for processing received signals and generating signals to be transmitted in all nodes are synchronized with each other through a synchronization network. This allows byteinterleaved instead of bit-interleaved multiplexing and prevents frequent justifications or stuffing to compensate for the frequency mismatch between different
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
46
CHAPTER 2
page 46
SONET/SDH Networks
clocks. Accessing each individual multiplexed signal becomes possible by means of a pointer to the appropriate byte in a repetitive frame structure (e.g., a 64-kbps stream corresponds to exactly one byte in a frame structure, which is repeated every 125 ms), thereby avoiding the need to demultiplex (and re-multiplex) the highbandwidth aggregate signal, significantly reducing the complexity of the network equipment. To cope with the explosive growing data traffic volume, Optical Transport Networks (OTNs) (discussed in Chapter 3) are expected to transport typical connections at or beyond the bit rate of SDH lines (up to 10 gigabits per second [Gbps]). Taking into account that OTNs did not materialize yet (because of the economic slowdown) and that SDH networks will coexist with the introduced OTNs for a long time, it is clear that there is still a lot of interest in the SDH technology. This is reflected in the recent work to develop a ‘‘next-generation’’ SDH technology: The goal is to adapt the SDH technology to the highly dynamic IP networks, overcoming the limitations caused by the initial targeting of more static telephone networks. The Link Capacity Adjustment Scheme (LCAS) [G7042] allows dynamic adjustment of the capacities of the transported signals as required. The Generic Framing Procedure (GFP) [G7041], [Her02] eases the encapsulation and transport of IP-based traffic over SDH and OTN frame formats. To allow SDH connections with a capacity beyond that of individual SDH connections, the concatenation of SDH connections between the same endpoints has been standardized. Nevertheless, many of the currently deployed SDH equipment do not implement this feature; therefore, virtual concatenation is seen as the solution to overcome the bandwidth limitations of current SDH connections. Virtual concatenation is based on independent connections routed between the same endpoints, where they are inverse multiplexed into a high-bandwidth connection. Although this avoids the need for upgrading the equipment inside the SDH network, the edge should solve at least any synchronization problems that might occur as a result of the independent routing of the involved connections. Finally, note that even when no intermediate SDH network elements are in place, transporting the information in an SDH frame structure might still be useful (e.g., when interconnecting IP routers through packetover-SONET [PoS]). Indeed, the SDH frame structure has proven to include valuable overhead information for supervisory purposes and is commonly accepted as standard.
2.2.2
Structure of SDH Networks As mentioned in Section 2.1.3, SDH is a transmission/transport network technology that can be decomposed into several layers [G803]. As shown in Figure 2.3, there exist two path layers and two section layers.
. The higher and lower order path (HOP and LOP) layers provide the flexibility in the network through connection functions. As shown in Figure 2.3, LOPs are always routed over a chain of HOPs. In other words, a HOP network connection can serve as an LOP link connection (more than one
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 47
2.2 SDH and SONET Networks
47
Figure 2.3 Layer structure of SDH networks. (ITU-T Recommendation G.803, ‘‘Architecture of transport networks based on the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.)
.
.
LOP may be routed over the same HOP). Note also that not only the LOP layer but also other non–SDH network layers can act as a client layer of the HOP layer. The multiplex section (MS) layer is responsible for multiplexing traffic and, thus, increasing the transmission bit rate between the network elements providing the connection functions in the path layers. The MS layer typically does not feature connection functionality, except for specific purposes like multiplex section protection (MSP). The regenerator section (RS) layer is responsible for supervising the regenerator sections. As shown in Figure 2.3, an MS may span multiple RSs, when regenerators are introduced in the network. Regenerators aim at cleaning up the distorted transmission signal to extend the reach of the transmission links. SDH regenerators perform 3R–regeneration: reamplification, reshaping, and retiming.
An additional physical media layer (typically optical, but also electrical or radio media are possible) is responsible for the actual transmission of the signal. Optionally, an optical transport network (OTN) can be deployed between the SDH network (layers) and the optical media layer. The format of the digital signal flowing between SDH NEs is called Synchronous Transport Module of order N (STM-N). An STM-N signal has a bit rate of N times the bit rate of an STM-1 signal: N 155, 522 megabits per second (Mbps) (as explained in Section 2.2.6, an SDH STM-N signal corresponds to a SONET Synchronous Transport Signal of level 3N [STS-3N] or an Optical Carrier of level 3N [OC-3N] when the STS-3N is transported optically). Virtual Containers-n (VC-n) support path layer connections: This includes the Containers-n (C-n) (which corresponds to the payload information) and the path overhead (POH) information. Per STM-1 signal one VC-4 or three VC-3s are transported at the HOP layer. As mentioned earlier, an HOP can carry a number of LOPs; more precisely, a VC-4 can transport up to three VC-3s, 21 VC-2s, 63 VC-12s, or 84VC-11s.6 As the 6
Sometimes VC-1 is used to refer to either a VC-11 or a VC-12.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
48
CHAPTER 2
page 48
SONET/SDH Networks
VC-11: 1,664 kbps (C-11: 1,600 kbps) x4 LOP Layer
VC-12: 2,240 kbps (C-12: 2,176 kbps) x3 OR x7
x1
OR x3
VC-3: 48,960 kbps (C-3: 48,384 kbps) HOP Layer
VC-2: 6,848 kbps (C-2: 6,784 kbps)
x1
x3
xN
VC-4: 150,336 kbps (C-4: 149,760 kbps)
xN Section Layers
STM-N: N x 155,522 kbps
Figure 2.4 SDH multiplexing hierarchy. (ITU-T Recommendation G.707/Y.1322, ‘‘Network node interface for the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, October 2000 Available at: www.itu.int. Accessed May 2004.)
multiplexing hierarchy in Figure 2.4 shows, this capacity can also be allocated to a mix of these LOP signals. For example, a VC-4 is able to transport 1 VC-3 plus 7 VC2s plus 21 VC-12s. A higher order VC (HOVC)-3 can accommodate only one third of a VC-4: more precisely, 7 VC-2s, 21 VC-12s or 28 VC-11s. The fact that a VC-3 can act as an HOP or as an LOP explains the ‘‘two-way’’ arrow between the OR function and the VC-3 signal in Figure 2.4.
2.2.3
SDH Frame Structure: Overhead Bytes Relevant for Network Recovery Figure 2.5 shows the STM-1 frame format in more details. This commonly known representation consists of 9-byte rows and 270-byte columns. As mentioned earlier, SDH networks were initially developed for supporting telephone networks. Taking into account that with respect to the Nyquist-Shannon sampling theorem, the digitalization of telephone signals is based on a sample frequency of 2 * 4 ¼ 8 kilohertz (kHz), the duration of an STM-1 frame was chosen exactly equaling the time between two sample points: 125 microseconds (ms). This explains the bit rate of an STM-1 signal mentioned in Figure 2.4: 8 bits=byte * 9 * 270 bytes=125 ms ¼ 155, 520 kbps. The bit rate of an STM-N signal is N times the bit rate of an STM1, implying also that each STM-N frame has a duration of exactly 125 ms. Figure 2.5 shows that the first 9 columns7 are dedicated to the section overhead (SOH), whereas the remaining 261 columns contain the STM-1 payload. The SOH is split up in three rows of regenerator section overhead (RSOH) and five rows of multiplex section overhead (MSOH). The remaining row contains a pointer (H bytes) where the HOVCs start: VCs can be shifted in time compared to the 7
More precisely, all rows except the fourth row in the first nine columns.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 49
2.2 SDH and SONET Networks
49
VC-4: 9 x 261
J1 A1
RSOH
A1
A1
A2
A2
J0
NU
NU
NU
NU
B1
E1
F1
D1
D2
D3
H1 B2
H1 B2
H1 B2
D4
MSOH
A2
H2
H2
H2
H3
K1
K2
D5
D6
D7
D8
D9
D10
D11
D12
S1
Z1
Z1
Z2
Z2
STM-1: 9 x 270
M1
E2
B3 C2 G1
H3
H3
F2 H4 F3 K3 N1
NU
NU
Pointer
Path Overhead
Media Dependent
Figure 2.5 STM-1 frame format. (ITU-T Recommendation G.707/Y.1322, ‘‘Network node interface for the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, October 2000. Available at: www.itu.int. Accessed May 2004.)
underlying stream of STM-1 frames. An HOVC-n together with its pointer forms an Administrative Unit-n (AU-n). Similarly, Tributary Unit-ns (TU-ns) exist at the LOP layer. The path overhead (POH) of a VC-4 (see Figure 2.5) or a VC-3 signal occupies one column (minus one H4 byte). VC-11, VC-12, and VC-2 frames span four STM-1 frames, forming a multiframe (the seventh and eighth bit of the H4 byte in the HOVC indicate the position of the current frame in the multiframe): 4 bytes per multiframe (or 1 byte per STM frame) is allocated for VC-11, VC-12, or VC-2 POH. The frame structure in Figure 2.5 is just an example. A VC-4 has been chosen as an HOP. Because an STM-1 can carry only a single VC-4, only one of the three H1, H2, and H3 bytes is used (these bytes are needed in case three VC-3s are multiplexed in the same STM-1). Note that part of the H bytes are reserved for supporting pointer justifications (in Figure 2.5, H3 can be used to carry payload information if needed in the case of a pointer justification). Until now, we have been mainly speaking about STM-1 signals. Although not completely correct, we can roughly think about an STM-N signal as N byte–interleaved STM-1 signals. In the current standards, N can equal 1 (155 Mbps), 4 (622 Mbps), 16 (2.5 Gbps), 64 (10 Gbps), or 256 (40 Gbps). For N larger than 1, concatenated VCs (see also Section 2.2.1) beyond the bit rate of a VC-4 can be supported. A complete discussion of the multiplexing and frame structure is beyond the scope of this book, but more details can be found elsewhere (e.g., [G707], [Sex92]). The goal of this chapter is not to explain every single overhead byte in detail, so our discussion here focuses on issues relevant in the context of network recovery. First, overhead bytes are needed for fault detection. Second, some overhead bytes propagate fault information. Third, the signaling for the Automatic Protection Switching (APS) protocol is transported in some overhead bytes.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
50
CHAPTER 2
page 50
SONET/SDH Networks
Table 2.1 summarizes for each layer (in accordance with [G806] and [G783]) that can be declared: For each defect the table states which overhead bytes to monitor and how long it takes to declare this defect from the moment the first anomaly has been perceived. A first category of defects includes the failure to track the beginning of an STM-N frame, the position of a VC relatively to such frame or Table 2.1 Defect Detection Times1
Defect Loss of Frame (dLOF) Loss of Multiframe (dLOM) Loss of Pointer (dLOP) Trace Id Mismatch (dTIM)2 Payload Mismatch (dPLM) Degraded Signal (dDEG)3 Excessive Error (dEXC)4 All Ones (dAIS)
Remote Defect Ind. (dRDI) Unequipped VC (dUNEQ) 1
Regenerator Section
Multiplex Section
VC-4/3
VC-2/12/11
A1, A2: 3 ms H4b7-8: 1-5 ms
J0: < 100 ms
H1, H2: [8-10]*125 ms J1: < 100 ms
V1, V2: [8-10]*500 ms J2: < 100 ms
C2: < 100 ms
V5b5-7, K4b1: < 100 ms V5b1-2: 4*10x2 ms V5b1-2: 4*10x2 ms V5b5-7: 5*500 ms V1, V2: 3*500 ms V5b8: {3, 5, 10}*500 ms V5b5-7: 5*500 ms
B1: 10x2 ms
B2: 10x2 ms
B3: 10x2 ms
B1: 10x2 ms
B2: 10x2 ms
B3: 10x2 ms
K2b6-8: 3*125 ms
C2: 5*125 ms H1, H2: 3*125 ms G1b5: {3, 5, 10}*125 ms C2: 5*125 ms
K2b6-8: [3-5]*125 ms
Throughout this chapter, we use the notation XXbY to indicate bit Y in the overhead byte XX. Data from ‘‘Transmission and Multiplexing (TM); Generic requirements of transport functionality of equipment; Part 1-1: Generic processes and performance,’’ ETSI EN 300 417-1-1 V1.2.1, October 2001. 3 A bit interleaved parity (BIP) mechanism is adopted to measure the bit error rate (BER). Assuming a Poisson distribution of errors, the values in this table indicate the maximum measuring time needed to declare and clear defects when the BER thresholds are set to 10x and 10(xþ1) , respectively. For dDEG, x can be configured in the range of 5 through 9. The numbers in this table are not valid when assuming a Bursty distribution of errors. The measuring time is then between 2 and 10 seconds. 4 The process of declaring and clearing dEXC defects is similar to that for dDEG defects, except that x is configured in the range of 3 through 5. dEXC defects cannot be declared when assuming a Bursty distribution of errors. 2
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 51
2.2 SDH and SONET Networks
51
the position of a frame relatively to a multiframe, respectively, resulting in a Loss of Frame defect (dLOF), a Loss of Pointer defect (dLOP), or a Loss of Multiframe defect (dLOM) (these defects are particular instances of the Loss of Alignment defects [dLOA]). A second category is based on signal quality, measured in terms of bit error rate (BER) by means of a bit interleaved parity (BIP) mechanism, allowing to declare a Degraded Signal (dDEG) defect or Excessive Error (dEXC) defect. A third category involves payload-type supervision (a signal type identifier allows verifying the compatibility of the adaptation functions at source and sink), connectivity supervision (a trail trace identifier allows verifying that a source TT function is not misconnected to fthe wrong sink TT function) and continuity supervision (monitoring the presence/absence of the CI allows supervising the signal integrity) resulting in a Payload Mismatch defect (dPLM), a Trace Identifier Mismatch defect (dTIM), and an Unequipped defect (dUNEQ), respectively. Connection functions insert an unequipped VC (more precisely, an all 0s pattern in some overhead bytes) signal at those outputs that are not connected to one of its inputs. There also exist supervisory-unequipped signals (more precisely, an unequipped VC with a valid trail trace identifier and Remote Defect Indication [RDI], Remote Error Indication [REI] bytes) to test a connection between two TT supervisory unequipped functions [ETSI1]. Finally, a fourth category involves the supervision of maintenance signals: an alarm indication signal (AIS) (or an all 1s signal) is sent downstream to indicate an upstream failure, whereas the sink sends a Remote Defect Indication (RDI) signal upstream to indicate that the trail in the opposite direction is failing. This mechanism is explained in more detail in Sections 2.3.1 through 2.3.3. Table 2.1 makes it clear that the time needed to declare a defect depends on which supervision process is involved. Although at least 375 ms are needed to detect any kind of defect, the quality, connectivity, and continuity supervision processes only declare failures after 10 ms or more (up to 10,000 seconds or 2 hours and 47 minutes for a dDEG with a threshold set to 109 ). The detection times for the other defects are compared with each other in Figure 2.6. The error bars indicate the tolerable range in accordance with Table 2.1. Most of these defects are declared after the corresponding anomaly is monitored in a small number of consecutive frames; this explains the multiplication factor of 125 ms (the length of an STM-N frame as explained earlier) or 4 * 125 ms ¼ 500 ms in the case of a VC-2/12/11 (because each multiframe spans four STM-N frames). For example, at the VC-4/3 level, an AUdAIS (a dAIS defect based on the H1- and H2-pointer bytes) and a VCdAIS (a dAIS defect based on the C2 byte) are detected within 3 * 125 ms ¼ 375 ms and 5 * 125 ms ¼ 625 ms, respectively, whereas a dRDI is detected within 5 * 125 ms ¼ 625 ms but in the best case in 3 * 125 ms ¼ 375 ms and in the worst cased in 10 * 125 ms ¼ 1:25 ms. Figure 2.6 clearly shows that typically (except for the worst-case dRDI detection) defects resulting from the maintenance signal supervision need the lowest amount of time to be declared, important because this will suppress most of the impact resulting from defects that are only a side effect of the root failure. Note, however, that the underlying physical layer might detect and notify a Loss of Signal defect (dLOS)—because of
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
CHAPTER 2 5000
SONET/SDH Networks
RS
MS
VC-4/3
4500
52
page 52
VC-2/12/11
4500
500
3000
625 500
375 375
1000
625
1500
625
2000
1125
2500
2500
2500
3000
2500
3000
3500
1500
Detection Time (microseconds)
4000
0 dAIS
VCdAIS
dRDI
dUNEQ
dLOF
dLOM
dLOP
Figure 2.6 Comparison of short detection times.
a transmitter failure or optical path break—even faster: more precisely, within 2.3 to 100 ms. Summarizing the previous discussion, the K2(b6–8) in the MS OH, the G1(b5) and C2 in the VC-4/3 OH, the V5(b5–8) in the VC-2/12/11 OH plus the AU and TU pointers, respectively H1&2 and V1&2, used to transport maintenance signals are very important in the context of network recovery: For example, these maintenance signals might trigger the Automatic Protection Switching (APS). In addition to that the K bytes (K1 and K2[b1–5] in MS OH, K3[b1–4] in VC-4/3 OH, and K4[b3–4] in VC-2/12/11 OH overhead bytes) transport the signaling messages for the APS protocol, which is discussed in Section 2.3.4.
2.2.4
SDH Network Elements Although it is crucial to understand how an SDH network can be decomposed in network layers consisting of a set of atomic functions, processing the overhead bytes, it is also important to understand how this model corresponds to a set of equipment (or network nodes) interconnected with each other (via network links). The example in Figure 2.7 shows that an equipment typically spans multiple layers and is (logically) built out of the atomic functions described earlier. The example in Figure 2.7 shows an add/drop multiplexer (ADM) that allows adding/dropping up to four VC-4s tributary signals into/from an STM-N aggregate signal. An ADM always terminates two network links: This includes termination of
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 53
2.2 SDH and SONET Networks
53
Trib./Client Signals S4/
_A S4_TT RSN/MSN_A
RSN/MSN_A RSN_TT
RSN_TT 1
1
STM-N Aggr. . Signal
STM-N Aggr. Signal N
OSN_TT
S4_C
N
MSN/S4_A
OSN_TT
MSN_TT
MSN_TT
OSN/RSN_A
MSN/S4_A
OSN/RSN_A
X_TT
Trail Termination function in layer X
X/Y_A
Adaptation function between server layer X and client layer Y
X_C
Connection function in layer X
Figure 2.7 Example of an STM-N ADM. (ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment—description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.)
the physical section (we implicitly assume here an optical section [OS]), the Regenerator Section (RS), and the Multiplex Section (MS). This add/drop functionality is provided by the S4_C connection functions. By providing the appropriate S4/ _A adaptation function, this ADM can connect to any kind of client network equipment (e.g., a PDH device or a router). Note also that the legend of Figure 2.7 outlines the naming convention used for the atomic functions mentioned throughout the remainder of this section. An optional suffix ‘‘_So’’ or ‘‘_Sk’’ may be added to indicate source or sink direction, respectively. In Figure 2.7, the atomic functions are assumed to be bidirectional. Note also that in accordance with the standards, this naming convention uses ‘‘Sn’’ to indicate a ‘‘VC-n path.’’ For example, an S4/S12_A_So function corresponds to a source adaptation function from a VC-12 CP to a VC-4 AP. SDH NEs are classified in three categories (Figure 2.8), depending on the number of aggregate signals terminated in the NE and the flexibility provided by the connection functions. A terminal multiplexer (TM) multiplexes a number of tributary signals into one aggregate signal. A TM typically does not provide any flexibility. Instead of a fixed time-slot assignment to the tributary signals, an optional connection function (notice the dashed line) might allow a flexible timeslot assignment. Add/drop multiplexers (ADMs) terminate two aggregate signals. Therefore, ADMs are typically used in a ring configuration: The ‘‘ring’’ corresponds then to the aggregate signal between the ADMs. The flexibility of the connection functions in an ADM is restricted to continuing the paths on the ring (thus, from the aggregate port at one side into the other aggregate port) or adding/ dropping VCs into/from this signal. As is explained in Section 2.4, ring networks are very suitable to provide network recovery. Note also that ADMs are often configured as TMs by leaving out the second aggregate port. The highest flexibility is
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
54
CHAPTER 2
page 54
SONET/SDH Networks
Terminal Multiplexer
Add/Drop Multiplexer
Digital Cross-Connect
Add/Drop Continue
MS
MS
MS
MS
MS
MS
MS
MS
RS
RS
RS
RS
RS
RS
RS
RS
OS
OS
OS
OS
OS
OS
OS
OS
Figure 2.8 Classification of SDH network elements (tributary ports: top side; aggregate ports: bottom side). (ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment—description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.)
provided by digital cross-connects (DXCs). Indeed, there is no (functional) restriction on the number of terminated aggregate ports and the featured connection functions provide full flexibility (e.g., as illustrated in Figure 2.8, any tributary port can be connected to another tributary port). Figures 2.7 and 2.8 might give the impression that NEs only feature connection functions at the VC-4 level, but this of course is not always the case. First, as mentioned earlier, not only VC-4s but also VC-3s can serve as higher order paths (in that case, all ‘‘S4’’ instances in Figure 2.7 have to be replaced by ‘‘S3’’). Second, connection functions can be foreseen at the higher and/or the lower order path layer. For example, some of the S4/_A adaptation functions in the ADM of Figure 2.7 might be S4/S12_A adaptation functions connected to an S12_C connection function, realizing VC-12 layer functionality. A commonly used DXC naming convention is able to reflect all these possibilities: The format looks like DXC-X=Y1 = . . . =YI , where X refers to the order of higher order paths and Y1 = . . . =YI listing the order of VCs being cross-connected. For example, a DXC-4/1 terminates VC-4 higher order paths (X ¼ 4) and cross-connects VC11s or VC-12s (Y1 ¼ 1), in a DXC-4/4 also VC-4 higher order paths enter the DXC (X ¼ 4) and are cross-connected (Y1 ¼ 4), whereas in a DXC-4/4/1 also
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 55
2.2 SDH and SONET Networks
55
VC-4 higher order paths enter the DXC (X ¼ 4) and are cross-connected (Y1 ¼ 4) and possibly terminated for the cross-connection of VC-11s or VC-12 (Y2 ¼ 1). Finally, in Figures 2.7 and 2.8, it is implicitly assumed that client network equipment connects to an interface compliant with client network technology and that the adaptation (e.g., the S4/_A function in Figure 2.7) is performed on the SDH NE inside the SDH network. An alternative is illustrated in Figure 2.9, in which the adaptation function is foreseen on the client device, which is connected via STM-1 signals (the additional SDH atomic functions are shown within the dashed rectangle) to the SDH device. Of course, the advantage is that the SDH network equipment should not be capable of adapting any possible client signal that might be transported over the SDH transport network infrastructure. However, independent from the support for network recovery in the client network technology, this configuration allows for a standardized SDH linear multiplex section protection (MSP) of the interface between client and SDH network equipment (note that the recovery provided inside the SDH transport network does not cover the interface to the client network equipment). For completeness, Figure 2.10 depicts the functional architecture of a regenerator device. As shown in Figure 2.3, a regenerator terminates only regenerator sections (RSs) and thus features RSN_TT plus RSN/MSN_A functions, whereas the MS signal transparently passes through the regenerator (thus, the signal leaving the RSN/MSN_A_Sk function is directly fed into the RSN/MSN_A_So function of the next regenerator section). Note that such an NE does not feature any connection function.
2.2.5
Summary . SDH network layers: regenerator section (RS), multiplex section (MS), higher order path (HOP) (HOP: VC-4 and/or VC-3), and lower order path (LOP) (LOP: VC-3, VC-2, VC-12, and/or VC-11) layers.
4 Times STM-1 Tributary Interface Client NE
OS
RS
MS
RS RS RS RS
OS OS OS OS
MS MS MS MS
MS MS MS MS
RS RS RS RS
S4/_A
SDH NE OS OS OS OS
To Client Network Functionality
STM-N Aggregate(s) to Other SDH NEs
S4_C
S4_TT
Figure 2.9 STM-1 tributary interface to client network equipment. (ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment—description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
56
CHAPTER 2
page 56
SONET/SDH Networks
MS
STM-N Aggr.
RS
RS
OS
OS STM-N Aggr.
Signal
Signal
Figure 2.10 Functional architecture of a regenerator. (ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment—description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.)
. SDH network elements: terminal multiplexers (TMs), add/drop multiplexers (ADMs), and digital cross-connects (DXCs) (plus regenerators without any flexibility). . SDH interfaces: STM-N signals at N 155, 522 kbps. Each STM-N frame (N * 9 * 270 bytes) has a duration of exactly 125 ms. . Defect detection: defects declared by the supervision of SDH maintenance signals (dAIS and dRDI) typically need the lowest amount of time to be detected. dUNEQ and dLOA detection times also fall within the range of 5 ms. (The physical layer might detect dLOS before the SDH layers are able to declare any other defect.) . Overhead bytes: from a network recovery perspective, overhead bytes for the transport of AIS and RDI maintenance signals and for automatic protection switching (APS) signaling messages are very important. An overview is given in Table 2.2.
2.2.6
Differences between SONET and SDH As stated earlier, the Synchronous Optical NETwork (SONET) technology is the U.S. counterpart of the Synchronous Digital hierarchy (SDH) technology. Despite
Table 2.2 Overview of Overhead Bytes for Transport of Maintenance Signals and APS Signaling Messages
Signal/Level
Bit Rate Payload
AIS and RDI OH
APS OH
MS
150,336 kbps
K2b6-8
K1 and K2b1-5
VC-4 VC-3
149,760 kbps 48,384 kbps
H1 and H2 G1b5 and C2
K3b1-4
V1 and V2 V5b5-8
K4b3-4
VC-2 VC-12 VC-11
6,784 kbps 2,176 kbps 1,600 kbps
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 57
2.3 Operational Aspects
57
that both technologies adopt very similar frame structures, minor differences in frame structure and semantics prevent both technologies from being fully compliant and thus interoperable with each other. These minor differences are not relevant to understand this chapter and, therefore, are beyond the scope of this chapter. The remainder of this chapter focuses on SDH, but the discussions are also applicable to SONET networks. The main difference between both technologies is that they are based on a different base signal. The SONET synchronous transport signal-1 (STS-1) is the base signal for the SONET technology at a line rate of 51.84 Mbps, whereas the SDH synchronous transport module-1 (STM-1) is the base signal for the SDH technology at a line rate of 155.52 Mbps. Thus, an SDH STM-N corresponds to a SONET STS-3N. The main reason that SONET is based on the STS-1 line rate is to better match the highest line rate of the U.S. Plesiochronous Digital hierarchy (PDH)—the predecessor of the SONET/SDH technology— more precisely, a DS3 signal at 44.736 Mbps (whereas, e.g., the European PDH defines signals up to an E4 signal at 139.264 Mbps). Note that an optical signal carrying an STS-N signal is called an optical carrier level N (OC-N). There is also a difference in terminology. An SDH regenerator section (RS), multiplex section (MS), and higher order path (HOP) are called, in SONET terminology, an STS section, STS line, and STS path, respectively. An SDH lower order path (LOP) is called in SONET terminology a virtual trunk (VT) path. With respect to the remainder of this chapter (describing network recovery in SDH networks), we want to stress that for each SDH recovery technique, an equivalent SONET recovery technique exists. However, because SDH and SONET are not fully compliant and interoperable, this also applies in particular to the SDH and equivalent SONET recovery techniques. Last but not least, SONET is standardized by the American National Standards Institute (ANSI), whereas SDH is mainly standardized by the International Telecommunication Union (ITU).
2.3 Operational Aspects This section discusses how failures are detected and propagated through SDH networks and where the network recovery techniques fit in this architecture. Although this section is more concerned with the protocol aspect, the discussion of the different flavors of the recovery techniques (on a more abstract level) is left for Sections 2.4 through 2.6. Section 2.3.1 gives an overview of the fault management processes in SDH networks. Sections 2.3.2 and 2.3.3 continue with a more detailed discussion on how failures are detected and propagated inside a network element and on a network level (between the network elements), respectively. Then, the Automatic Protection Switching (APS) protocol (which relies on this notification mechanism) is discussed in Section 2.3.4. Finally, the major conclusions are recapitulated in Section 2.3.5.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
58
CHAPTER 2
2.3.1
page 58
SONET/SDH Networks
Fault Management Processes As mentioned in Section 2.1.2, fault management processes are crucial from a resilience point of view, because these processes are responsible for locating and reporting network failures and possibly for triggering network recovery actions. The hierarchical structure of the fault management processes in transmission networks is illustrated in Figure 2.11. As illustrated in Figure 2.11, each fault management level consists of one or more filters. At the lowest level, filter f1 is responsible for the supervision of (part of the overhead bytes in) the data signal. When a certain anomaly in the data signal persists for a certain period of time/a number of frames or multiframes, this filter will declare the defect corresponding to this anomaly. For each possible defect, Section 2.2.3 discussed which overhead byte to supervise and how long the detection lasts. Filters f2 and f3 are responsible for correlating the defect declarations within an atomic function and, respectively, to trigger the appropriate consequent action inside the atomic function and to report the fault cause to the element management function (EMF). Based on the fault cause reports coming from all atomic functions, filter f4 in the EMF can define the failure. The EMF filters f5–f8 then generate the necessary alarms and failure reports to the higher management layers. For example, based on the received alarms and failure reports from different network elements, the central network management system (NMS) may reconfigure the network so that affected connections are provisioned along another route. In accordance with Chapter 1, such actions are classified under restoration actions, which are the topic of Section 2.6. Because the main focus of this chapter is on network protection (because restoration is not yet widely adopted in SDH networks), relying on preprovisioned backup resources and network elements autonomously (thus, without any intervention of a [central] network management system) switching over to these backup resources, filters f1 and f2 play a key role throughout the remainder of this chapter. A closer look at Figure 2.11 shows that the detection process (filter f1) and the management reporting functions (filters f3, f4, f5–f8) are only present in the sink direction, whereas the consequent actions (requested by filter f2) are present in both directions. The reason is that consequent actions involve the generation of the appropriate maintenance signals needed for failure propagation. Filters f2 in a TT sink function trigger three consequent actions when one of its f1 filters declares a defect. The first consequent action concerns notifying downstream atomic functions and network elements that a defect was detected affecting the corresponding data signal. This first consequent action involves replacing the content of the data signal with an all 1s signal as alarm indication signal (AIS). As explained in Figure 2.6, downstream atomic functions will detect the presence of this AIS signal within four frames or multiframes. To avoid this additional detection time in downstream atomic functions and thus to speed the propagation and notification process, the second consequent action in a TT sink function involves the generation of a network element internal auxiliary parallel (thus, separated from
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 59
2.3 Operational Aspects
59
NMS Adaptation Source Function (A_So)
Adaptation Sink Function (A_Sk) Via TMN CI_D CI_D CI_D
CI_SSF CI_SSF CI_SSF aSSF
aAIS
CI_SSF CI_D CI_SSF CI_D CI_SSF CI_D
f2 f3
1
2
aAIS
f2
f1 f1 f1 f5-f8 AI_D
AI_D
AI_TSF
3
AI_TSF
AI_D
aTSF
aAIS
f2
1
f4
aRDI RI_RDI
f3
AI_D EMF RI_RDI
f2 aRDI
2
f1 f1 f1
Network Element
CI_SSF
CI_D
CI_D
Trail Termination Sink Function (TT_Sk)
Element Mgmt Function (EMF)
Trail Termination Source Function (TT_So)
1 2 3
Defects (dXXX) Fault Cause (cZZZ) Failure (fZZZ)
CI_D CI_SSF
Characteristic Information - Data Signal Characteristic Information - Server Signal Fail Signal
AI_D AI_TSF
Adapted Information - Data Signal Adapted Information - Trail Signal Fail Signal
RI_RDI
Remote Information - Remote Defect Indication Signal
aAIS aRDI aSSF aTSF
Consequent Action to Insert an Alarm Indication Signal (AIS) Consequent Action to Insert a Remote Defect Indication (RDI) Signal Consequent Action to Enable the CI_SSF Consequent Action to Enable the AI_TSF
fY
Filter Y
Figure 2.11 Fault management hierarchical structure. (ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment—description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
60
CHAPTER 2
page 60
SONET/SDH Networks
the stream of STM-N frames) trail signal fail (TSF) signal. This TSF signal explicitly notifies downstream atomic functions inside the same network element whether a defect that affects the trail has already been declared in an upstream atomic function. The third consequent action does not involve notification in the downstream direction but in the upstream direction. For this purpose, TT functions do not connect only to access points (APs) and (termination) connection points ([T]CPs), but also to remote reference points (RPs). Via such an RP TT, functions can exchange remote information (RI). For example, the TT sink function can request from the corresponding TT source function (belonging to the opposite trail) to notify the upstream NE where the terminated trail originates that a defect has been detected affecting that trail, by sending a remote information (RI) remote defect indication (RDI) signal (i.e., RI_RDI) through the RP. Filters f2 in an A sink function trigger two consequent actions, when one of its f1 filters declares a defect, similar to the two consequent actions in a TT sink function for downstream notification. The generation of an AIS is the same as in a TT sink function, but instead of the generation of a TSF signal, a server signal fail (SSF) signal is generated for downstream notification. Note that generating these network element internal auxiliary parallel (thus separated from the stream of STM-N frames) SSF and TSF signals are useful only when they are also accepted as input in downstream atomic functions. More precisely, A and TT sink functions accept a TSF and a SSF signal, respectively, to avoid additional dAIS detection time or simply to be capable to learn about upstream defect declarations (when no f1 filter declaring dAIS defects is present in that particular atomic function). As shown in Figure 2.11, an A source function can also accept an SSF signal; this allows the A source functions responsible for pointer generation to insert an AU_AIS or TU_AIS in the data signal. A TT source function should be able to receive a remote information–remote defect indication(RI_RDI) signal to take the consequent action to insert a remote defect indication (RDI) signal in the data signal of the opposite trail of the trail on which a defect was declared. In summary, in addition to the immediate generation and handling of the network element internal auxiliary parallel (thus, separated from the stream of STM-N frames) SSF and TSF signals, there are two other consequent actions that impact the data signal. More precisely, the insertion of an AIS and an RDI signal. Upon declaration of aAIS, the AIS signal will be inserted within two frames or multiframes (more precisely, within 250 ms, or 1 ms). Upon declaration of aRDI in the TT sink function, the corresponding TT source function should insert the RDI signal within 1 ms or 4 ms in the case of a VC-2/12/11. The following sections elaborate on and demonstrate with examples ways in which these processes interwork.
2.3.2
Fault Detection and Propagation Inside a Network Element In the previous section, we highlighted the consequent actions atomic functions can take to enable fault propagation and notification in the downstream and upstream
Vasseur / Network Recovery
Final Proof 8.6.2004 5:19pm
page 61
61
2.3 Operational Aspects
directions. The goal of this section is to study how a network element (consisting of multiple atomic functions as explained in Section 2.2.4) reacts to failures and the resulting maintenance signals. Such information is necessary to study how the propagation and notification process looks on a network-wide level, which is discussed in the next section. The intention of this section is not to discuss every detail of the behavior of the various network elements. This section aims at highlighting the main mechanism by illustrating a few simplified examples. We refer you to [G783], [G806], [ETSI1], and [Sex92] for a more complete and detailed overview of the behavior of all atomic functions. Figures 2.12 through 2.17 show, for several examples, from left to right the time (remember that each frame lasts exactly 125 ms) and from top to bottom the path the data signal follows from ingress to egress. Light gray in the frame formats refers to digitized noise, whereas the dark gray represents an all 1s signal. The first example (Figure 2.12) shows how a regenerator (see also Figure 2.10) would react in case it stops receiving an optical signal at one of its input ports, for example, as a result of a cut of the incoming fiber. The OS_TT_Sk function will almost immediately detect that no longer optical signal power is coming in and thus declare the dLOS defect. It enables the TSF signal (aTSF) and starts inserting an all 1s signal within two frames (aAIS).8 The downstream atomic sink functions
Ingress
Time
OS dLOS
OS_TT_Sk
+ TSF
+ TSF
+ TSF
+ TSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
+ TSF
+ TSF
+ TSF
+ TSF
+ TSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
RS
+ TSF
OS/RS_A_Sk
RS_TT_Sk
RS/MS_A_Sk
MS RS/MS_A_So
RS_TT_So
RS
Egress
OS/RS_A_So
OS_TT_So
OS
Figure 2.12 Cable cut immediately upstream of a regenerator (top ! down, from ingress to egress; left ! right, time; light gray, digitized noise; dark gray, all 1s). 8
For simplicity, we consider that atomic functions start inserting the all 1s signal at the beginning of a new frame/container.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
62
CHAPTER 2
page 62
SONET/SDH Networks
forward the TSF (more precisely, the OS/RS_A_Sk and RS/MS_A_Sk functions translate the incoming TSF in an SSF and the RS_TT_Sk function translates the incoming SSF signal in a TSF signal) and start ‘‘refreshing’’ the AIS signal within two frames by inserting an all 1s signal in the downstream direction. The first adaptation source function (here, RS/MS_A_So) is the last atomic function refreshing the AIS signal within two frames after receiving the SSF from the last upstream adaptation sink function and stops forwarding the SSF as TSF. The output signal from the first adaptation source function is then transparently handled and encapsulated in the downstream atomic sink functions. For example, the RS_TT_So function adds RSOH bytes without caring whether the encapsulated payload is legal, corrupted, or all 1s. Note that in the upstream sink part the RS_TT_Sk function already terminated the incoming RS trail by stripping off and processing the corresponding RSOH bytes. In summary, the regenerator produces an MS_AIS signal (thus, the MSOH, AU pointer, and HOP overwritten with all 1s) within two frames after the declaration of the dLOS defect in the OS_TT_Sk function. Finally, there is no remote defect indication (RDI) signal to notify the immediate upstream SDH network element that a corresponding RS has been affected by a failure. Indeed, it is sufficient to have a remote defect indication (RDI) at the MS level, because (in accordance with Section 2.2.4) regenerators do not participate in the actual recovery actions (it does not feature any connection function) and all other network elements terminate the MS trails anyway. The example in Figure 2.13 considers a slightly different scenario. The receiver keeps getting an optical signal, but at some point, the signal gets heavily distorted
Ingress
Time
OS OS_TT_Sk 3 ms
+ SSF + TSF + SSF
MS RS/MS_A_So
+ SSF + TSF + SSF
+ SSF
RS/MS_A_Sk
+ SSF + TSF + SSF
RS_TT_Sk
+ SSF + TSF + SSF
dLOF
RS
+ SSF + TSF
OS/RS_A_Sk
RS_TT_So RS
Egress
OS/RS_A_So
OS_TT_So OS 26 Frames
Figure 2.13 A heavily distorted/noise signal entering a regenerator (top ! down, from ingress to egress; left ! right, time; light gray, digitized noise; dark gray, all 1s).
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 63
2.3 Operational Aspects
63
(the regenerator starts receiving noise). Note the similarities and differences between Figures 2.12 and 2.13. First, in Figure 2.13, the OS_TT_Sk function no longer declares a defect, which implies that it keeps converting the received (corrupted) optical signal into an electrical signal. However, the OS/RS_A_Sk function loses track of the frame alignment bytes and after a relatively long period (3 ms, or 24 frames) gives up and declares the loss of frame defect (dLOF). From that moment, the process is quite similar to that shown in Figure 2.12. The atomic function declaring the defect (here, OS/RS_TT_Sk function) starts sending an SSF (or a TSF in the case of a TT_Sk function) signal and sends out an AIS signal by inserting an all 1s signal. The fact that this process starts much later than in the previous example is the second important difference between both figures. In the end, the result is similar: Once the regenerator has been able to declare a defect, it generates an MS_AIS signal within two frames after the defect declaration. Figures 2.14 through 2.17 consider the cross-connection of a VC-4 path (e.g., in a DXC-4/4, see also Figure 2.8). Note the additional MS_TT_Sk, MS/S4_A_Sk, S4_C, MS/S4_A_So, and MS_TT_So functions in the signal path from ingress to egress. Note also that the frame format has been extended with the format of the VC-4 (the column in the middle represents the VC-4 path overhead and determines the beginning of the VC-4 container).9 The first example (Figure 2.14) considers a similar scenario as in Figure 2.12: A fiber cut immediately upstream of the NE interrupts the receipt of the optical signal. The process also is similar to that in Figure 2.12. The OS_TT_Sk function almost immediately detects this failure and declares the dLOS defect, enables the TSF signal, and starts sending out an AIS signal within two frames by inserting an all 1s signal. Although the principle of signal propagation is similar as in a regenerator, another behavior will be perceived because of the additional atomic functions (which are responsible for processing the AU pointer and the VC-4 POH) in the signal path from ingress to egress. The MS_TT_Sk function will strip off the MSOH bytes from the received signal and the MS/S4_A function will extract the VC-4 containers while removing the AU pointers. Because of the receipt of the TSF signal and the fact the VC-4 containers are not aligned with the STM frame structure, the insertion/refreshing of the all 1s signal will start before the receipt of the first bytes in the all 1s signal (note that insertion should start within two frames and we assumed for simplicity that insertion starts on frame/container boundaries). This AIS signal will travel through the HOP connection function (S4_C) together with the SSF from the MS/S4_A_Sk to the MS/S4_A_So function. The latter is the first source function in the data signal path, and thus, this source adaptation function will stop the forwarding of the SSF signal and insert an all 1s signal within two frames after the receipt of the SSF signal.
9
Following the STM-1 frame format, depicted in Figure 2.5, the AU pointer (more precisely, the fourth row in the nine columns at the left) determines the start of the VC-4, or in other words, the position of the column with the VC-4 path overhead in the 261 columns at the right.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
64
CHAPTER 2
page 64
SONET/SDH Networks
In contrast to the RS/MS_A_So function in the regenerator, the MS/S4_A_So function has to add the AU pointer bytes. In other words, it will add some more all 1s bytes to the all 1s that it receives. This is important because these additional all 1s bytes will allow downstream NEs to define an AUdAIS defect. Note that the MS/ S4_A_So function will not insert all 1s in these additional AU pointer bytes unless it receives an SSF, or in other words, unless a defect has been detected and declared in the upstream part in the signal path. Note also the RDI signal in Figure 2.14: The receipt of the SSF signal in the MS_TT_Sk function triggers the remote defect indication (RDI) consequent action in the atomic function, resulting in an MS_RDI signal being inserted within 1 ms in Ingress
Time 1 ms
RDI
OS dLOS
OS_TT_Sk
+ TSF
+ TSF
+ TSF
+ TSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
+ TSF
+ TSF
+ TSF
+ TSF
+ TSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
+ TSF
+ TSF
+ TSF
+ TSF
+ TSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
RS
+ TSF
OS/RS_A_Sk
RS_TT_Sk
RS/MS_A_Sk MS MS_TT_Sk
MS/S4_A_Sk S4_C
HOP
MS/S4_A_So
MS_TT_So MS RS/MS_A_So
RS_TT_So RS
Egress
OS/RS_A_So
OS_TT_So OS
Figure 2.14 Cable cut immediately upstream of a VC-4 cross-connection (top ! down, from ingress to egress; left ! right, time; light gray, digitized noise; dark gray, all 1s).
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 65
2.3 Operational Aspects
65
the MSOH of the opposite MS trail. In summary, the result is that the NE starts generating downstream an AU_AIS signal (all 1s in the AU pointer bytes [thus, H1 and H2] plus all VC-4 container bytes) within two frames and returning upstream an MS-RDI signal (K2b6-810 set to 110) within 1 ms after having detected and declared the dLOS defect. In Figure 2.15 we assume that the inlet (the incoming fiber) is connected to the regenerator outlet (the outgoing fiber) of Figure 2.12. Both the MS_TT_Sk and the MS/S4_A_Sk function will detect within one frame the all 1s signal in the MSOH and the AU pointer bytes, respectively, and will, therefore, declare within three frames the MSdAIS and AUdAIS defect, respectively, and start refreshing the all 1s signal and sending a TSF and SSF signal, respectively. The result is again that downstream an AU_AIS signal is generated within two frames and that upstream an MS_RDI signal is returned within 1 ms after the declaration of these defects. The scenario considered in Figure 2.16 is slightly different in the sense that now the inlet is connected to the outlet of the VC-4 cross-connection in Figure 2.14 instead of to the outlet of the regenerator in Figure 2.12. The NE now receives a AU_AIS instead of an MS_AIS signal (the received MSOH is correct because the MS trail starts in the NE performing the VC-4 cross-connection of Figure 2.14). The main difference is that only the MS/S4_A_Sk function declares the AUdAIS defect within three frames whereas downstream a new AU_AIS signal is still generated within two frames after declaring the defect. Note also that the receipt of a proper MS trail signal also avoids the need to return an MS_RDI signal upstream (no consequent actions triggered in the MS_TT_Sk function). The conclusion is thus that a VC-4 cross-connection forwards an MS_AIS or AU_AIS within 3 þ 2 ¼ 5 frames. How would a regenerator react on the input signals considered in Figure 2.15 (MS_AIS signal) and Figure 2.16 (AU_AIS signal)? Because all atomic functions in a regenerator monitor only the RSOH bytes, these AIS signals can transparently pass through a regenerator, because these AIS signals always have correct RSOH bytes. Note also that there does not exist something like an RS_AIS signal. At the other side, an LOP cross-connection (e.g., a VC-12 cross-connected in a DXC-4/1) behaves similarly to an HOP cross-connection (e.g., a VC-4 crossconnected in a DXC-4/4). In the signal path, an Sn_TT_Sk, an Sn/Sm_A_Sk, an Sm_C, an Sn/Sm_A_So, and an Sn_TT_So function (where n and m represent the order of the higher and lower order paths, respectively) are added for the LOP crossconnection. The first adaptation source function is in this case the Sn/Sm_A_So function, and thus, such a cross-connection will generate a TU_AIS (lower order path payload and overhead plus the TU pointer bytes (the V1 and V2 bytes in the case of a VC-2/12/11 or the H1 and H2 bytes in the case of a VC-3 contain an all 1s signal) within two frames or multiframes (more precisely, 2 * 500 ms ¼ 1 ms, except in the case of a VC-3, where it takes only 2 * 125 ¼ 250 ms) instead of an AU_AIS signal in the case of an HOP cross-connection. Of course, the Sn/Sm_A_So function 10 Remember that we use the notation XXbY throughout this chapter to indicate bit Y in the overhead byte XX.
66
Ingress
Time
1 ms
RDI
OS/RS_A_Sk
RS RS_TT_Sk
MS
+ SSF
HOP
+ TSF
S4_C
+ SSF
AUdAIS
+ TSF
MS/S4_A_Sk
+ SSF
MSdAIS
+ TSF
MS_TT_Sk
MS/S4_A_So
MS_TT_So
MS
page 66
RS/MS_A_So
RS_TT_So RS
Egress
OS/RS_A_So
Figure 2.15
SONET/SDH Networks
RS/MS_A_Sk
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
CHAPTER 2
OS OS_TT_Sk
OS_TT_So OS
Incoming MS_AIS signal ( ¼ output from Figure 2.12) (top ! down, from ingress to egress; left ! right, time; light gray, digitized noise; dark gray, all 1s).
Ingress
Time
OS OS_TT_Sk
OS/RS_A_Sk
RS RS_TT_Sk
MS MS_TT_Sk
MS/S4_A_Sk
+ SSF
+ SSF
HOP
+ SSF
S4_C
AUdAIS
MS/S4_A_So
MS_TT_So
MS RS/MS_A_So
Egress
Figure 2.16
OS_TT_So
OS
Incoming AU_AIS signal (¼ output from Figure 2.14) (top ! down, from ingress to egress; left ! right, time; light gray, digitized noise; dark gray, all 1s).
page 67
RS OS/RS_A_So
2.3 Operational Aspects
RS_TT_So
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
RS/MS_A_Sk
67
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
68
CHAPTER 2
page 68
SONET/SDH Networks
can generate this TU_AIS signal only after a defect has been detected and declared in the upstream sink part of the signal path. The OS_TT_Sk function might still declare a dLOS defect almost immediately, whereas the MS_TT_Sk or MS/ Sn_A_Sk function might still declare an MSdAIS or an AUdAIS defect, respectively, within three frames (thus, 3 * 125 ¼ 375 ms). In all these cases, the Sn_TT_Sk function will trigger the consequent action for returning upstream a HOP_RDI signal (which will be inserted within 1 ms by the corresponding Sn_TT_So function). In all cases, except the latter one, the MS_TT_Sk function will also trigger the consequent action for returning an MS_RDI signal (as already demonstrated earlier) within 1 ms. In addition to that, the Sn/Sm_A_Sk function might detect a TUdAIS within three frames or multiframes (more precisely, 3 * 500 ms ¼ 1:5 ms in the case of a VC-2/12/11). In summary, an LOP cross-connection generates a TU_AIS within two frames or multiframes to notify that it has detected a failure or to forward any AIS signal. In the worst case— forwarding a TU_AIS signal in the case of a VC-2/12/11 cross-connection—this might take up to (3 þ 2) * 500 ms ¼ 2:5 ms. Depending on the incoming signal, it might also return an MS_RDI and/or an HOP_RDI signal within 1 ms. Finally, coming back to an HOP cross-connection, Figure 2.17 shows that the process of detecting a failure and subsequently propagating this as an AIS signal through the network is not a simple sequence of actions performed one after the other. Figure 2.17 considers that the inlet of a VC-4 cross-connection is connected to the outlet of the regenerator of Figure 2.13. That regenerator kept forwarding the heavily distorted incoming signal before it decided to declare the dLOF defect and started generating the MS_AIS signal. More precisely, the VC-4 cross-connection receives the first all 1s only after 26 distorted frames. However, the MS/S4_A_Sk function needs only 8 to 10 frames to decide that it lost track of the AU pointer and thus declares the AUdLOP defect. This results in the generation of an AU_AIS signal within two frames. After a while, the MS_AIS signal comes in, which results in the declaration of the MSdAIS and AUdAIS defects in the MS_TT_Sk and MS/ S4_A_Sk function, respectively. In other words, the AUdLOP has been overruled by the incoming MS_AIS signal. Although the NE simply keeps generating an AU_AIS signal, it is only at this moment that returning upstream an MS_RDI signal is triggered. To conclude this section, we can summarize the examples as follows. After an NE declares a defect, it generates an AIS signal (an all 1s signal) downstream and an RDI signal upstream. Of course, the time to detect a failure depends on the actual received signal (see also Section 2.2.3). From the examples, we clearly see that the time to declare a dAIS defect is important (because the AIS signals are responsible for the failure propagation process in the downstream direction). The time needed to declare a dAIS defect takes three frames or multiframes. The last example illustrates that having different active defect declaration filters (f1 filters) in an NE can significantly complicate the failure propagation process, because these filters run in parallel and independent from each other. In accordance with Section 2.3.1, once a failure has been declared, an AIS signal will be inserted in a downstream direction within two frames or multiframes and an RDI signal in an
Ingress
Time 1 ms
RDI
26 frames
OS OS_TT_Sk
OS/RS_A_Sk
RS RS_TT_Sk
MS MS_TT_Sk
MSdAIS + TSF
+ TSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
+ SSF
HOP
AUdAIS
AUdLOP + SSF
S4_C
[8, 10] Frames
+ TSF
MS/S4_A_Sk
MS/S4_A_So
MS_TT_So
MS RS/MS_A_So
RS
Egress
OS_TT_So
OS
Figure 2.17
Late arrival of MS_AIS signal in a VC-4 cross-connection (top ! down, from ingress to egress; left ! right, time; light gray, digitized noise; dark gray, all 1s).
page 69
OS/RS_A_So
2.3 Operational Aspects
RS_TT_So
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
RS/MS_A_Sk
69
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
70
CHAPTER 2
page 70
SONET/SDH Networks
upstream direction within 1 (or 4) ms. The type of AIS signal that is inserted depends on the type of the NE. The examples show that a regenerator inserts an MS_AIS signal, whereas a higher order path cross-connection results in the insertion of a AU_AIS signal. Also the inserted RDI signal depends on the type of the NE. In the examples, only an MS_RDI is inserted (by the MS_TT functions), but Sn_TT functions (not considered in the examples) can also insert HOP_RDI or LOP_RDI signals. When TT functions at multiple levels are involved, an NE can insert multiple RDI signals simultaneously (of course, each of them transported in the appropriate overhead bytes).
2.3.3
Fault Propagation and Notification on a Network Level The previous section has shown that fault propagation and notification are done by sending alarm indication signals (AISs) downstream and returning remote defect indication (RDI) signals upstream, but that the actual behavior of the NE strongly depends on how the incoming signal looks and thus how the NEs are interconnected. This section aims at investigating how fault information propagates through a network. For this purpose, we consider a network example consisting of 10 NEs: five DXCs cross-connecting an LOP (DXC-4/1 A, B, H, I, and J in Figures 2.18 and 2.19 and DXC-4/3 A, B, H, I, and J in Figure 2.20), three DXC-4/4s cross-connecting a VC-4 (DXC C, F, and G) through which the lower order (LO) VC is routed, and two regenerators (D and E). The lower order path is routed from A to J and cross-connected in B, H, and I. The VC-4 under consideration is routed between DXC-4/1s B and H and cross-connected in the DXC-4/4s C, F, and G. The link between DXCs C and F is considered very long, requiring two regenerators (D and E). Let us consider the example of a fiber cut in front of regenerator D. The behavior of regenerator D, DXC-4/4 F, and DXC-4/4 G has already been explained in detail in Figures 2.12, 2.15, and 2.16, respectively. Figure 2.18 gives a detailed but incomplete (i.e., not all possible defect declaration filters are shown) overview of the most critical/interesting processes (defect declarations and AIS and RDI signal insertion) running inside the NEs and which signals they exchange with each other. Note that the dashed arrows represent TSF and SSF signals. Figures 2.19 and 2.20 show the AIS and RDI signals between the NEs in a time diagram. Regenerator D detects the loss of signal resulting from the fiber cut, declares the dLOS defect, and generates an MS_AIS signal at its output within two frames. It has already been explained in the previous section that this MS_AIS signal simply passes through regenerator E, because it contains only atomic functions processing RSOH bytes (which are legal, because the RS trail starts in regenerator D). The DXC-4/4 F detects the MS_AIS signal and, therefore, declares the MSdAIS defect. Because this DXC terminates the MS trail, it tries to notify the origin of the MS trail (thus, the DXC-4/4 C) by returning an MS_RDI signal upstream. Almost simultaneously, another atomic function in this DXC-4/4 will also declare the
DXC-4/1
A
DXC-4/1
B
DXC-4/4
regen.
regen.
D
E
C
DXC-4/4
DXC-4/4
F
DXC-4/1
G
dRDI : 3-5 Frames
AUdAIS: 3 Frames
TU_AIS
VC-4
AU_AIS
MS
MS
MS
RS
RS
RS
OS
OS
OS
MS
MS
MS
MS
MS
MS
RS
RS
RS
RS
RS
RS
RS
OS
OS
OS
OS
OS
OS
OS
dLOS
aRDI: 1 ms
aAIS: 2 Frames
aAIS: 2 Frames
aRDI: 1 ms
aAIS: 2 Multi-Frames
aRDI: 4 ms
Overview of atomic functions and their responsibility in the fault propagation and notification process. (C. Brianza, et al. ‘‘Deliverable D2a: Overall Network Protection—Version 1,’’ deliverable from the ACTS-project PANEL, April 1997).
page 71
aAIS: 2 Frames
Figure 2.18
VC-4
2.3 Operational Aspects
MS
MS_AIS
MS_RDI
VC-4
VC-4
VC-4 AU_AIS
VC-4
AU_AIS
VC-4
TU_AIS
TU_AIS
HOP_RDI
VC-12
VC-12
VC-12
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
MSdAIS: 3 Frames
dRDI: 3, 5 or 10 Frames
VC-4
J
TUdAIS: 3 Multi-Frames
LOP_RDI
VC-12
DXC-4/1
I
H
dRDI: 3, 5 or 10 Multi-Frames
VC-12
DXC-4/1
71
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
SONET/SDH Networks
dLOS
E
F
MS_AIS MSdAIS AUdAIS
DXC-4/1
DXC-4/4
G
AU_AIS
H
I
AU_AIS
MS_RDI AUdAIS
dRDI
AUdAIS
DXC-4/1
J
TUdAIS
TU_AIS HOP_RDI
TUdAIS
dRDI
DXC-4/1
2 ms
TU_AIS
4 ms
6 ms
3*4 = 12
D
DXC-4/4
2*4=8 3*4 = 12
C
Regen.
3*4 = 12
Regen.
2*4=8
B
DXC-4/4
1 ms 3
A
DXC-4/1
53
DXC-4/1
53
CHAPTER 2
1 ms
72
page 72
TUdAIS
8 ms TO A
LOP_RDI
12 ms
Figure 2.19 Time diagram for a VC-12 being cross-connected by DXC-4/1s.
AUdAIS defect (indeed, an MS_AIS signal implicitly carries an AU_AIS signal). Assuming that the upstream direction is not affected by the fiber cut, DXC-4/4 C will receive this MS_RDI signal and declare the dRDI defect within three to five frames (or thus within 375 to 625 ms). The DXC-4/4 F then forwards the AIS signal within two frames after the declaration of the MSdAIS or AUdAIS defect. As illustrated in Figure 2.16, a DXC receiving this AU_AIS signal (here, DXC-4/4 G) will forward this signal within two frames after the declaration of the AUdAIS defect (which takes up to three frames). Although DXC-4/1 H cross-connects VC-12s, it also terminates the VC-4 originating in DXC B. Because this VC-4 is affected by the fiber cut, DXC-4/1 H will receive after a while (more precisely, within 2 þ 5 þ 5 ¼ 12 frames, or 1.5 ms when ignoring the propagation delays) an AU_AIS signal and declare the AUdAIS defect within three frames. The declaration of this defect will result in returning an HOP_RDI signal upstream to notify the origin of the VC-4 (here, DXC-4/1 B) that the HOP has been affected. In parallel to this process, the DXC-4/1 H will declare the TUdAIS defect. It is important to note that the TU_AIS signal, already
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 73
73
2.3 Operational Aspects
implicitly present in the MS_AIS signal generated by regenerator D,11 can simply transit the DXC-4/4s without being delayed while the AU_AIS signals perceive some delay when propagating, or ‘‘rippling,’’ through the intermediate DXC-4/4s. This results in a race between the TUdAIS and AUdAIS defect declaration process in DXC-4/1 H. Because for the AU_AIS propagation process in DXC-4/4 F and G the worst-case scenario has been assumed (3 þ 2 ¼ 5 frames), the TUdAIS defect declaration process in DXC-4/1 H slightly wins this race. However, DXC-4/4 F and G are allowed to insert an all 1s signal in the AU pointer bytes already after one frame. In this case, the AU_AIS rippling/propagation process would perform best. The reason is that the TUdAIS defect declaration process is pretty slow when it is based on multiframes. But as illustrated in Figure 2.20, a VC-3 instead of a VC-12 being cross-connected as LOP would definitely result in the TUdAIS defect declaration process winning the race. Also in the case of a VC-12 being cross-connected, the TUdAIS defect declaration process would definitely be faster when there would be at least one or two more intermediate DXC-4/4s through which the AU_AIS signal has to ripple. Finally, from the moment that one of both defects has been declared, the DXC4/1 H will forward the AIS signal as a TU_AIS signal within two frames or multiframes. Similarly, as the AU_AIS signal had to ripple through the intermediate
DXC-4/3
DXC-4/3
DXC-4/4
B
A
C
Regen.
Regen.
D
E
DXC-4/4
DXC-4/4
F
DXC-4/3
G
DXC-4/3
H
DXC-4/3
I
J
AUdAIS
dRDI HOP_RDI
53
TUdAIS
TUdAIS
3
2 ms TU_AIS
1 ms
AUdAIS
3
AU_AIS
MS_RDI
TU_AIS
1 ms
AU_AIS
53
MSdAIS AUdAIS
53
MS_AIS 1 ms
dLOS
53
TUdAIS
4 ms
LOP_RDI dRDI dRDI 6 ms
Figure 2.20 Time diagram for a VC-3 being cross-connected by DXC-4/3s.
11 The MS_AIS signal consists of all 1s in the MSOH bytes, the AU pointer bytes (H1 and H2), and the VC-4 path overhead and payload bytes. The VC-12 together with its TU pointer bytes (V1 and V2) are encapsulated in the payload of the VC-4 (the C-4 container) and, thus, contain an all 1s signal: the TU_AIS signal. Because the intermediate DXC-4/4s transparently cross-connect the VC-4 payload, they also leave the TU_AIS signal intact.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
74
CHAPTER 2
page 74
SONET/SDH Networks
DXC-4/4 G, this TU_AIS signal has to ripple through the intermediate DXC I, before reaching the destination DXC J. After declaring the TUdAIS defect, within three frames or multiframes, the DXC J will return an RDI signal upstream to notify the origin of the LOP (i.e., DXC A) that the LOP has been affected. Comparing Figure 2.19 with Figure 2.20, the propagation and notification process becomes relatively slow at the LOP layer when the LOPs are based on a multiframe structure (e.g., in the case of a VC-12, as illustrated in Figure 2.19). For example, in Figure 2.19, it may take up to 2.5 ms for the TU_AIS signal to ripple through the intermediate DXC-4/1 I. Although having more than one recovery technique is beyond the scope of this chapter (it is the subject of Chapter 6), the race condition mentioned earlier (thus, in DXC H) is a good illustration of the potential risk to trigger multiple recovery actions at almost the same time. The race in Figure 2.19 might trigger in DXC H recovery actions at the HOP and LOP layers, based on the declaration of the AUdAIS and TUdAIS defect, respectively, quasi-simultaneously. Chapter 6 will delve into the details of why such a situation is often not desirable and how it can be avoided.
2.3.4
Automatic Protection Switching Protocol In the previous sections, we saw how the alarm indication signal (AIS) and remote defect indication (RDI) maintenance signals allow network elements along a connection to learn that this connection is affected by a failure. These signals can then trigger recovery actions in these network elements. SDH networks typically rely on Automatic Protection Switching (APS) techniques that assume pre-established backup resources protecting a certain set of working resources. The protection switching actions are coordinated through the exchange of APS protocol messages that are transported in part of the K overhead bytes. The goal of this section is to discuss the APS protocol more generally, and the following section highlights the various protection strategies available for SDH networks. For completeness, Section 2.6 briefly describes restoration techniques that do not rely on preestablished backup resources. The discussion in this section has a part devoted to trail protection and another part devoted to subnetwork connection protection (SNCP).
Trail Protection Within a network layer, one can choose between trail and subnetwork connection protection (SNCP). As illustrated in Figure 2.21, a sublayer is introduced to implement the trail protection. The characteristic information consists of the adapted information plus some overhead bytes. Based on these overhead bytes, the trail termination functions are able to supervise the integrity of a network connection. In addition, maintenance signals can be transported in those overhead bytes.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 75
2.3 Operational Aspects
Working
Extra
Extra
75
Working
X/Client_A
Layer
Trail Protection Sub-Layer
XP_TT
Switch/ Selector
Switch/ Selector APS Control
XP_C
Bridge
APS Control Bridge
X/XP_A X_TT Protection
Working
Figure 2.21 Architecture for trail protection. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
The overhead bytes also provide some overhead bits (more precisely, K1 and K2b1-512 on the MS level, K3b1-4 on the HOP level, and K4b3-4) dedicated to transport the APS protocol messages. The trail termination (X_TT) functions in Figure 2.21 terminate the trails and check their integrity (possibly resulting in a TSF signal). Instead of handing the trail signal directly to the client adaptation function (the X/Client_A function), it passes through the trail protection sublayer, which consists of a connection (Xp C) function that realizes the Automatic Protection Switching (APS) for a group of trails, a set of adaptation (X=Xp A), and trail termination (Xp TT) functions. The protection sublayer adaptation (X=Xp A) functions separate the APS protocol channel from the trail signal (and forward the TSF from the X TT function as SSF) to the APS controller in the Xp C function (see dashed arrows in Figure 2.21). It will accept a change in the APS requests only after it remains for three consecutive frames or multiframes. In order to co-ordinate the actions in the different network elements, the APS controllers communicate with each other through these APS channels. The SSF signals that the APS controllers receive will trigger the APS protection switching actions. Finally, the TT functions in the protection sublayer (the Xp TT functions) forward the data and 12 Remember that we use the notation XXbY throughout this chapter to indicate bit Y in the overhead byte XX.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
76
CHAPTER 2
page 76
SONET/SDH Networks
status signals (translation of SSF into TSF signal) from the trails selected in the protection connection (Xp C) function. Note that in the previous discussion, mainly the sink direction has been highlighted. The source direction is rather similar but the other way around; a major difference is that the source signals are typically bridged onto backup/protection trails, whereas in the sink a selection is made between the working and backup/ protection trails. Finally, note that Figure 2.21 illustrates the possibility to carry extra traffic over the backup/protection trails while no failures affect the working trails and to preempt this extra traffic in case the corresponding backup/protection trails are needed if a failure occurs. The example in Figure 2.21 is an illustration of linear APS, in which the APS protocol involves only two nodes. There also exist more complex APS protocols that involve more than two nodes interconnected in a ring configuration: MS protection rings, which have been and are still important in today’s production networks, are an example of such APS protocols. An in-depth discussion of these ring protocols is beyond the scope of this chapter, in which the discussion is on the strategies described in the following sections. The basics of the linear APS protocol are illustrated in Figure 2.22. A detailed specification of linear and ring APS protocols and rules is provided in [G841], and [Sex92] presents them in the form of state diagrams and flowcharts.
A
N
N
3 2
3 2
1
1
B Revertive Switch
Protection Switch
2 1 N 2 1 N 2 1 N 2 1 A
NoReq, 0, 0 SF, 2, 0
REV, 2, 2 SF, 2, 0
REV, 2, 2 SF, 2, 2
REV, 2, 2 SF, 2, 2
N
N
2 1
2 1
N
N
2 1
2 1
N
N
2 1
2 1
N
N
2 1
2 1 B
Time
N
A
REV, 2, 2 WTR, 2, 2
REV, 2, 2 NoReq, 0, 2 NoReq, 0, 0 NoReq, 0, 2
NoReq, 0, 0 NoReq, 0, 0
N 2 1 N 2 1 N 2 1 N 2 1 B
Figure 2.22 Illustration of bidirectional linear 1:N APS protocol (bridge request format: type/priority, requested channel to bridge, local bridge status). (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 77
2.3 Operational Aspects
77
Figure 2.22 is an illustration of 1:N linear APS. As explained in Chapter 1, this means that one backup/protection entity protects N working entities. The 1:N protection can be generalized to M:N protection, where M backup/protection entities protect N working entities. If only one entity has to be protected, we can choose between bridging the signal at the time of the failure (1:1 protection) or permanently (1þ1 protection). In the latter case, there is no opportunity to carry extra traffic on the backup/protection entity. The figure also assumes bidirectional operation (sometimes this is called dual-ended operation). When one direction of a bidirectional entity is sent over the backup/protection entity, the opposite direction is also sent over backup/protection entity. It is worth mentioning that unidirectional operation (sometimes called single-ended operation) is also possible. The main idea we want to illustrate in Figure 2.22 is that the downstream node (or the recovery tail end [RTE], following the terminology of Chapter 1) requests from the upstream node (or the recovery head end [RHE], following the terminology of Chapter 1) to bridge one of the N working channels onto the protection channel. The RHE should then perform this bridge operation and inform the RTE that it has fulfilled its request. From that moment, the RTE can safely select the signal from the backup/protection channel instead of from that particular working channel. Requests can result from external requests (e.g., manual request from the network operator), from locally generated requests (e.g., because of detected problems on one of the incoming channels), or from a request from the opposite side. A priority is assigned to each request type, allowing discrimination between multiple concurrent requests. More practically, Figure 2.22 assumes that channel 2 is affected only in the direction from node B to node A. Therefore, at the time of the failure (left part of the figure), node A (the downstream node) detects a signal fail (SF) condition. Because there are no other requests, node A starts asking B (the upstream node) to bridge channel 2. Once B receives this request, it verifies that this request does not conflict with any other requests and performs the bridge. As a result, it will also request A to bridge channel 2, to complete the bidirectional protection switching, while it informs that it has already bridged itself channel 2 (note the ‘‘2’’ at the end of the request). Note also that this involves a ‘‘REVerse’’ request, which has a lower priority than the SF request from A to B. When A receives this request, it fulfills the request by bridging channel 2 and informs B of this bridge (note the change from ‘‘0’’ to ‘‘2’’ at the end of the request) and notices that B fulfilled the bridge request and, therefore, selects for channel 2 the signal from the backup/protection channel. B also performs this selection from the moment it receives the notification of the bridge in A. The right part of the figure illustrates the operation when channel 2 is repaired. Node A will recognize that it receives a proper signal on channel 2 and therefore initiate the process to free the backup/protection channel and select channel 2 again. The APS protocol assumes in most cases a revertive mode of operation. The nonrevertive mode is typically applicable only to 1þ1 protection. Node A sends a wait-to-restore (WTR) request (which has a priority higher than the reverse request from B to A) to node B. This tells B that A has the intention to switch back but
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
78
CHAPTER 2
page 78
SONET/SDH Networks
that it waits some more time to be sure that no other requests are sent (e.g., channel 2 going down again after a very short while or a request from B with a lower priority than that of the SF request). The WTR timer should be configured in the range of 5 to 12 minutes. Once the WTR timer expires, node A selects the signal from channel 2 again and requests from B to release the bridge by sending a ‘‘no request’’ request. Note that the bridge in A still exists (and, thus, this is still carried in the ‘‘no request’’ request) because B is still requesting the bridge from A. Once B receives that ‘‘no request’’ request, it releases the bridge, and because there is no need for keeping the REVerse request, the signal from channel 2 instead of that from the protection/backup channel is selected again. Upon receipt of the ‘‘no request’’ from B in A, it also releases the bridge and changes the local bridge status in the request it is sending accordingly. From the example at the left in Figure 2.22, it is clear that it can take up to three times a one-way delay (along the protection/backup channel) before both nodes complete the bridge and select operation. An additional one-way delay is needed to inform the opposite side of the last change in the status. When the failure would affect both (thus, also the direction from A to B) instead of one direction of channel 2, this protection switch completion time of 3 þ 1 ¼ 4 one-way delays would be reduced (in the best case) to 2 þ 1 ¼ 3 one-way delays, because B would start requesting the bridge from A at the same time that A starts requesting the bridge from B. This oneway delay consists at least of the propagation delay (0.5 ms per 100 km) plus the duration of three consecutive frames13 (3 * 125 ¼ 375 ms) before a change in the APS request is accepted. Let us consider, for example, a link of 100 km: In this case, the one-way delay equals 0.875 ms and the protection switch completion time would range between (2 þ 1) * 0:875 ¼ 2:625 ms and (3 þ 1) * 0:875 ¼ 3:5 ms in the ideal case. Note, however, that in practice it will take some time to process APS requests: For example, when an APS controller is responsible for processing requests from multiple APS signaling channels, processing the APS requests sequentially might become an issue. Depending on the implementation of the network elements, performing a bridge or select might also consume some time.
Subnetwork Connection Protection As illustrated earlier, trail protection techniques have the advantage that the APS controllers coexist in the same NEs with the trail termination functions checking the integrity and allowing access to the APS signaling channels (the adaptation function in the trail protection sublayer splits only the APS signaling channel from the data signal). The drawback is that in some circumstances a network operator wants to protect part of the trail/network connection—thus, a subnetwork connection (as mentioned earlier subnetwork connection protection [SNCP] is possible). Consider, for example, the network scenario in Figure 2.23; the cloud represents the domain administrated by one operator and the VC-4 under consideration has to be set up through this network between nodes residing outside this network (e.g., in networks 13
For simplicity, here, excluding multiframes in the case of VC-2/12/11 trail protection.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 79
2.3 Operational Aspects
79
VC-4 Network Connection VC-4 Tandem Connection
STM-N STM-N
IncAIS
AU_AIS
IncAIS
AU_AIS
AU_AIS
AU_AIS S4_TT
Sublayer
S4_TT
S4TC/S4_A S4TC_TT S4_C MS/S4_A MS_TT
S4TC/S4_A S4_C MS/S4_A MS_TT
S4TC_TT S4_C MS/S4_A MS_TT
Figure 2.23 Sublayer tandem connection monitoring. (ITU-T Recommendation G.803, ‘‘Architecture of transport networks based on the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.)
belonging to other network operators). The network operator wants to protect this connection against failures that might occur in his or her own network. Because the VC-4 trail is terminated outside his or her network, VC-4 trail protection is not feasible. Because MS trails are set up between DXCs, DXC failures would not be covered by MS trail protection. Therefore, the network operator wants to protect the subnetwork connection corresponding to that part of the network connection that is routed in his or her network. The problem with subnetwork connection protection is that the subnetwork connections have to be supervised, which is a typical responsibility of the trail termination functions. A supervised subnetwork connection is sometimes also called a tandem connection. There are mainly four methods for the supervision process, as follows:
. Inherent supervision relies on the status information collected from the lower layers (the MS and RS layers) to estimate the status of the tandem connection. . Nonintrusive supervision: At the downstream end of the tandem connection, a monitoring TT function (Xm_TT function) simply listens to the received signal. An Snm_TT function is a classic Sn_TT function that terminates a
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
80
CHAPTER 2
. .
page 80
SONET/SDH Networks
VC-n trail and is capable of detecting and declaring VCdAIS defects. A VCdAIS is defined by an all 1s signal in the trail signal label (TSL) POH (more precisely, the C2 byte in the case of a VC-4/3, and V5b5-7 in the case of a VC-2/12/11). Intrusive supervision interrupts the actual trail to set up a supervisory unequipped trail through the tandem connection. Sublayer supervision adds tandem connection trail termination and adaptation functions, overwriting part of the overhead. This is illustrated in the bottom part of Figure 2.23. The end-to-end VC-4 trail is set up between the two S4_TT functions. It passes through three DXCs in the network operator domain (S4_C functions). In the ingress and egress DXCs, the sublayer tandem connection trail termination (S4TC TT) and adaptation (S4TC =S4 A) functions are added, to supervise the tandem connection sublayer trail. Note that SDH provides a dedicated part of the path overhead for the supervision of tandem connections: This concerns the network operator (N) bytes.
An important aspect in tandem connection supervision becomes possible with the last method: the ability to distinguish in the egress node between an AIS signal resulting from a failure upstream of the ingress of the tandem connection or from a failure that directly affects the tandem connection (thus, downstream of the ingress node). In the first case, the incoming AIS signal is translated in the ingress to and transported through the tandem connection as an IncAIS and translated back to the original AIS signal (in the egress) before being forwarded further downstream. Not only is the supervision of subnetwork connections a problem, but an additional problem is the coordination of the APS actions in the RHE and RTE, because it requires an APS signaling channel. For the moment, the N bytes path overhead dedicated to tandem connections does not provide an APS channel, and overwriting the K bytes may cause conflicts with the trail protection APS protocol. Therefore, as shown in Figure 2.24, only unidirectional 1þ1 subnetwork connection protection (SNCP) is supported (other modes are ‘‘for further study’’ according to the standards). Thus, in the RHE, the connection is permanently bridged onto the working and backup/protection subnetwork connection and the RTE simply selects the best signal (based on the supervision processes described in Section 2.2.3). In other words, the protection switch completion time does not involve any one-way delay (as is the case in M:N linear protection) but only depends on the capabilities of the RTE (i.e., the time it needs to change the selection). Figure 2.24 shows the mode adopting the nonintrusive supervision method (SNCP/N). Leaving out the optional monitoring trail termination (Xm_TT) functions would result in the mode adopting the inherent supervision method (SNCP/I). According to [ETSI1], the mode adopting the sublayer supervision method is possible (SNCP/S).
2.3.5
Summary . F1 filters: defect declaration in sink atomic functions. . F2 filters: consequent actions: aAIS (in TT_Sk, A_Sk and A_So) insertion of all 1s within two frames or multiframes in downstream direction, aRDI
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 81
Protected (Sub-)Network
81
TSF
Optional: Xm_TT
TSF
2.4 Ring Protection
Connection X_C
SSF
SSF
???
Y/X_A
Y_TT
Figure 2.24 Subnetwork connection protection (SNCP) with nonintrusive monitoring (SNCP/N). (‘‘Transmission and multiplexing (TM); generic requirements of transport functionality of equipment; part 1–1: generic processes and performance,’’ ETSI EN 300 417-11 V1.2.1, October 2001.)
.
.
. .
. .
(in TT_Sk and TT_So) insertion of notification signal upstream within 1 or 4 ms, aTSF/aSSF (in TT_Sk/A_Sk) enabling internal parallel auxiliary signal fail signal. HOP DXC: forwards MS_AIS and AU_AIS signals within three (! detection) plus two (! insertion) frames as AU_AIS signal. TU_AIS transits transparently (not delayed). LOP DXC: forwards MS_AIS, AU_AIS and TU_AIS signals within three (! detection) plus two (! insertion) frames or multiframes. MS_AIS and AU_AIS are detected always within three frames (¼ 375 ms). Race conditions: can occur between AIS propagation process in HOP and LOP layers. Automatic Protection Switching (APS): linear versus ring, trail versus subnetwork connection protection, unidirectional versus bidirectional operation. 1:N APS: RTE request from RHE bridge ! RHE performs bridge and notifies RTE ! RTE selects backup/protection channel. SNCP: only unidirectional 1þ1 mode (permanent bridge). Supervision of the subnetwork connections is an issue and several methods exist.
2.4 Ring Protection As mentioned in Section 2.3.4, SDH networks typically rely on protection techniques to increase the overall network survivability. These protection techniques can be categorized in ring and linear Automatic Protection Switching (APS)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
82
CHAPTER 2
page 82
SONET/SDH Networks
techniques. In particular, ring-based SDH networks have been very popular and their dominance remains very significant. For this reason, the overview of the various SDH recovery techniques starts in this section with the description of selfhealing ring network architectures, followed by a discussion of the linear protection strategies in Section 2.5, and concludes with highlighting the possibilities for restoration-based techniques in Section 2.6. The popularity of ring networks can be explained as follows. First, they typically feature add/drop multipliers (ADMs) that have only two aggregate ports: Comparing with the more advanced DXCs (typically used in meshed-based networks), ADMs became commercially available sooner and have a lower cost. Ring networks are also rather simple network architectures (e.g., routing decisions are limited to choosing the clockwise or counterclockwise direction on the ring) that are able to meet important network operator requirements (e.g., survivability, as discussed in this section). Therefore, the incentives for upgrading from a ring-based network to an eventual meshed network (typically featuring DXCs) are not always strong or clear enough, which resulted in ring-based networks becoming very popular. One can distinguish between three protection ring techniques: Multiplex Section–Shared Protection Rings (MS–SP Rings), Multiplex Section–Dedicated Protection Rings (MS–DP Rings), and SNCP Rings. MS-SP Rings and MS-DP Rings are similar in the sense that in the nodes adjacent to a failure, they loop back the traffic around the opposite side of the ring. However, they differ from each other in the sense that the forward and backward directions of a bidirectional connection are routed along the same side and opposite side of the ring in an MS-SP Ring and an MS-DP Ring, respectively. This is possible because an MSSP Ring carries in both the clockwise and the counterclockwise direction 50% working capacity and 50% protection/backup capacity, whereas all working capacity is carried in one direction and all the protection/backup capacity in the counter-rotating direction in an MS-DP Ring. Spatial reuse is an important feature of MS-SP Rings. Nonoverlapping connection can be routed in the same time slot (or thus capacity) in different sections of the ring. Thus, protection/backup time slots can be shared among nonoverlapping connections in an MS-SP Ring, whereas each connection is assigned a dedicated protection/backup time slot in an MS-DP Ring. In an SNCP Ring, each connection is also assigned dedicated protection/ backup capacity, because the source node bridges (copies) the signal along the opposite sides of the ring while the destination node selects the best received copy (instead of locally looping back the traffic, as with an MS-SP Ring or an MS-DP Ring). Sections 2.4.1 through 2.4.3 describe the multiplex section–shared protection ring (MS-SP Ring), the multiplex section–dedicated protection ring (MS-DP Ring), and the subnetwork connection protection ring (SNCP Ring) technique, respectively. How these ring networks can be interconnected in a reliable way is outlined in Section 2.4.4. The discussion is summarized in Section 2.4.5, and Section 2.4.6 highlights the analogies between SDH and SONET self-healing ring techniques.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 83
83
2.4 Ring Protection
2.4.1
Multiplex Section–Shared Protection Ring In an MS-SP Ring, the available capacity in the clockwise and counterclockwise direction is split in two equal parts: 50% is devoted to carry working capacity and the other 50% carries the spare capacity to protect the working capacity (Figure 2.25). The operation of the MS-SP Ring of Figure 2.25 in the case of a link failure is illustrated in Figure 2.26. Nodes adjacent to a failure (here, nodes B and C) detect the failure and loop back the working capacity in the spare capacity in the opposite direction around the ring (thus, along the path B-A-H-G-F-E-D-C and vice versa). Of course, this requires that the intermediate nodes connect the spare capacity entering and leaving the ADM in the clockwise direction and the spare capacity entering and leaving in the counterclockwise direction. As the figure illustrates, both the forward and backward directions of a bidirectional connection (or HOP) are looped back, because both directions are routed along the same side of the ring. From Figure 2.26 it is clear that there exists different states for the nodes on the ring. The three states are illustrated in Figure 2.27. In the absence of failures, all nodes will be in the normal state. However, from the moment a failure occurs on the ring, the adjacent nodes will trigger the APS protocol, causing the nodes adjacent to the failure to loop back the traffic and all the other nodes on the ring to go into the passthrough state. To trigger the appropriate state transitions in all ring nodes, it is necessary that the nodes detecting the failure send an APS request along the short A
B
H
G Connection
C
F
D
E
Working Capacity Protection/Backup Capacity
ADM
Figure 2.25 Multiplex section–shared protection ring in a failure-free situation. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
84
CHAPTER 2
page 84
SONET/SDH Networks
A
H
B
C
B
C
B
C
G
F
Connection
Working Capacity
Connection Looped Back
Protection/Backup Capacity
D
E
ADM
Figure 2.26 Illustration of the operation of a multiplex section–shared protection ring. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
and long path to the other node adjacent to the failure. The short path (B-C in Figure 2.26) is the segment on the ring from which the traffic is deviated through the loop-back operation: APS requests along the short path are needed to inform the upstream node of the status in the downstream node (this may trigger the APS protocol in the upstream node if only one direction fails [e.g., one fiber in a fiber pair between adjacent ADMs]). Note, however, that a node will never change from the normal state to the bridged-and-switched state (thus, the looped-back state; see middle of Figure 2.27) based on the receipt of an APS request along the short path. The long path (B-A-H-G-F-E-D-C in Figure 2.26) is the ring segment along which traffic is looped back. The main purpose of APS requests along the long path is to request from the other node adjacent to the failure to bridge and switch the traffic
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 85
2.4 Ring Protection
85
Figure 2.27 States of the ring nodes (MS/SP_A, SP_TT, MS/Sn_A: group of N/2 atomic function for a two-fiber STM-N MS-SP Ring). (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
(i.e., to activate the loop back), to request/ensure that all intermediate nodes go in the passthrough state and to inform all nodes on the ring about the ring status (e.g., which span fails?). A detailed specification of the MS-SP Ring APS protocol is given in [G841], whereas [Sex92] summarizes the main characteristics of the protocol in the form of state diagrams and flowcharts. The detailed specification of the protocol (K1b1-4,14 bridge request type; K1b5-8, destination node ID; K2b1-4, source node ID; K2b5, short/long; K2b6-8, status, inclusive MS-AIS and MS-RDI signals) in [G841] directly affects the ring size. Node IDs are restricted to 4 bits, so an MS-SP Ring can cover up to 16 nodes. As mentioned earlier, changes in the status of a node (see Figure 2.27) are triggered by requests received along the long path; thus, the one-way delay on the long path between both nodes adjacent to a failure is important. Consider, for example, a ring containing 16 nodes interconnected by links of 100 kilometers (km); this one-way delay equals (16 1) * 0:875 ¼ 13:125 ms (remember from Section 2.3.4 that the one-way delay per link of 100 km equals 875 ms). Without going into the details of the APS protocol, it takes between one and two one-way delays on the long path before all nodes on the ring have changed their status. An additional one-way delay along the long path is needed to inform all nodes on 14 Remember that we use the notation XXbY throughout this chapter to indicate bit Y in the overhead byte XX.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
86
CHAPTER 2
page 86
SONET/SDH Networks
the ring about the last change in a node status. This implies in this example that the protection completion time would range between (1 þ 1) * 13:125 ¼ 26:25 ms and (2 þ 1) * 13:125 ¼ 39:375 ms. Note, however, that these values assume an ideal situation in which the time needed to process APS requests and to act accordingly can be neglected. Figure 2.27 also illustrates that the MS-SP Ring protocol can be classified as an MS trail protection technique (the dashed rounded rectangle represents the APS sublayer according to Figure 2.21). The protection adaptation functions (connected to the MS_TT functions) splits up the administrative units (AUs) into a working and a protection group of AUs. These groups are then cross-connected in the protection connection function, according to the state of the node. The protection sublayer trail termination functions connect the AUs to MS/Sn_A functions responsible for the pointer processing. Note that this is also true for the protection AUs connected trough the node in the passthrough state (the STM-N frames do not necessarily have to be aligned with each other). Furthermore, in the normal state extra (or unprotected and not-yet-preempted) traffic can be routed through the spare/protection capacity, but the extra traffic will be preempted in case of a failure. MS-SP Rings support not only extra traffic but also Non-preemptible Unprotected Traffic (NUT) (this feature is not illustrated in the figure). This can be achieved by removing certain AUs and the corresponding spare/protection AUs from the groups to which the MS-SP Ring APS protocol applies. Of course, capacity (AUs or time slots) for supporting NUT has to be allocated on all spans on the ring and cannot be restricted to certain segments on the ring. Section 2.4.4 shows that the support of NUT can be interesting when interconnecting two MS-SP Rings based on the virtual ring interconnection scheme (otherwise, this would result in double protection). The lower part of Figure 2.27 implicitly assumes a two-fiber MS-SP Ring configuration. Figure 2.28 compares the two-fiber configuration with the fourfiber MS-SP Ring configuration. In a two-fiber MS-SP Ring configuration, two fibers interconnect the adjacent nodes in the ring, each carrying in the opposite direction 50% working and 50% spare/protection capacity (i.e., one of the two fibers belong to the clockwise ring, and the other to the counterclockwise ring). In a fourfiber MS-SP Ring configuration, four fibers (or two-fiber pairs instead of one fiber pair) interconnect two adjacent nodes in the ring: One fiber pair is dedicated to the transport of the working capacity, and the other fiber pair is completely dedicated to the spare/protection capacity. In other words, considering an STM-N MS-SP Ring, a two-fiber configuration can at most transport N/2 (e.g., eight in the case of an STM-16 ring) bidirectional protected VC-4s on each span, whereas a four-fiber configuration is able to transport up to N bidirectional protected VC-4s. Figure 2.29 shows that a four-fiber MS-SP Ring not only can accommodate more traffic than a two-fiber MS-SP Ring (because although 50% of the capacity still remains dedicated as protection/backup capacity, there is twice the amount of capacity available in the network) but also can support span (or link) protection. In a two-fiber configuration, each line failure will affect all working and spare/protection capacity in at least one direction; therefore, the traffic will always be looped
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 87
2.4 Ring Protection
Single Fiber Pair For • 50% working cap. (white) • 50% Prot./Backup Cap. (gray)
2-fiber MS-SP Ring
87
4-fiber MS-SP Ring Working Fiber Pair
W(W)
SP_C
Xtr(E)
Wo(E)
Xtr(W)
W(W)
E(W)
Wo(W)
Xtr(E)
Prot./Backup Fiber Pair
Sn_C
Wo(E)
Xtr(W)
Wo(W)
Sn_C
E(W)
SP_C
MS/Sn_A
MS_TT
MS/SP_A
SP_TT
P(E)
W(P)
P(W)
P(E)
E(P)
P(W)
W(P)
E(P)
MS-SP Ring Sub-Layer
Figure 2.28 Two-fiber versus four-fiber multiplex section–shared protection ring. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
2-Fiber MS-SP Ring
4-Fiber MS-SP Ring
Figure 2.29 Span protection in four-fiber multiplex section–shared protection ring.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
88
CHAPTER 2
page 88
SONET/SDH Networks
back in a two-fiber configuration (left side of the figure). In a four-fiber MS-SP Ring configuration (right side of the figure), a failure on one fiber can, for example, affect only a fiber carrying working capacity (in one direction). In this case (see right upper part of the figure), the spare/protection capacity on that span remains unaffected, and thus, the working capacity can be switched over to this spare/ protection capacity on that span. Of course, only a fiber carrying spare/protection capacity can fail (see right middle part of the figure); in this case no APS will be triggered (but failure propagation and notification for the extra traffic or NUT being affected are still required). Finally, a single fiber cut in the two-fiber MS-SP Ring can also correspond to the cut of both the working and the spare/protection fiber (in the same direction) affected by a failure (see right bottom part of the figure); in this case a loop-back operation as in the two-fiber configuration is needed. Note also that span protection has the advantage of offering a similar propagation delay as in the failure-free situation. In Figure 2.26 the physical routing of the traffic in the case of a failure is shown. From a logical point of view, Figure 2.26 shows that the MS-SP Ring APS lays out a bypass for the working traffic through the spare/protection capacity (around the ring) between the two nodes adjacent to the failure. This logical view (considering the same network and failure scenario) is presented in Figure 2.30; a logical bypass is laid out between nodes B and C. Such a logical view might help us to understand more complex failure scenarios. For example, consider the scenario depicted in Figure 2.31, in which two connections between A and D and between H and E are affected by two failures; the failure on the span B-C and G-F affect connections A-D and H-E, respectively. The failure on span B-C triggers the loop-back operation in nodes B and C.
A
H
B
G
C
F
D
E
Figure 2.30 Logical view of the operation of a multiplex section–shared protection ring (same scenario as in Figure 2.26).
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 89
89
2.4 Ring Protection A
H
B
G
C
F
D
E
Figure 2.31 Illustration of the need for squelching mechanisms (logical view).
Similarly, the span failure G-F triggers the loop-back operation in nodes G and F. However, as Figure 2.31 illustrates, both actions will interfere with each other; the loop back in B and G will create a logical bypass between both nodes (physically routed via A and H), whereas the loop back in C and F creates a logical bypass between C and F (physically routed via D and E). Both logical bypasses will result in the misconnection of A with H and D with E, respectively. Therefore, to deal with failures that affect (directly or indirectly) more than one span, the network must have the ability to ‘‘squelch’’ (i.e., to replace by an AIS signal) some connections to avoid misconnection, as illustrated in Figure 2.31. This is also true for node failures; a node failure will affect its west and east span. More precisely, all connections that originate/terminate in a failing node or in a remote isolated segment of the ring need to be squelched. Figure 2.32 considers the same scenario as in Figure 2.31 except that both failures do not occur simultaneously, but one after the other. The figure shows that the spare/protection capacity can be shared between different connections as long as these do not overlap. More precisely, a spare/protection AU (or time slot) allocated along the whole ring corresponds to exactly one working AU or time slot on each span along the ring and thus nonoverlapping higher order paths can be allocated to that particular AU or time slot on each link they pass. Overlapping connections would compete for the same AU or time slot on the same link, which thus prevents them from sharing the same spare/protection AU or time slot. The ability of nonoverlapping connections to reuse the same unit (here, AU or time slot) of working capacity is sometimes called spatial reuse. This spatial reuse feature becomes possible because in an MS-SP Ring, the forward and backward directions of a bidirectional connection are routed along the same side of the ring (note that this is not the case in an MS-DP Ring, as is explained in Section 2.4.2), and thus only occupy capacity on that segment of the ring. Consider, for example, that if a
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
90
CHAPTER 2 A
H A
H
page 90
SONET/SDH Networks B
G B
G
C
F C
F
D
E D
E
Figure 2.32 Spare/protection capacity sharing between nonoverlapping connections. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
two-fiber STM-16 MS-SP Ring containing eight nodes (A, B, C, D, E, F, G, and H), and that if between each pair of neighbors eight VC-4s need to be set up, then all working capacity will be occupied. Now for each pair of nodes that are adjacent to the same neighbor (thus, each VC-4 will be routed over two links), we would need to install or ‘‘stack’’ a second STM-16 ring (because per STM-16, a two-fiber MS-SP Ring can accommodate only eight VC-4s per link); one will accommodate the traffic between A and C, C and E, E and G, and G and A, and the other will accommodate the traffic between B and D, D and F, F and H, and H and B. This is independent from which segment (the short or long one) is chosen to route the connection: Typically, the short segment will be chosen, but in rare situations the long one may be chosen to balance the traffic over the ring (a typical design objective is to minimize the number of higher order paths routed over the
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 91
2.4 Ring Protection
91
highest loaded link in the MS-SP Ring). For example, consider again a two-fiber STM-16 MS-SP Ring containing eight nodes (A, B, C, D, E, F, G, and H) and that the following traffic needs to be set up: six VC-4s between B and C, four VC-4s between F and G, and four VC-4s between A and D. Then how do you route the four VC-4s between A and D: via B-C or via H-G-F-E? Because on the link B-C capacity for only 86 ¼ 2 VC-4s remains free and on the link G-F 8 4 ¼ 4 VC-4s, the best choice (from a capacity point of view) is to route the four VC-4s along A-H-G-F-E-D.
2.4.2
Multiplex Section–Dedicated Protection Ring The operation of multiplex section–dedicated protection rings (MS-DP Rings) is illustrated in Figure 2.33. The main difference between MS-DP Rings and MS-SP
A
B
C
D
Situation without Failure
H
G
F
A
B
C
E
D
Situation in Case of a Link Failure
H
G
ADM
F
Connection
Working Capacity
Connection Looped Back
Protection/Backup Capacity
E
Figure 2.33 Illustration of the operation of a multiplex section–dedicated protection ring.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
92
CHAPTER 2
page 92
SONET/SDH Networks
Rings is that the forward and backward direction of a bidirectional connection is routed along the opposite sides of the ring in an MS-DP Ring. More precisely, one direction is dedicated to carry the working capacity, and the counter-rotating fiber is dedicated to the spare capacity. As in an MS-SP Ring, the nodes adjacent to a failure loop back all traffic on the working fiber around the ring onto the protection fiber. That the forward and backward directions of a bidirectional connection are not routed along the same side of the ring implies that a bidirectional connection will occupy capacity along the whole ring. This prevents the ability for spatial reuse, as in MS-SP Rings. For example, consider again a ring containing eight nodes (A, B, C, D, E, F, G, and H) and that between each pair of neighbors eight VC-4s need to be set up. Then we would need a stack of 8 times (because each VC-4 occupies capacity on all links in the ring) 8 VC-4s ¼ 64 VC-4s or 4 STM-16 MS-DP Rings, whereas a single two-fiber STM-16 MS-SP Ring suffices (despite that it can accommodate only eight VC-4s on each link compared to the 16 VC-4s in an MS-DP Ring). However, an MSSP Ring requires a working and protection/backup time slot in both the clockwise and the counterclockwise direction, whereas only a single time slot in each direction suffices in an MS-DP Ring. Thus, an MS-SP Ring will only outperform an MS-DP Ring (in terms of capacity efficiency) when it reuses/shares the same time slots between on average more than two nonoverlapping connections. As in MS-SP Rings, the loop-back operation in nodes adjacent to a failure may also result in misconnections in MS-DP Rings (Figure 2.34). Note that this figure considers only a single bidirectional connection from node A to node D (whereas Figure 2.31 considers two bidirectional connections: between A and D and between H and E). A double-failure scenario that affects both the forward and the backward direction of a bidirectional connection will result in incorrectly connecting the endpoints with themselves (more precisely, node A gets connected with node A and node D with node D).15 Although such a misconnection involves only a single A
B
C
H
G
F
D
E
Figure 2.34 Misconnections in multiplex section–dedicated protection rings. 15 This assumes that no time-slot interchange (TSI) takes place in the intermediate nodes. This means that a connection gets assigned the same time slot on all links in the ring.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 93
2.4 Ring Protection
93
connection, such operation should be avoided and instead an AIS signal should be raised and propagated to the connection endpoints (squelching).
2.4.3
Subnetwork Connection Protection Ring SNCP and its operational aspects have already been discussed in Section 2.3.4. In SNCP the recovery head end (RHE) bridges (or copies) the signal along two paths and the recovery tail end (RTE) selects one of both received signals. Of course, there is nothing against adopting SNCP in a ring network (Figure 2.35). As explained earlier, currently SNCP works only in a unidirectional 1þ1 mode and thus does not involve any APS signaling; the advantage is that an SNCP Ring is not necessary restricted to 16 nodes, as is the case for MS-SP Rings (or MS-DP Rings) and that the protection switch completion time will depend only on the capabilities of the downstream RTE. In terms of capacity efficiency, an SNCP Ring performs equally as an MS-DP Ring: Each connection is assigned capacity (one time slot in each direction) along the whole ring, which prevents spatial reuse, as in MS-SP Rings. Because the signal is bridged (or copied) on the backup path all the time, it is not available to support extra traffic. The SNCP Ring concept is applicable to both the higher and lower order path layer, whereas MS-SP Rings (and MS-DP Rings) assume higher order path connections (see also Figure 2.27).
2.4.4
Ring Interconnection In Sections 2.4.1 through 2.4.3, we have described different self-healing ring mechanisms. Of course, it is not always desirable to build a network consisting of only
A
Bridge
B
C
Switch/ Selector
Switch/ Selector
F
E
Bridge
D
Figure 2.35 Illustration of the operation of a subnetwork connection protection ring.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
94
CHAPTER 2
page 94
SONET/SDH Networks
one ring. Typical networks consist of multiple rings. Consequently, a connection in such a network might cross multiple rings. For example, let us consider the connection from node A to node F in Figure 2.36. Each ring can easily guarantee survivability in case of failures inside the own ring. However, when the scope of all protection mechanisms is restricted to a single ring, then the connections will go from one ring to the other through a single ‘‘gateway.’’ This means that in Figure 2.36 the nodes C and H and the link between both nodes become a single point of failure. Thus, in addition to outages of the source and sink ADMs of the end-to-end connection, an outage of this interconnection gateway will have a large impact on the overall unavailability of the end-to-end connection (at least, as long as double failures within a single ring can be ignored). Nevertheless, having two or more rings each protecting only part of a connection can improve the overall availability. A ring that covers all the nodes in the network will typically be longer, so the chance that this ring is affected by a double failure is higher, whereas with two or more distinct rings some simultaneous failure can affect and thus be handled by different rings (e.g., a simultaneous failure of link A-B and link J-F). In a nutshell, dividing a network into multiple rings can be beneficial (and availability is only one consideration), but it is crucial to ensure that the interconnection gateways do not affect the overall availability too drastically. Other reasons could be, for example, that a single ring simply cannot accommodate all traffic routed over a network or that the end-to-end propagation delay becomes unacceptable. The rings covering the whole network will typically be laid out so that as much interring traffic (traffic crossing more than one ring) as possible is avoided. In other words, only a small fraction of the overall traffic typically crosses multiple rings and thus needs to be protected against failures of the interconnection gateways. For this purpose, the interring traffic is sent through two gateways instead of one. The following sections present different options to achieve this. The first one is the
Single Point of Failure B
G C
H
A
F D E
I J
Figure 2.36 Intrinsic vulnerability of single-node ring interconnections. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 95
2.4 Ring Protection
95
virtual ring (VR) interconnection scheme (see the section titled Virtual Ring Interconnection). As in regular SNCP, in the VR interconnection scheme the traffic is sent along two paths transiting different gateways. The second option is called drop and continue (D&C): An interring connection is bridged in the gateway node on the source ring and continues along the ring to leave the ring also through another interconnection gateway, whereas a gateway node on the destination ring selects the best of both copies. Thus, the VR interconnection scheme is a global protection technique, whereas the D&C interconnection scheme is a local protection technique. D&C can be adopted to interconnect SNCP Rings (see the section titled Drop and Continue Interconnection of SNCP Rings), to interconnect MS-SP Rings (see the section titled Drop and Continue Interconnection of MS-SP Rings), or to interconnect SNCP Rings with MS-SP Rings (see the section titled Drop and Continue Interconnection of MS-SP and SNCP Rings). D&C is not applicable to MS-DP Rings, but an alternative for the local protection of the interconnection gateways between two MS-DP Rings is presented in the section titled Interconnection of MS-DP Rings. Finally, the section titled Interconnection of Stacked Rings demonstrates that ring interconnection is of particular interest in a stack of rings, whereas the section titled Node Architectures for Gateways between Self-Healing Rings highlights different gateway node architectures.
Virtual Ring Interconnection Figure 2.37 shows the virtual ring interconnection scheme. As in regular SNCP the traffic is sent along two paths transiting different gateways (here, C-H and D-I); both paths form a virtual ring (here, A-B-C-H-G-F-J-I-D-E-A). By properly routing the traffic over the rings (thus, both paths along opposite sides of the ring), any single point of failure (except in the connection endpoints of course) can be avoided. Therefore, unless double-failure scenarios have to be considered, there is no need to invest in additional protection/capacity to protect one or both paths once again inside an individual ring. The double failure shown in Figure 2.37 is an example in which having no protection inside the individual rings (but only on an end-to-end basis) does make a difference, because the two simultaneous failures affect both paths and occur in different rings (whereas a simultaneous failure of link A-B and link A-E instead of link A-B and link J-F would always lead to the connection becoming unavailable, independent of whether the left ring protects both paths). Considering a higher order VC (HOVC) as an interring connection, not protecting both paths inside each ring results in occupying at most one time slot in each direction on all links (see Figure 2.37). Protecting both paths inside the individual rings would result in requiring twice the amount of capacity. Because SNCP Rings and MS-DP Rings do not support spatial reuse, both protected paths would require on all links one time slot in each direction. In an MS-SP Ring both paths can reuse/ share the same (working and protection/backup) time slot because they do not overlap when properly routed (e.g., via A-B-C and via A-E-D in the left ring). When the MS-SP Ring protection is not required, both paths can be transported as
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
96
CHAPTER 2
page 96
SONET/SDH Networks
No Signal to Select
B
No Signal to Select
G C
H
A
F D
I
E
J
Figure 2.37 Virtual ring interconnection. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
non-preemtible and unprotected traffic (NUT), requiring only half of the capacity; the protection/backup time slot along the ring then becomes available for the transport of another interring HOVC protected by means of the virtual ring interconnection scheme.
Drop and Continue Interconnection of SNCP Rings Another option for dual-gateway ring interconnection is called drop and continue (D&C). The D&C technique to interconnect SNCP Rings is illustrated in Figure 2.38. Instead of simply adding/dropping the signal in the gateway ADMs, the signal is also continued to the next gateway ADM. Thus, the signal entering gateway node C via B is continued to gateway node D and the signal entering gateway node D via E is continued to node C. As long as at most a single failure exists in the left ring (link A-B fails in the figure), both gateway nodes C and D are able to select a valid copy of the signal (here, the one transiting node E because of the failure of link A-B) and to hand it over to the other ring. The copy selected in node C reaches the destination F via nodes H and G, whereas the copy selected in node D reaches the destination F via I and J. Finally, the destination F selects one of both signals (here, the one coming from node C because of the failure of link F-J). In summary, the main difference between Figure 2.37 and Figure 2.38 (both assuming the same double-failure scenarios) is that in the virtual ring interconnection no valid signal is sent through the gateway C-H, whereas this is the case in D&C ring interconnection. Of course, a similar reasoning holds in the opposite direction from node F to node A. Note also that the D&C operation introduces some additional capacity to be allocated between both gateways; nevertheless, still only a single time slot on each link is required.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 97
2.4 Ring Protection
Drop & Continue
97
Drop & Continue
B
G C
H
A
F I
D
J
E
Drop & Continue
Drop & Continue
Figure 2.38 Drop and continue method to interconnect two SNCP Rings. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
Figure 2.39 is almost identical to Figure 2.38 except that a slightly different failure scenario is considered; now the gateway link D-I instead of the link F-J fails simultaneously with link A-B. The reasoning for the forward direction from node A to node F still holds: Two valid copies of the signal are sent from the left to right ring (one via gateway C and the other via gateway D), but in this case the copy sent from node D via I and J does also not reach the destination F (this time because of the failure of the gateway link D-I), and thus, F should select the other copy of the signal received via node G. However, in the backward direction from node F to node A the reasoning does not hold anymore. Indeed, gateway nodes H and I can select both signals received from node F and send it further on to the destination node A via C and B and via D and E, respectively. However, both paths are affected (because of the failure of the link A-B and the gateway link D-I, respectively), and thus, the destination node A does not receive a valid copy of the signal at all.
Drop and Continue Interconnection of MS-SP Rings Figure 2.40 illustrates the D&C interconnection of two MS-SP Rings instead of two SNCP Rings (the working path is assumed to be routed along A-B-C-H-G-F). The gateway node C (the first gateway along the path from node A to F) bridges/copies the signal onto two different paths toward the corresponding gateway node on the other ring. The add/drop signal is sent from node C directly to node H while the continue signal is routed through the second gateway D-I. Node H then selects the best copy out of both received signals (here, the signal it directly receives from node C, because of the failure of the gateway link D-I) before sending the signal
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
98
CHAPTER 2
page 98
SONET/SDH Networks
Drop & Continue
No Signal to Select
Drop & Continue
B
Select Signal from G
G C
H
A
F I
D E
J
Drop & Continue
Drop & Continue
Figure 2.39 Drop and continue method, but considering a slightly different failure scenario. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
further to node G and F. A similar description holds for the opposite direction from node F to A. Despite that the right MS-SP Ring protects the continue signal against the failure of the link H-I (Figure 2.41), it does not protect the continue signal against the link failure D-I, so H should still select the signal it directly receives from node C. In the figure, the left MS-SP Ring also simply protects the working signal against the failure of link A-B. Nevertheless, although Figure 2.41 considers that a third link fails (link H-I), in addition to the two simultaneous failures of Figure 2.39, the connection survives the triple failure, whereas in Figure 2.39 the double failure is already enough to interrupt at least the backward direction (from node F to node A). Figure 2.42 illustrates that the D&C and selection operation should not always be performed in the ADMs of the same gateway (see left: same-side routing), but that on both rings this can be performed in the other gateway (see right: oppositeside routing). Consider, for example, that in Figure 2.41 two interring connections have to be routed—that is, between nodes B and G and between nodes E and J. Then the left side of Figure 2.43 represents the most obvious situation between the two gateways C-H and D-I. The left part of Figure 2.43 shows that on the links on the ring between the gateway nodes (thus, links C-D and H-I in Figure 2.41), twice the amount of capacity is required compared to the other links in the same ring. More generally, D&C on MS-SP Rings may result in the links between the gateway nodes becoming a bottleneck. In other words, this may require upgrading the whole ring to a higher capacity (e.g., from STM-16 to STM-64). The middle part of Figure 2.43
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm
page 99
2.4 Ring Protection
B
99
G
C
H
Drop & Continue
Drop & Continue
F
A
D
I
E
J
Figure 2.40 Drop and continue to interconnect two multiplex section–shared protection rings. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
B
G
C
H Drop & Continue
Drop & Continue A
F
D
E
I
J
Figure 2.41 Drop and continue to interconnect two multiplex section–shared protection rings, but considering a triple failure instead of a single failure. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 100
100
CHAPTER 2
SONET/SDH Networks
Figure 2.42 Same-side versus opposite-side drop and continue routing. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
Figure 2.43 Scenarios for handling the additional capacity needed for the continue signal. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 101
2.4 Ring Protection
101
illustrates how transporting some of the continue signals as extra traffic in the MS-SP Ring protection/backup capacity can help to leverage this bottleneck problem. Note, however, that carrying continue signals as extra traffic in the MS-SP Ring protection/ backup capacity will reduce the set of failure scenarios that can be handled properly. The right part of Figure 2.43 shows an alternative solution: installing a third gateway between both gateways. In this way (because of the spatial reuse capability), the load of the congested link between both gateway nodes can be spread over the corresponding two links between the two outer gateways and the added gateway in the middle.
Drop and Continue Interconnection of MS-SP and SNCP Rings Finally, it is also possible to use D&C to interconnect an MS-SP Ring with an SNCP Ring (Figure 2.44). The figure shows that the routing in the MS-SP Ring part is identical to the left part in Figure 2.41 (D&C and selection operation in node C), whereas the routing in the SNCP Ring part is identical to the right part of Figure 2.38 and Figure 2.39 (in the forward direction from A to F, the two copies of the
B H
G
C Drop & Continue
Drop & Continue F
A
Drop & Continue D
I J
E
Figure 2.44 Drop and continue interconnection of a multiplex section–shared protection ring and an SNCP Ring. (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu. int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 102
102
CHAPTER 2
SONET/SDH Networks
signal are routed via the opposite sides of the ring to the destination node F, where the best copy is selected, whereas in the backward direction from F to A, both gateway nodes H and I select the best signal out of the two copies received from F, before sending them to nodes C and D, respectively, on the left ring). Note that in all scenarios described earlier, the D&C ring interconnection does not involve more than plain SNCP—that is, upstream a permanent 1þ1 bridge (the D&C operation) and downstream a selection between both bridged signals. More details on D&C ring interconnection schemes can be found in [G842] and [ETSI2].
Interconnection of MS-DP Rings Figure 2.45 shows that it is also feasible to interconnect two MS-DP Rings through two gateways. However, this technique is not based on the drop and continue principle but features similar capabilities. The forward direction (from node A to node F) is routed along the path A-E-D-I-J-F, whereas the backward direction (from node F to node A) is routed along the path F-G-H-C-B-A. In contrast to the D&C techniques, only a single copy of the signal is sent through one of the gateways (different gateways for the forward and backward directions). The capacity that is not used on the working fibers between both gateways (thus, on link D-C and link H-I) and the unused capacity on the gateway links (thus, link D-I and link H-C) allows to preestablish a backup loop (D-C-H-I-D) that protects against gateway failures. For example, during a failure of the gateway link D-I, the forward direction of the signal is looped back in node D and sent along nodes C and H to node I, where it is looped back once again to continue on its original route.
G
B
C
H
A
F
D E
I J
Figure 2.45 Dual-gateway interconnection of two multiplex section–dedicated protection rings.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 103
2.4 Ring Protection
B
103
G
C
H
A
F
D E
I J
Figure 2.46 Dual-gateway interconnection of two multiplex section–dedicated protection rings, but considering a triple failure instead of a single failure.
Figure 2.46 also shows that the MS-DP Rings protect this loop against failures of the links on the ring between both gateways (in the figure, link H-I fails, so all traffic on the working fiber is looped back onto the protection/backup fiber from node H, via G, F, J to node I). Finally, the left MS-DP Ring protects the backward direction from node F to node A against the failure of link B-A.
Interconnection of Stacked Rings Figure 2.47 illustrates that a stack of rings (e.g., when the capacity of a single ring is not enough to accommodate all traffic to be routed) is a particular situation in which interconnection of rings is important. The figure shows a stack of three STM-N Rings; each of them physically passing through four nodes but logically being terminated by an ADM in only two nodes. The physical location in the back functions as a hub node, where each ring features an ADM. A connection between two nodes that are distinct from the hub (Figure 2.47) must be routed through the hub to go from one ring to another. Routing all STM-N Rings in the stack along the same physical path requires the least amount of cable to be installed in the ground (e.g., a cable accommodates 200 fibers). Instead of Space Division Multiplexing (SDM), one can also think about solving fiber exhaust problems by multiplexing stacked STM-N Rings onto a single fiber by means of Wavelength Division Multiplexing (WDM), instead of transporting each STM-N on its own fiber pair (or pairs). In the literature many articles are available on techniques to minimize the number of required ADMs to be installed in a stack of rings [Ari00], [Col00] and
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 104
104
CHAPTER 2
SONET/SDH Networks
STM-N RIng
STM-N RIng STM-N RIng
Cable
Figure 2.47 An illustration of a stack of rings.
[Mod01]; this can be achieved by properly grooming connections into the appropriate rings. Unfortunately, most articles have until now been ignoring any potential need for dual-gateway interconnections in such stacked ring network designs.
Node Architectures for Gateways between Self-Healing Rings All the examples described thus far implicitly assume that the add/drop ports of an ADM on one ring are hard-wired to the add/drop ports of an ADM on another ring. The ADMs are said to be interconnected back to back (see also top part of Figure 2.48). However, Figure 2.47 illustrates that often more than two rings will be interconnected with each other in one location. Therefore, one might consider increasing the flexibility of the ring interconnection by installing a central digital cross-connect (DXC) in that location, as shown in the middle part of Figure 2.48. Note that the DXC simply cross-connects the signals from one ring to the other and is not involved in any D&C or other technique to protect the gateways between the rings (this is still the responsibility of the ADMs on the rings). The bottom part of Figure 2.48 shows that such a DXC can also directly terminate the STM-N rings
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 105
2.4 Ring Protection
ADM
105
ADM
Ring 1
Ring 2
DXC ADM
ADM
Ring 1
Ring 2
DXC
Ring 1
Ring 2
Figure 2.48 Node architectures for ring interconnection.
instead of passing through some intermediate ADMs. Of course, having a DXC that directly terminates the STM-N rings does not prevent the need for D&C; for example, D&C is still needed to survive from a failure scenario as depicted in Figure 2.38, independent of whether ADMs C and H and ADMs D and I are integrated into two DXCs directly terminating both rings.
2.4.5
Summary . Multiplex section–shared protection rings (MS-SP Rings): 50% working and 50% protection/backup capacity in clockwise and counterclockwise directions; two- and four-fiber modes; possibility for spatial reuse (better capacity efficiency than in dedicated protection rings when capacity can be shared among on average more than two nonoverlapping connections); traffic is looped back in nodes adjacent to a failure; squelching needed to avoid misconnections; restricted to at most 16 nodes because of the APS protocol specs. . Multiplex section–dedication protection rings (MS-DP Rings): One working fiber in one direction and one protection/backup fiber in opposite direction; no spatial reuse; traffic is looped back in nodes adjacent to failure; squelching needed to avoid misconnections; restricted to at most 16 nodes because of the APS protocol specs. . Subnetwork connection protection rings (SNCP Rings): Signal permanently bridged in RHE and both signals sent in opposite directions along the ring; RTE selects the best copy out of the two signals received from the RHE; no
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 106
106
CHAPTER 2
. .
.
.
. .
2.4.6
SONET/SDH Networks
spatial reuse; no misconnections; in current SNCP, no APS signaling needed, so no restrictions on the number of ring nodes and shorter protection completion times than in MS-SP Rings or MS-DP Rings. Ring interconnecetion: Dual-gateway interconnection schemes prevent the ring interconnection gateways from becoming single points of failure. Dual-gateway interconnection schemes: virtual ring and drop and continue (D&C) to interconnect any combination of SNCP and/or MS-SP Rings; customized interconnection scheme to interconnect MS-DP Rings. Advantage of D&C compared to virtual ring: Rings can protect independently from each other against single failures (! allows simultaneous single failures in distinct rings). D&C in MS-SP Rings: Risk for overloading the links on the ring between the gateways ! transport fraction of the continue capacity as extra traffic in the MS-SP Ring protection/backup capacity or install additional gateway nodes to allow spreading the load on the links between the gateways. Interconnection of MS-DP Rings and stacked rings: Not only interconnection of physically separated but also interconnection of stacked rings is an issue. Gateway node architectures: Back-to-back interconnection of ADMs; increased flexibility in the form of a central DXC to which all ADMs in a gateway connect; single DXC that directly terminates all interconnected rings.
Differences between SONET and SDH The protection rings described in the previous sections are protection rings for SDH networks. For each of them, a counterpart exists in SONET networks (although conceptually identical, they are not fully interoperable with SDH rings, because of some minor differences in the APS protocol details). The main difference is that a different terminology is adopted in SONET networks. More precisely, the format is xySR. The first character (x) represents whether it concerns a unidirectional (x ¼ U) or a bidirectional (x ¼ B) ring (thus, whether all working capacity flows in one direction and all protection/backup capacity in the opposite direction). The second character (y) refers to the recovery extent; it indicates whether it concerns a line (y ¼ L) or path (y ¼ P) switched ring (respectively, called local and global recovery in Chapter 1).
. An MS-SP Ring is called a bidirectional line switched ring (BLSR) in SONET networks. To discriminate between two- and four-fiber configurations, the number of fibers is added (respectively, BLSR/2 and BLSR/4 rings). . An MS-DP Ring is called a unidirectional line switched ring (ULSR) in SONET networks. . An SNCP Ring is called a unidirectional or bidirectional path switched ring (UPSR or BPSR) in SONET networks. Note that in a UPSR ring, the RTE will select by default the signal received through one of its ports (e.g.,
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 107
2.5 Linear Protection
107
the ‘‘west’’ port). Only if this signal is affected will the RTE select the signal coming in through the other port (here, the ‘‘east’’ port).
2.5 Linear Protection In Section 2.3.4 APS has been discussed in more detail from an architectural and protocol viewpoint. Linear protection was used there as an example. The goal of this section is to discuss different strategies based on linear protection switching: Sections 2.5.1 and 2.5.2 deal with multiplex section protection (MSP) and path protection, respectively, and Section 2.5.3 summarizes the main conclusions. Note that a similar discussion of the more advanced but widely deployed self-healing ring protocols are the subject of Section 2.4.
2.5.1
Multiplex Section Protection Linear protection is often applied on the multiplex section (MS) level (see Figure 2.22). Figure 2.49 illustrates the most general case: M:N (here 2:3) multiplex section protection. A span between two network elements (e.g., two DXCs) consists of five STM-N signals; two of these protect the span against a single or double failure on the three working signals. A double failure is shown in the figure; that is, the failure of STM-N working channel 1 is circumvented by routing the signal through the protection/backup channel 1, whereas a failure of working channel 3 is circumvented by routing the signal through protection/backup channel 2. Note that in the figure the bidirectional mode has been considered. Thus, although only one direction of the working channels is affected, both the forward and the backward direction of the signal are switched over to the protection/backup channels. As explained in Section 2.3.4 the protection/backup channels can be used for the transport of extra traffic when they are not used for protection purposes. M:N or
2 1 3 2 1 2 1 3 2 1 Working STM-N Signal Protection/Backup STM-N Signal
Figure 2.49 Bidirectional linear M:N multiplex section protection; here, M ¼ 2, N ¼ 3. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 108
108
CHAPTER 2
SONET/SDH Networks
1:N with N larger than 1 assumes that not all channels will fail simultaneously; therefore, linear M:N or 1:N protection will typically not be used to protect against cable cuts (what would typically lead to all channels being affected) but to protect, for example, against line card failures. In Section 2.3.4 it was pointed out that subnetwork connection protection (SNCP) relies on unidirectional linear 1þ1 protection. Figure 2.50 shows that (unidirectional) linear 1þ1 protection can be applied on the multiplex section (MS) level on a span between two network elements; the recovery head end (RHE) bridges/broadcasts the signal onto two distinct STM-N channels and the recovery tail end (RTE) selects the best one. Because multiplex section protection (MSP) is typically implemented as trail protection, there would be nothing against operating linear 1þ1 MSP in the bidirectional mode (but unidirectional operation is considered in the figure to differentiate it from Figure 2.49). As mentioned earlier, linear M:N or 1:N MSP with N larger than 1 is typically applied to protect against equipment failures like line card failures. As shown in Figure 2.51, often the network-wide recovery schemes do not cover the interconnection between the client network equipment and the tributary ports to which this client equipment is connected. Therefore, very often linear 1:N MSP is foreseen to protect against failures of the tributary ports on a network element. For example (Figure 2.51), client equipment can connect to an ADM through one or more STM-1 ports. The ADMs support linear 1:N protection to protect these STM-1 ports, whereas inside the network they support any of the self-healing ring protocols described in Section 2.4. As illustrated in Figure 2.9 in Section 2.2.4, the VC-n TT will be left to the client network equipment, because the client network equipment is connected through SDH STM-1 ports and thus there is no need for the SDH NEs to deal with client-specific signal processing.
2.5.2
Path Protection In Section 2.3.4 it has already been explained that subnetwork connection protection (SNCP) (instead of trail protection) currently relies on unidirectional linear
Bridge
Selection
Bridge
Working STM-N Signal Protection/Backup STM-N Signal
Selection
Figure 2.50 Unidirectional linear 1þ1 multiplex section protection. (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 109
2.5 Linear Protection
Coverage STM-1 Linear 1:4 MSP
Coverage STM-N Ring Protection
109
Coverage STM-1 Linear 1:4 MSP
STM-N Self-Healing Ring
Figure 2.51 STM-1 linear 1:N MSP (here, N ¼ 4) to protect tributary ports to/from client equipment.
1þ1 protection. As Figure 2.52 illustrates, this means that the recovery head end (RHE) bridges/broadcasts the signal onto two distinct paths (here, RHE ¼ node A), whereas the recovery tail end (RTE) selects the best copy it receives via both paths (here, RTE ¼ node D), based on the supervisory processes described in Section 2.2.3 and the failure notification and propagation processes described in Section 2.3 (and when adopting the revertive mode of operation, a WTR is applied to prevent frequent protection switching actions). The advantage of SNCP is that it is typically applied in the path (VC-n signals) instead of section layers. Thus, SNCP can also protect against node failures—for example, against outages of node B or node C in Figure 2.52. This would not be possible when adopting linear MSP: Here, linear MSP would be able to protect only the spans between nodes A and B, between nodes B and C, and between nodes C and D, but any outage of node B or C itself would result in the path from node A to node D becoming unavailable. Remember also from Section 2.3.4 that SNCP
F
E
D
A
B
C
Figure 2.52 End-to-end subnetwork connection protection (only one direction shown). (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 110
110
CHAPTER 2
SONET/SDH Networks
should not be necessary to protect a connection completely from ingress to egress. For instance, the nodes A and D in Figure 2.52 do not necessarily terminate the protected connection. One drawback of applying SNCP on an end-to-end basis in large-scale networks is that both paths on which the signal is bridged/broadcasted in the RHE may fail simultaneously, even when they are routed completely physically disjoint from each other. For example, the failure scenario considered in Figure 2.53 would result in the connection protected, as illustrated in Figure 2.52, becoming unavailable. Of course, one can think about dividing the network in subnetworks and adopting SNCP protection inside each individual subnetwork. However, in that case the interconnection of the subnetworks may become a concern from an availability point of view. As Figure 2.53 illustrates, drop and continue (D&C) can be adopted to overcome this problem: Note that the routing (only one direction shown) is similar to the one illustrated in Figures 2.38 and 2.39 in Section 2.4.4 (indeed, SNCP Rings simply apply SNCP on a connectionper-connection basis inside a ring network, but there is nothing specific for rings). As Figure 2.53 shows, there is no need to explicitly divide the network in subnetworks; one can apply SNCP enhanced with D&C to build a kind of a ‘‘ladder’’ network on a connection-per-connection basis to increase the end-toend availability. Finally, Figure 2.54 illustrates that linear M:N or 1:N (here, 1:3) protection is also applicable at the path instead of section levels. In the figure, one VC-n path protects against the failure of one out of the three working VC-n paths. The figure also shows that the protection/backup VC-n path can be used for the transport of extra traffic, but when it is needed for the protection that extra traffic will be preempted. The main difference with linear MSP is that it is not applied on a span-per-span basis but on an end-to-end basis. A drawback of the linear 1:N path protection is that the VC-n trail needs to be terminated in the RHE and RTE (thus, ingress and egress, respectively) and thus that these nodes should be capable of processing client-layer information F
E
A
D
B
C
Figure 2.53 Subnetwork connection protection with drop and continue mechanism to increase end-to-end availability (only one direction shown). (ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 111
2.5 Linear Protection
111
signals. Therefore, the ports to which the client network equipment connects should not necessary be SDH compliant, because the SDH edge network elements should process the client-layer signal anyway. Nevertheless, it might be worth considering linear M:N or 1:N path protection, because it also covers failures of intermediate nodes (which is not the case for linear MSP, because intermediate nodes terminate the multiplex sections [MSs]) and it outperforms SNCP in terms of capacity efficiency (the amount of additional protection/backup capacity compared to the working capacity only equals 33% in Figure 2.54 instead of the 100% needed when adopting SNCP). Of course a drawback of adopting 1:N instead of 1þ1 path protection is that the protection switching actions in the RHE and RTE must be coordinated by means of an APS signaling protocol, and thus, the protection switching completion time will be significant longer (this is especially true with geographically large networks because of the long propagation delays). Of course, one should be careful about routing the 1þN paths disjoint from each other with linear 1:N path protection. The (node) connectivity for a particular node pair is defined as the maximum number of (node) disjoint paths that exist between both nodes. To derive the node connectivity, the network topology needs to be translated into a dual-network representation by means of following steps:
. Each node is represented by an ‘‘in’’ and ‘‘out’’ vertex, interconnected by a directed edge from ‘‘in’’ to ‘‘out’’ vertex.
Preemption of Extra Traffic
Preemption of Extra Traffic
1:3 VC-n Trails
Figure 2.54 Linear 1:N path protection (here, N ¼ 3) with support of extra traffic (only one direction shown). (ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 112
112
CHAPTER 2
SONET/SDH Networks
. Each link is represented by two directed edges, leaving its source in the ‘‘out’’ vertex and entering its destination in the ‘‘in’’ vertex. . Each edge is assigned a single unit of capacity The maximal flow that can be set up through this dual-network representation from the ‘‘out’’ vertex in the source node to the ‘‘in’’ vertex in the destination node determines the node connectivity of the network between source and destination node. Assuming that the 1þN is less than or equal to the connectivity between source and destination nodes, supplying 1þN capacity units in the ‘‘out’’ vertex in the source node and demanding 1þN capacity units in the ‘‘in’’ vertex in the destination node and running a standard minimum cost flow problem will determine the cheapest16 and mutually routing of the 1þN connections needed for the linear 1:N path protection. Note also that supplying/demanding two capacity units will also determine the cheapest routing of both paths in SNCP protection; this is illustrated in Figure 2.55.
Supply = 2 Units [2,1] [0,1] 2
[0,1] 1
[1,1]
[0,1]
[1,1]
[1,1]
[1,1]
[1,1]
1
2
[1,1]
[1,1]
1
[1,1]
[1,1]
1
[1,1]
1
[0,1] [2,1]
[2,1] [0,1]
[0,1] [2,1] Demand = 2 units
[X: Cost, y: Cap]
X: Cost/Length
Edge over which a Capacity unit of the flow is routed
Figure 2.55 Illustration of the dual-network representation for calculating disjoint paths. 16
Of course, cheapest means in terms of a cost assigned to each directed edge to carry one unit of capacity (or to route one capacity unit over a network link or through a node). This cost, thus, makes abstraction of the granularity of the network equipment.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 113
2.6 Restoration
2.5.3
113
Summary . Linear multiplex section protection (MSP): Span protection (thus, excludes node failures); protection against line card failures; often used for protecting tributary ports (to client network equipment). . Linear path protection: Often as subnetwork connection protection (SNCP), thus, unidirectional 1þ1 protection; D&C increases end-to-end availability of SNCP; linear M:N or 1:N path protection can improve the capacity efficiency significantly but requires that RHE and RTE being capable of processing client layer signals.
2.6 Restoration The goal of this section is to investigate more flexible recovery mechanisms. Section 2.6.1 compares protection versus restoration techniques, and Section 2.6.2 summarizes the main conclusions.
2.6.1
Protection versus Restoration In the previous sections, we have discussed different protection strategies based on the Automatic Protection Switching (APS) protocol. All these strategies rely on preestablished protection/backup resources for specific working resources. By having the configuration management setting up all these working and protection/ backup resources in advance, it becomes possible for a light distributed APS protocol to switch over in the case of a failure from the working to the protection/backup resources autonomously (without any direct involvement of the network management system [NMS]) within a very short time frame. More precisely, all APS-based protection techniques aim to achieve protection switching times on the order of 50 or 60 ms. Note that even with a light APS protocol, this objective may be difficult to achieve in (geographically) very large networks, simply because of the significant propagation delays. For example, in Section 2.4.1, we mentioned that the protection switch completion time in an MS-SP Ring containing 16 nodes interconnected by links of 100 km can take up to (2þ1) * [(161) * (0:5þ3 * 0:125)] ¼ 3 * (15 * 0:875) ¼ 3 * 13:125 ¼ 39:375ms. Links of 200 km instead of 100 km (or, thus, a ring of 3200 km instead of 1600 km) would result in a value of (2þ1) * [(161) * (1þ3 * 0:125)] ¼ 3 * (15 * 1:375) ¼ 3 * 20:625 ¼ 61:875ms. However, remember also that these values exclude the time needed for the failure to be detected (and propagated) and assume an ideal case in which the time needed to process the APS requests and to act accordingly can be neglected. Of course, having a preestablished standby protection/backup resource for each resource or a few working resources is not optimal from a capacity (and thus cost) efficiency perspective. By setting up connections at the time of a failure along the alternative path, the spare capacity can be used more efficiently. The process of
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 114
114
CHAPTER 2
SONET/SDH Networks
setting up the connections along the alternative path at the time of the failure will typically significantly slow down the recovery cycle. One can distinguish mainly between two approaches for setting up these connections along the alternative route at the time of a failure. In the first approach the whole recovery process is typically centralized in the central network management system (NMS). The fault management process, as described in Figure 2.11 of Section 2.3.1, will notify the NMS. It is obvious that quite some time may be lost in the management reporting functions (e.g., because of the fault correlation filter f3). Once the NMS is aware of the fault, the configuration management process can start setting up the connections along an alternative route. Configuration management is typically designed to be robust instead of extremely fast. The fact that many connections may need to be rerouted at the time of a failure will typically imply an extra burden on the configuration management process. Computing the (alternative) routes along which the connections need to be set up at the time of a failure can be done either in advance (to speed the recovery process) or at the time of a failure. In the second approach, a distributed protocol suite (e.g., including a distributed routing protocol) may be responsible to signal the set up of connections along the alternative paths if a failure occurs. In addition, in such a distributed approach, paths can be precomputed or computed in real time at the time of the failure. Note also that such a distributed approach does not necessarily imply that the alternative routes are computed in a distributed fashion, but that it is also possible to keep the path computation process centralized. Until now, distributed recovery strategies have never been standardized for SDH networks. However, this may change with the introduction of a standardized distributed control plane in Automatically Switched Transport Networks (ASTNs), most likely based on Generalized Multi-Protocol Label Switching (G-MPLS) protocol stacks (as described in Chapter 6). The most flexible recovery strategy is the one that computes the alternative routes and possibly also some working routes at the time of a failure based on the actual status of the network. Because nothing is precomputed, this will be one of the slowest recovery strategies. In accordance with Chapter 1 this strategy is classified as restoration. Similarly as in protection strategies, one can distinguish between global and local recovery strategies. In path restoration, alternative paths are computed on an end-to-end basis, so each affected connection can (but should not necessarily) be assigned a route that is completely distinct from the working route, whereas in link or node restoration all connections transiting the nodes adjacent to a failing link or node are rerouted only between these nodes. Typically path restoration is more capacity (and thus cost) efficient than link restoration because its global nature allows spreading the alternative routes over the entire network and allows finding more optimal routes (a shortest path plus a local detour can become longer than a second shortest path between the endpoints). For example (see the case study in Section 3.6.4 in Chapter 3), in some networks choosing path restoration instead of link or node restoration can at least save up to 10% of the amount of capacity needed in the network, whereas choosing path restoration instead of dedicated path protection (like SNCP) can result in a capacity savings of up to 30%.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 115
2.7 Case Study
115
The distinction between protection and restoration is not very well defined. For example, a strategy that precomputes a backup route disjoint from the working route for each connection and in which affected connections still need to be set up along the backup path at the time of a failure is sometimes called path restoration or shared path protection. The drawback of recovery strategies based on precomputed alternative paths is that such paths can be precomputed only for a limited number of expected failure scenarios, whereas strategies that can compute the (alternative) routes at the time of a failure are much more flexible in the sense that they can take into account the actual status of the network, even in the case of unexpected failure scenarios. Thus, for example, in shared path protection, a double failure affecting both working and backup routes will still result in the connections remaining unavailable. In summary, setting up connections at the time of a failure typically implies better capacity efficiency, and route computation at the time of a failure implies a better compromise between capacity efficiency and failure coverage.
2.6.2
Summary . APS based protection techniques: Preestablished backup resources ! very fast protection switch completion times on the order of 50 or 60 ms. . Set up connections along the (alternative) route at the time of a failure: Slower, but better capacity efficiency; distributed or centralized. . Route computation at the time of a failure: Implies slowest recovery; better compromise between capacity efficiency and failure coverage; distributed or centralized. . Link versus path restoration: Path restoration typically more capacity efficient.
2.7 Case Study In this section an extensive practical case study is presented for a realistic network scenario. The goal is to compare three protection strategies from a cost and capacity perspective: pure end-to-end SNCP protection, pure MS-SP Ring–based protection, and a mix of end-to-end SNCP and MS-SP Ring protection. The case study presented in this section is part of a larger study [Col02], [Ari01], [Ari98], [Str00]. First, we list the assumptions of the case study: the network scenario, the node architectures, and the different protection strategies. Next, the objectives of the case study are highlighted, and the proposed design and evaluation methodologies are described. Finally, we present the results of this case study, before the major conclusions are recapitulated. Assumptions: Network Scenario, Node Configurations, and Protection Strategies In this case study a pan-European SDH-based carriers’ carrier network (a network providing transport services to other carrier networks like PSTN/ISDN and ISP
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 116
116
CHAPTER 2
SONET/SDH Networks
networks) is considered. This network interconnects 15 European cities by means of 19 intercity links. The sum of the length of all these links equals 4954 km. Major cities host two points-of-presence (PoPs) over which the client traffic is distributed evenly to increase the network reliability. The entire network contains 25 nodes, of which 5 are not used as PoP but only as flexibility points (thus, nodes in which traffic only transits the node but is not added/dropped to client network equipment). Eight nodes have a node degree of 3 (the number of links incident to a node), whereas all remaining 17 nodes have a node degree of 2, resulting in an average node degree of 2.32. The traffic forecast considered in this case study is specified in terms of a number of E1s17 (¼ 2 Mbps) or VC-12s, E3s (¼ 34 Mbps) or VC-3s, and E4s (¼ 140 Mbps) or VC-4s. Very roughly speaking, one can say that the total traffic volume (in Mbps) is distributed evenly over the three traffic components in the forecasts. The total traffic volume is equivalent to an average of 30 Mbps per node pair. Based on the results of the broader study [Col02], [Ari01], we have decided to present only two possible node configuration architectures. Figures 2.56 and 2.57 illustrate the node architecture without DXC for the SNCP and MS-SP Ring protection strategy, respectively. In these figures the node is incident to two network
E1 --> VC-12 E3 --> VC-3 LO MUX E4 --> VC-4
WDM Mux
WDM Mux Access Part
Transit VC-4 carrying E4 or E3&E1 traffic at the end
Flexibility Part
Figure 2.56 Node architecture without digital cross-connect for the subnetwork connection protection strategy. 17 Ex signals are PDH signals. For backward compatibility (remember from Section 2.2 that PDH is the predecessor of SDH), the SDH C-n containers are designed so that their capacity matches that of the PDH Ex signals.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 117
2.7 Case Study
117
E1 --> VC-12 E3 --> VC-3 LO MUX
WDM Mux
WDM Mux
E4 --> VC-4 Transit VC-4 carrying E4 or E3&E1 traffic at the end
Figure 2.57 Node architecture without digital cross-connect for the multiplex section–shared protection ring protection strategy.
links. By having wavelength division multiplexing (WDM) only a single fiber pair (one fiber in each direction) is needed per network link. Each wavelength channel carries an STM-16 signal. In addition to the WDM both multiplexers and demultiplexers, such an optical link also features booster, inline, and preamplifiers and transponders having a cost of 20%, 25%, 70%, and 15%, respectively, of the cost of a WDM (de)multiplexer. A more detailed description of optical networking equipment is given in Chapter 3. As shown in the Figures 2.56 and 2.57, each STM-16 wavelength channel enters the node through one fiber pair and passes through one or more ADMs before leaving the node through the other fiber pair; this way, the individual VC-4s in the STM-16 signal can be accessed (added and/or dropped) in a relatively cost-efficient way. Note that each ADM has 16 STM-1 tributary ports. Both figures also show that LO MUXs (de)multiplex the low order traffic (VC12/E1s and VC-3/E3s) into VC-4s at the edge of the network. These LO MUXs are directly connected to the STM-1 ADM tributary ports. Because this multiplexation and demultiplexation takes place only at the edge of the network, such VC-4s carrying lower order traffic is not terminated in intermediate nodes, the lower order traffic should be groomed into the VC-4s based on a per-destination basis (thus, the capacity will not always be used completely). For example, let us consider a PoP with a degree of two, and that to each of the other 19 PoPs, an SNCPprotected VC-12/E1 needs to be routed. Then 19 VC-4s would leave the PoP in each direction (because the VC-4s are not terminated in intermediate nodes), or thus, in each direction two instead of one STM-16 wavelength channels would be needed. However, when considering the other node architecture (based on a central DXC-4/
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 118
118
CHAPTER 2
SONET/SDH Networks
3/1), VC-4s carrying LO traffic are terminated in intermediate nodes, so the LO traffic can be regroomed in the intermediate nodes. This implies in this example that a single VC-4 (being able to accommodate up to 63 VC-12s) in each direction is enough to transport the 19 VC-12s and that only a fraction (one sixteenth) of one wavelength channel will be used in each direction. An important difference between Figure 2.56 and Figure 2.57 is that in the SNCP protection strategy, some ADMs are dedicated for access (ADMs through which the local customer traffic enters and exits the network) and some are dedicated to provide the necessary flexibility in the node to allow a transit VC-4 entering the node on one wavelength and leaving the node on another wavelength, although the same ADMs can serve both functions in the MS-SP Ring protection strategy. As Figure 2.56 illustrates for the SNCP protection strategy, despite that both working and backup connections may enter and leave the node on different wavelength channels, only a single ADM in the node can be responsible for the SNCP operation (the bridge/ select). In the case of one ADM would function as access and flexibility ADM, then it should be possible to have the customer traffic entering the network through one of its STM-1 tributary ports and a copy of the traffic being sent through another of its STM-1 tributary interfaces to another wavelength channel. Because ADMs typically do not support such capability, it is necessary to have a dedicated access ADM performing the SNCP operation and to have another ADM for flexibility purposes on the same wavelength. Finally, note that at most one access ADM per wavelength channel is needed (all 16 tributary signals can be bridged in/selected from the west and east directions), whereas up to two flexibility ADMs may be needed per wavelength channel (in the worst case, all VC-4s from both the west and the east side need to be added from/dropped to another wavelength channel). Figure 2.57 shows the node architecture without DXC for the MS-SP Ring protection strategy. As mentioned earlier, no separation of access and flexibility ADMs is needed. The reason is that the traffic is not duplicated in the MS-SP Ring protection strategy,18 so there is also no risk that both duplicates enter and leave the node on distinct wavelength channels/rings. Even when supporting D&C ring interconnection, this risk does not exist. For local customer traffic, there is no need for routing the traffic from one ring to another, because this access traffic can be directly added on/dropped from the right ring/wavelength channel. Note that because there is no need to separate access and flexibility functions into distinct ADMs and the MS-SP Ring protocol operates in the two-fiber19 mode, a single ADM per wavelength is needed (at most 50% of the 16 VC-4s ¼ 8 VC-4s can be dropped from/added in both the west and the east direction). In Figures 2.56 and 2.57, all transit VC-4s (independent of the whether they carry E4 or E3 and/or E1 traffic) entering and leaving the node on different 18
In each STM-16 MS-SP Ring/wavelength channel, 50% of the capacity is dedicated as protection/ backup capacity. Because this protection/backup capacity is intrinsically available on the MS-SP Rings, there is no need to route a duplicate of the VC-4 path through the network. 19 Because of the wavelength division multiplexing, two-wavelength mode (one wavelength in the clockwise and one in the counterclockwise direction) would be more appropriate as terminology here.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 119
2.7 Case Study
119
wavelength channels are routed directly from one (flexibility) ADM to another one on the other wavelength. In other words, the STM-1 tributary ports are interconnected hard-wired back-to-back (thus, by means of a fixed fiber interconnection). Because lower order traffic tends to be more variable (requiring many times maintenance people going onsite to modify the back-to-back interconnections) and there is a discrepancy between the average traffic volume per node pair and the capacity of a VC-4, the option to invest in a DXC-4/3/1 to cross-connect the lower order traffic was investigated. Note that such DXCs are pretty expensive; it is assumed that a DXC with 56, 112, or 224 STM-1 ports costs 941%, 1647%, or 2824%, respectively, whereas a single LO MUX costs only 29% of the cost of an ADM. Nevertheless, it was judged that the rare modifications needed for the E4 traffic do not motivate the investment in DXC equipment to get rid of the manual interventions. This leads to the node architecture with DXC as presented in Figures 2.58 and 2.59. Nothing changes for the VC-4s carrying E4 traffic; access is done directly through the (dedicated access) ADMs, whereas the back-to-back interconnection of the ADM STM-1 tributary ports remains hard-wired. For the lower order traffic, a central DXC-4/3/1 is installed. This DXC connects to multiple ADMs on different wavelength channels through STM-1 interfaces. Customers of lower order traffic access the network directly through this DXC. To improve the capacity efficiency as
DXC-4/3/1
E1 --> VC-12 E3 --> VC-3
E4 --> VC-4
Transit VC-3 or VC-12
VC-4/STM-1 Ports
WDM Mux
WDM Mux Access Part
Transit VC-4 carrying E4 traffic at the end
Flexibility Part
Figure 2.58 Node architecture with digital cross-connect for the subnetwork connection protection strategy.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 120
120
CHAPTER 2
SONET/SDH Networks
DXC-4/3/1
E1 --> VC-12 E3 --> VC-3
E4 --> VC-4
Transit VC-3 or VC-12
WDM Mux
WDM Mux
VC-4/STM-1 Ports Transit VC-4 carrying E4 traffic at the end
Figure 2.59 Node architecture with digital cross-connect for the multiplex section–shared protection ring strategy.
much as possible, the DXC terminates all VC-4s entering the node and carrying LO traffic. As Figure 2.58 illustrates, the SNCP operation to protect the E4 traffic is still the responsibility of the dedicated access ADMs. However, the figure also shows that the DXC-4/3/1 becomes responsible for the SNCP protection of the lower order traffic. Note that in this case the SNCP protection is done at the lower order path level (thus, VC-12 or VC-3 level), whereas in the other node architecture, the SNCP protection is done at the VC-4 level. Figure 2.59 illustrates the node architecture with DXC for the MS-SP Ring protection strategy. Once again, access of E4 traffic is done directly through the ADMs, whereas the access of E1 and E3 traffic passes through the DXC-4/3/1. It is important to mention that each ADM and each DXC-4/3/1 becomes a single point of failure for the LO traffic in the MS-SP Ring protection strategy; more precisely, the MS-SP Ring will recover the lower order traffic only when a link between two network nodes breaks. This is the result of the assumption that all VC-4s carrying lower order traffic entering the node are terminated by the DXC-4/3/1. In other words, a VC-4 carrying lower order traffic leaves the DXC in one node, enters a wavelength channel through the appropriate ADM, continues on a WDM link to an adjacent node where it leaves the wavelength channel through an ADM, and is then terminated in the DXC in that adjacent node. Remember from Figure 2.27 in Section 2.4.1 that the MS-SP Ring sublayer bridges/switches AU groups in the case of a failure. Thus, although a lower order path may reenter the same STM-16 ring/
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 121
121
2.7 Case Study
wavelength channel, it cannot be recovered if the ADM fails because the corresponding VC-4s are added/dropped in the ADM and terminated in the node, and thus, the MS-SP Ring protection of these VC-4s will fail. Because only the ADMs and not the DXC-4/3/1 participate in the MS-SP Ring protocol, a failure of the DXC will even not trigger the MS-SP Ring protection. For the same reasons, this node architecture does not support D&C interconnection of the STM-16 rings. The general node architecture framework is illustrated in Figure 2.60. Figures 2.56 through 2.59 consider nodes having a node degree of 2. As mentioned earlier and as illustrated in Figure 2.60, nodes can also have a node degree of 3; the node architectures presented earlier for degree-2 nodes are a special case of the general architecture for degree-3 nodes. As explained earlier, the incident fiber pairs are terminated by WDM (de)multiplexers. Each wavelength channel enters the node via one fiber pair and passes through a number of ADMs before leaving the node on another fiber pair; the figure shows that this remains true even in degree-3 nodes (the only exception is that at most one wavelength channel does not continue on another fiber pair; this will be the case when for the end-to-end SNCP protection strategy, the sum of the number of required wavelength channels on the three incident links is uneven). The figure also shows that each wavelength channel is dedicated to the transport of only SNCP or MS-SP Ring–protected traffic, but both wavelength channels can coexist in the
MS-SP Ring
SNCP: Flexibility
WDM Mux
Ing PR -S MS
SN Ac CP: ce ss
Ing
: CP lity SN xibi Fle
PR
WDM Mux
Cross-Connecting: • E4 Traffic: Always Hard-Wired • E3&E1 Traffic: Hard-Wired or DXC
SNCP: Access
Figure 2.60 General node architecture.
-S MS
Local Access Ports: E4, E3 and E1
: CP SN cess Ac
SN Fle CP: xib ility
WDM Mux
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 122
122
CHAPTER 2
SONET/SDH Networks
hybrid SNPC/MS-SP Ring protection strategy. As illustrated earlier, the figure also shows that ADMs have to be dedicated to either access or flexibility purposes in the SNCP protection strategy, whereas this is not true in the MS-SP Ring protection strategy. All the STM-1 tributary ports of the ADMs have to be interconnected with each other and with the local access ports to which local customer equipment is connected. As explained earlier, all these interconnections are hard-wired, except for the lower order traffic (thus, VC-12/E1s and/or VC-3/E3s) when a central DXC4/3/1 that terminates all VC-4s that carry lower order traffic is available. Note that also a single DXC-4/3/1 will be foreseen in the hybrid SNCP/MS-SP Ring protection strategy. Customers of lower order traffic connect directly to the central DXC, or when there is no such DXC, it is groomed into VC-4s by means of LO MUXs (not shown in the figure) that directly connects to a tributary port of an ADM (or access ADM). The customers of higher order traffic always directly connect to a tributary port of an ADM (or access ADM). As mentioned earlier, this case study aims at comparing three protection strategies, as follows: 1. Pure SNCP protection: All traffic is duplicated in the source node and routed along disjoint paths to the destination where the best copy is selected. Because SNCP protection is performed on an end-to-end basis, no rings have to be defined in the network. This implies that there is no constraint requiring that the wavelength channels are organized as rings through the network (on each link the amount of needed wavelength channels can be calculated independently). Although the risk that double failures affecting both the working and backup routes cannot be ignored for some node pairs (because of the long distances), applying drop and continue to improve the availability as illustrated in Figure 2.53 of Section 2.5.2 has not been considered in this case study. 2. Pure MS-SP Ring protection: Traffic is routed over a set of interconnected STM-16 rings that cover all nodes in the network. Wavelength channels must be organized so they form the rings in the network; this implies that not all capacity of each wavelength channels is used completely on each link. More than one STM-16 wavelength channel may need to transit the same network nodes, resulting in a stack of rings, as illustrated in Figure 2.47 in Section 2.4.4. Of course, a fiber pair (or network link) will probably carry more than one stack of rings; the different stacks guarantee geographical coverage. As explained earlier the node architecture with DXC does not allow D&C ring interconnection; thus, lower order traffic cannot recover from ADM or DXC failures, but organizing the network in a set of interconnected rings allows different rings simultaneously recovering from different single link failures. In the node architecture without DXC, D&C and single-gateway ring interconnection options are considered. 3. Hybrid SNCP/MS-SP Ring protection: As explained earlier, it is possible to have part of the wavelength channels dedicated to SNCP-protected traffic
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 123
2.7 Case Study
123
and the other wavelength channels dedicated to MS-SP Ring protection traffic.20 The only common equipment for the SNCP- and MS-SP Ring– protected traffic is the central DXC-4/3/1 in a node to cross-connect the lower order traffic. An important additional assumption is that no traffic is allowed to be protected partly by SNCP and partly by the MS-SP Rings. Without making the routing process much more complicated, it would not be possible to avoid double protection (thus, one or both copies of the SNCP protected signal to be protected again by an MS-SP Ring). Objective of the Case Study In the previous section a node architecture with and another one without a central DXC-4/3/1 have been presented. Also three protection strategies have been presented: pure end-to-end SNCP protection, pure MS-SP Ring protection and a hybrid SNCP/MS-SP Ring protection strategy. The major objective of this case study is to compare the three protection strategies with each other from a cost (and capacity) perspective. This case study also aims at investigating the impact of the adopted node architecture in this comparison. Finally, this case study also aims at studying the impact on the overall network cost of different routing approaches (D&C ring interconnection or not and balanced versus shortest path routing) with pure MS-SP Ring protection. Proposed Network Design and Evaluation Process The network design and evaluation methodology is split into four independent phases: 1. The traffic is routed over the network with respect to the considered protection strategy. 2. The required capacity is calculated. 3. The amount of equipment to transport this capacity is estimated. 4. The total network cost is derived. The remainder of this section focuses mainly on the routing phase, because this is the most relevant phase to understand the case study. As mentioned earlier, the traffic is routed over the network before any capacity is dimensioned in the network. In other words, the routing is not optimized (e.g., to minimize the network capacity) but relies on the link lengths of the given physical topology, so each traffic demand can be routed independently over this topology. With SNCP, a connection is protected by a node-disjoint backup route. Therefore, the shortest cycle containing both endpoints is calculated (based on the methodology described at the end of Section 2.5.2). The shortest path (SP) between 20 As mentioned in Section 2.4.1, some VC-4s on an MS-SP Ring can be supported as non-preemptible unprotected traffic (NUT). Therefore, one could think about having SNCP-protected traffic to be routed through these NUT VC-4s. However, such architecture was assumed to be too complex for being considered in this case study.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 124
124
CHAPTER 2
SONET/SDH Networks
both endpoints along this shortest cycle (SC) is chosen as the working route and the other one as the backup route. Note that looking for the shortest cycle ensures that the sum of the length of the working and backup routes is minimal while guaranteeing the node disjointness status of the working and backup routes. In the MS-SP Ring case, the routing phase is split into two steps. First, the traffic is routed over the given topology along a shortest path. Second, the set of rings carrying the traffic as efficiently as possible is computed. Tabu-search and simulated annealing are applied for this optimization process [Ari96], [Ari97]. An important constraint in this selection process is that the rings have to be interconnected with each other via at least two gateway nodes, to allow D&C for the ring interconnection. In the hybrid SNCP/MS-SP Ring case, the routing phase is split into three steps. The first two steps deal with the MS-SP Ring part. In contrast to the pure MS-SP Ring case, first a set of rings is selected (step one) and then as much traffic as possible is routed over this set of rings (step two). The ring selection is done manually. Note also that the set of rings is not allowed to cover the whole network (because otherwise all traffic would be routed over the rings, and thus no traffic would be left after the first two steps to be routed via SNCP). In the second step, the traffic that can be routed over the rings is routed over the rings. First interring traffic (thus, demands which span at least two rings) is routed over the rings along the shortest path. Afterwards intraring traffic is routed (over the shortest ring going through both endpoints) so that the load on the highest loaded link is minimized, resulting in the load on all links in the ring getting more balanced [Car97], [Cos94], [Kar97]. Per ring, the demands are sorted by descending capacity and routed one after the other over the ring. Finally, all the traffic that cannot be routed over a chain of rings belongs to the SNCP part and is routed over the meshed network, as explained earlier. Once all traffic has been routed over the network, one can derive the amount of capacity needed on the links and in the nodes. Note that the capacity dimensioning relies on the considered node architecture, because with a DXC-4/3/1, all VC-4s carrying lower order traffic and entering a node are terminated by that DXC, whereas no transit VC-4s are terminated in the other case. Note that the capacity for SNCP- and MS-SP Ring–protected traffic is dimensioned independently from each other in the hybrid protection strategy. The capacity dimensioning will identify the number of E1, E3, the number of corresponding VC-4s carrying those E1s and E3s (which depends on the considered node architecture), and the number of E4s routed over a link. The same information is determined for the amount of traffic routed in a node between each pair of incident links. However, here also a distinction needs to be made between transit traffic passing through the node and the amount of access traffic coming from/send to the local customers. Access traffic routed between a pair of incident links means that both duplicates are routed over these links in the case of SNCP-protected traffic or that it is routed over a ring that is routed over these links. Out of this, the number of required wavelength channels on each link or between each pair of incident links in a node can be derived. Note that this will depend on the considered protection strategy: In SNCP-protected
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 125
125
2.7 Case Study
traffic this number can be calculated on a per-link or a per-node basis, whereas in the MS-SP Ring protection strategy the number of stacked rings needs to be determined per topological ring because the ring capacity is determined by the link for which the ring carries the highest amount of traffic. Once the capacity needed in the network has been derived, the amount of required equipment can be estimated. Despite the relatively detailed information coming out of the capacity dimensioning phase, it is impossible to compute the exact required amount of each equipment type. And the methodology to obtain an acceptable estimation is quite complex and is thus beyond the scope of this chapter. A more detailed discussion on the adopted methodology in this equipment dimensioning (and other) phases can be found in [Col02] and [Ari01]. Once the amount of each equipment type is estimated, the corresponding cost can be calculated based on the relative cost figures mentioned in the previous section. Cost Comparison for Different Protection Strategies Figure 2.61 compares the network cost for the pure end-to-end SNCP protection, the pure MS-SP Ring protection, and the hybrid end-to-end SNCP/MS-SP Ring protection strategies for the node architectures with and without a DXC-4/3/1. In a case of node architecture with a central DXC-4/3/1, the SNCP protection strategy is significantly more expensive than the other strategies. As the figure shows, this is mainly a result of the significantly higher node cost in the SNCP protection strategy. This can be understood as follows: In SNCP-protected lower
Comparison of Protection Strategies Total Cost: MS-SP Ring Node Cost: MS-SP Ring
Total Cost: Hybrid SNCP/MS-SP Ring Node Cost: Hybrid SNCP/MS-SP Ring
110%
110%
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
0% NA with DXC-4/3/1
NA without DXC-4/3/1
Figure 2.61 Relative network cost for the different protection strategies.
Node Cost
Total Cost
Total Cost: SNCP Node Cost: SNCP
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 126
126
CHAPTER 2
SONET/SDH Networks
order traffic, the traffic is duplicated and thus transits much more (expensive) DXCs (let’s say twice for simplicity), although this is not the case for MS-SP Ring– protected traffic. An additional consideration is that no drop and continue is allowed in this node architecture with central DXC. Thus, unless only link failures are important, the lower cost for the MS-SP Ring–protected traffic will be paid by the drastically reduced overall availability. In a case of node architecture without central DXC-4/3/1, the situation is reversed: The pure end-to-end SNCP protection strategy slightly outperforms the pure MS-SP Ring protection strategy. The higher cost for the MS-SP Ring protection strategy is due to its higher link cost. Indeed, the number of stacked rings is driven by the link that carries the largest amount of traffic, whereas in the SNCP protection strategy the number of wavelength channels can be calculated on a linkper-link basis. In addition to that, drop and continue ring interconnection has been considered. Thus, for a slightly higher cost, a better availability is achieved in the MS-SP Ring protection strategy. Note that where possible, the ring selection process tried to prevent an overload of the links between the gateway nodes by selecting rings that have three or more nodes in common (see the issue raised in Figure 2.43 in Section 2.4.4). An important conclusion from Figure 2.61 is that independent of whether the pure SNCP protection strategy is cheaper or more expensive than the pure MS-SP Ring protection strategy, it is always possible to outperform both protection strategies by choosing for a hybrid SNCP/MS-SP Ring protection strategy. Although the difference shown is not that large, one may not forget that the ring selection process is done manually in the hybrid protection strategy, whereas this is automated in the pure MS-SP Ring protection strategy. In other words, there is still some room for improvements by automating this ring selection process in the hybrid SNCP/MS-SP Ring protection strategy. Figure 2.61 has considered drop and continue ring interconnection for the node architecture without DXC-4/3/1, although this is prevented when considering the node architecture with a central DXC-4/3/1. Figure 2.62 investigates the pure MSSP Ring protection strategy in combination with the node architecture without a central DXC-4/3/1 the impact of having drop and continue on the overall network cost. The figure confirms that indeed drop and continue results in a significant cost increase (approximately 10%). However, as explained previously, the traffic is first routed along the shortest path and then an appropriate set of rings is selected to accommodate this traffic. The figure shows that when having an additional step to balance the traffic on each individual ring, much higher cost savings can be obtained (while keeping the same overall availability) than in the case of not choosing for drop and continue ring interconnection (and thus affecting the overall availability). Summary . Node architecture: SNCP protection requires dedicated access and flexibility ADMs.
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 127
2.8 Conclusion
127
Relative Network Cost for NA without DXC-4/3/1 in Case of MS-SP Ring 110% Link Cost
100%
Node Cost
90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Shortest Path + D&C
Shortest Path without D&C
Balanced Routing + D&C
Figure 2.62 Impact of the routing in the multiplex section–shared protection ring protection strategy in the absence of DXC-4/3/1s.
. Node architecture with a central DXC-4/3/1 (terminating all VC-4s carrying lower order traffic and entering the node): Does not allow drop and continue ring interconnections and leads to the ADMs and DXCs becoming single points of failure (for lower order traffic). . Protection Strategies: It is possible to combine the advantages of pure SNCP protection and pure MS-SP Ring protection strategies in a hybrid SNCP/MS-SP Ring protection strategy with a better cost performance. . Cost evaluation: Drop and continue ring interconnection has a significant impact on the overall network cost, but balancing the traffic on the rings is at least as important as the drop and continue from a cost perspective.
2.8 Conclusion In Section 2.1 the concept of transmission/transport networks was introduced. Transmission/transport networks are networks that can provide huge amounts of capacity between nodes in client-layer networks in a flexible and cost-efficient way. The modeling/structuring of such networks has been discussed. Transmission/transport networks are built on three types of atomic functions: connection, trail termination, and adaptation functions are responsible for providing the flexibility, the supervisory processes for verifying the integrity of the network connections, and the process to adapt client-layer information so that it can be used in a network layer, respectively. A brief overview of the SDH network technology was given in Section 2.2. SDH networks can be decomposed in four network layers: a regenerator section
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 128
128
CHAPTER 2
SONET/SDH Networks
layer, a multiplex section layer, a higher order and a lower order path layer. Also the different types of network elements (terminal multiplexers, add/drop multipliers, and digital cross-connects) and the frame format used on the interface between these network elements have been discussed. While discussing the SDH frame format, special attention was paid to the part of the overhead that is needed for failure detection. Section 2.3 highlighted the operational aspect of automatic protection (APS) in SDH networks. More precisely, the failure notification and propagation process plus the basics of the APS protocol have been presented. Failure notification and propagation is achieved by inserting in the downstream direction an alarm indication signal (AIS) (an all 1s signal) and upstream a remote defect indication (RDI) signal. With respect to Automatic Protection Switching a distinction has to be made between trail protection and subnetwork connection protection (SNCP). Trail protection is realized by introducing in the trail endpoints sublayer functionality that implements the APS protocol. The APS protocol is needed to coordinate protection switching actions in all the involved network elements. SNCP is able to protect only part of a network connection but suffers from the fact that it has no access to the APS channels carrying the APS protocol messages embedded in the path overhead; therefore, SNCP relies on a permanent bridge in the upstream recovery head end (RHE), whereas the downstream recovery tail end (RTE) selects the best copy it receives (in this way, the APS signaling for coordinating the protection switching actions is avoided). Another important part of this chapter described the various recovery strategies possible in SDH networks. Section 2.4 presented three protection ring types: multiplex section–shared protection rings (MS-SP Rings), multiplex section–dedicated protection rings (MS-DP Rings), and SNCP Rings. MS-SP Rings and MSDP Rings are similar in the sense that they both rely on nodes adjacent to a failure for looping back all traffic along the opposite side of the ring, whereas SNCP Rings rely on protecting paths on an end-to-end basis. Because the forward and backward direction of a bidirectional connection are routed along the same side of an MS-SP Ring, MS-SP Rings can profit from the spatial reuse concept. More precisely, connections that do not overlap can reuse/share the same capacity on different segments of the ring. SNCP and MS-DP Rings are not able to profit from this concept, because each connection occupies capacity on all links in the ring. Finally, to improve the overall availability of interring connections, dual-gateway interconnection schemes have been studied: the virtual ring and the drop and continue interconnection schemes. Strategies based on linear protection switching were presented in Section 2.5: Linear protection switching involves at most two network nodes (in contrast to the ring-based APS protocols). Although M:N linear protection switching is often applied on the multiplex section level (thus, span protection), 1þ1 SNCP is applied at the higher order or lower order path level. Finally, Section 2.6 highlighted opportunities for more flexible network recovery techniques (e.g., restoration instead of protection) than those based on an APS protocol switching over very fast (targeting a protection switch completion time on
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 129
2.9 Recommended Reference Work and Research-Related Topics
129
the order of 50 or 60 ms) from the affected resources to dedicated preestablished protection/backup resources. By signaling the establishment of the spare resources at the time of the failure, the recovery process will slow down but will achieve a better capacity efficiency. Computing alternative paths in real time at the time of the failure instead of in advance allows a better compromise between capacity efficiency and failure coverage. A practical case study was presented in Section 2.7. This case study illustrates the advantages (from a network cost perspective) of having a hybrid SNCP/MS-SP Ring protection strategy instead of a pure end-to-end SNCP or pure MS-SP Ring protection strategy and highlights some issues with respect to providing protection when considering practical node architectures.
2.9 Recommended Reference Work and Research-Related Topics For more than a decade, SDH/SONET has been a mature technology, being specified in many standardization documents. The ITU recommendations used throughout this chapter and other recommendations can perfectly serve as reference material to understand every detail of this chapter. The decomposition of SDH networks in path and section layers and in atomic functions (see Section 2.2.2) is specified in ITU recommendations G.803 [G803] and G.805 [G805]. The frame structure describing the interface between SDH network elements (see Sections 2.2.3 and 2.2.4) is specified in ITU recommendation G.707 [G707]. A more comprehensive overview of these network and frame structures can be found in [Sex92] and [Kar99]. A major part of this chapter is devoted to the detailed discussion in Sections 2.3.1 through 2.3.3 on the interworking of the atomic function for fault detection, notification, and propagation. The characteristics of these atomic functions are crucial in fully understanding this discussion: These characteristics are specified in ITU recommendations G.783 [G783] and G.806 [G806] and the ETSI document EN 300 417-1-1 [ETSI1]. The other major part of this chapter is devoted to the overview of the different SDH network recovery techniques in Sections 2.3.4, 2.4, 2.5, and 2.6. A specification of all these techniques is given in ITU recommendation G.841 [G841]. ITU recommendation G.842 [G842] and ETSI document TS 101 010 [ETSI2] are dedicated to the interconnection of SDH network recovery techniques (in particular protections rings as described in Section 2.4.4). A more comprehensible overview of the SDH network recovery techniques can be found in [Sex92]. Taking into account that SDH/SONET is a mature technology and that it is similar to the optical transport network technologies, related research topics for those optical transport network technologies, as described in Chapter 3, will also apply to the SDH/SONET technology. It is necessary to investigate how the SDH network recovery techniques described in this chapter can be adopted in
Vasseur / Network Recovery Final Proof 8.6.2004 5:19pm page 130
130
CHAPTER 2
SONET/SDH Networks
next-generation SDH networks, featuring virtual concatenation (VC), the link capacity adjustment scheme (LCAS) [G7042], and/or the generic framing procedure (GFP) [G7041], [Her02].
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 131
CHAPTER 3
Optical Networks
Chapter 2 discussed the Synchronous Optical NETwork (SONET)/Synchronous Digital Hierarchy (SDH) layer of the network. The focus of this chapter is on the optical network layer. This layer can in the current backbone network be found underneath the SONET/SDH layer and may in the future even replace (a large part of) the functionality of the SONET/SDH layer. The recovery issues described in Chapter 2 for SONET/SDH are in this chapter extended to the optical layer of the backbone network. First, a general introduction to optical networks is given. The evolution of the optical layer of the backbone network from a pure transmission-based layer with static point-to-point connections to a true managed optical networking layer with reconfigurable switching nodes is discussed. In addition, the equipment that enables the optical networking functionality is shortly described. The current standardization efforts on optical networks are highlighted in Sections 3.1 and 3.2. The last sections of this chapter elaborate on the different recovery mechanisms that can be applied in optical networks. To understand how a failure of the optical network equipment is detected and how the subsequent alarms raised by the various pieces of equipment that detect the fault are correlated and suppressed, the overhead (OH) of the Optical Transport Module (OTM) is discussed in detail in Section 3.3. Section 3.4 introduces the different recovery schemes that can be applied in Optical Transport Networks (OTNs). Recovery schemes in both ring-based networks (Section 3.5) and mesh-based networks (Section 3.6) are discussed and compared with one another (Section 3.7). Section 3.8 discusses network availability, a crucial performance factor of recovery schemes. Because a single fiber can transport 160 or more different wavelengths, each with a capacity of 10 gigabits per second (Gbps) or more, a single cable cut can affect a tremendous amount of traffic. Network availability is thus of utmost importance in the design of modern We are greatly indebted to Sophie De Maesschalck, INTEC, Ghent University, for her exceptional contribution to the writing of Chapter 3.
131
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 132
132
CHAPTER 3
Optical Networks
telecommunication networks. After a theoretical discussion of the calculation of the network availability, some factors that influence the availability are highlighted. In Section 3.9 some recent trends in research on optical network recovery are discussed. Finally, the summary and conclusions are formulated in Section 3.10.
3.1 Evolution of the Optical Network Layer Although the optical layer of the backbone network is relatively new, it has already lived through an impressive evolutionary path. In this first introductory section, the evolutionary path from an optical network with point-to-point static connections to an intelligent optical network with reconfigurable switching nodes is discussed.
3.1.1
Wavelength Division Multiplexing in the Point-to-Point Optical Network Layer As the amount of voice and data traffic kept growing and the Internet, which turned out to generate vast amounts of traffic, came into the picture, more and more traffic needed to be conveyed over the network. Several solutions were available to tackle this increasing traffic demand. One could simply use more fibers, each transporting a single SONET/SDH signal. This is, however, not a very elegant or scalable solution and a very expensive one if there is a shortage of fiber in the existing cable or duct infrastructure. One approach would be to continually increase the bit rate (e.g., increase Time Division Multiplexing [TDM] rates above 40 Gbps), but again this is not a scalable solution. Besides, because of the dispersion effects of fiber optics, including chromatic dispersion (light sent over different wavelengths travels at different speeds) and polarization mode dispersion (PMD) (caused by imperfections in the fiber and also resulting in pulse broadening), increasing the bit rate is quite challenging. Another way to proceed was to introduce Wavelength Division Multiplexing (WDM) into the network. With WDM (Figure 3.1), multiple channels, each with a different (e.g., SONET/SDH) signal, are transmitted at distinct wavelengths over a single optical fiber. The distinct wavelength channels are indicated in Figure 3.1 by the use of color labels. In this way, the capacity of a single fiber is upgraded to a multiple of its original capacity. The capacity of an optical fiber is, thus, no longer limited to the bit rate of a single TDM signal, but by the number of wavelengths supported by the WDM system. The principle of WDM is quite similar to that of Frequency Division Multiplexing (FDM): Several signals are transmitted using different carriers, each occupying a nonoverlapping part of the frequency spectrum. Most WDM systems currently use the frequency region around 1550 nm, because this is one of the frequency regions where the signal attenuation reaches a local minimum. Besides attenuation, chromatic dispersion and PMD are a problem in optical fibers. The signals should be transported using wavelengths that are sufficiently far from each other to avoid interference. In contrast with chromatic dispersion, which can be compensated using, for example, nonzero dispersion shifted fiber or other
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 133
133
3.1 Evolution of the Optical Network Layer
Red
Red
λ1
λ2 λ3
Orange
Orange
λ2 λ3
λM-2
Green
λM-1 λM
Blue
MUX
Yellow ...
Fiber M Wavelength Channels (e.g., M x 10Gbps)
Violet λ Multiplexer
DEMUX
λ1
Yellow ... Green
λM-2
Blue
λM-1
Violet λ Demultiplexer
λM
Figure 3.1 Wavelength division multiplexing.
dispersion compensating techniques, PMD is much harder to overcome. Also, other impairments come into play, for example, nonlinear effects such as four-wave mixing (FWM). With FWM three signals transported at a different wavelength interfere and create a signal on a fourth wavelength, which may already be used to transport a real data signal. With current technology, more than 160 optical wavelength channels can be multiplexed onto a single fiber, and this number is expected to increase even further. Each wavelength can transport a signal with a bit rate of 10 Gbps, and 40 Gbps is just around the corner. WDM systems with channel spacing of 50 GHz have been developed, and we begin to see channel spacing of 25 GHz. When such a large amount of channels can be transported by the WDM system, the term Dense Wavelength Division Multiplexing (DWDM) is used, in contrast to Coarse Wavelength Division Multiplexing (CWDM), which is considered for the metropolitan network and multiplexes a limited number of wavelengths (typically four to eight) onto a single fiber. For CWDM, the frequency range of around 1310 nm, another frequency window with low attenuation, is commonly used. The main advantage of DWDM is its ability to increase many-fold the capacity of the infrastructure in place, without the need for expensive digging works to lay new fibers. Moreover, because the capacity is present in the already installed fibers, new wavelengths can be lit quickly. Besides this, because of the development of Erbium-doped fiber amplifiers (EDFAs), WDM allows sharing the amplifier cost by more traffic because all signals on a single fiber can be amplified using a single EDFA amplifier (Figure 3.2). EDFA amplifiers work in the frequency range of approximately 1530 to 1565 nm (the so-called C-band ). This is one of the reasons the 1550-nm window was chosen for the WDM technique. Also, for the L-band (approximately 1565 to 1625 nm), EDFAs have been developed. Other frequency ranges could be served (e.g., the S-band, 1460 to 1530 nm) by other types of optical amplifiers. The use of WDM and EDFA dramatically decreases the cost of long-haul transmission, because a single optical EDFA amplifier replaces the array of electrical amplifiers that was previously needed.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 134
134
CHAPTER 3
Optical Networks
Tx
Rx
Tx
Rx
Tx
Rx Electrical Regenerator
Transmitter
Receiver
SONET/SDH Network
Tx λ1
Tx λ3 Optical Line Terminal
Mux
Demux
Tx λ2
Rx λ1
EDFA Amplifier
Optical Network
Rx λ2 Rx λ3 Optical Line Terminal
Figure 3.2 A single optical amplifier in the optical network replaces an entire array of electrical regenerators from the SONET/SDH network.
At the beginning, WDM was mainly used to increase the point-to-point transmission capacity of the links in the transport network. Introducing WDM into the network provided high-capacity bit pipes between the client-layer equipment, typically SONET/SDH (see Chapter 2) or Internet Protocol (IP) (see Chapter 4) equipment. Traffic is transported optically between optical line terminals (OLTs) in such a point-to-point optical network. An OLT demultiplexes the wavelengthmultiplexed signal coming from the optical network and adapts the signal coming from the optical network into a signal suited for the client layer. This involves an optoelectronic conversion because the client nodes are SONET/SDH add/drop multiplexers (ADMs) (see Chapter 2, Section 2.2.4), digital cross-connects (DXCs) (see Chapter 2, Section 2.2.4) or IP routers and the traffic has to be processed at this digital level. After the processing (e.g., switching), the signal is converted back to the optical domain at the OLT and multiplexed for further transmission in the optical network. It is obvious that with the maturity of the WDM technology (160 wavelengths of 10 Gbps and up), the electronics in the SONET/SDH DXCs or ADMs or the IP routers will not be able to keep up with the traffic coming from the optical domain. Hence, the electrical nodes in this WDM point-to-point architecture become the new bottleneck.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 135
135
3.1 Evolution of the Optical Network Layer
3.1.2
An Optical Networking Layer with Optical Nodes The introduction of optical nodes into the network, through which individual wavelength channels can pass or can be terminated to the client-layer node, allows true optical networking. These nodes alleviate the need for expensive optoelectronic conversions and electrical processing equipment. It now becomes possible to keep passthrough traffic demand in the optical domain, by establishing lightpaths between the client-layer node equipment. At the network nodes the transit traffic is no longer always converted to the electrical domain, but can stay in the optical domain until it has reached its destination, as illustrated on the right side of Figure 3.3. The optical nodes have a structure similar to the digital nodes introduced in Chapter 2, Section 2.2.4 on SONET/SDH. Optical Add/Drop Multiplexers (OADMs) allow, just as in SONET/SDH, networks to be built in rings, whereas Optical Cross-Connects (OXCs) allow mesh optical networks or interconnected mesh-ring optical networks. The optical infrastructure is, thus, making the transition from a pure transmission layer to a real managed optical networking layer.
3.1.3
An Optical Network Layer Organized in Rings The introduction of OADMs in the optical layer allows the network to be configured in rings, similar to the SONET/SDH ring-based networks discussed in Chapter 2, Section 2.4. An OADM allows the dropping of a signal onto a specific wavelength out of the bundle of WDM-multiplexed signals and to add another signal on this wavelength to the WDM-multiplexed bundle. OADMs can be classified according to the number of wavelength channels that can be dropped and added and according to their flexibility. In a fixed OADM (see Figure 3.4 for an implementation example), a predetermined set of wavelength channels is reserved to add and/or drop data in and out of the WDM signal. The advent of flexible reconfigurable OADMs (see Figure 3.5 for an implementation example) at a low cost in which the dropping and adding of wavelength channels can be controlled was very important. IP Routers Switch Packets
Opto-Electrical Conversion
Point-to-Point Connections between IP Routers
OXCs Switch Lightpaths
Optical Networking with OXCs
Figure 3.3 From static point-to-point connections between IP routers to true optical networking.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 136
CHAPTER 3
Optical Networks
λ1, λ2...,λN
λ1, λ2...,λN
MUX
DEMUX
136
λ1 λ2 λ3
λ1 λ2 λ3
Dropped λ's
Added λ's
Figure 3.4 Implementation example of a fixed optical add/drop multiplexer, using a simple mux/ demux pair. (Adapted from R. Ramaswami, K. Sivarajan, Optical networks: a practical perspective, 2nd ed, Morgan Kaufmann, San Francisco, CA, 2002.)
2x2
...
DEMUX ...
2x2
...
MUX
λ1, λ2...,λN
λ1, λ2...,λN
2x2
λ1
...
λN
Dropped λ's
λ1
...
λN
Added λ's
Figure 3.5 Implementation example of a flexible optical add/drop multiplexer using 2 2 switches to add and/or drop the signals. (Adapted from R. Ramaswami, K. Sivarajan, Optical networks: a practical perspective, 2nd ed, Morgan Kaufmann, San Francisco, CA, 2002.)
In a fully flexible OADM the transit, add, and drop channels can be chosen without any constraint (instead of only channels out of a predetermined set, as with a fixed OADM). Note, however, that in general the number of wavelengths that can be added/dropped will be smaller than the total number of wavelengths in the WDMmultiplexed bundle. In ring-based networks, in which OADMs are used as node elements, typically only a limited part of the traffic that enters a specific OADM is nontransit traffic and has to be dropped from that node. As a consequence, it is economically more advantageous to install OADMs with limited add/drop capabilities, which are cheaper compared to OADMs that allow to add/drop all wavelengths. Conceptually, an optical ring does not differ greatly from a SONET/SDH ring. As in the case of SONET/SDH rings, several ring network configurations exist. The
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 137
3.1 Evolution of the Optical Network Layer
137
various configurations and the differences between optical rings and SONET/SDH rings are highlighted and discussed in Section 3.5. Optical rings can be interconnected using back-to-back installed OADMs or OXCs, the optical equivalent of the DXC discussed in Chapter 2, Section 2.2.4. The introduction of OXCs in the optical network layer, however, allows not only interconnecting rings but also meshed optical networks.
3.1.4
Meshed Optical Networks Other possible network architectures are mesh-based optical networks, or optical networks that interconnect a ring-based part with a mesh-based part. The key component in a meshed network architecture is the optical cross-connect (OXC). The basic functionality of an OXC is to switch a signal from an in-going port to the appropriate outgoing port. Several designs of an OXC exist. A first distinction that can be made is based on whether the switching matrix is electrical or optical. With an electrical switching matrix, an optoelectrical conversion is needed first. The traffic is switched in the electrical domain and then is converted back to the optical domain. The electrooptical and optoelectrical conversions are quite expensive, and this solution is not future proof because the expensive transceivers and the electrical switching core have to be replaced when the data rate increases (e.g., from 10 to 40 Gbps). This type of OXC is called opaque or optical-electrical-optical (OEO) OXC switch. This is the type of OXC that is available on the market at the time of publication. If the traffic is switched in the optical domain, the term transparent or OOO OXC is used. This solution is much more attractive because it avoids optoelectrooptical conversion and because the core switch is independent of the bit rate. Future upgrades that change the TDM rate no longer pose a problem. As the complexity of the system is reduced, the reliability improves. Several optical switching technologies are under investigation. Possible choices are, for example, switching based on micro-electromechanical systems (MEMS), bubble switching, thermooptic switching, holographic switching, switching based on beam steering, liquid crystals, and others [Ram02], [Ben01]. For the moment the 3D variant of MEMS is the most plausible technology to be implemented at large scale, because of its scalability and rather low optical loss, among other things. Another variant is the OEOEO opaque switch, a compromise between the aforementioned OEO and OOO switches. The switching matrix itself switches in the optical domain, but at the ingress and egress of the switch matrix, an OEO conversion takes place. This type of OXC does not have the bandwidth limitation and power consumption of an OXC with electrical switch matrix and allows wavelength conversion (because fully optical wavelength conversion is not currently available), and 3R (reamplification, reshaping, and retiming) signal regeneration (but in that case the bit rate and data format transparency are lost). Figure 3.6 shows a generic representation of an OXC. As explained earlier, the switch matrix may be electrical or optical. The transponders (OEO converters) that are depicted at the ingress and egress of the OXC may be present or not, depending on the exact implementation (e.g., in an all-optical implementation, they will not be present).
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 138
138
CHAPTER 3
Optical Networks
I/O Interface
Switch Matrix (Electrical or Optical) Transponders
Mux/Demux and OA
Figure 3.6 Generic representation of an optical cross-connect. (Adapted from J. Derkacz, et al. ‘‘IP/ OTN Cost Model and Photonic Equipment Cost Forecast- IST LION project,’’ Proc. 4th Workshop on Telecommunications Technoeconomics, Rennes, France, May 2002.)
Another classification of OXCs is based on their ability to convert the wavelength of the signal between incoming and outgoing ports. With a wavelength routing OXC (WR-OXC), an incoming signal on a certain wavelength is switched to the correct outgoing port of the OXC, but the signal remains on the same wavelength. A WR-OXC is not able to perform wavelength conversion. A wavelength translating OXC (WT-OXC), on the other hand, can translate the wavelength of an incoming signal to another wavelength before the signal leaves the correct outgoing port. A WT-OXC is, thus, more flexible than a WR-OXC. Note that as long as wavelength converters cannot be implemented fully optical, transparent all-optical OXCs cannot perform wavelength conversion. The choice of the type of OXC installed in the network also has an impact on the network planning process and thus the resilience planning. In a network with WR-OXCs, the lightpath between source and destination OXC has to be conveyed using the same wavelength channel on all links along the path. This is known as the wavelength continuity constraint and it requires solving the routing and wavelength assignment problem. Networks that deploy such WR-OXCs are often denoted wavelength path (WP) networks. In a network with WT-OXCs, the wavelength continuity constraint is not applicable. Such networks are called virtual wavelength path (VWP) networks. VWP networks increase the routing flexibility and the throughput to some extent, but there is a price to pay: The wavelength converting elements currently are very expensive, and they cannot be achieved all optically (i.e., cannot function completely in the optical domain). Wavelength conversion thus requires the OEO conversion of the signal, making this type of OXC opaque. A lot of work has been dedicated to comparing the pros and cons of both types of OXCs. Under a fixed static traffic load, the difference in performance between optical networks with WT-OXCs and WR-OXCs depends on the network characteristics but is usually very small. A comparison between WP and VWP networks using path
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 139
3.2 The Optical Transport Network
139
protection as a recovery scheme is discussed in Section 3.6.2. Under a dynamic traffic load, however, around 10% more traffic can be sent in a VWP network than in a WP network with the same blocking probability.
3.1.5
Adding Flexibility to the Optical Network Layer Sections 3.1.1 through 3.1.4 described the evolution of the optical network layer from a layer providing high-capacity bit pipes to its client layers to a real managed networking layer. The next step in the evolutionary path is the transition from a static optical networking layer to a flexible and agile optical layer. Such a flexible optical network layer enables the fast and efficient provisioning of new connections across the network, provides the possibility to deploy flexible restoration options, and allows efficient and high-quality network management. Part of the work on this flexible optical network has arisen from the evolution of the IP client layer (more precisely, from the Multi-Protocol Label Switching [MPLS] protocol). More details on this subject are discussed in Chapter 6, Section 6.1.
3.2 The Optical Transport Network The OTN, as defined by the International Telecommunications Union-T (ITU-T), has a structure very similar to that of a SONET/SDH network. The functionality of the OTN follows the generic principles defined in ITU-T Recommendation (ITU-T Rec.) G.805 [G805]. The specific aspects concerning the optical layer of the transport network are described in ITU-T Rec. G.872 [G872] and G.709/Y1331 [G709]. Also within other organizations such as the American National Standards Institute (ANSI) workgroup T1.X1, the Optical Internetworking Forum (OIF), and the Internet Engineering Task Force (IETF) a lot of work has been going on concerning the OTN. The IETF has historically been focussing on the IP client layer, but more recently IP over optical has also been under study. The OIF, on the other hand, focuses mainly on the optical layer of the network with contributions in interoperability and interfaces between vendor domains within optical networks and between the optical network layer and client layers. It has also dedicated quite some effort in making the OTN more flexible, which is further discussed in Chapter 6, Section 6.1.
3.2.1
Architectural Aspects and Structure of the Optical Transport Network According to ITU-T Rec. G.872 [G872], an OTN is characterized by a path layer on top of two section layers, just like the SONET/SDH network structure. These three layers are illustrated in Figure 3.7. The optical transmission section (OTS) layer is the lowest section layer. It provides the functionality for the transmission of optical signals on various types
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 140
140
CHAPTER 3
Optical Networks
DEMUX
Green
Red
Mux
Red Orange
OA
Yellow
Orange Green Yellow
OTS
OTS OMS OCh
Figure 3.7 Optical transmission section (OTS), optical multiplex section (OMS), and optical channel (OCh) layer.
of optical fiber. The layer on top is the optical multiplex section (OMS) layer. It provides the networking functionality for a multiwavelength optical signal. The highest layer is the optical channel (OCh) path layer and provides end-to-end networking functionality for optical channels to allow transparently conveying client signals of varying format between 3R regeneration points in the network. Instead of the OTS and OMS layers, an optical physical section (OPS) may be present (Figure 3.8). The OPS is a network layer that provides functionality for transmission of a single- or multiwavelength optical signal just as the OMS and OTS layers, but without their supervisory information. This is discussed in more detail later in this section. Currently, the only possibility to ensure the management requirements of the optical channel (nonintrusive monitoring21 and management of each optical channel) is by implementing the optical channel by means of a digital framed signal with digital overhead (similar to the SONET/SDH frame22 [G707], discussed in Chapter 2, Section 2.2.3). This results in the introduction of additional digital 21 The types of connection supervision considered by the ITU-T were introduced in Chapter 2, Section 2.3.4. A short reminder:
1. Intrusive monitoring: Test wavelength and fiber performance for continuity, achieved by breaking in the original trail and introducing a test trail that extends the connection for the duration of the test. 2. Inherent monitoring: The client layer (IP, ATM, STM) continuously monitors the state of a given connection by processing the overhead provided by the OCh layer to approximate the operational state of the client connection. Similarly, the OCh layer processes the data received from the OMS layer to approximate the operational state of each OCh channel. The overall status of the connection cannot be achieved with this type of monitoring because not all the necessary information for performance monitoring is contained in the overhead information. 3. Nonintrusive monitoring: The connection monitoring capability is provided by listening to the original data and its associated overhead. The overhead information transported by a connection is also used for fault detection. 22 In fact, the concepts described in ITU-T Rec. G. 709 find their roots in the correspondent standardization of SONET/SDH, ITU-T Rec. G.707. [G.707]
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 141
3.2 The Optical Transport Network
IP
ATM
STM-N
141
Ethernet
Optical Channel Payload Unit (OPUk) Optical Channel Data Unit (ODUk) STM-N
GbE
Optical Channel Transport Unit (OTUk) Optical Channel (OChr)
Optical Physical Section (OPSn)
Optical Channel (OCh) Optical Multiplex Section (OMSn)
Optical Transmission Section (OTSn)
Figure 3.8 Structure of the optical transport network. (Adapted from M. Vissers, Optical Transport Network & Optical Transport Module, ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/tsg15opticaltransport/tsg15opticaltransport/ OTN/g709-intro-v2.ppt. Accessed May 2004, and ITU-T Recommendation G.709/Y.1331, Interfaces for the optical transport network, ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
layer networks: the optical channel payload unit (OPU) layer, the optical channel data unit (ODU) layer, and the optical channel transport unit (OTU) layer (Figure 3.8). The OPU layer adapts the client signal. Multiple types of client traffic formats, including legacy traffic protocols such as SONET/SDH and ATM and newer protocols such as IP and Ethernet, can be transported. The ODU layer provides endto-end path monitoring (PM) and tandem connection23 monitoring (TCM). The OTU layer provides supervision between two 3R regeneration points in the OTN. The complete set of layers depicted in Figure 3.8 (from OPU layer to OPS layer or OTS layer) form the optical transport module (OTM). Figure 3.9 summarizes these different layers of the OTN for a small sample network. 23 A tandem connection (TC) is an arbitrary series of contiguous link connections and/or subnetwork connections. A TC represents that part of a trail that requires monitoring independently from the monitoring of the complete trail. See also Chapter 2, Section 2.3.4.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 142
142
CHAPTER 3
Optical Networks
3R
DXC 1
DXC 2
All-Optical OXC
OXC with 3R OTSn OTSn
OTSn OTSn
OTSn
OMSn OCh, OTUk
OMSn OCh, OTUk
ODUk, OPUk STM-N
Figure 3.9 Layers in the optical transport network.
3.2.2
Structure of the Optical Transport Module
Frame Alignm. OH
14 15 16 17
2 3
Byte
OTUk OH
ODUk OH
OPUk OH
1
7 8
1
Row
Column
4080
3824 3825
Different OTMs, with different functionality, have been defined. We discuss here only the OTM with full functionality (OTM-n.m) in detail. The frame structure of the OTUk within the OTM (Figure 3.10) consists, just as a SONET/SDH frame, of an overhead area for OA&M functions and a payload area for client data. A forward error correction (FEC) data block is added after the payload area to improve the performance because it enables error checking and correction at the receiving end of the signal. The detailed structure of the OTM-n.m is depicted in Figure 3.11. OPUk, ODUk, and OTUk overhead and FEC data are added to the client signal to form the optical channel transport unit (OTUk). Three bit rates are supported: signals with a bit rate of approximately 2.5 Gbps (k ¼ 1), 10 Gbps (k ¼ 2), and 40 Gbps (k ¼ 3). These correspond with the SONET/SDH data rates of an OC-48/STM-16, an OC-192/STM-64, and an OC-768/STM-256, respectively. The overhead of the
OPUk Payload
OTUk FEC
4
Figure 3.10 Optical channel frame structure. (Adapted from ITU-T Recommendation G.709/Y.1331, Interfaces for the optical transport network, ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 143
3.2 The Optical Transport Network
143
Client Data
OPUk OH ODUk OH
OTUk OH
Optical Channel Data Unit (ODUk)
ODUk Payload
OTUk FEC
OTUk Payload
OCCp
OCCp
OCCp
...
Optical Channel Transport Unit (OTUk)
Optical Channel (OCh)
OCh Payload
OCCp
...
OCCo OCCo
OCCo
Non-Associated OH (OCh, OMSn and OTSn OH) OCCo
OCh OH
Optical Channel Payload Unit (OPUk)
OPUk Payload
Optical Carrier Group (OCG-n.m) OCC: Optical Channel Carrier OCCp: OCC Payload OCCo: OCC Overhead
OMSn OH
OMSn Payload
Optical Multiplex Unit (OMU-n.m)
OTSn OH
OTSn Payload
Optical Transport Module (OTM-n.m)
OTM Overhead SIgnal (OOS)
Figure 3.11 Structure of the Optical Transport Module with full functionality OTM-n.m. (ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
OPU, ODU, and OTU layer is called the associated overhead. The specific information contained in these overheads is discussed in Section 3.3. An OTM-n.m supports the transport of several wavelength channels. The maximum number of supported wavelength channels—the order of the OTM—is indicated by the index n. For instance, an OTM-16.m signal can support up to 16 wavelengths. The index m defines the supported bit rate of the multiwavelength signal. It can take the values 1, 2, 3, 12, 23, or 123—meaning that the supported wavelengths can have a bit rate of only 2.5 Gbps, only 10 Gbps, only 40 Gbps, 2.5 and 10 Gbps, 10 and 40 Gbps, or 2.5, 10, and 40 Gbps, respectively. The optical channel payload is carried on a specific wavelength as payload of an optical channel carrier (OCC), its overhead in the corresponding OCC overhead. Finally, the OMS and the OTS are constructed. The OCh, OMS, and OTS overhead, denoted as nonassociated overhead, are transported in an additional optical wavelength channel, the optical supervisory channel (OSC). Besides the OTM-n.m, an OTM-0.m and an OTM-nr.m are defined. The index r in OTM-nr.m stands for reduced functionality. Instead of an OTS and OMS layer, an OPS layer is present. Remember that the OPS is a network layer that provides functionality for transmission of a single wavelength optical signal (OPS0) or a
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 144
144
CHAPTER 3
Optical Networks
Framework (G.871/Y.1301)
Network Architecture (G.872)
Physical Layer Aspects (G.691, G.692, G.694.1, G.694.2, G.664, G.959.1, G.693, Sup. 39)
Structures and Mapping (G.709/Y1331)
Functional Characteristics (G.798, G.806)
Management Aspects (G.874, G.874.1,G.875, G.7710)
Data Communication Network (G.7712/Y1703)
Protection Switching (G.808.1, G.873.1)
Jitter and Wander Performance (G.8251)
Testing (O.173, O.201)
Error Performance (G.8201, M.24otn)
Physical Layer (G.959.1, G.693)
Figure 3.12 Overview of the available and planned ITU-T recommendations on the Optical Transport Network. (Adapted from ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at http://ties.itu.int/ftp/itu-t/com15/ tsg15opticaltransport/tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004.)
multiwavelength optical signal (OPSn) just as the OMS and OTS layers, but without their supervisory information. No overhead of the OCh and OPS layer is supported (the nonassociated overhead of the OTM-n.m is, thus, not present). The OTM-0.m consists of a single optical channel; hence, there is no support for WDM. An OTM-0.1, for instance, transports a single wavelength channel of 2.5 Gbps. The OTM-0.m is also an OTM with reduced functionality.
3.2.3
Overview of the Standardization Work on the Optical Transport Network A complete overview of the standardization activities on OTNs within the ITU-T is given in [OTNTS] [G871]. A summary can be found in Figure 3.12.
3.3 Fault Detection and Propagation In Section 3.2 the Optical Transport Module (OTM) was introduced. We have seen that several types of overhead can be used for nonintrusive monitoring and man-
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 145
3.3 Fault Detection and Propagation
145
agement of the optical signal. The information transported in the overhead will be used for fault detection and propagation, as well as alarm suppression, among other things. In Section 3.3.1, the OTN overhead is discussed, emphasizing that part of the overhead that is useful for the discussion of fault detection and propagation. In Section 3.3.2, we shortly enumerate the different defects that are currently defined in the OTN. Finally, Section 3.3.3 illustrates the discussion on defects, fault detection and propagation, and alarm suppression, and provides some examples. This section (which is about fault detection and propagation) is useful to get better insight in the recovery mechanisms described in Sections 3.4 through 3.6, although these can be fully understood without this information. To fully comprehend this section, we recommend you to first get acquainted with the general transmission network terminology explained in Chapter 2, Section 2.1.
3.3.1
The Optical Network Overhead We start this section by detailing the different types of OTN overhead. As mentioned earlier, a distinction can be made between associated overhead and nonassociated overhead.
Associated Overhead The associated overhead includes the OPU, ODU, and OTU overhead.
Optical Channel Payload Unit Overhead The optical channel payload unit overhead (OPUk OH) enables the support of various kinds of client signals. It includes information to support the client signal adaptation.
Optical Channel Data Unit Overhead The optical channel data unit overhead (ODUk OH) (Figure 3.13) includes information for maintenance and operational functions to support optical channels. It provides path layer connection monitoring functions. Concerning monitoring and resilience, the path monitoring (PM) and tandem connection monitoring i (TCMi) fields are of particular importance. The PM field is dedicated to the end-to-end ODUk path and the TCMi fields allow for six levels of TCM. Both are illustrated in detail in Figure 3.14. The PM field contains a byte to transport the trail trace identifier (TTI), a bit interleaved parity (BIP) byte, a backward defect indication (BDI) field, a backward error indication (BEI) field, and a STAT field. In the TTI byte, a 64-byte long TTI is repeatedly transmitted to verify the connectivity of the OChs through network elements as ODU cross-connects. The BIP-8 is used for performance monitoring. The BDI and BEI are used for performance monitoring. The BDI signal conveys the signal fail status detected in a path termination sink function in the upstream direction. The BEI signal has been defined to convey in the upstream direction the count of interleaved bit blocks that have been detected in error by the corresponding
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 146
146
Optical Networks
CHAPTER 3
Frame Alignment OH
2
RES
TCM ACT
3
TCM3
4
GCC1
TCM: ACT: APS: EXP: FTFL: PCC: GCC: PM: RES:
GCC2
16
15
14
13
12
11
10
9
8
OTUk OH
TCM6 TCM2
Byte
TCM5
TCM4
TCM1
FTFL
PM
APS/PCC
OPUkOH
1
7
6
5
4
3
2
1
Row
Column
EXP
RES
Tandem Connection Monitoring ACTivation/deactivation Control Channel Automatic Protection Switching Coordination Channel EXPerimental Fault Type & Fault Location Reporting Channel Protection Communication Channel Control General Communication Channel Path Monitoring Reserved for Future International Standardization
Figure 3.13 Optical channel data unit overhead (ODUk OH). (ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
DAPI 31 32
Operator Specific 63
Bit
Bit
BEI
BEI/BIAE
TTI: BIP-8: SAPI: DAPI: BEI: BIAE: BDI: STAT:
STAT
8
SAPI 15 16
TCMi
64 Bytes long trail trace identifier, repeatedly transmitted in TTI field
PM
0
3
2 BIP-8
1 2 3 4 BDI 5 6 7 8
TTI
Byte
1 2 3 4 BDI 5 6 7
1
3 Bytes of the PM or TCMi (i=1..6) Fields
STAT
Trail Trace Identifier Bit Interleaved Parity−Level 8 Source Access Point Identifier Destination Access Point Identifier Backward Error Indication Backward Incoming Alignment Error Backward Defect Indication Status
Figure 3.14 The bytes of the path monitoring (PM) or tandem connection monitoring i (TCMi) field of the optical channel data unit overhead. (ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 147
3.3 Fault Detection and Propagation
147
ODUk PM sink using the BIP-8 code. The use of the BDI and BEI is illustrated in Section 3.3.3 with an example. The STAT bits indicate the presence of a maintenance signal, as shown in Table 3.1. These maintenance signals are discussed in Sections 3.3.2 and 3.3.3. The six TCM fields each support monitoring for ODU connections. The tandem connection (TC) overhead is added and terminated at the source and sink of the corresponding TCs, respectively. The TCMi overhead bytes have a structure similar to that of the PM signal of the ODUk overhead. The BEI/BIAE signal is used to convey in the upstream direction the count of interleaved bit blocks that have been detected in error by the corresponding ODUk TCM sink using the BIP-8 code. It is also used to convey in the upstream direction an incoming alignment error (IAE) condition that is detected in the corresponding ODUk TCM sink in the IAE overhead. The STAT bits indicate not only the presence of a maintenance signal but also whether there is an IAE at the source or there is no source active (Table 3.2). Using these multiple instances of the PM functions, the ODU paths can be monitored end-to-end through the public transport network or through the network of a network operator, or it can be used for subnetwork connection monitoring or monitoring on a ring network, among other things. Both nested (Figure 3.15) and overlapping ODU connections can be monitored. This functionality allows for true optical connection monitoring across multiple networks independent from network operator or equipment vendor, that is, across multiple administrative domains. The ODUk overhead also contains a field for automatic protection switching (APS) at the path layer (APS/PCC field). Its functionality is similar to that of the corresponding field in the SONET/SDH overhead and allows the implementation of path protection schemes such as optical channel shared protection rings (OCh-SP Rings, see Section 3.5.2 for details). In the ODUk overhead a byte is also allocated
Table 3.1 ODUk PM Status Interpretation
PM byte 3, bits 678
Status
000 001 010 011 100 101 110 111
Reserved for future international standardization Normal path signal Reserved for future international standardization Reserved for future international standardization Reserved for future international standardization Maintenance signal: ODUk-LCK Maintenance signal: ODUk-OCI Maintenance signal: ODUk-AIS
Source: From ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 148
148
CHAPTER 3
Optical Networks
Table 3.2 ODUk TCMi Status Interpretation
TCMi byte 3, bits 678
Status
000 001 010 011 100 101 110 111
No source tandem connection In use without incoming alignment error (IAE) In use with IAE Reserved for future international standardization Reserved for future international standardization Maintenance signal: ODUk-LCK Maintenance signal: ODUk-OCI Maintenance signal: ODUk-AIS
Source: From ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITUT Standardization Organization, February 2001, and amendment 1, November 2001.
TCM6
TCM6
TCM6
TCM6
TCM6
TCM6
TCM6
TCM5
TCM5
TCM5
TCM5
TCM5
TCM5
TCM5
TCM4
TCM4
TCM4
TCM4
TCM4
TCM4
TCM4
TCM3
TCM3
TCM3
TCM3
TCM3
TCM3
TCM3
TCM2
TCM2
TCM2
TCM2
TCM2
TCM2
TCM2
TCM1
TCM1
TCM1
TCM1
TCM1
TCM1
TCM1
A1
B1
C1
B2
C2
B3
B4
A2
C1 - C2 B1 - B2
B3 - B4
A1 - A2 TCMi
Tandem Connection Monitoring i (TCMi) overhead field not in use
TCMi
Tandem Connection Monitoring i (TCMi) overhead field in use
Figure 3.15 Nested and cascaded monitored ODUk connections. (ITU-T Recommendation G.709/ Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Avialable at: www.itu.int. Accessed May 2004.)
to transport the fault type and fault location (FTFL) message. This byte provides fault status information including information regarding type and location of the fault. This message is related to the TCM span. The TCM activation/deactivation control channel (TCM ACT) field is related to TCM. The general communication channel (GCC) fields are communication channels that can be used to pass information between any two network elements with access to the ODUk frame.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 149
3.3 Fault Detection and Propagation
149
Optical Channel Transport Unit Overhead
1
SM
GCC0
14
13
12
11
10
9
8
The optical channel transport unit overhead (OTUk OH) (Figure 3.16) enables the transport of the digital ODU over an optical channel connection. For fault and performance monitoring the section monitoring (SM) field is particularly important. It has a structure similar to the PM and TCMi fields in the ODUk overhead. The SM field, detailed in Figure 3.17, consists of a TTI subfield, a BIP-8 subfield, a BDI subfield, a BEI subfield, and an IAE subfield. These functions serve the same purpose as the parallel ones in the ODUk overhead, only now at
RES
SM: Section Monitoring RES: Reserved for Future International Standardization GCC: General Communication Channel
Figure 3.16 OTUk overhead. (ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
BIP-8
0
64 Bytes long trail trace identifier, repeatedly transmitted in TTI field
SAPI
15 16 DAPI
31 32 Operator Specific
63
BEI/BIAE
TTI: BIP-8: SAPI: DAPI: BEI: BIAE: BDI: IAE: RES:
8
9
TTI
Byte
1 2 3 4 BDI 5 IAE 6 7
8 1
10
3 Bytes of the SM Field
Bit
RES
Trail Trace Identifier Bit Interleaved Parity−Level 8 Source Access Point Identifier Destination Access Point Identifier Backward Error Indication Backward Incoming Alignment Error Backward Defect Indication Incoming Alignment Indication Reserved for Future International Standardization
Figure 3.17 Section monitoring field of the OTUk overhead. (ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 150
150
CHAPTER 3
Optical Networks
the OTUk section level instead of at the ODUk path level. Also the frame alignment signal (FAS) overhead is part of the OTUk overhead. The OTUk FEC allows for error detection and error correction in the optical links. The BIP-8 fields in the OTUk and the ODUk overhead only allow error monitoring on the OTUk and ODUk payloads, respectively. FEC performs error monitoring on the complete optical channel, including the OTUk OH. FEC enables the detection and correction of bit errors caused by physical impairments in the transmission medium (e.g., linear impairments such as attenuation or dispersion and nonlinear effects such as self-phase modulation or four-wave mixing).
Nonassociated Overhead The nonassociated overhead of the OTM (Figure 3.18) consists of the OTS, the OMS, and the OCh overhead and is transported by means of an optical supervisory channel (OSC).
Optical Channel Overhead The OCh OH includes information for maintenance functions to support fault management. The OTM-n.m OCh OH (for each optical channel carried within an OMS) consists of the following:
. OCh forward defect indication payload (FDI-P), used for OCh trail monitoring and defined to convey in the downstream direction the OCh payload signal status, namely normal or failed.
FDI-O n
TTI
FDI-P
BDI-O
FDI-O
BDI-P
BDI-P
FDI-P
PMI
PMI
OCh
BDI-O
OMSn
OTSn
1
2
3
OCI
General Management Communications
Figure 3.18 OTSn, OMSn, and OCh overhead within the OTM overhead signal (OOS). (ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 151
3.3 Fault Detection and Propagation
151
. OCh forward defect indication overhead (FDI-O), used for OCh trail monitoring and defined to convey in the downstream direction the OCh overhead signal status, namely normal or failed. . OCh open connection indication (OCI), indicates that upstream in a connection function the matrix connection is opened as a result of a management command. The consequential detection of the OCh loss of signal (LOS) condition can now be related to an open matrix. OCH overhead extensions may be expected in the future.
Optical Multiplex Section Overhead The OMS OH includes information for maintenance and operational functions to support OMSs. It consists of the following:
. OMS FDI-P, used for OMS section monitoring and defined to convey in the downstream direction the OMS payload signal status (normal or failed). . OMS FDI-O, used for OMS section monitoring and defined to convey in the downstream direction the OMS overhead signal status (normal or failed). . OMS backward defect indication-payload (BDI-P), used for OMS section monitoring and defined to convey in the upstream direction the OMS payload signal fail status detected in the OMS termination sink function. . OMS backward defect indication overhead (BDI-O), used for OMS section monitoring and defined to convey in the upstream direction the OMS overhead signal fail status detected in the OMS termination sink function. . OMS payload missing indication (PMI), sent downstream as an indication that upstream at the source point of the OMS signal none of the OCCps contain an optical channel signal, to suppress the report of the consequential LOS.
Optical Transmission Section Overhead The OTS OH includes information for maintenance and operational functions to support optical transmission sections. It consists of the following:
. OTS trail trace identifier (TTI), used for OTS section monitoring. . OTS backward defect indication payload (BDI-P), used for OTS section monitoring and defined to convey in the upstream direction the OTS payload signal fail status detected in the OTS termination sink function. . OTS backward defect indication overhead (BDI-O), used for OTS section monitoring and defined to convey in the upstream direction the OTS overhead signal fail status detected in the OTS termination sink function. . OTS payload missing indication (PMI) (sent downstream as an indication that upstream, at the source of the OTS signal, no payload is added. Is defined to suppress the report of the consequential loss of signal condition.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 152
152
CHAPTER 3
Optical Networks
The use of this nonassociated overhead is also clarified with some examples in Section 3.3.3.
3.3.2
Defects in the Optical Transport Network Section 3.3.1 discussed in detail the available overhead for fault detection and propagation. In this section the defects in the OTN are briefly discussed. Table 3.3 enumerates the most important ones [G798], [G806]. Most defects in the OTN are quite similar to those defined in the SONET/SDH layer (see Chapter 2, Section 2.3). After a defect has been declared, a decision has to be made on the consequent action to be taken. An example of such a consequent action is to send out an FDI. Also the fault cause has to be determined. In general, the defect and the consequent action share the same name.
Table 3.3 Defects in the Optical Transport Network
Loss of Signal Payload (LOS-P) Loss of Signal Overhead (LOS-O) Loss of Tandem Connection (LTC) Trace Identifier Mismatch (TIM)
Signal Degrade (SDEG) Payload Mismatch (PLM) Loss of Frame (LOF) Loss of Multiframe (LOM) Forward Defect Indication Payload (FDI-P)
No signal is coming in. The LOS-P defect is monitored at the OTS, OMS, and OCh layers of the OTM-n.m and the OPS and OChr layers of the OTM-nr.m and OTM-0.m. No optical supervisory channel (OSC) containing the nonassociated overhead is coming in. Detects the presence or absence of tandem connection overhead. The LTC is monitored at the tandem connection sublayer of the ODUk layer. Connectivity fault because of improper routing of the connection between trail termination source and sink, or because the connectivity is not maintained while the connection is active. The TIM will be monitored at the OTS, OTUk, and ODUk layers. The signal has degraded (too many error blocks in a monitoring interval). Monitored at the OTUk and ODUk layers. The payload type is not equal to the expected payload type. PLM is monitored at the path layer of the ODUk layer. The correct pattern in the FAS bytes of the OTUk frame is not found for five consecutive frames. The received MFAS does not match with the expected MFAS in five consecutive OTUk frames. The FDI-P signal is monitored at the OMS and OCh layers to suppress downstream alarms at the client layer caused by upstream defects detected by the server layer, which interrupt the client payload signal. The FDI signal is sent downstream as an indication that an upstream defect has been detected.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 153
3.3 Fault Detection and Propagation
153
Table 3.3 (cont’d)
Forward Defect Indication Overhead (FDI-O)
Alarm Indication Signal (AIS)
Backward Defect Indication Payload (BDI-P) Backward Defect Indication Overhead (BDI-O)
Backward Defect Indication (BDI)
Backward Error Indication (BEI)
Open Connection Indication (OCI)
Payload Missing Indication (PMI)
Locked (LCK)
When the signal is in the optical domain, the term FDI is used. In the electrical domain, the term AIS is used. The FDI-O signal is monitored at the OMS and OCh layers to suppress downstream alarms at the client layer caused by upstream defects detected by the server layer, which interrupt the OTM overhead signal. The AIS signal is monitored at the OTUk and ODUk layers to suppress alarms at the client layer caused by upstream defects detected by the server layer and/or the client’s tandem connection sublayer, which interrupt the client payload signal. It is the electrical equivalent of the FDI-P signal in the optical domain. The BDI-P defect signal is monitored at the OMS and OCh layers. The BDI-P is sent upstream as an indication that a defect has been detected that interrupts the client payload signal. The BDI-O defect signal is monitored at the OMS and OCh layers. The BDI-O is sent upstream as an indication that a defect has been detected that interrupts the OTM overhead signal. The BDI is declared or cleared through the BDI field of the SM, TCMi, and PM overhead fields of the OTUk and ODUk layers. A BDI signal is sent upstream as an indication that a defect has been detected that interrupts the client payload signal. The BEI is declared or cleared through the BEI field of the SM, TCMi, and PM overhead fields of the OTUk and ODUk layers. A BEI signal is sent upstream as an indication that an error has been detected that affects the client payload signal. The OCI defect is monitored at the OCh and ODUk path and tandem connection layers to qualify a downstream LOS defect by indicating that the LOS is due to the fact that the signal is not connected. An OCI signal is sent downstream as an indication that upstream the signal is not connected. This parameter is monitored at the OTS and OMS layers to suppress downstream LOS alarms caused by upstream defects that caused the missing payload. A PMI signal is sent downstream as an indication that upstream at the source point of the signal payload is missing. The LCK parameter is monitored at the ODUk path and tandem connection layer. An LCK signal is sent downstream to indicate that upstream the connection is locked, so no signal is passed through.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 154
154
CHAPTER 3
3.3.3
Optical Networks
OTN Maintenance Signals and Alarm Suppression
OCh
...
OCh
BDI
MPLS
OCh
OCh
IP
RDI REI
OCh
ATM
OCh
ODUk
OTUk-BDI OTUk-BEI
Future Server Layer
ODUk-BDI ODUk-BEI
OTUk
CBR
RDI REI
Ethernet
In this section we discuss how different maintenance signals can be correlated to reduce the number of alarms raised. Maintenance signals indicate defects in a connection. The defect indications are given in the upstream and downstream direction. Figure 3.19 gives an overview of the maintenance signals that convey the backward (upstream) information in the different layers of the OTM. Note that in the OCh layer, no BDI or BEI is defined for the moment. In addition, for some of the client layers, the corresponding RDI24 and REI25 are not defined yet. If the client layer is a constant bit rate (CBR) signal (i.e., a SONET/SDH signal), the RDI and REI (as discussed in Chapter 2, Section 2.2.3) are used.
OMSn-BDI-P OMSn-BDI-O
OMSn OTSn-BDI-P OTSn-BDI-O
OTSn
Figure 3.19 Optical Transport Network maintenance signals: backward information. (M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/tsg15opticaltransport/ tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004.)
24
Remote defect indication is a signal that is sent upstream to indicate that a defect has been detected that interrupts the signal. 25 Remote error indication is a signal that is sent upstream to signal an error condition.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 155
155
Ethernet
MPLS-FDI
MPLS
VP-AIS
IP
ATM
OMSn
OTSn-PMI
OCh-FDI
OCh-FDI
OCh OMSn-FDI
OMSn-PMI
...
OCh
OCh-FDI
OCh
OCh-FDI
OCh
OCh-FDI
OCh
ODUk-AIS
OCh-FDI
OCh
OTUk
Future Server Layer OTUk-AIS
ODUk
gen-AIS
CBR
3.3 Fault Detection and Propagation
OTSn
Figure 3.20 Optical Transport Network maintenance signals: forward information. (M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/tsg15opticaltransport/ tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004.)
Figure 3.20 summarizes the forward (downstream) maintenance information and indicates how the FDI of the optical layers, the AIS of the digital layers, and the PMI can be used to perform alarm suppression. The FDI is used to notify in the downstream direction that there is a signal fail. It is used to suppress downstream alarms in the client layer caused by an upstream and already detected defect in the server layer that has interrupted the client payload signal. The PMI signal is used to signal to the termination point of the trail that there was already no payload signal at the origin point of the trail. The use of the FDI, AIS, and PMI maintenance signals is illustrated in Figure 3.21. Let us assume that in the network depicted in Figure 3.21, each fiber is able to transport 200 wavelengths (optical line system of 200 wavelength channels). Each fiber cable contains 96 fibers and five fiber cables are grouped per duct. This means that in the case of a failure caused by erroneous digging activities, 5 * 96 * 200 ¼ 96, 000 wavelength channels could be affected. This would also imply that 96,000 LOS alarms are generated, a tremendous amount—too much to be handled.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 156
156
Optical Networks
CHAPTER 3
3R DXC 4
ODUk-AIS
OCh-FDI
3R
ODUk-AIS
DXC 2 DXC 1
OMS-FDI
OCh-FDI
DXC 3 OTS-PMI OTS-PMI
Figure 3.21 Alarm suppression based on the Optical Transport Network maintenance signals. (M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/tsg15opticaltransport/ tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004.)
The use of the OTN maintenance signals FDI, AIS, and PMI reduces the number of alarms in the case of Figure 3.21 from 96,000 to a single one per broken fiber (maximum 5 * 96 ¼ 480 LOS alarms). Let us look into more detail to this example. We focus on the connections between DXCs 1 and 2. After the duct has been damaged, an OMS-FDI is sent from the OTS layer to the OMS client layer. At the OMS termination point, thus, at the first OXC, the OMS-FDI is converted into an OCh-FDI. This OCh-FDI signal is sent to the upstream OCh termination points. In Figure 3.21, the OCh termination points are the 3R regenerators at the second OXC encountered. At this point the OCh-FDI signal is converted into an ODUkAIS signal. The OTS-PMI (and the OMS-PMI) signals prevent the LOS alarm from being raised at the OTS (and the OMS) layer, when the wavelengths are not present. Figure 3.22 illustrates the use of the BDI and BEI signals. After the occurrence of the network fault, the downstream OTS termination point (at the downstream OA) sends an OTS-BDI to the upstream OTS termination point (at the upstream OA) to indicate that a defect has been detected that interrupts the client signal. Likewise, the downstream OMS termination point (at the downstream demultiplexer) sends an OMS-BDI to the upstream OMS termination point (at the upstream multiplexer). The downstream OTUk termination points send both an OTUk-BDI and OTUk-BEI signal to the upstream OTUk termination points. Also, the downstream ODUk termination points send an ODUk-BDI and ODUk-BEI to the upstream ODUk termination points. The signals that are sent in the upstream direction are similar to those in Figure 3.21.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 157
3.4 Recovery in Optical Networks
157
OCh-FDI ODUk-AIS
OMS-FDI
OTS-PMI DXC 2
DXC 1
3R OTS-BDI OMS-BDI OTUk-BDI, OTUk-BEI ODUk-BDI, ODUk-BEI
Figure 3.22 Use of backward and forward information.
3.4 Recovery in Optical Networks In this section, we focus on recovery schemes in the optical network layer. These recovery schemes show a lot of resemblance to the schemes discussed in the previous chapter on SONET/SDH.
3.4.1
Recovery at the Optical Layer? Why would we want to deploy recovery schemes at the optical network layer? The reason for this is actually quite straightforward. Many of the failures occur at the optical network layer: Fiber cuts resulting from, for instance, digging works and the failure of an individual transmitter or receiver are quite common. As discussed earlier, the recovery schemes at the optical layer work at the level of a multiplex section or an optical channel. In both cases the recovery action is carried out using the large granularity of an optical channel or even a complete multiplexed bundle of optical channels. This means that there are fewer connections to restore. In Chapter 6 on multilayer survivability, we will see that a root failure at the optical network level typically results in a significant number of secondary failure indications at the higher layers (e.g., the SONET/SDH or IP layer). A recovery scheme at the client layer (e.g., SONET/SDH or IP) would need to restore quite a lot of affected connections.26 In the optical layer, however, the number of connections that are 26 The client layer of an Optical Transport Network typically does not consist of a single client network, but of a number of independent client networks. A single client network might not have to undertake more recovery actions than the optical network, but overall, recovering from a failure in the optical layer will typically be cheaper. A disadvantage of recovery at the optical layer is that decreased possibility to differentiate the recovery scheme per client.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 158
158
CHAPTER 3
Optical Networks
affected by the root failure and that have to be restored is limited. Because recovery at the optical layer recovers the affected connections in group, the recovery action is also fast and easier to manage than recovering each affected connection individually in the client layer. Recovering the affected connections in the client layer implies a lot of individual actions to switch the traffic from its working path to its backup path. The same recovery scheme classification as discussed in Chapter 2 can be used for optical networks. A distinction can be made between protection and restoration schemes. The difference between both was already explained in Chapter 1. Both options require signaling, but the (subtle) difference lies in the timing of the signaling actions. In the case of protection, the recovery paths are preplanned and fully signaled before a failure occurs. Hence, when a failure occurs, no additional signaling is needed to establish the protection path. In the case of restoration, the recovery paths can be either preplanned or dynamically allocated, but when a failure occurs additional signaling will be needed to establish the restoration path. All protection schemes, except the 1þ1 unidirectional protection switching scheme, rely on an APS coordination protocol, which is currently being standardized in the standardization bodies (see Section 3.4.2). It will undoubtedly show a major resemblance to the APS scheme in the SONET/SDH layer, which was explained in detail in Chapter 2, Section 2.3.4. Other classifications that can be made are based on whether the recovery scheme recovers from link and node (OXC or OADM) failures or only from link failures; whether the spare resources are preplanned and allocated off-line, or dynamic, after the failure has happened; whether the scheme is deployed in a ring-based or mesh-based network, and so on. Survivability schemes at the optical layer can also be classified depending on the exact sublayer in which the recovery action is performed. More precisely, in an optical network the recovery scheme can operate at the OCh level or at the OMS level. In the former case, each lightpath is switched to its backup lightpath when a failure occurs, one at a time. In the latter case, the whole multiplex of optical channels transmitted over a single fiber are switched over from the working path to the backup path.
3.4.2
Standardization Work on Recovery in the Optical Transport Network The status of the standardization work on the OTN at the ITU-T and other standardization organizations as the OIF and ANSI T1X1, at the time of this publication, has been discussed in Section 3.2. The topic of survivability in the OTN layer is, however, still under study within these standardization organizations. Specification of protection switching in both ring-based and mesh-based OTNs will soon be published (2004–2005). The following is a list of recommendations that are currently under development concerning recovery in the OTN within the ITU-T:
. ITU-T Rec. G.808.1: Generic Protection Switching–Linear Trail and Subnetwork Protection [G808.1]
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 159
3.4 Recovery in Optical Networks
. .
.
3.4.3
159
The scope of this recommendation is the definition of the generic functional models, characteristics, and processes associated with various linear protection schemes for connection-oriented layer networks. This recommendation is thus not limited to the OTN only, but is also valid for the SONET/SDH and ATM network layers. The protection schemes that are described are trail protection (see Chapter 2, Section 2.3.4) and subnetwork connection protection (see Chapter 2, Section 2.3.4). ITU-T Rec. G.808.2: Generic Protection Switching–Ring [G808.2] ITU-T Rec. G.873.1: Optical Transport Network (OTN)–Linear Protection [G873.1] This recommendation describes the APS protocol to support linear protection in the OTN at the ODUk path and ODUk TC sublayers. ITU-T Rec. G.873.2 Optical Transport Network (OTN)–Ring Protection [G873.2] This recommendation describes the APS protocol to support ring protection in the OTN.
Shared Risk Group An important concept to keep in mind when discussing recovery of optical networks is that of a shared risk group (SRG) [Str01]. This concept is closely related to the concept of diversity. Two lightpaths are said to be link/node diverse if they do not share a common link/node. Diversity implies, thus, that there is no single point of failure. However, to ensure real physical diversity, the lightpaths have to be diverse not only on the fiber cable topology, but also on the underlying duct topology. A short discussion of the physical placement of the optical fibers is needed. Optical fibers that will be buried underground are grouped into a fiber cable that is generally installed into a duct (a prefabricated pipe in which the cable is drawn inside using a draught winch), which is in its turn placed in a trench. Such a trench is often a right of way (ROW), which is frequently obtained from railroad companies or electricity companies. A situation that is fairly common is depicted in Figure 3.23. Although in the fiber topology of Figure 3.23 both paths between node A and node B (e.g., a working path and a dedicated protection path) seem to be diverse, this is clearly not the case physically, in the duct topology. A duct failure would affect both the working and the backup path. This example is typical for a submarine duct or ducts in dense metropolitan areas. One way to solve this problem is to introduce the SRG. This is an identifier that is assigned to the common resource, or thus, the common risk. In the previous example, this is the duct: The duct is a shared risk component, whose failure (e.g., caused by a cut from a digging accident) will cause all fibers in the duct to be cut. All fibers that go through that duct belong to the same SRG. During the calculation of the primary and backup path, one can then avoid using resources with this identifier in both the primary and the backup path at the same time (SRG diverse paths). Because a fiber will typically run through a sequence of ducts, a fiber will typically belong to several SRGs.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 160
160
CHAPTER 3
Optical Networks
A
B
Fiber Cable Topology
A
B
Duct Topology
Figure 3.23 Fiber cable topology versus duct topology.
Of course, the principle of SRG is not limited to the cable topology versus the duct topology but can also be applied to the fiber topology versus the fiber cable topology (e.g., in the case of fiber splicing, see Figure 3.24), to the logical network topology versus the transport layer topology in multilayer networks (see Chapter 6), and so on. In Figure 3.24, fiber is spliced at the manhole to reach, for example, an office with limited bandwidth needs at the upper node. Other examples are provided in Chapter 5 in which two IP links can share the same shared risk link group (SRLG)27 if they are routed in the same fiber.
3.5 Recovery Mechanisms in Ring-Based Optical Networks In this section we focus on recovery strategies in ring-based optical networks. As already explained (Chapter 2, Section 2.4) several ring-based architectures are possible. In such a ring-based architecture, the most natural way to provide recovery is using a protection scheme. Each of the possible ring-based network architectures has their own way to provide recovery [Ari1/00], [Bon01], [Ram02].
27
Note that in the IP/MPLS world, the term shared risk link group is usually used.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 161
3.5 Recovery Mechanisms in Ring-Based Optical Networks
161
Manhole
Fiber Cable Topology
Fiber Topology
Figure 3.24 Fiber topology versus fiber cable topology. (J. Strand, A. Chiu, R. Tkach, ‘‘Issues for routing in the optical layer,’’ IEEE Communications Magazine, vol. 39, no. 2, February 2001, pp. 81–87.)
A first distinction that can be made is based on the layer in which the protection scheme is implemented: the OMS layer or the OCh layer (Figure 3.25). With a scheme at the OCh layer, the recovery process is performed by the OADMs through which the traffic enters and leaves the ring network. With a scheme at the OMS layer, the recovery process is performed by the OADMs adjacent to the failure. The choice whether to use a scheme at the OMS or the OCh layer decides of course also on the granularity of the protection action. In the OMS layer the whole bundle of multiplexed optical channels is protected as a whole, whereas with a protection scheme at the OCh level, each optical channel is protected individually and the protection switching occurs at the granularity of a single optical channel. In Figure 3.25 a ring network with four nodes is shown in which a failure affects a single wavelength channel of the group of multiplexed optical channels of a connection between OADMs A and C (Figure 3.25). With a scheme at the OCh level (Figure 3.25, left), only the affected wavelength channel is switched over to the backup path (dotted line), which runs between OADM A where the affected traffic flow enters the ring and OADM C where this traffic flow exits the ring. With a scheme at the OMS level (Figure 3.25, right), the whole group of wavelength channels is switched over and the backup path runs between OADMs A and B, the OADMs adjacent to the failure. A second distinction is based on whether the protection scheme is dedicated (dedicated protection ring [DPRing]) or shared (shared protection ring [SPRing]). In the former scheme, each working wavelength around the ring has a dedicated protection wavelength, whereas in the latter scheme the protection capacity is shared between several working paths. The shared protection scheme is typically more complex to implement and manage but consumes fewer resources than the dedicated approach. This distinction between shared and dedicated rings is discussed in Chapter 2, Section 2.4. Another distinction can be made based on the direction in which the traffic is transmitted under normal working conditions. In a unidirectional ring, signals are always transmitted in the same direction on the ring, whereas in a bidirectional ring, signals are transmitted in both directions of the ring. Again, this is discussed in
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 162
162
CHAPTER 3
Optical Networks B
A
C
D B
A
D
C
B
A
D
C
Figure 3.25 Ring recovery scheme operating at the optical channel level (left) and the optical multiplex section level (right).
Chapter 2, Section 2.4. In addition, a further distinction can be made between ring architectures that use two fibers along the ring and four-fiber architectures. The recovery schemes in optical ring networks are not yet (completely) standardized (see Section 3.4.2). The thought is, however, that the APS protocol in the optical ring must meet the 50-ms switching time, just as with SONET/SDH. Because the recovery schemes in optical ring networks show a lot of resemblance to the parallel schemes in SONET/SDH ring networks (Table 3.4), they are not explained fully in the following sections, but the emphasis is on the difference with the corresponding SONET/SDH scheme. Section 3.5.1 discusses OMS protection rings. Both dedicated and shared OMS protection rings are studied. Section 3.5.2 focuses on OCh protection rings, again discussing both SPRings and DPRings. In Section 3.5.3, the OMS-based approach is compared with the OCh-based approach, and Section 3.5.4 compares the shared and dedicated approaches.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 163
3.5 Recovery Mechanisms in Ring-Based Optical Networks
163
Table 3.4 Parallel Scheme for SONET/SDH Ring Networks and Optical Ring Networks
SONET/SDH
Optical
Characteristics
MS-DPRing
OMS-DPRing OULSR1
MS-SPRing (SDH) BLSR (SONET)
OMS-SPRing OBLSR2
SNCP (SDH) UPSR (SONET)
OCh-DPRing OUPSR3
/
OCh-SPRing OBPSR4
Dedicated protection, local recovery scheme performed by the OADMs adjacent to the failure Shared protection, local recovery scheme performed by the OADMs adjacent to the failure Dedicated protection, end-to-end recovery scheme performed by the OADMs on which the traffic enters/leaves the ring Shared protection, end-to-end recovery scheme performed by the OADMs on which the traffic enters/leaves the ring
1
OMS-DPRing is sometimes called optical unidirectional line-switched ring. OMS-SPRing is sometimes called optical bidirectional line-switched ring. 3 OCh-DPRing is sometimes called optical unidirectional path-switched ring. 4 OCh-SPRing is sometimes called optical bidirectional path-switched ring. 2
3.5.1
Multiplex Section Protection in Ring-Based Optical Networks The first type of protection scheme in ring-based optical networks that is discussed operates at the OMS layer and performs fiber protection switching. The protection granularity is thus the capacity of a single fiber: the bundle of multiplexed optical channels. Such a fiber-based protection scheme requires only simple control and management mechanisms. The OMS level ring protection scheme can use dedicated or shared backup capacity.
OMS Dedicated Protection Rings With a two-fiber OMS DPRing (Figure 3.26), one fiber is dedicated for working traffic (outer fiber in Figure 3.26) and the other counterrotating fiber is reserved for protection traffic (inner fiber in Figure 3.26). Both directions of a bidirectional wavelength demand are routed on different sides of the ring, using the same wavelength. The same also applies for the protection path on the protection fiber. There is thus absolutely no possibility to reuse wavelengths on the ring for different
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 164
164
CHAPTER 3
Optical Networks A
B
C
F
E
D
A
B
C
F
E
D
Figure 3.26 The two-fiber optical multiplex section dedicated protection ring architecture in a failurefree condition (top) and after a link failure (bottom). The outer fiber ring is dedicated to working traffic, and the inner fiber ring to protection traffic.
demands. When a failure occurs, it is detected by the two OADMs adjacent to the failure, based on the monitoring information in the OMS OH. Both OADMs loop back the affected multiplexed bundle of optical channels on the protection ring in the opposite direction (Figure 3.26). An APS protocol is required to handle the switching.
OMS Shared Protection Rings The implementation of SPRing schemes is more complicated than that of DPRing schemes but is more efficient in terms of backup bandwidth usage. The shared protection equivalent of the OMS-DPRing is the OMS-SPRing. Two implementations are used: a two-fiber implementation and an architecture with four fibers along the ring. In the two-fiber implementation (Figure 3.27), half of the wavelengths on each fiber are reserved as working channels (marked in white) and the other half as
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 165
3.5 Recovery Mechanisms in Ring-Based Optical Networks
A
B
C
E
D
A
B
C
F
E
D
F
165
Figure 3.27 The two-fiber optical multiplex section shared protection ring architecture in a failurefree condition (top) and after a link failure (bottom).
protection channels (marked in gray). Working connections in one fiber are protected by the protection capacity in the other fiber, in the opposite direction of the ring. Both directions of a bidirectional demand are routed along the same side of the ring, in different fibers. The same wavelength can, thus, be reused to accommodate a connection between other nodes, whose route does not overlap. For instance, in Figure 3.27, besides the connection between OADMs A and D, a connection between OADMs E and F can be accommodated on the two-fiber OMS-SPRing using the same wavelength as connection A-D. When a failure is detected at the OMS level (link or OADM failure), the OADMs adjacent to the failure will loop back all the affected lightpaths at once on the protection channels of the ring. This is illustrated in Figure 3.27 for the failure of link A-B. An APS protocol is needed to coordinate the switching actions and ensure correct use of the shared protection capacity. There is no dedicated protection connection per working connection. The spare capacity in the network can be used by different working connections. In Figure
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 166
166
CHAPTER 3
Optical Networks
3.27, for instance, the same spare capacity is used to provide recovery for connections A-D and E-F. The spare capacity is, thus, shared between several working connections. Note that this implies that it is impossible to recover from multiple simultaneous failures affecting more than one connection. In Figure 3.27, for instance, when links A-B and E-F fail simultaneously, only one of the connections A-D and E-F can be recovered. If the same wavelength is used for both working directions of a bidirectional connection, wavelength conversion is required when a protection switch takes place. The need for wavelength conversion in the OADMs can be avoided by assigning different wavelengths to both directions of a working connection. In the four-fiber implementation (Figure 3.28), working and protection channels are carried over a different fiber in the ring. In this situation, both directions of a bidirectional demand can, thus, always get assigned the same wavelength without the need for wavelength converters in the OADMs (in contrast to the two-fiber implementation depicted in Figure 3.27). Of course, the four-fiber OMS-SPRing has twice the amount of capacity of the two-fiber implementation. However, the four-fiber implementation can recover from more failure situations than its twofiber version. If only the multiplex section in the working fiber of the ring is affected, the parallel protection fiber can be used after a simple span switch and no loop back occurs (Figure 3.28, middle). In this way, certain multiple failures can be fully protected. For instance, in Figure 3.28 in the event of the simultaneous failure of a single working fiber on links A-B and E-F, both connections A-D and E-F can be recovered. This was not possible with the two-fiber implementation of the MSSPRing. If both the working and the protection fiber are affected or in the case of a node failure, a ring switch is performed (Figure 3.28). The OMS-SPRings architecture needs only a limited number of protection switches, because the OMS bundle of optical channels is switched as a whole. However, because of this collective switching action, it cannot cope effectively with a failure that affects only a single wavelength channel (e.g., a failure of an optical transmitter in an opaque OXC or of a single mirror in a MEMS-based OXC design).
3.5.2
Optical Channel Protection in Ring-Based Optical Networks The second type of protection scheme in ring-based optical networks is deployed at the OCh layer and performs optical channel protection switching. The protection granularity is thus the capacity of a single wavelength. Again a distinction can be made between DPRings and SPRings.
OCh Dedicated Protection Rings OCh DPRing is a dedicated protection scheme that requires two fibers in the ring. Each wavelength demand is routed on a working path along one side of the ring and a dedicated backup path along the other reverse side of the ring. Bidirectional wavelength demands are supported by two wavelengths, one in each direction.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 167
3.5 Recovery Mechanisms in Ring-Based Optical Networks A
F A
F A
F
B
C
E
D
B
C
E
D
B
C
E
D
167
Figure 3.28 The four-fiber optical multiplex section shared protection ring architecture in a failurefree condition (top), after recovering from a single fiber fault using a span switch (middle), and after recovering from an optical add/drop multiplexer fault using a ring switch (bottom).
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 168
168
CHAPTER 3
Optical Networks
Both working wavelengths of the bidirectional wavelength demand can be routed along the same side of the ring, in different fibers and using the same wavelength (Figure 3.29). An alternative could be that both working wavelengths are routed on different sides of the ring so that one fiber of the two-fiber rings transports only working traffic while the other fiber transports only protection traffic. The wavelengths can, thus, not be shared by wavelength demands between other node pairs. The protection switching occurs at the OCh layer. When a link or node failure occurs in the ring, the affected traffic is switched to the protection path. Two alternative protection schemes can be implemented: 1þ1 or 1:1 dedicated protection. In the former case, when an optical splitter is used at the sending side, single-ended switching takes place at the receiving side, based on the monitoring information of the optical channel. No complicated signaling protocol is required, making the single-ended 1þ1 protection scheme simple and robust. In the latter case, the traffic at the sending side is not permanently bridged. Dual-ended switch-
A
B
C
F
E
D
A
B
C
F
E
D
Figure 3.29 Optical channel dedicated protection ring in a failure-free condition (top) and after a failure (bottom).
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 169
3.5 Recovery Mechanisms in Ring-Based Optical Networks
169
ing is required, and a switching protocol is needed to coordinate the switching action at both ends. The advantage of deploying 1:1 instead of 1þ1 protection is that the spare capacity can be used to accommodate low-priority traffic (extra traffic) in failure-free conditions, which can be preempted in the event of a failure to provide the spare capacity for the high-priority failing connection. This 1:1 scheme is, however, more complex because the recovery actions of the various OADMs on the ring must be coordinated.
OCh Shared Protection Rings The OCh-SPRing scheme (Figure 3.30) is the only ring protection scheme in the optical network layer that has no equivalent in the SONET/SDH layer. It is implemented as a two-fiber ring. On each fiber, half of the wavelengths are reserved for working traffic, and the other half for protection traffic. Working channels in one fiber are protected by protection channels in the other fiber. The protection
A
B
E
F
A
F
C
B
D
C
E
D
Figure 3.30 Optical channel shared protection ring in a failure-free condition (top) and after a failure (bottom).
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 170
170
CHAPTER 3
Optical Networks
channels travel around the ring in the opposite direction as the working channels. The two directions of a bidirectional wavelength demand are routed on the same side of the ring, in different fibers. The same wavelength can, thus, be reused for another nonoverlapping demand between a different node pair. For instance, in Figure 3.30, the nonoverlapping connections A-D and E-F can use the same wavelength. When a failure occurs, the affected optical channels are switched at the terminating OADMs to the other side of the ring and use then the protection channels in the fiber (Figure 3.30). Bidirectional traffic demands will, thus, need to be routed using different wavelengths for both directions, otherwise wavelength conversion is required when traffic on a working channel is switched to a protection channel in the opposite direction. In this shared approach there is no dedicated protection wavelength for each working path, but a pool of shared recovery resources is available for affected working connections. The backup paths are formed only after the failure has occurred. When the failure affects all links between two adjacent OADMs in the ring, no loop-back switching action is performed, but a direct backup path between source OADM and destination OADM is established. If a failure occurs, coordination is, thus, needed between the switching actions at the source and destination OADM of the wavelength demand. Moreover, the switching must be performed for each affected wavelength. Therefore, the OADMs must be managed by a quite sophisticated protocol to coordinate the switching and to ensure that the protection channels are correctly assigned under different fault conditions. Despite its complexity, the OCh-SPRing offers a lot of advantages, with its most important feature being capacity efficiency, because of the sharing of the spare capacity by several working connections.
Mix of OCh-SPRing and OCh-DPRing The protection schemes at the OCh level also allow to mix and match the dedicated and shared ring protection approaches to have the appropriate protection level on a per-wavelength basis. Some wavelength channels can be protected using the 1þ1 or 1:1 dedicated OCh-DPRing approach, and other wavelengths can employ the shared OCh-SPRing approach that allows the reuse of wavelengths and to accommodate extra traffic on the ring. Some other wavelengths even could not be protected at all. The protection is thus selected per wavelength channel.
3.5.3
OMS- versus OCh-Based Approach In an OMS-based protection ring, the switching action in the OADMs is based on OMS-level failure indications. The whole multiplexed bundle of optical channels within the OMS is switched as a group. Also the APS signaling is supported at the OMS level. In OCh-based protection rings, on the other hand, whether the protection is dedicated (OCh-DPRing) or shared (OCh-SPRing), not all wavelengths belonging to the OMS need to be switched at once. This means that failures that affect only a single wavelength of the multiplexed bundle (typically a
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 171
3.5 Recovery Mechanisms in Ring-Based Optical Networks
171
failing transmitter or receiver) can be handled more efficiently. They can also support different protection schemes in the various wavelengths of the multiplexed bundle, allowing to better accommodate the needs of the clients. In the OCh rings, no loop-back switching action is performed, while this is the case in the OMS-SPRing with two fibers. An OCh-based ring protection scheme ensures that the optical signal will never be transported over a distance longer than the circumference of the ring. In the case of an OMS ring, the protection path can in the worst case span almost twice the entire ring circumference. This also influences the potential size of the ring network. Optical signals suffer from signal degradation and signal distortion, and to ensure a correct interpretation of the signal at the endpoints, they need to be amplified or even regenerated at regular intervals. If we assume transparent OADMs (without regeneration of the optical signal), the total length of the OCh ring may, thus, be longer than that of the twofiber OMS-SPRing because no extra ring length is added for the loop-back switch.
3.5.4
Shared versus Dedicated Approach Because of the shared nature of the SPRings, they make more efficient use of the ring capacity than DPRings. With SPRings, there is a pool of spare resources in the ring that can be shared by the working connections. In contrast, in DPRings, each working wavelength around the ring has a dedicated protection wavelength. Table 3.5 gives an overview of the capacity requirements of an OCh-DPRing and an OCh-SPRing with n nodes for three traffic patterns. For the star traffic
Table 3.5 The Required Number of Wavelengths in a Two-Fiber OCh Ring with n Nodes
Cyclic Traffic Pattern
Star Traffic Pattern
Full-Mesh Traffic Pattern
OCh-DPRing OCh-SPRing OCh-DPRing OCh-SPRing OCh-DPRing OCh-SPRing n
2
n1
n1 if n odd n(n1)/2 n if n even
(nþ1)(n1)/4 if n odd n(nþ2)/4 if n even
Source: From T. Shiragaki, S. Nakamura, M. Shinta, N. Henmi, S. Hasegawa, ‘‘Protection architecture and applications of OCh shared protection rings,’’ Optical Network Magazine, Vol. 2, No. 4, July/August 2001, pp. 48–58.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 172
172
CHAPTER 3
Optical Networks
demand that is typical for a metropolitan area network, both ring protection schemes require a comparable number of wavelengths. For the full-mesh traffic pattern, usually found in the backbone network, the OCh-SPRing performs much better in terms of capacity than the OCh-DPRing: The required amount of wavelengths becomes almost half than that for an OCh-DPRing for larger n. The shared protection approach has the great advantage that it consumes less capacity around the ring, because of the pool of spare resources available for the backup paths. However, SPRings are more complex than DPRings. Dual-ended switching is needed. The assignment of protection channels to the affected working channels is done in real time. A sophisticated protection switching protocol is, thus, required to coordinate and supervise the recovery actions of the OADMs and to ensure that the protection channels are correctly assigned under different fault situations. With SPRings there is also the potential problem of misconnection in the case of an OADM failure. This can be explained using Figure 3.31. In the failure-free condition (Figure 3.31, top), two connections, A-C and C-D, are routed
A
B
C
F
E
D
A
B
C
F
E
D
Figure 3.31 Misconnection in a two-fiber shared protection ring after failure of optical add/drop multiplexer C.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 173
3.6 Recovery Mechanisms in Mesh-Based Optical Networks
173
using the same wavelength on the fiber and they both have OADM C as an endpoint. When this OADM fails (Figure 3.31, bottom), the loop-back procedure will try to connect both connections to each other, thereby establishing an unwanted connection A-D between the two endpoints of both connections that are not the failed OADM. To prevent this misconnection, a squelching mechanism is needed (see Chapter 2, Section 2.4, for more details).
3.5.5
Interconnection of Rings The size (fiber length and number of nodes) of a ring is limited by a number of physical constraints as transmission impairments (attenuation, loss, etc.), by the time it takes for the protection switch to be executed, and by the availability constraint, similar to what is described in Chapter 2 for SONET/SDH networks. A large-scale network thus typically is not covered by a single ring but will consist of several interconnected rings. The interconnection options for optical rings are the same as the ones explained for SONET/SDH rings in Chapter 2, Section 2.4.4. The simplest solution is to have two OADMs at the interconnection points, installed back to back: The optical multiplexed signals or the optical channels that have to change between rings are dropped at the first OADM and added at the second OADM. A second solution is to add flexibility in the interconnection point by installing an OXC between the two OADMs. The option with the most flexibility is, however, to place only an OXC in the interconnection point of the ring, allowing traffic to be added/dropped, to pass through the ring, or to change rings. Again, as in the case of SONET/SDH, a single point of failure for ring interconnection can be avoided by deploying a drop-and-continue (D&C) scheme. With the D&C scheme, two rings are always interconnected by two nodes. Instead of simply dropping the signal that has to be handed over from one ring to another at the first interconnection point between both rings, the signal also continues on the first ring and is handed over again at the second interconnection point between both rings. In this way, the network can always recover from single failures. For more details on D&C and the difference between D&C in the different ring types, see Chapter 2, Section 2.4.4. A difference with the SONET/SDH-based ring networks is that optical ring networks are an analogue transmission medium. The signal gets distorted by attenuation, noise, and nonlinear effects such as dispersion and self-phase modulation. The signal should thus be reamplified and even regenerated at regular intervals. Transponders can act as such 3R regenerators. If transparent OADMs are used, the ring interconnection points are a good choice to place these regenerators so each optical ring is an island of transparency.
3.6 Recovery Mechanisms in Mesh-Based Optical Networks Recovery schemes in mesh-based optical networks are under study by standardization organizations including the ITU-T, the OIF, and T1X1 (see Section 3.4.2).
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 174
174
CHAPTER 3
Optical Networks
These schemes will without a doubt show a major resemblance to the corresponding schemes in the SONET/SDH layer. A first distinction that can be made concerning recovery schemes in a meshbased optical network is between protection and restoration schemes. For protection, the recovery paths are preplanned and fully signaled before a failure occurs. Hence, when a failure occurs, no additional signaling is needed to establish the protection path. For restoration, the recovery paths can be either preplanned or dynamically allocated, but when a failure occurs additional signaling will be needed to establish the restoration path. Protection is further discussed in Sections 3.6.1 and 3.6.2, and restoration in Section 3.6.3. The comparison of both is discussed in Section 3.6.4. Another distinction is based on the extent of the recovery schemes. A recovery scheme can be implemented at the OMS layer or at the ODU layer. In the latter case, each working optical channel is protected individually between its source node and its destination node. This is called a path-based recovery scheme, because a recovery path between the source and destination nodes of the working lightpath is applied. In the former case, the complete bundle of multiplexed optical channels is protected between the endpoints of the OMS (OXCs or OADMs). Protection at the ODU level has the advantage that it can survive node failures. This is not the case for protection at the OMS level, because the OMS does not transit the nodes. A recovery scheme at the OMS layer is called a link-based recovery scheme. Only a local recovery path between the endpoints of the failed link is used to work around it. A link recovery mechanism replaces only the affected part of the working path, leaving the remaining part of it unaltered. Both approaches are illustrated in Figure 3.32. Of course, the
Working Path
Working Path
Link Recovery Path
Figure 3.32 Recovery extent: link versus path recovery.
Working Path
Path Recovery Path
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 175
3.6 Recovery Mechanisms in Mesh-Based Optical Networks
Working Path
Working Path
175
Path Recovery Path
Link Recovery Path
Figure 3.33 Back-hauling because of loop back of traffic with link recovery (left) is avoided with path recovery (right).
granularity of the recovery switching action differs for a link-based and a path-based recovery scheme. With link-based recovery all the lightpaths that travel along a failed link are simultaneously rerouted (the multiplexed bundle of wavelength channels is switched as a whole). Path-based recovery, on the other hand, needs to switch each affected lightpath individually on its alternative path between the endpoints of the lightpath. A local link recovery strategy has a number of disadvantages. The resulting backup path is often not the shortest alternative path. This is the case in Figure 3.32, where the complete backup path with link recovery crosses five links, whereas the backup path resulting from path recovery crosses only three links, and is in fact only as long in terms of hops as the working path. Link recovery may even lead to back-hauling, because the working capacity is looped back to recover from a failure (Figure 3.33). Back-hauling increases the length of the backup path. In optical networks this influences the placement of amplifiers and transponders, which are needed to guarantee a good signal quality. With a path recovery scheme, back-hauling is avoided. These remarks already give an indication that path-based recovery schemes perform better (higher capacity efficiency, better signal quality, etc.) than link-based schemes. This will be confirmed by the case study discussed in Section 3.6.4.
3.6.1
Protection A simple way to recover from failures in a mesh-based network is to use a protection scheme. Different protection options are available: 1þ1, 1:1, 1:N, or M:N (see Chapter 1 and Chapter 2, Section 2.3.4). With 1þ1 protection, the traffic signal is duplicated for protection purposes and transmitted over both a working path and a backup path. Using 1þ1 link protection will for each link of the working path reserve a backup path around that link. With 1þ1 path protection, a working and a backup path will be reserved between the source node and the destination node of the traffic demand. When both paths are link disjoint,28 both single link and single 28 Link disjoint means that the working path has no links in common with the backup path. Note that link disjointness implies node disjointness.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 176
176
CHAPTER 3
Optical Networks
node failures can be recovered. When both paths are only node disjoint,29 only recovery in the case of single link failures is guaranteed. The receiving end selects a nonfailing signal from both received signals. Switching occurs solely at the receiving end. This is a single-ended switching mechanism because one switching action is sufficient to recover the affected signal. The advantage of 1þ1 protection is that it is fast and easy, but a drawback is that the backup resources are permanently occupied. Path protection will typically use less capacity than link protection, because with link protection the backup paths are in general longer than with path protection. With 1:1 protection, on the other hand, the backup resources are used only to ensure recovery when a failure has occurred. The advantage is that in failure-free condition, the spare resources can be used to accommodate so-called extra traffic. This is additional traffic with lower priority than normal working traffic that is preempted and dropped when the spare resources are needed to perform the recovery action. The disadvantage of 1:1 protection is that selection and switching now has to be done at both the sending and the receiving end (dual-ended switching). An APS protocol is, thus, needed to coordinate the recovery action. The 1:1 protection scheme is thus somewhat more complex than 1þ1 protection. Besides this, if extra traffic is accommodated, the preemption of this extra traffic may slow down the recovery process, because it also consumes time. The 1:N and M:N protection schemes are shared variants of the 1:1 protection scheme. With 1:N protection, one spare path is shared between N working paths. With M:N (M < N), M spare paths are shared among N working paths, making this a quite complex scheme to implement. In optical networks that cover a large geographic area, multiple simultaneous failures are not that uncommon for very long connections. One way to decrease the probability that, for example, a double failure affects both the working path and the recovery path of a long (e.g., intercoastal) connection is to break up such a long connection into shorter connections with independent protection resources. Attention should then also be paid to avoid a single point of failure in such a design.
3.6.2
Protection in a WP Network versus Protection in a VWP Network In Section 3.1.4, different types of OXCs were discussed. A first type of OXC, called wavelength routing OXC (WR-OXC), is not able to perform wavelength conversion. A network with this type of OXCs installed is a WP network, where the lightpath between source and destination OXC has to be conveyed using the same wavelength channel on all links along the path. A further distinction can be made between a WR-OXC with and without wavelength tunability at the transmitter and receiver. With a WR-OXC without tunability, the working path and the recovery path have to use the same wavelength. A WR-OXC with tunability does not have 29 Node disjoint means that the working path has no nodes in common with the backup path. This conditions is less restrictive than link disjoint.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 177
3.6 Recovery Mechanisms in Mesh-Based Optical Networks
60000
350000
50000
300000 250000
Cost
4000
Cost
177
30000 20000
200000 150000 100000
10000
50000
0
0 WP WP VWP Protection Protection Protection + Tunability
WP WP VWP Protection Protection Protection + Tunability
Figure 3.34 Comparison of the cost for a 16-node network (left) and a 32-node network (right) between path protection, with wavelength routing optical cross-connects (OXCs) with (wavelength path [WP] protection þ tunability) and without tunability (WP protection) and with wavelength translating OXCs (virtual WP protection). (P. Arijs, B. Van Caenegem, P. Demeester, P. Lagasse, W. Van Parys, P. Achten, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, Vol. 1, No. 2, July 2000, pp. 25–40.)
this restriction, enabling the use of a different wavelength for working and recovery path. A second type of OXC the wavelength translating OXC (WT-OXC) can perform wavelength conversion and leads to a VWP network, where the wavelength continuity constraint along a lightpath no longer has to be met. Figure 3.34 illustrates the typical influence on the total network cost of using WR-OXCs with or without tunability or using WT-OXCs, using the results obtained for two sample networks with a static traffic demand. From Figure 3.34, it is clear that the cost of WP and VWP path protection is not really different. The WP network requires somewhat more wavelengths than the VWP network to resolve wavelength conflicts: 5% for the 32-node network, 15% for the 16-node network, because in the latter case fewer fibers per link are needed, making the wavelength assignment problem more difficult. With wavelength tunability, the difference in required wavelength channels between WP and VWP is less than 5% for both network sizes.
3.6.3
Restoration Until recently, 1þ1 or 1:1 dedicated protection was the only realistic choice for a network operator to make a meshed OTN resilient against failures. With 1þ1 dedicated protection, everything is calculated before the failure occurs: the route of the protection path and the wavelength assignment in the case of transparent optical networking. In addition, the cross-connects on the backup route are
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 178
178
CHAPTER 3
Optical Networks
switched beforehand. The routing and wavelength assignment problem can, thus, be solved off-line. With the implementation of an IP-based optical control plane (see Chapter 6, Section 6.1), however, restoration can become a real option for providing resilience in the optical backbone network. Just as in SONET/SDH, restoration schemes in an OTN will undoubtedly be superior in terms of capacity efficiency compared to protection schemes, but the implementation of a mesh restoration scheme is quite complex and requires sophisticated algorithms. Restoration is usually also slower than protection. With restoration, capacity in excess of the working capacity needed to support the normal working traffic is provided in the network. This spare capacity, which is shared among the various working connections, will be used to recover from failures. One must, however, keep in mind that restoration schemes in meshed optical networks are not (yet) standardized. If used today, they are based on proprietary schemes. Several options for restoration in a meshed optical network can be envisaged. The choice can be made between a link-based scheme (link restoration, working at the OMS layer) and a path-based scheme (path restoration, working at the ODU layer). This choice has a rather large influence on the recovery implementation and requirements. In the case study explained in Section 3.6.4, we will see that link restoration requires typically more capacity than path restoration, because of the often suboptimal routes found with link restoration (e.g., because of back-hauling) and because path restoration has a larger view on the network. Path restoration will typically distribute the backup routes of the affected connections over a larger part of the network than link restoration, allowing more opportunities to optimize the spare capacity needed in the network. The recovery extent also has an influence on the complexity of the recovery scheme. With path restoration, a restoration path has to be found for each affected working lightpath, whereas with link restoration the affected lightpaths are switched per multiplexed bundle and only a single restoration path per OMS has to be found. Thus, with link restoration the route computation process is easier and the number of required switching actions is limited compared to path restoration (often resulting in a lower recovery time), but the capacity efficiency is lower. An example of a shared restoration scheme [Lab02] in a meshed optical network is illustrated in Figure 3.35. In this figure, two working paths between different OXCs are depicted (solid line). They each have a protection path that is node disjoint with the working path they protect. However, both backup paths are routed on a common link, on which they can share an optical channel reducing the capacity needed in the network to protect against single link or node failures. This is indeed a restoration scheme because the OXCs at the end of the link that is shared by both backup paths need to be configured according to the failure that has happened. For instance, on the right-hand side of Figure 3.35, when working path 2 gets interrupted, both OXCs must be configured for backup path 2. Another classification basis is the route computation moment. The restoration route can be calculated before (preplanned restoration) or after the failure occurs (dynamic restoration). The same applies to the wavelength assignment. However, the actual switching action in the OXCs can be performed only after the failure has
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 179
3.6 Recovery Mechanisms in Mesh-Based Optical Networks
179
Working Path 1
Working Path 1
Shared Channel is Used by Back-Up Path 2
Back-Up Paths 1 and 2 Share a Channel on this Link Back-up Path 1
Back-up path 1
Back-up Path 2
Back-up Path 2 Working Path 2
Working Path 2
Figure 3.35 A sample path restoration scheme.
occurred, as only then the shared extra capacity is available to recover the affected connection(s). With preplanned restoration, the network will usually recover faster from a failure (no time needed for route calculation). All nodes will contain crossconnection maps, which indicate the cross-connection that is required, ensuring fast local actions in the OXCs. The preplanned recovery process is simpler and the restoration routes used during the recovery process are the ones that were envisaged to be used. With dynamic restoration, the restoration route resulting from the calculation scheme may be different from the one that was envisaged during the network design. As a consequence, the spare capacity provided in the network is not used in the foreseen way, and some failures may not be recovered, although sufficient spare capacity is provided in the network overall. On the other hand, dynamic restoration will be able to react to unexpected failures, which is not the case with preplanned restoration. The example shown in Figure 3.35 is a path restoration scheme with precalculated backup paths. The wavelengths may or may not be preassigned. Table 3.6 gives an overview of the types of restoration schemes and compares them with protection. A similar classification can be found in [Eli03], where even more restoration scheme variations are distinguished. The route computation in dynamic restoration can be done centrally or distributed. In the former case the central route computation entity has to have a full overview of the network topology and state (e.g., link utilization). In the latter case
Table 3.6 Comparison of Characteristics of Protection and Different Types of Restoration
Restoration
Protection
Backup Route Calculation
Wavelength Assignment on Backup Route
Cross-Connection on Backup Route
Preplanned Preplanned Dynamic Preplanned
Preplanned Dynamic Dynamic Preplanned
After failure After failure After failure Before failure
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 180
180
CHAPTER 3
Optical Networks
the network nodes have typically only local information at their disposal. The exact performance of the restoration scheme in terms of capacity depends on this choice. In [Ell03], implementing the path restoration scheme of Figure 3.35 in a distributed manner resulted in a 10% to 15% capacity increase compared to the centralized case. Also a mesh-restorable network can accommodate extra traffic in the unused spare capacity it has. This implies that a preemption protocol is needed, which may slow the recovery process. An overview of restoration algorithms can, for instance, be found in [Gro04].
3.6.4
Protection versus Restoration As with a SONET/SDH network, applying a protection scheme in a meshed network requires much more installed capacity (wavelengths) in the network than applying a restoration scheme. In Figure 3.36 this comparison has been quantified for a European size network with 28 nodes connected by 41 links in a biconnected30 mesh topology. The total fiber length of the links is 25,640 km, and the average node degree (number of adjacent links that connect a node to adjacent nodes) of the network is 2.93. Figure 3.36 clearly illustrates that 1þ1 dedicated protection requires the most wavelengths in the network. In fact, typically more than 50% of the capacity installed in the network with a dedicated protection scheme is spare wavelength capacity. This is because in almost all cases the backup path is longer than the working path, and thus uses more wavelengths. In this example, 60% of the required wavelength capacity is used for the protection paths and 40% for the working paths. There is a very small difference between the protection path and backup path that are link disjoint or node disjoint. In the latter case somewhat more capacity is needed because the network is protected against both single link and single node failures, whereas in the former case only recovery from single link failures can be guaranteed. Restoration lowers the required amount of spare capacity in the network. With path restoration, in which a backup path is calculated between the endpoints of the affected working path, less spare capacity is needed in the network than with link restoration, in which a local backup path around the failed link from the working path is established. Using path restoration the ratio between spare capacity and total capacity is around 40%. Restoration is thus more capacity efficient than protection, but the time needed to complete the recovery process is typically much longer with restoration than with protection. The recovery time with restoration lies between hundreds of milliseconds and tens of minutes. With protection, this is limited to tens of milliseconds. As explained in Chapter 1, a recovery scheme is chosen to recover from the so-called expected or accounted failures. Not all network failures are common
30
Biconnected means that between each node pair two disjoint paths can be found in the network.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 181
3.6 Recovery Mechanisms in Mesh-Based Optical Networks
181
Oslo Stockholm Glasgow Copenhagen Dublin Warsaw London Amsterdam Hamburg Berlin Brussels Frankfurt Prague Paris Straatsburg Munich Budapest Zurich Vienna Lyon Belgrade Milan Bordeaux Zagreb Barcelona
Rome
Madrid Athens
# of Required Wavelengths
25000
Back-Up Path Working Path
20000 15000 10000 5000 0 No Protection
1+1 Path Protection, Link Disjoint
1+1 Path Protection, Node Disjoint
Path Restoration
Link Restoration
Figure 3.36 Wavelength usage for different recovery schemes in a mesh optical network. (Top: Adapted from S. De Maesschalck, et al. ‘‘Pan-European optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225.)
enough to justify the use of (often capacity hungry) recovery schemes to recover from them (e.g., triple or quadruple network failures). In most cases the applied recovery scheme is dimensioned to recover from single link and/or node failures. A restoration scheme, however, shows more flexibility than a protection scheme in dealing with unexpected failures. With 1þ1 dedicated protection , the traffic cannot reach its destination when a double failure affects both the working path and the protection path. With restoration, there is often more than one option for the backup path, making it possible to recover from unexpected double failures.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 182
182
CHAPTER 3
3.6.5
Optical Networks
Protection Combined with Restoration Instead of making a single choice between applying a restoration scheme or a protection scheme in an OTN, both schemes can be used simultaneously. For example, one can distinguish between different classes of traffic with different transport requirements. One type of traffic could be high-priority traffic, which should be recovered very quickly. Another traffic type could be normal priority traffic, for which the recovery times are not that stringent. The latter traffic type could be recovered using, for example, a restoration scheme, and the former traffic type could be recovered using 1þ1 protection. Another more exotic way of combining a restoration and a protection scheme is to resolve failures that cannot be recovered using protection by a restoration scheme. For instance, with a double failure that affects both the working path and the backup path of a 1þ1 protected connection, restoration could be used for recovery.
3.7 Ring-Based versus Mesh-Based Recovery Schemes Sections 3.5 and 3.6 have explained in some detail the protection and restoration schemes that can be applied in ring-based and mesh-based optical networks. In this section, we compare these options. Figure 3.37 illustrates the difference in cost between protection and restoration in a mesh-based topology and the OCh-DPRings strategy in a topology that consists of interconnected rings, for a network with 32 nodes (the same as the one used to obtain the results of Figure 3.34). Both the link cost and node cost are considered. Figure 3.37 shows that the OCh-DPRings strategy using D&C results in the highest link cost. The difference between the interconnected ring design with and without D&C is around 5%. This is because with D&C, some connections have to take a longer route than without D&C to make sure that all rings are interconnected by two nodes. The link cost with 1þ1 dedicated mesh protection is about 19% less expensive than with interconnected OCh-DPRings. From Figure 3.37, it is clear that the node cost is significantly lower for the ringbased schemes than for the mesh-based schemes because of the relatively low price of OADMs compared to the expensive OXCs used in meshed networks. When D&C is used, the node cost is 10% to 15% higher, because the traffic between two rings is exchanged at two nodes to improve the availability. The 1þ1 dedicated protection option in a meshed network is the most expensive in terms of node cost. From Figure 3.37, we can draw a number of general conclusions. Table 3.7 summarizes the pros and cons of all recovery schemes and gives a qualitative comparison. The conclusions reached are similar to those obtained in Chapter 2 on SONET/SDH networks. The link cost is higher for dedicated schemes than for shared schemes, because with the former more capacity is needed because the spare capacity cannot be shared by several working connections. We have seen that link restoration needs
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 183
183
3.7 Ring-Based versus Mesh-Based Recovery Schemes
350000
Node Cost Link Cost
300000
Cost
250000 200000 150000 100000 50000 0 Mesh Link Path Protection Restoration Restoration
OChDPRing
OChDPRing + D&C
Figure 3.37 Comparison of the cost for a 32-node network between ring-based and mesh-based recovery strategies. (P. Arijs, B. Van Caenegem, P. Demeester, P. Lagasse, W. Van Parys, P. Achten, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, Vol. 1, No. 2, July 2000, pp. 25–40.)
Table 3.7 Qualitative Comparison between the Various Ring- and Mesh-Based Recovery Schemes
Dedicated Protection Rings Shared Protection Rings Mesh Path Protection Mesh Link Protection Mesh Path Restoration Mesh Link Restoration
Link Cost
Node Cost
ManageFlexibilment Cost ity
Availabil- Recovery ity Time
Higher Low High Highest Lowest Low
Lowest Lowest High High/mid Mid/low Mid
Low Mid Low Low Higher Higher
High High Mid Mid Mid/high Mid/high
Mid/low Lower Mid Mid High High
Fast Fast Fast Fast Slowest Slowest
Source: From P. Arijs, B. Van Caenegem, P. Demeester, P. Lagasse, W. Van Parys, P. Achten, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, Vol. 1, No. 2, July 2000, pp. 25–40.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 184
184
CHAPTER 3
Optical Networks
typically more capacity than path restoration. With path restoration the spare capacity is more balanced over the network because this is a global approach. On the other hand, link restoration is a local scheme. The capacity needed with DPRings lies typically in between: With DPRings, a protection path is established within each ring that the traffic traverses (between a local and a global approach). Restoration schemes need typically much less capacity than protection schemes. Again restoration in a mesh network is more efficient than shared protection because of the global extent. In ring networks the nodes are OADMs, whereas in mesh network OXCs are needed. OADMs are typically less expensive than OXCs. The hardware implementation of the APS protocol is simpler than that of a complex restoration scheme. Therefore, the node cost will be lower with ring-based schemes. However, in interconnected ring networks, complex and expensive OXCs may be used, increasing the node cost of ring networks, even more when D&C schemes are implemented. In mesh networks deploying a restoration scheme, less spare traffic has to be switched in the OXCs, lowering the node cost compared to networks using a capacity-hungry protection scheme. The management cost of the different recovery schemes can be more or less estimated by the amount of signaling needed. The 1þ1 unidirectional protection switching scheme requires no signaling protocol, which leads to a low management cost. All other protection schemes need an APS protocol implementation, increasing the management cost. Restoration schemes require quite complicated and thus expensive (distributed or centralized) signaling protocols. With protection, the backup path is fixed, whereas with restoration several backup paths may be possible. The chosen back-up path then has to be set up by configuring all the OXCs along the restoration path. In a shared protection scheme, switching has to be performed only in the two OADMs involved in the protection scheme. Flexibility encompasses the ability of the recovery scheme to cope with unexpected or unaccounted failures, or unpredicted traffic patterns. One of the advantages of restoration schemes is that they can more easily accommodate churn by allocating unused network capacity to the restoration process. This is impossible with a protection scheme in a ring network because 50% of the capacity is always allocated for recovery purposes. In a ring network capacity has to be added along the entire ring, or a completely new ring has to be added. In a mesh network the capacity extension can take place gradually. The flexibility of restoration schemes is also discussed in Section 3.8.2. Network availability was introduced in Chapter 1. It reflects the portion of time the network is operational. The availability of the different recovery schemes is discussed in more detail in Section 3.8.2. However, a few general statements can be made here. With ring networks, multiple failures occurring in different rings of the network can be recovered simultaneously. Ring interconnection points are protected with D&C. Recovery schemes in mesh networks typically offer recovery only for the expected or accounted failures (typically one, at most two simultaneous failures). With mesh path protection schemes, if the working and dedicated protection paths are affected simultaneously, there is no recovery possible. Mesh restor-
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 185
3.8 Availability
185
ation schemes offer a higher flexibility, but this depends heavily on the available spare resources in the network (see also Section 3.8.2). Ring networks, thus, typically offer a higher availability than mesh networks. The protection switching in ring- and path-based protection schemes should take place in less than 50 ms, just as for SONET/SDH. Restoration schemes will typically require more time (hundreds of milliseconds to tens of minutes).
3.8 Availability Availability is an important performance assessment factor of recovery schemes. We start this section with a general overview of terms and definitions used when performing availability calculations. Next, the availability of an unprotected and protected connection is discussed, which allows calculating the expected loss of traffic (ELT). Also the availability when a restoration scheme is deployed is discussed. Finally, some factors influencing the availability performance, such as the average node degree of the network topology or the characteristics of the transported traffic type, are studied.
3.8.1
Availability Calculations The term availability was introduced in Chapter 1. As explained there, the availability, A, of an item can be expressed using its mean time to repair (MTTR) (the time needed for the restoration of the item) and the mean time between failures (MTBF) (the time between consecutive failures of the item): A¼1
MTTR MTBF
(3:1)
Of course the unavailability, U, of an item can then be expressed as U¼1A
(3:2)
For a complex system such as a telecommunications network, availability is quite difficult to define and evaluate. In literature several definitions of network availability have been presented. In this chapter, we use a straightforward method. Line and node failures are assumed to be statistically independent.
Optical Node Failures Network elements such as OXCs and OADMs are composed of a (large) set of different pieces of equipment, each with its own MTBF and MTTR. For more details on the calculation of the overall MTBF of optical node equipment, we refer to [G911]. The MTBF of node equipment is usually expressed in hours or using the metric failures in time (FITs) (the number of failures in 109 hours, or roughly 114,155 years). The MTTR is expressed as an amount of time units (hours).
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 186
186
Optical Networks
CHAPTER 3
Line Failures Line failures can, for instance, be caused by a failure in the fiberoptic cable or the failure of an OA or WDM line system. An assumption often made is that fiber failures within a single fiberoptic cable are completely dependent, because most failures are caused by dig ups, affecting all fibers within the cable. For physical cables the MTBF can be specified using the cable cut (CC) metric. This is the average cable length that results in a single cable cut per year (e.g., CC ¼ 450 km means that per 450 km cable, there will be on average one cable cut each year). This expresses the fact that the probability to have a cable cut is larger for a longer link. The MTBF of the cable is then calculated as MTBF(hours) ¼ (CC * 365 * 24)=Length of the cable
(3:3)
In addition, the metric FITs/km (average number of failures in 109 hours/km) can be used to denote the MTBF of a cable. The MTTR for a cable is usually expressed as an amount of time units (hours). It includes the time needed to localize the fault, access the cable, repair the break, and put the cable back into service (transmission quality testing, etc.). The MTTR of an undersea cable is typically much longer than that of a terrestrial cable, because of the extra time needed to dispatch a cable ship and crew to do the repair, and the more complicated cable recovering, repair, and replacement. The MTBF of an OA and a WDM line system can again be expressed in hours or in FITs. The MTTR is expressed as an amount of time units (hours). Table 3.8 [Wil01] gives an idea of the typical MTBF and MTTR of important optical network equipment. A single (bidirectional) line, connecting two optical nodes, is made up of a series of items, namely pieces of physical cable, a number of OAs (how many depends on the line length and the spacing distance between the OAs), and a line system at each side of the line. A series of items is available if all individual items are
Table 3.8 MTTR and MTBF Values for Fiberoptic Cable, OAs and WDM Line Systems
Equipment
MTBF (hours)
MTTR (hours)
Bidirectional OA Bidirectional WDM Line System OXC OADM
5 * 105 5 * 105 1 * 105 1 * 105
24 6 6 6
CC (km)
MTTR (hours)
450
24
Terrestrial Fiberoptic Cable
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 187
3.8 Availability
187
available. If we assume that the items fail statistically independent, we can express its availability as A(series item1 , item2 , . . . , itemN ) ¼ P( (item1 ¼ av) and (item2 ¼ av) and . . . and (itemN ¼ av)) Y P(itemi ¼ av) ¼ i
¼
Y
(3:4) Ai
i
¼
Y
[1 Ui ]
i
where P(item1 ¼ av) stands for the probability that item1 is available. The availability of a bidirectional line, Aline , connecting node ni and nj can, thus, be expressed as 2 Aline ¼ Acable *AN OA *Alinesystem
(3:5)
where
. Acable is the availability of the cable between ni and nj , . AOA is the availability of a bidirectional OA (if one direction of the OA fails, also the other direction goes immediately out of service), . N is the number of bidirectional OAs needed on this line (depends on the length of the line), . Alinesystem is the availability of a bidirectional line system (if one direction of the line system fails, the other direction goes immediately out of service). The availability of the bidirectional line depicted in Figure 3.38 can, thus, be calculated as MTBFcable ¼
450 km * 365 * 24 h ¼ 15161:5 h 260 km
2 Aline ¼ Acable * AN OA * Alinesystem MTTRlinesystem 2 MTTRcable MTTROA 2 1 1 ¼ 1 MTBFcable * MTBFOA * MTBFlinesystem 2 2 24 h 24 h 6h 1 1 ¼ 1 15161:5 h * 5 * 105 h * 5 * 105 h
¼ 0:998417 * 0:999904 * 0:999976 ¼ 0:998297
(3:6)
(3:7)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 188
CHAPTER 3
Optical Networks
OA
80 Km
WDM Line System
WDM Line System
188
OA
100 Km
80 Km
Figure 3.38 Example of a bidirectional line.
Availability of Connections and Load Once we have the MTBF and the MTTR of the nodes and lines of the network, the availability of the connections can be calculated. Because a connection is assumed to be bidirectional, a connection is available only if both directions of this connection are available. Based on the availability of the individual connections, the availability of the total traffic load can be calculated. This also allows for the calculation of the expected loss of traffic (ELT),31 which is the total amount of traffic that the network is expected to lose every year because of failures [Ver95], and of the average ELT (AELT) per channel. The calculation of the availability of a connection depends on the applied recovery technique (1þ1 protection, link or path restoration, etc., or no protection).
Unprotected Connection An unprotected connection is routed over a series of nodes and lines and is thus available if all nodes and lines along the route of this connection are available. Equation 3.4 can thus be applied. Consider the example in Figure 3.39. The connection between node A and node B is available if node A, link A B, and node B are available. Equation 3.4 thus leads to A(connection A B) ¼ P((node A ¼ av) and (link A B ¼ av) and (node B ¼ av)) ¼ A(node A) * A(link A B) * A(node B) (3:8) If link A B corresponds to the link depicted in Figure 3.38, the availability of the unprotected connection A B is A(connection A B) ¼
MTTROXC A MTTROXC B 0:998297 1 1 * MTBFOXC A * MTBFOXC B
¼ 0:99994 * 0:998297 * 0:99994
(3:9)
¼ 0:998177 31
Sometimes the term total expected loss of traffic (TELT) is used instead of expected loss of traffic.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 189
3.8 Availability
189
C
A
B
Figure 3.39 Availability of an unprotected connection. (Adapted from S. De Maesschalck, et al. ‘‘PanEuropean optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225.)
C
Protection Path
A
Working Path
B
Figure 3.40 Availability of a protected connection. (Adapted from S. De Maesschalck, et al. ‘‘Pan-European optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225.)
Protected Connection A protected connection is available if the working path or the protection path of the connection is available. Equation 3.4 can no longer be applied, because we now have two series of items placed in parallel: the working path and the protection path. This is illustrated in Figure 3.40. The working path of the connection from node A to node B follows the route from node A, to link A B, to node B. The protection path, placed in parallel with the working path, is from node A, to link A C, to node C, to link C B, to node B.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 190
190
CHAPTER 3
Optical Networks
The availability and unavailability of items placed in parallel can be calculated as A(parallelitem1 , item2 , . . . , itemN ) ¼ 1 U(parallelitem1 , item2 , . . . , itemN ) ¼ 1 P( (item1 ¼ unav) and (item2 ¼ unav) and . . . and (itemN ¼ unav) ) Y P(itemi ¼ unav) ¼1 i
¼1 U(parallelitem1 , item2 , . . . , itemN ) ¼
Y
Y
Ui
i
Ui
i
(3:10)
As can be seen in Figure 3.40, the working and protection paths have the source and destination nodes (nodes A and B) of the protected connection (connection between A and B) in common. This means that the availability of such a protected connection must be expressed as A(protected connection) ¼ A(source node) * A(parallelpaths) * A(destination node) ¼ A(sourcenode) * (1 U(working0 path) * U(protection0 path)) * A(destinationnode) (3:11) where working0 path and protection0 path are the working and the protection path without the source and destination node of the considered connection.
Restored Connection Calculating the availability of connections using a restoration mechanism (link or path restoration) to cope with failures is more complex. The recovery path is now no longer uniquely defined per working path but depends on the failure. Calculating the availability of a single restored connection is not as straightforward as for an unprotected or 1þ1 protected connection. Therefore, instead of calculating the availability of a single connection, the availability of the total traffic load is often calculated. Let CapConi denote the capacity of connection i. The availability of the traffic load can then be expressed as P A(Load) ¼
i
(A(connection i) * CapConi ) P i CapConi
(3:12)
This definition, of course, cannot be applied to restored connections because we do not know the availability of a single restored connection. Therefore, the method usually applied for availability calculations taking into account a restoration mechanism is as follows: The probability that a certain failure scenario occurs is determined and the percentage of the traffic that was not affected or can be restored is
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 191
3.8 Availability
191
calculated; and this is repeated for all possible failure scenarios. The availability of a load under restoration is then obtained by A(load) ¼ 1
X
(Prob(failure scenario x) * (1
x
CapRecovered i )) CapTotal
(3:13)
where Prob (failure scenario x) is the probability that failure scenario x occurs, CapRecovered i is the total capacity of the recovered connections for failure scenario x, and CapTotal is the total capacity of the complete traffic matrix. Of course, the number of possible failure scenarios grows fast with a growing network size, so in general the number of failure scenarios taken into account for the availability calculations needs to be limited to leverage the calculation work. For example, in a network with five nodes and eight lines, there are 13 failure scenarios with one failure (5 with one node fault, 8 with one line fault). There are 78 scenarios with two failures (10 with two node faults, 28 with two line failures, and 40 with one node and one line failure), 336 scenarios with triple failures, and already 715 scenarios with four failures. In practical calculations, the number of simultaneous failures taken into account will thus typically be limited to two, at most three.
Expected Loss of Traffic and Average Expected Loss of Traffic Once the availability has been calculated, the ELT and the average expected loss of traffic (AELT) can be calculated. They are used to express the availability of the network services. The ELT of a traffic load can be calculated in the following way [Ver95]: The unavailability of a connection c, Uc , can be expressed as Uc ¼ EDTc =observation time
(3:14)
where EDTc (expected downtime of connection c) is the average time that the connection c is interrupted during a certain observation period. When the observation period equals 1 year, EDTc is the total ELT for this connection c, thus ELTc ¼ Uc * Capc * 525600 minutes=year ¼ (1 Ac ) * Capc * 525600 minutes=year
(3:15)
The ELT for the whole traffic load is then expressed as ELT ¼
X
ELTc
(3:16)
ELT AELT ¼ P Capc
(3:17)
c2load
The AELT can then be calculated as
c2load
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 192
192
CHAPTER 3
Optical Networks
As each wavelength transports one STM-X/OC-Y (with X ¼ 4=Y ¼ 12, X ¼ 16=Y ¼ 48, X ¼ 64=Y ¼ 192 . . .), the AELT of the optical layer is usually expressed in STM-X/OC-Y hours per year. If we would assume X ¼ 16=Y ¼ 48 (each wavelengths is capable of transporting 2.5 Gbps), the equation for the ELT becomes ELT ¼
X
(1 Ac) * Capc * 8760 in STM-16=OC-48 h=y
(3:18)
c2load
3.8.2
Availability: Some Observations Until now the discussion on availability was fairly theoretical. Without pretending to give a complete overview of all factors influencing the availability of a connection, there are some trends that can be observed. Unless stated otherwise, the MTTR and MTBF values indicated in Table 3.8 are used. Let us first get a glimpse of the influence of the recovery scheme on the availability.
Availability Comparison between 1þ1 Protection in Ring-Based and Mesh-Based Networks In [Ari7/00], the availability performance of 1þ1 dedicated protection in a meshbased network and an interconnected ring network was compared, both for link failures only and for link and node failures. The figures of Table 3.8 were used, except for the CC, which was assumed to be 300 km. This study clearly showed that when only considering link failures, mesh protection performed much worse than dedicated ring protection. For a 32-node network, the ELT is almost twice as large for mesh protection than for OCh-DPRings (Figure 3.41). This could be expected, because in a mesh network the working path is protected by an end-to-end protection path, whereas in interconnected OCh-DPRings the working path is protected by a succession of several protection paths, one per ring. The latter can thus survive multiple link failures, occurring in different rings of the interconnected ring network. When the D&C ring interconnection scheme is used, the ELT lowers a bit because the interconnected ring network can now survive from slightly more link failures (e.g., double-link failure in a ring, with one failing link between the two ring gateway nodes). When node failures are also taken into account, the ELT increases substantially, because a node failure causes all traffic terminating in that node to be lost. In addition, the relation between the mesh- and ring-based network’s ELT changes considerably, as can be seen in Figure 3.42. The interconnected ring network without D&C now performs worst. Because the gateway between rings is a single point of failure in this scenario, the 1þ1 protection scheme is not able to recover from such failures. Introducing D&C into the ring network significantly lowers the ELT. Applying the 1þ1 protection scheme in the mesh network causes the ELT to reach a value comparable to but somewhat larger than that of DPRings with D&C, again because of the end-to-end protection versus the ring-by-ring protection of the working path.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 193
3.8 Availability
193
180 160 140
ELT
120 100 80 60 40 20 0 Path Protection
OCh-DPRing
OCh-DPRing (+D&C)
Link Failures Only
Figure 3.41 Comparison of the expected loss of traffic (ELT) for a 32-node network using path protection and optical channel dedicated protection ring with and without drop and continue (link failures only). (P. Arijs, B. Van Caenegem, P. Demeester, P. Lagasse, W. Van Parys, P. Achten, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, Vol. 1, No. 2, July 2000, pp. 25–40.)
1400 1200
ELT
1000 800 600 400 200 0 Path Protection
OCh-DPRing
OCh-DPRing (+D&C)
Link and Node Failures
Figure 3.42 Comparison of the expected loss of traffic (ELT) for a 32-node network using path protection and optical channel dedicated protection ring with and without drop and continue. (P. Arijs, B. Van Caenegem, P. Demeester, P. Lagasse, W. Van Parys, P. Achten, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, Vol. 1, No. 2, July 2000, pp. 25–40.)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 194
194
CHAPTER 3
Optical Networks
Availability Comparison between Protection and Restoration Schemes in Mesh-Based Networks Another comparison that can be made is between restoration and protection in a meshed optical network. The results for such a comparison, using the MTBF and MTTR values of Table 3.8 on the European network depicted in Figure 3.36, are shown in Figure 3.43. As explained earlier, calculating the availability of the connections in a network that recovers from failures using link or path restoration is not that straightforward. In Figure 3.43, the results shown for the restoration schemes are an approximation. Because the availability of the network elements is quite high, the probability of multiple simultaneous failures is rather small. For these calculations, only single and double link and/or node faults were assumed. When a triple failure scenario occurred, only the nonaffected connections were taken into account. For other fault scenarios (e.g., quadruple failure scenario), we assume that none of the connections is available. The calculated ELT for link and path restoration is thus an upper bound of the exact value. Calculations for the case study presented in Figure 3.43 revealed that with the MTTR and MTBF numbers used, the probability that the network suffers from more than two simultaneous failures is indeed rather small: There is only a 0.18% chance for a fault scenario with more than two simultaneous network failures. As can be seen in Figure 3.43, the lowest ELT value is reached with the 1þ1 protection scheme. There is not much difference between the node- and link-disjoint
30,000 ELT
Wavelength channels
25,000
25,000
20,000
20,000
15,000
15,000
10,000
10,000
5,000
50,000
0
# Wavelength Channels
ELT (STM-64 Hours/Year)
30,000
0 1+1 Protected Link Disjoint
1+1 Protected Node Disjoint
Path Restoration Link Restoration
Recovery Scheme Unprotected: ELT = 259,082 STM-64 Hours/Year # Wavelength Channels = 8975
Figure 3.43 Comparison of the expected loss of traffic and the capacity requirements for different mesh-based recovery schemes.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 195
3.8 Availability
195
protection schemes, because the availability of the nodes is quite high and the chance for a line failure to happen is thus much larger than for a node failure (e.g., for this network and traffic situation, the chance of a single node failure to happen is 0.13%, whereas the chance for a single link failure is 18.90%). The performance of path restoration is comparable to that of 1þ1 protection. Link restoration performs worse, because this recovery scheme is not able to recover from node failures. Figure 3.43 also compares these different recovery schemes from a capacity point of view. It is clear that when taking into account both the ELT and the capacity requirements of these different schemes, path restoration seems to be a good compromise. The results of Figure 3.43 may seem a bit surprising. Path protection uses a dedicated backup path for each working path in the network. The flexibility of this recovery scheme is thus rather small. Restoration, on the other hand, searches for the backup path only after the failure has occurred, making it a more flexible recovery scheme, because it is not restricted to a predefined backup path route. The prior illustrated results, however, do not seem to reflect this increased flexibility to recover from failures using path restoration. This is due to the lower amount of capacity installed in the network that reduces this increased flexibility of the path restoration recovery scheme. In [Wil01], it was shown that overdimensioning the resources in the network with 10% to 20% in the case of path restoration significantly improves the ELT, because the path restoration scheme has more flexibility in choosing the backup path.
Availability versus Topology In [DeM03], the influence of the topology of the optical network on the availability of the connections was investigated. Starting from the network in Figure 3.36, links were removed or added to the network. In this way three topologies were studied: a quite sparse one with an average node degree of 2.43, the network of Figure 3.36 with an average node degree of 2.93, and a quite dense network with an average node degree of 4.36. Traffic was protected using the 1þ1 dedicated protection scheme. As can be seen in Figure 3.44, the ELT increases with decreasing average node degree of the topology. This can be explained by the fact that in the sparse topology, the connections have to follow on average a longer route in kilometers of fiber and have to pass through more OXCs. This means that the probability for a line or node failure along the path of the connection is higher, the availability lower, and thus the ELT higher. In the densest topology the routes between origin and destination OXC are the shortest and these routes thus have the highest availability (lowest ELT).
Availability versus Traffic Type Often, the total traffic demand between node pairs consists of different traffic types. In [Dwi00], for instance, a distinction is made between voice, transaction data and IP data traffic. Each of these traffic types is typically exchanged on a different
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 196
196
CHAPTER 3
Optical Networks
ELT (STM-1 h/y)
1400000 1200000 1000000 800000 600000 400000 2.43 2.93 Node Degree
200000 0 2002 2003 2004 2005 2006 2007 2008
4.36
Topology
Year
Figure 3.44 Influence of the optical layer topology on the expected loss of traffic.
geographical level, and they thus each have a different distance dependency relationship. For example, voice traffic is inversely proportional to the square of the distance, transaction data traffic is inversely proportional to the distance, and IP data traffic is independent from the distance Voice traffic 1=D2 Transaction data traffic 1=D
(3:19)
IP data traffic independent from D In addition, this different distance relationship has an influence on the availability of the connections and thus on the ELT. In [DeM03], this effect has been investigated, and the result is summarized in Figure 3.45. All three traffic types are protected using the 1þ1 dedicated protection scheme. The volume of voice traffic between cities A and B is inversely proportional to the square of the distance between these cities. Most voice traffic connections are between locations that are geographically quite close and thus follow on average a shorter path. As explained earlier, a shorter path means a more available path. For transaction data traffic the distance dependency decreases, because this traffic type is inversely proportional to the distance. There will be more transaction data connections between locations that are far away from each other than in the case of voice connections. This explains the longer routes that transaction data connections will have to follow on average and thus the higher AELT. IP data traffic is not dependent on the distance, so longer connections are more probable, which is translated in a higher AELT. Typically a service-level agreement will be less strict for IP data traffic than for voice traffic—meaning that the higher AELT incurred by IP data traffic is not
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 197
3.9 Recent Trends in Research
197
AELT (STM-1 h/y)
4.00
3.50
3.00
2.50 Traffic Type IP Data Traffic Transaction Data Traffic Voice Traffic
2.00 2001
2002
2003
2004
Year
2005
2006
Figure 3.45 Influence of the traffic type on the average expected loss of traffic.
necessarily reflected in less revenue for the network operator, because of, say, rebates (see Chapter 1).
3.9 Recent Trends in Research In this section, some recent trends in research are discussed. In Section 3.9.1, we focus on the concept of p-cycles. Section 3.9.2 discusses the meta-mesh recovery technique. In Section 3.9.3, we introduce flexible and intelligent optical networks and how they can be used to provide recovery. Of course this list is not exhaustive. Many other topics could be added to this section.
3.9.1
p-Cycles In Sections 3.5 through 3.7, recovery schemes in optical networks were discussed. A distinction was made between protection and restoration. Recovery schemes were also classified based on the topology of the optical network: ring-based or meshbased. Restoration in mesh-based networks is typically significantly more efficient in terms of capacity use than the protection schemes in ring-based networks. However, the latter are able to guarantee very fast switching times (50 to 60 ms), because only two nodes need to perform any action (see Section 3.5). Until recently there was a quite strict distinction between recovery schemes in mesh-based and ring-based networks. In [Gro98], however, a recovery scheme, called p-cycles, was proposed that offers the advantages of both the ring-based and the mesh-based recovery schemes: Ringlike switching speeds while having a capacity-efficiency comparable to that of restorable mesh-based networks. It is based on the formation of rings in the spare capacity of a mesh-restorable network. These are formed in
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 198
198
CHAPTER 3
Optical Networks
advance of any failure. The p-cycle recovery scheme is similar to a ring-based recovery scheme in that both use rings. However, unlike ring-based recovery schemes, p-cycles recover both failures on the ring and straddling failures. This is the key factor for obtaining the efficiency of a mesh-based recovery scheme using a ringlike protection structure. An example of the use of p-cycles is illustrated in Figure 3.46. In Figure 3.46(a), an example of a p-cycle is shown. In Figure 3.46(b), a link that is part of the ring breaks, and the surviving part of the p-cycle is used for recovery purposes. In Figure 3.46(c) and (d), although the link that fails is not part of the p-cycle, the p-cycle is used to support recovery of the broken link. Moreover,
(a)
(b)
A p-cycle
(c)
A link that is part of the p-cycle fails. The p-cycle contributes one recovery path
(d)
A link that is not part of the pcycle fails. The p-cycle contributes two recovery paths
A link that is not part of the pcycle fails. The p-cycle contributes two recovery paths
Figure 3.46 Use of p-cycle as recovery scheme. (W.D. Groover, D. Stamatelakis, ‘‘Bridging the ringmesch dichotomy with p-cycles,’’ Proc. of 2nd International Workshop on Design of Reliable Communication Networks (DRCN’00), (Munich, Germany, April 2000), pp. 92–104.)
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 199
3.9 Recent Trends in Research
199
not one but two recovery paths are available from the p-cycle, leading to more advantageous recovery circumstances. The difference with conventional ring-based schemes is thus that not only links on the ring but also failing straddling links are recovered by the p-cycle, and that in the latter case two recovery paths are available. With p-cycles, a single ring can provide recovery paths for much more failing links than with traditional ring-based recovery schemes, making this scheme significantly more capacity efficient, even as efficient as mesh-based recovery schemes.
3.9.2
Meta-Mesh Recovery Technique The technique called meta-mesh [Gro02] is a refinement of existing recovery schemes in mesh-based networks that increases the capacity efficiency in networks with a rather lower average node degree (sparse networks). For this type of network, ring-based recovery schemes are often thought to be the best solution, because a mesh-based recovery scheme may be equally expensive, because of the sparseness of the topology. The resulting design using the meta-mesh concept lies between pure link restoration and pure path restoration. A sparse network typically contains chains of degree-2 nodes (nodes with two incident links, see Figure 3.47[a]). When using link restoration, if a link between degree-2 nodes fails, the affected working traffic is looped back using the spare capacity (back-hauling, as discussed in Section 3.6) until it encounters a node with degree higher than 2, at the end of the chain (Figure 3.47[b]).
(a) Sparse Network
Working Path Local Restoration Path (b) Link Restoration with Loop back (Back hauling)
(c) Meta-Mesh Topology of Network
Figure 3.47 Difference between link restoration and meta-mesh technique.
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 200
200
CHAPTER 3
Optical Networks
With the meta-mesh recovery technique, each chain (a number of degree-2 nodes with a node of at least degree 3 at each end) is represented by one ‘‘metalink,’’ leading to a ‘‘meta-mesh’’ as shown in Figure 3.47(c). For a working path that contains one or more complete chains, traffic can be restored on the level of the affected meta-link instead of the level of the individual affected link of the original topology (i.e., rather meta-link restoration than link restoration). In this way, a part of the spare capacity that was used for the loop back needed with link restoration can be avoided, leading to a more cost-efficient capacity assignment. In [Gro02], it is shown that significant capacity savings can be obtained using this meta-mesh technique.
3.9.3
Flexible Optical Networks In Section 3.1.5, we already shortly discussed the evolution of the OTN from a static networking layer to a flexible and agile one. In Section 3.6, we also discussed how such an intelligent and flexible OTN with its IP-based control plane will enable optical restoration. However, these flexible and intelligent optical networks do not only have advantages for single-layer network recovery. The fast connection provisioning, typical for this kind of network, can also be used to provide resilience in a very capacity-efficient way in a multilayer network scenario. In such a multilayer network, resilience schemes need to be deployed in all network layers to recover from all possible network failures. The flexibility of intelligent and agile optical networks enables the reconfiguration or even reoptimization of the logical client topology (e.g., during a client node failure), which could be used to work around such a failure. This is discussed in more detail in Chapter 6, Section 6.2.4.
3.10 Conclusion In this chapter, recovery in optical networks was studied in detail. First, in Section 3.1, an overview was given of the ongoing evolution of the optical network layer from a static point-to-point layer providing high-capacity bit pipes to the client layer, over an optical network layer with switching and management capabilities, to a fully flexible optical layer. This flexible optical network is further discussed in Chapter 6. The main network elements in the optical layer are the optical crossconnect and the optical add/drop multiplier, which were both discussed in Sections 3.1.3 and 3.1.4. Next, the current architecture and structure of the optical transport network was discussed in Section 3.2. Several layers can be distinguished within the optical transport network, from bottom to top, the optical transmission section, the optical multiplex section (both are sometimes replaced by the optical physical layer), and the optical channel layer, which substructure consists of the optical channel transport unit layer, the optical channel data unit layer, and the optical channel payload unit layer. These layers form the optical transport module, with full or reduced
Vasseur / Network Recovery Final Proof 10.6.2004 1:34am page 201
3.10 Conclusion
201
functionality, depending on whether associated overhead is supported or not. The different optical transport module types also differ in the number of wavelength channels they can support. Also an overview of the current standardization effort was given. In Section 3.3, the overhead of these different network layers was discussed, emphasizing those parts of the overhead that are useful for fault detection and propagation. In addition, the different types of defects that can be encountered in the network were described. The use of the maintenance signals conveyed in the overhead for alarm suppression was illustrated with some examples. Sections 3.4 through 3.7 focused on the different recovery schemes that can be applied in optical networks. In Section 3.4, the question why we would like to use a recovery scheme at the optical network layer was answered. As such a recovery scheme works at a large granularity: It is fast, efficient, and easy to manage. Recovering from a root failure at the optical layer in the higher network layer would mean to recover from potentially many resulting secondary failures. A first distinction was made between recovery schemes in ring-based and mesh-based optical networks. Section 3.5 discussed the ring-based recovery schemes. Such a scheme is characterized by the level at which the recovery action occurs (OMS or OCh) and by whether the recovery capacity is dedicated to or shared between the working traffic (DPRing or SPRing). All four resulting schemes (OMS-DPRing, OMS-SPRing, OCh-DPRing, and OCh-SPRing) were discussed and compared in detail. Section 3.6 focused on recovery schemes (both protection and restoration) in meshed-based optical networks. The pros and cons of protection and restoration schemes were discussed. Protection is fast and easier to implement but is quite capacity consuming. Section 3.7 compared the performance of recovery schemes in ring-based and mesh-based optical networks. Section 3.8 was dedicated to availability, an important performance parameter of the different recovery schemes. After a theoretical introduction to availability calculations, some factors influencing the availability (e.g., the applied recovery scheme and the network topology) were discussed. Finally, Section 3.9 gave a short overview of some recent trends in research. As discussed in Chapter 1, several other layers can reside above the optical transmission layer. The current trend, however, is to evolve to an IP/MPLS-overOTN multilayer network. The IP layer is discussed in Chapter 4. Chapter 5 focuses on the Multi-protocol Label Switching (MPLS) protocol, which was introduced to enhance the capabilities of the IP client layer.
This page intentionally left blank
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 203
CHAPTER 4
IP Routing
This chapter is devoted to the recovery aspects of Internet Protocol (IP) routing. Link state interior gateway protocols (IGPs) have undoubtedly been successful during the past few years and have been deployed in the vast majority of the operators and large enterprises networks. Consequently, this chapter is mainly focussed on link state protocols. Interestingly the foundation of current link state protocols has been laid in the late 1970s in the very well known ARPANET network, but an increasing interest in fast recovery properties of link state routing protocols has been driving numerous optimization techniques during the last few years, leading to fast convergence enhancements, which are extensively covered throughout this chapter. The first part of this chapter, Section 4.1 through 4.12, focuses on the fundamental aspects of link state protocols, which include the reliable network topology discovery mechanism, the distributed shortest path computation, and the routing table calculation, which are described in detail not only from a protocol perspective but also with the objective of providing IP recovery network design rules. Because the nature of IP routing is to be completely distributed, an important part of this chapter focuses on the dynamic aspects of distributed routing and the various steps occurring during network convergence. Throughout this chapter we demonstrate that IP routing can provide subsecond convergence while preserving network stability even with major and multiple network failures, thanks to the use of dampening mechanisms. Furthermore, we show that the common perception that IP routing is limited to best-effort service is misleading and that optimized IGP metrics algorithms allow an operator to traffic engineer an IP network both at steady state and under single network element failure. Nonstop forwarding (NSF), a recovery technique available on many platforms that ensures continuance of data forwarding in light of control plane failures, is also discussed (this interaction with fast IP convergence is covered in Section 4.15). A rich set of examples are provided that highlight the
203
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 204
204
CHAPTER 4
IP Routing
concepts introduced in this chapter, which concludes with a detailed case study. Finally, the second part of this chapter, from Section 4.13 to 4.15, discusses some advanced topics like algorithm complexity, incremental shortest path first (SPF), and the potential interaction between fast IGP convergence and NSF. Note that you could skip this more advanced part and still have a very good understanding of the IP routing recovery mechanisms covered from Sections 4.1 through 4.12. This chapter concludes with a section on research-related topics.
4.1 IP Routing Protocols We start with an introduction on IP routing protocols, followed by an overview of the principles of the two major families of routing protocols: the distance vector and the link state protocol. The ineluctable superiority of link state protocols in terms of recovery is highlighted, explaining their wide adoption in most if not all the operators and large enterprises networks. Finally, this section concludes with the local versus the global recovery aspect of IP routing.
4.1.1
Introduction The objective of running a routing protocol is for each node to build a routing table that contains the shortest path32 to each reachable IP prefix. As detailed later in this chapter, the entire path does not have to be stored to route the packet; instead the router maintains a dedicated data structure called the forwarding information base (FIB) that contains the next hop for each reachable IP prefix along with other protocol information. Several routing protocols designed during the last three decades fall under one of the two following categories: 1. Distance vector routing protocols 2. Link state routing protocols
4.1.2
Distance Vector Routing Protocols Overview (‘‘Bellman-Ford’’) Distance vector routing protocols rely on the principle of periodic distribution of the routing table to each neighbor. Upon periodic timer expiration (and when network changes occur, like a network element failure), each node sends its routing table to each of its participating neighbors. The easiest way to illustrate how distance vector routing protocols work is through an example. Consider the simple network depicted in Figure 4.1; let us see step by step how each router builds its routing table.
32
The notion of ‘‘shortest path’’ is explored in Section 4.6.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 205
205
4.1 IP Routing Protocols
Link cost
x
A
B
A
A
B
B
N1
N1 Failure 1 4
C
D
C
C
D
D
Failure 2
E
F
(a)
E
F
(b)
E
F
(c)
Figure 4.1 Distance vector routing protocols.
At time t0: the router A boots up. As soon as the links A-B and A-C are effective (this includes the time for the layer 2 protocol underneath to be fully operational), the router A sends its routing table to both the node B and C. At this stage, A’s routing table is reduced to its directly attached links because the router A has not yet received any routing information from any of its direct neighbors. Then at time t1, B and C send their routing tables to A. So for instance, A learns that D is reachable from B with a distance of 4 upon receiving B’s routing table and that D is also reachable from C with a distance of 1. After adding, respectively, the costs of the local links A-B and A-C, A determines that the shortest route to reach D is by means of C with a distance of 1 þ 1 ¼ 2. Finally at time t2, the router A sends its new routing table (which now contains some reachability information about D [among others]) to each of its neighbors B and C. Note that before the node boots, the shortest path computed by B to D was through its directly connected link B-D (actually it was the only existing path). Upon receiving that new routing update from A, B figures out that the path cost to reach D by means of A is 2 þ 1 ¼ 3, which is a shorter path than the existing one; consequently, B updates its routing table and selects A as its preferred next hop to reach D. The same process occurs between each node, and after some time the network converges. Unfortunately, distance vector routing protocols become quite inefficient during a network element failure. Now consider the failure of the subnetwork N1 [Figure 4.1(b)] locally attached to the router A (e.g., because of the failure of the interface connecting node A to the subnetwork N1). When node A detects the failure, it quickly updates its routing table and marks the corresponding IP prefix as unreachable (cost is infinite). However, bear in mind that routers running a distance vector protocol exchange their routing table regularly. Hence, in the absence of a network
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 206
206
CHAPTER 4
IP Routing
failure, A periodically receives B’s routing table indicating that N1 is reachable by a distance of 2 (B’s routing table selected A as the next hop to reach N1). What would happen if A receives B’s routing table update just after having marked the corresponding routing table entry as unreachable? A would now select B as its preferred next hop to reach N1 with a cost of 3 (because at this point it does not have any route for N1 and would send a routing update to B). Then B would reflect the cost change (which is now 4) and advertise the new cost to A and so forth. This clearly creates a loop and the solution to break that loop is to consider the route nonreachable once the cost has reached a large value that can be considered infinite; this is called the count-to-infinity problem. For instance, RIP considers the value of 16 as infinite, which allows breaking the loop relatively quickly; however, the downside of such an approach is that the limit of the network diameter33 is now 16 because no path can exceed this value of 16 without being considered infinite. Various solutions have been proposed to avoid such loops. A very well known but partial solution is the ‘‘split horizon.’’ The idea is that a neighbor should never advertise a route to a node X if its preferred next hop for that route is by means of X. For instance, in the example depicted in Figure 4.1, B would not advertise N1 to A because its preferred next hop for N1 is via A (the same reasoning applies to C). Unfortunately, this works only to avoid a loop involving two nodes. Now consider the case of a failure of the link C-E [Figure 4.1(c)]. One possible sequence of events is the following. When C detects the failure of the link C-E, once the periodic routing update timer expires (or immediately depending on the distance vector protocol), it sends its routing table reflecting that the cost to reach E is now infinite (i.e., E is no longer reachable via the node C). Both node A and node D learn the news and update their routing tables, which are then sent to each of their neighbors. After some period the network converges. Now suppose a slightly different event timing: Suppose that C detects the failure of link C-E and sends its routing table update to A and D. However, bear in mind that routers exchange their routing table periodically; in the absence of any failure, both A and D advertise to B that they can reach E with a cost of 2. Suppose that B selects node D as its preferred next hop to reach E with a cost of 3. By virtue of the split horizon technique, neither router A nor router D advertise E to C in their routing table update because they both selected C as their preferred next hop to reach E. Now, although B does not advertise E to D (its preferred next hop to reach E), it does send an update related to E to node A with a cost of 3. At steady state, A selects C as its next hop, because the path via C is shorter. Back to the previous example, suppose that A receives C’s routing table update related to E (and reporting an infinite cost for E) and then immediately after A receives B’s routing update for E (this could happen if B sends its routing update before having received D’s routing update reporting that E is no longer reachable). Then, in this case, A selects B as its preferred next hop to reach E with a cost of 4 and sends a
33
The diameter of the network is defined as the maximum number of routers an IP path can contain.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 207
4.1 IP Routing Protocols
207
routing update related to E to its neighbor C, which results in building a loop involving four routers, hence the statement on which the ‘‘split horizons’’ technique partially solves the problem. Improvements of ‘‘split horizon’’ have been proposed to speed up the convergence time like the ‘‘split horizon with poison reverse,’’ where a node always re-advertises route N to a neighbor X with an infinite cost if its preferred next hop for that route is X, but despite several improvements, distance vector protocols inherently suffer from lack of efficiency in terms of convergence time. One of the popular distance vector routing protocols is Routing Information Protocol (RIP), which was the routing protocol provided on UNIX BSD in 1982 (known as ‘‘routed’’). Various versions of RIP have been defined: RIP version 1 [RIP-1] followed by RIP version 2 [RIP-2]. Some other interesting enhancements have also been made, in particular the ‘‘triggered update’’ [RIP-TRIG], which relies on the principle that RIP no longer sends periodically its complete routing table (except when explicitly requested) to every neighbor but does so only when a change in the network occurs, hence reducing unnecessary background noise. Also, some other distance vector protocols like Enhanced Interior Gateway Routing Protocol (EIGRP) have been designed with more advanced features and are certainly more optimal than the first version of RIP, but they all rely on the basic principle that each router provides to its participating neighbors its own view of the network after having computed its routing table. So although distance vector protocols like RIP may be suitable in small networks or in some particular network topologies (like ‘‘star’’ topologies), as stated earlier, their limitations in terms of convergence speed render them not suitable for large and meshed networks. This is especially true when convergence speed is required. In such circumstances, link state protocols are undoubtedly preferred.
4.1.3
Link State Routing Protocols Overview A Brief History The ARPANET has undoubtedly played a tremendous role in the current routing protocol design, and link state protocols are not an exception because the first link state protocol was invented and deployed in the ARPANET. The very first version of a dynamic routing protocol was deployed in the ARPANET and is described in [ARPA-1]. This first routing algorithm was an adaptive dynamic distributed routing protocol. Each term is important and must be clearly defined here:
. Dynamic: Dynamic routing tables are computed by contrast with static routing where, for each destination, the next hop is manually configured by the network administrator. . Distributed: Each router computes its own routing table. In other words, there is no central server that computes the routes and downloads the resulting routing tables on each node. . Adaptive: A routing protocol is said to be adaptive when the route computation takes into account certain dynamic network state conditions like the
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 208
208
CHAPTER 4
IP Routing
link load or experienced delays to influence its routing decision. Of course, adaptive routing protocols require some measurement process that determines/quantifies network characteristics and some way to disseminate that information to other nodes in the network. Finally a computation module is required to compute the shortest path according to specific constraints. One of the major challenges of this class of protocols is to ensure a sufficient reaction to network changes while preserving network stability without requiring unreasonable protocol overhead cost. This is definitely an interesting property that current link state protocols like Open Shortest Path First (OSPF) and Intermediate System to Intermediate System (IS-IS)34 do not have, for some reasons discussed later in this chapter. The first ARPANET routing algorithm (Figure 4.2) relied on the following principle: Each node maintained a table of the estimated delays to reach any other node in the network. Upon receiving the table from node X, node Y would first evaluate the delay to reach X and compute a new table of estimated delays to every other node in the network, where the shortest path was considered as the path with the shortest delay. Each node would send its estimated delay table to every adjacent node at a quite high frequency (every 128 ms). Although the first ARPANET routing algorithm was used for several years, several issues came up, as follows: 1. Packets containing the estimated delays tables were getting long and were growing as the ARPANET grew. 2. Route consistency was difficult to maintain across multiple nodes, which was inherent to the nature of the route computation where each node used to make its route computation on the estimated delay table calculated by other nodes. 3. The fast rate of exchange of estimated delay tables led to a lack of efficiency in adapting to congestion and major network changes and at the same time could overreact to minor changes in the network. 4. The delay measurement method was solely based on the queue lengths, which were not accurate because links had different characteristics like speed, propagation delays, and packet sizes. Moreover, at that time a processing delay (independent of the queue size) was not a negligible factor that was just ignored by the delay measurement method. Queue lengths were measured based on the instantaneous queue size, which was not an excellent indicator either. This justified coming up with a newer routing protocol version significantly different from the first version [ARPA-2]. One of the first major changes in the new version of the ARPANET routing protocol was that each node was disseminating the measured delays between itself and its adjacent neighbors (instead of generating a packet [or some packets] reflecting its estimated delays to every other node in the 34
Although some proposals have been made to make OSPF and IS-IS adaptive.
ARPANET Geographic Map, October 1980
MIT44
MOFFETT
MIT6
LBL
CCA RCC5 RCC49
AFGL AMES15
AMES16
LLL
SRI51 STANFORD SUMEX
NYU
SRI2 XEROX TYMSHARE NPS
GWC
ACCAT
NOSC
AFSD ISI27
ANL
DTI
STLA
UCLA SCOTT
CIT USC
RAND ISI52
AFWL
WPAFB ANDRW NRL DCEC
RCC71
ABERDEEN
DARCOM NSA
NBS
NORSAR
SDAC MITRE
YUMA
ISI22
BBN40 BBN63 BBN72 HARVARD
CMU DOCB
HAWAII
DEC
CORADCOM RADC
UTAH
ARPA WSMR
GUNTER
BRAGG EGLIN
ROBINS LONDON
TEXAS Satellite Circuit IMP TIP Pluribus IMP Pluribus TIP C30
(Note: This Map Does Not Show ARPANET Experimental Satellite Connections) Names Shown Are IMP Names, Not (Necessarily) Host Names
209
Figure 4.2 ARPANET map in October 1980. (M. Dodge. ‘‘Cybermap of the Month Column,’’ ARPANET, October 1980. [Illustration courtesy of the Computer Museum of History Center.] Available at http://mappa.mundi.net/maps/maps_001. Accessed May 2004.)
4.1 IP Routing Protocols
COLLINS
PENTAGON
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 209
LINCOLN
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 210
210
CHAPTER 4
IP Routing
network). Then upon receiving the information generated by each node, each router was able to compute the shortest path from itself to every other node using some distributed shortest path computation. Each packet was flooded throughout the network using a new reliable flooding mechanism (called the updating procedure at that time). It is worth highlighting some important properties of the flooding procedure, which was fast and reliable: This was of the utmost importance to ensure database consistency between nodes and avoid loops, as discussed in detail later in this chapter. A new updating packet was also originated and flooded throughout the network upon link state change (see [ARPA-5] for a detailed description of the updating procedure). Another important and new component of this new adaptive dynamic and distributed routing protocol was the delay measurement method where the average delay (by contrast with the instantaneous queue size) was measured every 10 seconds and reported if a significant change was noticed. Several detailed analyses were conducted to determine the efficiency of this new adaptive routing protocol and the results were very promising:
. . . .
.
Quick and accurate response to topological changes. Dynamic packet rerouting upon network congestion. Efficiency in terms of shortest delay path computations. Routing loops were very temporary and packets entering a loop did not traverse a router more than twice; on the other hand, several routers could be involved in a single temporary loop (see Section 4.6 for more details on temporary loops). The algorithm did not provoke network instabilities and oscillations, which is of course one of the potential drawbacks of adaptive algorithms. Indeed, because paths were computed based on actual traffic load, an inappropriate measure procedure and update frequency might have led to traffic oscillations. Flows were routed around congested areas, which alleviated the congestion in the area in question but could also create some congestion on other links, which could result in a new traffic shift, hence the possible oscillation. This is the reason current link state protocols are not adaptive because the trade-off between traffic load efficiency and some potential network oscillations has been in favor of using static IGP link metrics.
As described throughout this chapter, current link state protocols have several strong commonalities with the routing protocols designed for the ARPANET: ineluctably, the foundations of link state protocols have been laid during the ARPANET’s years. Several interesting references of the ARPANET and Internet history milestones can be found at [HISTORY].
Link State Protocols Overview Link state protocols rely on a fundamentally different concept than distance vector protocols. Each router is responsible for originating a link state protocol data unit (PDU) that describes its local topology (in a nutshell, its set of direct neighbors, the
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 211
4.1 IP Routing Protocols
211
local link characteristics (like the metric), the local IP addresses, etc.). Link state PDUs are then disseminated throughout the network via a reliable flooding mechanism. The collection of all the link state PDUs originated by every router in the network (which is called a link state database [LSDB]) allows every router to constitute a complete map of the network. Then, every router runs an algorithm that computes the shortest path tree (SPT), which provides the shortest path from the computing node to every other node in the network, as well as the routing table that contains all the reachable IP prefixes along with the corresponding preferred next hop(s) and the cost. The reliable link state PDU dissemination (flooding) process and the shortest path computation is covered in detail in this chapter. At steady state, routers exchange short messages (called hello) that allow them to ensure that their neighbors are still reachable; the correct link state protocol terminology is that routers’ adjacencies are still up. When a router first boots, it starts exchanging hello messages to automatically discover its participating neighbors. Once a neighbor is discovered, the process of LSDB synchronization starts, upon which routers exchange their LSDB; this guarantees that all the routers share the same view of the network (they have identical LSDBs). When a link or node fails in the network, as soon as the failure is detected, each router detecting a failure (and so a loss of adjacency) originates a new link state PDU that reflects the network topology change. Note that the network element failure detection can be done by means of the routing protocol’s hello messages (no hello messages are received from a neighbor during a configurable period) or the layer 1/2 protocol that sends an alarm to explicitly indicate a link failure. The aspects of failure detection are covered in detail in Section 4.3. So, for instance, in the case of a link failure the two routers interconnected via the failed link will originate a new link state PDU, which will be flooded in a reliable mode throughout the network. Upon receiving a new link state PDU reflecting the network topology change, each router triggers a new routing table computation, using a shortest path computation algorithm (usually referred to as SPF algorithm) and described in detail in Section 4.6. Various timers related to the origination35 of the link state PDU and SPF computation can be used to tune the routing protocol convergence while guaranteeing network stability; they are covered in detail in Section 4.4, but the aim of this paragraph is to introduce the general concept of link state routing protocols. Although temporary loops may appear during network convergence because of some lack of LSDB synchronization between routers (detailed in Section 4.7), those loops have a very short period of life and SPF algorithms guarantee the computation of loop-free paths. Link state routing protocols also support the notion of hierarchical routing, which allows splitting the network in multiple zones where just the routers belonging to a zone share the same LSDB. Limiting the number of routers in 35 The process by which a router builds the link state PDU and floods it is called the origination: We say that the router originates a new link state PDU.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 212
212
CHAPTER 4
IP Routing
each zone reduces the LSDB size, which in turn reduces the routers routing-related operations (less memory usage, faster route computation, and higher stability are examples of potential gains). In some cases, this might also be useful to isolate some part of the network that experiences regular instabilities (e.g., a region of the world where link failures occur very frequently). Then just the routers belonging to the zone where the failure occurs will be affected (receipt of the new link state PDU, routing table recomputation). The reachability information of an IP prefix outside of a zone will be provided by some routers connected to multiple zones and called area border router (ABR) in OSPF and L1L2 (level 1–level 2) router in IS-IS, which advertise the IP prefixes reachable outside of the zone along with optional metrics and various degrees of summarization (the process of route summarization allows reducing the number of advertised IP prefixes making use of the hierarchical nature of IP addresses). We must underscore that routing protocols have benefited from numerous implementation optimizations, and routers’ CPUs are much more powerful than several years ago, so the limits in terms of number of routers per zone have drastically changed. Trying to determine the maximum number of routers per zone is nonsense because that number highly depends of several factors like network stability, router’s CPU, and degree of connectivity, to mention a few. The two most widely used link state protocols are OSPF [OSPF] and IS-IS [ISIS]. There are significant differences between the two protocols, such as the link state packet (LSP) formats (called a Link State Advertisement [LSA] for OSPF and Link State Packet [LSP] for IS-IS), the protocol message types, and the LSDB synchronization procedures, to mention a few of them. Although the list of differences is certainly quite long, the similarities between both protocols are also numerous because they both are link state protocols. In particular, their properties in terms of recovery are very similar. The routing dynamic upon a network element failure is fairly identical: Once the failure is detected, a new LSP is originated and flooded throughout the network. Then every router receiving a new LSP triggers a routing table computation, making use of a shortest path algorithm. Hence, the set of mechanisms described in this chapter are applicable to both IS-IS and OSPF. The generic terminology of LSA is used and refers to an LSA for OSPF and an LSP for IS-IS in the rest of this chapter. Distance vector versus link state protocols: For the reasons explained earlier and in particular the scalability and convergence time aspects, link state protocols have been widely deployed, particularly in large networks.36 There might be very few 36 It is worth mentioning that a mix of link state and distance protocol routing protocols might be seen in some particular network topologies. For example, consider a network made up of a backbone of core routers interconnected in a mesh topology with a set of remote or edge routers attached to some core routers via a single link. In such a case, running RIP between the remote routers and the core routers and OSPF or IS-IS between the core routers is an interesting option. Indeed, this reduces the size of the LSDB because just the core routers will be part of the link state domain, while providing a dynamic way of learning the IP prefixes reachable by means of the local routers without the need for fast convergence between the remote routers and the core routers because they do not have an alternate path anyway (those routers are attached to the core routers via a single link). Note that such a configuration is more commonly seen in large enterprise networks than in operators’ networks. Such a routing design also has
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 213
213
4.1 IP Routing Protocols
exceptions, but most if not all service providers and large enterprise networks run a link state routing protocol. For that reason, the rest of this chapter is entirely devoted to the recovery aspects of link state protocols.
4.1.4
IP Routing: A Global versus Local Restoration Mechanism? IP routing is fundamentally a restoration recovery mechanism (Figure 4.3). Indeed, as described in detail throughout this chapter, once the failure has been detected by the router directly attached to the failed network element (e.g., link or node), it propagates the fault indication signal (FIS) throughout the network. More precisely, the link state protocol propagates the network topology change, which can be interpreted as an FIS in the case of a link or a node failure. Then, every router that receives the notification of the network change recomputes ‘‘on the fly’’ its routing table, which is by definition a restoration process. Now, strictly speaking, IP routing cannot be classified as either a global or a local restoration recovery mechanism. Indeed, the point of rerouting of the traffic affected by the failure depends on the network topology and can either be the router directly attached to the failed link or node (in this case, the restoration is local) or several routers upstream to the failure. As shown in Figure 4.3, depending on the network topology and the failure location, the rerouting node is either immediately upstream to the failure or several nodes upstream. So the degree of meshing (also sometimes referred to as the
F
G
H
Rerouting location in a highly meshed (dense) network
A
B
C
D
E
F
G
H
B
C
D
Rerouting location in a sparsely meshed network
A
E
Figure 4.3 IP restoration. several drawbacks in terms of network management because it requires routing information redistribution between routing protocols that may be complex to configure and a source of configuration errors.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 214
214
CHAPTER 4
IP Routing
network density) determines how close the upstream rerouting node is likely to be from the failure. This has an obvious impact on the rerouting time, but increasing the degree of meshing is not always possible and has a cost. Some studies on several large IP networks show that on average the rerouting node is between three and six hops upstream to the failure location.
4.2 Analysis of the IP Routing Recovery Cycle The aim of this section is to give an overview of each phase that takes place during the recovery cycle (introduced in Chapter 1) in the context of IP when a network element failure occurs in the network. Then an example is provided that illustrates the various rerouting phases previously described. Sections 4.3 through 4.6 explore in detail each of those phases, but it is important to first have a good understanding of the IP routing dynamics (Figure 4.4).
4.2.1
Fault Detection and Characterization As with any other recovery mechanism, the first task to occur in the recovery cycle is the failure detection itself, which usually has a nonnegligible impact on the overall convergence time. Section 4.3 is devoted to this important aspect.
4.2.2
Hold-Off Timer Multilayer recovery mechanisms are studied in detail in Chapter 6, but in a nutshell, there are situations in which it is appropriate for the IP layer to wait for the expiration of some hold-off timer before triggering an action, once the failure has
Recovery Time Failure Fault Detected
Time
Traffic Recovery Time Recovery Operation Time Fault Notification Time Hold-Off Time Fault Detection Time
Figure 4.4 Recovery cycle.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 215
4.2 Analysis of the IP Routing Recovery Cycle
215
been detected. For instance, consider the case of an IP-over-Dense Wavelength Division Multiplexing (DWDM) network in which some recovery mechanisms are also available at the optical layer. In other words, the DWDM links are protected and rerouted by means of some recovery mechanism (protection or restoration), as seen in Chapter 3. Then, if the rerouting time for the optical layer is bound to some time X, the IP layer should wait for some time Y, where Y > X , before triggering any action to avoid some undesirable racing conditions. If after the expiration of the hold-off timer the link is still down, the optical recovery probably has not succeeded and the IP layer should trigger some recovery action. Note that the holdoff timer may be dynamically computed when dampening techniques are used, as discussed in Section 4.4.
4.2.3
Fault Notification Time When a link or a node fails, every node directly attached to the failed network element will detect the failure after some period. Thus, for instance, in the case of a link failure, the two nodes interconnected by the failed link will detect the link failure (note that they may not simultaneously detect the failure), whereas in the case of a node failure, all the neighbors of the failed node will detect the failure. Each node having detected the failure sends an FIS throughout the network. The FIS in an IP network is a new LSA and the action of sending LSA is called the flooding. In the rest of this chapter, we will refer to this as LSA flooding. A node that receives a new LSA (compared to the local copy of that LSA stored in its local LSDB) must validate the received LSA, store it in its LSDB, and flood it to each of its neighbors (except to the neighbor from which it received the new LSA). The LSA flooding is always reliable and both IS-IS and OSPF have a reliable flooding mechanism that relies on the retransmission of nonacknowledged LSAs. The LSA flooding mechanism is detailed in Section 4.5, but the aim of this paragraph is to introduce the general IP routing dynamics.
4.2.4
Computation of the Routing Table Once a router has received a new LSA reporting a network topology change, it must compute a new routing table.37 Strictly speaking, an LSA does not report topology change but the current topology state. In other words, a router receiving a new LSA must first compare it to the current version of the LSA stored in its LSDB to determine whether a topology change has occurred. Note that the routing computation can be delayed by some amount of time for various reasons, which are explored in Section 4.6. The routing table contains the shortest path from the computing node to each reachable IP prefix. More accurately, the routing table just contains, for each IP prefix, the next hop in the shortest path, the IP metric (indeed, storing the complete shortest path for each IP prefix is not needed and would 37
Note that the routing table is often called the Routing Information Base (RIB).
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 216
216
IP Routing
CHAPTER 4
unnecessary consume memory space), the outgoing interface, and some lower layer information (related to the layer 2 protocol in use for the respective outgoing interface). The routing computation process requires two operations: 1. The SPT computation, which is the topology tree representing the network (Figure 4.5) 2. The next-hop information for each IP prefix (next hop, metric, outgoing interface) The computed SPT by a node38 X is the tree whose root is X and that computes the shortest path from X to every other reachable node in the network. Consider the network depicted in Figure 4.5. The SPTs computed by the nodes A and G are depicted on the right side of the figure. Hence, for instance, the shortest path from A to D is A-E-F-D and the shortest path from G to F is G-A-E-F. Note that there are two equal cost paths from G to B: the paths G-H-B and G-A-B. This allows performing load balancing between G and B along those two paths. In the other diagrams, equal-cost paths are not always depicted in SPTs.
G
H
x
Link Metric
2
1
3
H
I
1
1
A
B 1
1
1
1
3 G
I
1
1
C
D
1
2
1
3
B 1
1
1
1
3
A
C
G
F
H
1
I
1
D
1
2
1
3
1 E
SPT Computed by A
1 E
3
F
A
B
1
C
D
1
1
SPT Computed by G
1 E
1
1
F
Figure 4.5 Shortest path tree (SPT) computation. 38
The terms router and node are used interchangeably in this chapter. When describing an algorithm, we use the generic term node because the algorithm generally applies to any kind of node, such as routers, optical switch, and SONET-SDH switch. On the other hand, when describing an action specific to IP like the LSA flooding, the term router is more appropriate. The two terms are equivalent in this chapter.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 217
4.2 Analysis of the IP Routing Recovery Cycle
217
A note on terminology: The term equal-cost multipath (ECMP) is usually used to describe the ability to have multiple paths having identical costs. The exact algorithm to compute the SPT is covered in Section 4.6. Once the SPT has been computed, the next operation consists in populating the routing information base (RIB). Each IP prefix announced by each node present in the SPT is added to the RIB along with its shortest path (next hop). In other words, the SPT provides the shortest path to any reachable node in the network, whereas for each reachable IP prefix the RIB contains the next-hop, metric, and outgoing interface. So, for instance, in the example above, if node F announces a network prefix 160.92.23.0 (mask 255.255.255.0), the corresponding IP prefix is added by G to the RIB with A as its next hop because the shortest path to reach F is G-A-E-F. Note that for a prefix 161.23.54.0 (mask 255.255.255.0) announced by B, for instance, two entries would be added to the RIB with the respective next-hop A and H and the packets to that destination address will be load balanced (see Section 4.8 for more details on load balancing). In fact, the RIB can be computed as the SPT is calculated. In other words, there is no need to have a two-step approach whereby the SPT would be entirely computed followed by the RIB computation.
4.2.5
An Example of IP Rerouting upon Link Failure In this section, we saw the different phases occurring during IP rerouting upon a network failure event. Let us now illustrate those different steps through an example. Again, we emphasize the lack of predictability of the event timing sequence in a distributed computing environment. Several timing sequences can occur and this depends on several factors like the network topology, the links characteristics (propagation delay, level of congestion), the router performance, and load. Figure 4.6 highlights a possible event timing sequence. Figures 4.6 and 4.7 illustrate the event sequence. Time T0: The link C-D fails. After some period, the router C detects the failure (note that router D also detects the link failure, but the assumption is made in this example that the router C first detects the link failure). That period of time depends on the layer 2 protocol, the IGP parameter setting, and the failure type. In high-speed backbone networks, it is quite common to interconnect routers with optical lambdas using SONET/SDH framing or native SONETSDH links in which the failure detection is on the order of tens of milliseconds. That event triggers the origination of a new LSA, which is flooded to every neighbor. As discussed in detail in Section 4.5, node C may decide to delay the LSA origination by some (dynamic) period, but consider the general IP routing dynamics, the details of each phase being discussed later. Once the LSA is originated and flooded throughout the network, each router triggers an SPF and computes a new RIB corresponding to the new network state. The dotted arrow indicates the RIB entry for the destination Z (e.g., the next hop
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 218
218
CHAPTER 4
IP Routing
G
H
I
x
Link Cost – When not specified, the link metric is 1
T0: The Link C-D Fails, C Converges.
2
3
T0
RIB Entry for Z
S 3
A
B
C
LSA/LSP Propagation
D
Z
E
F
G
H
I
G
H
2
3
I
2
3
S
S
3
A
B
D
C
3
A
B
C
D
Z
Z
T1
E
T2
E
F
F
Figure 4.6 An example of IP rerouting dynamics.
G
H
x
I
Link Cost RIB Entry for Z LSA/LSP Propagation
2
3
S
A
3
B
C
2
D Z G
H
I
T3
E
F 2
3
S
A
3
B
C
D Z
T4
E
Figure 4.7 An example of IP rerouting dynamics (continued).
F
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 219
4.2 Analysis of the IP Routing Recovery Cycle
219
computed by router C to reach subnetwork Z is now B). We say that router C has converged. As with LSA origination, the SPF and RIB computation may also be delayed by node C (this is discussed in detail in Section 4.6). T1: B now receives the new LSA and first determines whether the LSA is a new one. Because the LSA is new (it reflects a topology change with no link between the routers C and D), B floods it to each of its neighbors and recomputes its routing table. In this example, the new shortest path computed by router B to reach Z is now through node F. T2: H receives the LSA, floods it to I and G (its neighbors) and recomputes its RIB. The new shortest path to Z is via node I. T3: A receives the LSA, floods it to E and G and recomputes its RIB. The new shortest path to Z is now via node E. T4: Finally, G receives the LSA from A and H and recomputes its RIB. The shortest paths to Z are via nodes H and A (as before). Finally, E will also receive the LSA from A and F and will recompute its RIB with F as its next hop to reach Z (as before). There are several important notes to mention here: 1. The event timing sequence might have been different. 2. Not all the nodes are affected by the link failure. For instance, node G is not affected by the failure of link C-D (the computed SPT is unchanged). Some optimizations of the SPT computation detect such a condition and do not trigger a routing table computation when the failure does not affect their current SPT. This is the case of incremental SPF as described in Section 4.14. This is also the case for node E. So which nodes are affected by a network element failure? A node is affected by a link or node failure if the failed resource belongs to its SPT. Another way to determine the set of affected nodes is to compute a reverse SPT rooted on the node terminating the failed link (Figure 4.8). 3. Another very important aspect to notice is the location where the traffic can actually be rerouted upon a network element failure. In the example above, the first node capable of rerouting the traffic coming from node A to destination Z is router B because C does not have any alternate path to Z. This highlights an important property of IP rerouting: As already pointed out, the location of the rerouting node with respect to the failure definitely has an impact on the overall convergence. The closer the rerouting node is to the failure, the smaller the failure’s impact on the traffic is. So the traffic sent from S to Z, for instance, will be dropped upon the failure of link C-D until node B has converged. As previously mentioned, in existing service provider networks, the number of hops between the rerouting nodes and the failure varies between three and six on average.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 220
220
CHAPTER 4
IP Routing
G
H
I
2
3
x
Link Cost
S A
3
B
C
2
D Z
H E
F
A
B
C
D
Reverse Spanning Tree Rooted at D
Figure 4.8 Set of nodes affected by a failure.
4.3 Failure Profile and Fault Detection The objective of this section is to answer the following set of important questions:
. What are the different failure profiles seen by the IP/Multi-Protocol Label Switching (MPLS) layer? . What are the set of mechanisms for failure detection? . How can each failure be unambiguously identified and what are the requirements for failure characterization? . What are the implications on the traffic of the various types of failures? This section applies to both IP and MPLS (discussed in Chapter 5), so although this chapter is entirely dedicated to IP, the term IP/MPLS is often used in Section 4.3.
4.3.1
Failure Profiles Various profiles of failures can occur in an IP/MPLS network. The aim of this section is not to provide an exhaustive list of all the possible failure types, but it is worth listing the main categories of failures.
Link Failures Several types of failures result in IP/MPLS link failure, as follows:
. Fiber cut . Optical equipment failure
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 221
4.3 Failure Profile and Fault Detection
221
. SONET/SDH equipment failure . Router interface failure (the port on a router line card or the line card itself fails) Although these failures are not identical, they all result in a loss of connectivity between two routers and, therefore, can be considered a link failure from an IP/ MPLS layer perspective.
Node Failure There are multiple possible causes of node failures whose nature has very different implications on traffic forwarding, as follows: 1. Router power supply outage: Some routers have a backup power supply that is automatically activated in case of failure of the primary power supply. Sometimes, a set of power supplies share the load with the capability to absorb the extra load in case one or more power supplies fail. Hence, in this case, a power supply failure does not have an impact on the router (a Simple Network Management Protocol [SNMP] trap is usually sent to a management agent so the power supply replacement can be performed). On the other hand, when a router does not have an embedded power supply redundancy, a power supply failure results in the complete cessation of operation of the router. 2. Facility power supply failure: The router does not get power supply when a facility power supply failure occurs in the building, for instance. 3. Route processor failure: In this instance, there are two families of router architectures that must be considered, as follows: . Centralized architectures: The route processor (RP) is responsible for the control plane tasks (routing table computation, signaling, and management) and is involved in traffic forwarding. . Distributed architectures: The RP is responsible only for the control plane tasks. Packets transit through the router via line cards (having usually their own processor and a set of specialized processors) without involvement of the RP. As described later in this section, the impact of an RP failure significantly differs in both cases. 4. Software failure: Impact on some specific features or software crash of the router operating system (OS) because of a software bug. 5. Planned node failure: The phrase planned node failure may sound quite surprising because a failure is usually inherently unpredictable. During the life of a network, routers (and any other active equipment) requires hardware and/or software upgrades for various reasons, as follows: . New interfaces must be added to increase the router connectivity. . New interfaces types are required to support higher speed rates.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 222
222
CHAPTER 4
IP Routing
. New functionalities are required to support new services and/or to optimize the network. Depending on the platform, some of these operations (hardware or software upgrades) require stopping the router operation and can be considered a node failure, with the particular property of being predictable, which obviously helps in reducing/eliminating their impact on traffic forwarding. Note that router upgrades occur relatively frequently in large networks and the requirement for very high network availability requires having mechanisms to minimize their impact.
4.3.2
Failure Detection In the previous section, we saw different failure profiles that can occur in an IP/ MPLS network. Now we turn to the set of mechanisms that can be used to detect those failures and their respective performance and scalability. There are two families of failure detection mechanisms, as follows:
. Lower layers failure notification . Hello-based mechanisms
Lower Layers Failure Notification The role of lower layers (layer 1 and 2) to detect and notify a link failure is essential and largely varies from one layer to another. For instance, the optical and SONET/ SDH layers provide very fast link failure notifications (on the order of tens of milliseconds, usually less than 10 ms). By contrast, if two routers are connected via a Frame Relay Permanent Virtual Circuit (PVC), then if a failure occurs in the Frame Relay network, the routers will have to wait a significant amount of time (usually several seconds) to be notified of the failure. Generally, the failure notification is obtained by means of protocols such as Local Management Interface (LMI) that allow a router to get PVC status from the Frame Relay switch. Note that some versions of LMI (e.g., T1.617 Annex D) provide asynchronous mechanisms so the Frame Relay switch can spontaneously notify a PVC status change after a network failure, provided the link connecting the router to the Frame Relay switch is not the cause of the failure. Similarly, if two routers are interconnected via ATM Switched Virtual Circuit (SVC), the failure notification is performed by the ATM PNNI routing and signaling protocol and is usually of the order of a few hundreds of milliseconds up to several seconds depending on the PNNI parameters tuning. An extreme case is when routers are interconnected via a layer 2 local area network switch. For instance, say three routers are connected to a Gigabit Ethernet switch and are IP/MPLS neighbors (a routing adjacency is active between each pair of routers). The failure of the link between a router and the switch (e.g., a port of a switch or a router interface connecting the router to the switch) will not be detected by the other two routers at the layer 2 level. Some mechanisms for fast failure detection in such an environment have been proposed. In a nutshell, the
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 223
4.3 Failure Profile and Fault Detection
223
Multiaccess Reachability Protocol (MARP) allows a router to be notified of the local failure between an IP neighbor and the switch. Suppose three routers R1, R2, and R3 are connected to a layer 2 switch S. If R1 establishes a router adjacency with R2 and R3, then using MARP, R1 will inform S of its interest to be explicitly notified of a failure of both the connection R2-S and the connection R3-S. When link R2-S fails, as soon as the failure is detected by switch S, S immediately sends a failure notification to R1. Such a protocol greatly improves the failure detection time in this type of environment.
Mechanism Based on Hello Protocols Failure detection by means of hello-based protocols have been used for almost several decades and relies on the principle of sending a periodic hello message between two neighbors. When one of the routers stops receiving hello messages for a configurable period, it concludes that a failure of the link between them or the neighbor itself has failed.
IGP Hellos Thanks to more powerful CPUs, some router architectures allow sending IGP hellos at a relatively high frequency (on the order of hundreds of milliseconds). Scalability impact: Running IGP hellos is not a cost-free operation for the routing task because multiple checks are performed when an IGP hello packet is received. Thus, the frequency cannot be increased without a nonnegligible impact on the router CPU.
Bidirectional Forwarding Detection Other hello protocols can also be used. Bidirectional forwarding detection (BFD) (see [BFD]) is a low-overhead hello mechanism that is independent of any routing protocol whose benefit is to be light and fast. BFD is undoubtedly an interesting alternative to the option of tuning IGP hello to higher frequency. Note also that because BFD is not tied to a routing protocol, it can be used in some network areas where no routing protocol is in use (e.g., between two autonomous system border routers [ASBRs]). Scalability impact: As opposed to IGP hello, BFD has been designed to require a very limited processing overhead with the objective to quickly detect a forwarding plane failure. Moreover, distributed implementation drastically improves the overall scalability in environments in which the number of neighbors is quite large.
Layer 2 versus Hello-Based Failure Detection Mechanisms As already pointed out, routing protocol fast hello mechanisms must be used with care to avoid scalability impact. Moreover, the failure detection time is likely to be significantly higher than with layer 2 link failure notification, although it might be acceptable in some cases. That being said, there are also some situations in which a combination of both failure detection mechanisms may be required; for instance, some failures like
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 224
224
CHAPTER 4
IP Routing
a forwarding plane failure (e.g., a line-card processor failure) may require some hello-based mechanisms because the layer 2 will not fail in this case, so even if a layer 2 detection failure mechanism is available, it might not be sufficient and should be complemented by hello-based mechanisms. Which hello mechanism should be selected is usually a quite difficult question. As mentioned earlier, there are multiple fast hello mechanisms: IGP, BFD, and RSVP.39 Clearly, IGP hellos (although implementations have been considerably optimized over the last couple of years) have not been designed to run at very high frequency because the processing of IGP hello is not a cost-free operation. Hence, there are no general rules that can be derived. Each method has some pros and cons and should be selected based on the set of objectives. For instance, if the convergence time objective has an order of magnitude of several seconds, then being able to detect the failure in a few tens of milliseconds is not a requirement and tuning the IGP hello is an appealing approach that does not require deploying additional failure detection mechanisms. If more stringent failure detection times are required and the layer 2 does not provide any fast failure detection mechanism, then BFD is certainly an interesting candidate (the choice will depend on the implementation of the protocol in the router, the network design, and the required failure detection times).
4.3.3
Failure Characterization This section deals with the failure characterization and why distinguishing a link from a node failure is not always straightforward. Consider three possible failure scenarios: 1. A node and its attached interfaces fail: Example: a power supply failure. In this case, all its neighbors will detect the failure because the set of attached links will also fail. 2. A node fails but its attached interfaces do not: Example: the node control plane fails (RP) but the platform is distributed and line cards keep forwarding the traffic. 3. A link between two nodes fails: Example: The line-card interface of one of two interconnected routers fails, or the link fails because of a fiber cut, for instance. We saw in the previous section that the link failure detection mechanism can either be the underlying layer 1 or layer 2 failure detection (i.e., SONET-SDH, optical layer) or some hello-based protocol (such as BFD or IGP hellos). Clearly a neighboring router cannot make any distinction between the failure scenarios 1 and 3 listed above because in both cases the failure detection is a link failure that makes current existing mechanisms incapable of differentiating a link from a node failure in some circumstances. 39 RSVP hello is a Multi-Protocol Layer Switching–specific fast hello mechanism, which is covered in Chapter 5.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 225
4.3 Failure Profile and Fault Detection
4.3.4
225
Analysis of the Various Failure Types and Their Impact on Traffic Forwarding As previously mentioned, there is a large set of possible failures that can occur in a network. This section proposes to analyze the impact on traffic forwarding for each failure profile listed earlier in this section and the set of failure detection mechanisms that can be used to detect those failures.
Link Failure Link failures always affect the data traffic until an alternate path is found and data traffic is rerouted over some alternate paths.
Node Failure As previously mentioned, there are multiple possible causes of node failures, and their nature has a different impact on the forwarded traffic. 1. Power supply outage: A router power supply outage (in the absence of power supply redundancy mechanisms) provokes both a control and a forwarding plane failure, so the traffic is black-holed until it is rerouted over an alternate path. 2. RP failure: The impact on traffic forwarding depends on the router architecture. On some centralized platform architectures, an RP failure usually implies a failure of both the control plane (routing, signaling, and management) and the data plane (packets sent to the failed router are simply dropped). That being said, there are also some centralized platform architectures in which the control plane functions are separated from the data plane. By contrast, on some distributed platform architectures, the RP is responsible only for the control plane; packets transit through the router via line cards without being in the path of the RP. Therefore, an RP failure does not affect the data plane, and packets are still forwarded by the router; this is just the control plane that fails. The expected behavior in this case is the following: After some period,40 the IGP adjacency will go down, the IGP neighbors of the failing routers will flood an updated LSA (router rink LSA for OSPF) or LSP (for IS-IS) and the normal IGP rerouting operations will occur. 3. Software failure: The impact of a software failure on forwarded traffic is highly coupled to the nature of the software failure and the system architecture, which can vary from the simple generation of a warning message followed by an automatic recovery (via restorable module) handled by the OS to a situation in which the router is completely affected and can no longer recover from the failure, which might require a complete reinitialization. In the latter case, the traffic is black-holed until the control plane of the 40
This period depends on the IGP timer’s configuration.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 226
226
CHAPTER 4
IP Routing
router’s neighbors detect the node failure. Note that OSs with a modular architecture usually allow limiting the failure to the software component that actually failed; consequently, the failure has a limited scoped and the failed software component can be restarted independently of the other modules. 4. Planned node failure: Because by definition the failure is ‘‘planned,’’ various actions can be taken before performing the upgrade so that the traffic may be gracefully rerouted around the node. Various methods can be used to meet that goal: For instance, the link costs to every adjacent node can be manually increased so the router will be smoothly excluded from the shortest path of other routers in the network. Another method consists of the node to be upgraded in originating a new IGP OSPF LSA or IS-IS LSP explicitly indicating that the node should be avoided in the SPT computation.41 In both cases, the consequences are that the other routers in the network will smoothly reroute the traffic around the node in question. The node to be upgraded will no longer carry any transiting traffic and could be safely upgraded without risking traffic disruption. It is worth mentioning that some software and hardware architectures support ‘‘hitless’’ software and hardware upgrades without requiring any of the actions mentioned earlier.
4.4 Dampening Algorithms This section covers the important notion of dampening. Stability is an important property of recovery mechanisms and should always be carefully considered when trying to achieve fast convergence. Fast converge implies to react quickly to network changes. In the case of network instabilities (e.g., because a network element experiences state changes at a high frequency), the overall recovery mechanism should not overreact. This could have some very undesirable side effects in the network instigating other failures that could themselves potentially create a snowball effect. Moreover, if a resource ‘‘flaps’’ (i.e., a link or node constantly goes up and down) and the traffic is systematically rerouted through the flapping resource (once restored), the traffic will experience multiple consecutive failures. This can be highly undesirable because multiple failures can be even worse than fewer longer failures. One solution is to implement some mechanisms to dampen the revertive process. One of the virtues of dampening mechanisms is to preserve network stability under unstable network conditions. The basic principle of dampening is to slow down the effect of network instability. This can be achieved by means of various algorithms at different layers and can be deployed at various locations of the network. 41
For example, with IS-IS a specific bit called the overload bit is set in the LSP.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 227
227
4.4 Dampening Algorithms
Three dampening algorithms are described below: 1. Up-state timer: When an interface (or more generally a network resource) fails, it is considered down immediately (unless a ‘‘hold-off’’ time is implemented). Then as the interface flaps, it is no longer considered operational (in an ‘‘up’’ state) until the interface is operational for a fixed period (value of the ‘‘up’’ timer) (Figure 4.9). This kind of algorithm has frequently been used in SONET/SDH networks. 2. Interface dampening using an exponential decay algorithm: With the exponential decay algorithm, when a link goes down the interface change is immediately reflected to the routing protocol that triggers the appropriate action. If the interface starts flapping (goes ‘‘up’’ and ‘‘down’’), the interface starts to accumulate penalties until a threshold is reached. Then the router starts considering the interface down even if the interface recovers (goes in ‘‘up’’ state). To be considered ‘‘up’’ again the amount of accumulated penalties needs to decrease according to an exponential curve until a second threshold is reached. This guarantees that an unstable interface is no longer advertised as ‘‘up’’ under unstable conditions (Figure 4.10). Let us now define more precisely the set of parameters used by such a dampening algorithm: Suppress and reuse thresholds: When the number of accumulated penalties exceeds the ‘‘suppress’’ threshold, the router dampens the Facility (e.g., an interface) (starts considering it ‘‘down’’) until the number of penalties decreases and crosses the ‘‘reuse’’ threshold (see Figure 4.10 for an illustration). Half-time period: This parameter controls the speed at which the number of penalties decays exponentially. When an interface, for instance, is put in ‘‘dampened’’ mode because the number of accumulated penalties has crossed the suppress threshold, if it stops flapping, the number of penalties Actual Interface State
T1
Up State Timer
Up
Down
Time
Advertised State of the Interface
Up
Down
Figure 4.9 Illustration of up-state timer dampening algorithm.
Time
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 228
228
CHAPTER 4
IP Routing
Actual Interface State Exponential Decay Up
Down
Time
Penalties Max Suppress Threshold
Reuse Threshold
Time Advertised Interface State Up
Down
Time
Figure 4.10 Illustration of interface dampening.
is reduced by half after each half-time has elapsed. Of course, if the interface continues to flap, the penalties get increased. Maximum suppress time: This represents the maximum amount of time an interface can stay in dampened mode. The value of this parameter, the reuse threshold, and the half-time period gives the maximum number of penalties that an interface can accumulate. 3. Exponential back-off algorithm: The exponential back-off algorithm has been implemented by some router vendors for both the LSA/LSP propagation and the SPF computation trigger (see Sections 4.5 and 4.6). Let us consider the example of LSA origination. Without any particular measure, each time a router connectivity state changes, the router generates a new LSA. This not only generates flooding in the network but also triggers an SPF and routing table computation on each node of the area. Moreover, this can potentially generate extra work on the ABR for OSPF (generating some new summary routes, ASBR summary) and L1L2 routers for IS-IS. To achieve fast convergence, the first time a link fails, it is highly desirable to trigger an LSA origination so that every router in the network can quickly recompute a new routing table. On the other hand, when the link flaps, a desirable behavior would be to slow down the generation of new LSAs, which is achieved using the algorithm detailed below. Later in this chapter we will explain that it may also de desirable to dampen the SPF execution on the routers upon receiving successive LSAs reporting network changes.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 229
229
4.5 FIS Propagation (LSA Origination and Flooding) Actual Interface State Exponential Back-Off Timer Up
Time
Down
X
Y
2*Y
2*Z
LSA Generation
Time
Figure 4.11 Illustration of exponential back-off timer algorithm.
Description of the exponential back-off algorithm: The following parameters are defined: . X: initial time before declaring the link down after the first failure has occurred . Y: time before declaring the link down after the second failure . Z: max time When the link goes down the first time, the router waits X milliseconds before advertising a new LSA. Then, if a second state change occurs, the router now waits Y milliseconds. If the link keeps on flapping, 2 * Y has to elapse before a new LSA is originated, then 4 * Y until a maximum of Z. The LSA origination timer is reset to the original value if no state change occurs during a period of time equal to 2 * Z (Figure 4.11). Note that there are various locations where the dampening process can be implemented on a network. Indeed, dampening can be deployed at the interface level: As the link starts flapping, the state of the interface is entirely controlled by some process in charge of managing the interface state, underneath the IGP layer, which uses dampening algorithm. Another approach is for the process in charge of managing the interface state to transparently reflect the interface state to the IGP and let the IGP take care of the dampening algorithm. Those two approaches are not exclusive from each other.
4.5 FIS Propagation (LSA Origination and Flooding) As briefly described in Section 4.2, every router detecting a topology state change (e.g., a network element failure) will trigger the sending of a new LSA, a process called the LSA origination. Then upon receiving a new LSA, a router will flood it to each of its neighbors, known as the flooding procedure. Before detailing the
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 230
230
CHAPTER 4
IP Routing
origination and flooding mechanism, it is worth mentioning several interesting aspects of LSA flooding: LSA flooding is reliable: Every LSA sent to a neighbor must be acknowledged. If not acknowledged after some period, the LSA is retransmitted, making the process of LSA flooding is reliable. Two-way connectivity check42: When a link fails, the two routers interconnected to the failed link will originate a new LSA to report the loss of adjacency over that link. So two new LSAs will be flooded throughout the network. Of course, the two LSAs will not be simultaneously received by the other routers in the network and the timing sequence will vary based on the LSA originating routers, parameters setting, and the network topology, to mention a few parameters. So a two-way connectivity check procedure is performed during the SPF operation (described hereafter) so a link is considered operational if the link is reported in the two LSAs of the two adjacent routers. In other words, if a link L between routers X and Y is just reported in the LSA originated by X but not by Y, the link is not taken into account in the path computation. The receipt of just one LSA is sufficient for the link to be considered ‘‘down’’ in case of link failure. LSA origination triggers and frequency: A new LSA is always originated when one of the following events occurs: refresh, local IP prefix change, local connectivity change. Number of flooded LSAs: When a link fails, in a full mesh network topology, the number of flooded LSA is O(n2 ); indeed, when a link fails, the LSA will be sent to n neighbors that will themselves reflood the LSA to n neighbors. In the case of a node failure in a fully meshed network, the number of flooded LSA is O(n3 ) because a node failure corresponds to the failure of n links and for a link failure O(n2 ) LSAs are flooded. Note that this really corresponds to a worst case and the number of flooded LSAs does not cause any problem in practice; the number of flooded LSAs received by any node will be much smaller. That being said, the LSA flooding relies on the flooding of a new received LSA to all the interfaces (except the interface the new LSA has been received from43), which can be significant if the degree of connectivity (number of neighbors) is large. An obvious case where such a flooding procedure may be quite inefficient is when two routers X and Y are interconnected via multiple links L1, L2, . . . , Ln. If X receives a new LSA from another neighbor Z, after having stored a copy of that LSA in its local LSDB, it will flood n copies of the same LSA to its neighbor Y (one per link L1, . . . , Ln), which has the following consequences: Although Y will just install one copy of that new LSA in its LSDB, some bandwidth will be unnecessarily 42
Note that the two-way connectivity check may be disabled in some very specific cases. There is just one exception to that rule in the case of OSPF for a designated router on a local area network that does reflood a new LSA received by a neighbor to every other neighbor using the same interface (the local area network interface). 43
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 231
4.5 FIS Propagation (LSA Origination and Flooding)
231
consumed. Moreover, this will also consume CPU on both X and Y. Finally, upon receiving the first copy of the new LSA on a link Li, Y will acknowledge it and will retransmit the LSA over all the n-1 other links back to X. Some ideas have been proposed in [LSA-FLOOD2] to modify the flooding procedure from a per-link to a per-neighbor basis, while still preserving the compatibility with existing procedures. Because every router relies on the receipt of a new LSA to perform traffic recovery, it is of the utmost of importance to prioritize the LSA origination and flooding mechanism to reduce the total rerouting time.
4.5.1
LSA Origination Process In this section, we will see the various events that trigger the origination of a new LSA. 1. Link failure: When a link L between two routers X and Y fails, both X and Y will detect the failure after the link failure detection time. This triggers the origination of new LSAs. 2. Node failure: When a router detects the failure of one of its neighbors, this also triggers the origination of a new LSA. Note that with a link failure, a router does not have the ability to determine whether the failure is a link or a router failure. Multiple mechanisms44 can be used to tell a link from a node failure, but in general link state protocols just report a loss of adjacency with the neighbor, which can be provoked by the detection of layer 2 link failure or an IGP hello timeout. 3. IP prefix reachability change: For instance, let’s suppose that an IP address is added (or deleted) on a router. This also triggers a new LSA flooding. Another typical example is the case of ABR routers (routers that participate to more than one area). Those routers are called ABR routers in OSPF (they interconnect multiple OSPF areas) and L1L245 routers in IS-IS (they interconnect multiple IS-IS levels). A topology change in one area/level can result in a reachability change. Indeed, some prefixes might be no longer reachable, new prefixes might now be reachable, or the metrics might have changed. As already mentioned, IS-IS and OSPF are slightly different in terms of LSA format. In the case of an ABR, for instance, OSPF will flood one LSA (type 3 for interarea) for each interarea route change while IS-IS will flood an entire new LSP, which contains all the data related to the L1L2 router. 4. LSA refresh: Link protocols periodically refresh LSA. Each router maintains a timer that triggers the refresh of its locally originated LSA upon a timer expiration. The LSA sequence number is incremented and
44 Some heuristics can be used in some particular network scenarios. For instance, let’s suppose that two routers are interconnected via multiple links. If just one link fails, this is probably due to a simple link failure and not a node failure. Other mechanisms could be used (e.g., sending a keep-alive message to the node via an alternate diverse path via some tunneling mechanism). 45 L1L2 stands for level 1–level 2.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 232
232
CHAPTER 4
IP Routing
the LSA is originated with the new sequence number. Note that this operation is performed regardless of whether the LSA content has changed and guarantees that any router that could have experienced a locally corrupted LSDB will receive a new copy of that LSA on a regular basis. In OSPF, this timer is called the LSRefreshTime and is an architectural constant, which is set to 30 mn and cannot be changed. Some proposals have been made to overcome this limitation (see [LSA-FLOOD1]). By contrast, the corresponding IS-IS timer (called refresh time) is configurable and can have a maximum value of more than 18 hours (large IS-IS networks usually set the refresh timer to this maximum value to reduce unnecessary flooding). 5. Configuration changes: For example, a link metric change. Hence, once an LSA origination event occurs, the router builds the new LSA and floods it to each of its neighbors. Bear in mind that each LSA is flooded throughout the network area (or the whole network for some types of LSA with OSPF) and possibly triggers routing table computation on every router in the network (if the LSA is a truly ‘‘new’’ LSA; i.e., if its content has changed and this LSA is not just a refresh). One can easily realize that LSA origination can have a significant impact on the network if the frequency is too high, which could occur in the case of network instabilities if no prevention mechanisms were taken. Let’s consider the case of a link whose state keeps changing because of some failed component (usually referred to as a link flap). If at each link state change a new LSA is originated, the number of flooded LSAs and SPF computations triggered on each node could provoke a network collapse or could just needlessly hog the routers’ CPU and consume network bandwidth. To prevent such an undesirable behavior, dampening mechanisms can be used by link state protocol to control the LSA origination process (and the SPF triggering process as mentioned in Section 4.6). Various existing implementations use an exponential back-off algorithm as described in Section 4.3. Example with IS-IS on a Cisco router: Router isis ... lsp-gen-interval A B C /* Line of command referring to the LSP propagation tuning */ ... The algorithm used strictly corresponds to the exponential back-off algorithm described in Section 4.4. Note that on a Cisco router, the variables A, B, and C correspond to the variables Z, X, and Y, respectively, described in the exponential back-off algorithm in Section 4.4. As already pointed out, such link state LSA dampening algorithms can be used in conjunction with interface dampening that could itself use other dampening algorithms (e.g., the exponential decay algorithm described in Section 4.3). So if the link flaps, when the state of the link first changes, the router waits for B milliseconds. When the link goes up again or another attached link fails,
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 233
4.5 FIS Propagation (LSA Origination and Flooding)
233
the router then waits for C milliseconds. Then, the time between every LSA origination exponentially increases up to the maximum value of A seconds. The router returns to the previous behavior if no LSA origination triggers occur during 2 * A seconds. This efficient algorithm allows for quick reaction to failures while protecting the network from an LSA flooding storm in the case of instability.
Parameters Tuning How those parameters should be set highly depends on the network characteristics and the convergence objectives. A case study is proposed in Section 4.6.
4.5.2
LSA Flooding Process When an LSA is flooded, at each hop, multiple operations must be performed, which all participate in the overall flooding delay: 1. The LSA processing: The receiving node performs a set of operations to decide whether the LSA must be flooded. 2. The LSA queuing: If the LSA must be flooded, the router must transmit the LSA on each appropriate interface. 3. Propagation delay: Time for the LSA to travel from node to node.
LSA Processing Once a new LSA is originated, it is flooded to every neighbor. When a router receives an LSA, it first checks whether the LSA is present in its LSDB. If the LSA is present, the router verifies whether the LSA is an older or a newer version (this is done by checking the sequence number; if the sequence number is identical, the checksum field is checked to determine whether the LSA is more recent). Multiple scenarios can occur:
. If the received LSA is older than the local copy, the LSA is acknowledged, the local (newer) copy is sent back to the sending node, and the flooding procedure stops. . If the LSA is neither newer nor older (same sequence number), the router just acknowledges the LSA receipt to its neighbor (because flooding is a reliable mechanism, LSA receipts are always acknowledged) and does not flood the LSA to any other neighbor. . If the LSA is newer, an acknowledgment is sent to the sending neighbor, the LSA is stored in the local LSDB and a copy of the LSA is flooded to every neighbor. Then a new route computation is triggered (see Section 4.6 for more details). It is worth noting that a router does not flood an LSA that is not newer than the local copy of that LSA in its LSDB; this prevents it from creating infinite forwarding loops of LSAs!
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 234
234
CHAPTER 4
IP Routing
Another important fact to underscore is related to the prioritization of the flooding operation over the SPF computation. Indeed, upon receiving a newer LSA, the router learns that it must perform a new route computation. So an implementation could decide to first update its routing table and then flood the LSA to other neighbors. This would undoubtedly have a negative impact on the overall network convergence so an efficient implementation should always flood the new LSA before triggering a routing computation to make sure that the LSA (and so the fault indication) is propagated as quickly as possible to every other node in the network. Note there are a few specific cases where triggering SPF before flooding might be more optimal, but in general it is preferable to first flood the new LSA and then trigger an SPF.
Queuing Delays When an LSA is transmitted over an interface, it potentially competes with other data packets. Hence, depending on the interface congestion level, the LSA packet may experience some nonnegligible delays if no particular measures are taken to provide the required quality of service (QoS). This is especially important during network failure where the level of congestion is likely to increase because of some rerouted flows. The requirement for QoS mechanisms is also dependent of the network design. For instance, if the network is highly overprovisioned (the links are not congested even during network failures), then the queuing delays are probably negligible even without any particular QoS mechanism (in this case, every packet is indifferently queued in a first-in first-out (FIFO)46 queue without any particular discrimination). On the other hand, if a link might be congested (even temporarily), appropriate QoS mechanisms should be provided to appropriately handle IGP control plane packets in particular the LSA packets.47 More precisely, this usually implies to put in place the following components: 1. Packet marking: each differentiated services (DS) header48 field of an IP packet is marked with a particular Diffserv code point (DSCP) value (see [RFC2474]). In a nutshell, this consists of ‘‘coloring’’ packets so that every router processing the packet can recognize the class of traffic the packet belongs to in order to provide the appropriate treatment in terms of QoS. Usually, the task of packet marking is performed at the edge of the network for the user traffic. In the case of an IGP control plane packet, each router originating an IGP packet is responsible for marking the IGP packets with the appropriate DSCP. Note that this only applies to OSPF because ISIS uses connectionless network service (CLNS). That being said, an IS-IS 46
A FIFO queue is such that packets are serviced in the order of their arrival. Note that generally speaking the control plane packet should receive the appropriate QoS. An obvious example is IGP hello packets. If a link congestion is so high that a hello packet is unacceptably delayed, this can lead to undesirable loss of adjacencies and network instability. 48 In IP version 4, the DS field is the type-of-service field (ToS). 47
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 235
4.5 FIS Propagation (LSA Origination and Flooding)
235
implementation should appropriately treat the IS-IS packets using internal prioritization mechanisms. 2. Packet scheduling: When a router has to send packets out of an interface, it can use the DSCP (‘‘color’’) to provide the appropriate QoS to the packet, which in practice, means two things: . Queue the packet based on its DSCP. . Potentially use congestion avoidance mechanisms like random early detection (RED).
Queue the Packet Based on Its DSCP Critical packets will be queued in a high priority queue, which will guarantee that those packets receive enough bandwidth and experience minimal jitter and loss. Multiple queuing systems are available in modern routers. For instance, the queuing system can be made of N queues where each queue Ni gets some fraction of the link speed bandwidth (also usually referred to as a weight). Optionally a maximum queue length can also be specified for each queue. Then a local policy is defined that specifies the set of DSCP (‘‘colors’’) that match the queue. Sophisticated queuing systems also provide the ability to provision preemptive queue(s) that are always served before any other queue in the system; the queuing scheduler serves each queue proportionally to its weight, but each time a packet is dequeued from a queue Ni, it examines whether there are packets waiting in the priority queue. If so, the packets in the priority queue are immediately served. Consequently, such a priority (also called preemptive) mechanism allows offering minimal queuing delays to the packets queued in the priority queue. Two interesting comments can be made at this point:
. To guarantee minimal queuing delay and jitter, the proportion of highpriority traffic should be kept below some threshold with respect to the total amount of traffic. To illustrate this aspect, the aim of queuing is about precedence so if all the packets have a high priority, then the notion of priority no longer makes sense. Various papers have been written on this topic in order to determine what this threshold value should be to bound both the queuing delay and the jitter, but the numbers vary based on the traffic profile and the level of conservatism. Usually, values vary between 20% and 30%. Note that such values should take into account the situation of failures provided that the objective is to also guarantee the QoS in the case of a network element failure. The type of queuing system also plays an important role here. The weights assigned to the delay-sensitive traffic queue should be determined based on the QoS objectives at steady state and during failure. On the other hand, if the sensitive traffic is served by a preemptive queue, it will always receive the best treatment in steady state and during failure. . One possible side effect of preemptive systems is the well-known phenomena of famine. Indeed, a preemptive queuing system may lead to the undesirable effect for the low-priority queues to receive a very poor amount
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 236
236
CHAPTER 4
IP Routing
of bandwidth, if any. This is particularly true with hierarchical queuing systems, in which multiple queues are served in a hierarchical preemptive mode of operation. Note that the proportion of high-priority traffic is usually kept under some threshold for the reasons mentioned earlier so the risk of famine is extremely limited. Furthermore, some queuing systems provide additional mechanisms (sometimes called rate limiters) that allow limiting the amount of bandwidth allocated to the priority queue. What if a queue Ni does not use its bandwidth allocation? Suppose that a queue Ni has been configured to get a percentage of the link capacity and does not use it just because the traffic that matches that queue is not sufficient to use the configured percentage of the link speed. By contrast with time division multiplexing (TDM) systems, the bandwidth is not wasted and is usually redistributed to other queues proportionally to their respective weight.
Congestion Avoidance Mechanisms Congestion avoidance mechanisms like RED can be used to perform selective packet discard upon queue congestion. The idea of congestion avoidance algorithms is the following: transport protocol algorithms like Transport Control Protocol (TCP) react to packet loss by reducing the rate at which they send packets. As a queue grows, packets experience higher delays, which are taken into account by TCP to dynamically increase their timer that determine the amount of time the sender TCP client waits before retransmitting a nonacknowledged packet. When the queue reaches its maximum, all the packets are dropped, which results in generating an ‘‘overreaction’’ of each TCP sender sending traffic over the congested link that suddenly drastically reduced their packet sending rate. This has the undesirable effect of provoking some traffic oscillation, in which a link gets congested, then utilization drops (because all the TCP senders drastically reduce their packet sending rate), then the link utilization increases again, and so forth. So the idea of congestion avoidance algorithms like RED is to use a probabilistic dropping mechanism that is a function of the average queue size (that uses a low-pass filter) and starts dropping packets when the average queue size starts to cross some threshold. When the average queue size exceeds some maximum value, all the packets are dropped. RED allows having a progressive effect: The number of TCP senders affected gets smoothly increased as the average queue size grows. A variant of RED (called weighted RED [WRED]) has been proposed to provide higher granularity per traffic profile (this allows to have different thresholds and discard rates for different DSCPs in a queue). Note that other variants of RED have also been elaborated like flow RED (FRED), but RED and WRED are the most commonly used congestion avoidance mechanisms. For more details on the subject, see [COMP-NETWORKS], [DIFFSERV-DEPLOY], [RED], [WRED] and [FRED]. Typically, an IP network must always be designed so that IGP control packets do not suffer from network congestion (i.e., experience low delay and no drop). Note that providing an appropriate QoS to IGP packets not only implies the handling of the packets to be processed at the output interface level but also
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 237
4.6 Route Computation
237
internally to the router such that the internal processing of packets between interfaces is tuned to achieve the required level of QoS.
Propagation Delay Once the IGP packet is serviced by the router queuing system, the LSA experiences some incompressible propagation delay before reaching the next-hop router. At first sight, the propagation delay might appear negligible, but this might not be the case. A rule of thumb is to consider that the propagation delay is 5 ms per 1000 kilometers (km) of fiber. Because the IP and optical layers are generally not congruent, it is not rare to see propagation delays that are significantly higher than the expected distance between two IP routers. To illustrate this aspect, consider a sparse optical network providing lambdas to an IP network. Although the geographical distance between a router in New York and a router in Los Angeles could be estimated to 5000 km (hence, a propagation delay of 25 ms), the path followed by the lambda interconnecting those two routers may be significantly higher. Note that the usual one-way propagation delay between two routers in the United States, for instance, rarely exceeds 30 ms. The propagation delay can of course be much smaller for a domestic network in small countries/states, for instance, and sometimes higher for an international network in which the optical paths can be quite long.
4.5.3
Time Estimate for the LSA Origination and Flooding Process To originate a new LSA, the IGP process must first obtain CPU resources, then run for some period to build the new LSA and then send it to every neighbor. The first component is directly dependent of the router OS. The router OS must perform various tasks and has a scheduler that is responsible for the CPU time sharing among the various tasks/processes that require the CPU. This is similar to any OS run on computer systems. In the case of a router, the OS must ensure that the IGP process can quickly get access to the CPU to process the LSA origination, once required. Generally, on modern router architectures, the LSA origination time rarely exceeds a few tens of milliseconds and can even be less than 10 ms; this includes both the waiting time to get the CPU and the LSA origination process run time. The same reasoning applies to LSA flooding. The router OS must also ensure that the process of LSA flooding will get an appropriate treatment to make sure that LSA flooding will not be delayed. Again, on modern router architectures, the LSA flooding time rarely exceeds a few tens of milliseconds.
4.6 Route Computation This section deals with all the aspects related to the computation of a shortest path between a router and every reachable IP prefixes. Route computation in an IP
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 238
238
CHAPTER 4
IP Routing
network is a key component of the overall network recovery time as with any recovery mechanism.
4.6.1
Shortest Path Computation Notion of Shortest Path As previously mentioned, a link state protocol allows each node in the network to dynamically learn the network topology (stored in an LSDB), which is used to compute the shortest path(s) to each reachable node in the network and therefore to each IP prefix that is advertised. Before describing how shortest path can be computed, the notion of shortest path must first be explored. A path cost is defined as the sum of the link metrics traversed along the path where a link metric is an integer defined by the network administrator that can reflect different link attributes depending on the overall objective. For example, a common practice is for the link metric to reflect the link speed (link bandwidth). So the metric is inversely proportional to the link speed. For example, if an OC192 link has a metric of 1, an OC48 link will have a metric of 4, whereas an OC3 link will have a metric of 64. Although this approach has been widely used in many networks, with the apparition of QoS-sensitive application, another scheme has also been adopted, which consists of reflecting the link propagation delay. Note that such a scheme is particularly suited in networks with homogenous link speeds but a wide range of propagation delays. In the former case (metric is a function of link speed), the shortest path between two routers is the path that follows the links with the higher link speed, which allows to better traffic engineer the network flows because large pipes will attract more traffic than small pipes. In the latter case (metric is a function of propagation delay), the shortest path from a node A to a node B is the path that minimizes the total propagation delay across the network. Some networks use a hybrid approach in which the link metric is a polynomial function of multiple attributes like the link speed and the propagation delay. From the shortest path computation perspective, the objective is still to compute the shortest path between two nodes taking into account the link metrics.
The Notion of Multitopology Routing The idea of being able to use multiple metrics is not really new and was introduced some time ago in both OSPF and IS-IS. This notion has been extended to the concept of multitopology routing in which multiple virtual topologies can be derived from a single physical network (with the ability to assign multiple metrics to each link). In the case of IS-IS, some extensions have been proposed [M-ISIS] to maintain separate topologies (called multiple topologies [MTs]) per protocol49 (IP versions 4 and 6, Multicast); even for a protocol like IP version 4, separate topologies could be 49
Protocol usually referred to as address family is this context.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 239
239
4.6 Route Computation
maintained. In a nutshell, when forming adjacencies over a link, routers exchange the set of MTs the link belongs to (IS-IS hello packets [IIH] are used for that purpose). IS-IS LSPs origination and flooding is unchanged, but new MT type length values (TLV) are carried within IS-IS LSP to flood the multiple MTs. In terms of path computation, one SPF is performed per MT and the corresponding routing table is populated. So the failure of a link being part of multiple MT would trigger the computation of multiple SPFs (one per MT the link belongs to), but the flooding remains unchanged. Several examples are provided below that illustrate the concept of MTs in various situations. Example 1: M-ISIS used for different protocol (address family): IP version 4 and 6 (Figure 4.12). This first example is illustrated in Figure 4.12, where some links belong to either the IP version 4 or IP version 6 topology and some other links belong to both. Example 2: M-ISIS used for multiple topologies of the same address family (IP version 4) (Figure 4.13). In this second example depicted in Figure 4.13, a network administrator is interested in running multiple IP version 4 topologies (with multiple metrics, each metric reflecting a particular constraint). For instance, in MT1 the link metric refers to the bandwidth, whereas in MT2,
Physical Topology
B
3
4 2
1
D
7
3
1
3
5 4
A
2
1
E 5
5
5
9
5
7
F
4
4
Metric used for MT1 (IPv4)
5
Metric used for MT2 (IPv6)
G 2
3
4
2
MT2 (IPv6)
B
MT1 (IPv4)
D
7
4 9
H
I
B
3
1
D
4
5
4
A
9
3
1
2
3
5
A
2
1
E 5
E
5 3
4
Figure 4.12 M-ISIS used for different protocols.
I
G 4
G 2
H
F
5
F
4
7 2
9
H
I
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 240
240
IP Routing
CHAPTER 4
Physical Topology B
2
D
10
Data( MT1)
3
5
1
5
Voice (MT2)
3
5
3
3
2
1
6
9
4
5
4
Metric used for MT1 (Link Metric = Bandwidth)
7
Metric used for MT2 (Link Metric = Propagation Delay)
5
A 5
E
F 4
G 2
5
4
2
1
H
I
SP1 - MT1 (IPv4 – bw Metric Based) B 2
3
3
4
SP2 - MT2 (IPv4 – Propagation Delay Metric Based) B D
D
Data( MT1)
A
3
3
1
Voice (MT2)
4
A
2
1
E
F
G 2
5
3
E
F
G 2
5
1
H
I
H
I
Figure 4.13 Multitopology routing: Example 2.
the link metric is a function of the propagation delay. Note that some links may be excluded from an MT (a satellite link with long propagation delay could be excluded from MT2). An interesting aspect to highlight is related to the forwarding: How does the router determine the routing table to consult in order to route an IP packet? In the first example, there is one MT per address family: For instance an IP version 4 packet will be routed using the IP version 4 routing table (MT1); hence, the address family determines the MT and thus the appropriate routing table. On the other hand, if several MTs belong to the same address family, there are multiple cases to consider: Situation 1: The MTs are fully disjoint (an interface cannot belong to more than one address family). Packets are received from an interface that unambiguously determines their MT. Situation 2: The MTs of the same address family share some interfaces, but the addresses do not overlap, then the router, by determining the destination address, can determine the MT to which the packet belongs. Situation 3: The MTs belong to the same address family and share some interfaces with overlapping address. This corresponds to the example depicted in Figure 4.13 in which some links belong to MT1 and MT2 and the address spaces overlap. Then in this case, some additional mechanisms are required. A typical solution is to use the DSCP that identifies in the IP header the required
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 241
4.6 Route Computation
241
QoS. In this case, each MT is reserved for a particular DSCP, and based on the DSCP the appropriate routing table is consulted. This way, voice packets marked, for instance, with the DSCP value of 5 will be routed using MT2, which computes the shortest path based on the propagation delay metric. Referring to our previous example depicted in Figure 4.13, if two packets enter node A and have node D as their IP destination, the data packet (marked with a specific DSCP value) will be routed along the SPT1 of MT1 and the voice packet marked with a different DSCP value will follow the SPT2 of MT2. The case of OSPF is slightly different. The support of IP version 6, for instance, requires OSPF version 3 ([OSPFv3]), but the capability to support multiple topologies (with multiple metrics per link) is part of the current protocol specification. When an adjacency is formed between two neighbors, the link can be advertised with more than one metric in the corresponding LSA type 1. Also, when interarea or external routes are advertised in LSAs of type 3 and 5, respectively, multiple metrics can be associated with each route. The protocol can support up to one metric per IP ToS value. Then each router computes an SPF per ToS and packets are routed using the appropriate routing table based on the ToS value of their IP header. This is equivalent to the example 2 presented earlier. Note that OSPF routers can be configured to route all the IP packets on the ToS 0 path only. When routers supporting ToS routing are combined with routers that just support ToS 0 path, then during the SPF computation, the routers that only support ToS 0 routing should be avoided to route non–ToS 0 IP packets.
4.6.2
The Dijkstra Algorithm The famous mathematician Edger Dijkstra, a pioneer in computer science, gave his name to an algorithm allowing the computation of a loop-free shortest path, which has been used for several decades in a wide range of contexts. The algorithm is described in detail in this section, but it is quite interesting to read a quote from Dijkstra related to this invention (from [EWD-1166]): I designed my first nontrivial algorithms. The algorithm for The Shortest Path was designed for the purpose of demonstrating the power of the ARMAC at its official inauguration in 1956, the one for The Shortest Spanning Tree was designed to minimize the amount of copper in the backpanel wiring of the Xl. In retrospect, it is revealing that I did not rush to publish these two algorithms: at that time, discrete algorithms had not yet acquired mathematical respectability, and there were no suitable journals. Eventually they were offered in 1959 to ‘‘Numerische Mathematik’’ in an effort of helping that new journal to establish itself. For many years, and in wide circles, The Shortest Path has been the main pillar for my name and fame, and then it is a strange thought that it was designed without pencil and paper, while I had a cup of coffee with my wife on a sunny cafe terrace in Amsterdam, only designed for a demo. . . .
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 242
242
CHAPTER 4
IP Routing
Algorithm Description The Dijkstra algorithm (also referred to as SPF) finds the shortest path from one source S to any other router in a network with nonnegative arcs (note that solving the shortest path problem in networks with negative arcs is much more complicated—actually this problem is NP complete; see Section 4.13 for a discussion on algorithm complexity). In the particular case of IP routing, arcs represent links whose cost is always positive. The result of the SPF algorithm is the shortest path tree (SPT), which represents a graph of the set of shortest paths.
Dijkstra Algorithm Before describing the Dijkstra algorithm, let’s first start with a few definitions. The network can be represented as a directed50 graph noted G ¼ (N,L), where N: The set of nodes (routers). L: The set of links. S: Source (the computing node). jLISTj: Number of elements of the list LIST. n: Number of nodes in the network. Lij : Link between the node i and the node j. c(Lij ): Cost of the link Lij . d(i): Current distance between the source S and the node i (sum of the links cost of every individual link along the shortest path). d(S) ¼ 0 Three lists are then defined: REM (REMAINING): list of nodes for which a shortest path has not yet been found. This list is also called the UNKNOWN list. PATHS: list of nodes for which a shortest path has been found. TENT: tentative list. List of nodes for which at least one path (may not be the shortest path yet) has been found. Note that jREMj þ jPATHSj ¼ N Step 1:
Initialize the three lists: PATHS Empty TENT {S} REM N (all nodes in the network) While TENT is not empty Move the node i to PATH such that d(i) ¼ min {d(k) for k 2 TENT} For each neighbor j of node i If the node j is not already in TENT Remove j from REM and move to TENT 50 Generally, two routers are connected via a bidirectional link. In this case, the link is represented as two directed arcs, which may or not have the same cost.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 243
4.6 Route Computation
243
Compute d(j) ¼ d(i) þ Lij Record i as its predecessor If the node j is already in TENT Compute d(j) ¼ d(i) þ Lij and update its predecessor if d(j) < current value for d(j) Compute the next-hops of j End A detailed step-by-step example of the Dijkstra algorithm is provided later in this section.
Dijkstra Algorithm Complexity Algorithm complexity is undoubtedly a key topic because it has a direct impact on the required amount of time for an IP router to compute an alternate path. Section 4.13 is devoted to algorithm complexity, but in a nutshell this refers to the number of operations required by an algorithm to provide an output as a function of the problem size. The Dijkstra algorithm complexity can be very easily computed. As covered in Section 4.13, the algorithm complexity is computed by considering the worst case complexity of the algorithm. The algorithm actually performs two different sets of tasks:
. Selection from the TENT list of the next node to move to the PATH list. . For each node moved to the PATH list, each of its neighbors is moved to the TENT list and for each of them the distance d(i) þ c(Lij) is computed. In the worst case, the operation (1) is performed n times (where n is the problem size; i.e., the number of nodes in the network) and at each step the number of nodes scanned is n (actually at step k, the maximum number of nodes in the TENT list is n-k). So the total number of tasks turns out to be: Sum (n-k) for k ¼ 1, . . . , n ¼ O(n2 ) (see Section 4.13 for a detailed definition of algorithm complexity). Then the operation (2) is performed L times, where L is the total number of links in the network. Hence, the resulting algorithm complexity of the Dijkstra algorithm is O(n2 ).
An Example Step by Step This section provides a detailed example followed step by step of the Dijkstra algorithm to compute a set of shortest paths. Let’s consider the network depicted in Figure 4.14. Initial Step: The following lists are: PATH¼{} TENT¼{A} (The computing node is A) REM¼ {B,C,D,E,F,G,H,I} (All the nodes in the network excluding the root A)
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 244
244
IP Routing
CHAPTER 4
Step 1
The closest node from the root A is A itself because d(A) ¼ 0. So i ¼ A (Figure 4.14). PATH¼{A} TENT¼ {B, E, H} (All the neighbors of A are moved to the TENT list51 and their shortest distance from the root is computed as well as their predecessor noted P(X) ). d(B) ¼ 3(P(B) ¼ A), d(E) ¼ 6 (P(E)¼A), d(H)¼5 (P(H)¼A) REM¼{C, D, F, G, I}
Step 2
The closest node to A belonging to the TENT list is selected: i ¼ B and its neighbors are moved to the TENT list (new d(i) are also computed) (Figure 4.15). PATH¼{A,B} TENT¼{E, H, C, F} (All the neighbors of B are moved to the TENT list and their shortest distance from the root is computed as well as their predecessor noted P(X) ). d(E) ¼ 6(P(E) ¼ A), d(H) ¼ 5(P(H) ¼ A), d(C) ¼ 7(P(C) ¼ B), d(F) ¼ 5(P(F) ¼ B) (Note that the node E keeps the same
B
C
D
4
3
Initial Step 3
1
1
2
8
PATH = {} TENT = {A} REM = {BCDEFGHI}
5
A 3
6
E
8
F 4
5
G 4
2
Indicates d(i) (distance from A) for a node I for which a shortest path has been found (in the PATH list)
5
2
H
I B
C
D
4
Step 1 Node i = A (d(A)=0) PATH = {A} TENT = {B, E, H} (list of neighbors of node A) d(B) = 3, d(E) = 6, d(H) = 5 REM = {CDFGI}
3
2
8
3
1
1
5
A
E
0
8
3
6
F 4
5
G 2
4
2
H
I
Figure 4.14 An example of the Dijkstra algorithm.
51 As already mentioned during the Dijkstra algorithm description, some neighbors may already be in the TENT list.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 245
245
4.6 Route Computation
3
B
C
D
3
4
Step 2 3
1
2
8
1
Node i = B (d(B) = 3<{d(E),d(H)} PATH = {A,B} TENT = {E, H, C, F} (add B’s neighbors) d(E) = 6, d(H) = 5, d(C) = 7, d(F) = 5 REM = {DGI}
5
A
6 E
0
8
3 F
4
5
G
4
2
2 H
I
3
B
C
D
3
4
Step 3 Node i = H (d(H)=5) PATH = {A,B,H} TENT = {E,C,F,I} d(E) = 6, d(C) = 7, d(F) = 5, d(I) = 7 REM = {DG}
3
2
8
1
1
5
A
6
8
3 E
0
F
4
5
5
G
4
2
2 H
I
Figure 4.15 An example of the Dijkstra algorithm (steps 2 and 3).
predecessor since the shortest path from A to E via the direct link A-E with a cost of 6 is shorter than the path A-B-E whose cost is equal to 11). REM¼{D, G, I} Step 3
The closest node to A belonging to the TENT list is selected: i ¼ H and its neighbors are moved to the TENT list (new d(i) are also computed). PATH¼{A,B,H} TENT¼{E, H, C, F, I} (All the neighbors of H are moved to the TENT list and their shortest distance from the root is computed as well as their predecessor noted P(X) ). d(E) ¼ 6(P(E) ¼ A), d(C) ¼ 7(P(C) ¼ B), d(F) ¼ 5(P(F) ¼ B), d(I) ¼ 7(P(I) ¼ H) (as in the previous case, the node E keeps the same predecessor since the shortest path from the A to E via the direct link A-E with a cost of 6 is shorter than the path A-H-E whose cost is equal to 9). REM¼{D, G}
Step 4
The closest node to A belonging to the TENT list is selected: i ¼ F and its neighbors are moved to the TENT list (new d(i) are also computed) (Figure 4.16). PATH¼{A, B, H, F} TENT¼{E, C, I, D, G}
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 246
246
IP Routing
CHAPTER 4
B
C
3
D
4
3
Step 4 3
1
1
2
8
Node i = F (d(F) = 5) PATH = {A,B,H,F} TENT = {E,C,I,D,G} d(E) = 6, d(C) = 6, d(I) = 7, d(G) = 13, d(D) = 6 REM = {}
5
A 3
6 E
0
8 F 5
4
5 5
G
2
4
B
2 H
C
3
3
Step 5 Node i = E (d(E) = 6) PATH = {A,B,H,F,E} TENT = {C,I,D,G} d(C) = 6, d(I) = 7, d(G) = 13, d(D) = 6 REM = {}
3
1
1
2
8
D
4
I
5
A 3
6 E 6
0
8 F 5
5
G
2
4
5
4
2 H
I
Figure 4.16 An example of the Dijkstra algorithm (steps 4 and 5).
d(E)¼6 (P(E)¼A), d(C)¼6 (P(C)¼F), d(I)¼7 (P(I)¼F), d(G)¼ 13 (P(G)¼F), d(D)¼6 (P(D)¼F) (this time, the distance from A to C is changed the C’s predecessor is updated). REM¼{} Step 5
The closest node to A belonging to the TENT list is selected: i ¼ E and its neighbors are moved to the TENT list (new d(i) are also computed). Note that the node C or D could also have been selected at this stage, but this does not change the resulting SPT. PATH¼{A, B, H, F, E} TENT¼{C, I, D, G} (All the neighbors of E are moved to the TENT list and their shortest distance from the root is computed as well as their predecessor noted P(X)); note that B’s neighbors, A, F, and H, are already in PATH. d(C)¼6 (P(C)¼F), d(I)¼7 (P(I)¼F), d(G)¼13 (P(G)¼F), d(D)¼6 (P(D)¼F) REM¼{}
Step 6
The closest node to A belonging to the TENT list is selected: i ¼ C. Note that D could also have been selected at this stage, but this does not change the resulting SPT (Figure 4.17). PATH¼{A, B, H, F, E, C} TENT¼{I, D, G} (All the neighbors of C are already in the TENT list and their shortest distance from the root is computed as well as their predecessor noted P(X))
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 247
247
4.6 Route Computation
B 3
C 6
D
4
3
Step 6
3
2
8
1
1
Node i = C (d(C) = 6) PATH = {A,B,H,F,E,C} TENT = {D,I,G} d(D) = 6, d(I) = 7, d(G) = 11 REM = {}
5
A 3
6
E 6
0
8
F 5
5
G 4
2
4
5
2
H
B 3
Node i = D (d(D) = 6) PATH = {A,B,H,F,E,C,D} TENT = {I,G} d(I) = 7, d(G) = 11 REM = {}
3
4
3
Step 7
D 6
C 6
I
2
8
1
1
5
A 3
6
E 6
0
4
5 5
8
F 5 2
G 4
2
H
I
Figure 4.17 An example of the Dijkstra algorithm (steps 6 and 7).
d(I)¼7 (P(I)¼F), d(G)¼13 (P(G)¼F), d(D)¼6 (P(D)¼F) REM¼{} Step 7
The closest node to A belonging to the TENT list is selected: i ¼ D PATH¼{A, B, H, F, E, D} TENT¼{I, G} (All the neighbors of D are already in the TENT or PATH lists and their shortest distance from the root is computed and their predecessor noted P(X)); note that the node C (D’s neighbors) is already in PATH. d(I)¼7 (P(I)¼F), d(G)¼11 (P(G)¼D) (a new shortest distance from A to G is computed (11) with D as a predecessor). REM¼{}
Step 8
The closest node to A belonging to the TENT list is selected: i ¼ I PATH¼{A, B, H, F, E, D, I} TENT¼{G} (All the neighbors of I are already in the TENT or PATH lists and their shortest distance from the root is computed as well as their predecessor noted P(X) ); d(G)¼11 (P(G)¼D) REM¼{}
Step 9
The last node in TENT is added to PATH: the node G (Figure 4.18).
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 248
248
IP Routing
CHAPTER 4
B 3
C 6
D 6
4
3
Step 8 Node i = I (d(I) = 7)
3
2
8
1
1
PATH = {A,B,H,F,E,D,C,I}
5
A
TENT = {G} 3
6
E 6 4
0
5 5
8
d(G) = 11 4
2
H
REM = {}
G
F 5 2
I
B 3
7
3
Step 9 (Final step) Node i = G (d(G) = 11) EMPTY
REM = {}
D 6
8
3
1
2
1
5
A
3
6
PATH = {A,B,H,F,E,D,C,I,G} TENT = {}
C 6
4
E 6 4
0
5 5
8 F 5 2
G 11 4
2 H
I
7
Figure 4.18 An example of the Dijkstra algorithm (steps 8 and 9).
Some Performance Numbers As mentioned earlier, a ‘‘naive’’ implementation of the Dijkstra algorithm has a complexity of O(n2 ). It is worth mentioning that various optimizations can be implemented that can drastically reduce the running time of the SPF computation. Some existing implementations have a complexity in n * Log(n). Figure 4.19 shows the algorithm complexity as a function of the problem size (the number of routers) for algorithms having a complexity of O(n2 ) and n * Log(n). The two figures show identical functions but with different scales on the axis. Of course, the router’s CPU greatly determines the overall computation time, but to give a rough estimate, existing optimized implementations running on core routers are able to complete an SPT in a few tens of milliseconds for networks having hundreds of routers. As already mentioned, the routing table computation not only requires the computation of the SPT but also the RIB, this component being nonnegligible in the total routing table computation time. A very interesting optimization of the original SPF algorithm, called incremental SPF, consists of limiting the SPT computation to some part of the tree and is covered in detail in Section 4.13.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 249
4.6 Route Computation
249
Algorithm Complexity as a Function of the Number of Routers (n)
Complexity
500 400 300
nLog(n)
200
n2
100 0 1
3
5
7
9
11
13
15
17
19
Complexity
Number of Routers n Algorithm Complexity as a Function of the Number of Routers (n) 1200000 1000000 800000 600000 400000 200000 0 1 115 229 343 457 571 685 799 913
nLog(n) n2
Number of Routers n
Figure 4.19 Algorithm complexities of n2 and nLog(n).
4.6.3
Shortest Path Computation Triggers As already mentioned, an SPF computation must be triggered each time a new LSA is received, which happens whenever, for instance, a topology change occurs or an IP prefix is added or deleted locally on a router (in that case, the network topology does not change, but the router advertises a change of IP address reachability). Both events provoke the origination of a new LSA, but OSPF and IS-IS handle the two events differently. Regardless of the event that caused the new LSA to be originated, OSPF systematically triggers a new SPF computation upon receiving a new LSA. On the other hand, in some implementations, IS-IS triggers a new SPT and RIB computation only if the LSA reflects a topology change. If the LSA reports a new IP prefix reachability information (so there is no topology change), IS-IS performs a new RIB computation (called a partial route computation [PRC]), which saves the cost of an SPT computation. This is, for example, the case of the Cisco IS-IS implementation. Once a node receives a new LSA and determines that an SPT and/or RIB computation must be triggered, it might be desirable to delay the computation. An efficient algorithm to handle delay between the triggering event and the actual task execution is to use a dampening mechanism (like the exponential back-off algorithm described in Section 4.4), as in the case of LSA origination. This preserves the
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 250
250
CHAPTER 4
IP Routing
router’s CPU in the case of network instability while allowing a fast SPF triggering in the case of limited network changes (which corresponds to the vast majority of the failure scenarios). Example on a Cisco router: Router isis ... spf-interval A B C prc-interval A B C ... Router ospf ... timers throttle spf A B C ... The algorithm used corresponds to the exponential back-off algorithm described in Section 4.4. Just note that on a Cisco router the variables A, B, and C correspond to the variables Z, X, and Y, respectively. As with LSA origination, such a dampening algorithm allows a router to quickly react to a single failure while protecting it from triggering too many SPT and RIB computations in the case of network instabilities. Indeed, although the SPT and RIB computation can be fast, it is clearly undesirable to run tens of SPT and RIB computations back to back if the router keeps on receiving new LSAs from a flapping router in the network that would not have any LSA dampening mechanism.
Parameters Tuning Similarly to the case of LSA origination parameter tuning, the setting on the SPFrelated parameters highly depends on the network characteristics and the rerouting time objectives. A case study is proposed in Section 4.6. That being said, a good practice is generally to set up variable A to a short value to get fast convergence upon a single failure and then rely on variables B and C to slow down the SPT and RIB computation triggering in the case of network instability. The case of a network having multiple SRLGs52 is quite interesting though. Suppose a network with a large amount of SRLGs and in which the propagation delays between links belonging to common SRLGs are not negligible (Figure 4.20): Let’s consider the network depicted in Figure 4.20: Links B-C and C-H belong to the same SRLG (e.g., these links share a common resource like a fiber). In the case of SRLG failure (fiber cut), both link B-C and link C-H will simultaneously fail. Upon the receipt of node B’s LSA, A may want to quickly trigger an SPF to improve the convergence, and in this particular example, A will likely select the path A-G-H-C to reach node C (if we suppose that all the links have a cost equal to 1). 52
The notion of SRLG has already been covered in Chapter 3 and is examined again in Chapter 5.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 251
4.6 Route Computation
251
x Link Cost
H
G
SRLG Failure
Z
S
A
D
B
E
C
F
Figure 4.20 Case of an SRLG failure.
Unfortunately, link H-C has also failed, but A will get an accurate updated topology view after having received node H’s LSA (or node C’s LSA), which may be delayed because of some propagation delay along the H-G-A path. Then, a second SPF must be triggered by node A (after the timer Y has expired). So in such networks, there are multiple strategies to handle such situations. The first strategy is to slightly increase the value of A, so the SPF is triggered after all the LSAs have been received. Another approach is to set B to a small value so a second SPF can be triggered if another LSA quickly arrives that reflects another topology change (actually there may be a third SPF required to provide the actual topology because each link failure triggers the origination of two LSAs).
4.6.4
Routing Information Base Update Computing the shortest path between the computing node and every other node in the network is one thing, but the ultimate goal is obviously to compute the routing table also called the RIB. Although the RIB computation can be performed during the SPT computation, we can consider the RIB computation as a separate task. The SPT provides the shortest path between the computing node and every other reachable node in the network: The RIB computation includes populating a table that contains the shortest path to reach the various IP prefixes. Then, usually, routers will compute another table, called the forwarding information base (FIB), which will contain the minimum required set of information to forward an IP packet. For instance, keeping track of the whole path and the corresponding metric to reach an IP prefix IP1 in the FIB is not really necessary; the only useful information from a forwarding perspective is the next hop and the outgoing interface for IP1. Consequently, the FIB will contain the list of IP prefixes, and for each of them, the next hop along with some other low-level information related to the layer 2 protocol in use on the outgoing link.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 252
252
CHAPTER 4
IP Routing
The computation time of the RIB is a component of the overall IGP convergence, which may not be negligible. Indeed, at first glance, one might think that the SPT computation, which is a function of the network topology as shown earlier, is the predominant factor of the routing table computation, but this might not be the case, especially with very powerful CPU RPs (this is even more true when techniques like iSPF53 are used). So both the network topology and the number of routes are important factors of the RIB computation time. This is why an advisable and very good common practice consists in trying to reduce the RIB size and the number of IP prefixes flooded by the IGP. Another interesting approach is to use mechanisms to prioritize the update of the important IP prefixes. But what is an important IP prefix and how can such prefixes be identified? Let us consider the two ends of the spectrum with a link IP address and a BGP peer address. IP link address usage is usually limited to the ‘‘traceroute’’ application and so losing the connectivity to an IP link address for a short period (until the routing protocol has converged) is certainly not an issue. On the other hand, a BGP peer IP address is used to forward all the traffic announced by that BGP peer. If router A has a peering session with router B, all the IP prefixes (Internet and/or virtual private network [VPN] in the case of MPLS VPN) announced by B are resolved via a mechanism called route recursion, where A tries first to find the path to reach router B; consequently, being able to find an alternate path to reach router B upon a network failure is of the utmost importance! Hence, a BGP peer IP address is typically an important prefix. So as mentioned earlier, prioritizing the treatment of those important prefixes is a good idea. Now how can such important IP prefixes be identified? IS-IS proposes a mechanism allowing to tag (‘‘color’’) certain routes (see [IS-IS-TAG]). The network administrator will of course be responsible for assigning particular tags to the ‘‘important’’ prefixes. Equivalent mechanisms for OSPF are under development. Of course another approach is to try to limit the IP prefixes carried in the IGP to the important prefixes (e.g., link IP addresses do not require to be advertised within the IGP).
4.7 Temporary Loops during Network State Changes Link state protocols compute loop-free shortest paths between various sourcedestination pairs under steady state. But when failures occur in the network, the momentary lack of synchronization between various routers’ LSDBs may lead to the creation of temporary loops. Such loops have the effect of potentially dropping the traffic traversing the links involved in those loops and to substantially increase some link loads, which may affect other traffic traversing those links even though those traffics do not follow a path affected by the failure. Moreover, there are other situations in which such loops can be observed and in particular when a link cost is
53
The iSPF algorithm is described in detail in Section 4.14.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 253
4.7 Temporary Loops during Network State Changes
253
increased (by manual configuration) or when a link is restored. The rest of this section describes the behavior of these temporary loops and their characteristics.
4.7.1
Temporary Loops in the Case of a Link or Node Failure In a distributed routing environments, the timing sequence of events is not deterministic and depends on various factors: the IGP configuration and the set of associated timers, the propagation and queuing delays experienced by an LSA flooded throughout the network, and the router implementations, to mention just a few of them. Hence, it is virtually impossible to predict the exact event sequence timing. That being said, one can analyze a possible sequence of events that could lead to temporary loops upon link or node failure. To better illustrate how temporary loops can appear in a converging network upon network element failure, consider the example in Figure 4.21. In Figure 4.21, the assumption is made that the IGP timers are tuned to provide fast convergence. So in this example, S is the source of the traffic and Z the destination. NH(X,Z) is the next hop computed by node X to reach destination Z. So, for instance, at steady state (no failure) NH(A,Z)¼B because the shortest path from A to Z is A-B-C-D NH(G,Z)¼H and A because the two shortest paths from G to Z are G-H-I-D and G-A-E-F-D Now consider the following sequence of events: Time T0: The link C-D fails. The IP packets traveling along this link start to be dropped. After some period (link failure detection time), router C originates a new LSA and triggers a new SPF to recompute its routing table. Once C has converged, NH(C,Z)¼B, because the next hop along the shortest path from C
G
H
I
x
Link Cost
2 5
2
S
A
Z
B 9
E
Figure 4.21 Illustration of a temporary loop.
F
C
D
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 254
254
CHAPTER 4
IP Routing
to reach Z is now B. This leads to the first temporary loop until node B has itself converged (which requires to receive the LSA of node C [or the LSA of node D] reporting that the link C-D has failed and recompute its routing table). Indeed, before B receives the new LSA and recomputes its routing table, NH(B,Z)¼C, and thus, there is a temporary loop B-C-B. T1: B then receives the LSA originated by C. A very useful optimization is always to flood a received LSA before triggering an SPF (this is to ensure that the LSA flooding is not delayed by the SPF computation). Once the LSA is flooded, B triggers an SPF and converges, then NH(B,Z)¼A. Now a new temporary loop appears between nodes A and B because NH(A,Z)¼B (A is not yet aware of the failure of the link C-D and thus has not converged yet). This secondary temporary loop is illustrated in Figure 4.22: Time T2: A receives the LSA, floods it to G and E, and triggers an SPF. NH(A,Z)¼G. The two previous micro-loops respectively between (A,B) and (B,C) disappear but a new temporary loop appears: A-G-A since NH(G,Z)¼A (for the traffic selecting this path since there are two equal cost paths from G to Z). Time T3: H receives the LSA, floods it to I and G and triggers an SPF. NH(H,Z)¼I. Note that there is no change in the path to reach Z from H, so no new temporary loop appears (Figure 4.23). P Important note: Once again, one cannot predict whether T2 would occur before (or after T3), but this sequence just highlights a possible routing dynamic.
G
H
I T0 2
5
S
Z A
2
B
C
x Link Cost
D
9
E
H
G
F
I
2 5
T1
S
Z
A
2
B 9
E
Figure 4.22 Illustration of temporary loops (continued).
F
C
D
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 255
255
4.7 Temporary Loops during Network State Changes
G
H
I
2
5 T2
S
A
2
Z
B
C
D
x
9
E
F
H
G
Link Cost
I
2
5 T3
S
A
E
2
Z
B 9
C
D
F
Figure 4.23 Illustration of temporary loops (continued).
T4: Finally, G now receives one of the LSAs originated by C and D either from A or H, triggers an SPF, and converges: NH(G,Z)¼H. Note that the temporary loop between G and A now disappears. This final state is depicted in Figure 4.24.
Loop Duration and Number of Routers Involved As pointed out earlier, the loop duration depends on the event sequence timing, which is highly unpredictable. Transient loops are caused by the lack of synchronization between the LSDBs of various routers. The duration of such a lack of synchronization is essentially driven by several clearly identified factors: 1. The newly originated LSA flooding: The longer the LSA flooding takes, the higher the probability is to get lack of synchronization between routers’ databases and by consequence temporary loops. This highlights the fact that flooding should get an appropriate treatment; in particular, a received LSA should always be flooded as fast as possible and the LSA packets should receive an appropriate QoS. The only incompressible period is the propagation delay along the links, 2. The IGP timers: A homogeneous set of IGP timers is usually recommended because heterogeneous timers may have the undesirable effect of increasing temporary loop durations. For instance, back to the previous example, it can easily be seen that if upon receiving the LSA originated by C, node B
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 256
256
IP Routing
CHAPTER 4
G
H
2
I
2
3
T4
S
Z A
4
B
E
C
F
D
G
H 2
I
2
3 T5
S
Z A
E
4
B
C
D
F
Figure 4.24 Illustration of temporary loops (continued).
delays the triggering of its SPF (because different timers are used), then the temporary loop between B and C would last longer than necessary. Hence, it is desirable and recommended to configure homogeneous timers (although under some specific circumstances/designs heterogeneous timers may be used). So either both B and C will slowly converge or they will be configured to quickly converge, but the situation of one node (e.g., C) converging rapidly and other nodes converging slowly is not desirable. The number of routers also highly depends on the timing sequence. The example above showed a set of concatenated temporary loops involving several pairs of routers, but a different sequence of events could have led to a larger loop involving more routers. It is worth noting that when the IGP is appropriately tuned, temporary loops have a short duration, but although their effect can be reduced by high IP time-tolive (TTL) values and a router’s large buffering capabilities, at high link speed, the packets entering a temporary loop rarely exit the loop.
Administratively Link Cost Increase An administrative link cost increase can also imply the creation of temporary loops, although there is no network element failure. Indeed, if a link cost is increased, some temporary lack of synchronization may result between routers’ LSDBs,
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 257
4.7 Temporary Loops during Network State Changes
257
thereby provoking some temporary loops. Note that a link failure can be seen as a cost increase to infinity.
An Undesirable Effect of Temporary Loops A potentially undesirable side effect of temporary loops is the link-load increase resulting from the looping packets, which can lead to link congestion and as such could affect some traffic a priori nonaffected by the link failure. To illustrate, consider the example in Figure 4.22. At time T1, the temporary loop between nodes A and B has undoubtedly an effect on the link A-B utilization. The traffic routed from S to some destination Z’ directly connected to F would follow the path A-B-F. Although this traffic should not be affected by the failure of link C-D because it does not follow this path, it suffers from the potential congestion created by the temporary looping traffic from S to Z. Now, things should be put in perspective: Such temporary loops last a short period, so the impact is usually minimal.
4.7.2
Temporary Loops Caused by a Restored Network Element As pointed out earlier, the cause of temporary loops is in the lack of synchronization between routers’ LSDBs, which also occurs upon a network element restoration. Indeed, consider the case of a restored link. As stated earlier, the sequence of events is not deterministic, but consider the two following situations: Situation 1: The diagram in Figure 4.25A shows the situation after link C-D has failed and all the routers have converged. T0: Link C-D is restored. Node C establishes an IGP adjacency with the node D, originates a new LSA to reflect the topology change (note that node D will also flood a new version of its LSA). C converges and NH(C,Z)¼D. Then at time T1, B receives the new LSA, floods it to each of its neighbors (H and F), triggers an SPF, and finally converges: NH(B,Z)¼C. Note that at this point, node B does not receive any traffic to Z from any of its neighbors. At time T2, A now receives the new LSA, floods it to each of its neighbors (G and E), triggers an SPF, and converges: NH(A,Z)¼B. At this point, the traffic sent by S to F starts flowing along the A-B-C-D path. Such a sequence event did not lead to any temporary loop, but now consider another sequence event timing. Situation 2: Suppose now that for some reason, the following sequence of events occurs. At time T0, link C-D is restored. Node C establishes an IGP adjacency with node D, and originates a new LSA to reflect the topology change (note that node D will also flood a new version of its LSA). At time T1, B receives the new LSA and floods it to each of its neighbors H and F. At time T2, A receives the new LSA and floods it to each of its neighbors G and E, triggers an SPF, and converges. Then NH(A,Z)¼B. Because B has not yet converged (NH[B,Z]¼A), a microloop A-B-A is created. At time T3, B converges and
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 258
258
IP Routing
CHAPTER 4
G
H
I
Situation 1 2
5 G
H
I
Z
S
A
2
2
B 9
C
5
(b)
S
Z A
2
D
B 9
C
E
F
G
D
H
I 2
E
5
F
Situation 2
(a)
S
Z A
2
B
C
D
9
(c) E
F
Figure 4.25 An example of temporary loops on link-up event.
NH(B,Z)¼C. This removes the previous temporary loop and another temporary loop appears because C has not yet converged (NH(C,Z)¼B): B-C-B. Finally, at time T4, C converges and NH(C,Z)¼D. Of course, all the temporary loops eventually disappear. The reason such temporary loops are created with such an event timing sequence in the case of a link restoration is that a node farther from the restored link converges before a node closer to that restored link. Consequently, it starts reusing the restored link in its SPT before the downstream node had time to converge, hence the temporary loop. Although this illustrates that such temporary loops can appear when a network element is newly restored, this case greatly differs with the previous case in two respects. First, the probability of such an event sequence timing is not very high. Then in the situation of a link or node failure, the microloops are not avoidable but do not affect the convergence. By contrast, in the case of a restored network element, the traffic is being needlessly dropped, but there is a solution to this issue. Actually there are many possible solutions. One of the solutions is to come up with a distributed algorithm that guarantees the sequence of converging nodes. Indeed, such temporary loops can be avoided only if a node does not converge before a node closer to the restored network element. So the idea is to use incremental delays in the SPF computation or routing table update to achieve that objective.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 259
259
4.8 Load Balancing
4.8 Load Balancing Load balancing undoubtedly plays a key role in IP networks and refers to the ability of a router to balance the traffic load to a destination X among a set of N equal cost paths. This is also called equal cost multiple paths (ECMPs); both OSPF and IS-IS support the computation of equal cost paths. Symmetrical networks offering a large number of equal cost paths between each pair of routers is not uncommon. Figure 4.26 shows a typical example of such a symmetrical network. In the simple network depicted in Figure 4.26, every edge router is dual attached to two core routers. In this simple example, two equal costs paths exist between each pair of edge devices, but of course there might be many more ECMPs between pairs of routers. Per-packet versus per-destination load balancing: Once N equal cost paths are computed by the routing protocol, there are actually two modes of operations to load balance the traffic among the set of N paths, known as per-packet versus per-session load balancing: 1. Per-packet load-balancing: In this mode, packets are distributed among the N paths in a round-robin fashion. Although quite efficient in terms of load sharing, this mode has the downside effect of introducing packet reordering among microflows. Indeed, the packets belonging to a single flow/conversation between a pair of hosts are likely to follow different paths, which may have different characteristics in terms of delay (e.g., different propagation or queuing delays along the paths). The immediate consequence is that they may be delivered in a different order, especially if the delays among the set of N paths significantly differ.
All Links Have a Cost of 1 ECMP Paths
C1
C2
C3
E1
E2
E3
E4
C4
C5
C6
Edge Routers Core Routers
Figure 4.26 Symmetrical networks with ECMP paths.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 260
260
CHAPTER 4
IP Routing
The impact of packet reordering is highly application dependent, but even a small reorder rate can have a very significant impact on the traffic throughput. An extensive analysis of the impact of reordering on TCP traffic can be found in [REORDERING] and shows a nonnegligible TCP throughput drop for a reorder rate between 0.1% and 1.0%, especially on long live flows and when delays experienced along the path are high. When the reorder rate approaches 10%, the application throughput is close to the minimal utilization. Video applications are also quite sensitive to packet reordering that basically results in an increase in packet loss. In addition to the packet reordering problem, sessions may experience increased jitter. 2. Per-session load balancing: One way to alleviate the packet reordering problem is to ensure that the packets belonging to a single flow always follow the same path.54 This invokes some hashing mechanisms in which a set of K buckets are used to select one of the N ECMP paths and the hashing function is performed on a set of IP fields. The idea is to ensure that the packets belonging to a session between two hosts is assigned to a single path while trying to achieve load balancing because multiple sessions will be assigned to different paths. The function in charge of assigning a session to a particular path among the set of N candidate paths uses a hashing function involving K hash buckets that takes into account the source and destination IP address of each packet. Then each of the K hash buckets points to a particular active paths (between 1 and N). There is a well-known issue with such hashing functions called the polarization effect, whereby traffic gets polarized along the same path if the same hashing function is used at each hop along the path. There are solutions to correct the polarization effect implemented in existing commercial routers where the hashing function is not identical at each hop. In some cases (in particular when the number of sessions is small), the load share among the set of N paths may not be even; in particular, when tunneling mechanisms are used in the network, the number of sessions may be relatively small, which may lead to unequal load sharing if the amount of traffic carried over those tunnels significantly differs. Enhanced hashing algorithms allow overcoming this potential limitation. Further analysis of hashing-based schemes can be found in [HASH]. Symmetrical versus asymmetrical load balancing: It is worth noting that only symmetrical load balancing is supported by both IS-IS and OSPF (some 54 Note that the queuing system should also ensure that packets from a single flow do not get reordered, which could typically occur in the case of packets re-marking. Indeed, suppose that the packets between a host X and Y are marked with a DSCP D1. If at some point some packets are re-marked with a different value D2 (e.g., because the bandwidth flow is not compliant with the QoS contract), then packet reordering could also occur if packets with DSCP D1 and D2 match different queues, even if all the packets follow the same path in the network. So a good practice is to ensure that packet re-marking does not imply selecting a different queue in any node of the network.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 261
4.8 Load Balancing
261
distance vectors protocols like EIGRP support asymmetrical load balancing though). Indeed, asymmetrical load balancing requires some extra precaution; consider Figure 4.27. As depicted in the Figure 4.27, there are two paths between routers E1 and E2: Path1: E1-C1-C2-E2 – Cost¼3 Path2: E1-C3-C4-C5-E2 – Cost¼6 Because the cost of path 2 is twice the cost of path 1, the idea of asymmetrical load balancing is to load balance the traffic between E1 and E2 with a share inversely proportional to the respective costs, so in this example, twice as many packets would be sent on path 1 than path 2. So suppose that 99 packets are sent from E1 to E2. According to the previous rule , 66 packets would be sent along path 1 and 33 packets would follow path 2. But node C3 also has two paths from C3 to E2: path3 (C3-E1-C1-C2-E2 with cost of 5) and path 4 (C3-C4-C5-E2 with a cost of 4). If C3 applies the same rule, it would send approximately (4/9) * 33 ¼ 15 packets along path 3 and 18 packets along path 4. So 15 of the 99 packets originally sent by E1 would be looping between E1 and C3. Then of course, (1/3) * 15 ¼ 5 packets would be sent again along path 2, and so forth. We can notice that the loop is partial and the packets will eventually be delivered to the destination (provided that their IP TTL does not expire before) but such a routing decision is of course highly undesirable. There could be some partial solutions to address this issue like avoiding sending a packet over the interface it has been received from, but then there are more complicated cases with partial loops involving N routers, especially with asymmetrical link costs that would require more protocol modifications. Now a legitimate question is whether there is any relationship between IP load balancing and recovery upon a network element failure. In fact, symmetrical
1
Link Costs
Asymmetrical Paths Costs
1
1
1 C2
C1 E1
E2
2
2 E3
E4
C3
1
C4
1
C5
Edge Routers
Core Routers
Figure 4.27 Asymmetrical load balancing.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 262
262
CHAPTER 4
IP Routing
networks with equal cost paths have some interesting properties as far as recovery is concerned: 1. Reduce the failure impact to a subset of flows between two pairs of routers. Indeed, if the flows between router X and router Y are load balanced among paths P1 and P2, the impact of a failure of one path is limited to a subset of the flows between routers X and Y, provided that the failure does not simultaneously affect both paths. 2. Improve the convergence time in the case of a failure: Consider the case of the network depicted in Figure 4.26. Because there are two ECMP paths from node E1 to E2 (E1-C1-C2-C3-E2 and E1-C4-C5-C6-E2), the traffic between E1 and E2 is load balanced among the two paths according to one of the load balancing methods mentioned earlier. In the case of a local link failure, the traffic from E1 to E2 could immediately be switched from one path to another without waiting for the recomputation of the routing table.
4.9 QoS during Failure We saw in detail in the previous sections that upon a network element failure detection, each node in the network recomputes its routing table to determine the shortest path to every other node in the network according to the new network topology. The objective of IP routing is not to route particular traffic based on some traffic constraint characteristics other than the IP destination. In other words, by contrast with other technologies like enhanced optical networks, ATM, or MPLS traffic engineering, there is no constraint taken into account during the shortest path computation like bandwidth requirement and resource class affinities. Consequently, IP routing does not try to achieve any goal of QoS guarantee upon a network element failure. Instead, a new SPT is computed based on the updated topology that selects the shortest IGP path to every other node in the network. Because the path computation is exclusively based on the destination, all the traffic to a particular destination will follow the same shortest path regardless of the traffic requirements, but that does mean that QoS objectives cannot be met with IP routing both at steady state and under failure.
4.9.1
IP Traffic Engineering at Steady State We will first discuss how traffic engineering techniques can be used at steady state (in the absence of failure) to guarantee some QoS. It is very important to elaborate some more on the concept of QoS here: Strictly speaking, IP routing does not deal with QoS, which relates to the set of mechanisms to handle traffic prioritization or congestion avoidance as described in Section 4.5. That being said, IP routing determines the traffic paths and so the link loads. Hence, in this section, as far as
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 263
4.9 QoS during Failure
263
IP routing is concerned, the objective of QoS is explicitly related to the link loads both at steady state and under failure, which inevitably affect the traffic QoS. The notion of traffic engineering is discussed in detail in Chapter 5 in the context of MPLS traffic engineering, but in a nutshell, the aim of traffic engineering is to efficiently use the network resources and try to reduce the link load utilization. Because the paths followed by the packets between routers are determined by the shortest path computation, one way to perform traffic engineering consists in running some off-line55 algorithm that determines the IGP metrics that will result in more efficient network resource usage. A common practice is to use IGP metrics inversely proportional to the link speed. So, for instance, if an OC192 link has an IGP metric of 1, an OC3 link would have a metric of 64. Although such an approach has the obvious benefit of being extremely simple and straightforward, it also suffers from some obvious limitations and may not allow to traffic engineer the IP backbone in a very efficient manner. At this point, it is probably worth elaborating the notion of ‘‘efficient traffic engineering techniques for IP’’: This usually refers to the ability to minimize the maximum/average link utilization rate to improve the QoS and reduce queuing delays, jitter, and packet loss. Other constraints may also be added like the minimization of the propagation delay experienced by the packets traveling across the network, or some bound on the maximum tolerable propagation delay. Trying to minimize the maximum link utilization has the obvious implication of increasing some path lengths, which implies a higher propagation delay. Hence, the ability to specify that the propagation delay increase after optimization should not exceed some threshold is ineluctably useful. Another interesting objective of the IGP metric optimization is to avoid drastic IGP metric changes. As already discussed in detail in Section 4.7, temporary loops may result from link cost changes (increase and decrease). Consequently, the computation of a new set of IGP metrics should try to minimize the number of IGP metric changes and the order of magnitude of those changes. Indeed, the larger the changes are, the more likely temporary loops will appear during the network convergence period. So one should bear in mind that such techniques relying on IGP metric changes are not entirely traffic nondisruptive because every IGP metric change may lead to some temporary loops. Sophisticated IGP metric computation algorithms should try to limit the number and the degree of those changes to minimize the likelihood of temporary loops and their diameter. Note also that such optimization techniques usually try to make extensive use of ECMPs. If unequal load balancing could be supported by link state protocols, this would certainly make those optimization techniques even more efficient. As a side note, the projection may not be entirely accurate because in most of the cases, the assumption of per-packet load balancing is made, which does not correspond to 55 Note that the off-line nature of such an algorithm is an important fact to underscore. By contrast with adaptive routing mechanisms, current link state protocols use static metrics. When the traffic matrix is not expected to change too often, those algorithms that compute a set of metrics meeting some expected target are off-line and the IGP metric values are then manually configured on each router.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 264
264
CHAPTER 4
IP Routing
the most common load balancing techniques for the reasons mentioned in Section 4.8 (such as packet reordering). Several interesting IP traffic engineering approaches have been proposed during the past few years to propose some algorithms to determine the set of IGP metrics that would allow to efficiency traffic engineer an IP backbone, provided that the traffic matrix is known. Note that this latter assumption is of the utmost importance to produce efficient results. That being said, when the traffic matrix is not known, some algorithms exist that try to infer the traffic matrix from the observation of some link loads (see [TRAF-EST] for a review of some techniques to perform traffic matrix estimation); those link load statistics can usually be easily gathered from routers SNMP management information bases (MIBs). Note that some of those traffic estimation algorithms also provide some level of confidence about the output. Depending on the level of confidence, the network administrator may decide either to perform further investigations, to increase the traffic matrix by some fudge factor to reduce the risk of traffic underestimations, or to just make some projection of future traffic growth. It is worth noting also that the accuracy of such algorithms is usually a function of the network topology; in other words, traffic estimation algorithms are more efficient on some network topologies than others. Another alternative is to use some management functionality provided by some router vendors that allow gathering the required statistics to build traffic matrices. Note also an interesting paper related to IP traffic characterization [IP-TRAF].
4.9.2
QoS Guarantee during Failure The statement ‘‘QoS guarantee during failure’’ is even truer upon network element failure. As already stated, IP routing does not try to achieve QoS guarantee along the backup path upon a network element failure but tries to quickly compute the new shortest path. That being said, some IGP optimization techniques can also be used to determine a set of IGP metrics that tries to efficiently traffic engineer the IP traffic both at steady state and under any single failure scenario. One of the challenges of those algorithms is to find a set of IGP metrics that does not have the side effect of routing the traffic at steady state in a non optimal way (or at least in a way too far from the optimal) because of the constraint of minimizing the maximum link utilization under link failure. Such an objective function of minimizing the maximum/average link utilization at steady state and under failure is undoubtedly more difficult to compute and the level of efficiency may substantially vary upon the network topology and traffic matrices. Also it is difficult to optimize the set of IGP metrics to optimize the network load at steady state and upon network failure, essentially because the set of IGP metrics must be computed to handle any network failure. In some network topologies, several alternate paths may exist between a pair of nodes, but because link state protocols like OSPF and IS-IS do not support asymmetrical load balancing, the available bandwidth may not be easily usable, unlike some other connection-oriented technologies in which multiple asymmetrical backup paths may be used to protect a particular network resource. That being
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 265
4.9 QoS during Failure
265
said, several algorithms have been designed that try to compute a set of metrics so the maximum link utilization under steady state and single network element failure is minimized. The additional constraint of delay mentioned earlier can also be taken into account, as well as the objective of minimizing the number of IGP metric changes and their order of magnitude.
Link Metric Manipulation Consider a set S of IGP metrics and an optimization function computing another set S' of IGP metrics to meet the objectives mentioned earlier. Let us also suppose that the number of changed IGP metrics is K. Then, from a network operation standpoint, the number of steps to change the K metrics is K. Why? Because a good practice will likely be to change one metric at a time, let the routers converge and then change another IGP metric. Some optimization could actually be done to try to minimize the number of steps. Furthermore, ideally, the optimization function should also be able to make sure that the network objectives (like the maximum link utilization or the delays) are also met during each transition. Indeed, without that objective in mind, the new set of metrics could be such that the final network state (once all the IGP metrics have been changed and the network has converged) are met but not at each step, which is obviously nondesirable, except if such changes are performed during network maintenance where the traffic load is lower.56 Note that IP traffic engineering techniques relying on IGP metric changes are not expected to be performed very often, considering the nonnegligible amount of work and the potential risks on the network.
Algorithm Complexity The problem of computing a set of IGP metrics that try to meet a set of objectives like minimizing the maximum link utilization at steady state and under single failure along with other constraints is clearly NP complete. Such algorithms make an extensive use of heuristics that drastically reduce the computation time while trying to approach the optimal solution (which can be defined as the solution obtained by solving a multicommodity flow problem). The art of designing such algorithms is in the ability to come up with a set of appropriate and efficient heuristics. Definition of efficiency: The discussion on the degree of efficiency of such IP traffic engineering techniques compared to other technologies making use of call admission control mechanisms with constraints based routing, for example, is still difficult to determine for several reasons. First, the degree of efficiency is directly tied to the algorithms in use to compute the IGP metrics or the TE LSP paths in the case of MPLS (on-line or off-line). Furthermore, the results are generally quite topology dependent. Further readings on the topic of IP traffic engineering are [IP-TE-1] to [IP-TE-6]. 56 Note that this may be more and more difficult because service providers tend to optimize their network utilization by running various traffic types at different periods, which progressively leads to the absence of periods where the network is quiet.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 266
266
CHAPTER 4
IP Routing
4.10 Nonstop Forwarding: An Example with OSPF Several modern router architectures have dedicated hardware for the control plane and data plane operations. Hence, such architectures can handle control plane failure without any impact on the data plane. The objective of NSF is to handle control plane failure on such platform architectures while preserving the data plane. NSF requires some IGP extensions, which are detailed in this section. The usual procedure in the case of control plane (routing) failure is for every neighboring router detecting the failure to trigger the normal IP rerouting process described earlier in this chapter (every neighbor of the router whose IGP control plane has failed originates a new LSA reporting the new topology state) and so all the routers in the routing domain trigger a new SPF, resulting in the exclusion of the failing router from the forwarding paths of every other router. As mentioned earlier, there are several router’s architectures that allow continuing the packet forwarding operation even in the case of a control plane failure. For instance, a router experiencing the failure of its RP (in charge of the control plane functions including routing and signaling) still continues to forward traffic based on the last state of its routing table while a standby RP takes the control upon the failure of the primary RP. The period during which the backup RP takes control and resynchronizes its control plane states is called the restarting period. As pointed out earlier, upon normal IGP operation, the restarting router would establish a new routing adjacency; consequently, this would trigger a network convergence as mentioned later. So both OSPF and IS-IS have been enhanced to allow such a procedure called ‘‘graceful restart’’ or ‘‘NSF’’57 without impacting the packet forwarding in the network (see [OSPF-GR] and IS-IS [ISIS-GR]). Note that a control plane failure covers the case of both a planned and an unplanned control plane failure. In the former case, the network operator may have to upgrade the hardware of the RP, for example, with the objective of not impacting the traffic forwarding in the network. The latter case (unplanned control plane failure) can occur in the case of hardware or software failure of the primary RP. If such an unplanned failure occurs, the NSF procedure will guarantee that traffic forwarding is not impacted; this latter mode can be supported only on platforms that preserve the forwarding states upon a control plane failure. As described later in this section, there is another condition for the NSF procedure to be completed without forwarding change during the control plane failure of the failing router; the absence of other network changes during that period. This is to reduce the probability of creating temporary loops during the restart period, as described hereafter in this section. We study the example of OSPF in this section.
57
In the rest of this section, the term NSF is used.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 267
4.10 Nonstop Forwarding: An Example with OSPF
267
As already mentioned, the restarting operation may be triggered either because of an unplanned RP failure (hardware or software) or to proceed to a hardware upgrade (e.g., a memory upgrade on the primary RP). In the rest of this section we will see the general mode of operation, followed by the detailed mode of operation of the restarting router and its neighbors.
4.10.1
Mode of Operation When the backup RP takes control upon detecting the failure of the primary RP, the failing router (also called the restarting router) first sends a notification to each of its neighbors indicating that it enters in a restarting mode and is requesting a ‘‘grace period.’’ For OSPF, a particular LSA is originated (called an opaque LSA with a local link flooding scope, which restricts the flooding of the LSA to local neighbors). This means that during the ‘‘grace period,’’ each neighbor will continue to advertise in its LSA the restarting router (i.e., no topology state change will be reported). There is one exception to that rule: If a network topology change occurs during the ‘‘grace period,’’ every neighbor of the restarting router will switch back to the regular OSPF mode of operation. This is a safe procedure and limits the risk of loops. Until the restarting node has converged, it relies on its previous routing information to forward packets. Hence, if a network topology change occurs during the grace period, because the restarting node is not capable of recomputing its routing table, not aborting the NSF procedure could lead to routing loops.
4.10.2
Mode of Operation of the Restarting Router Entering in a graceful restart mode: Once the backup RP detects the failure of the primary RP, or if the network administrator forces the restarting mode, the node enters in a graceful restart mode. Note that the ‘‘grace period’’ can either be dynamically computed by the backup RP in the former case or manually specified by the network operator in the second case. The situations of planned versus unplanned control plane failure are handled in a slightly different way. 1. Planned control plane failure: After the restarting router has determined that the forwarding states are operational, it originates the grace LSA (the grace LSA is flooded to each directly connected neighbor (more accurately, the grace LSA is flooded out of every local interface where there is a neighbor) and the flooding scope is local; in other words, the grace LSA is not flooded beyond the scope of the direct neighbors), which contains the ‘‘grace period’’ value, the reasons for the graceful restart among other parameters. 2. Unplanned control plane failure: In case of unplanned failure, once the backup RP has detected the primary RP failure, it must originates the grace LSA before starting to send any hello packet to its neighbors. The restarting router must send the grace LSA in an OSPF link state update
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 268
268
CHAPTER 4
IP Routing
packet even though the restarting router has not yet reestablished any adjacency with its neighbors. The restarting router may decide to send multiple copies of the grace LSA to its neighbors to increase the probability of successful delivery. During the graceful restart period (between the time the backup RP becomes active and the time all the adjacencies are re-established), the restarting router performs the following set of actions: 1. The restarting router must not generate any new LSA and must not flush or modify any received self-originated LSA. 2. OSPF calculation can be performed but no new OSPF routes must be installed. During the restarting period the restarting router must rely on the forwarding states computed before the failure. 3. OSPF has the notion of designated router (DR) used on multiaccess subnetworks. In a nutshell, on such subnetworks, it would be expensive to have a full mesh of adjacencies between routers (n routers on a LAN, for instance, would lead to n * (n 1)=2 adjacencies). To solve this issue the concept of designated router is introduced where a designated router (called a DR router) is dynamically elected as well as a backup designated router (BDR). Then each router on the multiaccess subnetwork maintains an adjacency with the DR and the BDR, which have the responsibility to flood the LSA updates. So, for instance, if a router X receives a new LSA, it sends it to the DR and BDR (using specific multicast addresses), which will then reflood it to all the routers they have an adjacency with using a multicast address. So, if during the restarting period, the node determines that it was a DR before the failure (if the restarting router is listed as the DR in the OSPF hello packet received by a neighbor), it must immediately consider it as the DR for the multiaccess network. Exiting from a graceful restart mode: A restarting router exits from the graceful restart mode if one the three conditions below is met: 1. All the adjacencies have been reestablished: Upon reestablishing new adjacencies, the process of LSDB resynchronization will allow the restarting router to retrieve a complete LSDB. The analysis of the LSA the restarting mode had originated before the failure will tell it the number of expected adjacencies. Hence, the restarting router can determine when all the adjacencies have been reestablished. 2. An inconsistent LSA is received from a neighboring router: Suppose that the restarting router A is adjacent with a router B that does not support the graceful restart procedure described here or that B has not received the grace LSA. Then B will have considered its adjacency with the router A as down and it may occur that A receives its self-originated LSA from a neighbor C reporting an adjacency between the routers A and B as well as an LSA from the node B that does not report this A-B adjacency. This requires the node to exit from the graceful restart mode. Another
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 269
4.10 Nonstop Forwarding: An Example with OSPF
269
situation is when A does not receive its self-originated LSA from B after the adjacency reestablishment. 3. The grace period ends. As soon as the restarting router exits from the graceful restart mode, normal OSPF operation is resumed (the router reoriginates its LSA, OSPF calculations are performed, and grace LSAs are flushed).
4.10.3
Mode of Operation of the Restarting Router’s Neighbors The neighbors of the restarting routers are also called the helper neighbors because they help the restarting router restart its control plane. As long as no network topology changes occur, helper neighbors continue to advertise their LSA reporting an active adjacency with the restarting router. Entering in a helper mode: When a router B receives a grace LSA from the restarting router A, it enters in a helper mode, provided that B supports the graceful restart procedure. Also, the router B must have an adjacency with the router A, the graceful restart period must not have expired, there must not be any network topology changes, and there must not be any locally configured policy preventing the node B to act as an helper node. Finally, to enter in helper mode, the node B itself must not be in restarting mode. Exiting the helper mode: A node being in helper mode exits from that mode if one of the conditions below is met: a. The restarting node exits from the restarting mode by flushing the grace LSA. b. The grace period terminates. c. A network topology change occurs in the network. More accurately, if the node B receives a new LSA (an LSA with a new content that excludes new LSAs received as the result of a refresh) and that LSA would have been flooded by the node B to the node A under normal circumstances. The action of exiting from the helper mode implies that the node reoriginates its LSA based on the state of its adjacency with the restarting router A.
4.10.4
Backward Compatibility Note that the graceful restart procedure described above is fully compatible with existing implementations. Indeed, if a node receives a grace LSA and does not understand this LSA because it does not support the graceful restart procedure, it will simply ignore it. Then it will originate a new LSA, reporting the lost adjacency with the restarting router. Every other neighbor of the restarting node A will then interpret the receipt of the new LSA as a topology change and will exit from the helper mode. As for the restarting router, upon reestablishing new adjacencies, it will receive inconsistent LSAs and will exit the restarting mode, reverting back to
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 270
270
CHAPTER 4
IP Routing
the normal OSPF mode. This also means that in order for a restarting node to perform the graceful restart operation, all its neighbors must also support the graceful restart mode. However, what if the secondary RP cannot switch over? In the case of an unplanned router failure, the grace LSA is sent by the secondary RP. Consequently, if the secondary RP is not in service, no grace LSA will be sent, and normal OSPF convergence procedures will apply. If the secondary RP can send a grace LSA but cannot successfully perform the graceful restart procedure, the adjacency will be maintained until the grace period has expired, which is not an issue because the assumption is made that the forwarding states are preserved.
4.11 A Case Study with IS-IS This chapter concludes with a case study that has the following structure: Assumptions: Network topology, layer 2/3 protocols, and so on. Objectives: Convergence time, failure coverage, performance, and so on. Proposed design: There are obviously several possible network designs that satisfy a given set of requirements. One specific design (with some variations) is proposed that meets the set of requirements. For the sake of illustration, IS-IS configuration examples provided in this case study correspond to the configuration of Cisco routers. Assumptions We consider the network depicted in Figure 4.28. The network is made of three layers: . An optical layer providing unprotected optical lambdas only . A SONET layer that provides protected VCs up to OC3 . An IP layer The IP routers are interconnected by various link types: . Within a point of presence (POP): Routers are interconnected by means of Giga-Ethernet (GE) switches. A typical POP infrastructure is represented in Figure 4.28. All the POPs have an identical infrastructure: a set of edge routers and two core routers interconnected to the other core routers via the wide area network (WAN) links depicted in the diagram. Within a POP, all the routers (edge and core) are interconnected via one or several layer 2 switches. In the case of a service provider network, the edge routers are used to aggregate the customer traffic; in other words, the customer routers are connected to the service provider network via one or several links (sometimes, there might be multiple customer routers in the same customer site, each router being connected to a different service provider’s edge router for redundancy reasons). If the network is an enterprise network,
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 271
271
4.11 A Case Study with IS-IS
OC3 Link (SONET – protected) Edge Routers
OC48 Link (Unprotected)
Layer 2 Switch
OC192 Link Core Routers
SEA
HEL
MIN
ONT CHI
BOS NYC
SLC SFO
WAS DEN
KAN PHX LAX
ATL DAL
HOU MIA
Figure 4.28 IP case study with IS-IS.
the edge routers may be used to aggregate to traffic coming from remote sites. Another possibility is that the remote sites are directly connected to the core routers; in this case, the POP infrastructure is reduced to the core routers. . Up to OC3, the links are provided by the SONET layer and are always protected. . OC48 and OC192 links are unprotected lambdas provided by the optical layer. No change in terms of layer 1/2 protection can be made in the network. In other words, unprotected links cannot be protected by the optical layer (to reduce cost). Three types of traffic are carried in the network: . Internet traffic . VPN traffic: (IPSec, L2TP version 3, etc.) . Voice traffic The network is Diffserv enabled and two classes of services are configured in the core: . An EF class is used for the voice traffic. The voice traffic is marked (colored) at the edge of the network with a DSCP of 5. IP packets
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 272
272
CHAPTER 4
IP Routing
carrying voice are queued in a preemptive queue and are always served with the highest priority. . An AF class is used for the data traffic. A congestion avoidance mechanism like WRED and is configured so Internet traffic is more aggressively dropped than the VPN traffic in the case of network congestion. The process of flooding always gets the precedence over SPF triggering: As discussed earlier in this chapter, a proper IGP implementation upon receiving a new LSP should always try to flood the new LSP to other neighbors as quickly as possible. In other words, upon receiving a new LSP, a router should not first trigger an SPF and then flood the LSP to other neighbors, because this would have the undesirable effect of slowing down the overall convergence. IGP control packets receive an appropriate QoS: The queuing delays experienced by IGP packets can be considered negligible. All the routers have a distributed architecture and are equipped with a secondary (backup) RP. NSF-capable routers are only the edge routers. In this network, the choice is made to always trigger an IGP convergence upon a core router failure (control or forwarding plane) because the network has been designed to handle the rerouting flows if a node failure occurs without QoS degradation. In contrast, NSF will be used on edge routers in which no alternate path to reach the customer routers connected to the edge routers exists if those routers are attached to a single edge router. In conclusion, all the edge routers will be NSF capable and the core routers will just act in helper mode. Objectives Objective 1: The targeted convergence time upon both link and node failure is 1 second for all traffic. Objective 2: Both at steady state and under single failure scenarios, the total proportion of voice traffic should never exceed 50% of any link capacity, to preserve a low delay and jitter for the voice traffic. As already discussed, there is no ‘‘magic’’ number that should not be exceeded for the maximum proportion of voice traffic, but 50% is given here for the sake of the example. Proposed Design Achieving the objective 1 requires coming up with some IS-IS tuning parameters so the rerouting time does not exceed 1 second upon link or node failure. This implies that the following set of tasks must be completed within 1 second: 1. Network element failure detection 2. LSP origination from the nodes detecting the failure (because this case study is devoted to IS-IS, the IS-IS terminology is used in the rest of this section). 3. LSP propagation throughout the entire network. 4. Routing table update (SPF and RIB computation).
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 273
4.11 A Case Study with IS-IS
273
Let us now analyze each component separately and propose an appropriate IGP parameter tuning. Network element failure detection: On the WAN links (OC3, OC48, and OC192) the regular SONET alarms are used, which usually allow detection of a link failure in a very short period, on the order of a few milliseconds. On the other hand, within a POP, routers are interconnected by means of layer 2 switches, so the failure of a link between an edge router and the layer 2 switch requires some other mechanisms in order for the other neighboring routers to detect the failure (which might be a failure of the local link or the router itself). Several link/node failure detection mechanisms have been described in Section 4.3. The current solution is to reduce the IS-IS hellos frequency and the hold-time timer, but this has some nonnegligible implications in terms of processing in the router. For instance, if the number of routers in a POP is 50, then tuning the hello frequency and the hold-time timer too aggressively (say, the hello frequency is set to 200 ms and the hold-time to 1 second) would have an impact on the core router because the core router would have to send a hello message every 4 ms on average. Another alternative is to rely on some other hello protocol mechanism that is much more scalable. On the other hand, the network administrator may also make the assumption that the failure of a link within a POP or the failure of a core router (usually highly redundant) is sufficiently low to be neglected. In this case study, the following design choices are made: On the intra-POP interface, the hello interval and the hold-timer are set to 1 and 3 seconds, respectively. On the inter-POP interface (link between two core routers), the hello interval and hold-timer are set to 3 and 10 seconds, respectively. When BFD comes available, the hello interval and the hold-timer could easily be augmented to 10 and 30 seconds, respectively, which are usually the default values. By contrast, BFD will use very short hello intervals (on the order of a few tens of milliseconds). This means that any link failure within a POP will be detected in at most 3 seconds. Likewise, the failure of the core router will be detected within 3 seconds by an edge router, but the likelihood of such failure is considered sufficiently low to be acceptable. On the other hand, the failure of the WAN links, much more common, will be detected in a few tens of milliseconds. In the case of a core node failure, the node failure detection time will vary according to the node failure type:
. Node power supply failure: In this case, the attached link will also fail and the neighboring nodes will quickly detect the failure via the SONET-SDH framing alarms. . Node control plane failure: A control plane failure (which may not affect the data plane) will be detected in at most 3 seconds. If the failing node is NSF capable and its neighboring nodes support the helper node, such a control
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 274
274
CHAPTER 4
IP Routing
plane failure will not have an impact on the traffic forwarding. In the absence of NSF (the core routers in this case study), the neighboring node, upon detecting the control plane node failure (within a maximum time frame of 3 seconds), will start rerouting the traffic around the node, but no traffic will be dropped. On the other hand, if a control plane failure also affects the forwarding state (this depends on the router architecture), the traffic will be dropped for at most 3 seconds. Reducing that time also requires using fast detection mechanisms like BFD. Because the assumption is made in this case study that the core routers preserve the forwarding state upon a route processor failure, no fast hello keep-alive mechanism is required in the core. Configuration of the SPF, PRC, and LSP Origination Dampening Mechanisms The following LSP origination, SPF, and PRC dampening mechanisms are configured: lsp-gen-interval 5 20 50 (A B C): As explained in Section 4.5, the parameters 5 20 and 50 have the following effects: B ¼ 20 ms is the amount of time the router waits after the first link failure has been detected before originating a new LSP. If the router has several local links sharing some SRLG (multiple local links belong to the same SRLG), then waiting for 20 ms allows to maximize the probability to originate an LSP reflecting the actually local state of the router because in the case of an SRLG failure, this will give the router a chance to detect all the local failures and trigger one LSP. C ¼ 50 ms corresponds to the amount of time the router will wait before advertising a second LSP if a second local state change occurs. A ¼ 5seconds is the maximum amount of time between two successive LSP originations according to the exponential back-off algorithm described in Section 4.3. spf-interval 5 30 20 (A B C) prc-interval 5 30 20 (A B C) B ¼ 30 ms is the amount of time a router waits before triggering a new SPF after having received a new LSP reflecting a topology change. P Important note: There are two interesting situations to consider here, as follows: Situation 1: The network does not contain an SRLG. Then if a link failure occurs, any router in the network should trigger an SPF quickly after receiving the first LSP reflecting the topology change (as explained in Section 4.5, the receipt of one LSP is sufficient to exclude the link from the network topology). If a node failure occurs, all the neighbors of the failing node will originate a new LSP. Depending on the network topology, one can expect that the respective LSPs will be received in a short time, so waiting for 30 ms is probably long enough to maximize the chance to run a new SPF on an up-to-date LSDB. Now, if link failures are much more frequent than node failures, reducing that
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 275
4.11 A Case Study with IS-IS
275
timer to a shorter value is perfectly reasonable and helps improve the convergence time. Situation 2: The network contains a substantial number of SRLGs. If several links geographically distant from each other share some SRLG, then a slightly higher value for B may be advisable. Indeed, various routers will originate new LSPs, and because they can be distant from each other, waiting for a slightly longer period than 30 ms may be desirable. C ¼ 50 ms corresponds to the amount of time the router will wait before triggering a second SPF in the case a second new LSP would be received. A ¼ 5 seconds is the maximum amount of time between two successive SPF according to exponential back-off algorithm described in Section 4.4. P Important note: The values noted should not be considered set in stone. Indeed, every network is different and the IGP parameters tuning must be driven by both the network constraints and the convergence time objectives. As mentioned in Section 4.7, heterogeneous IGP timers may have the undesirable effect of increasing the duration of temporary loops during the network convergence period. So the recommendation is to use identical timer configurations across all nodes in the network. In addition, incremental SPF is configured on each router, which will significantly decrease the SPF duration in the vast majority of the cases. Are the Objectives Met? Now we analyze whether the initial objectives are met with the proposed design. Objective 1: The targeted convergence time upon both link and node failure is 1 second for all traffic (Figure 4.29). Case of an intra-POP link failure (failure 3 in Figure 4.29): As mentioned earlier, the worst case of failure detection is 3 seconds, so one cannot guarantee a 1-second convergence time if such a failure occurs. That being said, the assumption has been made that such failures are rare enough to tolerate a longer rerouting time, if such a failure occurs. The same reasoning applies to hardware core node failures with respect to the traffic originated in the POP. Case of an inter-POP link failure (failure of the WAN links—e.g., failure 2 in Figure 4.29). Thanks to the SONET/optical fast failure detection, a link failure will be detected in a few tens of milliseconds. Then a new LSP origination occurs after 20 ms and the LSP is flooded throughout the network. The two following assumptions were made:
. The process of flooding has a high priority. . IGP control plane messages receive appropriate QoS.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 276
276
CHAPTER 4
IP Routing
OC3 Link (SONET – Protected)
Customers Routers
OC48 Link (Unprotected)
OC192 Link
x
Propagation Delay
HEL
15
Edge Routers
Failure 3
MIN
12
10
ONT
Layer 2 Switch
SEA CHI
8
BOS
Failure 2 NYC
Failure 1 8 SLC SFO
WAS
10
DEN
KAN PHX LAX
13
12 12
ATL DAL
HOU MIA
Figure 4.29 A case study with IS-IS link and node failures.
This implies that the newly originated LSP will be quickly flooded throughout the network and the delays experienced by an LSP originated by node i to reach node j are reduced to the propagation delay along the links plus some processing delays at each hop. Let us carefully analyze two failure examples demonstrating that objective 1 is met: Failure 1 is the Chicago’s power supply node failure: All of its attached links will fail and the node failure will be detected by means of link failure detection by every neighbor. The failure detection time will roughly be on the order of a few tens of milliseconds. After 20 ms, every Chicago routers’ neighbor will originate a new LSP reflecting the topology change. Let us analyze the impact of this router failure on the node of PHX (Phoenix). If one makes the assumption that the first LSP received is the one of DEN (Denver) reflecting a loss of adjacency with the node of CHI (Chicago), that new LSP will likely be received less than 10 ms after having been sent by the DEN router (because of the short propagation delay between DEN and PHX). Then upon receiving the first new LSP, every node in the network will wait 30 ms before triggering an SPF, according to the IGP tuning proposed in this case study; this will give enough time for the node of PHX to receive the other LSP originated by the neighbors of the failing node of Chicago, that is, Kansas City (KAN), Seattle (SEA), Boston (BOS), New York City (NYC), Washington, D.C. (WAS),
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 277
4.11 A Case Study with IS-IS
277
Atlanta (ATL), and Dallas (DAL). The fact that the node of PHX waits some time before triggering its SPF is quite interesting in this particular scenario because it will have received the LSP of the node KAN. Hence, it will avoid rerouting its traffic through KAN, which does not have any alternative (in fact if the PHX had not waited, it would have had to rerun another SPF upon receiving KAN’s LSP to properly reroute its traffic). That being said, the vast majority of failures are link failures (usually more than 80%), so the option of reducing the value of B to a few milliseconds should reduce the convergence time in most failure scenarios and slightly increase it if a node failure occurs, which is less often. Hence, the total convergence time is a few tens of milliseconds for the failure detection time (e.g., 20 ms) plus 20 ms (time for the Chicago’s node neighbors to originate a new LSP) plus a few tens of milliseconds for the PHX node to receive the LSPs (say, 20 ms) plus 30 ms (initial time to trigger the first SPF) plus SPF duration time. But what is the SPF duration time? As described in Section 4.6, this is made of two components: the SPT computation and the RIB update. The first component is on the order of tens of milliseconds in a modern router architecture for a few hundreds of nodes. The second component (RIB update) depends on the number of routes, which can be reduced and at least some mechanisms can be used to prioritize the most important prefixes. In a nutshell, the total convergence will very likely be less than 0.5 seconds, which completely meets the first objective. Now we turn to the second type of failure (failure 2 in Figure 4.29). Failure 2 is the failure of the link between the nodes of Ontario (ONT) and Boston (BOS). Notice the rerouting time from the perspective of the node of Seattle (SEA) to reach the node of Boston (BOS). The noticeable difference with the previous failure scenario is that the traffic from SEA to BOS that typically follows the path SEA-HEL-MIN-ONT-BOS (Seattle-HelenaMinneapolis-Ontario-Boston) is rerouted by a node that is three hops upstream to the failure, which implies additional delays to the rerouting time. If one assumes that the propagation delay is due to the speed of light plus a few milliseconds of per-hop processing delay, it turns out that one should add probably less than 100 ms compared to the previous convergence time. The aim of the second example is just highlight that in some cases (especially when propagation delays are quite high and the first node being able to reroute the traffic is several hops upstream to the failure), the convergence time may get easily increased by a few hundreds of milliseconds, which is still perfectly in line with our overall 1-second total convergence time target, but this highlights the challenge with restoration mechanisms (especially when the backup path is not local) to be in the range of tens or hundreds of milliseconds by contrast with other local protection recovery mechanisms like OPTICAL protection, SONET/SDH, or MPLS TE fast reroute.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 278
278
CHAPTER 4
IP Routing
P Important note: In the two previous failure scenarios, one should bear in mind that there might be some variations in terms of network convergence in networks having a very large number of IP prefixes, as discussed in Section 4.6. That being said, the objective of 1 second can certainly be met because in the two previous cases an estimate for the rerouting time was 0.5 seconds, which gives a time budget of 0.5 seconds for the RIB computation, a perfectly reasonable objective in most modern router architectures, at least for the most important IP prefixes (see Section 4.6 for a definition of an ‘‘important’’ prefix). Failure 3 is the control plane failure of an edge router. For instance, consider the control plane failure of the first edge router on the left depicted in Figure 4.29. In most cases, all the customers will be attached to a single service provide edge router and this is where NSF can help if a control plane failure occurs in a distributed router architecture (hardware or software route processor failure) and some centralized router platform architectures. Indeed, in such a case, the backup route processor will handle the failure without any impact on the forwarding state. This is particularly useful and avoids some very undesirable traffic disruptions. Of course, this requires for the edge router to be able to maintain the forwarding states and for the neighbors to help the failing router to recover. Objective 2: Both at steady state and under a single failure scenario, the total proportion of voice traffic should never exceed 50% of any link capacity to preserve a low delay and jitter for the voice traffic. Typically, such an objective is met by means of some external off-line IGP metric computation tool that provides a set of IGP metrics so the objective is met upon a single network failure. Note that other objectives like the propagation delay increase can also be added to the objective function in some off-line traffic engineering tool, as mentioned in Section 4.9.
4.12 Summary The objective of the Sections 4.1 to 4.11 was to cover in detail the properties of IP routing, a restoration protocol that relies on the concepts of distributed and dynamic routing algorithms in every node to compute the shortest path between the computing node and every other node in the network. This first part of this chapter focused on link state routing protocols, which corresponds to the vast majority (if not all) of the routing protocols used in large service provider and enterprise networks. Each component of IP routing has been studied in detail: the network topology dissemination by means of a reliable flooding protocol, the shortest path computation algorithms, the various mechanisms allowing IP routing to provide quick reaction, and fast convergence time upon network element failures while preserving the network stability in case of network resource oscillation thanks to efficient dampening algorithms.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 279
4.13 Algorithm Complexity
279
Because IP routing relies on the computation of alternate paths upon network failure detection by using distributed algorithms, a significant part of this chapter has been devoted to the detailed study of transient states as the network converges (until all the nodes in the network share an identical network topology view, leading to distributed consistent routing decisions). Although the main aspect of IP routing in terms of recovery is related to the convergence time, it has also be shown how traffic engineering techniques based on IGP metric optimization can be used to achieve a certain QoS objective during failure conditions. Another class of recovery mechanisms called NSF has also been studied in this chapter, which specifically handles routing control plane failures on routers where the forwarding states can be preserved in the event of control plane failures. One must admit that there are some incompressible delay components that prevent IP routing protocols from achieving a rerouting time on the order of tens of milliseconds, especially in sparse network topologies in which the node capable of rerouting the traffic along an alternate path may be several nodes upstream to the failure. Moreover, because of the inherent distributed nature of the routing computation, there is some degree of nondeterminism in terms of convergence times, which are driven by the network topologies and routers architectures, to mention a few criteria. That being said, throughout this chapter, it has been shown that efficient routing protocol implementations used in conjunction with appropriately designed networks allow achieving some efficient subsecond rerouting times, which will hopefully remove the widespread misleading perception that link state routing protocols converge in tens of seconds at best. Finally, this chapter concludes with a complete case study on the link state protocol IS-IS. The next sections discuss some advanced topics of IP routing, starting with a section related to algorithm complexity, a topic of the utmost importance for IP recovery because upon a network failure, alternate paths are computed ‘‘on the fly,’’ as with any other restoration mechanism. Then a second important optimization of the SPF computation known as incremental SPF is detailed, that allows to drastically reduce the routing computation under various network change circumstances. Finally, the concluding section is a discussion of the interaction between IP fast convergence techniques and NSF.
4.13 Algorithm Complexity 4.13.1
Definition of Algorithm Complexity The complexity of an algorithm is an important notion when evaluating an algorithm and greatly determines its usefulness. An algorithm is a step-by-step procedure that must be performed to solve a particular problem, where a step can either be an arithmetic operation, a comparison function between two numbers, or any other monolithic operation. The efficiency of an algorithm usually refers to its ability to find a solution to a particular ‘‘problem instance’’ in a reasonable time;
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 280
280
CHAPTER 4
IP Routing
more precisely, one usually refers to the amount of time required by the algorithm in order to solve the problem, in the worst case. A ‘‘problem instance’’ refers to a particular set of variables (e.g., a particular network topology). It is usually admitted that a good/efficient algorithm is one that provides a solution in a reasonable time. A common way of characterizing efficient algorithms is by determining the nature of the algorithm complexity function and in particular whether the function is polynomial or higher than polynomial. We call n the size of the problem, which can be the number of routers in the computation of the shortest path, for instance. Table 4.1 and the Figure 4.30 illustrate how the complexity of an algorithm evolves with the size of the problem for several algorithms’ complexities. Table 4.1 Examples of Algorithm Complexities as a Function of the Problem Size n
n
n2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 40 50 60
1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 900 1600 2500 3600
n3 1 8 27 64 125 216 343 512 729 1000 1331 1728 2197 2744 3375 4096 4913 5832 6859 8000 27,000 64,000 125,000 216,000
2n 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 1.07Eþ09 1.1Eþ12 1.13Eþ15 1.15Eþ18
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 281
4.13 Algorithm Complexity
281
Complexity
Polynomial Versus Exponential Algorithm Complexity as a Function of the Problem Size n 5000 4000
n2 n3 2n
3000 2000 1000 0 1
2
3
4 5 6 7 8 9 Number of Routers n
10 11 12
Complexity
Polynomial Versus Exponential Algorithm Complexity as a Function of the Problem Size n 1.5E+30 n2
1E+30
n3 2n
5E+29 0 1
9 17 25 33 41 49 57 65 73 81 89 97 Number of Routers n
Figure 4.30 Algorithm complexity as a function of the problem size for three algorithm complexities: n2 , n3 , and 2n .
So for instance, if the execution time of one step is set to 109 seconds (one nanosecond), for n ¼ 60, the total execution time of an algorithm with a complexity of n2 is 3:6 ms compared to 36.55 years with an algorithm whose total execution time is 2n . This raises an interesting question: How is the CPU power used to run nonpolynomial algorithms? There are two ways of answering such a question: 1. Consider two CPU powers and compare the sizes of the problem that can be solved with a nonpolynomial algorithm during the same amount of time. 2. Consider two CPU powers and try to compare their running times to solve a problem of size n. First, we consider the first question with CPU 2 ¼ k * CPU 1 (i.e., CPU2 can perform a basic operation k times faster than CPU 1) for an algorithm having a complexity of 2n . If the running time is fixed, then the size of the problem resolved with CPU2 will be log2 (k) greater than the size of the problem solved with CPU1. To take a concrete example, if CPU2 is 10,000 times faster than CPU1, the size of the problem that be solved with CPU2 is just increased by log2 (10, 000) ¼ 13:28 compared to CPU1, for a fixed amount of time. So an augmentation of the CPU
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 282
282
CHAPTER 4
IP Routing
power does not allow you to drastically increase the problem size with a nonpolynomial algorithm when the problem size n gets large. Now we consider the second question: If the size of the problem is fixed, how much faster will the CPU2 solve a problem of size n compared to CPU1? Exactly k times faster. Unfortunately, for large values of n, this gain is negligible compared to the number of operations. Back to the same example of an algorithm having a complexity of 2n for n ¼ 60, the number of operations will be 1.15Eþ18, so reducing the running time by a factor of 10,000 is not sufficient. So the bottom line is that the CPU power does not sufficiently help solve a problem having nonpolynomial complexity. This clearly illustrates why algorithms with polynomial complexity are considered efficient compared to nonpolynomial algorithms. It is quite common to consider a problem with nonpolynomial algorithm complexity as nontractable. That said, algorithm complexity is always computed for the worst case, and fortunately a large number of problem instances have a much smaller complexity. Hence, some well-known exponential time algorithms have been used to solve various problems, but in general the ultimate goal is to find an algorithm having a polynomial complexity. Two important comments can be made here: 1. Measuring the efficiency of an algorithm by the amount of time required to solve a particular instance of a problem is particularly adequate for recovery mechanisms. By definition, recovery mechanisms rely on the computation of an alternate path upon a network element failure ‘‘on the fly’’ (when the failure is detected). So the time required to find such a backup path is particularly important for recovery mechanisms. 2. The number of steps required by an algorithm may vary with each problem instance; in other words, the same algorithm may perform well to solve some problem instances and badly for other instances. When evaluating the algorithm efficiency, one always evaluates the worst case, but there are multiple ways of evaluating algorithm efficiency, as follows: . Try to evaluate the average complexity: The algorithm complexity is evaluated by considering a probability distribution of the various possible instances of the problem. Though quite simple, this approach has several limitations because the results are highly dependent on the algorithm implementation, problem instances, and others parameters. . Analysis driven by experience: The algorithm efficiency is determined by running it on several instances of the problem. . Worst case: In this case, the algorithm complexity is determined by computing the upper bound of the number of required steps for any possible instance of the problem. The worst-case approach has the beneficial merit to be objective and provides strict upper bounds for any problem instance. There are of course some cases in which some particular (and potentially) rare instances of a problem may provide an upper bound of the complexity far beyond the average case.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 283
4.13 Algorithm Complexity
283
We just saw that algorithm complexity is usually computed by considering the worst-case scenario; other aspects of algorithm complexity include the following:
. Constants are ignored. . Just the dominant term is listed. . The complexity is evaluated for large values of the variables. So, for instance, consider a problem P1 with one variable n. If the worst-case complexity of an algorithm A1 solving P is c1 * n2 þ c2 * log (n) þ c3 (where c1, c2, and c3 are three constants), then the overall algorithm complexity is n2 also noted O(n2 ). If the complexity of an algorithm A2 solving P is c1 * n2 þ c2 exp (2n ) þ c3, then the overall algorithm complexity is exp(n) also noted O(exp(n)). Now consider a problem P2 with two independent variables n and m. If the worst-case complexity of an algorithm A1 solving P2 is c1 * n2 þ c2 * m, where c1 and c2 are two constants, then A1’s complexity is n2 also noted O(n2 ). Another example would be a worst-case complexity of c1 * n2 þ c2 * m3 þ c3. This case is quite interesting; for sufficiently large values of m, the dominant factor the of the complexity is m3 even if c1 >> c2, so the resulting worst-case complexity is O(m3 ). That being said, there might be some problem instances in which considering just one factor or the other might be relevant, in particular if there is any control on the set of values the variable n and m can get. For instance, if one variable is restricted to a very limited subset of values, then the other factor will be dominant. On the other hand, if both variables can get any arbitrary value in a wide range of values, just the dominant factor wins. Another very important point to mention is that the algorithm efficiency is also highly driven by the implementation choices. For instance, consider the cost of searching for an element in a list, which is what the Dijkstra algorithm does to select the node to remove from the TENT list and add it to the PATH list. As mentioned in Section 4.6, in the worst case, this task must be performed n times and the number of elements to be scanned is n k (where k is the iteration number), so the complexity is on the order O(n2 ). Now, if the TENT list is sorted, the cost of searching the element in the TENT list to select the node and move it to PATH is reduced to O(1). Of course, this now requires sorting the TENT list and using a sorting algorithm. For instance, one can use the following simple sorting algorithm: Simple sort: Consider a simple array of size n and the simple following sorting algorithm: For I¼1 to n For j ¼ 1 to n If table(i)>table(j) then swap(table(i),table(j) ) End End where table(i) is the element stored in the array at the index i and swap(a,b) is the basic operation of swapping the elements a and b.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 284
284
CHAPTER 4
IP Routing
This algorithm complexity is (n 1) þ (n 2) þ . . . þ 1 ¼ n * (n 1)=2, hence O(n2 ). Note that there are multitude of sorting algorithms that all differ in terms of complexity. For instance, a very well known sorting algorithm, called quick sort, has a worst-case complexity of O(n2 ) but an average complexity of O(n * Log(n) ). This clearly highlights the importance of the implementation choice and the data structure used.
4.13.2
NP Complete Problem An NP complete problem is a problem for which no algorithm having a polynomial complexity has been found (yet). During the several last decades mathematicians have been trying to determine whether P (class of problem for which a polynomial algorithm has been found) is different from NP. An interesting property of NP complete problem is that if a polynomial algorithm for any NP complete problem exits, there are polynomial algorithms for all NP complete problems; hence the considerable amount of energy spent by mathematicians to find out polynomial algorithms to solve several very well-known NP complete problems.
Problem Reduction ‘‘Easy’’ Problems (Problems that Can Be Solved in Polynomial Time) An efficient way to evaluate algorithm complexity is to perform some problem reduction function. If a problem P1 can be reduced to another problem P2 by using a polynomial algorithm (we say that there exists a polynomial reduction of P1), and there is a polynomial algorithm to solve P2, then P1 is ‘‘easy’’ (there exists a polynomial algorithm solving P1).
NP Complete Problems When one cannot find an efficient (polynomial) algorithm for a specific optimization problem, a very common practice is to prove that the problem is NP complete and a very efficient way of proceeding is to find a polynomial function that transforms the problem to another problem known as NP complete. In practice, NP complete problems are approximated, using heuristics that drastically help reduce the problem instances to provide a solution close to the optimum in a reasonable amount of time. The art of finding good heuristics is indeed critical in computer science. For instance, the Dijkstra shortest path computation algorithm has a polynomial complexity; by contrast, the very well known ‘‘Traveling Salesman’’ (see [TRAVEL-SALESMAN]) problem is known as NP complete. This problem consists of trying to find the shortest closed path a traveling salesman should follow to visit a set of cities exactly once, where the distances between cities are known. More can be found on algorithms in the following references: [ALGO-1], [ALGO-2], [ALGO-3] and [ALGO-4].
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 285
285
4.14 Incremental Dijkstra
4.14 Incremental Dijkstra In Section 4.6, we saw the ‘‘regular’’ Dijkstra algorithm that computes a loop-free shortest path in a network. In this section, a very useful optimization called incremental SPF (iSPF) is presented, which allows reducing the SPT computation to some subtree (instead of the complete tree); in some cases detailed in this section, the SPT computation can even be completely avoided.
4.14.1
Motivation As described earlier in this chapter, in the normal IGP link state mode of operation, upon the receipt of a new LSA (an LSA with a newer content, not a refreshed LSA), a full SPF is triggered, potentially after some time has elapsed, regardless of the network state change reported in the LSA. To highlight the motivations for iSPF, let us analyze various situations (Figure 4.31). Consider the network depicted in Figure 4.31 and suppose that the link H-I fails. Upon a failure of the link H-I, two LSAs will be originated by the nodes H and I to reflect the network topology change, which will trigger an SPF computation on the node A (among others, but in this example we focus on the node A). A simple observation shows that link H-I is not used in the SPT computed by A and the same observation can be done for several other links in the network. This means that a
Thick Lines Represent the Computed SPF
Computing Node
B
A
D
E
C
G
H
J
H
I
K
L
J
M
L
O
G F
I
K
D
E
F
N
B
A
C
P
All Links Have a Cost =1
Figure 4.31 Motivation for incremental SPF.
Q
N
O
M
P
Q
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 286
286
IP Routing
CHAPTER 4
new SPF will be needlessly computed in this case. So one obvious optimization would be to avoid running an SPT when a new LSA is received that advertised a state change for a link, which is not used in the SPT. Now analyze another situation depicted in Figure 4.32. In this second example, consider the failure of the link I-L as depicted in Figure 4.32. By contrast to the previous failure scenario, that link is used in the SPT but with a regular SPF the entire SPT is recomputed, although just a subtree needs to be recomputed in this case. Now suppose that the link N-R comes up. Similarly, instead of recomputing the whole SPT, just the subtree rooted at node N should be computed. A final example of interesting optimization can be given; for example, a new IP address is added locally on a router, and this triggers the origination of a new LSA (LSA type 1 in OSPF, new LSP in IS-IS); hence, a new SPT is computed on every router, which is clearly not necessary in such a case. In addition to the processing time required to run an SPF, there is another downside of systematically recomputing an entire SPT upon receiving a new LSA. It is not uncommon to have multiple equal cost paths between two nodes, especially in highly symmetrical networks. When an entire SPT is recomputed, this may lead to the selection of another path to reach a particular node even if the node is reachable via an equivalent path. The drawback here is that this may lead to some unnecessary routing changes.
Thick Lines Represent the Computed SPF
New SPT After the Failure of the Link I-L
B
A
B
A
C
C
D
D
E E
G F
G F I
H
I
K
L
N
O
M
M
L
K
N
J
J
H
P
O
Q
R
Figure 4.32 Motivation for incremental SPF.
P Impacted Sub-trees
Q
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 287
4.14 Incremental Dijkstra
287
This highlights the whole idea of iSPF, which is to limit the SPT computation to the portion of the tree affected by the network topology change instead of systematically recomputing the entire SPT.
4.14.2
History The original iSPF algorithm has been originally designed by Eric Rosen during the ARPANET days in the late seventies (see [ARPA-2]). Since then, the algorithm has been slightly changed and optimized to handle additional network scenarios, in particular in the case of multiple equal cost paths. Note that the algorithm has been implemented in commercial products. For instance, Cisco Systems supports iSPF both for IS-IS and for OSPF.
4.14.3
Algorithm Description This section describes in detail the original iSPF algorithm, proposed by E. Rosen in 1978. Because there are various events that must be differentiated that lead to different sets of action when performing iSPF, this section starts by listing the various situations and required set of actions, followed by the iSPF algorithm itself. First, the assumption is made that node S (source) has already computed an SPT.
Situation 1: Link Cost Increase An LSA is received that reports a link cost increase for some link Lij (link between the node i and the node j) in the network. This is also usually referred to as a ‘‘bad news.’’ Note that this covers the cases of both a link increase because of a configuration change performed by the network administrator and a link failure in which the cost of the corresponding link is infinite. As already mentioned, if link Lij does not belong to the SPT, no action is required. Indeed, because the cost of link Lij has increased, there cannot be any shortest path to any node in the network via this link because it was already unused and the link cost increase would even increase the length of any path through that link. On the other hand, if the link is in the SPT, then not only the distance to the node j is increased but also the distance to any other node reachable by means of the node j; thus, all nodes belonging to the subtree rooted at node j are candidates for routing changes (there might be a better path via some other nodes from the source). Note that those nodes are just ‘‘candidates’’ for routing changes, which does not mean that the shortest path to those nodes will systematically change, but the metric will. On the other hand, any other node that does not belong to this subtree is not affected by the network change. This is illustrated in the following examples:
Example 1: Bad news and the link does not belong to the SPT (all the links have a cost of 1) (Figure 4.33). So for instance, in Figure 4.33, if the link cost of the
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 288
288
IP Routing
CHAPTER 4
Thick Lines Represent the Computed SPF
B
A
C
D
E
There is Absolutely NO Impact on the SPT
G F
N
H
I
K
L
O
J
M
P
Q
Figure 4.33 Example 1 (bad news and the link does not belong to the SPT).
link F-G is increased from 1 to 2 (or from 1 to infinity in case of link failure), this does not result in any SPT change. Example 2: Bad news and the link belongs to the SPT (Figure 4.34). Let us now consider Figure 4.34 (note that a few link costs have been changed for the sake of this example). If link I-L fails (or if its cost is increased to, say, 20), then not only the path to node L changes but also the path to any other node that belongs to the subtree rooted at node L (i.e., nodes O and P). Conversely, any other node that does not belong to the subtree rooted at node L is not affected. This implies the following changes to the original SPF algorithm:
. Identify the nodes that belong to the subtree rooted at node j, and update their distance from source S. . Try to find a shorter path to each subtree node K by routing K via those of its neighbors, which are not in the subtree. If such a shorter path can be found, add K to the TENT list.
Situation 2: A Link Cost Decrease (Good News) An LSA is received that reports a link cost decrease for some link Lij in the network. This is also usually referred to as good news. Note that this covers the cases of both a link decrease because of a configuration change performed by the network administrator and a link recovery.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 289
289
4.14 Incremental Dijkstra
Thick Lines Represent the Computed SPF
New SPT after the Failure of the Link I-L x Link Cost
B
A
C
B
A
D
E
C
G
E
G
F
F J
I
H 3
K
N
H
I
K
L
M
L
5
O
D
P
Q
N
J
M
5
O
P
Q
Impacted Sub-Trees
Figure 4.34 Example 2 (bad news and the link belongs to the SPT).
If the link Lij belongs to the SPT, the paths to node j and any other nodes belonging to the subtree rooted by j do not change because the path cost to those nodes via j will be decreased and the shortest paths from S to those nodes were already by means of j. Furthermore, any node being at a shorter distance from S than the new distance from j to the source will follow an identical path. Example 3 of good news and the link belongs to the SPT (Figure 4.35). Figure 4.35 shows an interesting example that illustrates the possible impact on the tree resulting from a link cost decrease that belongs to the SPT. Again, some list costs have been changed in this example compared to previous examples. So let us consider, for instance, that the link cost of link C-G decreases from 3 to 1. Because link C-G already belongs to the SPT, the path to node G and all the nodes in the subtree rooted at G (i.e., the nodes J and Q) do not change. The new shortest distances from the source node A to the nodes G, J, and Q are now 2, 3, and 4, respectively. All the nodes at a distance of 2 or less from node A (i.e., the nodes B, D, C, E, and H) are not candidates for routing changes. On the other hand, any node that does not belong to the subtree rooted at j and whose distance d to S is higher than the new shortest distance between S and j are candidates for routing change because there might exist a shorter path than the current path. So back to our example, the nodes F, I, K, L, M, N, O, and P are candidates for routing changes because their distance to S is strictly higher than 2
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 290
290
CHAPTER 4
IP Routing
Example 3: link C-G decreases from 3 to 1
Initial State B
A A
B D
C
D
C
5
1
3
5
3
3
E E 4
3
F
F
3
L
H
J
5 I
H
K
G
4
G
K
3
M
2
5
3
N
N
O
P
L
O
Q
J
I
M
2
P
Q
Impacted Sub-Trees
Figure 4.35 Example 3 (good news and the link belongs to the SPT).
(the new distance from A to G); indeed a new shortest path might be found by means of G. Of course, this does mean that those nodes will get a better path. Figure 4.35 depicts the new resulting SPT. This implies the following changes to the original SPF algorithm:
. Identify the nodes that belong to the subtree rooted at node j and update their distance from the source S. . Then, try to find a shorter path for each node that is not in the subtree rooted at j but is an immediate neighbor of a node in the subtree. If such a path is found, add the node to the TENT list. In other words, this step consists of checking whether a node that is not on the subtree could now get a shorter path via the subtree rooted at j. If link Lij does not belong to the SPT and its cost decreases, the algorithm must check whether there now exists a shorter path to j via the link Lij by calculating Delta ¼ d(i) (distance from the source S to i) þ cost(Lij ) d(j) If Delta > 0, this good news has basically no effect on the existing SPT. On the other hand, if Delta < 0, the shortest path to j is now via link Lij . So the operation reattaches node j to node i (his new predecessor58). With that first operation, the situation is now perfectly identical to the previous case where the 58
We sometimes use the term parent to name the predecessor in the SPT.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 291
4.14 Incremental Dijkstra
291
link Lij was in the SPT and its cost has decreased. An example is given in the Figure 4.36. Situation 3: Now let us consider the case of a node failure. If node j fails, this implies that all the nodes in the subtree rooted at j must be reattached to the SPT. The rest of the required operations are similar to those in the situation 1, except that node j is now excluded from the SPT. Situation 4: Node recovery. If node j now recovers after a failure, the first operation to perform is to compute the shortest path to node j, which can easily be done by checking each neighbor of node j and adding the cost of the link Lij . Then all the candidate nodes for routing changes are those whose distance from the source is greater than the distance from S to node j. At this stage one can perform a complete SPF algorithm starting with node j in the TENT list.
Final Incremental SPF Algorithm The aim of the previous paragraph was to highlight the required changes in each case; it is now time to provide the complete algorithm that basically consolidates all the previously described changes. Several variables are used during iSPF: Delta Lij : link between nodes i and j c(Lij ) cost of the link Lij c(Lij ) new cost of link Lij (after a change occurs on the link cost)
Initial State
A A
B
B
Example 4: The cost of the link D-G decreases from 5 to 1 D
C C
D
3
1
3
3
5
3
E E
4
3
3
L
H
J
5 I
K
3
M
2
N N
G
F
F
H
K
4
G
O
P
Q
5
3
O
J
I
L
M
2
P
Q
Impacted Sub-Trees
Figure 4.36 Example 4 (good news, the link does not belong to the SPT and Delta < 0).
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 292
292
CHAPTER 4
IP Routing
d(i) is the shortest distance from the source to node i S: a subtree Step 1
If there is no existing tree, go to step 7.
Step 2
If the change is related to a node status change (node recovery or failure—see cases 3 and 4 above), then Delta ¼ infinite. If the status change ¼ node recovery, go to step 3. If the status change ¼ node failure, go to step 4.
Step 3
If the change is related to link Lij , then If the link Lij belongs to the SPT, Delta ¼ c'(Lij ) c(Lij ) If the link Lij does not belong to the SPT, Delta ¼ d(i) þ c(Lij ) d(j) If Delta >0 then stop =* Comment: if Delta > 0, the algorithm stops because in this case, link Lij did not belong to the SPT and the change corresponds to a bad news; in other words, there is no better path to reach the node j than the existing one, so there is no impact on the SPT *=.
Step 4
Put node j and all of its descendants59 in S.
Step 5
For each node k in S, d(k) ¼ d(k) þ Delta. =* Comment: In other words, each descendant of the node j is having its distance updated by the distance change resulting from the link cost change of the link Lij *=.
Step 6
For each node k in S If Delta > 060 (bad news), try to find a shorter path to node k via each of its neighbor that does not belong to S (try to reattach the node to different subtree that would provide a better path). If such a better path is found, put node k in the TENT list. If Delta < 0, try to find a shorter path to each neighbor k' of k, which does not belong to S by means of node k. If such a better path is found, then put k’ in TENT. =* Comment: If Delta > 0 (bad news, link Lij is in SPT), the algorithm tries to reattach each node of S to another subtree offering a shorter path. If Delta < 0 (good news), the algorithm checks whether some other nodes, not in the impacted subtree, could be reattached to that subtree to follow a more optimal path *=.
Step 7
Move the node x to PATH such that d(x) ¼ min {d(y)for y 2 TENT}.
Step 8
For each neighbor z of node x If the node z is already in PATH, then If d(z) < d(x) þ cos t(Lxz ), then do nothing 59
A descendent of a node i is a node that belongs to some subtree A rooted at node i. Note that in this case, the link Lij is in the SPT and the change is a bad news.
60
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 293
4.15 Interaction Between Fast IGP Convergence and NSF
293
If d(z) > d(x) þ cos t(Lxz ), then remove z from PATH, put in TENT and update d(z) ¼ d(x) þ cos t(Lxz ) If the node z is not in TENT, then move z to TENT and update d(z) ¼ d(x) þ cos t(Lxz ) If the node z is already in TENT If d(z) < d(x) þ cos t(Lxz ), then do nothing If d(z) > d(x) þ cos t(Lxz ), then update d(z) ¼ d(x) þ cos t(Lxz ) Step 9
If TENT is empty, stop; otherwise go to step 5. Since then, various optimizations of the original algorithms have been proposed in particular to reduce the running time and handle specific cases like equal cost paths. Note an excellent reference: [ROUTING-THESIS].
4.14.4
iSPF Efficiency It can be observed that the gain varies with the location of the failure of the newly announced link with respect to the computing node. Indeed, when a link far downstream from a node fails, the gain of running iSPF as opposed to a full SPF is substantial because the impacted subtree is minimal compared to the entire SPT. When the link is not used in the SPT (which is not a rare event when one considers the proportion of links in the SPT), the gain is maximal because no new computation is triggered. On the other hand, the gain diminishes as the failure gets closer to the computing node, so in some cases running iSPF is not worth the slight computation complexity increase. There might even be some very particular cases where the computation time could be slightly greater than a full SPF running time if the failure is very close to the computing node. Now, in most network failures, the gain of iSPF just offsets the extra work of computation. This is particularly true for large networks with hundreds of nodes. Note also that the gain is not only limited to the SPT computation but also applies to the RIB computation, which is, as already pointed out, a nonnegligible component of the overall network convergence. Just to give a rough idea, some extensive tests ran on large network topologies showed that the potential gain can be as large as 90% with a very significant average gain (tens percentage), whereas the worst case when the failure is close to the computing node never exceeds a few percentage points in very large topologies.
4.15 Interaction Between Fast IGP Convergence and NSF Both the IGP convergence aspects and NSF have been studied in this chapter devoted to IP. At a first sight, IGP timers tuned to achieve fast convergence and NSF may look contradictory, although they both share the same goal of minimizing packet loss upon network element failure. Indeed, the IGP tries to find an alternate
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 294
294
CHAPTER 4
IP Routing
path around the failed network element. Conversely, the NSF procedure keeps on forwarding traffic to the failed node, making the assumption that the node has just experienced a control plane failure unaffecting the proper forwarding operation of packets. It must first be underscored that the failure scopes are different. Indeed, the IGP handles a broader scope of network failure types: link failures and node failures; moreover, it covers multiple node failure types: control plane failure, power supply failure. By contrast, NSF just handles the case of a control plane failure, which can be recovered using a second route processor. Furthermore, the preference for one mode over the other may be driven by several factors:
. The router’s location in the network: For instance, in the case of an edge router connecting a customer premises edge (CPE) device to the service provider router, it is not rare to have a single link between the CPE and the SP’s edge router. In the case of an edge router route processor failure, the only usable mechanism to avoid affecting the traffic sent by and to the CPE is to support NSF. . There might be some other situations in which NSF may be useful in the core: For instance, when a core router is fully redundant and the alternate paths around the routers are not able to carry the extra traffic routed through the node at steady state. In such a case, it might also be useful to always try to keep forwarding the traffic across the restarting router. That being said, it is foreseen that in most cases, the IGP timers will be tweaked appropriately to meet the convergence objectives in the core, whereas NSF will preferably be configured at the edge of the network. The following is a discussion of IGP and NSF timers tuning when both fast IGP and NSF are configured on a router.
. What happens if the IGP timers (hello, hold-down timers) are set to very small values in the network to achieve fast convergence and an unplanned route processor failure occurs on an NSF-capable neighbor router? . Does the restarting node have enough time to complete its restarting procedure before its neighbors declare it down and trigger an IGP convergence? Before trying to answer this question, it worth analyzing the consequences of such an event. If a neighbor incorrectly declares a restarting node down, then the traffic will be rerouted around the restarting node. Such an event, sometimes referred to as false-positive condition, would lead to unexpected behavior but in most cases will not have dramatic consequences other than triggering unnecessary traffic reroutes in the network (except for the IP prefixes not reachable via other means: e.g., a local area network locally attached to the restarting router). Note also that once the restarting node will reestablish its adjacencies, the nodes in the network will reconverge and will start reusing the restarted node in their path computation. There are several other situations in which false-positive events can occur, which are discussed in Chapters 5 and 6.
Vasseur / Network Recovery Final Proof 8.6.2004 3:58am page 295
4.16 Research-Related Topics
295
A false-positive event will be triggered on a node A if its restarting NSFcapable neighbor B cannot send the grace LSA before the expiration of its RouterDeadInterval61 interval (for its neighbor B). Indeed, suppose that the RouterDeadInterval is set to x seconds on the node A for its neighbor B and B experienced an unplanned control plane failure. Then, to avoid a false-positive event on A, A must receive B’s grace-period LSA before x has expired. Whether this objective is achievable or not will depend on both the platform’s ability to quickly detect the route processor failure and originate the grace LSA and the IGP parameter setting (in particular the RouterDeadInterval).
4.16 Research-Related Topics IP routing protocols continue to evolve to meet new requirements. The following are a few ongoing research topics on IP routing:
. SRLG SPF-aware protocol: This would allow some implementations to take into account the knowledge of SRLG (explored in detail in Chapter 5) during the SPF computation to minimize the routing convergence upon SRLG failure. . Temporary loop reduction: Several enhancements are being designed to reduce the effect of temporary loop and in some cases just eliminate those temporary loops (e.g., in the case of a link restoration or the failure of a link protected by other recovery mechanisms). . Local protection for IP: Some mechanisms could be used to provide local protection for IP to avoid the RIB computation upon network failure and the LSA propagation to a rerouting router being able to find an alternate path to the destination.
61 As a reminder, the OSPF RouterDeadInterval is the timer that defines the maximum amount of time a router A can wait without receiving any OSPF hello message from a neighbor B before declaring the A-B adjacency down. The corresponding IS-IS timer is called the hold-time.
This page intentionally left blank
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 297
CHAPTER 5
MPLS Traffic Engineering Recovery Mechanisms
Multi-Protocol Label Switching (MPLS) traffic engineering (TE) has encountered an ineluctable success during the past years, which led to the development of a rich set of MPLS TE recovery techniques. This chapter starts with a refresher of the MPLS TE technology, followed by the motivation for deploying such a technology in a data network. The recovery techniques are then examined with the objective to provide a detailed description of their mode of operation and their respective pros and cons, the type of the network design they preferably apply to, and aspects of design that operators find important for deployment in their network. Furthermore, various properties of each recovery technique are analyzed. These properties are of the utmost importance when choosing a particular recovery technique in a network: the recovery time, the impact on scalability, the ability to provide some quality-of-service (QoS) guarantees along the alternate path, and the technique efficiency with respect to the amount of bandwidth dedicated to recovery path. These are just a subset of the aspects covered for each recovery technique. This chapter covers the default restoration mode of operation of MPLS TE, as well as the global and local protection recovery schemes. A rich set of examples are PUB1 provided throughout this chapter that illustrate the mode of operation and how those various recovery techniques can be deployed in a network. An entire section is devoted to a complete set of case studies that show how an operator can use those MPLS recovery techniques to satisfy a set of recovery objectives while respecting network constraints. It is worth highlighting that most of these case studies are inspired by existing or foreseen deployment scenarios. After a summary section, this
297
Vasseur / Network Recovery Final 9.6.2004 9:48pm
298
CHAPTER 5
page 298
MPLS Traffic Engineering Recovery Mechanisms
first part of this chapter concludes with the standardization aspects of the MPLS TE recovery techniques. Then, the second part of this chapter is devoted to some advanced topics of MPLS recovery. The aim of those two sections is to cover in detail the signaling aspects of MPLS local protection (Section 5.14) and the interesting topic of the backup path computation (Section 5.15) and may be skipped by the reader without altering the good understanding of the MPLS recovery techniques. Finally this chapter concludes with a section that describes various related topics of research.
5.1 MPLS Traffic Engineering Refresher In this section, we first provide a brief refresher on the notion of traffic engineering. Then the terminology specific to MPLS TE is shown through an example, and after PR2 having reviewed the main components of MPLS TE, we detail the motivation for deploying MPLS TE in a network.
5.1.1
Traffic Engineering in Data Networks One of the major challenges of network design has always been traffic engineering; that is, how to route the traffic so network resources are efficiently used. The term ‘‘efficiently’’ requires some explanations though. An obvious objective of network design is to avoid congestion. If the network is fully congested, traffic engineering cannot really help and the network has to be upgraded (i.e., bandwidth and/or switching/routing capacities must be added). On the other hand, if some regions of the network are congested while others have spare capacity, then trying to alleviate the congestion spots by rerouting some flows along an alternate path (where capacity is available) certainly helps. In other words, TE defines how flows should be routed to efficiently use network resources. Even in the absence of congestion, a more optimal traffic load balance may help increase the QoS. For instance, suppose that some links are used at 60% capacity (on the average), which strictly speaking cannot be considered a congested link whereas other links are loaded at 10%. It is worth noting that delaysensitive traffic traversing a link loaded at 60% may experience some undesirable delay and jitter, especially without queuing mechanisms. Thus, achieving a better traffic load balance with the objective of minimizing the average link utilization might be another motivation for TE. Traffic engineering is not per se specific to MPLS. Various network types have been using TE methods like public voice networks, ATM, Frame Relay, and Internet Protocol (IP).
The Classic Fish Problem Let us consider the following classical fish problem to highlight how situations in which congestion appears in some parts of the network while other regions of the PUB3 network have spare capacity may occur in an IP network (Figure 5.1).
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 299
5.1 MPLS Traffic Engineering Refresher
Routing Decision Based on the IP Destination Address
299
All Links Have a Metric = 1
R1
R4 R5 R3
R8
R2
R6
R7
Path North: R3-R4-R5-R8 Path South: R3-R4-R5-R8
Figure 5.1 The classic ‘‘fish problem.’’
Figure 5.1 depicts two IP routers R1 and R2 sending traffic to the router R8 (and beyond). Both R1 and R2 will compute the shortest path to reach R8 using a routing protocol like Open Shortest Path First (OSPF) or Intermediate System to Intermediate System (IS-IS). Because all the links have an equal metric of AU4 1, the flows from R1 and R2 to R8 will both follow the same path (‘‘north’’). If the sum of their traffic exceeds the bandwidth capacity of the path ‘‘north’’ (R3R4-R5), this will result in some congestion, although some capacity is still available along the path ‘‘south.’’ Changing the link metric in this case will not help because IP routing protocols base their routing decision on the IP destination address. So whether a packet whose destination is R8 is received from R1 or R2, it will be routed by R3 along the same path. Another option in this very simple case is to set up the link metric so the north and south paths have an equal cost to use load balancing, but real networks are more complicated, and if other nodes are connected to routers R4, R5, R6, and R7, load balancing becomes much more challenging to achieve. That said, TE with IP routing is of course possible and has already been discussed in Chapter 4. One solution to obtain better resource utilization is to use tunneling techniques between source(s) and destination(s) so intermediate nodes do not participate in the routing decision. ATM was extensively used to reach that goal; ATM permanent virtual circuits (PVCs)/switched virtual circuits (SVCs) are established between switches with characteristics based on the traffic requirements of each circuit (e.g., bandwidth and QoS). ATM PVCs/SVCs are routed based on the network resources and link costing using off-line or on-line path computation methods (e.g., Private Network–Network Interface [PNNI]). Then, once a packet (encapsulated in ATM cells) is routed onto an ATM PVC, it strictly follows the ATM PVC path.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
300
CHAPTER 5
page 300
MPLS Traffic Engineering Recovery Mechanisms
Although relatively efficient to improve network bandwidth usage, there are several significant drawbacks with this approach:
. An additional layer (ATM) has to be managed and maintained in the network (ATM), which implies additional cost in terms of equipment and network operation. . The number of routing adjacencies maintained by each router is potentially very high because every router has a number of routing neighbors equal to the number of routers in the mesh, which introduces some routing protocol scalability limitations. Indeed, a mesh of n routers requires for each of them to maintain n adjacencies and the route computation (shortest path first [SPF]) is also increased significantly. This is where MPLS TE comes into play. MPLS TE is also a ‘‘tunneling’’ mechanism using TE Label Switch Paths (TE LSPs; the terminology TE LSP is detailed hereafter), which are established between pair of routers. Each TE LSP has its own set of constraints—like bandwidth, affinities, and AU5 rerouting constraints, to mention a few—and the network topology and resources are taken into account along with the set of constraints to compute the TE LSP path that satisfies the set of requirements. Different path computation methods can be used to achieve that objective: distributed (each router is responsible for the computation of its TE LSP path) or centralized (an off-line tool performs the path computation of all the TE LSPs in the network).Then once a TE LSP is established, IP packets are routed onto the TE LSP and strictly follow the computed path; intermediate routers do not make any routing decision. For instance, in Figure 5.2, suppose that the sum of required bandwidths between R1 and R8 and R2 and R8 exceeds the available bandwidth on the north
All Links Have a Metric = 1
R1 R4 R5
R3
R8
R2 R6
R7
Traffic Engineering LSP Routed through the North Path Traffic Engineering LSP Routed through the South Path
Figure 5.2 Optimizing network resources with MPLS traffic engineering.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 301
5.1 MPLS Traffic Engineering Refresher
301
path (R3-R4-R5). By using MPLS TE, once the TE LSP between R1 and R8 is established, R2 figures out that the bandwidth available on the north path is not sufficient to accommodate its traffic demand and selects the south path (R2-R3-R6-R7-R5-R8) to establish its TE LSP. This allows better network resource utilization and avoids traffic congestion. Note that compared to the previous case with an ATM overlay network, just one layer is required (IP/MPLS). Moreover, routers are not required to maintain routing adjacencies over TE LSP. It is important to note that MPLS TE is a control plane reservation protocol, so this is fundamentally a Call Admission Control (CAC) mechanism. In other words, when a TE LSP is set up, no particular resources in the data plane are reserved. The purpose of MPLS TE is to ensure that a TE LSP is not routed along a path where other TE LSPs have already reserved the bandwidth. For instance, on an OC3 link, if three TE LSPs have already been reserved a total bandwidth of 120 Mbps, the remaining available bandwidth (not already reserved in the control plane) is 35 Mbps and a TE LSP requiring more than 35 Mbps will have to be routed along another path. This is in contrast to IP in which IP packets are routed along the shortest path without considering the traffic flow and available resources along this path.
5.1.2
Terminology Because there are several terms specific to MPLS TE recovery techniques, which are used throughout this chapter, we illustrate each of them via an example (Figure 5.3). As depicted in Figure 5.3, three TE LSPs, called T1, T2, and T3, are signaled. For instance, the TE LSP T1 starts on R1 and terminates on R8. We say that R1 is the head-end label switched router (LSR) of T1 and R8 is its head-end LSR. Any other LSR traversed by T1 is a midpoint LSR (e.g., R3, R4, and R5 are all midpoint LSRs). Note that an LSR can play the role of a head-end LSR for an LSP while being a midpoint or a tail-end LSR for other TE LSPs.
Notion of Disjoint Paths Two TE LSPs are said to be link disjoint if they do not have any link in common (e.g., T1 and T2 in Figure 5.3 are link disjoint). The terminology link diverse is also used. On the other hand, two TE LSPs are said to be node disjoint if they do not share any TE LSR (e.g., T1 and T3 are node disjoint), except potentially their headend and tail-end LSRs. The term node diverse is also used. The recovery-specific terminology aspects are covered in their respective sections. For instance, several terms are specific to the local protection techniques, and these are covered in the section devoted to local protection techniques.
Shared Risk Link Group The notion of shared risk link group (SRLG) is crucial when studying network resiliency and specifically refers to the notion of simultaneous failures of multiple
Vasseur / Network Recovery Final 9.6.2004 9:48pm
302
CHAPTER 5
page 302
MPLS Traffic Engineering Recovery Mechanisms
TE LSP (Traffic Engineering Label Swtiched Path)
Head-End LSR
LSR (Label Switched Router) R1
R4 R5
T1 R3 R2
R8
T2
R6
R7
T3
Tail-End LSR Mid-Point LSR
Figure 5.3 Illustration of MPLS traffic engineering recovery.
network elements that can be caused by the failure of a single element. Let us consider the network scenario in Figure 5.4. Figure 5.4A shows a set of six optical cross-connect OXC1 through OXC6, which are interconnected by a set of fibers, which constitutes an optical layer used to interconnect the LSRs R1 through R5. More precisely, the various links are routed in the optical layer as follows:
. . . . . .
Link Link Link Link Link Link
R1-R2 R1-R4 R1-R5 R2-R3 R3-R4 R5-R4
follows follows follows follows follows follows
the the the the the the
optical path optical path optical path optical path optical path optical path
OXC1-OXC2. OXC1-OXC4-OXC5. OXC1-OXC6. OXC2-OXC3. OXC3-OXC5. OXC6-OXC4-OXC5.
In this scenario, the two optical paths followed by the links R1-R4 and R4-R5 share a common resource: the optical fiber interconnecting the OXC4 and OXC5. We say that the two links share a unique SRLG because the failure of a single resource (the optical fiber OXC4-OXC5) would provoke the simultaneous failure of the two links. By default the IP/MPLS layer does not have any visibility of the optical layout, which may lead to an incorrect path selection for TE a LSP. To remedy to this problem, an Internet Gateway Protocol (IGP) extension has been defined. As described in Section 5.1, the TE-related information is flooded within an OSPF area using an opaque LSA type 10 (for IS-IS the TE-related information is flooded in a specific type-length value [TLV]). This opaque LSA carries one top-level TLV, AU6 which can be one of the two following types: router address (type 1) or link (type 2). The link sub-TLV is made of several sub-TLVs. One of them is the SRLG sub-TLV (type 16); it has a variable length with 4 bytes per SRLG value.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 303
5.1 MPLS Traffic Engineering Refresher
303
Shared Risk Link Group Optical Layer OXC2
OXC3
R2
R3
OXC4
OXC1
Optical Fibers
OXC5
OXC6
R1
Same SRLG R4
R5
Figure 5.4 Shared risk link group.
P Important notes:
. A link may belong to multiple SRLGs. . The IGP extensions allow carrying the SRLG values. On the other hand, having the knowledge of the underlying optical/SONET-SDH topology is not always possible. Indeed, an operator may rely on another carrier to provide optical lambda, and in that case, the SP does not always have the knowledge of the actual physical path and the potential SRLG. Moreover, an optical path may be dynamic and so its path may change over the time. This requires updating the SRLG value each time a change occurs if the SRLG changes also. Notion of SRLG disjoint: A TE LSP is said to be SRLG disjoint from a link L or a node R if and only if its path does not include any link or node that is part of the SRLG of that L or R. For instance, back Figure 5.4, a TE LSP T1 following the path R1-R2-R3-R4 is SRLG disjoint from the link R1-R4. Two TE LSPs are said to be SRLG disjoint if the respective set of links they traverse do not have any SRLG in common.
5.1.3
MPLS Traffic Engineering Components The aim of this section is to review the main components of MPLS TE: 1. Configuration of TE LSP on head-end LSR: The first step consists of configuring the TE LSPs’ attributes on the head-end LSR. Various attributes can be configured like the destination (address of the tail-end LSR), the required bandwidth, the required protection/restoration, the affinities, and others.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
304
CHAPTER 5
page 304
MPLS Traffic Engineering Recovery Mechanisms
2. Topology and resource information distribution: To compute a path obeying the set of specified constraint(s), the head-end LSR needs to gather topology and resource information. Note that this applies only to situations in which the TE LSPs path is dynamically computed by each LSR (also referred to as distributed or on-line path computation) by contrast with centralized or off-line path computation in which the LSPs’ path is computed by an off-line tool. In such a case, the topology and resource information is AU7 distributed by a link state routing protocol (OSPF or IS-IS) with TE extensions that reflect links characteristics and reservation states. TE TLVs have been defined and are carried within an LSP for IS-IS and TE opaque LSA type 10 for OSPF to flood the reservation states and other parameters. 3. TE LSP computation: As already stated, the computation of a TE LSP path can either be performed by an off-line tool or on-line. In the former case, an external tool simultaneously computes all the TE LSPs paths according to the network resources. In the latter case, every router (LSR) uses its resource and topology database (IS-IS or OSPF), takes into account the set of requirements of the TE LSP, and computes the shortest path satisfying the set of constraints usually using a constraint shortest path first (CSPF) algorithm. Various types of CSPFs can be used. 4. TE LSP setup: Once the path of a TE LSP has been computed, the head-end LSR signals the TE LSP by means of the Resource Reservation Protocol (RSVP) signaling protocol with the corresponding set of extensions defined in [RSVP-TE]. For instance, in Figure 5.3, R1 computes a path for the LSP T1: R1-R3-R4-R5-R8 based on T1’s attributes and the network and resources topology information disseminated by the routing protocol. Once T1’s path is computed, T1 is signaled by RSVP-TE. TE LSPs are then signaled, maintained (refreshed) and potentially torn down using various RSVP messages: Path, Resv, Path Error, Path Tear, Reservation Error, Resv Confirmation, and Resv Tear. Also, various new objects have been defined in [RSVP-TE] for the purpose of MPLS TE, for example, to allocate labels to TE LSPs that will then be used in the MPLS data plane. Note that labels are assigned in the upstream direction using RSVP messages (Resv message) and intermediate LSRs are programmed accordingly. For instance, when the TE LSP T1 is signaled, labels are assigned by LSRs in the upstream direction: R8 provides a label to R5, R5 provides a label to R4, and so on. Note: It is worth mentioning that RSVP has often been criticized for its scalability, in particular the number of states required in the network. As a matter of fact, currently deployed networks can handle thousands of RSVP TE reservations (TE LSPs) on a single router without any problem. Moreover, various protocol enhancements have been defined (see [REFRESH-REDUCTION]) to further increase the scalability, if needed. Finally, MPLS TE can be deployed with multiple levels of hierarchies, if required, in very large networks.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 305
5.1 MPLS Traffic Engineering Refresher
305
5. Packet forwarding: Once a TE LSP is set up, the head-end LSR can update its routing table and start using TE LSP to forward IP packets. A label of 32 bits is pushed onto the IP packet, which is then label switched across the network (intermediate routers do not make any routing decision).
5.1.4
Notion of Preemption in MPLS Traffic Engineering There is one interesting property called ‘‘preemption’’ defined in MPLS TE, which deserves to be slightly elaborated in the chapter because upon network element failure, preemption mechanisms may be triggered. [RSVP-TE] defines the notion of preemption or priority for a TE LSP. This parameter is signaled in the SESSION-ATTRIBUTE object of the RSVP TE Path message (more precisely, the RFC defines two priorities known as the ‘‘setup’’ and ‘‘holding’’ priorities, which define the priority of a TE LSP with respect to taking and holding resources, respectively). When a new TE LSP is signaled, an LSR considers the admission of this newly signaled TE LSP by comparing the requested bandwidth with the bandwidth available at the priority specified in the setup priority. If the requested bandwidth is available but this requires preempting other TE LSPs having a lower priority, then the newly signaled TE LSP is admitted and one or more TE LSPs with a lower priority are preempted. Note that the selection of the set of lower priority TE LSPs to be preempted is a local decision and is generally implementation specific. More details of preemption policies can be found in [PREEMPTION-POL]. The preemption process implies the set of following actions for each preempted TE LSP:
. The corresponding local RSVP states are cleared and the traffic is no longer forwarded. . Messages are sent both upstream (RSVP Path Error message) and AU8 downstream (RSVP Resv Error) so all the states corresponding to the preempted TE LSP are cleared along its path. Then the head-LSR LSR of a preempted TE LSP initiates a TE reroute procedure as detailed earlier to reroute the TE LSP along another path. This means that hard preemption is by nature a disruptive mode. So the concept of soft preemption has been introduced in [SOFT-PREEMPTION] and proposes a different mode of preemption. If a TE LSP must be preempted to accommodate a higher priority TE LSP requests, the preempting LSR performs the following actions:
. The preempting LSP signals to the respective head-end LSR the need to reroute the TE LSP in a nondisruptive fashion (so-called ‘‘make before break’’ procedure). . The local states of the soft preempted TE LSP are not cleared and no RSVP Path Error/RSVP Error messages are sent.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
306
CHAPTER 5
page 306
MPLS Traffic Engineering Recovery Mechanisms
Hence, the preempting node keeps forwarding the traffic of a soft preempted TE LSP for a certain period. This gives a chance for the soft preempted TE LSPs headend LSR to reroute their TE LSPs along an alternate path without disrupting traffic flow. It is worth pointing out that this implies to temporary provoke reservation overbooking on some links because until the soft preempted TE LSPs are rerouted by their respective head-end LSR, the sum of admitted bandwidth is higher than the maximum allowed. Note that some algorithms can be carefully designed to preempt hard preemptable62 TE LSPs first. Moreover, appropriate MPLS Diffserv mechanisms can be used to make sure that high-priority traffic is served adequately.
5.1.5
Motivations for Deploying MPLS Traffic Engineering Once the concept of TE and the main components of MPLS TE have been reviewed, it is time to highlight the various motivations for deploying MPLS TE in a network. 1. Bandwidth optimization: As pointed out in Section 5.1, MPLS TE can be deployed to achieve better network resource utilization, usually referred to as bandwidth optimization. 2. Strict QoS guarantees: Another motivation for deploying MPLS TE in a network is to enforce strict QoS guarantees for various service types including sensitive traffic flows like voice, video, and circuit emulation. As already mentioned, MPLS TE acts on the control plane and as such takes care of the routing decision. For instance, consider a network with a single class of service (CoS), MPLS TE allows an operator to reduce the average and maximum link utilization. Hence, a direct implication is that the probability of traffic queuing delay is decreased, which correlates with a better QoS. Another example is the case of a network with multiple classes of service. Making sure that appropriate treatment of sensitive flows is performed in the data plane requires various mechanisms like marking, queuing, and congestion avoidance in the data plane. In such networks, MPLS TE will allow control over the proportion of high-priority traffic versus mediumand low-priority traffic on a per-link basis, which will increase the QoS. Although this has already been highlighted, to provide QoS guarantees between two nodes, specific actions must be taken in the IP/MPLS data plane, implementing the Differentiated Services (Diffserv) model. Indeed, MPLS TE is responsible for finding a path obeying a set of constraints, but once the packets are sent onto that TE LSP, each node along the path has to serve the packet appropriately according to the required CoS. 3. Fast recovery: Several mechanisms for MPLS TE are described throughout this chapter, allowing for fast recovery along with other requirements like QoS protection during failure. Those mechanisms have been generating a 62
The hard/soft preemptable property of a TE LSP is explicitly signaled in RSVP Path message.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 307
307
5.2 Analysis of the Recovery Cycle
growing interest for MPLS TE, and the sole interest for fast convergence, even if bandwidth optimization or strict QoS guarantees are not required, may justify the deployment of MPLS TE. Several large networks have deployed MPLS TE to benefit from the set of fast recovery mechanisms. The aim of the previous short paragraph was to introduce the motivation of AU9 deploying MPLS TE in a network: bandwidth optimization, strict QoS guarantees, and fast recovery. They are of course nonexclusive. For example, consider an IP/ MPLS network where the resource utilization is not optimal and fast recovery is desired. Then MPLS TE with, for instance, any fast recovery technique described in this chapter can be deployed. Another example is an IP/MPLS network where strict QoS guarantees are required for the voice traffic, for instance, as well as fast recovery for the virtual private networking (VPN) traffic and the voice traffic. Finally, as already pointed out, MPLS TE can be deployed for the sole motivation of benefiting from fast recovery. Consider an overprovisioned network in which neither bandwidth optimization nor strict QoS guarantees are necessary (QoS guarantee is achieved by overprovisioning), but fast recovery is a must. Then MPLS TE is a good candidate for its fast recovery property.
5.2 Analysis of the Recovery Cycle Before studying the various recovery techniques used in IP/MPLS networks, it is worth spending some time on the recovery cycle analysis introduced in Chapter 1 and depicted in Figure 5.5.
5.2.1
Fault Detection Time As with any other recovery techniques at any layer, the fault detection time is a key component of the total recovery time and highly varies depending on the fault detection mechanism in use and the underlying layer 1 and layer 2. For instance, the Recovery Time
Failure Fault Detected
Time
Traffic Recovery Time Recovery Operation Time Fault Notification Time Hold-Off Time Fault Detection Time
Figure 5.5 Recovery cycle.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
308
CHAPTER 5
page 308
MPLS Traffic Engineering Recovery Mechanisms
fault detection time can vary from a few tens of milliseconds when two LSRs are interconnected via a SONET/SDH VC or an optical lightpath to a few hundreds of milliseconds or seconds when hello mechanisms are required. (Section 4.3 in Chapter 4 has been entirely devoted to the important aspects of failure profile and fault detections aspects.)
5.2.2
Hold-Off Timer A hold-off timer can be very useful if the underlying layer has a recovery scheme. Those aspects of multilayer protection/restoration strategies are covered in detail in Chapter 6. In a nutshell, consider, for instance, a multilayer network where fast recovery mechanisms are implemented both at the optical layer and at the MPLS layer. Then, when the failure occurs, one should generally avoid any racing conditions where both recovery mechanisms simultaneously try to perform a reroute along an alternate path. In that case, a bottom-up timer-based approach can be adopted, in which the MPLS layer will wait for a hold-off timer to expire before trying to perform a reroute, to give the optical layer a chance to restore the failed resources. If the optical layer does not succeed in restoring the failed resource before the hold-off timer expires, the MPLS recovery mechanism will be triggered to restore the failed resource at the MPLS layer (the interlayer recovery mechanisms are more extensively discussed in Chapter 6).
5.2.3
Fault Notification Time To perform traffic recovery, an LSR must first be informed of the failure. As we will see in this chapter, depending on the MPLS TE recovery mechanism used, the traffic recovery may be performed on the node immediately upstream to the failure or on the head-end LSR (the LSR originating the TE LSP); we call the fault indication signal (FIS) the signal of the failure to the node in charge of performing the traffic recovery. Hence, once the fault has been detected by an LSR R, the FIS is propagated until reaching an LSR that has the ability to reroute the TE LSP affected by the failure. The fault notification time (time for the FIS to be received by the node in charge of the traffic recovery) will vary depending on whether the recovery technique is local or global, as shown in Chapter 1, Section 1.5.4. It is usually desirable to guarantee through appropriate scheduling on the various LSRs that the FIS receives the proper QoS, to minimize and guarantee the fault notification time. For instance, as mentioned in Chapter 4, the IGP flooding should be prioritized. In addition, IGP and RSVP messages should be queued appropriately and of course should never be dropped in the case of congestion. Refer to Chapter 4, Section 4.5, for further details on QoS mechanisms.
RSVP Reliable Messaging As we saw in the Chapter 4, IGP updates are always sent in reliable mode; this is inherent to link state routing protocols. By contrast, RSVP messages are sent by default in nonreliable mode. So a loss of a Path Error message (which is used to
AU10
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 309
5.2 Analysis of the Recovery Cycle
309
report an LSP failure to upstream nodes) may significantly increase the fault notification time, especially if the IGP has not been tuned to provide fast notification (see Chapter 4 for details). [REFRESH-REDUCTION] proposes a mechanism to send RSVP messages in reliable mode. Two additional RSVP objects are defined: the MESSAGE-ID and the MESSAGE-ID-ACK objects. Each RSVP message sent in reliable mode contains a unique MESSAGE-ID object and is acknowledged by a MESSAGE-ID-ACK object (note that it may be piggybacked to any other RSVP messages or to an RSVP acknowledgment message). The retransmission of a nonacknowledged AU11 message for which an explicit acknowledgment had been requested is based on an exponential back-off procedure; when an LSR has to send a message in reliable mode, it inserts a MESSAGE-ID object in the RSVP message and sets a particular flag in the MESSAGE-ID header called the ACK-Desired flag. Upon receiving the RSVP message, a neighboring LSR will send back an RSVP message containing a MESSAGE-ID-ACK object. When the message is acknowledged, the transmission procedure is terminated. If the sending LSR does not receive any acknowledgment before a dynamic timer has elapsed, the message is retransmitted. The dynamic timer Tk is exponentially increased until a maximum value is reached. Tk is first set to an initial retransmission value (generally a short value). For example, let us suppose that a message is sent for the first time, and Tk ¼ T1 is set to initial timer (the recommended value is 500 ms).
. If the message is not acknowledged after T1, then it is retransmitted. Otherwise the procedure is stopped. . Then Tk is set to Tk-1* (1þdelta) (the recommended value for delta is 1). . The maximum value for k is set to a fixed value (k ¼ 3 is recommended). AU12 In summary, the sending LSR waits 500 ms and then retransmits the message, then waits for the 500 ms*2, then 500 ms*4 with exponential increased waiting times. If the maximum retransmission value is set to 3, the message is no longer retransmitted after three trials.
5.2.4
Recovery Operation Time Any recovery technique involves a set of actions to be completed. This includes potential synchronization between network elements to coordinate.
5.2.5
Traffic Recovery Time The traffic recovery time represents the time between the last recovery action and the time the traffic is completely recovered. Each component described earlier is analyzed for the various recovery techniques described in this chapter. We just saw a brief description of each phase of the recovery cycle. There are multiple types of MPLS TE recovery techniques (Table 5.1):
Vasseur / Network Recovery Final 9.6.2004 9:48pm
310
CHAPTER 5
page 310
MPLS Traffic Engineering Recovery Mechanisms
Table 5.1 Categories of MPLS Recovery Mechanisms
Local recovery Global recovery
Protection
Restoration
Local protection (Section 5.5) Global protection (Section 5.4)
Global default restoration (section 5.3)
. MPLS TE global default restoration (Section 5.3). This is the default mode of recovery of MPLS TE, whereby the failure is notified to the head-end LSR by means of RSVP and the routing protocol, which in turn recomputes a new path and finally resignals the TE LSP along that new path. . MPLS TE global protection (Section 5.4): The basic principle is that two TE LSPs are set up by the head-end LSR: a primary LSP and a backup. Once the head-end LSR is notified of a failure along the LSP path, it starts using the backup LSP. . MPLS TE local protection (Fast Reroute; Section 5.5) is a local repair recovery scheme in which upon failure detection the LSPs affected by the failure are locally rerouted by the node immediately upstream to the failure.
5.3 MPLS Traffic Engineering Global Default Restoration MPLS TE global default restoration is the default recovery technique. Once a failure is detected by some downstream node, the head-end LSR is notified by means of RSVP and the routing protocol (FIS). Upon receiving the notification, the head-end LSR recomputes the path and signals the LSP along an alternate path.
5.3.1
Fault Signal Indication It is probably worth elaborating on the nature of the FIS in the context of MPLS TE because this aspect might be a source of confusion. In the context of an IP/ MPLS TE network, the FIS is either an IGP update63 or an RSVP Path Error AU13 message. Actually, both will be generated independently. In the case of the IGP, a node detecting a loss of routing adjacency will generate an LSA/LSP update (see Chapter 4 for a detailed description of IP routing from a recovery perspective). When a link fails between two nodes, the nodes attached to the failed link will send an IGP update. In the case of a node failure, all the neighbors of the failed node will send an IGP update. As discussed in Chapter 4, the timing sequence will highly 63
In the rest of this chapter, the term IGP will be used in place of routing protocol.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 311
5.3 MPLS Traffic Engineering Global Default Restoration
311
depend on the failure detection time and IGP parameter tuning. Moreover, every node detecting a failure will also generate an RSVP Path Error message sent to each AU14 head-end LSR having a TE LSP traversing the failed resource. For instance, in Figure 5.3, if the link R3-R4 fails, as soon as the node R3 detects the link failure, it sends a notification (RSVP Path Error message) to R1, the head-end LSR of T1 because T1 traverses the failed link. In addition, an IGP update will be sent by both the nodes R3 and R4 to reflect the new topology. Again, the timing sequence depends on the IGP tuning (see Chapter 4). Usually, the RSVP Path Error message is received by the head-end LSRs within a few tens of milliseconds so generally before the IGP update, but regardless of which FIS is first received, the head-end LSR will get notified. As pointed out in Section 5.2, the FIS delivery is of the utmost importance with MPLS global default restoration, because it triggers the rerouting of the affected LSPs by the head-end LSR.
5.3.2
Mode of Operation When a TE LSP is configured on a head-end LSR, its set of attributes is specified: destination (IP address of the tail-end LSR), bandwidth, priority, protection/ restoration requirements, and other MPLS TE parameters. As far as the recovery is concerned, an important parameter is the TE LSP path. As mentioned in Section 5.1, the path of a TE LSP can be computed in either a distributed or a centralized AU15 fashion. In the former case, the configuration does not specify any particular path and the head-end LSR dynamically computes the LSP path, taking into account the constraints and available resources in the network. In the latter case, the path for the TE LSP is statically configured on the head-end LSR. Some MPLS TE implementations allow the configuration of both options with an order of preference. In Table 5.2, a TE LSP is defined, with its corresponding parameters/ constraints: destination address (10.0.1.100), bandwidth (10000), and priority (1). In addition, the notion of path-option allows specifying in order of preference the list of paths that the LSP should follow. In this example, the preferred path is a static path (path 1) for which the set of hops is statically configured on the head-end LSR. If path 1 is not available (path broken, not all the required constraints can be satisfied along this path), path 2 is the second preferred path. Note that this corresponds to the off-line path computation method already mentioned for MPLS TE where the LSP path is computed by some other tool (not by the head-end LSR itself ). Then, if none of the static paths is available, the head-end LSR will try to find a path that complies with the requested constraints using the CSPF algorithm (this is the path option 3). Note that in addition, it might be possible to have different sets of constraints for different path options. For example, suppose that no path satisfying the bandwidth constraint (10000) can be found. Then one solution could be to try a lower value. Of course, that example of configuration shows a combination of static and dynamic paths for the sake of illustration. Just one dynamic path could have been configured or one or more static paths.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
312
CHAPTER 5
page 312
MPLS Traffic Engineering Recovery Mechanisms
Table 5.2 An example of MPLS Traffic Engineering TE LSP Configuration
interface Tunnel1 ip unnumbered Loopback0 no ip directed-broadcast tunnel destination 10.0.1.100 tunnel mode mpls traffic-eng tunnel mpls traffic-eng priority 1 1 tunnel mpls traffic-eng bandwidth 10000 tunnel mpls traffic-eng record-route tunnel mpls traffic-eng path-option 1 explicit name path1 tunnel mpls traffic-eng path-option 2 explicit name path2 tunnel mpls traffic-eng path-option 3 dynamic Path1 ¼ {192.170.14.2, 192.170.10.1, 192.170.4.5} Path2 ¼ {192.170.13.2, 192.170.17.1, 192.170.20.5}
Recovery Cycle with Global Default Restoration The mode of operation of global default restoration is relatively simple: When the head-end LSR is informed of the link/node failure, if an alternate path is specified, the head-end LSR will check to see whether the configured path satisfies the constraints for the TE LSP. If so, the TE LSP is reestablished along that path. If no preconfigured path is specified on the head-end router and if configured as such, then it triggers a new path computation for the set of affected TE LSPs, calling the AU16 CSPF process (this exactly corresponds to the example in Table 5.2: If a notification is received reporting that path 1 is unavailable, the head-end LSR tries to determine whether it can use path 2, and if path 2 is not valid for some reason, it tries to compute a path itself). Note 1: Various existing MPLS TE implementations allow relaxing constraint(s) upon failure, which might sometimes be necessary. A slightly more complicated example could be given in which for each path option, a set of different constraints is specified. For instance, consider a network with relatively high link utilization in terms of bandwidth reservation; a major node failure may cause the inability for several TE LSPs to find an alternative path. In this case, one of the options is to relax some constraints, like the bandwidth constraint so the TE LSP can be routed. There is one undesirable side effect though: Allowing a TE LSP to be rerouted as a 0 bandwidth TE LSP implies that traffic will flow over this tunnel without any CAC. Thus, no bandwidth can be guaranteed in this case. There are also various constraints a TE LSP can be configured to support. Bandwidth is just one of them. Another example is affinities. This allows, for instance, to ensure some TE LSPs will avoid particular network resources, using some bit masks. This can be seen as color. As an
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 313
5.3 MPLS Traffic Engineering Global Default Restoration
313
example, some network links might be colored in red (with red meaning ‘‘high propagation delay’’ or ‘‘poor quality’’). This affinity link property is propagated through IGP TE extensions (see [OSPF-TE] and [IS-IS-TE]). This way, a TE LSP carrying very sensitive traffic like voice-over-IP (VoIP) will be configured so red links are excluded from the path selection. In such a case, a major network failure may imply for the affected TE LSP to be non-reroutable without crossing one or several red links. In this case, it might be desirable to relax the affinity constraint. Note 2: A large proportion of deployed MPLS TE networks rely on distributed computation in which no static path is configured; in this case, just a dynamic path is configured and the head-end just recomputes a new path based on the LSP constraints and its knowledge of the network and resource topology information provided by the IGP. A usual question is: What is the CSPF duration time? And the systematic answer is: That depends. Indeed, the CSPF duration time is a function of the network size and the CSPF algorithm in use. Finding the shortest constraint path in a very large network obviously requires more time than in a small network. Furthermore, the CSPF complexity may be variable depending on the algorithm in use. Finally, the router CPU should also be taken into account. That said, in an order of magnitude, an average CSPF computation time using a classic CSPF algorithm on a network with hundreds of nodes rarely exceeds a few milliseconds. It is worth noting that one CSPF must be triggered per affected TE LSP. Indeed, if N LSPs starting on a head-end LSR R1 traverse a failed link, R1 will have to compute a new path for each of them. Once a new path has been found and computed, the TE LSP is signaled along the new path. The final operation before any traffic can be routed over the newly signaled TE LSP consists of updating the routing table for the destinations that can be reached via the TE LSP.
5.3.3
Recovery Time Providing hard numbers is not a realistic exercise because a significant number of factors influence the rerouting time, but we describe the different components of the recovery cycle with global default restoration through an example. Figure 5.6 shows the different steps of the recovery cycle with MPLS TE global default restoration. Step 1: The link R3-R4 fails, and an FIS (RSVP and IGP update) is sent to the head-end LSR. As already pointed out, the sequence timing of IGP update and the RSVP Path Error depends of many factors. The receipt of one of them is sufficient for the head-end LSR to be notified of the failure. Step 2: The FIS is sent to the head-end LSR. Note that the propagation delay might be nonnegligible and is made up of two components: the propagation delay (on wide area networks; this can be on the order of tens of milliseconds and can become as large as 100 ms between two continents where the optical path can be very long) and the queuing and processing delays for the FIS to
Vasseur / Network Recovery Final 9.6.2004 9:48pm
314
CHAPTER 5
3
page 314
MPLS Traffic Engineering Recovery Mechanisms
New Path Computation For the Set of Affected TE LSPs T1 1
R1
4
R6 IGP Update/RSVP Path Error 2
R2
R3
R7
1
R4
R5
TE LSP Signalled
R8
Figure 5.6 Event scheduling in the case of link/node failure with MPLS TE reroute.
reach the head-end router. As mentioned in Section 5.2, an appropriate marking and scheduling in the forwarding path is highly recommended to ensure that the queuing and processing delays are both minimized. Step 3: Upon receiving the failure notification, the head-end LSR (R1 in this example) tries to find an alternate path satisfying the set of constraints for each TE LSP affected by the failure. Step 4: The TE LSP is signaled along the new path. The RSVP signaling set up time is also made of several components: the propagation delay along the path (round trip) and the queuing and processing delays at each hop in both directions (upstream and downstream). Step 5: The routing table of R1 is updated to use the newly signaled LSP. AU17 In conclusion, because the different components of the recovery time are highly dependent of the network characteristics, the resulting recovery time may vary from a few milliseconds to hundreds of milliseconds, sometimes a few seconds. Testing MPLS traffic reroute in a lab made of a few routers will probably result in a very short convergence time (a few milliseconds); indeed, the propagation delays are negligible, as is the FIS processing delay. The CSPF computation is also very short because the network size is limited, and finally the set up time will also be negligible. In contrast, a network with 1000 nodes, links with high propagation delays, and hundreds of TE LSPs to reroute will require a much more significant amount of time to converge.
5.4 MPLS Traffic Engineering Global Path Protection MPLS TE global path protection (also usually referred to as path protection) is a global 1:1 protection recovery mechanism. As defined in Chapter 1, Section 1.5.4, this implies that the head-end LSR performs the rerouting (global recovery) and a presignaled backup LSP is used (protection) if the protected LSP fails.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 315
315
5.4 MPLS Traffic Engineering Global Path Protection
5.4.1
Mode of Operation Figure 5.7 describes the mode of operation of global path protection. In this figure, there are two primary TE LSPs, T1 (which follows the path R2-R3-R4-R5-R6) and T2 (which follows the path R7-R8-R9-R6). For each primary TE LSP, a dedicated backup LSP is set up, before any failure occurs. It is worth noting that a backup TE LSP (also called secondary TE LSP) must be link diverse or node diverse from the primary TE LSP. In this example, the backup LSP of T1 follows the path R2-R1R10-R11-R5-R12-R6, which is link diverse64 from T1. By contrast, the backup LSP of T2 follows the path R7-R2-R3-R4-R5-R6 and is node diverse from T2. The aspects related to the backup path computation are covered in Section 5.15. A backup (secondary) TE LSP is a regular TE LSP; that is, as far as RSVP signaling is concerned, a backup TE LSP is signaled as any other TE LSP and the backup TE LSP can be configured with either the same attributes as the primary TE LSP (in this case, the backup TE LSP satisfies the same set of constraints as the primary TE LSP) or with different constraints (e.g., no affinities, less bandwidth [say, 50% of the primary TE LSP]). For instance, if the backup TE LSP is configured with 50% of the primary TE LSP bandwidth, when used, the traffic will be forwarded along a path where 50% of the bandwidth has been reserved. This does not mean that the traffic will suffer from QoS degradation, depending on the actual use of the other LSPs sharing the same network resources along its backup path.
Primary TE LSP Backup (Secondary) TE LSP
R10
R1
R11
R12
Backup Up T1
T1
R2
R3
R4
R5
R6
Backup Up T2 T2
R7
R8
Figure 5.7 MPLS traffic engineering path protection.
64
The terms disjoint and diverse are used interchangeably.
R9
Vasseur / Network Recovery Final 9.6.2004 9:48pm
316
CHAPTER 5
page 316
MPLS Traffic Engineering Recovery Mechanisms
The mode of operation is quite straightforward: Once the failure is detected by some downstream node, an FIS is sent to the head-end LSR of each affected LSP (by affected LSP, we mean each LSP traversing the failed resource). Note that all the aspects related to the FIS delivery described in Section 5.3 identically apply here because both the global default restoration and the global path protection rely on the FIS delivery to trigger an LSP recovery. Then upon receiving the FIS, the head-end LSR immediately switches the traffic onto the backup TE LSP and updates its routing table accordingly.
5.4.2
Recovery Time Compared to global default restoration, no routing computation has to be done ‘‘on the fly’’ to find an alternate route for the failed TE LSP. Moreover, with global path protection, the backup tunnel is already signaled, so no signaling round is required to set up the backup TE LSP. It is important to note that the saving in convergence time is predominately provided by the presignaling of the TE LSP.
5.5 MPLS Traffic Engineering Local Protection After a brief section introduction to the specific terminology used for MPLS TE local protection, we describe the principle and mode of operation of two local protection techniques called MPLS TE Fast Reroute. The last section describes two deployment strategies of local protection recovery techniques. Note that the terms MPLS TE local protection and Fast Reroute are used interchangeably throughout this chapter.
5.5.1
Terminology We begin this section by defining the terminology specific to MPLS TE Fast Reroute through an example (Figure 5.8). As shown in Figure 5.8, an LSP T1 is signaled that follows the path R1-R2-R3R4-R5. T1 is said to be ‘‘fast reroutable’’ if it is signaled with a specific attribute set in the RSVP Path message that indicates its desire to benefit from local recovery in the case of a failure.65 As shown in further subsections, Fast Reroute is a local protection recovery scheme; hence, the LSPs affected by a failure are locally rerouted by the node immediately upstream to the failure. This node is called the point of local repair (PLR). For instance, the node R2 is a PLR if the link R2-R3 or the node R3 fails. Fast Reroute uses backup tunnels to reroute affected LSPs. When a backup tunnel terminates to PLR’s next hop (direct adjacent neighbor), it is an NHOP backup tunnel. When the backup tunnel terminates on the neighbor of the 65
See Section 5.14 for the details on RSVP signaling for Fast Reroute.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 317
317
5.5 MPLS Traffic Engineering Local Protection
R7
NNHOP Back-Up LSP
R6
B2 Protected LSP
T1 R1
R2
B1
Fast Reroutable LSP
R3
R4
R5
Merge Point PLR
NHOP Backup LSP R8
Figure 5.8 Terminology (MPLS local protection).
PLR’s neighbor, the backup tunnel is an NNHOP backup tunnel. Back to our example, B1 is an NHOP backup tunnel of the PLR R2 and B2 is an NNHOP backup tunnel of R2. The node where the backup tunnel terminates is called the merge point (MP); hence, R4 is the MP of B2. Finally, a fast-reroutable LSP is said to be protected at a node R if there exists a backup tunnel that can be used in the case of a failure. T1 is protected at R2 by B1 and B2. The terminology of detour merge point used in one Fast Reroute technique (one-to-one protection) is discussed in Section 5.14.
5.5.2
Principles of Local Protection Recovery Techniques We use the generic term MPLS TE Fast Reroute or Fast Reroute to describe local protection techniques. There are two techniques of Fast Reroute (both are local protection techniques) that are described in this chapter:
. Facility backup (also referred to as bypass) . One-to-one backup (also referred to as detour) Although the terminology might appear difficult to understand, the terminology used in this section is in line with the corresponding standardized documents. Both methods described are local repair techniques using local protection:
. Local: In the case of a link or node failure, a TE LSP is rerouted by the node that is immediately upstream to the failed link or node. Compared to the global default restoration and global path protection where the TE LSP is rerouted by the head-end LSR, in the case of local protection, the protected LSP is rerouted at the closest location upstream to the failure. This presents the very significant advantage of eliminating the need for the FIS to be received by the head-end LSR to reroute the affected TE LSP along an alternate path.
AU18
Vasseur / Network Recovery Final 9.6.2004 9:48pm
318
page 318
MPLS Traffic Engineering Recovery Mechanisms
CHAPTER 5
. Protection: As seen in the Chapter 1, with protection recovery mechanisms, a backup resource is preallocated and signaled before the failure. With both local protection recovery methods (facility backup and one-to-one backup), the backup LSPs are established before the failure occurs. When a failure occurs and is detected, every protected TE LSP traversing the failed resource (usually referred to as affected TE LSP) is rerouted over a backup TE LSP without having to compute a backup path ‘‘on the fly.’’ Although both methods are local repair techniques, they significantly differ in terms of backup LSPs. With facility backup, a single (or a very limited number of) backup LSP(s) is used to protect all the fast-reroutable TE LSPs from the failure of a link or node, which is a major benefit of the MPLS label stacking property. By contrast, the one-to-one backup creates a separate backup LSP for each protected TE LSP at each hop. More details about their respective scalability are provided in Section 5.5.8. To ease the understanding on each local protection technique, the following approach is followed: First, a quick overview of each local protection method is provided via an example. Then each method is described in detail in subsequent subsections.
5.5.3
Local Protection: One-to-One Backup As depicted in Figure 5.9, with one-to-one backup, at each hop, one backup LSP (called a Detour LSP) is created for each fast-reroutable TE LSP. So, for instance, at the node R3, to protect the set of fast-reroutable TE LSPs T1, T2, and T3, the following set of backup TE LSPs are set up:
. One Detour LSP D1 for the protected TE LSP T1, following the path R3-R10-R11-R5-R6
R10
D3
T3
R1
R12
R11 D1
T1
R2
R3
R4
D2
R5
R6
T2
R7
R9 R8
Figure 5.9 Illustration of the Detour LSP with one-to-one backup.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 319
319
5.5 MPLS Traffic Engineering Local Protection
. One Detour LSP D2 for the protected TE LSP T2, following the path R3-R8-R5-R9 . One Detour LSP D3 for the protected TE LSP T3, following the path R3-R10-R11-R12 Note that this only protects the fast-reroutable TE LSPs T1, T2, and T3 against a failure of the link R3-R4 and the node R4. Similarly, each node along the fastreroutable TE LSP paths will perform the same operation. At each PLR along the fast-reroutable TE LSP path, a local backup tunnel called Detour LSP that avoids the protected resource and terminates on the tail-end LSR for the fast-reroutable TE LSP is set up. In the previous example, for the fast-reroutable TE LSP T1, R3 sets up a Detour LSP D1 originated at R3 and terminated at R6 that avoids both the link R3-R4 and the node R4. Figure 5.10 shows the label allocation for both the primary TE LSP T1 and the Detour TE LSP D1 protecting T1 against a failure of either the link R3-R4 or the node R4. The respective labels of the protected TE LSP T1 and the Detour LSP D1 originated on R3 are shown in Figure 5.10. For example, when a failure of the node R3 occurs, as soon as the PLR R3 detects the failure, the fast-reroutable TE LSP T1 is locally rerouted by the PLR to follow the Detour LSP, as shown in Figure 5.11. It is worth noting the label swapping change here: Once R3 detects the R4 node failure, the label 1 is no longer swapped from 1 to 2 and forwarded to the R3-R4 interface but is now swapped from 1 to 10 and is sent to the outgoing interface R3-R10. Detour LSP merging: Various merging rules allow for the reduction of the number of Detour LSPs and are described in Section 5.14.
11
R10
R1 10
R12
R11 12
D1
T1
R2
1
2
3
R4
R3
R7
Figure 5.10 Mode of operation of one-to-one backup.
R8
R5
R6
R9
AU19
Vasseur / Network Recovery Final 9.6.2004 9:48pm
320
CHAPTER 5
5.5.4
page 320
MPLS Traffic Engineering Recovery Mechanisms
Local Protection: ‘‘Facility Backup’’ By contrast with one-to-one backup, with facility backup, just one backup tunnel per NHOP is required to protect against a link failure and one NNHOP backup tunnel is required to protect against a node failure. Of course, an NNHOP protects against not only a node failure (the bypassed node) but also the link between the immediately upstream node and the bypassed node. As discussed later, there are some benefits in setting up both NHOP and NNHOP backup tunnels. More accurately, a small set of backup tunnels may be required if bandwidth protection must be guaranteed (see Section 5.15 for more details on bandwidth protection), but the key point is that the number of required backup tunnels is not a function of the number of TE LSPs in the MPLS network, which is a crucial property to preserve scalability. In Figure 5.12, a single NNHOP backup tunnel (bypass) is configured on R3 (PLR) to protect any fast reroutable TE LSP traversing the node R3 and following the R3-R4-R5 path against a failure of the link R3-R4 or the node R4 (indeed, the same NNHOP backup tunnel can be used in both failure scenarios). R5 is the merge point. Hence, for instance, the two fast-reroutable TE LSPs T1 and T2 are protected by the NNHOP bypass tunnel B1 that follows the path R3-R10-R11-R5. Let us now consider a fast-reroutable TE LSP T1 that follows the path R2-R3R4-R5-R6. As shown in Figure 5.12, the corresponding labels are distributed in RSVP Resv messages (R5 distributes the label ‘‘3’’ to R4, R4 distributes the label ‘‘2’’ to R3, R3 distributes the label ‘‘1’’ to R2). In this example, a bypass tunnel B1 starting at the PLR R3 is also set up to protect against a link failure of the link R3-R4 and a node failure of R4. The corresponding labels are depicted in Figure 5.12. Note: In the case of an NHOP backup tunnel, this is often referred to as MPLS TE Fast Reroute link protection. When the backup tunnel is an NNHOP backup tunnel, this is usually called MPLS TE Fast Reroute node protection.
D1 R7
R1
R8
11
10
12
T1 1
R2
R3
R4
R5
R6
R9
Figure 5.11 One-to-one backup: Example of the mode of operation when the node R4 fails and the protected TE LSP T1 is locally rerouted by the PLR R3 onto its Detour LSP D1.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 321
321
5.5 MPLS Traffic Engineering Local Protection
IP Packet
B1 (Bypass) R10
R11 11
R1
R12
10
R3
T1
1
R5
2
R2
3
R4
B1
R6
T2
R7
R9
R8
Figure 5.12 Facility backup operation.
IP Packet
R10
R11
11 3
B1 (Bypass) R1
R12
3
10 3
T1
1
R2
R3
R4
R5
R6
T2
R7
R8
R9
Figure 5.13 Facility backup: Example of the mode of operation when the node R4 fails and the protected TE LSP T1 is locally rerouted by the PLR R3 onto the NNHOP backup tunnel B1.
A PLR can have NHOP and NNHOP backup tunnels. Furthermore, a PLR can have multiple NHOP backup tunnels and multiple NNHOP backup tunnels between a pair of LSRs to guarantee the bandwidth to the protected LSPs. This is discussed in detail in Section 5.15. Let us now consider a node failure and see the mode of operation of facility backup (Figure 5.13). As shown in Figure 5.13, in the case of a node failure of R4, as soon as the failure is detected by the PLR (R3), each protected TE LSP following
Vasseur / Network Recovery Final 9.6.2004 9:48pm
322
page 322
MPLS Traffic Engineering Recovery Mechanisms
CHAPTER 5
the path R3-R4-R5 will be rerouted onto the bypass tunnel B1. The rerouting operation consists of swapping the incoming label to the appropriate outgoing label, pushing an additional label corresponding to the backup tunnel label, and redirecting the traffic onto the outgoing interface of the backup tunnel. The ‘‘appropriate’’ label is the label expected at the MP for the protected TE LSP. It is worth elaborating on what the expected label is. So let us consider the two following situations: Situation 1: The backup tunnel is an NHOP backup tunnel, in which case, the MP is also the PLR’s NHOP for the protected LSP before failure occurs. Upon link failure, the PLR must perform a similar swap (no label change) as before the failure occurs; then the MP will receive the same label as before the failure but from a different interface. This is illustrated in Figure 5.14. In Figure 5.14, an NHOP backup tunnel B1 is set up from R3 to R4, which follows the path R3-R8-R4, protecting against a failure of the link R3-R4. The backup label distributed by R8 to R3 is 10 and a PHP (penultimate hop popping [PHP]) operation is performed between R8 and R4. Once the link failure is detected by the PLR (R3 in this example), for all the protected TE LSPs traversing the link R3-R4, the PLR R3 performs the following operations:
. Label swap of the protected TE LSP using the same label as before the failure . Push of the label corresponding to the NHOP backup tunnel . Redirect the traffic onto the backup tunnel outgoing interface Figure 5.15 shows the situation after the link R3-R4 has failed and the PLR R3 has locally rerouted the protected TE LSP T1 onto the NHOP backup tunnel B1.
IP Packet
R10
R1
R12
R11
T1 1
R2
2
3
R3
R4
R5
R6
10
B1 (NHOP) R7
R8
R9
Figure 5.14 Facility backup: Example of the mode of operation when the node R4 fails and the protected TE LSP T1 is locally rerouted by the PLR R3 onto the NHOP backup tunnel B1.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 323
323
5.5 MPLS Traffic Engineering Local Protection
IP Packet
R10
R1
R12
R11
T1 R2
1
3 R3
R4
10
2
2
R5
R6
B1 (NHOP) R7
R8
R9
Figure 5.15 Situation after the failure of the link R3-R4 and the PLR R3 has locally rerouted the protected TE LSP T1 onto the NHOP backup tunnel B1.
The PLR R3 performs the following operations to locally reroute the protected TE LSP T1 onto the NHOP backup tunnel B1: R3 swaps 1 to 2 (as before), pushes the label 10 and redirects the traffic onto B1’s outgoing interface (R3-R8). R4 (the MP) receives a label-switched packet containing the same label as before the failure but from a different interface. Situation 2: With an NNHOP backup tunnel, the MP is now the PLR’s nextnext hop of the protected LSP before the failure. So the PLR must perform a swap so the MP receives a label switched packet with the expected label (but from a different interface). To highlight this mechanism, let us consider the example depicted in Figure 5.16. Remember, at steady state (without any failure) the label swapping operation performed by R3 for the fast-reroutable TE LSP T1 is 1 to 2. In the case depicted in Figure 5.16, the MP R5 expects to receive a label 3 (label distributed by R5 to R4 for T1). So when the failure of the link R3-R4 or the node R4 occurs, R3 must swap 1 to 3 (instead of 2 before the failure), push the label 10, and redirect the traffic onto B1’s outgoing interface (R3-R10). This way, R5 (the MP) receives an identical packet as before the failure but from a different interface. By default, the PLR does not have the knowledge of the label used between its NHOP LSR and NNHOP LSR; it just learns from its direct downstream neighbor the label it must use for the TE LSP. An extension to an existing RSVP object (RRO object) is used to learn the label used between the NHOP and the NNHOP LSR (that signaling extension is described in Section 5.14).
AU20
Vasseur / Network Recovery Final 9.6.2004 9:48pm
324
CHAPTER 5
page 324
MPLS Traffic Engineering Recovery Mechanisms
IP Packet
R10
R11 11
3
B1 (Bypass) R12
R1
3
10
3
T1
1
R2
R7
R3
R4
R8
R5
R6
R9
Figure 5.16 Situation after the failure of the link R3-R4 and the PLR R3 has locally rerouted the protected TE LSP T1 onto the NNHOP backup tunnel B1.
P Important notes: Note 1: An identical operation is performed for every protected LSP rerouted onto the same backup tunnel; indeed, with facility backup, the same backup LSP is used for all the rerouted TE LSPs that intersect the backup tunnel on both the PLR and the MP. This is illustrated in Figure 5.17. This figure shows two primary tunnels T1 and T2 that used to follow the paths R1-R3-R4-R5-R6 and R7-R3-R4-R5-R6 before the failure. The labels in use are 100 (between R1 and R3), 101 (between R3 and R4), 102 (between R4 and R5) and PHP (between R5 and R6) for T1 and 110 (between R7 and R3), 111 (between R3 and R4), 112 (between R4 and R5) and PHP between R5 and R6. Because both T1 and T2 intersect at R3 and R5, the same NNHOP backup tunnel B1 can be used in the case of failure of the link R3-R4 or node R4. This is of course a very important scaling property of facility backup that uses MPLS stacking. Note also that the same property applies to NHOP backup tunnels. Note 2: In both cases (NHOP and NHOP bypass tunnels), no additional RSVP states are created along the backup paths for the rerouted TE LSPs. In other words, the LSRs along the backup path do not ‘‘see’’ the rerouted TE LSPs as far as the control plane is concerned. This is also a crucial property for the scalability properties of this solution.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 325
5.5 MPLS Traffic Engineering Local Protection
325
IP Packet 11 102 B1 (Bypass)
R10
R11
11 112
102
100
R1
10 102 T1
R12
112
10 112
R2
R3
R4
R5
R6
T2 110
R7
R8
R9
Figure 5.17 Illustration of the use of the MPLS stacking property by facility backup: Several protected TE LSPs are rerouted onto a single NNHOP backup tunnel B1 upon R4 node failure.
5.5.5
Properties of a Traffic Engineering LSP When using MPLS TE local protection, there are three properties a TE LSP can have: 1. Fast Reroute desired 2. Bandwidth protection desired 3. Node protection desired Fast Reroute desired TE LSP: Fast Reroute is a technology that can be used for some TE LSPs only (as already stated, such TE LSPs are called fastreroutable TE LSPs), so if a backup tunnel has been configured on a PLR, just the TE LSP signaled as ‘‘fast reroutable’’ will be fast rerouted in the case of a failure. Typically, this provides fast recovery using local protection to a subset of TE LSPs having stringent recovery requirements (e.g., the TE LSPs carrying sensitive traffic like VoIP or ATM-over-MPLS), whereas other TE LSPs carrying less sensitive traffic (e.g., Internet traffic) will be rerouted using TE LSP reroute. This obviously requires the ability to explicitly signal this fast-reroutable property of a TE LSP. The details of the signaling aspects are covered in Section 5.15. Bandwidth protection desired: The notion of bandwidth protection is extensively covered in Section 5.15, but here is a high-level description of this important notion. The previous section described the mode of operation of Fast Reroute for both the facility backup and the one-to-one backup method. When a TE LSP is signaled, one of the TE LSP attributes of the TE LSP is the bandwidth. A TE LSP is said to be bandwidth protected at a node R only if it can be fast rerouted and the selected backup tunnel offers an equivalent bandwidth as the primary TE LSP used to receive along the primary path (before the failure). In other words, the TE LSP does not suffer any QoS degradation along the alternate path. Note that the QoS may be a function not just of the bandwidth
AU21
Vasseur / Network Recovery Final 9.6.2004 9:48pm
326
CHAPTER 5
page 326
MPLS Traffic Engineering Recovery Mechanisms
but also of the propagation delay or jitter. Section 5.15 details how backup paths can be computed to provide such guarantees. When signaled, a protected TE LSP can explicitly request bandwidth protection. Node protection desired: In some cases, also further discussed in Section 5.15, it might not be possible for a PLR to find both an NHOP and an NNHOP backup tunnel offering full bandwidth protection. For example, let us consider the simple case of three routers R1, R2, and R3 connected in a row, and the R1-R2 link bandwidth is 20 Mbps and the R2-R3 link is 10 Mbps. Then the PLR may try to find an NHOP backup tunnel with 20 Mbps worth of bandwidth and an NNHOP backup tunnel with min(20,10) ¼ 10 Mbps worth of bandwidth. Suppose that no such NNHOP backup tunnel can be found but just an NNHOP backup tunnel of 5 Mbps. Then as new TE LSPs requesting for bandwidth protection are signaled, it may happen that no NNHOP backup tunnel offering bandwidth protection can be found. In this case, having an additional signaled parameter explicitly requesting node protection is desirable and can be used as a tie break. So if the PLR has two requests for bandwidth protection and cannot select an NNHOP backup tunnel for both of them because of insufficient bandwidth on the NNHOP backup tunnel, it can preferably select the NNHOP backup tunnel for the TE LSP having expressed a desire to get node protection in addition to bandwidth protection. Such a parameter has been standardized in [FAST-REROUTE] and is described in Section 5.15.
Notion of Class of Recovery The various TE LSP recovery requirements mentioned earlier allow an operator to define multiple CoRs and assign a different CoR to each TE LSP according to its recovery requirements. For instance, very sensitive traffic like voice-over-IP/ MPLS or ATM-over-MPLS could be routed over protected TE LSPs with bandwidth and node protection. In the case of a link or node failure, those TE LSPs would be very quickly rerouted, while maintaining an equivalent QoS. On the other hand, MPLS VPNs traffic could be routed onto protected TE LSPs without bandwidth protection. Finally the less sensitive traffic could be routed over nonprotected TE LSPs. Defining multiple classes of recovery provides the two following benefits:
. The set of rerouting operations can be prioritized. Indeed, every LSR will preferably start to recover the TE LSPs that belong to the highest CoR. . When bandwidth protection is required, this implies reserving some backup capacity in the network. With multiple CoRs, the amount of backup capacity is limited to the set of TE LSPs that belong to the CoR for which bandwidth protection is required. This allows to significantly optimize the required backup capacity.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 327
5.5 MPLS Traffic Engineering Local Protection
5.5.6
327
Notification of Tunnel Locally Repaired As described earlier, upon detection of a link/node failure, the PLR immediately starts rerouting the set of protected TE LSPs over their respective backup tunnels (bypass tunnels or Detour LSPs). This may result in following a suboptimal end-toend path. Consequently, in addition to performing the local reroute, the PLR sends a specific RSVP Path Error message for each rerouted TE LSP to their respective head-end LSR to indicate that a local reroute has occurred. This type of RSVP Path Error is sometimes qualified as nondisruptive because no RSVP states are cleared; it serves as a pure indication to the head-end LSR. The receipt of such of message will then trigger a reoptimization on the head-end LSR for the affected TE LSP. Indeed, as previously mentioned MPLS TE Fast Reroute is a temporary network recovery mechanism; the protected TE LSPs are quickly and locally rerouted onto backup tunnels using a local protection technique, but the path followed by the rerouted flows might no longer be optimal. This is illustrated in Figure 5.18. In Figure 5.18, a protected TE LSP, T1, following the path R0-R1-R2-R8 is set up. At router R1 (PLR), T1 is protected by an NHOP backup tunnel B1 against a failure of the link R1-R2 (B1 follows the path R1-R3-R4-R5-R2). When the link R1-R2 fails, upon detecting the link failure, the PLR (R1) reroutes the LSP T1 onto B1 and sends a Path Error ‘‘tunnel locally repaired’’ to T1’s head-end LSR (R0). As you can see in Figure 5.18, the path followed by T1 is not optimal (R0-R1-R3-R4-R5-R2-R8). The receipt of the Path Error triggers a reoptimization on R0, which in turn reroutes the TE LSP T1 along the path R0-R3-R4-R5-R2-R8, which is more optimal than the path followed by the rerouted flows during failure (R0-R1-R3-R4-R5-R2-R8). In this example, we assume that all the links have the
LSP1 Path Once Rerouted
R4
R5
B1 (Backup)
R3
LSP1
R1
R2
R0
R6
R7
Figure 5.18 Notification of local repair followed by head-end reoptimization.
R8
Vasseur / Network Recovery Final 9.6.2004 9:48pm
328
CHAPTER 5
page 328
MPLS Traffic Engineering Recovery Mechanisms
same metric. Of course, the TE LSP reoptimization should always be performed using the ‘‘make before break’’ procedure, avoiding any traffic disruption. The head-end will also be informed of the link failure via the receipt of an IGP update from one of the routers adjacent to the failed link. Either upon the receipt of an RSVP Path Error notify message ‘‘tunnel locally repaired’’ or an IGP update, the head-end triggers a TE LSP reoptimization.
Case of a Multiarea (OSPF) or Multilevel (IS-IS) Network In the case of a multiarea (OSPF), multilevel (IS-IS), or multiautonomous systems network, if the failure does not occur in the head-end LSR area/level, no IGP notification will be received by the head-end LSR. This means that the head-end LSR exclusively relies on the receipt of the RSVP Path Error message to be informed that a local repair has been performed on a downstream node. Consider the network depicted in Figure 5.19. In Figure 5.19, a fast-reroutable interarea TE LSP (T1) is routed from R0 to R4 and spans multiple areas. On R2, a NHOP backup tunnel that follows the path R2-R5-R6-R7-R3 protects any fast-reroutable TE LSPs traversing the link R2-R3 from a failure. When the link R2-R3 fails, the TE LSP T1 is rerouted onto the AU22 backup tunnel B1, but in this case the head-end LSR R0 does not receive any IGP update. Indeed, the failure occurred in the backbone area, so R0 does not have any visibility of the backbone area topology. A failure in the backbone area is invisible to R0 (R2 might send a new summary LSA if some addresses are no longer reachable, but generally the address aggregation scheme will be such that no summary LSA will be flooded into the area 0.0.0.1). Because the RSVP Path Error notify message is the only mechanism allowing the head-end LSR to be informed of a local repair that occurred on a downstream node that does not reside in the head-end area, a best common practice consists of sending the RSVP Path Error message in reliable mode.
T1 Path Once Rerouted
R6
B1
R7
Area 0.0.0.0
R5
Area 0.0.0.1
R2
T1
R4
R3
R1 Area 0.0.0.2
R0
R8
Figure 5.19 Notification of local repair followed by head-end reoptimization in a multiarea routing domain.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 329
5.5 MPLS Traffic Engineering Local Protection
5.5.7
329
Signaling Extensions for MPLS Traffic Engineering Local Protection By contrast with MPLS global default protection and MPLS TE global protection, which do not require any signaling protocol extensions beyond those of RSVP TE defined in [RSVP-TE] for the signaling of MPLS TE LSP, MPLS TE local protection (Fast Reroute) requires several signaling extensions. Although they are undoubtedly important, their detailed understanding is not a prerequisite to grasp how local protection works. Consequently, the signaling aspects of Fast Reroute are covered in detail in Section 5.14.
5.5.8
Two Strategies for Deploying MPLS Traffic Engineering for Fast Recovery As mentioned in Section 5.1, there might be several motivations for deploying MPLS TE:
. Bandwidth optimization: So that the network resources are used in a more efficient way. This also helps in providing better QoS. . Providing strict QoS guaranties to some specific traffic flows. . Fast recovery. In some networks, there might be an interest in MPLS TE for its fast recovery property only. In other words, bandwidth optimization and/or strict QoS guarantees are not required, but the operator would like to benefit from the fast recovery property of Fast Reroute without tuning its IGP parameters as described in Chapter 4. This section proposes two strategies for deploying MPLS TE when the only objective is to get fast recovery by using Fast Reroute. For instance, consider an underutilized (or overprovisioned) network. Such a network does not require any bandwidth optimization because it is not congested. Also, depending on the network load, QoS guarantees could rely on the simple assumption that no link is congested and the link loads are very low. In such a situation, MPLS TE is not required, and paths computed by the routing protocol are perfectly satisfactory. However, such a network may require fast recovery of link or node failures, making Fast Reroute a good candidate. Because Fast Reroute requires TE LSPs, the solution includes deploying TE LSPs but in a quite specific way, which we describe in this section. There are two strategies for deploying MPLS TE when the sole objective of the operator is to use Fast Reroute: 1. With a full mesh of unconstrained TE LSPs 2. With one-hop unconstrained TE LSPs
Network Design with a Full Mesh of Unconstrained TE LSPs A simple and efficient strategy is to deploy a full mesh of unconstrained TE LSPs. An unconstrained TE LSP is an LSP without any constraint. For instance,
Vasseur / Network Recovery Final 9.6.2004 9:48pm
330
CHAPTER 5
page 330
MPLS Traffic Engineering Recovery Mechanisms
the required bandwidth is 0, and no affinities are defined. The only property of such a TE LSP is to be fast reroutable. Indeed, the objective is not to use the traffic engineering property of MPLS TE (in the sense of ‘‘traffic engineer’’ the flows across the network). So the available bandwidth and other TE link–related information are still flooded by the IGP TE extensions but will never change. When a head-end LSR computes a path for an unconstrained TE LSP, the same CSPF algorithm is used as with any other TE LSP, but the obvious outcome is that the TE LSP will follow the IGP shortest path. In other words, the traffic routed onto unconstrained TE LSPs will follow the same paths as IP routed traffic, but in the case of link and/or node failures, fast-reroutable TE LSPs will be rerouted by MPLS TE Fast Reroute, which was the initial objective.
Network Design with Unconstrained One-Hop TE LSPs If the requirement is to use Fast Reroute for link protection only, then exactly one primary unconstrained TE LSP plus one single NHOP backup tunnel are required for every link to protect. The idea is to set up a one-hop tunnel following the same path as the link to protect. One way of achieving this is to set up an unconstrained TE LSP. This way the CSPF algorithm will just follow the most direct path between the head-end LSR and the tail-end LSR (the next hop of the head-end LSR in this case). Note that in this case the PLR node is also the head-end LSR. Then the one hop primary TE LSP must be configured so that all the traffic follows the TE LSP. It is important to note that because the TE LSP is a one-hop LSP, if PHP is used, no label is added once the traffic is routed over the primary TE LSP. Such a strategy is depicted in Figure 5.20. In the example shown in Figure 5.20, the objective is to protect the link R2-R3. So a single-hop tunnel (T1) is configured from R2 to R3 and all the traffic is routed onto this one-hop primary TE LSP through this link. T1 has no constraint, so this TE LSP follows the path R2-R3. An NHOP backup tunnel B1 is configured between R2-R3 with the constraint of being diversely routed from the protected link and follows the path R2-R8-R3. As discussed in Section 5.15, additional constraints may be added to also provide bandwidth protection. In the case of failure of the link R2-R3, the PLR (R2) will trigger Fast Reroute and all the traffic that used to be routed over the link R2-R3 will be rerouted over B1, following the path R2-R8-R3. Then the primary TE LSP T1 will be rerouted (reoptimized) and will follow the new shortest path between R2 and R3. Finally, the routing protocol will be informed of the link failure and will recompute a new path, which may or not follow B1’s path. The same configuration has to be repeated for each link to protect using Fast Reroute. Note that if the link R2-R3 is protected using a SONET/SDH protected VCs, Fast Reroute may also be used to protect against a router interface failure on the R2 or R3 side. In that case, one must ensure that both mechanisms are not
AU23
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 331
5.5 MPLS Traffic Engineering Local Protection
R6
331
R7
T1 (Unconstrained TE LSP)
R1
R2
R3
B1
R4
R5
R8
Figure 5.20 Deploying MPLS TE Fast Reroute with one-hop tunnel to protect against link failure.
simultaneously triggered. This aspect is covered in Chapter 6. Existing implementations support mechanisms to automate the creation of both the primary and the backup TE LSP, because in this case their set of attributes is known in advance to alleviate the configuration burden. The only constraint of the backup tunnel is to be diversely routed from the link to protect (some implementations support the computation of SRLG-diverse paths). Protection against link and node failures: To guard against both link and node failures, a similar approach is followed, with the only difference that at each hop, both one unconstrained TE LSP and one NNHOP backup tunnel per next-next hop must be configured. Why are one primary and one NNHOP backup tunnel required per NNHOP? Let us consider the example in Figure 5.21. As shown in Figure 5.21, in the case of a node failure of R3, all the traffic traversing the protected LSR needs to be rerouted onto some appropriate backup tunnels. That requires setting up one primary TE LSP for each possible traffic path traversing the protected node. In Figure 5.21, there are three paths leaving R2 that traverse the node R3 to consider: R2-R3-R4, R2-R3-R7 and R2-R3-R8. So three unconstrained TE LSPs are configured and set up on R2: T1, T2, and T3. Because each of these tunnels needs to be rerouted over a diversely routed backup tunnel, three NNHOP backup tunnels are configured: B1 protecting the traffic following the path R2-R3-R4 and routed onto the tunnel T1, B2 protecting T2, and finally B3 protecting T3. As in the case of link protection, the protected TE LSPs are unconstrained and follow the shortest IGP path. This explains the requirement for one unconstrained TE LSP and one backup tunnel per NNHOP. In the previous example, the number of NNHOPs of R2 is equal to 3: R7, R3, and R8.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
332
CHAPTER 5
page 332
MPLS Traffic Engineering Recovery Mechanisms
R6
R7
B2
B1
T2 T1
R1
R2
B3
T3
R3
R4
R5
R8
Figure 5.21 Deploying MPLS TE Fast Reroute with two-hop tunnel to protect against both link and node failures.
P Important note: Conversely to the previous case of link protection, the traffic must start flowing onto the primary 2-hops tunnels only when the failure occurs.
Comparison of Both Approaches Both the ‘‘unconstrained full mesh TE LSPs’’ and the ‘‘unconstrained approach’’ can be used and have their respective pros and cons. Indeed, the unconstrained approach clearly has the advantage to require the configuration and set up of a very limited number of TE LSPs. If just link protection is required, for every link to protect with Fast Reroute, just two TE LSPs are required: the primary one-hop TE LSP and an NHOP backup tunnel diversely routed from the link to protect. If node protection is required, one pair of TE LSPs (primary and backup) is needed for every next-next hop, as described earlier, which is still a very reasonable number. Note that at the time of publication, commercial implementations support only the 1-hop unconstrained approach. Moreover, some implementations ease the config- AU24 uration process with the use of very few commands to automate the configuration of such primary and backup TE LSPs. On the other hand, the unconstrained full mesh TE LSPs approach also offers a very easy migration path to the use of MPLS TE for other purposes like bandwidth optimization and strict QoS guarantees. Indeed, if at some point, one of those requirements appears, the operator will just need to set constraint(s) on the TE LSPs. For instance, bandwidth can be configured and then the TE LSPs will start using alternate path(s), if required. In terms of existing implementations, some solutions are available that automate the configuration process when setting up a full mesh of TE LSPs. In a nutshell, those solutions rely on several components:
. A discovery process is in charge of discovering the members of a mesh. In some MPLS TE networks, there might be multiple TE LSP meshes: one mesh of TE LSPs between LSRs acting as VoIP gateways, for instance, and
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 333
5.6 Another MPLS Traffic Engineering Recovery Alternative
.
333
another full mesh of TE LSPs between routers carrying the Internet traffic. Each TE mesh has its own set of characteristics in terms of bandwidth, priority, and protection/restoration, to mention just a few requirements. Then each router uses an IGP extension (OSPF or IS-IS) to advertise that it is a member of one or multiple TE meshes. This mechanism allows every router to discover all the other routers that belong to the same TE mesh. Then, once a router has discovered all the routers that belong to the same mesh, it can use a ‘‘template’’ (where the constraints specific to that particular mesh are locally specified) to set up the mesh of TE LSPs. Note that in this particular context of using MPLS TE for fast recovery only, the template is very restricted because the primary TE LSPs are unconstrained.
In terms of IGP, both methods are equivalent. The TE-related information is flooded by the IGP but will never changed because the TE LSPs are unconstrained and never reserve bandwidth.
5.6 Another MPLS Traffic Engineering Recovery Alternative Another MPLS TE recovery alternative has been proposed but never got any traction in the industry because of severe limitations: 1þ1 packet protection whose principle is to permanently bridge the IP/MPLS traffic over two diversely routed TE LSPs. The traffic bridging is made on the head-end LSR, and the decision to switch the traffic is performed by the tail-end LSR, which permanently compares the two identical received flows from the primary and secondary TE LSPs. When a failure occurs in the network, the traffic received from one of the TE LSPs is affected. Once the tail-end LSR detects the failure, it switches to the secondary TE LSP. Note that such a mechanism is also called a single-ended protocol because the switching decision process is made by a single entity (the tailend LSR in this case) without requiring any signaling exchange between the nodes. A failure may be a traffic interruption, an unacceptable error rate, or any other kind of defects. Once the tail-end LSR has performed the switch, it can either decide to stay indefinitely on this TE LSP and start using the other TE LSP (once restored) in the case of failure of the currently selected TE LSP or decide to switch back to the original TE LSP, once restored. Although this mechanism is simple and efficient in terms of recovery time, it has two major drawbacks that drastically limit its applicability:
. The amount of traffic forwarded in the network is doubled for each TE LSP protected with this 1þ1 mechanism. This is a serious issue because it basically implies at least66 a bandwidth wastage of 50%.
66 This technique implies at least a bandwidth wastage of 50% because one of the constraints of the backup TE LSP is to be disjoint from the protected TE LSP, which usually means that it will follow a longer path.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
334
CHAPTER 5
page 334
MPLS Traffic Engineering Recovery Mechanisms
. The failure discovery at the tail-end LSR usually requires some hardware changes and thus equipment replacement, which can also be expensive. For those reasons, such a mechanism has never been implemented or deployed but is just mentioned here for the sake of completeness in describing MPLS TE recovery techniques.
5.7 Load Balancing Load balancing is a technique to forward the traffic from a source to a destination across multiple paths. With equal load balancing, the traffic is balanced across multiple equal-cost paths. IGP, like OSPF or IS-IS, performs equal load balancing. This can be done on a per-packet basis (packets are sent along N equal-cost paths using a round-robin algorithm) or via some more sophisticated techniques avoiding packet reordering described in Chapter 4. With MPLS TE, both equal and unequal load balancing are supported. For instance, if there are two TE LSPs, T1 and T2 between two LSRs, LSR1 and LSR2, with respective bandwidth Bw1 and Bw2, then LSR1 can decide to balance the traffic whose destination is LSR2 (or beyond) in proportion to the respective bandwidths Bw1 and Bw2. Usually load balancing in MPLS TE–enabled networks is used when a single path obeying the set of constraints cannot be found between two LSRs. For instance, a TE LSP of B Mbps is required and no path with the required amount of bandwidth is available. Then the solution is to set up N LSPs so the sum of their bandwidth is equal to B. Another constraint can be added when the path computation of the set of N LSPs is performed like path diversity (the set of network elements traversed by the TE LSPs are disjoint). Strictly speaking, load balancing is not an MPLS TE recovery technique, so why dedicate a section to it? Because a positive side effect of load balancing is that when the flow between two points is balanced across multiple paths, the probability of simultaneous failures of all those paths is reduced compared to a single path, especially if those paths are explicitly diversely routed. Hence, the overall availability is increased. This property has been used by some operators to reduce the impact of network failure on specific flows. Let us illustrate that statement through the example in Figure 5.22. In this case, strictly speaking, the network availability is not increased but the impact of a network element failure on the traffic flows between two points is reduced. In Figure 5.22, the two POPs of Sevilla and Barcelona are made of two VoIP gateways and one LSR connected to the core of the network. The VoIP traffic is carried onto TE LSPs. In this case, even if all the traffic between LSR1 and LSR2 could be carried onto a single TE LSP, two diversely routed TE LSPs are established between LSR1 and LSR2 (with the same bandwidth or different bandwidths) and the traffic is balanced onto those two TE LSPs. In the case of a network failure,
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 335
5.7 Load Balancing
Point of Presence (POP) of Barcelona
VoIP
335
VoIP
LSR2 T1
T2
Point of Presence (POP) of Sevilla
LSR1
VoIP
VoIP
VoIP Gateways
Figure 5.22 An example of MPLS TE load balancing.
just a proportion of the traffic between LSR1 and LSR2 is affected (the traffic carried onto T1 in the previous example). That said, we must admit that such a design choice has the two following drawbacks:
. The number of states in the network is nonnegligibly increased: indeed, at least two TE LSPs are required between two LSRs. . The constraint of computing diverse paths may result in computing nonoptimal paths compared to a single TE LSP. But on the other hand the impact of a single element failure on the voice traffic between the two POPs is reduced. Note: One can, for example, increase the capacity of each TE LSP to be able to absorb the excess traffic resulting from the failure of one TE LSP. For instance, if N TE LSPs of B Mbps are set up between two routers R1 and R2 (let us call it a bundle of N TE LSPs), by allocating B * N/(N-1) Mbps to each of them instead of B Mbps; this allows the survival from the failure of one of them. In this case, the backup capacity reserved in the network is strictly equal to the capacity of one TE LSP in the bundle.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
336
CHAPTER 5
page 336
MPLS Traffic Engineering Recovery Mechanisms
5.8 Comparison of Global and Local Protection As previously described in Chapter 1, the evaluation of a recovery mechanism requires the consideration of several parameters: scope of recovery (link, node, SRLG), recovery time, guaranteed bandwidth, backup capacity requirements, state overhead, scalability, reordering, additive latency and jitter, signaling requirements, stability, and others. Throughout this chapter, we saw several MPLS TE recovery techniques, so the natural question that comes to mind is, which one to use. Although there is no unique answer because each network has its own constraint and requirements, the aim of this section is to provide a comparison of the global protection and local protection techniques with a particular focus on three key performance aspects:
. The recovery time . The state overhead, which is directly correlated to the scalability . The ability to perform bandwidth sharing when bandwidth protection is required
5.8.1
Recovery Time With global protection, rerouting is performed by the head-end LSR, which means that this requires for the head-end LSR to receive the failure notification to reroute the affected traffic onto their respective backup paths (whose paths have been precomputed and signaled). So in terms of recovery time, the delta between global and local protection is the failure indication signal propagation time to the head-end LSR. How large this delta is highly depends on the network characteristics. Thus, for instance, a network confined to a small country generally implies short propagation delays (less than 10 ms); on the other hand, an international network may easily experience much longer propagation delays . . . up to a few hundreds of milliseconds. In that case, convergence of a few tens of milliseconds requires the use of local protection techniques. Furthermore, queuing delays to process the control plane notification (RSVP and/or IGP) messages can be reduced via the use of QoS mechanisms. Note that in terms of recovery time, the two local protection schemes described earlier (i.e., ‘‘one-to-one’’ and ‘‘facility backup’’) are equivalent; they both rely on local protection where fast-reroutable TE LSPs are locally rerouted on presignaled backup tunnels and then reoptimized by their respective head-end LSR. In summary, as far as the recovery time is concerned, the key difference between local and global protection is in the failure propagation notification time to the headend LSR which, in the case of global protection is made of incompressible propagation delays and queuing delays that can be reduced by means of QoS mechanisms.
5.8.2
Scalability Scalability is undoubtedly one of the major aspects to consider when evaluating a recovery mechanism, and to that respect, global path protection, one-to-one backup, and facility backup local protection differ very significantly.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 337
5.8 Comparison of Global and Local Protection
337
Scalability is a relatively generic term that requires clarification in this context. Protection mechanisms require setting up backup tunnels before any failure to provide fast convergence (by contrast with global default restoration, the backup path is already computed and signaled). The configuration of backup tunnels can be facilitated via an automatic process, but setting up backup tunnels in a network is not entirely cost free. Although the scalability of RSVP is very high, in large networks, the number of backup tunnels can be significant as shown below, which requires to potentially handle a large number of states on routers. Moreover, the troubleshooting task is even more complicated for the team in charge of operating the network. So scalability is considered in terms of number of required backup tunnels in this context. Let us evaluate the number of required backup tunnels with global path protection, Fast Reroute facility backup, and one-to-one, based on the following assumptions: D: network diameter (average number of hops between a head-end LSR and a tail-end LSR) C: degree of connectivity (average number of neighbors) L: total number of links to be protected with Fast Reroute67 N: total number of nodes (LSRs) T: total number of protected TE LSPs in the MPLS network Bu: number of backup tunnels required K: number of class of recovery (as mentioned in Section 5.5.5, there might be several classes of TE LSPs, each requiring different CoR. In this case, each CoR has a dedicated set of backup tunnels) S: average number of splits (as discussed in Section 5.15, in some cases where bandwidth protection is required and backup bandwidth is a very scarce resource, more than one backup tunnel per protected link/node may be required if a single backup tunnel with enough bandwidth cannot be found) Note: Realistic assumptions for S and K are as follows:
. S < 4: Generally S will very rarely exceed 3. In a network where bandwidth protection is required but backup capacity is not a very scarce resource S ¼ 1. If bandwidth protection is not required, then S ¼ 1. . Also K < 3. M: number of meshes in the network (e.g., there may be multiple meshes of TE LSPs in a network serving different purposes: one mesh for the voice traffic and one mesh for the data traffic). It follows that ! L < N * C (because some links may not be protected by Fast Reroute) ! T ¼ M * N * (N1) (assuming a full mesh TE deployment)
67
Some links may be protected via other means like SONET/SDH and optical protection/restoration.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
338
CHAPTER 5
page 338
MPLS Traffic Engineering Recovery Mechanisms
Let us now compute the total number of required backup tunnels Bu with global path protection, Fast Reroute one-to-one, and facility backup. 1. Computation of Bu with global path protection The number of backup tunnels is equal to the number of primary TE LSPs: Bu ¼ M * T ¼ M * N * (N 1) One has to keep in mind that the number of backup TE LSP grows proportionally with the number of primary TE LSPs and as the square of the number of LSRs in a full mesh scenario. This can have a nonnegligible impact on the overall network scalability. Consider a full mesh of 200 LSRs: The total number of primary TE LSPs in the network will be 199 * 200 ¼ 39,800. Using path protection in this context doubles the number of TE LSPs, which gives a total number of TE LSPs equal to 79,800. This basically has nonnegligible consequences on state overhead: Every head-end LSR will see its number of TE LSPs to manage doubled. However, one must admit that this is not a major concern because the total number of TE LSPs on every head-end LSR is generally limited (equal to the number of LSRs in every mesh to which the head-end LSR belongs). On the other hand, especially in networks sparsely interconnected from a layer 1/layer 2 perspective, the total number of TE LSPs per midpoint LSR can be substantially large. Consider the example of the network depicted in Figure 5.23. This simple network is made of two levels of hierarchy:
. A high-speed core backbone with high-capacity LSRs interconnected by high-speed links (OC48, OC192) . An edge layer with a large number of smaller LSRs connected (or dual connected to the high speed core) via medium speed links The edge LSRs are fully meshed with each other (for the sake of readability, just the TE LSPs from R1, R2, and R3, to R4 are represented). Observe the number of TE LSPs traversing the high-speed core LSRs. This example shows that the number of TE LSPs per midpoint LSR can be quite high in such a network and the proportion of TE LSPs passing through those high-speed nodes can be substantial in comparison to the total amount of TE LSPs in the network. In typical existing networks, this can be as high as 20% to 30%, at steady state. In the case of failure of a high-speed core link or LSR, this number would be even more increased. The scalability impact can be characterized through various aspects:
. Memory impact on the midpoint LSR: Each TE LSP requires some memory to handle the RSVP states. . States refresh: RSVP is a soft-state protocol. This requires for each TE LSP to refresh the RSVP states, exchanging RSVP Path and Resv messages at
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 339
339
5.8 Comparison of Global and Local Protection
Backup TE LSP
Primary TE LSP
R1
Edge LSR R4 R2
High Speed Core LSR
R3 R5
High Number of Mid-Point TE LSPs on Core LSRs
Figure 5.23 State overhead with MPLS traffic engineering path protection.
regular intervals between neighbors. Note that the impact of TE LSP refresh can be drastically reduced, using two methods: Refresh reduction: this mechanism described in [REFRESH-REDUCTION] consists of using specific messages (SREFRESH) so that an LSR sends a unique message to its neighbor to refresh a large set of TE LSPs, instead of sending an individual RSVP Path message per TE LSP. Refresh interval: Moreover, the RSVP refresh frequency can be decreased; in this case, other liveness mechanisms like RSVP hellos (see Section 5.10) can be used. . Recovery time on the midpoint LSR: When local recovery mechanisms are used on the midpoint LSRs, the number of TE LSPs to reroute may have an impact on the recovery time. 2. Computation of Bu with local protection: facility backup Situation 1: If just links are protected with Fast Reroute, then Bu ¼ L * K*S Situation 2: If both links and nodes are protected with Fast Reroute then: Bu ¼ L * K * S þ N * C * (C 1) * K * S Bu ¼ (L þ N * C * (C 1) ) * K * S 3. Computation of Bu with local protection: one-to-one backup Without merging, Bu T * D ¼ M * N * (N 1) * D
Vasseur / Network Recovery Final 9.6.2004 9:48pm
340
CHAPTER 5
page 340
MPLS Traffic Engineering Recovery Mechanisms
Because we have now computed the theoretical formulas, let us make a few (realistic) assumptions that will help figure out the scalability impact. We consider a fully meshed network with the following characteristics:
. D (diameter) ¼ 5 . C (degree of connectivity) ¼ 4 . M (number of meshes) ¼ 2 (one mesh for voice and one mesh for data traffic) . K ¼ 2 (two classes of recovery: one for voice with bandwidth protection and one for data without bandwidth protection) . S ¼ 2 (on the average, two backup tunnels are necessary to get the required backup bandwidth between a PLR and a MP) . All links must be protected by Fast Reroute: L ¼ N * C Let us now compare Bu for global path protection, Fast Reroute one-to-one, and facility backup, using the previous formulas: Global path protection:Bu ¼ M * N * (N 1) ¼ 2 * N * (N 1) Local protection–facility backup (node protection): Bu ¼ (N * C þ N * C * (C 1) ) * K * S ¼ N * C2 * K * S ¼ 64 * N Local protection–one-to-one backup: Bu ¼ M * N * (N 1) * D ¼ 10 * N * (N 1) Figure 5.24 shows the value of Bu for the three MPLS recovery methods as a function of the number of nodes in the network (from 10 to 50 nodes and from 10 to 150 nodes). Figure 5.24 clearly shows that both global path protection and Fast Reroute one-to-one backup scale poorly in large environments. The number of backup tunnel per midpoint LSR can rapidly cause some scalability issues. Indeed, in a full mesh network of very reasonable size (50 nodes), with the assumption made above, the total number of primary TE LSPs is 4900 and the number of backup tunnels with each MPLS TE recovery techniques is as follows:
. With global path protection: 4900 . With local protection facility backup: 3200 . With local protection one-to-one backup: 24,500 Although merging of Detour LSPs can help reduce the number of backup tunnels, their number stays very high in large networks.
5.8.3
Bandwidth Sharing Capability The last criteria we want to evaluate in this comparison is the ability to perform bandwidth sharing with both global and local protection. To be cost-effective, the
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 341
5.8 Comparison of Global and Local Protection
Number of Backup Tunnels
Assumptions
Number of Backup Tunnels
30000 25000
Bu (Global Protection)
20000
Bu (Facility Backup)
15000
Bu (One to One Backup)
10000
341
Diameter = 5 Degree of Connectivity = 4 Number of Meshes = 2 Number of Splits = 2 Number of Class of Recovery = 2
5000 0
Number of Backup Tunnels Bu
10 15 20 25 30 35 40 45 50 Number of Nodes
250000 200000
Bu (Global Protection) Bu (Facility Backup)
150000
Bu (One to One Backup)
100000 50000 0 10 25 40 55 70 85 100 115 130 145 Number of Nodes
Figure 5.24 Comparison of the set of required backup tunnel with global protection, local protection ‘‘facility backup and one to one.’’
backup capacity (bandwidth reserved for backup tunnels) should of course be minimized. Section 5.15 will show that this goal can be efficiently met thanks to the interesting property of bandwidth sharing under the single failure assumption. Trying to conclude on the respective efficiency of global and local protection as far as bandwidth sharing is concerned is almost impossible because their relative performance is highly driven by the algorithms in place and even more importantly by the network topology. So the objective of this section is to provide some general facts about each of them with respect to the bandwidth sharing capability.
Global Path Protection Performing bandwidth sharing between backup path protecting independent resources is of course possible. By contrast with local protection, a path completely diverse/disjoint from the primary TE LSP must be found as opposed to protecting a single local resource; on the other hand, the level of granularity is higher (protect a TE LSP instead of a link or a node) and the backup capacity can be spread through the entire network. One of the major constraints with global path protection is that it requires an off-line computation for both the primary and the backup TE LSP when the objective is to achieve optimal bandwidth sharing. As already pointed out, with MPLS TE, the TE LSP path computation can be performed either by an off-line tool or in a distributed fashion. If TE LSP primary
Vasseur / Network Recovery Final 9.6.2004 9:48pm
342
CHAPTER 5
page 342
MPLS Traffic Engineering Recovery Mechanisms
path computation is done off-line, the tool can find a TE LSP placement satisfying the set of constraints while trying to compute their respective backup path whose placement maximize the degree of bandwidth sharing. Although this problem is clearly NP complete, sophisticated algorithms have been proposed along with a large set of heuristics to achieve that goal. On the other hand, if primary TE LSP path computation is done in a distributed fashion, trying to achieve bandwidth sharing between backup paths protecting independent resources would require some synchronization between every head-end LSR, which is by default not the case with a distributed computation. This would require very extensive signaling extensions and overhead of the control plane (signaling and routing), which makes this option virtually impossible and certainly not desirable. The bottom line is that if one decides to use global path protection with bandwidth guarantee and requires minimizing the backup capacity via bandwidth sharing, the only possible option is to perform the path computation of both the primary and the backup path by using an external off-line tool. Also, as the set of constraints is quite strict (compute primary and backup path simultaneously), this sort of solution is generally not very flexible. Indeed, a change in bandwidth requirement for a specific subset of TE LSPs may end up in a relatively important set of changes on other primary and backup TE LSPs. Algorithms may try to minimize the set of changes (this is known as the minimal perturbation problem, but this is not always possible). For instance, suppose a set of 5000 TE LSPs with their corresponding backup tunnels, so a total of 10,000 TE LSPs. All the TE LSPs are up and running in the network. After some time, the operator requests for more bandwidth for a few TE LSPs (because of traffic growth in a specific region of the network). The bandwidth increase of those few TE LSPs may require displacing a significant number of other TE LSPs, especially if the bandwidth is scarce. Moreover, the other constraint added by path protection is that diverse paths must be found while achieving optimal bandwidth sharing; those additional constraints certainly amplify the phenomena and the risk to end up with significant changes, which represents a nonnegligible constraint in terms of network operations.
Local Protection: ‘‘Facility Backup’’ and ‘‘One-to-One Backup’’ As shown in Section 5.15, bandwidth sharing is perfectly achievable with local protection facility backup using either a centralized or a distributed backup path computation model. Also, the performance in terms of bandwidth sharing depends on the path computation algorithm efficiency. To give some rough estimates, the numbers obtained on several large networks using some off-line backup tunnel path computation tools showed a degree of bandwidth sharing up to 5; in other words, the sum of bandwidth of the backup tunnels on the links was on average five times more than the actual backup capacity, thanks to the single failure assumption, which allows a high degree of bandwidth sharing. This means that the use of the independent CSPF-based model described in Section 5.15 would have required five times more backup bandwidth on each link. This degree of efficiency is of course a function of various aspects; the network topology (degree of connectivity, number
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 343
5.8 Comparison of Global and Local Protection
343
of SRLGs, elements to protect, and their protected bandwidth to mention a few of them), backup bandwidth capacity, and in particular the efficiency of the backup tunnel path computation algorithm. By contrast, performing bandwidth sharing is very difficult with one-to-one backup. Indeed, an individual backup tunnel is set up for each individual protected TE LSP. Bandwidth can be shared via merging but not between backup tunnels protecting independent resources. This would require very extensive signaling and routing overhead, as well as synchronization between various PLRs, which would increase the scalability impact even more. With facility backup, when backup tunnel path computation is performed, a new backup tunnel path computation does not AU25 need to be triggered (if a bandwidth pool is protected) each time a new TE LSP is set up or torn down; a facility (like a link or a node) can be protected by a set of backup tunnels regardless of the set of TE LSPs actually traversing the protected resource.
5.8.4
Summary In the previous sections of this chapter, we saw various MPLS TE recovery techniques in detail from various angles; the protocol extensions, the mode of operations, the capability of each technique, along with several other aspects. The aim of this section is to highlight the main advantages and disadvantages of each of them with the objective of providing some guidance of where each recovery mechanism preferably applies.
MPLS Traffic Engineering Default Global Restoration Quick Summary Global default restoration is the default mode of MPLS TE. When a failure is detected by the head-end LSR of one or several TE LSPs, for each affected TE LSP, a new path is computed ‘‘on the fly’’ (using CSPF to find a path that obeys the constraints or using some preconfigured alternate paths), and if a new path can be found, the TE LSP is signaled along that path. The traffic is then restored using the new TE LSP.
Advantages and Drawbacks Advantages
. Global default restoration does not require any additional configuration of backup path (unless the network administrator decides to explicitly configure the backup path). So, for instance, if a TE LSP is configured as dynamic (the path is computed using a CSPF algorithm), no other configuration is required. Drawbacks
. Global default restoration is the slowest recovery mechanism compared to the other protection mechanisms, because it implies the FIS propagation to
Vasseur / Network Recovery Final 9.6.2004 9:48pm
344
CHAPTER 5
.
page 344
MPLS Traffic Engineering Recovery Mechanisms
be received by the head-end LSR, a dynamic path computation (which grows with the number of TE LSPs to reroute and the network complexity) and TE LSP signaling. It cannot be used to provide recovery times on the order of tens of milliseconds. Note that a separate CSPF must be computed per TE LSP to reroute. Lack of predictability. In some cases, there is no guarantee that the TE LSP could be rerouted upon failure. A last-resort option is to relax all the TE LSP constraints, which guarantees that CSPF will always find a path for the TE LSP (which will be the IGP shortest path) and so the TE LSP will stay ‘‘up’’ (provided that there exists a path between the TE LSP’s head-end and tail-end LSRs).
MPLS Traffic Engineering Global Path Protection Quick Summary With MPLS TE global path protection, a diversely68 routed backup TE LSP is computed and signaled for each primary TE LSP before any failure. The constraint for the backup TE LSP can be identical or different from the primary TE LSP. In the case of a failure along the path, once the failure notification is received by the head-end LSR, it switches the traffic over the backup TE LSP and the traffic is recovered.
Advantages and Drawbacks Advantages
. In networks with many links and nodes and a limited number of TE LSPs to protect, this mechanism is easy to deploy and requires a limited amount of provisioning. For instance, suppose a very large network, where just a limited number of TE LSPs must be protected. With global path protection, just a few diversely routed TE LSPs must be configured and set up. On the contrary, the use of local protection would require the protection of every network element with backup tunnels along any potential primary path. . Because the backup tunnel is signaled before the failure, the path is deterministic and this provides a strict control of the backup tunnel path. Drawbacks
. Global path protection requires doubling the number of TE LSPs, which has a significant scalability impact in full mesh networks, as shown earlier. . Global path protection cannot in most cases (especially in international networks), provides tens of milliseconds of recovery time, which might be an issue to protect very sensitive traffic like voice or ATM/frame relay over IP/MPLS networks. This is due to the need for the failure notification to be
68
Diversely routed means link, node, or SRLG disjoint.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 345
5.8 Comparison of Global and Local Protection
.
.
345
received by the head-end LSR before switching the traffic over the backup path. If bandwidth guarantee is required, to provide bandwidth sharing, path protection requires the use of an external off-line tool for the computation of both the primary and secondary TE LSPs. The requirement for an end-to-end diversely routed path may imply in some cases to select a nonoptimal path for the primary TE LSPs.
MPLS Traffic Engineering Local Protection Quick Summary MPLS TE Fast Reroute is a local protection recovery mechanism. There are two flavors of Fast Reroute:
. Facility backup: For each protected network element, a backup tunnel is set up, before any failure. The number of backup tunnels is equal to 1 for link protection (potentially more if bandwidth protection is required, and a single backup tunnel with the required capacity cannot be found) and to N for node protection (where N is the number of next-next-hops for each LSR). This applies to each LSR in the network where protection is required. Potentially, as in the link protection case, more than one backup tunnel might be required per next-next hop if a single backup tunnel with the required capacity cannot be found. When the link or node fails, upon failure detection, the node immediately upstream to the failure switches all the fastreroutable TE LSPs onto their appropriate backup TE LSP (using the MPLS label stacking property). . One-to-one backup: For each fast-reroutable TE LSP using the one-to-one backup method, a separate diversely routed TE LSP is set up at each hop that terminates at the tail-end LSR. The number of backup TE LSPs (called Detour LSP) is a function of the number of fast-reroutable TE LSPs and the network diameter. Merging rules can help reduce the number of Detour LSPs in the network. When a link or a node fails, upon failure detection, the node immediately upstream to the failure switches all the fast-reroutable TE LSPs onto their Detour LSP. With both methods, once the PLR (node immediately upstream to the failure) has locally rerouted the protected TE LSPs affected by the failure onto their respective backup tunnel, it sends a notification to every head-end LSR of the fast rerouted TE LSPs, so that the head-end LSR(s) can potentially trigger a reoptimization and reroute the TE LSPs over a more optimal path in a nondisruptive fashion.
Advantages and Drawbacks Advantages
. MPLS TE Fast Reroute is a local protection mechanism and can provide very fast recovery time, equivalent to SONET-SDH/optical protection. This
Vasseur / Network Recovery Final 9.6.2004 9:48pm
346
CHAPTER 5
.
.
.
.
page 346
MPLS Traffic Engineering Recovery Mechanisms
is particularly important to protect TE LSP carrying very sensitive traffic. Facility backup and one-to-one backup are equivalent in terms of recovery time. Fast Reroute can provide bandwidth, propagation delay, and jitter guarantees in the case of link/SRLG/node failure. In the case of facility backup, the required backup capacity can be drastically reduced thanks to the notion of bandwidth sharing between backup tunnels protecting independent resources. High granularity: The concept of CoR allows to offer a wide range of protection coverage with a high granularity because the CoR is a per-TE LSP property. The facility backup method has a high scalability because the number of backup tunnels is a function of the number of network elements to protect and does not grow with the number of fast reroutable TE LSPs. Can easily be used even in networks where full mesh of TE LSPs are not AU26 deployed (see Section 5.8).
Drawbacks
. Requires configuring and setting up a number of backup TE LSPs, which can be nonnegligible in large networks. . Might be more complex to troubleshoot. . The Fast Reroute one-to-one backup method has a limited scalability in large networks.
5.9 Revertive versus Nonrevertive Modes There is another important aspect that we have not discussed so far in the context of MPLS TE recovery, which has been introduced in Chapter 1: the notion of revertive versus non-revertive mode. Indeed, once a network element failure occurs, recovery mechanisms are responsible for finding an alternate path. But once the resource is restored, how is the traffic rerouted onto that resource? This depends on whether the recovery mechanism is revertive or non-revertive and this is the subject of this section (Section 5.9).
5.9.1
MPLS Traffic Engineering Global Default Restoration With MPLS TE global default restoration, when a link, an SRLG, or a node fails, each TE LSP affected by the failure is rerouted over an alternative path determined by its head-end LSR. When the failed resource is restored, any head-end LSR has the possibility to reuse the restored resource. This relies on the reoptimization process by which a head-end LSR tries to evaluate for each of its TE LSPs whether a better path exists. There are several possible configurations for a TE LSP.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 347
5.9 Revertive versus Nonrevertive Modes
347
. Several static paths are configured: The head-end LSR reevaluates whether a preferred path (different than the path in use) is available. . The TE LSP is configured as purely dynamic (no static path is specified): The head-end LSR reevaluates whether a more optimal path exists (more optimal usually means ‘‘shorter’’ path using either the IGP or the TE metric [see [SECOND-METRIC]]). When is the reoptimization task performed? Several existing implementations support multiple reoptimization triggers:
. Event driven: A new IGP OSPF LSA or IS-IS LSP has been received and the head-end LSR determines that triggering a reoptimization may be appropriate because a better path may have appeared. . Timer driven: Each x seconds, the head-end LSR reevaluates whether a more optimal path can be found.
5.9.2
MPLS Traffic Engineering Global Path Protection With MPLS TE global path protection, upon link or node failure notification, the head-end LSR switches the traffic onto the backup LSP. When the link/node recovers, the head-end LSR can trigger either a revertive or a non-revertive action. In the former mode, the traffic is immediately switched back to the primary TE LSP once the primary TE LSP is restored (successfully resignaled). This should be done without traffic disruption but may provoke some packets reordering. In the latter mode, the traffic keeps flowing over the backup TE LSP. This option might be avoided if the backup TE LSP is less constrained than the primary TE LSP (i.e., has less bandwidth or follows a longer path). In the revertive mode, the switch-back action can be either event driven or timer driven.
P Important note: A side effect of trying to reuse a restored resource is the risk of multiple traffic disruption in case of resource flapping.
5.9.3
MPLS Traffic Engineering Local Protection There are actually two kinds of revertive modes with MPLS TE Fast Reroute, which are both specified in the Internet Engineering Task Force (IETF) specification [FAST-REROUTE]: 1. The globally revertive mode: In this case, the decision to reuse a restored resource is left to the head-end TE LSP upon reoptimization (which can be event or timer driven, as previously mentioned). 2. The locally revertive mode: When the PLR detects that the link/node is restored, it tries to resignal all the TE LSPs that are currently rerouted over a backup tunnel along the restored resources. If the resignaling attempt fails, the fast-rerouted TE LSPs keep using the backup TE LSP; if the attempt succeeds, the TE LSPs are switched back to their original path.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
348
CHAPTER 5
page 348
MPLS Traffic Engineering Recovery Mechanisms
Note that the locally revertive mode tries to switch back all the TE LSPs along the restored path contrary to the globally revertive mode where the head-end LSR can decide to reuse the restored resource on a per-TE LSP basis, depending on the TE LSP attributes. It is worth noting that the locally revertive mode may have undesirable effects:
. In case of resource flapping, the revertive mode would potentially cause multiple traffic disruptions; consequently, a locally revertive mode should implement some dampening revertive mechanism, as described in Chapter 4. Otherwise, if the resource flaps, the PLR constantly switches the TE LSP between the primary link and the backup tunnels, which results in multiple traffic disruptions. . Limited TE LSP attributes view: Contrary to the globally revertive mode, the PLR makes the switch-back operation without a complete knowledge of the TE LSP attributes. Suppose the following situation: A TE LSP T1 is signaled along a path P1. A link along P1 fails and Fast Reroute is triggered. A new link along another (shorter) path between T1’s head-end and tail-end LSR goes up. The failed link L1 is restored. A locally revertive mode would switch the traffic back to the restored link even if a better path exists between T1’s head-end and tail-end LSR. The globally revertive mode would have been more efficient in this case. That said, there are some circumstances in which a locally revertive mode might be useful though. For the reasons mentioned earlier, the MPLS TE Fast Reroute specification ([FAST-REROUTE]) recommends the globally revertive mode, whereas the locally revertive mode is optional.
5.10 Failure Profile and Fault Detection Section 4.3 of Chapter 4 is devoted to the subject of failure profile and fault detection; it was pointed out that the subjects discussed were applicable to both IP and MPLS. So only the MPLS-specific aspect will be covered in this section.
5.10.1
MPLS-Specific Failure Detection Hello-Based Protocols In the context of MPLS TE, another hello-based protocol has been defined, called ‘‘RSVP hello protocol extension’’ (see [RSVP-TE]). The basic mode of operation is similar to any other hello mechanism. RSVP hello messages are sent at a certain frequency, and if no RSVP hello messages have been received during a configurable amount of time (usually some number of times of the hello frequency), the RSVP hello adjacency is considered down. It is worthwhile noting that RSVP hello is a TE LSP property, but a proper implementation needs ensure that just one RSVP hello adjacency is activated per set of TE LSPs traversing the same interface. To illustrate
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 349
5.10 Failure Profile and Fault Detection
349
that statement, let us consider the case of two routers R1 and R2 interconnected via n links L1, L2, . . . , Ln and where several sets Si of TE LSPs traverse each link L1, L2, . . . , Ln. A very poorly scalable solution is to activate one RSVP hello adjacency per TE LSP. Instead, for each set Si, the routers R1 and R2 should select one TE LSP for which the RSVP hello adjacency will be activated. If the link Li fails, the RSVP hello adjacency of the selected TE LSP will go down and the router will declare all the TE LSPs traversing the link Li as impacted by the failure. So the total number of RSVP hello adjacencies will be n in this case. As with any other hello-based protocol, the important question of the scalability impact arises and there is no exception with RSVP hellos; sending RSVP hello messages requires some processing treatment by an LSR, which might not be an inexpensive operation. This explains why running fast hellos at very high frequency like 5 ms must be avoided. Moreover, a large number of neighbors also has an impact on the scalability of such a solution. Hence, if the number of neighbors is not too high and the RSPV hello frequency is reasonable, RSVP hellos may be a candidate for failure detection when lower layer fast detection mechanisms are not available. Of course, those numbers are highly dependent on the platform, but to give some rough numbers, at the time of writing, some routers can currently support 20 neighbors with RSVP hello messages sent every 100 to 200 ms without any problem. Note that the potential issue of platforms not being able to sustain RSVP hello is the potential triggering of false-positive alarms. A false-positive alarm occurs when Fast Reroute is inappropriately triggered by a loss of RSVP adjacency not because of a failure but just because the neighboring router is too busy to echo the RSVP hello message. This would not create any traffic black-holing, but the protected TE LSPs would be rerouted on their backup tunnel, although this was not required. Moreover, if the backup tunnel does not offer an equivalent QoS, the rerouted traffic may experience some performance degradation. Then they would very likely be eventually reoptimized by their respective head-end LSR along the initial path, but clearly this is not very desirable and should not happen too frequently.
5.10.2
Requirements for an Accurate Failure Type Characterization In the context of MPLS TE local protection, being able to differentiate a link from a node failure may be particularly useful. In Section 4.3 of Chapter 4, we saw why such a differentiation is not always obvious. In this section we will see why such a capability can be very useful and we will describe some potential solutions. Let us now analyze the situations where being able to differentiate a link from a node failure may be desirable:
Situation 1: Optimal Backup Path Selection Let us consider the network depicted in Figure 5.25 where two backup tunnels are configured on the PLR R0: the NHOP backup tunnel B1 and the NNHOP backup
Vasseur / Network Recovery Final 9.6.2004 9:48pm
350
CHAPTER 5
page 350
MPLS Traffic Engineering Recovery Mechanisms
B1: NHOP Backup Tunnel
R0
R1
R2
B2: NNHOP Backup Tunnel
Figure 5.25 Optimal backup tunnel selection.
tunnel B2. A conservative approach might be to systematically select B2 upon link failure detection because the PLR cannot tell a link from a node failure upon detecting the link failure. This way, if the failure was a node failure, the decision was correct. On the other hand, if the failure was a link failure, a better choice would have been to reroute the set of protected TE LSPs traversing the failed link onto B1. The fast rerouted TE LSPs (onto B2) could have followed a shorter path if B1 had been selected in this case. This is mainly due to the additional constraints imposed for the NNHOP backup tunnel path computation. It is worth highlighting that this drawback might be relevant only in some networks; typically, in a non– heavily loaded national network where the propagation delays are not significant, choosing a slightly longer path for a short period (until the TE LSPs are rerouted by their respective HE LSRs) is not necessarily an issue. On the other hand, in a poorly connected network with international links (having significant propagation delays), rerouting along a longer path is not desirable. This is even more true if the network is congested because the temporary rerouting along the B2 backup tunnel is likely to increase the level of congestion over a larger number of links.
Situation 2: Bandwidth Protection Violation As we will see Section 5.15, one can benefit from the single failure assumption to achieve bandwidth sharing between backup tunnels protecting independent resources. Unfortunately, the inability to differentiate a link from a node failure can lead to situations where backup tunnels protecting independent resources are simultaneously used, resulting in bandwidth protection violation. Let us consider the example in Figure 5.26. In the network depicted in Figure 5.26, B1 is a NNHOP backup tunnel originating on R1 and terminating on R3 protecting against a node failure (R2) and B2 is a NNHOP backup tunnel originating on R2 and terminating on R0 protecting against a node failure (R1). Because those two backup tunnels protect independent resources (R1 and R2), by virtue of
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 351
5.10 Failure Profile and Fault Detection
R0
R1
R2
351
R3
R4 Bandwidth Sharing
B1
B2
R5
Figure 5.26 Bandwidth protection violation.
the single failure assumption, they can share bandwidth because they are never simultaneously active (see Section 5.15 for further details). This is true in particular on the link R4-R5. Adopting the same backup tunnel selection strategy as in situation 1, as soon as the link failure is detected by the PLRs R1 and R2, they will both start rerouting protected TE LSPs on both B1 and B2, which would result in a bandwidth protection violation. This example clearly illustrates the need for some mechanism allowing to unambiguously differentiate a link from a node failure. An alternative (only available if the set of backup tunnel paths are computed by a central entity) is to make sure when computing NNHOP backup tunnels that two NNHOPs backup tunnels protecting adjacent nodes never collide (i.e., never share bandwidth on their common section). The counterpart of such an additional constraint is the increase of the path computation algorithm complexity and a lower bandwidth sharing efficiency. So the two examples clearly highlight the benefits of a solution that would allow a PLR to differentiate a link from a node failure. A solution has been proposed in [linknode-failure], which relies on sending hello messages along a link diverse path upon link failure detection; typically, an obvious candidate for the alternate path is the NHOP backup tunnel itself. In the example in Figure 5.27, several backup tunnels are configured: On R1, there is one NNHOP backup tunnel B1 protecting against a failure of the node R2 and one NHOP backup tunnel B3 protecting against a failure of the link R1-R2. Likewise, on R2, there is one NNHOP backup tunnel B2 protecting against a failure of the node R1 and one NHOP backup tunnel B4 protecting against a failure of the link R2-R1. Mode of operation: Upon link failure detection (by means of layer 2 link failure notification or RSVP/IGP hellos time out), R169 starts sending some hello message to R2 via the NHOP backup tunnel B3. If a response is received from 69
R2 performs the same set of operations.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
352
CHAPTER 5
page 352
MPLS Traffic Engineering Recovery Mechanisms
B4
R1
R0
R2
R3
B3
R4
B2
B1
R5
Figure 5.27 Mechanism allowing to differentiate a link from a node failure.
the adjacent node (R2), R1 can conclude that the failure is just a link failure and not a node failure. On the contrary, if no response is received from R2, the failure is a node failure. If we assume that such a failure characterization scheme is available, there are two strategies that can be put in place in terms of MPLS TE Fast Reroute decision: Option 1: Start using the NNHOP backup tunnel and switch back if required: In this option, as soon as the link failure is detected by the PLR (R1), all the protected TE LSPs traversing the failed link are rerouted onto the NNHOP backup tunnel. Then the failure characterization mechanism mentioned earlier is activated. If it turns out that the failure is a link failure, the rerouted TE LSPs are switched from their NNHOP back up tunnel to their NHOP backup tunnel. If the failure is characterized as a node failure, no particular action is required. Option 2: Start using the NHOP backup tunnel and switch back if required: This option basically does the opposite: Upon detecting the link failure, the PLR starts rerouting the protected TE LSPs traversing the failed link onto the NHOP backup tunnel. Then the failure characterization mechanism is activated; if it turns out that the failure is a link failure, no particular action is required. On the other hand, if the failure is characterized as a node failure, the rerouted TE LSPs are switched from their NHOP back up tunnel to their NNHOP backup tunnel. Pros and cons of each approach: The failure characterization process takes some time Tc (this amount of time depends on the protocol and timers used). With option 1, in the case of link failure, this might cause temporary bandwidth protection violation and/or nonoptimal backup path selection for the reasons mentioned earlier, but this option always minimizes the packet loss. With option 2, in the case of node failure, the duration of traffic loss is increased by Tc, but bandwidth
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 353
5.10 Failure Profile and Fault Detection
353
protection is always preserved and a more optimal backup path is selected in the case of link failure. Hence, depending on the network objectives, one may prefer one option or the other, provided that a failure characterization mechanism is available. In summary, being able to differentiate a link from a node failure is desirable to optimally select the backup tunnel to use in the case of MPLS TE Fast Reroute local protection ‘‘facility backup.’’ That said, this is certainly not an absolute requirement and should just be considered an optimization.
5.10.3
Analysis of the Various Failure Types and Their Impact on Traffic Forwarding A large set of possible failures can occur in a network. Section 4.3 of Chapter 4 provides an analysis of the impact on the forwarded traffic of various failure profiles and the set of failure detection mechanisms that can be used to detect those failures. Because most of the material covered in Chapter 4 also applies to MPLS TE, we will just focus on the MPLS TE specific aspects here: 1. Link failure: Link failures always affect the data traffic until an alternate path is found and data traffic is rerouted over some backup paths. Various mechanisms have been described in this chapter to handle link failures and find an alternate path. 2. Node failure: As mentioned in Chapter 4, there are multiple possible causes of node failures, and their nature has a different impact on the forwarded traffic. . Power supply outage: The traffic is black-holed until it is rerouted over a AU27 backup path. . Route processor failure: In centralized platform architectures, a route processor failure usually implies that packets forwarded to the failing routers are dropped. On the other hand, on distributed platform architectures this type of failure usually does not affect the data plane, and packets are still forwarded by the router, but just the control plane fails. The expected behavior in this case is that after some period,70 either the IGP or the RSVP hello adjacency will go down. In the former case (IGP adjacencies go down), the IGP neighbors of the failing routers will flood an updated LSA (router link LSA for OSPF) or LSP (for IS-IS). Every head-end LSR will detect that one or more of their TE LSPs traverse a failed LSR and should take the appropriate action (usually a graceful TE LSP reroute will be triggered in a nondisruptive fashion). In the latter case (the RSVP hello adjacency goes down), the node immediately upstream to the failed node will issue an RSVP notification (an RSVP Path Error message) to every head-end LSR having one or more TE
70
This period depends on the IGP or RSVP timer’s configuration.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
354
CHAPTER 5
.
.
page 354
MPLS Traffic Engineering Recovery Mechanisms
LSPs passing through the failed node. Every head-end LSR should then in turn take an appropriate action. This description does not apply to the case of graceful restart procedures. Software failure: The impact of a software failure on forwarded traffic is highly coupled to the nature of the software failure, which can vary from the simple generation of a warning message followed by an automatic recovery (via restorable module) handled by the operating system to a situation where the router is completely hosed and can no longer recover from the failure, which might require a complete reinitialization. In the latter case, the traffic is black-holed until the control plane detects the node failure. Planned node failure: Because the failure is ‘‘planned,’’ various actions can be taken before performing the upgrade. The traffic may be gracefully rerouted around the node. Various methods can be used to meet that goal: For instance, the link costs of every adjacent node can be manually increased or an updated IGP LSA for OSPF or IS-IS LSP can be flooded by the node to be upgraded. The consequences will be that the IGP will smoothly reroute the traffic around this node and every head-end LSR upon triggering a TE LSP reoptimization will likely reroute its TE LSPs along some other path in a nondisruptive fashion. The node to be upgraded will no longer carry any transiting traffic and could be safely upgraded without risking any traffic disruption.
P Important note: Some software and hardware architectures support ‘‘hitless’’ software and hardware upgrades without requiring any of the actions mentioned above.
5.11 Case Studies This section is entirely devoted to case studies where the various concepts covered in this chapter will be illustrated. Each case study will have the following structure:
. Assumptions: network topology, layer 2/3 protocols, . . . . Objectives: convergence time, failure coverage, performance, . . . . Proposed design: There is obviously no unique possible design to address a specific set of requirements. At least one possible design will be proposed for each case study. Three case studies are proposed in this section with a gradual complexity.
5.11.1
Case Study 1 Assumptions Let us consider the following network made of 12 nodes (LSRs) in the United Kingdom (Figure 5.28).
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 355
5.11 Case Studies
355
Unprotected Lambda Protected Lambda or Protected SDH
R11
R12
Giga Ethernet
R10
R9
R8
R6 R7 R2
R1 R5 R3
R4
Figure 5.28 Case Study 1.
. The network is made of several layers: . An optical layer, having the capability to offer protected or unprotected lambdas . An SDH layer, offering both protected and unprotected VCs . An IP/MPLS layer . As shown in Figure 5.28, the LSRs are interconnected by different types of links with various level of protection: . Unprotected lambda—e.g., link R1-R3 . Protected lambdas or protected SDH VCs—e.g., link R9-R10 . Gigabit Ethernet links—e.g., a layer 2 switch is used to interconnect the two LSRs R11 and R12, which are co-located within the same POP . The IGP is OSPF, configured with the following timer values: . OSPF hello interval is 10 seconds . OSPF RouterDeadTimer is 40 seconds . No incremental SPF, no fast LSA propagation, no fast SPF triggering . The vast majority of network element failures are link failures. Node failure is a rare event, so the current IGP convergence to handle node failure is sufficient.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
356
CHAPTER 5
page 356
MPLS Traffic Engineering Recovery Mechanisms
. No change in terms of layer 1-2 protection can be made on the network. In other words, an unprotected link cannot be protected by the optical or SDH layer to optimize cost. Likewise, there is no desire to unprotect a link that is already protected (e.g., because the optical/SDH equipments in place are already paid off). . Every link of this network is independent of other links. In other words, there is no SRLG (no single point of failure between the various links). . The network has enough capacity to carry the traffic with the required QoS even during a single failure using OSPF routing. MPLS TE is not deployed in this network for bandwidth optimization and/or strict QoS guarantee. . The network is not Diffserv enabled. Objectives
. The objective is to get fast recovery (in less than 50 ms) in the case of link failure of unprotected links. All the IP/MPLS traffic must be protected without any differentiation. Proposed Design The requirement for fast recovery implies the use of Fast Reroute. Facility backup will be selected for its scalability property. Because MPLS TE is not deployed in this network, the most appropriate design to benefit from the fast protection of Fast Reroute is to deploy one-hop tunnels, where Fast Reroute is required (where links are unprotected at the optical or SDH layer). So for each link to protect with Fast Reroute, two TE LSPs will be configured:
. An unconstrained primary TE LSP configured as ‘‘fast reroutable’’: No constraint will be applied to this TE LSP because the objective is just to get a tunnel to protect. Because no constraint is applied to this tunnel, it will follow the shortest path, that is, the direct link. Note also that because this tunnel is a one-hop tunnel, when an IP or MPLS packet is routed onto this tunnel, no label is pushed on it. This is because the head-end LSR also acts as a penultimate hop pop (PHP) LSR—that is, no label is pushed between the penultimate hop LSR and the tunnel destination. All the traffic will be routed onto this tunnel so that all the traffic routed over this link according to OSPF is protected via MPLS TE Fast Reroute in the case of a failure. . A backup TE LSP will be used to reroute all the protected TE LSPs in the case of a link failure. The path for this backup tunnel is link diverse from the link to protect and is either statically configured on the LSR or dynamically computed by the LSR. As mentioned earlier, the network has enough capacity to satisfy the required QoS even during a single failure. This implies that the only constraint that should be taken into account for the backup tunnel path computation is route diversity.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 357
5.11 Case Studies
357
Example: To protect the link R1-R5 with Fast Reroute, two TE LSPs are configured:
. A one-hop primary LSP without any constraint (0 bandwidth, no affinities, . . . ). All the traffic routed according to OSPF onto this link is routed onto this primary tunnel. . A backup tunnel B1 is configured from R1 to R5 and follows the R1-R2-R5 path. For each link nonprotected by a lower layer protection/restoration mechanism, this must be done in both directions because TE LSPs are in this case unidirectional. The same configuration is then applied to every unprotected link in the network: the links R1-R2, R2-R5, R1-R5, R1-R3, R3-R4, R4-R7, R4-R5, R5-R7, R2-R6, R6R7, R7-R8, R2-R9, R9-R8 and R11-R12. Note that because the network has enough capacity to satisfy QoS requirements even under a single failure scenario, the backup tunnel path does not need to be constrained to make sure enough capacity is available along the backup path. Also, because this network is a domestic U.K. network, the increase of the propagation delay along the backup path is considered negligible and acceptable. For instance, in the case of failure of the link R1-R5, once the failure is detected by the LSR, the fast-reroutable one-hop TE LSP T1 is rerouted onto the backup tunnel B1 within 50 ms. All the traffic that used to be routed onto the link R1-R5 will now be rerouted along the path R1-R2-R5; note that the rerouted traffic is label switched at R2, which does not make any routing decision (Figure 5.29). Then T1 is reoptimized by R1 to follow the new OSPF shortest path from R1 to R5. This shortest path could be R1-R2-R5 or R1-R3-R4-R5 depending on the OSPF metrics; the assumption is made here is that the shortest OSPF path is R1R3-R4-R5) (Figure 5.30). As mentioned above, every link is protected using the same configuration. There is just one exception in this network: the link R11-R12. Indeed, in contrast with the other links of this network that provide fast failure detection, this link is a Gigabit Ethernet link with a layer 2 switch between the LSRs R11 and R12. So an interface failure of R12 cannot be detected by R11 other than by means of a hello protocol. In this case, OSPF will eventually detect the failure after 40 seconds, which does not satisfy the fast convergence requirement. The solution is to run RSVP hellos between R11 and R12. The hello frequency will depend on the router implementation and the number of RSVP hellos’ adjacencies (one in this case). For instance, a hello’s frequency of 100 ms and a number of missed acknowledgments of four could be configured. In the case of an interface failure on R12, the link failure will be detected by R11 in 400 ms.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
358
CHAPTER 5
page 358
MPLS Traffic Engineering Recovery Mechanisms
R11
R12
R12
R11 R10
R10 RSVP Hello Session
RSVP Hello Session
R9
R9
R8 Backup TE LSP B1 Used to Protect the R1-R5 Link
R8 T1 is Rerouted onto R6B1 R7
R2
R5
R1
R2
R7
R6
R4
R1
1-Hop Primary TE LSP T1
R3
R5
R4 R3
Figure 5.29 Case Study 1.
R11
R12
R10 RSVP Hello Session
R9
R8
R6 R7 T2 is then Reoptimized to Follow the Shortest OSPF Path
R2
R4
R1 R5
R3
Figure 5.30 Case Study 1.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 359
5.11 Case Studies
5.11.2
359
Case Study 2 In this case study, the following set of assumptions and objectives are added to the previous ones mentioned in Case Study 1: Additional Assumptions
. As depicted in Figure 5.31, some links share SRLGs. For instance, the links R1-R2 and R1-R5 are in the same SRLG. Likewise, the links R1-R5 and R1-R4 belong to the same SRLG. This means that the failure of a single component in the network (e.g., an optical equipment or a fiber) would result in the simultaneous failure of both links. Note: Having SRLG-diverse links is not always possible. Indeed, an operator may not have a sufficiently meshed optical network. When leasing optical lambdas to another operator, SRLG diversity is also a generally more expensive option. In the example above, it was not possible to have SRLG-diverse links between R1 and R5 and R1 and R2, but the links R1-R5 and R1-R3 are SRLG diverse. At least, one should make sure that a node cannot be isolated as the result of a single failure. For example, having the four links R1-R3, R1-R4, R1-R5, and R1-R2 in the same SRLG would have completely isolated the node R1 in the case of a failure of that SRLG.
. MPLS TE is deployed in the network for network resource optimization. Hence, the 12 LSRs are fully meshed with TE LSPs and routed using distributed CSPF (132 TE LSPs are up and running). There is no specific constraint of bandwidth protection; upon a network element failure, protected TE LSPs are rerouted onto their respective backup tunnel for a short period (until they are rerouted/reoptimized by their head-end LSR). QoS degradation during that short period (generally on the order of a few hundreds of milliseconds) is considered acceptable. Additional Objectives An additional objective is added: Both links and nodes should be protected by MPLS TE Fast Reroute. Even if the failure of a node is not extremely frequent, the impact of a node failure is important in terms of volume of affected traffic by the failure. Fast recovery upon node failure is a requirement. Proposed Design The requirement for fast recovery implies the use of Fast Reroute. Facility backup will be selected for its scalability property. MPLS TE is deployed in this network, so contrary to the previous case study, configuring one-hop primary tunnels is not required. Note that primary TE LSPs may be configured to carry all the traffic
Vasseur / Network Recovery Final 9.6.2004 9:48pm
360
CHAPTER 5
page 360
MPLS Traffic Engineering Recovery Mechanisms
Unprotected Lambda
Giga Ethernet
Protected Lambda or Protected SDH
SRLG: Share Risk Link Group
R11 R11
R12
R10 R10
R9 R9
R8
R8
R6
R2
R7 R2
R1
R4 R5
R4
R1 R3
R3 Full mesh of TE LSPs − Just the TE LSPs from R1 to Other LSRs are Represented
Figure 5.31 Case Study 2.
routed in this network and all the traffic carried onto protected TE LSPs will be protected by Fast Reroute. Another option is to carry some traffic onto protected TE LSPs (like the voice traffic for instance): In this case, data traffic is routed using OSPF routing and so relies on OSPF as the recovery mechanism whereas voice traffic is protected by Fast Reroute. Both links and nodes must be protected according to the set of requirements listed above. Example: For the sake of simplicity in the illustration, just the backup tunnel originated by the node R1 to protect the fast reroutable TE LSPs against a failure of the link R1-R5 and the node R5 are shown in Figure 5.32; a similar approach is followed for the other links and nodes of the network. Link protection: For each unprotected link, an NHOP backup tunnel is configured. No constraint applies to the backup tunnel path as QoS degradation during failure is considered as acceptable. For instance, the link R1-R5 is protected by the backup tunnel B1. It is worth mentioning that B1’s path must be SRLG diverse from the link R1-R5. Hence, B1’s path cannot follow the paths R1-R2-R5 and R1AU28 R4-R5, because: . The links R1-R2 and R1-R5 share the same SRLG. . The links R1-R4 and R1-R5 also share the same SRLG.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 361
5.11 Case Studies
R11
361
Unprotected Lambda
Giga Ethernet
Protected Lambda or Protected SDH
SRLG: Share Risk Link Group
R12
R10
R9
Backup TE LSP B2 Used to Protect the TE LSPs Following the Path R1-R5-R2 Against a Failure of the Node R5
Backup TE LSP B4 Used to Protect the TE LSPs Following the Path R1-R5-R7 Against a Failure of the Node R5
R8
R6
R7
R2
R4 R1
Backup TE LSP B3 used to Protect the TE LSPs Following the Path R1-R5-R4 Against a Failure of the Node R5
R5
Backup TE LSP B1 Used to Protect Against a Failure of the Link R1-R5
R3
Figure 5.32 Case Study 2.
So in the case of a fiber cut, both the link R1-R5 and B1 would fail, which would make Fast Reroute protection ineffective. Hence, the only SRLG-diverse path for B1 is R1-R3-R4-R5. Node protection: The number of NNHOP backup tunnels that must be configured on each neighbor of the protected node is equal to n 1, where n is equal to the number of neighbors of the protected node. In this example, R1 has three NNHOP neighbors: R2, R4, and R7. Hence, the following NNHOP backup tunnels are configured:
. B2-path: R1-R2. Protects the TE LSPs that follow the path R1-R5-R2 against a failure of R5. . B3-path: R1-R3-R4. Protects the TE LSPs that follow the path R1-R5-R4 against a failure of R5. . B4-path: R1-R2-R6-R7. Protects the TE LSPs that follow the path R1-R5-R7 against a failure of R5. Notes:
. The same approach is followed for the backup tunnels originated on other R5’s neighbor nodes (e.g., R2, R4, and R7). For instance, for the node R2,
Vasseur / Network Recovery Final 9.6.2004 9:48pm
362
CHAPTER 5
.
.
.
5.11.3
page 362
MPLS Traffic Engineering Recovery Mechanisms
one next-hop backup tunnel is configured that follows the path R2-R6-R7R5 to protect the TE LSPs that follow the path R2-R5 and three NNHOP backup tunnels are configured and follow the paths R2-R6-R7, R2-R1 and R2-R1-R3-R4. When a backup tunnel is configured, some implementations allow to configure multiple static paths. For instance, the NNHOP backup tunnel B4 could be configured with two paths: Path 1 (preferred): R1-R2-R6-R7 Path 2: R1-R2-R9-R8-R7 If the link R6-R7 fails, B4 is rerouted along path 2 and is still usable to protect the primary LSPs that follow the path R1-R5-R7 against a failure of the node R5. Another mode consists in configuring the NNHOP backup tunnel as purely dynamic and the PLR will compute its path using CSPF. Both modes can also be combined: In this case, a set of static path(s) is given by order of preference, and if none of the static paths is available, the PLR computes the path itself. In the absence of a mechanism allowing to differentiate a link from a node failure, a perfectly valid approach when both NHOP and NNHOP backup tunnels are configured consists of systematically using the NNHOP backup tunnel in the case of a link failure (because the PLR does not know a priori whether the failure is a link or a node failure). The only exception is for the protected TE LSPs that terminate on the NHOP LSR, which are rerouted using the NHOP backup tunnel.
Case Study 3 This case study is definitely the most complicated one; it is the collection of the most complete set of requirements and allows highlighting all the possible sets of mechanisms provided by MPLS TE Fast Reroute. Assumptions
. The network is made of two layers: . An optical layer providing unprotected optical lambdas only . An IP/MPLS layer . The LSRs are interconnected by unprotected lambdas of different speeds: OC3, OC48, and OC192 . The IGP is IS-IS, configured with default values: . IS-IS hello interval is 10 seconds. . IS-IS hold time is 30 seconds. . No incremental SPF, no fast LSP propagation, no fast SPF triggering. . No change in terms of layer 1-2 protection can be made on the network. In other words, unprotected links cannot be protected by the optical layer (to reduce cost).
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 363
363
5.11 Case Studies
. Some links share a common SRLG, as shown in Figure 5.33. . Three types of traffic are carried in the network: . Internet traffic is IP routed (not MPLS switched) according to the routes computed by IS-IS and BGP. . VPN traffic: The network supports VPN traffic (might be MPLS VPN, IPSec, . . . ). . Voice traffic. . The network is Diffserv enabled and two classes of services are configured in the core: . An EF class: used for the voice traffic. . An AF class for the data traffic: A congestion avoidance mechanism like WRED is configured so Internet traffic is more aggressively dropped than the VPN traffic in the case of network congestion. . MPLS TE is deployed in this network for bandwidth optimization and service differentiation (strict QoS guarantees). Two meshes of TE LSPs are set up in the network: . One mesh of TE LSPs for the data traffic (called data primary TE LSP). . One mesh of TE LSPs for the voice traffic (called voice primary TE LSP).
OC3 Link OC48 Link OC192 Link
OC48 link BOS SEA NYC
CHI
SFO
SLC
WAS DEN
PHX LAX ATL DAL
HOU MIA
Figure 5.33 Case Study 3.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
364
CHAPTER 5
page 364
MPLS Traffic Engineering Recovery Mechanisms
P Important note: Diffserv-aware MPLS TE (also called DS-TE ) could be used in this network if different CAC mechanisms were required for the data and voice primary TE LSPs. DS-TE allows to apply different underbooking/overbooking ratios to different types of traffic. This way, one can ensure that no more than x% of voice TE LSP will be routed on every link whereas up to y% of overbooking will be allowed for data traffic. For instance, an operator could decide that to provide a strict QoS to the voice traffic, an appropriate queuing discipline must be configured for voice (e.g., priority queuing) and the proportion of voice traffic routed on every link must be limited to 30%. The limitation of voice traffic ensures that when a failure occurs, the offered rate of voice will not exceed the service rate of any link across the alternate path. On the other hand, an overbooking of 150% is perfectly tolerable for data traffic because of the statistical multiplexing property and lack of strict QoS guarantees. This case study could easily be extended to the Diffservaware MPLS TE case.
. The traffic matrix is such that a maximum of 20% of voice traffic is routed onto every link. . Requirements for bandwidth and propagation delay protection only apply to the case of a single failure. In other words, the situation of multiple simultaneous failures is considered as rare enough not to be considered. Abbreviations The following abbreviations are used in Figure 5.33:
. . . . . . . . . . . .
Phoenix: PHX Dallas: DAL Washington: WAS Miami: MIA Houston: HOU New York: NYC Atlanta: ATL Chicago: CHI Los Angeles: LAX Seattle: SEA San Francisco: SFO Salt Lake City: SLC
Objectives
. Objective 1: Both link and node failure must be protected with very fast convergence time. . Objective 2: Voice and data traffic have different classes of recovery (CoR): . Data traffic must be rerouted within 50 ms in the case of a link and/or node failure and QoS degradation is acceptable both in terms of propagation delay increase and bandwidth protection. . Voice traffic must be rerouted within 50 ms in the case of a link and/or node failure and the offered QoS for voice must be guaranteed during
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 365
5.11 Case Studies
.
365
failure (time during which protected TE LSPs are rerouted over backup tunnels). Protected voice TE LSPs must be rerouted onto backup tunnels offering an equivalent bandwidth and a propagation delay increase bounded to 50%. Also the maximum amount of total voice traffic during failure must not exceed 30% of the link capacity (this is to ensure that the proportion of voice traffic is bounded to guarantee a strict QoS to voice traffic). Moreover, a voice TE LSP must always be soft preempted (if required). Objective 3: The backup bandwidth should be minimized (the network backup capacity is the bandwidth reserved to place the backup tunnels dedicated to reroute voice LSPs).
Proposed Design The requirement for fast recovery implies the use of Fast Reroute. Facility backup will be selected for its scalability property. Various backup tunnel path computation tools can be used to compute the backup tunnel paths to fulfill the set of requirements. What follows is an example of potential output that highlights how the requirements can be met by a set of appropriate backup tunnels. Examples are provided for the backup tunnels required to protect against a failure of the node of Dallas and its adjacent links. Similar approaches are taken for the other links and nodes. The Internet traffic is IP routed (not carried onto TE LSP); hence, no particular configuration is required for the Internet traffic. Note that tuned IS-IS timers allow a better recovery time in the case of failure, as mentioned in Chapter 4. The primary TE LSPs are configured in the following manner:
. Data TE LSP: fast-reroutable LSP The following bits of the SESSION-ATTRIBUTE object of RSVP Path message are set/cleared: ‘‘Local Protection desired’’ ¼ 1 Bandwidth protection desired’’ ¼ 0 ‘‘Node protection desired’’ ¼ 0 ‘‘Soft preemption desired’’ ¼ 0 . Voice TE LSP: fast-reroutable LSP with bandwidth protection and soft preemption The following bits of the SESSION-ATTRIBUTE object of RSVP Path message are set/cleared: ‘‘Local Protection desired’’ ¼ 1 ‘‘Bandwidth protection desired’’ ¼ 1 ‘‘Node protection desired’’ ¼ 1 ‘‘Soft preemption desired’’ ¼ 1 Alternatively, the FAST-REROUTE object can be included in the RSVP Path message.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
366
CHAPTER 5
page 366
MPLS Traffic Engineering Recovery Mechanisms
Link Protection Every link is protected by two types of NHOP backup tunnels:
. A data NHOP backup tunnel: The only constraint for data backup tunnel is to be SRLG diverse from the link to protect. Indeed, QoS degradation is acceptable for data traffic based on the set of requirements. No bandwidth needs to be reserved for data NHOP backup tunnels. . A set of voice NHOP backup tunnels: Two constraints are taken into account for voice backup tunnels: the bandwidth and the propagation delay increase, which must be bounded to 50% (Figure 5.34). Example: The PHX-DAL link is protected by the following set of NHOP backup tunnels: Data backup tunnels: Because bandwidth/propagation delay protection is not required for data TE LSPs, a single backup tunnel BD1 (backup data) is configured and follows the path PHX-MIA-DAL. Note that BD1 cannot follow the path PHX-DEN-DAL (although this is a shorter path) because the links DEN-DAL and PHX-DAL belong to the same SRLG. Note that in the case of failure of the link PHX-DAL, traffic congestion may occur for data TE
OC3 Link OC48 Link OC192 Link
G = Gbps – M-Mbps
BOS SEA
BV1 (1G) PHX-DEN-CHI-DAL CHI
NYC BV2 (250M) PHX-DEN-WAS-DAL
DEN
SFO SLC
WAS PHX LAX DAL
BD
BV4 (250M) PHX-MIA-DAL
ATL
MIA
HOU
BV
Figure 5.34 Link protection for data and voice backup tunnels.
BD1 PHX-MIA-DAL
BV3 (500M) PHX-HOU-MIA-DAL
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 367
5.11 Case Studies
367
LSP; indeed, no particular bandwidth constraint has been applied on the BD1 path computation. For example, the node DEN could also decide to route its NHOP backup tunnel protecting the TE LSPs traversing the link DEN-DAL along the path DEN-PHX-MIA-DAL. In the case of failure of the SRLG containing the links DEN-DA and PHX-DAL, both backup tunnels would be simultaneously active. In addition, all the primary TE LSPs routed along the link PHX-MIA may send traffic; this just shows how a congestion for the data traffic could occur with any particular precaution regarding the data backup tunnel path, but as mentioned in the set of requirements, this is considered perfectly acceptable for the data traffic in this case study (the backup tunnels are used for a short period [until the primary TE LSPs using them are reoptimized by their head-end LSR], so the potential congestion will also be temporary). Another alternative would have been to select the PHX-LAX-SFO-SEA-BOSNYC-WAS-DAL where the minimum link bandwidth is OC48 (compared to OC3 along the path PHX-MIA-DAL), but then the propagation delay is much longer in this case; this is a trade-off. Voice backup tunnel: First the required amount of bandwidth (protected bandwidth) must be computed. Because not more than 20% of voice is routed on every link (either because of the traffic matrix or DS-TE is deployed in the network), the required capacity for the set of voice backup tunnel protecting the link PHX-DAL is 0.2 * OC192 ¼ 2 Gbps. Because the other adjacent links are OC48, establishing a single 2 Gbps NHOP voice backup tunnel from PHX to DAL would result in carrying more than 30% of voice traffic during failure on every link. Let us consider the path PHX-DEN-CHI-DAL: Every link is OC192 and can carry up to 20% of voice at steady state (2 Gbps) and 30% during failure (3 Gbps). Routing a single voice backup tunnel of 2 Gbps would results in 2 Gbps þ 2 Gbps ¼ 4 Gbps worth of voice traffic: 40% of voice traffic! Hence, the maximum backup capacity that can be used for voice backup on those links is 1 Gbps to respect the constraint of 30% of voice traffic on every link during failure. Hence, multiple NHOP voice backup tunnels (noted as BV) are configured: AU29
. BV1: 1 Gbps follows the path PHX-DEN-CHI-DAL. . BV2: 250 Mbps follows the path PHX-DEN-WAS-DAL (the link WAS- AU30 DAL is an OC48 link so the maximum backup capacity it can offer is (30% AU31 20%) ¼ 10% of OC48, which is 250 Mbps). . BV3: 500 Mbps follows the path PHX-HOU-MIA-DAL (along this path, the maximum backup capacity is 10% of 2 * OC48 (there are two OC48 links between HOU and MIA), which gives a total of 500 Mbps. . BV4: 250 Mbps follows the path PHX-MIA-DAL. So the sum of bandwidths of BV1, BV2, BV3, and BV4 is 2 Gbps (as required), and during failure, no link receives more than 30% of traffic. The propagation delay increase constraint is also satisfied as the backup paths never increase the propagation delay by more than 50%, which was another requirement. Note that the path
Vasseur / Network Recovery Final 9.6.2004 9:48pm
368
CHAPTER 5
page 368
MPLS Traffic Engineering Recovery Mechanisms
PHX-SFO-SEA-BOS-CHI-DAL offers 1 Gbps of backup bandwidth but does not satisfy the propagation delay increase constraint. Once the set of data and voice backup tunnels are computed and signaled, when a data fast-reroutable TE LSP is signaled across the PHX-DAL link, the BD1 backup tunnel is selected. If the link PHX-DAL fails the data TE LSP are rerouted within 50 ms, without any QoS guarantees during the failure. The situation for voice TE LSPs is slightly different because the node of PHX (acting as a PLR) must select a voice backup tunnel among the set of available backup tunnels BV1, BV2, BV3, and BV4. Various algorithms can be implemented to make the appropriate selection. It is worth underscoring that a fast-reroutable TE LSP is always rerouted on a single backup tunnel; its traffic is not load balanced over multiple backup tunnels. Node Protection A set of data and voice backup tunnels must also be configured to protect the TE LSP traversing the node of Dallas from a node failure. As in the case of link protection, for every NNHOP neighbor, a set of NNHOP backup tunnels must be computed. There are two types:
. Data NNHOP backup tunnel: The only constraint for the data backup tunnels is to be SRLG diverse from the node to protect. Indeed, QoS degradation is acceptable for data traffic based on the set of requirements. . Set of voice NNHOP backup tunnels: Two constraints are taken into account for voice backup tunnels: the bandwidth and the propagation delay increase that must be bounded to 50%. A similar approach as in the case of link protection is taken for the computation of the set of required NNHOP backup tunnels. Illustration of the Concept of Bandwidth Sharing The concept of bandwidth sharing between backup tunnels protecting independent resources is extensively covered in Section 5.15. Objective 3 of this case study clearly states that the network bandwidth capacity must be minimized. Figure 5.35 shows how the backup bandwidth is shared between voice backup tunnels protecting independent network resources. In Figure 5.35 some link capacities have been modified compared to Figure 5.33. Let us consider the set of voice backup tunnels from the node of Phoenix required to protect the voice TE LSPs following the paths PHX-DAL-DEN and PHX-DAL-WAS in the case of a failure of the node Dallas. From the node of Phenix, several sets of backup tunnels must be computed:
. BV1: from PHX to DEN to protect the TE LSPs following the path PHX-DAL-DEN against a node failure of DAL . BV2: from PHX to WAS to protect the TE LSPs following the path PHX-DAL-WAS against a node failure of DAL . BV3: from PHX to CHI to protect the TE LSPs following the path PHX-DAL-CHI against a node failure of DAL
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 369
369
5.11 Case Studies
OC3 Link
OC48 Link
OC192 Link
BOS BV5
SEA
CHI NYC SFO SLC BV1
WAS DEN
PHX
BV2
LAX ATL DAL
HOU MIA
Figure 5.35 Bandwidth sharing between voice backup tunnel protecting different nodes.
. BV4: from PHX to MIA to protect the TE LSPs following the path PHXDAL-MIA against a node failure of DAL Let us just consider the two sets of backup tunnels (BV1 and BV2) to simplify the example. One can make several interesting observations:
. Under the assumption that no more than 20% of voice traffic can be routed on every link, the required capacity for BV1 is 0.2 * (min(PHX-DAL link AU32 capacity)(DAL-DEN link capacity) ) ¼ 0.2 * (min(OC192,OC48) ) ¼ 500 Mbps. . Under the assumption that no more than 20% of voice traffic can be routed on every link, the required capacity for BV2 is 0.2 * (min(PHX-DAL link capacity), (DAL-WAS link capacity) ) ¼ 0.2 * (min(OC192,OC48) ) ¼ 500 Mbps. . In the case of failure of the Dallas node, both BV1 and BV2 are simultaneously active, so they cannot share any bandwidth along any path that they have in common. So the computed paths for the voice backup tunnel BV1 and BV2 are PHX-DEN and PHX-DEN-WAS and the amount of voice reserved from the backup bandwidth pool on the link PHX-DEN is 500 þ 500 Mbps ¼ 1 Gbps.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
370
CHAPTER 5
page 370
MPLS Traffic Engineering Recovery Mechanisms
Let us now consider the set of voice backup tunnels required to protect the TE LSPs following the SEA-CHI-NYC path in the case of a node failure of the node of Chicago. The required capacity is 20% of min (SEA-CHI link capacity, CHI-NYC link capacity) ¼ 500 M. Hence, a single NNHOP voice backup tunnel BV5 that follows the path SEA-SLC-DEN-WAS-NYC satisfies the constraints in terms of required bandwidth: The maximum amount of voice traffic routed on every link during failure (30%), and propagation delay increase bound (50%). Because BV1 þ BV2 and BV5 protects voice TE LSPs against the failure of independent resources (respectively the nodes of Dallas and Chicago), thanks to the single failure assumption (valid in this case study as mentioned in the set of assumptions), they can share the backup bandwidth! So the amount of required capacity of both BV2 and BV5 on the DEN-WAS link is max(500 Mbps, 500 Mbps) ¼ 500 Mbps and not the sum of their bandwidth. This allows to very significantly reduce the required backup capacity in the network. Note that the maximum amount of backup bandwidth available on the link DEN-WAS is (30% 20%) ¼ 10% of an OC48 link ¼ 500 Mbps. Without this property of bandwidth sharing the AU33 link DEN-WAS could not have accommodated the bandwidth requirements of BV2 and BV5.
5.12 Standardization Before we elaborate on the standardization aspects of the MPLS traffic recovery mechanisms described in this chapter, it is worth highlighting one important comment: You might have noticed that several references are provided, which refer to IETF drafts that are not RFC (Request For Comment) yet. Strictly speaking, even if an IETF draft is not a standard yet, this does not preclude from being a technology already, available in commercial products and deployed in existing networks. A good illustration of this statement is MPLS TE local protection (Fast Reroute). At the time of writing, this is still an IETF draft: draft-ietf-mplsrsvp-lsp-fastreroute, which will likely become an RFC soon. For some other drafts, there might still be individual submission that will potentially never become RFCs. Note that the IP and MPLS standards are specified by the IETF. The aim of this section is not to provide an exhaustive list of all the standards related to MPLS TE recovery but to highlight the most important ones. Several IETF drafts have been listed throughout the chapter in the related sections. The ultimate web site where all the related IETF drafts and RFCs can be consulted is the IETF web site, at www.ietf.org. We first saw MPLS TE global default restoration: By definition this does not imply any standards other than the RFC that defines the signaling extension for MPLS TE: [RSVP-TE]. In addition, an interesting standard to consult is [FMRECOV] that defines a framework of the MPLS based recovery protocols. Then, the next MPLS TE recovery mechanism covered in this chapter was MPLS TE global path protection. Because it simply relies on the set up of diversely
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 371
5.13 Summary
371
routed TE LSPs, there is no specific standard in addition to MPLS TE. Indeed, the path computation of diversely routed TE LSPs does not need to be standardized. Finally, the last MPLS TE recovery mechanism that has been studied in detail is MPLS TE local protection (Fast Reroute): The main IETF draft specifying both the facility backup and the one-to-one backup local protection recovery techniques is [FAST-REROUTE]. In addition, various other drafts related to the backup path computation models are [FACILITY-BACKUP], [KINI], and [BP-PLACEMENT]. Several other IETF drafts have been listed throughout this chapter. Usually, the question that immediately arises is: Does a protocol specification have to be a standard to be implemented in commercial product? This answer is a definite no, and we saw several examples in this chapter. Indeed, vendors may decide to implement a protocol that is not yet an RFC based on customers’ demands or its confidence in the fact that the IETF draft will become an RFC.
5.13 Summary This first part of this chapter is devoted to the study of the MPLS TE protection and restoration mechanisms: Global default restoration, the default rerouting mode of MPLS TE, was first introduced. Then various protection mechanisms were covered: global path protection, which provides a substantially faster convergence time than global default restoration but adds a significant amount of backup states, which may be a limitation in large network. Moreover, in networks where the propagation delay can be significant, convergence time of a few tens of milliseconds is not achievable. Then a large part of this section was dedicated to Fast Reroute, which allows not only a fast convergence time (tens of milliseconds) upon a link or node failure but also with strict QoS during failure in terms of bandwidth and propagation guarantees. Furthermore, MPLS TE allows the use of different classes of recovery assigned on a per TE LSP basis, with a high degree of granularity. As mentioned in this chapter, strict QoS during failure is required neither in every network (but just where bandwidth is very scarce) nor for every traffic type. Finally, for the sake of reference, another last recovery mechanism (not really pursued in the industry) has been briefly presented: the 1þ1 protection. Then a comparison of the different sets of mechanisms has been provided with their respective advantages and drawbacks. Strictly speaking, load balancing cannot be considered a recovery technique. That said, it has been shown that load balancing can contribute to reducing the AU34 impact of a network failure on the traffic flow between two points, with some drawbacks though. An entire section dealt in detail with the delicate problem of failure detection and characterization and the impact of each failure profile on the forwarded traffic. Finally, to conclude the first part of this chapter, three case studies were proposed, each having a different set of assumptions and objectives: from a simple objective of fast convergence upon link failure in an IP network to a more complex
Vasseur / Network Recovery Final 9.6.2004 9:48pm
372
CHAPTER 5
page 372
MPLS Traffic Engineering Recovery Mechanisms
network with a wide set of objectives including fast convergence upon link or node failures with different classes of recovery where network backup bandwidth has to be minimized. In the second part of this chapter (Sections 5.14 and 5.15), we explore some specifics advanced topics, which are not required to understand the MPLS recovery mechanisms but might be interesting if you want to read advanced material on the subject. First, we investigate in detail the signaling extensions that have been specified for MPLS TE Fast Reroute. The reader interested by the mechanisms of Fast Reroute but not in the detailed signaling aspect may want to skip that section. In the first part of this chapter, we saw that both MPLS TE global path protection and local protection use preestablished backup tunnels; we will see that there are several techniques to compute the path of those backup tunnels depending on the set of recovery objectives and network constraints.
5.14 RSVP Signaling Extensions for MPLS TE Local Protection This section defines the set of RSVP signaling extensions for the two local repair techniques: facility backup and one-to-one backup. Note that some RSVP extensions are common to both techniques, whereas others are specific to either facility backup or one-to-one backup.
5.14.1
SESSION-ATTRIBUTE Object The format of the RSVP SESSION-ATTRIBUTE object carried in an RSVP Path message is depicted in Figure 5.36. Let us detail the various flags defined in the RSVP SESSION-ATTRIBUTE object:
There are Two SESSION-ATTRIBUTES Format with and without Resource Affinities. This Format Corresponds to the Format with Resource Affinities (Class = 207, C-Type = 1) Soft Preemption Desired
4 Bytes
Bandwidth Protection Desired
Local recording Desired
Include-any Exclude-any ERO Object Expansion Request
Include-all Setup Prio
Hold Prio
Flags
Namelength
SESSION Name
Figure 5.36 RSVP SESSION-ATTRIBUTE object.
Node Protection Desired
Local Protection Desired SE Style Desired
Flags
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 373
5.14 RSVP Signaling Extensions
373
1. Setup and holding priority flag: These flags are not specific to Fast Reroute. The setup priority characterizes the ability of a TE LSP to get resources, potentially preempting existing TE LSPs with lower priority. The holding priority defines the priority of a TE LSP once set up (used to decide whether the TE LSP can be preempted by another TE LSP). The reader may refer to [RSVP-TE] for details. 2. Flags field. Local protection desired: 0x01 When set, this flag signals a fast-reroutable TE LSP Label recording desired: 0x02 This flag indicates that the labels used for this TE LSP must be recorded in the RSVP RRO object carried in the RSVP Resv message. The RRO object is described in Section 5.14.4. The label recording flag must be set for a fast-reroutable TE LSP. As mentioned earlier, label recording is necessary to discover the label used between PR35 the NHOP and NNHOP LSR in the case of facility backup with NNHOP backup tunnels. So when this flag is set, every node along the TE LSP path will insert an IPv4 subobject in the RRO object carried in the RSVP Resv message which travels in the upstream direction (from the tail-end LSR to the head-end LSR). This provides the required information about the label used by downstream nodes. SE (Shared Explicit) Style desired: 0x04 The ‘‘Shared Explicit’’ flag allows two TE LSPs to share some reservation and is used when a TE LSP is rerouted (e.g., when a TE LSP is reoptimized along a shorter path, the new TE LSPs share its reservation with the ‘‘old one’’ before this ‘‘old’’ reservation is torn down: This is known as the make before break procedure). When requesting Fast Reroute, the head-end LSR should set this flag. Bandwidth protection desired: 0x08 When set, this flag indicates that the TE LSP requests bandwidth guarantees during failure (period during which the fast-reroutable TE LSP is rerouted onto its backup tunnel) and so should not suffer from QoS degradation during failure. If a different value for the bandwidth is requested during failure (less than the original bandwidth), then the bandwidth (in case of failure) is specified in the FAST-REROUTE object defined below. Node protection desired: 0x10 When set, this signals to the LSRs along the path that a NNHOP backup tunnel should preferably be selected over a NHOP backup tunnel. Soft preemption desired: 0x40 The soft preemption flag is used to indicate that soft preemption is desired (as opposed to ‘‘hard’’ preemption).
Vasseur / Network Recovery Final 9.6.2004 9:48pm
374
CHAPTER 5
page 374
MPLS Traffic Engineering Recovery Mechanisms
P Important note: It is important to underscore the term desired. If the request cannot be satisfied, the PLR can decide either not to set up the TE LSP or to select a backup tunnel not satisfying the request. This is a local decision. For instance, suppose a TE LSP carrying voice traffic requesting both Fast Reroute and bandwidth protection. If a PLR along the path cannot find a backup tunnel with the requested amount of bandwidth, the PLR may select a backup tunnel to fast reroute the TE LSP in the case of failure, even if the bandwidth request in the case of failure is not satisfied. Note that additional mechanisms should be used to ensure that this decision will preserve the bandwidth guarantees that might have been provided to other TE LSPs. Another object, the RRO object, described later in this section, is used by each PLR to indicate whether the request is satisfied. In other words, whether a backup tunnel has been selected, the nature of the selected backup tunnel (NHOP or NNHOP backup tunnel), and finally whether the bandwidth protection request could be satisfied.
5.14.2
FAST-REROUTE Object The purpose of the FAST-REROUTE object is to signal the requirements of the backup tunnel to use for a fast reroutable TE LSP. The FAST-REROUTE object is carried in RSVP Path messages and its format is described in Figure 5.37. Each RSVP object has a class and a C-type. The class of the FAST-REROUTE object was not determined at the time of writing but will use the form 11bbbbbb for compatibility (this allows an RSVP implementation that does not recognize this object to just ignore it and forward it unchanged to the downstream nodes). The C-type value is 1. Let us now detail the different fields of the FASTREROUTE object depicted in Figure 5.37A). Setup and holding priorities: Both the setup and the holding priorities are used to specify the priorities of the backup tunnel. They have an identical usage as any other TE LSP as defined in [RSVP-TE].
4 Bytes
4 Bytes Length (Bytes)
Class-num
C-Type
Hold Prio
Hop Limit
Flag
Setup Prio
Length (Bytes)
Class-num
C-Type
PLR ID 1 Bandwidth
Avoid Node ID 1 Include-any
Avoid Node ID n Exclude-any
PLR ID n
(a)
Include-all
Figure 5.37 FAST-REROUTE and DETOUR objects.
(b)
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 375
5.14 RSVP Signaling Extensions
375
Hop-limit: The hop-limit field specifies the maximum number of hops between a PLR and an MP (a value of 0 means that just direct links can be used). Flags: Two methods for Fast Reroute local repair have been presented in the first part of this chapter: facility backup and one-to-one backup. This flag allows specifying the requested method at each PLR along the path: One-to-one Backup Desired: 0x01 Facility Backup Desired: 0x02 Bandwidth: This field indicates the required bandwidth to protect the TE LSP. This field is a 32-bit IEEE floating point integer, in bytes-per-second. Exclude-any: 32-bit vector representing a set of attribute filters associated with a backup path any of which renders a link unacceptable. Include-any: 32-bit vector representing a set of attribute filters associated with a backup path any of which renders a link acceptable (with respect to this test). A null set (all bits set to zero) automatically passes. Include-all: 32-bit vector representing a set of attribute filters associated with a backup path all of which must be present for a link to be acceptable (with respect to this test). A null set (all bits set to zero) automatically passes. Note: Using attributes filters can be very useful. Indeed, MPLS TE allows to use colors (also called affinities; see [TE-REQ] for detailed requirements). Affinities can be used during path computation to include or exclude some particular links. For instance, let us suppose that links with long propagation delays are marked with the color red (this would correspond to a particular bit of the resource class affinity vector). This property is carried within IGP TE extensions. Then one of the attributes of a primary TE LSP carrying voice traffic will be to exclude from its path all the red links (links with long propagation delays). The same set of rules applies to the backup tunnel. If the PLR requires setting up a backup tunnel to protect fast-reroutable TE LSPs requesting for bandwidth protection, for instance, it can use affinities to avoid red links. Note that the affinity constraint of the fast reroutable TE LSP may be different than the ones of the corresponding backup tunnel. When used, the FAST-REROUTE object can only be inserted by the head-end of a TE LSP and cannot be changed by any other LSR along the TE LSP path.
5.14.3
DETOUR Object The RSVP DETOUR object (whose format is depicted in Figure 5.37B) is specific to the one-to-one backup method and is used to identify Detour LSPs. The AU36 DETOUR object does not have a class-num normalized at the time of writing, but it will have the form: 0bbbbbbb. It is worth noticing at this point that the high order bit of the class-num is 0, which implies that a node receiving a path message with a DETOUR object must reject the path message if it does not support that object, and it must send an RSVP Path Error message to the PLR. The C-type is 7. Let us now describe the different fields of the DETOUR object.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
376
CHAPTER 5
page 376
MPLS Traffic Engineering Recovery Mechanisms
PLR ID and Avoid Node ID: The PLR ID is an IPv4 address of the PLR and the ‘‘Avoid Node ID’’ contains an IPv4 address of the immediate downstream neighbor (preferably its router-ID) that the PLR wants to avoid. The reason for multiple possible (PLR ID, Avoid Node ID) pairs is that Detour LSPs might be merged to reduce the total number of Detour LSPs in a network. In that case, when multiple Detour LSPs are merged by the Detour Merge Point (DMP), the DETOUR object of the merged Detour LSP contains all the pairs of (PLR ID, Avoid Node ID) of the merged Detour LSPs. An example will be provided later in this section that illustrates the use of the PLR ID and Avoid Node ID.
5.14.4
Route Record Object Another important object to describe is the RRO object (Figure 5.38). This object has not been explicitly defined for Fast Reroute, but several new flags have been added which are required for Fast Reroute. The RSVP RRO object has a Classnum ¼ 21 and a C-Type ¼ 1: The RRO object is used to record route, labels, and other useful information AU37 detailed hereafter along a TE LSP path and is made of variable length subobjects: IPv4 address subobject is quite simple: The type 0x01 defines an IPv4 address and the IPv4 address specifies a regular IPv4 address of the recording node. Then several important flags for Fast Reroute are defined: 0x01 Local protection available: This flag indicates that a backup tunnel is available at the PLR adding the subobject. 0x02 Local protection in use: When Fast Reroute is triggered on a PLR, because of a link or node failure, the PLR sets this flag in the corresponding IPv4 subobject. This indicates that Fast Reroute is in use and that the protected TE
4 Bytes Type
4 Bytes
Length
Ipv4 Address
Ipv4 Address (cont)
Prefix Length
(b) Sub-Objects
(a)
4 Bytes Label
(c)
Figure 5.38 RRO object and subobject.
Flags
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 377
5.14 RSVP Signaling Extensions
377
LSP is rerouted over a backup tunnel at this node. Before any failure occurs, this flag must be cleared. 0x04 Bandwidth protection: As mentioned above a TE LSP has the option to signal its desire to be protected with a backup tunnel offering an equivalent bandwidth (the TE LSP is said ‘‘bandwidth protected’’), either by setting the ‘‘bandwidth protection desired’’ bit in the SESSION-ATTRIBUTE object or by including a FAST-REROUTE object in the RSVP Path message. When the bandwidth protection request can be satisfied (a backup tunnel offering an equivalent bandwidth can be selected by the PLR), the ‘‘bandwidth protection’’ flag of the IPv4 subobject is set. If bandwidth protection is requested, then each PLR must set this flag appropriately. If bandwidth protection is not explicitly requested, the PLR has the choice to set the bit or not. 0x08 Node protection: Desired protection from node failure can be explicitly requested for a particular TE LSP by setting the ‘‘node protection desired’’ bit in the SESSION-ATTRIBUTE object. If the PLR can find an NNHOP backup tunnel, then the ‘‘node protection’’ bit is set; otherwise (an NHOP backup tunnel has been selected), this bit is cleared; in this case, just the ‘‘Local protection available’’ bit is set. Similar to the previous case, if ‘‘Node protection’’ is requested, each PLR must set this flag appropriately. If node protection is not explicitly requested, the PLR has the possibility to set the bit or not. As already mentioned, there may be some situations where a request cannot be fully satisfied. Suppose, for instance, that a TE LSP requests local protection (setting the ‘‘Local protection desired’’ bit of its SESSION-ATTRIBUTE object or using the FAST-REROUTE object) along a path R1-R2-R3-R4-R5. If all the nodes can select a backup tunnel in case of link/node failure except the node R3 (because its backup tunnel is down or just not configured), the RRO object carried in the RSVP Resv message sent from R5 to R1 (in the upstream direction) will contain a set of IPv4 subobjects listing all the nodes from R5 to R1 with the Ipv4 subobject of R3 having its ‘‘Local protection desired’’ bit cleared. Another example is if the TE LSP has requested bandwidth protection and the node R2 can find a backup tunnel but not offering an equivalent bandwidth. In that case, the ‘‘Bandwidth protection desired’’ bit of the IPv4 subobject of R2 will be cleared. The RRO object is a very efficient way of signaling the protection status at each hop. This can be used for troubleshooting on the head-end LSR or to take some appropriate actions at the head-end LSR. Label subobject: This field contains a 32-bit label and is used to learn downstream labels and must be included by each node if the ‘‘label recording desired’’ bit of the SESSION-ATTRIBUTE object carried in the RSVP Path message has been set. Note that the presence of this subobject is of the utmost importance for Fast Reroute facility backup so the PLR learns the label to use when rerouting some protected TE LSPs onto an NNHOP backup tunnel, as previously explained in detail.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
378 5.14.5
CHAPTER 5
page 378
MPLS Traffic Engineering Recovery Mechanisms
Signaling a Protected Traffic Engineering LSP with a Set of Constraints As already mentioned, a head-end LSR can either use the ‘‘Local protection desired’’ of the SESSION-ATTRIBUTE object or the FAST-REROUTE object to signal the fast reroutable property of a TE LSP. Note that even if the FASTREROUTE object is used, [FAST-REROUTE] recommends to also set the ‘‘Local protection desired’’ bit of the SESSION-ATTRIBUTE object. Some other parameters/constraints pertaining to the protected TE LSP can also be signaled: the request for bandwidth protection and/or node protection. This just requires to set the ‘‘bandwidth protection desired’’ and ‘‘node protection desired’’ bit, respectively, in the SESSION-ATTRIBUTE object of the RSVP Path message. If additional control over the backup tunnel is required, the head-end can also include a FAST-REROUTE object in the path message, specifying the bandwidth, attributes filters, hop limit, and priorities that apply to the backup tunnel. If the head-end requires the PLR along the TE LSP path to use a particular local repair technique (facility backup or one-to-one backup), the corresponding flag should be set in the FAST-REROUTE object. An example of the mode of operation for facility backup with node protection has been provided in Section 5.5.4. It was mentioned that in this case, a discovery label process is required so the PLR can discover the label used between the NHOP and NNHOP to perform the appropriate label operation when Fast Reroute is activated. The complete backup label discovery process is described below. At this point, one just needs to mention that the ‘‘Label recording desired’’ bit must be set in the SESSION-ATTRIBUTE of the RSVP Path message. This will trigger the label recording process at each hop from the TE LSP tail-end LSR to the head-end LSR.
5.14.6
Identification of a Signaled TE LSP A TE LSP is uniquely identified by two objects carried in the RSVP Path message: the SESSION and the SESSION-ATTRIBUTE objects. More precisely, the following fields present in those two objects uniquely identify the TE LSP:
. The IPv4 (or IPv6) tunnel endpoint address (IPv4 [or IPv6] address of the egress node for the tunnel). . The Tunnel ID (a 16-bit identifier used in the SESSION object that remains constant over the life of the tunnel). . The Extended Tunnel ID (a 32-bit [IPv4] or 128-bit [IPv6] identifier used in the session object that remains constant over the life of the tunnel). Normally set to all zeros. Ingress nodes that wish to narrow the scope of a SESSION to the ingress-egress pair may place their IP address here as a globally unique identifier. . The IPv4 (or IPv6) tunnel sender address (IPv4 [or IPv6] address for a sender node). . The LSP ID (a 16-bit identifier used in the SENDER_TEMPLATE and the FILTER_SPEC that can be changed to allow a sender to share resources with itself).
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 379
5.14 RSVP Signaling Extensions
379
With one-to-one backup, the backup LSP (also called Detour LSP) must be differentiated from the protected LSP. Likewise, when a protected TE LSP is fast rerouted using the facility backup method, the signaling must be updated so one can differentiate the fast rerouted TE LSP from the original one. This differentiation is necessary for merging and to perform appropriate states treatment. Two methods have been defined to achieve this objective: Method 1: The Sender-Template-Specific method (referred to as STS): With this method, when the RSVP Path message of the rerouted TE LSP is sent along the backup path, the five attributes mentioned above are unmodified, except the ‘‘IPv4 tunnel sender address,’’ which is set by the PLR to one of its local address (if the PLR is also the head-end LSR this address must be different from the original one). Method 2: The Path-Specific method (referred to as PS): With that second method, both the SESSION and the SESSION-ATTRIBUTE object are unchanged, but an additional object (the DETOUR object) is added. This way the PLR can differentiate the protected TE LSP (also called the fastreroutable TE LSP) because it contains a FAST-REROUTE object or the ‘‘Local protection desired’’ bit of its SESSION-ATTRIBUTE is set from the backup LSP that contains a DETOUR object. Facility backup always uses the STS method, whereas the one-to-one backup may use either the STS or the PS method.
5.14.7
Signaling with Facility Backup Earlier in this chapter, we described the mode of operation of Fast Reroute facility AU38 backup: In a nutshell, to protect a facility like a link or a node, one or more backup tunnels are preestablished and maintained by a PLR. When a TE LSP is first signaled, a PLR analyzes the signaled parameters and selects the appropriate backup to use in case of a failure. All those operations are performed before any failure and upon a link or a node failure, the set of protected TE LSPs are rerouted onto their backup tunnel. This section details the signaling operations performed by the PLR and the MP at each step of the rerouting procedure.
Point of Local Repair Behavior before the Failure To select a backup tunnel for a TE LSP, when the TE LSP is first set up, any PLR along the path first determines the TE LSP properties and requested attributes explicitly signaled through RSVP: 1. Label recording desired (mandatory with facility backup): If set, the PLR must insert a label subobject in the RRO object carried in the RSVP Resv message sent upstream. 2. Local protection desired: If the ‘‘Local protection desired’’ bit of the SESSION-ATTRIBUTE object of the corresponding RSVP path message is set and/or a FAST-REROUTE object is present in the RSVP Path message (in this latter case, an optional preference for the facility backup or one-to-one
Vasseur / Network Recovery Final 9.6.2004 9:48pm
380
CHAPTER 5
page 380
MPLS Traffic Engineering Recovery Mechanisms
local protection technique may be signaled), then the TE LSP is said ‘‘fast reroutable’’ and a backup tunnel must be selected. If the PLR can successfully select a backup tunnel for the TE LSP, then it must reflect it in the RRO object carried in the corresponding RSVP Resv message forwarded upstream (the ‘‘Local protection available’’ bit of the RRO IPv4 object is set; otherwise, the bit must be cleared). For example, if a backup tunnel is selected for a protected TE LSP and goes down, the bit must be cleared in the subsequent Resv messages sent upstream. 3. Bandwidth protection desired: If the ‘‘Bandwidth protection desired’’ bit of the SESSION-ATTRIBUTE object is set and/or a FAST-REROUTE object is present in the RSVP Path message with a ‘‘bandwidth’’ field set to the required bandwidth during failure, then a backup tunnel guaranteeing an equivalent QoS during failure should be selected. If the request can be satisfied, then the ‘‘Bandwidth protection’’ bit of the IPv4 RRO subobject carried in the corresponding RSVP Resv messages forward upstream must be set. 4. Node protection desired: If the ‘‘Node protection desired’’ bit of the SESSION-ATTRIBUTE object is set and/or a FAST-REROUTE object is present (with hop limit > 0), the PLR should try to find a backup tunnel that does not terminate to the NHOP (i.e., a backup tunnel that does not just protect against a link failure). If this is not possible, the ‘‘Node protection’’ bit of the IPv4 RRO subobject carried in the RSVP Resv message forwarded upstream must be cleared. It is worth reemphasizing that the backup tunnel selection is a local decision and different implementations may make different choices; the bits defined above express a ‘‘desire.’’ So for instance, an implementation may decide to provide bandwidth guarantees to a fast-reroutable TE LSP if such a service can be offered even if bandwidth protection has not been explicitly desired, provided that other requests for TE LSPs that have explicitly requested bandwidth protection can also be satisfied. As already mentioned, if facility backup is in use, another task that the PLR must perform is to identify the label used between the NHOP and the NNHOP LSR for the protected TE LSPs for which an NNHOP backup tunnel has been selected. Let us consider Figure 5.39. As illustrated in Figure 5.39, when the TE LSP T1 is first signaled, because the ‘‘label recording desired’’ of the SESSION-ATTRIBUTE object carried in RSVP Path message is set, each node includes in the RSVP Resv message traveling in the upstream direction (from R6 to R2 in the example) both an IPv4 subobject and a label. This way, the PLR R3, for instance, will learn the label used between R4 (NHOP) and R5 (NNHOP). Note that this is just required with facility backup if the selected backup tunnel is an NNHOP backup tunnel. Indeed, in the case of an NHOP backup tunnel, the label used is the same as the fast-reroutable TE LSP. There is just one exception to this discovery procedure, which is related to the per-interface label space platform. By contrast with global label space platforms where the label space is shared between all interfaces, some platforms (e.g., the
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 381
5.14 RSVP Signaling Extensions
381
11 3
R11
B1 (Bypass)
R10
R1
3
10 3
T1 1
R4
R3
R2
R5
R6
T2 Sub-Objects
R7 R3 Needs to Discover the Backup Label Used Between R4 and R5
R8 Type
Length
Ipv4 Address (cont)
Ipv4 Address Prefix Length
+ Label
Figure 5.39 Illustration of the label discovery process with Fast Reroute facility backup.
ATM LSR platforms) have different label spaces per interface. Consequently, an MP may use different labels for a TE LSP for different interfaces. With a global label space platform, for a given incoming label, an MPLS packet will be identically switched regardless of the incoming interface. For instance, in Figure 5.39, when T1 is fast rerouted by the PLR R3, the traffic from T1 will be received either from the link R4-R5 (prior to failure) or from the link R11-R5 (during failure) but always with the same incoming label. Hence, it will be forwarded to the link R5-R6 in both cases. With a per-interface label space platform, the PLR will have to perform a specific procedure consisting in sending, before any failure, a path message onto the backup tunnel (as if the protected LSP was fast rerouted) to discover the label that the MP (R5) expects to receive for T1 when Fast Reroute is triggered. Note that the vast majority of packet LSRs use a global label space.
Point of Local Repair Behavior during Failure Upon failure detection, the PLR triggers Fast Reroute and the protected TE LSPs are rerouted onto their respective backup tunnels. Besides the traffic rerouting, the PLR must also perform a set of control plane operations. Because RSVP is a soft state protocol, the RSVP Path messages for the rerouted TE LSP(s) must be sent onto the backup tunnel to refresh the TE LSP states on downstream nodes. Indeed, without any specific action, the RSVP states for the rerouted TE LSP would not be refreshed and would time out; after a certain period, downstream nodes would tear down the TE LSPs. In the previous example,
Vasseur / Network Recovery Final 9.6.2004 9:48pm
382
CHAPTER 5
page 382
MPLS Traffic Engineering Recovery Mechanisms
after Fast Reroute has been triggered, the PLR (R3) sends the RSVP messages of T1 onto the backup tunnel. Note that intermediate nodes (R10 and R11) do not see those control messages because they are label switched. Then the MP (R5 in this example) continues to receive RSVP Path messages and can refresh the corresponding RSVP states. Compared to the original RSVP Path message that used to be forwarded before the failure on the R3-R4 link, the RSVP Path message sent onto the backup tunnels contains the following changes:
. The ‘‘local protection desired,’’ ‘‘Bandwidth protection,’’ and ‘‘Node protection desired’’ bits are cleared. . The IPv4 tunnel sender address of the SENDER-TEMPLATE object is changed and set to an local address of the PLR (the STS method is always used by facility backup). . The RSVP-HOP object is set to a local IPv4 address of the PLR. . The ERO (Explicit Route Object) is updated: the RSVP ERO object carried in an RSVP Path message of a TE LSP always contains the list of hops that a TE LSP must follow. When a node receives an ERO object it first checks that the first node listed in the ERO object corresponds to one of is local interface. So in Figure 5.39, without any specific ERO object update, R5 would receive an ERO object listing an address of R4 and not one of its own addresses. So the PLR (R3) needs to update the ERO object such that the next listed node is the MP (R5) before sending the RSVP Path message onto the backup tunnel. . The RRO object is updated: The RRO object sent in Resv messages in the upstream direction (to R2) is updated as already mentioned, and the ‘‘local protection in use’’ bit of the IPv4 subobject is set. Note: The RSVP messages sent onto the backup tunnel are path, path tear, and ResvConf messages.
Merge Point Behavior during Failure Once Fast Reroute becomes active, the PLR starts sending path messages onto the backup tunnel for every rerouted TE LSP that will be received by the MP, which in turn refreshes the corresponding states.
5.14.8
Signaling with One-to-One Backup With one-to-one backup the procedure is significantly different. Indeed, with facility backup, no signaling occurs for the protected TE LSP before the failure; the backup tunnel is maintained as any other TE LSP and there is no signaling for the set of protected TE LSP that may use this backup tunnel. By contrast, with one-to-one backup, each protected has a Detour LSP originated at the PLR and terminating at the tail-end LSR, which must be set up and maintained. In this section we describe the signaling operation to set up and maintain those Detour TE LSPs.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 383
383
5.14 RSVP Signaling Extensions
1. Remember, the one-to-one backup technique may either use the sendertemplate specific or the path specific method to identify a protected TE LSP and its Detour LSP: . If the sender-template specific method is used, then when signaling the Detour LSP, the PLR replaces the IPv4 (IPv6) address present in the SENDER-TEMPLATE object by one of its local address (which must be different from the one used in the protected TE LSP). A DETOUR object may also be added, but this is not mandatory because the new address in the SENDER-TEMPLATE object is sufficient to differentiate it from the protected TE LSP. . If the path specific method is used, the PLR adds a DETOUR object in the path message of the Detour LSP. 2. The ‘‘local protection desired,’’ ‘‘Bandwidth protection,’’ and ‘‘Node protection desired’’ bits are cleared. 3. The PLR also removes the FAST-REROUTE object that may have been present in the original protected TE LSP. 4. RSVP-HOP object is set to a local IPv4 address of the PLR. 5. The ERO object is updated; indeed, the ERO object of the protected TE LSP used to contain the list of hops to follow from the PLR to the tail-end LSR for the protected TE LSP. Let us illustrate the ERO object update operation through the example shown in Figure 5.40. In this example, the primary TE LSP T1 follows the path R2-R3-R4-R5R6. When the PLR R3 signal its Detour LSP (called D1 in the Figure 5.40), the ERO object is updated from R4-R5-R6 to R10-R11-R5-R6, which is the path computed by the PLR for the Detour LSP.
ERO Object for the Detour LSP D1: R10-R11-R5-R6
R10
R1
R12
R11 D1
T1
R2
R7
R4
R3
R8
R5
R6
R9
Figure 5.40 ERO object calculated for the detour LSP with Fast Reroute ‘‘one-to-one backup.’’
Vasseur / Network Recovery Final 9.6.2004 9:48pm page 384
384
MPLS Traffic Engineering Recovery Mechanisms
CHAPTER 5
6. The bandwidth advertised in the Sender-TSPEC object reflects the bandwidth of the Detour LSP, which can be equal to either the bandwidth of the protected TE LSP if there was no FAST-REROUTE object and the ‘‘bandwidth protection desired’’ bit was set in the SESSION-ATTRIBUTE object of the RSVP Path message or the bandwidth explicitly specified in the FAST-REROUTE object signaled in the RSVP Path message of the protected TE LSP. 7. The RRO object is updated: The RRO object sent in Resv messages in the upstream direction (to R2) is updated, as already mentioned; the ‘‘local protection in use’’ bit of the Ipv4 subobject is set when Fast Reroute is triggered.
5.14.9
Detour Merging AU39
As pointed out previously, MPLS TE Fast Reroute one-to-one backup has the drawback of generating a potentially considerable number of TE LSPs because the number of required backup tunnels (Detour LSPs) is a function of the number of protected TE LSPs and the network diameter (number of hops traversed by each protected TE LSP). One method helping in alleviating this concern is to proceed to LSP merging. Several rules are defined in [FAST-REROUTE] to handle Detour LSP merging, but the concept is quite simple and described in Figure 5.41. Let us consider the network depicted in Figure 5.41, A protected TE LSP T1 is set up and follows the path R0-R1-R2-R3-R4-R5 and the local repair technique used in this network is one-to-one backup. So taking the PLR R0 as an example, R0 computes a Detour LSP D0 following the path R0-R6-R7-R8-R2-R10-R11-R4-R5 according to the requirements for the protection of T1 and the topology and resource information flooded by the IGP. Likewise, R1 computes a Detour LSP D1 following the path R1-R7-R8-R9-R4-R5 and R2 computes a Detour LSP D2
D1 ERO: R1-R7-R8-R9-R4-R5
R6
D2 ERO: R2-R8-R9-R4-R5
R8
R7
D1
D0
Situation Prior to Merging
R9
D2
T1 R0
R1
R3
R2
D0 ERO: R0-R6R7-R8-R2-R10R11-R4-R5 R10
R4
R5
R11
Figure 5.41 Backup tunnel (Detour LSP) path computation with MPLS TE Fast Reroute ‘‘one to one.’’
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 385
385
5.15 Backup Path Computation
Merging1 (D0 with D1) ERO(D0)>ERO(D1)
R6
R7 D0
Merging 3 (D1 with Protected R9 TE LSP)
R8
D1
Situation After Merging
Merging 2 (D0 with D2)
D2
T1
R0
R1
R2
R3
R4
R5
D1 ERO: R1-R7R8-R9-R4-R5
D0 ERO: R0-R6R7-R8-R2R10R11-R4-R5
R10
D1 ERO: R2-R8R9-R4-R5
R11
Figure 5.42 Illustration of the merging rules with Fast Reroute ‘‘one-to-one backup.’’
following the path R2-R8-R9-R4-R5. For the sake of simplicity, both R3 and R4 will perform the same operation, but their respective Detour LSPs are not represented in the diagram. As shown on the Figure 5.42, it follows that R7 detects the presence of two detour LSPs D0 and D1 that both protect the same TE LSP: T1. So they can be merged. Because D1 has a shorter path than D0, the resulting merged Detour LSP will be D1 (note that when those Detour LSPs are merged, there are some additional rules to compute the resulting DETOUR object, which are defined in [FAST-REROUTE]). Likewise, R8 can also perform a merging of D1 and D2. And finally, a third detour merging operation can be performed by R4, but in this latter case, the situation is slightly different. Indeed, when the LSR R7, for example, performs a detour merging operation, it merges two Detour LSPs, whereas in the case of the LSR R4, the merging of a Detour LSP and the primary TE LSP is performed. When an LSR merges the protected LSPs with a Detour LSP, the result is always the protected TE LSP. Figure 5.42 shows the result after merging.
5.15 Backup Path Computation In this section, we cover the aspects related to the backup path computation for each of the MPLS TE recovery techniques studied in this chapter. This section is quite dense because of the complexity of the problem to solve, which greatly varies with the set of objectives. Indeed, simple algorithms can be used to compute a diverse path for global path protection or local protection. By contrast, for
Vasseur / Network Recovery Final 9.6.2004 9:48pm
386
CHAPTER 5
page 386
MPLS Traffic Engineering Recovery Mechanisms
example, the algorithms to find a set of backup tunnels for Fast Reroute to provide bandwidth guarantees and a bounded increase of the propagation delay while trying to minimize the required amount of backup network capacity can certainly be very complex. This section deals with all the issues of backup tunnel path computation with respect to the set of recovery objectives.
5.15.1
Introduction As previously mentioned, several aspects must be considered when evaluating a protection/restoration scheme. In the previous sections, we saw various MPLS traffic recovery techniques: global default restoration, global path protection, and local protection. Each recovery technique requires the computation of backup path, which can be calculated ‘‘on the fly’’ with restoration techniques or precomputed when using protection techniques like global path protection and local protection. In this section, we focus on the backup path computation aspects of MPLS TE protection techniques. The first aspect that crosses one’s mind about recovery techniques is the recovery time, which is a crucial aspect but not the only one. Indeed, the QoS during failure, in other words, the QoS provided to the rerouted flows along the backup path is also a very important aspect that is directly correlated to the backup path computation. The backup tunnel path computation complexity is essentially driven by the set of objectives and increases nonlinearly with the set of associated constraints. So for instance, in the case of Fast Reroute, if the only objective is to compute a diversely routed backup tunnel from the protected section (link, node, SRLG) to provide fast convergence in case of resource failure, then the path computation complexity is not very high (rerunning a regular CSPF on a subgraph is usually sufficient). On the other hand, if the objective is to provide a recovery mechanism offering fast convergence and strict QoS guarantees during failure (e.g., bandwidth guarantee and bounded increase of the propagation delay) while trying to minimize the required backup capacity, then this increases the backup tunnel path computation complexity by an order of magnitude. This section explores those different requirements and details for each of them some possible backup tunnel path computation techniques.
5.15.2
Requirements for Strict QoS Guarantees during Failure Typically, voice traffic does not tolerate QoS degradation for a long period without being perceptible by the users. So an operator may decide to provide QoS guarantees to the TE LSP carrying voice traffic, even during failure. The same reasoning is likely to apply to ATM CBR traffic carried over MPLS. On the other hand, some other TE LSPs carrying less sensitive traffic could tolerate a QoS degradation during failure (until they are reoptimized by their respective head-end LSR). This can be part of the service-level agreement (SLA) between an operator and its
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 387
5.15 Backup Path Computation
387
customers. This highlights the notion of CoR introduced earlier. As previously mentioned, in the case of Fast Reroute, the requirement for bandwidth guarantee during failure is explicitly signaled and so can be applied on a per TE LSP basis, providing a high granularity.
5.15.3
Network Design Considerations Every network is different and the constraints on backup path computation are not just driven by the set of objectives but also by the network design considerations. The aim of this section is to describe several typical network designs to illustrate the network design implication on the backup path computation for a defined set of recovery objectives in term of QoS during failure.
QoS Considerations in Typical Backbone Network Profiles We can list three typical networks designs: 1. Overprovisioned networks: A simple strategy to provide QoS is to put in place strict planning rules and make sure that the network has always enough bandwidth to accommodate the traffic demand while respecting the QoS objectives. For instance, if at any time the maximum utilization of any link is less than 20% (this is of course just an example for the sake of illustration), there is no need for any particular QoS mechanisms and/or TE mechanism in the network. Note that failure simulation should help figuring out whether the utilization rule mentioned above is still valid under various failure scenarios. In such a network, the backup tunnel path just needs to be diverse from the protected section (link, node, or SRLG). This approach is simple, efficient, but expensive (and requires some network planning tools and a reasonably accurate traffic matrix knowledge). 2. MPLS Diffserv-aware networks: In networks where multiple classes of service must be provided with different QoS objectives, one can use the Diffserv architecture where the traffic is marked (colored) based on its CoS and then queued appropriately in the data plane to reach the QoS objectives on a perCoS basis. In addition to the queuing, congestion avoidance disciplines like WRED can be used. This ensures that delay-sensitive traffic is serviced appropriately while best-effort traffic gets a lower priority and can potentially suffer from some congestion (of course the number of CoS is not limited to two). A very well-known fact is that strict bounded delays and jitter can be provided to high-priority traffics (like voice) provided that some appropriate queuing mechanisms are deployed in the network and the proportion of high-priority traffic served by the high-priority queue (usually preemptive queue) is limited to a fixed percentage of the total amount of traffic forwarded on a specific link. So if the network is designed so the proportion of voice traffic on every link is bounded to, for instance, 30% in both steady state and under failure scenarios, then no particular constraint needs to be applied to the backup tunnel path computation; the backup
Vasseur / Network Recovery Final 9.6.2004 9:48pm
388
CHAPTER 5
page 388
MPLS Traffic Engineering Recovery Mechanisms
tunnel just needs to be diversely routed from the protected section (link, node, or SRLG). 3. Traffic-engineered network: In some other networks, there is clearly a need for traffic engineering (voice, ATM, IP, and MPLS) to optimize network AU40 resource utilization. Various studies have been conducted during the last 20 years to propose IGP metric computation algorithms so the traffic is routed in an ‘‘optimal’’ way to prevent situations where some links are heavily congested while some other links in the networks have some spare capacity; this refers to IP traffic engineering techniques and have been covered in Chapter 4. Another way of achieving traffic engineering in MPLS-enabled networks is to rely on MPLS TE where the traffic is routed on TE LSPs whose path computation is based on the network topology and available resources with call control admission schemes; in that case, one makes sure that a TE LSP is routed in the network so every traversed network link can accommodate the traffic demand. Of course traffic-engineered networks can also be Diffserv aware. Diffserv mechanisms ensure that each traffic receives the level of required QoS, whereas traffic engineering is in charge of computing a path that can meet the bandwidth AU41 and other requirements. Furthermore, MPLS Diffserv aware TE allows to enforce different CAC schemes (and so underbooking/overbooking) on a per-class type basis, which provides a very high degree of granularity.
Guaranteeing QoS during Failure Things get a bit more complicated when QoS objectives must also be met during non– steady state periods. What if a link or a node fails? An operator may simply decide that the QoS objectives may not be respected in the case of failure in its network. Let us first consider the simple case of an overprovisioned network (Figure 5.43).
All Links are OC3 Links (except R1-R2)
R3
R4
R5
R8
R7
R6
R9
R0
R1
OC48
R2
R10
R11
R12
R13
Figure 5.43 Bandwidth guarantee during failure in an overprovisioned network.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 389
5.15 Backup Path Computation
389
Although this chapter is dedicated to MPLS TE, let us consider the case of the pure IP (non-MPLS) overprovisioned network depicted in Figure 5.43. Clearly, in such a network, even if IP traffic engineering techniques are used (tuning of the IGP AU42 metrics) to avoid congestion on any link at steady state, a link failure is likely to provoke congestion on alternate paths. As depicted on Figure 5.43, all the IP flows destined to R2 and beyond and traversing the nodes R3, R0, and R10 will be rerouted along their next shortest path in case of the failure of the link R1-R2 (through the south path); a maximum of 30% worth of traffic at steady state could potentially result in a congestion of the links R10-R11, R11-R12, R12-R13, and R13-R2 in the case of failure of the link R1-R2. As pointed out in Chapter 4, some IGP metric optimization techniques try to solve that issue for both steady state and under single network failure scenarios. The result varies with the effectiveness of the algorithm in use and the network topology. Also, the degree of granularity is relatively poor because all the IP traffic must be rerouted along the same alternate path by contrast with MPLS TE where several backup paths (backup tunnels) can be computed to reroute a subset of the traffic. Hence, MPLS TE provides a higher flexibility and granularity, which eases the finding of appropriate backup paths to provide QoS guarantees during failure. For instance, back to our previous example in Figure 5.43, in the case of failure of the link R1-R2, the TE LSPs originally routed through the link R1-R2 will be rerouted along alternate paths obeying the set of required constraints; so TE will play its role, trying to avoid congestion, and if necessary, multiple backup tunnels will be used to be able to reroute the TE LSPs requiring an equivalent QoS during failure. During failure, protected TE LSPs are rerouted over their respective backup tunnel. As illustrated in Figure 5.44, in this particular example, if a single NHOP backup tunnel is provisioned to reroute all the protected TE LSPs traversing the
All Links are OC3 Links (except R1-R2)
Backup Tunnel 1
R4
R3
R5
R8
R7
R6 Backup Tunnel 2
R9
R0
R1
OC48
R2
Backup Tunnel 3
R10
R11
R12
R13
Figure 5.44 Bandwidth guarantee during failure with Fast Reroute, using multiple backup tunnels.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
390
CHAPTER 5
page 390
MPLS Traffic Engineering Recovery Mechanisms
link R1-R2, in case of failure of this link, fast recovery is certainly achieved, but without QoS guarantee. Hence, the solution consists of provisioning multiple backup tunnels (Figure 5.44). As shown in Figure 5.44, in the case of failure of the link R1-R2, three backup tunnels are used to reroute the set of primary TE LSPs requiring fast recovery and bandwidth protection that traverse the link R1-R2. So this example illustrates the statement upon which the use of multiple backup tunnels can help achieving the goal of QoS guarantee during failure. Let us now consider another example (Figure 5.45). Figure 5.45 shows the situation of an MPLS TE network using Fast Reroute. Let us suppose that the NNHOP backup tunnels originated at R3, R0, R10, R11, and R12 are computed without trying to ensure bandwidth guarantees. What could happen (as depicted on the Figure 5.45) is that those NNHOP backup tunnels may be routed over the same path (the IGP shortest path). In this case, the sum of traffic carried by the set of protected TE LSP rerouted onto those tunnels will very likely provoke some congestion along the south path. This example highlights the fact that node failures usually have a greater impact than link failures so the statements mentioned in the case of link failure are even more valid in this case. The examples above brought out several important considerations, which are worth being summarized before considering in more details the backup path computation aspects. Let us again briefly consider the following steps during a failure process when using a local protection recovery:
. t0: The network element (link, node or SRLG) failure occurs. . t1: Protected TE LSPs are rerouted onto their respective backup tunnel. . t2: TE LSPs are reoptimized by their respective head-end LSR along a new path satisfying their respective constraints (if such a path exists). . t3: The failed resource is restored. All Links are OC3 Links (except R1-R2)
R4
R3
R5
R6
OC48
R0
R1
R7
R8
R9
R2
R10
R11
Figure 5.45 Bandwidth guarantee during failure.
R12
R13
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 391
5.15 Backup Path Computation
391
. During t1-t0: The traffic is dropped (t1-t0 is the recovery time). . During t2-t1 (also called during failure): Protected TE LSPs are rerouted onto their respective backup tunnel. . During t3-t2 (also called after failure): TE LSPs are rerouted over an alternate path (if such a path exists). . After t3: The initial network capacity is restored. Situation 1: The network is overprovisioned and QoS objectives can be met at steady state, during, and after the occurrence of a failure. For those networks, a perfectly reasonable approach consists in provisioning the backup tunnels without applying any constraint, except of course the one of being diversely routed from the link/SRLG/node that they protect. In the case of failure, the traffic is quickly rerouted and does not suffer from any QoS degradation. Situation 2: The network is overprovisioned at steady state, but upon a link or a node failure, congestion may appear: If QoS degradation during failure (t2-t1) is acceptable but not after failure (beyond t2), then a reasonable approach is to limit the backup path computation to a single constraint: being diversely routed. In this case, during failure, the rerouted TE LSPs may suffer from QoS degradation, but this is considered as acceptable. After a short period, they will be rerouted along an alternate path (if such a path exists) that offers the required QoS. If QoS must be guaranteed also during failure, then the additional constraint is to compute backup tunnel paths such that the QoS is preserved along the backup path, at least for some Class of Recovery. Situation 3: The network uses of MPLS TE for network resource optimization and/or strict QoS guarantees at steady state. The same conclusions as with situation 2 apply. In summary, the previous discussion demonstrates that the constraints on the backup paths are driven by both the QoS objectives and the network design. In overprovisioned networks (at steady state and under failure) or in networks where QoS during failure is acceptable, then the constraint of the backup path is minimal; the backup tunnel path just needs to be diversely routed from the protected section, a problem whose complexity is not greater than computing a regular TE LSP path on a sub-graph. On the other hand, in non–overprovisioned networks where QoS guarantees must be ensured during failure for some traffic, backup paths must satisfy additional constraints. Undoubtedly, MPLS TE makes those objectives more likely achievable by allowing to restrict those requirements to a subset of the traffic and by using multiple backup paths for different TE LSPs to reroute. Before covering in detail the backup path computation aspects, there is another important fact to notice. As already pointed out, in some networks, MPLS TE is deployed for the sole interest of fast recovery and several deployment scenarios have been described in Section 5.5. Let us consider the very realistic scenario, where at steady state, no particular traffic engineering measures should be taken. The AU43 traffic load on every link is perfectly acceptable and the QoS objectives are met.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
392
CHAPTER 5
page 392
MPLS Traffic Engineering Recovery Mechanisms
This does not mean that under failure congestion does not appear in some regions in the network. For instance, a fiber cut of a core network router failure can sometimes result in severe congestion spots event though the network load at steady state was perfectly acceptable without any need for traffic engineering. Hence, an PR44 interesting strategy can consist of deploying MPLS TE where the TE LSPs are configured with their respective bandwidth but follow the IGP shortest path at steady state (because every IGP shortest path has enough capacity to accommodate the traffic demand). That said, as pointed out, this may no longer be true during failure. Then during failure, the TE LSP will be rerouted over non-IGP shortest paths and congestion will be avoided or at least reduced. This is another application of MPLS TE: bandwidth optimization after failure (until the resource is restored). Note that some failures may last several hours or even days before being fixed.
5.15.4
Notion of Bandwidth Sharing between Backup Paths The previous section provided several examples where backup tunnels must follow a path offering QoS guarantees (in terms of bandwidth and sometimes propagation delay). Backup tunnels are regular TE LSP, so a simple approach consists of setting backup tunnels with bandwidth as any other primary TE LSP. But this may lead to a very inefficient backup bandwidth usage as shown in the Figure 5.45. So at this point, the very important and simple notion of bandwidth sharing is introduced: Two backup tunnels can share some bandwidth only if they cannot be simultaneously active. For instance, as depicted in Figure 5.45, if two backup tunnels T1 and T2 protect two independent resources R2 and R7 and one makes the assumption that R2 and R7 cannot simultaneously fail, then the total amount of bandwidth that must be reserved on the links they both traverse is the maximum of their bandwidths, not the sum, which highlights why simply setting up backup tunnels with the required bandwidth would be quite inefficient in term of network bandwidth usage (Figure 5.46). In Figure 5.46, both T1 and T2 are backup tunnels used in the context of local protection to protect against a failure of the nodes R2 and R7, respectively. Suppose also that QoS guarantee during failure is required. If T1 and T2, respectively, require X and Y Mbps, at first sight, one might think that the amount of required bandwidth for both T1 and T2 on the link R4-R5 is X þ Y. But if we assume that either R2 or R7 can fail (but they cannot simultaneously fail), then T1 and T2 are never simultaneously active; hence, the required bandwidth for T1 and T2 on the link R4-R5 is max(X,Y) instead of X þ Y, which results in considerable bandwidth gain in terms of required network backup capacity. It is probably worth defining more accurately what simultaneously fail means. When a link or a node fails (at time t0), the protected TE LSPs traversing the failed resources are rerouted onto their backup tunnel. Then those TE LSPs are rerouted by their respective head-end LSR along an alternate path (at time t2, according to the terminology previously introduced), if such a path exists. After a period of ta ¼ t2 t0, those TE LSPs are no longer rerouted over their backup tunnel. The single failure assumption assumes that a second failure will not happen during ta, so
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 393
5.15 Backup Path Computation
393
SINGLE FAILURE ASSUMPTION: The Assumption is Made that a Simultaneous Failure of the Node R2 and R7 is Not Likely to Happen
R1
R2
R3
The Backup Tunnel T1 Protects R1 from a Failure of the LSR R2
T1 R4
R5
T2 The Backup Tunnel T2 Protects R6 from a Failure of the LSR R7
R6
R7
R8
Figure 5.46 Notion of bandwidth sharing between two backup tunnels protecting independent resources.
two backup tunnels protecting independent resources cannot be simultaneously active. Then if a second failure occurs after Ta, bandwidth protection can still be ensured (provided that the backup tunnels that used to be routed over the previous failed resources have been reestablished). Hence, the benefit of the single failure assumption is very straightforward; bandwidth sharing between backup tunnels protecting independent resource is possible and results in very significant bandwidth saving as the required amount of backup capacity in the network is drastically reduced. Moreover, the single failure assumption is considered perfectly realistic in many networks, especially because router node failure probability is generally very low. Note also that the diagram depicted in Figure 5.46 is generic and equally applies to global or local protection (facility backup). In the former case, the backup TE LSP are end to end (between head-end LSR and tail-end LSR). In the latter case, those two backup tunnels are between a PLR and an MP. Now that the general concepts of QoS guarantees and bandwidth sharing have been illustrated, it is time to describe how those concepts apply to the backup path computation in the context of global path protection and local protection.
5.15.5
Backup Path Computation: MPLS TE Global Path Protection As described in Section 5.4, global path protection requires the ability to compute diversely routed paths. Indeed, the backup path must be diversely routed from the primary TE LSP path. As already mentioned, two paths can be either link or node disjoint; it is obvious that two node-diverse paths are necessarily link diverse, but the opposite is not true. Hence, the constraint of finding node diverse paths is
Vasseur / Network Recovery Final 9.6.2004 9:48pm
394
CHAPTER 5
page 394
MPLS Traffic Engineering Recovery Mechanisms
stricter than finding link diverse paths. Multiple algorithms have been proposed to compute link- or node-diverse paths (see [SURVIVABLE]) and a simple algorithm (referred to as the two-step approach) is described here.
A Simple Algorithm for Diverse Path Computation: The Two-Step Approach A simple approach for computing two diverse paths is to use the two-step approach algorithm (referred to as the 2SA algorithm). This algorithm consists of first running CSPF to find the first path, then prune any link (for link diverse paths) or node (for node diverse paths) traversed by this shortest path and run a second iteration of CSPF to find the second path. Although very simple and fast, this algorithm has the following limitations:
. It may fail to find two link- or node-diverse paths for some pair of nodes even if such a solution actually exists. . The resulting solution may be suboptimal in finding two diverse paths so the sum of their cost is minimal (Figure 5.47). To illustrate this statement, let us consider the double-square network diagram depicted in Figure 5.47A) and the two pairs of LSRs (R1-R3) and (R4-R3). This network has links with costs 1 or 2 as shown on the figure. Using the 2SA algorithm, two link- and node-diverse paths can easily be found between R1 and R3 (Figure 5.47A). On the other hand, the situation is different
R1
R2
R3 1
2
1
2
Two Link and Node Diverse Paths Can Be Found Between R1 and R3 Using the 2SA Algorithm
2
Link Cost
1 R4
2 R5
R6
(a)
R1
R2
Two Link and Node Diverse Paths Cannot be Found Between R4 and R3 Using the 2SA Algorithm
R3 1
2
2
1
2
1
R4
2
R5
(b)
Figure 5.47 Computation of diversely routed paths using the 2SA algorithm.
R6
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 395
5.15 Backup Path Computation
395
between the LSRS R4 and R3; the 2SA fails to find two link- or node-diverse paths (Figure 5.47B). This raises another interesting question related to the objective the computation of two diverse paths has to meet: How should one define the optimality criteria of the paths computation? To illustrate this, let us go back to Figure 5.47, where it can be easily seen that two node-diverse paths can be found. They would follow the paths R4-R1-R2-R3 and R4-R5-R6-R3. Remember, there may be two situations in which two diverse paths are required:
. Situation 1: MPLS TE global path protection is used (Section 5.4). . Situation 2: The traffic between R4 and R3 is load balanced between two diverse paths (Section 5.7). In situation 1, this means that the traffic follows a nonoptimal path at steady state (with a cost of 5 instead of 3 for the shortest path between R4 and R3). This is clearly a trade-off that should be considered when evaluating MPLS TE global path protection because at steady state, traffic may not follow an optimal path to satisfy the requirement of having a diverse path for the backup tunnel. Furthermore, the backup tunnel will just be used in the case of failure along the primary TE LSP path for a short period, so is it worth following a nonoptimal path at steady state (during the vast majority of the cases) just to be able to get a diversely routed path under failure? A possible compromise is to add a constraint to the diverse path computation: the cost increase of the primary TE LSP cost compared to the shortest path obeying the set of constraints. If no such paths can be computed, fall back to the global repair mechanism. For instance, if to satisfy the constraint of finding two diversely routed paths, this results in a path cost increase of 50% for the primary TE LSP (compared to the shortest possible path satisfying the set of constraints), then global default restoration should be used instead of global path protection. In situation 2 (load balancing), it would then be interesting to find a solution where the sum of the costs of the two diverse paths is minimized, something that the 2SA algorithm cannot guarantee either. Let us consider Figure 5.48 and study the performance of the 2SA algorithm in trying to find two diverse paths so the sum of their costs is minimized. Let us now run the 2SA algorithm and determine the sum of the costs of the two diverse paths between the node pair (R4,R3). Figure 5.49 shows the two diverse computed paths that would be obtained by running the 2SA algorithm described above. As shown in Figure 5.49, the two diverse paths obtained with the 2SA algorithm provide two paths so the sum of their cost is 4 þ 13 ¼ 17, which is clearly not the most optimal set of diverse paths that could have been obtained (here, the best set of diverse paths have a sum of costs of 11 instead of 17). What does this highlight? The two examples depicted above show that the 2SA algorithm, though very simple, is not always very efficient because it sometimes fails to find two diverse paths even though such paths exist, and when an additional objective of minimizing the sum of costs of the two diverse paths is added, which
Vasseur / Network Recovery Final 9.6.2004 9:48pm
396
CHAPTER 5
page 396
MPLS Traffic Engineering Recovery Mechanisms
R2
R1
R3
1
1
1
1
5
2
1
1
R5
R4
R6 10
2
1
R7
Figure 5.48 Optimization of the sum of the cost of two diversely routed paths.
R2
R1
R3
1
1
5
Optimization of the Sum of the cost of Two Diversely Routed Paths with the 2SA Algorithm
1
1
2
1
1
Sum of the Costs = 4 + 13 = 17
R2
R1
R3
1
1
R5
R4 2
1
R6 10
Sum of the Costs = 7 + 4 = 11
R5
R4 1
2
1
1
R7 Optimization of the Sum of the Cost of Two Diversely Routed Paths with an Optimized Algorithm
1
1
5
2
R6 10
R7
Figure 5.49 Sum of the cost of two diversely routed paths using the 2SA algorithm.
can be useful in the case of load balancing, for example, this algorithm cannot meet that objective either. Hence, more optimal diverse path computation algorithms have been proposed that can always find diverse paths if such paths exist and that can compute two diverse paths so the sum of their cost is minimized (see [SURVIVABLE]) where, for
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 397
5.15 Backup Path Computation
397
instance, two runs of (modified) Dijkstra algorithms71 allow for the computation of diverse optimal paths. Additional constraints like introducing a trade-off between path diversity and path cost increase can be added but at the cost of increasing the algorithm complexity.
5.15.6
Backup Tunnel Path Computation: MPLS TE Fast Reroute Facility Backup Let us now describe the backup path computation in the context of MPLS TE local protection.
Backup Tunnel Path Computation without QoS Guarantee during Failure The simpler case of backup tunnel computation without QoS guarantee during failure is first considered. As already discussed, in several deployment scenarios, the unique constraint that must be considered for the backup tunnels computation is to find a diversely routed path from the protected facility (link/node/SRLG). This can be for one of the following reasons:
. The network is overprovisioned: In this case, regardless of the backup tunnel path, the rerouted TE LSPs will follow a noncongested path. This ensures QoS guarantees during failure. . QoS guarantee during failure is just not a requirement: Fast recovery is the unique constraint and the flows rerouted over a backup path can suffer from QoS degradation during failure for a limited period (until they are rerouted along another path by their respective head-end LSRs, provided such a path can be found). Because the backup path complexity is drastically reduced in those cases, there are just two aspects to discuss: 1. Manual configuration versus dynamic backup tunnel path computation 2. Backup tunnel path computation triggers Manual configuration versus dynamic backup tunnel path computation: As previously PR45 discussed, with MPLS TE Fast Reroute facility backup, the number of backup tunnels is a function of the number of protected resources, not the number of protected TE LSPs. Because the number of backup tunnels that must be configured is limited, the network administrator may just decide to manually configure the backup tunnels paths; in this case, no dynamic computation is performed by the LSRs. On the other hand, as stated earlier, the backup tunnel path computation is, in this case, quite straightforward and not CPU intensive, so another option is to rely on some distributed path computation where each PLR computes its own set of backup tunnels: 71
Other algorithms can also be used.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
398
CHAPTER 5
page 398
MPLS Traffic Engineering Recovery Mechanisms
R6
R6
R7
R5
R7
R5
R1
R2
R2
R1
R0
R0
R4 R3
R3
R6
R4
R7
R5 Computation of an NHOP Backup Tunnel R2 Computation of NNHOP Backup Tunnels
R0
R3
R4
Figure 5.50 Backup path computation of NHOP and NNHOP backup tunnels without strict QoS guarantee.
Let us consider the example in Figure 5.50. Figure 5.50 depicts a simple network where the PLR R0 requires setting up the following set of backup tunnels: . An NHOP backup tunnel to protect against a failure of the link R0-R1 (no other constraint other than computing a diversely routed path). . A set of NNHOP backup tunnels to protect against a failure of the node R1 (no other constraint other than computing a diversely routed path). Note that one NNHOP backup tunnel is required per NNHOP. The LSR R0 needs to perform the following steps: Step 1: Compute a NHOP backup tunnel path to protect the link R0-R1: Because no other constraint than the diverse route computation is required for the NHOP backup, a single algorithm consists of pruning the protected section (link R0-R1 in this case) and running CSPF over the remaining topology. The selected path will be the shortest path, taking into account either the IGP or the MPLS TE metric, because no bandwidth is required for the backup tunnel. The resulting NHOP backup tunnel is depicted in Figure 5.50. Step 2: Compute a set of NNHOP backup tunnel paths, one for each NNHOP. In this particular example, R0 has 4 NNHOPs: R6, R7, R2, and R4. For each of them, the PLR R0 performs a CSPF computation over the remaining topology (after having pruned the protected resource R1).
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 399
399
5.15 Backup Path Computation
Such a backup tunnel path computation is straightforward and several existing implementations support dynamic backup tunnel computation. P Important note: So why not always adopting a dynamic backup tunnel path computation scheme? Although such a backup path computation can easily be handled in a distributed fashion, there might be another reason why manual configuration is required: the nonsupport or configuration of the IGP TE extensions specifying the SRLG. In this case, the PLR does not have the required knowledge to compute an SRLGdiverse path. To illustrate this issue, let us consider Figure 5.51. In the example depicted in Figure 5.51, the two lightpaths interconnecting the pair of LSRs (R1,R4) and (R4,R5) belong to the same SRLG. In other words, they have some equipment in common (at the optical layer in this case) whose failure would provoke the failure of both lightpaths. By default the IP/MPLS layer does not have such visibility and the topology seen by the IP/MPLS layer is reduced to the topology described in Figure 5.51B. Computing the backup tunnel path to protect the link R1-R4 from a link failure would result (in this simple example) in selecting the shortest path diversely routed from the protected link, hence the path R1-R5-R4 (supposing that all the links have an equal metric and the path satisfies other constraints), as shown in Figure 5.52A.
R2
R3
OXC3
OXC2 OXC4
OXC5 Optical Layer
R1
OXC1
(a)
OXC6
R4 Same SRLG
R5 R2
R3
IP/MPLS Topology View R1
R4
(b)
Figure 5.51 IP/MPLS logical view.
R5
Vasseur / Network Recovery Final 9.6.2004 9:48pm
400
CHAPTER 5
page 400
MPLS Traffic Engineering Recovery Mechanisms
R2
R3
R1
“IP Diverse” Path
R4
(a) R2
R5
SRLG Diverse Backup Path
R1
(b)
R3
R4 R5
Figure 5.52 Computation of an SRLG diverse path.
Unfortunately, such a backup tunnel path would not be the right choice. Indeed, a failure of the SRLG shared by the links R1-R4 and R5-R4 would PR46 imply the failures of both the protected link R1-R4 and its associated backup tunnel because the link R5-R4 would also fail. This highlights the importance of being able to compute a SRLG diverse path for the backup tunnel by AU47 means of, for example, distributed CSPF SRLG-diverse backup path computation algorithms. Backup path computation triggers: Now another interesting question arises. When should a backup tunnel (with facility backup) path computation be triggered? The backup tunnel path computation and establishment can either be triggered when the link goes up (for an NHOP backup tunnel) or when the neighbors adjacency is first established (for an NNHOP backup tunnel). Another alternative is to set up a NHOP or NNHOP backup tunnel when the first protected TE LSP traversing the protected resource is signaled. Furthermore, a PLR can trigger backup tunnel path reoptimization at regular intervals to determine whether a better path (shortest path) exists.
Backup Tunnels Path Computation with Strict QoS Guarantees during Failure Undoubtedly, when QoS guarantees during failure are required, backup tunnel path computation is getting significantly more complicated because the requirement of ensuring that the backup tunnel paths offer QoS guarantees (at least for some CoRs) is added. This section explores the various aspects of the backup tunnel path computation to satisfy such a set of constraints. Strict QoS guarantees can be
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 401
5.15 Backup Path Computation
401
reduced to the ability to reroute TE LSPs over a backup tunnel providing an equivalent bandwidth and sometimes a bounded increase of the propagation delay. This is why the terms QoS guarantee and bandwidth protection are used interchangeably throughout this section. It is worth reinforcing the fact that Fast Reroute is a temporary mechanism (i.e., a protected TE LSP is rerouted onto a backup tunnel until it gets reoptimized by its respective HE LSR). Therefore, while the protected TE LSPs are rerouted over their backup tunnel, the QoS provided to those TE LSPs is dictated by the amount of bandwidth of the backup tunnel and the propagation delay experienced along the backup tunnel path. To compute a set of backup tunnels that satisfy such a set of requirements, one must follow several steps: Step 1: First answer the following set of questions: 1. What is the amount of bandwidth to protect? 2. What is the network backup capacity? 3. What are the backup tunnel path computation triggers? Step 2: Choose a backup tunnel path computation path model.
Step 1: Answer the Following Set of Questions 1. What is the Amount of bandwidth to protect? When trying to achieve bandwidth protection with Fast Reroute, one must first determine the amount of bandwidth to protect (also called the protected bandwidth). The protected bandwidth is the amount of bandwidth required for the backup tunnel(s) (i.e., amount of bandwidth that needs to be protected). At first glance, this seems a quite obvious question. That said, there are two approaches that can be taken here, each having its respective pros and cons: Approach 1: Protect the actual reserved bandwidth. To illustrate that first approach, let us consider the following example of an OC3 link where just 10 TE LSPs have been signaled that traverse this link and such that the sum of their bandwidth is 50 Mbps. Suppose also that just a subset of them requires bandwidth protection and the sum of their bandwidth is 30 Mbps. In this model, the idea consists of computing an NHOP or NNHOP backup tunnel having a capacity of 30 Mbps: The protected bandwidth is 30 Mbps. Indeed, why try to protect the entire OC3 capacity if only 30 Mbps worth of traffic must be protected? The advantage of this approach is that just the amount of required bandwidth is reserved, not more, which thus allows optimal backup bandwidth usage in the network. The immediate counterpart is the requirement for more frequent backup tunnel path computations. Indeed, the protected bandwidth changes as new TE LSPs are signaled and torn down. If a new backup tunnel path computation is triggered each time the protected bandwidth changes in the network, this will generate the computation
Vasseur / Network Recovery Final 9.6.2004 9:48pm
402
CHAPTER 5
page 402
MPLS Traffic Engineering Recovery Mechanisms
and signalling of new backup tunnels more frequently. One might try to limit this frequency by the introduction of a threshold mechanism—for instance, for an OC3 link, set a threshold every 20 Mbps (a more efficient mechanism would not adopt a nonlinear spacing of the thresholds though). When the protected bandwidth crosses a threshold a new backup tunnel path computation is triggered. Another set of thresholds is defined when the protected bandwidth decreases of course. Approach 2: Protect a bandwidth pool regardless of the actual amount of reserved bandwidth. The protected bandwidth does not depend on the actual amount of reserved bandwidth by a set of protected TE LSPs requesting bandwidth protection that traverse a protected resource. So typically, if an OC3 link has a capacity of 155 Mbps, one tries to find a set of backup tunnels for 155 Mbps. Similarly, a protected SDH-SONET VC of 155 Mbps reserves 155 Mbps of backup capacity in the network, whether the protected VC carries some traffic or not. P Important notes: . Because bandwidth protection can be requested on per–TE LSP basis, if the operator knows a priori that the proportion of TE LSP requesting bandwidth protection will never exceed x% of each link capacity, then the protected bandwidth can be limited to x% of each link capacity. For example, if bandwidth protection is just required for the voice traffic and the operator knows a priori that each link will never carry x% of voice traffic, then the required protected bandwidth for each facility to protect is limited to x% of the link capacity. . When MPLS Diffserv aware TE is configured on the network, more than one pool of bandwidth can be configured. The aim of such a model is to allow different CACs for different classes of traffic. For instance, an OC3 link can be configured so the maximum amount of voice traffic does not exceed a fixed percentage of the link capacity, for instance, 50 Mbps and the maximum amount of data traffic does not exceed 200 Mbps. This interesting model can guarantee different overbooking/underbooking ratios per class of traffic. In the example mentioned above, the maximum amount of voice traffic admitted for the TE LSP carrying voice will never exceed 50 Mbps, whereas up to 200 Mbps of TE LSPs carrying data traffic can be admitted for this OC3 links. A proper scheduling mechanism then needs to be configured to guarantee that each class of traffic will be served appropriately. Hence, the network administrator may decide to protect the bandwidth of a certain pool, for instance, the bandwidth pool dedicated to the voice traffic. This allows guarantees fast recovery with Fast Reroute for the data traffic and fast recovery with bandwidth guarantee for the voice traffic. So the protected bandwidth is in this case limited to a specific bandwidth pool, which reduces the amount of required backup capacity.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 403
5.15 Backup Path Computation
403
Though a bit less optimal because it potentially required more protected bandwidth than necessary, this approach is more scalable than the previous one, as backup tunnel path computation is much less frequently triggered. 2. What is the network backup capacity? The backup capacity is defined as the network capacity dedicated for backup tunnels requiring bandwidth and that cannot be used by primary TE LSPs. The ratio of the required backup capacity divided by the available bandwidth is an important efficiency factor of a recovery mechanism. Typically, if the required capacity to provide bandwidth protection is 20% of the total network capacity, the recovery mechanism can be defined as extremely efficient as far as the bandwidth usage is concerned. Indeed, this means that just 20% of the network capacity is dedicated to backup while being able to provide bandwidth protection when required, by contrast with SONET-SDH, for instance, where a protected VC requires to allocate twice the VC bandwidth: once for the primary VC and the same bandwidth for the backup. Hence, for each link, the network administrator defines the following:
. The primary bandwidth pool(s):72 This determines the maximum amount of bandwidth that can be admitted on the given resource for primary TE LSPs. . The backup pool: Total amount of bandwidth that can be used by backup tunnels. This is illustrated in Figure 5.53. In Figure 5.53, the network administrator configures the proportion of the link that can be used by regular TE LSPs and the amount of bandwidth reserved for the backup tunnels used by TE LSPs requiring bandwidth protection. The overlay backup network is the network with link capacity equal to the backup bandwidth pool on each link. P Important note: An important aspect of the reserved backup capacity in an IP/ MPLS network is that the bandwidth is unavailable in the control plane but still fully available in the data plane. So, for instance, if an OC3 link is configured with a backup pool of 30 Mbps and a reservable bandwidth pool for the primary TE LSPs of 125 Mbps, no more than 125 Mbps of bandwidth can be reserved by all the primary TE LSPs traversing the link (CAC function). That said, the bandwidth is still available in the data plane. In other words, the packets forwarded onto the link R0-R1 will be served at link speed rate. This offers a higher QoS at steady state (when the backup tunnels making use of the backup pool are not active), a major difference with other recovery mechanisms at lower layers where the backup capacity cannot be easily reduced by primary traffic. For instance, in the optical plane, the optical backup capacity cannot be used by the active primary optical paths. So to avoid some bandwidth waste, one technique consists of allocating the 72 Potentially multiple bandwidth pools will be defined on a per class type basis if MPLS Differsv aware TE is used. For the sake of simplicity, we consider that a single pool is defined.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
404
CHAPTER 5
page 404
MPLS Traffic Engineering Recovery Mechanisms
Backup Capacity Primary Reservable Pool
Overlay Backup Network
Figure 5.53 Illustration of the network backup capacity.
backup capacity to low-priority optical paths that are preempted in the case of failure by the high-priority rerouted optical paths Overlay backup capacity network discovery: We describe below a backup tunnel path computation model whereby the entity responsible for the backup tunnel path computation will first have to acquire the knowledge of the backup capacity on each link to perform the backup tunnel path computation (i.e., the backup network capacity). There are two ways by which the entity responsible for computing the backup tunnel paths can acquire the knowledge of the overlay backup network: 1. Via a local static configuration: The network administrator just manually configures the amount of backup capacity for each link in the network. An alternative would consist in using for the capacity the difference between the actual link speed and the maximum reservable bandwidth. Indeed, when a link is configured with MPLS TE, the network administrator configures the maximum reservable bandwidth, as already mentioned. Furthermore, the link speed is advertised by the IGP. So the entity could implicitly conclude that the backup capacity is equal to link-speed–maximum reservable bandwidth. For example, an OC3 link is configured with maximum reservable bandwidth of 120 Mbps; in this case, the entity in charge of the backup tunnel path computation could implicitly deduce that the backup capacity on this link is OC3-link-speed ¼ 155 Mbps 120 Mbps ¼ 33 Mbps. Unfortunately, this approach does not work with overbooking. Suppose that the
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 405
5.15 Backup Path Computation
405
network administrator decides to apply overbooking on some links. If the maximum reservable bandwidth is 200 Mbps to allow for an overbooking ratio, this strategy no longer works. 2. Via an automatic IGP discovery: This just requires a simple and straightforward IGP (OSPF or IS-IS) extension so every node can explicitly signal through its IGP the amount of backup capacity on each of its attached link(s). Such an extension has been proposed in [FACILITY-BACKUP] and a new sub-TLV (called backup bandwidth pool sub-TLV ) has been defined. This IGP extension does not have any IGP scalability impact, which is an important aspect that must be highlighted. Indeed, every router advertises the backup bandwidth pool for each of its attached link. In the FACILITY-BACKUP model studied in this section, this value does not PR48 change as new backup tunnels are dynamically signaled. 3. What are the backup tunnels path computation triggers? The same backup path computation triggers as in the previous case (backup tunnel computation without QoS) are valid here. In addition, a backup tunnel path computation is also triggered when there is a change in the protected bandwidth and/or the network backup capacity.
Step 2: Choose a Backup Tunnel Path Computation Path Model Once the protected bandwidth on each link is determined and the network backup capacity is known, the next step is to choose a backup tunnel path computation model. Several models have been proposed and listing all of them is virtually impossible. Some of these models rely on distributed backup tunnel path computation (each PLR is responsible for computing its set of backup tunnels) and others explicitly rely on centralized backup tunnel path computation. They all differ by their degree of efficiency, required set of protocol signaling extensions, complexity, and scalability, along with other criteria. Hence, two models, known as the independent CSPF-based model and the facility backup model, are described in detail in the rest of this section, but bear in mind that they are not the only backup tunnel path computation models available. Model 1: the independent CSPF-based model: A simple approach to provide fast recovery and bandwidth guarantees during a failure is to simply set up a backup tunnel with a bandwidth equal to the protected bandwidth. In this model, each PLR simply executes the following set of tasks: . The PLR first determines the amount of bandwidth required73 for the NHOP or NNHOP backup tunnel (protected bandwidth). . Compute a path for the backup tunnel applying the bandwidth constraint as for any regular TE LSP (could either use the regular reservable bandwidth or the backup bandwidth). . Set up the backup tunnels with their associated bandwidth. 73
The determination of the required amount of bandwidth to be protected is discussed later in this chapter.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
406
CHAPTER 5
page 406
MPLS Traffic Engineering Recovery Mechanisms
Although this approach is certainly simple and meets the requirements, it suffers from several limitations: 1. Bandwidth sharing between backup tunnels protecting independent resources cannot be performed, requiring much more backup bandwidth than necessary in the network. 2. Inability to find a placement of the backup tunnels even if a solution exists (in some cases). Let us illustrate those limitations through two examples: 1. Inability to perform bandwidth sharing: Although the concept of bandwidth sharing has already been introduced, it is worth providing another example to highlight its benefits. As explained previously, under the single failure assumption, two backup AU49 tunnels protecting independent resources can share bandwidth. By contrast, two backup tunnels originated by two LSRs that protect against the failure of the same resource (link or node) cannot share their bandwidth because upon failure of the resource they protect, they will be simultaneously active. In Figure 5.54, B1 and B2 are NNHOP backup tunnels originated on the PLRs R8 and R0, respectively, to protect against a failure of the node R1, whereas B3 is an NNHOP backup tunnel originated at the PLR R5 to protect against a failure of
All Links are OC3 Links (except R0-R1, R8-R1, R5-R3) 30M
R6
R5
R7 B3
R4
R3 R0
B1
10M
B2
15M
R2
R1 R8
R9
R10
R11
Figure 5.54 Illustration of the inability to perform bandwidth sharing with the independent CSPF model.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 407
5.15 Backup Path Computation
407
the node R6. As mentioned in Figure 5.54, all links are OC3 except R0-R1 (10 Mbps), R8-R1 (15 Mbps), and R5-R6 (30 Mbps). In this example, we assume that the protected bandwidth is equal to the link bandwidth (e.g., a backup tunnel of 10 Mbps is required to protect from a failure of the link R0-R1 or the node R1). Moreover, the single failure assumption is made, which is that two LSRs cannot simultaneously fail. In this example, B1 and B2 both protect against the failure of the same resource (node R1). This means that upon R1’s failure, both B1 and B2 will be active so the required bandwidth on the link R3-R4, for instance, for both of them is 10 Mbps þ 15 Mbps. On the other hand, because the backup tunnel B3 protects R5 from a failure of the node R6, the backup tunnel B3 can share the bandwidth with B1 and B2, so the required bandwidth on the link R3-R4 is max ( (B1þB2),B3) ¼ 30 Mbps and not 10 Mbps þ 15 Mbps þ 30 Mbps ¼ 55 Mbps. So AU50 the single failure assumption allows the use of bandwidth sharing and 25 Mbps of backup bandwidth is saved on the link R3-R4. Another interesting fact is that in some scenarios, the actual amount of reserved bandwidth for B1 and B2 may not be the sum of their bandwidth. Suppose now that the bandwidth of the link R1-R2 (or bandwidth pool; see below) is now equal to 5 Mbps. The maximum amount of traffic originated by R0 and R8 that can traverse the link R1-R2 is bounded by the R1-R2 bandwidth pool: 5 Mbps. So in this case, the total amount of protected bandwidth for B1 and B2 on the link R3-R4 is indeed 5 Mbps not 10 Mbps þ 15 Mbps ¼ 25 Mbps. Therefore, this example clearly highlights the benefit of bandwidth sharing under the assumption of a single failure. Unfortunately, with the independent CSPF model, each PLR determines the amount of bandwidth to be protected and sets up its own backup tunnel; there is no synchronization between PLRs. This is why no bandwidth sharing can be achieved. With the independent CSPF model, B1, B2, and B3 are signaled with 10 Mbps, 15 Mbps, and 30 Mbps, respectively, and the amount of reserved bandwidth on the link R3-R4 is 55 Mbps. 2. Inability to find a placement of the backup tunnels even if a solution exists (in some cases): This is the second limitation of the independent CSPFbased model. By definition, the independent CSPF model relies on the uncoordinated backup tunnel path computation of various LSRs; consequently, the order of setting is arbitrary, which can result in the inability to find a backup tunnel placement even though a solution exists. Let us illustrate this statement through an example depicted in Figure 5.55. In Figure 5.55, R0 requires an NNHOP backup tunnel of 10 Mbps (capacity of the link R0-R1) to protect against a failure of R1 and R8 requires an NNHOP backup tunnel of 20 Mbps (capacity of the link R8-R1) to protect against a failure of R1. Suppose also that the backup bandwidths on the links R3-R4 and R8-R9 are 20 Mbps and 10 Mbps, respectively. If the first node starting its backup tunnel computation is R0, it will likely select the shortest path satisfying the constraints for its backup tunnel: R0-R3-R4-R2. Then no path obeying the required constraint of 20 Mbps of bandwidth can be
Vasseur / Network Recovery Final 9.6.2004 9:48pm
408
CHAPTER 5
page 408
MPLS Traffic Engineering Recovery Mechanisms
All Links are OC3 Links Unless Mentioned Otherwise 20M
R3 R0
B1 (10M)
R4
10M
Backup Bandwidth
20M
Protected Bandwidth
R2
R1 R8 10M R9
R10
R11
Figure 5.55 Illustration of the potential inability to find a placement of backup tunnel with independent CSPF model.
All links are OC3 links unless mentioned otherwise
20M
T1 (10M)
R3
R4
10M
R0
Backup Bandwidth
20M
Protected Bandwidth
R1
R2
R8
10M
R9
R10
R11
Figure 5.56 Illustration of the potential inability to find a placement of backup tunnels with the independent CSPF model, although a solution exists.
found by R8, although a backup tunnel placement could be found as depicted in Figure 5.56. Of course, a solution could have been found even with the first placement allowing for load balancing with two backup tunnels having 10 Mbps of bandwidth each and following the paths R8-R0-R3-R4 and R8-R9-R10-R11. But with the
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 409
5.15 Backup Path Computation
409
independent CSPF model, for a fixed number of backup tunnels (also called splits), a similar example could be found where potentially a solution cannot be found to satisfy the requirement of protecting X Mbps. Model 2: The facility-based computation model provides strict QoS guarantees to a set of specific protected TE LSP requesting bandwidth protection with an efficient backup bandwidth usage. This aspect is indeed extremely important for cost effectiveness. The ‘‘facility-based computation’’ model is described in [FACILITY-BACKUP]. 1. Centralized versus distributed path computation models: The facilitybased computation model specifies two possible methods for the computation of the set of required backup tunnel paths: a. The centralized model, in which a central server (also called a path computation element [PCE]74) computes the paths for the set of backup tunnels that protect all the network resources b. The distributed model, in which each router (LSR) is responsible for the computation of a subset of backup tunnel paths In any case, there is a set of variables that the PCE must take as input to perform AU51 backup tunnel path computation: 1. The amount of protected bandwidth 2. The backup capacity 3. The network topology and resources Centralized backup tunnel path computation: There are actually two subcases that must be considered independently, depending on whether the central PCE is responsible for both the primary and the backup tunnel path computations or just the backup tunnel path computation. Situation 1: The PCE is responsible for both the primary and the backup tunnel path computations. This assumes that MPLS TE is used in the network for bandwidth optimization and/or strict QoS guarantees. In addition, Fast Reroute is deployed for fast recovery. In this case, the PCE knows both the amount of protected bandwidth, which is equal to the actual reserved bandwidth (because it is also responsible for the primary tunnel placement) and the backup capacity, which is nothing but the remaining capacity once all the primary TE LSPs have been placed. So the PCE can protect one element at a time (an element being either a link, a node, or an SRLG), using all the network backup capacity. This will ensure that bandwidth is shared between backup tunnels protecting independent resources; indeed, suppose that the PCE tries to compute the set of backup tunnels to protect all the TE LSPs
74 There are several possible terms to refer to the capability of computing a TE LSP path for a client LSR: path computation server, path computation element, and path computation router.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
410
CHAPTER 5
page 410
MPLS Traffic Engineering Recovery Mechanisms
requesting bandwidth protection that traverse a node R1 in the case of failure of the node R1. The protected bandwidth is equal to the sum of their bandwidth and the PCE can use all the available bandwidth on every link not consumed by primary TE LSPs. Once this set of backup tunnels has been computed, the PCE can start considering the protection of the TE LSPs traversing another node R2. The amount of backup capacity available for that new set of backup tunnels is strictly equal to the amount of bandwidth considered in the previous case. Why? Simply because under the single failure assumption, the resources R1 and R2 cannot simultaneously fail so their respective set of backup tunnels cannot be simultaneously active and so they can share the backup bandwidth. A few comments can be made at this point: Comment 1: The backup tunnel path computation is an NP-complete problem whose complexity renders its computation intractable without the use of some heuristics to speed up the path computation. Comment 2: Some complex algorithms can be used to find an optimal placement for the primary TE LSPs while trying to fully protect bandwidth and achieving an optimized bandwidth sharing, but this might not always be possible. For the sake of illustration, if the bandwidth is a scarce resource and the bandwidth cannot be fully protected if the primary TE LSPs are placed in an optimal fashion, then the PCE may decide (based on some preconfigured local policy) to displace some primary TE LSPs from their optimal path to free up some bandwidth on some path to get a complete bandwidth protection. As far as the network topology is concerned, the PCE can acquire it either via routing or any connection to a seed router. In the first case, the PCE can be adjacent to any LSR in the network and run an IGP like IS-IS or OSPF. The only requirement is to make sure that the PCE set the ‘‘Overload bit’’ for IS-IS or ‘‘Max metric’’ for OSPF so that the PCE is not considered as a router by other routers and is never included in their SPT. Another possibility is for the PCE to acquire the network topology (IGP database) via a Telnet session or SNMP management information base. The PCE can collect the IGP database from any router in a routing area because all routers of the same routing area share an identical IGP database (which is a fundamental property of link state routing protocols). If the autonomous system is made of several areas, then the PCE needs to have at least one connection to a router in each area. It is worth pointing out that the acquisition of the network topology via routing offers a significant advantage: a real-time view of the network topology. As a reminder, bandwidth sharing relies on the single failure assumption (i.e., backup capacity cannot be shared by backup tunnels protecting nonindependent resources [resources that can fail simultaneously]). Thus, when a failure occurs, rapid backup AU52 tunnel recomputation makes the single failure assumption more reliable. Situation 2: The PCE is responsible only for the backup tunnel path computation.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 411
5.15 Backup Path Computation
411
This case typically applies to two scenarios:
. Scenario 1: The primary TE LSPs paths are computed in a distributed fashion (by each head-end LSR using a CSPF algorithm), whereas the backup tunnel paths are computed by the (centralized) PCE. . Scenario 2: Separate centralized PCEs are used to compute primary and backup tunnel paths. Scenario 1: Because the PCE is responsible only for the backup tunnel path computation, it cannot use the unreserved bandwidth (not used by the primary TE LSPs) for the backup capacity. Why? Let us suppose that the PCE, in order to compute the backup tunnel paths, uses the unreserved bandwidth by the primary TE LSPs. It will be shown hereafter that the backup tunnels are signaled with 0 bandwidth; this is to avoid some extensions of the CAC process, but let us just make the assumption that backup tunnels are signaled with 0 bandwidth for the moment. So under the previous assumption, the PCE computes the set of backup tunnels (using the current available bandwidth). Because the LSRs compute the path for their primary TE LSP and do not have any knowledge about the backup tunnels in place and their respective computed bandwidth (backup tunnels are signaled with 0 bandwidth), this implies that they could at any time draw some bandwidth from the reservable bandwidth pool, outdating the backup tunnel path computation, which explains why, when the PCE is just responsible for the backup tunnel path computation, the PCE cannot consider the unreserved bandwidth as the backup capacity. The solution is to get non-overlapping pools for primary and backup tunnels (two pools are defined: one for the primary and one for the backup and they do not overlap). This way, an LSR could use the bandwidth pool reserved for primary tunnel and the PCE could use the backup pool reserved for backup tunnels; there is no overlap, so the set up of new primary TE LSP does not invalidate previously computed backup TE LSPs. Scenario 2: Scenario 2 is somewhat similar to scenario 1 because the PCE in charge of computing the backup tunnels paths cannot use the unreserved bandwidth (known by the other PCE responsible for the primary TE LSPs path computation); hence, non-overlapping bandwidth pools are also required. Distributed model: The aim of the distributed model is to distribute the backup tunnel path computation among several LSRs instead of relying on a central PCE to perform backup tunnel path computation. To avoid confusion, it is worth clarifying the notion of ‘‘distributed computation.’’ In the computer science world, the notion of distributed computation usually refers to the ability to involve several processors in a computation task. In the distributed facility computation model, the computation of a set of TE LSPs to protect a particular resource is always performed by a unique entity (in this case an LSR). The notion of distributed computation refers to the fact that the set of backup tunnels to protect a set of N resources is shared among several entities but the
Vasseur / Network Recovery Final 9.6.2004 9:48pm
412
CHAPTER 5
page 412
MPLS Traffic Engineering Recovery Mechanisms
set of backup LSPs required to protect a particular resource R is always computed by a unique PCE (an LSR in this case). Let us now consider the situations in which a set of backup tunnels must be computed to protect a node, a link, and an SRLG: AU53 Situation 1: protection against a node failure As previously pointed out, the set of backup tunnels that needs to be computed by a unique entity (PCE) is the set of backup tunnels protecting against the failure of a resource R. In other words the set S of backup tunnels protecting against the failure of a resources R cannot be computed by different entities. Why? Because they cannot share bandwidth. Let us go back to the diagram depicted in Figure 5.54 for a moment. In the case of failure of the node R1, the backup tunnels B1 and B2 are simultaneously active. Moreover, backup tunnels are signaled with 0 bandwidth for a reason detailed later in this section. So the implication is that a unique entity must be responsible for the computation of all the backup tunnels that protect against the failure of R1 (this is required to make sure that the backup tunnel paths offer the required bandwidth). A very natural choice for this entity is the node R1 itself! In the distributed model, to protect against a node failure (the failure of R1), R1 will compute all the backup tunnels from every neighbor to their set of next-next hops: from R0 to R2, R0 to R8, R0 to R9, R0 to R10, R8 to R0, R8 to R2, R8 to R9, R8 to R10, R9 to R8, R9 to R0, R9 to R2, R9 to R10, R10 to R9, R10 to R8, R10 to R0, R10 to R2, R2 to R10, R2 to R9, R2 to R8, and R2 to R0. Likewise, R6 performs the computation of backup tunnels from each of its neighbors to their NNHOP in the case of its own failure (from R5 to R7 and R7 to R5). Neither synchronization nor communication is required between the two PCEs R1 and R6 because they compute backup tunnels to protect independent resources (R1 and R6). Each of them may use the whole backup network capacity, which allows them to naturally perform bandwidth sharing. As in the case of the centralized model with the PCE responsible for the backup tunnel path computation only, a separate backup bandwidth pool is required. Communication between a node acting as a PCE and its neighbors requires some signaling protocol detailed later in this section. Situation 2: protection against a link failure To protect a link L, if unidirectional TE LSPs are used, two NHOP backup tunnels are required (one in each direction). If the link fails in one direction (e.g., a laser on the sender side or a photodiode on the receiver side fails), then one NHOP backup is used. On the other hand, in the case of a bidirectional link failure (e.g., fiber cut), both NHOP backup tunnels will be used. This requires for the two NHOP backup tunnels offering bandwidth protection to be computed by a single PCE to avoid bandwidth protection violation. This is illustrated in Figure 5.57. So let us consider the network depicted in Figure 5.57: An NHOP backup tunnel B1 protects the fast-reroutable TE LSPs traversing the link R1-R2 against a
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 413
5.15 Backup Path Computation
R0
R1
R2
413
R3
R4 NO Bandwidth Sharing B2
B1
R5
Figure 5.57 Computation of NHOP backup tunnels with bandwidth protection with the facility backup model.
failure of the link R1-R2. Another NHOP backup tunnel B2 protects the fastreroutable TE LSPs traversing the link R2-R1 against a failure of the link R2-R1. Let us now suppose that the two NHOP backup tunnels B1 and B2 are computed independently; they may share bandwidth! Indeed each NHOP backup tunnel path computation will be performed independently. But in the case of a bidirectional link failure, both NHOP backup tunnels will be active, which will result in bandwidth protection violation, hence the requirement for two NHOP backup tunnels to be computed by a single PCE. This can clearly be seen in Figure 5.57; in the case of a bidirectional failure of the link R1-R2, both B1 and B2 are simultaneously active on the link R4-R5, which results in a bandwidth protection violation. A simple solution consists in electing one of the two ends of the link as the PCE for the computation of the set of NHOP backup tunnel paths protecting against a bidirectional link failure (e.g., the LSR with the smaller routed ID could be selected). Situation 3: protection of an SRLG Likewise, the protection of an SRLG requires to elect a PCE to compute the set of required backup tunnels in the case of failure of this SRLG (Figure 5.58). As shown in Figure 5.58, the set of required backup tunnels to protect against a failure of SRLG S1 must be performed by a unique PCE elected among the set of LSRs: R1, R4, and R5. Signaling of backup tunnel with 0 bandwidth Several times in this section, we made the statement that backup tunnels providing QoS guarantees are signaled with 0 bandwidth in the ‘‘facility-based computation’’ model. To illustrate why backup tunnels are signaled with 0 bandwidth (although their path is computed to provide bandwidth guarantees), let us consider Figure 5.59.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
414
CHAPTER 5
page 414
MPLS Traffic Engineering Recovery Mechanisms
R2
R3
OXC3
OXC2 OXC4
OXC5 R1
OXC1
OXC6
SRLG S1
R4
R2
R3
R5
R1
R4
R5
Figure 5.58 Protection of an SRLG with the facility backup model.
R5
R6
R7
Call Admission Control
B4: 20M
Bandwidth Sharing
B3: 50M
R4
R3 R0 B1: 30M
R2
R1 R8 B2: 50M
R9
R10
R11
Figure 5.59 Signaling backup tunnels with 0 bandwidth with the facility backup model.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 415
5.15 Backup Path Computation
415
In Figure 5.59, a set of backup tunnels have been computed:
. B1 (30 Mbps) and B2 (50 Mbps) protect fast-reroutable TE LSPs traversing the path R8-R1-R2 against a node failure of R1. . B3 (50 Mbps) protects fast-reroutable TE LSPs traversing the path R0-R1R2 against a node failure of R1. . B4 (20 Mbps) protects fast-reroutable TE LSPs traversing the path R5-R3-R4-R7 against a node failure of R6. For the sake of simplicity, just a few backup tunnels are shown in Figure 5.59 (e.g., there are other backup tunnels: B5, from R9 to protect the fast-reroutable TE LSPs traversing the path R9-R1-R2 against a failure of the node R1, to mention one of them). B1, B2, and B3 cannot share the bandwidth because they protect different TE LSPs from the failure of the same resource (node R1 in this case). On the other hand, under the single failure assumption, they can share bandwidth with B4 because B4 protects from the failure of a different resource (node R6). The paths of B1, B2, B3, and B4 have been computed to ensure bandwidth protection. Now, as far as the signaling is concerned, there are actually two options:
. Option 1: Signal backup tunnels with their bandwidth . Option 2: Signal backup tunnels with 0 bandwidth Let us now see each option and the respective pros and cons: First, it is worth reiterating here that a backup tunnel LSP is just a regular TE LSP.75 In other words, when a backup tunnel is signaled, any LSR along its path performs the same operation as with any other TE LSP, in particular the CAC checking against the available bandwidth on the link for the priority signaled in the RSVP Path message. Option 1: Backup tunnels are signaled with their respective bandwidth. Although this seems to be the most natural approach, it would require some PR54 CAC modifications on midpoint to allow for bandwidth sharing. Indeed, back to our previous example depicted in Figure 5.59, R3 would need to figure out that B4 protects a different resource than B1 and B3 and so they can share the bandwidth. Its CAC function would need to count for 80 Mbps (max( (T1þT3),T4) ) instead of T1 þ T3 þ T4 ¼ 100 Mbps. This would not only require some modification of the CAC module on the midpoint LSRs but also RSVP signaling extensions so a backup TE LSP should be identified as a backup tunnel and the resource it protects. Option 2: This approach consists of signaling backup tunnels with 0 bandwidth, which prevents from having to implement any RSVP signaling extensions and CAC modifications. Of course, the fact that a backup tunnel is signaled with 0 bandwidth is completely decorrelated from the bandwidth this TE LSP gets. Remember that the backup tunnel path has been computed to ensure that the backup tunnel will get the required bandwidth in case of a failure. So by virtue of the backup tunnel computation, backup tunnels will have the required 75
This applies to the facility backup method of MPLS TE Fast Reroute.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
416
CHAPTER 5
page 416
MPLS Traffic Engineering Recovery Mechanisms
bandwidth along their respective path, but they are just signaled with 0 bandAU55 width. The distributed backup tunnel path computation model presented here is just one model among others. At the time of publication, other distributed models under investigation could allow for some degree of bandwidth sharing with limited requirements in terms of extra routing and signaling extensions. Backup tunnels selection: At this point, a set of backup tunnels have been computed to provide fast recovery and bandwidth protection. The next interesting question is: How are those backup tunnels selected as primary TE LSPs requesting for Fast Reroute and bandwidth protection are signaled? When a new TE LSP explicitly requiring local protection and bandwidth protection is signaled, a backup tunnel satisfying the request must be selected by each PLR along the path. In the example of Figure 5.59, the paths of the backup tunnels B1 and B2 have been computed to provide 30 Mbps and 50 Mbps of bandwidth, respectively. Each time a fast reroutable TE LSP traversing the node R8 and requesting bandwidth protection is signaled, the PLR R8 has to select a backup tunnel (B1 or B2) satisfying the bandwidth constraint. So the PLR has to keep track of the total amount of available bandwidth per backup tunnel, which is equal to the backup tunnel bandwidth minus the sum of the bandwidths of all the protected LSPs that have selected the backup tunnel. A detailed example follows. The algorithm in charge of the backup tunnel selection, called the packing algorithm, is usually implementation specific. At this point, it is worth mentioning a few issues that the packing algorithm must resolve: 1. Constraints prioritization: When a protected TE LSP is signaled that explicitly requires bandwidth protection, its set of requirements must be taken into account to proceed to the backup tunnel selection: amount of the requested bandwidth, link versus node protection and others. Furthermore, if several backup tunnels exist at the PLR, they may have different properties like NHOP versus NNHOP, different bandwidth available, or different path lengths. The set of required constraints (with potentially some hierarchy between constraints) and the backup tunnel properties must be considered by the backup selection algorithm to perform an appropriate selection. For instance, suppose two backup tunnels: B1 is an NNHOP backup without enough bandwidth to satisfy the bandwidth requirement and B2 is an NHOP backup with enough bandwidth. If the protected TE LSP requires both bandwidth and node protection, a choice must be made about which constraint will be satisfied first. 2. Bandwidth fragmentation: Various algorithms can be designed for the backup tunnel selection. Here is just a subset of some possible implementations for the sake of illustration:
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 417
5.15 Backup Path Computation
417
. A1: always select the backup tunnel with the smallest available bandwidth that meets the bandwidth protection requirement . A2: ‘‘load balance’’ . A3: always select the backup tunnel with the highest bandwidth that meets the bandwidth protection requirement Let us illustrate the challenge of the packing algorithm (which is also known as the ‘‘knapsack’’ problem) through an example. Consider a link R0-R1 with a protected bandwidth of 25 Mbps (i.e., R0 requires a set of backup tunnels such that the sum of their bandwidth is equal to 25 Mbps). Because no single backup tunnel having a 25 Mbps capacity can be found, the backup tunnel path computation algorithm has calculated two backup tunnels B1 and B2 having a capacity of 10 Mbps and 15 Mbps (B1 and B2 follow different paths), respectively. We note [X,Y] the remaining backup capacity (RBC) on the respective backup tunnels B1 and B2. In the example, we illustrate the resulting outputs of each packing algorithm described above upon a specific sequence of events. Note that the assumption is made in this example that all the signaled TE LSPs request for Fast Reroute and bandwidth protection. Time t0: a first TE LSP 1 is signaled with a bandwidth requirement of 4 Mbps. With A1: RBC ¼ [6,15] With A2: RBC ¼ [6,15] With A3: RBC ¼ [10,11] Time t1: TE LSP 2 is signaled with a bandwidth requirement of 4 Mbps. With A1: RBC ¼ [2,15] With A2: RBC ¼ [6,11] With A3: RBC ¼ [10,7] Time t2: TE LSP (LSP 3) is signaled with a bandwidth of 12 Mbps. With A1: RBC ¼ [2,3] With A2: FAILS, no backup tunnel can be selected With A3: FAILS, no backup tunnel can be selected This simple example shows that A1 proposes the best strategy in this example to avoid bandwidth fragmentation, but unfortunately, as shown below, this does not prevent from having to implement a bandwidth defragmentation strategy as new TE LSPs are signaled and torn down. An interesting analogy is the de-fragmentation of a hard disk, with a noticeable difference though: In the case of a file that must be stored on a hard disk, if no single contiguous set of blocks with the requested file size can be found, the operating system will store the file in multiple noncontiguous blocks whose addresses are stored in a file allocation table (FAT). By contrast, when the PLR has to select a backup tunnel, it must find a single backup tunnel with enough bandwidth to satisfy the requirement of the newly setup protected TE LSP. Using multiple backup tunnels to reroute a single protected TE LSP is not desirable (indeed, if multiple backup tunnels are used to reroute the same TE LSP and these backup tunnels have significantly
Vasseur / Network Recovery Final 9.6.2004 9:48pm
418
CHAPTER 5
page 418
MPLS Traffic Engineering Recovery Mechanisms
different propagation delays, this may lead to undesirable out-of-order packet delivery). Let us now continue the example with the assumption that the algorithm A1 is chosen. Time t3: A new TE LSP (LSP 4) is signaled with a bandwidth requirement for the protection of 3 Mbps. RBC ¼ [2,0]. Time t4: LSP 1 is torn down. RBC ¼ [6,0]. Time t5: LSP 3 is torn down. RBC ¼ [6,12]. Time t5: A new TE LSP 5 is signaled with a protection bandwidth requirement of 14 Mbps. The backup tunnel selection algorithm fails as no backup tunnels with 14 Mbps can be found. This shows that a bandwidth defragmentation procedure must be triggered at this point to satisfy the new request. The backup assignment must become: B1: LSP1, LSP2, LSP4 B2: no TE LSP So RBC ¼ [3,15]. This allows accommodating the request of LSP5. Note that the backup bandwidth defragmentation procedure may be triggered by a timer or a backup tunnel selection failure event. In the former case, the PLR performs a defragmentation when a timer expires, whereas in the latter case, the defragmentation is triggered when a new bandwidth protection request cannot be satisfied. Both approaches can also be combined. Path computation client: PCE signaling protocol As already mentioned, the PCE (in this case the entity in charge of the backup tunnels path computation) can be either a central PCE (in the centralized model) that performs the backup tunnels computation for all the protected resources in the network or an LSR (in the distributed model). In both cases, a signaling protocol is required such that an LSR (a Path Computation Client, or PCC) can request the computation of a set of backup tunnels to protect its TE LSPs traversing a particular resource in the case of failure of this resource and the PCE can provide the set of computed backup tunnels. Such a signaling protocol has been proposed in [PATH-COMP] and defines some RSVP TE extensions to address this requirement. [PATH-COMP] defines a new RSVP message type called ‘‘Path computation’’ message (a specific flag defines whether the Path computation message is a request or a reply). Then multiple optional objects have been specified for various purposes. Indeed, the scope of this protocol is quite wide and its applicability is not restricted to the offload of backup tunnel path computation. The interesting fact is that the RSVP Path computation messages will reuse all the RSVP objects carried in the RSVP Path message to signal a TE LSP. Those objects specify the TE LSP attributes. For instance, the RSVP Path computation message will carry the SESSION, SESSION-ATTRIBUTE, SENDER-TEMPLATE, to mention of few objects that define the set of attributes for the TE LSP. Just a few additional objects are added that characterize the request. In the particular context of Fast Reroute, a specific
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 419
5.15 Backup Path Computation
419
object has been defined in [FACILITY-BACKUP] that allows specifying the resource to protect, the destination of the set of backup tunnels, optional resource classes, as well as the maximum number of backup tunnels in the set and for each of them the minimum required bandwidth. Note that at the time of publication, such signaling protocol has not yet been standardized; hence the reference [PATHCOMP] is given for the sake of illustration only. For instance, let us consider the computation of the set of NNHOP backup tunnels between two LSRs R0 and R2. Suppose that the protected bandwidth is 50 Mbps and that no single NNHOP backup tunnel of 50 Mbps can be found. Although 5000 TE LSPs of each 10 Kbps would give a total of 50 Mbps, this is certainly not a desirable scenario. Indeed, a protected TE LSP can just use a single backup tunnel at a time; furthermore, the number of backup tunnels would be considerably high. Another scenario would be to get three backup tunnels having the following bandwidths: 49 Mbps, 500 Kbps, and 500 Kbps. In some networks, the last two backup tunnels would be too small to be able to protect any protected AU56 LSP requesting bandwidth protection. This explains why being able to specify some constraints on both the number of backup tunnels and their minimum bandwidth in the RSVP Path computation request is desirable.
5.15.7
Backup Tunnel Path Computation with MPLS TE Fast Reroute One-to-One Backup As described previously, Fast Reroute one-to-one backup requires one backup tunnel (Detour LSP) per protected TE LSP at each hop acting as a PLR. In other words, each LSR along the protected TE LSP path, using one-to-one backup will have to compute a Detour LSP path originated at this node up to the egress LSR of the protected TE LSP (destination). This requires collecting a set of information: 1. The list of LSRs traversed by the protected TE LSP: This information is available in the RSVP RRO object carried in the Resv message traveling in the upstream direction (from the tail-end LSR to the head-end LSR). One may think that the ERO object always contain the complete list of downstream nodes, but this is not always the case, for instance, if the ERO object contains some loose hop(s). A loose hop address is a nondirectly connected address. This is illustrated in the Figure 5.60. The example depicted in Figure 5.60 shows the situation where a TE LSP path is specified as a mix of strict and loose hops. In that example, a TE LSP T1 is set up AU57 from the LSR R2 to R6 with the following ERO object: R3(strict)-R5(loose)R6(strict). Typical use of loose hops is when the head-end LSR cannot compute the whole path of a TE LSP, as it lacks topology and resources information—for example, when the TE LSP spans multiple routing areas. In this case, one solution is to specify a list of strict hops in the head-end area followed by a list of loose hops (the area border routers [ABRs]). Each ABR is then responsible for a partial route computation, up to the next loose hop (in general another ABR connected to the
Vasseur / Network Recovery Final 9.6.2004 9:48pm
420
CHAPTER 5
page 420
MPLS Traffic Engineering Recovery Mechanisms
L
The L Bit is Set if the Subobject Represents a Loose Hop in the ERO Object
R2
ERO Object: R3 (strict)R5 (loose)-R6 (strict)
Type
Length
IPv4 Address (continued)
IPv4 Address (4 Bytes) Prefix Length
Reserved
IPv4 Prefix ERO Subobject Format
R3
R4
R6
R5
ERO Object: R6 (strict)
Received ERO Object: R5 (loose)-R6 (strict) Computed ERO Object: R4 (strict)-R5 (strict)R6 (strict)
ERO Object: R5 (strict)-R6 (strict)
Figure 5.60 Illustration of an ERO object specified as a mix of strict and loose hops, which shows why the RRO object must be used in some cases to learn the TE LSP path on downstream nodes.
next hop routing area). Back to our example of Figure 5.60, the PLR R2, for example, does not compute the path between the LSR R3 and R5 (R5 is specified as a loose hop); the path between the nodes R3 and R5 is computed by R3. This highlights an example where the PLR (R2 in this example) may not have the full list of hops traversed by the protected TE LSP by observing the ERO object; in this case, the information would be obtained via the RRO object carried in the RSVP Resv message. More details on the signaling can be found in Section 5.14. 2. The list of downstream links and nodes that the PLR wants to protect. This information is also available in the RRO object carried in the RSVP Resv message forwarded in the upstream direction. 3. In addition, the PLR must learn the list of upstream links that the protected TE LSP traverses. Likewise, this information is available in the RRO object. The Detour LSP and the protected TE LSP should not share a common next hop upstream of the failure: . With the path-specific method, the Detour LSP must not pass through the same links as the protected TE LSP to avoid an early LSP merging. . With the sender template–specific method, the reason is that the Detour LSP and the protected TE LSP would share the bandwidth, although in case of failure they would be simultaneously active, resulting in bandwidth violation. 4. The required protected bandwidth (i.e., the amount of bandwidth required for the protected TE LSP, which will be the bandwidth of the Detour LSP). The head-end LSR of a protected TE LSP has the ability to request a backup
Vasseur / Network Recovery Final 9.6.2004 9:48pm
page 421
5.15 Backup Path Computation
421
tunnel with an equivalent bandwidth or a percentage of the primary TE LSP bandwidth.76 5. The maximum number of hops the backup tunnel path can have between the PLR and the MP if a FAST-REROUTE object is present. A value of 0 indicates that the backup tunnel protects against link failure only. 6. Finally, some link attribute filters that may be applied to the backup tunnel path if, for instance, some links should be avoided (e.g., long propagation delay links). Once all that information is gathered by the PLR, it tries to find the shortest path (running a CSPF on the remaining topology, once the protected section have been pruned) for the backup tunnel taking into account the constraints mentioned earlier. Note that the destination address of the backup tunnel in the context of Fast Reroute one-to-one backup is the egress LSR (destination of the protected TE LSP). Note that the PLR may or may not succeed in finding a path for the backup tunnel that satisfies the set of requirements. In such a case, the PLR can start a timer and retry when the timer expires. Furthermore, a PLR can trigger a backup tunnel path reoptimization at regular intervals to determine whether a better path (i.e., the shortest path) exists.
5.15.8
Summary In Sections 5.14 and 5.15, we saw the signaling aspects of MPLS TE local protection, which include several RSVP TE extensions. This also included the Fast Reroute mode of operation as far as the signaling aspects are concerned. Another key aspect covered in detailed is the backup path computation for the MPLS TE protection techniques (global and local). Although many algorithms and modes have been proposed for the computation of the backup path, a few of them have been presented that can both be implemented on a central server (usually referred to as the off-line approach) or LSRs (distributed computation). The choice between centralized and distributed backup tunnel path computations is, as usual, driven by the tradeoff between optimality (centralized computation) and flexibility and reactiveness (distributed computation). That said, the relative degree of optimality of both approaches is really a function of the algorithms in use and the network topology. This area is constantly evolving, and undoubtedly, new algorithms will be designed that will allow taking into account new constraints with a increasingly higher efficiency; however, it has been shown that the level of complexity grows nonlinearly with the number of required constraints. 76 This information can either be derived from the SESSION-ATTRIBUTE object (‘‘Bandwidth protection desired’’ bit) or the FAST-REROUTE object, if present. In the former case, if the bandwidth protection desired bit is set, the requirement is for full protection; in other words, the bandwidth requirement for the backup tunnel is identical to the protected TE LSP bandwidth. In the latter case, the bandwidth field of the FAST-REROUTE object specifies the bandwidth requirement for the backup tunnel, which may be a fraction of the protected TE LSP bandwidth. Again, See Section 5.14 for a detailed description of the signaling aspects.
Vasseur / Network Recovery Final 9.6.2004 9:48pm
422
CHAPTER 5
page 422
MPLS Traffic Engineering Recovery Mechanisms
5.16 Research-Related Topics MPLS TE recovery techniques have considerably evolved during the last few years and is undoubtedly a mature technology, which does not mean that it has stopped evolving! Consequently, there are several active topics of research that we can mention: 1. Multiple failure: Throughout this chapter, we usually made the assumption of single failure whose benefit was, for instance, to perform bandwidth sharing between backup tunnel protecting against independent resources. Bear in mind that an SRLG that effectively results in the failure of multiple links is considered a single failure. There are several ongoing investigations on the topic of multiple failures, both to study whether multiple failures occur in existing networks and to propose various backup tunnel path computation models in this context. For instance, one could extend the notion of SRLG to include some probability of multiple failures of a set of elements. Then such information could be considered to compute the backup tunnel’s paths or multiple backup tunnels’ path, each one protecting against a set of multiple failures. 2. ‘‘Fast Reroute’’ extensions for ‘‘point-to-multipoint LSPs’’: There are several proposals to extend the concept of MPLS TE LSP to point-to-multipoint LSP where packets replication would be performed in the core. Extensions might be required to Fast Reroute to protect ‘‘point-to-multipoint’’ LSPs against network element failure. 3. Centralized and distributed backup path computation algorithms: This has obviously been a constant topic of research and new centralized and distributed algorithms are regularly proposed to improve their degree of efficiency with various objectives criteria.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 423
CHAPTER 6
Multilayer Networks
In the previous chapters, we discussed recovery from the viewpoint of a single network technology (such as Internet Protocol [IP], Synchronous Digital Hierarchy [SDH], or Optical Transport Network [OTN] networks). This chapter presents recovery mechanisms and strategies for multilayer networks. It has three main parts. The first part (Section 6.1) highlights the current evolution from static to intelligent optical networks (IONs) based on a distributed control plane (CP). This includes the Automatic Switched Optical Network (ASON) framework, the protocols (mainly Generalized Multi-Protocol Label Switching [G-MPLS] based) currently pushed forward to implement such a distributed CP, and different CP architectures. This information is needed later in the chapter, when dynamic survivability mechanisms in multilayer networks are discussed, for which operating a distributed CP is required. In the second part of this chapter (Section 6.2), the generic issues of multilayer recovery strategies are discussed. Three general categories for providing recovery in multilayer networks are described: single-layer recovery schemes in multilayer networks with the important issue of in which layer of the network to provide the recovery scheme; static multilayer recovery schemes where recovery schemes at several network layers can be provided with an important issue of how to make them interwork; and then the dynamic multilayer recovery strategies that use dynamic logical topologies for survivability purposes. In the last part of this chapter (Section 6.3), some concrete examples and case studies of recovery in multilayer networks are given. These are optical restoration and MPLS Traffic Engineering (TE) Fast Reroute (FRR); SONET-SDH protection and IP routing; and MPLS TE Fast Reroute and IP routing.
423
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 424
424
CHAPTER 6
Multilayer Networks
6.1 ASON/G-MPLS Networks In this section, the current evolution from static to flexible and intelligent optical networks based on a distributed CP is highlighted. Section 6.1.1 describes the ASON framework (kind of a meta-model) under standardization by the International Telecommunications Union (ITU), and Section 6.1.2 discusses protocols that are important candidates (mainly a G-MPLS–based solution steered by the Internet Engineering Task Force [IETF]) to implement such distributed CP. Finally, Section 6.1.3 illustrates different CP architectures for multilayer networks assuming different integration levels of the CPs of the different network layers.
6.1.1
The ASON/ASTN Framework In Chapters 2 and 3, SDH and OTN technologies were considered inflexible; they provide fixed transmission links between the client network equipment. However, the traffic pattern offered to the client networks becomes more and more dynamic, as not only the traffic pattern is continuously changing over time, but also the location between which traffic is routed changes continuously (traffic churn). Therefore, being able to reshuffle the transmission capacity in a client network becomes more critical. This requires that the transport network or optical transport network allows setting up and tearing down connections on demand and in an automated way by the customers. Such flexibility of course requires that more intelligence is pushed in the optical transport network, leading to the concept of intelligent optical networks (IONs). In [DeM04] the advantages of ION, the drivers for and the opportunities that are brought by IONs have been studied. For example, storage area networks (SANs) are just one example of applications that require bandwidth on demand driving the development of IONs. But also from a network perspective, IONs can be beneficial. For example, as mentioned in Chapters 2 and 3, in terms of network recovery, it becomes possible to set up and tear down connections on demand at the time of a failure, thereby enabling restoration instead of protection in transport networks or optical transport networks. Although restoration is able to outperform protection in terms of capacity efficiency, it cannot meet the same recovery completion times as protection. As Section 6.2.4 shows, IONs can not only provide restoration in an optical transport network but also provide on-demand spare capacity for recovery in a client network. The ability to set up and tear down connection on demand lets one allocate capacity in the network as long as it is needed; consequently, a highly dynamic traffic pattern can result in a significant reduction in network capacity or thus CAPital EXpenditure (CAPEX). But not only a significant CAPEX reduction can be obtained, also the automation of setting up and tearing down connections implies an important reduction in OPeration EXpenditure (OPEX).
Major contribution to Section 6.1 is credited to Didier Colle, INTEC, Ghent University
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 425
6.1 ASON/G-MPLS Networks
425
Within the ITU-T, a framework for Automatic Switched Optical Networks (ASONs) has been defined. More precisely, ITU-T G.807 specifies the requirements for Automatic Switched Transport Networks (ASTNs) [G807], whereas ITU-T G.8080 specifies the architecture of an ASON [G8080]. Note that the ASON/ ASTN framework is applicable to any transport network technology and thus is not restricted only to optical transport networks. Internet drafts [Ala03] and [Pap03] specify the requirements for the Generalized Multi-Protocal Label Switching (G-MPLS) routing and signaling protocols, respectively, to support this ASON framework (as specified in the previously mentioned ITU-T recommendations). Figure 6.1 illustrates the ASON framework. Important is that in this framework a distributed control plane has been added to the management and transport planes (TPs) already present in classic transport networks. This control plane (CP) consists of a set of optical connection controllers (OCCs) that are connected to and control the switches in the transport plane (TP). The Generalized Switch Management Protocol (GSMP) is just one example to implement this connection control interface (CCI). The OCCs are connected with each other via the network-to-network interfaces (NNIs), whereas the switches in the TP are connected via physical interfaces (PIs). To route and set up or tear down the optical channels (OChs), the OCCs run a
Telecommunication Mgmt Network
MIB
MIB
MIB
Mgmt Plane
User-Network Interface
Optical Connection Controller
Connection Control Interface
Request Agent Control Plane
NMI-T
Network Element Mgmt Agent Network-Network Interface
MIB
NMI-A
Network Mgmt System
MIB
Physical Interface
Switch Fabric
Transport Plane Optical Channel
Customer Premise Equipment
Figure 6.1 The Automatic Switched Optical Network framework.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 426
426
CHAPTER 6
Multilayer Networks
routing and a signaling protocol, respectively, over these NNIs. More precisely, an interior-NNI (I-NNI) is an NNI between two OCCs inside the same administrative domain (AD) at least exchanging topology or routing information, service connection messages, and optional network resources control information, whereas an exterior-NNI (E-NNI) connects OCCs residing in distinct ADs supporting the exchange of reachability or summarised network address information, authentication, and connection admission control (CAC) information and connection service messages. Although a Generalized Multi-Protocal Label Switching (G-MPLS) [Ala03], [Pap03]–based protocol suite is the option being pushed forward, it should also be possible to adopt, for example, a Private Network-to-Network Interface– (PNNI)–based protocol suite. The client requests the setup of an OCh, when its Request Agent (RA) (e.g., the user-to-network interface client [UNI-C] in the Optical Internetworking Forum [OIF] UNI 1.0 specification) sends the appropriate message over the user-to-network interface (UNI) to an OCC. The UNI should at least support the exchange of naming and addressing information, authentication and connection admission control (CAC) information, and connection service messages. It is important to note that no internal routing or topology information is disclosed to clients or other administrative domains. Section 6.1.2 gives more details on these protocols. Although the control may become more and more important, the management plane (MP) and the network management system (NMS) will not disappear completely. For example, most operators will remain interested in billing and accounting (typically an MP functionality). The MP is connected to the CP via the network management interface for ASTN (NMI-A) control plane components. Similarly, it is connected via the network management interface for transport (NMI-T) network elements to the TP. One of the main goals of the CP of an ASTN is to provide a switched connection (SC) service. But an ASTN should also be able to provide a (hard) permanent connection ([H]PC) service. The NMS can choose to provision the connection by itself via the NMI-T or to request the setup from the control plane via the NMI-A. In the latter case, a soft permanent connection (SPC) service is provided. The main function of the control plane is connection management and control. The CP should be able to control and manage (switched and soft permanent) connections that are either unidirectional or bidirectional point-to-point or unidirectional point-to-multipoint connections. The CP should also be able to support multihoming, diversity, and other services (like establishment of closed user groups).
6.1.2
Protocols for Implementing a Distributed Control Plane The previous section mentioned that G-MPLS is the protocol suite currently pushed forward to implement a distributed CP. As explained in Chapter 5, labels in regular MPLS are represented as integers, in most cases, attached to an IP packet as an additional shim header. However, for example, a wavelength channel (color) can be interpreted as a label, too. Therefore, the concept of Generalized MPLS (G-MPLS) allows a label to be represented as an integer, a time slot in a Time
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 427
6.1 ASON/G-MPLS Networks
427
Division Multiplexing (TDM) frame, a wavelength or waveband on a fiber, a fiber in a cable, and so on. Thus, the idea is to reuse the same protocol suite, using standard MPLS to set up and tear down label switched paths (LSPs), to be able to control switched instead of permanent connections through transport networks or optical transport networks. Figure 6.2 illustrates this principle of generalizing the MPLS paradigm. As explained in Chapter 5, the Resource reSerVation Protocol with Traffic Engineering extensions (RSVP-TE) [RFC2205] has been adopted as the signaling protocol to set up and tear down MPLS traffic engineering LSPs; the required extensions for TE LSP are specified in [RFC3209]. [RFC3471] specifies the signaling protocol extensions required specifically for supporting G-MPLS. [RFC3473] maps these requirements to the necessary RSVP-TE protocol extensions. Of course, the generalized label request and label objects are critical extensions to the RSVP-TE protocol to enable G-MPLS support. The generalized label request consists of an LSP encoding type (e.g., packet versus digital wrapper versus lambda LSPs), a switching type (e.g., switch the LSP as TDM circuit or as complete wavelength channel), and a generalized payload identifier (G-PID) (e.g., packet-over-SONET [PoS]). The encoding of the generalized label object depends on the link on which the label is used. In addition to the generalized label request and generalized label objects, the signaling protocol extensions provide support for suggesting a label to
IP Payload IP Header MPLS Label
5
A
C
B
D
7
IN IF IN LABEL OUT IF OUT LABEL A 2 D 3 B 5 C 7 B 9 D 7
A
C
λ IN
λIN B
--> λ O
UT
D
Figure 6.2 The Generalized Multi-Protocol Label Switching concept.
λOUT
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 428
428
CHAPTER 6
Multilayer Networks
be chosen by a downstream node, for restricting the label set from which a downstream node can choose, and for setting up bidirectional LSPs. In Chapter 4, the principle of link state routing protocols has been covered in detail. Figure 6.3 summarizes this for a G-MPLS–capable network. Thanks to the neighbor discovery process, each node in the network learns about all neighbors. Moreover, a flooding procedure allows each node to flood this information throughout the network periodically. Each node processes all incoming link state packets and stores the received status information of each link in its link state database. By having such a link state database in each network node, each network node shares a common network topology and resources view and thus can compute the constraint shortest path along which an LSP has to be set up. To be able to distinguish between different link types, the link state contains a field indicating the multiplexing/switching capability of the advertised link. The third column in the link state database illustrates this; this column states that all links are optical links (thus, lambda switch capable [LSC]) except link CD that is a link at the IP level (thus, packet switch capable [PSC]). Consequently, the lightpath is set up between C and E via B and D. An important aspect of G-MPLS is that a generalized label can represent different types of LSPs. For example, thinking about an IP-over-WDM network, an LSP can be either a lightpath (labels are wavelengths) or a regular LSP in the IP layer (labels are integers possibly carried in a shim header). The question is now how to route the latter LSPs over the set of former LSPs (lightpaths) that form a logical network topology. For this purpose, from the moment an LSP (which has to function as a logical link) has been set up, it will be advertised by means of a link
Link-State Database Link Cost Switch Cap.
[AB,AE]
Incoming Link State Packets
[AB,BD,BC] [BD,CD,DE] [AE,DE]
A
B
C
AB AE BD BC CD DE
1 1 1 1 1 1
LSC LSC LSC LSC PSC LSC
CE
1
PSC
Knowledge of Network Topology
A
E
[AE,DE,CE]
CSPF Routing of Lightpath Between C and E
B
C
D
IP/Optical NE LSC link PSC link
E
D
Figure 6.3 Principle of link state routing in Generalized Multi-Protocol Label Switching networks.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 429
6.1 ASON/G-MPLS Networks
429
state packet to all other nodes in the network. In Figure 6.3, the lightpath being set up between nodes C and E creates a logical link at the IP level between nodes C and E. Thus an integrated overview of a multilayer network can be obtained [Kom02]. There is no need to bring up a routing adjacency (e.g., launching the discovery protocol) between both endpoints of such a logical TE link (actually being an LSP in the underlying network layer or sublayer) after it has been advertised; therefore, such an adjacency is also called a forwarding adjacency (FA). Not only the creation of FAs results in TE links over which no routing adjacency is brought up, but in GMPLS, some technologies simply do not allow transporting the control information in-band (thus, requiring out-of-band control channels), and sometimes a bundle of links is advertised as a single TE link to improve scalability. These and other issues are still under standardization [Kom03]. In Chapter 3, Section 3.6.3, we discussed that the introduction of an IP-based optical control plane will make restoration a realistic recovery option for the optical transport network. Figure 6.4 illustrates how restoration can be provided in G-MPLS networks. This example assumes that after the lightpath between C and E has been established in Figure 6.3, link BD fails affecting this lightpath. We assume that the ingress node C is notified that the lightpath has been affected by a failure and that node C acts as recovery head-end (RHE). Depending on the information the RHE receives, it may already recalculate a new CSPF route before the receipt of updated link state packets or it may wait a while to receive those updated link state packets. Once it has an updated overview of the network topology and resources, the RHE can recalculate a new CSPF route for the lightpath (in this example via node A instead of via node D) and reestablish the lightpath along this new route.
Link-State Database Link Cost Switch Cap.
[AB,AE] Incoming Link State Packets
[AB,BD,BC] [BD,CD,DE] [AE,DE,CE]
A
B
C
AB
1
LSC
AE
1
LSC
BD
1
LSC
BC
1
LSC
CD
1
PSC
DE
1
LSC
CE
1
PSC
Knowledge of Network Topology
A
E
[AE,DE,CE]
Recalculation CSPF Routing of Lightpath Between C and E
B
C
D
IP/Optical NE
LSC link PSC link
E
D
Figure 6.4 Illustration of restoration in Generalized Multi-Protocol Label Switching networks.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 430
430
CHAPTER 6
Multilayer Networks
As elaborated in Chapter 3, Section 3.6.3, there are many ways to implement restoration in a network; the above example illustrates only one possibility. For example, instead of node C acting as an RHE (and thus needs to be notified of the failure), node B could also act as an RHE. Assuming only single link or node failures, node B should be capable to recalculate a new alternative path from itself to the destination node E for the affected lightpath (in this case the alternative path goes from node B via node A to node E) from the moment that node B detects a failure. Because node B detects the failure itself and already has an overview of the overall network status in its link state database, it should not wait for any updated link state packet or any other failure indication signal before it can start calculating the alternative path. This should result in a significant improvement of the restoration time. We call this technique fast topology-driven constraint-based rerouting (FTCR) [Vhe00], [ColPNC011] because it relies on the topology information in the link state database of the RHE (topology-driven part) and requires explicit routing (constraint-based part) because the RHE may already start the signaling process to set up the affected lightpath along the alternative route before the link state databases in the intermediate nodes along this alternative path are updated. Table 6.1 summarizes all possibilities mentioned in Chapter 3, Section 3.6.3, to implement network recovery. The main difference between protection and restoration is that with the cross-connection on the backup route takes place before the occurrence of the failure, whereas with restoration the OXCs on the backup route are only cross-connected after the failure. The calculation of the restoration backup route can be preplanned or dynamic, just as the wavelength assignment on the backup route. As the control plane becomes more and more important in transport networks, it is obvious that it will get logically separated from the transport or data plane instead of being a simple add-on or feature of the transport plane. In other words, the development and upgrade tracks of the transport plane and the control plane functionality get decoupled from each other. Even more, it should be possible that the control plane functionality be supplied by a vendor other than the one supplying the hardware for the data or transport plane: for example, it is not unimaginable that developing a full-fledged control plane grows beyond the expertise and capabilities of a hardware supplier.
Table 6.1 Comparison of Protection and Restoration
Restoration
Protection
Backup Route Calculation
Wavelength Assignment on Backup Route
Cross-Connection on Backup Route
Preplanned Preplanned Dynamic Preplanned
Preplanned Dynamic Dynamic Preplanned
After failure After failure After failure Before failure
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 431
6.1 ASON/G-MPLS Networks
431
Of course, the mission of the control plane remains controlling the equipment in the transport or data plane; thus, logically separating the control plane from the transport or data plane requires a standardized interface between both planes. This interface is called the Connection Control Interface (CCI) in the ASON framework, as depicted in Figure 6.1. The General Switch Management Protocol (GSMP) [RFC3292] is ‘‘a general purpose protocol to control a label switch’’ that allows separating the switch control from the switch forwarding elements; at the time of writing, the necessary extensions for supporting generalized labels are under standardization [Cho03]. In short, GSMP allows a controller to establish and release connections across the switch, add and delete leaves on a multicast connection, manage switch ports, request configuration information, request and delete reservation of switch resources, request statistics, and get informed of asynchronous events such as a link going down. Connection management/control is one of the most important functions supported by GSMP. GSMP is in nature a master-slave protocol, meaning that the switch controller acts as master, launching requests that have to be performed, and the switch itself acts as slave, responding/acknowledging these requests after taking the necessary actions. The only exception to this is the slave (thus, the switch) informing the master (thus, the switch controller) of an asynchronous event. A Transaction ID (TID) carried in the protocol messages allows correlating responses with the corresponding requests. Response messages will acknowledge the success of the requested operation or the failure of the requested action; therefore, the slave can only generate a response message after it has performed the requested operation. As depicted in Figure 6.1, the Request Agent of a client requests via the User Network Interface (UNI) the setup and tear down of switched connections through the ASON network. The Optical Interworking Forum (OIF) has already standardized a first version of the UNI protocol and is continuing to develop this protocol. First, a distinction has to be made between the UNI-client (UNI-C) and the UNI-network (UNI-N) side of the UNI. The UNI-C has its own address. However, connection establishment requests should use UNI-N addresses. To allow one UNIC to translate the UNI-C address of the destination into its UNI-N address, the UNI supports an address resolution service. Not only address resolution queries need to be supported, but also address registration messages. In addition to the address resolution service, the UNI also runs a service discovery procedure (SDP), which negotiates whether LDP or RSVP will be supported as a signaling protocol, whether address resolution is supported, which framing (e.g., SONET or SDH) and port level granularity is used, and what transparency is supported. Even before any service discovery or address registration can take place, the Neighbor Discovery Procedure (NDP) and control channel establishment should be finished. Indeed, a control channel is required through the UNI, to serve the exchange of signaling messages between the UNI-C and UNI-N entities. The control channel is an IP-based control channel (IPCC). There are three possibilities to support the IPCCs (Figure 6.5). In-fiber/in-band establishes the IPCC
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 432
432
CHAPTER 6
Multilayer Networks
In-Fiber/In-Band (IPCC in DCC Bytes)
In-Fiber/Out-of-Band Out-of-Fiber/Out-of-Band (IPCC in Separate (Sub)Channel)
Figure 6.5 Support of the IP-based control channel. (‘‘User network interface [UNI] 1.0 Signaling Specification,’’ Optical Internetworking Forum/User Network Interface Specifications [OIF2000.125.5], June 2001. Available at www.oiforum.com. Accessed May 2004.)
over the data communication channel (DCC)77 of the data wavelengths. In the in-fiber/out-of-band mode, the IPCC occupies a separate wavelength or TDM channel on the UNI link. Finally, in the out-of-fiber/out-of-band configuration, the IPCC may be routed over an IP transport network. Once the IPCC has been established, the NDP, SDP, and address registration client can start using the UNI for requesting the setup and tear down of lightpaths through the network. Note also that the UNI 1.0 specification considers three reference configurations, as illustrated in Figure 6.6. The client to optical network element (ONE) direct service invocation configuration (bottom of Figure 6.6) is the most intuitive one; both the client device and the ONE terminate the UNI data and control channels. The client to network agent direct service invocation configuration (left side of Figure 6.6) is very similar to the previous one, except that the UNI-N signaling functionality is shifted from the ONE to a central network agent. This network agent configures the ONE via signaling over an internal signal interface (ISI). The last reference configuration is called the client agent to network agent third-party service invocation (right side of Figure 6.6). In this configuration not only the UNI-N signaling functionality is shifted to a separate network agent, but also the UNI-C signaling functionality is shifted to a separate network agent, called the UNI-C proxy. The network agents drive the configuration of both the ONE and the client device through an ISI. The use of such signaling network agents may ease the migration from a traditional OTN toward an ASON (e.g., the Network Management System [NMS] may serve as such a network agent).
6.1.3
Overview of Control Plane Architectures (Overlay, Peer, Augmented) Section 6.1.1 explains that the ASON framework assumes that clients request via the UNI to set up/tear down connections through the ASON network; considering an IP-over-OTN network, this would mean that the IP network acts as a client from 77
The DCC is an auxiliary communication channel that is provided by some dedicated overhead bytes in the framing overhead of the wavelength channel.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 433
6.1 ASON/G-MPLS Networks
UNI-N
UNI Control
433
UNI-C Proxy
UNI Control
UNI-C ISI ISI
UNI-N
UNI Data
UNI Data UNI Control
UNI-C
Figure 6.6 User-to-network interface (UNI) reference configurations (bottom, client to ONE direct service invocation; left, client to network agent direct service invocation; right, client agent to network agent third-party service invocation).
the OTN server network layer. However, Section 6.1.2 shows that the G-MPLS protocol suite is able to obtain an integrated view of a multilayer network and thus can control this network as a single network. Section 6.1.2 also shows that the control plane will get logically separated from the transport or data plane (instead of being a simple add-on or feature of the transport plane), as the control plane becomes more and more important in transport networks. For example, this would allow a vendor other than the one supplying the hardware for the data or transport plane to supply the control plane functionality. This section aims to describe different control plane architectures assuming different integration levels of control planes of the client and server network layers. Although the discussion is generally applicable, it focuses on an IP-over-OTN network scenario. The first control plane model is the overlay model: As shown in Figure 6.7 both layers run their own control plane in this model. The IP layer acts as client layer or user layer, and the OTN acts as server layer; therefore, a User Network Interface (UNI) between both layers allows the client IP layer to request capacity (i.e., lightpaths) from the server OTN network. Both control planes are completely independent from each other. In other words, the client layer’s routing (IP routing like OSPF) and possibly MPLS signaling is independent from the optical layer control plane signaling (and routing). Of course, both control planes can be instantiated from the same control plane type (e.g., G-MPLS). However, the independence also allows the OTN to run other ASON-compliant protocols.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 434
434
CHAPTER 6
Multilayer Networks
Figure 6.7 The overlay model. (D. Colle, et al. ‘‘Developing control plane models for optical networks,’’ Technical Digest, 2002 Optical Fiber Communication Conference [OFC2002], Anaheim, CA, March 17–22, 2002, pp. 757–759.)
For coupling both layers, an appropriate protocol (or common set of assumptions) between both layers is provided, to allow communication between the control planes of both layers. This protocol provides, for example, address resolution between the layers and/or initiates the connection request/release. One of the drawbacks of the overlay model is the duplication of control functionality (e.g., two separate routing protocols are running in the two layers). Another disadvantage is the scalability problem; for each established lightpath a corresponding IP routing adjacency has to be established. This was also a problem in classic IP-over-ATM because of the increased amount of state and information in the routing databases. MPLS was able to overcome these issues by using TE shortcuts, as opposed to establishing adjacencies across each distinct LSP. A final drawback of the overlay model is that there is a clear client-server relationship; for example, address resolution is required because of separate address spaces (as explained in the following discussion, this issue is solved in the augmented model). On the other hand, the advantage of the separation of the two control plane instances in the overlay model is that any confidential information from the transport network is not disclosed to any client network (operator). It seems also that the overlay model is most suited for the interconnection with legacy SDH networks. A second control plane model is the peer model, shown in Figure 6.8. In this model a single control plane controls both the IP and the OTN layer. The result is that IP router forwarding engines and OXC switch fabrics are logically integrated from a control plane viewpoint. Because the current MPLS control protocols would only require minor modifications to become G-MPLS compliant, it would be an interesting scenario to have the LSRs take over the control of the optical crossconnects. By having a standardized control interface (e.g., GSMP) such a scenario
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 435
6.1 ASON/G-MPLS Networks
435
IP/OXC Controller Control Channel
Control-Plane Physical (= Fiber) Topology
Customer Premise Equipment Data-Plane Integrated IP/OTN Box
OXC Switch Fabric
IP-Router Forwarding Engine
Figure 6.8 The peer model. (D. Colle, et al. ‘‘Developing control plane models for optical networks,’’ Technical Digest, 2002 Optical Fiber Communication Conference [OFC2002], Anaheim, CA, March 17–22, 2002, pp. 757–759.)
should not be that unrealistic; such a standardized interface is interesting, particularly in case a single vendor does not supply both equipment types. So-called IP/OTN control channels are realized over the physical links between these logical IP/OTN entities. In case of G-MPLS, lightpaths are treated as optical LSPs and thus do not result in a new peering session between their endpoints (i.e., no control channel is established over the lightpaths). Note, however, that this does not prevent the lightpaths from being advertised into the routing protocol as direct TE links (or Forwarding Adjacencies [FAs]) between their endpoints. The peer model has the following advantages. First, duplication of functionality is avoided. Second, the disadvantages of the client-server relationship between IP and OTN (e.g., problems with address resolution) no longer exist. Although no additional peering session is required per established lightpath (which may solve some scalability problems; e.g., no processing of hello messages for each lightpath), the lightpath has to be advertised to the network as a logical link. Clear drawbacks of the peer model are the following. The peer model is not applicable to all imaginable business models. For example, in an IPover-OTN network scenario, the transport or optical transport network operator may not accept that the ISP (e.g., a competitor) takes over the control of the OTN (or vice versa). The peer model is also limited to a single domain or autonomous system. The third control plane model is the augmented model, illustrated in Figure 6.9. This model is a compromise between both the overlay and the peer model. It is quite
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 436
436
CHAPTER 6
Multilayer Networks
IP-Router Controller
IP Control Channel
IP-Router Forwarding Engine IP Enhanced UNI
NNI
Logical (= Lightpath) Topology
OTN Control Channel OXC Controller OXC Switch-Fabric NNI OTN
Physical (= fiber) Topology
Figure 6.9 The augmented model.
similar to the overlay model, in the sense that both layers may have their own control plane instance. However, some control information like reachability information may leak through the interface between both layers. Rephrased more practically, [Raj00] states in what they call the ‘‘interdomain interconnection model’’ that the client–layer reachability information is carried through the OTN, but OTN addresses are not propagated to the client network. The principle of leaking client-layer reachability information from one side of the network to the other is similar to the principle of MPLS/BGP VPNs [RFC2547] and is illustrated in Figure 6.10. Consider that IP router rA is attached via port opA to the OXC oxA and that IP router rB is attached via port opB to the OXC oxB. When router rB and OXC oxB run an E-BGP session over the UNI, then OXC oxB learns the address from rB. More precisely, OXC oxB knows then that rB can be reached via its port opB. It advertises this relation via an I-BGP session to OXC oxA. OXC oxA forwards this BGP route over an E-BGP session to router rA, after removing any optical address from the route. In other words, router rA can easily learn the address from router rB, while the address resolution is kept inside the transport network. From this moment, router rA can simply ask the OXC oxA to establish a lightpath to router rB. It is the responsibility of that OXC oxA to translate the address rB in the connect request to the appropriate optical port address. Finally, note that although everyone agrees that the augmented model is situated somewhere in between the two extreme overlay and peer models, there is not yet a clear understanding or definition of this augmented model. Nevertheless, it is clear that the augmented model tries to find a compromise between the advantages and disadvantages of both extremes.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 437
437
6.2 Generic Multilayer Recovery Approaches
OK, I can ask the transport network to connect me to router rB!
Optical Port Addres opB
Optical Port Addres opA rB sits on port opB on oxB
I can reach rB
UNI Connect rB
Router rA
HELLO (I’m rB)
Connect opB on oxB
OXC oxA
UNI
OXC oxB
Router rB
Figure 6.10 Illustration of how the ASON carries the client-layer reachability information from one side of the network to the other.
6.2 Generic Multilayer Recovery Approaches In the previous chapters, survivability and recovery mechanisms have been discussed from the viewpoint of a single network technology, and thus within one network layer (e.g., IP routing in the IP layer, or 1þ1 optical protection in the OTN layer). As shown, these schemes can effectively handle a large number of failure scenarios. The integration of different network technologies, for example, IP and OTN, into (realistic) multilayer transport networks (see also Chapter 1) leads to new opportunities and new challenges concerning the survivability of such multilayer networks. Those opportunities lie in the fact that in such networks recovery techniques from the different network layers can cooperate to recover more efficiently or faster from a network failure. This also brings new challenges and difficulties to the coordination of those mechanisms in the different layers. It is the intention of this section to give a generic description of the survivability in multilayer networks; some more concrete and specific case studies of multilayer recovery mechanisms are discussed in Section 6.3. This section starts with illustrating why attention should be paid to multilayer recovery (see Section 6.2.1). Then three generic categories for providing recovery in multilayer networks are discussed: single-layer recovery schemes in multilayer networks (see Section 6.2.2) with the important issue in which layer of the network to provide the recovery scheme; static multilayer recovery schemes (see Section 6.2.3) where recovery schemes at several network layers can be provided with an important issue of how to make them interwork; and then the dynamic multilayer recovery strategies (see Section 6.2.4) that use dynamic logical topologies for Major contribution to Section 6.2 is credited to Ilse Lievens, INTEC, Ghent University.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 438
438
CHAPTER 6
Multilayer Networks
survivability purposes. A summary of the multilayer recovery strategies is given in Section 6.2.5.
6.2.1
Why Multilayer Recovery? A multilayer (or multitechnology) transport network can be viewed as consisting of a stack of single-layer (or single-technology) networks. Between the adjacent layers of this stack typically a client-server relationship exists. If we consider an IP-overOTN multilayer network, the IP network layer is the client layer of the underlying OTN network layer, whereas the OTN layer acts as a server layer to the IP layer, providing, for instance, transport functionality to the higher client layer. Because each of these network layers has its own single-layer recovery schemes, one may be wondering why it is not sufficient to simply deploy a recovery scheme in only one layer of the multilayer network. One could think of the situation in which IP routing is deployed in the IP client layer and could be used against the failure of an IP router or of an IP interface card, or 1þ1 optical protection is deployed in the OTN server layer to be used against the failure of an OXC or an optical fiber cut. Unfortunately not every failure in a particular network layer can be handled by the recovery mechanism in that same network layer. Consider, for example, Figure 6.11, in which an IP-over-OTN multilayer network is depicted showing the failure of OXC B. This failure is detected in the optical network and a recovery action may be initiated in the OTN layer. However, the OTN recovery action cannot recover the traffic along the working path (which goes from IP router a to IP router d),
b a
d
c IP layer
B
D
A E C OTN layer Working Path
Figure 6.11 Why multilayer recovery?
Recovery Path
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 439
6.2 Generic Multilayer Recovery Approaches
439
because from the OTN layer point of view, this traffic is nothing more than two separate connections A-B and B-D, which are both unrecoverable in the OTN layer (as those connections terminate in the failing OXC). From the IP point of view, a number of secondary failures (links a-b, b-c and b-d) are detected, isolating the router b. Upon detection of these faults, the IP network layer could also initiate recovery actions, eventually leading to the recovery path indicated in Figure 6.11, where the traffic from router a to router d traversing router b will be rerouted by the IP layer via other paths (e.g., via the path a-c-d). Another example leading to the same situation is the failure of router b that can only be recovered in the IP layer (of course in both cases, the traffic destined for router b cannot be recovered). This example illustrates that it is not that simple to decide straightforwardly in which layer of the multilayer network to provide and deploy a recovery scheme; it might even be beneficial to deploy recovery schemes in multiple layers of the network. As will be shown in the following sections, it is important to be able to combine recovery schemes in more than one layer to benefit from the advantages of each layer. It should be noted that multilayer recovery strategies should only be applied if indeed they are more beneficial than single-layer recovery. Implementing a multilayer recovery strategy does not mean that all the recovery mechanisms will be used at every layer, as is shown in Section 6.3 where some case studies are discussed. Let us illustrate the complexity of the trade-off that must be made when deciding in which layers to provide recovery schemes, with the following reflections:
. Recovery at higher network layers is desired, because lower layers will not notice failures of higher layer equipment. . Recovery at higher layers is desired because higher layer equipment can become isolated because of a failure in the lower network layer (e.g., the failure of an OXC in the optical network layer). Only a recovery scheme in the higher layer is then able to recover the traffic that transits this isolated equipment. . Recovery at lower layers is desired, because native traffic that is injected in lower layers cannot be recovered by higher layer recovery strategies. . Implementing multilayer recovery is typically more complex to monitor, operate, and design than single-layer recovery. When discussing multilayer networks and their survivability, two crucial questions should be answered:
. In which layer or layers should recovery schemes be provided? . If multiple layers are chosen for this purpose, then how is the survivability in these layers coordinated?
6.2.2
Single-Layer Recovery Schemes in Multilayer Networks This section discusses the provisioning of recovery functionality in multilayer networks by starting from the single-layer recovery schemes. The concepts and discussions focus on a two-layer network but are generic and can thus be applied to any multilayer network. We look at how a recovery scheme in one network layer
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 440
440
CHAPTER 6
Multilayer Networks
can be deployed to provide survivability in the multilayer network. This basically comes down to answering the question in which network layer a recovery scheme can be deployed, and what are the consequences of such a decision.
Survivability at the Bottom Layer In this recovery approach, denoted survivability at the bottom layer, recovery of a failure is done at the bottom layer of the multilayer network. In an IP-over-OTN network, for example, this implies that the 1þ1 optical protection scheme or any other recovery scheme that is deployed in the OTN layer attempts to restore the affected traffic in case a failure occurs. By recovering a failure at the bottom layer, this strategy has the benefit that only a simple root failure has to be treated and that the number of required recovery actions is minimal (the recovery actions are performed on the coarsest granularity). In addition, failures do not need to propagate through multiple layers before they trigger any recovery action. However, one of the major drawbacks of this recovery strategy involves its inability of coping with problems or failures that occur in a higher network layer, above the bottom layer in which the recovery scheme is deployed. In addition, there are also situations in which the recovery process in the bottom layer is not able to restore all traffic, whereas a higher layer recovery mechanism would be able to. For example, if we consider an IP-over-OTN network, in which a node failure occurs in the optical layer (being an OXC failure), the optical network layer recovery mechanism will only be able to restore the affected traffic that transits the failed bottomlayer node (being the OXC). The co-located higher-layer node (an IP router in this case) will become isolated because of the failure of the OXC underneath, and thus, all traffic that transits this IP router cannot be restored in the lower optical layer. An example is given in Figure 6.12, with a two-layer network. The considered network carries two traffic flows between client-layer nodes a and c. One traffic flow (a-d-c, indicated with a full line in the left part of the figure) transits the client-layer node d (using two logical links a-d and d-c), and the other traffic flow (a-c, indicated with a thin solid line in the left part of the figure) uses a direct logical link from a to c and only transits the server-layer node D. A failure occurs in the bottom layer, for example, the failure of node D. The left part of the figure illustrates that the server layer cannot recover the first traffic flow a-d-c. This is due to the fact that the clientlayer node d is isolated because of the failure of D, which is terminating both logical links a-d and d-c. This implies that the client layer has to recover this flow, which is shown at the right part of Figure 6.12 (client-layer recovery path a-b-c, using two logical links a-b and b-c). However, the second traffic flow a-c is routed over a direct logical link between node a and c. This logical link transits the failing node D in the bottom layer, which means that this traffic flow can be restored by the bottom-layer recovery scheme, as shown in the left part of the figure with the dotted line.
Survivability at the Top Layer Another strategy for providing survivability in a multilayered network is to provide the survivability at the top layer of the network. The advantage of this strategy is
Transit Traffic in Isolated Client Node Needs Recovery in the Client Layer
Logical Links Terminated in a Failing Node Cannot Be Recovered
d
Not Recovered by Server Layer Recovery
c
Isolated Client Node
a
d c
Recovered by Client Layer Recovery
D
E
C
A
Server Layer
b
Client Layer
Recovered by Server Layer Recovery
D
E
C
A
B
Server Layer
Client Layer Primary Path 1 Client Layer Recovery Path 1
B
Client Layer Primary Path 2 Client Layer Recovery Path 2
Figure 6.12 Survivability at the bottom layer: Illustration of the impact of a node failure on two traffic flows between the client-layer nodes a and c.
6.2 Generic Multilayer Recovery Approaches
b
Client Layer
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 441
Isolated Client Node
a
441
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 442
442
CHAPTER 6
Multilayer Networks
that it can cope more easily with higher layer failures or node failures, also illustrated in Figure 6.12. A major drawback of this strategy is that it typically requires a lot of recovery actions, because of the finer granularity of the flow entities in the top layer. However, by treating each individual flow at the top layer, this allows differentiating between these flows based on their (service) importance. In other words, the top layer may choose to recover the critical, high-priority traffic before any recovery action is taken for the low-priority flows. Such a service differentiation among traffic flows based on the order of the recovery action is not possible in lower layers, because the lower layers switch every flow in an aggregated signal with one single recovery action. Indeed the level of granularity of recovery is a lambda at the optical layer, a VC at the SONET-SDH layer, a class of service at the IP layer, and a traffic engineering LSP at the MPLS TE layer. Under certain conditions, this finer granularity may also lead to more efficient capacity usage. First, aggregated signals that are poorly filled with working traffic have enough capacity left to transport spare resources. Second, the finer granularity allows distributing flows over more alternative paths. However, when comparing this survivability at the top-layer strategy with the survivability at the bottom-layer strategy, a trade-off exists between a better filling of the capacity of the logical links on one hand and the potential larger amount of higher layer equipment required on the other hand. Not only the potential difference in granularity between the failing equipment in a lower network layer and the corresponding affected entities in the top layer (thus, requiring more recovery actions) is an issue. Also the typically complex secondary failure scenarios (in the top network layer), as a consequence of a single root failure in a lower layer, can become a problem. This is illustrated in Figure 6.13, where the failure of an optical link in the bottom layer corresponds with the simultaneous failure of three logical IP links in the top layer (see also Chapter 5, Section 5.1.2, in which Shared Risk Link Groups (SRLGs) are discussed in this context). This implies that the recovery scheme in the top layer will have to recover from three simultaneous link failures, which is quite complex. If there had been a recovery scheme at the bottom layer, however (see the section Survivability at the Bottom Layer), this recovery scheme would only have to cope with the more simple failure scenario of one link failure.
Slightly Different Variants: Survivability at the Lowest Detecting Layer and Survivability at the Highest Possible Layer A slightly different variant on the strategy that applies survivability at the bottom layer is the survivability at the lowest detecting layer strategy. The lowest detecting layer is the lowest layer in the layered network hierarchy that is able to detect the failure. This implies that multiple layers in the network will deploy a recovery scheme, but that the (single) layer that detects the root failure is still the only layer that takes any recovery actions. With this kind of strategy, the problem that the bottom-layer recovery scheme does not detect a higher layer failure is avoided because the higher layer that detects the failure will recover the affected
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 443
6.2 Generic Multilayer Recovery Approaches
443
Client Layer
Server Layer
Figure 6.13 Survivability at the top layer: A single root failure may propagate to many so-called secondary failures.
traffic. However, although this survivability at the lowest detecting layer strategy can ensure that traffic transiting the failing equipment in the detecting layer is restored, it still suffers from the fact that it cannot restore any traffic transiting higher layer equipment isolated by a node failure in the detecting layer. With this survivability at the lowest detecting layer strategy, the client layer in the example of Figure 6.12 will deploy a recovery scheme, but the considered traffic flow a-d-c is still lost, because this client-layer recovery scheme is not triggered by the occurrence of the node failure in the server layer. So, although this strategy considers the deployment of recovery schemes in multiple layers, it is still considered a singlelayer survivability strategy in a multilayer network, because for each failure scenario the responsibility to recover all traffic is situated in only one layer (i.e., the one detecting the failure). A slightly different variant of the strategy that provides survivability at the top layer is the survivability at the highest possible layer strategy. Because not all traffic has to be injected (by the customer) at the top layer, with this strategy a traffic flow is recovered in the layer in which it is injected, or in other words the highest possible layer for this traffic flow. This means that this highest possible layer is to be determined on a per traffic flow basis. For example, a data-centric optical network (IP-over-OTN network) may also support a leased optical channel service. This survivability at the highest possible layer strategy is also considered a single-layer survivability strategy for providing survivability in a multilayer network, even though it deploys a recovery scheme in multiple layers. Indeed, survivability at the highest possible layer may lead to recovery schemes in multiple layers, but these will never recover the same traffic flow. Actually, this strategy deploys the
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 444
444
CHAPTER 6
Multilayer Networks
survivability at the top layer strategy for each traffic flow individually (which implies that in essence, both strategies do not differ from each other).
6.2.3
Static Multilayer Recovery Schemes In the previous section, some strategies are discussed that apply a single-layer recovery mechanism (meaning that recovery is strictly limited to one layer of the network when coping with network failures) to provide survivability in the multilayer network. As shown there, both the strategies with survivability at the bottom or lowest detecting layer and at the top or highest possible layer have their advantages and drawbacks. The advantages of these approaches can be combined, which implies that recovery mechanisms will run in different layers of the network as a reaction to the occurrence of one network failure. More generally speaking, the choice in which layer to recover the affected traffic due to a failure will depend on the circumstances, such as which failure scenario occurred. However, this requires some rules or coordination actions to ensure efficient interworking and coordination between the network layers that are involved in the recovery process (e.g., the discussion in Chapter 2, Section 2.3.3, on SDH networks). This interworking is a so-called escalation strategy and strictly defines how layers and the recovery mechanisms within those layers react to different failure scenarios. This section discusses several existing escalation strategies or approaches: uncoordinated, sequential and integrated. Then the issue of how and where to provide the spare capacity in multilayer networks is discussed, followed by the issues of network stability, network operation complexity, and revertive operation for multilayer recovery. The section ends with a qualitative comparison of the discussed survivability strategies.
Uncoordinated Approach The easiest way of providing an escalation strategy is to simply deploy recovery schemes in the multiple layers without any coordination. This will result in parallel recovery actions at distinct layers. Consider again the two-layer network of Figure 6.14, with, for instance, the failure of the physical link A-D in the server layer. This failure of the physical link will also affect the corresponding logical link a-d in the client layer and the considered traffic flow a-c. Because the recovery actions in both layers are not coordinated, both recovery strategies in the client and the server layer will attempt recovery of the affected traffic. This implies that in the client layer the traffic flow a-c is rerouted by the recovery mechanism of the client layer (e.g., IP routing in an IP-over-OTN network), resulting in a replacement of the failed path a-d-c by, for instance, a new path a-b-c. At the same time, the server layer recovers the logical link a-d of the client-layer topology by rerouting all traffic on the failing link A-D through node E. In this example, recovery actions in a single layer would have been sufficient to restore the affected traffic. The main advantage of the uncoordinated approach is that this solution is simple and straightforward from an implementation and operational point of view
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 445
6.2 Generic Multilayer Recovery Approaches
445
d a c
b
Client Layer
D
E
C A
Server Layer
B
Client Layer Primary Path Server Layer Recovery Path
Client Layer Recovery Path
Figure 6.14 The uncoordinated multilayer survivability strategy.
(e.g., no standardization of coordination signals between both layers is required). However, Figure 6.14 also shows the drawbacks of this strategy. Both recovery mechanisms occupy spare resources during the failure (the server layer along A-E-D and the client layer along a-b-c, implying the occupation of spare resources on A-B and B-C in the server layer), although one recovery scheme occupying spare resources would be sufficient. This implies that potentially more extra traffic (being unprotected preemptable traffic) is squelched or disrupted than necessary. The situation can be even worse: Consider, for example, that the server layer reroutes the logical link a-d over the path A-B-C-D instead of A-E-D, then both recovery mechanisms need spare capacity on the links A-B and B-C. If these higher layer spare resources are supported as extra traffic in the lower layer, there is a risk that these client-layer spare resources are preempted by the recovery action in the server layer, resulting in ‘‘destructive interference.’’ In other words, none of the two recovery actions was able to restore the traffic, because the client layer reroutes the considered flow over the path a-b-c, which was disrupted by the server-layer recovery. Reference [Wau99] illustrates that these risks may exist in real networks; the authors prove that a switch-over in the optical domain (e.g., for protection purposes in the optical network) may trigger traditional SDH protection. Furthermore, as discussed in the section Trade-Off between Rerouting Time and Network Stability for Recovery in Multilayer Networks, such a multilayer recovery strategy can have a significant impact on the overall network stability and can introduce
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 446
446
CHAPTER 6
Multilayer Networks
undesirable potential race conditions in the network. In summary, although simple and straightforward, just letting the recovery mechanisms in each layer run without a coordinating escalation strategy has its consequences on efficiency, capacity requirements, and even ability to restore the traffic.
Sequential Approach A more efficient escalation strategy, in comparison with the uncoordinated approach, is the sequential approach. Here the responsibility for the recovery is handed over to the next layer when it is clear that the current network layer is not able to do the recovery task. Instead of uncoordinated recovery in several network layers, one ensures that a fault is not resolved in different layers at the same time (possibly leading to racing conditions), by imposing a chronological order on the recovery mechanisms. For this escalation strategy two questions must be answered: in which layer to start the recovery process and when to escalate to the next layer. Two approaches exist: the bottom-up escalation strategy and the top-down escalation approach, each having different variants.
Bottom-Up Escalation With the bottom-up escalation strategy, the recovery starts in the bottom or lowest detecting layer (being the network layer where the failure is detected, ensuring a fast activation of the recovery mechanism) and escalates upwards. All affected traffic that cannot be restored in this layer (e.g., because of capacity shortage) will be restored in a higher layer. The advantage of this approach is that recovery actions are taken at the appropriate granularity: First the coarse granularities are handled (restoring the big connections), recovering as much traffic as soon as possible, and recovery actions on a finer granularity (implying in a higher layer) only have to recover a small fraction of the affected traffic. This also implies that complex secondary failures (as a consequence of the propagation of a root failure in the higher network layers; see the section Survivability at the Top Layer) are handled only when and if needed (e.g., if recovery in the lower layer because of the root failure is not possible). In the client-server example of Figure 6.12, for instance, there is the failure of OXC D as the root failure. This corresponds with the simultaneous failure of three IP links (a-d, a-c, and d-c) in the client layer. If the server-layer recovery mechanism copes with the failure of OXC D, then the clientlayer recovery mechanism will only have to handle the recovery of the traffic over the links a-d and d-c, being less complex than the simultaneous failure of three links. An example of the bottom-up interworking approach is shown in Figure 6.15, where node D in the server layer fails. The server layer starts with the recovery process, attempting to restore the logical link a-d. The server layer fails in this recovery because this logical link terminates on the failing node D. As such, the client-layer recovery scheme is triggered78 to restore the corresponding affected 78
The implementation of this trigger mechanism is discussed at the end of this subsection.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 447
447
6.2 Generic Multilayer Recovery Approaches
Phase 1: Recovery Action in Server Layer
Phase 2: Recovery Action in Client Layer
d
d
a
a
c c
b
Client Layer
Client Layer
D
E
b D
E
C A
A C
Server Layer
B
Server Layer Recovery Failed
Server Layer
B
Client Layer Primary Path Server Layer Recovery Path
Client Layer Recovery Path
Figure 6.15 The bottom-up escalation approach.
traffic flow a-c (originally following the route a-d-c), by rerouting it over node b instead of node d. An issue that must be handled in the bottom-up escalation strategy involves how a network layer knows whether it is the lowest layer that detects the failure so it can start the recovery or has to wait for a lower layer. Typically the fault signals that are exchanged to indicate a failure will carry sufficient information so it can be derived in which layer the failure occurred, and the recovery process can start. Suppose, however, that this is not the case. Assume that we have a fourlayer network, where a failure occurs in the bottom layer. Assume that the failure is detected in all four layers simultaneously (this assumes no delay in the propagation of the failure signals) and that it cannot be derived from those signals in which layer the failure has occurred. This means that each of the higher layers can think to be the lowest detecting layer and start with the recovery. This can be overcome by appropriately using the mechanism of hold-off timers (see discussion later in this section), which are set progressively higher as we move upwards in the stack of layers: 0 milliseconds (ms) in the bottom layer, 20 ms in the first layer, 40 ms in the highest but one layer, and 60 ms in the top layer. In this way, the recovery mechanisms in the higher layers will wait, and this gives the chance to the bottom layer (where the failure occurs) to do the recovery.
Top-Down Escalation With top-down escalation, it is the other way around. Recovery actions are now initiated in the top or highest possible layer, and the escalation goes downwards in the layered network. Only if the higher layer cannot restore all traffic, recovery actions in the lower network layer are triggered. An advantage of this approach is
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 448
448
CHAPTER 6
Multilayer Networks
that a higher layer can more easily differentiate traffic with respect to service types so it can try to restore high-priority traffic first. A drawback of this approach, however, is that a lower layer has no easy way to detect on its own whether a higher layer was able to restore traffic (an explicit signal is needed for this purpose). So here the implementation is somewhat more complex and not currently implemented. There is also a problem of efficiency, because it is very well possible that, for example, 50% of the traffic carried by a wavelength channel in an optical network is already restored by a higher network layer recovery mechanism, hence protecting this wavelength in the optical layer as well is only useful for the other 50% of the carried traffic.
Implementation of an Escalation Strategy In the previous subsections on escalation strategies it was mentioned that at one point in the recovery process the recovery is handed over from one network layer to another. The actual implementation of these escalation strategies (referring to handing over the responsibility for recovery from one layer to the other one) is another issue. Two possible solutions are described here (for the ease of explanation, the bottom-up escalation strategy is assumed in what follows). A first possible implementation solution is based on a hold-off timer Tw . Upon detection of a failure, the server layer can start the recovery immediately, whereas the recovery mechanism in the client layer has a built-in hold-off timer that must expire before initiating the client-layer recovery process. In this way, if the fault is already fixed by the server-layer recovery mechanism before the hold-off timer expires, no client recovery action will take place. If this hold-off timer expires and all or part of the traffic is not restored, then the client layer will take over the recovery actions. This is very straightforward; however, the main drawback of a hold-off timer is that recovery actions in a higher layer are always delayed, independent of the failure scenario, because the hold-off timer must expire first. Moreover, as discussed in a later section, one of the challenges is to determine the optimal value for Tw that is driven by a trade-off between recovery time and network stability and performance. The second possible escalation implementation overcomes this delay by using a recovery token signal between layers. This means that the server layer sends the recovery token (by means of an explicit signal) to the client layer from the moment it knows it cannot recover (all or part of ) the traffic. Upon receipt of this token, the client-layer recovery mechanism is initiated. This allows limiting the traffic disruption time in case the server layer is unable to do the recovery. A disadvantage, compared with the hold-off timer interworking, is that a recovery token signal needs to be included in the standardization of the interface between network layers. For the top-down approach, a hold-off timer is probably less appropriate, because the lower layer must be notified with an explicit signal whether the higher layer managed to restore the traffic or not. Note that at the time of publication the timer-based approach is the only approach currently available in commercial products and therefore used in deployed networks.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 449
6.2 Generic Multilayer Recovery Approaches
449
Integrated Approach A more radical means to ensure coordination between the recovery mechanisms in different layers is to combine the different recovery mechanisms in one integrated multilayer recovery scheme. This implies that this recovery scheme has a full overview of all the network layers and that it can decide when and in which layer (or layers) to take the appropriate recovery actions. Although this approach is clearly the most flexible from the recovery point of view, combining different technologies in one mechanism is often unrealistic from a practical point of view. Indeed, to profit from this high flexibility, one has to provide the necessary algorithmic intelligence and/or complexity. Another issue is the implementation and realization of such an integrated approach. It is unlikely that a single recovery scheme, controlling and having an overview of all network layers, is developed in current overlaid networks. However, this can become more feasible when looking at peer-to-peer networks.
Supporting Spare Resources for Multilayer Recovery Multilayer survivability involves more than just coordinating the recovery actions in multiple layers. There is also the issue of the spare resources, and how they have to be provided and used in an efficient way in the different layers of the network. Several examples are given in Section 6.3, which elaborates on some case studies. One way or another the logical (spare) capacity assigned to the recovery mechanisms that are deployed at higher network layers must be transported at the lower layer. Several ways exist to do this. The most straightforward option is called double protection and is depicted in Figure 6.16 for an IP-over-OTN network. Note that this figure (as well as Figures 6.17 and 6.18) is highly simplified to focus the discussion on the relevant aspects of the spare capacity provisioning in multilayer networks. In reality of course there will be more OXCs and fibers connecting them. Here the spare capacity that is provisioned in the logical IP network to be used by the IP routing mechanism is simply protected again in the underlying optical layer. Despite the reduced complexity, this double protection is a rather expensive solution. In Figure 6.16, a working logical IP link between the outer two router line cards (full line in the figure) is protected by the logical IP spare link between the inner two router line cards (dashed line in the figure). These logical IP links are implemented by two lightpaths in the optical layer. Both these lightpaths are also protected in the optical layer (top and bottom dashed lines in the OTN layer in the figure). A failure of the fiber (carrying the lightpath of the working IP link) interconnecting both OXCs would result in using the backup lightpath of the lightpath implementing the working logical IP link. Only in the case that, for example, one of the outer router line cards (on the working IP link) fails would the spare logical IP link be needed. The added value of protecting and thus investing in an additional amount of spare capacity in the optical layer is expected to be very low. Only if, for example, one of the outer router line cards and the top fiber (interconnecting both OXCs) fail simultaneously, this would result in added value of this double protection. In
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 450
450
CHAPTER 6
Multilayer Networks
IP
OTN Working IP Link Backup Lightpath for Lightpath of Working IP Link
Spare IP Link Backup Lightpath for Lightpath of Spare IP Link
Figure 6.16 Option 1: Double protection.
some network scenarios (e.g., pan-European networks consisting of 20,000 kilometers [km] of fiber), simultaneous cuts of two fibers might become a concern. To benefit in such a situation from the expensive double protection, less overlap would be required between the lightpaths implementing both logical IP links. However, (optical) transport networks are typically sparse (e.g., having an average connectivity of less than three) and would not allow such non-overlapping routing. In summary, investing in double protection is very debatable and probably only meaningful in a few exceptional network scenarios. Figure 6.16 shows one point-to-point example of this double protection. This is of course valid for an entire network; if each IP link is protected in the optical layer and the IP network is traffic engineered to survive from any single failure (implying that there is also spare capacity in the IP layer, which is at its turn protected in the optical layer), the required capacity is more than twice what is actually needed because that backup capacity is provisioned in both network layers. A first possibility to save investment in physical capacity is carrying the spare capacity in the logical higher-layer network allocated to the higher-layer network recovery techniques, as unprotected traffic in the underlying network layer (or layers) (see Figure 6.17 for the IP-over-OTN example). This strategy, called logical spare unprotected, still allows protecting against any single failure: A cut of the bottom fiber (carrying the lightpath of the working IP link) would trigger the optical network recovery, whereas a failure of one of the outer router line cards would trigger the IP layer network recovery. A prerequisite for such a scenario is that the optical network supports both protected and unprotected lightpaths. It is
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 451
6.2 Generic Multilayer Recovery Approaches
451
IP
OTN Working IP Link Backup Lightpath for Lightpath of Working IP Link
Spare IP Link
Figure 6.17 Option 2: Logical spare unprotected.
crucial to guarantee that the unprotected spare lightpath carrying the spare capacity of the logical higher network layer (in Figure 6.17, these are the logical spare IP links) is not affected by the failure that triggers the IP layer network recovery (that actually uses this unprotected spare lightpath). Otherwise, the spare IP capacity would also become unavailable for recovery of this failure, and the recovery process would fail. One step beyond simply carrying the spare capacity of the logical higher network layers as unprotected traffic in the underlying layer is to preempt this unprotected traffic by the network recovery technique of the underlying network layer. This is the common pool strategy, and an example is given in Figure 6.18 for an IP-over-OTN network. The lightpath implementing the working logical IP link is optically protected. The lightpath implementing the spare logical IP link is then routed in the (optical) spare capacity, which is needed to protect the aforementioned lightpath (the one that implements the working logical IP link). Thus the backup lightpath of the working IP link overlaps with the lightpath of the spare logical IP link. In case of a failure of the fiber carrying the working logical IP link, the optical protection will be triggered, preempting the lightpath implementing the spare logical IP link. In that case, there is no problem preempting this lightpath because it is not needed in the failure scenario. However, the preemption of lightpaths carrying logical spare capacity requires additional complexity. In summary, the common pool strategy provides a pool of physical spare capacity that can be used by the recovery technique in either the IP or the optical layer (but not simultaneously). The options logical spare unprotected and common pool, which are discussed above, are developed to reduce the amount of capacity to invest when deploying a
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 452
452
CHAPTER 6
Multilayer Networks
IP
OTN Working IP Link Backup Lightpath for Lightpath of Working IP Link
Spare IP Link
Figure 6.18 Option 3: Common pool.
static multilayer survivability strategy. Let us give a flavor of these cost savings with the following example on a realistic network scenario, being the Italian backbone network (see left side of Figure 6.19). A simplified/linear cost model has been adopted to quantify the network cost. Each fiber has been assigned a cost per wavelength channel per kilometer to quantify the cost for the optical fibers and the inline amplifiers that are installed every 70 km. This means that the total cost to equip the fiber is evenly distributed over all wavelength channels possibly multiplexed. In the nodes, a cost per wavelength channel port has been assigned. This includes the cost for an OXC (averaged over the three OXC sizes for which data was available) plus a cost for the WDM line system. Finally, an IP layer cost has been assigned; this is a cost per terminated wavelength channel and includes a router line card cost and the cost for the OXC port to which the router line card is connected. In summary, this cost model does not explicitly represent or take into account the granularity of the system sizes (except the wavelength channels that are assumed to have a capacity of 2.5 Gbps). The adopted design methodology is as follows. First, for each node pair, the cost to establish a 1þ1 protected lightpath is computed. Because a linear cost model is applied, the shortest, and thus the cheapest, cycle containing both nodes has to be found. Knowing the cost for each possible logical IP link, the logical IP network topology is optimized to transport the offered traffic demand, the result of which is depicted on the right side of Figure 6.19. This result depends of course on the considered traffic pattern and volume; the traffic pattern is based on the assumption that most of the traffic is generated in the four major cities that have a connection to the commodity Internet and where the content servers are installed. Second, each IP
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 453
453
6.2 Generic Multilayer Recovery Approaches
Nominal Topology (Optimal Logical Topology in Failure-Free Condition)
Physical Topology Milano
Torino
Trento
Trento Milano
Torino
Venezia
Venezia
Genova Genova
Bologna Firenze
Firenze
Bologna
Pescara
Roma
Pescara
Cagliari Cagliari
Roma
Napoli Bari
Napoli
ReggioC Palermo
Bari
ReggioC Palermo
Figure 6.19 Network scenario: Optical transport network layer topology (left), and nominal Internet Protocol layer topology (right).
router failure scenario is simulated; traffic is always rerouted along the remaining shortest path according to the logical topology. The links in the logical network are then dimensioned so that sufficient capacity is available to cope with each router failure. Based on the design of the logical IP network, two optical traffic demands are generated. The first one is called the logical working capacity (corresponding to the link capacities, represented as thickness of the lines at the right side of Figure 6.19). The second one contains the remaining capacity including the logical spare capacity that is only needed in case of at least one router failure. In a final step, the underlying network is dimensioned; for this purpose the shortest cycle routing is adopted to compute the cost for each possible logical link. Both the working and the backup path are equipped for the logical working capacity. In addition, the capacity to support the logical spare capacity is also equipped. With double protection, capacity is added on both the working and the backup path, whereas with unprotected logical spare capacity, the capacity is only added on the working paths. For the common pool strategy, the capacity along the working paths is computed and compared to the optical spare capacity that has already been installed to support the logical working capacity. Only the part of the former capacity that cannot be transported in the latter capacity is added to the network (thus, on each link the maximum instead of the sum of both capacities is installed). The cost comparison between the three static multilayer survivability strategies is depicted in Figure 6.20; the nominal cost refers to the case where no recovery against router failures/isolations would be considered and serves as base for the comparison (100%). The figure clearly shows that an overall network cost reduction of 10% can be achieved by transporting the logical spare capacity as unprotected or
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 454
454
CHAPTER 6
Multilayer Networks
Relative Optical Network Cost (Percentage of Nominal Case)
IP Spare Protected, Unprotected and Common Pool 140% 130% 120% 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Line Cost
IP Spare Protected
IP Spare Unprotected
Node Cost
Trib Cost
Common Pool
Strategy
Figure 6.20 Static multilayer survivability strategies: A comparison based on capacity requirements.
as extra (this means unprotected and preemptable as in the common pool strategy) traffic in the underlying optical network. Note, however, that this does not help reduce the dominant cost to connect the IP routers to the optical network.
Trade-Off between Rerouting Time and Network Stability for Recovery in Multilayer Networks In multilayer networks where recovery mechanisms are used at multiple layers, the issue of achieving the recovery objectives while ensuring network stability is of utmost importance. As described earlier, the uncoordinated approach implies that the client layer tries to recover from the failure as soon as the failure is detected, independently of the recovery actions triggered at the server layer. As already pointed out, this approach has many drawbacks. A more viable approach is the sequential bottom-up escalation approach. Only the timer-based variant is currently available in commercial products; the client layer waits for some hold-off timer (Tw) to elapse before triggering any recovery action, to give the server layer the opportunity to recover the failure. This illustrates the trade-off between recovery time on one hand and network stability on the other hand. Indeed, if the timer Tw is set to a too small value, this may lead to a so-called false-positive recovery action, where the client layer will trigger its recovery mechanism before the server layer has completed its set of recovery actions. For instance, consider the example of an IP layer (as client layer) and a SONET network layer (as server layer). Another more elaborated example is given in Section 6.3 with optical restoration (in the optical server layer) and MPLS TE Fast Reroute (in the IP/MPLS client layer). Suppose that the hold-off timer Tw is set to 60 ms and it turns out that the SONET protection cannot be completed under 60 ms (e.g., because the ring has long propagation delays and contains a
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 455
6.2 Generic Multilayer Recovery Approaches
455
large number of stations). Upon a link failure and after the timer Tw has elapsed, the IP layer will trigger some network convergence before the SONET layer will have recovered the failed link. Consequently, every IP router will recompute its routing table avoiding the failed link. This failed link, however, will effectively be restored a few tens of milliseconds later by the SONET protection. Once the link is restored, the network will have to reconverge so this link can be reused. If the timer was set to a larger value, the SONET protection would have been able to do the recovery job, and the IP layer would not have been obliged to initiate any recovery and reconvergence actions. So, the fact that the timer Tw is given a too small value has the undesirable effect of two unnecessary network convergences and may provoke network congestion for some time (up to several seconds depending on the revertive strategy) if the IP layer is not dimensioned to handle a single link failure without traffic congestion (at least for some flows). Moreover, it is usually highly desirable to use some dampening mechanism (which can be implemented at various layers), which has the aim to slow down the rate of state changes of a network element that is constantly flapping. Several examples have been given throughout this book. A typical example is the case of a flapping link (see Chapter 4, Section 4.4). In the IP layer, it is undesirable to generate a new IP LSP at each link state change, so the LSP origination process is dampened until the network resource is considered as sufficiently stable (various dampening algorithms have been described in Chapter 4, Section 4.4). In other words, because dampening is necessary to preserve the network stability in case of unstable network elements, the consequences of inappropriately declaring a resource as down may lead to a situation where the resource, once restored, is not reused for some time. As illustrated, setting Tw to too small values may have a negative impact on the network. On the other hand, setting Tw to a too large value results in longer recovery times if the fault cannot be recovered in the server layer. Such a compromise is not always straightforward and the recovery time objectives must imperatively be balanced with the network stability and performances in multilayer recovery networks. Of course this is particularly true when the different recovery mechanisms operate under similar time frame (like SONET-SDH, optical, and MPLS TE Fast Reroute). The case of SONET-SDH protection in combination with IP routing not tuned for fast convergence is less problematic.
Network Operation Complexity Although usually difficult to quantify, multilayer recovery approaches inevitably increase the network operation complexity. Indeed, the various layers (optical, SONET-SDH, IP/MPLS) are usually managed by different operators’ teams or organizations. Hence, upon a network element failure, this requires close collaboration between the teams. Furthermore, the troubleshooting of the root failure may be quite difficult to determine because several recovery mechanisms are triggered at different layers. This might not be a show stopper but should be taken into account when opting for a multilayer recovery strategy.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 456
456
CHAPTER 6
Multilayer Networks
Revertive Operation in Multilayer Networks with Multilayer Survivability Various challenges of implementing multilayer recovery in multilayer networks have been discussed in the previous subsections. Another interesting aspect relates to the so-called revertive operation that looks at the reuse of a restored resource (so after a failed resource has been repaired) in the network. Usually a resource is reused at a client layer once the server layer has declared it as operational again. For instance, if a SONET-SDH VC is restored, it is quite common for the SONET-SDH layer to wait for 10 seconds before announcing the link as operational to the IP layer. Consequently, the IP layer will start reusing the link once a new IGP adjacency is reestablished with the adjacent router over the restored link. Then if an additional layer like MPLS TE exists, the link will eventually be reused to route TE LSPs, if such a link offers a more optimal path for some existing TE LSPs or if new TE LSPs are established in the network. Network stability is of the utmost importance in data networks. Hence, it is usually desirable to wait some time before reusing a restored resource, to ensure that the resource is not in an unstable state and is not likely to fail again shortly. Various mechanisms can be used: wait a fixed period of time without any failure (e.g. 10 seconds) before reusing a restored link or wait for some dynamic timer taking into account the link failure history.
Qualitative Performance Comparison of Some Recovery Strategies for Multilayer Networks In Section 6.2.2, the use of single-layer survivability strategies in multilayer networks is discussed, highlighting a few shortcomings. Section 6.2.3 illustrates how these disadvantages can be overcome by providing recovery mechanisms in different layers and implementing an escalation strategy to ensure the correct interworking of these strategies. The qualitative performance of the discussed survivability strategies can be compared using a number of intuitive criteria, as shown in Table 6.2. Four strategies are compared: bottom layer, bottom-up, top layer, and integrated approach. Each of the used performance criteria (left-most column) indicates and measures a certain aspect of the recovery process and is given a qualitative value. For example, the failure coverage of a survivability strategy can be low, indicating that a small number of failure types can be handled by the strategy, or high if a broad range of failure types can be handled. Another performance criterion is the required bandwidth resources and indicates the extra capacity that is needed in the network to restore the traffic compared with the situation in which there is no extra capacity, and thus, no traffic can be restored. The performance parameter coordination and management refers to the escalation approach: It is high if an escalation approach is in place. The strategy complexity, on the other hand, refers more to the complexity of the survivability strategy when dealing with an actual failure. For example, with the bottom-up strategy, an escalation strategy is installed, but then the survivability process falls back to two single-layer restoration strategies (because the escalation strategy defines and performs the transfer of the restoration process from one layer
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 457
6.2 Generic Multilayer Recovery Approaches
457
Table 6.2 Comparison and Summary of Several Qualitative Performance Parameters for Some Significant Recovery Strategies
Performance Criteria
Switching Granularity Failure Scenario Recovery Close to Root Failure Coverage Coordination, Management Required Bandwidth Resources Service Differentiation Strategy Complexity
Survivability Strategy
Preferred Value
Bottom Layer
Top Bottom-Up Layer
Integrated Approach
Coarse Simple Yes Low Low
Coarse Simple Yes High High
Fine Complex No High Low
Coarse Simple Yes High Low
Coarse Simple Yes High Low
Low
High
Low
Low/high
Low
Difficult Low
Difficult Medium
Easy Low
Easy High
Easy Low
(Adapted from D. Colle, et al, ‘‘Data-centric optical networks and their survivability,’’ IEEE Journal on Selected Areas in Communications, vol. 20, no. 1, January 2002.)
to another). This makes the strategy complexity medium, but the coordination and management high (as a result of the installment of an escalation strategy). The preferred value—from the viewpoint of a network operator—for each of the performance criteria is shown in the right-most column of the table. This can be used as a benchmark for comparing the different strategies. Because most strategies have advantages and disadvantages, we can select strategies according to their behavior at certain specific and important criteria (from the viewpoint of the operator or decision maker). References [Gry98] and [Dem99] illustrate that the spare resource requirements can be reduced for the case of multilayer survivability, by supporting higher layer spare resources as extra traffic in the lower layer spare resources (this is the common pool of spare resources). However, as mentioned in Section 6.2.2, a proper coordination of the recovery schemes becomes absolutely necessary in such a case.
6.2.4
Dynamic Multilayer Recovery In the previous section, static multilayer recovery strategies have been discussed. They are called static, because at the time of a failure the logical network topology (in an IP-over-OTN network, this is the IP layer topology) is left unchanged (i.e., static), and no specific actions are taken to modify it. As such the logical network must be provided with a recovery technique and the required spare resources, to be able to survive router failures, for example. Dynamic multilayer survivability strategies that are the subject of this section differ from such static strategies in the sense that they actually use logical topology
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 458
458
CHAPTER 6
Multilayer Networks
modification for recovery purposes. This requires the possibility to flexibly and at real-time set up and tear down lower layer network connections that implement logical links in the higher network layer. As was discussed in Section 6.1, for instance, optical networks will be enhanced with a control plane, which gives the client networks the possibility to initiate the setup and tear down of lightpaths through the optical network. This could be used to reconfigure the logical IP network when it is affected by a network failure. This approach has the advantage that the logical network spare resources should not be established in advance in the logical IP network (at least no spare line capacities) and thus the underlying optical network should not care about how to treat (as protected, unprotected, etc.) these client layer spare resources. This implies that there is no longer a requirement for established spare capacity in the logical IP layer, in contrast with the static multilayer resilience schemes discussed in the previous section. In the optical layer, however, spare capacity still has to be provided to deal with lower layer failures such as cable cuts or OXC failures. Enough capacity is also needed in the optical layer to support the reconfiguration of the logical IP network topology and the traffic routed on that topology. An illustration of such a dynamic reconfiguration of the logical higher-layer topology in case of failures is given in Figure 6.21 for an IP-over-OTN network. Initially in a failure-free situation the traffic flow from router a to router c is forwarded via the intermediate router b. To this end the logical IP network contains the IP links a-b and b-c, implemented by the lightpaths A-B and B-C in the OTN network. When router b fails, routers a and c will detect this failure and come to the conclusion that these two logical links are useless and can be torn down. This is requested to the optical layer, by sending a request through the UNI. This releases some capacity in the optical layer that can be used to set up a direct logical IP link from router a to router c. This setup is requested to the underlying optical network by sending a signal through the UNI, requesting the setup of the lightpath between OXCs A and C. So, at the time of the failure, the logical IP network topology is reconfigured. As mentioned before, a special feature of the underlying optical network is needed for this; it must be able to provide an SC service to the client network. ASONs, or more generally IONs, have this particular feature. A key issue with this dynamic multilayer recovery strategy (let us take the IPover-OTN scenario as an example) involves the actual logical IP network topologies that will be used at the occurrence of failures. This implies that for dynamic multilayer recovery strategies the logical IP layer topology has to be dimensioned several times: There is one dimensioning exercise for the failure-free case (this is also called the nominal case) and then there is one dimensioning exercise for each possible IP router failure with as result—for each IP router failure—the reconfigured topology that will be used when that particular IP router failure occurs. This is illustrated at the right side of Figure 6.22. If there are, for example, four IP routers in the network, five IP layer dimensioning exercises must be performed. For each of these IP layer dimensioning exercises, the capacity needed in the underlying optical layer is calculated. Network survivability against OTN layer failures is guaranteed by using an appropriate resilience scheme in the optical layer. The resources needed
e
e
d
d
a
a c
B
Lightpath Implementing IP Link OTN Link (Optical Fiber)
D
E
C
A
b
OTN Layer IP Link Traffic Flow
Figure 6.21 Illustration of dynamic multilayer survivability strategy.
C
A
B
Lightpath Implemented at Time of the Failure
IP Link Established after the Failure
6.2 Generic Multilayer Recovery Approaches
D
E
OTN Layer
IP Layer
b
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 459
IP Layer
c
459
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 460
460
CHAPTER 6
Multilayer Networks
in the OTN layer to be able to recover from all possible single (IP or OTN) failures can then be calculated as the worst case resource (e.g., IP router cards) requirements of the OTN network taken over the failure-free and all IP failure scenarios (so the maximum needed capacity requirements over all these scenarios gives the actual dimensioning results). In comparison, the left side of Figure 6.22 shows the way of calculating the required OTN resources for a static multilayer recovery scheme (in the IP layer some working and spare LSPs are shown; the topology has to be biconnected to allow MPLS recovery of router failures). The bottom part of the figure then shows the actual resource requirements on the OTN links for both strategies, showing an improvement in the case of dynamic multilayer recovery. This theoretical result will be confirmed by a case study at the end of this section. We assume that the IP topology used during failure-free conditions is as optimal as possible with respect to the traffic pattern and delay constraints. When an IP router failure occurs, the network has to carry less traffic than in the failurefree nominal case, because all traffic originating or terminating in the failing router indeed cannot be restored. There are two possible approaches for the reconfiguration of the IP topology during such a failure condition, the so-called global reconfiguration option and the local reconfiguration option. In global reconfiguration, the goal is to have at each moment the most optimal topology with respect to the new traffic pattern, so without the traffic entering or leaving the network via the failing router. For every scenario (failure-free and every IP router failure), the IP topology is completely recomputed from scratch to obtain a new optimal topology that copes with the particular failure. The remaining IP traffic is then rerouted over this new logical IP topology. The local reconfiguration option potentially involves less reconfiguration of the IP topology under failure conditions. In this case, when an IP router fails, this router and its incident links are removed from the logical IP topology and the remaining IP traffic (all IP traffic that did not terminate in the failing router) is rerouted on this reduced topology. Link capacities can be upgraded or downgraded as needed by the new routing of the (affected) traffic. For example, if we consider a logical ring for the nominal or failure-free situation, this logical ring can become a star topology when applying global reconfiguration. In the case of local reconfiguration, however, this will remain a logical ring topology. With both reconfiguration approaches, the capacity requirements for the OTN layer are determined for each failure and the failure-free scenario, and the resources needed on each link in the optical layer are then calculated as the maximum of the resources needed on that link over each of those scenarios. Let us now look at the comparison between the static multilayer recovery schemes described in Section 6.2.3 and the dynamic (ION) recovery schemes discussed above, and this on the same network scenario as in Figure 6.19. To obtain the results for the dynamic multilayer recovery schemes (ION local reconfiguration and ION global reconfiguration), the capacity demands of the IP layer topology and related traffic pattern on the optical layer network are calculated for the optimal nominal (failure-free) IP topology and for each of the 14 possible IP
Static Multi-Layer Recovery Scheme
Dynamic (ION-Based) Multi-Layer Recovery Scheme Failure-Free Scenario
All Single IP Router Failure Scenarios
IP Layer
IP Layer
IP Layer
IP Layer
Capacity Needed on OTN Links
2
OTN Layer
Worst-Case Capacity and Resource Requirements Over All Scenarios
2 3
OTN Layer
OTN Layer
OTN Layer
OTN Layer
Compare Static and Dynamic Results
1
1 2
OTN Layer
461
Figure 6.22 Static multilayer resilience scheme (left) versus dynamic multilayer resilience scheme using ION flexibility (right). (Note that for reasons of simplicity, the [physical] OTN topology is assumed to be uniconnected, so a recovery scheme in the OTN layer is not possible. In reality, however, the OTN topology will be biconnected, which enables the use of an appropriate resilience scheme also in that layer.) (S. De Maesschalck, et al, ‘‘Intelligent optical networking for multilayer survivability,’’ IEEE Communications Magazine, vol. 40, no. 1, pp. 42–49, January 2002.)
6.2 Generic Multilayer Recovery Approaches
...
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 461
...
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 462
462
CHAPTER 6
Multilayer Networks
router failure conditions. The underlying optical layer should be able to support each of these failure conditions and the failure-free condition. Thus, the capacity that needs to be installed on the links in the optical network is the maximum capacity needed on those links over all these failure and the failure-free cases. Figure 6.23 shows a cost comparison (relative to the nominal failure-free situation) for the static recovery options using MPLS rerouting to protect against IP router failures (see [Dem99] and [ColONDM01] for more information on this recovery scheme in the MPLS layer) and for the dynamic options using ION flexibility. In all options, recovery against single optical node or link failures is provided using path protection in the optical layer. The total network cost is split in three parts: a line cost proportional to the length of the links, a node cost proportional to the number of wavelengths entering or leaving an OXC via an aggregate port, and a tributary cost for each IP router line card connected to an OXC. Figure 6.23 confirms that for all strategies, the optical network needs to install more capacity than for the support of the nominal logical IP network. In addition, ION local reconfiguration is clearly the most cost-efficient multilayer recovery scheme. The decreasing cost trend from ‘‘double protection’’ to ‘‘IP spare not protected’’ to ‘‘common pool’’ was expected as the IP spare resources are supported more and more efficiently by the OTN resources. The higher flexibility needed to optimize the logical IP topology in each particular fault scenario in ‘‘ION global reconfiguration’’ requires a higher amount of installed capacity and equipment in the optical layer than ‘‘ION local reconfiguration,’’ making this global strategy more expensive (even as expensive as the quite inefficient static ‘‘double protection’’ strategy). The ‘‘ION local reconfiguration’’ solution is less expensive than the ‘‘common pool’’ one. The main cost difference lays in the tributary cost. ‘‘ION
Relative Optical Layer Cost (Percentage of Nominal Case) 160% Line Cost
Node Cost
Tributary Cost
140% 120% 100% 80% 60% 40% 20% 0% ION Global Reconfiguration
Double Protection
IP Spare Not Protected
Common Pool
ION Local Reconfiguration
Multilayer Resilience Scheme
Figure 6.23 Cost comparison between static and dynamic multilayer resilience schemes. (S. De Maesschalck, et al, ‘‘Intelligent optical networking for multilayer survivability,’’ IEEE Communications Magazine, vol. 40, no.1, pp. 42–49, January 2002.)
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 463
6.2 Generic Multilayer Recovery Approaches
463
local rerouting’’ needs fewer IP router line cards, and because this equipment is relatively expensive, this equipment saving results in quite a large cost saving. When looking at these results, however, one remark that needs to be taken in account, is that the simple but straightforward methodology used to compute the needed resources in the ‘‘common pool’’ approach, cannot always guarantee a correct functioning of this multilayer recovery scheme. As long as a single line or router failure occurs, only one of the two recovery mechanisms is activated and there is no risk of interference. However, in the case of an OXC failure, the protection mechanism in both layers is triggered. The optical path protection scheme is triggered for the flow transiting the failing OXC, the IP/MPLS recovery for the flow transiting the isolated router. Simply taking the maximum of the protection capacity over both recovery schemes for calculating the needed spare resources in ‘‘common pool’’ (as we do) is thus not always appropriate because both recovery schemes may compete for the same resources at the same time. Besides this, there is the possibility in the case of an OXC failure in the options ‘‘IP spare not protected’’ and ‘‘common pool’’ that the optical routes of the spare and working IP capacity overlap. However, a proper but surely more sophisticated design approach could solve these problems. This corresponding additional operational complexity may not be that critical for static networks (e.g., manual provisioning) but becomes an increasingly important issue when evolving to more dynamic networks. The dynamic multilayer recovery schemes, however, do not suffer from these design disadvantages and can guarantee a better fault coverage. The reason is that in this case the spare capacity in the logical network does not have to be designed in advance, but capacity is provisioned as needed and always optically protected. Such a dynamic approach has the advantage of being highly efficient in terms of required backup capacity. There are of course also some issues and challenges. Compared to the static multilayer approach, the recovery time in the case an IP network reconfiguration that is required for a failure is likely to be significantly higher. Indeed, if we go back to the example of a router failure, with the static multilayer approach, the routers adjacent to the failed or isolated router can quickly detect the failure and the network can converge (find alternate paths for the impacted traffic flows) in a short period of time by means of fast IP routing or MPLS TE Fast Reroute. The dynamic multilayer recovery approach requires for the IP router (or routers) to signal via the UNI the setup of new IP links, the routing and signaling in the optical layer, and finally the setup of IGP router adjacencies over the newly established IP link (or links). In particular, such an approach requires several rules to prevent ‘‘false-positive’’ alarms that could lead to several network instabilities. Indeed, upon a router failure, the network should be quickly reconfigured to limit the impact of traffic disruption and/or quality-of-service (QoS) degradation because of the congestion, but at the same time it would be undesirable to trigger a complex set of recovery mechanisms involving several layers for a temporary router failure. So the trade-off between fast recovery time and network stability is difficult to determine. Note also that such a dynamic multilayer recovery mechanism would still require some extra equipment capacity in the IP layer.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 464
464
CHAPTER 6
Multilayer Networks
Single-Layer Recovery Options
− Preplanned vs. Dynamic Route Calculation
Recovery Interworking Strategies
− Dedicated vs. Shared Backup Facilities Static Multi-Layer Recovery Strategy − Recovery at Lowest Layer − Recovery at Highest Layer − Recovery at Multiple Layers
Interworking Strategy − Sequential − Bottom-up − Top-down − Diagnostic − Uncoordinated − Integrated
ol
Po
on m re ted om Spa tec C − cal pro n gi Un tio n Lo ec − tio ot ra Pr u e fig n bl io on ou at D ec ur R − g l fi ca on Lo ec − R l ba lo G −
Multi-layer Spare Capacity Design
− Link vs. Path-Based − Centralized vs. Distributed Control
Figure 6.24 Generic framework for multilayer survivability. (Adapted from P. Demeester, et al, ‘‘Resilience in multi-layer networks,’’ IEEE Communications Magazine, vol. 37, no.8, August 1998, pp. 70–76.)
6.2.5
Summary In the previous sections we discussed generic strategies for survivability in multilayer networks. These range from single-layer recovery schemes for multilayer survivability to static multilayer recovery strategies to dynamic multilayer recovery approaches. Figure 6.24 illustrates the different options and building blocks that are possible for recovery in multilayer networks (based on reference [Dem99]).
6.3 Case Studies In the previous sections of this chapter, we saw the various possible models of multilayers recovery networks where multiple recovery techniques could be combined to recover from network element failures. In this section, we propose several case studies corresponding to existing possible multilayers recovery networks. In this section, three case studies of interlayer recovery mechanisms are proposed. As already mentioned, at the time of publication, the only viable and deployed interlayer recovery strategy in use is the timer-based sequential bottomup escalation approach whereby each layer having a recovery mechanism, starting at the bottom layer, tries to recover from the detected failure. Once a failure is detected, each layer waits for a configurable timer to elapse before triggering any recovery action to give a chance to a lower layer to recover the fault. Taking the example of two layers (called the top and bottom layers), if the fault occurs in the bottom layer, the recovery time is as fast as possible. On the other hand, if the fault can only be recovered at the top layer, this induces some additional delays because
Vasseur / Network Recovery Final Proof 9.6.2004 7:25pm page 465
6.3 Case Studies
465
the timer has to elapse at the top layer before triggering a recovery action. As already pointed out, the level of predictability of the maximum time required by the bottom layer to recover from a fault helps in adjusting the timer adequately. Undoubtedly, this approach offers the best guarantees in terms of network stability avoiding any racing condition between recovery mechanisms simultaneously triggered at different layers. Three case studies are covered in this section: 1. Optical restoration and MPLS TE Fast Reroute 2. SONET-SDH protection and IP routing 3. MPLS TE Fast Reroute and IP routing
6.3.1
Case Study 1: Optical Restoration and MPLS Traffic Engineering Fast Reroute In this case study, the optical network will use a restoration recovery mechanism to handle both fiber failures and other optical equipment failures. In addition, MPLS TE Fast Reroute is used as a protection mechanism to handle link failure occurring at the IP/MPLS layer (router interface failure) and IP/MPLS router failures. We first describe each single-layer recovery mechanism, followed by the multilayer aspects (Figure 6.25).
Single-Layer Recovery Mechanisms 1. Optical restoration: In this example, the optical network provides a restoration mechanism whereby upon a network element failure detection a dynamic routing and signaling mechanism is responsible for restoring the affected set of lightpaths. To minimize the required backup capacity in the optical layer, the network is dimensioned to survive from a single fiber or optical node failure. The use of an optical restoration mechanism certainly has the advantage of supporting the concept of shared optical backup capacity, which optimizes the required backup capacity compared to a protection mechanism (e.g., 1þ1 optical protection) for the same failure coverage. On the other hand, as with any other restoration recovery, the recovery time is greater than with a protection mechanism and less deterministic. Indeed, once the fiber or optical network equipment failure has been detected and localized, the FIS must be flooded throughout the network until it reaches the node capable of recovering the traffic, which in turn recomputes an alternative path and finally resignals the optical path. Several commercial implementations exist that provide optical restoration based on proprietary protocols or G-MPLS/ASON. As a matter of fact, the backup path computation time increases (sometimes nonlinearly) with the number of optical nodes, the network topology complexity, and the number of constraints taken into account when computing the path. 2. MPLS TE Fast Reroute: In this case study, a full mesh of MPLS TE LSPs is established in the network (note that these TE LSPs can be with or without
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 466
466
CHAPTER 6
Multilayer Networks
NNHOP (Next Hop) FRR Backup Tunnel Protecting Against a Failure of the Router b
NHOP (Next Hop) FRR Backup Tunnel Protecting a Failure of the Link a-b b a
d
IP/MPLS Layer (MPLS TE Fast Reroute)
c
B
D
A E Optical Layer (Restoration)
C
Working Path Recovery Path of the a-b Link after a Failure of the Fiber A-B
Figure 6.25 Case Study 1: optical restoration with IP/MPLS FRR.
constraints depending on the network requirements; see Chapter 5 for a detailed discussion on the use of MPLS TE). These TE LSPs are signaled as fast reroutable (i.e., local protection using MPLS TE Fast Reroute is required in the case of network element failure). Hence, in this case study, at each hop, a set of backup tunnels protecting against link and node failure is presignaled. In the case of link or IP/MPLS node failure, the TE LSPs are locally rerouted onto their respective backup tunnels (selected when the TE LSPs are first signaled) within a very short time (on the order of 50 ms). In a second step, these TE LSPs are potentially reoptimized along a more optimal path by their respective head-end LSR.
Interlayer Recovery Mechanisms Let us now focus on the interlayer recovery aspects and in particular two aspects, as follows:
. Set of recovery actions . Required backup capacity Set of Recovery Actions The interlayer strategy adopted in this case study is the timer-based sequential bottom-up escalation approach, in which upon a link failure detection, the client layer (IP/MPLS in this case) waits for some timer Tw to elapse before triggering any recovery action, which gives the server layer (the optical layer) a chance to restore
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 467
6.3 Case Studies
467
the failed resource. As already discussed, the determination of the optimal value of Tw may not be entirely straightforward. Ideally, Tw must be set to the bounded value of the restoration time, which is itself a function of the network topology and set of constraints (e.g., the set of affinities and minimization of the propagation delay) plus some fudge factor to take into account some unpredictable additional delays that can occur with any restoration protocol. If Tw is set to a too small value, there is clearly a risk of triggering an IP/MPLS protection, although the link is about to be restored by the optical layer. On the other hand, if Tw is set to a too large value, then failures that cannot be recovered in the optical layer (like an IP/ MPLS node failure) will suffer from unnecessary additional recovery delays. Let us analyze the different possible failures that can occur in the network and the set recovery actions triggered in each case: 1. Link failure: As already mentioned, the optical restoration mechanism has been given enough backup capacity to restore any affected lightpath upon a single fiber or optical network element failure. Consequently, when a link failure occurs, both the optical and the IP/MPLS layer will detect the failure (e.g. in the case of a fiber cut, the optical layer will first detect the failure and will immediately inform the IP/MPLS of the failure), but just the optical layer will trigger a restoration process while the IP/MPLS layer will start the timer Tw. In the case of a single link failure, the optical layer will succeed in restoring the set of affected lightpaths before Tw, and the IP/MPLS layer will just clear the alarm. 2. Optical node failure: If the failed optical node does not have any IP/MPLS router attached to it, then the optical layer will be able to restore all the lightpaths traversing the optical node. On the other hand, if some routers are connected to that optical node, the router may be isolated or may suffer multiple link failures depending on the network configuration. In the former case (IP/MPLS router isolated), the failure cannot be recovered in the optical layer. After Tw has elapsed, the IP/MPLS Fast Reroute protection will be triggered and the LSPs traversing the failing node will be rerouted onto their respective next-next hop (NNHOP) backup tunnel. Of course the traffic directed to the isolated node will be dropped because it cannot be restored. The recovery time Tr in this case will be equal to the failure detection time plus Tw plus potentially the time for the IP/MPLS layer to effectively reroute the set of affected TE LSPs onto their respective backup tunnel. 3. Double link failures, IP/MPLS link failure, or IP/MPLS node failure: The assumption has been made that the optical layer can recover from single optical layer link failures. So if a second failure occurs, the recovery process will have to be performed at the IP/MPLS layer and the recovery time will still be equivalent to the previous case (Tr). The case of an IP/MPLS link failure (e.g., caused by a router interface failure) is quite interesting; indeed, the optical layer cannot recover from such a failure, although it can usually detect it. In this case, the IP/MPLS layer will trigger FRR but still after Tw
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 468
468
CHAPTER 6
Multilayer Networks
because the IP/MPLS layer cannot unambiguously differentiate such a link failure from a link failure that can be recovered in the optical layer. Finally, let us now consider the case of an IP/MPLS node failure. We saw in Chapter 4 that there are several possible node failure scenarios that require different sets of failure detection mechanisms (see Chapter 4 for an exhaustive list). For instance, in the case of a power supply failure, all the attached links will also fail; hence, the adjacent routers will detect the failure and will trigger FRR after the timer Tw has elapsed (consequently the rerouting time will be equal to Tr). Now, if the control plane of a centralized architecture IP/MPLS router fails (which affects both the control plane and the data plane), the links will not fail and other hello-based protocol mechanisms are required, which will determine the total rerouting time, as discussed in Chapter 4 (in this case, the timer Tw does not come into play). The set of recovery actions is illustrated in Figure 6.25; upon a fiber cut between the optical nodes A and B, the optical layer can restore the lightpath between the routers a and b (along the path A-E-C-B in the optical layer). If a second link failure occurs or the optical node B fails or the LSR B fails, after the timer Tw has elapsed, the LSR B triggers FRR. Required Amount of Backup Capacity The second aspect related to such an interlayer recovery approach is the amount of required backup capacity in the network. Such an approach raises the interesting question of the amount of required backup capacity. Indeed, some network backup capacity is required in the optical layer to restore the affected lightpaths from a single fiber failure or an optical network element failure, but some backup capacity is also required at the IP layer. Let us take the example of a link failure. Because some link failures may only be recovered in the IP/MPLS layer (e.g., a router interface failure), this requires provisioning some backup capacity not only in the optical layer but also in the IP/MPLS layer. The case of double failures is another example because the optical layer cannot recover from double failures in this case study. The immediate consequence is that a protected network element like a link requires some backup capacity in both layers. Strictly speaking, this is not what is referred to as double protection and the amount of required backup capacity to protect a link L can be reduced thanks to the notion of shared capacity in the optical layer and the IP/MPLS layer. Moreover, as discussed in Chapter 5, an interesting option is to protect just a proportion of the link capacity at the IP/ MPLS layer (e.g., if x% of the link capacity is used for the QoS-sensitive traffic, an alternative is to compute a set of backup tunnels offering a capacity of x% instead of the complete link capacity). However, unavoidably, some backup capacity will be required in both layers to protect the same network element. P Important note: In terms of implementation, you may decide to implement the timer Tw at the optical/SONET/SDH layer in which the server layer waits for Tw before informing the client layer of the failure or at the client layer (IP/MPLS), whereby the IP/MPLS layer is immediately informed of the failure but waits for Tw
Vasseur / Network Recovery Final Proof 9.6.2004 7:25pm page 469
6.3 Case Studies
469
before triggering any recovery action. Both approaches are functionally identical. Such an interlayer recovery approach has been demonstrated in the context of the European LION project (see [Cav IEEE]).
Summary It has been shown that optical restoration and MPLS TE Fast Reroute can be used in combination by using a timer-based sequential bottom-up escalation approach. Such an application has several challenges, particularly in the evaluation of the timer value Tw, which may be quite difficult to optimally determine, but it has the benefit of avoiding highly undesirable racing conditions between recovery mechanisms acting at different layers, hence providing a solution that does not compromise network stability, if adequately designed. It has also been shown that careful design must be performed to minimize the required amount of backup capacity at each layer to protect the same set of network elements.
6.3.2
Case Study 2: SONET/SDH Protection and IP Routing Single-Layer Recovery Mechanisms Such a recovery strategy has been widely deployed in many IP/MPLS networks during the past several years where link failures are handled by the SONET/SDH layer and other failures like router interface failure or router failure rely on IP routing to find an alternate path. The trend is to move toward different network architectures not involving any protection at the SONET/SDH layer for several reasons:
. It is not rare to have a high proportion of high-speed links in operators’ backbones networks (OC48 and OC192). Relying on SONET/SDH usually requires some relatively expensive equipment and the optical layer is more suitable to deliver such high-speed links. . SONET/SDH protection (as described in Chapter 2) implies to waste a significant proportion of the total bandwidth dedicated for protection. . The emergence of fast recovery techniques like fast IP routing or MPLS TE Fast Reroute provide fast recovery times (similar to SONET/SDH protection for MPLS TE Fast Reroute). That said, though relatively expensive, such a recovery strategy (relying on SONET/ SDH protection) has proven its efficiency. As in the previous case study, a timer-based sequential bottom-up escalation approach is adopted in which a timer Tw is started at the IP layer once the alarm is received by the SONET/SDH layer. Compared to the previous case (where a restoration mechanism was used in the server layer), the value of the timer Tw is easier to determine and more deterministic. As mentioned in the SONET/SDH specification, the maximum recovery time in an MS-SP Ring is 60 ms (10 ms of detection þ 50 ms of recovery time) provided that the ring distance does not exceed
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 470
470
CHAPTER 6
Multilayer Networks
1200 km, the number of stations is less than 16, and the ring is idle before the protection. If those conditions are respected, then Tw can be safely set to 60 ms. If not, Tw must be increased accordingly, but in any case, the recovery time with a protection scheme will be more deterministic than with a restoration mechanism.
Interlayers Recovery Mechanisms Set of recovery actions: The set of recovery actions upon link failure is quite straightforward; the SONET/SDH layer will recover the set of affected VCs within a short time frame (usually 60 ms, as mentioned earlier). If the failure cannot be recovered by the SONET/SDH layer (router interface failure, router node failure, multiple failure in the SONET/SDH layer), then IP routing will trigger a network convergence, as described in Chapter 4. Backup capacity: In the previous case study, we mentioned that provisioning backup capacity in both layers may be necessary because not all failures can be recovered in a single layer. This is a decision that the operator must make. Indeed, one could also decide that in the vast majority of the cases, failures are link failure in the server layer (SONET/SDH); hence, no backup capacity is required in the IP layer. The assumption is made that multiple failures in the SONET/SDH layer, router interface failure,79 and IP router failures are sufficiently rare not to justify dedicating backup bandwidth in the IP layer. If such an event occurs, the IP layer may suffer from congestion whose effects can be reduced for the sensitive traffic (like voice) by using QoS mechanisms (see Chapter 4 for more details). Revertive mode: In most cases, SONET/SDH alarms that result from defects are held on for 10 seconds after the defect clears. In other words, the SONET/ SDH VC will wait for 10 seconds after the VC has recovered before declaring it in an ‘‘up state.’’ This guarantees that some link instability (sometimes referred to as flapping) does not provoke network instabilities.
Summary Though usually expensive, such an interlayer mechanism has been widely deployed in several networks. Because the recovery time of SONET/SDH protection is relatively predictable and deterministic, Tw can be quite easily computed. In the case of a link failure in the SONET/SDH layer, the link is restored without any implication on IP routing. If a failure cannot be recovered in the SONET/SDH layer, IP routing is triggered and the network converges. In terms of required backup capacity, the assumption was made in this case study that failures not recoverable in the SONET/SDH layer were sufficiently rare not to justify dedicating any backup capacity in the IP layer.
79
Note that router interface failure may also be handled using Automatic Protection Switching.
Vasseur / Network Recovery Final Proof 9.6.2004 7:25pm page 471
6.3 Case Studies
6.3.3
471
Case Study 3: MPLS Traffic Engineering Fast Reroute (Link Protection) and IP Rerouting Fast Convergence Single-Layer Recovery Mechanisms In this case study, the routers are interconnected by unprotected lightpaths (a fiber cut or an optical network element failure is not protected at the optical layer). Furthermore, there are several SRLGs in the network (several lightpaths are routed through common equipment). MPLS TE Fast Reroute is used to handle link failures that can occur in the optical layer (e.g., fiber cut, optical equipment failure) or the IP/MPLS layer (e.g., router interface failure). Note that such a recovery strategy is very likely to become quite successful. Several existing networks have adopted this model (with several variants related to using the Fast Reroute to protect link or node and the IP parameter settings). We saw in Chapter 5 that there are several possible deployment scenarios for MPLS TE Fast Reroute. Scenario 1: The first option is to have a full mesh of MPLS TE LSPs between the core routers. Note that these TE LSPs may have multiple constraints (e.g., bandwidth and affinities) or could just be unconstrained, in which case they just follow the IGP shortest path. Scenario 2: The second option is to deploy unconstrained one-hop primary tunnels (for link protection) that will be quickly fast rerouted onto presignaled backup tunnels in the case of a failure. In this case study, MPLS TE is not required for bandwidth optimization or strict QoS guarantee. Moreover, fast recovery is required only in the case of a link failure. Hence, the decision is made to use scenario 2; for each link to be protected, the following set of MPLS TE tunnels is deployed: . An unconstrained one-hop primary tunnel routed onto the protected link. . A next-hop (NHOP) backup tunnel whose path is automatically computed by the point of local repair (PLR) so the backup tunnel is SRLG diverse from the protected link. In other words, the NHOP backup tunnel path is the shortest path between the PLR and the next-hop based on the IGP metric that avoids any link having at least one SRLG in common with the protected link (this should be treated as an additional constraint in CSPF). This is mandatory because fast recovery is required in the case of an SRLG failure. Note: Networks are usually designed to survive from a single SRLG failure. In other words, an SRLG failure should not result in a disconnected graph where some destinations may become unreachable. Now, in some situations of double network failures, such an SRLG-diverse path may not be found. Then there are several possible alternatives:
. The PLR tries to relax the SRLG-diversity constraint to be able to find a path for the NHOP backup tunnel. This could still be useful in the case of a router interface failure or a single link failure.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 472
472
CHAPTER 6
Multilayer Networks
. Try to find a path that minimizes the number of links having at least one SRLG in common with the protected link or try to avoid the paths having a high number of SRLGs in common with the protected section. The assumption is made that node failures are rare enough to tolerate longer recovery times; hence, IP routing is used to handle IP/MPLS node failure. This can be achieved by tuning the IS-IS parameters. As explained in Chapter 4, a few parameters must be tuned to meet the rerouting time objective, as follows: lsp-gen-interval 5 200 500 The parameters 5 200 and 500 have the following effects: B ¼ 200 ms is the amount of time the router waits after the first link failure has been detected before originating a new link state packet80 A value of 200 ms is appropriate because there are multiple SRLGs in this network. Thus, waiting for 200 ms before originating a new LSP maximizes the chance to capture an accurate network topology change in a single link state packet. C ¼ 500 ms corresponds to the amount of time the router will wait before advertising a second LSP if a second local state change occurs. A ¼ 5 seconds is the maximum amount of time between two successive LSP originations according to the exponential back-off algorithm described in Chapter 4. spf-interval 5 100 200 prc-interval 5 100 200 Because there are multiple SRLGs in this network, it is advisable to set the timer for triggering the SPF computation to 100 ms, which increases the probability that the computing router will have received all the LSPs resulting from an SRLG failure before recomputing its routing table. It is also recommended to activate iSPF, which significantly reduces the SPF computation in the case of a network topology change in most cases and consequently helps reduce the recovery time of IP.
Interlayer Recovery Mechanisms Set of Recovery Actions So what does happen when a link fails? As soon as the failure is detected by the IP/ MPLS layer by means of the optical layer, which can itself use SONET/SDH framing, MPLS TE Fast Reroute is immediately triggered. By contrast with the two previous case studies, no recovery mechanism is deployed in the optical or SONET/SDH layer, so no timer-based delay approach is required and the FRR protection must be triggered immediately. Because all the traffic is carried onto a single LSP, this LSP is rerouted onto its SRLG-diverse NHOP backup tunnel within a few tens of milliseconds. In a second step, the protected LSPs are 80
Note that we use the IS-IS terminology here.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 473
6.3 Case Studies
473
reoptimized to follow the shortest path between the PLR, or Point of Local Repair, and its neighbor; this last operation, which occurs immediately after the LSP has been locally rerouted onto its backup tunnel, is not traffic disruptive thanks to the ‘‘make before break’’ procedure detailed in Chapter 5. In the case of a node failure, which also implies the failure of its local links, as soon as the failure is detected, MPLS TE Fast Reroute is also triggered (remember, the PLR cannot differentiate a link from a node failure). In this case, because the decision has been made to use FRR for link protection only, only NHOP backup tunnels have been configured to handle link failures in the network. So in the case of a node failure, MPLS TE Fast Reroute just locally reroutes the 1-hop TE LSP onto its NHOP backup tunnel. Obviously, this does not recover the traffic because the node has failed. In this case study, node failures are handled by IP routing. Consequently, as soon as the failure is detected by the IGP, each router adjacent to the failed node will originate a new LSP after some time determined by the IGP timer settings described earlier. The new LSP will be flooded throughout the network. Each router receiving the new LSP will then wait 100 ms before triggering a new SPF and recomputing its routing table. At this point, the network has converged (Figure 6.26). A very interesting aspect in the case of a link failure is that without any particular measure, it turns out that a second traffic disruption will occur after the link failure recovery performed by FRR, which is due to the loop effect resulting
I
J
K 1-Hop Fast Reroutable Primary Tunnel Carrying the Traffic from D to E SRLG (Shared Risk Link Group)
S A
F
B
C
E
D
B1: NHOP Backup Tunnel Protecting Against a Failure of the Link D-E (B1 is SRLG Diverse from the Link D-E)
H
G I
J
1-Hop Primary Tunnel Rerouted onto B1 and Then Reoptimized Along the Path D-C-B-G-H-E
K
S A
B
C
D
E
Routing Decision Data Flow
F
G
H
Figure 6.26 Case Study 3: Fast Reroute link protection þ IP routing fast convergence.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 474
474
CHAPTER 6
Multilayer Networks
from the temporary lack of synchronization between the routers’ link state database (such a temporary loop effect has been studied in detail in Chapter 4). As described in Figure 6.26, when the link fails, the protected one-hop primary tunnel between the node D and E is fast rerouted onto the NHOP backup tunnel within 50 ms. Hence, the traffic traversing the link D-E is recovered within 50 ms. After some time determined by the IGP settings mentioned earlier, IS-IS converges, but during a short period, some temporary loops may potentially occur. For example, back to Figure 6.26, one possible sequence of events is that the router D converges before the router C, which results in a temporary loop. As explained in detail in Chapter 4, the traffic may be dropped during the life of such a temporary loop. This is an interesting fact because it highlights some interrecovery dependencies between IP and MPLS TE Fast Reroute; to guarantee that the traffic disruption upon a link failure is limited to 50 ms and that no additional traffic disruption is experienced because of IP routing, the IGP must be enhanced to avoid the creation of temporary loops upon the failure of links protected with a local protection mechanism like MPLS TE Fast Reroute. Such an enhancement implies the ability for IP to signal that a link is protected by FRR and some SPF algorithm enhancements to avoid the creation of temporary loops in such cases. The ability to signal a link as protected with some local protection has been proposed in [FRR-PROT]. This raises an interesting question: Why declare the link as down in the IP layer if the link is protected by the Fast Reroute? Why not just follow the backup path without triggering any IP rerouting? This is definitely another possible solution that can be deployed in practice with existing commercial implementations thanks to the concept of FA, which allows signaling in the IP layer an MPLS TE LSPs as a link. Let us consider what happens when a protected one-hop tunnel is rerouted upon a link failure with FA. Once the link D-E fails, the one-hop tunnel is locally rerouted onto its backup tunnel; shortly after, the one-hop–protected TE LSP is reoptimized along another path. In any case, the protected TE LSP stays alive, and hence because the TE LSPs are reported as a ‘‘physical link,’’ every other router in the network will still see a link in ‘‘up’’ state and IP will never trigger any recovery process. This is illustrated in Figure 6.27. The drawback of this approach is that the path followed by the rerouted LSPs may not be optimal compared to the network state if IP rerouting would have occurred. Indeed, supposing that all links have an identical cost of 1, the flows between the nodes B and E will follow the path B-C-D-C-B-G-H-E (note that there is no loop here because the IP packets will be carried onto MPLS TE LSPs between the node D and E). On the other hand, without FA, IP routing will converge and B will then route the traffic to E via G, along the path B-G-H-E. Note that if SONET/ SDH protection had been used in place of MPLS TE FRR, the physical path followed by the recovered link may have been similar to the MPLS TE backup tunnel path. Backup capacity: The question of required backup capacity in both the IP and the MPLS layer has been extensively discussed in Chapters 4 and 5, but as a
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 475
475
6.3 Case Studies
I
J
K
S A
F
B
C
E
D
G
IP Routing Topology
Fast Rerouted Primary Tunnel Path
H
I
J
A
B
F
G
K
C
D
E
H
Figure 6.27 Case Study 3: Fast Reroute link protection þ IP routing fast convergence with FA.
reminder, there are several possible strategies in such an interlayer case study, as follows: 1. The IGP metrics are tuned by means of some off-line optimization tool to provide bandwidth guarantees upon link and node failures. The bandwidth guarantee during failure can be done for some flows (the sensitive traffic like ‘‘voice’’) with QoS mechanisms in the network like Diffserv to limit the amount of required backup capacity. In this case, MPLS TE Fast Reroute will just be used to minimize the packet loss (hence, guaranteeing fast recovery) upon link failure (in this case study, FRR is used for link protection only). In other words, the only constraint when computing the backup tunnel path is to find an SRLG-diverse path. Backup capacity is just required in the client layer (IP). 2. Another possibility is to reserve some backup capacity to place the NHOP backup tunnels so that bandwidth guarantee is provided along the backup tunnels paths for the whole link bandwidth or some pool of bandwidth. Then the operator can either decide that node failures are sufficiently rare not to justify to dedicate backup capacity in the IP layer or to reserve some backup capacity in the IP layer in the case of node failure. In the latter case, this will unavoidably lead to reserving backup capacity in both layers to cover similar failures.
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 476
476
CHAPTER 6
Multilayer Networks
Reuse of a restored resource: In our previous example, all the traffic from the router D to the router E travels onto the primary protected TE LSPs between those two nodes. Once the link D-E is restored, a new IGP adjacency is established, but the traffic will only restart traversing the link D-E once the primary TE LSPs is reoptimized along this shorter path. The head-end LSR D can either decide to reoptimize the primary TE LSP as soon as the link is restored and the IGP adjacency is operational but thanks to the IGP dampening mechanism, a flapping link will not be immediately reused (see Chapter 4 for more details on the algorithm). In addition, the operator can also decide to adopt a timer-based reoptimization approach whereby the tunnel will be reoptimized on a regular basis, reducing the risk to immediately reuse a restored flapping link.
Summary In this case study, we saw that MPLS TE Fast Reroute can be used to provide fast recovery (50 ms) upon link failure (in the optical layer or the IP layer) in conjunction with IP routing to recover from IP/MPLS node failure. In such a multilayer recovery strategy, FRR must be triggered without any delay (as soon as the link failure is detected); then if the failure cannot be recovered by FRR (in case of node failure), IP takes the appropriate set of recovery actions after some timers (determined by the IGP settings) have elapsed. Note that the network stability is preserved by means of various dampening mechanisms at both layers (FRR reoptimization and IP routing). As far as the backup capacity is concerned, various approaches are possible to minimize the required capacity. Note also that because the backup capacity is provisioned in upper layers, this offers high granularity (bandwidth guarantees can be determined on a per-flow level), helping to minimize the required backup capacity.
6.4 Conclusion In previous chapters survivability and recovery mechanisms were discussed from the viewpoint of one network technology, and thus within a single network layer (e.g., IP routing in the IP layer or 1þ1 optical protection in the OTN layer). In the first part of this chapter, we highlighted the current evolution from static networks to intelligent optical networks (IONs) featuring a distributed control plane (this was used in the multilayer recovery strategies later in this chapter). Within the ITU-T, a framework for such Automatic Switched Optical Networks (ASONs) is under standardization, whereas the Generalized Multi-Protocol Label Switching (G-MPLS) protocol suite under standardization in the IETF is the most likely solution for implementing an ION. An example of optical restoration was given in such G-MPLS networks. The integration of different network technologies, such as IP and OTN, into (realistic) multilayer transport networks offers new opportunities and
Vasseur / Network Recovery Final Proof 8.6.2004 3:23pm page 477
6.4 Conclusion
477
challenges as far as the survivability of such multilayer networks are concerned, which was the subject of the second part of this chapter. A generic description of the survivability in multilayer networks was given, which included three main categories: single-layer recovery in multilayer networks, static multilayer recovery in multilayer networks, and dynamic multilayer recovery in multilayer networks. The first category discussed strategies that apply a single-layer recovery mechanism (i.e., recovery is strictly limited to one layer of the network when coping with network failures) to provide survivability in the multilayer network. A step further gave us the second category, in which recovery mechanisms will run in different layers of the network as a reaction to the occurrence of one network failure. The choice of in which layer to recover the traffic affected by a failure will depend on the circumstances, such as the failure type or the timing constraints. This requires some coordination rules (a so-called escalation strategy) to ensure the efficient interworking and coordination between the network layers that are involved in the recovery process. Several such escalation strategies were discussed. Part of this chapter focussed on the challenges an operator would face when implementing a multilayer recovery strategy: avoidance of some racing conditions that could occur in case of multiple network recovery mechanisms at different layers, optimization of the required network backup resource capacity. In addition, issues of network stability, network operation complexity, and revertive operation were discussed. These static multilayer recovery strategies are called static, because the logical network topology is left unchanged (i.e., static) and no specific actions are taken to modify it. In the third category, however—dynamic multilayer survivability strategies—such logical topology modification is used for recovery purposes. This requires the possibility to flexibly and in real time set up and tear down lower layer network connections that implement logical links in the higher network layer. For example, optical networks will be enhanced with a control plane, which gives the client networks the possibility to initiate the setup and tear down of lightpaths through the optical network, and which could be used to reconfigure the logical IP network when it is affected by a network failure. It is worth mentioning that dynamic multilayer survivability strategies have their own challenges in particular in terms of complexity and are not available in a short term. This chapter concludes with three case studies illustrating some realistic multilayer recovery deployment strategies. The first case study combines the use of optical restoration with MPLS TE Fast Reroute local protection. The second case study illustrates a pretty common deployment case in several networks combining SONET/SDH protection with IP routing. Finally, it is shown how MPLS TE Fast Reroute can be used in conjunction with IP routing in the third case study. Each case study starts with an analysis of the mode of operation of each recovery mechanism followed by a detailed description of the set of recovery actions upon a network element failure in such a multilayer recovery strategy and in particular the set of recovery actions under such circumstances. The network design considerations are particularly emphasized throughout those case studies to avoid racing conditions, maximize the network stability, and optimize the required amount of network capacity.
This page intentionally left blank
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 479
Bibliography
[Ala03] W. Alanqar, et al, ‘‘Requirements for generalized MPLS (GMPLS) routing for automatically switched optical network (ASON),’’ Internet draft: draft-ietf-ccamp-gmplsason-routing-reqts-01.txt, December 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [ALGO-1] M. Garey, D. Johnson, ‘‘Computers and intractability: a guide to the theory of NP-completeness,’’ New York, NY, Freeman, 1979. [ALGO-2] R. Ahuja, T. Magnanti, J. Orlin, ‘‘Network flows,’’ Prentice Hall, Englewood Cliffs, NJ, 1993. [ALGO-3] V. Vazirani, ‘‘Approximation algorithms,’’ Springer Verlag, New York, NY, 2001. [ALGO-4] C. Papadimitriou, K. Steiglitz, ‘‘Combinatorial optimization: Algorithms and complexity,’’ Dover, Mineola, NY, 1998. [Ari00] P. Arijs, et al, ‘‘Planning of WDM rings networks,’’ Photonic Network Communications Magazine, vol. 2, no. 1, January 2000. [Ari01] P. Arijs, ‘‘Planning of ring-based telecommunication networks,’’ PhD thesis, Ghent University, Ghent, Belgium, 2000–2001. [Ari1/00] P. Arijs, M. Gryseels, P. Demeester, ‘‘Planning of WDM ring networks,’’ Photonic Network Communications Magazine, vol. 2, no. 1, January 2000, pp. 33–51. [Ari7/00] P. Arijs, et al, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, vol. 1, no. 2, July 2000, pp. 25–40. [Ari96] P. Arijs, ‘‘Development of algorithms for optimal ring selection within an SDH network topology,’’ M. Sc. Thesis, Ghent University, Ghent, Belgium, 1995–1996. [Ari97] P. Arijs, et al, ‘‘The design of SDH ring networks using tabu-search and simulated annealing,’’ paper presented at the 5th International Conference on Telecommunication Systems: Modelling and Analysis, Nashville, TN, March 1997. [Ari98] P. Arijs, et al, ‘‘SDH protection in long distance networks: a practical case study,’’ DRCN’98, Brugge, Belgium, May 17–20, 1998. [ARPA-1] J.M. McQuilan, D.C. Walden, ‘‘The ARPANET design decisions,’’ Computer Networks, vol. 1, no. 5, August 1977. [ARPA-2] J.M. McQuillan, I. Richer, E. Rosen. ‘‘An overview of the new routing algorithm for the ARPANET,’’ ACM SIGCOMM Computer Communication Review, ACM Press, vol. 25, no. 1, January 1995, pp. 54–60.
479
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 480
480
Bibliography [ARPA-3] J.M. McQuillan, I. Richer, E.C. Rosen, ‘‘ARPANET routing algorithm improvements—first semiannual technical report,’’ BBN report no. 3803, April 1978. [ARPA-4] J.M. McQuillan, I. Richer, E.C. Rosen, D.P. Bertsekas, ‘‘ARPANET routing algorithm improvements—second semiannual technical report,’’ BBN report no. 3940, October 1978. [ARPA-5] E.C. Rosen, J. Herman, I. Richer, J.M. McQuillan, ‘‘ARPANET routing algorithm improvements—third semiannual technical report,’’ BBN report no. 4088, April 1979. [ARPA-6] J.M. McQuillan, I. Richer, E. Rosen, ‘‘ARPANET routing study—final report,’’ BBN report no. 3641, September 1977. [ARPA-7] W.E. Naylor, L. Kleinrock, ‘‘On the effects of periodic routing updates in packet switched networks,’’ Conference Record, National Telecommunications. [ARPA-8] E.C. Rosen, ‘‘The updating protocol of the new ARPANET routing algorithm,’’ submitted to Fourth Berkeley Conference on Distributed Data Management and Computer Networks. [Aut02] A. Autenrieth, A. Kirsta¨dter, ‘‘Engineering end-to-end IP resilience using resiliencedifferentiated QoS,’’ IEEE Communications Magazine, vol. 40, no. 1, January 2002. [Bat02] P. Batchelor, et al, ‘‘Study on the implementation of optical transparent transport networks in the European environment—results of the research project COST 239,’’ Journal Photonic Network Communications, vol. 2, no. 1, January-March 2002. [Ben01] G. Bennet, ‘‘The layperson’s guide to optical networking,’’ tutorial, third workshop on Design of Reliable Communication Networks (DRCN) 2001, Budapest, Hungary, October 2001. [BFD] W. Katz, ‘‘Bidirectional forwarding detection,’’ Internet draft: draft-katz-ward-bfd, work in progress. Available at: www.ietf.org. Accessed May 2004. [Bon01] P. Bonenfant, ‘‘Short course on optical networking, architectures, standards, protection & restoration,’’ European Conference on Optical Networking (ECOC) 2001, Amsterdam, The Netherlands, September 2001. [BP-PLACEMENT] J.L. Le Roux, ‘‘A method for an optimized online placement of MPLS bypass tunnels.’’ Internet draft: draft-leroux-mpls-bypass-placement, October 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [Car97] T.J. Carpenter, et al, ‘‘Demand routing and slotting on ring networks,’’ DIMACS Technical Report 97-02, January 1997. [Cav IEEE] C. Cavazzoni, et al, ‘‘The IP/MPLS over ASON/GMPLS testbed of the IST Project LION,’’ Journal of Lightwave Technology, vol. 11, November 2003. [Cho03] J.K. Choi, et al, ‘‘General Switch Management Protocol (GSMP) v3 for optical support,’’ Internet draft: draft-ietf-gsmp-optical-spec-02.txt, June 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [Col00] D. Colle, et al, ‘‘Comparison of architectures for stacked ring network featuring compact add/ drop multiplexers,’’ DRCN’00, Munich, Germany, April 9–12, 2000. [Col02] D. Colle, ‘‘Design and evolution of data-centric optical networks,’’ PhD thesis, Ghent University, Ghent, Belgium, 2001–2002.
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 481
Bibliography
481
[Col02] D. Colle, et al, ‘‘Data-centric optical networks and their survivability,’’ IEEE Journal on Selected Areas in Communications, vol. 20, no. 1, January 2002. [Col02] D. Colle, et al, ‘‘Data-centric optical networks and their survivability,’’ (invited) IEEE Journal on Selected Areas in Communications, vol. 20, no. 1, January 2002, pp. 6–20. [ColONDM01] D. Colle, et al, ‘‘Porting MPLS recovery techniques to the MPLambdaS paradigm,’’ Optical Networks Magazine, vol. 2, no. 4, July/August 2001, pp. 29–47. [ColPNC01] D. Colle, et al, ‘‘MPLS recovery mechanisms for IP-over-WDM networks,’’ Photonic Network Communications, Kluwer Academic Publishers, vol. 3, no. 1/2, January 2001, pp. 23–40. [COMP-NETWORKS] L. Peterson, B. Davie, ‘‘Computer networks: a systems approach,’’ Morgan Kaufmann, San Francisco, CA, 2003. [Cos94] S. Cosares, I. Saniec, ‘‘An optimisation problem related to balancing loads on SONET rings,’’ Telecommunication Systems, vol. 3, no. 2, November 1994. [Dem99] P. Demeester, et al, ‘‘Resilience in multi-layer networks,’’ IEEE Communications Magazine, vol. 37, no. 8, August 1998, pp. 70–76. [Dem99] P. Demeester, IEEE Communications Magazine, special issue on survivable communication networks, vol. 37, no. 8, August 1999. [DeM02] S. De Maesschalck, et al, ‘‘Intelligent optical networking for multilayer survivability,’’ IEEE Communications Magazine, vol. 40, no. 1, pp. 42–49, January 2002. [DeM03] S. De Maesschalck, et al. ‘‘Pan-European optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225. [DeM04] S. De Maesschalck, et al, ‘‘Advantages of intelligent optical networks,’’ IEEE Communication Magazine, submitted. [DIFFSERV-DEPLOY] J. Evans, C. Filsfils, ‘‘Deploying Diffserv in multiservice IP backbone networks for tight SLA.’’ [DS-TE] F. Le Faucheur, et al, ‘‘Requirements for support of Differentiated Services-aware MPLS Traffic Engineering,’’ RFC 3564, Internet draft: draft-ietf-tewg-diff-te-reqts-06.txt, July 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [Dwi00] A. Dwivedi, R. Wagner, ‘‘Traffic model for USA long-distance optical network,’’ Proceedings of the Optical Fiber Conference (OFC) 2000, Baltimore, MD, March 2000, vol. 1, TuK1-1, pp. 156–158. [E800] ITU-T Recommendation E.800, ‘‘Terms and definitions related to quality of service and network performance including dependability,’’ ITU-T Standardization Organization, August 1994. Available at: www.itu.int. Accessed May 2004. [Ell03] G. Ellinas, et al, ‘‘Routing and restoration architectures in mesh optical networks,’’ Optical Network Magazine, vol. 4, no. 1, January/February 2003, pp. 91–106. [ETSI1] ‘‘Transmission and multiplexing (TM); generic requirements of transport functionality of equipment; part 1–1: generic processes and performance,’’ ETSI EN 300 417-11 V1.2.1, October 2001.
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 482
482
Bibliography [ETSI2] ‘‘Transmission and multiplexing (TM); Synchronous Digital Hierarchy (SDH); Network protection schemes; interworking: rings and other schemes,’’ ETSI TS 101 010 v1.1.1, November 1997. [EWD-1166] E.W. Dijkstra, ‘‘EWD-1166,’’ November 1993. Available at: www.cs.utexas.edu/ users/EWD/ewd11xx/EWD1166.PDF. Accessed May 2004. [FACILITY-BACKUP] J.P. Vasseur, et al, ‘‘MPLS traffic engineering fast reroute: bypass tunnel path computation for bandwidth protection,’’ Internet draft: draft-vasseur-mplsbackup-computation, November 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [FAST-REROUTE] P. Pan, et al, ‘‘Fast reroute techniques in RSVP-TE,’’ Internet draft: draft-ietf-mpls-rsvp-lsp-fastreroute, May 2004, work in progress. Available at: www.ietf.org. Accessed May 2004.
AU1
AU2
AU3
[FM-RECOV] V. Sharma, F. Hellstrand, RFC3469, ‘‘Framework for Multi-Protocol Label Switching (MPLS)-based recovery.’’ Internet draft, February 2003, work in progress. Available at: www.ietf.org. Accessed May 2004.
AU4 AU5
[FRED] D. Lin, R. Morris, ‘‘Dynamics of random early detection.’’
AU6,
[FRR-IN-USE] ‘‘IS-IS Link attribute TLV,’’ Internet draft: draft-vasseur-isis-link-attibute, May 2004, work in progress. Available at: www.ietf.org. Accessed May 2004. [G7041] ITU-T Recommendation G.7041/Y.1303, ‘‘Generic framing procedure,’’ ITU-T Standardization Organization. Available at: www.itu.int. Accessed May 2004. [G7042] ITU-T Recommendation G.7042/Y.1305, ‘‘Link capacity adjustment scheme for virtual concatenated signals,’’ ITU-T Standardization Organization, May 2002. Available at: www.itu.int. Accessed May 2004. [G707] ITU-T Recommendation G. 707, ‘‘Network node interface for the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization. Available at: www.itu.int. Accessed May 2004.
AU7 AU8 AU9
AU10
[G707] ITU-T Recommendation G.707/Y.1322, ‘‘Network node interface for the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization. October 2000. Available at: www.itu.int. Accessed May 2004.
AU11
[G709] ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.
AU12
[G783] ITU-T Recommendation G.783, ‘‘Characteristics of Synchronous Digital Hierarchy (SDH) equipment functional blocks,’’ ITU-T Standardization Organization, October 2000. Available at: www.itu.int. Accessed May 2004. [G798] ITU-T Recommendation G.798, ‘‘Characteristics of optical transport network hierarchy equipment functional blocks,’’ ITU-T Standardization Organization, January 2002. Available at: www.itu.int. Accessed May 2004.
AU13
AU14
AU15
[G803] ITU-T Recommendation G.803, ‘‘Architecture of transport networks based on the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004. [G805] ITU-T Recommendation G.805, ‘‘Generic functional architecture of transport networks,’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.
AU16
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 483
Bibliography
483
[G806] ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment—description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004. [G807] ITU-T Recommendation G.807/Y.1302, ‘‘Requirements for automatic switched transport networks (ASTN),’’ ITU-T Standardization Organization, July 2001. Available at: www.itu.int. Accessed May 2004. [G808.1] ITU-T Recommendation G.808.1, ‘‘Generic protection switching—linear trail and subnetwork protection,’’ ITU-T Standardization Organization, under development. Available at: www.itu.int. Accessed May 2004.
AU17
AU18
AU19
[G8080] ITU-T Recommendation G.8080/Y.1304, ‘‘Architecture for the Automatic Switched Optical Network (ASON),’’ ITU-T Standardization Organization, November 2001. Available at: www.itu.int. Accessed May 2004.
AU20
[G841] ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.
AU21
[G842] ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004. [G871] ITU Recommendation G.871/Y1301, ‘‘Framework for optical transport network recommendations,’’ ITU-T Standardization Organization, October 2000. Available at: www.itu.int. Accessed May 2004. [G872] ITU-T Recommendation G.872, ‘‘Architecture of optical transport networks,’’ ITU-T Standardization Organization, November 2001. Available at: www.itu.int. Accessed May 2004. [G873.1] ITU-T Recommendation G. 873.1, ‘‘Optical Transport Network (OTN)—linear protection,’’ ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004. [G873.2] ITU-T Recommendation G.873.2, ‘‘Optical Transport Network (OTN)—ring protection,’’ ITU-T Standardization Organization, under development. Available at: www. itu.int. Accessed May 2004. [G911] ITU Recommendation G.911, ‘‘Parameters and calculation methodologies for reliability and availability of fibre optic systems,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004. [Gro00] W.D. Groover, D. Stamatelakis, ‘‘Bridging the ring-mesh dichotomy with p-cycles,’’ proceedings of the second International Workshop on Design of Reliable Communication Networks (DRCN’00), Munich, Germany, April 2000,’’ pp. 92–104. [Gro02] W.D. Groover, J. Doucette, ‘‘Design of a meta-mesh of chain subnetworks: enhancing the attractiveness of mesh-restorable WDM networking on low connectivity graphs,’’ IEEE Journal of Selected Areas in Communications, vol. 20, no. 1, January 2002, pp. 47–61. [Gro04] W.D. Groover, ‘‘Mesh-based survivable networks: Options and strategies for optical, MPLS, SONET and ATM networking,’’ Prentice Hall PTR, Upper Saddle River, NJ, 2003.
AU22,
AU23
AU24
AU25
AU41
AU26
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 484
484
Bibliography [Gro98] W.D. Groover, D. Stamatelakis, ‘‘Cycle-oriented distributed preconfiguration: ringlike speed with mesh-like capacity for self-planning network restoration,’’ proceedings of the IEEE International Conference on Communications, Atlanta, GA, June 1998, pp. 537–543. [Gry01] M. Gryseels, ‘‘Planning of multi-technology telecommunication networks,’’ PhD thesis, Ghent University, Ghent, Belgium, January 2001. [Gry98] M. Gryseels, K. Struyve, M. Pickavet, P. Demeester, ‘‘Common pool survivability for meshed SDH-based ATM networks,’’ proceedings of the International Symposium on Broadband European Networks (SYBEN’98), Zurich, Switzerland, May 1998, pp. 267–278. [HASH] Z. Cao, Z. Wang, E. Zagura, ‘‘Performance of hashing-based schemes for Internet load balancing,’’ ITU-T Standardization Organization. [Her02] E. Hernandez-Valencia, M. Scholten, Z. Zhu, ‘‘The Generic Framing Procedure (GFP): an overview,’’ IEEE Communications Magazine, vol. 40 no. 5, May 2002. [HISTORY] Available at: www.cs.utexas.edu/users/chris/think/digital_archive.html. Accessed May 2004.
AU27
AU28 AU29
AU30
[I321] ITU-T Recommendation I.321, ‘‘B-ISDN Protocol reference model and its application,’’ ITU-T Standardization Organization, April 1991. Available at: www.itu.int. Accessed May 2004.
AU31
[IP-TE-1] B. Fortz, J. Rexford, M. Thorup, ‘‘Traffic engineering with traditional IP routing protocols.’’
AU32 AU33
[IP-TE-2] D. Applegate, E. Cohen, ‘‘Making intra-domain routing robust to changing and uncertain traffic demands: understanding fundamental tradeoffs.’’
AU34 AU35
[IP-TE-3] B. Fortz, M. Thorup, ‘‘Internet Traffic Engineering by optimizing OSPF weights.’’
AU36 AU37
[IP-TE-4] M. Thorup, ‘‘Fortifying OSPF/IS-IS against failure.’’ [IP-TE-5] B. Fortz, ‘‘Optimizing OSPF/IS-IS weights in a changing world.’’ [IP-TE-6] A. Nucci, et al, ‘‘IGP link weight assignment for transient link failures.’’ [IP-TRAF] N. Brownlee, K. Claffy, ‘‘Understanding Internet traffic streams: dragonflies and tortoises,’’ IEEE Communications Magazine, October 2002, 110–117. [ISIS] ISO, ‘‘Intermediate system to Intermediate system routing information exchange protocol for use in conjunction with the protocol for providing the connectionless-mode network service (ISO 8473),’’ ISO/IEC 10589, 1992.
AU38 AU39 AU40 AU42 AU43 AU44 AU45, AU46
[ISIS-GR] M. Shand, L. Ginsberg, ‘‘Restart signaling for IS-IS,’’ Internet draft: draft-ietfisis-restart-05.txt, January 2004, work in progress. Available at: www.ietf.org. Accessed May 2004. [ISIS-MT] T. Przygienda, N. Shen, N. Sheth, ‘‘M-ISIS: multi topology,’’ Internet draft, January 2004, work in progress. Available at: www.ietf.org. Accessed May 2004. [IS-IS-TAG] C. Martin, B. Neal, S. Previdi, ‘‘A policy control mechanism is IS-IS using administrative tags,’’ Internet draft, April 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [ISIS-TE] L. Smit, ‘‘IS-IS extensions for traffic engineering,’’ Internet draft: draft-ietfisis-traffic-05.txt, August 2003, work in progress. Available at: www.ietf.org. Accessed May 2004.
AU47, AU48 AU49, AU50
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 485
Bibliography
485
[Jur98] I. Jurdana, B. Mikac, ‘‘An availability analysis of optical cables,’’ Workshop on AllOptical Networks (WAON’98), Zagreb, Croatia, May 1998. [Kal96] G. Kalbe, et al, ‘‘Operator requirements,’’ European ACTS project Protection Across Network Layers (PANEL), deliverable D1, December 1996. [Kar97] N. Karunanithi, T. Carpeneter, ‘‘SONET ring sizing with generic algorithms,’’ Computers and Operations Research, vol. 24, no. 6, 1997. [Kar99] S.V. Kartalopoulos, ‘‘Understanding SONET/SDH and ATM: communications networks for the next millennium,’’ IEEE Press, Piscataway, NJ, 1999. [KINI] Kini, et al, ‘‘Shared backup label switched path restoration,’’ Internet draft: draftkini-restoration-shared-backup, May 2001, work in progress. Available at: www.ietf.org. Accessed May 2004. [Kom02] K. Kompella, Y. Rekhter, ‘‘LSP hierarchy with generalized MPLS TE,’’ Internet draft: draft-ietf-mpls-lsp-hierarchy-08.txt, September 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [Kom03] K. Kompella, Y. Rekhter, ‘‘Routing extensions in support of generalized Multi-Protocol Label Switching,’’ Internet draft: draft-ietf-ccamp-gmpls-routing-09.txt, October 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [Lab02] J.F. Labourdette, et al, ‘‘Routing strategies for capacity-efficient fast-restorable mesh optical networks,’’ Photonic Network Communications, vol. 4, no. 3–4, Jan-Dec 2002, pp. 219–235. [Lab99] C. Labovitz, A. Ahuja, F. Jahanian, ‘‘Experimental study of Internet stability and wide-area backbone failures,’’ paper presented at the 29th Annual International Symposium on Fault-Tolerant Computing, Madison, WI, June 1999. [Las99] A. Lason, et al, ‘‘Network scenarios and requirements,’’ European IST project Layers Interworking in Optical Networks (LION), deliverable D6, September 1999. [Lem02] E. Lemuel, ‘‘Asia Pacific submarine cable network service restored,’’ Inq7.net, July 2002. Available at: www.inq7.net/inf/2002/jul/18/ inf_1–1.htm. Accessed May 2004. [LINKNODE-FAILURE] J.P. Vasseur, A. Charny, ‘‘Distinguish a link from a node failure using RSVP hellos extensions,’’ Internet draft: draft-vasseur-mpls-linknode-failure, October 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [LSA-FLOOD1] ‘‘OSPF refresh and flooding reduction in stable topologies,’’ Internet draft: draft-pillay-esnault-ospf-flooding-07.txt, June 2003, work in progress. Available at: www.ietf. org. Accessed May 2004. [LSA-FLOOD2] ‘‘Flooding optimizations in link-state routing protocols,’’ Internet draft: draft-ietf-ospf-isis-flood-opt.txt, 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [M20] ITU-T Recommendation M.20, ‘‘Maintenance philosophy for telecommunication networks,’’ ITU-T Standardization Organization, October 1992. Available at: www.itu.int. Accessed May 2004. [M30000] ITU-T Recommendation M.3000, ‘‘Overview of TMN recommendations,’’ ITU-T Standardization Organization, 1995. Available at: www.itu.int. Accessed May 2004.
AU51, AU52
AU53
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 486
486
Bibliography [M3010] ITU-T Recommendation M.3010, ‘‘Principles for a telecommunications management network,’’ ITU-T Standardization Organization, February 2000. Available at: www. itu.int. Accessed May 2004. [Man1] E. Mannie, et al, ‘‘Generalized multi-protocol label switching (GMPLS) architecture,’’ Internet draft: draft-ietf-ccamp-gmpls-architecture, March 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [Man2] E. Mannie, et al, ‘‘Recovery (protection and restoration) terminology for GMPLS,’’ Internet draft: draft-ietf-ccamp-gmpls-recovery-terminology, June 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [McC95] S. McCarthy, ‘‘Reliability keeps up with network growth,’’ Telephony, June 1995. [McK00] ‘‘Backbone! How changes in technology and the rise of IP threaten to disrupt the long-haul telecom services industry,’’ September 2000. Available at: www.mckinsey. de/_downloads/knowmatters/telecommunications/backbone.pdf. Accessed May 2004. [Mod01] E. Modiano, P.J. Lin, ‘‘Traffic grooming in WDM networks,’’ IEEE Communications Magazine, vol. 39, no. 7, July 2001. [MPLS-DESIGN] J.P. Vasseur, J. Guichard, F. Le Faucheur, ‘‘Real world designs of converged MPLS networks—review of deployed network designs to offer L2/L3 VPNs, QoS, traffic engineering, IPv6 and multicast,’’ Cisco Press, 2004 (in press). [MPLS-TE] E. Osborne, A. Simha, ‘‘Traffic engineering with MPLS,’’ Cisco Press, Indianapolis, IN, 2002. [MT] ‘‘Routing in IS-IS,’’ Internet draft: draft-ietf-isis-wg-multi-topology-06.txt, work in progress. Available at: www.ietf.org. Accessed May 2004. [OSh94] C. O’Shea, ‘‘Requirements and reference configurations for survivability,’’ European RACE project end-to-end Survivable Broadband Networks (IMMUNE), deliverable D2, June 1994. [OSPF-TE] Y. Katz, ‘‘Traffic engineering extensions to OSPF,’’ Internet draft: draft-katzyeung-ospf-traffic-09.txt, October 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [OSPFv2] J. Moy, ‘‘OSPF version 2,’’ RFC 2328. [OSPG-GR] J. Moy, P. Pillay-Esnault, A. Lindem, ‘‘Graceful OSPF restart,’’ Internet draft: draft-ietf-ospf-hitless-restart-08.txt, work in progress. Available at: www.ietf.org. Accessed May 2004. [OTNTS] ‘‘Optical Transport Networks & technologies standardization work plan,’’ ITU-T Standardization Organization, May 2002. Available at: http://www.itu.int/itudoc/itu-t/ com15/otn/76091.html. Accessed May 2004. [Owe02] K. Owens, V. Sharma, M. Oommen, ‘‘Network survivability considerations for traffic engineered IP networks,’’ Internet draft: draft-owens-te-network-survivability, May 2002, work in progress. Available at: www.ietf.org. Accessed May 2004. [Pap02] D. Papadimitriou, et al, ‘‘Shared risk link groups encoding and processing,’’ Internet draft: draft-papadimitriou-ccamp-srlg-processing, June 2002, work in progress. Available at: www.ietf.org. Accessed May 2004.
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 487
Bibliography
487
[Pap03] D. Papadimitriou, et al, ‘‘Requirements for generalized MPLS (GMPLS) signaling usage and extensions for Automatically Switched Optical Network (ASON),’’ Internet draft: draft-ietf-ccamp-gmpls-ason-reqts-05.txt, November 2003, work in progress. Available at: www.ietf.org. Accessed May 2004. [PATH-COMP] J.P. Vasseur, et al, ‘‘RSVP path computation request and reply messages,’’ Internet draft: draft-vasseur-mpls-computation-rsvp, 2004, work in progress. Available at: www.ietf.org. Accessed May 2004. [PREEMPT-POL] J. De Oliviera, J.P. Vasseur, L.C. Chen, C. Scolglio, ‘‘LSP preemption policies for MPLS traffic engineering,’’ Internet draft: draft-deoliviera-diff-te-preemption, 2003, work in progress. Available at: www.ietf.org. Accessed May 2004.
AU54, AU55 AU56
[Raj00] B. Rajagopalan, et al, ‘‘IP over optical networks: architectural aspects,’’ IEEE Communication Magazine, vol. 38, no. 9, September 2000, pp. 44–102. [Ram02] R. Ramaswami, K. Sivarajan, ‘‘Optical networks: a practical perspective,’’ 2nd ed, Morgan Kaufmann, San Francisco, CA, 2002. [RED] S. Floyd, V. Jacobson, ‘‘Random early detection gateways for congestion avoidance,’’ IEEE/ACM Transactions on Networking, vol. 1, no. 4, August 1993, pp. 397–413. [REFRESH-REDUCTION] L. Berger et al, ‘‘RSVP refresh overhead reduction extensions,’’ RFC2961, IETF Web site, April 2001. Available at: www.ietf.org. Accessed May 2004.
AU57, AU58 AU59
[REORDERING] M. Laor, L. Gendel, ‘‘Effect of packet reordering in a backbone link on applications throughput,’’ IEEE Network, vol. 16, no. 5, September 2002.
AU60, AU61 AU62
[RFC2205] R. Braden, et al, ‘‘Resource Reservation Protocol (RSVP)—version 1 functional specification,’’ RFC2205, IETF Web site, September 1997. Available at: www.ietf.org. Accessed May 2004.
AU63, AU64
[RFC2474] K. Nichols, S. Blake, F. Baker, D. Black, ‘‘Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers,’’ RFC 2474, IETF Web site. Available at: www.ietf.org. Accessed May 2004. [RFC2547] E. Rosen, Y. Rekhter, ‘‘BGP/MPLS VPNs,’’ RFC2547, IETF Web site, March 1999. Available at: www.ietf.org. [RFC3209] D. Awduche, et al, ‘‘RSVP-TE: extensions to RSVP for LSP tunnels,’’ RFC3209, IETF Web site, December 2001. Available at: www.ietf.org. Accessed May 2004. [RFC3292] A. Doria, et al, ‘‘General Switch Management Protocol (GSMP) V3,’’ RFC3292, IETF Web site, June 2002. Available at: www.ietf.org. Accessed May 2004. [RFC3471] L. Berger, ‘‘Generalized Multi-Protocol Label Switching (GMPLS) signaling functional description,’’ RFC3471, IETF Web site, January 2003. Available at: www.ietf.org. Accessed May 2004. [RFC3473] L. Berger, ‘‘Generalized Multi-Protocol Label Switching (GMPLS) signaling resource Reservation protocol-traffic engineering (RSVP-TE) extensions,’’ RFC3473, IETF Web site, January 2003. Available at: www.ietf.org. Accessed May 2004. [RIP 1] C. Hedrick, ‘‘Routing Information Protocol,’’ RFC1058, IETF Web site. Available at: www.ietf.org. Accessed May 2004. [RIP 2] G. Malkin, ‘‘RIP version 2,’’ RFC1723, IETF Web site. Available at: www.ietf.org. Accessed May 2004.
AU65 AU66, AU67 AU68 AU69
AU70
AU71 AU72 AU73 AU74 AU75 AU76 AU77 AU78 AU79 AU80 AU81 AU82
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 488
488
Bibliography [RIP-TRIG] G. Meyer, S. Shrerry, ‘‘Triggered extensions to RIP to support demand circuits,’’ RFC2091, IETF Web site. Available at: www.ietf.org. Accessed May 2004. [Rob01] L. Roberts, C. Crump, ‘‘US Internet IP traffic growth,’’ Caspian networks, August 2001. Available at: www.caspiannetworks.com/library/presentations/traffic/Internet_Traffic_ 081301.ppt. Accessed May 2004. [Ros01] E. Rosen, et al, ‘‘Multi-Protocol Label Switching,’’ RFC3031, IETF Web site, January 2001. Available at: www.ietf.org. Accessed May 2004. [ROUTING-THESIS] P. Narvaez, ‘‘Routing reconfiguration in IP network,’’ MIT, June 2000. [RSVP-TE] D. Awduche, et al, ‘‘RSVP-TE: extensions to RSVP for LSP tunnels,’’ RFC3209, IETF Web site, December 2001. Available at: www.ietf.org. Accessed May 2004.
AU83, AU84 AU85 AU86 AU87 AU88, AU89 AU90 AU91
[SECOND-METRIC] F. Le Faucheur, et al, ‘‘Use of Interior Gateway Protocol (IGP) metric as a second MPLS traffic engineering metric,’’ Internet draft: draft-ietf-tewg-te-metric-igp, work in progress. Available at: www.ietf.org. Accessed May 2004. [Sex92] M. Sexton, A. Reid, ‘‘Transmission networking: SONET and the Synchronous Digital Hierarchy,’’ Artech House, Norwood, MA, 1992. [Sha03] V. Sharma, F. Hellstrand, ‘‘Framework for MPLS-based recovery,’’ Internet draft, work in progress, RFC 3469, IETF Web site, February 2003. Available at: www.ietf.org. Accessed May 2004. [Shi01] T. Shiragaki, et al, ‘‘Protection architecture and applications of OCh shared protection rings,’’ Optical Network Magazine, vol. 2, no. 4, July/August 2001, pp. 48–58. [Soc91] T. Socolofsky, C. Kale, ‘‘A TCP/IP tutorial,’’ RFC1180, IETF Web site, January 1991. Available at: www.ietf.org. Accessed May 2004.
AU92 AU93
AU94 AU95 AU96 AU97
[SOFT-PREEMPTION] M. Meyer, et al, ‘‘MPLS traffic engineering soft preemption,’’ Internet draft: draft-ietf-mpls-soft-preemption, work in progress. Available at: www.ietf.org. Accessed May 2004. [Sos94] J. Sosnosky, ‘‘Service applications for SONET DCS distributed restoration,’’ IEEE Journal on Selected Areas in Communications, vol. 12, no. 1, January 1994, pp. 59–68. [Str00] K. Struyve, et al, ‘‘Application, design and evolution of WDM in GTS’s Pan-European transport network,’’ IEEE Communications Magazine, vol. 38, no. 3, 2000, pp. 114–121. [Str01] J. Strand, A. Chiu, R. Tkach, ‘‘Issues for routing in the optical layer,’’ IEEE Communications Magazine, vol. 39, no. 2, February 2001, pp. 81–87. [SURVIV] R. Bhandari, ‘‘Survivable networks: Algorithms for diverse routing,’’ Kluwer Academic Publishers, Amsterdam, The Netherlands, 1999. [TE-REQ] D. Awduche, et al, ‘‘Requirements for traffic engineering over MPLS,’’ RFC2702, IETF Web site, September 1999. Available at: www.ietf.org. Accessed May 2004. [TRAF-EST] ‘‘Traffic matrices estimation: existing techniques and new directions,’’ Available at: http://www.acm.org/sitcom/sigcomm2002/papers/trafficmatrix.html. Accessed May 2004.
AU98 AU99 AU100 AU101 AU102 AU103 AU104 AU105
Vasseur / Network Recovery Final Proof 8.6.2004 2:55am page 489
Bibliography
489
[TRAVEL-SALESMAN]. Available at: http://members.cox.net/mathmistakes/travel.htm. Accessed May 2004. [Ver95] D. Vercauteren, P. Demeester, J. Luystermans, E. Houtrelle, ‘‘Availability analysis of multi-layer networks,’’ proceedings of the third International Conference on Telecommunications System Modeling and Analysis, Nashville, TN, March 1995, pp. 483–493. [Vhe00] P. Van Heuven, et al, ‘‘Recovery in IP based networks using MPLS,’’ paper presented at the IEEE Workshop on IP-oriented Operations & Management IPOM 2000, 2–4 September 2000, Cracow, Poland, pp. 70–78. [Vis02] M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/ tsg15opticaltransport/tsg15opticaltransport/OTN/g709–intro-v2.ppt. Accessed May 2004.
AU106
[Wau99] N. Wauters, G. Ocakoglu, K. Struyve, P.F. Fonseca, ‘‘Survivability in a new pan-European carriers’ carrier network based on WDM and SDH technology: current implementation and future requirements,’’ IEEE Communications Magazine, vol. 37, no. 8, August 1999, pp. 63–69. [Wil01] G. Willems, et al, ‘‘Capacity versus availability trade-offs in mesh-restorable WDM networks,’’ proceedings of the third international workshop on Design of Reliable Communication Networks (DRCN’01), Budapest, Hungary, October 2001. [Wos01] L. Wosinska, L. Thylen, R. Holmstrom, ‘‘Large-capacity strictly nonblocking optical cross-connects based on microelectrooptomechanical systems (MOEMS) switch matrices: reliability performance analysis,’’ Journal of Lightwave Technology, vol. 19, no. 8, August 2001. [WRED] Cisco Systems. Available at: www.cisco.com/univercd/cc/td/doc/product/software/ ios112/ios112p/gsr/wred_gs.htm. Accessed May 2004.
AU107 AU108 AU109
[Wu97] T.-H. Wu, N. Yoshikai, ‘‘ATM transport and network integrity,’’ Academic Press, Amsterdam, The Netherlands, 1997. [X200] ITU-T Recommendation X.200, ‘‘Data networks and open systems communications: opens systems interconnection—model and notation,’’ ITU-T Standardization Organization, July 1994. Available at: www.itu.int. Accessed May 2004.
AU110
[X700] ITU-T Recommendation X.700, ‘‘Management framework for Open Systems Interconnection (OSI) for CCITT applications,’’ ITU-T Standardization Organization, September 1992. Available at: www.itu.int. Accessed May 2004. [X701] ITU-T Recommendation X.701, ‘‘Information management—Open Systems Interconnection (OSI)—system management overview,’’ ITU-T Standardization Organization, August 1997. Available at: www.itu.int. Accessed May 2004.
AU111, AU112 AU113 AU114, AU115 AU116 AU117, AU118
This page intentionally left blank
Vasseur / Network Recovery Final Proof 8.6.2004 3:12pm page 491
List of Figure Sources
AU PUB
Figure 1.5 ITU-T Recommendation I.321, ‘‘B-ISDN Protocol Reference Model and its Application,’’ April 1991. Available at: www.itu.int. Accessed May 2004.
AU1
Figure 1.9 G. Kalbe, et al, ‘‘Operator requirements,’’ European ACTS project Protection Across Network Layers (PANEL), deliverable D1, December 1996.
PUB2 AU2
Figure 1.14 V. Sharma, F. Hellstrand, ‘‘Framework for MPLS-based recovery,’’ Internet draft, work in progress, RFC 3469, IETF Web site, February 2003. Available at: www. ietf.org. Accessed May 2004.
AU3 AU4
Figure 1.15 V. Sharma, F. Hellstrand, ‘‘Framework for MPLS-based recovery,’’ Internet draft, work in progress, RFC 3469, IETF Web site, February 2003. Available at: www. ietf.org. Accessed May 2004.
AU5 AU6
Figure 2.2 ITU-T Recommendation G.805, ‘‘Generic functional architecture of transport networks,’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.
AU7
Figure 2.3 ITU-T Recommendation G.803, ‘‘Architecture of transport networks based on the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.
AU8
Figure 2.4 ITU-T Recommendation G.707/Y.1322, ‘‘Network node interface for the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, October 2000. Available at: www.itu.int. Accessed May 2004.
AU9
Figure 2.5 ITU-T Recommendation G.707/Y.1322, ‘‘Network node interface for the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, October 2000. Available at: www.itu.int. Accessed May 2004.
AU10
Figure 2.7 ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment— description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.
AU11
Figure 2.8 ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment— description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.
AU12
Figure 2.9 ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment— description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.
AU13
491
Vasseur / Network Recovery Final Proof 8.6.2004 3:12pm page 492
492
List of Figure Sources Figure 2.10 ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment— description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.
AU14
Figure 2.11 ITU-T Recommendation G. 806, ‘‘Characteristics of transport equipment— description methodology and generic functionality,’’ ITU-T Standardization Organization, October 2000, and ITU-T Recommendation G.806, amendment 1, ITU-T Standardization Organization, prepublished March 2003. Available at: www.itu.int. Accessed May 2004.
AU15
Figure 2.18 C. Brianza, et al, ‘‘Deliverable D2a: Overall Network Protection—Version 1,’’ deliverable from the ACTS-project PANEL, April 1997.
AU16
Figure 2.21 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.
AU17
Figure 2.22 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004. Figure 2.23 ITU-T Recommendation G.803, ‘‘Architecture of transport networks based on the synchronous digital hierarchy (SDH),’’ ITU-T Standardization Organization, March 2000. Available at: www.itu.int. Accessed May 2004.
AU18
AU19
Figure 2.24 ‘‘Transmission and multiplexing (TM); generic requirements of transport functionality of equipment; part 1-1: generic processes and performance,’’ ETSI EN 300 417-1-1 V1.2.1, October 2001. Figure 2.25 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004. Figure 2.26 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004. Figure 2.27 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004. Figure 2.28 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.
AU20
AU21
AU22
AU23
Figure 2.32 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: www.itu.int. Accessed May 2004.
AU24
Figure 2.36 (Top) ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www. itu.int. Accessed May 2004.
AU25
Figure 2.37 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. Accessed May 2004.
AU26
Vasseur / Network Recovery Final Proof 8.6.2004 3:12pm page 493
List of Figure Sources
493
Figure 2.38 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU27 Accessed May 2004. Figure 2.39 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU28 Accessed May 2004. Figure 2.40 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU29 Accessed May 2004. Figure 2.41 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU30 Accessed May 2004. Figure 2.42 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU31 Accessed May 2004. Figure 2.43 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU32 Accessed May 2004. Figure 2.44 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU33 Accessed May 2004. Figure 2.49 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: AU34 www.itu.int. Accessed May 2004. Figure 2.50 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: AU35 www.itu.int. Accessed May 2004. Figure 2.52 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: AU36 www.itu.int. Accessed May 2004. Figure 2.53 ITU-T Recommendation G.842, ‘‘Interworking of SDH network protection architectures,’’ ITU-T Standardization Organization, April 1997. Available at: www.itu.int. AU37 Accessed May 2004. Figure 2.54 ITU-T Recommendation G.841, ‘‘Types and characteristics of SDH network protection architectures,’’ ITU-T Standardization Organization, October 1998. Available at: AU38 www.itu.int. Accessed May 2004. Figure 3.4 Adapted from R. Ramaswami, K. Sivarajan, ‘‘Optical networks: a practical perspective,’’ 2nd ed, Morgan Kaufmann, San Francisco, CA, 2002. Figure 3.5 Adapted from R. Ramaswami, K. Sivarajan, ‘‘Optical networks: a practical perspective,’’ 2nd ed, Morgan Kaufmann, San Francisco, CA, 2002. Figure 3.6 Adapted from J. Derkacz, et al. ‘‘IP/OTN Cost Model and Photonic Equipment Cost Forecast-IST LION project,’’ Proc. 4th Workshop on Telecommunications Technoeconomics, Rennes, France, May 2002.
Vasseur / Network Recovery Final Proof 8.6.2004 3:12pm page 494
494
List of Figure Sources Figure 3.8 Adapted from M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties. itu.int/ftp/itu-t/com15/tsg15opticaltransport/tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004. and ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.10 Adapted from ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.11 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.12 Adapted from M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties. itu.int/ftp/itu-t/com15/tsg15opticaltransport/tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004.
AU39 AU40
AU41
AU42
Figure 3.13 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.14 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.15 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.16 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004.
AU43
AU44
AU45
AU46
Figure 3.17 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.18 ITU-T Recommendation G.709/Y.1331, ‘‘Interfaces for the optical transport network,’’ ITU-T Standardization Organization, February 2001, and amendment 1, November 2001. Available at: www.itu.int. Accessed May 2004. Figure 3.19 M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/ tsg15opticaltransport/tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004. Figure 3.20 M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/ tsg15opticaltransport/tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004. Figure 3.21 M. Vissers, ‘‘Optical Transport Network & Optical Transport Module,’’ ITU-T Standardization Organization, April 2002. Available at: http://ties.itu.int/ftp/itu-t/com15/ tsg15opticaltransport/tsg15opticaltransport/OTN/g709-intro-v2.ppt. Accessed May 2004.
AU47
AU48
AU49
AU50
AU51
Vasseur / Network Recovery Final Proof 8.6.2004 3:12pm page 495
List of Figure Sources
495
Figure 3.24 J. Strand, A. Chiu, R. Tkach, ‘‘Issues for routing in the optical layer,’’ IEEE Communications Magazine, vol. 39, no. 2, February 2001, pp. 81–87.
AU52
Figure 3.34 P. Arijs, et al, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, vol. 1, no. 2, July 2000, pp. 25–40. Figure 3.36 (Left) Adapted from S. De Maesschalck, et al. ‘‘Pan-European optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225. Figure 3.37 P. Arijs, et al, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, vol. 1, no. 2, July 2000, pp. 25–40. Figure 3.39 Adapted from S. De Maesschalck, et al. ‘‘Pan-European optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225. Figure 3.40 Adapted from S. De Maesschalck, et al. ‘‘Pan-European optical transport networks: an availability based comparison,’’ Photonic Network Communication, vol. 5, no. 3, May 2003, pp. 203–225. Figure 3.41 P. Arijs, et al, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, vol. 1, no. 2, July 2000, pp. 25–40. Figure 3.42 P. Arijs, et al, ‘‘Design of ring and mesh based WDM transport networks,’’ Optical Networks Magazine, vol. 1, no. 2, July 2000, pp. 25–40. Figure 3.46 W.D. Groover, D. Stamatelakis, ‘‘Bridging the ring-mesh dichotomy with p-cycles,’’ proceedings of the second International Workshop on Design of Reliable Communication Networks (DRCN’00), Munich, Germany, April 2000, pp. 92–104. Figure 4.2 M. Dodge. ‘‘Cybermap of the Month Column,’’ ARAPANET, October 1980. (Illustration courtesy of the Computer Museum of History.Center.) Available at http:// mappa.mundi.net/maps/maps_001. Accessed May 2004. Figure 6.5 ‘‘User network interface (UNI) 1.0 Signaling Specification,’’ Optical Internetworking Forum/User Network Interface Specifications (OIF2000.125.5), June 2001. AvailAU53 able at www.oiforum.com. Accessed May 2004. Figure 6.7 D. Colle, et al. ‘‘Developing control plane models for optical networks,’’ Technical Digest, 2002 Optical Fiber Communication Conference (OFC2002), Anaheim, CA, AU54 March 17-22, 2002, pp. 757–759. Figure 6.8 D. Colle, et al. ‘‘Developing control plane models for optical networks,’’ Technical Digest, 2002 Optical Fiber Communication Conference (OFC2002), Anaheim, CA, March 17-22, 2002, pp. 757–759. Figure 6.22 S. De Maesschalck, et al, ‘‘Intelligent optical networking for multilayer survivability,’’ IEEE Communications Magazine, vol. 40, no. 1, pp. 42–49, January 2002. Figure 6.23 S. De Maesschalck, et al, ‘‘Intelligent optical networking for multilayer survivability,’’ IEEE Communications Magazine, vol. 40, no. 1, pp. 42–49, January 2002. Figure 6.24 Adapted from P. Demeester, et al, ‘‘Resilience in multi-layer networks,’’ IEEE Communications Magazine, vol. 37, no. 8, August 1998, pp. 70–76.
This page intentionally left blank
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 497
Index
Numbers 1þ1 packet protection in MPLS TE, 333–334 1þ1 protection (dedicated), 31 1:1 protection (dedicated with extra traffic), 32 1:N linear APS, 76–78 1:N protection, 32 one-to-one backup in MPLS TE backup tunnel path computation, 419–421 bandwidth sharing capability, 343 overview, 318–319, 345 RSVP signaling, 382–384
A A functions. See adaptation functions (A) access points (APs), 44, 45 accounted failure scenarios, deriving, 16–18 accounting management, 42 adaptation functions (A) defined, 44, 45 f2 filters in sink function, 60 adapted information (AI), 44, 45 adaptive dynamic distributed routing algorithm in ARPANET ARPA-2 version, 208, 210 ARPANET map (October 1980), 209 efficiency analyses, 210 first version, 208, 209 issues arising with, 208 terms defined, 207–208
adaptive routing protocols, 207–208 add/drop multiplexers (ADMs). See also Optical Add/ Drop Multiplexers (OADMs) DXCs versus, 82 interconnection of stacked STM-N Rings and, 103–104 in ring networks, 82 in SDH networks, 52–54 additive latency, as criterion for recovery mechanisms, 27 administrative link cost increase, temporary loops and, 256–257 ADMs. See add/drop multiplexers (ADMs) AELT (Average Expected Loss of Traffic), 191–192, 196–197 AI (adapted information), 44, 45 Alarm Indication Signal (AIS) defect in OTNs, 153, 155–156 Alarm Indication Signal (AIS) in SDH AU_AIS signal, 65, 67, 68, 70, 72, 73–74 fault detection and propagation inside NEs, 60–70 fault propagation and notification on network level, 70, 72–74
late arrival of MS_AIS signal in VC-4 cross connection, 68, 69 MS_AIS signal, 65, 66, 70, 72 race conditions in HOP and LOP layers, 72, 73, 74 TT sink function and, 58 TU_AIS signal, 68, 72–73 algorithm complexity CPU power and, 281–282 defined, 279–284 Dijkstra algorithm, 243 efficiency and, 279–280, 282, 283–284 as function of problem size, 280–281 implementation choices and efficiency, 283–284 NP complete problem, 284 QoS during failure and, 265 worst-case scenario, 282–283 algorithms, routing. See routing algorithms All Ones defect (dAIS) in SDH, 50, 51 alternate paths. See recovery paths American National Standards Institute (ANSI), 57 APCN 2 (Asia Pacific Cable Network) submarine cable break, 15 application layer (Layer 5), 6 APs (access points), 44, 45 APS protocol. See automatic protection switching (APS) protocol
497
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 498
498
Index
ARPANET routing protocol, 207–210, 287 Asia Pacific Cable Network (APCN 2) submarine cable break, 15 ASON. See Automatic Switched Optical Network (ASON) ASTNs (Automatic Switched Transport Networks), 425–426 asymmetrical load balancing, 260–262 asymmetrical services, 3 AT&T, erroneous software update in, 15 atomic functions overview, 43, 45 responsibility in fault propagation and notification (SDH), 70, 71 augmented model for control plane, 435–437 automatic protection switching (APS) protocol 1:N linear APS, 76–78 bidirectional (dual-ended) operation, 76, 77 linear APS, 74–76 MS-SP Rings and, 83–86, 88 ring APS, 76 sublayer tandem connection monitoring, 78–80 subnetwork connection protection, 78–80 trail protection, 74–78 unidirectional (single-ended) operation, 77 Automatic Switched Optical Network (ASON) control plane (CP), 425, 426–437 framework, 425–426 G-MPLS and, 3 optical connection controllers (OCCs), 425 standardization, 425 transport planes (TPs), 425
Automatic Switched Transport Networks (ASTNs), 425–426 availability. See also reliability calculations for optical networks, 185–192 comparison between 1þ1 protection in ring-based and mesh-based optical networks, 192–193 comparison between protection and restoration in mesh-based networks, 194–197 defined, 9 example for computing, 9 formula for, 11, 185 recovery schemes and, 21 topology versus, in meshbased optical networks, 195 traffic type versus, in meshbased optical networks, 195–197 availability calculations for optical networks availability of connections and load, 188–191 ELT and AELT, 191–192 line failures, 186–188 optical node failures, 185 protected connection, 189–190 restored connection, 190–191 unprotected connection, 188–189 Average Expected Loss of Traffic (AELT), 191–192, 196–197
B backup capacity as criterion for recovery mechanisms, 26 dedicated versus shared, 29–30 overlay backup capacity network discovery in MPLS TE facility backup, 404–405
required amount of backup capacity (multilayer recovery case study), 468–469 in single-layer recovery mechanisms, 29–30 backup path computation in MPLS TE bandwidth sharing between backup paths, 392–393 diverse path computation algorithm, 393–394 facility backup, 397–419 global path protection, 393–397 guaranteeing QoS during failure, 386–387, 388–392 introduction, 386 network design considerations, 387–392 one-to-one backup, 419–421 overview, 385–386, 421 QoS considerations in backbone network profiles, 387–388 backup path selection in MPLS TE, 349–350 backup tunnel path computation in MPLS TE facility backup amount of bandwidth to protect, 401–405 backup tunnels selection, 416–418 centralized computation, 409–411 distributed model, 411–416 facility-based model, 409 independent CSPF-based model, 405–409 manual configuration versus dynamic computation, 397–400 overlay backup capacity network discovery, 404–405 path computation client (PCE), 418–419
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 499
Index with strict QoS guarantees during failure, 400–419 triggers, 400 without QoS guarantee during failure, 397–400 Backward Defect Indication (BDI) defect in OTNs, 153, 156 Backward Defect Indication Overhead (BDI-O) defect in OTNs, 153 Backward Defect Indication Payload (BDI-P) defect in OTNs, 153 Backward Error Indication (BEI) defect in OTNs, 153, 156 bandwidth optimization using MPLS TE, 306 bandwidth protection violation in MPLS TE, 350–353 bandwidth sharing capability between backup paths, 392–393 case study, 368–380 local versus global protection in MPLS TE, 340–343 basic level user reliability requirements, 18 BDI (Backward Defect Indication) defect in OTNs, 153, 156 BDI-O (Backward Defect Indication Overhead) defect in OTNs, 153 BDI-P (Backward Defect Indication Payload) defect in OTNs, 153 BEI (Backward Error Indication) defect in OTNs, 153, 156 Bellman-Ford routing protocols. See distance vector routing protocols bidirectional (dual-ended) operation in APS, 76, 77 bidirectional connections in SDH/SONET, 3
bidirectional forwarding detection in hello-based mechanisms, 223 bidirectional line switched Rings (BLSR). See Multiplex Section-Shared Protection Rings (MS-SP Rings) in SDH bidirectional linear protection in MSP, 107–108 bidirectional path switched Rings (BPSR). See Subnetwork Connection Protection Rings (SNCP Rings) bidirectional traffic, 3 BML (business management layer), 43 bottom-up escalation, 446–447 business critical user reliability requirements, 18 business management layer (BML), 43 bypass. See facility backup or bypass in MPLS TE
C C (connection function), 43, 45 cable cuts link failure caused by, 220 overview, 12–13 preventing, 20 submarine cable break (APCN 2), 15 CAPital EXpenditure (CAPEX), IONs and reduction in, 424 case studies for MPLS TE recovery mechanisms (Case Study 1: UK Network) assumptions, 354–356 link protection, 356–357 objectives, 356 proposed design, 356–357 case studies for MPLS TE recovery mechanisms (Case Study 2: UK
499
Network with Shared SRLGs) additional assumptions, 359 additional objectives, 359 link protection, 360–361 node protection, 361 proposed design, 359–362 case studies for MPLS TE recovery mechanisms (Case Study 3: Complex US Network) abbreviations, 364 assumptions, 362–364 bandwidth sharing, 368–380 link protection, 365–368 node protection, 368 objectives, 364–365 proposed design, 365 case studies for multilayer recovery (Case Study 1: Optical Restoration and MPLS TE Fast Reroute) interlayer recovery mechanisms, 466 overview, 465, 469 required amount of backup capacity, 468–469 set of recovery actions, 466–468 single-layer recovery mechanisms, 465–466 case studies for multilayer recovery (Case Study 2: SONET/SDH Protection and IP Routing) interlayer recovery mechanisms, 470 overview, 470 single-layer recovery mechanisms, 469–470 case studies for multilayer recovery (Case Study 3: MPLS TE Fast Reroute and IP Rerouting Fast Convergence) overview, 476 set of recovery actions, 472–476
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 500
500
Index
case studies for multilayer recovery (continued ) single-layer recovery mechanisms, 471–472 case study for IP routing with IS-IS analysis, 275–277 assumptions, 270–272 dampening mechanisms configuration, 272–274 objectives, 272 proposed design, 272–274 SPF duration time, 277–278 case study for SDH protection strategies, 115–127 assumptions, 115–123 cost comparisons, 125–127 hybrid SNCP/MS-SP Ring protection, 122–123, 124–127 network design and evaluation process, 123–125 network scenario, 115–116 node configurations, 116–122 objective, 123 protection strategies, 122–123 pure end-to-end SNCP protection, 122, 123–127 pure MS-SP Ring protection, 122, 124–127 CCI (Connection Control Interface), 425, 431 centralized recovery mechanisms, 34–35 centralized routing architectures, RP failure and, 221, 225 characteristic information (CI), 44, 45 circuit switching, 4 classes of recovery as criterion for recovery mechanisms, 28 TE LSPs and, 326 Coarse Wavelength Division Multiplexing (CWDM), 133 common pool strategy, 451–454 configuration management, 42
Connection Control Interface (CCI), 425, 431 connection function (C), 43, 45 connection points (CPs), 43, 45 connectionless networks, 4–5, 36 connection-oriented networks, 4–5, 36 control plane (CP) in ASONs architectures, 432–437 augmented model, 435–437 Connection Control Interface (CCI), 425, 431 defined, 425 G-MPLS and, 426–429 main function, 426 overlay model, 433–434 peer model, 434–435 protocols for implementing, 426–432 control plane overview, 7–8 count-to-infinity problem with distance vector routing protocols, 206 CP. See control plane (CP) in ASONs CPs (connection points), 43, 45 CWDM (Coarse Wavelength Division Multiplexing), 133
D dAIS (All Ones defect) in SDH, 50, 51 dampening algorithms in IP routing exponential back-off timer algorithm, 228–229 exponential decay algorithm for interface dampening, 227–228 fast converge and, 226 flapping resources and, 226 stability preserved by, 226 up-state timer algorithm, 227 D&C interconnection. See drop and continue interconnection of rings in SDH data link layer (Layer 2), 5
data plane, 6 data-centric networks, evolution in, 2–3 dDEG (Degraded Signal defect) in SDH, 50, 51 decentralized recovery mechanisms, 34–35 dedicated backup capacity, 29, 30 dedicated protection paths with extra traffic, 32 overview, 31 dedicated protection rings. See Multiplex SectionDedicated Protection Rings (MS-DP Rings) in SDH; Optical Multiplex Section-Dedicated Protection Rings (OMS DPRings) dedicated recovery mechanisms in ring-based optical networks, 161, 171–173 defect detection times in SDH networks, 50–52, 56 defects. See also failures defined, 10 OTN maintenance signals and alarm suppression, 154–157 in OTNs, 152–153 in SDH networks, 50–51 Degraded Signal defect (dDEG) in SDH, 50, 51 degree of survivability, 10 Dense Wavelength Division Multiplexing (DWDM), 133, 215 Detour LSP in MPLS TE, 318–319, 343, 345 Detour LSP merging, 319, 384–385 DETOUR Object (RSVP), 375–376 dEXC (Excessive Error defect) in SDH, 50, 51 Diffserv code point (DSCP) packet marking and, 234–235 packet scheduling and, 235
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 501
Index queuing packets based on, 235–236 digital cross-connects (DXCs) in SDH ADMs versus, 82 fault propagation and notification on network level, 70, 72–74 overview, 53–54 Dijkstra algorithm complexity, 243 described, 242–243 Dijkstra quoted on, 241 incremental Dijkstra algorithm, 285–293 performance, 248–249 step by step example, 243–248 Dijkstra, Edger, 241 distance vector routing protocols count-to-infinity problem, 206 Enhanced Interior Gateway Routing Protocol (EIGRP), 207 example, 204–206 inefficiency during network element failure, 205–206 link state protocols versus, 212–213 objective, 204 overview, 204–207 Routing Information Protocol (RIP), 207 Routing Information Protocol with triggered update (RIP-TRIG), 207 split horizons techniques, 206–207 distributed (decentralized) recovery mechanisms, 34–35 distributed routing architectures RP failure and, 221, 225 temporary loops during link or node failure, 253–257 distributed routing tables, 207 diverse routing, 21 dLOF (Loss of Frame defect) in SDH, 50, 51
dLOM (Loss of Multiframe defect) in SDH, 50, 51 dLOP (Loss of Pointer defect) in SDH, 50, 51 dLOS (Loss of Signal defect) in SDH, 51–52 double protection, 449–454 dPLM (Payload Mismatch defect) in SDH, 50, 51 dRDI (Remote Defect Indication defect) in SDH, 50, 51 drop and continue interconnection of rings in SDH MS-SP and SNCP Rings, 101–102 MS-SP Rings, 97–101 overview, 95, 106 SNCP Rings, 96–97 DSCP. See Diffserv code point (DSCP) dTIM (Trace Id Mismatch defect) in SDH, 50, 51 dual homing principle, 20, 21 duct topology, SRG and fiber cable topology versus, 159–160 dUNEQ (Unequipped VC defect) in SDH, 50, 51 duplication of packets, as criterion for recovery mechanisms, 27 DWDM (Dense Wavelength Division Multiplexing), 133, 215 DXCs. See digital cross-connects (DXCs) in SDH dynamic multilayer recovery global reconfiguration option, 460 for IP-over-OTN network, 458–460 local reconfiguration option, 460, 462–463 logical IP topologies and, 458, 460, 461 static schemes versus, 457–458, 460, 462–463
501
dynamic recovery paths, 30, 31 dynamic routing tables, 207
E earthquake, Hanshin/Awaji, 14–15 EDFA (Erbium-doped fiber amplifiers), 133 EIGRP (Enhanced Interior Gateway Routing Protocol), 207 ELT (Expected Loss of Traffic), 191–192, 194–195 Enhanced Interior Gateway Routing Protocol (EIGRP), 207 equipment failures link failure caused by, 220–221 overview, 13 preventing, 20 Erbium-doped fiber amplifiers (EDFA), 133 escalation strategy bottom-up escalation, 446–447 defined, 444 hold-off timer implementation, 448 recovery token signal implementation, 448 top-down escalation, 447–448 evolution in data-centric networks, 2–3 evolution of SHD/SONET to OTNs, 39 evolution of the optical network layer adding flexibility, 139 mesh organization, 137–139 optical nodes, 135 ring organization, 135–137 WDM in point-to-point optical network layer, 132–134 Excessive Error defect (dEXC) in SDH, 50, 51 Expected Loss of Traffic (ELT), 191–192, 194–195
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 502
502
Index
exponential back-off timer dampening algorithm, 228–229 exponential decay algorithm for interface dampening, 227–228 external causes of failure, 12
F f1 filters (SDH), 58, 80 f2 filters (SDH) maintenance signals and, 58 overview, 81 in A sink function, 60 in TT sink function, 58, 60 FA (forwarding adjacency), 429 facility backup or bypass in MPLS TE backup tunnel path computation, 397–419 bandwidth sharing capability, 342–343 link failure and mode of operation, 322–324 link protection versus node protection, 320 node failure and mode of operation, 321–322 overview, 345 PLR behavior before failure, 379–381 PLR behavior during failure, 381–382 RSVP signaling, 379–382 single NHOP or NNHOP backup tunnel in, 319–321 failure coverage. See scope of failure coverage failure detection in IP routing hello-based mechanisms, 223–224 lower layers failure notification, 222–223 failure profiles link failures in IP routing, 220–221 link failures in MPLS TE, 353
node failures in IP routing, 221–222 node failures in MPLS TE, 353 failure scenarios deriving accounted failure scenarios, 16–18 protection versus restoration and, 31 scope of, 25 failures. See also defects; outages or faults; reliability; specific kinds accounted versus unaccounted, 16 availability and optical line failures, 186–188 availability and optical node failures, 185 commonly occurring, 12–13 defects defined, 10 drastic or severe, 13–15 failure-and-repair process, 10 FCC reporting requirements, 13 internal versus external causes, 12 link failures in IP routing, 220–221, 225, 253–257 MTBFs (mean time between failures), 11 MTTR (mean time to repair), 11 in multilayer networks, 438–439 multiple-link, 17–18 network element failure defined, 10 node failures in IP routing, 221–222, 225–226, 253–257 preventing, 20–21 QoS during, 262–266 root or primary, 10 secondary, or symptoms, 10 single-link, 16, 17 single-node, 16–17 terminology, 10–11 time of failure, 10
unintentional versus intentional, 8 fast converge case study, 471–476 dampening algorithms in IP routing and, 226 interaction between fast IGP convergence and NSF, 293–295 fast recovery, MPLS TE for, 306–307 Fast Reroute (FRR) in MPLS TE. See local protection in MPLS TE FAST-REROUTE Object (RSVP), 374–375 fault clearing time, 24 fault detection and characterization in IP routing, 214 fault detection and propagation in NEs (SDH) cable cut upstream of regenerator, 61–62 cable cut upstream of VC-4 cross-connection, 63–65 distorted noise/signal entering regenerator, 62–63 incoming AU_AIS signal, 65, 67, 68 incoming MS_AIS signal, 65, 66 late arrival of MS_AIS signal in VC-4 cross connection, 68, 69 summary, 68, 70 fault detection and propagation in optical networks associated overhead, 143, 145–150 defects, 152–153 maintenance signals and alarm suppression, 154–157 nonassociated overhead, 150–152 optical channel data unit overhead (ODUk OH), 145–148
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 503
Index optical channel overhead (OCh OH), 150–151 optical channel payload unit overhead (OPUk OH), 145 optical channel transport unit overhead (OTUk OH), 149–150 optical multiplex section overhead (OMS OH), 151 optical transmission section overhead (OTS OH), 151–152 overview, 144–145 fault detection in MPLS TE bandwidth protection violation and, 350–353 differentiating link failures from node failures, 349–353 optimal backup path selection and, 349–350 RSVP hello protocol extension, 348–349 fault detection time overview, 23 recovery cycle and, 307–308 fault indication signal (FIS) as IGP update message, 310–311 in MPLS TE, 310–311 propagation in IP routing, 229–237 as RSVP Path Error message, 311 fault management defined, 42 hierarchy in SDH, 58, 59 SDH processes, 58–60 fault notification time illustrated, 23 in IP routing, 215 overview, 23 in recovery cycle, 308–309 RSVP reliable messaging, 308–309 fault repaired notification time, 24, 25
faults. See failures; outages or faults FCC (Federal Communications Commission) reporting requirements, 13 FDI (Forward Defect Indication) in OTN, 155–156 FDI-O (Forward Defect Indication Overhead) defect in OTNs, 153 FDI-P (Forward Defect Indication Payload) defect in OTNs, 152–153 FDM (Frequency Division Multiplexing), 132 Federal Communications Commission (FCC) reporting requirements, 13 FIB (forwarding information base), 204, 251 fiber cable topology SRG and duct topology versus, 159–160 SRG and fiber topology versus, 160, 161 fiber topology, SRG and fiber cable topology versus, 160, 161 FIS. See fault indication signal (FIS) fish problem in traffic engineering, 298–301 flapping resources dampening algorithms in IP routing and, 226 LSA origination and, 232–233 flexible optical networks, 200 flow random early detection (FRED), 236 Forward Defect Indication (FDI) in OTN, 155–156 Forward Defect Indication Overhead (FDI-O) defect in OTNs, 153 Forward Defect Indication Payload (FDI-P) defect in OTNs, 152–153
503
forwarding adjacency (FA), 429 forwarding information base (FIB), 204, 251 FRED (flow random early detection), 236 Frequency Division Multiplexing (FDM), 132 FRR (Fast Reroute) in MPLS TE. See local protection in MPLS TE
G gateways dual-gateway ring interconnection schemes, 94–95, 106 between self-healing rings, node architectures for, 104–105 General Switch Management Protocol (GSMP), 425, 431 Generalized Multi-Protocol Label Switching (G-MPLS) for ASON CP implementation, 426–429 ASON standardization and, 425 for dynamic lightpath allocation, 3 label presentation options, 426–427 link state routing in G-MPLScapable networks, 428 LSP representation in, 428–429 restoration, 429–430 generic multilayer recovery approaches. See also specific approaches case studies, 464–476 deciding which layers get recovery schemes, 439 dynamic multilayer recovery, 457–463
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 504
504
Index
generic multilayer recovery approaches (continued ) generic framework for multilayer survivability, 464 need for multilayer recovery, 438–439 overview, 437–438, 464, 476–477 single-layer recovery schemes in multilayer networks, 439–444 static multilayer recovery schemes, 444–457 supporting spare resources for multilayer recovery, 449–454 global default restoration in MPLS TE advantages and drawbacks, 343–344 defined, 310 fault indication signal (FIS), 310–311 mode of operation, 311–313 overview, 343–344 recovery cycle with, 312–313 recovery time, 313–314 revertive versus nonrevertive modes, 346–347 standardization, 370 global path protection in MPLS TE advantages and drawbacks, 344–345 backup path computation, 393–397 bandwidth sharing capability, 341–342 defined, 310 local protection compared to, 336–346 mode of operation, 315–316 overview, 314, 344–345 recovery time, 316, 336 revertive versus nonrevertive modes, 347 standardization, 370–371
state overhead and scalability, 336–340 global recovery, defined, 33 G-MPLS. See Generalized Multi-Protocol Label Switching (G-MPLS) good news (link cost decrease), iSPF and, 288–291 GSMP (General Switch Management Protocol), 425, 431 guaranteed bandwidth, as criterion for recovery mechanisms, 27
H Hanshin/Awaji earthquake, 14–15 hello-based mechanisms bidirectional forwarding detection, 223 false-positive alarms, 349 IGP hellos, 223 in IP routing, 223–224 layer 2 link failure notification versus, 223–224 in MPLS TE, 348–349 RSVP hello protocol extension, 348–349 helper neighbors of restarting routers, 269 higher order path (HOP) layer in SDH. See also Virtual Containers-n (VC-n) in SDH DXC (digital cross-connect), 70–74, 81 LOPs carried by, 47–48 MS-SP Rings and, 83 overview, 46–47, 55 race conditions and AIS propagation, 72, 73, 74 hold-off time illustrated, 23, 24 in multilayer recovery, 38, 448 in recovery cycle, 23, 308 in reversion cycle, 24 hold-off timer
escalation strategy implementation, 448 in IP routing, 214–215 HOP layer. See higher order path (HOP) layer in SDH
I IETF (Internet Engineering Task Force), OTN work by, 139 IGP. See interior gateway protocols (IGP) incremental Dijkstra algorithm (iSPF) efficiency, 293 final algorithm, 291–293 history, 287 link cost decrease (good news) and, 288–291 link cost increase and, 287–288 motivation, 285–287 inherent supervision of subnetwork connections (SDH), 79 integrated approach to multilayer recovery, 37, 449 integrity, defined, 9–10 intelligent optical networks (IONs), 424, 476 interface dampening using exponential decay algorithm, 227–228 interior gateway protocols (IGP) FIS as IGP update message, 310–311 hello-based mechanisms in IP routing, 223 interaction between fast IGP convergence and NSF, 293–295 link metric manipulation, 265 metric optimization, 263–264 packet marking and, 234–235 planned node failure and, 226 RP failure and, 225 temporary loop duration and timers, 255–256
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 505
Index Intermediate System to Intermediate System (IS-IS) routing protocol case study, 270–278 multitopology routing, 238–241 overview, 212 packet marking and, 234–235 shortest path computation triggers, 249 TE LSPs and, 328 internal causes of failure, 12 International Telecommunication Union (ITU) OTN standards, 139, 144 SDH standardized by, 57 work on OTN recovery, 158–159 Internet Engineering Task Force (IETF), OTN work by, 139 Internet Protocol (IP) routing algorithm complexity, 279–284 analysis of recovery cycle, 214–220 case study with IS-IS configuration, 270–278 dampening algorithms, 226–229 distance vector routing protocols, 204–207 failure characterization, 224 failure detection, 222–224 failure profiles, 220–222 fault notification time, 215 FIS propagation, 229–237 forwarding information base (FIB), 204 global versus local recovery and, 213–214 hold-off timer, 214–215 impact of failure types on traffic forwarding, 225–226 incremental Dijkstra algorithm, 285–293
interaction between fast IGP convergence and NSF, 293–295 link failures, 220–221, 225 link state routing protocols, 207–213 load balancing, 259–262 lower layers failure notification, 222–223 LSA origination and flooding, 215, 229–237 multilayer recovery case studies, 469–476 node failures, 221–222, 225–226 nonstop forwarding (NSF) OSPF example, 266–270 overview, 278–279 protocols, 204–214 QoS during failure, 262–265 rerouting upon link failure (example), 217–220 research-related topics, 295 route computation, 237–252 routing table computation, 215–217 temporary loops during network state changes, 252–258 Internet Protocol/MultiProtocol Label Switching (IP/MPLS) IP/MPLS-over-OTN multilayer model, 2–3, 203 unidirectional connections in, 3 interworking. See escalation strategy intrusive supervision of subnetwork connections (SDH), 79–80 IONs (intelligent optical networks), 424, 476 IP layer failure scenarios, 17–18 in IP-over-OTN network, 6, 7 single-layer versus multilayer recovery and, 36–37
505
IP routing. See Internet Protocol (IP) routing IP/MPLS-over-OTN multilayer model, 2–3, 203 IP-over-OTN network IP layer in, 6, 7 multilayer recovery requirement in, 438–439 OTN layer in, 6, 7 IS-IS. See Intermediate System to Intermediate System (IS-IS) routing protocol iSPF (incremental SPF). See incremental Dijkstra algorithm (iSPF) ITU. See International Telecommunication Union (ITU)
J jitter, as criterion for recovery mechanisms, 27
L label switched router (LSR) configuration of TE LSP on head-end, 303, 311–312 head-end versus midpoint versus tail-end, 301 preemption, 305–306 latency, as criterion for recovery mechanisms, 27 Layer 1 (physical layer), 5, 16–17 Layer 2 (data link layer), 5 Layer 3. See network layer (Layer 3); optical network layer Layer 4 (transport layer), 5–6 Layer 5 (application layer), 6 layered network representation for IP-over-OTN network, 6, 7 for multitechnology networks, 6, 7 for OTNs, 139–142 reference models, 5 for SDH networks, 46–48, 55 for SONET networks, 57 TCP/IP protocol stack, 5–6
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 506
506
Index
LCK (Locked) defect in OTNs, 153 LCs (link connections), 43 line failures in optical networks, availability and, 186–188 linear protection in SDH, 107–113 multiplex section protection (MSP), 107–108, 109, 113 overview, 113 path protection, 108–112, 113 link connections (LCs), 43 link cost decrease (good news), iSPF and, 288–291 link cost increase, iSPF and, 287–288 link disjoint or link diverse TE LSPs, 301 link failures in IP routing causes of, 220–221 detection, 222–224 failure characterization, 224 impact on traffic forwarding, 225 LSA origination and, 231 temporary loops from, 253–257 temporary loops from restored links, 257–258 link failures in MPLS TE Case Study 1 (UK network), 356–357 Case Study 2 (UK network with shared SRLGs), 360–361 Case Study 3 (complex US network), 365–368 differentiating from node failures, 349–353 impact on traffic forwarding, 353 Link State Advertisement (LSA) aspects of LSA flooding, 230 example of rerouting upon link failure, 217 flooding procedure defined, 215, 229 flooding procedure overview, 233–237
impact of origination on network, 232–233 inefficiencies in flooding, 230–231 LSA refresh, 231–232 opaque LSA, 267 origination process, 231–233 OSPF versus IS-IS routing protocol and, 212 parameters tuning, 233, 250–251 propagation delay, 233, 237 queuing delays, 233, 234–237 temporary loop duration and flooding, 255 time estimate for origination and flooding process, 237 link state databases (LSDBs), 211–212 Link State Packet (LSP), 212 link state routing protocols distance vector protocols versus, 212–213 hierarchical routing, 211–212 history, 207–210 IS-IS, 212 link state databases (LSDBs), 211–212 objective, 204 OSPF, 212 overview, 210–213 protocol data units (PDUs), 210–211 load balancing defined, 259 MPLS TE and, 334–335 per-packet versus predestination, 259–260 recovery upon network failure and, 261–262 symmetrical versus asymmetrical, 260–262 local protection for IP, researches on, 295 local protection in MPLS TE advantages and drawbacks, 345–346 backup path computation, 397–421
bandwidth sharing capability, 342–343 comparison of approaches, 332–333 defined, 310 Detour LSP merging, 319, 384–385 facility backup or bypass, 320–324, 342–343, 345, 379–382, 397–419 global protection compared to, 336–346 local defined, 317 merge point (MP), 317, 382 motivations for deploying, 329 multilayer recovery case studies, 465–469, 471–476 network design with full mesh of unconstrained TE LSPs, 329–330 network design with unconstrained one-hop TE LSPs, 330–332 NHOP backup tunnel, 316 NNHOP backup tunnel, 316–317 notification of tunnel locally repaired, 327–328 one-to-one backup or Detour LSP, 318–319, 343, 345, 382–384, 419–421 overview, 345–346 point of local repair (PLR), 316, 379–382 principles of recovery techniques, 317–318 protection defined, 318 recovery time, 336 revertive versus nonrevertive modes, 347–348 RSVP signaling extensions, 372–385 signaling extensions, 329 standardization, 371 state overhead and scalability, 336–340 TE LSP properties, 325–326 terminology, 316–317 local recovery, defined, 32–33
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 507
Index Locked (LCK) defect in OTNs, 153 LOF (Loss of Frame) defect in OTNs, 152 logical spare unprotected strategy, 450–454 LOM (Loss of Multiframe) defect in OTNs, 152 LOP layer. See lower order path (LOP) layer in SDH LOS-O (Loss of Signal Overhead) defect in OTNs, 152 LOS-P (Loss of Signal Payload) defect in OTNs, 152 Loss of Frame defect (dLOF) in SDH, 50, 51 Loss of Frame (LOF) defect in OTNs, 152 Loss of Multiframe defect (dLOM) in SDH, 50, 51 Loss of Multiframe (LOM) defect in OTNs, 152 Loss of Pointer defect (dLOP) in SDH, 50, 51 Loss of Signal defect (dLOS) in SDH, 51–52 Loss of Signal Overhead (LOS-O) defect in OTNs, 152 Loss of Signal Payload (LOS-P) defect in OTNs, 152 Loss of Tandem Connection (LTC) defect in OTNs, 152 low cost user reliability requirements, 18 lower layers failure notification in IP routing layer 2 notification versus hello-based detection, 223–224 overview, 222–223 lower order path (LOP) layer in SDH. See also Virtual Containers-n (VC-n) in SDH DXC (digital cross-connect), 70–74, 81
multiple LOPs carried in HOP layer, 47–48 overview, 46–47, 55 race conditions and AIS propagation, 72, 73, 74 LSA. See Link State Advertisement (LSA) LSDBs (link state databases), 211–212 LSP (Link State Packet), 212 LSPs, traffic engineering. See TE Label Switch Paths (TE LSPs) LSR. See label switched router (LSR) LTC (Loss of Tandem Connection) defect in OTNs, 152
M M:N protection, 32 management plan, 8 matrix connection (MC), 43 mean time between failures (MTBFs) in availability formula, 11, 185 for cable cuts, 12 defined, 11 for equipment failures, 13 optical line failures and, 186, 187 optical node failures and, 185 unprotected connections and, 188 mean time to repair (MTTR) in availability formula, 11, 185 for cable cuts, 13 defined, 11 for equipment failures, 13 optical line failures and, 186 optical node failures and, 185 unprotected connections and, 188 merge point (MP), 317, 382 mesh networks. See also meshbased optical networks ring networks versus, 3–4, 35 single-layer recovery in, 35
507
mesh-based optical networks. See also mesh networks availability comparison between 1þ1 protection in ring-based optical networks and, 192–193 availability comparison between protection and restoration schemes, 194–197 availability versus topology, 195 availability versus traffic type, 195–197 optical cross-connects (OXCs) in, 137–138 overview, 137–139 recovery mechanisms, 173–182 ring-based versus mesh-based recovery schemes, 182–185 meta-mesh recovery technique in optical networks, 199–200 MP (merge point), 317, 382 MPLS (Multi-Protocol Label Switching) in IP layer, 2–3. See also Generalized Multi-Protocol Label Switching (G-MPLS); Internet Protocol/MultiProtocol Label Switching (IP/MPLS) MPLS TE. See Multi-Protocol Label Switching traffic engineering (MPLS TE); Multi-Protocol Label Switching traffic engineering (MPLS TE) recovery techniques MS layer. See multiplex section (MS) layer in SDH MS-DP Rings. See Multiplex Section-Dedicated Protection Rings (MS-DP Rings) in SDH MSP. See multiplex section protection (MSP)
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 508
508
Index
MS-SP Rings. See Multiplex Section-Shared Protection Rings (MS-SP Rings) in SDH MTBFs. See mean time between failures (MTBFs) MTTR. See mean time to repair (MTTR) multiautonomous systems networks, TE LSPs and, 328 multilayer networks. See Automatic Switched Optical Network (ASON); generic multilayer recovery approaches multilayer recovery case studies, 464–476 common pool strategy, 451–454 cost comparisons, 451–454 deciding which layers get recovery schemes, 439 double protection strategy, 449–450 dynamic, 457–463 generic framework for multilayer survivability, 464 integrated approach, 37, 449 logical spare unprotected strategy, 450–454 need for, 438–439 network operation complexity and, 455 overview, 36–37, 476–477 qualitative performance comparison, 456–457 revertive operation, 456 sequential approach, 37, 446–448 single-layer recovery schemes in multilayer networks, 440–444 static recovery schemes, 444–457 supporting spare resources for, 449–454
trade-off between rerouting time and network stability, 454–455 uncoordinated approach, 444–446 multiple rings. See ring interconnection in SDH multiple-link failures, 17–18 multiplex section (MS) layer in SDH defect detection times, 50–52 linear protection, 107–108, 113 overhead bytes, 56 overview, 47, 55 multiplex section protection (MSP) bidirectional linear protection, 107–108 MS-DP Rings, 82, 91–93 MS-SP Rings, 82, 83–91 OMS DPRings, 163–164 OMS SPRings, 164–166 OMS-versus OCh-based approach, 170–171 overview, 113 path protection versus, 110 shared versus dedicated approach in optical networks, 171–173 STM-1 linear protection, 108, 109 unidirectional linear protection, 108 Multiplex Section-Dedicated Protection Rings (MS-DP Rings) in SDH interconnection, 102–103 misconnections, 92–93 MS-SP Rings versus, 82 operation, 91–92 optical ring networks compared to, 163 overview, 105 spatial reuse prevented in, 92 as unidirectional line switched Rings (ULSR), 106
Multiplex Section-Shared Protection Rings (MS-SP Rings) in SDH. See also case study for SDH protection strategies APS protocol and, 83–86, 88 as bidirectional line switched Rings (BLSR), 106 drop and continue interconnection, 97–101 drop and continue interconnection with SNCP Rings, 101–102 in failure-free situation, 83 link failure and, 83–86 logical view, 88–89 misconnections, 89 as MS trail protection technique, 86 MS-DP Rings versus, 82 Non-preemptible Unprotected Traffic (NUT) support, 86 one-way delay on long path, 85–86 operation, 83–86 optical ring networks compared to, 163 overview, 105 span protection in four-fiber ring, 86–88 spare/protection capacity sharing between nonoverlapping connections, 89–91 spatial reuse feature, 82, 89–91 squelching mechanisms, 88–89 states of ring nodes, 83–86 two-fiber versus four-fiber configuration, 86–88 multiplexing. See also specific kinds of multiplexers byte-interleaved versus bitinterleaved, 45–46 STM-N ADM example, 52–54 Multi-Protocol Label Switching (MPLS) in IP layer, 2–3. See also Generalized Multi-Protocol Label
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 509
Index Switching (G-MPLS); Internet Protocol/ Multi-Protocol Label Switching (IP/MPLS) Multi-Protocol Label Switching traffic engineering (MPLS TE). See also Multi-Protocol Label Switching traffic engineering (MPLS TE) recovery techniques; TE Label Switch Paths (TE LSPs) bandwidth optimization using, 306 classical fish problem, 298–301 components, 303–305 fast recovery using, 306–307 motivations for deploying, 306–307, 329 preemption in, 305–306 QoS guarantees, 306, 386–387, 388–392, 400–419 shared risk link group (SRLG) and, 301–303 terminology, 301–303 traffic engineering in data networks, 298–301 tunneling using TE Label Switch Paths (TE LSPs), 300–301 Multi-Protocol Label Switching traffic engineering (MPLS TE) recovery techniques 1þ1 packet protection, 333–334 backup path computation, 385–421 case studies, 354–370 comparison of global and local protection, 336–346 extensions for point-tomultipoint LSPs, 422 failure profile and fault detection, 348–354 global default restoration, 310–314, 343–344, 346–347
global path protection, 310, 314–316, 344–345, 347 load balancing and, 334–335 local protection, 310, 316–333, 345–346, 347–348 MPLS TE refresher, 298–307 overview, 371–372 recovery cycle analysis, 307–310 research-related topics, 422 revertive versus nonrevertive modes, 346–348 RSVP signaling extensions for local protection, 372–385 standardization, 370–371 multitechnology networks, 6 multitopology routing, 238–241
N NCs (network connections), 44 neighbors of restarting routers, 269 network connections (NCs), 44 network element layer (NEL), 43 network elements (NEs) in SDH fault detection and propagation inside, 60–70 overview, 52–55, 56 network layer (Layer 3). See also optical network layer failure scenarios, 17–18 overview, 5 in SDH networks, 46–48, 55 in SONET networks, 57 network management interface for ASTN (NMI-A), 426 network management layer (NML), 43 Network Management System (NMS) abstraction levels or layers, 43 management aspects of, 42 restoration in SDH networks and, 113–115 in transmission networks, 42 network planes. See also control plane (CP) in ASONs control plane, 7–8
509
data or user plane, 6 illustrated, 7 management plan, 8 network reliability. See reliability NMI-A (network management interface for ASTN), 426 NML (network management layer), 43 NMS. See Network Management System (NMS) node disjoint or node diverse TE LSPs, 301 node failures in IP routing. See also route processor (RP) failure causes of, 221–222 impact on traffic forwarding, 225–226 LSA origination and, 231 planned, 221–222, 226 temporary loops from, 253–257 node failures in MPLS TE Case Study 2 (UK network with shared SRLGs), 361 Case Study 3 (complex US network), 368 differentiating from link failures, 349–353 impact on traffic forwarding, 353–354 planned, 354 nonintrusive supervision of subnetwork connections (SDH), 79 Non-preemptible Unprotected Traffic (NUT), 86 nonrevertive mode in MPLS TE recovery, 346–348 single-layer recovery mechanisms and, 36 nonstop forwarding (NSF) OSPF example backward compatibility, 269–270 entering graceful restart mode, 267–268
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 510
510
Index
nonstop forwarding (continued ) entering in helper mode, 269 exiting graceful restart mode, 268–269 grace period, 267 during graceful restart period, 268 interaction between fast IGP convergence and, 293–295 mode of operation of restarting router, 267–269 mode of operation of restarting router’s neighbors, 269 mode of operation overview, 267 overview, 266–267 restarting period defined, 266 NP complete problems, 284 NSF. See nonstop forwarding (NSF) OSPF example NTT (Japanese telephone company), 14–15 NUT (Non-preemptible Unprotected Traffic), 86
O OADMs. See Optical Add/Drop Multiplexers (OADMs) OCCs (optical connection controllers) in ASONs, 425–426 OCh layer. See optical channel (OCh) path layer of OTNs OCh OH (optical channel overhead), 150–151 OCI (Open Connection Indication) defect in OTNs, 153 ODU (optical channel data unit) layer of OTNs, 141 ODUk OH (optical channel data unit overhead), 145–148 OEO (optical-electrical-optical) OXC switches, 137 OIF (Optical Internetworking Forum), OTN work by, 139
OMS DPRings. See Optical Multiplex SectionDedicated Protection Rings (OMS DPRings) OMS layer. See optical multiplex section (OMS) layer of OTNs OMS OH (optical multiplex section overhead) in optical networks, 151 OMS SPRings. See Optical Multiplex Section-Shared Protection Rings (OMS SPRings) 1þ1 packet protection in MPLS TE, 333–334 1þ1 protection dedicated with extra traffic, 32 1:1 protection dedicated, 31 1:N linear APS, 76–78 1:N protection, 32 one-to-one backup in MPLS TE backup tunnel path computation, 419–421 bandwidth sharing capability, 343 overview, 318–319, 345 RSVP signaling, 382–384 Open Connection Indication (OCI) defect in OTNs, 153 Open Shortest Path First (OSPF) routing protocol nonstop forwarding (NSF) example, 266–270 overview, 212 packet marking and, 234 shortest path computation triggers, 249 TE LSPs and, 328 OPeration EXpenditure (OPEX), IONs and reduction in, 424 Optical Add/Drop Multiplexers (OADMs) fixed, 135, 136 flexible, 135–136 OMS DPRings and, 164
OMS SPRings and, 165–166 optical nodes and, 135 recovery in ring-based optical networks and, 161, 164, 165–166, 170, 171 ring organization and, 135–137 optical channel data unit (ODU) layer of OTNs, 141 optical channel data unit overhead (ODUk OH), 145–148 optical channel (OCh) path layer of OTNs defined, 140 recovery mechanisms in ringbased optical networks, 161 optical channel overhead (OCh OH), 150–151 optical channel payload unit (OPU) layer of OTNs, 141 optical channel payload unit overhead (OPUk OH), 145 optical channel protection in ring-based optical networks mixed OCh DPRings and OCh SPRings, 170 multiplex section protection versus, 170–171 OCh DPRings, 166–169 OCh SPRings, 169–170 optical channel transport unit overhead (OTUk OH), 149–150 optical connection controllers (OCCs) in ASONs, 425–426 optical cross-connects (OXCs) in mesh-based optical networks, 137–138, 176–177 OEOEO opaque switches, 137 opaque or OEO OXC switches, 137 restoration schemes in mesh-based OTNs and, 178–179
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 511
Index SRLGs and, 302 transparent or OOO OXC switches, 137 wavelength routing (WR-OXC), 138, 176–177 wavelength translating (WT-OXC), 138, 177 Optical Internetworking Forum (OIF), OTN work by, 139 optical multiplex section (OMS) layer of OTNs defined, 140 link-based restoration schemes, 178 recovery mechanisms in ringbased optical networks, 161 optical multiplex section overhead (OMS OH), 151 Optical Multiplex SectionDedicated Protection Rings (OMS DPRings) overview, 163–164 shared approach versus, 161, 171–173 Optical Multiplex SectionShared Protection Rings (OMS SPRings) dedicated approach versus, 161, 171–173 overview, 164–166 optical network layer adding flexibility, 139 evolution, 132–139 mesh organization, 137–139 with optical nodes, 135 recovery schemes, 157–158 ring organization, 135–137 WDM in point-to-point, 132–134 optical networks. See also meshbased optical networks; Optical Transport Networks (OTNs); ringbased optical networks adding flexibility, 139 availability and 1þ1 protection in ring-based
versus mesh-based optical networks, 192–193 availability and protection versus restoration in mesh-based networks, 194–197 availability calculations, 185–192 defects, 152–153 evolution of the optical network layer, 132–139 fault detection and propagation, 144–157 maintenance signals and alarm suppression, 154–157 mesh organization, 137–139 optical nodes, 135 overhead, 145–152 overview, 200–201 recovery mechanisms in meshbased networks, 173–182 recovery mechanisms in ringbased networks, 160–173 recovery schemes in the optical network layer, 157–160 research trends, 197–200 ring organization, 135–137 ring-based versus mesh-based recovery schemes, 182–185 WDM in point-to-point optical network layer, 132–134 optical nodes failures and availability, 185 overview, 135 optical physical section (OPS) layer of OTNs defined, 140 optical transport module and, 143–144 optical transmission section overhead (OTS OH), 151–152 optical transport module (OTM) frame structure of OTUk, 142 OPS layer and, 143–144
511
order of (maximum supported wavelength channels), 143 structure, 142–144 Optical Transport Networks (OTNs) architectural aspects and structure, 139–142 associated overhead, 143, 145–150 bottleneck at nodes overcome by, 2 extension toward G-MPLS, 3 maintenance signals and alarm suppression, 154–157 nonassociated overhead, 150–152 optical channel data unit (ODU) layer, 141 optical channel (OCh) path layer, 140 optical channel payload unit (OPU) layer, 141 optical multiplex section (OMS) layer, 140 optical physical section (OPS) layer, 140, 143–144 optical transmission section (OTS) layer, 139–140 optical transport module (OTM) structure, 142–144 overview, 139 SDH/SONET network evolution to, 39 standardization, 139, 144 standardization work on recovery, 158–159 traffic volumes for, 46 optical-electrical-optical (OEO) OXC switches, 137 OPU (optical channel payload unit) layer of OTNs, 141 OPUk OH (optical channel payload unit overhead), 145 OSPF. See Open Shortest Path First (OSPF) routing protocol
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 512
512
Index
OTM. See optical transport module (OTM) OTN layer in IP-over-OTN network, 6, 7 single-layer versus multilayer recovery and, 36–37 OTNs. See Optical Transport Networks (OTNs) OTS OH (optical transmission section overhead), 151–152 OTUk OH (optical channel transport unit overhead), 149–150 outages or faults. See also failures; reliability defined, 10 detection and propagation inside NEs (SDH), 60–70 drastic or severe, 13–15 FCC reporting requirements, 13 information propagation through SDH network, 70–74 planned versus unplanned, 12 router power supply outage, 221, 225 overlay model for control plane, 433–434 OXCs. See optical cross-connects (OXCs)
P packet switching, 4 path protection in MPLS TE. See global path protection in MPLS TE path protection in SDH drop and continue mechanism in, 110 dual-network representation for disjoint paths, 112 end-to-end SNCP, 108–110 linear 1:N, 110–112 multiplex section protection (MSP) versus, 110
Path-Specific (PS) method of identifying signaled TE LSP, 379 payload, 6 Payload Mismatch defect (dPLM) in SDH, 50, 51 Payload Mismatch (PLM) defect in OTNs, 152 Payload Missing Indication (PMI) defect in OTNs, 153, 155–156 p-cycles, 197–199 PDH (Plesiochronous Digital Hierarchy), 45 PDUs (protocol data units), 210–211 peer model for control plane, 434–435 performance criteria for recovery mechanisms, 25–28 of Dijkstra algorithm, 248–249 of multilayer recovery strategies, 456–457 performance management, 42 per-packet load balancing, 259–260 physical layer (Layer 1), 5, 16–17 planes. See network planes planned node failure hitless upgrades and, 354 in IP routing, 221–222, 226 in MPLS TE, 354 Plesiochronous Digital Hierarchy (PDH), 45 PLM (Payload Mismatch) defect in OTNs, 152 PLR. See point of local repair (PLR) PMD (polarization mode dispersion), 132 PMI (Payload Missing Indication) defect in OTNs, 153, 155–156 point of local repair (PLR) behavior before failure, 379–381
behavior during failure, 381–382 defined, 316 polarization mode dispersion (PMD), 132 power supply failure facility failure, 221 impact on traffic forwarding, 225, 353 in IP routing, 221, 225 in MPLS TE, 353 node failure caused by, 221 router power supply outage, 221, 225 preplanned recovery paths dynamic recovery paths versus, 31 overview, 30 protection versus restoration and, 31 pre-session load balancing, 260 preventing failures, 20–21 primary failure, 10 primary path, recovery schemes and, 21, 22 propagation delay in LSA flooding, 233, 237 protection. See also specific kinds 1þ1 (dedicated), 31 1:1 (dedicated with extra traffic), 32 1:N (shared recovery with extra traffic), 32 case study for SDH protection strategies, 115–127 M:N, 32 in mesh-based optical networks, 175–177, 180–182 in optical networks, 158 restoration compared to, 430 restoration versus, 31, 113–115, 158 protection rings. See ring protection in SDH protocol data units (PDUs), 210–211
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 513
Index PS (Path-Specific) method of identifying signaled TE LSP, 379
Q quality of service (QoS) algorithm complexity and, 265 backbone network profile considerations, 387–388 guarantee during failure in IP routing, 264–265 guarantee during failure in MPLS TE, 306, 386–387, 388–392, 400–419 link metric manipulation and, 265 in MPLS Diffserv-aware networks, 387–388 during non-steady state periods in MPLS TE, 388–392 overprovisioned networks and, 387, 397 queuing delays in LSA flooding and, 234–235 recovery schemes and, 22 in traffic engineered networks, 388 traffic engineering at steady state and, 262–264 queuing delays in LSA flooding congestion avoidance mechanisms, 236–237 packet marking, 234–235 packet scheduling, 235 QoS and, 234–235 queuing packets based on DSCP, 235–236 queuing process described, 233 random early detection (RED) and, 236
R random early detection (RED), 236 RDI. See remote defect indication (RDI) signal in SDH
recovery cycle criteria for performance, 25–28 fault detection time, 307–308 fault notification time, 308–309 with global default restoration in MPLS TE, 312–313 hold-off timer, 308 illustrated, 23 in IP routing, 214–220 overview, 23–24, 307–310 recovery operation time, 309 traffic recovery time, 309–310 recovery cycle in IP routing example of rerouting upon link failure, 217–220 fault detection and characterization, 214 fault notification time, 215 hold-off timer, 214–215 routing table computation, 215–217 recovery extent defined, 32 global versus local recovery, 32–34 recovery head-end (RHE) in APS subnetwork connection protection, 80 in APS trail protection, 77 in global recovery, 33 in local recovery, 32 local versus global recovery and, 33 recovery schemes and, 21 in SNCP Rings, 93 recovery mechanisms in meshbased optical networks availability comparison between 1þ1 protection in ring-based optical networks and, 192–193 availability comparison between protection and restoration schemes, 194–197 link-based recovery schemes, 174–175
513
link-based restoration schemes, 178 overview, 173–175 path-based recovery schemes, 174 p-cycles, 197–199 preplanned versus dynamic, 178–180 protection combined with restoration, 182 protection in WP versus VWP networks, 176–177 protection options, 175–176 protection versus restoration, 174, 180–181 restoration, 177–180 ring-based schemes versus, 182–185 shared restoration schemes, 178 recovery mechanisms in MPLS TE global default restoration, 310–314 global path protection, 310, 314–316 local protection, 310, 316–333 recovery mechanisms in ringbased optical networks availability comparison between 1þ1 protection in mesh-based optical networks and, 192–193 dedicated versus shared schemes, 161, 171–173 interconnection of rings, 173 layer of implementation and, 161 mesh-based schemes versus, 182–185 meta-mesh recovery technique, 199–200 mixed OCh DPRings and OCh SPRings, 170 multiplex section protection, 163–166 OCh DPRings, 166–169 OCh SPRings, 169–170 OMS DPRings, 163–164
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 514
514
Index
recovery mechanisms in ringbased optical networks (continued ) OMS SPRings, 164–166 OMS-versus OCh-based approach, 170–171 optical channel protection, 166–170 overview, 160–162 SONET/SDH networks compared to, 163 two-fiber versus four-fiber configuration, 166, 167 unidirectional versus bidirectional rings and, 161 recovery operation time, 23–24, 309 recovery paths local versus global recovery and, 33–34 preplanned versus dynamic, 30–31 recovery schemes and, 21–22 in single-layer recovery mechanisms, 30–31 recovery schemes (basic principle), 21–22 recovery tail-end (RTE) in APS subnetwork connection protection, 80 in APS trail protection, 77 in global recovery, 33 in local recovery, 32 local versus global recovery and, 33 recovery schemes and, 21 in SNCP Rings, 93 recovery time as criterion for recovery mechanisms, 26 defined, 26 with global default restoration in MPLS TE, 313–314 with global path protection in MPLS TE, 316 local versus global protection in MPLS TE, 336
protection versus restoration and, 31 recovery token signal escalation strategy implementation, 448 in multilayer recovery, 38 RED (random early detection), 236 regenerator (SDH) cable cut upstream of, 61–62, 70, 72 distorted noise/signal entering, 62–63 incoming AU_AIS signal, 65, 67, 68 incoming MS_AIS signal, 65, 66 regenerator section (RS) layer in SDH defect detection times, 50–52 overview, 47, 55 reliability. See also failures; outages or faults definitions, 9–11 importance of, 8, 20 measures to increase, 20–22 overview, 8–22 requirements for services, 18–19 requirements for users, 18 SLA examples, 19–20 trend of requirements, 20 Remote Defect Indication defect (dRDI) in SDH, 50, 51 Remote Defect Indication (RDI) signal in SDH fault management processes and, 60 fault propagation and notification on network level, 70, 72 HOP_RDI signal, 72 MS_RDI signal, 64–65, 70, 72 RI_RDI (remote information –remote defect indication), 60 SSF signal triggering, 64–65
reordering of packets, as criterion for recovery mechanisms, 27 research-related topics flexible optical networks, 200 on IP routing, 295 meta-mesh recovery technique, 199–200 MPLS TE, 422 p-cycles, 197–199 trends in optical networking, 197–200 Resource Reservation Protocol (RSVP) FIS as RSVP Path Error message, 311 hello protocol extension, 348–349 reliable messaging mode, 308–309 scalability issues, 304 signaling extensions for MPLS TE local protection, 372–385 TE LSP setup, 304 Traffic Engineering extensions (RSVP-TE), 427 restarting router defined, 267 entering graceful restart mode, 267–268 exiting graceful restart mode, 268–269 during graceful restart period, 268 restoration in G-MPLS networks, 429–430 in mesh-based optical networks, 177–182 optical, multilayer recovery case study, 465–469 in optical networks, 158 protection compared to, 430 protection versus, 31, 113–115, 158 in SDH networks, 113–115 reversion cycle criteria for performance, 25–28
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 515
Index illustrated, 24 overview, 24–25 single-layer recovery mechanisms and, 36 reversion operation time, 24, 25 RHE. See recovery head-end (RHE) RIB. See routing information base (RIB) or routing table ring interconnection in meshbased optical networks, 173 ring interconnection in SDH drop and continue interconnection of MS-SP and SNCP Rings, 101–102 drop and continue interconnection of MS-SP Rings, 97–101 drop and continue interconnection of SNCP Rings, 96–97 dual-gateway schemes, 94–95, 106 global versus local protection techniques, 95 MS-DP Rings, 102–103 node architectures for gateways, 104–105 overview, 93–95, 105–106 of stacked STM-N Rings, 103–104 virtual ring (VR) interconnection, 95–96 vulnerability of single-node interconnections, 94 ring networks. See also ringbased optical networks defined, 3 mesh networks versus, 3–4, 35 popularity of, 82 single-layer recovery in, 35, 36 SONET/SDH compared to optical, 136–137 as transmission networks, 41 ring protection in SDH. See also specific kinds
MS-DP Rings (Multiplex Section-Dedicated Protection Rings), 82, 91–93 MS-SP Rings (Multiplex Section-Shared Protection Rings), 82, 83–91 overview, 81–82, 105–106 ring interconnection, 93–105 SNCP Rings (Subnetwork Connection Protection Rings), 82, 93 in SONET versus SDH, 106–107 ring-based optical networks. See also ring networks availability comparison between 1þ1 protection in mesh-based optical networks and, 192–193 interconnection of rings, 173 mesh-based versus ring-based recovery schemes, 182–185 mixed OCh DPRings and OCh SPRings, 170 multiplex section protection, 163–166 OADMs and, 135–137 OCh DPRings, 166–169 OCh SPRings, 169–170 OMS DPRings, 163–164 OMS SPRings, 164–166 OMS-versus OCh-based approach, 170–171 optical channel protection, 166–170 overview, 135–137 recovery mechanisms, 160–173 shared versus dedicated approach, 161, 171–173 SONET/SDH networks compared to, 136–137 RIP (Routing Information Protocol), 207 RIP-TRIG (Routing Information Protocol
515
with triggered update), 207 root failure, 10 Rosen, Eric, 287 route computation Dijkstra algorithm, 241–249 routing information base (RIB) update, 251–252 shortest path computation, 238–241 shortest path computation triggers, 249–251 route processor (RP) failure centralized versus distributed architectures and, 221, 225 impact on traffic forwarding, 225, 353–354 in IP routing, 221, 225 in MPLS TE, 353–354 Route Record Object (RRO) of RSVP, 376–377 route recursion, 252 router interface failure, 221 router power supply outage impact on traffic forwarding, 225 node failure caused by, 221 routing algorithms. See also Internet Protocol (IP) routing adaptive dynamic distributed algorithm in ARPANET, 207–210 complexity, 243, 265, 279–284 congestion avoidance mechanisms, 236–237 dampening algorithms, 226–229 Dijkstra algorithm for shortest path, 241–249 for IGP metric optimization, 264 incremental Dijkstra algorithm, 285–293 QoS during failure and algorithm complexity, 265 routing information base (RIB) or routing table
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 516
516
Index
routing information base (RIB) or routing table (continued ) computation, 215–217 example of rerouting upon link failure, 217–220 populating, 217 route recursion, 252 shortest path tree (SPT) computation, 216–217 updating, 251–252 Routing Information Protocol (RIP), 207 Routing Information Protocol with triggered update (RIP-TRIG), 207 routing table. See routing information base (RIB) or routing table RP failure. See route processor (RP) failure RRO (Route Record Object) of RSVP, 376–377 RS layer. See regenerator section (RS) layer in SDH RSVP. See Resource Reservation Protocol (RSVP) RSVP signaling extensions for MPLS TE local protection detour merging, 384–385 DETOUR Object, 375–376 FAST-REROUTE Object, 374–375 identification of a signaled TE LSP, 378–379 Route Record Object (RRO), 376–377 SESSION-ATTRIBUTE Object, 372–374 signaling a protected TE LSP with a set of constraints, 378 signaling with facility backup, 379–382 signaling with one-to-one backup, 382–384 RTE. See recovery tail-end (RTE)
S safety critical user reliability requirements, 18 scalability as criterion for recovery mechanisms, 27–28 hello-based mechanisms in IP routing and, 223 local versus global protection in MPLS TE, 336–340 of RSVP, 304 scope of failure coverage as criterion for recovery mechanisms, 25–26 failure scenarios, 25 local versus global recovery and, 34 percentage of coverage, 25–26 SDEG (Signal Degrade) defect in OTNs, 152 SDH. See Synchronous Digital Hierarchy (SDH) secondary failures or symptoms, 10 security management, 42 self-healing ring mechanisms. See ring protection in SDH Sender-Template-Specific (STS) method of identifying signaled TE LSP, 379 sequential approach to multilayer recovery bottom-up escalation, 446–447 escalation strategy implementation, 448 overview, 37, 446 top-down escalation, 447–448 server signal fail (SSF) signal in SDH MS_TT_Sk function and, 64–65 A sink function and, 60 A source function and, 60 service management layer (SML), 43 service-level agreements (SLAs)
overview, 19–20 SML layer and, 43 services reliability requirements and types of, 18–19 SLA examples, 19–20 SESSION-ATTRIBUTE object (RSVP), 372–374 shared backup capacity, 29–30 shared recovery in one-to-N protection, 32 in ring-based optical networks, 161, 171–173 shared risk group (SRG) defined, 18 optical network recovery and, 159–160 shared risk link group (SRLG) defined, 18 LSA origination parameter tuning and, 250–251 MPLS TE and, 301–303 researches, 295 SRLG disjoint TE LSPs, 303 shortest path computation Dijkstra algorithm, 241–249 incremental Dijkstra algorithm, 285–293 multitopology routing and, 238–241 shortest path defined, 238 triggers, 249–251 shortest path tree (SPT) computation for routing tables, 216–217 Signal Degrade (SDEG) defect in OTNs, 152 signaling. See also specific signals extensions for local protection in MPLS TE, 329 fault detection and propagation inside NEs and (SDH), 60–70 fault management processes and (SDH), 58–60 OTN maintenance signals and alarm suppression, 154–157
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 517
Index protection versus restoration and, 31 recovery token signal, 38 requirements as criterion for recovery mechanisms, 28 SDH versus SONET, 57 STM-N signal (SDH), 47–48 STS-3N signal (SONET), 47 single point of failure, recovery schemes and, 22 single-layer recovery mechanisms backup capacity, dedicated versus shared, 29–30 centralized versus decentralized, 34–35 characteristics, 28–36 connection-oriented versus connectionless networks, 36 control of recovery process, 34–35 global versus local recovery, 32–34 in multilayer networks, 439–444 protection versus restoration in, 31–32 recovery paths, preplanned versus dynamic, 30–31 revertive versus nonrevertive mode, 36 ring versus mesh networks, 35–36 single-layer recovery schemes in multilayer networks overview, 439–440 survivability at the bottom layer, 440, 441 survivability at the highest possible layer, 443–444 survivability at the lowest detecting layer, 442–443 survivability at the top layer, 440, 442 single-link failures defined, 16 focus on, 17 global recovery, 33
local recovery, 32 single-node failures defined, 16–17 focus on, 17 global recovery, 33 local recovery, 32–33 sink functions, f2 filters in, 58, 60 SLAs. See service-level agreements (SLAs) SML (service management layer), 43 SNC (subnetwork connection), 44 SNCP. See subnetwork connection protection (SNCP) SNCP Rings. See Subnetwork Connection Protection Rings (SNCP Rings) software failures AT&T erroneous software update, 15 hitless upgrades and, 354 impact on traffic forwarding, 225–226, 354 in IP routing, 221, 225–226 in MPLS TE, 354 node failure caused by, 221 overview, 12 SONET. See Synchronous Optical NETwork (SONET) spatial reuse MS-SP Ring feature, 82, 89–91 prevented in MS-DP Rings, 92 prevented in SNCP Rings, 92 SPT (shortest path tree) computation for routing tables, 216–217 SRG. See shared risk group (SRG) SRLG. See shared risk link group (SRLG) SSF signal. See server signal fail (SSF) signal in SDH stability as criterion for recovery mechanisms, 28
517
dampening algorithms in IP routing and, 226 trade-off between rerouting time and network stability in multilayer recovery, 454–455 standardization ASON, 425 ASTNs, 425 MPLS TE recovery, 370–371 OTNs, 139, 144 SDH, 57 SONET, 57 work on OTN recovery, 158–159 state overhead, as criterion for recovery mechanisms, 27 static multilayer recovery schemes common pool strategy, 451–454 cost comparisons, 451–454 double protection strategy, 449–454 dynamic multilayer recovery versus, 457–458, 460, 462–463 escalation strategy defined, 444 integrated approach, 449 logical spare unprotected strategy, 450–454 network operation complexity and, 455 overview, 444 qualitative performance comparison, 456–457 revertive operation, 456 sequential approach, 446–448 supporting spare resources for multilayer recovery, 449–454 trade-off between rerouting time and network stability, 454–455 uncoordinated approach, 444–446 STM-N Rings, interconnection of stacked rings, 103–104
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 518
518
Index
STM-N signal. See Synchronous Transport Module of order N (STM-N) signal in SDH STS (Sender-Template-Specific) method of identifying signaled TE LSP, 379 STS-3N (Synchronous Transport Signal of level 3N) signal, 47 sublayer supervision of subnetwork connections (SDH), 80 sublayer tandem connection monitoring with APS, 78–80 submarine cable break (APCN 2), 15 Subnetwork Connection Protection Rings (SNCP Rings). See also case study for SDH protection strategies drop and continue interconnection, 96–97 drop and continue interconnection with MS-SP Rings, 101–102 optical ring networks compared to, 163 overview, 82, 93, 105–106 as unidirectional or bidirectional path switched Rings (UPSR or BPSR), 106–107 subnetwork connection protection (SNCP) APS protocol and, 78–80 multiplex section protection, 108 path protection, 108–110 subnetwork connection (SNC), 44 survivability defined, 10 degree of, 10 single-layer recovery schemes in multilayer networks and, 440–444
switch-back operation defined, 24 reordering of packets from, 27 switch-over operation, 21 symmetrical load balancing, 260–262 symmetrical services, 3 symptoms or secondary failures, 10 Synchronous Digital Hierarchy (SDH). See also ring protection in SDH ADM (add/drop multiplexer), 52–54 APS protocol, 49, 74–80 base signal, 57 bidirectional connections in, 3 case study, 115–127 defect detection times, 50–52, 56 DXCs, 53–54, 70, 72–74, 82 evolution to OTNs, 39 fault detection and propagation inside NEs, 60–70 fault management hierarchy, 58, 59 fault management processes, 58–60 fault propagation and notification on network level, 70–74 frame structure, 48–52 interfaces, 56 linear protection, 107–113 MS-DP Rings, 82, 91–93 MS-SP Rings, 82, 83–91 multilayer recovery case study, 469–470 multiplex section protection (MSP), 107–108, 109, 113 network elements (NEs), 52–55, 56 network layers, 46–48, 55 operational aspects, 57–81 optical rings compared to SDH rings, 136–137 overhead bytes relevant for recovery, 48–52, 56
overview, 127–129 path protection, 108–112, 113 references and researchrelated topics, 129–130 restoration, 113–115 ring interconnection, 93–105 SNCP Rings, 82, 93 SONET compared to, 56–57, 106–107 standardization, 57 TM (terminal multiplexer), 53 VC-n, 47–48, 55 Synchronous Optical NETwork (SONET). See also Synchronous Digital Hierarchy (SDH) base signal, 57 bidirectional connections in, 3 evolution to OTNs, 39 multilayer recovery case study, 469–470 network layers, 57 optical rings compared to SONET rings, 136–137 references and researchrelated topics, 129–130 SDH compared to, 56–57, 106–107 standardization, 57 STS-3N signal, 47 Synchronous Transport Module of order N (STM-N) signal in SDH ADM example, 52–54 in multiplex section protection, 107–108 overview, 47 STM-1 frame format, 48–49 STM-1 tributary interface to client, 55 Synchronous Transport Signal of level 3N (STS-3N) signal (SONET), 47
T TCP/IP protocol stack, 5–6 TCPs (termination connection points), 44
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 519
Index TE Label Switch Paths (TE LSPs) affected, defined, 316 bandwidth protection desired, 325–326, 373, 377, 378, 380, 383 classes of recovery, 326 classical fish problem and, 300–301 configuration on head-end LSR, 303, 311–312 Detour LSP in MPLS TE, 318–319 extensions for point-tomultipoint LSPs, 422 in facility backup, 320–324, 345 fast-reroutable, 325 global default restoration and, 311, 312–313, 314, 343–344 global path protection and, 315–316, 344–345 identification of a signaled TE LSP, 378–379 label recording desired, 373, 379 link disjoint or link diverse, 301 local protection desired, 373, 378, 379–380, 383 multiarea (OSPF), or multilevel (IS-IS), or multiautonomous systems networks and, 328 network design with full mesh of unconstrained TE LSPs, 329–330, 332–333 network design with unconstrained one-hop TE LSPs, 330–333 node disjoint or node diverse, 301 node protection desired, 326, 373, 377, 380, 383 notification of tunnel locally repaired, 327–328 in one-to-one backup, 318–319, 345
packet forwarding, 305 path computation, 304 preemption, 305–306 properties in MPLS TE, 325–326 revertive versus nonrevertive modes and, 346–348 RSVP hello protocol extension and, 348–349 secondary, in global path protection, 315 setup, 304 signaling a protected TE LSP with a set of constraints, 378 soft preemption desired, 373 SRLG disjoint, 303 TE LSPs. See TE Label Switch Paths (TE LSPs) Telecommunications Management Network (TMN), 42 temporary loops administrative link cost increase and, 256–257 causes of, 252–253 duration and number of routers involved, 255–256 illustrated, 253, 254, 255, 256, 258 link or node failures and, 253–257 link-load increase from, 257 researches, 295 restored network elements and, 257–258 terminal multiplexer (TM) in SDH networks, 53 termination connection points (TCPs), 44 TID (Trace Identifier Mismatch) defect in OTNs, 152 time of failure, 10 TMN (Telecommunications Management Network), 42 top-down escalation, 447–448 topology of optical networks
519
availability versus, in meshbased networks, 195 SRG and, 159–161 Trace Id Mismatch defect (dTIM) in SDH, 50, 51 Trace Identifier Mismatch (TID) defect in OTNs, 152 traffic availability versus traffic type in mesh-based optical networks, 195–197 data versus voice, 1–2 importance of reliability and, 20 increase in, 1 IP/MPLS-over-OTN multilayer model for large volumes, 2–3 optical technology and concentration of, 20 symmetrical versus asymmetrical, 3 unidirectional versus bidirectional, 3 WDM as solution for, 132–134 traffic engineering. See also Multi-Protocol Label Switching traffic engineering (MPLS TE) applicability of, 298 classical fish problem, 298–301 in data networks, 298–301 in non-MPLS networks, 298 at steady state, QoS and, 262–264 traffic forwarding in IP routing failure types and, 225–226 forwarding information base (FIB), 204 multitopology routing and, 240–241 nonstop forwarding (NSF) OSPF example, 266–270 traffic recovery time illustrated, 23 MPLS TE recovery mechanisms, 310 overview, 24, 309 in recovery cycle, 309–310
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 520
520
Index
traffic reversions time, 24, 25 trail protection in SDH 1:N linear APS, 76–78 architecture for, 74–76 linear APS, 74–76 MS-SP Rings and, 86 overhead bytes and bits, 74–75 sink direction, 74–76 source direction, 76 trail signal fail (TSF) signal in SDH cable cut upstream of regenerator and, 61–62 TT sink function and, 60 trail termination (TT) functions defined, 44, 45 f2 filters in sink function (SDH), 58, 60 trail protection and (SDH), 73–75 transmission networks. See also Synchronous Digital Hierarchy (SDH); Synchronous Optical NETwork (SONET) atomic functions, 43, 45 illustrated, 41 management of, 42–43 overview, 40–42, 127 reference points, 45 structuring/modeling, 43–45 transport layer (Layer 4), 5–6 TSF signal. See trail signal fail (TSF) signal in SDH TT functions. See trail termination (TT) functions
uncoordinated approach to multilayer recovery, 444–446 Unequipped VC defect (dUNEQ) in SDH, 50, 51 unidirectional (single-ended) operation in APS, 77 unidirectional connections in IP/MPLS, 3 unidirectional line switched Rings (ULSR). See Multiplex SectionDedicated Protection Rings (MS-DP Rings) in SDH unidirectional linear protection in MSP, 108 unidirectional path switched Rings (UPSR). See Subnetwork Connection Protection Rings (SNCP Rings) unidirectional traffic, 3 UPSR (unidirectional path switched Rings). See Subnetwork Connection Protection Rings (SNCP Rings) up-state timer dampening algorithm, 227 user plane, 6 users reliability requirements and types of, 18 trend of reliability requirements, 20
U
V
ULSR (unidirectional line switched Rings). See Multiplex SectionDedicated Protection Rings (MS-DP Rings) in SDH unaccounted failures, 16
Virtual Containers-n (VC-n) in SDH cable cut upstream of VC-4 cross-connection, 63–65 connection functions, 54–55 defect detection times, 50–52
late arrival of MS_AIS signal in VC-4 cross connection, 68, 69 overhead bytes, 56 overview, 47–48, 55 time diagram for VC-12 crossconnected by DXC-4/1s, 72 time diagram for VC-3 crossconnected by DXC-4/3s, 73 virtual ring (VR) interconnection in SDH, 95–96 virtual wavelength path (VWP) optical networks defined, 137 protection in WP networks versus, 176–177 WT-OXCs and, 137, 177 voice traffic, data traffic volume versus, 1–2 VR (virtual ring) interconnection in SDH, 95–96 VWP optical network. See virtual wavelength path (VWP) optical networks
W Wavelength Division Multiplexing (WDM) bandwidth capacity increased by, 2 Coarse (CWDM), 133 Dense (DWDM), 133 Erbium-doped fiber amplifiers and, 133 interconnection of stacked STM-N Rings and, 103 overview, 132–134 in point-to-point optical network, 132–134 wavelength path (WP) optical networks defined, 137 protection in VWP networks versus, 176–177 WR-OXCs and, 137, 176–177
Vasseur / Network Recovery Final Proof 8.6.2004 3:15pm page 521
Index wavelength routing optical cross-connects (WROXCs), 138, 176–177 wavelength translating optical cross-connects (WT-OXCs), 138, 177 WDM. See Wavelength Division Multiplexing (WDM)
weighted random early detection (WRED), 236 working path, recovery schemes and, 21, 22 WP optical networks. See wavelength path (WP) optical networks
521
WRED (weighted random early detection), 236 WR-OXCs (wavelength routing optical cross-connects), 138, 176–177 WT-OXCs (wavelength translating optical crossconnects), 138, 177