Table of Contents Clustering Windows ServersA Road Map for Enterprise Solutions...
49 downloads
1094 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Table of Contents Clustering Windows ServersA Road Map for Enterprise Solutions.............................................................1 Preface..................................................................................................................................................................3 About this book.......................................................................................................................................3 Why a cluster?.........................................................................................................................................4 Why this book?.......................................................................................................................................6 What's in this book?.........................................................................................................................6 Book organization............................................................................................................................7 Who's this book for?.........................................................................................................................7 Research methodology.....................................................................................................................7 Copyrights, trademarks, and service marks............................................................................................7 Chapter 1: Understanding Clusters and Your Needs.....................................................................................8 1.1 Writing a Request for Proposal (RFP) for a cluster that will succeed..............................................8 1.2 When is a cluster not a cluster?.........................................................................................................8 1.2.1 Availability..............................................................................................................................9 1.2.2 Scalability..............................................................................................................................10 1.2.3 Reliability..............................................................................................................................10 1.2.4 Manageability........................................................................................................................10 1.2.5 Single−system image.............................................................................................................10 1.3 Subsystems......................................................................................................................................11 1.4 Cluster attributes.............................................................................................................................13 1.4.1 User recovery........................................................................................................................13 1.4.2 Administrative recovery........................................................................................................14 1.5 Design goals....................................................................................................................................15 Chapter 2: Crystallizing Your Needs for a Cluster......................................................................................20 2.1 Introduction.....................................................................................................................................20 2.2 Acceptable availability....................................................................................................................21 2.3 Acceptable scalability.....................................................................................................................22 2.3.1 Scalable.................................................................................................................................22 2.3.2 Downtime..............................................................................................................................22 2.4 Acceptable reliability......................................................................................................................23 2.4.1 Server failover shared disk....................................................................................................27 2.4.2 Server failover non−shared disk............................................................................................28 2.4.3 Storage failover.....................................................................................................................28 2.4.4 Interconnect failover..............................................................................................................29 2.5 Cluster attributes.............................................................................................................................30 2.6 Summary.........................................................................................................................................30 Chapter 3: Mechanisms of Clustering...........................................................................................................31 3.1 Introduction.....................................................................................................................................31 3.2 Cluster membership........................................................................................................................32 3.3 States and transition........................................................................................................................34 3.4 Cluster tasks or resources................................................................................................................35 3.4.1 Cluster alias...........................................................................................................................35 3.4.2 Cluster address......................................................................................................................35 3.4.3 Disk resource.........................................................................................................................35 3.4.4 Cluster service or application................................................................................................36 3.4.5 Other resources......................................................................................................................36 i
Table of Contents Chapter 3: Mechanisms of Clustering 3.5 Lockstep mirroring..........................................................................................................................36 3.6 Replication......................................................................................................................................37 3.7 Shared disk and shared nothing disk...............................................................................................39 3.8 SAN versus NAS............................................................................................................................40 3.9 Summary.........................................................................................................................................42 Chapter 4: Cluster System Classification Matrix.........................................................................................43 4.1 Introduction.....................................................................................................................................43 4.2 Cluster review.................................................................................................................................44 4.3 Classes............................................................................................................................................45 4.3.1 Cluster plus............................................................................................................................45 4.3.2 Cluster...................................................................................................................................45 4.3.3 Cluster lite.............................................................................................................................46 4.3.4 Attributes...............................................................................................................................48 4.4 Cluster or component or attribute?.................................................................................................48 4.5 Cluster products..............................................................................................................................49 4.5.1 Marathon Technologies.........................................................................................................49 4.5.2 Microsoft Cluster Service (MSCS).......................................................................................49 4.5.3 Compaq cluster software.......................................................................................................50 4.5.4 Veritas software.....................................................................................................................50 4.5.5 Legato software.....................................................................................................................50 4.5.6 Other considerations..............................................................................................................50 4.6 Summary.........................................................................................................................................53 Chapter 5: Cluster Systems Architecture......................................................................................................54 5.1 Introduction.....................................................................................................................................54 5.2 Cluster terminology........................................................................................................................54 5.2.1 Cluster nodes or cluster members.........................................................................................54 5.2.2 Active cluster member...........................................................................................................55 5.2.3 Cluster resources...................................................................................................................55 5.2.4 Resource groups....................................................................................................................55 5.2.5 Dependency tree....................................................................................................................56 5.2.6 Cluster interconnect...............................................................................................................56 5.3 Cluster models................................................................................................................................57 5.3.1 Active/standby cluster with mirrored data............................................................................58 5.3.2 Active/passive cluster with mirrored data.............................................................................59 5.3.3 Active/active cluster with shared disk...................................................................................60 5.3.4 Active/active cluster with shared files...................................................................................61 5.4 Microsoft's Cluster Server architecture...........................................................................................62 5.4.1 Cluster Service......................................................................................................................62 5.4.2 Resource Monitor..................................................................................................................63 5.4.3 Resource DLL.......................................................................................................................65 5.4.4 Failover Manager..................................................................................................................66 5.4.5 Resource Groups...................................................................................................................66 5.4.6 Node Manager.......................................................................................................................68 5.4.7 Configuration Database Manager..........................................................................................68 5.4.8 Global Update Manager........................................................................................................69 5.4.9 Event Processor.....................................................................................................................69 5.4.10 Communications Manager..................................................................................................69 ii
Table of Contents Chapter 5: Cluster Systems Architecture 5.4.11 Log Manager.......................................................................................................................70 5.4.12 Cluster time service.............................................................................................................70 5.5 Quorum Resource...........................................................................................................................71 5.6 Cluster failover architecture............................................................................................................72 5.6.1 Administrative failover..........................................................................................................73 5.6.2 Recovery failover..................................................................................................................73 5.6.3 Cluster failback......................................................................................................................76 5.6.4 Planning for a cluster failover...............................................................................................76 5.6.5 Failover policies....................................................................................................................77 Chapter 6: I/O Subsystem Design..................................................................................................................79 6.1 I/O subsystems and capacity planning for clusters.........................................................................79 6.2 I/O load model................................................................................................................................82 6.3 Data processing capacity model for a cluster.................................................................................84 6.3.1 Processor...............................................................................................................................85 6.3.2 Memory bandwidth...............................................................................................................87 6.3.3 Memory operation rate..........................................................................................................87 6.3.4 I/O bandwidth........................................................................................................................88 6.3.5 Main I/O bus..........................................................................................................................88 6.3.6 AGP video bus.......................................................................................................................90 6.3.7 I/O operation per second rate (IOPS)....................................................................................91 6.4 Well−engineered storage systems...................................................................................................92 6.5 The future of system bus technology..............................................................................................93 6.6 Rules of thumb for cluster capacity................................................................................................95 Chapter 7: Cluster Interconnect Technologies.............................................................................................97 Overview...............................................................................................................................................97 7.1 What is a cluster communication interconnect?.............................................................................97 7.2 Comparison of the technologies used to interconnect systems.....................................................101 7.2.1 Bus functionality.................................................................................................................101 7.2.2 LAN functionality...............................................................................................................102 7.3 VIA cluster interconnect software standard..................................................................................104 7.3.1 Why VIA?...........................................................................................................................105 7.4 Winsock Direct technology...........................................................................................................106 7.5 SCSI technology for NT clusters..................................................................................................107 7.5.1 SCSI standards....................................................................................................................107 7.5.2 SCSI device ID numbers.....................................................................................................110 7.5.3 Single−ended vs. differential SCSI bus...............................................................................111 7.5.4 SCSI differential bus...........................................................................................................112 7.5.5 LVD vs. HVD SCSI technology.........................................................................................112 7.5.6 The SCSI "T" connector......................................................................................................114 7.5.7 SCSI component quality......................................................................................................115 7.5.8 Supporting larger SCSI disk farms......................................................................................116 Chapter 8: Cluster Networking....................................................................................................................117 8.1 LAN technology in a clusterthe critical link.................................................................................117 8.2 The enterprise connection.............................................................................................................118 8.3 Connection and cost......................................................................................................................120 8.4 Cluster intercommunications........................................................................................................121 iii
Table of Contents Chapter 8: Cluster Networking 8.5 LAN vs. SAN................................................................................................................................121 8.6 Network transports........................................................................................................................122 8.6.1 IP single point of failure......................................................................................................123 8.6.2 Single protocols vs. multiple network protocols.................................................................123 8.6.3 Transport redundancy..........................................................................................................123 8.6.4 Compaq's Advanced Server transport redundancy..............................................................124 8.7 Change control on routers.............................................................................................................125 8.8 Fault isolation...............................................................................................................................125 8.9 Cluster computer name.................................................................................................................126 8.9.1 How the cluster alias is used...............................................................................................127 8.10 Cluster Service's use of IP mobility............................................................................................128 8.11 IP addresses required for virtual servers.....................................................................................129 8.12 Load balancing............................................................................................................................130 8.12.1 IP load−balancing solutions..............................................................................................130 8.12.2 Windows Load Balancing Service....................................................................................130 8.12.3 HyperFlow.........................................................................................................................131 8.13 Redundant network hardware.....................................................................................................133 8.13.1 Multiple NICs....................................................................................................................134 8.13.2 Multiple NICs and load balancing.....................................................................................134 8.14 Environmental considerations for network equipment...............................................................134 8.14.1 Power.................................................................................................................................135 8.14.2 Air conditioning................................................................................................................135 8.15 Change control............................................................................................................................135 Chapter 9: Cluster System Administration.................................................................................................136 Overview.............................................................................................................................................136 9.1 The importance of cluster administration.....................................................................................136 9.2 Building a high−availability foundation.......................................................................................137 9.2.1 Cluster hardware certification.............................................................................................138 9.3 Cluster implementation options....................................................................................................139 9.3.1 Preconfigured systems.........................................................................................................140 9.3.2 Cluster upgrade kits.............................................................................................................140 9.3.3 The build−your−own approach...........................................................................................141 9.4 Installation, test, and burn−in.......................................................................................................141 9.4.1 Documenting your cluster system.......................................................................................142 9.4.2 Why document your system?..............................................................................................142 9.4.3 Hardware diagnostic procedures for a cluster.....................................................................142 9.4.4 Remote system management...............................................................................................143 9.4.5 Verifying cluster hardware capacity....................................................................................144 9.5 Planning system capacity in a cluster...........................................................................................144 9.5.1 Symmetric multiprocessing (SMP) for scalability..............................................................145 9.6 Administering applications in a clustered environment................................................................148 9.6.1 Identifying cluster−aware applications...............................................................................148 9.6.2 Licensing applications in a cluster......................................................................................148 9.7 Administering cluster failover groups..........................................................................................149 9.7.1 Determining a preferred node for a group...........................................................................149 9.7.2 Determining resource dependencies in a groupCluster resources.......................................149 9.8 Administering virtual servers........................................................................................................150 9.8.1 Cluster alias name...............................................................................................................150 iv
Table of Contents Chapter 9: Cluster System Administration 9.8.2 IP addresses.........................................................................................................................150 9.9 Managing cluster failover events..................................................................................................151 9.9.1 The impact of failover on server applications.....................................................................151 9.9.2 The impact of failover on end users....................................................................................151 Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering..............................153 10.1 Total system design approach to high availability......................................................................153 10.2 Identifying the cause of downtime..............................................................................................154 10.3 Quality hardware.........................................................................................................................155 10.3.1 Selecting high−quality hardware.......................................................................................156 10.3.2 Selecting a vendor.............................................................................................................157 10.3.3 Dealing with commodity hardware...................................................................................158 10.3.4 Why is MSCS certification important to you?..................................................................159 10.4 Datacenter facilities....................................................................................................................161 10.4.1 Reliable power...................................................................................................................161 10.4.2 Backup power supplies......................................................................................................161 10.4.3 Temperature and humidity controls..................................................................................162 10.4.4 Cleanliness.........................................................................................................................162 10.4.5 Backup procedures and issues...........................................................................................163 10.4.6 Hardware and software service contracts..........................................................................163 10.4.7 Hardware and software service support contracts.............................................................164 10.4.8 Spare parts.........................................................................................................................164 10.5 Disaster recovery plans...............................................................................................................165 10.5.1 System maintenance plan..................................................................................................167 10.5.2 Maintenance checklist.......................................................................................................167 10.5.3 Test plan............................................................................................................................168 10.5.4 Simulated failures..............................................................................................................168 10.6 System design and deployment plan...........................................................................................169 10.6.1 Vendor "value−added" approach.......................................................................................171 Glossary...............................................................................................................................................171 A−C...............................................................................................................................................171 D−H...............................................................................................................................................174 I−P.................................................................................................................................................177 Q−S................................................................................................................................................179 T−Y...............................................................................................................................................182 References........................................................................................................................................................185 Vendors...............................................................................................................................................185 Books..................................................................................................................................................192 Articles, Papers, and Presentations.....................................................................................................193 Trade associations...............................................................................................................................193 List of Figures..................................................................................................................................................196 Preface.................................................................................................................................................196 Chapter 1: Understanding Clusters and Your Needs..........................................................................196 Chapter 2: Crystallizing Your Needs for a Cluster.............................................................................196 Chapter 3: Mechanisms of Clustering.................................................................................................196 Chapter 4: Cluster System Classification Matrix...............................................................................196 Chapter 5: Cluster Systems Architecture............................................................................................197 v
Table of Contents List of Figures Chapter 6: I/O Subsystem Design.......................................................................................................197 Chapter 7: Cluster Interconnect Technologies....................................................................................197 Chapter 8: Cluster Networking...........................................................................................................198 Chapter 9: Cluster System Administration.........................................................................................198 Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering........................198 List of Tables...................................................................................................................................................199 Preface.................................................................................................................................................199 Chapter 2: Crystallizing Your Needs for a Cluster.............................................................................199 Chapter 4: Cluster System Classification Matrix...............................................................................199 Chapter 5: Cluster Systems Architecture............................................................................................199 Chapter 6: I/O Subsystem Design.......................................................................................................199 Chapter 7: Cluster Interconnect Technologies....................................................................................199 Chapter 8: Cluster Networking...........................................................................................................199 Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering........................200
vi
Clustering Windows ServersA Road Map for Enterprise Solutions Gary Mauler Milton Beebe Digital Press An important of Butterworth−Heinemann Boston Oxford Auckland Johannesburg Melbourne New Delhi Copyright © 2002 Butterworth−Heinemann A member of the Reed Elsevier group All rights reserved. Digital Press is an imprint of Butterworth−Heinemann. All trademarks found herein are property of their respective owners. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or therwise, without the prior written permission of the publisher. Recognizing the importance of preserving what has been written, Butterworth−Heinemann prints its books on acid−free paper whenever possible. Library of Congress Cataloging−in−Publication Data British Library Cataloging−in−Publication Data A catalogue record for this book is available from the British Library. The publisher offers special discounts on bulk orders of this book. For information, please contact: Manager of Special Sales Butterworth−Heinemann 225 Wildwood Avenue Woburn, MA 01801−2041 Tel: 781−904−2500 Fax: 781−904−2620 For information on all ButterworthHeinemann publications available, contact our World Wide Web home page at: http://www.bh.com. 10 9 8 7 6 5 4 3 2 1 Printed in the United States of America I would like to dedicate this book to my family:
1
Clustering Windows ServersA Road Map for Enterprise Solutions My wife Valerie, who has been patient and supportive of me during the long hours I spent researching and writing this book. My patient sons, Robert and Steven, who will no longer have to hear, "...as soon as Daddy is finished with the book..." My parents, Robert and Mary Mauler, and also Georgie Mauler, Alice Blum, MaryLou Bartrum, and Gertie. R.G.M. For my best blessings, Andrea and Matthias. M.D.B Acknowledgments We would like to thank everyone who has helped us in one way or another with our book. There have been many people who have been kind enough to offer their wisdom and insights to us as we conducted the research needed to write this book. To all of you that have helped us along the way we want to say thank you. There are a few people who deserve special thanks for going that extra mile in helping us bring this book to print: Scott Barielle, IBM Global Services; Jim Emanuel, Northrop Grumman Corporation; Jim Wolfe, Northrop Grumman Corporation; Marty Adkins, Mentor Technologies Group, Inc.; Brad Cooper, Bancu Technology, Inc.; Greg Forster, Independent Consultant and friend; Dr. Jim Gray, Microsoft Corporation; and Mark Woods, Microsoft Corporation.
2
Preface About this book During the final decade of the twentieth century, Microsoft achieved historic levels of marketing success in the computing business. Microsoft offered an alternative so economically appealing to the computing industry that resistance seemed futile in all but a few cases. Two questions remain. First, "Is there a solution to the downtime and business interruption often associated with Microsoft Server platforms?" This alone is compelling enough for some computing environments in which the Microsoft solution is not considered acceptable. The second question is "How can this solution cost−effectively increase the capacity of our data processing resources?" In other words, how can the Microsoft solution grow with future computing needs. Clustering provides an answer to both of these questions. For a moment let us turn to a story told by the late, great Rear Admiral Grace Hopper. Her words, in what we simply call the oxen story, provide an illuminating parallel to the computing system dilemma that we face. The story expressed her vision about the future of computing; Rear Admiral Hopper was, in effect, predicting the future in which we are now living. She has our deepest respect as an inventor and visionary in the field of computing. This story of hers really says it all. "When we got our first computers, we got a great big computer. We were encouraged to get a mainframe. And we took our data and we set it through a process. And the process consisted of hardware, software, communications, and people. Hopefully the output product was information. Since this was a system, hopefully it was under some form of control and there was a feedback loop from the information to the control to improve the quality of the information. We got a great big computer, and we poured all of our data into it. Well, pretty soon it got overloaded and what did we do? We said that we needed a bigger computer, and that was where we made the first step in the wrong direction." "I like to use a story from the past. Back in the early days of this country they did not have any Caterpillar tractors, they did not have any big cranes. When they moved heavy objects around they used oxen. And when they got a great big log on the ground and one ox could not budge the darn thing, they did not try to grow a bigger ox! They used two oxen! And I think they are trying to tell us something. And that is, when we need greater computer power, the answer is not to get a bigger computer; it's get another computer. And we should have recognized it long ago." "The answer is, to do the problems of the future we will need systems of computers, clusters. Not one great big computer with a single path through it all hampered by an operating system. We need to look to systems of computers, and that's what we will build in the future."[1] The correct approach, then, is to use systems of computers instead of one great big computer with a single path hampered by an operating system. Well, the future is now, and in this book we are going to do our best to help our readers get the technical knowledge they need to make the best decisions about how to effectively deploy clusters at their companies. It is quite evident, when one looks at the market share reports of what companies are deploying for the mission−critical server platforms, that there is a rapidly growing demand for deploying and supporting 3
Why a cluster? Microsoft Servers. We are left then with a thorny problem. These NT/2000 Server platforms are generally not as available as minicomputers and mainframes. This is not to say that Microsoft is necessarily to blame. The blame can and should be equally spread over every software developer that writes code that is run on the Windows NT/2000 Server platform, ranging from device drivers to database servers. You see, any code that is running on your server could be the culprit. When you think of the code that gets installed on a server these days and where it comes from, it gets a little scary. The question that our book attempts to answer is how can you best protect your company from the many potholes along the road of support that threaten your information systems' availability and reliability. Windows NT/2000 Server is a maturing operating system. To put the reasons for lower inherent availability into perspective, consider the version numbers of operating systems such as commercial UNIX, OpenVMS, OS/400, or MVS. The release numbers for these are assigned by engineering and are a true gauge of the maturity of the operating system. Consider the marketing strategy Microsoft used in the first release of NT. The first version was named NT 3.1. Would you buy a new operating system, version 1.0? In fairness, Microsoft historians would argue that "3.1" was used to differentiate it from the existing Windows platform Windows 3.0. But suppose to level the field and establish a reference base we construct a table (see Table P.1) to compare the version number assignments of Windows NT/2000 and its service packs (SPs).
Table P.1: Versions of NT Window Marketing (actual product) MS Version Engineering Development major.minor change NTAS 3.1 1.0 NTAS + (SPs) 3.1 1.1 NTAS 3.5 1.2 NTAS + (SPs) 3.5 1.3 NT 4.0 2.0 NT + SP1−3 4.0 SP1−3 2.1 NT + SP4−6 4.0 SP4−6 2.2 2000 2000 3.0 2000 + SP 2000 3.1 The column labeled "major.minor change" should help you put into perspective exactly where Microsoft's NT/2000 O.S. stands in terms of maturity. Those readers who have been in this business as long as we have can surely relate to this when they recall where they were when working with their favorite operating system, be it VMS, UNIX, or MVS, when it was only at revision 3. [1]
R.A. Grace Hopper, Lecturer at Westinghouse Corporation, Maryland (1986)
Why a cluster? In the authors' experience, many clients with large production networks request advice and direction regarding clusters. The initial question we ask of the client is, "Why do you need a cluster?" The most succinct answer received has been "I need an NT Server that doesn't go down so much." Others say, "I need a server that won't lock up and require a reboot." Restated, what most sites require is a Microsoft server platform that will not interrupt business in normal day−to−day operation. In two words, today's business needs require availability and reliability as provided by cluster system operation. 4
Why a cluster? When our single−system processor is not capable of the server load, what decision should be made? Our 66 MHz system processors have given way to 266 MHz, 600 MHz, and now GHz processing speeds. Grace Hopper's oxen story advises us to reflect on the experience of our past. When one system isn't capable of the load, try adding another system. Another reason cluster vs. non−cluster could enter your decision−making process is benchmark speed. The independent test results of the Transaction Processing Council (TPC) prove again and again that cluster systems reign supreme in the cost/performance arena. TPC, which can be reached at http://www.tpc.org, provides results of a standard battery of tests that satisfy the ACID (atomic, consistent, isolated, and durable) requirements. The published test results are from corporations that submitted the best of their best systems. System cost, while important in consideration of the cost per transaction, is no criterion here. The systems submitted to the TPC tests typically range from $5 to $20 million. As one peruses the systems of TPC's famous "top 10," one might look on it as a report on drag−racing for computers. The graphs in Figure P.1 show an interesting phenomenon that occurred from the last part of 1994 through 1995. There was a huge decline in cost per transaction from 1995 to 1996, while the actual transactions per minute for both clustered and non−clustered systems remained somewhat steady. This shows that the technology was relatively stagnant, while the cost to produce computer hardware experienced a marked decline.
Figure P.1: TPC−C benchmarks (Source: Zona Research).
5
Why this book? If you do visit the TPC Web site, you may find, as in dragracing, an impressive performance. And, just as in the traditional drag race, you may hear the "whine" of the processors and the "scream" of the disk drives as they leap in unison for the coveted first place. Sanity slowly returns, and the words "practical" and "affordable for your business needs" come to mind and you seek further counsellike that provided by this book! Researching what products or technologies are available within the spectrum of your needs may present another problem. You may find the offerings too numerous and sometimes even ill defined. This book was written to provide decision−makers with a roadmap to success in their search. This book sets out to establish how and why cluster system operation provides high availability and reliability. It attempts to cover the available and possible clustering technologies and offerings. We hope that this volume will help you understand how your high−availability needs can be met by cluster systems. This book defines and categorizes the types of technologies and product offerings. A matrix of offerings allows readers to choose the cluster system that meets their needs. The objective of this book is to help in narrowing your search to only the technologies and products that will allow you to succeed.
Why this book? The technical professional has to deal with an interesting dichotomy. On the one hand, there is the lure of a very popular enterprise server operating system whose presence is expanding. On the other hand, there is the challenge of being responsible for a platform that is not as available as the "glass house" or legacy operating system it is replacing. We set out to write this book to help you with this dilemma. This book is a guide to the different technologies and procedures that provide higher availability, scalability, and reliability. During the course of the book, we provide definitions and fundamentals and present materials from more than 30 years of experience with various operating systems.
What's in this book? This book is meant to provide a reference. The background and fundamentals of cluster systems are presented in a manner that clearly defines what a cluster is and what a cluster is not. The book provides a decision−maker's roadmap. Its entire focus is to allow readers to succeed in evaluating their systems' availability needs and then match these needs to appropriate technologies and products. Products and technologies are categorized by their capabilities and limits. The book is concerned with getting the correct technology into your shop. Clustering technology is going to be very expensive no matter how you look at it. This book will help you identify the requirements, prioritize them, identify and select the appropriate technology, and thereby develop an implementation plan that allows you to deploy a high−availability solution. We feel confident that the design and implementation that you will come up with after reading this book will be both cost−effective and fully capable of meeting the needs of your organization. Some books in this field contain a degree of fluff. Others are focused on a single or a few vendors' product offerings. Still others are closer to manuls or documentation on the procedures for setting up and configuring specific vendor products. This book is different in that it is not biased to any vendor's product(s). It does not try to teach you the configuration screens or utilities of a specific cluster offering. We try only to clarify your needs and the available technologies so that you can match them and succeed.
6
Book organization
Book organization The organization of this volume follows the thinking process of a decision−maker researching high−availability solutions. It starts with an introduction and needs analysis, and moves on from there to a classification of high−availability and cluster technologies. Detailed sections follow that explain the relevant low−level technologies such as SCSI, Fibre Channel, and cluster interconnects. Vendor offerings are then cataloged and classified according to the Cluster Classification Matrix.
Who's this book for? The information in this book will be of value to technical professionals. It is of the most importance to CIOs and MIS and IT managers. It should also be useful to those who are responsible for tracking technology directions and planning implementation strategies. Senior IT staffs and operations managers will find it invaluable.
Research methodology Both Gary Mauler and Milt Beebe have been working with clusters since the 1980s. Their understanding of the internals and components formed the fundamental structure of how clustering technologies and products were researched for this book. In other words, knowing how it is built and done was used to look at how others are building and doing. Extensive research was done in preparation for writing this book; vendors' technical and engineering staffs were contacted directly for information. Technical presentations were attended. White papers were read and digested. The Internet was searched repeatedly. Internal and client cluster systems were dissected and documented. Periodicals were combed. Contacts with vendors were reestablished and refreshed prior to final submission.
Copyrights, trademarks, and service marks ActiveX, Visual Basic, BackOffice, BackOffice logo, Microsoft, Windows, Windows 2000, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Tandem, Himalaya, Integrity, NonStop, Object Relational Data Mining, ServerNet, and the Tandem logo are trademarks or registered trademarks of Tandem Computers Incorporated in the United States and/or other countries. • Alpha AXP is a trademark of Compaq Computer Corporation. • Intel is a registered trademark of Intel Corporation. • IBM is a registered trademark of International Business Machines Corporation. • PowerPC is a trademark of International Business Machines Corporation. • MIPS is a registered trademark of MIPS Computer Systems, Inc. • VERITAS and VERITAS FirstWatch are registered trademarks of VERITAS Software Corporation. All other marks are trademarks or registered trademarks of their respective owners.
7
Chapter 1: Understanding Clusters and Your Needs 1.1 Writing a Request for Proposal (RFP) for a cluster that will succeed Picture yourself in the early 1980s. You are assigned the task of designing a new computing system. Your guidelines are slim at best. The situation can only be described as: "I don't know what I want, but I will know it when I see it." The only defense against this type of statement is to sit down and write what is called a Request for Proposal. The Request for Proposal would possibly look like: I don't know what I want in a computer system, but it should provide at least the following characteristics: • Availability • Reliability • Scalability The design of the cluster computer system evolved as an answer to such a Request for Proposal. The term "cluster" as it applies to the computer industry was popularized by Digital Computer Corporation in early 1983 with VMS version 3.7. Two VAX 11/750s maintained cluster communication at the rate of 70 million bits/sec or 8.75 Mbytes/sec. During the past 17 years, many ideas of what the term cluster should mean have been set forth. When the cluster system was first introduced, the selling point was not the term "cluster." Nobody knew what the term meant. But, people did "know what they would want, if they saw it." Therefore, the selling points were availability, reliability, and scalability, all of which the cluster system would provide. The term "cluster," over the years of development, evolved to become synonymous with these characteristics. Unfortunately, these same common characteristics of a cluster have become commonplace and are often used interchangeably with the term cluster! Vendors have used the term "high availability" and declared this as their cluster implementation, when, in fact, the system does not provide all the characteristics that a cluster, as originally defined, was meant to provide. The point is, just because a particular configuration of hardware and software provides a characteristic of a cluster, it is not necessarily a cluster. This brings up an interesting question: When is a cluster not a cluster? To answer this we need to state firmly what a cluster is, as it was originally defined.
1.2 When is a cluster not a cluster? A computer cluster is a system of two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources. This is a paraphrase of the VAXcluster definition found in Roy Davis's book VAX Cluster Principles, which states, "A VAX cluster system is a highly integrated but loosely coupled multiprocessor configuration of VAX systems and storage 8
1.2.1 Availability subsystems communicating with one another for the purpose of sharing data and other resources." This definition clearly states what a "cluster" should have. When a cluster is constructed based on this simple definition, the resulting entity will have some very desirable characteristics: • Availability • Scalability • Reliability • Manageability • Single−system image Some manufacturers have actually used one or more but not all of these characteristics as testimony of their product's ability to be considered a cluster. For example, "Our system meets the cluster standard of high availability and reliability." This makes about as much sense as saying, "Zebras have stripes; therefore, an animal with stripes is a zebra." Of course, hyenas have stripes as well, and the analogy is just as ridiculous. A "cluster" is not a cluster when the system described does not adhere to the minimum definition of what a cluster should be. Let's say it one more time, before we move on. "A cluster consists of two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources." A system that does meet the definition of a cluster offers the characteristics listed above. Let's define these characteristics and provide some examples.
1.2.1 Availability Availability is the quality of the system's response to a user or process request. Consider this scenario. You walk into a good restaurant on a Saturday night (without reservation) and ask for a table, and you get "right this way" for a response. Actually, this is an example of highly available. The term "highly available" alludes to an instantaneous response (availability) to a request. The reality of the restaurant scenario, however, especially on a Saturday night, is a wait of at least 15 to 20 minutes. The reality of a single−server system's availability can be exasperating. Suppose your office has a single server with network applications. You've got a deadline, and you need one more file. It's time for Murphy's law to strike. You've got the connection, the file is selected, and just before you get the transfer, the network hangs. Why? It could be pollen in the air, disk I/O bottlenecks, server capacity limits, or many other things. Nothing short of a complete power outage has an impact on day−to−day operations like this scenario, and, regrettably, it happens far too often in single−server situations. So how do you approach this problem? There are a number of ways to address availability. The cluster provides a system configuration, which maintains user−perceived availability during subsystem downtime. Some computer system designers use redundancy in their attempts to provide availability. The amount of redundancy used is usually in directly proportion to their level of paranoia. Common examples of redundancy include redundant servers, redundant networks, and redundant storage subsystems. Redundant servers are, in fact, what some people think of when you utter the word "cluster." Let's defer discussion of this point for now to Chapter 5, "Cluster Systems Architecture." Redundant networks can be expensive, but they are necessary when network downtime is not tolerable. To be sure, big business and money transactions require stable, available computer systems, but there are computer−controlled industrial process control implementations as well. Consider a steel−producing plant involved in what is termed the "caster" portion of production. These plants turn out batches of 5,000 tonsthat's 9
1.2.2 Scalability 10,000,000 poundsof liquid steel per "ladle." The network and computer operation used here is not something that can tolerate a lot of downtime. The statement "Oh, it does that sometimes, just Control−Alt−Delete it" is definitely not used here!
1.2.2 Scalability The system should be capable of addressing changes in capacity. A cluster is not confined to a single computer system and can address capacity requirements with additional cluster membership. The cluster definition included the phrase "two or more independent computer systems." The cluster system should allow for additional cluster membership to meet the scalability needs of growth, and, ideally, additional cluster membership would not require a reboot of the cluster system.
1.2.3 Reliability Briefly stated, "reliable" means "sustaining a requested service." Once a proper operation has been initialized by a user or an application, the system should be able to provide a reliable result. Remember the preceding scenario of the poor guy trying to get a simple file from a server. Well, imagine that the system serving the user crashes and really becomes unavailable. Picture this: the transfer is underway and almost complete whencrash! Some applications have recovery capabilitythat is, the use of temporary files to regain a part of or even all of the transaction. The question is, how reliable is that recovery method? A cluster could provide a reliable result by providing a "failover" strategy. A system that provides failover provides an alternative path to an initialized request. Interruptions, such as cluster failover for whatever reason (discussed at length later), should be "transparent" to the user or the application. Ideally, should a cluster member or storage system fail during a user−requested action, interruption would be indeterminable as far as the user is concerned. At worst, the interruption to the user's work would be minimal. Additionally, the cluster system should be resilient to the actions of a user or application. A "renegade" user or application should never cause the downfall of a cluster. The worst that a recalcitrant application or user should be able to bring about would be the "dismissal" of that application or user.
1.2.4 Manageability A cluster system should be capable of being centrally or singly managed. Ideally, the cluster manager should be able to access and control the "sharing and accessing of resources" from any point in the cluster. This implies two specific but equally important tasks. The cluster manager should be able to modify the system configuration without disturbing the supported users and processes. A manager should be able to shut down and reboot any cluster member or supporting storage subsystem. Further, the cluster manager should be able to control users' access (including the addition or removal of users) to the cluster resources. These two tasks should be capable of being performed on any member system of the cluster or system that has been granted access to the cluster management. Resources, from a user standpoint, should be transparent. Cluster system resources should appear as though they were "local" to the user.
1.2.5 Single−system image Each computer member of the cluster is capable of independent operation. The cluster software provides a middleware layer that connects the individual systems as one to offer a unified access to system resources. This is what is called the single−system image.
10
1.3 Subsystems Since each computer member is capable of independent operation, if one member of the cluster should fail, the surviving members will sustain the cluster. Picture, if you would, the cluster as a royal figure of ancient times, when a litter was sometimes used to transport royalty. The litter was supported by six able−bodied men. As the men moved as one, the litter moved as one. If the road got rough or one man slipped, the surviving men would sustain the litter with its cargo. The cluster does not need six independent computers to sustain cluster operation, but two independent systems are required as a minimum to constitute a cluster. This inherent ability of a cluster provides a key ingredient to a cluster's availability and reliability. This independence allows an entire computer to fail without affecting the other cluster members. The following summarizes the advantages of a cluster with a single−system image: • The end−user or process does not know (or care) where the application is run. • The end−user or process does not know (or care) where, specifically, a resource is located. • Operator errors are reduced. • A single−system image typically incorporates both a hardware layer and an operating system service or feature common to each cluster member. The user's or application's point of entry to the cluster is a cluster member's application layer. The single−system image presents the user or application with the appearance of a single−system imagea cluster. The individual supporting members are transparent. With the advent of its Windows 2000 Advanced Server, Microsoft has introduced its first Windows operating system that addresses the cluster characteristics of availability, scalability, reliability, manageability, and single−system image. Unlike the original VMS cluster or other types of cluster systems of recent years, this operating system does not require proprietary hardware for its implementation.
1.3 Subsystems A cluster system consists of three subsystems (see Figure 1.1): 1. Server subsystem, using two or more independent computers 2. Interconnect subsystem, using two or more computer−storage interconnects. 3. Storage subsystem, using one or more storage systems
Figure 1.1: Cluster system. It is the combination of these three that provides cluster capability according to our definition, "a computer cluster is a system of two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources." At first glance, the foregoing definition along with the constituent subsystems might incline a person to declare that a cluster system is merely a means to accomplish 11
1.3 Subsystems fault tolerance. Fault tolerance means "resistance to failure." Resistance to failure is an obviously desirable trait but is not in itself what clustering is about. Fault tolerance should be considered a component of a cluster's subsystem. In fact, fault tolerance could be a component of any one of a cluster's subsystems or inherent in all three. For example, a component of the server subsystem would be an individual server; a component of the interconnection subsystem could be an individual controller; and a component of the storage subsystem could be an individual disk. Therefore, the server subsystem contains at least two componentsthe computers that are members of the cluster. The interconnect subsystem consists of at least two components: (1) the controllers that provide interconnection between the two computers and the storage subsystem and (2) the "intelligence" or software/hardware combination, which could address a "failover" situation at the interconnect level. A single RAID "box" could be considered as an entire storage subsystem. (RAID is the acronym for "redundant array of independent disks." An excellent reference on RAID can be obtained through the RAB Council at http://www.raid−advisory.com.) RAID storage is an implementation of a model or construct of how independent disks work as one. One such RAID model is the "mirror" or exact replication of one disk to another for the purpose of availability. A cluster consists of three subsystems, and each subsystem consists of components. When fault tolerance was mentioned previously, it was described as an example of a subsystem component. Fault tolerance can indeed be a significant component of all three cluster subsystems. Fault tolerance (Figure 1.2 shows a simple RAID 1 configuration) is and has been an inherent quality of Microsoft's NT Server product line. But Microsoft's NT Server did not have built−in cluster capability, as does the Windows 2000 Advanced Server product.
Figure 1.2: Cluster components. Another example of implementing fault tolerance as a component of a cluster subsystem is the Marathon computer system. This is a hardware fault−tolerant NT server. But, from our stated definition, a Marathon computer system would be, in itself, a single component of a server subsystem. Several years ago, manufacturers such as Force, Tandem, and Digital Equipment Corporation produced fault−tolerant computers. Digital Equipment's FT3000 VAX had dual "everything" right down to the AC power sources. The system had dual processors, memory, controllers, storage, network cards, and power sources. But even with all that, if you wanted true high availability (with the reliability of an FT3000), Digital had a cluster configuration involving two FT 3000s in a cross−coupled cluster configuration as an available, reliable cluster solution. RAID boxes, by definition, have fault tolerance. An example of a fault−tolerant interconnect is the dual−ported SCSI adapter as found in the CL 380 Compaq cluster box. Fibre Channel such as the Compaq HA (High Availability series), FDDI, and cluster network interconnects support redundant implementations, thereby adding a fault−tolerant component to the cluster interconnect subsystem. Another example of a cluster component is the clustering software added to the server's operating system. The clustering software must be integrated with the operating system. If a system should fail because of a catastrophic event, the cluster software would be "first−in−line" to take whatever recovery action is necessary.
12
1.4 Cluster attributes Components can also be useful in adding to the features of a cluster without being part of a specific subsystem. Such is the example provided with the replication software produced by Octopus and Oracle Replication Server. Replication software can add a fail−safe fault−tolerant component to the cluster server subsystem. Either of these products provides additional features to the cluster system as a whole. But what about something that would allow enhancement or future revision? This leads to a third level of hierarchy in our cluster system definitionthe cluster attribute.
1.4 Cluster attributes Cluster attributes add features or functions that, while desirable, do not directly contribute to clustering by our definition. An example of this is cluster management software. Since a cluster comprises more than one server, it would be convenient to have the cluster management tools available wherever the cluster manager is located. Today, there are many "cluster−aware" management products and backup software products available that can add desirable features to the cluster system. Dynamic linked libraries (DLLs) that are written specifically for the cluster middleware software are another example of a cluster attribute. DLLs would provide an extensible foundation for the cluster software to access and enhance the overall cluster operation. Returning briefly to our overall account of what a cluster system should provide, we see that there are two groups of people with very different needs that must be addressed. For users, the overall goals of the cluster are to provide availability and reliability. For managers, the overall goals are to provide scalability, central management, and stability. This last item, stability, is the strength to stand or provide resistance to change. With regard to the cluster, this comes from the operating system chosen for the cluster members and the ability of the cluster to failover, if need be, to the next available cluster member. The Windows 2000 operating system provides an even more resilient kernel than its NT predecessor in providing immunity to the "blue screen of death" or to an errant process action. And in the event that a subsystem has a failure, the Windows 2000 Advanced Server cluster provides a built−in capability of failover. This same "stability" characteristic applies to the user as well. The terms we've used from the user's perspective"available" and "reliable"imply that when a portion of a subsystem fails, another portion of that subsystem compensates or fails over. When a failover occurs, there are two categories to consider. These categories are the perspectives represented by the two groups of affected personsusers and administrators. Let's examine recovery from these two perspectives.
1.4.1 User recovery From a system crash or from an administrative operation session failover, there are two possibilities. The first (and ideal) is that the user has no perceived disconnect. As far as the user is concerned, nothing happened! The second case (and most common) is a lost sessionor session disconnect.
13
1.4.2 Administrative recovery The cluster is still available, but the user is forced to reaccess the resource desired. If the application involved is critical and session disconnects are a possibility, then the application needs to involve transactional processing or the capability to rollback to its initial state.
1.4.2 Administrative recovery From a system crash or from an administrative operation session, failover should include a central control system. The cluster system should have built−in messaging to both connected users and administrators for the purpose of advising impending cluster member removal. Administrative shutdown of a portion of a subsystem is sometimes necessary for the carrying out of administrative tasks. When a cluster manager performs an administrative shutdown to a cluster member, any resources served by that cluster member are automatically transferred to remaining cluster members. In summary, we've stated that a computer cluster consists of three principal subsystemsa server, a storage subsystem, and an interconnection subsystem. Together these provide the basis of our original definition of a cluster as a system made up of two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources. The subsystems are constructed with discrete components, such as the server members and fault tolerance. Cluster attributes are basic system enhancements that enable additional functionality. Currently, the word "cluster" has become a buzzword. Seventeen years ago the word "cluster" had no marketability, but "availability" and "reliability" did. Today, to gain interest, vendors bend and twist many definitions of what a present−day cluster should be to match their product's characteristics. The term "cluster" has been and is used quite loosely by the press and some vendors to describe high−availability solutions for Windows NT 4.0 Servers and Windows 2000. Some of these "clusters" do not live up to the authors' definition and hence may not provide a constant availability of service. An "apples to apples" comparison between Windows 2000 Clustering and legacy solutions such as OpenVMS Clusters or NonStop Clustering will probably end up in debates about features and capabilities missing in Windows 2000 Clusters. For all you OpenVMS fans, please see Figure 1.3. Digital Background Perspective If you happen to have a background with OpenVMS Clusters, I am sure that halfway through this chapter you might be saying to yourself that "if it ain't got a distributed lock manager (DLM) then it can't be called a cluster." There are two points that we would like to make to you. First, Windows 2000 and Clusters are relatively new. Remember our argument that Windows 2000 is really only at version 3.0. Second, Microsoft appears to have developed architectures today that set a good foundation for future enhancements. Finally, remember the saying that "there is more than one way to skin a cat." With that last comment, don't be surprised to learn that new ways are not always bad.
Figure 1.3: Digital perspective. As you read this book, you should be focusing on the new challenges that Windows 2000 Cluster products were designed to address. At the same time, remember that Windows 2000 is still a relatively new operating system built to address many new challenges. Therefore, its design goals differ from those of other, traditional operating systems.
14
1.5 Design goals
1.5 Design goals Businesses of all sizes are acquiring computer systems at an astonishing rate. It seems that as quickly as they are built, they are being snatched up for use either at home or by business. Just a short time ago, only large corporations or governments could afford a data processing system. Now businesses of all sizes, from mom−and−pop shops to international corporations, are integrating computers into their business processes as fast as they can get their hands on them. The one thing that is more and more common between mom−and−pop shops and large corporations is that the daily successes of their companies are becoming totally dependent on the reliability and capabilities of their data processing systems. This is especially true for companies that are doing business electronically over the Web. In the global market, the sun never sets! Because of all this, most people who became accustomed to PCs in the 1980s now realize that an occasional "Ctrl−Alt−Del" just won't cut it anymore. PCs are now expected to deliver at the same level of service as minicomputers and mainframes but at a tenth of the cost. Where there is consumer demand, you will find entrepreneurs ready to provide a solution. So let's take the time to discuss why there is so much money being invested in the cluster industry and why there are so many companies scrambling to position themselves in the emerging Windows 2000 Cluster marketplace. For example, the merger of Compaq, Tandem, and Digital is an example of three companies that all had independent technology critical for implementing clusters and decided it was time to join forces to leverage off each other's unique capabilities. Tandem's ServerNet and NonStop technology, Digital's OpenVMS Cluster technology, and Compaq's leadership role in industry standard servers has made for a unique corporate marriage. Microsoft's marketing studies have shown that there is a very large demand for higher availability from application servers than is possible with today's high−volume and low−margin PC class of server. Their customers have taken the plunge from traditional industry−standard solutions such as IBM, DIGITAL, NCR, and HP to the world of PCs with the hope that there will be huge savings, more productivity, and more user friendliness. What many of these people may not have realized when they made their "leap of faith" decision was that there were some good reasons why these traditional systems were expensive and labeled proprietary. These traditional vendors built their systems from the ground up to provide the reliability that they knew their customers needed (but maybe did not appreciate). Their marketing strategy, if not directly stated, was implicitly one of screaming "you get what you pay for." The customer's pleas of "enough is enough" fell on the ears of Microsofta listener. Then the customer settled for less at a greatly reduced price. Herein lies the dichotomy of the past decade. It is between those that settled for the new, the cheap, and the "not so available or reliable" as opposed to those that remember the days of "money is no object" and the legacy systems of yore. Many of the hardware features specifically designed into legacy systems addressed the issues of reliability and high availability in software applications. These features were included without a second thought, probably for the reason that engineers in those days had a philosophy of "build it the way it ought to be built" because component development demanded extra attention to quality. Disk drive and circuit board development was and is still evolving. So at that point in time, qualityand hence reliabilitywas viewed by most to be as important as cost. After all, in those days it was easy to spend a quarter of a million dollars or more on a data processing system. At the other end of the spectrum, too many of us remember the back−ground of the first Windows operating 15
1.5 Design goals system and the second and the third. We remember hearing, "Oh yeah, it does that once in a while; just do a shut−down and reboot it, and it will go away"! this was addressed to persons experienced in data processing where "shut down and reboot" was not a common procedure. The idea of actually considering a "Windows system" as a replacement for a legacy operating system became interesting only during the past decade. There is a classic reliability story (circa late 1980s) from the questions put forward by a prospective new information technology (IT) employee interviewing at a large Digital Equipment Corporation customer site. It seems that the prospect asked an employee how long the data processing system would run between crashes. The individual replied that he did not know, after all he had been working at this company for only a year! It has been only recently that the market has demanded this same level of reliability in the high−volume, low−cost market. This chapter has been citing the former Digital Equipment Corporation as an example, but it is a fact that the firm had a distinct advantage over most of its competitors. Digital had control of the complete systems that they delivered to their customers. The computer was their design, the peripherals were for the most part their design, and OpenVMS was completely under their control. Even the people that serviced the hardware were typically Digital employees. You can do a lot when you can control all the variables, which includes pricing the product for what the market will bear. Today, that is definitely not the case. Various companies write operating systems. Other independent companies build CPUs, and thousands of companies manufacture and maintain computer peripheral hardware. Because it presents choices and massive competition for any product providing the attributes and features of "cluster systems," Windows 2000 Advanced Server is a "true component" of a cluster system's computer subsystem. And, because there are "choices" to be made, Microsoft realizes that Windows 2000 Advanced Server is just one of many choices available to the market. Hardware manufacturers also realize that there are choices. Proprietary pricing is becoming a thing of the past. By now, most of us have become used to the Bill Gates way of doing things. We like the user−friendly Windows environment and the plentiful selection of low−cost software development tools and applicationsat least until the company's application server hangs and no one can get any work done for a day. But complete trust in the single−server solution represents a pathway to disaster. If a company's computer single−server system goes down, everything stops! Recently we were in a large discount store when the point−of−sale system crashed. Everyone was totally helpless! The employees were standing around without a clue about what to do. Even worse, the store's customers were leaving in disgust (including ourselves). They were victims of an "all your eggs in one basket" computing system. Finally, there is an answerthe clusterthat addresses the single−point−of−failure problem along with many other problems as well. Businesses today have gone through a dramatic change in the way they conduct business. With the advent of the Web, businesses have the potential to sell to people all over the world. That means the store is open for business 24 hours a day and 365 days a year. With the potential for thousands or even millions of transactions a day, it is easy to see why companies are looking for better than 99 percent availability from their computer systems. So why, out of the blue, did Microsoft decide to include Advanced Server and DataCenter as a cluster portion of their 2000 product offering? Maybe while everyone was waiting for the PC LAN server to be rebooted, one of the old−timers in information services (IS) said "I can remember the days when we did not have this problemwe had a VAX cluster." The commentator forgets or neglects to state the price the company paid for that cluster and doesn't remember the task force of personnel required to maintain and care for the cluster. Maybe if designers and developers who remember how it used to be began thinking of how it could be, the race would begin for a practical cluster consisting of personal computer member servers.
16
1.5 Design goals In reality, there have been cluster attempts and symmetrical multiprocessing attempts since Microsoft's inception of NT 3.1. To be successful in the PC market, it will take a lot more than what was delivered in the past. Today, businesses cannot afford the luxury of large IS organizations with many experienced personnel. Instead, they are more likely to have a few people who have to wear many IS hats. Acting as system manager is only one of the many responsibilities they must assume. There is a definite need for clustering solutions that virtually install themselves and have very simple and easy−to−use graphical user interfaces. These customer−driven requirements are right in line with the direction cluster vendors have been taking for the Window 2000 Server architecture. In fact, Microsoft is delivering its own cluster software solution for the Windows 2000 operating system, as well as actively promoting open standards for hardware technology that can benefit clustering. Given the business model in the computer marketplace today, where the hardware might come from one or more vendors and the software also comes from dozens of other vendors, Microsoft and the many hardware and software suppliers must work very closely on developing standards at every level. There are a couple of benefits that we all will see from these efforts on the part of Microsoft. One benefit that accountants will appreciate is reduced cost. Through the standardization of software application program interfaces (APIs) and hardware architectures, the market is being opened up to many players, big and small. As more players enter the market, competition forces prices down while at the same time pushing technology further ahead. Those of us who are technologists will appreciate the many technical approaches that will be offered to us to make our system hardware faster and more reliable. Since the basic interface specifications can be standardized, the hardware vendors can concentrate on advanced hardware features while being assured that what they develop will work with Microsoft's 2000 operating system. One such effort, which is discussed in Chapter 7, "Cluster Interconnect Technologies," is virtual interface architecture (VIA). Microsoft is working with more than 40 leading hardware vendors to develop a standard cluster interconnect hardware architecture as well as the software APIs that will drive that hardware. A complete clustering solution, by our definition, is a very complicated mix of software and hardware. Even with all the work that has been done already by Tandem, Digital (now Compaq), and others, you do not get there overnight. It was very smart on the part of Microsoft to cross−license the patent portfolios of Digital and Tandem. It is still going to take some time for Microsoft to give its 2000 line capabilities similar to those that are already available for VMS and UNIX. Fortunately for all of us, Microsoft has stepped up to the challenge and has laid out a road map for clusters that will get us there over time. The past and current releases of 2000 Cluster Server address just the basic need for availability. According to Microsoft's market studies, that is the most pressing need today for the majority of Microsoft's customer base. We also need to remember that this is not just Microsoft's responsibility. For us to benefit from Microsoft's clustering "foundation," third−party application developers must rework their applications to take advantage of the high availability and scalability features. These are available to them by taking advantage of new software cluster APIs included in Microsoft's 2000 Cluster Server product. It is only when Microsoft and other applications developers put all their pieces together that we will really see the benefits of 2000 clustering. Microsoft's stated policy is that the functionality and features that they incorporate into new releases of MS Cluster Service will be a direct result of the feedback they get from their customers. The bottom line is that we cannot get there overnight. Microsoft has certainly taken on a large chunk of work in building MS Cluster Service, but there is an equal amount of work that must be completed by software application vendors and cluster hardware vendors as well. It will happen over time. We recently attended the twentieth anniversary celebration of OpenVMS, and we can attest to the fact that there is still heated debate going on over what new 17
1.5 Design goals features should be included in OpenVMS Clusters, even after 20 years! To architect the MS Cluster Service product, Microsoft stepped up to a challenge not attempted before by the IBMs and Digitals of the world. When IBM and Digital sat down to design their cluster architectures, they viewed their market potential in the order of thousands of customers, all of which would be using hardware that the two companies had carefully designed and tested for the specific purpose of running "their" cluster solution. It was a very controlled environment, mainly because there were very few options for customers to choose from. Microsoft's goal, on the other hand, is to develop its Cluster Service so that it will address the data processing needs of a broad market with the potential for millions of customers. Microsoft has the rather large challenge that these older vendors did not have to addressattempting to deal with all the support issues surrounding hardware manufactured by dozens of system vendors running potentially thousands of different applications. When you think about those numbers, you can then begin to imagine how different their architectural decisions can be. The potential users of Cluster Service range from small professional offices to large international corporations. A diverse customer base like this needs a solution that is very scalable and easy to support. A small professional office will be just as dependent on its company's databases as a super−large international corporation. The difference is that small companies need a low−cost entry point with the ability to grow the system as their business grows. In addition, they are looking for a system that is very simple to set up and manage. The large corporations, on the other hand, have the advantage of keeping IS professionals on staff who design, install, and manage their computer systems. It seems, though, that today this advantage is shrinking. Big or small, we are all expected to do more, with less time and help to do it. Microsoft's user−friendly graphical approach to its cluster administration tools will be appreciated by anyone responsible for supporting a cluster of servers. As we have already said, the first stop along Microsoft's road is simply "availability." This is a straightforward capability for Microsoft to implement. At the same time, it may be just what the doctor ordered for some companies needing to deploy mission−critical applications on Microsoft's 2000 operating system. Even though this initial release falls short by some people's standards for clusters, the bottom line is that you have to start somewhere. By releasing this product (Microsoft project name "Wolfpack") to the market a couple of years ago, Microsoft started the process of encouraging third−party software developers to get up to speed with the new cluster APIs now supported by Windows 2000 Server. As third−party application vendors come up to speed with the Cluster Server SDK, a whole new breed of cluster−aware applications for the mass market will appear. Further down the road, Microsoft will likely add support for distributed applications and for high−performance cluster interconnects such as storage area networks (SANs). This will not only put them in the same league with the "UNIX boys," but Microsoft will be in a position to set the standards by which all cluster technology will be measured in the future. The hoped−for, ideal solution would allow a user a Lego style of assembly. Visualize the servers and subsystems as "building blocks" that are nothing more than the standard off−the−shelf computer systems in use today. These "cluster building blocks" can have single−CPU or symmetrical multiprocessing (SMP) CPU configurations. And, they don't have to be configured as identical twins. One machine can have 256 MB of memory, and the other machine can have 64 GB. It does not matter when you cluster. (It will work; but we will talk about some important issues you should be aware of later.) Ideally, you should be able to add and remove these computer building blocks in a manner transparent to the users who are using the cluster.
18
1.5 Design goals An alternative to the cluster is the "standby server" system. Those of you who have not had the fortune to work with a standby server type of architecture may be fortunate! Let's just say for now that standby servers can be extremely complex and unforgiving in their configuration. Typically, they are configured with two exactly identical computers (down to the BIOS chip version level), but you can use only one computer at a time. One computer is always in an idle standby state waiting to take over if the primary server fails. Middleware or software, which addresses hardware differences, is available to provide similar results. A cluster system, which incorporates the single−system image as a cluster component, can share the workload. You can build larger and larger systems by simply rolling in another computer and cabling it up to the cluster. The following list of points summarizes our discussion on business goals. Keep these in mind as we start to talk about the technical directions and goals Microsoft has taken in the development. The goals of the early cluster initiatives for NT 4.0 have become focused with the advent of the Windows 2000 Advanced Server and Windows 2000 DataCenter as Cluster Server products. Now Microsoft has a product that can: 1. Deliver a high availability solution. 2. Make it very easy to install and administer. 3. Use low−cost industry−standard hardware. 4. Develop based on open standards. 5. Start out small and provide for growth. 6. Develop tools for third parties to extend their functionality.
19
Chapter 2: Crystallizing Your Needs for a Cluster 2.1 Introduction The "oxen story" related in the Preface really provides sufficient justification for why a cluster solution should be considered. Why get a bigger server that will eventually need replacing by an even bigger server when what you really need is an additional server? When it comes to lifting a heavy or bulky load, most of us have no problem getting help and sharing the load. Why is that? Two or more people can safety lift a load too big for one. Think about it. You have a big box, cumbersome and heavy, and you need to lift it. Do you go through your list of friends who are Arnold Schwarzenegger look−alikes to find a single person big enough to lift it for you, or do you find two or more friends who can help? However, when it comes to computer solutions, increased capacity needs are often met with the "bigger server" approach. Before we leave this example, consider this distinction. When two people (or two oxen) move a load, the load is easier to bear. But, if during the process of moving the load, one of the people falls, the entire job stops! Here is where the computer cluster distinction steps up to the plate. A cluster of computers provides the availability to address the load, even at reduced capacity. Like the oxen story, clusters solutions, in addition to providing many other benefits, have always been able to meet the load−sharing issues. The initial investment for the cluster usually proved prohibitive for all but the absolutely necessary situations. Only recently has the idea of building a cluster become an economical solution for the purpose of load sharing. But clusters provide much more than just load sharing. When computer cluster systems were first becoming popular (15 years ago), the real reasons clusters were the "only" answer at almost any price were their availability, scalability, and reliability. For the jobs involved in banking, health care, and industry, the costs of providing availability, scalability, and reliability were insignificant compared with the consequences of not doing it. So, at any cost, cluster systems became a solution. And, the term "cluster" became a buzzword. The Preface of this book provided a simple graph of the transaction per minute benchmark tests conducted by the TPC Organization. The cost of clustering has dropped dramatically in recent years. Still, with cost always being an important factor, a determination of what is needed on a practical, affordable scale and what is acceptable calls for discussion. Consider the cluster system that manages an international bank's transactions or provides guidance for a manned space flight. These cluster systems require somewhat more stringent specifications than those of numerous other businesses. Clearly there must be a qualitative delineation of systems which are called a "cluster." So, let's divide the classification of cluster into two parts: "cluster" and "cluster plus." Consider this. Your personal needs require a vehicle to provide dependable (and affordable) transportation to and from work. To satisfy this, the vehicle must meet the minimum requirements and be within a reasonable costs of say $15,000$20,000. A super Sports Utility Vehicle featuring constant 4−wheel drive capable of "all terrain" and equipped with a mechanical system engineered for a temperature range of −75 to 125° F. with a cost of $75,000 may be a bit more than what you actually need. Clearly, there is a difference between a "vehicle" and a "vehicle plus," just as there is with clusters. Maybe your business needs do call for the system configuration definedas a cluster. But, perhaps your business needs require things a bit more than just a cluster. Such a system could only be described as a "cluster plus." 20
2.2 Acceptable availability Then again, maybe your needs could be satisfied with something less than a cluster, but with cluster attributes. This last suggestion alludes to a system that, although not by definition a cluster, possesses characteristics or cluster attributes. The point is, there is a range of choices available for your needs and you need to crystallize what your needs really are. So, let's talk about your needs.
2.2 Acceptable availability Availability is the time the computer system takes to provide a response to a user or process. Briefly stated, a service is required and is available, given some time. All computer systems are subject to a limitation of availability. Picture yourself in a corporate network environment. You browse the network, find your network application, and double−click. The dreaded hourglass of frustration immediately appears and hovers for what seems an eternity. Is the time you have to wait acceptable? Another example: you come into work and you can't log on to the network, because the network is down. You learn that the network will be available by 9 A.M. Is this acceptable? Acceptable availability is, therefore, a time that all can live with. To qualify as acceptable availability, the determination is simple. A service is required and is available within a time frame that is acceptable. The restaurant example given in Chapter 1 provides a good illustration of this. You walk into a good restaurant on a Saturday night without a reservation; the maitre d' informs you there will be a wait of 15 to 20 minutes. On a Saturday night, that's acceptable availability. There is a second term you may hearhigh availability. Same restaurant, same maitre d', same time period, only this time the maitre d' recognizes you, greets you by name, and offers you immediate seating! That's what is meant by high availability. Availability is a time−to−action characteristic. But there are also direct and indirect events that can affect system availability. Indirect actions are like system interrupts to a processor. This means the event is outside or external to the control of the system. Direct actions are like program exceptions. This means the event is part of a program or user activity. Indirect actions or interrupts to normal system operations are almost always unavoidable and also the worst cases. Power failures, network outages, and computer crashes (or, in the case of Microsoft NT 2000, computer "stops") are examples of indirect actions that affect availability. You can imagine such situations. Everything seems to be running fine when all of a sudden"What happened?" It's gone! No matter how well an application, system, or operation is tested, there will come a time when the rule of inevitability strikes. It is indeed fortunate when hardware failure is the reason for your unavailability. This is fortunate, because hardware can be made redundant and fault tolerant. If you have ever "Web surfed," more than likely you've received a "server unavailable" message at least once. This may be from a server that is really unavailable (crashed) or "maxed out" (no more connections possible). Then there are the application and system failures. And whether it's for a moment of time or seconds of real heart−pounding anxiety, the system was unavailable when needed. Direct actions or exceptions to normal system operations occur on an as−needed administration basis; these are unavoidable, necessary actions of day−to−day system operation. Hardware upgrades and modifications always cause system unavailability. Operating system patches and upgrades and sometimes even application software installations require system shutdown and unavailability. System backups sometimes require exclusive access to the system. This may reduce server capacity partially or completely and contribute to unavailability.
21
2.3 Acceptable scalability
2.3 Acceptable scalability Scalability refers to the fact that the system can be modified to meet capacity needs. Acceptable scalability involves four items of consideration of which two could be considered a subset of the second. The items are scalable, downtime, seamless scalability, and non−seamless scalability.
2.3.1 Scalable First, and most important, to be scalable the systems must have the capability to have their capacity modified. If the system can be modified to an acceptable level without having to be replaced, then the system could be said to possess the first qualification of acceptable scalability. For example, the system motherboard has spare slots for I/O, memory, CPU, and storage.
2.3.2 Downtime Second, the time required to accomplish the modification is acceptable. Scalability usually requires system operation interruption. Basically the difference between scalability and acceptable scalability is the amount of downtime that is acceptable while the system capacity is being modified. Consider this scenario. You just invested $10,000 for server hardware that is to include redundancy, uninterruptable power supply (UPS), and dual processor capability. The system comes on line and everything is working great. The new database server application will need an additional 100 GB storage capacity! You realize you need to add storage, and the storage system is not hot swappable. Clearly, acceptable scalability involves careful planning for a system that is scalable without interruption. But this type of scalability, storage space, can be seamless with the right hardware chosen as a system base. This is a classic example of why implementations of hot−swappable disk arrays are sometimes necessary. Microsoft's Windows 2000 operating system takes advantage of this hardware feature by allowing dynamic recognition and importing of additional data disks. System interruption seems almost inevitable at times. A system that requires an additional central processor or the replacing or upgrading of an existing processor will be necessarily shut down during such a hardware modification. But this is just one classic example of how a cluster system could save the day. If you realize that your system needs some modification requiring a shutdown/power−off situation, a cluster system would allow you to transfer your system operations to the remaining cluster member computer systems. And let's not forget those things that move. Murphy's law dictates that "moving things are more likely to fail than nonmoving ones." Just as car radios frequently outlast the car itself, most modern processors can outlast the cooling fans required to cool them. The problem then is that sometimes the cooling fans are physically mounted to the processor. So how do you replace a burned−out cooling fan on an active processor? Think of it. There are motherboards being sold with $200 CPUs whose survivability depends on the quality of a standard $3.00 "brass bushing" fan. Symmetrical multiprocessor servers are really interesting in connection with this particular problem. A trouble call is received claiming the server is noticeably slowit must be the network. A closer investigation of the twin processor server reveals a telltale odor. When the case is opened, the processor fan is observed to be stopped and there's a really interesting blue color on the now expired CPU! When a processor upgrade from a capacity need is imminent, replacement with good ball−bearing cooling fans should be part of that procedure. Systems with good scalable characteristics should have redundant cooling, alarms, and temperature monitoring in addition to the ability to replace cooling fans during operation. Systems maintenance should require minimal tools and offer ease of access to areas of likely failure, such as 22
2.4 Acceptable reliability fans and disk drives. The point is that system interruption will always happen to some extent in addressing scaling and capacity needs. How often this happens and the time spent addressing these needs determine the cluster's acceptable scalability. In a nonclustered single−system environment, shutdown for an upgrade to address scalability is almost a given. Consider now the cluster system and what cluster systems have to offer. In a cluster, a new member could be added or an existing member removed, modified, and returned to the cluster to address scalability needs. Even with clustered systems, however, sometimes a shutdown is necessary. This brings up the last two considerations of acceptable scalability. To ease the time factor of scalability, systems could be considered to possess seamless and non−seamless scalability. "Seamless" refers to a type of operation in which the system at large remains constant and users and processes detect no system interruption for any change to the system. Seamless scalability requires no shutdown, and absolute seamless scalability is at the high end of the scale. An additional cluster−capable computer system is dynamically added to an existing cluster by configuring the proposed cluster members' parameters. The proposed cluster member shuts down and reboots as it takes on the revised system parameters for cluster membership, but the existing cluster system remains operable during the process. An additional RAID set is added to the cluster's storage subsystem. Because the cluster hardware components were carefully chosen during construction, additional storage elements are a matter of refreshing the system storage utility's display. Seamless scalability addresses the replacement processor/cooling fan problem. Since the server in need of a capacity change (processor upgrade) can be a removable cluster member, seamless scalability allows continuous cluster operation (albeit at reduced capacity in the absence of the cluster member). Consider this final note about seamless scalability. Seamless scalability includes the ability to repair and restore the system to the desired capacity without total shutdown of the cluster at large. This, of course, does require service personnel to address the hardware changes and software configuration issues. But the cluster remains operable at all times. Therefore, a cluster that can boast of seamless scalability would not be without an operability premium. Non−seamless scalability is a system capacity modification that would require a complete cluster shutdown. Some modifications always seem to fall into this category (e.g., modifications to the building's power supply). In general, however, single−system environments are non−seamless in meeting their capacity modification needs. Have you ever seen a single−system server have its memory or CPU upgraded while continuing to be powered and serving users?
2.4 Acceptable reliability Reliability is the ability of the system to obtain a stable state such that the system remains operable for an acceptable period of time. Once a user or a process initiates an operation, the result will be a known state. Even if the power cord is jerked from its socket at what seems to be the most vulnerable moment, the system will produce a known state, not an indeterminate one. Acceptable reliability is a scale or percentage of how well this qualification is met. The perfect score, 100 percent reliability, is unattainable. The famous five 9s or 99.999 percent is what most IS managers would like to see. To obtain this high reliability, the system design must plan for failure and have alternatives for survivability. Reliability can be thought of as planned as opposed to unplanned.
23
2.4 Acceptable reliability Planned (or some like to call it engineered) reliability is a system design that provides an alternative to system operation failure. One example can be illustrated by the employment of hot−swappable disks. Hot−swappable storage arrays are a key to continuous reliable service. When a drive element of a storage subsystem array fails, an alarm is set and sometimes an indicator light. The RAID model sustains the integrity of the data stored on the surviving drive. The failed drive is removed and replaced, and the RAID model rebuilds the data to the replacement drive. The simplest RAID form for fault tolerance is RAID 1 or the Mirror set (see Figure 2.1).
Figure 2.1: Mirror or RAID 1 example. Unplanned reliabilitycrash! Planned reliability is a system design that provides alternatives to system failure. Unplanned reliability is what happens when you run out of alternatives! System failure is the nice term to describe what has traditionally been known as a crash. The advent of Microsoft's Windows NT introduced an even gentler term to describe this action. As mentioned before, Microsoft refers to a system failure as a stop. However, this term is seldom used in deference to the more popular term blue screen of death (BSOD), so named for what appears on the screen when it occurs. With a cluster, however, when this unplanned event occurs, reliability could rest with the surviving cluster members or nodes. Acceptable reliability should address two important issues: process and data. Let's start with data.
Data integrity is key to acceptable reliability. Care must be taken to ensure that the data is not corrupt or in a questionable state during and following a system crash. The cure for this is threefold. 1. Replicate your storage. Each time a write of data is issued to the primary storage unit, a copy of that data is written to a second storage unit. This operation is described by RAID model 1 (redundant array of independent disks). For faster storage, but with data integrity, use RAID model 5 or striping with parity (see Figure 2.2). Here a file that would normally take four I/O transfers is stored, with integrity, in just two transfers.
Figure 2.2: Stripe with parity or RAID 5 example. 2. Back up your data. Do this on a basis directly proportional to your level of paranoia. Can your operation survive, should it need to, on two−day−old data? If not, you need to back up your data every day. Most operations require a daily backup of operational data. Fortunately, Windows 2000 comes with a built−in robust backup utility that even allows scheduled disk−to−disk network backups. 24
2.4 Acceptable reliability 3. Transactional processing. Data operations, which are critical, should always employ a transactional processing file system as part of data management. The Microsoft NTFS file system has employed transactional processing from its introduction in NT 3.1. But, what is transactional processing? Transactional processing is a client/server operation involving two simple phases of operationprepare and commit. To illustrate this process, consider a transaction with an ATM machine (money machine). This is a transactional process involving a client (the ATM) and the server (the bank). A user proposes a transaction at card insertion. During the session, one of two things will happen. The user will walk away, happy, money in hand, and his account will be debited for the amount of transaction. Or, the user will walk away, sad, no money, but his account will note debited for the transaction amount. The point is that at any phase of the transaction, even if lightning should strike the ATM machine (granted the user may be surprised), the transaction can be "rolled back" to initial conditions, thus preserving data integrity. Now, comes the consideration of acceptable reliability relative to the process. The process, the user or application, resides on a computer system. If there is only one computer system and that computer system fails, the process or application fails. The computer system itself becomes a single point of failure. But what if two or more computer systems could interact with the same storage array as with a dual−port SCSI controller or Fibre Channel. This would be a cluster. This follows the definition "two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources." If this is the case, then one cluster member computer system could fail and the remaining systems could take charge of the user or application process. But even though the remaining computer systems are members of the cluster, the member is still a different computer system. Then what starts the user or application on the surviving member of the cluster? Process restart can be manual. This means that an operator must be present to complete the change of operation from the "failed" portion of the cluster to a surviving cluster member. This, then, would be a new instance of the process. Consider a user involved in a terminal access process such as OpenVMS or a UNIX session or a Microsoft Terminal Server client. If a user were connected as a terminal access process at the time of the crash, the user would lose connectivity. The worst thing that user is faced with in this situation is a "hung" terminal window. The user will have to know how to "kill" or terminate the lost session with the server. The user would then have to start a new terminal access to the surviving cluster member. The danger inherent in this type of system is reflected by the inevitable question, "What happens to the data at the time of the crash?" This is where the operating system and the associated file systems used must offer the previously discussed "ATM−like" transaction processing used by Windows 2000 NTFS. Process restart can be automatic. For critical applications, a restart or resume of the application on the surviving member should be automatic. If the failover involves a "start" of an application on a surviving node, there will be an obvious start delay. What if a "start" delay is unacceptable and failover must be so fast as to be transparent to cluster operation? (Transparent to the operation means that an operator is unaware that a failover has taken placeexcept for an alarm event.) Clearly this subdivides "restart" into two categories, which can be referred to as active/passive and active/active. Active/passive (see Figure 2.3) refers to a configuration requiring the application to be "started" and bound to the required data. When system failure occurs, for example, Server_1 crashes, and the surviving member, Server_2, must take control of the database and start the application. This takes time. How much time will be the determining factor in whether the active/passive method offers a solution of acceptable reliability.
25
2.4 Acceptable reliability
Figure 2.3: Active/passive. Active/active (see Figure 2.4) refers to a configuration allowing two instances of the cluster application. But only one instance of the application has data access. As shown, the application need only enable the data selection to "resume" activity.
Figure 2.4: Active/active. Microsoft refers to applications that can utilize the cluster application program interfaces and cluster libraries as cluster aware applications. However, Microsoft is quick to point out that applications that are not cluster aware may be restarted via automated script files. Applications that are cluster aware make up a growing list. Microsoft's BackOffice products, such as SQL and Exchange, and Microsoft's Windows 2000 Server services, such as DHCP, WINS, and DNS, are some of the more notable cluster aware products. Please note that the active/active, like the active/passive has a path from either computer system to the storage array. Also note that only one system at a time has actual access to the storage array. This arrangement is sometimes called a "shared nothing" storage array because the drive is never "shared" between the computer systems. At any time instance, the drive is "owned" by a single computer member. Shared disk (see Figure 2.5) clustering was developed by Digital Equipment Corporation's OpenVMS clusters. Any cluster member's process or application could access any disk block for which it had permission. The trick is not to allow one system to modify (write) disk data while another system is accessing (reading) data. Synchronized access of the common storage area was the function of an application distributed and running on all cluster members known as the distributed lock manager. This application used a database common to all cluster members known as the lock management database. Each file or data store for the common disk had an associated data structure called the "lock." The lock stored ownership and access privileges to the file or data store. In order for a process or application to read or write to the database, a lock or data structure of assigned access had to be obtained from the distributed lock manager and stored in the lock manager database.
26
2.4.1 Server failover shared disk
Figure 2.5: Shared disk. The distributed lock manager service identified three specific thingsthe resource, the cluster node with current responsibility, and the resource access permissions. This last, third element of permissions is that to which the term "lock" referred. The permissions for each cluster resource were granted or locked in accordance with the lock manager. The advantage of the shared−disk cluster system is that it provides a common base for programs. In a read−intensive environment, the shared disk system offers faster service than shared−nothing. An environment that is subject to frequent writes requires the distributed lock manager to constantly synchronize the access. Despite this required cluster traffic between cluster members, there is a huge advantage to this distributed lock manager middleware. Little has to be done to ensure that the applications are cluster aware. Database applications are easier to develop in a shared−disk environment, because the distributed lock manager takes care of the disk synchronization. However, most databases use row−level locking which leads to a high level of lock management traffic. Shared disk systems can generate 100 times more small lock level messages than non−shared disk systems. In short, shared−nothing designs are more efficient, but require more program development. The ability of a cluster to produce acceptable reliability by restarting a process manually or automatically on a remaining cluster member can be termed "failover." The degree of failover capability is dependent on the cluster's reliability needs. Each of a cluster's subsystems can possess failover capability. Let's look at examples of each.
2.4.1 Server failover shared disk When an individual cluster member fails, for whatever reason, the last messages communicated from the server amount to a "node (or server) exit" announcement that is translated by the remaining nodes as a "remove node from cluster" message. Even if the message is not received from the failing cluster member, the cluster should be able to detect the member loss and "failover" that member's responsibilities. How would this be possible? Let's look at an example of this in the shared−disk design used by the VMS cluster system. The lock manager database and the lock manager service play a key role. Because each member of the cluster had and ran a copy of the lock manager service (hence the term distributed lock manager) and had access to the lock manager database, access to the unified database of all cluster resources would be held by all cluster members. An exact accounting of which node, which resource, and what access would be listed. In the event of a node failure, the resources and accesses of the failed node would be redistributed to the remaining cluster members. A distributed lock manager and associated database as briefly described here represent a cornerstone for proper cluster operation. For more information on this, please see (Roy Davis, 1993). [1]
27
2.4.2 Server failover non−shared disk
2.4.2 Server failover non−shared disk Microsoft uses a database called the "resource group" in Windows 2000 Advance Server and DataCenter. This structure marks a great forward step in tracking cluster resources by node. This database also provides entries for resource: • Dependencies. What conditions must exist for this resource to exist? These are set like a "rule" or test of initial environmental conditions for a given resource. • Registry entries. These are specific additional registry entries (or system parameters) that an application or the system can reference. • Node. This is an access table, which permits the cluster manager to specify which node(s) a resource is allowed to use.
2.4.3 Storage failover From its first release of Windows NT 3.1, Microsoft offered a software solution to failed data. Microsoft's solution was a software implementation of the RAID Advisory Board's level 1 model. A "mirror" or RAID 1 drive set provides an exact copy, block by block, of a given partition between two physically different disks, regardless of controller, manufacturer, or geometry. For the truly paranoid, Microsoft's disk duplexing added different controllers for the mirrored pair in the event of controller failure. Hardware solutions abound for all levels of RAID, especially RAID 1 (mirroring) and RAID 5 (striping with parity). When the term "RAID" is used by itself, most likely the level of RAID referred to is RAID level 5 (striping with parity). Table 2.1 provides a brief description of the common RAID levels.
Table 2.1: RAID Technologies RAID Level 0
Common Name Striping
1
Mirroring
2
3
Parallel transfer with parity
4
5
"RAID"
6
RAID 6
Description
Disks Data Availability Required Data distributed across the disks in the array. N Lower than single disk No data check All data replicated on N separate disks (N N, 1.5N, etc. Higher than RAID 2, 3, 4 usually 2) or 5; lower than 6 Data protected by Hamming code check data N + m Much higher than single disk; higher than RAID distributed across m disks, where m is determined by the number of data disks in 3, 4, or 5 array N+1 Much higher than single Each virtual disk block sub−divided and distributed across all data disks; parity check disk; comparable to data stored on a separate parity disk RAID 2, 4, or 5 Data blocks distributed as with disk striping; N + 1 Much higher than single parity check data stored on one disk disk; comparable to RAID 2, 3, or 5 Data blocks distributed as with disk striping; N + 1 Much higher than single check data is distributed on multiple disks disk; comparable to RAID 2, 3, or 4 As RAID level 5, but with additional N+2 Highest of all listed independently computed check data alternatives 28
2.4.4 Interconnect failover Note N = 2 Source: Paul Massiglia, The RAIDbook, St. Peter, MN: RAID Advisory Board (1997). [2] Note that the above models described in Table 2.1 may be combined to form the desired characteristics. For example RAID 10 uses striping without parity and mirroring. The technique can be referred to as "mirroring the stripe sets." However, storage failover is not limited to a failover within "a" RAID model. Storage failover can encompass failover between storage arrays. The storage arrays themselves don't even have to be local! Storage area networks provide an additional safeguard and data availability. This deserves more detailed discussion (see Chapter 7, "Cluster Interconnect Technologies"); the focus of this chapter is to "crystallize" or determine what your needs may require.
2.4.4 Interconnect failover Finally, this refers to the connection from server to storage as well as the connection from server to server. The server failover described was dependent on the surviving member's ability to detect a server exit. A simple single−server system with a single− or dual−controller local to the server can detect a disk failure. At the low−end implementation of Microsoft Windows 2000 Advanced Server Cluster, the cluster system typically uses a "private" 100 Mbps internal network for cluster system communication and a SCSI controller to address the storage subsystem. But to detect a server failure you really need an interconnection mechanism that can address interprocessor communications as well as interface the storage subsystem. The two most popular communications mechanisms use Fibre Channel and/or Small Computer Systems Interface (SCSI) for interprocessor or system−to−system communication. Legacy interconnect systems originating from Digital Equipment Corporation included CIComputer Interconnect (70 Mbps); DSSIDigital Standard Storage Interconnect (30 Mbps); and FDDIFibre Distributed Data Interconnect (100 Mbps). Network interconnection between cluster servers is proving to be a most interesting technology. For a long time, network interconnects wallowed in the slowness of a 10 Mbps limitation, competing with the typical local area network congestion. Now, technology is on the threshold of commonplace server−to−server communication beyond 1 Gbps. Interconnect failover capability is a simple matter of configuring redundant networks. Couple this with net work storage arraysor network array storage (NAS)and you get a setup that is illustrated in Figure 2.6.
Figure 2.6: Cluster storage array.
29
2.5 Cluster attributes
2.5 Cluster attributes Cluster attributes add features or functions that, while desirable, do not directly contribute to clustering according to our definition. But when you are deciding on what your cluster needs are (and writing the cost justification), listing those hardware and software needs up front may be to your advantage. An example list follows: • Cluster management software. Perhaps the management software accompanying your choice is not adequate to your needs, and a layered product is deemed appropriate. • Backup software. If it's not part of your cluster system's choice or adequate to your needs, backup software may be necessary. • Load balancing. Most storage array providers have the capability of static load balancing the I/O requests among the redundant storage arrays. There is an exception. One old adage states "it costs that much, 'cause it's worth it." Some storage controllers can be paired to provide the capability of dynamically adjusting the I/O load between controller pairs. • Libraries. Does the cluster system of your choice provide attributes that your software can interface?
2.6 Summary The purpose of this chapter is to provide points to think about in choosing the cluster that is right for you. Ultimately and hopefully, your needs of acceptability, availability, and reliability will dictate some dollar value that will bring you the type of cluster system you need. Bibliography
[1] Roy Davis, VAX Cluster Principles, Umi Research Press (1993).
[2] Paul Massiglia, The RAIDbook, St. Peter, MN: RAID Advisory Board (1997).
30
Chapter 3: Mechanisms of Clustering 3.1 Introduction The preceding chapter on "cluster needs" introduced the capabilities and possibilities of cluster systems. And, certainly the terms availability, scalability, and reliability seem a little easier to understand. But what about the "how"the way these capabilities come about from within the cluster subsystems? Consider how the components of a clustered system work together as you would the elements of a football game. In football, the principal goal is for a team of persons to carry a football from one end of a field to another, repeatedly, while fending off opponents who try to prevent this from happening. At any one time, only one person has the ball. That person's fellow team players will position themselves and work to present the best possible avenue for the ball carrier to succeed. Some players will work to provide a reliable block to the opposition, while others will avail themselves to the ball carrier in case the ball carrier is attacked or becomes unavailable. Each team member has a specialized job to help ensure that the goal is met. Woe to the team with the "individual" player. Anyone who has seen or played in team sports has seen the player who is absolutely convinced that only one can meet the goalhimself. The goal on the football field is always attained by the team whose members act as one and yoke their common strengths and abilities. Certain members of the team have "backup." For example, there are two guards, two tackles, two ends, and two halfbacks. In any given "play" of the game, the team and member job division provides availability, in case one member fails, and hopefully, a reliable play conclusion. The analogy is simple; a cluster system consists of a number of "members," sometimes "pairs" of members. The members have unique, specialized jobs designed to achieve the goalcontinuous and robust computer system operation. The control and operation of the cluster mechanisms are governed by parameters inherent to the cluster mechanisms. A Microsoft Windows 2000 system, which supports clustering, needs to store the parameters in the Microsoft Windows 2000 database, called the registry. The registry is divided into two principal areasan area for users (HKEY_USERS) and an area for the operating system. The operating system part of Microsoft's Windows 2000 registry is referred to as HKEY_LOCAL_MACHINE. Successful cluster operation depends on specialized cluster software working independently and with the operating system. Therefore, the parameters of cluster operation would be part of two registry subkeys (see Figure 3.1): 1. HKEY_LOCAL_MACHINE\Software, for the general (e.g., software name) and initial parameters. 2. HKEY_LOCAL_MACHINE\CurrentControlSet\Services, for the cluster−specific operational applications employed as services. The applications (programs) relative to the cluster operation must operate independent of user control. Therefore, those applications specific to cluster operation are deployed as Microsoft Windows 2000 services. Services can operate independently of user control and at a level closer to the system hardware than a user−controlled application.
31
3.2 Cluster membership
Figure 3.1: Microsoft Windows 2000 registry. Clustering software is a "layered" product. This means that the cluster software is in addition to the base operating system's operation and may be manufactured and sold separately from the manufacturer of the operating system. Since the software is an addition, the actual names of the terms may vary from vendor to vendor, but the use and employment will be the same. For example, one of the terms examined will be the cluster member. Some vendors refer to the cluster member as a node. This chapter intends to introduce general terms associated with some of the methods or mechanisms of clustering.
3.2 Cluster membership Members of a cluster are referred to by some cluster software systems as nodes. The number of computers, which can be nodes or members of a single cluster, is determined by the capabilities of the cluster software. The cluster software inherent to Microsoft's Windows 2000 Advanced Server limits the number to two. Other vendors provide for many computers working as one cluster system. A cluster system includes a specific set of rules for membership. Some clustered systems use Voting as a basic rule for membership. Voting. A computer typically becomes a member of a cluster by presenting the cluster manager with its computer name and a vote. The vote is a system parameter (typically named vote) with a numerical value of one or greater. Notice that the vote is a numerical value. By allowing the vote to be a variable, the vote could be used to bias the election for cluster membership. The cluster manager authenticates the computer name against the cluster membership database. The vote value is used to determine cluster operation. Baseball (backyard or street baseball) was always a difficult game to get going in the summer. Most of the regular players were at camp, vacation, or otherwise unavailable. So, the pickings got slim for what was an acceptable number for a game. Somehow the "pickings" for cluster membership have to be set into a rule. Cluster systems are conducted like an orderly business meeting. A cluster system is not operational as a cluster until a numerical parameter called quorum is met. As is the baseball example, quorum is the minimum number of "players" needed to play. Quorum is a cluster management parameter whose value determines the cluster system's operability. Typically this operability is derived from the following algorithm:
32
3.2 Cluster membership This algorithm derives the already stated "two or more systems" as a minimum for cluster operation. If each cluster member casts one vote, then quorum would be 2. Note that quorum is an even integer, and the results of the algorithm are rounded down to the next whole number. If a cluster consisted of three members and each member had one vote, the algorithm would produce 2.5 which would round to 2. For a three−member cluster the quorum value should be 2. There are two cluster configurations that deserve more discussion. The cluster is based on the combination of independent computer systems. This means that some cluster members may be more "desirable" as cluster members than others. Then, there is the situation when a minimal cluster, of only two members, temporarily loses one of the members. Does the cluster system, along with the cluster software, go down? What about that period of time during which one node is "down"? You may need to shut down a cluster member in order to change the hardware configuration. Or, how do you have a cluster while the other system is booting? This is where the quorum disk comes in. A disk configuration can have a proxy vote. When a cluster system contains a storage system common to two or more computer systems, the cluster manager can vote on behalf of that storage subsystem. Each computer system of this configuration must be capable of accessing this common storage system. Consider a cluster system with a quorum disk. As the first computer system member boots, that system attempts to access the common storage system. During the access, the cluster manager acknowledges the successful access of that storage system by casting a vote in behalf of that storage system (or proxy vote) to the quorum algorithm. With respect to the Microsoft Windows 2000 system, the successful access is usually directed to a specific partition. The partition needs no restrictions other than a common drive letter for accessing computer systems. That is, the partition used for quorum could be a single, mirror, or striped partition. Take a look at Figure 3.2.
Figure 3.2: Quorum disk. Figure 3.2 shows that even while a single server is booting, one server and a quorum disk can meet a quorum of 2. This is fine. Even with a two node system, the cluster software and operation remain intact with only one cluster member. However, the availability and reliability factors will be "out the window" during this time when only one cluster member is operable. This is similar to the cluster system achieved by Microsoft's Windows 2000 Advanced Server in two−node cluster configuration. When the Microsoft Advanced Server 2000 cluster boots, the cluster manager service does not acknowledge the cluster until notification from at least one cluster member as well as the proxy for the quorum disk are received. The proxy vote is generated from the cluster manager upon successful access to the "S:" partition. When a cluster has three or more members, there is no need for a quorum disk. Quorum can be established by any two of the three computer system members. But what about computer systems of different capacities? Is there some way to favor one computer system member over another?
33
3.3 States and transition The beauty of a cluster using the algorithm shown previously lies in its ability to address cluster members of different capacity. Consider a cluster whose members vary in capacity such as symmetric multiple processor servers or large memory servers as shown in Figure 3.3.
Figure 3.3: Quorum example. In the example shown in Figure 3.3, Server_1 and either Server_2 or Server_3, being online, could achieve cluster operation. In the absence of the "big" node, both of the smaller systems would have to be on line for cluster operation. Note, that the quorum disk had a "0" vote. In general, there is no real need for a quorum drive in a three−node cluster. Check out the details of the symmetrical multiple processors (SMP), computer systems used for www.tpc.org tests. The systems used in the tests for the Transaction Processing Council are the biggest and best systems their respective members can build. Those systems may exceed your computing needs. But, what if a large server was available for your cluster operation. The larger capacity servers could be "favored" by increasing the numerical value of the vote parameter. However, in the above example (and because of the large disparity between the computer system capabilities), a different problem could arise. If the "big" server (SMP system with the 16 GB of memory) is offline and inaccessible for a long period of time, the system falls back to a basic two−node cluster configuration. In order to have cluster operation, both of the smaller servers would have to be online to satisfy the quorum algorithm. Cluster operation could "hang" until both members were available. Therefore, there is a problem if the "big" server were to be offline for an extended period of time. The cluster software should be configurable to allow the dynamic changes to the parameters for quorum, vote, and quorum disk. The cluster should not require any member to reboot in response to changes to the quorum algorithm parameters.
3.3 States and transition A cluster member (of any vendor's cluster product) has, at least, four discernible states or conditions during membership. These states represent the condition or status of the cluster members to the cluster manager. • Joiningentering the cluster. This state is sometimes called booting, but the meaning is the same. The cluster membership is increasing and the cluster manager could redistribute the cluster load. • Exitingleaving the cluster. This state must be attainable automatically in the event of a cluster member failure. The cluster membership is decreasing, and the cluster management software must recalculate the quorum algorithm to ensure quorum (the number of computers required to operate as a cluster). The cluster manager must redistribute the exiting cluster member's load to the remaining cluster members. • Runningoperable for cluster tasks. A cluster member must be running or ready to receive a cluster 34
3.4 Cluster tasks or resources task assignment. A cluster member must be in the running state in order to carry out the cluster task. • Transitioncluster member assignment or reassignment of task. Cluster tasks are assigned and reassigned sequentially to a cluster member. During this time, the cluster member is usually unable to perform other cluster duties, and is therefore unavailable. Think of the last two states, running and transition, as the "running backs" of a football team. The player (cluster member) has to be "open" (running or ready) to receive the ball (the cluster task). During the time the ball is in the air and throughout the events up to the "catch" and the "hold on to the ball," the play is in transition.
3.4 Cluster tasks or resources A cluster task (sometimes referred to as a resource) is an activity traditionally bound and inherent to single−computer operation. The cluster operation allows these tasks to be distributed and therefore shared by the cluster members. The terms used in the following text may vary from vendor to vendor in their name, but their functionality is the same.
3.4.1 Cluster alias This is the name shared by the cluster members. As in football, each team member has an individual name (computer name). But once the player becomes a member of the team (cluster), the player plays as a part of the team name (cluster alias). When a computer operates as an individual server, network shares are accessed via: • \\computer−name\share−name
When a computer is a cluster member, network shares may be accessed via: • \\cluster−alias\share−name
3.4.2 Cluster address This is the 32−bit Internet Protocol address shared by the cluster members. What a convenience! Like a shopping center offering one−stop shopping, the cluster address allows users to worry only about a single address for things like SQL, Exchange, and Office Applications. One may argue that an enterprise can already do this by just using one big server. But, then what happens if that one big server goes down?
3.4.3 Disk resource This is the name or label given to a disk or disk partition (part of a disk) that can be assigned by the cluster manager to a cluster member (see Figure 3.4).
35
3.4.4 Cluster service or application
Figure 3.4: Disk resource.
3.4.4 Cluster service or application This is the name or label given to an application or service. The cluster manager uses this name to assign this task to a cluster member. It may seem redundant to give a name to an application, which already has a name, but there is a reason for this. By providing a name, associated parameters can be assigned. These parameters could include: • Names of cluster members capable of running or permitted to run this application • Groups or individual users allowed access to the application • Times or schedule of application availability
3.4.5 Other resources Additional miscellaneous resources may include: • Cluster event messages, which are for error logging and traceability • Cluster scripts, which are for structuring cluster joining, exiting, and transitions. Typically these take the form of a command procedure and may use PERL as a construct. • Configurable cluster delays, which may be useful to allow "settling" before a "next" action
3.5 Lockstep mirroring Lockstep servers are composed of redundant components. Each area that could fail is replicated via dual processors, memory, I/O controllers, networks, disks, power supplies, and even cooling fans (see Figure 3.5).
Figure 3.5: Lockstep setup.
36
3.6 Replication Each area includes a cross−coupled means of communicating with its redundant area. Each instruction performed by Processor A is mimicked at the exact same time on Processor B. The content of Memory A is an exact mirror of the Memory B's content. Each I/O received by Controller A is duplicated by Controller B. Mirror set B receives the exact same data as Mirror set A. The duplication is complete except in the operation of the network. The network interface cards, while redundant, have one and only one network link at one time in compliance with network rules. Should the network card interface fail, the redundant network card would become enabled. There was a system bearing a resemblance to the diagram in Figure 3.5 called the FT3000, manufactured by Digital Equipment Corporation. This system even had a way to signal an AC power loss. The system had an electromagnet, which was energized with the system power. When energized, the electromagnet suspended a metal plate such that only half of the plate was visiblethe half with green paint. When system power was switched off or lost, the electromagnet lost power and the other half of the metal plate slid into viewthe half with red paint. An unstoppable setup? Not exactly. The Achilles' heel of such a server is the operating system itself along with its operating applications. If an application or the operating system executes a fatal system instructionfor example, an improper memory referencethen both processors and both memory areas are affected. This powerful fault−redundant system will crash. This is why fault tolerance itself does not a cluster make. If two such systems, as described, were working in a cluster configuration, that would indeed be a highly available cluster solution. Therefore, Digital Equipment did offer an FT3000 cluster with two FT3000s as the cluster members. To be sure, one had to have a big wallet to consider that solution.
3.6 Replication Replication, in short, is the reproduction or duplicate writing of data. Microsoft's Windows NT 3.1 had a replication service that allowed an automatic distribution of login scripts and related small−sized (preferably less than 10 Kbytes) accessory files from an Export Server to an Import Server.[1] The Replication service was never meant to be more than a convenience. Still, the idea of automatically distributing data sounds like an excellent method of keeping updated copies of valued files (see Figures 3.6 and 3.7).
Figure 3.6: Replication.
37
3.6 Replication
Figure 3.7: Replicationone to many. Wow! Figure 3.7 shows how to get three perfect copies of the original, or one to many including a backup and distributions to the branch offices. But, wait. What if file.cmd is somehow corrupt, or becomes 0 blocks. With a replication service there is the ever−persistent problem of garbage in−garbage out (GIGO). Whatever is on the export side becomes the import side. The point of this is to show that some forms of replication are not always a means of providing high availability. Replication has been used as a general term to apply to sophisticated service applications that can substantially increase the user's availability and reliability. Services of this nature can be volume, file, or even a mirrored image.
Volume replication (see Figure 3.8) provides the ability to replicate or copy changes to a volume to a remote location. The term "volume" is the logical reference of one or more partitions. Note that a letter identifies the volume. Note also that the destination partition does not necessarily have the same letter as the source. The connectivity is by traditional network or Fibre Channel.
Figure 3.8: Volume replication. Still, users must keep in mind that drive replication is not mirroring. Mirroring (see Figure 3.9) requires two equally sized partitions on two physical drives. The partition drive letter on both partitions will be the same. This is not the case with replication.
Figure 3.9: Partition mirroring. [1] The Replication service discussed here is not part of the Microsoft Windows 2000 product line. Replication can be one to one or one to many and is offered as a layered product service by many vendors.
38
3.7 Shared disk and shared nothing disk
3.7 Shared disk and shared nothing disk In Chapter 2, "Crystallizing Your Needs for a Cluster," a choice was presented for the storage subsystema shared disksubsystem (see Figure 3.10) or a shared nothing disk subsystem (see Figure 3.11). Briefly reviewing that information, a shared disk is a disk that can accommodate simultaneous access from two or more computer systems. To prevent one system from "stepping" on the data of the other system, each system has a copy of a lock management database to synchronize the access.
Figure 3.10: Shared disk.
Figure 3.11: Shared nothing disk. The shared disk (see Figure 3.10) has the following characteristics: • Minimal application adaptation • Fast access by either cluster member in a read−intensive environment • More expensive to implement than shared nothing
A shared nothing disk is a disk that accommodates asynchronous access from two or more computer systems. This means that at any given time, one and only one computer system has exclusive access to the disk. Control is granted to an accessing cluster member by switching control. For SCSI−3 connections, the switching is done at the SCSI−3 controller. Switching must be kept at a minimum for optimum performance. With the advent of SCSI−3, especially Fibre Channel, the speed and ability to switch access control from one computer system to another became noteworthy. Even moderately priced storage subsystems with SCSI−3 controllers are able to change control between cluster members in less than five seconds.
The shared nothing (see Figure 3.11) disk has the following characteristics: • Typically requires application adaptation 39
3.8 SAN versus NAS • Switching between servers impedes performance and should be minimal • Less expensive to implement than shared disk
3.8 SAN versus NAS The storage subsystem of a cluster solution provider may offer two basic alternatives to local system storage. These two alternatives have entered the marvelous world of buzzwords.They are called the storage area network (SAN) and the network attached storage (NAS). Both of these storage systems have been around for quite a while. A storage area network is a dedicated network for moving data between heterogeneous servers and storage resources. Figure 3.12 shows an example of a storage area network according to the definition, but an "early" edition. Figure 3.13 shows a more modern version of storage area network.
Figure 3.12: Early storage area network.
Figure 3.13: Storage area network. 40
3.8 SAN versus NAS Some readers may recognize that in Figure 3.12, HSC stands for hierarchical storage controller. These controllers were computers in themselves, specialized to serve disk and tape devices. These separate "intelligent" blocks between computer system and storage took the "serving" load off the computers. The interconnection was much like the interconnection for cable TV. Cable TV companies harp about thieves "stealing" cable because it hurts legal customers. The distribution of cable TV is accomplished by a strong multiplexed signal transferred to a transformer, much like a spring flowing into a pool. The clients withdraw the signal from the transformer like wells tapping into the pool. Actually the process is called induction coupling. The point is that energy is pushed into the transformer and taken out. If too many clientsespecially "unknown clients" or cable thievestap into the transformer, then the paying customers will notice their picture quality decrease.
The transformer in this case is the ring called a star coupler. This is actually a torroidal transformer that the computers and hierarchical storage controllers connect to like clients to cable TV. As with cable TV, the number of clients is limited. Storage area networks commonly use SCSI or Fibre Channel at speeds greater than 160 MB/second for the interstorage communication. Network attached storage(NAS) (see Figure 3.14) consists of an integrated storage system (e.g., a disk array or tape device) that functions as a server in a client/server relationship via a messaging network. Like storage area networks, network attached storage is not "new." This type of storage includes two variationsindirect and direct network served devices.
Figure 3.14: Network attached storage. Indirect storage should look familiar to some readers as it follows the form of a dedicated computer server offering a directory, partition, disk, or disks as a network share. This model of indirect network attached storage is not meant as a bias toward Microsoft. Sun Microsystems, Inc. was the patent developer of network file system (NFS). Network shares and the "universal naming convention" as used by Microsoft operating systems can be traced to IBM and a product called LanManager. Direct storage does not require an intermediary computer server. One of the first implementations of this type of storage was the local area disk introduced by Digital Equipment Corporation. A major limitation of that implementation was its restriction to the proprietary local area terminal protocol. Network attached storage is a variation on network attached printing. Hewlett Packard introduced an impressive method of circumventing the need for a computer as a print server using data link control protocol.
41
3.9 Summary
3.9 Summary This chapter set out to define and illustrate some of the mechanisms used by cluster systems. By understanding these cluster "mechanisms," readers can determine which of them would best address their own cluster needs.
42
Chapter 4: Cluster System Classification Matrix 4.1 Introduction We begin by reviewing the content of the preceding chapters. • Chapter 1,"Understanding Clusters and Your Needs," defined what a cluster is and what it is not. • Chapter 2, "Crystallizing Your Needs for a Cluster," provided guidelines for your computer system's needs. • Chapter 3, "Mechanisms of Clustering," discussed terms and tools used by clusters. It is now time to see what type of cluster is best suited for your needs. Or, to rephrase, now that you've listed what you need and looked at some of the cluster mechanisms, it's time to look at what is available. With all the different ideas, methods, and implementations of "clusters," what kind of cluster is right? Let's take the information we have defined and described and attempt to classify "clusters" in accordance with their capabilities. Clusters have been available since 1983. So, why haven't clusters been more commonplace? Naturally, money had a lot to do with it. More specifically, what was available from technology and at what cost? Sure, clusters have been available for quite awhile, but the downward spiral of the cost did not begin until 1993. Now that is an interesting yearthe year that Microsoft introduced Windows NT 3.1. Now, more than ever, various vendors are offering various "cluster" systems and all claiming to be "the" cluster system. The problem is that there is such a diversity and choice, there is a need to sit down with paper and pencil, list what is needed, and determine whether the cost of satisfying that need would be justified. Anyone involved in the computer purchase would invariably have to justify that purchase. An expected return on investment must be clear. A suggested method would be to express the needs and make a list in a spreadsheet fashion with column headings such as: Minimal Acceptable Desired Nice to Have Each row under the columns would contain those attributes that meet with the respective column headings. This chapter's intent is to present a "Cluster Capabilities Spreadsheet" or matrix, which illustrates what is available by the various classes of cluster sand their associated characteristics. This is achieved by classifying different "levels" of cluster capabilities. The purpose is to present a suggested "class" of cluster for a given need. For example, the stated definition of a cluster provides for an overall availability and reliability in accordance with user needs. An initial generalization would be a simple graph of a cluster's capability against availability and reliability. Subsequently, a cluster purchaser could identify what type of cluster would satisfy the needs chart as suggested above. The matrix is shown in Figure 4.1.
43
4.2 Cluster review
Figure 4.1: Cluster classification matrixcluster classes. Each vertical level of increase is an increase in capability and is therefore considered a different "class" of cluster. So what demarcates one class from another? What would be the characteristics of each class? Remember that the purpose is to present a recommendation in accordance with needs. Therefore, what example would drive the need for a given cluster class? If your need for a cluster is not critical, then the "class" of cluster (and the cost) will be less than the class of cluster required when human life is at stake and money is no object. Limiting the cluster classification matrix to four levels is meant to delineate and provide guidance, not discrimination. One way to provide the delineation intended is to give examples of the cluster matrix levels. Figure 4.2 provides characteristics and examples of each of the cluster matrix levels.
Figure 4.2: Cluster classification matrixexamples.
4.2 Cluster review For a quick review, consider the cluster system definition. A cluster system is composed of three subsystems: server, storage, and interconnect. These subsystems are built from cluster components. A component is a discrete member of a cluster subsystem. For example, a server could be a component of the server subsystem, because the server subsystem is composed of two or more servers. A RAID array could be a component, because a RAID array is an example of a type of storage component. Clustering software is an example of a cluster component. Clustering software is a very necessary component of the cluster server subsystem. Without clustering software, the cluster server subsystem would not be able to communicate with the constituent cluster servers. Finally, attributes are enhancements of a given operating system; they could aid in the operation of a cluster but are normally inherent to any general, non−clustered environment. Since these attributes may be present in any of the operating systems that have cluster capability, they should be considered as a base on which cluster systems can be constructed.
44
4.3 Classes
4.3 Classes 4.3.1 Cluster plus Cluster plus is a cluster that offers transparent failover. A cluster system offering this class of clustering would be capable of transferring any or all applications from one cluster computer member to another with no apparent (at least to the user) latency. An entire cluster member could fail, and the user would never know. When should a cluster be a cluster plus and offer transparent automatic application failover? Whenever human life, certain businesses, or manufacturing process control are at risk, this class of cluster would address those needs. Clearly, however, an operation that can withstand as much as 10 minutes downtime during normal business hours doesn't need 100 percent availability. But if the computer is controlling a hospital's central cardiac monitoring system and you happen to be a patient, the last thing you want to hear is a nurse saying, "Oh, it does that once in a while; just shut it down and reboot the system." Human life considerations should always be a driving factor for justifying a cluster plus. It goes without saying that this kind of capability and complexity may require a significant investment. But, then, this class of cluster is mandatory when human life is concerned or when the process controlled absolutely requires this level. This expenditure becomes a non issue in the previously mentioned process control operation involving 5,000 tons of molten steel per ladle in a continuous pour process.
4.3.2 Cluster Cluster is the class that includes the three cluster subsystems and components as previously discussed. The example cited for a cluster, "business critical," may introduce a gray area (no pun intended by the graph color). Business critical translates to time lost, and time lost translates to revenue loss when the cluster system is unavailable. The degree of criticality of this is in direct proportion to how much revenue as percentage of the business income is lost per unit of downtime. The finance end of your business or manufacturing process can provide the figures to support the simple formula of:
If your monthly "System Operation Time" is 95 percent, then your Gross Proceeds possible are reduced by 5 percent! Let's say your monthly Possible Gross is $100,000 and your system is 99 percent system operational. Your system is still costing you $1000/ month due to that 1 percent of monthly downtime. Naturally, as the percentage of "uptime" increases above 99 percent toward the Five 9's or 99.999 percent, the system complexity (and cost) increase. In addition to these calculations is an expected time of unavailability (after all no cluster is perfect). And, herein lies another "gray area" specific to this class. To perform as the class cluster, the system must meet the definition of a cluster. But, inherent to providing the characteristics of a cluster is time latency. As cluster members enter or leave the cluster or take on cluster tasks, there is always a latency period. Depending on the cluster software, this could be anything from a few seconds to a few minutes. What is an acceptable delay of operation? The answer to this question helps phrase the right question to ask of your would−be vendor. Further, this may be the factor guiding your choice of one cluster software package 45
4.3.3 Cluster lite over another. There are three general reasons for the latency period of cluster unavailability. These correspond to the cluster states of joining, exiting, and transition, as discussed in Chapter 3. Usually, the exiting state, due to a cluster member failing, is considered the worst case for latency. Therefore, consider the following example. Vendor A specifies a worst−case parameter of 45 seconds for failover. While vendor B promises a worst−case of 30 seconds, vendor B has a price tag exceeding the necessity of a 15−second increase in availability. Again, what is the allowed time loss for this cluster system? If the business need can withstand 55 seconds of failover time, then the choice becomes a no−brainer. Remember that failoverthe time it takes the cluster to properly handle an exiting cluster member and regain cluster functionalityis only one of the reasons for latency. With all this to consider, a second matrix of just the cluster class needs to be expressed. This next matrix is meant to provide a graph of "decreased latency" versus increased cost. Decreased latency may not be a term commonly found in a cluster product description, but it does correspond directly with the terms availability and reliability. In order to provide this functionality, the cluster mechanisms and architecture have increased complexity. Therefore, the next matrix (see Figure 4.3) is titled "Cluster complexity and capability."
Figure 4.3: Cluster complexity and capability.
4.3.3 Cluster lite Cluster lite is a name that could be given to a system that has cluster components and even one or more (but less than three) cluster subsystems. The term cluster lite is in no way meant to be disparaging. Examples that fit this level of cluster may exhibit high availability and reliability characteristics but simply do not meet the requirements of our defined term, cluster. The following two examples are meant to clarify the purpose of this class. The first example has a cross−coupled pair of storage arrays in order to satisfy a high availability need. This system is designed to provide availability in the event of storage or controller failure. The system has an identifiable storage subsystem and an interconnect subsystem but only one server. This configuration (shown in Figure 4.4), while certainly providing high availability, would still be considered a cluster lite system. This system could address a critical job, which may involve a single or multiple individuals. The operation of this job, while critical in nature, does not require a complete cluster. The illustration shows a computer that is performing a job so critical that the computer is configured with multiple controllers and storage arrays. Perhaps this computer is a SQL server with a database of information that cannot be lost. This configuration adds reliability to the SQL job.
46
4.3.3 Cluster lite
Figure 4.4: Cluster lite example 1. The system of Figure 4.4 has an obvious single point of failure (SPOF), and that is the computer system. But perhaps the cost of a second system and the cluster software does not justify the need. The configuration shown in Figure 4.5, while certainly providing high availability, would also be considered a cluster lite system. This system represents a type of configuration known as replication. Each replication member stores an exact copy of the data to support availability and reliability. This system is good as long as the replication members remain intact. One single point of failure of this system could be the network. This is not a cluster, because there is no means of cross communication between the member's respective interconnect and storage systems. However, this configuration supports multiple replication members and software, so support of this configuration is economical.
Figure 4.5: Cluster lite example 2. The last example (shown in Figure 4.5) deserves another look. The network technology is evolving at an astonishing rate. What used to be a 10 Mbit/second (old Ethernet) standard is now giving way to a Gigabit/sec (IEEE 802.3ab) standard. Storage controllers are integrated with the motherboards and processors they serve. What if this "network" model were slightly changed and terms such as "virtual interconnect" were added. Theorizing: a virtual interconnect would be a hardware/software interconnect implemented as a functional part of the network interconnect. Perhaps it would look like Figure 4.6.
47
4.3.4 Attributes
Figure 4.6: Cluster future? Does the figure look like as though it would satisfy the requirements of a cluster? A computer cluster is a system of two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources.
4.3.4 Attributes Attributes represent that class of enhancements inherent to an operating system that cause that operating system to be cluster capable. To be cluster capable, an operating system should have certain attributes as a base for cluster software. Attributes add features or functions to the cluster system that, while desirable, do not directly contribute to clustering according to our definition. Therefore, operating systems that offer such features as "built−ins" naturally lend themselves to the capability of the higher cluster classes. Such attributes include the following: • Central management of users and computers • Backup and storage management tools • Dynamic linked libraries for customized user and process control • System driver development kits for customized system control The Microsoft Windows 2000 operating system has an excellent central user and computer management facility called Active Directory, which readily lends itself to cluster management. Windows 2000 has an integrated backup tool (much better than that in NT 4.0) that allows disk−to−networked−disk backups with an integrated scheduler. Microsoft's Windows 2000 also supplies extensive dynamic linked libraries and software driver development kits. Because of these attributes, third−party vendors such as Legato Systems are able to offer cluster software as a layered middleware software product add−on to Microsoft Windows 2000 Server. Microsoft implements its version of cluster software only on the Microsoft Windows 2000 Advanced Server and Microsoft Windows 2000 DataCenter products.
4.4 Cluster or component or attribute? Four general cluster classes hardly seem sufficient to pin down or restrict a manufacturer's cluster method or implementation. There are a myriad of characteristics that could be matrixed and considered in accordance with high availability and reliability needs. Perhaps your needs do involve process control, but your operation does not endanger human life; maybe it affords a time window, but the time window is small. Your needs require more than a cluster (as defined), but not a cluster plus. The next matrix (see Figure 4.7) offers a refinement to the cluster matrix and introduces some cluster methods and method qualifiers.
Cluster Method
Example
48
4.5 Cluster products DLM enhacned application Cluster Plus DLM NOS DLM NOS HandoffActive/Active HandoffActive/Passive HandoffActive/Passive Replication Cluster Dual port access
Lock step mirror Attributes
Oracle 9i parallel server Compaq OpenVMS Compaq TruCluster Tru64 UNIX MSCS as implemented in DataCenter MSCS Advanced Server Cluster Legato Cluster Enterprise Veritas, Legato (Vinca, Octopus) SCSI RAID Array Interconnect Network (redundant NIC) Cluster Software Power Fibre Channel Hub/Switch Marathon Endurance Management Software Network Load Balance (formerly WLBS Cisco Redirector Resource Monitor
Figure 4.7: Cluster classification matrixexamples. Please note that the class cluster lite was not inadvertently dropped from the matrix. Cluster lite is a system that has some but not all of the subsystems of a cluster.
4.5 Cluster products 4.5.1 Marathon Technologies Marathon Technologies offers an excellent high−availability solution. This solution is an excellent example of "lock step mirror" technology. Each part of a Marathon server is hardware redundant. This means that each instruction and each I/O is performed by two separate but interconnected machines such that if any server component fails, the overall operation will succeed. Marathon features dual processor, redundant computing engines, redundant I/O processing, mirrored RAID storage, and built−in redundant network connections.
4.5.2 Microsoft Cluster Service (MSCS) Microsoft's Cluster Service is by default the industry−standard solution for clustering Microsoft's Windows operating system for both NT and 2000 versions. From a practical point of view, one must consider the fact that there will be many people trained in Windows 2000 MSCS in the work force. This will be an important factor to companies simply because clusters by nature are much more complicated to set up and administer than simple servers. Clusters involve more equipment and have an inherent complexity in their operation. There will always be products on the market that address niche requirements to provide additional functionality beyond that which Microsoft designed into the Microsoft Systems Cluster Server. It is hoped that the cluster comparison matrix will help you to identify and categorize the many different products. Remember that these products either provide clustering functionality directly or combine with other products to enhance the availability and reliability of Microsoft's Windows server products. The best consumer is an 49
4.5.3 Compaq cluster software informed one armed with loads of product information to help in making a more knowledgeable decision.
4.5.3 Compaq cluster software DLM is the acronym for distributed lock manager, which represents the heart and soul of Compaq's Network Operating System (NOS) cluster−capable operating system called VMS. The lock manager maintains a database of each cluster resource and its usage. The purpose of the lock manager is to provide a mechanism to synchronize resource access to avoid corruption. This same cluster resource access method is applied in Compaq's Tru−Cluster (UNIX). Because this database is distributed to all cluster members, the loss (or gain) of any one cluster member (or node) in a multinode clustered environment would result in a cluster transition. During the cluster transition, the lock manager database management adjusts to the cluster membership change and preserves the synchronized resource access.
4.5.4 Veritas software Veritas (a subsidiary of Legato Systems) has a number of clustering (shared access) and replication products, which provide high availability and uninterrupted access. Replication means that a copy of those resources determined for high availability is stored on each cluster member.
4.5.5 Legato software Legato has unique storage area network (SAN) products as well as clustering and replication products (Vinca's Octopus for one). SAN products allow WAN high−availability solutions.
4.5.6 Other considerations Whatever your decision for implementation, a considerable amount of expertise and effort should be involved in designing, implementing, monitoring, reconfiguring, and expanding your final choice. Most of the third−party clustering products currently available claim compatibility with Microsoft Systems Cluster Service (MSCS) by using the APIs provided by MSCS to allow them to work hand−in−hand with Cluster Service. Because of Microsoft's dominance in the market, MSCS will become the standard for clustering. Microsoft will be able to apply its successful business model of high volume and low cost per box. In the end, the consumer will surely benefit from this approach. Microsoft Cluster Service had a rather slow start. There are a couple of reasons for this. First, when Wolfpack (a cluster product produced by Microsoft) was initially released, there were already quite a few products that provided "cluster−like" capabilities. These products were familiar to some people and were priced very competitively with MSCS. There was also the "cart−before−the−horse" syndrome taking place. Windows NT could be clustered all right, but there were no real "cluster−aware" applications available that would give customers the incentive to buy MSCS. In response to this, Microsoft promised a three−phase release of Cluster services for Windows NT. The first release of MSCS under Windows NT 4.0 was designed from the start to increase the availability of Windows NT services to the end user. The release of Windows 2000 Advanced Server and Windows 2000 DataCenter marks the second phase of cluster services for Microsoft. The third phase is yet to come. A quick summary of the offerings of the various vendors is presented in Tables 4.1 through 4.5.
Table 4.1: Microsoft
50
4.5.3 Compaq cluster software Product Type Class Positive Characteristics
Microsoft Windows 2000 Advanced Server (built−in) Cluster software−based solution Built−in configurable with proper hardware Will be the market baseline Will become a commodity product
Negative Characteristics
Wide support Does not have a DLM. Software applications must be written specifically to the cluster's API to achieve full benefits from being clustered. Cluster failover time is equal to the time it takes to start the application's services plus the cluster failover time. Hardware from the HCL Need senior NT system administrator
Requirements for Implementation Qualitative amount of expertise and effort to design, implement, monitor, reconfigure, expand, etc. System administrator experienced with technologies such as SCSI, Fibre Channel, ServerNet Table 4.2: Legato Cluster Enterprise Product Type Class Positive Characteristics
Legato Cluster Enterprise Cluster software−based solution Built−in configurable with proper hardware Works with Microsoft Windows 2000 Server Wide support Does not have a DLM.
Negative Characteristics
Software applications must be written specifically to the cluster's API to achieve full benefits from being clustered. Cluster failover time is equal to the time it takes to start the application's services plus the cluster failover time. Hardware from the HCL Need senior NT system administrator system administrator experienced with technologies such as SCSI, Fibre Channel, ServerNet.
Requirements for Implementation Qualitative amount of expertise and effort to design, implement, monitor, reconfigure, expand, etc. Table 4.3: LegatoOctopus (Vinca) Product Type Class
Add−on or Standalone Cluster lite 51
4.5.3 Compaq cluster software Method Positive Characteristics
Data replication Relatively low−cost solution No special hardware required Software−only solution.
Negative Characteristics
Requirements for Implementation
Very easy to install and configure Effectively an active/passive solution Potential for loss of data due to time lag for replicated data Runs on any standard Windows installed operating system. High−speed link between replication servers
Table 4.4: Compaq Cluster Product Type
TruCluster UNIX and VMS Cluster plus Distributed lock manager
Class Positive Characteristics
Built−in configurable with proper hardware Time−tested (since 1983) solution Support only on Compaq hardware
Negative Characteristics
Cost of product
Requirements for Implementation Qualitative amount of expertise and effort to design, implement, monitor, reconfigure, expand, etc. Table 4.5: Compaq Intelligent Cluster Administrator[a] Product Type Class
Cost of administration Proprietary software and hardware Need senior system administrators
Add−On Cluster litemanagement, monitoring Can run standalone or with Insight Manager Supports only MSCS on packaged Compaq clusters
Positive Characteristics Negative Characteristics
52
4.6 Summary Requirements for Implementation Install software Qualitative amount of expertise and effort to design, implement, monitor, Unknown, not tested reconfigure, expand, etc. [a] Compaq Intelligent Cluster Administrator is really an attribute that comes with a Compaq preconfigured Microsoft Windows 2000 Advanced Server or DataCenter product. This offering's pertinent feature is the ability to manage multiple MSCS clusters from a single Web browser. This characteristic is a monitoring and management tool. Realtime event monitoring, however, requires the use of Compaq's proprietary Insight Manager XE.
4.6 Summary Hopefully, the matrices and associated commentary provided in this chapter have helped to bring some insight and clarification to your computing needs. Matching your computing needs to a cost−effective configuration is usually a planning venture involving more than one person. Effective configuration involves the cooperative effort of a CIO spokesperson, a CFO spokesperson, and a CEO spokesperson. One last word of caution remains to be stated. The configuration of the present may not fit the needs of the future. Whatever is decided, the systemshould be scalable. Scalability is an important factor, one that is not necessarily present in the current vendor offerings. Be careful of cluster offerings that limit the configuration to two cluster members. Consider cluster system interconnects that address more than two members, such as Fibre Channel. Bibliography For more material please see the following sites: http://www.compaq.com http://www.legato.com http://www.microsoft.com
53
Chapter 5: Cluster Systems Architecture 5.1 Introduction In this chapter, we will talk about clustering from a systems architecture perspective using Microsoft's Cluster Service as our example. We will discuss the design that Microsoft used to implement clustering solutions for Windows NT/2000 that is known as Cluster Server. We intend to give the reader a basic understanding of how the system works and why the software developers chose particular architectures for their implementation of Windows NT/2000 Clustering. We will concentrate on the nature or characteristic of a Cluster implementation rather than discussing how to simply install or manage a cluster. Although our primary focus is Windows NT/2000 Clusters, we occasionally reference the clustering technology used on other operating systems during the past 10 to 20 years. We hope that by understanding some of these other successful cluster designs you will be better prepared to understand the different design tradeoffs that have been made in designing and implementing Windows NT/2000 Clustering.
5.2 Cluster terminology There is a very good chance that some of the terminology that we use in this book will be new or at least a little different from what you are accustomed to. At this point, then, it will be helpful to define some of the terminology related to clustering. One problem is that some vendors typically use definitions for clustering that best promote their products, while other vendors may use a different definition or name for the same thing. We can only imagine the potential for conflicting terminology at the new and bigger Compaq. Now that Compaq has absorbed the former Digital Equipment and Tandem Computer Corporations, you will likely get different definitions depending on which building you visit at Compaq. People who come from a Tandem or Digital background and who have been managing clusters for the past 10 to 20 years will have in their minds definitions that are based on their respective clustering technology and products. Today, the architectures used to build e−commerce solutions have different requirements; hence, their designs and implementations will naturally be geared to meeting the needs of Internet ISPs. The WEB−centric applications have their own unique requirements for implementing high−availability solutions. The way this is done today is slightly different from the way it was done in the past. It is essential for those of us who have been in the business for some time now to approach today's e−commerce solutions that are built on Windows NT for what they are and not for what they are not.
5.2.1 Cluster nodes or cluster members The term "cluster node" is frequently used as the name given to a computer when it is linked with other computers to form a cluster. When standalone independent computers are configured into a cluster they are referred to as "cluster nodes," or "cluster members." A cluster node is, in effect, the basic building block upon which clusters are built. Two or more computers that share a common storage system, have a communications interconnect, and are managed with an operating system that is enhanced with clustering software are referred to as a cluster. It takes a little more than just connecting communications links and a storage area network between independent computers to consider them a cluster. You must register a computer system with the Cluster Database as part of the cluster server software installation before it can be referred to as a "cluster node" or a "cluster member." Microsoft refers to a machine that is in the Cluster Database as a "defined cluster member." It is not until the node is up and running and actually participating in the operations of the cluster that they 54
5.2.2 Active cluster member refer to it as an "active cluster member." In our terminology, we refer to it as the "cluster member being online." If the "active cluster member" goes offline, then Microsoft refers to it as a "defined cluster member." We simply say that the cluster node is "offline." Some of Microsoft's terms may be a little bit confusing if you come from a background of UNIX or VMS clustering. Therefore, you might have to do a little translating here and there.
5.2.2 Active cluster member A cluster node is referred to as an "active cluster member" after it joins a cluster or forms a cluster of its own. It is possible for a cluster to consist of only a single node, although that usually does not occur unless there is a failure in a two−node cluster. During the process of setting up a cluster, each system that intends to participate (become a member) in a cluster must register itself with the cluster it intends to join. The registration information is stored in something called a Cluster Database, which is actually entries in the Windows NT/2000 registry. However, unlike the normal registry entries on a standalone system, in a cluster these cluster state entries are distributed to every cluster node and entered into the cluster log file on the cluster quorum device. The software that is responsible for two or more Windows servers forming a cluster is known as the "Cluster Service." The Cluster Service consists of various software modules that each implement specific functions that are necessary to coordinate the interoperation between the nodes in a cluster. You can think of the Cluster Service as the brain of the cluster. When a computer system boots up, it will try to join the cluster that it is registered with. If that fails, then it may assume that it is the only node available to form a cluster and will attempt to form a cluster on it own. If it is successful at either joining or forming a cluster, it is referred to as an "active" cluster node.
5.2.3 Cluster resources A resource can be either a physical or a logical object that is used by the cluster to provide services to client systems. For example, an application such as SQL Server, an IP address, or network names are examples of logical resources. Shared disk, RAID arrays, and I/O interfaces are examples of physical resources. The major limitation that exists with Cluster Service Phase I is that a cluster resource can be owned and controlled by only one node at a time. This means that if you need to provide a high−availability solution for SQL Server, for example, then SQL could run on only one node in the cluster at a time. Phase I of Microsoft's clustering initiative was targeted at availability not scalability. This type of solution does provide for high availability but does not address scalability issues. If a node fails for some reason, then the Cluster Service on another cluster node could start the SQL Server resource on that node, thereby maintaining database services to network clients. That limitation also means that the time required to failover was partly dependent on how long it would take to start up the applications on the new server. In the case of our example, it meant starting SQL Server from a cold start. Ideally, you would prefer to have SQL Server already running on every node in the cluster. In that way, it would be necessary to failover only the actual databases, thereby saving a lot of time waiting for a cold start of SQL Server. The other obvious benefit of having multiple instances of a resource such as SQL Server is that each instance could be actively servicing users and thereby sharing some of the processing load.
5.2.4 Resource groups Typically, each node of a cluster will have many different types of resources. Each software application needs a defined set of resources to function. For example, if we wanted to set up a Web server, we would need at least an IP address, the Web server application, a disk, and a database system. All of these things are 55
5.2.5 Dependency tree considered cluster resources. We call this collection of resources, which are all necessary for an application to function, a "resource group." The reason for organizing the individual cluster resources in a cluster into groups is to make it easier for system administrators to manage failover events. A group consists of resources that have some kind of dependency relationship between them. In the case of a Web server application, it's obvious that a Web server will be dependent on an IP address to function at a minimum.
5.2.5 Dependency tree In addition to being dependent on each other for data exchange and control, resources can also have a relationship with reference to time. A resource will have a specific startup and shutdown order associated with it. Referring again to the Web server example, to start the Web server application it would be necessary to bring the disk resource online first, followed by the IP address, the database application, and finally the Web server application itself. If you want to failover the Web server to another node, either for maintenance or as a result of a failure, then the resources in the "Web group" would need to be shut down in the reverse order to allow a failover to occur. In planning the design of your cluster, it is necessary to document the order in which the resources in a group are started and stopped. The mechanism that is used is called a "dependency tree." The dependency tree will show you at a glance the dependency relationship between all the resources in a cluster. It will also give you an indication of the required startup and shutdown order for a cluster group.
5.2.6 Cluster interconnect You will see in Figure 5.1 a connection between the two nodes in our cluster diagram. We call this dedicated communications link between cluster members a cluster interconnect. The cluster interconnect is sometimes referred to as the cluster's "private communication link." There are many cluster messages going back and forth between nodes in a cluster, the most common of which is the cluster heartbeat. These messages need to have a very low latency between the time an event occurs on one node and the time the other nodes in the cluster become aware that the event occurred. This requirement dictates the need for a dedicated high−speed, low−latency link connecting all of the nodes in a cluster. Although, it is obvious from the diagram that the cluster nodes could communicate across the enterprise LAN connection, that connection would provide a high−speed connection but would probably not satisfy the requirement for very low−latency communications between cluster nodes. This is because LANs use shared−bus architecture and therefore typically provide service on a first come, first served basis. A server's network connection to the enterprise carries all the communications between the client machines and the server. Although you may have multiple gigabit Ethernet adapters providing more than adequate bandwidth in a server, the time delay to send a message to a server node is still limited by network contention caused by all of the client systems vying for access.
Figure 5.1: Typical cluster architectures. The link used for the cluster interconnect can be any network type of communications hardware that is supported by Windows NT/2000. Typically, you will see Ethernet used for the cluster interconnect. For a simple two−node cluster, all that would be needed to establish an Ethernet connection for a two−node cluster 56
5.3 Cluster models would be a simple "crossover cable." The cluster interconnect is typically used to pass cluster administrative messages between the member nodes in a cluster. It can also be used in some implementations (e.g., replication clusters) to pass data that needs to be replicated from one node to the other, thereby eliminating any potential network traffic load on the enterprise LAN connection. The cluster interconnect could also use storage/system area network (SAN) technology such as fiber distributed data interface (FDDI), Fibre Channel, or ServerNet/MyraNet to greatly increase the capacity of the cluster interconnect over that of Ethernet. By using virtual interface architecture (VIA) as the software protocol for the cluster interconnect it will be possible to significantly reduce the amount of CPU horsepower required to handle cluster messaging traffic. A more detailed discussion of VIA occurs in Chapter 7, "Cluster Interconnect Technologies."
5.3 Cluster models Many cluster system architectures have been used to design clustering products over the years. It became obvious to us late one night, as we were preparing to teach a seminar on clustering the next morning, that a cluster diagram is worth a thousand words when it comes to explaining the differences between the cluster designs in use today. A lot of thought was put into the drawings you will see on the next few pages to help us explain some of the configurations that are possible when designing clusters. As you study these drawings you might not see at first the differences between some of the configurations, because the differences between them are subtle. If you take the time to consider both the hardware and software elements in each configuration, these illustrations should be a valuable learning tool in helping you to understand the different solutions that have been implemented. The most important thing to remember as you look over these different architectures is that there is not necessarily only one correct answer. Once you understand these diagrams, the tradeoffs that are being made by the different vendors in their products will hopefully become clearer to you. First, let's lay the groundwork by explaining the components that are typically used in clusters. In Figure 5.1 we show two servers referred to as Node A and Node B. Using our terminology, these two servers are referred to as cluster members. Like any normal enterprise server, they are connected to the enterprise network. In addition, because they are part of a cluster, they are also configured with a private cluster communications interconnect. The clustering software that is resident on each node in the cluster will use the cluster interconnect to pass status, system messages, file updates, and cluster heartbeats; in some designs, application data will also be transferred over this communication link. This link is used by the cluster to maintain a distributed database of both dynamic and static cluster parameters. You will also notice that we have shown two disk arrays in our example. There is a solid line going between Node A and the first disk array and a dotted line going from this disk array to Node B. The solid line indicates that Node A currently owns or is in control of the disk array. The dotted line indicates that the disk array is physically connected to Node B but that Node B does not have control of or own the disk array. As you can see from the figure, it is possible for both Node A and Node B to own their own storage device. The most common cluster configurations that you will run into will assign a portion of the total cluster's storage capacity to each node in the cluster. If the server that owns a storage device fails, one of the remaining cluster nodes can gain control of the failed system's storage devices. In some of the configurations that we will be talking about, the storage subsystem is physically connected to the local server. In those situations, any data that needs to go between machines is sent either over the LAN connection or over the private cluster link. The "yield signs" in our diagrams indicate that the cluster software is in control of all access to the disk array. When we use the "yield sign," it means that only one server will be allowed to access a disk array at a time.
57
5.3.1 Active/standby cluster with mirrored data
5.3.1 Active/standby cluster with mirrored data We begin our discussion of the different cluster designs with Figure 5.2. There you see an example of an active/standby server configured so that it mirrors data from an active node to a standby node. This design consists of two servers connected together using the enterprise land connection to communicate with client workstations and the cluster communications interconnect as their primary path for intracluster communications. Sometimes it is either not possible or too costly for the particular application to have a cluster interconnect link between the systems. An example of this would be if the purpose were to physically separate the two servers over a long distance to minimize risk due to a local catastrophe. In that case, you might need to rely on the LAN/WAN connection to handle communications for both client and cluster management traffic. Depending on the available bandwidth between servers, this configuration might have performance issues that need to be considered because of limitations in the available bandwidth over a WAN. This is one area that needs to be analyzed and taken into consideration when setting up this type of clustered system.
Figure 5.2: Active/standby cluster with mirrored data. You'll notice that in this example each machine is configured with its own RAID storage system. With this design, the two computers do not share a common SCSI bus. Therefore, each machine has its own disk and I/O devices that it has complete control over. One of the computer systems will be designated as the primary server. Under normal conditions, the primary server will handle the entire processing load. The other cluster node is referred to as the backup server. The backup server does not do any real work as long as the primary server is alive and healthy. Under these conditions, the only thing the backup server does is maintain a mirrored copy of the primary server's disk drives. We have seen one vendor's implementation of this architecture that requires that both systems be built using totally identical hardware. When we say "identical," we really do mean identical, right down to the version level of the BIOS chips on the motherboards, SCSI controller, and other I/O adapters. A requirement like that can become a real headache for the system administrator when it comes to stocking spare replacement boards. In Figure 5.2 we use a yield sign to indicate that clustering software running on the backup server prevents active clients from accessing the backup system. In the event of a failure on the primary server, the clustering software will automatically restart and will initialize all of the software applications on the backup server. However, before doing that, it must first reconfigure the backup server with the same network name and IP address that the primary server was using when it failed. As soon as the primary server fails, the mirroring of data between the nodes stops also. Once the backup server takes over from the primary server, it then allows the cluster's client workstation to begin to access data from its mirrored copy of the data that is locally attached to the backup server. This architecture is good at providing quick failover and fault tolerance while at the same time providing a consistent level of performance. The downside is the cost. You must purchase two identical servers, but you will only get one system's worth of processing power. Actually, you will get a little 58
5.3.2 Active/passive cluster with mirrored data less than a single system's performance because of the overhead of the mirroring software and cluster management.
5.3.2 Active/passive cluster with mirrored data You will notice from looking at Figure 5.3 that the active/passive servers that use a mirrored data configuration are very similar to the active/standby server configuration that we just talked about. The only difference you will notice between Figure 5.2 and Figure 5.3 is that there is no yield sign on the passive server. In an active/passive configuration, the passive server can respond to requests from clients for applications and data that are owned by and local to the passive server. The big advantage of this configuration over the previous example is that the passive server can now do real work. It is still responsible for mirroring data from the primary server, but it can now take a semi−active role and service requests from clients' workstations. In practice, you would set up the passive server to handle low−priority services that could be halted without a major impact on the business. This would allow the high−priority services that had been running on the primary server to take over the whole capacity of the passive server in case of an unexpected failure or an intentional administrative failover. You will need to size the passive server adequately so that when the primary server fails, the passive server node will have adequate processing power to assume the load from the primary server. The active/passive architecture can have the undesired side effect of first stopping all processing on the standby server before applications from the primary server are failed over and restarted. This means that the users who were working on the backup server are out of luck. Once the backup server is up and running, its original user has the option to reconnect. The only problem in this scenario is that the machine that was the passive server has now taken over the identity of the primary server. The client workstations that were connected to the passive server would now have to connect to what used to be the primary server. As you can imagine, this can get a little confusing for the users. This solution meets the requirement for keeping the company's primary applications available. However, the longer failover times and the unwanted effect of dumping the innocent users on the passive server when the primary server fails may not appeal to everyone. On the plus side, this solution does allow you to deliver mission−critical applications while at the same time allowing for the maximum processing capacity.
Figure 5.3: Active/passive cluster with mirrored data. The one big advantage of the active/passive configuration is that it yields greater value from the investment that was made in server hardware. The trade−off is that when the primary server fails and the secondary server takes over for the primary server, the applications that were running on the backup passive server will not be available. In fact, all of the users on both the primary and passive servers will see a temporary outage. In addition, the users that were running just fine on the backup passive server are displaced and are now dead in the water. In an active/passive configuration, failover will take longer since the backup server is already actively processing requests from clients and must reconfigure itself to assume the responsibilities of the primary server. The reconfiguration process involves starting up applications that were running on the failed primary server and shutting down applications that had been running on the backup server but are not 59
5.3.3 Active/active cluster with shared disk essential in an emergency situation. All things considered, this configuration is pretty attractive to managers who don't want to explain to their boss why a $50,000 server is doing nothing 99 percent of the time, as in the case of an active/standby configuration.
5.3.3 Active/active cluster with shared disk "Active/active cluster with shared disk" is our terminology for what Microsoft calls a shared nothing cluster. Microsoft decided to use the shared nothing architecture for its Cluster Service product mainly because it does not require a distributed lock manager (DLM). Microsoft's designers felt that a DLM would not allow them to scale Windows NT/2000 clusters to the sizes they envisioned would be necessary to meet the future needs of e−commerce. Figure 5.4 illustrates how a Microsoft Cluster Service is interconnected. You will notice the typical network connection to the backbone network and the cluster interconnect that provides for the intercluster communications between the two cluster nodes. Given today's technology, these two networks are likely to be Ethernet devices, mainly because of the cost factor and the fact that the Ethernet is commonly running at 100 MHz currently and at 1 GHz as its costs come down. You are not limited by Ethernet, however; any network interface that is supported by Windows NT/2000 could be used as well. In the future, you should expect to see "switched fabric" technologies used for the cluster interconnect.
Figure 5.4: Active/active cluster with shared disk. At the bottom of the diagram, you can see a disk array that is connected by a SCSI bus to both cluster nodes. SCSI was the first storage device interconnect standard supported by Cluster Server and is still the lowest−cost one. Here again, as new technology becomes more cost−effective, you will likely see a shift to higher−performance cluster interconnects such as Fibre Channel and ServerNet, among others. Thanks to the relatively new software architecture called VIA, using different cluster interconnect technologies will be as easy as it is today using different vendors' NDIS−compatible network adapters with Windows. NDIS was developed to define and standardize the software interface used by Windows to communicate with network devices. NDIS accomplishes this by using a layered software protocol stack to provide what is basically a plug−and−play environment for network devices. VIA is poised to do the same in the system area network (SAN) arena. The VIA standard provides a standard set of software APIs for both the software applications programmer and the SAN device manufacturers to build to. VIA is discussed in greater detail in Chapter 7. Electrically, it is possible for both nodes in the cluster to access the array at the same time. But thanks to the Cluster Service software running on both nodes of the cluster, only one system is allowed to access a disk or volume at a time. Even though both computers have physical access to the disk arrays, only one server is allowed to logically mount a disk volume at a time. The Cluster Service prevents concurrent access to data 60
5.3.4 Active/active cluster with shared files files stored on the disk array. To allow concurrent access to files would have required a file−locking mechanism, and Microsoft chose not to implement one. That is where the term "shared nothing" comes from. A distributed lock manager scales OK when the number of nodes in the cluster is in the range of 2 to 96 nodes, as supported by VAX VMS. Beyond that, Microsoft believes that the overhead in keeping track of file and record locks would become quite large. After studying some of the white papers from Microsoft's Bay Area Research Center (BARC) it appears that its vision for the future is for clusters to be able to scale to hundreds of nodes. The system architects at Microsoft don't think that is practical with lock manager−based architecture. Their fear is that with a 1,000−node cluster potentially more processor time might be spent dealing with the locking mechanism than doing useful work. Only time will tell. The active/active cluster configuration is an improvement over the first two we discussed in that both servers will be doing real work. In addition, it is possible to set up mirroring between the storage arrays owned by Server A and Server B. Doing so would give you extra protection from the failure of a SCSI bus, a Fibre Channel bus, or the whole disk array. Just remember that mirroring and clustering software will consume some amount of CPU resources on each server. Therefore, although you have two servers in your cluster, your available CPU horsepower will be less than if the two servers were standalone systems. That is the small price you must pay for higher availability. One option to consider is using a storage area network (SAN). The mirroring of the arrays can be done by the SAN, thus freeing the CPUs from that task.
5.3.4 Active/active cluster with shared files The last configuration that we will discuss is called "active/active shared file" architecture. Microsoft refers to it as a "shared everything" model. In Figure 5.5 you will notice that the dashed lines on the SCSI bus have been replaced with solid lines to indicate that each server has both physical and logical access to all files on the disk array. We have inserted the "traffic cop" icon on each node to represent the functions that are performed by a distributed lock manager (DLM). The solid lines connecting both nodes to the shared disk array indicate that both nodes have concurrent access to all files contained on the disk array. In order for this configuration to work successfully, there must be some kind of control over who can access what and when. That is why we drew a traffic cop on each cluster node to represent the role played by the distributed lock manager. Under the control of a DLM, it is possible for applications running on each node in the cluster to access a common database file at the same time. An application that wants to access a shared file must make a request for permission to the DLM using system API calls. The DLM will determine who is allowed to access a file at a given time and will block all others until the lock is released. This allows multiple nodes to be working in parallel using the exact same database files. This architecture works nicely when the number of nodes in the cluster is small. But the intracluster communication traffic generated by the DLM can grow quite large as the number of cluster nodes is increased or the applications themselves become read/write intensive.
Figure 5.5: Active/active cluster with shared files.
61
5.4 Microsoft's Cluster Server architecture This architecture has been used in OpenVMS for more than 15 years and is considered by many to be the standard by which other cluster products are judged. The active/active shared file configuration might be one of the best solutions to the problem of an application that requires more CPU horsepower that can be put into one computer cabinet. The shared file architecture allows you to add more CPU horsepower to solve a processing load problem by simply connecting another computer cabinet to the cluster. The advantage to a shared file architecture is that each cluster node could be configured to execute the same application against the same data files, allowing you to scale up the processing power to meet the demand from your user community. Here again, you will not see a two−times improvement in system performance, because of cluster software overhead and especially the DLM. This solution provides both scalability and availability. The cost you must pay for this scalability is the time necessary to execute a cluster failover. The problem is that a failover could take a long time, depending on the number of locks requested by applications running on the cluster. Before a failover can proceed, the DLM must resolve all the locks across the cluster. Specifically, all of the locks that were requested by the failed cluster node must be dealt with and released.
5.4 Microsoft's Cluster Server architecture In our simplified illustration of a cluster, shown in Figure 5.6, you will see the four major components of Microsoft's Cluster Server software. The first and the most important is the Cluster Service. The Cluster Service is a collection of software modules that Microsoft refers to as "managers." Cluster Service consists of 10 managers that together are responsible for implementing Cluster Service. The Cluster Service uses another independent software module call a Resource Monitor to keep track of the status of the many resources in the cluster. The Resource Monitors in turn rely on custom software modules, known as Resource DLLs, that are written by software application developers to monitor the health of their specific application. Finally, at the bottom is the actual Resource which can be either a physical device, a software application, or a logical resource.
Figure 5.6: Cluster Service architecture.
5.4.1 Cluster Service The term Cluster Service refers to the collection of software modules that actually implement clustering functionality on Windows NT/2000 Server. In Figure 5.7, you can see the software modules that Microsoft uses to provide and manage services on a cluster. This collection of modules is a rather complicated bit of software that Microsoft has layered on top of the standard Windows NT/2000 operating system. The Cluster Service not only has to keep track of all the events on the server on which it is running, but it must also stay in constant communication with the other nodes in the cluster. Its most important responsibility is to maintain a real−time distributed database across all active nodes in the cluster that reflects the state of the total cluster. In the event that a member node either fails or is taken offline administratively, this database will be queried by 62
5.4.2 Resource Monitor the remaining cluster member nodes to determine what actions must be taken by the cluster to restore services to its clients. The Event Processor, with the help of the Failover Manager and the Resource Manager, monitors the cluster status on both the local node as well as the other cluster nodes with the help of the Communication Manager. We will subsequently explain in greater detail the different modules that make up the Cluster Service and how they interact with one another. Figure 5.8 illustrates the relationships that exist between the functional software modules that make up Cluster Service.
Figure 5.7: Cluster software architecture.
Figure 5.8: Microsoft Cluster Service software components.
5.4.2 Resource Monitor The next component to be discussed is the Resource Monitor. The job of the Resource Monitor is to manage and monitor the different resources available on a cluster node. The Resource Monitor receives commands from the Cluster Service and reports back the status of the resources that it is monitoring. The Resource Monitor's job is to constantly run in the background, maintaining contact with Resource DLLs that were 63
5.4.2 Resource Monitor assigned to it. An important feature of Microsoft's architecture is that the Resource Monitors run in their own process space, which isolates them from the Cluster Service. This is to protect the Cluster Service from failing as a result of the failure of an individual cluster resource. To further isolate Cluster Service and the other cluster resources from the failure of a single resource, Microsoft's design allows for more than one Resource Monitor to be running at a time. These features of Cluster Service are quite significant when it comes to the robustness of the cluster. Each instance of the Resource Monitor can be assigned the task of monitoring one or more Resource DLLs. For example, in practice you might want to have one Resource Monitor started to monitor a Resource DLL that you might suspect is not a fully debugged piece of software. If the resource did fail, the worst that could happen is that the Resource DLL and the Resource Monitor assigned to monitor it would crash, thus protecting the Cluster Service and the other resources on that cluster node. In Figure 5.9 you can see that the Resource Manager has started two Resource Monitors. One of the Resource Monitors is responsible for keeping track of two Resource DLLs, and the other Resource Monitor has only one Reso urce DLL to monitor. A Resource Monitor does not have any decision−making logic built into it; its job is simply to monitor the Resource DLLs assigned to it and report back their status to the Resource Manager, which in turn provides status information to the rest of the software modules that make up the Cluster Service. The Resource Monitor accomplishes its task by calling the cluster APIs that are used for managing Resource DLLs. In Table 5.1, we have listed a few of the cluster API calls that can be used by software application developers to interface their software to the Resource Monitor.
Figure 5.9: Resource Monitor and Resource DLLs. Table 5.1: Resource Monitor API Functions API Functions Description Startup/Shutdown Open The Open function is used to initialize a cluster resource. This function will allocate whatever system resources are required by the Resource DLL and the actual resource. Online Once the resource has been initialized by the Open call, an API call to Online will cause the resource to become available for system use. Offline The Offline command is used for a normal graceful shutdown by allowing the resource to clean up after itself before it goes into the offline state. Terminate If for some reason it becomes necessary to immediately take a resource offline, the Resource Monitor will call the Terminate function, which will quickly put the resource in an Offline state without allowing for a graceful shutdown. 64
5.4.3 Resource DLL Close
Running State LookAlive
IsAlive
The Close API call is just the opposite of the Open function. It is used to remove a cluster resource that was previously created by an Open function. The result of this call is to deallocate any system resources and then put the resource in a stopped state. The LookAlive function is called by the Resource Monitor to determine whether the resource is still functioning. This API gives a quick, cursory check into the health of the resource and is normally called at frequent predetermined polling intervals as set by the system administrator. If the LookAlive function reports a failure, then the Resource Monitor can use the IsAlive API to get a more thorough report on the status of a resource. During normal operation of the cluster, this API is called much less frequently than the LookAlive function.
5.4.3 Resource DLL Storage devices, software applications, and logical entities such as an IP address are managed and monitored by the Cluster Service through their Resource DLL. The Resource DLL plays a very important part in a cluster because it acts as the "eyes and ears" of the Cluster Service. A Resource DLL sits in between the Resource Monitor and the application that it was written to monitor. It will use the Cluster Service APIs to communicate with the Cluster Service above it and a software vendor's application−specific APIs below it. There are two sources of Resource DLLs. Microsoft, as the operating system vendor, provides standard out−of−the−box Resource DLLs that manage operating system−type resources. The standard Resource DLLs that are included with the Windows operating system support such Resources as IP addresses, File and Print services, physical disks, the network name, and the IIS Web server. In addition, Microsoft provides default Resource DLLs that provide basic cluster support for generic applications and services that don't yet have custom Resource DLLs written. Ideally, third−party software application developers over time would supply their own custom Resource DLLs written specifically for their own applications. If a software developer intends its application to be totally integrated into the cluster environment, the developer will have to supply its own custom DLLs that understand the peculiarities of the particular application. The custom Resource DLL must be able to communicate the status of the application it is monitoring by responding to standard cluster API calls from a Resource Monitor. A Resource DLL also provides a mechanism that will allow an application to determine what is happening in the environment in which it is running. A custom Resource DLL can use the Cluster APIs to control how the application responds to cluster events by constantly monitoring the operating states of the cluster and the cluster member on which it is currently hosted. This will allow an application to find out that a failover is about to take place and take whatever actions are appropriate for that particular application under those conditions. If an application's custom Resource DLL learns that a failover has been requested by Cluster Service, it can inform its application that it needs to do a graceful shutdown by purging its caches and closing open files. The real benefits of Windows NT/2000 Cluster Service won't be realized until third−party software developers fully support the API set included as part of Cluster Service. Microsoft has done a good job of laying the foundation with Cluster Service APIs. It is now up to third−party application developers to take advantage of all that is available to them if they want their software applications to be truly "cluster enabled." From an architectural point of view, the advantage of using custom Resource DLLs is that they provide for a finer granularity of control of applications. For example, without a Resource DLL the Cluster Service would have to take an application that has failed and simply restart the application on another cluster node. As you can imagine, this process would involve moving all the resources required by the application to another node and then restarting the application. Finally, the application would have to attempt to recover from where it left off. This could be a very time−consuming process, to say the least. The good news is that when application 65
5.4.4 Failover Manager developers fully exploit the capabilities of Resource DLLs, it will then be possible to have multiple instances of an application running on the cluster. However, by employing a sophisticated Resource DLL that implements full clustering capability, it will be necessary to failover only the data itself. This makes the real failover entity the data and not the application, thereby giving finer control and seamless integration between applications and the cluster itself. This tight integration between an application and the cluster is referred to as a "cluster−aware application." We are not all the way there today, but when the industry achieves this level of cluster integration we will be able to move to the next level of clusteringscalability. By having multiple instances of an application running on every node of a cluster and all of them processing request from clients, you will be able to realize one of the major benefits of clustering. Cluster Service can enhance the availability of just about any legacy application in a limited way; however, if the application is written to use the Cluster APIs, then it is able to work hand−in−hand with the cluster and achieve high levels of fault tolerance and scalability. This is referred to as being "cluster aware." The cluster APIs allow applications to communicate with the Cluster Service so that an application can be aware of cluster events that might have an effect on it and can inform the Cluster Service of its own status. Those who want more information about using the Cluster API tools should refer to the Software Developer Kit (SDK). The SDK assumes that you have C++ loaded on your system. Included in the kit are example resource DLLs that you can use as a starting point. The SDK is not meant for use by day−to−day cluster administrators and operators. It is really meant for software developers and advanced users who want to enhance their applications. Most of us receive cluster resource DLLs with the software applications we purchase or use some of the default ones that come with Cluster Service.
5.4.4 Failover Manager The Resource Manager and the Failover Managers work together to manage and control resources in the cluster. Once a failover event message is received from the Node Manager or a Resource Monitor, the Resource/Failover Manager will reference the Configuration Database Manager to provide information on what actions need to be performed to carry out the requested failover action. Those actions could be to stop or start a resource, failover a Group, or arbitrate which cluster member should own a particular Group.
5.4.5 Resource Groups Resource Groups are logical groupings of resources in a cluster that have some type of dependency relationship with each other. The resources in a cluster that have a dependency relationship with each other are logically organized into Groups to make it easier to manage them. For example, a file and print service application would obviously be dependent on a disk drive resource at a minimum (Figure 5.10). However, there are also other resources in the cluster that would be required for file and print services to function correctly. To start with, for clients to access a file and printer server, a network name and an IP address would have to be assigned. Together, the IP address, network name, and the file and print application itself would make up what is known as a Group. There are likely to be many Groups in a cluster, but a Group can only be under the control of a single cluster node at a time. We refer to this as being "owned." In order for a Group of resources to successfully failover, the resources in the group must be physically present on the cluster node that is going to host the Group.
66
5.4.4 Failover Manager
Figure 5.10: Relationship between cluster resources. For example, if one of the resources you plan to failover is running SQL Server, then that application must have been installed on all nodes that could potentially act as hosts for that resource. You begin the process of installing the application on the cluster by first deciding which clustered disks the application is going to be installed on and the cluster nodes that will participate in the failover operation. Next, you would failover the clustered disk to the first node that will be hosting the application and then run the application's setup procedure. Then you would failover the clustered disk to the other nodes in the cluster and run the setup procedure again for each one that is going to be enabled as a backup host for the application. You should be aware that in a shared nothing cluster only one cluster node can be executing an application and accessing its data at a time even though it has been installed on multiple cluster nodes. Those other installations will be used only when a failover occurs. Something that you need to be aware of is that not many applications today have provisions in their software license agreements for a software "cluster license." Applications such as SQL Server are typically licensed for installation on only one physical server at a time. This means that any applications that you may want to failover will require a license for each server that could potentially serve as a host. That is true even though only one machine would be running the application at a given time. We hope that in the future license agreements will be modified to account for clustered use, thereby making the situation more equitable for users. If applications are written to fully support the Cluster APIs and/or if a distributed lock manager becomes available, it will be possible for multiple copies of an application to run concurrently on every node in the cluster. In this situation, it makes sense to have multiple licenses because all nodes will be doing useful work. Unfortunately, there does not seem to be much motivation for Microsoftor any other software developer for that matterto figure out how to treat an application's license as a cluster resource. Then the "license resources" from the failed nodes could be automatically moved to a surviving node in the cluster. This is unfortunate, but as you can see, a solution to that challenge would result in a 50 percent loss in sales. Consequently, software vendors have little motivation to rush out and find a solution. Groups have cluster−wide policies associated with them that define the actions that are to take place in case of a cluster node failure. First the Cluster Service needs to know which cluster node is the default host for each Group of resources. It also needs to know which host the Group should be moved to when a cluster failover event occurs. This information is also distributed across the cluster so that each node knows what services to start when it boots up. In the case of a cluster node failure, the Cluster Service will poll the database to determine how it should handle the Groups on the failed cluster member. When the failed cluster member rejoins the cluster, the Cluster Service will again check the cluster database to determine whether any Groups need to be failed back to their preferred host node and when that failback should occur. These features are very powerful capabilities for managing the operation of a cluster. Cluster System managers will need to pay very careful attention when they are setting up Groups. We will save a more detailed discussion on managing Resource Groups to later in the book when we discuss Cluster Administration.
67
5.4.6 Node Manager
5.4.6 Node Manager The purpose of the Node Manager is both to monitor the status of other cluster members and to coordinate and maintain the status of cluster members. The Node Manager accomplishes this by sending a "heartbeat" message on a regular interval to the other cluster members. As long as each node in the cluster receives a heartbeat message from every other node in the cluster, then the cluster is assumed to be OK. If a node fails to send out its heartbeat on time or if a node does not hear the heartbeat because of some type of network or Cluster Interconnect failure, the Node Manager will attempt to use another means to verify that the unheard node is still alive. If a cluster node believes that one of the other nodes in the cluster has failed, it has two methods that it can use to contact the node in question. First, it can try to contact the node via the public network connection. If it is unsuccessful over that data path, then it can attempt to use the cluster storage interconnect to determine whether the other node is active by using the SCSI challenge−and−defend command protocols. If all attempts fail, the Node Manager will notify the Cluster Event processor that it has detected an unreachable cluster member. At this point, the cluster node detecting the missing heartbeat will broadcast a message to the remaining cluster members informing them that a failure has been detected. Once a failure has been detected, all disk I/Os to shared SCSI storage devices will be stopped immediately. This is done to prevent the corruption of open files on the cluster−shared storage devices. The next step is for all cluster members to start a process called the regroup event. This process is the most important thing that occurs in a cluster because the ownership of a resource at any point of time must be tightly controlled to prevent corruption of data. The Quorum Device that was discussed earlier plays a big role in the regrouping of a cluster. The regroup event will cause all nodes in the cluster to query each other to determine who is reachable and which node has failed. Once this has occurred, the node that has failed is declared "offline." At this point, it is up to the Resource Manager along with the Failover Manager to start the process of starting the appropriate Groups on the remaining running cluster node. This is known as a "failover."
5.4.7 Configuration Database Manager One of the most critical functions performed by clustering software is the clusterwide real−time distributed database that controls its operation. This database is created and maintained by one of the software modules of Cluster Service called the Configuration Database Manager. The configuration database stores every bit of data that is needed by the Cluster Service to function. Therefore, it is important that the integrity of the data be carefully maintained. The Configuration Database Manager accomplishes this by using a two−phase commit mechanism. The database is not like a general−purpose SQL database. The state of the cluster must be distributed across the cluster in real time and is independently maintained by each member of the cluster. Each cluster member stores what it knows about the state of the cluster as entries in its Windows registry. At the same time, the cluster configuration data is distributed to every cluster member node in real time. The interval of time between when an event occurs and when all of the other cluster nodes know about it must be very short. This requirement dictates that there must be a very short latency between the time when an event occurs and when the other nodes in the cluster are informed of its occurrence. The requirement for low−latency communications between cluster nodes is driving Microsoft and other leading hardware vendors to adopt a new hardware and software architecture called Virtual Interface Architecture (VIA). In short, VIA is a communications solution that is based on both hardware and software working together to streamline the communications path between applications running on two or more computer systems. The Virtual Interface Architecture is able to deliver high−speed low−latency communications by moving the communications protocols outside of the operating system itself and by allowing the application to talk directly to the VIA hardware interface, thereby substantially minimizing the 68
5.4.8 Global Update Manager amount of overhead contributed by the operating system. We will definitely be going into a lot more detail about VIA and other SAN protocol technologies in Chapter 7. Any changes to the Cluster State are simultaneously logged to both the Cluster Configuration Database and the cluster Quorum Disk. A single file on the Quorum Disk contains an ongoing record of any changes made to the cluster's configuration. This log file is designed so that cluster nodes that happen to be offline while changes are being made can update their local NT/2000 Registry entries the next time they come back online. The file used to hold this information is stored in a master file table (MFT) and is therefore not visible to end users with standard desktop file viewers. Since the one node in the cluster that controls the Quorum Disk at any given time is in control of the cluster, it can be assumed that by default the cluster configuration data that is stored on the Quorum Disk contains the most up−to−date picture of the state of the cluster.
5.4.8 Global Update Manager The Global Update Manager is responsible for ensuring that every node in the cluster sees a consistent view of the cluster configuration database. Every cluster member needs to have access to the current status of the cluster in the event that it must respond to a cluster failover event or cluster reorganization. Some of the information used by the cluster is static, such as resource properties, failover policies, preferred hosts, and systems registered with the cluster. There is also dynamic information that must be shared in real time between all cluster members that are online. This dynamic information gives all cluster members a consistent view of the current state of the cluster. The Global Update Manager provides the network middleware functionality that is used by all of the other software modules in Cluster Service use to communicate with each other. If it is necessary to deliver a message from one node to another, the Global Update Manager provides a standardized and reliable mechanism for message passing. It accomplishes this by providing a common software interface that is used by all of the other modules of the cluster service when it is necessary to provide synchronized updates to the cluster state database across all member nodes. It can provide both guaranteed and nonguaranteed broadcast message delivery and synchronization. The Global Update Manager does not make any policy decisions on behalf of Cluster Service. Its sole purpose is to provide a reliable broadcast mechanism and to manage the complexities of intranode communications.
5.4.9 Event Processor The Event Processor module is quite important to the operation of Cluster Service because it acts as the coordinator for all of the different components of Cluster Service. The Event Processor is responsible for initializing and starting up the Cluster Service module and serves also as the initial entry point to the Cluster Service module. After Cluster Service has been started, the Event Processor sends a message to the node manager to either join an existing cluster or to form a new cluster. Once the node has successfully joined a cluster, the Event Processor then takes up the role of a message switch between the different modules that make up the Cluster Service. For cluster−aware applications that are registered with the Cluster Service, the Event Processor will deliver notification of cluster events to them. Similarly, if a cluster−aware application needs to query the Cluster Service, the Event Processor will receive the query request from the application and then post a message to the appropriate Cluster Service module. The Event Processor will handle cluster−aware application API calls to open, close, or enumerate cluster objects. The term "cluster objects" refers to systems, resource types, resources, and groups.
5.4.10 Communications Manager Whereas the Event Processor handles communications internally to a cluster member node, the Cluster Manager is responsible for managing communications between cluster members. The Cluster Manager modules that are present and running on each node in the Cluster communicate between each other using the 69
5.4.11 Log Manager UDP protocol. The Cluster Service's Communications Manager communicates with the other cluster nodes via the cluster's Communications Interconnect. Typically this interconnect will consist of two 100BaseT Ethernet cards connected together with a short piece of twisted−pair cable that is typically referred to as a "crossover cable." The cross over cable is the equivalent of an RS−232 "null modem" cable that provides the same functionality for 10BaseT networking. As was mentioned earlier, the Communications Manager can also use the VIA APIs to support a system area network (SAN) in place of Ethernet. This will allow the Communications Manager to use much less processing power and therefore reduce its CPU overhead to do its job. This is due to the vastly improved message−handling capabilities available with VIA. Lower processing requirements on the part of the Cluster Service means that more "real" work gets done in your cluster. If you were to put a network analyzer on the cluster interconnect, you would see a constant stream of messages going back and forth between cluster members on a regular basis. This series of constant messages flowing back and forth between the cluster members is known as "keep−alive messages" or "heartbeat" as it is sometimes called. What's happening here is that the Communications Manager on one machine sends a message to the other node saying, "Hello, I am here. Are you still there?" These keep−alive messages will continue to be sent back and forth as long as the Cluster Service is running in both nodes. If a node fails to receive a response to its keep−alive message from the other nodes in the cluster, the Communications Manager will notify the Cluster Service that it has detected a failure. It is up to the Cluster Service to determine the appropriate actions that must be taken and then carry them out. Other message types handled by the Communications Manager deal with cluster membership−state transitions. These messages are used by the Cluster Service to negotiate the status of the current configuration of the cluster with the other member nodes. There are certain events that can trigger state transitionsfor example, a cluster node could decide that it wants to join a cluster again after being taken offline either voluntarily or involuntarily. A cluster node could experience a hardware crash. A cluster node could become unreliable as observed by other nodes in the cluster as a result of experiencing repeated reboots. Any of these conditions would initiate a membership−state transition.
5.4.11 Log Manager The Log Manager is responsible for ensuring that the recovery log file on the Quorum resource is kept in synchronization with the Configuration Database. It is important that every node in a cluster is kept up to date and its information is available clusterwide. The Log Manager accomplishes this by constantly checking the copy of the cluster database that each node maintains. If it discovers any changes on a local node, those changes are updated throughout the cluster and logged to the Quorum resource.
5.4.12 Cluster time service The Cluster Service on each member of a cluster must maintain synchronization in real time of the distributed database across the cluster. Since every cluster member must know the state of every node in a cluster at any instance in time, having the clocks in each node synchronized is a necessity. This requires that the time a particular message was sent be very accurately known. Messages sent between cluster nodes must have a time stamp so that the cluster can tell the sequence in which each event occurs in the cluster. You can easily imagine the problems that could occur if all of the system's clocks were not in synchronization throughout a cluster. If they were out of synchronization, a cluster member that received a status message might receive one message that occurred in the past but could potentially receive a message about an event that is occurring in the future. That might be OK in a science fiction movie, but it wouldn't be such a good thing for your cluster. The only real requirement is that all the nodes in a cluster must have their clocks synchronized among themselves.
70
5.5 Quorum Resource The system administrator has the option of specifying which node in the cluster will assume the role of timekeeper. If the administrator does not specify, then the Cluster Service will automatically elect a node to be the source of time for the whole cluster. The node that is providing the time standard is known as a Time Source. The goal of the Time Source is to make sure that each node is working with a consistent value for time. It is not required for the time to be traceable to the U.S. National Bureau of Standards, although that can be accomplished with special software and hardware that links the Time Source to time data transmitted from the National Bureau of Standards. As long as all clocks are in synchronization, everything will work just fine. It's the fact that the time is synchronized and not that it is the actual time of day that is important here.
5.5 Quorum Resource The Quorum Resource is an example of a technology used in Cluster Server that Microsoft received from two of its key partnersDigital Equipment Corporation and Tandem Computer Corporation, both of which are now part of Compaq Computer Corporation. An easy way to understand what a Quorum Resource does is to think of it as sort of a "tie−breaker" for determining who gets control of the cluster. Microsoft took advantage of standard protocols that are part of the SCSI standard that allow for access control of disks on a SCSI bus that has more than one computer on it. The Quorum device is currently implemented as a SCSI disk that can be "owned" by only one system at a time. This condition is guaranteed by the SCSI control protocol, which supports commands that ensure that a disk is under control of only one SCSI controller at a time. Because of this, the SCSI protocol must be used to control the disk designated as Quorum Resources at this time. The Quorum device must use the SCSI protocol no matter how it is physically connected. That means that even though it might be connected using Fibre Channel, it must still talk the SCSI protocol to work as a Quorum device. The Quorum Device is important because of a situation that Microsoft calls a "split−brain" cluster. This condition can occur when two or more nodes in a cluster are both up and running. The problem occurs when each one thinks that the other cluster node is down simply because the first can't communicate with the second because of a problem with the Cluster Communications Interconnect and the Enterprise Network Connection. If there were no Quorum device, both nodes would try to take control of any shared disks, and you would end up with a corrupted disk. Hopefully it will be very rare for both network connections to fail at the same time, but this situation can occur if the cluster interconnect fails and at the same time the Enterprise network connection is having problems or is very slow responding. The whole purpose of the Quorum device and the Quorum voting algorithm is to ensure that if a "split−brain" situation does occur, data corruption will not happen. Although such a failure scenario is hopefully rare, the Quorum Resource does serve an important role during normal cluster operation. Under normal conditions, the Quorum Resource is storing log data that tracks the state and configuration of the cluster. As changes are made to the Cluster Database (NT registry entries on each cluster node), they are also appended to the log file maintained on the Quorum disk. This information is needed if a cluster member had been taken offline for maintenance or had failed during a period of time while changes to the cluster configuration were taking place. When this node rejoins the cluster, it will look at the log and determine the changes that have been made while it was offline. Once all cluster member nodes have accessed and processed the cluster log file, the entries are purged from it. Because of the importance of the data contained in the log file on the Quorum disk, it would be wise to consider providing redundancy for that disk. You could consider using either a mirrored disk pair or a RAID array as a means of protecting this information that is so crucial to the operations of the cluster. Microsoft has made provisions for third−party hardware developers to design other devices that can function equally as well as Quorum devices. Nevertheless, at this time the SCSI protocol used to access a physical disk is your only 71
5.6 Cluster failover architecture option. It could be possible to implement a highly redundant hardware−based solution in the future. The advantages to such a solution would be speed and potentially higher reliability.
5.6 Cluster failover architecture Many people think that all the excitement about clustering is to prevent system down time due to some type of catastrophic hardware failure. In the old days, that would not have been a bad assumption. When we got into the computer business over 20 years ago, a VAX computer was built up from 30 or more large printed circuit boards, each much larger than a PC's motherboard today. After performing some extensive mathematical calculations, we can tell you with confidence that at some point that system would crash and that the only mathematical law that mattered was Murphy's law. With such a large parts count and all the wires and connections necessary to interconnect all those logic cards, it was no surprise when one of those modules failed. The only question back then was when would it fail? Today's servers are built with a single motherboard and maybe 4 to 10 adapter cards. The CPU is now a single piece of silicon as opposed to hundreds of discrete components. Because of these advances in technology, the reliability of today's computers is much greater than when we got started. Now we are not going to try to convince you that computers don't crash today, but surveys conducted recently point to other causes of system downtime these days. Cluster failover is probably going to be the most difficult issue to deal with for a system administrator because you will have so many options available to you. You have quite a few decisions to make about how the cluster will deal with a failure of hardware or software applications running on it in the event of a cluster failover. Then, after the cause of the failover has been determined and rectified, you will probably want to do a "failback" operation to return the cluster to its normal state. The failback process requires that the system administrator make even more decisions. One time you will want the failback to occur right away, and another time you may want to wait for an off−peak time to failback when there won't be as much of an impact on the users. Failover is the process of moving services from one node in a cluster to another node. There are two failover scenarios shown in Table 5.2. The first failover scenario, administrative failover, addresses the need to perform administrative functions without interrupting services to users of the cluster. In that case, a cluster failover is used to temporarily stop running services on one node and then restart the services on a new node while some type of system maintenance is performed. A recovery failover, on the other hand, occurs in response to a failure in the system and is initiated automatically by the cluster based on the specific failover properties preconfigured by a system administrator. These failover properties are assigned to a specific resource group and apply to all the resources contained in that failover group. The same physical and logical resources must be present on every node in the cluster that acts as a backup of the primary host node in the cluster. Yes, as mentioned previously, that could mean that you would need to purchase multiple software licenses even though only one license might be used at a time! We keep thinking it would be nice if Microsoft would figure out how to treat a software license as a cluster resource that could be transferred along with a resource group to another node in the cluster (hint, hint).
Table 5.2: Cluster Failover Scenarios Scenario Initiated By Reason Administrative Administrator Administrative tasks Recovery Hardware or Software Malfunction
72
5.6.1 Administrative failover
5.6.1 Administrative failover An administrative failover is a little less dramatic in that the decision to failover is being made by a calm, level−headed system administrator. This type of failover is called administrative because the failover event is the result of a conscious action taken by the administrator and not because of the failure of a resource on a cluster node. Because of the high reliability of today's computer hardware, administrative will more than likely be the most common type of failover. The statistics being reported seem to confirm this view. We have talked to a lot of system administrators who did not believe this until they thought about it for a minute. Usually, once you get the initial hardware problems resolved on a server, the only time it is taken offline is when either the hardware or the software needs updating. Consider for a moment how many applications you have running on your servers, and recall how often you get those little reminders in the mail from your favorite software vendors telling you about the special deals they have and why you should update your software. Examples of administrative failover include using a failover to perform a manual load balancing of services across a cluster, upgrading system software, or performing hardware maintenance. During an administrative failover event, there is an orderly shutdown of the services running that are to be failed over to another node. This is significant because it means that files are closed and system caches are flushed in an orderly fashion, providing for a controlled shutdown. Obviously, this is the best−case scenario for a cluster failover. With an administrative failover you should have minimal to no loss of data and little business interruption; in fact, most users should see no interruption at all. To achieve this level of service requires close coordination between the cluster administrator and the business operations manager. Microsoft's Cluster Server is not a "non−stop" solution. Therefore, anytime a failover takes place, the applications are stopped and then restarted. A well−written cluster−aware application could potentially improve on that by requiring only that the data be failed over to another node that is already running another instance of the application. Even in that scenario, however, some time could still be needed for application startupfor example, verifying and opening the various files associated with the application that is failed over. Some administrators feel that, as a matter of course, applications should be restarted as a sort of preventive medicine. We leave you to decide whether that is based on a good technical understanding of the application or is just an "old wives tale" tradition. Some applications are known to eat away at memory until problems occur. There are two approaches to solving this problem. One is to hope that the software vendor can figure out why his software has a memory leak and can fix it. The other probably more practical solution is to restart the application on a regular schedule. This will help to ensure that the application will run at peak performance given that it has a reputation of not doing the best job managing system resources. Some people will restart applications that they know exhibit these problems with the idea that it is just a good preventive measure.
5.6.2 Recovery failover "In the unlikely event we experience a drop in air pressure," as they say in the airline business, "the oxygen mask will automatically fall." Well almost the same thing is going to happen to your cluster except that instead of oxygen masks the Cluster Service is going to execute a recovery failover event. The second failover scenario in Table 5.2 occurs when a failover is executed in response to an unexpected component failure by restarting services on another node in the cluster. We will call this a recovery failover. As you would expect, a failover that occurs as a result of a software or hardware failure is likely to be the most disruptive. When an unexpected failure occurs, it is not possible to do an orderly shutdown of applications or to close open files and flush their caches. At best, everything just abruptly stops, leaving files open and data still in caches. The worst case occurs when the data in open files is corrupted when the system crashes. This means that when Cluster Service attempts to restart the application on another node in the cluster, there will likely be a 73
5.6.1 Administrative failover considerable amount of time required to clean up the file system before applications can restart. Once the applications are restarted, they may then need to roll back any transactions that were not completed as a result of the system crash. As you can imagine, a large database−dependent application could take a considerable amount of time to come back online after a failure. The good news is that the application will be able to resume providing services to users a lot sooner than if there were no cluster at all. The alternative to a cluster failover would either be a phone call to your vendor's field service 800 number or a pager call to the system administrator. Neither of these scenarios is acceptable when service has been disrupted for hundreds of users. When there is a failure of either hardware or software within a cluster, a failover is used to restore services. Table 5.3 provides a general guideline for the amount of business interruption you are likely to experience after a component in the cluster fails. A hard failure means that a chip, connector, or wire just stopped workingthat is, if you are lucky! The worst type of hardware failure is an intermittent one. An intermittent hardware failure is the worst kind of failure because it is so hard to find. The system may run for an hour, a day, or a week between failures, making it next to impossible to find the problem. It can also have the nasty side effect of corrupting much more data than a hard failure of a component.
Table 5.3: Cluster Recovery Failover Modes Result Cause Business Interruption Hard Device malfunction Probable failure Soft failure Software Possible, depending on how the application is written and how the cluster is malfunction designed A software failure on the other hand might be easier to deal with if it has a well−written Resource DLL that does a good job of monitoring its health. If the application's Resource DLL can detect a problem, the cluster server can act to ensure availability of the service to the users. Cluster Service could simply restart the application on the same node that it was running on with the hope of releasing resources that may have gotten in an indeterminate state within the application's environment. The other option that Cluster Service has is to just failover the application service to another node in the cluster. Although trying to predict what will happen when something fails in a cluster is like a roll of the dice, we believe that your best chance of minimizing downtime is with a soft failure of a well−written cluster software application. That is why we give a software failure a somewhat better score in Table 5.3. We are also assuming that most of the applications that you are likely to be currently running on a cluster are written to deliver the best availability possible by taking a proactive approach to error detection, reporting, and recovery. A recovery failover, as it name implies, is the result of either a hard failure or a soft failure. A hard failure means that something just plain stopped working all of a sudden without any warning. The failure could be caused by any cluster component in the system. An example would be if an integrated circuit failed or a cable was cut, but it could also be a software failure. The important thing to remember is that it was not planned and there was no way that the cluster software could have known that a failure was about to occur. This means that unless the application is journal logging as it goes, users will more than likely experience business interruption due to some loss of data. At a minimum, you should expect to have some loss of productive employee time, depending on the amount of time required to reconstruct the state of the application when the failure occurred. A "soft failure" occurs when the clustering software detects a failure of a software application service. An example of this would be if an application just stops responding to requests or becomes slow to respond for some reason. This could be due to a possible memory leak or the depletion of a system resource that then 74
5.6.1 Administrative failover causes the application to slow down. The clustering software can detect problems by polling the application's Resource DLL. Once a problem is detected, the cluster can instruct the application to shut down in preparation for being failed over to another node or even being restarted on the existing node. The key point is that when a soft failure occurs, the cluster is able to command the application service to perform an orderly shutdown. This is the best situation you can hope for with a cluster. This type of failure should allow the clustering software to restart the application services in the shortest amount of time. The most practical way to look at failover is from a user's point of view by considering the impact the failure of a cluster service might have on its end users. From a user's perspective, there are two types of recovery failovers, which are shown in Table 5.4. We will refer to the first of these as a "session failover." This type of failover would not be detectable by the end user. This means that there would be no business interruption experienced by any of the users. The cluster would failover the application and its associated databases in such a manner that no interruption of service is detected by the end user. This is the ideal scenario and the ultimate goal of clustering. We call it a session failover because the context of the session between the user and the server is preserved during the failover. This means that the user does not have to do anything at all. Users will not have to re−login, reestablish connections, or restart their applications. With a session failover there is no impact to the user at all. The whole process of failover occurs in a fashion completely transparent from the user's point of view.
Table 5.4: Recovery Failover Types Failover Type Session Remains Intact Business Interruption Session Yes No Lost Session No Yes We call the second type of recovery failover a lost session failover. From a user's perspective, this type of failover can be downright annoying when it occurs, because the logical session that had been established between the user's desktop computer and the server is lost. The application context that existed is gone, along with any knowledge of the user's session. In this scenario, the end user will be required, at a minimum, to reestablish his session with the cluster and the clustered application. In the worst case, the user could potentially lose the information that he was working with at the time of the failure. We call this type of recovery failover a "lost session failover" because when the failover occurs, the session's context is lost. Business interruption will be experienced, and the user will certainly be inconvenienced. A potential reason for a lost session failover to occur is that the application was not specifically written to run in a cluster environment or is an incomplete implementation. Typically, an application can tolerate only a short loss of connectivity to its host server. Microsoft's networking has built−in mechanisms to automatically reconnect to network services; these will work fine for very short network "glitches." However, the time it takes for an application to failover is likely to be longer than a typical glitch, which is more than many non−cluster aware applications can tolerate. In the case of non−cluster aware applications, their error−detection routines would simply determine that they can no longer access data on the server and report back an error to the end user. A cluster−friendly application would take into consideration that a cluster failover may be taking place and alert users on the cluster that they need to wait until a failover completes. The best way to deal with this type of problem is to insist that the software vendors that you deal with write their applications to fully support clustering. Even so, there may be other components in the cluster such as network gear that may also take too long to failover, thereby causing the end user's sessions to terminate. A lost session failover will certainly contribute to the total time that users must wait for their application to come back online. Every effort should be made to reduce the impact of a lost session failover. Don't overlook the obvious, either; simply providing good training and setting expectations will help users deal with system failures in a timely manner. It is amazing how much time users can waste 75
5.6.3 Cluster failback chitchatting and trying to guess what is happening to the network while a normal failover is taking place. If the users are told in advance what to expect and how to deal with the outages, the whole process will go a lot smoother. To clarify our definition for failover scenarios and put everything into perspective, take a look at Table 5.5. We have defined two specific failover scenarios that are supported by typical clustering solutionsadministrative failover and recovery failover. An administrative failover event is planned for by the system administrator and should not be a surprise for anyone when it occurs. The administrative failover can be scheduled to occur at a time when there will be either minimal or, ideally, no impact on business operations. If a failover occurs because the Cluster needs to recover from a component failure of some type, we refer to this event as a recovery failover. The recovery failover can be attributed to the failure of either a hardware or software component in the cluster. You might also want to consider how the cluster would behave when the hardware failure is of an intermittent nature. An intermittent hardware failure scenario has got to be the most frustrating one to deal with, because it is like trying to shoot a moving target. Finally, a recovery failover has two more characteristics associated with it. We refer to them as session and lost session. The ideal scenario is for the client/server session to be kept intact during the failover period. In that scenario, we call it a session failover. If the client system looses its logical connection to the cluster during failover and must physically reconnect after the services have failed over, then we call that a lost session failover. The reason is that all knowledge about the "state" of the connection between the client and the server is lost. The right−hand column in Table 5.5 summarizes the effect the failover will have on the availability of the application services to the business community.
Table 5.5: Failover Scenario Characteristics Failover Reason Type Administrative Planned Planned Recovery Hard Session Lost Session Soft Session Lost Session
Business Interruption No No Yes No Yes
5.6.3 Cluster failback The opposite of a cluster failover is called a failback. The purpose of a failback is not to recover from any software or hardware failures. Instead, failback is used to restore the cluster and the services to their initial configuration default state. A failback is a deliberate action taken either automatically by the clustering software or manually by the system administrator. A failback event is not triggered as a result of a system failure. A cluster administrator can set the failback properties to meet the company's operating procedures and needs. Typically, once a failover does occur, you would want the services to remain running until a time when the system load was at its lowest point. Then you could decide to failover the applications to run on their default nodes. When an application is restarted on its default node by initiating a cluster failover, it is called a failback. In any case, the decision of when and how to failover services is totally under the control of the cluster system administrator.
5.6.4 Planning for a cluster failover As any Boy Scout knows, you must always be prepared. As a cluster administrator, you have the job of determining what actions the Cluster Service performs in case of a failure. System administrators need to carefully plan and, more importantly, document the failover policies for their clusters. The cluster failover 76
5.6.5 Failover policies plan is quite important to the operation of the company. It should be reviewed by management to ensure that it adequately protects the business from losses due to the unavailability of critical software applications. Application developers also have the ability to determine how their application will behave if the system resources it needs are no longer available. Third−party developers are being encouraged to update their applications to be cluster aware. Ideally, software developers should go the extra mile and incorporate more automated and sophisticated error−detection code into applications that are going to be critical to a business's bottom line. This new breed of "smart applications" should be constantly monitoring the vital signs of the system trying to predict when a failure might occur. The goal should be to detect warning signs (e.g., excessive use of system resources or decreasing memory working space) and take a proactive approach to fixing the problem before users are affected. These "smart applications" can't do it all alone; the hardware platforms that they are running on must also be "smart." Microsoft has planned multiple releases or phases for Cluster Service. The first phase was designed to meet a limited number of requirements in order to deliver the technology as quickly as possible to users. System availability was one requirement that was given very high priority because that was what their customers told them they needed for Windows NT. Scalability ended up a little lower on the priority list during the first go−around. Take, for example, a cluster running SQL Server with Cluster Server Phase 1. In phase one there is the limitation that you are allowed to have an application running on only one member node at a time. The reason for this is that if it become necessary to failover SQL Server from the primary host, it needs a host that does not already have a copy of SQL Server running because two copies of the same application could get confused if their operations were not coordinated. The problem resulted because there was no mechanism to coordinate multiple instances of an application on a cluster. This means that we need to failover the whole resource group as a single entity. The good news is that as clustering technology matures it will be possible to treat the actual database as the resource that is part of the failover group and not the application itself. That means that every node in the cluster could have an instance of the application up and running and servicing users. Not requiring applications to failover means that there will be less interruption for users. The advantage is that with multiple copies of an application running they can all share the processing load of the enterprise and are ready to take over from one another if a failure occurs.
5.6.5 Failover policies One of the primary responsibilities of a cluster administrator is setting up and maintaining the policies that tell the cluster what should happen when cluster resources failover from one node to another. These policies are associated with each resource group on the cluster and can be easily administered using the cluster management interface. Although the actual failover is automatic, what happens when a failover occurs must be determined in advance. In the initial release of Microsoft's Cluster Server, the options to pick from are somewhat limited. They include a failover to either node A or node B, restarting the application on the existing node, or no failover at all. You might ask, if you have a cluster why would you not always want to perform a failover? The answer is that if the remaining node is already running applications that are critical and has only the necessary capacity to host one or two more application, you may want to be selective about which applications are allowed to failover. As Microsoft's Cluster Service supports more and more nodes, the number of possible failover options are going to increase exponentially. Windows 2000 Datacenter currently supports up to four clustered nodes. That number is likely to increase over time. With a four−node Datacenter cluster you have three possible nodes to which you can failover an application. The failover scenario for a four−way Datacenter cluster can quickly get very complicated. All of the other nodes could be up and running, or one or more nodes could be offline because of a failure or for system maintenance purposes. Another possibility with a multinode cluster is that there could be multiple node failures in the cluster, in which case the node that you had planned to use to failover your application may no longer be available. In that case, your application now has potentially two 77
5.6.5 Failover policies nodes to choose from. There are many decision points that need to be taken into consideration when determining how an application service will failover when a failure occurs. As you can imagine, a failover decision tree can become quite complicated and convoluted for a cluster with more than two nodes. Even in a two−node cluster there are a number of decisions that must be made to set up the failover policies for a particular cluster. For example, with a two−node cluster configuration, because of processor capacities and business priorities you may not want to failover every application when there is a failure. If both servers in a two−node cluster are already heavily loaded, then a decision must be made as to which application services must run. The ones that have a lower priority can possibly wait until the node that went offline comes back online. Many of the decisions involved in setting up failover policies simply boil down to what amount of downtime your business can afford. The business decisions made by your company's management will more than likely be what determines your cluster's failover priorities. It will be up to the cluster administrator to configure Cluster Services to implement those policies.
78
Chapter 6: I/O Subsystem Design 6.1 I/O subsystems and capacity planning for clusters Performance and capacity issues that are relevant to standalone non−clustered Windows−based servers are even more so in a clustered environment. Specifically, a poorly designed I/O subsystem in a cluster is likely to be the single major culprit of a poorly performing cluster today. In a cluster, the I/O subsystems on other member nodes can also contribute to unnecessarily lengthy interruptions of service during a failover event. A good I/O subsystem design should take into consideration the clusterwide I/O capacity requirements for every business−critical service that is running on each node in the cluster. That will allow you to determine the I/O subsystem capacity required to support your critical applications when a node(s) goes offline and a failover is initiated, as would be the case when a failover occurs. This chapter discusses issues surrounding the I/O subsystems used in today's computer systems and how they relate to clustered computers. While we have seen a dramatic increase of the clock rate for CPUs during the past few years, we have not seen the same level of performance gains in the I/O subsystem bandwidth. Multiprocessor−configured cluster nodes in which the CPUs are typically clocked at over a gigahertz are the norm today. Because of this, it has become painfully obvious that the I/O subsystems in today's computers are falling way below the performance curves of the processors available. This chapter presents our outlook for the next generation of I/O subsystems, which are targeted to deliver high performance as well as high availability for the next generation of enterprise cluster systems. We specifically address the system hardware architecture issues that are most important for computer designs that will be used with clustered servers. Clustered systems today that are running very large enterprise applications are likely to have cluster nodes configured with symmetrical multiprocessor (SMP)−based computer systems. There are essentially two methods to interconnect the processor, memory, system peripherals, and high−speed I/O PCI interfaces on system motherboards. In Figure 6.1 we illustrate a BUS connected system bus. This architecture has been the mainstay of the computer industry for years. The BUS architecture performance has been acceptable up to now and has proved to be a cost−effective solution. The problem is that as the number of CPUs connected together in an SMP system grows, a system bus architecture will become a point of contention for access to common shared resources such as a memory system. Also, if one of the devices on a multidrop bus fails, everything connected on the bus will probably stop working as well.
79
Chapter 6: I/O Subsystem Design
Figure 6.1: SMP system using BUS architecture. The device contention problem will never go away completely, but current designs can help reduce the probability of bus contention and increase the I/O subsystem's capacity in the process. One such architecture is based on a switching technology that appears to be positioned to replace the bus architecture. Figure 6.2 is an example of using switch technology to interconnect processors, memory, and I/O. A switch−based solution can reduce device contention problems and at the same time isolate devices that have failed in the system. Today, switch technology is typically used only on high−end computers. New designs showing up these days are able to drive the costs down to allow wider deployment as a result of high levels of silicone integration. The pros and cons of the different architectures that are used to interconnect CPUs, memory, and the I/O subsystem are still going to be the subject of debates for the next few years. A few are shipping already, and a few more are soon to be released. You will want to carefully watch the developments in this area as you do your cluster system design and rollout.
Figure 6.2: SMP system using switch architecture. 80
Chapter 6: I/O Subsystem Design The Intel Corporation is betting on Profusion technology, which it acquired through acquisition of Corollary Corporation to solve the I/O subsystem bottleneck problems associated with the bus technology they have been using all along. Systems built using the Intel Profusion chipsets are shipping today from Compaq and other vendors. The intended market for Profusion−based computer systems is the very high−end enterprise or e−business system. Customers purchasing Profusion−based systems are doing so because they need a lot of CPU horsepower in one box. Compaq, Microsoft, and others are referring to SMP configurations as "scaling up." Profusion is intended for customers who want SMP−based solutions and need to be able to scale up to eight processors in one box. Of course, a cluster can be formed by connecting multiple Profusion systems together. When multiple systems are interconnected to form a cluster, Microsoft refers to this as "scaling out." A cluster can be formed of scaled−up servers that are scaled out by interconnecting multiple computers. Intel indisputably owns the low− to mid−range desktop and server markets but is very hungry to increase its share of the very lucrative high−end enterprise and e−business application server markets. Intel is confident that its Profusion system chipset will give its already high−performance CPUs the I/O subsystem capacity to scale up to eight or more processors in a cost−effective linear fashion. Intel's IA32 has been built using a shared bus model since the inception of the IBM PC. It served the firm well because it allowed Intel and their OEMs to build and deliver affordable lost−cost systems. First−generation computer vendors such as IBM, Digital, and others that have been supplying high−end servers for some time now had already seen the bottleneck problems associated with a shared bus architecture. They had already started to migrate to a switch−based architecture designed to replace or supplement the system bus years ago. Now that Compaq has sold the Alpha technology it received from the acquisition of Digital Equipment Corporation to Intel, we expect that Intel will quickly incorporate Digital's designs into the I64 processor as new releases come out. The former Digital Equipment Corporation saw the need years ago for a much higher mechanism for transferring data between CPU, memory, and the I/O subsystem than the shared bus architecture Digital had been using with its earlier VAX systems. When Digital decided to build a high−end multiprocessor VAX system, it developed a crossbar switch architecture to deliver the I/O capacity needed to compete with the high−end IBM mainframes. Similarly, when Digital engineered its next generation ALPHA processor to replace the VAX processor, it was built from the ground up to utilize a crossbar switch−based architecture. This architecture is referred to as Alpha EV6 technology. This same technology allowed Compaq to build an Alpha SMP system knows as "Wildfire" that was capable of supporting up to 254 processors in it. The question is now that Compaq is no longer in the Alpha business, what will Intel do with the technology it received from the deal with Compaq? However, it is interesting that Intel is not the only company that is benefiting these days from the Alpha EV6 Technology. Advanced Micro Devices (AMD) knew that if its high−performance Athlon processor were going to be successful, it would require a high−speed system interconnect that would allow it to perform at its full potential currently but also to scale as higher clock rate processors were released. AMD very wisely decided to license the EV6 Alpha bus architecture for use with its Athlon processors. AMD refers to the EV6 technology used with its Athlon CPUs as the AMD Athlon System Bus. If Athlon's popularity is any indication, it appears that the AMD Athlon system bus coupled with the Athlon CPU was a good design decision. The fact that the Athlon System Bus running at 200 MHz today can scale up to a 3.2 Gbytes/second bandwidth as the clock frequency is increased to 400 MHz is an indication that the EV6 technology has a distance to go before it reaches it limits. The announcement of the Intel deal with Compaq was made as this book was going to press, so we don't know how Intel will play its cards right now. Don't be surprised if, when you look at the block diagrams of Intel's Profusion next to one of AMD's Athlon System Bus, you will see many similarities. That is because the crossbar switch architecture is the basic ingredient in both companies' architectures. We must be careful not to let ourselves become bogged down in comparing bandwidths in a book about high availability. If there is a failure in the system, then all the bandwidth in the world won't do you any good. That is the reason why systems based on a switch architecture 81
6.2 I/O load model are very exciting to anyone designing high−availability clusters. One of the benefits of a switch versus a bus architecture is that a switch can, by design, provide a good degree of isolation between the components connected to it. Each component, such as the CPU, memory, or I/O bus interface, has a point−to−point connection as it goes through the switch fabric and onto the device that it wants to communicate with. As we attempt to eliminate all single points of failure, that design feature means that if one of the CPUs or other components fails, in theory it should not pull down the other CPUs in the system with it, as a bus design would. This allows the system to survive a single point of failure of the system interconnect. The ability to isolate a failure of a single component is very important when designing a system for high availability. Over the next few years, it appears that the I/O subsystem will be the primary subsystem targeted by hardware designers for major improvements in performance and capacity. During the past 10 years, the primary emphasis has been on increasing the speed of the CPU. Now that CPU processors are being clocked in excess of 1 GHz, it is time for the processor and I/O buses to do some catching up. The I/O subsystem has now become the obvious bottleneck in today's servers. It has almost reached the point today where any more increases to the speed of the processors might be a wasted effort until the performance and capacity of the system buses catch up. If your CPU is waiting for data to arrive from a disk or network device, you are not receiving the full benefit of that high−speed processor chip you purchased. The marketing approach that has been used by CPU chip vendors up to now has been to sell raw CPU megahertz! If you don't believe us, take a look at any advertisement for computers and notice the largest type font on the page. This approach seems to work with the general public. It's an easy sell to nontechnical buyers mainly because it is a concept that is easy to convey in marketing campaigns. The message is that the more megahertz that is in the box, the faster the computer will be. Naturally, the more megahertz they put in the box, the higher the price they figure people will be willing to pay, resulting in bigger profits for computer manufacturers. Well, now it finally appears that the cat is out of the bag! The trade media have found a new story line to write aboutthe I/O subsystem bottleneck issues. Consequently, there are numerous articles starting to appear explaining and warning about the potential problems awaiting the unsuspecting public if they purchase systems with inadequately designed I/O subsystems. The really important metric used to evaluate a cluster's I/O subsystem is the I/O per second (IOPS) rate. The CPUs that are used in computers today are easily capable of putting the I/O subsystems to the test. Think about the bandwidth requirements that would be required when you load a server with gigabyte Ethernet adapters, a multiport Fibre Channel adapter, and a Server Area Network adapter like Compaq's ServerNet. That is an awful lot of data coming into a computer! Today, modern enterprise servers are bombarded with very large amounts of data arriving to be processed at an astounding rate. In the current multimedia−crazed society, end users not only expect their requests for information to be returned to their screens blazingly fast, but they also expect to be entertained in the process with pictures, music, and video. These requirements are driving the industry toward making evolutionary advances in the I/O subsystems in the near term and revolutionary changes for the long term. The goal of this chapter is to give you a solid technical foundation by explaining the relevant technical concepts that will enable you to make wise purchasing decisions for your company's clustered systems. A good technical grounding on the I/O subsystem will enable you to select a system that provides the maximum computing capacity at the lowest cost.
6.2 I/O load model Windows NT/2000 file and print server I/O can be characterized as having a small data transfer size but a very high number of I/O transactions per second. We will call this a "small/high" model. This characterization of a server's I/O utilization is what we refer to as an "I/O load model." Examples of other server types using this 82
6.2 I/O load model small/high I/O load model include messaging servers (e.g., Notes/Domino or Exchange Server). Another example is a client/server application that is servicing a very large number of requesters. Each request for data by the client that is sent might be only a few bytes long, and the response returned to the client would also be only a few bytes long. In such a scenario, a high number of transactions per second will stress the I/O subsystem's ability to handle I/O requests to the disk storage farm. A well−designed I/O controller that is able to optimize and prioritize requests going to an array of disks would be very beneficial here. But even the best designed I/O adapter has its limits when it comes to the number of I/Os per second that it can process. The large number of I/O requests going to the disk array will likely encounter the laws of disk physics long before they tax the capacity of the controller's electronics. A well−designed controller can go a long way in masking the physical limitations of a mechanical device such as a disk drive. Remember that the laws of physics allow you to accelerate and reposition the physical read/write head only so fast. Large buffers on the disk and the I/O controller will be a big help, but the firmware present on the controller can help by ordering read/write requests to minimize or eliminate unnecessary head seeks or waits for disk rotation. The speed and capacity of the control electronics on the adapter ultimately determine how many operations can be performed per second. If you can find it on a vendor's glossy product brochure, it would be referred to as the I/O per second rate. Unfortunately, it seems that most vendors don't include it in their data sheets, probably because it is hard to derive due to the many variables involved. One other reason we think is more plausible is that the concept is hard to explain to people who are not very technical. Therefore, what you will see most often is a blurb simply stating the speeds of industry standard buses that it connects to, because that does not require any explaining. For example, you will see 160 Mbytes per second quoted for a SCSI Ultra 160 bus or 100 Mbytes for Fibre Channel. The only problem that this presents for consumers is that it does not provide them with any real information with which to analyze their I/O subsystem capacity. Numbers like these reflect only the theoretical maximum bit rate of the physical media. The problem is that they don't take into account the overhead associated with and imposed by the protocol used to coordinate the transfer of data from one device to another device on the cable. Adding more physical disk drives to an array helps the situation by allowing the array controller to distribute the I/O load across multiple drives (sometimes referred to as spindles). The simple explanation as to why this helps is that while one disk drive is doing a seek, the controller can go onto the next drive and initiate another operation on it while waiting for the first drive to complete its seek. In fact, every drive in the array could be independently working on an I/O operation. As you can see, the more drives in the array, the more work can be accomplished by the array as a whole. This takes advantage of the fact that an array controller can queue up I/O requests from a host, prioritize them, and then issue commands to each drive so that they can all work in parallel. The second I/O load model that we are going to talk about is called the "large/low" I/O load model. In this example, the I/O subsystem would be required to handle larger chunks of data (size) but with a low number of I/O transactions per second. Some examples of server types using the large/low model include imaging servers, multimedia servers, graphics−intensive intranet Web servers, and application services with larger data transfer sizes. There is a certain amount of controller overhead associated with just setting up the I/O operation to be performed. Transferring larger amounts of data per I/O operation cycle is typically less stressful on an I/O controller than a lot of small I/O transfers. The large/low I/O model can also place less stress on the physical limits of a disk array. We are assuming that the disks in your array are logically defragmented and the data is organized in a linear fashion. Once a read operation begins on a large data transfer such as motion video or audio, assuming that the data can be read sequentially, the buffers on the disks and I/O controllers will be working at their peak efficiency. That is not the case with the small/high load model discussed previously, because the small files being accessed randomly 83
6.3 Data processing capacity model for a cluster across the disk platter will render the local buffers useless to some degree. But with a large/low model, once the physical location of the file containing the desired data is found, the controller electronics on the disk can start pulling data off the platter in a linear fashion and placing it in the drive's onboard cache. If the file on the disk is stored in sequential blocks, the read/write heads will not have to be moved, thereby greatly improving the transfer of data from the disk platter. These characterizations or I/O load models are just one way in which you can evaluate the requirements of your particular application's needs for I/O capacity. Table 6.1 summarizes the relationship between these models.
Table 6.1: I/O Load Models Server Type
Data Transfer Size
I/O per Second Rate
I/O Load Model File/Print Small High Small/High Messaging Small High Small/High Imaging Large Low Large/Low Multimedia Large Low Large/Low Our primary area of concern for I/O subsystem design is that of the small/high I/O load model, which is typical of file and print services on Windows−based servers. Experience has shown that the clusters that tend to bottleneck during a system failure event are predominantly small/high I/O load models. This is due to the inherent nature of what we have found to be typically configured server hardware. From what we have seen, people tend to design for bus speeds and not actual capacity. It's hard to analyze a system to find the true bottleneck. What usually happens is that a completely new system is rolled in to replace the one that is perceived to be "out of capacity" when the problem might be remedied by simply reconfiguring the place of the I/O controller on the buses or adding an additional I/O controller to balance the load. Unfortunately, when a new system is rolled in and things seem to start working better, no one takes the time to find out what was really causing the problem in the first place. Everybody's happy, and the tech staff goes to work putting out the next big fire.
6.3 Data processing capacity model for a cluster Contrary to what some computer marketers are pitching these days, the processing capacity of a computing system depends on more than the mere speed or megahertz of its CPU chips. In order to take into account the total system design capacity for a computer system, or a cluster, the design must take into account all of the following essential elements: • Processor(s) or CPU(s) • Memory bandwidth (data bus) • Memory operation rate • I/O bandwidth (I/O bus(es)) • I/O operation rate (IOPS) Each of these elements of the capacity model performs its own discrete functions. As you start to analyze a computing system, one of these elements will stand out as a primary bottleneck to performance. Once the primary bottleneck has been addressed, other bottlenecks will be exposed. In turn, they can then be analyzed 84
6.3.1 Processor and resolved to prevent them from imposing a performance or capacity constraint on the system. The goal of designing a high−performance system is to reduce or eliminate each bottleneck in turn. This can be accomplished by either increasing the capacity of an element or by decreasing the load placed on it. A discussion of each of these elements now follows. Their characteristics that can effect performance will be highlighted. We will give some recommendations for maximizing the total capacity of the system as we discuss each element. The theoretical maximum capacity of the system is possible only when each of these elements is individually tuned for its maximum capacity. Together these elements make up what we refer to as a "whole computer" or cluster node. In turn, each of these "whole computers" is linked by some type of interface bus and controlled by clustering software associated with the operating system resident on each node. Although they remain independent computers even while in a cluster, a node that has a poorly designed I/O subsystem can have a negative effect on the cluster when a failover occurs. Every node in a cluster must be ready to take up the slack of any other node in the cluster that goes offline. If it does not, when a failover occurs the additional load on the server could swamp a marginally performing I/O subsystem.
6.3.1 Processor The processor element may consist of one or more physical CPU chips in a computer system or cluster node. There are three performance metrics that apply to the processor: • Processor bus bandwidth in megabytes or gigabytes • Clock speed of the processor bus in megahertz • Clock speed of the processor itself Be careful not to confuse the system bus clock speed with the clock rate of the CPU. Take. for example, a 733 MHz Pentium III processor that uses a 100 MHz processor bus. The processor's clock speed is only one factor affecting the processing capacity of a computer system. The processor's internal chip architecture along with its clock frequency or speed determines the number of instructions that the CPU can perform in one second. A CPU with a very fast clock speed is all well and good, but the processor's real ability to do useful things depends on how long it has to wait for data arriving over the system bus. Ideally, a processor should be able to fetch the data or instruction it needs the instant it needs it from its local on−chip cache. When the cache does not contain the next instruction or data, the processor must wait until the data it needs is fetched from memory. One might wonder why we don't just add very large caches onto the CPUs. The problem is that different types of software applications behave differently in regard to how they can effectively use cache memory. A software application that accesses data and instructions, for the most part, in a linear fashion, would be considered "cache friendly." Data can be prefetched from memory by anticipating that the data that the processor is going to need next will probably come from the next linear location in memory. Applications that tend to access data randomly or that have programs that do a lot of branching to different nonlinear code segments would require the cache to be flushed and reloaded every time a new code segment is executed or data is accessed that does not fit into the cache. This would definitely be considered a "non−cache friendly" environment. Applications that are non−cache friendly would perform better with CPU architectures that have small caches, whereas applications that are cache friendly would benefit from CPUs with large caches. There are other factors beyond the processor's clock rate that affect the speed of the system bus. The system controller chips used on the mother−board play a large part in determining the computing capacity of the system. These chips control and direct all movement of both data and instructions between processors, memory, and I/O buses. The architecture of the system controller determines how fast data can be moved between the various components of a computer. Here again, it's not just speed but how much data can be exchanged in a given amount of time. The width of the system bus along with the bus clock speed determines 85
6.3.1 Processor how much data is transferred in a given amount of time. The industry is moving to 64−bit bus widths for high−end systems. Obviously, a bus that is being clocked at 100 MHz will transfer more data if each clock transition transfers 64 bits compared with one that can transfer only 32 bits at a time. Today, what would be considered a low−end server is likely to be equipped with a processor bus that is 32 bits wide and with a clock speed of 66 MHz. That works out to a 266 MB/s bandwidth. Midrange servers would likely be designed with a 64−bit wide bus running at a 100 MHz clock speed that produces an 800 MB/s bandwidth. High−end servers, on the other hand, are configured with processor bus speeds of at least 133 MHz and are 64 bits wide. That will result in a bandwidth of 1.066 GB/s. On the very high end, there is the Alpha processor bus architecture developed by Digital and now owned by Compaq that runs at 200 MHz. Advanced Micro Devices (AMD) has licensed this bus architecture for its Athlon processors, allowing them to run at a whopping 1.6 GB/s. Obviously, the race is on between CPU vendors to develop processors and their supporting system controller chipsets that are capable of supporting faster processor bus offerings. With processor speeds above 1 GHz, there is a real need to increase the processor bus speed so that it can keep up with these very fast CPUs. Ideally, the processor's speed and the system bus speed should be matched so that the processor does not have to wait for data to arrive. Remember that a CPU that is waiting for data is wasting your money. We say that, because if you went out and purchased the fastest processor on the market thinking that your investment will result in more processing capacity for your computer system, you will be getting cheated if the processor must constantly execute "test and branch" instruction while waiting for data to arrive from memory, disk, or the network over the system bus. The total bandwidth of the system bus imposes capacity limits on how fast CPU−to−CPU transfers can occur in SMP systems, the processor−to−memory bandwidth, processor−to−video bus bandwidth, and processor−to−I/O bus bandwidth, as illustrated in Figure 6.1. In addition, the width of the processor bus determines how many bits can be transferred on each clock transition. It is easy to see from Figure 6.3 how easy it would be to saturate a system bus, given the speed at which I/Os and network interfaces are capable of today. Because the I/O−per−second rates for I/O devices are only going to increase in the future, new technologies must provide the increased capacity that will be needed in the future.
Figure 6.3: Processor bus bandwidths. Be aware that in all traditional Intel architecture servers (i.e., all but those using the Profusion chipset), CPU utilization will increase as I/O utilization increases. This is because the bus architecture requires the direct 86
6.3.2 Memory bandwidth intervention of the CPU to handle all I/O transfers and control functions. The Profusion chipset that Intel uses in its high−end servers is designed as an intelligent switch that is able to offload the processors from many house−keeping chores that rob the processor of its theoretical processing capacity. In an ideal configuration, the processor should spend all of its time processing information and not being involved in I/O operations and other system chores. This implies that a switch−based SMP architecture should scale up in a much more linear fashion than a bus−based architecture. Ideally, you would like to see the capacity of a computer scale up by a factor directly proportional to the number of CPUs installed.
6.3.2 Memory bandwidth The term memory bandwidth refers to the amount of data transferred to or from memory per second. The memory bandwidth rating is determined by the width and the speed of the memory bus as well as the access speed of the actual memory chips. This can be different from or the same as the rating of the processor bus (system bus). Ideally, the speed of the processor and the speed of the memory system should match. The CPU's cache memory tries to mask any mismatch in speeds between the CPU and main memory. Optimizations such as pipelining and caching by the CPU chip can certainly improve the effective throughput of the memory system. You are may encounter low−end servers configured with SDRAM operating at a memory bus speeds between 100 MHz (PC−100) and 133 MHz (PC−133). PC 100 SDRAMs running at a 100 MHz memory bus speed deliver a theoretical bandwidth of 800 MB/s. The next step up from there is the PC 133 SDRAM, which is implemented on a 133 MHz memory bus for a bandwidth of 1.066 GB/s. The newer PC 133 DDR technology runs on a memory bus with an effective speed of 266 MHz. RamBus technology is implemented with an effective speed of 800 MHz. It is important to note that these bus speeds, which are quoted by every vendor in the marketplace, are only the theoretical maximum speed at which data can be transferred across the physical wire. These performance numbers don't take into account the processing overhead associated with hardware and software (microcode), interface protocols that are used to control the flow of data between CPU, memory, and the I/O subsystem. An example of this problem is the scenario when the I/O subsystem is in the middle of a DMA transfer between a PCI adapter and the memory subsystem. If the CPU happens to need another chunk of data, it will experience a delay while the DMA is in progress. The process of requesting and then being granted access to a bus does reduce the effective capacity from what might be theoretically possible. The throughput capacity for processor−to−memory transfers is limited by the slower of the two bus speeds. For example, a server with a 100 MHz processor bus and PC 133 SDRAM memory will yield a peak processor−memory bandwidth of only 800 MB/s. This is the bandwidth of the 100 MHz processor bus, the lower of the two. In such mismatched configurations, you will be operating at less than 100 percent efficiency for the given hardware. One note before we leave this topic is to be aware that the term "data bus" is often used to refer to the memory (RAM) bus. This should not be confused with the I/O (disk, network, etc.) bus.
6.3.3 Memory operation rate The memory operation rate is defined as the number of memory transactions per second. This is constrained by the latency in accessing memory. Latency can be caused by interference of bus master devices (video, I/O adapters, processors) competing for simultaneous access to the memory bus. Latency is also caused by the synchronization required on servers with mismatched processor bus and memory bus clock rates. The design of the interfaces from bus to bus as well as the size of I/O buffers will affect latency as well. Wait states and state handling designs of memory controllers have a significant impact on the memory operation rate of a server.
87
6.3.4 I/O bandwidth
6.3.4 I/O bandwidth The term I/O bandwidth is used to describe the amount of data that can be moved to and from the I/O bus or buses per second. I/O bandwidth is measured in megabytes per second (MByte/s or MB/s). The motherboards used on high−end servers use multiple I/O buses to improve the overall bandwidth capacity. These buses are categorized as follows: • Main I/O bus(es) • Secondary I/O bus(es) • Video bus The Main I/O bus is used to connect I/O adapters that are built directly on the motherboard such as SCSI or Network adapters as well as Secondary I/O buses to the motherboard System Controller Chip, which in turn interfaces the I/O subsystem to the processor(s) and memory. The Main I/O bus is the connection point for all of the other Secondary I/O buses on the motherboard as illustrated in Figure 6.4. The Secondary I/O buses can include one or more PCI buses for further general−purpose expansion capability and specialized buses to support system devices such as keyboards, mouse, floppy disks, or CD ROM drives. Onboard adapters are included on a motherboard as a means of achieving higher levels of integration and thereby reducing the total system costs. Unfortunately for us, the motivation is usually to reduce costs as opposed to increasing the total system availability. We tend to prefer to have our I/O interfaces on plug−in modules as opposed to being built into the motherboard so that they can be replaced easily if a line driver IC that interfaces the adapter to the outside world should fail. There is a higher probability of failure for these interface chips because they can be exposed to electrical static discharge from connecting external devices that may have a built−up electrical charge on them or induced voltage spikes due to long cable runs in close proximity to high−power devices. The video bus refers to a dedicated AGP slot that is connected directly to the System Controller. A good motherboard design will try to balance the I/O load across multiple Secondary I/O buses and take advantage of the performance gains that can be achieved with bus segmentation implemented in PCI Bridge chips on the motherboard. The PCI Bridge interface chip allows concurrent operations to occur on the two separate buses that it connects together.
Figure 6.4: I/O subsystem bus configuration.
6.3.5 Main I/O bus The predominant peripheral component interconnect (PCI) I/O bus has a bandwidth of 133 Mbytes/sec. This applies to the 32−bit, 33 MHz implementation of the PCI 2.0 or 2.1 specification. The PCI 2.1 specification defines a 64−bit wide bus and allows for a 66 MHz clock rate. Some of today's server implementations, however, implement only 32−bit PCI 33 MHz buses. You can expect this to change, especially once Intel's 88
6.3.4 I/O bandwidth new 64−bit CPU architecture starts to show up on the market. The Adaptec Corporation points out in its "Ultra160 SCSI Technology Overview" white paper that this is one area that should be of major concern to customers who are moving to Gigabit Ethernet. Their position is that a server with Gigabit Ethernet adapters can easily generate a 200 Mbyte/s aggregated data rate that can easily overload a 32−bit, 33 MHz PCI bus. This is certainly worth looking out for if you need to design a high−end server. In order to take advantage of a 64−bit wide PCI bus, the new 64−bit adapters must be used. A 32−bit PCI card (the most common) will function on a 64−bit implementation in 32−bit mode. Likewise, a 64−bit PCI adapter could run in 32−bit mode on a 32−bit PCI bus implementation. Variations in clock speed on a PCI 2.1 bus are possible. A 64−bit wide bus can run at a clock speed of 33 MHz, and a 32−bit wide bus can run at 66 MHz. Table 6.2 clarifies the possible combinations.
Table 6.2: PCI Bus Bandwidth Bus Width/Spec Clock Speed Bandwidth I/Os per Second 32 bit/2.0 or 2.1 33 MHz 132 MB/s 10,000 32 bit/2.1 66 MHz 266 MB/s 20,000 64 bit/2.1 33 MHz 266 MB/s 20,000 64 bit/2.1 66 MHz 532 MB/s 40,000 A 66 MHz PCI bus supports only two PCI adapters. To configure servers with this implementation, a "peer bus" or multiple PCI bus architecture must be used. A multiple PCI bus architecture is preferable on clustered servers for other important reasons: • The aggregate bandwidth of the main I/O buses is increased. • The aggregate IOPS capacity of the main I/O buses is increased. There are two different methods of expanding PCI busesbridged PCI and peer PCI. Figure 6.5 shows how a PCI bridged bus is interconnected. A bridged PCI configuration allows a manufacturer to add additional PCI adapter slots off the primary PCI bus. The bandwidth and IOPS capacity of the system remain the same. A bridged PCI configuration does not permit any isolation of bus traffic to occur. The main advantage to a bridged PCI bus is that more PCI adapters can be made available on the motherboard at a low cost. The PCI bridged architecture is acceptable only if I/O bandwidth and throughput are not issues for your application. If you just need a lot of PCI slots and there is excess I/O capacity on your server, then the bridged architecture should work fine. One PCI slot is consumed by a PCI−PCI bridge interface chip. This segments the bus into bridged and unbridged slots.Typically slots 0, 1, and 2 are full−speed primary (unbridged) slots. Slots after number 3 are secondary (bridged) slots that access the system subject to the overhead of the PCI bridge chip.
89
6.3.6 AGP video bus
Figure 6.5: PCI bridged bus configurations. A peer or multiple PCI bus architecture as shown in Figure 6.6 increases the aggregate bandwidth and IOPS capacities of the I/O subsystem. Typically, you find the PCI peer bus architecture used on high−performance motherboards. The PCI bridge interface chip will allow traffic on each PCI bus segment to flow independently. This means that the overall throughput of the system is going to be much greater than on a motherboard that was implemented using a PCI bridged architecture. The more sophisticated chip implementations can perform PCI adapter peer−to−peer I/Os without interrupting a processor. Servers based on the Intel Profusion chipset will free the processors from having to handle I/O transfer requests. The Profusion chipset uses a "crossbar switch" type of technology, in which the I/O interrupts are offloaded from the CPUs on the motherboard. This architecture is inherently multiple PCI bus peers with respect to main I/O. Peer or multiple PCI buses are preferable for server systems, especially high−end and clustered servers. Profusion chipsets on the current eight−way servers are currently the state of the art for I/O capacity and for minimal CPU interrupt latencies.
Figure 6.6: PCI peer bus architecture.
6.3.6 AGP video bus The advantage of the AGP bus standard is that it is a dedicated video bus that removes video I/O bus traffic from the main system I/O buses. A PCI−based video adapter will consume bandwidth on the I/O bus and consume part of the IOPS capacity of the PCI bus. The AGP bus is a 32−bit wide bus implemented at a clock rate of 66 MHz for a bandwidth of 256 MB/s. The 2X and 4X modes actually perform two and four data transfers per cycle. A 2X AGP video controller has a bandwidth of 533 GB/s, and a 4X AGP bus transfers 1.066 GB/s. The AGP bus should be used for any cluster design, no matter whether your system will be used for "graphics" or not. You can assume that a Windows system by nature is a graphic−intensive system. If you want to visualize the effect that the I/O bandwidth capacity of your motherboard has on your server, you can utilize the NT Performance Monitor "logical disk" and "physical disk" objects to do an analysis of I/O capacity. 90
6.3.7 I/O operation per second rate (IOPS)
6.3.7 I/O operation per second rate (IOPS) IOPS is the I/O transaction per second rate of a bus or aggregate of buses on a server. Windows NT/2000 servers of small/high I/O load model (file, print, or messaging) tend to max out at the I/O per second rate (IOPS). The problem is that this hardware−based limitation is not obvious when reached, as is CPU or memory exhaustion. In order to increase capacity or specify the correct hardware for a cluster server, IOPS issues must be considered. Figure 6.7 shows the typical IOPS rates supported in server hardware in the field today.
Figure 6.7: Capacity model. Hardware advertisements from storage vendors and other sources of information highlight the bandwidth of storage systems and components. This is rarely the bottleneck on small/high servers. We are going to discuss the capacity model for I/O components and the attributes and parameters that affect IOPS. This will enable you to analyze cluster server systems and products before you scale up or purchase new hardware. Consider the following as an example of an IOPS capacity problem. Assume a 2,800−workstation site with one two−node cluster for file and print. The NT Performance Monitor shows that CPUs are not exhausted, nor is memory on either member. A look at the Physical Disk and Logical Disk objects in Performance Monitor shows average queue lengths greater than two on both. In fact, the current queue lengths tend to stay in double−digit numbers. This is indicative of a cluster that is I/O bound. Each system in this cluster uses a three−channel 80 Mbps SCSI RAID controller, and all disks are 7400 rpm 8 msec access time. Also assume a 33 MHz 32−bit PCI bus with a bandwidth of 132 MB/s. Located on the bus is a 100 Mbps Ethernet adapter. How would you upgrade this system? Common responses, especially from salespeople, are to upgrade to faster RAID arrays or controllers, or to upgrade to Fibre Channel. These solutions may or may not solve the performance problem and allow the cluster to handle an increased load. The bandwidth may not be the bottleneck. The IOPS capacity of the PCI bus, controller/adapter, or disk(s) is the bottleneck on NT/2000 file servers more often than not. A typical 32−bit 33 MHz PCI bus has an effective IOPS capacity of 10,000 I/Os per second. A single modern RAID or Fibre Channel adapter can swamp this implementation of a PCI bus. Before rushing out and purchasing more equipment, check first to see whether the bus IOPS rate might be your bottleneck. It is not a good career move to spend your company's money for new equipment that will not give substantially better performance or keep services available in case of a system failure event. In this example, purchasing a Fibre Channel adapter would not increase your capacity if the bottleneck is in the IOPS capacity of the PCI bus. The solution might be as simple as balancing the load on the PCI bus by adjusting the position where each PCI card is placed on the bus. Focusing on bandwidth is "minicomputer think." It is also psychologically comfortable for those of us who do 91
6.4 Well−engineered storage systems not understand this lower level of hardware. Big bandwidth numbers look good on a glossy ad and are easy to find on a specifications sheet. One usually has to speak with a vendor's hardware engineer to find IOPS capacities of adapters, controllers, hubs, and switches. A common (and valid) approach to this problem is to buy more state−of−the−art hardware. The bandwidth of the I/O bus is usually not the constraining factor with a small/high NT/2000 server. Bandwidth is a valid assumption when working with UNIX, OS/400, or OpenVMS. Minicomputer systems tend to perform fewer, but larger I/O transfers compared with Microsoft file and print servers. Note that the IOPS rate applies to multiple points in the storage configuration. The bus, adapter, controller, and each disk have their own fixed IOPS capacity. When capacity planning for a cluster, the smallest IOPS rate of any single component in a tracethrough from the CPU to the end of the bus is the bottleneck. A tracethrough of the components is typically disk−controller−adapter−bus. The point at which a bottleneck occurs must be larger than the I/O demand a single cluster member will present during a system failure event. In other words, you will need to consider the sum of the IOPS rate for all applications that might be running on a particular node during a failover event. Of the five key elements of capacity, the two that we are concerned with are the I/O bandwidth and I/O operation per second rate. What really counts in configuring a Microsoft server cluster is the I/O rate (IOPS). The bandwidth of the PCI bus, again, is 133 MB/s. Even a larger file server (4,000+ users) rarely moves that much data consistently. This is because Windows and DOS programs tend to generate a large number of small−sized random−access I/Os. Ballpark numbers on the bandwidth utilization of large file servers are 48 to 96 MB/s. This does not mean that you need only one bus on a big system. The ability to balance the I/O load on a server is key, even if the capacity of the first bus has not been reached. Even a mid−sized server can reach the IOPS capacity of a SCSI adapter or the PCI bus long before the I/O bandwidth capacity has been consumed. IOPS on midsized and larger servers start at the ballpark rate of 8,000 I/O/sec. This number can vary widely based on the load at your site. A vendor's I/O subsystem does not always match the CPU and memory in terms of capacity. It is common to observe NT/2000 systems that are bottlenecking at the I/O subsystem and are traded in for new models with more or faster CPUs and newer features. Proper understanding of I/O−per−second demands in your environment and I/O−per−second capacity of your storage systems will enable more informed decisions in connection with the acquisition and expansion of cluster hardware.
6.4 Well−engineered storage systems The following are some general suggestions that have been useful for people responsible for designing large clustered systems. When evaluating storage hardware, look for: • A well−written adapter device driver. For example, does the device driver interrupt the CPU multiple times per I/O? • Use of coprocessors in the design to offload I/O handling from the main CPU (intelligent I/O controllers that utilize on−board high−performance processors). • Implementation of the entire SCSI command set in firmware. SCSI has the capability of queuing and optimizing commands. Not all vendors implement this. • Number of outstanding SCSI commands per channel handled versus the IOPS capacity of the SCSI adapter. • Control of the chunk size and stripe size in a RAID set. 92
6.5 The future of system bus technology • Software that allows collection of IOPS statistics at the disk, channel, and controller levels. RAID is not a catch−all solution. If you have more IOPS demand of a RAID 5 set than the controller's capacity, it does not matter what whiz−bang features the storage array has. Though RAID 5 is usually the correct implementation for an NT/2000 file server, do not assume that the default parameters are optimum. One example of how default parameters can cost you capacity is the "chunk size" configuration parameter associated with a RAID array. RAID 5 is disk striping with parity. Stripe sets are composed of same−size units on a set of disks. This same−size unit is the "chunk size." The average size of a host I/O request should fit within the chunk size of a RAID set. For example, if our chunk size is 256 bytes and the mode of our host I/O request is 512 bytes, we could perform only 2½ I/Os per controller operation to that stripe set. If our chunk size were set to 520 bytes, we could perform 5 I/Os per controller operation.
6.5 The future of system bus technology The InfiniBand I/O architecture technology is an example of a next−generation solution targeted at satisfying the ever growing demands for I/O bandwidth in very high−end enterprise servers. I/O subsystem capacity is acknowledged by the industry as the bottleneck in providing the I/O bandwidth that will be required to deliver computing services for e−commerce and engineering customers, now and in the future. Over time it is very likely that as the cost of implementing InfiniBand decreases, it will replace the PCI bus that is integral and common on all computer systems today. The PCI bus architecture has effectively already replaced the ISA and EISA bus architectures. The approach that is being taken by the developers of InfiniBand can best be described as a revolutionary new design of the I/O subsystem. However, the benefits of InfiniBand technology go beyond just increasing system I/O bandwidth. InfiniBand is not really a system bus like those we have been used to in the past. The closest thing to compare it to is a network switch. Data exchanges that take place between the components that make up an InfiniBand−based computer such as CPU, memory, and I/O subsystems are actually network packets that are sent between one component and another. This is very different from traditional computers that rely on multidrop bus architectures (e.g., ISA, EISA, or PCI), which transfer data one I/O transaction at a time. I/O transactions with InfiniBand are accomplished by sending a packet through the InfiniBand switch fabric addressed to the I/O subsystem or the memory system, similar in nature to sending a request over an Ethernet network. The deference is that InfiniBand is a great deal faster and much more scalable than Ethernet. There is one characteristic of a switched Ethernet backbone that most of us who worked with Thin−Wire Ethernet can really appreciatethe physical isolation that exists between the nodes due to the switch. You will notice (see Figure 6.8) that in a Thin−Wire Ethernet network, all nodes are connected to one wire called the Ethernet cable. This cable works the same as a device on a PCI bus in a computer except that there are many more wires in the bus. It is commonly referred to as a multidrop bus because multiple devices are just drops along the cable run. The problem with a multidrop network is that when one device along the cable fails, it can cause all the other devices connected to the wire to fail also. In addition, it can be fairly difficult to track down the device causing the failure without going through a process of elimination. As you can imagine, that can be a very time−consuming process, which is not tolerated these days.
93
6.5 The future of system bus technology
Figure 6.8: Ethernet bus. Another problem a multidrop bus such as Ethernet has is that only one device at a time can transmit data. That drastically reduces the effective bandwidth of the cable to a value a lot less than the theoretical capacity of the cable. For example, Ethernet is clocked at 10 million bits a second. However, because every device connected to the cable must wait for its turn to transmit, the actual bandwidth that is achievable on a multidrop Ethernet cable is a lot less than 10 million bits per second. It turns out to be more like 2 to 3 Mbs. We talked about the Ethernet bus because it is familiar to most of us and a little easier to visualize, but the PCI bus is faced with some of the same problems that Ethernet has. A paper published by Adaptec Corporation reported on the usable bandwidth of the PCI bus based on tests conducted by Adaptec. Its authors found that a PCI bus rated at 133 MHz may actually produce a usable bandwidth of only around 90 MHz because of protocol and device contention overhead of the PCI bus. By now most companies have either already adopted an Ethernet Switched backbone or are wishing that they had. The reason is that an Ethernet Switched backbone provides more bandwidth by physically isolating one node from another. Instead of nodes just hanging off a cable, an Ethernet Switch allows for a direct connection between each node and a port on the switch. If a network node fails, there is physical and logical isolation that should protect the other network nodes and allow them to continue processing. That in itself is a big improvement over the multidrop cable from the early days of Ethernet networking. Another benefit of a Switched Ethernet backbone is that each network node has its own 10 or 100 MHz network pipe between it and the port on the switch. That eliminates the problem of device contention. When a switch is used, it gives each device connected to the switch its own network segment. The switch is what isolates the nodes from each other. The switch knows the identity of every device that it is connected to. You can see in Figure 6.9 that when node A tries to talk to node D, the switch has the intelligence to send the packet it receives from node A directly to node D unless it happens to be a broadcast packet. At the same time that an I/O transaction is taking place between node A and node D, it is also possible for node B and node C to be communicating with each other. As you can imagine, that is a very big improvement over the shared bus design of the original Ethernet backbone cable.
Figure 6.9: Ethernet Switch−based architecture.
94
6.6 Rules of thumb for cluster capacity You might have thought that we were getting off on a tangent talking about Ethernet networking, and maybe we did a little. But we thought that because most readers are already familiar with Ethernet, that makes it a good example to use to describe the benefits of InfiniBand. Everything that we have just said about the benefits of the Ethernet Switch backbone also applies to the new InfiniBand technology architecture. The nodes on an InfiniBand network consist of CPUs, memory subsystems, and I/O devices. These InfiniBand nodes are connected to ports on the InfiniBand switch. Each device has its own isolated and direct physical link to the switch. That can drastically reduce the probability of contention for resources within the system. This is especially true for SMP systems. Each CPU would have its own port connection on the InfiniBand switch. This would allow multiple data transfers to occur between multiple CPUs and I/O devices or memory subsystems because a switched fabric can handle multiple point−to−point connections simultaneously within the fabric. This allows for the highest possible throughput between subsystems in a computer system. Another side benefit to the InfiniBand switched−fabric architecture that is important to high−availability systems is that it provides failure isolation between the components in a system. If, for example, a CPU or memory component fails, then the switch will isolate it from the remaining functioning devices in the system. Once a component fails, it is possible to perform maintenance on the failed device by removing it from the running system without affecting the other devices in the system. Here again, that is possible because the switch isolates the failed device. Shared−bus architectures such as PCI make that much more difficult to implement.
6.6 Rules of thumb for cluster capacity The following is a list of tips and tricks that you should consider as you design your cluster system. These recommendations are based on the technical points that were raised in this chapter. They are practical rules of thumb that should be taken into consideration as you make your configuration design trade−offs. One point that is very important to remember is that once you decide on the design and configuration of your cluster hardware, it must be documented. You should also seriously consider labeling PCI bus positions and cabling to guarantee that an unsuspecting technician repairing one of your cluster nodes late one night and working under a lot of pressure does not mistakenly place a card in the wrong slot. For example, if a 33 MHz PCI card is placed in a 66 MHz bus slot, the bandwidth of the I/O subsystem could be halved without anyone knowing what was going on. That would be a hard one to find. Here are some further tips: • Beware of vendors' packaged or default configurations; they are rarely optimized. • Use the fastest matching clock rate processor and memory buses available. • Use multiple or peer 64−bit 66 MHz PCI bus architectures. • Always place critical adapters in unbridged PCI slots. • Balance I/O controllers across multiple PCI buses. • Use multiple RAID or Fibre Channel adapters on separate buses as opposed to "loading up" multiple channels on a single adapter. • Use redundant RAID and Fibre Channel controllers that multipath I/O across redundant adapters on separate PCI buses. • Adapters that are not critical for capacity (e.g., sound, SCSI ZIP, IDE, and CD−ROM) should not be placed on the main I/O buses with cluster interconnect or storage and network adapters. • Resist the tendency to fill a RAID cabinet; if the controllers, adapters, or buses are already at capacity, it can bottleneck a cluster during a failure event. • Place the system disk on a separate adapter and bus from the main application and data disks. • The system disk should not be on the same adapter and bus as tape drives and SCSI CD−ROMS. • Place large secondary paging files, log files, and scratch areas on a different disk, preferably on a 95
6.6 Rules of thumb for cluster capacity different adapter and bus. In summary, clustering is not for the faint at heart. A good cluster system design requires you to have a working understanding of the low−level hardware system components. We believe that you will need this knowledge to enable you to have the upper hand when dealing with the media hype and vendor presentations that you are likely to encounter. There will be claim after claim from competing vendors stating that their solution is best and quoting industry standards to back up their claims of greatness. The problem is that you will need to be able to analyze the claims and determine whether a given vendor's solution is relevant for your particular company's clustering needs. The fact is that there are quite a few good solutions available from many vendors. Vendors try to pick a market niche that they think they understand better than anyone else does, and they believe that their products address the requirements better than anyone else does. Our goal is to give you the technical knowledge so that you can match your business requirements for high availability, scalability, and reliability with the vendor whose products provide the best technical solution. That will allow you to make a valid trade−off between technology and the business needs of your company. Each member of a Microsoft Cluster must have the capacityin particular, IOPS capacityto handle the processing demands of the services that your company defines as critical. For example, in the case of Microsoft Advanced Server configured as a two−node cluster, if one node goes offline the remaining cluster node must have the capacity to execute all of the applications that are deemed critical for business operations in the event of a single system failure. Anything less will cause interruption of service to the user. A cluster is intended to avoid just that. A well−designed I/O subsystem will guarantee that the cluster will also provide a consistent level of service to its users even in the event of the failure of one node in the cluster. We talked about reliability earlier in the book; this is what it is all about. Users expect the cluster not only to be available but also to provide a reliable or consistent level of service. A good I/O subsystem design is the one area in which you should not take any shortcuts.
96
Chapter 7: Cluster Interconnect Technologies Overview This chapter discusses the cluster interconnect technologies that are supported under Cluster Service. Since Microsoft built its Cluster Service product to support industry standard hardware, what we discuss here applies equally well to clustering solutions from other vendors. We will take the time to go over some of the nuts and bolts of SCSI technology because contrary to what the marketers and press would like you to believe, SCSI is not going away any time soon. SCSI meets the needs of many of those installing clusters today and will continue to be a good fit for certain types of applications. It is certainly the technology of choice for cost−conscious customers looking to purchase smaller cluster configurations. By small, we mean cluster−in−a−box solutions configured with no more than two nodes. The cable length limitations associated with SCSI are not really a factor when everything is contained inside one enclosure. Further, the problems associated with the SCSI bus device priority order are not as much of an issue when the size of the cluster is limited to two nodes. On the other hand, customers who are looking to build very large cluster installations will probably look at SAN technology (e.g., Fibre Channel). SCSI has a long track record in the field, and, more important, it's a very cost−effective solution. There is nothing unique about SCSI that makes it less expensive to buy other than the fact that it has been on the market for some time now and the market prices have been driven down because of mass production and competition. Obviously, over time the cost differential between SCSI and newer technologies such as Fibre Channel will become insignificant. We also discuss emerging technologies that are being proposed for use as cluster interconnects so that you can get a feel as to the direction that industry is heading. It has been our experience that if you are going to run into problems setting up a cluster, it will probably be in the area of the cluster interconnecting hardware. During the beta of Wolfpack we were quite surprised when some people reported that some vendors they had called argued with them saying that there was no such thing as a SCSI "Y" cable or a SCSI "Tri−link" connector. We decided that it would be a good idea to review some of the design guidelines here for SCSI, because we have found so much misinformation in the media about how to use this technology when building cluster interconnects. We start with a quick review of some of the important design criteria for implementing SCSI buses. For the faint of heart, SCSI can be a little intimidating, and the fact that there seems to be so much misinformation out there does not help matters either. We then explore emerging new technologies that can be used for cluster interconnects. Finally, we introduce you to the "VIA" software interface standard that Microsoft and Intel are working to develop with the cooperation of more than 40 other vendors and compare that with Microsoft's Winsock Direct architecture for System Area Networks.
7.1 What is a cluster communication interconnect? Basically, the cluster communication interconnect should be a high−speed low−latency communication path between the nodes in a cluster. Cluster members need to communicate two different types of information between themselves using the cluster interconnect as the medium. The first information type that needs to be exchanged in a cluster is used to manage and control the operation of the cluster itself. The second type of information transfer that occurs between cluster members is the actual data that a user stores and retrieves from the cluster. In the Phase I release of Cluster Service, the user's data and control messages are sent over their own separate buses as shown in Figure 7.1. The cluster's control and management message traffic is typically sent over a private Ethernet link that we refer to as the "cluster communications interconnect." The 97
Chapter 7: Cluster Interconnect Technologies user data travels over its own bus, which would typically be a SCSI bus.
Figure 7.1: "Classic" cluster interconnect. The reason we show two separate buses in Figure 7.1 is that when Cluster Service was first released, there were no standards in place to allow for multiple types of communications to occur on a common cluster interconnect bus. That all will change with new technologies such as Microsoft's VIA architecture; Winsock Direct, also from Microsoft; and hardware System Area Network (SAN) technologies such as ServerNet and Fibre Channel. Fibre Channel, for example, was designed from the beginning to support multiple protocols such as SCSI and IP just as Ethernet supports IPX, AppleTalk, and IP protocols. In the case of Cluster Service, IP can be used to transfer management and control information between cluster members, whereas the SCSI protocol is used to transfer data to and from a disk array. While all of these communications are going on between the members of a cluster, the end user's connection to the cluster is occurring over yet another networking link that we refer to as the "enterprise LAN" connection. Another possible way to interconnect a cluster is to use a SAN (Server Area Network). Depending on whom you talk to, you will get a different definition of what SAN stands for. From a practical point of view, the definition is not that important. As mentioned previously, we have heard people call it any of the following: Server Area Network, System Area Network, and Storage Area Network. The bottom line is that a SAN is going to be the way, in the future, that large enterprise clusters are interconnected. Take a look at Figure 7.2 showing a SAN cluster interconnect and contrast that with the configuration shown in Figure 7.1. Technically, a SAN has a lot of good features going for it, but one thing that should stand out when you contrast the two figures is that there is less hardware involved in a SAN, which inherently means that it is more likely to achieve higher reliability. Notice the two connections between each node and the SAN. It is possible to configure the SAN with redundant cabling and or hubs. This is very easy (but not cheap) to implement and is a very desirable feature when designing a cluster configuration.
98
Chapter 7: Cluster Interconnect Technologies
Figure 7.2: Cluster interconnect using Fibre Channel or ServerNet technology. For a cluster to operate correctly requires a great deal of coordination between all of the devices in the cluster. A small two−node Cluster Service system is just fine today with a 10 or 100 Mbps Ethernet link used as the cluster communications interconnect for cluster management traffic. The reason a 10 Mbps Ethernet link will work just fine is that the only network traffic that goes across this link is for cluster management functions and inter−process communications between applications that are using the messaging services available by calling cluster APIs. A point−to−point Ethernet is very efficient because the likelihood of Ethernet packet collisions is very low, owing to the nature of the protocols used by Cluster Service. For example, one of the nodes in the cluster will send a "heartbeat" message asking the other node to respond. Once the node receives the "heartbeat" message it then responds. With this type of traffic it is not likely that both nodes will be trying to broadcast at the same time. As the size of the cluster grows beyond two nodes, all bets are off. There is one point that needs to be made about how the two Ethernet cards used for the cluster communications interconnect are configured. If you thought that you could get extra redundancy by enabling routing between the two network adapters, unfortunately that is not supported. If one node of the cluster happens to be isolated from the other node because of a failure in the Enterprise LAN link, client traffic will not be routed over the cluster communications interconnect to the isolated node. In fact, the failure of the hub that connects cluster member nodes to the enterprise appears to be one of the few failures that Cluster Service and other high−availability solutions are not able to detect in the initial software releases. We will talk more about some of these issues when we discuss cluster networking in Chapter 8. At the bottom of Figure 7.1 you will notice the cluster data resource bus. This bus is used to connect the computer nodes in a cluster to a mass storage subsystem such as a RAID array. It is possible to have multiple storage subsystems on the data resource bus. Many cluster implementations use separate interfaces, as well as different communications technologies for both the cluster resource bus and the cluster communications bus. This makes sense for two reasons. First, it is a very cost−effective solution, and the technologies used for the interfaces are readily available and field proven. Microsoft could have started off using a proprietary solution for its cluster communications interconnect such as ServerNet or Memory Channel from Campaq (actually Digital and Tandem at the time). Microsoft decided instead to support commercial off−the−shelf hardware (COTS) as a baseline design for Cluster Server Phase I. SCSI was the logical choice for connecting CPUs and the disk storage subsystems because it delivers an acceptable level of performance today at a reasonable cost. Many of the vendors Microsoft referred to as "early adopters" (e.g., IBM, DEC, HP, NCR, and Tandem) had all been shipping UNIX minicomputers using SCSI technology for years. These vendors have been supporting SCSI and have learned the hard way what works and what does not. Vendors such as these should be able to provide valuable advice to their customers when problems occur. SCSI vendors solved their interoperability problems more than 10 years ago, whereas 99
Chapter 7: Cluster Interconnect Technologies those in the Fibre Channel camp are just starting down that path. The question that you might want to ask yourself is, "Do I really enjoy the excitement of living on the 'bleeding edge' of technology?" Microsoft and the "early adopters" are working together on developing an industry standard solution for Phase II of Cluster Server using higher−speed interconnects based on open standards. The goal is to drive the cost down as a result of volume production and standardization in the marketplace. This goes along with Microsoft's marketing strategy of developing "high−volume" commodity products. Naturally, all the vendors involved are hoping that their technology will become the "standard" that is chosen. There is already a fair amount of ongoing development activity by many hardware vendors to develop high−speed interconnects suitable for use as a server area network. One project that has received quite a bit of press coverage lately is the effort to offload the TCP/IP stack into hardware located on a network interface adapter to allow for network−connected storage devices that use standard IP protocols. Many of the large system vendors have transitioned into primarily hardware/services vendors now that Microsoft is the dominant player when it comes to operating systems. Just as Microsoft turns to software to solve the world's problems, these hardware vendors naturally turn to new hardware technologies as their approach to solve problems that can't otherwise be solved by software alone. In the case of TCP/IP, implementing that protocol stack in hardware is expected to reduce the CPU processing load on a large cluster by around 20 to 30 percent. This is a good example of where software alone can't solve the problem. One of the traditional problems with large clustered systems is the amount of communication that can take place between the nodes in a cluster. All of this can put a substantial load on the CPUs, which means that there are fewer CPU cycles to devote to the applications themselves. Today, with Cluster Service's "shared nothing" approach and given the fact that there are few applications that are truly "cluster ready," it has not yet become a critical issue. As software developers exploit the full capabilities that Microsoft designed into Cluster Service, you will see the demand grow exponentially for higher−speed system interconnect buses. The Cluster API will provide a messaging service for applications running on more than one server that need to communicate across the cluster in order to manage the distributed resources of a clusterwide application deployment. This is an area that we see Microsoft stepping up to in future releases of Cluster Service. The goal is to have the tools and extensions to Cluster Service that will make it easier for software developers to enable their applications to run clusterwide and not on just a single node at a time. As more and more applications are deployed in a distributed fashion across the cluster, the demand for bandwidth on the cluster communication interconnect will increase and will likely become the major bottleneck in the system. When there are multiple instances of an application running on a cluster, they will need to stay in constant communication with each other in order to coordinate their operation. The decision on what technology to use for the cluster communication interconnect will likely be one of the most important decisions that you will have to make when purchasing or setting up a cluster. The technology used for SANs is the focus of a lot of engineering development work these days, which means that you will need to monitor this area very closely. Over time, as usual, there will be a shakeout in the market, and you will find one or two solutions that stand out above the rest. There are a few on the market today that appear good, but the market has not yet cast its vote to determine who will come out on top. The decision of what technology to select will depend mainly on three factorscost, growth requirements for the cluster, and the time line for that growth. Actually, there is one more: which vendor hires the best marketing consultants and has the largest advertising budget. We have seen that having the fastest and most advanced chip does not always mean that you win, but then being purchased by Compaq is not that bad either! The Compaq Corporation is trying to address the need for high−speed low−latency interconnects in the Windows NT cluster marketplace. They are well aware of the issues based on years of experience from their minicomputer UNIX products (Digital and Tandem). It is interesting that in the mid−1980s, Digital developed a SAN but referred to it as a CI−Cluster. It is even more amazing that it has taken more than 20 years for the idea of using a system area network architecture to really catch on. Compaq has acquired a couple of solutions 100
7.2 Comparison of the technologies used to interconnect systems that it can offer based on technology it acquired from mergers with Digital and Tandem. Digital's contribution to Compaq's suite of products is the Memory Channel technology. Memory Channel products have been on the market for a few years now but have been supported only on the OpenVMS and True64 UNIX product lines. Memory Channel is very fast and could be an ideal cluster system interconnect solution for Windows NT servers. Unfortunately, Microsoft does not support Memory Channel at this time, and we don't know of any plans to support Memory Channel in the future. Tandem was actually a little bit ahead of Digital in that Microsoft had already included the software drivers with the Cluster Server product for use with Tandem's cluster system interconnect technology called ServerNet. On a side note, Tandem is credited with coining the phrase "server area network" (SAN). You can expect to see high−performance interconnect solutions introduced in the next few years by different vendors. Some of the technology they will be using will come from existing high−end products, which will be reengineered to meet the demands of a high−volume marketplace. Other technologies will emerge from academic research labs that have been actively researching super high−speed communications technology to meet the needs for "super computing." Naturally, over time you will see some fallout from these competing technologies. The challenge for us all is to select for our companies the technologies today that hopefully will become the industry standards of tomorrow.
7.2 Comparison of the technologies used to interconnect systems There are basically three requirements for interconnect technologies that are used in cluster systems. They can be classified as LAN, bus, and SAN. We will focus on which technologies are best suited to address these three areas (see Table 7.1).
Table 7.1: Cluster Interconnect Technology Options Functionality Bus LAN SAN
Technology SCSI Ethernet Fibre Channel, ServerNet, MyraNet, InfiniBand
7.2.1 Bus functionality First we are going to discuss the requirements for a data bus. At the present time, Windows NT clustering does not support booting the cluster from a common boot image as OpenVMS clusters can do. Therefore, each system in the cluster must have a local disk drive connected to a dedicated bus, which is designated as the boot device. The local boot device should be as fast and as reliable as reasonably possible. The new Ultra160 SCSI standard is a very good fit for this application since the bus runs at 160 MBps and uses a low−voltage differential bus interface designed to eliminate bus errors due to electrical noise. The Ultra160 SCSI allows for 15 peripherals in addition to the SCSI controller itself on the bus. One point to remember is that the local system will not have to store gigabytes of data. The only thing that should be stored locally on a node in a cluster is the operating systems and local applications. Therefore the 160 Mbps Ultra160 SCSI bus should work well given that disk drives today are averaging only around a 20 Mbps transfer rate. Even when the disk transfer rate increases, as we know it will, there is still a margin for growth built into the 160 Mbps bus. Fibre Channel could also be used to interconnect local disk drives within the system cabinet. But given the current costs of FC adapters and FC disk drives, that might not be cost−effective today, as well as representing a bit of overkill. Of course, as usual, this is all likely to change as prices fall and the bandwidth of 101
7.2.2 LAN functionality the Fibre Channel bus increases. However, there is one technical issue that is worth considering. It is becoming more attractive from a system builder's point of view, and from the customer's perspective also, to use the single connector assembly (SCA2) along with a backplane to interconnect the new style of disk drives within a single cabinet. The SCA connector is available on both SCSI and FC drives. This configuration allows for easy removal of a failed drive without taking down the system, especially when combined with a RAID controller. Since SCSI is a bus architecture, the disk array's backplane is nothing more than a printed circuit board with copper traces and connectors soldered on. You can't get anything more reliable than just connectors and copper traces on a fiberglass board. Now comes the problem: because Fibre Channel's I/O architecture is a loop, it is necessary to add an active circuit called a port bypass circuit (PBC) for each disk drive connector slot on the Fibre Channel backplane. The APC performs a very important function necessary for the proper operation of Fibre Channel loop architecture. This circuit allows disk drives to be removed and inserted into the Fibre Channel loop without disrupting the operation of the devices already on the loop. In other words, the PBC completes the loop if a device is removed from the loop and opens the loop (momentarily) to insert a new device. If for some reason the PBC should fail, the loop would be broken and communications would come to a halt. In terms of reliability, if you compare a copper wire with a PBC circuit, the PBC circuit will lose every time. Any time you are designing a system for very high reliability, your goal should be to eliminate as many "active" components as possible. If you can't eliminate them, the next best thing is to have redundant paths, but that is another topic for discussion. Using a bus architecture such as SCSI for interconnecting disk arrays and servers in a Windows NT cluster is also a viable option for smaller organizations that won't be needing terabytes of online disk storage. The fact is that there are many operations (e.g., doctors' offices, auto service shops, retail shopping point−of−sale (POS) applications) that need high reliability but have no need for super high−speed access to massive amounts of data. We think those are the customers that Microsoft dreams of when talking about high−volume retail sales of its software products. Customers who fall into that profile tend to be very cost conscious when it comes time to buy a cluster. We see such users being attracted to "cluster−in−a−box" solutions consisting of two nodes, with each node being a 2 to 4 SMP configuration. Using SCSI as the cluster data resource bus makes a lot of sense for this type of application. With everything mounted in one to three cabinets, the issues related to the SCSI bus lengths disappear completely. The performance levels will be acceptable at a price that this class of user is willing to pay. It will probably take a little time before the cost to implement this type of configuration using Fibre Channel becomes competitively priced. Currently, there is about a 1:4 ratio between the cost of an Ultra160 SCSI adapter and a Fibre Channel adapter. The SCSI Trade Association (STA) is already working on releasing and promoting even faster versions of SCSI called Ultra320 SCSI and Ultra640 SCSI, with even faster bandwidths expected by the year 2003. The raw bus clock rate is not the only issue that SCSI needs to address. There needs to be work done to address the bus priority issues on a SCSI bus. In a cluster, all nodes should have the same priority level, which is not possible with normal SCSI priority levels, because each SCSI device and adapter has its priority level determined by its address on the bus. Compaq has tried to address this issue with a SCSI bus switch that has a "priority fairness" switch that helps address this problem by ensuring that every node has equal access to the bus. If the SCSI consortium can hold the cost down and continue to make improvements to the technology, then SCSI should remain a major contender for inclusion in future cluster designs.
7.2.2 LAN functionality The next requirement for a cluster is to have some kind of LAN connection to support IP traffic between cluster members. This would be required for clusters configured with a physical SCSI bus as the cluster data 102
7.2.2 LAN functionality resource bus. The reason for this is that the physical SCSI bus supports communication only via the SCSI protocol and therefore does not support the IP protocol suite. This is not a problem for a SAN solution such as Fibre Channel that can support multiple protocols simultaneously. We see Ethernet as the preferred solution for a low−end cluster configured with a SCSI data resource bus. Ethernet has had a long and successful track record, and the majority of system administrators feel very comfortable setting it up and maintaining it. It is cheap, it does the job, and it keeps getting faster. Given the fact that there will probably be fewer than 16 nodes in a low−end cluster, an Ethernet private cluster communications link running at 100 Mbps will do just fine. If you find yourself outgrowing 100 Mbps, then there is always Gigabit Ethernet as an option. In a two−node cluster server configuration, the simplest method of connecting the nodes together is to use the 10BaseT equivalent of a null modem cable (see Figure 7.4). Normally, when you connect a hub and a computer together the electronics in the hub take care of connecting the transmitter and receiver pairs on one end of the cable to the receiver and transmitter pair at the opposite end. In Figure 7.3, you can see how the "twist" is made in the cable. You could get out your crimp tool and make your own cable, but another option is to go to your local computer store. This cable will be referred to as a "crossover cable." Still another option is to look around the game aisle in your favorite computer store, where you will likely find a special "game cable" for connecting two computers together, which allows two people to play against each other over a local Ethernet connection. This "game cable" is just what you need to connect your two node clusters together.
Figure 7.3: RJ45 to RJ45 "null modern" wiring diagram.
Figure 7.4: Twisted−pair "crossover cable" connected cluster. On the other hand, you might be tempted to purchase one of those inexpensive Ethernet minihubs to connect the two nodes of your cluster together (Figure 7.5). The ones we are talking about cost around $50 and are made by a vendor no one has ever heard of. But remember the "single−point−of−failure" issue. In general, a copper wire will be a lot more reliable than any electrical device such as an Ethernet hub with a few dozen active components and an AC wall adapter. There is really nothing to fail in a short length of coaxial or 103
7.3 VIA cluster interconnect software standard twisted−pair wire except maybe the connectors if they are subject to physical abuse or are not crimped correctly.
Figure 7.5: Twisted−pair Ethernet hub connected cluster.
7.3 VIA cluster interconnect software standard The virtual interface architecture (VIA) is a specification being developed under the leadership of Compaq Computer Corporation, Intel Corporation, and Microsoft. The easiest way to explain what VIA is all about is to compare it with the network device interface specification (NDIS). NDIS, for those of you who are not familiar with LanManager (Windows networking), is a software architecture specification that allows the LanManager protocol stack to be isolated from the physical network adapter by a set of standard interface API calls. This allows hardware vendors to ensure that their network adapters will interoperate correctly with the industry standard LanManager by simply supplying an NDIS−compatible driver that they develop with their network adapters. It is up to the vendors to develop and test their own NDIS device drivers to the standard. Now let's get back to VIA. VIA is to clusters what NDIS is to LANs. The VIA specification allows Microsoft to offer a plug−and−play solution for hardware vendors wanting to sell cluster interconnect hardware. Already included with Cluster Service are drivers for Compaq's Server−Net SAN technology. Another company with a rather interesting background is Myricom. This firm has commercialized a technology called Myrinet, which was originally developed in an academic environment at the University of Southern California (USC), where the company's principals were researchers. Myrinet already supports the VIA architecture under the Linux operating system. Given the short time that Cluster Service has been on the market and the number of SAN solutions that are already available, we can expect to have many technologies and capabilities from which to choose. What VIA means for us, as customers, is that we will be able to shop around for whatever cluster interconnect solution we want to use and be assured that it will work correctly with Cluster Service. The VIA specification as developed by the three founding members (Compaq, Intel, and Microsoft) was presented to other key industry computer manufacturers at a conference held in January 1997. The goal of this meeting was to propose a standards−based cluster interconnect technology that would provide a cost−effective plug−and−play approach to server area networks (SAN). They hope that by gaining broad industry support for open architecture SAN APIs, the cost can be driven down to the point where clustering will be affordable for a high−volume market. With highvolume sales, Compaq will be able to sell its ServerNet hardware in the same price range as much slower LAN−based technologies. The VIA architecture will allow participating vendors to cash in on the rapidly expanding cluster market much as happened with Ethernet NICs. For customers it will mean lower cost and a wider selection of SAN technology solutions to choose from.
104
7.3.1 Why VIA?
7.3.1 Why VIA? There are a lot of reasons why VIA is important, the most important being the need for a good solid technology foundation for future cluster applications. Cluster Service, with the limited number of applications currently available, isn't ready to put SAN technology to the test. But Microsoft is already shipping its Datacenter solution that can support up to four nodes. Certainly, it won't be long before application developers will be taking advantage of cluster API services to build distributed applications that run across large multinode clusters. This new generation of parallel applications will require a server area network capable of handling huge amounts of message traffic with very little system overhead. VIA combines both a hardware and software solution to provide a means for cluster nodes to pass messages between each other. The hardware solutions that we just mentioned both provide for very high speed physical links. The other piece of the puzzle that is needed to provide low−latency message delivery between cluster nodes while at the same time not impacting the available processing power of the system is the software protocol stack called VIA. The VIA protocol stack requires that compatible hardware be present to function and that software vendors write their applications to support the VIA software APIs. The reason that VIA is able to deliver a high−speed, low message latency communication service and at the same time significantly reduce the processing overhead required by the operating system is that it moves the majority of the VIA protocol stack outside of the operating system's kernel. In Figure 7.6 you can see that there are two distinct software interfaces that an application uses when dealing with VIA. The first thing that an application must do when it wants to use the VIA services is to set−up the "virtual interface" that it will use to communicate to another node. The process of setting up the virtual interface pipe does involve the operating system kernel. Unlike other communications protocols like TCP/IP that require the operating system to execute the TCP/IP protocol, once the virtual interface communications channel is set up, the OS simply gets out of the way. The application is then free to use the protected virtual interface channel between an application running in one node to an application on another node in the SAN. The user process can directly access memory buffers allowing it to communicate directly to the VIA hardware device. The significance of this is that a cluster application will have a low latency Gigabit speed communication channel, at the same time reduce the processing load normally required for such a communication channel.
Figure 7.6: VIA software protocol stack.
105
7.4 Winsock Direct technology
7.4 Winsock Direct technology Microsoft just got warmed up with the work it did on VIA. Since the release of the VIA standard, Microsoft has come up with its own communications solution targeted at the system area network arena. The solution is called Winsock Direct, and, as its name implies, it is closely related to the TCP/IP Socket programming interface standard. The good thing about Winsock Direct is that developers have to develop their applications only once using the standard Windows Socket API interface. The application will automatically take advantage of the higher−performance Winsock Direct capabilities if the systems that it is running on have SAN hardware installed. You can see in Figure 7.7 that an application that wants to communicate can just issue normal Winsock calls. Winsock Direct adds an additional layer of software into the Winsock communications stack that acts as a "switch" that will automatically redirect the data communications stream to either the normal Winsock over the TCP/IP path or the new high−speed Winsock Direct protocol stack using SAN high speed communications links. This is attractive for existing applications that are already written to use the Windows Socket programming interface. If these applications are deployed on either a Microsoft Windows 2000 Datacenter Server solution or on a Windows 2000 Advanced Server equipped with SAN hardware, they can immediately take advantage of the SAN environment without any programming changes to the application. This is a very big plus for developers, because they will have to write their applications to support only one API. Without Winsock Direct capabilities, developers would need to write and support separate versions for each SAN product on the market. Unfortunately, this is the problem with the VIA standard. An application that uses the Windows Socket programming interface will require the developer to rewrite the section of code that deals with interprocessor communication if it is going to run on a VIA−configured cluster. The opposite is also truean application written for VIA will run only on a system configured with VIA hardware.
Figure 7.7: Winsock Direct stack. The real advantage to Microsoft's SAN solution is that when Winsock Direct and industry−standard SAN hardware technology are used, there is considerably less load on the CPU, thereby dramatically increasing the capacity of each node in a cluster. There are two technologies that are working together to minimize the processing load on the CPU. First, Winsock Direct achieves part of its performance by moving the greater part of its software processing outside of the Windows kernel and into the user processing space, thereby reducing 106
7.5 SCSI technology for NT clusters the number of context switches that would have been required to go between user mode and kernel mode. This also means that existing applications running on the cluster will also perform better as a result of the improved efficiencies achieved by faster node−to−node communications, thanks to Winsock Direct and the underlying SAN hardware. The bottom line is that Winsock Direct's higher performance communication link will allow you to "scale out" existing applications on a cluster. But the really interesting thing is that this technology will open the "technological door" for a new generation of cluster applications that have been waiting for a very high−speed and low processor−overhead communication link between nodes in a cluster in order to be technically feasible. We think that Winsock Direct will go a long way toward removing the barrier that now exists to the arrival of the next generation of truly distributed cluster applications.
7.5 SCSI technology for NT clusters When Wolfpack was in Beta Test, we were surprised at the number of major PC hardware distributors that had people on their sales force who did not understand the hardware technology needed for cluster interconnects. We decided that from an academic point of view it would be interesting to see how hard it would be to collect all the hardware pieces needed to build a cluster. So we started calling a few of the leading distributors. It was a shock to find that so many of the people that we talked to had not heard of things like SCSI 3 or Differential SCSI and Differential hard drives. Even though Microsoft's goal for Windows clusters was to use COTS hardware, it might still be difficult to find a PC vendor who knows what you are asking for. The reason is that even though clusters are becoming more common in small businesses, the numbers of systems shipped are negligible compared with sales of traditional single−box servers. The best approach is to talk to vendors who have traditionally sold minicomputers or high−end workstations. They have been using and selling SCSI cables and connectors and other high−end hardware for years. SCSI has been a way of life for vendors such as HP, IBM, and the old Digital side of Compaq. The term SCSI stands for Small Computer System Interface. The idea for a low−cost interconnection for small computer systems first came from IBM. SCSI got its start at IBM on the IBM−360 series of computers in the 1970s. Shugart Associates modified the design in the early 1980s, which became a standard for interconnecting intelligent disk drives. Contrary to the "small" in SCSI, it has been used in all sizes of computers. It has been used on the low end extensively by Apple for interconnecting peripherals on its popular desktop systems and on the high end by companies such as IBM, Digital, SUN, and others on minicomputers and mainframes. For some reason, probably having to do with its high cost at the time, IBM chose not to use SCSI in the design of the PC. Hence, to this day, SCSI is not a common interface on desktop PCs. Apple, however, chose to make SCSI the standard peripheral interconnect mode on its systems.
7.5.1 SCSI standards The SCSI bus can be confusing even to experienced computer system professionals, especially since there are three different SCSI standards to deal with. We are going to take a little time to explain some of the common SCSI terms you will need to know about and then introduce you to the design guidelines you need to be familiar with as you plan your SCSI bus interconnects. The following are the typical SCSI standards that you are likely to see listed in catalogs: SCSI−1, SCSI−2, SCSI−3, Ultra2 SCSI, and Ultra160 SCSI. To complicate matters even more, there are variations to these general specifications. You won't hear much talk about SCSI−1 these days, although there is still a lot of SCSI−1 equipment around. The equipment shipping today that you will come in contact with will either be Ultra2 SCSI or Ultra160 SCSI. For the purpose of clustering, we will be concentrating on Ultra160 SCSI, since this will best meet the needs for clustering today and at the same time provide a small margin for growth.
107
7.5 SCSI technology for NT clusters One way to easily identify which type of SCSI device you are looking at is to examine its connector. If you count 50 pins on the connector, then that device or cable is either a SCSI−1 or SCSI−2 device. The connector that is used for SCSI−1 is a Centronics 50−pin connector as shown in Figure 7.8. When the SCSI−2 standard was adopted, the primary connector was specified as the Micro−D high−density 50−pin style, but the Centronics connector was still listed as an alternative. The "D" refers to the shape of the connector shell (see Figure 7.9).
Figure 7.8: Centronics 50−pin SCSI connector.
Figure 7.9: Micro−D 50 connector. If you count 68 pins in the connector, then you have a SCSI−3 device. You might have a little trouble seeing the difference between the 68−pin and the 50−pin connectors, because without having the two side by side they look almost the same. The standard SCSI−3 device, Ultra2, and Ultra160 SCSI specifications call for either a Micro−D 68−pin connector (see Figure 7.10) or the single connector assembly (SCA2) for hot−swappable drive bays (see Figure 7.11).
Figure 7.10: Micro−D 68−pin SCSI connector.
108
7.5 SCSI technology for NT clusters
Figure 7.11: SCA2 connector for removable drives. One other thing you should be aware of is the different physical retention hardware used. SCSI−2 uses a latch−type device, and SCSI−3 replaces the latch with threaded thumbscrews. The reason for this is that a 68−pin SCSI−3 cable contains more conductors and therefore is heavier and stiffer, which in turn requires more physical strength in its construction for strain relief. You will notice in Figure 7.12 the difference between the latch and screw styles of connector and realize that the two are not interchangeable.
Figure 7.12: SCSI connector locking mechanisms. The SCSI bus comes in two flavorsnarrow or wide. In general, if someone does not explicitly say "wide," it is usually safe to assume that they mean "narrow." The term "narrow" refers to a bus that has eight data lines plus control lines and a ground, whereas the term "wide" means 2 bytes wide or 16 data bits. The SCSI specification allows for even wider buses, but products based on buses wider than 16 bits have not appeared in the market so far and probably won't, given the rapid advances Fibre Channel technology is making. The physical cabling and connector issues involved with SCSI constitute an area in which Fibre Channel excels, since it is a serial interface and therefore requires only a transmit line and a receive line. The preferred SCSI bus width for high−performance servers is 16−bit or "wide SCSI." You can achieve 40 MBytes per second using a "wide" bus coupled with devices that support "Ultra SCSI." The term "Ultra SCSI" refers to a maximum bus frequency of 20 MHz and is also sometimes referred to as "Fast/20." Table 7.2 provides a breakdown of SCSI versions and bus speeds.
Table 7.2: SCSI Versions and Bus Speeds SCSI Version SCSI I (narrow) SCSI II Fast SCSI (narrow) Fast Wide SCSI SCSI III Ultra SCSI (narrow) Ultra Wide SCSI Ultra2 SCSI (wide)
Data Transfer Rate Number of Devices (max) 5 MByte/sec 8 10 MByte/sec 20 MByte/sec
8 16
20 MByte/sec 40 MByte/sec 80 MByte/sec
8 16 16 109
7.5.2 SCSI device ID numbers Ultra160 SCSI (wide) 160 Mbytes/sec 16 Ultra160 SCSI is the latest effort by the SCSI vendors to keep up with the Fibre Channel crowd. Ultra160 SCSI is based on a 16−bit bus architecture that is clocked at 160 Mbytes per second. To achieve that bus speed, the Ultra160 bus uses a low voltage differential (LVD) electrical signaling interface. The low voltage differential interface is an important technology for Windows clustering. For one thing, the LVD interface reduces the manufacturing costs of building SCSI devices and at the same time allows for much higher bandwidths than previous SCSI implementations. The LVD interface also provides for higher noise immunity and at the same time allows for longer cable lengths, which is advantageous when connecting multiple system cabinets together in a cluster. A good question to ask your storage array vendors is what bus width they are using internally on their disk drives. They could be using "narrow devices" talking to their RAID controller and a "wide" bus going back to the CPU. This configuration is cost−effective but system performance could suffer. The issue here is that an internal 8−bit bus keeps the costs down, while the external 16−bit bus going to the computer could potentially deliver data faster to the disk array than it could be processed. Always check to make sure that you are purchasing well−balanced designs for any type of I/O system by making sure that no bottlenecks exist in any of the data paths.
7.5.2 SCSI device ID numbers We have run into quite a few people who have had questions for us about how SCSI device ID numbers work. We decided there must be a need for a good explanation, so we can just respond, "go buy our book"! So here goes. As we have already discussed, the "narrow" SCSI bus has only eight data lines. These eight data lines are also shared or multiplexed for use as both the data bus and the device address select lines. The SCSI specification does not allow any encoding of the device address ID numbers on these eight data lines. Therefore instead of allowing 256 addresses that would have been possible by encoding the eight bits, we are limited to only eight devices. Each device on a SCSI bus must be assigned a unique device ID. Remember, when SCSI was first developed, eight disks would have cost hundreds of thousands of dollars each! Who would have ever imagined needing or for that matter even being able to afford more than one or two disk drives. The device priority level on the bus can be confusing. The rule that most of us are aware of is the higher the ID, the higher the priority. Unfortunately, that rule applies only to the "narrow" SCSI bus. On a "narrow" bus, an ID of 7 is the highest and is usually assigned to the SCSI controller, and the lowest priority is ID 0, which is usually assigned to the boot disk drive. The idea is to assign the fastest devices the lowest priority and slower devices higher priority so those fast devices (e.g., disk drives) can't "hog" the bus. Things change a little when you are dealing with a "wide" SCSI bus. In that case, there are now 16 data lines available for device IDs. Here again, there is no encoding or decoding of the 16 data lines so we are limited to 16 devices on a "wide" bus. It would seem that a device with ID 15 is the highest priority, but in order to maintain backwards compatibility with the addressing scheme on the "narrow" bus, a SCSI ID of 15 is actually a lower priority than that of device with ID 0. So the lowest priority device on a "wide" SCSI bus is actually ID 8, not 0 as one would logically conclude. The bottom line of this discussion is that it does not matter whether you have "wide" or "narrow" SCSI buses; device ID 7 is still the highest priority and that is where you want the SCSI controller. You can better visualize the relationship between IDs and priorities by referring to Figure 7.13. When it comes to using SCSI in a cluster, the fact that a node's priority is based on the ID assigned to its SCSI adapter presents a problem. In a cluster, one would like each cluster node to have equal priority in accessing storage devices on a SCSI common bus. A standard SCSI bus configuration will always allow the system that is assigned an ID of 7 to have priority in accessing the disk subsystem. Compaq came up with a solution to this problem for use with clusteringto build a SCSI hub that guarantees that each 110
7.5.3 Single−ended vs. differential SCSI bus node in the cluster gets its fair share of bus time when accessing the SCSI bus.
Figure 7.13: SCSI ID vs. priorities.
7.5.3 Single−ended vs. differential SCSI bus There are basically three different types of SCSI bus options that are important to know about when dealing with NT clusters: the single−ended bus, the high voltage differential (HVD) bus, and the low voltage differential (LVD) bus. The differences are in how the signals are transmitted electrically between the devices on the bus. Besides the technical stuff, one important bit of information is that the HVD bus devices are more expensive because of the extra hardware circuitry required. On the other hand, the new LVD standard does not cost much more than the single−ended SCSI hardware owing to lower costs of components due to higher levels of chip integration. Another advantage with LVD is that it can be mixed on the same cable as single−ended without shorting out anything. On the other hand, you should never mix HVD devices and single−ended devices on the same cable. They are definitely not compatible, and if you are not very careful you will destroy some expensive hardware. We always make sure we clearly mark HVD devices in our computer rooms to lessen the chance that someone will mix a single−ended device with an HVD device. It can easily happen, because the connectors used on both interfaces are the same. Sub−sequently, we describe how the two buses operate, their benefits, and the tradeoffs to be made in selecting one over the other. The single−ended SCSI bus is the most common and the least expensive. For many desktop and small server applications it does the job adequately and at a reasonable cost. The drawback with the single−ended bus is its short cable length and its susceptibility to noise. Unfortunately, it is almost impossible to stay under the maximum cable length of 3 meters for a single−ended "fast SCSI" bus when interconnecting two servers and a storage device, as would be needed in a typical cluster. That is unless you're building a system like Data General's "cluster−in−a−box." In that situation, since everything is housed inside one system cabinet it might not be a problem because of the close proximity of all the SCSI devices inside the cabinet. These days you will probably only see single−ended SCSI used to connect scanners, ZIP disks, or CD burners. The single−ended electrical design of the original SCSI bus will not support the high bandwidth bus speeds needed for today's disk−intensive applications. As you can see in Figure 7.14, the electrical circuit of one SCSI data line consists of a transmitter on one end of the cable connected to a receiver on the other end via a single data wire. If the voltage on the wire is between 3.0 and 5.0 volts, then the receiver outputs a logical FALSE. If the voltage is close to 0, then the receiver outputs a logical TRUE. Both the receiver and the transmitter circuit share a common electrical ground. One of the problems with this type of connection is that copper wire has an electrical property called resistance. That means the longer the wire is, the lower the voltage that will show up on the other end of the wire. If the prescribed cable length were to be exceeded, there might not be enough signal at the far end of the 111
7.5.4 SCSI differential bus cable for the line receiver circuit to tell if it should output a 0 or a 1. Another problem is that if there is electrical noise adjacent to the cables, a false signal (voltage) can be induced on the data line, thereby causing a false output.
Figure 7.14: A single−ended SCSI driver and receiver circuit.
7.5.4 SCSI differential bus The SCSI differential bus is a good solution to some of the problems we pointed out for the single−ended bus design. As you can see from Figure 7.15, there are a few more components needed for the differential bus circuit and they have more functionality. The main downside with differential SCSI is its higher cost. But in practical terms, the cost differential between single−ended and differential devices is not really a big issue considering all the other hardware involved in an enterprise server cluster and the benefits that differential SCSI components can provide.
Figure 7.15: Differential SCSI bus. The differential SCSI bus determines what its logical output is by comparing the voltage difference on the two data lines. The comparison is only between the two lines, unlike the single−ended system in which the voltage on the data line is compared to ground. The advantage with comparing the two lines is that if a noise pulse (current) is introduced along the cable, both data lines will be equally affected, but their voltage difference will still remain the same. In the single−ended cable, the single data line will experience an increase in voltage in reference to ground, and the logic chip at the end will most likely output the wrong logic level (voltage). Note
You can demonstrate principles of induced electrical currents (electrical generator) for yourself at home with a magnet, copper wire, and a compass; by passing the magnet across the copper wire you can cause current to flow in the wire. Not only will you understand more about the sources of SCSI bus noise but you will also have a great science experiment for the kids. In the office you can conduct similar experiments with either a commercial vacuum cleaner or a large motor in close proximity to your SCSI cable!
7.5.5 LVD vs. HVD SCSI technology The question as to which is better, LVD or HVD, is really a moot point these days. Once the Ultra160 specification was released and vendors started shipping Ultra160 disk drives, it signaled the end of HVD products. It is not very likely that you will see any HVD disk, being used in new installations. But, there are some good buys to be had at local swap meets! It is not that there is any real problem with HVD technology, but the costs of manufacturing HVD equipment makes it too costly compared with LVD. The SCSI LVD technology and the speed enhancements with the Ultra160 specification have helped to bolster sales to users 112
7.5.4 SCSI differential bus who demand high−performance disk subsystems for applications such as video and audio real−time editing. This demand for Ultra160 technology has helped to drive the prices down, thereby furthering the market acceptance of Ultra160 SCSI devices. If you are faced with owning both single−ended and differential devices and don't know what you are going to do, there is hopethanks to a device called a SCSI bus converter, available from ANCOT Corporation and other vendors, you can mix and match single−ended devices with high voltage differential bus devices. The SCSI bus converter is a two−port bi−directional "black box." One port has a SCSI single−ended electrical interface, and the other port uses a high−voltage differential interface. This device can save you a lot of money if you already have an investment in single−ended devices that you want to use in a cluster. When you originally purchased the device, the SCSI bus simply connected one CPU to one storage cabinet. If the two cabinets were positioned side by side, a short cable was all that was needed to make the connection. The worst−case situation would be if you were using fast SCSI devices. In that case, the maximum bus length is 3 meters. Given that limit, it is really only practical to connect one computer to one storage array with a little bit of cable budget left over. Remember when calculating cable length to include the cable inside the cabinet (if any) in addition to the external interconnecting cables. You can see in Figure 7.16 some typical cable length calculations to go by. As you can see, a single−ended bus just runs out of distance when you want to connect to a third device. With the current cost of disks drives so low, it's very likely that most customers will purchase all new storage arrays when installing clusters.
SCSI Specification Single Ended Bus−SE Ultra − 16 bit bus Low−Voltage Differential − LVD Ultra2 − 16 bit bus Ultra160 − 16 bit bus
Two Devices More than 2 Devices 16 max
Peak Data Rate
Distance Between Device
3 meters (4 max)
1.5 meters (8 max)
40 MBps
30 cm
25 meters 25 meters
12 meters 12 meters
80 MBps 160 MBps
30 cm 30 cm
Figure 7.16: SCSI cable lengths. With a SCSI bus single−ended (SE) to HVD converter you can overcome the problems just discussed. One thing to remember when working with these devices is that each port is a separate bus for the purpose of calculating bus length. Therefore, when adding up cable lengths you add all the single−ended cables together and verify that they meet the 3−meter rule for fast SCSI. Then do the same for the differential side, making sure the cables meet the 25−meter rule. When considering whether you want to use these HVD SCSI bus converters, remember that they can possibly become a single point of failure in some configurations. The real reason these devices are so attractive is that the cost difference between a differential disk drive and a single disk drive is in the order of $400 per disk. If you have a number of disks in a cabinet, you could easily be paying a few thousand dollars for differential interfaces on drives inside one cabinet. On the other hand, a typical SCSI bus converter costs about $500. These converters let you use single−ended drives inside a cabinet where cable length is not a problem and at the same time connect to the CPU across the room using differential SCSI.
113
7.5.6 The SCSI "T" connector
7.5.6 The SCSI "T" connector You will remember from the days of ThinWire Ethernet that there was something called a "T" connector. Ethernet is a multidrop bus somewhat similar to SCSI. The reason we used "T" connectors with Ethernet was to allow workstations to be connected serially. One advantage of the "T" connector was that the workstation could be disconnected from the middle post of the "T," leaving the other two connections intact and thereby not disrupting the network. It worked great unless you had users who wanted to take the "T" with them when they moved offices. This same concept applies equally to interconnecting server nodes and disk storage boxes together in a cluster configuration. Instead of calling it a "T" connector, in SCSI terms it is known as either a "tri−link connector" or a SCSI "Y" cable. As you can see in Figure 7.17, the tri−link connector looks a little like an Ethernet "T" connector except that the connectors look like rectangles instead of round pipes. The tri−link connector used for clustering has one 68−pin male connector and two−68 pin female connectors. The male connector plugs into the disk storage unit or a server node. If the device were at the end of a SCSI bus then a SCSI terminator would be plugged into one of the female connectors on the back. The other female connector would receive the cable from another device further down the bus. Figure 7.18 illustrates a typical SCSI configuration as used in a cluster.
Figure 7.17: Tri−link SCSI adapter.
Figure 7.18: Tri−link adapter used in a cluster. 114
7.5.7 SCSI component quality Although the tri−link connector is a neat device, it is not the only solution. It is about 1¼ inch wide, which means that it will cover up the connectors on the neighboring adapters on the server's I/O connector panel. If you have a lot of I/O devices in your server, you could conceivably lose the use of two backplane slots. Therefore, the tri−link adapter might be the perfect solution on a storage device that has only one 68 SCSI connector on the back, and where you do not have to worry about physical interference from other connectors. To solve the problem we just discussed, there was another connector design called a SCSI "Y" cable. It is not as elegant a solution as the tri−link connector, but it does solve the physical interference problem. Like the tri−link connector, the "Y" cable has one 68−pin male connector and two 68−pin female connectors. As you can see from Figure 7.19, the male connector is attached to the two female connectors with about eight inches of SCSI cable.
Figure 7.19: SCSI "Y" adapter.
7.5.7 SCSI component quality One thing we have learned from years in the field is that quality and adherence to standards varies greatly from vendor to vendor. One company we discovered in our search for a supply of "Y" and tri−link connectors is Amphenol, Inc. For one thing, Amphenol has been around for a long time and that means the firm has gained a lot of experience building cables. The reason we have not heard much about Amphenol is because it OEM manufactured cable assemblies for larger computer manufacturers such as Compaq (Tandem and Digital) and IBM. When we talked to people at Amphenol, we asked what the difference was between their products and, say, a bargain−priced line. Sometime it is difficult to tell the differences or they might not be apparent to the novice SCSI shopper, so we want to point out some things to watch for. For one thing, the cable used for SCSI must adhere to certain standards, including the spacing of the conductors inside the cable jacket, the dielectric, the shielding, and certain electrical characteristics. Another thing to watch for is the physical strain reliefs used to secure the cable to the connector bodies. The cable is precision made and can tolerate only moderate compression before the electrical signal characteristics are affected. If you see a cable with the connector strain reliefs tightly clamped down on it, you are looking at trouble. Another tell−tale sign to look for is a cable that has signs of a kink in it. That could be a good indication that there may be internal damage to the cable. The damage we are talking about here could affect the electrical characteristics of the twisted−pair cables inside the SCSI cable. The only way to know for sure if the cable is damaged is to test it with a time division reflectomator (TDR), but since not many of us have one of those devices sitting in the corner the best thing to do is replace the suspect cable. In talking with the people at Amphenol, it became obvious that they understood what engineering factors contributed to a good SCSI cable assembly. At fast SCSI bus speeds of 20 MHz and beyond, the design of the cable, the connectors, and the method with which the connectors are connected to the cable are all important. For example, Amphenol's "Y" cables are constructed with one continuous copper wire. The insulation is removed at the point where the center connector is placed, and then the conductor is soldered to the connector. This kind of construction ensures good electrical characteristics at very high bus frequencies. Like any cable 115
7.5.8 Supporting larger SCSI disk farms used for high−speed communications, the cable used for SCSI must meet stringent physical and electrical requirements. Even the best quality cable can fail as a result of improper handling. It is important when you install SCSI cables that you treat them with respect. Do not put undue strain on the cable, and be careful that you don't make 90−degree bends. Most cable used for high−speed communications (e.g., Ethernet, twisted−pair, and fiber optic) all have specifications for a minimum bend radius. A good rule to remember is not to deform the cable any more than absolutely necessary. For example, if you need to use cable ties, be careful not to over−tighten the tie to the point at which it compresses the cable. This can change the cable's electrical transmission characteristics. This same advice applies to twisted−pair wire used for Ethernet. Since all of the data in your cluster will travel over the SCSI bus, it is important to make sure that you purchase the best quality SCSI components. Good quality SCSI cable and connectors aren't cheap. If you do find some vendor offering parts considerably below what everyone else is charging, you should probably wonder what corners this vendor has cut in its manufacturing process. If you have to cut costs, you should not look at the SCSI bus as a place to save money when designing your cluster.
7.5.8 Supporting larger SCSI disk farms Something else to keep in mind is that SCSI adapter cards have more than one controller on each card. New higher−density SCSI connectors are allowing companies like LSI Logic Corporation to market PCI cards that have both Ethernet and SCSI controllers on the same card. LSI Logic even has a 4−port SCSI controller card, made possible because of very high−density SCSI connectors. One word of caution if you do decide to use one of these 4−port SCSI adapters: don't put all your eggs in one basket! Rather than just using one of these cards for all of your storage cabinets, you should distribute your storage devices across multiple adapters so that you can lose one SCSI bus adapter and still keep going.
116
Chapter 8: Cluster Networking 8.1 LAN technology in a clusterthe critical link Like any enterprise server, Windows clustering relies on its connection to the enterprise LAN to provide a reliable communication link with its end user. Because the main goal of this book is to show the reader how to build highly available and scalable systems, the connection to the local area network (LAN) is a very important topic anytime there is a discussion involving clustering. Even though LANs have been deployed since the early 1980s, from our experience one cannot assume that they have necessarily been designed or implemented correctly everywhere they have been installed. In our experience in the field, the LAN connection might well be considered the most fragile link in a cluster implementation. Further, the failure of the enterprise LAN can be blamed for well over half of the service interruptions to end users. Even if you want to assume, for the sake of argument, that most LANs have been deployed correctly, it might surprise you to know how easily a LAN can be brought to its knees. All it takes is for someone to make a very simple and common mistake, not knowing the ramifications of his or her actions, or even a networking administrator who just plain makes a stupid mistake. It's amazing how easy it is for one innocent user to take down a whole building and maybe even a campus. We witnessed this one day after receiving a frantic call from a network support group employee. The individual was in a panic; one of his facilities was at a standstill because his two T1s going into the building were completely saturated. After locating the computer that was generating all the traffic, we discovered an end user that was simply trying to find his files stored on a shared drive on a server in another building on the campus. The problem was that the user did not realize that opening multiple instances of the Windows "Find File" utility to search all the servers on his company's LAN in parallel would have the effect it did. It certainly seemed like a reasonable thing to him at the time! After all, he was in a hurry to find the file he needed for a deadline. Windows dutifully carried out his commands, saturating the LAN with server message block (SMB) requests until all other traffic on the LAN came to a standstill.
117
8.2 The enterprise connection
Figure 8.1: Typical Windows NT cluster network configuration. That was not the only time that we saw that type of thing happen at large companies. You can't protect yourself from that scenario no matter how much you spend on cluster hardware. In this chapter, we discuss the role that LAN technology plays in clustering for high availability. We will identify possible risks that could have a negative effect on high availability and actions that can be taken to minimize those risks.
8.2 The enterprise connection For the purpose of this discussion, when we refer to the cluster's LAN connection we are talking about the communications link that connects the end user to the clustered system. For just about every clustering solution on the market today, the communications protocol used on the enterprise LAN connection is based on TCP/IP. This makes a lot of sense from an Internet e−commerce point of view, but there are still quite a few companies using protocols other than IP. Two that come to mind are Novell's IPX and the former Digital Equipment Corporation's DECNET. If you happen to be one of those companies with legacy LAN protocols, we are sorry to say that you are basically out of luck when it comes to the mainstream clustering solutions. Cluster load−balancing solutions also rely heavily on the functionality of the TCP/IP protocol. We can't emphasize enough the importance of putting a lot of effort into the design of your enterprise LAN's connection to the cluster. You will need to pay close attention to the physical connections, their bandwidth capacity, and any potential single point of failure that can result from the network topology you select for your 118
8.2 The enterprise connection design. As you have seen from our previous example, a LAN can fail as a result of problems with networking protocol as easily as from the failure of a physical device such as a router, a switch, a connector, or cabling. With this in mind, it is important to consider in your LAN implementation the ability to detect and isolate failures due to software or network protocol problems. A network software problem such as we described at the beginning of the chapter is likely to cause considerably more disruption of services to a larger number of people than the failure of a single line card in a switch. In addition, trying to identify a network software failure is likely to be much more challenging and time−consuming than detecting a hardware failure in a network device. In many cases when a piece of hardware fails these days, you will likely find an LED error indicator that will either turn red to indicate a failure or turn off, indicating that a failure has occurred. Obviously, just about anybody can see that an LED has turned red, has gone from green to yellow, or is completely off. Usually, when this kind of hardware failure occurs, either the piece of equipment is replaced or a module is replaced and the device is immediately put back into service. If the LED turns green you know that the problem is fixed. Things are not that easy when the failure is due to the network protocol or problems in the application software that is trying to communicate across the network. In such a situation, the individual attempting to fix the problem is going to need a lot more in−depth knowledge about how an application is attempting to communicate across the network or how the TCP/IP protocol itself functions. The individual hired to diagnose a problem like this is typically a senior−level person. This individual will need to be equipped with some rather sophisticated diagnostic tools in order to detect and repair these types of failures. We have found that Ethernet is a double−edged sword. When it works it is great, but when it is broken, it can become a real problem. Ethernet was designed from the ground up to be very reliable, and it is. The protocols that are used on an Ethernet LAN provide a considerable amount of error checking and error recovery as an integral part of the protocol. The problem comes when the error−checking and recovery protocols mask real or pending problems on the network. We have seen Ethernet LANs that have had serious technical configuration problems limp along for some time without either the LAN administrators or the end users even being aware that they had problems. This speaks very well for the design work that was done by Xerox on the Ethernet standard. Without a good network management system in place it might not be possible to detect transient faults and/or network performance problems that actually exist but are not apparent to the end user. If you are lucky, a network device will just plain fail, and then you can send someone out to fix it. That is the simplest and most desirable scenario to deal with. But, from our experience, you should be prepared to deal with the elusive network problem that is more likely to occur and is definitely harder to diagnose and resolve. It is worth considering products such as HP OpenView, Travolie, network sniffers, and even TDRs. Whichever LAN technology you decide to implement for your enterprise, it's important to follow the design rules and guidelines closely. A LAN implementation that has a marginal design can and will fail more quickly than a LAN designed with some built−in margin of error. As we have already discussed, detecting LAN problems can be very difficult and time−consuming. It's much better to invest the time up−front in creating a good design to begin with. If it is within your budget range, it makes sense to design in reserved capacity to allow for the inevitable growth that will occur. A network designed correctly with extra reserve capacity might allow you to use extra capacity to provide some level of redundancy in the event of hardware or link failures. When everything is functioning normally, the extra capacity is there to level out peak loading to enhance the reliability of the system from the perspective of the end user. We will say many times throughout this book that achieving high availability means eliminating as many of the potential single points of failure in the system as possible. As you can see from Figure 8.2, one way to increase the availability of the cluster's LAN connection is to use dual NICs in each cluster server and then connect each NIC to a different LAN hub or switch. The advantage to this configuration is that if one of the NICs were to fail, network traffic could still flow to and from the server. This configuration makes sense only 119
8.3 Connection and cost if you also install dual redundant LAN hubs and switches. Here again, if one of the hubs should fail, network traffic can continue to flow through the other hub.
Figure 8.2: Redundant enterprise LAN connections and hubs. On the negative side, this configuration might require you to purchase a second hub, unless you already have multiple hubs in your wiring closet. The failure of the cluster's LAN connection could also be caused by a pinched or cut network drop cable going from the server to the hub or switch. This potential single point of failure could be eliminated by physically routing the cables that go between the two hubs and the two NICs in separate cable troughs in the data center and making sure that the two hubs are located in different wiring closets.
8.3 Connection and cost If you don't already have two separate hubs available to you and you don't want to make the investment in a second hub, Figure 8.3 illustrates another technique that can at least help reduce your risk of network hardware failures. In this figure, we have only one hub, but it has two or more port cards installed. You can take advantage of the fact that each port card provides some level of isolation from the other one and plug one NIC into the first port card and the second NIC into the second port card. This won't protect you from a failure of the power supply or backplane in the hub; even so, it does protect you from a failure of the port interface electronics, which is the area that is most likely to fail anyway. The advantage of this configuration is that you can save the cost of a second hub and still increase your total availability somewhat. Some network hubs and switches have dual redundant power supplies, which can also help to increase the availability of your networking infrastructure. 120
8.4 Cluster intercommunications
Figure 8.3: Reducing the single point of failure for the enterprise LAN.
8.4 Cluster intercommunications LAN technology is currently the standard that is used for the cluster intercommunications link on most of the mainstream clustering products today. The reason for this choice is rather simple. It is a familiar technology to most system administrators, it is relatively cheap, and support is already built into the Windows NT operating system. A point−to−point Ethernet LAN running at 100 Mbs can easily handle a small two−node cluster today. In the future, expect to see Ethernet replaced with higher performance hardware such as FDDI, Fibre Channel, ServerNet, MyraNet, or InfiniBand. However, maybe more important than the hardware itself will be the use of protocols other than TCP/IP for the cluster interconnect LAN. The reason for this is that although TCP/IP is good at ensuring reliable communication from someone's PC going across the Internet wide area network (WAN) to a cluster, the PCP/IP protocol has a lot of functionality that is not needed on a high−performance, low error−rate LAN or SAN interconnect. A protocol designed for WAN or Internet communications comes bundled with a lot of functionality that is certainly desirable for reliable WAN connectivity for the Internet, but becomes a hindrance for a cluster of 2 to 32 nodes located in close proximity to each other in a data center. A WAN protocol such as TCP/IP has a few unnecessary protocol layers when all that is needed is to send messages across a very high−speed low error−rate link between cluster nodes located only a few feet apart in a data center. As is the case in anything that you do, using the right tool for the job is the challenge. A protocol that works well in an error−prone environment such as the World Wide Web is not what you need when connecting a few nodes together in a cluster.
8.5 LAN vs. SAN
121
8.6 Network transports Let's take a few minutes and review the technical differences between a LAN, a WAN, and a storage/system area network (SAN). The LAN came about because of the need for a fast low error−rate connection between workstations and servers in a local area. A "local area" was considered to be a floor or building. Typically, a LAN does not extend past a building's walls. Ethernet has served us well in this capacity and with Gigabyte Ethernet on the horizon, it should continue to serve us well for some time to come. It is fast, it has low error rates, and its cable plan specifications support typical building wiring specifications. Back in the days of DOS, Microsoft and IBM developed a LAN protocol called NetBEUI. Similarly, the former Digital Equipment Corporation also developed LAN−based protocols named Local Area System Transport (LAST) and Local Area Disk (LAD). These protocols could possibly be considered the first SAN protocols since they were used to connect their CPUs to remote network−connected disks. Digital also used it as a cluster communications transport in its Local Area VAX clusters product (LAVC). Both NetBEUI and LAST have one thing in commonthey were designed to support only LAN technologies, not WAN links. Now that the Internet has become so popular, TCP/IP is the standard protocol at most corporate sites. Unlike NetBEUI and LAST, TCP/IP is a routable protocol, which makes it suitable for wide area network (WAN) applications. Although TCP/IP is well suited for WAN communications, it is somewhat less efficient when used in a LAN environment. The reason we say this is because TCP/IP is a "jack of all trades" when it comes to networking protocols. It supports very high−speed and low error−rate technologies such as Ethernet at the high end, while on the low end it can be used on the slowest and noisiest communication lines known to manthe dial−up telephone circuits. Because of this and the fact that TCP/IP allows us to connect anything to anything, TCP/IP has received the "most favored" rating for connecting end users to servers around the world over the World Wide Web. What it does not do well is connect two or more very high−speed CPUs together in a server area network (SAN). Fortunately, Intel, Microsoft, and a number of other companies are actively working to flesh out their designs of a new protocol called Virtual Interface Architecture (VIA), which is designed specifically for a SAN environment. Unlike the Internet and TCP/IP, a SAN is a local cluster interconnect that supports very high−speed communications link hardware that has a very low error rate and is designed to interconnect a cluster of computers in a data center. Its design goals are drastically different from those of either a LAN or WAN design. VIA standard is being designed from the ground up to support communications hardware capable of very high speeds with very low error rates, which is a crucial component needed to support the next generation of clustered applications.
8.6 Network transports From the point of view of e−business and the Web, TCP/IP is certainly the right choice for a network transport protocol. It is worth pointing out that the TCP/IP protocol itself can in fact end up being your single point of failure. The first thing that comes to mind for many people when we talk about a single point of failure is a piece of computer or network hardware. That is because 15 or 20 years ago when computers were built from hundreds of discrete components, the probability was very high that one of those components or solder joints would fail. Today, with very high−scale integration of logic circuits, the probability that a system failure will be the result of a hardware component failure is substantiality reduced. Nevertheless, it is interesting that many people still have the perception that hardware causes the most downtime. The conversation about protocol failure usually does not come up during discussion of clustering. Cyber warfare is a good example of why the TCP/IP protocol is one of the weakest links in a cluster system today. It has only been lately that many people are beginning to realize how vulnerable they really are to either intentional or unintentional causes of failure of the network protocol itself. The TCP/IP protocol provides a lot of very useful functionality for users, which is a good thing. Unfortunately, evil people have 122
8.6.1 IP single point of failure figured out how to use that functionality for not−so−good purposes. In the past, most people that we talked to were quick to mention their experiences with a network hub failing or a cable being cut. That has all changed now that we are hearing about instances of cyber warfare that are being perpetrated by 13−year−olds who simply exploited functionality that was part of the TCP/IP protocol. Steve Gibson of Gibson Research Corporation has an excellent article on his Web site that describes how he became a victim of such an attack (www.grc.com).
8.6.1 IP single point of failure During the past few years we have seen quite a few instances of network failures due to TCP/IP protocol corruption, human−induced causes, router hardware failures, or just plain poorly written network software. For example, one very well−known manufacturer of laser printers shipped printer driver software that literally took down a campus network because it walked through all the IP addresses on the extended LAN trying to find printers to connect to using ARP broadcast protocol. Needless to say, that process caused havoc with all the devices on the extended campus LAN. Luckily, the network support staff at the site finally figured out what was going on and took steps themselves to prevent the problem and then followed up with a harsh call to the vendor that caused them so much grief. The problem was that every once in a while unsuspecting end users would go off on their own and attempt to load the printer drivers and cause the network to go belly up again and again. From the end users' point of view it made sense. If they were having problems printing they would check the Web and see that there was a new and improved driver on the vendor's Web site. They would promptly download the driver and proceed to install it on their workstation, not realizing the problems that they were going to cause. Of course, the printer vendor quickly fixed the program once the problem was discovered and reported, but not before it contributed to quite a few hours of downtime over the course of a few days. The message here is that something as simple as an unsuspecting innocent user just trying to load an upgraded driver could cause the servers on the LAN to become unavailable for a considerable period of time. When you consider how typical that scenario is, you will realize how vulnerable you can be to something as simple as the failure of a communications protocol. By implementing standard desktop computer "load sets" and policies on what can be loaded on the desktop computers, it should be possible to reduce this type of threat to your network. You will then only have to worry that someone who is new to the network support group staff will make a similar mistake while testing the software drivers for a new printer that is being evaluated.
8.6.2 Single protocols vs. multiple network protocols When it comes to building high−availability systems, the ultimate goal is to eliminate any potential single point of failure. We just finished discussing how vulnerable TCP/IP is to failure. It's true that quite a large number of applications today are dependent on TCP/IP as their underlying network transport. But there are applications that don't require TCP/IP to run. A case in point is Novell's IPX protocol. Quite a few applications have been written to specifically support the IPX network protocol. It is totally feasible to develop and deploy client/server applications that can transparently use either protocol to communicate with the back−end server. The only challenge left is to find a clustering solution that can support multiple network protocols. A hardware−based clustering solution is probably the best choice for supporting network protocols other than TCP/IP.
8.6.3 Transport redundancy Luckily, there are solutions on the market that can support multiple network protocols in a clustered environment. One such solution comes from Marathon Technologies Corporation. Marathon's Endurance product family are basically fault−tolerant solutions, although they like to refer to it as "assured availability." 123
8.6.4 Compaq's Advanced Server transport redundancy Since it is a hardware−based solution using standard PC hardware and a standard off−the−shelf copy of Windows NT/2000, it does not have the same limitations that Cluster Service has when it comes to network protocols. Microsoft's solution is partially based on the functionality of the IP protocol itself. Microsoft's clustering solution relies heavily on a technology referred to as IP Addressed Mobility. Cluster Server is based on the concept of many virtual servers existing in a cluster, each of which has its own virtual IP address and which can run on any available server in the cluster. Because of the use of IP Addressed Mobility, other network protocols such as IPX, DECnet, and AppleTalk cannot be used. Marathon's Endurance 6200, on the other hand, does not embody the concept of a virtual server or virtual IP. If Endurance 6200 detects any failure whatsoever on the active server, everything running on that server is immediately switched to a hot standby server. Of course, Marathon will still need to issue an ARP broadcast to inform other computers and routers on the LAN that the MAC address associated with the Endurance 4000 server has changed. Other than that, Marathon does not rely on the IP protocol for failover. Although we have not proved this theory in the lab, it would appear that Marathon's architecture should not care what network protocols are running on it. If you were to pick protocols such as IP and IPX and install them on both the Endurance 6200 server and on your client workstations, you would be protected against a failure of one of the network protocols. Microsoft's operating systems will automatically try a second protocol stack if a connection cannot be made on the first network protocol stack. This feature has been built into Microsoft's Windows networking drivers for some time now. That very feature saved one large corporation a lot of grief when its primary protocol, TCP/IP, failed one day. As it turns out, the company was a big Digital Equipment Corporation customer and just happened to be running a dual protocol stack using DECnet as the second protocol. Even so, some machines that were running only TCP/IP at their site were dead in the water. The computers that were running dual protocols were able to continue on working because the problem that occurred affected only the TCP/IP protocol.
Figure 8.4: Windows support for multiple network transport protocols.
8.6.4 Compaq's Advanced Server transport redundancy Compaq's implementation of Microsoft's Advanced Server on both True64 UNIX and OpenVMS are other examples of clustering solutions that are not network transport protocol dependent. Naturally, Compaq supports TCP/IP and DECnet on both OpenVMS and True64 UNIX clusters running Advanced Server. In addition, when the old Digital was developing the Pathworks PC networking product on both the OpenVMS and True 64 UNIX servers, it also ported the NetBEUI protocol to these platforms. Whether you need support 124
8.7 Change control on routers for these legacy protocols or just want redundant network transport, these protocols (TCP/IP, DECnet, and NetBEUI) from Compaq could fill your needs for redundancy. Compaq also implemented a distributed lock manager (DLM), Advanced Server running on an OpenVMS cluster that can do some tricks that Cluster Service with its shared nothing model just can't do. For example, Advanced Server running on an OpenVMS cluster can function as a virtual primary domain controller providing 100 percent availability for the primary domain controller without the need for a backup domain controller. That functionality will work with multiple protocols running in the cluster. As we said earlier, Microsoft's Windows clients are already able to take advantage of dual network transport. If one of the transports fails, the Windows networking software will automatically attempt to re−establish the connection on the remaining transport. It has been quite interesting to note that of all the other high−availability solutions that we have looked into, none of them addressed the need for protecting the cluster from a failure due to the network protocol itself.
8.7 Change control on routers We have run out of fingers and toes to count the number of times we have seen corporate networks come to a standstill after a network administrator tried to add a simple static route to the router configuration. You would not think that something this simple and commonplace today is still causing network outages lasting anywhere from one hour to a day. What is really scary is that we have seen this occur on the same router link over and over again. After we looked into this, it became obvious that the problem was that different network administrators had been modifying the router's configuration tables without documenting all of the changes they were applying to the routing table. The lesson learned from this experience is that it is essential to implement a change−control procedure on all equipment associated with a system that must be highly available. The change control procedure should document any changes that are made to the configuration files on any router or hub on the network. At a minimum, your configuration−control document for a router should contain, in addition to the normal IP and subnet mask information, detailed documentation on why a route was established, information about the other end of the link, the name of the individual establishing the link, and the date. With this information in hand, anyone needing to change a router's configuration table will be able to clearly see its current configuration and then be able to compare that with the final configuration after changes have been applied. This way, if a route is accidentally deleted while changes are being made to the router table, it will be easier for the person making the changes to compare the "before" and "after" configurations to make sure that any changes that occurred were intentional.
8.8 Fault isolation For years, the computer trade magazines managed to keep alive the debate over which LAN architecture was best, Ethernet or token ring. As it turns out, just as VHS crushed Beta, Ethernet succeeded in capturing the major market share in the LAN arena. One of the characteristics of Ethernet that made it a success is its tolerance for poorly implemented network installations, physical cables, and connectors. The ability of Ethernet to continue to function despite all odds is certainly an impressive capability, but it also has a negative side effect. A poorly designed Ethernet LAN might appear to less experienced network administrators to be functioning properly when in fact it might just be a disaster waiting to happen. It has never ceased to amaze us when we see looks of surprise after we show an IT manager the screen of a network analyzer and he realizes for the first time that despite his perception that his LAN was running at peak performance, in reality he had 125
8.9 Cluster computer name some serious problems that needed to be dealt with. We have seen Ethernet continue to run with broken connectors, junk coaxial cable, the wrong type of cable altogether, improperly configured cables and connector configurations, and even the wrong terminators for Ethernet. The surprising thing about Ethernet is that from a user point of view, it all seemed to work fine. Ideally, a network should have its performance baselined to give the network administrators a reference point for them to use in determining the current state of the network. That baseline can be arrived at using a couple of different approaches. One is to measure the network's performance after verifying that the network is functioning correctly as designed and installed. Another approach to establishing a baseline would be to model the network as installed, using a sophisticated network modeling and simulation tool such as OPNET. Theoretically, the network model should give you a good idea about what to expect for the network's performance. With this information in hand, you should be prepared to monitor your network's performance using one of the many network management tool sets. With the performance parameters in hand, you should be able to set performance level triggers that will give you an alert if your network performance drops below the level that you established from the data that you collected. There is a saying that goes, "the only difference between men and boys is the price of their toys." Well, the same thing can be said for networks. The difference between a good network and a bad network can usually be attributed to the amount of training and experience its administrators have and the sophistication of the network diagnostic and management tools that are available for them to use. In order to isolate network faults, you must first have good documentation of the network topology as it was installed. With your network documentation in hand you can begin using a network monitoring and analysis tool to take a baseline performance measurement of your LAN. It would be wise to observe your network at different times of the day under different network loading conditions. The tools that you select can range in sophistication from a software solution such as Network Monitor from Microsoft to some rather expensive hardware devices designed specifically for performing network diagnostics. The choice comes down to how much insurance you want to buy to ensure that your network continues to run at optimum performance. The prices range from a few hundred dollars to a few thousand dollars. Generally speaking, the more you pay, the more sophisticated and capable the device is. The lower−cost solutions are typically software−based solutions that can be run on standard workstations or servers. The more expensive devices are typically hardware−based solutions designed to capture and monitor network traffic in real time without any packet loss. The bottom line is that you need some capability for monitoring your LAN if you want to be able to quickly find a network fault or, better yet, discover a component that is just starting to go bad before it actually fails.
8.9 Cluster computer name All NetBIOS services that a user accesses on a Microsoft server must specify the "computer name" assigned to the server or cluster hosting the service. The computer name is also sometimes referred to as the NetBIOS name. Any services that will use the NetBIOS protocol require that the computer name be specified as \computer name\service\directory\file. A good example of two services that use the NetBIOS protocols are the file and printing services available on any of Microsoft's Windows server platforms. Typically, the computer name and the TCP/IP node name are set to the same thing. Other types of network services that run on a Microsoft server (e.g., HTTP, FTP, and Telnet) require only TCP/IP and therefore do not require the Net−BIOS computer name to function. Clustering adds another level of complexity to dealing with computer names. When you set up a Microsoft cluster, you will need to configure a few computer names depending on how many resource groups you decide to configure. The first computer name that you will need to set up is the one that you are prompted for when you install the Windows operating system on your networked computer. You are automatically prompted to enter a computer name during the installation and setup customization procedure. Each physical server in the cluster will have its own unique computer name and IP address. This computer name is assigned 126
8.9.1 How the cluster alias is used to a physical cluster node and should not be used by end users to access services on the cluster. If a user were to attempt to access cluster services by referencing a physical server's computer name, you would not be able to access any of the services that you had been using after that server failed or those services are taken offline and moved to another cluster node. If you were to browse for resources on your cluster you would see an entry for every node in the cluster such as: \\cluster_node1\, \\cluster_node2\, etc. These nodes can be referenced directly by their physical server computer name if you need to administer a particular node. When you configure your cluster you will be required to assign an additional computer name/IP pair to uniquely identify the cluster itself. This computer name is referred to as the "cluster name" or a "cluster alias." The cluster name is used for the administration of the cluster itself. The cluster name is not linked to a particular physical server. It is also called a cluster alias because the cluster name can be used to access any service or virtual server no matter where it is running in the cluster. This feature frees end users from having to figure out where a service is running on the cluster. All users need to do is to reference the cluster alias and they will be able to access their favorite service. The cluster alias is a key usability feature of clustering. The cluster administrator will use the cluster computer name when administering the cluster's resources and services. A user that does a browse for services on the cluster using the cluster name or alias (\\cluster\), will be able to see all the services that are available on the cluster. Users who want to actually connect to a virtual server should use the computer name given to that particular virtual server or resource group. It is the virtual server's NetBIOS name that is failed over from one node in the cluster to another, not the cluster name. After the initial administrative setup of your cluster server you will then start the normal administrative task of assigning resources and services to groups which we have been referring to as virtual servers. This is an administrative requirement for Cluster Server and is not something that end users will have to deal with. Each virtual server that you set up on your cluster will need to have a NetBIOS computer name and a unique IP address assigned to it. Users will be able to access the virtual server that is offering the services that they want wherever it is running in the cluster by referencing the virtual server name (\\myclusterapps\). The computer name that is assigned to a virtual server stays associated with the virtual server no matter where it is running on the cluster. Because of this, the end user only need to know the name of the virtual server and not the computer name of the physical piece of hardware that is actually running the applications. One physical server in the cluster is likely to host multiple virtual servers with each one having its own IP address and computer name. Each virtual server that you create will hog one IP address; therefore, you will need to do the appropriate planning when setting up the IP addressing scheme at your site. The number of virtual servers depends only on how you decide to organize services on your cluster. In general, Microsoft suggests that you might want to configure a virtual server for each application and its associated services. It really depends on the amount of granularity you want in managing your cluster's services.
8.9.1 How the cluster alias is used In Microsoft's scheme of things, the computer name is key to Windows usability. Users quickly learn the descriptive computer names that are assigned to servers throughout the LAN. Once users discover the connection between the names for the services they desire and the computer names of servers that host them, they will quickly learn how to navigate themselves to wherever they would "like to go today." The cluster alias name assigned to a cluster is used if you want to browse a cluster to discover what services are running. By using the cluster alias, the user has the ability to browse for services on a cluster without worrying about where the service is actually running. When users browse using the cluster alias, they will be shown all of the available services running on the cluster as well as any services that happen to be running locally on nodes that are members of the cluster. Technically, there is not a problem with having services running on local nodes, but services that are running locally on a cluster node do not benefit from the cluster's failover capability. The cluster alias is particularly useful in connection with the load−balancing solution used with large Web server farm configurations. The load−balancing software or hardware advertises the cluster alias to the network but hides the individual cluster nodes from the Internet. The load−balancing product then decides 127
8.10 Cluster Service's use of IP mobility where each inbound request would best be serviced based on the availability and processing load on each node in the load−balancing cluster. From a user's point of view, he or she is just accessing a single server when in fact there could be up to 32 servers clustered together as in the case of Microsoft's load−balancing product.
8.10 Cluster Service's use of IP mobility Microsoft's clustering architecture for the Windows NT/2000 environment is centered on the ability to dynamically move an IP address from one machine to another at will. This architecture allows users to establish multiple "virtual servers" that can be moved from one physical server to another physical server in the cluster at will. These virtual servers can be hosted on any available "physical server" in the cluster. The big advantage of IP mobility is that it is based on industry−standard protocols, namely IP and the associated address resolution protocol (ARP). This means that IP mobility will work on any client connected to the clusters without requiring any special code to be installed on the client other than the industry−standard TCP/IP networking protocol. The early "Digital Clusters for Windows NT" product was not based on IP mobility and therefore required that a small piece of code be loaded on each machine. Obviously, that was not looked on very favorably by system administrators. But when Digital Clusters for Windows NT was released, the Windows NT version 3.51 did not support IP mobility. Microsoft eventually added support for IP mobility in version 4.0 of Windows NT. Thanks to the industry−standard TCP/IP protocols, the failover of an IP address in a Windows NT/2000 cluster is a rather simple process and, more importantly, it occurs in about four seconds. Figure 8.5 shows the basic steps that are taken by Cluster Service to failover the IP address of a group or virtual server. Once Cluster Service determines by way of an application's resource DLL that it has failed, all of the dependent resources that make up a group must be moved altogether to another active node in the cluster. At that point, Cluster Service begins the process of starting up all of the resources in that group on the new server node. If the application will require network services, the first thing that Cluster Service will do is to initialize the group's IP address and computer name on the new node. Once the IP address is initialized, the TCP/IP protocol drivers will issue an ARP broadcast. The ARP broadcast is received by all hosts on the local LAN as well as the local network gateway. Upon receiving the ARP broadcast message, all nodes on the local LAN will update their ARP tables to reflect the new IP address to MAC address mapping.
128
8.11 IP addresses required for virtual servers
Figure 8.5: IP address failover.
8.11 IP addresses required for virtual servers One thing that you need to be prepared for when working with Cluster Service is that you will need a good handful of IP addresses (where is IPv6 when you need it?). Microsoft's recommendation is to put each application that you want to protect into its own resource group. The application and all its dependent resources, including the computer name and an IP address, form what Microsoft calls a virtual server. In addition, each physical server in the cluster will also require a permanently assigned fixed IP address. If your site uses DHCP to assign IP addresses to nodes on your network, you must be aware of the need to assign static IP addresses to clustered servers. Client workstations can still use DHCP to receive their IP address dynamically. Depending on the number of applications or virtual servers being supported on your cluster, you will likely end up needing quite a few IP addresses. This will need to be planned for when setting up your site's IP network configuration.
Figure 8.6: Cluster server"virtual servers."
129
8.12 Load balancing
8.12 Load balancing Load balancing has become very important in the industry today due to the rapid growth in e−commerce to support a worldwide market. Network load−balancing solutions are being offered today that attempt to balance incoming Web traffic loads across multiple Web servers. You will hear the media and some of these vendors refer to their products as clusters. By our definition, load−balancing solutions by themselves don't meet our requirements to be called true clusters. Granted, they are an important part of the total high−availability solution; by themselves, however, they do not provide a complete solution for data availability. That requires a highly available distributed data warehousing solution on the back end. When you put it all togetherload balancing on the front end, distributed processing in the middle, and a highly available database farm on the back endyou end up with what we call a real cluster.
8.12.1 IP load−balancing solutions Load−balancing solutions are good today only if they support the TCP/IP protocol, because that is what Web−based applications rely on. There are two approaches to IP load balancing. One is strictly a software−based solution. The software is loaded on all of the computers that will participate in load balancing. The software determines whether a computer is online or offline and also calculates the relative processing load on that particular computer. Based on that information, the inbound Web traffic is directed to the computer that is best able to service the request from a system−performance point of view. Another overall approach to IP load balancing makes use of networking front−end equipment to handle the distribution of inbound network traffic. There are a couple of approaches that are being used today. These solutions are typically offered by traditional networking equipment vendors such as CISCO in the form of firmware (software) solutions that are sold as add−on options to network router−type boxes. The other approach that is being taken by nontraditional networking vendors is to offer a standalone hardware solution that is totally dedicated to the task of load balancing. The only potential problem with network hardware−based solutions is that they could become either a single point of failure or a choke point for network traffic. Microsoft makes that point in the marketing material that compares their Windows Load Balancing software product with other companies' hardware−based solutions.
8.12.2 Windows Load Balancing Service Windows Load Balancing Service (WLBS) is one of these software−based solutions. Figure 8.7 shows how a typical WLBS might be configured. Microsoft acquired the WLBS technology from Valence Research, Inc. and has added it to its suite of add−ons to its Windows server product line. The WLBS product provides scalability and high availability of Web "services." In itself, it does not address high availability of the actual data on the back−end database servers. For that you will need to consider using a traditional clustering solution to protect your databases on an as−needed basis. If the primary purpose of your Web application is only to deliver static content, WLBS will fit the bill all by itself since your data is static and can be local to each host in the WLBS cluster.
130
8.12.3 HyperFlow
Figure 8.7: Three−tier clustering using WLBS. The WLBS software monitors inbound IP traffic that is destined for your Web server's virtual IP address. At the same time, it is calculating the loading that is present on each node in your WLBS cluster. It then dynamically distributes inbound sessions to the least−loaded node. It is also able to detect the failure of a node by sending "heartbeat" messages at a predetermined rate to all nodes in the cluster via the network communication link. WLBS makes use of a "virtual" IP address for the cluster. It manages the virtual IP address as well as the operation of the IP protocol stack on each Windows server in the WLBS cluster. It does this by inserting a filter or wedge between the NIC's device driver and the TCP/IP protocol stack. By doing this, it is able to direct which node processes a request for Web service. WLBS is a LAN−based product that uses standard network protocols and hardware. That means that if you want to upgrade your existing Ethernet LAN to Gigabit Ethernet at some point in the future, WLBS can take advantage of the increased network performance without needing any upgrade to your system's hardware or software other than the NIC itself. When configuring a WLBS cluster, it would be wise to install a second NIC for the sole purpose of handling all of the WLBS administrative traffic that is generated.
8.12.3 HyperFlow The HyperFlow product from Holon Technologies is an example of a hardware−based solutions that provide load balancing across a cluster of servers and ensures high availability of the clustered services. HyperFlow functions like WLBS in that it will distribute incoming network traffic to the least loaded server. HyperFlow is capable of detecting and isolating a server that fails. Holon Tech chose to take a hardware approach to solving this problem. By performing 100 percent of the load−balancing functionality in hardware/firmware, 131
8.12.3 HyperFlow the application server's CPU capacity is not affected at all. That means that more processing power is available to your applications. HyperFlow also utilizes the concept of a virtual IP address for the cluster of servers. This allows clients to be totally unaware of the fact that they are connecting to a cluster of servers. It also means that additional nodes can be added to the cluster without any action being required from client systems. The HyperFlow system works at Layer 3 of the IP protocol stack, which means that it can easily support all standard IP−based protocols. Unlike WLBS, HyperFlow is not a distributed implementation of load balancing. Instead, a hardware box is installed between your cluster and the Internet/intranet. This architecture presents a potential for either a network bottleneck scenario or a single point of failure. Fortunately, Holon's design allows you to connect two of their HyperFlow switches in an active/passive configuration. The way this works is that in a normal state, all of the network traffic will flow through the active HyperFlow switch. The passive HyperFlow switch is constantly monitoring the health of the active switch. If a failure is detected, the passive HyperFlow switches itself to the role of the active switch. When the failed HyperFlow switch is repaired and comes back online, it will automatically configure itself as the passive switch. From a network performance point of view, it would certainly have been nice if the HyperFlow switches could balance network traffic load between themselves, but such is not the case. Both Windows Load Balancing Server and HyperFlow are front−end solutions that deal strictly with inbound network traffic. By our definition, load−balancing solutions are considered to be components of a cluster and not clusters in themselves. In Figure 8.8, you see a typical configuration of a cluster using the HyperFlow technology. Note that at each level in the diagram there is some form of redundancy. At the top, there are redundant paths to the Internet; under that, there are two HyperFlow switches; then the application servers that may be running IIS; and, finally, a cluster acting as the data warehouse. At the very bottom of the figure, you see that redundancy for the storage subsystem is provided by the SAN.
132
8.13 Redundant network hardware
Figure 8.8: Typical multipletier HyperFlow configuration.
8.13 Redundant network hardware We find that many people who are engaged in a discussion about high−availability systems typically focus on the server hardware. Typically, the discussions are centered on protecting data stored on disk, the CPU and motherboard, and RAM memory. It seems as though many people overlook the significance of the networking hardware. From statistics that have been published and from our personal experience, the network interface controller (NIC) has the highest probability of failure of any of the components that make up a computer. Our professional opinion (our guess) is that because the network cable that connects the NIC in the computer to the hub in the wiring closet can be as long as 100 meters, it might be picking up inductive voltage spikes as it passes near sources of high magnetic fields within a building. Believe it or not, the steel beams within a building can be a source of significant electromagnetic fields as a result of electrical currents that might find their way into the building's frame structure. Other sources of high−voltage fields generated within an office building can be attributed to ballast in fluorescent lighting and motors used in blowers for heating and air conditioning systems. High−voltage spikes induced into LAN cabling from these sources can, over time, cause NICs and other network equipment to fail. Articles written on the effect of static discharge and high−voltage spikes indicate that the damage to integrated circuits occurs over a period of time. The failure is not instantaneous; rather, each time a high−voltage discharge occurs, a little more semiconductor material is cut away. Eventually the semiconductor will fail.
133
8.13.1 Multiple NICs
8.13.1 Multiple NICs The easiest approach that you can take to protect your servers is to install multiple NICs. Don't be tempted to purchase a NIC adapter that has multiple ports on one card. At first, the idea that a multiple−port adapter will conserve variable slots on the motherboard might seem logical. However, it might not make a lot of sense when you consider the other options available to you by using multiple adapters. To start out with, you shouldn't put all your eggs in one basket, which is what you would be doing if you used a multiple−port Ethernet adapter. This is going to make even more sense if computer manufacturers ever decide to start shipping hot−swappable PCI interface cards. Then you will be able to replace and repair an individual NIC adapter without taking the whole computer down. The network−like architecture in InfiniBand should make it relatively easy to support plug−and−play capabilities in the future.
Table 8.1: Benefits of Dual Network Interface Controllers Availability Dual redundant network adapters Scalability Distributing network adapters across PCI controllers Remember that if you are going to use multiple NICs, NBT (NetBios transport) can bind to only one network adapter at a time. If your cluster is to be used as a Web server or FTP server, then the NBT problem will not affect you. On the other hand, if you are going to offer file and print services on your cluster, then you are going to have problems because Microsoft uses the NetBIOS transport protocol for file and print services. The folks at Adaptec have come up with a product that they call Duralink64 Failover, which is intended to solve this type of problem with NetBIOS. The Duralink software driver can manage multiple NIC cards in an active/standby configuration and can automatically switch network traffic to the standby NIC if the primary one fails. They also offer Duralink64 Port Aggregation, which allows up to 12 ports to be aggregated into one virtual network port on a server. This allows the network traffic to be distributed and balanced between all of the physical Ethernet network ports on the server. From the server's point of view, it is like getting Gigabit Ethernet bandwidth while using current Fast Ethernet NICs.
8.13.2 Multiple NICs and load balancing Another thing to consider is that by using multiple NICs you will be able to balance your I/O load across multiple PCI system buses. From a hardware bandwidth point of view, this will allow you to achieve the highest bandwidth your system's motherboard is able to deliver. In addition, if you go up a layer or two on the OSI model and look at network bandwidth from a software point of view, you can purchase third−party network software drivers that will load−balance server network traffic across NICs. The decision to purchase multiple NICs can potentially deliver a big payback for both availability and scalability. Be sure to review Chapter 6, where we discuss the benefits of designing and implementing a well−balanced I/O system.
8.14 Environmental considerations for network equipment We have seen it happen repeatedly when a data processing facility is constructed. The computer room will have the best of everything (of course we are exaggerating a little)air conditioning, fire−prevention system, alarm systemand the room will be well lit and cleaned. But when you go down the hall and look into the communications closet, you will more than likely be staring into a small crowded room with poor lighting, no air vents, and construction debris left over from when the building was built 10 years ago. You may have designed the best cluster that money can buy, but if there is a network failure and your users cannot connect to the cluster you have just wasted a lot of time and money. 134
8.14.1 Power
8.14.1 Power When planning the facility for your high−availability environment you will need to take the same levels of precaution for your networking hardware that you did when designing the power system for your cluster servers. Routers and network hubs need to be protected from power failures just as servers do. That means installing UPS systems on dedicated circuits, preferably with two different circuits originating from two different power distribution panels (fuse panels). The UPS systems serve two purposes. First, in case of an AC power failure, your network equipment will continue to function until power is restored. The second benefit to installing a UPS system is that it can isolate your delicate and expensive equipment from the AC power line. This is not true for all UPS units, so you need to check with the manufacturer to be sure that the UPS system that you purchase provides this protection.
8.14.2 Air conditioning Time and time again, we have seen sites do an excellent job of setting up environmental controls in their computer rooms while ignoring their wiring closets completely. Many new buildings have separate cooling systems for their data processing facilities that are controlled independently from other areas of the building, such as offices and hallways. Many new buildings today are being designed to be "smart buildings"; translated, this means that in the summer, the air conditioner system is either turned down or turned off completely as soon as employees leave for the day. This could mean that while the servers are sitting in a nicely air−conditioned room, the wiring closets throughout the building filled with network hubs and routers are baking in the heat overnight. We have always wondered why, if the building were truly "smart," it would not know to keep the air conditioning running in the wiring closet?
8.15 Change control Usually when a network is first installed, someone has taken the time to do a design and produce a drawing. As time goes on, inevitably more and more network drops are added to the network. The network hubs will more than likely be upgraded every three to five years. Given today's job market, the individuals who designed and implemented your network may no longer be around. If a problem does occur and you have not kept up with the additions and changes to your network configuration, you will have a big problem on your hands. If you find yourself in this situation, you might end up having to pay someone to document your network configuration before you will be able to start to diagnose the problem. Documenting an existing network is not an easy task. It involves tracing all the wires and verifying every connection in your networkdefinitely not a fun thing to do.
135
Chapter 9: Cluster System Administration Overview Once the general public accepted the usefulness of personal computers in the early 1980s, more and more businesses of all sizes have attempted to automate their business practices as fast as possible. The good news is that as long as the PCs stayed up and running, those businesses reaped the benefits of low−cost commodity data processing systems. The bad news is that PCs were originally designed as low−cost devices for small businesses or computer geeks who wanted to have their own computers at home. When the PC first hit the market, a typical user could have best been described as a frustrated mainframe user who did not like paying high fees for each microsecond the CPU was processing a job, especially since it seemed to always crash about halfway through the job anyway. The problem was that despite the fact that IBM delivered on the promise for independence and affordability, the PC tended to put the user on his own computing island. Across the ocean, however, the corporate data was safety protected behind the famous "glass wall" in the corporate data center, which was never particularly noted for its accessibility. The second age of personal computing, which we are now living in, can best be recognized by the extensive real−time networking of PCs to data all over the world. Employees worldwide are networked ("chained," as some people like to put it) to their corporate data centers from their desks, their home offices, and even the wireless PALM Pilots in their shirt pockets. The days are gone when a business owner would open for business in the morning and turn off the lights and lock the front door in the evening. Today businesses operate 7 days a week, 24 hours a day, and employees have little patience if they can't get to corporate data anytime they need to. A company could stand to lose a lot of money and the goodwill of its customers if it is not able to deliver service accurately and on time. We have heard estimates of between $10,000 to $5,000,000 per instance depending on the type of application and business that goes down. The keys to success in deploying a high−availability system lie in the hands of its system administrators.
9.1 The importance of cluster administration When IBM first started marketing the PC, TV ads showed silent film legend Charlie Chaplin unboxing and setting up his own PCall by himself! In those days, that was quite different from what most people expected from a data processing giant like IBM, which traditionally sold computers that took a truck to deliver and cost in the millions of dollars. We have come a long way since DOS version 1.0. In the early 1980s, PC users would typically access corporate data using software implementations of dumb terminal emulators or IBM 3270 emulators. Those days are gone. Now PCs and PDAs are hooked to enterprise wired networks and wireless connections to allow instantaneous access to corporate data through a very complex distributed client/server computing model from anywhere and at anytime of the day. That image of Charlie Chaplin setting up a PC led many people to believe that there was nothing to it. Many more people believed that PCs would somehow save their companies huge amounts of money. After all, if Charlie could do it, anybody could. Right? Wrong, Charlie! Studies conducted by Microsoft and others showed that by 1993, 50 percent of the system outages studied were attributed to system management failures, not failures of hardware or software. We now realize that PCs require a lot more professional care and feeding than what IBM's Charlie Chaplin TV commercials might have led you to believe. There are five main reasons why good system administration procedures are important: 136
9.2 Building a high−availability foundation 1. PC systems tend to be built from low−cost commodity hardware. 2. Servers designed for high availability are generally very complex. 3. More functionality is added every 18 months to already feature−rich operating systems. 4. Client/server−based application software is challenging to administer. 5. Storage technology implementations such as RAID, SAN, and NAS are becoming increasingly complex. A successful high−availability solution must address more than just hardware and software; it needs to include the development of well−thought−out procedures and policies for dealing with any type of system failure or problem. Preventive maintenance procedures are too often forgotten about, even though they represent a very cheap insurance policy for protecting your business's data. In addition, it is very easy to get in a situation where you are always behind the eight ball and never find time to attend to your staff's continuing education and training needs. There have been many new technologies introduced recently to attempt to solve the high−availability problem that many PC system administrators have not been exposed to in the past. A technically up−to−date system administrator is key to achieving high availability. Butand that is a big "but"the most important thing that needs to be done first is to establish good processes and procedures for managing the data center. The PC environment needs the same discipline and management practices applied to it that have been used to manage large mainframe data centers.
9.2 Building a high−availability foundation Clint Eastwood said in one of his movies "a man has got to know his limits." We believe that the same can be said of the system administrator of a cluster. In deciding on how to build the best highly available computing environment, there must be a clear understanding of all the variables associated with all the application services that are to be protected from failure. Working with your company's management, there are decisions to be made on what priorities will be placed on each application that will run in your data center. A baseline should be established for the required percentage of availability that your company is willing to live with. The best approach is to do whatever you can do to avoid a failure in the first place. If a server never fails, then we will never have to worry about restoring failed services. Technologies such as InfiniBand, as well as RAID storage subsystems, and also new RAID−based memory systems could very well result in "non−stop" hardware platforms in the future. The one thing that these three technologies have in common is that they can totally isolate hardware faults in a computer system. We all know that nothing is ever 100 percent perfect and that system failures and taxes are not likely to ever go away, so the next best thing to do is to develop a well−thought−out approach to minimize the impact to business operations when a failure does occur. To win at the high−availability game, you must seek out and find all single points of failure. However, you can get ahead of the game by developing a plan for what to do when failures occur. Traditionally, the process of recovering from a system failure involved three steps as shown in Figure 9.1.
137
9.2.1 Cluster hardware certification
Figure 9.1: The three steps to recover from system failure. Before clustering and other high−availability solutions came on the scene, this process was very time−consuming and was, unfortunately, a manual process. With the advent of clustering software technologies, this once manual process can now be done quickly and automatically. More importantly, the policies on how to best recover from a failure can be planned in advance by the system administrator in cooperation with company management. Once agreed on, your recovery plan is used to develop a failover procedure that is executed once a failure has been detected in the cluster. This emphasizes the importance of upfront planning and the need to ensure that your design is a complete solution for implementing a high−availability environment. By automating what was once the duty of a system manager, it is now almost possible to remove the human from the loop. This, more than anything, lessens the chance of human error and reduces the time needed to restore services back to the users.
9.2.1 Cluster hardware certification Anyone who has attempted to install new hardware and software drivers on a PC can attest to the fact that sometimes it can be downright frustrating. In fact, some of us have even gone so far as to praise Apple in our moments of frustration with the PC. There are just too many variables for the average person to deal with: motherboard BIOS versions, software driver version levels, interface adapter hardware revision levels, adapter BIOS versions, cables and connectors, etc. For example, during the initial beta of NT Clustering it was necessary to ensure that the BIOS version on the SCSI cards were at a specific level. This involved removing and installing an IC on the SCSI host bus adapter. Those are some of the things that you might have to deal with when setting up a cluster. But as we have already pointed out, when it come to clustering, your "server" is now not just one computer but a "cluster of computers," and they all need to work together perfectly. These clustered computers will share information and resources between each other in order to work together. In doing so, they will require a fast and reliable communications bus between them. Getting two computers to talk to one another has traditionally been a challenge for many people, to say the least. Microsoft wants to make clustering available to the mass market, and to meet that challenge, the company's designers realized that they had to make sure that customers would be able to purchase hardware that is already known to work 138
9.3 Cluster implementation options with the clustering software. Microsoft has developed two processes to help guarantee that its cluster software and the hardware that you purchase will work together. These two processes address both their customer's need for highly available Windows servers while at the same time being scaled so as to be affordable. The first solution was to certify any vendor's hardware platform that is intended to be sold for use with cluster systems before the vendor could advertise it as being compatible with Windows Cluster Service. This is an entry−level clustering solution for customers looking for a solution that is both affordable and capable of delivering high availability. Microsoft offers either Windows NT Enterprise Edition or Windows 2000 Advanced Server as low−end clustering products. Figure 9.2 shows the relationship between Microsoft's three clustering product offerings versus costs.
Figure 9.2: Microsoft's cluster product positioning. Customers needing the utmost in high availability would choose Microsoft's Datacenter "solution." Notice that we specifically called Datacenter a "solution." That is because Datacenter is much more that a box with a software CD inside. Datacenter starts with Microsoft's entry−level clustering products and builds on that by adding additional system management tools and a stringent hardware/software certification program. In Chapter 10 we talk some more about administrative processes, hardware and software quality testing and certifications, vendor support, and the quality controls that are needed to turn a clustering software product into a full data center solution. The Microsoft Datacenter solution is a combination of software, hardware, system integration, solution certification, and "one−phone−call" vendor support services. Datacenter isn't going to be cheap, but it will definitely deliver high availability for your company.
9.3 Cluster implementation options Basically, you have three options available if you want to install Windows NT/2000 clusters. The options you have range from the conservative approach to the more adventurous. We will point out the pros and cons and give you our opinion on the different approaches; after that, you can decide for yourself how much of a challenge you are up for. The three options are as follows: • Preconfigured systems • Server upgrades • Build−your−own approach
139
9.3.1 Preconfigured systems
9.3.1 Preconfigured systems Preconfigured systems are a good choice for people who are not comfortable with configuring hardware or have too many other things they are responsible for. The vendors can save you a lot of time and frustration by preconfiguring all the hardware necessary for a Windows NT/2000 cluster and testing it for you before it is shipped. The advantage with this option is that the customer only has to unpack the system, plug it in, and turn it on. This is definitely the worry−free option. One example of a "plug−and−play" cluster solution came from Data General Corporation. The company did all the hard work for you by building a complete two−node cluster into a single cabinet and preloading the operating system and clustering software for you. Data General appropriately named the product "Cluster in a Box." The hardest thing you will have to do is to carry the empty shipping boxes to the dumpsters out the back door. What's really nice about the "Cluster in a Box" is that you can have your cluster up and running within an hour or so after you unpack the box. Other vendors of Windows NT/2000 clusters simply ship you all the components you need to assemble the cluster. When the system arrives at your site, it is up to you to assemble the system using the supplied cables, connectors, and system cabinets. This solution might not be quite as nice as the "cluster−in−a−box," but it is much better than if you had to go shopping to find and purchase all the components needed to construct a cluster. The one thing that you might want to consider if you decide to buy someone's "cluster−in−a−box" solution is that if you decide to upgrade your cluster in the future, you will be faced with redeploying a rather special−purpose box as opposed to a general−purpose computer. In any case, the advantage of preconfigured systems is that someone else has taken the time and has done the hard work for you of integrating hardware components. When you purchase a preconfigured system, you will know that it will work and that Microsoft has certified it.
9.3.2 Cluster upgrade kits The server "upgrade kit" option is meant for people who already have a server system that is on the cluster hardware compatibility list and want to purchase an additional machine to form a cluster. Another scenario is that they already have two servers that they now want to cluster and just need the "cluster glue" to tie them together. This option is definitely for those who are comfortable working with hardware. Based on our experience, we would highly recommend that you take the "upgrade kit" approach as opposed to buying the components a la carte. Even though Windows NT Cluster Service uses COTS hardware, it is "specialized" COTS hardware, and it may be difficult to find PC vendors who know what you are asking for. The vendors that sell upgrade kits have done all the hard work for you. All the parts that you will need to connect your existing servers together are included in these kits. From our experience, the hardest parts to acquire are the SCSI cables, terminators, and "Y" adapters. Having to find these on your own could turn into a real project. Not every vendor offers upgrade kits, and the ones that do will support only a limited number of configurations. It comes down to a cost tradeoff decision as to whether it makes sense to support some of the older server hardware still in use. If an upgrade kit is not available for your model server and you have made the decision that you need to cluster these systems, your only choice will be to acquire all the necessary components on your own. The best advice that we can give you is to talk to vendors who sell minicomputers or high−end workstations, because these guys have been using and selling clustering solutions a lot longer than the new breed of PC vendors. The type of SCSI cables and connectors used in Windows clustering have been commonly used in minicomputer clustering for some time. As more companies start purchasing clusters, more and more people will become familiar with these technologies. As new technologies such as Fibre Channel, ServerNET, SANs, NAS, and InfiniBand become widely used and competitively priced, configuring Windows clusters will become a lot simpler and more affordable.
140
9.3.3 The build−your−own approach
9.3.3 The build−your−own approach The build−your−own approach for an MSCS cluster is really only for those who feel they don't have enough excitement in their life or have existing servers that they need to cluster but don't have cluster upgrade kits available for those particular models. While it is totally possible to build your own cluster from components that you shop around for, we are not convinced that is the most prudent approach. First, you have to locate all the components. I know, Microsoft, Compaq, and others claim that they are using COTS hardware. What they did not tell you was whose shelf this "commercial off−the−shelf" hardware comes from. From our experience, if you are very determined, are pretty good at searching the Web, and have a lot of patience, then you might be successful at building your own cluster. Then the big trick will be getting Microsoft to support your configuration. In general, we do not recommend the home−built approach to cluster system hardware. But if you are the type of person who likes to build things, we would suggest that it might be a more rewarding experience if you were to purchase some LEGOs and build LEGO robots with your kids. The build−your−own approach will work just fine for software−only high−availability solutions from vendors such as Legato or Marathon, to name a few. These vendors either don't rely on special hardware at all or, in the case of Marathon, they supply the custom hardware that you need. When installing a Marathon cluster, you simply install the firm's adapters into any server that you have on hand. The hardware used for Marathon servers can be any industry−standard PC hardware. Even easier is a solution like Octopus from Legato. This is a software−only solution that installs like any other software application onto standard server hardware.
9.4 Installation, test, and burn−in If an electronic component is going to fail, it usually does so within the first week of operation. It is true that most, if not all, of the components used to build a server have been tested at some point during the manufacturing process. Component−level tests are used to detect hard failures of a chip or to grade a batch of chips for specific speeds. After individual components are assembled together on printed circuit boards, they are again tested as part of that particular motherboard or interface adapter. Given the quantity that high−volume commodity computers are produced in, there is not much time in the manufacturing process for extended burn−in testing. When a vendor is trying to ship thousands of systems a day, about the most that you can hope for is the old "smoke test." In other words, if you don't see smoke coming out of the case when you turn it on, you put it in a box and ship it. The responsibility for good quality burn−in testing is now becoming the responsibility of the end user. It is a good idea to schedule at least a couple of daysor better yet a weekof burn−in time in your system deployment schedule. It the system can get through its first week, it will usually run for a very long time before it fails. Since the probability of failure is fairly high during the first few days the system is powered on, you definitely do not want the system to be used in a production environment before the burn−in cycle has been completed. Server burn−in time does not have to be nonproductive time. One approach that you can take is to assemble your hardware and then install the Windows NT/2000 operating system. Let that run overnight, and then begin installing the remaining pieces of the operating system and patches as required during the next day or two. More than likely a large server installation will take more than one day to complete. The time it takes to install and configure all the software on a server all counts toward the burn−in time. The installation process will also exercise the system components and hopefully reveal any problems that might exist in your system.
141
9.4.1 Documenting your cluster system
9.4.1 Documenting your cluster system As with any network server, your cluster system's configuration should be well documented. Having good documentation of your system configuration is even more important for computer systems that are used to provide high−availability services. The chore of documenting a computer installation tends to be very low in priority at many companies these days. Usually the excuse is that documentation will be put off until after things settle down with the installation, which never seems to happen in the real world. Many system managers will agree that they typically go from one fire drill to another and rarely have the luxury to go back and do the paperwork.
9.4.2 Why document your system? A simple answer to the question "why should I document my system?" would be that it makes good business sense. But it will probably take a little more convincing than that to get all of the system administrators to commit to producing good documentation. In fact, today there are requirements placed on companies by both government regulatory bodies (e.g., the Federal Drug Administration) and others that require complete documentation for manufacturing systems that could affect the health and well−being of human lives. In addition, there are other industrial standards organizations (e.g., the International Standards Organization (ISO) and the Software Excellence Institute (SEI)) that also require companies wishing to receive certification to meet stringent standards for designing and implementing standard processes. Further, these organizations require that all processes used by an organization be well documented and maintained. One of the best reasons for producing good documentation is to ensure that it is easy to transition system management responsibility between employees. This is especially true if you want to • Maintain a consistent level of service to the customer. • Train others. • Go on vacation. The following are typical examples of documentation that you may find useful for your installation. Many companies and organizations will have the option of determining what level of documentation is appropriate for their site. For some industries such as pharmaceuticals, government regulatory bodies dictate the level of documentation they require. The following is a list of typical documentation maintained for data processing systems providing critical services to end users or driving process−control systems. • System drawings/schematics • Hardware and software inventories • SLD service level definition • Service level agreements (SLA) • Standard operating procedures (SOP) • Change controlhistory of what has been done in the past • On−call guides • Help desk/trouble−tracking system
9.4.3 Hardware diagnostic procedures for a cluster In the process of writing this book, one day a light went off in our heads while trying to diagnose a server that went from a crash every day to a crash every hour. The situation was made even more frustrating by the fact that it was supporting a large proposal effort that employed over 75 people and was on a very short fuse. We 142
9.4.4 Remote system management did not have the luxury of just shutting down the server while we waited "on hold" for the person from tech support to help us diagnose the problem. He thought that was reasonable, but our management did not. The users demanded that the server be immediately rebooted so they could continue to print out their proposal, knowing full well that there was a 95 percent probability that the system would crash again within the hour. This situation was as frustrating for our users as it was for us, especially when the "tech support" person would ask us to read the error codes off the "blue screen" and we would tell them that we had already rebooted so our users could get their documents printed. We realized after dealing with this problem for a few days how nice it would have been if this server had been clustered, even though we were only using it for file and print services. From an administrator's point of view, clustering makes a handy tool for diagnosing server hardware that fails in a somewhat random fashion, as in our example. As an administrator you will quickly appreciate the failover and failback features available with clustering. When a server crashes, the services that were running on that server will be restarted on another server in the cluster. You have the option as administrator in determining what services are likely to failover. But more importantly, from a diagnostic point of view, you can also determine the cluster policies for failback once the failed server rejoins the cluster. There are a few options available to you to pick from, depending on your priorities. First, you can decide what fails over and what dies with the server. This option is important to consider if there is not enough horsepower available on the remaining nodes in the cluster. Another option that is important from a diagnostics point of view is what you want the cluster to do once the failed cluster node is returned to full operation. You have the option of having the cluster move the resource groups back to their default nodes as soon as the failed cluster node is back on line. That option would not have been a good choice in the example, because even though we brought our failed computer back on line after trying a few fixes, it continued to fail. In our case, a better choice would have been to leave the services running on the server that they failed over to while the system in question was checked out. If we had a cluster installed at the time, we could have left the "blue screen" up on the monitor to make the telephone support guy happy. After repairs are made to a failed cluster node and it is determined that the system is again stable, the cluster administrator can then manually failover services to their default hosts. Finally, it is also possible to schedule failback to occur at offpeak time to minimize the impact to users during peak processing times.
9.4.4 Remote system management A very nice capability to have on servers located at a remote site or even distributed around a campus is an out−of−band remote management system. An out−of−band management capability allows you to have complete access to your servers remotely even if the enterprise network is down and the server is hung. This is accomplished by using an intelligent controller card and network that are used only to monitor the operational status of the server as shown in Figure 9.3. Status and control of the server is possible over either a LAN connection or a dial−up modem connection in the case of the remote sites. The level of access we are talking about here is at the hardware level. For example, a remote operator could force the system power supply to turn off and then back on to force a full reset and reboot. Remote management from the operating system level will be possible with Windows Terminal Server, which allows the administrator to log on to the server remotely. A combination of an out−of−band management system and Terminal Server's remote console capability will allow you to have a centralized control for all of the servers in your company.
143
9.4.5 Verifying cluster hardware capacity
Figure 9.3: Out−of−band remote management.
9.4.5 Verifying cluster hardware capacity One thing that is important to remember when selecting a server is that as your system grows over time, you are going to need options slots on the motherboard to add additional interfaces. Desktop or low−end server boxes do not have enough PCI slots to handle the number of cards that you will need to install. A minimum configuration for a cluster node typically requires at least five PCI slots. The cluster requires two separate network adapters, one for the enterprise connection and one for the private cluster interconnect. Then you need two more slots for the local SCSI host adapter and the shared SCSI bus adapter. Finally, one more slot may be needed for a video adapter. Low−end motherboards sometimes have only three or four PCI slots. The requirement for a generous supply of PCI slots puts you into a high−end server class of machines. If you feel comfortable talking to vendors about chips and buses, you should ask your vendors to go into more detail about their motherboard architecture. We mentioned earlier that there is not much room for hardware vendors to be creative when it comes to motherboard design. But the one area that they can make improvements in is the system bus. The performance and capacity of the system buses on your motherboard will have considerable impact on your server's performance. For one thing, you will need plenty of PCI slots in a cluster server. This is especially true for systems that will need to have a large number of disks attached to them. You can purchase high−density I/O adapters for SCSI and Ethernet, thereby achieving more I/O channels per PCI slot, but by using high−density cards, you might be putting too many of your eggs in one basket.
9.5 Planning system capacity in a cluster There are basically two configurations that Windows NT/2000 Server inherently knows how to tune itself for: • Application server • File and print server These two basic roles that NT/2000 can assume typically experience different hardware bottleneck characteristics. For example, a Windows NT/2000 server that is used to provide database services such as SQL Server will typically gobble up all the CPU horsepower and memory that you give it. On the other hand, a server that is used just for file and print services will put a strain on the I/O subsystem and the network adapter. Memory might also be an issue, depending on the amount needed to cache I/O requests. 144
9.5.1 Symmetric multiprocessing (SMP) for scalability The process of determining the correct size for your cluster nodes is similar to that of sizing network servers, but there are a few additional things to take into account when planning for cluster failover. In planning a single server, most system managers plan for their average CPU load and factor in for future unplanned application growth. When we have worked on government contracts, customers usually use 50 percent as the required reserved CPU power. The idea is to plan your hardware purchase so that you don't outgrow it eight months after it is installed. Besides being very embarrassing, it won't do much for your career advancement. So let's work through some calculations; the numbers just might surprise you. To start with, let's agree that when we purchase a new server we should allow for a 50 percent reserve. The logic is to allow for new applications that will likely be added over the life cycle of the system. In addition, the 50 percent reserved horsepower can come in handy for peak loads. You might be thinking that this is going to require a pretty large server and do we really need to allow for a 50 percent growth? To answer this question, consider what happens to CPU loading when we decide to build a cluster. Your first reaction might be that with two machines we would get twice the amount of processing power. But the whole reason we are talking about clusters in the first place is because we have applications that need to stay running because they make lots of money for our companies. So if one node of our cluster dies, the other node must be able to pick up the processing load of the failed node in addition to handling its own load. This means that the surviving node is now at 100 percent (50 percent from the failed node plus its own 50 percent load) capacity. If you are thinking this wastes 50 percent of your investment in that $30,000 server, consider the fact that other clustering solutions will use one cluster node as the "hot stand−by" (active/passive) solution. In a "hot stand−by" configuration, the stand−by machine just sits idle waiting for the first machine to fail. At least with an active−active cluster solution, both cluster nodes are actively providing services to users. This may work fine for the first six months, but what happens when the new server you purchased with 50 percent capacity is loaded down with new applications and now has only 25 percent reserved CPU horsepower? Let's run those calculations again: a 75 percent load from node 1 plus a 75 percent load from node 2 equals 150 percent. That means the surviving node will need to work at 150 percent of its designed capacity. What this means in technical terms is that the system will be slow as molasses until the node that is offline is returned to service. We don't think this is the answer your management wants to hear. There are options available that you can use to manage loading, thanks to the way MSCS handles failover. MSCS allows you to decide what services failover and when they failback. As the administrator you can set the rules for cluster failover that best meet the business requirements of your company. You have the option to failover only some applications and leave noncritical applications offline until the failed node comes back online. This allows you to ensure that mission−critical applications will run at a desired performance level; at the same time, you won't break the bank by buying extra reserved capacity on your servers. Also, remember this discussion will get even more complicated when Microsoft releases its "n" node cluster. Currently, Datacenter can support four clustered computers. As the number of supported nodes in a cluster increases, the complexity of the failover policies also becomes a lot harder to configure and manage.
9.5.1 Symmetric multiprocessing (SMP) for scalability Symmetric multiprocessor (SMP)−based servers should be high on your shopping list for use as cluster server nodes. Windows NT has inherent support for SMP hardware, and it is the only real option you have at this time to address the issue of scalability. First, we should define what we mean by a multiprocessor system. Generally the term multiprocessing refers to a computer system that uses more than one processor. These processors, under control of an operating system, work together to share the processing load of common computing tasks. It is generally assumed that when we refer to an SMP computer we are talking about multiple processors contained within a single cabinet, usually on the same motherboard or at least plugged into the same backplane. Another requirement for a multiprocessor system is that the processors should be working in a cooperative manner. 145
9.5.1 Symmetric multiprocessing (SMP) for scalability The easiest part of building a multiprocessor system is designing the hardware. The really hard part is in designing the software that divides up the processing tasks and then allocates each processor its share of the processing load. The difference between the theoretical scalability and what is actually achievable is directly related to how good the operating system is in managing its tasks across multiple processors. Another point is that the applications you want to run on a multiprocessing system also need to understand how to take advantage of the resources available to them in this environment. Since symmetric multiprocessing (SMP) is the best option you have today for achieving scalability through scaling up, you need to come up with a strategy until MSCS and the applications you are using are capable of scaling out using cluster−aware software applications. As we have already discussed, Windows and MSCS initially addresses only availability of resources and does not address how to scale those resources as your processing demands grow. So the real question is, "What hardware options do I have for scaling up my servers to meet increasing user demands while I wait for Microsoft to release a new version of Cluster Service that addresses scalability and for software vendors to take advantage of the APIs in MSCS so that their applications are fully cluster aware?" Figure 9.4 shows the two approaches that are used to increase the capacity of a computing system. Today, the easiest path available for increasing processing capacity to an application is to scale up. The term "scale up" means to add processors to an existing computer chassis. Unfortunately, this approach does not address availability, since by adding additional processes to an existing chassis, you are in effect putting "all of your eggs into one basket." The other option that is possible with clustering is to "scale out" by loosely coupling "whole computers" in a clustered configuration. If additional processing power is needed in a "scale out" configuration, additional computers are simply rolled up and added into the existing cluster.
Figure 9.4: Scaling up vs. scaling out. While we are waiting for those software updates to be delivered, users are demanding more and more from our servers. They are getting tired of hearing "just wait for the next release, it will fix that." Even though SMP might not be a 100 percent cure for performance issues today, it still makes sense to invest in servers that are capable of supporting multiple processors as options. The line has been drawn in the sand for software vendors that hope to be able to sell into the "enterprise" market. Vendors that want to be successful in the high−end cluster environment will need to write their applications to take advantage of SMP hardware. At the same time, these applications will need to support clustering functionality that can allow them to scale out, distributing their processing over multiple SMP−based cluster nodes. OEMs have always had the options of supporting more processor configurations, but they are required to supply their customers with a customized version of Windows NT/2000 that has been modified and tested to support their specific hardware configurations. 146
9.5.1 Symmetric multiprocessing (SMP) for scalability Before you get too excited over the possibilities of SMP−configured servers, look at Figure 9.5 and become familiar with some of the technical issues that may limit the ability of SMP to deliver scalability. As you can see in Figure 9.5, when you need more processing power in an SMP configuration, you just keep plugging in additional processors. In theory, this looks like an easy and straightforward solution to our problems. But if it sounds too good to be true, that's probably the case. One of the first things that you will notice as you look at the diagram is that the data paths from processors to memory and from processors to I/O devices will quickly become bottlenecks as additional processors are added. Eventually, it will be a problem getting data to and from the LAN fast enough to feed those super−fast processors that are sitting there executing NOPs waiting for data to process. Practical experience with SMP configurations tells us that the processing power of a server does not grow linearly based on the number of processors.
Figure 9.5: Using SMP to scale out. There are a number of factors at play here. First, the more processors that are added to the processor bus, the more contention there is for access to data on the bus. Second, the more processors that are in the system, the more overhead the operating system has trying to coordinate the tasks that each processor is working on. Finally, the software applications themselves must be written to behave correctly on an SMP system. Reports in the media indicate that quite a few of the applications on the market today don't behave very well when it comes to sharing resources with other applications running on the same server. If that is the situation with the applications that you have decided to use, then you will not see any great benefit in using SMP systems. In fact, it has been reported in the press that some of the software out there is written so poorly that Windows NT just can't allocate tasks across CPUs in an SMP system. In that situation, you would be wasting your money if you bought a second CPU. We don't think this situation is a static one. It would be reasonable to assume that these problem applications will be updated soon to behave correctly in an SMP environment. The problems that we just discussed in connection with SMP designs are being addressed by replacing the system bus with a "switch" architecture. The advantage of using a switch rather than a bus to interconnect the CPU, memory, and I/O subsystems is that a bus supports simultaneous non−blocking communications between each subsystem on the mother−board. This feature will dynamically increase the throughput capacity of the motherboard. In Figure 9.6 we illustrate this by showing a switch instead of a bus in the center of the system design.
Figure 9.6: Switch−based system architecture. The real solution for scalability is the combination of clustering and SMP. When you also factor in cluster−aware applications, you will end up with quite an impressive system. Since the first release of Microsoft's Cluster Service really addresses only availability issues and not scalability issues, SMP−configured cluster nodes will give you some scalability options. On the other hand, clustering allows you to combine a number of SMP servers together, each server having its own independent I/O bus. Now you 147
9.6 Administering applications in a clustered environment can minimize the bottleneck effect of the I/O bus contribution. Cluster−aware applications that run across the cluster can process client requests coming into the system through any member node, thereby allowing you to distribute the I/O load across the multiple I/O buses contained in each node of the cluster. In the short term, we will have to rely on SMP to give us scalability, but in the long run, the application we use must be able to scale across the cluster to deliver real scalability. These applications must be written to be aware of and take advantage of services that Cluster Service offers by means of cluster APIs.
9.6 Administering applications in a clustered environment Administering applications in a cluster takes a little more work than it would with a single standalone server. There are quite a few options that the cluster administrator has when setting up applications on a cluster that don't come into play on a single standalone server. Applications and services can and will be moved and restarted as needed by Cluster Service to maintain services to the end user. For this to function correctly, it requires particular attention to the details involved in setting up applications in a clustered environment.
9.6.1 Identifying cluster−aware applications Since clustering in the PC world is a relatively new concept, it will take some time for software developers to learn the intricacies of programming to the cluster APIs before they will be able to release a clustered version of their software. Realizing the full benefit of clustering will probably require them to rethink how their applications should work in a clustered environment. In order for an application to be fully cluster aware, it must register itself with Cluster Service and use the cluster APIs to report its status to the Resource Monitor. Don't hold your breath waiting for clustered versions of your favorite applications. First, there must be a major market demand before vendors will start working on a cluster version. In the meantime, you will have to make the best of things. Just because an application is not cluster aware does not mean that it can't benefit from Cluster Service. Microsoft addresses this issue by including generic resource DLLs to provide cluster support for applications that do not have custom resource DLLs. There are a few basic issues that you need to consider to determine whether your applications will work in a clustering environment. These are as follows: • The network transport used by the application • How the application keeps track of its operational state • The application's utilization of data storage For some customers, it might be an issue that Cluster Service supports only the TCP/IP protocol stack. This could be a concern for sites still using protocols such as DECnet, NetBEUI, or IPX. We have seen many customers who use these protocols as a holdover from legacy applications and installations. The comment we hear time and time again from these customers is "if it ain't broke, we don't try to fix it." Possibly, they have a good point. Even though Cluster Service works only with the TCP/IP low−level network protocol, that does not mean it won't support higher−level protocols that ride on IP such as RPCs, NetBIOS, DCOM, or Named Pipes. These application−level protocols simply use TCP/IP as their transport. So as long as there is a TCP/IP path between the client and the server these application−layer protocols will just go along for the ride.
9.6.2 Licensing applications in a cluster Because of the newness of clustering in the PC marketplace, the corporate attorneys and product marketing managers that work for suppliers of enterprise software have not adapted to the changing software license model that the minicomputer world has already addressed. The major problem with Cluster Service today is 148
9.7 Administering cluster failover groups that it does not treat software licenses as cluster resources. We are not sure that this is as much of a technical issue as it is a legal one. The licensing of software applications for use in a cluster is exactly the same as with a standalone server. That in a nutshell is where the problem is. A cluster of servers is the system, not the individual nodes that make up the cluster. But today, you are still required to license the software applications on each node in the cluster that might host the application in the event of a failure or administrative failover. Why isn't the software license itself treated as a cluster resource and failed over with the rest of the dependent resources contained in the group or virtual server? That outcome certainly makes sense to us, but it would mean fewer licenses sold, which is why we think this feature has not yet been implemented. If an instance of the application can be running on only one node in the cluster today, as is the case with Cluster Service, then why should a user be required to pay for two licenses when it is only possible to use one license at a time?
9.7 Administering cluster failover groups Cluster Service requires that resources within a cluster that have interdependencies be organized into resource groups. Groups as a whole will failover, as opposed to individual applications or services. As the system administrator, you will need to start with an application and then list all of the resources that are required for that application to run. For example, a Web server requires at a minimum an IP address and a disk to store information on. In addition, it might require a database server. These items are required for the Web server to function and are referred to as resource dependencies. The Web server application and the resources that it depends on are documented by developing a resource dependency tree. The dependency tree will prove to be an invaluable aid to the system administrator in documenting and designing failover groups.
9.7.1 Determining a preferred node for a group Once the number of required groups is determined, the next step is to assign each group to a default or preferred node. The preferred node in a cluster can also be thought of as the default node that an application should be started on if everything is normal in the cluster. If a failover event occurs, the group can be assigned to failback to the default node upon recovery of the failing resource, Cluster Service will reassign the group back to its preferred node either immediately after recovery or at a later time, as determined by the cluster administrator. Determining the preferred nodes for the different groups in a cluster is dependent for the most part on the computing capacity of the individual nodes in the cluster. To determine the preferred node for a group, you should analyze the types of applications in the groups that will be hosted on a particular server. The applications should be verified to be compatible with each other on the same server. In addition, the server must be sized such that you can adequately support its default applications as well as any other applications that may failover to it in the case of a failover event. There are applications that are known to have problems coexisting with other applications on the same server. It is important to take this into consideration when determining the preferred node for a group.
9.7.2 Determining resource dependencies in a groupCluster resources From a cluster point of view, a cluster resource can be a software application, a physical device, or a logical entity. The following list names typical resources that are usually found in a cluster: • Disks drives/RAID arrays • IP address • Server computer name 149
9.8 Administering virtual servers • Printer queues • File shares • Generic applications • Generic services • IIS virtual root • Message queue server • Distributed transaction coordinator • DHCP server • Time service
9.8 Administering virtual servers With Cluster Service, the whole concept of a "server" is much different from what we are used to. Traditionally, a server was thought of as a computer that has large disk drives, a lot of memory, and a fast processor. We also configure it so that it has an IP address and computer name that make it unique. All of that changes when you set up Microsoft Cluster Service. From now on, when we say "server," we could be talking about the computer hardware box, or we might be referring to the new concept of a "virtual server." A virtual server has all of the logical attributes of a physical server except that they are not tied to any particular piece of hardware. Just as we assigned IP addresses in the past to our physical servers, we will be assigning IP addresses and computer names to virtual servers. Similarly, virtual servers also have physical resources that they need in order to be functional (e.g., disk drives). Unlike the disk drives attached to a physical server, the disk services assigned to a virtual server can follow the virtual server as it moves from one physical box to another in the cluster. As an administrator of a Microsoft cluster, it will be your job to set up and configure the virtual servers that are running in your cluster. In setting up the virtual servers, you will need to consider the number of applications that will be running on your cluster. There are many possible configurations, as you will see once you take into account the applications and their associated resources. One possible configuration that you may decide to settle on is to put each application in its own virtual server. This configuration gives you the ability to manage each application individually in the cluster. If you are going to set things up that way, remember that if two of your applications use a common resource, both of these applications and their common resource will reside in the same virtual server. The reason for this is that all of the resources in a virtual server must failover together, because a resource such as a disk can be accessed by only one physical server at a time. Since it is possible for virtual servers to failover independently of one another, it would be impossible to have an application running on one physical server while its disk was still assigned to another physical server.
9.8.1 Cluster alias name You can think of the cluster alias name and a server's computer name as being the same. Microsoft depends on network computer names for allowing users to locate services on a network. The cluster alias name is a virtual computer name that identifies a cluster of servers. The cluster alias is not tied to one particular server; instead, all the servers that participate in the cluster can respond to requests to the cluster alias name.
9.8.2 IP addresses In today's Web−centric world, the IP address is required for most applications to function. The ability to move the IP address resource between nodes in a cluster is a basic requirement for a cluster. Cluster Service is 150
9.9 Managing cluster failover events capable of dynamically moving an IP address from an inactive node and restarting it on an active node. Microsoft Cluster Service technology utilizes the concept of a virtual server. A virtual server, as its name implies, is not hardware based but is instead a group of resources and services that collectively form a virtual server. An IP address is part of the group of resources that defines a virtual server. From a network administrator's point of view, cluster virtual servers and their associative IP addresses can become an issue. Unlike a typical standalone server that usually has only one IP address assigned to it, there can be multiple virtual servers, each with its own IP address running on a single physical server. As you can imagine, a large cluster of servers will require quite a few IP addresses. This could be an issue for organizations that are already running low on available IP addresses.
9.9 Managing cluster failover events Probably the most important thing that a cluster systems administrator does is to manage failover events. After all, that's what clustering is all about. There are a lot of decisions that need to be made by the system administrator regarding how software applications and services react when a failover event occurs. Since a failover can be either planned or unplanned, the cluster administrator must plan for either case. The first thing that must be decided is which applications or services should failover. Because of systems capacity and cost factors, the administrator may decide that it is either not necessary or practical to failover every application or service. The decision may be to failover only applications that are considered critical for the company. If the decision is made that a particular application or service will failover, the next decision to be made is which server or servers should be configured to run the application when a failover becomes necessary. When the cluster size is larger than two, priorities for the different nodes in the cluster must be set. As you can imagine, as the size of the cluster grows, the task of developing a cluster failover matrix will require a significant amount of effort. Once the reason for the failover has been determined, the next decision point is if and when a failback should occur. Since clustering in its simplest form is basically a restarting of an application on another machine, you may not want to stop and restart an application that is alive and well in the middle of the day. Instead, you may elect to wait for an off−peak time when a failback would have little or no impact on users.
9.9.1 The impact of failover on server applications Software applications that execute on a cluster need to be designed to deal with system failures and other cluster events from the very start. A good application will constantly be checkpointing itself so that in the event of a failure it can restart and pick up processing where it left off. In order to accomplish this, it will have to keep a journal log on a clustered disk so that after the application fails over to a new server, it can determine what it was doing when the other node failed. Applications that keep their operational state only in RAM will not work very well in a cluster, because the application's state will be lost when a failover occurs.
9.9.2 The impact of failover on end users For a user who is using a typical application today on a cluster, the impact of a failover is basically a hit−or−miss situation. If the user was in the middle of accessing data on a cluster when it failed, that user would be told to Abort, Retry, Fail. If the user was not trying to communicate with the server when it failed, they may never be aware of the fact that the application was failed over to a new server in the cluster. In the future, applications that are written to run in a cluster will need to be able to determine whether the cluster is going through a failover. If it does detect a failover, the application should warn the user that the user's data is temporarily unavailable and to wait until the cluster failover is complete. As we said before, a cluster failover 151
9.9 Managing cluster failover events can be thought of as a super−fast field service callone server fails, and its applications and resources are restarted on a new piece of hardware. Whether or not the end user is aware of the downtime is totally dependent on the application being used and how well it supports the cluster APIs.
152
Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering 10.1 Total system design approach to high availability Window NT Clustering software is only one component of a data processing system designed for achieving high availability. In order to design the best possible high−availability environment for your data center you must take a "total system" design approach to achieve your goal. By that we mean to achieve the highest level of availability you must analyze the effect that every possible component or subsystem has on total system availability. In fact, even if you never install a cluster, you should still consider the following list of requirements for your data processing system. The following list is really made up of "no−brainer" recommendations for system administrators. These recommendations set the foundation for your cluster design. They are suggestions that, as an administrator, you should consider adopting and including in your standard operating procedures. You would be surprised by the number of computer systems we see that are not protected by these common−sense procedures and recommendations. Figure 10.1 represents the challenge that you face when trying to achieve high availability for your data processing system. You will notice that only one of these items in this list is a software solution. The rest of them are tried−and−true procedures that have been used for years in the minicomputer and mainframe worldjust the things that you would expect to be implemented in a large corporate data center. The game plan here is to eliminate all single points of failure if possible. It should not be a surprise that most of our recommendations are just plain common sense. When your manager sees this list, his or her first reaction might be, "We can't afford all that." The concept of affordability is a very important one to put into perspective. Suppose we could design a system with no single points of failure; the question is would it also be unaffordable? In reality, you will need to perform a risk analysis of your business operations and determine a cost−effective approach to high availability. In effect, you are deciding how much insurance your company can afford. It's no use kidding yourself or your management, because the reality is that purchasing and operating a high−availability data processing environment is not going to be cheap. Don't be surprised when you learn that clustering will cost more than the cost of just owning two standalone servers. The most important thing is to understand and identify your risk and then make the necessary business and financial tradeoffs. What you and your management come up with will probably not be the ultimate state−of−the−art system, but it won't bankrupt the company either. So let's review some of our recommendations for achieving a total systems solution for high availability.
153
10.2 Identifying the cause of downtime
Figure 10.1: The elements of a high availability solution.
10.2 Identifying the cause of downtime The reason that most companies acquire clusters is because they have experienced losses due to system downtime. When you talk about downtime, most people will immediately assume that you are referring to a server crash. When the application or file server crashes unexpectedly, it usually gets the attention of everyone in the organization, including top management. It's the fact that the downtime was unplanned and came as a surprise that upsets people the most. The most common phrase that you hear is, "This is costing us money." Usually, the next comment you hear after a server has been restored is, "You need to do something so that this does not happen again." It is the unexpected server crash due to failed hardware that gets management's attention and typically results in a project being defined to look for a clustering or a high−availability solution. A system "crash" gets people's attention, but there are many more issues around the typical data center that can actually have a much bigger impact on system availability than a hardware failure. Table 10.1 lists both planned and unplanned causes of system outages that are equally disastrous.
Table 10.1: Causes of System Outages Unplanned Outage Server hardware Networking hardware Network service providers AC power Physical resources (cooling)
Planned Outage Backups Software upgrades New hardware Preventive maintenance Hardware repair Network reconfigurations It is certainly true that unexpected hardware failures can cost a company a lot of money and potentially affect customer relations. However, it could be that the "other type" of downtime is really what's costing your company the most money. Its the "planned outage" that everyone is expecting and knows is just part of life that could in fact be costing more over the course of a year than those infrequent but painful unplanned outages.
154
10.3 Quality hardware By implementing clustering, your unplanned outages can be managed so that they have little if any effect on the users. For example, if a software application needs to be upgraded, it might be possible to do what is known as a "rolling upgrade" by taking one server down and then upgrading its software while the remaining servers continue servicing users. Once the software is installed and verified, the server can rejoin the cluster. This way, only one server is offline at a time. This might not always be possible for some major software upgrades where the actual database structure changes, but for simple software fixes it should work most of the time. Each server in the cluster could in turn be upgraded in this manner without disrupting services to users. Without the benefits of clustering, all users on the system would have to wait until the upgrade has been completed. This could take time, possibly resulting in a system outage anywhere from hours to a day or more. By employing clustering technology, users would see only a short pause in services while their applications are moved from the server to be worked on to one of the other nodes in the cluster. This is achieved by an orderly administrative failover of applications and data. By definition, an administrative failover is considered orderly because the system administrator plans for when the system will be shut down. There is a potential here for a considerable savings in time and money. Unfortunately, since most good system managers send out e−mails warning about system outages, users and management are prepared for downtime and are not alarmed as is the case when the outage is unexpected. You might go as far as to say that they become complacent even though planned downtime can be one of the major contributors to lost productivity and lost revenue for your company. The same thing applies for hardware upgrades and repairs. Preventive maintenance service calls, usually scheduled for every 60 to 90 days, are another example of planned downtime. If you have a cluster, each server in the cluster can be taken offline one at a time while the other nodes continue servicing users. Planned downtime is a major factor in justifying a cluster installation even though it is sometimes not as visible as an unplanned system crash. It is not surprising to hear at some sites that don't have high−availability solutions already in place comments like; "We can't afford to take the system down now for maintenance; we will do that later." Well, "later" in that context usually means never. But without clustering, system managers are hesitant to take a running system down. Consequently, there is a big temptation to put off preventive maintenance, postpone software updates and bug fixes, and, even worse, fail to do backups. By employing clustering and other high−availability technologies, system administrators will no longer be tempted to compromise good operating procedures.
10.3 Quality hardware The whole purpose of clusteringand this book, for that matteris to help you achieve the highest availability possible for your company's data processing servers. Microsoft's business is built on the premise that customers will purchase its high−volume commodity operating systems, such as Windows NT/2000, and run them on high−volume commodity hardware. The whole strategy behind Microsoft's clustering solution is to allow customers to purchase low−cost commodity computers from whomever they want to and, at the same time, achieve system availability similar to what was in the past available only on proprietary high−cost hardware. Microsoft's clustering software cannot achieve maximum availability all by itself. It needs reasonably well designed and constructed hardware on which to run in order to deliver the best possible availability. Compaq has taken a big step forward in this direction with its new line of high−end servers that feature hot−swappable RAID memory subsystems. Their RAID memory system can detect a failure, notify the system administrator, and then can be repaired without ever having to take the system offline. It is getting to the point with RAID memory and RAID disks where you will turn your computer system on once and never have to turn it off again. Likewise, IBM is actively working on self−healing servers that can work around hardware failures and allow repairs to made on the fly. These are just two examples of high−end solutions for applications that require the ultimate solution for high availability. 155
10.3.1 Selecting high−quality hardware First, it goes without saying that choosing the best quality hardware you can find makes sensebut it does not necessarily have to be the most expensive. The question you may probably ask is, "How can I tell what is the best?" If you don't have engineers on staff at your company to help you with your hardware evaluations, then you might consider using a consultant with a good track record to help you determine the best configuration for your needs. Another option is to become an expert yourself. There is a lot of information available, and most of it is free. You can start by sitting down with your favorite Web browser and searching sites from Compaq, IBM, Dell, and HP, to name a few.
10.3.1 Selecting high−quality hardware The first release of Cluster Service was intended to be available only on a limited number of certified systems from vendors who had signed up with Microsoft to be one of the early adopters of Wolfpack (the original beta test name). This meant that you could not run down to your local CompUSA store to purchase Cluster Service off the shelf, and you certainly could not grab one of those PCs on sale at one of those computer swapmeets on the weekendthat is, unless you don't care about being able to call Microsoft for support. Even though you probably won't be assembling a cluster from scratch, we think that it would be enlightening for you to understand the issues involved in selecting and building a cluster from a hardware point of view. So we are going to dig in a little bit on the technology used in building clusters. Keep in mind that even with a preconfigured COTS solution, once a system is delivered to you, it is your problem! The more you know about what makes it tick, the better off you will be in the long run. We hope that you will have an on−site maintenance contract with your vendor, but if there is a problem with the cluster, your management will look to you to make it work. From our experience, there are too many technicians who show up prepared only to unscrew a board and plug in a new one. It is becoming rare to find a technician who understands how the system works and is able to diagnose those rare problems that are well defined in Murphy's law. The bottom line is just what our sons are taught in Boy Scouts: a good Scout should be prepared. So you can think of this chapter as your Swiss army knife for clusters. Gregory Phister's definition of a cluster is "a collection of interconnected whole computers." The term "whole computer" refers to COTS computers that can function as standalone computers or, with the addition of "clustering" software and hardware, can join and form a cluster. Microsoft's vision is for clusters made up of hundreds of these "whole computers." Not that we are trying to scare you, but if Microsoft is on target with its projections, then system administrators in the future could be faced with many computers, each running its own copy of Windows 2010, that will need to be managed as if they were one computer. This should be a sign of the complexity that will be involved in setting up clusters in the future. You can expect to see an increase of at least an order of magnitude over that of a single server. The reason for this increase in complexity is that the "server," instead of being just one computer, is now made up of two or more computers interconnected together using a lot of very sophisticated clustering technology. As Microsoft increases the maximum size of a cluster, the life of the system administrator will become more and more challenging. We can only hope that their salaries will increase proportionately. The real lifesaver in all of this is that with Cluster Service's administration tools you can manage the whole cluster as if it were just one server. As you begin to work with clusters, you will find yourself getting more acquainted with things like differential SCSI buses, Fibre Channel, ServerNet, and other high−speed server interconnect hardware. Although Windows clustering is delivered on a CDROM, the software is only one small part of the total solution. Clustering software is very dependent on hardware in order to deliver high availability and scalability. Reliability, on the other hand, means that the hardware that you use for your cluster must be carefully built and tested with quality as the primary goal. You will want to familiarize yourself with as much cluster hardware technology as possible, short of becoming an electrical engineer. 156
10.3.2 Selecting a vendor
10.3.2 Selecting a vendor Some of you might not feel too sure about your ability to select hardware, but remember the old saying, "Nobody ever got fired for buying IBM." Now we are not necessarily saying you should run out and buy IBM. What we are saying is that if you are not sure, play it safe and stick with big−name companies. Amdahl, Compaq, Dell, Gateway, Fujitsu, HP, IBM, NCR, Siemens Nixdorf, and Stratus have all announced their support for NT Clusters. All of these companies have been in the business for some time and generally do a good job. In this context we are using the name IBM as a shorthand to refer to well−established companies with good reputations. You should be dealing with companies that are financially sound and likely to be in business when you need to go back for support. We don't recommend that you shop for high−availability solutions at your local weekend swap meet. Don't laughwe have heard of a few people who do their shopping for computer hardware at the weekend computer flea markets and hamfests. The advantage to dealing with large companies like Compaq, Dell, HP, and others who deal in high−volume sales of commodity computer hardware is that they have a vested interest in doing extensive qualification testing on any system they plan to sell. After all, if there is a problem they know that they will get a flood of support calls. Any calls they get when products are still on warranty cost them money, and that eats into the already slim profit margins for commodity computers. These large computer vendors can afford to make the investment required to do quality testing of their products and have developed that expertise over many years. Also, look for companies that have achieved ISO−9000 certification. In order to be awarded the ISO certification, a company needs to have in place documented procedures and policies that ensure that their products will be designed and manufactured to high standards for quality. So, how can you distinguish one vendor from another? The first thing you need to do is get to know your vendors. Call and find out who their local sales person is and invite the sales rep in for a chat. Pay attention to how easy it is to reach them and how fast they return your telephone calls. Remember that you are a "potential customer" right now, so their responsiveness can only get worse after they cash your check! Make them do the research work for you by asking them for comparisons of their products with those of their competitors. They should be able to "sell" you on why their hardware is better than the next company's. A good salesman can help you get through the hardware configuration maze that you will need to understand in order to select the appropriate hardware to meet your company's needs. It can really become a nightmare sometimes trying to figure out all the different cables and accessories that are needed to make the system actually work. An experienced salesperson can get through that chore in no time. If answers are hard to get or if your gut feeling tells you that you aren't sure you are getting the right answers, don't be afraid to ask for better explanations and more data. The biggest advantage to working with sales people is that they have access to what is referred to as "inside sales support personnel." This means that if you have a configuration problem that the sales person has not seen before, he should be able to make a quick phone call and talk to someone who is technically grounded enough to answer your question. Believe us, it is much better for that sales person to make that phone call to his dedicated inside sales support person than for you to call an 800 number and wait and wait for someone to answer the phone. Get to know your sales reps, because you will undoubtedly be dealing with them for a long time. Take a good look at their hardware. Take the covers off the case, and look inside. You should be able to see and feel a strong, well−braced, and reinforced chassis. The case should be modular to provide easy and quick assess to components. A good modular design allows rapid repair without a major disassembly effort. This is especially important in a two−node cluster. If one of the nodes fails, you no longer have a redundant system available to you until the failed node is repaired and put back online. If Murphy just happened to strike twice during the window of time when you were waiting for parts to arrive to make the repairs, the whole system would be down. It is important to reduce that window of time when you are most vulnerable. 157
10.3.3 Dealing with commodity hardware
10.3.3 Dealing with commodity hardware We hear the term "commodity hardware" a lot these days. Microsoft talks about it, the PC trade magazines write about it, and a lot of us sit around wondering what they are really talking about. Ten to twenty years ago, computer system vendors sold products based on their own proprietary designs. Today, that has all changed. What a lot of people do not realize is that Intel not only manufactures the CPUs that are in our PCs and servers but they also play a major role in the design of the system motherboard and many of the other components used to build PCs today. In Figure 10.2 we show the process that takes place from the time when Intel designs a new processor to when an OEM develops and optimizes its motherboard using Intel's design specifications.
Figure 10.2: Intel's processor/system development cycle. The bottom line is that by the time motherboard manufacturers get their designer guideline packages from Intel, all of the hard engineering work has been done for them. With CPU speeds in the hundreds of MHz, the physical layout of printed circuit wires on the motherboard is highly critical. Once Intel has debugged the basic design of a motherboard for a particular CPU and its support chips, it would be risky business for a motherboard manufacturer to deviate from the standard Intel design unless there was an awfully good reason to do so. This means that even though the computer you buyor the motherboard you purchase from a component supply housecomes from a particular vendor, its basic design originated at Intel. Something a lot of us have forgotten about and some are too young to know about is the long−lasting effect of IBM's original design of the PC. For backward compatibility reasons, IBM's original designs are still being copied in the manufacturing of PCs today. IBM's patents cover many of the engineering designs used to build PCs. It is sometimes hard to imagine the depth to which IBM's influence extends. With IBM holding around two million patents on computer technology, covering everything from software to computing and networking technologies, it makes one wonder how other vendors can find ways to add proprietary or custom features to their motherboard designs. In fact, if you remove the covers of just about any PC on the market today, they all look surprisingly similar on the inside. Today, many computer vendors are simply purchasing "white boxes," painting them in their colors, attaching their corporate logos, and boxing them up to be delivered to retail stores or other resellers. The term "white box" refers to a standard computer chassis that is pre−built with all the standard PC interfaces and peripherals 158
10.3.4 Why is MSCS certification important to you? such as CD and floppy disk drives already installed and configured. The only thing left for the computer reseller to do before shipping the system off to a customer is to install the CPU, memory, and hard drives as requested by the customer. So the question is if PCs are based on many of IBM's patents and Intel's CPUs and motherboard designs, how does someone distinguish one vendor's products from the others? After flipping through many pages of sales ads and sorting through a lot of advertisers' junk mail, we concluded that the competition for computer sales is focused on four criteria: • Price • Delivery time • Quality • Service and support On average, most of the vendors we have seen are all in the same ball−park when it comes to price and delivery time. Vendors on the high end of the price scale are typically there because of their desire to deliver higher−quality hardware and provide higher levels of customer support. When shopping for high availability, price of computer systems should not be your primary selection criteria. Your primary concern should be with the quality of the hardware components used to build the computer system. For example, disk drives have become a low−cost commodity item. A vendor has two options available to him when purchasing the disk drives to be installed in the computer you are buying. First, he can purchase an off−the−shelf drive from a distributor or a manufacturer and simply install it in your system. The other option available to vendors is to develop quality specifications and standards of their own and perform incoming inspections on the drives they receive from their vendors. Taking this one step further, a very large computer systems manufacturer could require a vendor to make changes to its off−the−shelf hardware to meet more stringent performance criteria. A computer systems vendor that maintains an engineering staff qualified to evaluate and set component parts standards and does its own incoming inspections will need to charge more for its systems than one who simply uses off−the−shelf components. But that is OK with us, because our main concern when configuring a high−availability system is being able to acquire the most reliable hardware we can find. Obviously, your first line of defense against system downtime is high−quality hardware that is assembled into a reliable data processing system. But no matter how good the hardware you purchase is, it will eventually fail as a result of normal wear. At that point, the most important concern that you will have is how fast you can get the system repaired and back on line. That it is why we believe the second most important selection criterion for vendors is their track record for delivering high−quality support services for the equipment they sell. Here again the cost of establishing and maintaining a first−class field service organization does not come cheap. The media reported that one of the main reasons Compaq acquired Digital Equipment was because of Digital's worldwide first−class field support organization. We don't really imagine that Compaq had been dreaming about owning VAX/VMS. The crown jewel of Digital was obviously the field service organization. Compaq at that time did not have a service organization that even came close to Digital's. The most important thing to a system administrator is to be able to get a failed system back on line as soon as possible. To achieve that goal you need to be able to rely on your vendors to deliver repair parts quickly and to dispatch qualified service personnel in a timely manner.
10.3.4 Why is MSCS certification important to you? Microsoft initially established a process by which system vendors who were early adopters (Compaq, IBM, and others) submit their configurations for testing to Microsoft. Then Microsoft would verify that the complete system configuration that was to be sold would be able to reliably run the MSCS clustering 159
10.3.4 Why is MSCS certification important to you? software. Once MSCS started shipping, Microsoft turned to a self−certification process that anyone can undertake by running a Microsoft supplied software test suite and submitting the results to Microsoft for validation. Microsoft publishes the list of hardware configurations that successfully pass their software test suite just as they do for the Windows NT operating system. Anyone planning to purchase Microsoft's Cluster Servic product should first consult this list before buying a cluster system from any vendor. Hopefully, the certification process will provide some level of assurance and peace of mind for customers that the hardware they want to purchase from this list will work flawlessly. Before you get upset that your choices for cluster hardware will be limited initially, bear in mind that anyone who needs a cluster should be looking for a well−designed and tightly integrated hardware solution for delivering high availability. The cost of the cluster hardware will be very, very small compared to the savings that can be realized by avoiding tens of thousands of dollars in system downtime. Your main goal should be to end up with a system that is reliable and compatible. The certification process is a kind of guarantee that a system consisting of specific hardware and software will perform at an acceptable level of performance. One vendor taking that concept a step further than Microsoft is Marathon, the developer of the Endurance product family. Marathon goes as far as to offer an insurance policy to guarantee uptime for their customers. Even though Microsoft is not selling an insurance policy for Cluster Server, it does want to ensure that your Windows cluster is as reliable as possible. That policy is obviously not only good for Microsoft's software business but yours as well. It is far more important to get a system with the highest reliability than to save a few dollars on computer hardware. The approach that Marathon and many of the other "high−availability" vendors are taking is strictly a software approach to clustering combined with, in the case of Marathon, their own proprietary hardware. The strictly software approach eliminates the need for certifying COTS server hardware configurations. There are two obvious advantages to this approach. First, the users are able to pick the server hardware from just about any hardware vendor they choose. In fact, they might even choose to use servers that they already own as a cost−cutting measure. The second advantage to this approach is cost. If you are not tied to a limited number of machines on the cluster hardware compatibility list, you can shop around to get the best price for a server. Marathon is unique in providing proprietary hardware and software that you install in whatever server hardware you chose. This custom hardware, along with their software, allows Marathon a certain amount of isolation from the hardware that the customer might choose to use. This approach has worked so well that they were able to find an insurance company to underwrite them. Because of the industry's quest for "open systems," the use of commodity hardware components is a major driving force for cluster certification. Since clustering is a new concept for the PC marketplace and at the same time is much more sophisticated than traditional PC network servers, Microsoft wants to ensure that their customers will be satisfied with their new clustering product. When you think about it, hooking up two or more large servers with all the necessary cluster interconnect hardware and software is no easy task. There are many hardware and software configuration items that all need to be carefully tracked as a cluster is being built and deployed. Remember what it was like when you set up your last large server. Can you imagine the amount of effort that will be required to integrate and administer a 16−node cluster? You would certainly want to know upfront that someone has already worked out most of the bugs before you tackle a job like that. As vendors and customers become more knowledgeable about clustering, it is conceivable that these restrictions will be eased. One would assume that if Microsoft has its way, you will be able to confidently go down to your local computer reseller and purchase all the components necessary to build a cluster. In reality, it is more likely that large companies will be more than willing to pay the cost of having a reputable vendor deliver a preconfigured system that has been certified and tested to work together as a system. Still, smaller companies and organizations will probably opt for more of a lower−cost do−it−yourself approach.
160
10.4 Datacenter facilities
10.4 Datacenter facilities The physical facilities that will hold your cluster are just as important as the software, computers, and networks that make up your high−availability system. Computers need electrical power to provide the current to run the CPU chips, disk drives, fans, monitors, etc. Electrical "noise" on power lines can cause bits of data to be read incorrectly and possibly to crash the CPU if the induced noise is bad enough. The by−product of using all of that energy is heat. It is important that all electronic equipment be kept at the correct temperatures. The hotter the components get, the quicker they will fail. Although they need electricity to run, solid state components don't survive very long in a room with static electricity. Moisture in the building will prevent the build−up of static charges on people as they walk around the room. People (operators) often wear clothes made of wool and similar fabrics. These materials pick up dust from all over the place, and then the dust falls off as they sit at a console. Dust and lint from the clothes clogs up filters and vents as the cooling fans draw the air from the room into the cabinets to cool the integrated circuits that make up the computer. All of this will have its effect on the mean time before failure (MTBF) of your system. To achieve a long−running computer system, all of these environmental issues should be addressed.
10.4.1 Reliable power Most of the suggestions we are making are standard "good practices" and should be implemented even if you never install a cluster. For example, it should be standard procedure to install an uninterruptable power supply (UPS) on all servers and multiuser systems; that decision is almost a no−brainer, given the low cost of UPSs. The power requirements for today's computers are low enough that low−cost gel−cell battery power systems are more than adequate. Most importantly, they come with built−in intelligence so that they can inform the processor that there is a loss of AC power and state how much power is left in their batteries. Windows NT/2000 can interface with most of the UPS systems that are on the market today via a simple RS−232 interface. The more sophisticated UPS systems can also be managed by integrating them into an enterprise management solution such as Hewlett Packard's OpenView using the SNMP management agent hardware built into the power supply. The only problem with UPSs is that you can't just set them in the corner and forget them. They need to be part of your scheduled maintenance procedure, which should include a test to see that the batteries are still holding a full charge. All batteries have a limited life span and need to be replaced as recommended by their manufacturer. This schedule should be incorporated into your site's maintenance plan. There are two approaches for UPS design. One approach that has been used was to purchase a UPS large enough to power all of the computers in a data center. The advantage to this approach is a possible savings in the purchase cost over a lot of small UPSs on each machine along with the ongoing savings in maintenance cost because there is only one set of batteries to maintain. The obvious problem with this solution is the old "no single point of failure" rule for high−availability systems. In addition, individual UPSs attached to each server can then be connected to different power circuits with their own fuses so that if one fuse blows, only the UPSs on that fuse are affected. There are many different types of UPS systems on the market. The protection that they provide varies in proportion to their costs.
10.4.2 Backup power supplies There should be provisions for dual, redundant, and hot−swappable power supplies. Power supplies and the fans that cool them have a tendency to fail more often that most other components. Each power supply should have its own separate power cords. This will allow you to connect each power supply to a separate power source. Power supplies generate heat as a byproduct of providing regulated noise−free power. The cooling fans should also be redundant and hot swappable. 161
10.4.3 Temperature and humidity controls The "PC class" of servers that we're used to seeing are usually equipped with only a single power supply. There really isn't anything wrong with this for a workstation−class machine. By using a UPS and/or a power−line conditioner you can protect your power supply from potential problems such as voltage spikes and RF signals riding on the AC power line. What we've just described are common−sense preventive measures that you can take to protect one of the most important components of your system. But if a single component should fail in the power supply, then your whole system will come to a halt. This situation can easily be prevented if you use dual power supplies. Ideally, the power supply subsystem will allow redundant power supplies. In the event that one of the power supplies fails, the other power supply can immediately take over. Then the failed power supply would be removed from the cabinet for repair. Installing hot−swappable power supplies will allow you to remove a defective power supply without shutting down the server.
10.4.3 Temperature and humidity controls You might be surprised at how much dust gets pulled into a chassis by power supply cooling fans. On older minicomputers and mainframes all the enclosure vents would have filters to keep the dirt out. On the low−cost commodity computers we see today it is rare to find filters. If you do, that is a very good sign the manufacturer understood what is important and went that extra step to build a quality product. You should make cleaning or changing the filters part of the normal scheduled maintenance procedures. Dirty filters will restrict airflow, resulting in elevated temperatures inside the system cabinets. Computer chips really don't like to be hot. We have found that the cooler the computer room is kept, the longer it will be between service calls. We have always used a temperature setting for our data centers that was on the low−end of the range recommended by the vendor. Comparing our service call numbers with other data centers' numbers convinced us that good environmental controls make a big difference in the frequency of failures. Look for environmental monitoring inside the case that is tied to a management system that monitors these conditions and reports them to a central management console. One of the best investments that you can make is a simple device that is connected to a telephone line and sits in the computer room monitoring the temperature. Devices like this cost less than $300 but are worth more than their weight in gold. The way they work is that they can be programmed with high and low temperature limits. They are also programmed with the phone number of your pager and the pagers of other members of the staff. If the room temperature goes outside of the preprogrammed limits, it will automatically start calling pagers. These systems have saved many companies from disaster. We highly recommend that you look into acquiring one of these devices for your data center.
10.4.4 Cleanliness The faster a CPU runs, the more heat is dissipated by the processor and other support chips. The designers of these computer systems use a lot of fans to draw cool air into the cabinet to cool the electronic components. In the process, the air movement in the room will vacuum up any dirt from the surrounding area. This dust and dirt collects in filters and on the electrical components themselves. It does not take too long for these filters to become full of dirt and reduce the flow of air into the cabinet. The lint that falls off your clothes and collects on circuit boards inside computers in effect puts a warm winter jacket on the CPU. This is definitely not good. Every effort should be made to keep your computer room as clean as possible. The computer room facilities should be maintained by the building's janitorial staff on a regular basis. On a side note, make sure, if you intend to allow the janitorial staff in your computer room, that all of the cables and power cords are secured and out of the way. It doesn't take much to pull the power cord out of a system while pulling the vacuum cleaner around the room! Some preventive measures that are taken at many sites include not allowing food in the data center and providing a coat rack outside of the room, which helps to reduce lint and dust buildup. Also, it is a good idea to locate high−volume printers outside of the room where your computer racks and disk 162
10.4.5 Backup procedures and issues farm are located. Paper is coated in clay, which gets airborne when it is handled, as occurs in the paper feed mechanism in printers. This dust can clog filters, which results in reduced cooling, and can also be deposited on CPU heat sinks, which will reduce the cooling efficiency of the heat sink. The bottom line is that the cooler you keep your electronic components, the longer they will run.
10.4.5 Backup procedures and issues Owning a RAID array tends to promote a sense of securitymaybe we should say a false sense of security. A RAID array does a very good job of preventing the loss of data due to the failure of a disk drive. Unfortunately, a RAID array will be of no help if there is a local disaster such as a building fire or a terrorist attack. That is why having a backup procedure in place is absolutely necessary for a successful high−availability implementation. There are a couple of options available to you when planning your total data backup and recovery plan. If your company has only one data center, then a scheduled tape backup plan is your only option. If your company has more than one data center, there are other options available. In addition to regularly scheduled backups, you could also implement data mirroring between a pair of clusters in remote data centers. The advantage to data mirroring is that you can achieve, in effect, a real−time backup of your data. If high availability is important to you, then protecting your system backup media must be taken seriously. It would not make sense to rigorously back up your disk farm and then leave the backup tapes on top of the computer cabinet as you walk out the door. If there were a fire or flood, what good would those wet and muddy tapes be to anybody? A wise procedure would be to keep two weeks of tapes locally and then send monthly backups to an off−site storage facility. The off−site storage facility could be a commercial data protection service provider that provides a vault for a fee, or you could opt to send the tapes to another location of your own company. In fact, you could trade favors with your fellow employees at the other site and act as their off−site storage in return for their helping you. One of the most horrifying experiences a system administrator can have is to load a set of backup tapes only to find that one of them is not readable. That can make for a really bad day! There are two ways to solve this problem. The first is to do a read−after−write verify operation on the complete data set. The problem with this approach is that it takes twice as long to perform a backup operation. The benefit is that unless something physically happens to your backup media, you can be reasonably confident that you can successfully recover your data using those backup tapes. Another approach is to utilize two tape backup systems along with software that supports making mirrored copies of the backup. The way this works is that the backup software will write the same data to both backup systems simultaneously. Although not 100 percent foolproof, having two sets of backup tapes produced on two separate tape drives will certainly give you very good odds that at least one set of backup tapes will be readable. The advantage to this approach is that you end up with two sets of backup tapes. One set of tapes could be sent off−site for storage, while the other copy could be stored locally. If you need access to your backup data, you can use the locally stored backup tapes rather than waiting to retrieve the tapes stored at the remote facility.
10.4.6 Hardware and software service contracts Be aware that to set up a PC business today it does not take much more than a P.O. Box and a laser printer to print business cards and invoices. We can't say it enough: Pick your vendors and consultants carefully. Sometimes hardware failures can be very hard to diagnose for someone without a lot of experience. Inexperienced service technicians will basically end up replacing one component at a time until it appears that 163
10.4.7 Hardware and software service support contracts the system is running correctly. This sounds a lot like the joke about how a programmer repairs a flat tire on his car. He uses a recursive approach by changing one tire at a time and checking to see whether the car sits level after each tire is changed. An experienced field service technician can be a big help during a time of high pressure when management wants to see a system up and running from the day it is delivered. Someone with sufficient experience is more likely to pin−point the problem the first time. That means that the system is back online sooner and not likely to fail as soon as the repair technician leaves the building. The recursive approach to repairing a downed server usually means multiple system crashes and multiple calls to the vendor requesting service. Not only is this frustrating for system administrators, but it can also become downright annoying to users who had been told, "The system has been fixed, you can now go back to work." A good field technician has a wealth of information about things that are hard to find in documentation, such as revision levels for drivers; BIOS, firmware, and hardware circuit revisions; and how they all interact together. Technicians can keep track of what maintenance has been done and maintain a schedule for preventive maintenance. They can advise you when to call vendors for upgrades based on the operating system and the version you are running. There are also very simple but important tasks like cleaning air filters and getting rid of the dust that accumulates in and around the vents in cabinets. We already discussed the problems created by restricted airflow, which causes overheating, possibly shortening the life of electronic components.
10.4.7 Hardware and software service support contracts One of the problems that many of us have experienced in today's PC market is finger−pointing between vendors. We can painfully recall the day when we were told by one of the leading PC computer manufacturers that we would have to remove all the "other vendor's" adapters that we had installed in their box before they would help us diagnose the problems we were having. With support policies like these, you are in effect locked into that particular vendor. Who in their right mind would want to keep an extra set of adapter cards lying around just so the telephone support people will talk to you? We have found that telling a little "white lie" will sometimes allow you to work around this particular problem. One thing you should consider when purchasing your first cluster system is whether your vendor has tested their components for compatibility with any of the other components that you plan to integrate into the system to ensure that they all work together. As you can imagine, a vendor does not have any particular reason to go through the expensive effort of testing for capability with its competitor's products. It is totally unreasonable to expect, say, Compaq to spend a lot of time testing its servers for compatibility with EMC's RAID arrays now that Compaq has purchased Digital's Storage Works group. On the other hand, EMC probably does have some good reasons to make sure that its storage arrays work well with Compaq's servers because of the large numbers that are in the marketplace. We do not recommend that the average system administrator attempt to do his own integration testing. Consequently, the best thing to do is to stick with hardware that is guaranteed to work together. Don't assume that you can make it work; even if you could, it's probably not worth the time and effort that it would take. We wanted to point out this little scenario just to make you think about all the potential problems you might have assembling large clusters from a variety of computer vendors as opposed to a single vendor. It is quite possible that once you select a vendor for your first two node clusters, you might be locked in to that vendor as you expand and grow the cluster. So pick your bed partners wisely.
10.4.8 Spare parts One of the best investments you can make is to maintain a set of spare parts. You might argue that a cluster is really an expensive "spare," but in the case of a two−node cluster, if one server is down you no longer have a 164
10.5 Disaster recovery plans backup. The users who are now running on the second server have no protection, and if you are having one of those really bad days, the second server could crash as well. Although this is an extreme example of Murphy's law, it is still best for you to plan for it. Even with express overnight delivery service, your replacement parts may not arrive for 12 to 24 hours. Your goal is to get your failed server back up as soon as possible, because in a two node cluster as long as one server is down, you have a single point of failure in your system. If your system goes down after the local pickup time in your supplier's home town, you are out of luck for another day! Worse yet is the dreaded response: "Sorry, we just shipped out the last one we had in stock yesterday, and they are on back order." Once you receive those parts, how long will it take you to install them? What happens if the replacement part you receive is DOA? This is not a hypothetical question; believe me, we have seen it happen all too often. Without onsite spares you are vulnerable for anywhere from 12 to 48 hours. It is a matter of economy as to how long you are willing to be running your application with no backup server. Your business case should determine whether you need on−site spares or not. Building and maintaining a spare parts inventory is not cheap. By standardizing on one type of hardware you can reduce the number of spare parts needed to support your installation. If your company has many varieties of computers, trying to keep spares on hand for all of them will be cost prohibitive. You don't need a complete standby system sitting on the shelf. Most hardware vendors can provide a recommended spares list for your hardware based on their experience in maintaining their hardware. Their recommendations are usually tied to statistical studies of the probability of failure for the different components in the system and on the calculated mean time between failure (MTBF) for each lowest replaceable unit (LRU). Once you have established spare parts kits at the site, you must then establish a procedure for replenishment. When parts are taken out of kits to make repairs, someone needs to be responsible for either returning the part for exchange, if it is still under warranty, or ordering new parts. Another thing to watch out for is that as your computer system gets older, some of the parts may become harder to find. If you intend to keep the same configuration for 10 years, you might want to consider stocking up on components that have already been replaced by newer technology in the marketplace. Once the new hardware is on the market, distributors will start to dump the older stuff, possibly making it hard for you to find replacements. The good news is that as components are being discontinued, their prices usually drop substantially. Of course, the other option that you have is to upgrade to the new technology. Ideally, this would be the approach most of us would like to take, but compatibility issues with other components in the system may make it impossible.
10.5 Disaster recovery plans System administrators can do a great many things to prevent their systems from failing, using typical failure scenarios as a basis for their system design. A disaster can't be classified as a typical failure. It's fair to say that when a disaster occurs, it is totally random and out of anyone's control. Similarly, the damage that occurs as a result of a disaster is totally unpredictable. That being said, certain areas in the world are noted as being more susceptible to natural disasters. For example, San Francisco is an area in which earth−quakes occur on a regular basis; on the East Coast of the United States, Florida is known for hurricanes. For the United States, those two areas are definitely extreme examples. But a disaster can occur just about anywhere and without notice, as we all painfully learned on September 11, 2001. The severity of a disaster can range from an event within your building to a regional disaster. It is important for the system administrator of a high−availability system to develop a disaster recovery plan for the data facility. It is much better to work on developing a plan before an emergency occurs, when people can think calmly and do a what−if analysis or even have experts review the plan. Once disaster strikes, emotions and tempers will prevail over cool−headed logical thinking. The disaster plan you develop should 165
10.5 Disaster recovery plans address all the issues we have presented here, using the risk−analysis approach we mentioned. Tradeoffs need to be made, but it is important that everyone involved agree to the tradeoffs so there won't be surprises later on. The plan that you develop should be available as both a printed and an online document. This document doesn't need to be fancy. A simple text file will suffice and has the advantage of being universally accessible using the most primitive tools available on any computer. This book is basically about buying insurance for your company's data processing needs. Most of what we have said is a matter of identifying the risks, evaluating the cost associated with those risks, and finally deciding how much insurance to buy. This leads to the question as to what preparations should be made in case of a natural or man−made disaster. The size of your company and the number of offices and their locations are factors in deciding on the approach to take for disaster recovery. As you can see in Figure 10.3, if your company has at least two or more data centers located in different geographical areas, then using data mirroring between clusters located at each data center is a good solution for local disasters. This solution will work only if the two sites are located far enough apart that a disaster at one location would not likely affect the data center at the other location. High−speed data circuits connecting the two sites together would also be required to allow the mirroring of data in real time between clusters as shown in Figure 10.4. In addition, the clusters located at each data center would need to have enough reserve processing power to assume the work− load from the site that was down as well as having the same software applications installed.
Figure 10.3: Data mirroring between remote cluster sites.
Figure 10.4: Decision tree for protection against a disaster. If your organization has only one location, there are still good options available to you. One approach would be to write a contract with a disaster recovery service company that would give you access to a computer system similar to the one you have installed at your site. The way this service works is that when a disaster strikes, you take your last good backup tapes to the disaster recovery company's site and load your company's data and applications on their computers. There are two issues that you should be aware of with this scenario. First, the company you select for disaster 166
10.5.1 System maintenance plan recovery services should be located away from your geographical area, for obvious reasons. Another is that these disaster recovery companies will sell their services over and over again. They make their money by betting the odds that only a very small percentage of their customers will have an actual disaster. Based on that assumption, they do not maintain a backup server for every contract they write. The problem with this is that if a disaster affected a large geographical area in which they had written many disaster recovery contracts, they would not be prepared for a large number of companies all showing up at the same time with their backup tapes looking for help. The question as to who will get service under circumstances like these should be agreed on in advance and documented in your service agreement with them. You would definitely not want to be the one to be told that they could not help your company because they had already exceeded their capacity. No matter which of the approaches your company decides to go with, a disaster recovery plan must be written. Every bit of information that would be needed by someone trying to duplicate your data processing facilities needs to be documented in the disaster plan. The hardware that you are using should be documented in sufficient detail that someone could build an exact duplicate. This includes all system BIOS settings and interrupt and I/O address settings on adapter cards Developing a disaster recovery plan and storing it on the shelf in your computer room is just not enough. The people who may be required to implement the plan need to be trained and hopefully have an opportunity to practice the procedures documented in the plan. In addition, the training cannot be a once in a lifetime event. As new people are hired into the group or people's memories get fuzzy, the training sessions will need to be repeated as necessary. The training process also helps to verify that the plan that you develop is complete and understandable by people other than those who wrote it.
10.5.1 System maintenance plan Because of the low cost of PC equipment and its commonplace proliferation throughout corporate enterprises, we have observed a somewhat complacent attitude toward the care and feeding of "PC class" servers. Part of this comes from the fact that most people who own PCs simply open up the box, connect the keyboard and mouse, and then slide everything under the desk, forgetting about it until it fails a year or two later. Fortunately, the PC equipment being manufactured today is pretty reliable for the most part. Because of that, many people seem to be comfortable dealing with failures if and when they occur. That's fine for some types of users and applications. When it comes to administering a high−availability system, the prime objective is to prevent the failure from occurring in the first place! The best way that we know of doing that is to develop and implement a well−designed maintenance plan customized to your particular system hardware and software configuration. The maintenance plan that you come up with must balance the requirement for 100 percent uptime with the need to have access to individual servers for short periods of time on a periodic schedule throughout the year. Unfortunately, the tendency is to put off doing maintenance as something that can be caught up with later when things are not so hectic. A lot of us have surely heard that phrase before! The overall plan that you prepare and send to management needs to includes a commitment for good system administration practices.
10.5.2 Maintenance checklist It is important for you to develop a maintenance checklist that is specific for your site. Each cluster installation within your organization should have a set of documents that are specific to that system's hardware and software configuration. Even if your organization standardizes on hardware and software, it is still important that each system have its own set of records. This is important because some clustering 167
10.5.3 Test plan solutions require that each computer system in the cluster be "exactly" the same. When we say "exactly," we mean down to the revision data of the firmware in the BIOS chips. If this is a requirement of your clustering product, then just having the same models of computer hardware will not be sufficient. It's quite possible to have the same computer model manufactured on different dates containing different versions of the firmware BIOS. Anytime maintenance is performed that necessitates the replacement of hardware that contains a BIOS firmware, the version levels written on the BIOS chips should be recorded in your maintenance checklists. This information can also be invaluable when debugging other interoperability problems, which is something you will spend more of your time doing as the number of nodes on your clusters grows. A well−maintained maintenance checklist will also be an invaluable tool for any new personnel attempting to come up to speed. The checklist will help newcomers understand the specific hardware and software configuration that they will be responsible for. At the same time, they will be able to see a history of cluster failures and, more importantly, the solutions. The old saying "learn from your mistakes" definitely applies here. We have found that there are a few things about computers that you can't learn from books. By recording your experiences for new hires to learn from, you will make their job a little easier, which in turn allows you to move on to a bigger, more challenging job and hopefully a lot more money.
10.5.3 Test plan The test plan that you come up with does not have to be elaborate, but it is a good idea to describe the steps that can be used to verify proper operation of the system. Because of the nature of clustering, you will inevitably end up with a lot of computer systems interconnected together and also connected to outside sources. It can become really intimidating for a new system administrator to walk up to a cluster for the first time. A test plan should be developed to help someone systematically verify that all of the connections between the systems in the cluster are functioning. Given the complexities involved with interconnecting cluster hardware, a well−designed test plan will, at a minimum, help to jog your mind. Further, it will be a big help to the next person to come along who is not likely to have a clue as to how your cluster is configured and what its peculiarities are. The test plan should include liberal use of graphics to illustrate how the cluster is cabled up. It should also include other information such as system ID and network addresses that will help in the debugging of communications problems. If you discover any other useful information to help in isolating problems, it should be included in the test plan. You should consider the test plan as a living document, to be updated as new tricks are discovered that will simplify the maintenance of the cluster.
10.5.4 Simulated failures Just as your fire department conducts a fire drill, you should conduct your own cluster fire drill. First, it is certainly a good way to test to see whether the disaster recovery plan that you developed will actually work. If there are shortcomings in your disaster recovery plan, it is much better to discover that during a drill rather than a real disaster. Second, another way to look at conducting testing is to prove to yourself and, more important, to your management that the clustering hardware and software will actually work as designed and justify the company's investment. When your cluster grows larger than two nodes, the amount of hardware involved can become overwhelming. The cabling behind a cluster or under the floor (in case the data center still has raised floors) can easily turn into a rat's nest of wires. Simulating failures, as called out in your test plan, will allow you some "play time" with the cluster to become more familiar with all the networking and cluster hardware. This hands−on time will allow you and your staff to become more familiar with the way the cluster is constructed. It also allows for one more sanity check to ensure that what was designed was actually implemented. The end result is that 168
10.6 System design and deployment plan all involved will walk away from the test feeling much more confident that they and the cluster are ready whenever a failure occurs. In planning for a simulated failure test you should make a list of all possible things you can think of that could go wrong. You might even have some fun challenging the people in your group to post all the ideas that they come up with. The following are some of the things that could happen in your data center that would cause a server to fail: • Someone trips over a power cord going to one or more of the servers or RAID arrays in your computer room. • The air conditioning system fails overnight or on a weekend. • A Fibre Channel or SCSI data cable is pinched under a raised floor panel, causing an intermittent failure. • A water line breaks or the roof leaks during a particularly harsh storm and floods your data center. • A filter capacitor catches fire in a power supply inside a server and damages multiple circuit boards and wires. • Construction workers short out high−voltage power lines, causing a power outage in your building. • A system operator accidentally causes a server to reboot.
This list gives just a few examples of the failures that you might see. The cost of protecting yourself against some of these failure scenarios could be more than what your company was willing to invest. At least by going through this exercise, you will be able to identify the threats and then determine whether you have a means to protect your system from that particular threat. In some cases, it is an easy decision, but there are times when the cost of protection against a failure simply can't be justified. For example, having the power company install redundant power distribution lines from different power grids to protect you from a major power outage is very expensive. An alternative is to install your own diesel generator. If you are lucky enough to have another office located in another power grid, the simplest solution would be to just failover to the remote cluster. As you can see, there are no right or wrong answers. The solutions that you come up with will be specific to your company's needs. You won't find a one−size−fits−all solution when it comes to clustering.
10.6 System design and deployment plan An operating system is only as good as the hardware it runs on. Windows NT/2000 does a good job of running on just about any hardware currently available, but it can't work miracles. System administrators need to do their part to make sure that Windows has reliable and available physical resources. A system administrator must be aware of what hardware and system configuration choices are available that can have a major impact on system performance. There are particular areas where you can derive a major payback by making the correct choice for your system. The following list names the areas that we feel you should pay particular attention to: • Selecting well−engineered components (motherboards, I/O adapters, etc.) • Optimizing the server−to−network interface hardware • Proper configuration of Microsoft networking protocols • Sizing and optimizing the disk subsystem • Tuning Windows NT for specific roles A cluster of Windows NT/2000 servers should be carefully assembled from high−quality hardware that has been tested together to perform with a high degree of reliability and capability. Trying to save a few dollars on 169
10.6 System design and deployment plan hardware for a cluster just does not make sense, assuming your goal is to deliver the highest availability and performance possible to your users. If you have determined that your business could benefit from a cluster solution, then you have probably based that decision on the dollar amount your company would lose if there were a system outage. It makes more sense to base your savings on preventing system outages than on the small amount of money you might save by shopping around for a "good buy" on hardware. Many of us can remember the difficulty we had early on assembling Windows NT systems from commodity hardware. Sometimes it worked, but other times we wished we had checked the hardware compatibility list before we got started. Microsoft, along with the early adopters of MSCS, tested specific system configurations to ensure that both its own hardware and Cluster Service software would work together correctly. The hardware configurations that passed Microsoft's testing procedures are listed in a cluster hardware compatibility list. Clusters came to us from the minicomputer world where vendors such as the former Digital and Tandem designed special (also sometimes referred to as proprietary) cluster hardware to work with their own TrueCluster and NonStop cluster software. Even now, Compaq is one of the few companies that have both their own operating system software and hardware engineers working together to design the systems on which OpenVMS and the True64 UNIX operating system runs. The advantage companies like Compaq have is that they have complete control of their systems and can deliver a tightly integrated and fine−tuned package. For example, the developers of the OpenVMS operating system worked hand in hand with the engineers who developed the VAX hardware on which it ran. Close cooperation like this leads to a tight coupling between hardware and software, allowing the design team to implement the functionality where it makes the most sense. Sometimes it might be better to implement something in software, and other times it might make more sense to build the hardware. These trade−offs allow the team to achieve the optimum mix to meet system cost and performance goals. Contrast this with what is happening in the PC market today. Microsoft developed the Windows operating systems that run on generic Intel processors and motherboards. Then hundreds of vendors get system board reference designs from Intel and manufacture motherboards by the millions. In such a market, it is next to impossible for vendors to do any serious customizing (value−added designs) of their hardware designs without being incompatible with the standard implementation of the operating system. That might explain why so many of the hardware comparisons done in PC trade magazines usually show that many of the vendors' products compare very closely with one another in performance. In fairness, it must be said that Microsoft does host industry meetings between hardware and software developers to put issues like these on the table for open discussion. But bear in mind that there are as many agendas as there are attendees. Many people labor over the decision as to whom they should buy from based on the wrong criterionhardware. Today, much of what you buy is of similar design and quality, so what really distinguishes one product from the next tends to be the vendor's level of service and support. Even more importantly, you should consider the vendor's methods of selecting and qualifying the components used to build its systems. The reason is that the components that are used by many of the major vendors come from Asian manufacturers that build to order for large−volume OEMs. As mention previously, the larger OEMs with an experienced engineering staff can specify the quality standards they want for the components that they use in their products. Other system integrators get whatever comes off the boat. The disk storage subsystem is one area where a vendor can add real value by qualifying and testing COTS disk drives. Believe me, you will really begin to appreciate the complexities involved with disk storage subsystems if you ever have the opportunity to attend one of Richie Larry's presentations on Storage Works. Richie was the system architect for Compaq's Storage Works products. Talk about being picky! Before going to one of Richie's presentations, we would have said. "What's the big deal, grab a disk and bolt it into the cabinet, FDISK it, FORMAT it, and load the OS without thinking twice." But Richie gives you a better appreciation for the engineering that goes into a quality disk array. There are many design issues your vendors should be taking great pains to address. Examples include heat generated by the drive and ways to dissipate it; 170
10.6.1 Vendor "value−added" approach power consumption and ways to deliver "clean power" to the drives; and vibration generated by a drive during seeks and how it affects adjacent drives in the cabinet. These are examples of the "value−added engineering" you should look for from system integration vendors. System integration vendors such as Compaq and others take standard COTS hardware from many different manufacturers and integrate those components into a functional and reliable system for you. The value that they bring to the table is making sure that all the different components that make up a system work together. In the days before the PC generation, customers who wanted a cluster simply called their friendly salesperson and said, "I want to buy a cluster." A truckload of boxes would show up 60 to 90 days later, and a few days after−ward a team of field service personnel would arrive to connect everything together, burn it in, and then test it. By today's standards that may seem a little extreme, and you might hear some people say, "that's just a PC, anybody can do it." The bottom line is that the customer got a system that would run and run and runat least most of the time. Today, given the PC mindset, it is very common for vendors to assume that end users will unpack, install, and perform their own initial system test. That might be fine for a single system, but what would you do if you were faced with a 4−, a 6−, or even a 16−node cluster to install?
10.6.1 Vendor "value−added" approach There are many vendors who are either already or soon to become players in the Windows NT Cluster Server marketplace. The big question is how will they distinguish themselves from their competitors? If Windows NT Cluster Server becomes a totally COTS software and hardware solution, then what difference does it make whom you buy from? Obviously, the vendors don't want to leave your purchasing decision to chance. With the release of Cluster Server Phase 2, you have many more configurations to pick from as more vendors jump on the bandwagon and qualify their cluster hardware. For example, you can pick from preconfigured turnkey systems, or you may decide to upgrade existing standalone hardware to a cluster. Naturally, the standalone systems must already be on the cluster hardware compatibility list. Another option some vendors may choose is to enhance the capacity and functionality of Windows NT Cluster Server by building their own proprietary hardware and writing the necessary software drivers. The area where you will probably find these products appearing is that of the cluster communication interconnects. You might say, "but that means proprietary stuff." Yes it does, but if the technology is successful and there are plans to license it to other system integrators then, who knows, it might become the next industry standard. The real question is do you want to buy the "one size fits all" product or a solution that is optimized for your particular application? Some of the solutions being worked on by different vendors have the potential of providing significant enhancements to basic cluster technology being deployed today. As competition flushes out the best solutions, one of these solutions will dominate the market.
Glossary A−C Active/active The term active/active refers to a configuration allowing two instances of the cluster application. But only one instance of the application has data access at one time. Two disk controllers can simultaneously process I/O commands sent from one or more host computers to an array of disks. The computer systems are tasked with synchronization of the access. If one RAID controller in an active/active with failover configuration ceases to operate properly, the surviving RAID controller automatically assumes its workload. The application need only enable the data selection to "resume" 171
10.6.1 Vendor "value−added" approach activity. The advantage of this type of configuration is that both computers are doing real work. This translates into reduced cost for the total system. Active/passive See Active/standby. Active/standby In an active/standby configuration only one computer or node is actively processing all of the users for the cluster. The other computer, in this two−node configuration, is not doing any processing work at all. It is strictly in a standby mode waiting for a failure in the primary computer. If the primary computer fails, the standby node will immediately take over processing responsibilities. Active/passive goes the active/standby one better. With active/passive, the passive server is used for additional duties such as noncritical file or application serving. Address failover The virtual server's network name and IP address are automatically transferred from the failing server to a surviving cluster node. The process of address failover is completely transparent to client nodes and does not require any software changes or additions to network clients. Alias The computer name that is used to represent a clustered system of computers is known as the cluster alias. The cluster alias name, unlike LanManager's computer names, is really a virtual computer name since it does not refer to a physical computer. Arbitration Is the process of determining which computer or cluster member should take control of a Quorum Resource after the cluster's state has become indeterminate. Automatic failover and failback Clustering software on Windows NT/2000 server facilitates and manages the cluster environment and the applications that are running. If a failure is detected, the clustering software will automatically failover running applications and their dependent resources to other healthy nodes in the cluster. If and when a failed server comes back online, the clustering software will automatically attempt to move the applications that were originally running on that server before it failed back to it. This behavior is under the control of the system administrator and can be easily customized. Availability Availability is the quality of the system's response to a user or process request. Consider this scenario. You walk into a good restaurant on a Saturday night (without reservation) and ask for a table, and you get "right this way" for a response. Actually, this is an example of High Availability. Bus The term bus refers to the interconnect medium between one or more devices in a computer system. Physically, a bus could be constructed of either copper wire or a light−conducting fiber. Information to be transferred across a bus could be sent in a serial fashion using one conductor, or the information could be transferred in parallel using eight or more conductors. Channel The term "channel" has traditionally referred to a very high−speed I/O interface between an IBM central processing unit and high−speed devices such as disk farms or communications devices. Likewise today, the term channel is used as part of the name of a new technology called Fibre Channel and InfiniBand. These are two technologies used to connect components that make up a computer system. Cluster A computer cluster is a system of two or more independent computer systems and storage subsystems intercommunicating for the purpose of sharing and accessing resources. Cluster address This is the 32−bit Internet Protocol address shared by the cluster members. What a convenience! Like a shopping center offering one−stop shopping, the cluster address allows users to worry only about a 172
10.6.1 Vendor "value−added" approach single address for things like SQL, Exchange, and Office Applications. Cluster alias See Alias. Cluster APIs Probably the biggest contribution that Microsoft made to the Windows NT/2000 market was a standard set of software APIs. This is very significant since it allows third−party software developers to use an "open" standard along with Microsoft to enhance applications running on Microsoft's Cluster Service. Cluster attributes Cluster attributes are features or functions that, while desirable, do not directly contribute to clustering by our definition. An example of this is cluster management software. Cluster Group A Cluster Group is a collection of physical and logical resources that exist on a node in a cluster. These resources typically have dependencies on one another. It makes it easier to understand this by thinking of it as a dependency tree. If one resource in the tree fails, then the other resources that depend on it would also have to fail. Because of this, all the resources contained in a Cluster Group must failover together. As such, the Cluster Group is the lowest level that is managed by the Cluster Failover Manager. Cluster lite Cluster lite is a name that could be given to a system that has cluster components and even one or more (but, less than three) cluster subsystems. The term cluster lite (or cluster wannabe) is in no way meant to be disparaging. Cluster member Only Windows NT/2000 servers with cluster software installed can be cluster members. Windows NT/2000 Workstation and Windows 95 cannot be cluster members. In order for a Windows NT/2000 server to become a member of a cluster you must first install Cluster Service and then tell it which cluster it should join. This action identifies to the Cluster Manager that the information for a new node should be added to the Cluster Database. Being added to the Cluster Database only means that the "cluster" knows about that node. A cluster member can be in one of three states in a cluster. If the node is fully participating in the cluster, is said to be an "online" member. On the other hand, if the node has failed or was manually shut down for maintenance, then the node is said to be "offline." The third state is somewhat of a hybrid state. It is referred to as "paused." A "paused" cluster node performs all the same functions as an "online" member, except that it does not provide services to network clients Cluster name This is the name assigned to represent the collection of computer systems that make up a cluster. It is also referred to as a cluster alias name. Cluster plus A cluster plus is a cluster that offers transparent failover. A cluster system offering this class of clustering would be capable of transferring any or all applications from one cluster computer member to another with no apparent (at least to the user) latency. An entire cluster member could fail, and the user would never know. Cluster regroup A cluster regroup occurs after one of the nodes in a cluster is either manually removed from the cluster or just fails. The remaining nodes in the cluster start the process of discovering what nodes are alive and well and then begin arbitrating for control of cluster resources, such as the quorum disk. Cluster resource A resource in a cluster can be either a physical device or a logical service. For example, applications that are accessed from network workstationsa file share, IP addresses, and computer network names used by virtual serversare examples of logical resources. Examples of physical resources would be a cluster shared disk storage device or a communications device that is shared in a cluster. Cluster 173
D−H resources are assigned by the system administrator to a specific node in the cluster. The resource can be accessed or serviced only from one node in the cluster at a time. If the cluster node that was assigned to offer the resource as a service fails, then another node will assume the responsibility of providing the resource to the client's workstations. Cluster Service The collection of software modules that all together implement clustering on Windows NT/2000 is referred to as the Cluster Service. Every node in a cluster will be running Cluster Service locally. Each instance of Cluster Service communicates with the Cluster Service running on every other node in the cluster. All together the Cluster Services running on each node are what actually implement a cluster. Cluster−aware applications Applications that have been specifically written to take advantage of the services available while running in a cluster environment are said to be cluster aware. Applications can call standard cluster APIs to determine the state of the cluster and to control their own executing environment. In addition, applications can take advantage of standard "cluster middleware" that can provide standard ways for processes running on the nodes in a cluster to communicate between themselves. Clustering You won't get off this easy; read the whole book! Communication Manager This software module is responsible for all communications between any of the nodes in a cluster. The Communication Manager provides a standardized set of communications services that other software modules that make up the Cluster Service use to communicate with their peers on other nodes in the cluster. The Communication Manager acts as a communications middleware for the entire cluster. Examples of the types of communications services provided by the Communications Manager include keep−alive protocol, resource and membership state transitions, and cluster database updates. Configuration database The configuration database is used to track the state of the cluster. This information is distributed in real−time to all cluster member nodes. The standard NT Registry architecture is used for storing the state of the cluster. Information contained in the configuration database is both static and dynamic in nature. Crossover cable The twisted−pair Ethernet cable that is used to connect the two computers in a two−node cluster is referred to as a crossover cable. This cable is the twisted−pair Ethernet equivalent of an RS−232 null modem cable. It makes the appropriate connections so that the transmit signals from one computer go to the receiving inputs on the other computer.
D−H Differential SCSI bus There are two types of electrical signal interfaces that are used in the SCSI bus. The most common and least costly to use is the single−ended SCSI bus. The differential SCSI bus is more costly for the manufacturer due to the need for additional circuitry to implement differential signaling. The cost factor has been the main reason for the limited deployment of differential SCSI. Unlike single−ended SCSI that uses a single wire to transmit data and a return ground, differential SCSI uses a pair of wires to carry the signal. As the name implies, the output is determined by the relative difference in voltage between the two wires that make up the pair. If electrical noise is introduced into the pair of wires, both of these wires will have the same levels of noise on them. On the receiving end, the differential receiver circuitry will see the same levels of voltage due to noise cancel each other out and will not produce an output. This inherent immunity to electrical noise makes differential SCSI 174
D−H ideal for electrically noisy environments. Its higher immunity to noise allows differential SCSI to work over longer distances and higher bus speeds. The Low Voltage Differential SCSI bus technology has gained acceptance due to the industry's need for higher bus clock frequencies and its lower cost to manufacture than its predecessor, the High Voltage Differential SCSI bus. Disk array A disk array consists of two or more physical disk drives that are managed by either software or hardware so that the operating system sees the physical disks as one logical disk. In the context of clustering, a disk array is a stand−alone cabinet containing two or more physical disks, redundant power supplies, and an intelligent disk controller that implements the RAID standards. Disk failover The process of transferring control of a physical disk resource from one node of a cluster to another is called disk failover. The process of failing over a disk drive in a cluster requires that the Cluster Service perform numerous checks of the cluster environment before allowing the transfer of control from one node to another. Failure to do so would seriously jeopardize the integrity of data on the disk. Disk mirroring This is the process whereby an exact copy of the data on one disk is copied to another disk. This process takes place transparently to user applications and is implemented either as a software service to the operating system or by a hardware controller. Disk Spanning Even though Disk Spanning does not provide any performance or reliability benefits, it does improve the manageability of independent disk drives. Disk Spanning allows multiple independent disk drives to be logically combined and presented to the users as a single drive. For example, if three physical disks were physically installed in a system they might appear as drives C:, D:, and E:. If Disk Spanning were implemented on the system, then the user would see only one logical drive C:. This one logical drive has the combined capacity of the three drives in our example. Disk striping This is also known as RAID 0. Disk striping or RAID 0 is usually implemented when the primary goal is to maximize the disk subsystem I/O performance. Disk striping can improve performance by breaking up a disk's I/O operation into multiple segments and, in turn, simultaneously write these individual data segments to multiple disks concurrently. It should be pointed out that this does not provide any redundancy for the data. If one drive in a striped set fails, then all of the data in the striped set would be lost. Distributed lock manager (DLM) The job of the distributed lock manager is to allow multiple users on multiple nodes in a cluster to access a single file or record. The initial release of Windows NT/2000 clustering does not provide a distributed lock manager. Software vendors such as Oracle Corp. have developed their known distributed lock manager to support concurrent access within their application environment. Compaq (Digital Equipment Corp.) had developed a DLM for its OpenVMS Clusters. ECC memory This stands for error−correcting memory. Error correcting memory allows for recovery from single−bit errors in memory. The recovery from a memory error is automatic and transparent to the user when error−correcting memory chips are installed in the computer. Without it, the system will halt if it detects a memory error or the software will crash. Using error−correcting memory will greatly enhance the availability and reliability of a server. Failback After a cluster node that has failed rejoins the cluster, the services that were originally assigned to that node will be restarted on the original node according to policies determined by the system administrator. The process of transferring and re−starting the applications when a node rejoins the cluster is called failback. The system administrator can decide to have the services restarted immediately or wait for a later time when the impact to users would be less. Failover 175
D−H This term is used to describe the process that MSCS uses to transfer control of cluster services from one cluster node to another cluster node. This is the basic mechanism used by Windows NT/2000 clusters to provide high availability of resources (disk, print, applications, etc.) to client nodes. In the best−case scenario, a user of the cluster will not even be aware that a failover occurred. The worst−case scenario would require the user to wait a few minutes and then reconnect to the service. Failover time Failover time is a term used to describe the time it takes the cluster to detect that a failure has occurred and to restart the necessary Cluster Groups on another node in the cluster. This time varies quite a bit depending on the type and number of resources in the Cluster Groups being failed over. Fast SCSI The SCSI−2 specification allows bus transfer rates up to 10 MHz. SCSI bus speeds between the original SCSI specification of 5 MHz and the 10 MHz specification for SCSI−2 are referred to as either Fast SCSI or Fast/10. Fault resistant For many users a fault tolerant solution is cost prohibitive. A fault−resistant architecture is basically a cost tradeoff on the level of availability that the system has. In a system designed to be fault resistant, the failover time is typically measured in terms of seconds or minutes as opposed to a fault−tolerant system, where the failover time is instantaneous. Fault tolerant Fault tolerance means "resistance to failure." The term "fault tolerant" is used to describe a system that is capable of surviving a single hardware or software failure and still be able to continue providing normal services. Fault tolerance is typically achieved by providing duplicate hardware that can take over in the event of a failure in the primary hardware. Fibre Channel The American National Standards Institute (ANSI) is responsible for an integrated set of standards and protocols that define Fibre Channel. The specification allows for multiple types of physical media, support for multiple transport protocols, and various transmission speeds. It also allows for significant scalability in both cable distance and the number of nodes supported. File level locking When more than one user needs to access a common set of data, the operating system must implement some method of policing read and write access. File level locking is one of the methods used to manage data access by multiple users on a system. This method controls access to the data by allowing a user to lock out all other users on the system from accessing the data contained in a particular file. The problem with this approach is that if this is a very large database file that is shared by many users on the system, then everyone will have to wait until the single user finishes his update operation and releases the file. Another approach would be to implement a record level locking approach. The advantage to this approach is that a user would only have to request a lock on the individual record that he is working on at a given time and not the entire file. This would allow more users to simultaneously access the data set. Groups From the point of view of MSCS, a Group is the basic unit of failover that is managed by an MSCS cluster. A system administrator of a cluster would set up Groups to help assist in managing multiple resources that have multiple interdependencies on each other. A Group consists of multiple cluster resources that are logically organized into dependency trees. If MSCS determines that a failover is required, all the resources defined for a Group would be moved from one server to the other. Heartbeat The heartbeat is really nothing more than a simple communication protocol used by the nodes in a cluster to communicate between themselves. The purpose of the heartbeat protocol is to determine whether a cluster node has failed. This is accomplished when one node in the cluster sends a message to another node in the cluster that says, "hello are you there?" The other node must then respond in a prescribed amount of time by acknowledging the message. If a node does not respond to a heartbeat 176
I−P message in the prescribed amount of time, it is assumed to have failed. High availability The term "highly available" alludes to an instantaneous response (availability) to a request. You walk into a good restaurant on a Saturday night (without reservation) and ask for a table, and you get "right this way" for a response. This is an example of high availability. The term high availability is also used to describe a standard computer system that has been enhanced with software and/or hardware for the purpose of reducing the time to access data when a failure occurs. The time it takes to re−establish services after a failure occurs is measured in seconds or minutes. Host adapter This refers to any I/O interface that is installed in a server. For example, in an MSCS cluster there will be typically at least three Host adapters for external devicesshared SCSI bus, Ethernet adapters used for the cluster interconnect, and the connection to the enterprise LAN. Hot spares If cost is not important to you but availability and performance are, then a hot spare type of solution would be of interest to you. The term "hot spares" refers to the system configuration, which is composed of two completely redundant computers. The computer that is referred to as the hot spare is powered up and runs an exact copy of the operating system and the applications that are running in the primary computer. Typically, both computers will be synchronized so that the operating system and applications run in lockstep with each other. In the event the primary system fails, the hot spare or backup system is capable of immediately taking over the workload from the primary system.
I−P I2O The I2O specification defines a new software and hardware architecture design for connecting high−speed peripherals. It achieves this by dividing the workload between the host and the peripherals. In addition, it utilizes an I/O processor chip to offload the host processor. Although this concept is new to PC class servers in use today, it has been in use for years in the mainframe and high−performance minicomputer world. InfiniBand InfiniBand is an emerging technology that will be used to interconnect the components of a computer system. It is a very high−speed interconnect that can be used to interconnect CPUs, memory, and I/O adapters as well as peripheral devices. Due to its very high speed and scalability, it will allow computer architects to build systems with only one bus. Interconnect The interconnect subsystem consists of at least two components: (1) the controllers that provide interconnection between the two computers and the storage subsystem and (2) the "intelligence" or software/hardware combination that could address a "failover" situation at the interconnect level. Cluster nodes need to be in constant communication with each other. In a shared nothing cluster such as MSCS, the volume of message traffic is low, but the message latency must be very short. A private communications network is used for intra−communications between cluster members (no end−user data transmitted on this link). This communications link between cluster members is referred to as the Cluster Interconnect or simply the Interconnect. I/O operation per second rate (IOPS) This is the number of I/O transaction per second that can occur on a bus or aggregate of buses on a server. IP failover 177
I−P The ability to automatically move an IP address from one node cluster to another is referred to as IP failover. This is a fundamental capability that is required so that a user's workstation will not require any software additions to its IP stack and can still maintain connectivity to cluster services after they have failed over. Latency Latency is the system overhead that occurs when data is moved between nodes in a cluster. A well−designed SAN reduces this overhead by minimizing CPU cycles and by providing fast data transfer between all cluster resources attached to the SAN. Load balancing As client workstations come online and connect to a cluster, they will be automatically connected to the cluster node that has the lightest workload. Not all cluster implementations currently support true load balancing. The ones that do use a sophisticated algorithm that determines the user load based on many factors. Lockstep When two independent computers are synchronized so that they both execute the same instructions at exactly the same time, they are said to be executing in lockstep. Fault−tolerant systems use the lock step mechanism to ensure that the hot standby system can immediately take over in the event the primary system fails. Low voltage differential signaling (LVDS) The new Ultra 2 SCSI standard is based on the low voltage differential signaling standard. Because of LVDS, Ultra 2 SCSI can communicate at speeds of up to 80 Mbit per second. The cost of LVDS−based controllers and peripherals should become competitive with existing SCSI technology and even IDE devices over time. This is due to the fact that the low−voltage signaling makes it possible to produce an IC that contains the SCSI control logic as well as bus drivers on a single chip. Manageability A cluster system should be capable of being centrally or singly managed. Ideally, the cluster manager should be able to access and control the "sharing and accessing of resources" from any point in the cluster. Memory channel Compaq Computer Corp. markets this very high−speed communication bus for use with its UNIX True64 Clusters. Compaq licenses the core technology used in Memory Channel from a third party. Memory Channel achieves its high speed and low latency by using memory reads and writes instead of normal I/O instructions to communicate with other nodes on the memory channel bus. Middleware The term middleware refers to a level of software services sitting above the operating system and below user−level applications. The middleware software provides a standard set of services through software APIs to user applications. Typical services that middleware software provides include distributed database services, time services, message passing, etc. Mirrored Data One method used to achieve high availability is to make an exact copy of the data on one disk to another disk that will act as a backup in case the first disk fails. This process is known as either Mirrored Data or Data Mirroring. NAS NAS or network attached storage consists of an integrated storage system (e.g., a disk array or tape device) that functions as a server in a client/server relationship via a messaging network. Like storage area networks, network attached storage is not "new." This type of storage includes two variationsindirect and direct network served devices. Offline/online A cluster member node is said to be online when it is actively participating in a cluster. If a cluster member has failed or was intentionally removed from the cluster for service, then it is referred to as being offline. 178
Q−S Parallel bus The LPT or printer port on a PC and the SCSI bus are examples of parallel buses. A parallel bus is typically eight or 16 bits wide, and each bid is assigned to a separate wire in the cable Partitioned A cluster is said to be Partitioned when, due to a severe hardware failure, cluster nodes attempt to form two clusters. Primary Domain Controller and Backup Domain Controller Inherent in the standard Windows NT/2000 products are features that address high availability. The Windows Domain architecture is one such example. The architecture consists of one machine designated as the Primary Domain Controller (PDC) and one or more other machines acting as Backup Domain Controllers (BDC). In the event that the PDC fails for any reason, a BDC can immediately assume the role of a PDC. With NT 4.x, this process is manual. With NT 2000, it is possible for all controllers to be peers. Primary server/secondary server A server that is designated by the system administrator as the default server in a cluster or other high availability configuration. Private network This is a term that Microsoft uses to refer to the network connection between the two nodes of the MSCS cluster. In this book we commonly refer to this network as the Cluster Interconnect. In the near future, the Cluster Interconnect will become known as the server area network as vendors such as Compaq and others release their new high−performance interconnect products. ServerNet and InfiniBand are two technologies that could meet that challenge. Public network In MSCS terminology, Public Network refers to the network connection to the building or campus LAN. Client workstations accessing MSCS clusters must do so using the Public Network LAN connection. Pulling a Group Pulling a Group is just the opposite of pushing. The scenario is that the server that was hosting the Group has failed, and the remaining nodes in the cluster detect the failure and determine that the Group needs to be failed over and restarted on one of the remaining nodes in the cluster. The remaining nodes will force the Group to be rehosted on one of the remaining healthy nodes in the cluster. Pushing a Group If one of the resources in a Group fails but the cluster node that it is executing on does not, or if the administrator decides that it is necessary to take that server down for service, then that server will decide on its own what should happen to the Groups that are under its control. If a Group has failed more than a prescribed number of times, then the server may decide that there is a problem on the local server and try to transfer the Group to another node in the cluster. In any case, pushing will occur as long as the node is capable of making decisions for itself.
Q−S Quorum Quorum is a cluster management parameter whose value determines the cluster system's operability. Typically this operability is derived from the Quorum algorithm: Quorum (Votes +2)/2. Quorum Disk When a cluster system contains a storage system common to two or more computer systems, the cluster manager can vote on behalf of that storage subsystem. Each computer system of this configuration must be capable of accessing this common storage system. Such a system has a disk known as the Quorum Disk. The disk presence represents a proxy vote to the Quorum Algorithm. 179
Q−S See also Quorum Resource. Quorum Resource Every Cluster must have a unique resource known as the Quorum Resource. There can only be one Quorum Resource at a time within a cluster. The Quorum Resource is used as a "tie−breaker" in determining which node within a cluster should assume control when it is not possible to communicate with other nodes in the cluster. The cluster member that has the ownership of the Quorum Resource is assumed to be in charge of the cluster. A Quorum Resource has two special attributes. First, it must be capable of storing the Cluster Log File. Second, it must support a challenge/defense protocol for arbitrating who gets ownership of the Quorum Resource. In the first release of MSCS a SCSI disk is used as the Quorum Resource because it meets the above requirements. RAID This is the acronym for redundant array of inexpensive disks. A RAID storage subsystem is an important part of the puzzle when designing a high−availability system. The RAID architecture uses multiple low−cost disk drives under the control of a very intelligent disk controller that implements the RAID protocol to achieve both an extremely "fault−tolerant" as well as a high−performance disk subsystem. Reliability Briefly stated, "reliable" means "sustaining a requested service." Once a user or an application has initialized a proper operation, the system should be able to provide a reliable result. A cluster could provide a reliable result by providing a "failover" strategy. A system that provides failover provides an alternative path to an initialized request. Resource Monitor The role of the Resource Monitor is to manage and monitor the status of the resources assigned to it by the Cluster Service. The Cluster Service, by default, will initialize only one Resource Monitor but in order to eliminate the possibility of a single point of failure it is possible to start up multiple Resource Monitors. This feature is especially useful when trying to deal with resources that are less than 100 percent stable. The sole purpose of the Resource Monitor is to carry out the commands given to it by the Cluster Service and to report back to the Cluster Service Resource dependencies Consider a shared disk and data. The shared disk must be accessible to retrieve the data. Therefore, the data resource is dependent on the accessibility of the disk. Resources Any physical device or logical service that the Cluster Server uses in the process of providing services to client workstations is known as a cluster resource. Examples of typical cluster resources include client/server applications, shared cluster disk, print services, time services, virtual IP addresses, etc. Cluster resources are managed by one of the software modules within the Cluster Service called the Resource Manager. In a shared nothing cluster a resource can run on only one cluster node at a time. The system administrator of a cluster assigns resources to each node in a cluster in such a way as to balance the processing load evenly across the cluster. SAN A storage area network is a dedicated network for moving data between heterogeneous servers and storage resources. The acronym SAN has different names depending on whom you are talking to. Storage vendors refer to it as storage area network. Other vendors will use the term system area networks or server area networks. The bottom−line is that what everyone is talking about is a very high−speed network used to connect processors, storage devices, and communications devices together in a cluster. A cluster can use a SAN for its cluster interconnect link. A SAN is similar to traditional LAN and WAN schemes, but SANs are optimized specifically to allow cluster resources to be shared. Any cluster resource that is attached to a SANsuch as servers, disks, or I/O devicescan directly communicate with any other device on the SAN. SANs are designed for low−latency, non−blocking systems, and provide an any−to−any switching fabric. 180
Q−S Scalability The term scalability means the system is capable of addressing changes in capacity. A cluster is not confined to a single computer system and can address capacity requirements with additional cluster membership. One of the reasons why people turn to clusters is because they want to be able to grow the processing capability of their cluster as user demands increase. Ideally, if your processing needs increase, you could simply add another node to the cluster and your normal processing capacity would increase by an amount equal to the node you added. In reality, a cluster does not scale linearly because for each node added to the cluster there is additional overhead required to manage the new node. The goal of the first release of MSCS (limited to two computer systems) was to address availability more than scalability. Scale out When multiple systems are interconnected together to form a cluster, it is referred to as "scaling out." Scale up Compaq, Microsoft, and others refer to SMP configurations as "scaling up." SMP−based solutions are said to scale up when additional processors are added into the box. SCSI This acronym stands for small computer system interface. It was originally based on technology from the IBM Corp. that was later modified by the Shugart Corporation for implementing a low−cost standard interface for its disk drives. It was adopted as a standard by ANSI in 1986. ANSI expanded the proposed SCSI architecture to support many devices beyond just disk drives. Today SCSI supports many devices such as: disk and tape drives, scanners, printers, and jukebox storage systems. SCSI 1 This is the first version of the SCSI architecture that was defined by the ANSI organization in 1986. It is known as X3.131−1986. The first version of SCSI supports only bus transfer speeds up to 5 MHz and a parallel bus that is 8 bits wide. SCSI−1 was adopted primarily by minicomputer and workstation vendors and by Apple Computer Corp. at the consumer level. SCSI 2 The ANSI committee doubled the bus speed from the 5−MHz SCSI−1 bus to 10 MHz for SCSI−2. This is also sometime referred to as FAST/10 or just SCSI−2 FAST. SCSI−2 also supports a high−density 50−pin "D" type connector in addition to the 50−pin low−density SCSI−1 connector. The SCSI bus architecture is somewhat unique in that compatibility between versions has been maintained. SCSI 3 This is the latest SCSI standard that ups the bus speed to 160 MHz. It is also referred to as Ultra160. In addition to improving bus speed, SCSI−3 will also support different types of media and electrical interfaces such as low−voltage differential signaling (LVDS). SCSI address/SCSI ID Each peripheral or device on a SCSI bus must have its own unique ID. A narrow or 8−bit bus supports only eight device addresses. This is because SCSI does not encode address IDs. Instead, it uses each data line from its 8−bit bus to represent one device. Therefore a "wide" or 16−bit SCSI bus is capable of supporting 16 devices. Serial bus A serial bus transfers data one bit at a time through a simple wire or fiber. Some examples of serial buses used with Windows clusters include Ethernet and Fibre Channel. ServerNET Compaq (Tandem Corp.) has developed high−speed communications hardware and software that allows multiple nodes in a cluster to be tied together in a high−speed bus configuration called ServerNET. ServerNET will also allow peripherals such as disk arrays to be connected to the same bus. Shared Disk Model See Shared Everything Model. 181
T−Y Shared Everything Model An example of a Shared Everything Model cluster is Compaq's OpenVMS Clusters product. Microsoft also refers to this as a Shared Disk Model. This clustering architecture allows every node in the cluster to access any disk resource at any time. For this to work the cluster must synchronize and serialize the access to shared devices by individual cluster members. Compaq solved this problem on OpenVMS by implementing a distributed lock manager (DLM). As you can imagine, as the size of the cluster grows, the amount of work that the DLM must do can get very large. In addition, the communications bandwidth between cluster members required to support a DLM can also become an issue. Shared Nothing Model This is the architectural model that Microsoft has adopted for its Windows NT/2000 cluster products. In a Shared Nothing Cluster, each cluster member has its own set of peripherals that it controls and manages. A resource such as a disk drive is only accessible logically by a single cluster member at a time even though the disk drives are physically connected to each cluster member by a shared bus such as SCSI or Fibre Channel. Shared−memory architecture A computing model in which multiple processors share a single main memory through which all traffic between devices must flow. SMP servers employ this model. The drawback of this approach is that as the number of processors grows, memory access becomes a bottleneck, thus limiting scalability. Single−ended SCSI bus This term refers to the type of electrical interface used. The single−ended interface uses a single wire for each data or control line going between peripherals. From the electrical point of view it is referred to as a "negative bus." This means that 0 volts represents a logical 1 and a signal level of three volts or greater is a logical 0. A common ground wire is used as the electrical return path for all data lines. The single−ended SCSI bus is technically a very simple interface design. The good news is that a simple design translates into lower cost. The bad news is that there are limitations on bus length and speed due to the single−ended bus design. Single point of failure This is a term that is used when discussing high−availability systems. A good design eliminates the possibility that the failure of a single component in the system could cause the failure of the total system. A system designed for high availability would have completely redundant components. For example, there would be multiple processors, multiple power supplies, dual SCSI buses, redundant network adapters and connections, etc. Single−system image Each computer member of the cluster is capable of independent operation. The cluster software provides a middleware layer that connects the individual systems as one to offer a unified access to system resources. This independence allows an entire computer to fail without affecting the other cluster members. The single−system image presents the user or application with the appearance of a single−system imagea cluster. The individual supporting computer members are transparent to the user or application. Symmetrical multiprocessing Symmetrical multiprocessing is the "balanced" operation of more than one processor unit within a single computer system. The term balance is used to qualify the processors' equal standing of operation.
T−Y Termination 182
T−Y Termination refers to a device that is placed at both ends of a SCSI bus. Terminators come in two varieties. A passive terminator is nothing more than two resistors that match the electrical impedance of the bus. An active terminator uses active electrical components to more accurately match the impedance of the bus. Time Source It is important that all cluster nodes maintain a consistent view of time. One node in the cluster is elected to be the Time Source for the whole cluster. The system administrator can designate a particular node to be the Time Source. This would be particularly advantageous if that node were equipped with a precision external clock. As far as time is concerned, it is more important that it be consistent across cluster nodes than be absolutely accurate. Transaction−aware applications An application that will be used in a cluster environment must be aware of how clusters behave during failover operations. An application that is written for high−availability clusters must use some types of transaction processing order to be able to function in a cluster. Transactional processing Transactional processing is a client/server operation involving two simple phases of operationprepare and commit. An example of a transaction process is the operation of an Automatic Teller Machine. A person uses an ATM (client) to propose a transaction with the bank (server). The transaction is processed based on the appropriate access code, sufficient funds at the client (ATM), and sufficient account support at the server (bank). Tri−link connector The tri−link connector is wired electrically the same as a Y−cable. The only difference is that with a tri−link connector all the wiring is neatly contained within the connector shell. The only problem with the tri−link connector is that unlike the Y−cable, it covers up two PC host adapter slots on the back of the PC. See also Y−cable. Ultra SCSI The term "Ultra" is used to refer to a SCSI bus running at twice the speed of the SCSI−2 standard. An 8−bit or narrow Ultra SCSI bus runs at 20 MB sec., and it follows that the ultrawide SCSI bus would be capable of transferring 40 MBps. VAX The term VAX is used to refer to both a processor chip and a computer system developed by the former Digital Equipment Corp. It stands for virtual address extension. The VAX was an extension to an earlier Digital processor called the PDP−11. VIA This is a new acronym that is a result of an effort on the part of Microsoft, Compaq, and Intel to develop a virtual interface architecture for use in Windows NT/2000 clusters. One way to understand what VIA is all about is to compare it with NDIS in LANManager. Just like NDIS, VIA provides a standard software interface at the top of its protocol stack, and at the bottom it is independent of the physical medium. The purpose of VIA is to provide very low communication latency, low processor overhead, and a standard software protocol interface. Virtual disk In this context, the term "Virtual" means that a user can access his information on what appears to be a local physical hard drive. In reality, a virtual disk is usually located on a server in the network. The server uses a normal operating system file that has been logically formatted as though it were a physical disk. When a user reads and writes to the virtual disks they are actually writing to a file on a server. Virtual server MSCS introduced a new concept called the virtual server. As the name implies, a virtual server appears to a client workstation as though it were a physical server. In reality, a virtual server consists of the logical services that make up a physical server such as IP address, node name, computer name, 183
T−Y applications, etc. These entities can be grouped together and assigned to any physical node and cluster. Wide SCSI Originally SCSI was developed as an 8−bit bus. The second generation of the SCSI bus was expanded to support a 16−bit wide bus. The 16−bit SCSI bus is referred to as wide, and the 8−bit SCSI bus is referred to as narrow. Windows Service User workstations access what is known as a Service on a Windows NT/2000 system. MSCS is an example of a Windows Service. Other examples of common Windows Services include file and print, as well as naming services such as WINS. Y−cable The Y−cable is used to allow a cluster node to be removed from a SCSI bus without having to bring down the SCSI bus in the process. This is accomplished by plugging the SCSI bus into a female connector on one leg of the "Y." The center connector of the "Y," which is a male connector, gets plugged into the node or disk array, and the remaining leg of the "Y" is connected to a terminator. This arrangement allows you to disconnect the node or disk array while still keeping the bus terminated and functional.
184
References Over the period of time it took us to write this book it has been absolutely amazing to watch how quickly vendors supplying clustering technology have come and gone! We have had to go back and edit this section of the book more than once to reflect the many corporate mergers that have taken place. The most notable of these mergers occurred when Compaq Computer corporation purchased Tandem Computers. Tandem was acknowledged by Microsoft as one of the major technology contributors to their Windows clustering initiative. If that wasn't enough then Compaq quickly turned around and purchased Digital Equipment Corporation who Microsoft also credited as their other major technology partner on the Wolf Pack project. So, if you find any dead links, don't blame us! Another major player in the High Availability market who has been busy purchasing companies is Legato Systems, Inc. Legato has purchased two of its competitors: Vinca (Standby Server) and Qualix (Octopus). Both of these companies had been in the market for a relatively long time. They both offered solutions early on that could enhance the availability of Windows NT Server.
Vendors Amphenol Interconnect Products Corporation 20 Valley Street Endiocott, New York 13760 Phone: (607) 786−4202 Web: www.amphenol−aipc.com Amphenol Interconnect Products is a leading supplier of engineered cable assemblies and custom connectors to the OEM computer and military markets. They have numerous catalogs and specification sheets that you can request that contain a lot of valuable information about SCSI cables, connectors and terminators. They are a supplier to many large computer manufacturers such as IBM, DEC, Tandem and others. Call them if you are looking for SCSI "Y" cable assemblies. They have been supplying these parts for years. • Products • SCSI cable assemblies • OEM custom cables • Connectors • Interface connector adapters • SCSI terminators Compaq Computer Corporation 20555 SH 249 Houston, Texas 77070−2698 Phone: 800−888−0220 Web: www.compaq.com Compaq along with the former Tandem and Digital organizations have been working with Microsoft to develop an industry standard solution for clustering Windows NT servers. The Compaq family of companies have contributed many key technologies to Microsoft and are continuing to work closely with Microsoft to develop the next generation of MSCS. Their ProLiant Clusters systems have undergone extensive testing at 185
References selected customer sites, and have been designed specifically to support MSCS. Compaq is working closely with Microsoft and other leading vendors to develop a high−speed systems cluster interconnect architecture. Their own ServerNet technology, that was developed by Tandem, is well positioned to become the physical transport layer of choice on which the Virtual Interface Architecture (VIA) set of APIs will first be implemented. Compaq is leading the consortium of industry leaders in developing a standard set of software APIs to support SAN interconnect technologies. Tandem Computers Incorporated was a very strategic acquisition for Compaq. With over 24 years of experience delivering highly available systems, Tandem's technology is now helping Compaq to offer clustered Windows NT/2000 Server−based solutions. Their solutions are based on field proven technology called "NonStop Software" and their System Area Network technology, which have been used in their UNIX based solutions. "NonStop" is an intelligent software middleware layer that sits on top of the cluster services layer. Because of Tandem's experience with clustering middleware they had been a major contributor on Microsoft's efforts to development a standard clustering API for Windows NT Server. Tandem was able to provide a highly scalable cluster solution mainly because of their shared−nothing architecture. This is the same architecture that Microsoft has adopted for Microsoft Cluster Server (MSCS). Microsoft's MSCS phase II multinode clustering initiative will utilize Tandem's ServerNet system area network hardware along with VIA. It appears that ServerNet is likely to become the de facto cluster interconnect standard. As a testimonial to Tandem's technical leadership it is interesting to note that Tandem's systems handled 90 percent of the world's securities transactions, 80 percent of bank automated teller machine transactions, and 66 percent of all credit card transactions. • Products • ServerNet • NonStop TUXEDO transaction monitor for Windows NT Server, • ProLiant Cluster Series • SCSI Storage Arrays • Fibre Channel Storage Arrays Cubix Corporation 2800 Lockheed Way Carson City, Nevada 89706−0719 Phone: (800) 829−0550 Web: www.cubix.com Cubix manufactures high−density "back−room" server systems. These systems are built from multiple PC subsystems (plug−in computers) that are managed as a single platform. Cubix refers to this as a "Consolidated Server". Each "plug−in PC" subsystem is an independent server that is configured with either a uniprocessor or SMP processors, memory, drives, controllers, etc. Cubix calls them consolidated servers because multiple computersalong with their associated drives, power, electronics, and adapter boardsare housed together in a single chassis. Because of the high density that Cubix achieves with their server architecture, their servers are ideal for use as "server−farms" that need to support a large number of users. Instead of filling your computer room with rack upon rack of regular PCs, you can consolidate your servers into one fault−tolerant, managed enterprise in as little as one tenth of that space. And in the process you'll reduce your cooling, power, and administrative requirements.
186
References Dell Computer Corporation One Dell Way Round Rock, Texas 78682 Phone: 800−847−4085 Web: www.dell.com Dell is delivering clustering solutions based on their PowerEdge line of servers and Microsoft MSCS software. Dell is using its direct business model to deliver complete clustering solutions that are tested and ready to run upon delivery. Dell gives its customers the convenience of buying industry−standard clusters in the same way they buy industry−standard servers. They also provide services to help customers plan for what applications make sense to cluster, plan the cluster deployment, and assist in the design of the best cluster configurations for a customer's requirements. • Products • PowerEdge Clusters • PowerVault Storage arrays EMC Corporation Hopkinton, MA U.S.A 01748−9103 Phone: 508−435−1000 Web: http://www.emc.com Richard J. Egan and Roger Marino in Newton, Massachusetts founded EMC Corporation in August 1979. Originally, EMC was involved only with memory add−ins for existing lines. EMC is now a world−renowned leader in cluster and storage solutions. One of EMC's acquisitions was a computer company called Data General Corporation. Data General Corporation was a leading supplier of servers, storage systems, and technical support services. The company designed, manufactured, and supported two families of systems, AViiON® servers and CLARiiON® mass storage products. The ClaRiiON® product is a vital part of the network storage solution provided by EMC Corporation. The EMC CLARiiON® IP4700 network−attached storage (NAS) and EMC CLARiiON FC4700 storage area network (SAN) systems has won the PC Magazine Award (May 2001) for Innovation in Infrastructure. According to PC Magazine, "After a detailed evaluation, our editors chose products that were the most innovative and technologically intriguingproducts that exemplify the best and most advanced of today's networked infrastructure technologies." • Products • EMC Celerra is a dedicated network file server running software optimized for sharing information over networks. It combines Symmetrix enterprise storage technology with a unique software and hardware approach to bring unprecedented levels of availability, management, scalability, and performance to network file storage. This network−attached storage solution allows you to share information between heterogeneous networked servers and end−user clients as if that information were physically stored on the local workstation. • CLARiiON RAID storage 187
References • EMC GeoSpan for MSCS provides a disaster recovery functionality to a high−availability Microsoft Cluster Service (MSCS) environment. Granite Digital 3101 Wipple Road Union City, CA 94587 Phone: 510−471−6442 Web: www.scsipro.com Granite Digital specializes in producing very high quality SCSI cables, terminators, and SCSI diagnostic cables and connectors. Their diagnostic cables and connectors feature LEDs for reporting status of critical SCSI signals. By simply glancing over one of their SCSI connectors you can tell they know what's going on with your SCSI system. Windows NT Clusters are going to be enough of a challenge for the average person, without having to worry about SCSI cables failing. The one place you do not want to cut corners is on the SCSI cabling. Granite Digital is not known for being low cost, but they are recognized as one of the lead sources of extra high quality cables with the added benefit of built−in diagnostics. If your future includes working with SCSI, go and check out Granite Digital's SCSIVue Diagnostic terminator and their diagnostic cables. They'll sure save you time and trouble. • Products • SCSI Diagnostic Cables super high quality with built−in diagnostics • SCSI Teflon Flat Ribbon Cables for high speeds • SCSI Diagnostic Terminator with Active Regulation • Ultra SCSI Adapters, Connectors, and Gender Changers • Digital SCSI Cable Tester • Digital SCSI ECHO Repeater • Solid State SCSI Switch • SCSI disk array cases Legato Systems, Inc. 2350 West El Camino Real Mountain View, CA 94040 Phone: 650−210−7000 Fax: 650−210−7032 Web: www.legato.com Legato System, Inc. has been very busy with corporate mergers. The following two companies are now part of Legato: Qualix Group, Inc. (Octopus) Vinca Corp. (Co−Standby Server for Windows NT) Octopus Technologies had developed a software−based family of products that provide solutions for achieving high data availability. They have been shipping products for both the UNIX and Windows NT platforms since 1990 and have received many good reviews from the media. Their products use data mirroring technology in conjunction with "watchdog" software to manage the failover of applications and data from a 188
References failing server to new target server. Their software delivered many of the same capabilities that were in MSCS at a very low cost. In addition to server to server data and application availability they are one of the few vendors who also provide a solution to address real−time data mirroring from the corporate desktop. Their OctopusDP product allows you to provide real−time data mirroring for both Windows NT and Windows 95 desktops. Vinca Corporation is one of the leading data protection companies in the network computing industry today. Vinca is known for innovative and patented technology that has made data protection solutions more affordable, easier to use and compatible with all types of computer hardware. Vinca is in the business of keeping all sizes and types of businesses up and running by providing products that make network data, applications and services more available and accessible. Co−StandbyServer for Windows NT provides application failover, as well as data preservation. Co−StandbyServer allows either of two servers to failover for the other. With failover, the data and server identity of the failed server are transferred to the surviving server, and the remaining server retains its original identity as well. The applications running on the failed server can also be re−started on the surviving server. This allows NT server clients to retain data as well as application availability. • Products • Legato Octopus • Legato HA+ • Legato StandbyServer • Legato Co−StandbyServer LSI Logic Corporation 1551 McCarthy Blvd. Milpitas, California 95035 Phone: 866.574.5741 (in the United States) Phone: 719.533.7679 (outside the United States) Fax: 866.574.5742 Web: www.lsilogic.com LSI Logic Corporation was founded in 1981. LSI Logic pioneered the ASIC (Application Specific Integrated Circuit) industry. Today, LSI Logic is focused on providing highly complex ASIC and ASSP (Application Specific Standard Product) silicon solutions. In May 2001 LSI Logic Corp. Purchased IntraServer, a manufacturer of high performance and high density host adapters targeted at the high−end server market. IntraServer had formed and developed from former Digital employees with a lot of experience in highly available high performance clustered systems. • Products • SCSI host bus adapters • Multifunction host bus adapterscombination SCSI and Ethernet support • High density adaptersplaces multiple I/O functions on a single adapter card Marathon Technologies Corp. 1300 Massachusetts Ave. Boxborough, MA, 01719, USA Phone: (508) 266−9999 or (800) 884−6425 189
References Web: www.marathontechnologies.com Marathon was founded in 1993 with the goal to deliver mission critical application availability using low cost computers as building blocks to create more capable systems. The concept was that by using an array of low cost computers they could deliver high levels of scalability and performance at a very low cost. Their architecture allows them to deliver a "no downtime" solution providing continuous processing regardless of the type of hardware failure. They are unique in the industry because they were the first to make available an insurance policy that guarantees constant delivery of service and performance regardless of the failure. For companies needing a truly fault tolerant solution for Windows NT Marathon is the place to go. They have designed a true fault tolerant computer system that uses industry standard Intel based servers and the Windows NT operating system. It will run all Windows NT applications without modification. The result is the industry's first and only hardware and software fault tolerant technology for Windows NT. • Products • Endurance 6000 Myricom, Inc. 325 N. Santa Anita Ave. Arcadia, CA 91006 Phone: 626−821−5555 Web: www.myricom.com Myrinet produces high speed communications products for Clusters, LANs, and SANs. They specialize in high−speed, computing and communication components supplied for OEMs, research institutions and defense systems. In addition to its Myrinet business, Myricom performs research, most of which is under the sponsorship of the Defense Advanced Research Projects Agency (DARPA). If you want to learn more about SANs then you should definitely check out their web site for more information on Myrinet. A recent testimonial to the capabilities of Myrinet was the announcement by the University of Illinois Department of Computer Science that they successfully implemented a 128 node, 256−process Windows NT Supercluster. Andrew Chien, a professor in the University of Illinois Department of Computer Science and a member of the Alliance Parallel Computing Team, constructed the 128 node cluster out of standard Compaq and Hewlett Packard PCs running Microsoft's Windows NT operating system. Professor Chien uses a Myrinet SAN as the physical transport layer for his "Fast Messages Middleware" software. Layered on top of the middleware is the actual clustering software called High Performance Virtual Machine (HPVM). HPVM enables each node of his NT cluster to communicate at a bandwidth of just under 80 megabytes per second and a latency under 11 microseconds. The combination of Commercial off the shelf (COTS) PCs and a Myrinet SAN gives the scientific and engineering community a low−cost alternative to conventional high performance machines used to carry out high−end computational research. Microsoft Corp. One Microsoft Way Redmond, WA 98052−6399 Phone: 425−882−8080 In August 1998, Microsoft Corp. acquired Valence Research, Incorporated, a Beaverton, Oregon based developer. Valence Research was an industry−leading TCP/IP load−balancing and fault tolerance software company. Valence Research's Convoy Cluster Software won the Windows NT Magazine Editor's award and 190
References the Windows NT Intranet Solutions Expo Best of Show award, and it represented an important new addition to Microsoft's clustering technology and Internet capabilities. The award−winning technology was renamed Microsoft Windows NT Load Balancing Service. This technology brought enhanced scalability and fault tolerance to a range of Windows NT−based products, including outbound SMTP mail service in Microsoft Exchange Server and Microsoft Proxy Server as well as integrated system services, such as Microsoft Internet Information Service, Point−to−Point Tunneling Protocol Service and Microsoft Internet Authentication Service. Microsoft Windows NT Load Balancing Service complements the features already provided by the clustering subsystem in Windows NT Server Enterprise Edition. Together, these technologies create a highly flexible and scalable solution for front−to−back high availability in mission−critical environments, including Internet server farms. "Microsoft and Valence Research share the same vision of providing our customers with unparalleled levels of scalability and high availability through clustering technology," said Dr. William L. Bain, co−founder and CEO of Valence Research. "We're very pleased that, as part of Microsoft product offerings, our technology will now benefit a much wider group of customers." Microsoft themselves use the Microsoft Windows NT Load Balancing Service on such sites as microsoft.com and MSN.com, a group of sites representing some of the highest−volume traffic on the Internet. It is a stable and proven technology, and has enabled microsoft.com to achieve service availability levels above 99 percent. Microsoft Windows NT Load Balancing Service is also compatible with Internet Protocol Security (IPSec), the Internet Engineering Task Force (IETF) standard for end−to−end security at the network layer, reinforcing Microsoft's commitment to standards−based security initiatives. A revised version of the original Valence Research, Inc. technology, Convoy Cluster Software, is available today in Windows 2000 Advanced Server and DataCenter. Included in this technology is the ability to build Internet Web farms with up to 32 cluster nodes. NCR Corp. 1700 S. Patterson blvd. Dayton, OH 45479 Web: www.ncr.com NCR was one of the pioneers in the development of clustering solutions for UNIX systems. They introduced LifeKeeper for UNIX in 1992 followed by LifeKeeper for Windows NT in 1996. LifeKeeper for Windows NT builds upon NCR's experience with UNIX clustering and is a key to NCR's high availability offering for Windows NT/2000 environment. LifeKeeper is used to cluster multiple servers allowing them to monitor and back up each other's applications. It is an active/active cluster architecture where all systems are active and productive until a failure occurs. NCR was one of the first companies to release a new version of its LifeKeeper software that was the first clustering software for Windows NT capable of joining up to 16 Windows NT servers in a high availability cluster. NCR was also able to achieve faster recoveries, more customization of recovery environments, decreased hardware dependencies and improved user availability with their LifeKeeper 2.0 release. It is interesting to note that LifeKeeper 2.0 offers the same feature set for both the NT and UNIX operating systems. LifeKeeper 2.0 offers some impressive functionality that is worthy of a closer look. It will give you an idea what NCR has been up to and where they are heading with their Teradata solutions.
191
Books
Books In Search of Clusters, by Gregory F. Pfister Published by: Prentice Hall (515 284−6751) Web: www.prenhall.com Pfister's book deals with clustering in general terms. Although it is not an NT Clustering book, it will give the reader a good background on clustering and at the same time provide some definitions on what a cluster is and is not. Mr. Pfister is with the IBM Corporation. The RAIDbook, 6th Edition, by Paul Massiglia Published by: RAID Advisory Board St. Peter, MN Order: http://www.raid−advisory.com The RAID Advisory Board has published many books and articles and sponsored numerous conferences and seminars. Among the boards publications are the "RAIDbook" and the "Storage System Enclosure Handbook". In the area of standards, the RAB, working closely with the American National Standards Institute (ANSI), has developed the "SCSI Controller Commands" and "SCSI−3 Enclosure Services Command" Set standards. VAX Cluster Principles, by Roy G. Davis Digital Press, Butterworth−Heinemann 225 Wildwood Avenue Woburn, MA 01801 Phone: 800−366−2665 Web: www.bh.com/digitalpress VAXcluster Systems, Digital Technical Journal Issue Number 5, September 1987, Digital Equipment Corporation This publication is a collection of technical papers by the developers of VMS clustering. There are several papers that will be of particular interest to anyone wanting to expand their knowledge about the origins of clustering and what might be on the horizon. The topics covered in this collection of technical papers include: The VAXcluster Concept: Overview of a Distributed System, The System Communication Architecture (SCA), The VAX/VMS Distributed Lock Manager (DLM), VAX Cluster Availability Modeling, etc. Don't let the words VAX and VMS scare you away from this excellent source of background information. Who knows? Some of these authors might just be wearing another employee badge today and working in the NorthWest! Basics of SCSI, 4th Edition, by Jan Dedek at ANCOT Corporation Ancot has shipped more than 30,000 copies of this booklet. The "Basics of SCSI" booklet is a tutorial on SCSI technology written by Jan Dedek, Ancot's President. Highly recommended! What is FIBRE CHANNEL?, 4th Edition, by Jan Dedek and Gary Stephens, ANCOT Corporation
192
Articles, Papers, and Presentations Ancot Corporation 115 Constitution Drive Menlo Park, California 94025 Phone: 650−322−5322 Web: www.ancot.com This booklet is a very good introduction to Fibre Channel technology all packed into 72 pages. If you are looking to get up to speed fast, then this is a very good place to start. You can order their booklets from Ancot's web site or by phone. The booklets are free. They are very well done booklets that deal with technology and not marketing hype. The Book of SCSI: A Guide for Adventurers, by Peter M. Ridge (ISBN: 1−886411−02−6) A practical book about SCSI aimed at helping the everyday users of SCSI. It is full of helpful hints on making SCSI work for you. This book gives plain English explanations about how to work with SCSI IDs, LUNs, termination, parity checking, asynchronous and synchronous transfer, bus mastering, caching, RAID, and more. You will also find many tips, tricks, and troubleshooting help.
Articles, Papers, and Presentations Berg Software Design P.O. Box 3488 14500 Big Basin Way, Suite F Saratoga, CA95070 USA Web: http://www.bswd.com/cornucop.htm This excellent web site has a plethora of links and white papers and session notes regarding storage (to include Fibre Channel) and networks. Clarion University of Pennsylvania Clarion, PA 16214 Admissions: (800) 672−7171 Switchboard: (814) 393−2000 TTY/TDD: (814) 393−1601 The following link brings you to "Users Guide to the VMS−Cluster Computing System," an excellent guide to the VMS cluster computing system: http://www.clarion.edu/admin/compserv/vtabcont.htm.
Trade associations Fibre Channel Loop Community PO Box 2161 Saratoga Ca. 95070 Phone: (408) 867−1385 Web: http://www.fcloop.org 193
Articles, Papers, and Presentations Fibre Channel Association 2570 West El Camino Real, St. 304 Mountain View, CA 94040−1313 Phone: 1−800−272−4618 Web: http://www.fibrechannel.com The Fibre Channel Association (FCA) is a "corporation," much the same as each of the FCA member companies are corporations. In particular, the FCA is incorporated under the laws of the state of California, and is legally classified as a nonprofit corporation. This means that the business of FCA is not conducted for the financial profit of the members, but for the mutual benefit of its members. The RAID Advisory Board 10805 Woodland Drive Chisago City, MN 55013−7493 Phone/Fax: 651−257−3002 Web: www.raid−advisory.com Formed in July of 1992, and open to all, the RAID Advisory Board (RAB) membership's goal is to assist users to make more informed storage procurement decisions. The RAB's goal is achieved by means of three key programs: Education, Standardization, and Classification. In the area of education, the RABhas published many books and articles and sponsored numerous conferences and seminars. The RAB has developed functional and performance specifications and established the RAB RAID Level Conformance Program and the RAB Disk System and Array Controller Classification Program. Over 20 RAB members are currently licensed to display the RAB logo and legends indicating that the products identified by the logos have met certain criteria established by the RAB. The SCSI Trade Association 404 Balboa Street San Francisco, CA 94118 Phone: 415−750−8351 Web: The SCSI Trade Association was formed to promote SCSI Interface Technology. They benefit the SCSI user base by serving as central repository of information on SCSI technology and by promoting increased public understanding and use of SCSI by means of their Web site and other publications. The members of the SCSI Trade Association guide the growth and evolution of SCSI Parallel Interface Technology now and into the future. Storage Network Industry Association 2570 West El Camino Real, Suite 304 Mountain View, CA 94040 Web: www.snia.org
194
Articles, Papers, and Presentations Storage Network Industry Association (SNIA) is a consortium of storage, storage networking, system integrators, application vendors, service providers, and IT professional developers established in 1997. As such they provide programs of information, testing, certification, with provisions for conferencing, workgroups and advisory committees. Although this organization is rather new, the members and board of directors read like a "who's who" of the industry. Their membership is too large to list. But you will see SNIA with the IP Storage Forum, Support Solutions Forum and Storage Networking world. Try out their web site at http://www.snia.org for excellent white papers and general information. InfiniBand Trade Association 5440 SW Westgate Drive, Suite 217 Portland, OR 97221 Phone: 530−291−2565 Web: www.infinibandta.org
195
List of Figures Preface Figure P.1: TPC−C benchmarks (Source: Zona Research).
Chapter 1: Understanding Clusters and Your Needs Figure 1.1: Cluster system. Figure 1.2: Cluster components. Figure 1.3: Digital perspective.
Chapter 2: Crystallizing Your Needs for a Cluster Figure 2.1: Mirror or RAID 1 example. Figure 2.2: Stripe with parity or RAID 5 example. Figure 2.3: Active/passive. Figure 2.4: Active/active. Figure 2.5: Shared disk. Figure 2.6: Cluster storage array.
Chapter 3: Mechanisms of Clustering Figure 3.1: Microsoft Windows 2000 registry. Figure 3.2: Quorum disk. Figure 3.3: Quorum example. Figure 3.4: Disk resource. Figure 3.5: Lockstep setup. Figure 3.6: Replication. Figure 3.7: Replicationone to many. Figure 3.8: Volume replication. Figure 3.9: Partition mirroring. Figure 3.10: Shared disk. Figure 3.11: Shared nothing disk. Figure 3.12: Early storage area network. Figure 3.13: Storage area network. Figure 3.14: Network attached storage.
Chapter 4: Cluster System Classification Matrix Figure 4.1: Cluster classification matrixcluster classes. Figure 4.2: Cluster classification matrixexamples. Figure 4.3: Cluster complexity and capability. 196
Chapter 5: Cluster Systems Architecture Figure 4.4: Cluster lite example 1. Figure 4.5: Cluster lite example 2. Figure 4.6: Cluster future? Figure 4.7: Cluster classification matrixexamples.
Chapter 5: Cluster Systems Architecture Figure 5.1: Typical cluster architectures. Figure 5.2: Active/standby cluster with mirrored data. Figure 5.3: Active/passive cluster with mirrored data. Figure 5.4: Active/active cluster with shared disk. Figure 5.5: Active/active cluster with shared files. Figure 5.6: Cluster Service architecture. Figure 5.7: Cluster software architecture. Figure 5.8: Microsoft Cluster Service software components. Figure 5.9: Resource Monitor and Resource DLLs. Figure 5.10: Relationship between cluster resources.
Chapter 6: I/O Subsystem Design Figure 6.1: SMP system using BUS architecture. Figure 6.2: SMP system using switch architecture. Figure 6.3: Processor bus bandwidths. Figure 6.4: I/O subsystem bus configuration. Figure 6.5: PCI bridged bus configurations. Figure 6.6: PCI peer bus architecture. Figure 6.7: Capacity model. Figure 6.8: Ethernet bus. Figure 6.9: Ethernet Switch−based architecture.
Chapter 7: Cluster Interconnect Technologies Figure 7.1: "Classic" cluster interconnect. Figure 7.2: Cluster interconnect using Fibre Channel or ServerNet technology. Figure 7.3: RJ45 to RJ45 "null modern" wiring diagram. Figure 7.4: Twisted−pair "crossover cable" connected cluster. Figure 7.5: Twisted−pair Ethernet hub connected cluster. Figure 7.6: VIA software protocol stack. Figure 7.7: Winsock Direct stack. Figure 7.8: Centronics 50−pin SCSI connector. Figure 7.9: Micro−D 50 connector. Figure 7.10: Micro−D 68−pin SCSI connector. Figure 7.11: SCA2 connector for removable drives. Figure 7.12: SCSI connector locking mechanisms. Figure 7.13: SCSI ID vs. priorities. Figure 7.14: A single−ended SCSI driver and receiver circuit. 197
Chapter 8: Cluster Networking Figure 7.15: Differential SCSI bus. Figure 7.16: SCSI cable lengths. Figure 7.17: Tri−link SCSI adapter. Figure 7.18: Tri−link adapter used in a cluster. Figure 7.19: SCSI "Y" adapter.
Chapter 8: Cluster Networking Figure 8.1: Typical Windows NT cluster network configuration. Figure 8.2: Redundant enterprise LAN connections and hubs. Figure 8.3: Reducing the single point of failure for the enterprise LAN. Figure 8.4: Windows support for multiple network transport protocols. Figure 8.5: IP address failover. Figure 8.6: Cluster server"virtual servers." Figure 8.7: Three−tier clustering using WLBS. Figure 8.8: Typical multipletier HyperFlow configuration.
Chapter 9: Cluster System Administration Figure 9.1: The three steps to recover from system failure. Figure 9.2: Microsoft's cluster product positioning. Figure 9.3: Out−of−band remote management. Figure 9.4: Scaling up vs. scaling out. Figure 9.5: Using SMP to scale out. Figure 9.6: Switch−based system architecture.
Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering Figure 10.1: The elements of a high availability solution. Figure 10.2: Intel's processor/system development cycle. Figure 10.3: Data mirroring between remote cluster sites. Figure 10.4: Decision tree for protection against a disaster.
198
List of Tables Preface Table P.1: Versions of NT
Chapter 2: Crystallizing Your Needs for a Cluster Table 2.1: RAID Technologies
Chapter 4: Cluster System Classification Matrix Table 4.1: Microsoft Table 4.2: Legato Cluster Enterprise Table 4.3: LegatoOctopus (Vinca) Table 4.4: Compaq Cluster Table 4.5: Compaq Intelligent Cluster Administrator
Chapter 5: Cluster Systems Architecture Table 5.1: Resource Monitor API Functions Table 5.2: Cluster Failover Scenarios Table 5.3: Cluster Recovery Failover Modes Table 5.4: Recovery Failover Types Table 5.5: Failover Scenario Characteristics
Chapter 6: I/O Subsystem Design Table 6.1: I/O Load Models Table 6.2: PCI Bus Bandwidth
Chapter 7: Cluster Interconnect Technologies Table 7.1: Cluster Interconnect Technology Options Table 7.2: SCSI Versions and Bus Speeds
Chapter 8: Cluster Networking Table 8.1: Benefits of Dual Network Interface Controllers
199
Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering
Chapter 10: Achieving Data Center Reliability with Windows NT/2000 Clustering Table 10.1: Causes of System Outages
200