PerforlJ1a~ce
Analysis .of Transaction .. :.Proc~ssin9 Syste~s
PerlOrmanceAnalysh of Transaction' . Processing Systems Wilbur H. Highleyman
Prentice Hall, Englewood Cliffs, New Jersey 07632
To my parents-
Peach and Bud
v
. Contents
PREFACE
1
xvii
INTRODUCTION
1
PERFORMANCE MODELING, 2 USES OF PERFORMANCE MODELS, 5 THE SOURCE OF PERFORMANCE PROBLEMS, 7 THE PERFORMANCE ANALYST, 7 THE S'IRUCI'URB OF nus BOOK, 8 SYMBOLOGY, 10
2 TRANSACTION-PROCESSING SYSTEMS COMPONENT SUBSySTEMS. 14 C'ommmricarion Network. 15 Processors. 15 Memory. 15 Application Processes. 16 Data Base. 16 Other Peripherals. 17 SYSTEM ARCHITECTURES. 17 Expandability, 18
12
Contents
viii
Distributed Systems, 18 Fault Tolerance, 20 Summary, 22 TRANSPARENCY, 22 The Process, 22 110 Processes, 24 Interprocess Communications, 26 Process Names, 28 Process Mobility, 29 Summary, 30 PROCESS STRUCTURE, 30 Process Functions, 30 Interprocess Messages, 31 Addressing Range, 33 Process Code Area, 33 Process Data Area, 33 The Stack, 35 Summary, 36 PROCESS MANAGEMENT, 36 Shared Resources, 36 Process Scheduling, 37 MechaDisms for Scheduling, 38 Memory Management, 40 Paging, 40 Managing Multiple Users, 43 Additional Considerations, 44 Summary,4S SURVIVABnlTY, 4S Hardware Quality, 46 Data-Base Integrity, 47 Software Redundancy, 49 Transaction Protection, SO Synchronization, 53 Backup Processes, 55 Message Queuing, 56 Chedcpoiating, 59 SOFI'WARE ARCHITECTURE, 63 Bottlenecks, 63 Requestor-Server,64 DyDamic Servers, 64
3 PERFORMANCE MODELING BO'ITLENECKS,67
67
Contents
ix
QUEUES, 68 THE RELATIONSHIP OF QUEUES TO BOTTLENECKS, 69 PERFORMANCE MEASURES, 70 THE ANALYSIS, 73 Scenario Model, 73 Traffic Model, 74 Performance Model, 77 Model Results, 83 Analysis Summary, 8S
4 BASIC PERFORMANCE CONCEPTS
87
QUEUES-AN INTRODUcrION, 88 Exponential Service Times, 90 Constant Service Times. 91 General Distributions, 91 Uniform Service Times, 92 Discrete Service TUDeS, 92 Summary, 92 CONCEPTS IN PROBABDJTY AND OTHER TOOLS, 94 Random Variables, 94 Discrete Random Variables, 94 Continuous Random Variables, Case 1, 99 Case 2, 100 Permutations and Combinations, 104 Series. 105 The Poisson Distribution. 106 The Exponential Distribution. 110 Random Processes Summarized. 111 CHARACTBRlZING QUEUING SYSTEMS, 113 INFINrI'E POPULATIONS, 113 Some Propenies of IDfiDite Populations, 114 Dispersion of Response Time, 114 Gt.11IImIl distribution, 115 Central Limit Theorem, 116 VQriQnce of response tiIMs, 117 Properties of MlMJI Queues, 120 Properties of MIG/I Queues, 122 SiDgIe-QamIel Server with Priorities. 123
Nonpreeinptive SeTver. 124 Preemptive server, 125 Multip1e-Channel Server (MIMIc). 126 Multiple-Owmel Server with Priorities, 127 Nonpreemptive server, 127 Preemptive server, 128
.
x
Contents
FINITE POPULATIONS, 128 Single-Server Queues (MlMllImlm), 130 Multiple-Server Queues (MJMIclmJm), 131 Computatiooal Considerations for Finite Populations; 132 COMPARISON OF QUEUE TYPES, 132 SUMMARY, 136
5 COMMUNICATIONS
1.
PERFORMANCE IMPACT OF COMMUNICATIONS, 140 COMMUNICATION CHANNELS, 141 Dedicated Lines, 141 Dialed. Lines, 141 Virtual Circuits, 142 Satellite Channels, 143 Local Area Networks, 144 Multiplexers and Concentrators, 144 Modems,147 Propagation Delay, 148 DATA TRANSMISSION, 149 Character Codes, 149 Asynchronous Communication, 150 Synchronous Communication, 152 Euor Pelfonnance, 152 Euor Protection, 154 Half-Duplex Channels, 155 Full-Duplex Cbaunels, 156
PROTOCOLS, 157 Message Identification aDd Protection, 157 Message TraDSfer, 158 Half-duple% mastlge transfer. 158 Full-duple%-mastlge transfer. 158 Channel Allocation, 160 Bit Syncbronous Protocols, 160 BITS, BYTES, AND BAUD, 163
LAYERED PROTOCOLS, 164 !SOIOSI, 165 Applictztion kzyer. 165 Presenttztion kzyer, 166 Session layer. 167 Transport kzyer. 167 Network layer, 167 Data link layer. 168 Physicollllyer. 168 SNA,I69 X.25, 170
Contents
xi
MESSAGE TRANSFER PERFORMANCE, 172 Half-Duplex Message Transfer Efficiency, 172 Full-Duplex Message Transfer Efficiency, 176 Message Transit Time, 178 Message Transfer Example, 180 ESTABUSHMENT/TERMINATION PERFORMANCE, 181 Point-To-Point Contention, 182 Multipoint PolllSelect, 18S
LOCAL AREA NETWORK. PERFORMANCE, 189 Multipoint Contention: CSMAlCD, 189 Token Rings, 192 SUMMARY, 196
6 PROCESSING ENVIRONMENT
197
PHYSICAL RESOURCES, 198 Processors, 199 Cache Memory, 199 110 System, 200 Bus, 201 Main Memory, 202 Processor Performance Factor, 203 Traffic Model, 204 Performance Tools, 20S Performance Model, 206 W-Bus, 206 Memory, 207 R-Bus,207 Memory queuefull, 208 Model Summary, 208 Model Evaluation, 208 OPERATING SYSTEM, 213 Task Dispatching, 214 Interproc:ess Messaging, 218 Global message netWork, 218 Directed message paths, 219 File system, 219 Mailboxes, 219 Memory Management, 220 110 Transfers, 220 OIS Initiated Actions, 221 Locks,222 Thrashing, 222 SUMMARY, 224
xii
Contents
7 DATA-BASE ENVIRONMENT
226
THE FILE SYSTEM, 227 Disk Drives, 228 Disk Controller, 228 Disk Device Driver, 229 Cache Memory, 230 . File Manager, 231 File System Perfonnance, 232
FILE ORGANIZATION, 234 Unstructured Ftles, 234 Sequential Ftles, 237 Random Ftles, 238 Keyed Ftles, 238 Indexed Sequential Ftles, 243 Hashed Files, 245 DISK CACHING, 245 OTHER CONSIDERATIONS, 249 Overlapped Seeks, 250 Alternate Servicing Order, 250 Data Locking, 251 Mirrored Ftles, 252 Multiple Ftle Managers, 253
File m.tZ1IIZger per disk volume, 253 Multiple file m.tZ1IIZgers per disk volume, 253 Multiple file m.tZ1IIZgers per multiple volumes, 255
AN EXAMPLE, 255
8 APPUCATION ENVIRONMENT PROCESS PERFORMANCE, 260 Overview, 260 Process Time, 265 Dispatch Tune, 266 Priority, 266 Operating System Load, 267 Messaging, 267 Queuing, 268
PROCESS STRUCTURES, 269 Monoliths, 269 Requestor-SeIver, 270
271 Servers, 27.2 File m.tZ1IIZgers, 274 Multitasking, 274 Reque~rs,
Dynamic SeIvers, 278 Asynchronous 110, 280
259
xiii
Contents AN EXAMPLE, 281
SUMMARY, 296
9
FAULT TOLERANCE
298
TRANSACI'ION PROTECI'ION, 300 SYNCHRONIZATION, 303 ~AGEQUEmNG,~
CHECKPOINTlNG, 307 DATA-BASE INTEGRITY, 308 AN EXAMPLE, 308
10 THE PERFORMANCE MODEL PRODUCT
312
REPORT ORGANlZATION, 313 Executive Summary, 313 Table of CoDteDtS, 313 System Description, 314 Transaction Model, 314 Traffic Model, 314 Performance Model, 315 Model Summary, 31S Scenario, 316 Model Computation, 316 Results, 317 Conclusions and RecommendatioDS, 317 PROGRAMMING THE PERFORMANCE MODEL, 317 Input Parameter Entry and Edit, 317 Input Variable Specification, 318 Report Specification, 319 Parameter Storage, 319 Dicticmary, 319 Help, 320 Model Calculation, 320 Report, 320 TUNING, 321
QUICK AND DIRTY, 322
11 A CASE STUDY
323
PERFORMANCE EVALUATION OF THE SYNTREX GEMINI SYSTEM, 325 Executive Summary. 326
xiv
Contents
-
Table of Contents, 327 1. Inttoduction,328 2. Applicable Documents, 329 3. System Description, 329 3.1 General, 329 3.2 AqUlJrius Communication Unes, 330 3.3 AqUlJrius Interface (AI), 331 3.4 Shmed Memory, 334 3.5 File Manager, 335 3.6 File System, 336 4. Transaction Model, 338 5. Traffic Model, 340 6: Performance Model, 342 6.1 Notation, 34i 6.2 Average Transaction Time, 343 6.3 AqUarius Terminal, 344 6.4 Communication Line, 344 6.5 Aquqrius Interface, 346 6.6 File Manager, 350 6.7 Disk Management, 353 6.8 Buffer Overjlow, 355 7. Scenario, 357 8. Scenario Time, 359 9. Model Summary, 360 10. Results, 365 10.1 Benchmark Comparison, 365 10.2 Component Analysis, 368 11. Recommendations, 371 References, 373
APPENDIX 1 GENERAL QUEUING PARAMETERS
APPENDIX 2 QUEUING MODELS
375
377
APPENDIX 3 KHINTCHINE-POLLACZEK EQUATION FOR MlG!1 QUEUING SYSTEMS
383
APPENDIX 4 THE POISSON DISTRIBUTION
APPENDIX 5 MIRRORED WRITES A. DUAL LATENCY TIMES, 389 B. SINGLE DISK SEEK TIME, 391 C. DUAL DISK SEEK TIME, 392
389
_
Contents
xv
APPENDIX 6 PROCESS DISPATCH TIME
397
A. INFINITE POPULATION APPROXIMATION ERROR, 397 B. DISPATCHING MODEL, 399 C. SINGLE PROCESSOR SYSTEM, 402 D. MULTIPROCESSOR SYSTEM, 40S E. AN APPROXIMATE SOLUTION, 409
APPENDIX 7 PRIORITY QUEUES APPENDIX 8 BIBLIOGRAPHY
411
414
Preface
This book provides the tools necessary for predicting and improving the performance of real-time computing systems, with special attention given to the rapidly growing field of on-line traDsaCtion-processing (OLTP) systems. It is aimed at two audiences:
1. The system analyst who thorougbly understands the concepts of modem operating systems and appJic:ation stnlCtIIres but who feels lacking in the mathematical tools necessary for performance evaluatiOD. 2. The mathematician who has a good grasp of probability and queuing theory but who would-like to gain a better undastandiDg of the technology bebind today's compotiDg systems so tbat these tools might be effectively applied._
OLTP systems are mpidly becomiDg a part of our evayday Hfe. Mercbants pass our mdit cards thmugb slot readers so that rem.ote systems can check our czedit. Every commercial aiEpJane ride and hotel stay is plamled and ttacked by these systems. When we are critically ill, OLTP systems monitor oar critical sigDs. They control our facrories and power plants. We obtain cash from ATMs, play the horses and lotteries, and inveSt in stocks and bonds tbaDks to OLTP systems. _ _ No wonder 1beir performance isbecomiDg a growing concem. -A poorly performing system may simply frustrate us wbile we wait for its response. Even worse, it can be life-tlnea:rening to a b1!sjness-or even to our loved ones. - We define the perfOlillaDce of an OLTP system as -~ time required to receive a xesponse from it once we have sent it a traDSaCtion. Our traDsadion must wait its tum over xvii
XVIii
Preface
.
..
and over again as it passes from one setvice point to another in the OLTP system, since it is just one of many transactions that the system is trying to process simultaneously. These service points may be processors, disks, critical programs, commUDication lines-any common resource shared among the transactions. As the system gets busier, the delays at each service point get longer; and the system may bog down. The study of the behavior of these delays as a function of transaction volume is the subject of the rapidly expanding field of queuing theory. .. ApeIformance aDiiyit is'One'who has an intimate knowledge of the ~ of . these systemS and who can apply the practical tools available from queuing theory and other mathematical disciplines to make IeasODable statements about the expected performance of a system. The system may be one still being conceived, an existing system in trouble, or a system undergoing euhancement. This book is intended to train performance analystS. For the system analyst who may be a bit intimidated by higher mathematics, it presents mathematical tools that have practical use. Mmeover, the derivations of these tools ate for the most part explored, perlJaps not always rigorously, but in a manner designed to give a full understanding of the meaning of the equations that represent the tool. For the most part, only a knowledge of simple algebra is required. For the practicing mathematician, there is an in-depth description of each OLTP system component that may have a performance impact. These components include commuDication lines, processors. memories, buses, operating systems, file systems, and softwue application architectures. System extensions for fault tolerance, an important attribute of OL'IP systems, ate also covered.. This book is organized so that the reader may skip easily over material that is already familiar and focus instead on material of direct inteEest. The book does not present a "cookbook" approach to perfoImance analysis. Radler, it stresses the understanding of the use of basic tools to solve a variety of problems. To this end, many examples ate given during the discussion of' each topic to hone the mader's ability to use the appIopriate tools. As of the dare of this writiDg, the title "perfoImance analyst" has DOt been accepted as a usual job c1escriptioD. This is certaiDly DOt due to a pa:eeption that pe!f0!DJlllCe analysis is unnecessmy but pedJaps instead to the pa:eeption that meaningful pe!formance analysis is somewbat that of a mythical art. If this book can advance the accepamce of the pmcticiDg perfOl'lllaDCe aualyst, it will have achieved its goal.
A wodc of this magnitDde is the result of the efforts of maDy. I would like to take this opportunity to
tbaDk:
• All my anonymous reviewers, whose in-depth criticisms played a majorrole in the ~rijzation of the book.
Preface
xvix
• -My partner, Burt Liebowitz, who often challenged my fuzzy thinking to enforce a clarity and accuracy of presentation. • My daughter Leslie for the typing and seemingly endless retyping of the manuscript as it progressed through its various stages. • My wife, Janice, a professional writer in her own right, for turning my writing into real English. • Charles Reeder, who prepared the illustrations for this book, often with •'realtime" responses. • My many customers, who have provided the real-life testing ground for the methodology presented in the book, and especially to Concurrent Computers Corp. and Syntrex Inc., for their kind permission to use the studies presented in chapters 6 and 11, respectively. • Last, but not least, my editors, Paul Becker and Karen Winget, for their encour"" agement and guidance in the mechanics and business issues of bringing a new book to press. ABOUT THE AUTHOR Dr. Higbleyman bas over 30 years' experience in the development of real-time on-line data processing systems, with particular emphasis on high perfOIDWlce multiprocessor faulttolerant systems and large communications-or systems. Other application areas include intelligent terminals, editorial systems, process control, and business applications. Major accomplisbments include the first computerized tota1i7JJtor system for racetrack wagering installed for the New York Racing Association, the first automation of floor trading for the Chicago Board of Trade, the intemalional telex switching systems utilized by ITT World Communic:atiODS, the fault-tolerant data-base management system used by the New York Daily News, a 6000-terminallottery system for state lotteries, an electroDic billing data collection system for the telephone opezating c:ompaDies, and message switching systems for international cablegram and private network services. Dr. Higbleyman is fOUllder and ChaimJan of1be Sombers Group, a company which has provided tumkey software packages for such systems since 1968. He is also founder and c:bahman of MiniDasa Services, a company using miDicompuier technology to bring data processing services to small businesses. Prior to these activities, he was founder and vice-presideDt of Data TEeDds, Inc., a turnkey developer of real-time Systems since 1962. Befen that, he was employed by Bell Telephone Laboratories, where be was responsible for the development of the 103 and 202 data sets, and by LincoJn Laboratories, whe!e he worked on the first UIDsistorized computer. In addition to his management activities at The Sombers Group, Dr. Higbleyman is currently active in:
• performance modeling of multiprocessor systems. • fault-tolerant CODSiden.tions of multiprocessor systems.
xx
Preface • architectural design of hardware and software for real-time and multiprocessor
systems. • functional specifications for data -processing systems. He has performed analytical perfonnance modeling on many systems for seveml clients including: A. C. Nielson
Autotote Bunker Ramo Concurrent Computer Digital Equipment Corp. FlISt National Bank of Chicago FTC Communications G.E. Credit Corp. Harris Hewlitt Packard
m
World Communications MACOMDCC PYA/Monarch Smith Kline Stratus Syntrex
Systeme
Tandem Telesciences Time
Dr. Highleyman received the D.E.E. degree from Brooldyn Polytechnic Institute in 1961, the S.M.E.E. degree from Massachusetts Institute of Technology in 1957, and the B.E.E. degree from Rensselaer Polytechnic Institute in 1955. He holds four U.S. patents and bas published extensively in the fields of data communications, pattern recognition, computer applications, and fault-tolerant systems. He also sits 01' bas sat on several boards, including: • The Sombers Group (Chairman) • Science Dynamics, Inc. • MiniData Services, Inc. (Chairman)
• Intematicmal Tandem User's Group (Past PIesident) • Vertex Industries
1 Introduction
Ubiquitous. mysterious. WODderful--and sometimes aggravating-computers. They
truly are becoming IIlOle and more involved in what we do today. What they do often affects our quality of Jife, from the success of our businesses to the enjoyment of our free time to our comfons and CODVaUeDces. Businesses enter 1DDSaCtioDs as they occur and obtain up-to-tbe-:mimlte status information for decision-makiDg. Banks are extending on-liDe financial services to their corporate customeIs for interactive money transfers and account S1atUs. giving COIpO!3te money mauagers the uJrimate in cash-lIow ~ CaD your telephoDe company or credit cani company about a bill. and your cbarge and paymeat history appears OIl the SCIeen for jmnwtiate action. See your tmvel agent for airline tickets. and the CODIpIUer p1eseats all optioDs, mabsyour zesemdiODS, aud issues
your tickets.
.
TIlDe for fun? Buy tickets tbIOugh your local ticket outlet from the inventory kept OIl COI:DpUt£r. Play the hones or the state lottery-ifyou'te a lucky wilmer, the computer will c:a1cuJate your payoff. Not feeling well? Computers will mauge your hospital stay, will order your tests, and, of course. will prepme your bills. Other computers will monitor your specimens as they llow tbrougb the cliDicallabolatory, thus eusnring the quality and ac:curacy of the test results. ,ten
.
Need to COThinllDicate? Computers-will carry your voice, yOUr'data, and your writwords rapicD.y to ~ ~0IlS. •
1
2
Introduction
Chap. 1
And Quietly in the background, computers monitor the material fabric of our daily lives, from power and energy distribution to traffic control and sewage disposal. All of the preceding examples are types of transaction-processing systems. These systems accept a transaction, process it using a base of data, and return a response. A b:ansaction may be an' inquiry, an Onter, a message, a status. The d8ta base may contaiJi customer information, inventory, orders, or system status. A response may be a status display, a ticket, a message, or a command. For instance, in a wagering system, the wager information is a transaction that is stored in the data base, and the reply is a ticket. Subsequently, the ticket becomes the ttansac:tion as the system compares it to the data base to see if it is a winner. The reply is the payoff amount. In the control system for an electrical power netwOIk, the transactions are status changes such as power loading, transformer temperatures, and circuit failures. The data base is the current network configuration and status. The new status change updates the data base, and a command to alter the configuration may be issued as a response (e.g., reduce the voltage to prevent a power overload, thereby avoiding a brownout). In a message-switching system, the transaction is an incoming message (text, data, facsimile, or even voice). The data base is the set of messages awaiting delivery and the routing procedmes. The response is the J:eCeipt and delivery acknowledgments returned to the sender and the message routed to the destination. In an airline or ticket IeSerVation system, the transaction is first an iDquiJy. Schedule status is re4m hed. from the data base, which also holds an inventory of available seats. A subsequeDt ttaDSaCtioD. is the Older, which will update the inventory. A ticket will be issued in IeSpoDSe. No wonder we become affected--even aggravated-by the performance of these systems. Have you ever wai1ed sevaal minutes 08 the telephone for the cledc on the other end to get your biDing status? Have you ever watched the anxiety and the anger of a racehorse player1ryiDg to get the bet in befo!e the beD goes off, only to be frustrated by an ever-slowiDg Hne at the wiDdow? Have you ever watched a mmclumt get impatient waiting for a credit cud validation, give up, and mab the sale without authorization, thus risking a possible loss? Have you ever ••• ? The liSt goes 08. , And that is wbat tbis book is an about: the prediction and control of the perl'onnanc:el of these U3DSaCIioa-proc systems, which are weaving their way into our lives. •
w
an know tbatas a computer sysrem. becomes loaded, it ""bogs down." Respcmse times to user requests get longer and leading to iDc:reased and aggravaDOD of .. . '.. .longer, ... .. fiuslmUon _.. . .. We
,
_-_._----
IOf coarse, system avaiJahiIity is an eqaaDy impodIDl cciDI:em. Ifa\oe you ever beeIl1Dld Ibat you CIIl't pc a ticket • Ibis time because ""die c:uoupaIei is doWD"? The zeliability aaalysis of1hese sysrcms is DOt a tcpic: for dIis book. Howew:r, perfarmall'OB cIegaIdIIion clue to ~ tabla by die sySIIe.ms to _ nIiable oper-
atioa is a CODCenl aDd is CCM:leCI. TecbDiques for die IdiabJlity aaaI.ysis of ~ sysImDS may be fouDd in ~ ,witz [17].
Chap. 1
Performance Modeling
3
.
"
the user population. A measure of the capacity of the system is the greatest load (in transactions per hour, for instance) at which the response time remains marginally accept-
able. Deterioration of response time is caused by bottlenecks within a system. These bottlenecks are common system resources that are required to process many simultaneous" transactions; therefore, transactions must wait in line in order to get access to these resources. As the system load increases, these lines, or queues, become longer, processing delays increase, and responsiveness suffers. Examples of such common system resources are the processor itself, disks, communication lines, and even certain programs within the system. One can represent the flow of each major transaction through a system by a model that identifies each processing step and highlights the queuing points at which the pr0cessing of a transaction may be delayed. 'Ibis model can then be used to create a mathematical expression for the time that it takes to process each type of transaction, as well as an average time for all transactions, as a function of the load imposed on the system. This processing time is, of course, the response time that a user will see. The load at which response times become unacceptable is the capacity of the system. Performance modeling concerns itself with the prediction of the response time for a system as a function of load and, consequently, of its capacity. A simple example serves to illustIate these points. FIgUre l-1a shows the standard symbol used throughout this book for a resource and the attendant queue of transactions awaiting servicing by tbat IeSOurCe. The "service time" for the resource is often shown near it (T" in this case). The service time is the average time it takes the resource to process a transaction queued to it. Figure I-1b is a simple view of a transaction-processing computer system. TrausactioDs mive &om a variety of incoming c:onnnunk:ation lines and are queued to a program (1) tbat processes these inbound requests. 'Ibis program requires an average of 80 mjUisemnds (msec.) to process a traDsaction, which is then sent to the disk server (2) to read or write dal:a to a disk. The disk server ~ these am other requests and requires an average of SO 1DSeC. per request. Once it has completed all disk walk, it forwards a respouse to an output program (3), which retams these and other respoases to the comDlUDication lines. 1be output pogram requires 20 1DSeC. on the average to process each
respoase. Since the programs and the disk system me serviDg' multiple sources, queues of As the servers get busier, the queues wID get loDger, the time a 1raDsactioil spends ill the system wID get longer, and the system's respoase time will get slower. , One huplied queue not shown in.Figure l-Ib is the processor queue. We assume in this system that many pograms are rmming-many more dian are shown. But t:he!e is only one processor. Therefore, wilen a program has woat to do, it must wait ill line with other programs before it can be giveo access to the processor and actuilly run. Let us DOW do a IiUle performance aoalysis using Figu!e 1-1b (wbiCb, by the way, will later be called a "traffic moder~). If the system is idle, no queues wID build;, and an average transaction will work its way tbrougb the system ill 80 + SO +20 = ISO msee.
ttaDsac:tioDs awaiting service may build ill froat of eich of these serYeIS.
Introduction
4
--- "I I
TRANSACTION
I
1----.1
Chap. 1
RESPONSE
RESOURCE
1 - - -.....
QUEUE TS
RESOURCE MODEL (a)
(I)
~----. III
PROGRAM PROCESS
TRANSACT~
INPUT
/
\(
BOmsec
(2)
(3)
50msec
:= I. . -~__. .
1-----'
OUTPUT.
-
20msec SIMPlE COMPUTER SYSTEM (b)
·(the sum of the iDdividual service times for each server in its path). Not a bad
respoase
time. Now let us look at the msponse time in a more normally loaded system in which the queue leDgtbs for all servas, inducting the processor, aVerage 2. Tbat is, on the average, any RqUest for service will find 2 Rquests in from of it--one being serviced and one waiting for service. The DeWly eatencl mquest will be the seconcl Rquest in line, DOt COUIdiDg the mquest being serviceclat!be time. (As willbe shown.later, resomce loads of ~ will oflauesult in ~ leDgIbs of2. That is, jfa server is busy 67 pen:ent of the time on the average, its avezage queue leagth will be 2.) With these queue leDgtbs, each traDSaCtion must wait 2 service times before being serviced, and then each traDsaCtion must be serviced. Sounds like the nisponse time should triple £.rom 150 1DSeC. to 450 1DSeC~, right? WIODg. The:response time degrada!:ion is eveu worse since each progmm must wait in a queue for the processor. Let us assume . that the average time the processor spends nmiIiDg a prognm1 is 50 1DSeC. Let us also
Chap. 1
Uses of Performance Models
5
assume that the disk server is an intelligent device that does not use the processor and so is not slowed by baving to wait in the processor queue. Remember that the processor is 67 percent loaded, so its queue length is 2. When the input program wants to run, it must wait in the processor queue for 2 x 50 = 100 msec. and then must spend 80 msec. serving the incoming transaction. So far as the incoming transaction is concerned, the service time of this program is 180 msec., not 80 IDSeC. Likewise, the effective service time of the output program in the loaded pr0cessor is 2 x 50 + 20 = 120 msec. rather than 20 msec. The disk server time remains at 50 msec. since it doesn't rely on the processor. An average queue length of2 in front of each server now causes the response time to triple relative to the effective service times. Thus, at a load of~, system response time degrades from ISO IDSeC. to 3(180 + SO + 120) = 1050 msec! Note that we could plot a curve of response time versus load if we just knew bow to relate queue size to load. Then, given the maximum acceptable response time, we could determine that load representing the capacity of the system, as shown in Figure 1-2. 'Ibis is what performance modeling is all about.
USES OF PEllFORlMtNCE .ODELS Ideally, a performance model should be validated and tuned. Its results should be c0mpared to measured~. If they are sigDificandy diffeJent, the IeaSOIIS should be understood and the model COJrected. Usually, this results in the inclusion of certain processing steps initially deemed trivial or in the determiDation of more accmate parameter values. A performance model, no matter how deIailed it may be, is neverdleless a simplification of a very complex process. As such, it is subject to the inacc:uracies of simplification. However, experieDc:e has shown these models to be smprisinglyaccurate. Moreover, the tteads pleCticted are even IDOIe ac:cmate and can be used as very efJective decision tools in a variety of cases, such as
1. Perjomuznce Predictitm. The perfomumc:e of a pJanued system can be predicted before it is built. This is an ememely valDable tool durlDg the design phase; as bottleDecks can be idemified aDd CODeCCed before implementaIion (often zequiring signifiCant 'architec:tuia'i ch3nges), and pertormanc:e goalS' Can be' W:mied.-' 2. Perjomuznce TIIIIing. Once a system is ~, it may DOt perform as expected. The performance model can be used along with actual performance measDl&ments to "look inside" the system and to help locate the problems by comparing actual pm"SSiDg and delay times to those tbat are expected. 3. CostIBeMjit of EnhIlncements. When it is planned to modify or eahance the system, the performance model can be used to esriJnate the performance impact of the proposed change. This iDformation is invaluable to the evaluation of the proposed change. If the change is being made strictly for petfonnance purposes, then the costIbenefit of the change, can be accurately determmed.
Introduction
6
Chap. 1
2.0
1.8
1.6
1.4
u
MAXIMUM jACCEPTABLE RESPONSE nME
1&1 !! 1.2 1&1
:IE
i=
., z .,0 1&1
--.------- ---
1.0
I I
Q.
III
a::
.8
I
CAPACITY
:/
.6
.4
o
.2
.4
.6
.8
1.0
LOAD
. . 4. System Co'lffiguration. As a product is introduced to the madcetplace, it often has several options that can be used to tailor the'system's perfonaance to the user's Jleeds-;.for example. the number of disks, the power of the processor. or comDlUIIic:ation line speeds. The performance model can be packaged as a sales tool to help configure-new systems and,to give the costomer some CODfidenc:e in the system's capacity and performance.
Chap. 1
The Performance Analyst
7
THE SOURCE OF PERFORIlllANCE PROBLEMS
It is interesting to speculate about the main cause of the poor peIformance that we see in practice. Is it the hardware-the processors, memories, disks, and communication systems of which these systems are built? Is it the software-the operating systems and applications that give these systems their intelligence? Is it the users of the systems, through inexperienced or hostile actions? The data base organization? The system managers?
Actually, it is none of these. The second most common cause of poor peIformance is that the designers of the system did not understand peIformance issues sufficiently to ensure adequate peIformance. The first most common cause is that they did not understand that they did not understand. , If this book does nothing more than focus attention on those issues, that are important to peIformance so that designers may seek help when they need it, the book will have achieved a major part of its goal. If it allows the designers to make design decisions that lead to an adequately peIforming system, it will have achieved its entiIe goal. THE PERFORIIANCE ANALYST The effective prediction of peIformance has been a little-used art-and for a very good reason. h IeqUires the joining of two disciplines that are inheIently contradictory. One is that of the mathematical disciplines-from probability and statistical theory to the LapIace-Stieltjes transforms, generating functions, and birth-death processes of queuing theory. The odler discipline is that of 1he system aaalyst-dara-base design, communication protocols, process stnICtUIe, and softwaIe arcbitecIure. Qualified system analysts are cenamIy adept at a1gebm. They most likely, are a little rusty atelememaly calcalus, pmbabiIity theory, and statistics. Do they know how to solve differential-cliffele'we equations, apply Bessel functions, or uDdersraud the attributes of ergodicity? Pmbably not. They don't need to know, and 1hey p!Obably don't want to know. Practicing applied madJematicians. for the most part, have not been exposed to the inner WOIkiDgs of CODfeD4KDIy data porasingsystems. 1bird-normal form, pipelining, aad SNA may be not much !DOle tban WOlds to theni. ADd when they do uncIenrand a system's performance problem. it is difficWt for 1hem to iDake ~ on it because the 8SShiI'I4ioDs teq1Iiied for reasoaab1e ca1cuJations ofIeD divage so far from the real world tbat the mathematicians' oldco11ege professcD-aDd'cerIaiDlythe pat body of coatemporary coDeagues-woald never appiove. . . Performance analysts are in an awkwaJ:d position. They must be JeaSODably accomplished in system aaalysis to 1IIlderstand the nature of the system they are auilyzing. yet they must be practical enough to make those assumptions and approXimations necessary to solve the problem. Likewise. they mustUDdersrand the application of some fairly straightforwani malbematieal principles to the solution of the problem without being SO embarrassed by the accompanying assumptions as to render themselves ineffective. In short,
8
Introduction
Chap. 1
performance analysts must be devout impeIfectionists. The only caveat is that they must set forth clearly and succinctly the impeIfections as part of the analysis. A performance model is a very simple mathematical characterization of a very complex physical process. At best, it is only an approximate statement of the real world (though actual results have been surprisingly useful and accurate). But isn't ,it better to be able to make some statement about anticipated system peIformance than none? Isn't it better to be able to say that response time should be about 2 seconds (and have some confidence that you are probably within 20% of the actual time) than to design a system for a I-second response time and achieve 5? Based on actual experience, the most honest statement that can be made without a performance analysis is of the form "The system will probably have a capacity of 30 transactions per second, but I doubt it." The pmpose of this book is to eliminate the phrase "but I doubt i~. "
THE STRUCTURE OF THIS BOOK To that end, this book takes on both system analysis and mathematics. It is designed to give applied mathematicians the background they need to understand the structure of contemporary transaction-processing systems so they can bring their expertise to bear on the analysis of their performance. It is also designed to give system designeIS the mathematical tools tequked to predict performance. In either case, we have created perl'ormance analysts. It is the author's strong contention that a tool is most useful if the users are so familiar with it that they can make it themselves. 'Ibis applies to the various relationships and equations which are used in this book. Most will be derived, at least heuristically if not rigoJ:ously; in this way, the assnmprions that go mto the use of the particular tool are inade .clear. Most derivations are simple, requiring only basic algebra, a little CalculuS perbaps, and an elementary knowledge of probability and SllDstics. The derivatioDs are included in die main body of the text. MOle complex. derivations are left for the appendixes. Just oc:casioDaD.y a derivation is so complex that only a reference is given. The book is stn1CIDIed to support the system analyst seWcing better matbematical tools, the mathematician seeking. a better system UDderstanding, . . either one seeking anythjng in between. 'Ibis structure is shown in Figme 1-3. Chapter 2 isa major!eView oldie c:on1m&JlOl8Iy tecbnology involved in ttansactionp1'OC"SSing systems. It wiD be most useful to those seeking a better undmtaDding of the ~ of'these systems, but this cb8Pter could be skhnmecl or bypassed by system analysts knowledgeable in ttansaction processing. Chapter 3 gives a sDnple but iD-deptb example of performance modeling based on chapter 2 material extended by some elementary m~ introduced as die modeling progresses. This chapter ~ a preview of the rest of the book. A thorough underS1aDding of chapters 2 and 3 will equip the leader with the tools necessary to perform elementary performance analyses. The rest of the book then hones the tool:s developed in ~ two chapters.
Chap. 1
The Structure of This Book
9
CHAPTER 2
CHAPTER 4
SYSTEM
MATHEMATICAL
BACKGROUND
BACKGROUND
!
'!
CHAPTER 3
CHAPTERS 5-8
~
A LOOK AT PERFORMANCE MeDEUNG
COMMUNICATIONS
~ ..1
CHAPTER 10
/
DOCUMENTm'lON
-
( DISK)
PROCESSOR
OPERATING SYSTEM I-APPUCATION PROGRAMS DATA BASE
CHAPTER 9 FAULT
TOLERANCE
CHAPTER II CASE STUDY
Chapter 4 peseIdS a variety of mattM:mar;ca1 tools that m useful in cerram situ': While the bulk: of the chapLer is areview of queaiDg theoJy, which allows us to !elate queue 1eagtbs to mouit:e loads, many other useful tools m~. including basic coacepts in probability and statisdcs and expmsion of IISeful series. Cbapters 5 tbrougb 8 expand concepts relative to the major building blocks of a ttaDSadion-pr0c:essiD8· system. These building blocks include the oommnnjeation network (cbapter 5), the processor and opexating system (chapter 6), the data base (chapter 7), and the· application programs (chapter 8). Chapter 9 extends 1hese concepts to fault~leraDt systems. These cbaptm's COIdiin iDsigbt for both the aulyst and the matbemati~
tioas.
Introduction
10
Chap. 1
ciano System concepts are explored in more depth, and the application of the mathematical tools to these areas is illustrated. Chapters 10 and 11 are summary chapters. Chapter 10 discusses the organization and components of the formal performance model so that it will be useful and maintainable as the system grows and is enhanced. Chapter 11 includes a complete example of a perfounance model. References are given following chapter 11 and are organized alphabetically. Appendix 1 summarizes the notation used, and Appendix 2 summarizes the major queuing relationships. Further appendixes give certain derivations that will be of interest to the more serious student. The book is not intended to be exhaustive; the rapidly progressing technology in TP systems prevents this anyway. Nor is it intended to provide a cookbook approach to perfonnance modeling. The subject is far too complex for that. Rather, it is intended to provide the tools requiIed to attack. the performance problems posed by new system architectures and concepts. The book is also not intended to be a programming guide for performance model implementation. Though the complexity and interactive nature of many models will require them to be programmed in order to obtain IeSUlts, it is assumed that the performance analyst is either a qualified programmer or has access to qualified staff. No progranmring examples are given; however, useful or critical algorithms are occasionally presented. The author highly recommends that the serious student read references [19] and [24]. James Martin and Thomas Saaty both present highly J:eadable and in-depth presentations of many oftbe concepts necessazy in performance analysis. Also. Lazowska [16] provides an int.eJ:esting approach by which many performance problems can be attacked using fairly simple and straightforwani techniques.
SVlfBOLOGY One final Dote on symbology before embaddDg on this subject. The choice of parameter symbols, I fiDel. is ODe of the most frustrating and mun.daDe parts of performance analysis. 0fteD. symbols for several hundred parameters must be iDveJDd; there are just not enough cbaracters to go around without eDODDOUS subscripting. Therefore, the choice of symbols is often not !dec:tive of what they lep:eseDt. 'Ibis is a poblem we live with. The symbols used in the body of this book are summarized in Appeu.cIix·l for easy &efeIeDce. NotWitbstanmng the.pmblems of naming CODVeDtioDs me.atioDed above, the author does iDipose certain &eSttictioDs: . .,
1. All symbols _ at most ODe chalacterplus subscripts. Thus, TS is never used for service time; but T$ may be. This pteveuts poduc:ts . of symbols from being ambiguous. The only exception is var(x), used to rep:esent 1he varlaDc:e of x. 2. Oaly characl:erSfrom the Arabic alphabet are used (uppmcase and lowercase A through Z,
IlUDleials 0 through 9).
There are two reasons for doing this:
Chap. 1
Symbology
11
a. Most of the typewriters, word processors, and programming languages I use provide little if any support for the Greek alphabet. b. More important, performance models are most useful if understood and used by middle and upper technical management. I usually find this level of user to be immediately and terminally u,timidated by strings of Greek letters. This convention can be particularly disturbing to the applied mathematician who is used to p, A, and JL as representing occupancy, mival rates, and service rates, respectively. Instead, in this text he or she will findL. R. and T as representing load, mival rate, and service time. 'Rather than p = AlJL, he will find L = RT. However, the first time he or she tries to program a model in Greek, he or she will quickly learn to translate. The only exception to this is delta. a is used to indicate a small increment, and 8 is used to represent the delta (impulse) function in certain derivations. They never appear in resulting expressions. One other convention used is the door U) and ceiling
2 Transaction-Processing Systems
This chapter explores the basic mchitecture of traDSaction-processiDg systems. Special attention is given to process structure and maoagement. a thorough knowledge of which is mandatory for an appreciation of peIfonuance issues. Strategies for fault tolerance &Ie also paalted. 1"he tam transaction-processing system is a ratber general teml whose definition in this book has importance ODly in terms of the applicability of the material p1eseutecl in the following chapters. It can be defined as readily by what it is as it can be by what it is . DOt.
A TP system is an on-line _-time multiuser system that accepts requests and retums IeSpODSeS to those zequests. The act of geaentiDg a tapODSe usually implies acw:ssiq a base of data maintaiaed by the TP systeQ1. 'Ibis cIata base is often a complex set of iDteaelatecl information maintaintA on disk files by a sophisticated clata-base manager. In someTP applicatioDs. simple memory-Ieside:Dt tables may be sufficient. On-liM II'Je8DS 1bat the user bas a direct CODDeCtion to die system. A user who calls a bDliDg clakis DOt on-line, though the bilJiDg clak with uerminaJ is. Asystem mwhich requests me eataecl via punched cards is DOt oa-JiDe.Systems mwhich users iDIaact with the system via tennjnaJs cxmnected by private or dialedcom"'uDication lines lie on-line, as me systems in which IemOte season communicate status over c:onummicaIion lines and n=ive commands in response to status .cbaDges. Martin [19] defines real time as follows:
12
Chap. 2
Transaction-Processing Systems
13
A real-time conlputer system may be defined as one which controls an environment by receiving data, processing them, and raking action or retuming results sufficiently quickly to qJJect the functioning of the environment at that time. (Emphasis added.)
"Sufficiently quickly" is a matter of the application. In a TP system, this is satisfied if IeSpODSeS are received in a time that appears short to the USer, i.e., the system appears responsive to the user. For a human user, this usually means one or two seconds. For a real-time control application, acceptable response times might be much shorter or perhaps even much longer. A single word that means "on-line real time" is interactive. If the user can communicate directly with a system, and the system responds quickly, then the user is in a position to interact with the system. He or she can ask questions and get responses without having to preplan his session with the system. His next request can be a function of the system's previous response. All TP systems are interactive. If the system's response slows to the point that intemction is not feast"le or is overly frustrating, the TP system loses its value. Thus, the importance of performance analysis. Conversely, a 1P system is not a batch system in which data is accumulated over a long period of time and then processed as one long contiguous file (say once per day). However, there is an impo.ttant connection between batch and TP systems. Most TP systems have a batch component. There are certain functions that are simply more efficient in batch mode than if done interactively. A prlme.example is data-base updating. An interactive update involves finding one or more indivicIual reconIs in a la!ge file, updating them, returning them, and modifying any key files used to locate them. The system may also have to update audit files Iequired to recover the data base in the event of a system failure during the update. If, instead, updates are batched, sorted periodically, and passed against the file in an ordered manner, the total system time required for d:lese updates can be sjgnjficandy less. Qapter 7 explores this in more-delail. 'Ibemore, many TP systems will do interactively 1hose updates which are IeQUiIed to suppott" interactive JeqUests and will batch updates and zeports tbat can be delayed. Often, batch pl'CnSsing will run in backpouDd mode while the system is supporting iDtenctive taffic. 11ms, the conteDtion for resources (prlmarlly processor, meDlOlY, and disk) betweeo sUnnJtaneously opaating batch and TP systems must be understood and accouDted for. Also, a TP system is nota sdentific number-crunching application, such as a weather-predidion system (or a perfOl'llllDCe model, for that matter). These Systems are charader:ized by very longprocessiDg times (minutes-to hours) befme IeSUlts are available. A 'IP system is DOt an iDtelligent termiDal, a ciR:uit switch, or a packet switch which massages data and passes it on. Fmally, for pmposes of this book, a 'IP system is not a single.user system, such as a persoaal c:ompurer, since peIfomJance degmdation due to loading does DOt occur. A network of persoaalcottlpllters accessing a common data base, ~, is very mnch a TP system.
Transaction-Processing Systems
14
Chap. 2
Though the performance concepts discussed in this book do bave some relation to non-TP systems, they are primarily pertinent to interactive data-base systeriIs.
COIIPONENT SUBSYSTEMS TP systems bave the following subsystem components, as shown in Figure 2-1: . . • <;Ommunication netwOIk
• • • • •
Processors MemoIy Application programs Data base Other peripherals
The functions of each of these subsystems are desCIibed below.
I
.
USER TERMINALS
I
PROCESS \
" M E M
o R y
PATH ..
'a ~ Inquiries Updates
---~-DATA BASE MANAGER
DATA FILES Figure 2-1 ComponeDts of a TP system..
Chap. 2
Component Subsystems
15
Communication Networlc A COIDIDIlDication network intercoDDeCts the users with the system. This network could include point-to-point and multidrop lines, dialed ~es, packet networks, local area networks, and satellite links. These links could utilize a variety of protocols, both half duplex and full duplex. The communication subsystem also includes the cominunication bardware that interfaces the communications network with the processor and the software that manages the communication netwOIk (noted as the communication manager in FJ.gUre 2-1). The communication subsystem is IeSpODSible for doing everytbing necessaty to get a request to the application programs and to return a reply to the user. However, it has no knowledge, nor need it bave any. of the content of a request or reply message.
One or more processors provide the processing power required for the communication manager, application programs. and data-base manager. If multiple processors are pr0vided. there must be some means to coordinate·their activities and to share work between them. For purposes of performance analysis, the concept of a processor is extended beyond the hardware to encompass the operating system. That is to say. the processor provides the enviIomnent in which the application programs, CQD111111Irication 1DlIIUlget. and database manager opeme. This includes memory management (swapping pages or overlays), task dispatching. priority management, and interprogram messaging. For systems c0mprising multiple processors, this also iDcludes the mechanisms for load sharing. interp:ocessor communication, and fault recovery (if any).
IIemory A JlII!JtfJOr! system suppcx1S the fundioDs of the processor. For a single processor system, the memory is jurimately associaIed.with the processor and is considered to be ODe and the Same with the processor forpelforiiilftCe aaa1ysis ~ union of a processor with its iDputIOU1pUt ports and dedicat.ed JDeIDOIY is called a computer. (Note: in general, a computer bas a processor. manary. buJk stomge, and periphenl devices, iDcludiDg printers and commnnicaDon lines. However. a processor with 110 ports and memory is sufficient to pedoml useful functioas and represents a computer in many applic:atioas.) A system with many proc:esscxs, each with its own memory. is called a mullicompurer system. Memory as a separate subsystem loses its ideDtity in this case. However, in a system with seveml processors sbaring the same memory. the memory is a common resource for which the" various processOIS compete and must be viewed as a separate subsystem. Such systems are caDed maltipIoc:essor systems and are distinct from multicomputer systems. Such systems, by the way, often bave small high-speed CtlChe 1!UfII'IOries ~ with each processor. These ~ memories are used to retain
imPoses.
Transaction-Processing ~ystems
16
Chap. 2
the most recently used data and instructions (i.e., those likely to be reaccessed in the near future) in an effort to unload common memory.
Of coone, a multicomputer system may also have associated with it a common memory to wbich all computers in the system have access. This case, though, is no from any other external data device, such as disk units, and is treated in the way. Multiprocessor and multicomputer systems are described in some greater detail later in this chapter.
different
same
Application Processes
Application programs process requests and issue replies..To be more accurate, let us define more precisely a program. A program is something created by a programmer, who has written it by specifying some data structures and some procasing procedures in some programming language. (This is called the source code.) The programmer has then compiled that source code by using a language compiler (another program), which translates the source code into object code that is executable by the processor. That object code is bundled with common system library routines into an object module and sits as a file on a disk unit somewhere in the system. It is this object module that is the physical embodiment of the program. h contains all data structures and computation procedures but is DOt actually executing. In order to execute, the object module must be loaded into the memory accessible by a processor and then executed by that processor. Most contemporaIy systems provide a multiuser environment. This meaDS that the same program may be loaded many times into the computer to service different users simultaneously. (In pmctice, usually only one copy of the procedures is loaded, as it can be shared by all users; a copy of the data structllJ.'es is individually loaded for each user.) To keep these diffaem instan1iatioDs of the same progxlDl ideDtUied, they are given independent DameS. A program rmmiDg in a compIIter with a system-uDique 1UI1ne is called a process. Thus, to be JDOJe accurate, the appJic:atioD subsystem c:omprlses ODe or more pro.Cesses diat accept requests from the cO'IPlimllication manager (as shown iii Figure 2-1): make tequesIS for data or updates to the da1a-base maaager, formulate Ieplies, and letum these replieS to the user via the C(l!Ju1mnW:ation mauage:r. Note a1sotbat the rornmmrication mIDager aDd data-base manager usually 1'UD as processes in the 1P system. Though the operatiDg.system may be implemeDted as a set of COIlCIIDmtly execnring pmc::esses, it is oftea ~ ftoma peEfunnaP:e viewpoint to not Ueat it as such. Radler, the ope[atiDg ~. in coqjuDc:tion with tbe hardwme, provides the envimmDeDt wi1bin which the processes execute.
same
DataBase The data base contains the data upon which· mmsaction replies are based. .1bis data is ~Y stored on disk in ~ systems but may be stored in other bulk star8ge.
Chap. 2
System Architectures
17
devices, such as large RAMs (random access semiconductor memories), bubble memories, or drums. The data-base subsystem includes the following: • One or more physical storage media and associated controllers, such as multiple disks and their controllers. • The computer 110 channels used to transfer data between the storage system and the computer's memory. This data transfer is usually made directly to or from computer memory (DMA, or direct memory access), with the processor being notified (via an interrupt) upon completion of the transfer. • The data-base manager, wbich runs as a process or a set of cooperating processes in the computer. Its job is to execute (readlwritelupdate) requests and other data management CQTIJTII8IIds (openIclose file, locklunloclc record, etc.) received from the application subsystem. It knows everything about the details for storing and retrieving information but knows nothing about the content of that information.
Other Peripherals Other periphera1s may be needed in the TP application. Most other periphera1s that one might find on a computer system, such as tapes, printeIs, and card readers, are DOt used for intemcI:ive processing. One exception might be a printer used to log status changes and to :request CODeCtive action by the operator. However, in contemporary systems, status logs are usually written to disk; the operator interfaces to the system via a CRT terminal, with the operator's actions also logged to disk.
SYSTEIf ARCHITECTURES A COhill!i()Jl cbaracte:ristic ofTPapplicaIions is that they grow. Gmw1b occurs asa IeSUlt of two fadoJs. One is simply the growth in transaction volume. As a system is suCcessfully used and finds accepIIDCe. amcmg the user CQIIlnpmjty, more and more useD • fOUDd·for the sysIaD.. Tnmsadionvolume grows-and .grows-somethlle5 with DO appmalt limit in sight. Home banking is an excellent example of this. What is the size of the potential user populatioD. for a home banking system? And what load will it impose OIl the system? No one knows in advance (at least, not anile time-of this writing). So hoW big a sys1Iml should a bank pmdlase to provide home banking? A little one in case the service doesn't catch hold? A big ODe in case it does? How big is big? The second factor that causes growth is inaased functioDality. A racetrack, for instance, buys a tota1irator system to handle wagering. This system is ~ to accept bets, maintain and display pools, calculate payoffs, and cash winning tickets. But once ~ system is succ:essfaIly DIs1aIled and nmning, management sees new and expanded ~
Transaction-Processing Systems
18
Chap. 2
for it. Exotic new wagers such as the Reinvested Quadrifecta (whatever that is). Telephone account betting. Off-track betting. The racing secretary's selection procedures and calendar. State audit reports. Maybe even payroll and general ledger. Many of these enhancements must operate during active wagering and therefore impose additional load on the primary TP functions. Oearly, the system needs more power than that called for by the initial specifications. Just as clearly, the "tote" system, having been procured under competitive bidding procedures, is configured very tightly. It has no excess capacity. What to do? .
Expandability The answer is to design the initial TP system with expandability in mind. Not with a hieruchy of "boxes" such that the user "trades in" (that is, tries to sell on the used marlcet) a CIl1'IeDt box for a bigger box, with all the attendant conversion effort required to get full advantage of the enbana:d features of the bigger box. Rather. expandabili.ty is
obtained today by choosing a multiprocessor or multicomputer architecture that allows expansions to system capacity to be made by simply adding processors or computers, diSks. and cnmnnlnication lines as volume warrants. And in today's art, such expansion can be achieved ~ DO softwaIe changes whatsoever-a paramount consideration.
This is not to say that there are not a lot of 1P systemS in use based on sing:1e-c:omput technology. There are. But understanding the performance of multiprocessor an4 multicomputer systems (distributed systems, as we will call them) allows one to easily model a sing1e-computer System since it;is simply a degenelate example of either. The basic structure of expandable systems is shown in Figmes 2-2 and 2-3. A multicomputer system (FJgUre 2-2) camprlseS two or IDOIe c:ompur.er systemS interc0nnected by a high-speed bus (typically with a capacity. of 2 to 30 megabyteslsecond). The bus is used primarily to pass messages between processes. This arcbitecture is called loosely cwpled because the coupliDg between componen1S·is oaly at the message·level (i.e., proc:esses being I'UD by c:tiffaeDt processors com""Jniane with each other by . ~manging messages). BUS
MEMORY 110
PROCESSOR MEMORY 110
PROCESSOR
MEMORY 110
Chap. 2
System Architectures
19
MEMORY
r-------i--CACHE
----PROCESSOR
BUS
CACHE
CACt£
PROCESSOR
PROCESSOR
I/O
A muJ.tiproc:esso system (FIgUre 2-3) comprises two or more processCn CODIleCted via a high-speed bus to a common memory. Each processor may bave its own small local memory (called a cache memmy) to try tom;Jrimjze main memory acc:essing. In this case, the bus speed is typically 10 to 80 megabyteslsecond because of its higher utilization in a bigher speed enviroDment. Such systems are termed tightly coupled because coupling is at the data level (i.e., processes being run by different processors communicate with each otber by &baring common data in memory). In eitber case, iDputIODIpUt devices are distributed across the processors or computers to divide up the I/O load appropriately. The multicomputer and the multiprocessor aICbitec:tures bave advantages and disadvantages ~ to each odler. A multicomputer.system requires more memmy than the multiprocessor system since the opetad:ug system and common process code must be mplicated in each processor iDstead of being sbaIed in a common memory. Intaprocess cx.u""maicatioD is also slower in a multicomputer system since such coomulDication is based on messages beiag passed tbIough a messaging system (typically, a few mj1Hsemads is xeqaired to sead a message to anotberprocess). WIth the availability of common accessible memory, a multiprocessor system C8D pass messages between ~ cesses in times measured in microsecoDds. FiDaJly, o.ace again from a perf~ viewpoint, a mu1Dprocessor system has iDhe.talt load balancing properties, since processors can operate off a single task queue maimained in COUlftW)D memory. For examp1e# a process may be ready to run and will be processed by p!ocessor 3 UD1il a disk can is made. It is thea paused aDd,..upon complelion of the data transfer, is requeued to the 1aSk queue. Wbea it works its way to the head of the queue, processor 6 is the next processor that becomes avaiJable, and it processes the next step in that process. Thus, a process will be passed at each step to aDOtber processor, and processors will be Opt 100% busy as long as there is WOJk to do. No such provision exists for the processors in a multicomputer system. Each is peasslgned a fixed set of tasks and typically keeps busy 50 percent to 90 peECeDt of the time. If a processor gets very-:busy, there is DO way for a less busy processor to share its load.
an
an
Transaction-Processing Systems
20
Chap. 2
On the other band, the multiprocessor architecture can suffer from a severe potential fault mechaDism. A sick processor (one with one of a set of specific hardware malfuuctiODS, especially in its memory interface circuits) or a sick. process may IUD amok aud CODtaminate critical parts of memory. The system has then crashed. A process in a multicomputer System, OD the other band, may protect itself from the faults of others by ensuring that all messages received from other processes at least appear to be sane and will Dot crash that process. Thus, a multicomputer system trades reliability for performance relative to a multiprocessor system. The common memory that gives the multiprocessor system its superior performance advautages is its Achilles heel when system reliability is cousidered. AD interestiDg modification to the above architecture is the hybrid approach ShoWD in Figure 2-4. In this architecture, a series of processing modules are inteICODDec:ted via a high-speed bus. "Ibis is a loosely coupled arcmtectme, since each processing module is a full computer, and proc:essiDg modules communicate via messages. A processing module, however, comprises a tightly coupled architecture of several processors executing tasks off a common queue in common memory. Hybrid architedmes have the potential of achieving the benefits of both loosely and tightly coupled architedmes while giviDg up little in terms of either peUODDaIlce or reliability.
.-------.,.-- - --- ----, BUS
PROCESSING MODULE
P
M
P
ij
P p
0
R
I
o
I
o
Y
I Fault Tolerance These expaDdable 8ldDtec mtes have ODe additioDa1 vay impcWmt capability. With a lia:le added ~ (atleast·concepmally). they may be made to be SIII'ViVable. That is, they will contiDue to maintain full fanctioDality in tbe presea.ce of-any Single (and often multiple)haJ:dwme fault. A commoa an:bitectwe applicable to lOosely coupled, tightly coupled, and hybrid architectures is shown in Figure 2-5. Basically. only two hardware eMancemeDtS need be made: 1. The bus. CODDeCIiag the processors or computeIs is replicated so that communication between processes can CODtiDue in the event of a bus failure. If both buses
Chap. 2
System Architectures
21
r-----' For I MEMORY I-Multi-processor I
I
L..-r----i 1 I
DUAL BUS
PROCESSOR
I
On~
I I 1
PROCESSOR
PROCESSOR
'or Computer
Comm Lines
are operatioDal, they share the oomnnmication load. If one fails, the othei assumes the fall load. 2. AU 110 CODttollers are dual-ported and CODIleCt to two processors. Thus, there is a path to every peripheml device even in the case of a pmc:essor or computer failme. In addition, physical disks may be dually ported to dual CODIrOllers so that full access to files is guamnteed even in the event of a ctiSk comroller failure.
Note that memory in a closely coupled system is geuezaJly DOt zepJic:amd. It would be to DO avail, since a sick processororproc:ess would comanrinate boCh memories anyway. However, memory in these ~ is panitionecl so that if a memoIy partition fails, it can be CODfigmed out of the system and the renurining memoIY used to run the system. (Sequoia Systems is an exc:eption in that it provides a system with replic:aled meJIlO!Y
protected by fault-tolerant processors.) In addition to the bardware configuration shown in Figure 2-5, sigDDicant operating system eDbancemeDU must be made to support fault tolerance and IeCOVeiry. These are described in later sections in tbis chqter.
22
Transaction-Processing Systems
Chap. 2
Sammary
In terms of contemporary offerings, fault-tolerant loosely coupled systems have been offered by such vendors as Tandem Computers, Tolerant Systems, Digital Equipment COIp., and Concurrent Computer. Tighdy coupled fault-tolerant systems are offered by Sequoia Systems and Concmrent Computer. Fault-tolerant hybrid systems are offered by; Stratus and Parallel Computers. Other distributed-system vendors include Arete, Enmasse, IBM, Nixdorf, and NCR, among others. . Another architecture not described is that of a dually redundant data-base system. . supporting multiple intelligent worlc-stations. SuCh systems are offered by No Halt Computers ofFanningdale, Long Island, and Symrex qf Eatontown, N.J. (Synttex.' s offerii:lg is basically a word-processing system.) . Note that a fundamental property of distributed systems, as these architectmes are collectively called, is that of trfJ1lSp(lTency. Since a process can be nm in any processor in the system, it must be able to communicate with any other process or peripheral device, no matter where that component currendy resides_ Thus, the configuration of the system. must be logically transparent to the process. It cares not where the other processes and 110 devices are. It simply wants to be able to communicate with them. In the following sections, the concept of a process, its management, the management of memory in which it xesides, and its role in transparency and survivability will be discussed further. Excellent in-depth coverage of distributed systems is given by Uebowitz and Carson [17].
TRANSPARENCY
In oider for a distributed·sysrem to be completely genera1-purpose, the first and foIemost requi!ement is that the user must not be mquired to be aware that it is a distributed system. Wilen peparing progmms, the user should not be concemed with the processor to be used or to which pmcesso!S the various perlphenls to be used are connected. 1be opemdng system must povide all the featmes xequiIed for any program l'UDDing • _.:.I. jpbaal.a-..:_ --.1.._ • on one processo.rto ca"'ntnmcate w&u& any ~ \lWYI\iIW or any U&UIOiI pIOgEam nmnmg on my other processor. Tbis chaDctaistic is called t1'tl1ISpQ1'ency. It hides the in1ric:acies and complexities of the disIribured environment from. the user, who thus can treat the system as if it weze a single computer. .
711eProcess The concept of transparency is illustrated in Figure 2-6. Let us coasider a simple data-base inquiry program. To the user, tlUs is a program that accepts an inquiry from the user termmal, accesses a data base stoled on disk, and retums a response as shown in Figure 2-68. In a siDgle-processor system, FIgUre 2-68 might be a good repesemation of the J'hysica1 patbs involved in the application.
Chap. 2
Transparency
23
REQUEST
•
INQUIRY
fafLlCAT!ON
Path WreBlBNOf (b)
Fipre u
Distriba1ed system U8IISp8IeIICY.
In a distributed system, FlgUIe 2-6a rqxesents how we would want to view the' applicalion logically. However, the physical representation might be quite different, as shown jn Figure 2-6b. Here the application is IUDDing in a four-c:omputer system. The opemtor's termiDal is physically connected to computer 1, the blquUy process is physically rmmiDg in computer 3, and the disk is physically CODDeCted to computer, 4. Via the iDtetprocessor bus, however, the termiDal is logi.cally COIIIleCted to the inquiry process, which is logically CODDeCted to die disk. It is die logical CODDeCtiODs that are apparent to the user (Figure 2-6&); the physical CODDeClions (FlgDl'e 2-&) are tlauspaieat in that the user doesn't necessarlly bow whidl computers are iDvolved in a specific application. Thus, the operatiDg system provides traaspareacy ifthe application pogram teqU.i:les DO advance knowledge of the system c:oafigmalion. Tbe idemical iDquity pogram shown in Figure 2-& would support a aa:miDal CODDeCted to CUiiipidm 2 c:ommnnicatiDg with a disk on computer 1 wbile it itself ran as a process in comp11Ia" 4, 8Ild so on. Since the physical compotas are immaterial in the design of an application for a truly traDSpareDt system, let us resttudUre our tbjnkjng a little by redefining the followiDg terms: • A computer (or CPU) is a physical piece ofbantware comprising logic, JDeD1OIY,
and 110• • A program is a physical set of object code, probably residing on·a disk connected to some computer.
24
Transaction-Processing Systems
Chap. 2
• A process is a program running in a processor. (There is nothing to restrict multiple copies of a program from numing in one or more computers, thus creating several like processes, each perhaps handling a different terminal but otherwise providing the same application as its companion processes.)
• Plrysical means the way things Ie8lly are. • Logicol means the way things appear to the user. Thus, a process is the logical IeSUlt of a physical program running in a physical computer. The user sees the application provided by the process but is not aware of wbich physical computer is being used nor of where the physical program resides. A process, then, is the basic logical unit within a distributed system. Each logical task is handled. by a process. To the user, there are two types of tasks to be performed and therefore two types of processes: application processes written by the user and devicehandling, or 110, processes provided. by the operating system.
I/O PlVCesses We have already talked about application processes. Device-hIlndling processes, or I/O processes, are typically considered part of the operating system but are in fact identical in structure to application processes. Their job is that of the classical device handler: they handle traDSfers to and from their IeSpeCtive devices (writes and reads), as well as other control functions, in IeSpODSe to requestS from appli~on processes. They ctiffer from application processes only with regard to certain restrictions: • An I/O process must reside in the same computer to wbich its coae&pOnCliDg device controller is auached. • An I/O process CIIIIlOt origiDare c:ommunicaIion to another process (except to its backup, as described later); it can only respond to requestS from other processes. • An I/O process can execute I/O iDstructions, whereas an application process can-
not. Thus, the inquiry application of Figure 2-6 does not involve just the application process. It also involves a trmrinaJlIO process anela disk 110 process. The logical structure of the application is tbeEefore better ~ by Figme 2-7. Here, the physicai intmprocessor bus has been replaced with a logical inIaprocess bus. Application pr0cesses are shown above the bus and 110 processes below the bus. Figure 2-7b shows a DlO1'e extended application in which 0Iders are entered from several tenninals. The Older entry processes (one perterminal) access various files on disk to verify and build the order on disk. 0Dce an Older is complete, the common invoice process is informed so that it can print an invoice on the primer. It reads the disk-IesideDt invoice file and prints the requesred invoices. (Note: AD application process can be ,designed to handle a single termiDal, as above, or multiple terminals.)
Chap. 2
Transpanency
25
APPLICATION
PROCESSES
I/O PROCESSES
INQUIRY APPUCAT10N (a)
ORDER ENTRY APPUCATlON (b)
Fipre2-7
~assetsofpilCc 1111.
Transaction-Processing Systems
26
Chap. 2
The example of Figure 2-Th shows -that application processes not only communicate with 110 processes but also with other application processes. (An 110 process will never communicate directly with another 110 process since an 110 process cannot initiate a communication; it can only respond to one.) In fact, except/or irs inrenuzl processing, the only thing an application process in a multicomputer environment can do is to communicate with another process. As we shall see, it is the simplicity of this statement of the role of a process that leads.to the elegance of the multicom.pu~ structure and forms the basis of the .transparency and survivability functions. A process in a multiprocessor enviromnent is not similarly restrained but still forms the basis for transparency. This description of processes is exemplary of contemporary systems today. No bard-and-fast rules are followed, and the propenies of processes in different systems may be somewhat different from those described here. However, the principles described are sound and will form a satisfactory basis for understanding the nature of a process and its management for ped'ormance analysis purposes. . Interprocess CoIIImunications
mterprocess
Let us now look in more detail at c:ommunica1lons. One process communicares with another by sending it an interproceSS message. To do so, it maely provides to the operating system the name of the process, the content and length of the message, and whether a response is required. Note that this leads to three types of intelprocess messages, which we will desigaate
as follows: • Write. A message is sent to another process with DO IeSponse xequixecl (except for completion SWUS, i.e., success or euor condition). • Read. A null message (i.e•• DO data, leDgtb of zero) is sent to another process with a zesponse expected. • Wrireread. A message is sent to another process. and a response is expected. In !be example ofF1gU1'e 2-7b. an order entry process might mum infOImation to its tmninaJ and wait for tile next opemtOJ entry. This is a WRlT.EREAD iDteJ:pocess message. When it JeCeives data from !be opentm. it may verify or expand certain infODDalion, such as part number, by senctiDg READ messages to the disk process, which will lead tequeSted clara from !be disk and mum it to tile order entry process. . The order entry process may tbeD write invoice data to disk by ~ WRITE messages to the disk process and will tbeD inform the invoice process that the. invoice is n:ady by sending it a WRITE message. The invoice process will READ the invoice from the disk process and wID print it by sending WRITE messages to the printer process. Note again that 110 processes never send inteIptocess messages; they can only IeSpOJld to them. .
Chap. 2
Transparency
27
In order to support interprocess messages, the system must provide tbIee facilities: • a hardware path interconnec:tiDg all computers, with supporting software.
• a provision to assign a unique name to each process. • an operating system capability to know in which computer each process is currently operating. The hardware path can be implemented in any one of a number of ways, as shown in Figure 2-8. Each has its own advantages and disadvantages: • The bus is simple but must resolve contention and carry the entire multicomputer load. • The ring can use different paths simultaneously but places a load on computers not involved in the message. .
• A star can use cWferent paths simultaneously but requires a4ditional hardware: the switch. The switch must carry the entire multicomputer load. • A fully CODDeCted system provides maximum capacity bot requires N-l bus connections at each computer for an N-computer system. . • A partially ccmnected system is app1ication-depend.
oEjDEJ BUS
RING
STAR
FULLY CONNECTED
PARTIALLY CONNECTED
. . Transaction-Processing Systems
28
Chap. 2
No matter which network is chosen, the important fact is that it is transparent to the
user. We do not care bow an interprocess message physically gets from one process to another as long as it'logica1ly gets there. In most contemporary cases, the bus netWork is used. However, DEC's VAXcluster is a star network; and Stratus uses a modified star/ring network.
Process Nalll8S The next facility required is that of process 1Ul1DeS. A process is named when it is created. A process is created at tbe request of an operator command or at the request of another process. Creating a process causes the operating system, in the CPU in which the process is to run, to schedule that program (i.e., object code on disk) to be rolled into memory and run as the named process. A typical command to run a process might be RUN ORDER / NAME ORDERl, CPU 3, PRI 150 I.
This command would cause the program with the physical object file name ORDER (on disk) to be run on physical computer 3 as a logical process named ORDERl at priority ISO. (If the CPU weze not specified, the process would run on the same CPU as the creating termiDal or process. If the Dame were not speci1ied, an arbitrary UDique name would be assigned. But no process other than the creator would know this name; therefore, no process other than the creator could COIDIIlUDicate with it. If the priority were DOt specified, a default priority would be used.) . Finally, tbe operating systeDi must know in which computer each process is oper~ ating. Whenever a process is aeated, aU computers me informed of this event (via tbe bus). Each computer maintains a process diIectoIy that contaiDs the name of each process and the CPU in which it is cmreDtly running. I/O processes are Cleated at system generation time and are pe.nnanent entties in the process c:tinc:toJ:y. An exception to this is disk files. A disk I/O process COD1!Ols one physical disk controller. This COId!Oller may be connec:red to seveml physical disk uDits. Many named disk files me on these disks and are tbe:refore accessible tbrougb a common disk process. However, the application pmgmm. wiD. waDt to Jefere.nc:e a file by its name; it doesn't need to know the name oftbe disk process with which it must comnmnicate·jn order to reach the pbysic:al disk containing that file. Disk files are typic:aUy cqaniml into volumes, each volume br;iDg bandled by a specific disk process. The name assjgned to a disk process is, jn fact, the name of tbe volume it is JvmdHng. When a zequest is made to lad ore write a disk file in a given volume, the process dkectoly is sem:hed to cI.etermine the process and
CPU handUng that disk volume. . Thus, wbeaa process wants to send an inteIprocess message to anotber process, its local operating system looks up the process name in its process directory, ~the CPU in wbichtbat process is running, and, if it is in another compurer, sends the ~ over the iure&pioc:essOr bus to the receiving process. If a teSpODSe is expected•. it WiD. zetum that response to the ttansmitting process when it is received. .. Note that. the process itself neied have DO knowledge whatsoever of where other processes IeSide nor bow to get a message to other processes. These fandions have been . totally handled by the openting system, thus providing the traDspmency we desUe.
Chap. 2
Transparency
29
The fact that a process may be mobile (i.e., move from one computer to another) leads to two important characteristics of multicomputer systems:
'. Load sharing. The creator of a process can request the status of all computers and can then create the process in the least-loaded computer. It is common practice among programmers to display the status of all computers before deciding in which computer to run a compile, edit, or other utility. • SurviVflbility. Process mobility allows us to re-create a process (or switch to a backu.p process) if another computer that was nmning that proceSs bas failed. This subject will be dealt with later in depth.
Of course, in mu1t:iprocessor systems, no such mobility is required, as the system. is automatically load balancing; a failed processor simply does not execute any tasks off the
common queue. A final point should be made about system. transparency. As we have said, in a ttuly ttanspareDt system., the user is UDawate of wheIe the various peripheIals and processes are iii a distributed system bec:aDse he or she is unaware of the mechau;sm involved in routing a message from. one process to another. This mechanism involves the bigh-speed inter-
processor bus. Though the high speed of the bus is necessary to achieve good system througbput, tbat is the primII:y le8S01l for its speed. 'I'hele is no logical reason whypart of the bus could not be a slower coml!VJDicaDon line, allowing geographical separation of gmups of c0mputers. The system of Figure 2-6b might be distributed as shown in Figure 2-9. 'Ibis pmvides a geographical disb:ibution for a Ialge system, which can offer many advantages:
,/
."",-------------- .......... -',
INTER~aus
,
\
\
\ '--_-:::3=
COr.PUtERs
~
Sf. LOUIS NODE
~
CHICAGO NODE
4
Transaction-Processing Systems
30
Chap. 2
• The use of a common corporate data base without excessive line costs. • Load balancing in a large, geograpbically distributed system by running large jobs at the least-loaded node. • Printing of reports at the user's site even if the data base is remote. • Message delivmy among all users of the diStributed networlc. • Local control over local processing capability while having access to the networlc for load sharing and data-base access. This distributed system requiIes DO substantial effort on the part of the user, for the
same transparency conceptS of inteIp.rocess communication hold as described previously. It is only necessary for the operating system to be able to know the geograpbical node, as well as the computer in which a process is running via the process ~.
Sulllmary We have seen that distributed system transparency has been achieved by considering an application as a COIDID1lDications networlc. The nodes of the networlc are the processes which communicate with each other via inteIprocess messages over an intetprocess bus.
As. in any COIDID1lDications system. node or path failures may cause loss or duplication of messages. This is one of the major consideIations when we discuss survivability. Also, communication path delays and queuing at the nodes impose a limit on the capacity of the system. 'lbis will become the basis of our tbroughput aualysis.
PROCESS S17lUCTURE We haVe described the process as the fuDdamenta1logical UDit of a distributed system from the applications viewpoiDt. A process is sbnply a progIam naming in a compurer. There are two types of pmc:esses: application processes, which peIform a portion or all of the app1ication-orient logic, anclIlO processes. each of which is a device bandler for a speci1ic device or device group. An application is implemented by a set of appIopriare processes that COIDIDUIIicate among themselves via iDtap.roc:ess messages. In Older to understand the ~ used to achieve survivability and to be able to analyze aDd enhance the throughput of a distlibuteci system, we must understand a process in mo.te detail. In this section, we will describe a typical structme for a process and then explore how it is managed by the operadDg system: .
Process Functio_ Since a process is a program-io the classical sense-nmning in a computer, it has some of the characteristics we natmally associate with programs. That is, it comprises a set of .jnsttuctions, or code, that operates on data. However, seven! restrictions are p~ on the
Chap. 2
Process Structure
31
code and data at this point that make a process a subset of what we generally think of as a program. In fact, a process.is allowed to do only two things:
• Perform its assigned logical duty by operabng on its own local data (this includes communicating with and controlling its device in the case of an 110 process). • Exchange interprocess messages with other processes. These messages are used to pass data and control functions from one process to another. A process is specifically DOt allowed, for instance, to modify its own code, to access or manipulate the data of another process, or to directly interface with a device Dot under its control as an 110 process. The strudllIe of a process is shown in Figme 2-10. We will characterize it as comprising three parts: • A process control block (PCB), which is used by the operating system to schedule and control the process. • The code that implements the functions of the process. • The data local to the process. upon wbich the code interacts. PROCESS
QUEUE OF RECEIVED INTERPROCESS MESSAGES SYSTEM
S'I"STEM '-- CALLS CODE , - - - - -
-------OPTIONAL. RESPONSE
...
CODE
INTER-
~s
TRANSMiTTED INTERPROCeSS MESSMES _ _ _ _ _ _ _ -SErtr: OPTICINAL RESPONSE DATA
Iipre 2-18 SUacIIa of a pocess.
As shown in FIgUIe 2-10, the process is allowed to do only two things in addition to ~
its own local da!a. h may exchange messages with other processes and may
Transaction-Processing Systems
32
Chap. 2
make certain calls to routines (not processes) contained within the operating system (these calls are considered part of its assigned logical duty). Only three types of"interprocess messages may be sent by a process:
• A message to another application process, perhaps with a response. These messages are used to pass control functions and data between application processes. • Read, write, and control requests to I/O processes for peripheral device activity. These messages are similar in all IeSpectS to messages sent to" other application
processes. • Data to its backup process, if any, to worm the backup process of the cuuent status of the data in case a failure should cause the backup to have to take over procesgng. These "checkpointing" messages will be desaibed later. The first two message types may be sent by application processes ODly. As we have said before, they cannot be origiDatecl by I/O processes. However, any process-application or I/O-may send a ~ message to its backup process. 'Ibis forms the basis of one of the survivability strategies, which will be descn"bed later. In addition to these interproc:ess messages. a process (application or 110) may receive messages tbat appear to be interproc:ess messages but are in fact generated by the operating system. These "messages are generally status messages conceming situations, such as processor down or up events or the stopping or aborting of a process that was cteated by this process. " As shown in Figme 2-10, messages sent to a process need not necessarily be pr0cessed i1TlTlWiiately. Rather, they are queued to the process (via links in the PCB) until the process is ft:8dy to lead and process them. This queuing facility will become impch1aDt in our discussionS of survivability, since there may be no way for the backup process to know about messages queued but"DOt read by its primary process. DOr may theIe be any way for 1be opemtiDg system to knoW tbat a message has beentead and processed but DOt aaswaM. Lost and duplicate messages in the event of a takeover are as importaDt in multic:wlputei survivability consickntions as they are in any communicatioDs a e t w o r k . " " The system calls tbat a process can make to tbe.operatingsystem generally fan into two classes:
"
• CaDs to routiDes ~ are sbDply common"routiDes supplied by the operaIing system but which in reality are logically part of the process's code. For the DlOst part,
these lOlltiDes" impIemeat1be "transmjssjon _ DCeption of the variQUS "tYPes of interprocess messages. They are fODDally part of the operating system because they use system-level tables and common buffer pools tbat must be "bidden" from the application programmer. • SchednUng zequests for itself and other processes. A process can IeqUest that it be suspended until anyone of a set of events occurs (discussed later), or it can request tbat another process be created or stopped. "
Chap. 2
"Process Structure
Addressing Range
::tbe rather formal structure of a process, as described above, bas important implications with respect to addressing within the process. In the process structure defined above, there" is DOt a requirement that any memory location in the system be available from any memory location within a process. Conversely: • The code cannot modify itself. Therefore, a data instruction can access only the data area, and a jump, branch, or routine call instruction can refer only to the code area. Thus, code and data spaces are mutually exclusive. In physical memory, the code and data areas are separate and identifiable . • A process can access only its own local data area. No addressing range is necessary or allowed to access the data within the local range of another process.
Note that one process does not have direct access to the data of another process. This is not an arbiuary restrictiEDl but is necessary to the ftmdamental requirement of survivability. It is of paramount importance that the data area of a process not be corrupted by another process tbat is "sick" because of faulty hardware or software. If a faulty process had access to the data areas of other processes, it could indeed COIrUpt those areas and cause a system failure. However, processes bave access to the data of other processes only through the medWJism of inteIprocess messages. 'Ibis gives a process the ability to edit and reject improper accesses to its data area, thus ensuring to any arbitrary extent the integxity of its own data. 'Ibis may not be ttue of multiprocessor systemS that share a common memory.
For purposes of some of the following discussions, particularlywitb tegard to survivability, it is impo:rtint to understand in more detail the SI:rUCtDre of the code and data in a process. The code area comprises a collection of p:ocedures. A procedme is a routine that bas an entranCe and an exit (or perhaps multiple entraDces and exits).·One procedme is . designated the main ~ It is given control whentbe process is first: IUD and bas DO exit except to stop the process. It ucalls" other p:ocedures, which may in turn call other proc:ecbues, and so on. When a procedure exits, it Ietums CODtrol to the procedure that called it at the point in the code ionOwing the call. There is DO other way for a procedme to exit (short of termiDa1ing the process). FlgUIe 2-11 shows a typical code mea and its associated data mea
areas
as processing' progresses through several stages.
"
Prvcess Data AnNt
.J'he data mea is separated into two distinctly differalt areas, global data and the proCess
:local data, the Jatteroften implemtmted via a stack, as will be described. The global dab!-
Transaction-Processing Systems
34
MAIN PROCEDURE
~DURE PROCICURE
~-l PROCEDURE
PROCEDURE " 'DATA C CODE
T
STACK STACK
IWLI MAIN DATA
~
r-
MAIN DATA
~-------
~ I}-
~~S
(b)
(0)
MAIN alTA
t--'------
MAIN CALLS A PROCEDURE
MAIN EXECUTING
GLOBAL DATA
P-u..i'SSED
GLOBAL. DATA
GLOBAL. DATA
Chap. 2
GIsf.PlL
'\R'"
GLOBAL DATA
MAIN DATA
MAIN DATA
MAIN DATA
1------ r----- 1-----1----- 1------ 1------
A PARA. MAlNSM
A DATA
8 PARA.
B PARA
BMRA.
MAINSM
MAlNSM
MAIN SM
8 DATA
BDATA
BDATA
CPARA ~---BSM
1-----CDA'Ia
MAIN CAUSA
A EXITS 10 MAIN
B
MAIN
CALLS 8
CALLSC
C EXITS
TOB
B EXITS
10 MAIN
PROCEDURE EXECUTION (e)
fipre 2-11 Proc:essiDg cycle.
'is available to all procecluRS within the process. The stack is used to hold the ctara: that is local to the procecluRS currently invoked and to COD1rOl the aestiDg of those procecluRS. Tbae is some similarity between processes and procedures. Just as a process bas exclusi~ access to its daIa mea, a procedme has local daIa to which OIlly it bas access. Mcnover, just as a process can pass daIa to another process via an iaterp.&ocess message, so can a procedme pass daIa to another procedme when it caDs it. Tberefoie, a procedme has access to 0Dly dnee types of.~
• The 'global daIa of the process, which is accessible by all pmcedures witbiD tbat
process. • Its own local data. • Dam passed to it by the procedme that c:alled it.
Chap. 2
Process Structure
35
The Stack A stack allows data items to be pushed onto it and then popped from it in inverse order (last-in, first-out). The last two types of data listed above are aeated in the stack area and are then deleted as procedures are entered and exited. Therefore, this data requires addressable meDlOIY space only when it is meaningful, i.e., when the procedure is invoked. The data simply does not exist when the procedure is not bemg used. The use of a stack for procedure data thus brings a degree of efficiency to memory utilization. Figure 2-11 shows an example of stack usage as a process executes. When the process is first created, it is given a data area containing its global data and the local data for the main procedure (since the main procedure always exists, so does its local data). The code and data areas.at process creation time are shown in Figure 2-11a. As the main procedure executes in-IiDe code, the data area remains in this connguration. However, when the main procedure decides to call another procedure, it does so by pushing aprocetlure data set onto the staclc. The procedure data set comprises all of the data needed by the called procedure and, in fact, is the only data available to that procedure (except, of course, that the global data is available to all procedures). The procedure data set comprises three types of data, as shown in Figure 2-11b. • The parameterS being passed to the called procedure by the calling procedure. These parameters can include pointers to data in the global area or pointers to local data of the calling procedure. '
• A stoc1c 1IUIT1rD definjng how the procedure is to exit. The stack marker contaiDs the address in the calling procedure to which the called procedure is to retllm (the addless following the procedure call) and the macbiDe enviromnent that is to be IeStOIed following the called procedure's mum. • The local data for the called procedure. The called procedure then executes, operating on the pammeters passed to it and the global data pertineDt to it. When it bas c:ompleted. it deletes its data set from the stack and IetumS to the calling procedure accoJding to the stack matter. Of course, a procedure can call another- procedure, which can call aaotber, and so OD. In this case, the stade grows, with the 'nesting of pwcedwes being defined by the sequence of SlaCk JDalbrs. . Figure 2-11c shOws the stack COIIfigmaDoil as the maiD procedure executes. The main pmcedw:e fiIst calls procedure A, which executes and IetUms. The main procedure ~ caDs procedure B, which in tum caDs procedure C. At this point, the stack CODtaiDs two procedure data sets, one for B and one for c. Procedure C then completes, zetm:Ding to B, which itself completes. zetuming to the main procedure with a clean stack. A vital point that will fODD one of the foundations of our survivability discussion is ~ the data area with its stack completely defines the state of the process at any ~ in,
36
Transaction-Processing Systems
Chap. 2
its processing cycle. If we were to interrupt a process and then could somehow move its data area as it then stood and its code area to another processor and restart execution at the interrupted instruction, the process would continue to IUD as if nothing had happened (baving, of course, copied the machiDe environment in terms of memory maps, condition codes and so forth). More about this later.
Summary In S1lIDIIlaIY. a process is the basic logical entity at the applications level in a distributed system. It can do only two things: perform its own assigned processing functions (including process scheduling) and exchange inte!process messages with other processes. An application process can send interpIocess messages to other application processes, to I/O processes, and to its own backup. An 110 process can respond to messages received from application processes and can send messages to its own backup. A process comprises a data area and umnoditiable code that implements a series of procedureS. The data area contains global data, which is available to all procedureS. and a stack. The stack holds all local data pertinent to the procedureS CUIIeDtly invoked and controls their nesting. The data area and the CUDeDt machine enviromnent totally define the c:mrent state of the
process.
PROCESS IfAlVAGEIIENT Having described the process as the basic logical element in a disuibuted system, and baving explO!ed its struc:mre and the way in wbich it interacts with other processes to implement an application, we must DOW tum our attemion to how the process shares its envhoDmeDt c:oopem!ively with other proCesses Competing for the same resources. Obviously. in a large distributedsyste!D, many things go on simultaneously. Many users peIfOlJD their various 1aSks wbile siUing at many terminals while operatodess backgrouDd 1aSks may be at wade npdaring data bases, maintaining statistics, and doing sundry other tasks. Many processes me active simultaneously.
Shared R...".. Processes in ODe coiiiputer geaaally nm DldependeDtly of processes in other computers. However, .that JlOUP of p.rocesses assigDecl to a giveD. compure:r (aD processes in a multiprocessor system) must sbm:e its IeSOUices in suc:h a way that all processes may Complete • their tasks in a timely maDIle1'. 'Ibere. lie two domiDant computer resources to be CODSideIed: time and space. The time RSOUtCe is processor time. Thougb it appears to the user that multiple processes lie l'UDIliDg COIlCUl'IeDt1y in a processor, we know. in fact. that at any instant only one process is actually numing.lt will run until it decides to reJinquish control of the processor. at which time another process will be given the PJOCeSSOl. Therefore, there :must be a way to know which processes are. awaiting the processor and·to transfer control
Chap. 2
Process Management
37
of the processor from one process to another in an orderly and efficient manner. This is called process scheduling. A necessaxy efficiency in process scheduling is that processes Carry priorities and that, in fact, the highest-priority process desiring the processor bas it. The space resource is memory. The sum total of the memory requUements of all the processes that are to run in a given computer could well exceed that computer's memory size. This is particularly true with mobile processes in a multicomputer system, for we do not necessarily know in advance which processes are going to run where. Therefore, we must have a way to allocate the available memory to those processes cuaendy requiring memory. This is called memory 1M1'IIlgement. The functions of process scheduling and memory management in a distributed system do not differ sigDiftcandy from,the way in which they are bandied in any one of many modem-day single processor operating systems; the reader who is familiar with these operating systems will already understand the concepts to be discussed. However, they are important for our following discussions of survivability and peIformance and will theIefore be described so that we will all be talking the same language. Process Scheduling Let us first discuss process scheduling. We assume that seveml processes have been aeared in a computer and are currently taking tums running in that computer. A process can be in one of three Slates: • Waiting. The process is currently idle but is waiting for one or more events to occur. Until one of these events occurs, theI:e is notbing the process can do; therefme, it is not a candidate for srJteAuling. These events may typically be one of three types: (1) the receipt of an intelpJ:oc:ess message; (2) a device interrupt sigDjfying an action has been completed and IeqUiIes process action (for 110 processes ODly); (3) a time iD1erval specified by tbe process has elapsed
• Reody. An event bas oc:cwted, one tbat must be bandled by the process. The process is ready to make use of the COIIIJIUteI:,wben it is given the chance. • Active. The process is CUI'Ia1tly the process 1'UIIDing in the compaIet. (Note thatthae is DO such thing as a cIonmmt process. Once a process is ~ it is waiting, leidy or active. If it decides to stop itself, it DO loager 'exists. The physical program that was used to Clate it still exists, but tbe logical process has disappeared from the system.) . The active process will coatinue to run and to consume all processor time (except for intenupt and cycle-stealjng activities, which are transparent to it) until one of the follow-
ing situations occurs:
,
• It decides it must wait for an event (interplocess message or response, intellupt. and/or time interval). .
Transaction-Processing Systems
38
Chap. 2
• It decides to stop itself. (The process is of no further interest to us. since it is no longer a candidate for scheduling.) • A higher-priority process becomes ready (it will preempt the active process).
In the first case (waiting for an event). the process goes into the waiting stare. In the second case (termination), the process disappears. In the third case (higher priority pr0cess). the process returns to the ready state. In any event. the highest priority ready process, if any, is given control of the processor. This switcbing of processes proceeds , indefinitely and is the basis of process scheduling within a computer.
Mechaniams for Scheduling Processor schedu1ing is handled via two fairly straightforward mechanisms that we will call the ready list and the timer list. The ready list is a list ofall processes in the ready state, in order of priority (within the same priority. the processes are orclerecl on a first-come basis). The timer list is a list of all processes awaiting time-outs, in order of their time-out interval. A process may be on the timer Jist. on the ready list, or on neither. Processes are maintained in a list via links in their PCBs. Figure 2-12 shows a typical process scbeduJjng sequence. In Figure 2-12a. 10 Cleated processes. named A through J. are running in a computer. Process A is c:unently the active process; it is the one that is actually nmniDg in the computer. Processes B. C. D. and E are all ready to run but are lower priority dian process A (priority I is the highest. to simplify tbis example). 1beIefore. they are linked in the ready list via thcir PCBs, with process B being at the head of the list, since it is the highest-priority ready process. Processes F. G. H. I. and J are waitiDg for some event to occur. and processes F. G, ancI H are faab«more being CODtroIlecl by time-outS. For example, if an event does DOt occur within six seconds, ~ G will become ready myway. (A process can also wait simply for the pmpose of timing aad DOt for any other event but its own time-out). Processes F, G. and H are liDked via their PCBs into the timer list, with process F beiDg at the head of the list, since it bas the shortest timo-oat. TJIDeoOUt values in the PCBs for tbis example are iDaemeotaJ. values to be added to those abead Of them on the Jist. 'lbns, processes F, G, and Hare waitiDg 1,6 and 16 seconds, respectively. In Figme 2-12b, process Z has been czeated and has been added by the operatiDg system to the zeady list. Since it bas bigberpriority than any oCber process in the ready list, it is pIacecl at the head. of that list. . In Figure 2-12c, process A bas completed and has ~ itselfumil it receives another evem. h is DOt timiDg on that evem. Process Z. being thebigbest-prlority ready process, is JeJDoved fiom the ready list and made the active process. In Figure 2-12d. process Z bas suspeDded itself to await an event and bas IeqUested an eigbt-second 1ime-ouL It is themoIe pJacecl in the waitiDg state and added to the timer list between processes G and H, wbich have smaller and jrearer 1iine-outs. IeSpeCtively (process H's time-out is adjusted by Z's incremeDt). Note that F timed out and bas been added to the ready list. Also note that procesS J bas xeceived an event and bas
Process
Chap. 2
rj:l-
READY UST
TIMER LIST
loicHSil--II!c1
~
I~IIP~I
WAITING PROCESSES - - -
INITIAL CONDITION (a)
~tfY fZjlZ]1ZJ1;sl-GrJ •
READY PROCESSES
I
Irs~R ~11!cl I~I
[!l
WAITING PROCESSES
PROCESS Z IS CREATED (b)
~--tE!J-I=: B
READY LIST~~~ I
I 1;j:1 ~
READY PROCESSES
~~ 1K)~lttl-W
I~ I 0 IT]
WAITING PROCESSES
PROCESS Z RUNS (e)
READY~~ ~ eLI ~ le&.,gJ IeaYJgJ
LIST
READ\' PROCESSES
nMER'~ LIST
lzkj
h&J
~
~ 1P<21 ~
WAITING PROCESSES
PROCESS
,I) SUSPENDS
READt
LIST
TIMER~ LIST W IukI
I
PCI \ IPCBII A I·
WAmNG PROCESSES
PROCESS (~) USURPS Figure 20-12 Pzocess schedn1mg.
I
39
40
Transaction-Processing Systems
Chap. 2
therefore been scheduled by being added to the ready list. Since process Z is now suspended, process B is removed from the head of the ready list and is made active. In Figure 2-12e, process Z bas received an event prior to its time-out. It is removed from the timer list, and process H's time-out is readjusted to include what had been Z's time-out interval. Process Z bas a higher priority than the currently active process. Theii~ fore, it is not placed on the ready list. Instead, the cuuently active process B is returned to the head of the ready list, and process Z is made active, thus usmping control of the processor from process B. 'Ibis is called preemptive scheduling. The foregoing example shows that processes share the processor by simply moving from active to waiting to ready and back to the active state again. Via the timer and ready lists and via appropriate actions in response to events by the operating system, the cuuent state of each process is known and controlled. Process scheduling gives a computer a multitasking capability because. in effect, several differeat tasks may be running CODCUDeIltly.
IIeInory IIanageIIIent TumiDg DOW to memory m.aoagement, the problem is how to stuff five pounds of potatoes into a two-pound bag. Since in a distributed system we generally do not know in advance what processes will be IUDDiDg. much less in which computer, we must be prepared to support a set of processes that have memory requirements exceMing the memmy availability of the computer in which they are ruDDing. perhaps by a wide margin. 1be requirements of a single process might even be bigger than the memory on its computer.
Paging The fust step in memory maDagemeDt is to realize tbatall processes oftenC8DDOt fit into memory simu1taDeously. but somehow the COD1eDts of their data areas must be maintained. Therefore, an image of the process's data area is created on disk wIleD the process itself is aeated. (Code areas never cbange, and their images are already available on disk.) This disk image repleseDts the c:anent CODteDts of all meiDOry locations in the. data mea tbat are not CUD'eDtly in the physic:a1 memozy of the compurer. . It is the msponsibility of memolY management to easm:etbat if a section of a pr0cess's data area is to lose its place in memory to ~ poc:ess, its disk image must fust be updated. Since the disk image of a process (code aad data)ismaely a IepIesmtatioo of the true process, which exists only when it is adUally in memory, it is called 'VinlIIzl 1II!f11ID1'Y. By proper management of vhtual memory, theoompatec can be made to tI[1ptIflT to have a memory as big as is needed to hold all proCesses. . Virtual memory may be mauaged via a mechanism known as pagefrJulting. Physical memory is organized into "pages," typically O.SK to 4K bytes in size. When a process adchesses a WOld in its memory (either in its daIa area or in its code area). the processor (by bardware) eosures that the page containing tbat word is,.in fact, in memory; if not, obvi~usly , .it c:aDIlOt be accessed. .
Chap. 2
Process Management
41
If the processor finds that the page currently being referenced is not in memory. it generates a page-fault indication. This means that the processor must bring this page into memory from its disk image before the C1D'reDt process can continue. In order to do that. the processor must select a memory page that it is willing to overwrite with the new process page (note that every memory page contains valid code or data from some process or another). The algorithm used to select a memory page for destruction varies from system to system but is generally based on two considerations: • Age. When was this page last referenced? If the contents of a page have not been used for awhile. the page is considered dormant and a candidate for overwrit-
ing• • Modqication . ..Dirty" pages are data pages (never code pages) that have been modified since they were last read from their disk images. If a dirty page is to be overwritten. it must first be written out to update its disk image. whereas a clean page does not need to be written out. Clean pages. therefore. have priority over dirty pages as candidates for overwriting. since less system resources will be
used. Once the processor bas selected a page for overwriting. it will write it to its disk image if it is dirty (bas been modified) and will then read in to that same physical page in memory the process page being accessed. At this point. the process may continue. Some systems use overJay amIS to swap process Portions between memory and disk: However, the basic concepts of page faulting and overlays are similar, except tbat overlays are of variable size and usually larger and may be controlled by the process itself. Though vhtual memory significantly expands the capabilities ofa computer, it must be empbasized mat page faultiDg is an extremely hazardous event with JeSpect to system perIomumce. It can W1uaJly SIall a processor if allowed to get out ofhand. Wben a page fault occurs, DOt much else can go on in the sys1em. Certainly, anotberpage faultc:aunot be honored muil the.first bas been processed. Otherwise, one could fiDd so many page faults in propss mat wbea. one page fault completed, 1hat page would aiready bave been reseized by a later zequest (an ·exueme but possible samario). 'Iherefore, dmiDg page fan1ring,1be processor bec:omes vecy limited in what it does 8Ild may sIall, dms drastically reducing its tbroughpat. . To alleviate this, one may be able to ""fix" part or all of the code or data of critical processes iDto pb.ysical memory. These pages ·are never subject to paging. T1~y will always be in memory wIleD needed but willleduce 1be amount of IIleDlO!Y available for
paging. One of the obvious cbaractaistic:s of virtual memory as descr:ibed above is that one never knows what physical page in memory CUIreDtly contains a particular logical page of a process. In fact, a logical process page migrates randomly through physical memory.as page fanlting occurs. SomIAimes it is 1bere, and sometimes it isn't. And wheoever it ~, it pops up somewhere else.
-
Transaction-Processing Systems
Chap. 2
Keeping track of wbich logical page is currently in wbich physical page is called memory mapping. Like page faulting, it is largely a hardware-supported function since it must be performed on-the-fty with instruction execution. To accomplish this, the processor typically. has a set of hardware memory maps. One may be used, for instance, to map the code area and one to map the data. area. These maps can be loaded by software and then used by the hardware as a process executes. The contents of a memOIy map are illustrated in Figure 2-13. LOGICAL PROCESS PHYSICAL MEMORY PAGE ADDRESS PAGE ADDRESS
LOGICAL PROCESS PHYSICAL MEMORY PAGE ADDRESS PAGE ADDRESS
0
113
0
16
I
I
2
27 -(97)
2
17 79
3 4
72
4
13
• • •
• • •
• • •
-
• •
• CODE MAP
-
3
DATA MAP
i Likewise, each process has a code map and a data map associated with it. When it is rmming, its maps are loaded into the processor's hardware maps. Let us say that the example of Figure 2-13 shows the maps for the .cmrently nmuing process and d1at the process is cmrently executing code in its fiISt logical page (page 0). 'Ibis code is found in physical page 113 by the processor, and the processor is thus executing code located in physical page 113. If the process's code execution falls through to page 1 or lmmches to page 4, execution proceeds llOI!D8lly since these pages are in physical meJDOIY. However, jfit should attempt to execute an insttuc:tion in its lOgical page 2, the processor will not find that page to be CUDeDdy in physical meDlOIY. It.will page fault, find an old page to overwrite (say physical page 97) and will lead the cunent process's logical page 2 into physical page !17. Arr.y maps !efaenciag the old c:onteDts of ph,sicalpage !17 must bemodiied to showtbat logical page to be DO longer in physical memory. The process may DOW pnx:eed to execute code in its logical page 2 (physical page 97). A similar sceaario holds for dara-page accesses. In the mnnp1e ofFIgUre 2-13, data. accesses to logical pages 0, 1, 2, or 4 will be executed normally. A data access to page 3 will cause a page fault. Note that under memory mapping. the code or data area of a process is not necessarily (and probably is not) mapped into contiguous physical memory. 1bough memory space appears logically to be contiguous to the user, it is in faa: spread quite mndomly throughout physical memory. We bave c1escribed separate memory maps. for code and data. areas, for these areas .. ~ totally exclusive and may each DDge over the entire add!ess space.
Chap. 2
Process Management
43
.anaging .ultiple Users One additional fallout from these types of operating systemS should be noted. As we have said, a process is a program I'IJlUUng in a computer and comprises a code area and a data area. The code area is never modified. Tbe!e is no :reason that in that same computer, another process could not be created. using the same program but simply carrying another name. In fact, since its code area would be identical to that of the first process, the code area would not have to be duplicated; both processes would map into the same code area in physical IDeIDOJ:Y, ~ shown in F~ 2-~4.,
'~A.
TERMINAL I
PROGRAM INO
TERMIIIAL
2
TERMINAL. N
Fipre 2-14 MIIIIiuser opeadaL
Let us incmduce one IDCR concept, 1hat of the home termintzl. When a process is aeated by an operator. it is Cleated by that operator typing in a rormmmd at a terminal. The process tbeD lcnowS that tamjnaJ as i1s home terminal aDd may Jeter to it as such. Consider a progtam written to pezform a task (maybe an inquiry task) to support ODe and ODly ODe tfmrinaJ, which is i1s home tamjnaJ. The program file is ,stcnd as an object file on disk and is IIiImed INQ. " ODe operator could walk up to tmninaJ 1 (in FIgIR 2-14) and type RUN lNQINAME INQAI and tile Process INQA would be Cleated.' A copy of tile code ccmespondiDg to program lNQ would be made available to p.roc:ess INQA, and a data area image for process INQA would be created on disk. The operator can now use process INQA for inquiry pirposes. As die process executes. pages requD:ed by it Will be paged in aDd mapped as pmiously described.
Transaction-Processing Systems
44
Chap. 2
A second operator can walk up to terminal 2 and enter a similar command: RUN lNQINAME lNQBI and process INQB will be created as was lNQA, except that a second copy of lNQ's code does not need to be paged into physical memory if lNQB is running in the same computer as lNQA (that is, the terminals are on the same computer, or the CPU is specified in the RUN command). Process lNQB will use the same code area as lNQA. Of course, a separate data area is aeated for process lNQB. This can continue vinDally without limit, allowing more and more simultaneous users of the same single-user program by creaIing more and more uniquely named pr0cesses using the same program. All such processes lUDDing in a common computer will use a common memory-residen.t copy of the code, but each will have its own data area. In fact, the sum total of memory used by many sing1e-user processes is about the same as the JDeIDOr)r used by one lirge rimltiUser process since the separate data areas are· required anyway. The duplication of common data areas and the additional multiprocess oved1ead (e.g., PCBs) are periJaps offset by the simpler code requiIed.
Additional ConsideratioIts The creation of a laIge number ofprocesses does not come without problems, and there are some instances in wbich it is desiIable to design a process to handle a number of users. Among these considerations are the following: -
• Each process requires a PCB and other system stmctu.res, such as file control blocks, that ue aIlocared from vecy valuable common buffer space in the system data area. 'Ibis imposes a pacticallimit cmthe number of prOcesses that can exist in one computer at any one time. Typical CODtempOImy systemS can support from 100 to a few bundred COJlCUDeIlt processes. • If large buffers ue RqUiIed for each user, II users will mquire II buffers if each process is siDglo-user. This could result in excessive page faulting. A multiuser process could get by with a smaller buffer pool, wbich it c:J.ynamically allocates to users as needed, so page faultiag would be redaced or etinrinated. • IDterproc:ess messages ue time·cons"""ng in multicomputer sysIeJiIs. Typical interpIoc:ess message times mage from 1 JDSeC. to SO JDSeC. In some applications. it is possible to batch traDsactiODs iD one process before sencfing.tbem. to 8DOCber proceSs, thus saviDgmany intetpIocess messages. In this case, it may be desirable to have a· piQcess ccmtto1 many users so it can aa:amulaie a batch in a Ielatively short period of time. Orherwise, data-base updating and othetfuneticms dependent on these traDsactiODs might drag. • The act ofprocessfi'itcbing addsopaating system ovedlead. Typical overl1eads run from·O.s DJSeC. to 10 DJSeC. When processes are switched, the old active piOc:ess must be switched to waiting and pedJaps added to the timer list, the new active process must be removed from the =rdy list, and maps must be ~hed.
Chap. 2
Survivability
45
A single-user process must be switched after every transaction. A multiuser pr0cess must be switched only when it has processed all transactions pending from all its users.
Su...",.ry This description of processes and process management highlights another important feature of contemporary distributed systems: simple software development. Let us look at what this type of operating system allows us to do. A programmer can write a program designed to support a single user and, with virtually DO additional effort, find that the program has the ,following attributes: • It is a multiuser program, one that can be used by many users simultaneously. • It is a program that will run in a multitasking enviromnent (i.e., many other tasks may be numing CODCU1IeDtly). • It is a program that will run on a computer regardless of the amount of memory it requires or the amount of memory available to that computer. • It is a program that will run in a distributed enviromnent in which the programmer can be unaware of the particular computer that will be used or to which computers the programmer's various periphe:ral devices are connected. Thus, the problem of writing multiuser, multitasJdng. distributed programs has been reduced to the writing of a single-thread, single-user progEaDl-the simplest possible solution. UDfortuDately, this has not come without peaalty. The penalty is in system capacity because of the high overhead of this do-all operating system. But tbis is the trade-off we will be willing to make as hardware costs go down and as software costs go up-more ,hardware for less software. The critical subject of system capacity and responsiveness thus becomes even more important and creates a stronger focus 'on the need for performance analysis.
SURVNABIUTY The simple expedieat of formally defining the process as the basic logicalUDit in a system, bouudiDg its capabDities, and then building an operatiDgsystem that effectively mauages such processes leads us to a powerful progtli iliiring envD:om:oeDt. We can now write multiuser applicatioDs that run in a multitaslc:ing, distributed eavironment wbile concerning omselves 0Dly with the problems of a single-user, smgle-tbread applicatioD. The distributed aspect allows us a further exteDSion of thesecapabiliti.es. Since the system DOW has at least two p.roc:esso1'S, and since we bave the optiOn of ,adding two of anything else that might be critical, we bave tile OPPOdDDity to make the system bigbly faD1t.,toleJ:ant. We can C!eate a system that will survive any single failure (and many cases
Transaction-Processing Systems
46
Chap. 2
of multiple failures) in that it will continue to perform functionally in the same way in the face of these failures. The user may notice a loss of capacity or responsiveness but will oat lose any of the system's capabilities. We will see that the structure and management of processes. playa big role in achieving this goal.
Hardware Duality The first step in achieving survivability is hardware duality. If a critical hardware component fails, there must be at least one other identical component that can be used immediately. Equally important are the paths to all c:omponents. If one path fails, there must be an altemate path to that component. For instance, if the processor to which a User's tenDinal is connected fails, and there is DO means to connect the terminal to another processor, tben so far as the user is eoncemed, the system has failed. This leads to the concept of dual-ported devices, in which each device controller has two ports, each ofwhieh can be connected to a separate eompnter. At any one time, one of the ports is being actively used, and the other is dormant, playing a backup role. The computer eonneeted to the active port is said to "own" the device. As an example, Figme 2-15 shows a dual-ported printer. The printer (a normal, everyday, sing1e-ported printer) is connected to a controller that can be driven either by processor A or processor B via two independent ports. Each of these processors runs an 110 process, which controls the printer via its CODDected port (remember that an 110 process must reside in the same computer to which the device is connected). Processor A cur.rently owns the printer, and the operating system knows this. 'Ibelefore, intetpioeess messages containiog data to be printed on the printer are routed to the printer 110 process in
processor A. . Theze are seveml failmes that could cause this path to the printer to fail. Specifically, processor A could fail, or logic in the printer controller port conneered to processor .A could fail. In the foJmer case, the operatiag system would zealize the failure of pro-
.
"".-------
Chap. 2
Survivability
47
cessor A and would transfer ownership of the printer to processor B. In the latter case, the printer 110 process would detect the fault in the controller and would transfer ownership to processor B. In either event, subsequent interproc:ess printer messages are sent to the printer 110 process in processor B. Therefore, the failure of a processor or a device port is indeed ttansparent to the user insofar as the user's access to that device is concerned (providing the application process is written to reissue an 110 message in the event of an em>r). UnfOItUDately, in all devices there are simplex points of failure that will totally remove that device from service. For instance, the failure of a printer motor or even common logic, such as a line driver in the printer controller, will disable the printer. This can be overcome only by totally duplicating the device and its controller and by making provision in the system for rerouting work away from the failed device to an altemate device. In the case of a printer, for instance, a sophisticated spooler queues work for all printers on the system. If a printer fails, it simply becomes UDavailable to the spooler, which continues to despool all work to the remaining printers. In many cases, the duplication of a device is not economically justifiable; and work for a failed device is simply held until the device once again becomes available. There are two cases in which the continued operation of the device is every bit as imponant as the processors themselves. One case involves the inteIproc:essor bus; without it, all paths but local paths within a processor are lost. Therefore, this bus most be duplexed. The other case is that of disk files containing critical data bases and system files (program and process images).
Data Base Integrity If a disk containing a critical file goes down, and there is DO alternate, the system goes down. Totally. Furthermore, just having a backup disk is DOt satisfactOIy. It most COD~ completely updated files, i.e., it most be a mirror image of its partner. As data comes in that updates one disk of a "miDomd pair," it must also update the other disk. FIgUre 2-16 shows the c:onfigmation for a miIIored disk pair. Three levels of mirroring may be used:
• One c:onttoller and two disks (Figure 2-168). However, if CODtroller logic c0mmon to both disks fails, then access to both disks is lost. • One ~ per disk (Figure 2-16b). No single failure will prevent ac:c:ess to the data. • Dual-ported disk devices COIl1leCted to dual-ported controllers (Figule 2-16c). This adds an additional level of redundancy to the IIIiIIoml pair. There is an important utility that most be available to support min:ored files if they are to be truly effective: an on-line disk copy utility to be used when a disk that is part of a mirrored pair is to be Ietumed to service. When a disk unit fails, the files are handled by the mnajning simplex disk. When the disk is repaired or replaced with a spare and is to be
Transaction-Processing Systems
48
SINGLE CONTROLLER (Q>
DUAL CONTROLLERS (b)
DUAL-PORTED DISKS (c)
Chap. 2
Chap. 2
Survivability
49
put back in the system, it must be brought back to its minored condition (i.e., containing an exact copy of the other disk) even while further modifications are received for that data. That is the job of the on-line disk copy utility: to copy one disk to another while at the same time ensuring that file updates are kept current. To summarize hardware duality, Figure 2-5, which was previously discussed, shows a simple system with communication lines, printer, and mirrored disk pair. That figure shows bow these peripherals might be configured physically. Figure 2-17 shows the logicalllO process configuration as it would interact with an application process. Primary paths are indicated as solid lines, backup paths as dashed. One final point should be made about hardware duality in a survivable system. Duality is fruitless if a failed device cannot be Iepaired and IetUmed to service while the system is running (this led to the need of an on-line disk copy utility for mirrored files). Therefore, any piece ofbardware-including processors, buses, device controllers, power supplies, even fans-must be capable of being mnoved, repaiIed, or replaced and plugged baclc in while power is still on the system and without inducing' 'glitches" in the system operation.
Software Redundancy There are several approaches to software redundancy in contemporary distributed systems. These approaches can be classified acconting to the following geaeral methods, wbich are discussed in the following paragraphs.
~
¢
I I . I I
Transaction-Processing Systems
50 • • • •
Chap. 2
Transaction protection Synchronization Message queuing CheckpointiDg
Software redundancy pIeSeDtS different problems. depending upon whether a multiprocessor or multicomputer (including bybrid) arcbitecture is used.
Transaction Protef:tion Using trausaction protection, no attempt is made to keep a backup process updated. Rather, each transaction is logged to disk in a set of audit files in such a way that transactions that were in progress at the time of failure can be backed out of the system. The user is then requested to teenter that ttansac:tion. In this case, the failure is not transparent to the user, though its impact is mjnimal and usually acceptable. However, recovery may take sevezal minutes as the audit files are played back and the data files couected. Transaction protection is used ~vely by multiprocessor systems; as it must be assumed that a fault bas cootaTDimrted memory and that no other recovery mechanism is available. It is also used in certain malticomputer system offerings and in maay single computer systems for which it is the only means of IeCOvery. Note that ~ processes are wmecessary if ttaasaction protection is used as the fault-recovery strategy since they need DOt be kept up-to-date anyway. Should a fault occur in a processor, the system will IeCOnSttuCt its data base via the audit files and then IeCreate the failed processes in a surviviDgpocessor before restarting the system. In fact, in a uue load-sharing multiprocessor system, the concept of a backup process is meaningless, since the process is IepI'esented by control strucbJ:Ies (the Process Conttol Block, for instance), a code area, and a data Ilea in common memory. It survives even if a processor fails and needs to be JeCJeated 0Dly if a memory partition failure destroys some control infoJ:mation peItiDeDt to that process. The traDSaCtion-p:otection procedure developed by the former Syuapse CorporaDon of Milpitas, Califomia, for their multiprocessor system is a very good example of Ibis tecImique. BasicaDy, an chaDges to files are logged in separate audit files as traDsaCtions me processed. In the event of.failme, the system is paused, and an files me ietumed to a peviouslyestablished "CODSisUmcy point, ,. which eusm:es the iutegtity of the data base. 1iaDsactions that have been completed followiDg the CODSisteDcy point me "'J:olled forwani" from the log, or IeCOIDp1eted.. TI'8IISaCtiODS that have DOt been completed are "'J:olled baclcwani," or deleted. The system is then xestartecl, and uncompleted transactions must be n:enteled. Figme 2-18 shows the anctiting activities that Syuapse used in order to tecODSIrUCt its data base foUowiDg a sYstem fault. All cbaaged data me written to two logs: the history log and the tempotalY log. The histoty log contains an before. and after IeCOId images of o.cbaDged data. These IeCOIds must be pbysically written to the history log before a ttaDS-
Chap. 2
Survivability
51
r
\.
APPLICATION
PROCESS
DATA BASE
)
f HIS10RY \. LOG
BEFORE/AFTER RECORD IMAGES
TEMPORARY \. LOG
BEFORE IMAGES SINCE LAST CP
LOGGING - HISTORY LOG BEFORE/AFTER RECORD IMAGES PHYSICAL DISC WRITE GUARANTEED AT COMMIT TIME BEFORE PROCESS CAN PROCEED
- TEMPORARY LOG BEFORE IMAIES OF CHANGED DATA NO DATA CAN BE PHYSICALLY WRITTEN UNTIL TEMPORARY LOG IS WRITTEN
- CONSISTENCY POINT (CP) EVERY FEW (2) MINUTES-USER SPECIFIED HISTORY L.06. WRmEN 10 DISC TEMPORARY LOG DATA DELETED ALL CHANGED DATA WRITTEN 10 DISC
1....... (. ... • d") -n.-.c: . . to be ~ I.e., lS CQlh""1ttK • "'JIW,&~OIe, at traDsaCtlOIl oommit time, the user is brie1ly paused UDtil an of the changecl data is written to the bistoly
.. . . .
8CUon IS
:~~
c:oDSaaw.&...u
log. The tempOraIy log CODfa;ns 0Dly before images of changed datL No data can be written to a data-base file UDtil the before image bas been physic:ally written to the remporaIY log. Periodically (typicaUy every·two minutes), the entire system pauses to establish a CODSisteDcy point (CP). At this point, all pending data are written to the history log. the contents of the tempo1'8Iy log are deleted, and aD .changed data are written to the· appropriate.data-base files. At the CODSistency point, the entire set of application processes are
Transaction-Processing Systems
52
Chap. 2
correctly represented on disk. All that is required in the event of a failure is to restore the system to this consistency point, redo transactions since completed, and restart the users. This recovery process is done via the history and tempormy logs. Since the temporary log was cleared at CP time and contains only the before images of any data-base changes since that time, the data base is rolled back to its CP state via the temporary log. The history log is then used to replay transaction activity, as shown in Figure 2-19. There are seven possible tranSaction scenarios, as shoWD. Transactions that started before the coDSistency point but never completed before the time of failUre or that were aborted (rolled back) per an operator request follOwing the consistency point are rolled back (types 1 and 2). A transaction that started before the consistency point and successfully completed before the failure is reconstructed by rolling forward from the history log (type 3). Transactions that started after the consistency point and either didn't complete (type 4) or were rolled back at the request of the user (type 5) are ignored. Transactions that staned after the consistency point and that completed prior to the failure are reconstructed by rolling forward from the history log (type 6). No action is required :for any transaction that completed before the consistency point (type 7). RECOVERY - REESt'ABUSH SYSTEM AT CP VIA TEMPORARY LOG
- RECONSTRUCT' SUBSEQUENT ACTMTY VIA HIS1ORY' LOG INCUJDING 1"" Ua<S - ROI LBACK INCCMPI.ETED TRANSACTDG VIA HIS'TORY LOG
-RESULT PC!
LaACK TYPES 1.2
RCIU.FORWARD TYPES 3,S
acoRE TYPES 4,5,7
TYPE
2 3 4 5 S 7
S S S
-
..-
,. i
I I
i
I I I I
S-RJC
FAUpE
CP
I I I
S S S
I
R
C
-
R
C
s- SfART R. OPERATOR ABORT 'ItOU8ACK) c· COMPLETED
TRANSACTION
Figure 2-D Recowr:y
Chap. 2
Survivability
53
This procedure recovers, via software, activity on a transaction basis. If a failure should occur, the user may see a delay measured as a few minutes and may be asked to reenter information that already bad been entered. Beyond this, the procedure ensures full data integrity of the data base and continued operation of the system.
Synchronization In a synchronizing system, every process is replicated in a different processor. All pr0cesses execute, i.e., they all process the transaction and periodically compare results. This synchronizing point is usually an operating system function and may typically occur at 110 request points or completion points. FJgUIe 2-20a shows a dually-redundant synchronizing system. When one process is ready to sync:bronize, it will wait for the other process to catch up and itself be ready for syncbronization. Certain interprocess checks are made, and if everytbing compmes, the processes proceed. If there is an euor, diagnostic procedures are invoked to determine the faulty process, after wbich the surviving process continues with the transaction (after, pe!baps, creating another backup process to maintain survivability). One way to provide diagnostic checking to determine the identity of the failed processor is to have more than two processors (say, tine or four) involved in the transaction. If there is then a failure, a simple vote of all processors will determine the culprit (assuming only single-point failures). Such voting systems (Figme 2-2Ob) are used for ultrahigh-reliability IeqUhements such as the space program. August Systems (Salem, 0Ieg0n) offers a triplexed voting system for process control applications (such as nuclear power plants), in which the concept of voting is even applied to the digital and analog inputs and outputs of the system (for aualog sigDals, the median sigaal is taken). An impol1ant example of a synchronizing system is the product iDtroduced by Stratus (Marlboro, Massachusetts). This system is a quadraplexed votiDg system. in which SynchroDizing is done by hardware at the system clOCk fIequency (typically seYeral tilDeS per miclOsecond). The four processors are grouped as pairs of dual processors (Figme 2-2Oc:), with all four executiDg in "lock step, .. i.e., all doing exactly the same operaIioD at each system clock time. If one processor fails, it will DOt agme with its comPanion, and both are taken out of service. The other pair contiDues 1ID1lffected. Tbis is the 0Dly medlod to dale that requires no softwam support whaIsoever (except, of course, to diagnose the problem). . A modified approach tbat is somewhat similar to that· developed by St:ra!us is·taken by A1T in its 3B20Dproduct (FIgme 2-2Od). 1bis is a dual compurer sysImD in which ODe operates while the other provides a hot SIandby. As data areas in memory are modified in die primaIy system, a bardwme tiDk updates those areas in the memory of the standby system. Should the primary fail, the SIandby uses the process control blocks (wbich, of course, are part of the memory-resident data that was kept updated) to load the active processes and to continue execution. One might also consider this system. as a class of checkpointiDg systems, described later. Note that all synchronizing systemS requiJ:e at least two processors to do the job of one..· However, leCC)Vety is vUmaIly insamtaneous.
Tran~action-Processing
54
Systems
REQUEST
Process
COMPARE
REPLY
REQUEST
REPlY
SYSTEM BUS
Proces.o.~
LOCK STEP (c)
~.-----....,
---_
MEMORY
00 I
10
Executing
Processes ME~ UPDATE
d)
. Fipre 2-21 SyucbroaiziDg systems. :.~
.
Chap. 2
Chap. 2
Survivability
55
Backup Processes
Transaction recovery, as described above, IeqWres no backup process. If there is a failure, failed processes are recreated in a surviving computer, incomplete transactions are backed out, and the user IeeDters the last request if it bad not been complet¢. This procedure could take several minutes. SyncbroDizing systems require two or mOle fully active processors executing in parallel but give nearly instantaneous recovery. 'Ibus, these two approaches represent the extremes of the cost-performance tradeoff: minimal haniware and long recovery times (transaction recovery) versus maximum hardware and almost instantaneous recovery times (syncbronization). The software redundancy teclmiques to be discussed next-message queuing and checkpointing-are aimed at getting near-instantaneous (i.e., seconds) recovery while using little additional processing power to keep the backup process (which is dormant) updated. BefOle describing these methods, we must first discuss the mecbanism for creadng and managing the backup process. Just as each bardware unit must bave a backup. so must each software unit, or process. Should the computer in wbich a process is runoing fail, then that process will cease to exist (and the capabilities it provides will be lost to the user) unless a spare process can be "switched in... This requUes two capabilities of the process: • It must be able to aeare a backup copy of itself in another computer whenever it is aeated or bas taken over from a failed process . • It must be able to keep its backup.infomled of what it is cummtly doing (for instaDc:e. what transaction it is cuaeatly working on) so that the backup can continue its work UDinrmupted should the primary process fail.
Let us first CODsider the aeaDon and maaagement of the backup process. Assume that process A bas been created. One of.the first things process A does is to Iequest the operating sysaem to CIeate a backup copy of itself in another computer. We will call this backup process A'. It bas the same DaIIIe as process. A but is aeated by and runs in a diffeteat computer. A' detects that it is the backup (because it can sease that its c0mpanion al1eady exisIs) and iyrmwtiately calls a monitor procedme that is !eSpODSible for IDODitoring the primmy process and taking ovec in the event of primary failme. The monitor procedme is provided by the operatiDg system. Just as the opmatiDg system. must know of process A so !bat it can route inte.rpJ:oc:ess messages to it, so must ~ know about process A'. As we disc:ussed earlier. it knows about process A and all other processes in the system via the process dUectOIy. a copy of which .is maintained in each physical processor. Let US DOW extend the concept of the process diRctory to what we will caD. the process pair directDry (PPD). The PPD contains the name of each process. the computer in which the primary is numiDg. and the computer in wbich the backup, if any. is running. Figuxe 2-21 shows a part of a typical·PPp.
Transaction-Processing Systems
56 PROCESS NAME
PRIMARY PROCESSOR
INQUIRE
Chap. 2
BACKUP
PROCESSOR 2
MAl NT
2
REPORTA
4
• • •
rJglll'e 2-21 ""
Process pair directory.
i
"
"!
As process A is perfomliDg its duties, the operating system routes all intexprocess messages destined for process A to it. However, should process A fail (most likely because of a processor failure but possibly because of a software fault tbat causes the
operating system to abort process A), the operating system will look in the PPD and find tbat process A, in fact, bas a backup. It will send process A' what appears to be an inteIprocess message indicating that process A bas failed and the reason for its failure (processor failure, abort, orwhatever). This causes dleoperating system in the process A' computer to schedule process A' (i.e., put it on the ready list). Further iDteIpIocess messages for process A are DOW routed to process A' for processing, and the system survives. At this point, process A' may create its own backup to protect itselffrom further failure. Fxgme 2-22 shows a typical life of a process in the presence of a computer failure. A dm:e-computer system is shown in which process A is Cleated in computer 1. It creates its backup, A' , in computer 2. Later, computer 1 fails; and process A takes over, creating its backup, A" , in CODOq1Uter 3. Subsequently, computer 1 is repaDed. The system could be left as is. However, in this case, it is desiIed to Jeestablish load baJancing. 1'herefole. prOcess A' stops its backup, A", and reaeates a backup A in computer 1. It tbeD switches COJdml to process A, resetting the system to its initial CODfigmatiOD. As can be seen, a variety of strategies can be emPJoyed to eDsate system survival in a degmdiag system. Load sbariDg should be an· impor13Dt" CODSi
Message Queuing
Having a backup process to switch to in case the primary process fails is essential to ,surviving. However, iftbe system. is to perform fuctionaUy in the same way it would have
Chap. 2
Survivability
Q)
A IS CREATED
b) A CREATES BACKUP
~
COMPUTER I FAILS BECOMES PRIMARY
c)
a
d)
Ii CREATES
BACKUP
If'
I
f)
Ii
,II
STOPS ITS BACKUP
I 9) :
I
CREATES A BACKUP
',~I
'-,
021
PPO
~311~1~IBI
1..-1
I
I
II ~ II t:;\ 2
~D
311
IB,I
PROCE A SSl2P ..._~.. • • • PPD
h) d SWITCHES CONTROL
TOA
bad DO failme occaneci, the backup process must take over whe!e the primary left off!" This means it must know wb3t the primary is doing. . One way to accomplish tbis is to queue all intecpIocess messages received by the primary process to its cformant backup. so that the backup proci:ss may ""catch up" if it bas to take over (see Figure 2-23). Message queuing was fIrstinttoduced in a distributed
sysremby Aungen (Fort Lee, New Jemey). The basic concept as impleiDeDl:ed by Auragen is for a process to send each interprocess message to duee destinatious: 1. The primary process that is the intended destiDatiOD. 2. The baclcup process for the iDteDded destination process •. 3. The baclcup process for the sending process • .Separate actioas are taken by each of these receMng processes'
Transaction-Processing Systems
58
Chap. 2
BUS Intel PI ocess Message - IPM
FipI'e ~23 Message queuing.
• The primary destinari9n process processes the·message as is appxopriate, with DO CODSic1eration given to the other JeCeiviDg proc:esses. • The backup process for the iDIr.:Dded des1iDation process does nothing. It simply keeps the message in its input message queue. • The backup of the sending process notes that a message has been sent by incrementing a counter and then discards that message. Thus, each backup process has a queue of an messages that have been JeCeived by its active COUDte1p8rt and knows how many messages its active half has sent. Should a failure terminate an active process, its backup will take over and begin doing what is DatDral: process its message queue. Thus, it will redo an processing that the foaner active process bad done, zesulring in a data space ideaticaI to that of the pxevious active process at the time of its failuIe. Futtbetmore, by using the count of messages ttaDsmitted by its other half, it will pmveDt the sending of duplicate messages. such as disk updates, tamjnaJ displays, etc. ~ of comse, implies that the ~atiDg sysrem guarantees that the order of messages in the backup's JeCeive queue is precisely identical to the order of message execation by the prlmaEy process.) One ftaw in this strategy as preseDted so far is that the badaIp process's iDput queue will beccIDe adXtladly lcmg wDh time, thus CCMs.."ring an ammatD.y large amount of stmage space and requidDg aD arbittadly loDg recovery 1ime. This problem is solved by perlodically usyncbroDiziDg"1be backup process widl its pimary. Syncb[oaiziag in ~ sease is acc:omplisbed simply byfcxciDg an clinydala l*PS of·the prlma!y process to disk; i.e., iDvokiDg 1hepage fault mecJumism of the pocesSor on bebaJf Of this proc;ess to wrire an IDoctified data pages in memoIY to disk.. . .' . At this ~ if tbebackap took over, it would page-in up-to-date memory. Thus, its receive queue can be. ~, aad its coum of messages· sent cali be cleared to zero. SyaclumizariOll poiIII:s lie geDDlly basecl on queue leDgth 01' time. Th.eytypic:ally OCCUr every mimIre or so with active processes, aad !eCOvery should take a few (5 to 10) seconds. The time between syncbroDiziDg poiD1s is clearly a comp:omise. By making the time shorter~ the page faulting load on the system iDcteases. By maldng the 1ime loDger, ~ recovery time iDaeases.
Chap. 2
Survivability
59
Though message queuing systems requD:e software support to provide recovery from failure, this support is appJications-iDdependeDt and can be made a function of the operiting system. Thus, as with a lock-step, synchroDizecl system, the application program-" mer can be totally 1IDCOIlCelDed with the problems of survivability.
Checkpointing CheckpoiDting in the sease used here was first iDtroducecl by Tandem Computers (Cupertino, Califomia) for use in its Non-Stop 'nI series of fault-tolerant systems. 'Ibis type of checkpointiDg takes advantage of the fact that it only is necessary for the data area of the backup process to be idemical with that of the primary process at certain critical points in the process's execution. Going back to om' discussion of process SIn1CtU.re, checkpninring is done quite simply as the IeSUlt of one of the ~ of the structure of a process. A process c:onvrins a code area and a data IleL The data area comprises global data and a SlaCk that is used to nest pIOCedmes. to pass parameters between proc:eclmes, and to hold temporaIy data needed locaDy by a proc:edwe. The stare of a process is cJetermmed by the stare of its data area, i.e., give two like processes (two pIocesses with the same code area) the same data area and enviroameDt. aDd they will perform ideDtical functioDs. Therefore, if we could somehow m:rinaajn 1he data area of1he backup process so !bat it was ideDtical to the data area oftbe primatyprocess, it would behave exactly 1he same as if it were the primary process at the time of a prlmaty process failure. UDfortuDately. 1he system load that would be iwposed by CODSIantly updatiDg the backup's data area precludes such an app!08Ch. However, it only is necessary that the data areas be ideDtieal at ce:nain criDcal points in the process's execution. For iDstance. if the data axeas were ,made idemicaJ. jmmediately following the receipt of a traDsacQon to pr0cesS, then if the primary failed afIa' panially processiDg tbe ttaDsaCtion, the bac1cDp would start at tbe point at which tbe 1raII-=ticm was teCeivecl and would mproc:ess it. In many applicatioDs. 1bis would be acc:eptab1e. The backup's data area is updahri via a mechanism called checkpoinzing. 'Ibis is simply an iDtecpocess message seat by tbe pimaLy process to its backDp, tbe CODteiDts of which me tbe cam:at COD1mtS of tbe data area. Like other iDtaplocess messages. tbe !eceipt of1ilis ~ ~age by the backDpprocess causes it to be scheduled to IUD. As desaibed·~, tbe backap process aeca1eS the JDODitol procedwe. This proc:edme will mcave the checkpoiDt message aacl will stale it in its data mea, thus updating the data area as desired. Updadag.1be data _ includes updaring !be stack.. '. . Should abe primary ~ sDbseqae:ady fail, the zeceipit of a ~ messagedesci.rib:iug that fai11R (as c1escribed pMously) wiD cause !be moDitor pmc:edwe to IebIm to 1he main low of tbe Process at.1he last received cbeckpojDt. That is to say, the backup process takes over aDd starts exec :ijI he accouting to tbe last c:beckpoiDt. But how does the JDODiror know specifically which iDstruc:tion it is to start execut~? The program COUDter is not part oftbe data _ and is 1hel'efore DOt sent over as ~
Transaction-Processing Systems
60
Chap. 2
of the checkpoint message. This is bandJed via the last stack malicer iii the data area's stack. Each stack marker indicates the place at which the c:mrent procedm:e was called and Causes the cummt procedure to return to the iDstruction following its call when it bas completed. Upon the procedm:e's retum, the stack marker also contains the processor enviroDment to IeStore. When a primary process wants to send a checlcpomt message, it does so by c:aDiDg an operating system procedure that takes care of actually issuing the iDterpIocess message. Thetefore, the last stack marker on the checkpoiDted stack.was placed there as a result of the call to the checkpoint procedm:e by the primary process; it points to the iDsttuction in the application progmmfollowiDg the checkpoint call. When the JDODitorprocedm:e wants to tum em the baclalp process. it simply executes a procedm:e IetUm according to the last stack marlcer, i.e., the backup process will tum on at the iDstmction following the checkpoint call as if it weze the primmy pnx:ess exiting the checkpoint procedm:e. This is illustrated in Figure 2-24. The checkpoint proc:edure also zetDrDS with a status CODditiOD. This status condition normally indicates success when the primary process is rmmiDg. However, if the backup process is mmed on. the D1OJlitor proc:edure foIces an error staDls that indicates that the primary failed and why. This allows tbe backup process to perform special takeover logic, should any be ~ In actual padice, it is usually .1JDDeN$saty to checkpoint the entile data area. The :global data often mntaiDs a laIge data base, whereas the stack is typically small. S~ PRIMARY PROCESS
l
r- RECEIVE TRANSACTION
:
J
BACKUP PROCESS
I
RECEIVE
rONI
+ t Trr---+-~rr
I I I
EDIT
TRANsAcnoN
t---AULURE
: POST TO lRANSACTIClN FILE
RECEIPT OF
EaT TRANSACTION
~..
l
PCST 1\) TRANSACTION FILE
I I I I
RETURN RESPONSE
: . ! IWAIT
FOR NEXT TRANSACTION
L _ _ _ .J
RETURN RESPONSE
~
WAIT FeR NEXT TRANSACTION
L _ _ _ --I
Chap. 2
Survivability
61
large (multi-Kword) cbeckpoint messages would Iep1esent a large bus and processor load, it is advantageous to send over just that part oftbe data mea that bas changed since the last checkpoint. This often is simply c:enain elements of the global data plus the stack. In some cases, it may not even be necessary to checlcpoint the stacie. If cenam messages simply update intema1 parameterS, it ODly is necessary to process 'the message, update those parameten, and then checlcpoint tbe changed data. In this case, the stack is not c:heckpointed, so backup processing will resume at the last point at which the stack was checkpoiDted rather than at noncritical c:heckpoints. The basic concepts of checkpojntiDg are quite sttaightforward CIeate backup pr0cesses and keep them iDfo.tmed at critic:al processing points via chec:kpoiDt messages. CheclcpoimiD.g does repRSeIlt a sigDificant system load and should DOt be used casually. Each checlcpoint is, in effect, an iDtapIocess message reqoirlDg several miUisecxmds, as discussed earlier. CheckpoiDtiDg is an impotIant cxmsideration in pelfODDaDce analysis. CbeckpoiDtiDg sa:aregies should be carefully thought out in terms of mjnjmjzing the number of checlcpoint messages and checlcpoint lengths wbile achieving the degIee of fault ttaDSp81eDCY desired in the system. More important, these strategies must be established as part of the design of the process. It is insufficient to implement simplex processes iDitially without giving thought to survivability and then worry later about whe!e to put the c:heckpoints. This can lead to a process organization in wbich the cbeC'kpcriDting task burden is so large it c:aDDOt Iea1istically be cmied. Let us explore various levels of checkpoiDtiDg. As we have said, the level of checkpoiDt:iDg should be CQ!1,naswate wiIh the level of fault ttaDSp81eDCY desimL Consider an iDquiIy application in which the operator enters an iDquily, a file is searched, and data is JetWued to the operator. In many situations, it may be quite xeasoaable to ask the openror to xeeoter the inquiry in the rare oc:caneuce of a system fault that bas iDtea:upted the inquiry process. No c:beckpointing Deed be doae at an. Should the backup process find it has taken over, it migbt simply sead a lepeat RqUeSt to an opaatms. DOt knowiDg ~ ones had active "iDqujries. An even better siluatioD thaD the above is one in which the terminals buffer the inquiry aad pass it to the system. in !espouse to a poll. In tis case, the DeW primary process Deed 0Dly poll an 1mDiDals; dlose wiIh 1IDIIISWered iDquirles will Hmllsmit those inquiries fO!' lep!occssiag. Full fault U3IISp8reDCy bas beea acbieved without checlcpoint~.
.
However, if it is 1IDCIesinIble to teqUeSt again 1be traDsaCtioa. from the operatOr once it has been JeCeived, the process can checlcpoint it as soon as it teceives it and then process the inquiry. In tis case, if the process fails, 1be backup has the traDsactioD and will leproc:ess it without having to zequest it again. The ope!3tOI will IeCeive a mspcmse without ever knowiDg tb.eIe was a fault. In tis case, if a failme 0CC1IlS after the system bad IeSpODded to 1be last inquiry and before it bas obIaiDed the next one,1be IeSpODSe to the last inquiry will be lettansmitted to the operator since the ttaDSaction is being totally leprcx:essed. If this is UDdesirable. a second checlcpoint is RqUirecl following the retum of the iDquiJ:y IeSpODSe. U~y, applicatioDs vsuaUy lIeD't tbis simple. Typically, a traDsaCtion is
62
Transaction-Processing Systems
Chap. 2
used to update a file. The simplest case of this is when the transaction is simply logged to a transaction file for later processing. In this case, all of the aforementioned strategies . hold. If the operator or terminal can be requested to resubmit the transaction, DO checkpointing is required. Otherwise, the transaction should be checkpointed when it is received. " In this example, a failure could cause the transaction to be logged twice. Often, the processing programs can handle the case of duplicate transactions (transactions may cmy serial numbers, for instance). If this is intolerable, then (as above) the process should checkpoint following the logging of the transaction to cause the backup to pick up at this point. Often, however, the transaction is used to perform an on-line update of a file. In this case, a record must be read, modified according to the transaction, and then rewritten. Whether or not the transaction was checlcpointed when it was received, it is imperative that it and the record be c:heclcpointed when the read of the record has been completed. 0therwise, a double update could occur (unless this is allowed). Consider a transaction that contains a count that is to be added to a field in a record. If the transaction is simply reprocessed following a failure, it would be added again to the field if the first transaction had completed. However, by c:heclcpointing the read record and by assuming a failure after the transaction had been completed, the backup process would continue from the point at which the read had completed. It would add the trans-" action count to the origiDal field and would retum the record, overwriting identical data left behind by the primary process. So far as transaction processing is conc:emed, this is the case of most general inta:est and is shown in Figure 2-25. The first cbeckpoint is needed only if the transaction cannot be requested again from the user. The middle cbeckpoint is needed only ifa double update is DOt allowed.Tbis same checkpoint can be used to protect multiple updares, provided all data is read, the c:heckpoint is sent, and then files are updated. The last checkpoint is needed 0Dly if a Iepeat :teSpODSe cannot be tolemted. Most processes. whether they deal with transaction processjng, extema1 event c0ntrol, c:onnmmicatioa switchiDg, or whatever, can be fnmed as subsets of FJgUre 2-25. TheL'efOle, we can see that the usual wont case for a transaction is three checkpoints. k is impoI1aDt to mjninrin, the Il1IDlber of cbeckpoiDts because they. create system ovedlead. Some1imes DO checkpoints are requUed. k is 1iequeDtly possible to design the system so that DO !IlClR than one c:heckpoint per traDsaCtion is 1eQ1liIed. However, CODsideratioDs to allow this often range throughout the system. from operating procedmes (reenteriDg a traDsadioD) to termma1s (block ttansmit. ignoring unexpected respoases) to processing functions (defecting and ignoring du.plicate transactions). 'IbelefoIe, the determination of checkpoint strategies belongs in the very early stages of design and is not a candidate for afterdlought. . The above discussion has concerned itself with a software mechanism for checkpointing. The ATI 3B2OD, discussed earlier and shown in Figure 2-2Od, can be considered a system. in .which checkpoin1ing is accomplished via hardware. As memory in the primary system. is mocfified. a hardware chedcpointing mecba~. Dism updates c:cmespoDding meDlOIY in the standby system. As opposed to·the check-
Chap. 2
PRIMARY PROCESS
~CEIVE TRANSACTION
l l
CHECKPOINT TRANSACTION
I BACKUP PROCESS :
--.l.~? I ~crlON
CHECKPOINT RECORD
I I•
UPDATE RECORD
I
READ DISK RECORD
l
l
l
REWRITE RECORD RETURN R!SPONSE
l
CHECKPOINT STACK
l
63
Software Architecture
STARTING HERE
RE-UPDATES
I ~~DISK
I
I I I I.. r~RHERE I TRANSACTI~
WAIT FOR NEXT TRANSACTION ,
L____ ..1I
•;
I'ipre 2-25 Geaeal uansaction c:hecqK'4nting.
Pointing tedmique described above, however, this hardwme approach requires that ~ memory CODteDts of both the primary and standby systems be identical. Therefore, both systemS must be dedic:ated to the same tasks; in this sease, tbe ATr system is more akin to a synchronizing system. .
SOFIWARE ARCHITECTURE The pedClJh'gc:e of a TP system is 0Dly partly derermiDed by its hardware CODfigmation. The impact of the softwaIe is usuaBy even more cmcial. This impact mages fiom tbe efficieacy with which programs me written to laDguage CODSideratioDs. opeiatiDg system characteristic: and application softwaIe arcbito:uae.
In almost eve:ry TP system, there is a softwme bottleneck somewhere that may limit ultimare system capacity DO mauer how much hardwme is added. The most rommou bottleneck is the data-base mauager. Since it must coordiaar.e the activity of many application processes that wish to access the data base simultaneously, all data-base requests must be funneled tbrough it. This is largely because of tbe problem of simult8neous updates to a recant. Let us say that process ":-- and process B·both want to
64
Transaction-Processing Systems
Chap. 2
read a record and perhaps update it. For both to read the record is DO problem. But DOW let us assume that both modify the record and want to rewrite it. If process A is the first to rewrite its record, then the updated record submitted by process B will overwrite process A's changes, which are DOW lost. By fuJmelling all requests through a common data-base manager, mechanisms can be established to prevent such confticts. Typically, a process that wishes to update a record will request that record (or file, or field within a record, depending upon the system) be locked and thus be made unavailable to any other proces& desiring to modify that record until it bas been updated by the requesting process (it is, however, usually available for reading by other processes). If another process makes a request to lock this record while it is locked by another record, the new request is either rejected or the requesting process is placed. in a queue for that record. In some systems, multiple data-base manager paths are provided to mjnimjze the data-base bottleneck effect. This can be done sometimes by segregating unrelated files or, altematively, by segregating unrelated operations (open/closes versus readIwrites, for instance).
Another bottleneck that can exist is a common log. If, for instance, all system actions must be logged to a common log file or printer, that log becomes a bottleneck. The audit files required for transaction recovel)', described earlier, are a good example of such a potential bottleneck. Requestor-Server
One common software architecture that is found in many TP systems is the requestorserver model, shown in FlgUIe 2-26. In this model architecture, xequestor processes each service one or men user 1r!nnina1s. When a requestor receives a request from a user, it evaluates the request and passes "it to an appropriate server process that is designed to bancIle that request~. Figure 2-26, beJDg vel)' simplistic, "shows two 1;ypes of servers: one for bndling inquiries aDd one for bandHng updates. A server will do wbatever it bas to do to satisfy the request. This usually involves iDtetac1;ing wi1h the data-base III8II8geI' to gain aa:ess to-or update" data in the data base.' It theD formuJara a reply and IetUms it to the user via its nquestor process.
Requestor processes are pezmaueudy crearecl aDd assipM to support a 'fixed set of user termjnaJs. However, server processes me, often dynanricaIly allocated according to the load OIl the system. For instaDce. let us assume that the volume of updates becomes vel)' high in the aftemoon. Seeing tbat queues of waiting update IeqUeStS me begiDning to foun, thuS slowing down IeSpODSe time (i.e., the system is begimring to appear sluggish to the user), the system will spawn additiODal update server processes aDd will spread the waiting transactions over them to achieve a degI:ee of paraJ1el processing. If this is a distributed system. new servers can be crearecl in the least-beavily loaded J:omputeIs. If the system bas multiple disk spindles, system pedonnance might be nearly
Chap. 2
Software Architecture
65
CONTEXT
USERS
Fipre 2-~ Requestor-servu model.
,
;
. I
proportiODal to the number of servers up to the point, at least, that all servers are being kept busy. Of course, the disk system will ultimately be a bottleneck. limiting system pedor-
manc:e. As the load on a particular server class ctiminjshes, servers are killed off 1II1til 0Dly one remains. A common queue may be maint;ained for all servers in a class, with the next server that becomes free getting the next request; or each server may have its own queue, with requests being distribatedto the servers ac:conting to some algorltbm (round robin, or
to the shortest queue, for insIanc:e). Note tbat aD imporIaDt cbaDcreristic of the requestor-server model bas a signjficam design impact. Since there can be multiple servers in a class, there is 110 guarantee tbat a request Will be routed to a particular server. Many ttansadiODS comprise a sequ.eace of IeqUeStlMp1y iDlaactioas with the opentoI, with the next request depending upon information from che previous reply. This inf01mation is 1mIled the conte%t of the zeqaest, i.e., it is the CODteXt in which che Iequest is to be intelpeted. This CODIeXt is ~.1ogically IequUed to be carried mthe IeqUest; in priDeipal, at least, it is known to the system. For instance, let us assume tbat the user has asked to see billing information for a particular customer tbat, as part of that request, bas supplied a customer number. However, tb.eIe is m.ore reply iDfomIaIion thin can fit on the scteea, so a pard81zeply is tetwued. To see the next ••page" Of this information, the user should. be able simply to send a ·'next" message, as tile system knows which function (billing iDquiry, in this case) is being exacised on wbich customer and wbich of many pages is cummtly being viewed. This continuing information-tbe function, the customer, and the page-is the context of the user's next RqUeSt. But w~ can Ibis context be kept? Not in the senter, because it is not known in advance, which server witbjn a class will get the next request. Ii can be maintNned in the
.s
66
Transaction-Processing Systems
Chap. 2
requestor; but perhaps the requestor services many terminals. Moreover, if the context data size is large for the worst-case request,1arge amounts of memory may be required for a requestor to store context. So much, in fact, that it may have to keep context areas on disk (as shown in Figure 2-26). And these extra disk accesses are going to have an impact on system performance. Therefore, an element of good TP design is to minimize request context. One excellent way in today's art is through the use of intelligent terminals, which can store the context at the user's site. Then the context can simply be"added to each request as that request is sent to the TP system. Of course, this compounds another in the ODgoing saga of performance' issues: communication-line loading. So we see that achieving optimum pedormance is an unending search for best compromises. How could we ever do that effectively without pedormance modeling? The answer is simple-we don't! In the next chapter, we introduce the basic concepts of pedormance modeling. This will set the stage for the more detailed discussions to follow.
3 Performance Modeling
The degIadation of performance as a TP system gets busier is caused by two related factors: queues and bott1eneclcs. Queues of traDSaCtioDs awaiting service will form in a busy TP systein for each shared resource. As the system becomes busier, these queues will get longer, causing processing delays and performance degradation. The system resource with the lowest capacity will limit the ultimate tbroughput of the system and is the
system's bottleneck. The primary role of a performance model .is simply to identify the queues and bottlenecks in a system and to evaluate tbeir impact on system perfODDaDce. The system designers will then know what to expect from the system and which areas widJin the system provide the most fertile opportDDities for performanc:e eDbancemem In 1bis chapter. we explole the cbaracteristicsof queues and bottleDecks. with a simple yet talistic and cOmpiete perfOI'l:llalK:eaualysis. As DOted in the Introduction, a tbo.rougb uadeIstandiDg of the CODteDts of 1bis chapter provides the system aaalyst not otherwise interested in bemming a perf01'lDlDCe "specialist" with the tools necessary for e1emen1aly perfonmmce analysis•.
BOTTLENECKS A boUleDeek. is nothing more than a common system resource that can run out of capacity Any common resource within the TP system is a ~date for a bottleneck. The disk is the most common bottleneck in TP systems; but
befoIeother COJJIQlOD systemresomces do.
67
68
Performance Modeling'
Chap. 3
given enough disk spindles to provide multiple access paths in order to spread data' uniformly (in terms of access requirements) over the disks, this bottleneck can be broken. The processor (or processors) might run out of capacity before the disks and thus become the bottleneck. Or memory could be the bottleneck if there is not enough to provide common data structures (such as 110 buffers) or if there is not enough to accommodate the bulk of process memory requirements. This leads to excessive o~y swapping or page faulting.
The bus that connects multiple processors or computers in a distributed system can be a bottleneck, as can communication lines when many user teIminals share a common line (multidropped lines or local mea networks). Fmally, processes themselves can become the bottleneck. This is often true of the data-base manager, since data-base consistency requirements often preclude multiple instantiations of the data-base manager. Log servers and other nondyuamic server precesses are other examples of a potential process bottleneck, as discussed in the previous chapter.
Sometimes, a bottleneck is built into the system unintentionally. More tban one designer has brought a system to its knees by deciding to log unsuccessful transactions on the console printer for opemtor action. If the printer is a 3O-character-per-second printer, and if the failed transaction requires 300 characters to describe it, it will take 10 seconds to print. Let us say that the TP system was designed to handle SO transactions per secondor SOO in 10 seconds. If cmIy 0.2% of all transactions are unsuccessful, the console printer becomes the boUleneck!
QUEUES
Queues are much more exciting than bottle.aecks. Howevez-, we wouldn't have queues if we didn't have bottlenecks, for a queue is simply a line of Rquests for service by a c:ommnn system IeSOUrCe that, of course, is a candidate for a boa:leDeck In fact, the closer it is to being a system boCdeDeck, the loDger in general will be the queue waitiDg for it. Some system. JaOUnleS don't proVide' a queuing capability. Oar infamous console printer could be an ex"",le oftbat if e8ch pvcess WIIdiDg to pdDt a message bas ~ sejze it, prim its message, and tbeD Jelease it. If the c:ciJsOle printer bas abeady been seized, then the process must back off and try again JaIer. In tbis case, traDsactions back up in the queues of, the zequestiDg processes, provicIiDg they support queues of awaiting traDsactions, or back up to pevious queues, or eventDaJly even to the user. Of course, the CODSOle printer sync1rome can be cured by providing a CODSOle process that drives the printer. Other processes will send their messages to the queue' of the console process, which will then leismely print them without holding up the other pre-
cesses.
A more critical situation is c:baracteristic of common memory. If the common memory subsystem of a mulIiprocessor system does not pzovide queuing, there is DO, way for ,S9ftware to provide it because of the very high-speed Datme oftbis subsystem. In fact, by
Chap. 3
The Relationship of Queues to Bottlenecks
69
definition, memory speed must be comparable to the instruction execution speed of the processors that it supports, since it is one of the limiting factoIs in execution speeds. Therefore, some processors oontending for common memory will have to back off and wait while the CUIIeD.t processor is being serviced. . Some common memory systems do provide a queue implemented via hardware buffers. Sophisticated processors with look-ahead capability can anticipate their data and instruction needs and can therefore queue memory requests while they continue pr0cessing. Of course, being a very high-speed and expensive b~, the common memory queue is usually very short. Once filled, processors will have to wait as if there were DO queue. We have DOW seen tbree simple classes of queues: none, limited size, and infinite (at least practically so). Queues ate explored in great depth in the next chapter, where it is seen that maximum queue length is only one of several attributes characterizing the behavior of a queue. The time required for processing by a common resource is the sum of two c0mponent times: the time spent waiting in the queue and the time spent actually being serviced. These ate called the wait time and the service time, respectively. The sum of these two times is called the delay time, i.c., the time that the transaction is delayed due to a service requirement by that resource: . Delay Time = Wait Tune
+ Service Time.
For a large class of queuing situations (randomly distributed mivals and service times in a steady-stare system), the delay time is given by (see chapter 4): Delay =
T r=z.
(3-1)
Here T is the average service time of the resource (or the server as it is called in queuing theory, not to be confused with the use of server in the !eCpJeStor-servel model, which is also a server in the queuing sease. Oh. well!). L is the load imposed on the resource (the p!qJOl1ion of time that it is busy). L is also called the 0CCDpIDCY of the server. The term 1 ~ L can be thought of as a ..stretching factor." h Stletcbesoat the service time of 1he ieSOUrte as the load on that resource inaeases to account for the queue waiting time. In fact, the service time is sttetcbed without bound, i.e.• die queue grows ubiUarlly large as die load approacbes l00eJ, (L= 1). THE RELADONSHIP OF QUEUES TO BOTTLENECKS. We can illustrate the relationship between bottlenecks and queues with a simple example. Assume that a request requi1'es six process dispatches to service: once for die requestor process to IeCeive the request. once for the server process to receive· it, once for the data-base mauage:r to receive it and to initiate a disk request, and once for each of the above processes to receive and return the reply. Notice that this request also IeqUires one disk~. ~ assume that there are several requestor aad server processes handHng
70
Performance Modeling
Chap. 3
a volume of like transactions, each vying with the others for access to the data-base manager and each competing with the other and with the data-base manager for the use of the processor. Using equation ~ 1, we can predict the perf~ce of this system. Let tp be the average processing time requjrecl for a process dispatch, td be the average disk access time, and R be the total system transaction rate. Then the load on the processor is 6Rtp , and the load on the disk is Rtd. Since the total response time that the operator will see is the sum of the processor and disk delays (since each step must be performed in series, one after the other), we can use equation 3-1 to write 'D _ _ &-1:"'-
time
=
6tp + td 1 - 6Rtp 1 - Rtd·
(3-2)
This expression shows the process waiting in line and being serviced six times by the processor and being serviced once by the disk. 1 Let's put some numbers into tbis equation and plot the results. We will CODSidertwo cases. For case I, let the average processing time requiIed for each process dispatch, tp , be five mUlisecnnds (msec); and let the disk access time, td, be SO DlSeC. Response time as a function of system load, R, is plotted in FIgUre 3-1a for this case, along with the component delay times for the processor and disk. Note that the disk capacity is 20 traDsactioos per second (11.05 seconds) and that the processor capacity is 33.3 transactions per minute (11(6 x .005) seconds). Thus, the disk is the bottleneck; and theresponse-time curve re1lects tbis. Now let us double the proc:essjng time to 10 JDSeC. The resulting response-time curves are shown in Figure 3-1b. The processor now becomes the bottleneck, with a capacity of 16.7 transactions per second, as compaIed to 20 for the disk. As the system load approaches the bottleneck capacity, the IeSpODSe time grows quite large. In fact, note that at around ~ of system capacity (12 and 10 traDsactions per second, respec:dvely), the respcmse time iDa:eases dmmatically with small increases in volume. This is the poim at which users may really become frusIrated and which empbasizes a basic rule in the design of 1'e8l-time systemS: never load a component to mom than 60% to 70% of its capacity. 'UDfoduuately, the use of this rule is often the sole attempt at perfcmumc:e aualysis.
Befcxe we can get seriously into pedOa "umC:e modeUug. we must 1iIst come to apeement on wbat we are measudng. Sblce TP systems are interactive, and since the most common ~ oompJajnt is system slugg;sbness. it seems reasoaable, that response time be
om:
lJn Ibis example, qgeaes baildiDg at die data-base DIIII8F' aad die pnlmses are igDcnd to simplify die c:ak:aJaIioIIs The zesaltiDg CODCepIS at Ibis pojDt are IDCIIe hupOilaat 1baD is die proper modeJiDg ec:Imique. which wm be ldDecl sboIdy.
Chap. 3
Performance Measures
71 BOTTLNECKS
0) fp = 5 msec. fd =50 msec.
Processor -
"
I ,
I -Disk
I I
~.3 i= IIJ
oz .2 ~
13
II: .1
/
L..-_-- - - ---
"",-'
"
.",
/~
o
o
5
10
I
I I I
J
DISK
PROCESSOR
I
I
'y
,,/
/
I"" I
------------
.5
I
, I I I
_-
--- -- ..... --+...., 15
20
25
30
35
TRANSACTIONS/SECOND
= 10 msec. td" 50 msec.
b) tp
5
10
15
20
25
TRANSACTIONS/SECOND
PrimarY candidate. We have ~ explored the concept of respcmse time in some detaii in the previous section• . However, response time COIDeS in many varieties. What we have described so far is the average. or mean, respcmse time. In Figure 3-1a, for a load of 10 tnmsactiODS per ~d, the average respcmse time is .143 ~.
72
Performance Modeling
Chap. 3
In many cases, the system operator's concern is not with the average customer but with the iIate customer. Therefore, the operator wants to specify a maximDm response time-Iet's say of two seconds-that ensures a level of performance that may keep eveI]'body happy. . Unfortunately, the specification of an absolute cap on response time is quite umeasonable, because the wOl'St-case. response time will occur only for the very UDlikely (in fact, nearly impossible) case of simultaneous receipt of transactions from all users. For instance, assume for the case ofFlgUre 3-1a that there are 600 customers using the system. If, on the average, each generated a transaction once a minute during peak activity, the average transaction load on the 'system would be 10 transactions per second, giving an average response time of 143 msec. A pretty swift system. Nevertheless, the WOl'St case would indeed occur if all transactions mived simultaneously. Then, since the disk would be the bottleneck and would process only 20 transactions per second, the last transaction would be completed in 30 seconds. Pretty horrible. And also very umealistic:. Taken to its conclusion, only 40 users could be supported and still guarantee a maximum response time of two seconds. A more realistic statemcmt would be to ~y "99.99% of all ttansactons will finish in less than two seconds," or some equivalent statement. This still would leave an occasional iIate customer, but the level of hassling could certainly be controlled (1 out of 10,000 may Dot be too bad). To achieve the level ofperfOJmaDCe described in that statement, DOt only do we need to know the average value ofn:sponse time, but we also need to know its distribution. We must be able to state that with probability p, all responses will be received in less than t seconds, whatever p and t are specified to be. This, then, is the second performance IDe8S1R (average response time being the first). Analyzing the distribution of respcmse time is a much more difficult problem tban estimating its mean and is solved for only a few cases. Fortunately, for the general case, thete is an elegant and simple approximate solution that we will often use. Formaayofthe queues we will analyze, one can make the statemeDt that 9S% of all response times willl:!e less than tbree times the average response time, 99% will be less than five times the average, 99.9% will be less than seven times the avenge, and 99.99% of all responses will be less than nine times the average. ('Ihese staraDeIIts are based on the Gamma function, which is explcnd fUrdIa" in chapcer 4). Thus, to achieve a level of performance such that 99.99% of all response times·will be less 'than two seconds J:equi:res an average xesponse time of DO JDOIe than % = .22 seconds. From FJgUre 3-11, tbis allows a 1raIIsadion rate of 15 U3DSaC:IioDs per second. The difference between 99.99% and 100% satisfaction is the c:IiffeIence between 40 users and 900 users! Lesson: Tum a deaf ear to absolute maxima: We have, mc;ctemaUy, just introduced the thhd performance measure: ctI[IQdtj. As we have seen, the capacity of a TP ~ is the sumiJwl ttansaction tate tbat prodD<:es the maximum acceptable response time, wbether that response time is stated a mean ~ time or as the probability that a response time will not exceed a certain limit.
as
Chap. 3
The Analysis
73
In summary, the performance measures with which we will be concerned are the following: • Mean response time. The average respoDSe time to a transaction seen by a user, . taken as a function of system load. • Maximum response time. That time such that a specified proportion of all responses will occur in less time, also taken as a function of system load. • Capodty. MWmed in terms of system load, that which results in the minimum acceptable performance as spec:ified by a response-time requjrement.
THE ANALYSIS A proper performance analysis comprises many parts:
1. A system description of the TP system to be modeled. 2. A scenDrio model that describes the chaIacteristics of the traDsaction load being placed on the TP system. . 3. A traffic model that describes the flow of data through the system. 4. A perfornuzn&e model tiDcummt recording for posterity the above tbree items. 5. A pe7fOT'fNlllCe model program tbat Unp~ the equations of the model if they are too complex for IIWlU8l c:alcn1arion. 6. Result memoranda that give results of the model's predictions as "what if" games are played with it. Items I, 4, and 6 (system description, performance model document, and result memonmda) are discussed in the Summary cbapters 10 and 11. Item 5, the program, is DOt a topic for tbis book; the need for it is mentioned only in passing. though it is briefly
CODSidaecl in chapter 10. However, items 2 8Ild 3, the scenario model and the traffic model, represent the bUlk of the performance analyst's work and will be imroduced here. First, the scenario model.
Scenario IIofIeI The sceuario model deals with dJaracteriring the ttaDSactioD load being placed on the system by the lISeIS. There is DOtbiDg very complicated about it-it is Ie8Ily just a case of bean counting-bat it caD be a very laIge (and adJnittMty dull) task at times. The sceuario model identifies every tnmsaction that will be Offered to the system and characterizes the ttaDSaCtioD's use of xesourc:es: CQ!DII!1!Dk:ation message lengrhs, disk accesses. processing requilemeDtS, and orber special mquhemeuts. It also establisbes the probability of each ~on oc:cuniDg.
74
Performance Modeling
Chap. 3
Using these probabilities, the load on each resource imposed by an "average transaction" can then be calculated by summing the products of each probability and the use of that resource.
Sometimes we cheat and don't take all transactions into consideration. It may be that 3 transaction types account for 95% of the load and 30 transaction types account for . 5% of the load. In this case, it is perfectly reasonable to consider only the 3 types and "add. 5% to their imposed load to ~ up the diffeIeDce. At other times, we're not so lucky and must account for many. many transaction types. As an example, let's play lucky. Let us assume that there are only two types of transactions: an inquiIy and an update. Each has its own unique request- and replymessage formats and undergoes similar processing in a requestor-server environment, except that inquiries are processed by the inquiry server and updates by the update server. Inquiries require a single disk read, and updates require a read to fetch the record to be updated and a write to return the updated record. Thirty-five percent of all transactions are inquiries, and 65% are updates. Our scenario model can then be summarized in the following table, with values added for processing times and for communication line message lengths: TABLE 3-1. SCENARIO MODEL Message size (bytes)
Traasacticm IDquiry
Update
Disk
Server
Probability
~
Reply
accesses
time(msec)
.35 .65
20 200
400 IS
1 2
10 IS
In many cases, the values in Ibis table will be expJeSSioDs. For instance, the number of disk accesses Rquired for a cashing uansactiOll in a wagering system may be a function of the DlllDberof bell OIl a ticket. The size of a COlDhmniadiOD mquest message for an iDsert tnmsadion in a ~ system may be a fanctiOD of the size of the iDsert, and so forth~ iii the example perf~ model leport ~ in cbaprer II, the scenario table is quite complex. The above example is, of course, quite simple. Traffic IIodeI The traffic model is. the fun of performauce mocteling. It is the characte:rizati of the flow of requests and Ieplies (the system ttaffic) through the system. Evqy step that may have an impact on performance shoUld be included, as we do not want to a priori what is ~t and what is not. Let the zesults of the model tell us tbat.
Judge
Chap. 3
The Analysis
75
It is often helpful to draw a traffic diagram. This diagram is only a tool; it is DOt intended to be rigorous. Only three symbols are needed, but feel free to invent additional ones if they will help you organize your thoughts. The three symbols are . _...... ... - _....-...... .. . . 1. A processing step, typically representing a process or a disk access. The step is described in the box, and a service-time expression is provided.
--_ _-
2. A message path, such as a communication line or an interprocess message. As with a processing step, the path is identified, and a service-time expression is given.
3. A queue. If the queue has a maximum size, it is shown. A queue-length expression is also shown.
10
--II~" 11"11 q
I ..
The 1raffic ctiagram is sttuctm:ed to show ill the processing of a ttaDSadion each step that is sigDificant relative to pedmnanc:e. 11 is often helpful to DUlDbertbe steps so that the ctiagram' may ,be easily desc:ribed(part of good doc1l1laeD1ation that wiJl be messed in chaprer 10). For some sysIaDS,mulIiple 1raffic ctiagrams may be teqUhecl to cbmctaize diffeJ:eDt classes of tIaIISaCtiom if their plOCeS'ing sequences are sigDi1icant1y difJeiem. The average response time can then be expiessed as the sum of the IeSpODSe times of the ciomponent processiDg aad message patbs plus the time Spent in the various queues. , . , As an example of a trafIic model, let us c:oasida" a variation on the reqUestor-server system described in cbapter 2 and as illustIatecl in FJgUIe 3-2. Requests are received by a request handler process and routed to one of two servers: an inquiIy server dedicated to haudUng inquiries and an update server dedicar.ed to bandting updates. Each issues its own dasa-base directives, to the data-base manager, as described in the scenario of the ,
,
,
.
7&
Performance Modeling
Chap. 3
REQUEST--!·~8 · ~ DATA BASE
CDI~:>
REPLY"'~E---
Fipre 3-2 A TP system.
server
Previous section. i.e., the inquiry will request the Iead of a record; the update server will request a Iead followed by a write of the updated record. Servers are single-tbreaded. Each processes ODly one transaction at a time. Each server formulates a reply to the user when it bas fiuisbed processing the transaction and mums tbat reply via the reply handler. Replies to inqujries contain the requested data; replies to updates contain a completion status. CcmmmDication-line message sizes. server processing times. and the distribution of ttaDSaCIion types are given iD Table 3-1 (the sceuario model developed iD the previous section). User tenniDals are CODDeCted to the system via point-to-poiDt 96OO-baud (960 bytes per second) asynchronous COIDJDUDication lines (one termiDal per line). All 110 is done by the requestIreply baDdlers for M1IIIIIIJDicatioas aad by the data-base mauager for disk; tbere are no separate 110 processes. An processes nm at the same priority. and iDtemJpt processiDg can be igDoIed. . A ttUfic model fortbis system is shoWn iD Figure 3-3. Refenmg to the numbeI:eci steps in pareatheses on tbatdiagEam, we fiDel a mquest beiDg ttaDsnritted from the user temrinaJ over dleCMlrmmicaI:ioD)iDe (I). requiriDg a time. tao 'Ibis request is received by the RqUeSt bandler process (2), wbicb mquhes a processiDg time of '1'1 befqle passiDg die RqUeSt via an iDtaprocess message (3)'to die queue ofthe-appopriate server (4)~ Interprocess messages IequiIe a time, It-. and die servers (5) zequiIe '. time to tota1Iy process an iDquiEy request and sead back Us teply and ,_ time to p1'OVide similar service for an update request. Avaage queue 1eagtbs for the iDqaiIyserver and updaIe server are lJi and lJ". respectively. (Note the symbology ctif6cully iD the use of the subscript i to mean input as in lei and '1'1 and inqrdTy as in lJi aad r.,. This plObIem is c:ompounded iD larger models but is a hazaId of the professiClll. Clear and unambiguous definitioD is the 0Dly solution.)
The server Diust make an avaage of lid disk calls to process a traDsactioD. For tbC iDquiry server, lid = 1 (one mad) 8Dd lid = 2 for the update server (a read followed by a write in order to update the JeCO.rd). Each requires an iDtetprocess message (6) to send a . ~ to die data-base manager's queue (7), ~ averages lJb items in length. The
Chap. 3
The Analysis
77 (10)
(2)
(I)
(3)
RECIUEST~ ~ II.PM tel·
~
. t ,pm
(9)
(4) I
JjJJ}-
'----r-'
qi
~
qu
tsu nd • I FOR INQUIRY = 2 FOR UPDATE
REPLY. comm teo Figure 3-3 Traftic model.
, !
data-base manager (8) tequiIes a prOcessing time of tb to service each request plus a disk
access time (9) of t.. Note that there is no queue for the disk. as its only user is the data-base manager. which itself provides a queue for its users. Theretum of the disk data (for a read) or status (fora write) (10) to the inquiry server or update server is time-free, as it is the READ portion of the original WRlTEREAD iDteIpIocess message which sent the IeqUest to the data-base manager and is therefore included in the origiDal message time. Disk IeSpODSeS do DOt queue for the server, since each server is siJ1g1e-1hIeade and is waitiDg for the respoase. When the server bas fiDished processing the iequest and bas formulated a reply, it sends that reply via an iaterprocess message (11) to the l1'p1y bandler queue (12), which averages qo replies in leDgtb. The reply bandler then retums the reply to the ~ (13), requjJing a processiDg time of t,. aDd a MID1D1JDicalion time of teo. Wbat is your intuitive feel for the capacity of this syStem? What is the PrecJoimnani bottleDeck? Let's see how good your guess is.
We can DOW express respoase time as a function.ofload, using DO more mathematics 1ban that of equation 3-1·and DO D1O!e sysIembowledgetban 1bat giveD in cbaprm 2. The ODly subtlety is the effective service time oftbe poc:esses, siDce they an c:ompece for a common processor. This is dealt with below. . . '.. To 1eDc:l a bit of organization to the developmem
Te
= COIIIID1IDicati line component of respoase time (sec:onds).
Ti = request handler component of n=sponse time (seconds). Ts = server component of respoDSe time, including the data-base manager time (sec. ODds).
Performance Modeling
-
78
Chap. 3
TD = reply bandler component of response time (seconds). ,
Then, T = Te + Ti + Ts + TD
(3-3)
is the system re5pODSe time, where
T = system response time (seconds).
Communications. Since there is no sharing of communication lines (they are point-fO.point lines), there are no line queues. Communication time is fixed. Let
= communication line speed (bytes per second),
s
mii = inquiry request message size (bytes),
= inquiry reply message size (bytes), m,u = update request message size (bytes),
miD
IPZut>
= update reply message size (bytes),
Pi = probability of an inquiry transaction,
p"
= probability of an update transaction.
=
Of course, Pi + p" 1. Then the total communication tnffic for an inquiry is (m;; + m;o) bytes and for an update is (m,u + 1PZut» bytes. The average message length is the sum of these weighted by their probability of oc:c:urRnce. 'Thus, the average traDSaction time contributed by the· ' communication line, Te. is '
Te = (p,(m;; + m;o) + pJ..m,,; + 1PZut»]Is
(34)
Request handler. The component of response time inttodnced 'by the JeqUeSt handler comprises the foJiowiDg items:
1. The service time for serviciDg the JeqUeSt•. 2. An iDtaprocess message time for samng the zequest to the server queue. (Where we put this is ariAtwy. For this model, we wiD load intapocess message times onto the sender.) The request handler service time mnst include 1he process dispatch time, sinCe when the RqUeStor is Ieady to service an item., it must fixst get in line with the other processes to wait for the processor. Once it has the processor, it completes its service befoIe relinquislUng it. Let td
= avemge process dispatch time (seconds).
Chap. 3
The Analysis
79
Then Ti td
= tri + td + tipm
(3-S)
will be evaluated later.
Server.
The server response-time component comprises the following steps:
1. A wait in the server queue. 2. The service time for processing the :request and formulating the reply. 3. lnterprocess message times to send data-base requests to the data-base manager queue (lid requests per transaction). 4. The data-base manager service time for each disk :request. S. An intexprocess message time to send the reply to the outgoing :requestor. Let us define the following terms:
tqi = delay time for inquiry service (queue plus service time .in seconds). tqu = delay time for update service in seconds.
Since the probability of these two types of ttansactions is Pi and P,., respectively, the server response time component, Ts, is
Ts = p;tqi + PrJ",.
(3-6)
By further defiDing Tb = data-base mauager delay time (seconds), Lsi = load on the inquiry server,
La. = load on the update server, we can expzess the response-time c:ompone:at for an inquiry traasaction, _ 'si + 2td+ Tb + 2tp 1 - L,.;
tlli -
and for an update transaction,
t",.,
tqi,
as (3-7)
as
_ t$ll + 3td + 2Tb + 3tp 1 _ L$II
tl/ll -
r:LO) \..-v
Remember that an inquiIy ~ one disk access and that an update IeqUi1'es two disk accesses. The server process must be dispatched once upon the Iec:eipt of the request and once upon the Ieceipt of each disk response. The server loads are, teSpeCtively. Lsi = P;R(tsi + 2td + Tb
+ 2t;pm)
(3-9)
(3-10)
Performance Modeling
80
Chap. 3
where R = transaction rate (transaction per second). Data·base manager. Note that the data-base manager time, Tb , becomes a part of the server service time because it is in a closed loop with the server; that is, the server must wait for each disk request to be completed before it can go on. Thus, this time becomes part of its service time. Since the data-base manager has its own queue, the server delay time is a solution to a compound queue problem but in a straightforward and obvious way. Letting
Lb = data-base manager load, then the data-base manager deJay time,
TIl>
is
(3-11)
where (3-12)
Reply handler.
Once a reply has been issued by a server, it suffers the following
further delays: 1. A wait in the reply handler queue. 2. The service time for processing the reply.
4
= reply handler load
the reply handler service time compoaeot, To, is
T. o
= tro + t"
1-4
(3-13)
where
4 = R(tro + ttD
(3-14)
Dispatch time. We DOW have expIeSSions for an parametets based on known inputs, except for the dispatch time, tdo This is the average time that a process must wait in the processor queue (the Ieady list, as described in chapter 2) before gaining access to the ~.
Chap. 3
The Analysis
81
For each transaction, the following dispatches take place: Number of dispatches
Ploc:ess
Service time
Requestor Requests Replies Server IDqairies Updates Data-base IIIIIIII&eI' Toral dispatches
t,,+ ·t.... t",
3p.
Pi + ']p. 2+3pi+SP.
From this list, one can state that the dispatch rate, Td
t61 + 21.... t.,,+ 3t.... t.
']pi
Tdo
is
= R(2 + 3Pi + 5p,,)
(3-15)
The average process time per dispatch, tpo is calculated by summjng the various process service times weighted by their probabilities of occummce for a transaction. This probability of occunence is the ratio of their fiequencies of occurrence in a transaction to the total number of dispatches per traDsaction. Thus, average process time, fp, may be ex~as
fp
= [tri
+ tTO + P~6i + PrJ"" + (1 + P.)tb + (3 + p,,)tq.JI(2 + 3Pi + 5p.,).
(3-16)
The processor load, Lp, is
Lp =
T~p
(3-17)
. . . . . Thus, average dispatcll time, t", (the time spent waiting in the processor queue bnt exclud. ing the service time) is from equation 3-1: 1
-..i..._ "P-I-Lp'fl ~ - Lp t t.d-l-Lp
(3-18)
Model summary. 'Ibis model is smnmarized in Tables 3-2 anc:l3-3. Table 3-2 lists all model paua:uerms, separated into four categories:
• Result ptl7Il1MIer. Those calcalamd parameters that are of most probable inteIest as a result (0Dly T in this case) . • Input VtIriobles. Those parameters which are most likely to be varied to play "what if" games (only R in this case). IA men ac:cmate approacb would c:ak:aIaIe diffezaIt dispatdl times for eich pocess by excluding ~ pmc:essor load of the pcoc:ess beiDg evaluarecI. See Appeadix 6 aDd a CCIIDpII8bIe example ill Qaprer 8.
Performance Modeling
82
Chap. 3
TABLE 3-2. PERFORMANCE MODEL PARAMETERS 1. Ruult PIITtIIIIdD'S T Avenge traDsadion respoase time (seconds) 2. Input VDriable.s R Avenge system 1IIIIS8dion rare (ttaDsacrions per sec:cmd) 3. Input PilTtl1MIeTS mu Avenge iDquiIy request message lengdl (bytes) 1IIj" Avenge iDqaiJy reply message lengdl (bytes) III,,; Average updare request message lengdl (bytes) m.. Avenge updare reply message lengdl (bytes) p; Probability of aD iDquiry uusaction P. Probability of aD upcIaIe traDsaCtiOD s COJIIIDDDic:adoD liDe speed (bytes per sec:cmd) t. Avenge daIa-bese IDIII8F' proc:essiDg time per disk request (seconds) Z. Avenge disk access time (secoads) IDretptocess message time (seconds) t,; Avenge request baDdler pIOCeSSiDg time for a request (seconds) t", Avenge reply haDdler proc:essiDg time for a seply (seconds) to; Average iDquiry server ptoc:essiDg time for aD iDqujry (seconds) to. Avenge update seMI' processing time for aD upcIaIe (seconds) 4. I~ PIITtIIIIdD'S L" DaIa-base IIIIIII8ger load L" Reply baDdler load Lp Pnx:essor load 1.,; IDquiIy server load L.. Updafe server load Td PIOcess dispIrch rare (ptcmm per seccmd) ~ Avenge ctispaIch time (waiting time for the processor in seconds) tp Avenge paocess time per cIispIrch (secaads) tq; Average iDqairy sene!' delay time (secaads) t.,. Average update sem:r delay time (secaads) T. Average dala-blSeJlllllllpl' delay time zequiIed to paocess a disk request (secaads) T. OcIi""",,/cmau.-liDe sespoase time CCIDIpOIlII1t (seconds) T; Request baDdler zespaase time compoaeat (secaads) T.. Reply baDdler sespoase time compoaeut (secaads) T6 Sener respaase time COIIIpOIII'JIIt (secaads)
t....
• Input ptITtI1MIerS. Those parameters wbich c:baracter.ize me'system and for which values are assumed for pmposes of the model• • IntennediIzte ptITtI1MIerS. All c:alcuJatecI. parameteIS except for leSUlt parame. ters. Table 3-3 SUiilmarizes the equaDODS, with ~ numbers to refecenc:e back to ~ . text. Note the top-dowD pn;seotation of all parameIer expessioDs in Table 3-3. The expressiOn for an intemaediate parameter is DOt presented until it appears in a previous expession. 'Ibis is an iDvaluable aid in organizing a model and in orgmrizing your own thought process.
Chap. 3
The Analysis
83
Model Results The average response time, T, as a function of the transaction load, R. is shown in Figure 3-4, using the values of input parameters largely suggested in previous sections and summarized in Table 3-4. Note the rather precipitous bIeak in the curve-much sha1per than we would expect from a T/(l-L) relationship. This is usually caused by a "sleeper" in the system: a bottleneck that contributes little delay at low loads but becomes the system bottleneck at high loads. In fact, with this response-time characteristic, the system will "bIeak" very suddenly as load is increased, and it is ~ to stay away from the knee of the curve. If the system capacity is taken as four transactions per second (67% of full capacity), we are fairly safe, with an average response time of about 0.6 seconds. Note that if we TABLE 3-3. PERFORMANCE MODEL SUMMARY 1. lWponse time T= Tc+ T/+ T.+ T.
(3-3)
.2.C~
Tc =[P,(IIIJ/ + m",) + p.c."". + m...)] s
(3-4)
3. Reqwst htwller T/=tli+t.t+z..., 4. SInen
(3-5)
T.=~+p.r..
(3-6)
_ ,. + 2t.t + T. + 2trr
(3-7)
I-La _ t. + 3t.t + 2T. + 31... z..l-L.. La =p/«.t. + 2z., + T. + 2t...> L. =p"R{t. + ~ + 2T. + 31..,) ft;-
(3-8) (3-9)
(3-10)
s. DcrIa-btlsc I/IIDItZIC7" :t _t.+t,,+z.,.
(3-11)
.- I-I. I. =R(p, + 'P.)(r. + t.t + z...) 6.
(3-12)
RIIJII11rt1n4ler r.=~
(3-13)
• I-I.. I.. =R(t,..+r,,>
(3-14)
7. DispaIt:h time t.t= 1 feL,. z,
L,. =r~
(3-18) .
_ [tli + fro + PI. + p.r. + (1 + p.)r. + (3 + p.)z.,.J Z, (2 + 3pi + 51.) r4
=R(2 + 3p, + 51.)
(3-17) (3-16) (3-15)
Performance Modeling
84
I
J
I
1.5
I I
.,, I
J
I
-I.
I
I
I
i=
I
1&1
I
Average Response nme
I
en I /
------ ... o
2
__ -
.",.
",
"
I
/
/
~
Server
COmponent
3
4
rme (tQU) 5
6
TRANSACTIONS/SECOND (R) Figare3-4 SysIem lespaase time.
TA8LE3-4. INPUT PARAMETER VALUES
Patameeer Ills
".. m.,
..... PI P.
.s It
,. t.,. tli t", tli
'.
MeaDiDg
IJIqIIi1y Ieqaest leDgIh (byres) IJIqIIi1y seply IeagIh (bytes) Update Ieqaest Ie:DgI:h (byIIes) 1JpdaIe seply IeagIh (bytes) PloIIabiIity of aD iDqaiIy PloIIabiIity of aD apcIaIe CmEhHillj.-:atioa !iDe speed (byres per secaad) DII:a-base IIIIIIIF' pocessiDg time (secaads) Disk access lime (secoads) IDImpiOCii$$ messap lime (secoads) Request baDdIer pocessiDg lime (secaads) Reply b8Ddle.r paoc:essiDg lime (secoads) IDqui:ry sener paoc:essiDg lime (secoads) Update sener ptOCeSSiDg lime (secoads)
Value 20
400 200
IS 0.35 0.65
9CiO .010
.035 .OOS .OOS .OOS .010 .015
Chap. 3
Chap. 3
The Analysis
85
should push the capacity up by 25% to 5 transactions per second, we only get a 15% increase in response time (to 0.7 seconds). However, at this point, a mere 10% increase in load gives almost a 50% increase in response time! This fairly low capacity-four transactions per second-is DOt an uncommon cbar~ acteristic ofTPsystemsas a per processor measure. One to teD transactioDS per second per processor is the common range for contemporary systems. This emphasizes the great importance of performance analysis and its companion, perfOImaDce plaDning. Note that we can use the model to "peek" into the system to analyze its performance more deeply. Not only can we plot the component response times-Te, Tit Ts, T(1' tqi, and t..-but we also can evaluate component loading-Lsi> La, Lb , ~, and Lp. This capability is our tool to determine where the system can best be modified to enhance its
performance. Let us. apply this tool to our example system. Peeking inside will identify our "sleeper," the update server. Its response time component, t.. , is also shown in FlgUIe 34. h is initially hidden by the comrmmications component, which is independent ofload because of the point~~point lines. However, it becomes the bottleneck at six transactions per second. Wbat Can we do to eDbance the system? Simply add another update server, and have doubled the capacity of the system. No extra batdware needed. In fact, the model will show the following component loads for the single server case at a load of six" transactions per second:
we
IDqairy server (h,) lJpdare server ct-) Dala-bIIse JIIIIIIIpI' (4) Reply baDCIer (J..)
PRIcessor (J..,J
0.27 0.94 0.49
0.06 0.35
By addiDg a second update server, its load is Ieduc:ed from a vety high load (0.94) to a reascmable load (0.47) at six traDsadioDs per second. Since the haniware componeitts Of the system-the disk: and processor-are carrying a 35% to SO% load, significant capacity enhancwuent could be achieved by moving into a mukiserverenviroDmeat wUbout purchasing any DeW hardware. The model could be cbaDged to Ieftect this and could be used to pIedict the optimum IlUIIIber of servers zequimd to balance the softwaIe with the bardwue. Of course, dyDamic servers, as described in chapter 2, would be self-adjUsting. The performance analysis would tbeD be diIecred to predict the performance of that system.
OaIy
Allalpis Sum...ry The above discussion bas presented, to some extent, the content of a proper performance ,analysis. The system description bas spmmed chapter 2 and this chapter, though ~
86
Performance Modeling
Chap. 3
description would normally be contained in a more formal and localized section of the performance analysis. A scenario model is developed along with a traffic model; significant explanatory text accompanies these developments. The model is summarized, and results are presented (no program was necessary to allow calculation of this simple model), as well as recommendations for system performance improvement at no additional bardware cost. In short, DOt only has an analysis been undertaken, but it also bas been thoroughly documented. Not only have we completed a real performance model, but we also have completed our first perfOI1lWlCe analysis! This analysis was completed using only a simple queuing equation, a knowledge of TP systems as presented in chapter 2, and a little native ingenuity. This is all it takes to be a performance analyst. Why, then, are there so
few of us? The remaining chapters add some tricks to our tool kit, give us a better understanding of our tools, and suess the impo.ttance of documenting our model and analysis.
4 Basic Performance Concepts
The matbematical foundation required for performance modeUng is very much a function of the depth to which the analyst is interested in diving-from sJdnmring the surface to over ODe'S head. As we saw in chapter 3. a peat deal of pedormaDc:e modeling can be acbieved with ODe queuing equation (equation 3-1). a lot of simple algebra, and some system
sease.
However, tbeIe _ many unusual and complex TP system an:bitectures that can be !DOle accurately analyzed with some better tools. 'J'bjs chapter can be ~ the ''tool kit" for the .book. 'l'.blee of the six sections of this cbapter pIOVide the basic fouadaDon of queuiDg theory for the remaining cbaptas. Below is a brief overview of 1hese tIRe sectioDs.
I. Queues-AnI7rtTDtluction gives a simple. iDI:ui1ive derivation ofpabaps the most impollaot: zelationsbip for the pafoJmance aaalyst the JCbIntchiDe-Pollaczek equation, which cbaracterizes a broad range of queue types. Equation 3-1 is a . subset of this eqaatioJL NmDlathematicians shoulcl find the iDtui1ive derivaIion UDCImtaDdabJe aDd illnnrinaring in terms Of 1IIlCIasIaDctin the behavior of queues. Those knowledgeable in queuing theory will find definitions of teIWS and ~ used 1brougbout the rest of the book. 2. Chl.arlcterizing Queuing Systems slll!ll"arizes the Kendall classification system for queues. This classification system is used tbrougbout the book.
Basic Perfonnance Concepts.
88
Chap. 4
3. Comparison of Queue Types summarizes the queuing concepts presented in this chapter by discussing the comparative behavior of queuing systems as a function of servi~ time distribution, number of servers, and population size. The othertbree sections of the chapter deal with a broad range of marhematical tools, including probability concepts, useful series, and queuing theory. Those less inclined to mathematics will be pleased to know that the rest of the book is completely understandable without a detailed knowledge of these tools.
QUEUES-AN 'NTRODUcnGN As we have seen in the simple performance examples in chapter 3, queues are all': important to the study of system performance. In the simplest of terms, a queue is a line of items awaiting service by a commoDly shared resource (which we hereiDafter will call' a server). Let us first, therefcxe, take an intuitive look at the phenomenon of queuing. Through just common sense and a little algebra, we can come close to deriving the Kbintchine-Pollaczek equation, which is one of the most important tools available for performance analysis. Its formal derivation is presented in Appendix 3, but it is not much more difficult than that presented here. Let us consider a queue with very simple characteristics. Interestingly enough, tbese characteristics are applicable as a good approximation to most queues that we will fiDei in real-Hfe systems.
These cbaractaistics are 1. 1'.bere is ODIy a single server. 2. No limit is imposed on the length to which the queue might grow. 3. Items to be serviced mive on a random basis. 4. Service order is first-in, first-out. 5. The queue is in a steady-state CODditio.D, i.e., its average length measmed Over a long enough period is CODSIaIlt
In FJgme 4-1, a server is serv:iDg a line of waiDDgttaDsadioDs. On the avemge, it takes the server T seconds to service a tmDsactioD. Some 1I3DSaCtioDs take more, some tate less, but the avemge is.T. T is called the 8DVice time of the server. The average number of 1I3DSaCtioDs in the sysIem, whlch we will call queue length, is Q tmDsaCtions. This compises W umsactioDs waiting in line plus the 1raDsaction. (if any) cummtly being serviced. Finally, transacIions are miviDg at a tate of R t:raasactions per second. The server is busy, or occupied, RT of the time. (IfttaDsadioDs are miviDg at a rate of R 1 per second, and if the server zequites T OS sec:oods to service a ttaDSaCtioD, then it is busy 50% of the time.) 'Ibis is called the·occupancy of the server or the 1Dad ~) on the server: . , (4-1) L=RT
=
=
Chap. 4
Queues-An Introduction
89
CO .1 -l 0-1
11--'
[
ARRIVAL RATE =R
I
I
SERVER
AVG. SVC. TIME • T
Tq = QUEUE nME (TIME WAITING IN LINE) Td • DELAY TIME (Tltt£ IN UNE PLUS SERVICE TIME)
L also represents the probability that an miving transaction will find the server busy. (Obviously, iftbe server is not busy, 1he!e is no waiting line.) When a tnmsaction mi~ to be serviced, it will find in front of it, on the average, W transactiODS waiting for service. W"11h probability L, it will also find a traDSac:tion in the process of beiDg serviced. The servicing of the CUJteDt traDsaction, if any, will bave been partiaUy completed. Let us say that only leT time is left to fiDish its servicing. The newly mived ttansaction will bave to wait in liDe long enough for the cummt transaction to be c:ompleted (leT seconds L of the time) and then for each traDsaction in front of it to be serviced (WT seconds). TheIefOle, it must wait in line a time Til (the queue time): Til
=WT + 1J(J'
(4-2)
From the time that it arrived in liDe to the time that its servicing begiDs, an average of W other traDsadiODS must bave mived to majntain the average line length. Since transactiODS are arriving at R ttaDSaCtioDs per secoad, tbeD . W=T"R
or Til
=WIR
(4-3)
EquadDg 4-2 and 4-3 and solviDg for 1he waiting IiDe length gives (usiDg 4-1)
1£2
W=I~
(~)
The to1allcmgth of the queue as seen by an miving ttaDsaction is the waiting line, W, plus a traDsaction being serviced L of the time:
Q =·W + L
(4-5)
L . Q = l-L [1-.(1 - k)I..]
(4-6)
or
Basic Performance ConceptS
90
Chap. 4
The delay time, Td , is detenDiDed in a similar manner. It is the total amount of time the transaction must wait in order to be serviced-its waiting time in line plus its se.Mce time. During the time tbat the transaction is in the system. (Tto, Q transactions must arrive to maintain the steady state:
Q = TdR
= TdLIT
(4-7)
Setting this equal to equation 4-6 and solving for the response time, Td , gives Td = 1 ~ L [1 - (1 - k)L]T
(4-8)
kL Tq = l-L T
(4-9)
From 4-2 and 4-4,
Equations 4-4, 4-6, 4-8, and 4-9 are the basic queuing equations for a single server. assumptions inherent in these expressions are that transactions arrive completely independendy of each other and are serviced in order of arrival. The equations axe, thelefore, very genem1 expressions of the queuing phen0menon. They axe quite accnrate when the transactions axe being generated by a number of indepeDdent sources, as long as the number of som:ces is sigaifiC8Dtly greater 1baD the average queue lengIbs. A common linrinng case of accaracy occurs when there is a small IlDIIlber of soun::es, each of which can have only one O'Iltsta1lCting transaction at a time (a common case in Computer systems). In this case, the queue length CIDIlot exceed the IlDIIlber of sow:ces; and the mival mte becomes a function of server load. (Arrivals will slow down as queues build, since sources must await servicing before geDeEating a new transaction.) Howev«, in this case, the above explessions wiD be coaservative, since they will in general pedict queue leDgtbs somewhat pealer than those that actua1ly will be experienced. (Obviously, the Jimiring case of only one sourcewDl never experience a queue, although equation 4-6 wiD predict a queue.) The parameter Ie in these equations is a function of the service-time cIistribuDon. It is the average amount of service time that is left to be expeDded on the CUIl'eIIt transaction ,being serviced when a new transaction mives. Let us look at these explessions for certain impoIIant cases of service-time distribuIions. As stated previously, the primary
For expoaeatial service times, the pobabiJity that the remaining service time wiD be greater than a given amouat t is expoaeatial (e-CI). We diseass this and other p10babiJity concepts later in this cbapter. An expoaeatial distribution bas the cbaracreristic that the remaining service time after any arbitraty delay, mllmjng that the servicing is still in pmgress, is S1ill expoaeatial, i.e., it bas DO memory. Thus, one bas the fonowing interesting situation: If the average service time of the server is T, and if the server is CWieady busy (for DO matter bow long), one is going to have to wait an average time of T for the "service to complete. Since Ie is the ptoportion of servicing J'I'!D'UIjnjng ~ a new traIIS-
Chap. 4
Queues-An Introduction
91
action enters the queue, then k = 1 for exponential service times. Thus, equations 4-6, 4-8, and 4-9 become L
Q
= l-L T
Td=T=7
(4-10) (4-11)
and
Tq
L
= 1 -L T
(4-12)
These are the forms of the queuing expzessions often seen in the literature and represent a typical worst case for service-time distributions (though it is possible to COD"struct unusual distributions that are worse than exponential). Equation 4-11 is the fomi" that was used in chapter 3 as equation 3-1.
Constant Service Times If the service time is a constant, then on the average. half of the service time will remain when a new ttaDSaCtion enters the queue. Therefore k = ~, and Q =
~ (I - ~)2 I-L
(4-13)
Td=--L(I-~)T l-L 2
(4-14)
LI2 T.q=-T l-L
(4-15)
and
These expressions show queues and delays that are somewhat less tbantbose pIedieted for exponential service times. CODSIaDt service times genaally repesent the best case.
(4-16)
where E(T)
= the mean of the service-time distribution (simply represented as Therein).
var(T)
= the variance of the service-time distribution.
Basic Performance Concepts
92
Chap. 4
and E(1'2)
= the second moment of the service-time distribution.
This JeSuit is derived in Appendix 3. We will hereafter refer to k as the distribution coefficient of the service time for the server. For exponentially distributed service times, as we will see later, the standard deviation equals the mean. Since the variance is the square of the standard deviation, then k = I, consistent with our argument above. For constant service times, the variance is zero, and k = Y.z, as also just shoWD.
Unifonn Sentice Times A service time that can range from zero to s seconds with equal probability is uniformly distributed. It has a mean of s/2 and a second moment of E(1'~
lL~Pdt = r/3 = -So
lberefore, (4-17)
This is between the cases of constant and random service time, as would be expected.
Discrete Sentice Times Often times, the service time will be one of a set of CODStant times, each with a duration of Ti and a probability of Pi. In this case, k=
! :~$
(4-18)
'Ibe above equatioDs aEe the very imPOltaut qneuing equatioDs derived by TCbjntebine and Po)]acze1c (ac:tua1ly, equatiOD 4-9 for the queue 1ime. Tt • is fOImaDy known as the
KbintcMne-PoDaczek equation, or the PoDaczek-KbintcbiDe equalion, depending upon to whom you wish to give primary credit). These are so important to pesfoJmance analYsis that we summarize them be:re: . 1cL2
W
= l-L
Q
L = -I-L [1 -
(4-4)
(I - k)L]
(4-6)
Chap. 4
Queues-An Introduction
Td
1 = l-L [1 -
Tq =
(1 - k)LJT_
kL r.:rT
(4-8) (4-9)
and (4-16)
where W
Q
= average number of items waiting to be servic:ed, = average length of the queue, including the item cmteDtly being ser· viced, if any,
Td
= average delay time for a ttaDsaction, including waiting plus service time,
Tq = average time that a transaction waits in line before being serviced, k
= distribution coefficient of the service time,
T = average service time of the server,
L = load on (oc:cupancy of) the server.
. Equation 4-8 for-the delay time is the one which will be most often used in this book. The queue length, Wor Q (equation 4-4 or 4-6), is often necessary when we are coucemed with overflowing fiDite queues. The queue time, Tq (equation 4-9), is useful when the queue contaiDs items· froin diverse somces, each with a diffem1t mean service time. In this case, the queue time is calcalatM using the weighted mix of an tnmsaction ~ times. The average delay time is calcalatrd for an item by adding the average queue time, calculated for the mix of an items mthe queue, to the average service time for that item. Thus, Td = Tq
+T
(4-19)
as is suppotted by equatioDs 4-8 and 4-9. Qrber nseful n=1atioDs between these parameters that can be deduced from the above equatioDs aDd that have already been. pesented, are Q =W+L
(4-20)
T.q-!IT L
(4-21)
and
Td=
'f
=Tq+T
(4-22)
Basic Performance Concepts
94
Chap. 4
Note that L is the average load on the server. As we shall see later, in a multiserver system with c servers, the total load on the system is d. Equations 4-20 through 4-22 still hold after substituting d for L. Equation 4-7, derived earlier, provides another important insight into queuing systems:
.
Q=RTd
(4-7)
This is known as Uttle's Law (Utt1e [18], Lazowska [16]) and states that the number of items in a system is equal to the product of the throughput, R, and the residence time of an item in the system, Td • These relationships are surprisingly valid across many more queue disciplines than the simple first-in, fiIst-out discipline discussed here. This point is explored further in Appendix 3, where it is shown that the order of servicing is not important so long as an item is not selected for service based on its characteristics. For instance, these relationships would apply to a round-robin (or polling) service algorithm but not to a service procedure that gave priority to short messages over long messages. CONCEPTS IN PROBABILITY, AND OTHER TOOLS
Performance analysis can at times require some innovative ingenuity. This typically does not require any knowledge of higher mathematics. However, a basic knowledge of probability concepts and other helpful hints can often be useful. The material presented in this section is intended to simply touch on those concepts that have proven useful to the author over the course of several performance analyses. We will launch into some detail, however, cnncmring randomness as itIelates to the Poisson and expoaential probability distributions. Not that we need these so much in our analysis efforts but because an in-depth knowledge of them is impeIative in Older to cleady understand many of the simplifying assumptions that we will often have to make in Older to obcain even approximate solutions to some problems. Excellent coverage of these topics for the p.taeticing analyst may be found in Martin [19] and GolDey [7].
Rando.. Variables Probability theory COIICel'IlS itself with tile desaiption and manipuJatioD of Tfl1IIlom variables. These variables desa:ibe real-life situations aDd may be classified into disc!ete and continuous random variables.
Discrete RandollJ Variables A discrete 'Vtl1itIble is one which can take ODly certain disCl'ete values (even though there may be an iDfinite set of these values, known as a coU1l1llbly i11jinite set). The length of a ,~is a discrete random 'Vtl1itIble; it may have no items, one item, two items, and so on
Chap. 4
Concepts in Probability, and Other Tools
95
without limit. The length of a queue with a fixed maximum length is an example of a discrete variable with a finite number of values. If we periodically sampled a queue in a real-life process, we would find that at each sample point it would contain some specific number of items, ranging from zero items to the maximum number of items, if any. If we sampled enough times and kept counts relative to the total number of samples (and assumed that the queue is in steady swe), we would find that the pIOpOItion of time that there were exactly n items in the queue would converge to some number. 'Ibis would be true for all allowable values of n. For instance, if we made 100,000 samples and found that for 10,000 times there was nothing in the queue, for 20,000 times there was 1 item. in the queue, and for 1,000 times there were 10 items in the queue, we would be fairly certain that the queue would nmmally be idle 10% of the time, have a length of 1 for 20% of the time, and have a length of 10 for 1% of the time. These values, expressed as proportions rather than as percentages, are, of course, the probabilities of their corresponc.ting events. That is, the probability in this case of a queue length of zero is .1, etc. We wiD note the probability of a discrete event as P,., wheIe n in some way describes the event. For instance, in the case of a queue, P,. is the probability of there being n items in the queue. If we were drawing balls of different colors from an urn, we might choose PI to be the probability of drawing a red baD and P2 that probability for a peen ball. Thus,
p"
= Probability of the 0CCUIrenCe of event n.
The set ofP,. that describes a random variable, n, is called the probability density function of~ . There are several important properties of discrete variables, but the most obvious are
I. Each probability must be no greater than I, since we can never be more certain than certain: '
OSp,. so 1
(4-23)
2. The probability deasity function must sum to 1. siDce 1 and 0Dly 1 event OIl each observalion is certain:
lJ,,.= 1
"
(4-24)
wheIe the SI""""rion is over an allowed values of n. 3. .Assnnring that ew:ms _ ~, the probability of a specific combination of ew:ms is the product of their p!Obabilities. Thus, if we were to draw a ball from an urn COJl1lriTring balls of several different colors, put it back, and draw another ball, and if PI were the probability of drawing a red ball and P2 the probability of drawing a green ball, then 1he probability of drawing a red ball and a green ball is PIP2. (Note that we have to put the fiISt ball back in order to avoid changing the probabilities; otherwise, the two events would not be independent.)
Basic Performance Concepts
96
Probability of ocamence
of a set of independent events
Chap. 4
(4-25)
= 1rp" "
where the product, 11', is over the specified events. Thus, "and" fmplies product (the probability of event 1 aNi event 2 is Pl]J2). 4. Assummg that events are independent, the probability that an event will be one of several is the sum of those probabilities. In the above example, the probability of drawing either a red ball or a green ball on a single draw is PI +
1'2. Probability of occunence of one of a set of independent events = ~"
(4-26)
"
where the summation is over the deslred events. Thus, "or" implies "sum" (the probability of event 1 or event 2 is PI + ]J2). S. The probability of a sequence of depeDdeDt events depends upon the conditional probabilities of those events. If we did DOt return the ball to the um, then the probabilities would change for the second dmw. The probability of a red and then greeD. draw would be the probability of a red draw times the probability of a greeD. ball given that a red ball has been drawn. Thus, the probability of red, then green = P1Pi1), where p,.(m) is the probability of event n oc:curring (a greeD ball in this case) given that event m has occurred. In general, letting p(n, m) be the probability of the sequeace of events n and m, p(n, m)
= p,.p".(n)
(4-27)
6. The average value (or mean) of a l'81ldom variable that has JllIDleric memring is the sum of each value of the variable weigbrecl by its probability. Let i be the mean of the variable with values n and probabilities p". Then (4-28) where the sum is taken over all allowed values of n. 7. h is often imponaDt to bave a feel for the ·'cfisPe.mion'· of the DDdom variable. If its mean is P, will all obemdioDs yield a value close to P (low dispersion), or will they vary widely (high dispersion)? A CODlDlOIl measme of dispeEsion is to calculate the average mean square of all obse:rvariODS Ielative to the mean. 'Ibis is called the variaDce of the lIIldom variable, deDoted VIr (n):
var(n) =
L(n-i)7" "
(4-29)
whe.re the sum is taken over all allowed values of n.. The square root of the variance ~ called its standard deviasion and, of course, bas the same dimension as the variable (i.e., items, seconds, ttaDsactiODS. etc.).
Chap. 4
Concepts in Probability, and Other Tools
97
8. The moments of the variable~ also sometimes used. The mth moment of a variable, n, is represented as TI" and is (4-30) where the summation is over all allowable n. Note that the mean is the ~ IDOIDeDt (m 1). There is also an important relation between the variance and the second moment:
=
var(n)
= L(n-iifp. n =
Ln7. n
~2nipn +
Lii?n
From equations 4-24 and 4-28, var(n) =
Til -
ii2
(4-31)
.That is, the variance of a random. variable is the diffmmc:e between its second moment and the square of its mean. (See equation 4-16 for a use of this relationship.)
9. If x is a mndom variable that is the sum. of other random variables, then the mean of x is the sum of the meaDS of its component variables, and the variance of x is the sum of the variances of its component variables. Thus, if
x=a+b+c+ .. ,
-
-
- ... x=a+b+c+
(4-32)
and var(x) = var(a)
+ var(b) +var(c) + ...
(4-33)
10. Ifx is a choice of ODe of a possible set ofvariables, thea its mean is the weigbIed avenge of those variables, and its second IIlOIDeIlt is the weighted average of the sec:oad DIOIIleDts of those variables. Thus, if a, b, c, ... are each random. variables, and if x may be a with probability P., b with probability Pin etc., then
- -
x- = ape + bpb + -CPc + ... ~ =DiP. + Jilpb + ilpc +
...
(4-34)
(4-3S)
Note that weighted second IIlOIDeDtS are added when x is a choice, whereas variances are added when x is a c:ombiDaIion. 11. The set of probabilities Pn !bat describe a I3Ddom variable may be summed up to, or beyond, some limit. This sum is called the cllmuJative disuibution func-.
Basic Performance Concepts
98
Chap. 4
tiOD of n. If the sum is up to but does Dot include the limit, then the cumulative distribution function gives the probability that n will be less than the limit. This is denoted by P(n < m), where m is the limit
< m) =
P(n
L p"
n<m
(4-36)
where the summation is over all n less than m. Note as m grows large, < m) tends toward unity. If the sum is beyond the limit (n > m), then P(n > m) is the probability that n will exceed the limit
P(n
> m) =
P(n
L P"
n>m
(4-37)
where the sum. is over all n greater than m. As m grows large, P(n > m) tends to zero. A simple example will illusttate many of these points. Figure 4-2a gives a probability deDsity function for the size of a message that may be transmitted for a terminal. Its size in bytes is distributed as follows: Message size (11)
Probability (p,,)
20 21 22
.1 .2 .3 .2
23 24
.2
Note that the probability of all messages is 1: 24
LP" ='1 _20 The mean message size is
ii =
24
L np" = 22.2 11-20
The varlanc:e of the message size is 24
var(n) = .
Its standard deViation is 1.25. The second moment is
L (n-ii)7" = 1.56 ,,-20
Chap. 4
Concepts
in Probability, and Other Tools
99
Note the relationship between variance and second moment: var(n) =
r1- - ;? = 494.4 - 22.22 = 1.56
'Ibis illustrates one potential computational pitfall. The variance calculated in this manner can be a small difference between two relatively large numbers. For that reason, the calculation should be made with sufficient accuracy. The cumulative distribution functions for this variable are shown in Figure 4-2b. As with the density function, these functions have meaning only at the disaete values of the variable. Thus, the probability !bat the message length will be greater than 22 is .4 (i.e., P23 + P24 = .4) and tbat it will be less tban 22 is .3 (i.e., P20 + P21 .3). Now, let us assume that we have a second message type with a mean of 35 bytes and a variance of3. Denote asml the first message described by the disttibution of Figure 4-2, and denote as m2 the new message just defined. Consider the following two cases:
=
Case 1. ml is a zequest message, and m2 is the zesponse. What is the average communication line usage (in cbarade.rs) and its variance for a complete transaction? In 1bis case, the communication line usage is the sum. of the message usages. The mean and variance of this total usage are the sum. of the means and variances for the individual messages. Let the total line usage per transaction be m. Then
.4 Pn
.3 .2 .1
I II I ,
I TI9202I22232425 I
MESS.. SIZE (n)
P
MESSAGE SIZE em) CUllJLATIVE DISTRIBUTION FUNCTl(JNS (b)
Basic Performance Concepts
100 m = mI
Chap. 4
+ m2_
m = iiI + m2 = 22.2 + 35 = 57.2 var(m)
= var(ml) + var(m:z) = 1.56 + 3 = 4.56
Thus, average communication traffic per traDsaction will be 57.2 bytes with a variance of 4.56 or a standaId deviation of 2.14 bytes. Case 2. Both ml and m2 are request messages. m will be ml 30 percent of the time and m2 70 percent of the time. What are the mean and variance of m? m is DOW a choice between messages. Its mean is
ii = .3 x 22.2 + .7 x 35
= 31.16
The second moment of m is found by adding the weighted second moments of ml and m2. The second moment of m2 is the sum of its variance and the square of its mean: m22 = var(m:z)
+ iil· = 3 + 39 =
1228
Then
"r = .3 x 494.4 + .7 x 1228 = 1007.92 The variance of m, then, is var(m)
= m 2 -;;p = 1007.92 -
31.1@ = 36.97
Thus, the Ieqaest messages will average 31.16 bytes in length, with a variance of 36.97 or a standaId deviation of 6.08 bytes.
The previous sec:tioD cIesc:ribed disaete random variables-those tbat take on ODly certain discrete (often integer) values, such as tbe number of items in a queue or tbe IlUIDber of bytes in a message~ But what about the number of seconds in a service? If a process mquhes somewheze betweeIl 2 and 17 msec to process a traDSaCtioD, it can vary MDf.innnusly between these limits. We can say that the probability is .15 that the service time for this process is betweeIl 10 and 11 secoDds, bat this p10bability includes service times of 10~. 10.2, and 10.25679 seccmds. The service time variable is not disc:me in this caSe. It can assume an infiDite IlUIDber of values and is dIaefore called a continuous rarulom variDble. AU of the IUleS we haw; eStabusbed for disc:me variables have a corollary foi
continuous variables, often with 1he summation sign simply replaced with an integral sign. The chaJ:acteristic and rules with which we will be conc:emed in pedmnance analyses are as fonows:
Chap. 4
Concepts in Probability, and Other Tools
101
1. The probability density function is continuous. If%is a random value,/(%) is the probability tbat x will fall within the iDfinitesimal range /(x)dx. More specifically, the probability that x will be between a and b is P(a s x
S
b) = f:/(x)dx
S
1
(4-38)
The probability density function of x is/(;c). Notice that there is no requirement tbat/(x)<1 for all values of x, only that eqnation 4-38 results in a value no greater than 1 over any range. For instance, if x has equal probability of ranging from 0 to .1,/(x) 10 for 0 S x S .1. 2. Since x must have some value, the integral of the probability density function must be one:
=
f!(%)dx = 1
(4-39)
where integration is over the allowed range of x. 3. The average, or mean, value of x is its integral weighted by its probability over its range:
L
x=
,g(x)dx
(4-40)
where the integration is over the allowed range of x. 4. The variance of x is the square of the deviation of x from its mean weighted by its probability and integrated over its range: var(x)
=
i
(x-x)2j(x)dx
(4-41)
where the integration is over the allowable range of x. 5. The mth moment of x is x"', weighted by its probability and integrated over the range of x:
x,. = Lx",/(X)dx
(4-42)
where the integration is over the allowed range ofx. Note that the variance given. in equation 4-41 ean be expanded to .
var(x)
=
L
x2/(x)dx -
L
2xX/(;c)dx
L
+ rt<x)dx
Using equation 4-39, 4-40, and 4-42. var(x)
= X2
-
x2
(4-43)
just as with disaece variables. 6. The properties of ca1c:qJaring means and variaDces for S1DDS of variables or choices of variables me the same as for disaete probabilities. That is, a. If x is ,a sum of continuous random variables %1. :%2, X3. • •• , then its mean is tbe sum of the means, and its var:iaDce is the sum oftbe variances ofXl, Xl. X3
Basic Performance Concepts
102
Chap. 4
b. If x is a choice between continuous random variables Xh X2, X3, ••• with probabilities PI, 1'2, P3, .•. , then its mean is the weighted sum of the means, and its second moment is the weighted sum of the second moments of Xl, X2, X3, ••• , where the weighting factors are the probabilities Ph 1'2, 1'3, .... 7. The cumulative distribution functions for X are
I~X)dx
(4-448)
= I~)dx
(4-44b)
p(X
and P(x>a)
Equation 4-44a is the probability that x is less than the limit a, where the integration is over all allowed values of x less than a. Equation 4-44b is the pr0bability that/(x) is greater than a, where the integration is over all allowed values of X greater than a.
Figure ~3 shows an example of a continuous probability density function and its cumulative distributions. . It is DOt often mperfonnance analysis that we are forced into the calculus of CODtinuous distribution :functions. However, there ale two prominent examples. One is the calculation of the distribution coefficient in the Kbintc:biDe-Pollaczek eqwdion for a c0ntinuous variable. Such an example was given earlier in this chapter (eqwdion 4-17). Another is the following. Consider a medium whose length is b. From one random position, Xl, what is the mean distance to any other random position, X2? 'I'hae are several physical processes conesponding to this case: • The average seek distance of a disk moving from its current track to a new track, 1.0,.--__
~
P(x)a)
/
"- P(x(a)
~.6 ....
iii
I0.4 IE
o
xora
Chap. 4
Concepts in Probability, and Other Tools
103
• The distance between two terminals on the bus of a local area networlc, • The amount of tape that must be passed in a tape search starting at one random point and moving to another.
The solution to this problem is presented here as an example tying many of the above . The probability density function that x will be at any given point on b, j(x), is
concepts together.
f(x)
= 1/b, 0 s x s b
Otherwise, f(x) = 0
Thus, f(x)dx = L"1 -dx = [x]" = 1 Lo"· ob b 0
The average distance between one random point, X10 and another, X2, is the avemge distance XI - X2 given a point, Xl, then averaged over all possible values of XI. I! X2 is greater than XI. then its value can range between XI and b. Therefore, its probability density function is
and its average distance from Xl is
-X2-XI = Lil(X2-XI)f(xi)dx2 . = b-1 .Lil(x2-XI)dx2 =(b-Xl) 2' X2 > %1
XI
.
Xl
%1
I! X2 is less than Xl, then its value can range from 0 to Xl. TheIefore, its probability deDsity function for Ibis case is f(xv
=
;1'
Xl
< XI
and its average distance from Xl is
.
..
The pJObability thatX2 will be ~ tbanXI is (b -x~)Jb, and the probability that it will be less is xI/b. Thus, the average distance between Xl and X2 for a given value of XI
is
Basic Performance Concepts
104
Chap. 4
Since Xl can :range from 0 to b, its probability density function is !(X1)
1
=b
The average distance tbatX2 is from Xl whenXl is varied over its range will be called i and
is - = Lb~
X
o
-
2bxl
2b
+ 2x12 J~(\Xl,"", \..1_ .
2
13
_ 1 [ 2bx1 2x ]b x=2b'2 ~xl--2-+T 0
i= bl3. Thus, the average distance between Xl and X2 is 113 b. This means tbat the average seek distance on a disk is 113 of all tracks. The average distance between two termiDals on a local mea aetworlt is 113 of the bus length. The avenge:random seek distance for a tape is 113 of its length.
Pelflllltations and Combinations It is sometimes useful to be able to calculate the DUIIIber of ways in wbich we can select n objects from a group of m objects. Sometimes the order in which we select them is important, and sometimes it is not. If order is important, we are talking about permuIIItions. Let there be m distinct objects in a group, and we desUe to choose n of them. How many c:ti1fereDt ways can we choose these n objects? On the first choice, we will choose one of m objects. Given n choices, the total number of di:ffeEent ways we can choose n objects is m(m - I)(m - 2) ... (m - n
+ I).
Tbis can be written as
plll= II
m! (m-n)!
(4-45a)
wbme P': is the number of pamutations of m objects taken n at a time. However, if order is DOt important, we haveCOUDted too maay possibilities in the above analysis. We·have C01IDted all of the permutations for each set of choices but are only intaested in COUD1iDg that particular combinatiODof choices once. For iDstance, if one set of choices was ABC, we have counted it as
ABC ACB
BAC BCA
CAB CBA
or six times, whereas we are ODly iDtelestecl in COUIltiDg it once. We are iDterested in the of combiMtions of objects, DOt in all of their pemmtations.
~
Chap. 4
Concepts in Probability, and Other Tools
105
The first item chosen could have occmrec:l during any of the n choices. Given that, the second item could have occurred during any one of the remaining (n - 1) choices and SO on. The same combiDation bas been counted n(n - 1) (n - 2) ... (1) times for a total of n! times. Thus, the total number of combiDations is the number of permutations divided byn!:
cm = n!(m-n)! m! n
(4-45b)
where C': is the number of combinations of m objects taken n at a time.
Series In many of the cases with wbich we will work, we will find ourselves with a summation over an infinite (or at least a very large) number of items. Often, these infinite series can be reduced to a very manaple expression. Some of the more useful ones are summarized here.
a. 1 + x + :x2 + r3 + ... ,0 so x so 1 This can be written in the form
...
1
2X'=1=0 I-x
(4-46)
The similar series, wbich is truncated on the left, directly fonows and is
x" + Xll+l + X"+ 2 + ... ,0 so x so 1 This may be written as
(4-47)
Likewise, this series truncal'tid to the right is 1 + x + :x2 + ... + r l , 0 so x S; 1 wbich may be written as ... ID ID l-xi' 2X' = 2Xi - 2X' = (1-xi')2x i = 1-0 i-O t-II 1-0 I-x
11-1
(4-48)
The doubly ttuncated series is ;xii
+;xII+l + ... +
r
1, O:s x:s 1
This may be written as 11-1.
l-xll-II
11-11-1.
2x'=;xII i-II
2 X'=X i-O
Il
XII-xII =I-x I-x
(4-49)
106
Basic Performance Concepts
b.
+ 2.%2 + 3%3 + ... ,0 S This can be expressed as
%
%~
...
~.
Chap. 4
L %
LiU:' = - i=1 (l-x'j-
(4-50)
This can be expressed as (4-51) .
Conversely,
may be expressed as (4-52)
We DOW ctiscuss the Poisson and exponential distributions in some detail. Not because we will use them in our calculations so often (though simulation studies certainly do) but because they repsesent much of the statistics of queuing tbeoIy and form an important undeJpinDing to our unc:lersranding of the use of the tools we will bring to bear on the aualysis of perfOllDlDCe problems. The Poisson distribution provides the pmbabilities tbat exactly n events may happen in a time iDterval, t, provided tbat the 0CCUDeDCe of these events is independent. That the iDdepeDdeDce of events is the only assumption made is the mason tbat this distribution is so important. Event independeDc:e simply says tbat events oc::cur completely randomly. They do DOt occur in batches. The OCCWleDCe of one event is DOt at all depeDdent on what bas 0CCIIDed in the past, nor bas it any iDftuence on what will oc::curin the fatme. The process has DO memory; it is memoryless. We will can a process tbat creates such rauc:Iom events a . random process. In queuing th.eory, thele are two important cases of a random process: 1. The mival of an item at a queue is a random event and is independent of the . arrival of any 01her item. 'IbelefOIe, mivals to a queue are random. 2. The iDstant at which the ser:viciDg of an item by a server completes is a rauc:Iom event. h is independent of the item being serviced and of any of its past service cycles. TheIefore, service completions by a server are random.
Chap. 4 . Concepts in Probability, and Other Tools
107
Note that randomness has to do with events: the event of an mival to a queue, the event of a service time completion. Let us determine the probability that exactly n random events will occur in time t. We will represent this probability by p,,(I): p,,(t)
= the probability that n random events will occur in time t
(Remember that n is a discrete random variable. Its values are the result of a random· process. These two uses of rtmdom are unrelated. Random variables also are the result of nomandom processes.) Note thatPII(I) is a probability function that depends on an additional parameter, t. As t becomes larger, the probability that n events will occur c:han.ges. This is unlike our simple probability functions described earlier. Such a process is caDed a stochlJstic pr0-
cess. The average rate of the occmrence of events is a known pammeter and is the only one we need to know. We will denote it by r: r
= average event oc:currence rate (events per second)
Thus, on the average, n events will occur in time I. Since events are c:omp1etely random, we know that we can pick a time interval, t, sufficiently small that the probability of two or more events oc:curriDg in that time inrerval can be ignored. We will DOte Ibis ubitrarily small time interval as IJ.t and will assume that the only things that can happen in IJ.t are that DO events wDI occur or that one event will
occur. Let us DOW observe a process for a time t. At the end of tbis observation time, we find that n events have occaaed. We 1ben observe it for IJ.t more time. The probability that ~ furt:be.r event will oc:car in IJ.t is i'1J.t. The probability that no further events will occur in IJ.t is (1 - rlJ.t). Thus. the probability of observing n evems in the time (I + 1J.t) is (4-53)
This equation notes that n events may occur in the iDtaval (I + 1J.t) in one of two ways. EitI:m n evems have occ:unecl in the imerval 1 IIIIIl no events have occ:uned in the subsequent iDterval.lJ.t. or n - 1 eWIlts have oc:cwed in the iDIerval 1 IIIIIl one mare event has OCCWIed in ibe subseCpJem interval, 1J.t. (Note that since the mival of an event is independent of previous mivals, all of these probabilities me indepenIJent and may be c0mbined as shown, acccmtingtoru1es 3 and 4 in 1he earlier sectioD emitled ..Discrete Random Variables. ") . If no events OCCWIecl in the interval t + 1J.t, tbis ~p is written pJ..I
+ 1J.t) =pJ..I)(l -
rAt)
(4-54)
since PII-l does not exist. 'Ibat is, the probability of DO events occming is the pmbability that.JlQ events OC:CWled in At. . the interval t and that DO events 0CCUIIed in the interval .
Basic Performance Concepts
108 Equations 4-53 and 4-54 can be P.(t
+~-
Chap. 4
reaminged as
P.(t) = -rpJ..t)
+ rp,,-l(t)
(4-55)
and pJ..t + ~
-
pJ..t) = -rpJ..t)·
(4-56)
As we let At become smaller and smaller, tbis becomes the classical definition of the derivative of P.(t) with IeSpeCt to t, dp.(t)/dl. Denote the time derivative of pJ..t) by p~(t):
p~(t) = dpJ..t) dl
We can express equations 4-55 and 4-56 as p;At)
= -rpo(t)
(4-57)
p~t)
= -rp.(t) + rp.-l(t)
(4-58)
and
This is a set of differeDtial-c:lifferem:e equations; their solution is shown in Appendix 4 to be
P.
(t)
n = (rtY'en!
(4-59)
This is the Poisson distrlbIItion.1t gives the probability 1hat exactly n events will occur in a time interval t, given onlytbat their mivaIs are random with an avenge rate T. Though the serious student is encouraged to ~ the solution to these equations in Appendix 4, the main lesson to be leamed is the simple and underlying fact 1bat the Poisson disttibution depends only 011 the randomness of event 0CCUDeDCe. All 1bis is s1llDD18lized by sayjDg 1bat the distribution of the number of DDdom events that will occur in a time interval t is given by the Poisson distribution. ID qaeuiDg theory. the nmclom eveDIS of concern are arrivals to queues and completioDs of service. Let us look at some properties of the Poisson distribution. First, the sum of the probabilities over all values of n is
f (rtY'~-n = e-n f (Ttr = e-ne n = 1 n. n.
,,-0
.-0
as would be expected (the infinite series given by equation 4-51 is used). We now derive the mean value of n for the cJistribution:
..
ii =
tn
.-0
(rt)"e- n n!
= rtf (rtY'-l e-n ._l(n-l)!
Chap. 4
Concepts in Probability, and Other Tools
109
and
where i bas been substituted for n - 1 in the summation. Thus, the mean number of events that will occur in a time ~al t is Tt, as we would expect: (4-60)
n=Tt
The second moment of n for the Poisson distribution is derived in a similar man-
ner:
and
nz = ne-n:Lell n (rt,
~-1
n-l
Letting i
=n -
(n-l)!
1,
,;z = ne-nI(i+I) ('"!Y z!
i-o
_ _ ,;z -
Tte
... (TtY] ., Tt~(._I)' + ~ z• •-1 Z •
()i-l -n[ell .!!..-.
.-0
and
Iii = (rtf + Tt
(4-61)
From equation 4-31, the variance of n is var (II)
=Iii - i?
Since tile mean II is rt, 1bea var(lI) = rt
(4-62)
lbns, both the mean and the variance of II is rt for a Poisson distribution. Note the memoryless feature of the Poisson distribution. The probability that any number of events will happen in the time interval t is a fuDction only of the mival rate, T, the number of events, II, and the time interval, t. h is completely independent of what happened in the pevious time intervals. Even if DO event bas oc:c:urred over the past several time intervals, there is DO inaeased assurance that one will occur during the next time interval.
Basic Performance Concepts
110
Chap. 4
The Exponential Distribution The exponentilJl distribution is very much related to the Poisson distribution and can be . derived from it, as will soon be shown. It deals with the probability disUibution of the time between events. Note that the Poisson distribution deals with a discrete variable: the number of events occurrlng in a time interval t. The exponential distribution deals with a continuous variable: the time between event occurrences. To derive the distribution of interevent times, we assume that events are arriving randomly at a rate of T events per second. Let us consider the probability that, given that an event has just occ:urred, one or DlOIe events will occur in the following time interval, t. This is the probability that the time between events is less than t. 1fT is the time to the next event, we can denote this probability as P(7' < t) and can express it as (4-63)
That is, the probability that the next event will occur in a time inrervalless than t is the probability that one event will occur in time tplus the probability that two events will occur in time t, and so 011. Manipulating equatioo 4-63, we have
f (rtf - 1] n.
P(7' < t)
= e- rr [
P(7' < t)
= 1 - e- rr
..-0
and (4-64)
'Ibis is a cumulative distribution for the iDtezevent time t. Its density function, p(t), is the derivative of its cumulative distribution function. That is, from equation 4-44&, P(7' < t)
=
r
p(t)dt
DifIemltiatiDg both sides with respect to t gives oJt)
y,
= C~P(7'< t) = C~(le-rr ) tit tit
wbere C must be choseD such that tbe integral oftbe density functiOn is unity (see equation 4-39). Thus,
Since
Chap. 4
Concepts in Probability, and Other Tools
111
1
Cr-= I" r
and C=1 Thus, the probability density function for the interevent time, t, is
p(t) =
re-rr
(4-65)
We can also expJeSS the altemate cumulative distlibution giving the probability that T is greater than t. From equation 4-44b, we bave
P(T> t)
= fe-rTdT
=( -~e-rr]~ and (4-66)
as would be expected from equation 4-64, since P(T < t) + P(T > t) = 1. Since t is a continuous variable, P(T = t) is zero, and can be ignored. The mean, variance, and second moment of the exponential distribution can be shown to be
= l/r vm(t) = l1r2 i
and t2 =
21r
(4-67) (4-68) (4-69)
Note once again the JDeDlOIYless feature of the expoDeDtiaJ distribution. No matter when we start waitiDg fer an eveat (even if ODe has DOt occaned fer awbile), the expected time to the next eveat is sdJlllr. Also note 1bat t has been ~hele JeJative to the way it is used in the Poisson disIributioD. In the Poisson distnDutioo, t isa fixed iDterval over wbich the probability of occuaeuc:e of II evems is expnssed In the expoNP'iaJ disI:ributioJl, t is the raadom varlable expessiDg the time betweeIl eveats.
To SlIIDJD8I'jze the above, we make tbIee statem.eDts about a nmdom. process with an avenge eve:at rate of r events per second. 1. A TtWlom process is ODe in wbich events ate gene.fated randomly and independently. The probability 1bat an eveat will occur in an l1'bib:alily small time interval III is rill, independent of the eveat history of the process.
112
Basic Performance Concepts
Chap. 4
2. The probability PII(t) that n events will occur in a timeinterva1 t is given by the
Poisson distribution: PII(t)
=
(rt'f
n~
-rr
with n=rt var(n) = rt
and
r = (rt)2 + rt 3. The probability density function p(t) for the interevent time t is the exponential function p(t)
= re-rr
with
i = 1/r var (t)
= 1/r2
and
Thus, random, Poisson, and exponential distributions all imply the same thing: a random process. 'Ibis is a process in wbich events occur randomly, the distribution of their oc:cmren.ces in a given time interval is Poisson-distribut.ed, and the distribution of times between events is expoaeadally distributed. In queuing theaIy, dIae lie two DDCIom processes with which we ftequently deal. One is· tII'7'ivtzb to tl f/IIeW. An arrlval of an item to a queue is a random evem. Arrivals are said to be Poisson-dislribul, and the iuteIanival rate is exponentially distribatecl. The statellkems random tmivtsls and Poisson tl1TivtIJs lie equivalent. The· 01b.er p:ocess is . . IIlnice time t1/ tl server. Manning the server is busy seniciDg items in iIs queue, 1IJe·compktioD·of a service is a nmciomevem. Tbedistribution of service completioDs is PoissoD-disIrib (though we don't normaUy expess this), and service times (Which are the times between completion events in this case) are exponen- . tially disIrlbuted. The statemeJdS TfI1IIlDm·service times and exponential service times are equivalent. Due to the memoryless D8bD'e of random processes, if we begin the observation of a 1'8IIdam. server with average service time t" as it is in the middle·of pmc:essiDg an item, the average time !eQ1lh'ed to complete tbis service is still '., DO matter how loDg the item bad been in service prior to our observation. 'Ibis property was used as an argument conceming the evaluation of the service time distribution coefficient for exponential service times in the derivation of equaDOIIS 4-10 tbEOugh 4-12.
Chap. 4
Infinite Populations
113
CHARACTERIZING QUEUING SYSTEIfS Kendall [13] bas defined a classification scheme for queuing systems that lends order to the various cbaracteristics these can have. A queuing system is categorized as AlBlclKlmIZ
where
A == the arrival distribution of items into the queue, B == the service time distribution of the servers,
c == the number of servers, K == the maximum queue length,
m == the size of the population wbich may enter the queue, and
Z == the type of queue discipline (Older of service of the items in the queue). Several representations for the arrival and service time distributions (A and B) have been suggested, but for our purpose we will deal with four. A or B may be M for a IaDdom (memoryless) distribution,
D for a coastant distribution (such as a fixed service time), G for a geaeral distribution, and
U for a UDifOlmdistribution (this, admittedly, is added to the list by this anthor). lbns,.MfDI3/10l4OlFlFO represeDts a queuiDg system in which anivaIs are random, service time is CODStant, and there are 3 servers serving a queue which can be no longer tban 10 items, serving a populatioa of 40 on afint-come, fint-serve basiS.' . Ifthe maximum queueleDgth is unljnrited (K = co), if the population is iDfiDi1e (m == co), and if the queue disc:ipliDe is FIFO, 1beIl the 1ast dDee terms axe dropped. TbeD, for instance, an MIMIl system is a system in which DDdom aaivals &Ie served by a siDgle server with mndom service times. This is the simplest of an queuing systems. AD MlG/l system is one in wbichraudOm arrivals are serviced by a siDgle server with geaeral service times. This is the case solved by TOUntc:biDe and Pollaczek. INRNfI'E I'OPULA11ONS . ODe of the parameters in ~'s classification scheme is the size of the population m using the queue. This is a particularly impottant pammeter for the followiDg~. If the size of the population is iDfiDite, 1beIl the rate of arrival of user$ to the queue is independem of queue leDgth and therefOle of the load on the system. That is to say, no matter how loDg1he queue, there is still an iDfiDite popuJation of users :from which the next ~ to the queue will ~.
Basic Performance Concepts
114
Chap. 4
However, if the user population is finite, then those waiting in the queue are no longer candidates for entering the queue. As the queue grows, the available population dwindles, and the mival rate falls off. As the load on the system grows, the imposed load decreases. Thus, the load on the system is an inverse function of itself (tbis is sometimes referred to as the graceful degradation of a system). The analysis of queues formed from infinite populations is quite different from. that of queues formed from. finite populations. We will first consider infinite populations, about which a great deal can be said. Some Properties of Infinite PopUItmoIlS Regarding infinite populations, there are some general properties that can be useful. These include
1. Queue input from several sources. If several random sources each feed a c0mmon queue, each with different average arrival rates, Ti, then the total mival distribution to the queue is a Poisson distribution with an mival rate T equal to the sum of the component arrival rates, Ti (Martin [20], 393). (See Flg1JI'e 44a.) 2. Output distribution ofMIMIc queues. Ifone or more identical servers with exp0nential service times service a common queue with Poisson-type mivals, then the outputs from that queue are Poisson-djstrib, with the departure rate equal to the mival rate, i.e., the depauwes have the same distribution as the mivals (Slaty [24], 12-3). (See FIgUte 4-4b.) 3. TTtl1IStICtion stTeom is splil. If a randomly distributed. transaction stream is split into multiple paIbs, the tmusac:tious in each pa!b are random streams with proponioaate arrival lites (IBM [11], 49). (See Figure 4-4c.) 4. TII1IIl8m queues. From. 2, a randomly dimibuted transaction stream passing through 1aDdem compound queues Will emerge as a DDdomly distribu1ecl stteam with the same avenge rate as wIleD it entered the system (IBM [11], SO). (See Figure 44d.) S. 0rtJg ofservice impoI:t Oft response time. The meaD queue time and mean queue leagIh as JRCIic:red by . , KJrintcbjue-Po11acalc: equation is indepeDdeDt of the cm:Ie:r in which the 'queue is saviced, so loug as that order is not dependent upon the service time. This would not be true, for iustance, if irems xequirlDg less service wae seIViced in advance of other items (Martin [19], 423). (See Figum 4-4e.)
Dispersion of Response r .....
We have ah'ead.y disc:ussecl the ueed to be able to make a stat.ement relative to the dispersion of the response time, something in the form "the ~ that respcmse time will be less than two seconds is 99.9%." We will discuss tJ:ne appmaches to this pr0blem.
Chap. 4
Infinite Populations
115
"~~:D" MULn-C!:IANNEL
SERVER
(b)
TANDEM QUEUING SYSTEMS (d)
--..JIi[]--.C!:J-.=--"~ QUEUE ~Pl.JNE
.Gamma distribution. WIthout goiDg iDto great deIail, the Gamma function is the key to tbis stalemeot. It is a more geaeral form -of a probability functioD of which the expoDeDtial ctistribution is a special case (see Martin [19], 437-439). It bas the p:operty tbatthe sam of a set of variables follows a Gamma fmlctioD if each oftbe!8riables follows
a (janmutfanCtion. In TP systemS, a tnmsaction usually passes tbIOugh a series of servers, as we have seen. Tbe respouse time of the system is the sum of the component delay times of each server. These lie often servers with exponeutially distributed service times (at least approximarely). Though the sums oftbese delay times may DOt be exponemial, they will be Gamma-disIribut, and this distribution can be used to detemUne the probability dJat the system respouse time ~ be greater tban a multiple of its mean.
Basic Performance ConceptS
116
Chap. 4
To use this tecbnique, we need to know the mean system response time and the variance of the system response time. The mean system response time is, of comse, the primary focus of the performance model. The variance of this response time is more difficult and often impossible to calculate. However, a reasonable limiting assumption to make is that the response time is random, i.e .• it is exponentially distributed. In this case, the variance is the square of the mean. Real systems will usually have a smaller variance than this, i.e., the response time will not be completely random. The Gamma distribution is used for these pmposes as follows. Fust calculate the Gamma distribution parameter, R, where
T2
R=var(T) .
(4-70)
T is the response time, and T is its mean. Then use the Gamma cumulative distn"buQon function with parameter R to determine the probability that the response time will not exceed a multiple of the mean time. Note that R= 1 for a randomly distributed response time. Real-life response time variations will probably be less random and thus have a greater value of R. Certain values of these probabilities are listed in the following table for values ofR from 1 to 10 (the range in which we would nOJlDllly be inteIested). TABLE ..1. TABLE OF k = TIT R
= f:1vu(T) PJobablJity of zapoase time T ~exceediag
leT
.95
.99 .999 .9999
1
2
3
S
10
3.0 4.7 6.9 8.9
2.4 3.4 4.6 S.7
2.1 2.8 3.8 4.5
1.9 2.4 3.0 3.S
1.6 1.9 2.3 2.6
To take the c:onservative case of R= 1, we can say that 95 percent of an services will finish in less than three mean service times, 99 pen:ent in less than five mean service times, 99.9 peKeIlt in less than seven mean service limes, aad 99.99 perceDt in less than nine mean service times. This is oftea sufficiellt to coaservaDvely vaJidaIe the performance of a system. (These are the valnes tbat wee used in the example in chapter 3.)
CentraI.LilnitTheorem.· ACCOIdiDg to the Centtal Limit TbeoIem, "the ctisttibution of the sum of sevetal random variables approaches the DODDal distdbutioD for a wide class of variables as the number of random variables in the sum beComes large. " The p!eCise test of how closely a system will approach a normal ctistribution is quite complex. HoWever, the tbeoiem has been shown to hold well in typical queuing aDalysis problems. To use this theorem, the first step is to calculate the mean and variance of the ,;:esultiDg response time by adding the delay times for each of the components:
Chap. 4
Infinite Populations
117
-T= TI- + T2 - + ... and var(T)
= var(Tl) + var(T~ + ...
Then, for a given probability, as given in Table 4-2, simply add the standard deviation, i.e., the square root of the variance, weighted by the factor p to the mean to obtain the maximum value of response time below which actual response times will fall with the
given probability. For instance, if mean response time is 4 seconds, and if its standard deviation is found to be 3 seconds, then with 99.9 percent probability, the response time will be less than 4 + 3.09 x 3 = 13.27 seconds. TABLE 4-Z. NORMAL DlSTRIBunON p
.90 .95 .99 .999 .9999
1.28 1.65 2.33 3.09 3.71
For random distributions in which the standard devWion is equal to the mean, Table 4-2 indicates that the maximum respoase time for a given probability is (p + 1) times the mean response time. Comparing Tables 4-1 and 4-2, the normal dislribulion tedmiqne equates approximately to R = 2 to 3 when random disttibutioDs are assumed.
Variance of responie times. For a queuing system in which inputs are 1311dom with arrival rate, R, and in wbich the distribution of the service time, T, is arbitrary, i.e., the KhintcJrine-PoJlaczek MIO/l case,·the variance of the delay time, Td, is given by*
or
_
_2
KI'3
RlT2 -var(T~ = 3(1-L) + 4(1-L'f + T2 - T2
(4-71)
where
T is the mean of the service time, T (Note: elsewhere, this is noted simply as T.) *'Ibis m1aIiaD ay be foaad ill Madin [19], Madin (20], aad IBM [11], each of wbich difl'ers from Ihe oIhe:ls 8JI!I comaiDs miDor emxs.
Basic Performance Concepts
118
Chap. 4
T2 is its second moment.. T3 is its third moment R is the arrival rate to the queue L is the load on (occupancy of) the server This is solved for the three following cases of interest: 1. Exponential service time. In this case,
T2 =
if2
and
T3 = 61'3 Substituting these expressions into equation 4-71 yields the delay time variance for a server with exponentially distributed service time:
1'2
var(T,u = (l-L~
(4-72)
Note that this is the square of the mean service time as to be expected (see equation 4-11). 2. Uniform service time. If the service time may fall with equal probability between two limits (disk seeking is close'to
this), then
T2
= ~T2 3
T3
= if3
and (4-73) 3. Constant service time. If the service time is ccmstant (such as a polled communication line with a fixed-leogtb message), then
T2
= f'2
T3 = 1'3 and (4-74)
A !e8SOD8bility c:beck can be made on these variances by letting the load. L. approach zero. The delay time variance should approach the variance -of the service time.
Chap. 4
Infinite Populations
119
The results of this exercise are var(T,j) ......
T2
for exponential service for unifOIDl service for constant service
var(T,j) ...... T213 var(T,j) ...... O
All of these are as to be expected, using var (T) = T2 - T2 in each case. Using the fact that the variance of a sum of random variables is the sum of the variances (equation 4-33), one can calculate the variance of the delay times, i.e., the variance of the respoase times, of a tandem. queue in which a ttaDsaction flows through a series of servers. For iDstaDce, assume a traDsactiOD is processed by a communication line with constant service time, then by an application process with exponential service time, then by a disk with mDform service time, and finally by a comnumication line with constant service time. This situation is Jeflecteci in the following table with some sample values for service times and server loads. Service time variances are calculated acconting to the previous expressions. TABLEW- EXAMPLE TANDEM QUEUE Step
CoDl",'Oicatioos
P10cess Disk
Service lime (t) diSUibadoa
Mean
Server
ofT
load
CcmsIaDt
.2 .3
.2 .4
.004 .2SO
.4 .1
.6 .2
.373 .001
ExpoNmial
UIIifoaD
Com"''''Iicatioas
CcmsIaDt
r.o
VariaDce ofT" •
:a
We see that the tandem. queue provides us with a mean respoase time of one second and a variance of .628, i.e., a sumdard deviation of .792 sec:onds. If we wished to use the Gamma disIribution to de1mDiDe the pmbability of a long IeSpODSe, we would calculate R as R = T2/var (T)
= 12/.628 = 1.6
JmapolaDng Table 4-1 for a 99.9 paaadle, we find a value for k of 5.5. Multiplying the second mean service time by tbat Il1IIDber allows us to state that 99.9 percent of transactions will be completed in less 1ban 5.5 seconds. As an aJ.temative, we could use the Ceatral LiDiit'lbecnm. At the 99.9% pezamtile, we see that we shoald move 3.09 staDdard ckMatiODs out from die meaD. TbDs, we can make the statemeat that 99.9 percem of all transactions will ~ in less 1ban (1 + 3.09 x .786) 3.4 seconds. The Gamma distribution gave us a more conservative JeSuIt (S.S seconds) than the Centtal Limit 1beorem. In general, the more c:oase.rvative JeSUlt should be used. This will be given by the Qamma function for large, normaUmi S1aDdard deviations, i.e., the Iatio of the standard deviation to its mean, and by the Centtal Limit TheoIem for small, nonnalized standard deviations (typically, less than .6).
an
ODe
=
Basic Performance Concepts
120
Chap. 4
Properties of ././1 Queues Queue lengths. Queues formed by random mivals at a server with random service times (an MIMII system) are the easiest to analyze. For an MIMII system, the probability that a queue will be a particular length can be derived tbrougb what is called a birtb.-death process. We used the birth part of this to derive the Poisson distribution. We consider an MIMII queuing ~ i.e., a single server with random mivals and random service times, in which the average arrival rate is r and the average service rate is s. The probability that the queue will bave length II is p,.. where the queue includes all items waiting in line plus the item being serviced. If we consider a very short time interval, At, then the probability that an item will mive at the queue is rAt; this is a birth. Likewise, the probability that an item will leave the queue (US'uning there is a queue) is sAt; this is a death. We observe the queue at some point in time and note with probability P,,-h P", or P,,+ 1 that there are 11-1, 11, or 11+ 1 items, IeSpeClively, in the queue. Ifwe come back at a time that is At later, we will find 11 items in the queue under the fonowing conditions:
I. If there bad been 11 items on the first observaI:ion and if there bad been DO mivals or departmes in the subsequent interval, At. Since the probability of no mival is (I-rAt), and since the probability of DO departure is (I-sAt), then this will occur with a probability P" (l-rAt)(I-sAt). 2. If there bad been 11 items on the first observaI:ion and if there bad been one mival and one depai:tme in the time interval At. This will occur with probability p,,(rAtXsAt). 3. If there bad been 11-1 items on the fimt observation and if there bad been one mival dming the interval At, with DO departmes. This will occur with probability p_lrAt(1-sAt). 4. If there bad been 11+ 1 items on the first observation and if there bad been one depauwe during the interval At, with DO mivals. 'Ibis OCCUIS with probability pII+ 1sAt(l-rAt).
IgD.oriDg terms with At2, since these will disappear as At goes to zero, we have p"
=p,/..I -
sAt -rAt) +P_trAt + P,,+lSAt
(4-75)
AmmmJating P" terms, this beooines (s + r)p"
= rp,,-l + SPII+I .
(4-76)
The load on the system, L, is L= rls
Thus, equation 4-76. can be IeWriUeD as P,,+1 = (I
+ L) p" - !P_l
(4-77)
Chap. 4
Infinite Populations
121
For n=O, there is DO p,,-J, and there can be no departure if the initial value of n is zero. Thus, equation 4-75 can be manipulated for the case of n=O to give (4-78)
PI =Lpo
Using equations 4-77 and 4-78 iteratively, we find
P2 = LZpo
P3 = L3po and
p" = L"po Since L is the load on the server, it represents the probability tbat the server is occupied. Thus, the probability that the server is unoccupied is I-L. This is the probability tbat there are DO items in the system (no queue): (4-79)
Po= I-L
The probability of the queue length being n is P"
= L,,(I-L)
(4-80)
We can per.form some checks on this result as follows. FIl'St, the sum of these probabilities should be unity: GO
Lp"
,,-0
III
-=1 L
III
= ,,-0 LL"(I-L) = (l-L)LL" = =1 _0 I-L
Next, we can calculate the average queue leagdl, Q: GO
GO
Q = LnL"(I-L) = (l-L)LnL"
,,-1
,,-0
Using equation 4-50, this becomes Q
L
L
= (I-L) (I-L)2 = l-L
(4-81)
This is just what TCbjntr;hVte-Pol1aczek prec1ic:ted (see equation 4-10). The other msults can be simjJarIy verified. Finally, the variance of the queue length is given by
var (n)
=Q + ~ = h (I-L,
(4-82)
(The derivation of this is complex; see Saaty [24], 40.) The probability tbat a queue will exceed n items, P(q > n), is III
P(Q>n) =
~ Ln(l-L) = (I-L)
,,-,,+1
III
Lril
L Ln = (l-L)n"'ril I-L
Basic Performance Concepts
122
Chap. 4
from equation 4-47. Thus, p(Q > n) = L',+l
(4-83)
Summarizing what we have just deduced about the properties of MIMII queUes, we
have Probability of queue length being n: PII
= L II(I-L)
(4-84)
Average queue lengtb: L Q= l-L
(4-85)
Variance of queue length: L
(4-86)
var(Q) = (I-L)2
Probability of queue length exceeding n: P(Q
> n) = LII+I,
(4-87)
Also, from equations 4-20 tbrough 4-22: (4-88) (4-12)
and ([l' 1 Td=-=-T L l-L
As noted by the equatioIlllUlDbers. die expressi
(4-11) lie
the same as those
We bave akeady derived the properties of queues foaDed by random mivals to a server with a known, though DOt neces$8rily IIDdom, service time ctistribuIion. These are the Kbintchine-Po~ equatiOns. which are repeated hete for convenience. kL2 W = I:-L
(4-4)
L Q=(1- (l-k)L] l-L .
(4-6)
Chap; 4
123
Infinite Populations
Td
= 1~L [I -
Tq =
(4-8)
(l-k)L]T
kL I=rT
(4-9)
k = !E(T2) 2 T2
(4-16)
Single-Channel Senter with Priorities In many of our systems, the server is serving a queue organized by priorities. When the server becomes flee, it next services the item tbat has the bighest priority and has waited the longest in its priority class. There are two t;Vpes of priority service disciplines of interest to us. One is nonpreemptive servicing. in which the service of an item is completed before the service of another item is started, even though a higher priority item may bave mived after servicing started. On the other hand, a preemptive service discipline mquin:s that the servicing of an item be suspended if a bigher priority item mives. When the higher priority item has been serviced, servicing of the original item resumes where it left off. The impact of priority service can be deduced intoitively by noting tbat, so far as an item being serviced is concemed, the capacity of a server is Ieduc:ed by the time wbich it must spend servicing higher prioritY items. Let Lir be the load imposed on the server by items ofbigherpriority than the one we are CODSidering. The amount of time left to service an item at the considen:d priority is (l-Ln) of the total time. The average amount of time required to service an item at priority p. rp, is
rp = Tp +LnTp Here Tp is the time to service an item at priority P if thae wae no higher priority interference. This equation states tbat Tp of the server's time is actDally spent servicing the item. However, during tho total time. rp, tbat the item is being serviced. the server spends Lir of tbat time tending to higher priority duties. 'I1ms, the effective time to service the item is
.
r:=-1L p l-L" In effect, the service time at priority P bas been JengtbeDed by the factor l/(l-Ln). The world has slowed down at.priority p. We fOUDd earlier (equatioD 4-9)tbat the time an item must wait in a queue for a single priority server cauying a load L isTq .= kLTl(1-L). However, if the server must also process a higher priority load of we DOW know tbat the time tbat an item will wait in the queue at priority p, Tqp. is
L".
kLT
Tqp
= (I-L)(I-L,,)
(4-89)
Basic Performance Concepts
124
Chap. 4
Note that the term kLT is, in fact, the amount of service time left for the item currently being serviced when a new item mives at the queue. Let us call this term To. Then
T. qp
To = (l-L)(l-L,,)
(4-90)
where TlIP
To
= average queue wait time at priority p. = average service time rema;,,;ng for the item being serviced when a new item arrives at the queue.
L
= load imposed on the server by items at priority p and bigher.
Lit
= load imposed on the server by items at priorities greater than p.
If we number our priorities from 1 to 1'-, with the convention that items with bigher priority IlUIIlbem have pIecedence over items with lower priority numbers, the above defiDitions for L and Lit can be expIessed as fonows: II-. L = LL,
L, = load imposed on the server by items at priority i. Equation 4-90 is quite geDeral and is, in fact, applicable to most server systemS. This result is rigorously derived by Saaty [24] and KleiDrock [15] (see Appendix 7). Let us DOW apply 1his quite general J:eSUlt to DODpEeemptive and pteemplive serveII.
Nonpreemptive...".. For a ~ve server, the average service time mnaiDing for the item beiDg serviced when a new item enters the queue is 1he average of such times over an priorities:
That is, the mnaiDing service time at priority i is kiI'" and a prlorlty i item will be in 1he server L, of the time. LetL, be the total load on .the server and T, be the service time aver8ged over an priorities. Then the probability that an item beblg serviced is at priOrity i is LlL,. .Ifk, is iDdepe.Ddent of priority, i.e., the nalUIe ofthe service is the same xegm1less of priority, then we can assign k, = k and mwrite To as ,
II-.L, To = kL,L r T/ '-=141t
Chap. 4
Infinite Populations
125
or T()
= kL;I,
Thus, kL,T, qp - (l-L)(l-Ln)
1'. -
(4-91a)
where
L, = total load imposed on the server by items at all priorities. T, = service time averaged over all priorities.
Once an item is given to the server, its service is DOt preempted. Therefore, from equation 4-22, (4-91b) where Tdp
= delay time through the server (queue wait time plus service time) at priority p.
Tp = service time at priority p.
Preemptive server.
For peemptive service, the activity at lower priorities is
traDspmeDt to an item, since the service of lower priority items is jmnwtiately suspended
and not resumed so loDg as a higher priority item is in the system. 1'helefore, using an argument similar to tbat used for nonp.reemptive serven,
or
T = service time averaged over all priorities at priority P and higber. item is given the server, it will :ras,Wheal antime at priority is service
to P
be inteaupted by higher priority activity. Tpl(l-Ln). and
kLT
T. = (l-L)(l-Ln)
(4-928)
1'.dp-1'..+ ~ I-Lit
(4-92b)
Basic Performance Concepts
126
Chap. 4
Multiple-Channel Server (MIMIc) A multiple channel queuing system comprises c cbaonels serving a single queue into wbich items are arriving at a rate R. As soon as a server channel finishes processing an item, it starts servicing the next item at the head of the queue. Each server has an average service time, T. The distribution of the queue lengths, PII, is as follows, where n is the total number of items in the system, including those being serviced (Saaty [24], 116):
PII
=pJ.eL)"ln! , I S n S
c
(4-93)
and (4-94)
. Po is calculated from LPII 11-0
= 1: c-l
p;l =
L (eL)"ln! + (eL)"Ic!(l-L)
(4-95)
11-0
In the above expressions, L is the average load on each server. The total system load is eL=KI
(4-96)
The avenge number of items waiting in the queue for service, W, is equal to the average number of items in the system in excess of the number of servem, c. From
equation 4-94:
W=
..
.
L (n-c)PII = 11-,,+1 L (n-c)p~)"e"lc! II-c+l
Oumging the summation index to % = n-c. we have W = Poe" ~%(L)Z+"
c!
z-1 .
=p.ceL'f ~%(L)Z C! z-l
From equation 4-50, W may be eJqRSSed as W- L(eL'f - c!(l-L'f Po
(4-97)
The avenge number of items in 1he system, Q. inc1ncting those being servic:ed, is, from equation 4-20,
Q = W + eL
(4-98)
The average waiting time in the queue. Til' is obtained from equation 4-21:
Til =
WT
(eL)"
CL = c(c!)(I-L'jl PoT_
(4-99)
Chap. 4
Infinite Populations
127
The average delay time through the system, i.e., queue time plus service time, is, from equation 4-22,
Ttl = Tq + T
(4-100)
The above equations apply for exponential service times. However, Martin points out (Martin [20], 461) that simulation studies have shown that the waiting line size, W, and waiting time, Tq , do vary in about the same way as does the Khintchine-Pollaczek distribution c:oefticient k, just as do single server queues. Thus, for general service time distributions,
kL(cL)C
W == c!(I-L)2 Po
(4-101)
Q ==W+cL
(4-102)
k(cL"f
Tq == c(c!)(I-L)'1. PoT
(4-103)
+ T.
(4-104)
and
Ttl == Tq
Note for c=l, equations"4-9S, 4-97, and 4-99 reduce to
Po
1
= 1 + U(l-L) = l-L
and
L2 W= l-L
LT Tq = l-L which are the siD.gle-server KbintdJine-PoDaczek equaDons 4-4 and 4-9 for k= 1. Q and Ttl also !educe to equaUcms 4-6 and 4-8 for k= 1.
Equation 4-90 is "quite general and applies to mu1tip1e-ch anneJ. servas as wen as to singlechannel servas (see Saaty [24],p. 234). Equation 4-90 S1aI:es tbat Tq = To/(l-L) for a single priority &elVer. Thus, from equation 4-99 for a multiple cbannel server,
_ (eLY To - c(c!)(I-L) PoT
(4-105)
Nonpreemptive server. For nonpIeemptive service, To is averaged over all priorities:
Basic Performance Concepts
128 _
(cL,)C
Tqp - c(c!)(l-Lr)(l-L)(l-L,JpJr Ttip
= Tqp + Tp
p;l
=I
Chap. 4 (4-106a)
(4-106b)
c-l
(cL,rln!
+ (cL,)CIc!(l-Lr)
(4-106c)
11-0
Preemptive server. For preemptive service, lower priority service is transparent, and To is therefore averaged over all priorities from the CODSidered priority and higher: (cL)C
Tqp
= c(c!)(1-L)2(1-Ln) PoT
T.tip -- T.qp+ .Ie.I-Liz
(4-107a) (4-107b)
c-l
p;l
=I
(cL)"ln!
+ (cL)CIc!(I-L)
(4-107c)
11-0
RNITE POPULATIONS As discussed pzeviously, queues formed from finite populations bave the c:batacterlstic of graceft4 degrodlztion: as the load on the system inaeases, the arrival Iate decreases because of a Jeducecl active population. In geneml, the population should be considered fiaite unless it is much larger than the expected queue 1eDgtbs. Common appJicatiODS in TP systemS include the following: • Tmninals on a multidtopped conummicaMD line which CODteDd for that line, • Multiple servers (in the teqUeStOr-server sease) accessing a data base, • Processors in a mu1tip:oc:esso.r system CODteading for main 1DImlOIY. The following is based on IBM (11), wi1h some conectiODS aDd much enhancing. In . geaenl, we t:biak of a user as doing some wed befoIe euraing the queue. This time is callecl the availability time, T•• and is assumed to be expouenrially distributed It is the time that a tmnjnaJ is used prior to biddiDg for dle line (often called '"tbiDk time," since it lep.tesents the time that the user is drinking befo.te entering the nextuausaction) or the time that a daIa-base JD8D88"'" speDds ~ a zequest befcn geuiDg in line for the disk or the time speDt actively pmcessiDg by a processor befcxe IeqUestiDg its next common
memory access. Once in line, the user must wait a time, T", prior to being serviced and then an average expouenriaUy-distri.buted service time of T. Thus, on the avenge, each user will cycle tbIOugh dle system every (T. + Tq + T) seconds. The user's availability time. T., ~ indepeDdeDt of the system, and Tis uaaffected by system load. However. as the system
Chap. 4
Rnite Populations
129
becomes loaded, the waiting time, Tq , ~, thus slowing down the mival rates of the users. This is the graceful degradation effect. Let us define a service ratio, z, as the ratio of aVailability time, Ta, to the average service time, T:
(4-108)
z=TJT
If the user is almost always available, i.e., not in the queue, the user's service ratio may be aIbitrarily large. If the user is almost always in the queue, the service ratio may approach zero. Since each user arrives at the queue every (Ta + Tq + T) seconds on the average, and since there are m users, then the arrival rate, R, of users to the queue is
R=
m
Ta+Tq+T
=
mIT z+TqIT+l
(4-109)
and the load, L, on the system is
m
L =RT= z+TqIT+l
(4-110)
As an aside, equation 4-109 can be solved for Td = Tq + T as
m
Td= R - Ttl
This is known as the Response Time law (Lazowska [16]) and Ielates the system EeSpODSe time, Td , to the individual interarrival time, mfR, and the think time, Ta. That is, the response time is the individual inteImival time minus the availability time-an intuitively obvious xe1ationsbip. From equation 4-21, Tq =
WT WT T = iii (z+TqIT+l)
(4-111)
Solving equation 4-111 for Tq gives T. = W(z+l) T
m-W
q
From equations 4-21 and 4-112,
system
L
(4-112)
load, L, can be expressed as
=!!!..::..! z+1
(4-113)
Solving for W,
W= m -(z + 1)L
(4-114)
Q=W+L=m-zL
(4-115)
From equations 4-20 and 4-114,
130
Basic Performance Concepts
Chap. 4
+T
(4-116)
and from equation 4-22, Td = Tq
Equations 4-112 through 4-116 express the queuing relationships for a finite population and for any number of servers as a function of the service ratio, z. Note that these relationships apply to both single-server and multiple-server queuing systems by substituting cL for L, wbeIe c is the number of servers, and L is the load on each server. The only assumption is that the availability time, T". and the service time, T. are exponentially distributed. However, L and Ware functions of each other, and their solution depends on the number of servers. The evaluation of these terms is discussed in the next sections. One other general relationship of interest is the probability that a user is busy, i.e., is in the queue or is being serviced. Since users mive in the queue at a combined rate of R, each user will arrive once every mlR seconds, on the average, and will spend an average time of Td seconds in the system. Thus, the probability that a single user is busy waiting for the server or being serviced is P(busy)
L = RTd -m = -mT (Tq + 7) = 1 -
zL -m
(4-117)
using equations 4-112 and 4-114 for the simplification. These :relationships will be used to study the single-server and multiple-server
cases. Single Serrtel' Queues (M/II/f/IIJ/"'J
For a sing1e-server queue with random service times serving a finite population of size m with random availability times, the probability tbat n items will be in the system (iDcJudjng the one being serviced) can be shown to be the following (IBM [11]):
rn- II PII =
(m-n)! 1ft
(4-118)
•
~; i-oj·
The server utilization (or load), L, is the probability that the server is busy, i.e., the queue length is nonze:ro: Zm-II 1ft
L =
1ft
~PII= ~
11-1
(m-n)! 1ft
J
11-1 ~!-
(4-119)
i-oj! As a sanity check, the case for z=O represents full loading. The significant terms in equation 4-119 as z approaches zero occur for n=m andj=O. In this case, L becomes 1 as would be expected. For very large values of z, the significant tenDS are for n= 1 andj=m. In this case, L ~ mlz. For an unloaded system (z infinite), L becomes zero as expected.
Chap. 4
Finite Populations
131
Equations 4-114, 4-115, 4-112, and 4-116, respectively, may be used expression for L to calculate W, Q, Tq , and Td.
with this
Multiple-Server Queues (M"'/c/m/m) The finite population system bas been solved for the finite-population, multiple-chaDnel case. This is probably the most general solution of practical usefulness that exists as of the time of this writing. As with the single-chaDnel case, it is assumed that both service time, T, and availability time, Til, are exponentially distributed. A typical (and very important) example of this type of queuing system in TP systems is the case in which multiple application programs access a data base via multiple copies of a data-base manager. In this case, the application programs are the users, and the database managers are the servers. Again, the service Iatio, Z, is defined as Z
= T,/T
Let C be the number of servers and m the number of users. Then the probability, Pro, of there being n users in the queue can be shown to be the following (IBM [11], 45-46):
Pn
Pro -_
= (~) z~ Po, 1 :S n :S c n!
I...IJ-C
Ca.
Where( ~ )is the binomial coefficient:
(m)
1 <-< n IZ i Po, c - .n - m
m! ( m) n = n!(m-n)!
(4-120)
(4-121)
(4-122)
Po is such that it satisfies 1ft
Po= 1- ~P.
.-1
(4-123)
W is the average number of users in the system exceeding the nnmber of serven, c.
Thus,
(4-124) Knowing Was a fuDction of z (the only variable in p,.) allows us to evaluate Q, Tq , and T41 from-equations 4-113 (which gives the loadL) and from equations 4-115,4-112, and 4-116, respeaively. A diffe.rent and somewbat more complex solution to this problem is given by Saa1y [241. 3~327.
Basic Performance Concepts
132
Chap. 4
Computational Considerations for Finite Populations The expressions for finite populations do not generally lend themselves to manual calculation. This is because they must be solved iteratively for many cases. This can be seen by the following reasoDing. The length of the waiting line, W, is a function of the service ratio, z. z is a function of the availability time, 141 • In many of our analyses, the average arrival I3te at the queue is fixed as R, where Rim is the transaction I3te per user. The availability time, 141, is then la = mlR - lq - 1 (see equation 4-109); Thus, la is a function of 1 9 , which is a function of W. Thus, W is a complex function of itself. Consequently, these expressions are best evaluated iteratively by computer. Typically, a choice for z will be made, and W will be calculated. Using W, 19 can be c:alculated from equation 4-112 and then la from equation 4-109. z Can DOW be calculated, and iftbis value does not equal the starting value for z, a new value for z is chosen. This process continues until it converges on a common value for z. Of course, if the model being evaluated bas a determined average availability time, 141 , then these expressions can be calculated manually.
co.PARlSON OF QUEUE TYPES ~
we have seen, the delay time of a queuing system is sensitive to several pmawete.rs, notable of which are the following:
most
• Load on the server(s). • Distribution of 'service times.
• Number of servers. • Size of population serviced by the system.
We generally assume that arrivals are Poisson-distrib. h is useful to obtain a graphic feel for the effect of these parameters. We first c:onsiderthe ctisIribution of service times. Figme 4-S shows the JlOJ'fD8Iimi IeSpODSe time, Tp, of single-server queues with a variety of response-time distributions: • ExpoiaentiaI (MIMJl). • Unifmm (M/U/l). • Constant (MID/I). As was ~y discussed, a server with constant service time perfOllDS beUer tban one with a uniformly distribu!ecl service time and even IDOIe so relative to a server with an exponentially distributed service time. Though the server performance curves of F1g1I!e 4-S appear to be close, this can be misJeading. It is true that cIifferenc:es at low
Chap. 4
Comparison of Queue Types
133
.4
.8
.6
1.0
loads may DOt be significant. However, consider heavily loaded serveIS, as shown in the fonowing table.
TABLE 44. NORMALIZED DElAYllME (Tdm Server load .7
.8
.9
3.3 2.6 2.2
5.0 3.7 3.0
10.0 7.0
5.s
134
Basic Performance Concepts
Chap. 4
At 90 percent server load, the response time, Td , for a server with exponential service time is nearly twice that of a server with constant service time. Another practical consideration for a queuing system can also be noted with reference to FJ.gUIe 4-5. Consider an MIMII system that is SO percent loaded. Its nonnaJized delay time, TP, is 1/(1 - L) = 2. If the load imposed on this server blaeases by 10 percent to .55, then its noma)jzed delay time becomes 2.22, or an 11 pexcent increase. At an 80 pexcent load, the normaJized delay time is 5. A 10 pexcent load increase to 88 percent causes the delay time to increase to 8.33-a 67 percent increase. At 90 percent load, a 10 percent load increase to 99 pezcent causes a normalized delay time increase of 1000 pezcent! This effect is called amplification. As the load on the system is increased, small changes in load cause ever greater amplification of delay-time changes. The range of response times as seen by a user fluctuates over a wider range, usually increasing levels of ~on. . For this reason, a common rule of thumb is to keep resource loading to a level less than 60 to 70 percent. At this level of loading, a system will be reasonably well-behaved in the presence of small load variations. The comparison of systems with multiple servers or with finite populations or with both is more complex. However, a feel for the impact oftbese parameters can be obtained by studying FIgUIe 4-6. This figure shows DOJD1aJjzed response time, TP, as a timction of individual server load, L, for four cases: 1. The simple MIMII case of a single server serving an infinite population of users. 2. 'I'1ne servers serving an infinite population (MIMI3). 3. A single server serving a population of ODly 10 users (MlMllII0!10). 4. Three servers serving a population of 10 users (MIMI3/10!10).
Note the following characteristics: • BaviDg n servers serving a CO""'W)D queue ofusers (MIMI.3 in Figure 4-6) is more efficient than baviDg n servers each serving. lin oftbe users (each being aD MIMIl server in Figure 4-6). We would:rather Wait in a oo""'iOIl tine for sevenl baDk tellers tban have to pick a teller and thea wait in a IiDe dedicated to that tellec. • Respouse time improves for a queuing system as the popn1arim served by that system becomes smaller. This is because queues C8IIDOt grow as large or as quickly. • The IeSpODSe time characteriStic of finite populations approximates that for infinite popuIaIioDs when the population is much greater than the avenge queue leDgdt would be for infinite popuJaticms. ie., for small loads. In Figure 4-6, a queue length of 1 occurs in the MIMII system for a load, L, of 0.5. At tbis point, the delay time for 1he finite population case MIMII/10/10 is within 10 percent of the iDfiDite population case MIMII. Though DOt shown, a queue length of 1
Chap. 4
Comparison of Queue Types
135
7 I
I
6
5
4
-T MIMI! M/M/3110110
I I I I
~
".
~QUEUE LENGTH
.,. .,
.,.
o
.2
FOR M/MII
.6
.4
.8
1.0
L Iipre U
Effect of IIIIlIipIe se:rvas ad iDiIe popuIaIioas.
occurs for the MIMI3 system at a load, L, of .67~ Again, the respcmse times tor the MIMI3 and MIMI3IIOIIOSysteWs _ wi1biD 10 pen:eat at this point. In both cases, so loag as the fiDite:'pOpUJaDon is an ·Older of m8gmtnde greater than the queue length, (10 > > I), the iDfiDite populaDon sohition is a reasoaably accm:ate appJOxjymltiOll.
An iDterestiDg rule of thumb follows from tbis observation and from the 60 to 70 percent load mle suggested earlier. At 2J31oad (66.7 pacem), the avenge queue length is U(l - L) = 2. Since in most cases we do DOt want to exc:eed this load, we will DOt DODDally expect avenge queue leDgtbs to be greater than 2. 1'herefOIe, a population of 20 will gea.enllysuffice to qualify as an iDfiDite popuJation. For popuJatiOD sizes less 1ban 20. or for those cases where loads will exceed 67 pen:ent, it may be advisable to CODsider the system as ODe with a fiDite
poPUlation.
Basic Performance Concepts
136
Chap. 4
Figure 4-6 fails to answer one other imponant question. Given the need for a specific capacity, is it better to use a single higb-capacity server or several lower capacity servers operating from the same queue? Let us say that we decide that a resource must have an ultimate capacity to service 10 items per second. Figure 4-7 shows two solutions to this need: • A JDOIe powerful siDgle server with a service time of 0.1 second (MIMIl). • ThIee less powerful servers operating in puallel serving the queue of items, each of these having a service time of 0.3 seconds (MIMI3).
In either case, a maximum of 10 items per second can be serviced. As shown by Figure 4-7, the single-server system is decidely better at all loads, since its service time is much smaller, resulting in shorter delay times. If JDOIe servers were used in the multiserver case, then each would only get slower and aggravate the ~.
.
Thus, from Figures 4-6 and 4-7, we can make the following observations!elative to the applicability of sing1e-server and multiserver systems: • If the choice to be made is the organization of n like servers, then it is better to feed them from a common queue rather than from individual queues (FIgUIe 46). • If the choice to be made is between one high-speed server and many lower speed. servers, choose the high-speed server (Figure 4-7). Typical examples of these situations follow.
• A replicated set of server processes should be driven from a commcm queue. • A multipoc:essor system with n processors sbarlDg a common memory and serving a commOD task queue will. outperfOIDl a multicomputer system. with n c0mputers if an puesson have the same power. • A single higJl-speed computer will omperfOIDl a umltipiocessor system or a multicomputer system with the same cumulative capacity. • A siDg1e higJl-speed disk UDit will ouqcfoJmmukiple loweI: speed disk UDits with the same c::ombiDed capacity.
SUllIlARY The queuing models described in dDs c:bapter lie summarized in Appendix 2, with n0tational symbology being summarized in Appendix 1. The queuing expiessions are grouped according to their KeadaIl classifiadions, where the author bas defined certain classificati9D types for the terms AlBlclklmlz to meet the needs of the ~ preseuted.
Chap. 4
Summary
137
2.1
1.8
1.5
1.2
Td (sec)
0.9
0.6
0.51----------;;
o
.2
.4
.6
.8
1.0
For miva! and service time distributioris ·WB). tbe following classes bave beeR presented:
D --c:cmsIaDt G-general
For tbe D.UIDber of servers (c). we have l-sUlgle server t;-fiDite servers
Basic Performance Concepts
138
Chap. 4
=
For the maximum queue length. k, we have always assumed an iDfiDite queue (k =). For finite queues, most models provide the probability distribution of queue lengths so that the probability of queue overflow can be considered. For m, the number of potential users in the system, we have m-a finite population of users =-an infinite population of users Finally, for type of queue discipline, we have A -any FIFO-first-in, first-out pp -preemptive priority NP --liOIlpreeDlptve priority
If any of the last three classificaDon terms Iklmlz are left off a queue system classification, it implies an infinite queue, infinite population, and a first-in, first-out service (Iklmlz = 1=1=1FfPO). Also note that if a finite population of size m is considered, the maximum queue length is also m. The queue systems we have studied include MlGllI=I=/A (Tbe Kbintcbjne-Pollaczek case) M1M1lI=I=/A MIU/I/=I=IA MlDllI=I=!A (These three are derivatives of MlGlI) MlGlI (for delay time variance) MIMII (for queue cIistdbuDoDs) MlG/I/=I=!PP (peemptive priorities) MIGII/=J=~ (nonpreemptive priorities)
MIMIc (mu1tjcha rmel server) MlGlc (an approximation) . M!MJcI=/=!PP (multiserver with p:eemp1ive priorities) MlMlcI=/=!NP (mulriserver with ncmpreemptive priorities) MIMII/mlm (siDg1e-chaDnellimited population) M!MJclm/m (muJticbannellimited population) .. Though many adler cases have been stDdied in the litetatwe, these are the ones that have useful solutions and which are of most intaest to us. The remainder of the book deals with the application of these concepts to the performance analysis of ttaDsadion ~ systems. The intent is to show how to use these tools to c::reate solutions to !eal-life problems. Though some application !eSUlts are general, their iDtent is not to form a cookbook. Radler, the goal is to be 8ble to look at a ~ and UDique problem aud determine an adequate, if approximate, solution.
5 Communications
The processiDg of a ttaDsaCtion begiDs with some pertinent event outside oftbe ttaDsaCCion processiDg system. 'Ibis event could be a customer making a IeqUeSt to a teller or ticket salesperson, a staIUS change in a power network, or an alarm generated by a patient. mcmitorlDg UDit in a hospital. The first tbiDg that must be done is to sead the data desczib.. iDg this event to the TP system. 1bis is the role of COI'II1tIlI1Iic1lS. The study of con .11lI1IIK:aIion facilities fins many volumes. The comllPmicaMD industty is a multibillion dollar industry with aearly a cemury of bistoly. It is highly regulated aad weU-undedtood tecImicaJly. It is a subject of iDteDse staDdaIdization by cqaniwioas such as the Amican Naticmal StaDdmds IDstitate (ANSI) and the EmopeaD IDrematioDal Telegmph aad Telephone Consultative Coillmittee (CClTI'). (Compue this state of affairs to tbat of the large but sti1l8edgJiDg c:ompater iDcIustty, where we CODtiDDe to dabble in seetiDg to 1IDders1aDd what we me doiDg.) Thaefore, we can only scrarch the surface of this massive body ofkDowledge. And we will do so only to the e:aem that we can uncIe:rsIand aad accomtt for the perfoImaDce issues involved in OC"DII'II'dcatiDg a traasactioD to the TP system and in ft!hD'!!ing a reply. The first half of this cbaprer consists· of sectioas tbat provide the cmmrqnjt:atj1)D background for the performance secUODS tbat foDow. For the c:owmuu1cadon novice, these iDiti.al secUODS range from the sDnpIe to the sublime, covering chancteristics of cormmmica1ion channels, methods of dam tnmsmission, protocol concepts, and modem open systems via layered potocols. Those well-versed in CQ!D.nnnjcatiODS may want to simply :skim these secUODS for the terminology used later. .
139
140
Communications
Chap. 5
The later sections use examples to develop performance-aualysis techniques for message ttansfer and establisbmentltermiDation procedures. These include half-duplex and full-duplex message ttaosfer, point-to-point and multipoint (LAN) contention net- . works, and multipoint polled networks. .
PERFORMANCE IMPACT OF CO• •UNICADONS
In a 1P system, a communication line is a server. It is a resouroe of finite capacity that . must pass data between the user and the 1P host. The average time that it takes to pass this data is its service time.
Ccnnmnnication lines are often sbared by many user tem1inals. Therefore, what may form are queues of user transactions awaiting access to the line. The role oftbe conumJDication facility in 1P system performance is shown generally in FJ.gUre 5-1. The data describing the transaction mives at the facility but must wait for access to it (1). The transac:tion data is tben transmitted over the cnmmuuication line (2) and entexs a queue (3) of work waiting to be processed by the host (4). When the host bas geneiated a reply (which is a performance study in ilself), it enters that reply iDlo a queue (5) to await access to the outgoing line. Fmally, the reply is returned to the user (6) via the communication facility. So far, simple. However, the analysis of waiting times and service times bas many complexities not evident in this simple description. COIDI"lmication queues. for example, are oftal not first-in, mst-out queues. Rather, access is granted to the line in an orderly fashion by polling terminaJs in a roDDd-robin fashion or in a diSOIderly fashion by lettiDg a terminal grab the line and see if it is successful in transmitting without colliding with anotheI' terminal's ttaDsmission.
TRANSACTION
-TIill---1
UNE
T1I1E.
UNE QUEUE (I)
TP HOST
REPLY
------1
....
UNE TillE
r---mrr--~ LINE QUEUE
(6)
(5)
Fipre 5-1 ComlllDoiadiCm JiDe ped'oaoaoce.
(4)
Chap. 5
Communication Channels
141
Line service times me complicated by the fact tbat considerable overhead may be required to pass one block of data. This overhead is created by the protocol procedures necessary for ensuring proper identification of CQ!D1Dnnication traffic and its protection agaiDst errors. Furtbe.nnoxe, line service time can be a function of the line etror rate, as blocks in etror may have to be retraDSmitted. In the following sections we will discuss the various communication techniques, message protocols, and network concepts which make up a lP communication facility and which can have an impact on lP peIformance.
The first level in the communication bienrchy is the physical communication cbaDnel itself, which can take many forms. Some forms are obtainable from public netwoIks, and others may be privately furnished.
DeGICIIted Unes The simplest of all communication channels is the dedicated line (Figure 5-2a). This ch!lDJlel is petmaneDt1y established and may support a single temliDal or multiple terminals. If the line supportS a single terminal (a point-to-point connection), then communiCation may be under control of the·host,· i.e., the host is the "master," and the terminal the "slave." As an alternative, the te.rminal and host may both act as master and contend for the line. If many teIminaJs are multitJropped on the line (a terminal cormection is referred to as a drop), thea. they lie usually conttol1ed by a polling pIOfOCOl in which the host master queries each terminal for incoming dara on a round-robin basis or according to some other poll schedu1e. The host also directs outgOing traffic to a specific tamiDal or group of terminals. Dedicated Jines lie highly ef1icieDt, as DO time is speDt in establishing the c0nnection; the CODIIeCtion is pennanent. However, dedicated lines are also quite costly and lie genemlly justified ODly if tbey can be higbly lJtiliud
is
Dialed LilIes For occasioDal use, dialed CODDeCtions via a public netwOlk are ofteu used (Figure 5-2b). WbeD a user wishes to commmricate with the TP host, he dials the host marmally or via an antoimtic terminal dialing fancticm and establishes a point-to-point connection for the dmation of his session with the host. Diak:d CODDeCtioDs can be quite economical for occasiODal use. However, connectioo-establislmmt time can be significant (many seconds); even worse, all host potts could be found busy. Oaly one terminal perctialed c:onnection can be su.ppotted, and data rates are sigDiicantly less 1ban those achievable on dedicated lines. Compounding the ~rate limitations are the higher etror rates found on dialed lines (typically an order
142
Communications
Chap. 5
o----~ HOST POINT- TO - POINT
MULTI DROPPED
DEDICATED
LINE
(0)
-1
o--z-
HOST
DIALED LINE
VIRTUAL CIRCUIT
(b)
(e)
I I
BUS
I
?t6
HOST
6 '
RING
i?'
IL. _ _ _ _ _ _ _ _ _ _ _ _ _ .J
SATEWTE CHANNEL
LOCAL AREA NETWORK
(d)
(e)
fi&are 5-2 CoaanmIicadaa cbamIe1s. "of mapitlJde dIaD those fouad OD dedicated liDes). These error Dtes slow the ~ even further because of r:ettaDsmissiOD ItqUileaDeuIs.
Peater
Virtual ClJ'Cllits the dedicated and dialed coaaecti.oas described above bave ODe cbaracteristic in COJDIDOIl: once, tbe CODDeCdOD is eslablisbed, equipllClDt is dedicar.ed to tbat conversation fiom source to demnation UDtil tbe circuit is b!oken. 1'heref0le. valuable co.... mmit;atioo equipment lies idle during pauses in the daD. convenarion. equipment !bat could in principal be used to support other convenatioDs. theIeby iDcreasiDg the capacity of the netwOlk and mduciDg the cost to the user. The emergenc:e of public and private packet switching networks bas addressecl this diJ.emma (Figme S-2c). Using tbis tecbDology, a user's data message is broken up into ~-leugth packets. Each packet is routed iDctependemly tbrougb'1he networlc to its
.
Chap. 5
Communication Channels
143
destination. Thus, each physical circuit coll1leCting switching nodes in the network actually carries traffic from several users on an as-needed basis. In fact, if a node getS very busy, many networlcs will route packets around that node using alternate routes. Thus, the packets comprising a message may, in fact, take different paths tbrough the network to their common destination. The different transmission times imposed by these various padls, compounded by random delays caused by queuing and line errors, mean that there is no guarantee that the packets will arrive at their destination in the same order in which they were sent. The proper disassembly of the message into packets at its source and the subsequent reassembly of the packets to recoDStitute the origjDal message at its destiDation is the function of a specialized piece of terminal equipment called a PAD (Packet Assembly and Disassembly). PADs may either be furnished by the customer and be located on the customer's premises or may be fumished by the network operator at the switch sites. In the latter case, customers CODUDUDicate with the PAD over a standard dedicated or dialed COTDID1JIIication line, as desc:r.ibed above. The host PAD is often implememed witbin the host via software, thus eliminating the need for special PAD hardware. The connection between customers through a packet-switched network is c:allecl a virtual circuit, since it is logically the.te (what I send, you receive) but not physically there. That is, one cannot point to specific equipment and say that equipment is devoted to a particular c:onnection. Just as in standard telephone technology, virtual circuits can be dedicated (permanent vU:tual circuits) or "ctia1ed" (switched virtual circuits). The dedication or sharing of these circuits !elates to the use of logical :cesourc:es in the switches that act to define the circuits, rather than relating to specific M1'D!D1mication lines. The use of packet-switcbed vinual circ:uits brings with it significant economics. The one disadvantage is a somewhat longer propagation time through the netwo!k (the "line time" ofFigule 5-1).
SIdeI'. Chan..... Another medium for uansmissioD of data is 1he stztellik cltDnn81 (F1gUIe 5-2d). Functi0nally, a sateDite ch8llJ1tl!1 is much like a ctialed or dedicaIed line. Channels can be dyDamically. allocaIed to users as 1hey need them (ljke a dialed CODDediOn) or can be decticared to
a pair of users. SareJ]ite ch8JJDt!ls are iDhe:raltly OJl&.way. In a 1P application, two chllJl1'lels are needed, one to sead the uaasacdcm and one to zeceive the zeply. -SatelJjre ch8llllftls can have very high baDdwidtbs aDd ccmseqaeatly can Support large data mtes. An interestiDg possibility is that the sigDa1s IeJayed by a satellite can be received by any receiver m1he "footprint" of the sarellite, thus opening 1he way to a variety of broadcast oppo.ttnnmes, such as disUibutiDg Sllmmary data of inIemt to all users of a TP syStem. There is, however, an impoItant performance issue that !elates to satellite clwmels. That issue is its "line time," or propagation delay. A typical geostationaly satellite is about 36,000 kilometers from its earth stations. At the speed oflight (300,000 kmlsec.),
Communications
144
Chap. 5
the propagation time between an.earth station and its satellite is 120 msec., or 240 msec. from the tnmsmitting earth station to the receiving earth station.
Satellite propagation delays coupled with comparable packet-switch delays can cause serious performance problems in a TP application using a packet-switched service, if satellite cbannels are used by the packet switch.
Local Area Networks A local area. network (Figure 5-2e) interconnects multiple users via a bigh-speed channel . to which an users connect. The netwodc medium usually comprises twisted-pair cable or coaxial cable configured as a bus or a ring. Contemporary local area networks typically support data mtes in the one- to ten-megabit/second range. More complex networks can support multiple channels of these capacities or greater. Usually, an users on a local area network are equals-that is, there is no master on the network that controls access to the network. Network access is either by contention (start tnmsmitting if DO one else is and hope DO one else does) or by a masterless fann of polling known as token passing. These protocols are discussed in a later section. Local area networks can provide very bigh speed communication channels between large numbers of users over limited geogmpbica1 range. Typical local area networks will span a building and perhaps even a college campus or industrial parle.
The efficiency of dedicated circuits often can be improved by combining the traffic from multiple users onto a single line in a mauner more efficient than the use of simple polling techniques. One way to accomplish this is tbrough the use of multiplexers. Multiplexing is the sharing of a chaDnel by several users in a way that is substantially traDspaIeDt to the users. In effect. the channel is broken up into subchamels. The subch8DIJels are then available as independent channels to individual users as shown in Figure 5-3a. Of course, the combined data IeqUiIements of the users must be somewhat less than the capacity of the channel. TbeIe are several established tecJmiques for mn1tip~
• Frequency Division Multiplexing (FDM) divides the cll2nmel into separate cbannels in the ftequeDcy domain, as shown in Figw:e 5-4&. A wide baDclwiddl is carved into sevaal subcbanneJs of smaJler bandwiddls. Since the data lite that is supportable over a channel is pIOpOItioDal to its bandwidth, tbeD the capacity of each subclJannelis only a fraction of the main cJmmeJ capacity. This very important characteristic of the data capacity of a bandwiddl-limited channel derives from the well-known Nyquist theorem (Nyquist [22]). FDM is very economic:a1 in tams of the equipment J:equirecl to support it. However, it is an inefficient use of the main channel because of the guard bands
required to prevent inI:aference between subchamels. 'Ibis is wasted bandwidth, not available for data transmission.
Chap. 5
145
Communication Channels MAIN CHANNEL
USERS
M P X
-------
a
USERS OR HOST POR1S
R SU8CHANNELS
MULTIPLEXER (a)
MAIN CHANNEL
USERS
c o
HOST
N
C SUBCHANNELS
CONCENTRATOR (b)
I'igare 5-3 Sbaled c:baaDel use•
• Tune-DivisionMuItiplexing (TDM) divides the channel in the time domain (Figure 5-4b) IIIher tbaD in the fIequency domain. The bigh-speed data stteam of the main channel is divided iDto time slots which are ptealloc:ated to snbclvmnels. The time sloes may be ODe bit wide, ODe byte (chal3cter) wide, or some other size.
Data1'eCeived from a useroWDblg ODe of the subt:hannels is mserted iDto that sqbc:bannel at the UIIIsnritting ead and is ex2:lacred aad IecoDSInJctecl iDto abe user's message at the sec:eiviDg cmd. Except for some syncIaoDiziDg ovedIead needed to gwaaId'ee that the mceiver can determine the beginning of a sabchannel sequeace (known is a m..a.e), the TOM teclmique is highlyefficieut in its use of maiD channel capacity. • Statistical Multiplexing fUdber inaeases the utilization of the maiD channel when the subcbannel users are casual. Casual users use a suhcbanne1 when they Deed to, but these users are idle a substaatiallDlOUDt of the time. Most 1P system users are casual. A problem with both the FDM and roM approaches just described when users are casual is that the subcbaane1s are often idle. Available capacity is not Peing used.
Communications
146
Chap. 5
MAIN CHANNEL
_r--------.A'------\ '-y-----J SUBCHANNEL - - - -.... FREQUENCY
FREQUENCY DIVISION MULTIPLEXING (a)
MAIN CHANNEL FRAME A
\
SUBCHANNEL ----~... TIME
TIME DIVISION MULTIPLEXING (b)
Stmsrical multiplexiDg is a roM variant thai solves this problem. Wltll statistical mul1iplexiDg. S1Jbcbannel time slots are not Ift8Ilocatecl to users. In fact, theIe me many IIlOle users tban time slots. WbeD ,data is n=ived at the trIIismiaet. it is placed 'in the next available time slot and is seat to the receiver with some clara idemifyiDg die sencHng user. In this way. maay casual users ,can be serviced by fewer subchann.e1s. Statistical multiplexiDg does have 1be problem of cmrload when data bursts arrive tbal CIDI10t be handJt4. In this case, die multiplexer IIlIISt buffer tbe excess data umiJ. it can c:atch up. If its baffas fill. then'the multiplexer must execute ftow CODtrol ~ to stall the users sending data or else it wiIllose data and will have to request leIDDsmission. In any event, statistical multiplexing will impose delays on transaction and Ieply uaffic dmiDg peak periods when it caDDOt baDdletbe peak traffic. However.,
Chap. 5
Communication Channels
147
properly engineered. the statistical multiplexer can significantly reduce channel costs while remajning substantially tIaDspareDt to the user. The above discussion has reviewed the primary techniques for multiplexing. In most cases, these tecImiques will have mjrrimal if any impact on perfonuance. The!e is another device. a concentrator. which is often used to perfonn a similar function (Figme 5-3b). Essentially just balf of a TOM multiplexer. the concentrator combines traffic fcom many users into a single data stream for the host to untangle. In the case of the concentrator. the time slot size often is large enough for an entire message or at least for a sigDificant packet of data from a message. In this case. there is a performance . penalty equal to the message (or packet) transmission time over the maUl channel. since the message must be completely IeCeived at the transmitting end before it is sent to the receiver.
So far we have discussed c:omnmnication channels without concemiDg ourselves very Though this is the topic for the next section."
~ with how data getS passed over them.
o USER
Ul.. o~__lJL. ___-(o)-_lJl._--t CHANNEL
DIGITAL CHANNEL (a)
ANALOG CHANNEL (b) Pipre 5-5 Modems.
HOST
~--...
148
Communications
Chap. 5
there is a very imponant device that is often needed to convey a data. stream over a physical communication chamlel. Though there are purely digital channels available today on which the user can directly impIess his binary data stteam (FiguJe 5-5a), most data channels are analog Channels that are derivatives of voice chaIinels (see Figure 5-Sb).They caD cari:y tones but cannot cmy binary levels as we know them inside the terminal or computer. Thelefore, a device is needed to convert the data signals into tones suitable for the channel. This can be done by sending a loud tone for a one and a soft tone for a zero (chaugiDg the amplitDde of the tone, or amplitutJe modulation). As aD alternative, the pitch oftbe tone can be alteIed to designate a one or a zero (frequency modulation). There is also a teclmique similar to ftequeDcy modulation kDowD as phase modulDtion. Because of various CODSideratiODS, such as error perfonnance and cost, frequency modulation is most often used, though phase modulation is also common. A device is needed at the user and host ends to modulate the data stream in order to create the analog tones and to demodulate received tones in Older to recreate the origjDal data stteam sent from tbe other end. The device that perfoIms this modulation and demodulation function is known as a modem. Usually, the modem is traDspa1'eIlt so far as performance is concerned. However, there is a very important case in which it may have a significant peEformance impact. This case is that of half-duplex conmnmication, especially for dialed cormections, and is discussed in DlO!e detaillarer.
One fiDal note is in Older about c:c,mmnmicatioo channel performance; it concerns propagation delay over'tbe ciIaJit. We have akeady discussed propagation delay over sate1lli:e chsmnels and through packet switched Detworks. But what about simple dedicated and c6aled telephooe chanDels? Can tbese delays be significant? We can begin by lookiDg at the simple propagation delay of the sigDal over a wile CODDectiDg tbe user to the host. Depending upon the type of wile (from loaded rmal c:iIa1its to c:ouja] cable), the speed of a signal over wile can wry between aboUt 0.5 to 0.9 of the speed of light. Even at the speed of light, a signal wID take 16 m&eC. to traverse the 3000 miles'across the USA or 65 msec. to go,halfway amund the world. Mcxe ~ times might be twice as laqe. Add to this the fact that every time the signal goes tbrou.gh an elecIronic repeater or fiber, another few miDjSf'!CX'Dlis are added. Long-haul circuit delays of several teas of nriDjSl'roDds are cJefiDirely to be expected Dialed lines typically will be WOlSe than leased lines, since 1hey may go tbrough more c:emral offices (and their associated eqaipwent) than conditioned leaseclliDes. Furthemlore, their lower bandwidth usually goes baDd-iD-band with longer propagation delays. Long propagation delays can be of special conc:em in large polled netwolks in which each transadion must be chaIged with some poll oveDlead. If each poD requiIa two propagation delays (one to seud the poll JeqUeSt and one to IeCeive the response), ~
Chap. 5
Data Transmission
148
can be quite slow, whether it is successful or not. This is discussed in more detail later.
DATA TRANSMISSION Once we have a data channel over which we can feed a data stream, we must then agree on how the data will be represented on the channel. It is not enough to just send a message comprising a string ofbinary bits. We must know where the message begins and ends and how to inteEpret the bit patterns contained in the message.
Character Codes Typically, messages in 'IP systems are made up of strings of cbaracters: alphabetic characters, numeric chamcters, and special characters, such as punctuation maries. This is not tnle in all applications. Scientific data might be sent as large binary numbers in sciemific notation (mantissa pius exponent). Satellite telemetry data might be long streams of binary data, as would a computer object progEaDl file being downloaded over a ciR:uit. We are interested in the alphanumeric data of TP system messages in which a character can be represeotecl by a specified set of bits. Early teletype systems used 5 bits per character (32 combiDations). This was the Baudot code. Two special characters (FIGS and LTRS) shifted between alphabetic meaning and numeric/special symbol meanings of the mnairring 28 combiDations (all Os and all Is had special meaning). In the 19505, 6 bits was a popular definition of a chaIacter. 'Ibis chaJ:acter size provided 64 combinations covering the alphanumeric character set (36 charactms) plus ample special , characters. However, it soon became apparent that this was not enough. Uppa--case and lowercase characters wae desired. Fwtheanole, IIlOdem protOCOls requhed a rich set of contto1 characters (this topic is discussed in more detail later). Sevea-bit charactas (128 combi1IIIions) wae IDOle appopriate to meet these needs. Thus was born the ANSI SIaDdaId ASCII code (American StaDdard Code for information 1Dtmcbange). The ASCD code set is a seven-bit codeplus an eaor...cfetec1ing parity 'bit, as shown in Figure >6a. The parity bit may be unused (always set to 0 or 1), or it may be set sudl that the total number of hils in the cbaracter is odd· (odd parity) or even (even parity). If either an even or odd parity bit is used, tbeD the resulting eight-bit code is error..defeding. This is because the changing of any one bit in the character because of noise will cause the parity mit to fail. A Competing code set is IBM's EBCDIC (Exrended BiDary CodedDecimal). 'Ibis is also an 8-bit code in which all 256 combinations are used (see Figure 5-6b). Thus, modem tec1mology bas settled on an 8-bit character code. This grouping of 8 bits.is called a byu. (Four bits is eaough to repesean a number and is used in some applications. Four bits is called a nibble). Note that today's compateIS typically use word sizes that are multiples of bytes-word sizes of 16 bits or 32 bits.
150 ~
Communications
EVEN, ODD, OR NO PARITY
Q-Z =61-7A A-Z
Chap. 5
0-9
= 60-69
= 41-SA ASCII CODE (Q)
Q-Z
= 81-89
0-9" F'O-F9
91-99 A2-A9 A-E
= CI-C9 01-09
E2-E9
EBCDIC CODE (b) NOTE: Character codes ore in hexadecimal
Having de1iDed an eight-bit byte as a basic UDit of iDfolmation in a 1F system, we must DOW be ~to seDda string of bytes over a COIIIIYI11Dic:atioD channel in an jnte1Jigmte fashion. Simply sending a loDg string of bits is DOt sadsfactory, siDce we would never kDow wIae the byte bouDdarles wae (see FigIue 5-7a). Cleady, additioDal iDforn1atioD must be embedded in the ~ of bUs so that the mceiver can cfetmnjne wIae a byte starts. ODe tedmique for doiDg this is caDed ~ ctIItIIrIlIIIic.· As shown in FigtD S-7b. a steady ""I" sigDal is traDsJ:DiUed betweeD c:baracters (a IlUD'ting signal). This interval is caDed the stop iDtetval. WheA it is desiIecl to seDd a byte. a stIlTt bit oomp.rising a single "0" bit (a.spacing sigDal) is &eDt, fonowed by the eight data-bits. Then the )iDe r:etams to marking for 1he next stop interval. The stop interval is guamnteed to be of a minimum leugth (typically 1, 1.5, or 2 bits in length). Thus, each byte is framed by a I-bit start sigDal at itS bElgiunjng and by a stop sigDa1 ~ its end (which is at least one bit in lengdl). To recogaize byte bouDdaries, the receiver
Chap. 5
Data Transmission
151 WHICH 8 BITS IS A BYTE '>
J .
1000110100000110
SIMPLE BIT STREAM (a)
BYTE
" STOP
MARK
I
LJl
SPACE
0
0
0
I
I
0
I
' l1Ja
t
STOP
-
START
ASYNCHRONOUS ENVELOPE (b)
S Y N
S Y N
S Y N
D
D
A T A
~ A
....
D A T A
S Y
.N
S Y
S Y
N
N
D A ••• T A
SYNCHRONOUS ENVELOPE (c)
1'ipre5-7 Byte 1IaDS1 lissioD
Simply looks for a stop-starttraDSition (a mark-space uaasitioD), discaIds ibe fiDt bit as ~ S1art bit, and stmes eight bits. The next bit should be a stop bit, aad the next start bit is awaited.
.
To achieve byte m:ogaition, the asynchronous COI""'Ilui<Son tecimique has cre.ated a 2-bit enVelope arouDd each byte (assuming a 1,-bit stop interval). A byte beiDg traD.Smitb:d over an asynchronous communjcatiOll c1wmel therefore requiJ:es 10 bits to pass 8 bits of iDformatioa-a 2S percent overhead. One iDterestiDg characteristic of 1:bis techaique is 1bat the traasmitter may transmit a char8cter at any time, since the stop iDterval between cbarac:ters can be axbittaI:ily long. This characterisUc is particularly useful for data tbat is raadoJDly generated (e.g. from a keyboatd) and gives rise to the term asyndvonous. applied to 1:bis tedmique for c0mmu-
nications.
152
Communications
Chap. 5
SpchI'OllOll$ COmmunication Syncbronous cornmunjcation takes advantage of blocks of data tbat can be 1raDSmitted as uninterrupted byte streams to achieve a reduction in enveloping overhead (and also to achieve an improvement in error pedormance, as described later). Basically. one or more special synchronization characters (SYN) are inserted periodically in the byte 5Ueam. The receiver can search for the syncbroDization sequence and then can count out 8-bit bytes thereafter. A typical syncbronous sequence is shown in FlgUre S-7c. The transmission is initiated with 3 SYN cbaracters (the ASCII SYN character is hexidecimall6). If the receiver is not in synchronization, it will look: for a SYN character by continuously evaluating the last 8 bits received. When it finds a SYN character, the receiver then starts accumulating 8 bits at a time as data bytes (the next two should also be SYN characters, a condition tbat can be used as a saDity check). Periodically, the transmitter will iDsert additional SYN characters to allow the receiver to ensure itselftbat it is still in synchronization. When the data has been sent, the traDstIritter can go idle, or it can send a steady stteam of SYN cbaracters to maintain synchroDization with the receiver. A typical interval between SYN cbaracters is 128 data bytes. If 3 SYN cbarac:teIs are sent after every 128 data bytes, then the envelope overhead IeqUired for byte recognition is 3/128 = 2.3%. This is an order of magnitude better than asynchronous communication.
Noise OIl the commnniraioD liDe as well as phase. frequency, and amplitude distortion caused by IiDe chmacterlstics will distort the data sigDal as it travels over the channel. This is evideDced in the demoduJated .sigDa! by a pheDomeaon known.as jitter. If successive received bits are viewed overlapped OIl an oscilloscope. the J'f!SIJJting pattem will appear as shown in Figure s-8a. Each bit ttaDsitioD. will geaen1ly not occ:ar exactly at the time tbat it should. Radler, it will be a little early or a little Jam. appeariDg to ujiuer" back and forth as successive bits are viewed. Though the relation betweenjiUer. liue distortion, aDd line noise is quite complex. the amount of jitter can be used as a measmre of the inteasity of the noise and distortion OIl the line. Figme 5-8b shows the effed: of jitter OIl an asynchronous sigDal. Let T. be the duration of a bit interval. The apploprlate strategy for asynchronous !reCeption is as f0llows:
.
1. Look for a srop-start traDsition. 2. Wait for a one-balf-of-a-bit interval (1',;2). 3. Sample the IreCeived sigDal. This should be the start bit (01herwise. declare a . false cbar.Icrer and JetDm to 1). 4. Sample eight more times at the bit interval. T•• to obtain the. eight data bits.
Chap. 5
Data Transmission /
153
OVERLAPPED RECEIVED BITS
L / .lITTER
"
EXPECTED TRANSJnON
JITTER (a)
ASYNCHRONOUS TOLERANCE (b)
• • • SAMPLE nilES
SYNCHRONOUS TOLERANCE (e)
s.
.
.
Sample ODe more time after an iDterva1 of Tb to easme that a proper stop iDterval bas been ~ved (otherwise, dec1ale a synchroDization error). 6. Repeat 1 tbrough S for each suc:cessive character.
From Figure S-8b, it is seen that a jitter of lnagnude T,)4 can move the start-stop traDsition 1/4 of a bit to the right and the traDsi1:ion of any data bit 114 of a bit to the left. At tis point, the sample of the bit !My be in error. Thus, the maximum jitter that can be ~ by an asyncbJOnous c:haDne1 is T,)4. Figure 5-Sc shows the equivalent: case with a synchronous stteam. For synchronous ~, the sample time is DOt detamined by a single trmsition as it is in the
Communications
154
Chap. 5
asynchronous case. Rather. the sampling time is based on the long-term averaging of many data transitions (or in some cases is derived from the modulated sigDal itself) and is therefore quite accurate relative to the expected transition times. Given accurate sampling times, it is evident from F"'1gure 5-8c: that it would take jitter of a magpitude equal to TJI2 to aeate an error. Therefore, a synchronous channel can tolerate twice the jitter that can be tolerated by an asynchronous channel. We have now seen that asynchronous teclmiques incur a byte-identification overhead Which is an order of maguitude more than synchronous channels and that they can only tolerate half the noise. The reduced noise tolerance means a higher incidence of rettansmissions and a fmther reduction in efficiency. So wby is asynchronous communication even used? ne reason is cost. The· requirements for more 8ccurate clocking and for a ~ complex byte boundary recogDition algorithm (recognizing a SYN character rather than a simple transition) make synchronous traDsmission IIlOIe expensive than async:bronous traDSlnission. Therefore. asynchronous techniques tend to be used for lower-speed applications (up to 2400 bits per second). and synchronous tec:bniques tend to be used for higher-speed applications, where getting as much out of a cbaDnel as possible is desirable.
Error Protection We have discussed the generation of errors because of line noise and distortion that cause jitter in the IeCeived signal. We have also seen one example of an error-detectiDg code to proteet against such eaors: the parity bit used in the ASCII cbaEac:ter set. This is commonly known as a vertical redundtmcy cMck (VRC). UDfortuDately, errors on CODIDlUIJication lines are not isolated. They tend to occur in short bursts. 1'helefore, it is qm possible that an even Il1IIIiber of errors will occur within a cbaractec. In this event, the c:baracter parity check, or VRC, will still be satisfied; and the error will go UDCIetecIed. For this reason, a SUODger amr-detedion scheme is often mqujred. This is twicaUy done by protediDg the message (or transmission block) with additional error-detection· codes placed at the end of the message. There are two in common use. 1. The loDgitudinal redundancy cbeck (LRC) adds ODe byte to the message that is itself a parity byte. Each bit is set so that the sam of an CODespondiDg bits in an by1es of the message is even (or odd. as. the case may be). 2. The cycIicalledlmdaDcy c:beck (CRC) isa much stroager error detecrion code. It is typically a 16-bit code added to the ead of the message, tbough lcmge.r codes give better protection. Though the theory behind CRC codes is quire exteDsive aDd complex (see lfamming [8]),· the CRC is essentially tbat sequence of bits that, if appended to the message, creates a biliary DUmber that is exactly divisible by some prede1mninetS DUmber.
CRC codes can be extended to provide forwatd error c:oaecting Sys1emS. In these systems, : . thele is so muchmmdancy provided by the error c:omction code (often up to SO
Chap. 5
155
Data Transmission
percent-see Stallings [25]) that Dot only can an error be detected but also the specific bit in error can be identified.. Therefore, that bit can be corrected. In this case, the code is a single-bit error-correcting code and may, in fact, couect many multiple-bit errors. Codes can be defined. that will cottect up to e eJl'OIS and detect up to d errors, where d > e (see Gallagher [6]). However, the price that is paid is efficiency. As error codes get more powerful, they impose a higher overllead on the system. In the cummt art, error-c:oIreding codes are used only in situations where retransmission is very expensive or impossible. Satellite c1wmels are a good example of the use of these techniques, since retransmission uses expensive chaJmel capacity. Broadcast systems are an example in which IetraDsmission may be impossible, since there may be no return path.
, HaH-Dupiax Channels : ..\., SO far, we have concemed ourselves with the encoding of data so as to be able to identify it (byte identification) and to proteCt it against eD'OIS. There is one other major pedOImallce . consideJ:ation at the data traJlSmjssion level, and that is whether the communication channel is simplex, balf-duplex, or full-duplex. A simplex channel can ttansmit information in only one ctirec:tion. A half-duplex channel can transmit in either direction but ODly in one dUection at a time. Afull-duplex channel can transmit in both directions simultaneously. We will ignoIe simplex channels, since they are not usually of use in TP systems. If they are used, they bebave, for performance purposes, as bali of a full-duplex clwmel. . A balf-duplex channel aeates several performance conside:mtions. Not only is traffic in one dUection affected by traffic in the other diIection, but there also can be significant delays in QmJiD.g a channel around. Channel turnaround time comprises two c:ompouents: 1. Chllnnel settling time. When a channel is relinquished by one ttansmitter and acqui1ed by another ttaDSDDtter, tbeIe is a paiocl during which the energy imparted to the channel by the fust nnsmiuer is decayiDg, and the energy imparted by die seccmd 1l'8IISIIUt:Ier is buildiDg. ODly when the new transmission eaergy is peatertban the old lIansmission energy by a significaat amount can the line be used for Ietiable oomnnmicalioa. This time is typically a few byte. intervals. . 2. Echo. suppressors. LoDg telephone tines 1r:nCl to develop echoes because of impedance mismatches alcmg tbeir leagth.To }RVent 1bis from becohdng a nuisaDce to telephone usea, echo suppteSS01S have been insIalled 1bIOugb.out the telephone DetWodc. These devices detamiDe die dimction ofprednnrinant transmitted eneIgy and suppms 1rIDSmissions in the IeVerse direction. When the direction of ttansnrission nMDeS, the echo supptessors IeVerSe di.tection. This can take a few tellS or even a few htmdreds of tmllisecmds. (Have you ever noticed. the firSt syDable of the from the other end being cut off?) : On a balf-cluplex clwmel, Ieliable communication cannot be achieved until the
conve:rsmon
156
Communications .
Chap. 5
.
echo suppressors on the.cbannel are all reversed. Though the telephone companies have undertaken a program to upgrade their equipment in order to eJjminate echo suppressors, many remain and probably will remain for the foreseeable future. This is primarily a problem for dialed lines, as d.edic:ated tines are conditioned in many ways which preclude the need for echo suppressors. The tumaround delays required by channel settling time and by echo suppressors may be established by timers in the terDrlDal or host equipment or may be compensated for in the modem itself. The modem provides two signals for this purpose: 1. A Request To Send (RTS) sigDal to the modem, requesting permission to send data. 2. A Oear To Send (crs) sigoal from the modem, indicating that data may DOW be sent.
crs
In actual fact, the 0Dly logic that links the sigoal to the RTS sigoal is a timer in the modem. The time-out is set by the user to the minimum safe time determined for channel tumarouDd. Typical timer values range from tens of nriUjseconds to hundreds of nrilJjseconds. Oearly. channel tumarouDd delays can have a sigDificant impact on communication channel perf"0l1D3IlCe. A lOO-msec. tamaround penalty for every 2OO-byte block over a 24OO-bitIsec. c:bam:ael requiring (8)(200)12400 667 msec. per block is not to be taken ligbtly. This becomes even worse for a polled channel, wherein a typical poll sequence might be 3 eharacters and a poll response 1 character (10 msec. and 3.3 DJsec., IeSpectively, at 2400 bitslsec.). Adding 100 JDsec. to each of these completely distorts the pmfomJance picture.
=
FulI-Dup1ex CIIann_
-.-
A full..duplex channel is conc:epcua1ly simple. Both sides of the conversation can transmit simu1taDeous1y. Full-duplex c:bannels can be derived in several ways from physical channels. For iDs1aDc:e. two separate cImmeIs may be provided. one to be used for each direction. This is usually the tedmiqueused for higher speed CQI1JID1IDicaDODS. For lower data rates. a single physical channel may be divided iDro two logical channels using FlequeDcy Division Multiplexing (FDM). as described earlier. The lower half of the frequeacy spectmm supported by 1be channel may be used to send data in one diIection, and the upper half of 1be.frequeacy specuum may be used to send data in the opposite ctiIection. This teclmique is used by. low speed (300 bit per second) modems. One significant advantage of this technique is that a fuD-duplex cmmec:tion can be established over a dialed line. Though fuD-duplex channels me c:onceptualIy simple. the message tmDsfer pr0tocols RqDired to take maximum advantage of the full-duplex capability can be quite a bit more complex dian those used with half-duplex chaUDels. These protocols are discussed in detail in the next section. .
Chap. 5
Protocols
157
PROTOCOLS The previous discussions can be perceived as descnbing the process of data transmission. The facilities and teclmiques described allow sequences of bytes to be delivered between
users. Let us now discuss data communications. In order to have a meaniDgful communication of data between a sender and a receiver, the data must be identifiable and transferred reliably. The procedure for accomplishing this is called a protocol. A protocol is an agIeemeDt between the sender and receiver of data as to exactly how that data will be transferred. Protocols provide tIuee primary functions:
1. Message identificotion-They identify the bounds of messages carrying the transaction and response data. 2. Do.tIl protection-They protect data agaiDst euor, ensuring its reliable delivery.
3. Chtmnel alloco.tion-They provide the mec:lwrism for allocating the channel in an O1derly mauner to the various competing users of the cbarmel.
Protocols typically have tIuee distinct parts:
1. The establishment procedure, which serves to establish a virtual CODDeCtion . between two users (or many users in the case of a bmadcast). . 2. The message transfer procedure, which describes the form of message ttaDsfer. 3. The termino.tion procedure, which bJeaks the virtual c:cmnection• .The establishment and tenxrination plocedutes saDsfy the ehaUDe! allocatiOD functiOD. The message-traDsfeprocedure is the message-ideDtifcaticm fImction. DaIa potectiOD spans and sigDUicandy complicates all pmcedm:es. Tbere me many standan:Iized protocols, and their SIUdy alcme would fill volumes. Our COIlCeED is to 1IIidea:staud1be basics ofpmtocols fmma paformaDce Viewpoint so tbat, given a protocol specific:atioD, we can evaluaIe i1s pedomumc:e wiI:biD a given ccm"'PIWc:atiODs eavimnment Therefole, we wiJl SIIIdy some classes of prorocols·and zeJate tbem geoen1ly to some of 1be IIlOIe popular protocoJs in USe today.
" ••8.,. ,.,.,,1icafiWJ .." I'Iotedioft Just as start-stop bias and sy&ebroDizatiOll bytes are USed to identify the boundaries of data bytes in a bit stream, so must there be a u:w:c:hawsm to iden1ify messages in a byte stream. This is typically accomplished with CODttol characters that are chosen to be UDiqae in the bytesttam. Quite simply, a 1IDique start-of-text byte may indicate the start of a message. and an end-of-text byte may iDcficate the end of a message. These are COIDDlODly desiguatwl STX and m;x, and are, for blstanc:e. fOUDd in the ASCII control set.
158
Communications
Chap. 5
Em>r-protec:tion (and in sOme instances, enor-correcUon) bytes follow the ETX' byte. As discussed earlier, these include LRC (longitudinal redundancy check) or CRe (cyclical redundancy check) codes. A typical message protected by a sixteen-bit CR.C code would be formatted as follows: STX (data) ETX CR.C CR.C
Let us next look at the major function of any protocol: the reliable transfer of a message from its source to its destination. Just as with communication cIwmels. we may dichotomize protocols into balf-duplex and full-duplex protocols. When using a half-duplex protocol, traDsmission occurs in ODly one direction at a time. A full-duplex protocol supports simultaneous communication in both directions. Half-duplex protocols may be implemented using either balf-duplex or full-duplex communication cbaDDels. However, full-duplex protocols require a full-duplex clwmel.
Half-Duplex Message Transfer. In a typical ba1f-duplex protocol, user A sends a message to user B and then awaits a respouse from user B, as shown in Figure S-9a. This IeSpODSe may be a positive acknowledgement that the message was received comctly (ACK) or a negative acknowledgement iDdic:atiDg that the message was received in enor (NAK). If user A receives an ACK, the next message is sent. However, if user A receives a NAK, the pIeVious message must be retransmitted and the above process repeated. This protocol causes the tnmsnrit:terto pause between messages in order to receive an acknowledgement from the receiver. 'Ibis pause is IeqUired because the half-duplex channel supports ODly ODe traDsmission at a time. Full-DupIex Message Transfer. A ftdl-dDplex clwmel allows the traDsmitfer to send ccmtin~, acknowledgements can be xebiIUed over the reverse channel while message traDsmission coadDDes. In fact, the. other ead can be tnmsmiairag its own series of messages at 1he same time. .The problem is tbe COO11IiDation of acknowledgements with messages, siDcethe traDSliiiaa may be able to seud many messages before it gees' an acknowledgement to a pmous message (especially over cbarmels with long propagation times, such as sateDite or pactet-switdlecl cJwnMls). The solution to this problem is to number messages and tb.en to acknowledge by message.number. In fact, to allowbotb. eDds·to tnwswit sjmqbaneo\1sly, the ACK'or NAK can be piggybacked into each message. Figure S-9b shows this proc:edure. User A is sending messages AI, A2, etc., wbile user B is sending messages BI, B2, and so on. Each piggybacks an acknowledgement of the last message received comctly or the first message RCeived iDcoDecd.y. . For iDstaDce, by the time user B is ready to send its fourth message, B4, it still bas ~y been able to process user A's message Al and so sends an ACKI (just as it had with
sa.ce
Chap. 5
Protocols
159
USER A
USER
B
HALF DUPLEX PROTOCOL (a)
USER A
~ ~
[ill
IB1
[ill
m:J IA41
[ill [ill
ACKO
ACK2
ACIG
ACK3
ACK3
ACK6 ACK8
ACKO
ACK5
XXXXXXXX
USER B
L!IJ
[!!] I!!] IE] [!!] [ill [!!] [!!J [!!]
ACKO
ACKO
ACKI
ACKI
ACK3
NAK4 ACK3
ACK4
ACK6
FULL DUPLEX PROTOCOL (b)
I'igare 5-9 Basic pmcocoIs.
its p!8Vious message. 83).
By the time user B is n:ady to traDsmit. its next:message. B5, it bas appoved two mme messages from user A and so sends an ACIO.
However. user B DOW fiDds user A's message A4 in Cl1VJ'. 1berefcxe, user B sends a NAK4 wiIh its aext message.. 86. When user A mcei:ves the NAK4, it resets i1se1fto start senctiDg at its message A4, aud the process comimJeS. This is obviously a mmecomplex pIOtOCOl1ban that mquhed for baJf4lplex dIaD. nels. It also zeqaires stcnge at the transmitter for seven! messages to support the poteDtial tettaMmission of1hese messages. This canbe.compared to a stonge JeqUhemeat of only one message for the simpler ~-cIupJex channel, However, in Iebml for this added cost, tile fuI1..dup1ex channel is utilized to a much greaaer.exteDt.. Note t:bat in tis example the u:ausmitter was xequired to back up to the message in eaor and to Jetnmsmit all d8ta from t:bat point OIl. We will xefer to tbis as the Go-Back-N protocol. An altemative stritegy that is also used is sel«tive retrtI1Ismissio, in which ODly the message in eaor is leO'ansmitted, In this case. mfen:iDg to Figme 5-9b, user A's message sequence would have been ( ••• A4. AS, A6. A4, A7, AS •.. ). However, this tecImique requires stoDge not only at tile ~ but also at the receiver, since later
Communications
160
Chap. 5
messages must be held until the message in error is received properly. Again, higher cost yields higher performance. .
Channel Allocation In 0Ider to send a message from a source to a destination, the sender must acquire the sole use of the chamlel and must then be able to address the receiver. This is the establishmeDt" procedme.
When the message (or perhaps a block of data repteSeDtiDg a partial long message) bas heeD sent, the sender must release the charmeL 'Ibis is the terminatiOD procedme. In order for a chamlel to be allocated to a sender, one of two procedures may be used: 1. Assignment of the chaDDeI by an orderly procedure. 2. Uncontrolled contention for the channel by all users. Ordedy procedures for channel assignment include polling of users by a master station or the passing of a token giving the current token.1lolder the right to use the cbaDDel. If a conteDtion protocol is used, then provision must be made to detect collisions and to recover the lost data. 'IbeIe is one further dichotomy to be m:ogDized wbeD consicIering establishmeDtl termiDation procedures, and that is the Il1.DDber of users of the channel. If tbe!e are ODly two users, then the cbaImel is a point-to-point channel. If there are more thaD two users, thea the clumue1 is a multipoint chanuel.
So far, the protocols we have disc:ussed are byte-orieDted. Control characters are takeD from a byte set, and error control characters are based on a byte structure. In fact, data is expected to be bytes• .Tbere are IDIIlY app1icatioas mwhich a byte SUUdUre is DOt to tbe data. A good example is public aetworks, which must be tDIISp8Ialt to any data structure. Data chanuels derived from telephoDe HDes meet 1bis crllerioD as the particaJar form of data ICpII*.....'ion and die protocol are up to die user. The situaDODbeoomes IIlO1'e complex for packet-switdJiDg netwoJb, siDce by tbeir very DatDre 1hey must bleak tbe users· daIa imo pacbIs. They use their OWD dam suuctares and p:otocoIs, aud these must be 1laIISp8leDt to any user da!a~. . The protocols used for such appIicatioDs are bit-oriellted, as they mast deal with data at the bit ratber dian the byte level. Since syncbronous HDes are typically used with these protocols for pedormance considerations, these protocols syuchroDjzation 'proceduRs. 'IberefoIe, we mfer to such p:otocoIs as bit synchronous protocols. Bit syncbroD01IS protocols use fall-duplex channels and are fall-duplex protocols. Message tlaDSfer procedures aud establisbmeDtIton procedures are entwiDed in these protocols. The two most commonly used bit synchtonous p:otocols are HDLC (a CCl'lT stan-
m
umve
J'rovide
Chap. 5
161
Protocols
dard) and SDLC, IBM's offering. HDLC stands for High-Level Data Link Control, and SDLC stands for Synchronous Data Link Control. The ANSI ADCCP and CCI'IT LAP-B
protocols are other examples. HOLC and SDLe are quite similar, especially c:oncemiDg our needs forpexformance . CODSideration. They are described in some detail in Stallings [25], Meijer [21], and Hammond [9]. We will take a bigb.-levellook at HOLC below as an example. Under HOLe, a message is broken up into packets, or frames. Each frame is enveloped with syncbroDization, control, and eIrOI' protection information as shown in Figure 5-10a. Frame elements include: • Leading and trailing flag tields that provide synchronization. Each flag is an eigbt-bit field CODtaining the bit sequence 01111110. FLAG
IADDR£"I~I 8 BITS OR MORE
8 BITS
8 BITS
OR MORE
I I
DATA
.'
FCS
16
VARIABLE
or 32
81TS
HDLC FRAME FORMAT (Q)
2!
45678
N(S)
I
0
S
I
0
U
I
I
'.
P/F
N(R)
S
PIF
N(R)
M
P/F
II
N (S)..
SEND SEQUENCE NUMBER
N (Rl .. PI F..
RECEIVE SEQUENCE NUMBER POLL I FINAL BIT
S
SUPERVISORY FUNCTION:
II
RR RNR RE.. SRE..'" M
..
RECEIVE READY RECEIVE Nar READY RE..ECT SELECTIVE RUECT
UNNUM8ERED FUNCTION
CONTROL FIELD (b) Fipre 5-18 HOLe.
F1.AG
8 81TS
Communications
162
Chap. 5
• An address field of eigbtor more bits to identify the IeCipient of the frame. • A control field of eight or more bits that defines the type of frame (information, supervisory, or uanumbered). • The data field, which may be of any length (in some implementations, it is constrained to be a multiple of eight bits). • A frame check sequence (FCS) field that contains a 16-bit or 32-bit eRe character for error detection. The flag fields provide synchronization by including a unique bit sequence in each frame. The UDiqneness of this flag sequence IDIlSt be preserved in that it must not appear in the RSt of the frame. Should a sequence of six Is that can be misinteIpreted as a flag field be found in the frame, the sequence is broken up by a technique caJled bit stuffing. The transmitter simply inserts a 0 after every sequence of five Is (except, of course, in the flag fields). The receiver, upon recei$.g five Is, checks the next bit. If it is a 0, the receiver deletes it. If it is a I, and the seventh bit is a 0, the receiver interpIets the sequence as a flag field (seven Is sigDal a special abort condition). The control field provides for three frame typeS:
1. I1Tjo17fl4titm frames (I-frames), which cmy the data. 2. Supervisory frames (S-frames), which provide flow conttol and error control. 3. U1I1UI1IIberedframes (U-frames), which are used for a variety of channel control functions. The information and supervisory frames are used for the estabJishment and eaor recovery functions in which we ate inteIested. Many of these functions ate implemented in the frame's control field, as shown in FIgUre S-IOb. The first ODe or two bits cIe.fiDe the type of frame to follow. If tbis is an information frame, a pair of tbree-bit sequence numbers is provided. One sequence number, N(S), specifies the sequence IlIIIIlber of the CUlmlt frame being sent. The 01her, N(R), specifies the sequence Il1IIIIber of the DeXt frame aDticipaIed In other wonts, N(R) tens the otber end that all previous frames bave·been mceived pmpedy and that the otber end may flush these messages from its baffas. TJms, message acknowledgement is piggybacked onto information pacbIs as they flow tbIOugh the system. If an acknowledgement is due to be sent to the otber end, but DO infomudion frame is available; then a supervisory flame may be seat U1sread. An RR frame (Receive Ready) is seat with the DeXt expected frame number in N(R) if tbis end is in a position to receive a frame. 0d1.erwise, an RNR. (hceive Not Ready) supervisory frame is sent. "Ibis also indicates the next frame to be expected but forees the 01her end to-delay its transmission UDtil a subsequent RR frame is sent. 11ms, RNRIRR. couples provide flow control over the link. If a frame is !eCeived in error, then a supervisory REJ (Reject) frame is seat, indicat,ing in N(R) the frame from which retransmission is to begin. This is the Go-Back-N:
Chap. 5
Bits, Bytes, and Baud
163
protocol. If selective retransmission is desired, the supervisory SREJ (Selective Reject) frame is sent instead. In this case, only the frame indicate4 by N(R) is retransmitted. Note that the sequence number fields are three bits in length and provide sequence numbers 0-7. This meaDS that the window size, W, on the c:lwmel is seven messages. That is. the receiver can get up to seven messages behind the sender before the sender must wait for an acknowledgement. A window size of7 is used instead of 8 to prevent confusion over message numbers. To understand the reason for this, let us consider the following example. Assume that the sender bas sent messages 0 through 7 and is waiting for an acknOWledgement while it is holding these eight messages. It then receives an acknowledgement with N(R) = 0, indicating the next message the receiver is expecting. Is it the meSsage 0 cunently being held by the transmitter. or is it the message 0 which the transmitter is due to send next? This CODfusion is avoided by always )jmiting the window size to one less than the sequence numberrange (the window size may. of course, be further limited by other factors, such as available buffering). Though the HDLC protocol described here limits the window size to 7, an extension to a window size of 127 is available through HDLC. This can be important for bigbspeed.long-delay channels such as satellite channels. Let US now look at the establishment functions built into HDLC. These are c0ntrolled via two fields in the f.nIme, tbe PlFbit in the control field and the address field. The PIF bit is tbe pon-fiDal bit. The host uses this bit to poll a termiDal, the te.nDinal uses this bit to indicate to the host that it bas nothing more to send. The specific temUnal is addressed by the host via the address field. In Older to select a termiDal for transmission, the host simply addresses a message to it via the addIess field. In Older to pon a temIiDal, the host sends a frame COD1aining that termiDal's address with the PIF bit set. If an information flame is due to be sent to the temIinal, the poll bit in that frame is set. Otherwise, a supervisory RR frame is sent. If the termiDal has no daIa to send, it wiB Ietuman RR. frame with the PlFbit set. If it bas daIa, it will mum informadon &ames to the host. ·An but the last I-frame wID have a zero PIF bit. -A onePIF bit in the last flame indicares to the host that the terminaJ bas finished its mmsmissioD.
BITS, Bna, AND BAUD A passing CQ111iiMM1t on somecmnnmricatioa terminology is appmpriate at this point. Let us take a 2400 bit-per-secoocl syacbloDous COIIIIDI.1nic:a 1iDe. .Some refer' to this as a 2400-bit-per-sec:oud line, others as a 24OO-baud line, and still otherS-- as a 300-byte-persecond line. Ale these all equivalent? I thiDk so, but I doubt it. And if that sounds coafusing, it's because we have let ourselves get sloppy with JlODK'Alclature. Let uS first take the term boIId. Baud is tecImica1ly a measme of the number of state transitions per second that ':
164
Communications
Chap. 5
.$X)mmUDication line is capable of acbieviDg while still having the receiver accurately detect the state sequences. In many cases, the line shifts between only two states: one and zero. In this case, baud is the measure of bits per second that the line can bindle.
However, many traDSmission schemes use more than 2 values per transition. A common modulation technique is 4-pbase modulation. With tIJis technique. each transition is to one of 4 states and thus represents 2 bits. In this case, a 2400-baud line supports 4800 bits per second. In general, if each state can be one of M values, tben baud and bits per second are related as fonows: Bits per second = baud x 10l2M The term bits per 8ec01ll1 is not all that clear either. From a purely information theory viewpoint, a bit is a unit of information. TheJ:e is no iDformation cmied in asynchronous start/stop bits. Since a l2OO-bauci asynchronous line (in the loose sense) transmits 120 100bit cbaracte.rs per second, with each cbaracter contaiDing only 8 information bits, is it a 1200-bit-per-second line or only a 960-bit-per-second line? Well then, we say, let us use bytes as our 1DeaSUle. But again, there is no information carried in synchronous SYN bytes nor in control bytes such as STX. ENQ. or eaor control characters. Do we eliminate these from our measure of line capacity? Enough said. Modem usage bas become somewhat sloppy, and all forms are accepted. Let's make sure that our use is clear, either by context or by explicit definition.
LA YEllED PROTOCOLS In the p:evious sections, as we discussed communicatioas from the physical dumnel1evel up to some fairly complex potocols, we described how each lower layer adds its own information to the data in order to coD"nmic:are. An appJication might SIart out with a message to sfzd. This gets passed to a p.tOtOCOl baDdler, which :flames the message with . SIart-of-textaDd end-of-text ideJltifiers and which adds some eaor control and channel COD1rOI information. FiDally, this expmded message is banded to a ttaDsmitter, which may add startlstop bits or SYN cbarac1as to get the data across the line. At the 0Iber eud, the zeceiver S1rips out the sync:broDizadoD bits aad hands _ • message in its protocol euvelope to the protocol handler. The protocol baDdler extracts the message and bands it to the application program. This pmcecIme is diapmmed in Figure 5-11. In effect; the applicaiion progDlI1 at the SOUR:e end is ~ as jf it is connmJl!k:ating dilecdy with the appJi<:ation pogram at the destilJarion eact Each does so by passiDg data between itself and a protocol handler. The details of the protocol haDdler are of DO concem to the application programs. The protOcol handler also ICfS as jf it is 1a1kiDg diIec:dy to its compaDion protoCOl handler. On the one band, each passes data between itself and a mysterious application program. On the 0Iber hand, each excbaDges data with a line handler. The protocol bandler ~ DO information about the imler workings of eithertbe application programs
Chap. 5
Layered Protocols
SOURCE
165
DESnNATION
APPUCATION CREATE MESSAGE
ro-------
APPLICATION PROCESS MESSAGE
PROTOCOL HANDLER
ro-------
PROTOCOL HANDLER
------
UNE RECEIVER
UNE TRANSMITTER
t
I
Iigare 5-11 Layered protoc:ol.
Or the line handlers. Only the interfaces need be defined. However, the protocol handler must have intimate knowledge of and cooperate with its companion protocol handler. The same description applies to the line handlers (the transmitter and IeCeiver). They talk to each other but have no knowledge of higher layers. Thus is born the concept of Iayer~ protocols. Each lower layer deals in a greater abstraction relative to the data being handled. It provides services to the next higher layer by using the primitives of the next lower layer. Each layer acts as if it is communicating with a copy of itself via its own layer protocol. It has no knowledge of higher or lower layers, except for the immediate interfaces. Major networks today are based on layered protocols. The two most common are the Intemational Standard Organization's Open System IntercoDnect (lSOIOS!) and IBM's Systems Network Atchitectwe (SNA). The popular X.2S protocol used widely in packetswitching networks implements the lower three levels of ISO/OS!.
ISO/OSI DlOIe of a model for layered atdlitectDres tban it is a specification. In fact, it is called the Os] Reference Model (ISO [12]). The OSI Reference Model establishes seven layers, as shown in Figure 5-12. Each layer has its own defined mponsibilities and provides services relative to these responsibilities, as shown in FIgUre 5-13, to the layer above it (except for the application layer, which is the highest). It can cx.mummiOite with its peer layer, which is typically in a separate system, via a oonnmmicaUon cJvmnel provided by the next lower layer. Each peer layer has a protocol it uses to inte:rcomnqrnicate; this protocol is of no concem to other layers (though protocols often are designM with other protocols in mind to simplify the system). The lowest layer, the physical layer, has clliect access to the physical c0mmunication cbaImel. We discuss DOW the seven OSI layers.
OS! is
The application layer is the layer at which the application System and network agement functions also reside at this level.
Application layer.
processes execute and at which users interact with the system.
man-
Communications
166 LAYER
Chap. 5
--
7
APPLICATION
6
PRESENTATION
5
SESSION
4
TRANSPORT
NETWORK
2
DATA UNK
PHYSICAL
Presentation layer. The lower layers under the application layer serve to allow diverse applications rumrlng on different equipment to converse wi1h each other. This IeqUites tbat the differences in the way these applicatiODS comrmmic:ate must be made traDspareDt to the applicatiODS. At the highest level, these diffeIences may include the code used, IlUIIlber Iepresemat:ion, compression of IqJetitive data, message structme, and termiDal fmmats. The resolution of these cti1Ia=.ces is the responsibility of the presentll.tion layers and may be CODSidered the syn1aX of the conversation between application layers. . TheIe are tine syntax venions to be considered for the data: tbat syntax used by ~ of the ~ application processes and tbat syntax provided to the presentation layer ~ LAYER N+I
TASK
LAYER N+I
RESPONSE
TASK
LAYER N
RESPONSE
LAYER N
COMMUNICATION
CHANNEL
~---------------------~
Chap. 5
Layered Protocols
167
the channel comprising the lower layers. It is the respoDSl"bility of the pesentation layer at
each encs to convert between the application layer syntax and the lower level channel syntax.
Session layer. The session layer is IeSpODSlble for establisbing a connection between application entities and for managing the session. Session establishment is done at the request of an application and may involve requesting an appropriate channel from the transport layer, establishing the availability of the destination application, identifying the SOUICe application for the destination, and obtamiDg permission from the destination application to communicate with the source application. At the end of the session, the session layer must temlinate the session. Session management includes the enforcing of the interaction between the applications, whether it be fuU-duplex, baJf-duplex, or simplex. It may be responsible for ensuring the proper sequence of messages if the lower levels might interchaoge, add, or delete messages. This may occur especially if the physical c:cmummjcati.on channels fail and are :recovered by lower layers. If system usage is to be accounted for. this is a valid function for the session layer. . . Transport layer. The session layer is aware of wbich application it is to talk to, but it has DO knowledge of where that application is. The establishment of a cbaDnel to that application begins with the tTtZ1lSp(Jrt layer. The primary function of the t.raDSport layer is to provide a uanspment meaDS of data traDsfer for a session. If messages from the session layer need to be broken up into packets. then the ttaDspOrt layer is RSpODSible for disassembling the message into 0utgoing packets and assemb1iDg incoming packets into messages (the PAD function). It interacts with the netwOlk layer to provide flow conuol if necessary. ensuring that DO more data is passed to the Detwcd: tbaD the network can baDdle. The traDspOrt layer also may be responsible for providing a certain class of service derNIDdM by the session layer. Classes of service include priority. delay. and security specifications. as wen as services such as multiple addressing.
Network layer. The network layer Idieves the traDsport layer of any concem over how the systems are ~. It is the uetwork layer that knows the topology of the system, and it is the uetwodc layer's respoasibility to see tbat pactecs submitted by the traDspOJt layer are routed p:opedy to their destiDarioDs. Tbere are a variety ofJOUtiDg medvmisms tbat may be used. Fi%etJ TOUting defiDes a specific path betweaa each pair of eadpoials. Altemate routing defiDes a PrimarY path for each endpoint and one or DlCR altemate paIbs in case of failure or coagestion on the primary path. Dynmnic rtnIdng allows packets to be routed through the netwodc along a path which makes best use of the·networlt facilities at that iDstaDt in time. The network layer is respoDSible for flow control in the network and interacts with
ne
the transport layer to restrict incoming data if it c:aanot bebandled by the networlc. networlc level may or may not be responsible for the proper OIdering of received packets, depeDding upon the characteristics of the cJmmel. It must be mponsive to c1ass-ofservice requestS from the traasport level.
Communications
168
Chap. 5
., Note that the routing of a packet may carry it through several nodes in a network. How does the reference model apply to a node in which an application for this session is not active? The answer is that the networlc layer is responsible for routing, as shown in Figure 5-14. A message originates at an application layer and is broken into packets by the traDsport layer in that system. The packets then flow through the networlc layers of all intervening nodes until they mive at the destination system. There the networlc layer passes the packets to the transport layer, where the message is reassembled and delivered to the application layer via the session and presentation layers.
Data link layer. The purpose of the data liDk layer is to provide virtually errorfree COIDDlUDication across a link connecting two nodes in the network. It must do so in a manner which imposes no restriction on the data (data transparency). As such, the data link layer provides the following primary functions: • Syncbronjzation, so that packets (frames) may be identified.
• Error detection. • Error correction, either through forwud error correction or via retransmission. Note that it must be assumed that a COIIIlection spans several nodes. The data link. layer guarantees error-free communication only across the separate links in the CODDeCtion-it C8DD0t guarantee euor-free operation across the channel. That is why higher' layers ate also involved in eaor control. The HDLC protocol which we have described earlier is a contemporary example of a data link layer.
Physical layer. The pbysicallayer is IeSpODSible for the IIUIDageIDeDt of the :physical communication liDk.. This includes the elec:ttical characteristics of the cba~ LAYER APPUCATION
A
A
PRESENTATION
P
P
SESSION
S
S
TRANSPORT
T
T
NETWORK
N
-r
-y-
N
DATA UNK
D
D
D
D
PHYSICAL
X
X
X
X
~
DATA PATH
. I'ipre 5-14 OSI muldnocIe ~.
MESSA.a
-+PACKETS
Chap. 5
Layered Protocols
'69
and the mechanical specification of the connectors to the clwmel. It also includes the modem interfaces and the functions and usage of modem signals.
SNA The mM Systems Networlc.Archi1ecture (SNA) is a specific implementation of a layered protocol. Though its layers are somewhat different from the ISO R~erence Model, there
is a strong similarity. Under SNA, there are five layeIs. They are shown in Figure 5-15 with their general con:espondences to the OSI model. , Under SNA, there is DO defined physical layer; the existence of a suitable physical connection is implied. The d4ttJ link controll4yer is very much like the OSI data link level. SDLe, which is more or less a subset of the HDLC protocol described earlier, is specified for this level. The path control layer provides services similar to the OS! network layer. It pr0vides logical chaDnels and flow control between entities known as Network Addressable Units (NAUs). Since dynamic routing is sapported at this level, pad1 control is IeSpODsible for the delivery of data units in proper order even if they are IeCeived :&om. the data link control layer in improper order. The transmissioIi control layer' is similar to the OS! traDSpOrt layer except that it haS some responsibilitie related to session management. A session under SNA is a l~
OSI
SNA
APPLICATION
PRESENTATION
FUNCTION MANA_MENT DATA SERVICES
SESSION
DATA FLOW COIITROL
TRANSPORT
TRANSMISSION CONTROL
NETWORK
PATH CONTROL
DATA UNK
DATA UNK CONTROL
PHYSICAL
170
Communications
Chap. 5
.connection between NAUs. Transmission control is responsible for establishing and terminating a session and for maintaining the CODDeCtions required during the session. The d/lIaftow control layer is the other part of session control and coaesponds to the OSI session layer. It is responsible for managing the session, including balf-duplex or full-duplex data low, for ensuriDg all-or-none data delivery, and for bracketing transactions SO that the range of a transaction comprising several messages can be identified. The ftmction mIlIIIlgement dIlta services layer provides services to the end user. It is sDnilar to OSI's presentation layer and spills into the application layer. It provides OSI presentation services such as data !efon:nat:ting and data compression. It also provides network management functions such as network reconfiguration, collection of network statistics, and fault identification and isolation. Refer:ence may be made to Meijer [21] for a more detailed discussion of SNA.
JC.2& X.2S is a protocol that is widely used internationally. his packet-oriented and is therefore used widely in packet-switching networks. It coaesponds roughly to the OSI Reference Model layers 1, 2, and 3 (physical, data liDk, and network). The data liDk layer is a subset of the HOLC protocol described earlier. The network layer is designed specifically to interface easily with the HDLC protocol. The following description of X.2S is a condeusation of material found in Stallings [25] and Meijer [21]. Under X.2S, a user always communicates with a rommunicaIion facility. The user's equipment is called Data Terminal Equipment (DTE), and the comJDUDication facility is called Data Circuit-Terminating Equipment (DCE). F1ple 5-16 shows the use ofX.2S in a packet-switched enviroameat Users connecttbeir DTEs to the packet-switched network via a DeE provided by the network and communicate with that DeE ovec an X.2S line. The DeE is IeSpODSlDle for establisbing a virtual CODDeCtion ~ the switch to the destiDation-user's OCE, wbich ta1lcs to the dNtiDation-user'~
DTE - DATA TERMINAL EQUIPMENT DCE - DATA CIRCUIT-TERMINATING EQUIPMENT P S N - PACKET sWITCHED NETWORK
Chap. 5
Layered Protocols
171
D~
yia an X.2S link. There is no requirement tbat X.2S be used within the packetswitched network itself. Data must be submitted to the X.2S link in packets. The maximum packet length is specitied-l28 bytes is typical. would be more accurate to use "octets" instead of bytes, as the X.2S protocol implies no structure of the data.) Packets of data are enveloped with a tbIee-byte X.2S beader shown in Figure 5-17. The fields are as fonows:
at
• The qualifier bit, Q, which allows the distinction between data flows-for instance, information or control. . • The delivery bit, D, which requests a delivery confirmation from the destina-
tion. • The 1fIDIlulo, a field that specifies whedle.r a window size of7 or 127 is to be used, i.e., a sequence number range of 8 or 128. It also specifies certain fcmnat extensions. • The group field and channel number field, which, when concatenated, allow the designation of up to 4095 logical channels. • The receive-N(R)-and the send-N(S)-sequence numbers, analogous in their use to the same HDLC fields described earlier. • The more-do:ttz bit 00. which signifies tbat the current packet is an inrermediare pactet in the message. The last packet sets this bit to zero.
.
87654
Q
.
D
S
GROUP NO.
MODULO
CHANNEL NO. COMMAND IFCONTROt.. PACKET
N(R)
-z.
2
··
·
N(S)
M
DIC
fA
-
QUAUFIER
D -
DEUVERY
Q
MODULO - SEQ. NO. RANSE = 8 OR 128 GROUP NO. - LOGICAL CHANN.EL GROUP NO. CHANNEL NO. - LOGICAL CHANNEL NO. N (R) - RECEIVE SEQUENCE NO. N (S) SEND SEQUENCE NO. M - MORE DATA . Figure 5-17 X.2S header. DIC - DATA (0) OR CONTROL (I)
172
Communications
Chap. 5.
• The d4uz1control bit (DIe), which speci1ies whetPer this is a. data (information) packet or a control packet.
Control packets have the same format in the first two bytes as a data packet. However, bit 1 of the third byte is set to 1; and bits 2-8 are used as a command. The commands are shown in FJ.gUIe 5-18. 'Their inteJ:pIetation is always in pairs. depending upon whether the control packet was sent by a D1E or a DCE. X.2S supports two types of c:onnectiODS. One is a permanent virtual circuit (pVC). This is a dedicated circuit that IeqWres DO call establishment or discoJmection. The other is a virtual call (YC). which establishes a virtual cireoit and then discoDnects it at the end of the call. The establishment and discoDnection procedures are shown in Figure 5-19. using some of the control packets listed in Figure 5-18. The originating OTE first sends a Call Request packet. which is passed tbrough the network and received by the destination DTE as an Inc:oming Call packet. If it can handle the call. the destination OTE IeSpODds with a Call Accepted packet, which is delivered to the originating OTE as a Call Connected packet.
At this point, the two O'IEs can communicate. The message transfer procedures of the HDLC protocol described earlier are used for data communication. When one D1E is ready to clear the call, it sends a Clear Request packet, which is received by the other DTE as a Oear Indication packet. It responds by remming a Clear Confirmation packet. It is interesting to note the multiple encapsulations of data in layeJed protocols. using X.2S as an example. In FJ.gUIe 5-20 a data packet is shown as it is received for traJ1smis.. sion over the physical channel. The data message (which itself may have bigher-level headers and 1rai1ers) is appended with an X.2S header at the uetwork layer. This packet is then traDSfeDed to the data-liDk layer, whae the HDLC envelope is added (as shown in Figure 5-20). At the receiver, the data-link layer strips the HDLC envelope from the packet. The X.2S envelope is stripped by the netwclIk layer before sending the packet up the layer structme.
.aMGE 1'JIAIISFSI1'EIIFOIIIIANCE Let us DOW look at some performance CODSideratioDs for message U'aDSfer. In the followiDg sections, we CODSider 1he tiDe efficiency and 1DDSit time for both balf-duplex and fall-duplex message traDsfer procedures.
Ilalf..DapIex •• ssage Transfer Efficiency The efticieDcy of message transfer is affected by two CODSideratioDs: 1. The overhead created by the protocol for message identifi.cation and error detection.
Chap.S
173
Message Transfer Performance FUNCTION
CALL SETUP
CALL CLEARING
DATA AND INTERRUPT
FLOW CONTROL
DeE TO DTE
DTE TO DCE
INCOMING CALL
CALL REQUEST
CALL CONNECTED
CALL ACCEPTED
CLEAR INDICATION
CLEAR REQUEST
DCE CLEAR CONFIRMATION
DTE CLEAR CONFIRMATION
DCE DATA
DTE DATA
DCE INTERRUPT
DTE INTERRUPT
DCE INTERRUPT CONFIRMATION
DTE INTERRUPT CONFIRMATION
DCE RR
DTE RR
DCE RNR
DTE RNR DTE RE"
RESET
RESTART
RESET INDICATION
RESET REQUEST
DCE RESET INDICATION
DTE RESET CONFIRMATION
RESTART INDICATION DCE RESTART CONFIRMATION
RESTART REQUEST DTE RESTART INDICATION
2. R.ettansmission of messages in emlI'. StalliDgs [25] presents a very UDdastandable description of message-ttaDsfe efficieDc:y, wbich we will follow in priDc:iple. Let as make the tonowiDg de1iDitions:
U = message tmDsfer efficieDcy. tift
= average time teqDiled to transmit a messaae (sec.).
tp,.o
= ovedIead time per messaae imposed by" the protocol '(sec.).
k,. = the average IlUIIlber Oftimes tbai a message must be seat (IetraDsmission factor) due to eaors and the successful traIISIDission.
Then 1he efJicieoc:y of message traDsfer, U, can be expressed as U-
- k,.(r,.
tift
+ tp,o)
(5-1)
Communications
174
Chap. 5
PSN DTE
DCE
~
DCE
DTE
--
-- --CLEAR CONFIRMATION
- - -
CLEAR CONFIRMATION
That is, it is the lido of the basic message time, tm, to the actIial1ime spent deliveriDg the message, 1;(tm + IJIPIt). Note 1bat errors mprotocol c:oatrol 'bloclr;s (ACK, NAK) are igaoJed, as these are typicaJly sma1l blocks less subject to error. Also DOte fiom equation 5-1 dial: 1he actual C:O""'nmicadoD c:banrte11iD:ie used to send a message is t,,/U. To 1IIIdersrand protocol ovedIead, let us trace 1he path of a balf-daplex message traDsfer using Figm: 5-21. The fust step is to ttaDsmit the message; this zequires a time, tm. as c1efiDed above. The message must pmpagate over the cormmmication etullmel. To do so mquiJ:es a time, tp",p. It must then be processed by the receiver, t,n:I, wbic:h will traasmit an ACK if 1he message is c:o.uect. The ACK will requD:e a traDsmission time of ,_. FmaIly, the ACK will propagate to the traDsmittcr (a time of ~) and wm be processed by the traDsmitter,~. At tbis point, message 1raDSfer is complete.
Chap. 5
Message Transfer Performance
HEADER
"
HOLe
175
TRAI LER
~
Iii
~D3
Figure 5-2t X.2S data eucapslIlation.
From this description, the protocol overhead time, t,-, can be expressed as tpro
= 21"". + tp1'C + ta
(5-2)
where tpl'OfJ
= one-way chaunel propagation time (sec.),
processing time for the message and for the ACK (t~l + tp1'Ci),
tp1'C .=
t_ = time to send the ACK. Let us define the following term,
a, such that
a = !f!! =
tpr'!f
+ (tIM" + t~
21",
tIll
(5-3)
That is, a is the avenge protocol delay per tmJsmjmon and is normaUzeci by the message time. There are two traDsmissiODS per message 1mDSfer: one for the message itself and one for the acknowledgement. Using equation 5-3, equation 5-1 can be rewritten as
U
1
=1;(1 + 2a)
(5-4)
The mtraasmimon :factor, k,., can be evaluated as follows: Let
Pb
= probability of a bit error in the cnnpmmic:ation charmel,
Pm = probability that a message c:ormrins M = message size, in byles, and H = message ovedIead, in bytes. PROPAGATION
~ ~
DELAY
Z
tm
~ PROCESS ACK . tprc2
..
PROCESS MSG
tprop
PROPAGATION _
DELAY
Z
XMIT
ACK tprop
tprc I
ODe
or more eaors,
Communications
176
Chap. 5
•Typical values for Pb are 10-5 for dialed lines and 10-6 for dedicated lines. We assume that bit errors are random. and independent. This is hardly the case in typical communication cluumels, wbeIe errors tend to occur in bursts. However, the assumption of independence is conservative and will lead to somewhat higher message error rates than will actually be observed. We also assume that our error detection algorithm is perfect. The message overbead of H bytes includes header and trailer data which frame a message. Examples of this overhead will be given later. For synchronous transmission, the to1al number of bits in a message is 8(M+H). (The multiplier will be 10 or more for asynchronous transmission.) The probability that a particular bit is euor-free is (I-Pb)' The probability that all bits are error-free is the probability that the message is error-flee and is (I-Pb)8()4+H). Thus, Pm
= 1 - (1 - Pb)8()4+H)
(5-5)
A message must always be transmitted at least once. It will be transmitted a second time with probability Pm. a tbiId time with probability Pm2 , etc. It will be transmitred on the average Ie, times, where Ie, = 1 + Pm
+ p",2 + ...
or Ie,=--L
(5-6)
I-Pm
Substituting this IeSUlt into equation 54 yields U
=1 -
Pm
1+2a
= (I-Pb)8~+H) 1+2a
(5-7a)
If the bit error probability is sufficiently smaIl, then (I-Pb~+B) - 1-8(M+H)Pb
and
u- 1-8(M+H)Pb 1+2a
.
(5-7b)
This approximaDon is valid if8(M+H)Pb «I. For values of Pb in thecxder of IO~6 and for reasonable message sizes, this appIOYimarion holds.
The IDIlysis of the efficiea.cy of a fu11.duplex protocol is compJicated by the fact that there are two cases to CODSide:r, both based on the extent of the protocol delay, a. To understand these cases, we must add one additioDal concept to our knowledge of protocols: the window, in particular, the window of size w. As we memioDed earlier. the ttansmitter must be able to buffer messages so that it
Chap. 5
Message Transfer Performance
177
error.
can Ie1IaDSmit from the point at which the receiver received a message in Furthermore~· the receiver must provide equivalent buffering if selective retransmission is used. But how much buffering is required? The number of messages that the transmitter can get ahead of its acknowledgements from the receiver is called the window. It, in fact, is specified by the protocol, as the window size !elates to the size of the message number that can be specified in an ACK or NAK message. (See Stallings [25] for a mOIe thorough discussion of this.) A typical window size, W, is 7, though larger windows are sometimes used for channels with very long propagation times. Since the diffexenc:e between the transmitter and receiver can be no greater than W messages, this is the amount of buffering requiIed. Messages are held in the buffer until they are acknowledged, at which time they are Bushed. (For selective Iettansmissic:m, messages must remain in the xeceive buffer until all pxevious messages have been properly received.) Now let us assume that the time IeqUUed for the ttansmitter to send W messages is greater than the total propagation time to receive an acknowledgement. That is, Wtm a=: (1 + 2a)tm
Then the transmitter will always receive an acknowledgement to a message before it fiUs up its buffers with unacknowledged messages. Tbel'efore, the triDsmitter can continually use the channel for transmission; efficiency, U, is affected only by Je1mJsmissions: U
= 11k,. , Wa=: 1 + 2a
However, if the transmission time for W messages is less than the total propagation time, then after W messages are sent, the transmitter must wait a time (1 +2a)tm - Wt". before it can start sencting again. 'Ibis cycle willlepe8t every W messages. Thus, ODly W messages will be sent every (1 +2a)tm secxmds, and W U = k,(1 +2a)' W < 1+2a Restating the above xesults, tile efficiency for a fall duplex protocol can be expressed as
U = lIAr W
= ";"'k,(":::1";':'+':"2a~)
Woe: 1+2a W< 1+2a
(5-8)
It remaiDs DOW to evaluate Ar. For the seled:iveleIDDSmission case, the pmcedme is simple, since only the message in error needs to be retransmitted. This is equiva1eut to the balf-duplex protocol case, and equation S-6 holds. Thus, for selective TetrfJl'lSl'l'lissi U=I-Pm
= W(I-p",) 1+2a
Wa=: 1+2a (5-9)
W< 1+2a
178
Communications
Chap. 5
.. The technique in which all blocks after the block in. eIIOI' are retransmitted is called the Go-Back-N teclmique; let us evaluate its efficiency. With this technique, N messages must be retraDsmitted. In a manner similar to the derivation of equation 5-6, we note that a message must always be transmitted once. With probability Pm. N messages will have to be retransmitted; with probability Pm2, N messages will have to be transmitted a second time, and so on. (We ignore here the probability of eIIOI' in other than the first message being retransmitted, as this will simply initiate a new retransmission sequence). Thus, the retransmission factor, k,., is
or (5-10) For the case in which the trausmission time of W messages is greater than the total propagation time to JeCeive a message (W ~ 1+2a), the number of messages that the transmitter leads the receiver by is N = (l+2a). This is because it will take (1+2a)tm secouds for au acknowledgement to be returned to the transmitter for a message that is just begiDDing its ttaDSmission. DuriDg this time, N = (1 +2a)t,,/tm messages will be seut, including the message for which the acknowledgement is being retumed. If that acknowledgement is a NAK., then (1 +2a) messages must be rettaDsm.itted. Likewise, if W messages take less time than the round-trip time to receive the acknowledgement (W < 1+2a), then the traDsmitter must pause after W messages aDd must wait for the acknowledgement. If it is a NAK, then N = W messages must be retransmitted. SubstitutiDg these values for N iDto equaDon 5-10 to obIain kn and then combiDing that J:eSUlt with equation 5-8 yields, for Go-&cIc-N retTtl1IS1fIiss
U=
I-Pm 1+2Dpm
=
W(l-p"') (I-Pm+Wp")(l+2a)
Wi!: 1 + 2a
(5-11) W< 1 +2a
Me•• age TIWISit n.... From a system peJ.'formana: viewpoint, we ale as iDIaested in the tnmsit time of a message as we are in line efticieDcy. Let us evaluate message ttaDsit time for the tIDe cases discussed above. Let
r:.. =message traDsit time (sec.), or the average time reqWrecl to send a message aDd to have it successfully received. For baJf-duplex protocols, we note that the first copy of a message passes through the receiver in a time t".(1 +a). Subsequent retraDsmissions reqWre a time t".(l +2a).
Chap. 5
Message Transfer Performance
179
Since. the average number of retnmsmissions per message is p"/(l-p"J, then
t;" = t",(1+a)
+ t",(l+2a)p"/(1-p,,,)
This can be rearranged as follows:
~ = (1+2a) - a(l-p"J t I-Pm
",
= (1+2a I-Pm
a)t m
(5-12)
Using equation 5-7a, this becomes the holf-duplex message transit time,
(5-13)
t;,,=(i,-a)tm
This intuitively obvious result is that the message transit time is the total round-trip message time, t"/U, minus one propagation time (since the successfully received message does not have to await an acknowledgement). For the case of selective retransmission with a full-duplex protocol, the same argument holds. An euor-free message will xequire t",(1 +a) time to pass tbrough the receiver. A retransmitted message occurs with probability p"/(1-p",) and requires a time of t",(1 +2a). Thus, equation 5-12 holds. However, this cannot be written in the form of equation 5-13. since the efficiency, U, of a selective retransmission line is a ctifferent expression, one given by equation 5-9. We therefore use equation 5-12 forthejull-duplex selective retransmission message transit time:
~ = ( 11-p", +2a _ a)tm
(5-14)
For the balf~ex and selective retransmission cases, we have DOt been concerned with the effect of eJ'!OIS on other messages. When a ccmnmmjcation line accepts a message for uansmission, it is sent fonbwith and is received by the receiver in a time indepeDdent of other errors. Ettor zetraDSmission time affects the average service time t,,/U, and thus is a factor in the communication line load. This will affect the queue delay but not the ttaDsit time for a message. However, for the Go-Back-N protocol, the situation is not so simple. If a message must be retraDsmi1ted, aBN-1 messages foDOwiDg it must also be retransmitted. 1l:ms, N messages axe delayed by a time r".(1 + 2a)p"/(l-p"J, and the average message traDsit time is
t;. = t,..(1 Using the
+ a) + Nr".(1 + 2a)p"/(1-p"J
maDipul8tion tbatled to equation 5-12, this can be written as t;' = [
(!:;:)(1 +
(N - l)p"J - a
]rift
As was done intuitively for the half-cluplex case of equation 5-13, equation 5-14 for selective retnmsmission can be inte.tp!eted as the l'OUDd-trip delay of an average message minus one propagation time, a, since the message receipt does not have to wait for its acknow~. For the Go-Back-N protQCOl, the average round-trip message time,
Communications
180
Chap. 5
..~luding error retransmissioDS, is inaeased by N - 1. retransmissions for p", of the time.
As noted earlier for this case:
N= 1+2a
We: 1+2a
N=W
W< 1+2a
Thus, for full-duplex Go-Back-N message transit time: t;" = t;" =
[(! ~!:)
(1
+ 'lap",) - a }",
We:I +2a
[(:~!:)(1 + (W-l)p",) - a Jr",
(5-15) W< 1+2a
h is important to note that if a COIDIIlUDication line is to be coDSidered a server serving a queue of waitiDg messages, then its service time is its effective cbamlel utilization, t"JU, not the message transfer time, t;".
Let us look at a typical line and evaluate line utilization and message transit time for the above t:bree protocols. Our example line will be a long-haul dedicated full-duplex synchronous line with the following parameters:
= bit eaor rate = 10-s M+H = message 1eDgth = 300 bytes tprop = propagation time = SO JDSeC. tpre = pmcessing time = 20 D1SeC. S = line speed = 19,200 bitslsee. = 2,400 bytesIsee. '- = ackaoWledF, packet time = 6 bytes @ 2,400 bytesIsec = 2.5 1DSeC.
Pb
. 1m
W
= message time of 300 bytes @ 2,400 bytesIsec. = 125 D1SeC.
= window size for fuIl-duplex protocols = 7
From equation 5-5, the p:obabDity of a letrallswission, Pm, is
Pm = 1 - (1 - 10-5)
8(300)
= ,(11.37
From equation 5-3, the avemge nonnaJi'ed pmtocol delay, a, is
a
= SO + (20+2.5)12 = 125
An .~7
Thus, W > 1+2a; that is, a full-duplex acknowledgement will be tetumed before the window is exbaustect
Chap. 5
EstablishmentlTermination Performance
181
message;
. U~ these values, Table 5-1 is CODStrUcted. It shows the line efficiencies, transit times, and communication line service times for all three protocol classes as well as for an ideal line, i.e.,
DO
delay and no errors.
TABLE 5-1. MESSAGE PROTOCOL EXAMPLE t;' Message traDSit time (msec.)
1.0 0.493 0.976 0.954
125 192 . 192 198
t"lU LiDescnice lime (mscc.)
125 2S4 128
131
Note the relationship between the message transit time and the line service time. For a baJf-duplex line, the service time is much grear.er than the transit time, since the line must be held until an acla10wledgement is received, though the message has akeady been delivered. For full-duplex Jines, messages are DOt delayed for prior acknowledgements. The next message is sent before the prior message may have propagated to the receiver. Thus, service time is less than transit time. The superiority oftbe full-duplex protocols is appment. Though the message transit time is S1Jbstantjally the same for all cases, full-duplex protocols make use of the line twice as efficieDdy. Thaefore, 1bey can bandle twice as IIIIlCh traffic as the balf-duplex p:otocols underthe conditioDs assumed in tbis example. The efficiency of the balf-duplex line is further c:ompmmised by the fact that traDsDDssion can occm: in only one direction at a time. TheIefore, if the traffic in both diIec:tioDs is equivaleDt, the capacity of a balf-duplex line is mduced by a further factor of two !elative to a fall-duplex line.
The 8Mignment of a channel to a data communication session adds acktiti.cmal ovedIead to message transmissioa. If the session is long, 1bis overhead may be minor. However, for short sessioDs. channel esrabJisJunent aad tennjnatjcm overhead may be quite sigDifi.cant. Two COIDIDOIl cases ate CODSide:Ied by example in the foIlowiDg sectioDs. iii the first m:ample, a dedicated line ~ two users who are peers. EithUmay tty to use the line, aad collisioDs must be teSOived. This is an example of a C01l1mltion pIOtoC01. The other example is for a multipoint line in which a master staIion polls the other users (slaves) in an onk:dy JDaIIDer'. CcmnnmricatiQll is allowed ODly between the master and a se1ec:tecl (polled) slave. 1'beIe are many other types of protocols in use. The role of master station may be tEansferred between users in some netwod:s. The line may be held for a Ieply in some
Communications
182
Chap. 5
·ases but released after the tnlDsmjsUon of a traDSaction in other cases. The techniques used in the foDowing examples are extendable to these and other protocols. A more complex class of protocols are those that provide multipoint contention. These are discussed in the subsequent section about local area networks.
Poinf-To-Poinf Contention . ..
, .
The first procedure to be discussed is a balf-duplex contention protocol on a point-to-point chaDoel. This is a typical case in TP systems in wbich a temIinal is connected to the TP host via a dedicated line. Usually, the temJiDal will iDitiate the traDSJDission of a request to the host and wiD await a reply. However, on occasion the host may want to send an unsolicited message to the temJinal. An important cbarac.terlstic of this example is that the line is not heavily utilized. Therefore, if either the temJiDal or the host simply seizes the line whenever it wants to send, the probability of colJjcting is quite low. An example of this sort of protocol is povided by the ANSI X3.28 protocol described in ANSI [2]. Subcategoty 2.3 (Two-Way Alternate NODSWitched Point-ToPoint) describes the conteDtion protocol, which typically is used with the X3.28 subcategory B1 message transfer protocol (Message .Associated Blocking with Longitudjnal Checking and Single Acknowledgement). A simplified repaemation of this protocol is shown in F1g1Jre 5-22 via the "railroad" diagram used so successfully in the ANSI specification. This diagram shows the low of control iDfo.rmad.on and data in an 1III8IIlbiguous and compact foml. Actions taken by one user are shown in blocks; actioDs taken by the other user are UDblocked. The maiD low is indicated by heavy lines. Let U$ denote our two users as user A aDd user B. When user A wams to acqujre the channel, it sends an ENQ character (1) (one oftbe control chaIacters). If user B is
AScn
ESTABLISHMENT
I
MESSAGE TRANSFER
I
I
TERMINATION
NEXT MESSAR
S
'---- =- ..III
18
~--i~
."~TOATATC X
Q (I)
XC
(4)
..
(7)
I
NEYlOUS
I·
MESSA8E (3)
(6)
I
1
Chap. 5
EstablishmentlTermination Performance
183
ready.to receive the message, it responds with an ACK (2). If it is not ready, it responds
with a NAK ( 3 ) . . . When user A receives the ACK, the cbannel belongs to it (this is the establisbment procedure). User A then can send one or more messages (4), pansing at the end of each to await an acknowledgement from the receiver (5). If a NAK is received (6), the message must be retraDsmitted. When user A is finished, it releases the channel by sending an BOT chaIacter (7). This termination procedure completes the sequence. At this point, either user may seize the channel. . Of comse, if both users attempt to seize the channel at the same time by sending simultaneous ENQ charaders, a collision occurs. The contention is IeSOlved by having bOth users Wait ctiffeIalt times for iesponse befOre giving up and trying again. The usei with the shorter time-out wins. The probability of a collision can be determined as follows. As defined earlier, let
a
tFOP
= chaDneI propagation time between the two users
Also let
Ra = message rate of user A (messageslsec.). Rb = message rate of user B (messageslsec.).
t. = delay time for user A after a collision. tb
= delay time for user B after a collision.
to. = chameJ aaplisition time for user A (sec.). feb = c:baImel acquisition time for user B (sec.).
One user (say user A) aII'eIIIpG to seize the dmmel by sending an ENQ. On either side of its seizm:e atImDpt time, tbeI:e is a cdtical time slot, ~, durlDg wbich user B may seize the channel and cause a coIlisioIl. If user B seads an ENQ befo1e user A )JUt.witbiD ODe pmpagaDOJl time, user A wm not see it befo1e it sends its ENQ. Likewise, user B wm DOt see user A's ENQ for ODe pmpagatinD time and may sead its ENQdurlng that
time.
The probability that user B wm coDide with a user A seizme attempt is 2tFOP~' LiDwise, the probability that user A will c:oDide widl a user B seizme attempt is
2tp,.Ra.
In die m:at of a coIlisioD, one user wm wiD the time-out race. Let us say that it is always user A. That is, t,.
We DOW can calcuJate the cbannel aaplisition times for user A and user B. In either .case, a wait time of2t,-p always occ:m;s as the ENQ pmpa~ to the other end and as ~ .
184
Communications
Chap. 5
.~CK is returned. Wrth probability 2tpropRz (x = a, b), a .collision Occurs. User A must wait a time td before trying again and is guaranteed success. Thus, the user A channel
acquisition time, leG, is
or tca
= 2tprop[1 + RlAta + 2tprop)], ta < t#l
(5-16)
t_ can be added to the message time, which is calculared according to the pro<:edmes in the previous section. This gives total message time from a performance viewpoint. Total line milization is obtained by also adding in the teDDiDation time required to send an BOT, which is simply another tP"'P time. Following a collision, user B must wait user A's delay time plus its message time. It then must reacquire the channel. Let
'- = message traDSit time for user A Also let
kcz = number of messages sent by user A Then user B's chanuel acquisition time, tcI" is
teb = 2tprop + (2tpropR.)(ta+kot~ + (2tpropR.)(2tP"'P) The first teml is user B's fiISt attempt at channel seizure. The sec:oncl teml Iepresents the delay of user B's mpJeSt while user A seizes the channel (ta + k.t:u>, which occurs with probability It,-,R.. The thb:d term IepIeSeDtS the second acquisition attempt after user A has 1'eleased the channel. This OCCUIS with probability 2t",.R. and takes a time 2tprop.
teb can be written as
Icb = 2t",.[1
+ RJ.ta + 2tp,. + V,.)l, ta < tb .As ali example, COD.Sider the foJlowiD& case:
(5-17)
tP"'P = cfunmel propagaIio.D time = SO m&eC.
Rt, = 1 message per 30 secooc1s (traDsaCtion).
R" = 1 message per 30 seconds (lq)ly). ta
= user A delay time =200 IDSeC.
kcz
= DUJDber of user A messages = 1.
t:-
= user A message time = 1 second (say, a 300-byte message over a 24OO-bitIsec. line).
Then,
Chap. 5
EstablishmentlTerrnination Performance
185
= .101 seconds. teb
= user B cbannel acquisition time
= .104 seconds. Since the ideal time is a round-trip propagation time of 0.1 seconds, the contention protocol behaves well in 1bis case.
Multipoint Poll/Select Another rommcm configuration in TP systems is a multipoint clwmel in wbicb te.rminals are polled by the 11> host. Typical of these protocols are the ANSI X3.28 Subcategory 2.4 and 2.S and IBM's bisync (binary syncbronous, or BSC) protocol. Both are quite sbDilar, 8nd a simplified leplesenta1ion of the bisync piotocOl is shown in Figure 5-23. This protocol suppol1S two functioas. One is polling, in which the host invites one of the ~ to send a message. To do so, the host seuds a poD sequence (1) comprising the tenninal's identification in upper-case characters followed by an ENQ character. If the terminal bas DO message to send, it returns a NAK (2). If the tenninal bas a message to send, it sends that message (3) (ac:cording to some message-transfer protocol DOt shown). Following either event, the host temrinates the process by sending an BOT cbaracter (4). At this point, the host is flee to initiate its next function. The other protocol functiOn is that of selection, wbich is used to send a message to a ~. To do so,. the host sends a selection sequence (S) comprising the temrinal iden~ MESSAGE TRANSFER
ESTABLISHMENT POLL A
A D D R E S S
E N
MESSAGE . TRANSFER
Q
(3)
(I)
SELECT
a
d d r
e s
E
N
Q
(7)
MESSAGE TRANSFER (8)
5(5)
!'ipre5-23 MultipoiDt poDIselect.
TERMINATION
g-0
T
(4)
Communications
186
Chap. 5
.Pfi.cation in lower-case letters followed bY an ENQ. If tbe terminal is not ready to receive the message, itretums aNAl{ (6). If it is ready, itretums an ACK rl), and tbehost then sends the termmal its message (8). Agabl, the message transfer protocol is not shOWD. An alternate selection sequence is called/ast select. With fast select, the terminal is assumed to be ready at any time. The ~n sequence (S) and ~ (~) are sent together, with no lntervening ACK (j) or NAK (6) being returned by die tei1DinaJ.: ..After either event, the host sends an EOT (4) to terminate the process. From a performance viewpoint, the interesting question is the frequency of negative . polls (a poll with DO response) and the consequent poll cycle time. Martin [19] presents a solution to this as follows. Let Q
= number of terminals which have a message to send and which are awaiting a poll.
= total number of terminals on the line. Pen) = probability tbat the nth terminal being polled is the first to IeSpoDd with a mesV
sage. The probability tbat ~ first terminal to be polled bas a message is
P(1) = QIV
(5-18)
The probability that the second temUnal is the one to respond is the probability tbat the first terminal bas DO message and that the second terminaJ does. Thus,
P(2)
= (1 - Q) .JL = Q V-Q V V-I V V-I = P(l) V-Q V-I
Similarly.
. P(3)
= (1- ~)(l- V~l) V~2 _ QV-QV-Q-I V-2
. - vV-r _
V-Q-l
-P(2) V-2
P(JI) = P(}i-l) V-Q-N+2
V-N+I
(5-19)
. Let
M = the number of termiDals 1bat must be polled in Older to get a message.
Chap. 5
EstablishmentlTermination Performance
187
.We know that in the worst case, thefust V-Q terminals will be idle and that the first terminal with a message will be V-Q+ 1. Therefore, the average number of terminals that must be polled to get the first message, M, is V-Q+l
M=
I
NP(N)
(5-20)
N-l
where P(I) is given by equation 5-18 and P(N) is given by equation 5-19. h can also be shown thit
as would be expected. Equation 5-20 gives the es1ablisbmeDt proc:edme ovedlead for polling, since (M-l) negative polls me associated with each message. h is based on the assumption that at least one ten:niDa1 is always active. What about a lightly loaded line which only occasionally sees a message from a terminal-and when tbere is a message. thele is only one terminal on the line with a message? In this case we can expect~ DUIIlber ofnegative polls to be about half the total number of terminals. More prec:isely, the probability that any one tenninal will bave the message is IIV. The average IlUIDber: of terminals that will be polled to get the message once the ~ is ready is, from equation 5-20,
Ai =
f N(!) =.! V(V+ 1) ~ !:!:! V V 2 2
N-I
(5-21)
In order to detennine wbedler to use equatioa 5-20 of 5-21, and in order to evaluate equation 5-20. welleeCl to know the JlDIIIbeio ofwaitiDg terminals, Q. Q will depend upon the line ~ wbich dependS 11pOIl the DepIive poD load, which depends on Q. 'Ibis circ1l1ar relationship leQuhl?$ 1hat we guesS iDi1ially at Q, make a ca1cn1ation, and perhaps iterate to obtain DIOIe accmare msalIs. ODe way to approach this is to use theM/GIl queuing IDOdel as an initial guess. If this malts in a value of Q lesS -than I, use the simpler equation 5-21. Odlerwise, use equation 5-20. Let us illUSllate tbis with an example. Coasider a line :for which V = totalllUlDber of 1mDiDalS = 8.'
1iansadion rate = 1 traDsactiODI15 sec:ondsI1ermiD. TDDSadion message time ~ 1 second, a constant. sent in response to a pon. o
Response message time = 0.5 seconds, a constant, sent via fast select. ,. . ..... tpoll = time to seDd a pollar select seqaeace = 30 DJSeC. tp,..,p = channel propagation time = 15 JDSeC.
188
Communications
Chap. 5
Temlinals send a transaction message of fixed length (say, 300. bytes at 2400 bits! sec., requiring one second to transmit as assumed). The reply is also a fixed-length message (say, 150 bytes at 2400 bits/sec., requiring 0.5 seconds) .. A terminal sends its transaction to the bost in response to a poll. The host responds later with a reply via a select sequence. Every traDsaction is followed by one and ODly one reply. As an initial cut, the load on the line is calculated using just the transaction and reply load:
L
= 8(1+.5) = .8 15
The distribution coefficient for the MlGll model is given by equation 4-18: 1 E(1'2)
K =2
1 (.5)(1)'-
+ (.5)(.5)'-
-:rr = 2 [(.5)(1) + (.5)(.5)]2 = .56
The average number of termmals, Q, waiting for the line (including the one being ser-
viced, if any) is from equation 4-6:
·L Q = l-L [1- (l-k)L] = 2.6
Since Q must be an integer, let us use Q = 3. Equation 5-20 applies, which yields
M
= (.375) + = 2.25
2(.268) + 3(.179) + 4(.107) + 5(.054)
+ 6(.018)
Thus, each successful poll requRs a total of 2.25 poDs (1.25 negative polls plus one successful poll), each xequiriDg(2t.... + tpoa) 60 msec. Since the traDsacIiOD tate is 8115 uausactions per second, poll load is (8115)(2.25)(.06) = .07. When added to the .8 load imposed by the messages. total line load is .87. This load can be used to repeat the above caJcnJariOll UDti1 the iDtegral number of busy termiDaJs IeIDaias CODStaDL For iDsamce. successive itaatioDs for 1bis case give
=
TAllLEN. POWNG EXAMPLE
LoadU.> IunI:ioD
Q
lIIiIial
FiDal
1 2 3 4
3 S
.80
Z1
,g']
4
,as
.as
4
.86
.86 .86
II 2.2S 1.50 . ·1.80 1.80
The iteration has converged Dicely, n:sultiDg in I.S· polls on the avenge being requiIed to receive a message (.8 negative .polls plus.ODe successful poll).
Chap. 5
Local Area Network Performance
189
.Establishment and termination times now directly follow for this example. To establish the channel for an incoming (polled) transaction requjIes 1.8 poll times (which inclUde the message' propagation times by our definition). This is 1.8 x 60 ,;.. 108 msec.Thus, to receive our 1 second message actually requjIes 1.1 seconds due to polling. Termination is effectively transparent, as the EOT termination character can be simply piggybacked onto the next poll 'or select sequence.
LOCAL AREA NETWORK PERFORJlANCE The connection of user terminals to a TP host via a local area network (LAN) is rapidly gaining popularity, especially if the terminal population is fairly close geograpbic:ally to the host. Also, remote termiDal clusters may be interconnected by a LAN and then CODnecred to the host by a LAN gateway. A gateway is another device on the LAN which passes LAN data to the host and which distributes host data to the LAN. We now consider techniques for managing data on a LAN. A local area network is a higb-speed data link generally c:omprisiDg twisted pair wires, coaxial cable, or a fiber optic link (Figure 5-24&). k may span a building, a college campus, or an industrial complex. Users attach directly to the LAN and can COIDIIlDDicate with any other user of the LAN according to a specific protocol. In fact, when one user traDsmits, all other receivers IeCeive that transmission. Tbe tnmsmitted data contams addressiDg information identifying the intended m:eiver(s). In this section we discuss cerIain protocols that are popularly used for LANs. 1bese protocols tend to combine the establishment/termination proc:edures and message transfer procedures of classical protocols and for tbis le8SOD are trealed separately. 11 is noted that the family of IEEE 802.x standaIds describe CODteII1pODIIy LAN
protocols.
_uItipoint ConteIItioa: CSV...lCD .
. '
In a pmrious section we CODSidered poiDt-to-poiDt COD1eDtion pmtocols. in which either one of the two terminals Could attempt to seize the tiDe and traDsmit. In the eveat of a collision, proc:edures were establisbed so that each_ ii1;na) would back off and try again. at diffcmlt times to !eSOlve the cOmamon. Tbere is 8n iwpoI1aDt class of Uaultiuser local area netwodts in wbich an users conumd for· tbe netwOJk. Various procedmes are used. to zesolve CODtlids One of the popular LAN content:i.oD protocols is a potoco1 known as CSMAICD, which SIaDds for Carrier-Seuse Mu1tip1e-Access with CoDision Detection. Under this protocol, any user of tbe LAN can transmit at any time so long as thEee rules
are obeyed. The first two rules are 1. The user first listeDS to the netwoJk to see if it is idle. If tbe network is busy, the . user waits a random time and tries again. 2.. If the channel is idle, tbeD the user can ttansmit his data.
Communications
190
Chap. 5
USERS ON A LAN (a)
USER A
USER B
USER C
!_d, -1-.. d_·1 2
~
-J-
PACKET A
PACKET B TRANSMITTED WITHIN to! td l +d2 )
TRANSMITTED AT to
COWSIONS
Cb) Fipre 5-24 CSMA patocol.
This prorocol in i1s basic form is called CSMA. CS stands for carrier sease. That is. the user listens befen talkiDg. MA meaDS multiple access many users can use the netWOIk. The pmtocol is emic:hed if a tmnsmitter can IIlODiIm" the line while it is transrnjttiDg. In tbis case. if aDOtber user statts ttansmjll;ng. bodl traDsmiUers wi1l read the garbled sigDal on the line aDd wiD stop ttansmissiOD without wasting furlhervaluable channel time. 'Ibis tbiId rule adds co1tisiOD detediOD to the protocol. giviDg the pmtocolthe name ' CSMAlCD. The tbiId rule can be zestaIed thus: '
3. While traDSmiaiog. DlODitor the chaJmel. Jfthe signal is garbled, 1raDsmit a short jamming burst to eIISIIEe 1hat all users detect the collision. and then stop traDs'pissioD. Wait a random time, aDd try again.
Collision deteclion is useful because even listeDiDg to the cbamJel before transmit-
Chap. 5
Local Area Network Performance
191
ting does not guarantee collision-free operation. This is illustrated in Figure 5-24b. where three users are shown on a LAN. User A is a distance of d1 from user B. and user C is a
distance of d2 from user B. For our purposes. we will measure d1 and d2 in tenns of clwmel propagation time. i.e., d 1 is the time it takes for a sigoaJ. to propagate from user A to userB. Now let us assume that user A transmits a packet at time to. It will not reach user C until time to + d 1 + d2 • Therefore, user C can legitimarely decide to send a packet during this time, as it bas not yet beard user A's traDsmission. In fact, user C can send a packet at any time during the mterval to :t: (d1 + dv, USing the reverse argument for the minus sign. Should user C do so, a collision will result. All users will eventually see a jamming signal. The evaluation of collision probabilities and the resulting LAN performance bave been heavily analyzed and are thoroughly covered in Hammond [9]. ODly the results are given here. Let
S = the mte of packets succ:ess1iJlly transmitted per packet tnmsmission time. G
= the rate of packets offered to the network (successful transmissions plus retries) per packet transmission time.
a
= normalized end-to-end propagation delay of the channel, defined as the mtio of the end-to-end propagation delay to the packet transmission time, in short: _ end-to-end LAN propagation time apacket transmission time
Note that S, G, and a are allllOJ'D181;zed to the packet transmission time, which is assumed fixed. Is the term a significant? Are propagation delays significant relative to packet time? Hmnmond [9] DOtes that sipals propagate over a coaxial cable at about .65 of the speed of light. This is about S nanosecoads (S X 10-9 secODds) per meter. To travel a kilometer (a typical LAN leDgth) requiles abOut S microsecoD.ds. A typical LAN daIa mte is 10 megabi1sIsecoad. Thus, sigDa! propagation time is about SO bit times. A user at one end of the LAN can send SO bits befcxe a user at the other end will heaT it. Electronic delays in tbi . traas.miuels and l'eCeivas can be even more sipificant. Thus the mtio a is the primary factor in LAN pedormance. Note that GIS is the avemge number of uausmissions JeqUired per packet. This is the nmnber we would lib for performatw:e aaalysis. UnfortaDately, is we sbaR see below, LAN analysis is DOt so kind to us since the expressions do not lend themselves to obtaining GIS. Computer COIIIpUImioD is teqaiEed. Hammond [9] gives the following results for the CSMA and CSMAlCD protocols described above, assuming random mivals:
CSMA.:
Ge-tIIG
S = 0(1 +2a) + e-GO
(5-22)
Communications
192
Chap. 5
CSMAlCD: S
= Ge
Ge-aG CIG+jaG(1-e-GG)+2aG(1-e-aG)+(2-e-GG)
(5-23)
where .
J t
J=-
(5-24)
and J = jamming iDterval (seconds).
t
= propagation time between staDons (sec.) (assumes staDODS are equidistant).
Anotber case to be noted is a simpler protocol in which the transmitter does not listeD before sending. In this case, COUisiODS, of course, will be significantly more frequent. This simpler protocol is used in some radio netwmks and is called the Aloba protocOl after the first major networlc in which it was used (in Hawaii, of course). For this case:
AlohIl:
S= Ge-ZG
(5-25)
In IlOIle of these expIeSSioDs can we easily find GIS, the probability ofIetransnUwOD that we would like to have. In fact, we usually know S (the traffic to be cmied) and must find G, just the opposite of what these expIeSSioDs allow. The obvious tecJmjque is to calcaJate tables or graphs and to use tbem to find G for a given value of S. To some extent, Hammond [9] p!eseDts graphs that could be quite
helpful. However, the graphical IeplesenratiiJDS of S vs. G expose another iD1aesting pr0blem. A typical s-G xelatioDship is shown in FiguJe 5-25. Note that iDiUally, as system tbroughput S incEeases, the netwo!k load G iDcIeases liDearly. Tben netwodc load starts to rise at a faster rate as Jetl:aDsnUWQDS occur. Ul1imately, the DetWoIk C8DIlOt support the level ofletraDsmissKms, aad tbroughput S faDs off evea. as network load inaases funber. The network is 1Iu:asbDlg. In fact, at any given tbroughput value, S, 1bae lie two DetWoJ:k load values, G, one for p.mper operation aDd ODe for tbIasbiDi. Two points lie to be made:
I. Use the lower value of G to obIaiD the pobability of letraosmission GIS, since it must be assumed that the netwcBk is DOt tbnsJring. 2. Dcm't operate the network Dear its peak tbmugbpatS, as it may go over the hump
and SIa1t thDshing.
A token ring is another orgmrization of a local Ilea netwOJk. To use this protocol, the LAN must be structured as a closed ring, as shown in FiguIe 5-26&.
Chap. 5
.... ~
Local Area Network Performance
193
.3
~
::I
a.
% C!I ::I 0 II: %
.2
~
NETWORK LOAD
(al
Figare 5-25 PafClillWlCie dlalactaistics of nptlripoint CGIfImIion profOCOls.
on
Stations CODDect to the ring in such a way that they can listeD to traffic the ring as" it passes by and modify that traffic on the lly. To do this, each data packet is necessarily delayed in the statioD-c:oJmec equipmeDt. typicaJly by a few bits. For iDsIance. data may be passed tbrough a shift tegister of sufficieDt length (Figure S-26b) to allow the staDon to do the following:
• lntelpiet data cmteriQg from the riDg. • Modify the data Defore reaansDDuiDg. • IDsert data before tetransmiUiag. • Delete data to prevent IeUaDSmission.
Dam packets ciJmlaring around the ring incur two types of delays: 1. Tbe propagation time -over the liDk between channels 2. The station latency, or delay ~ as the data circulate through the station . CODDeCtion
Communications
194
Chap. 5
~USERS
" RING LAN (a)
DELETE IN FROM RING
\,p
SHIFT REGISTER
OUT TO
-I I I I I I I I 1--1--~i---R-ING-·
.u.
INTERPRET
,L
11 MODIFY
STATION CONNECTION (b)
.
'.
In the simple (aDd most popular) token ring, a siDgle token circulates around the rlDg. It is, in fact, a small packet of c:oatrol data that iDclucleS. amo.ag other tbiDgs, a 1ieeIbusy indic:ator. As it circulates, each station in tum can receive it, evaluate it, and pass it on to the DeXt station. On an idle riDg, the token is marked as flee. If a sta!iOD has a message to send, it waits for the idle tobD.. When it receives the idle tobn, the scatioD madcs it busy aDd adds its data to the end of the token. It also fills in an address field in the token. The busy tobn widl cIa1a now ciJcuJates atOUDd the riDg and is read by each station. When a stationIeads a busy tobn, it checks to see if the data is addressed to itself. If so, it captures the data. In either event, the busy token is passed on to the DeXt station.
Chap. 5
Local Area Network Performance
195
Fmally, the token retums to the station that sent the data. That station sets the token to idle and deletes the data from the token as it passes by. An approximate pass at token ring performance can be taken as follows: Let
tprop t,flll
= propagation time between stations = station lateDcy time
The time to pass the token from one station to the next is tprop + t,flll. This time is, in every respect, a poll cycle as previously described. The only difference is that the system is self-polling rather than requiring a master station to do the polling. As described earlier, of the V stations on the netwOIk, Q will be waiting with data to send. On the average, M stations will be polled in order to find a terminal with a message. M. Q, and V are related by equations 5-20 or 5-21, as apPlopriate. Once a sendjng station bas been found, the data packet must travel, on the average, halfway around the ring to its destiDadOn. That is, if there are V users, V-I of which are possible receiYeIS, then. the probability that any particular user will be the IeCeiver for a packet is l/(V-l). The average number of users that the packet must pass before the teceiver is found is V-I
I
i-I
iI(V-I) =
--L. V(V-I) = Y (V-I)
2
2
The packet must pass aD statiODS and tetmn to its 1JaDsmbter before the token is freed.
Thus, the total network time to send the next message, given that one is available, is thesumofthepoDingtime,M(r".. + t..>,.aaclthe transmission time, V(tprop + t..>. Let the rate of anival of messages to be tnmsmittecl be R messages per second. In 8 busy system, 1be rate of message 1raDsmission over the network will equal the mival rate of messages to the networlc: (ffI + V)(r".. + t..)
R
= lIR
(5-26)
= message rate (messages per secoad)
Equation 5-26 can be solved for the average D.1DDber of terminals, M, that must be polled:
1
M= R(r".. + t..) - V
(5-27)
If equation 5-27 indicates a value of M -greater than (V+ 1)12 (8 lightly loaded network), then equation 5-21 govems, andM is taken as (V+ 1)12. In this case, there is only one transaction on the ring awaiting service. On the average, the token wiD pass by half the teJ:minals to find the message and half the teDDinals to deliver the message. The total time to deliver the message from the time it was submitted, ts , is
Communications
196 ts = V(tpnlp + t.,), M
~
(V+l)12
Chap. 5 (5-28)
where
ts = message delay time.
If equatiOD 5-27 indicates a value ofMless tban (V+ 1)12, then the number of messages, Q, waiting for service on the LAN is obtained from equation 5-20. (An iterati~e calc:ulation is required.) Though service is round-robin, it is conservative to assume first-in, fust-out servicing. In this case, message delay time is ts
= Q(M+V)(~+t.,) -
; (tpnlp+t.,), M< (V+l)12
(5-29)
The first term represents the time to pass the idle token to the next active device for each of the Q messages in line and then to pass the busy token completely around the ring. The second term repteSelltS the fact that the message in question needs to pass only VI2 terminals to reach its receiver. Using equation 5-26, eqUation 5-29 can be rewritten as t$ =
~- ;
(tpnlp
+ t.,), M < (V+ 1)12
(5-30)
The material contained in this chapter, though voluminous, baIdly saatches the surface of CQ!DJ!IDDjcation technology. However, it provides us with the basics required to 1lD.derS1aIld and evaluate pedormance of 001IIIDIDIicati ChSIIIMls,p:otocols, incl netwo!ks. in-depch discussions of many more netwOlks aDd staDdaIds may be found in the various refereDces cited, inclnding Kleimock [15] on ARPANET, Meijer [21] on standards .and pl'OtOCOls, Hammond [9] on local netwOlks, aDd Stallings [25] on netwoDcs.
6 Processing Environment
Now that We can n=ceive amqaest from a user aadhave a m,dum;sm forretnming a Jeply, how do we get from hele to there?· The COJDpUtm :viewed as 1he c:emral processblg UDit (1he CPU), its memolY, its I/O ports, and its opemaiDg system-is the traDsacIionp:ocessing eDgiae tbat provides 1he eav:imJmat in which the application pograms caD pocess a request and caD retDm a teply. In 1bis c:bapta" we mview issues m1ated to hardwaIe aDd ope.taIiDg system design. Of putic:aJar iwpotlailCe to ped'mnance aualysis is CClIDtadioD. creared in multi.pIocessor systems aad IfI:rashiDg caused by qvntiDg system activi1ies. . For putpOOeS of aaalysis, we wiD CODSicIer die pIOONSiag eavimmDeat in terms of two levels: 1he pllysical'- and die opaating sysIaD Jevel. Ai die pIIysicai level, die CPU, memolY, I/O coDuoners,' and, in die c8se of a cIisIribIDcI sysIaD, die iDImpocessor or iIdercoaupuIa bus act togetber to execute the iDstmc:tioDs sabmiI;tefS by die ollw!ling ~ IDd by the ~ ~. In many pedOnnara IDIlyses, die deCatlecJ charadeaistk:s of die pIlysical1evel me biddea by, and in fact lie tab inIo accoaat by, die op,,"'ing sJsran. However, it is oftal desiEable to deal at die pb.ysic:al level, aacl we will view some of die pezfmtumce issues of tbat level. The role of die opemtiDg system level is to hide the physical cbandrrlstics of die c:oiDpata from die user p:oWte a DlCD comfcaable (jf less efficieDt) eavimmDeat for die developmmlt of DeW appJicaticms. As modem operating systems become IDOIe powerful programmer, .] .., '.- aad do JDOIe . . for . .1be . ...applic:atioD __ . _....... .. .. they'often . . exact a greater . . toll. inI
_to
187
Processing Environment
198
Chap. 6
.. -performance. The performance of many TP systems today is limited more by operating system characteristics than it is by the hardware or by the application programs. PHYSICAL RESOURCES The physical resources with wbich we should be concemed are the CPUs, memories, 110 conttollers, and intercoDDeCting bus structures that make up a computer. Actually, a single computer architecture does not create a vezy interesting exercise for the performance analyst. We will, therefore, take as an example a multiproc:essor distributed system and will look at its performance characteristics as the load imposed upon it incIeases. This example is structmed to illustrate performance modeling techniques for a variety of hardware considerations. Once understood, these techniques can then be used by the analyst as a foundation for modeling other multiprocessor systems. Much of the material for the following multiprocessor example was taken from studies performed for Concurrent Computer Corporation. The author would like to express his app:eciation to Conc1m:ent Computer for its pem1ission to use this mate-
rial. A typical multiprocessor system is shown in Figure ~1. Though this is a fairly simple representation of such a system, it highlights the performance issues with which we will want to deal. In this system, several processors are connected via a single full-duplex (i.e., two-way) bus to a plurality of memory units. Each processor also haDdles cenain . 110 devices on bebaJf of the entire system. Seven! manufactureEs currendy offer systems with multiprocessor aIclUtectures, including CoDameDt Computer (T'mton Falls, New Jersey), Stratus (Marlboro, Massa~
•••
MEMORY
MEMORY
1 BUS
1 CACHE
CACI£
I
CPU
110 caIIT
/.\ COMMUNICATION UNES
I
CPU
~-
110 CONT
~
... I
CACHE CPU
1--
110 CC».IT
1
~
W R
Chap. 6
Physical Resources
199
chusetts), and Sequoia (Marlboro, Massachusetts). Some of these systems can become quite large and may include several hundred processors. Let us add some characteristics to our hypothetical system to give us something to analyze.
Processors The industry today likes to characterize processor power in MIPS, or millions of iDstructions per second. Unfortunately, this is Dot a very meaningful term. Fll'St of all, there is a big difference between what is done with a 1(M)it insttuction and what is done with a 32-bit ins1ructiOD. Second, in today's sophisticated architectmes with variable-length iDstructions, pipelining, and so on, the actual me of instruction execution is very much a function of the program being run. R.aI:ber than predicting how fast a given program will run, we will look at peIformance degradation under various conditions compared to the system's ideal peIformanc:e. This will be done by defining a processor pe7formtl1lCefoctor. Ppf, which is UDity under ideal conditions, and which decleases as system load causes performance degradation. MOle about tbis later. We will accept from the desjgners that the average execution me for a processor is M instructions per second: M
= processor speed (instructions per second)
As shown in FJgUIe 6-1, each processor contains, in addition to its CPU, a cache memory (to make IIlO.te efficient use of the main memory resoun:e) and one or IIlO.te 110
controllers.
CecIJe • ..",ry The CtJClul memory is a small high-speed memory (genenlly fastertban main memory) that is used to store the inoIe ieceat data accessed by the CPU. The hope is tbat most data or iDstn1ctioDs that the program will need to access will be found in cache. This bas, proved to be quite effective·in pracdce. programs tend to Opemte within small loops of instruc~:and within smaU local areas of data for significant periods befOre moving on. ...•. ·Whea data· or insttucticms must be moved imo cache. they must overwrite infonDanon cmrently in cacbe. Tbe decision of what informarion to ova:write is a de:cision made by the cache controller. Usually. cache infomudion is aged; the infonuanon 1bat has not been used for the longest time is chosen to be overwriUen. Since cache memories teDd to be implemented with higher speed c:omponcmts than main memories. they are IDOle costly and tbemfore SID8ller. Cache memories tend to be in the 4 to 64 kilobyte raDge. whereas main memories tend·to be in the range of 1 to 64 megabytes-a thousand times bigger. ~Or purposes of our analysis, we will assume that our cache memory has the folloWing· Characteristics:
as
• h has an access time of 100 nannseoonds (one aanoseconc:l is one billiOnth of a .
second).
Processing Environment
200
Chap. 6
• Its word length is 32 bits, or 4 byteS. • Data is moved to it from memory in 4 word blocks (16 bytes). • Data that the CPU writes is written to cache memory as well as to main IDeDlO!Y in whatever increments of whole words the CPU desires, up to a maximum of four words. This is called cache write-through and ensures that main memory is always u~to-date. • While data is being read into cache or written tbrough cache to main memory, the processor is paused. Note that to the extent that the CPU bas to access main memory, system perfonnance will degrade. Not only is main memory slower, but there also is contention for its access from other processors. A processor will have to access main IDeDlO!Y to read data DOt in its cache, and to write data. Therefore, to the extent that the program can be written to be highly modular and to make extensive use of local registers and stacks to write temporary IeSUlts, the system. will perform better. This is often the responsibility of an optimizing
compiler•. When the CPU finds its data in cache, this is called a cache hit. If it bas to access main II1eDlOlY to read data, this is called a cache miss. It is virtually impossible to predict what the average cache miss ratios will be, since cache activity is highly dependent upon the stnlCtUre of the particular program being ran. However, the performance analyst can pmUct the effects on perfon:nance caused by cache misses. This prediction can then serve as valuable infcmnation for the designers who are sizing cache, writing compilers, and so OD. We will make these predictiODs in the following analysis. For the time being, we can characterlze the perfonnance of the cache memory by a single parameter: the cache miss ratio, Pm:
Pm = cache miss ratio (the probability that a read will require a main memory access). It might be noted that the cadle memory described above is quite simple in today's techno1ogy. Some·systemS do not write through cache, thus saving sigrrificut main memory 1ime. Rather, the cache memories of all processors axe c:oasicb:ed an enensiml of main memory, and the "ownership" of any piece of data is known to the memcxy system. If one processor wants a data item that is cummtly in another processor's cache, it'mceives it from 1bat cache memory.
'/OSptenl Each processor may be responsible for haMling 110 transfers between c:enain devices, . such as tflrnrinals, ctis1cs, printers. etc., via 110 COIIUOllers connected to that processor. In some ~ certain processo!S are designated as 110 proc:essoIs-pedlaps even as daSa.;base processors. comw.unication processors, etc.-and otherS axe desigDite4 as application processors, which perform DO 110 but which act upon data received by the 110 processors and which mum data to them. In other arcbiteetmes, all proceSSOlS are amorphous-they may do anything. At a fust level of approximation, this SOlt of SI:l'UCtme can
Chap. 6
Physical Resources
201
be ignored, as each processor is simply a member of the user population for the common
memory resource. I/O operations are typically either programmed 110 or direct QCcess I/O. Under programmed I/O, data is written to and mad from a. device, a character at a time, under ctiIec:t programmed control. This is usually used for very low-speed devices or to send commands to or receive status from an JlO device. For purposes of analysis, this load will be ignored. It is small and can be considered part of the application program. Direct access 110 occurs when a device reads data directly from memory or writes it directly to memmy without program interventiOD. This is often refeued to as a DMA (c.tiIect memory access) transfer. The program, of course, must initiate the traDSfer and must be notified of its completion (usually via a hardwme-geDerated intemJpt). Direct access is usually used for all high-speed devices (disks. tapes, high-speed communication liDes.liDe printers) and can repxeseDt a signiDcant load on the system. For pmposes of our model. let us assume the following 110 characteristics: • Data is traDSfer.red in diIect access mode between maiD memory and a device in 4-word (16 byte) blocks (the same as for cache reads). • The cache memory logic is used to accomplish the memory traDsfetS. • TheIe are DO freewbeeJiDg input devices, i.e., the rate of all input data flow can be comrolled to prevent data loss in bigh load situations. (This usually implies.tbat the 110 controllers have sufficient buffering to accommodate any outstanding read IeqUeSts.)
The 110 system am then be characterized simply by the composite data rates from all devices. Let
Di = data input rate from all devices (bytes/sec.) Do
=data output rate to all devices (bytes/sec.) =
These data rates are mken as those tbat would occur in an ideally opetatiDg system (PPI 1). However, since 110 ttaDsfels are iDi1iared by the processors, it is assumed thatadDal 110 rates drop off with pmcessor perfo1mance and are, in fact, PpJDi and PptDo •
...
A high-speed bus iDtm:oDDects the processors with the maiD memory modules. This bus is a full-dup1ex bus compriSiDg III R-bus (Read) aDd a W-bas (Write). The W-bus is used to send the addJ:ess of the data to the meDio!y. For wrim operaIioDs, it also tbeD. sends the data to JDem(JJ:)'. The R-bus is used to JeCeive data mid from meJDOIY during a read operation, the address oftbis data having been supplied previously on the W-bus. Each bus is 32 bi1s wide (4 bytes) plus CODttoI sigDals. This is sufficient to send a full addless or a full data word OD each bus cycle. Thus, to read a bloclc frOm memory, one W-bus cycle is IeqUired to send the ackkess; four R-bus cycles are IeqUiJed to read the four-word block. To write a bloclc xequires two to five W-bus cycles-one for the addiess and ~ for each data word to be written (up to four words).
Processing Environment
202
Chap. 6
The bus speed is 100 DSeC. (nanosecond) per word. This means that each bus has a . Capacity of 40 megabytes per second for a composite speed of 80 megabytes per second. Some of this capacity, though, is used for addressing. The bus is a common resource for the system. Processors must contend for it to send ~ and data to memory, and memory UDits must contend for it to retum data to the processors. There me several contention schemes that could be used, including round robin and priority schemes. This system uses a time-varying priority contention algorithm. that inaeases the priority of a processor or memozy as it waits. Thus, a user who has waited awhile is more likely to obtain the bus than one who has just mived. This algorithm approximates first-in, first-out (FlFO) servicing. Several questions can be addressed relative to the perfoImance of the bus:
• Is it a significant bottleneck in the system? • Is it balancwl, Or is one side (R or W) heavily used and the other lightly used? Should costs, therefore, be reallocated to make one side faster or wider than the other?
To summarize the bus characteristics: • The bus is full-duplex (W-bus and R-bus). • Each path is 32 bits wide (4 bytes) plus control data. • Bus cycle time is 100 DSeC. • A write zequires one W-bus cycle to send the address and one to four W -bus cycles to send the data. • A zead J:eQ1liJes one W-bus cycle to send the address and four R-bus cycles to ~ve the data. Jlajll
IIeIIIory
Main memory comprises one or DlOIe memory modules tbat ope!8le indepeade.atly. Each module provides, let's say, two megabytes of storage. Up to 32 memory modules can be provided for a toIal of 64 megabytes of memory. Each memory module!eCp1ires 400 DSeC. to set up an address and 100 DSeC. per wOld to read or write data. Thus, a !e8d operation ~.soo DSeC., and a write RqUitei 500 to 800 usee., depending upon its leugIh. Tbe 4-word block is read from memmy befoIe any data is letIIwed to abe requesdDg. processor via the R-bus. A queue with a ~ of eight items.is provided in each memory module. Each item is five words in length and can hold anytbiDg from a Iead OO''''Ii8nd and addJess to a write command with its· adchess aad data. As lQag as there is room in the queue. the meDlOl'Y can accept an access oommand. If1he queue is full, the command is tejected; and the zequestiDg processor must back off and try again larer. Back-off1ime is typically a few ~. If queue overflows me frequent. back-otf time might be a perfODDa.DCe
eame
Chap. 6
Physical Resources
203
factor. If back-off time is long, processors will be delayed; if it is short, bus load and consequently queue delays for the bus will increase. Summarizing main memory cbaracteristics: • Each memory module provides two megabyteS of storage. • Up to 32 memory modules may be configured in the system. • Data is accessed one word at a time. • Memory timing is 400 DSeC. to set up and 100 DSeC. per word to read or write data. • A queue of eight items is provided to buffer incoming memory commands.
Note that since memory speeds and bus speeds are identical, the memory could request the R-bus as soon as it had accessed the first word, thus saving 300 DSeC. Many common memory systems do this. We assume a full read here to simplify the analysis so as to more clearly present the underlying analytic principles.
Proceaor Perfonnance Factor We alluded previously to a processor peUOIDWlce factor, PpJ. Let us now define it. If a processor were numing totally without access to main memory, i.e., all reads were from cache or from its own internal registers and stack, and all writes were to its internal registeIs and stack, then it would be operating at maximum speed. This speed has been defined as M million 'instructions per second. lDstructions are being executed every 11M microsecond. . , However, to the extent that main memory must be accessed, additional overhead is inc1med, which slows the system down. On the average, each instruction will be slowed by a time, T;, so that instructions are actually being executed every T; + 11M mic:rosecODds. The processor petformImce j'tM:tor is defined as the ratio of the ideal inst:ruc:tion execation time to the acmal execution time:
11M 1 PpJ=T;+ 11M = MI;+ 1
(6-1)
T; is a diIect func:IioD of the mDDber of leads and writes that must be made to main memmy. Since eadl oftbese main memory accesses may be delayed due to other system activity, then T; is also a function of that other ~. In our example, that activity is generated by other processors and I/O devices. Let r = proportion of insUuc:dons which are reads. w = proportion of instructions whic:h are writes. tr = time to complete a read to main meDlOIy (f.LSeC., i.e., one millionth of a second). t..,
= time to complete a write to main memory (f.LSeC.).
Processing Environment
204
Chap. 6
pm has been defined as the cache miss ratio, i.e., the probability that a n:ad will have Since T; is the main memory n:ad and write delays averaged over all instructions, it may be expressed as
. "to be passed to main memory.
Ti = Pmn,. + wt",
(6-2)
Traffic Model A traffic model showing the progression of a read or write access through the system is shown in Figure 6-2. The processor (1) first waits for access to the W-bus (2) so that it can send its command and addIess word and the data if this is a write command. When the processor is granted access to the W-bus (3), it sends its data to the appropriate IDeDlOIY queue (4), where it &waits action by the memory. However, if the meD10IY queue is full, the request is rejected (5), and the processor will have to try again later. If the request is accepted, and if it is a write access, the write operation is now complete; the processor can continue. (5)
(I)
PROCESSOR
C9>!----.{R:;;,
Pipre 6-2 Multip:ocessor uafIic model.
If this is a read access, the processor must wait for the memory module (6) to worlc through its queue ofWOlt, access the data, aDd seDd it back to the processor via the R-bus. To do this, the memory module must wait for access to the R-bus (7), then 5eDd it via the R-bus (8) to the processor (9). At this poiDt, the processor can proceed. Queuing aDd service times 1bat are identified in Figure 6-1 are tqw = waitiDg time for the W~bus (p.sec.). t_ = service time for the W-bus (p.sec.). ts-
= time to send a read command OIl the W-bus (j.IoSeC.).
t_ = time to send a write command OIl the W-bus (j.IoSeC.).
,... = queuing time for the memory (j.IoSeC.). t."
= service time for the memory (p.sec.).
Chap. 6
Physical Resources
205
tqr = queuing time for the R-bus (fLSeC.).
t,.
= service time for the R-bus (fLSeC.),
Pa
= probability that memory queue is full (abort).
ta
= retry time (fLSeC.).
One complication in this model is the W-bus retry possibility, which is reflected by the probability Pa. If the W-bus must retry. it must wait a time, ta , before trying again. Obviously, when it tries a second time, it will again fail with probability Po. In fact, it will have to try a second time with probability Pa, a tIrlJd time with probability Pa2 , and so on. The total number of tries is I + Po + Pa2 + ... , or 1/(1 - p.). The total number of retries is one less, or Pal(1 - P.). Thus, the read and write time delays can be expressed as tr = (t".., + t_ + Pllt.)l(1 - P.) 1w
= (tqw + t_ + Pat,,)/(l -
+ tqm + t.- + tqr + t.
p.)
(6-3) (6-4)
Note that W-bus queue times and memory queue times are independent ofreading or writing, since both reads and writes are mixed in the queue. However, W-bus service time is c:lifferent for reads and writes (one cycle for a read, two to:five for a write); these are therefore expressed separarely. W-bus queue time will depend on the average service time, which is _Rt.-+Wt_ t_R+W
(6-5)
where R and Ware the read and write access mtes, including both processor activity and 110 activity. Since the nre of!e8dlwrite processor accesses is PmrM and wM, respectively, and since the rate ofxeadlwrite I/O accesses is Do /16 and Di116, IeSpeCtive1y (16 bytesIaccess for 110), tben
R
=p",rM + Do /16
W = wM + D i l16
(6-6)
(6-7)
,. Note that the subscripts D aDd i for data traDSfen reftect output and iDput relative td· the device. Device outputs (Po) are teads &om memory'and writes to the device, and' device iDputs (Pi) are Ie84s from. the device and writes to memory.
PerfoI'lllllllC8 Tools It DOW remains for us to evaluate the terms in Figure 6-2. The analytical process is complex, as the traffic model does not fit any of our queuing models. It is trUe that the servicing order of all queues is subs1antially FIFO (FJrSt-In, First-Out). Since there are a limited number ofusen in each population (the processors on the one hand conte:ncting for ~ W-bus and memory, aDd the memories on the other band conteDding for the R-bus), ~
Processing Environment
206
Chap. 6
seems that our finite population models ought to be useful. And this is, in fact, ttue for the ' " 'multiple memories contending for the R-bus. However, the processor-to-memory path is a tandem queue. A request must first wait for the W-bus. Then, if it is a read, it must wait for the memory. How does one define the availability time, Ta , and the service time, T, needed to calculate the service ratio, z, for a read operation? Furthermore, R-bus service times are constant. Memory and W-bus service times are probably more constant than random. How do we handle these? The answer to this is not apparent to the author. We could give up and look for another job. Or we could use the best tools we have for the job, invoking our cloalc of devout imperfectionism. In this case, the tools we can always fall back on are the Kbintchine-Pollaczek relations. We will use these to c:haracterize the delays in the system. Ifthete are not many processors or memories in the system, at least the results will be conservative, since they should predict queues that are somewhat larger !ban the actual ones. On the other band, if there are many processors or memories (say twenty or more), these relations will yield predictions that will be quite good. Note that though the processors are sending data to multiple memories and the memories are sending data to multiple processors, these are both single-server situations. This is because each block of data is being routed to a specific memory or processor. Memories and processors are not load-sbaring servers in this environment.
Perlo......nce .odeI W-Bus. address word:
The W-bus service time for read operations is a single cycle to send an t_ = 0.1 fLSeC.
(6-8)
For a write opemion. the average service time is a function of the distribution of write lengtbs. Let Wi be the probability that a processor write is i wonts long (1 so i:S 4). Then 4
wM~ wi(O.1
t = -.
i-I
+ O.Ii) + (0.5)Di /16
wM+Di /16
(6-9)
where care bas been takeu to average in ,110 wrltes. It is assumed that the dlstn."bution of W-bus retry service times is the same .as the origiDally imposed service ~. Average W-bus service time is t... as given by equation 6-5. The disttibution coefficient, kw. as defined by Khimchine and Pol1aczek. is 4
=R(O.l)2 + wM~1 Wi (0.1 + 0.1;)2 + (O.5)2Di /16 It;.., 2(R + W)t!..
(6-10)
(Note the use of equation 6-7 in the IlUDler8tor to account for 110 load.) The load on the W-bus. Lw. is the late of acc:essiDg of that bus multiplied by the time of each access. Since accessing will slow down as the processors slow down, the load on
Chap. 6
Physical Resources
207
the Vl«s decreases proportionately with the processor performance factor, Ppf• However, W-bus load will increase with retries due to memory queue overflow. W-bus load is expressed as
(6-11)
where
P = number of processors Thus, the waiting time for the W-bus is tqw=
l~t6W
(6-12)
Memory. The memory avemge service time is similar to that of the W-bus service time. A read requires a fixed O.S 1J.SeC., and a write requires 0.4 + 0.1i 1J.SeC., where i is the number of words to be written. The average memory service time, t_, is t
-
=
0.8R +
4
wM.~ Wi
The distribution coefficient,
,-=1
(0.4 + 0.1i) + (0.S)D i I16
R+W
Ie"., is 4
Ie". =
(6-13)
R(o.sf + wM.~ Wi (0.4 + 0.1i)2 ,-1 2(R + W)~
+ (0.SfDi/16 (6-14)
The load on the memory is
L",
S
= PpJP(R + W)t_IS
(6-15)
= the number of memories in the system (storage devices)
The queuing delay for memory is t_
...,... such
= le".L", t
I-L",-
(6-16)
Note that we have assunwt equal load across all memories. Usually,· in systems as these, there is not enough data to suggest any more detailed an alloc:ati.on.
R-Bus.
The R-Bus service time is constant at 0.4 IDSeC.:
= 0.4
(6-17)
k,. = O.S
(6-1S)
t" The distribution coefficient is
since the service time is constant.
208
Processing Environment
Chap. 6
R-bus load is (~19)
and the queue delay is (~20)
Memory Queue Full. Fmally, we can approximate the memory-queue-full probability. For random anivals and random service times, the probability that a queue will exceed n items is L,,+ 1, where L is the load OD the server (this includes the item. being serviced, which still takes queue space in this memory system). Since the service time of the memory is far from random, this result will be conservative. Therefore, it is conservative to state that the memory retry probability, Pd, is Pd =L",9
(~21)
since a memory queue of eight items is provided.
IIodeI SUID....ry The model we have generated for this diSlributed processing system is summarized in Table ~1 (paIameter defiDitiODS) and Table ~2 (equatiODS). Note that iterative calculation is required to calculate the results, since the load imposed OD the memory system is a function of tbat same load. As the load increases, response time incIeases. The processors slow down, thus reducing the load. This shows up via the term PPI. the processor performance factor we are attempting to calculate. Ppfis a function of T;. which is a function of queue delays. which are functions of loads. which are functions of PPI' CoosequentlY. this set of results does DOl lend itself to manual calCUlation; it must be calculated by computer. Note the similarity to the approach taken in chapter 4 for fiDite populations. There, system load was calculated as a function of delay time as given by equations 4-109 and 4-110 in a manner auaJ.ogous to wbat we have doDe above. Since the finite population solution was DOt available to us because of the difficulty of establisbiDg an availability time. Ttl. we have accomplisbecl an adequate result by using a similar teclmique.
Using this model, ODe c:ould ask sevetal questions about the projectecl performance of the system:
1. How many processors can be supported by the bus? 2. What is the optimum processor-to-memory mtio? (i.e., at wbat point does memory speed start to have an impact on performance?) 3. What is the effect of cache misses?
Chap.S 4. S. 6. 7.
Physical Resources
209
What is the effect of writes? How does 110 affect perfOIDWlce? Is memory queue overfiow a factor? Are the buses evenly loaded?
The model has been evaluated for a typical set of conditions to show how these questions could be answered in a real analysis. Certain input parameters are assumed in order to allow calcolation. Unless modified as a calcolation pammeter, the input values shown in Table 6-1 have been used to create ,the results. This !able assumes that each processor executes at 3 MIPS. Thirty percent of
TABLE ~,. MULTIPROCESSOR PARAMETERS IDput patameten Patuueret
Di Do M P
P. r S t" W Wi
Value
o o
Data iDpat tate from an devices (bylesIsec.). Data oaIpUl tate to an devices (bylesIsec.). PIOcessor speed (lIISttUCticmsIsec.). Number of pmc:essotS. Cacbe-miss raIio. PzopartioD of iDstmctiaas tbat _ leads. Number of IIIfIIDOIY mails. Write bas back.off time (p.sec). Proponicm of iDsIrucDoas tbat _ writes. PIoporticm of all pJOCeSSOr wme accesses tbat _ ; WOlds in leagtb (1 $ ;$4).
3
x 10'
varies
0.1
0.3 4
3 0.1 WI= 1-0.4 2-0.4
3-0.0 4-'0.2
j
Resoa:e iDdex (used below): j=".........,.. r-R-bas. , w-W-bas.
--=
DisuibaIiaD coefic:ieIIt fer j. Load ClllIIISCIIIItIe j. PmbIIIiJily of memory queue fIID.. Processor pe:di- dACe facIar. TOCIl JIIOCISSCIr ad J/O . - tate. A~ delay time per iDsInJdioD _10 main IIIfIIDOIY access (p.sec.). QueaiDi delay fer j (p.sec.). TUDe to camplefe a . - to main JDeIDCI)' (JI.sec.). Service time fer j (p.sec.). 'lime to seac1 a lead CCJIIDIIII!MI 0Il1be W-bus (p.sec.). 'lime to seac1 a write OO'iiili!!!nd oa1be W-bas (p.sec.). 'lime to complete a write to main JDeIDCI)' (p.sec.). TOCIl processor ad J/O write tate. .
--= --=
Processing Environment
210
Chap. 6
TABLE 6-Z. MULnPROCESSOR MODEL 1
PIf=MT;+i
(6-1)
Ti = p .. n, + wtw
(6-2) (6-3)
+ t_ + p"t,,)!(l - p,j + t.., + t... + lqr + t.,. z.. = (t.,.. + t_ + p"t,,)/(1 - p,j
I. = (I.,..
(6-4)
_Rz_+ WI_ t.,.R+W
(6-S)
=p",rM + D.116
(6-6)
R
W= wM +D/16
(6-7) (6-8)
t_=O.I~.
=
t
/1;.,=
+ 0.1i) + (0.S)Di/16
wMI wi(O.1 1001
(6-9)
wM+Di /16 4 R(O.I)2 + wM;t Wi (0.1 + 0.1;)2 + (0.S'fDi/16
,-I
2(R
(6-10)
+ W)~W
1- = P"'p~ + W)t.,.l(1 - p,j
(6-11)
t.,..=I~t...
(6-12) 4
t
.,.
+ 0.1i) + (0.S>DiI16 =0.8R + wMI Wi (0.4 R+W
(6-13)
i-I
4
k",=
R(0.8)2 + wM;t Wi (0.4 + O.lif + (0.8)2Di/16
,-I
(6-14)
2(R+ W)iL
L.
=P"'p~ + W)z.,.IS
r.,.= 1~
(6-15) (6-16)
t_
(6-17).
,.,.=0.4 k,. O.S
=
(6-18)
L. =0.4RP",P
(6-19)
'- = k,.L. t.,.
(6-20)
....
l-L. p" = L".'
(6-21)
:
.
I
all iDsUuctioas me xeads, and 10 pe.rc:eDt me writes. The cache-miss ratio for xeads is 10 pen:em. Avenge write length is two words (the weighted average of the w;'s). No I/O is oc:cmriDg except for die special I/O c:aJcuJation shown below. Foul memory UDits me provided unless this pmameter is used as an iDput variable. . Figme 6-3 shows system pezfmnaw:e as processors and memory me added. Note that with only 1 memory, the system is effectively limited to 6 processors (giving the power of 3.2 proc:essOIS). With 8 ~, perf011DaDce S1al1s to ftatten around 12 processors. As a rough statement, one could say tbat each memory can support about four
'.
.
~
Chap. 6
Physical Resources
211 /
/ /
/ IDEAL
10
/
MENORlj (M)
/ /
___~ 8-CD
/
8
/
/ / / / / /
MaN
___---4
___---3 ___----------------------2
/ /
4
/ / /
/
2
/ /
/.
o
8 10 PROCESSORS (N)
12
14
16
Pipn 6-3 MemoIies.
ProCessors.
At this level, PPI is rumDng about 0.6. The addition of aD extra· processor' gives less than 0.5 processor improvement in service aDd is hanUy worth it. Beyond 8 memories and 12 processors, the system is bus-limited (note tbat aD iDfiDite IlUIDber of memories gives about the same perfomumce as 8). We have just answered the fust two questions.
.
Figure 6-4 shows proe:essorperformanre as a function of die cache-miss mtio, Pm. Above 5 pcm:ent, perfo.nuanc:e dropS dramaDcaUy. Thus, compilers, cache size, and haIdware uchitecture should be aimed·at cac:he.miss ratios smaller thaD 5 pereeDt. Our tbiId question is DOW answaed. FIgure 6-5 shows proc:essorperf01mance as a function of the write mtio for a varying IlUIDber of processors. This is a Yay seusiDve factor-for eves 6pocessors,·write ratios in excess of 10 pezceDt impose significant load. Thus, the answer to our fourth question. . . FIgure 6-6 shows the effect OD~ performance as I/O rates iDaease. 110 is assumed to be split eveDly between input aDd output. Performance drops fairlyUDiformly with 110 rates. The system loses approximately a half processor of capacity with every 10 uqabyteslsec. of 110. . . Sp goesour.fifth.question.
212
Processing Environment
Chap. 6 .
.._ CACHE MISS RATtO ~
_____ .00
10
JI'
.02 8
___--.05
I MEMORY PER PROCESSOR
__-----.10
""a..a.. IC
Z
__- - - - - - - - - - . 2 0
____________________________ AO ---------------------.~
2
o
2
4
12
8 PROCESSORS
14
16
,, .
i
I
For any reascmable opeDtion, me:mmy .modules should not be loaded beyond 80 percent load. In this case, the pro~ of queue overflow fmm equation 6-21 is (.8)' = .013 aud is not a sigrrifiCam factor. Oar sixth question aasweml. FiDaJly, Figure 6-7 shows the loadiDg on the buses as a fmiction of the DUIDber of processors. Our sevem:h question is answeted. The loads on the R-Bus and W-Bus are IaSODably balanced, at least for dUs mix of iDstnIdions. . It shoulclbe apparent that some petty tough questioDs can be IIISWeIed by a model . tbat is fairly approximate. Notice that DO attempt was made to give bigbly ~ aDSWel'S.
The physical system being modeled is just too complex to ever allow tbis.
However, the geDeral stateI:DeD1S tbat can be made are quite powerful, as can be seen. , How accurate me these results? We'never know until we have a physical systein to measure, and this particular one will probably never be built (after all, it was just a ,hypothetical system). However, my experieIIce aDd tbat of DJ,@DY others in the field ~
,
Chap. 6
Operating System
213
I.0r---c:.:::::::::::--~::::--=::::::::-------==::::--------_
PROCESSORS
I 1&1
2
C,)
~
~ .6 a::
1&1
Q.
1&1
>
.4
4
~
NO READ ACTIVITY
~
NO I/O
...J
5 ~--6
.2
o
.J
.2
.3
.4
WRITES /TOTAL INST. Figare 6-5 Effect of WRrrES.
"been quite encouraging in terms of modeling accmacy (see Martin [20] and the a-Hte case study pzesented in chapter 11). Once agaiIi, it is better to be able to make a Je8SODably useful SIatemeDt about pezformance tban to make DO SIatemeIlt at an.
OPERATING SYSJBI Unlike the physical eaviroDment that we have just discussed. the operadng system enviIODIDeIlt does not lend itself to an example that can be solved in its entimy. Rather, the operating system provides certain en~ tools that the application processes use to ,Petf0lDl their functions. Therefore. we will deal with an UIldersIanding of those tools ~
Processing Environment
214
Chap. 6
10
DMA I/O ("BYTES/SEC)
---:/ _---40 ~_20
_ _ _.60
oc Z
80
MEMORIES. 8
o
2
4
6
8 PROCESSORS
10 (N)
12
14
16
Figure '"' Heavy I/O.
"this section and will apply them to different application architectures in the fonowmg chapters. 'I'here are six operating system functions tbat typically bave an impact and often need to be CODSideled in a pe.rformanc::e analysis: • task dispatching
• • • •
interprocess messaging memory management I/O tnmsfen OIS-initiated actions
• IeSOUrCe
locking
,
0
When a process is ready to run, it must be placecl in a queue (or Ready List) to await its tum to use the processor. The processor is the server, and the processes are the users in a classical queuing system. The DIlDlber of processes rmming in a TP system is usually large enough to qualify as an infinite population; at least, that will be the case assumed herein. " ,
Chap. 6
Operating System
215 "
"
":
W-BUS LOAD _ _-R-BUS
-----
o
2
---
4
LOAD
--- --- --
6
8
- - - PPF
10
PROCESSORS· , I
I
Ot course, if a specialized application has ODly a few processes, then the finite queuing system model can be used. *
"
o
FurtbenDore, the processing time required"by the Various processes varies so much that the assumption of randomly disUibared ~siDg times Sbould be quite valid. The time spent in the Ready List, or processor queue, is caJled the dispatch time and is referred to as td in these ctiscussiODS. There is a second component of task dispatcbing time, aad that is the opetaIiDg system processiDg time ~ecl to switch processes, ca1led context switching. 'Ibis involves pJaciDg the cuuent but expiriDg process on some list, removing tbe DeW CUDeDt process from the Ready List, switcbiDg memory maps, modifyiDg the processor enviMameDt Iqistm, aDd so on, i.e., doing all the things teqUi1ed to put the old process to !eSt and to start up the new p.rocess. Depending upon the amount of banfwme support available, CODteXt switching can 1ake anywhere from a ~ miaosecODds to a few mj11iStfflDdS;" ODe to two nWliseoonds is ty.pical. For pwposes of paformance analysis, this is a fixed time that can be bundled into the application process time and is bandled tbroughout the text in tbis DI8IIIler. The time spent waiting for the proc:essor~ disparch time-is affected by two o
•
•
*A mare pmcise solutioa to 1bis dispatdIing plObIem is giveD. in AppeDdix 6. wbeIe it is shown 1bat a zeaSODable approxiallll'ioa is obIaiDed by simply ip.adDg die !old impoaecl 011 die prgcessor sySIeID by a giveD. pIOCeSS wbeD c:a.Jc:n1aring Us diSpiitch lime.
Processing Environment
216
Chap. 6
__ other considerations. One is whether the application is I'UDDing in a single computer or multicomputer environment (or in certain multiprocessor environments) in which it can only run in one processor. This is the classic single-server case, and Queue delays are determined from the MIMIl model. Altematively, the application may run in a loadsharing multiprocessor enviroDment in which, when it Ieaches the head of the Ready Ust, it is serviced by the next available processor out of c processors. This environment is a multiserver environment described by the MIMIc model. The other consideration is that of priorities. Most TP systems provide many priority levels for processes (256 priorities is not atypical). If there is a processor load at higher priorities, it must be taken into account by the proper model, depending upon whether the operating system is preemptive or not (i.e., if it is preemptive, an executing process can be preempted by a bigber priority process). Within this framework, the calculation of dispatch time is straightforward. It is a "bean counting" exercise in wbich the dispatch rate and average running time for each process are set forth. A process will be dispatched typically on every 110 completion, whether it be the receipt of data to process, the completion of data that was sent, or the receipt of an intaprocess message. Of come, 110 operaDons are often no-waited, or asynchronous (depending upon the manufacturer's terminology), which meaDS that the process does not pause just because it has initiated an 110 operation but continues to do other work. In this case, actual dispat.clUng may occur less fIequeudy. . On the other band, a process may be dispatched simply because a time-Out tbat"ii baS specified has occurred. In any event, detelmining the dispatch rate and the average execution time for a process requiIes a thorough understawting of the particular application and is a subject of the system. description which shoUld precede each performance analysis. Given that the perfonnance analyst has done the necessary homeworlc: and has established the appIoprlate parameters for each process-its dispatch rate as a function of load, its average service time, and its priorlty--oae can compute the following parameters. Let process i running at priority P have a dispatch rate of JIip and an average service time of t• . TheIl the process dispatcb rate at priority p, n" is
n, = ~lI(p
(6-22)
i
The processor load imposed by an processes nmaing at priori1y P is
Lp = In.,t. i
.(6-23)
The awage service time of processes running at prlorlty P is the total pocessing time per sec:cmd divided by the number of dispatches per second at priorityp:
t. = The processor load imposed by
~lItpt$ip i
~JIip i
=Lpln,
(6-24)
.
an processes nmning at abigherprioritytban ~ !
Chap. 6
Operating System
217
pis. _.
Lh
= q>p I ~niqtsiq i
(6-25)
and the average dispatch rate for these processes, nh, is
nh
= q>p ~ ~niq i
(6-26)
Consequently, the average service time for those processes executing at a priority higher than p, tm , is
(6-27) The above bas calculated average service time for processes at priority p and for processes with a priority greater than p. The average service time, t$' for processes at priority p and higher is
t = L,tsp + Latm
Lp+Lh
$
(6-28)
This is true because of all process executions at priorities P or higher, LpI(Lp + L,,) will be processes executing at priority P with average service time of t., and LilIa." + La) of these processes will be executing at a higher priority wiI:h an average service time of For all processes rmmiDg at all priorities, the dispatch me, 11" load, 1." and service time, tt, are: (6-29) 1., = tt
Iq
~n;qt.fiq
(6-30)
i
= 1.,111,
(6-31)
For a system in which ODly ODe processor can execute the above processes (single computer, multic:omputer:, or certain multiptocessor systems), the c6spatch time in general is the queue time, Tq , taken from the MfMIlI=I=1PP model for a p-eem.ptive opentiDg system and from the MfMlll=1=INP model for a IlOIlpftaDptive system. From cbapter 4 and Appendix 2: 1. Preemptive single server
(L, + L,,)t$
_ td -
(1 - Lp - L,,)(l - L,,)
(6-32)
2. Nonpreemptive single server 1.,rs d - (1 - Lp -·L,,)(1 - L,,)
1. -
(6-33)
Processing Environment
218
Chap. 6
For a multiprocessor load sharing system with c processors:
3. Preemptive multiserver t. d
=
(L, + LJJcpots c(c!)[l - (Lp + LJJlc]2(1 - Lltlc)
6-34) (
4. Nonpreemptive multiserver t. d -
L,cPots c(c!)(l - L,lc)[1 - (Lp + LJJlc](l - Lltlc)
(6-35)
in the above equations, c-l
Po-1 = ~(L')nln! + (L')C/[l - L'lc]c! n-O
(6-36)
and L' = Lp + Lit for the preemptive case and L, for the nonpreemptive case. Note tbatLp, Lit, andL, in these equations comprise the total system load rathertban the average server load, as used in Appendix 2 and chapter 4. Of course, in all cases of a preemptive system, the actual process service time which is added to the dispatch time to obtain full delay time must be divided by (I-LiJ to ac:count for preemptive processing by the higher priority processes (see Appendix 2 and chapter 4). If the system is a single priority system, Lit in the above equations becomes zero, with the conesponding simplifications. .
Contempcmay TP applications me organized as autonomous processes, each with their own scope of respcmsibility aad all passiDg dam to each OCher via messages. In some cases, these mtapJOc:ess messages can Iepieseut a significant podion of the load on a TP system. .. Tbeze are several ways in which the messaging facility ~ be implemeated. AD are suitable for distributecl systems, but cme-the mailbox-is suitable oaly for single computer or mu1tiproc:essor systems. These 1ecImiques are described brle1ly below. However, the only IeSUlt of practical interest to the perfOlDJ8DCe 8D8lyst is the bouom-liDe time requited to pass a message from one process to another.
Global message network. W'ltb this implementation, any process can send a message to any otber process in the system without any specific effort on the part of one process to establish a path to the adler process. This tecImique is geaemIly applied to muhic:omputer systemS. AD the sending precess needs to know is the Dame of the receiving process. The operating system knows the name of all J'l'OCeSSC'S in the system and their
Chap. 6
Operating System
219
wh~ts. It assumes the responsibility for the message, ~ua1ly moving j~ into a system-allocated buffer. It then routes it over the bus (or network, if necessary) to the computer in which the receiving process is 11lDDing and queues it to the message queue for
that process. Even if the receiving process is l'UIlDing in the same computer as the sending process, this full procedure is often followed, except that the bus transfer is null, i.e., shortcuts are not taken.
This type of messaging facility is used by Tandem. Directed message paths. In other implementations, there are no general messaging facilities provided by the operating system. Rather, it is the responsibility of one process to establish a message path to another process via operating system facilities. Once established, the operating system knows of this path, and message transfer is similar to that used for global messaging. This philosophy is found in the UNIX pipe structure and is used by Synt:rex (Eatontown, New Jersey) in its distributed word-processing product. File system. The TP file system can also be used to pass messages between processes. A message file can be opened by two processes and can be used by one process to write messages to the other. The receiving process is alerted to the receipt of a message via an event flag and can read that message from its file. On the surface, this can sound very time-coDsummg-writing to and reading from disk. However, disk traDsfers are cached in memory (in a cache similar to the memory cache described in the previous section). If messages are read shortly after they are written, they will still be in memory; the message time is equivalent to the above teclmiques. If they are not lead for awhile, they are ilushed to disk to Dee up valuable memory space. . Since the file system allows· transpalent access to all files across the system, this messaging concept supports distributed systems. This tecImique is used by S1ratus in its multicomputer sysrem.
Mailboxes. Mailboxes are like message files except that they reside in common memory. Tbey are adaptable ODly to sing1e-computer or nmltiprocessor systems, siDce aD processes mustbave direct access totbe mailbox memory. Since thae Deed be no physical movement of the message as with the other techniques, message transfer with mailboxes can be much faster. Message traDSfer in multicomputer systems tends to be quite tjmr:..consmning because of multiple physical transfers of the message from application space to system space to the bus to a c:tiffeJent system space and back to appJication space. Typical transfer times are measured as a few milJjsecoods to tens of mjJJiseconcls. DUect memory ttansfer of messages in multiprocessor systems can be sigDificantly faster, especially when mailboxes are used. Typical1raDSfer times are measured in teDths of milliseconds.
Processing Environment
220
Chap. 6
In any event, 1he time required to pass messages between processes can usually be bundled in with the process service time for the sending and receiving processes . . . . .ry Afanagement
Most TP systems today provide a virtual memory facility in which there is little relation to logical memory and physical memory. In principle, many very large programs can execute in a physical memory space much smaUer than their total size. This is accomplished by page swapping, as discussed in chapter 2. When a process requD:es a code or data page that is not physically in memory, the operating system declares a page fault, suspends that program, and schedules 1he xequired page to be read into physical memory, overwriting a current page according to some aging algorithm. When the page has been swapped in, the suspended program is allowed to continue. Page fault loads are very difficult to predict and aDalyze; but for the pe1formanc:e analyst, there is an easy out. Page faulting is so disastrous to system performance that we typically assume it does DOt exist. If it becomes sigaificant, the cure is to add more memory (if possible). Though this sounds like a cop-out, it is DOt without merit. If a system does Dot have enough memory, it will begin to thrtISh because of page faulting. This sort of thrashing will rapidly briDg a TP system to its knees. Contempotmy wisdom and experienc:e indicate that page faulting should DOt exceed one to two faults per second. Overlay mauagem.ent is another technique for memory management and is c0ntrolled by the applicatiOD program. It is less flexible than page management but avoids the thrashing problem (8SS1.11Ding that ovedaid programs are DOt also rmmiDg in a paged virtual" memory environment). An applicaIion process is coasi&Rdto have a lOOt segmeat that is always in JDeDlOIY and one or DlOle overlay areas. It is flee to load parts of its progEam into its overlay .8lea when it deems fit. When the app1icadoD process makes such a Iequest. it is suspended UD1il the overlay mives and is then teSCheduled. The impact of overlay calls is simply the added overhead of the disk activity and the additioDal process dispatdring. both of which can be accounted for using the normal techniques presented heIeiD.
I/O Tra""" Once an I/O block ttaDsfer (as distinguished from a pmgrammed I/O uausfer) bas been jmatM, it continues iDdepeDdeDtly of the application process. Processor cycles are used to transfer clara diIectly to or from memcxy, foD.owing which tbeopetating system responds to a transfer completion iDtenupt. At this time, it will typically scheclule the initiating process so tbat this process can do wbarever it Deeds with the data traDSfer
completion. Let
DiD
= average 110 ttaasfer rate in both c:ti!ecti.ons (bytes/sec.).
---
Chap. 6
221
Operating System
BiD ::=.. average block. transfer rate in both directions (blocks/sec.). tdiD
tbiD
= processor time to transfer a byte (often just a ponion of a processor cycle as data may be transferred in multibyte words) (sec.).
= operating system time required to process a data transfer completion (one per block) (sec.). Then the processor load imposed by I/O at the data transfer and intenupt level, LEo,
is
(6-37) The application of this overllead value to system performance will DOW be discussed, along with other operating system functions that have a similar effect.
O/S Initiated Actions Besides the functions just described, there are other operating system functions that impose an overhead on the system. These are primarily tasks that the operating system itself initiates, such as • timer list management.
• periodic self-diagnostics, • monitoring of the health of other processorS or computers in the system. Let
1- = operating system load imposed on the system by OIS iDitiated func:tiODS. "4 = 110 load on the system (as defined above). 4 = total opeiatiug system overilead, including 110 transfers and self-initiated functions.
Then
(6-38) Since 4 of the processor capacity is being c:casamed by nonapplication-process orleoted activity, (1-4) of the processor is available for appJicaIion use. 'Ibis bas the effect of iDaeasiDg aU application service times by 1/(1-4): "" t Servi ppareD ce
A
= Actual Service Time (1 -
4)
(6-39)
That is, it appealS that the application process is rmming 'OIl a machine that has ODly ( 1- 4) of its rated speed or capacity. Ifthere me other higher priority processes rmming which also rob the application process of ~siDg capacity, then 4 is simply anOther .
i
222
Processing Environment
Chap. 6
__c;omponent of that bigherpriority processing load. (L" would include Lo ~ODS 6-32 through 6-36, for example.) Note that Lo is not meant to include data-base management overhead. Though the data-base manager is not an application process per se, from a performance viewpoint it is treated as such a process. This topic is discussed in the next chapter. Locks In a multiprocessor system, there will be contention for various operating system resources by the multiple processors in the system. For instance, more than one processor may try to schedule a new process, which means that each such processor will attempt to modify the :ready list. Multiple processors may try to IDOdify a block in disk cache memory as described in the next chapter. To pteVeDt such a resource (ready list, timer list, disk cache, etc.) from becoming CODtamjnated, only one processor at a time must be allowed to use it. Therefore, each common IeSOu:rce is protected by a so-called lock. If a processor wants to use one of these xesoun:es, it must first test to see if this IeSOu:rce is being used. If DOt, the processor must lock the IeSOu:rce UD1il it bas fiDished with it so that another processor cannot use that resource during this time. Actually, this action of testing and locking must be an integrated action so that no other processor can get access to the lock for testing between the
and lock actions. If a processor finds a lock set, it must pause and wait for that lock (i.e., enter a queue of processors waiting for that lock) before it can proceed. This queuing time for locks must be added to the process service time if it is deemed to be significant. And indeed, significant it can be. 'Ibere are examples of contemporaIy systems in which teSOUIQe loclcing is the predomiDaDt operatiDg system boUleneck. . In some systems, if the lock delay is too long, the process will be scheduled fri:later time. Though this fiees up the processor for otb.erworlt, it bas a serious impact on the delayed process because the process must DOW await another dispatch time. Lock delay can also seriously affect processor load because of the extra proc:ess-contex switching time that is iD.curIed. teSt
a
Tbe.re are sevenl possibilities for tbrasbiDg in systems of this sort. One COJDDlOD cause is page-fauJdng. ADodler cause .in mD1tiprocessor systemS is long queues for locbd resources, wbicb can cause additiODal CODtext switches. Tbese effects can cause the pr0cessing !eCp1iremeDts for a process to suddeDly increase, with a significant increase in
response time. 1'heIe are other, more subtle increases in processing requirements for TP systems. Memory and bus contention can cause process service times in multiproc:essor systems to increase as load increases. lDterp:ocess message queue leDgtbs will iDaease as load increases, causiDg dumping to disk in some systems or rejection of messages in other systems. Either case causes an iDc:rease in process service time.
Chap. 6
Operating System
223
_.AlI of these factors cause a process's effective service·time to increase as load increases. As service time increases, the capacity of the system decreases. In extreme cases (unfortuDately not uncommon in multiprocessor Systems), the system capacity can decrease beyond its current capacity, causing a "U-turn" in system performance. That is, the system can suddenly start thrashing and have a capacity less than the capacity at which it started thrashing. RespOnse times can dramatically increase by an order of magnitude or :more at this point. Figure 6-8 illustrates this phenomenon. This figure is a little different from the response time curves with which we have previously dealt, as it shows response time as a function of throughput (i.e., transactions per second processed by the system) rather than load. Normally, the tbroughput of the system is the offered transaction rate, R, and is related to the load, L, by L=K1'. However, in a thrashing system the system is 100% loaded (it is continually busy) and may not be able to keep up with the arriving transaction stream. For that reason, we observe IeSpODse time as a function of the 1broughput of the system rather tban its load. With refeIence to Figure 6-8, as long as the system can keep up with the arriving
:
T2
---------
He
THRASH NODE
\
!NCREASE IN
\ \FFERED
IJ)AI)
RESPONSE TIME B THRASHING THRESHOLD
TI
XI
THROUGHPUT
224
Processing Environment
Chap. 6
", • .transactions, it behaves properly.. For example, while operating at point :A:, it can provide a throughput of Xl traDsactions per second with an average response time of TI. However, as the incoming traDsaCtion rate approaches the .'tbrasbing threshold," B, various system resoun::es become seriously overloaded. Memory use is stressed to the point of creating excessive page faults; lock contentions cause processes to time out and be rescheduled; queues grow too long and are dumped from memmy to disk. In short, service time per t:ransaction dramatically increases. As a consequence, the capacity of the system is decJeased (it is, in effect, the inverse of the service time), the IeSpODSe time is increased (it is proportional to the service time), and the system is operating at point C. A further increase in the offered load (transaction arrival rate) to the system will only aggravate the situation, causing more tbrasbing, decreased capacity, and increased response time. This leads to the ·'U-turn" effect of Figure 6-8. What is the practical impact of such a system characteristic? Consider a user interacting with the system while the system is operating at point A. A sudden, brief burst of activity will drive the system into tbrashing mode; the user will suddenly find that the system is DOW opeI8ting at point C. Response time has suddenly increased from n to 7'2. In one typical system displaying this characteristic, the author measured response times which suddenly increased from. I second to 30 seconds! So far as the user is concemed, the system has just died. 'Ibis condition will persist until the offered load decreases long enough for thrashing to cease and for the system to get its house back in order. 'Ibis is the second tbrasbing example that we have discussed. The first example related to local area networks using contenDOD protocols (see FJgme 5-25). As in that case, the main lesson is that systems with the poteDtial for such severe tbmshing should be operated well below the thrashing thIeshold. Normal operating loads should allow adequate margin for anticipated brief peak loads to ensure that these loads will not cause thrash mode operation.
In tbis chapter we havelooked at the physical hardwaIe and its effect upon performance. Tbe hardwaEe was viewed as a complete anaJ.yzable system, and our peEformance aaalysis tools wae used to make some wide-naging statemeots about a typical system. With respect to today's opeDtiDg systems, ~ also m'iewed many chamcteristics that may have a serious impact On performance. It is often 1rUe in contemporaty sysIemS that intaproc:ess messages in mo1ricomputer systems ate the most p:edominaDt of all operating system functions. Task dispatdring is also often important, especially for those cases in which processors are running heavily loaded. 110 and other O/S activity are usually less important (with the exception of dara-base maJlagement activities, which are discussed in a later chapter). Memory maaagement (page faulting) is either not a problem or is an insmmountable problem. The rapidity at which a system bJeaks when page faulting becomes significant is so awesome as to justify remaining well away from page
faulting.,
Chap. 6
Summary
225
.In chapter 8, we will look at system performance from the viewpoint of ~ application processes. This is where we will use some of the operating system concepts developed in this chapter.
7 Data-lase Environll1ent
Most traDsaction proc:essiDg systemS obtain the iDformatiOD 1'eq1Ured to formulate a IeSpODSe from a base of data that is so large that it must be maintained on large, buIkstorage devices, typically disk units in today's technology. The data is so massive that it is very imponaDt to have efficieut access pa!bs to locale a particular data item with the nrinima) amount of effort. 'IbiS is especially true when data is sto!ed on disk, for as we shall see, each disk access IeqUires a significam amount of time. It appears that the futme is rapidly bringing high-speed gigabyte RAM (Random Access MemoIy) teclmology into the realm of reality (a giga is a billion!). When this bappeDs, maDy of today's c:oncems over mpid accesswDl be repJac:ed with equal c:oncem over the logical ease of access and maintainability of the data base a subject aheady addIessed by today's Jdational data bases. 1bough daIa-base c:qanization is DOt a topic for this book, we wiJl addIess it briefly later in this cbapter. Of course, coming in pam1le1 with gigabyte RAM is the development of kilogigabyte ctisks so that pedormance oftbese systems will probably always be an issue ;just on a larger sCale tban today. We coasider in this chaptertbe performance of data bases stond on one orDlO!e disk units. Data is typically managed by a data-base manager that provides a ""logical view" of the data base. ""Logical view" meaDS that the data is seen in the way the user wants to see it, DO matter bow the data is actually physically organized. For instance, one might want a list of all employee names by depadment. The dala-base manager will p:ovide a view of the data base as if it contains employees organiwl by depanment, even though the actual data in the data base might be organized in multiple lists, including a master employee list 22&
Chap. 7
22.7
The File System
coDWDing employee name, number, address, s8Jary, etc., aDd a second list giving employee numbers for each clepattment. Data-base managers are large and complex and are usually implemented as a set of cooperating processes. As such, their analyses will follow the teclmiques described in the next chapter, wbich covers application processes. However, the perfotmance of the database manager is very much a function of the system's ability to efficiently access and manipulate the various files (or tables) that constitute the data base. This is the role of the file system and is the subject of tbis chapter. THE FILE SYSTEAf
The file system in a CODtemporaIy TP system, viewed from a perfotmance standpoint, comprises a bierarchy of components, as shown in Figure 7-1.
DATA BASE MANAGER OR APPUCATION PROGRAMS
\11 FILE
iI I
MANAGER
I I
, CONTROL
I I I
,,, !..-
CACHE
SYSTEM OR APPU~ONBU~
MEMORY
DISK DEVICE DRIVER
DISK
CONTROLLER
•••
Fipre 7-1 File system hiam:hy.
Data-Base Environment
228
Chap. 7
. - .Disk Drives At the bottom of the hierarchy are the disk drives themselves. In most systems today, the disk drives use moving readlwrite heads that must first be positioned to the appropriate track or cylinder if multiple disk platters with typically one head per platter are used. A cylinder comprises all of the tracks on all p1atteIS at a particular radial of the disk unit. Once positi~ the drive must wait for the desired iDfOIIDation to rotate under the head before reading or writing can be done. Thus, to access data OD a disk drive, two separate but sequential mechanical motions are requUed:
• Seeking. or the movement of the readlwrite heads to the applopriate cylinder. • Rotlltion, or lmency, which is the rotation of the cylinder to the desired position under the DOW-positioned heads. This sequence of actions is necessary to position the disk heads and read data. Writing data is a bit more complex. It must first be understood that data is organized into sectors OD the cylinder (a sector is typically 256 to 4096 bytes). Data can be read or written only in multiples of sectors. A sector typically contains many records. Thus, to write a IeC01'd, the appropriate sector must be read acccm:ting to the above sequence, the IeC01'd must be iDserted, and the sector must be rewritten. Since the heads are alR:ady positioned, this simply requires an additiODal disk rotation time relative to a read opemtion. Typical seek times are 20 to 40 msec.; latency time is, on the avemge, a half revolution time, or 8.3 msec., for a disk rotating at 3600 IpJD (today's DOml). Seek plus latency time will be callecl QCCeSS time. A good average access time to be used in the following discussions and examples is 35 IDSeC. to read a reconf and 52 JDSeC. to write a IeCOEd (a rotational time of 17 JDSeC. is added for a write).
Disk Controller The disk controller is a hardware device controller that dUect1y conuols one or mme disk drives. Typical controllers can CODllOl up to eight drives. A controller executes three basic classes of c:ommands: . 1. Seek, meaning to seek a specified cylinder on a specified drive. 2. Reotl, Maing to lead a given sector or secto!S OIl the cuuently positioned cylinder OIl the specified drive. 3. Write, meaning to write a given sector or sectoIS on the cuuently positioned cyJinder on the spec:ifted drive.
Of course, there are 01her oommaMs for status and diagnostic purposes, but these are the important for peIfo1 ';Iance issues. Most controllers can overlap seeks-that is, they can have multiple seeks outstand-
ODes
Chap. 7
The File System
229
ing so that several disk drives can be positioning themselves simultaneously to their next desiriii' cylinders. The good news is that since seek time is the predomininifactor in access time, this technique can significantly reduce access time and increase disk performance. The bad news is that it takes so much intelligence on the part of the software to be able to look ahead and predict cylinder usage (except in certain unique applications) that this feature is seldom supported by software. More about this later. Some disk controllers provide buffering for the data to be read from, or written to, disk. That is, for a write operation the software will first transfer the data to be written to the disk controller buffer. The controller will then write the data to disk at its leisure. Likewise, for a read the data will be read from the disk into the controller's buffer, where it will be available to be read by the processor at its leisure. Coutroller buffering is a mixed blessing. Without buffering, it becomes a leal hardware perfonDance problem to ensure that sufficient 110 capacity and processor memory time exist to guarantee the synchronous 1raDsfer of data between the disk and processor without data loss (once started. this data stream. C8DDOt be jntenupted). On the other band, with controller buffering, a disk transfer cannot exceed the buffer length (typically, anywhere from 256 to 4,096 bytes). Without controller buffering, data transfers can be as long as desired (at least, they can be as long as the processor's 110 channel will
allow). For our ncmnal perfonDance efforts at the application level, the problem of controller buffering usually is DOt CODSide:red. Disk transfer sizes are given, and we assume that they occur without data loss.
Disk. Device Driller The disk device driver is the lowest level of software associated with the file system. h accepts higher level MDUDands from the file maDager and execates these as a series of primitive CC)IDJ1'!8Dds submitted to the disk conttoller. It monitors status sigDals Letumed by the CODttoller to c:IetermUIe success or failure of operations, takes such steps as it can to retry failed operations, and takes whatever other steps are necessary for gwaaateeiDg the iDtegtity of disk operaDons (such as repJaciDg bad sectors with spare sectors from. an allocated pool). The most oornmon ccn""I8Ms bandled by the device driver are Iadlwrite c0mmands. The driver will select the appropiate disk drive, will issue an app:opriate seek CQIIJID8Dd. will ensure 1bat the heads bave positioned pmperJ.y, and then will issue a data transfer comnpnd . The memmy location of data to be written to disk or of the destiDation of data to be read from disk is passed to the device driver along with the command from the file manager. Data may be 1:I8IIsfeJml between the disk and buffers provided by an application program (orprovided by the opemting system on an application program's behalf). or data may be traDSferred into and out of disk cache memory. The device driver, once initiated by a COIIlDJ811d from. the file DWJage.r, openres substantially at the intenupt level. When it bas sncx:essfWly completed the 1raDsfer or bas given up, it will schedule the file~. The device driver execution time is
Data-Base Environment
230
Chap. 7
.. typically included in the load 4" discussed under OIS-initiated :functions·is the previous cbapter.
Cache Memory
Most systems today provide a disk cache memory capability that :functions much like the memory cache described in chapter 6. Basically, the intent is to keep the most recently used disk data in memory in the hopes that it will be reaccessed wbile in.cache. Because of the slow speed of disk relative to dle system's main memmy, disk cache is usually allocated from. main memmy space (tbis part of memory is usually not a candidate for page swapping). Because memory sizes in today's TP systems can be quite large (many megabytes), disk cache is often not limited in size but rather is established by the application designers at system generation time. The maDageIJ1eDt of disk cache is similar in many respects to memory cache. Several factors are taken into consideration, such as
• Transfers tikely to make ineffective use of cache are often allowed to bypass cache. Sequential file transfers by sector are a good example of this. If a file is read sequentially or written sequentially by sector, previous sectors will never be reaccessed and so do not need to be cached. However, if records witbin a sector are being ac:cessed. sequential sectors are cached so that records blocked in that sector may take advamage of cache. In some systems, sequential sectors in cache are overwriUen by the next sector, as the old sector will DOt be needed again. • The various records in cache are aged and are also ftagged if they bave been mod.i1ied (a dirty~. When a new mconi is. to be read in,.1D apptoprlate area in cache must be overwriUeD. The cacMng algorithm will generally elect to overwrite the oldest area, i.e., ID area tbat bas DOt been used for the longest time. If t:be:re is a choice, a clean area wiD be o~ as opposed toa dirty area, since a dirty area must be written to disk first before it can be ove.rwrlUen (lIDless cache write-through is used as discussed next). • In many TP ~, 1IIJJdiJi«l dIIItl wiD reside in disk cache memory until it is forced to disk bybeiDg oveiWritIm with new dIlL· HoWever, in the evmt of a pmcessor failme, tbat daIa may be lost, aod the data base wiD bave been corrupted. In fault-to1enmt systems, cache w:rD-tbrough is usecl. In this an writes to disk cause an update to cache and a pbysical write to disk (just like our earlier· memory cache example) before the write is declated complete. In this way, all c:ompIered writes reside on disk in the event of a pmcessor failure. • The size ofcache metnl)ry required is a diIect 1imction of the traDsaCtion rate to be supported by dle system. Ccmsider a tiaDsaction wbich reads a record and which may bave to updare that record at the operator's discretion 30 sec:onds later. If the system is handljng 1 traDsaction per minute, then a cache size of 1 record is likely to give good perf01'lll8DCe. However, if the system traDsaction rate is 10 per second, then a mjnimum cache size of 300 records will be needed to guarantee any
some
case:
Chap. 7
The File System
231
..... _. reasonable cache bit ratio, i.e., 10 records per second (300 reconts total)-will have been read into cache during the 30 seconds it will have taken the operator to update the original record. Disk cache memory is just another fiavor of the concept behind virtual memory (page faulting) and main memory caching. As with these other mechanisms, disk cache bit ratios are very difficult to predict. As mentioned above, they are most effective when files are being accessed randomly (a common cbaracteristic of ~ systems) and are least effective when files are accessed sequentially (as with batch systems). In ~ systems, disk cache bit ratios of 20 percent to 60 percent are common. This parameter is typically specified as an input to the model or is treated as a computational variant.
File Manager The file 11UI1Ulger usually runs as a high priority process in the ~ system. In the simplest case, there is typically one file maoage.r associated with each disk conttoIler, although this is DOt a fum rule. A file manager may control several disk controllers, or as an alternative, there may be several file managers associated with a single disk controller. Multiple file managers are considered later. AppJication processes (including a data-base maaager, if any) submit requests to the. file manager for disk operations. which are stacked up in the file manager iDput queue. These can be quite complex operatiODS. such as
• Open. Open access to a file on behalf of the requesting process. Typically, a file control block (PCB) is allocated and assigned to the process-file pair. This instantiation of a file opeD is ofteD given a file DUJDber so that later file requests by this process need ODly give a file DUJDberratherthan a file D8IIle. The FeB keeps such data as canent record position, end-of-file 1D8Iker, and file permissions. A file open may be zequested for various permissions, such as read only, modify privileges, or shared or exclusive access. • Close. Oase the access to the file by tlrls process. • Position. Position the record pointer to a particular IeCOId or to the end of the file. .
• Reod. Read a record. • Lock. Lock the record being read (or file or RICOId field, depeDding upon the file management system) SO that DO od:Ier process can lock this record or update it. Locking is used prior to an update to make SUIe that processes do not step on each other when tty.ing to simu1taDeoIJS1y update tile same !eCOId. • Write. Write a new record or a modified J:eCOrd. • Unlock. Unlock the record beiDg written.
This is only a partial list of file management duties. We have yet to talk about file which would expand this Hst to include operations such as semching for blank
strudU1'es,
Data-Base Environment
232
Chap. 7
. - .slots in random 'files, keyed reads and writes, etc. The point is that the·file maDi.ger is a bigbly intelligent process, and this intelligence costs time. Typical file manager execution times can run 100SO JDSeC. for 32-bit l-MIP machines. Only for special applications today is processing time less than 10 DlSeC. per file access. When compared with the 30-50 DlSeC. disk access time, it can be seen that the file manager time makes a bad situation even worse. Note that file manager time is substantially additive to disk time. When a request is selected for processing. the file manager must first do all the validation and processing required to submit one or mcne commands to the disk device driver. It then checks to see if the data item is in cache. If not, the file manager submits the first of a potential series of commands to the disk driver and then goes dormant until the disk driver has completed the operation and bas retumed the IeSUlt (data or status). The file maaager then must verify the result and, if necessary, submit the next command to the driver. 'Ibis process continues until the request from the application process bas been completely satisfied. Some operatiODS, such ,as file opens and keyed accesses, can require several disk operations to complete. In all of these operations, the disk is active while the file manager is dormant and vice versa. Thus, the actual time required to complete a disk operation is the sum of the physical disk times and the file manager processing times. File s,stem Performance Let
t. = disk access time (seek plus latency) (sec.). tdr
= disk rotational time (twice Jate.ncy) (sec.).
TltJir = number of disk lead operations requDd for file opcnUon i
(open,. close, lead,
write, etc.).
n..., = number of disk write operations required for file operation i. ~
= file IIUID8F time for operaIion i (sec.).
Is
= proportion of ~ zequiIecl if data is in cache.
Pd = average disk cache miss DIio, i.e., the probability of teqUiriDg a physical disk ~
= file
system
service time for operation t (sec.).
Note that the disk time mquired to'lead a!econl is , . and to write a mc:ont is ,. + 'Ibe!e are two cases to CODSider in terms of file system service time: caclrlng of writes and cache writB-tbrough. If writes are cached, the sector to be updated is sem:ched for in cache. If found, the sector is updated and left in cache. It eventually will be flushed to disk when it hasn't been used for awhile but may bave bad several updates made to it by then. The cache miss mtio, Pd, must take this into accouDt. tdr.
;
Chap. 7
The File System
233
·-ff writes are not cached (cache write-through), each writeinodiftes the sector to be upclated if it is in cache, but that sector is unconditionally written to disk. The write will take advantage of cache to read the sector but will always physically write it back to disk (on the next disk spin if it bad been physically read from disk). Thus, if the sector is found in cache, a disk time of t. is required to write it-out. If it is not found in cache, a disk time of t. + tdr is Iequired to read it ind then to write it out. A time, fcta, is required every time; a time, ftb., is required Pd of the time. For cached writes, file system service time for operation i is ~
= a;~ + Pd [n.:tvt. + n,nJ.,t. + tdr)]
(7-1)
For cache write-tbrough, ~
=
a;~
+ Pd (1ld;,t. + n,u..,tdr) + n,u..,t.
(7-2)
The parameter ili takes into account the effect of finding the desiIed data in cache. If (1 -
pill of the time data is in cache, and if the file manager time then required is!;fJm;,
then the average file manager time for operation i is (1 -
pIll!;fJ;m + Pd~
or (Pd + /; - PJ;)~
Thus, (7-3)
Let
Pi = probability of file operation i.
= average file service time (msec.). Then the average file system semc:e time is fJ
fJ= ~PiZft
(7-4)
i
If
.
:
Rf = rare of file JeqUeSts (per second).
Lt = load on file system,. "J = number of file xequests per ttaDsaction.
R, = transaction rate (per second). then the file system load is (7-5)
Data-Base Environment
234
Chap. 7
. - .Assuming that file service time, fj-, is random and that the number of processes requesting . file service functions is large, the file service delay time, frt'.. including queue delays and servicing, is t",=
...
---1t1-L"
(7-6)
Equations 7-1 and 7-2 above ignore the actual transfer time of the data between disk and memory. This is a small portion of a single rotation and is usually small enough so that it can be ignoIed. For instance, if there are 32 sectors per ttack, then the ttansfer of one sector will leQuire Y.n of a rotation time. If rotation time is 16 DJSeC., this amounts to 0.5 msec., which is small compared to average access times in the order of 30 to SO DJSeC.
FILE ORGAIVIZAnON
Formal data-base structmes are generally c:baracterized as biemrchica1, networlc, or relational. Though data-base structures are not the topic of this book, suffice it to say that each of these organizations is a furtber attempt at achieving ultimate ftexibility and maintainability of the data base. . And as this goal is achieved, it exacts its toll: performance. Relational data bases are recognjzed today as being the most ftexible and maintainable- data bases-and often the . worst performance hogs (1hougb impressive strides in this area are being made). Many systems are first built as pure 'i:bixd normal fcmn" relational data bases and are then modified to compromise this structure in order to achieve adequate performance. (For an excellent discourse on data-base st.rud:1IIeS, see Date [5], a classic in this field.) One cbaracterlstic th8t all of these data-base ~ have in COllllllOD. is the need for keyed files. Thus, almost aD of today's file systemS support keyed files. They also support sequential files for batch processing, mndom files for efficiency, and UIISt1'UCtIIIed files as the ultimaI:e pmgtamrael"S out. These file SIrIlCtUreS farm the basis of 1P system performance to a large extent and are ~ next.
An UllStTU.CtllTed jile is viewed simply as an array of bytes (see Figure 7-2a). The application process can mad or write any number .ofbytes(up to _a limit) S1arting at a particular byte position. Note that in general, a uansfer operation wiD begin witbin ODe.sector on disk and eod witbin anotber sector. Let
b. = number of bytes in a disk sector. bll = number of bytes being transferred to or from an UDS1rUC:tUIed file.
Chap. 7
File Organization
235
SECTOR (b s bytes) A
~I T
RECORD (b u bytes)
UNSTRUCTURED FILE (a) .
-
L...y-J
l
T
RECORD SECTOR (variable length)
READ
(av r s records lsector)
J
-
WRITE
SEQUENTIAL FILE ( b)
READ
WRITE
'-
,/'
RECORD (fixed length)
SECTOR
RANDOM FILE (e)
If the tmDsfer size is DO gmtr:r tbaIi a sector, the probability of it faDiDg directly witbin a sector is (b" - b. + I)lb,,; odJerwise, two sectors must be accessed. With pr0bability I-(b" - b. + I)lb" = (b" - I)lbs • Thus, the avemge number of sectors to be accessed is
1 (b" - b" + 1) + 2 (b,. - I) = b" + b" - 1 b <: b bs b" b , , ' II - "
Data-Base Environment
236
Chap. 7
,- -If the transfer size, bu , is greater than a sector length, b$' then it will'be less than an integral number of sectors by
(::l
b$ - bIt ) bytes
where oXl is the ceiling opel8tOr, implying tbatx is rounded up to the next integer unless it already is an integer. This length plus 1, divided by the sector size, is the probability that the transfer will be found in (b"lb.>l sectors; otherwise, it will require one more sector. Thus, for bIt > b$' the average number of sectors requiIed is
~l +(l_~lb.~b.+l).b.>b. The first term is the mjnimum number of sectors required; the second teml is the probability of IeqWriDg one more sector. 'Ibis reduces to
b$ + bIt - 1 b$ wbich is the same result as that obtained for a transfer size less than a sector size. Thus, to transfer a record to or from an UDStnlCtUred file, the average number of sectors tbat must be accessed, Ildu, is
(7-7) A typical disk driver will treat an UDStnlCtUred lead or write of multiple sec:tOIS as a primitive opemtion and will transfer these sectors without inteauption. This often means that only one seek is made to the proper cylinder, and then one sector is uansfemxl on each spin of the disk (thele not being enough time between sectors to issue a new set of disk access commands). In this case, eqnations 7-1 and 7-2 can be rewriUal for the special case of UDStnlCtUred files -as follows: For UDStnlCtUred reads: ~
= Gt~ + Pdt..
(7-8)
where t. is a single access time, fda. to traDsfe.r the first sector plus (n". - 1) rotIItion times, each !e:QUiriDg a time of ,. to tnmsfer the n~.l1Ia;ning sectors:
'- = fda + (Ildu - 1)"".
(7-9)
For UDStnlCtUred cached writes: ~
=
Gt~
+ Pd(t. + no,.)
(7-10)
For UDStnlCtUred unc:ached writes:
(7-11)
Chap. 7
File Organization
(These equations ignore the sector read time, wbich will add the equivalent of-a p8flia1 disk rotational time equal to that portion of a track occupied by the n." sectors.) More sophisticated disk systems may reduce these times. For instance, all sectors may be ttansfeued OD a single spin. This may be accomplished by one of many techniques:
• A multisector traDSfer capability in the disk controller. • A sufficiently fast disk driver that can issue a new disk transfer command between sectors. • Interleaved sectors, in which logically contiguous sectors are, in fact, separated by one or more pbysical sectors OD disk. An intelligent disk driver may also IeCOgnize tbat a sector that is to be totally rewritten does DOt have to be react This is especially valuable if record blocking is to be .done at the application level, with disk ttansfer ODly of entire sectors. Equations 7-7 through 7-11 can be easily modi1i.ed to fit these aDd other examples of unstructured file manipulation.
Sequential Files As shown in Figure 7-lb, a sequentilll file comprises a series of records that are written sequentially to the end .of the file and are read sequemially. Records may vaty in length but generally do not span sector boundaries (we will assume that here). A key perfomumc:e advantage of sequential files is tbat they make maximum use of cache. As records are written, they can be buffered in cache (or in an application-provided buffer) until a sector's worth bas been accumulated. at which point the sector is written to disk. As records are being read, a physical disk read is oDly requiIed once per sector mto cache or into an application-provided buffer. Fmm there, records are read at memory
speeds until the sector is exhansted. If sequeatial reads or writes are fairly rapid, disk cache can be counted OD to do the bufferiDg, as the sector will bave a high enough activity to prevent its being flushed from cache. If file activity is going to be slow, dIeD the use of an applicaDOD program buffer guaraarees memory-based operation witbin a.sector. ODe poteDtial pobJt:m with cadring sequential reads and writes is that die potentia1ly . high 1raDsfer rate can rapidly fin cache meDlOIY and can cause other data to be iiusbed needlessly DeedJessly because cadring any more than one sector is memring1ess for a sequential file, since that sector wiD not be reused. Some systems will allow a cache block !bat had been used for a sequential traDsfer to be ill1D1fliliately reused; odlen will allow cache to be bypassed for sequential uausfers. Eitherteclmique will pteveDt the disk cache from being neecIlessly flushed by a high-speed sequeatial ~ transfer. From a perfounance Sf.aDdpmDt, if there are T. records per sector, then physical disk activity is required for 0Dly liT. of the read or write zequests. WIth respect to equatioDs .1-1 and 7-2,
Data-Base Environment
238
Chap. 7
For sequential file reads: (7-12)
For sequential file writes: 1Idiw = 1/T.St 1I,:u,
=0
(7-13)
where Ts is the average number of records per sector.
Random Files
Random (or direct access) files can be written to or read from simply by specifying the record number (Figure 7-~). R.ec:oIds in random files are typically fixed-Iength; if they vary in leugtb, they are stored in fixed-length record slots. Records do not typically cross sector boundaries. The file system can easily calculate the sector of the file tbat a record is in by knowing the number of records per sector and the sector number. For instance, if the file contams 10 records per sector, and if record 73 is desired, then it is to be found in sector 7 (assuming tbat the first sector is numbered 0).
Random files are very efficient for accessing data randomly. Any record in the file may be Iead or written with just ODe access. However, caching will be somewhat ineffective, since it is unlikely (in a Jmge file, at least) that the same sector will be requked. soon after a previous access, except fortbe case of a read with intent to possibly update the . record via a later opention. For random files,
".. = I, 1Idiw = 0 for reads 7I4fw = 1, 1I,:u, = 0 for writes
(7-14) (7-15)
K~FiIes
We have desaibed above the tbIee basic file sa:uctmes in preclnnrinant use today. UnstruclDled files are accessed by specifying a byte position wi1bin the file and a suiDg size (bJ to transfer. Sequemial files are transfeaecl record by record in sequence; !be record size is ohn carried with the record to faciIitare variable lengdlleCOlds. Random (or db:ect) files are accessed by record 1l1IIDber. 1bae are other file types in use, but for performance pmposes they can usually be classUied as ODe of tbese line. Though dIese files each have tbeir own rudiJi..,.ntary file access mecJvnrism, IDCR complex access meIhods are needed for general on-line traDSaCtion processing applications. For example, we may want to find a customer master record in the customer master file for customer H0W379A or to update an inventory record for item·SCRI37. There is DO way provided by these file structures to find such a record without a brute force search (Ieading the file sequentially or using DDdom access to do a binary search on an ordezecl file).
.
Chap. 7
Rle Organization
239
...JlUs problem is solved via key files. A key file is typically a separate.1ile that contains the values on which we would like to access another file, with pointers to the data records in that file. Each record in a key file contains the value of a key (such as cUstomer I.D. or product code) and an address of the corresponding record in the data file. The key file is maintained in key order, and provision is made for a rapid search of the key file. For files of typical sizes, any record may be found based on its key in two to four accesses, as we will see in the following discussion. Any of the above file types can be supported with key files. If a key is unique, there will be only one record in the key file for each key value. Customer I.D. and product code are examples ofUDique keys. If a key is not UDique, there will be one record in the key file for every data record containing that key. A DOnunique" key might be customer zipcode or product availability status. If a Iequest is made to access a record based OIl a nonunique key, the data record addressed by the first key record will be returned. Subsequent data records with this same key value will then be returned on request by simply reading the key file sequentially. Of course, there can be any DUIIlber of key files supporting a particular data file. TheIe will be one key file for every key whose value we would like to use to quickly access a data file. From a performance viewpoint, key files are a mixed blessing. On the one hand, they provide the most rapid generalized access to the data we need to support TP applications; their efficiency is, to a peat extent, the basis for the rapid response times of today's systems. On the other band, wbenever a record is creared, key records must also be created. Whenever a record is updated. there is a sttoDg likelihood tbatkey recoIds will have to be updated (which implies deleting old key records and inserting new key records). If we are cmyiDg a lot of keys to facilitate rapid access, then the on-line creation and updating of IeCOlds could overcome all tile benefits of mpicl keyed access. 'lberef0le, a fundamental knowledge of keyed files is puamount for TP performance aaalysis.
Figure 7-3 shows a data file with one of its key files. Usually, for each data rec:oni, tbc= is a corresponctiDg key rec:oni which contaiDs two fields of iDformation:
• The key value peIt8iDiDg to tile data IeC01tl• • A pointer to that data reconL The forms that the poiDta" can take will be discussed later. Key recoMs are typically $IIUIIl (say 10 bytes for tile key value plus 4 bytes for a record pointer for a total of 14 bytes). One sector.OD disk can conE many key:records; a lK byte sector could c:cmtaiD 73 14-byte key :records, 1hoagb there is ofteD overhead involved. 1be key :records are maintained in the file in key Older. Thus, if the key file were lead from beginning to end, one would find the key values to be in a1pbanumeric order. The fast access to a particular key value is achieved through a 11'ee sttuctm:e, as shown iD Figure 7-3. 'Ibis str1lCtUre is commonly known as a balanced B-tree and com.-
Data-Base Environment
240
[Ill]
7) I I I II I I II
III III II:: ~
r III
AH7
ROOT
LEVEL I
~S04
KEY RECORD
SECTOR t
A
Chap. 7
1
II 1III 111-------III II
DATA
RECORD Fipre 7-3 Keyed file.
"prises levels of pseudo-key records pointing to sectors of key reconis in the next lower level.
For iDstaDce, tefeaing to the first uee level above the key 4le in FlgUIe 7-3 noted as Level I, one recmd poims to a by file sector aad D01a 1bat the fiIst !eCOId in tbat sector contams the key valueAH73. The next reccmI in the Levell tree poiD1s to the next sector in the key file, DOtiDg 1bat it begiDs widl the by value AS04. Thus, we know that the first key file sector descdbed lime c:ontaiDs by values from AH73 to AS04 (wbetber a key value of AS04 can appear in tbis sector or DOt depeDds upon wbedler or DOt the key is DDique). A tree level above Level I similarly points to the sectoIS in Levell and so OIl, with each tree level geaiDg smaller and smaller (if 50 key records can fit into a sector, then each level contains 1150 of the records in the next lower level). Ultimately, a uee level will be IeaChed that is only ODe sector in leDgth•. 'Ibis is the highest level and is known as the "root" of the B-tree. Thus, a key file CODtaiDs two uuUor pans: I. Key recotds, each containing a key value and a poiDrer to a CODeSpODdiDg data record.
Chap. 7
File Organization
241
.2.· A B-Tree, providing quick access to any key record containing a spec.i1ied key value. To find a data record contaiDing a certain key value, the following sequence is followed: 1. Read the root segment. 2. Determine the sector of the next level to be searched, and read that sector. 3. Continue following the path through the levels of the tree until the last level bas been read and searched. This level gives the sector containing the desiIed key record.
4. Read this key record sector, and find the desired key record. 5. Using the pointer in the key record, read the data record.
Thus, to find and read a data record via a key file requjIes k + 21ile reads, where k is the number of levels in the B-tree. To update a record requjIes k + 1 file reads to find the location of the data record, plus the data write. If ~ is the number ofkey records that will fit in a sector, then k levels will support 1;. records, where
1;. = ~I:+l
(7-16)
To get a feel for the number of levels required in a tree, let us look at the number of key records various levels will support. This is sensitive to the number of keys in a sector, and the table below shows key file size as a function of keys per sector and the number of levels. It is assumed that key records and tree IeCOrds are the same size. TAaE7-'. KEY RLE SIZE (k'-RECORDS) Levels (I) 1 (Root)
2 3 4
Keyslsecrcr (t.:) 50 100
2,SOO 125,000 6.25 x 10' 3.125 x 10'
10.000 10' 10' l()1o
200
40.000 8 x 10' 16 x 10' 32 x l()lo
Thus, with ODly tine levels, file sizes in the nril1iODS of IeCOrds can be supported. Typical trees are ge.oe:raJly two or tine levels deep. Thus, a record in a keyed file can geom.lly be accessed with a maximum of four to five accesses. Moreover, amming the file is accessed EeasODably frequently, the root segment will almost always be found in cache; and there is a good chance that the first level below the root will also be cached•. CoDsequently, for modest file sizes (a million tec:OJds or so) with modest key sizes, the read of the key file may be a siDgle access. This is a common assumption in performance analysis.
Data-Base Environment
242
Chap. 7
Thus, as a first approximation, the read of a keyed file is equivalent to two disk
accesses: 1Zdir
= 2, 1Zdiw = 0 for reads
(7-17)
This value can be adjusted to reflect larger or smaller file sizes. Adding a record to a keyed file is a different story. Not only must every add to a data file be accompanied by adds to each affected key file, but the B-tree in each key file also must be updated if the new key record bas disturbed the key structure. In most cases, a free record slot will be found in the key file, and DO tree update will be required. However, if the sector into which the key record is to be inserted is full, it must be split into two sectors so that the new key record can be inserted into its proper position. Each of these sectors will now be about half full. 'Ibis sector split means that a record must be added to the next higher level in the tree; doing this will occasionally also cause a sector split in that level. The splitting of tree sectors could ttave1 all the way up to the root and, in the worst case, cause the root to split. This will cause a new root segment to be created one level higher than the old root segment, thus adding one level to the tree. FOItDDateIy, sector splits do Dot occur all that ftequently. When they do, they create room for many writes in the future. Sector splits are usually ignored for perfonnance purposes, and a write to a keyed file is treated just as a write to any other file plus a write to
all key files to be updated. Thus, if 11k key files are to be updated on an average write to a keyed file, the number of disk accesses to write to a keyed file is (using the read assumptions for B-tree caching) 1Zdir nk
=
1Zdiw
= nk + 1 for writes
(7-18)
Though block splitting is generally igD
k,. = [(1 - ~ lirl
(7-19)
In the absence of any other knowledge, a slack of 259& is a good assumption. 'Ibis is because a block is 50% full following a block split and 1009& full just before a block split, for an avemge of 7S~. Key files impose one significant restric:tion on the data file: it cannot be teOlga"ized without totally rebuilding the key files correspouding to that data file. This is because of the nature of the pointer in 1he key file IeCOrd. It generally will contain a byte position (for
Chap. 7
File Organization
243
aa.unstructured file), a record number (for a relative file), or a sectorlrecord identifier (for a sequential file). Whatever it is, it is "hardwired" to the current organization of the data file. Records caunot be inserted into the data file if that will displace other records, nor can the data file be compacted to recover deleted record space. In general, records once written must remain in place (except to be deleted). Any time a record is moved, the key files must be rebuilt. For this reason, data is usually added sequentially to keyed files, i.e., written to the end of file. Any file reorganization (for iDstaDc:e, to recover deleted space) is done as a batch job out of hours. This restriction is removed by the use of indexed sequential files, described next.
Indexed Sequential Fil. An indexed sequentilJl file is one in wbich the data records are maintained in key order according to a primllry key (which is usually a UDique key). It also contains its own B-tree structure to give quick access to a data record according to the primary key, as shown in Figure 7-4. In effect, the data records xeplace the key records in a key file: Except for the fact that the data records are usually quite large compared to the key records, the description of a key file access given above applies cfuectly to an indexed ,sequential file. How~, the luge data mc:ords mean that the last level of the B-tree ~ j
I111I
ROOT
/\ I111I111I
/I\~
LEVELl
(=-
11111111111111111
I
I
I
I
I
I
I
LEVEL 2
I
DATA
FILE
244
Data-Base Environment
Chap. 7
.• much larger than it would be for a key file. CoDsequently; an extra B-t:ree--Ievel is usually required. Since it is so large, it is unreasonable to assume that this level will be found in a cache. However, this extra B-tree level is often embedded with the data and typically requires an additional latency time to access. This extra time may be conveniendy accounted for by simply increasing the disk access time judiciously. It can be more accurately accounted for by modifying equatiODS 7-1 and 7-2, as fonows. A read will require reading k tree records (typically 2, as discussed above). The first will requite a normal disk access of ta , and the mnajning (k - 1) tree records will each require a latency time of t;l2. Fmally, the data record will require a latency time of 1,;12. Thus, . Indexed sequential read: (7-20) where
k = number of index levels not resident in cache. A cached write will require this time to read the sector to be modified plus an additional rotation time, ftJr, to rewrite the IeCOId.
Indexed sequential cached write: ~
=
ai~
+ Pd[ta + (k + 2) t;l2]
(7-2IA)
An uncached write must always peri'onn the dislcaccess in order to write the data, even jfit does not need to access disk to read the key: Indexed sequential uncached write:
(7-2IB) Since records can DOW be accessed directly by their primary key, the restric:tioDs on file LeOIg8nizatjOll DO loDger apply. In fact, every time a data record is iDserted iDto its place in an indexed sequential file, poteDtial block splits may move data records around. But just as with a key file, the primary key B-ttee is immediately updated to Ie1lect the new IeCOId sttucIme. Otber key files may support an iD.dexed sequential file aDd give access to 1bat file via secondtlry.kep. In this case, the mc:ozd pointer found in a key record is tbe value of the primarykeyfortbeccmespouctingdalaft:COld. .Thus. tozadanindexedsequentialfilevia a seconcIary key, the key file is read. Thea the data. file is read via its prlmary key. Thus, ac=ssing an indexed sequential file by its prlmary key can be quite a bit faster than usiDg a keyed file. SeconcIary key access will be a little slower than a keyed file access because of tbe extta prlmary key B-ttee 1bat must be seaIChed. Often, from a perform8JJCel viewpoint. so much of the access activity can be accomplished by primary keys that inde:1.ed Sequeatial files can be a big boost to perfnrman<'A':. Note that seconcIary keys may be used to access an indexed sequential file but DOt to write to it. All writes must be via the primary key, which is mquired to be unique jf IeC01'd updates are to be performed. .
Chap. 7
Disk Caching
245
Hashed Files Hashing is another teclmique of gaining keyed access to a file. It is not as predominant in TP systems today as is keyed access but is used enough to bear mentioning. The intent of hashing is to use the value of the primary key to calculate a sector address into which the record will be placed. Some sort of hashing algoritbm is invented, one that will convert the key into a num.beled value that will fall within the range of preallocated sectors for the file. That sector is then read, and the record is inserted if there is room. If the!e is no room for the record in that sector, then it must be written to an overftow area under some secondary algorithm. To access a record, the key is bashed, and that sector is read. If the desired record is not found in the sector, then the overliow area must be searched. A simple example of a basbing algorithm is to use the first three characters of a key and to treat them as a base-36IlU1Dber, with 0 to 9 having values 0 to 9 and with A to Z having values 10 to 35. Then the key H7D would hash to 17 X 362 + 7 x 36 + 13 = '22).97 and would point to sector number 22,297. The range of this hashing algorithm is
363 = 46,656 sectors.
If the file is sparsely populated, bashing can be a very effective access method. In effect, a keyed access can be achieved with one disk access. However, if tile file begins to fill up, performance can degrade rapidly. In effect, bashing ttades space for time. It makes inefficient use of disk space to achieve a rapid access time. Hashing algoritbms me genemlly not supplied by the system vendors. Rather, they me imp1em.ented by the application programs.
DISK CACHING Except in some very simple situatioDs, the effectiveness of disk caching is VirtuaDy imp0ssible to calmJate because of the compJ.emy of its use. Usually, a IeISODIble cadle hit DIio (or Cache miss DIio, Pd, as used herein) is assumed, based on experiaK:e or measmemea1s on·the actDal or equivaleat'system. As an a1temative, the cadle miss ratio can be varied as a parameter, aadperfcmnar¥:e caJcn1atiODS can be made over a raDge of inraest. In this way, at least, ODe can deca:miDe the seusitivity of the"sysIem to caching. Many systemS allow the· size of the disk cache to be specified as a system genemDon pazameter. Cache size can then be adjusted during system operation to obtain the best compromise between disk caching ind other memmy requixements. In effect, the analyst is caught between a rock and a bani place when it comes to memory allocation. As disk cache is made larger to speed up data-base access, less memory zemains for program ~, resulting in a higher iDcidence of page faults. Reducing disk delays in one ~
Data-Base Environment
246
Chap. 7
, _simply increases them in another., Ultimately, if a good balance between. data-base disk caching and program pelfonnance from page faulting caDDot be found, the only solution may be to buy IDOl'e memory-if the system. will support it. Some insight can be gained into cache effectiveness on a simple level by considering the "longevity" of a disk block in cache. The longevity of a disk block is the amount of time one would expect it to last unaccessed in cache before-based on the least recently used algorithm-it got Bushed back to disk. If, for instance, an application reads some data and presents it to an opemtor, who will later want to modify it, we are interested in knowiDg whether that data will still be in cache when the modifications are made. This would save the rereading of this data from disk. Let us define the following terms: Rd
= the average transfer rate (reads and writes) of blocks from disk to cache (blocks per second).
Ra
= the logical access rate of disk.
Pd
= the disk cache miss ratio.
Cd
= the disk cache size (blocks).
Tt:IM:he = the longevity of a block in cache.
If a block is tead in (for either tead or write purposes) and then sits in cache unused, it will be Bushed from cache when it is the oldest block (assuming no distinction is made by the cache manager between clean and c:tirty blocks). Since blocks are being lead in at a rate of R" blocks per second, this will take a time of (7-22)
where Tt:IM:he is the time that it will take to flush the bloclc. R" is the actual t:raDSfer rate between disk and~. In this simplified example, R" = P4Ra. That is, R. actual block accesses per second aremquiled of disk, and Pd of these blocks are DOt found in cache. Tbeir product is the physical disk transfer rate. '!bus, (7-23)
Let us say that our application involves hiPlY randOm bits on moderately sized files. We would expect by file B-ttees to be inc:ache, ,but key rec:cmIs anddat,arecords would not likely be cached. Let us farther cIefiD.e " , 1Idr
= the number of logical disk Iead requests per transaction.
IIdw
= the number of logical disk, write requestS per transaction.
R, = the traDsacti.on rate.
Chap. 7
247
Disk Caching Cd:
nk
= the amount of cache used to store the B-trees. = the number of key files to be updated per ttansaction.
The anticipated physical disk access rate, Rd , is then 2n;Rt for reads (read a key file and a data file for each read request) and (lnttw + nk)Rt for writes (read a key file and update a data record for each request plus update nk key files per transaction). This can be
expressed as ~
= (2n,; + lnttw + n,JRt
The amount of cache available for data and key record transfers is (Cd - Cm). Using these expressions to modify equation 7-22 gives a cache longevity time of T __
= (Cd -
Cd:)I(2n,;
+ 2n., + nt>Rt
(7-24)
Let us plug some typical numbers into this expression. Consider an inventory system dedicated to the inquiry of inventory and the placing of an order against tbat inventory. As a result of an operator query, a customer master file and a product file are each read according to a key. The opemtorretums an Older, which updates an amount in the product file, updates a status key to the product file, and writes an order detail record to an Older file keyed by product code and customer I.D. Thus, 1Idr
= 2 keyed reads (customer master and product files).
1ldw
= 2 keyed writes (product aad order detail files).
11k
= 3 key file updates (product status key on the product file and customer and product keys on the order file).
.
Three files are involved: the customer master, product, and order files. Let us assume the following conditions: • The customer master file contajus 50,000 IeCOl'ds of 300 bytes each.
• The customer I.D. is 10 bytes. • The disk sector size is 2K bytes.
With this informatiOn. we can estimate the size of the customer master file B-1Iee, wbich we are assmDing is~. We do this as follows. Each disk sector can ~ 2048114 = 146 keys (aspmring.a 4-byte pointer and no ovabead).1bus, 1here will be 50,0001146 = 343 sectors occupied by the key mcuds for the customer" master file. Assuming a 30 peaamt slack,' then 343/.7 = 490 blocks is a more reasonable DUIIlber. These blocks lie pointed to by the first level of the B-tree (wbicbwe assume is cached). The 490 sectors of keys willleqUb:e 490/(146X.7) S sectoJ:s in the first level of the ttee (with 309& slack) and a root sector. Thus, the customer ,I.D. B-tree for the customer master file will leQUire 6 sectors ~ cache.
=
Data-Base Environment
248
Chap. 7
Let us assume that an equivalent analysis on the other key files leads-to the following B-tree sizes: TABLE 7-2. EXAMPLE B-TREE SIZES Key
File
Customer master Product master 0JdeI' detail
B-ttee sectors
CustomerID
6
PMduct code PMduct status
4
CustomerID
4
PMductcode
~ 23
S
Thus,
Cdt = 23 cache blocks required for B-trees (461( bytes). FiDally, let us assume that we have a I-megabyte cache memory (500 blocks). Thus, Cd
= 500 blocks of cache memory.
Then, from equation 7-24: .
Tc:adw
= (500 -
23)11lRr
=43.41Rr
where we have assumed in our value for Pldw that the product file!ecord to be updated is not found in cache, i.e., it has been read and then Bushed before it could be updated. Obviously, we would like to have enough cache to ensure that, with high probability, this recon1 is not ftusbed before it is updated. Note tbat the longevity of a block in cache is a function of the ttansaction rate, Rr • As the transaction rate increases, disk activity increases, and longevity decreases. The following table gives longevity times for a nmge of ttaDsaction rates for this case: TABLE 7-3. EXAMPLE LONGEVITIES R, (1I'IIDSIsec)
T..... (sec)
1 2 S 10
43A 21.7 8.7 4.3
Ifit is anticipated tbat a user will require about 30 seconds to fill an order, then at one
tJ:ansaCtion per second, the product file recon1 to be updated will have a good cbance of mnaining in cache. This would give better pedODDaDCe dian 8Dticipatecl, since it was assumed that this recoId would not be found in cache when it was to be updated. At
Chap. 7
Other Considerations
248
traBsaction rates greater than 1.5 transactions per second, the probability that: this record will remain in cache long enough is substantially reduced. ' . . . A feel for cache bit (or miss) mtios can be obtained from this example: ne following table lists total disk requests for each operation, the number that B-tr= (root and level I) assumed to be in cache, the number that ate assumed to come from disk (first-time reads or writes), and the number that ate candidates for cache bits if the record· has not been tlushed.
are
accesses
TA8LE74 EXAMPLE DISK ACTIVITY Opermoas
Read cast. mast. Read p:oduct Write pmcluct Write stams by Write Older deIail Write cast. ID by Write prod. code by
Tacal accesses
B-llee
4 4 4 3 4 3
2 2 2 2 2 2
2 2
..1
..1
!
2S
cache
14
Reads! writes
2 1 2
1 9
2
This table reflects the assumption !bat an key files ate supported by a two-level B-tree. Thus, there ate 25 accesses to sectors required to process this txaDsaction. Fourteen are assumed to be in cache, and 9 are assumed to nec:essarily Iequire a physical disk access. The two update accesses (one to xead the product code key file and one to write to the product file) may be in cache if the update is fast enough_ Thus, the disk cache bit ratio xanges from .56 to .64 (the cache miss ratio ranges from .36 to .44), c1qwmding upon the succ::ess of finding a record that is to be updated and is sdll in cache. 1)ese results ate typical of 1P systems. 'Ibis "aaalysis" of disk cache has used a very simple example to illusttate the basic c:oncepts of disk cache and to give a feel for the magrritnde of parameters involved in typica11P systemS_ In the n:al world, any realistic analysjs of disk caching is usua1ly ~ complex to be useful (If indeed at all possible), and reasonable assumptions for cache miss ratios are usually used iDstead However, the longevity aaalysis that led to equaIion 7-24 and to the examp1e of Table 7-3 can be useful in eglhllating the amount of cache mem.oIY tbat repesems the tbIesbold teqU.iIed to achieve high cache bit ratios for updare activity. This can be of impo11ance in 1P systems~
Paramount
OTHER CONSIDERATIONS 'l'he!e are sevenl other c:onsidaaIions that can affect the performance of the data base in a TP environment. An unciezs1:aDc6Dg of these will allow the pelfO!JDlllce aaalyst to appr0priately modify his attaclc on the problem of analysis.
250
Data-Base Environment
Chap. 7
., . .00000apped Seeks A disk controller can typically drive several (say eight) disk drives. In general, however, it can be ttaDsfelring data to or from only one disk drive because of hardware and 110 clwmellimitatiODS. However, if there is a queue of disk IequeStS waiting for the disk controller, many systems will allow the disk device driver (the software driver that controls the disk controller-see Figure 7-1) to "look ahead" through the queue. By doing so, the device driver can initiate seeks on other drives to get their heads properly positioned in anticipation of 1raDsfening data. This capability is called overlapped seeking. The seeking of the readlwrite heads on some disks is overlapped with the seeking of other heads and also with the transfer of data from one of the disk units. Overlapped seeks can drastically reduce disk access time in busy ~ (it has little effect on idle systems). Since seek time is a major component of total file mauagement time, tbis can greatly improve the responsiveness of the data-base system.
The.re is DO straigbtforwud way to analyze the effect of overlapped seeks except to . estimate the effective seek time at the anticipated load and to use that time in the analysis rather tbaD the acmal seek time. An example will illustrate the sort of estimate that can be made.
Let us assume that a pxeUroinary analysis bas shown that the avenge queue length of requests waiting for a controller (W in temIS of chapter 4 notation) is two requests. The controller bas four disks COJII1eCh'd to it, one which, of course, is busy. This leaves tluee disks. The probability that the fust request is for ODe of the free disks is 3/4. Thus, with probabiJi1y .75, ODe overlapped seek can be started. With probability .75, the second Iequest can lead to an overlapped seek if the fiIst did DOt. If the fust mquest did lead to an overlapped seek, 1hen with pmbability .5, the secoad request can lead to an overlapped seek. Thus, with probability (.25)(.75) + (.75)(.5) = .56, the second request will ft:S1Ilt in an overlapped seek. On the average, a request will be giveD an overlapped seek (.7S + .56)12 = .66 of the time. If we assume, given an overlapped seek, that the seek is' c:ompleae when die request is c:hoseD for data ttaDsfer, 1hen overlapped seeks are traDspIIeDt to die requesting process, and. overa1l seek time bas been reduced by 66 pen:eat. .If seek 1ime is 20 msec•• the effecti!e seek time is (1- .66)(20) = 6.8 DJSeC. (The assumpti.o.a of 100 pateDt seek overlap with the processiDg oftbe c:om:at request is notUDlaSOll8ble, since disk ~ times tald to ran in the same Older of magnjrude as seek times.)
Queues are no.tmal1y serviced on a FIFO (First-In, FiJst-Olit) basis. However, tbeIe are algoritbms that will look ahead tbEOugh !be queue aDd will decide which request to next service based on maximizing efficieacy. AD example of this sortof servicing algorithm is the elevator algorithm. This alg0rithm searches !be queue for that request which is closest to die CUDeDI: position of the disk head. aDd chooses ~ request for service. In tbis way, seek time is minbnized.
Chap. 7
Other Considerations
. 251
.. Algorithms such as this have a problem in that although the overall efficiency of the . disk system. is eohanced, some requests may get delayed an inoniinate amount of time. In faa, with the elevator algorithm, it is possible to create a,scenario in which a request may never be bonored in a busy time. 'Ibis would be a Ieq1JeSt at one edge of the disk when all activity is at the other edge. Remember our system manager who heats ODly from irate customers? Let's not do this to him. Of course, the algorithm can be modified to prevent this. One form of a modified elevator algorithm always sweeps in one ctiJection from one edge of the disk to the other, servicing all pending requests in cylinder order. It then reverses and repeats this sequence. In this case, one can make an approximate statement concerning the enhancement provided by this modified elevator algorithm. As pointed out in chapter 4, the average distance between two cylinden chosen randomly on disk is 1/3 of the total number of cylindels. 'Ibis is the average seek distance of the disk arm. as it services a FIFO queue. However, if there are n items on the average in the disk queue, and if these are serviced according to the modified elevator algorithm such that all are serviced in order as the disk arm. sweeps across the disk, then the average seek distance is lin of the total number of cylindels. If n is 10, then the average access time is based on moving a total of 10 peICeDt of the cylindeJ:s versus 33 percent, or a IeCluction of 3.3 in the access distance. This can be significant. Of course, a ten-item queue is even more sigDificant in tams of delay time. By the time the queue 1eDgtb is.lcmg enough to make algorithms like this mmringful, the system has long ceased perfomling salisfar.torily. Algorltbms such as these.., not typically foUDd in today's TP systems. They are DOt very effective UDless queue leDgtbs are loag, and good pedODD8llCe design dictates 1hat queue lengths be short (70 peJCeIlt resoun:e loads )!ield queue lengths in the CEder of one or two, according to KhintdDDe and Pollaczek). In addition. such algorithms teDd to be uilp:ecIicrable and have DOt been fOUDd to be DeCeSSary. However, as with overJapped seeks, one tecJmique to handle service order alg0rithms is to estimate in some way an effective seek time (or access time ifJOtatiODal latency enbancemtillts .., involved) and to use this modified value in the analysis.
A typical TP system has many users aa:essing the same files. As long as a file is simply being lead, thee is DO problem. However, as soon as two users try to update the same data, a problem can arise. Suppose User A reads an inventory record, finds 2S widgets in stock, and decides to sen 10, theJeby neecJing to update the stock quantity to ldect the fact 1hat 15 axe left. However, befoJ:e user A can return theupdat.e, user B has lead the same record and decides to sellS widgets. User A by DOW bas updated the file, unbeknownst to user B, to reflect a new ~,lS. User B then returns an updated record showing a quantity ofla widgets.
Data-Base Environment
252
Chap. 7
""". The data base DOW reflects 20 widgets in stock, whereas there are really only 10. The two users of the system have stepped on each other's toes. The solution to this dilemma is dIZt4 locking. When a piece of data is locked, no other entity (person or program) may read that data for purposes of updating (normal reading is often allowed). Depending upon the system, data locks can be applied to an entire file, a record in the file, or a data field in the record. The finer the granularity of tbe lock, the more efficient the system (except, of course, that the finer the lock, the more overhead is imposed on the system). Thus, user A will read the widget record with lock Oet us assume that record locking is available). User B will then try to read this record with lock. However, the attempt will fail, and user B will have to wait (either try again later or be queued by the system to the locked record). When user A updates the record to show 15 widgets left, tbe lock will be removed. User B's request for a read with lock will now be honored and will show 15 widgets left. Selling 5, user B will return an updated record showing 10 widgets left in stock. Qearly, file locks can bog down a system terribly, especially if the file is kept locked while an operator takes some action on the data (what if the operator takes a coffee break while he has the file locked?). Recont locks offer a much smaller chanc:e for CODf1ict, and data item locks are even better. As a general design philosophy, the locking mechanism which freezes the least data should be used (record locking is quite common in contemporaty systems). The lock also should be majntaiued for as shOlt a time as possible (for iDstance, ODly during the updating process, after the opezator has petformecl all other fanctioDs). Most 'n' systems are designed so that the chance of data lock conflicts are negligible; lock confticts can usually be ignored from a peIformance viewpoint. If lock conflicts are not negligible,delays from lock conflicts must be esril!Uded and added to die effective file IIl8DI8er service time if locks are queued or to the total tnmsaction time if the operator must wait and then resubmit the tIansaCtioD. Such delays are so appJication-dependent that notbing more in a general seuse can be said about them.
As discussed in chapter 2, a:itica1 files are often mb,ored for reliability. When data must be wrlUen to two disk units, even 1hougb. this may be done sinmlumeoUsly, the avemge write time is loDger than writing to ~ disk. Concepcual1y, one maY explain this by tbjnJdng of one disk as being die "average" disk compledDg in an "avemgen time. The other disk. will eiIbe:r complete faster, in which case themilrored write took III "avaage" time, or will complete slower, in which case the mirIored Write took a longer than "average" time. 1be net result is necessarily a mirrored write that averages loDger than a
single write. An analysis of mUlmed writes is given in Appendix 5. The results are quite simple. Mirroring a file adds about 40 percent to die seek time. Thus, if a disk has a 20 DISeC. seek time and an 8.3 msec.lateIlcy time, then single disk write time is (20 + 8.3 + 16.7) = 45 A mirrored write will add .4(20) 8 msec., yielding a write time Of 53 InSeC. 0"
msec:.
=
Chap. 7
Other Considerations
253
.. To achieve even greater reliability, fault-tolerant systems will often write to only one disk at a time. In this case, if a failure occurs during writing, one disk is guaranteed to be readable. However, a DrlnoIed write now requires twice the disk time of a single disk write. One compensating advantage of min'oIed drives is that both can be shared for reading. Some systems take advantage of this to some degree or another. If an application is heavily oriented to reading over writing, mirroIed disk drives could prove to be a peIformance advantage.
Multiple File . ....".1'$ In many TP systems, the file manager can become the bottleneck for the system. 'Ibis can be alleviated by partitioning the system so that it can bave several file managers, each sharing a portion of the load. Multiple file managers can alleviate another problem. in some cases. As we discussed earlier, disk processjng (i.e•• CPU) time and physical disk access time me substantially serial in Dature. While the file manager is processing a request, the disk is idle. Then the file manager waits while the disk transfers the desUed data. In typical TP systems, a disk driven by a single file manager can 0Dly be kept busy 2S to 60 percent of the time, thus sbaIply reducing its effective capacity. If multiple file mauagers could be used to drive a disk, disk utilization could ~ sbaIply improved. There are sevaal ways in wbich multiple file managers can be utilized. File manager per crlSle volume. A large TP system will typic:aIly have sevaal disk volumes (physical disk drives) that it uses. These systems often provide the capability for (or require) sepanIe file mauagers for each disk volume, as shown in FJ.gDre 7-Sa. While this does not solve the problem of low disk ntijization described above, it does give a DJtdvnrism for a1lev:iaDDg the file 1'!18IJSlF" botdelW'L To aualyze the pe!foImance of this type of file manager CODfiguratiOD, one simply computes the total file load (file 1'!18IJSlF" processing and disk access) as set fo1'th in equatioDs 7-1 tbrougb 7~ and then allocates this load across the iDdepeDdeDt file DJaDagen. Often, theIe is not suticieat iDformatiOll to allocate load OIl any but an evealy distributed basis. In this case, if a file load of Lt is to be ctisIribared aaoss D file maaagers, each CODtrolliDg their own disk system, thm11he load OIl each file manager is LjD.
Muldpie file managers per disk volume. Whether a TP system is large or "small, it am. bemDt fiOm bavingmultiple file maaageES shale "one disk. This DOt ODly relieves 1he·file maaagemeat boa:JeDeck to some extem but also a1lows disk utili7arinn to approach 100 peIeeDt. This S1rUdme is shown in FigaIe 7-Sb. In order for more 1:ban one file manager to use a disk volume, xequests to that disk must be appropIiaIely partitioned between the file mauagers. It is DOt sufficient to allow them to simply WO!k from a single queue, since the order in which c:el1ain mquests are ~ is critical. For iDstance,if a pmcess submits two xequests, one to open a file and ODe to ~ it, it is im:peIative that the requests be executed in that order. We would not
Data-Base Environment
254
• • •
Chap. 7
D DISK VOWMES
FILE MANAGER PER DISK VOUJME (a)
~
- 1 1MIlliHMII 11 FM Lf 1m '--_---I
:
TI1iH
MII -1111 Lf 1m
M/Mlllmlm
V
•
FM '--_ _~
JJ1!l--@ m FILE MANAGERS
MULTIPLE FILE MANAGERS PER VOLUME (b) I'ipre 7-5 Multiple &le JIIIIIIFIS.
Want ODe file manager to SIart the openiDg of a file (a leDgIhy proceclme) and anotIle£ U; immediately try to read tbat file befoJe the opeuiDg proceclme was complete. One sttaightfolwam way to partition work between different file mauagem is to bave each responsible for a de1iDed subset of files on the ctisk volume. In tbis way, all oper_. aUons OIl a file will be CODSisteut. SiDc:e requestS must be partitioned, the file IDIIDIlgel'S do DOt act as a muJriserver serviciDg a common queue. Radler, just as in the· file JDimager per ctisk volume case cIesaibed above, the load is distributed to them as iD,dividua1 and iDdepeDdeDt servers. If there are in file managers servicing a single ctisk volume, and if file system load is Lt, then the load on each file mauager is LIm. However, the ctisk DOW sees multiple users (though Dowi1ere near an iDfiDite DUIDber). It will have a queue of worlc to do and will teSpODd as a single server to the m file managers. The cbaracterlstics of the disk queue will be governed by the single server, finite users model MlMllImim discussed in Chapter 4.
Chap. 7
An Example
255
---.:wn-1 Lf I Dm
FM
h
'--_ _I
'\
___=_.....
/~.SK
---JIID-1 Lf I Dm
..-.
~
FM '-------'
• •
---TITII--I 4
~
FM
/Dm
___ =
-JWL-l ~ 4
•
.
---,~IIill--8DIIiI DISK
FM
DISK VOLUMES m FILE MANAGERS
/Dm
MULTIPLE FILE MANAGERS. MUL:rIPLE VOWMES
!sl Fipre 7-5 (COIIII1.)
Multiple file managers per multiple volumes. The above two configuratiODS can, of course, be combmed as shown in Figure 7-Sc. Multiple disk volumes are each CODtrolled by multiple file maaagers. Using the above notation, the load OIl each file maDager is LpDm.
As an example of file ~ pedormance, let us CODSider the tnmsacDOIl discussed 1D1der cache mauagement and apply it to various D111Dbas of file mauagen in a TP system wbich has one file m.auager ~ disk volume. To summarize that example. we have a S1:aDcIml t:raDsactiOD comprising: • 2 keyed Eeads • 2 keyed writes • 3 key file updates
Data-Base Environment
256
Chap. 7
Let us desigoate these as file operation types i = 1, 2, and 3,· IeSpeCtively. As discussed earlier, the following table gives the values for 1Id;r, 1ld;w, and Pi for each file operation and also gives typical values for file management processing time, ~: TABLE 7-5. FILE OPERATION PARAMETERS
File operation
i
n",.
Keyed read Keyed write Key file upcIare
1 2 3
4 2 2
.....
~
(msec)
20
2 1
30,
2S
PI
.29 .29 ~ 1.00
We further assume the following parameters: tda
tfir
= 28 DlSeC.
= 17 DlSeC.
Pd =.4 /; =.5 for i = 1, 2, 3 We further assume that writes are cached. Based on these values, the:file management times, ~, are given by equation 7-1 and are shown in Table 7-6. TABLE 7~ EXAMPLE FILE MANAGEMENT TIMES
1 2 3
58.8 79.4 57.9
The average file system service time, from equation 74. is
~ = }:Pi~ = 64.4 DlSeC. i
The load 011 the file system, given by equation 7-5 is , • ",.4"» -r - .":/'.,.. .., - (7)(64.4) 1000 Rt --
4SR,
•
and the load on each, file maaager is
LID = .4SRr/D where D is, the number of disk volumes and consequently the number of file maoagers. From equation 7-6. the mpcmse time of a file mauage:r is _~_
64.4
14f- 1 - ~ - 1- .4SRr /D
An Example
Chap. 7
257
..._.:nus
response time (or file manager delay time) is plotted in Figure 7,.6 for one to three file managers. It is clear from FIgUre 7-6 that additional file manager paths to different disks proponionately increase the capacity of the system and can have dramatic DISI< UNITS (D)
600
-•• -
500
o
E I
. ~400
'"j::: ~
",300 rn
z
f
rn
~ 200
o
2
3
4
5
6
TRANSACTIONS / SEC. (Rt) Iipre 7~ Multiple file IIIIIIIIp:rzespcmse time.
effects on response time. At two ttaDsaCti.oDs per second, the response time for the three cases is shown in Table 7-7. TABLE 7-7. EXAMPLE RESPONSE TIMES D
z.,(msec)
1 2 3
644 117 92
Data-Base Environment
258
Chap. 7
An additional point to note is the avemge disk utilization. The aver,age disk operation requires 64.4 msec. Of this, 2:a;Pi~
= 17.5 1DSeC.
i
is processing time ciming which the disk is idle. Thus, during the time that the file managementsystem is busy. the disk is only being used 73 pm:ent of the time. In many systems tbis disk utilization can be less than SO percent. It is this situation that is enb8J!t'A'.d by utilizing multiple file managers per disk volume.
8 Application En vironll1ent
Now that we bave a vehicle, let's take a look at the passenger. Our TP vehicle includes a COD1DlUIlicatiOD netwodc for exchanging messages (requests and replies) with the users, a processing enviIomneDt (the cpu. memory. and opeIatiDg system) to act as our TP engine. and a data base that is used to maintain the status of our enviroDment and to answer iDqujries Ielative to that enviroDmeDt from our users. It is 1his vehicle in which the TP application teSides and by which it is bidden from tbe JDQDCIane poblcms oftbe zeal world. The ccmmumicaDOD manager wOlries ~ line pro1OCOls, data emJIS, DetWork ouaages; it preseIItS a clean message interface to tbe application. The opetating system aDd its bardwue tab away die c:oncems of memOry management, multiple processoa. mul1iple users, iDterprocess OO1'!11DIJDicatOll, process scheduJiDg. 8Dd in some cases eWIl fault decection and RCOVerY. The file maaager or data-base IDaDager provides a smoo1h path for acressmg aDd majntabring our data. . In this enviroDu1ftlt, 1he applicaDoD is simple, at least conc:ep1DaJly. h.c:omprlses one or ID01e processes 1bat acc:eptmessages and apply tbemto die database for·iDqui!y aad update purposes. The aualysis of its pe!fmnlllQ, however, still is tied to 1he ped'cmnanc:e issues of its vebicle aad partiCularly its eagiDe. . A TP applicaDoD typically comprises a set of coopeEatiDg processes. The requestorserver model described in cbapter 3 is a good example of Ibis. 'I'helefOIe, DO matter how efficieDtly a pmc:ess is desigaed, it will still be delayed by otber system acDvities as it competes widltbem forIeSOUlCeS. This cbapter deals with the ~ of application S1rIlCImes in die TP enviromDeDt.and describes a variety of application process stn1CImeS currently in use.
Application Environment . .
260
Chap. 8
PROCESS PERFORMANCE A portion of the response time calculation is, of course, the time that is consumed by the application process in processing a transaction. However, tbis is not usually a major factor in response time, and it is often only a negligible factor. A transaction inquiry providing a 2-second response time might only require 50 IDSeC. of processing time, for instance. True, a good bit of the remaining time is used up in COJDDJ.UDicatiODS and data-base management; however, a good bit of time is also committed to process 1IlIIIJageD1eI1. These are the times that we deal with here. Virtually all of the process management considerations have been touched on in previous chapters, so most of what we will discuss here will be in the aamre of a review and consolidation of this knowledge. The delay time imposed by an application process on a traDsaction is only partly due to its actual processing time. This delay time must also include
1. The queue delay incun'ed by the traDsaction as it waits in line for the process. 2. The dispatch time incuued wbile the process is waiting for the processor. 3. The processing time of the process itself. 4. The contention for the processor with application activities of higher priority. 5. The contention for the processor with the operating system as it handles :interrupts and other system activities. 6. The messaging time required to COIDIIlUIIicate with other processes.
Ow8wiew From a fundamental conceptual viewpoint, the process enviromnent as desc:ribecl above is shown in Figure 8-1. We view a process hae in the simplest of terms .to obtain an overview tying in the above concepts as a UDified whole. A message emas the pIOceSS's message queue and waits in tbat queue for a time tq UDtil it reaches the head of the queue. The process receives tbis message, processes it, and passes it on to another process. Figure 8-1 shows a process rmmiDg in a processor. The pocess bas an input queue that RCeives messages at a rate ofRand pocesses them with an average service time of tp. As the process comple1es a ttaDsactioa, it passes it on to anotber process via an iDtaplocessor message JeqUiriDg a system time of Once the process has processed a ttansaction, it reliDquishes control of the p!OCeSSOf and waits for the next traDsadion. It then gets back in line with other processes at its priority and waits for 1he processor so thai it can service 1bis DeXt traDsaction. This is shown by the"piocess being an item waiting in a greater queuo-tb.e processor queue. The amount of time tbal1he process must wait in this queue is its dispatch time, til. Note tbat tq IepeSents the time spent by a meUllge in aprocess queue. tllrepresents the time spent by a process in the processor queue. As shown in Figure 8-1, the process, once rmmiDg, does DOtbave the processor an to ~tself. For one tbiDg, the opetadng system c:ousumes a portion of the processor capacity
t...,.
Chap. 8
Process Performance
261
.' .. _~- - - - - - - - - - t d - - - - - - - . - - - -
r
ESSOR
.------~
PROCESSOR
PROCESSES
•
MESSAGE QUEUE
Lo
PROCESSOR QUEUE
as it handles 110, cOmmunication with other processes, timer-list management, and SO OD. The load imposed on the processor by the operating system. is 4. Similarly, bigher priority processes may be usmping the processor while the process is trying to run (tbis is the case of preemptive schedlJling). These bigher priority processes impose a CPU load of L". The process nquiIes tp time to complete its task. But only (1 - Lo - L,,) of the processor is available to the process, so that in a time t;, only t; (1 - Lo - L,,) time is used on behalf of the process. l'herefore, our adWIl pmcessing time, (" once the process is given the processor, is given by
or (,=
Ie
1-4-L"
.
(8-1)
A message miviDg at the head of the process's message queue must wait first for a time, 'td, for the process to be cliSpatcbed and then for the processing time, Thus, the service time, ts , so far as a message is c:oncemed is
t;.
(8-2)
Equation 8-2 repzesents the effective processing time, or service time, that a mes-,
.
Application Environment
262
Chap. 8
. . . j . waiting in the process's message queue will experience. Before ~$~sed, the message must wait in this queue for a time tq • Since transactions ate being received by this
process at a rare of R traDsactions per second, then the load on the process is L Using the MIMII model, the waiting time tq is _ tq -
L _ Rts 1 _ L ts - 1 _ Rt.s ts
The total delay time tbrough the server,
t.
= Rts •
(8-3)
t., is
= t.q + t.s = l -tsRt.s
(8-4)
The dispatch time, ttl, is the time the process must wait for the processor wbile processes of equal priority ahead of it in the processor queue are being serviced. In Appendix 6, we point out that the MIMII model is inappropriate for the calculation of process dispatch time if any one process accounts for a substantial portion of processor time. The MIMII model will lead to arbitrarily large processor queues at high loads, but we know that the processor queue length cannot exceed m-l if there are m processes in the system. An approximation which is suggested in Appendix 6 is to simply exclude the effect of the arriving process when calculating the length of the processor queue which it will see, since it will never have to wait for itself. Let the total mival rate of messages to be serviced by processes at this priority be Rp, and let the average processing time of all messages at this priority be t;'. Then the load Lp imposed upon the processor by all processes at priority P except for the process being considered is
Lp = Rpt;, - Rtp
(8-5)
We exclude the load of the process whose dispatch time we are considering, as discussed above. From equation 6-32, the dispatch time, td, for our process is fd =
(1 -
(L, + Lo + L,Jt' Lp - Lo - L,,)(1 - Lo - L,,)
(8-6)
where
Rp = mival rare of ttaDsaCtioDs to processes at priority p.
t;' =
avenge service time for all processes at priority p.
Lp = load imposed on the processor by processes at the CODSideIed priority, except for the coasideled process. t' = service time averaged over all priorities, including the CODSidered priority and higher, but exclusive oftbe CODSidered process.
To complete its function, the process must sead an mtapocess message forwanting this transaction to another process. This requiles a time t ... , which is opendiDg system
Chap. 8
Process Performance
2&3
time"!!Ad which typically is not affected by other loads. We assume that it does not add to process service time but rather is treated explicitly. Thus, total Service time the m.essage is
for
fds
+ t;pm
This simple model bas inCOIpOrated all of our above points that affect process performance. These six points are the fonowing: 1. Traosaction (message) queue delay is tq • 2. Dispatch time is td. 3. Process time is tp. 4. Higher priority contention is Lit. s. Operating system conteD1ion is 4,. 6. Messaging time is tipm.
At. an example of the compoUDdiDg effects of the process enviromnent on process perfcmnance. assume the fonowing parameter values (all are reasonable): Plocess time (".. t;, () 0peraIiDg system loacl (t.,J Higbc:r pdority load (hJ Iutt:ijAocess messap time (r;,..) Plocess tnIIISaCIiaIll8le (R) TOIalIl'llDSllCtioD !ate at Ibis pricrity (Rp)
IOmsec.
.1 .4 2msec. IS tnms.lsec.
30 tnms./sec.
From equation 8-5. the load at Ibis priOlity. exclusive of the process being c0nsid-
ered, is Lp
= .15.
From equation 8-6. the process dispatch time z" is 37.1 JDSeC. That is to say. once a process bas wodt to do. it must wait an average of 37.1 IDSeC. before it can nm. From equation 8-2. a message will tequUe a time t, of 57.1 1DSeC. to be processed once it mives at the bead of the message"queue. ·Tbis time comprises 37.1 D1SeC. dispatch time waiting for the processor plus 20 DJSeC. of appaIent procA'$Sing time. From equatioa 8-3. the time. tq. that a message waDs in the message queue is 342.9 DJSeC. From equatioa 8-4, the total delay time z", for a message from· the iime it arrives at
the process to the time that it is processed is 400 DJSeC. Adding iDtctp10ces5 message time gives a total prooessing delay of 402msec. An this fOr a proceSsing time of 0Dly 10
msec.! tds
To obtain a feel fortbe cause oftbis apparent disaSter. let us expand equation 8-4 for by substituting equation 8-2 for t, . Using equations 8-1 and 8-6, we first write t, as
LJ t t t, = (1 - L)(1 - LIJ + (1 - LIJ = (1 - L)(1 - LIJ where we have substitnted
Application Environment
264
Chap. 8
= 101DSeC.
t=t'=tp L=Lp+Lo+Lh L;'=Lo+Lh
= .65 = .5
Then tl(1 - L)(l - LJ,) tl(1 - L;') tds = 1 - Rtl(1 - L)(l - LJ,) = 1 - [L + Rtl(l - LJ,)]
We have in effect a server with a service time of 20 1DSeC. (ok) which is lOaded 9S percent (awful!). Note the effect of priority service. If all activities were at the same priority, then LIz = 0, and tds becomes that for a server with a service time of 10 1DSeC. which is loaded 80 percent (as we would intuitively expect). CoDsequently, tds would be SO 1DSeC. instead of 400 IDSeC. Thus, prioritized service can WR8lc havoc in a heavily loaded system (the IeSults are more reasonable for lightly loaded systems). In effect, we have seen that the service time of a low priority process can be increased significantly by higher priority activity, which IeSults in a commeDSUIate increase in process load and in a possible dramatic iDaease in the delay time through the process. One should approach the design of a prioritized system with great caution. Relative to the effect of process enviromnent, consider the following change to our process structure. If the process were allowed to service all messages in its queue rather than just one message before reliDquisbjDg the processor, several dispatd1 times would be saved. A dispatch would be required cmly if the queue became empty, that is, for cmly (1 - Rts) messages (a message will find the queue icD.e 1 - Rts of the time). One might expect this to significantly improve pel'fanuance. Accounting for this c:haDgc, equation 8-2 is modified to give a ts of ts = (1 - RtJ1tl + 1 -
4t - LII
or _ 1tl + 1,/(1 -
ts -
to -
L,J
I+Rt"
'Ibis IeSUl1S in a ta' of 36.7 1DSeC. for the above case and in a proc:essiDg delay tds of 81.6 1DSeC. rather !ban 400 1DSeC. Quite an improvement and a further demoDsIxation of the importance of process enVkomo.eDt·on pelfomi8llQ'. Of come, DOthiDg ~ for flee•. The time 1hat this pocess "owus" the processor is DOW sigliificantly iIlClaSed e8cb time it is graat.ed the pmcessor. l' in equation 8-6 for other processes at this pri.ori1y and lower is inaeased, and delays 1brough these processes will increase as a result of their CX1eDded dispatch times. 'Ibis effect is e1egaDt1y stated by the WGII CoDservatioD Law (see Klcinrock [15]). The weighted sum of the queue waiting times is a constant, given by
necessarny
P-
IT
T. = L,To 1 - L,
p-l-P'"
Chap. 8
Process Performance
265
~.
Lp " = server load at priority p. Tqp = queue waiting time at priority p. L, = total load on the server.
T"
= time to complete the service of the current item when a new item mives at the queue.
This is formally proved for nonpreemptive systems. Thus, if we improve the level of service for ODe class of items, others will surely suffer.
Process Time There is not a great deal that can be said aua1ytically about process time. Before a program is written, it is difficult to estimate process time except from. general experience with similar systems. After a program is l'IlDDing, average processing time and all maDDer of dispersion measmes can be obtamed by the various performance measuring tools often provided with the system or by instJ:mnenring the process itself. Seldom can these values be deduced aoalytically. However, their accuracy has a strong impact on performance aualysis. So how do we make performance predictioas on a machine !bat hasn't been built yet, much less progmmwecl? Or on an appJic:anon that is curreutly being programmed? Do we pack our bags and give up? Or once again, do we invoke our cloak of devout imperfectionism and give it our best shot? It has been the author's practice, based on experience with several systems. to use the fonowing process times if DO better infODDation is available. They are based on a 32-bit 1 MIPS processor and should be adjusted up or down acccmtingly. They should 8Iso be appropdate1y adjusted or replaced based on the user's own knowledge and • • lienee. faac:Iicm
Ploc:ess time (msec.)
Col ' "'IIirllioas (pee block) AppJk:IIicD (per file c:aD.)
S S
File DIIIIIIpr (per opeaIiaIl)
3S
To these values should be added the process" c:omext-switching time if signjficaot (the time it takes for the openI1ing system to switch processes). For iDstance, CODSider a process which IeBds a message (1 block), mabs tIuee file calls, and retarDs a IeSpODSe (1 block). A.ssume c:ontext-switching time is 2 JDSeC. The process must be given conuol of the processor 4 times. onc:e to read the "incoming message"and Once at the completion of each file call. Total processing time is therefore 2 x 5 + 3 x S + 4 x 2 = 33 m&eC •• or 33/4 = 8.25 m&eC. per dispatch.
266
Application Environment
Chap. 8
At least these values provide a reasoDable starting point. As..xeal values are .. ..- Obtained, they should replace the suggested values, and the model should then be reevaluated. Dispatch Time The dispatch time for a process is the time it must wait in line to obtain access to the processor, i.e., its waiting time on the ready list. Dispatch time has been thoroughly
discussed in chapter 6 in the section entitled "Task Dispatching." Dispatch time expressioDS are given for single processor and multiprocessor environments and for pteemptive as well as nonpreemptive schedulers. As a summary of that section, dispatch time is viewed as the time a process must wait in a queue (the ready list) before it has access to the processor. The service time for items in the queue is the aVEnge time per dispatch that processes in that queue will consume, i.e., the average amount of time that these processes will be active once given the processor. This isa leal time, calculared fIom the actual average CPU time consumed by these processes and adjusted for operating system and higher priority process activity. The queuing models used are MIMIl for a single processor environm.eDt and MIMlc for a multiprocessor environment. This assumes that process times are exponentially distributed and that mivals to the ready list are random. It also assumes that the IlUDlber of processes in the system is much pester than the ready list's average length. If the number of processes is DOt large, then the c1etermiDaQon of dispatch time requiles an itaative calculation, as described in Appendix 6. To avoid this complex calculation, a useful approximation is to simply ignore the impact of a process OIl the processor's dispatch queue when calcuJatiDg its dispatch time. This 8ppro:timarion is evaluated in Appendix 6. In Older to calculate dispatch time, the perfounance analyst must be able to estimate the dispatch rate aad average process time per dispatch for each priority level. He JDDSt also bave a feel for the oved1ead imposed by the operating system.
A pocess is affecIecI bybigber priority processes, since these steal pmcessing capacity from all lower priorities, It is affected by processes at its own p.riorlty, since it must compete with these processes for CPU time. It may also be affected by lower priority processes if it is rmmiDg with a IlOIlpl'eeIDptive scbedu1er, as it may bave to wait for a lower priority process to complete before it can be given the CPU. The effects due to processes at the same priority and at lower prlorities are dispatching problems and are CQVeIed in the previous secrioD and in cbapter 6 UDder "Task Dispatching." Higher priority tasks DOt ODly slow down dispatdring, as described p:eviously, but also slow down the process itself if the schecJuler is pteemptive. This effect is taken iDto account in the queuiDg model for preemptive priority systeJDs given in chapter 4 (the :
I
Chap. 8
P~sPerionnance
MlML.lJaJ/aJ/pP model) and in the above example. In effect, the task processiBg time is increased by 1/(1 - Ln), where Lh is the load imposed by bigher priority tasks. Note that this is true only for preemptive schedulers. A nonpreemptive scheduler will cause a delay in the dispatching of a process jf a lower priority process is currently runoing, since higher priority requests will DOt interrupt a process once it is scheduled. A process waiting in a processor queue must wait for the processing of all processes of equal priority that anived earlier. It must also wait for all higher priority processes to be pr0cessed, regardless of when they come in. This latter delay takes the same amount of time regardless of whether the execution of lower priority processes is or is not iD.telrupte
processes.
However, once given the processor, a process will not be intenupted by higher priority processes jf dispatching is nonpreemptive. It will be affected only by operating system overilead.
Thus, for nonpreemptive dispatching, equations 8-6 and 8-2 become, respectively, L, ItJ = 1 - Lo - Lh - Lp • 1
-4t' - Lh
(8-7a) (8-7b)
L, is the total processor load•
.
OpeIati..., System Load
Like process times, the opemting system load is often not a subject of the pelfOl'DllllCC analysis uDless it is the operatiDg system itself that is being aaalyzed. Rather, this load is an input parameter to the model (or in some cases is igDOIed as being small). Typical ope.tatiug system loads m 5 to 20 percent in contemporal'y systemS. Note that this load does DOt include intaprocess message time, wbich is handled explicitly, or process switching times, which should be bundled in with the process time (this is typically one to ten IDIltiserouds UDless powerful hardware support is provided). ()peIatiDg system load does iDclude in1e.aapt handling, timer-list maaagement, and fault-
recovery pmvisioas. As TP systems get mate complex, and as operating systemS do more and more for us, ODe poiDt to make is that the operating system ovedlead CODtinues to grow with each new product. Let's keep an eye on this factor.
"ssegi..., In order for a TP application to be stnlCtDIeCl as a set of c:ooperat:iDg processes, the system must provide a mechanism for passing.m.essages betweeIl processes~ Various types of messaging 111fdJanjsms m discussed in cbaprer 6 in the section entitled ulnte.ipoce&$ Messaging." !
.
Application Environment
268
Chap. 8
These messaging mechanjsms are geneI3lly impleIneDted in ~ oLtbree ways: 1. Common memory, used often for mailboxes that allow one processor to place a message in the mailbox and then to set an event dJat will notify the receiving process dJat it bas a message. Common memory messaging techniques are applicable only to single computer or multiprocessor architectures in which all pr0cesses have access to a common memory. 2. Message system, in which a special operating system facility is provided to accept a message from a sending process and to route it to the input queue of a receiving process. This messagjng mechanism is applicable DOt only to multicomputer systems but also to netwOIks of multicomputer systems, since the message can be sent over the netwo!k. 3. File system, in which message queues are implemented as files and in which advantage is taken of disk cache to keep messages memory-resident. In this scheme, a receiving process will open a file that it designates as an input queue. Other processes open this file and insert messages for the receiving process. A special event facility is provided to alert the IeCeiving process that a message is available. This technique is also available to netwodcecl multicomputer systems if the appropriate degn:e of transparency bas been provided. 'Ibis technique bas the advantage of writing a queue to disk (via the normal disk cache flushing function) if a queue gets too long or is inactive. All messaging systems usually provide a response facility so that a:respcmse may be made to a message. A notable exception is the UNIX pipe, which is a oDe-way-only message facility. . In geaenl, common memory message systems are the fastest, and file system message systemS are the slowest. Typical JeqUeStheply times for today's systems are as
follows: TABLE 8-t. TYPICAL MESSAGE TIMES RequestImpIy CPU.Cia (msec.)
0.1 - 1 1 - 10 10 - 100
Queuillfl
Given the effects of process time, dispatch time, priorities, operatiDg system. loads, and interproc:ess messaging, a process's service time is c:alcu1ated as described in the section "Overview" and given by equation 8-2 for p;eemptive schednJjng or by equation 8-7b for
Chap. 8
Process Structures
269
non~ptive scheduling. Depending on the system, interprocess message time may be includable in the process service time. This was assumed Dot to be the case in the example given in the Overview above. Knowing the process service time, we can DOW calcnlate the queuing delay of messages waiting for this process and therefore the total message delay time as it passes tbrough this process. Usually, messages can be assumed to mive randomly from a large Duinber ofusers (or at least via tandem queues that are fed by a large DUlDberofusers-see chapter 4 under '~Some Properties of Infinite Populations-Tandem Queues"). It is also usually appropriate to characterize the process time, tp , as being random. Thus, the MIMIl queuing model is genemlly appropriate. Other queuing models might be apptopriate in special situations, as shown in Table 8-2. Mayall your models be MIMIl!
TABLE 8-2. APPUCABLE QUEUE MODELS
MIMII MIDIl
R8IIdom mivaIs. IlIIIdom service times R8IIdom mivaIs. coastIIlt service ames R8IIdom arrivals, 1IIIifoJmly disIribulecl se:rvic:e ames Limited popaladcm. raudom service times R.aDdom mivaIs. proc;ess is ODe of maD)' serviDg a c:ommon queue
MlU/l
MIMIllmIm MIMIc or MIGIc
MlMlcImIm
PROCESS StRUCTURES
a
A TP applicatioD can be implemeDted in myriad of ways, from a single, large monolilbic process to a complex set ofinclepeDdeat cooperadng processes. While it is DOt the pmpose oftbis book to discuss issues in tbe design of1P systemS, the structme oftbese systems is
impoI1aDt for pafQl!ll8l1M 8D8lysis; it helps to ande:rstand why one structm:e bas advanover anotber. . We will, in"fact. be dealiDg primarily with the requestor-server mocIel and its variants. This model was ctiscussed in c:bapter 2 (UDder ..Software AIchitectm:c--RequesrorServer'') and was used as an example in cbapter 3.
tages
We will first dispatch monolithic structID'es. A TP applica!ion can be written as one large monolilbic program. This program would bandJ.e an communications·and all data-base activity required by each qpe of tmnsadion submitted to the system. Such a structm:e bas a lot of problems-it ce.rtaiDly flies in the face of all we hold dear in the modem theories of software architecture-in tbat it is hardly modular. Some of the problems one faces with such a S1:I:l1ctUre include the following: "
.
Application Environment
270
Chap. 8
• The application will be very difficult to maintain," whether ~ involves
bug fixing or:fuDctional enhancements. • There is little capability to tune the system to achieve'better performance. • The monolithic strIlctule does not lend itself to a distributed architectm:e. Consequently, the application cannot grow in volume by adding additiooal computing elements.
• h is very difficult to add new transaction types. Requestor-5ente1' In contrast, a TP application built as a set of requestors passing ttaDsadions to a set of servers is a model of modularity. Since each program is small, it is easy to maintain. The system has immen$e tuDiDg opportunities, especially in a dis1ributed environment, by moving processes to less loaded computing elements, by adjusting the number of servers, and by sizing the n:questOrS to each handle an optimum number of users. A generalized requestor-server model is shown in Figure 8-2. Here we see that each user is served by a requestor process. A requestor process can usually service multiple users though its set of users is :fixed. Requestors pass transactions !eCeived from their users to an appiopriate server for processing. 'lbe!e is typically one server type for each type of ttaDsaction. If a traDsaction represents a sigDific:ant volume, 1here could be multiple ~ of that type created that would shale the transactiooload. We will call tbis set of like servers a server cIIlss. The servers send file tequests to the file managers, which CODttoI disk activity.
:
USERS
Chap. 8
271
Process Structures
'f'Ilc=. could be one file DWJager per disk volume, one per multiple disk volumes, or multiple file managers per disk volume, as discussed in chapter 7.
Each of these process descnDed.
types
has different cbaracteristics, as will
DOW
be
Requestors. Since a requestor must service a set ofuse.rs, each of whom. may be active simultaneously with the others, it must be a mu1titbreaded process, i.e., it must be able to baDdle several conc:urrent transactions. In effect, a requestor must bave embedded in it a sort of minioperating system that will perform multitasking within the process. Every time it is given control of the processor, it should satisfy all outstanding processing for all of its users befoIe it releases the processor. Multitasking within a process is further discussed below. The requesto1'S are often designed to service a particular type of user, with different requestors being provided for different classes of users. For instance, there might be a requestor type that services users diIectly CODDeCted by asyncbronous lines, another requestor type to service users on a multidropped channel, and still another requestor type that will service users communicating over X.2S links. As many Iequestors of each type as necessary are provided to service the user community. The responsibilities of the requestor include the following: • Providing the protocol support IeqUired by the user in Older to reliably ttaDsfer messages back and forth (though this is sometimes the function of the CC)iillllrmication driver or even an iDtelligent comrmmications controller or modem). • Editing and validating messages Ieceived from the user and obtaUling couect:ed data wben necessary. • RecogniziDg the type of· transadion received from the user and sending it to the message queue of an appmprlare server class, wbich will process this transaction. • RetumiDg 1eplies Ieceived fiom the servers to the app1Op1We terminal. • Saving the COIIteXl of a traasaction that may involve seveml messages and iDserting contextual iDformation wben RqUirecl.
The problem of CODIeXt saviDg has been described earlier. To npat it, CODSider the case of several aHke savas within a class tbit c:oopemte to proces$ _ type of uaosaclion. This 1DDS8Cticm may be quite complex and may JeqUiIe seveml inte.racIioDs with the user·to complete. Each such:iateractioD involves a messige from~ and a!eply xetuwecl to, the user. Now sappose that the actions to be taken one message depend upon the CODta1ts or!eSlll!s of pevious messages. 1be server that is to process this message needs to know the pertinent iDfODDation tbat was contained in earlier messages·or replies. However, it may not be the same server that processed these earlier messages, as each message is routed indepeadeatly to a server based usually on some load-balancingalgolitbm. TherefOle, the server must be given tis information wben it receives the new message. This
on
Application Environment
272
Chap. 8
information is called the context of the transaction; it is, the contex~ wit;J:lin which the " ···tilessage is to be evaluated. . If the server cannot hold the context of the message, something must. In some systems, the temrlDal is intelligent enough to save its own context. However, additional C()TI1III111rication loading would be incorIed, as this information is repetitively transmitted with each transaction. Furthermore, we must design the system to handle a broad range of te:rminals, many of which may DOt have this sort of intelligent capability. The obvious choice, then, is the requestor. As messages cmying contextual data which must be saved pass through the requestor, the requestor is responsible for saving that data in a data mea dedicated to that user. As any message subsequently passes through that reqojres some of this contextual data, the requestor must iDsert this data into that message. One potential problem with this approach is that the context stomge area is multiplied by the number of users supported by the requestor. If we me DOt careful, the size of , the combined CODtext mea could exceed the data space limitations of the requestor, and this context mea would then have to be written to a disk file. This additional disk activity could cause severe performance degradation. Servers. Each server is typically desigDed to handle one type of transaction or pedJaps a small number of alike transactions. A server is usually single-tbreaded; it accepts a message from its message queue, processes it-including all file aa:esses that me requiJ:ed-fmmulates a reply, and returns the reply to the appJopriate requestor. Only then does it read the next message from its queue in order to process it. As discussed above, a server is context-flee. It must be given all the infOImation needed to process a request in the request message. Once it returns a reply to that request, it bas DO further manory of that request. Since a server is siDgle-threaded, it bas a limited capacity and could become a bottleaeck. To avoid this, several alike servers can be spawued to shate the load of a particular transaction type. Servers within a class could be started up at system generation time, in which case the DUJDber of servers in a class is fixed. As an a1temative, servas could be spawned aDd killed dynanricUly to account for load variatiODS. Such dynamic servers are disCussed Jater. WheD theIe me multiple servers in a class, different systems feed them in diffaeDt ways, as shown in Figme 8-3. In some systems, each server within a class bas its own ~, as shown in Figure 8-3a. A requestor sends its request to a server based on some load-balaDciDg, a1g~, typically to tbe server with the sborrest,queue', or iDsb=ad on a IOUIld-mbin basis. If~ load on a server class of size S is L, 1bea each server is wOlkiDg from an MIMIl queue and is c:arryiDg a load of US. In other systemS, servers wOlk off of a conimon queue, as shown in FJgme 8-3b. This is, of course, the JDOEe efficient tecImique. The queue in this case can be characterized by the MIMIc model, where c = S, the DUmber of servers. One problem DOt mentioned yet with Ieprd to servas is data loctiDg. In many systems, a da1a lock on a da1a base is owned by a process; 0Dly that process may 1'e1DOVe the lock. Thus, 'if a server is to lock a file while an ~ views some da1a with the
an
Chap. 8
Process Structures
273
LIS
L
LIS
• • LIS SERVER
CLASS K SERVERS WITH INDEPENDENT QUEUES
~
L
• •
•
SERVERS WITH COMMON QUEUE
~
Patemial toupdare it, that same servcz must do the updareso that it can UD10ck the file:
But we caDDOt guarantee that the same servcz will be the ODe to process that update. If anodIer server receives the update message, it will not be able to IeDlOVe the lock. One solution to this diJemma is to design tIaDS8dioDs so tbat a11locks ale placed and IeIIlOvecl duriDg the proceSsing of a siDg1e message. 'Ibis may DOt always be possible. Another solation is to CJeate a sepatate loc1dDg process that teSpODds to server requestS to place andJeDlOVelocks. SiDc:e such a process would own a11locks, it could mnove them. This approach is complex (and a poteDtial bottleneck).
Application Environment
274
Chap. 8
In other systems this is solved by vesting ownership of the lock·wiflt.the traDsadion. Each transaction is given a unique ID, and this ID is carried with each message and with each lock. In this way, any server can unlock a data element locked by another server by simply providing the transaction ID. File managers. The file mIl1UZger has been discussed in detail in chapter 7. To review its characteristics, it is responsible for rec:eiving and executing file system te,queSts. These typically include opening and closing of files, reading and writing of records, and maintaining data locks. The file manager can be organized in a variety of ways (see Figure 7-5):
• TheJ:e can be ODe file manager for the system; it handles all requests for all disk volumes. In this case, the file manager can become a significant bottleneck. • There can be ODe file manager per disk volume. This relieves the file system bottleneck to the extent that there are multiple disk volumes. As with a single file mauager, the disk volumes will be underuiilized, since they lie idle while the file manager is processing a request. This underutiJization may be in the Older of SO percent.
• There can be multiple file managers per disk volume. This can improve disk utilization signDicant1y and therefore can further reduce the file system b0ttleneck. Only in the last case will there be a queue provided for each disk volume (Dot shown in Figure 8-2). Since the first two architectnres result in each disk volume being driven by a single file manager, no queue will build at the disk volume itself.
A process can be single-dD:eadecl, as are the servers in a requestor-server environment. In this case, c:onamency can be achieved by CRating sevenl alike processes. If n simi1ar processes are created, 1ben they can process n tasks simu1taneous1y. This is multitasking at the sysrem.level. Multiple tasks this case, processes) are c=ated and managed by the opemting system. It is one thing to aeate 3 or 4 server tasks to provide ccmcmrent processing for a ttansaction type. It is quire anodIer thing to create 1000 processes, each to act as a xequestDr for a user. This is because each process imposes its own ovedJead, both in tams ofJDelDOlY space ~ coattol blocks, file c:ontml bloc:ks, etc.) aDclin terms of time for switching. The management of a large number of can impose a sigDif-" icant ovedlead on a TP sysrem.. Note that code and daIa space zequirements are basically the same whe1:ber we have many processes at one per user or one process per many usas, since the code is sbaIed and tile data have to be separate ayway. It is the ancillary memory and time teqUiIemeIdS that become a problem. but typically only for a large
em
Pmcess
processes
areas
~·of processes.
Chap. 8 .
Process Structures
275
. ..- Therefore, it is advantageous to be able to czeate a process that is multitaSldilg within itself. Such a process is multithreaded and can handle several COIlC1ment tasks, such as the bandling of multiple teIminals or communication lines by the requestors in the requestorserver model. Such a process bas embedded in it the elements of an operating system so that it can switch between tasks and petbaps even manage its own memory space. However, being a small version Of the big guy, it imposes a comparatively smaller overllead OD the system. In a process, there are two fundamental techniques for multitasking, which are similar to the two major techniques for achieving multitasking in any opeI3ting system. One is scanning, or time sbaring in the classic sense. With this technique, the tasks are serviced in round-robin fashion until the process makes a complete cycle through all the tasks and finds them idle. At this time, the process can telinquish the CPU. The other technique is to build an event-driven process. This is, of course, equivalent to modem operating systems. With this tedmique, events affecting a task controlled by a process are IeCeived and queued by that process. An event might be the completion of a termiDalllO or disk I/O or the receipt of a message or a message reply. The process will work through its event queue, performing task processing as requiIed, until it bas exbaustecl the queue, at which point it will relinquish the CPU. It will be IeSCheduled by the operating system when another event is directed to that process. Some lP systems provide one or the other of these multitasking options as part of their enviromnent. In other systems, multitasking is a grow-your-own project. From a perl'oJ:maIwe viewpoint; an event-driven multitasking process is very efficient, and its overhead often is DOt considered to be sigaificant in the perfomumce aDalysis. This overhead is simply taken into account in the process time (and is usually buried in the estimation error for task time anyway). If it is to be analyzed, its analysis follows the techniques already CODSideIed. However, scamriDg overhead for a time-sbaring muJritasldng process can be sigDificant. This can be aDalyZed as follows. Let tt = time to process an event for a task.
t_ = time to switch tasks. ti
= time to detcmrine that an idle task needs DO processing.
fey
= scauner cycle time, i.e., time to cycle once tbrougb all tasks.
Re = rate of eveDts miving at the process. IIr = D.1IIDber of tasks serviced by this process.
Let us also assume that the event rate for a particular task is such that the arrival of more than ODe event per task during a cycle time fey is unlikely. Then, during a complete cycle through all tasks, which xequires a time tey, there wiD be: IIr task switches Jeq1Iiring a time of IIrt,rw.
Application Environment
276
Chap. 8
Ret" events arriving in time to be processed, Iequiring a time of -R.~tr. (nt
-
Ret,,) tasks that are found idle, requiring a processing time of (nt - Ret,,}t;.
Thus,
t"
= n,tsw + Ret"tt + (n, -
Ret,,}t;
This can be solved for the cycle time 'CY: _
n,(ti + tsw)
Icy - 1 - Re(tt - t~' Icy S n,(tt + tsw}
(8-8)
That is, the SC8DDer cycle time is the kU.e cycle time divided by 1 minus the iDcremeDtal processing load per cycle. InaemeDtal processing load is that time Iequired to process an event over and above the idle processing time. Note that the cycle time cannot exceed a task process plus a task switch time (tt + tsw) for each of the n, tasks, since it was assumed that no more than one event per task would be processed. Thus, equation 8-5 is valid only for cycle time t" S n,(tt + tsw}. Note also that the cycle time inaeases as the number of tasks increases. It also increases as the transaction rate increases unless the task proc:essiDg time, tt, equals the idle processing time, ti. In d:ais case, it makes no diffelence whether an event has occurred or DOt for a task-the processing time is tbe same. Asmmring that the mival of events to the process are random, then the probability of zero events ocauriDg in a cycle time PJ..t,,} is given by the Poisson distribution and is from equation 4-59: Po (lcy) = e-~
(8-9)
Thus, with probability I, the process will scan through aU tasks the first time. It will make a second scan if ODe or more events have miwd during the first scan, whic::h wiD. occur with probability (1 - e-R.~). Likewise, the probability that it wiD. make a tbiId scan is tile probability t:hal it had to make a second scan. and that an event mived dming that scan. Thus, the probability of a 1bird scan is (1 - e-~"f. And so on. The average DlJIDbe:r of scaDS that the scanaer must make, 11;" is
n" = I + (1 -
e-~)
+ (1 -
e-~'f
+ ...
or from equation 4-46
"" = I -
1 (1 - e-lU;)
=
eRe . ~
(8-10)
and the process will have control of the CPU for a period, tp , of Ip
= ""Icy
(8-11)
Once the process releases the processor, it ~ be dormant until another evem mives•. Since event mivals are assumed to be random, tbe process wiD. be dormant for a time lIRe. Now let us define
Chap. 8
Tcy
~.the
m
Process Structures
process cycle time, i.e., the time between the invocations of the
scaDDer pr0-
cess. Lp = CPU load imposed by the scanner process. ep = processing efficiency of the scamaer process. The process will be invoked evety
Tcy = nstcy + liRe
(8-12)
seconds. During this time, it will process ReTcy events using nstcy of CPU time. Thus, the load imposed upon the processor is
Lp = nstg Tcy
or, from equation 8-12,
Lp = Renstcy
Renstcy + I
(8-13)
In on:Ier to process ReTcy events, it will expend a useful processing time of ReTeyt, while expending an actual processing time of nstcy. Thus, the efficiency, ep , of the process. taken as the ratio of useful to actual processing time. is
ep _- ReTCltr _- -Ret, (IZstey + It'D ..~.) nstcy nstcy or (8-14)
tcy and n. are giveD by equations 8-8 aDd 8-10. An example will help to put scaDDel" times and efficiency into pelspedive. Let us assume tbat a requestor process .using scamring has thefollowiDg typical pammeters: Task·switcbiDg time (t,.,) = I JDSeC.
= I msec.. Eveat poCessiDg time (t,) = 10 JDSeC.
Idle processing time (I;)
Number of !aSks (n,)
= 32
This requestor will CODtrol32 terminaJs. Let us assume that a user at a termina) is entering Iequests at an avenge of once every 10 sec:oads. As he enters a request, c:enain fields are sent on-the-fty to the requestor for verification. When the request is releasecl. the requestor sends it. if valid. to a server and then retums a Ieply. Assume 4 fields are sent for validation so that the requestor .must bandle 6 events per ttaDsaction (4 fields. the fiDaJ request, and tile reply from the server). Thus. the event rate. Re. is
Re = 6 x 32110 = 19.2 eveD.1SIsec.
Application Environment
278
Chap. 8
From equation 8-8, the cycle time, tey, for the scanner is tey
= 77 JDSeC.
That is, the requestor takes an average of 77 during one
JDSeC.
to pass through all tasks that arrive
scan cycle.
From equation 8-10, the average Dumber of cycles that the requestor will make before finding all tasks idle is PIs
=e(l9.2)(.077) = 4.4
Thelefore, OIl the average, the requestor will remain active for nstcy = (4.4)(77) = 339 msec. and then will go dormant for liRe = 1/19.2 = 52 JDSeC. During this time it will process Re(n,tcy + 1/14) = 19.2(.339 + .052) = 7.5 messages and will do useful pr0cessing of (7.5)(10) = 75 JDSeC. Thus, its efficiency, ep , is 751339 = .22. This is the same JeSUit we would obtain from equation 8-14. -
Tbeload,Lp,imPO$edontb.ecPbiS339/(339+ 52)'~
.87, wbichistberesuittbai
would be obtained from equation 8-13. In Older to be efficient, a scanner must be kept busy, which means that it must impose a heavy load on the CPU. Scanners (and event-driven requestoIs, too) can seize the CPU for long periods of time. For 1bis 1e8SOD, requestors me often designed to Ielinquish the CPU if more than a certain amount of time (or scanner cycles) bas elapsed.
We bave previously discussed the concept of dynamic servers, in wbich the Dumber of in a class is adjusted to compensate for the load being imposed on those servers. As the load incRases, more servers are spawned. As the load decreases, 1JDDeCeSsary servers are killed. Of course, one server must always remain, even dnring idle times. It is useful to be able to predict the D1.DDber of servers that will exist UDder a given load condition. This depeads upon the algorltbm used to spavin and kill servers. Suppose that we have a TP system. in wbich each server has its own queue (as in
servers
Figure 8-3&). Requests lie passed to each server on a rouad-robin basis. If the queue of
any server exceeds a length of n, we will spawn a Dew server. If the length. of any queue goes to zero, i.e., the server is idle, we will·kill a server. Of course, 1hc:I:e should be some time delays to easme that we do DOt spawn and kill servers at a mpid tate to 8CC(W!!!!DC)dat.e short term fiudnar:icms. Msnming that queue arrivals and server service times me random, tbe probability that a server will find its queue exceecting n items is (from equation 4-83) /
wbae
L = total average load on the server class. S = DIIIDber of servers. and LIS is the average load on each server.
Chap. 8
Process Structures
279
.. ...The probability that a server will find its queue empty is . P(Q
= 0) = 1 -
US
The steady state is achieved when these two probabilities are equal; that is, in a steadystate condition, during any reasonable observation interval we would want to spawn a process as often as we would want to kill ODe. Thus, under a given transaction load, L, the number of servers that will exist is S. where S satisfies the expression .
(Usr+ 1 = 1 - US
(8-15)
By manipulating this equation, we obtain S"(S - L)
= L"+l
(8-16)
This exp:ession is evaluated in the following table for various values ofLand n, giving the mjmmmm number of servers requD:ed to satisfy equation 8-16. For good perfonnance, we nonnally don't want to see queue lengths longer than 2 or 3. The table shows allowable queue lengths of 2 to 4. TABLE ..... NUMBER OF DYNAMIC SERVERS(S) Load (L)
AJlowabJe queue Jeagdl
2
1 1
1 1
1 1
2 3 S 6
S
2 3 S 6 8
10
IS
2 3 4 6 7 14
0
.5 I
2 3 4
7 14
Another way to conside:r1be dyDamic server case is to use the server load that will be
mainJainecJ by the system on each senrer in Older to control its queue size. From equation 8-15, 1he foBowiDg 1able can be CODStrUCted: TABLE .....
~SERVERLOAD
AJlanbIe queue Jea&th (II)
Sener Load (US)
1
2
3
4
.62
.68
.73
.76
Note from. Table 8-3a tbat overtbis wide range, theleis never muchdiffemlce in the number of servers requD:ed to maintain a queue offouror a.queue of two. 1'heIefoze, there is Jittle ieasOD to choose higher queue.lengths-at least in this example.
Application Environment
280
Chap. 8
If the servers were fed from a common queue, the ~ ~ w.o.l1ld hold, except that the probability of queue lengths would be given by equations 4-93 to 4-95 for the MIMIc model. These equatiODS, of course, would require computer-aided evaluation. AqnchlOlIOUS
I/O
One other technique used in process design to speed up the system is the use of asynchro1llJ1/S 110. Usually, when a single-thIeaded process makes an 110 request, it will then pause and wait for the result. Servers are generally designed tbat way, for instance. Since the process is synchroDized with its 110 requests, we will call this syncbIOnous 110. With asynchrODous 110, a process can issue an I/O request and then continue on and perbaps issue even more 110 xequests. As a result, it can have several CQIDJDIlIrication or disk JeQUests pending wbile it continues processing. It can check periodically for the Completion status of each requeSt and cany on its or it can pause if it must wail for one or more requests to complete. It will then be awakened by the opemtiDg system. when a request has completed. ' Asynchronous I/O can speed up the processing of a complex transaction significantly. The exact amount ofperformanc:e improvement is usually difficult to evaluate; but the following heuristic argument provides a IaSODable approach. It will be argued that the total processing time of a process using asyachroDous 110 for disk servicing is determined by the longest of the disk or processing times (here, disk servicing is used as just one eumple of 110 service). In addition, one must consider an iDitial disk lead to uprlme the pump" and a final disk write to complete processing. Tbis is shown for two cases in F1gIR 8-4-when processing is the domiDant time and when
proc:essmg,
disk is the domiDant time. In both cases, an iDitial disk read time, t,., is shown, as well as a fiDal disk write time, two In aD. actual case, either or both may DOt be mquiIed. Figure 8-4 should be CODSideIed to IepreSeDt the general case. W'J!h respect to Figure 8-4a, it is clear 1bat the time requized is the processing time, tp , plus the initial ancl fiDal disk accesses when processing time is p.Eedonrinant. When disk time is predominaDt (Figme 8-4b), 1hen 1be time requjred is the disk time. Thus, if t, is the time r:equUed for the process to finish its 1aSk, and if tp is the pm:essiDg time, ancl if t4 is the disk time exclusive of the first read and last write, one can expms this'm1ation as tt tl" + tw + max (z", tp ) (8-1'7)-
=
which states that the 1ask time is 1be sum of the initial read and fiDa1 write plus the processing or intamediate disk time, whichever is greater. A queuing point oc:curs because of asyncbronous 110 wbich bas DOt been c:cmsideRd; , that is, disk completioDs can queue for a process that bas issued multiple disk requests and that is busy when these requests complete. That this does DOt increase the 1ask time as a first approximation can be seeD from FIgUre 8-4 and the following argumentS: • If processing time is predominaDt (FIgUre 8-4&), tben the task time is dependent upon the proceSsing time. The fact that disk c:omp1etioDs are queued' and waitiDi
Chap. 8
An Example PROCESSING
l~
281
= tp
_ _ _ _ _ _~y~______~1
DISK = td
PROCESSING PREDOMINANT (a)
PROCESSING = tp (~,
________~A~________~)
t,
DISK .. td
DISK
tw
PREDOMINANT (b)
for the process will DOt add to the overall p.roc:essiDg time and thus will DOt add to' the task time. • If disk time is predomiDaDt, it meaDS that to' a first approximation the process is available to process each disk c:omp1etion as it occ:urs. Thus, queues of compIetioDs wiD. DOt build up.
The above aualysis is equaD.y applicable ,toasynclmmous 110 on C:OlllllmnjeSon liDes. Simply zep1ace disk time with romm1Jllic:ation time. ' Async:hroaous 110 is most effective 8£ high.loads wbal borh disk and poc:essor are quite busy. It caD be ~ allow loads siDc:e the sysI'eIIl is idle and since 1bere is iDsufficieDt acti:vitY to support overJapped disk aDd pIocessOI' fu:DctioDs. III Ibis ~ die 1aSk time, It. would be d1e sam of an processiDg Compoaems. In actual pndice.d1e IeSIIlt wOuld lie somewhere between. sDic:e we am DlO1'e :iDI'aesred in loaded 1P systemS. we can assume that overJapped 110 is generally effective.
AN EJCAMI'I.E As an example of modeling' an applicalioD enviromneDt. _ "-Mn look at an adaptation of what is known as d1e BTl benchmark TbisiS beoondnganindustty standard to measure d1e performance of 1P sysrems and is described in Anon.II]. '
Application Environment
282
Chap. 8
This benchm.aIk. models a banking application. h considers a teller transaction tbat a teller file, and a br8nch file and t:bf:D write an audit record to a history file. Each oftbe three updated files is accessed by key, but no key files need be updated. The teller request message is a lOO-byte message with a lOO-byte reply. Ten thousand teller termiDals are CODDeCted to the system via an X.25 netWOrk. Each teller gen-
. .- ttlUSt update a customer account file,
elateS one transaction every 100 seconds. To support 10,000 tellers generating 6,000 transactions per minute, the system size is significant indeed. We define the ~ of this benchmark system. as tbat load which will result in an average EeSpoDSe time of two sec:oods. The file c:haracte:rist:cs and their access activity are given in Table 8-4. Note tbat only the history file is mirrored. All other files are protected via a transaction protection mechanism, which will not be modeled here. This will be discussed in chapter 9 as part of the discussion of fault tolerance.
TABLE 8-4. ET1 BENCHMARK PARAMETERS File
BllIIlCh TeDer CUstomer History
Reconf sia (bytes)
100 100 100 SO
File sia
File
File
(reccmIs)
OIpIIizalicm
access
Keyed Keyed Keyed
Update Update Update
MimRd
Wdte
1,000 10.000 10.000.000 90 Days
sequ=ial
Our system'is a multicomputer system built ac:con:Iing to the requestor-server model, as shown in F"IgUIe 8-5. RequestC:n are evem-driven, ivmcf!jng up to 32 terminals each. Servers are dynamicaUy allocated, with independent queues and a maximum queue length of 3. The system is configmed as an expandable multicomputer system of P ideDtica1 moclules. There are P comptIIte.ts aad D disk UDits, orgarri7M so that each c:ompater has DIP disk 1JIIits. Each disk volume is driven by one &Ie msmager aadbas 1.5 megabytes of cache memory available to it. .ODe ctiik volume is mhror:ed for the history file. The otber . files are parti1ioned among all disks to· achieve 1IDifcmn disk 1oadiDg. . The pmpose of this eximp1e is to iIbistrate the modeling of an application environTberefore, we will defiDe the DSpODSe time as the time from the'iirlvai ofibe lisi' . byte of the zecpst to the traDsmisSioD of the first byte of the. reply. In this way, we caD
man.
ipole the C(ID1lngniCaMn netwoIk. However, the file cbaractaistics are impol1ant (aDd, in fact, dominant) in the per-' formance of this system. We will DOW evaluate them in the context of chapter 7. Let us first look at the size of the files and their key struc:tuEeS to obtain a feeling for disk cache effectiveness. First of an, we will assume that files are DOt index sequendal; therefore, each is supi»orted with a sepalatekey tile. SiDcekey files are DOt to be npda1ed in this application, We can assume that the slack factor for these files is 1. FIom Table 8-4
Chap. 8
An Example
283
• TELLER TERMINALS
••
• •
•
and equation 7-19, we can determine the file sizes and B-tme levels for each file; these are" given in Table 8-5. To do this, we must make an assumptiOD about key size and sector size. For pmposes of this model, we assume tbat a key record requires 15 bytes and !bat the sector size is 1,024 bytes. Thus, a sector will hold 68 key records, 10 data records for an update file, and 20 records for the history file.
TABLEN.
m
BENCHMARK KEYED FILES B-uee Lew:1 sizes (byIes)
File size
B-TNe
File
(bytes)
levels
lOOt
IIIIIlIl
Braach
lOOK
1 2 3
II{ IK II{
3K
Teller Castcmer
1M IG
MIt
Key file size (bytes) lSI{
32lC.
1481{ 2,163K
147,()S9K
In Table 8-5, K. is kilobytes (a sector size), M is megabytes, and G is gigabytes.
Since we bave I.S megabytes of cache, we can deduce the followiDg about cacIie effec': tiveness: • The bJanch file is hit frequently. and boch its key file and data file will probably always be in cache. 'Ibis will consume·ll6K. of cache.
Application Environment
284
Chap. 8
• The key file for the teller file will probably always be in cache (it ~ 152K), · but the teller data record bas a lower probability of being found iD'Cache. We will assume that teller data IeCOIds must be read from disk. • The first two levels of the customer B-ttee (33K) should always be in cache. We will assume that the lowest level of the B-tree, the key record, and the data record will be read from disk.
We will also assume that a IeCOId that bas been read at the beginning of a tnmsaction will still be in cache when we come to update it. That is, it will have a longevity rate of at least 100 seconds. This assumption will be checked later when we have determined what transaction rate each disk will support. For DOW, the cache usage that we have assumed is 116X (152 + lOOR.,,)K
( 33 + 3OOR,,)K (301 + 4OORa,,)K
where Rid is the tnmsaction rate per disk. The Rid factors account for one teller sector and three customer account sectors being read from disk and surviving in cache for at least 100 seconds. Note that this CODdition is satisfied if ~l+~ldSl~lK~N~
or if Rid S 3 transactionslsecond
(8-18)
The satisfaction of this Jelation guarantees that a single 1.5 megabyte cache will suffice for each file manager. Next, using the typic:al file managanent times suggested in chapter 7 (and adding one for sequeotial wrlt.es), we can CODSttuCt the following table for file activity:
TABLE .... En BENCHMARK ALE ACTIVITY B-Tree lewis
ReldbnDcb ReId 1IeIIer ReId CIISIr:IIDer
1 2 3 1 2
3 3 2 2 3
3
4
o.
Wlilies
("..)
("..,)
CIICbe
File operMicm
W_biIIICh W_teDer W_ CIISIr:IIDer W_ biSIICIIy
Reads
17
Disk
Cadle
F.M. 'l'ime (~
Disk
1 3
1
4
~
US
I I ~ 2.OS
(msec.) - .-
20 20 20 30 30 30 30
Pi
VI VI VI In VI VI In
Chap. 8
An Example
285
._ ...The history file is shown as requiring 1 disk-write to cache 95 percent ·of the time. 'Ibis is because it is a sequential 50-byte write so that 20 records will fit ma sector. Thus, a sector need. be written ODly once per 20 records. Though the write to the teller and customer record will find tbat sector mcache according to our assumptions, the updated record must eventually be physically written to disk, as we assume that the record will Dot survive mcache until it is once again updated. Therefore, a disk write is shown for these two cases. From the preceding table, we bave 25 disk accesses per transaction, 18.95 of which are cached. Thus, our cache-bit ratio is .76, and the cache-miss ratio, Pd, is .24. (In this simple example we can calculate cache activity. Usually, this is not possible, and a reasonable value for Pd must be assumed.) We can now calculate the values for file management time for each operation. These values are also based on suggested parameter values used mchapter 7, which are ttJ.e following:
= average disk access time = 28 1DSeC. tdr = disk rotational time = 17 JDSeC.
t.
Ii = file manager service time ratio for cache bits = 0.5 for all i. We must also increase t. by 40 pm:ent of the average seek time (.4 x 20 = 8 msec.) to account for file miaoring when calcuJating history file management time (see AppeDdix 5). Ratber than using equation 7-1, wbic:h assumes that we cannot enumerate cache hits and therefore have to assume a cache miss ratio, we can, in this simpler example, calculate file management times di!ectly, using our knowledge from chapter 7. These results are given mTable 8-7, with the foJIowing computational notes: . • Reading a customer record involves !e8ding a B-tree plus a key record and then a dam mxxd. 'The first two lads are similar to the lead of a primary key and dam Iecord from an indeDd sequeD1'ial file; they typically requUe one full access plus a 1ateocy time, or 28 + 8 = 36 JDSeC. The data !eCOId RqUiIes anotber faD. access of 28 msec. . • Writing the customer record is a standard write access of 28 msec., since the key record and dam record are asswiled to stiB be in cache. • By the same agumelllS, the lead and write of the teller !eCOId are also each a single access. • The bistoJ:y file requires (28 + 8)120 = 2 JDSeC. of ctisk time to write a block, averaged over the 20 records reqahed to fill a sector. • The following file manager CPU times are assumed if the dam IeC01'd is not in cache: 2Omsec.
3Omsec. 3Omsec.
Application Environment
28&
Chap. 8
. .._. Since J; = 0.5, these times will be cut in ba1f if the data record is fo.uDd in cache. TABLE 8-7. ET1 BENCHMARK RLE MANAGEMENT TIME Total file
File open1ioD
File maDager CPU time (fi~-msec.)
Disk time
mauager time
(msec.)
(msec.)
28 64
10 48 84
10
Readbaach Read te1ler Read customer Wlite bIaDch Wlite te1ler Wlite c:ustome:r Wlite histmy
20 20
IS
IS 30 30
.J2 140
28 28 ~
ISO
58 58
..l1 290
From this analysis, each of the 7 file accesses requites an average file mam.gement time of 41.4 IDSeC. Of this, 20 IDSeC. is CPU time, and 21.4 JDSeC. is disk time. Note tbat the physical utilization of the disk cannot e;ceed 52 percent (21.4141.4). This is a clear case of the value of having a second file mauager using each disk. We can DOW Ietum to our assumption conce:nUng cache longevity, as quantified by equation 8-18. From Table 8-7, the total file management time per traDSaCtion is 290 IDSeC. 'Ibis gives a maximum traasaction rate supportable by 1 disk of 1/.290 3.4 traosac:tions per second. Assuming tbat we will keep our disk loads less tban 70 percent (which any good designer wiD do), the capacity per disk is (.7)(3.4) = 2.4 traDsactions per second. This is witbiD the J:aDge speciDecl by equation 8-18. Coupled with the fads that the traDsaClion would probably be completed in less 1ban 100 seecmds (in fact, almost iDstaDt1y in this beDdunark case, since there is DO iDterveDiDg operator time between the read aad update) and tbat we sbould have some 0Iberc:ache activity (especla11yon the teller file), our assllmption of cache effectiveness appem quite secute. Now dial we have chalactetized the file system. let us tam oar attadion to the applicaDon. Figme 8-6 pMSeIdS a traffic model for.the BTl beDcJvnark A request is IeCeived by the COl "" mniCation baDdler (1) and is passed to a requestor (2). The mquestor validates the request aad dIeD seads it via an imeLpwcess message to a server queue (3). BventuaD.y, the server (4) wD1 process the request; in cIoiDg so, it Will issue seven file requesrs to the file IIIIII8pf (6) via its queue (5), each tbIOugh aniuteLpiocess message. The file mamager wD1 access its disk (7) as n.qaited, IIing each lespoDSe to the server (8). Upon completion of the mmsadioD, the server will!eblm a reply to the zequestor (9), which will pass d1e reply via the C(WDmnaication baadler (10) to the termiaal. The tequest to the server and its xeply to the requestor ate assumed to be caai.ed by the same iDterprocess message. a WRlTERBAD. 'Ibetefore, DO outbound J:eqUeStor queue is shown. (Tbe:re will be ODe, but it is and is ignored.) CO"'IIUmjcariOll between the commuDication bandler and its associated J:eqUeStor is Lc:snmed to be via common memory mailbox messages. This time can be ignoIed.
=
m. ...
sman
o
i
l' co
~
fr3
'tJ
Ci"
Ipr
Ilpm
Iqs
IpS
tl pm
tqf
Ipf
tdd
...
REPLY
fpC
tpr
Fillure 8·6 BTl traffic model.
~
Application Environment
288
Chap. 8
The following times are defined in the traffic model:
qs tipm
tpc
tpr
tip tps tqf
tpf tM
= maximum server queue length. = inteIprocess message time. = communication handler time per message. = requestor time per message. = server queue time per transaction. = server processing time per traDsaCtion. = file manager queue time per file Iequest. = file manager processing time per file request. = disk time per file IequeSt.
Except for the queue times that must be calculated, Table 8-8 gives assumed values for these~. MOst"Oftbese parameter valuCis ire based on typical suggested previously in tbis chapter, except that communication time has been doubled to account for X.2S complexity. File maDagef time and disk time per file request, tpfand tM, have been shown above to be 20 msec. and 21.4 msec., respectively.
values
TABLE .... ET1 BENCHMARK PROCESSING TIMES Parameter
q.
t....
Value (msec.)
3
t,.. r".
10 10 10 35
'"
20
fpc
1M
21.4
Referring again to Figure 8-6, we can also express the teSpODSe time, Tr , to a traDsaction. his composed of the processiDg times of all the processes plus queue waiting times. Queue waiting times include the file manager queue time, tqt, the server queue time, tqs, and the process dispatch time, ttf, which is the processor queue time that must be added to the process time for each process j. For puposes of dJis analysis, j bas the foJlowiug values: "
s--se:rver•
i-file manager. Let us break tbis problem into simpler pieces by defhring
Tr = transaction response time.
Chap. 8
An Example
289
fJ = file management time per file request. til = server transaction service time.
D = number of disks in the system.
S = number of servers in the system.
Ld = disk load. Lil = server load. tdj
= process dispatch time for process type j.
Rt = transaction I8te.
We assume that inteIproCeSS message time is incurred by the sending process. Then, working from the top down via Figure 8-6, the response time, Tr , is Tr
= 2(tpc + tdc) + 2(tpr + t..) + tipm + tip + til
(8-19)
Server time, til' is til = (tp.r
+ St.) + 7(tipm + tqf + fJ)
(8-20)
The load on the server, LiI , is
(8-21)
However, this system uses dyDamic servers; the number of such servers is adjusted to maintain a maximum queue length of tbree. This implies that the server load will remain constant at a level given by equation 8-15. Using this relationship, we have Lf·+l
= 1-
Lil
(8-22)
wbich, for qil = 3, zesu1ts in (see Table 8-3b) Lil
= .73
(8-23)
To estimate the average server queue time, we must assume a model. Since this will be a very large system, each server will be fed by a large number of DSaS. 1'beIefore, we can assume an iDfiDite population. However, the service time is made up of the sum. of a large number of random service times. These include. saver dispatch times (one for the teqUeStOimessage and seven for the fileman8g8rcompletioDs), • serverprocessiDg times, seven file IIIIDIgel" queue wairs, seven file maoager dispatch times, seven file mauager proressiDg times, and approximarely six physical disk times, or 43 service times in all. As we shall see, the resultiDg service time is luudly mncIom, and the MIG/I model is thelefore applicable. The mean and variance of the server service time are the sum. of tile meaDS and variances of its c:omponems (see equations 4-32 and 4-33). Consider the simple case in wbich all service times are equal and random, with a mean of t. Iftbere are n components, then the mean of the sum is nt, the va:riance of the sum is nz2. and the server distribution coefficient, ks, is. from equation 4-16.
Application Environment
290
Chap. 8
As n becomes large, ks approaches 0.5.
In our benchmark case, the service times are not all equal (in fact, the dispatch and queue times vary with load), but the number is large. A careful calculation will show that ks is indeed close to 0.5, and we will take it as such. Thus, the average server queue time is _ k~s _ 0.5Ls tqs - 1 _ Ls ts - 1 - Ls ts
(8-24)
The file management time per request is
fJ = (tPl + t4P + tdd + 6tdfn The final term reflects the fact that every seven file management requests result in six disk accesses (see Table 8-6), requiring an additional dispatch of the file manager. This expression can be rewritten as
fJ =
tpf
+ 13tdfn + t.td
(8-25)
To determine the file manager queue wait time, the same argument used for the server queue model will justify the use of the MlG!1 model here. The file manager in a large system will be fed by many servers so that arrival times can be consideled random from an infinite population. In this C8$e, however, there are only 4~ service times c0mprising fJ, including 13n dispatch times, 13n processing times, and 617 of a physical disk time. We assume the disk time is UDiformly distributed, with an average time of 15016 = 25 msec. (see Table 8-7). Therefore, its variance is 113(25f (see the derivation leading to equation 4-17, and remember that the variance is the second moment less the square of the mean). As a conservative simplification, we will ignore dispatch time in the computation of the file manager distribution coefficient, kj, as it will normally be small compared to tpt and tdd anyway. Since the average file manager process time per dispatch is 2OJ(13m = 10.77 msec. and is assumed to be random, then
J...=! (1 + (13n)(10.i7)2 + (617)(113)(25)2
~
2
.[(13m(lO.77) + (6I7)(25)]i
61)
•
and t.qf
= .!:i!:L. t~ = ~ t~ 1-~7 1-~J
(8-26)
Since there are seven file requests per transaction, ~ is ~=
7Rt tlD
Fmally, we evaluate process dispatch times,
(8-27) tdj.
Let
Chap. 8
An Example
291 tt = processor time per transaction.
Then, from Figure 8-6 and Table 8-8: tt = 2tpc
+ 2tpr + tps + 7tpJ + 8tipm
(8-28)
or tt
= 29S msec.
(8-29)
The Dumber of process dispatches per transaction can be determined by assuming that a process will be awakened whenever it receives a message or a device completion. In the case of the file manager, disk completions number approximately six according to
Table 8-6. Process dispatches are enumeJ3ted in Table 8-9. TABLE 8-8. ET1 PROCESS DISPATCHES
2 2
2 2
1
7
7
6
8
Y 2S
Thus, the number of process dispatches per transaction, 1Jp, is lip
= 2S
(8-30)
The dispatch time for process j is approximated by calcuJating the processor queue wait time via the MIMII model but excluding the effect of the process being consideled (see Appendix 6). The avenge processiDg time per dispatch for all pmcesses except), tJ,
is
(8-31)
tJ = average processing time per dispatch for an pmcesses except process j. tj
= avenge processing time per traDSaction for all pmcesses of type j.
1IJ = dispatches per traDsactioD for all processes of typej. Xj
= number of proc::esses of type j
(note tbat X$
= $).
Application Environment
292
Chap. 8
We define
P
Lp
= number of processors in the system. = total processor load.
Lpj = processor load imposed by all processes of type j. L;j
= processor load imposed by all processes except process j.
The load imposed on the processor by all processes except for process j. L;,,;. is L;j = Lp - LpjlXj Since
=RtttlP
Lp
(8-32)
and
then
(8-33)
Fmally, t,.,. ."
l~·t! = -=eJ:L. 1 - LpJ
(8-34)
Because the interproc:esss message time is charged to the sending process, then the fonowing values:
fj's have the
't: = 2tpe =20 IDS8C.
t,. = 2tpr + tipm = 30 1DSeC. ts tps + 7tipm lOS 1DSeC. ~ 7tlll 140 1DSeC.
= =
=
=
Also, from Table 8-9, the ".i'S ate:
=2 ,.,. =2
lie
"s = 13 n,,= 8
These equalioDs represent the respoase time model for the EI'1 benchmarK, as implemented in our example distributed system. They are summarized in Table 8-11, with a definition of terms given in Table 8-10. Table 8-10 also summarizes the parameter values used (note that Ls is treated as an input pammeter since its value is fixed by the dynamic server algoritbm).
Chap. 8
An Example
293
This model assumes a large system and is a function of the number of proeessors, P, ancl the number of disks, D. Before we launch into a major calculation ranging over all D and P, let us use a little intelligence relative to the final system. From Table 8-7, we see tbat the file system is utilized for 290 JDSeC. during each transaction; equation 8-29 shows tbat 295 JDSeC. of processor time is used per transaction. These are so close tbat it is quite reasonable to consider a processing module comprising one processor and one disk. . . TABLE 8-'0. ET1 BENCHMARK PARAMETERS Value RGIIll ptlTfIIII4t4n
T,
Average zespoose time (sec.)
Input WITilIbJes
R,
Sysrem 1raIISaCtioIl1ate (ft'8!!$l!CIionssecoGCl)
Input ptlTfIIII4t4n
D
1.. lip
P 1M
t....
tpc
t,j t, tpr
q.
Number of disk UDits Server load per server Process disparI:h late per 1IaIIsICticm NlDIIber of pIOCeSSOrS Average physical disIc time per file request (msec.) lidapwc:essot messap time 0 '''".I!¥:aIioD baDcIJer time per message (msec.) File IIIIDIIF' processiDg time per file mquest (msec.) Reqaesror time per message (msec.) Server time per messap (msec.) MaxDmIm server queue 1eD&th
1,2 .73•.68
as 1 21.4 10 10
20 10 35 3,2
l".".rditIre ptJ1'IIIII4t6J
l.J Lp
Lf.J IIJ
r., ~
tJ
tj
File DIIIIIIF load per file DIIIIIIIF' Processor load per pIQCIISIOC Processor load per pIQCIISIOC. adasive of pocess j l)jspatches per IDDSdioa fer poc:ess type j Ploq:Iss dispIrdl time (msec.) for pzocess type j Pie ...... service time per file 1eqaest (msec.) Awage PocessiD& time per II'DsniaD far pocess type j (msec.) Awrage pocessing time per cIispatdl fer all pnI cesses except poc:ess j (msec.) File DIIIIIIF qaeae time (msec.) Saver qaeae time (msec.) Saver _ _ time (msec.) CPU lime per t.IDsac&ioD (msec.) Number of pocesses of type j
This~ can then be used as a module to build a system as big as we would lilce. Therefore, the BTl benchmark will be evaluated for one pmc:essor and ODe disk or
=1 D =1 P
This means there will be one file ~ (Xf = 1). A trial
calculati<m: will show tbat there
Application Environment
294
Chap. 8
TABLE 8-11. ET1 BENCHMARK MODEL
= 2(tpc
T,
+
t.tc)
r., = .SL,t./(l
+
2(t"..
+
fd,)
+
z.rl = 1 -
I.
~)
= .6~¥(1 = 7R ¥D
~
=
(8-19)
(8-20) (8-22)
L.
1.,
Iqt
+ r., +
(8-24)
= (t,. + 8t.> + 7 (,.,. + Iqt +
t.
t..,
- LJ
- 1.,)
(8-26)
(8-27)
t
t" + 131"" + z.tI
(8-25)
T'.,'
fdj
=1 ~~
(8-34)
L;;
= Lp ('t
(8-33)
I' OJ -
-'t ~/%j)
It - ~/%; '7 - "i%;
(8-31)
= R,ltlP 't = 2tpc + 2z",. + t,. + 71" + 81....
Lp
(8-32) (8-28)
will be 6 communication processes (Xc = 6) and 6 requestors (x, = 6). The namber of servers, x"' is that required to keep the server load below .73. Note that at a 70 pen:ent load, this module will carry .71.29 = 2.4 ttaDsactiODSl second. This is a rough estimate of system capacity as the actual value will depend upon the response time cbaracteristics. The response time fortbis module is shown in F1g1lte 8-7 as a function oftraDsaction rate. Smprise! To achieve an:spouse time of2 seconds, a module can carry 0Dly a load of 1.65 ttaDsaCtions per second, not 2.4. In fact, it can barely approach that capacity, saturating at about 2.S tIaDsaCtions per secoDd; This example shows one of the great benefits inpezformance modeling. First, it shows how educated guesses can sometimes lead us astEay (though they ate still useful). Secondly, the model will let us ""look iDro" the system to see wbat went wroDg. This can be done by perasing the calcnJatjan msults summarized in Table 8-12. : .
.
'
TA8LE8-12. En BENCHMARK FOR P Lp
~av)
~
1.,
,.,
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
.06
,(101 .001
.042 .043 .044
.06
.002
.12 .18 .2S
.004
.006 .009
.32 .39
.013 .019
.47
.026 .037 .OSS
.S3
.59 .65
.002 .003 .004 .005 .006
.008 .010 .012 .016
.045
.046 .047
.048 .049 .OSl .052
.054
.ss .64 .73 .83
I
= 1, D =1, q" =3
Rt
.12 .18 .24 .30 .35 .41 .47
.
.086
.159
L,
x.
I.
l.-
T,
.73 .73 .73 .73 .73 .73 .73 .73 .73 .73 .73
1
.42 .44 .47
.56
1.03
1 1 1 1 1 2 2 3 4
.s9
1.09
.so
.63 .67
1.16 1.24 1.34 1.46 1.66 1.90
1.18
6
1.74
.54
.73
.S9 .67
.80
.77
.92
.91 1.04 1.2S 1.60 2.35
2.28 2.90 4.23
Chap. 8
An Example
295
3.5
3.0
III
2.0
- - - - - - - - - --
:E
j:
ftl
1.5
f3a:: 1.0
.5
0"
3.0
We can deduce many 1hiDgs from these cilC'datinns, but ODe is most obvious: The pmdmriR"ttfactoris server queue time, tq.. h acc:ouIdS foroverODe-ba1f of the :teSpODSe 1ime. We can reduce 1bis a DDIDber of ways~ One is to COIISider !eCIuciDg"1:be Dmimmn queue size, "q.",io 2~" "This" will bean right as Iciag as we do DOt create tOo many servE tor a module to handle. We can c:alCIrlate the IlUIIIbet of servers per modDle as follows. The server load, L., is, from equation 8-21
1.. = Rrt.IS (8-35) The server load is fixed at ~68 ~ a ~ size of 2 by the d.}'DIIDic server algorltbm
Application Environment
296
Chap. 8
...- (see Table 8-3b). Thus, at two transactions per second, .the number. of .servers is (using ts = 1.18 from Table 8-12): S = [(2)(1.18)/.6811
=4
There should therefore be DO problem in reducing the maximum queue size to 2. The server queue time ti/s is also a ~ function of the server time ts , which in turn is very sensitive to the file management time tqf + ~; in fact, it is proportional to seven times this factor. We.can control tqfby adding an extra disk to reduce file load, Lt. From Table 8-12 (preceding), tbis is not terribly important at one traDSaCtion per second but becomes sigoifi.cant if we approacb two transactions per second (which we would like to do). Thus, let us modify our system to contain two disks per processor, and let us reduce our maximum server queue length to 2: QS=
2
D =2 Figure 8-7 shows the result. We bave significantly increased the capacity of the system-up from 1.65 to 2.1S transactions per second. Not what we bad hoped, but a lot closer to our original educated guess of 2.4 transactiODSlsecond. With this CODfiguration, each module can handle (2.15) 100 = 215 users, since each user genemtes one transaction every 100 seconds. Therefore, in each module there would be 7 requestors, averaging 30.7 users each. To service 10,000 tellers siDm1tanenusly, the distributed system would comprise 47 modules (47 processors and 94 disks). A nice sale!
Let us assume a processor costs $60,000, and each disk costs $20,000. Then we can a unique figure of merit for this system. Each module costs 5100,000 and can support 2. IS transactions per second. Thus, this system costs $46,500 per transaction per second. Cost per traDsaCtion per second, or $KITPS, is becoming a common measure of tnmsaction processing efficiency. Typical systems today mnge from 40 to 400 SKII'PS (Anon. [1]) so that our system is certaiDly compeQtive. It :might also be DOted that the ~ processiDg power of today's distributed systems tends to IUD between 1 and 10 traDsactioasIsecond per processor. n tabs a lot to bandle a traDsaCtion. No WODder tbe world needs perforThance
obtain
aaa}ysts.
.
SU.IfARY We bave seen that tbe actnal processing time CODSI1DJed by an application process often . IepJeseDts only a minor contribution to response time. Application times may be magniby queue delays and operating .fied manyfold .
system
characteristics.
Chap. 8
Summary
.....•One of the greatest magnifiers is the compound queue, comprising messages waiting in line for a process, which in tum must wait in line for the processor. The process time is magnified by queue delays for the processor, making its queue delays even worse. In a multiprocessor system, compounding is even worse. Not only do messages wait in line for processes and processes wait in line for proce5SOIS, but processors wait in line for common resources such as memory and locked data structures. The techniques of this chapter and those of chapter 6 can be combined to solve this problem.
9 Fault Tolerance
TIaDSaCtioo-processiDg systems bave become such a part of our lives that many :facets of our day-to-day existeIlc:e depend upon their good health. As notecl in chapter 1. we could not buy airline tickets, inquire about our bills, have our credit checbcl, or enjoy our CUDeDt level of hospiIal c:me \\'idlout these sysIaDS-at least DOt with the efficieDcy and ease that we eqjoy 1Dday. Wheo a system goes down, we can be iDconveoienc:ed or may even suffer physicaDy. In addi1ion; the ope.raror of the system will ofIml suffer financial loss. An airline will lose customers, credit cams will be shifted wi1ti the mere ffick of a new card, a hospital may lose a patient. CoDsequemly, in many systems a svbstamial price tag can be placed on bigh avaiJabnity. To . . dUs xequitewem, a variety of.approacbes have been taken to allow a TP system to CODtiDue operatiDg even in the face of the failme of a sjgnificant componeat-aprocessor, a disk, a memory uait, an 110 CODIIOJler, whatever. In dUs cbapter we analyze the impact of fault to1:eranc;e OIlperfolmaDce. The EI'l example of chapter 8 is extended to show examples of performance degradation due to fauh tolerant provisioDs. The various teqUiremeDts for fault-tolerant systemS and CODtcmpomy approacbes to fauh tolerance are discussed in some detail in chapter 2 in the section entitled "Survivability." There, four generic approacbes to fault tole:rance are ctisc:Ilssed:
298
Chap. 9
, .' - .
Fault Tolerance
299
1. Transaction protection, in which transactions in progress at the time of failure are rolled back and must be reentered by the operator or automatically by the system. . . . . . . 2. Synchronization, in which multiple systems perform the same function and periodically check each other's results. Should there be a failure, they vote on the according to some algorithm. 3. Message queuing, in which aD. active process sends every message it receives to a dormant backup process. Should the active process fail, the backup process can reconstruct the state of the active process by processing these messages before cmying on with normal activities. 4. Checkpointi1Jg, in which an active process updates the state of a dormant backup process at critical points so that the backDp can take over ill1lT!fdiately upon the failure of the active process. COIl'ect output
All of these fault-tolerant systems have ODe thing in common-redundant baldware. This is the basic requirement for survivability, since if a component fails, the system can IeCOver only if it bas a spare component to immediately put in place. oDe very impoItant component to replicate _ tbe cfisk units carrying the files required to support tbe TP application. This generally involves mirroring these files. Minv.red files _ discussed in detail in chapter 7 under tbe section entitled "MiD:ored Files" and will not be CODSide!ed further here. Beyond this, the various approaches to fault tolerance vary in terms oftbe amount of ban:lware needed, the ongoing operational load Unposed upon the system by fault tolerance processing requirements, and the recovery time following a fault. These _ detaiJed in chapter 2 but _ geDeJally sqmmarizM in the table below. ,
'
TABLE 8-1. GeNERALIZED COMPARISON OF FAULTTOLERANT TECHNIQUES TecIIIIiqae TlllllsacdaD (l!OIeCtioD
Bmiware 1IIi1;',,;'" <'Ii)
100
0praIi0aal IoIId
RecoYay
Lip"o
MiIaIIes lnmwtille
time,
SyiduoaizaIiau
2S-SO
beavy Nc.to modest
Messaae
80-90
Modest
SecaDdsto DIimIIIes
8S-9S
Ligbtto modest
seCcmcts
qaeaiDB Qec:IqI oiIliag
Note that no tecImique is a pauara. If we want fast IeCOVery, we have to pay for it in bardwaI:e. . . If we want nrinilDlJin ban:lware, we must settle for slow 1eCOVeZ'Y.
Fault Tolerance
300
Chap. 9
While the prereding table gives a feel for the relative cbaracteristics of tauk-tolexani , -8pproaches, it must be pointed out that there are as manyapproaclieS £0-survivability as there are systems. Systems developed in the future or even under development today may well improve sigoificantly on the evaluations stated in this table. However, it is the purpose of this book to provide tools, not solutions. A familiarity with ~ techniques will allow the perl'ormance analyst to approach the analysis of new system architectures with confidence. Let us now look at the operational performance issues raised with' these- variOuS 8pproaches. Note that we will not concem ourselves here with estimates of recovety time. FU'St, this is presumed to be an infrequent activity in today's'haIdware art (many contemporary systems are now agoying haIdware mean time between single failures of over a year). Secondly, recovery time in many systems is a user-controJled parameter in that operating efficiency can be traded for recovery time (this is especially true for traDsaCtionrecoVety and message-queuiDg systems, in wbich the volume of data that must be recovered can be reduced by more frequent cleanup procedures).
TRANSACDON PROTECDON
Even systems that are otherwise fault-tolerant often provide a transaction-protection facility. That is because one must recognize that no matter how fault-tolexant we make a system, we are proreciing only agaiDsthardware faults (at least in today's art). But there is a major class of faults cansed by that insidious body of bJacic magic known as software. Though software bugs may eventually disappear in opeIIting systems as they mature tbrougb years of use, they will always be with us in our applicaDon software. Even these will become less frequent as fourth generation application Janguages take hold. But software bugs, like the common roach, will live long after us. The basic c:onc:ept behind ttaDsa.Ction potection is that we deal with a TP system in a unit c::allecI a trtlllSflCtion. A traDsaction can be quite complex and can involve multiple updates to our data base. If an updates c:aDDOt be made, then none should be made. For iDs1anc:e, if an iDveDtory traDSaCIiDD. is to update the quantity in stock in the inventoly file and is to Cleate a IeCOl'd in the Older file;" then both should be ckme; otherwise, the data base would be left in an iD.c:oasisrt:at state. If the iDveDtmy file is updatwl but the Older file is not, iDveDtoIy would be frozen with nowhere to sbip it. If the converse update occuned, the procluet, would be shipped widlout !educing invearory. In either case, our system would be in trouble. Note that a traDsaCliOll can fail for sevaal zeasoDS: • A hatdwaJe failme can occarthatpecludes the traDsactiOll from being completed. such as'the failme of a common memory moclule in a multiprocessor system. • A 'software faD.me can 8bort the ttaDsaction. • The operator can decide to abort the ttaDsaCtion.
Chap. 9
301
Transaction Protection
transaction can update files across a geographically distributed netwQtk. If a link to a remote node fails, the transaction might not be able to be completed.
" JI.A
All transaction protection methods provide a mechanism for marking the begjDniDg and end of a transaction. As a transaction is started, data-base updates are made in such a way as to be only temporuy. Ooly when the transaction has completed are t:bedata-b8se' updates committed. Otherwise, they are rolled back (or never applied) and never affect the data base. A generalized transaction-protection mecbanism is shown in FIgUre 9-1. When a transaction is received. a "start transaction" command (1) is issued. This notifies the operating system that all subsequent file activity related to this transaction is to be pr0tected. As disk updates or writes are generated to files on bebalf of this transaction. they are intercepted by the opmating system. Records to be modified or written are locked (2), and the before and after images of the record to be written are captured. These images are .written to an audit file (3).
BEFORE/AFTER
~~: (9)
IMAGES (3)
COMMIT
\D~jf!;ON LOG
FAILURE STiME
\(7) (8)1/
t LOCK
(5)
i
UNLOCK (8)
(2)
,
.
When the ttansaclion bas been c:ompJered, any IWllajDing befoIeIafIer record images pertaiDing to tbis transaction are ftushed from cadle to the audit file. At this time. the transaction can be CODSideIed to be safely stored, and the 'user is notified that the transacIion is CQTDmjtred (4). The .aual updates to the data base also have been proceetting durlDg this time (5). A key rule, however, is tbat DO data-baserecord is modified UDtil its befcxelafterimage bas been physically written to the audit file. These data-base updates may CODtiDue while the user begins a new transaction. Howeve:r,locks on all records involved in tbis transaction are maintained until the enme transaction has been applied to the data base. Only then wiD these locks be released (6).
!
.
Fault Tolerance
302
Chap. 9
In the event of a failure, the audit file provides two key features.mr recovering the data base: rollback and rollforward.
Rollbtzck allows incomplete transactions to be backed out of the data base. Should a system failure of some sort occur before the transaction bas been committed (7), then tbat transaction C8DI1Ot be applied to the data base. However, some data-base records may already have been modified. Using the audit file before images of all recon'ls modifi.ed by the transadion, the transaction-protection facility can restore those recotds to their origiDal value. Thus, an incomplete transaction will have no impact on the data base. RolIfonwzrd allows the application of a committed transaction to be completed even if the system fails between transaction commit time and the time that the data-base update is completed (8). This is simply done by leading the after image of the moclliied records and by applying them. to the data base. One problem that may be apparent involves knowing which transactions are complete and which are incomplete following a failure. Many systems will provide this information by writing a log of all completed transactions (9). This log can then be read to faciliwe transaction recovery following a failme. Another solution is to establish a CODSistency point periodically in the system by flushing all modified data to disk. At this point, it is known that all data-base updates have been made. Recovery functions need proceed only from this consistency point. Another type of rollforwatd capability must also be provided to account for media failure. If, for some IeaSOIl, our disk UDits have crashed, or if faulty softwaIe bas corrupted our data, we must be able to rec:cmstnJCt it. 'Ibis is done by restoring a known good copy of the data base (typically from magnetic tape) and by "playing back" an transadions that have occum:d since that copy was made. These transactions will be °saWd in audit files, some of wbich may also be on tape. with the most Iecent ODeS stiil 00· ° disk. (It is important that they be on a physically sepmte disk from the ODe with the CODupted data so that the audit files are not also corrupted.) 'Ibis fODll of roDforward can take hours or even days for a very large data base. It is not a very Dice thing to bave to do. Data leQUiIecl for both rollback and roDforward must be propedy leCCXded befoze ~ .. . II n....-ft_l oc:edmes . • • transadioD IS CQI!""'ttec. ~ supportiDg transaction protec:IiOn must pmvic1e for such pmper reconting. The Syaapse systan descrlbed in chapter 2 will be used as an example for analytic puipo&eS. In SlJDIIDaty. the Symapse UIDsaCtion-pmtecIion tecbDique involws the wrlIing of two audit files. This is descrlbed in deIail m chapter 2 m the section entitled "Survivability-Trausaction Pmtection," as follows:
....l""'-
°
an
• A- history log. which COIII8iDs the befCR and after images of data items updated. It must be physically written to disk before the ttaDsadion is CODSideIed complete• • A tewpoLary log. wbich comaiDs the befoxe images of an data items updated. It must be written befoxe any data base update affecting ~ items is made.
Periodically (every couple of minutes or so), the
~
cache is flushed, thus czeaDng a
Chap. 9
Synchronization
303
m
c01lSistency point, or CP. Following a· CP, the data the temporary log .js- deleted. Chapter 2 explains the recovery teclmiques available with this technique. Typically, the set of data item changes will fit in one record to be written to each audit file. The audit files can be sequential files with variable record size. Each record is as long as is necessuy to hold the requU:ed information for a transaction. Under the above assumption, each transaction is burdened with an additional two sequential write operations. Note tbat these must be physically written to disk (cache write-tbrough) and that the history log must be written to disk before the transaction is declared complete. (Typically, the final response to the user will be delayed until this write is ensured.) . The load on the system imposed by the consistency point is genemlly ignored as it seldom occurs (every few minutes or so). Thus, transaction protection in tbis case is modeled from a perl'ormance viewpoint by simply adding n sequential cache write-tbrough operations to each transaction, where n . is typically equal to two.
SYNCHRONIZAnON
To achieve fault tolerance via syncbIOnization, at least two systems must be processiDg the same data and must be periodically comparing their results. If they must pause to do this, then a performance penalty is paid.
However,
this time is typically small and often .
zero. The beauty of a syncbronization system is that ecovery time is instant (at least to the observer) sbould a failure occur. The failed module is simply ignored by the other modules in the system.
Basically, these are voUng systems. The ~ JUles, and the minority is deemed to have failed. However, we have said that synchronization systems have two or more processing c:omponeats. How can two take a democratic vote and have one win if they
differ? 'I'hae are at least two ways this can be achieved. One way, applicable to most . modem c:oJi4)Uta'S, is to build in enough diagDoscic hardware so that any nomecovenble eaor is detecIed and causes 1bat sysI'aD to ·'crash.n A failed system dies and simply does DOt respond to the other balf; it does not vote. Thus, if one system faDs, the other simply c:mies OD. Another tecImique, useful in certain applications. is to implement an algo.ri1bm wbich can deduce the failed system. Such a JIIeclwnjsm was used in an early uc:etrack totaJ.iDtor system iDsIalled by Autotote of Newck, Delaware, for the New YOlk RaciDg Association. Designed by the author, each half of this system received wagering messages eaor-protected by encoding, processed these wagers, and COIIIp8IeCl acceptheject signals for each wager. If one systen:l accepted the wager bat the other rejected it, then the wager was rejected, keeping both systems consistent However, the accepting system Iec:orded a black mark for the rejecting system. If a system accumulated too many black madcs, it was declared sick and taken out of service. .
Fault Tolerance
304
Chap. 9
August Systems, as mentioned in chapter 2, offers a triplexed syndlronized system. Stratus, also described in chapter 2, uses a quadraplexed synchronized system to achieve fault tolerance totally via haniware, with virtually instantaneous ~very. Note that if p processors are used in a synchronized system, the total processor utilization is only IIp of the total processing capacity. This is not necessarily an economic factor, as is shown by StraIUs' strong presence in the marketplace. A reasonably conservative approach to syncbronizati.on delay is obtained by considering the response time dispersion as discussed in chapter 4 in the section entitled "Infinite populations-Dispersion of Response Times." The respcmse times between systems will vary for a number of reasons, including the following:
• Scanmng requestors in different systems may pick up the same transaction at different times. • TransactiODS'will be serviced by servers in a diffaent order. • Disks will receive requestS in a diffemtt order, thus having different access times. Using the tools described in the just referenced section, we can make starements conceming the distribution of this IeSpODSe time. Specifically, with the Gamma function, we can estimate the probability that the ~ of a transaction will complete in less than a given time, given the mean time and variance of the processing time. Here a transaction is taken to mean that processing done between sync:hronizing points. If we have lie systems that will each complete a transaction in t seconds with ~ ability P6 = (.5)1IJIe, then the probability is [(.5)1IJIe)" = .5 that all will complete the ttansaction in t seconds. Thus, this value of t seconds is the avaage IeSpODSe time for the group of lie systems. For instaDce, ifdRe systems are involved, we wiD want to know that time within wbich the ttaDsadion will complete with probability (.5)113 = .79. Then we will know that all tbree will have completed their transaction in that time with probability .79' = .5. (This mgun:aeDt ISSU""N that the RSpODSe times areiDdepeDdeDt. To the extent that the!e is· depeDdence among the .mpcmse times,. this appwach is c:oaservative.) . Tbe Gamma functiOn gives us these times as a function of the variance of the IapoDSe time. Some typical valDes aregivaa below:
TAREN. SYNOfRONIZATlON OVERHEAD k. Pm_sing elcmImIs (n.)
(.5)'" (P.)
1.0
1.2
'l'1wr(7) 1.4 1.6
1.8
2.0
2 3 4
.71 .79 .14
1.25 1.56 1.83
1.25 1.541.78
1.25 1.52 1.71
1.25 1.so 1.70
1.25 1.41 1.67
1.24 1.47 1.65
In chapter 4 it is noted that the distnbution of the respcmse time created by random arrivals at tandem queues with random service times is itself random. 1'heIefOIe, its
Chap. 9
Synchronization
305
variance is equal to the square of tbemean response time. RandOm distrilmtion of response times is usually a conservative assumption; if all else fails, it can be used. This is the case of T2/var (7') = 1 in the above table. However, chapter 4 also describes a technique for estimating the variance of the response time mote accu.tately. The example for a TP system given there yields a ratio of variance to response time-squared equal to .628. From. the above table, the average response for four processing elements is 1.83 multiplied by the average response of one element, if random response times are assumed (T 2/var (7') = 1). However, if the normalized variance is .628, then T 2/var (7') = 11.628 = 1.6; the average four element response time is 1.70 times the average response of one element, or a 7 pereent performance improvement per our estimate. Thus, it can pay to go through the exercise of more accurately estimating response time variance rather than to casually assume randomness of response times. On the other band, the values in the table above are closely enough clustered to give us a reasonable rule of thumb for synchrouized systems, as follows: TABLE 9-3. SYNCHRONIZATION OVERHEAD No. of pmcessiDg
Additicm to 1be average
elemads(~)
lespcmse time (~-1)('II)
2 3
2S SO
4
7S
Note tbat these results apply to a transaction wbether there is one synchronizing point or n such points per ttaasac:OOD. To demoDstrate tbis, let lie
= IlUIDber of processing elements.
ke =
n:spoDSe time factor for lie processing elements.
n" = DU1Ilber of syncbroniz.ing points per traDsaCtion. t,.1 = average traDsaction-msp time for a siDgle element. t,.
= uansaction-!eSpODSe time for the system.
The respcmse time, tro for the system is the sam of the leSpODSe times ~ each synchronizing point:
or tr = kerrl
(9-1)
(9-2)
Fault Tolerance
306
Chap. 9
Thus, the re8pODSe time, t" is independent of the Dumber of syncbroDizmg points, n., • ... - . Gamma [Pet ts-12/var (t,l)] is the value of the Gamma function for parameters Pe and t,12/var (t,.l). The pammeter P. is defined as follows. If each element completes its transaction within a time t with probability Pe, then the group of elements will complete the traDsaction in that time with probability 0.5: p. = (0.5)1/,..
(9-3)
Note that hardware syncbrcmization as practiced by Stratus results in a variance of zero, since syncbronization points are determined by a common clock. Thus, Ice = 1, and there is DO peIfonnance peualty •
• ESSAGE QUEUING With message queuing (described in chapter 2 in the section entitled "SmvivabilityMessage Queuing"), a backup process is kept infOIJDed of the statUS of its primary process by queuing those messages that are also ctixected to its primary. In this way, if the primary process fails, the backup can process the queue of old messages to bring itself up-to-date before taking over the processing functions from the primary. To prevent the backup from sending duplicate messages while it is processing old messages during recovery, the primary process will also sead copies of its output messages to its own backup process. The backup will maintain a count of these so that it vim know when to start teleasing messages. To prevent the backup queue from becoming too long, the system is "cleaned up" periodically. 'Ibis is done by figshing all dirty memoty pages to disk. (This. of com:se, assumes that memcxy paging for both the primary process and the backup pI9C8SS uses a COIDII1OD miD'oIed disk pair.) At this point. if the backup takes over, it will be in the same state as the primary process. 'l"haefore, its receive queue can be deleted and its message count JeSet. 'Ibis is the equivaleDt of the consistency point used with transaction pr0tection. It is igDoJed for peIfornwmce purposes. since it genaaD.y occurs iDfrequently (eveJY sevenl seccpk to mimnes, depeDding upon the recovay time desired and the maximmn queue leDgtbs). ·The pmfonnance aualysis oftbese systems is~. For each inwp0ce8S message sent, tbIee are actually seat: .
• One to the cJestination primary process for normal processing. • One to the cJestiDariOll backup process fOr replay followiDg a fanure. • One to the sender's backup process so that it can know which messages bave been seat.
In most messaging systemS, this activity will incur subsIantia11y the overhead of three inteIp:ocess messages. Thus, one needs only to triple inteIpIocess message activity to account for fault tolerance using message queuing.
Fault Tolerance
308
Chap. 9
DATA-BASE INTEGRITY
With these fault-tolerant concepts in mind,let us take one more look at data-base integrity. In chapter 7, we discussed disk mirroring as a meaDS for protecting the data base from. a mechanjcal disk failure. However, min'oring plays a role in only one aspect of ensuring the integrity of the data base. There are, in fact, three levels of integrity with which to be concemed: 1. Data integrity. At this level, we are concerned that each write of a data block was safely executed by the disk system and that the block can be read reliably. This Is 'often enSmed by fead-atter-Wriie protCciiOn oii some dlsk"W:iits "aDd by·· error-coaecting codes for reading data. It is further ensured by using DrlIrored disks. 2. File integrity. Even though we can ensure that all writes complete successfully and that all data can be reliably read block by block, Ibis does not mean that our files are intact. An intermption of a complex file operation by a system failw:e can leave a B-tree garbled during a block split or can lose an end-of-file on a . sequential file. Should Ibis bappen, our files are of questiODable use to us. File integrity is eusured by guaranteeing process integrity via checkpointing, message queuing, or syncbroDization. 3. TTIl1lSDCtion integrity. Even if our files are protected, the data base can still be conupted if a traasaction is inteuupted. Trausaction iategJity is protected by transaction protection tecImiques, which can ron back inc:ompleted ttaDsactions
followiDg a failure. Tnmsaction protection is often coupled with one of the other types of fault protection. This allows the system to conriMe 1IDIffecred in the pesence of any sh1gle failure (and, in fact, in the presence of many multiple failmes) bnt allows the data base to be recovaed following severe failmes that may have COI)taminatecl it. :..
Let us compate the pedc:Jrmance of tbese various techniques by applying them to the BTl bemc:bmark evaluated in cbapta' 8. We will <:oDSidcr the followiDg four techniques: 1. TrtIPIIGCIitm protecdtm. If the system fails in anY way, it is rec:ovCJeCl by the mmsactioD-procec:tio.n facilities. 2. Syndrronizt.Itit; In Ibis tecImique, two processors are provided to process traDSaCtions in parallel. If one fails, it is assumed that diagDostic bardware will cause it to stop so that it can cause DO damage. 3. Message queuing. lDterp&ocess messages are queued to a baclalp process for reproc:essing in the event of a failure.
Chap. 9
An Example
309
.. ····4. Chedcpointing. This teclmiq1ie
as~ tbree checkpoints per transaCtion witb
duplication protection. To account for fault-tolerance overllead. the following modifications are made to tbe example of chapter 8: 1. TrtI1ISO&tion protection is modeled by adding two write times to the transaction. Both must be cache write-through type operations. Each opeI3Iion is a sequential-file write to a ~ file. Using the values of Table 8-7, we have the results in Table 9-4. TABLE 904. TRANSACTION PROTECTION RLE MANAGEMENT TIME . File operaIioa
File IIIIIIIgCr time (msec.)
BTl proc:essiDg Sequeatial wriIe
140 ..§Q
Distdme
Tocal time
(msec.)
(msec.)
200
ISO
290
...22
~ 440
240
Note that transaction protection significantly iDc:reases disk access time and might justify an extra disk UDit per module. However, we will maintain the module CODfiguration for consistency of results. The server still makes ODly seven file requests as the audit writes ire performed lI:IDSpIIeDtly by the OpelatiDg system. Therefore, it is sufficient to sim. ply add the audit time as ovedlead to the traDsaCtion's disk time. DOiDg 1bis yields a file manager processing time, t,r, and a physical disk time, ftItl, per file access of .
t,r =
= 28.6 JDSeC.
(9-4)
r"" = 2AOrJ = 34.3 JDSeC.
(9--5)
2JX)f1
or a total of 62.9 JDSeC. per avenge file call. All odIt:r equatioDs for the model remain the same. 2. SyndIroniztztio is modeled by adding 2S peECeDt~ the siDgle-system respoDSe time. PUle and simple. However, DOle that 1bis reQuUes twice the hardware. 3. Message queuing is modeled by trip1iDg all iDtapl0ces5 messages. This is simply done by tripling the io.taptoc:essmessage time, tipm:
tipm
= 30 JDSeC.
(9--6)
4. Checkpoindng is modeled by adding tbree iDteJ.process messages to each ttansaction. Two are added in the IeqUeStor followmg the receipt of a request and the send of a Ieply. The other is added to the server following the read of ~ data but
Fault Tolerance
310
Chap. 9
before any writes. Thus, equations 8-19,8-20 and 8-28 become the following equations:
(8-19): T,
= 2(tpc + tde) + 2(t",. + idr) + 3tipm + tqs + ts
(9-7)
(8-21): ts = (tps
+ tipm + 8tds) + 7(tipm + tqf + If)
(9-8)
(8-28): It
= 2tpc + 2t",. + tps + 7tPl + lltipm
(9-9)
The results of this exercise are shown in Figure 9-2. Based on the particular paramo
Q T
S
co
3.0
-•
: 2.5 E I
-..." 11
f
= a:
T Q S
o-
NO FAULT TOLERANCE
C - CHEClCPOINTIH S - SYNCHRONIZATION T - TRANSACTION PROTECTION Q -
MESSME QUEUING
o
Iigare 9-2 Impact of fault toleraace.
:
0
Chap. 9
An Example
311
eters tbat we have chosen, and CODSidering the range of interest (around a nv.o-second average response time), checkpoinUng is reasonably efficient. SyncbronizatiOD and transaction protection are next, but remember tbat the synchronization system requires twice the hardware. Message queuing imposes the heaviest load. No general conclusions can be dIawn from this exercise, however, as the results of Figure 9-2 can be sigDificantly different for different values of the system parameters. In fact, parameters can be chosen to give the advantage to any of these techniques. Table 9-5 gives the capacity and K$/TPS for each system based on a SI00,OOO module cost. This table should be c::onsidered an illustration ODly. It is very sensitive to system and performance parameters. For instance, if the allowable response time were 3 . seconds, system costs per TPS would be much closer. If it were 1.0 second, ODly the ., cbeckpointing fault-to1erant system would apply (in addition to the plain vanilla system.) A message-queuing system, in particular, might well incoIporate a more efficient message System. (peIbaps 2 msec., as is seen in some contempO.taIy systems, ratbertban 10 msec.). This would make it much more competitive.
TABLE N. COST OF FAULT TOLERANCE FOR THE ET1 BENCHMARK EXAMPLE Fault-tolenat
medlllllism Ncme TrlIIISlCtioD p1"OIIeCdoa. ~ Message queaiDg Cbec;tpohdj"l
Module cost (K$)
Capacity
100 100
2.15
200
1.81 1.36 2.09
100 100
(TPS)
1.40
~ fac:Im (KSITPS)
46.5 71.4 110.5 73.5 47.8
One fiDal comment. Any fault-tolerant system employjDg a backup (all but pure traDsactiOll-Protecteci sysrems) will run faster if the backup is down. SyncbroniziDg systems don't have to syncbroDize. Message qUeuing and checqJojDting systemS don't have to seoc:l backup messages. 1'he author bas seen at least cine case (and I'm sme there are others) in which the opentors actually iDbibited one or mole of these fault-tolenDce featmes when the system became loaded. Bad pnctice~ Let's design these systemS corxectly to start with.
10 The Perforlflonce Model Product
An in-depth performance analysis can be a very significant effort, ranging from a few days to several analyst-montbs. Once we have gone to all the bother of an analysis, it would certaiDly be Dice if otbers c:ould make use of it. It would be even nicer if years from DOW we would be remembered for our fine wolk. Both these goals can be acbieved tbrough a common vehic1e--docrmrentl. Many of us don't like to write, but write we must if we are to complete our job as professioDal performance analysts. The resulting documeut is the tangible product of our long hours of analysis. To tab a little pain out of tis chore, this chapter discusses the CODteDtS and organization of a proper pedormance analysis documeDt. The next cbapter then presents a sample performance analysis taka!. from zeal life. The key to a successfal'perfmmnce aualysis is the same as with most tasks. Work from the top down:; Start by obtabIing a thorough knowledge of the system aud abe ttansactions that will chive it. Tben cIJaracI:erire the, system's pedcmrumce with a traffic diagEam. Using this, geaerate the performance model as a set of equations or tables or both, as approprlare. Snmmarize the JllOdel so that the c:alculation phase can be oqanind, and then com.pote pertinent J:eSUlts. Fmally, describe the results and the' conclusions that can be derived from them. It is always useful to begin the performance analysis document with a short executive SUIDID8IY of 2 or 3 pages at the mOst~ Those readers who xeally count seldom mtd a ,1000page detailed analysis. This smrnnary should identify the system, should comment,
312
Chap. 10
Report Organization
313
about the magnificent art of peIforma.nce analysis, and should present the prim@ry results and-Conclusions. Finally, the document should have the finishing touches of a table of contents (ofteli overlooked, but terribly important, especially for later reference) ind an attractive cover sheet clearly identifying the system and the author.
REPORT ORGANlZAnON
The very first words that strike the reader who bas tumed past the cover sheet are those of the executive summary. This summary may be the only section the reader actually reads. The executive summary contains brief statements about the fonowing:
• The system that is analyzed and its basic functious. • The reasons for the perfonnance analysis. • A word about performance analysis techniques. (Remember-this may be the first and only peIfonnance analysis the :reader bas come across.) • The primary results of this pedcmnance analysis, complete with curves or tables if apPIOpIiate. However, these sboulc:l be devoid of detail; instead, they should be summaries of imporraDt results• . • 'The conclusions to be drawn from these results and their impact on cur.teDt or future activities. The "word about performance analysis" is an adve.rtisemeat for the art. In one In another sease it is quite useful, especially when one zealizes how few people bave been exposed to good performance analyses. Such people may be quite CODfused about how die dOCUDll!:D,f came to be if we don't give 1hem a bint. A typical short paragraph satisfyiDg this xequilemem is: SeDSe it is boilerplate.
'lbe msulIs peseatec1 in tbis documimt wae gIw'ined usiDg a tec:Igriqne kDown as aaalytic performaace modeJiDg. W'1dl tbis teclmique, delays iD a system whicb iDcrease wirh die load imposec:l OIl 1bat s,-m can be c:balac:tai:zed rmtbematicaUy. Coasequeotly, die IespoDSe time of die system to a 1I8DSICtioD, aDd thus its capacity, can be IDIIthr",.t¥:ally modeled aDd eva1uaIed 1IDde:r a variety of c:oadidoas.
Table of eonten.. Not much need be said about the table ofcontents, except that it should DOt be fcqotten. It gives the serious reader a clue as to wbat is contained in the document and acts as a valuable quick refeIence tool (even for you, the author, years down die road).
The Performance Model Product
314 ...... Spfem
Chap. 10
Description
In my opinion, the system description is the most important part of the document. If the system is imperfectly understood, the model will at best be equally imperfect. 'lb.e system description should start with an overview of the system's hardware and software architecture and the functions performed by the system. Each module that bas an impact on performance is then described in detail, with all significant aspects affecting performance fully covered. 'lb.e system description performs two major functions: 1. It acts as the interface document between the performance analyst (who is presumably initially Daive c:ooceming the system) and those knowledgeable about the system. In this sense, it is an interactive document and may take several cycles of review by the system analysts and rewrites by the perfcmDance analyst before it is complete. 2. It acts as the basis for developing the traDsaction model and the traffic model, from which the rest of the performance analysis is derived.
For both these reasons, the system description should be completed and app1'Oved before the actual performance analysis proceeds. Otherwise, there may be a good deal of rework required.
TI'8IIACtioIIIIodeI Tbe t1YJnSQCtion model cbaracterizes the ttansaction load imposed upon the system. In the examples tbroughoat this book, the traDsaction models were quite simple and may DOt
bave even been noticed. However, real-life systems expe.rieDce a wide variety of often complex traDsacticms. Tbe cbaintedDtion of these traDsactioDs involves ideatifyjDg them, listing their resource zequiremeDts (cxmmnmications, disk accesses by type, and special processiDg), and their mlaD.ve tiequency, or pmbabi1ity, of OCCWieDCe. . The ttuisacd.on model is ofteD.just so much bean commog and can be a tedious choJ:e for both the system aaalyst and the perfov"1"'re aaalyst. But it must be ckme. Itrequi&es a significant COIdIibution fmm those who know the system, and it can proc:eecI in pu:aDel with the &est of the performance-modeling activities, since its &esUlts are usually not needed UDtilIllOdel compmaDoD time. .
Traffic IIofIeI Now the fun and excitement can fiDally begin. Tbe system desaiption, baving been fiDally signed off by the system analysts, gives us all the iDformation we need to moc1e1 the system. The first step is to aeate the trtIjfic model. This is a diagram such as tbal given for the ETI beacbmark in Figure 8-6. It shows
i
Chap. 10
Report Organization
315
~.JI9w
of a transaction through all pertinent processes and highlights the queuing points. There may often have to be different traffic models for different transactions. Each should be carefully described in the text; to this end, numbering the elements in the traffic model and referring to them in the text can be a great help. To aid in transcribing the traffic model into the performance model, it is also a help to note major parametetS next to their corresponding elements on the traffic model diagram, as was done in Figure 8-6.
Performance Model As suggested above, the perjomvmce model, at least at its highest level, is simply a transcription of the traffic model into a mat:bema!ical expression, such as was done with the BTl benchmark, resulting in equations 8-19 through 8-34. In contrast, some models lend themselves to tabular descriptions instead of equations. Such an example is the file system parameter analysis tabularized in Tables 8-4 through 8-7. Generally, the model should be generated in a top-down manner. The first step is to create the response-time relationship. Some of the· parameters in this relationship will be inputs and need not be evaluated funber. Others, especially queuing points, will need. further evaluation. As relatiODSbips are generated for these parameters, they will use a mix of known inputs and new parametetS to be analyzed. As we wade down into DlOIe and more detail, we will finally hit bottom. This occurs when all expressioas are combinations of known inputs or the results of other computable expressions. Sometimes we will come across a. problem that results in a patattteter that is a complex function of itself and is DOt-cfimcdy computable. This happens, for instance, in limited· popUlation queues in which the load imposed on the server is a function of who are not in the queue, which is a fonetion of the load on the server. An example of this was found in the analysis of processor pettomJailCe in chapter 6 (see "Physical Resoun:es-Model S11IDDIIIY"). In these cases, the model must be calcoIarM itemtively, which usually meausit must be pmgiammed and evaluated using a compUter.
users
IIodeI Su~...ry Though often included as a set of tables at the end of the perfoDnance model, the model SIIIII1III.ITy is the final organization of the model prior to its calculation. It comprises two parts:
1. A definition of an tenDs with their dimensions, i.e., DJSeC., sec., etc., to avoid scaling errors, and 2. A listing of all expressions to be used in the calcuJaDons. It is conveniem to organize the definition of the tenDs into four groupings:
The Performance Model Product
316
Chap. 10
1. Result parameters. Those parameters to be calculated that are...the most likely candidates to be viewed as useful results. . 2. Input variables. Those parameters that are most lilcely to be varied to run different tests.
3. Input parameters. Those parameters with known values aDd that need not be varied. 4. IntermedilJte parameters. Those parameters that will be calculated in order to derive the desked results.
It is important that all expressions (whether they are equations or tables) are listed in top-down order even if they were not so ordered in the text. In this way, parameters can be evaluated starting at the bottom of the table and wodc:ing up. The same organjzation is imperative if the model is to be programmed. Table 8-11 exemplifies this organization for the BTl benchmark example. The expression listing should provide references to the derivation of each expression in the text by showing the equation number or table number for each. It is useful not to consolidate equations into huge, single expressions. Radler, keep each equation simple; use a building block approach. This has several advantages: . • It makes computation and/or programming simpler. • It makes the model more modular and the!efore more maintainable in terms of maldng cbanges, fixing errors, or adding enhncements (just ~ programs). • It gives US the ability to look deeper into the system to see what is going on by having a finer granularity of the intermediate pmmetets that we will calculate (remember our experience with the BTl benchmark). The 1ast point is of para.mouDt imponance. The intermediate parameters should, at . the very least, mc:lude the load on ·all system e1ements and tbe delay time (queue pIns service times) tbrough each element. In this way we can ideaDfy the bottleDecb by looking at the element or ~ CODtribudng most to the teSpODSe time and can attempt to come up with improvemalts to the syscrm to !educe botd.cmecks. At,1east this approach will give us the tools to do that.
Scenario The scetIII1io is the last step prior to the ~cuJation of the model. It ~ DOt only the , particular mixes of tl'U!sarVoas that wiD be iiDposed on the system but aJso.orber system parameters that may or may DOt be varied, such as the IlUIIlber of·Processors and disks, CQJDD1Imication line speeds, and other parameters.
IIodeI CoIDputatioa Though the details of the 'inodel C011Ip1Ilations Deed DOt be part of the perfca "'811M aaalysjs document (though they are.often attacJwi for posteEity), the ~of!aUlts uSing
Chap. 10
Programming the Performance Model
317
the ql~ is, of course, very much a part of modeling. Computation may be ~~ either manually or by computer. Considerations for performance model programs are discussed later.
Results
P'mally, results are available from our tedious hours of modeling and calculating. They should be presented in a clear, simple SlJmmary in tabular or chart foIm (see Figure 8-7 and Table 9-5, for example). Then supporting data could follow in more detailed chart or tabular form. Again, a top-down approach to results presentation is recommended.
Conclusions and Recommendations The results will generally lead to some conclusions. One conclusion may be to praise ,the system and its ability to meet its requb:ements. Others may be recommendations for system improvement, a cost-benefit analysis of proposed performance enbaDl:ements, or the impact of a new proposed function or system change on perfcmnance. Another useful result is configuration tables to be used to size the system for different environments. Whatever the conclusions, they should be clearly stated with adequate reference to the supporting results.
Many perfonnance models can be calculated manually witbjust a few hours of effort and a hand calculator. OtherS need to be programDJed for one or more of a variety of reaSons, including the following: • They are very complex or, even. worse, requb:e iterative solutions. • They will be used often, and it would be a convenience to have a progI3D1 available. • They will be used as a sales or DUmagement tool. 'Ibougb it is not the intent to discuss program"'ing techniques here, thee are some basic guidelines that are useful. The key term is 1he much used PC saw: userfriendly. Wben user frieDdliness is built into a progmm up-fIODt, it doesn't usually cost very much. A program pieced together without thought will,PJObably nevec be user frieDdly. The:re are several areas with which to be ,concemed. '
Input ParallJeter Entry aDfl Edit The heaviest user interface (and the one that will cause the most frustration) is entering the data. Some models can have huncb:eds of data cmtry parameters. Bach should be identified on the data eutty SCIeeD with its per.formance model symbol as wen as with its fuJI
318
The Performance ModeJ Product
Chap. 10
definition. Even if there are only a few data entIy paxameters, a screen can get heavily . .--- -cluttered with parameter definitions, which must exist because even the"aiaIyst will eventually not remember what hsm was. On the other hand, we would like to mange the parameter notation so that those knowledgeable with the model will know iDstandy just what the parameter is. In any event, the notation should be created as if the performance model document were nowhere . to be found or, if available, were so intimidating that no one would enter it to figme out which parameter is which. The program must envelop the user with friendliness and knowledge. A convenient way to achieve a user-friendly data entry system is to present a screen for a set ofiDput parameters, each identified only by thejr parameter notation (such as h.",). In this way, the saeeD. can be uncluttered and the cursor moved easily between fields for data entry.
To give the user the definition of these terms, a help line is reserved at the top or bottom of the saeeD. When the cursor is positiODed at a pammeter, its definition is displayed automatically on the help line. In this way, an experienced user can move around the screen and insert data, ignoring the help line, while an inexperienced user is guided by this line. This help line must, of course, not only define the variable but also indicate its UDits. To facilitate editing of data, it is important that the data-entry saeeDS work in page mode. That is, the user must be able to fteely move back and forth between fields until he or she releases the page, at which time the parametelS are saved and the next screen full of parameters, if any, is displayed. One subtle point IemaiDs. How do we display subscripts? Most everyday video terminals cannot display such terms as 1&.". If we bave adheIed to the conventiOD of sing1e-letter parameters plus subscripts (as has been done in this book), then the display could simply show the parameters as a sequence of characters, since we know that all but the first character are subscripts. Otherwise. we can use a special chaxacter, such as the 1lDdedine, to denote the following characters as subscripts. such as b...sm. The same tecIuiique can be used in the progmm as symbols for the parametas. A similar method can be Used for supe!$Cripts. Often, a cuet is used for this pmpose. Thus, Ir_ can be symbolized OD the screen and in the pr:ogtam as Ium"c. But how do webandle Gteek: letters? DoJl'tkDow. Not my problem. As I indicated in my inttodDcIionof chaprer 1. I cIOn't use them.
Once input panmeters are entered, the IaDge overwbich one or more input variables are to be varied during the calculation and the ina:ements for these ranges must be entered. For instance. we may want to vary the traDsaCtion rate from 1 to 10 traDsactions per second in stepS of 0.5 for a system with 1 to 5 processors. Since we can ODly guess w~ par~ we may want to vary at the start, it is a powerful tool to allow the user to specify miy set of input parameters as input variables.
Chap. 10
Programming the Performance Model
319
To do tbis, the user simply enters the parameter symbol, at whiCh point our fri~y help (If the symbol is invalid, the terminal beeps and requjres a reentry.) Once satisfied that tbis is the correct symbol, the user then enters the range and increment size. This process is repeated for as many input variables as the user desh"es and the program allows.
line identifies the meaning of the symbol and its UDits.
Report Specification A perfOl1D8llCe model can often calculate hundreds of intermediate parameters, anyone of which might be the potential cause of a system problem. But usually we are interested omy'in a handful. But wbich ones? We never know at the start. Therefore, it is very usetUl to allow the user to specify which IeSUlts are desiIed. Of comse, one specification is "all," which will give the user pages of detailed results. A more specific IeqUeSt allows us to provide a prettier report. The user should be able to specify not only the input variables but also the result parameters by simply typing the parameter symbol; the help line confirms the accuracy of the chosen symbol. If a small enough number of results has been specified so that a columnar report can be generated, the report should be produced in colUlDllar form.. Otherwise, it should be a listing of results. It is often desirable to print parameter defiDitions at the bottom of a report for those parameters listed on the IepOrL This should be a user option. It is also important to provide a facility for printing a user-specified test name on each report. Of course, all reports should carry a date and a time.
There is nothing more frustradDg than typing in several dozen pammeter values, a few input variable specifications, and a report format, and then having to do it an over again because you want to change the value of ODe variable for a subsequent calculation. Thete should be two facilities to help the user in this regard: 1. Whenever the user calls up SCleeDS to enter data or specific:atioDs, display ~ was previously eatmed. In this way, the User needs to change only those items that need modification. ' 2. Give the user a facility for saving a data set (mput pmlMterS, input vadables, and report specifications) iD a DaIDed file on disk and the ability to JeCall it for
use.
We have hobbled the inexperienced user to some extent with the user interface desc::rlbed above. When entering input vari8bles or result variables (~or the report specification), one
The Performance Model Product
320
Chap. 10
..•must know the symbols for the desired parameters. But we have assumed that the user does not remember these symbols.
Therefore, we should provide a dictionary of parameter symbols and their defiDitiODS. The dictionary should be organized into input symbols (variables and parameters) and output symbols (result and intermediate parameters). The user should be able either to view these symbols on the screen or to get a hard-copy dictionary for later reference.
Help For the really inexperienced user, a help screen should be provided. It will explain how to call up screens, manipulate the cursor controls, make c:ouections, and genaally IUD the model. It is a user's manual in a disk file. The first display of the program should iDform. a user how to invoke the help function (and perbaps to even print it out). In this way, the only thing the user needs to know is how to call the program (that can be written on the label of the diskette containing the program). We tacitly assume that the user knows how to boot the PC or log on to the tenDinal.
_oriel Calculation Now tbat we have seen to it that the user can use the model, it is time to make a calcnlation once data has been entmed. Usually, this has nothing to do with the user. The program will run, and the results will print. . However, there are legitimate occasions for error. A typical error is a specification of an input pmmeter and variable that will overload a component. In dlis case, the model should CODtiDue calcu1ating so that all teqUeSted calculatioDs will be made rather than just aborting. However. when the results are printed for a calcuJaDon that could not be completed, inform the user what happened. i.e•• disk overload.
The Jeport specUication has already been descrlhecl. The results can be presented in ODe of ~ways:
• on the screen. • in bald copy. • graphically, if a plotter has been included in the package.
For screen or printed IepOrts. either a columDar format or list can be used, depending .upon the number of result variables specified.
Chap. 10
Tuning
321
_. The
report should be titled with a user-supplied title, then should be ~ and time-stamped. As a user option, parameter defiDitions can be added to the report for those input variables and result variables shown on the report.
TUNING No performance model is perfect the day it is first written. If we are lucky enough to get real values from actual measurements on the system, then we bave an. opportoDity to ""tuDe" our model; for we would indeed be fortunate the first time around if the model reflected the real world as acc:urately as it could. Differences between model results and actual measurements can be caused by seven! factors: 1. The assumptions required in order to develop a model may lead to iDaccuracies. These inherent'errors in performance modeliDg are ones with which we live. Typically, they are not so severe as to invalidate the model. UDless we can get smarter (or work harder) and use more sophisticated techniques, there is nothing we can. do about iDherent modeling errors. 2. The values of the input paxameters that we have been led to believe are COJIeCt may not be. These are ptJTIII1I.eIric errors. An example of a panmetric eD'O! is to ·estimate that a process will require S msec. of proa=ssiDg time per transaction before it is writte:D and then find out that it actually takes SO JDSeC. when it is up and 1'1IIIIIiDg. Parametric er.rms are comcted by simply IeIUIJIliDg the model with theCOJIeCt parameter values. 3. W"lthin the origiDal system description there may bave been emu that caused the modeled system to differ from the real system. These are structural errors and are COllected by cbmging the relabODSbips in tbe model. For a programmed model, this requires progmm chaDges. 4. 'I'here may be problems in the real system 1bat prevent the system from working as mteaded Tbese are system elTon an.d are coaec:ted by modifying the·actDal system lwdware or software. 'Ibis is the best lcind of eaor for a performance analyst to find since it bas proven the value of our analytic techniques.
In any event, through an. itemtive process of timing. in which actual· results and model results are compared. differences exanrined. an.d c:onectioasmade. a performance model can. often be made inaeasingly accurate. When the desired accuracy is achieved. the performance model is!eady for use as a maoagement tool. a sales aid. or for whatever it is intended.
322
The Performance Model Product
Chap. 10
QUICK AND DIRTY As a final point, it should be emphasized that this chapter has dealt witb the formalization of tbe performance modeling process. A model fully developed along these guidelines could require many analyst-months and tens of thousands of dollars. This is not to say that there is not a great need for fast answers to performance questions tbat can be satisfied by a quick model, minimum documentation, and a fast calculation. Or just a few minutes at a blackboard. And that is what this book has been all about: to give the performance analyst the tools and concepts needed to give educated answers to performance questions in whatever form is appropriate.
11 A Case Study
Contained in this chapter is an actual case study. The author expresses his appreciation to Synttex Incolporated for its permission to use this material. Synttex is a manufacturer of word-processing systems. Its basic word processor is a stand-alone terminal called Aquarius. Gemini is a xedundaDt data-base server tbat pr0vides bigbly IeHable access to common documents for up to 14 Aquarius terminals. Multiple Gemini systems can be netwolked together via SynNet. Synttex is extremely iDterested in CODtinually improving the performance of its systems. It tberefOIe ronmrissioned a SbIdy to derermine what could be done to improve the capacity of its Gemini data-base server. The attached study is an ideal complement to the material in this book, as it sttesses the use of concepts ratber than a cookbook appmadl. For iDscaace: • The C()IDIDDDication liDks CODDeCtiDg the Aquarius terminals to the Gemini data base use a special CODteDtion protocol designed especially for this system. • A scanning process is used to process Aquarius messages but is c:ootinuously nummg in its own processor.
• Fault toleranc:e is achieved by synchroDization at the scanner level and uses a masterlslave relatiODSbip. • Multiple file aumagers are used, but traffic is split based on function rather than on
files. • The Aquarius SC8IIDerS interface wi1h the file managers via a shared memory, wbich is a limited IeSOUICe.
323
324
A case Study
Chap. 11
None of these architectures were specifically discussed in the pr=ec:Iing cbapteJ:s. However, all are analyzed with the tools given, with a little ingenuity, and wi~ a little devout imperfectionism. The result is a complex model with over SO input and SO intermediate parameters requiring nested iterative computation. The model was programmed and run against a benchmark that bad also been run on Gemini. The results are comfortably close, allowing the model to be used to peek inside Gemini to find what is bothering it. Several recommendations are consequently made for performance improvement. . The following case history document generally follows the organization suggested in chapter 10. One apology is appropriate. The transactions for this system are quite complex, and their cbaIacterization as a transaction model adds little to the performance modeling example. Therefore, they are treated as being explained in a reference document, with the results simply being presented as Tables 4-1, 4-2, and 4-3. Though the traDsaction model is necessary for model calculation, its development is unimportant for putpOSeS of this example. • A finite number of file manager processes compete for a common processor. Pmcess dispatch time is calculated by ignoring the load of one of the file managers. This is an example of the technique suggested in Appendix 6• • A finite number of file managers compete for a common disk. The disk queUe is calculated via the M/MIlImim model, as an example of this technique.
Performance Evaluation of the Syntrex Gemini Systell1
Prepared for: Syntrex JnCOIp01'lted
By: . W. H. Higbleyman . December, 1983
325
326
A Case Study
Chap. 11
The Gemini word processing system bas met with significant marlcet success in large operations requiring fault-tolerant common access to documents by multiple terminals. However, in bigb-volume applications, serious response-time degradation has been noted. It is important to determine the potential for significantly expanding the capacity of Gemini in an economical way. The stated goal is to double its capacity with a cost increase not exceeding 20
percent.
To answer this question, the pedormance of the Gemini system has been analyzed in some detail with a tecJmique known as analytic performance modeling. With this technique, delays in a system that increase with the load on that system can be characterized mathematically. Consequently, the response time of the system to a transacti~, and thus can be matbematically modeled and evaluated under a variety of condi-
its capacity,
tions. The result of the fonowing pedormance analysis is simply stated: Gemini is limited by the speed of its disk units. Any effort made to increase capac:ity must be aimed at one of two areas: speeding up the disk system or dec:reasing the load on. the disk: system. This can be done in seveml ways: 1. Eljrrrinaring physical copies of documents would reduce disk activity by 50% ('Ibis approach could be applied only if we are willing to suffer the consequences of a system
2. . 3. 4. S.
failure.) Using only
a portion of larger disks would speed up access time, with a ~ n:duction in disk: load of up to 50%. The disk: process could be rewritten to reduce processing time. Dual disks on each c:ontroller would reduce disk loading by 50%. Aquarius tem1iDa1s could be split between two Gemini systems, intercomlected by SynNet.
6. A larger cache memory could be used. Solntion 2, the use of only a portion of larger disks, is J:eC(\mmended. It involves simply the pmcbase of larger disk UDits. Little developmem effort is~. This will iDcreise the cost of a Gemini unit by about 1~ but should give it twice the 'capacity.
Chap. 11
Executive Summary
TABLE OF CONTENTS Title
~
1. 2. 3. 3.1 3.2 3.3 3.3.1 3.3.2 3.4 3.S 3.6 4. S. 6. 6.1 6.2 6.3 6.4 6.S 6.S.1 6.5.2 6.6 6.7 6.8 7. 8. 9. 10. 10.1 10.2
11.
InttoductiOD Applicable Doc:umems System DescriptioII General Aquarius C'.ommuIIic:adO lJDes Aquarius JDtmface (AI) COlJ3lDllDicatiOD Conttol SyncblvDizatioD Sba!ed Memory File Manager File System Trasac:dcm Model Traf6c Model Petformame Model NoIation Average TI3DSICtiOD Time .Aquarius TermiDa1 CoJmmmicatiOD Line Aquarius Imerfac:e Scan Cycles Scan Tune FileMauapr Disk MaugemeDt Buffer 0YedI0w Sc:euado
SceDaDo TUDe Model SummaIy Resrdls 8m hm,dc Comparison Comp ""I'r ADalysis
k
c~lIIW'darims
Page 328 329 329 329 330 331 331 332 334 33S 336 338 340 342 342 343 344 344
346 347 348 3SO 3S3 3SS 3S7 3S9 360 36S 366 368 371
328
A Case Study
Chap. 11
1. INrRODUCTlON This document derives analytically a performance model for the Syntrex Gemini System. Its results will serve to predict the xesponse time of the system as a function of transaction . volume. transaction mix. number of tenninals, and other environmental factors. It is well known that as a computer system becomes loaded, it "bogs down." Response times to user requests get longer and longer, leading to increased 1iustration and aggravation of the user population. A measure of the capacity of the system is the load (in transactions per hour, for instance) at which the IeSpODSe time becomes marginally ~le.
Deterioration of response time is caused by bottlenecks within a system. These are common system xesources that are required to process many simultaneous transactions; therefore, transactions must wait in line in order to get access to these resources. As the system load incxeases, these lines, or queues, become longer, processing delays increase, and responsiveness suffers. Examples of common resources are the processor itself, disks, COIIlIIlUDication Jines, and even certain pIOgIams within the system. One can represent the flow of each major transaction through a system by a model that identifies each processing step and that highlights the queuing points at which the processing of a traDsaction may be delayed. 'Ibis model can then be used to create a mathematical expression for the time that it takes to process each type of transaction, as well as an average time for traDsactions as a function of the load imposed on the system. This processing time is, of course, the response time that °a user will see. The load at which response times become unarnptable is the capacity of the system. Ideally. a pelformance model should be timed. Its results should be compared to measured xesu1ts and, jf significantly different, the zeasons should be understood and the model corrected. Usually, this xesuJ.ts in the inclusion of certain processing steps initially deemed trivial or in the determination of IDOIe accurate parameter values. A performance model, DO matIer how ddaj1cd it may be, is, nevertbeless, a simplification of a very complex process and as such. is subject to the inacaIracies of simplification. However, experleDc:e has shown that these models can be smprisingly accmate. Moreover, the ttends that lie predicted are eVen IIlOIe accmate and can be used as an exttemely effective decision tool in a variety of cases. Uses of a pedormaDce model include' o
Performance prediction. The peQQl"D18IJCe of a plaJmed system can be ptedieted befoIe it is built. 'Ibis is a C1'I1cial tool during the design phase, as bottleDecks can be ideatifted and couec:ted before implemeatation (often requiring sigDi1icam an:hitectural changes), and perfcmuance goals can be verified. Performance tuning. Once a system is built, it may DOt perform as expected. The perfcmnanc:e model can be used along with actna1 performance measurements to "look inside" the system and to help locate the problems by comparing actual processing and delay times to those that are expected.
Sec. 3
System Description
329
,__Costlbenefit of enhancements. When plans call for moc:li1ication or~ce ment of a system, the perfon:nance model can be used to estimate the performance impact of the proposed change. This is an invaluable input to the evaluation of the proposed change. If the change is being made strictly for performance pmposes, then the costJ benefit of the change can be ac:curately detennined. System configuration. As a product is introduced to the marketplace, there often are several options tbat can be used to tailor the system's performance to the user's needs: number of disks, power of the processor, C()TI111IIJDicaon line speeds, and similar options. The performance model can be paclcaged as a sales tool to help configure new systems and to give the customer some confidence in the system's capacity and performance.
2. APPUCABLE DOCUMENTS 2.1 Edit function requirements, intemal memorandum; Syntrex, Eatontown, NJ. July 17. 1983. 2.2 Martin, J. 1972. System Analysis for Data Transmission. Englewood Cliffs, NJ: Prentice-Hall. 2.3 International Business MacbiDes (IBM). Analysis of some queuing models in real-time systems, Document DO. F'2O-OOO7-1. Wbite Plains, NY: mM. 3. SYS'IEIf DESCRIPTION
3.1 GeneraJ A general view of the Gemmi system is shown in Figme 3-1. It comprises ared1mdant file management system which can support up to 14 Aquarius temrina1s. W"1tbiD Gemini, two identi.c:al file maaagemeot systems, each with its own fixed disk, run in a masrerlsJave relationship. Each temJiDal is coanected to both sides of Gemini so that each Gemini side may n=ive all Aquarius tmfIic. However, 0Dly tbe master side will transmit to the Aquarius. FIgUre 3-2 shows in JDOIe detail ODe side of the Gemini system. The teJ:miDal comJDUDic:aIions traffic is CODtI01led by the Aquarius imedace (Al). It is responsible for validating m:eived messages aad for srorlDg them in a memory·wbich is shaIed by the file :manager. It also retrieves messages from sbaIed memory for tnmsmission to the terminals. The AI aad file mauagernm in separate poc:essors. The shaIed memory is a portion of the AI memory. Another function performed by tbe AI is synchronization between the two halves of GemiDi. The AI easures that messages are banded to the file manager in either half in the same Older; it does not return 'a response to an Aquarius until both sides have c:ompleted processing that'IeqUeSt.
.
A Case Study
330
Chap. 11
GEMINI
MASTER AQUARIUS
• •
• SLAVE
AQUARIUS
I
! The file manager is responsible for executing all requests from the termmaJs. Itruns in a multitbreaded environment in which it bas available several copies of itself (the subordiDare file maDagerS). As it receives termmal IeqUests through the sbared memory, it will process some itself aDd will pass others to its suboldiDates ac:cording to a routing o
algorithm.
Each GemiDi half bas a single fixed disk to store all files. Most disk transfers use a cache memory to reduce the need for physic:al disk accesses. Each of these componeDtS is discussed further in the following sectioDs. 3.2 Aquarius CoIIIIIJunication Lines Each Aquarius temUnal is CODDCCted to the Gemini system via a bigb-speed, cia1, synchronous, half-duplex commuuicaD.OIlliDe. Line speeds are
Aquarius to GemiIii: GemiDi to Aquarius: .
37.5 kilobytes/sec. 41.25 kiJ.obytesIsec.
These COD"eSpODd to bit rates of 300 and 330 lWobitsIsec., respectively. The protOCOl !bat is used is a CODttati.oD protocol. Eitber side may begin tIaDSmitting without permlssion wtaaJ.ever it wishes, OnCe it lias ascertained that the other side is- . not already traIJSDritting. In the event of a collision with a daIa packet, the sender of that packet will receive an "improper" acknowledgemeDt aud will IetraDsmit, as described in more detail below. A data packet may comain up to 520 bytes of dam plus 27 bytes of overhead. Each data ~ CODtaiDs a sequence number aDd is aclcnowledged by the receiver.witb an 0
Sec. 3
System Description
1.--------..
331 GEMINI - - - - - - - -....
• • •
AQUARIUS
•
• •
• •
(Al)
•
• • • •
AQUARIUS
TO OTHER HALF OF
GEMINI Iipre 3-2 GemiDi systrm architIedwe.
When the line is otherwise idle, ~ most recent ackDowledgeme.at packets me periodically sent as an "I'm alive" message. Cummtly, tbis occms about once a second, with the GemiDi and Aquarius sencHng at slightly different rates to JeCIuce coDisioDs. AD acknowledgement packet may be piggybacked onto a data packet if tile two are to be sent simuJtanemsly. If an acknowledgement pacbt is lost due to a co1lisioD, tbae is DO gEeat pmblem, as another will soon be transmitted anyway. However. jf a daIa packet is lost, the seader will R8Iize tbis after it rec:eives afew (cuaeatJ.y four) adalowledgeme.omessages. an ofwhich will coatain the p!eVious message sequeace DUIIIber. Attbis point, it will Je1laDsmit the message packet.
IcknOwledge packet, which is 32 bytes in leDgth.
3.3.1 Communication ControL AU comrmmicatioD tlow between the Aquarius te.rmiDals and the 0emiDi sys1em is via shaIed memory. Before mceiv:iDg a message from a termiDa1, the AI allocates a block in shared memory to m:eive tbat message. The" file manager places respouses in sbaIed memozy for retarD. to the tenDiDal. The AI comrols all c:ommmrication flow by scamring the tenniDal communication lines. As it processes each line, it provides the following fLmc:ti.oDS:
A Case Study
332
Chap. 11
DIII4 Block Reception. If a data block bas been received from a-termiDal, the AI checks its completion statUs. If the message was received in mor, it is discarded (the message will eventually be retransmitted). If the message bas been received correc:t1y, the AI checks to see if the slave side bas already received it (the synchroDizat:ion method between the two halves of Gemmi is described in the next section). If not, no action is taken until the next scan cycle. If the slave half bas received the message properly (on the first scan cycle or on a later cycle), the master AI will queue this message to the file manager for processing and will prepare to so inform the slave, as described below. It will DOt return an acknowledgement at this time but will instead wait a wbile (cuuently 60 msec:.) to see if a response will be available in this time. If so, the respoDSe data block will also contain the acknowledgement, thus reducing the COIIlDl1JDication load somewhat. If the slave half receives the message in enor, it is discarded by both sides. Properly received messages are synchroIIized with the master, as described above. Acknowledgement messages are also syncbroDized as they are received, before the message sequence number is processed.
' ' , .
Reqonse Block TrtIIIS1lIis8iDn. The AI will check to see if thete is a message in shaIed memory for this termiDal. This message may be a IeSpODSe message or an acknowledgement message. If it is a IeSpODSe message, the AI checks to see if the slave half bas responded. If DOt, the AI will check on the next scan cycle. When a slave ~ bas been received, and if it is the same response, then the AI will. initiate the traDsmission of the block to the Aquarius. If the response is diffezent ' (detemJined by the message size), a catastropbic error is dec1arecl, and the Aquarius is :' taken out of service. If both sides have a message to send, and if one is an acknowledgement while the other is a mspoase, the adalowledgemeot will be sent. The RSpODSe will not be seat until" the other side is !e8dy to send it. ~Blllt:kRecqtion. Ac1alowledgemeot block recepIioDs IeqUiIe ac:doIl if they iDdic:ate 1he poper zecepUon of the previous message or an appropriate idle CODdition. If sevezal ackDow1eclgemeDts (cmrent:ly four) indicate tbat the last block was DOt leceivecl, then tbat mspome block is retransmitted, If no traffic is mc:eivedfor a period (typically 10 to IS ackDowledgemeot times) from an Aquarius terminal, then Gemini declares tbat Aquarius tennjnaJ to be down.
DO
3.3.2 Synchronization. The AI is also responsible for maintaining synchroDi: zation between the master and slave sides of the Qemini system. 'I'here are two primary goals of synchronization
• Toensme tbat both sides execute file manager IeCpStS in the same order, at least iDsofal',as critical reJatioDs are concemed. For iDstanc:e, if ODe termiDal was cl0sing a document while another tenniDal was ttying to opeD that document, success ~
Sec. 3
System Description
333
....- .. or failure of the open depends upon whether it was executed before 01'- after the close . • To return a response to an Aquarius only after both sides are ready with a response and only if both responses are identical. Synchronization blocks are sent between the two Gemini halves as DMA transfers. A syncbroDization block, shown in Figure 3-3a, contains up to 10 message slots. Each message slot can describe an Aquarius. message that bas been received by that side or an Aquarius message that is ready to be transmitted by that side. The master synchronization block also carries a bit map showing which IeqUeSts have been sent to the file manager. Each of 14 bits represent a particular Aquarius. The file mauager requests are sent to the file mauager in terminal number order, as will now be
described. As the AI scaoner is making its rounds, it places message descriptors in the next free slot of the synchronization block each time it finds a received IeqUeSt or a response ready to transmit. It sends its synchronization block to the other side at the begDmiDg of each scan cycle. It may also send one during the scan cycle if the synchronization block fills
up. The use of the synchronization blocks and the actual synchronization algorithm is shown in Figure 3-3b. In this figure, the actions of the master side are shown. The numbers in parentheses refer to processing steps to which reference will be made in the fonowing description. When an Aquarius IeqUeSt is received by the master side of the Gemini, it is stored in sbmd memory (1), and an enuy describiDg this message is made in the synchronization block (2). The previously receiWd synchronization block from the slave side is checked to see if the slave bad already received the request (3). If not, the master AI proceeds with dle processing of ~ lines and will check the slave status when this line is again serviced.
When the AI finds that the slave bas also received the request (3), it queues this mauager (4) and also sets the bit for this tamiDal in the bit map in die S)'IlCbroDizaDo block (5). Note that since the synchronization block is always sent to the slave at the end of an AI scan sequence (and sometimes in between), it is guaranteed that requestS are sent to the file manager in 1mIDiDal sequence. Meanwhile, the slave is poc:esIiDg messages in asimilarJDaDDer, except. that it is holmng temrinal zequestsUDtil it teeeives a synchronization block from the master. At that dine,. it will sead the .n=quests that it is holdittg for terminals indic8ted in the mister's bit map to its file maaagedn telmjuat DUIIlberOJder. thus, it is assuied that each file man.... receives iequests in the same Older, though the slave"fne manager will alwaYS Ft its requests Jater. When the master file manager is ready to return a response, that response is stored in shmd memory (6); and an enuy is made in the next Dee slot of the synchronization bloCk (7). A check is made to see if the slave bas obtained its response by looking at the last synchronization block received from the slave (8). If not, the AI conthmes processing other lines and checkS when this line is serviced. request to the file
asam
A Case Study
334
r-
Chap. 11
BIT MAP, MSGS' _ SENT TO FILE MGR
MESSAGE SLOT I
MESSAGE SLOT I
• • •
•
MESSAGE SLOT 10
MESSAGE SLOT 10
• •
>
MASTER
SYNCHRONIZATION
SLAVE
BLOCK
(a)
.------~
__--..:.(9.;.;'---1 SHARED (J)
MEM.
MASTER
(5) (2)
(7) (I)
~
(8)
FROM
SLAVE
SYNCHRONIZATION SEQUENCE (b)
When-the sJave (Which is pocasing lespoases in a simiJar.m8mier) indi'*s tbadi bas the response (8), and if both responses ateidemical (at least iDsofar as baviDg eqUal message leogtbs, wbich would disIiDguish betweeIl suc:c:ess and failure), then tbe master will mtam the respcmse to the Aquarius (9).
3.4 Shared IfemoIy All CQ!liInUDication ttaffic between the Aquarius tennjnaJs and tile file 1M!!ager is via the sbared JDeDlOQ', under 00Dtr0l of the AI. SbaIed memozy is ozganized iDto four areas:
.
.
Sec. 3
System Description
335
, ,.,.Header section. Disk controller buffers. Networlc buffers. AI buffers. The header section contains shared memory control information and one 4O-byte short buffer per line, which may hold an acknowledgement message to be IetUmed to the terminal. The disk controller buffers hold responses to be returned to the termiDals. As mentioned earlier, an acknowledgement is piggybacked onto a response if a response is
available. The networlc buffers support SyuNet, the local area networlc which can intercounect Syntrex products. SynNet is DOt a subject of this SblCly. The AI buffers are used to receive requests from the te.mUnals. All buffers are S60 bytes in length (of which S20 bytes are data). The number of . buffers varies with the number oftermiDals, but a typical configm:ati.on for a 14-termiDal system would provide 20 disk contmller buffers, 49 AI buffers and 30 network (NI) buffers for a SyuNet system. 11uee of the disk controller buffers are IeSerVed for emeIgency use to break deadlocks that can occur,on lImg read operations (a read of up to five contiguous disk blocks, used prlmarily for program download pmposes). Such a deadlock can occur if multiple terminals request program downloads simultaneously. and the two Sides of Gemini fill these buffers in diffenmt order SO that DO xequest is completed by both sides when the buffers become tun (the order of long leads is not preserved by the file 1DIIIIgeI'). The number of AI buffeIs is approximately three per line. This allows some degree of look-ahead for a terminal in that one request can be acknowledged and can be queued to the file IDIDIge.t' while the DeXt is being received. Should a message be IeCeived when all AI buffeIs are tun, it is discarded and must be tetl:ansmitred. . 'Also, should the disk contrOller buffers become full (not in a deadlock situatiOn), file managers queue up and wait for a be block.
me
.
3.5
.
File....."...
The file manager processes all J:equests from tile tennjna'Js, le"'Hling tile 8pp:optiate tespOQSeS. As the AI deteclszequestS in sbalecl JDeIDOIY that have been completed by both Sides of Gemini, ihose iequests queued to the file'mmiager' (as described above, the slave will queue its IeqUesIS ODly after being notified that the masaa: has clone so). As shown in Figme 3-2, the file manager is ran in a multi1breadecl COD1igaration in that _ are several icIemical file IIJIIIIaFIS running in the system (CUDe.Dtly, five copies run sjmultaneously). One is designated as the main file uumager; it is tbis process that manages the queue of requests in the sbaIed memory. As it tetrieves RqUeStS ftom"the queue, it decides whether to process that teqUeSt cmectly or'route it to one of its sabol~file~.
me
336
A Case Study
Chap. 11
The routing algorithm is based on classifying all requests into tlu=. classes:
a. Gets. All Gets (requests for data) except for long reads are executed by the main file manager. . b. Synchronized requests. These are requests that must be executed in order, such as opens, closes, deletes. Each is handed to a Dee subordinate file manager, and . these subordinate threads are queued (iftbeIe are more than one) so that only one . executes at a time. Executions are in order. c. All oth4r requests. All other requests are handed to a Dee subcmtinate file manager. If all subontiDate file mauagers are busy, the main file manager is stallec:l. It cannot access the next request as it may not be able to deal with it. It cannot even "peek" at the next IeqUeSt to see if it is a Get wbich it could execute. . In most cases, a subordinate will retIID1 its RSponse (usually, just a completion status) to the main file manager, who will then return it to the terminal via sband memory. . However, in the case of long Ieads in which sigDificant data is returned, the subordinate file manager processing that long read will retUm the data directly and will notify the main file manager when it has completed. Each file manager executes its request by issuing a series of read block and write block oommands to disk as aec:essary. These are independent commands so that no one file manager can seize the disk for IIlOle than a block read or write time. All file managers have equal priority for disk accesses. The disk system includes an 8O-block cache memory managed by an LRU (least recently used) algoritbm. Cerrain operations effecIively bypass cache, such as long reads, as it is unlikely that these operations would benefit from cache. In these cases, a cache block is maxked as a candidate for immediate mISe.
The Syntrex file system c:omprlses a hiemcbical stnJcture of cfirectorles which provic1e unique paths to files. A document is made up of a set of files. All files and ctiIectDries (which me ac:maD.y files, as will be seen laIer) comprise a series of 512-byre sectolS (or blocks) OIl disk. Files are OJpDiud into a document via a ctiIectory, as shown in ~ 3-4a.. Tbe dhec:bxyis a named set of sectoIS, each ofwbich c:onIain up to 15 file _ _• 'Ihus, if the dimcrory cOnims 15 or less files, ii is m8cIe up of one sector, for up to 30 files, the ctirectoIy requites two sectors. Sectors continue to be added in this way as necessm:y. Aile name in a directory comprises the name of the file to which it points and a physical disk sector address (a block pointer), wbich points to the ac1Ual :file (FigIB 3-4b). If the :file contaiDs less· than 512 cbar.acters, then it is contained in this single block. Otherwise, this block contains up to 64 poiDtms to 64 other blocks. In this case, the block is known as an indirect block.
can
Sec. 3
System Description
337
DOCUMENT
NAMED
DIRECTORY
DIRECTORY
•••
FILES
DOCUMENT DIRECTORY (0)
FILE FILE
FILE
DIRECTORY
BLOCK
~____~_UM~E~UMr-E~UM~_E~____~ (l5F1L~)
..--,........+--.,..-...,
INDIRECT
BLDCK ~~~~~(64BLO~)
r--r-f-'"'"'I'""-t
INDIRECT
BLOCKS
•••
TEXT BLOCKS
•••
'--_---' (512 BYTES )
FILE STRUCTURE (~)
I1pre 3-4 Docnmem strucIDre.
An iDdDect block may point up to 64 teXt bloCksCODtainjng the file data or to another' 64 iDdkect blocks. Figme 34b shows a direc:tmy entry. pobltiDg to an iDdRct block, which points to anotber set of inctiIect blocks, whiCh point to the set of text blocks. Thus, a file widl DO indirect levels can contain 512 bytes. One indirect lev.el supports 64 x 512 321{ bytes; two indirect levels support 64 x 51:22, or 16 megabytes (about SOOO to 8000 pages). The above description of the S1l'UCtUre of a text file as a tree sttuctme also applies to c:tiIectories. A ctirectOIy is just another file in which the text is a series of file names, up to IS per block:. Thus, if the diIectory shown in FIgUre ~ contained 100 file name entries, it would actuaJly comprise an indirect block:, which would then point to 7 text blocks, each of wbich could hold IS file D81DeS. Its indirect block would be pointed to by a file IUIDle in the next higher direc:tmy and so on.
=
A Case Study
338
Chap. 11
The ttansactions to be considered for this analysis of Gemini are the common set of editing functions, which include
Index Scan Open/Close Document Attach Copy Physical Copy
Go To Page Scroll
Delete1Insert Cut Paste Insert Footnote Add Text Attribute
Manual Hyphenation Paginate Print The disk and processor activity for these transactions has been analyzed in reference 2.1, listed earlier in the chapter. They are summarized in Tables 4-1 through 4-3 for four classes of activity:
"Ii
= number of traosactions for edit function j.
ngj
= number of Get CQ1DII18Dds for edit function j.
TI4f
= number of disk accesses required for edit function j.
n;q =number of disk: cache accesses for edit function j. Tbese terms represeat the sigDificant IeSUlts of the traasactioa. model and are those that are IeqUired by the usage scenario developed in Section 9. The terms in Tables 4-1 through 4-3 are defi.neci in Table 8-1.
...,1
TABLE "'. EDIT FUNC110N TRANSACTION ACTMTY j
FimcIion
1
IDdex Sc:aIl
4- "
+ 3Ok, +
"rJ 3
I
Ji/[lS(l - st>l
1-
2
0peaICl0se Document
3
AaachCopy
4
Pbysical Copy
2(4 + 'Jf~
.4+7b 8 + Iv, + 2(P + .041".
0 0 (P + f)dJ,.,.
+ 2f,
Sec. 4
Transaction Model
339
TABLE 4-1. EDIT FUNCTION TRANSACTION ACTIVITY Func:tioD
j
II~
+ bs + (23c,ln..>1'
5 6
Go To Page Sc:roll
IIJ$
5
IIr6
o if 11,:$ 23
7 8 9 10
DeleteIIDsen
1 + ".1".
[(II, -
11
12 13 14
CUt
8+~n..
Paste Insert Foocnote Add Text Atlribute Manual Hyphenaricm
7 + 2n,1". 8 2 2 13 + 12k, + + d/6 + (p + f)dl". 12 + 2k, + (f. + d/6 + (p + f)dl".
PagiDate
Print
23)c.J".n' if
II,
> 23
o o
cv.
1 T 1Ip1".
o o o 2 + 'JIs + (p + f)dllltb 2 + 'JIs + dJ6 + (P + f)dJ".
TABLE 4-Z. EDIT FUNCTION DISK ACTIVITY F1IDCIicm
j
IDdexScan
I
/00.
2 3
0peaICl0se Docnmem AaIcb Copy
4
Physical Copy
n.f
"-I 3
4 + 12Ok, + ,..1
/lad... 110(3 +4.) (4+ 71')d...+ [<2bs + IO)p + l'lldl". +
2fr.
0 0 2.1110 4.4 + lO;'Jfs + [(lb. + 4)p + 6(Jdl". + [(P + f)dl". + 'JIs)r{(p + f)dl". + 1fJ
+!Is
5
Go To Page
6 7
ScmJl ~
..,44
1It1~
8 9
Cut Paste
2d...+ 18 + (2 + ~n..)4 24-+11+ (1 + n,ln..)(1 + t4> 2d...+33+4 11+4 11+4 (5 + 71')t4. + 3 + 48k, + 'JIs + 2.Sd + (p + f)dln.. (S + 71')t4. + 2 + 16k, + 'JIs + dJ6 + (P + f)dJ".
10.2 + (2 + ~/n..)tJ;, 6 + (1 + n,ln..)tJ;,
10 11
12 13 14
IDsat FOOIIIOIIe Add Text Aaribute Marmal Hypbenaricm PagiDare
Print
n,s
I
"-'
0
2O.2+~
6+tJ;, 6+tJ;, 4dJ3
0
A Case Study
340
Chap. 11
TABLE 4-3. EDIT FUNCTION TERM EXPRESSIONS
4 =(1 - p,)Ji/[3O(1 - $1)] + PI(bl + 1) d"s = 2.2 + (1 - PI)JiI[IS(1 - $1)] + p;(b/ + 1) 12
P. = pdln" ~ J.7
=1 -
(1-2)
(I-3a) nj
(1-4) (1-8a)
p_=1
r{x}
(1-1)
c(.x)lx if c(x) s c
c+(X-e)(I-£)
=1 c(n) =R -, (R -
r{x}
x
R if c(.x) > C
1)p(n)
(I-8b) (1-7)
1 ..-I p(n) = 1-- ~ p(m)
(1~)
P(1) = 1
(l-S)
R_I
=miD(R) such dial c(e) ~ C n.. =512(1 - $5) n" =32.768<1 - $5) d" =2[65 + (~ - 1)(1 + pdln"J) e
lip = 2b, + 9 + (12 + d,,)P. t; = 2b, + 4 + (8 + d.r{6S})p.
(1-10) (I-3d) (l-3e)
(1-9b) (1-3b) (l-le)
8. TRAFFIC IIODEL A model representing the major processing steps for an Aquarius function is. shown in Figure 5-1. For this discussion, the following terms are defined: • Trtl1IStICtion. A ttaDSaCtion is a file manager command sent to the GemiDi by an Aquarius termiDal, for example, get a block, put a block, open a file, and others. . Each 1nDSaCtion is made up of a request sent to GemiDi and a TesporlSe n::ceived from Gemini. An AquariuS cannot issue its next l8qUest until it bas m:eived the mponse to its pevious request. • Function. A fuDccion is an Aquarius action specUied by the operator, for example, open a document, cut and paste, close, print, and others. A function c0mprises ODe or more transactioDs. • Scentlrio. A sceuario is a defined sequence of actions to be taken to process a document. A sceaario comprises one or mOle ftmctiODS. RefeaiDg to FiguIe 5-1, wilen a user requests a function at the terminal (1), the
Aquarius processes It (2), creates a request, 'and sends that IeqUest to GemiDi (3). '·the At: proc:esses the request, once it bas been m:eived, into sbmed meDlOIY (4) and theD waits for the slave to indicate proper mception (5) before queuing the request to the file manager.
(6).
Sec. 5
341
Traffic Model (51
(2)
COMM
PROCESS
LINE
REQUEST
Nt -1
Tc
(12)
(II)
(4)
(5)
AI
AI
RCV. IN SH. MEM.
RESPONSE
QUEUE
Ts + Tb
(10)
COMM
PROCESS
WAIT FOR . . . - - - , SLAVE
LINE
(9)
(7) Tf
FILE
WAIT FOR SLAVE
MANAGER
QUEUE
Nd
Td
Figure 5-1 Gemini ttaffic model.
The mu1titbreaded file Mager will ultimately process this request (7) and may' issue ODe or more logical disk cnrmnands '(8) on its behaJf. On the average, Nd disk commauds (teads or writes) are issued for a Iequest. When the processing of the request has been completed, the file manager will store a IeSpODSe in sbared memory. The AI will wait for the slave (9) to indicate tbat it bas also processed the respcmse and will then send that respcmse to the Aquarius (10, 11). The Aquarius wID process the response (12) aad may issue another request. When Aquarius has completed the func:tioD, a IeSpODSe is!etUmed to the caller (a process witbin Aquarius). On the avenge, Nt GemiDi traDsaCtions are created for every user-requested"
function.
The followmg pemw 11""<:e aualysis will conc:em itself prlmarily with the evaluation of ttaDsaction response time as a ftmctioD of system load. 1'JaDsacticm respcmse time is separated into two primitive response times:
Tr = processiDg requestlresponse time, or that time teqUhed to respond to a request if there weze DO disk accesses teqUiIed. Td = disk response time. The ave:age 1raDSaCtion response time, Tt , is then
Tt
= average traDsaction time = Tr + Nild
'-(1)
and the average function time, T" is
T,
= ave.r,age functiOn time =N.,(Tr + Nild)
(2)
A Case Study
Chap. 11
These parameters are not particularly useful for getting a feel for-terminal respon- . siveness but fmm a valuable basis for determining the degradation of system peIformance under load and under different user scenarios and for evaluating the effect of proposed system modifications. A more tealistic feel for terminal responsiveness may be obtained by considering the total time required to execute some complex scenario of operations equivalent to actual use of the system. Let function i require NIij transactions of type j and let transaction j require Ndj disk accesses. Then function i will require a time T,i, which is expressed as:
.
.
rn
~=I~~+~~ j
With this expression, the avemge time to perfmm an Open, Get, Cut and Paste, or any of the other commands can be evaluated. The completion time for a complete scenario, T$' requiring NIi function requests for function i, is
T$
= average scenario time = I
i
(4)
N,iT,i
Thus, the time to complete an entire scenario can be determined as a function of load. The rate of degradation of total scenario time as a function of load, given by equation 4, can be shown to be the same as the rate of degIadation of the average transaction time given by equation 1. The simpler equation 1 can be used to eviluate average transaction time if relative peIfo.nnance measures are desired. The m.ore complex sceaario time can be evaluated via equation 4 to give measures more meaningful to the user. &. PERFORIIANCE MODEL
In tbis section. the various traffic elements described in the previous secticms are characterized 1IUd'hematic:a1ly so tbat various IeSpODSe times can be predicted as a fanction ofload for a variety of system. CODditions and scenarios. &.1 Notation
Before proceecting with the derivation of the model, certain notational conventions will be established. Tbey are p1eseIded as a guidetine, though the limifatiou of the JI1III1ber of symbols on the author's typewriter tequiIes occasional depa11Dres from these conven-
tions.
.'
A parameter in the model may be an input varitzble (one that is likely to be a candidale for variation in Older to study different enviromnen1a1 conditions), an input parameter tbat is substantially known (such as commlJnication line speed). a·calcultJted intermetlilJte parameter. or a result para1II8ter (a caJ.culared pmametr.r' tbat is Jikdy to be of interest as an end result). Each parameter is repeseDted by a subscripted alphabetic symbol, where typical symbOls used include:
Sec. 6
Performance Model
343
-load c:mied by a server m -message size nfl-number of items
L
p,P -probability
q,Q-queue length T,R-rate S -communication line speed t,T -time
The first subscript usually identifies a subsystem. Typical of these are A-Aquarius temIiDal c -communication line d-disk f -file manager g-Gemini system s -AI scanner Thus, Ts would be the traDSaCtion delay time imposed by the AI scanner; and Ld would be the disk load. Subsequent subscripts further qualify the pmmeter and are defined as DeCessmy.
6.2 Average Transaction Tillie As discussed in the previous section, the average traDsadion time for a transaction to ftow through the Gemini system is Tt = Tr
+ Nild
(2-1)
where T,. is the RqUest/respODse time if there wete no disk accesses associated with the average transaction, Td is the time tequiIed to process a disk request, andNd is the average number of·disk accesses per 1r8DSaCtion. referring to FIgUre 5-1, Tr can be expressed as
T,. = T. + Tc.+ Ts + T,+ Tb Tcr = delay time imposed by the Aquarius terminal.
Tc
= delay time imposed by the COft.mgni<:ation line.
Ts
= delay time imposed by die AI scamtei.
T, = delay time imposed by the file manager. Tb
= delay time caused by shared memory blocldng.
(2-2)
A Case Study
344
Chap. 11
6.3 Aquarius Terminal A detailed study of the Aquarius termiDal is beyond the scope of this effort. However, its processing times cannot be ignored. The Aquarius is tbeJ:efore characterized as a "black box" which imposes a load-independent delay time on each transaction which it generates. This delay time is the term Ta in equation 2-2 and represents the time required by the Aquarius to generate a request and then to process the response to that request.
6.4 Collllllunication Une Aquarius and Gemini communicate over a synchronous, balf-duplex communication line via a contention protocol. When either wants to transmit, it first waits for the line to be idle. It then will transmit its data and will wait for an acknowledgement. . It is possible that the data message will collide with a data message or acknowledgement being transmitted from the otber end. This can only be determined after a few acknowledgements have been received for the previous data message sent. Since acknowledgements on a presumably idle line occur only about once a second, message retransmission could induce a delay of sevenl seconds. Therefore, cmmmmieation line time to send a message is c:c:mnmmication time
= wait time + transmission time + retransmission time.
Let Sa
= speed of line from Aquarius to Gemini (bytesIsecond).
Sg
= speed of line from Gemini to Aquarius (bytesIsecond).
m. = average message size for a data message from Aquarius to Gemini (bytes). m,
= average message size for a data message from Gemini to Aquarius (byteS).
lilt
= aclcnowledge message size (bytes).
Ta = tmDsaction rile imposed by an Aquarius t.enDiDal (transactioDs per second).
Let us fiIst CODsiderwait time for1he Aquarius. Since each ttansacD.on comprises an acIalowledgechequest and:respoDSey tbe pobabiJitytbat the GemiDi will be traDsmitting is Ta(m, + m,J/s, (message rile muldpliecl by message time). Thus, the Aquarius will fiDeI the tiDe busy when it wants to transmit with this probability. If it is assumed that message sizes are exponentially distributecl (a c:oosemtive assumption), them the average time that the Aquarius will wait when the _ is busy is an avenge GemiDi message time (a characteristic of the expoaeaQa) disCnDution is that the average time for an event to c0mplete is independent of the time at which observation of the event first began). Thus, average Aquarius wait time, r-, is (probability of wait) x (length of wait):'
...
(4-1)
Sec. 6
Performance Model
345
_...Wait time for Gemmi messages is somewhat different. The probability of-liaYing to wait is similar to that for Aquarius and is TQ(ma + m,JlsQ• However, if the line is busy, the AI will continue processing other lines and will check this line on its next scan cycle. It is assumed that it will find the line idle on the next try because of the message! acknowledge protocol (a second wait could happen but with very low probability). Let tss = AI sc::auner scan time
Then
Gemini wait time: tcgw
= TQtss(ma + mc)lsQ
(4-2)
Average transmimon time is that time required to send the message. Thus, Aquarius transmission time is: tctlt = malsQ
(4-3)
tcgr = m,ls,
(4-4)
Gemini transmission time is:
Tbere is a collision window during which one side may check. the line, believe it is idle, and start ttaDSmittiDg when the other side is doiDg the same. This window starts when one side decides the line is idle and continues with the processing and communic:ation time mquired to traDsmit the first byte and have it detected at the other side. Let tw = collision window for the first byte on the line, wbicb. mcIudes: • time to initiate transmissiOll of the first byte once it has been decided that the line is idle. • tf'anm,;ssion time of the first byte. • communication line propagatioD delay. • time to indicate to the zeceiver that the first byte has been received. The probability that an Aquarius message will start in tbis window is TQtw. The pmbability that a Gemini message will start in tbis window is Talw, and the collision probability for a message is tbc=foJe Given a collision, the message will DOt be IetraDsmittecl for seveml acknowledge-
T",t:,.
ment times. Let lit = DlUI1ber of acknowledgements for the pnMous message before a message is rettaiJs-
mitted. tk = interacknowledgement time for an idle line. Then the average xettaDsmission time caused by collisions is (probability of collision) x (retransmission delay), or r,,~nkt~. The collision window time can be further expanded by assnnriDg a propagation ~
A Case Study
346
Chap. 11
..._ . on the communication line of typically half the speed of light and by assigning parameters to the processing times. Let t_, tqwr = time to iDi1iate transmission of a byte once the line has been determUled to be idle at the AquariuslGemini (seconds). ta-,
tqwr
= time to detect the reception of a byte by the software once it has been received by the hardware at the Aquarius/Gemini (seconds).
b
c
= average communication line (bus) length (meters). = speed of light (meterslsecond).
Combining all of the above with these new parameters, one obtains the following retraDsmission times: Aquarius retransmission time: tar
= TaT,nktk(t_ + liSa + 2blc + t_)2
(4-5)
Gemini retraDsmjssion time: tcgr = TaT,lZktk(tpr
+ lis, + 2blc + tgwr)2
(4-6)
The transaction time imposed by the communication line is the sum of all these delays for the n=quest being sent from the Aquarius and for the receipt of the response from the Gemini: ~=~+~+~+~+~+~
~
where we bave ignored the probability of multiple retransmissions and where
Tc = avemge ttaDSaCtion CQJI1D1I1Djcation time. Zcij =
component time, where i = a is Aquarius. = gisGemmi. j = w is wait time.
= tis tnmsmission time. = T is IeI1aIJsmissiOll time.
The AI operates by scanning the Aquarius termiDals. From a performance viewpoint, the AI can be analyzed by looking at two separate issues:
• the number of scan cycles IeqUiled to process a transaction. • the average duration of a scan cycle.
Sec. 6
Performance Model
347
_ .,.6.S.1 Scan cycles. So far as scan cycles are concerned, the ~on is delayed during the time that it is waiting for the AI to process it. When a request is first received by the master AI, the transmission must wait, on the average, one-half scan cycle for the master AI to find the received request. This request bas been received simultaneously by the slave AI, wbich is scanning asynchronously relative to the master AI. Thus, given any state of the two scanners, any other state is equally probable. Using this, one can deteImine the average synchronization delay between the two scanners for an input message. There are four cases: a. The slave gets to the message firsL It will do so with a probability of 0.5. Given this condition, al. The slave gets to its home position befoxe the master gets to the teJ:minal, with probability 0.5. In this case, the master will find the message and may process it immediately. a2. The master gets to the terminal befoxe the slave gets to its home position, with probability 0.5. In this case, the master will not process the message until the next scan cycle, resulting in a one-cycle synchronization delay. b. The master gets to the message first. It will do so with probability 0.5. Given this condition and after the slave bas gotten the message, bl. The slave gets to its home position before the master gets to the temUnal the second time, with probability 0.5. In this case, the synchroDization delay is one scan cycle. ·bl. The master reaches the terminal a second time before the slave reaches its home position, with probability 0.5. In this case, the master must make one IDOle scan cycle before processing the message, taWting in a syndD:onization delay of two scan cycles. The above bas ignored early transmjSsms of a synchroDization block because it becomes full. This Would improve perfOllDlDCe and is, tbelefore, a conservative assumption.
Since each CODditioD isequaJly probable, and since these conditions result in 0, 1, 1, and l scan cycles of S)ncbiciuization delay, respectively, the average synchronization delay for input messages ~O x..25 + 1 x .25 + 1 x .25 + l x .25, or one scan cycle.' The master DOW queues the meSsage to the file manager and sets a bit in the synchronization block bit map to iDdicate tbisactioD. An avemge balf cycle later, the master will seod this syncbronizationblock to the slave. It is assumed tbat the slave must pass tbrough home before seading any··of the messages to the file manager balf scan cycle). The slave will then take another half scan cycle to locate this message and to seud it tothe slave file JD8D8gel". Thus, the slave ·file manager is, on the 1.5 scan cycles behind the master. It is further assumed that variaIioDs in processing and disk times average out and ~
(a
avenge,
A Case Study
348
Chap. 11
ali CQ1TUDands ate processed in the same order by the master and slave. Thus, the slave file -"'.ioanager will return. the response to this request I.S scan cycles later thaD the master file manager. A half cycle later, the slave will infOIDl the master; and again, a half cycle later, the master will find that the slave has the response. This results in an output syncbroDizatiOD delay of one scan cycle. At this point, the response may be transmitted. Summarizing AI scanner delays imposed on a transaction, one obtains the following: Scan lines
Master fiDds zequest IDput syachloaizatioD to maSIet IDput syncbloaizadoD to slave 0uIplt syDCiDoDizaDon
0.5 1.0
1.5 !:Q 4.0
Thus, a tnmsaction will be delayed by four AI scan cycle times. 6.5.2 Scan time. one of four functious: 1. 2. 3. 4.
As the AI
scaDS
the terminal lines, it basically can perform
Do DOtbing. Process a received message. Pass a IeqUeSt to the file manager. Transmit a message.
For each transaction, the AI must IeCeive a RqUeSt, send an acknowledgement, send the IeqUeSt to the file 1DIII8geI', send a response, and receive an acknowledgemeDt (two message receptioDs, two message tr:msnrissioos, and one file IDIII8geI' baDd-off). Let
t.
= average AI scan time to service all temrina1s.
Nil
= number of Aquarius tezmiDals On the system.
Til
= transaction me per Aquarius ttmrinaJ.
Then the DIIIIlber of transactioDs IeCeived per scan cycle is NIITllt•• However, while the AI is sc:amring tiDes, it is also servicing intenupts, which adds to the amount of time it takes for scan processing IeqUiIeme1dS. At the iutenupt level, it is processing two message teeeptioas and two message tnmsmissioas as described above for each transaction plus the reception of a' synchroDizadon block. . Using the convention of priming to deaote intermpt times, let:
ta
= AI time requiJ:ed to process an idle line.
Sec. 6
Performance Model
349
1.sr _=... ~ time required to process a received message. tsf
= AI time required to send a request to the tile maoager.
tSt = AI time to process
a transmitted message.
t;r = AI interrupt time to process a received message. t;' = AI interrupt time to process a traDsmitted message. t;b = AI intenupt time to process a synchronization block. tic
= interval between "I'm alive" messages (acknowledge messages on an idle line).
tui
= idle processing time. One can then express the total average scan time as:
tu
= N"T"tu [2(t,+ t~ + tSt + t;') + tsf] + + N"t$$(t., + t~ + tSt + t;')/tk + tui
t~
(5-1)
The first term represents the number of message IeCeptiODS and ttansmissions and file maoager band-offs handled in an average cycle (number of transactions multiplied by proa"SSing and intermpt times per transaction). The second term is the synchronization block interrupt time, which occurs once per cycle. The third term represents the load imposed by "I'm alive" messages, which occur at a rate of N"t$$ltk messages per scan cycle. The final term is the idle processing time requjred. Idle processing is requjred for any line· which does DOt have an activity. Since the rate of activity is five times the transaction rate (there are five activities per transaction), then the number of active lines on a scan (assuming only one activity per line on auy scan cycle) is 5N"T"tu • Thus, the idle processing time during a scan cycle, t#i, is
(5-2)
t.
t.=O
(5-3)
Finally, t;" is a fundion of the number of slots filled in the synchxoDjzation block. Let
= wellupt time to process au empty syDChmDization block.
t;" = intetIupt time to process a filled slot in the synchronization block.
Since there is a slot used for each ~ and·transmitred message,.there ate four slots used per ttaDsaCtion. The synchxoDjzation block intaIllpt processing time can then be expressed as . (5-4)
where 4N"T"t$$ is the avenge number of slots used in the syncbroDization block. Equation 5-1, wbich shows t,. as a function of itself, could become unstable at large loads, depending upon ~ ac:curacy of the provided parameters. Therefore, a maximum
A Case Study
350
Chap. 11
value on the AI scan cycle will be imposed for the condition of one message to be trans···iDitted, one to be received, and one to be passed to the file manager fofeach line:
(5-5) The above aualysis bas evaluated the AI scan cycle as a function ofload, Te. The AI delay, T$' imposed on a transaction is four scan cycles, as shown in Section
6.5.1: (5-6)
Equations 5-1 through 5-6 xep.resent a nonlinear set of equations for t# (due to equations 5-2 and 5-3), in which t# is expressed as a function of itself; tn can be expressed mom conveniently for programming purposes as follows. rU'St, using equation 5-4, all terms on the right of equation 5-1 containing t#, except for t., are gathered and called by the parameter a: a = Ne[(2re + I/tk)(t""
+ t;" + t$l + t;') + Tet$/+ 4T.t~1
t.
(5-7)
This parameter is, in effect, the ttansaction load on the AI. Then
t = lIS
+ kiNet$l 1 - (a - SkiN.Teta)
(5-8)*
ki bas been introduced as a switch: Ie, = idle terminal indicator: 1 if there
are idle teImiDals
oif all terminals are busy
(5-9)
(STet#
<
1)
. (STct#
~
1)
Of course, the value Of Ie, is DOt known a priori. TheEefore, equation S-8 is first calculated with Ie, = 1 to obtain a trial value of t.v. Jftbis results in STet.v ~ 1, a second calculation is made for Ie, = O. That value for t# is used instead. This value is then compared to that value of t.v obtained from equation 5-5, and the lesser value is used.
S.SFlIe ......... The file manager comprises multiple tIueads that service a common queue (the liDked IeqUeSts in sbaIed memory). The a11oc:ation of these reQuests to file manager 1hIeads is·a complex process from. a m.odeliDg viewpoint and cannot be zepmented mathematically. However, an approximation can be made which allows a reasonable characterlzatio. When multiple servers process the same population, there are generally ~ limiting. !
*Nate die simiJarity betwea eqaIIion S-8 aad eqaatiaD W of chIpIa' 8. The scaD cycle time is die idle time c1Mded by ODe miDas die iDcremeatal pmcessiDg load.
Sec. 6
Performance Model
351
~,._ ,The best case is when all servers service a common queue. The next i~in the queue is serviced by the next free server. The "DOIJDal" worst case is when each miving user chooses a line in which to wait and then must be serviced by that server. In this case, the multiple servers are simply a set of single servers operating in parallel. The true multiserver case described first gives significantly improved perfonnance over parallel single servers. There is another case in which perf~ is poozer than that for pllI8llel single servers, and that is when each server services only certain classes of transactions. For instance, if a bank had a common waiting line, and if the customer at the head of the line went to the next flee teller, this would be the best case. If each customer freely chose a teller waiting line when he entered a bank, this would be the next best case. If one teller serviced customers whose last name began with A-C, another D-F, and so on, this would be the worst of the above cases. In the case of the file manager, there is a mix of these disciplines. All the Gets are bandIed by the main thread, an example of the worst case above. Most others are handed . to the next free thread, an example of the best case above. However, synchronized requests must be executed in order and are therefore queued behind each other. This is again an example of the worst case above. Even worse, each queued, synchronized request deletes a server, since queujDg is done by queuing tile managers. As long as the number of servers is small, it is asSlJmed that this complex service algorithm is best represented by independent parallel file managers, each serving its own queue. Transactions are distributed randomly among the queues. To the extent that requests are non-Get, IlODS}'IlChroDj requests, this appmximaricm is conservative; otherwise, it is optimistic. The file manager service time is the sum of ,transaction processing time and CPU waiting time IeqUUed for a tIaDSaClioa. In addition, main thread blocking wiD be conside!ed.
Let ~
=tile manager service time, exclusive of disk ~.
~
= time RqUhed to poc:ess a traasaction if DO disk i'eqUests were mquUed.
~
= delay imposed on an average t:ransadion due to waiting for the CPU~
,~
=delay imposed on an average transaetion du:e to maintlRad blocldng.
Then the file manager service time,~, is ~=~+~+~
.U
Since ~ is an input parameter, there is DO need to evaluate it further. The amount of time that a transaction must wait in line for the CPU wbile it is being used by other file manager tbreads is ~. This will occur for each file ID8IUIF1: dispatch required to process a transaction. There are 1bree dispatches requiIed to route, process, and mum the tnmsaction (between the main thread and the subordinate thread) and one required following the completion of each disk request. Thus, there are (Nd + 3) dis-
I
A Case Study
352
Chap. 11
. __ J*Ches per transaction, where Nd has been previously defined as the..number of disk accesses required per transaction. Let tdp = processing time to
process a disk request.
Then the three required dispatches will require a processing time of ~, and subsequent dispatches to process disk IeqUests will require a processing time of ttlp each. Average processing time per dispatch is (~ + Ndtdp)l(Nd + 3). Assuming the processing times are exponentially distributed, and using the MIMII queue waiting time, the CPU delay, Zfd, for a traDSaCtion is (Nd + 3) wait times:
~
=(N d
+ 3)
Lp (tIP + Ndtdp) (1 - Lp) (Nd + 3)
or (6-2)
where Lp is the processor load. Since ttaDsactiODS are miving at a tate of N.r. transactiODS per second, and since each requjres a processing time of (~ + Ndtdp), then CPU load is F-I
Lp = --rNGr.(~ + Nd t.)
(6-3)
for F file maoagers. The term (F - I)lF causes the load of a specific file IDaDager to be ignored when calculating its dispatch time, as it will DOt be delayed by itself. * Blocking occm:s if .u subordiDate thxeids are bUsy and- will continue tor ODe me . manager servi<;e time (~ + NdT,u, assuming an exponential distribution of this time. If there are F threads in the system, then blocking occurs if at least F - 1 requests are in the sysrem. DOlle of which are Gets. Let L:t be the load on the file management system, with each _dIread cmyiJ:Ig a load (its occupancy) of ~F. 'Ibus. the probability .that F - 1 threads will be busy is fH1pf-l. If the probabiIity.that a zequest is a Get is pg (derived &om the sccmario; see Section 9). theo the probability of F - 1 threads servicing Ipl-Get requests is (I - pg'f-l. The probability of blockiDg is the product of these piobabiIities. aDd the avemge t:IaDsaCIion time-lost to blocking is the probability of ~ multiplied by the average blocked time. ~ + NdTd: ~
= [(1- pg)LjlFf-l(~+ NdT,u
= number of file manager threads. pg = probability that a R:qUeSt is a Get.
F
-See Appeadix 6.
(6-4)
Sec. 6
Performance Model
353
Lj = total load on the file manager.
As defined above, the transaction rate is NaTa. where NQ is the number of Aquarius teImiDa1s, and T is the transaction rate per terminal. Since the file manager must wait for Nd disk accesses per transaction, each requiring a time of Td (evaluated in the next section), then the total file manager load is Q
Lj= NQTQ(fJ+ NIIT~
(6-5)
The average delay time imposed by a file maDageI', T" including queue wait and servic:e times, again assuming that its servic:e time is exponential, is
(6-6)
Equations 6-1 through 6-S relate fJ as a high order function of itself and therefore must be solved iteratively. .!
6.7 Disk llalllJflfllll8llf All file manager tbreads use the disk as a common resource by placiDg IeqUests in the disk queue and then awaiting their completion. Let Ids = disk system service time to process a disk request.
oz. = processing time requjred to process a disk request.
,.1
= processing time requjred to process a disk request from cache.
'.2 = processing time requiIed to process a physical disk request. t. = disk access time (seek plus latency). h
= cache bit ratio. ,. = hZtIpl + (1 -
'tb = '. + (1 -
k~
k)r.
(7-1a) (7-1b)
The time Mquirecl to service a disk request is DOt only this service time but also the time spent waiting in the disk queue. SiDc:e 1he queue length C8DDOt be greara: than the number of file :mapagers (a relatively sma1l DIIIDber), improved accuracy in the IIlOdel can be obtained by not using the infinite population MIMII relation but rather using the relations for afiDit.e user population (1he F file managers in this case) as derived in reference 2.3. A file mauager is busy waiting for a disk zespcmse for an average time of Td, where Td has been defiDecI as the total disk delay time, including waiting in the queue and then being serviced by the disk. There is then an average time of ~ during which the tile manager is idle relative to the disk and is a candidate for making another disk EeqUeSt.
A Case Study
354
Chap. 11
The "service ratio" z is a measure of the proportion of time that the file manager is ._-available as a candidate for issuing a disk request. It is defined as .(7-2) The greater z is, the more available the file manager is. If there are F file managers, each with a service ratio z, it has been shown (see reference 2.3) that the average number of file managen that will be busy when another file manager makes its request, and thus the average number of requests in the system (including those queued to the disk plus those being serviced), is given by the following truncated Poisson distribution: ZF-k-I
qd
=
F~I ~k
(F-k-l)!
k-I
F-I
L
j=O
.
~
(7-3)
j!
'Ibis assumes an exponentially distributed disk service time and represents the number of file requests ahead of a newly arriving request. Thus, the total disk delay time for a transaction, including queue wait and service, is
1d = (qd + l)t.
(7-4)
The set of F file managers Jec:eives transactions at a Iate of N(lT(I transactions per second and genezares NdNtlTtl disk requests per second. Each file manager generates disk requests at a rate of NdN(lTtlIF requests per second. Thus, the interval between disk requests by a file mauager is the recipIocal of this expression. This interval must also be the sum of the file m8aager idle time, ~, between disk services and of the disk service time, 1d • 'Ibus,
(7-5) or ~=
Substituting this into equation 7-2,
~
z=
F NdNtlTtl
-1d
(7-6)
is eJiJnjnatM from this set of equations: F·_ 1 NtI1/tlTtl1d
(7-7)
Since Td is a function of qd, which is a nonlinear function of z, which is a function of 1 d• 1d is a function of i1self; tlms, these expressions must be solved iteratively. The disk: load, Ltt, is (7-8)
Sec. 6
Performance Model
355
6.8 Buffer Overflow Input requests and output responses are buffeml in shared memory. If the input buffers become full, incoming messages are discarded and must be retraDsmitt.ed. If output buffers become full, the :file managers are stalled. Eidler:represents a delay in the processing of a transaction. The traDSaCtion delay due to full input buffers can be detennined as follows. A request is allocated to a buffer when It is received and remains there until the :file manager begins to process it. Its occupancy of the buffer occurs in two phases:
• Waiting for AI synchronization before it can be queued to the:file manager• • Waiting in the queue for the file manager. As described in section 6.5.1, a message waits in the slave buffer for 1.5 AI scaDDet cycles longer than in the master. Thus, the slave input buffer occupancy repmsents the . critical case and requires tbtee scan cycles of a time t. for AI syncbroDization (see section 6.5.1) before placiDg the request in the :file manager queue. Assuming an exponential distribution of file manager service time, the average time a request will wait mthe queue for dle file manager, ~, is
t~_ = (fIF)
7'1
t.
1 - L.tfF If
(8-1)
where 1f, Lt, and F bave been p:eviously'defined as the file manager service time, file manager load, and number of file managers, IeSpeCtively. .- Input blocks are also used by acknowledge messages. As described in section 6.5.1, an acknowledge message (on either side) will requi1'e an avemge of 1.5 AI scan cycles to be detected by both sides. Acknowledge messages are received once for each transaction plus approxDDate1y once per second (actDally every tk seconds as pmriously defined) for "I'm alive" messages. ('Ibe nlduction in these later messages as system activity increases is ignored as a simplifyiDg incl CODSelVItive assumption;) Thus, the total oc:cupmcy time of an input·bafJecis (4.St. + ~) for each traDsaction since a traDsaCtion requires both a tequest and a EeSpODSe. In addition, an input buffer is required for 1.St. for each '-rm alive" message. Since ttansactions mive at a Iate of N.ra ttaDsactioas per second, since "I'm alive" messages are geuetated at aIate of N.1tk, and since dlis ttatlic is spIead over Bi buffers, the proportion of time an input buffer will be occupied (dlat is, its occupancy. Li) is L, = Ntl,rt/..4.Sta
+ ~) +
1.5t..ltI1/B;
(8-2)
where B1 = number of input blocks. Input blocking will occur if B; + 1 or more input blocks are Rq11ired. Tbjs OCCUIS widl a probability of Lft+l. Should there be an input buffer blockage. an incoming request~
A Case Study
356
Chap. 11
be discarded and will have to be retransmitted. This requhes a time ~ to nk times dle - ... - interacknowledge time of Zk, terms previously defined in section 6:4. -Thus, the average transaction delay due to input buffer blocking, Zbi, is the probability of this 0CCU11'eDCe multiplied by the time delay: (8-3)
So far as output buffers are concerned, there are two possibilities for blocking. One is that long reads may cause a deadlock, which is broken by use of the emergency buffers. It is assumed that this happens seldom enough not to be a perfOIDWlce issue. The other concern is buffer exhaustion. If all output buffers are full, the file managers will stall. The average life of a response in an output buffer can be deduced from the AI sc:aDDer activity described in sectioD 6.5.1. As noted in that section, the slave is about 1.5 AI scan cycles behind the master in tenDS of completing its file management services, so the slave buffers will have a lower occupancy than those of the master. Therefore, the following analysis will ~ with the master only. From Section 6.5.1, a response will sit in a buffer an average time of 2.5 AI scan times plus an acknowledge time. These AI scan times include 1.5 scan cycles for input synchronization for the slave (the reason the slave is behind) and One for output synchronization. It is assumed that the Aquarius terDrlnal IeSpOuds with an acknowledgement within one AI scan cycle. Thus, the master will see it OD the next scan cycle. As argued in Section 6.5.1, the average additional time for the master to be notified that the slave bas seen this message is one scan cycle. Thus. a response message will occupy an output block for 4.5 scan cycles as follows:
IDpat syJIducIaizada1lag
1.5
0aIpIt syIIdJrcI •" . Receive ICbowJedae
1
~.1!!IIt syDCbraaizIIIio
1
L 4.5
Responses are miving into 0U!pUt memory at the transaction rate, N.T., and are dis1ributed amoDg Bo buffers. Thus. the average occapancy of an output block. 4. is
Lo = 4.5Ner.z#IBo
(8-4)
wheIe
Bo = number of our:put (Disk CoatroJler) blocks, excluding emergency blocks. Zss
= the previously derived AI scan cycle.
Sec. 7
Scenario
357
If there are Bo output blocks, blocking of the file managers occurs if Bo + 1 or more blOckS' are needed. This condition will occur with a probability of 4 80+1 and-will last . 4.5/2 = 2.25 scan cycles on the average (since the time of block occupancy is fairly constant). Thus, the average time that transactions are delayed by output blocking is (8-5) The total blocking delay time, Tb , is
Tb =
tbi
+ tbtl
(8-6) ,
7. SCENARIO In Older to make a statement about Gemini performance, a specific scenario of the system's use must be specified. This scenario may be tailored to approximate various user - enviromnenrs but will be constrained by the following assumptions:
a. One to 14 Aquarius terminals may be specified, each performing similar tasks.
b. The Aquarius terminals may be either token or nontoken. For token systems, no ' programs are downloaded from Gemini; they are all IeSident on the terminal diskette. i.e., programs that are downloaded are assumed to be infrequently used.
c. 'All tasks are editing tasks. Document creation is via OCR, and its load is ignOIed. d. The profile of a typical cJocnmenr will be specified, including the following: • its average length, in pages, • aveage number of c:barad:as per page, • number and length of supporting files (headeIs, footers, footnotes,. saves), • aveage size of dixectories. e. The profile of a typical editing session will be specified, including the following: • procedure used to search for and specify doc:ument to be edited, • the selection ofbow the copy will be made, eidIer via an aItach (edit copy) or a physical copy (index copy), • aveage number of opens of otbc:r textJformat file pails, • a tepetitive sequence of editiDg functions, including a page Go To, saoDing for a cfesignsted number Of lines, and execution of edit commands from a specified DUx of commands, . • average number of tepetitions before the document is closed, • aveage number of prints during edit session, • proportion of edits that are pagiJU1tc14 before close. f. The proportion of documents that are printed after editing. ' g. The specification of background tasks such as pagination and spelling' check.
A Case Study
358
Chap. 11
This scenario can be characterized as follows via input variables.-.l..et: d = average document length, in pages.
p
= average page length, in characters.
/; = avemge directory length at level i, in number of entries, where: i
= 0 is file room. = 1 is cabinet. = 2 is drawer. = 3 is folder. = 4 is doc:ument.
nj
= average number of edit functions of type j to be performed on each document, where: j
= 1 is Index Scan. = 2 is OpenICose Document. = 3 is Attach Copy. = 4 is Physical Copy. = 5 is Go To Page.
= 6 is Scroll Line.
= 7 is DeleteIInsert Text. = 8 is Cut. = 9 is Paste. = 10 is Insert Footnote. = 11 is Add Text Attribute. = 12 is Manual HypbeDation. = 13 is PagjDate. = 14 is Print.
Section 4 puts in tabular form tile number oftraDsactions and disk accesses required to peDonn each of the above functions in terms of the parameters d. p, /;, and other pammetas. Let ."rj
= DUIIlber of traDSaCtioas DqUi1:ed toexecate the edit function j.
"4i = IIUIDber of disk accesses required to execute the edit function j.
= totalllUlDbef of traDsactions IeqUired by the sceaario. Nb = total number of disk accesses teqUhed by the scenario. N1$
Nft
= total number of edit functions xequiled by the scenario.
(9-1)
Sec. 8
Scenario Time
359
Ntis = Lnj~
(9-2)
j
Nfs =
Lnj j
(9-3)
The average number of transactions, Nr , per edit function is Nr = NrslNfs
(9-4)
The average number of disk accesses per transaction, Nd , is Nd=NdsIN#
(9-5)
If the scenario is to be accomplished in T. seconds, then the transaction rate per Aquarius terminal, Ta, in transactions per second, is (9-6)
Also, let
N8$ = number of Get transactions in the scenario.
ngj
= number of Get transactions for the appropriate jth edit function.
Then
Np = L"Jngj
(9-7)
j
and the probability that a transaction will be a Get is
P8 = NpIN#
(9-8)
This is a parameter needed by the model, along withNd , N" and Ta. The model will calculate an avemge transaction response, Tr, to an average traDsaction, as well as the transaction IeSponse time c:omponeut without disk accesses. T,., and the disk access component, Ttl. These are Ielared by
Tr = T,. + NtlTd
(9-9)
The IeSpODSe time to the eclitflmction of type j. 1j, is:
7j
=1I,jT,. + "4Td
(9-10)
RefelalCe 2.1 gives a means by which to estimate the cache bit mo, h, via the pamm.eter which is the estimated· cadle hits for each edit func:lion. This esrimate is
n:u.
h=
Jnin:u
(9-11)
Nds
8. SCEIVARIO n.E The scenJJrio time. i.e., the time it takes for an operator to complete the speci1ied sequence of operations, is a function of the ~ time of the system, which in tum is a function
A Case Study
3&0
Chap. 11
_.... _.of the load imposed upon the system, which loOps back to being a functioR of the scenario time. The preceding sections have predicted system response time for a preset scenario time, T". A more useful number might be the actual scenario time that would result when a number of Aquarius terminals are given a scenario and left to do it in all possible haste. This would involve guessing at a scenario time T", using that to calculate a transaction rate To (equation 9-6), and then calculating a scenario time T" that will be different from the guess. The guess for T" is then adjusted until the calculated value for T" is sufficiently close to the actual value for T". In order to be reasonably accurate, the operator time must be included in the scenario time. Let
Toj
= operator time requited to initiate the edit function of type j.
Then the scenario time, T", is
T; = ~1I,;(TDj + 7j)
(10-1)
j
where nj and 7j have been previously defined as: II,;
= average number of the edit functions of
type j
to be performed on each
document.
7j = average time for the edit function of type j. A teasonable initial guess would be to set transaction rate To to zero and to c:alculate T". This would give the mbrimum possible value of T", which could then be incremented until the solution was found.
8. IIODEL SUIIIIARY The Gemini performance model is SlII1!DI8!'jad in Tables 9-1 and 9-2. Table 9-1 defines the pammeters, organizing them into fOUt sections: a. Result ptI1'Q1fleIeTS. which are those calculaIed pIIameters that are likely to be of interest for %eSUl1s. b. 17fJU1 wzriIzbles•.wbichare likely to be varied to deta:mine perfOimance UDder varying cmvjronmeats. c. 17fJU1 ptI1'Q1fleIeTS. which are inputs to the model that are fairly stable. d. Intermedit1te parameters. which are all calculated patameteIs, except for result ~.
Table 9-2 summarizes the model equations into several groups:
a. results,
Sec. 9
Model Summary
361
...b.. communication response-time component, c. AI response-time component, d. file manager response-time component, e. disk system response-time component, f. shared buffer overflow response-time component, g. scenario, h. edit functions. TABLE 9-1. GEMINI PERFORMANCE MODEL PARAMETERS a. RGUlIs T. Awnge 1IIIISIICIicm lime (sec:oads) 7j Awnge lime for edit f1mction of type j (secoacIs) b. Input VtZriablu bi Number of jDdiJect levels at db:ec:lory level i c. Awnge liDeleaglh (c:bmcteIs) C Cache size (blocks) d Awnge doc:ameDt leDgth (pa&es) f Awnge number of foImal cbaractms per page fi Awnge number of files at db:ec:lory level i f. Awnge number of support objecls (text aad fomIat file) used in edit sessicm per doc:nment (footaoIes, beadas, footas) DireclCIIy level: i 0 is lOOID =liscabiDet =2isdrawer = 3 is folder 4 is doonnent =SislUt j edit faDctiaD type: j 1 is IDdcx ScaD = 2 is 0peaIC10se J)oc:mMgt 3 is AUIch Copy 4 is PIIysicIl Copy S is Go To Paae 6 is ScIoD LiDe 7 is De1eIeIfDsert Text 8 is Cal 9 is Pas1ie 10 is JDseat FoccaoIe 11 is Add Text AIIIibare ., 12 is MamIal Hypbrn....-QD 13 is PIgiDaIe 14 is Print At totea switcb (0 if 1Oka, 1 if DOt tobIl) ~ average number of cbal:1ICIaS in ". average number of c::hal'Ic:tas affeccecl in a cIeIeIWIse:rt seqaeace (die grarer of die IIIIIDber of c:baracfm iDscrIed 01' deleIed) IIj average number of edit faDcdoIIs of type j to be pafozmed OIl each doC"D..,,,
= = =
= = = = = = = = = = =
=
=
"cat
A Case Study
362
Chap. 11 "
•..~-. TABLE 9-1. GEMINI PERFORMANCE MODEL PARAMETERS-Continued-_
.
I'Zp
average Jl1IIDber of c:hmcters in a paste
no
average Jl1IIDber of !iDes scm1led between edit acliODS numbe:r of Aquarius termiDaIs average page Ieugtb (c:hmcters) pobabiIity tbat directory level i is bashed numbe:r of lefeaeace blocks on disk tile pmponioD of 1IIIIISed space in a block at directory level i (slack). aveage sc:eaario (user) time (seconds) user Jevel in diiecImy from which index scaas are made (u 0, 1, 2. 3)
Nil P Pi
R Si
T.
=
" c. Input PtII'IIIMI6rS b
Bi Btl c F h m" ~
m.t At
s"
s,
''-
,.1
"'"
r...z ~
z".,. t,.. lot t~
t" t..
',.. ,. t;" ,;,
. Ci
Til
aveage CODIIDIIIIic:at !iDe (bus) leDgtb to Aquarius (merezs) Dumber of AI blo.c:ks in sbared meDICI)' numbe:r of disk CODtIOller blocks in sbared memory. exclusive of emergeacy blocks speed of ligbt (3 x lOS meIaS per secoacl) numbe:r of file DI8II8pr tbreads average c:ac:he bit rado (may also be catcalaled) request-message leDgCb (bytes) ~message la8Ih (bytes) acknowledge-message leDgth (bytes) numbe:r of acbowledgemeDts for tile pevioas message befoce tile camat message is JettJnsm;tled AqIIIIdus mmgpmjc:arion !iDe speed (bytes/sec.) Gemini COIIIIIIUIIicatioII!iDe speed (bytes/sec.) time to detect tile Iec:epIicm of a by1e by tile software cmce it bas beeD JeCeiwd by tile ~ 81: tile AqaIiias (sec.) time to DIiIiaIe tile tnn!S!!rissim of tile fitst by1e of a messap oace tile line has been dete..wDiicl to be idle by tile AquaIius (sec.) disk average access time (seek plus 1aIeDcy) (sec.) CPU processing lime req1Iiied to pmcess a disk access from CICbe (sec.) CPU pI'OCeSSiDg time req1Iiied to pmcess a physical disk access (sec.) file III8II8&a" CPU pocessiDg lime mquiJed to process a ttaIISICdan exclusive of disk accesses (sec.) time to detect tile Iec:epcioD of a by1e by tile software oac:e it bas been mc:eiwd by tile ~ 81: the Gemiui (sec.) time to _ _ tile tanS",;'sWD of tile fizst ~ of a message oace tile !iDe has been ....... "';ned to be idle by tile Gemiui (sec.) ia..,." rnvJed&e lime OIl 8D idle !iDe (sec.) AI JlIocessiD& time ieqaiJed to queue a request to tile ile DIIIJIIIII' (sec.) AI processiD& time ieqaiJed to process an idle !iDe (sec.) AI processiDa time ieqaiJed to process a zec:eived IDIISSIF (sec.) AI processiDg time ieqaiJed to process a IpDsmjned (sec.) AI imea1Ipt time mqailed to process a message iDdicaIor ill a S)DCIaouizal:iw block slot (sec.) AI iDreaapt lime RqUired to process aD empty S)'IlCbmaiz:ado b1oc:t (sec.) AI ianapt lime to process a Jec:eiwd message (sec.) AI iDreaapt lime to process a '",'puined message (sec.) AqIIIIdus time mquiJed to process a .....saeikm (sec.)
messaae
messaae
I~P~-'
4%
. . ...
--_.-. __ .. - .....
AI load IIIrJl!qtpbJe to activities wbose fnIqDiiacy is a fimcIion of AI scaD lime, exclusive of idle pmc:essiDg
.
Sec. 9
Model Summary
363
TABLE 9-1. GEMINI PERFORMANCE MODEL PARAMETERS-Continued c(n)
d,.
ll;
tInt d,,;
d,p e
f k;
~
L, L; 1.. Lp "'" ~
n.t n",
"tJ II,p
Ntl N. N" N" N, N. p, p(n) p,
ftl
rW
r. r.; ,. '-
',.,
r....
,. tqr
't." ~
~
mezeuc:e
IUIIIlber of physical disk accesses rCqaired to' update n counts IUIIIlber of vinual disk accesses to Pat a text block IUIIIlber of cache acc:eises to Pat text blOck number of disk accesses zequiIed to opeD a new file (c:seare a file) at dizectozy level i JIUIDbe:r of disk accesses zequiIed to opeD an existing (old) file at clirec:tory level i IUIIIlber of disk accesses zequiIed to copy a poiDter block JIUIDbe:r of refeIeDc:e COUDt updates n:quired to completely fill cache avemge Il1IIDber of fomJat characters per page AI idle tiDe switch: = 1 if5r.,t# < 1 wben~ 1 0 otberwise disk load file JIIIII8F total load oc:cupaDCY of aD AI buffer oc:cupaDCY of a disk CODUOller buffer file IIIIIII8ga" processor load IIUIDber of disk accesses zequiIed for !he edit f1mctiOD of type j JIUIDbe:r of disk cache accesses required for tile edit fuuctioD of type j IIUIDber of Get MI!!!D!!!!Ck JeqUized for !he edit faaciicm of type j JIUIDbe:r of text chazaclas in a disk block (512 less slack positicms) JIUIDbe:r of transactioDs n:quired for the edit f1mcticm of type j IIUIDber of text c:bmctezs poiDmd to by a poiDter block (32,768 c:bmcters minDs slack posidoas) avemge Il1IIDber of disk accesses per traDSaCticm totalll1llDber of disk accesses in a sceaario totalll1llDber of edit fIIDctioas in a sceaado totalll1llDber of Get tAD. tiorIs in a sceaario avemge IIUIIIber of tnmwtioas per edit fuDcticm totalll1llDber of tlaDSICIioas in a sceaario ~ tIIat a tl'aIISaCdoa is a Get probebiIity of a physical disk access wbile apdaIiDg the JiI' zefeIeace COUIIt p:obability of a poiDIr:r block split dmiDg a block Pat avemge JeagIh of disk queue cache billllio for .x refeaeace COIIDt blocks tnIIISl""imIl8IIe per AquaIiIIs (...... w:tiaas per secoad) awap tn....nm 1ime becaDse of AI buffer bloc:kiDg (sec.) awap au pocessiDg mqaiIecl to pocess a disk access (sec.) awap tnIISaI:IioD 1ime becaDse of disk CIIIdlODer bIockiDg (sec.) aveaae retnnmni.... 1ime for lID Aquarius Ieqaest (sec.) awap u.nsn issiooa time for lID Aquadus leqaest (sec.) awap WIit 1ime before In Aqaarias CIZl begin ......smiltj". a JeqDeSt becanse of Gemini tnfIic (sec.) aveaae JelmSn!ksion 1ime for a Gemini IespGDSe (sec.) avemge tpnsmjssion 1ime for a Gemini respaase (sec.) averap wait: 1ime before a Gemini CIZl begiD tnnsmilrina a 1espcase because of Aqwaius tnfIic (sec.) avemge 1ime zeqaiIecl to pocess a disk access, iDcIadiDg queue delay aDd pmcessiDg (sec.) avemge 1ime for the file maaager to p!oc:es5 a transac:tioD, exclndiDa disk pocessiDg bat iDcIadiDg file maaager bloc:lr:iDg, CPU queuing, aDd file IIIIIIIIFI' queuing (sec.) avemge ~ 1ime becanse of file manager blockiDa (sec.)
a
=
=
364
A Case Study
Chap. 11
___ . TABLE 9-1. GEMINI PERFORMANCE MODEL PARAMETERS-Continued-_
avenge UIDSa<:tion time because of queue delays for the file IIIIII&F CPU (sec.) avenge uaasac:tioD time because of waidDg ill the file IIIIII&F queue (sec.) avenge AI scan cycle time (sec.) avenge delay time because of sbmc:l memozy blocldng (sec.) avenge delay time because of the commgnjc:arim )iDe (sec.) avenge delay time because of proc:essing a disk access (sec.) avenge delay time because of file manager activity (sec.) opentor time required to iDitiaIe the edit functjon of type j (sec.) avenge delay time excluctiag disk activity (sec.) avenge traDSaCIicm time clue to AI activity (sec.) file IIIIII&F service ratio
TABLE 9-2. GEMINI PERFORMANCE MODEL SUMMARY a. Raponse dIM T,=Tr+NtlTtI Tr =T. + Tc + T$ + 7i+ T. 7j = n,;Tr + n,qTtI T.
= Ij
np'oj + 7j)
(10-1)
b. Communicmion time T. + r.. + '- + tep + fc,r
='-
t...,.
(2-1) (2-2) (9-10)
= r.[(~ + 1ttic)ls612
+ t.,
'-=m.ls. r.r6I1t1t(t.. + lis. + 2blc +.)2 r.t..(m. + 1II,;)/s. fc,r ~/s, tq,. r.r611t1.t(tpr + I/s6 + 2blc + t..,1
'- = • = = =
c. AI time T$=4t.. t.. 1;
=1 -
t.
(4-7) (4-1) (4-3) (4-5) (4-2)
(4004) (4-6) (~
+ l:;N.t.
(a - SI:;N.r.t,j)
= 1 if S r.t.. < 1 .. 0 0Ibenrise t _ =N.(t" + t;,.+ t.+ r;.+ t~ a =N.[(2r. + 1I1.I:)(t" + ,,+ fa. + t;.) + r.r" + 4r.,t'...l
(5-8)
(5-9) (5-S) (5-7)
d. F~"."..". time
TI= (~+ N"Ttt'J/(l -
LlF)
~=~+~+tt. ~ L,.(~ +.N"r..,)I(l ~ L.)
=
L,. =
(F ;
1)N.r.(~ + N"r...)
tt."" [(1 - P6)L,lFr-1(ti+ N" + N"T~
Lt- N.r.(IJ+ N"T~ e. Di#:time T" - (ttl + 1),.
"" =r.., + (1 - 11),. r..,="J + (1- ")~ L" =N.reN"""
(6-6) (6-1) (6-2) (6-3) (6-4) (6-S)
(704) (7-1b) (7-1a) (7-8)
Sec. 10
Results
TABLE.9-2.
3&5
GEMINI PERFORMANCE MODEL SUMMARY--COntinued ZF-Jc-I
F-I
qd= ~ k pi
.
(F-k-l)!
'-1 .
~ ~ jaO J. ~
F
z=
.,
-1
. NdN.r.T . d ..... f. Buffer 0vsjIqw time T" = tbi + t""
= n"t"kBi+1 t"" = 2.2St..z..,.... ~ =N.[r.(4.5t.. + ~) + 1.5t..IIJc]lB; lbi
1
'-- _ 79 -
(/.IF) ~ l-LiF 4.SN.r.t..IBo
L. =
(7-3)
(7-7)
(8-6) (8-3) (8-5) (8-2) (8-1) (8-4)
g. Scenario (9-4) (9-5) (9-6) (9-8)
N, =N,.INfI Nd=NeiNa r. =N..fT. p,=NpIN..
N..
=~"ill,j
! "i'" =!,lIj Np =! "i"ri
(9-1)
Ne -=
(9-2)
Nfl
(9-3) (9-7)
j
~"in:,
h = i Ne
(optioaal. if DOt gMD. as iDpat)
(9-11)
Note: 1I,j. "ri. "..1DCl n:, _ gMa in Tables 4-111uvusb 4-3.
10. RESULTS 10.1 8McIJmarlc CoIIIpari8OII The performance model was ~ for !be scenario shown in Table 10-1. This same sceaarlo was used to IUD a bfmcbmark test on the system. Table 10-1 lists the opemtor activities IeQ\IiRcl to process a cfocnment UDder dIis bench"!8ric and also shows those activities used by the model to approximate tIDs benchmartc. The model Was evaluamd for the IlOl1tOke1i case, with merence counts iii The IllUDber of Aquarius termiDaJs was varied from Ito 14. The time that was measured experlmentally for this case is shown as the dashed line in Figme 10-1. This is a curve of the time ~ for 81} operator to complete the benchmarlc as a function oftbe number of teEmiDaJs on the system and ranges from 840 seconds for one terminal to 3960 seconds for 12, tetmiDaJs (FIgUre 10-1 has extrapolated the curve . to 14 terminalS).
memorY.
A Case Study
366 TABLE
,0-,.
GEMINI
Chap. 11
BENCH~K
F1mcti.on
Beach"",""
Model
1. IDdex ScaD 2. 0peaJCl0se Document!
4 4
4
Delete Document 3. Attach Copy 4. Pbysical Copy S. GoToPage 6. Saoll
1 0 1
7. DeleteIJDsert2
8. Cut 9. Paste 10. iDsert FooIDote 11. Add Text Attribule 12. MamIal HypheaatiOll 13. PagiDate 14. PriDt SpeD 0Iec:k'
S
0
4
4
24 20 3
24 20 3
4
4
1 3 0
1 3 0
4
4
1
3
Noles: lDelere Doc:ameDt is 1aba as eqaivaleat to an 0peDIC10se DocmneDt. 2GemiJIi BeDcJI marl iDcluded 20 SearcbIReplace, wbic:h are 1akIm as eqaivaleDt to 20 Sc:.roll and DeJeIeIIDsert. SSpeU 0Ieck is taba as equivaIem of 2 PriDt.
is
The model calculatioDS are shown as the family of solid lines in Fijure 10-1. Each for a cfif.feIeDt cache hit mio. As opposed to the ~tive1y expected curve of the experimeDta1 IeSUlts, the predicted curves are smprisingly linear; this effect is discussed later and is shown to be caused by disk saturation. Fur1henDoJe, the worse-tban-1iDear perfomumce that was actually measured is because of Ieduced cache effectiveuess as load increases. If the model is takr:n as giving IepIeSeDlative results, then one would conclude 1bat the GemiDi cache effectiveness was gieater at low loads (around .7) and decleased to about .45 as load inaeased. This is to be expecred, since the greater disk activity at higbe.r loads will Bush out data that would otherwise be available at lower loads. The model mabs. its own coaservative estimate of cache effectiveness, assnming that oaly frequently accessed blocks are in cache, These include:
• Free list. • Directory block for cumm.t. directory.. • IndiIect blocks for c:um.nt text block.
It would be expected 1bat the model's prediction. would be close to 1bat measured for highI .
Sec. 10
Results
367
'500·
- - - - CALCULATED - - - - MEASURED 4000
III
2
~
o C c
Z III U CD
100
o
4
6
8
10
12
14
TERMINALS I'ipn 10.1 SCIIIIIio dille.
ioads. In fact, the model predicts a cache bit mtio of .444. very close to that measured for bigb loads (.45). Thus. it can be CODCludecl that the model gives reasonable xesults compared to those actually measured. UsiDg the cache bit mtio estimated by the model gives conservative results at mocJerate loads and fairly accurate teSU1.ts at high loads.
A
368
case Study
Chap. 11
10.2 Component Analysis Using terms from the model, we define the following terms:
Transaction is a request submitted by an Aquarius termiDal to the Gemini. Function is an edit action, such as an open, scroll, paste, etc. Scenario is the set of edit functions required to process a document. Thus, a scenario comprises a set of functions, each of which comprises a set of transactions. Let
Tu = time required to complete the scenario. T, = time to complete a transaction.
Nrs = number of transactions in the scenario. Then
Tu
= NrsT, =
The benchmark test required 75 functions totaling 3046 transactions (Nrs 3046). Transaction time, T" comprises a processing component and a disk system c0mponent. Let
= processing time per transaction (exclusive of disk processing time). Nd = number of disk accesses per transaction. T4 = disk time (processing, seek and rotational time) per disk access. T,.
Then
T,
= T,. + NdTd
In the beDcbmark scenario,1heIe are 2.92 disk accesses per ttansaction (8$87 disk accesses per scenario). Figure 10-2 shows T, and its componentS for a cache hit ratio of .444, as calculated by the model. Atbigh loads, T,. is cleady the preclomiDant factor. Though it curves sJiabt1y at low loads (as would be intDitively expec:ted), tbis curvature is offset by a flattening of the disk time, Td, msulting in a linear T,• Disk time "satmates" because the disk system is fed by a finite number of somces (four file managers), thus limiting the size of the disk queue. At higher loads, the disk queue approaches a CODSIaDt value (3.6 accesses waiting or being serviced), resulting in constant disk service times. I,nading for the disk system, file maaagers, and processOr are shown in FJ.gUre 10-3, as calculatr4 by the model. Note that the processor load is quite small (15-20 percent at high loads). However, the disk system and file maaager quickly satmate, the disk system at about 4 temiinals andtbe file manager at about 8 terminals.
Sec. 10
Results
369
2.0 1.8
CACHE HIT RATIO =.444 ;; 1.4
c
z
o
()
~ 1.2
...:IE
i= 1.0 .8
.6 .4 Tg .. Nts Tt
.2
Tt .. T,+Nd Td
o
6
8
10
12
14
TERMINALS I'ipre 114 TiaDsadiClll1ime CCIIIJlC""II's.
As menticmed above, the pndtmrinanr COJDpODeDt in the 1l'aDsacti0D time, T,. is the
processing time, T" DOt1be disk time, Td. But we just said tbat the disk system was ·heavily loaded and 1be processor ligbtly loaded. The auswer to tbis apparent anomaly is given in Figw:e l0-4•. where 1be culprit is shown to be the file mauager. Here, the compooeats of T,. are shown as: .
T. = Aquarius time. Tc
= comiidmi<:ation time.
Ts = AI scanner time.
T, = file manager time.
A Case Study
370
Chap. 11
1.0
DISK
.8
.6 CACHE HIT RATIO ••444 .4
.2 PROCESSOR
o
2
4
6
8
10
12
14
TERMINALS i
I
!
"
Note 1bat comnmDicatiou time. Te. decleases as load increases. 'Ibis is due to reduced coJlisioDs because. of the ligbter trmrina1 trafIic at higher system loads, i.e., Aquarius is J'UIIDiDg slower. Because the file maaager must wait for.multiple (2.9) accesses to a saturated disk system, its growth is sgbstanriaJly linear. That is. siDce 1he satwated disk systesD is giving fixed respoase times above four temrinals (see FIgUre 10-2). then doubling the load will approximately double the file maDager time. As seeD from. Figure 10-4. the file maaagertime, T" is the predomjDaDt tac,ror in Tr , which is 1he predomjDaDt factor in T" to which sceuario time, T•• is pIoportional. Since Tf is DOW explained to be linear (at least at higher loads), the scenariotime, T., will be linear also.
Sec. 11
371
Recommendations
.20
.16 CACHE HIT RATIO =.444 .
;; c z 0 0 III
(I)
.12
I II
2:
-
j:
_
.08
o
2
4
6
8
10
12
-
---
Tf/IO
14
TERMINALS
11. RECOIfIfENDAnoNS A c:aladation bas been made of Gemmi performa1ice using the perfcmDance model. This model appears to agree well with beDc:bmart IeSUlts. The message is clear: The 0DIy way to signifiraiilly iaIproge GemiDi performaace is to reduce disk sysIem load.
Disk system includes the physical disk as .
wen as the disk process. - ·
372
A Case Study
Chap. 11
Running at loads equivalent to the Gemini benchmarks, one finds_!!te following: • Average IeSpODSe time is 4 times longer than in an unloaded system. • A 20% peak. increase (which will easily happen) causes the response time to further degrade to 10 times that of an unloaded system. • Ninety percent of the response time is reJated to disk activity. Reducing disk system load by 50% accomplishes the following:
• Average loaded response time will be about 30% longer than unloaded response time (rather than 400%). • A 20% peak. increase in load will cause a 25% increase in response time (rather than 250%). Thus. nearly an order ojmllgnitude decretlSe can be made in the load charOl:teristics oj Gemini by cutting the disk system load in half. TheJ:e is DO easy way to achieve this reduction. The token configuration and the movement of reference counts to memory will buy a 10-15% reduction in disk system load (which translates to a 40% reduction in response time under load conditions.) Other options available (with DO comment on their ~ agony) include the following:
a. Eliminate physical copies of documeuts. This acx:ounts for about 50% of disk activity. b. Use faster and larger disks (use only a subset of tracks to reduce seek time) to achieve nearly a 50% !eduction in disk access time. c. Decrease the disk process processiDg time. d. Use dual disks on each c:ontroller. with overlapped seeks. to achieve nearly a 2:1 increase in allowable access rates. e. Split the Aquarius tem1inaJs between two Gemini systems interc:onnectec with SyuNet. f. Use a bigger cache. ODe significant observation is the fact that the disk system saturates at about 4 tennjna1s in this benchmark At this point, the cache bit ratio is about 0.7. If cache sire weze tripled. tbis same cache hit ratio (or better) ought to be achieved for 12 terminals (this amount of cache would DOW be available for each group of41nmj nals). The additioaal effectiveness of an tem.jna1s sbariDg a larger cache would add even IIlOle to the atttactiveness of this solution. Scenario time would be reduced.Dom abont.SOOO seconds to 3200 seconds. about a SO% improveI:Dmlt (taken as 500(13200). The flairening of the experimental time for less than 4 te.rmiDa1s suggests that a larger cache than this might DOt add much.
samano
Solution a. may be unacceptable from a reliability viewpoint. Solution c. requires a large software effort with questionable results. Solution e. will cause perhaps an equal
Chap. 11
References
~QI;m8Dce
degradation because of the speed and imposed load of SynNet. ·Solution f.
373
requires major hardware modifications to increase cache addressability. Solutions b. and d. require simple purchased hardware changes, with b. being the far more economical. Since both solutions result in a substantial and almost equal perl'or:mince improvement, solution b., using a portion of larger disks, is recommended. A capacity improvement of 2: 1 can be expected.
APPENDIX 1 General Queuing Parameters (as used in Appendix 2) c E(T2) k
L
D1.IIIIber of senas seccacl mameDt of T KbiDrdIiDe-PoUa disIribatioD coefiicieIIt load (occupm:y) of a server (at 1he CODSidcrecl priority aDd higber priorities. if priomies
are iDwIved)
L" L, m
n P. P. Q
R T
T. Td T. T,
Tq T.. T,
T2
tcCalload imposed OIl a server by all useES at a higber priority dian tbat being CODSideI:ed tcCalload imposed OIl a server by all useES at all priotiIies D1.IIIIber of asas ill a fiDiIe popaIaIiOD D1.IIIIber of i1mDs (asas) in a queue pIObIbility tbat 1he Iea&* of a queue is zero (ptobabitity tbat die queaiDg sysIem is idle) pmIIIbiJity dial 1he Ieqdl of a queue is n items (
375
376
fi
General Queuing Parameters
Appendix 1
W
third DIOIDeJlt of T average IIUIIIher of items (usas) waiting in line for service, excludiDg the item being
z
servic:ed service xatio for aD item (user) in a fiDite population
APPENDIX 2 Queuing Models
The following tables summarize the results of chapter 4 for each of the queuing models. Note that the full set of parameters is not available for all models. The set of parameters that are available is given. Definitions of parameters are given in Appendix 1. The Kendall classification for queuing models is summarized below for reference. Each table is titled by its Kendall classification. A queuing model is classified as AlBlclKJmIZ
where A is the anival disUibution of items into the queue: M-random. D -constant. U-uniform. G --geneIal. B is the service time distribution of the servers and can have the same values as
A. c is the number of servers. K is the maximum queue length.
m is the size of the population. Z is the queue discipline. A -Any.
Queuing Models
378
Appendix 2
FIFO-F'U'St-ln, FU'St-Out pp -Preemptive Priority. NP -Nonpreemptive Priority. If any of the last three characteristics are left out, the defaults are infinite queue length, infinite population, and FIFO queue discipline (/=/=IFIPO).
General
Q=W+L WT T.q =L-
(4-20) (4-21)
f= Tq+ T
Td=
(4-22)
ld?-
W=-
Q
(4-4)
l-L L = ....=rll - (l-k)L] l-L
(4-6)
kL
T.q=-T
(4-9)
l-L
Td
1 = -l-L [1 -
RT3
var(Td)
R'-P
-
= 3(l-L) + 4(1-L)2+ T2 k _lE(T~
-iTT" L2
w=r=r L Q=r=r
-
T2
(4-71) (4-16)
(4-88) (4-10)
LT T.q = -
(4-12)
T Td = (I-L)
(4-11)
l-L
L
var(Q) = (l-L)Z
T2
.'
(4-8)
(l-k)L]T
var(T,d
= (I-.L)Z
(4-82) (4-72)
Appendix 2
Queuing Models
379
Po= (I-L) Pn = L n(1-L) P(Q > n) = L n + 1
(4-79) (4-80) (4-83)
Note: MlGII model with k = 1
(4-4) (4-6)
(4-9) (4-8) (4-73)
1 L2
W=-21-L Q
=..!::-. l-L
(4-4)
(1 -~)2
liZ T.q=-T l-L
(4-15)
(l_f)T 2 T2 (1 L2) 3 - 12
Note: MIG/I D10CId with k
(4-13)
Td=_Il-L
(4-14)
var(T~ = (l-L'f
(4-74)
=1
kLT Tqp = (I-L)(1-LJJ
_
kLT
Tdp - (l-L)(I-LJJ
+ Tp (l-LJJ
(4-92a) (4-92b)
Queuing Models
380
Appendix 2
./GI11=1=INP T. qp -
/cL;I,
(4-91a)
(l-L)(I-LJJ
kL3'r
(4-91b)
Tdp = (I-L)(I-LJJ + Tp .,
./M/c/=/=/FIFO _
L(cLY
(4-97)
W - c!(1-L)2Po
Q=W+eL _ (cL)C Tq - c(c!)(l-L')'l p;I Td=Tq + T c-l = I (eL)"/n! + (eLYlc!(l-L)
Po-1
(4-98) (4-99) (4-100) (4-95)
11-0
PII
= poC,eL)"/n!, 1 :S n S = poL"~/C!, n C! c
c
P,.
(4-93) (4-94)
./G/C/=/=IRFO kL(eLY - c!(I-L'f Po Q-W+eL k(eLY Tq == c(c!)(I-L'f p;I Td-Tq+ T c-l Po-l = I (cL)"ln! + (cL"fIc!(I-L) W
(4-101) (4-102) (4-103) (4-104) (4-95)
11-0
p,. == poC..eL)"ln!, 1 :S n s c P. = poL1I~/c!, n iit: c
(4-93) (4-94)
II/II/c/=/=IPP T.
-
tIP -
(eL"f p;I . c(c!)(l-L'fO -LJJ Tp Tdp=Tqp +-I ....LII
(4-107a) (4-107b)
c-l
Po-1 =
I
(eL)"ln!
11-0
+ (eL"flc!(l-L)
.(4-107c)
Appendix 2
Queuing Models
381
(cL,)C Tqp = c(cl)(I-L,)(I-L)(l-LJJ PJ
+ Tp
Tdp = TlIP
c-l
(4-106a) (4-106b)
PD -1 = ~ (cL,)"/nl + (cL,)c/c!(I-L,)
(4-106c)
W= m - (z + 1)L
(4-114) (4-115)
n-O
M/M/1/m/m/FIFO
Q=m-zL T = W(Z + 1) T q m-W T4= Tq+ T
L=
(4-112)
z"'-n
(4-116) (4-113, 119)
!!!::!! = ~ (=-~)! z+1
n-l~ ZI ~
-,
j-o}-
z"'-n
(4-118)
(m-n)!
Pn= m . ~ ZI ~
j-O
-,
J-
(4-108) (4-110)
z=TJT
R=LIT
zL
P(user busy) = 1 - m
(4-117)
M/II/c/m/m/FIFO W = m-(z+ 1)l, =
m
2: (n-c)pn n-c+l
(4-114, 124)
Q=m-zL
(4-115)
T. = W(z+l)r
(4-112)
q m-W T4=Tq+ T L
=!!::!: z+1
Pn = (~)~P(I' 1 :s n:s c
(4-116) (4-113) (4-120)
,. Queuing Models
382
Pn =
n! (m)l
IJI-C I i P~, c." n z
c:S n :S m
Appendix 2
(4-121)
m
PD
= 1 - ,,-1 LPn
(4-123)
z=TJT
(4-108) (4-110)
R=LIT
P(user busy) = 1 _ zL m
(4-117)
APPENDIX 3 .Khintchine-Pollaczek . Equation for MlGn Queuing Systel11s The following derivatiOD of the Kbjnrcbine-Pollaczek equatioDS for the MIG/I queuing system is a summary of one given by Slaty [24]. It follows the simplified analysis intr0duced in chapter 4, and reference to Figure 4-1 is Suggested. We assume that a queue is observed in its steady stare. This implies that over any two given time periods with leuglhs 1bat allow statistically sigDificant averages to be observed, the mean and variance of the queue 1eDgdl will be the same. Should we observe the queue 1e.ngtb. at the iDstam after an item leaves tile server, we observe q items waiting in line, incJncting the next one to be seiviced. The service time for Ibis next item is t. We tbeIi observe the queue t secoads later and find the queue 1eDgdl to be q'. During Ibis time iD1e.rvaJ. t, r items mive. Thus, q' and q can be related as
follows:
q'=q-l+rifq>O
(A3-1)
q'=r
(A3-2)
ifq=O
That is, if the first item left q items belUnd, tbeD q' is q reduced by the leaving of the next item and iDc!eased by the arrival of r items.. .If the first item left DO items behind (q = 0), then q' is equal to the DUIDber of newlymived items, r. Note intuitively that r is indicative of the load on the server. If r = I item mives during each ~ce timet, the load on the server will be 1. . 383
Khintchine-Pollaczek Equation for M/G/' Queuing Systems
384
Appendix 3
Equations A3-1 and A3-2 can be combined as follows:
=q -
q'
(A3-3) ,
1+ r +j
where j j
= 0 if q > 0 = 1 if q = 0
(A3-4) (A3-S)
Taking the expected values of the variables in equation A3-3, we have E(q') = E(q) - 1 + E(r)
+ Ev1
Since the system is in equilibrium, E(q') = E(q); and
(A3-6) ~ore,
from equation
(A3-6) EV) = 1- E(r)
(A3-7)
Let us DOW assume that arrivals to the queue are random, that is, they are genemted by a Poisson process and arrive at an average J:ate of R items per second. From chapter 4, equations 4-60 and 4-61, we know that the mean r and second moment r- of r items miving randomly in a period of t seconds are
r =Rt
(A3-8)
;2= (Rt'f + Rt
(A3-9)
Averaging r over time t, we have
E(T)
=r =E(Rt) = R1 =R:I' = L
(A3-10)
where we use T to denote the expected value of t and L to tepreSeDt the server load, RT. T is the average service time of the ,server. Using equation 4-31, we also can avemge over time:
r
r
= E(Rt'f + E(Rt) , = E[R2var(t) + £2(Rt)] + E(Rt)
E(P)=
or r=~t)+L2+L
(A3-11)
Let us DOW square equation A3-3. This gives
q'2=q2_2q+2qr+2qj+ 1 - 21'- 2j+,:z + 2rj+ f
(A3-12)
We note the following c:oncmring j:
f
=j
q(1-}) = q E{J) = 1 - L
from equaD.oas A3-4 and A3-S fiom equaD.oas A3-4 and A3-S from.equations A3-7 and A3-10
We also note that r is independent of q. Also, r is independent of j, ~ j is a function
Appendix 3
Khintchine-Pollaczek Equation for M/G/1 Queuing Systems
385
only of q. Therefore, whenever we take the expected value of rq or rj, the expe&Jed value of the" product is the product of the expected values. Taking the expected values of the terms in equation A3-12 and applying the above observations, one obtains E(q'2) = E(rj-) - 2E(q) + 2E(q)E(r) + 1 - 2E(r) - 2(1-L) + E(r) + 2(l-L)E(r) + (I-L)
Since the queue is in equilibrium, E(q'2) = E(rj-). EJiminatillg these terms and substituting the values for E(r) and E(r) from equations A3-10 and A3-11 gives
o = -2E(q)(1 -
L)
+ 2L -
L2 + R2var(t)
Solving for E(q), E( ) = 2L - L2 + R2var(t) q 2(1- L)
Denoting the expected value of E(q) by Q, and noting that R = LIT, this can be
rewritten as
:L[ L+ ~( + "t»)]
Q= 1
1-
1
(A3-13)
We now define the distribution coefficient, k, as
k=~( 1 + ~t»)
(A3-14)
Equation .A3-13 then can ie expressed as Q
L = l-L[1(l-k)L]
(A3-1S)
Equation .A3-1S is the same as the expression for Q given by equation 4-6, which was derived by a less rigorous but intuitive approach. Equation A3-14 is that reponed as" equation 4-16. The relations for W, Til' and Tdnow can be detemrined from the geneml expressions given by equations 4-S, 4-3, and 4-7, respectively. Note that these equations weze derived only under the· following assumptions:
more
• Arrivals to the queue are Poisson-distrib. • The queue is in equilibrium. • Service time is independent of mival time or any other cbaracteristic: of the times being serviced. Therefore, these equations apply for any distribution of service times and for any servicing order of the queue (so long as an item is not selected for service based on one of its characteristics, such as its service time). Thus, the solution is geneml for the MlGlll=l=/A case of queuiDg systems.
APPENDIX 4 The Poisson Distribution
In chapter 4 we began die derivation ofdle Poisson distributioD. It was determined that the probability of n items aniviDg in a time t. p,.(t), was given by the following system of differential-difference equations: pOet)= -rpo(t)
(4-57)
(4-58) The fonowing solution to this set of equations is a summary of that solution found in Slaty [24]. Let us define a gcmearing function p(z,t) such tbat GO
P(z,t)
=,,-0 Lz"p,.(t)
(A4-I)
If we should diffaentiate tbis equation n times with Iespect to z. we have a"P(z,t)
,
az" . nlp,,(t)
(n+ I)!
. (n+2)!..2.. 21 2.J',,+'it)
+ ---rr-ZP,,+l(t) +
+ ...
Setting z to zero. we obtain .
386
a"~~,t) _
n!p,.(t). z = 0
(A4-2)
Appendix 4
The Poisson Distribution
387
Thus, by differentiating the generating function P(Z,t) n times with respect to :..dividing the iCsUIt by n!, and setting z=O, we obtain PII(t). Let us now consider a time t as discussed in chapter 4 and assume that i items have arrived in the queue up to time t. That is, by the definition of PII(t), p,{O) = 1 PII(O)= 0 for n*i
Thus, from equation A4-1, for t=O, P(z,O)
=tpl..O) = t
Also, if Z is set to 1, from equation A4-1, P(l,t) =
(A4-3)
.
L p,.(t) = 1 11...0
(A4-4)
Now let us multiply the differential-difference equations 4-57 and 4-58 by zII, obtaining zOpo(t)= -rzO[Jo(t) z"p~(t)= -rr'pll(t)
If we 'SUID these over all n, we obtain
+ rz"pll-l(t)
.
.
.
11-0
.=0
.-1
~z"p~(t) = -r ~z"pll(t) + r ~z"pll-l(t)
(A4-S)
The left-band term of this expression is simply ap~. t). The first term on the right is -rP(z,t). The second term on the right is
rzpo(t) + 7Tpt(t)
+ rz"IJ2{t) + .. .
= rz[po(t) + ZPl(t) + z7".(t) + .. . = rzP(z,t). Thus, equation ·A4-S can be written as the linear differential equation
a~~,t)_ r(z-l)p(z,t)
(A4-6)
The solution to this is p(z,t)
= Cerlrl)l
(A4-7)
which can be verified by substitnting p(z,t) fromequati.on A4-7 into both sides of equaIion A4-6. The value of C is dependent upon how many items, i, are xec:eived by time t=0. Let us assume that at t=0, zero items have been received in the queue (i=O). In this way, p,.(t) will tJ:uly be the probability of receiving n items in the subsequent interval t. From equaD.0il A4-3, setting i=O, P(z,O) = t
=1
The Poisson Distribution
388
Appendix 4
.Thus, C= 1 in equation A4-7 and P(z,t)
= er<=-l)t
(A4-8)
As we pointed out earlier with reference to equation A4-2, p,,(t) is derived from 'P(z,t) by differentiating P(z.t) 12 miles with respect to z, dividing by 12!, and .i-tO" zero. Performing these operatiODS on equation A4-8 yields
settiDg
p,,(t) = (rtfe- rt
n!
(A4-9)
This is the solution for the Poisson distnDution referenced as equation 4-59_
APPENDIX 5 Mirrored Writes
A. DUAL LATENCY TIllES CoDsider a disk transfer xequest arriving at two independent disks simultaneously. Assume that both disks mitiare the processing of this transaction simultaneously, that both heads are mitially at the same track, and that both heads mive simultaneously at the new track. At Ibis point. the rotatioDal position of the two disks is nudom. What is the avemge time that it will take'for the first disk to make the transfer? What is the avemge time for both disks to wait their rotatiODallatency time in order to complete the disk 1raDsfer? (A single disk, of course. zequjms a half rocaDOD on the average.) Ut the disk wbich bas the smallest disIanc:e to go be designated disk B. and it must await a fmctional mtation ofB. where 0<8<1. Let the other disk be disk A; it must await
a fractiODal rotation of A. whe!:e B
388
Mirrored Writes
390
Appendix 5
For the case of the leading disk: p(A) =
dA
p(BIA)
=}m
pCB) =}mdA E(B) = JBP(B) =
il fA
BA=:.dBdA
A.oJB...O
I:.J ~ ]:"'0dA
=
-i
~ [A2]1 = 4' l
- A-02
A-O
E(B)
=!4
For the case of the lagging disk: p(JJ) = dB 1
p(A1B)
= (I-B)dA 1
p(A)=~
E~)=f~A)=£_J~~~ --
LI
=-1 2
=
[A2]1
1 -B-o(I-B) 2
LI
(1
B-O
dB
A-B
+B)dB
![B + B'-]I 2
2
B-O
E~)=~4 Thus, on the average, dual disks that seek in syncbronism will leqUire 3/4 of a rotation for each to find the sector.
Sec. B
Single Disk Seek Time
391
B. SINGLE DISK SEEK TIllE Consider a disk transfer request miviDg at a disk for a sector whose track is randomly positioned relative to the track at which the head is currently positioned. What is the average distance the head must move? Let the total seek distance be normalized to 1 and the total seek path be measured from 0 to 1. The current head position is at C, where O
=
E(x) is the expected (average) value of x.
Case 1: Rnal PoaitiOll Prior to C p(case 1) = C p(C) = dC
P(SIC)=~ p(S) = i;tsde
Case 2: Final PosItiOll Beyond C p(case 2} = 1 - C P(C)
=de 1
p(SIC)=~
P(S)=~de The avenge value of S is 2
E(5)= ~p(Case n}Sp(S) .11-1
f ~ + (1 f1-CS(1-C)dSdC Jc-oJs-o C lc-01s-o (1-C)
E(5)= (1
Mirrored Writes
392
Appendix 5
f:cJ ~ ];aodC + f:.o[ ~ J:::dC
=
=Llc=o [Cl_C+!]dC=[C _Cl +£]1 2 3 2 2 3
C=O
Thus,
OD
=!3
the average, the disk head must seek a distance of 1/3 of the total head
span.
c. DUAL DISK SEEK nME Consider a disk request arriving at two independent disks simultaneously. Each disk hrnneAiately executes a seek to the appropriate sector. However, we assume that ~ . cummt position of the head on one disk is random compared to the head's position OD the other disk. This is caused, for instanc:e, by one disk peri'orming a mad that the other disk did Dot pexform. In Appendix SB, it was shown that the average seek for a single disk acc:essing data randomly was 1/3 of the tracks. In this appendix we consider the average seek time of the two disks together. Let the disk which bas the farthest seek distance be designated disk A and the other disk disk B. The maximum seek distance is normaJized to 1. Disk B must seek a distmc:e of B, where O
= p(xly)p(y).
= expecred (average) value of x.
Cae 1: " , . . A and B Positioned Prior to X p(case 1) =](l p(){)
= dX
1 p(B1X) =jdB p(B)
1 = jdBdX
p(A1B) =
x:jfA 1
p(A) = X(X.-B)dAJJBdX
Sec. C
Dual Disk Seek Time
~.~:
Disks A and B Positioned After X p(case 2)
393
= (1-X)2
P(X) = dX I p(BIX)=-dB I-X
I P(B) =-dBdX I-X I p(AlB) = I-X-BdA I P(A) = (1-X)(I-X-B)dAdBdX
Case 3: Disk A Prior to X, Disk B After X p(case 3) = X(I-X) p(X) = dX
I p(BIX) = I-X dB
P(B) =
I r=x dBdX
I . p(A1B) = X-BdA
p
(A) -
"7
I dMBdX (I-X)(X-B)
for B
< X, 0 otherwise
for B < X, 0 otherwise for B <X, o otherwise for B
< X, 0 otherwise .
Case 4: Disk A After X, DiaIc B Prior To X p(case 4) = X(l-X) p(X)
= dX·
I p(BIX) =jdB
for B < I-X, 0 otherwise
=jI dBdX
for B < I-X, 0 otherwise
p(IJ)
1
P(AIB) = (I-X-B) dA
for B
<
I-X, 0 otherwise
Mirrored Writes
394
P(A) = X(l-~-B) dAdBdX
Appendix 5
for B < I-X, 0 othenV"Cse
The conditions on B for cases 3 and 4 result from the facts that B must be less thanA and that A is limited to X for case 3 and to I-X for case 4. It is useful to note that 4
~p(case n) =](2 + (I-X)2 + 2X(l-X) = 1
,,"'1
The average value of A is the average dual disk access time and is 4
E(A)
= ,,-1 ~Ap(A for case n)p(case n)
where an integration is taken over all possible values ofA, B, andX. Note that cases 3 and 4 react diffenmtly c:lependjng upon whether X < 112 or X > 112. In these cases:
ifX< 1I2,thenO 112, then 0 < B < I-X The resulting expression for E(A) is then
E(A)
='!JJ:.J:.
(case 1)
BXi:B) dAdB
r1- X [1-X
A(l-Xl)
+ JB-oJA ..B(l-X)(l-X-B) dAdB
+. 1,x-a
112 [[X
]' dX
rx
AX(I-X) JB-oJA-s(l-X)(.X-B) dAdB
JI-X
(X AX(l-X) ] + JB- A_BX(l-X-B) dAdB dX (I
[
fl-XiX
AX(I-X)
+ lX-II2 JB-O A-B (l-X)(X-B) dAdB
'+ LB-O lA_BX(I-X-B) r AX(l-X) dAdBJdX 1 X -
1 X -
Cases 1, 3A, and 38 RCIuce 10 the form
11 fX LLx:B[~]:_BdBdX ~ dAdBdX
x BJA-BX-3
=
=~fxLX(X+B)dBdX
(case 2) (case 3A) (case 4A) (case 3B) (case4B)
Sec. C
Dual Disk Seek Time
=
395
11 [XlB + XB2] dX 2x
2
B
For cases 1 and 3A, B ranges from 0 to X:
=
~L(Xl + ;)dX = ~LX3dX = [1~]X
For case 1, X ranges from 0 to 1 3
(case 1)
=16 For case 3A, X ranges from 0 to 112 3 = 256
(case 3A)
For case 3B, B ranges from 0 to I-X and X ranges from 112 to 1:
ILl [XlB + XB2]1-X _. dX = -ILl 2 4
-
2
= ...
X'-'ll2
B-O
(X-Xl)dX
X-JI2
l[; -::]:.112
(case 3B)
= 9/256
.....
Cases 2, 4A, and 4B reduce to the form
L
f r fl-x A(l-X)
Jx, ,JA-B(1-X-B) dAdBdX
fL .
(l-X)
[A2]1-X
= J", B(l-X-B) '2
A-B dBtlX.
=!f f (l-X)(I-X+B)dBdX iJXJB = !f1 [(I-X)2 + (l-X)B] dBdX iJ"'B ,
=!L [(I-X)2B + (1-~] 4X 2x
2
B
For cases 2 and 4B, B IaDgeS from 0 to I-X:
.
=.H (l-X)3dX =~[X - 3,X2 + Xl- r] 4J 4 2 4 x x
For case 2, X ranges froQ1 0 to 1: (case 2)
Mirrored Writes
396
Appendix 5
For case 4B, X ranges from 112 to 1:
= 3/256
(case 4B)
For case 4A, B ranges from 0 to X; and X ranges from 0 to 112:
!{42 [(1-X)2B + (1_xf!:.]X dX 2Jx=o
2
!ill2 =if::o[X- ~+~]dX
B=O
=2 x-o [XO-X)2 + (1-x!:.]tlX 2 (case 4A)
1[X2 1 1 ]112 =2 2_¥3+~ x-o = 9/256 Thus, average disk access time for dual disks is 3
E(A)
3
3
9
9
3
IS
= 16 + 16 + 2S6 + 2S6 + 256 + 2?6 = 32
E(A) = .469 of a UDit seek (versus .333 for a single disk-see Appendix 5B).
Thus, on the average, the seek time of dual disks is .469/.333 = 1.4 that of a single disk. For disks with an average access time of 35 msec. and with an average latency time of 8 m5eC., the average seek time is 27 IDSeC. From Appendix SA, it is seen that the effective latency time ofsuch a dual disk is inaasecl by 4 1DSeC•• whereas from the above it is seen that the effective seek time of a dual disk is ina:eased by about II msec. over that of a single disk. Thus. by the time the last disk bas finished its seek, !be fust most probably has finished its latency (to a fust degtee of approximation). Thelefore. the last disk need. only wait a single disk latency time. on the avaage. As a ~ 40% of the average seek time for a single disk should be added to the average access time for a single disk to obtain the avemge aa:esstime for a mirrored disk. This discussion bas CODSideIed the case in wbich writes to boIh disks of a minOled pair are execatecl simllltaDeOusly. In many fault-tolerant systems, these writes are done ODe at a time to ensure that at least ODe disk always bas a good. copy of the file (i.e.• a power spike could cause write e.nors on both disks ifboth were active siDm1taneously). In tbis case, the time to write to both sides of the miIrored pail-is simply twice a single write time.
APPENDIX 6 Process Dispatch Time
Process dispo.tch time is defined as the time a process must wait in the schednJiDg queue, or "ready list," for the processor. The effective service time for a process is the sum of its dispatch time plus its processing time. Thus, a tr:aDsactiOD selected for service by a pr0cess will experieoce aD effective service time that increases with processor load. Dispatch time throughout the text of this book has been approximated by using the , MIMIl model for a siDgle processor system and the MIMIc model for a multiprocessor system. This is a reasooable approach if the DUIDber of processes makiDg demaDds OD the processor system is much greater than the expecred 1eDgth of the schedllJing queue and if DO process repJeSeDtS a sigDfficaDt portion of the processor load. If the number of pr0cesses is small, but all processes lIe nearly ideDtic:al iD terms of processing activity, then the MIMIC/mim model can be used. However, this model can be shown to produce optimistic results (as will be argued Jater). ' In this appeudix, it is shown that the use of the iDfiDite population models is c0nservative in that the processor load of the process being CODSide!ed is COUDted twice. More iCcmate models for dispatch time lIe derived, but these require iterative calcula1iOD. h is concluded that a reasooable appIQximarion giviDg iDc::reased dispatch time acc:uracy is simply not' to include the processor load of the processforwbicbdispatch time is bemg calc:ulated in the calcuJaIiOD of the schednliDg queue leagtb.
A. INFINITE POPULATION APPROXItmlTlON ERROR Let us consider a siDgle processor system serviDg m identical processes, and let us further use the simple MIMIl model to represent the processor schednliog queue. We assume that
397
Process Dispatch Time
398
Appendix 6
.........there are many more processes than the anticipated length of the scheduliDg queUe, so the assumption of an infinite population feeding the scheduling queue is reasonable. We consider the case of a very simple process that services a transaction by using T . seconds of processor time. When one or more transactions are in a process's queue, that process enters the scheduling queue, waits its dispatch time (its waiting time in the scheduling queue), and th~ processes the transaction at the head of its queue. It then exits the processor and reenters the schedJ.iling queue if there is another transaction in its queue. In this case, the processor is busy if at least one tranSaction exists in anyone or more of the process queues. The only effect of the process structure from a performance viewpoint is to reorder the transaetions. They will not be served strictly in the order in which they enter the processIprocessor system but instead in some order determined by what is effectively the round-robin servicing of busy processes. Since servicing order is not important in queues of this sort, a transaction should see a delay (response) time of T,. = T/(l-L), where L is the load on the processor. Let us derive the process's response time by using the simple view that the scheduling queue and the process queues are all MIMIl queues. Let R = average arrival rate of transactions to the system.
m = number of processes. T = average processor service time•
. L = processor load = RT. ttl
= process dispatch time.
Ttl= processor delay time
= process service time.
T,.=transadion respoase time. Then the time which a process must wait in the scheduling queue (ItS dispatch time) isLTI(l-L). The service time for a process is its dispatch time pIns its processing time, or LTI(l-L)+T TI(l-L). Since each process is bandJing ~transacdon rate of Rim, it is busy (Rim)TI(l-L) of . the time (its load). Therefore. its EeSpOnse time to a transaction entering its queue is
=
.
.
.
T,
.
,.
==
T/(l-L)
1 - (Rim) TI(l-L)
-
(A6-1)
T
1 - (L + U"})
.
Since T,. should be T/(l-L) as argued above. it is seen that the approximate IeSpODSe time is in em»" due to. the effectiveload in the denoiniDatOr of equation A6-1 being inaeased by Um. the load of one process. In effect, the load hnposed on the processor by the process under consideIalion has been counted twice. once in the term L and once by the term lim. Let us redo this calculation by not including the load of the process under c0nsideration when calculating dispatch time. Then the effective load on the processor is
.
.
,!
Sec. B
Dispatching Model
399
(1 :-Um), and the process dispatcb time; rd, is
(L-lim)T
td --==~~~ - l-(L-lim)
(A6-2)
The process service time, Td , is T
Td = rd + T = l-(L-Lim)
(A6-3)
The transaction response time is T. ,.
=1-
Td
(Rim)
Td
(A6-4)
Substituting equation A6-3 into A6-4 and simplifying gives T,.
T
= l-L
(A6-S)
as expected. In effect, we have considered the scheduling queue from the viewpoint of a particular process. That process sees the queue occupied by other processes, wbich are imposing a load (L-Um) on the processor. The average delay (td+7) experienced by the pr0cess in passing tbrougb the processor is less than that predicted by our simple MIMIl approximation, as the load on the processor is taken as less than the full load so far as scheduling queue length'is concer:ned (see equation A6-3). The extra delay is made up by the transaction's delay in the process's queue. The insight from this example is c:anied through in the next sections to calculate . more ac:curately the process dispatch times when the number of processes may be small and their prOcessing asymmetric:. Both single processor and mu1tiproc:essor systemS are c:onside!ed. .
B. DISPATCHING .ODEL In many cases. a system is comprisecl of a fiDite number of single-tbreaded processes c:ompeting for common processing resources (one or more processors). Each process receives transactions from what is effectively an infinite population. This situadon is depicted in FIgUre A6-l. . As shown in this figure. there are m processes. labeled PI through Pili' being served by c processors. Each receives transactions from an infinite popula1ion of users. The itb process, Pi. receives. transactions at a rate of Ri transactions per sec:ond. Whenever a proc:ess bas" one'or transactions in its queue, it enters the scheduling queue to await the availability of a processor for pmposes of worlcing OIl the trails:;' action at the ~ of its queue. Upon reaching the head of the scheduling queue, the : .' .. . . . . . .. :
more
Process Dispatch Time
400
PROCESSES
Appendix 6
PROCESSORS
'proCess will be assigned to the next processor that becomes available. When the process' bas finished with the processor. it waits for its next transaction and then reenters the scheduling queue for further processing. The amount of time that a process must wait in the scheduling queue is called its dispatch time. In the text of 1bis book. dispatch time has been calculated by assuming that theIe axe a large number of processes and that the schedu1ing queue, theIefore, has the properties of an MIMIc queue. A IIlOIe accurate solution to this problem can be attained by'CODSidering the plight of
a single process Pi competing with the other processes for processor time. Let us define the average dispatch time for process Pi as tt/i and the avenge delay time tbrough the processor system for process Pi as Tt/i. Tt/i is the time that process Pi must wait in the scheduling queue plus its processor service time Ti: ~=~+~
~~
wheIe:
Tt/i = avenge time spent by process Pi in the processor system on each entry. tt/i
= average process disparch time for process Pi.
Ti = average service time for process Pi. The load. Lp, on process Pi (that portion of time tbat"it is busy awaiting service or is being seJ:Viced by the processor) is (A6-7)
Sec. B
Dispatching Model
401
also can be interpreted as the probability that process i will be in the system. The average number of processes in the processor system (scheduling queue plus ptOcesSOIS) when process Pi enters the scheduling queue is the sum of the process loads of all the other processes. This excludes its own load since it caonot already be in the processor system. The average Dumber of items in a system generally has been designated by Q:
Lpi
(A6-8)
The Dumber of processes Qi seen by process Pi when it enters the system is then Qi = Q - Lpi
(A6-9)
where: Q = average Dumber of processes in the system.
Qi
= average number of processes in the system when process Pi arrives.
Lpj
= portion of time process P j is in the system.
Ri
= arrival rate of transactions to process Pi.
Let us also define:
11 = average processor service time for all processes except process Pi. , };RjIj Ti = ~R. i.i
6-10)
(.Ai
'J
We also define L; as the processor load imposed by process Pi andL; as the processor load Unposed by all processes except for process Pi: (A6-11) The waiting tine 1eagtb. is the number of processes in the system, excluding those being served. From equation 4-20, the average waiting tine leDg1h, Wi, seen by process Pi
is
Wi= Qi-I.;
(A6-12)
From equation 4-21, the average waiting time for a process entering the system before it is assigned to a processor is (A6-13) where: tdi
= dispatch time for process Pi (i.e., the amount of time it must wait before it gets a processor).
Pro.cess Dispatch Time
402
Appendix 6
Wi = waiting line length seen by process Pi. Li = processor load imposed by process Pi.
Li = processor load imposed by all processes except for process Pi' Note that the dispatch time for each process is different. It is affected by the load (the percent of the time busy) of all other processes but not by its own load. A seldomly used process will be more affected by busy processes than a busy process will be affected by seldomly used processes. Equation A6-13 represents the general solution for process dispatch time. Note tbat it must be solved interatively, since the dispatch time for proCess Pi (equation A6-13) depends on the dispatch times of all other processes Pj (see equations A6-9, A6-8, A6-7, A6-6), which in tum depend upon the dispatch time for process Pj' It will be shown later, however, tbat equation A6-13 reduces to a closed foxm if all processes are identical. The preceding model is a general model for the case of an open system with a finite population of heterogeneous users. As such, it has application to many other cases which must often be modeled. Examples include: • terminals waiting for a common communications line and • disk units waiting for a common disk controller.
Before proc«ding. we clarify a point made earlier. In the introduction to this appendix, it was stated that the use of the finite population closed system model (the MlMJclmlm model) provided optimistic IeSUlts if the system were, in fact, an open system. The reason for this can be UIlderstood as follows. In the MlM/clmlm model, a randomly distributed tbinlc time is assumed. Therefore, the probability of a think time of
zero seconds is exactly zero. However, in an open finite system, queues build at the ~ (see F1gUIe A6-1). If the length of a user queue is nonzero, then the user will immediately reenter the system, giving an effective tbinlc time of zero. The probability of a tbink time of zero seconds is DODZeI'O in an opeD system. The resUlt of this is tbatthe mival of users from a finite population to an open system will tend to occur in batches. During mstanc:es of system activity, users will tend to reenter1be system ilDD1C'!Ctiately. This distorts the probability distribution of queue sizes towards larger queues, thus resulting in larger average queue sizes and larger average delays for an open system than 1hose for a closec1 system. As pointed out in chapter 4, a ciosecl1inite population system experiences graceful degmdation as busy users axe mnoved from the population of users eligible to enter the queue. There is no such effect in an open system.
c.
SINGLE PROCESSOR SYSTEII We DOW constrain the model to a set of processes with identical service time distributions represented by a common distribution coefficient, k. For the single processor case, the
Sec. C
Single Processor System
403
~~ility that a process is being serviced when process Pi arrives in the ~ is Li. Tht!f' the dispatch time for process Pi is the time remaining for the process cunently being
serviced, kLm, plus the wait time for the· processes awaiting service, WiT;:
= WiT; + kL;T;
tdi
Using equation A6-12, (A6-14) where k
= Khintchine-Pollaczek distribution coefficient.
Process delay time is (A6-15)
A very important case is the simplest case of homogeneous processes with random. service tUnes using a single CPU. Each process entering the scheduling queue requires an average of T seconds of processing time. If there are m processes, and if the total system transaction rate is R transactions per second, then k = 1
Ri = Rim Ti=T;=T L;=Um We also define for convenience
W; =W' L; =L' tdi
= td
The prime (').is used to deDote the system as seen by an amviDg process. From equations (A6-8) and (A6-9), Q'
.'
= (m-l) -mR Td
(A6-16)
wheIe we have used (A6-7): (A6-17) Noting that
L=RT
(A6-18)
Process Dispatch Time
404
Appendix 6
and
L m-l L'=L--=-L
m
m
(A6-19)
where L is the total processor load on the system, the dispatch time is, from equations (A6-14) and (A6-16), (A6-20)
,
From equations (A6-1S),
or (A6-21)
Note that equation (A6-21) is identical to equation (A6-2). The aIgUJDeD.t following equation (A6-2) holds here as well, i.e., the transaction response time will be TI(l-L). From equations (A6-15), (A6-20), (A6-8). and (A6-13). otberrelationsbips include (A6-22)
L'
Q' = t,/T= l-L'
L
Q =RTd= l-L'
L't.
L'2
W' =--s.=_ T l-L'
(A6-23) (A6-24) (A6-2S)
Also, the probability that a specific process will be found in the system, £p, is [from equations (A6-17) and (A6-22)]
L" = The
L=1.
lim l-L'
.. (A6-26)
maximum queue length that can be seen by an amving process occurs when
~
equations (A6-23) and (A6-19), tbequeue1engtb. forL=1 is m-l, as would
be expected.
. . Note that this system of equations for an open system with a finite, homogeneous popula!ion of users is equivalent to the MIMII model for an infinite population, except that the processor load is taken as that load exclusive of the process being consideled. The distribution of queue lengths seen by an amving process is discussed in the next section for the case of multiple servers. These tee:luce to the single server caSe by letting the number of servers, c, be one.
Sec. 0
Multiprocessor
405
A.ULn~OCESSORSY~.
The parameter Qi given by equation (A6-9) represents the total number of transactions in the processor system (the scheduling queue plus the one or more processors) when process Pi arrives. If there is more than one processor, it is the length Wi of the scheduling queue that is important. Given c processors, a process will experience no process dispatch delay if there are less than c processes in the processor system. If there are c or more processes, the newly arrived process will have to wait for a processor. The Dumber of processes already in the processor system exceeding c is called Wi. Wi is the average length of the waiting line for processors seen by process Pi, exclusive of those processes cumntly being serviced. When process Pi arrives at the processor system, it will find n processes already in the system with probability p!..n). Wi is the sum of these probabilities for n>c, weighted by the length of the waiting line (n-c): . _1
Wi =
L
(n-c)p,(n)
(A6-27)
Note that the maximum length of the waiting line seen by process Pi is (m-I)-c. The dispatch time for process Pi is, from equation (A6-13), (A6-28)
Process delay time Tdi is given by equation (A6-6) as tdi + Ti • We proceed by determining the probabilities pen) that there are n processes in the system. From this distribution, we calculate Q and then Q" which leads to the dispatch time tdi.
We first introduce the following notation: y(n) = c:'{xj(l-.xj:}}, j¢k
(A6-29)
This notation is used in the following expressions to imply the sum of the products of all combinations of m diverse itemsxJ taken n at a time, with the remaining Xi items fomled (l-xt>. If the Xi are homogeneous with all Xi=X, then this expression becomes .
as .
y(n) = ( _m;,( ),X'(l-xyn-71 m n.n.
(A6-30)
In the following analysis. we assume that service times are random. For n
processes
L = ~jlj
(A6-3 1)
be the total load on the processors imposed by all processes. Then the average ~ imposed on each processor is lie. For n processes in the system and n
Process Dispatch Time
406
Appendix 6
then
L)A( 1 - cL)C-A' n
p(n)
(A6-32)
where p(n)
= probability that n processes are in the system, and c= A
c! (n-c)!n!
.
(A6-33)
The probability that all c processors will be busy is .
}:p(n) = Pc = (L
'f
C)
A-C
(A6-34)
where
Pc = probability that all processors are busy. There are C': combinations of c processes in the processors when all processors are busy. The probability of any particular set of processes being processed is, on the aver-
age,PdC:. When n~c, an arriving processj will have to enter the processor queue and wait for a time td}. The probability that ~ j is in the system is LpJ, wbere .
LpJ = Rpj + td})
(A6-3S)
Thus, for n~c, the number of processes in the system will include c processes being serviced, n-c processes waiting for service, and m-n processes i
From equation (A6-34), then, p(n) = ~C':{c:r:~{L,,;
(A6-36)
This distribution of queue lengtbs can be avaagecl to determine the average number Q, in the system:
of~,
m
Q = ~np(n) ,,"'1
Noting that . Qi
=Q -
L,n = Q-R;(r.+T;),
equation (A6-28) can be used to solve for the dispatch time for each process. L,n ~s
Sec. 0
Multiprocessor
407
on the probabilities Lpj that each of the other processes is in the system. Likewise, the Lpj Therefore, these equations must be solved iteratively.' However, p(n) can be evaluated specifically for the homogeneous case. We first note that
aepends on Lp;.
m
2: c,:'x"(l-x)m-n = I
(A6-37)
n·O
by exhaustive enumeration as follows. Since a
(I-x)"
,
= po L(a ~k)'k,(-l)k.xk ..
(A6-38)
then C!'x"(I-x)m-n n
-
m-n
, ( )' 2: m. m-n . (-l)kx"+k k=O (m-n)!n! (m-n-k)!k!
m-n
,
-- k=O 2: (m-n-k)!n!k! m. (-l)kx"+k enumerating for a few n, we have: m!
m!
1
m!
1
m!
_~. 1 r+-
m!
1
m!
!
m!
n = 0 m! - (m-l)!x + 2 (m-2)!x2 -
n=l
m! (m-l)!
.,....;.;..~.x-
m!
(m-2)!
1 m!
n=2
6 (m-3)!xl + 2 (m-3)!
~
r-
2 (m-2)x2 - 2 (m-2)!xl + ...
n=3
6 (m-3)!
xl-
All terms but the first cancel, ~ equation (A6-27). Also, m
2: ~l-xyn-n = mx
(A6-39)
11-0
as can be s=l by a similar exercise (simply mul1:iply each row in the above enumeration by n). Retunrlng to the homogeneous solution, equation (A6-36) becomes p(n) = pcC'::~-t: (l-Lpyn-n, ni!!:c (A6-:40) . . where we have used Lp = ~ for the probability that a process will be in the system (as
defined earlier).
Process Dispatch Time
408
Appendix 6
From equations (A6-32), (A6-34), and (A6-37), we see that m
~p(n)
n-O
c-l
c-l
m
= LP(n) + Lp(n) = LP(n) + Pc O ..-c n-O n ..
c (L)n = n-O L~ -C ( 1--L)c-n =1 C as must be the case. Also, from equations (A6-32) and (A6-39), the average number of processors that are busy is -
c-l
c
..-0
n.O
(L)n (
~ np(n) + cpc = ~ n~ -
C
L)c-n
1- -
C
=L
as would be expected. According to equation (A6-34), the sum of the p(n) given by equation (A6-40) over the range c to m should yield Pc. That this is true is demonstrated as follows: m
m
n=c
n-c
LP(n) = PcLc:":~;-C(l-Lpr-n Letting q=n-c, this is rewritten as m
m-c
.. -c
q-O
Lp(n) = PeL cq-CL~ (l-Lpr-c- q From equation (A~37), this ~ m
Lp(n) = Pc The average waiting line length W is m
W=
L (n-c)p(n) m
w-= pcL(n-c)c:":~;-c (l"'"Lpr-n
(A6-41)
SubstitDtiDg q=n-c in a manner similar to the mgumem given above, and using equations (A6-~) and (A6-39), we have
(A6-42) -
~.
The average number of processes in the system is the number of processes-waitiDg for service, W, plus the IlUIDber of processes CUIreDtly being serviced. On the avemge, the processoIS will be busy L of the time, which is the average number of Processes being serviced. Thus [see also equation (4-20)],
Sec. E
An Approximate Solution
409
-
= Q-Lp
Q'
(A6-44)
and its dispatch time td, is, from equation (A6-13), td
=
(f-1)T.
For the homogeneous case, Q = mLp m-I Q' = (m-l)Lp = - Q m Thus,
Q' = m:l[
(~)c(m-C)Lp+L]
(A6-4S)
For the single processor case (c= I),
m-I
Q' = -;;zL[(m-l)Lp + I]
(A6-46)
The probability mat a process will be in the system in the single server homogeneous case is, from eq1l!tion (A6-26),
Um Lp =I-L' Remembering that L'
=
m:
1L, equation (A6-46) then reduces to
L' Q'-l-L' which agrees with equation (A6-23) for the single processor case.
E. AN APPROXlIlATE SOLUTION The exact solutions to the dispatch time for process Pi have been derived by calculating the number of ttaDsaCtions Qi and/or the length of the scheduling queue Wi. exclusive of the load imposed on the processor by the ith process Pi. This gives rise to a simple approximate technique, which is to use the MIMIl or MIMIc queuing ~ as appropriate bnt to
410
Process Dispatch Time
Appendix 6
. base the calculation of processor load (and therefore Q or W length) on aB-processes except that one for which dispatch time is being considered. Different dispatch times will be calculated for different processes. This approximation approaches the simpler technique of calculating a common dispatch time based on total processor load if each process represents only a small portion of the total processor load. This dispatch time approximation bas been shown to be exact for the important case of a single server with random service time and with a finite, homogeneous popula-
tion. 1beanalysis presented above represents an intuitive approach tothegeneral problem of afinite-population open system. Amore rigorous approacb would view this problemin terms of a Markovian system as described by Kobaysbi (see Appendix 10, Bibliography).
APPENDIX 7 Priority Queues
The following aDalysis of priority queues is based on Kleinrock [1S]. chapter 3, and Saaty [24], chapter 11. Let items queued for a server be raDked by priority, with priorities being numbered from'l to p_. Items with bigher priority numbers are chosen for service before items with lower priority numbers. Without regard for priority queuing discipline whatsoever, we observe that an item entering the queue at priority P suffers three types of delay before being chosen for service:
any
1. The item must wait for the item cmrently being serviced, if any. to complete its service. We denote this time as To. 2. The itein must wait for the servicing of all items that are in the queue prior to its anival and that will be served prior to it. There areNip such items from priority i. 3. The item must wait for the servicing of all items that anive subsequent to its mival and that will be served prior to it. There are Mip such items from priority i. Let Ti be the average service time for an item at priority i. The average time that an item at priority P will ave to wait in the queue before being chosen for service, T",. U
Tqp = To +
/Ie.
[I-.
;-=1
i-I
L NipTi + L MipT;
(A7-1)
411
Priority Queues
412
Appendix 7
We are interested in the case in which items withhigber priorities are chosen for service before items with lower priorities and in which items within a priority class are served on a first-come, first-served basis. Then .
= 0, Nip = RiTqi Mip = 0, Mip = RiTqp Nip
i
(A7-2)
i t:= P
(A7-3)
iSp
(A7-4)
i>p
(A7-S)
That is, items already in tbe queue at priorities less than p will not be serviced so long as an item of priority p is in the queue. Items of priority p or greater already in the queue will be serviced before the newly arrived item. If Tqi is the average time an item of priority i remains in tbe queue, and if Ri is the arrival rate of priority i items, then there will be an average of Ri Tqi items of priority i in the queue at any given time. SUnilarly, items that arrive after our itein has entered the queue and that are of the same priority or less will not be serviced until after our item has been serviced. However, dming the time, Tqp , that our item waits in the queue, RiTqp items will arrive for each priority i. Those of higher priority will be serviced first. Using equations A7-2 through A7-S, equation A7-1 can be rewritten as II-.
Tqp = To + .
II-.
L RiTqiTi + i-p+l L RiTqpTi i-p
(A7-6)
Noting that the load imposed by priority i items on the server is Li = RiTi, we have II-.
Tqp = To +
II-.
L LiTqi + i-p+l L LiTqp i-p
(A7-7)
We wish to prove that the solution to this equation is T. 'IP -
To
(1 - L)(l - L,.)
(A7-8)
II-.
L = LRiTi
(A7-9)
i-p
II-.
Lit
= i-p+1 L RiTI .
(A7-10)
We do so by induction. Given equation A7-8 for the queue delay at priority p, let us evaluate the queue delay for the next lower priority, Tqp- 1• From equation A7:7, (A7-11)
Appendix 7 Tbjs._~ -be
Priority Queues
413
rewritten as P-
Tqp-l = To +
L LiTqi + LpP-
+
1
i-p
Tqp- 1
P-
L
L
LiTqp -
i-p+l
P-
LiTqp +
L LiTqp-
1
(A7-12)
i=p+Ji-p
But the first, second, and fourth termS are Tqp. from equation A7-7. Thus, using equations A7-9 and A7-10, (A7-13)
or (A7-13) Using equation A7-8, T. qp-l
Ttl = [1 - (L + Lp-l)](l - L)
(A7-14)
This is what we would expect from equation A7-S. Fmally, in the limiting case for the bighest priority, equation A7-7 yields TqpaIIII\ = To + Lp..xTqpaIIII\
(A7-15)
or . TqpaIIII\
.
~
=
1 _
~
(A7-16)
~
is exactly what equation A7-8 yields. Thus, equation A7-8 is proved for an p.
APPENDIX 10 References and
Bibliography REFERENCES 1. AIIoD., et al. 1985. A measure of 1IaDSaCtiOD processing power, DtzItun4tion 31:112-118. 2. American National StaDdards IDsDtute (ANSI). 1971. ProcelJuresfor the lISe of the C01IIIIIUIlictIlion COIUTOI chllrtu:ters of American 1IIllional stflIUJDrd code for injomtlZlion inlerchtmge in specifi4d dIJtIl COIIIIIIII1Iict links. New Y
.
.
9. Ifammond, J. L. t aDd P. J. P. O'Reilly. 1986. Perfor"tnmta tIIIIIlysis of IDctIl computeT netWOIb. Radiag, Massacbuseas· Addison-Wesley. 10. 1tigbleymaD, W. H. 1982. Survivable systems. COIIIpIIIIf'WOrI Four part series. 14(5):19-22; 14(6):1-12; 14(7):9-18; 14(8):1-10. 11. IntemarioDal BusiDess MacJIiDes (IBM). 1971. Analysis of StR'IIe quadng 1IfDtIII1s in real-time systemS. Doalment DI11Dber F2O-OOO7-1. White PJaiDs, N.Y.: IBM. 12. Intem!n'ional Staudard Organization (ISO). 1982. J~ processing systems '- Open systems iDtercoIIDecdon - Basic Rfereoce model. Geneva, Switzerland: ISO. 13. KeDdaJl,D. G. 1951. Some problems in die 1bemy of queues. JW17IIll of the Ruyal Statistical Society. Series B13, pp. 151-185.. .
414
Appendix 10
References and Bibliography
415
14: ~, L. 1975. Queuing systems. Vol.l, Theory. New York: John Wiley and_Sons. IS. _ _ • 1976. Queuing systems. Vol. 2, Computer applictztions. New Yorlc: John W'Uey and Sons. 16. Lazowska, Edwatd D., et aI. 1984. Qlllmtitotive system petjOT11llUlCe. Englewood Cliffs, N.J.: Prentic:e-Hall. 17. Liebowitz, B. H., and J: H. Carson. 1985. Multiple processor systems/or real-time applications. Englewood Cliffs, N. J.: Prentic:e-Hall. 18. Little, J.D.C. 1961. AproofofthequeuiDg fonnulaL ~W. OperationsResearch 9:383-387. 19. Martin, J. 1967. Design ofreal time systems. Englewood Cliffs, N. J.: Prentice-Hall. 20. ~ 1972. Systems analysis for datil transmission. Englewood Cliffs, N. J.: PreJ1tiCe.:
=
Hall.
21. Meijer, A., andP. Peeters. 1982. Computer network architectures. Rockville, Md.: Computer Science Press. 22. Nyquist, H. 1928. Certilin topics in telegraph transmission theory. AlEE .TrQ1lSQCtions. 47(April). 23. Peck, L. G., and R. N. Hazelwood. 1958. Finite queuing tables. New YOlk: John Wiley &.
Sons. 24. Saaty, T. L. 1961. Elements of queuing theory. New Yorlc: McGraw-Hill. 25. Stallings, W. 1985. DatIl and computer COIII1JUUIicatin. New York: MacmjJlan. .
.
BlBUOGRAPHY
.athematical Foundations ALI.EN, A. O. 1978. Probability, statistics and queuing theory with computer science applicatioDs. Acaiemic Press, New YOlk. FEua, W. 1950. An iDttoduction to probability theory and its applicatioDs. John Wiley, New
.Yodc. . KNvTH, D. E. 1968. The art of computerplOgr8lDlDiDg. Vol. 1: fnndamentaJ aIgoritbms. AddisoJ1Wesley, Readmg, MA. . KNtlTR, D. E. 1969. nie art of computer PJOIIammiDg. Vol. 2: Semi-DU.1De'lical algoritbms. Addison-Wesley, ReadiDg, MA. ," .
KNlmI, D. E. 1973. The art of CCliiIlJA*lr p1Og1'8IDIIIiDg. Vol. 3: SortiDg aod sean:biiIg. AcldisoDWesley, Reading, MA. TRIvEDI, K. S. 1982. PIobability aod statistics with Idability, queuing, and 00Dip4ta scieJice appJicatioDs. PIeD1ice-HaD, EugIewood Cliffs, NJ. .
B~,
B. 1978. Micro-aDa1ysis of c:ompurer system perfOUll8DCC'. Van NostIaDd ReiDhold, New
YOlk. DENNING, P. J., and J. P. BUZEN. 1978. The opaatioDal analysis of queuing netwo.dc models. Computing Surveys 10(3): 225-261.
416
References and Bibliography
Appendix 10
•..•. FElUtARI, D. 1978. Computer systems evaluation performance. Prentice-Hall,-Englewood Cliffs, Nl.
F1t.EmERGER, W. 1972. Statistical computer performance evaluation. Academic Press, New York. GELENBE, E., and I. MlTRANI. 1980. Analysis and synthesis of computer systems. Academic Press, New York. HEll ERMAN, H., and T. F. CoNROY. 1975. Computer system performance. McGraw-HID, New York. KOBAYASHI, H. 1978. Modeling and analysis. Addison-Wesley, Reading, MA. LAVENBEltG, S. S., ed. 1983. Computer performance modeling handbook. Ac:ademic: Press. New York. MAcNAIR, E. A., and C. H. SAUER. 1985. Elements ofpradical performance modeliDg. PrenticeHall, Englewood Cliffs, Nl. SAUER. C. H., and K. M. CHANDY. 1981. Computer systems performance modeling. Prentice-Hall, Englewood Cliffs. NJ. .
Modeling Tools
Proceed.
BEILNER. H., and J. MATER. 1984. COPE: Past, present and futme. oftbe lDtematiooal Confenmc:e on Modelling Techniques and Tools for PeIfotma4ce ADalysis. Paris. BHAltATH-KUMAlt.. K•• andP. KEIMANI. 1984. Pafonnanceevaluation tool (PET): An analysis tool for computer c()T!I!mmication networks. IEEE Joumal on Selected .A!eas in CommUDications SAe2. 1 (Januaty):220-22S. BooYENS. M.• etal. 1984. SNAP: An analytic multiclass queuing network analyzer. PJoceedjngs of tile Intemational CoDfemlce on ModeIHng Techniques and Tools for Performance Analysis, Paris. lNFoBMAnoN RIsEARcH AssocIATES. 1983. PAWSIA User Guide. Austin. TX. MEItLE, D.• D. PoTIER. and M. VERAN. 1978. A tool for com.puta' system performance analysis in Paformance of Computer lDsWlatioDs. FemIri. North HollaDd. Amsterdam. QuAN1lTATIVE SYSTEM PEuoDIANCE, INc. 1982. MAP tefereuce guide. Seattle, WA. QuAN1lTATIVE SYSTEM PEuoItMANc:E. INc. 1982. MAP user guide. Seattle. WA. RmsEI.. M.• aDd C. H. SAUER. 1978. QueuiDg aetwodc models: Methods of solution and tbejr
m:
pp. 115-167 in Cammt treDds in ptOgAnunjng methodology. Vol. Software modeling and i1s impact on performance. K. M. CHANDY. aDd R.. T. Ya (eels.) PIeDtice-Hall. Englewood CUffs, NJ. SAUER. C. H .• M. RmsEI., and E. A. MAcNAIR. 1m. RESQ - A package for solution of geoer;dized queuiDg netwodcs. PJoceedings 1m NatioDal Ccimpu1rz Coaferenc:e. Dallas. TX. pp.m-986. . p10graDl implemeutaDon.
SAUER, C. H•• E. A. MAcNAIR, and 1. F. KtJROSE. 1982. The Iesemch queuing package version 2: Introduction aad examples. IBM Research Report RA-138. Yorlaown Heights, NY. VERAN. M.• and D. Po1'JER. 1984. A pmtable environmeDt queuing systems modelJing. 'Prtx-ecrings of the IntematioDal Coafenmce on ModeDing Tecimiques and Tools for Pedormance ADalysis. Paris. WHITT, W. 1983. The queuing network anaJyze:r. The Bell TecImical Joumal62 (9):2779-2815.
Index
A
Application enviromnent, 259-97 application process peIfonuance, 260 dispatch time, 266 messaging, 267-68 openting system load, 267 overview, 260-65 priority, 266-67 process time, 26S-66 queuing, 268-69 application process structures, 269-81 asyncbmnous 110, 280-81 dyaamic servers, 278-80 . -monoliths, 269-70 multitasking, 274-78
requestor-server, 270-74 ETI benchmark model as in example of
modeling an, 281-96 summary on, 296-97 Asynchronous communic:alion, ISO-51 appropriate strategy for teception .of, 15253 compared to syndJronous, 154
Asynchronous 110, 280-81
B Backup processes, 49-63
Baud definition of, 163-64 Bibliogl3phy,414-15 Bottlenecks, 3, 63-64, 67-68 relationship of queues to, 69-70, 71 C
Cache memories, 15-16, 19 Carson, 22 Case study (see Syntrex Gemini System performance evaluation study) Centtal Limit Theorem, 116-17, 119 Checkpoiming, 55, 59-63, 299,3(17 in example of fault tolerance; 309-11 ConnmmicatiODS, 139-96 bits, byteS, and baud, 163-64 oormmmication channels, 141-49 concen1Iators, 145, 147 dedicatecllines, 141, 142 417
418
.
Index
Communications (Contd.) dWea lines, 141-42 local area networks, 144modems, 147-48 multiplexers, 144-47 propagation delay, 148-49 satellite cbannels, 143-44 virtual circuits, 142-43 data transmission, 149-56 asynchronous communication (see Asyncbronous communication) character codes, 149, 150 error performance, 152-54 error protection, 154-55 full-duplex channels (see Full-c1uplex channels)
balf-c1uplex cbanne1s (see Half-c1uplex channels) jitter, 152-54
synchronous commlJnication (see Syncbronous communication) establisbment/temliDation perf~, 181-89 multipoint poll/select, 185-89 point-to-point contention, 182-85 layeIed protoCOls (see Layered protocols) local area network performance, 189-96 multipoint contention protocol CSMAICD,I89-92 token ring protocol, 1.92-96 message transfer example, 180-81 message transfer performance, 172-81 fu11-dup1ex message traDSfer efficiency, 176-78 balf-dupJex message ttansfer efficiency, 172-76 message transit time, 178-80 pe1formance impact of, 140-41 protocols (see Protocols) Computer
defiDition of, 15,23 Concentrators, 145, 147
.
.
D Data-base enviromnent, 226-58 data-base managen, 226-TI disk caching, 245-49
.~~le of file maDag~uertonnance, 255-58 . file organization, 234-4S hasbed files, 245 indexed sequential files, 243-44 keyed files, 238-43 random files, 235, 238 sequential files, 235, 237-38 unstructured files, 234-37 file system, 227-34 cache memory, 230-31 disk controller, 228-29 disk device driver, 229-30 disk drives, 228 file manager, 231-32 bierarcby of components, 227 performance of, 232-34 other considerations, 249-55 alternate servicing order, 250-51 data locldng, 251-52 mirrored files, 252-53 multiple file managers, 253-55 overlapped seeks, 250 Dedicated Jines, 141, 142 Dialed lines, 141-42 Disk server, 3, 5 Distributed systems application processes, 24-26 (see also
Transaction processing systems, component subsystems, applications processes)
definition of process within, 24 role of a process widJin, 26 110 processes (devic:e-bandJing processes), 24-26 cIifferea.ces from application processes, 24 intaprocess Communications, 26-28 advantages and disadvaDtages of baniwme path implemeatation. 27 facilities supporting, Z1 type of iote:Iprocess messages, 26 process management (see Process management) process mobility, 29-30 process names, 28 process structure (see Process structmej sUrvivability (see Survivability)
Index
419
Distributed s}'stems (Contti.) systein~ of (see System architectures) transparency of, 22-30 illustration of, 22-23 summary on, 30 (see also Transaction-processing systems) Drop definition of, 141
-.
E En benchmark moc:Ie1 as example in comparing approaches to fault tolerance, 308-11
as example of modeling an application environment,281-96 Exponential distribution, 110-11, 112
.,
.
.
:
.
F
Fault to~ce, 20-21 approaches to (see Survivability, approaches to software redundancy) data base integrity (see Survivability, data base integrity) method used in case study, 323 Finite populations, 128-32 computational considerations for finite populations, 132 . multiple-server queues (MIMJclmlm), 131 sblgle-server queues (MlMllImIm), 130-31 Full-duplex channels. 145 message traDsfer efficiency, 176-78 message b:aIIsit time, 179-81 protocols for, 158-60 bit synchronous, 160-63 H Half-duplex cbannels, ISS-56 message traDsfer efficiency, 172-76 message transit time, 178-79 protocols for, 158, 159 Hanunond, 191, 196 Hashed files, 245 Higbleyman, W. H., 325
I
Indexed sequential files, 235, 237-38 Infinite populations, 113-28 dispersion of response time, 114-19 Central Limit Theorem, 116-17, 119 gamma distribution, 115-16, 119 variance of response times, 117-19 multiple-channe1 server (MIMIc), 126-27 with priorities, 127-28 properties of MlG/l queues, 122-23 properties of MIMII queues, 120-22 single-cbanneJ. server with priorities, 12325 nonpreemptive server, 124-25 preemptive server, 125 some properties of, 114 Interactive definition of, 13 110 processes [see Distributed systems, 110 processes (device-handling processes)]
J Jitter, 152-54 K Kendall's classification scheme, 113, 377 Keyed files, 238-43 Kbintchine, 91, 92 Kbintcbine-Pollaczek equation, 92-94, 38385 . K1eimock, 196
L Laye!ed protocols, 164-72
1S0IOSI,I65-69 SNA, 165,169-70 X.25, 165, 170-72 Lazowsb., 10, 94, 129 Liebowitz, 22 Little,94 Little's Law, 94 Local area networlcs, 144 protocols (see Communications, local area _ network perfonnance) Logical definition of, 24
420
Index
M Martin, James, 10, 12, 94 Message queuing, 55, 56-59, 299, 306 in example of fault tolerance, 308-11 Meijer, 196 MiIrored files, 47, 252-53, 299 Min'ored writes, 389-96 dual disk seek time, 392-96 dual latency times, 389-90 single disk seek time, 391-92 Modems, 147-48
Multicomputer systems advantages and disadvantages of, 19-20 architecture of, 18 definition of, 15 two important characteristics of, 29 Multiplexers, 144-47
Multiplexing definition of, 144 established techIiiques for, 144-47 Multiprocessor systems, 15-16 advantages and disadvantages of. 19-20 architecture of, 19 definition of, 15 transaction protection used by, 50-53
o Object module, 16
On-line definition of, 12 OSI Reference Model, 165-69
p
Performance analysis document (see PerfonDance model product) Performance analyst, 7-8 Performance modeling, 67-86 analysis summary, 85-86 background on, 2-5 basic conceptS of, 87-138 bottlenecks (see Bottlenecks) components of aualysis, 73 ped'ormance model document, 73 performance model program (See Performance model program)
result memoranda, 73 scenario model, 73....74 system description, 73 traffic model, 73, 74-77 definition of, 8 example of, 4 measure of the system capacity, 3 perfonnance measures, 70-73 capacity, 73 maximum response time, 73 mean response time, 73 queues (see Queues) response time and, 3
steps to successful performance analysis, 312
uses of, 5-6 Perfonnance model product, 312-22 programming the performance model. 31722 dictionary, 319-20 help screen, 320 input parameter entry and edit, 317-18 input variable specification, 318-19 modelcalcubdon,320 parameter storage, 319 report specification, 319, 320-21 report organization, 313 conclusions and recommendations, 317 executive summary, 313 model computation, 316-17 model summary, 315-16 performance model, 315 results, 317 . scenario, 316 system description, 314 table of contents, 313 traffic model, 14-15 transaction model, 314 tuning, 321 factors causing differences between model results and actual . measurements, 321 Performance model program, 73, 77-85 components, 77-78 coDUDmrlcations,77,78 data-base manager, 79, 80 dispatch time, 80-81 reply handler, 78, 80
Index
. Performance model program (Contll.) request handler, 77, 78-79 server, 77, 79-80 nlOdel"parameterS,81-82 Dlodel~u,83-85 model~,82,83'
Performance problems sources of, 7
Physical definition of, 24 Poisson distribution, 106-9, 111; 386-88 definition of, 108 memoryless feature of, 109 Pollaczek, 91, 92 Preemptive scbedn1ing, 40 Priority queues, 4ll-13 Probability theory concepts
permutations and combiDations, 104-5 randon variables (see Random variables) Procedure
definition of, 33 . similarity to process, 34 types of data accessible to, 34 Process definition of, 16, 24, 30 Process dispatch time, 397-410 an approximate solution, 409-10 dispatching model, 399-402
iDfiDite population approximation en'Ol', 397-99 multiprocessor system, 405-9 single processor system, 402-4 Processing eDvUoDment, 197-225 opemtiDg system, 197-98,213-25 iDtapIocess messaging, 218-20 110 traDsfeIs, 220-21 015 injriared adions, 221-22 task c:tispatcIring, 214-18 . . _ .tbIasbing,222-24 pbysicalleSOUlteS, 198,-213
bus,201-2 cache meDlOl')', 199-200
110 system, 200-201 main meDlOl')', 202-3 performance model presenwion and evaluation, 206-13 performance tools, 205-6 proCessor performance factor, 203-4
421
processors, 199 traffic model, 204-5 ~on,224
Process management, 36-45 managing multiple users, 43-44 additional considerations, 44-45 home terminal conc:ept~ 43-44 mechanisms for scheduling, 38-40 memory management, 37, 40-42 memory mapping, 42 memory page destruction, 41 page faulting, 40-41 virtual memcxy, 40 process scbednUng, 37-38
resources, 360-37 summary ~, 45 Process pair dkectory, 55-56 Process structure, 30-36 addressing range, 33 function of the stack, 34, 35-36 interproceSs messages, 31-32 system calls, 32 types of, 32 parts of, 31 process code area, 33 process data area, 33-34 process fwictions, 30-31 processing cycle, ~ summary on, 36 Processor performance factor, 203-4 Processor queue, 3 shared
Program defiDition of, 16, 23 Protocols, 157-63 cbanDel allocation, 160 definition of, 157 for full..duplex channels, 158-60 bit-synchronous, 160-63 fundions and parts of, 157 ~~-~,
159, 178, 179-80
for balf-duplex channels. 158, 159 layeJed (see Layered protocols) local area network (see Copmnmic:atioDs, local area network performance) message identification and protection, 1S7~:58
message transfer, 158-60
Index
Q Queues, 67,68-69, 87-88,268-69 characteristics of, 88 .compound queues, 297 constant service times, 91 definition of, 88 discrete service times, 92 due to asynchronous I/O, 280-81 exponential service times, 90-91 geneIal distributions, 91-92 important queuing equatiOilS, 92-94 introduction to, 88-94 . Kbintchine-Pollaczek equation, 92-94, 383-85 message queuing, 55, 56-59, 299, 306 priority queues, 411-13 queue length, 88, 120-22 relationship to bottlenecks, 69-70, 71 tandem queues, 114 . uniform service times, 92 (see also Fmite popuJatioos; lnfinite populations) Queuing models, 136-38, 377-82 for given application process characteristics, 269 general queuing parameters, 375-76 (see also Queues) Queuing systems . comparison of queue types, 132-36 K.endaJl's classification Scheme for, 113, 377 finite populations and (see Fmite populations) infinite populations and (see InfiDite populaDons) summary on, 136-38 Queuing theoIy important cases of random processes in, 106-7, 112
R Random files, 235, 238
lWlclom processes.'
definition of, 111 . exponential distribution and, 110-11, 112 important cases in queuing theoIy of, 106.. 7, 112 .
Poisson distribution and (see Poisson distribution) summary on, 111-12 Random variables, 94 continuous, 100-104 . characteristics and rules for, 100-102 definition of, 100 examples of, 102-4 discrete, 94-100 definition of, 94 examples of, 98-100 propeI1ies·of, 95-98 Ready list definition of, 38 Real time definition of, 12-13 Requestor-server model, 64, 65 . . as application process structure, 269, 27074 . file managers, 274 requestors, 271-72 servers, 272-74 as ·m benchmark JDO
s . Saaty, Thomas, 10 SareJ1ite. cbaDnels, 143-44 . Selective Ietransmission, 159-60, 177, 179 Sequential files, 235, 237-38 Series, 105-6 Service time definition of, 3 Software arc:bitecture, 63-66 bottlenecks (see Bottlenecks) dyumic savers, 64-66 RqUeStOr-server model (see R.equestor-server model) Software redundancy (see Survivability, approaches to software redundancy) . SouIce code definition of, 16 Stallings, 173, 196 S~ty,20,29,30,32,45-63,298-
300
Index
Survivability (ConttI.J' approaches "to "Software reduDdancy, 49-<>3 , backup processes, 55-56 ~UWng,55,59-63,299,307
example using BTl benchmark to .ooDapare various approaches, 30811 message queuing (see Message queuing) ~~on,53-54, 299,303-6 transaction protection, 50-53, 298, 299, 300-303,308 data-base integrity, 47-49,308 CODigurmon of a mirrored disk pair, 48 illusttation of logical dual porting, 49 levels of, 308 levels of mirroring, 47 baId.wue duality. 46-47 Syncbrouization, 53-54, 299, 303-6 Synchronous communication, 151, 152 bit-syndlroDous protocols, 160-63 compae:d to asyncbronous, 154 tolerance of jitter, 153-54 Syntrex 0emiDi System perfcmnance evaluation study, 325-74 applicable documents, 329 background on, 323-24 executive summary, 326 in1roduction, 328-29 model summuy, 360-65 perfOJmallCe model, 342-57 . Aquarius mtaface, 346-50 Aquarius tmDiDal, 344 avenge traDsadion time, 343 butter overSaw, 355-57 comai.uDic:ation line, 344-46 disk management, 353-54 file maDI8«, 350-53 DOtaIioa,342-43 "RCO""'adatioas, 371-73 xefereaces, 373-74 results, 365-71 beuchmadc comparison, 365-67 , ' component aaalysis. 368-70 " sc:enari.o, 357-,:59 "SMDado dme~ 359-60 syS1aD. desdiptiOD, 329-37 AquaritIs oommnniI2DOD lines, 330-31 , AquaritIs iDfaface, 331-34
423
file manager, 335-36 file system,336-37 geoenU, 32~30, 331 shared mem9I)', 334-35
synchronizaiion as fault-tolerance technique, 332-34 table of conteJits, 327 traffic model, 340-42 transaction model, 338-40 Syntrex Incorporated, 323 System architectUres, 17-22 distributed systems (see Distributed syStems) .: expandability of, 18 factors causing applications growth, 17-18 fault tolerance (see Fault tolerance) , hybrid,20
loosely coupled, 18 of multicomputer system, 18 of mu1tiproc:essor system, 19 SlJIJlIDaIY on, 22 tightly coupled, i9 System resoUICeS, 3 T
Timer list definition of, 38 Traffic model, 3, 4 TraDSaCtion-processing systems, 12-66 applicanon enviromnent (see Application
environment) batch component of, 13 characlaistics of, 13 component subsystems, 14-17 application ~, 16 <see also Distributed systems, application processes)
ccmmnmication netwodc, 15 data base, 16-17 memory,15-16 other peripherals, 17 processcxs, 15 da!a-base environment (see Data-base enviroDment) definition of, 2, 12 examples of, 1-2 processing enviromnent (see ~sing envi1'omnent)
Index
424
Transaction processing systems (Contd.) softWU.e architecture (see. Software architecture)
architectures, 17-22 (see ~o Distributed systems)
system
Transaction protection, Sc}-S3J98, 299, 300-303, 308
u-v Unstructured files, 234-37 Virtual circuits, 142-43