MODELING UNCERTAINTY An Examination of Stochastic Theory, Methods, and Applications
INTERNATIONAL SERIES IN OPERATIONS RESEARCH & MANAGEMENT SCIENCE Frederick S. Hillier, Series Editor
Stanford University
Vanderbei, R. / LINEAR PROGRAMMING: Foundations and Extensions Jaiswal, N.K. / MILITARY OPERATIONS RESEARCH: Quantitative Decision Making Gal, T. & Greenberg, H. / ADVANCES IN SENSITIVITY ANALYSIS AND PARAMETRIC PROGRAMMING Prabhu, N.U. / FOUNDATIONS OF QUEUEING THEORY Fang, S.-C., Rajasekera, J.R. & Tsao, H.-S.J. / ENTROPY OPTIMIZATION AND MATHEMATICAL PROGRAMMING Yu, G. / OPERATIONS RESEARCH IN THE AIRLINE INDUSTRY Ho, T.-H. & Tang, C. S. / PRODUCT VARIETY MANAGEMENT El-Taha, M. & Stidham , S. / SAMPLE-PATH ANALYSIS OF QUEUEING SYSTEMS Miettinen, K. M. / NONLINEAR MULTIOBJECTIVE OPTIMIZATION Chao, H. & Huntington, H. G. / DESIGNING COMPETITIVE ELECTRICITY MARKETS Weglarz, J. / PROJECT SCHEDULING: Recent Models, Algorithms & Applications Sahin, I. & Polatoglu, H. / QUALITY, WARRANTY AND PREVENTIVE MAINTENANCE Tavares, L. V. / ADVANCED MODELS FOR PROJECT MANAGEMENT Tayur, S., Ganeshan, R. & Magazine, M. / QUANTITATIVE MODELING FOR SUPPLY CHAIN MANAGEMENT Weyant, J./ ENERGY AND ENVIRONMENTAL POLICY MODELING Shanthikumar, J.G. & Sumita, U./APPLIED PROBABILITY AND STOCHASTIC PROCESSES Liu, B. & Esogbue, A.O. / DECISION CRITERIA AND OPTIMAL INVENTORY PROCESSES Gal, T., Stewart, T.J., Hanne, T./ MULTICRITERIA DECISION MAKING: Advances in MCDM Models, Algorithms, Theory, and Applications Fox, B. L./ STRATEGIES FOR QUASI-MONTE CARLO Hall, R.W. / HANDBOOK OF TRANSPORTATION SCIENCE Grassman, W.K./ COMPUTATIONAL PROBABILITY Pomerol, J-C. & Barba-Romero, S./MULTICRITERION DECISION IN MANAGEMENT Axsäter, S./ INVENTORY CONTROL Wolkowicz, H., Saigal, R., Vandenberghe, L./ HANDBOOK OF SEMI-DEFINITE PROGRAMMING: Theory, Algorithms, and Applications Hobbs, B. F. & Meier, P. / ENERGY DECISIONS AND THE ENVIRONMENT: A Guide to the Use of Multicriteria Methods Dar-El, E./ HUMAN LEARNING: From Learning Curves to Learning Organizations Armstrong, J. S./ PRINCIPLES OF FORECASTING: A Handbook for Researchers and Practitioners Balsamo, S., Personé, V., Onvural, R./ ANALYSIS OF QUEUEING NETWORKS WITH BLOCKING Bouyssou, D. et al/ EVALUATION AND DECISION MODELS: A Critical Perspective Hanne, T./ INTELLIGENT STRATEGIES FOR META MULTIPLE CRITERIA DECISION MAKING Saaty, T. & Vargas, L./ MODELS, METHODS, CONCEPTS & APPLICATIONS OF THE ANALYTIC HIERARCHY PROCESS Chatterjee, K. & Samuelson, W./ GAME THEORY AND BUSINESS APPLICATIONS Hobbs, B. et al/ THE NEXT GENERATION OF ELECTRIC POWER UNIT COMMITMENT MODELS Vanderbei, R.J./ LINEAR PROGRAMMING: Foundations and Extensions, 2nd Ed. Kimms, A./ MATHEMATICAL PROGRAMMING AND FINANCIAL OBJECTIVES FOR SCHEDULING PROJECTS Baptiste, P., Le Pape, C. & Nuijten, W./ CONSTRAINT-BASED SCHEDULING Feinberg, E. & Shwartz, A./ HANDBOOK OF MARKOV DECISION PROCESSES: Methods and Applications Ramík, J. & Vlach, M. / GENERALIZED CONCAVITY IN FUZZY OPTIMIZATION AND DECISION ANALYSIS Song, J. & Yao, D. / SUPPLY CHAIN STRUCTURES: Coordination, Information and Optimization Kozan, E. & Ohuchi, A./ OPERATIONS RESEARCH/ MANAGEMENT SCIENCE AT WORK Bouyssou et al/ AIDING DECISIONS WITH MULTIPLE CRITERIA: Essays in Honor of Bernard Roy Cox, Louis Anthony, Jr./ RISK ANALYSIS: Foundations, Models and Methods
MODELING UNCERTAINTY An Examination of Stochastic Theory, Methods, and Applications
Edited by
MOSHE DROR University of Arizona
PIERRE L’ECUYER Université de Montréal
FERENC SZIDAROVSZKY University of Arizona
KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW
eBook ISBN: Print ISBN:
0-306-48102-2 0-7923-7463-0
©2005 Springer Science + Business Media, Inc. Print ©2002 Kluwer Academic Publishers Dordrecht All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America
Visit Springer's eBookstore at: and the Springer Global Website Online at:
http://ebooks.springerlink.com http://www.springeronline.com
Contents
Preface
xvii
Contributing Authors
xxi
1 Professor Sidney J. Yakowitz D. S. Yakowitz
Part I 2 Stability of Single Class Queueing Networks Harold J. Kushner 1 Introduction 2 The Model Stability: Introduction 3 4 Perturbed Liapunov Functions 5 Stability 3 Sequential Optimization Under Uncertainty Tze Leung Lai 1 Introduction 2 Bandit Theory 2.1 Nearly optimal rules based on upper confidence bounds and Gittins indices 2.2 A hypothesis testing approach and block experimentation 2.3 Applications to machine learning, control and scheduling of queues Adaptive Control of Markov Chains 3 3.1 Parametric adaptive control 3.2 Nonparametric adaptive control 4 Stochastic Approximation 4 Exact Asymptotics for Large Deviation Probabilities, with Applications
1
13 13 13 15 22 23 28 35 35 37 37 42 44 44 45 47 49 57
MODELING UNCERTAINTY
vi
Iosif Pinelis Limit Theorems on the last negative sum and applications to non1. parametric bandit theory Condition (4)&(8): exponential and superexponential cases 1.1 Condition (4)&(8): exponential (beyond (14)) and subexpo1.2 nential cases The conditional distribution of the initial segment 1.3 of the sequence of the partial sums given Application to Bandit Allocation Analysis 1.4 1.4.1 Test-times-only based strategy 1.4.2 Multiple bandits and all-decision-times based strategy 2 Large deviations in a space of trajectories Asymptotic equivalence of the tail of the sum of independent random 3 vectors and the tail of their maximum 3.1 Introduction 3.2 Exponential inequalities for probabilities of large deviation of sums of independent Banach space valued r.v.’s 3.3 The case of a fixed number of independent Banach space valued r.v.’s. Application to asymptotics of infinitely divisible probability distributions in Banach spaces 3.4 Tails decreasing no faster than power ones Tails, decreasing faster than any power ones 3.5 3.6 Tails, decreasing no faster than
59 62 63 66 68 68 70 72 77 77 81 83 86 88 89 95
Part II 5 Stochastic Modelling of Early HIV Immune Responses Under Treatment by Protease Inhibitors Wai-Yuan Tan and Zhihua Xiang 1 Introduction 2 A Stochastic Model of Early HIV Pathogenesis Under Treatment by a Protease Inbihitor 2.1 Modeling the Effects of Protease Inhibitors Modeling the Net Flow of HIV From Lymphoid Tissues to 2.2 Plasma 2.3 Derivation of Stochastic Differential Equations for The State Variables Mean Values of 3
100 103
A State Space Model for the Early HIV Pathogenesis Under Treatment by Protease Inhibitors 4.1 Estimation of given
104 106
4
4.2 5 6
Estimation of
Given
An Example Using Real Data Some Monte Carlo Studies
with
95
96 97 98 99
and 107 108 113
Contents 6 The impact of re-using hypodermic needles B. Barnes and J. Gani 1 Introduction 2 Geometric distribution with variable success probability 3 Validity of the distribution 4 Mean and variance of I 5 Intensity of epidemic 6 Reducing infection The spread of the Ebola virus in 1976 7 8 Conclusions
vii 117 117 118 119 120 122 123 124 128
7 Nonparametric Frequency Detection and Optimal Coding in Molecular Biology David S. Stoffer 1 Introduction 2 The Spectral Envelope 3 Sequence Analyses 4 Discussion
129 133 140 152
Part III
155
8 An Efficient Stochastic Approximation Algorithm for Stochastic Saddle Point Problems Arkadi Nemirovski and Reuven Y. Rubinstein 1 Introduction Classical stochastic approximation 1.1 2 Stochastic saddle point problem 2.1 The problem 2.1.1 Stochastic setting 2.1.2 The accuracy measure 2.2 Examples 2.3 The SASP algorithm 2.4 Rate of convergence and optimal setup: off-line choice of the stepsizes Rate of convergence and optimal setup: on-line choice of the 2.5 stepsizes Discussion 3 Comparison with Polyak’s algorithm 3.1 Optimality issues 3.2 4 Numerical Results A Stochastic Minimax Steiner problem 4.1 4.2 A simple queuing model 5 Conclusions Appendix: A: Proof of Theorems 1 and 2
129
155
155 155 157 157 158 159 159 162 163 164 167 167 168 172 172 174 178 179
viii
MODELING UNCERTAINTY Appendix: B: Proof of the Proposition
9 Regression Models for Binary Time Series Benjamin Kedem, Konstantinos Fokianos Introduction 1 2 Partial Likelihood Inference 2.1 Definition of Partial Likelihood 2.2 An Assumption Regarding the Covariates 2.3 Partial Likelihood Estimation 2.4 Prediction Goodness of Fit 3 Logistic Regression 4 4.1 A Demonstration Categorical Data 5
10 Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates Harro Walk 1 Introduction 2 Results Lemmas and Proofs 3
182 185 185 187 187 188 188 190 191 192 194 196 201
201 203 205
11 Strategies for Sequential Prediction of Stationary Time Series László Györfi, Gábor Lugosi 1 Introduction 2 Universal prediction by partitioning estimates 3 Universal prediction by generalized linear estimates 4 Prediction of Gaussian processes
225 228 236 240
Part IV
249
12 The Birth of Limit Cycles in Nonlinear Oligopolies with Continuously Distributed Information Lag Carl Chiarella and Ferenc Szidarovszky 1 Introduction 2 Nonlinear Oligopoly Models 3 The Dynamic Model with Lag Structure 4 Bifurcation Analysis in the General Case 5 The Symmetric Case 6 Special Oligopoly Models 7 Conclusions
225
249
249 251 251 253 259 263 267
Contents ix 13 269 A Differential Game of Debt Contract Valuation A. Haurie and F. Moresino 1 Introduction 269 2 The firm and the debt contract 270 3 273 A stochastic game 4 Equivalent risk neutral valuation 275 4.1 Debt and Equity valuations when bankrupcy is not considered 276 Debt and Equity valuations when liquidation may occur 4.2 278 5 Debt and Equity valuations for Nash equilibrium strategies 280 6 Liquidation at fixed time periods 281 7 Conclusion 282 14 Huge Capacity Planning and Resource Pricing for Pioneering Projects David Porter 1 Introduction 2 The Model Results 3 Cost and Performance Uncertainty 3.1 3.2 Cost Uncertainty and Flexibility Performance Uncertainty and Flexibility 3.3 4 Conclusion 15 Affordable Upgrades of Complex Systems: A Multilevel, PerformanceBased Approach James A. Reneke and Matthew J. Saltzman and Margaret M. Wiecek 1 Introduction 2 Multilevel complex systems 2.1 An illustrative example Computational models for the example 2.2 Multiple criteria decision making 3 Generating candidate methods 3.1 Choosing a preferred selection of upgrades 3.2 Application to the example 3.3 4 Stochastic analysis 4.1 Random systems and risk 4.2 Application to the example Conclusions 5 Appendix: Stochastic linearization 1 Origin of stochastic linearization Stochastic linearization for random surfaces 2 16 On Successive Approximation of Optimal Control of Stochastic Dynamic Systems Fei-Yue Wang, George N. Saridis
285 285 287 291 292 297 298 298 301
301 306 309 312 313 314 315 317 320 321 321 322 327 327 327 333
MODELING UNCERTAINTY
x 1 2 3 4 5
6
Introduction Problem Statement Sub-Optimal Control of Nonlinear Stochastic Dynamic Systems The Infinite-time Stochastic Regulator Problem Procedure for Iterative Design of Sub-optimal Controllers Exact Design Procedure 5.1 Approximate Design Procedures for the Regulator Problem 5.2 Closing Remarks by Fei-Yue Wang
17 Stability of Random Iterative Mappings László Gerencsér 1 Introduction 2 Preliminary results The proof of Theorem 1.1 3 Appendix
Part V
334 335 337 346 349 349 353 356 359 359 364 367 368 373
18 373 ’Unobserved’ Monte Carlo Methods for Adaptive Algorithms Victor Solo 1 373 El Sid 374 2 Introduction On-line Binary Classification 375 3 4 Binary Classification with Noisy Measurements of Classifying Variables376 Offline Binary Classification with Errors in Classifying Variables -Online 378 5 380 6 Conclusions 19 Random Search Under Additive Noise Luc Devroye and Adam Krzyzak 1 Sid’s contributions to noisy optimization 2 Formulation of search problem Random search: a brief overview 3 4 Noisy optimization by random search: a brief survey Optimization and nonparametric estimation 5 6 Noisy optimization: formulation of the problem Pure random search 7 8 Strong convergence and strong stability Mixed random search 9 10 Strategies for general additive noise 11 Universal convergence
383 383 384 385 390 393 394 394 398 399 400 410
Contents 20 Recent Advances in Randomized Quasi-Monte Carlo Methods Pierre L’Ecuyer and Christiane Lemieux 1 Introduction 2 A Closer Look at Low-Dimensional Projections 3 Main Constructions 3.1 Lattice Rules 3.2 Digital Nets 3.2.1 Sobol’ Sequences 3.2.2 Generalized Faure Sequences 3.2.3 Niederreiter Sequences 3.2.4 Polynomial Lattice Rules Constructions Based on Small PRNGs 3.3 3.4 Halton sequence Sequences of Korobov rules 3.5 3.6 Implementations 4 Measures of Quality 4.1 Criteria for standard lattice rules 4.2 Criteria for digital nets Randomizations 5 Random shift modulo 1 5.1 5.2 Digital shift Scrambling 5.3 Random Linear Scrambling 5.4 5.5 Others 6 Error and Variance Analysis 6.1 Standard Lattices and Fourier Expansion 6.2 Digital Nets and Haar or Walsh Expansions 6.2.1 Scrambled-type estimators shifted estimators 6.2.2 Digitally 7 Transformations of the Integrand 8 Related Methods 9 Conclusions and Discussion Appendix: Proofs
420 423 425 426 428 431 431 432 433 435 438 439 439 440 441 444 448 449 449 450 451 452 452 453 455 455 457 461 462 464 464
Part VI
475
21 Singularly Perturbed Markov Chains and Applications to Large-Scale Systems under Uncertainty G. Yin, Q. Zhang, K. Yin and H. Yang 1 Introduction 2 Singularly Perturbed Markov Chains 2.1 Continuous-time Case 2.2 Time-scale Separation Properties of the Singularly Perturbed Systems 3 Asymptotic Expansion 3.1 3.2 Occupation Measures 3.3 Large Deviations and Exponential Bounds
xi 419
475
476 480 481 483 485 485 487 492
MODELING UNCERTAINTY
xii
4
5 6
3.3.1 Large Deviations 3.3.2 Exponential Bounds Controlled Singularly Perturbed Markovian Systems 4.1 Continuous-time Hybrid LQG Discrete-time LQ 4.2 Further Remarks Appendix: Mathematical Preliminaries 6.1 Stochastic Processes 6.2 Markov chains 6.3 Connections of Singularly Perturbed Models: Continuous Time vs. Discrete Time
22 Risk–Sensitive Optimal Control in Communicating Average Markov Decision Chains Rolando Cavazos–Cadena, Emmanuel Fernández–Gaucherand 1 Introduction 2 The Decision Model 3 Main Results 4 Basic Technical Preliminaries 5 Auxiliary Expected–Total Cost Problems: I 6 Auxiliary Expected–Total Cost Problems: II 7 Proof of Theorem 3.1 8 Conclusions Appendix: A: Proof of Theorem 4.1 Appendix: B: Proof of Theorem 4.2 23 Some Aspects of Statistical Inference in a Markovian and Mixing Framework George G. Roussas 1 Introduction 2 Markovian Dependence 2.1 Parametric Case - The Classical Approach 2.2 Parametric Case - The Local Asymptotic Normality Approach 3
492 493 494 495 498 504 505 505 506 508 515
516 518 521 526 529 538 542 544 547 551 555 556 557 558 561
2.3 The Nonparametric Case 567 Mixing 576 3.1 Introduction and Definitions 576 3.2 Covariance Inequalities 578 3.3 Moment and Exponential Probability Bounds 581 3.4 Some Estimation Problems 582 3.4.1 Estimation of the Distribution Function or Survival Function 582 3.4.2 Estimation of a Probability Density Function and its Derivatives 584 3.4.3 Estimating the Hazard Rate 586 3.4.4 A Smooth Estimate of F and 588 3.4.5 Recursive Estimation 589 3.4.6 Fixed Design Regression 591 3.4.7 Stochastic Design Regression 592
Contents
xiii
Part VII
607
24 Stochastic Ordering of Order Statistics II Philip J. Boland, Taizhong Hu, Moshe Shaked and J. George Shanthikumar 1 Introduction 2 Likelihood Ratio Orders Comparisons 3 Hazard and Reversed Hazard Rate Orders Comparisons 4 Usual Stochastic Order Comparisons 5 Stochastic Comparisons of Spacings Dispersive Ordering of Order Statistics and Spacings 6 A Short Survey on Further Results 7 25 Vehicle Routing with Stochastic Demands: Models & Computational Methods Moshe Dror 1 Introduction 2 An SVRP Example and Simple Heuristic Results 2.1 Chance Constrained Models 3 Modeling SVRP as a stochastic programming with recourse problem The model 3.1 3.2 The branch-and-cut procedure 3.3 Computation of a lower bound on and on Q(x) 4 Multi-stage model for the SVRP 4.1 The multi-stage model Modeling SVRP as a Markov decision process 5 SVRP routes with at most one failure – a more ‘practical’ approach 6 The Dror conjecture 7 Summary 8 26 Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms Paul J. Sanchez, John S. Ramberg and Larry Head 1 Introduction Linear Models 2 2.1 Factorial Analysis 2.1.1 Definitions and Background 2.1.2 The Model 2.1.3 The Coefficient Estimator 2.2 Walsh Analysis 2.2.1 Definitions and Background 2.2.2 The Model 2.2.3 Discrete Walsh Transforms 2.3 Fourier Analysis 2.3.1 Definitions and Background 2.3.2 The Model 3 An Example
607 608 609 611 615 615 618 620 625
625 627 630 631 633 635 636 638 640 641 643 645 646 651 652 653 654 654 656 658 658 658 662 662 663 663 665 666
MODELING UNCERTAINTY
xiv
Fast Algorithms 4.1 Yates’s Fast Factorial Algorithm Fast Walsh Transforms 4.2 Fast Fourier Transforms 4.3 5 Conclusions Appendix: A: Table of Notation
4
27 Uncertainty Bounds in Parameter Estimation with Limited Data James C. Spall 1 Introduction 2 Problem Formulation 3 Three Examples of Appropriate Problem Settings Example 1: Parameter Estimation in Signal-Plus-Noise Model 3.1 with Non-i.i.d. Data 3.2 Example 2: Nonlinear Input-Output (Regression) Model 3.3 Example 3: Estimates of Serial Correlation for Time Series 4 Main Results 4.1 Background and Notation 4.2 Order Result on Small-Sample Probabilities 4.3 The Implied Constant of Bound 5 Application of Theorem for the MLE of Parameters in Signal-PlusNoise Problem 5.1 Background 5.2 Theorem Regularity Conditions and Calculation of Implied Constant 5.3 Numerical Results 6 Summary and Conclusions Appendix: Theorem Regularity Conditions and Proof (Section 4)
670 671 676 679 682 684 685 686 688 689 690 691 692 693 693 695 695 697 697 698 699 702 703
28 A Tutorial on Hierarchical Lossless Data Compression John C. Kieffer 1 Introduction 1.1 Pointer Tree Representations 1.2 Data Flow Graph Representations 1.3 Context-Free Grammar Representations 2 Equivalences Between Structures 2.1 Equivalence of Pointer Trees and Admissible Grammars 2.2 Equivalence of Admissible Grammars and Data Flow Graphs 3 Design of Compact Structures 4 Encoding Methodology 5 Performance Under Uncertainty
711 715 716 718 721 721 723 725 727 729
Part VIII
735
29 Eureka! Bellman’s Principle of Optimality is valid!
735
711
Contents Moshe Sniedovich 1 Introduction 2 Remedies The Big Fix 3 4 The Rest is Mathematics 5 Refinements 6 Non-Markovian Objective functions Discussion 7 30 Reflections on Statistical Methods for Complex Stochastic Systems Marcel F. Neuts 1 The Changed Statistical Scene Measuring Teletraffic Data Streams 2 Monitoring Queueing Behavior 3
Author Index
xv
735 738 739 740 744 746 748 751 751 754 757 761
Preface
This volume titled MODELING UNCERTAINTY: An Examination of Stochastic Theory, Methods, and Applications, has been compiled by the friends and colleagues of Sid Yakowitz in his honor as a token of love, appreciation, and sorrow for his untimely death. The first paper in the book is authored by Sid’s wife – Diana Yakowitz – and in it Diana describes Sid the person, his drive for knowledge and his fascination with mathematics, particularly with respect to uncertainty modelling and applications. This book is a collection of papers with uncertainty as its central theme. Fifty authors from all over the world collectively contributed 30 papers to this volume. Each of these papers was reviewed and in the majority of cases the original submission was revised before being accepted for publication in the book. The papers cover a great variety of topics in probability, statistics, economics, stochastic optimization, control theory, regression analysis, simulation, stochastic programming, Markov decision process, application in the HIV context, and others. Some of the papers have a theoretical emphasis and others focus on applications. A number of papers have the flavor of survey work in a particular area and in a few papers the authors present their personal view of a topic. This book has a considerable number of expository articles which should be accessible to a nonexpert, say a graduate student in mathematics, statistics, engineering, and economics departments, or just anyone with some mathematical background who is interested in a preliminary exposition of a particular topic. A number of papers present the state of the art of a specific area or represent original contributions which advance the present state of knowledge. Thus, the book has something for almost anybody with an interest in stochastic systems. The editors have loosely grouped the chapters into 8 segments, according to some common mathematical thread. Since none of us (the co-editors) is an expert in all the topics covered in this book, it is quite conceivable that the papers could have been grouped differently. Part 1 starts with a paper on stability in queuing networks by H.J. Kushner. Part 1 also includes a queuing related
xviii
MODELING UNCERTAINTY
paper by T.L. Lai, and a paper by I. Pinelis on asymptotics for large deviation probabilities. Part 2 groups together 3 papers related to HIV modelling. The first paper in this group is by W.-Y. Tan and Z. Xiang about modelling early immune responses, followed by a paper of B. Barnes and J. Gani on the impact of re-using hypodermic needs, and closes with a paper by D.S. Stoffer. Part 3 groups together optimization and regression papers. It contains 4 papers starting with a paper by A. Nemirovski and R.Y. Rubinstein about classical stochastic approximation. The next paper is by B. Kedem and K. Fokianos on regression models for binary time series, followed with a paper by H. Walk on properties of Nadarya - Watson regression estimates, and closing with a paper on sequential predictions of stationary time series by L. Györfi and G. Lugosi. Part 4’s 6 papers are in the area of economics analysis starting with a nonlinear oligopolies paper by C. Chiarella and F. Szidarovszky. The paper by A. Haurie and F. Moresino examines a differential game of debt contract valuation. Next comes a paper by D. Porter, followed by a paper about complex systems in relation to affordable upgrades by J.A. Reneke, M.J. Saltzman, and M.M. Wiecek. The 5th paper in this group, by F.-Y. Wang and G.N. Sardis, concerns optimal control in stochastic dynamic systems, and the last paper is by L. Gerencsér is about stability of random iterative mappings. Part 5 loosely groups 3 papers starting with a paper by V. Solo on Monte Carlo methods for adaptive algorithms, followed by a paper on random search with noise by L. Devroye and A. Krzyzak, and closes with a survey paper on randomized quasi-Monte Carlo methods by P. L’Ecuyer and C. Lemieux. Part 6 is a collection of 3 papers sharing a focus on Markov decision analysis. It starts with a paper by G. Yin, Q. Zhang, K. Yin, and H. Yang on singularly perturbed Markov chains. The second paper, on risk sensitivity in average Markov decision chains, is by R. Cavazos–Cadena and E. Fernández–Gaucherand. The 3rd paper, by G.G. Roussas, is on statistical inference in a Markovian framework. Part 7 includes a paper on order statistics by P.J. Boland, T. Hu, M. Shaked, and J.G. Shanthikumar, followed by a survey paper on routing with stochastic demands by M. Dror, a paper on fast Fourier and Walsh transforms by P.J. Sanchez, J.S. Ramberg, and L. Head, a paper by J.C. Spall on parameter estimation with limited data, and a tutorial paper on data compression by J.C. Kieffer. Part 8 contains 2 ‘reflections’ papers. The first paper is by M. Sniedovich – an ex-student of Sid Yakowitz. It reexamines Bellman’s principle of optimality. The last paper in this volume on statistical methods for complex stochastic systems is reserved to M.F. Neuts. The efforts of many workers have gone into this volume, and would not have been possible without the collective work of all the authors and reviewers who read the papers and commented constructively. We would like to take this opportunity to thank the authors and the reviewers for their contributions. This book would have required a more difficult ’endgame’ without Ray Brice’s ded-
xix
PREFACE
ication and painstaking attention for production details. We are very grateful for Ray’s help in this project. Paul Jablonka is the artist who contributed the art work for the book’s jacket. He was a good friend to Sid and we appreciate his contribution. We would also like to thank Gary Folven, the editor of Kluwer Academic Publishers, for his initial and never fading support throughout this project. Thank you Gary ! Moshe Dror
Pierre L’Ecuyer
Ferenc Szidarovszky
Contributing Authors
B. Barnes School of Mathematical Sciences Australian National University Canberra, ACT 0200 Australia
Philip J. Boland Department of Statistics University College Dublin Belfield, Dublin 4 Ireland
Rolando Cavazos–Cadena Departamento de Estadística y Cálculo Universidad Auténoma Agraria Antonio Narro Buenavista, Saltillo COAH 25315 MÉXICO
Carl Chiarella School of Finance and Economics University of Technology Sydney P.O. Box 123, Broadway, NSW 2007 Australia
[email protected]
Luc Devroye School of Computer Science McGill University Montreal, Canada H3A 2K6
xxii Moshe Dror Department of Management Information Systems The University of Arizona Tucson, AZ 85721, USA
[email protected]
Emmanuel Fernández–Gaucherand Department of Electrical & Computer Engineering & Computer Science University of Cincinnati Cincinnati, OH 45221-0030 USA
Konstantinos Fokianos Department of Mathematics & Statistics University of Cyprus P.O. Box 20537 Nikosia, 1678, Cyprus
J. Gani School of Mathematical Sciences Australian National University Canberra, ACT 0200 Australia
László Gerencsér Computer and Automation Institute Hungarian Academy of Sciences H-1111, Budapest Kende u 13-17 Hungary
László Györfi Department of Computer Science and Information Theory Technical University of Budapest 1521 Stoczek u. 2, Budapest, Hungary
[email protected]
A. Haurie University of Geneva Geneva Switzerland
MODELING UNCERTAINTY
Contributing Authors
Larry Head Siemens Energy & Automation, Inc. Tucson, AZ 85715
Taizhong Hu Department of Statistics and Finance University of Science and Technology Hefei, Anhui 230026 People’s Republic of China
Benjamin Kedem Department of Mathematics University of Maryland College Park, Maryland 20742, USA
John C. Kieffer ECE Department University of Minnesota Minneapolis, MN 55455
Adam Krzyzak Department of Computer Science Concordia University Montreal, Canada H3G 1M8
Harold J. Kushner Applied Mathematics Dept. Lefschetz Center for Dynamical Systems Brown University Providence RI 02912
Tze Leung Lai Stanford University Stanford, California
xxiii
xxiv Pierre L’Ecuyer Département d’Informatique et de Recherche Opérationnelle Université de Montréal, C.P. 6128, Succ. Centre-Ville Montréal, H3C 3J7, Canada
[email protected]
Christiane Lemieux Department of Mathematics and Statistics University of Calgary, 2500 University Drive N.W. Calgary, T2N 1N4, Canada
[email protected]
Gábor Lugosi Department of Economics, Pompeu Fabra University Ramon Trias Fargas 25-27, 08005 Barcelona, Spain
[email protected]
F. Moresino Cambridge University United Kingdom
Arkadi Nemirovski Faculty of Industrial Engineering and Management Technion—Israel Institute of Technology Haifa 32000, Israel
Marcel F. Neuts Department of Systems and Industrial Engineering The University of Arizona Tucson, AZ 85721, U.S.A.
[email protected]
Iosif Pinelis Department of Mathematical Sciences Michigan Technological University Houghton, Michigan 49931
[email protected]
MODELING UNCERTAINTY
Contributing Authors
David Porter Collage of Arts and Sciences George Mason University
John S. Ramberg Systems and Industrial Engineering University of Arizona Tucson, AZ 85721
James A. Reneke Dept. of Mathematical Sciences Clemson University Clemson SC 29634-0975
George G. Roussas University of California, Davis
Reuven Y. Rubinstein Faculty of Industrial Engineering and Management Technion—Israel Institute of Technology Haifa 32000, Israel
Matthew J. Saltzman Dept. of Mathematical Sciences Clemson University Clemson SC 29634-0975
Paul J. Sanchez Operations Research Department Naval Postgraduate School Monterey, CA 93943
George N. Saridis Department of Electrical, Computer and Systems Engineering Rensselaer Polytechnic Institute Troy, New York 12180
xxv
xxvi Moshe Shaked Department of Mathematics University of Arizona Tucson, Arizona 85721 USA
J. George Shanthikumar Industrial Engineering & Operations Research University of California Berkeley, California 94720 USA
Moshe Sniedovich Department of Mathematics and Statistics The University of Melbourne Parkville VIC 3052, Australia
[email protected]
Victor Solo School of Electrical Engineering and Telecommunications University of New South Wales Sydney NSW 2052, Australia
[email protected]
James C. Spall The Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723-6099
[email protected]
David S. Stoffer Department of Statistics University of Pittsburgh Pittsburgh, PA 15260
Ferenc Szidarovszky Department of Systems and Industrial Engineering University of Arizona Tucson, Arizona, 85721-0020, USA
[email protected]
MODELING UNCERTAINTY
Contributing Authors
Wai-Yuan Tan Department of Mathematical Sciences The University of Memphis Memphis, TN 38152-6429
[email protected]
Harro Walk Mathematisches Institut A Universität Stuttgart Pfaffenwaldring 57, D-70569 Stuttgart, Germany
Fei-Yue Wang Department of Systems and Industrial Engineering University of Arizona Tucson, Arizona 87521
Margaret M. Wiecek Dept. of Mathematical Sciences Clemson University Clemson SC 29634-0975
Zhihua Xiang Organon Inc. 375 Mt. Pleasant Avenue West Orange, NJ 07052
[email protected]
D. S. Yakowitz Tucson, Arizona
H.Yang Department of Wood and Paper Science University of Minnesota St. Paul, MN 55108
[email protected]
xxvii
xxviii G. Yin Department of Mathematics Wayne State University Detroit, MI 48202
[email protected]
K. Yin Department of Wood and Paper Science University of Minnesota St. Paul, MN 55108
[email protected],
[email protected]
Q. Zhang Department of Mathematics University of Georgia Athens, GA 30602
[email protected]
MODELING UNCERTAINTY
This book is dedicated to the memory of Sid Yakowitz.
Chapter 1 PROFESSOR SIDNEY J. YAKOWITZ D. S. Yakowitz Tucson, Arizona
Sidney Jesse Yakowitz was born in San Francisco, California on March 8, 1937 and died in Eugene, Oregon on September 1, 1999. Sid’s parents, Morris and MaryVee, were chemists with the Food and Drug Administration and encouraged Sid to be a life-long learner. He attended Stanford University and after briefly toying with the idea of medicine, settled into engineering (“I saved hundreds of lives with that decision!”). Sid graduated from Stanford with a B.S in Electrical Engineering in 1960. His first job out of Stanford was as a design engineer with the University of California’s Lawrence Radiation Laboratory (LRL) at Berkeley. Sid was unhappy after college but claimed that he learned the secret to happiness from his office mate at LRL, Jim Sherwood, who told him he was being paid to be creative. Sid decided that “Good engineering design is a synonym for ‘inventing’.” For graduate school, Sid chose Arizona State University. By this time, his battle since childhood with acute asthma made a dry desert climate a mandatory consideration. In graduate school he flourished. He received his M.S. in Electrical Engineering in 1965, an M.A. in Mathematics in 1966, and Ph.D. in Electrical Engineering in 1967. His new formula for happiness in his work led him to consider each topic or problem that he approached as an opportunity to “invent”. In 1966 Sid was hired as an Assistant Professor in the newly founded Department of Systems and Industrial Engineering at the University of Arizona in Tucson. This department remained his “home” for 33 years with the exception of brief sabbaticals and leaves such as a National Academy of Science Postdoctoral Fellowship at the Naval Postgraduate School in Monterey, California in 1970-1971. In 1969 Sid’s book Mathematics of Adaptive Control Processes (Yakowitz, 1969) was published as a part of Richard Bellman’s Elsevier book series. This book was essentially his Ph.D. dissertation and was the first of four published
2
MODELING UNCERTAINTY
books. Latter Sid was instrumental in the popularization of differential dynamic programming (DDP). Overcoming the “curse of dimensionality” made possible the solution of problems that could at that time only be solved approximately, for example, high dimensional multireservoir control problems. His paper with then Ph.D. student Dan Murray in Water Resources Research (Murray and Yakowitz, 1979) demonstrated quite dramatically what could be done with DDP. In addition to his own prolific accomplishments Sid had another important talent - the ability to recognize talent in others. He enthusiastically collaborated with colleagues on numerous subjects including hydrology, economics, information-theory, statistics, numerical methods, and machine learning. Sid’s international work started in 1973 with his participation in a joint NSF sponsored US-Hungarian research project. According to Ferenc Szidarovszky (Szidar), also involved in the project, his extraordinary talents in combining probabilistic and statistical ideas with numerical computations made him one of the most important contributors. Several papers on uncertainty in water resources management, conference presentations, and invited lectures were the result of this grant that was renewed until 1981. This was the period that he had the most intensive collaboration with his many Hungarian colleagues. This cooperation also resulted in the two textbooks on numerical analysis with Szidar, Principles and Procedures of Numerical Analysis (Szidarovszky and Yakowitz, 1978) and An Introduction to Numerical Computations (Yakowitz and Szidarovszky, 1986). Long after the project terminated, Sid continued to collaborate with Hungarian scientists who often visited him in Tucson enjoying his hospitality. Sid’s ability in combining probabilistic ideas and numerical computations made him an expert in simulation. His book Computational Probability and Simulation (Yakowitz, 1977) is considered one of the best of its kind. His paper on weighted Monte-Carlo simulation (Yakowitz et al., 1978) offered a new integration method that was much faster than the known classical procedures. Sid had a very successful six year cooperation with Professors Columban Hutter, of ETH Zurich, and Szidar working on an NSF sponsored project on the mathematical modeling and computer solutions of ice-sheets, glaciers and avalanches. In this work, Sid’s expertise on numerical analysis was the essential factor in solving large-scale differential equations with unusual boundary and normalizing conditions (Yakowitz et al., 1985; Hutter et al., 1986a; Yakowitz et al., 1986; Hutter et al., 1986b; Hutter et al., 1987; Szidarovszky et al., 1987; Hutter et al., 1987; Szidarovszky et al., 1989). Sid’s algorithmic way of thinking resulted in two major contributions to game theory. With Szidar he developed a new proof for the existence of a unique equilibrium of Cournot oligopolies, which is constructive, offering an algorithm to find the equilibrium. This paper, (Szidarovszky and Yakowitz, 1977) is one of
Professor Sidney J. Yakowitz
3
the most cited papers in this field and has been republished in the Handbook of Mathematical Economics. They could also extend the constrictive proof for the case when the price and cost functions are not differentiable. They proved that even in the case of multiple equilibria, the total output of the industry is unique and the set of all equilibria is a simplex. They also considered the effect of coalition formation on the profit functions (Szidarovszky and Yakowitz, 1982). Sid was an expert in time series, both parametric and nonparametric. On the nonparametric side he made contributions regarding nearest neighbor methods applied to time series prediction, density and transition function estimation for Markov sequences, and pattern recognition (Denny and Yakowitz, 1978; Schuster and Yakowitz, 1979; Yakowitz, 1979; Szilagyi et al., 1984; Yakowitz, 1987; Yakowitz, 1988; Yakowitz, 1989d; Rutherford and Yakowitz, 1991; Yakowitz and Lowe, 1991; Yakowitz and Tran, 1993; Yakowitz, 1993a; Morvai et al., 1998; Yakowitz et al., 1999). In particular Sid worked in the area of stochastic hydrology over many years including analyzing hydrologic time series such as flood and rainfall data to investigate their major statistical properties and use them for forecasting (Yakowitz, 1973; Denny et al., 1974; Yakowitz, 1976; Yakowitz, 1976; Szidarovszky and Yakowitz, 1976; Yakowitz, 1979; Yakowitz and Szidarovszky, 1985; Karlsson and Yakowitz, 1987a; Karlsson and Yakowitz, 1987b; Naokes et al., 1988; Yakowitz and Lowe, 1991). On the parametric side, Sid applied his deep understanding of linear filtering of stationary time series in the problem of frequency estimation in the presence of noise. Here he authored several papers on frequency estimation using contraction mappings, constructed from the first order auto-correlation, that involved sophisticated sequences of linear filters with a shrinking bandwidth. In particular, he showed that the contraction mapping of He and Kedem, which requires a certain filtering property, can be extended quite broadly. This and the shrinking bandwidth were very insightful (Yakowitz, 1991; Yakowitz, 1993c; Li et al., 1994; Kedem and Yakowitz, 1994; Yakowitz, 1994a). He found numerous applications of nonparametric statistical methods in machine learning (Yakowitz, 1989c; Yakowitz and Lugosi, 1990; Yakowitz et al., 1992a; Yakowitz and Kollier, 1992; Yakowitz and Mai, 1995; Lai and Yakowitz, 1995). As a counterpart to his earlier work on numerical computation, Sid introduced a course at the University of Arizona on Non-numerical Computation. This course, which resulted in an unpublished textbook on the topic, developed methods applicable to machine learning, games and epidemics. Sid loved this topic dearly and enjoyed teaching it. He continued to explore this area up to the time of his death. In 1986 Sid met Joe Gani, and they worked together intermittently from that time until his death. Over a period of 13 years, Sid and Joe (together with students and colleagues) wrote 10 joint papers. Their earliest interest was in the silting of dams, which they studied (with Peter Todorovic of UCSB) (Gani et al.,
4
MODELING UNCERTAINTY
1987), followed by a paper on the prediction of reservoir lifetimes under silting (Gani and Yakowitz, 1989). In both, they made use of the Moran Markovian reservoir model. By this time, Sid and Joe had been awarded an NIH grant, which lasted until Joe’s retirement from the University of California, Santa Barbara in 1994. They used the grant to analyze various epidemic models, starting with cellular automaton models (with Sid’s student R.Hayes) in 1990 (Yakowitz et al., 1990), and automatic learning for dynamic Markov fields in epidemiology in 1992 (Yakowitz et al., 1992b). Sid soloed on a decision model paper for the AIDS epidemic that year (Yakowitz, 1992) as well as collaborating with Joe on a basic paper on the spread of HIV among intravenous drug users (Gani and Yakowitz, 1993). Two further papers followed in 1995, one on interacting groups in AIDS epidemics (Gani and Yakowitz, 1995), and another on deterministic approximations to epidemic Markov processes (Gani and Yakowitz, 1995). More work followed on expectations for large compartmentalized models for the spread of AIDS in prisons (with Sid’s student M. Blount) (Yakowitz et al., 1996; Gani et al., 1997). In these problems, Sid brought to bear his vast knowledge of Markov processes and, as Joe describes, his formidable computational skills. Sid spent part of his last sabbatical in Australia working with Joe Gani. This work culminated with a paper, published posthumously in 2000, on individual susceptibilities and infectivities (with D.J. Daley and J. Gani) (Daley et al., 2000). Sid was extremely proud of the work on this topic as well as the paper on the spread of AIDS in prisons (Yakowitz et al., 1996). Both of these papers required delicate analysis, and careful computation, at both of which he excelled. The above is surely a truncated and inadequate representation of the impact that Sid has had on the many topics that he was interested and the people that he worked with over the years. Sid authored over 100 journal papers and books and probably an equal number of papers in conference proceedings. The single common thread that is woven into all of his research is uncertainty. All of his publications have a probabilistic component and this book is perhaps the most appropriate tribute that could be given him. “Write it down!” is an admonishment that he made on many occasions. His own writing was greatly influenced by his love of literature and he felt quite strongly that technical writers could learn much from the art of creative writing. Perhaps Sid said it best: ...readers come to us to learn about certain specific facts. But they will value us more if we give them more insight and entertainment than they bargained for. from a letter to Gene Koppel on the value of creative writing courses
He hoped not only to pass on information but to inspire further investigation into the topics. This desire to inspire often led him to experiment in his teaching. He wanted to pass on more than just the information in the textbook and he was hard on
Publications of Sid Yakowitz
5
himself when he did not succeed. Sid was not known positively for his teaching style. He was stressed and anxious over lectures and presentations. When I have to go somewhere to give an invited lecture, I always take crayon drawings that [my children] have made for me. I put them on the podium and they give me strength. from a letter to his friend Steve Berry
This lifelong battle with “stage fright” often led to comic episodes of forgetfulness, fumbling, and embarrassment. He had little patience with laziness or disinterest from his students. Those who took interest in the subject, and the time to get to know Sid, grew to love him for his honesty, insight, enthusiasm, humor and cheerfulness. Sid was never too busy to answer a question. His work ethic defined, for me and many other graduate students, a standard which we measure ourselves against. Manbir Sodhi I am privileged to have been his student. I will never forget his sense of humor and the fact that he didn’t take anything too seriously, including himself. T. Jayawardena He distinguished between principle and process. He warned... against becoming lost in process, and forgetting the initial principles with which they were to concern themselves. ....He inveighed passionately against engineers simply putting their heads down, and becoming lost in the mechanistic ritual of their jobs. John Stevens Berry, Stanford roommate and lifelong friend.
I leave you with an image that most will find familiar, and many endearing. The Yakowitz grin. Not a relentlessly upbeat smiley-face grin, also not a sneer. A smile that asked for very little as a reason for its appearance: anything at all in any way amusing and that he could share with others. Robert Hymer, Stanford roommate and lifelong friend
I wish to thank Joe Gani, Ben Kedem, Dan Murray and Szidar for their assistance in preparing this introduction. I and all of Sid’s family members are grateful to Moshe Dror, Szidar, and Pierre L’Ecuyer for proposing and working so hard as editors of this tribute to my husband and mentor.
PUBLICATIONS OF SID YAKOWITZ Books Yakowitz, S. (1969). Mathematics of Adaptive Control Processes. Elsevier, New York. Yakowitz, S. (1977). Computational Probability and Simulation. Addison-Wesley, Reading, MA. Szidarovszky, F. and S. Yakowitz. (1978). Principles and Procedures of Numerical Analysis. Plenum Press, New York.
6
MODELING UNCERTAINTY
Yakowitz, S. and F. Szidarovszky. (1986). An Introduction to Numerical Computations, edn. Macmillan, New York [ edn 1989]. Papers Yakowitz, S. and J. Spragins. (1968). On the identifiability of finite mixtures. Ann. Math. Statist. 39, 209-214. Yakowitz, S. (1969). A consistent estimator for the identification of finite mixtures. Ann. Math. Statist. 40, 1728-1735. Yakowitz, S. (1970). Unsupervised learning and the identification of finite mixtures. IEEE Trans. Inform. Theory 16, 330-338. Fisher, L. and S. Yakowitz. (1970). Estimating mixing contributions in metric spaces. Sankhya A 32, 411-418. Yakowitz, S. and L. Fisher. (1973). On sequential search for the maximum of an unknown function. J. Math. Anal. Appl. 41, 234-359. Yakowitz, S. and S. Parker. (1973). Computation of bounds for digital filter quantization errors. IEEE Trans. Circuit Theory 20, 391-396. Yakowitz, S. (1973). A stochastic model for daily river flows in an arid region. Water Resources Research 9, 1271-1285. Yakowitz, S. (1974). Multiple hypothesis testing by finite-memory algorithms. Ann. Statist. 2, 323-336. Yakowitz, S., L. Duckstein, C. and Kisiel. (1974). Decision analysis of a gamma hydrologic variate. Water Resources Research 10, 695-704. Denny, J., C. Kisiel, and S. Yakowitz. (1974). Procedures for determining the order of Markov dependence in streamflow records. Water Resources Research 10, 947-954. Parker, S. and S. Yakowitz. (1975). A general method for calculating quantization error bounds due to round off in multivariate digital filters. IEEE Trans. Circuits Systems 22, 570-572. Sagar, B., S. Yakowitz, and L. Duckstein. (1975) A direct method for the identification of the parameters of dynamic nonhomogeneous aquifers. Water Resources Research 11, 563-570. Szidarovszky, F, S. Yakowitz, and R. Krzysztofowicz. (1975). A Bayes approach for simulating sediment yield. J. Hydrol. Sci. 3, 33-45. Fisher, L. and S. Yakowitz. (1976). Uniform convergence of the potential function algorithm. SIAM J. Control Optim. 14, 95-103. Yakowitz, S. (1976). Small sample hypothesis tests of Markov order with application to simulated and hydrologic chains. J. Amer. Statist. Assoc. 71, 132-136. Yakowitz, S. and P. Noren. (1976) On the identification of inhomogeneous parameters in dynamic linear partial differential equations. J. Math. Anal. Appl. 53, 521-538.
Publications of Sid Yakowitz
7
Yakowitz, S. (1976). Model-free statistical methods for water table prediction. Water Resources Research 12, 836-844. Yakowitz, S., T.L. William, and G.D. Williams. (1976). Surveillance of several Markov targets. IEEE Trans. Inform. Theory 22, 716-724. Szidarovszky, F. and S. Yakowitz. (1976). Analysis of flooding for an open channel subject to random inflow and blockage. J. Hydrol. Sci. 3, 93-103. Duckstein, L., F. Szidarovszky, and S. Yakowitz. (1977). Bayes design of a reservoir under random sediment yield. Water Resources Research 13, 713719. Szidarovszky, F. and S. Yakowitz. (1977). A new proof of the existence and uniqueness of the Cournot equilibrium. Int. Econom. Rev. 18, 181-183. Denny, J. and S. Yakowitz. (1978). Admissible run-contingency type tests for independence and Markov dependence. J. Amer. Statist. Assoc. 73, 117-181. Yakowitz, S., J. Krimmel, and F. Szidarovszky. (1978). Weighted Monte Carlo integration. SIAM J. Numer. Anal. 15, 1289-1300. Schuster, R. and S. Yakowitz. (1979). Contributions to the theory of nonparametric regression with application to system identification. Ann. Statist. 7, 139-149. Yakowitz, S. (1979). Nonparametric estimation of Markov transition functions. Ann. Statist. 7, 671-679. Neuman, S. and S. Yakowitz. (1979). A statistical approach to the inverse problem of aquifer hydrology: Part 1. Theory. Water Resources Research 15, 845-860. Murray, D. and S. Yakowitz. (1979). Constrained differential dynamic programming and its application to multireservoir control. Water Resources Research 15, 1017-1027. Yakowitz, S. (1979). A nonparametric Markov model for daily river flow. Water Resources Research 15, 1035-1043. Krzysztofowicz, R. and S. Yakowitz. (1980). Large-sample methods analysis of gamma variates. Water Resources Research 16, 491-500. Yakowitz, S. and L. Duckstein. (1980). Instability in aquifer identification theory and case studies. Water Resources Research 16, 1045-1064. Pebbles, R, R. Smith, and S. Yakowitz. (1981). A leaky reservoir model for ephemeral flow recession. Water Resources Research 17, 628-636. Murray, D. and S. Yakowitz. (1981). The application of optimal control methodology to non-linear programming problems. Math. Programming 21, 331347. Szidarovszky, F. and S. Yakowitz. (1982). Contributions to Cournot oligopoly theory. J. Econom. Theory 28, 51-70. Yakowitz, S. (1982). Dynamic programming applications in water resources. Water Resources Research 18, 673-696.
8
MODELING UNCERTAINTY
Yakowitz, S. (1983). Convergence rate of the state increment dynamic programming method. Automatica 19, 53-60. Yakowitz, S. and B. Rutherford. (1984). Computational aspects of discrete-time optimal-control. Appl. Math. Comput. 15, 29-45. Szilagyi, M., S. Yakowitz, and M. Duff. (1984). A procedure for electron and ion lens optimization. Appl. Phys. Lett. 44, 7-9. Murray, D. and S. Yakowitz. (1984). Differential dynamic programming and Newton’s method for discrete optimal control problems. J. Optim. Theory Appl. 42, 395-415. Yakowitz, S. (1985). Nonparametric density estimation, prediction and regression for Markov sequences. J. Amer. Statist. Assoc. 80, 215-221. Yakowitz, S. (1985). Markov flow models and the flood warning problem. Water Resources Research 21, 81-88. Yakowitz, S. and F. Szidarovszky. (1985). A comparison of Kriging with nonparametric regression methods. J. Multivariable Anal. 6, 21-53. Yakowitz, S., K. Hutter, and F. Szidarovszky. (1985). Toward computation of steady-state profiles of ice sheets. Z. für Gletcherkund 21, 283-289. Schuster, E. and S. Yakowitz (1985). Parametric nonparametric mixture densityestimation with application to flood frequency analysis. Water Resources Bulletin 21, 797-804. Yakowitz, S. (1986). A stagewise Kuhn-Tucker condition and differential dynamic programming. IEEE Trans. Automat. Control 31, 25-30. Hutter, K., S. Yakowitz, and F. Szidarovszky. (1986a). A numerical study of plane ice sheet flow. J. Glaciology 32, 139-160. Yakowitz, S., K. Hutter, and F. Szidarovszky. (1986). Elements of a computational theory for glaciers. J. Comput. Phys. 66, 132-150. Hutter, K., F. Szidarovszky, and S. Yakowitz. (1986b). Plane steady shear-flow of a cohesionless antigranulocytes material down an inclined plane – a model for flow avalanches: Part I. Theory. Acta Mechanica 63, 87-112. Hutter, K., F. Szidarovszky, and S. Yakowitz. (1987). Plane steady shear-flow of a cohesionless antigranulocytes material down an inclined plane – a model for flow avalanches: Part II. Numerical results. Acta Mechanica 65, 239-261. Yakowitz, S. (1987). Nearest neighbour methods in time-series analysis. J. Time Series Anal. 2, 235-247. Szidarovszky, F., K. Hutter, and S. Yakowitz. (1987). A numerical study of steady plane antigranulocytes chute flows using the Jenkins-Savage model and its extensions. J. Numer. Methods Eng. 24, 1993-2015. Hutter, K., S. Yakowitz, and F. Szidarovszky. (1987). Coupled thermomechanical response of an axisymmetrical cold ice-sheet. Water Resources Research 23, 1327-1339. Sen, S. and S. Yakowitz. (1987). A quasi-Newton differential dynamic programming algorithm for discrete-time optimal control. Automatica 23, 749-752.
Publications of Sid Yakowitz
9
Karlsson, M. and S. Yakowitz. (1987a). Nearest-neighbor methods for nonparametric rainfall-runoff forecasting. Water Resources Research 23, 1300-1308. Karlsson, M. and S. Yakowitz. (1987b). Rainfall-runoff forecasting methods, old and new. Stoch. Hydrol. Hydraul. 1, 303-318. Gani, J., P. Todorovich, and S. Yakowitz. (1987). Silting of dams by sedimentary particles. Math. Scientist 12, 81-90. Naokes, D., K. Hipel, A.I. Mcleod, and S. Yakowitz. (1988). Forecasting annual geophysical time series. Int. J. Forecasting 4, 103-115. Yakowitz, S. (1988). Parametric and nonparametric density-estimation to account for extreme events. Adv. Appl. Prob. 20, 13. Szidarovszky, F., K. Hutter, and S. Yakowitz. (1989). Computational ice-divide analysis of a cold plane ice sheet under steady conditions. Ann. Glaciology 12, 170-178. Yakowitz, S. (1989a). Algorithms and computatitonal techniques in differential dynamic programming. Control Dynamic Systems 31, 75-91. Yakowitz, S. (1989b). Theoretical and computational advances in differential dynamic programming. Control Cybernet. 17, 172-189. Yakowitz, S. (1989c). A statistical foundation for machine learning, with application to Go-Moku. Comput. Math. Appl. 17, 1095-1102. Yakowitz, S. (1989d). Nonparametric density and regression estimation for Markov sequences without mixing assumptions. J. Multivariate Anal. 30, 124-136. Gani, J. and S. Yakowitz. (1989). A probabilistic sedimentation analysis for predicting reservoir lifetime. Water Resources Management 3, 191-203. Yakowitz, S. and E. Lugosi. (1990). Random search in the presence of noise, with application to machine learning. SIAM J. Sci. Statist. Comput. 11, 702712. Yakowitz, S., J. Gani, and R. Hayes. (1990). Cellular automaton modeling of epidemics. Appl. Math. Comput. 40, 41-54. Rutherford, B. and S. Yakowitz. (1991). Error inference for nonparametric regression. Ann. Inst. Statist. Math. 43, 115-129. Yakowitz, S. and W. Lowe. (1991). Nonparametric bandit methods. Ann. Operat. Res. 28, 297-312. Dietrich, R.D. and S. Yakowitz. (1991). A rule based approach to the trim-loss problem. Int. J. Prod. Res. 29, 401-415. Yakowitz, S. (1991). Some contributions to a frequency location problem due to He and Kedem. IEEE Trans. Inform Theory 17, 1177-1182. Yakowitz, S., T. Jayawardena, and S. Li. (1992a). Theory for automatic learning under partially observed Markov-dependent noise. IEEE Trans. Automat. Control 37, 1316-1324. Yakowitz, S., R. Hayes, and J. Gani. (1992b). Automatic learning for dynamic Markov-fields with application to epidemiology. Operat. Res. 40, 867-876.
10
MODELING UNCERTAINTY
Yakowitz, S. and M. Kollier. (1992). Machine learning for optimal blackjack counting strategies. J. Statist. Plann. Inference 33, 295-309. Yakowitz, S. (1992). A decision model and methodology for the AIDS epidemic. Appl. Math. Comput. 52, 149-172. Yakowitz, S. and L.T. Tran. (1993). Nearest Neighbor estimators for random fields. J. Multivariate Anal. 44, 23-46. Yakowitz, S. (1993a). Nearest neighbor regression estimation for null- recurrent Markov time series. Stoch. Proc. Appl. 48, 311-318. Gani, J. and S. Yakowitz. (1993). Modeling the spread of HIV among intravenous drug users. IMA J. Math. Appl. Medicine Biol. 10, 51-65. Yakowitz, S. (1993b). A globally convergent stochastic approximation. SIAM J. Control Optim. 31, 30-40. Yakowitz, S. (1993c). Asymptotic theory for a fast frequency detector. IEEE Trans. Inform. Theory 39, 1031-1036. Li, T.H., B. Kedem, and S. Yakowitz. (1994). Asymptotic normality of sample autocovariances with an application in frequency estimation. Stoch. Proc. Appl. 52, 329-349. Pinelis, I. and S. Yakowitz. (1994). The time until the final zero-crossing of random sums with application to nonparametric bandit theory. Appl. Math. Comput. 63, 235-263. Kedem, B. and S. Yakowitz. (1994). Practical aspects of a fast algorithm for frequency detection. IEEE Trans. Commun. 42, 2760-2767. Yakowitz, S. (1994a). Review of Time series analysis of higher order crossings, by B. Kedem. SIAM Rev. 36, 680-682. Yakowitz, S. (1994b). From a microcosmic IVDU model to a macroscopic HIV epidemic. In Modeling the AIDS Epidemic: Planning, Policy, and Prediction, eds E.H. Kaplan and M.L. Brandeau. Raven Press, New York, pp. 365-383. Yakowitz, S. and J. Mai. (1995). Methods and theory for off-line machine learning. IEEE Trans. Automat. Control 40, 161-165. Gani, J. and S. Yakowitz. (1995). Computational and stochastic methods for interacting groups in the AIDS epidemic. J. Comput. Appl. Math. 59, 207220. Yakowitz, S. (1995). Computational methods for Markov series with large statespaces, with application to AIDS Modeling. Math. Biosci. 127, 99-121. Lai, T.L. and S. Yakowitz. (1995). Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Control 40, 1199-1209. Gani, J. and S. Yakowitz. (1995). Error bounds for deterministic approximation to Markov processes, with applications to epidemic models. J. Appl. Prob. 32, 1063-1076. Yakowitz, S. and R.D. Dietrich. (1996). Sequential design with application to the trim-loss problem. Int. J. Production Res. 34, 785-795.
Publications of Sid Yakowitz
11
Tran, L., G. Roussas, S. Yakowitz, and B. Van Troung. (1996). Fixed-design regression for linear time series. Ann. Statist. 24, 975-991. Jayawardena, T. and S. Yakowitz. (1996). Methodology for the stochastic graph completion time problem. INFORMS J. Comput. 8, 331-342. Morvai, G., S. Yakowitz, and L. Györfi. (1996). Nonparametric inferences for ergodic, stationary time series. Ann. Statist. 24, 370-379. Yakowitz, S., M. Blount, and J. Gani. (1996). Computing marginal expectations for large compartmentalized models with application to AIDS evolution in a prison system. IMA J. Math. Appl. Medicine Biol. 13, 223-244. Blount, S., A. Galambosi, and S. Yakowitz. (1997). Nonlinear and dynamic programming for epidemic intervention. Appl. Math. Comput. 86, 123-136. Gani, J., S. Yakowitz, and M. Blount. (1997). The spread and quarantine of HIV infection in a prison system. SIAM J. Appl. Math. 57, 1510-1530. Morvai, G., S. Yakowitz, and P. Algoet. (1998). Weakly convergent nonparametric forecasting of stationary time series. IEEE Trans. Inform. Theory 44, 886-892. Yakowitz, S., L. Györfi, J. Kieffer, and G. Morvai. (1999). Strongly consistent nonparametric forecasting and regression for stationary ergodic sequences. J. Multivariable Anal. 71, 24-41. Daley, D.J., J. Gani, and S. Yakowitz. (2000). An epidemic with individual infectivities and susceptibilities. Math. and Comp. Modelling 32, 155-167.
Part I Chapter 2 STABILITY OF SINGLE CLASS QUEUEING NETWORKS Harold J. Kushner Applied Mathematics Dept. Lefschetz Center for Dynamical Systems Brown University Providence RI 02912 *
Abstract
1.
The stability of queueing networks is a fundamental problem in modern communications and computer networks. Stability (or recurrence) is known under various independence or ergodic conditions on the service and interarrival time processes if the “fluid or mean approximation” is asymptotically stable. The basic property of stability should be robust to variations in the data. Perturbed Liapunov function methods are exploited to give effective criteria for the recurrence under very broad conditions on the “driving processes” if the fluid approximation is asymptotically stable. In particular, stationarity is not required, and the data can be correlated. Various single class models are considered. For the problem of stability in heavy traffic, where one is concerned with a sequence of queues, both the standard network model and a more general form of the Skorohod problem type are dealt with and recurrence, uniformly in the heavy traffic parameter, is shown. The results can be extended to account for many of the features of queueing networks, such as batch arrivals and processing or server breakdown. While we concentrate on the single class network, analogous results can be obtained for multiclass systems.
INTRODUCTION
Queueing networks are ubiquitous in modern telecommunications and computer systems and much effort has been devoted to the study of their stability properties. Consider a system where there are K processing stations, each with an infinite buffer. The stations might have exogenous input streams (inputs * Supported in part by NSF grants ECS 9703895 and ECS 9979250 and ARO contract DAAD19-99-1-0-223
14
MODELING UNCERTAINTY
from the outside of the network) as well as inputs from the other stations. Each customer eventually leaves the system. The service is first come first served (FCFS) at each station and the service time distributions depend only on the station. This paper is concerned with the stability of such systems. The basic analysis supposes that the systems are working under conditions of heavy traffic. Then the result is specialized to the case of a fixed system in arbitrary (not necessarily heavy) traffic. Loosely speaking, by heavy traffic we mean that the fraction of time that the processors are idle is small; equivalently, the traffic intensities at each processor are close to unity. Heavy traffic is quite common in modern computer and communications systems, and also models the effects of “bottleneck” nodes in general. Many of the queueing systems of current interest are much too complicated to be directly solvable. Under conditions of heavy traffic, laws of large numbers and central limit theorems can be used to greatly simplify the problem. Most work on stability has dealt with the more difficult multiclass case (although under simpler conditions on the probabilistic structure of the driving random variables), where each station can work on several job classes and there are strict priorities (Banks and Dai, 1997; Bertsimas et al., 1996; Bramson, 1994; Bramson and Dai, 1999; Chen and Zhang, 2000; Dai, 1995; Dai, 1996; Dai and Vande Vate, 2001; Dai et al., 1999; Dai and Weiss, 1995; Down and Meyn, 1997; Dai and Meyn, 1995; Kumar and Meyn, 1995; Kumar and Seidman, 1990; Lin and Kumar, 1984; Lu and Kumar, 1991; Meyn and Down, 1994; Perkins and Kumar, 1989; Rybko and Stolyar, 1992). A typical result is that the system is stable if a certain “fluid” or “averaged” model is stable. The interarrival and service intervals are usually assumed to be mutually independent, with the members of each set being mutually independent and identically distributed and the routing is “Markov.” Stability was shown in Bramson (1996) for a class of FIFO networks where the service times do not depend on the class. The counterexamples in Bramson (1994), Kumar and Seidman (1990), Lu and Kumar (1991), Rybko and Stolyar (1992), Seidman (1994) have shown that the multiclass problem is quite subtle and that even apparently reasonable strategies for scheduling the competing classes can be unstable. The stability situation is simpler when the service at each processor is FCFS and the service time distribution depends only on the processor, and here too a typical result is that the system is stable if a certain “fluid” or “averaged” model is stable. The i.i.d assumption on the interarrival (resp., service) times and Markov assumption on the routing are common, although results are available under certain “stationary–ergodic” assumptions (Baccelli and Foss, 1994; Borovkov, 1986). In the purely Markov chain context, where one deals with a reflected random walk, works via the classical stochastic stability techniques for Markov chains include Fayolle (1989), Fayolle et al. (1995), and Malyshev (1995).
Stability of Single Class Queueing Networks
15
For the single class case, and under the assumption that the “fluid” approximation is stable, it will be seen that stability holds under quite general assumptions on the probabilistic structure of the interarrival and service intervals, and on the routing processes. This will be demonstrated by use of the perturbed Liapunov function methods of Kushner (1984). The basic class of models under heavy traffic will be defined in Section 2, where various assumptions are stated to put the problem into a context of interest in current applications. These are for intuitive guidance only, and weaker conditions will be used in the actual stability results. The basic class of models which is motivated by the heavy traffic analysis of queueing networks is then generalized to include a form of the so-called “Skorohod problem” model which covers a broader class of systems. Stability under heavy traffic is actually stability of a sequence of queueing systems, and it is “uniform” in the traffic intensity parameter as that tends to unity. The stability results depend on a basic theorem of Dupuis and Williams (1994) which established the existence of a Liapunov function for the fluid approximation, and this is stated in Section 3. The idea of perturbed Liapunov functions is developed in Section 4. Section 5 gives the main stability theorem for the sequence of systems in heavy traffic, as well as the result for a single queue which is not necessarily in heavy traffic. The results can be extended to account for many of the features of queueing networks, such as batch arrivals and processing or server breakdown.
2.
THE MODEL
A classical queueing network: The heavy traffic scalings. Heavy traffic. analysis works with a sequence of queues, indexed by and with arrival and service “rates” depending on such that, as the fraction of time at which the processors are idle goes to zero. With appropriate scaling and under broad conditions, the sequence of scaled queues converges weakly to a process which is the solution to a reflected stochastic differential equation (Kushner, 2000; Reiman, 1984).1 Let there be K processors or servers, denoted by Let denote the size of the vector of queues in the network at real time for the system in the sequence. There are two scalings which are in common use. One scaling works with the state process defined by where both time and amplitude are “squeezed.” This is used for classical queueing problems where the interarrival and service intervals are O(1). In many modern communications and computer systems, the arrivals and services (say, arrivals and transmissions of the packets) are fast, and then one commonly works with the scaling We will concentrate on the first scaling, although all the results are transferable to the second scaling with an appropriate identification of variables.
16
MODELING UNCERTAINTY
Consider the first scaling where the model is of the type used in Reiman (1984). See also Kushner (2000; Chapter 6). Let denote the interarrival interval for exogenous arrivals to and let denote the service interval at Define times the number of exogenous arrivals to by real time and define analogously for the service completions there. Let denote times the number of departures from by real time that have gone to For some centering constants which will be further specified below, define the processes:
If the upper index of a sum is not an integer, we always take the integer part. Let denote the indicator function of the event that the departure from goes to Define the “routing” vector In applications to communications systems, where the second (the “fast”) scaling would be used, the scale parameter is the actual physical speed or size of the physical system. Then, the actual interarrival and service intervals would be defined by and resp. Thus, in this latter case we are working with a sequence of systems of increasing speed. Under the given conditions, the results in Section 5 will state that the processes are uniformly (in ) stable for large Assumptions for the classical network in heavy traffic. Three related types of models will be dealt with. The first form is the classical queueing network type described above. Conditions (A2.1)–(A2.3) are typical of those used at present for this problem class. These are given for motivational purposes and to help illustrate the basic “averaging” method. The actual conditions which are to be used are considerably weaker. The second type of model, called the Skorohod problem, includes the first model. But it covers other queueing systems which are not of the above network form (see, for example, Kushner (2000; Chapter 7)). Finally, it will be seen that the stability results are valid even without the heavy traffic assumptions, provided that the mean flows are stable. A2.0. For each the initial condition (0) is independent of all future arrival and service times and of all routing decisions, and other future driving random variables. A2.1. The set of service and interarrival intervals is independent of the set of routing decisions. There are constants such that The set of processes defined in (2.1) is tight in the Skorohod topology (Billingsley, 1968; Ethier and Kurtz, 1986) and has continuous weak sense limits.
Stability of Single Class Queueing Networks
17
A2.2. The are mutually independent for each There are constants and such that and The spectral radius of the matrix is less than unity. Write The spectral radius condition in (A2.2) implies that each customer will eventually leave the system and that the number of services of each customer is stochastically bounded by a geometrically distributed random variable. We will often write It is convenient to introduce the definition Condition (A2.3) is known as the heavy traffic condition. It quantifies the rate at which the difference between the mean input rate and mean output rate goes to zero at each processor as A2.3. There are real such that
Note that (2.2) implies that
or, equivalently, that
Let (resp., denote times the number of exogenous arrivals to (resp., jobs completely served) at by real time An example. See Figure 2.1 for a two dimensional example. The state space is the positive quadrant.
The limit system and fluid approximation for the classical model. The basic state equation is
18
MODELING UNCERTAINTY
Define the vector Under (A2.0)–(A2.3), the sequence converges weakly to the reflected diffusion process defined by Kushner (2000), and Reiman(1984)
where is a continuous process (usually a Wiener process, in applications). The reflection term has the form
where is continuous, nondecreasing, with and can increase only at where The vector is the column of and is the reflection direction on the line or face The reflection directions for the example of Figure 2.1 are given in Figure 2.2. Define Then, converges weakly to the “fluid approximation” which solves the reflected ODE or Skorohod problem:
The stability of the system (2.6) is the usual departure point for the study of the stability of the actual physical stochastic queues.
A simplifying assumption on the arrival and service times. To simplify the notation, it will be assumed in the rest of the paper that there are which can go to zero as fast as we wish as such that arrivals and service completions can occur only at integral multiples of (real time) We further suppose that at most one exogenous arrival and one service completion can occur at a time at each processor. These conventions are not restrictive since can go to zero as fast as we wish as All results hold without this convention. Furthermore, for notational simplicity and without loss of generality, let us suppose that service completions occur “just before” any arrivals at any
Stability of Single Class Queueing Networks
19
processor, and that they both occur “just before” the times These are still referred to as the departures and arrivals at the times Consequently, there cannot be a departure from a processor at real time if its queue is empty at real time Some details for (2.4). A formal summary of a few of the details which take (2.3) to (2.4) will be helpful to get a feeling for the derivation of (2.4) and motivation for the role of (2.6) in the stability analysis. The development involves representations for the terms in (2.3) that separate the “drift” from the “noise.” Let denote the indicator function of the event that there is a departure from at real time and let denote the indicator function of the event that this departure goes to Let be the indicator function of the event that there is an exogenous arrival to at real time By the definitions, we have
Write
Alternatively, we can write
where
Define a residual time error term to be a [constant/ × [a residual interarrival or service interval] plus a deterministic process which converges to zero. They will always converge weakly to the ‘zero” process. The right hand terms of (2.8) and (2.9) differ by a residual time error term, which is times the number of steps (of real time length each) between real time and the time of last departure at or before real time Let denote times the real idle time at by real time The right hand term of (2.8) can be written as
20
MODELING UNCERTAINTY
Now, write the coupling terms as
Use the representation (2.8) for the coefficient of in the right hand term of (2.10). Then, the (negative of the) idle time terms in the equation (2.3) for sum to
Now, consider the exogenous arrival processes. Modulo a residual time error term,
Alternatively,
where
The difference between the right hand terms of (2.11) and (2.12) is also a residual time error term and is asymptotically negligible. Putting the expansions together and using the heavy traffic condition (A2.3) yields (modulo asymptotically negligible errors, from the point of view of weak convergence)
where
where
is from (2.2),
has the form
Stability of Single Class Queueing Networks
and
21
is defined by
Let denote the boundary face on which and let denote the reflection direction on that face. Then, for some nonnegative and uniformly bounded random variables can be written as
Under (A2.0)–(A2.3), converges weakly to a continuous process and converges weakly to the solution of (2.4). The Skorohod problem. A more general model. The above discussion has motivated the model (2.13)–(2.15) in terms of the new “primitives” and with the state space A more general model starts from the form (2.13)–(2.15), and generalizes it to the following. The state space G is a “wedge” or convex cone (replacing the set and is formally defined in (2.4). The system model is represented as
where
has the form (2.14a) for uniformly bounded random variables and for some nonnegative and uniformly bounded random variables the reflection term can be written as
Assumptions for the model (2.16), (2.17). Condition (A2.4) below holds for the original queueing network model, as does (A2.5) because of the condition on the spectral radius of Q. A2.4. There are vectors such that the state space G is the intersection of a finite number of closed halfspaces in each containing the origin and defined by and it is the closure of its interior (i.e., it is a “wedge”). Let denote the faces of G, and the interior normal to Interior to the reflection direction is denoted by the unit vector and for each The possible reflection directions at points on the intersections of any subset of the are in the convex hull of the directions on the adjoining faces. Let denote the set of reflection directions at the point whether it is a singleton or not. A2.5. For define the index set Suppose that lies in the intersection of more than one boundary; i.e, has the
22
MODELING UNCERTAINTY
form for some Let denote the convex hull of the interior normals to resp., at Then, there is some vector such that for all A2.6.The solution to the deterministic Skorohod problem (the fluid approximation to (2.4) or (2.16))
converges to zero for each initial condition.
3.
STABILITY: INTRODUCTION
Stability of a queue can be defined in many ways, but it almost always means something close to a uniform recurrence property, which we can define as follows. Let be a large real number. Suppose that the current queue size is Then there is a real valued function K(·) such that the mean time, conditioned on the systems data to the present, for the queue size to get within the centered at the origin, is bounded by with probability one. Then we say that the queue process is uniformly recurrent. The queue process need not be Markovian, or a component of a Markov process, or even ergodic or stationary in any sense. Now, consider a sequence of queues in heavy traffic, scaled as in Section 2. Then the uniform recurrence property is rephrased as follows. Suppose that the current scaled queue size is Then the mean time, conditioned on the systems data to the present, for the scaled queue size to get within the whose whose center is the origin, is bounded by with probability one and uniformly in (large) The study of the theory of the stability of Markovian processes, via stochastic Liapunov functions, goes back to Bucy (1965), Khasminskii (1982), and Kushner (1990a) and that of non-Markovian processes, via perturbed Liapunov functions, to Blankenship and Papanicolaou (1978), Kushner (1984), and Kushner (1990a). It was shown in Harrison and Williams (1987) that a necessary and sufficient condition for the recurrence of the heavy traffic limit (2.4) (when is a Wiener process) is that
in that each component of the vector is negative. If is the routing probability from to then is the matrix of the total mean number of current and future visits to for customers currently at The paper (Dupuis and Williams, 1994) extended this result to more general reflection processes and state spaces by replacing (2.4) by a stochastic differential equation of a more general type (the solution to a Skorohod problem),
23
Stability of Single Class Queueing Networks
which might not arise as a limit of the sort of queueing processes (2.3) discussed in Section 2. They constructed a Liapunov function for the associated fluid approximation, and used this to prove the recurrence of the heavy traffic limit. The needed properties of their Liapunov function are stated in Theorem 3.1. The next two sections contain an analysis of a class of systems whose heavy traffic limits are either of the queuing type of (A2.0)–(A2.3), or of the more general Skorohod model type (2.16), (2.17). The same methods will also be applied to a single queue (and not a sequence of queues) which might not be in heavy traffic. The method is interesting in that it can handle quite complicated correlations in the arrival, service and routing processes, and even allow nonstationarities. We aim to avoid stronger assumptions which would require the processes to be either stationary or to be representable in terms of a component of the state of a Markov process. Thus, the notion of Harris recurrence is not directly relevant. Our opinion is that stability results should be as robust as possible to small variations in the assumptions. The perturbed Liapunov function method is ideal for such robustness. The following is a form of the main theorem of Dupuis and Williams (1994). The reference used the orthant But the proof still holds if G is a wedge as in (A2.4). For a point the interior of G, define to be the set of indices such that In the theorem, denotes the gradient of V(·). Theorem 3.1. Assume (A2.4)–(A2.6). Then, there exists a real–valued function V(·) on with the following properties. It is continuous, together with its partial derivatives up to second order. There is a (twice continuously differentiable) surface such that any ray from the origin crosses once and only once, and for a scalar and For
Thus, the second partial derivatives are of the order of there are real such that
For
as
Also,
There is such that for Define V(0) = 0. Then V(·) is globally Lipschitz
continuous.
4.
PERTURBED LIAPUNOV FUNCTIONS
In this motivational section, we will work with the first model of Section 2, the sequence of networks in heavy traffic as modeled by (2.7) with limit process
24
MODELING UNCERTAINTY
(2.4) and (2.5), but the assumptions (A2.0)–(A2.3) will be weakened. By the scaling of time, can change values only at times with the departures occurring “just before the arrivals. Perturbed Liapunov functions. Introduction and motivation. This section will introduce the idea of perturbed Liapunov functions (Blankenship and Papanicolaou, 1978; Kushner, 1984; Kushner, 1990b; Kushner and Yin, 1997) and their structure. The computations are intended to be illustrative and will be formal. But, the actual proofs of the theorems in the next section go through the same steps. The classical Liapunov function method is quite limited, for problems such as ours, since (owing to the correlations or possibly non Markov character) there is not usually a “contraction” at each step to yield the local supermartingale property which is needed to prove stability. The perturbed Liapunov function method is a powerful extension of the classical method. In the perturbed Liapunov function method, one adds a small perturbation to the original Liapunov function. As will be seen, this perturbation provides an “averaging” which is needed to get the local supermartingale property. The primary Liapunov function will be simply the function V(·) of Theorem 3.1. The final Liapunov function will be of the form where is “small” in a sense to be defined. Let denote the expectation given all systems data up to and including real time We can write
where the are O(1), uniformly in all variables. Before proceeding, for motivation let us temporarily modify the model and formally examine the first term on the right side of (4.1), under the assumptions that and that the (real time) interarrival and service intervals are exponentially distributed with rates resp., supposing that is “infinitesimal,” and letting the set of all intervals be mutually independent. The conditional mean value of the bracketed term is then
Stability of Single Class Queueing Networks
25
By the heavy traffic condition (A2.3), times this “formally converges” to Putting this “limit” into (4.1) yields that the first term of (4.1) is (asymptotically) By Theorem 3.1, this is less than If we ignore the second order term in (4,1) and the behavior on the boundary of G–{0} where at least one component of is zero, we then have that
I.e., has the supermartingale property for The order of the conditional mean change per step and standard stochastic stability theorems (Kushner, 1984) implies the uniform recurrence. For “non exponential” distributions, one needs some mechanism that allows the indicator functions to be replaced by their “centering” or “mean” values. This is done by adding a perturbation to the Liapunov function, and this will also allow us to account for the behavior on the boundary and deal with the second order term as well. Motivation for the form of the Liapunov function perturbations. Now, drop the “exponential and independence assumption.” Before defining the actual perturbation, for additional motivation we will discuss the general principle (Kushner, 1984) behind the construction of the perturbation by use of a slightly simpler form of it. Let be small and let us work with large enough such that for small For our centering constants define Proceeding formally until further notice, the first suggested perturbed Liapunov function will have the form:
where we define
The individual perturbations in (4.2b) are defined by:
where we define
26
MODELING UNCERTAINTY
We suppose (until further notice) that the and are O(1), uniformly in all variables, w.p.1. Clearly, the centering constants must be such that the sums are well defined. Under broad mixing conditions, there are such centering constants. While continuing to proceed formally now, we will return to this point in the next section and modify the C–functions to extend the conditions under which the sums are well defined. Define
By the definitions of the perturbations, we can expand as
where the
Also,
are O(1), uniformly in all variables. Similarly, we can write
27
Stability of Single Class Queueing Networks
Note that the term sponding term
in (4.1) plus the corre-
in (4.5a) equals
Repeating this for all of the other first order terms yields
By the heavy traffic condition (A2.3), times the terms in brackets in the second line in (4.7) converges to as Now, turn to the boundary terms. Define and Thus, asymptotically, we can write (4.7) as
Next, let us dominate the second order terms in (4.1).
For large enough
Average the indicator functions in the second order part of (4.1) via use of the perturbations as done for the first order terms. This yields the bound for the second order terms, for large
28
MODELING UNCERTAINTY
Finally, combining the above expansions and bounds, for large enough Theorem 3.1 allows us to write
The boundary term can be written as
which is nonpositive by Theorem 3.1. Thus, has the supermartingale property for large state values, say for for some positive number Suppose that Then, asymptotically, the mean number of steps of length which are required for it to return to the set where is bounded above by Thus, in the interpolated time scale, it requires an average of units of time. Since the perturbation goes to zero as this bound also holds asymptotically for as well. Hence, the desired stability.
5.
STABILITY
Discussion of the perturbations. Let us examine the in (4.4a) more closely to understand why our O(1) requirement on its value is reasonable. Since is merely a centering constant for the entire sequence, the actual mean values or rates can vary with time (say, being periodic, etc.). Fix and let and be the real times of the first two exogenous arrivals to queue after real time Consider the part of given by
This equals
Next, for the moment, suppose that the interarrival times are mutually independent and identically distributed, with finite second moments, and mean Then (5.1) equals zero w.p.1, since Obviously and can be any two successive exogenous arrival times with the same result. Thus, under the independence assumption,
Stability of Single Class Queueing Networks
29
is just
where is just the conditional expectation of the mean real time to the next exogenous arrival to queue after real time given the data to real time For use below, keep in mind that (w.p.1) this quantity is bounded uniformly in under the above assumptions on the independence and the moments. Now, suppose that the interarrival times are correlated, still with centering constant Let denote the sequence of exogenous arrival (real time) times to after Then, for
Then, grouping terms and formally speaking, we see that the series
is just (5.2) plus
This sum is well defined and bounded uniformly in under broad conditions. Similar computations can be done for the and The perturbations which are to be used. The perturbations defined in (4.4) are well defined and O(1) uniformly in under broad mixing conditions. But, there are interesting cases where they are not well defined. A typical such case is the purely deterministic problem where and where H is an integer. Then the sum, taken from to is periodic in with each segment moving linearly between zero and The most convenient way of circumventing this problem of non convergence and including such examples in our result is to suitably discount the defining sums (Kushner and Yin, 1997; Solo and Kong, 1995). Thus, if the sums in (4.4) are not well defined, then we will use the alternative perturbations where the sums are changed to the following discounted forms, for some small
30
MODELING UNCERTAINTY
The sums in (5.4) are always well defined for each and the conditional expectation can be taken either inside or outside of the summation. Finally, the above discussion is summarized in the following assumption. and are bounded A5.1. There is a constant B such that by B w.p.1, for each The actual perturbed Liapunov function which is to be used is
where
and, analogously to (4.4), the individual perturbations are defined by:
Theorem 5.1. Let be tight. Assume the network model of Section 2, (A2.6), and that the spectral density of Q is less than unity. Then is recurrent, uniformly in If (5.4) is well defined and uniformly bounded without the discounting, then the undiscounted form can be used. Note on the proof. Note that, for the discounted function,
and similarly for the conditional expectations of the increments in and The right hand “error term” is hence it adds to the right side of (4.9), and this is dominated by the first order term there. The rigorous development with the perturbed Liapunov function defined by (5.4)–(5.6) is exactly the same as the formal development using (4.2)–(4.4), with the addition of the error terms arising from the second term on the right of (5.7) (and analogously from the and and we will not repeat the details.
Stability of Single Class Queueing Networks
31
Fast arrivals and services: The second scaling. In the queueing network model of Section 2, we defined This is the traditional scaling for queues. But, in many applications to computer and communications systems, the channel capacity is large and the arrivals and services occur “fast,” proportionally to the capacity. Then, the parameter is taken to be the basic speed of the system and one uses (Altman and Kushner, 1999; Kushner et al., 1995). As noted at the beginning of Section 2, the service and interarrival intervals are then defined to be with centering constants To be consistent with the previous development, suppose that arrivals and departures can occur only “just before the real times The development that led to Theorem 5.1 used only the scaled system, and the results are exactly the same if we have fast arrivals and services. The Skorohod problem model (2.16), (2.17). We will use the perturbation
where
The proof of the next theorem is the same as that of Theorem 5.1. Theorem 5.2. Assume the model (2.16), (2.17) and the conditions (A2.4)– (A2.6). Suppose that is tight and that the of (5.9) are bounded, w.p.1, uniformly in Then is recurrent, uniformly in If the functions (5.9) are well defined and uniformly bounded without the discounting, then the undiscounted form can be used. Fixed queues: Non heavy traffic problems. Consider a single queueing network (not a sequence) of the type discussed in Section 2. Let denote the size of the queue at server at time The primary assumptions in Theorem 5.1 were first (A2.4)–(A2.6) which enabled the construction of the fundamental Liapunov function of Theorem 3.1 and, second, (A5.1) which provided the averaging. The fact that is sufficient for the (average of the) first order term in (4.1) to dominate the (average of the) second order term for large even without their relative and scalings. Drop the in the definitions, and suppose that arrivals and departures only occur as in the queueing network model of Section 2; i.e. “just before” times for some small In particular, for define
32
MODELING UNCERTAINTY
Define
All of the equations in this and in the previous section hold if the is dropped. Thus, we have the following theorem. Theorem 5.3. Assume that the spectral radius of Q is less than unity, condition (A2.6) and that the sums in (5.10) are bounded w.p.1, uniformly in and Then Q(·) is recurrent. If the functions (5.10) are well defined and uniformly bounded without the discounting, then the undiscounted forms can be used.
NOTES 1. All weak convergence is in the Skorohod topology (Ethier and Kurtz, 1986). 2. Thus the gradient is the same at all points on any ray from the origin.
REFERENCES Altman, E. and H.J. Kushner. (1999). Admission control for combined guaranteed performance and best effort communications systems under heavy traffic. SIAM J. Control and Optimization, 37:1780–1807. Baccelli, F. and S. Foss. (1994). Stability of Jackson-type queueing networks. Queueing Systems, 17:5–72. Banks, J. and J.G. Dai. (1997). Simulation studies of multiclass queueing networks. IEE Trans., 29:213–219. Bertsimas, D., D. Gamarnik, and J. Tsitsiklas. (1996). Stability conditions for multiclass fluid queueing networks. IEEE Trans. Aut. Control, 41:1618– 1631. Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York. Blankenship, G. and G.C. Papanicolaou. (1978). Stability and control of systems with wide band noise disturbances. SIAM J. Appl. Math., 34:437–476. Borovkov, A. A. (1986). Limit theorems for queueing networks. Theory of Probability and its Applications, 31:413–427. Bramson,M. (1994). Instability of FIFO queueing networks. Ann. Appl. Probab., 4:414–431. Bramson, M. (1996). Convergence to equilibria for FIFO queueing networks. Queueing Systems, 22:5–45.
REFERENCES
33
Bramson, M. and J.G. Dai. (1999). Heavy traffic limits for some queueing networks. Preprint. Bucy, R. S. (1965). Stability and positive supermartingales. J. Differential Equations, 1:151–155. Chen, H. and H. Zhang. (2000). Stability of multiclass queueing networks under priority service disciplines. Operations Research, 48:26–37. Dai, J., J. Hasenbein, and J. Vande Vate. (1999). Stability of a three station fluid network. Queueing Systems, 33:293–325. Dai, J.G and S. Meyn. (1995). Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Trans on Aut. Control, 40:1889–1904. Dai, J. and J. Vande Vate. (2001). The stability of two station multi–type fluid networks. To appear in Operations Research. Dai, J. G. (1995). On positive Harris recurrence of multiclass queueing networks: a unified approach via fluid models. Ann. Appl. Probab., 5:49–77. Dai, J. G. (1996). A fluid–limit model criterion for instability of multiclass queueing networks. Ann. of Appl. Prob., 6:751–757. Dai, J. G. and G. Weiss. (1995). Stability and instability of fluid models for reentrant lines. Math. of Oper. Res., 21:115–135. Down, D. and S.P. Meyn. (1997). Piecewise linear test functions for stability and instability of queueing networks. Queueing Systems, 27:205–226. Dupuis, P. and R.J. Williams. (1994). Lyapunov functions for semimartingale reflecting Brownian motions. Ann. Prob., 22:680–702. Ethier, S. N. and T.G. Kurtz. (1986). Markov Processes: Characterization and Convergence. Wiley, New York. Fayolle, G. (1989). On random walks arising in queueing systems: ergodicity and transience via quadratic forms as liapunov functions, Part 1. Queueing Systems, 5:167–184. Fayolle, G., V.A. Malyshev, and M.V. Menshikov. (1995). Topics in the Constructive Theory of Markov Chains. Cambridge University Press, Cambridge, UK. Harrison, J. M. and R.J. Williams. (1987). Brownian models of open queueing networks with homogeneous customer populations. Stochastics and Stochastics Rep., 22:77–115. Khasminskii, R. Z. (1982). Stochastic Stability of Differential Equations. Sijthoff, Noordhoff, Alphen aan den Rijn, Amsterdam. Kumar, P. R. and S.P. Meyn. (1995). Stability of queueing networks and scheduling policies. IEEE Trans. on Automatic Control, 40:251–260. Kumar, P. R. and T.I. Seidman. (1990). Dynamic instabilities and stabilization methods in distributed real time scheduling policies. IEEE Trans. on Automatic Control, 35:289–298.
34
MODELING UNCERTAINTY
Kushner, H. J. (1967). Stochastic Stability and Control. Academic Press, New York. Kushner, H. J. (1972). Stochastic stability. In R. Curtain, editor, Stability of Stochastic Dynamical Systems; Lecture Notes in Math. 294, pages 97–124, Berlin and New York, Springer-Verlag. Kushner, H. J. (1984). Approximation and Weak Convergence Methods for Random Processes with Applications to Stochastic Systems Theory. MIT Press, Cambridge, Mass. Kushner, H. J. (1990). Numerical methods for stochastic control problems in continuous time. SIAM J. Control Optim., 28:999–1048. Kushner, H. J. (1990). Weak Convergence Methods and Singularly Perturbed Stochastic Control and Filtering Problems, volume 3 of Systems and Control. Birkhäuser, Boston. Kushner, H. J. (2000). Heavy Traffic Analysis of Controlled and Uncontrolled Queueing and Communication Networks. Springer Verlag, Berlin and New York. Kushner, H. J., D. Jarvis, and J. Yang. (1995). Controlled and optimally controlled multiplexing systems: A numerical exploration. Queueing Systems, 20:255–291. Kushner, H. J. and G. Yin. (1997). Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York. Lin, W. and P.R. Kumar. (1984). Optimal control of a queueing system with two heterogeneous servers. IEEE Trans. on Automatic Control, AC-29:696–703. Lu, S. H. and P.R. Kumar. (1991). Distributed scheduling based on due dates and buffer priorities. IEEE Trans. on Automatic Control, 36:1406–1416. Malyshev, V. A. (1995). Networks and dynamical systems. Adv. in Appl. Probab., 25:140–175. Meyn, S. P. and D. Down. (1994). Stability of generalized Jackson networks. Ann. Appl. Prob., 4:124–148. Perkins, J. R. and P.R. Kumar. (1989). Stable distributed real–time scheduling of flexible manufacturing/assembly/disassembly systems. IEEE Trans. on Automatic Control, 34:139–148. Reiman, M. R. (1984). Open queueing networks in heavy traffic. Math. Oper. Res., 9:441–458. Rybko, A. N. and A.L. Stolyar. (1992). On the ergodicity of stochastic processes describing open queueing networks. Problems Inform. Transmission, 28:199–220. Seidman, T. I. (1994). First come, first served, can be unstable. IEEE Trans on Automatic Control, 39:2166–2171. Solo, V. and X. Kong. (1995). Adaptive Signal Processing Algorithms. PrenticeHall, Englewood Clffs, NJ.
Chapter 3 SEQUENTIAL OPTIMIZATION UNDER UNCERTAINTY Tze Leung Lai Stanford University
Abstract
1.
Herein we review certain problems in sequential optimization when the underlying dynamical system is not fully specified but has to be learned during the operation of the system. A prototypical example is the multi-armed bandit problem, which was one of Yakowitz’s many research areas. Other problems under review include stochastic approximation and adaptive control of Markov chains.
INTRODUCTION
Sequential optimization, when the underlying function or dynamical system is not fully specified but has to be learned during the operation of the system, was one of Yakowitz’s major research areas, to which he made many important contributions in a variety of topics. In this paper we give an overview of some of these topics and related developments, and review in this connection Yakowitz’s contributions to these areas. The optimization problem of finding the value which maximizes a given function is difficult when is large and does not have nice smoothness and concavity properties. Probabilistic algorithms, such as simulated annealing introduced by Kirkpatrick et al. (1983), have proved useful to reduce the computational complexity. The problem becomes even more challenging if is some unknown regression function so that an observation at a given has substantial “uncertainties” concerning its mean value In such stochastic settings, statistical techniques and probabilistic methods are indispensable tools to tackle the problem. When is finite, the above optimization problem in stochastic settings can be viewed as a stochastic adaptive control problem with a finite control set, which is often called a “multi-armed bandit problem”. In its simplest form, the problem
36
MODELING UNCERTAINTY
can be described as follows. There are statistical populations with univariate density functions with respect to some measure M. At each time we can sample from one of these populations and the reward is the sampled value Thus the control set is where control action refers to sampling from An adaptive sampling rule consists of a sequence of random variables taking values in such that the event (“ is sampled from ”) belongs to the generated by Let If were known, then we would sample from the population with the largest mean, i.e., where is assumed to be finite. In ignorance of the problem is to sample sequentially from the populations to maximize or equivalently to minimize the regret
as where and if A occurs, otherwise. Section 2 reviews some important developments and basic results of the multi-armed bandit problem, ranging from the above parametric setting with independent observations to nonparametric models with dependent observations pioneered by Yakowitz and his collaborators in a series of papers. Returning to deterministic optimization as described in the first paragraph, if is convex subset of and is smooth and unimodal, then efficient gradient methods are available for deterministic problems and their counterparts in stochastic settings have been developed under the rubric of “stochastic approximation”. It is widely recognized that for satisfactory performance stochastic approximation procedures have to be initialized at good starting values. One possible approach for modifying stochastic approximation algorithms accordingly is to incorporate them into a multistart procedure. This idea was used by Yakowitz (1993) to find the global maximum of a function that may have multiple local maxima. Section 4 reviews this work and some other developments in stochastic approximation. In the engineering literature, stochastic approximation schemes are usually applied to optimization and control problems in dynamical systems, instead of to static regression functions considered above. In principle, given a prior distribution of the unknown system parameters and the joint probability distribution of the sequence of random variables that determine the stochastic system, one can formulate a stochastic adaptive control problem as a dynamic programming problem in which the “state” is the conditional distribution of the original system state and of the parameter vector given the past observations. However, because of the complexity of the systems usually encountered in prac-
Sequential Optimization Under Uncertainty
37
tice, the dynamic programming equations are prohibitively difficult to handle, both computationally and analytically. Moreover, it is often not possible to specify a realistic probability law for all the random variables involved and a reasonable prior distribution for the unknown parameter vector. Instead of the Bayesian approach, a much more practical alternative that is commonly used in the engineering literature is the “certainty equivalence” approach that replaces unknown parameters in the optimal control rule by their sample estimates at every stage. Section 3 gives a brief review of stochastic adaptive control in controlled Markov chains. It shows how asymptotically optimal control rules can be constructed by a modification of the certainty equivalence approach, called “certainty equivalence with uncertainty adjustments” by Graves and Lai (1997), that incorporates uncertainties in the parameter estimates.
2.
BANDIT THEORY
The “multi-armed bandit problem”, introduced by Robbins (1952), derives its name from an imagined slot machine with arms. When an arm is pulled, the player wins a random reward. For each arm there is an unknown probability distribution of the reward, and the player’s problem is to choose N successive pulls on the arms so as to maximize the total expected reward. The problem is prototypical of a general class of adaptive control problems in which there is a fundamental dilemma between “information” (such as the need to learn from all populations about their parameter values) and “control” (such as the objective of sampling only from the best population), cf. Kumar (1985). Another often cited example of such problems is in the context of clinical trials, where there are treatments of unknown efficacy to be chosen sequentially to treat a large class of N patients, cf. Chernoff (1967).
2.1.
NEARLY OPTIMAL RULES BASED ON UPPER CONFIDENCE BOUNDS AND GITTINS INDICES
For the regret defined in (1), Robbins (1952) showed that it is possible to achieve by a “certainty equivalence rule with forcing” that chooses from the population (“arm”) with the largest sample mean (“certainty equivalence”) except at a sparse set of times when is chosen (“forcing”) for each Lai and Robbins (1985) showed how to construct sampling rules for which at every These rules are called “uniformly good.” They also developed asymptotic lower bounds for the regret of uniformly good rules and showed that the rules constructed actually attain these asymptotic lower bounds and are therefore asymptotically efficient.
38
MODELING UNCERTAINTY
Specifically, they showed that under certain regularity conditions
for uniformly good rules, where is the Kullback-Leibler information number. Instead of sampling from the population with the largest sample mean, they proposed to sample from the population with the largest upper confidence bound for and showed how these confidence bounds can be constructed for the sampling rule to attain the asymptotic lower bound (2). Their result was subsequently generalized by Anantharam, Varaiya and Walrand (1987) to the multi-armed bandit problem in which each represents an aperiodic, irreducible Markov chain on a finite state space S so that the successive observations from are no longer independent but are governed by a Markov transition density This extension was motivated by the more general problem of adaptive control of finite-state Markov chains with a finite control set, details of which are given in Section 3. Besides control engineering, the theory of multi-armed bandits also has an extensive literature in economics. In particular, it has been applied to pricing under demand uncertainty, decision making in labor markets, general search problems and resource allocation (cf. Rothschild (1974), Mortensen (1985), Banks and Sundaram (1992)). Unlike the formulation above, the formulation of adaptive allocation problems in the economics literature involves a discount factor that relates future rewards to their present values. Moreover, an economic agent typically incorporates his prior beliefs about the unknown parameters into his choice of actions. Suppose an agent chooses actions sequentially from a finite set such that the reward of action has a probability distribution depending on an unknown parameter which has a prior distribution The agent’s objective is to maximize the total discounted reward
where is a discount factor and denotes the action chosen by the agent at time The optimal solution to this problem, commonly called the “discounted multi-armed bandit problem”, was shown by Gittins and Jones (1974) and Gittins (1979) to be the “index rule” that chooses at each stage the action with the largest “dynamic allocation index” (also called “Gittins index”), which is a complicated functional of the posterior distribution given the rewards of action up to stage where denotes the total number of times that action has been used up to stage Let have distribution function (depending on the unknown parameter ) so that are independent random variables with common
39
Sequential Optimization Under Uncertainty
distribution function Let index associated with
be a prior distribution on is defined as
The Gittins
where the supremum is over all stopping times defined on (cf. Gittins (1979)). As is well known, the conditional distribution of given can be described by that are independent having common distribution function and that
has distribution which is the posterior distribution of given Chapter 7 of Gittins (1989) describes computational methods to calculate Gittins indices for normal, Bernoulli and exponential with the prior distribution of belonging to a conjugate family. These methods involve approximating the infinite horizon in the optimal stopping problem (4) by a finite horizon N and using backward induction. When is near 1, a good approximation requires a very large N and becomes computationally prohibitive. Varaiya, Walrand and Buyukkoc (1985) have suggested a simple way to view the complicated index (4) and to see why the index rule is optimal. Suppose there are machines with deterministic rewards In analogy with (4), the index at time 1 of machine is defined as
Suppose that
and
is attained at
From the definition of
it follows
These inequalities can be used in conjunction with to prove the following: For any rule let T be the stage that operates machine 1 for the st time, so machine 2 is operated times by during the first T stages. Consider the rule that operates machine 1for the first stages, machine 2 for the next stages and such that is the same as after stage T. Then the total discounted reward of is larger than or equal to that of showing that it is better to use the index rule until stage The argument can then be repeated starting at stage proving the optimality of the index rule. See Section II.A of Varaiya, Walrand and Buyukkoc (1985) for the algebraic details.
40
MODELING UNCERTAINTY
Bounds and approximations to the Gittins index have been developed by Brezzi and Lai (2000a,b) and Chang and Lai (1987). In particular, Brezzi and Lai (2000a) showed that
where and denote the mean and standard deviation of respectively. Making use of these bounds, Brezzi and Lai (2000a) gave a simple proof of the incompleteness of learning from endogenous data by an optimizing economic agent. Specifically, they showed that with positive probability the index rule uses the optimal action only finitely often and that it can estimate consistently only one of the generalizing Rothschild’s (1974) “incomplete learning theorem” for Bernoulli two-armed bandits. Moreover, the Gittins index can be written as an upper confidence bound of the form where is a nonnegative function of and When the are normal with mean and variance 1 and the prior distribution is normal, Chang and Lai (1987) showed that can be expressed as where as
and There is also a similar asymptotic theory of the finite-horizon bandit problem in which the agent’s objective is to maximize the total reward
where is a prior distribution of the vector Even when the are independent under (so that is a product of marginal distributions as in (3)), the optimal rule that maximizes (6) does not reduce to an index rule. In principle, one can use dynamic programming to maximize (6). In the case of Bernoulli populations with independent Beta priors for their parameters, Fabius and van Zwet (1970) and Berry (1972) studied the dynamic programming equations analytically and obtained several qualitative results concerning the optimal rule. Lai (1987) showed that although index-type rules do not provide exact solutions to the optimization problem (6), they are asymptotically optimal as and have nearly optimal performance from both the Bayesian and frequentist viewpoints for moderate and small values of N. The starting point in Lai’s approximation to the optimal rule is to consider the normal case. Suppose that an experimenter can choose at each stage between sampling from two normal populations with known variance 1 such that one has unknown mean and the other has known mean 0. Assuming a normal prior distribution on the optimal rule that maximizes the expected
Sequential Optimization Under Uncertainty
41
sum of N observations samples from the first population (with unknown mean) until stage and then takes the remaining observations from the second population (with known mean 0), where is the posterior mean based on observations from the first population and are positive constants that can be determined by backward induction. Writing and treating as a continuous variable, Lai (1987) approximates by where is the posterior variance of and
The function is obtained by first evaluating numerically the boundary of the corresponding optimal stopping problem for the Brownian motion and then developing some simple closed-form approximation to the boundary. Although it differs from the function in (5) because of the difference between the finitehorizon criterion and the discounted criterion, note that as Brezzi and Lai (2000b) recently developed a similar closed-form approximation to by computing the optimal stopping boundary for Brownian motion in the discounted case. More generally, without assuming a prior distribution on the unknown parameters, suppose are independent random variables from a oneparameter exponential family with density function with respect to some dominating measure. Then is increasing in since and the Kullback-Leibler information number is Let be the maximum likelihood estimate of based on Lai (1987) considered an upper confidence bound for of the form where
Note that is the generalized likelihood ratio statistic for testing so the above upper confidence bound is tantamount to the usual construction of confidence limits by inverting an equivalent test. Lai (1987) showed that this upper confidence bound rule is uniformly good and attains the lower bound (2) not only at fixed as (so that the rule is asymptotically optimal from the frequentist viewpoint), but also uniformly over a wide range of parameter configurations, which can be integrated to show that the rule is asymptotically Bayes with respect to a large class of prior distributions for There is also an analogous asymptotic theory for the discounted multi-armed bandit problem as as shown by Chang and Lai (1987).
42
MODELING UNCERTAINTY
The construction of asymptotically efficient adaptive allocation rules that attain (2) at fixed in Lai and Robbins (1985) uses similar upper confidence bounds which, unlike (7), do not involve the horizon N and for which the asymptotics in (2) is not uniform over
2.2.
A HYPOTHESIS TESTING APPROACH AND BLOCK EXPERIMENTATION
When switching costs are present, even the discounted multi-armed bandit problem does not have an optimal solution in the form of an index-type rule, as shown by Banks and Sundaram (1994). At any stage one has a greater propensity to stick to the current arm instead of switching to the arm with the largest index and incurring a switching cost. Although approximating an index by an upper confidence bound that incorporates parameter uncertainty is no longer applicable, we can re-interpret confidence bounds as hypothesis tests (as explained in the sentence following (7)) and modify the preceding rules in the presence of switching costs by using hypothesis testing to decide which population to sample from. Brezzi and Lai (2000b) recently used this approach to construct a class of “block experimentation” rules in which active experimentation with an apparently inferior population is carried out in blocks. Specifically consider statistical populations such that has density function with respect to some common dominating measure for Let be a positive integer divisible by and partition time into frames such that the length of the frame is for and is for The frame is further subdivided into blocks of equal length so that refers to the block in frame Let be a random permutation of (i.e., all permutations are equally likely). The block in the first frame is devoted to sampling from For the frame denote the population with the largest sample mean among all populations not yet eliminated at the end of the st frame by Let denote the number of such populations and let Let denote the population with the largest sample mean among all populations not yet eliminated at the end of the block where the end of block means the end of frame Let denote successive observations from and be the sample mean based on For the block which will be denoted by (with ), we sample from until stage
where
is defined as the largest number in if the set in (8) is empty, and is the generalized likelihood ratio (GLR) statistic for testing
Sequential Optimization Under Uncertainty
43
based on and is given by (9) below. If the set in (8) is non-empty, eliminate (or ) from further sampling if and the remaining observations in the block are sampled from (or ) that is not eliminated. For the block is devoted to sampling from the population with the largest sample mean among all populations not yet eliminated at the end of block If for some integer J, the preceding definition of the “block experimentation” rule applies to all J frames. If we modify the definition of the Jth frame by proceeding as before until the Nth observation. The GLR statistic for testing based on is
with noting that the function is continuous and increasing and therefore has an inverse. By choosing to be of order with Brezzi and Lai (2000b) showed that such block experimentation rules attain the asymptotic lower bound (2) for the regret while the expected number of switches converges (as ) to at any fixed with a unique component that attains Similar hypothesis testing and block experimentation ideas were used by Lai and Yakowitz (1995) in their development of nonparametric bandit theory which was initiated by Yakowitz and Lowe (1991). In this theory the successive observations from different arms may be dependent and no parametric model for the underlying stochastic sequences generating these observations is prescribed. To begin with, suppose that there are stochastic sequences such that and exists and is finite, where for Let and define the optimal index set Assuming polynomial bounds of the order on and exponential bounds on the left tails of for Lai and Yakowitz (1995) showed how a sequential allocation rule can be constructed for choosing among so that the regret
is of the order O(log N) , where is the number of observations from that have been taken up to stage N. Their basic idea is to replace the parametric likelihood-based upper confidence bounds described in Section 2.1 by nonparametric sample-mean-based tests. They also extended the construction to the case where there are countably infinitely many stochastic sequences whose means have a finite maximum such that Given
44
MODELING UNCERTAINTY
any nondecreasing sequence of positive numbers such that and as they showed how can be incorporated into the allocation rule so that the regret satisfies Not only do these results improve and generalize earlier work on nonparametric bandits by Yakowitz and Lowe (1991), but they can also be extended to controlled Markov chains, as will be shown in Section 3.2.
2.3.
APPLICATIONS TO MACHINE LEARNING, CONTROL AND SCHEDULING OF QUEUES
Yakowitz (1989), Yakowitz and Lugosi (1990), Yakowitz and Kollier (1992), Yakowitz and Mai (1995) and Kaebling, Littman and Moore (1996) have given various applications of bandit theory to machine learning. These applications illustrate the usefulness of Yakowitz’s “black-box” (nonparametric) approach to stochastic optimization, with which one can deal with processes having unknown structure so that one still has long-run average cost optimality, with regret rates only slightly larger than the optimal rates for parametric problems. Although it is not applicable to queuing control for which optimal policies are not of index type, Gittins’ discounted bandit theory has been applied to determine optimal scheduling policies for queuing networks, which turn out to be of index type. A major step in this direction was undertaken by Whittle (1981) who introduced the notion of “open bandit processes”, in which new projects (arms) are continually appearing. Whittle used a dynamic programming equation to define the Gittins index of a project of type in state assuming Markovian dynamics for the evolution of a project. He showed that the optimal policy that maximizes the infinite-horizon discounted reward is to work at each time on an available project that has the largest of those Gittins indices which exceed the index of a project constantly left idle (at the “no-further-action” state ) and to remain idle when no such project is available. Lai and Ying (1988) showed that under certain stability assumptions the open bandit problem is asymptotically equivalent to a closed bandit problem in which there is no arrival of new projects, as the discount factor approaches 1. Using this result, they showed that Klimov’s (1974, 1978) priority indices for scheduling queuing networks are limits of Gittins indices for the associated closed bandit problem and extended Klimov’s priority indices to preemptive policies and to general queuing systems.
3.
ADAPTIVE CONTROL OF MARKOV CHAINS
To design control rules for a stochastic system whose dynamics depends on certain unknown parameters, the “certainty equivalence” approach first finds the optimal (or asymptotically optimal) control rule when the system parameters are known and then replaces the parameter values in this control rule by their
Sequential Optimization Under Uncertainty
45
sample estimates at every stage. It tries to mimic the optimal rule (assuming known system parameters) by updating the parameter estimates based on all the available data. It is particularly attractive when the optimal (or asymptotically optimal) control scheme assuming known system parameters has a simple recursive form that can be implemented in real time and when there are real-time recursive algorithms for updating the parameter estimates. Such is the case with stationary control of Markov chains to maximize the long-run average reward.
3.1.
PARAMETRIC ADAPTIVE CONTROL
Mandl (1974) studied such certainty equivalence control rules in finite-state Markov chains whose transition functions depend on the action chosen from a finite control set and an unknown parameter belonging to a compact metric space Let denote the state space. The objective is to choose a stationary control law that maximizes the long-run average reward
where represents the one-step reward at state when action is used and is the stationary distribution (which is assumed to exist) of Since and are finite, the set of stationary control laws is finite, which will be denoted by If were known, then one would use the stationary control law such that
In ignorance of
a certainty equivalence control rule uses the control law
at time where is an estimate of based on Mandl (1974) chose to be the minimum contrast estimate and showed that converges a.s. to under a restrictive identifiability condition and some other regularity conditions. Borkar and Varaiya (1979) removed this identifiability condition and showed that when is finite, the maximum likelihood estimate converges a.s. to a random variable such that for all They also gave an example for which with positive probability, showing that the certainty equivalence rule eventually uses with positive probability only the suboptimal stationary control law to the exclusion of other control laws because of premature convergence of the parameter estimates to a wrong parameter value. In view of this difficulty with the certainty equivalence rule, various modifications of the rule have been proposed, including (i) forced choice schemes that reserve some sparse set of times for experimentation with all stationary control laws in (ii) randomization schemes that select according to a probability distribution, depending on past data, that assigns positive probability to every
46
MODELING UNCERTAINTY
and (iii) using penalized (cost-biased) maximum likelihood estimators see Kumar’s (1985) survey of these methods. Motivated by the lower bound (2) for the multi-armed bandit problem, which is a special case of controlled i.i.d. processes, Agrawal, Teneketzis and Anantharam (1989a) developed a similar lower bound for a controlled independent sequence by making use of the finiteness of and introducing a finite set of “bad” parameter values associated with Whereas in the multi-armed bandit problem has component parameterizing an individual arm parameterizes all the arms in a controlled independent sequence so that rewriting (2) in terms of the “bad” set provides the main clue in extending (2). When the state space the control set and the parameter space are all finite, Agrawal, Teneketzis and Anantharam (1989b) developed a “translation scheme” which together with the construction of an “extended probability space” enabled them to extend the lower bound (2) further to controlled Markov chains by converting it to a form similar to that for controlled i.i.d. processes. Graves and Lai (1997) removed the finiteness assumptions on and and used another approach involving change of measures to extend the lower bound (2) to controlled Markov chains when the set of stationary control laws is finite. Define and by (11) and (12) and the regret of an adaptive control rule by
as in (10). Assume no switching cost for switching among the optimal stationary control laws that attain the maximum in (12) and a positive switching cost for each switch from one to another where and are not both optimal. Let be the cumulative switching cost of an adaptive control rule up to stage N. An adaptive control rule is said to be “uniformly good” if and for every Graves and Lai (1997) showed that for uniformly good rules,
where is defined below after a few other definitions. First the analogue of the Kullback-Leibler information number in (2) now takes the form
which will be assumed to be finite for all and which assumes the transition probabilities to be absolutely continuous (having density functions ) with respect to a measure on Next the finiteness of enables us to
47
Sequential Optimization Under Uncertainty
decompose
i.e.,
as the union of subsets:
is an optimal stationary control law if
where
For
let
Thus, is the set of all optimal stationary control laws when is the true parameter value, and consists of all “bad” parameter values which are statistically indistinguishable from if one only uses the optimal control laws because Define as
Using sequential likelihood ratio tests and block experimentation ideas (similar to those described in Section 2.2) to introduce “uncertainty adjustments” into the certainty equivalence rule, Graves and Lai (1997) constructed uniformly good adaptive control rules that attain the asymptotic lower bound (14).
3.2.
NONPARAMETRIC ADAPTIVE CONTROL
Without assuming a parametric family of transition densities Lai and Yakowitz (1995) consider a controlled Markov chain on state space with control set and transition probability function where is a of subsets of Let represent the one-step reward at time For a stationary control law we shall use to denote the transition probability function of the controlled Markov chain under the control law and to denote the conditional probability measure of this chain starting at state Let be a countable (possibly infinite) set of stationary control laws such that
exists for every and and such that there is a maximum value of over For a control rule that chooses adaptively some stationary control law in to use at every stage, its regret is defined by
48
MODELING UNCERTAINTY
where is the number of times that the control rule uses the stationary control law up to stage N and denotes expectation under the probability measure of the controlled Markov chain starting at and using control rule Since the state in a controlled Markov chain is governed by the preceding state irrespective of which control law is used at time it is important to adhere to the same control law (“arm”) over a block of times (“block experimentation”), instead of switching freely among different arms as in conventional multi-armed bandits. Let Take and let Partition into blocks of consecutive integers, each block having length except possibly the last one whose length may range from to Label these blocks as so that the block begins at stage The basic idea here is to try out the first stationary control laws for the stages from to Specifically, for if with use stationary control law for the entire block of stages if
and use for all the stages in the block otherwise. In (12), denotes the number of times is used up to stage
and
is so chosen that
Let be any nondecreasing sequence of positive numbers such that and as Lai and Yakowitz (1995) showed how and can be chosen in (16) and (17) so that the regret defined by (15) satisfies for every Yakowitz, Jayawardena and Li (1992) extended the nonparametric bandit theory of Yakowitz and Lowe (1991) to the case of a nondenumerable set of arms. Instead of the regret (10), they introduced the “learning loss”
under the assumption that is a metric space with the Borel For controlled Markov chains with a nondenumerable set of stationary control laws, Lai and Yakowitz (1995) showed how to construct an adaptive control rule with for any and such that
Sequential Optimization Under Uncertainty
49
by sampling control laws from for block experimentation. The basic underlying idea is to sample independently from according to some probability distribution such that for every open ball B. This yields a countable set of stationary control laws for which the same strategy as in the preceding paragraph can be applied. Lai and Yakowitz (1995) applied this control rule to the problem of adaptively adjusting a service rate parameter in an M/M/1 queue with finite buffer, on the basis of observed times spent in the queuing system and the service costs, to minimize the long-run average cost. Specifically, suppose the cost for the item serviced is
where is the time spent in the queuing system by the job, is the service time for that job, and is the service rate in effect during that service time, with A being a parameter of the problem. The decision maker need not know in advance the arrival distribution, or how costs depend on service time, or even how the service rate being adjusted is related to service time. One desires a strategy to minimize the average job cost. Making use of the control rule described earlier in this paragraph, Lai and Yakowitz (1995) showed how decision functions can be chosen adaptively from a space of functions mapping the number of jobs in the system into a prescribed interval of service rates, so that the average control costs converge, as to the optimal performance level the expectation being with respect to the invariant measure induced by the decision function
4.
STOCHASTIC APPROXIMATION Consider the regression model
where denotes the response at the design level M is an unknown regression function, and represents unobservable noise. In the deterministic case (where for all ), Newton’s method for finding the root of a smooth function M is a sequential scheme defined by the recursion
When the random disturbances entails that
are present, using Newton’s method (20)
Hence, if should converge to so that and (assuming M to be smooth and to have a unique root such that then (21) implies that which is not possible for many kinds of random
50
MODELING UNCERTAINTY
disturbances (e.g., when the are independent and identically distributed (i.i.d.) with mean 0 and variance To dampen the effect of the errors Robbins and Monro (1951) replaced in (20) by constants that converge to 0. Specifically, assuming that
the Robbins-Monro scheme is defined by the recursion
where
are positive constants such that
Noting that maximization of a smooth unimodel regression function is equivalent to solving the equation Kiefer and Wolfowitz (1952) proposed the following recursive maximization scheme
where at the and
stage observations and respectively,
are taken at the design levels and are positive constants
such that
and is an estimate of Beginning with the seminal papers of Robbins and Monro (RM) and Kiefer and Wolfowitz (KW), there is a vast literature on stochastic approximation schemes of the type (23) and (25). In particular, Blum (1954) proved almost sure (a.s.) convergence of the RM and KW schemes under certain conditions on M and For the case of i.i.d. with mean 0 and variance Sacks (1958) showed that an asymptotically optimal choice of in the RM scheme (23) is for which has a limiting normal distribution with mean 0 and variance assuming that This led Lai and Robbins (1979) to develop adaptive stochastic approximation schemes of the form in which is an estimate of based on the current and past observations. Noting that the inputs should be set at if it were known, Lai and Robbins (1979) defined the “regret” of an adaptive design to be They showed
Sequential Optimization Under Uncertainty
51
that it is possible to have both asymptotically minimal regret and efficient final estimate, i.e., a.s. and has a limiting normal distribution with mean 0 and variance as by using a modified least squares estimate in (27). Asymptotic normality of the KW scheme (25) has also been established by Sacks (1958). However, instead of the usual rate, one has the rate for the choices and assuming M to be three times continuously differentiable in some neighborhood of The reason for the slower rate is that the estimate of has a bias of the order when This slower rate is common to nonparametric regression and density estimation problems, where it is known that the rate of convergence can be improved by making use of additional smoothness of M. Fabian (1967, 1971)showed how to redefine in (25) when M is continuously differentiable in some neighborhood of for even integers so that letting has a limiting normal distribution. In control engineering, stochastic approximation (SA) procedures are usually applied to dynamical systems. Besides the dynamics in the SA recursion, the dynamics of the underlying stochastic system also plays a basic role in the convergence analysis. Ljung (1977) developed the so-called ODE method that has been widely used in such convergence analysis in the engineering literature; it studies the convergence of SA or other recursive algorithms in stochastic dynamical systems via the stability analysis of an associated ordinary differential equation (ODE) that defines the “asymptotic paths” of the recursive scheme; see Kushner and Clark (1978) and Benveniste, Metivier and Priouret (1987). Moreover, a wide variety of KW-type algorithms have been developed for constrained or unconstrained optimization of objective functions on-line in the presence of noise. For Spall (1992) introduced “simultaneous perturbation” SA schemes that take only 2 (instead of measurements to estimate a smoothed gradient approximation to at every stage; see also Spall and Cristion (1994). For other recent developments of SA, see Kushner and Yin (1997). The ODE approach to analyze SA procedures usually assumes that the associated ODE is initialized in the domain of attraction of an equilibrium point. Moreover, the theory on the convergence rate of under conditions such as unimodality and smoothness refers only to so large that lies in a sufficiently small neighborhood of the limit where the regression function can be approximated by a local polynomial. The need for good starting values to initialize SA procedures is also widely recognized in practice. By synthesizing constrained KW search with nonparametric bandit theory, Yakowitz (1993) developed the following multistart procedure that converges to the global maximum of where is a convex subset of even though M may have multiple local maxima and minima.
52
MODELING UNCERTAINTY
Let and let be a probability measure on that has a positive continuous density function with respect to Lebesgue measure. Let and The set represents times at which new starting values are generated. Specifically, for choose a new starting value at random according to the probability distribution Let be a hypercube centered at with sides of length and observe a response at design level which will be used to initialize a KW scheme constrained (by projection) inside For carry out a constrained KW step for each hypercube (“bandit arm”) whose sample mean based on all the observed inside the hypercube up to stage differs from the largest sample mean by no more than or whose sample size is smaller than The basic idea is to occasionally generate new regions for local exploration using a constrained KW procedure. Letting denote the KW design level at stage inside the hypercube with the largest sample mean among all hypercubes whose sample sizes exceed Yakowitz (1993) showed that has a limiting normal distribution under certain regularity conditions, where is the global maximum of M (which may have many local maxima) over
REFERENCES Agrawal, R., D. Teneketzis and V. Anantharam. (1989a). Asymptotically efficient adaptive allocation schemes for controlled I.I.D. processes: Finite parameter space. IEEE Trans. Automat. Contr. 34, 258-267. Agrawal, R., D. Teneketzis and V. Anantharam. (1989b). Asymptotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space. IEEE Trans. Automat. Contr. 34, 1249-1259. Anantharam, V., P. Varaiya and J. Walrand. (1987). Asymptotically efficient allocation rules for multiarmed bandit problems with multiple plays. Part II: Markovian rewards. IEEE Trans. Automat. Contr. 32, 975-982. Banks, J. S. and R.K. Sundaram. (1992). Denumerable-armed bandits. Econometrica 60, 1071-1096. Banks, J. S. and R.K. Sundaram. (1994). Switching costs and the Gittins index. Econometrica 62, 687-694. Benveniste, A., M. Metivier, and P. Priouret. (1987). Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, New York. Berry, D. A. (1972). A Bernoulli two-armed bandit. Ann. Math. Statist. 43, 871-897. Blum, J. (1954). Approximation methods which converge with probability one. Ann. Math. Statist. 25, 382-386. Borkar, V. and P. Varaiya. (1979). Adaptive control of Markov chains. I: Finite parameter set. IEEE Trans. Automat. Contr. 24, 953-958.
REFERENCES
53
Brezzi, M. and T.L. Lai. (2000a). Incomplete learning from endogenous data in dynamic allocation. Econometrica 68, 1511-1516. Brezzi, M. and T.L. Lai. (2000b). Optimal learning and experimentation in bandit problems. To appear in J. Economic Dynamics & Control. Chang, F. and T.L. Lai. (1987). Optimal stopping and dynamic allocation. Adv. Appl. Probab. 19, 829-853. Chernoff, H. (1967). Sequential models for clinical trials. Proc. Fifth Berkeley Symp. Math. Statist. & Probab. 4, 805-812. Univ. California Press. Fabian, V. (1967). Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Statist. 38, 191-200. Fabian, V. (1971). Stochastic approximation. In Optimizing Methods in Statistics (J. Rustagi, ed.), 439-470. Academic Press, New York. Fabius, J. and W.R. van Zwet. (1970). Some remarks on the two-armed bandit. Ann. Math. Statist. 41, 1906-1916. Gittins, J. C. (1979). Bandit processes and dynamic allocation indices (with discussion). J. Roy. Statist. Soc. Ser. B 41, 148-177. Gittins, J.C. (1989). Multi-Armed Bandit Allocation Indices. Wiley, New York. Gittins, J.C. and D.M. Jones. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (J. Gani et al., ed.), 241-266. North Holland, Amsterdam. Graves, T. L. and T.L. Lai. (1997). Asymptotically efficient adaptive choice of control laws in controlled Markov chains. SI AM J. Contr. Optimiz. 35, 715-743. Kaebling, L.P., M.C. Littman and A.W. Moore. (1996). Reinformement learning: A survey. J. Artificial Intelligence Res. 4, 237-285. Kiefer, J. and J. Wolfowitz. (1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Statist. 23, 462-466. Kirkpatrick, S., C.D. Gelatt and M.P. Vecchi. (1983). Optimization by simulated annealing. Science 220, 671-680. Klimov, G.P. (1974/1978). Time-sharing service systems I, II. Theory Probab. Appl. 19/23, 532-551/314-321. Kumar, P.R. (1985). A survey of some results in stochastic adaptive control. SIAM J. Contr. Optimiz. 23, 329-380. Kushner, H.J. and D.S. Clark. (1978). Stochastic Approximation for Constrained and Unconstrained Systems. Springer-Verlag, New York. Kushner, H.J. and G. Yin. (1997). Stochastic Approximation Algorithms and Applications. Springer-Verlag, New York. Lai, T.L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15, 1091-1114. Lai, T.L. and H. Robbins. (1979). Adaptive design and stochastic approximation. Ann . Statist. 7, 1196-1221.
54
MODELING UNCERTAINTY
Lai, T.L. and H. Robbins. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4-22. Lai, T.L. and S. Yakowitz. (1995). Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Contr. 40, 1199-1209. Lai, T.L. and Z. Ying. (1988). Open bandit processes and optimal scheduling of queuing networks. Adv. Appl. Probab. 20, 447-472. Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Contr. 22, 551-575. Mandl, P. (1974). Estimation and control of Markov chains. Adv. Appl. Probab. 6, 40-60. Mortensen, D. (1985). Job search and labor market analysis. Handbook of Labor Economics 2, 849-919. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58, 527-535. Robbins, H. and S. Monro. (1951). A stochastic approximation method. Ann. Math. Statist. 22, 400-407. Rothschild, M. (1974). A two-armed bandit theory of market pricing. J. Economic Theory 9, 185-202. Sacks, J. (1958). Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist. 29, 375-405. Spall, J.C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Automat. Contr. 37, 332341. Spall, J.C. and J.A. Cristion. (1994). Nonlinear adaptive control using neural networks: Estimation with a smoothed form of simultaneous perturbation gradient approximation. Statistica Sinica 4, 1 -27. Varaiya, P.P., J.C. Walrand and C. Buyukkoc. (1985). Extensions of the multiarmed bandit problem: the discounted case. IEEE Trans. Automat. Contr. 30, 426-439. Whittle, P. (1981). Arm-acquiring bandits. Ann. Probab. 9, 284-292. Yakowitz, S. (1989). A statistical foundation for machine learning, with application to Go-moku. Computers & Math. 17, 1085-1102. Yakowitz, S. (1993). A global stochastic approximation. SIAM J. Contr. Optimiz. 31, 30-40. Yakowitz, S., J. Jayawardena and S. Li. (1992). Theory for automatic learning under partially observed Markov-dependent noise. IEEE Trans. Automat. Contr. 37, 1316-1324. Yakowitz, S. and M. Kollier. (1992). Machine learning for blackjack counting strategies. J. Statist. Planning & Inference 13, 295-309. Yakowitz, S. and W. Lowe. (1991). Nonparametric bandit methods. Ann. Operat. Res. 28, 297-312.
REFERENCES
55
Yakowitz, S. and E. Lugosi. (1990). Random search in the presence of noise, with application to machine learning. SIAM J. Scient. & Statist. Comput. 11, 702-712. Yakowitz, S. and J. Mai. (1995). Methods and theory for off-line machine learning. IEEE Trans. Automat. Contr. 40, 161-165.
Chapter 4 EXACT ASYMPTOTICS FOR LARGE DEVIATION PROBABILITIES, WITH APPLICATIONS Iosif Pinelis Department of Mathematical Sciences Michigan Technological University Houghton, Michigan 49931
[email protected]
Keywords:
large deviation probabilities, final zero crossing, last negative sum, last positive sum, nonparametric bandit theory, subexponential distributions, superexponential distributions, exponential probabilistic inequalities
Abstract
Three related groups of problems are surveyed, all of which concern asymptotics of large deviation probabilities themselves – rather than the much more commonly considered asymptotics of the logarithm of such probabilities. The former kind of asymptotics is sometimes referred to as “exact”, while the latter as “rough”. Obviously, “exact” asymptotics statements provide more information; the tradeoff is that additional restrictions on regularity of underlying probability distributions and/or on the corresponding zone of deviations are then required.
Most broadly, large deviation probabilities may be defined as any small probabilities. Thus, large deviation probabilities will always arise in hypothesis testing (say) when small error probabilities are desired. For example, consider the “small deviation” probability where X is a continuous random variable (r.v.) and is a small positive number; obviously, this probability is small and can be rewritten as the large deviation probability where and Of course, when a large deviation probability, say of the form is considered, usually the r.v. Y is assumed to represent a certain structure of interest; e.g., Y may be a linear or quadratic form in independent r.v.’s.
58
MODELING UNCERTAINTY
Literature on large deviations is vast. Most of the results concern the so-called large deviation principle (LDP), which deals with asymptotics of the logarithm of the large deviation probalility, ln (which is most often Gaussianlike, rather than with asymptotics of the probability itself. We write if and if Some authors refer to asymptotics of the logarithm of the large deviation probalility as “rough" asymptotics, in contrast with “exact" asymptotics of the probability itself. To appreciate the difference, note that such two different cases of “exact" asymptotics as, e.g.,
for any
correspond to the same “rough" asymptotics
A great deal of literature on “rough" asymptotics is reviewed and referenced in the monograph by Dembo and Zeitouni (1998). In this paper, we shall survey a few results concerning “exact" asymptotics. Naturally, the volume of literature in this direction is much smaller. Yet, it still exceeds by far our capabilities to do any kind of justice to them in this paper. See another recent monograph, Vinogradov (1994). A typical result in this direction is the following beautiful theorem due to A. V. Nagaev (1971). Let be the partial sum of independent identically distributed (i.i.d.) r.v.’s with a common cumulative distribution function (c.d.f.) and where and is slowly varying at Then
whenever here,
and
i.e., whenever
and
is the c.d.f. of the standard normal law. The first summand, is roughly the probability of the set of the trajectories
such that and all the jumps are relatively small. The second summand, is roughly the probability of the set of the trajectories such that and exactly one of the jumps, say is close to while the input of all the rest, with is negligible. The first summand on the right-hand side of (1) dominates the second one in the zone of “moderately" large deviations
Exact asymptotics for targe deviation probabilities, with applications
59
while the second summand dominates the first one in the zone of “very" large deviations for every If the exponent then, assuming for simplicity the symmetry of the distribution of the one has
whenever see Heyde (1968). In the borderline case the behavior of the large deviation probabilities depends on that of the slowly varying function see Tkachuk (1974; 1975). Finally, for subexponential tails lighter than powers, the nature of large deviations is more complicated; see again Nagaev (1971). Just as for (1), certain conditions of regularity of tails are required for “exact" asymptotics in the entire zone of the large deviation probabilities in which relation (2) obtains. However, all standard subexponential statistical distributions possess much more than the required regularity. Let us now briefly describe the contents of this paper. In Section 1, we present our joint results with Sid Yakowitz. There, “exact" asymptotics of large deviation probabilities concerning the last negative sum of i.i.d. r.v.’s with a positive mean are given, as well as related invariance principles for conditional distributions in the space of trajectories of the cumulative sums. Application to nonparametric bandit theory are given here as well. In Section 2, “exact" asymptotics of large deviation probabilities in the space of trajectories of the cumulative sums are described, which generalize (1), In Section 3, “exact" asymptotics of “very" large deviation probabilities in general Banach spaces are described, which generalize (2).
1.
LIMIT THEOREMS ON THE LAST NEGATIVE SUM AND APPLICATIONS TO NONPARAMETRIC BANDIT THEORY
The results of this section stem from the line of inquiry begun by Robbins (1952) and continued by Lai and Robbins (1985) and Yakowitz and Lowe (1991). They are taken from Pinelis and Yakowitz (1994), unless specified otherwise. Let be a sequence of i.i.d. r.v.’s with and let denote the c.d.f. of X. Let for and let By the strong law of large numbers, T is a proper random variable, as well as
60
MODELING UNCERTAINTY
Put
We assume that the distribution of X is either absolutely continuous or lattice. The latter means that for some and the only values assumed by X are the ones of the form without loss of generality, we may and do assume that the maximum span of the lattice distribution is 1; that is, the greatest such that for some is 1; in addition, it is assumed that Then we can say that the distribution of X , in both cases, is absolutely continuous with respect to the measure that is the Lebesgue one in the “absolutely continuous" case and the counting measure in the maximum span 1 lattice case. In both cases,
where is the density function; if is the counting measure, then for every integer and (3) means in this case that
The notation as in (3) is convenient since it unifies both cases. Note that analogues of the subsequent results hold when either the underlying distribution is any lattice one or has a non-zero absolutely continuous component. Now let stand for the density of which is defined, in both cases, analogously to (3). Let The following assumption is crucial: there exist a constant positive function such that
Theorem 1 If (4) holds, then
where
Moreover,
and a
Exact asymptotics for large deviation probabilities‚ with applications
61
More explicit expressions for can be given for the case of exponentially or superexponentially decreasing left tails; the expressions are especially simple when X has a maximum span 1 lattice lower-semicontinuous distribution‚ i.e.‚ when note that this case corresponds‚ in the application to the learning problems described in the next section‚ to the situation when‚ say‚ and In particular‚ yet simpler formula‚
takes place when X can assume only the three values (–1)‚ 0‚1 with probabilities respectively‚ so that this will correspond to the situation when and can assume values 0 or 1 only. As to the case when X has a maximum span 1 lattice upper-semicontinuous distribution‚ i.e.‚ one can obtain the following improvement of Theorem 1 itself:
where
The asymptotic result (5)‚ provided by Theorem 1‚ depends only on condition (4)‚ which is related to the local limit theorem for large deviations of sums of i.i.d. r.v.’s. Naturally‚ the question remains‚ “When is the restriction (4) satisfied?" Our conjecture is that (4) is always valid provided only that (i) EX > 0‚ (ii) P(X < 0) > 0‚ and (iii) the density is surely bounded. Moreover‚ we conjecture that for all
where
62
MODELING UNCERTAINTY
if these integrals both exist, otherwise. Note that and always, so that is a well-defined non-positive real number. Consideration below shows that our argument is applicable to all superexponentially decreasing and also to some complete enough spectra of exponentially and subexponentially decreasing left distribution-tails of X. In particular, conjecture (4)&(8) is satisfied for all usual statistical distributions, e.g., if and the distribution of Y belongs to any one of the following classes: all bounded distributions, Poisson, normal, F, Pareto, Gamma (including the Erlang, and geometric ones), the extreme value distributions (both kinds of tails), lognormal and Weibull distributions, Pascal and negative binomial distributions (including discrete geometrical one), etc. Let us now discuss the applicability of conjecture (4)&(8) in more detail. First, consider
1.1.
CONDITION (4)&(8): EXPONENTIAL AND SUPEREXPONENTIAL CASES
Suppose that the following Cramér type condition (cf.‚ e.g.‚ Hu (1991)) takes place: for some and for each compact
In particular‚ all bounded distributions‚ Poisson‚ normal‚ Gamma (including Erlang‚ and geometric ones)‚ Pascal and negative binomial distributions (including discrete geometrical one) and their Cramér transforms‚ etc.‚ are taken care of via this case. Take‚ for a moment‚ the distribution of X to be absolutely continuous. We of course assume throughout that EX > 0 and P(X < 0) > 0. Then (see‚ e.g.‚ Borovkov and Rogozin (1965) or Hu (1991)‚ formula (A.1)‚ page 415)‚
as
uniformly in
belonging to any compact subset of the set
where
indicates the derivative and is the unique (because tion to the equation Note that this definition of
soluis consistent
Exact asymptotics for large deviation probabilities‚ with applications
63
with (10). Suppose that
(Recall that and so, for (14) to be true, it is sufficient that the left tail be superexponential, i.e., or, more generally, that Using (12), one can see that (4) is satisfied in this case with moreover, Note that in the Cramér case, under the additional condition that there exists a number such that
Siegmund (1975) gave integral analogues of (4). Condition (15)‚ under which the treatment does get simplified‚ was also used in variuos large deviation problems; e.g‚ see Bahadur and Rao (1960)‚ Feller (1971)‚ Borovkov (1976)‚ and Fill (1983). The lattice distribution case can be treated in essentially the same way‚ using‚ e.g‚ the result of Petrov (1965); also see Petrov (1975). In the literature‚ it has become a kind of natural modus vivendi to confine analysis of large deviations of order in boundary problems for random walks only to the case when the Cramér condition (11) holds; see‚ e.g.‚ Hu (1991) and the references therein. We have seen that in such a case‚ our approach — via condition (4)&(8) — is valid as well as the habitually used Cramér transform methods. What is more‚ condition (4)&(8) (if it is indeed a condition in the sense that it is not always satisfied!) leads to essentially new horizons. For example‚ consider here the following vast class of distributions for which the Cramér condition (11) and/or (14) may fail while (4) is true.
1.2.
CONDITION (4)&(8): EXPONENTIAL (BEYOND (14)) AND SUBEXPONENTIAL CASES
Let X be either absolutely continuous or maximum span 1 lattice‚ with the density of the form either
or
where
and as
are functions slowly varying at (i.e.‚ and vanishing sufficiently rapidly at
64
MODELING UNCERTAINTY
say‚ where We emphasize that the “subexponential” case is included here as well. The classes (16) and (17) contain such distributions as F‚ Pareto‚ Weibull‚ stable distributions‚ and their Cramér transforms. If (recall (10)) < 0‚ then (14) is satisfied‚ and so‚ condition (4)&(8) is true‚ in view of the above discussion. Suppose now that If (16) holds‚ then (cf. Nagaev (1971))
uniformly in this implies
for any
here‚
uniformly in for any An analogous statement holds for the case (17); cf. Nagaev (1969). Note‚ finally‚ that here as defined by (9)‚ so that (4)&(8) hold again‚ while (14) fails. Theorem 1 and (7) show that the adequate term for description of the asymptotics of is Moreover‚ Theorem 1 implies
Usually‚ the sum in (19) can be easily treated. Suppose‚ e.g.‚ that conditions (11) and (14) are valid. Then (12) gives
where From (19) and (20) it easily follows that
But it is known (or may be obtained using (12)) that under the conditions (11) and (14)‚ the main term asymptotics of differs from that of given by (12)‚ only by a constant factor. Our conclusion therefore is that under the conditions (11) and (14)‚
where is a positive constant depending only on F. The last assertion‚ also under a Cramér type condition similar to (11)&(14)‚ but with the additional condition (15)‚ was found by Siegmund (1975).
Exact asymptotics for large deviation probabilities‚ with applications
65
In case (15) does take place‚ a comparison of (21) and the result of Siegmund (1975) shows that
where One can see that (23) does not contain defined by (15); we therefore conjecture that (23) holds without (15)‚ just whenever the Cramér condition (11)&(14) is satisfied. In order to compute using (23)‚ one needs to deal with the distributions of all the But as it was said above‚ the situation is much simpler in the case when X has a maximum span 1 lower-semicontinuous distribution. In this case‚ it is easy to obtain from a “renewal” integral equation similar to‚ say‚ (3.16) in Feller (1971) that for integral hence‚ (6) and (8) yield
where
is given by (15)‚
As we stated above‚ the expression for becomes especially simple when X has a maximum span 1 distribution which is both upper- and lower-semicontinuous. i.e.‚ when X can assume only the three values (–1)‚ 0‚1 with probabilities‚ say‚ respectively. Here‚ (7) takes place with
Another case‚ in which a simpler formula for may be found is when enjoys a more explicit expression — see‚ e.g. Borovkov (1976)‚ Chap. 4‚ Section 19. In particular‚ this is true in the case of the Erlang distribution. If‚ e.g.‚ is “purely continuous exponential” on that is‚
where — cf. Borovkov (1976), Chap. 4, Section 19, (14) and Feller (1971), Chap. XII, (5.9), then (6) and (8) yield
66
MODELING UNCERTAINTY
Let us now see what happens when the Cramér condition fails at all. Suppose‚ e.g.‚ that (16) takes place with Then (18) and (19) yield
where is a positive constant. But it is known — cf. Nagaev (1971) — that in this case‚ Hence‚
so that the asymptotics is similar to that of (22).
1.3.
THE CONDITIONAL DISTRIBUTION OF THE INITIAL SEGMENT OF THE SEQUENCE OF THE PARTIAL SUMS GIVEN T = N
Here, we shall present, in particular, limit theorems for the conditional distribution of the number of negative values among given First, we present the limit theorem for the conditional distribution of given denote by the density of this distribution (in the lattice case, this notation may be regarded literally as well). Theorem 2 If condition (4) holds‚ then for
all
where is defined by (6); by the Scheffé theorem (see, e.g., Billingsley (1968)), this convergence then takes place also in variation, and hence in distribution; moreover, for large enough
where C is the same as in (4). Consider now the conditional distribution of given where Observe that this distribution remains the same if the condition is replaced by — recall the independence
Exact asymptotics for large deviation probabilities, with applications
67
of the Unlike the universal character of the limiting behavior of the distributions described by Theorems 1 and 2, that of essentially depends on whether the Cramér condition (11)&(14) is satisfied or not. Consider the random process on [0,1] defined by the formula
where is the integral part of The trajectories of live in the so denoted space D — see, e.g., Billingsley (1968) — of the functions without discontinuities of the second kind. As usual, D is assumed to be endowed with the Chebyshev norm let be the corresponding Borel in D. Theorem 3 If the Cramér condition (11)&(14) is satisfied and then for all and the conditional distribution in D of the process
given (or, equivalently, given weakly converges to that of the standard Brownian bridge; for the definition of the latter, see, e.g., Billingsley (1968); is defined in (13). The same is true for the conditional distribution of given As mentioned earlier, if the Cramér condition fails, the limiting behavior of the process defined by (25), is completely different; instead of the Brownian bridge as the limiting process one has to deal with the following Z-shaped trajectories:
If U is a random variable uniformly distributed on [0, 1] and, as before, let us call the random function the Z process. Theorem 4 If the subexponentiality condition (16) with is satisfied, then for all and the conditional distribution in D of the process
given (or, equivalently, given weakly converges to that of the Z process. The same is true for the conditional distribution of given
68
MODELING UNCERTAINTY
Remark. and
Theorem 4 holds if instead of (16), condition (17) with is assumed; if however (17) takes place with but the limiting behavior of the process given (or becomes somewhat whimsical, taking on shapes intermediate between those of the Brownian bridge and the Z process; here, we do not give further details on this case. Nonetheless, while in the two cases — (11)&(14) on the one hand and (16) on the other — we have so different patterns as the Brownian bridge and the Z process, the asymptotics of the conditional distribution of the number of, say, negative sums are exactly the same in both cases — which is no wonder: cf., e.g., Feller (1971), Theorem 3, Section 9, Chapter XII. More rigorously, it is easy to extract from Theorems 3 and 4 the following. Corollary 1 Let
denote any of the following random variables:
If and either the Cramér condition (11) & (14) or the subexpopentiality condition (16) with (or (17) with and (0,1/2)) is satisfied, then for all and the conditional distribution of given (or, equivalently, given weakly converges to the uniform distribution on [0,1]. The same is true for the conditional distribution of given An obvious observation based on this corollary is that the “striking distributional similarity" between the asymptotics of (in our terms) and which is the result of Kao (1978), is actually by no means surprising.
1.4.
APPLICATION TO BANDIT ALLOCATION ANALYSIS
First, we consider a simplified strategy, which nevertheless turns out to be in a certain sense optimal, at least in the exponential and superexponential cases, as shown in Yakowitz and Lowe (1991). 1.4.1 Test-times-only based strategy. Suppose that we have two bandits and whose behaviour is described by two independent sequences of i.i.d. r.v.’s and respectively; and are regarded as the results of potential observations made on and at the time moment as if the observations were made indeed,
Exact asymptotics for large deviation probabilities‚ with applications
69
We assume that the better bandit is that is‚ EY <EZ. (This is of course unknown to the observer; we suppose that in general no information about the distributions of Y and Z is available except that the means EY and EZ are finite.) Let N(1) = 1 < N(2) < N(3) < . . . be a sequence of nonrandom integers‚ called here test times; the observations are supposed to be made only at these time moments. Consider the number of the test times up through the time moment
and
the cumulative sums of the observations made on and up to the time moment Let us say that bandit is apparently better at the time moment (which leads to the erroneous decision that is better indeed) if Denote by the last time moment at which is apparently better:
Since straightforward from (19). Proposition 1 If the distribution of constant given in (6)‚
where
the following statement is satisfies (4)‚ then for the
is the density of
Remark. Recall that condition (4) holds under very general circumstances. Equations (20)‚ (23)‚ and subsequent to them give more explicit forms for (26) for different tail decrease rates. Let M be the total number of the test times at which the wrong bandit was selected.
70
MODELING UNCERTAINTY
Proposition 2 Put with
Suppose that
If (16)
then
where the constant c is given in (6) (0,1/2)) are satisfied then
1.4.2 Multiple bandits and all-decision-times based strategy. Here we consider a more common strategy: Continue to observe the apparently worse bandit at the test times and use every observation (including all those made between test times) in calculating the sample mean. At test times, switch the observation allocation whenever the order relation of the bandit sample meant; changes. Let us now give a rigorous description of the model. Let stand for the bandits under consideration. For each let be a sequence of i.i.d. r.v.’s with the common mean is regarded as the result of a potential observation made on at the time moment as if the observation were made indeed, We assume that the best bandit is that is, (this fact is of course assumed to be unknown to the observer). Let N(1) = 1 < N(2) < N(3) < · · · be a sequence on nonrandom integers, called here test times. We define recursively the sequences and where The first of them is the sequence of the decision indicators:
is the number of times has been observed up to the test time is the cumulative sum of the corresponding results of observations; and is the corresponding sample mean. More exactly, we put
(at the test times all the bandits are tested), and if then if
and and
Exact asymptotics for large deviation probabilities‚ with applications
71
otherwise‚ where
Thus the control-decision process is completely defined. Let us now introduce
the last time moment above construction‚
Fix any real number
at which the best bandit
such that
is not observed. By the
and put
Consider
and the number of the test times up to the time moment Theorem 5 For all
Note that for all and Thus‚ one can combine Theorem 5 and previous results in order to obtain estimates for the distribution tail of Suppose that‚ for each is the p.d.f. of (understood as in (3))‚ is the c.d.f. of and (cf. (4))
some constant
and function
Then‚ by (19)‚
72
MODELING UNCERTAINTY
for some constants The last expression can be further treated (cf. (22) and (24)). The estimate given by Theorem 5 might seem too rough, but actually it is not, because, by the strong law of large numbers, the quality of the test at the time moment is basically determined by the number of observations made up to the time moment on the apparently worst bandit, which is close to if is large.
2.
LARGE DEVIATIONS IN A SPACE OF TRAJECTORIES
The results of this section are taken from Pinelis (1981), unless specified otherwise. As in Section 1, let be i.i.d. r.v.’s with a common distribution function F and let stand for the cumulative sums of the Let the random process be defined by the same formula (25). However, in this section we shall assume that EX = 0, rather than EX > 0. Also, in this section we shall impose more restrictions on F. First, we shall assume that Var X is finite; actually, without loss of generality, we shall assume that Var X = 1. Further restrictions on F will be imposed in the statements of results below. In this section, we shall describe the asymptotic behavior of large deviation probabilities of the form where A is separated from subject only to the conditions
and and
and vary arbitrarily At that, we assume
that the tails of the distribution of X decrease in a certain sense as as where A similar problem corresponding to was considered by Godovan’chuk (1978). The asymptotic behavior of probabilities of large deviations in the space of trajectories when Cramér’s condition holds was studied by Borovkov (1964). Let us write if
as
where A function H is said to be regularly varying with exponent
for all
if
73
Exact asymptotics for large deviation probabilities‚ with applications
Let us say that H is completely regularly varying with exponent
if
for some function which is regularly varying with exponent It is not difficult to see that (i) every function which is completely regularly varying with exponent is regularly varying with exponent and (ii) every function H which is regularly varying with exponent is asymptotically equivalent to a function which is completely regularly varying with exponent that is‚ as Let us say that a set is separated from 0 if
For any
as usual‚ let
the distance from
the
For
For
For any
denote the boundary of A‚
to A‚ and
of A. and
and
define the step function
consider the “cross section"
consider the measures F and
for any Borel set Let us refer to a set sense) if the function [0‚1]. For an Riemann integral as
by
defined by
separated from 0‚ as (in the Riemann is continuous almost everywhere (a.e.) on set
consider the measure
defined by the
74
MODELING UNCERTAINTY
Let us refer to an
set
separated from 0, as
if
Note that for the of a set separated from 0, it suffices that at least one of the following two conditions hold: (i) the set is for all small enough (so that the Lebesgue integral exists) and (ii) uniformly in Let be a standard (continuous) Wiener process. Then the probability is defined for any Let us refer to a set as if, for any
as
where It is easy to see that this (27) takes place if, for some
condition is very mild. Indeed, e.g.,
and for such a choice of A one may even take Remark 2 below. Let us refer to a set as if
See also
as
According to the great many results on the large deviation principle (LDP), the class of sets is very large. Finally, for consider the two sets, and defined by
Now we are able to state the main results of this section. Theorem 6 Let a set be and and such that the sets and are Let the tail functions and be completely regularly varying with exponent Then
Exact asymptotics for large deviation probabilities‚ with applications
whenever
75
and
Theorem 7 Assume that a function and the set
is Assume also that regularly varying with exponent
is a.e. continuous‚
and the tail function
is
Let
Then
whenever Remark.
and By the Donsker-Prohorov invariance principle‚ if is equivalent to
then
Remark 1. There arises the question: For what functions with will the set be —Let us call a contact point for if
Let have only a finite number of contact points and let be infinitely differentiable in a neighborhood of each contact point; moreover, for each
assume that not all the derivatives
are
equal to 0; if we mean here a left neighborhood of and the left derivatives at Then, using Theorem 13 (Borovkov, 1964), it is not hard to see that is An analogous assertion can be given also for the case when the set of contact points is of nonzero Lebesgue measure. Thus, we see that the condition of the of is quite mild. Let, as usual, stand for the standard normal c.d.f.:
76
MODELING UNCERTAINTY
Corollary 2 Assume that varying with exponent
whenever
and the tail function Then
is regularly
and
The following assertion, given above as (1), is due to A.V. Nagaev (1969); (1971). Corollary 3 Assume that varying with exponent
whenever
and the tail function Then
is regularly
and
Remark. The conditions of Theorem 6 contain the requirement of complete regular variation of the tails of the distribution of X; that is‚ the requirement of regular variation of the density of the distribution of X. This is necessitated by the breadth of the class of sets for which Theorem 6 holds; such sets A in Theorem 6 do not have to be “integral"; thus‚ Theorem 6 has a “local" aspect to it. On the other hand‚ the condition of the completeness of the regular variation is not necessary in the “integral" Theorem 7 and Corollaries 2 and 3‚ and indeed is not imposed there. A few words about the method of the proof. Let us write
where
is a bounded sequence such that as for any is a positive constant; R is the “remainder". Then and behave mainly as and respectively‚ while the remainder R is negligible. To prove this‚ one uses so-called “methods of one and the same probability space” to show that‚ except for a negligible probability‚ there are only two kinds of trajectories providing for large deviations: (i) the trajectories corresponding
Exact asymptotics for large deviation probabilities‚ with applications
77
to in which all the jumps are small enough, smaller than and which then behave as the trajectories of the Gaussian process and (ii) the trajectories corresponding to in which there is a large jump larger than and which then behave as the random jump function Note that provided that and vary so that one of the sides of this relations tends to 0. Similarly, provided that and vary so that one of the sides of this relations tends to 0. It is easy to see that the “jump" term is negligible in comparison to the “Gaussian" probability in the zone with and a sufficiently small provided that A is Thus, in this case
On the other hand, the “Gaussian" probability in Theorem 6 is negligible in comparison to the “jump" term in the zone with and a sufficiently large provided that Thus, in this case
As we shall see in the next section, the latter kind of behavior is exhibited in very general settings, when the summands may take on values in a Banach space and do not have to be identically distributed.
ASYMPTOTIC EQUIVALENCE OF THE TAIL OF THE SUM OF INDEPENDENT RANDOM VECTORS AND THE TAIL OF THEIR MAXIMUM
3.
3.1.
INTRODUCTION
Here, we follow Pinelis (1985). For the beginning, let us consider, as in the previous sections, i.i.d. r.v.’s with a common c.d.f. Set Consider a “typical" trajectory of the sums which significantly deviates from 0, say in the sense that there occurs an event of the form where is such that is small. How does such a typical trajectory look like? If the Cramér condition
78
MODELING UNCERTAINTY
is satisfied for some then a “typical large deviation" trajectory of the sums "straightens out", and the contrinutions of the summands are overall comparable with one another. If the Cramér condition does not take place, then the shape of a “typical large deviation" trajectory changes significantly. If the tail is not too irregular and (28) does not hold, then in a zone of the form one has the asymptotic relation
which is equivalent to
This means that the large deviations of the sum occur mainly due to just one large summand. Note that in (29) or (30) it is not at all necessary that in fact‚ the number of summands may be as well fixed. It is not difficult to understand the underlying causes of the two different large deviations mechanisms. For instance‚ let there exist a bounded density and let Take a large enough positive number We are interested in the of numbers that maximize the joint density or‚ equivalently‚ minimize under the condition Proposition 3 Let
Let there take place one of the following two cases: (i) function is convex on (ii) function is concave on for some real Then in the first case second case
and
for large enough for a fixed
as while in the and
Thus‚ one can see that if is convex‚ then (the Cramér condition is satisfied and) the maximum of the joint density is attained when the summands are equal: On the other hand‚ if the alternative takes place‚ then (the Cramér condition fails and) the maximum of the joint density is attained when only one of the summands is large (equal to and the rest of them are small (equal to
Exact asymptotics for large deviation probabilities‚ with applications
79
In other words, the critical influence of the Cramér condition as to the type of the large deviation mechanism is related to the facts that (i) “regular" densities are log-concave if the Cramér condition holds and log-convex (in a neighborhood of otherwise, and (ii) the direction of convexity of the logarithm of the density determines the kind of large deviation mechanism.1 Of course, the density does not have to be either log-concave or log-convex for one of the two large deviation mechanisms to work. The role of log-convexity (say) may be assumed, e.g., by a condition that the tail of the distribution is significantly heavier, in a certain sense, than exponential ones. Thus, in Theorems 14 and 16 below, where the tails decrease no faster than with there are no convexity conditions. On the other hand, such a condition is part of Theorem 15 below, where the tails can be aribitrarily close to exponential ones; that a convexity condition is essential is also shown by Example 1 below. For a fixed relation (30) was studied by Chistyakov (1964), with an application to a renewal equation. It was shown in Chistyakov (1964) that for a.s. positive X, relation (30) holds for any fixed iff it is true for i.e., when
C.d.f.’s F satifying (31) were later on called subexponential because of their property which follows from the relation
for any any fixed These and other properties of the class of all subexponential c.d.f.’s were established in Chistyakov (1964). Class attracted attention a number of authors; see e.g. Embrechts, Goldie, and Veraverbeke (1979), Goldie and (1978), Vinogradov (1994), and bibliography therein. If (30) holds for every fixed then it remains true for uniformly in some zone of the form where In general nothing can be said on the rate of growth of There have been many publications concerning (30), including estimates of see, e.g., Andersen (1970); Anorina and Nagaev (1971); Godovan’chuk (1978); Heyde (1968); Linnik (1961); Nagaev (1969; 1971; 1964; 1973; 1979); Pinelis (1981); Rohatgi (1973) (1970); (1971); (1978); (1968); (1961); (1969); (1971); (1964); (1973); (1979); (1981); (1973). What are the differences of the results listed below from ones found in preceding literature? First, the restriction that the summands are identically distributed is removed. This removal seems quite natural: as an extreme case, if all summands except one are zero, then (29) is trivial.
80
MODELING UNCERTAINTY
Second‚ it is assumed here that the summand r.v.’s take on values in an arbitrary separable Banach space rather than just real values. Note that‚ in a variety of settings‚ when the Banach space is “like" infinitedimensionality works rather toward the only-one-large-summand large deviation mechanism. Indeed‚ let here and let be independent real-valued r.v.’s such that the distributions of the are continuous; let and otherwise. Then
so that (29) is trivial. Third, conditions of regularity of tails are relaxed here, especially when the second moment does not exist. What is even more important is that regularity conditions are now imposed only on the sum of the tails of the distributions of the summands, rather than on every one of the tails. Fourth, the optimal zone is studied for the most difficult case when the tails may be arbitrarily close to exponential ones; see Theorem 15 and Proposition 4 below. The only precedent to this result is the work by Nagaev (1962), where it is assumed that the summands are real valued and identically distributed and stricter regularity conditions are imposed on the tails. The method of Nagaev (1962) does not work in the more general setting. It may seem somewhat surprising that the proofs of the limit theorems stated below do not use a Cramér transform of any kind but are based on the inequalities given in the next section. This shows, in particular, how powerful those inequalities are. To explain the idea on how to obtain limit theorems based on the upper bounds, let us turn back to (1). It is comparatively easy to show that
On the other hand, as was said, the second summand – – on the right-hand side of (1) dominates the first one – zone of “very" large deviations, when that one has an upper bound of the form
where
– in the Suppose now
Exact asymptotics for large deviation probabilities‚ with applications
81
in the zone for any given and large enough cf. (36) below. Clearly that would be sufficient for the asymptotics
in the large deviations zone of the same form – Let be a double-index array of r.v.’s with values in a separable Banach space such that for each the r.v.’s are independent. Let
We shall be interested in the conditions under which the large deviation asymptotics
obtain.
3.2.
EXPONENTIAL INEQUALITIES FOR PROBABILITIES OF LARGE DEVIATION OF SUMS OF INDEPENDENT BANACH SPACE VALUED R.V.’S
Let
where is any nonnegative Borel function. For any function and any positive real numbers consider the generalized inverse function
82
MODELING UNCERTAINTY
where we set sup Assume that
for any
Theorem 8 Let a function any positive numbers
be log-concave on an interval for some for all where is a positive number. The for such that one has
where
Remark. If
for some
then
for some constant C > 0, depending only on and It follows, e.g., that if the are i.i.d. and for some then for any there is a constant such that
in the zone of “moderate" deviations. The constant 2 in is obviously the best possible. Similar important properties of the bound (35) result in the optimal zones of large deviations in which (33) and (34) take place; cf. (44) and (56) below. Theorem 8 is based in part on the following result. Theorem 9 (Pinelis and Sakhanenko )
For all
Moment inequalities for sums of infinite-dimensional random vectors are also known; see e.g. Pinelis (1994) and references therein.
Exact asymptotics for large deviation probabilities‚ with applications
3.3.
83
THE CASE OF A FIXED NUMBER OF INDEPENDENT BANACH SPACE VALUED R.V.’S. APPLICATION TO ASYMPTOTICS OF INFINITELY DIVISIBLE PROBABILITY DISTRIBUTIONS IN BANACH SPACES
Let here
and fix
Theorem 10 If the c.d.f. is subexponential (i.e.‚ satisfies (31))‚ then the asymptotic relation (33) takes place as Theorem 10 is complemented by the following powerful and easy to check sufficient condition for subexponentiality. Theorem 11 Let a c.d.f. F with F(0) = 0 satisfy (32). Assume also that there exist such numbers and a function that
as
and the function
Then F is a subexponential c.d.f. Theorem 11 shows that the tail of a subexponential c.d.f. may vary rather arbitrarily between and cf. Proposition 4 below. In particular‚ if for some
as and F(0) = 0, then F is a subexponential c.d.f. This contradicts a statement in Teugels (1975), page 1001 that such an F is not subexponential for That this statement is in error was noticed also by Pitman (1980), who used a different method. The following example shows that the sufficient condition for subexponentiality provided by Theorem 11 is rather close to a necessary one. In particular, one cannot replace there by 1. Example 1 Let where the
and, for all
let
are chosen so that is continuous, and and let a c.d.f. F be such that
Let then for all
84
MODELING UNCERTAINTY
Then F is not subexponential, and the function is concave only for Yet, the necessary condition (32) is satisfied. On the other hand, the choice of the function in the definition is rather arbitrary. Instead of one may use here any concave function
such that
as
and the function
is integrable in a neighborhood of of the following functions will do:
In particular‚ any
Theorem 11 works especially well for subexponential tails close to exponential ones. For tails decreasing no faster than power ones‚ the following criterion is useful. Theorem 12 (Borovkov(1976)‚ page 173) Let a c.d.f. F with F(0) = 0 satisfy (32). Assume also that F is such that
as
Then F is a subexponential c.d.f.
Consider now applications to asymptotics of infinite-dimensional infinitely divisible probability distributions. By measures on a Banach space we mean Borel measures on A probability distribution on is called infinitely divisible if it can be represented, for every natural as the convolution of a probability distribution on For any measure on with a finite variation one can define the compound Poisson distribution
with Lévy measure Let denote the distribution concentrated in a point For any measure and positive numbers let
for all Borel A in
where
Measure is called a Lévy measure if and there exists the weak limit
Let
for all for some
Any of
such limits (all of which are equal to one another up to a shift) is denoted by s Pois
Exact asymptotics for large deviation probabilities‚ with applications
85
It is well known (see e.g. Araujo and Giné (1979)‚ page 108) that any infinitely divisible probability distribution on admits the unique representation of the form
where is a centered Gaussian distribution. In what follows‚ we assume that and are as in (37). Let
Theorem 13 If then
is a subexponential c.d.f. for some (or‚ equivalently‚ for all)
as Vice versa‚ if M is a subexponential c.d.f.‚ then (38) holds.
is so too and
When and is concentrated on the result of Theorem 13 was obtained by Embrechts, Goldie, and Veraverbeke (1979); when and is regularly varying, by Zolotarev (1961); when is a stable distribution of an index by Araujo and Giné (1979). In the case one may obtain one-sided analogues of the results of this subsection, using the following extension of the notion of a subexponential c.d.f. Let us refer to a c.d.f. F as positive-subexponential if the c.d.f. is subexponential. It is easy to see that if F is a positivesubexponential c.d.f., then for any fixed one has
as It can be shown that if the expectation then (39) is incompatible with the Cramér condition. However‚ one can construct conterexamples showing that then both (39) and the Cramér condition may hold; for instance‚ consider
where
86
3.4.
MODELING UNCERTAINTY
TAILS DECREASING NO FASTER THAN POWER ONES
In this subsection‚ we shall deal with nonnegative nonincreasing functions H‚ satisfying a condition of the form for some constant C > 1 and some positive numbers It is easy to see that such functions decrease no faster than a power function: if and then
Theorem 14 Let
and
vary so that
At that‚ suppose that
for all
for some
and
Then one has the relations (33) and (34).
Corollary 4 Suppose that
for
where H is a positive function such that
for some positive real numbers Suppose also that at least one of the following three conditions holds: (i) is a Banach space of type 2‚ for all and and (44) takes place;
Exact asymptotics for large deviation probabilities‚ with applications
(ii)
is a Banach space of type
for some
87
and
for all and
(iii) Then one has the relations (33) and (34) provided Recall that a Banach space such that
and (43).
is of type if there is a constant
for all independent zero-mean r.v.’s
with the values in
Remark. Conditions (45)–(47) are satisfied if‚ e.g.‚
for some and some slowly varying function Thus, Corollary 4 is a generalization of results of Andersen (1970); Godovan’chuk (1978); Heyde (1968), , for all the values of in (48) except for If, with it is additionally assumed that, for instance, all the are symmetrically distributed and is of type then one still has the relations (33) and (34) provided and (43). Corollary 5 Let
be fixed‚
Suppose that
as
Then
The latter Corollary is for illustrative purposes only. It shows that the individual distributions of the summands may be very irregular; they may be even be two-point distributions. For (33) and (34) to hold‚ it suffices that the sum of the tails be regular enough. Note also that Theorem 12 follows from Theorem 14.
88
MODELING UNCERTAINTY
3.5. Let
as
TAILS‚ DECREASING FASTER THAN ANY POWER ONES stand for the set of all differentiable function satisfying the conditions
Introduce
Theorem 15 Suppose that
where the positive sequence Suppose that
is non-decreasing and
for some
and
for some
Then one has (33) and (34).
Note that condition (52) implies the function
is con-
cave‚ while (53) implies Roughly‚ conditions (52) and (53) mean‚ in addition to certain regularity‚ that the tail decreases slower than any exponential one but faster than any power one. The following proposition shows that Theorem 15 concerns tails that may be arbitrarily close to exponential ones. Proposition 4 For any function such that exists a function such that
as
there
as
Condition (56) is essentially (44) rewritten so that the other conditions of Theorem 15 are taken into account. Together with (55)‚ it determines the boundary of the zone of large deviations in which one has (33) and (34); this boundary is best possible‚ as shows comparison with results Nagaev (1973). Note that‚ for the tails close enough to exponential ones‚ the boundary of the
Exact asymptotics for large deviation probabilities‚ with applications
89
zone is essentially determined by condition (55); the role of the “separator between the spheres of influence" of conditions (55) and (56) is played by the function The following corollary is parallel to Corollary 5. Corollary 6 Let be fixed. Suppose that conditions (49) and (50) hold. Finally‚ assume that
for some
3.6.
and all large enough integral
Then one has (51).
TAILS‚ DECREASING NO FASTER THAN
In the previous two subsections‚ there were given theorems covering all the spectrum of the tails‚ from to for which the relations (33) and (34) may hold in general. However‚ tails such as the ones mentioned in the title of this subsection may still remain of interest. As one can see from the statement of Theorem 16‚ a restriction is now imposed only on the first derivative of the tail function‚ while in Theorem 15 one has to take into account essentially the sign of a second derivative. Note also that Theorem 16 generalizes Theorems 1 and 2 of Nagaev (1977) simultaneously. Consider the following classes of nonnegative non-decreasing function defined on some interval where (i) class non-increasing in (ii) class
defined by the condition: for whenever
defined by the condition:
(iii) class for defined by the condition: absolutely continuous, and its derivative satisfies the inequality Proposition 5 If
is
is
then
In particular‚
This proposition shows that all these three kinds of classes are essentially the same.
90
MODELING UNCERTAINTY
and‚ for all
Theorem 16 Suppose that
for
where and
Suppose also that and
Then relations (33) and (34) hold. Corollary 7 Suppose that one has (54) for some function
as
such that
Suppose that
where
and
Then relations (33) and (34) hold. Corollary 8 Suppose that
where Suppose that
is non-decreasing and
and
is slowly varying at
Then relations (33) and (34) hold. Of course, the latter corollary could have been obtained from Theorem 14 as well, and at that one would have obtained more precise – as compared with (59) – boundaries of the corresponding zone of large deviations. On the other hand, Corollaries 7 and 8 generalize the first statement of each of Theorems 1 and 2 of Nagaev (1977). The remaining statements of the latter two theorems can be deduced quite similarly, if one observes that Theorem 16 can modified as follows: (i) replace (57) by the condition
(ii) add condition (41); (iii) replace the right-hand side in (58) by
REFERENCES
91
NOTES 1. Similar observations on large deviation mechanisms were offered by Nagaev (1964).
REFERENCES Andersen‚ G. R. (1970) Large deviation probabilities for positive random variables. Proc. Amer. Math. Soc. 24 382–384. Anorina‚ L. A.; Nagaev‚ A. V. (1971) An integral limit theorem for sums of independent two-dimensional random vectors with allowance for large deviations in the case when Cramér’s condition is not satisfied. (Russian) Stochastic processes and related problems‚ Part 2 (Russian)‚ “Fan" Uzbek. SSR‚ Tashkent‚ 3–11. Araujo‚ A. and Giné‚ E. (1979) On tails and domains of attraction of stable measures in Banach spaces. Trans. Amer. Math. Soc. 248‚ no. 1‚ 105–119. Bahadur‚ R.R. and Rao‚ R. Ranga. (1960) On deviations of the sample mean‚ Ann. Math. Stat.‚ 11‚ 123-142. Billingsley‚ P. (1968) Convergence of Probability Measures‚ Wiley‚ New York. Borovkov‚ A.A. (1964). Analysis of large deviations for boundary problems with arbitrary boundaries I‚ II. In: Selected Translations in Math. Statistics and Probability‚ Vol.6‚ 218-256‚ 257-274. Borovkov‚ A.A. (1976). Stochastic Processes in Queuing Theory. SpringerVerlag‚ New York-Berlin. Borovkov‚ A.A. and Rogozin‚ B.A. (1965) On the multidimensional central limit theorem‚ Theory Probab. Appl.‚ 10‚ 52-62. Chistyakov‚ V.P. (1964) A theorem on sums of independent positive random variables and its application to branching processes‚ Theory Probab. Appl.‚ 9‚ 710–718. Dembo‚ A. and Zeitouni‚ O. (1998) Large deviations techniques and applications. Second edition. Applications of Mathematics‚ 38. Springer-Verlag‚ New York. Embrechts‚ P.; Goldie‚ C. M.; Veraverbeke‚ N. (1979) Subexponentiality and infinite divisibility. Z. Wahrsch. Verw. Gebiete 49‚ no. 3‚ 335–347. Feller‚ W.‚ (1971) An Introduction to Probability Theory and its Applications‚ II‚ 2nd ed.‚ John Wiley & Sons‚ New York. Fill‚ J.A. (1983) Convergence rates related to the strong law of large numbers‚ Annals Probab.‚ 11‚ 123-142. Godovan’chuk‚ V.V. (1978) Probabilities of large deviations of sums of independent random variables attached to a stable law‚ Theory Probab. Appl. 23‚ 602–608. Goldie‚ C.M. (1978) Subexponential distributions and dominated-variation tails. J. Appl. Probability 15‚ no. 2‚ 440–442.
92
MODELING UNCERTAINTY
Goldie‚ C.M. and Klüppelberg‚ C. (1978) Subexponential distributions. University of Sussex: Research Report 96/06 CSSM/SMS. Heyde‚ C. C. (1968) On large deviation probabilities in the c ase of attraction to a non-normal stable law. Ser. A 30 253–258. Hu‚ I. (1991) Nonlinear renewal theory for conditional random walks‚ Annals Probab.‚ 19‚ 401–22. Kao‚ C-s. (1978) On the time and the excess of linear boundary crossing of sample sums‚ Annals Probab.‚ 6‚ 191–199. Lai‚ T.L. and H. Robbins (1985) Asymptotically efficient adaptive allocation rules‚ Advances in Applied Mathematics‚ 6‚ 4–22. Linnik‚ Ju. V. (1961) On the probability of large deviations for the sums of independent variables. Proc. 4th Berkeley Sympos. Math. Statist. and Prob.‚ Vol. II pp. Univ. California Press‚ Berkeley‚ Calif. 289–306 Nagaev‚ A.V. (1969) Limit theorems taking into account large deviations when Cramér’condition fails‚ Izv. AN UzSSR‚ Ser. Fiz-Matem.‚ 6‚ 17–22. Nagaev‚ A.V. (1969) Integral limit theorems taking large deviations into account when Cramér’condition does not hold‚ I‚ II‚ Theory Probab. Appl.‚ 24‚51–64‚ 193–208. Nagaev‚ A.V. (1971)‚ Probabilities of Large Deviations of Sums of Independent Random Variables‚ D. Sc. Dissertation‚ Tashkent (in Russian). Nagaev‚ A.V. (1977) A property of sums of independent random variables. (Russian.) Teor. Verojatnost. i Primenen. 22‚ no. 2‚ 335–346. Nagaev‚ S.V. (1962)‚ Integral limit theorem for large deviations‚ Izv. AN UzSSR‚ Ser. fiz.-mat. nauk. 37–43. (in Russian). Nagaev‚ S.V. (1964)‚ Limit theorems for large deviations‚ Winter School on Probability Theory and Mathematical Statistics. Kiev‚ 147–163. (in Russian). Nagaev‚ S. V. (1973) Large deviations for sums of independent random variables. Transactions of the Sixth Prague Conference on Information Theory‚ Statistical Decision Functions‚ Random Processes (Tech. Univ.‚ Prague‚ 1971; dedicated to the memory of Antonín Academia‚ Prague 657– 674. Nagaev‚ S. V. (1979) Large deviations of sums of independent random variables. Ann. Probab. 7‚ no. 5‚ 745–789. Petrov‚ V.V. (1965) On probabilities of large deviations of sums of independent random variables‚ Teor. Veroyatnost. Primen.‚ 10‚ 310-322. (in Russian). Petrov‚ V.V.‚ (1975) Sums of Independent Random Variables‚ Springer-Verlag‚ New York. Pinelis‚ I.F. (1981) A problem on large deviations in a space of trajectories‚ Theory Probab. Appl. 26‚ no. 1‚ 69–84. Pinelis‚ I.F. (1985) Asymptotic equivalence of the probabilities of large deviations for sums and maximum of independent random variables. (Russian)
REFERENCES
93
Limit theorems of probability theory‚ 144–173‚ 176‚ Trudy Inst. Mat.‚ 5‚ “Nauka" Sibirsk. Otdel.‚ Novosibirsk. Pinelis‚ Iosif (1994) Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probab. 22‚ no. 4‚ 1679–1706. Pinelis‚ I. F.; Sakhanenko‚ A. I. Remarks on inequalities for probabilities of large deviations. (Russian) Theory Probab. Appl. 30‚ no. 1‚ 143–148. Pinelis‚ I.; Yakowitz‚ S. (1994) The time until the final zero crossing of random sums with application to nonparametric bandit theory‚ Appl. Math. Comput. 63‚ no. 2-3‚ 235–263. Pitman‚ E. J. G. (1980) Subexponential distribution functions. J. Austral. Math. Soc. Ser. A 29)‚ no. 3‚ 337–347. Robbins‚ H.‚ (1952) Some aspects of the sequential design of experiments‚ Bull. Amer. Math. Soc. 58‚ 527-535. Rohatgi‚ V. K. (1973) On large deviation probabilities for sums of random variables which are attracted to the normal law. Comm. Statist. 2‚ 525–533. Siegmund‚ D. (1975) Large deviations probabilities in the strong law of large numbers‚ Z. Wahrsch. verw. Gebiete 31‚ 107-113. Teugels‚ J. L. (1975) The class of subexponential distributions. Ann. Probability 3‚ no. 6‚ 1000–1011. Tkachuk‚ S. G. (1974) Theorems o large deviations in the case of a stable limit law. In Random Process and Statistical Inference‚ no. 4‚ Fan‚ Tashkent 178–184. (In Russian) Tkachuk‚ S. G. (1975) Theorems o large deviations in the case of distributions with regularly varying tails. In Random Process and Statistical Inference‚ no. 5‚ Fan‚ Tashkent 164–174. (In Russian) Vinogradov‚ V. (1994) Refined large deviation limit theorems. Pitman Research Notes in Mathematics Series‚ 315. Longman Scientific & Technical‚ Harlow; copublished in the United States with John Wiley & Sons‚ Inc.‚ New York. Yakowitz‚ S. and W. Lowe‚ (1991) Nonparametric bandits‚ Annals of Operations Research‚ 28‚ 297-312. Zolotarev‚ V. M. (1961) On the asymptotic behavior of a class of infinitely divisible laws. (Russian) Teor. Verojatnost. i Primenen. 6‚ 330–334.
Part II
STOCHASTIC MODELLING OF EARLY HIV IMMUNE RESPONSES UNDER TREATMENT BY PROTEASE INHIBITORS Wai-Yuan Tan Department of Mathematical Sciences The University of Memphis Memphis‚ TN 38152-6429
[email protected]
Zhihua Xiang Organon Inc. 375 Mt. Pleasant Avenue West Orange‚ NJ 07052
[email protected]
Abstract
It is well documented that‚ in many cases‚ most of the free HIV are generated in the lymphoid tissues rather than in the plasma. This is especially true in the late stage of HIV pathogenesis because in this stage‚ the total number of T cells in the plasma is very small‚ whereas the number of free HIV in the plasma is very large. In this paper we have developed a state space model in plasma involving net flow of HIV from lymph nodes‚ extending the original model of Tan and Xiang (1999). We have applied this model and the theory to the data of a patient (patient No.104) considered in Perelson et al. (1996)‚ in which RNA virus copies per were observed on 18 occasions over a three week period. This patient was treated by a protease inhibitor‚ ritonavir‚ so that a large number of non-infectious HIV was generated by the treatment. For this patient‚ by using the state space model over the three week span‚ we have estimated the numbers of productively HIV-infected T cells‚ the total number of infectious HIV‚ as well as the number of non-infectious HIV. Our results showed that within this period‚ most of the HIV in the plasma was non-infectious‚ indicating that the drug is quite effective.
96
MODELING UNCERTAINTY
Keywords:
1.
Infectious HIV‚ lymph nodes‚ Monte Carlo studies‚ non-infectious HIV‚ protease inhibitors‚ state space models‚ stochastic differential equations.
INTRODUCTION
In a recent paper‚ Tan and Xiang (1999) developed some stochastic and state space models for HIV pathogenesis under treatment by anti-viral drugs. In this paper we extend these models into models involving net flow of HIV from lymphoid tissues. This extension is important and necessary because of the following biological observations: (1) HIV normally exists either in the plasma as free HIV‚ or trapped by follicular dendritic cells in the germinal center of the lymph nodes during all stages of HIV infection (Fauci and Pantaleo‚ 1997; Levy‚ 1998; Cohen‚ Weissman and Fauci‚ 1998; Fauci‚ 1996; Tan and Ye‚ 2000). Further‚ Haase et al. (1996) and Cohen et al. (1997‚ 1998) have shown that the majority of the free HIV exist in lymphoid tissues. This is true especially in the late stage of HIV pathogenesis. For example‚ Perelson et al. (1996) have considered a patient (Patient No. 104 in Perelson et al. (1996)) treated by a protease inhibitor‚ ritonavir. For this patient‚ at the start of the treatment‚ the total number of T cells in the blood was yet the total number of RNA virus copies was in the blood; many other examples can be found in Piatak et al. (1993). Thus‚ in the late stage‚ it is unlikely that most of the free HIV in the plasma were generated by productively HIV-infected CD4 T cells in the plasma. (Note that the total number of T cells includes the productively HIV-infected T cells.) (2) Lafeuillade et al. (1996) and many others have shown that both the free HIV in the plasma‚ and the HIV in lymph notes can infect T cells‚ generating similar dynamics in the plasma as well as the lymph nodes. Furthermore‚ the infection process in the lymph nodes is much faster than in the plasma (Fauci‚ 1996; Cohen‚ Weissman and Fauci‚ 1998; Cohen et al.‚ 1997; Haase et al.‚ 1996; Kirschner and Web‚ 1997; Lafeuillade et al.‚ 1996; Tan and Ye‚ 2000). From these observations‚ it follows that most of the free HIV in the blood must have come from HIV in the lymph nodes‚ rather than from productively HIV-infected CD4 cells in the blood; this is true especially in the late stages. To model the HIV pathogenesis in the blood‚ it is therefore necessary to include net flow of HIV from the lymph nodes or other tissues to the plasma. On the other hand‚ since the T lymphocytes are less mobile (Weiss‚ 1996) and are much larger in size than the HIV‚ one would expect that the number of T cells flowing from the lymph nodes to the plasma is about the same as that flowing
Stochastic Modelling of Early HIV Immune Responses
97
from the plasma to the lymph nodes. In this paper we thus consider only net flow of HIV from lymphoid tissues to plasma‚ ignoring net flow of T cells. In Section (2)‚ we illustrate how to develop a stochastic models for HIV pathogenesis under treatment by protease inhibitors‚ involving net flow of HIV from lymphoid tissues to plasma. To compare results with the deterministic model‚ in Section 3 we give equations for the expected numbers of the state variables. By using the stochastic model in Section 2 as the stochastic system model‚ in Section 4 we develop a state space model for the early HIV pathogenesis under treatment by protease inhibitors‚ and with net flow of HIV from lymphoid tissues to the plasma. Observation of the state space model is based on the RNA virus copies per of blood taken over time. In Section 5 we illustrate the model and the theory by applying it to the data of a patient reported by Perelson et al. (1996). Finally in Section 6‚ we generate some Monte Carlo studies to confirm the usefulness of the extended Kalman filter method.
2.
A STOCHASTIC MODEL OF EARLY HIV PATHOGENESIS UNDER TREATMENT BY A PROTEASE INBIHITOR
Consider the situation in which a HIV-infected individual is treated by a protease inhibitor. Then‚ for this individual‚ both infectious HIV and noninfectious HIV will be generated in both the plasma and the lymphoid tissues. Since HIV data such as the RNA virus copies and/or the T cell counts‚ are usually sampled from the blood‚ we illustrate in this section how to develop stochastic models in the plasma for early HIV pathogenesis‚ under treatment by protease inhibitors and with a net flow of HIV from lymphoid tissues to plasma. We look for models which capture most of the important characteristics of the HIV pathogenesis‚ yet are simple enough to permit efficient estimation of parameters and /or state variables. Thus‚ since the contribution to HIV by latently HIV-infected T cells is very small (less than 1%; see Perelson et al. (1996))‚ we will ignore latently HIV-infected T cells like these authors. Also‚ in the early stage when the period since treatment is short (e.g.‚ one month‚ one may assume that drug resistance of HIV to the anti-viral drugs has not yet developed; further‚ because of the following biological observations‚ we will follow Perelson et al. (1996) in assuming that the number of un-infected cells is constant:
(i) Before the start of treatment‚ the HIV pathogenesis is in a steady-state situation.
98
MODELING UNCERTAINTY
(ii) The uninfected CD4(+) T cells have a relatively long life span (The average life span of uninfected T cells is more than 50 days; see Cohen et al. (1998)‚ Tan and Ye (1999a)). (iii) Mitter et al. (1998) have provided additional justifications for assumption (2); they have shown that even for a much longer period since treatment‚ the assumption has little effect on the total number of free HIV. Let
denote the numbers of productively HIV-infected T cells‚ non-infectious free HIV and infectious free HIV in the blood at time t respectively. Then we are considering a three dimensional stochastic process With the above assumptions‚ we now proceed to derive stochastic equations for the state variables in plasma of this stochastic process under treatment by protease inhibitors and with net flow of HIV from lymphoid tissue. We first illustrate how to model the effects of protease inhibitors and the net flow of HIV from the lymphoid tissues to the plasma.
2.1.
MODELING THE EFFECTS OF PROTEASE INHIBITORS
Under treatment by protease inhibitors‚ the HIV can infect the T cells effectively; however‚ due to inhibition of the enzyme protease‚ most of the free HIV released by the death of productively HIV-infected T cells is non-infectious. It follows that under treatment by protease inhibitors‚ both non-infectious free HIV (to be denoted by ) and infectious free HIV (to be denoted by ) will be generated. When the drug is effective‚ most of the released free HIV is non-infectious. To model this stochastically‚ denote by The probability that a free HIV released in the plasma at time t by the death of a productively infected T cell is non-infectious. if the protease inhibitors are 100 percent effective and if the protease inhibitors are not effective. For sensitive HIV‚ Number of free HIV released by the the death at time t of a productively HIV-infected T cell in the plasma. The number of the non-infectious HIV among the leased free HIV. Given
HIV, That is,
re-
is a binomial random variable with parameters
99
Stochastic Modelling of Early HIV Immune Responses
2.2.
MODELING THE NET FLOW OF HIV FROM LYMPHOID TISSUES TO PLASMA
As shown by Cavert et al. (1996), as in plasma, the anti-retroviral drugs in Highly Active Anti-Retroviral Therapy (HAART) treatment can reduce HIV to a very low or undetectable level in the lymphoid tissues. Further, dynamics for HIV pathogenesis in the lymphoid tissues and in the plasma (Lafeuillade et al., 1996) have been observed to be similar. Thus, by dynamics similar to those in the plasma, both and will be generated in the lymphoid tissues under treatment by protease inhibitors. Furthermore, as shown by Fauci (1996), Cohen, Weissman and Fauci (1998), Cohen et al. (1997), Haase et al. (1996) and Lafeuillade et al. (1996), the infection process of HIV pathogenesis in the lymph nodes is much faster than in the plasma. Hence, there are net flows of HIV from the lymph nodes or other tissues to the plasma; on the other hand, since the T lymphocytes are less mobile and much larger in size than the HIV, one may ignore the net flow of T cells (Weiss, 1996). To model this stochastically, denote by: The total net flow of HIV during generated by all productively HIV-infected T cells in the lymphpid tissues at time t‚ The total net flow of HIV during in the lymphoid tissues at time t.
generated by
Then‚ consists of only whereas consists of both infectious and non-infectious free HIV. Let be the number of infectious HIV among the free HIV and the probability that a released free HIV in the lymphoid tissues at time t becomes non-infectious in the lymphoid tissues under treatment by protease inhibitors. The conditional probability distribution of given is then given by:
1 if the protease inhibitors are 100 percent effective in the lymphoid tissues and if the protease inhibitors are not effective in the lymphoid tissues. For sensitive HIV‚ In Tan and Ye (1999b)‚ distribution given by:
is modelled by a negative binomial
where is the flow inhibiting function‚ the flow potential function and the saturation constant (Kirschner and Webb‚ 1997).
100
MODELING UNCERTAINTY
Let denote the net flow of HIV from the lymphoid tissues to the plasma during the net flow of HIV from the lymphoid tissues to the plasma during Then‚ and The expected values of
and
are given respectively by:
and where
and
2.3.
DERIVATION OF STOCHASTIC DIFFERENTIAL EQUATIONS FOR THE STATE VARIABLES
To derive stochastic equations for the state variables of the above stochastic process‚ consider the time interval and denote by: The number of HIV during
cells generated by infection of T cells by
The number of deaths of the
cells during
The number of non-infectious HIV generated by the death of cell during
The total number of HIV (infectious or non-infectious) generated by the death of one cell during The number of net flow of HIV for non-infectious and for infectious HIV) from lynph nodes to the plasma during HIV and
The number of deaths of HIV for infectious HIV) during
for non- infectious
Then we have the following stochastic equations for the state variables
101
Stochastic Modelling of Early HIV Immune Responses
In the above equations, the variables on the right hnad side are random variables. To specify the probability distributions of these variables, let denote the HIV infection rate of T cells by free HIV in the plasma at time t; let denote the death rate of productively HIV-infected T cells in the plasma at time t and the rate at which free HIV or in the plasma are being removed or die at time t. Then, the conditional probability distributions of the above random variables are given by:
Further‚ with have:
The expected values of
where
and
and
are given respectively by:
we
102
MODELING UNCERTAINTY
and
Let be the number of uninfected T cells and (Note the number of uninfected T cells has been assumed to be constant during the study period.) Define the random noises by:
Then‚ denoting by the above equations (2.1 )-(2.3) can be rewritten as the following stochastic differential equations respectively:
In equations (2.4)-(2.6), the random noises have expectation zero and are uncorrelated with the state variables and Since the random noises are random variables associated with the random
103
Stochastic Modelling of Early HIV Immune Responses
Table 1 Variances and Covariances of the Random Noises
transitions during the interval these random noises are uncorrelated with the random noises for all and if Further‚ since these random noises are basically linear combinations of binomial and multinomial random variables‚ their variances and covariances readily be derived. These results are given in Table 1.
3.
MEAN VALUES OF
Let and denote the mean values of respectively. Then‚ by using equations (2.4)-(2.6)‚ we obtain:
and
104
MODELING UNCERTAINTY
From equation (3.3)‚ we obtain
If we follow Perelson et al. (1996) in assuming that the drug is 100% efective (i.e. then the solution of is
Denote
Then‚ since
is very large‚
Since is usually very small‚ the above equations then lead to the following equation for the approximation of
Hence an approximation for
is
In the example‚ we will use this approximation of
4.
to estimate
A STATE SPACE MODEL FOR THE EARLY HIV PATHOGENESIS UNDER TREATMENT BY PROTEASE INHIBITORS
The state space model consists of a stochastic model of the system and an observation model which is a statistical model based on some data relating the observed data to the system. Thus‚ the state space model has significant advantages over both the stochastic model and the statistical model alone since it combines information from both models. For the state space model of HIV pathogenesis under treatment by protease inhibitors with the flow of HIV from lymphoid tissues to the plasma‚ the stochastic system model is that given by the stochastic difference equations (2.4)-(2.6) of the previous section. Let be
105
Stochastic Modelling of Early HIV Immune Responses
the observed total number of RNA virus load at time model based on the RNA virus load is given by:
where
and
Then the observation
is the random error associated with
measuring (If T cell counts at different times are available‚ then the observation model will contain some additional equations involving observed T cell counts.) In equation (4.1)‚ one may assume that the have expected value 0 and variance and are uncorrelated with the random noises of equations (2.4)-(2.6) of the previous section. In the above state space model‚ the system model is nonlinear. Also‚ unlike the classical discrete-time Kalman filter model‚ the above model has observations only at times (This is the so-called missing observations model.) For this type of non-linear state space model‚ because the HIV infection rate is very samll‚ it is shown in Tan and Xiang (1999) that the model can be closely approximated by an extended Kalman filter model. This has in fact been confirmed by some Monte Carlo studies given in Section 5; see also Tan and Xiang (1998‚ 1999). In this paper we will thus use the extended Kalman filter model to derive the estimates and the predicted numbers of the state variables
To illustrate‚ write equations (2.4)-(2.6) and (4.1) respectively as:
for
Let
be an estimator of and define
pose that
with estimated residual
as an unbiased estimator of is the unbiased estimator for
if
Sup-
with covariance matrix
P(0). Then‚ starting with the procedures given in the following two subsections provide close approximations to some optimal methods for estimating and predicting The proofs of these procedures can be found in Tan and Xiang (1999).
106
4.1.
MODELING UNCERTAINTY
ESTIMATION OF
Let
GIVEN
be an estimator (or predictor) of
given data
with estimated residual covariance matrix of
and denote the
by
Write
P(0) = P(0|0) and Then, starting with with P(0) = P(0|0), the linear, unbiased and minimum variance estimators of given the data are closely approximated by the following recursive equations: (i) For satisfies the following equations with boundary conditions
where
is given in (iii) below:
(ii) For the covariance matrix the following equations with boundary conditions
where
satisfies
is given in (iii) below:
for (iii) Denote by respectively by:
where the from (ii). Then
from (i) and and
are given
107
Stochastic Modelling of Early HIV Immune Responses
and
where
and
To implement the above procedure, one starts with
and
P(0) = P(0|0). Then by (i) and (ii), one derives and for and also derives and P(1|1) by (iii). Repeating these procedures one may derive and for These procedures are referred to as forward filtering procedures.
4.2. Let Let
ESTIMATION OF AND be an estimator of
GIVEN
WITH
given the data
be the covariance matrix of
Denote
and Then‚ starting with and P(0) = P(0|0)‚ the linear‚ unbiased and minimum variance estimators of given data are closely approximated by the following recursive equations: (i) For equations with boundary conditions is given in (iii) below:
for
satisfies the following where
108
MODELING UNCERTAINTY
(ii) For equations with boundary condition
where
satisfies the following
is given in (iii) below.
(iii) and recursive equations:
for
are given by the following
and
where
To implement the above procedure in deriving
for a given initial
distribution of at one first derives results by using formulas in Section (4.1) (forward filtering). Then one goes backward from to 1 by using formulas in Section (4.2) (backward filtering).
5.
AN EXAMPLE USING REAL DATA
As an illustration‚ in this section we apply the model and the theories of the previous sections to analyze the data from a HIV-infected patient (No. 104) who was treated by ritonavir (a protease inhibitor) as reported in the paper by Perelson et al. (1996). For this patient‚ the RNA virus copies in the plasma have been measured on 18 occasions within 3 weeks. The observed RNA virus copies per of blood in the plasma are given in Table (2). (The data were provided by Dr. David Ho of the Diamond AIDS Research Center in New York city.) For this individual‚ at the time of initiating the treatment‚ there were only 2 T cells per of blood‚ but there were RNA copies per of blood. Thus‚ most of the HIV must have come from lymph nodes or other sources. To develop a stochastic model for the data in Table (2)‚ we follow Perelson et al. (1996) in assuming that during the 3-weeks period since treatment‚ drug resistance has not yet developed‚ and the number of uninfected T cells remains constant. As discussed in Section (2)‚ these assumptions appear to be justified by the observations. Thus‚ for the data
Stochastic Modelling of Early HIV Immune Responses
Table 2. Observed RNA Virus Copies per
109
for Patient No. 104
in Table (2)‚ a reasonable model may be a homogeneous three-dimensional stochastic process with a flow of HIV from lymph nodes to the plasma. This is the model described in Section (2) with time homogeneous parameters. That is‚ we assume N(T) = N‚ Then‚ an approximate solution for the mean is
where
and
Under the assumption that
for
is equivalent to assuming that the drug is 100% effective. This is the assumption made in Perelson et al. (1996‚ 1997). Our Monte Carlo studies seem to suggest that this assumption would not significantly distort the results and the pattern. To fit the above model to the data in Table (2), we use the estimates and of Perelson et al.(1996). We use N = 2500 from Tan and Ye (1999, 2000), with
110
MODELING UNCERTAINTY
from Tan and Xiang (1999)‚ and
and
Because the decline of HIV in the lymph nodes appeared to be piece-wise linear in log scale (Perelson et al., 1997; Tan and Ye, 1999;2000), we assume h(t) to be given by for nonoverlapping intervals where the are constants and is the indicator function of if and if Assume the above approximation and the parameter values of and take Under these conditions‚ the best fitted parameter values of and
(The estimates of
was derived by minimizing the residual sum of squares where is the observed RNA virus copies per
at
Under the above specification‚ the stochastic equations for are given by:
and
Stochastic Modelling of Early HIV Immune Responses
111
Using these estimates and the Kalman filter methods in Tan and Xiang (1998, 1999), we have estimated the number of cells, infectious HIV as well as non-infectious HIV per of blood over time. Plotted in Figure (1) are the observed total number of RNA copies per together with the estimates by the Kalman filter method and the estimates by the deterministic model in two cases (Case a: Case b: Plotted in Figures (2)-(3) are the estimated numbers of infected T-cells and free HIV (infectious and non-infectious HIV) respectively.
From Figure (1)‚ we observed that the Kalman filter esitmates followed the observed numbers‚ whereas the deterministic model estimates appeared to draw a smooth line to match the observed numbers‚ and could not follow the fluctuations of the observed numbers. Thus‚ there are some differences between the two estimates within the first 8 hours although the differences are not noticeable in the figure; however‚ after 8 hours‚ there is little difference between the two estimates. Furthermore‚ the curves appeared to have reached a steady state low level in 200 hours (a little more than a week). From Figures (3)‚ we observed that at the time of starting the treatment‚ the estimated number of infectious HIV begins to decrease very sharply and reaches
112
MODELING UNCERTAINTY
the lowest steady state level within 10 hours‚ and there is little difference between the Kalman filter estimates and the estimates of the deterministic model; on the other hand‚ the estimated number of non-infectious HIV first increases‚ reaching the maximum in about 6-8 hours before decreasing steadily to reach a very low steady state level in about 200 hours. Within 50 hours since treatment‚ there appeared to have significant differences between the two estimates of the number of non-infectious HIV; after 50 hours such differences appeared to be very small.
Comparing the two cases in Figures (1)-(3)‚ we observed that the estimates assuming (Case b) are almost identical to the corresponding ones assuming (Case a). These results suggest that the delay effects are very small in this example. Comparing results in Figures (1)-(3) with the corresponding results in Tan and Xiang (1999)‚ one may note that the two models give very close results. However‚ it is important to note that the model in Perelson et al. (1996) and in Tan and Xiang (1999) assumed that all HIV were generated by the productively HIV-infected T cells‚ while the model in Section (4) assumed that most of the HIV came from lymph nodes. The results suggest‚ however‚ that if one is
Stochastic Modelling of Early HIV Immune Responses
113
interested only in the estimation of T cells and free HIV‚ the two models make little difference.
6.
SOME MONTE CARLO STUDIES
To justify and confirm the usefulness of the extended Kalman filter method for the state space model in Section 4‚ we generate some Monte Carlo studies using a computer on the basis of the model in Sections 2 and 4. The parameters of this model were taken from those of the estimates of patient No. 104 given above. To generate the observation model‚ we add some Gaussian noises to the total number of the generated free HIV to produce That is‚ we generate by the equation
where the are as given above, and is assumed as to be a Gaussian variable with mean 0 and variance with Using the generated data we then use the same method as in Section 4 to derive the Kalman filter estimates from the state space model, and the estimates from the deterministic model. From these we have observed the following results: (1) In the estimation of the numbers of T* cells, free HIV and free HIV, the estimates by the extended Kalman filter method appear to follow the generated numbers very closely. These results suggest that in estimating these numbers, one may in fact use the extended Kalman filter method as de-
114
MODELING UNCERTAINTY
scribed in Section 3. Similar results have also been obtained by Tan and Xiang (1998,1999) in other models of a similar nature. (2) The estimates by using the deterministic model seem to draw a smooth line across the generated numbers. Thus, although results of deterministic model cannot follow the fluctuations of the generated numbers, due presumably to the randomness of the state variables, they are still quite useful in assessing the behavior and trend of the process. (3) For the numbers of cells and free HIV, there are small differences between the Kalman filter estimates and the estimates of the deterministic model, due presumably to the small numbers of these cells. For the noninfectious free HIV (i.e. however, there are significant differences between the Kalman filter estimates and the estimates using the deterministic model. It appears that the Kalman filter estimates have revealed much stronger effects of the treatment at early times (before 10 hours), which could not be detected by the deterministic model.
ACKNOWLEDGMENTS The authors wish to thank the referee for his help in smoothing the English language.
PUBLICATIONS OF SID YAKOWITZ Cavert‚ W.‚ D.W. Notermans‚ K. Staskus‚ et al. (1996). Kinetics of response in lymphoid tissues to antiretroviral therapy of HIV-1 infection. Science‚ 276‚ p. 960-964. Cohen‚ O.J.‚ G. Pantaleo‚ G.K. Lam‚ and A.S. Fauci. (1997). Studies on lymphoid tissue from HIV-infected individuals: Implications for the design of therapeutic strategies. In: "Immunopathogenesis of HIV Infection‚ A.S. Fauci and G. Pantaleo (eds.)”. Springer-Verlag‚ Berlin‚ p. 53-70. Cohen‚ O.J.‚ D. Weissman‚ and A.S. Fauci. (1998). The immunopathpgenesis of HIV infection. In: “Fundamental Immunology‚ Fourth edition.” ed. W.E. Paul‚ Lippincott-Raven Publishers‚ Philadelphia‚ Chapter 44‚ p. 1511-1534. Haase‚ A.T.‚ K. Henry‚ M. Zupancic‚ et al. (1996). Quantitative image analysis of HIV-1 infection in lymphoid tissues. Science‚ 274‚ p. 985-989. Fauci‚ A.S. and G. Pantaleo. (Eds.) (1997). Immunopathogenesis of HIV Infection. (Springer- Verlag‚ Berlin‚ 1997.) Fauci‚ A.S. (1996). Immunopathogenic mechanisms of HIV infection. Annals of Internal Medicine 124‚ p. 654-663. Kirschner‚ D.E. and G.F. Webb. (1997). Resistance‚ remission‚ and qualitative differences in HIV chemotherapy. Emerging Infectious Diseases‚ 3‚ p. 273283.
Publications of Sid Yakowitz
115
Lafeuillade‚ A.‚ C. Poggi‚ N. Profizi‚ et al. (1996). Human immunodeficiency virus type 1 kinetics in lymph nodes compared with plasma. The Jour. Infectious Diseases‚ 174‚ p. 404-407. Mitter‚ J.E.‚ B. Sulzer‚ A.U. Neumann‚ and A.S. Perelson. (1998). Influence of delayed viral production on viral dynamics in HIV-1 infected patients. Math. Biosciences‚ 152‚ p.. 143-163. Perelson‚ A.S.‚ A.U. Neumann‚ M. Markowitz‚ et al. (1996). HIV-1 dynamics in vivo: Virion clearance rate‚ infected cell life-span‚and viral generation time. Science‚ 271‚ p. 1582-1586. Perelson‚ A.S.‚ O. Essunger‚ Y.Z. Cao‚ et al. (1997). Decay characteristics of HIV infected compartments during combination therapy. Nature‚ 387‚ p. 188-192. Piatak‚ M. Jr.‚ M.S. Saag‚ L.C. Yang‚ et al. (1993). High levels of HIV-1 in plasma during all stages of infection determined by competitive PCR. Science‚ 259‚ p. 1749-1754. Tan‚ W.Y. and H. Wu. (1998). Stochastic modeling of the dynamics of T cells by HIV infection and some Monte Carlo studies. Math. Biosciences‚ 147‚ p. 173-205. Tan‚ W.Y. and Z.H. Xiang. (1998). State Space Models for the HIV pathogenesis. In: " Mathematical Models in Medicine and Health Sciences”‚ eds. M.A. Horn‚ G. Simonett and G. Webb. (Vanderbilt University Press‚ Nashville‚ TN‚ 1998)‚ p. 351-368. Tan‚ W.Y. and Z.H. Xiang. (1999). A state space model of HIV pathogenesis under treatment by anti-viral drugs in HIV-infected individuals. Math. Biosciences‚ 156‚ p. 69-94. Tan‚ W.Y. and Z.Z. Ye. (1999). Stochastic modeling of HIV pathogenesis under HAART and development of drug resistance. (Proceeding of the International 99’ ISAS meeting‚ Orlando‚ Fl. 1999.) Tan‚ W.Y. and Z.Z. Ye. (2000). Assessing effects of different types of HIV and macrophage on the HIV pathogenesis by stochastic models of HIV pathogenesis in HIV-infected individuals. Jour. Theoretical Medicine‚ 2‚ p. 245-265. Weiss‚ R.A. (1996). HIV receptors and the pathogenesis of AIDS. Science‚ 272‚ p. 1885-1886.
Chapter 6 THE IMPACT OF RE-USING HYPODERMIC NEEDLES B. Barnes School of Mathematical Sciences Australian National University Canberra‚ ACT 0200 Australia*
J. Gani School of Mathematical Sciences Australian National University Canberra‚ ACT 0200 Australia
Abstract
This paper considers an epidemic which is spread by the re-use of unsterilized hypodermic needles. The model leads to a geometric type distribution with varying success probabilities‚ which is applied to model the propagation of the Ebola virus.
Keywords:
epidemic and its intensity‚ Ebola virus‚ geometric type distribution.
1.
INTRODUCTION
In her book “The Coming Plague”‚ Laurie Garrett (1994) describes how the shortage of hypodermic needles in Third World countries‚ as well as shared needles in most countries‚ is responsible for the spread of infectious diseases. She cites one case from the Soviet Union in 1988‚ where 3 billion injections were administered‚ although only 30 million needles were manufactured. In one instance of shared needles that year‚ a single hospitalised HIV infected baby‚ led to the infection of all other infants in the same ward. * This project was funded by the Centre for Mathematics and its Applications‚ Australian National University‚ Canberra‚ University College‚ University of New South Wales‚ Canberra.
118
MODELING UNCERTAINTY
We will consider the impact of a single needle‚ used to inject a group of size N‚ which may contain infectives and susceptibles‚ without sterilisation of the needle between use on the individuals. Thus if the person injected is the first infective on which the needle is used then of the susceptibles will be infected subsequently. Clearly if all susceptibles are infected‚ while if none of the susceptibles is infected. In the following sections we derive a distribution for the number of infected susceptibles. We then consider the impact of using several needles‚ and apply the results to the 1976 spread of the Ebola virus in Central Africa.
2.
GEOMETRIC DISTRIBUTION WITH VARIABLE SUCCESS PROBABILITY
The standard geometric distribution is a waiting time distribution for independent Bernoulli random variables such that with and The waiting time T, until the first success, is given by
with the probability generating function (pgf)
Suppose now that the probability trial‚ so that
of success varies with each
with and (see Feller (1968)). The distribution of the time T until a first success is given by
In the particular case of infection by re-used needles‚ when we start with susceptibles‚ infectives and the number of new infectives‚ we
The impact of re-using hypodermic needles
119
have
We thus have a distribution for the random variable I‚ of susceptibles infected when the person injected is infected. We can write it in general as
3.
VALIDITY OF THE DISTRIBUTION
We provide a simple example which suggests this to be a valid distribution‚ and then prove that it is indeed so in general. Let N = 3‚ and then We need to show that the sum over all is 1. From equation (2.5)‚
and thus
We can show to be an honest distribution in general‚ by applying a simple binomial identity. Consider the sum of probabilities in the form of
120
MODELING UNCERTAINTY
equation (2.6),
We need to prove this sum equal to 1, that is,
or
Now applying the well known binomial identity for integers, (see Abramowitz and Stegun, 1964)
to the left hand side of equation (3.9), and writing proved as required.
4.
both positive
the equality is
MEAN AND VARIANCE OF I
The same identity (equation (3.10)) and technique can be used to establish expressions for the mean and variance of the distribution. Suppose we wish to determine the mean number of susceptibles infected, then
The impact of re-using hypodermic needles
since
when
121
Hence, using identity (3.10), we obtain
which provides an expression for the mean value. The variance of I is
In order to evaluate it we first find an expression for (3.8), this is
From equation
122
MODELING UNCERTAINTY
Hence we obtain (from (4.12)and (4.13)),
Applying these results to the simple example of §3, with N = 3, we have, from equation (4.11)
and
and from equation (4.14)
5.
INTENSITY OF EPIDEMIC
It is of interest to estimate the intensity of the epidemic, defined here as the average proportion of susceptibles in a group of N individuals, who are infected when the same needle is used to inject them. This proportion is
It has a least value of 1/2 when that is, when there is only a single infective in the population. The proportion rises rapidly as the number of susceptibles decreases, and if then which is close to 1 for large N. The variance of is
Applying these proportions to our simple example (3.7) we have
The impact of re-using hypodermic needles
6.
123
REDUCING INFECTION
An obvious and simple strategy for reducing the risk of infection is to divide the population into K groups (K = 2,3,4...) of N/K individuals, where for an integer, and sterilize the needle after vaccinating each group. Using a subscript to number each of the groups, we have
where originally we had
We can show that
The sum of the quantities from equation (6.16) can be expressed in the following form
For the original equation (6.17) for E(I) we have
Now, since for at least some and is non-negative by definition, and comparing equation (6.20) with equation (6.19), we have the inequality (6.18).
124
7.
MODELING UNCERTAINTY
THE SPREAD OF THE EBOLA VIRUS IN 1976
Although it is well understood today that disease is easily spread through the use of shared needles, this was not so well known in 1976 in Yambuku, a central African town in the Republic of Zaire, where the re-use of unsterilized hypodermic needles turned out to be a most efficient and deadly amplifier for the Ebola virus. Whilst the exact manner in which the disease emerged in August, 1976, is uncertain, the consensus is that a single traveller, foreign to those parts, who came to the Catholic Mission Hospital at Yambuku on 28 August with a mysterious illness, carried with him the Ebola virus. He was hopitalised for 2 days and then disappeared. At the time, Yambuku hospital housed 150 beds in 8 wards, and treated between 300 and 600 patients and outpatients each day. Many were given injections, but with only 5 needles which were re-used many times over. The sterilization procedure was a 30 litre autoclave, and boiling water atop a Primus stove. Most outpatients were pregnant women, visiting the hospital for checkups, and they were injected with a vitamin B complex. We can use the above model to get some idea of the rapid spread of the virus. We will consider the use of K = 5 needles on a population N, in which there are infectives. In general, we have infectives distributed between K groups of equal size N/K, such that Assume the probability of each different configuration for the total number of infectives to be the same. We are interested in the number of new infectives after each set of injections, which can be expressed as
With infectives in a particular group and occurring in that group,
the probability of
infectives
where is the number of infectives generated by the initial infectives. The probability can be expressed as a binomial distribution
where it is assumed any infective has the probability 1/K of belonging to a particular group. Thus, with K groups, the expected total number of new
125
The impact of re-using hypodermic needles
infectives is
This expression for E(I) can be simplified. Set
and then
which gives
Returning to the available data concerning the Ebola virus in Garrett (1994), the epidemic began on September 5, 1976. In that first month 38 of the 300 Yambuku residents died from Ebola. The hospital was closed to new patients after 25 September, a quarantine on the region imposed on the 30 September, and by October 9 the virus was observed to be in recession. By November, 46 villages had been affected with 350 deaths recorded. The probability of contracting the virus from an infected needle was estimated to be 90%, and once
126
MODELING UNCERTAINTY
contracted by this means the probability of death was between 88% and 92.5%, or possibly higher. After the symptoms appear, the Ebola virus takes approximately 10 days to kill a victim, through a particularly painful and gruesome means; the patient, losing hair, skin and nails, literally bleeds to death as the membranes containing the body fluids disintegrate and the body organs liquify. From the data we can establish estimates for parameters in the model. Of the primary cases, 72 of the 103 were from injection at the hospital. We then assume that 0.7 of the deaths resulting from the virus were as a direct result of infected needles. Thus four weeks after the onset of the epidemic, from the 38 deaths in the Yambuku region, we estimate 0.7 × 38 = 26.6 were caused by infected needles. Towards the end of the epidemic, by November, of the 350 recorded deaths we estimate that 0.7 × 350 = 245 were as a result of infected needles. (The virus was also spread through body fluid contact, and in one cited case, a single individual spread the disease to 21 family members and friends, 18 of whom died as a result. However, in general, individuals who contracted the disease in this manner had a 43% chance of survival. This method of infection is not considered in our model.) We take the period between infection by the Ebola virus and death to be 21 days. This follows as the incubation period was between 2 and 21 days, typically 4 to 16 days, and time until death after the appearance of the major symptoms approximately a week to 10 days. The probability of infection through the re-use of a needle it difficult to estimate from the available data. Taking into account the existing, but insufficient, sterilization process described above, it has been estimated as 1/50, and a number of different values have been compared. (See Figure 6.1 with values 0.03, 0.05, 0.07 and 0.1.) The probability of death following infection from a needle is between 88% and 92.5%, and is taken here as 88%. Recall that the probability of infection from a single infected injection at the hospital was given as 90%. The results of simulations for the number of victims of the Ebola virus from 5 September to November, 1976, resulting from the introduction of a single infective into the Yambuku hospital, are illustrated in Figures 6.1, 6.2 and 6.3. Infection through means other than infected needles have not been considered. The number of hospital beds at Yambuku was 150, and between 300 and 600 patients were treated each day, including those in the hospital. We assume that 250 of these received an injection on a single day, using 5 different needles. The expected number of new infectives is calculated from equation (7.21. As is illustrated in Figure 6.1, for the number of deaths caused by infection through re-use of needles, the model predicts 26.4 deaths after 4 weeks, and over the following months, after November, the number of deaths approaches 245 in the limit. By the end of November the model predicts a total of 236 deaths. These figures compare well with the data given above: 38 deaths in the first month (28 days), 0.7 × 38 = 26.6 of which are due to infection through
The impact of re-using hypodermic needles
127
128
MODELING UNCERTAINTY
needles, and 350 deaths by November, 0.7 × 350 = 245 of which are expected to have been from infected needles. Furthermore, the model is in agreement with the data which states that by 10 October, the disease was in recession. This is illustrated in Figure 6.2. The epidemic is seen to peak and begin to decline in the first week of October, between day 30 and day 40. While the closure of the hospital to new patients, with a drastic reduction in the susceptible population, would certainly have had an impact on this decline, it is clear (Figure 6.2) that the number of new infectives was declining by day 15 to 20, that is from 20 to 25 September. Figure 6.3 graphs the number of susceptibles and infectives in the hospital over time, illustrating the proportion of infectives to susceptibles and demonstrating a gradual decline of the epidemic in late September, followed by a sharp decline when the hospital was closed and the region quarantined. The damage caused by the Ebola virus in 10 days, is comparable with that caused by AIDS over a period of 10 years. It is no wonder it is considered the second most lethal disease of the Century.
8.
CONCLUSIONS
A geometric type distribution model has been developed for the spread of an infection through the re-use of infected needles. It provides a reasonable description of the dynamics of the Ebola virus epidemic in Yambuku, in 1976, although the available data is scarce. Further work, comparing this model with the spread of a disease through contact with some external source, using the Greenwood and Reed-Frost chain binomial models, as well as the dynamics of a deterministic model, is planned to provide an insight into the differences in the propagation of epidemics.
REFERENCES Abramowitz, M. and I. A. Stegun. (1964). Handbook of Mathematical Functions. National Bureau of Standards, Washington D. C. Feller, W. (1968). An Introduction to Probability Theory and its Applications. John Wiley, New York. Garrett, L. (1994). The Coming Plague: Newly Emerging Diseases in a World out of Balance. Penguin, New York.
Chapter 7 NONPARAMETRIC FREQUENCY DETECTION AND OPTIMAL CODING IN MOLECULAR BIOLOGY David S. Stoffer Department of Statistics University of Pittsburgh Pittsburgh, PA 15260
Abstract
The concept of spectral envelope for analyzing periodicities in categorical-valued time series was introduced in the statistics literature as a computationally simple and general statistical methodology for the harmonic analysis and scaling of non-numeric sequences. One benefit of this technique is that it combines nonparametric statistical analysis with modern computer power to quickly search for diagnostic patterns within long sequences. An interesting area of application is the nucleosome positioning signals and optimal alphabets in long DNA sequences. The examples focus on period lengths in nucleosome signals and optimal alphabets in herpesviruses and we point out some inconsistencies in established gene segments.
Keywords:
Spectral Analysis, Optimal Scaling, Nucleosome Positioning Signals, Herpesviruses, DNA Sequences.
1.
INTRODUCTION
Rapid accumulation of genomic sequences has increased demand for methods to decipher the genetic information gathered in data banks such as GenBank. While many methods have been developed for a thorough micro-analysis of short sequences, there is a shortage of powerful procedures for the macroanalyses of long DNA sequences. Combining statistical analysis with modern computer power makes it feasible to search, at high speeds, for diagnostic patterns within long sequences. This combination provides an automated approach to evaluating similarities and differences among patterns in long sequences and aids in the discovery of the biochemical information hidden in these organic molecules.
130
MODELING UNCERTAINTY
Briefly, a DNA strand can be viewed as a long string of linked nucleotides. Each nucleotide is composed of a nitrogenous base, a five carbon sugar, and a phosphate group. There are four different bases that can be grouped by size, the pyrimidines, thymine (T) and cytosine (C), and the purines, adenine (A) and guanine (G). The nucleotides are linked together by a backbone of alternating sugar and phosphate groups with the carbon of one sugar linked to the carbon of the next, giving the string direction. DNA molecules occur naturally as a double helix composed of polynucleotide strands with the bases facing inwards. The two strands are complementary, so it is sufficient to represent a DNA molecule by a sequence of bases on a single strand. Thus, a strand of DNA can be represented as a sequence of letters, termed base pairs (bp), from the finite alphabet {A, C, G, T}. The order of the nucleotides contains the genetic information specific to the organism. Expression of information stored in these molecules is a complex multistage process. One important task is to translate the information stored in the protein-coding sequences (CDS) of the DNA. A common problem in analyzing long DNA sequence data is in identifying CDS that are dispersed throughout the sequence and separated by regions of noncoding (which makes up most of the DNA). For example, the entire DNA sequence of a small organism such as the Epstein-Barr virus (EBV) consists of approximately 172,000 bp. Table 1 shows part of the EBV DNA sequence. The idea of rotational signals for nucleosome positioning is based on the fact that the nucleosomal DNA is tightly wrapped around its protein core. The bending of the wound DNA requires compression of the grooves that face toward the core and a corresponding widening of the grooves facing the outside. Because, depending on the nucleotide sequence, DNA bends more easily in one plane than another, Trifonov and Sussman (1980) proposed that the association between the DNA sequence and its preferred bending direction might facilitate the necessary folding around the core particle. This sequence dependent bendability motivated the theoretical and experimental search for rotational signals. These signals were expected to exhibit some kind of periodicity in the sequence, reflecting the structural periodicity of the wound nucleosomal DNA.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 131
While model calculations as well as experimental data strongly agree that some kind of periodic signal exists, they largely disagree about the exact type of periodicity. A number of questions remain unresolved: Do the periodicities in rotational signals occur predominantly in di- or in trinucleotides, or even in higher order dinucleotides? Ioshikhes et al (1992) reported evidence for dinucleotide signals, while the analysis of Satchwell et al (1986) resulted in a trinucleotide pattern that was supported by data from Muyldermans and Travers (1994). Some questions are: Which nucleotide alphabets are involved in rotational signals? Satchwell et al (1986) used a strong, S = (G, C), versus weak, W = (A, T) hydrogen bonding alphabet to propose one signal, while Zhurkin (1985) suggested the purine-pyrimidine alphabet with another pattern, and Trifonov and coworkers propose a different motif. What is the exact period length? The helical repeat of free DNA is about 10.5 bp, the periodicities of rotational signals tend to be slightly shorter than 10.5 in general, for example: 10.1 bp in Shrader and Crothers (1990), 10.2 bp in Satchwell (1986), 10.3 bp in Bina (1994), and 10.4 bp in Ioshikhes et al (1992). Consistent with all these data is the proposition by Shrader and Crothers (1992) that nucleosomal DNA is over wound by about 0.3 bp per turn. Are there other periodicities besides the approximate 10 bp period? Uberbacher et al (1988) observed several additional periodic patterns of lengths 6 to 7, 10, and 21 bp. Bina (1994) reports a TT-period of 6.4 bp. Of course one could extend this list of controversial questions about the properties and characteristics of positioning signals. Depending on the choice among these divergent observations and claims, different sequence-directed algorithms for nucleosomic mapping have been developed, for example, by Mengeritsky and Trifonov (1983), Zhurkin (1983), Drew and Calladine (1987), Uberbacher et al (1988), and Pina et al (1990). An attempt to analyze existing data by the spectral envelope (Stoffer et al, 1993a) could result in a more unified picture about the major periodic signals that contribute to nucleosome positioning. This, in turn, might lead to a new reliable and efficient way to predict nucleosome locations in long DNA sequences by computer. In addition to positioning, the spectral envelope could prove to be a useful tool in examining codon usage. Regional fluctuations in G+C content not only influence silent sites but seem to create a general tendency in high G+C regions toward G+C rich codons (G+C pressure), see Benardi and Bernardi (1985) and Sueoka (1988). Schachtel et al (1991) compared two closely related and showed that for pairs of homologues genes, G+C frequencies differed in all three codon positions, reflecting the large difference in their global G+C content. In perfect agreement with their overall compositional bias, the usage for each individual amino acid type was shifted significantly toward codons of preferred G+C content. Several authors reported codon context related biases (see Buckingham 1990, for a review). Blaisdell (1983)
132
MODELING UNCERTAINTY
observed that codon sites three were chosen to be unlike neighboring bases to the left and to the right with respect to the strong-weak (S-W) alphabet. While the various studies on codon usage exhibit many substantial differences, most of them agree on one point, namely the existence of some kind of periodicity in coding sequences. This widely accepted observation is supported by the spectral envelope approach which shows a very strong period-three signal in genes but disappears in noncoding regions. This method may even be helpful in detecting wrongly assigned gene segments as will be seen. In addition, the spectral envelope provides not only the optimal period lengths but also most favorable alphabets, for example {S, W} or {G, H}, where H = (A, C, T). This analysis might help decide which among the different suggested pattern [such as RNY, GHN, etc., where R = (A,G), Y = (C, T), and N is anything] are the most valid. The spectral envelope methodology is computationally fast and simple because it is based on the fast Fourier transform and is nonparametric (that is, it is model independent). This makes the methodology ideal for the analysis of long DNA sequences. Fourier analysis has been used in the analysis of correlated data (time series) since the turn of the twentieth century. Of fundamental interest in the use of Fourier techniques is the discovery of hidden periodicities or regularities in the data. Although Fourier analysis and related signal processing are well established in the physical sciences and engineering, they have only recently been applied in molecular biology. Because a DNA sequence can be regarded as a categorical-valued time series it is of interest to discover ways in which time series methodologies based on Fourier (or spectral) analysis can be applied to discover patterns in a long DNA sequence or similar patterns in two long sequences. One naive approach for exploring the nature of a DNA sequence is to assign numerical values (or scales) to the nucleotides and then proceed with standard time series methods. It is clear, however, that the analysis will depend on the particular assignment of numerical values. Consider the artificial sequence ACGT ACGT ACGT... . Then, setting A = G = 0 and C = T= 1, yields the numerical sequence 0101 0101 0101..., or one cycle every two base pairs (that is, a frequency of oscillation of or a period of oscillation of length Another interesting scaling is A = 1, C = 2, G = 3, and T = 4, which results in the sequence 1234 1234 1234..., or one cycle every four bp In this example, both scalings (that is, {A, C, G, T} = {0, 1, 0, 1} and {A, C, G, T} = {1, 2, 3, 4}) of the nucleotides are interesting and bring out different properties of the sequence. It clear, then, that one does not want to focus on only one scaling. Instead, the focus should be on finding all possible scalings that bring our interesting features of the data. Rather than choose values arbitrarily, the spectral envelope approach selects scales that help emphasize any periodic feature that exists in a DNA sequence of virtually any length in a quick
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 133
and automated fashion. In addition, the technique can determine whether a sequence is merely a random assignment of letters. Fourier analysis has been applied successfully in molecular genetics; McLachlan et al (1976) and Eisenberg et al (1984) studied the periodicity in proteins with Fourier analysis. They used predefined scales (for example, the hydrophobicity alphabet) and observed the frequency of amphipatic helices. Because predetermination of the scaling is somewhat arbitrary and may not be optimal, Cornette et al (1987) reversed the problem and started with a frequency of and proposed a method to establish an ‘optimal’ scaling at In this setting, optimality roughly refers to the fact that the scaled (numerical) sequence is maximally correlated with the sinusoid that oscillates at a frequency of Viari et al (1990) generalized this approach to a systematic calculation of a type of spectral envelope (which they called and of the corresponding optimal scalings over all fundamental frequencies. While the aforementioned authors dealt exclusively with amino acid sequences, various forms of harmonic analysis have been applied to DNA by, for example, Tavaré and Giddings (1989), and in connection to nucleosome positioning by Satchwell et al (1986) and Bina (1994). Recently, Stoffer et al (1993a) proposed the spectral envelope as a general technique for analyzing categorical-valued time series in the frequency domain. The basic technique is similar to the methods established by Tavaré and Giddings (1989) and Viari et al (1990), however, there are some differences. The main difference is that the spectral envelope methodology is developed in a statistical setting to allow the investigator to distinguish between significant results and those results that can be attributed to chance. In particular, tests of significance and confidence intervals can be calculated using large sample techniques.
2.
THE SPECTRAL ENVELOPE
Briefly, spectral analysis has to do with partitioning the variance of a stationary time series, into components of oscillation indexed by frequency and measured in cycles per unit of time, for Given a numerical-valued time series sample, that has been centered by its sample mean, the sample spectral density (or periodogram) is defined in terms of frequency
134
MODELING UNCERTAINTY
The periodogram is essentially the squared-correlation of the data with a sines and cosines that oscillate at frequency For example, if represents hourly measurements of a persons body temperature that happens to oscillate at a rate of one cycle every 24 hours, then will be large because the data will be highly correlated with the cosine and/or sine term that oscillates at a cycle of but other values of will be small. Although not the optimal choice of a definition, the spectral density of the time series can be defined as the limit as the sample size tends to infinity of provided that it exists. It is worthwhile to note that and
where Thus, the spectral density can be thought of as the variance density of a time series relative to frequency of oscillation. That is, for positive frequencies between 0 and 1/2, the proportion of the variance that can be attributed to oscillations in the data at frequencies in a neighborhood of is roughly If the time series is white noise, that is, is independent of time and for all then that is, a uniform distribution. This is interpreted as all frequencies being present at the same power (variance) and hence the name white noise, from an analogy to white light, indicating that all possible periodic oscillations are present with equal strength. If is a highly composite integer, the fast Fourier transform (FFT) provides for extremely fast calculation of for where is the greatest integer less than or equal to If is not highly composite, one may remove some observations or pad the series with zeros (see Shumway and Stoffer, 2000, §3.5). The frequencies are called the fundamental (or Fourier) frequencies. The sample equivalent of the integral equation (2.2) is
where is the sample variance of the data; the last term is dropped if is odd. One usually plots the periodogram, versus the fundamental frequencies for and inspects the graph for large values. As previously mentioned, large values of the periodogram at indicate that the data are highly correlated with the sinusoid that is oscillating at a frequency of cycles in observations.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 135
As a simple example, Figure 2.1 shows a time plot of 128 observations generated by
where is the frequency of oscillation, is a phase shift, and ~ iid N(0,1); the cosine signal, is superimposed on the data in Figure 2.1. Figure 2.2 shows the standardized periodogram, of the data shown in Figure 2.1. Note that there is a large value of the periodgram at and small values elsewhere (if there were no noise in (2.4) then the periodogram would only be non-zero at Because—no matter how large the sample size—the variance of periodogram is unduly large, the graph of the periodogram can be very choppy. To overcome this problem, a smoothed estimate of the spectral density is typically used. One form of an estimate is
where the weights are chosen so that and A simple average corresponds to the case where for The number is chosen to obtain a desired degree of smoothness. Larger values of lead to smoother estimates, but one has to be careful not to smooth away significant peaks.
136
MODELING UNCERTAINTY
An analogous theory applies if one collects numerical-valued time series, say for In this case, write as the column vector of data at time The periodogram is now a complex matrix
where * means to transpose and conjugate. Smoothing the periodogram can be accomplished as in the univariate case, that is, The population spectral density matrix, is again defined as the limit as tends to infinity of The spectral matrix is Hermitian and non-negative definite. The diagonal elements of say for are the individual spectra and the off-diagonal elements, say for are related to the pairwise dependence structure among the sequences (these are called cross-spectra). Details for the spectral analysis of univariate or multivariate time series can be found in Shumway and Stoffer (2000, Chapters 3 and 5). The spectral envelope is an extension of spectral analysis when the data are categorical-valued such as DNA sequences. To briefly describe the technique using the nucleotide alphabet, let be a DNA sequence taking values in {A, C, G, T}. For real numbers not all equal, denote the scaled (numerical) data by where
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 137
Then, for each frequency, we call satisfies
the optimal scaling at frequency
if it
where is the spectral density of the scaled data, is a real number, 1 is a vector of ones, and Note that can be thought of as the largest proportion of the power (variance) that can be obtained at frequency for any scaling of the DNA sequence and is the particular scaling that maximizes the power at frequency Thus, is called the spectral envelope. The name spectral envelope is appropriate because envelopes the spectrum of any scaled process. That is, for any assignment of numbers to letters, the standardized spectral density of a scaled sequence is no bigger than the spectral envelope, with equality only when the numerical assignment is proportional to the optimal scaling, We say ‘proportional to’ because the optimal scaling vector, is not unique. It is, however, unique up to location and scale changes; that is, any scaling of the form yields the same value of the spectral envelope where and are real numbers. For example, the numerical assignments {A, C, G, T} = {0, 1, 0, 1} and {A, C, G, T} = {–1, 1, –1, 1} will yield the same normalized spectral density. The value of however, does not depend on the particular choice of scales; details can be found in Stoffer et al (1993a). For ease of computation, we set one element of equal to zero (that is, for example, the scale for T is held fixed at T = 0) and then proceed with the computations. For example, to find the spectral envelope, and the corresponding optimal scaling, holding the scale for T fixed at zero, form 3 × 1 vectors
Now with the scaled sequence, the vector sequence by the relationship implies
can be obtained from This relationship
where is the 3 × 3 spectral density matrix of the indicator process, and V is the population variance-covariance matrix of Because the imaginary part of is skew-symmetric, the following relationship holds: where denotes the real part of It follows that and can easily be obtained by solving an eigenvalue problem with real-valued matrices.
138
MODELING UNCERTAINTY
An algorithm for estimating the spectral envelope and the optimal scalings given a particular DNA sequence (using the nucleotide alphabet, {A, C, G, T}, for the purpose of example) is as follows: 1 Given a DNA sequence of length as previously described.
form the 3 × 1 vectors
2 Calculate the fast Fourier transform of the data:
Note that is a 3 × 1 complex-valued vector. Calculate the periodogram, for and retain only the real part, say 3 Smooth the periodogram, that is, calculate
where are symmetric positive weights, and controls the degree of smoothness. See Shumway and Stoffer (2000, Ch 3) for example, for further discussions on periodogram smoothing. 4 Calculate the 3 × 3 variance-covariance matrix of the data,
where
is the sample mean of the data.
5 For each determine the largest eigenvalue and the corresponding eigenvector of the matrix Note that is the unique square root matrix of S, and is the inverse of that matrix.
6 The sample spectral envelope is the eigenvalue obtained in the previous step. If denotes the eigenvector obtained in the previous step, the optimal sample scaling is this will result in three values, the fourth being held fixed at zero. Any standard programming language can be used to do the calculations; basically, one only has to be able to compute fast Fourier transforms and eigenvalues and eigenvectors of real symmetric matrices. Note that this procedure
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 139
can be done with any finite number of possible categories, and is not restricted to looking only at nucleotides. Inference for the sample spectral envelope and the sample optimal scalings are described in detail in Stoffer et al (1993a). A few of the main results of that paper are as follows. If is an uncorrelated sequence, and if no smoothing is used (that is, then the following large sample approximation based on the chi-square distribution is valid for
where is the number of letters in the alphabet being used (for example, in the nucleotide alphabet). If is a consistent spectral estimator and if for each the largest root of is distinct, then
converges jointly in distribution to independent zero-mean normal distributions, the first of which is standard normal; the covariance structure of the asymptotic (normal) distribution of is given in Shumway and Stoffer (2000, Section 5.8). The term in (2.7) depends on the type of estimator being used. For example, in the case of weighted averaging (we put and take but as as in (2.5), then If a simple average is used, that is, then Based on these results, asymptotic normal confidence intervals and tests for can be readily constructed. Similarly, for asymptotic confidence ellipsoids and chi-square tests can be constructed; details can be found in Stoffer et al. (1993a, Theorems 3.1 – 3.3). Peak searching for the smoothed spectral envelope estimate can be aided using the following approximations. Using a first order Taylor expansion we have
so that is approximately standard normal. It also follows that and If there is no signal present in a sequence of length we expect for and hence approximately of the time, will be less than where is the upper tail cutoff of the standard normal distribution. Exponentiating, the critical value for becomes From our experience, thresholding at very small values of relative to the sample size works well.
140
MODELING UNCERTAINTY
As a simple example, consider the sequence data presented in Whisenant et al (1991) which were used in an analysis of a human Y-chromosomal DNA fragment; the fragment is a string of length The sample spectral envelope (based on the periodogram) of the sequence is plotted in Figure 2.3a where frequency is measured in cycles per bp. The spectral envelope can be interpreted as the largest proportion of the total variance at frequency that can be obtained for any scaling of the DNA sequence. The graph can be inspected for peaks by employing the approximate null probabilities previously given. In Figure 2.3a, we show the approximate null significances thresholds of 0.0001 (0.60%) and 0.00001 (0.71%) for a single a priori specified frequency The null significances were chosen small in view of the problem of making simultaneous inferences about the value of the spectral envelope over more than one frequency. Figure 2.3a shows a major peak at approximately cycles per bp (about three cycles in the DNA fragment of 4156 bp), with corresponding sample scaling A = 1, C = 0.1, G= 1.1, T = 0. This particular scaling suggests that the purine-pyrimidine dichotomization best explains the slow cycling in the fragment. There is also a secondary peak at approximately cycles per bp with a corresponding sample scaling of A = 1, C = 1.5, G = 0.9, T = 0. Again we see the pairing of the purines, but the pyrimidines C and T are set apart; the significance of this scaling and frequency warrants further investigation and at this time we can offer no insight into this result. Using the tests on the specific scalings, we found that we could not reject the hypotheses that (i) at A = G = 0, C = T=1, and (ii) at A = G = 0, C = –1, T=1. To show how smoothing helps, Figure 2.3b shows a smoothed spectral envelope based on a simple average with For this example, so an approximate 0.0001 significance threshold is (2/4156) exp(3.71/5) = 0.10%. Note that there is considerably less variability in the smoothed estimate and only the significant peaks are visible in the figure.
3.
SEQUENCE ANALYSES
Our initial investigations have focused on herpesviruses because we regard them as scientifically and medically important. Eight genomes are completely sequenced and a large amount of additional knowledge about their biology is known. This makes them a perfect source of data for statistical analyses. Here we report on an analysis of nearly all of the CDS of the Epstein-Barr virus via methods involving the spectral envelope. The data are taken from the EMBL data base. The study of nucleosome positioning is important because nucleosomes engage in a large spectrum of regulatory functions and because nucleosome research has come to a point where experimental data and analytical methods from
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 141
different directions begin to merge and to open ways to develop a more unified and accurate picture of formation, structure and function of nucleosomes. While ten years ago many investigators regarded histones as mere packing tools, irrelevant for regulation, there are now vast amounts of evidence suggesting the participation of nucleosomes in many important cellular events such as replications, segregation, development, and transcription (for reviews, see Grunstein
142
MODELING UNCERTAINTY
(1992) or Thorma (1992)). Although nucleosomes are now praised as a long overlooked “parsimonious way of regulating biological activity" (Drew and Calladine, 1987) or as the “structural code" (Travers and Klug, 1987), no obvious signals that would distinguish them from the rest of DNA are yet known (Trifonov, 1991). It therefore remains an important task to understand and unravel this complex structural code. The genetic code is degenerate, more than one triplet codes for the same amino acid, nevertheless strong biases exist in the use of supposedly equivalent codons. This occurrence raises many questions concerning codon preferences and understanding of these preferences will provide valuable information. It is our goal to contribute to a better understanding of coding sequences by presenting the statistical methodology to perform a systematic statistical analysis of periodicities and of codon-context patterns. To this end, we suggest the following types of analyses based on the spectral envelope. The uses of the spectral envelope and the analyses presented here are by no means exhaustive; we will most likely raise more questions than we answer. We first explore the gene BcLF1 of Epstein-Barr. Figure 3.4 shows the spectral envelope of the CDS which is 4143 bp long with the following nucleotide distribution: 900 A, 1215 C, 1137 G, 891 T. There is a clear signal at one cycle every three bp Smoothing was performed in the calculation of the sample spectral envelope using triangular smoothing with (in this case the weights are so that and a 0.0001 [0.00001 ] significance threshold is approximately 0.17% [0.20%]). In this analysis, the scalings at the peak
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 143
frequency of one cycle every three bp were A = 1.16, C = 0.87, G = – 0.14, T = 0. This suggests that BcLF1’s signal is in the {M, K} alphabet, where M = {A or C},K = {G or T}. The next question is which positions in the codon are critical to the BcLF1 signal. To address this we did the following: First, every codon-position 1 was replaced with a letter chosen at random from the nucleotide alphabet, {A, C, G, T}, with probability equal to the proportion of each letter in the entire gene. For example, the first nine values of BcLF1 are ATG GCC TCA; they are changed to where for are independently chosen letters such that the probability that is: an A is 900/4143, a C is 1215/4143, a G is 1137/4143, a T is 891/4143. This is done over the entire gene and then the spectral envelope is computed to see if destroying the sequence in that position destroys the signal. The graph shown in Figure 3.5a is the resulting spectral envelope. There it is noted that not very much has changed from Figure 3.4, so that destroying position 1 has no effect on the signal. The graph
144
MODELING UNCERTAINTY
shown in Figure 3.5b has the second position destroyed and Figure 3.5c is the result of destroying the third position. While destroying the first position has virtually no effect on the signal, destroying the second or third position does have an effect on the signal. In the next three panels of Figure 3.5, the results of destroying two positions simultaneously are shown. Figure 3.5d shows what happens when the first and second positions are destroyed, Figure 3.5e has the first and third positions destroyed, and Figure 3.5f has the second and third positions destroyed. It is clear that the major destruction to the signal occurs when either the first and third positions, or the second and third positions are destroyed, although in either case there is still some evidence that the signal has survived. From Figures 3.5e and 3.5f we see that the first and second positions cannot thoroughly do the job of carrying the signal alone, however, the signal remains when the third position is destroyed (Figure 3.5c), so that the job does not belong solely to position three. To show how this technology can help detect heterogeneities and wrongly assigned gene segments we focus on a dynamic (or sliding-window) analysis of BNRF1 (bp 1736-5689) of Epstein-Barr. Figure 3.6 shows the spectral envelope (using triangular smoothing with of the entire CDS (approximately 4000 bp). The figure shows a strong signal at frequency 1/3; the corresponding optimal scaling was A = 0.04, C = 0.71, G = 0.70, T = 0, which indicates that the signal is in the strong-weak bonding alphabet, S = {C, G} and W = {A, T}.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 145
Next, we computed the spectral envelope over two windows: the first half and the second half of BNRF1 (each section being approximately 2000 bp long). We do not show the result of that analysis here, but the spectral envelopes and the corresponding optimal scalings were different enough to warrant further investigation. Figure 3.7 shows the result of computing the spectral envelope over four 1000 bp windows across the CDS, namely, the first, second, third, fourth quarters of BNRF1. An approximate 0.001 significance threshold is .69%. The first three quarters contain the signal at the frequency 1/3 (Figure 3.7a-c); the corresponding sample optimal scalings for the first three windows were: (a) A = 0.06, C = 0.69, G = 0.72, T = 0; (b) A = 0.09, C = 0.70, G = 0.71, T = 0; (c) A = 0.18, C = 0.59, G = 0.77, T = 0. The first two windows are strongly consistent with the overall analysis, the third section, however, shows some minor departure from the strong-weak bonding alphabet. The most interesting outcome is that the fourth window shows that no signal is present. This result suggests the fourth quarter of BNRF1 of Epstein-Barr is just a random assignment of nucleotides (noise).
146
MODELING UNCERTAINTY
To investigate these matters further, we took a window of size 1000 and moved it across the CDS 200 bp at a time. For example, the first analysis was on bp 1700-2700 of Epstein-Barr, the second analysis is on bp 1900-2900, the third on bp 2100-3100, and so on, until the final analysis on bp 4700-5700. Each analysis showed the frequency 1/3 signal except the last analysis on bp 47005700 (this is an amazing result considering that the analysis prior to the last one is on bp 4500-5500). Figure 3.8 shows the optimal sample scalings at the 1/3 frequency from each window analysis; the horizontal axis shows the starting location of the 1000 bp window (1700, 1900, 2100, ..., 4700), and the vertical axis shows the scalings. In Figure 3.8 the scalings are obtained as follows: each analysis fixes G = 0 and then the scales for A, C and T are calculated. Next, we divided each scale by the value obtained for C, so that C = 1 in each analysis. Hence, the vertical axis of Figure 3.8 shows the scales, at frequency 1/3, in each window with G = 0 (solid line), C = 1 (dashed line), A free (solid line), T free (dashed line). This was done primarily to assess the homogeneity of the strong-weak bonding alphabet across the CDS. We see that, for the first quarter or so of the CDS the {S, W} alphabet is strong. That strength fades a bit in the middle and then comes back (though not as strong) near the end of the
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 147
CDS. We see, however, that this alphabet is nonexistent in the final 1000 bp of BNRF1. This lack of periodicity prompted us to reexamined this region with a number of other tools, and we now strongly believe that this segment is indeed noncoding. Herpesvirus saimiri (taken from GenBank) has a CDS from bp 6821 to 10561 (3741 bp) where the similarity to EBV BNRF1 is noted. To see if a similar problem existed in HVS BNRF1, and to generally compare the periodic behavior of the genes we analyzed HVS BNRF1 in a similar fashion to EBV BNRF1 as displayed in Figure 3.6. Figure 3.9 shows the smoothed (triangular with spectral envelope of HVS BNRF1 for (a) the first 1000 bp, (b) the second 1000 bp, (c) the third 1000 bp, and (d) the remaining 741 bp. There are some obvious similarities, that is, for the first three sections the cycle 1/3 is common to both the EBV and the HVS gene. The obvious differences are the appearance of the 1/10 cycle in the third section and the fact that in HVS, the fourth section shows the strong possibility of containing the 1/3 periodicity (the data were padded to for use with the FFT, the 0.001 significance threshold in this case is (2/756)exp(3.71/3) = .91; the peak value of the spectral envelope at 1/3 in this section was 0.89) whereas in EBV the fourth section is
148
MODELING UNCERTAINTY
noise. Next, we compared the scales for each section of the HVS analysis. In the first section, the scales corresponding to the 1/3 cycle are A = 0.2, C = 0.96, G = 0.18, T = 0, which suggests that the signal is driven by C, not-C. In the second section the scales corresponding to the 1/3 signal are A = 0.26, C = 0.63, G = 0.73, T = 0, which suggests the strong-weak bonding alphabet. In the third section there are two signals; at the approximate 1/10 cycle the scales are A = 0.83, C = 0.47, G = 0.30, T = 0 (suggesting a strong bonding-A-T alphabet), at the 1/3 cycle the scales are A = 0.20, C = 0.32, G = 0.93, T = 0 (suggesting a G-H alphabet). In the final section, the scales corresponding to the (perhaps not significant) 1/3 signal are A = 0.28, C = 0.51, G = 0.81, T = 0, which does not suggest any collapsing of the nucleotides. Finally, we tabled the results of the analysis of nearly every CDS in EpsteinBarr. Only genes that exceed 500 bp in length are reported (BNRF1 and BcLF1 are not reported again here). In every analysis we used triangular smoothing with in which case These analyses were performed on the entire gene and it is possible that a dynamic analysis would find other significant periodicities in sections of a CDS than are listed here. Table 2 lists the CDS analyzed, the 0.001 critical value (CV) for that sequence, the significant values of the smoothed sample spectral envelope (SpecEnv), the frequency at which the spectral envelope is significant (Freq), and the scalings for A, C, and G at the significant frequency (T = 0 in all cases). Note that for some genes, there is no evidence to support that the sequence is anything other than noise; these genes should be investigated further. The occurrence of the zero frequency has many explanations but we not certain which applies and this warrants further investigation. One explanation is that the CDS has long memory in that sections in the CDS that are far apart are highly correlated with one another. Another possibly is that the CDS is not entirely coding. For example, we analyzed the entire lambda virus (approximately 49,000 bp) and found a strong peak at the zero frequency and at the one-third frequency; however, when we focused on any particular CDS, only the one-third frequency peak remained. We have noticed this on other analyses of sections that contain coding and noncoding (see Stoffer et al, 1993b) but this is not consistent across all of our analyses.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 149
150
MODELING UNCERTAINTY
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 151
152
4.
MODELING UNCERTAINTY
DISCUSSION
The spectral envelope, as a basic tool, appears to be suited for fast automated macro-analyses of long DNA sequences. Interactive computer programs are currently being developed. The analyses described in this paper were performed either using a cluster of C programs that compile on Unix operating systems, or using the Gauss programming system for analyses on Windows operating systems. We have presented some ways to adapt the technology to the analysis of DNA sequences. These adaptations were not presented in the original spectral envelope article (Stoffer et al, 1993a) and it is clear that there are many possible ways to extend the original methodology for use on various problems encountered in molecular biology. For example, we have recently developed similar methods to help with the problem of discovering whether two sequences share common signals in a type of local alignment and a type of global alignment of sequences (Stoffer and Tyler, 1998). Finally, the analyses presented here point to some inconsistencies in established gene segments and, evidently, some additional investigation and explanation is warranted.
ACKNOWLEDGMENTS This article is dedicated to the memory of Sid Yakowitz and his research in the field of time series analysis; in particular, his contributions and perspectives on fast methods for frequency detection. Part of this work was supported by a grant from the National Science Foundation. This work benefited from discussions with Gabriel Schachtel, University of Giessen, Germany.
REFERENCES Bernardi, G. and G. Bernardi. (1985). Codon usage and genome composition. Journal of Molecular Evolution, 22, 363-365. Bina, M. (1994). Periodicity of dinucleotides in nucleosomes derived from siraian virus 40 chromatin. Journal of Molecular Biology, 235, 198-208. Blaisdell, B.E. (1983). Choice of base at silent codon site 3 is not selectively neutral in eucaryotic structural genes: It maintains excess short runs of weak and strong hydrogen bonding bases. Journal of Molecular Evolution, 19, 226-236. Buckingham, R.H. (1990). Codon context. Experientia, 46, 1126-1133. Cornette, J.L., K.B. Cease, H. Margaht, J.L. Spouge, J.A. Berzofsky, and C. DeLisi. (1987) Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. Journal of Molecular Biology, 195, 659-685.
REFERENCES
153
Drew, H.R. and C. R. Calladine. (1987). Sequence-specific positioning of core histones on an 860 base-pair DNA: Experiment and theory. Journal of Molecular Biology, 195, 143-173. Eisenberg, D., R.M. Weiss, and T.C. Terwillger. (1994). The hydrophobic moment detects periodicity in protein hydrophobicity. Proc. Natl. Acad. Sci., 81, 140-144. Grunstein, M. (1992). Histones as regulators of genes. Scientific American, 267, 68-74. Ioshikhes, I., A. Bolshoy, and E.N. Trifonov. (1992). Preferred positions of AA and TT dinucleotides in aligned nucleosomal DNA sequences. Journal of Biomolecular Structure and Dynamics, 9, 1111-1117. McLachlan, A.D. and M. Stewart. (1976). The 14-fold periodicity in alphatropomyosin and the interaction with actin. Journal of Molecular Biology, 103, 271-298. Mengeritsky, G. and E.N. Trifonov. (1983). Nucleotide sequence-directed mapping of the nucleosomes. Nucleic Acids Research, 11, 3833-3851. Muyldermans, S. and A. A. Travers. (1994) DNA sequence organization in chromatosomes. Journal of Molecular Biology, 235, 855-870. Pina, B., D. Barettino, M. Truss, and M. Beato. (1990). Structural features of a regulatory nucleosome. Journal of Molecular Biology, 216, 975-990. Satchwell, S.C., H.R. Drew, and A.A. Travers. (1986). Sequence periodicities in chicken nucleosome core DNA. Journal of Molecular Biology, 191, 659-675. Schachtel, G.A..P. Bucher, E.S. Mocarski, B.E. Blaisdell, and S. Karlin. (1991). Evidence for selective evolution in codon usage in conserved amino acid segments of human alphaherpesvirus proteins. Journal of Molecular Evolution, 33, 483-494. Shrader, T.E. and D.M. Crothers. (1990). Effects of DNA sequence and histonehistone interactions on nucleosome placement. Journal of Molecular Biology, 216, 69-84. Shumway, R.H. and D.S. Stoffer. (2000). Time Series Analysis and Its Applications. New York: Springer. Stoffer, D.S., D.E. Tyler, and A.J. McDougall. (1993a). Spectral analysis for categorical time series: Scaling and the spectral envelope. Biometrika, 80, 611-622. Stoffer, D.S., D.E. Tyler, A.J. McDougall, and G.A. Schachtel. (1993b). Spectral analysis of DNA sequences (with discussion). Bulletin of the International Statistical Institute, Bk 1, 345-361; Bk 4, 63-69. Stoffer, D.S. and D.E. Tyler. (1998). Matching sequences: Cross-spectral analysis of categorical time series. Biometrika, 85, 201-213. Sueoka, N. (1988). Directional mutation pressure and neutral molecular evolution. Proc. Nati. Acad. Sci., 85, 2653-2657.
154
MODELING UNCERTAINTY
Tavaré, S. and B.W. Giddings. (1989). Some statistical aspects of the primary structure of nucleotide sequences. In Mathematical Methods for DNA Sequences, M.S. Waterman ed., pp. 117-131, Boca Raton, Florida: CRC Press. Travers, A.A. and A. Klug. (1987). The bending of DNA in nucleosomes and its wider implications. Philosophical Transactions of the Royal Society of London, B, 317, 537-561. Trifonov, E.N. (1991). DNA in profile. Trends in Biochemical Sciences, 16, 467-470. Trifonov, E.N. and J.L. Sussman. (1980). The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc. Natl. Acad. Sci., 77, 3816-3820. Uberbacher, E.C., J.M. Harp, and G.J. Bunick. (1988). DNA sequence patterns in precisely positioned riucleosomes. Journal of Biomolecular Structure and Dynamics, 6, 105-120. Viari, A., H. Soldano, and E. Ollivier. A scale-independent signal processing method for sequence analysis. Computer Applications in the Biosciences, 6, 71-80. Zhurkin, V.B. (1983) Specific alignment of nucleosomes on DNA correlates with periodic distribution of purine-pyrimidine and pyrimidine-purine dimers. Febs Letters, 158, 293-297. Zhurkin, V.B. (1985). Sequence-dependent bending of DNA and phasing of nucleosomes. Journal of Biomolecular Structure and Dynamics, 2, 785-804.
Part III Chapter 8 AN EFFICIENT STOCHASTIC APPROXIMATION ALGORITHM FOR STOCHASTIC SADDLE POINT PROBLEMS Arkadi Nemirovski Reuven Y. Rubinstein Faculty of Industrial Engineering and Management Technion—Israel Institute of Technology Haifa 32000, Israel
Abstract
1.
We show that Polyak’s (1990) stochastic approximation algorithm with averaging originally developed for unconstrained minimization of a smooth strongly convex objective function observed with noise can be naturally modified to solve convex-concave stochastic saddle point problems. We also show that the extended algorithm, considered on general families of stochastic convex-concave saddle point problems, possesses a rate of convergence unimprovable in order in the minimax sense. We finally present supporting numerical results for the proposed algorithm.
INTRODUCTION
We start with the classical stochastic approximation algorithm and its modification given in Polyak (1990).
1.1.
CLASSICAL STOCHASTIC APPROXIMATION
Classical stochastic approximation (CSA) originates from the papers of Robbins-Monroe and Kiefer-Wolfovitz. It is basically a steepest descent method for solving the minimization problem
156
MODELING UNCERTAINTY
where the exact gradients of are replaced with their unbiased estimates. In the notation from Example 2.1, the CSA algorithm is
where is an arbitrary point from X, projection of onto X); the stepsizes
is the point of X closest to are normally chosen as
(the
C being a positive constant. Under appropriate regularity assumptions (see, e.g., Kushner and Clark (1978)) the sequence converges almost surely and in the mean square to the unique minimizer of the objective. Unfortunately, the CSA algorithm possesses poor robustness. In the case of a smooth (i.e., with a Lipschitz continuous gradient) and nondegenerate (i.e., with a nonsingular Hessian) convex objective, the rate of convergence is and is unimprovable, in a certain precise sense. However, to achieve this rate, one could adjust the constant C in (1.3) to the “curvature” of a “bad” choice of C – by an absolute constant factor less than the optimal one – can convert the convergence rate to with Finally, if the objective, although smooth, is not nondegenerate, (1.3) may result in extremely slow convergence. The CSA algorithm was significantly improved by Polyak (1990). In his algorithm, the stepsizes are larger in order than those given by (1.3) (they are of order of with so that the rate of convergence of the trajectory (1.2) to the solution is worse in order than for the usual CSA. The crucial difference between Polyak’s algorithm and the CSA is that the sequence (1.2) is used only to collect information about the objective rather than to estimate the solution itself. Approximate solutions to (1.1) are obtained by averaging the “search points” in (1.2) according to
It turns out that under the same assumptions as for the CSA (smooth nondegenerate convex objective attaining its minimum at an interior point of X), Polyak’s algorithm possesses the same asymptotically unimprovable convergence rate as the CSA. At the same time, in Polyak’s algorithm there is no need for “fine adjustment” of the stepsizes to the “curvature” of Moreover, Polyak’s algorithm with properly chosen preserves a “reasonable” (close to rate of convergence even when the (convex) objective is nonsmooth and/or degenerate. A somewhat different aggregation in SA algorithms was proposed earlier by Nemirovski and Yudin (1978, 1983). For additional references on the CSA
157
An Efficient Algorithm for Saddle Point Problems
algorithm and its outlined modification, see Ermoliev (1969), Ermoliev and Gaivoronski (1992), L’Ecuyer, Giroux, and Glynn (1994), Ljung, Pflug and Walk (1992), Pflug (1992), Polyak (1990) and Tsypkin (1970) and references therein. Our goal is to extend Polyak’s algorithm from unconstrained convex minimization to the saddle point case. We shall show that, although for the general saddle point problems below the rate of convergence slows down from to the resulting stochastic approximation saddle point (SASP) algorithm, as applied to stochastic saddle point (SSP) problems, preserves the optimality properties of Polyak’s method. The rest of this paper is organized as follows. In Section 2 we define the SSP problem, present the associated SASP algorithm, and discuss its convergence properties. We show that the SASP algorithm is a straightforward extension of its stochastic counterpart with averaging, originally proposed by Polyak (1990) for stochastic minimization problems as in Example 2.1 below. It turns out that in the general case the rate of convergence of the SASP algorithm becomes instead of that is, the convergence rate of Polyak’s algorithm. We demonstrate in Section 3 that this slowing down is an unavoidable price for extending the class of problems handled by the method. In Section 4 we present numerical results for the SASP algorithm as applied to the stochastic Minimax Steiner problem and to an on-line queuing optimization problem. Appendix contains the proofs of the rate of convergence results for the SASP algorithm. It is not our intention in this paper to compare the SASP algorithm with other optimization algorithms suitable for off-line and on-line stochastic optimization, like the stochastic counterpart (Rubinstein and Shapiro ,1993). It is merely to show the high potential of the SASP method and to promote it for further applications.
2.
STOCHASTIC SADDLE POINT PROBLEM
2.1.
THE PROBLEM
Consider the following saddle point problem: (SP) Given a function of on X × Y, i.e., a point in and its maximum in
In what follows, we write (SP) down as
We make the following
at which
find a saddle point attains its minimum
158
MODELING UNCERTAINTY Assumption A. X and Y are convex compact sets and is convex in concave in and Lipschitz continuous on X × Y.
Let us associate with (SP) the following pair of functions
(the primal and the dual objectives, respectively), and the following pair of optimization problems
It is well known (see, e.g., Rockafellar (1970)) that under assumption A both problems (P) and (D) are solvable, with the optimal values equal to each other, and the set of saddle points of on X × Y is exactly the set
2.1.1 Stochastic setting. We are interested in the situation where neither the function in (SP), nor the derivatives are available explicitly; we assume, however, that at a time instant one can obtain, for every desired point “noisy estimates” of the aforementioned partial derivatives. These estimates form a realization of the pair of random vectors
being the “observation noises”. We assume that these noises are independent identically distributed, according to a Borel probability measure P, random variables taking values in a Polish (i.e., metric separable complete) space We also make the following Assumption B. The functions functions taking values in
and
Here
on respectively, such that
are Borel
159
An Efficient Algorithm for Saddle Point Problems
and respectively;
are the sub- and super-differentials of
in
is the standard Euclidean norm on the corresponding and
are the Euclidean diameters of X and Y, respectively.
We refer to and in (2.2) satisfying Assumption B as to a stochastic source of information (SSI) for problem (SP), and refer to problem (SP) satisfying Assumption A and equipped with a particular stochastic source of information and as to a stochastic saddle point (SSP) problem. The associated quantity will be called the variation of observations of the stochastic source of information. Our goal is to develop a stochastic approximation algorithm for solving the SSP problem. 2.1.2 The accuracy measure. As an accuracy measure of a candidate solution of problem (SP), we use the following function
Note that is expressed in terms of the objective function rather than of the distance from to the saddle set of It is nonnegative everywhere and equals to 0 exactly at the saddle set of This is so, since the saddle points of are exactly the pairs Note finally that
since
2.2.
EXAMPLES
We present now several stochastic optimization problems which can be naturally posed in the SSP form. Example 2.1 Simple single-stage stochastic program. Consider the simplest stochastic programming problem
with convex compact feasible domain X and convex objective Here Polish space and P is a Borel probability measure on Assume that the integrand is sufficiently regular, namely,
is a
160
MODELING UNCERTAINTY
is summable for every F is differentiable in for all with Borel in is Lipschitz continuous in with a certain constant and
Assume further that when solving (2.7), we cannot compute directly, but are given an iid random sample distributed according to P and know how to compute at every given point Under these assumptions program (2.7) can be treated as an SSP problem with and a trivial – singleton – set Y (which enables us to set ). It is readily seen that the resulting SSP problem satisfies assumptions A and B with
and that the accuracy measure (2.5) in this problem is just the residual in terms of the objective:
Example 2.2 Minimax stochastic program. Consider the following system of stochastic inequalities:
with convex compact domain X and convex constraints P are the same as in Example 2.1, and each of the integrands possesses the same regularity properties as in Example 2.1. Clearly, to solve (2.9) it suffices to solve the optimization problem
Here
which is the same as solving the following saddle point problem
Note that the latter problem clearly satisfies Assumption A. Similarly to Example 2.1, assume that when solving (2.10), we cannot compute (and thus explicitly, but are given an iid sample distributed
161
An Efficient Algorithm for Saddle Point Problems
according to P and, given
and we can compute Note that under this assumption the SSI for the saddle point problem (2.11) is given by
The variation of observations for this source can be bounded from above as
Note that in this case the accuracy measure
satisfies the inequality
so that it presents an upper bound for the residual in (2.10). Although problem (2.7) can be posed as a convex minimization problem (2.10) rather than the saddle point problem (2.11), it cannot be solved directly. Indeed, to solve (2.10) by a Stochastic Approximation algorithm, we need unbiased estimates of subgradients of and we cannot built estimates of this type from the only available for us unbiased estimates of Thus, in the case under consideration the saddle point reformulation seems to be the only one suitable for handling “noisy observations”. Example 2.3 Single-stage stochastic program with stochastic constraints. Consider the following stochastic program:
subject to
with convex compact domain X and convex functions and let P and the integrands satisfy the same assumptions as in Examples 2.1, 2.2. As above, assume that when solving (2.12) – (2.13), we cannot compute
162
MODELING UNCERTAINTY
explicitly, but are given an iid sample distributed according to P and, given and can compute To solve problem (2.12) – (2.13), it suffices to find a saddle point of the Lagrange function
on the set Note that if (2.12) – (2.13) satisfies the Slater condition, then possesses a saddle point on and the solutions to (2.12), (2.13) coincide with the of the saddle points of Assume that we have prior information on the problem, which enables us to identify a compact convex set containing the of some saddle point of Then we can replace in the Lagrange saddle point problem the nonnegative orthant with Y, thus obtaining an equivalent saddle point problem
with convex and compact set X and Y . Noting that the vectors
form a stochastic source of information for (2.14), we see that (2.13) – (2.12) can be reduced to an SSP. The variation of observations for the associated stochastic source of information clearly can be bounded as
These examples demonstrate that the SSP setting is a natural form of many stochastic optimization problems.
2.3.
THE SASP ALGORITHM
The SASP algorithm for the stochastic saddle point problem (2.1), (2.3) is as follows:
163
An Efficient Algorithm for Saddle Point Problems
Algorithm 2.1
where is the projector on X × Y :
the vector
and
is
being positive parameters of the method;
are positive stepsizes which, in principle, can be either deterministic or stochastic (see subsections 2.3 and 2.4, respectively); the initial point
is an arbitrary (deterministic) point in X × Y.
As an approximate solution of the SSP problem we take the moving average
where is a deterministic function taking, for every between 1 and
integer values
In the two subsections which follow we discuss the rate of convergence of the SASP algorithm and the choice of its parameters. Subsections 2.4 and 2.5 deal with off-line and on-line choice of the stepsizes, respectively.
2.4.
RATE OF CONVERGENCE AND OPTIMAL SETUP: OFF-LINE CHOICE OF THE STEPSIZES
Here we consider the case of deterministic sublinear stepsizes. Namely, assume that where C > 0 and As we shall see in Section 3, with properly chosen C and yield an unimprovable in order (in a certain precise sense) rate of convergence.
164
MODELING UNCERTAINTY
Theorem 1 Under assumptions A, B and (2.19), the expected inaccuracy of the approximate solutions generated by the method can be bounded from above, for every N > 1, as follows:
The proof of Theorem 1 is given in Appendix A. It is easily seen that the parameters minimizing, up to an absolute constant factor, the right hand side of (2.20) are given by the setup
Here denotes the smallest integer which is results in
2.5.
With this setup, (2.20)
RATE OF CONVERGENCE AND OPTIMAL SETUP: ON-LINE CHOICE OF THE STEPSIZES
Setup (2.21) requires a priori knowledge of the parameters When the domains X, Y are “simple” (e.g., boxes, Euclidean balls or perfect simplices), there is no problem in computing the diameters And in actual applications we can handle simple X and Y only, since we should know how to project onto these sets. Computation of the variation of observations is, however, trickier. Typically the exact value of is not available, and a bad initial guess for can significantly slow down the convergence rate. For practical purposes it might be better to use an on-line policy for updating guesses for and our current goal is to demonstrate that there exists a reasonably wide family of these policies preserving the convergence rate of the SASP algorithm. We shall focus on stochastic stepsizes of the type (cf. (2.19))
where is fixed and depends on the observations For the theoretical analysis, we make the following Assumption C. For every depends only on the observations collected at the first steps, i.e., is a deterministic function of
165
An Efficient Algorithm for Saddle Point Problems
and
Moreover, there exist “safety bounds” – two positive constants – such that
for all Let us associate with the SASP algorithm (2.15) – (2.18), (2.23) the following (deterministic) sequence:
where the supremum is taken over all trajectories associated with the SSP problem in question. Theorem 2 Let the stepsizes in the SASP algorithm (2. 15) – (2.18) be chosen according to (2.23), and the remaining parameters according to (2.21). Then under assumptions A, B, C the expected inaccuracy of the approximate solution generated by the SASP algorithm can be estimated, for every N > 1, as follows:
Note that (2.20) is a particular case of (2.26) with The proof of Theorem 2 is given in Appendix A. We now present a simple example of adaptive choice of Recalling (see (2.19), (2.21)) that the optimal stepsize is the one with and where is the variation of observations, it is natural to choose as
where (2.4)
is our current guess for the unknown quantity
Since by definition
166
a natural candidate for the role of
MODELING UNCERTAINTY
is given by
– the sample mean of “magnitude of observations”. Such a choice of may violate assumption (2.24), since fluctuations in the observations may result in being either too small or too large to satisfy (2.24). It thus makes sense to replace (2.29) with its truncated version. More precisely, let
where and present some a priori guesses for lower and upper bounds on respectively. Then the truncated version of (2.29) is
and
Clearly, the stepsize policy (2.27), (2.30) satisfies (2.24) – it suffices to take and In addition, for the truncated version we have
where O(1) depends solely on our safety bounds Inequalities (2.26), (2.31) combined with the stepsize policy (2.27), (2.30) result in the same of convergence as in (2.22). Note that the motivation behind the stepsize policy (2.27), (2.30) is, roughly speaking, to choose stepsizes according to the actual magnitudes of observations along the trajectory rather than according to the worst-case “expected magnitudes”
An Efficient Algorithm for Saddle Point Problems
3.
3.1.
167
DISCUSSION COMPARISON WITH POLYAK’S ALGORITHM
As applied to convex optimization problems (see Example 2.1), the SASP algorithm with the setup (2.21) looks completely similar to Polyak’s algorithm with averaging. There is, however, an important difference: the stepsizes given by (2.21) are not quite suitable for Polyak’s method. For the latter, the standard setup is with and this is the setup for which Polyak’s method possesses its most attractive property as opposite to rate of convergence on strongly convex (i.e., smooth and nondegenerate) objective functions. Specifically, let be the class of all stochastic optimization problems on a compact convex set with twice differentiable objective satisfying the condition and equipped with a stochastic source of information with variation of the observations not exceeding L. Note that problems from class possess uniformly smooth objectives. In addition, if which corresponds to the “well-posed case”, the objectives are uniformly nondegenerate as well. For Polyak’s method with stepsizes and properly chosen ensures that the expected error of N-th approximate solution does not exceed where depend only on the data of Under the same circumstances the stepsizes given by (2.21) will result in a slower convergence, namely, ln N . Thus, in the “well-posed case” the SASP method with setup (2.21) is slower by a logarithmic in N factor than the original Polyak’s method. The situation changes dramatically when that is, when we pass from the “well-posed” case to the “ill-posed” one. Here the SASP algorithm still ensures (uniformly in problems from ) the rate of convergence which is not the case for Polyak’s method. Indeed, consider the simplest case when X = [0,1] and assume that observation noises are absent, so that Consider next the subfamily of comprised of the objectives where
and let us apply Polyak’s method to with stepsizes starting at The search points are
where is continuous on [1/2,1). For the points well as their averages, belong to the domain where
as
168
MODELING UNCERTAINTY
Therefore in order to get an of Polyak’s algorithm requires at least steps. We conclude that in the ill-posed case the worst-case, with respect to rate of convergence of Polyak’s algorithm cannot be better than Thus, in the ill-posed case Polyak’s setup with results in worse, by factor of order of rate of convergence then the setup We believe that the outlined considerations provide enough arguments in favor of the rule unless we are certain that the problem is “wellposed”. As we shall see below, in the case of “genuine” saddle point problems (not reducing to minimization of a convex function via unbiased observations of its subgradients) the rate of convergence of the SASP algorithm with setup (2.21) is unimprovable even for “well-posed” problems.
3.2.
OPTIMALITY ISSUES
We are about to demonstrate that as far as general families of SSP problems are concerned, the SASP algorithm with setup (2.21) is optimal in order in the minimax sense. To this end, let us define the family of stochastic saddle point problems as that of all SSP instances on X × Y (recall that an SSP problem always satisfies assumptions A, B) with the variance of observations not exceeding L > 0. Given a positive and a subfamily let us denote by the information-based of defined as follows. a) Let us define a solution method for the family as a procedure which, as applied to an instance from the family, generates sequences of “search points” and “approximate solutions” with the pair defined solely on the basis of observations along the previous search points:
Formally, the method
is exactly the collection of “rules” and the only restriction on these rules is that all function pairs must be Borel. b) For a solution method its on an instance is defined as the smallest N such that
the expectation being taken over the distribution of observation noises; here is the inaccuracy measure (2.5) associated with instance The of on the entire family is
169
An Efficient Algorithm for Saddle Point Problems
i.e., it is the the worst case, over all instances from of on an instance. For example, (2.22) says that the complexity of the SASP method on the family of stochastic saddle point problems can be bounded from above as
provided that one uses the setup (2.21) with Finally, the of the family is the minimum, over all solution methods of of the methods on the family:
A method
is called optimal in order on
if there exists
such that
Optimality in order of a method on a family means that does not admit a solution method “much faster” than for every required accuracy solves every problem from within (expected) accuracy in not more than steps, while every competing method in less than steps fails to solve, within the same accuracy, at least one (depending on ) problem from We are about to establish the optimality in order of the SASP algorithm on families Proposition 3.1 The complexity of every nontrivial (with a non-singleton X × Y) family of stochastic saddle point problems admits a lower bound
C being a positive absolute constant. The proof of Proposition 3.1 is given in Appendix B. Taking into account (3.1), we arrive at Corollary 3.1 For every convex compact sets and every L > 0, the SASP algorithm with setup (2.21) ( is set to L) is optimal in order on the family of SSP problems. Remark 3.1 The outlined optimality property of the SASP method means that as far as the performance on the entire family is concerned, no alternative solution method outperforms the SASP algorithm by more than an
170
MODELING UNCERTAINTY
absolute constant factor. This fact, of course, does not mean that it is impossible to outperform essentially the SASP method on a given subfamily of For example, the family of convex stochastic minimization problems introduced in Subsection 3.1 can be treated as a subfamily of (think of a convex optimization problem as a saddle point problem with objective independent of ). As explained in Subsection 3.1, in the “wellposed” case the SASP algorithm is not optimal in order on (the complexity of the method on is while the complexity of is In view of Remark 3.1, it makes sense to present a couple of examples of what we call “difficult” subfamilies – those for which is of the same order as the complexity of the entire family For the sake of simplicity, let us restrict ourselves to the case Y = [0,1], L = 10. It is readily seen that if both X , Y are non-singletons, then can be naturally embedded into so that "difficult" subfamilies of generate "difficult" subfamilies in every family of stochastic saddle point problems with nontrivial X , Y . 1) The first example of a “difficult” subfamily in is the family of illposed smooth stochastic convex optimization problems associated with X = [–1,1], A = 1, L = 10, see Subsection 3.1. Indeed, consider a 2-parametric family of optimization programs
the parameters being and We assume that the family (3.3) is equipped with the stochastic source of information
Thus, we seek to minimize a simple quadratic function of a single variable in the situation when the derivative of the objective function is corrupted with an additive Gaussian white noise of unit intensity. The same reasoning as in the proof of Proposition 3.1 demonstrates that the family is indeed "difficult", the reason being that the programs (3.3) become more and more ill-posed as approaches 0. 2) The second example is a single-parametric family of “genuine” saddle point problems
An Efficient Algorithm for Saddle Point Problems
171
equipped with the stochastic sources of information
here the parameter varies in [–1,1], The origin of the stochastic saddle point problem (3.4) – (3.5) is very simple. What we seek in fact is to solve a Minimax problem (see Example 2.2) of the form
specified by an unknown parameter What we can observe are the values and derivatives of at any desired point They are corrupted with noise and equal to
respectively. The observations of
are, respectively,
Applying the scheme presented in Example 2.2 to the above stochastic Minimax problem, we convert it into an equivalent stochastic saddle point problem, which is readily seen to be identical with (3.4) – (3.5). Note that the family of stochastic saddle point problems is contained in We claim that the family is "difficult". Indeed, denoting by the accuracy measure associated with the problem (3.4) and taking into account (2.5), we have
It follows that In other words, if there exists a method which solves in N steps every instance from within an expected inaccuracy then this method is capable of recovering, within the same expected inaccuracy, the value of underlying the instance. On the other hand, from the viewpoint of collected information, the N observations (3.5) used by the method are equivalent to observing a sample of N iid random variables. Thus, if
172
MODELING UNCERTAINTY
then one can recover, within the expected inaccuracy the unknown mean of from N-point sample drawn from this distribution, regardless of the actual value of the mean. It is wellknown that the latter is possible only when Thus, as claimed. Note finally that the stochastic Minimax problems (3.6) which give rise to the stochastic saddle point problems from are “as nice as a Minimax problem can be”. Indeed, the components present just shifts by of simple quadratic functions on the axis. Moreover, problems (3.6) are perfectly well posed – the solution is “sharp”, i.e., the residual is of the first order in the distance from a candidate solution to the exact solution provided that this distance is small. We see that in situations less trivial than the one considered in case 1), "difficult" stochastic saddle point problems can arise already from quite simple and perfectly wellposed stochastic optimization models.
4.
NUMERICAL RESULTS
In this section we apply the SASP algorithm to a pair of test problems: the stochastic Minimax Steiner problem and on-line optimization of a simple queuing model.
4.1.
A STOCHASTIC MINIMAX STEINER PROBLEM
Assume that in a two-dimensional domain X there are towns of the same population, say equal to 1. The distribution of the population over the area occupied by town is All towns are served by a single facility, say, an ambulance. The “inconvenience of service” for town is measured by the mean distance from the facility to the customers, i.e., by the function
being the location of the facility. The problem is to find a location for the facility which minimizes the worst-case, with respect to all towns, inconvenience of service. Mathematically we have the following minimax stochastic program We assume that the only source of information for the problem is a sample
with mutually independent entries distributed according to i.e., a random sample of N tuples of requests for service, one request per town in a tuple.
An Efficient Algorithm for Saddle Point Problems
173
The above Minimax problem can be naturally posed as an SSP problem (cf. Example 1.2) with the objective
the observations being
In our experiments we chose X to be the square and dealt with towns placed at the vertices of a regular pentagon, being the normal two-dimensional distribution with mean and the unit covariance matrix. We used setup (2.19), (2.21) with the parameters
and ran 2,000 steps of the SASP algorithm, starting at the point The results are presented on Fig. 1. We found that the relative inaccuracy
in 20 runs varied from 0.0006 to 0.006.
174
4.2.
MODELING UNCERTAINTY
A SIMPLE QUEUING MODEL
Here we apply the SASP algorithm to optimization of simple queuing models in steady state, such as the GI/G/1 queue. We consider the following minimization program:
175
An Efficient Algorithm for Saddle Point Problems
where the domain X is an
box
In particular, we assume that the expected performance
is given as
where is the expected steady state waiting time of a customer, is the cost of a waiting customer, are parameters of the distributions of interarrival and service times, is the cost per unit increase of and is the transpose of Note that for most exponential families of distributions (see e.g., Rubinstein and Shapiro (1993), Chapter 3) the expected performance is a convex function of To proceed with the program (4.1), consider Lindley’s recursive (sample path) equation for the waiting time of a customer in a G I / G / 1 queue (e.g. Kleinrock (1975), p. 277):
where, for fixed is an iid sequence of random variables (differences between the interarrival and the service times) with distribution depending on the parameter vector It is readily seen that the corresponding algorithm for calculating an estimate of can be written as follows: Algorithm 4.1 : 1. Generate the output process
using Lindley ’s recursive equation
here are the distributions of the interarrival and service times, respectively, and are independent random variables uniformly distributed in [0,1]. 2. Differentiate (4.5) with respect to thus getting a recurrent formula for and use this recurrence to construct the estimates of Note that under mild regularity assumptions (see, e.g., Rubinstein and Shapiro (1993), Chapter 4) the expectation of converges to 0 as Application of the SASP algorithm (2.15)–(2.18) to program (4.1) – (4.3) yields
176
MODELING UNCERTAINTY
where are the estimates of yielded by Algorithm 4.1, with in (4.5) replaced by (see (4.6)), and It is important to note that now we are in a situation different from the one postulated in Theorem 1 in the sense that the stochastic estimates of used in (4.6) are biased and depend on each other. The numerical experiments below demonstrate that the SASP algorithm handles these kinds of problems reasonably well. In these experiments, we considered an M/M/1 queue, and being the interarrival and service rates; is the decision variable in the program (4.1 )–(4.3). Taking into account that it is readily seen that the value which minimizes the performance measure in (4.1 )– (4.3) is We set (which corresponds to and ) and choose as X the segment
which in terms of the traffic intensity
is
To demonstrate the effect of the Cesaro averaging (4.6) we present below statistical results for both the sequences and (see (2.15)–(2.18)), i.e., with and without Cesaro averaging. We shall call the sequence the crude SASP (CSASP) sequence. Tables 4.1,4.2 present 10 realizations of the estimates for and yielded by the CSASP and SASP algorithms (denoted respectively) along with the corresponding values of the objective; two estimates in the same column correspond to a common simulation. Each of the 10 experiments related to Table 4.1 was started at the point which corresponds to the starting point for the experiments from Table 4.2 was The observations used in the method were given by the Lindley equation approach; the “memory” and the stepsizes were chosen according to (2.21), with and In each experiment, we performed 2000 steps of the algorithm (i.e., simulated 2000 arrivals of customers). Tables 4.3 and 4.4 summarize the statistics of Tables 4.1 and 4.2, namely, they present the sample means and
and the associated confidence
intervals. Let and be the widths of the confidence intervals associated with the CSASP and SASP sequences, respectively. The quantity
An Efficient Algorithm for Saddle Point Problems
177
can be regarded as the efficiency of the SASP sequence relative to the CSASP one. From the results of Tables 4.3 and 4.4 it follows that the efficiency is quite significant. E.g., for the experiments presented in Table 4.3 we have and We applied to problem (4.1)–(4.3) the SASP algorithm with the adaptive stepsize policy (2.27), (2.30) and used it for various single-node queuing models with different interarrival and service time distributions. In all our experiments we found that the SASP algorithm converges reasonably fast to the optimal solution
178
5.
MODELING UNCERTAINTY
CONCLUSIONS We have shown that The SASP algorithm (2.15)–(2.18) is applicable to a wide variety of stochastic saddle point problems, in particular to those associated with single-stage constrained convex stochastic programming programs (Examples 2.1–2.3). The method works under rather mild conditions: it requires only convexity-concavity of the associated saddle point problem (Assumption A) and conditional independence and unbiasedness of the observations (Assumption B). In contrast to the classical Stochastic Approximation, no smoothness or nondegeneracy assumptions are needed. The rate of convergence of the method is data- and dimension-independent and is optimal in order, in the minimax sense, on wide families of convex-concave stochastic saddle point problems. As applied to general saddle point problems, the method seems to be the only stochastic approximation type routine converging without additional smoothness and nondegeneracy assumptions. The only alternative method for treating these problems is the so-called stochastic counterpart method (see Shapiro (1996)), which, however, requires more powerful nonlocal information on the problem. (For more details on the stochas-
179
An Efficient Algorithm for Saddle Point Problems
tic approximation versus the stochastic counterpart method, see Shapiro (1996)).
APPENDIX: A: PROOF OF THEOREMS 1 AND 2 As was already mentioned, the statement of Theorem 1 is a particular case of that of Theorem 2 corresponding to so that it suffices to prove Theorem 2. Let
be the scaled Euclidean distance between the projection operator, we have
and
Note that due to the standard properties of
We assume, without loss of generality, that Note that where for each Let us fix
are deterministic Borel functions. and consider the random variable
We have from (0.1)
Setting
we can rewrite the resulting relation as
where and
Since
is convex in Similarly,
and
we have whence
180
MODELING UNCERTAINTY
Substituting this inequality in (0.2), we obtain
Summing the inequalities (0.5) over ity, we obtain
and applying the Cauchy inequal-
where
Applying next Jensen’s inequality to the convex functions account (2.18), we obtain that
Since
because both
we also have that
and
and
and
and taking into
Clearly
belong to X × Y . In view of these inequalities we obtain from (0.6)
The right hand side of (0.8) is independent of upper bound of the left-hand side over
consequently, it (0.8) majorizes the This upper bound is equal to
An Efficient Algorithm for Saddle Point Problems
181
(see (2.6)). Thus, we have derived the following inequality
In view of (2.23) and assumption C we have
Consequently, (0.9) yields
(we have taken into account that To obtain the desired estimate for it suffices to take expectation of both sides of (0.11). When doing so, one should take into account that In view of assumption B the conditional expectations of the vectors (for fixed ) are zero, those of their squared norms do not exceed
and
182
MODELING UNCERTAINTY and by construction inequalities
is a deterministic function of
This implies the
and similarly Finally,
With these inequalities we obtain from (0.11)
Since, by definition, and, consequently, we arrive at (2.26).
APPENDIX: B: PROOF OF THE PROPOSITION Without loss of generality we may assume that X is not a singleton. By evident homogeneity reasons, we may also assume that the diameter of X is 2 and that X contains the segment being a unit vector. For a given consider the two problems and with the following functions:
Let further
183
REFERENCES
be the associated estimates of the partial (with respect to and ) derivatives of and respectively. Assume, finally, that is a standard Gaussian random variable. It is readily seen that the problems indeed belong to Let By the definition of complexity, there exists a method which in N steps solves all problems from (in particular, both and ) with expected inaccuracy at most The method clearly implies a routine for distinguishing between two hypotheses, and on the distribution of an iid sample
where routine
states that the distribution of every is as follows:
is
The
In order to decide which of the hypotheses takes place, we treat the observed sample as the sequence of coefficients at in the N subsequent observations of the gradient with respect to in a saddle point problem on X × Y (and add zero observations of the gradient with respect to ). Applying the first N steps of method to these observations, we form the N-th approximate solution and check whether If it is the case, we accept otherwise we accept It is clear that the probability for to reject the hypothesis when it is valid is exactly the probability for to get, as a result of its work on a point with In this case the inaccuracy of regarded as an approximate solution to is at least and since the expected inaccuracy of on the indicated probability is at most 1/4. By similar considerations, the probability for to reject when this hypothesis is valid is also Thus, the integer is such that there exists a routine for distinguishing between the aforementioned pair of statistical hypotheses with probability of rejecting the true hypothesis (whether it is or ) at most 1/4. By standard statistical arguments, this is possible only if
with an appropriately chosen positive absolute constant O(1), which yields the sought lower bound on N.
REFERENCES Asmussen, S. and R. Y. Rubinstein. (1992). “The efficiency and heavy traffic properties of the score function method in sensitivity analysis of queuing models", Advances of Applied Probability, 24(1), 172–201. Ermoliev, M. (1969). “On the method of generalized stochastic gradients and quasi-Fejer sequences", Cybernetics, 5(2), 208–220. Ermoliev, Y.M. and A. A. Gaivoronski. (1992). “Stochastic programming techniques for optimization of discrete event systems", Annals of Operations Research, 39, 1–41. Kleinrock, L. (1975). Queuing Systems, Vols. I and II, Wiley, New York.
184
MODELING UNCERTAINTY
Kushner, H.I. and D. S. Clarke. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag, Applied Math. Sciences, Vol. 26. L’Ecuyer, P., N. Giroux, and P.W. Glynn. (1994). “Stochastic optimization by simulation: Numerical experiments for the M/M/1 queue in steady-state", Management Science, 40, 1245–1261. Ljung, L., G. Pflug, and H. Walk.(1992). Stochastic Approximation and Optimization of Stochastic Systems. Birkhaus Verlag, Basel. Nemirovski, A. and D. Yudin. (1978). “On Cesàro’s convergence of the gradient descent method for finding saddle points of convex-concave functions", Doklady Akademii Nauk SSSR, Vol. 239, No. 4, (in Russian; translated into English as Soviet Math. Doklady). Nemirovski, A. and D. Yudin. (1983). Problem Complexity and Method Efficiency in Optimization, J. Wiley & Sons. Pflug, G. Ch. (1992). “Optimization of simulated discrete event processes". Annals of Operations Research, 39, 173–195. Polyak, B.T. (1990). “New method of stochastic approximation type", Automat. Remote Control, 51, 937–946. Rockafellar, R.T. (1970). Convex Analysis, Princeton University Press. Rubinstein, R.Y. and A. Shapiro. (1993). Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization Via the Score Function Method, to be published by John Wiley & Sons. Shapiro, A. (1996). “Simulation based optimization—Convergence analysis and statistical inference", Stochastic Models, to appear. Tsypkin Ya.Z. (1970). Adaptation and Learning in Automatic Systems, Academic Press, New York.
Chapter 9 REGRESSION MODELS FOR BINARY TIME SERIES Benjamin Kedem Department of Mathematics University of Maryland College Park, Maryland 20742, USA
Konstantinos Fokianos Department of Mathematics & Statistics University of Cyprus P.O. Box 20537 Nikosia, 1678, Cyprus
Abstract
1.
We consider the general regression problem for binary time series where the covariates are stochastic and time dependent and the inverse link is any differentiable cumulative distribution function. This means that the popular logistic and probit regression models are special cases. The statistical analysis is carried out via partial likelihood estimation. Under a certain large sample assumption on the covariates, and owing to the fact that the score process is a martingale, the maximum partial likelihood estimator is consistent and asymptotically normal. From this we obtain the asymptotic distribution of a certain useful goodness of fit statistic.
INTRODUCTION
Consider a binary time series taking the values 0 or 1, and related covariate or auxiliary stochastic data represented by a column vector The binary series may be stationary or nonstationary, and the time dependent random covariate vector process may represent one or more time series and functions thereof that influence the evolution of the primary series of interest The covariate vector process need not be stationary per se, however, it is required to possess the “nice” long term behavior described by Assumption A below. Conveniently, may contain past values of and/or past values of an underlying process that produces
186
MODELING UNCERTAINTY
We wish to study the regression problem of estimating the conditional success probability
through a parameter vector where represents all that is known to the observer at time about the time series and the covariate information; clearly, More precisely, the problem is to model the conditional probability (1.1) by a regression model depending on and then estimate the latter given a binary time series and its time dependent random covariates. The present paper follows the construction in Fokianos and Kedem (1998) and Slud and Kedem (1994), where categorical time series and logistic regression were considered, respectively. It is primarily an extension of Slud and Kedem (1994), although still a special case of Fokianos and Kedem (1998). Accordingly, we model (1.1) by the general regression model,
where F is a differentiable distribution function. Following the terminology of generalized linear models, the term link is reserved here for the inverse function (McCullagh and Nelder, 1989),
Thus, is the inverse link. Any suitable differentiable inverse link F that maps the real line onto the interval [0, 1] will do, but we shall assume without loss of too much generality that F is a differentiable cumulative distribution function (cdf) with probability density function (pdf) In particular, when F is the logistic cdf (the case of canonical link),
(1.2), or equivalently (1.3), is called logistic regression, and when F is the cdf of the standard normal distribution, the model is called probit regression. The most popular link functions for binary regression are listed in Table 9.1. The regression model (1.2) has received much attention in the literature, mostly under independence. See, among many more, Cox (1970), Diggle Liang Zeger (1994), Fahrmeir and Kaufmann (1987), Fahrmeir and Tutz (1994), Fokianos and Kedem (1998), Kaufmann (1987), Keenan (1982), Slud and Kedem (1994), and Zeger and Qaqish (1988).
187
Regression Models for Binary Time Series
2.
PARTIAL LIKELIHOOD INFERENCE
The useful idea of forming certain likelihood functions by taking products of conditional densities where the formed products do not necessarily give complete joint or full likelihood information is due to Cox (1975), and later studied rigorously in Jacod (1987), Slud (1982), Slud (1992), and Wong (1986). In this section we define precisely what we mean by partial likelihood with respect to an increasing sequence of sigma–fields, and then apply it in regression models for binary time series. This is done quite generally for a large class of link functions where is a differentiable distribution function.
2.1.
DEFINITION OF PARTIAL LIKELIHOOD
Partial likelihood with respect to a nested sequence of conditioning histories is defined as follows.
Definition. Let be an increasing sequence of and let be a sequence of random variables on some common probability space such that is measurable. Denote the density of given by where is a fixed parameter. The partial likelihood (PL) function relative to and the data is given by the product
Thus, if we define,
then for a binary time series with covariate information hood of takes on the simple product form,
the partial likeli-
188
MODELING UNCERTAINTY
In the next section we study the maximizer of maximum partial likelihood estimator (MPLE) of
2.2.
referred to as the
AN ASSUMPTION REGARDING THE COVARIATES
The following assumption guarantees the asymptotic stability of the covariate process Assumption A A1. The true parameter belongs to an open set A2. The covariate vector almost surely lies in a nonrandom compact subset of such that A3. There is a probability measure v on such that is positive definite, and such that for Borel sets
in probability as at the true value of A4. The inverse link function F is twice continuously differentiable and
2.3.
PARTIAL LIKELIHOOD ESTIMATION
It is simpler to derive the maximum partial likelihood estimator by maximizing with respect to the log-partial likelihood,
Assuming differentiability, when exists it can be obtained from an estimating equation referred to as the partial likelihood score equation, where,
Just as is the case with the regular (full) likelihood, assuming differentiability, the score vector
Regression Models for Binary Time Series
189
where,
plays an important role in large sample theory based on partial likelihood. The score vector process, is defined by the partial sums,
Observe that the score process, being the sum of martingale differences, is a martingale with respect to the filtration That is, Clearly, Define,
and
where Then the sample information matrix satisfies,
By Assumption A,
has a limit in probability,
while Fokianos and Kedem (1998),
It follows that,
190
MODELING UNCERTAINTY
and we refer to as the information matrix per single observation for estimating By assumption A it is positive definite and hence also nonsingular for every By expanding using Taylor series to one term about and by Assumption A, we obtain the useful approximation up to terms asymptotically negligible in probability,
Thus, an application of the central limit theorem for martingales gives (Fokianos and Kedem, 1998; Slud and Kedem, 1994),
We now have Theorem 2.1 (Fokianos and Kedem, 1998; Slud and Kedam, 1994). The MPLE is almost surely unique for all sufficiently large N, and as
(i)
(ii) (iii)
2.4.
PREDICTION
An immediate application of Theorem 2.1 is in constructing prediction intervals for from By the delta method (see Rao, 1973, p.388), (ii) in Theorem 2.1 implies that
where Therefore, an asymptotic
prediction interval is given by,
Regression Models for Binary Time Series
191
GOODNESS OF FIT
3.
The (scaled) deviance
is used routinely in testing for goodness of fit of generalized linear models. It turns out, however, that the deviance is quite problematic when confronting binary data (Firth, 1991; McCullagh and Nelder, 1989). An alternative is a goodness of fit statistic constructed by classifying the binary data according to the covariates (Fokianos and Kedem, 1998; Schoenfeld ,1980; Slud and Kedem, 1994). Let constitute a partition of For define,
and
Put The next result follows readily from Theorem 2.1, and several applications of the multivariate Martingale Central Limit Theorem as given in Andersen and Gill (1982), Appendix II. Assume that the true parameter is
Theorem 3.1 (Slud and Kedem, 1994). Consider the general regression model (1.2) where F is a cdf with density f. Let be a partition of Then we have as
(i) where
is a square matrix of dimension
Here A is a diagonal by
The matrix matrix, and the
matrix with the
diagonal element given
is the limiting inverse of the information column of B is given by
192
MODELING UNCERTAINTY
(ii) As
(iii) As
the asymptotic distribution of the statistic
is In verifying (3.17) in Theorem 3.1 it is helpful to note that is a zero-mean martingale with asymptotic variance and that for and are orthogonal. Replacing by its estimator in (3.17), the goodness of fit statistic can be used for model adequacy. In this case is stochastically smaller than when and are obtained from the same (training) data, and stochastically larger when is obtained from one data set (training data) but comes from a different independent data set (testing data). Indeed, in the first case,
On the other hand, when data sets,
and
are obtained from independent
Appropriate modification of the degrees of freedom must be made to accommodate the two cases.
4.
LOGISTIC REGRESSION
The logistic regression model where (see (1.4)) is the most widely used regression model for binary data. The model can be written as,
193
Regression Models for Binary Time Series
The equivalent inverse transformation of (4.18), and another way to write the model, is the canonical link for binary data referred to as logit,
For this important special case the previous results simplify greatly as all and Thus, the score vector has the simplified form,
and the sample information matrix reduces to,
It is easily seen that
is the sum of conditional covariance matrices,
and that the sample information matrix per single observation verges to a special case of the limit (2.13),
con-
There is also a simplification in Theorem 3.1. Corollary 4.1 Under the assumption of logistic regression, Theorem 3.1 holds with
and
194
4.1.
MODELING UNCERTAINTY
A DEMONSTRATION
Consider a binary time series obeying a logistic autoregression model containing a deterministic periodic component,
so that corresponding success probability,
A time series from the model and its are plotted in Figure 9.1.
To illustrate the asymptotic normality result (ii) in Theorem 2.1, the model was simulated 1000 times for N = 200, 300,1000. In each run, the partial likelihood estimates of the were obtained by maximizing (2.6). This gives 1000 estimates from which sample means and variances were computed. The theoretical variances of the estimators were approximated by inverting in (4.21). The results are summarized in Table 9.2. There is a close agreement between the theory and the experimental results. A graphical illustration of the prediction limits (2.16) is given in Figure 9.2 where we can see that is nestled quite comfortably within the prediction limits. Again we inverted (4.21) for the approximation
Regression Models for Binary Time Series
195
196
MODELING UNCERTAINTY
To demonstrate the tendency of the goodness of fit statistic (3.17) towards a chi-square distribution with the indicated number of degrees of freedom, consider again the logistic regression model (4.23). It is sufficient to partition the set of values of into disjoint sets. Let
Then, k=4, and the estimator,
is the sum of those ’s for which is in are obtained similarly. In forming (3.17), we replace
by its
where is given in (4.18). The Q-Q plots in Figure 9.3 were obtained from 1000 independent time series (4.23) of length N = 200 and N = 400 and 1000 independent random variables. Except for a few outliers, the approximation is quite good.
5.
CATEGORICAL DATA
The previous analysis can readily be extended to categorical time series where admits values representing categories. We only mention two types of models to show the proximity to regression models for binary time series. For a thorough treatment see Fokianos (1996), and Fokianos and Kedem (1998). Generally speaking, we have to distinguish between two types of categorical variable, nominal, where the categories are not ordered (e.g. daily choice of dinner categorized as vegetarian, dairy, and everything else), and ordinal, where the categories are ordered (e.g. hourly blood pressure categorized as low, normal, and high.) Interval data can be treated as ordinal. A possible model for nominal categorical time series is the multinomial logits model (Agresti, 1990),
where is a p-dimensional regression parameter and is a vector of stochastic time dependent covariates of the same dimension. A well known model for the analysis of ordinal data is the cumulative odds model (McCullagh, 1980). We can illustrate the model using a latent variable Thus, let where is a sequence of i.i.d random
Regression Models for Binary Time Series
197
198
MODELING UNCERTAINTY
variables with cumulative distribution F, is a vector of parameters, and is a covariate vector of the same dimension. Suppose that we observe
for where parameters. It follows that
are threshold
The model can be formulated somewhat more compactly by the equation:
Since the set of cumulative probabilities corresponds one to one to the set of the response probabilities, estimating the former enables estimation of the latter. Various choices for F can arise. For example, the logistic distribution gives the so called proportional odds model. In principle any link used for binary time series can be used here as well. A Final Note As said before, this paper is an extension of Slud and Kedem (1994). In that paper there is a data analysis example that uses National Weather Service rainfall/runoff measurements. The data were graciously provided to us in 1987 by Sid Yakowitz, blessed be his memory. When Slud and Kedem (1994) finally appeared in 1994, Sid upon receiving a reprint made some encouraging remarks. The present extension is written in his memory.
REFERENCES Agresti, A. (1990). Categorical Data Analysis. Wiley, New York. Andersen, P. K. and R. D. Gill. (1982). Cox’s regression model for counting processes: A large sample study. Annals of Statistics, 10, 1100-1120. Cox, D. R. (1970). The Analysis of Binary Data. Methuen, London. Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 69-76. Diggle, P. J., K-Y. Liang, and S. L. Zeger. (1994). Analysis of Longitudinal Data. Oxford University Press, Oxford. Fahrmeir, L. and H. Kaufmann. (1987). Regression models for nonstationary categorical time series. Journal of Time Series Analysis, 8, 147-160. Fahrmeir, L. and G. Tutz. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer, New York. Fokianos, K. (1996). Categorical Time Series: Prediction and Control. Ph.D. Thesis, Department of Mathematics, University of Maryland, College Park. Fokianos, K. and B. Kedem. (1998). Prediction and Classification of nonstationary categorical time series. Journal of Multivariate Analysis, 67, 277296.
REFERENCES
199
Firth, D (1991). Generalized linear models. Chapter 3 of D. Hinkley et al. eds. (1991). Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, Chapman and Hall, London. Jacod, J. (1987). Partial likelihood processes and asymptotic normality. Stochastic Processes and their Applications, 26, 47-71. Kaufmann, H. (1987). Regression models for nonstationary categorical time series: Asymptotic estimation theory. Annals of Statistics, 15, 79-98. Kedem, B. (1980). Binary Time Series. Dekker, New York. Keenan, D. M. (1982). A time series analysis of binary data. Journal of the American Statistical Association, 77, 816-821. McCullagh, P. and J. A. Nelder. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London. McCullagh, P. (1980). Regression models for ordinal data (with discussion). Journal of the Royal Statistical Society, B, 42, 109-142. Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd ed. John Wiley, New York. Schoenfeld, D. (1980). Chi-squared goodness-of-fit for the proportional hazard regression model. Biometrika, 67, 145-153. Slud, E. (1982). Consistency and efficiency of inferences with the partial likelihood. Biometrika, 69, 547-552. Slud, E. (1992). Partial likelihood for continuous-time stochastic processes. Scandinavian Journal of Statistics, 19, 97-109. Slud, E. and B. Kedem. (1994). Partial likelihood analysis of logistic regression and autoregression. Statistica Sinica, 4, 89-106. Wong, W. H. (1986). Theory of partial likelihood. Annals of Statistics, 14, 88123. Zeger, S. L. and B. Qaqish. (1988). Markov regression models for time series: A quasi likelihood approach. Biometrics, 44, 1019-1031.
Chapter 10 ALMOST SURE CONVERGENCE PROPERTIES OF NADARAYA-WATSON REGRESSION ESTIMATES Harro Walk Mathematisches Institut A Universität Stuttgart Pfaffenwaldring 57, D-70569 Stuttgart, Germany
Abstract
1.
For Nadaraya-Watson regression estimates with window kernel self-contained proofs of strong universal consistency for special bandwidths and of the corresponding Cesàro summability for general bandwidths are given.
INTRODUCTION
In this paper a self-contained treatment of some convergence problems in nonparametric regression estimation is given. For an observable random vector X and a not observable square integrable real random variable Y the best estimate of a realization of Y on the basis of an observed realization of X in a mean square sense is given by the regression function defined by for minimizing with respect to measurable because of
where is the distribution of X. Knowledge of also allows an interpretation of the relation between X and Y . The regression function which is usually unknown can be estimated on the basis of an observable training sequence of independent (identically distributed) copies of the random vector ( X , Y). Let S be a fixed bounded sphere in around 0 and be its indicator function (window kernel function). Choose a sequence in (0, of so-called bandwidths. Now by use of the observations of the regression
202
function
MODELING UNCERTAINTY
is estimated by
(with This estimate, also for more general kernel functions K, has been proposed by Nadaraya (1964) and Watson (1964). Set
Weak universal consistency of
i.e.
for each distribution of ( X , Y) with again for more general kernels, and choice has been established by Devroye and Wagner (1980) and Spiegelman and Sacks (1980). The concept of (weak) universal consistency in regression estimation was introduced by Stone (1977) who showed this property for nearest neighbor regression estimates. It is an open question whether is strongly universally consistent, i.e.
for the usual choice For nearest neighbor regression estimates strong universal consistency was shown by Devroye, Györfi, Krzyzak and Lugosi (1994). Strong consistency of the Nadaraya-Watson kernel estimates in the case of bounded Y was shown by Devroye and Krzyzak (1989). On the other hand, from Yakowitz and Heyde (1997), Györfi, Morvai and Yakowitz (1998) and Nobel (1999) it is known that the concept of strong universal consistency cannot be transferred to the general case of stationarity and ergodicity, even not for {0, 1}-valued Y. The present paper deals with almost sure convergence in and the corresponding Cesàro summability (a.s. convergence of arithmetic means of the of window kernel regression estimates. The convergence result (Theorem 2; strong universal consistency) is established for a piecewise constant sequence of bandwidths, especially
([ ] denoting the integer part) with arbitrary fixed The bandwidth sequence above is well approximated by (3) via the choice
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 203
of large In the proof, the essential step (Theorem 1) consists in the verification of a condition in Györfi’s (1991) criterion for strong universal consistency of a class of regression estimates. The Cesàro summability result (Theorem 3) is proved for a general bandwidth sequence satisfying especially as above. The results can be transferred to partitioning estimates and, refining arguments for binomial random variables and using a more general covering lemma due to Devroye and Krzyzak (1989), also to Nadaraya-Watson estimates with rather general kernel; as to the convergence results (with another argument) see Walk (1997). The only tools used in the present paper without proof are Doob’s submartingale convergence theorem (Loève (1977), section 32.3) and the fact that the set of continuous functions with compact support is dense in (Dunford and Schwartz (1958), p. 298). It is well known in summability theory (see Zeller and Beekmann (1970), section 44, or Hardy (1949), Theorem 114) that Cesàro summability (even Abel summability) of a sequence together with a gap condition on the increments implies its convergence. This gap condition is fulfilled for the increments of in Theorem 2, but not for the increments of the sequence there, thus Theorem 2 is not implied by Theorem 3, but needs a separate proof. Section 2 contains the results (Theorems 1,2,3). Section 3 contains lemmas and proofs.
2.
RESULTS
The results concern Nadaraya-Watson regression estimates with window kernel according to (1) and (2). Theorem 1 is an essential tool in the proof of Theorem 2 and is stated in this section because of its independent interest. In contrast to the other results, Theorem 1 deals with integrable (instead of square integrable) nonnegative real (instead of real) random variables where, as in the other parts of the paper, are independent copies of ( X , Y). Theorem 1. If at most for the indices where for fixed D > 1, then with some
for each distribution of ( X , Y) with integrable
204
MODELING UNCERTAINTY
Avoiding a formulation which uses ( X , Y), the assertion of Theorem 1 can also be stated in the form
with independent identically distributed random vectors in the definition of where is independent of the underlying distribution of which has to satisfy This means that in spite of possible unboundedness of the s the sequence is a.s. bounded with an asymptotic bound depending on By this result, as in the proof of Theorem 2, using
for bounded real random variables one can show the latter assertion even for integrable real random variables But we shall content ourselves to show the corresponding convergence result for square integrable real formulated as Theorem 2. This theorem states strong universal consistency of window kernel regression estimates for special sequences of bandwidths. satisfy Theorem 2. Let at most for the indices for fixed D > 1 and where
e.g.
with
Then
Remark. In Theorem 2 the statistician does not change the bandwidth at each change of as is done in the usual choice But if the special choice (3) is written in the form one has such that and are of the same order and even the factor in can be arbitrarily well approximated by use of a sufficiently large in the definition of This is important in view of the rate of convergence under regularity assumptions (see e.g. Härdle (1990), ch. 4, with further references). The next theorem states that for very general choice of the bandwidth sequence including choice the sequence of is Cesàro summable to 0 a.s. Theorem 3. Let satisfy
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 205
Then
3.
LEMMAS AND PROOFS
The first lemma which concerns binomially distributed random variables, is elementary and well-known. Lemma 1. If B is a binomial variable, then
for PROOF.
Now a variant of inequalities of Efron and Stein (1981), Steele (1986) and Devroye (1991) on the variance of a function of independent random variables will be established. Assumptions concerning symmetry of the function or identical distribution of the random variables or bounded function value differences are avoided. Lemma 2. Let be independent m-dimensional random vectors where is a copy of For measurable assume square integrable. Then
PROOF. We use arguments in the proof of Theorem 9.2 (McDiarmid (1989)) and Theorem 9.3 (Devroye (1991)) in Devroye, Györfi and Lugosi (1996) and Jensens’s inequality. Set
206
MODELING UNCERTAINTY
form a martingale difference sequence with respect to We have
Let
denote the distribution of
Then
for which yields the assertion. The following lemma is well-known especially in stochastic approximation. Lemma 3. (Aizerman, Braverman and Rozonoer (1964), Gladyshev (1965), MacQueen (1967), Van Ryzin (1969), Robbins and Siegmund (1971)) Let and be sequences of integrable nonnegative real random variables on a probability space with and let be a nondecreasing sequence of of such that and are measurable with respect to (i.e. events for each and
Then converges a.s. PROOF. Setting
one notices that sup
is a submartingale with respect to satisfying Then Doob’s submartingale convergence theorem (Loève
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 207
(1977), section 32.3) yields a.s. convergence of from which the assertion follows by a.s. convergence of The next lemma is a specialized version of a covering lemma of Devroye and Krzyzak (1989), compare also Devroye and Wagner (1980) and Spiegelman and Sacks (1980). Lemma 4. There is a finite constant only depending on such that for each and probability measure
PROOF. Let S (with radius R) be covered by finitely many open spheres of radius R/2 (M depending only on Thus For each and relation implies thus Then for each and one has
In the following the notation
will be used. First a criterion for strong universal consistency will be given. Lemma 5. (Györfi (1991)) is strongly universally consistent if the following conditions are fulfilled:
a)
a.s. for each distribution of (X,Y) with bounded Y , b) there is a constant satisfying only
such that for each distribution of ( X , Y) with
PROOF. Let ( X , Y) be arbitrary with loss of generality. Fix For all
Assume define
without
208 and let by
MODELING UNCERTAINTY
and and
be the functions
and
when Y and
are replaced
respectively. Then
By a) and uniform boundedness of
and
By Cauchy-Schwarz inequality
if L is chosen sufficiently large, further for this L and by b),
The following lemma is well-known from the classical proof and from Etemadi’s (1981) proof of Kolmogorov’s strong law of large numbers (see e.g. Bauer (1991), §12). Lemma 6. For identically distributed random variables with let be the truncation of at i.e.
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 209
Then
Moreover, for
with D > 1 we have
PROOF. Noticing that
we obtain the first assertion from
by the Kronecker lemma, and the second assertion via
In view of the third assertion, for with Then
let
denote the minimal index
210
MODELING UNCERTAINTY
and we obtain
PROOF OF THEOREM 1. Without loss of generality may be assumed (as long as for some insert integer part of into the sequence As in the classical proof and Etemadi’s (1981) proof of Kolmogorov’s strong law of large numbers (see e.g. Bauer (1991), §12) a truncation argument is used. Set
for
(as in the proof of Lemma 5), and
(as in Lemma 6). Further set
for In the first step, it will be shown
for
and for some suitable constant Let be independent (identically distributed) copies of (X, Y) and let be obtained from via replacing By Lemma 2
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 211
For
we have
212
(by exchangeability)
MODELING UNCERTAINTY
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates (by exchangeability and =: 8B + 8C + 8D. In the following several times Lemma 4 will be used, also the independence assumption for taking expectations and First we obtain
(by Cauchy-Schwarz inequality and Lemma 1 for
Noticing
we similarly obtain
(by Cauchy-Schwarz inequality and Lemma 1 for
and
213
214
MODELING UNCERTAINTY
Further, by
and exchangeability, we obtain
These bounds yield (4). In the second step
will be shown. We use a monotonicity argument (compare Etemadi’s (1981) proof of the strong law of large numbers). For
we have
thus
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 215
By the independence assumption, Lemmas 1 and 4, we obtain
and thus
Using (4) and Lemma 6, by
we obtain
and thus
Now (6), (7), (8) yield (5). In the third step the assertion of the theorem will be shown. Because of
one has a.s. from some random index on. Thus, because of (5), it suffices to show that for each fixed
But this follows from
216
MODELING UNCERTAINTY
(see above upper bound for B). PROOF OF THEOREM 2. According to Lemma 5 and Theorem 1 it suffices to show
for bounded Y. This was proved for general bandwidth sequences by Devroye and Krzyzak (1989) via exponential inequalities. For the special bandwidth sequence here, we shall use Lemma 3. By uniform boundedness of and apparently it’s enough to show
for each sphere S* around 0. First we shall prove
With we obtain
Taking integral on S* with respect to of the sequence
and
by Lemma 3 we obtain a.s. convergence
Almost Sure Convergence Properties of Nadaraya- Watson Regression Estimates 217
because of piecewise constancy of
the corresponding relation with
with a suitable constant
and
replaced by
and
To show (11) and (12) we notice
with suitable with Lemma 4; further we notice
for some
with some D* > 1. A.s. convergence of
together with
and then use
218
for some
MODELING UNCERTAINTY
and
yields
It holds
To show this notice that the set of continuous functions with compact support is dense in (see e.g. Dunford and Schwartz (1958), p. 298). Now for an arbitrary and choose such an with Obviously (14) holds for Further, by Lemma 4
and (14) is obtained in the general case. Now from (13) and (14) relation (10) follows. By (10) with one has for
(15) and (10) yield (9). It should be mentioned that (9) can also be proved using once more Etemadi’s (1981) argument for the strong law of large numbers, thus avoiding the above martingale argument. Noticing that in context of the piecewise constant
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 219
sequence one can assume arbitrarily close to 1, one obtains
with
and thus, considering also the case relation (9). One is led to (16) by a majorization and a minorization and then using
(generalized Lebesgue’s density theorem; see e.g. Wheeden and Zygmund (1977), Theorem 10.49) in view of an expectation term and
(by (11)) in view of a variance term. PROOF OF THEOREM 3. The argument is similar to that for Theorem 2 the notations of which we use. Analogously to Lemma 5 one notices that it suffices to show that for some
for each integrable
and that
in the case of bounded Y. In view of (17) as in the proof of Theorem 1 we set
for
We notice
The latter assertion is obtained from
220
MODELING UNCERTAINTY
(see (4)) and thus
(because of Lemma 6), by use of the Kronecker lemma. From (19) we obtain
as in the third step of the proof of Theorem 1. Further
because of
obtained as (7). (20) and (21) yield (17). Now assume Y is bounded. In view of (18) it suffices to show
for each sphere S* around 0. The proof of this is reduced to the proof of
We notice
REFERENCES
221
by
with some and by the Kronecker lemma. (23) together with (14) yields (22). The author thanks the referee and M. Kohler, whose suggestions improved the readability of the paper.
REFERENCES Aizerman, M.A., E. M. Braverman, and L. I. Rozonoer. (1964). The probability problem of pattern recognition learning and the method of potential functions. Automation and Remote Control 25, 1175-1190. Bauer, H. (1991). Wahrscheinlichkeitstheorie, 4th ed. W. de Gruyter, Berlin. Devroye, L. (1991). Exponential inequalities in nonparametric estimation. In: G. Roussas, Ed., Nonparametric Functional Estimation and Related Topics. NATO ASI Ser. C, Kluwer Acad. Publ., Dordrecht, 31-44. Devroye, L., L. Györfi, A. Krzyzak, and G. Lugosi. (1994). On the strong universal consistency of nearest neighbor regression function estimates. Ann. Statist. 22, 1371–1385. Devroye, L., L. Györfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York, Berlin, Heidelberg. Devroye, L. and A. Krzyzak. (1989). An equivalence theorem for convergence of the kernel regression estimate. J. Statist. Plann. Inference 23,71–82. Devroye, L. and T. J. Wagner. (1980). Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann. Statist. 8, 231–239. Devroye, L. and T. J. Wagner. (1980). On the convergence of kernel estimators of regression functions with applications in discrimination. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 51, 15–25. Dunford, N. and J. Schwartz. (1958). Linear Operators, Part I. Interscience Publ., New York. Efron, B. and C. Stein. (1981). The jackknife estimate of variance. Ann. Statist. 9, 586-596.
222
MODELING UNCERTAINTY
Etemadi, N. (1981). An elementary proof of the strong law of large numbers. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 55, 119–122. Gladyshev, E.G. (1965). On stochastic approximation. Theor. Probab. Appl. 10, 275-278. Györfi, L. (1991). Universal consistencies of a regression estimate for unbounded regression functions. In: G. Roussas, Ed., Nonparametric Functional Estimation and Related Topics. NATO ASI Ser. C, Kluwer Acad. Publ., Dordrecht, 329–338. Györfi, L., G. Morvai, and S. Yakowitz. (1998). Limits to consistent on-line forecasting for ergodic time series. IEEE Trans. Information Theory, 44, 886-892. Härdle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. Hardy, G. H. (1949). Divergent Series. Clarendon Press, Oxford. Loève, M. (1977). Probability Theory II, 4th ed. Springer, New York, Heidelberg, Berlin. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In: J. Neyman, Ed., Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley and Los Angeles, 281-297. McDiarmid, C. (1989). On the method of bounded differences. In: Surveys in Combinatorics 1989. Cambridge University Press, Cambridge, 148-188. Nadaraya, E.A. (1964). On estimating regression. Theory Probab. Appl. 9,141– 142. Nobel, A.B. (1999). Limits to classification and regression estimation from ergodic processes. Ann. Statist. 27, 262-273. Robbins, H. and D. Siegmund. (1971). A convergence theorem for nonnegative almost supermartingales and some applications. In: J.S. Rustagi, Ed., Optimizing Methods of Statistics. Academic Press, New York, 233–257. Spiegelman, C. and J. Sacks. (1980). Consistent window estimation in nonparametric regression. Ann. Statist. 8, 240–246. Steele, J.M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist. 14, 753-758. Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist. 5, 595– 645. Van Ryzin, J. (1969). On strong consistency of density estimates. Ann. Math. Statist. 40, 1765-1772. Walk, H. (1997). Strong universal consistency of kernel and partitioning regression estimates. Preprint 97-1, Math. Inst. A, Univ. Stuttgart. Watson, G.S. (1964). Smooth regression analysis. Ser. A 26, 359–372. Wheeden, R.L. and A. Zygmund. (1977). Measure and Integral. Marcel Dekker, New York.
REFERENCES
223
Yakowitz, S. and C. C. Heyde. (1997). Long-range dependent effects with implications for forecasting and queueing inference. Preprint. Zeller, K. and W. Beekmann. (1970). Theorie der Limitierungsverfahren, 2. Aufl. Springer, Berlin, Heidelberg, New York.
Chapter 11 STRATEGIES FOR SEQUENTIAL PREDICTION OF STATIONARY TIME SERIES László Györfi Department of Computer Science and Information Theory Technical University of Budapest 1521 Stoczek u. 2, Budapest, Hungary
[email protected]
Gábor Lugosi Department of Economics, Pompeu Fabra University Ramon Trias Fargas 25-27, 08005 Barcelona, Spain,
[email protected] *
Abstract
1.
We present simple procedures for the prediction of a real valued sequence. The algorithms are based on a combination of several simple predictors. We show that if the sequence is a realization of a bounded stationary and ergodic random process then the average of squared errors converges, almost surely, to that of the optimum, given by the Bayes predictor. We offer an analog result for the prediction of stationary gaussian processes.
INTRODUCTION
One of the many themes of Sid’s research was the search for prediction and estimation methods for time series that do not necessarily satisfy the classical assumptions for autoregressive markovian and gaussian processes (see, e.g., Morvai et al., 1996; Morvai et al., 1997; Yakowitz, 1976; Yakowitz, 1979;
*The work of the second author was supported by DGES grant PB96-0300
226
MODELING UNCERTAINTY
Yakowitz, 1985; Yakowitz, 1987; Yakowitz, 1989; Yakowitz et al., 1999). He firmly believed that most real-world applications require such robust methods. This note is a contribution to the line of research pursued and promoted by Sid who directed us to this beautiful area of research. We study the problem of sequential prediction of a real valued sequence. At each time instant the predictor is asked to guess the value of the next outcome of a sequence of real numbers with knowledge of the past (where denotes the empty string). Thus, the predictor’s estimate, at time is based on the value of Formally, the strategy of the predictor is a sequence of decision functions
and the prediction formed at time is After time instants, the normalized cumulative prediction error on the string is
In this paper we assume that are realizations of the random variables drawn from the real valued stationary and ergodic process The fundamental limit for the predictability of the sequence can be determined based on a result of Algoet (1994), who showed that for any prediction strategy and stationary ergodic process
where
is the minimal mean squared error of any prediction for the value of based on the infinite past Note that it follows by stationarity and the martingale convergence theorem (see, e.g., Stout, 1974) that This lower bound gives sense to the following definition: Definition 1 A prediction strategy is called universal with respect to a class of stationary and ergodic processes if for each process in the class,
Universal strategies asymptotically achieve the best possible loss for all ergodic processes in the class. Algoet (1992) and Morvai et al. (1996) proved
Strategies for sequential prediction of stationary time series
227
that there exists a prediction strategy universal with respect to the class of all bounded ergodic processes. However, the prediction strategies exhibited in these papers are either very complex or have an unreasonably slow rate of convergence even for well-behaved processes. The purpose of this paper is to introduce several simple prediction strategies which, apart from having the above mentioned universal property of Algoet (1992) and Morvai et al. (1996), promise much improved performance for “nice” processes. The algorithms build on a methodology worked out in recent years for prediction of individual sequences, see Vovk (1990), Feder et al. (1992), Littlestone and Warmuth (1994), Cesa-Bianchi et al. (1997), Kivinen and Warmuth (1999), Singer and Feder (1999), and Merhav and Feder (1998) for a survey. An approach similar to the one of this paper was adopted by Györfi et al. (1999), where prediction of stationary binary sequences was addressed. There we introduced a simple randomized predictor which predicts asymptotically as well as the optimal predictor for all binary ergodic processes. The present setup and results differ in several important points from those of Györfi et al. (1999). On the one hand, special properties of the squared loss function considered here allow us to avoid randomization of the predictor, and to define a significantly simpler prediction scheme. On the other hand, possible unboundedness of a real-valued process requires special care, which we demonstrate on the example of gaussian processes. We refer to Nobel (2000), Singer and Feder (1999), Singer and Feder (2000), Yang (1999), and Yang (2000) to recent closely related work. In Section 2 we introduce a universal strategy for bounded ergodic processes which is based on a combination of partitioning estimates. In Section 3, still for bounded processes, we consider, as an alternative, a prediction strategy based on combining generalized linear estimates. In Section 4 we replace the boundedness assumption by assuming that the sequence to predict is an ergodic gaussian process, and show how the techniques of Section 3 may be modified to take care of the difficulties originating in the unboundedness of the process. The results of the paper are given in an autoregressive framework, that is, the value is to be predicted based on past observations of the same process. We may also consider the more general situation when is predicted based on and where is an process such that is a jointly stationary and ergodic process. The prediction problem is similar to the one defined above with the exception that the sequence of is also available to the predictor. One may think about the as side information. Formally, now a prediction strategy is a sequence of functions
228
MODELING UNCERTAINTY
so that the prediction formed at time is The normalized cumulative prediction error for any fixed pair of sequences is now
All results of the paper may be extended, in a straightforward manner to this more general prediction problem. As the extension does not require new ideas, we omit the details. Another direction for generalizing the results is to consider predicting vectorvalued processes. Once again, the extension to processes is obvious, and the details are omitted.
2.
UNIVERSAL PREDICTION BY PARTITIONING ESTIMATES
In this section we introduce our first prediction strategy for bounded ergodic processes. We assume throughout the section that is bounded by a constant B > 0, with probability one. First we assume that the bound B is known. The case of unknown B will be treated later in a remark. The prediction strategy is defined, at each time instant, as a convex combination of elementary predictors, where the weighting coefficients depend on the past performance of each elementary predictor. We define an infinite array of elementary predictors as follows. Let be a sequence of finite partitions of the feature space and let be the corresponding quantizer:
With some abuse of notation, for any and we write for the sequence Fix positive integers and for each string of positive integers, define the partitioning regression function estimate
where 0/0 is defined to be 0. Now we define the elementary predictor
by
That is, quantizes the sequence according to the partition and looks for all appearances of the last seen quantized strings of length
Strategies for sequential prediction of stationary time series
229
in the past. Then it predicts according to the average of the following the string. The proposed prediction algorithm proceeds as follows: let be a probability distribution on the set of all pairs of positive integers such that for all Put and define the weights
and their normalized values
The prediction strategy
is defined by
Theorem 1 Assume that (a) the sequence of partitions is nested, that is, any cell of is a subset of a cell of (b) if denotes the diameter of a set, then for each sphere S centered at the origin
Then the prediction scheme defined above is universal with respect to the class of all ergodic processes such that One of the main ingredients of the proof is the following lemma, whose proof is a straightforward extension of standard arguments in the prediction theory of individual sequences, see, for example, Kivinen and Warmuth (1999), and Singer and Feder (2000). Lemma 1 Let be a sequence of prediction strategies (experts), and let be a probability distribution on the set of positive integers. Assume and Define that
with
and
230
If the prediction strategy
MODELING UNCERTAINTY
is defined by
then for every
Here – ln 0 is treated as
Proof. Introduce for each
Note that
so that
Therefore, (2.2) becomes
and
for
First we show that
Strategies for sequential prediction of stationary time series
231
which is implied by Jensen’s inequality and the concavity of the function for Thus, (2.2) implies that
which concludes the proof. Another main ingredient of the proof of Theorem 1 is known as Breiman’s generalized ergodic theorem (Breiman, 1960), see also Algoet (1994).
Lemma 2 (BREIMAN, 1960). Let process. Let T denote the left shift operator. Let functions such that for some function that Then
be a stationary and ergodic be a sequence of real-valued almost surely. Assume
232
MODELING UNCERTAINTY
Proof of Theorem 1. By a double application of the ergodic theorem, as almost surely,
and therefore
Thus, by Lemma 2, as
almost surely,
Since the partitions are nested, is a martingale indexed by the pair Thus, the martingale convergence theorem (see, e.g., Stout, 1974) and assumption (b) for the sequence of partitions implies that
Now by Lemma 1,
Strategies for sequential prediction of stationary time series
233
and therefore, almost surely,
and the proof of the theorem is finished. Theorem 1 shows that asymptotically, the predictor defined by (2.1) predicts as well as the optimal predictor given by the regression function In fact, gives a good estimate of the regression function in the following sense: Corollary 1 Under the conditions of Theorem 1
Proof. By Theorem 1,
Consider the following decomposition:
Then the ergodic theorem implies that
234
MODELING UNCERTAINTY
It remains to show that
But this is a straightforward consequence of Kolmogorov’s classical strong law of large numbers for martingale differences due to Chow (1965) (see also Stout, 1974, Theorem 3.3.1). It states that if is a martingale difference sequence with
then
Thus, (2.4) is implied by Chow’s theorem since the martingale differences are bounded by (To see that the indeed form a martingale difference sequence just note that for all Remark. UNKNOWN B. The prediction strategy studied in this section may be easily extended to the case when the process is bounded, but B is unknown, that is, when no upper bound is known to the range of the process. In such a case we may simply start with the hypotheses B = 1 and predict according to (2.1) until we find a value with Then we reset the algorithm and start the predictor again but with doubling the value of the previous B, and keep doing this. Then the universal property of Theorem 1 obviously remains valid to this modified strategy. Remark. CHOICE OF Theorem 1 is true independently of the choice of the as long as these values are strictly positive for all and In practice, however, the choice of may have an impact on the performance of the predictor. For example, if the distribution has a very rapidly decreasing tail, then the term – will be large for moderately large values of and and the performance of will be determined by the best of just a few of the elementary predictors Thus, it may be advantageous to choose to be a large-tailed distribution. For example, is a safe choice, where is an appropriate normalizing constant. Remark. SEQUENTIAL GUESSING. If the process takes values from a finite set, one is often interested in the sequential guessing of upon observing the
Strategies for sequential prediction of stationary time series
235
past Such a problem was investigated (among others) by Györfi et al. (1999), where it was assumed that takes one of two values: Sequential guessing is then formally defined by a sequence of decision functions
and the guess formed at time is guessing by on the string is
The normalized cumulative loss of
where I denotes the indicator function. Algoet (1994) showed that for any guessing strategy and stationary ergodic binary process almost surely, where
is the minimal expected probability of error of guessing based on the infinite past The existence of a guessing scheme for which almost surely follows from results of Onrstein (1978) and Bailey (1976). In Györfi et al. (1999) a simple guessing procedure was proposed with the same asymptotic guarantees and with a good finite-sample behavior for Markov processes. The disadvantage of the predictor given in Györfi et al. (1999) is that it requires randomization. Here we observe that with the help of predictors having the universal convergence property of Theorem 1 we may easily define a nonrandomized guessing scheme with the desired convergence properties. Given a prediction strategy
for a binary process decision functions
we simply define the guessing scheme
by the
Then we may use the properties of established in Corollary 1 to conclude that the guessing scheme defined above has an average number of mistakes converging to the optimum R* almost surely. Indeed, if we define the decision based on observing the infinite past as the one minimizing the probability of error of guessing
236
MODELING UNCERTAINTY
then we may write
(by the ergodic theorem)
(by the martingale convergence theorem)
(by Theorem 2.2 in Devroye et al. (1996))
(by Corollary 1.) Thus, any predictor with the universal property established in Theorem 1 may be converted, in a natural way, into a universal guessing scheme. An alternative proof of the same fact is given by Nobel (2000).
3.
UNIVERSAL PREDICTION BY GENERALIZED LINEAR ESTIMATES
This section is devoted to an alternative way of defining a universal predictor for the class of all bounded ergodic processes. Once again, we apply the method described by Lemma 1 to combine elementary predictors, but now, instead of partitioning-based predictors, we use elementary predictors which are generalized linear predictors. Once again, we consider bounded processes, and assume that a positive constant B is known such that (The case of unknown B may be treated similarly as in Section 2.) We define an infinite array of elementary predictors as follows. Let be real-valued functions defined on The elementary predictor generates a prediction of form
such that the coefficients are calculated based on the past observations Before defining the coefficients, note that one is tempted to define the
Strategies for sequential prediction of stationary time series
237
as the coefficients which minimize
if and the all-zero vector otherwise. However, even though the minimum always exists, it is not unique in general, and therefore the minimum is not welldefined. Instead, we define the coefficients by a standard recursive procedure as follows (see, e.g., Tsypkin, 1971, Györfi, 1984, Singer and Feder, 2000). Introduce
(where the superscript T denotes transpose),
Let
be an arbitrary positive constant and put identity matrix. Define
where I is the
It easy to see that the inverse can be calculated recursively by
which makes the calculation of
easy.
Theorem 2 Define
Suppose
and for any fixed
the set
is dense in Define a prediction strategy by combining the elementary predictors given by (2.1). The obtained predictor is universal with respect to the class of all ergodic processes with
238 Proof. Let of that is,
MODELING UNCERTAINTY and let be an eigensystem are orthogonal solutions of the equation
and
(Note that since is symmetric and positive semidefinite, its eigenvalues are all real and nonnegative.) Let be the integer for which if and if Express the vector as
and define
(It is easy to see by stationarity that the value of the vector It is shown by Györfi (1984) that
and moreover
Also, observe that by the ergodic theorem, for any fixed
is independent of
239
Strategies for sequential prediction of stationary time series
Therefore by (3.6), (3.5) and Lemma 2
Next define the coefficient vector achieves the minimum in
to be any vector which
Then
It is immediate by the martingale convergence theorem that
240
MODELING UNCERTAINTY
On the other hand, by the denseness assumption of the theorem, for any fixed
Thus, we conclude that
Finally, by Lemma 1,
which concludes the proof. Again, as in Corollary 1, we may compare the predictor directly to the regression function. By the same argument, we obtain the following result. The details are left to the reader. Corollary 2 Under the conditions of Theorem 2
4.
PREDICTION OF GAUSSIAN PROCESSES
Up to this point we have always assumed that the process to predict is bounded. This excludes some traditionally well-studied unbounded processes such as gaussian processes. In this section we define a predictor which is universal with respect to the class of all stationary and ergodic gaussian processes. For gaussian processes the best predictor (i.e., the regression function) is linear, and therefore we may use the techniques of the previous section in the special case when However, the unboundedness of the process introduces some additional difficulty. To handle it, we use bounded elementary predictors
241
Strategies for sequential prediction of stationary time series
as before, but the bound is increased with Also, we need to modify the way of combining these elementary predictors. The proposed predictor is based on a convex combination of linear predictors of different orders. For each introduce
where the vector formula introduced in Section 3:
of coefficients is calculated by the
where is a positive number, with for Introduce the notation
Then the predictor that
is defined as follows: for all then
and
if
is such
where
with Thus, we divide the time instances into intervals of exponentially increasing length and, after initializing the predictor at the beginning of such an interval, we use a different way of combining the elementary predictors in each such segment. The reason for this is that to be able to combine elementary predictors as in Lemma 1, we need to make sure that the predictor as well as the outcome to predict is appropriately bounded. In our case this can be achieved based on Lemma 3 below which implies that with very large probability, the maximum of identically distributed normal random variables is at most of the order of
242
MODELING UNCERTAINTY
Theorem 3 The prediction strategy defined above is universal with respect to the class of all stationary and ergodic zero-mean gaussian processes. Also,
At a key point the proof uses the following well-known properties of gaussian random variables:
Lemma 3 (Pisier, 1986). Let ables with
be zero-mean gaussian random variThen
and for each
Proof of Theorem 3. Lemma 3 implies, by taking
This implies, by the Borel-Cantelli lemma, that with probability one there exists a finite index such that for all Also, there exists a finite index T such that for all
Strategies for sequential prediction of stationary time series
Therefore, denoting
243
we may write
In other words,
This proves the second statement of the theorem. To prove the claimed universality property, it suffices to show that for all ergodic gaussian processes,
244
MODELING UNCERTAINTY
This can be done similarly to the proof of Theorem 2:
Define the coefficient vector
such that it minimizes
(If the minimum is not unique, choose one arbitrarily.) Then
since
Now the proof can be finished by mimicking the proof of Theorem 2. Once again, we may derive a property analogous to Corollary 1:
Strategies for sequential prediction of stationary time series
245
Corollary 3 Under the conditions of Theorem 3,
Proof. We proceed exactly as in the proof of Corollary 1. The only thing that needs a bit more care is checking the conditions of Kolmogorov’s strong law for sums of martingale differences, since in the gaussian case the corresponding martingale differences are not bounded. By the Cauchy-Schwarz inequality,
where C and are positive constants, which implies the condition of Kolmogorov’s theorem is satisfied.
so
Remark. RATES OF CONVERGENCE. The inequality of Theorem 3 shows that the rate of convergence of to L* is determined by the performance of the best elementary predictor The price of adaptation to the best elementary predictor is merely an additional term of the order of This additional term is not much larger than an inevitable estimation error. This is supported by a result of Gerencsér and Rissanen (1986) who showed that for any gaussian process and for any predictor
On the other hand, Gerencsér (1994) showed under some mixing conditions for processes that there exists a predictor such that
Further rate-of-convergence results under more general conditions for the process were established by Gerencsér (1992). Another general branch of bounds can be found in Goldenshluger and Zeevi (1999). Consider the representation of
246
MODELING UNCERTAINTY
with transfer function
Goldenshluger and Zeevi show that if for
and
then for large
and for
Thus, for the processes investigated by Goldenshluger and Zeevi, the predictor of Theorem 3 achieves the rate of convergence
REFERENCES Algoet, P. (1992). Universal schemes for prediction, gambling, and portfolio selection. Annals of Probability, 20:901–941. Algoet, P. (1994). The strong law of large numbers for sequential decisions under uncertainty. IEEE Transactions on Information Theory, 40:609–634. Bailey, D. H. (1976). Sequential schemes for classifying and predicting ergodic processes. PhD thesis, Stanford University. Breiman, L. (1960). The individual ergodic theorem of information theory. Annals of Mathematical Statistics, 31:809–811, 1957. Correction. Annals of Mathematical Statistics, 31:809–810. Cesa-Bianchi, N., Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. (1997). How to use expert advice. Journal of the ACM, 44(3):427–485. Chow, Y.S. (1965). Local convergence of martingales and the law of large numbers. Annals of Mathematical Statistics, 36:552–558. Devroye, L., L. Györfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York.
REFERENCES
247
Feder, M., N. Merhav, and M. Gutman. (1992). Universal prediction of individual sequences. IEEE Transactions on Information Theory, 38:1258–1270. Gerencsér, L. (1992). estimation and nonparametric stochastic complexity. IEEE Transactions on Information Theory, 38:1768–1779. Gerencsér, L. (1994). On Rissanen’s predictive stochastic complexity for stationary ARMA processes. J. of Statistical Planning and Inference, 41:303– 325. Gerencsér, L. and J. Rissanen. (1986). A prediction bound for Gaussian ARMA processes. Proc. of the 25th Conference on Decision and Control, 1487–1490. Goldenshluger, A. and A. Zeevi. (submitted for publication 1999). Non-asymptotic bounds for autoregressive time series modeling. Györfi, L. (1984). Adaptive linear procedures under general conditions. IEEE Transactions on Information Theory, 30:262–267. Györfi, L., G. Lugosi, and G. Morvai. (1999). A simple randomized algorithm for consistent sequential prediction of ergodic time series. IEEE Transactions on Information Theory, 45:2642–2650. Kivinen, J. and M. K. Warmuth. (1999). Averaging expert predictions. In H. U. Simon P. Fischer, editor, Computational Learning Theory: Proceedings of the Fourth European Conference, EuroCOLT’99, pages 153–167. Springer, Berlin. Lecture Notes in Artificial Intelligence 1572. Littlestone, N. and M. K. Warmuth. (1994). The weighted majority algorithm. Information and Computation, 108:212–261. Merhav, N. and M. Feder. (1998). Universal prediction. IEEE Transactions on Information Theory, 44:2124–2147. Morvai, G., S. Yakowitz, and L. Györfi. (1996). Nonparametric inference for ergodic, stationary time series. Annals of Statistics, 24:370–379. Morvai, G., S. Yakowitz, and P. Algoet. (1997). Weakly Convergent Stationary Time Series. IEEE Transactions on Information Theory, 43:483–498. Nobel, A. (2000). Aggregate schemes for sequential prediction of ergodic processes. Manuscript. Ornstein, D. S. (1978). Guessing the next output of a stationary process. Israel Journal of Mathematics, 30:292–296. Pisier, G. (1986). Probabilistic methods in the geometry of Banach spaces. In Probability and Analysis. Lecture Notes in Mathematics, 1206, pages 167– 241. Springer, New York. Singer, A. and M. Feder. (1999). Universal linear prediction by model order weighting. IEEE Transactions on Signal Processing, 47:2685–2699. Singer, A. C. and M. Feder. (2000). Universal linear least-squares prediction. International Symposium of Information Theory. Stout, W.F. (1974). Almost sure convergence. Academic Press, New York. Tsypkin, Ya. Z. (1971). Adaptation and Learning in Automatic Systems. Academic Press, New York.
248
MODELING UNCERTAINTY
Yakowitz, S. (1976). Small-sample hypothesis tests of Markov order, with application to simulated and hydrologic chains. Journal of the American Statistical Association, 71:132–136. Yakowitz, S. (1979). Nonparametric estimation of Markov transition functions. Annals of Statistics, 7:671–679. Yakowitz, S. (1985). Nonparametric density estimation, prediction, and regression for Markov sequences. Journal of the American Statistical Association, 80:215–221. Yakowitz, S. (1987). Nearest-neighbour methods for time series analysis. Journal of Time Series Analysis, 8:235–247. Yakowitz, S. (1989). Nonparametric density and regression estimation for Markov sequences without mixing assumptions. Journal of Multivariate Analysis, 30:124–136. Yakowitz, S., L. Györfi, J. Kieffer, and G. Morvai. (1999). Strongly consistent nonparametric estimation of smooth regression functions for stationary ergodic sequences. Journal of Multivariate Analysis, 71:24–41. Vovk, V.G. (1990). Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learning Theory, pages 372–383. Association of Computing Machinery, New York. Yang, Y. (1999). Aggregating regression procedures for a better performance. Manuscript. Yang, Y. (2000). Combining different procedures for adaptive regression. Journal of Multivariate Analysis, 74:135–161.
Part IV Chapter 12 THE BIRTH OF LIMIT CYCLES IN NONLINEAR OLIGOPOLIES WITH CONTINUOUSLY DISTRIBUTED INFORMATION LAGS Carl Chiarella School of Finance and Economics University of Technology Sydney P.O. Box 123, Broadway, NSW 2007 Australia
[email protected]
Ferenc Szidarovszky Department of Systems and Industrial Engineering University of Arizona Tucson, Arizona, 85721-0020, USA
[email protected]
Abstract
1.
The dynamic behavior of the output in nonlinear oligopolies is examined when the equilibrium is locally unstable. Continuously distributed time lags are assumed in obtaining information about rivals’ output as well as in obtaining or implementing information about the firms’ own output. The Hopf bifurcation theorem is used to find conditions under which limit cycle motion is born. In addition to the classical Cournot model, labor managed and rent seeking oligopolies are also investigated.
INTRODUCTION
During the last three decades many researchers have investigated the stability of dynamic oligopolies in both continuous and discrete time scales. A comprehensive summary of the key results is given in Okuguchi (1976) and their multiproduct extensions are presented in Okuguchi and Szidarovszky (1999).
250
MODELING UNCERTAINTY
However relatively little research has been devoted to the investigation of unstable oligopolies. There are two main reasons why unstable oligopolies were neglected in the main stream of research. First, it was usually assumed that no economy would operate under unstable conditions. Second, there was a lack of appropriate technical tools to investigate unstable dynamical systems. During the last two decades, the qualitative theory of nonlinear differential equations has resulted in appropriate techniques that enable economists to investigate the behavior of locally unstable economies. There is an expanding literature on this subject. The works of Arnold (1978), Guckenheimer and Holmes (1983), Jackson (1989) and Cushing (1977) can be mentioned as main sources for the mathematical methodology. Theoretical research on unstable dynamical systems has evolved in three main streams. Bifurcation theory (see for example, Guckenheimer and Holmes (1983)) is used to identify regions in which certain types of oscillating attractors are born. Centre manifold theory (see for example, Carr (1981)) can be applied to reduce the dimensionality of the governing differential equation to a manifold on which the attractor lies. Many other studies have used computer methods (see for example, Kubicek and Marek (1986)) to determine the typical types of attractors and bifurcation diagrams. Empirical research can be done either by using computer simulations (see for example, Kopel (1996)) or by performing laboratory experiments (see for example, Cox and Walker (1998)). In these studies observations are recorded about the dynamic behavior of the system under consideration. In this paper a general oligopolistic model will be examined which contains the classical Cournot model, labor managed oligopolies, as well as rent-seeking games as special cases. Time lags are assumed for each firm in obtaining information about rivals’ output as well as in obtaining or implementing information about its own output. The time lag is unknown, it is considered as random variable with a given distribution. The dynamic model including the expected values is a nonlinear integro-differential equation, the asymptotical behavior of which will be examined via linearization. The Hopf bifurcation theorem will be then applied to find conditions that guarantee the existence of limit cycles. The paper develops as follows. Nonlinear oligopoly models will be introduced in Section 2, and the corresponding dynamic models with time lags will be formulated in Section 3. Bifurcation analysis will be performed in Section 4 for the general case, and the special case of identical firms will be analyzed in Section 5. Special oligopoly models such as the classical Cournot model, labor-managed oligopolies, and rent-seeking games will be considered in Section 6, where special conditions will be given for the existence of limit cycles. Section 7 concludes the paper.
251
The Birth of Limit Cycles in Nonlinear Oligopolies
2.
NONLINEAR OLIGOPOLY MODELS
In this paper an game will be examined, when the strategy set for each player is and the payoff function of player is given as
with
Here
is a given function. Notice that this
model includes the classical Cournot model, labor-managed oligopolies, as well as rent-seeking oligopolies as special cases. In the case of the Cournot model
where is the inverse demand function, and the case of labor managed oligopolies,
where is the inverse production function and In the case of rent-seeking games
is the cost function of firm
In
is the cost unrelated to labor.
where is the cost function of agent The first term represents the probability of winning the rent. If unit rent is assumed, then is the expected profit of agent Let be an equilibrium of the game, and let Assume that in a neighborhood of this equilibrium the best response is unique for each player, and the best response functions are differentiable. Let denote the best response of player for a given
3.
THE DYNAMIC MODEL WITH LAG STRUCTURE
We assume that in a disequilibrium situation each firm adjusts to the desired level of output according to the adaptive rule
where > 0 is the speed of adjustment, on the output of the rest of the players, and
is the expectation by player is the expectation on his/her
252
MODELING UNCERTAINTY
own output. We consider the situation in which each firm experiences a time lag in obtaining information about the rivals’ output as well as a time lag in management receiving or implementing information about its own output. Russel et al. (1986) used the adjustment process
which is a differential-difference equation with an infinite eigenvalue spectrum. In economic situations the lags and are usually uncertain, therefore we will model time lags in a continuously distributed manner that allows us to use finite dimensional bifurcation techniques. Therefore expectations and are modeled by using the same lag structure as given earlier in Invernizzi and Medio (1991) and in Chiarella and Khomin (1996). Particularly we assume that
and
where the weighting function
is assumed to have the form
Here we assume that T > 0, and is a nonnegative integer. Notice that this weighting function has the following properties: (a) for weights are exponentially declining with the most weights given to the most current output; (b) for zero weight is assigned to the most recent output, rising to maximum at and declining exponentially thereafter; (c) the area under the weighting function is unity for all T and (d) as increases, the function becomes more peaked around For sufficiently large values of the function may for all practical purposes be regarded as very close to the Dirac delta function centered at (e) as the function tends to the Dirac delta function.
253
The Birth of Limit Cycles in Nonlinear Oligopolies
Property (d) implies that for large values of
so that the model with discrete time lags is approached when sufficiently large. Property (e) implies that for small values of T,
is chosen
in which case we would recover the case of no information lags usually considered in the literature. Substituting equation (3.4) into (3.2) and (3.3), and the resulting expressions for and into equation (3.1), a system of nonlinear integro-differential equations is obtained around the equilibrium. In order to analyze the local dynamic behavior of the system we consider the linearized system. Letting and denote the deviation of and from their equilibrium levels, the linearized system can then be formulated as follows:
where
4.
BIFURCATION ANALYSIS IN THE GENERAL CASE
The characteristic equation of the Volterra integro-differential equations (3.7) can be obtained by a technique expounded by Miller (1972) but originally due to Volterra (1931). We seek the solution in the form
Substituting this form into equation (3.7) and allowing
we have
254
MODELING UNCERTAINTY
By introducing the notation
with
and
with
equation (4.2) can be simplified as
Therefore the characteristic equation has the form
Notice that in the most general case the left hand side is a polynomial in of degree The determinant (4.6) can be expanded by using a special idea used earlier by Okuguchi and Szidarovszky (1999). Introduce the notation
255
The Birth of Limit Cycles in Nonlinear Oligopolies
to see that equation (4.6) can be rewritten as
where we have used the simple fact that for any
of
vectors
and
This relation can be easily proved by using finite induction on the dimension and A value is an eigenvalue if either
for at least two players, or
We recall that in order to apply the Hopf bifurcation theorem, we need to study the behaviour of the eigenvalues as they cross the pure imaginary axis. As a first step we need to obtain the conditions under which the eigenvalue equations (4.8) and (4.9) yield pure complex roots. We consider each of these equations in turn. Consider first equation (4.8). A value solves this equation if
Assume now that is a pure complex number: In order to separate the real and imaginary parts of the left hand side, introduce the following operators. Let
256
be a polynomial with real coefficients
and
Then it is easy to see that with any real
By introducing the polynomials
and
we have
MODELING UNCERTAINTY
Define polynomials
The Birth of Limit Cycles in Nonlinear Oligopolies
257
and
Then equation (4.10) can be rewritten as
Equating the real and imaginary parts to zero leads to the following:
and
If is considered as the bifurcation parameter, then any real solution must satisfy equation
Notice that the left hand side has an odd degree, and no terms of even degree are present in its polynomial form. Therefore is always a root. However, it is difficult to find simple conditions that guarantee the existence of nonzero real roots. Differentiating equation (4.10) with respect to we obtain
258
If P
MODELING UNCERTAINTY
denotes the multiplier of
then equation (4.15) implies that
Therefore
and so
Assuming that the numerator is nonzero, the Hopf bifurcation theorem (see for example, Guckenheimer and Holmes (1983)) implies that there is a limit cycle for in the neighborhood of Consider next equation (4.9). It can be rewritten as
with
Assume now that
then equation (4.18) simplifies as
Separating the real and imaginary parts yields the following two equations:
The Birth of Limit Cycles in Nonlinear Oligopolies
259
and
In this case we have more freedom than in the previous case, since now we have bifurcation parameters. Assume that with some values of these equations have a common nonzero real solution Differentiating equation (4.18) we have
If is nonzero, then the Hopf bifurcation theorem implies the existence of a limit cycle around We may summarise the above analysis by stating that the existence of a limit cycle is guaranteed by two conditions: (i) Equations (4.14) and/or (4.19) have nonzero real solutions; (ii) The real part of the derivative and is nonzero. Condition (i) is difficult to check in general. In the next section we will show cases when nonzero real roots exist and also cases when there is no nonzero real root. In the general case computer methods are available for solving these polynomial equations. A comprehensive summary of the methods for solving nonlinear equations can be found for example, in Szidarovszky and Yakowitz (1978). After a real root is found, can be obtained by solving equation (4.12) (or (4.13)) or (4.19) (or (4.20)), and the value of has to be substituted into the derivative expressions (4.16) or (4.21). And finally we have to check if the real part of the derivative is nonzero.
5.
THE SYMMETRIC CASE
In this section identical firms are assumed, that is, it is assumed that Therefore as well. Assume in addition that the initial values are identical. Then equations (4.3), (4.4), and (4.5) imply that the characteristic equation has the form
260
MODELING UNCERTAINTY
Notice that this equation is very similar to (4.10), therefore the same idea will be used to examine pure complex roots. Introduce the polynomials
and
A complex number
is a solution of equation (5.1) if and only if
Equating again the real and imaginary parts to zero we have the following two equations:
and
Consider again as the bifurcation parameter. Then there is a real solution if and only if satisfies equation
Notice that this is a polynomial equation. As in the general case, it is difficult to find a simple condition that guarantees the existence of a nonzero real root. As in the general case, the degree of this polynomial equation is odd, and no even degree terms are present in its polynomial form.
261
The Birth of Limit Cycles in Nonlinear Oligopolies
Differentiating equation (5.1) with respect to
Let now
denote the multiplier of
we have
then (5.6) can be rewritten as
Therefore
Assuming that the numerator is nonzero, the Hopf bifurcation theorem implies the existence of a limit cycle for in the neighborhood of As a special case assume first that the firm’s own information lag S is much smaller than T the information lag about rival firms. Thus we select S = 0, and equation (5.1) reduces to the following:
If
then this equation simplifies as
The real parts of the roots are negative for If 1 – then there are two nonzero real roots. If = 0, then and is negative. Therefore there is no nonzero pure complex root. Consider next the case of Then equation (5.9) has the special form
262
MODELING UNCERTAINTY
Let be a pure complex root. Substituting it into equation (5.11) and equating the real and imaginary parts we see that
Thus from the second equality in equation (5.12) we see that the combination of parameter values for which pure complex roots are possible is given by the relation
Simple calculation and relations (5.12) imply that at
Thus the Hopf bifurcation theorem implies the existence of a limit cycle for in the neighborhood of
The Birth of Limit Cycles in Nonlinear Oligopolies
263
In order to illustrate a case when there are lags in obtaining information on both the rivals’ and own outputs and no pure complex root exists, select Then equations (5.3) and (5.4) are simultaneously satisfied if
and
By substituting
into equation (5.17) we see that
With fixed values of if or even if S – T is a small positive number, then becomes negative. In such cases no pure complex root exists. If S–T> then there is nonzero real root. However this case in which the own information lag is longer than the lag in obtaining information about rival firms seems economically unrealistic.
6.
SPECIAL OLIGOPOLY MODELS Consider first the classical Cournot model (2.2). Assuming that
a simple calculation shows that
and in the case of identical firms,
which is negative for The sign coincides with that given in (5.13). Considering equation (6.3) together with the relation (5.13), we see that the parameters and T can be selected so that there is a limit cycle for all
264
MODELING UNCERTAINTY
Consider next the classical Cournot model with the same price function as before but assume that the cost function of firm is a nonlinear function Assuming interior best response, we find that for all firms
Hence
can be obtained by solving equation
for In order to guarantee a unique solution assume that in the neighborhood of the equilibrium and that the right hand side is strictly increasing i.e.
which holds if
in the neighborhood of the equilibrium. Differentiate equation (6.4) with respect to to obtain
which implies that
By using relation (6.4) again we see that at the equilibrium
where
Notice that
becomes
negative if the output of each firm is smaller than the output of the rest of the industry. Figure 2 illustrates the value of as a function of Notice also that with the appropriate choice of (namely can take any negative value. Thus from equation (5.13) we see that this
265
The Birth of Limit Cycles in Nonlinear Oligopolies
version of the classical Cournot model can yield limit cycles with any number of firms.
Assume next that the price function is linear, and the cost functions are nonlinear. Then the profit of firm can be obtained as
The best response is the solution of the equation
for Assume that in the neighborhood of the equilibrium, and that the right hand side strictly increases in order to guarantee uniqueness i.e. the condition
holds. Differentiating equation (6.9) with respect to
Under assumption (6.10) the value of the function with independent variable
shows that
is always negative. The shape of is similar to the one shown in
266
MODELING UNCERTAINTY
Figure 2. By controlling the value of any negative value can be obtained for showing that limit cycles exist for all If is linear, then is zero, therefore = for all Comparing this value to (6.3) we see that no satisfies this value, so no limit cycle can be guaranteed in this case. Consider now the model of labor-managed oligopolies (2.3) with a linear or nonlinear and with and where all parameters are positive. The profit of firm per labor can be then given as
which is mathematically equivalent to the classical Cournot model with hyperbolic price function and cost functions
Consider again labor-managed oligopolies (2.3) and now assume that
with all parameters being positive. Assume again that the best response is interior in the neighborhood of the equilibrium, then an easy calculation shows that
Notice that the right hand side is strictly decreasing in forward calculation shows that
and straight-
Notice that and a comparison to (5.13) shows that limit cycles may be born for The cases of nonlinear and can be examined similarly to classical Cournot oligopolies therefore details are not discussed here. Consider finally the case of rent-seeking oligopolies (2.4). Notice that they are identical mathematically to classical Cournot oligopolies with the selection of Therefore our conclusions for the classical model apply here.
REFERENCES
7.
267
CONCLUSIONS
A dynamic model with continuously distributed lags in obtaining information about rivals’ output as well as in the firms obtaining or implementing information about their own outputs, was examined. Classical bifurcation theory was applied to the governing integro-differential equations. Time lags can be also modeled by using differential-difference equations, however in this case one has to deal with an infinite spectrum, which makes the use of bifurcation theory analytically intractable. In addition, fixed time lags are not realistic in real economic situations. We have derived the characteristic equation in the general case, and therefore the existence of pure complex roots can be analyzed by using standard numerical techniques. The derivatives of the best response functions were selected as the bifurcation parameters. The derivatives of the pure complex roots with respect to the bifurcation parameters were given in a closed form expression, that makes the application of the Hopf bifurcation theorem straightforward. The classical Cournot model, labor-managed oligopolies, and rent-seeking games were examined as special cases. If identical firms are present with linear cost functions and hyperbolic price function, then under a special adjustment process limit cycles are guaranteed for a sufficiently large number of firms. However with nonlinear cost functions limit cycles may be born with any arbitrary number of firms. If the price as well as the costs are linear, no limit cycle is guaranteed, however if the costs are nonlinear, then limit cycles can be born with arbitrary values of Similar conclusions have been reached for labormanaged oligopolies. Rent-seeking games are mathematically equivalent to the classical Cournot model with hyperbolic price function, therefore the same conclusions can be given as those presented earlier.
REFERENCES Arnold, V.I. (1978). Ordinary Differential Equations. MIT Press, Cambridge, MA. Carr, J. (1981). Applications of Center Manifold Theory. Springer-Verlag, New York. Chiarella, C. and A. Khomin. (1996). An Analysis of the Complex Dynamic Behavior of Nonlinear Oligopoly Models with Time Lags. Chaos, Solitons & Fractals, Vol. 7. No. 12, pp. 2049-2065. Cox, J.C. and M. Walker. (1998). Learning to Play Cournot Duopoly Strategies. J. of Economic Behavior and Organization, Vol. 36, pp. 141-161. Cushing, J.M. (1977). Integro-differential Equations and Delay Models in Population Dynamics. Springer-Verlag, Berlin/Heidelberg/New York. Guckenheimer, J. and P. Holmes. (1983). Nonlinear Oscillations, Dynamical Systems and Bifurcations of Vector Fields. Springer-Verlag, New York.
268
MODELING UNCERTAINTY
Invernizzi, S. and A. Medio. (1991). On Lags and Chaos in Economic Dynamic Models. J. Math. Econ., Vol. 20, pp. 521-550. Jackson, E.A. (1989). Perspectives of Nonlinear Dynamics. Vols. 1 & 2. Cambridge University Press. Kopel, M. (1996). Simple and Complex Adjustment Dynamics in Cournot Duopoly Models. Chaos, Solitons & Fractals, Vol. 7, No. 12, pp. 2031 -2048. Kubicek, M. and M. Marek. (1986). Computational Methods in Bifurcation Theory and Dissipative Structures. Springer-Verlag, Berlin/Heidelberg/New York. Miller, R.K. (1972). Asymptotic Stability and Peturbations for Linear Volterra Integrodifferential Systems. In Delay and Functional Differential Equations and Their Applications, edited by K. Schmitt. Academic Press, New York. Okuguchi, K. (1976). Expectations and Stability in Oligopoly Models. SpringerVerlag, Berlin/Heidelberg/New York. Okuguchi, K. and F. Szidarovszky. (1999). The Theory of Oligopoly with Multiproduct Firms. Edition) Springer-Verlag, Berlin/Heidelberg/New York. Russel, A.M., J. Rickard and T.D. Howroyd. (1986). The Effects of Delays on the Stability and Rate of Convergence to Equilibrium of Oligopolies. Econ. Record, Vol. 62, pp. 194-198. Szidarovszky, F. and S. Yakowitz. (1978). Principles and Procedures of Numerical Analysis. Plenum Press, New York/London. Volterra, V. (1931). Lecons sûr la Théorie Mathématique de la Lutte pour la Vie. Gauthiers-Villars, Paris.
Chapter 13 A DIFFERENTIAL GAME OF DEBT CONTRACT VALUATION A. Haurie University of Geneva Geneva Switzerland
F. Moresino Cambridge University United Kingdom
Abstract
1.
This paper deals with a problem of uncertainty management in corporate finance. It represents, in a continuous time setting, the strategic interaction between a firm owner and a lender when a debt contract has been negotiated to finance a risky project. The paper takes its inspiration from a model by Anderson and Sundaresan (1996) where a simplifying assumption on the information structure was used. This model is a good example of the possible contribution of stochastic games to modern finance theory. In our development we consider the two possible approaches for the valuation of risky projects: (i) the discounted expected net present value when the firm and the debt are not traded on a financial market, (ii) the equivalent risk neutral valuation when the equity and the debt are considered as derivatives traded on a spanning market. The Nash equilibrium solution is characterized qualitatively.
INTRODUCTION
In Anderson and Sundaresan (1996) an interesting dynamic game model of debt contract has been proposed and used to explain some observed discrepancies on the yield spread of risky debts. The model is cast in a discrete time setting, with a simplifying assumption on the information structure allowing for a relatively easy sequential formulation of the equilibrium conditions as a sequence of Stackelberg solutions where the firm owner is the leader and the lender is the follower.
270
MODELING UNCERTAINTY
In the present paper we revisit the Anderson-Sundaresan model but we formulate it in continuous time and we characterize a fully dynamic Nash equilibrium solution. The purpose of the exercise is not to complement the results obtained in Anderson and Sundaresan (1996) (this will be done in another paper using again the discrete time formulation), but rather to explore the possible mixing of Black and Scholes valuation principles (Black and Scholes, 1973) with stochastic differential game concepts. It happens that the debt contract valuation problem provides a very natural framework where antagonistic parties act strategically to manage risk in an uncertain environment. The paper is organized as follows: in section 2 we define the firm that envisions to develop a risky project, needs some amount C to launch it and negotiates a debt contract to raise this money. In section 3 we formulate a stochastic differential game between a firm owner and a lender in a situation where there is no market where to trade debt or equity. In section 4 we explore the equivalent risk neutral valuation principles applied on equity and debt value when there is a market where these derivatives can be traded. In section 5 we analyze qualitatively the Nash-solution to the stochastic differential game characterized by the Hamilton-Jacobi-Bellman equations satisfied by the equilibrium values of equity and debt. In section 6 we discuss a reformulation of the problem where liquidation can take place only at discrete time periods.
2.
THE FIRM AND THE DEBT CONTRACT
A firm has a project which is characterized by a stochastic state equation in the form of a geometric Brownian process with drift:
where W : is a standard Wiener process, is the instantaneous growth rate is the instantaneous variance. This state could represent, for example, the price of the output from the project. The firm expects a stream of cash flows defined as a function of the state of the project. Therefore, if the firm has a discount rate the equity of the unlevered firm, that is the debt free firm, when evaluated as the net present value of expected cash flows, is given by
A Differential Game of Debt Contract Valuation
271
Using a standard technique of stochastic calculus one can characterize the function as the solution of the following differential equation
The boundary condition (1.4) comes from the fact that a project with zero value will remain with a zero value and thus will generate no cash flow. An interesting case is the one where since (1.4) rewrites now as
A linear function can be used as a test function. It satisfies the boundary condition (1.6) and it satisfies (1.5) if A is such that This defines and therefore for all Later on, in section 4 we will see another way of evaluating the equity, which does not require to specify the discount rate when there is a market spanning the risk of the project. Suppose that the firm needs to borrow the amount C to finance its projects. The financing is arranged through the issue of a debt contract with a lender. The contract is defined by the following parameters also called the contract terms: outstanding principal amount P maturity T grace period coupon rate The contract terms define a debt service that we represent by a function It gives the cumulative contractual payments that have to be made up to time The contract may also include an interest rate in case of delayed payment and an interest rate in case of advanced payments. If one assumes a linear amortization the function is solution to
This function is illustrated in Figure 1.1. The strategic structure of the problem
272
MODELING UNCERTAINTY
is due to the fact that the firm controls its payments and may decide not to abide to the contract at some time. At time let the variable give the state of the debt service, which is the cumulated payments made by the firm up to time This state variable evolves according to the following differential equation
where is the payment at time and We assume that the payment rate is upper bounded by the cash flows generated by the project
The strategic structure of the problem is also due to the fact that the lender is authorized to take control of the firm when the owner is late in his payments, i.e. at any time where is the set
The lender action is an impulse control. If at time the lender takes control and liquidates the firm he receives the minimum between the debt balance and the liquidation value of the firm, i.e.
A Differential Game of Debt Contract Valuation
273
A possible form for the liquidation value is
This assumes that the lender will find another firm ready to buy the project at equity value and K is a constant liquidation cost. For the example of debt contract given above, the debt balance is
The argument of Anderson and Sundaresan is that, if the firm is in financial distress and its liquidation value is too low, the lender may have advantage not to liquidate, even if the condition is satisfied. This strategic interaction between the firm owner and the lender will be modelled as a noncooperative differential game where the payoffs are linked to the equity and debt value respectively. These values will depend on the time the state of the project the state of the debt service and will be determined either by optimizing the expected utility of the agents or, if equity and debt can be considered as derivatives traded on a market, through the so-called equivalent risk neutral valuation.
3.
A STOCHASTIC GAME
Let us assume that the risk involved in developing the project, or in financing it through the agreed debt contract cannot be spanned by assets traded on a market. So we represent the strategic interaction between two individuals, the firm owner, who strives to maximize the expected present value of the net cash flow, using a discount rate and a lender who maximizes the present value of the debt, using a discount rate The state of the system is the pair An admissible strategy for the firm owner is given by a measurable function1 A strategy for the lender is a stopping time defined by where B is a Borel set Associated with a strategy pair we define the payoffs of the two players
274
MODELING UNCERTAINTY
For the firm’s owner
For the lender
Remark 1 It is in the treatment of the debt service that our model differs significantly from Anderson and Sundaresan (1996). The creditor does not forget the late payments. The creditor can also take control at maturity, if the condition holds. The firm is allowed to “overpay” and it is thus possible that be positive at maturity2 . It is then normal that the firms gets back this amount at time T as indicated in (1.13). Definition 1 A strategy pair is a subgame perfect Nash equilibrium for this stochastic game, if for any the following holds
We shall not pursue here the characterization of the equilibrium strategies. Indeed we recognize that it will be affected by several assumptions regarding the attitude of the agents toward risk3 (i.e. their utility functions) and their relative discount rates. What is problematic, in practice, is that these data are not readily observable and it is difficult to assume that it is common knowledge
275
A Differential Game of Debt Contract Valuation
for the agents. In the next section we shall use another approach which is valid when the equity and debt can be considered as derivatives obtained from assets that are traded or spanned by an efficient market. We will see that the real option or equivalent risk neutral valuation method eliminates the need to identify these parameters when defining an equilibrium solution.
4.
EQUIVALENT RISK NEUTRAL VALUATION
Throughout the rest of the paper we make the following assumptions characterizing the financial market where the debt contract takes place. Assumption 1 We assume that the following conditions hold A1: no transaction costs
A2: trading takes place continuously A3: assets are assumed to be perfectly divisible A4: unlimited short selling allowed A5: borrowing and lending at the risk-free rate, i.e. the risk-free asset can be sold short A6: the firm has no impact on the risk structure of the whole market
A7: no arbitrage A8: there exists a risk-free security B, for example a zero-coupon bond, paying the risk-free interest rate The dynamics of B is given by
To simplify assume that the firm’s project has a value V that is perfectly correlated with the asset which is traded. We assume that this asset pays no dividend , so its entire return is from capital gains. Then evolves according to
Here the drift rate should be equal to the expected rate of return from holding an asset with this risk characteristics, according to the CAPM theory (Duffie, 1992)
4
where is the risk free rate, is the market price of risk and is the correlation of with the market portfolio. One also calls the risk adjusted expected rate of return that investors would require to own the project.
276
MODELING UNCERTAINTY
The project is assumed to pay dividends where is the payout ratio. Then the value V evolves according to the following geometric brownian motion5
where If we repeat the developments of (1.3)-(1.6), but with and we will check that V is indeed the equity value for the unlevered firm when one uses the risk-adjusted rate of return as the discount rate. Again we assume that the firm needs an amount C to launch the project. The firm, which is not default free, cannot sale the risk-free security short, since this would be equivalent to borrowing at the risk-free rate. Therefore the assumption A5 applies only to investors who may buy the firm’s equity or debt. The firm’s owner and the lender are now interested in maximizing the equity and the debt value for the levered firm respectively. For that they act strategically, playing a dynamic Nash-equilibrium based on the evolutions of equity and debt.
4.1.
DEBT AND EQUITY VALUATIONS WHEN BANKRUPCY IS NOT CONSIDERED
At time the firm pays to contribute to the debt service and the remaining cash, given by is a dividend paid to the equity holder. Assume that the lender strategy is never to ask for bankruptcy while the borrower’s strategy is given by a feedback law We are interested in defining the value of Equity under these strategies. The equity is a function of the time the value of the hypothetical unlevered firm V and the state of debt service and is denoted by To simplify notation we omit the arguments and simply refer to E. One constructs a portfolio composed of risk-free bonds B and derivative E replicating the unlevered firm. To avoid arbitrage opportunities this portfolio must provide the same return as the unlevered firm, since it has the same risk6. Applying Ito’s lemma we obtain the following stochastic equation for the equity value dynamics:
The dynamics of E and V are perfectly correlated. Therefore it is possible to construct a self-financing portfolio consisting of shares of the risk-free asset B and share of the equity E (all dividend paid by the equity are immediately reinvested in the portfolio), which replicates the risky term of V
277
A Differential Game of Debt Contract Valuation
and pays the same dividend as V. The portfolio value at time is
Keeping in mind that the equity pays a dividend a dividend the strategy is self-financing if7
and the portfolio pays
which leads to
In order to get a risk replication, the weight
Then the weight
must be given by
must verify
We can now write the stochastic equation satisfied by the portfolio value
One observes that, as intended, the portfolio replicates the risk of V. Due to our sixth assumption, must have the same return as V, otherwise there would be arbitrage opportunities. Moreover since at time zero, the value of
278
MODELING UNCERTAINTY
the portfolio is equal to the value of V, we must have for all Matching the drift terms and multiplying by one obtains the following partial differential equation that has to be satisfied by the equity value on the domain
At maturity, one should have
A similar equation could be derived for the other derivative (debt) value D
At maturity one should have
Remark 2 As we assume that no bankrupcy occurs, we have to suppose that, at maturity the balance of the debt service will be cleared, if the value of the project permits it. This is represented in the terminal conditions (1.28) and (1.30). This formulation is interesting, since the model reduces then to a single controller framework. The firm’s owner controls the trajectory but runs the risk of having a terminal value drawn to 0. Since the payments are upper bounded by the cash flow value it might be optimal for the borrower to anticipate the payments, for some configurations of We shall not pursue that development, but rather consider the more realistic case where the lender can liquidate the firm when it is late in its payments.
4.2.
DEBT AND EQUITY VALUATIONS WHEN LIQUIDATION MAY OCCUR
Suppose now that both actors have chosen a strategy, the lender having a stopping-time strategy which defines the conditions when liquidation occurs. After liquidation the equity E is not available to construct the portfolio Therefore, such a portfolio has a return which is smaller than or equal to the
A Differential Game of Debt Contract Valuation
279
return of V. One obtains the following relations for E and D respectively
Equality holds at points where there is no liquidation. At a liquidation time the following boundary conditions8 hold
where
and
At maturity T when
Consider a strategy pair
and the equity value as
the boundary conditions are
Then we can rewrite the debt value as
280
where
MODELING UNCERTAINTY
is the auxiliary process defined by
and denotes the expected value w.r.t. the probability measure induced by the strategies (Here is the stochastic process induced by the feedback law This is the usual change of measure occuring in Black-Scholes evaluations.
5.
DEBT AND EQUITY VALUATIONS FOR NASH EQUILIBRIUM STRATEGIES
Now let us reformulate the noncooperative dynamic game sketched out in section 3 but now using the real option approach to value debt and equity. The players strategies are now chosen in such a way that a Nash equilibrium is reached at all points Under the appropriate regularity conditions a feedback Nash equilibrium will be characterized by the HJB equations satisfied by equity and debt values
The boundary conditions at maturity T when
The boundary conditions in
are
are9
The optimal strategy for the firm is “bang-bang" and defined by the switching manifold
281
A Differential Game of Debt Contract Valuation
The equilibrium strategy is given by
The manifold with equation firm. If the equation can be solved as
determines the behavior of the then
The lender takes control as soon as the following conditions are satisfied
6.
LIQUIDATION AT FIXED TIME PERIODS
Often, in practice, the debt service is paid at fixed periods The lender can only take control at time period, if the condition holds. This is the framework used in Anderson and Sundaresan (1996). If one assumes, as in Anderson and Sundaresan (1996), that, at each period the unpaid debt service can trigger a liquidation but if no liquidation occurs the unpaid service is forgotten, then the state variable becomes unnecessary. In a payment is due. If the value
does not satisfy the contract terms, i.e. if
282
MODELING UNCERTAINTY
the lender can liquidate. The H J B equations write now
The game simplifies considerably. The only relevant state variable is the value of the firm V(t) at t. The firm’s strategy is now pay the debt service at anything.
7.
otherwise don’t pay
CONCLUSION
The design of the “best" debt contract, would be the combination of design parameters, that maximizes the Nash-equilibrium value of equity E(0; (0,V)) while the Nash-equilibrium debt value D(0; (0, V)) is at least equal to the needed amount C. This is a very complicated problem that can uniquely be addressed by direct search methods where the Nash-equilibria will be computed for a variety of design parameter values. In this paper we have concentrated on the evaluation of the equity and debt values, when, given some contract terms, the firm owner and the lender act strategically, and play a dynamic Nash equilibrium. The interesting aspect of this model, which was already present in Anderson and Sundaresan (1996), is the use of equivalent risk neutral or real option valuation technique in a stochastic game. The contribution of this paper has been to extend the model to a continuous time setting and to consider an hereditary effect of past deviations from the contracted debt service. We feel that this formulation is relatively general and that it contributes to the extension of game theory to Black-Scholes economics.
283
REFERENCES
NOTES 1. In all generality the control of the firm owner could be described as a process which is adapted to For our purpose the definition as a feedback law will suffice. 2. Notice that the overpayment is a form of investment by the firm. We could allow it to invest a part of the cash flows in another type of asset; this would be more realistic from a financial point of view but it would further complicate the model. 3. Note that we have assumed here that the agents are optimizing discounted cash flows, not the utility of them. If risk aversion has to be incorporated in the model then the equilibrium characterization will be harder. 4. It is defined by where and are the expected return and standard deviation of the market portfolio respectively. 5. This can be verified by constructing a portfolio where all dividends are immediately reinvested in the project. Such a portfolio is a replication of and must, therefore follow the same dynamics as 6. Notice that the usual way to obtain the Black-Scholes equation, would have been to construct a self-financing portfolio composed of the risk-free asset B and the underlying asset V in such proportions that it is a replication of the derivative (either E or D). Then according to the two last assumption, we know that this portfolio has to give the same return as the underlying asset. However in our case the unlevered firm is not traded; so we proceed in a symmetric way and construct a self-financing portfolio composed of the risk-free asset B and the derivative E that will replicate V. 7. Such a dynamic adaptation of the portfolio composition is feasible, as is measurable with respect to the filtration generated by W(t). 8. One uses the notation arbitrarily small. 9. One uses the notation
arbitrarily small.
REFERENCES Anderson R.W. and S. Sundaresan. (1996). Design and Valuation of Debt Contracts, The Review of Financial Studies, Vol. 9, pp. 37-68. Black F. and M. Scholes. (1973). The Pricing of Options and Corporate Liabilities, The Journal of Political Economy, Vol. 81, pp. 637-654. Dixit A.K and R. S. Pindyck. (1993). Investment under Uncertainty, Princeton University press. DuffieD. (1992). Dynamic Asset Pricing Theory, Princeton University Press.
Chapter 14 HUGE CAPACITY PLANNING AND RESOURCE PRICING FOR PIONEERING PROJECTS David Porter Collage of Arts and Sciences George Mason University
Abstract
1.
Pioneering projects are systems that are designed and operated with new, virtually untested, technologies. Thus, decisions concerning the capacity of the initial project, its expansion over time and its operations are made with uncertainty. This is particularly true for NASA’s earth orbiting Space Station. A model is constructed that describes the input-output structure of the Space Station in which cost and performance is uncertain. It is shown that when there is performance uncertainty the optimal pricing policy is not to price at expected marginal cost. The optimal capacity decisions require the use of contingent contracts prior to construction to determine the optimal expansion path.
INTRODUCTION
A pioneering project is defined as a system in which both the reliability and actual operations have considerable uncertainties (see Merrow et al. (1981)). Managing such projects is a daunting task. Management decision making in the face these sizable uncertainties is the focus of this paper. In particular, the decisions on the initial system capacity and design flexibility for future growth of the project will be addressed. The typical management process for pioneering projects is to design to a wish list of requirements from those who will be using the project’s resources during its operations (this is sometimes referred to by engineers as user “desirements”). The main management policy of the designers is to hold margins of resources for each subsystem to mitigate the uncertainties that may arise in development and operations. The amount of reserve to be held is not well defined, but it is sometimes referred to as an insurance pool. However, unlike an insurance pool, the historical results for pioneering projects has been universal “bad luck ” by all subsystems which deplete the reserves (see Merrow et al. (1981) and Wessen and Porter (1998)). Rarely are incentive
286
MODELING UNCERTAINTY
systems used to obtain information on system reliability and the cost of various designs.1 Instead of focusing on these types of management policies, we want to determine the optimal planning, organizational and incentive systems for managing pioneering projects. The application that will be used throughout will be NASA’s Space Station. The Space Station is an integrated earth orbiting system. The Station is to supply resources for the operation of scientific, commercial and technology based payloads. The structure of the Station design is an interrelated system of inputs and outputs. For example, the design of the power subsystem as a profound effect on the design of the propulsion subsystem since power will be provided via photovoltaics. The solar array subsystem creates drag on the Station which will require reboosts from the propulsion subsystem. Fox and Quirk (1987) developed an input-output model of the Station’s subsystems in which the coefficients of the input-output matrix are random. This model has been analyzed for the case with uniform distributions over the coefficients and cost parameters, with a safety-first constraint on the net output of the Station. Using semi-variances, they determine the distribution of costs over the Station’s net outputs. Using some test data from engineering subsystems, the model has been exercised for the uniform Leontief system (see Quirk et al. (1989)). While this attempt at modelling the interaction of the subsystems and the uncertainty in cost and performance has shown that the Station is very sensitive to performance uncertainties and that the errors propagate, there is very little in the way of policies to help guide the design process. One of the main features of the operation of the Station is that information about subsystem performance and cost accrues over time (see Furniss (2000)). Thus, design and allocation decisions need to take into account the future effects on resource availability. For the Station, a capacity for each subsystem must be selected at “time zero” along with the design parameters of each subsystem. Next, users of the Station must design their individual payloads and be scheduled within the capacity constraints of the system. After payloads are manifested, the Station operations parameters must be determined. After the Station starts operating, a decision on how to grow the system must be made. The timing of decisions can be found in Diagram 1 below. An important aspect of the analysis to follow is that new information will become available that was previously unknown during the decision timeline.
Huge Capacity Planning and Resource Pricing for Pioneering Projects
287
The decision variables include the vector of initial subsystem design capacities X, the vector of initial design parameters v, the vector of planned resources u that will be used by payloads, the realization of resources q that are available to payloads for operations and the capacity expansion, redesign and operations after there is experience with the capabilities of the Station. Not pictured in Diagram 1 is the fact that actual Station operating capabilities are realized between time and It is that time that the uncertainty is resolved and known to all parties. Within the confines of the decision structure defined above, the following questions are addressed: Given the uncertainty over cost and performance, what should be the crucial considerations in the initial capacity and design of the Station? How should outputs of the Station be priced to users; in particular, can management rely on traditional long-run marginal cost principles to allocate outputs and services efficiently?2
2.
THE MODEL
Space Station can be viewed as a multiproduct “input-output” system in which decisions are related across time. In addition, there is uncertainty as to the cost and performance of the system and this uncertainty is resolved over time. The Station is a series of subsystems (e.g. logistics, environmental control, electric power, propulsion, etc.) which require inputs from each other to operate. With a project as new and complex as the Station, the actual operations are uncertain. Specifically, the performance and cost of each subsystem is uncertain. Let denote the gross capacity of subsystem i and the design parameters of subsystem i. Let denote the vector of resources of other subsystems utilized by subsystem i. Let denote the random variable associated with the cost and performance of each subsystem. The input-output structure of the Station is given by:
Let i = 1,...,n so that there is a total of n subsystems that constitute the Station. If the Station subsystems were related by a fixed linear coefficient technology, (1) could be represented by Y=AX where Y would be a 1 matrix representing the internal use of subsystem outputs, A would be the matrix of inputoutput coefficients where the entries are random variables, and X is the 1 matrix of subsystem capacities. We consider the more general nonlinear structure for our model.
288
MODELING UNCERTAINTY
Costs are assumed to be separable across subsystems so that total cost can be represented as
Where is the cost of maintaining subsystem i with capacity and design in state At time t=0, the initial subsystem designs and capacities are selected under uncertainty. Let i=1,...,n denote subsystem net capacities so that
Thus, the amount of capacity available to service payloads will be a function of X, The model that has been described up to this point is defined in terms of stock units. Changing the model to accommodate resource flows would unduly complicate the model. Using flow resources would take into account peak loads and other temporal phenomena that are important in the scheduling of payloads but would detract from the general management decision making issues that are the focus of this paper. At time payloads are manifested to the Station. Let be the contract signed with payload j to deliver services. In general, could be a list of contingent resource levels as well as expected resource deliveries and monetary damages from the failure to deliver services. The important issue to note is that the contract must be made before is known. Let denote payload j’s consumption of the capacity of subsystem i and define Given the set of contracts signed at the vector of subsystem capacities available at time will be committed for through The mapping of contract specifications to resource commitments will be defined through If the contracts m are based on actual resource delivery so that nonperformance contracting is infeasible, it would be the case that for all At time the value of is realized and the actual level of subsystem capacities is known and contracts commitments and penalties are assessed. At time management must decide on the increments to initial subsystem capacities and redesign choices One of the most discussed items in pioneering projects (see Abel and Eberly (1998)) is the dynamic nature of costs and performance based on the installed capacity. How one designs the Station to incorporate modular additions and the level of experience gained from operating a complex system can have profound effects on the incremental costs of expansion. We capture the notions of growth flexibility through the function f(X,v). This function enters the cost and performance elements of the project at the growth stage. Specifically, at we have:
Huge Capacity Planning and Resource Pricing for Pioneering Projects
289
Where is known and resources are allocated among payloads. We assume that That is, the more flexible the project, the smaller will be the corresponding cross-utilization of subsystems and costs when the decision to expand is made. Definition 1 A project is said to exhibit learning-by-doing properties if This definition captures the ides that the larger (or more complex) the project the larger will be the benefits of learning to operate the system. If it is costly to dismantle or redesign existing capacity then the project will have what is commonly referred to as capital irreversibilities. Definition 2 A project is said to have capacity irreversibilities if for Definition 3 A project is said to have design irreversibilities if To find the system optimum, we compare benefits and costs. The benefit side of the model is one of the most difficult to estimate in practice. Clearly, however, the payload designer is the individual in the best position to determine these benefits. The difficulty in obtaining this information is that there is an incentive overstate the benefits since NASA does not base its pricing systems on this information, it uses it only to aid in its subsystem design decisions. Nonetheless, we model the benefit side of the model using a per period dollar benefit function for each payload j. Specifically, let denote payload j’s monetary payoff from consuming units of subsystem capacities, where The present value of benefits at time t=0, starting at is given by:
290
MODELING UNCERTAINTY
Where r is “the” discount rate. The present value of benefits between time t=0 and is given by:
Turning to the cost side of the objective function, the present value of costs at time t=0 if the state is is:
The present value of system costs at time
for state
is given by:
In order to maximize net benefits, we must choose initial capacities and designs a growth path and an allocation of subsystem capacities to maximize:
Subject to the equations (4), (6), and (7) and where is the expectation operator over This is a dynamic programming problem so we must solve the problem at first taking the t=0 decisions as given, and then proceed to solve the and t=0 problems. For the most part we are interested in the comparative static properties of the model and the pricing solutions. For the solution to the problem and its associated comparative statics will use a form of envelope theorem for dynamic programming. Recall that the envelope theorem has:
Where to maximize
is the solution to the problem of choosing x for fixed subject to and (the Lagrangian), so that
so as 4
Huge Capacity Planning and Resource Pricing for Pioneering Projects
291
This theorem allows an easy way to find the marginal effects of the parameters of the model. Extending this model to the dynamic programming case is easy. We define the maximization problem at as:
Where X is the vector of choice variables and is a vector of fixed parameters. Let be a solution to (14) and define Suppose we can partition where are are the choice variables from the t=0 problem and are fixed parameters for all periods. At t=0 the optimization problem is:
Since
is a parameter in
Differentiating
The first order conditions for
we can find:
we obtain:
require that:
Adding (16) to (17) and using (18) we find:
Thus, at t=0 we can safely ignore the “indirect” effects of the parameters on the solution at We now use this fact to obtain some results from this model.
3.
RESULTS
We will present a series of results starting with the easiest case and steadily add complexity to the decision making model.
292
3.1.
MODELING UNCERTAINTY
COST AND PERFORMANCE UNCERTAINTY
We assume that f(X,v)= 0. At the value of is realized and expansion and redesign decisions are made. Thus the maximization problem becomes:
Subject to
where The necessary conditions
for a maximum are:
Where is the Lagrangian multiplier for the ith constraint. Rewriting (22) and (23) in matrix notation we find:
Where is the vector of multipliers, is the vector of marginal capacity costs, is the vector of marginal redesign costs, is the diagonal matrix of marginal capacity cross-utilizations and is the diagonal matrix of marginal redesign cross-utilizations. Result 1 (standard marginal cost pricing): Optimal decisions are made when That is, capacity adjustments and redesigns should equate marginal costs with demands at those prices for resources.
Huge Capacity Planning and Resource Pricing for Pioneering Projects
293
Moving to the problem, is known and X,v have been selected; the only issue remaining is allocating the available resources among payloads. Since future benefits and costs are not affected by the allocation we need only maximize benefits given revealed supply subject to any commitments made at through the contracts Formally,
Subject to for positive levels of output. Let be the maximum value of (27) subject to the net output constraint for different values of X,v, and For fixed values of X and v, is the optimal contingent plans of the payloads. The solution to (27) is the maximum benefits that can be derived subject to the supply constraint. This is nothing more than the use of a spot market to clear supply and demand. Result 2 (spot market): At time when is known, if there are no contracts restricting supply, prices should be set by spot contracting to solve (27). If contracts are signed prior to then for the contracts to create incentives for the optimal allocation of resources, the contracts must be dependant on In particular, for fixed values of X and v, prices p must be a function of and solve the following problem:
The prices for net outputs q at defined in the equations above, will be functions of i.e. contingent contracts. If we let denote the prices that solve the equations above, this means that contracts signed at t=0 must be contingent on For practical purposes, such a complete set of contingent contracts is infeasible. However, a less restrictive set of contracts can be designed that can increase efficiency. Let and where denotes the sample space for Definition 4 A priority contract of class for resource where there are classes, is a contract that specifies delivery of output if and only if for for
294
MODELING UNCERTAINTY
Thus the dispatch of resources, once the uncertainty is resolved, is governed by which priority class the resource resides.5 We will examine this contract type when we examine the problem. The problem is essentially the planning of resource use by payloads given the installed capacity X and the design v. If no contracts have been signed at t=0, then the problem is either to sign contingent contracts by solving (28) and (29) above or just assign payloads to the station so that the resource demands by all the manifested payloads is less than and use spot markets to clear the demand once is revealed. Suppose that for payload j, the net benefits derived from the payload are a function of the payload design.6 In particular, let us represent the design variable for payload j with so that net benefits can be modelled as where denotes gross benefits from the payload and denotes the costs of designing and operating the payload. We assume that benefit and costs are an increasing function of net output use and payload design, i.e., If priority contracts are utilized and if denotes the price for priority class k for resource i then each payload j would maximize:
Where E is the expectation operator over the joint distribution of priority classes and net outputs. The maximization equations and market clearing constraints become
If priority classes are not used to create contracts, but instead when is realized a Vickrey auction (see Vickrey (1961)) is conducted wherein each payload j submits a two-tuple and the auctioneer computes winners as follows:
Huge Capacity Planning and Resource Pricing for Pioneering Projects
295
The Vickrey auction has a dominant strategy that has each bidder revealing their true value. However, since each bidder does not know the value of or the values of other participants prior to developing their payloads they must calculate the probability that resources will be available and that they have one of the winning bids. In particular, payload j must select based on the probability that their value will be greater than other bidders and that they fit within the resource constraints (35). Since the only way to increase one’s probability of being selected is to fit within the capacity constraints, enters into the decision calculus through the functions B, C and the probability of fitting. That is, if j selects a larger then for a smaller choice of we can obtain the same net benefits, but increase the probability of fitting (see Figure 1.1 above). Thus, there is an incentive to design less resource intensive payloads which are inefficient payload designs relative to the use of priority contracts.7 Result 3 (priority contracting): The use of priority contracts results in a more efficient assignment of resources than the use of spot contracts. Result 4 (inefficient payload design): If payloads have design choices, then reliance on a spot market when there are performance uncertainties will result in inefficient payload designs.
296
MODELING UNCERTAINTY
We are now in a position to examine the t=0 problem. The maximization problem becomes:
The first order conditions for installed capacities are:
Using our envelope theorem from Section 2, (37) can be written as:
Where the second term is evaluated at and the last terms are evaluated at Using equation (22) we find that
The above equation along with the fact that
yields:
Thus, capacity should be selected so that:
If there is no performance uncertainty then it is a simple rule to price where marginal benefits equal expected marginal cost at time t=0, i.e However, when there is performance uncertainty, we again need priority contracting to get information on the marginal benefits of reliability to be equated with incremental costs. Result 5: If there are no performance uncertainties, then prices should be set to expected marginal cost and capacity should be built to equate demand and supplies at these prices. If there are performance uncertainties priority contracting should be used to obtain demand information on reliability.8 We now turn to an environment in which there is capital irreversibilities or learning by doing
297
Huge Capacity Planning and Resource Pricing for Pioneering Projects
3.2.
COST UNCERTAINTY AND FLEXIBILITY
Suppose that there is no performance uncertainty, but that flexibility is an issue. Thus, we need to introduce the constraints (4) - (7) into the model. At the first order conditions will implicitly define the expansion path redesign and demands as functions of X, v, and f. Since there is no performance uncertainty, we do not need to consider the contingent allocation problems at Thus examining the decisions that are affected at time t=0 by using the implicit supply and demand function at time and applying our envelope theorem we find the following first order conditions:
Equation (42) implies that payloads should be charged use. Using (30) we can derive the pricing equation:
for resource
Result 6 (capacity commitment pricing): If there are capital irreversibilities, i.e. then the optimal pricing policy is to charge an amount greater than long-run expected marginal cost. Specifically a capacity commitment charge of needs to be 9 charged in order to insure optimal capacity is installed. Definition 5 Project A is said to have greater capital irreversibilities or lower learning by doing effects than Project B if and only if
From (45) we can obtain the following result: Result 7 (installed capacity): The larger (smaller) are the capital irreversibility (learning by doing) effects, the smaller should be the installed capacity.
298
3.3.
MODELING UNCERTAINTY
PERFORMANCE UNCERTAINTY AND FLEXIBILITY
Returning to our model described in the beginning of Section 3 and applying the envelope theorem when flexibility is present, we will obtain the following first order conditions:
Compared to equation (40), we have an additional term associated with flexibility. Thus, the solution entails both the use of contingent or priority contracting to obtain contingent demands to optimally ration supply while incorporating the effects on performance and costs of capital irreversibilities. With capital irreversibilities, solving the reliability problem by installing additional capacity or more sophisticated designs is costly if you are wrong and learn that changing the expansion path is expensive. Thus the additional terms in the equations above tell us: Result 8 (demand reduction): When there are capital irreversibilities (learning by doing) the priority contract prices will be higher (lower) to that resource demands are suppressed (enhanced) reflecting the cost (benefit) of committing to capacity at time t=0.
4.
CONCLUSION
We can now address the questions listed at the beginning of this paper. Concerning the initial size of the project, if technology is such that additions to capacity or design changes are costly, then a smaller more flexible system should be built and operated until more information concerning performance is obtained. If the technology is such that irreversibilities are small and that marginal productivity increases “dramatically” as operators learn how the technology works, then a larger more capable system should be built initially. These flexibility issues also need to be part of the pricing regime to suppress the appetite of payloads to demand more resource. For the second question that was posed there is a better alternative than what NASA currently uses to price resources. In particular, it is clear that the use of priority contracts is essential and prices should not be solely based on incremental costs. Project management should devote more effort in obtaining accurate information concerning cost and performance distributions and should provide more flexible contract procedures to assist users in contingent planning. In addition, contract regimes should be put in place to aide operators in the
REFERENCES
299
rationing of resources when there are performance short-falls. The market for manifesting payloads and the scheduling of resources should be interactive and should be done years before actual operations. In this way price information can guide payload design and Station growth paths.
NOTES 1. One notable exception is the management policy used on the Jet Propulsion Laboratory’s Cassini Mission to Saturn (see Wessen and Porter (1997)) 2. The Space Shuttle went through this process in determining the price to charge private companies for the use of the Shuttle bay to launch satellites and middeck lockers for R&D payloads. The initial pricing policy was to charge for the maximum percentage of Shuttle mass or volume capacity used by a payload times the expected long-run marginal cost of a Shuttle launch, or short-run marginal cost if demand faltered, see(Waldrop(1982)). 3. In general, so that benefits are in terms of contract specifications based on the realization of We investigate the structure of such contracts in Section 3.1, but for now we only consider use of subsystem capacities. 4. It is assumed that and are continuously differentiable. 5. This type of contract has been investigated by Harris and Raviv (1981) as a way for a monopolist to segment the market. Chao and Wilson (1987) examine this form of contracting to price reliability in an electric network. 6. Resources expended to design payloads is one of the most expensive elements in payload development. Payloads can be designed to be more autonomous so as to use less astronaut time but this may increase the use of on-board power and data storage. These trade-offs are extremely important but are usually done in a haphazard manner (see Polk (1998)). 7. This phenomena has been found on the Space Shuttle due to the reduced number of flight opportunities for science payloads. The Shuttle program has created a Get-Away-Special (GAS) container for payloads that does not require man-tending or special environmental controls. The number of payloads that are of the GAS variety have increased 10-fold from the beginning of the Shuttle program. 8. Noussair and Porter (1992) have developed an aution process to allocate a specific form of priority contract. The experiments they conduct show that their auction results in highly efficient allocations and out performs proportional rationing schemes. 9. A somewhat similar result can be found in Yildizoglu (1994).
REFERENCES Abel A. and J. Eberly. (June 1998). “The Mix and Scale of Factors with Irreversibility and Fixed Costs of Investment, ” Carnegie-Rochester Conference Series on Public Policy 48:101-135. Chao, H. and R. Wilson. (December 1987). “Priority Service: Pricing, Investment and Market Organization,” American Economic Review 77:899-916 Fox G. and J. Quirk. (October 1985). “Uncertainty and Input-Output Analysis,” JPL Economics Research Series 23. Furniss, T. (July 2000). “International Space Station,” Spaceflight 42: 267-289. Harris M. and A. Raviv. (June 1981). “A Theory of Monopoly Pricing Schemes with Demand Uncertainty, ”American Economic Review 71:347-365. Merrow, E., K. Philips and C. Myers. (September 1981). “Understanding Cost Growth and Performance Shortfalls in Pioneering Process Plants,” Rand Corporation Report R-2569-DOE.
300
MODELING UNCERTAINTY
Noussair C. and D. Porter. (1992). “Allocating Priority with Auctions: An Experimental Analysis,” Journal of Economic Behavior and Organization 19:169-195. Polk C. (1993).The Organization of Production: Moral Hazard and R&D. Ph.D Dissertation, California Institute of Technology, Pasadena California. Quirk, J., M. Olson, H. Habib-agahi and G. Fox. (May 1989). “Uncertainty and Leontief Systems: An Application to the Selection of Space Station Designs,” Management Science 35:585-596. Waldrop, M. (1982). “NASA Struggles with Space Shuttle Pricing,” Science 216:278-279. Wessen R. and D. Porter. (August 1997).“A Management Approach for Allocating Instrument Development Resources,” Space Policy 13:191-201 Wessen R. and D. Porter. (March 1998). “Market-Based Approaches for Controlling Space Mission Costs: The Cassini Resource Exchange,” Journal of Reduced Mission Operations Costs 2:119-132. Vickrey, W. (1961). “Counterspeculation, Auctions and Competitive SealedBid Tenders,” Journal of Finance, 16:8-37 Yildizoglu, M. (July 1994). “Strategic Investment, Uncertainty and Irreversibility Effect,” Annales d’Economie et de Statistique 35:87-106
Chapter 15 AFFORDABLE UPGRADES OF COMPLEX SYSTEMS: A MULTILEVEL, PERFORMANCE-BASED APPROACH James A. Reneke
Matthew J. Saltzman Margaret M. Wiecek Dept. of Mathematical Sciences Clemson University Clemson SC 29634-0975
Abstract
1.
A modeling and methodological approach to complex system decision making is proposed. A system is modeled as a multilevel network whose components interact and decisions on affordable upgrades of the components are to be made under uncertainty. The system is studied within a framework of overall performance analysis in a range of exogenous environments and in the presence of random inputs. The methodology makes use of stochastic analysis and multiple-criteria decision analysis. An illustrative example of upgrading an idealized industrial production system with complete computations is included.
INTRODUCTION
In this paper we address the problem of upgrading a complex system where there are multiple levels of decision making, the effects of the choices interact, and the choices are accompanied by uncertainty. We propose a framework for choosing, from among a number of alternatives, a set of affordable upgrades to a system that exhibits these characteristics. We present the methodology in terms of an illustrative example which highlights its application to each of these areas of difficulty. The system of interest is decomposed into multiple levels, each consisting of a network describing the interactions of the components at that level. The process
302
MODELING UNCERTAINTY
constructs conceptual models working from top to bottom, and evaluates the impact of proposed upgrades using computational models from bottom to top. In general, as one proceeds down the levels of decision making, alternatives depend on more independent variables representing exogenous influences and hence uncertainty. The upgraded enterprise operates within a range of environments represented by the exogenous variables and so the best set of choices might not be optimal for any particular environment. Understanding the tradeoffs leads to better choices. Because of the interaction of choices for component upgrades the focus for the decision maker must be on overall enterprise performance. Major upgrades of some components may have little impact on overall performance while minor upgrades on other components could have a large impact. Finally, upgrade options must be evaluated in terms of their performance in noisy environments and the contribution of component uncertainty to enterprise uncertainty. Relation of this work to the literature. The problem of upgrading complex systems has a wide range of applications in business and engineering. Korman et al. (1996) review the process of making upgrade or replacement decisions for a chemical distribution system. Hewlett-Packard Corporation used operations research based methods to develop design changes in order to improve the performance of a printer production line, as reported by Burman et al. (1998). Samuelson (1999) describes the effort of modular software design that allowed the “plug-in” replacement of the control software and improved the performance of a telephone system of International Telesystems Corporation. Su et al. (2000) study two distribution management systems of a Taiwan power company and consider system upgrades or migration to improve performance. In engineering communities, the upgrade problem has been extensively studied. As upgrades are part of recoverable manufacturing, researchers in the area of manufacturing planning and control have been involved. Yield impact models are applied by McIntyre and Meits (1994) to evaluate cost effectiveness of upgrades in semiconductor manufacturing. Hart and Cook (1995) proposed a checklist type approach to decision making about upgrade versus replacement in the manufacturing environment. Upgrade decisions in the context of life-cycle decision making for flexible manufacturing systems are examined by Yan et al. (2000). Some authors view upgrades from the perspective of system modernization. Wallace et al. (1996) propose a system-modernization decision framework and analyzes engineering tradeoffs made in modernizing a manufacturing design system to use distributed object technology. Another perspective is used by van Voorthuysen and Platfoot (2000). They develop a system to identify relationships between critical process variables for the purpose of process improvement and upgrading, and they demonstrate the methodology on a case study of a commercial printing process.
Affordable Upgrades of Complex Systems
303
The upgrade problem for complex systems is related to the system design problem. Hazelrigg (2000) critiques classical systems engineering methodologies. He proposes properties that a design methodology should satisfy, including: independence of the engineering discipline, inclusion of uncertainty and risk, consistency and rationality in comparing alternative solutions and independence of the order in which the solutions are considered, capability of rank ordering of candidate solutions. The method should not impose preferences on the decision maker nor constraints on the decision-making process, should have positive association with information (the more the better) and be derivable from axioms. Papalambros (2000) and Papalambros and Michelena (2000) survey the application of optimization methods to general systems design problems. In addition to identifying optimal structural and control configurations of physical artifacts, optimization methods can be applied to decomposition of systems into subsystems and integration of optimal subsystem designs into the final overall design. Rogers (1999) proposes a system for decomposition of design projects based on the concept of a design structure matrix (DSM). Rogers and Salas (1999) describe a Web-based design management tool that uses the DSM. Models and methodologies for upgrade decisions have been also studied by mathematical and management scientists, and by operations researchers. Equipment upgrade and replacement along with equipment capacity expansion to meet demand growth is studied by Rajagopalan et al. (1998) and Rajagopolan (1998). The former presents a model for making acquisition and upgrade decisions to meet future demand growth and develops a stochastic dynamic programming algorithm while the latter unifies capacity expansion and equipment replacement within a deterministic integer programming model. Replacement and repair decisions are studied by Makis et al. (2000) with the objective to find the repair/replacement policy minimizing the long-run expected average cost of a system. The problem is formulated as a continuous time decision problem and the results are based on the theory of jump processes. Carrillo and Gaimon (2000) introduce an optimal control model to couple the improvement of system performance with decisions about the selection and timing of process change alternatives and related knowledge creation and accumulation. Majety et al. (1999) describe a system-as-network model for reliability allocation in system design. In this model, the methods are associated with reliability measures as well as costs. Cost is the objective to be minimized and a certain level of overall system reliability is to be achieved. Here the objective is linear and the constraints are potentially nonlinear, depending on the structure of the network. Luman (1997); (2000) describes an analysis of upgrades to a complex, multilevel weapons system (a “system of systems”). The methodology treats cost as
304
MODELING UNCERTAINTY
the independent variable, allowing the decision maker to analyze the tradeoff between cost and overall performance. Luman addresses the need to understand the relationship between overall performance and performance measures of major subsystems. The upgrade problem is modeled as a complex nonlinear optimization problem. Closed form approximations can adequately represent some systems, but for greater complexity, simulation-based optimization methods may be required. The methodology is illustrated on a mine countermeasures system involving subsystems for reconnaissance and neutralization. Nature of the results. This paper presents a comprehensive modeling and decision methodology in which the modeling techniques are designed for decision making and decisions are based on system performance over a range of operating conditions constrained by affordability considerations. The decision methodology makes use of multicriteria decision analysis within a framework that allows for uncertainty. A thorough exposition of ideas in the paper is given in terms of an illustrative example of an industrial production system. The problem has two decision levels and illustrates how one might model interactions among possible choices. Finally, the role of uncertainty is discussed in terms of risk associated with the final design. The paper provides computational schemes which can be adapted to more complex systems. Performance is represented by response surfaces or functions with respect to independent variables defining operating conditions. Each component is modeled as a transformation of an input performance function to an output function. Interaction between components at a single level are modeled through the performance functions: the transformation describing a component may take account of the output function of another component as part of its input. In the modeling phase, the relationships between tasks or components are developed from the top level down. Each component at a given level is decomposed into subtasks, and the relationships between tasks are modeled as a network. In the decision phase, selected affordable upgrade alternatives at the lower level are passed to the higher level, where they are combined to form alternative upgrade plans for that level. These, in turn, are passed up to the next higher level. At the top (enterprise) level, the decision maker chooses a most desirable selection of upgrades from among alternatives passed up. At each level, multicriteria optimization is used to identify upgrades that are near optimal over the range of operating conditions. Stochastic linearization representations (Reneke, 1998) are extended to systems with two independent variables and used to assess risk associated with the chosen solution. This technique has been applied previously only to systems with one independent variable. Of
Affordable Upgrades of Complex Systems
305
note, simulations are obtained as matrix multiplications and the methods return basic statistics exactly. Significance of the results. While we do not claim that we have completely met the criteria sketched by Hazelrigg, we assert that our approach does satisfy a majority of the desired features and does not suffer from pitfalls common for other methods. The reliance on performance functions rather than on the physical system makes our model independent of the engineering discipline and allows for rational analysis by a decision maker who is unfamiliar with all the details of the system under consideration. Performance functions can potentially be derived from underlying physical (or economic) models where appropriate, or they can be based on simulations. But this independence means that the approach can be applied to non-physical systems as well as the physical systems traditionally in the purview of systems engineering. The process limits the amount of information which must be passed between levels. Of course, if all information is available to decision makers at adjacent levels then there is no need to distinguish between levels. The use of stochastic analysis guarantees that decisions are analyzed under conditions of uncertainty with exogenous variables modeling uncertainty at each level. Due to the use of multicriteria analysis, the set of feasible decisions at the lower level is rationally and consistently narrowed down to a set of candidate decisions that are passed up to the higher level at which they become feasible alternatives. The final preferred decision at the top level is made according to the preferences expressed by the decision maker. The work described here opens up new directions for research. The approach as outlined in the example provides a framework for lots of focused investigations. In particular, investigations into application of the system modeling/simulation methodology to problems based on real data, for clarifying the use of stochastic analysis in a multiple criteria decision process, and into the examination of numerical methods for large scale simulations, i.e., systems with many components and choices. More research is needed to explore the application of multicriteria optimization in aiding the decision maker to arrive at a preferred decision. Since the resulting problems might be of high dimension, techniques have to be developed to reduce the problems to a level at which the analysis can be easily comprehended. Outline of the remainder of the paper. Section 2 describes the modeling process. At each level, each task or component is modeled as a linear transformation defined on a set of performance functions. All of the performance functions at a given level depend on the same set of exogenous variables. Relations among component choices, i.e., which components influence which, are describable using a network with nodes (components) representing operators
306
MODELING UNCERTAINTY
and links representing performance functions. A bi-level example of a simple industrial production model is introduced and the modeling of tasks and their interactions is carried out. We take up in order simulations of components and computing the system performance function in terms of external influences. These computational tools are used in the discussion of decision methods which follows. The decision process is described in Section 3. At each level beginning with the lowest, a multicriteria optimization technique is used to search for nondominated upgrade choices from among affordable alternatives for each task. These solutions form the set of choices that are integrated by the decision maker to form the upgrade alternatives that are passed to the next higher level. The decision maker at the top level chooses a single preferred solution from among combinations of alternatives provided for the subsystems at this level. Section 4 discusses stochastic analysis including the two sources of decision uncertainty, modeling component risk, and evaluation of a final design (solution) in terms of system risk. The technical details of the stochastic analysis appear in the Appendix. Finally, in the conclusion (Section 5) we discuss applicability of the methodology and the value of our approach, and we list some questions to be resolved.
2.
MULTILEVEL COMPLEX SYSTEMS
In general, a complex system is viewed as a system composed of several decision levels with the top level referred to as the master level and lower levels
Affordable Upgrades of Complex Systems
307
(see Figure 15.1). Each level below the master level might have multiple decision makers. Conceptual modeling proceeds from the top down. At the master level, the overall system (the enterprise) is modeled as a single task. At every level but the lowest level , a task can be decomposed into subtasks assigned to lower level decision makers who do not interact. The lowest level is reached when tasks are not decomposed further. Each decision maker, in decomposing a task, develops a conceptual model of interactions among subtasks (see Figure 15.2). Tasks are viewed as networks whose nodes represent subtasks to be accomplished with arcs representing performance functions. Also for each task, the subtasks are accomplished in an environment modeled by a set of exogenous variables so the performance functions may depend on several independent variables. The network structure indicates how the performance of each subtask influences and is influenced by other subtasks. Thus a node (a subtask) acts on performance functions represented by inbound arcs to produce a performance function associated with outbound arcs. Computational models are developed and decisions made from the bottom up. During the decision making phase, methods for accomplishing the tasks are associated with the nodes. Starting at the lowest level, decision makers develop alternative feasible methods for their tasks, develop computational models for these methods, and pass methods and models up to the next higher level. Methods for a task are feasible provided they satisfy the cost constraint assigned to that task. Computational models at the next higher level are constructed from the sub-method models using the previously developed conceptual models. Selections are made from the alternative feasible methods and these methods and models are passed up to the next higher level. At the master level, computational models of feasible methods for the system task are constructed and a preferred method is chosen from among the alternatives. The overall goal is to find a preferred method satisfying the enterprise cost constraint, i.e., an affordable selection of upgrades. Methods are modeled by means of linear operators relating the input performance of a task to the output performance. The performance functions may be vector or scalar valued and are defined on the space of exogenous variables characteristic for the task. By restricting models of methods to linear operators, computational models at higher levels can be obtained by “plugging” the alternatives from below into the conceptual models. Models for methods (linear operators) are based on stochastic linearizations of the underlying components. The theory and methodology of stochastic linearization are discussed elsewhere (Reneke, 1997; 2001). The convergence theory is well developed for the one variable (signal) case (Reneke et al., 1987; Reneke, 1998). The theory for the multiple variable (response surface) case is emerging but is complete for significant special cases. Stochastic linearization,
308
MODELING UNCERTAINTY
discussed further in Section 4, provides an “easy to apply” modeling methodology based on natural assumptions for sub-method behavior. In the two-dimensional case, real valued performance functions can be thought of as (response) surfaces. Further, discrete surfaces can be represented as matrices. Finite dimensional approximations for each linear operator in this case require two matrices. For instance, a linear operator M acting on a discrete response surface can be represented by where is an matrix, is an matrix, and is an matrix . The decision making process for a task involves a choice of efficient selections of methods (linear operators) from among feasible selections of methods by which the task may be performed (see Figure 15.2). The process starts at level where there is a collection of unrelated simple tasks that cannot be further decomposed and which do not interact. Each task is assigned to a decision maker who selects a set of efficient methods for the task. These efficient methods are then passed to the higher level
Affordable Upgrades of Complex Systems
309
Each level task, as well as the master-level task, is evaluated for every efficient selection of methods for every subtask and a set of efficient selections of methods for that task is produced. In particular, a level task output performance function is computed for every feasible, i.e., affordable, selection of methods for its subtasks. This process yields the set of feasible selections of methods for the task. The set of efficient selections of methods is then found among the feasible selections and passed to the higher level at which the task becomes a subtask. When a set of efficient selections of methods is produced for the master-level task, this level's decision maker chooses a preferred selection of methods from among the efficient selections which becomes the optimal selection for the system and concludes the decision making process. As stated above, each level task, has a set of exogenous variables modeling this task’s uncertain environment. When passing efficient selections of methods from subtasks to a higher level task, exogenous variables not common to all subtasks making up the task are eliminated. The set of efficient selections of methods for the master level task is computed in the space of exogenous variables characteristic only for that task and all exogenous variables are eliminated when the optimal selection has been found.
2.1.
AN ILLUSTRATIVE EXAMPLE
In this paper, we focus on a bi-level example modeling a simplified industrial production system. The model consists of two levels. The master level represents the entire production process. The lower level consists of three com-
310
MODELING UNCERTAINTY
ponents: a component M for acquiring and storing raw materials, a component F for fabricating the product, and a component D for distributing the product. (See Figure 15.3.) Each component corresponds to a specific task and performances of the tasks interact to produce the enterprise performance. As mentioned above, the system is part of a larger environment beyond the control of the decision maker which exerts an influence on system performance. We assume that the external environment is characterized by two variables: potential demand for the product, denoted by and total industry capacity for producing the product, denoted by The influence of each component on other components is characterized by a performance function depending on and Also, there is a performance function reflecting availability of inputs of raw material and a performance function reflecting actual customer demand for product from this particular enterprise. In general, the performance functions will be vector valued but for this simplified example we assume the functions of two variables are scalar valued, i.e., scalar fields. The units do not have to be the same for each component.
Note that the two environmental variables reflect the importance assigned to the materials component. We could have decided that the distribution component was the most important and chosen different environmental variables. The system to this point can be visualized as in Figure 15.4 where the arrows indicate relations between components. For instance, the performance of the fabrication component depends on the performance of the materials component and influences the performance of the distribution component. The performance functions require engineering insight and may be based on detailed investigation of the individual components. For this simplified presentation we assume that models the availability of raw material in the market, models material purchased and stored and available for fabrication, models product available for distribution, models customer demand, and is the enterprise performance function, measuring unmet consumer demand. Assuming might look like Figure 15.6. (Note in this figure and subsequent figures the surfaces are computed at grid points The independent axes are labeled by the grid indices.) For increasing and supply of material increases and prices fall. If there is no
Affordable Upgrades of Complex Systems
311
potential demand or no production capacity then the supply of raw materials must be low. The values of are scaled between 0 and 1 but there is nothing significant in this. Associated with each component (or subtask) is a method modeled by a linear operator which transforms the sum of the performance functions affecting that component into that component’s performance function. For instance, and where we use the component designation for the associated linear operator. We complete the diagram of the master-level network in Figure 15.5. The added feedback operators and are dictated by nature, i.e., are not subject to the decision maker, and all arrows emanating from a component modeling
312
MODELING UNCERTAINTY
component performance are equal. Thus the diagram does not model flows of material, there is no conservation law, but the interrelationships among components. The model given above describes an existing system. In the upgrade problem we assume that each subtask can be accomplished using one of several methods. The overall goal is to optimize enterprise performance by choosing one method (linear operator) for each of the subtasks. Note that the choices interact complicating the decision process.
2.2.
COMPUTATIONAL MODELS FOR THE EXAMPLE
To illustrate the methodology we have provided linear transformations to model M, F, and D and assume that is the negative of the identity operator. Again for the function is chosen as in Figure 15.7. To aid in visualizing the effect of applying the component operators, we apply them
Affordable Upgrades of Complex Systems
313
to the input function The results are displayed in Figure 15.7. Of interest is how the operators (methods) transform an input function. The response of M to increases rapidly for and small but increasing. The response is flatter for and large. Similar observations could be made for F and D. Combining the component models in the feedback network, Figure 15.5, we obtain the enterprise performance, Figure 15.8.
3.
MULTIPLE CRITERIA DECISION MAKING
Multiple criteria decision making (MCDM) is a field of system science dealing with complex systems performing in the presence of multiple conflicting criteria. In our multilevel model, the space of exogenous variables of each task creates such an environment since the best task performance varies depending upon regions or even points of this space. There is no single response surface representing the best task performance but rather a set of alternatives (response surfaces) competing with each other over that space. Analogously, there is no single response surface representing the best overall system performance. Within this framework we apply two classic stages of MCDM: the optimization stage of screening for efficient solutions (nondominated outcomes) from among feasible alternatives and then the decision stage of choosing a preferred efficient solution (nondominated outcome). Several effective decision aids supporting the latter are collected by Olson (1996), while theoretical foundations and advances are thoroughly analyzed by Roy (1996). The optimization stage is performed for every task at every level of the system and involves the elimination of task response surfaces that are outperformed by other surfaces over the entire space of exogenous variables for this task. The optimization produces a set of candidate methods for a task that will be passed up to the next higher level at which this task becomes a subtask.
314
MODELING UNCERTAINTY
The decision stage is performed only for the master-level task and yields a final preferred response surface and the corresponding selection of methods which determine a preferred selection of upgrades for the system. In this section, we first formulate an optimization problem suitable for the optimization of a task and then discuss how a preferred selection of upgrades could be found. The presented approaches and methods are then illustrated on the bi-level example.
3.1.
GENERATING CANDIDATE METHODS
The quality of the upgrade decision is gauged by a system performance measure that is not available in a closed form but rather in the form of a response surface determined by the values achieved at some preselected grid points. The decision maker is interested in choosing a selection of methods for each task such that the system performance is at its best at every grid point at which this performance is evaluated. Given the space of exogenous variables of a task, every grid point of this space can be viewed as a criterion (or scenario) with respect to which the task performance should be optimized and the decision maker is to choose a selection that would be as good as possible across all the criteria (scenarios). In this context, the upgrade problem becomes a multiple-criteria optimization problem or a multiple-grid-point optimization problem and should be treated within a multiple-criteria decision making (MCDM) framework. However, it is not a typical MCDM problem in which a preferred decision has to be made with respect to several (typically less than 10) objective functions available in a closed form. The upgrade problem will usually have a large number of grid points (100-200) so that the resulting optimization problem will have the same large number of criteria represented by the values on response surfaces. Let be the set of all feasible selections of methods for a task at some level and be the set of all surfaces associated with the feasible selections, where is the number of the surfaces. The task’s decision maker has obtained the selections from lower level decision makers and checked the feasibility of the selections against a cost constraint for the task. A selection of methods is feasible if it satisfies the cost constraint for a task. Let be the set of grid points selected in the domain of exogenous variables and (see page 308) represent a grid point, where is the number of the points. Let be the value of the surface assumed at the grid point Depending on the application, the optimal performance level at a grid point might be the maximum or minimum achievable or it may be a particular fixed target value between the maximum and minimum. The collection of target values at the grid points determines a reference surface S* passing through the
Affordable Upgrades of Complex Systems
315
target value at each grid point. The surface S* might not be a response surface for the system. A surface associated with a selection is said to dominate a surface associated with a selection if at each grid point the value of is at least as close to the reference value as that of and the value of is closer than that of at at least one grid point. Thus, nondominated surfaces are those for which every other surface associated with a feasible selection of methods yields an inferior value at one or more grid points. The method selections associated with each nondominated surface are called efficient. In other words, a feasible surface associated with a selection is said to be nondominated if there is no other feasible surface such that for all and for at least one With respect to this model, generation of the nondominated surfaces related to the multiple grid optimization problem leads to solving the following multiple objective program
where the minimum is vector valued.
3.2.
CHOOSING A PREFERRED SELECTION OF UPGRADES
Let denote the set of all nondominated surfaces for the master-level task. Given this set, the master decision maker is interested in choosing a preferred nondominated surface and an efficient selection of methods that produces the chosen surface. This process can be performed in many different ways due to incomplete preference information provided by the multiple grid optimization problem at the master level and the resulting need to introduce additional information to resolve the selection process. One can apply a variety of methods developed in the area of multiple criteria decision analysis (MCDA) that deal with choosing the best alternative from among the finite number of choices. In this study we explore two methods that we adapt to our specific application. One is the well known method of choosing a preferred nondominated solution with respect to the reference (ideal) solution while the other combines the preference ranking of solutions with their costs to produce a new, higherlevel multicriteria problem. Ranking nondominated surfaces. If the target is to maximize the performance level, then we can construct the reference surface as follows: At every grid point we find a surface that yields the best system performance. This is accomplished by solving the single grid point optimization
316
MODELING UNCERTAINTY
problem at each grid point
Let be an optimal solution of (3.1) with the optimal value of Having solved problem (3.1) for every we get a collection of optimal surfaces Define now a utopia surface S* as the upper envelope of all the surfaces in S* which is made of the portions of the nondominated surfaces visible from The surface S* might not be a response surface but represents ideal system performance over the entire domain of exogenous variables. Reference surfaces for performance measures to be minimized can be constructed similarly. For finite target values, the reference surface simply passes through the target value at each grid point. We now choose a preferred nondominated surface as the one that is the closest to the reference surface where a norm of choice measures the distance between the surfaces in the multidimensional space of grid points. When constructing the norm, the decision maker can assign different weights to each grid point in order to model the probability of the system having to perform within a neighborhood of the values of exogenous variables related to this point. One may partition the domain of exogenous variables into B subsets and to each of them assign a probability chosen by the decision maker such that Hence, the weight assigned to each point in a subset is equal to where is the number of points in subset In addition to variations in the probability of encountering certain operating conditions, the decision maker may make judgments regarding the importance of achieving optimal performance under such conditions. For example, there may be a rare combination of operating conditions under which optimal system performance is critical. These judgments can be captured by eliciting from the decision maker weights associated with the grid points (or with regions, as above). Larger weights penalize poor performance at those points. Define the weighted for by computing
where
We then solve the problem
Affordable Upgrades of Complex Systems
317
The preferred surface is the one that achieves the minimum. Note that the definition of the norm can be extended to treat over-performance and underperformance asymmetrically if appropriate. Ranking versus cost. An alternative approach is to compute a weighted distance from the reference surface for each surface. This value can be treated as a single criterion in a bicriteria optimization problem, where nondominated surfaces (based on grid points) are to be compared based on value and implementation cost. The higher-order decision problem is
where is the minimum cost of an efficient selection of methods corresponding to nondominated surface and the minimum is vector valued. The solutions to this problem are nondominated in the sense that any decrease in distance from the reference surface comes at the expense of an increase in cost and any decrease in cost is accompanied by an increase in distance. Identifying such solutions can be accomplished using weighted Tchebycheff norms or other techniques for bicriteria optimization. The decision maker must then choose a most preferred nondominated surface using other preference identification techniques.
3.3.
APPLICATION TO THE EXAMPLE
We continue with our outline of the example and assume that each of the three decision makers at level 1 has already chosen efficient methods for the task from among feasible (affordable) methods for that task (i.e., those methods that satisfy a cost constraint for the task). Assume for illustration that there are three efficient methods for task M of acquiring and storing raw materials to be used by the fabrication component. Materials are to be purchased on an open market with supplies and prices influenced by the environmental variables, size of potential demand for the product and total industry capacity for producing the product. The methods are 1 Purchase for “just in time delivery” minimizing storage before fabrication. 2 Purchase and storage of critical raw materials subject to scarcity minimizing interruptions in the fabrication process. 3 Development of flexible handling and storage facilities capable of responding to changing raw material requirements.
We do not discuss modeling these methods but assume the methods are equally effective under “normal” operating conditions with costs for imple-
318
MODELING UNCERTAINTY
mentation increasing from method 1 to method 3. Recall that the models are to serve the decision maker. A detailed model of any of the suggested methods might be very complex with many variables and parameters. Further, any reasonable physical model would almost certainly be nonlinear. However, the decision maker is only concerned with how a method transforms an input performance function into an output performance function. Even these simplified models must be restricted to linear transformations so the decision maker can account for interactions among choices. Similar comments can be made about the other two components, F and D. So for each component we assume three efficient choices of methods in ascending order of cost. The enterprise performance function for the lowest cost choices for M, F, and D is given in Figure 15.9. Among the twenty-seven choices for methods, assumed to be feasible for the master-level task, seven selections produce nondominated enterprise performance functions. The efficient selections are {111, 311, 312, 321, 322, 331, 332} where the selection means we have chosen the method for M, the method for F, and the method for D. Recall that every nondominated surface exceeds any other surface at one or more points. We might attempt comparing the nondominated surfaces on subregions of the set of exogenous variables in order to pick out preferred solutions. In Figures 15.10 and 15.11, we produce graphs of the diagonals of the nondominated enterprise functions. To illustrate the selection of a preferred surface at the master level, we calculate the weighted norm of the difference between each nondominated surface and the reference surface which passes through performance level 0 at each grid point. Based on the shape of the input performance function in the example, we partition the domain of exogenous variables into three diagonal bands and assign probability 1/2 to the central band and probability 1/4 to the bands on each side. The decision maker assigns importance 1 to each grid point. The
Affordable Upgrades of Complex Systems
319
320
MODELING UNCERTAINTY
norm is The cost of the methods associated with a surface are taken to be the sums of the digits in the triple identifying the surface. Thus, we have the data in Table 15.1. The norms and costs are plotted in Figure 15.12.
It can be seen that the surface with the most desirable norm is 312. When norm and cost are treated as two criteria to be minimized simultaneously, the nondominated surfaces with respect to these criteria are 111, 311, and 312. The decision maker must select from among these the final surface and associated methods.
4.
STOCHASTIC ANALYSIS
In practice systems must be considered with uncertainty. The first source of uncertainty is the environment in which the system operates, modeled in this paper by the exogenous variables. Another source of uncertainty comes
321
Affordable Upgrades of Complex Systems
from system inputs. In the example we can consider either or both of the system inputs, i.e., and as random. Figure 15.13 illustrates the case of random customer demand and Figure 15.14 shows the corresponding enterprise performance. We always assume that the random input functions have been modeled leading to a stochastic decision problem which we can discuss in terms of risk. The nature of the process for generating candidate methods in the previous section precludes the use of models with random inputs. Therefore we envision resolution of the stochastic decision problem in two steps. 1 Optimize the deterministic system as in the previous section using expected inputs. 2 Analyze the risk for the preferred choices using random inputs.
If the risk is unacceptable then iterate after eliminating the component choices which contribute the most to enterprise risk. This pruning process will be aided by considering the obtained from simulations of the complete model.
4.1.
RANDOM SYSTEMS AND RISK
Risk is associated with random variability in the enterprise performance function. If we admit random elements into our system model then random variability in performance can not be eliminated. However, the variability can be “shaped” by our choices of methods for system tasks. Methods are graded in terms of risk by their ability to damp out variation in the input function. High risk methods damp out the least and the low risk methods the most. In general, the goal is risk reduction which is equivalent in the model to reducing random variation in the enterprise performance function. The master-level decision maker must determine if the variability in the enterprise performance as given in Figure 15.14 is acceptable when customer demand, for example, has the variability pictured in Figure 15.13. A system upgrade decision problem forces us to balance cost and risk. A simplistic choice of the cheapest choices for all the methods might result in an unacceptable design in terms of risk. Alternately, always choosing the least risky method for each task might result in a design which is too costly to implement. The upgrade problem leads us to search for those components where the cheaper (perhaps, more risky) choices have the least effect on risk for the overall system.
4.2.
APPLICATION TO THE EXAMPLE
The stochastic version of the performance function ability of raw materials is assumed to be of the form
modeling availwhere Z
322
MODELING UNCERTAINTY
is a two-dimensional random surface defined over We make a similar assumption for modeling random customer demand. Consider the three efficient methods for task (or component) D from Subsection 3.3 pictured in Figure 15.15 in terms of their transformation of In Figure 15.16 we see the action of these transformations on (The same sample surface for Z was used in all of these graphs and all of the succeeding ones.) The random variation of the component response to is reduced more as we proceed from the cheapest to the most expensive alternatives. Thus we conclude that the methods can be rated as riskiest to least risky in the same order. Under “normal” conditions, i.e., no noise, the three methods as part of a complete system perform about the same. See Figure 15.17 for the enterprise performance functions for the three options. In Figure 15.18 we see the three enterprise functions for the stochastic cases. Clearly the least risky option yields the best performance for the overall system. In Figure 15.19 we see the system performance for deterministic and stochastic The three options for D again perform about the same. The noise enters the system too late to be influenced by D. If the riskier option costs less then under these circumstances that option merits consideration. This discussion of risk concludes our discussion of the stochastic optimization problem. However, the introduction of random input surfaces enables us to explain the modeling methodology used for the component operators in the decision process. This material is presented in Appendix 15.A.
5.
CONCLUSIONS
This paper presents a novel system-science approach to complex systems decision making. A system to be upgraded is modeled as a multilevel network whose nodes represent tasks to be accomplished and whose arcs describe the interactions of component performance measures. Since a task can be carried out by a collection of methods, a decision maker is interested in choosing a best method for each task so that the system performance is optimized. Thus the well-known concept of network—typically used to modeling physics-based subsystems, components of the system and flows between them—now becomes a carrier of higher-level performance analysis conducted on tasks the system is supposed to accomplish and on available methods. This performance-based rather than physics-based framework is independent of engineering discipline and results in other desirable properties highlighted in the paper. At each decision level, the system performance depends upon exogenous variables modeling the uncertainty related to the environment in which the system may operate. This source of uncertainty is accounted for in the decision process through the use of multicriteria optimization. Uncertainty in
Affordable Upgrades of Complex Systems
323
324
MODELING UNCERTAINTY
Affordable Upgrades of Complex Systems
325
326
MODELING UNCERTAINTY
the selection chosen for implementation associated with random inputs (i.e., performance functions) is assessed using a stochastic analysis technique. The methodology has been examined for the case of two exogenous variables and is under development for the more general case. Also required for application are component models, understanding of relations between components, and mean input surfaces. Finally, for stochastic inputs the random part of the input surface is limited to the standard Wiener surface. The value of our approach consists of 1 The comprehensive approach we have taken to multilevel decisions. 2 The simplified modeling methodology for decision making. 3 The framework allowing for decisions with uncertainty.
Many questions have yet to be resolved: How to extend the methodology to more than two exogenous variables? How to quantify risk in the context of the upgrade problem? What is the proper relationship for physics based models vs. performance based models in decision making? In addition, it will be necessary to develop processes for carrying out the analysis that can be implemented by decision makers working with real systems.
Affordable Upgrades of Complex Systems
327
APPENDIX: STOCHASTIC LINEARIZATION The method of stochastic linearization is reviewed in the next subsection in the context of stochastic processes. In the last subsection we consider stochastic linearization in the context of random surfaces with the aid of an assumption on the surface covariance kernels. The factorization of the surface kernels following from this assumption enables us to present a representation of linear operators generating the random surfaces.
1.
ORIGIN OF STOCHASTIC LINEARIZATION
Normal processes have the property that all of the information about the process obtainable from observations is already contained in the process mean (a function of one variable) and the process covariance (a function of two variables). Thus normal processes are the easiest to analyze. Stochastic processes generated by linear systems are normal. Stochastic linearization is a method of analysis which attempts to approximate the underlying generating system (assuming a zero mean response) with a linear system which in turn generates a normal process having the mean and covariance of the original process. If the underlying system has a known dynamical model (the case considered in control applications), then stochastic linearization has a certain flavor. If the only information available for the underlying system is the mean and covariance of the observation process (the case considered in signal processing applications), then stochastic linearization has a different flavor. Our work derives from the latter case (Cover et al., 1996). Suppose a given process has mean function and covariance function where and Here, is the and is the matrix defined by (This shorthand is used throughout the remainder of this section.) If is a (0, l)-normal with independent components, then where is the upper Cholesky factor of has mean and covariance i.e., provides a stochastic linearization of the underlying system generating the given process. The linear system approximates the underlying system in that the two have the same second order statistics.
2.
STOCHASTIC LINEARIZATION FOR RANDOM SURFACES
For scalar fields defined on two-dimensional rectangles, sufficient conditions on the system covariance kernel are given for the development of a system linearization based on a factorization of the system covariance kernel. The mathematical question of limitations imposed by the covariance condition can be discussed in terms of properties of the resulting linearizations. A set of reasonable conditions on linear systems can be shown to be equivalent to the covariance condition. Thus the methods apply to a rich class of random fields.
Basic hypothesis. The central idea in this subsection is to produce a condition on the covariance kernel of a random field (function of two variables) which for fields F generated by linear systems, i.e., of the form F = AW, is sufficient to produce a representation of A. (Here W is the standard Wiener field, see below.) If F is not generated by a linear system then the method produces a linearization of the unknown nonlinear operator Â, namely A. In this case the utility of the linearization depends upon the particular application and how nearly the assumption of the condition (perhaps uncheckable) is to holding.
328
MODELING UNCERTAINTY
The condition is the following: for each pair of points
and
in [0, U] × [0, V]
where cov(·) is the covariance operator. (See Vanmarcke (1988), p. 82.) Note that the condition is on the observation field F rather than on the underlying unmodeled system. In general, the independence expressed in the condition might not hold. However, such independence is implicit in the common engineering practice of exploring a given physical system by allowing only one quantity to vary at a time.
The Standard Wiener field.
The standard Wiener field on [0, U] × [0, V] has the
following defining properties: sample fields are continuous
if
then
where E[·] is the expectation operator. Simulation of the standard Wiener process proceeds as follows. If then has mean zero and covariance i.e., is indistinguishable from Suppose, in addition, that Then has mean zero and covariance i.e., is indistinguishable from
Linear models.
We provide a recipe for representations of linear systems generating random fields satisfying the basic hypothesis. The recipe works for a large class of systems. Recall from Section 2 that finite-dimensional approximations for a linear operator T require two matrices. For instance, where is an matrix, W is an matrix, and is an matrix. The representation problem for a linear transformation T reduces to a search for appropriate matrices and If and then for and we have
Hence, TW has the mean and covariance of F and so T is the stochastic linearization of the system generating F. Construction of linear models in the example. The recipe yields, for the component operator M, the representation and where and If the modeler has observations and available then and can be estimated from the data. This scenario might correspond to the case where an implementation of M exists and is available for experimentation.
REFERENCES
329
If the modeler does not have observations to use in estimating and then they might be based on engineering experience or intuition. Except for scaling, the shapes of and will likely come from a class of simple shapes. The shapes chosen in the example are representative of the possibilities for and Some small amount of data might be useful in establishing the appropriate scales for the operators. Note that in the example and were assumed equal ([0, U] = [0, V] = [0, 5] and ) and There is no requirement that this be true. Similar comments hold for the construction of the component operators F and D.
REFERENCES Burman, M., S. B. Gershwin and C. Suyematsu. (1998). “Hewlett-Packard uses operations research to improve the design of a printer production,” Interfaces 28, 24–36. Carrillo, J.E. and Ch. Gaimon. (2000). “Improving manufacturing performance through process change and knowledge creation,” Management Science 46, 265–288. Cover, A., J. Reneke, S. Lenhart, and V. Protopopescu. (1996). “RKH space methods for low level monitoring and control of nonlinear systems,” Math, Models and Methods in Applied Sciences 6, 77–96. Hart, D.T. and E. D. Cook. (1995). “Upgrade versus replacement: a practical guide to decision-making,” IEEE Transactions on Industry Applications 31, 1136–1139. Hazelrigg, G.A. (2000). “Theoretical foundations of systems engineering,” presented at INFORMS National Meeting, San Antonio. Korman, R.S., D. Capitanio and A. Puccio. (1996). “Upgrading a bulk chemical distribution system to meet changing demands,” MICRO 14, 37–41. Luman, R.R. (1997). “Quantitative decision support for upgrading complex systems of systems,” D.Sc. thesis, The George Washington University, Washington, DC. Luman, R.R. (2000). “Upgrading complex systems of systems: a CAIV methodology for warfare area requirements allocation,” Military Operations Research 5, 53–75. Makis, V., X. Jiang and K. Cheng. (2000). “Optimal preventive replacement under minimal repair and random repair cost,” Mathematics of Operations Research 25, 141–156. Majety, S.R.V., M. Dawande, and J. Rajgopal. (1999). “Optimal reliability allocation with discrete cost-reliability data for components,” Operations Research 47, 899–906. McIntyre, M.G. and J. Meitz. (1994). “Applying yield impact models as a first pass in upgrade decisions,” Proceedings of the IEEE/SEMI 1994 Advanced Semiconductor Manufacturing Conference and Workshop, Cambridge, MA, November 1994, 147–149. Olson, D.L. (1996). Decision Aids for Selection Problem, Springer, New York.
330
MODELING UNCERTAINTY
Papalambros, P.Y. (2000). “Extending the Optimization Paradigm in Engineering Design,” Proceedings of the 3rd International Symposium on Tools and Methods of Competitive Engineering, Delft, The Netherlands. Papalambros, P.Y. and N. F. Michelena. (2000). “Trends and challenges in system design and optimization,” Proceedings of the International Workshop on Multidisciplinary Design Optimization, Pretoria, South Africa. Rajagopolan, S. (1998). “Capacity expansion and equipment replacement: a unified approach,” Operations Research 46, 846–857. Rajagopolan, S., M. S. Singh and T. E. Morton. (1998). “Capacity expansion and replacement in growing markets with uncertain technological breakthroughs,” Management Science 44, 12–30. Reneke, J., R. Fennell, and R. Minton. (1987). Structured Hereditary Systems, Marcel Dekker, New York. Reneke, J. (1997). “Stochastic linearizations based on random fields,” Proceedings of the 2nd World Congress of Nonlinear Analysts, Athens, Greece, Nonlinear Analysis, Theory, Methods & Applications 30, 265–274. Reneke, J. (1998). “Stochastic linearization of nonlinear point dissipative systems,” www.math.clemson.edu/affordability/. Reneke, J. (in preparation, 2001). “Reproducing kernel Hilbert space methods for spatial signal analysis”. Rogers, J.L. (1999). “Tools and techniques for decomposing and managing complex design projects,” Journal of Aircraft 36, 266–274. Rogers, J.L. and A. O. Salas. (1999). “Toward a more flexible Web-based framework for multidisciplinary design,” Advances in Engineering Software 30, 439–444. Roy, B. (1996). Multicriteria methodology for Decision Aiding, Kluwer, Dordrecht. Samuelson, D.A. (1999). “Predictive dialing for outbound telephone call centers,” Interfaces 29, 66–81. Su, C.L., C. N. Lu and M. C. Lin. (2000). “Wide area network performance study of a distribution management system,” International Journal of Electrical Power & Energy Systems 22, 9–14. Vanmarcke, E. (1998). Random Fields: Analysis and Synthesis, The MIT Press, Cambridge, MA. van Voorthuysen, E.J. and R. A Platfoot. (2000). “A flexible data acquisition system to support process identification and characterization,” Journal of Engineering Manufacture 214, 569–579. Wallace, E., P. C. Clements and K. C. Wallnau. (1996). “Discovering a system modernization decision framework: a case study in migrating to distributed object technology,” Proceedings of the 1996 IEEE Conference on Software Maintenance, ICSM. Monterey, CA, November 1996, 185–195.
REFERENCES
331
Yan, P., M.-Ch. Zhou and R. Caudill. (2000). “Life cycle engineering approach to FMS development,” Proceedings of IEEE International Conference on Robotics and Automation ICRA 2000, San Francisco, CA, April 2000, 395– 400.
Chapter 16 ON SUCCESSIVE APPROXIMATION OF OPTIMAL CONTROL OF STOCHASTIC DYNAMIC SYSTEMS Fei-Yue Wang Department of Systems and Industrial Engineering University of Arizona Tucson, Arizona 87521
George N. Saridis Department of Electrical, Computer and Systems Engineering Rensselaer Polytechnic Institute Troy, New York 12180
Abstract
An approximation theory of optimal control for nonlinear stochastic dynamic systems has been established. Based on the generalized Hamilton-Jacobi-Bellman equation of the cost function of nonlinear stochastic systems, general iterative procedures for approximating the optimal control are developed by successively improving the performance of a feedback control law until a satisfactory suboptimal solution is achieved. A successive design scheme using upper and lower bounds of the exact cost function has been developed for the infinite-time stochastic regulator problem. The determination of the upper and lower bounds requires the solution of a partial differential inequality instead of equality. Therefore it provides a degree of flexibility in the design method over the exact design method. Stability of the infinite-time sub-optimal control problem was established under not very restrictive conditions, and stable sequences of controllers can be generated. Several examples are used to illustrate the application of the proposed approximation theory to stochastic control. It has been shown that in the case of linear quadratic Gaussian problems, the approximation theory leads to the exact solution of optimal control.
Keywords:
Hamilton-Jacobi-Bellman equation, optimal control, nonlinear stochastic systems
334
1.
MODELING UNCERTAINTY
INTRODUCTION
The problem of controlling a stochastic dynamic system, such that its behavior is optimal with respect to a performance cost, has received considerable attention over the past two decades. From a theoretical as well as practical point of view, it is desirable to obtain a feedback solution to the optimal control problem. In situations of linear stochastic systems with additive white Gaussian noise and quadratic performance indices (so-called LQG problems), the separation theorem is directly applicable, and the optimal control theory is well established (Aoki, 1967; Woham, 1970; Kwakernaak and Sivan, 1972; Sage and White,1977). However, due to difficulties associated with the mathematics of stochastic processes, only fragmentary results are available for the design of optimal control of nonlinear stochastic systems. On the other hand, there is need to design optimal and suboptimal controls for practical implementation in engineering applications (Panossian, 1988). The objective of this paper is to develop an approximation theory that may be used to find feasible, practical solutions to the optimal control of nonlinear stochastic systems. To this end, the problem of stochastic control is addressed from an inverse point of view: Given an arbitrary selected admissible feedback control, it is desirable to compare it to other feedback controls, with respect to a given performance cost, and to successively improve its design to converge to the optimal.
Various direct approximations of the optimal control have been widely studied for nonlinear deterministic systems (Rekasius, 1964; Leake and Liu, 1967; Saridis and Lee, 1979; Saridis and Balarm, 1986), and appeared to be more promising than the linearization type approximation methods that have met with limited success (Al’brekht, 1961; Lukes, 1969; Nishikawa et al., 1962). For stochastic systems, a method of successive approximation to solve the Hamilton-Jacobi-Bellman equation for a stochastic optimal control problem using quasilinearization, was proposed in Ohsumi (1984), but systematic procedures for the construction of suboptimal controllers were not established. This paper presents a theoretical procedure to develop suboptimal feedback controllers for stochastic nonlinear systems (Wang and Saridis, 1992), as an extesnion of the Approximation Theory of Optimal Control developed by Saridis and Lee (1979) for deterministic nonlinear systems. The results are organized as follows. Section 2 gives the mathematical preliminaries of the stochastic optimal control problem. Section 3 describes major theorems that can be used for the construction of successively improved controllers. For the infinite-time stochastic regulator problem, a design theory using upper and lower bounds of the cost function and the corresponding stability considerations are discussed in Section 4. Two proposed design procedures are outlined and illustrated with
Successive Approximation of Stochastic Dynamic Systems
335
several examples in Section 5. Section 6 concludes the paper with remarks regarding its history and notes to Dr. Sid Yakowitz.
2.
PROBLEM STATEMENT
For the purpose of obtaining explicit expressions, and without loss of generality since the results can be immediately generalized, consider a nonlinear stochastic control system described by the following stochastic differential equation:
Where is a vector of state of the stochastic system, is a control vector, is a specified compact set of admissible controls, and is a separable Wiener process; and are measurable system functions. Equation (1.1) was studied first by Itô (1951) and later under less restrictive conditions, by Doob (1953), Dynkin (1953), Skorokhod (1965) and Kushner (1971). It is assumed that the control law satisfies the following conditions: i) Linear Growth Condition,
ii) Uniform Lipschitz Condition,
where is Euclidian norm operator, and is some constant. For a given initial state (deterministic) and feedback control the performance cost of system is defined as
With nonnegative functions and J is also called the Cost Function of system (1.1). The infinitesimal generator of the stochastic process specified by (1.1) is defined to be,
336
MODELING UNCERTAINTY
where has compact support and is continuous up to all its second order derivatives (Dynkin, 1953), and and are transpose and trace operators, respectively. The differential operators are defined as
A pre-Hamiltonian function of the system with respect to the given performance cost (1.2) and a control law is defined as
The optimal control problem of stochastic system can be stated now, as follows: Optimal Stochastic Control Problem: For a given initial condition and the performance cost (1.2), find such that
If it is assumed that the optimal control law, exists and if the corresponding value function, is sufficiently smooth, then and V* may be found by solving the well-know Hamilton-Jacobi-Bellman equation (Bellman, 1956),
Unfortunately, except in the case linear quadratic Gaussian controls, where the problem has been solved by Wonham (1970), a closed-loop form solution of the Hamilton-Jacobi-Bellman for solving the optional stochastic control problem can not be obtained in general when the system of Equation (1.1) is nonlinear. Instead, one may consider the optimal control problem relaxed to that of finding an admissible feedback control law, that has an acceptable, but not necessarily optimal, cost function. This gives rise to a stochastic suboptimal control solution that could conceivably be solved with less difficulty than the original optimal stochastic control problem. The exact conditions for acceptability of a given cost function should be determined from practical considerations for the specific problem. The solution of the stochastic suboptimal control problem that converges successively to the optimal is discussed in more detail in the next two sections.
Successive Approximation of Stochastic Dynamic Systems
337
SUB-OPTIMAL CONTROL OF NONLINEAR STOCHASTIC DYNAMIC SYSTEMS
3.
This section contains the main results of the approximation theory for the optimal solution of nonlinear stochastic control problems. Two theorems, one for the evaluation of performance of control laws and the other for the construction of lower and upper bounds of value functions, are established first. Then theoretical procedures that can lead to the iterative design of suboptimal controls are developed based on those theorems. Theorem 1. (Performance evaluation of a control law) Assume V : be an arbitrary function with continuous V, and and satisfy the condition
where is a suitable constant. Then the necessary and sufficient conditions for to be the value function of an admissible fixed feedback control law i.e.,
are
Proof: From (1.7), using Itô’s integration formula (Itô, 1951), it follows that:
Therefore,
338
MODELING UNCERTAINTY
The sufficient condition results from the above equation. For the necessary condition, assume that Then from the above equation, and for
Therefore,
has to be true for all
hence
which proves the necessary condition. Remark 1: For a feedback controller that makes system (1.1) unstable, the associated does not satisfy the assumption of Theorem 1, particularly, condition (1.7). Remark 2: The relation in Eqs. (1.9) and (1.10), called the Generalized Hamilton-Jacobi-Bellman equation for the stochastic control system (1.1), is reduced to the Hamilton-Jacobi-Bellman partial differential equation (1.6), when the control is replaced by the optimal control Remark 3: The exact cost function for a given control law is found by solving the Generalized Hamilton-Jacobi-Bellman partial differential equation. This equation, through simpler and more general than the Hamilton-JacobiBellman equation, is difficult to solve in a closed form. Remark 4: For more general stochastic process defined by the stochastic differential equation and the general form of performance cost, one can show that Theorem 1 is still true. Since it is generally difficult to find the exact cost functions satisfying Eqs. (1.9) and (1.10) of Theorem 1, the following theorem introduces a method of constructing the lower and upper bounds of the cost functions. This method can be used for design of simpler suboptimal controllers based only on the upper bounds to value functions. Theorem 2 (Lower and upper bounds of cost functions) For an admissible fixed control law and a continuous function with for all If the function satisfies Eq. (1.7) with continuous V, and and
Successive Approximation of Stochastic Dynamic Systems
then That is‚
339
is an upper (or a lower) bound of the cost function of system (1.1).
Proof: By procedure similar to the proof of Theorem 1‚ it can be shown that
Therefore‚ from Eqs. (1.11) and (1.12)‚ it follows that‚
This completes the proof. Remark 1: In general‚ the function in this theorem does not need to be calculated. However‚ stating the inequality as in (1.11) gives an additional degree of flexibility that enables the determination of an upper (or lower) bound to the cost function Remark 2: The function in this theorem is the exact cost function for a system with a performance cost augmented by
where is a terminal cost function such that Having established the two theorems for the evaluation of performance of a given feedback control law‚ it is necessary to develop algorithms to improve the control law now. The followings Theorem 3-5 provide a theoretical procedure for designing the suboptimal feedback controllers based on the Theorem 1‚
340
MODELING UNCERTAINTY
while Theorem 6 presents a method for constructing upper and lower bounds to the optimal cost function‚ which can be used to evaluate the acceptability of suboptimal controllers. Theorem 3. Given the admissible controls and with and be the corresponding cost functions satisfying Eqs. (1.7) and (1.8) for and respectively‚ define the Hamiltonian function for and 2‚
where
It is shown that
when
Proof: Let
then
Successive Approximation of Stochastic Dynamic Systems
341
Therefore
which‚ from assumption (1.18)‚ implies that
In addition‚ from Eq. (1.10) of Theorem 1‚
However‚ applying Itô’s integration formula to jectory generated by the control it follows that
along the tra-
Hence‚
Remark: One should not try to find by subtracting from directly based on their individual Itô formulas evaluated along the trajectories generated by their corresponding controls. This is due to the fact that the two state trajectories‚ generated by and respectively‚ are different. A combination of Theorem 1 and 3‚ where represents the cost function of the system (1.1) when driven by control yields an inequality that serves as a basis of suboptimal control algorithms‚ to iteratively reduce the cost of the performance of the system. This is outlined by the following theorem. Theorem 4. Assume that there exists a control and a corresponding function satisfying Eqs. (1.7) and (1.8) of Theorem 1. If there exist a function satisfying the same condition of Theorem 1‚ its associated control has been selected to satisfy
342
MODELING UNCERTAINTY
then
Proof: Since control and the corresponding value function must satisfy Eqs. (1.9) and (1.10)‚ according to Theorem 1‚ it follows that for every
This can be rewritten as
that is‚
Similarly‚ one can find‚
Since‚
It follows from Eqs. (1.22) and (1.23) that
Hence‚ according to Theorem 2‚
which proves the theorem. Remark: Clearly‚ condition (1.20) in Theorem 4 is much easier to be tested than condition (1.18) in Theorem 3‚ since (1.20) does not involve the time derivation and the infinitesimal generator Based on Theorems 3 and 4‚ the following theorem establishes a sequence of feedback controls which successively improve the cost of performance of the system‚ and converge to the optimal feedback control.
Successive Approximation of Stochastic Dynamic Systems
343
Theorem 5. Let a sequence of pairs satisfy Eqs. (1.7) and (1.8) of Theorem 1‚ and be obtained by minimizing the pre-Hamiltonian function corresponding to the previous cost function that is‚
then the corresponding cost functions
‚ satisfies the inequality‚
Thus by selecting the pairs sequentially‚ in the above manner‚ the resulting sequence converges monotonically to the optimal cost function V*‚ and the corresponding sequence converges to the optimal control associated with V*. Proof: Since the control of (1.24) and the corresponding cost function satisfy (1.9) and (1.10) of Theorem 1‚ it follows from (1.22) of Theorem 4 that
where
Therefore‚ application of (1.19) of Theorem 2 yields
From (1.10)‚
hence‚ Itô’s integration formula applied to by leads to the inequality‚
that is‚
which proves (1.25).
along the trajectory generated
344
MODELING UNCERTAINTY
To show the convergence of the sequence‚ note that is a non-negative and monotonically decreasing sequence that satisfies (1.7). Therefore‚ the following limits exist:
and
for all and where is the limit of the cost functions . The corresponding limit of control sequence can be identified from (1.24) as‚
Clearly‚ and thus obtained‚ still satisfy Eqs. (1.9) and (1.10) of Theorem 1. However‚ from the construction of control sequence minimizes the pre-Hamiltonian function associated with the value function In other words‚ and satisfy the Hamilton-Jacobi-Bellman equation for the optimal control of stochastic system (1.1)
Hence‚
are the optimal control and optimal value function of the stochastic control problem (1.5). Remark 1: It follows from this theorem that the optimal feedback control and the optimal cost function V* are related by
which is a relationship‚ that results from the minimization of the Hamiltonian function associated with the stochastic system (1.1). Remark 2: As indicated by the conditions‚ in applying the theorem‚ assumptions must be made a priori‚ regarding the admissibility of the successively
Successive Approximation of Stochastic Dynamic Systems
345
derived control laws and their corresponding value functions. However‚ for a nonlinear stochastic control system as in (1.1)‚ the admissibility of the new control laws is not always easy to show. Finally‚ the following theorem presents a method for the construction of an upper (or a lower) bound of the optimal cost function Since the optimal cost function is extremely difficult to find‚ its upper (or lower) bounds can provide a practical measure to evaluate the effectiveness of the sub-optimal controllers. Theorem 6. Assume that there exists a function satisfying condition (1.7) of Theorem 1‚ for which the associated control
is an admissible one. Then‚ is an upper (or a lower) bound to the optimal cost function of system (1.1)‚ if it satisfies the following conditions
where is continuous and for all Proof: From (1.31)‚ and the Hamilton-Jacobi-Bellman equation‚ it is obvious that the optimal control and the optimal cost function‚ are related
and similarly‚ for
For
where
and
subtracting (1.35) from (1.36) yields
From assumption (1.34)‚
346
MODELING UNCERTAINTY
Therefore‚ application of Itô’s integration formula to trajectory generated by control yields‚
So
is an upper bound to the optimal cost function subtracting (1.36) from (1.35) leads to‚
along the
For
where It can be shown‚ by using condition (1.34) and Itô’s integration formula it follows that
In this case‚ is a lower bound to the optimal cost function Theorems‚ which lead to the design of simpler sub-optimal controllers based on the upper and lower bounds of cost functions‚ may also be constructed. A more detailed discussion of such derivation for the infinite-time stochastic regulator problem is given in the next section.
4.
THE INFINITE-TIME STOCHASTIC REGULATOR PROBLEM
The infinite-time stochastic regulator problem is defined as a control problem for nonlinear stochastic system (1.1)‚ with infinite duration All state trajectories generated by admissible controls in must be bounded uniformly in For the infinite-time stochastic regulator problem and assuming that the system is stable‚ the Performance Cost exists and is defined as
A discussion of the stability of system (1.1) with Eq. (1.37) as is given in the next section. Applying Itô’s integration formula before the limit‚ the cost function becomes
Successive Approximation of Stochastic Dynamic Systems
347
where satisfies and (1.8) of Theorem 1 for the possible state trajectories‚ which is true for all All the theorems developed in the previous section are still valid for the infinite-time stochastic regulator problem‚ except that all the terminal conditions at in those theorems are no longer required. However‚ in this case‚ theorems can be constructed which can lead to the iterative design of simpler sub-optimal controls based only on the upper and lower bounds can be obtained without solving the partial differential equation (1.9) of Theorem 1‚ those theorems have a great potential for application. Two of such theorems‚ corresponding to Theorems 3 and 4‚ are given in the sequel. Theorem 7. Given admissible controls and with and be their corresponding cost functions defined by (1.37)‚ if there exist function pairs and satisfying (1.11) of Theorem 2 for and respectively ‚ then‚
When
Proof: Following the same procedure used in the proof for Theorem 3‚ one can show that
where
Therefore‚ Itô’s integration formula yields‚
hence‚
which implies that
The next theorem is the counterpart of Theorem 4‚ and its proof can be carried out by the same procedure used in Theorem 4. Theorem 8. Assume that there exists a control and a function pair satisfying (1.11) of Theorem 2. If there exists a
348 function pair 2‚ of which the associated control
MODELING UNCERTAINTY
satisfying the same condition of Theorem has been selected to satisfy
then‚
where and are the cost functions of and respectively. Note that neither Theorem 7 nor Theorem 8 is true for the stochastic system (1.1) with a cost function defined by (1.2). Stability of the infinite-time approximate control problem‚ will be treated in a manner similar to the deterministic case‚ that is‚ It suffices to show that the Performance Index (1.37)‚ of the system (1.1)‚ is bounded for all the controls generated by the Approximation Theory. Lemma 1. If Theorem 7 and/or 8 are satisfied‚ stability of the infinite horizon systems‚ driven by the subsequent controls generated by the approximation theory‚ is guaranteed if the first controller is selected to yield a bounded The proof of the above Lemma is clearly established from the statements of Theorems 7 and/or 8. In order to prove the stability of the system for the first controller‚ the following steps are appropriate for consideration: 1 System (1.1) must be Complete Controllable‚ for all admissible controls 2 Since all the states are assumed available for measurement‚ system (1.1) is obviously Completely Observable. 3 The Performance Cost (1.37) is bounded because‚
From Itô’s integration formula‚
Where
Successive Approximation of Stochastic Dynamic Systems
349
Select the first control to satisfy the previous theorems of the Approximation Theory‚ and the condition‚
Then‚ using the Mean Value Theorem‚
Then‚
The boundedness of the Performance Index of the infinite-time problem, establishes the stability of the system for all the controllers derived from the Approximation Theory.
5.
PROCEDURE FOR ITERATIVE DESIGN OF SUB-OPTIMAL CONTROLLERS
The optimal feedback control and its associated satisfying the Hamilton-Jacobi-Bellman equation‚ Eq. (1.6)‚ obviously satisfy all the theorems developed in Section 3. However‚ in most of cases of nonlinear stochastic control systems‚ the optimal solution is very difficult‚ if not impossible‚ to implement either because the solution is unavailable or because some of the states are not available for measure. In both cases‚ the theory developed in Section 3 may serve to obtain controllers which can make the system stable‚ and then be successively modified to approximate the optimal solution. Upper and lower bounds of the value function of the nonlinear stochastic system may be used to evaluate the effectiveness of the approximation.
5.1.
EXACT DESIGN PROCEDURE
This approach‚ based on the assumption that the cost function for a control can be found to satisfy Eqs. (1.9) and (1.10) of Theorem 1‚ may be implemented according to Theorems 3 and 4‚ by the following procedure: 1 Select a feedback control law 2 Find a
for system (1.1), set
to satisfy Theorem 1 for
350
MODELING UNCERTAINTY
3 Obtain 3 or 4 for
and a and
to satisfy Theorem 1‚ and Theorems is an improved controller;
4 From Theorem 6‚ find a lower bound to the optimal cost function and then use as a measure to evaluate as an approximation to the optimal control If acceptable‚ stop. 5 If the approximation is not acceptable‚ repeat Step 2 by increasing index by one and continue. The improved controller in Step 3 can also be constructed by using Theorem 5‚ if the corresponding cost function can be obtained. When a lower bound to the optimal value function is difficult to find‚ Step 4 can be omitted and then the criterion for the acceptability of the approximation has to be determined based on other considerations. Example 1. (Linear Stochastic Systems) In order to better understand the method‚ the design procedure of a sub-optimal controller will be first applied to a linear stochastic system‚ the optimal solution of which is well known. The linear stochastic system is described by the following differential equation:
The cost function of the system has the quadratic form
The infinitesimal generator of the linear stochastic process is
Assume first a linear non-optimal control‚
where to be
is a feedback matrix. The corresponding cost function is assumed
Where 1‚ i.e.‚
and
can be found by solving Eqs. (1.9) and (1.10) of Theorem
Successive Approximation of Stochastic Dynamic Systems
351
and
The feedback law is improved by using Theorem 5. From Eq. (1.24)‚
and the corresponding cost function is assumed to be
Where
As i.e.‚
and
are determined by solving the equations‚
approaches S‚ the solution of the matrix Riccati equation‚
and correspondingly‚ the control approaches to
Which is the optimal control for the linear stochastic systems with quadratic performance criterion (Wonham‚1970). This solution demonstrates the use of Theorem 5 to sequentially improve the control parameters towards the optimal values in a Linear Quadratic Gaussian system with well-known solution. Example 2. The second example illustrates the design method by the following nonlinear first-order stochastic system:
with a cost function selected to represent a minimum error‚ minimum input energy specification criterion‚ for a regular control problem
352
MODELING UNCERTAINTY
The infinitesimal generator of the stochastic process becomes
First assuming a linear control law‚
The corresponding cost function is assumed to be
Equation (1.9) of Theorem 1 yields
Which is true for
Next‚ select a higher order control law‚
Such a controller was selected to be of the same order as the partial derivative of the cost function as per Theorem 5 suggests. The corresponding cost function is assumed to be‚
In this case‚ in order to satisfy Eq. (1.9)‚ one must solve
Which is true for
To satisfy Theorem 4‚ controllers
Which yields
and
must satisfy‚
353
Successive Approximation of Stochastic Dynamic Systems
Their corresponding cost functions can be compared‚
If Theorem 5 is to be used for the above according to
And a
and
must be selected
satisfying Eq. (1.9) of Theorem 1 exists‚ if
Comparing the cost functions‚ one finds that‚
In both cases‚ since the performance of the system has been improved by replacing a linear controller by a nonlinear controller All the above controllers make the origin an equilibrium point for the system.
5.2.
APPROXIMATE DESIGN PROCEDURES FOR THE REGULATOR PROBLEM
In many cases‚ the selection of a to satisfy (1.9) and (1.10) of Theorem 1 is a very difficult task. In such a case approximate design procedures‚ which use the upper and lower bounds of the cost function obtained through Theorem 2 can be constructed. For the infinite time stochastic regulator problem‚ the following design procedure is proposed based on Theorem 7‚ 8 in Section 4: 1 Select a feedback control law
for system (1.1), set
354
MODELING UNCERTAINTY
2 For an bound.
find a
for
to satisfy Theorem 2 for a lower
3 Obtain a and for an and find a for to satisfy Theorem 2 as an upper bound. found should also satisfy conditions (1.40) or (1.41) for the improvement of performance. 4 Using a lower bound to the optimal cost function‚ which is determined according to Theorem 6‚ the approximation of the optimal control can be measured. If acceptable‚ stop. 5 If the approximation is not acceptable‚ repeat Step2 by increasing index by one and continue. Example 3. The design method is illustrated with the following nonlinear first-order stochastic regulator problem:
with a cost function
The infinitesimal generator of the stochastic process is
For a linear control law‚
The lower bound of its cost function is assumed to be
Application of Eq. (1.11) of Theorem 2 leads to
which is satisfied by For a higher order control law‚
Successive Approximation of Stochastic Dynamic Systems
355
the upper bound to its value function is assumed to be
In this case‚ application of (1.11) yields‚
which is true for
where are arbitrary. Improvement of performance satisfied‚ which leads to
occurs if Eq. (1.40) or Eq. (1.41) is
which‚ with the rest of the inequalities‚ produce acceptable values for and For example‚ one can show that
are a set of the acceptable values. The lower and upper bounds of the value functions in this case are found to be‚
356
MODELING UNCERTAINTY
Improvement of performance occurs by replacing a linear controller by a nonlinear controller of the same order as the partial derivate of Note that in this case the actual cost function of control cannot be found by simply using the method applied in Example 2. This approach has a great potential for application since one does not have to solve the partial differential equation of Theorem 1 every time.
6.
CLOSING REMARKS BY FEI-YUE WANG
In the Fall of 1987 at Rensselaer‚ I took the course of ECSE 35649: State Estimation and Stochastic Control by Professor George N. Saridis‚ then my PhD. Advisor. One day in the class‚ we had an argument regarding the Entropy reformulation of optimal control theory‚ and I went to the blackboard to show my derivation. George was upset and dismissed the class for the day. That was how this paper started. For the rest of the course‚ I concentrated on formulating and proving the theorems and developing iterative design procedures‚ and by the end of semester‚ I had the most of the results shown here. However‚ I did not find any useful results regarding the Entropy formulation of optimal stochastic control problem‚ the reason for its beginning and a favorite topic of George’s. Until the end of my Rensselaer years‚ I never found the time to complete the paper‚ partly due to the fact that my effort was focused on Intelligent Control Systems then. I come to Tucson in the middle of the hot summer in 1990. Sid was one of the first few colleagues I talked to at the SIE Department‚ and become a good friend ever since. Over years until his untimely death in 1999‚ I had received generous support and advising on many issues from Sid‚ and admired his strong passion for scholarly work‚ especially analytical research work. He often comments that a paper without an equation can hardly be a good paper‚ which kind of affected my decision to complete this paper‚ since there are certainly a lot of equations here. I would like to acknowledge that most of the results in this paper had been reported at IEEE 1992 International Conference on Decision and Control in Tucson‚ Arizona (see Wang and Saridis‚ 1992). When I completed the final version of this paper for journal submission‚ George had asked me to add the Entropy reformulation of the Generalized Hamilton-Jacobi-Bellman equation to the paper along the line of Saridis (1988) and Jaynes (1957). I agree that this is a good idea but I could not find a meaningful way to do so at that time‚ actually‚ even today. Then in 1994‚ George developed a way to reformulate the Approximation Theory of Stochastic Optimal Control‚ add to the paper as an additional section‚ and submitted to the Control – Theory and Advanced Technology in Japan (Saridis and Wang‚ 1994)‚ which had been discontinued for publication in 1994.
REFERENCES
357
I believe the theoretical procedures outlined in this paper could be an extremely useful foundation for the development of a numerical package of designing practical controllers for nonlinear stochastic systems in complex applications found in economical‚ social‚ political‚ and management fields. This will be a direction for future research.
ACKNOWLEDGMENTS I would like to dedicate this paper to the memory of Professor Sidney J. Yakowitz.
REFERENCES Al’brekht‚ E. G. (1961). On the optimal stabilization of nonlinear systems‚ J. Appl. Math. Mech. (ZPMM)‚ 25‚ 5‚ 1254-1266. Aoki‚ M. (1967). Optimization of Stochastic Systems‚ Academic Press‚ N.Y. Bellman‚ R. (1956). Dynamic Programming‚ Princeton University Press‚ Princeton‚ N.J. Doob‚ J. L. (1953). Markov Processes. Wiley‚ N.Y. Dynkin‚ E. B. (1953). Stochastic Processes. Academic Press‚ N.Y. Itô‚ K. (1951). On Stochastic Differential Equations‚ Memn Amer. Math. Soc.‚ 4. Jaynes‚ E. T. (1957). Information theory and statistical mechanics. Physical Review‚ 4‚ 106. Kushner‚ H. (1971). Introduction to Stochastic Control‚ Holt‚ Reinhardt and Winston‚ NY. Kwakernaak‚ H. and R. Sivan. (1972). Linear Optimal Control Systems‚ Wiley‚ N.Y. Leake‚ R.J. and R.-W. Liu. (1967). Construction of suboptimal control sequences‚ SIAM J. Control‚ 5‚ 1‚ 54-63. Lukes‚ D. L. (1969). Optimal regulation of nonlinear dynamical systems. SIAM J. Control‚7‚ 1‚75-100. Nishikawa‚ Y.‚ N. Sannomiya and H. Itakura. (1962). A method for suboptimal design of nonlinear feedback systems‚ Automatica‚ 7‚ 6‚703-712. Ohsumi‚ A. (1984). Stochastic control with searching a randomly moving target‚ Proc. Of American control Conference‚ San Diego‚ CA‚ 500-504. Panossian‚ H. V. (1988). Algorithms and computational techniques in stochastic optimal control‚ C.T. Lenodes (ed.)‚ Control and Dynamic Systems‚ 28‚ 1‚ 1-55. Rekasius‚ Z.V. (1964). Suboptimal design of intentionally nonlinear controllers. IEEE Trans. Automatic Control‚ AC-9‚ 4‚ 380-386. Sage‚ A. P. and C. C. White. (1977). Optimun Systems Control‚ Prentice-Hall‚ Englewood Cliffs‚ N.J.
358
MODELING UNCERTAINTY
Saridis‚ G. N. and Fei-Yue Wang. (1994). Suboptimal control of nonlinear stochastic systems‚ Control – Theory and Advanced Technology‚ Vol. 10‚ No. 4‚ Part l‚ pp.847-871. Saridis‚ G. N. (1988). Entropy formulation for optimal and adaptive control. IEEE Trans. Automatic Control‚ AC-33‚ 8‚713-721. Saridis‚ G. N. and J. Balaram. (1986). Suboptimal control for nonlinear systems. Control-Thoery and Advanced Technology (C-TAT)‚ 2‚ 3‚ 547-562. Saridis‚ G. N. and C. S. G. Lee. (1976). An approximation theory of optimal control for trainable manipulators. IEEE Trans. Systems‚ Man‚ and Cybernetics‚ SMC-9‚ 3‚ 152-159. Skorokhod‚ A. V. (1965). Studies in the Theory of Random Processes. AddisonWesley‚ Reading‚ MA. Wang‚ Fei-Yue and G. N. Saridis. (1992). Suboptimal control for nonlinear stochastic systems‚ Proceedings of 31st Conference on Decision and control‚ Tucson‚ AZ‚ Dec. Wonham‚ W. M. (1970). Random differential equations in control theory. A. T. BharuchaReid (ed.)‚ Probabilistic Methods in Applied Mathematics‚ Academic Press‚ NY.
Chapter 17 STABILITY OF RANDOM ITERATIVE MAPPINGS László Gerencsér Computer and Automation Institute Hungarian Academy of Sciences H-1111‚ Budapest Kende u 13-17 Hungary*
Abstract
1.
We consider a sequence of not necessarily i.i.d. random mappings that arise in discrete-time fixed-gain recursive estimation processes. This is related to the sequence generated by the discrete-time deterministic recursion defined in terms of the averaged field. The tracking error is majorated by an L-mixing process the moments of which can be estimated. Thus we get a discrete-time stochastic averaging principle. The paper is a simplification and extension of Gerencsér (1996).
INTRODUCTION
Most of the commonly used identification methods for linear stochastic systems can be considered as special cases of a general estimation scheme, which was proposed in Djereveckii and Fradkov (1974); Ljung (1977), and further elaborated in Djereveckii and Fradkov (1981); Ljung and Söderström (1983). This scheme can be described as follows. Let us define a parameter-dependent, stochastic process by the state-space equation:
where are matrices of appropriate dimensions that depend on a parameter and D is an open domain. Here is an exogenous noise source‚ which is assumed to be a vector-valued‚ wide-sense stationary stochastic process. The matrices are assumed to have *Partially
supported by the National Foundation of Hungary under contract No T 032932.
360
MODELING UNCERTAINTY
their eigenvalues inside the unit disc, and A(.), B(.) and C(.) are of Let be an quadratic function, defined on components of which are homogeneous quadratic functions of Let
all
It is easy to see that is well-defined. The general abstract estimation problem is then defined as follows: find the solution of the nonlinear algebraic equation using observed values of for various The solution will be denoted by It is assumed that is locally identifiable‚ i.e. is non-singular. The following stochastic gradient-type algorithm was proposed in Djereveckii and Fradkov (1974); Ljung (1977):
where defined by
is an on-line estimate of the frozen-parameter value
This algorithm is closely related to the following frozen-parameter procedure:
The above general scheme includes a number important identification procedures‚ such as the recursive maximum-likelihood method for Gaussian ARMAprocesses‚ or the maximum-likelihood method for the identification of finitedimensional‚ multivariable linear stochastic systems (cf. Caines‚ 1988;Ljung and Söderström‚ 1983). In fact‚ for multivariable linear stochastic systems the general estimation scheme is not directly applicable‚ and a simple extension is needed: take a function which is linear in Q and in and then define As an extreme case, may not depend on at all. This is the case with the celebrated LMS method in adaptive filtering. Here a particular component of a wide-sense stationary signal is approximated by the linear combination of the remaining components. Formally: let be an
361
Stability of Random Iterative Mappings
second order stationary process‚ where the minimization problem
is a real-valued process. Consider
(1.9) where
is an
valued weighting vector. This is equivalent to solving
The on-line computations of the best weights is obtained by the LMS method:
In many practical applications the system to be estimated has a slowly changing dynamics. In this case instead of accumulating past information we gradually forget them. This is accomplished by replacing the gain in (1.4) or in (1.7) by a fixed gain‚ say Thus we arrive at the procedure:
Fixed gain procedures are quite well-known in adaptive filtering. The normalized tracking error has been analyzed from a number of aspects. Weak convergence‚ a central limit theorem and an invariance principle‚ when have been derived in Kushner and Schwartz (1984); Györfi and Walk (1996); Gerencsér (1995); Joslin and Heunis (2000); Kouritzin (1994). Bounds for higher order moments for any fixed 0 have been derived in Gerencsér (1995). In this paper we consider general fixed-gain recursive estimation processes that include processes given by (1.12). In modern terminology these can be considered as iterated random maps. Thus we consider random iterative processes of the form
where
is a random variable defined over some probability space for 1 and where D is an open domain. Fixed gain estimation processes of have been analyzed in a number of interesting cases beyond LMS. Stability results for Markovian time-varying systems are given in Meyn and Caines (1987). Time-varying regression-type models have been investigated in Bittanti and Campi (1994); Campi (1994); Guo (1990). A powerful and general result in a Markovian setting is given in as Theorem 7‚ Chapter 4.4‚ Part II of Benveniste et al. (1990). The advance of Gerencsér (1996) is the extension of the so-called ODE-method (ODE for Ordinary Differential Equations) to general fixed-gain estimation schemes‚ and a more complete
362
MODELING UNCERTAINTY
characterization of the tracking error process. The focus of that paper was on processes in continuous-time. The application of the so-called ODE method for discrete-time fixed-gain recursive estimation processes requires the often painful analysis of the effect of discretization error. The main contribution of the present paper is an extension of Gerencsér (1996) to discrete-time processes and the development of a discrete-time ODE-method‚ where ODE now is for Ordinary Difference Equation. This new method is much more convenient for applications and also gives more accurate characterization of the error process. These advantages will be exploited in the forthcoming paper (Gerencsér and Vágó‚ 2001) to analyze the convergence properties of the so-called SPSA (simultaneous perturbation stochastic approximation) methods‚ developed in Spall (1992); Spall (1997); Spall (2000)‚ when applied to noise-free optimization. In the conditions below we use the definitions and notations given in the Appendix. The key technical condition that ensures a stochastic averaging effect is Condition 1.2. Condition 1.1 The random fields H and say K and let Condition 1.2 H and for of
We
are L-mixing uniformly in for and in respectively, with respect to a pair of families
Condition 1.3 The mean field EH write
The process defined by
are bounded for
is independent of
i.e. we can
will be compared with the discrete-time deterministic process
The conditions imposed on the averaged field G will are described in terms of (1.16) given below‚ this is more convenient for applications. Thus consider the continuous-time deterministic process defined by
Condition 1.4 The Junction G defined on D is continuous and bounded in together with its first and second partial derivatives‚ say
363
Stability of Random Iterative Mappings
Here
denotes the operator norm.
The condition ensures the existence and uniqueness of the solution of (1.16) in some finite or infinite time interval for any which will be denoted by It is well-known that is a continuously differentiable function of and also exists and is continuous in The time-homogenous flow associated with (1.16) is defined as the mapping Let D be a subset of D such that for we have for any For any fixed t the image of under will be denoted as i.e. The union of these sets will be denoted by i.e.
The
neighborhood of a set will be denoted by i.e. for some Finally the interior of a compact domain is denoted by int In the following condition a certain stability property of (1.16) is described: Condition 1.5 There exist compact domains that we have for some for some
such and
Condition 1.6 The autonomous ordinary differential equation (1.16) is exponentially asymptotically stable with respect to initial perturbations‚ i.e. for some and we have for all
Theorem 1.1 Assume that Conditions 1.1-1.6 are satisfied and and Then for we have and Moreover setting some fixed c we have
where any
where D‚ and
is an L-mixing process with respect to and
for all with
and we have for
and here depends only on and the domains is a system’s constant‚ i.e. it depends only on K‚ L‚ and
and
364
MODELING UNCERTAINTY
The above theorem has two new features compared to previous results by other authors: first‚ the higher order moments of the tracking error are estimated. Secondly‚ the upper bound is shown to be an L-mixing process. An interesting consequence is the following stochastic averaging result: Theorem 1.2 Let the conditions of Theorem 1.1 be satisfied. Let F(.) be a continuously differentiable function of defined in D. Then we have
with probability 1‚ where c is a deterministic constant which is independent of In the above studies the boundedness of the estimator process is ensured by strong conditions‚ among others the boundedness of the random field. It is not known if these conditions can be relaxed. A globally convergent procedure has been developed in Yakowitz (1993) for the case of Kiefer-Wolfowitz-type stochastic approximation procedures with decreasing gain‚ but it is not known if the technique developed there can be extended to general‚ fixed gain estimation processes. On the other hand‚ averaging is also important in deterministic systems (cf. Sanders and Verhulst‚ 1985). The use of deterministic techniques in a stochastic setting is not yet fully explored.
2.
PRELIMINARY RESULTS
In the first result of this section we show a simple method to couple continuoustime and discrete-time processes. Then we show that the discrete flow defined by (1.15) is exponentially stable with respect to initial perturbations for sufficiently small Lemma 2.1 Assume that Conditions 1.4‚ 1.5 and 1.6 are satisfied. Let solution of (1.16) and be the solution of (1.15). Then if then will stay in and for all Proof: Let
be the
denote the piecewise linear extension of for Taking into account Lemma 0.2 we can write as long as for all
where
denotes the integer part of But Taking into account Condition 1.6 we get
365
Stability of Random Iterative Mappings
Now hence if then will be smaller than the distance between and where denotes the complement of the set hence will stay in for all Thus the proposition of the lemma is proved. An interesting property of discrete flow defined by (1.15) is that it inherits exponential stability with respect to initial perturbations if is sufficiently small. Let denote the solution of (1.15) with initial condition Lemma 2.2 Assume that Conditions 1.4‚ 1.5 and 1.6 are satisfied. Then d > implies that will stay in moreover for any we have
whenever
is sufficiently small.
Proof: Write
where we get
and
We have
Differentiating with respect to
First we consider the local error a second-order Taylor-series expansion gives
or equivalently
From (1.16) we have
Setting
366
MODELING UNCERTAINTY
Thus Differentiating (2.5) with respect to
we get
To estimate the first term in the integral we differentiate (2.6) with respect to to obtain
where * denotes tensor-product. Note that by Condition 1.6 we have and hence
Thus we finally get
Returning to (2.4)‚ the first term under the first integral on the right hand side is estimated using Lemma 0.3. For the second term we write
Differentiating with respect to
we get
Thus the absolute value of the first integral on the right hand side of (2.4) is majorated by
For the absolute value of the second integral we get the upper bound
Altogether we get
367
Stability of Random Iterative Mappings
where
is a system-constant. Write
It follows by standard arguments that of the with
Then we get
where the latter is the solution
Writing
we have Obviously we have for any whenever is sufficiently small‚ and thus the proposition of the lemma follows.
3.
THE PROOF OF THEOREM 1.1
First it is easy to show that that for all The proof is almost identical with Let us subdivide the set of integers into intervals of we consider the solution of (1.15) denoted by Thus is defined by
Now we prove that for
whenever the proof of Lemma 2.1. length T. In the interval starting from at time
we have
where is defined in terms of the random field tories by:
along deterministic trajec-
For the proof first note that Condition 1.5 and an obvious adaptation of Lemma 2.1 ensures that for all whenever There is no loss of generality to assume that and then we have
368
MODELING UNCERTAINTY
In the first integral take supremum over and over the initial condition which enters implicitly through Then we get the random variable defined in (3.2) with Since H is Lipschitzcontinuous in with Lipschitz-constant L‚ the second term is majorized by Applying a discrete-time Bellman-Gronwall-lemma we get the claim for Now Lemma 2.2 of Gerencsér (1996)‚ which itself is proved by direct computations using Theorems of Gerencsér (1989)‚ implies that the process is L–mixing with respect to and for any and we have where depends only on and the domains Let the solution of (1.15) be denoted by Then we claim that for any and for sufficiently small
where
and is a system’s constant‚ its value being given by The proof is based on a representation analogous to (2.3). Using (3.1) and Lemma 2.2 with the role of will be taken by and thus we get the claim. It follows that we have for any
with This is obtained by applying Lemma 0.1 for the process with To complete the proof of Theorem 1.1 we combine the estimates (3.5)‚ (3.7)‚ (3.4)‚
APPENDIX The basic concepts of the theory of L-mixing processes developed in Gerencsér (1989) will be presented. Let a probability space be given, let be an open domain and let be parameter-dependent stochastic process. Alternatively, we may consider as a time-varying random field. We say that is M-bounded if for all
Here | · | denotes the Euclidean norm. We shall use the same terminology if or t degenerate into a single point. Also we shall use the following notation: if is M-bounded we write Moreover‚ if is a positive real-valued function we write if
369
Stability of Random Iterative Mappings
Let be a family of monotone increasing and be a monotone decreasing family of We assume that for all and are independent. For we set A stochastic process is L-mixing with respect to uniformly in if it is M-bounded and if we set for
where
is a positive integer then
The following lemma is a direct consequence of Lemma 2.4 of Gerencsér (1989)‚ which itself is verified by direct calculations: Lemma 0.1 Let families of
with
be a discrete time L-mixing process with respect to a pair of Define the process
Then
is L-mixing with respect to
To capture the smoothness of a stochastic process
for process
We define
A stochastic process is M-bounded, i.e. if for all
and for
with respect to
we have
we define
is M-Lipschitz-continuous in we have
if the
in an analogous way. Finally we introduce the notations
Two simple analytical lemmas follow. Due to its importance the proof of the first Lemma will be given. The proof of the second lemma is given in Gerencsér (1996): Lemma 0.2 (cf. Geman, 1979) Let be a function satisfying Conditions 1.4, 1.5 and let be solution of (1.16). Further let be a continuous, piecewise continuously differentiable curve such that Then for
370 Proof: Consider the function written as Write
MODELING UNCERTAINTY Obviously the left hand side of (0.3) can be
Taking into account the equality
which follows from the identity the lemma.
after differentiation with respect to
we get
Lemma 0.3 Under Conditions 1.4‚ 1.5 and 1.6 we have
REFERENCES Benveniste, A., M. Métivier, and P. Priouret. (1990). Adaptive algorithms and stochastic approximations. Springer-Verlag, Berlin. Bittanti, S. and M. Campi. (1994). Bounded error identification of time-varying parameters by RLS techniques. IEEE Trans. Automat. Contr., 39(5): 1106– 1110. Caines, P. E. (1988). Linear Stochastic Systems. Wiley. Campi, M. (1994). Performance of RLS identification algorithms with forgetting factor: a approach. J. of Mathematical Systems, Estimation and Control, 4:1–25. Djereveckii, D. P. and A.L. Fradkov. (1974). Application of the theory of Markov-processes to the analysis of the dynamics of adaptation algorithms. Automation and Remote Control, (2):39–48. Djereveckii, D. P. and A.L. Fradkov. (1981). Applied theory of discrete adaptive control systems. Nauka, Moscow. In Russian. Geman, S. (1979). Some averaging and stability results for random differential equations. SIAM Journal of Applied Mathematics, 36:87–105. Gerencsér, L. (1989). On a class of mixing processes. Stochastics, 26:165–191. Gerencsér, L. (1995). Rate of convergence of the LMS algorithm. Systems and Control Letters, 24:385–388. Gerencsér, L. (1996). On fixed gain recursive estimation processes. J. of Mathematical Systems, Estimation and Control, 6:355–358. Retrieval code for full electronic manuscript: 56854. Gerencsér, L. and Zs. Vágó. (2001). A general framework for noise-free SPSA. In Proceedings of the 40-th Conference on Decision and Control, CDC’01, page submitted.
REFERENCES
371
Guo‚ L. (1990). Estimating time-varying parameters by the Kalman-filter based algorithm: stability and convergence. IEEE Trans. Automat. Contr.‚ 35:141– 157. Györfi‚ L. and H. Walk. (1996). On the average stochastic approximation for linear regression. SIAM J. Control and Optimization‚ 34(1):31–61. Joslin‚ J. A. and A. J. Heunis. (2000). Law of the iterated logarithm for a constant-gain linear stochastic gradient algorithm. SIAM J. on Control and Optimization‚ 39:533–570. Kouritzin‚ M. (1994). On almost-sure bounds for the LMS algorithm. IEEE Trans. Informat. Theory‚ 40:372–383. Kushner‚ H. J. and A. Schwartz. (1984). Weak convergence and asymptotic properties of adaptive filters with constant gains. IEEE Trans. Informat. Theory‚ 30(2):177–182. Ljung‚ L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Contr.‚ 22:551–575. Ljung‚ L. and T. Söderström. (1983). Theory and practice of recursive identification. The MIT Press. Meyn‚ S. P. and P.E. Caines. (1987). A new approach to stochastic adaptive control. IEEE Trans. Automat. Contr.‚ 32:220–226. Sanders‚ J. A. and F. Verhulst. (1985). Averaging methods in nonlinear dynamical systems. Springer-Verlag‚ Berlin. Spall‚ J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Automat. Contr.‚ 37:332– 341. Spall‚ J. C. (1997). A one-measurement form of simultaneous perturbation stochastic approximation. Automatica‚ 33:109–112. Spall‚ J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Trans. Automat. Contr‚ 45:1839–1853. Yakowitz‚ S. (1993). A globally convergent stochastic approximation. SIAM J. Control and Optimization‚ 31:30–40.
Part V Chapter 18 ’UNOBSERVED’ MONTE CARLO METHODS FOR ADAPTIVE ALGORITHMS Victor Solo School of Electrical Engineering and Telecommunications University of New South Wales Sydney NSW 2052 Australia vsolo @ syscon.ee.unsw.edu.au
Keywords:
Markov Chain, Monte Carlo, Adaptive Algorithm.
Abstract
Many Signal Processing and Control problems are complicated by the presence of unobserved variables. Even in linear settings this can cause problems in constructing adaptive parameter estimators. In previous work the author investigated the possibility of developing an on-line version of so-called Markov Chain Monte Carlo methods for solving these kinds of problems. In this article we present a new and simpler approach to the same group of problems based on direct simulation of unobserved variables.
1.
EL SID
When I first started wandering away from the statistics beaten track into the physical and engineering sciences I soon encountered a small but intrepid band of ‘statistical explorers’; statisticians who had taken the pains necessary to break into other areas (where their mathematical skills did not excite the same level of awe as they might have in some of the social or biological sciences) but who were having a considerable impact, out of proportion to their numbers, elsewhere. Of course there were the better known names, the Speakes and Burtons of these strange lands: Tukey certainly was there in geophysics; and Kiefer in information theory. But there were others, less well known, the Bakers and Marc hands, perhaps more enterprising, but no less deserving of respect. Sid
374
MODELING UNCERTAINTY
was one of these, but different even then; too maverick to belong to a group of mavericks! I first encountered Sid’s work in adaptive control in the 1970’s (Yakowitz, 1969). Then a little later I was astounded to discover he’d worked on the marvellous ill-conditioned inverse problem of aquifer modelling (Neuman and Yakowitz, 1979). And yet again there was his work on Kriging (Yakowitz and Szidarovszky, 1984), providing some rare clarity on a much mis-discussed subject. He was worrying about nonparametric estimation of stochastic processes long before it became fashionable : e.g. Yakowitz (1985) and earlier references. He was then, an early player in areas which became extremely important. In his research Sid cut to the heart of the problem and asked the hard (and embarassing) questions; he obviously took application seriously but saw the utility of rigour too. At a personal level he was very humble and very encouraging to other younger researchers. Because Sid worked on so many diverse problems he did not get the kind of recognition he deserved but he certainly rates along with others who did. Given that his first work was on adaptive algorithms I am pleased to offer a contribution on this topic. This work, with a statistical bent, would appeal to Sid; although he would not be pleased by the lack of rigour!
2.
INTRODUCTION
In many problems of real time parameter estimation there is a necessity to estimate unobserved signals. These may be states or auxiliary variables measured with error. In this paper we concentrate on this latter problem of errors in variables. Thus in Computer Vision, consider the fundamental problem of estimating three-dimensional motion from a sequence of two-dimensional images using a feature based method - a highly non-linear problem (Tekalp, 1995). Many current methods ignore the presence of noise in the measurements which make the problem into an errors in variables problem. Kanatani (1996) has tackled the problem of noise but his techniques only apply in high signal to noise ratio cases and under independence assumptions. An approach that overcomes this in in Ng and Solo (2001). In multiuser communication based on spread spectrum methods, detection methods may require spreading codes which may not be known and so have to be estimated. They occur in a nonlinear fashion (Rappaport, 1996). To construct an adaptive parameter estimator for a non-linear problem there are two routes. In the model free or stochastic approximation approach (Benveniste et al., 1990; Ljung, 1983; Solo and Kong, 1995) an analytic expression is needed for the gradient of the likelihood with respect to the parameters. In the model based approach which usually leads to approximations based on the
’Unobserved’ Monte Carlo Methods for Adaptive Algorithms
375
extended Kalman filter (Solo and Kong, 1995; Ljung, 1983) this gradient is also needed. But there are many problems where it is not possible to develop analytic expressions for the likelihood much less its gradient. Errors in variables typically produce such a situation. In the statistics literature in the last decade or so Markov Chain Monte Carlo methods (MCMC)(Roberts and Rosenthal, 1998), originating in Physics (Metropolis et al., 1953) and Image Processing (Geman and Geman, 1984) have provided a powerful simulation based methodology for solving these kinds of problems. In previous work (Solo, 1999) we have pursued the possibility of using this method in an offline setting. Here based on more recent work of the author (Solo, 2000ab) we develop an on-line version of an alternative and simpler method. We use a simple binary classification problem as an exploratory platform. For the errors in variables case we consider below there seems to be no previous literature (aside from Solo (2000a)) using the model free approach. For the model based approach one ends up appending the parameters being tracked to the states and pursues a nonlinear filtering approach. There are examples of this e.g. Kitagawa (1998) but not really aimed at the kind of errors in variables problem envisaged here. The remainder of the paper is organised as follows. In section 2 we describe the problem (without measurement errors) and briefly review an adaptive estimator. In section 3 we discuss the same problem but where the auxiliary (classifying ) variables are measured with error and describe the ’Unobserved’ Monte Carlo method for parameter estimation (offline). In section 4 we develop an online version of the estimator and sketch a convergence analysis. Conclusions are offered in section 5. In the sequel we use for the density of an unobserved signal or variable; for a conditional density and for the density of an observed variable. We also use to denote a probability of a binary event, this should cause no confusion.
3.
ON-LINE BINARY CLASSIFICATION
Suppose we have iid (independent identically distributed) data where are binary random variables e.g. yes/no results in an object (e.g. a bolt on a conveyer belt) recognition task and are regressors or classifying auxiliary variables (e.g. a spatial moment estimated from an image of the object). A typical model linking the recognition variable to the classifying variable is (Ripley, 1996) the logistic regression model which specifies the conditional distribution of given as,
376
MODELING UNCERTAINTY
In our previous work (Solo, 1999) we gave, as a background, details of the simple instantaneous steepest descent algorithm, for estimating the parameter on-line, based on the one point likelihood namely
And an averaging stability analysis (Kushner, 1984; Benveniste et al., 1990; Solo and Kong, 1995; Sastry and Bodson, 1989) was sketched. This is all reasonably straightforward.
4.
BINARY CLASSIFICATION WITH NOISY MEASUREMENTS OF CLASSIFYING VARIABLES-OFFLINE
Now we complicate the problem by assuming that the auxiliary variables are measured with error. The model becomes.
The unknowns are and we denote data as Then direct maximum likelihood estimation (off-line) for is not straightforward because although the density is given and the conditional density is easily computed, there is no closed form analytic expression for the marginal density of
and hence for the likelihood.
’Unobserved’ Monte Carlo Methods for Adaptive Algorithms
377
In such a case a natural idea is to try the EM (expectation maximization) algorithm (Dempster et al., 1977) which requires the kernel (wherein subscript 0 denotes expectation with the value
But the conditional density is intractable and so the kernel cannot be constructed. The idea now is to generate an approximation to the kernel (Wei and Tanner, 1990)
where
are supposed to be iid draws from
We cannot directly generate draws from this conditional density since the normalising constant (or partition function) cannot be calculated. Note that this partiton function is precisely the density that we cannot calculate. And now MCMC comes into play because we can generate instead draws from a Markov chain which has as its limit or invariant or ’steady state’ density. There are numerous MCMC algorithms available (Roberts and Rosenthal, 1998). An alternative to EM is MC-NR (Monte Carlo Newton Raphson) (Kuk and Cheng, 1997) in which the idea is to get by simulation also. The EM framework remains useful. We find
So generate from a MCMC relating to descent (or NR)
and use the e.g. steepest
This idea was developed into an online procedure in the previous work (Solo, 1999). But here we take a different route.
378
MODELING UNCERTAINTY
In recent work (Solo, 2000ab) the author has found a simpler route to Monte Carlo based estimation by means of direct simulation of the unobserved variable. The method is dubbed ’unobserved’ Monte Carlo (UMC).1 Referring to (4.1) we see that if we draw iid samples from the density q(u) then a Monte Carlo estimate of is
Similarly the gradient is given by (with
which leads to a Monte Carlo estimator
We can get an estimate of by dividing (4.5) by (4.3). Detailed discussion of the technique in an offline setting is given in Solo (2000a), and Solo (2000b). Here we pursue an on-line version.
5.
BINARY CLASSIFICATION WITH ERRORS IN CLASSIFYING VARIABLES -ONLINE
To get an online or adaptive algorithm we follow the usual approach (Solo and Kong, 1995) and use an ’instantaneous’ gradient. So (4.5) leads us to
where is a draw based on This is much simpler than the MCMC approach developed in Solo (1999) since here the Monte Carlo draws are generated directly not as a result of an auxiliary limiting procedure. Note that this will not converge to the maximum likelihood estimator, rather it is an estimator in its own right. It is not our aim here to attempt a full blown convergence and performance analysis of this algorithm. But we can sketch a stability analysis. We now develop an averaging analysis using the kind of heuristics discussed e.g. in Solo and Kong (1995)2 . Sum over an interval of extent N to get
’Unobserved’ Monte Carlo Methods for Adaptive Algorithms
379
Now if is small, N not too large then in the sum can be approximated by its value at the beginning of the sum while can be approximated by a draw from So we get
Continuing this can be approximated by its average
Note the unusual feature that to find the averaged system
is the true density of
Now argue in reverse
It is interesting that this is the same averaged system found in the previous work (Solo, 1999) based on MCMC simulation. Continuing we find
We now note trivially that is an equilibrium point of the averaged system since We also get local exponential stability of the linearized system since
And under certain regularity conditions R is positive definite since it is the Fisher Information. Using the more formal methods in Solo and Kong (1995) we can link behaviour of the averaged system to the original algorithm through so called Hovering theorems; thus (5.1) has no equilibrium points and so under certain regularity conditions, hovers or jitters around the equilibrium points of the averaged system. This analysis will be developed in detail elsewhere. However we can sketch a global stability analysis. If we replace the discrete averaged system (5.2) by a continuous time one (this is discussed in Solo (1999))
380
MODELING UNCERTAINTY
i.e. an ODE. Then there is a natural Lyapunov function, namely
By Jensen’s inequality it has the property show identifiability of then additionally assume is continuous. Also we see that
and assuming one can if and only if We so that
Thus is uniformly bounded and converges. We assume is continuous and as For our example this holds e.g. with and This ensures then that is bounded. Since all limit points of must be stationary points of V. And local identifiability will ensure these are isolated.
6.
CONCLUSIONS
In this paper we have pursued the issue of on-line use of simulation methods for adaptive parameter estimation in nonstandard problems. Here we have concentrated on errors in variables problems. We have shown that a simpler simulation method than previously developed has great promise for these kinds of problems. Previously problems like these have been ignored, or treated approximately by Taylor series expansions. And it now appears to be possible to do much better. The careful reader can cast an eye back over our discussion to see that nothing special about the errors in variables setup is invoked. The approach developed here extends in a straight forward way to deal with partially observed dynamical systems. It should be noted however that there will remain some problems where even to draw from the density of the unobserved signal will itself require an MCMC approach. In future work we will look at implementations for more realistic problems and study stability and performance in more detail.
NOTES 1. The current and related works were completed in the second half of 1999 after a sabbatical. Recently the author became aware of work in so-called sequential Monte Carlo ? of which to some extent UMC is a special case. 2. A more formal analysis can be developed using the methods in (Solo and Kong, 1995) but will be pursued elsewhere
REFERENCES
381
REFERENCES Benveniste, A., M. Metivier, and P. Priouret. (1990). Adaptive Algorithms and stochastic approximations. Springer-Verlag, New York. Dempster, A.P.,N.M. Laird, andD.B. Rubin. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, 39:p.1-38. Geman, S. and D. Geman. (1984). Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE. Trans. Patt. Anal. Machine Intell., 6:721–741. Kanatani, K. (1996). Statistical optimization for Geometric Computation: Theory and Practice. North-Holland, Amsterdam. Kuk, A.Y.C., and Y. W. Cheng. (1997). The Monte Carlo Newton Raphson algorithm. Jl Stat Computation Simul. Kitagawa, G. (1998). Self organising state space model. Jl. Amer. Stat. Assoc, 93:1203–1215. Kushner, H.J. (1984). Approximation and weak convergence methods for random processes with application to stochastic system theory. MIT Press, Cambridge MA. Ljung, L. (1983). Theory and practice of recursive identification. MIT Press, Cambridge, Massachusetts.
Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys., 21:1087–1091. Ng, L. and V. Solo. (2001). Errors-in-variables modelling in optical flow estimation. IEEE Trans. Im.Proc., to appear. Neuman, S.P. and S. Yakowitz. (1979). A statistical approach to the inverse problem of aquifer hydrology, I. Water Resour Res, 15:845–860. Rappaport, T.S. (1996). Wireless Communication. Prentice Hall, New York. Ripley, B.D. (1996). Pattern recognition and Neural networks. Cambridge University Press, Cambridge UK. Roberts, G.O. and J.S. Rosenthal. (1998). Markov chain monte carlo: Some practical implications of theoretical results. Canadian Jl. Stat., 26:5–31. Sastry, S. and M. Bodson. (1989). Adaptive Control. Prentice Hall, New York. Solo, V. and X. Kong. (1995). Adaptive Signal Processing Algorithms. Prentice Hall, New Jersey. Solo, V. (1999). Adaptive algorithms and Markov chain Monte Carlo methods. In Proc. IEEE Conf Decision Control 1999, Phoenix, Arizona, IEEE. Solo, V. (2000a). ’Unobserved’ Monte Carlo method for system identification of partially observed nonlinear state space systems, Part I: Analog observations. In Proc JSM2001, Atlanta, Georgia, August, page to appear. Am Stat Assocn.
382
MODELING UNCERTAINTY
Solo, V. (2000b). ’Unobserved’ Monte Carlo method for system identification of partially observed nonlinear state space systems, Part II: Counting process observations. In Proc 39th IEEE CDC, Sydney Australia. IEEE. Tekalp, M. (1995). Digital Video Processing. Prentice-Hall, Englewood Cliffs, N.J. Wei, G.C.G. and M. Tanner. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Jl. Amer. Stat. Assoc, 85:699–704. Yakowitz, S. (1969). Mathematics of Adaptive Control Processes. Elsevier, New York. Yakowitz, S. (1985). Nonparametric density estimation, prediction and regression for Markov sequences. Jl. Amer. Stat. Assoc, 80:215–221. Yakowitz, S. and F. Szidarovszky. (1984). A comparison of kriging with nonparametric regression methods. Jl Mult Anal.
Chapter 19 RANDOM SEARCH UNDER ADDITIVE NOISE Luc Devroye School of Computer Science McGill University Montreal, Canada H3A 2K6
Adam Krzyzak Department of Computer Science Concordia University Montreal, Canada H3G 1M8
1.
SID’S CONTRIBUTIONS TO NOISY OPTIMIZATION
From the early days in his career, Sid Yakowitz showed interest in noisy function optimization. He realized the universality of random search as an optimization paradigm, and was particularly interested in the minimization of functions Q without making assumptions on the form of Q. Especially the noisy optimization problem appealed to him, as exact computations of Q come often at a tremendous cost, while rough or noisy evaluations are computationally cheaper. His early contributions were with Fisher (Fisher and Yakowitz, 1976; Yakowitz and Fisher, 1973). The present paper builds on these fundamental papers and provides further results along the same lines. It is also intended to situate Sid’s contributions in the growing random search literature. Always motivated by the balance between accurate estimation or optimization and efficient computations, Sid then turned to so-called bandit problems, in which noisy optimization must be performed within a given total computational effort (Yakowitz and Lowe, 1991). The computational aspects of optimization brought him closer to learning and his work there included studies of game playing strategies (Yakowitz, 1989; Yakowitz and Kollier, 1992), epidemiology (Yakowitz, 1992; Yakowitz, Hayes and Gani, 1992) and communication theory (Yakowitz and Vesterdahl, 1993).
384
MODELING UNCERTAINTY
Sid formulated machine learning invariably as a noisy optimization problem, both over finite and infinite sample spaces: Yakowitz (1992), Yakowitz and Lugosi (1990), and Yakowitz, Jayawardena and Li (1992) summarize his main views and results in this respect. Another thread he liked to follow was stochastic approximation, and in particular the Kiefer-Wolfowitz method (1952) for the local optimization in the presence of noise. In a couple of technical reports in 1989 and in his 1993 SIAM paper, Sid presented globally convergent extensions of this method by combining ideas of random search and stochastic approximation. We have learned from his insights and shared his passion for nonparametric estimation, machine learning and algorithmic statistics. Thank you, Sid.
2.
FORMULATION OF SEARCH PROBLEM
We wish to locate the global minimum of a real-valued function Q on some search domain a subset of As we pose it, this problem may have no solution. First of all, the function Q may not have a minimum on (consider and and if a minimum exists, it may not be unique (consider the real line and and if it exists and is unique, it may be nearly impossible to find it exactly, although we can hope to approximate it in some sense. But is even that possible? Take for example the function on defined by everywhere except on a finite set A on which the function takes the value – 1. Without a priori information about the location of the points of A, it is impossible to locate any point in A, and thus to find a global minimum. To get around this, we take a probabilistic view. Assume that we can probe our space with the help of a probability distribution such as the uniform density on or the standard normal distribution on If X is a random variable with probability distribution we can define the global minimum by the essential infimum: This means that and for all The value of depends heavily on It is the smallest possible value that we can hope to reach in a search process if the search is carried out at successive independent points with common probability distribution For example, if is the uniform distribution on then is the (Lebesgue) essential infimum of Q in the unit cube. To push the formalism to an extreme, we could say that a couple defines a search problem if (i) Q is a Borel measurable function on (ii) is a probability measure on the Borel sets of (iii) Formally, a search algorithm is a sequence of mappings from to where is a place at which is
Random Search Under Additive Noise
385
computed or evaluated. The objective is have tend to and if possible, to assure that the rate of this convergence is fast. In random search methods, the mapping is replaced by a distribution on that is a function of and is a random variable drawn from that distribution. The objective remains the same. Noisy optimization problems will be formally defined further on in the paper.
3.
RANDOM SEARCH: A BRIEF OVERVIEW
Random search methods are powerful optimization techniques. They include pure random search, adaptive random search, simulated annealing, genetic algorithms, neural networks, evolution strategies, nonparametric estimation methods, bandit problems, simulation optimization, clustering methods, probabilistic automata and random restart. The ability of random search methods to locate the global extremum made them indispensable tool in many areas of science and engineering. The explosive growth of random search is partially documented in books such as those by Aarts and Korst (1989), Ackley (1987), Ermoliev and Wets (1988), Goldberg (1989), Holland (1992), Pintér (1996), Schwefel (1977, 1981), Törn and Žilinskas (1989), Van Laarhoven and Aarts (1987), Wasan (1969) and Zhigljavsky (1991). Random search algorithms are usually easy and inexpensive to implement. Since they either ignore the past or use a small collection of points from iteration to iteration they are easily parallelizable. Convergence of most random search procedures is not affected by the cost function, in particular its smoothness or multimodality. In a minimax sense, random search is more powerful than deterministic search: this means it is nearly the best method in the worst possible situation (discontinuities, high dimensionality, multimodality) but possibly the worst method in the best situation (smoothness, continuity, unimodality) (Jarvis, 1975). The simplest random search method the pure random search can be used to select a starting point for more sophisticated random search techniques and also can act as a benchmark against which the performance of other search algorithms are measured. Also, random search is much less sensitive than deterministic search to function evaluations perturbed by the additive noise and that motivates the present paper. In ordinary random search, we denote by the best estimate of the (global) minimum after n iterations, and by a random probe point. In pure random search. are i.i.d. with a given fixed probability measure over the parameter space The simple ordinary random search algorithm is given below:
In local random search on a discrete space, usually is a random neighbor of where the definition of a neighborhood depends upon the application. In
386
MODELING UNCERTAINTY
local random search in a Euclidean space, one might set
where is a random perturbation usually centered at zero. The fundamental properties of pure random search (Brooks, 1958) are well documented. Let be the distribution function of Then is approximately distributed as where E is an exponential random variable. This follows from the fact that if F is nonatomic,
Note first of all the distribution-free character of this statement: its universality is both appealing and limiting. We note in passing here that many papers have been written about how one could decide to stop random search at a certain point. To focus the search somewhat, random covering methods may be considered. For example, Lipschitz functions may be dealt in the following manner (Shubert, 1972): at the trial points we know Q and can thus derive piecewise linear bounds on Q. The next trial point is given by
where C is the Lipschitz constant. This is a beautiful approach, whose implementation for large d seems very hard. For noisy problems, or when the dimension is large, a random version of this was proposed in Devroye (1978). If is compact, is taken uniformly in minus the union of the balls centered at the with radius If C is unknown, replace it in the formula for the radius by and let such that and (example: for Then almost surely. Global random search is a phrase used to denote many methods. Some of these methods proceed in a local manner, yet find a global minimum. Assume for example that we set
where are i.i.d. normal random vectors, and is a given deterministic sequence. The new probe point is not far from the old best point, as if one is trying to mimic local descent algorithms. However, over a compact set, global convergence takes place whenever This is merely due to the fact that form a cloud that becomes dense in the expanding sphere of radius Hence, we will never get stuck in a local minimum. The convergence result does not put any restrictions on Q. The
Random Search Under Additive Noise
387
above result, while theoretically pleasing, is of modest value in practice as must be adapted to the problem at hand. A key paper in this respect is by Matyas (1965), who suggests making adaptive and setting
where is a preferred direction that is made adaptive as well. A rule of thumb, that may be found in several publications (see Devroye, 1972, and more recently, Bäck, Hoffmeister and Schwefel, 1991), is that should increase after a successful step, and decrease after a failure, and that the parameters should be adjusted to keep the probability of success around 1/5. Schumer and Steiglitz (1968) and others investigate the optimality of similar strategies for local hill-climbing. Alternately, may be found by a one-dimensional search along the direction given by (Bremermann, 1968; Gaviano, 1975). In simulated annealing, one works with random probes as in random search, but instead of letting be the best of (the probe point) and (the old best point), a randomized decision is introduced, that may be reformulated as follows (after Hajek and Sasaki, 1989):
where is a positive constant depending for now on only and is an i.i.d. sequence of positive random variables. The best point thus walks around the space at random. If the temperature, is zero, we obtain ordinary random search. If is a random walk over the parameter space. If and is exponentially distributed, then we obtain the Metropolis Markov chain or the Metropolis algorithm (Metropolis et al, 1953; Kirkpatrick, Gelatt and Vecchi, 1983; Meerkov, 1972; Cerny, 1985; Hajek and Sasaki, 1989). Yet another version of simulated annealing has emerged, called the heat bath Markov chain (Geman and Hwang, 1986; Aluffi-Pentini et al, 1985), which proceeds by setting
where now are i.i.d. random variables and is the temperature parameter. If the are distributed as the extreme-value distribution (with distribution function then we obtain the original version of the heat bath Markov chain. Note that each is then distributed as log log(1/U) where U is uniform [0,1], so that computer simulation is not hampered. The two schemes are not dramatically different. The heat bath Markov chain as we presented it here has the feature that function evaluations are intentionally corrupted by noise. This clearly reduces the information content and must slow down the algorithm. Most random search algorithms take random steps but
388
MODELING UNCERTAINTY
do not add noise to measurements; in simulated annealing, one deliberately destroys valuable information. It should be possible to formulate an algorithm that does not corrupt expensive function evaluations with noise (by storing them) and outperforms the simulated annealing algorithm in some sense. One should be careful though and only compare algorithms that occupy equal amounts of storage for the program and the data. We now turn to the choice of In view of the representation given above, it is clear that is bounded from below by a constant times as is the threshold we allow in steps away from the minimum. Hence the need to make small. This need clashes with the condition of convergence (typically, must be at least for some constant ). The condition of convergence depends upon the setting (the space and the definition of given We briefly deal with the specific case of finite-domain simulated annealing below. In continuous spaces, progress has been made by Vanderbilt and Louie (1984), Dekkers and Aarts (1991), Bohachevsky, Johnson and Stein (1986), Gelfand and Mitter (1991), and Haario and Saksman (1991). Other key references on simulated annealing include Aarts and Korst (1989), Van Laarhoven and Aarts (1987), Anily and Federgruen (1987), Gidas (1985), Hajek (1988), and Johnson, Aragon, McGeoch and Schevon (1989). Further work seems required on an information-theoretic proof of the inadmissibility of simulated annealing and on a unified treatment of multistart and simulated annealing, where multistart is a random search procedure in which one starts at a randomly selected place at given times or whenever one is stuck in a local minimum. On a finite connected graph, simulated annealing proceeds by picking a trial point uniformly at random from its neighbors. Assume the graph is regular, i.e., each node has an equal number of neighbors. If we keep the temperature fixed, then there is a limiting distribution for called the Gibbs distribution or Maxwell-Boltzmann distribution: for the Metropolis algorithm, the asymptotic probability of node is proportional to Interestingly, this is independent of the structure of the graph. If we now let then with probability tending to one, belongs to the collection of local minima. With probability tending to one, belongs to the set of global minima if additionally, (for example, for will do). Here is the maximum of all depths of strictly local minima (Hajek, 1988). The only condition on the graph is that all connected components of are strongly connected for any The slow convergence of puts a severe lower bound on the convergence rate of simulated annealing. Let us consider optimization on a compact of and let Q be bounded there. If we let have a fixed density that is bounded from below by a constant times the indicator of the unit ball, then in the Metropolis algorithm converges to the global minimum in probability if yet
Random Search Under Additive Noise
389
Bohachevsky, Johnson and Stein (1986) adjust during the search to make the probability of accepting a trial point hover near a constant. Nevertheless, if is taken as above, the rate of convergence to the minimum is bounded from below by which is much slower than the polynomial rate we would have if Q were multimodal but Lipschitz. Several ideas deserve more attention as they lead to potentially efficient algorithms. These are listed here in arbitrary order. In 1975, Jarvis introduced competing searches such as competing local random searches. If N is the number of such searches, a trial (or time unit) is spent on the search with probability where is adapted as time evolves; a possible formula is to replace by where is a weight, and are constants, and is the trial point for the competing search. More energy is spent on promising searches. This idea was pushed further by several researchers in one form or another. Several groups realized that when two searches converge to the same local minimum, many function evaluations could be wasted. Hence the need for on-line clustering, the detection of points that belong somehow to the same local valley of the function. See Becker and Lago (1970), Törn (1974, 1976), de Biase and Frontini (1978), Boender et al (1982), and Rinnooy Kan and Timmer (1984, 1987). The picture is now becoming clearer—it pays to keep track of several base points, i.e., to increase the storage. In Price’s controlled random search for example (Price, 1983), one has a cloud of points of size about where is the dimension of the space. A random simplex is drawn from these points, and the worst point of this simplex is replaced by a trial point, if this trial point is better. The trial point is picked at random inside the simplex. Independently, the German school developed the Evolutionsstrategie (Rechenberg, 1973; Schwefel, 1981). Here a population of base points gives rise to a population of trial points. Of the group of trial points, we keep the best N, and repeat the process. Bilbro and Snyder (1991) propose tree annealing: all trial points are stored in tree format, with randomly picked leaves spawning two children. The leaf probabilities are determined as products of edge probabilities on the path to the root, and the tree represents the classical tree partition of the space. Their approach is at the same time computationally efficient and fast. To deal with high-dimensional spaces, the coordinate projection method of Zakharov (1969) and Hartman (1973) deserves some attention. Picture the space as being partitioned by a N × · · · × N regular grid. With each marginal interval of each coordinate we associate a weight proportional to the likelihood that the global minimum is in that interval. A cell is grabbed at random in the grid according to these (product) probabilities, and the marginal weights are
390
MODELING UNCERTAINTY
updated. While this method is not fool-proof, it attempts at least to organize global search effort in some logical way. Consider a population of points, called a generation. By selecting good points, modifying or mutating good points, and combining two or more good points, one may generate a new generation, which, hopefully, is an improvement over the parent generation. Iterating this process leads to the evolutionary search method (Bremermann, 1962, 1968; Rechenberg, 1973; Schwefel, 1977; Jarvis, 1975) and a body of methods called genetic algorithms (Holland. 1975). Mutations may be visualized as little perturbations by noise vectors in a continuous space. However, if is the space then mutations become bit flips, and combinations of points are obtained by merging bit strings in some way. The term cross-over is often used. In optimization on graphs, mutations correspond to picking a random neighbor. The selection of good points may be extinctive or preserving, elitist or non-elitist. It may be proportional or based on ranks. As well, it may be adaptive and allow for immigration (new individuals). In some cases, parents never die and live in all subsequent generations. The population size may be stable or explosive. Intricate algorithms include parameters of the algorithm itself as part of the genetic structure. Convergence is driven by mutation and can be proved under conditions not unlike those of standard random search. Evolution strategies aim to mimic true biological evolution. In this respect, the early work of Bremermann (1962) makes for fascinating reading. Ackley’s thesis (1987) provides some practical implementations. In a continuous space, the method of generations as designed by Ermakov and Zhigljavsky (1983) lets the population size change over time. To form a new generation, parents are picked with probability proportional to
and random perturbation vectors are added to each individual, where is to be specified. The latter are distributed as where the ’s are i.i.d. and is a time-dependent scale factor. This tends to maximize Q if we let tend to infinity at a certain rate. For more recent references, see Goldberg (1989), Schwefel (1995) or Banzhaf, Nordin and Keller (1998).
4.
NOISY OPTIMIZATION BY RANDOM SEARCH: A BRIEF SURVEY
Here is a rather general optimization problem: for each point we can observe a random process with almost surely, where Q is the function to be minimized. We refer to this as the noisy optimization model. For example, at we can observe independent copies of where is measurement noise satisfying and Averaging
Random Search Under Additive Noise
391
these observations naturally leads to a sequence with the given convergence property. In simulation optimization, may represent a simulation run for a system parametrized by It is necessary to take large for accuracy, but taking too large would be wasteful for optimization. Beautiful compromises are awaiting the analyst. Finally, in some cases, is known to be the expected value or an integral, as in or where A is a fixed set and T is a given random variable. In both cases, may represent a certain Monte Carlo estimate of which may be made as accurate as desired by taking large enough. By additive noise, we mean that each is corrupted by an independent realization of a random variable Z, so that we can only observe The first question to ask is whether ordinary random search is still convergent. Formally, if are independent realizations of Z, the algorithm generates trials and at observes Then is defined as the trial point among with the lowest value Assume that with probability at least is sampled according to a fixed distribution with support on Even though the decisions are arbitrary, as in simulated annealing, and even though there is no converging temperature factor, the above algorithm may be convergent in some cases, i.e., in probability. For stable noise, i.e., noise with distribution function G satisfying
such as normally distributed noise, or indeed, any noise with tails that decrease faster to zero than exponential, then we have convergence in the given sense. The reader should not confuse our notion of stability which is taken from the order statistics literature (Geffroy, 1958) with that of the stable distribution. Stable noise is interesting because an i.i.d. sequence drawn from G, satisfies in probability for some sequence See for example Rubinstein and Weissman (1979). Additional results are presented in this paper. In noisy optimization in general, it is possible to observe a sample drawn from distribution at each with possibly different for each The mean of is If there are just two and the probe points selected by us are where each of the is one of the then the purpose in bandit problems is to minimize
in some sense (by, e.g., keeping respect to the sequential choice of the
small). This minimization is with Obviously, we would like all
392
MODELING UNCERTAINTY
to be exactly at the best but that is impossible since some sampling of the non-optimal value or values is necessary. Similarly, we may sometimes wish to minimize
where is the global minimum of Q. This is relevant whenever we want to optimize a system on the fly, such as an operational control system or a gameplaying program. Strategies have been developed based upon certain parametric assumptions on the or in a purely nonparametric setting. A distinction is also made between finite horizon and infinite horizon solutions. With a finite number of bandits, if at least one is nondegenerate, then for any algorithm, we must have for some constant on some optimization problem (Robbins, 1952; Lai and Robbins, 1985). In the case of bounded noise, Yakowitz and Lowe (1991) devised a play-theleader strategy in which the trial point is the best point seen thus far (based on averages) unless for some integer ( and are fixed positive numbers), at which times is picked at random from all possible choices. This guarantees Thus, the optimum is missed at most log times out of Another useful strategy for parametric families was proposed by Lai and Robbins (1985). Here confidence intervals are constructed for all The with the smallest lower confidence interval endpoint is sampled. Exact lower bounds were derived by them for this situation. For two normal distributions with means and variances and Holland (1973) showed that Yakowitz and Lugosi (1989) illustrate how one may optimize an evaluation function on-line in the Japanese game of gomoku. Here each represents a Bernoulli distribution and is nothing but the probability of winning against a random opponent with parameters In a noisy situation when is uncountable, we may minimize Q if we are given infinite storage. More formally, let be trial points, with the only restriction being that at each with probability at least is sampled from a distribution whose support is the whole space (such as the normal density, or the uniform density on a compact). The support of a random variable X is the smallest closed set S such that We also make sure that at least observations are available for each at time If the noise is additive, we may consider the pairings for all the observations at each of and recording all values of the number of wins of over where a win occurs when for a pair of observations For each let and define as the trial point with maximal value. If and then almost surely (Devroye, 1977; Fisher and Yakowitz, 1973).
Random Search Under Additive Noise
393
Interestingly, there are no conditions whatever on the noise distribution. With averaging instead of a statistic based on ranks, a tail condition on the noise would have been necessary. Details and proofs are provided in this paper. For non-additive noise,
for all (where Y is drawn from ) suffices for example when is obtained by minimizing the at the trial points. Gurin (1966) was the first to explore the idea of averages of repeated measurements. Assume again the condition on the selection of trial points and let denote the average of observations. Then, if Gurin proceeds by setting
This is contrary to all principles of simulated annealing, as we are gingerly accepting new best points by virtue of the threshold Devroye (1976) has obtained some sufficient conditions for the strong convergence of One set includes and (a very strong condition indeed). If and for each is stochastically smaller than Z where for some then and are sufficient as well. In the latter case, the conditions insure that with probability one, we make a finite number of incorrect decisions. Other references along the same lines include Marti (1982), Pintér (1984), Karmanov (1974), Solis and Wets (1978), Koronacki (1976) and Tarasenko (1977).
5.
OPTIMIZATION AND NONPARAMETRIC ESTIMATION
To extract the maximum amount of information from past observations, we might store these observations and construct a nonparametric estimate of the regression function where Y is an observation from Assume that we have pairs where a diverging number of are drawn from a global distribution, and the are corresponding noisy observations. Estimate by which may be obtained by averaging those whose is among the nearest neighbors of It should be obvious that if almost surely, then almost surely if To this end, it suffices for example that that the noise be uniformly bounded, and that be compact. Such nonparametric estimates may also be used to identify local minima.
394
6.
MODELING UNCERTAINTY
NOISY OPTIMIZATION: FORMULATION OF THE PROBLEM
We consider a search problem on a subset B of where is the probability distribution of a generic random variable X that has support on B. Typically, is the uniform distribution on B. For every it is possible to obtain an i.i.d. sequence distributed as where is a random variable (“the noise”) with a fixed but unknown distribution. We can, if we wish, demand to see as little or as much of the sequence as we wish. With this formulation, it is still possible to define a random search procedure such that almost surely for all search problems and all distributions of Note that we do not even assume that has a mean. Throughout this paper, F is the distribution function of The purpose of this paper is to draw attention to such universally convergent random search algorithms that do not place any conditions on and F, just as Sid Yakowitz showed us in 1973 (Yakowitz and Fisher, 1973).
7.
PURE RANDOM SEARCH
In this section, we analyze the behavior of unaltered pure random search under additive noise. The probe points form an i.i.d. sequence drawn from a distribution with probability distribution At each probe point we observe where the ’s are i.i.d. random variables distributed as Then we define
This is nothing but the pure random search algorithm, employed as if we were unaware of the presence of any noise. Our study of this algorithm will reveal how noise-sensitive or robust pure random search really is. Not unexpectedly, the behavior of the algorithm depends upon the nature of the noise distribution. The noise will be called stable if for all
where G is the distribution function of and 0/0 is considered as zero. This will be called Gnedenko’s condition (see Lemma 1). A sufficient condition for stability is that G has a density and
EXAMPLES. If G does not have a left tail (i.e., for some then the noise is stable. Normal noise is also stable, but double exponential noise is not. In fact, the exponential distribution is on the
395
Random Search Under Additive Noise
borderline between stability and non-stability. Distributions with a diverging hazard rate as we travel from the origin out to are stable. Thus, stable noise distributions have small left tails. In fact, for all The reason why stable noise will turn out to be manageable, is that min is basically known to fall into an interval of arbitrary small positive length around some deterministic value with probability tending to one for some sequence It could thus happen that as yet this is not a problem. This was also observed by Rubinstein and Weissman (1979). In the next section, we obtain a necessary and sufficient condition for the weak convergence of for the pure random search algorithm. T HEOREM 1. If is stable, then in probability. Conversely, if is not stable, then does not tend to in probability for any search problem for which for all small enough, > 0. We picked the name “stable noise” because the minimum of is stable in the sense used in the literature on order statistics, that is, there exists a sequence such that in probability. We will prove the minimal properties needed further on in this section. The equivalence property A of Lemma 1 is due to Gnedenko (1943), while parts B and C are inherent in the fundamental paper of Geffroy (1958). LEMMA 2.
A. G is the distribution function of stable noise if and only if
in
probability for some sequence
B. If and then Note:
in probability for some sequence then as for all Also, if and as
for all
C. If the noise distribution is not stable, then there exist positive constants a sequence and
a subsequence for all
and an
such that
PROOF. We begin with property B. Note that by assumption, and thus Also, implies Observe that for any This shows that eventually, Thus, and Let us turn to A. We first show that B implies Gnedenko’s condition. We can assume without loss of generality that is monotone decreasing since can be replaced by in view of property B. For every we find such that Thus, and
396
MODELING UNCERTAINTY
Thus,
The case of bounded is trivial, so assume Now let (and thus ), and deduce that Since is arbitrary, we obtain Gnedenko’s condition. Next, part A follows if we can show that Gnedenko’s condition implies the existence of the sequence proving the existence of is equivalent to proving the existence of such that and for all Let us take From the definition of G, we note that for any Thus, by the Gnedenko condition, for
Similarly,
This concludes the proof of part A. For part C, we see that necessarily We define By assumption, there exists an a sequence and an such that for all Next, by definition of we note that for infinitely many indices we have These define the subsequence that we will use. Observe that for all while for all with
PROOF OF THEOREM 1. Let F be the distribution function of and let G be the distribution function of We first show that stable noise is sufficient for convergence. For brevity, we denote by Furthermore, is an arbitrary constant, and Observe that the event is implied by where is the event that for some and simultaneously, and is the event that for all we either have or The convergence follows if we can show that and as
Random Search Under Additive Noise
397
This tends to zero by property B of Lemma 1. Next,
where we used property B of Lemma 1 again. This concludes the proof of the sufficiency. The necessity is obtained as follows. Since G is not stable, we can find positive constants a sequence a subsequence and an such that and for all (Lemma 1, property C). Let be in this subsequence and let be the multinomial random vector with the number of values in and respectively. We first condition on this vector. Clearly, if for some in the second interval we have while for all in the first interval, we have then Thus, the conditional probability of this event is
To un-condition, we use the multinomial moment generating function
where and tion. This yields
are the parameters of the multinomial distribu-
provided that This can be guaranteed, since we can make smaller without compromising the validity of the lower bound. This concludes the proof of Theorem 1.
398
8.
MODELING UNCERTAINTY
STRONG CONVERGENCE AND STRONG STABILITY
There is a strong convergence analog of Theorem 1. We say that G is strongly stable noise if the minimal order statistic of is strongly stable, i.e., there exists a sequence of numbers such that almost surely as T HEOREM 2. If is strongly stable, then almost surely. PROOF. Since strong stability implies stability, we recall from Lemma 1 that we can assume without loss of generality that For let Observe that in any case as Let be arbitrary, and let be the set of indices between 1 and for which Let denote the cardinality of this set. As is binomial it is easy to see that except possibly finitely often with probability one. Define
Since almost surely, we have and almost surely. Define Consider the following inclusion of events (assuming for convenience that is integer-valued):
The event on the right hand side has zero probability. Hence so does the event on the left hand side. It is more difficult to provide characterizations of strongly stable noises, although several sufficient and a few necessary conditions are known. For an in-depth treatment, we refer to Geffroy (1958). It suffices to summarize a few key results here. The following condition due to Geffroy is sufficient:
This function comes close to being necessary. Indeed, if G is strongly stable, and is monotone in the left tail beyond some point, then Geffroy’s condition must necessarily hold. If G has a density then another sufficient condition is that
Random Search Under Additive Noise
399
It is easy to verify now that noise with density is strongly stable for and is not stable when The borderline is once again close to the double exponential distribution. To more clearly identify the threshold cases consider the noise distribution function given by where is a positive function. It can be shown that for constant the noise is stable but not strongly stable. However, if as then G is strongly stable (Geffroy, 1958).
9.
MIXED RANDOM SEARCH
Assume next that we are not using pure random search, in the hope of assuring consistency for more noise distributions, or speeding up the method of the previous section. A certain minimum amount of global search is needed in any case. So, we will consider the following prototype model of an algorithm: has distribution given by where is a sequence of probabilities, and is an arbitrary probability distribution that may depend upon the past (all with and all observations made up to time ). We call this mixed random search, since with probability the trial point is generated according to the pure random search distribution In the noiseless case, is necessary and sufficient for strong convergence of to for all search problems and all ways of choosing One is tempted to think that under stable noise and the same condition on we still have at least weak convergence. Unfortunately, this is not so. What has gone wrong is that it is possible that too few probe points have small Q-values, and that the smallest corresponding to these probe values is not small enough to “beat” the smallest among the other probe values. In the next theorem, we establish that convergence under stable noise conditions can only be guaranteed when a positive fraction of the search effort is spent on global search, e.g. when Otherwise, we can still have convergence of but we lose the guarantee, as there are several possible counterexamples. THEOREM 3. If for some then under stable noise conditions, in probability, and under strong stable noise conditions, almost surely. Conversely, if then there exists a search problem, a stable noise distribution G, and a way of choosing the sequence such that does not converge weakly to PROOF. We only prove the first part with strong stability. We mimic the proof of Theorem 2 with the following modification: let be the set of indices between 1 and for which and is generated according to the portion of the removed mixture distribution. Note that is distributed as where the are i.i.d. Bernoulli random variables with parameter Note that almost surely,
400
MODELING UNCERTAINTY
so that except possibly finitely often with probability one. Then apply the event inclusion of Theorem 2 with The weak convergence is obtained in a similar fashion from the inclusion
where we use the notation of Theorem 2. All events on the right-hand-side have probability tending to zero with
10.
STRATEGIES FOR GENERAL ADDITIVE NOISE
From the previous sections, we conclude that under some circumstances, noise can be tolerated in pure random search. However, it is very difficult to verify whether the noise at hand is indeed stable; and the rate of convergence takes a terrible beating with some stable noise distributions. There are algorithms whose convergence is guaranteed under all noise distributions, and whose rate of convergence depends mainly upon the search distribution F, and not on the noise distribution! Such niceties do not come free: a slower rate of convergence results even when the algorithm operates under no noise; and in one of the two strategies discussed further on, the storage requirements grow unbounded with time. How can we proceed? If we stick to the idea of trying to obtain improvements of by comparing observations drawn at with observations drawn at then we should be very careful not to accept as the new unless we are reasonably sure that Thus, several noise-perturbed observations are needed at each point, and some additional protection is needed in terms of thresholds that give the edge in a comparison. This is only natural, since embodies all the information gathered so far, and we should not throw it away lightly. To make the rate of convergence less dependent upon the noise distribution, we should consider comparisons between observations that are based on the relative ranks of these only. This kind of solution was first proposed in Devroye (1977). Yakowitz and Fisher (1973) proposed another strategy, in which no information is ever thrown away. We store for each all the observations ever made at it. At time draw more observations at all the previous probe points and at a new probe point and choose from the entire pool of probe points. From an information theoretic point of view, this is a clever though costly policy. The decision which probe point to take should be based upon ranks, once again. Yakowitz and Fisher (1973) employ the empirical distribution functions of the observations at each probe point. Devroye (1977) in a related approach advocates the use of the Wilcoxon-Mann-Whitney rank statistic and modifications of it. Note that the fact that no information is ever discarded makes these algorithms nonsequential in nature; this is good for parallel implementations, but notoriously bad when rates of convergence are considered.
Random Search Under Additive Noise
401
Consider first the nonsequential strategy in its crudest form: define as the “best” among the first probe points which in turn are i.i.d. random vectors with probability distribution Let be a sequence of integers to be chosen by the user. We make sure that at each time the function is sampled times at each If the previous observations are not thrown away, then this means that new observations are necessary, at and at each of The observations are stored in a giant array As an index of the “goodness” of we could consider the average
The best point is the one with the smallest average. Clearly, this strategy cannot be universal since for good performance, it is necessary that the law of large numbers applies, and thus that where is the noise random variable. However, in view of its simplicity and importance, we will return to this solution in a further subsection. If we order all the components of the vector so as to obtain then other measures of the goodness may include the medians of these components, or “quick and dirty” methods such as Gastwirth’s statistic (Gastwirth, 1966)
We might define as that point among with the smallest value of We could also introduce the notion of pairwise competitions between the for example, we say that wins against if The with the most wins is selected to be This view leads to precisely that with the smallest value of However, pairwise competitions can go further, as we now illustrate. Yakowitz and Fisher (1973) thought it useful to work with the empirical distribution functions Our approach may be considered as a tournament between in which matches are played, one per pair Let
We say that wins its match with when We define as that member of with the most wins. In case of ties, we take the member with the smallest index. We call this the tournament method. Rather than using the Kolmogorov-Smirnov based statistic suggested above, one might also
402
MODELING UNCERTAINTY
consider other rank statistics based on medians of observations or upon the generalized Wilcoxon statistic (Wilcoxon, 1945; Mann and Whitney, 1947), where only comparisons between are used. The number of function evaluations used up to iteration is In addition, in some cases, some sorting may be necessary, and in nearly all the cases, the entire array needs to be kept in storage. Also, there are matches to determine the wins, leading to a complexity, in the iteration alone of about This seems to be extremely wasteful. We discuss some time-saving modifications further on. We provide a typical result here (Theorem 4) for the tournament method. THEOREM 5. Let be chosen by the tournament method. If then almost surely as PROOF. Fix and let G be the noise distribution, and is the empirical distribution function obtained from observations taken at Note that can be considered as an estimate of In fact, by the Glivenko-Cantelli lemma, we know that almost surely. But much more is true. By an inequality due to Dvoretzky, Kiefer and Wolfowitz (1956), in a final form derived by Massart (1990), we have for all
For
we define the positive constant
and let be the event that for all Clearly, then, An important event for us is the event that all matches with have a fair outcome, that is, wins against Let us compute a bound on the probability of We observe that so that To see this, fix with Then if
but this in turn implies that
which is impossible. Thus, every such match has a fair outcome.
Random Search Under Additive Noise
403
Consider next the tournament, and partition the in four groups according to whether belongs to one of these intervals: The cardinalities of the groups are If holds, then any member of group 1 wins its match against any member of groups 3 and 4, for at least wins. Any member of group 4 can at most win against other members of group 4 or all members of group 3, for at most wins. Thus, the tournament winner must come from groups 1, 2 or 3, unless there is no in any of these groups. Thus,
We showed the following:
This is summable in for every so that we can conclude almost surely by the Borel-Cantelli lemma. It is a simple exercise to modify Theorem 4 to include mixing schemes, i.e., schemes of sampling in which has probability measure where and is an arbitrary probability measure. We note that still converges to almost surely if we merely add the standard mixing condition
Following the proof of Theorem 4, we notice indeed that for arbitrary N ,
Hence,
Here we used an argument as in the proof of Theorem 3 Theorem 4 in the section on mixed global random search. The bound tends to 0 with N, and thus almost surely if Consider the simplest sequential strategy, comparable to a new player entering at each iteration in the tournament, and playing against the best player seen thus far. Assume that the are i.i.d. with probability measure and that at iteration we obtain samples and at and respectively. For comparing performances, we use suitably modified statistics such as
404
MODELING UNCERTAINTY
where is the empirical distribution function for the sample, and is the empirical distribution function for the other sample. Define according to the rule
where is a threshold designed to give some advantage to since the information contained in is too valuable to throw away without some form of protection. We may introduce mixing as long as we can guarantee that for any Borel set A, and is the usual mixing coefficient. This allows us to combine global and local search and to concentrate global search efforts in promising regions of the search domain. THEOREM 5 (Devroye, 1978). Assume that the above sequential strategy is used with mixing, and that comparisons are based upon Assume furthermore that If then in probability as If
then PROOF.
then
Now,
with probability one as We will use fact that if
in probability, and
almost surely. This follows easily from the inequality
405
Random,Search Under Additive Noise
where we used the Dvoretzky-Kiefer-Wolfowitz inequality. Thus, the summability condition is satisfied. To obtain the weak convergence, we argue as follows. Define
and
Then
A bit of analysis shows that when or But we have already seen that
Define
and either
Then, for
Thus, when and either or we have weak convergence of to This concludes the proof of Theorem 5. Choosing and is an arbitrary process. Can we do without? For example, can we boldly choose and still have guaranteed convergence for all search problems and all noises? The answer is yes, provided that increases faster than quadratically in This result may come as a bit of a surprise, since we based the proof of Theorem 5 on the observation that the event occurs finitely often with probability one. We will now allow this event to happen infinitely often with any positive probability, but by increasing quickly enough, the total sum of the "bad" moves is finite almost surely. THEOREM 6. Assume that the sequential strategy is used with mixing, and that comparisons are based upon Assume furthermore that
406
MODELING UNCERTAINTY
and one as PROOF.
Let
Then
with probability
be arbitrary. Observe that
Thus, strong convergence follows from weak convergence if we can prove that
This follows if
and
By the Beppo-Levi theorem, the former condition is implied by
By the Beppo-Levi theorem, the latter condition is implied by
Define
We recall that where
We note that if L is the distance between the third and first quartiles of G, then This is easily seen by partitioning the two quartile interval of length L into intervals of length or less. The
Random Search Under Additive Noise
407
maximum probability content of one of these intervals is at least From the proof of Theorem 5,
This is summable in
as required. Also,
and this is summable in To establish the weak convergence of we begin with the following inclusion of events, in which is a positive integer:
where
Here we used the notation of the proof of Theorem 5. Theorem 6 By estimates obtained there, we note that
408
MODELING UNCERTAINTY
provided that is large enough. For any fixed we can choose so large that this upper bound is smaller than a given small constant. Thus, is smaller than a given small constant if we first choose large enough, and then choose appropriately. This concludes the proof of Theorem 6 when is used. Both the sequential and nonsequential strategies can be applied to the case in which we compare points on the basis of the average of observations made at This is in fact nothing more than the situation we will encounter when we wish to minimize a regression function. Indeed, taking averages would only make sense when the mean exists. Assume thus that we have the regression model where is a real-valued function, and is some random variable. For fixed we cannot observe realizations of but rather an i.i.d. sample where and the are i.i.d. In the additive noise case, we have the special form where is a zero mean random variable. We first consider the nonsequential model, in which the probe points are i.i.d. with probability measure The convergence is established in Theorem 7. Clearly, the choice of (and thus our cost) increases with the size of the tail or tails of In the best scenario, should increase faster than THEOREM 7. Let be chosen on the basis of the smallest value of Assume that is such that Then in probability as when condition A holds: for some (a sufficient condition for this is and We have strong convergence if B or C hold: (condition B) for all in an open neighborhood of the origin, and (condition C) for some and Finally, for any noise with zero moment, there exists a sequence such that almost surely as PROOF. Let be the event that for all We note that
Arguing as in the proof of Theorem 4, we have
Random Search Under Additive Noise
409
This implies weak convergence of to if From the work of Baum and Katz (1985) (see Petrov (1975, pp. 283-286)), we retain that for all if where If for some then by Theorem 28 of Petrov (1975), so condition A is satisfied. Finally, if for all in an open neighborhood of the origin, then for some constant (see e.g. Petrov (1975, pp. 54-55)). This proves the weak convergence portion of the theorem. The strong convergence under condition B follows without trouble from the above bounds and the Borel-Cantelli lemma. For condition C, we need the fact that if for every where is a constant, then we have
(see Petrov, 1975, p. 284). Now observe that under condition C,
for some constant This tends to 0 with N. The last part of the theorem follows from the weak law of large numbers. Indeed, there exists a function with as such that Thus, Clearly, this is countable in if we choose so large that Therefore, the last part of the theorem follows by the Borel-Cantelli lemma. The remarks regarding rates of convergence made following Theorem 4 Theorem 4 -> Theorem 5 apply here as well. What is new though is that we have
410
MODELING UNCERTAINTY
lost the universality, since we have to impose conditions on the noise. If we apply the algorithm with some predetermined choice of we have no guarantee whatsoever that the algorithm will be convergent. And even if we knew the noise distribution, it may not be possible to avoid divergence for any manner of choosing
11.
UNIVERSAL CONVERGENCE
In the search for universal optimization methods, we conclude with the following observation. Let Q be a function on the positive integers with finite infimum. Assume that for each there exists an infinite sequence of i.i.d. random variables called the observations. We have A search algorithm is a sequence of functions
where as a function of
describes a distribution. The sequence is called the history. For each starting with we generate a pair from the distribution given by This pair allows us to look at Thus, after iterations, we have accessed at most observations. A search algorithm needs a function of that maps to to determine which integer is taken as the best estimate of the minimum:
A search algorithm is a sequence of mappings A search algorithm is universally convergent if for all functions Q with and all distributions of in probability. We do not know if a universally convergent search algorithm exists. The difficulty of the question follows from the following observation. At time we have explored at most integers and looked at at most observations. Assume that we have observations at each of the first integers (consider this as a present of additional observations). Let us average these observations, and define as the integer with the smallest average. While at each integer, the law of large numbers holds, it is not true that the averages converge at the same rate to their means, and this procedure may actually see diverge to infinity in some probabilistic sense.
REFERENCES Aarts, E. and J. Korst. (1989). Simulated Annealing and Boltzmann Machines, John Wiley, New York.
REFERENCES
411
Ackley, D.H. (1987). A Connectionist Machine for Genetic Hillclimbing,Kluwer Academic Publishers, Boston. Aluffi-Pentini, F., V. Parisi, and F. Zirilli. (1985). “Global optimization and stochastic differential equations,” Journal of Optimization Theory and Applications, vol. 47, pp. 1–16. Anily, S. and A. Federgruen. (1987). “Simulated annealing methods with general acceptance probabilities,” Journal of Applied Probability, vol. 24, pp. 657– 667. Banzhaf, W., P. Nordin, and R. E. Keller. (1988). Genetic Programming : An Introduction : On the Automatic Evolution of Computer Programs and Its Applications, Morgan Kaufman, San Mateo, CA. Baum, L. and M. Katz. (1965). “Convergence rates in the law of large numbers,” Transactions of the American Mathematical Society, vol. 121, pp. 108–123. Becker, R.W. and G. V. Lago. (1970). “A global optimization algorithm,” in: Proceedings of the 8th Annual Allerton Conference on Circuit and System Theory, pp. 3–12. Bekey, G.A. and M. T. Ung. (1974). “A comparative evaluation of two global search algorithms,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-4, pp. 112–116. G. L. Bilbro, G.L. and W. E. Snyder. (1991). “Optimization of functions with many minima,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC21, pp. 840–849. Boender, C.G.E., A. H. G. Rinnooy Kan, L. Stougie, and G. T. Timmer. (1982). “A stochastic method for global optimization,” Mathematical Programming, vol. 22, pp. 125–140. Bohachevsky, I.O. (1986). M. E. Johnson, and M. L. Stein, “Generalized simulated annealing for function optimization,” Technometrics, vol. 28, pp. 209– 217. Bremermann, H.J. (1962). “Optimization through evolution and recombination,” in: Self-Organizing Systems, (edited by M. C. Yovits, G. T. Jacobi and G. D. Goldstein), pp. 93–106, Spartan Books, Washington, D.C. Bremermann, H.J. (1968). “Numerical optimization procedures derived from biological evolution processes,” in: Cybernetic Problems in Bionics, (edited by H. L. Oestreicher and D. R. Moore), pp. 597–616, Gordon and Breach Science Publishers, New York. Brooks, S.H. (1958). “A discussion of random methods for seeking maxima,” Operations Research, vol. 6, pp. 244–251. Brooks, S.H. (1959). “A comparison of maximum-seeking methods,” Operations Research, vol. 7, pp. 430–457. Bäck, T., F. Hoffmeister, and H.-P. Schwefel. (1991). “A survey of evolution strategies,” in: Proceedings of the Fourth International Conference on Ge-
412
MODELING UNCERTAINTY
netic Algorithms, (edited by R. K. Belew and L. B. Booker), pp. 2–9, Morgan Kaufman Publishers, San Mateo, CA. Cerny, V. (1985). “Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm,” Journal of Optimization Theory and Applications, vol. 45, pp. 41–51. Dekkers, A. and E. Aarts. (1991). “Global optimization and simulated annealing,” Mathematical Programming, vol. 50, pp. 367–393. Devroye, L. (1972). “The compound random search algorithm,” in: Proceedings of the International Symposium on Systems Engineering and Analysis, Purdue University, vol. 2, pp. 195–110. Devroye, L. (1976). “On the convergence of statistical search,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-6, pp. 46–56. Devroye, L. (1976). “On random search with a learning memory,” in: Proceedings of the IEEE Conference on Cybernetics and Society, Washington, pp. 704–711. L. Devroye, L. (1977). “An expanding automaton for use in stochastic optimization,” Journal of Cybernetics and Information Science, vol. 1, pp. 82–94. Devroye, L. (1978a). “The uniform convergence of nearest neighbor regression function estimators and their application in optimization,” IEEE Transactions on Information Theory, vol. IT-24, pp. 142–151. Devroye, L. (1978b). “Rank statistics in multimodal stochastic optimization,” Technical Report, School of Computer Science, McGill University. Devroye, L. (1978c). “Progressive global random search of continuous functions,” Mathematical Programming, vol. 15, pp. 330–342. Devroye, L. (1979). “Global random search in stochastic optimization problems,” in: Proceedings of Optimization Days 1979, Montreal. de Biase, L. and F. Frontini. (1978). “A stochastic method for global optimization: its structure and numerical performance,” in: Towards Global Optimization 2, (edited by L. C. W. Dixon and G. P. Szegö), pp. 85–102, North Holland, Amsterdam. Dvoretzky, A., J. C. Kiefer, and J. Wolfowitz. (1956). “Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator,” Annals of Mathematical Statistics, vol. 27, pp. 642–669. Ermakov, S.M. and A. A. Zhiglyavskii. (1983). “On random search for a global extremum,” Theory of Probability and its Applications, vol. 28, pp. 136–141. Ermoliev, Yu. and R. Wets. (1988). “Stochastic programming, and introduction,” in: Numerical Techniques of Stochastic Optimization, (edited by R. J.-B. Wets and Yu. M. Ermoliev), pp. 1–32, Springer-Verlag, New York. Fisher, L. and S. J. Yakowitz. (1976). “Uniform convergence of the potential function algorithm,” SIAM Journal on Control and Optimization, vol. 14, pp. 95–103.
REFERENCES
413
Gastwirth, J.L. (1966). “On robust procedures,” Journal of the American Statistical Association, vol. 61, pp. 929–948. Gaviano, M. (1975). “Some general results on the convergence of random search algorithms in minimization problems,” in: Towards Global Optimization, (edited by L. C. W. Dixon and G. P. Szegö), pp. 149–157, North Holland, New York. Geffroy, J. (1958). “Contributions à la théorie des valeurs extrêmes,” Publications de l’Institut de Statistique des Universités de Paris, vol. 7, pp. 37–185. Gelfand, S.B. and S. K. Mitter. (1991). “Weak convergence of Markov chain sampling methods and annealing algorithms to diffusions,” Journal of Optimization Theory and Applications, vol. 68, pp. 483–498. Geman, S. and C.-R. Hwang. (1986). “Diffusions for global optimization,” SIAM Journal on Control and Optimization, vol. 24, pp. 1031–1043. Gidas, B. (1985). “Global optimization via the Langevin equation,” in: Proceedings of the 24th IEEE Conference on Decision and Control, Fort Lauderdale, pp. 774–778. Gnedenko, A.B.V. (1943). Sur la distribution du terme maximum d’une série aléatoire, Annals of Mathematics, vol. 44, pp. 423–453. Goldberg, D.E. (1989). Genetic Algorithms in Search Optimization and Machine Learning, Addison-Wesley, Reading, Mass. Gurin, L.S. (1966). “Random search in the presence of noise,” Engineering Cybernetics, vol. 4, pp. 252–260. Gurin, L.S. and L. A. Rastrigin. (1965). “Convergence of the random search method in the presence of noise,” Automation and Remote Control, vol. 26, pp. 1505–1511. Haario, H. and E. Saksman. (1991). “Simulated annealing process in general state space,” Advances in Applied Probability, vol. 23, pp. 866–893. Hajek, B. (1988). “Cooling schedules for optimal annealing,” Mathematics of Operations Research, vol. 13, pp. 311–329. Hajek, B. and G. Sasaki. (1989). “Simulated annealing—to cool or not,” Systems and Control Letters, vol. 12, pp. 443–447. Holland, J.H. (1973). “Genetic algorithms and the optimal allocation of trials,” SIAM Journal on Computing, vol. 2, pp. 88–105. Holland, J.H. (1992). Adaptation in Natural and Artificial Systems : An Introductory Analysis With Applications to Biology, Control, and Artificial Intelligence, MIT Press, Cambridge, Mass. Jarvis, R.A. (1975). “Adaptive global search by the process of competitive evolution,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC5, pp. 297–311. Johnson, D.S., C. R. Aragon, L. A. McGeogh, and C. Schevon. (1989). “Optimization by simulated annealing: an experimental evaluation; part I, graph partitioning,” Operations Research, vol. 37, pp. 865–892.
414
MODELING UNCERTAINTY
Rinnooy Kan, A.H.G. and G. T. Timmer. (1984). “Stochastic methods for global optimization,” American Journal of Mathematical and Management Sciences, vol. 4, pp. 7–40. Karmanov, V.G. (1974). “Convergence estimates for iterative minimization methods,” USSR Computational Mathematics and Mathematical Physics, vol. 14(1), pp. 1–13. Kiefer, J. and J. Wolfowitz. (1952). “Stochastic estimation of the maximum of a regression function,” Annals of Mathematical Statistics, vol. 23, pp. 462–466. Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. (1983). “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680. Koronacki, J. (1976). “Convergence of random-search algorithms,” Automatic Control and Computer Sciences, vol. 10(4), pp. 39–45. Kushner, H.L. (1987). “Asymptotic global behavior for stochastic approximation via diffusion with slowly decreasing noise effects: global minimization via Monte Carlo,” SIAM Journal on Applied Mathematics, vol. 47, pp. 169– 185. Lai, T.L. and H. Robbins. (1985) “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22. Mann, H.B. and D. R. Whitney. (1947). “On a test of whether one or two random variables is stochastically larger than the other,” Annals of Mathematical Statistics, vol. 18, pp. 50–60. Marti, K. (1982). “Minimizing noisy objective functions by random search methods,” Zeitschrift für Angewandte Mathematik und Mechanik, vol. 62, pp. T377–T380. Marti, K. (1992). “Stochastic optimization in structural design,” Zeitschrift für Angewandte Mathematik und Mechanik, vol. 72, pp. T452–T464. Massart, P. (1990). “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality,” Annals of Probability, vol. 18, pp. 1269–1283. Matyas, J. (1965). “Random optimization,” Automation and Remote Control, vol. 26, pp. 244–251. Meerkov, S.M. (1972). “Deceleration in the search for the global extremum of a function,” Automation and Remote Control, vol. 33, pp. 2029–2037. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. (1953). “Equation of state calculation by fast computing machines,” Journal of Chemical Physics, vol. 21, pp. 1087–1092. Mockus, J.B. (1989). Bayesian Approach to Global Optimization, Kluwer Academic Publishers, Dordrecht, Netherlands. Männer, R. and H.-P. Schwefel. (1991). “Parallel Problem Solving from Nature,” vol. 496, Lecture Notes in Computer Science, Springer-Verlag, Berlin. Petrov, V.V. (1975). Sums of Independent Random Variables, Springer-Verlag, Berlin.
REFERENCES
415
Pinsky, M.A. (1991). Lecture Notes on Random Evolution, World Scientific Publishing Company, Singapore. Pintér, J. (1984). “Convergence properties of stochastic optimization procedures,” Mathematische Operationsforschung und Statistik, Series Optimization, vol. 15, pp. 405–427. Pintér, J. (1996). Global Optimization in Action, Kluwer Academic Publishers, Dordrecht. Price, W.L. (1983). “Global optimization by controlled random search,” Journal of Optimization Theory and Applications, vol. 40, pp. 333–348. Rechenberg, I. (1973). Evolutionsstrategie—Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog, Stuttgart. Rinnooy Kan, A.H.G. and G. T. Timmer. (1987). “Stochastic global optimization methods part II: multi level methods,” Mathematical Programming, vol. 39, pp. 57–78. Rinnooy Kan, A.H.G. and G. T. Timmer. (1987). “Stochastic global optimization methods part I: clustering methods,” Mathematical Programming, vol. 39, pp. 27–56. Robbins, H. (1952). “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, pp. 527–535. Rubinstein, R. Y. and I. Weissman. (1979). “The Monte Carlo method for global optimization,” Cahiers du Centre d’Etude de Recherche Operationelle, vol. 21, pp. 143–149. Schumer, M.A. and K. Steiglitz. (1968). “Adaptive step size random search,” IEEE Transactions on Automatic Control, vol. AC-13, pp. 270–276. Schwefel, H.-P. (1977). Modellen mittels der Evolutionsstrategie, Birkhäuser Verlag, Basel. Schwefel, H.-P. (1981). Numerical Optimization of Computer Models, John Wiley, Chichester. Schwefel, H.-P. (1995). Evolution and Optimum Seeking, Wiley, New York. Sechen, C. (1988). VLSI Placement and Global Routing using Simulated Annealing, Kluwer Academic Publishers. Shorack, G.R. and J. A. Wellner. (1986). Empirical Processes with Applications to Statistics, John Wiley, New York. Shubert, B.O. (1972). “A sequential method seeking the global maximum of a function,” SIAM Journal on Numerical Analysis, vol. 9, pp. 379–388. F. J. Solis, F.J. and R. B. Wets. (1981). “Minimization by random search techniques,” Mathematics of Operations Research, vol. 1, pp. 19–30. Tarasenko, G.S. (1977). “Convergence of adaptive algorithms of random search,” Cybernetics, vol. 13, pp. 725–728. Törn, A. (1974). Global Optimization as a Combination of Global and Local Search, Skriftserie Utgiven av Handelshogskolan vid Abo Akademi, Abo, Finland.
416
MODELING UNCERTAINTY
Törn, A. (1976). “Probabilistic global optimization, a cluster analysis approach,” in: Proceedings of the EURO II Conference, Stockholm, Sweden, pp. 521– 527, North Holland, Amsterdam. Törn, A. and A. Žilinskas. (1989). Global Optimization, Lecture Notes in Computer Science, vol. 350, Springer-Verlag, Berlin. Uosaki, K., H. Imamura, M. Tasaka, and H. Sugiyama. (1970). “A heuristic method for maxima searching in case of multimodal surfaces,” Technology Reports of Osaka University, vol. 20, pp. 337–344. Vanderbilt, D. and S. G. Louie. (1984). “A Monte Carlo simulated annealing approach to optimization over continuous variables,” Journal of Computational Physics, vol. 56, pp. 259–271. Van Laarhoven, P.J.M. and E. H. L. Aarts. (1987). Simulated Annealing: Theory and Applications, D. Reidel, Dordrecht. Wasan, M.T. (1969). Stochastic Approximation, Cambridge University Press, New York. Wilcoxon, F. (1945). “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, pp. 80–83. Yakowitz, S. (1992). “Automatic learning: theorems for concurrent simulation and optimization,” in: 1992 Winter Simulation Conference Proceedings, (edited by J. J. Swain, D. Goldsman, R. C. Crain and J. R. Wilson), pp. 487– 493, ACM, Baltimore, MD. Yakowitz, S.J. (1989). “A statistical foundation for machine learning, with application to go-moku,” Computers and Mathematics with Applications, vol. 17, pp. 1095–1102. Yakowitz, S.J. (1989). “A globally-convergent stochastic approximation,” Technical Report, Systems and Industrial Engineering Department, University of Arizona, Tucson, AZ. Yakowitz, S.J. (1989). “On stochastic approximation and its generalizations,” Technical Report, Systems and Industrial Engineering Department, University of Arizona, Tucson, AZ, 1989. Yakowitz, S.J. (1992). “A decision model and methodology for the AIDS epidemic,” Applied Mathematics and Computation, vol. 55, pp. 149–172. Yakowitz, S.J. (1993). “Global stochastic approximation,” SIAM Journal on Control and Optimization, vol. 31, pp. 30–40. Yakowitz, S.J. and L. Fisher. (1973). “On sequential search for the maximum of an unknown function,” Journal of Mathematical Analysis and Applications, vol. 41, pp. 234–259. Yakowitz, S.J., R. Hayes, and J. Gani. (1992). “Automatic learning for dynamic Markov fields, with applications to epidemiology,” Operations Research, vol. 40, pp. 867–876.
REFERENCES
417
Yakowitz, S.J., T. Jayawardena, and S. Li. (1992). “Theory for automatic learning under Markov-dependent noise, with applications,” IEEE Transactions on Automatic Control, vol. AC-37, pp. 1316–1324. Yakowitz, S.J. and M. Kollier. (1992). “Machine learning for blackjack counting strategies,” Journal of Forecasting and Statistical Planning, vol. 33, pp. 295– 309. Yakowitz, S.J. and W. Lowe. (1991). “Nonparametric bandit methods,” Annals of Operations Research, vol. 28, pp. 297–312. Yakowitz, S.J. and E. Lugosi. (1990). “Random search in the presence of noise, with application to machine learning,” SIAM Journal on Scientific and Statistical Computing, vol. 11, pp. 702–712. Yakowitz, S.J. and A. Vesterdahl. (1993). “Contribution to automatic learning with application to self-tuning communication channel,” Technical Report, Systems and Industrial Engineering Department, University of Arizona. Zhigljavsky, A.A. (1991). Theory of Global Random Search, Kluwer Academic Publishers, Hingham, MA.
Chapter 20 RECENT ADVANCES IN RANDOMIZED QUASI-MONTE CARLO METHODS Pierre L’Ecuyer Département d’Informatique et de Recherche Opérationnelle Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal, H3C 3J7, CANADA
[email protected]
Christiane Lemieux Department of Mathematics and Statistics University of Calgary, 2500 University Drive N.W., Calgary, T2N 1N4, CANADA
[email protected]
Abstract
We survey some of the recent developments on quasi-Monte Carlo (QMC) methods, which, in their basic form, are a deterministic counterpart to the Monte Carlo (MC) method. Our main focus is the applicability of these methods to practical problems that involve the estimation of a high-dimensional integral. We review several QMC constructions and different randomizations that have been proposed to provide unbiased estimators and for error estimation. Randomizing QMC methods allows us to view them as variance reduction techniques. New and old results on this topic are used to explain how these methods can improve over the MC method in practice. We also discuss how this methodology can be coupled with clever transformations of the integrand in order to reduce the variance further. Additional topics included in this survey are the description of figures of merit used to measure the quality of the constructions underlying these methods, and other related techniques for multidimensional integration.
MODELING UNCERTAINTY
420
1.
Introduction
To approximate the integral of a real-valued function the unit hypercube given by
defined over
a frequently-used approach is to choose a point set and then take the average value of over
as an approximation of Many problems can be formulated as in (20.1), e.g., when simulation is used to estimate an expectation and each simulation run requires calls to a pseudorandom number generator that outputs numbers between 0 and 1; see Section 2 for more details. If is very smooth and is small, the product of one-dimensional integration rules such as the rectangular or trapezoidal rules can be used to define (Davis and Rabinowitz 1984). When these conditions are not met, the Monte Carlo method (MC) is usually more appropriate. It amounts to choosing as a set of i.i.d. uniformly distributed vectors over With this method, is an unbiased estimator of whose error can be approximated via the central limit theorem, and whose variance is given by
assuming that (i.e., is square-integrable). This means that the error associated with the MC method has probabilistic order Quasi-Monte Carlo (QMC) methods can be seen as a deterministic counterpart to the MC method. They are based on the idea of using more regularly distributed point sets to construct the approximation (20.2) than the random point set associated with MC. The fact that QMC methods are deterministic suggests that one has to make assumptions on the integrand in order to guarantee a certain level of accuracy for In other words, the improved regularity of comes with worst-case functions for which the QMC approximation is bad. For this reason, the usual way to analyze QMC methods consists in choosing a set of functions and a definition of discrepancy to measure, in some
Recent Advances in Randomized Quasi-Monte Carlo Methods
421
way, how far from the uniform distribution on is the empirical distribution induced by Once and are determined, one can usually derive upper bounds on the deterministic error, of the following form (Niederreiter 1992b; Hickernell 1998a):
where is a measure of the variability of such that for all A well-known special case of (20.3) is the Koksma-Hlawka inequality (Hlawka 1961), in which is the set of functions having bounded variation in the sense of Hardy and Krause, and is the rectangular-star discrepancy. To compute this particular definition of one considers all rectangular boxes in aligned with the axes and with a “corner” at the origin, and then take the supremum, over all these boxes, of the absolute difference between the volume of a box and the fraction of points of that fall in it. The requirement that in this case implies in particular that has continuous partial derivatives; see Niederreiter (1992b), page 19, for a precise definition. It is clear from (20.3) that a small value of is desirable for the set This leads to the notion of low-discrepancy sequences, which are sequences of points such that if is constructed by taking the first points of then is significantly smaller than the discrepancy of a typical set of i.i.d. uniform points. The term low-discrepancy point set usually refers to a set obtained by taking the first points of a low-discrepancy sequence, although it is sometimes used in a looser way to describe a point set that has a better uniformity than an independent and uniform point set. In the case where is the rectangular-star discrepancy, it is common to say that is a low-discrepancy sequence if Following this, the usual argument supporting the superiority of QMC methods over MC is to say that if is obtained using a low-discrepancy point set, then the error bound (20.3) is in which for a fixed dimension is a better asymptotic rate than the rate associated with MC. For this reason, one expects QMC methods to approximate with a smaller error than MC if is sufficiently large. However, the dimension does not need to be very large in order to have for large values of For example, if one must have to ensure that and thus the superiority of the convergence rate of the QMC bound over MC is meaningful only for values of that are much too large for practical purposes. Nevertheless, this does not mean than QMC methods cannot improve upon MC in practice, even for problems of large dimension. Arguments
422
MODELING UNCERTAINTY
supporting this are that first, the upper bound given in (20.3) is a worstcase bound for the whole set It does not necessarily reflect the behavior of on a given function in this set. Second, it happens often in practice that even if the dimension is large, the integrand can be well approximated (in a sense to be specified in the next section) by a sum of low-dimensional functions. In that case, a good approximation for can be obtained by simply making sure that the corresponding projections of on these low-dimensional subspaces are well distributed. These observations have recently led many researchers to turn to other tools than the setup that goes with (20.3) for analyzing and improving the application of QMC methods to practical problems, where the dimension is typically large, or even infinite (i.e., there is no a priori bound on In connection with these new tools, the idea of randomizing QMC point sets has been an important contribution that has extended the practical use of these methods. The purpose of this chapter is to give a survey of these recent findings, with an emphasis on the theoretical results that appear most useful in practice. Along with explanations describing why these methods work, our goal is to provide, to the reader, tools for applying QMC methods to his own specific problems. Our choice of topics is subjective. We do not cover all the recent developments regarding QMC methods. We refer the readers to, e.g., Fox (1999), Hellekalek and Larcher (1998), Niederreiter (1992b), and Spanier and Maize (1994) for other viewpoints. Also, the fact that we chose not to talk more about inequalities like (20.3) does not mean that they are useless. In fact, the concept of discrepancy turns out to be useful for defining selection criteria on which exhaustive or random searches to find “good” sets can be based, as we will see later. Furthermore, we think it is important to be aware of the discouraging order of magnitude for required for the rate to be better than and to understand that this problem is simply a consequence of the fact that placing points uniformly in is harder and harder as increases because the space to fill becomes too large. This suggests that the success of QMC methods in practice is due to a clever choice of point sets exploiting the features of the functions that are likely to be encountered, rather than to an unexplainable way of breaking the “curse of dimensionality”. Highly-uniform point sets can also be used for estimating the minimum of a function instead of its integral, sometimes in a context where function evaluations are noisy. This is discussed in Chapter 6 of Niederreiter (1992b), Chapter 13 of Fox (1999), and was also the subject of collaborative work between Sid Yakowitz and the first author (Yakowitz, L’Ecuyer, and Vázquez-Abad 2000).
Recent Advances in Randomized Quasi-Monte Carlo Methods
423
This chapter is organized as follows. In Section 2, we give some insight on how point sets should be constructed by using an ANOVA decomposition of the integrand over low-dimensional subspaces. Section 3 recalls the definition of different families of low-discrepancy point sets. In Section 4, we present measures of quality (or selection criteria) for low-discrepancy point sets that take into account the properties of the decomposition discussed in Section 2. Various randomizations that have been proposed for QMC methods are described in Section 5. Results on the error and variance of approximations based on (randomized) QMC methods are presented in Section 6. The purpose of Section 7 is to briefly review different classes of transformations that can be applied to the integrand for reducing the variance further by exploiting, or not, the structure of the point set Integration methods that are somewhere between MC and QMC but that exploit specific properties of the integrand more directly are discussed in Section 8. Conclusions and ideas for further research are given in Section 9.
2.
A Closer Look at Low-Dimensional Projections
We mentioned earlier that as the dimension increases, it becomes difficult to cover the unit hypercube very well with a fixed number of points. However, if instead our goal is to make sure that for some chosen subsets the projections over the subspace of indexed by the coordinates in I are evenly distributed, the task is easier. By doing this, one can get a small integration error if the chosen subsets I match the most important terms in the functional ANOVA decomposition of which we now explain. The functional ANOVA decomposition (Efron and Stein 1981; Hoeffding 1948; Owen 1998a) writes a square-integrable function as a sum of orthogonal functions as follows:
where only on the variables in such that
if I is non-empty, and
corresponds to the part of that depends Moreover, this decomposition is
424 if
MODELING UNCERTAINTY
Defining
we then have
The best mean-square approximation of by a sum of (or less) functions is Also, the relative importance of each component indicates which variables or which subsets of variables are the most important (Hickernell 1998b). A function has effective dimension (in the superposition sense) if (Caflisch, Morokoff, and Owen 1997). Functions defined over many variables but having a low effective dimension often arise in practical applications (Caflisch, Morokoff, and Owen 1997; Lemieux and Owen 2001). The concept of effective dimension has actually been introduced (in a different form than above) by Paskov and Traub (1995) in the context of financial pricing to explain the much smaller error obtained with QMC methods compared with MC, for a problem defined over a 360-dimensional space. A broad class of problems that are likely to have low effective dimension (relative to are those arising from simulation applications. To see this, note that simulation is typically used to estimate the expectation of some measure of performance defined over a stochastic system, and proceeds by transforming in a more or less complicated way a sequence of numbers between 0 and 1 produced by a pseudorandom generator into an observation of the measure of interest. Hence it fits the framework of equation (20.1), with equal to the number of uniforms required for each simulation, and taken as the mapping that transforms a point in into an observation of the measure of performance. In that context, it is frequent that the uniform numbers that are generated close to each other in the simulation (i.e., corresponding to dimensions that are close together) are associated to random variables that interact more together. In other words, for these applications it is often the subsets I containing nearby indices and not too many of them that are the most important in the ANOVA decomposition. This suggests that to design point sets that will work well for this type of problems, one should consider the quality of the projections corresponding to these “important” subsets. We present, in Section 4, measures of quality defined on this basis. We conclude this section by recalling two important properties related to the projections of a point set in Firstly, we say that is fully projection-regular (Sloan and Joe 1994; L’Ecuyer and Lemieux 2000) if each of its projections over a non-empty subset of dimensions contains distinct points. Such a prop-
Recent Advances in Randomized Quasi-Monte Carlo Methods
425
erty is certainly desirable, for the lack of it means that some of the ANOVA components of are integrated by less than points even if evaluations of have been done. Secondly, we say that is dimensionstationary (L’Ecuyer and Lemieux 2000) if for any such that that is, only the spacings between the indices in I are relevant in the definition of the projections of a dimension-stationary point set, and not their individual values. Hence not all non-empty projections of need to be considered when measuring the quality of since many are the same. Another advantage of dimension-stationary point sets is that because the quality of their projections does not deteriorate as the first index increases, they can be used to integrate functions that have important ANOVA components associated with subsets I having a large value of Therefore, when working with those point sets it is not necessary to try rewriting so that the important ANOVA components are associated with subsets I having a small first index (as is often done; see, e.g., Fox 1999). We underline that not all types of QMC point sets have these properties.
3.
Main Constructions
In this section, we present constructions for low-discrepancy point sets that are often used in practice. We first introduce lattice rules (Sloan and Joe 1994), and a special case of this construction called Korobov rules (Korobov 1959), which turn out to fit in another type of construction based on successive overlapping produced by a recurrence defined over a finite ring. This type of construction is also used to define pseudorandom number generators (PRNGs) with huge period length when the underlying ring has a very large cardinality (e.g., lowdiscrepancy point sets are, in contrast, obtained by using a ring with a small cardinality (e.g., between and For this reason, we refer to this type of construction as small PRNGs, and discuss it after having introduced digital nets (Niederreiter 1992b), which form an important family of low-discrepancy point sets that also provides examples of small PRNGs. Various digital net constructions are described. We also recall the Halton sequence (Halton 1960), and discuss a method by which the number of points in a Korobov rule can be increased sequentially, thus offering an alternative to digital sequences. Additional references regarding the implementation of QMC methods are provided at the end of the section.
426
3.1.
MODELING UNCERTAINTY
Lattice Rules
The general construction for lattice rules has been introduced by Sloan and his collaborators (see Sloan and Joe 1994 and the references therein) by building upon ideas developed by Korobov (1959, 1960), Bakhvalov (1959), and Hlawka (1962). The following definition is taken from the expository book of Sloan and Joe (1994). Definition: Let be a set of vectors linearly independent over and with coordinates in [0,1). Define
and assume that
The approximation based on the set is a lattice rule. The number of points in the rule is equal to the inverse of the absolute value of the determinant of the matrix V whose rows are the vectors This number is called the order of the rule. Note that the basis for is not unique, but the determinant of the matrix V remains constant for all choices of basis. Figure 20.1 gives an example of a point set that corresponds to a two-dimensional lattice rule, with Here, the two vectors shown in the figure, and form a basis for the lattice Another basis for the same lattice is formed by and These 101 points cover the unit square quite uniformly. They are also placed very regularly on equidistant parallel lines, for several families of lines. For example, any of the vectors or given above determines one family of lines that are parallel to this vector. This regularity property stems from the lattice structure and it holds for any lattice rule—in more than two dimensions, the lines are simply replaced by equidistant parallel hyperplanes (Knuth 1998; Conway and Sloane 1999). For to cover the unit hypercube quite well, the successive hyperplanes should never be too far apart (to avoid wide uncovered gaps), for any choice of family of parallel hyperplanes. Selection criteria for lattice rules, based on the idea of minimizing the distance between successive hyperplanes for the “worst-case” family, are discussed in Section 4.1. From now on, we refer to point sets giving rise to lattice rules as lattice point sets. Each lattice point set has a rank associated with it, which can be defined as the smallest integer such that can be obtained by taking all integer combinations, modulo 1, of vectors independent over Alternatively, the rank can be defined as the smallest number of cyclic groups whose direct sum yields (Sloan
Recent Advances in Randomized Quasi-Monte Carlo Methods
427
and Joe 1994). For example, if are positive integers such that for at least one then the lattice point set
has rank 1 and contains distinct points. It can also be obtained by taking with and for in (20.4), where is a vector of zeros with a one in the position. The condition for all is necessary and sufficient for a rank-1 lattice point set to be fully projection-regular. A Korobov rule is obtained by choosing an integer and taking in (20.5), for all In this case, having and relatively prime is a necessary and sufficient condition for to be both fully projection-regular and dimension-stationary (L’Ecuyer and Lemieux 2000). The integer is usually referred to as the generator of the lattice point set For instance, the point set given on Figure 20.1 has a generator equal to 12. In Section 3.3, we describe an efficient way of constructing for Korobov rules when is prime and is a primitive element modulo In the definition of a lattice rule, we assumed This implies that has a period of 1 in each dimension. A lattice with this property is called an integration lattice (Sloan and Joe 1994). A necessary and sufficient condition for to be an integration lattice is that the inverse of the matrix V has only integer entries. In this case, it can
428
MODELING UNCERTAINTY
be shown that if the determinant of V is then there exists a basis for with coordinates of the form where We assume from now on that all lattices considered are integration lattices. The columns of the inverse matrix form a basis of the dual lattice of defined as
If the determinant of is then contains times less points per unit of volume than Also, is periodic with a period of in each dimension. As we will see in Sections 4.1 and 6.1, this dual lattice plays an important role in the error and variance analysis for lattice rules, and in the definition of selection criteria.
3.2.
Digital Nets
We first recall the general definition of a digital net in base a concept that was first introduced by Sobol’ (1967) in base 2, and subsequently generalized by Faure (1982), Niederreiter (1987), and Tezuka (1995). The following definition is from Niederreiter (1992b), with the convention, as in Tezuka (1995), that the generating matrices contain an infinite number of rows (although often, only a finite number of these rows are nonzero). Definition 1 Let
and
be integers. Choose
1. a commutative ring R with identity and with cardinality 2. bijections
for
3. bijections
for
4. Generating matrices
For
with
and compute
(usually
and of dimension
over R.
let
be the digit expansion of in base
Consider the vector
Recent Advances in Randomized Quasi-Monte Carlo Methods
where each element
is in R. For
Then digital net over R in base
429
let
with
is
This scheme has been used to construct point sets having a lowdiscrepancy property that can be described by introducing the notion of (Lemieux and L’Ecuyer 2001). Let where the are non-negative integers, and consider the boxes obtained by partitioning into equal intervals along the axis. If each of these boxes contains exactly points from a point set where then is said to be If a digital net is whenever for some integer it is called a (Niederreiter 1992b). The smallest integer having this property is a widely-used measure of uniformity for digital nets and we call it the of Note that the is meaningless if and that the smaller is, the better is the quality of Criteria for measuring the equidistribution of digital nets are discussed in more details in Section 4.2. Figure 20.2 shows an example of a two-dimensional point set with points in base 3, having the best possible equidistribution; that is, its is 0 and thus, any partition of the unit square into ternary boxes of size is such that exactly one point is in each box. The figure shows two examples of such a partition, into rectangles of sizes and respectively. The other partitions (not shown here) are into rectangles of sizes and For all of these partitions, each rectangle contains one point of This point set contains the first 81 points of the two-dimensional Faure sequence in base 3. In this case, in the definition above, all bijections are the identity function over the generating matrix for the first dimension is the identity matrix, and is given by
The general definition of this type of construction is given below. From now on, is assumed to be prime power, and in all the constructions described below, R is taken equal to Also, we assume that a bijection from to has been chosen to identify the elements
430
MODELING UNCERTAINTY
of with the “digits” and all bijections and are defined according to this bijection. In particular, when is prime, these bijections are equal to the identity function over and all operations are performed modulo The base and the generating matrices therefore completely describe these constructions, because the expansion of a given coordinate of is obtained by simply multiplying with the digit expansion of in base The goal is then to choose the generating matrices so that the equidistribution property mentioned above holds for boxes that are as small as possible. In terms of these matrices, this roughly means that we want them to have a large number of rows that are linearly independent. For example, if for each the first rows of the matrix are linearly independent, then each box obtained by partitioning the axis into equal intervals of length has one point from In particular, this implies that is fully projection-regular. Digital sequences in base (see, e.g., Larcher 1998 and Niederreiter 1992b) are infinite sequences obtained in the same way as digital nets except that the generating matrices have an infinite number of columns; the first points of the sequence thus form a digital net for each For example, Sobol’, Generalized Faure, and Niederreiter sequences (to be discussed below) are all defined as digital sequences since the “recipe” to add extra columns in the generating matrices is given by the method. We now describe specific well-known constructions of digital nets with a good equidistribution. We do not discuss recent constructions pro-
Recent Advances in Randomized Quasi-Monte Carlo Methods
431
posed by Niederreiter and Xing (1997, 1998), as they require the introduction of many concepts from algebraic function fields that go well beyond the scope of this chapter. These sequences are built so as to optimize the asymptotic behavior of their as a function of the dimension, for a fixed base See Pirsic (2001) for a definition of these sequences and a description of a software implementation. 3.2.1 Sobol’ Sequences. Here the base is and the specification of each generating matrix requires a primitive polynomial over and integers for to initialize a recurrence based on that generates the direction numbers defining The method specifies that the polynomial should be the one in the list of primitive polynomials over sorted by increasing degree (within each degree, Sobol’ specifies a certain order which is given in the code of Bratley and Fox (1988) for There remains the parameters to control the quality of the point set. Assume where for each The direction numbers are rationals of the form
where is an odd integer smaller than for The first values must be (carefully) chosen, and the following ones are obtained through the recurrence
where denotes a bit-by-bit exclusive-or operation, and means that the binary expansion of is shifted by positions to the right (i.e., is divided by These direction numbers are then used to define whose entry in the row and column is given by A good choice of the initial values (or in each dimension is important for the success of this method, especially when is small. The implementation of Bratley and Fox (1988) uses the initial values given by Sobol’ and Levitan (1976) for with and More details on how to choose these initial values to optimize certain quality criteria are given in Section 4.2. 3.2.2 Generalized Faure Sequences. These sequences were introduced by Faure (1982) and generalized by Tezuka (1995). For this type of sequence, the base is the smallest prime power larger or equal
432
MODELING UNCERTAINTY
to the dimension (which means that these sequences are practical only for small values of An important feature of this construction is that their has the best possible value provided is of C the form for some integers Assuming the matrices take the form: where P is the Pascal’s matrix (i.e., with entries and is an arbitrary non-singular lower-triangular matrix. The original Faure sequence is obtained by taking prime and as the identity matrix for all By allowing the matrices to be different from the identity matrix, point sets that avoid some of the defects observed on the original Faure sequence can be built (Morokoff and Caflisch 1994; Boyle, Broadie, and Glasserman 1997; Tan and Boyle 2000). For instance, the Generalized Faure sequence of Tezuka and Tokuyama (1994) amounts to take Recently, Faure suggested another form of Generalized Faure sequence that consists in taking equal to the lower-triangular matrix with all nonzero entries equal to 1, for all (Faure 2001). 3.2.3 Niederreiter Sequences. This construction has been proposed by Niederreiter (1988). We first describe a special case (Niederreiter 1992b, Section 4.5), and then briefly explain how it can be generalized. The base is assumed to be a prime power, and for this special case the generating matrices are defined by distinct monic irreducible polynomials in Let for In what follows, represents the field of formal Laurent series over that is,
To find the element on the row and consider the expansion
where and
column of
for
is the unique integer satisfying and and is also uniquely determined. The element on the row column of is then given by For these sequences, the is given by which suggests that to minimize the
Recent Advances in Randomized Quasi-Monte Carlo Methods
433
should be taken equal to the monic irreducible polynomials of smallest degree over In the general definition of Niederreiter (1988), the numerator in (20.7) can be multiplied by a different polynomial for each pair of dimension and row index, and the just need to be pairwise relatively prime polynomials. Tezuka (1995), Section 6.1.2, generalizes this concept a step further by removing more restrictions on this numerator. 3.2.4 Polynomial Lattice Rules. This construction can be seen as a lattice rule in which the ring of integers is replaced by a ring of polynomials over a finite field. As we explain below, it generalizes the construction discussed by Niederreiter (1992a), Niederreiter (1992b), Section 4.4, and Larcher, Lauss, Niederreiter, and Schmid (1996), and is studied in more details by Lemieux and L’Ecuyer (2001). Definition: Let be a set of vectors of formal Laurent series over linearly independent over where is a prime power. Define
and assume that Let be the mapping that evaluates each component of a vector in at The approximation based on the set is a polynomial lattice rule. The number of points in the rule is called the order of the rule and is equal to where is the degree of the inverse of the determinant of the matrix V whose rows are the vectors Most definitions and results that we mentioned for lattice rules, which from now on are referred to as standard lattice rules, have their counterpart for polynomial lattice rules, as we now explain. First, we refer to point sets that define polynomial lattice rules as polynomial lattice point sets, whose rank is equal to the smallest number of basis vectors required to write as
In the expression above, represents the operation by which all non-negative powers in are dropped. The construction introduced by Niederreiter (1992a), also discussed in Niederreiter (1992b), Section 4.4, and Larcher, Lauss, Niederreiter, and Schmid (1996), and
434
MODELING UNCERTAINTY
sometimes referred to as the “method of optimal polynomials”‚ is a polynomial lattice point set of rank 1‚ since it can be obtained as
where is a polynomial of degree over and the are polynomials in the ring of polynomials over modulo A Korobov polynomial lattice point set is obtained by taking = mod in (20.9) for some As in the standard case‚ the condition is necessary and sufficient to guarantee that a Korobov polynomial lattice point set is dimension-stationary and fully projectionregular (Lemieux and L’Ecuyer 2001). An efficient way of generating this type of point set when is a primitive polynomial is described in the next subsection. The condition that implies that has a period of 1 in each dimension‚ and we call a polynomial integration lattice in this case. If V represents a matrix whose rows are formed by basis vectors of then is a polynomial integration lattice if and only if all entries in are in In that case‚ a basis for with vectors having coordinates of the form can be found‚ where is the determinant of and is in or equal to Finally‚ the columns of form a basis of the dual lattice of which is defined by
where the scalar product is defined as This construction is a special case of digital nets in which the generating matrices are defined by
where denotes the coefficient of in the formal series the coordinate of and the index is determined by and the structure of as follows: has the property that for some nonzero polynomials coming from a triangular basis of and such that (see Lemieux and L’Ecuyer 2001‚ Lemma A.2‚ for more details)‚ we have
435
Recent Advances in Randomized Quasi-Monte Carlo Methods
In other words‚ the first columns of each matrix contain the coefficients associated with the first basis vector the next columns are associated with the second basis vector and so on. For polynomial lattice point sets of rank 1‚ the corresponding generating matrices can be described a bit more easily (Niederreiter 1992b‚ Section 4.4)‚ as we now explain. For each dimension consider the formal Laurent series expansion
The first
coefficients
satisfy
where A is defined as
(see‚ e.g.‚ Lemieux and L’Ecuyer 2001)‚ and the coefficients are such that = The coefficients for satisfy the recurrence determined by i.e.‚
where the minus sign represents the subtraction in the generating matrices are defined by
The entries of
Note that in the definition of Niederreiter (1992b) and Pirsic and Schmid (2001)‚ the matrices are restricted to rows.
3.3.
Constructions Based on Small PRNGs
Consider a PRNG based on a recurrence over a finite ring R and a specific output function from R to [0, 1). The idea proposed here is to define as the set of all overlapping in than can be
436
MODELING UNCERTAINTY
obtained by running the PRNG from all possible initial states in R. More precisely‚ let be a bijection over R called the transition function of the PRNG‚ and define
for The sequence period of at most |R|. Now let define for
obtained from any seed has a be an output function‚ and
We call the point set
where a recurrence-based point set. The requirement that be a bijection guarantees that is dimension-stationary‚ and therefore fully-projection regular (L’Ecuyer and Lemieux 2000). In concrete realizations‚ it is often the case that the recurrence has two main cycles‚ one of period length |R| – 1‚ and the other of period length 1 and which contains 0‚ the zero of R. In this case‚ can be generated very efficiently (provided the function is easy to compute) by running the PRNG for steps to obtain the point set
and adding the vector to this set. An important advantage of this construction is that it can be used easily on problems for which depends on a random an unbounded number of variables. For this type of function‚ one needs a point set whose defining parameters are independent of and for which points of arbitrary size can be generated. Recurrence-based point sets satisfy these requirements since they are determined by the functions and the set R‚ and those are independent of the dimension. Also‚ points of any size can be obtained by running the PRNG as long as required. Note that when the coordinates of each point in have a periodic structure but by randomizing as in Section 5‚ this periodicity disappears and the coordinates of each point become mutually independent‚ so one can have without any problem. The link between PRNG constructions and QMC point sets was pointed out by Niederreiter (1986). Not only similar constructions have been proposed in both fields but in addition‚ some of the selection criteria used in the two settings have interesting connections. Criteria measuring the uniformity of the point set given in (20.10) are also
Recent Advances in Randomized Quasi-Monte Carlo Methods
437
required in the PRNG context because in this case, this point set can be seen as the sampling space for the PRNG when it is used on a problem requiring uniform numbers per run. See L’Ecuyer (1994) for details on this aspect of PRNGs and more. We now describe two particular cases of this type of construction that provide an alternative way of generating Korobov-type lattice point sets (either standard or polynomial). Example 1 Let
for some prime
define
for some nonzero and let This type of PRNG is a linear congruential generator (LCG) (Knuth 1998; L’Ecuyer 1994)‚ and the recurrence-based point set is a Korobov lattice point set. When is a primitive element modulo (see Lidl and Niederreiter 1994 for the definition)‚ the recurrence (20.12) has the maximal period of for any nonzero seed‚ and can thus be generated efficiently using (20.12) and (20.11). Example 2 Let be a primitive polynomial over (see Lidl and Niederreiter 1994 for the definition)‚ be a positive integer‚ and
Let
where Laurent series over
be given by the composition
at
is defined as the evaluation of a formal that is‚
This type of PRNG is called a polynomial LCG (L’Ecuyer 1994; Tezuka 1995)‚ and the recurrence-based point set is a Korobov polynomial lattice point set. When and are relatively prime‚ the recurrence (20.13) has maximal period length Polynomial LCGs can also be obtained via Tausworthe-type linear feedback shift-register sequences (Tausworthe 1965): the idea is to use the recurrence
438
over
MODELING UNCERTAINTY
and to define the output at step
as
which in practice is typically truncated to the word-length of the computer. Tezuka and L’Ecuyer give an efficient algorithm to generate the output under some specific conditions (Tezuka and L’Ecuyer 1991; L’Ecuyer 1996). Combining a few Tausworthe generators to define the output can greatly help improving the quality of the associated set as explained by L’Ecuyer (1996)‚ Tezuka and L’Ecuyer (1991) and Wang and Compagner (1993). Another way of enhancing the quality of point sets based on linear recurrences modulo 2 is to use tempering transformations (Matsumoto and Kurita 1994; Matsumoto and Nishimura 1998; L’Ecuyer and Panneton 2000). Note that these transformations generally destroy the lattice structure of (Lemieux and L’Ecuyer 2001). However the point set obtained is still a digital net and therefore‚ it can be studied under this more general setting. Conditions that preserve the dimensionstationarity of under these transformations are given by Lemieux and L’Ecuyer (2001). The idea of combining different constructions to build sets with better equidistribution properties is also discussed in Niederreiter and Pirsic (2001) in the more general setting of digital nets.
3.4.
Halton sequence
This sequence was introduced by Halton (1960) for constructing point sets of arbitrary length‚ and is a generalization of the one-dimensional van der Corput sequence (Niederreiter 1992b). Although it is not a digital sequence‚ it uses similar ideas and can thus be thought of as an ancestor of those sequences. For the point in the sequence is given by where the integers are typically chosen as the first prime numbers sorted in increasing order‚ and is the radical-inverse function in base defined by
where the integers i.e.‚
are the coefficients in the as in the digital net definition.
expansion of
Recent Advances in Randomized Quasi-Monte Carlo Methods
3.5.
439
Sequences of Korobov rules
With (infinite) digital sequences‚ one can always add new points to until the estimation error is deemed small enough. The lattice point sets that we have discussed so far‚ on the other hand‚ contain only a fixed number of points. We now consider a method discussed by (Sobol’ 1967; Maize 1981; Hickernell and Hong 1997; Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux 2001) for generating an infinite sequence of point sets in such that for is a Korobov (polynomial) lattice point set and Sequences based on nested Korobov lattice point sets can be constructed by choosing a nonzero odd integer and defining
where Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux (2001) give tables of generators to build these sequences‚ that were chosen with respect to different selection criteria to be explained in Section 4.1. One way of constructing a sequence based on Korobov polynomial lattice point sets is to choose a polynomial of degree 1 in (i.e.‚ and a generating polynomial such that Then define
where
is defined as on page 433‚ and mod In this case‚ the sequence turns out to be a special case of a digital sequence; see also Larcher (1998)‚ page 187‚ for the particular case where and for a more general setting based on irrational formal Laurent series.
3.6.
Implementations
The points of a digital net in base can be generated efficiently using a Gray code. This idea was first suggested by Antonov and Saleev (1979) for the Sobol’ sequence‚ and for other constructions by‚ e.g.‚ Hong and Hickernell (2001)‚ Pirsic and Schmid (2001)‚ Tezuka (1995)‚ and Bratley‚ Fox‚ and Niederreiter (1992). Assuming the idea is to modify the order in which the points are generated by replacing the digits from the expansion of by the Gray code
440
MODELING UNCERTAINTY
for which satisfies the following property: the Gray code for and only differ in one position; if is the smallest index such that then is the digit whose value changes‚ and it becomes in the Gray code for (Tezuka 1995‚ Theorem 6.6). This reduces the generation time because only a dot product of two vectors has to be performed in each dimension to generate a point. It is sometimes suggested in the literature (Bratley‚ Fox‚ and Niederreiter 1992; 1998; Acworth‚ Broadie‚ and Glasserman 1997; Fox 1986; Morokoff and Caflisch 1994) that for most low-discrepancy sequences‚ an initial portion of the sequence should be discarded because of the so-called “leading-zeros phenomenon”. For sequences such as Sobol’s that require initial parameters‚ this problem can be avoided (at least partially) by choosing these parameters carefully. Using a sufficiently large number of points and randomizing the point set can also help alleviate this problem. We refer the reader to the papers mentioned above for more details. The FORTRAN code of Bratley and Fox (1988) can be used to generate Sobol’ sequence for and is available from the Collective Algorithms of the ACM at www.acm.org/calgo‚ where the code of Fox (1986) for generating the Faure sequence can also be found‚ as well as a code for Niederreiter’s sequence (Bratley‚ Fox‚ and Niederreiter 1994). A code to generate Generalized Faure sequences (provided the matrices have been precomputed) is given by Tezuka (1995). Owen has written code to generate scrambled nets; it is available at www–stat.Stanford. edu/˜owen/scramblednets‚ and is free for noncommercial uses. A C++ library called libseq was recently developed by Friedel and Keller (2001)‚ in which they use efficient algorithms to generate scrambled digital sequences‚ Halton sequences‚ and other techniques such as Latin Hypercube and Supercube Sampling (Mckay‚ Beckman‚ and Conover 1979; Owen 1998a). This library can be found at www.multires.caltech. edu/software/libseq/index.html. There are also a few commercial software packages to generate different QMC point sets; e.g.‚ QR Streams at www.mathdirect.com/products/qrn/‚ and the FinDer software of Paskov and Traub (1995) at www.cs.columbia.edu/˜ap/html/finder. html.
4.
Measures of Quality
In this section, we present a number of criteria that have been proposed in the literature for measuring the uniformity (or non-uniformity) of a point set in the unit hypercube i.e., for measuring the
Recent Advances in Randomized Quasi-Monte Carlo Methods
441
discrepancy between the distribution of the points of and the uniform distribution, in the context of QMC integration. In one dimension (i.e., several such measures are widely used in statistics for testing the goodness of fit of a data set with the uniform distribution; e.g., the Kolmogorov-Smirnov (KS), Anderson-Darling, and chi-square test statistics. Chi-square tests also apply in more than one dimension, but their efficiency vanishes quickly as the dimension increases. The rectangular-star discrepancy discussed earlier turns out to be one possible multivariate generalization of the KS test statistic. Other variants of this measure are discussed by Niederreiter (1992b), and connections between discrepancy measures and goodness-of-fit tests used in statistics are studied in Hickernell (1999). The asymptotic behavior of quality measures like the rectangularstar discrepancy is known for many constructions. For example, the Halton sequence has a rectangular-star discrepancy in but with a hidden constant that grows superexponentially with This is also true for Sobol’ sequence, but with a smaller hidden constant. By contrast, Faure and Niederreiter-Xing sequences are built so that this hidden constant goes to zero exponentially fast as the dimension goes to infinity. In practice however, as soon as the number of dimensions exceeds a few units, these general measures of discrepancy are unsatisfactory because they are too hard to compute. A more practical and less general approach is to define measures of uniformity that exploit the structure of a given class of highly structured point sets. Here we concentrate our discussion on these types of criteria, starting with criteria for standard lattice rules, then covering those for digital nets.
4.1.
Criteria for standard lattice rules
The criteria discussed here all relate to the dual lattice defined in Section 3. The first criterion we introduce has a nice geometrical interpretation, and is often used to measure the quality of LCGs through the so-called spectral test (Fishman and Moore III 1986; Fishman 1990; Knuth 1998; L’Ecuyer 1999). It was introduced by Coveyou and MacPherson (1967) and was used by Entacher, Hellekalek, and L’Ecuyer (2000) to choose lattice point sets. It amounts to computing the euclidean length of the shortest nonzero vector in i.e.,
The quantity turns out to be equal to the inverse of the distance between the successive hyperplanes in the family of most distant parallel
442
MODELING UNCERTAINTY
hyperplanes on which the points of lie. This distance should be as small as possible so there are no wide gaps in without any point from Equivalently, should be as large as possible. Algorithms for computing are discussed in, e.g., Dieter (1975), Fincke and Pohst (1985), Knuth (1998), and L’Ecuyer and Couture (1997). For instance, the dual of the basis shown in Figure 20.1 contains the shortest vector in given by h = (5,8), so in this case. This test can be applied not only to but also to any projection of More precisely, assume and let be the integration lattice such that The idea is to compute where for all To define a criterion in which is computed for several subsets I, it is convenient to normalize so that the same scale is used to compare the different projections. This can be achieved by using upper bounds derived from the “best” possible lattice in dimensions (not necessarily an integration lattice). Such bounds can be found in, e.g., Conway and Sloane (1999) and L’Ecuyer (1999). Criteria using these ideas have been used for measuring LCGs (Fishman and Moore III 1986; Fishman 1990; L’Ecuyer 1999), usually for subsets I containing successive indices, i.e., of the form The following criterion is more general and has been used to construct tables of “good” Korobov lattice point sets by L’Ecuyer and Lemieux (2000):
where Let the range of a subset be defined as The criterion considers projections over successive indices whose range is at most over pairs of indices whose range is at most over triplets of indices whose range is at most etc. Note that for dimension-stationary point sets‚ it is sufficient to do as in the definition of and to only consider subsets I having 1 as their first index. The next criterion is called the Babenko-Zaremba index‚ and is similar to except that a different norm is used for measuring the vectors h in It is defined as follows (Niederreiter 1992b):
where max It has been used by Maisonneuve (1972) to provide tables of “good” Korobov rules‚ but its computation is typ-
Recent Advances in Randomized Quasi-Monte Carlo Methods
443
ically much more time-consuming than computing Both and can be seen as special cases of more general criteria such as general discrepancy measures defined‚ e.g.‚ by Hickernell (1998b)‚ Theorem 3.8‚ or the generalized spectral test of Hellekalek (1998)‚ Definition 5.2. These criteria use a general norm to measure h and apply to point sets that do not necessarily come from a lattice. The following criterion‚ called uses the same norm as the Babenko-Zaremba index‚ but it sums a fixed power of the length of all vectors in the dual lattice instead of only considering the smallest one. For an arbitrary integer it is defined as (Sloan and Joe 1994; Hickernell 1998b)
When
is even‚ it simplifies to (Sloan and Joe 1994; Hickernell 1998b)
where is the Bernoulli polynomial of degree (Sloan and Joe 1994)‚ and it can then be computed in time. This criterion has been used in‚ e.g.‚ Haber (1983)‚ Sloan and Walsh (1990)‚ and Sloan and Joe (1994) to choose lattice point sets. The definition of has been generalized by Hickernell (1998b) by the introduction of weights that take into account the relative importance of each subset I (e.g.‚ with respect to the ANOVA decomposition of The generalization is defined as
where that is even and the
this “weighted”
is the set of nonzero indices of h. Assuming are product-type weights‚ i.e.‚ that
becomes
which can still be computed in we recover the criterion
time. By letting for A normalized version of this
444
MODELING UNCERTAINTY
criterion and the quality measure described above have been used for selecting generators defining sequences of nested Korobov lattice point sets by Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux (2001). Note that can be considered as a special case of the weighted spectral test of Hellekalek (1998)‚ Definition 6.1. Various other measures of quality for lattice rules can be found in Niederreiter (1992b)‚ Sloan and Joe (1994)‚ Hickernell (1998b)‚ and the references therein.
4.2.
Criteria for digital nets
As we mentioned in Section 3.2‚ the of a digital net is often used to assess its quality. To compute this value‚ one has to find which is the largest integer such that for any integers satisfying the vectors are linearly independent‚ where denote the row of the generating matrix of for Hence‚ computing is typically quite timeconsuming since for a given integer there are different vectors satisfying and (Pirsic and Schmid 2001). If we define as the value of the for the projection an equivalent definition of is
That is‚ measures the regularity of all projections and returns the worst case. Inside this definition of we can also normalize the value of so that projections over subspaces of different dimensions are judged more equitably‚ in the same way as the value is used to normalize in the criterion To do so‚ we can use the lower bound for in dimensions‚ given by Niederreiter and Xing (1998)
and define‚ for example. The idea behind this is that a large value of for a lowdimensional subset I is usually worse than when I is high-dimensional and therefore it should be more penalized. The definition of the that we used so far is of a geometrical nature‚ similarly to the interpretation of as being the inverse of the distance between the hyperplanes of a lattice. Interestingly‚ just like
Recent Advances in Randomized Quasi-Monte Carlo Methods
445
the can also be related to the length of a shortest vector in a certain dual space (Tezuka 1995; Larcher‚ Niederreiter‚ and Schmid 1996; Niederreiter and Pirsic 2001; Couture and L’Ecuyer 2000) as we now explain. Our presentation follows Niederreiter and Pirsic (2001). Let be the generating matrices associated with a digital net in base with points. Let C be the matrix obtained by concatening the transposed of the first rows of each that is‚ if denotes the matrix containing the first rows of then The analysis of Niederreiter and Pirsic (2001) assumes that the generating matrices are but we will explain shortly why it remains valid even if we start with matrices and truncate them. Let be the null space of the row space of C‚ i.e‚ We refer to as the dual space of the digital net the following norm on for any nonzero and vector by
The following result about the Pirsic (2001):
from now on. Define let Define the norm of a
is proved by Niederreiter and
We now explain why this result is valid even if the matrices have been truncated to their first rows. Let denote the dual space that would be obtained without the truncation. Observe that by definition‚
Also‚ using Proposition 1 of Niederreiter and Pirsic (2001) and the fact that the dimension of the row space of C is not larger than we have that Therefore‚
446
MODELING UNCERTAINTY
which means that (20.19) is also true if we replace by From now on‚ we assume that actually represents the dual space obtained without truncating the generating matrices Also‚ we view as a subspace of which means that each element in is represented by a vector of polynomials over and the norm is defined by with the convention that deg(0) = –1. In the special case where is a polynomial lattice point set‚ the dual space corresponds to the dual lattice If we define the norm
then
where is the resolution of the polynomial lattice point set. This result is discussed by Couture, L’Ecuyer, and Tezuka (1993), Couture and L’Ecuyer (2000), Lemieux and L’Ecuyer (2001), and Tezuka (1995). The resolution is often used for measuring the quality of PRNGs based on linear recurrences modulo 2 such as Tausworthe generators (Tootill, Robinson, and Eagle 1973; L’Ecuyer 1996). From a geometrical point of view, the resolution is the largest integer for which is equidistributed. Obviously, if This concept can be extended from polynomial lattice point sets to general digital nets by replacing by above. More precisely, we have: Proposition 1 Let be a digital net in base space of The resolution of satisfies:
and let
be the dual
where ||h|| is defined as in (20.20). Proof: The proof of this proposition requires results given in the forthcoming sections, and it can be found in the appendix. The resolution can be computed for any projection of for let
where is the dual of the row space of the matrix The following criterion has been proposed to select polynomial lattice
Recent Advances in Randomized Quasi-Monte Carlo Methods
447
point sets (Lemieux and L’Ecuyer 2001)‚ and it could also be used for any digital net:
where the set has the same meaning as in the definition of the criterion in (20.14). Another criterion is the digital version of the quality measure It is closely related to the dyadic dyaphony (Hellekalek and Leeb 1997) and the weighted spectral test (Hellekalek 1998)‚ and was introduced by Lemieux and L’Ecuyer (2001) for polynomial lattice point sets in base 2. It uses a norm W(h) defined as
and is defined for any integer
where and the weights
and weights
as
In the special case where is even, are product-type weights as in (20.15), we have that
where
and it is assumed that when We conclude this subsection by giving references where numerical values of the previous criteria are given for specific point sets. The value of the for Sobol’ sequence can be found in Niederreiter and Xing (1998) for dimensions these values are compared in that paper with those obtained from the improved base-2 sequences proposed by Niederreiter and Xing (1997, 1998). The “Salzburg Tables” given in (Pirsic and Schmid 2001) list optimal polynomial pairs and their associated to build Korobov polynomial rules. Generalized Faure sequences are built so that but with the drawback that the base must be at least as large as the dimension. Hence only small
448
MODELING UNCERTAINTY
powers of the bases are typically used in practice‚ and the quality of is measured only for (or less) projections‚ where is small. This illustrates the fact that for comparisons to be fair between different constructions‚ the value of the should be considered in conjunction with the base Algorithms to compute are discussed by Pirsic and Schmid (2001). The resolution has been used by Sobol’ for finding optimal values to initialize recurrences defining the direction numbers in his construction (Sobol’ 1967; Sobol’ 1976). More precisely‚ his Property A means that the first points of the sequence have the maximal resolution of 1‚ and his Property means that the first points have the maximal resolution of 2. Following ideas from Morokoff and Caflisch (1994) and Cheng and Druzdzel (2000)‚ a criterion related to is used to find initial direction numbers for Sobol’ sequence in dimensions in the forthcoming RandQMC library (Lemieux‚ Cieslak‚ and Luttmer 2001); the maximum in (20.21) is taken over all two-dimensional subsets I of the form where is the dimension for which we want to find initial direction numbers‚ and Examples of parameters for polynomial lattice point sets chosen with respect to for different values of are given by Lemieux and L’Ecuyer (2001).
5.
Randomizations
Once a construction is chosen for and the approximation given by (20.2) is computed‚ it is usually important to have an estimate of the error For that purpose‚ upper bounds of the form (20.3) are not very useful since they are usually much too conservative‚ in addition to being hard to compute and restricted to a possibly small set of functions. Instead‚ one can randomize the set so that : 1) each point in the randomized point set has a uniform distribution over 2) the regularity (or low-discrepancy property) of as measured by a specific quality criterion‚ is preserved under the randomization. The first property guarantees that the approximation
is an unbiased estimator of When the second property holds‚ the variance of the estimator can usually by expressed in a way that establishes a relation between the optimization of the criterion whose value is preserved under the randomization‚ and the minimization of the estimator’s variance. Hence these two properties help viewing random-
Recent Advances in Randomized Quasi-Monte Carlo Methods
449
ized QMC methods as variance reduction techniques that preserve the unbiasedness of the MC estimator. In practice‚ the variance of can be estimated by generating i.i.d. copies of through independent replications of the randomization. This estimator can then be compared with the estimated variance of the MC estimator to assess the effectiveness of QMC for any particular problem. We now describe some randomizations having these two properties and that have been proposed in the literature for the constructions presented in the preceding section.
5.1.
Random shift modulo 1
The following randomization has originally been proposed by Cranley and Patterson (1976) for standard lattice rules. Some authors suggested that it could also be used for other low-discrepancy point sets (Tuffin 1996; Morohosi and Fushimi 2000). Let be a given point set in and an random vector uniformly distributed over The randomly shifted estimator based on is defined as
When is a lattice point set, the length of the shortest vector associated with any projection is preserved under this randomization. An explicit expression for the variance of in that case will be given in Section 6.1. With this randomization, each shifted point mod 1) is uniformly distributed over Therefore, even if the dimension is much larger than the number of points and if many coordinates are equal within a given point (for instance, when comes from a LCG with a small period, as in Section 3.3), these coordinates become mutually independent after the randomization. Hence each point has the same distribution as in the MC method; the difference with MC is that the points of the shifted lattice are not independent. See L’Ecuyer and Lemieux (2000), Section 10.3, for a concrete numerical example with and These properties also hold for the other randomizations described below.
5.2.
Digital
shift
When is a digital net in base the counterpart of the previous method is to consider the expansion of the random vector and to add it to each point of using operations over More precisely,
450
MODELING UNCERTAINTY
if
and
we compute where
and let This randomization was suggested to us by Raymond Couture for point sets based on linear recurrences modulo 2 (see also Lemieux and L’Ecuyer 2001). It is also used in an arbitrary base (along with other more time-consuming randomizations) in Hong and Hickernell (2001) and (1998) as we will see in Section 5.4. It is best suited for digital nets in base and its application preserves the resolution and the of any projection.
5.3.
Scrambling
This randomization has been proposed by Owen (1995), and it also preserves the of a digital net and its resolution, for any projection. It works as follows: in each dimension partition the interval [0, 1) in equal parts and permute them uniformly and randomly; then, partition each of these sub-intervals into equal parts and permute them uniformly and randomly; etc. More precisely, to scramble L digits one needs to randomly and uniformly generate several independent permutations of the integers [0... (assuming a specific bijection has been chosen to identify the elements in with those in if is not prime), where and compute
where
In practice‚ L may be chosen equal to the word-length of the computer‚ and the digits for are then dropped. However‚ as
Recent Advances in Randomized Quasi-Monte Carlo Methods
451
(1998) points out‚ if and no two points have the same first digits in each dimension (i.e.‚ for each the unidimensional projection has a maximal resolution of then the permutations after level are independent for each point and therefore‚ the random digits for can be generated uniformly and independently over Hence in this case we do not need to generate any permutation after level Owen (1995)‚ Section 3.3‚ has suggested a similar implementation. When and are large‚ the amount of memory required for storing all the permutations becomes very large‚ and only a partial scrambling might then be feasible; that is‚ scramble digits‚ and generate the remaining ones randomly and uniformly (Fox 1999)‚ or use the permutations for the digit (Tan and Boyle 2000). A clever way of avoiding storage problems is discussed by (1998)‚ and a related idea is used in Morohosi’s code (which can be found at www.misojiro.t. u–tokyo.ac.jp/˜morohosi) for scrambling Faure sequences. The idea is to avoid storing all the permutations by reinitializing appropriately the underlying PRNG so that the permutations can be regenerated as they are needed. This is especially useful when the base is large‚ which happens when Faure sequences are used in large or even moderate dimensions.
5.4.
Random Linear Scrambling
(1998) proposes an alternative scrambling method that does not require as much randomness and storage. It borrows ideas from the scrambling technique and transformations proposed by Faure and Tezuka (2001) and Tezuka (1995). This method is also discussed by Hong and Hickernell (2001)‚ where it is called “Owen-Scrambling”; our presentation follows theirs but we prefer the name used by to avoid any confusion with the actual scrambling proposed by Owen. The idea is to generate nonsingular lower-triangular matrices with elements chosen randomly‚ independently‚ and uniformly over (the elements on the main diagonal of each are chosen over where is such that and vectors with entries independently and uniformly distributed over The digits of a randomized coordinate are then obtained as
where all operations are performed in
452
5.5.
MODELING UNCERTAINTY
Others
We briefly mention some other ideas that can be used to randomize QMC point sets. In addition to the linear scrambling, (1998) proposes randomization techniques for digital sequences that are easier to generate than the scrambling method, while retaining enough randomness for the purpose of some specific theoretical analyses. Hong and Hickernell suggest another form of linear scrambling that incorporates transformations proposed by Faure and Tezuka (2001). Randomizations that use permutations in each dimension to reorder the Halton sequence are discussed by Braaten and Weller (1979) and Morokoff and Caflisch (1994). Wang and Hickernell (2000) propose to randomize this sequence by randomly generating its starting point. Some authors (Ökten 1996; Spanier 1995) suggest partitioning the set of dimensions into two subsets (typically, of successive indices, i.e., and and then to use a QMC method (randomized or not) on one subset and MC on the other one. One of the justifications for this approach is that some digital nets (e.g., Sobol’ sequence) are known to have projections with better properties when I contains small indices; this suggests using QMC on the first few dimensions and MC on the remaining ones. However, this argument becomes irrelevant if a dimension-stationary point set is used. More importantly, if the QMC point set is randomized and can be shown (or presumed) to do no worse than MC in terms of its variance, then there is no advantage or “safety net” gained by using MC on one part of the problem. Estimators obtained by “padding” randomized QMC point sets with MC are analyzed by Owen (1998a) and Fox (1999). Owen (1998a) discusses some other padding techniques, as well as a method called Latin Supercube Sampling to handle very high-dimensional problems.
6.
Error and Variance Analysis
In this section‚ we study the error for the approximations based on low-discrepancy point sets‚ and the variance of estimators obtained by randomizing these sets. All the results mentioned here are obtained by using a particular basis for the set of square-integrable functions over to expand the integrand In each case‚ the basis is carefully chosen‚ depending on the structure of so that the behavior of the approximation on the basis functions is easy to analyze. The results presented here describe‚ for a given square-integrable function‚ the variance of different randomized QMC estimators. This is one possible approach for analyzing the performance of these estimators;
Recent Advances in Randomized Quasi-Monte Carlo Methods
453
different aspects of randomized QMC methods have been studied in the literature. In particular‚ Hickernell and (2001) have recently studied the behavior of different types of error on Hilbert spaces of integrands in a general setting that includes MC‚ scrambled nets‚ and randomly shifted lattices. They consider the “worst-case error” (error on the worst integrand for a given QMC approximation)‚ the “randomcase error” (expected error‚ e.g.‚ in the root mean square sense‚ of a randomized QMC estimator on a given integrand)‚ and the “averagecase error” (average error over and they give explicit formulas for these errors. This type of analysis provides useful connections between variance expressions such as those studied here (which corresponds to the random-case setting)‚ and the discrepancy measures discussed in the previous section (which corresponds to the worst-case setting). Specific results for scrambled-nets estimators are given by Heinrich‚ Hickernell‚ and Yue (2001b).
6.1.
Standard Lattices and Fourier Expansion
For standard lattice rules‚ the following result suggests that expanding in a Fourier series is appropriate for error and variance analysis. Recall that the Fourier basis for is orthonormal and given by where and Lemma 1 (Sloan and Joe 1994‚ Lemma 2.7) If is a lattice point set‚ then for any
Hence‚ the lattice rule integrates a basis function with no error when and with error 1 otherwise. Using this‚ we get the following result: Proposition 2 Let and
be a lattice point set‚
be defined as in (20.22)‚
be the Fourier coefficient of evaluated in (From Sloan and Joe 1994) If has an absolutely convergent Fourier series‚ then
454
MODELING UNCERTAINTY
(From L ’Ecuyer and Lemieux 2000) If
whereas for the MC estimator
based on
then
points‚
The result (20.25) was proved independently by Tuffin (1998)‚ but under the stronger assumption that has an absolutely convergent Fourier series. Notice that by contrast with the MC estimator‚ there is no factor of that multiplies the sum of squared Fourier coefficients for the randomly shifted lattice rule estimator Hence in the worst case‚ the variance of could be times as large as the MC estimator’s variance. This worst case corresponds to an extremely unlucky pairing of function and point set for which for all However‚ in the expression for the variance of the coefficients are summed only over the dual lattice which contains times less points than the set over which the sum is taken in the MC case. Therefore‚ if the dual lattice is such that the squared Fourier coefficients are smaller “on average” over than over then the variance of will be smaller than the variance of From the results given in the previous proposition‚ different bounds on the error and variance can be obtained by making additional assumptions on the integrand (Sloan and Joe 1994; Hickernell 1998b; Hickernell 2000). Most of these bounds involve the quality measures or Hence a point set that minimizes one of these two criteria minimizes a bound on the error or variance for the class of functions for which those bounds hold. Such analyses often provide arguments in favor of these criteria. A different type of analysis‚ based on the belief that the largest squared Fourier coefficients tend to be associated with “short vectors” h‚ corresponding to the low frequency terms of suggests that the lattice point set should be chosen so that does not contain those “short” vectors. From this point of view‚ a criterion like seems appropriate since it makes sure that does not contain vectors with a small euclidean length. This criterion also has the advantage of being usually much faster to compute than (Entacher‚ Hellekalek‚ and L’Ecuyer 2000; Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux 2001).
Recent Advances in Randomized Quasi-Monte Carlo Methods
6.2.
455
Digital Nets and Haar or Walsh Expansions
Recall that digital nets are usually built so as to satisfy different equidistribution properties with respect to partitions of the unit hypercube into boxes. For this reason‚ it is convenient to use a basis consisting of step functions that are constant over boxes for studying their associated error and variance. Both Walsh and Haar basis functions have this property. In addition‚ the Walsh functions form an orthonormal basis of 6.2.1 Scrambled-type estimators. We first define the Haar basis functions in base following the presentation of Owen (1997) and (Heinrich‚ Hickernell‚ and Yue 2001a). Let be a vector of positive integers‚ be a vector such that and be such that A multivariate Haar wavelet basis function is defined as
where Now consider the part of the Haar expansion of that depends on the basis functions associated with a given vector i.e.‚ let
where
The function is also a step function‚ which is constant within the boxes obtained by partitioning into equal intervals along the axis‚ for each Owen (1997) shows that the variance of the estimator based on a scrambled digital net with points is given by
where
and depends on the equidistribution properties of the digital net and the definition of the scrambling. Assuming that the of the net is
456
and
MODELING UNCERTAINTY
we have (Owen 1998b):
Using this‚ Owen obtains the following bound on the variance of the scrambled-net estimator: Proposition 3 (Owen 1998b‚ Theorem 1) Let structed from a scrambled digital net with integrable function‚
be the estimator conpoints. For any square-
Hence the variance of the scrambled-net estimator cannot be larger than the MC estimator’s variance‚ up to a certain constant (independent of but growing exponentially with for a fixed base For a net‚ which implies that this constant can be bounded by (Owen 1997). In the case where satisfies certain smoothness properties (its mixed partial derivatives satisfy a Lipschitz condition)‚ Owen shows that Under this assumption‚ he obtains a bound in for the variance of the scrambled-net estimator. Other results on the asymptotic properties of the scrambled-net estimator‚ that use Haar wavelets‚ can be found in Heinrich‚ Hickernell‚ and Yue (2001a)‚ Heinrich‚ Hickernell‚ and Yue (2001b)‚ and the references cited there. Haar series are also considered in the context of QMC integration by‚ e.g.‚ Entacher (1997) and Sobol’ (1969). An important point to mention is that the scrambling is not the only randomization for which (20.27) holds; the result is valid for any randomization satisfying the following properties (Hong and Hickernell 2001; 1998): 1 each point uniformly distributed over
in the randomized point set
2 for
if
and then
for
is but
Recent Advances in Randomized Quasi-Monte Carlo Methods
457
(a) (b) (c)
is uniformly distributed over are uncorrelated, for any
Hong and Hickernell (2001) have shown that the random linear scrambling mentioned in Section 5.4 satisfies these properties‚ so the bound (20.28) given in Proposition 3 holds for linearly scrambled estimators as well. This is interesting since this method has a faster implementation than the scrambling of Section 5.3. Note that the digital shift does not satisfy 2(b) since the digits are such that Digitally shifted estimators. 6.2.2 To study the variance of a shifted digital net we use a Walsh expansion for Walsh series have also been used to analyze the error produced by (nonrandomized) digital nets by Larcher, Niederreiter, and Schmid (1996), Larcher, Lauss, Niederreiter, and Schmid (1996), and Larcher and Pirsic (1999). In the presentation of the forthcoming results, the vector h will be used both to represent elements of where and elements of the ring When required, we will use the bijection defined by = where for each to go back and forth between these two spaces. For any the Walsh basis function in h is defined as
where h · u = and are such that operations are performed in
and For
and the coefficients and all we have that
where corresponds to a digit-by-digit addition over (as if we were adding the corresponding elements in See Larcher‚ Niederreiter‚ and Schmid (1996) and Larcher and Pirsic (1999) for more information on generalized definitions of Walsh series in the context of QMC integration. denote the Walsh coefficient of in h‚ that is Let
458
MODELING UNCERTAINTY
The following result may be interpreted as the digital counterpart of the result stated in Lemma 1. Recall that denotes the dual space of Lemma 2 Let we have
be a digital net in base with
For any
Larcher, Niederreiter, and Schmid (1996) have shown that the above sum is 0 when h satisfies Proof: If then h · since is in the row space of C for all and the result follows easily. If then where We are interested in the scalar product h · for Notice that = which is the image of under the application of a mapping that corresponds to the multiplication by Since the dimension of this image is 1 and the dimension of the kernel of this mapping is thus Hence each element in has pre-images in under this mapping, and therefore as a multiset {h · contains copies of each element of Using this and the fact that
the result immediately follows. Using this lemma‚ we get the following result‚ which is analogous to that presented in Proposition 2. It is proved by Lemieux and L’Ecuyer (2001) for the case where and is a polynomial lattice point set. Proposition 4 Let be a digital net in base For any function having an absolutely convergent Walsh series expansion‚ we have
Let be a shifted digital net in base and be the associated estimator. For any square-integrable function we have
and the variance of the MC estimator
based on
points is given by
Recent Advances in Randomized Quasi-Monte Carlo Methods
Proof: Assume
If
459
Then we can write
is square-integrable‚ then the function
defined by
is also square-integrable‚ where corresponds to a digit-by-digit addition in In addition‚ because Parseval’s equality holds for the Walsh series expansion (see Golubov‚ Efimov‚ and Skvortsov 1991‚ for example). Now‚ for any we have
In the above display‚ the third line is obtained by letting and thus where denotes a digit-by-digit subtraction in
460
MODELING UNCERTAINTY
From (20.31)‚ the result follows. The variance of the MC estimator is obtained by applying Parseval’s equality. To see the connection with scrambled-type estimators‚ we use the following result (proved for by Lemieux 2000)‚ whose proof is given in the appendix. The result also makes a connection between variance expressions in terms of Haar and Walsh expansions‚ because is defined in terms of Haar coefficients. Lemma 3 Let be a vector of positive integers and in (20.26). If is square-integrable‚ then
be defined as
where
otherwise}.
Hence in the digital shift case‚ in comparison with MC‚ the contribution of a basis function to the variance expression is either multiplied by (if h is in the dual space) or by 0‚ whereas in the scrambled case‚ this contribution is multiplied by 0 for “small vectors”‚ and by a factor that can be upper-bounded by a quantity independent of otherwise. This factor being sometimes in the digital shift case prevents us from bounding by a constant times Similarly‚ the case of smooth functions yields a variance bound in for digitallyshifted estimators‚ which is larger by a factor than the order of the bound obtained for scrambled-type estimators. On the other hand‚ the shift is a very simple randomization easy to implement; the estimator can typically be constructed in the same (or less) time as the MC estimator based on the same number of points. Based on the expression (20.30) for the variance of the digitally shifted estimator‚ the same type of heuristic arguments as those given for randomly shifted lattice rules can be used to justify selection criteria such as to choose digital nets. That is‚ if we assume that the largest Walsh coefficients are those associated with “small” vectors h‚ then it is reasonable to choose so that the dual space does not contain those small vectors. If we use the norm ||h|| defined in (20.20) to measure h‚ this suggests using a criterion based on the resolution such as if instead we use the norm V(h) defined in (20.18)‚ then the or the variant defined in (20.16) should be employed. We refer the reader to Hellekalek and Leeb (1997) and Tezuka (1987) for additional connections between Walsh expansions and nonuniformity measures (e.g.‚ the so-called ‘Walsh-spectral test’ of Tezuka). Note that criteria based on the resolution are faster to compute than those based on the
461
Recent Advances in Randomized Quasi-Monte Carlo Methods
because the latter verifies the much larger number of vectors
7.
of
for a
Transformations of the Integrand
So far‚ our description of how to use QMC methods can be summarized as follows: Choose a construction and a randomization; choose a selection criterion; find a good point set with respect to this criterion (or use a precomputed table of “good” point sets); randomize the point set‚ and compute as an estimator for If the selection criterion mimics the variance of well enough‚ one should obtain a low variance estimator with this approach. Most of the selection criteria presented in Section 4 are defined so that they should imitate more or less the variance of for a large class of functions‚ i.e.‚ they provide “general-purpose” low-discrepancy point sets. However‚ once the problem at hand is known‚ the variance can sometimes be reduced further by making use of information on in a clever way. In particular‚ techniques used to reduce the MC variance can also be used in combination with QMC methods. Examples of such techniques are antithetic variates‚ control variables‚ importance sampling‚ and conditional Monte Carlo. These methods can all be seen as transformations applied to in order to reduce its variability; that is‚ one replaces by a function such that and (hopefully) If the function requires more computation time for its evaluation‚ one should make sure that the variance reduction gained is worth the extra effort‚ i.e.‚ that the efficiency is improved. A second class of methods that can reduce the variance of QMC estimators for certain functions are dimension reduction methods. Among this class are the Brownian bridge technique of Caflisch and Moskowitz (1995)‚ approaches based on principal components (Acworth‚ Broadie‚ and Glasserman 1997; Åkesson and Lehoczy 2000)‚ and various methods discussed by Fox (1999) for generating Poisson and other stochastic processes. Typically‚ these methods are used when is defined in terms of a stochastic process for which a sample path is generated using the uniform numbers provided by a point in The goal is then to generate the sample path in a way that will decrease the effective dimension of This is usually achieved by using a method that gives a lot of importance to a few uniform numbers. As an illustration‚ we describe in the example below the Brownian bridge technique‚ which can be used to generate the sample paths of a Brownian motion.
462
MODELING UNCERTAINTY
Example 3 As this often happens in financial simulations (see e.g.‚ Boyle‚ Broadie‚ and Glasserman 1997; Caflisch‚ Morokoff‚ and Owen 1997)‚ suppose we want to generate the sample path of a Brownian motion at different times using uniform numbers For instance‚ this Brownian motion might be driving the price process of an asset on which an option has been written (Duffie 1996). Instead of generating these observations sequentially (that is‚ by using to generate the Gaussian random variable given is used to generate is used to generate for for etc. This can be done easily since for the distribution of given and is Gaussian with parameters depending only on By generating the Brownian motion path this way‚ more importance is given to the first few uniform numbers since they determine important aspects of the path such as its value at the end‚ middle‚ first quarter‚ etc. Another type of transformation that can sometimes reduce the actual dimension of the problem (and not only the effective one) is the conditional Monte Carlo method; see L’Ecuyer and Lemieux (2000)‚ Section 10.1‚ for example‚ where this method is used with randomly shifted lattice rules. This method can also be viewed as a “smoothing technique”‚ i.e.‚ it replaces the integrand by a smoother function (in this case‚ a conditional expectation) that‚ e.g.‚ satisfies conditions required to obtain variance rates that are as in Section 6.2. We refer the reader to Spanier and Maize (1994) and Fox (1999)‚ Chapters 7 and 8‚ for more on smoothing techniques that can be used in combination with QMC methods.
8.
Related Methods
We now discuss integration methods that are closely related to QMC methods‚ but that do not exactly fit the framework presented so far. First‚ a natural extension for the estimator or would be to assign weights to the different evaluation points; that is‚ for a point set define
with Hickernell (1998b) proved that when is a lattice point set‚ then a certain measure of discrepancy (called the discrepancy)‚ defined so that these weights are accounted for‚ is minimized by setting the weights to be all equal to In other words‚ using weights is useless in this case.
Recent Advances in Randomized Quasi-Monte Carlo Methods
463
However, if is not restricted to have a particular form, then it can be shown that in some cases allowing different weights can bring a significant improvement (Yakowitz, Krimmel, and Szidarovszky 1978). For example, one can use the MC method with weights defined by the Voronoï tessellation induced by the uniform and random points more precisely, define
where denotes the Lebesgue measure on This approach yields an estimator with variance in when Weighted approximations also based on Voronoï tessellations are discussed by Pagès (1997). A closely related idea is used in stratified sampling (Cochran 1977). In this method‚ the unit hypercube is partitioned into cells and is uniformly distributed over for The stratified sampling estimator is then
where It can be shown (Cochran 1977) that for any squareintegrable function‚ The amount of variance reduction depends on the definition of the cells and their interaction with the integrand An integration method that is guaranteed to yield an estimator with a variance not larger than the MC estimator for monotone functions is the Latin Hypercube Sampling (LHS) (Mckay, Beckman, and Conover 1979). It uses a point set whose unidimensional projections are evenly distributed (i.e., one point per interval for 1). To construct this point set, one needs to generate random, uniform, and independent permutations of the integers from 0 to 1, and independent shifts uniformly distributed over Then define the point set
Additional results on this method can be found in‚ e.g.‚ Avramidis and Wilson (1996)‚ Owen (1998a)‚ and the references therein. In particular‚ see Owen (1992a) and Loh (1996b) for results showing that the LHS estimator obeys a central-limit theorem. A related method that extends
464
MODELING UNCERTAINTY
the uniformity property of LHS and has close connections with digital nets is a randomized orthogonal array design; we refer the reader to Owen (1992b, 1994, 1995, 1997) and Loh (1996a) for more on this method.
9.
Conclusions and Discussion
We have described various QMC constructions that can be used for multidimensional numerical integration. Measures of quality that can help selecting parameters for a given construction have been presented. We also discussed different randomizations, and provided results on the variance of estimators obtained by applying these randomizations. In particular, we gave a new result that expresses the variance of an estimator based on a digitally shifted net as a sum of squared Walsh coefficients over the dual space of In the near future, we plan to compare empirically various constructions and randomizations on practical problems, to study selection criteria and compare their effectiveness, and to investigate in more details the effect of transformations, such as those discussed in Section 7, on the variance of the randomized QMC estimators.
Appendix: Proofs Proof of Proposition 1: The result is obtained by first generalizing Proposition 5.2 of Lemieux and L’Ecuyer (2001) to arbitrary digital nets in base This can be done by using Lemma 2 from Section 6.2. More precisely, we show that is equidistributed if and only if where
Consider the class of all real-valued functions that are constant on each of the boxes in the definition of Clearly, is if and only if the corresponding point set integrates every function with zero error. But due to its periodic structure, each function has a Walsh expansion of the form
where and written as
where the
and is the bijection between mentioned on page 457. To see this, note that any can be
are real numbers. When there exists and an integer such that Let and recall that where the coefficients and come from the
Recent Advances in Randomized Quasi-Monte Carlo Methods
465
representation and respectively. When goes from to goes from 0 to in and this dot product is then equal to each number between 0 and exactly times. Hence, if we first integrate with respect to when computing via (20.29), any term from the sum (20.A.2) will be 0 because
Therefore,
if
and (20.A.1) follows. Now, for any nonzero is in Hence by Proposition 4, the error obtained by using to integrate is since the only nonzero Walsh coefficient of is the one evaluated in (and it is equal to 1). From this, we see that if is then Hence if has a resolution of then it is and therefore We now show that which will prove the result. Since the resolution is it means that is not Therefore, the matrix L formed by concatenating the transposed of the first rows of each generating matrix has a row space whose dimension is strictly smaller than Hence there exists a nonzero vector x in such that Furthermore, we can assume since would contradict our assumption that has a resolution of Define by
Since L is just a truncated version of C and the coefficients of for powers of larger than are zero for all we have that Ch = 0, and therefore with which proves the result. Proof of Lemma 3: Recall that
where is a step function constant on the boxes obtained by partitioning the axis of in equal intervals, for each Using the same notation as for the preceding proof, we have that where
Hence from the proof of Proposition 1, we know that Assume and that there exists one We need to verify that Now,
if such that
466
MODELING UNCERTAINTY
Now, since form
the function
is constant over any interval of the Hence, if
for any then (20.A.3) is equal to zero and the result is proved. To show (20.A.4), it suffices to observe that
for any
which proves the result.
Acknowledgments This work was supported by NSERC-Canada individual grants to the two authors and by an FCAR-Québec grant to the first author. We thank Bennett L. Fox, Fred J. Hickernell, Harald Niederreiter, and Art B. Owen for helpful comments and suggestions.
References Acworth, P., M. Broadie, and P. Glasserman. (1997). A comparison of some Monte Carlo and quasi-Monte Carlo techniques for option pricing. In Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, ed. P. Hellekalek and H. Niederreiter, Number 127 in Lecture Notes in Statistics, 1–18. Springer-Verlag. Åkesson, F., and J. P. Lehoczy. (2000). Path generation for quasi-Monte Carlo simulation of mortgage-backed securities. Management Science 46:1171–1187. Antonov, I. A., and V. M. Saleev. (1979). An economic method of computing Zh. Vychisl. Mat. Mat. Fiz. 19:243–245. In Russian. Avramidis, A. N., and J. R. Wilson. (1996). Integrated variance reduction strategies for simulation. Operations Research 44:327–346.
REFERENCES
467
Bakhvalov, N. S. (1959). On approximate calculation of multiple integrals. Vestnik Moskovskogo Universiteta, Seriya Matematiki, Mehaniki, Astronomi, Fiziki, Himii 4:3–18. In Russian. Boyle, P., M. Broadie, and P. Glasserman. (1997). Monte Carlo methods for security pricing. Journal of Economic Dynamics & Control 21 (8-9): 1267–1321. Computational financial modelling. Braaten, E., and G. Weller. (1979). An improved low-discrepancy sequence for multidimensional quasi-Monte Carlo integration. Journal of Computational Physics 33:249–258. Bratley, P., and B. L. Fox. (1988). Algorithm 659: Implementing Sobol’s quasirandom sequence generator. ACM Transactions on Mathematical Software 14 (1): 88–100. Bratley, P., B. L. Fox, and H. Niederreiter. (1992). Implementation and tests of low-discrepancy sequences. ACM Transactions on Modeling and Computer Simulation 2:195–213. Bratley, P., B. L. Fox, and H. Niederreiter. (1994). Algorithm 738: Programs to generate Niederreiter’s low-discrepancy sequences. ACM Transactions on Mathematical Software 20:494–495. Caflisch, R. E., W. Morokoff, and A. Owen. (1997). Valuation of mortgage-backed securities using Brownian bridges to reduce effective dimension. The Journal of Computational Finance 1 (1): 27–46. Caflisch, R. E., and B. Moskowitz. (1995). Modified Monte Carlo methods using quasi-random sequences. In Monte Carlo and QuasiMonte Carlo Methods in Scientific Computing, ed. H. Niederreiter and P. J.-S. Shiue, Number 106 in Lecture Notes in Statistics, 1–16. New York: Springer-Verlag. Cheng, J., and M. J. Druzdzel. (2000). Computational investigation of low-discrepancy sequences in simulation algorithms for bayesian networks. In Uncertainty in Artificial Intelligence Proceedings 2000, 72–81. Cochran, W. G. (1977). Sampling techniques. Second ed. New York: John Wiley and Sons. Conway, J. H., and N. J. A. Sloane. (1999). Sphere packings, lattices and groups. 3rd ed. Grundlehren der Mathematischen Wissenschaften 290. New York: Springer-Verlag. Couture, R., and P. L’Ecuyer. (2000). Lattice computations for random numbers. Mathematics of Computation 69 (230): 757–765. Couture, R., P. L’Ecuyer, and S. Tezuka. (1993). On the distribution of vectors for simple and combined Tausworthe sequences. Mathematics of Computation 60 (202): 749–761, S11–S16. Coveyou, R. R., and R. D. MacPherson. (1967). Fourier analysis of uniform random number generators. Journal of the ACM 14:100– 119.
468
MODELING UNCERTAINTY
Cranley, R., and T. N. L. Patterson. (1976). Randomization of number theoretic methods for multiple integration. SIAM Journal on Numerical Analysis 13 (6): 904–914. Davis, P., and P. Rabinowitz. (1984). Methods of numerical integration. second ed. New York: Academic Press. Dieter, U. (1975). How to calculate shortest vectors in a lattice. Mathematics of Computation 29 (131): 827–833. Duffie, D. (1996). Dynamic asset pricing theory. second ed. Princeton University Press. Efron, B., and C. Stein. (1981). The jackknife estimator of variance. Annals of Statistics 9:586–596. Entacher, K. (1997). Quasi-Monte Carlo methods for numerical integration of multivariate Haar series. BIT 37:846–861. Entacher, K., P. Hellekalek, and P. L’Ecuyer. (2000). Quasi-Monte Carlo node sets from linear congruential generators. In Monte Carlo and Quasi-Monte Carlo Methods 1998, ed. H. Niederreiter and J. Spanier, 188–198. Berlin: Springer. Faure, H. (1982). Discrépance des suites associées à un système de numération. Acta Arithmetica 61:337–351. Faure, H. (2001). Variations on Journal of Complexity. To appear. Faure, H., and S. Tezuka. (2001). A new generation of To appear. Fincke, U., and M. Pohst. (1985). Improved methods for calculating vectors of short length in a lattice, including a complexity analysis. Mathematics of Computation 44:463–471. Fishman, G. S. (1990), Jan. Multiplicative congruential random number generators with modulus An exhaustive analysis for and a partial analysis for Mathematics of Computation 54 (189): 331–344. Fishman, G. S., and L. S. Moore III. (1986). An exhaustive analysis of multiplicative congruential random number generators with modulus SIAM Journal on Scientific and Statistical Computing 7 (1): 24–45. Fox, B. L. (1986). Implementation and relative efficiency of quasirandom sequence generators. ACM Transactions on Mathematical Software 12:362–376. Fox, B. L. (1999). Strategies for quasi-Monte Carlo. Boston, MA: Kluwer Academic. Friedel, I., and A. Keller. (2001). Fast generation of randomized lowdiscrepancy point sets. In Monte Carlo and Quasi-Monte Carlo Methods 2000, ed. K.-T. Fang, F. J. Hickernell, and H. Niederreiter: Springer. To appear.
REFERENCES
469
Golubov, B., A. Efimov, and V. Skvortsov. (1991). Walsh series and transforms: Theory and applications, Volume 64 of Mathematics and Applications: Soviet Series. Boston: Kluwer Academic Publishers. Haber, S. (1983). Parameters for integrating periodic functions of several variables. Mathematics of Computation 41:115–129. Halton, J. H. (1960). On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerische Mathematik 2:84–90. Heinrich, S., F. J. Hickernell, and R.-X. Yue. (2001a). Integration of multivariate Haar wavelet series. Submitted. Heinrich, S., F. J. Hickernell, and R. X. Yue. (2001b). Optimal quadrature for Haar wavelet spaces. submitted. Hellekalek, P. (1998). On the assessment of random and quasirandom point sets. In Random and Quasi-Random Point Sets, ed. P. Hellekalek and G. Larcher, Volume 138 of Lecture Notes in Statistics, 49–108. New York: Springer. Hellekalek, P., and G. Larcher. (Eds.) (1998). Random and quasi-random point sets, Volume 138 of Lecture Notes in Statistics. New York: Springer. Hellekalek, P., and H. Leeb. (1997). Dyadic diaphony. Acta Arithmetica 80:187–196. Hickernell, F. J. (1998a). A generalized discrepancy and quadrature error bound. Mathematics of Computation 67:299–322. Hickernell, F. J. (1998b). Lattice rules: How well do they measure up? In Random and Quasi-Random Point Sets, ed. P. Hellekalek and G. Larcher, Volume 138 of Lecture Notes in Statistics, 109–166. New York: Springer. Hickernell, F. J. (1999). Goodness-of-fit statistics, discrepancies and robust designs. Statistical and Probability Letters 44:73–78. Hickernell, F. J. (2000). What affects accuracy of quasi-Monte Carlo quadrature? In Monte Carlo and Quasi-Monte Carlo Methods 1998, ed. H. Niederreiter and J. Spanier, 16–55. Berlin: Springer. Hickernell, F. J., and H. S. Hong. (1997). Computing multivariate normal probabilities using rank-1 lattice sequences. In Proceedings of the Workshop on Scientific Computing (Hong Kong), ed. G. H. Golub, S. H. Lui, F. T. Luk, and R. J. Plemmons, 209–215. Singapore: Springer-Verlag. Hickernell, F. J., H. S. Hong, P. L’Ecuyer, and C. Lemieux. (2001). Extensible lattice sequences for quasi-Monte Carlo quadrature. SIAM Journal on Scientific Computing 22 (3): 1117–1138. Hickernell, F. J., and (2001). The price of pessimism for multidimensional quadrature. Journal of Complexity 17. To appear.
470
MODELING UNCERTAINTY
Hlawka, E. (1961). Funktionen von beschränkter variation in der theorie der gleichverteilung. Ann. Mat. Pura. Appl. 54:325–333. Hlawka, E. (1962). Zur angenäherten berechnung mehrfacher integrale. Monatshefte für Mathematik 66:140–151. Hoeffding, W. (1948). A class of statistics with asymptotically normal distributions. Annals of Mathematical Statistics 19:293–325. Hong, H. S., and F. H. Hickernell. (2001). Implementing scrambled digital sequences. Submitted for publication. Knuth, D. E. (1998). The art of computer programming, volume 2: Seminumerical algorithms. Third ed. Reading, Mass.: AddisonWesley. Korobov, N. M. (1959). The approximate computation of multiple integrals. Dokl. Akad. Nauk SSSR 124:1207–1210. in Russian. Korobov, N. M. (1960). Properties and calculation of optimal coefficients. Dokl. Akad. Nauk SSSR 132:1009–1012. in Russian. Larcher, G. (1998). Digital point sets: Analysis and applications. In Random and Quasi-Random Point Sets, ed. P. Hellekalek and G. Larcher, Volume 138 of Lecture Notes in Statistics, 167–222. New York: Springer. Larcher, G., A. Lauss, H. Niederreiter, and W. C. Schmid. (1996). Optimal polynomials for and numerical integration of multivariate Walsh series. SIAM Journal on Numerical Analysis 33 (6): 2239–2253. Larcher, G., H. Niederreiter, and W. C. Schmid. (1996). Digital nets and sequences constructed over finite rings and their application to quasi-Monte Carlo integration. Monatshefte für Mathematik 121 (3): 231–253. Larcher, G., and G. Pirsic. (1999). Base change problems for generalized Walsh series and multivariate numerical integration. Pacific Journal of Mathematics 189:75–105. L’Ecuyer, P. (1994). Uniform random number generation. Annals of Operations Research 53:77–120. L’Ecuyer, P. (1996). Maximally equidistributed combined Tausworthe generators. Mathematics of Computation 65 (213): 203–213. L’Ecuyer, P. (1999). Tables of linear congruential generators of different sizes and good lattice structure. Mathematics of Computation 68 (225): 249–260. L’Ecuyer, P., and R. Couture. (1997). An implementation of the lattice and spectral tests for multiple recursive linear random number generators. INFORMS Journal on Computing 9 (2): 206–217. L’Ecuyer, P., and C. Lemieux. (2000). Variance reduction via lattice rules. Management Science 46 (9): 1214–1235.
REFERENCES
471
L’Ecuyer, P., and F. Panneton. (2000). A new class of linear feedback shift register generators. In Proceedings of the 2000 Winter Simulation Conference, ed. J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick, 690–696. Pistacaway, NJ: IEEE Press. Lemieux, C. (2000), May. L’utilisation de règles de réseau en simulation comme technique de réduction de la variance. Ph. D. thesis, Université de Montréal. Lemieux, C., M. Cieslak, and K. Luttmer. (2001). RandQMC user’s guide. In preparation. Lemieux, C., and P. L’Ecuyer. (2001). Selection criteria for lattice rules and other low-discrepancy point sets. Mathematics and Computers in Simulation 55 (1–3): 139–148. Lemieux, C., and A. B. Owen. (2001). Quasi-regression and the relative importance of the ANOVA components of a function. In Monte Carlo and Quasi-Monte Carlo Methods 2000, ed. K.-T. Fang, F. J. Hickernell, and H. Niederreiter: Springer. To appear. Lidl, R., and H. Niederreiter. (1994). Introduction to finite fields and their applications. Revised ed. Cambridge: Cambridge University Press. Loh, W.-L. (1996a). A combinatorial central limit theorem for randomized orthogonal array sampling designs. Annals of Statistics 24:1209–1224. Loh, W.-L. (1996b). On Latin hypercube sampling. The Annals of Statistics 24:2058–2080. Maisonneuve, D. (1972). Recherche et utilisation des “bons treillis”, programmation et résultats numériques. In Applications of Number Theory to Numerical Analysis, ed. S. K. Zaremba, 121–201. New York: Academic Press. Maize, E. (1981). Contributions to the theory of error reduction in quasiMonte Carlo methods. Ph. D. thesis, Claremont Graduate School, Claremont, CA. (1998). On the L2-discrepancy for anchored boxes. Journal of Complexity 14:527–556. Matsumoto, M., and Y. Kurita. (1994). Twisted GFSR generators II. ACM Transactions on Modeling and Computer Simulation 4 (3): 254–266. Matsumoto, M., and T. Nishimura. (1998). Mersenne twister: A 623dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 8 (1): 3–30. Mckay, M. D., R. J. Beckman, and W. J. Conover. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21:239–245.
472
MODELING UNCERTAINTY
Morohosi, H., and M. Fushimi. (2000). A practical approach to the error estimation of quasi-Monte Carlo integration. In Monte Carlo and Quasi-Monte Carlo Methods 1998, ed. H. Niederreiter and J. Spanier, 377–390. Berlin: Springer. Morokoff, W. J., and R. E. Caflisch. (1994). Quasi-random sequences and their discrepancies. SIAM Journal on Scientific Computing 15:1251– 1279. Niederreiter, H. (1986). Multidimensional numerical integration using pseudorandom numbers. Mathematical Programming Study 27:17– 38. Niederreiter, H. (1987). Point sets and sequences with small discrepancy. Monatshefte für Mathematik 104:273–337. Niederreiter, H. (1988). Low-discrepancy and low-dispersion sequences. Journal of Number Theory 30:51–70. Niederreiter, H. (1992a). Low-discrepancy point sets obtained by digital constructions over finite fields. Czechoslovak Math. Journal 42:143– 166. Niederreiter, H. (1992b). Random number generation and quasi-Monte Carlo methods, Volume 63 of SIAM CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia: SIAM. Niederreiter, H., and G. Pirsic. (2001). Duality for digital nets and its applications. Acta Arithmetica 97:173–182. Niederreiter, H., and C. Xing. (1997). The algebraic-geometry approach to low-discrepancy sequences. In Monte Carlo and QuasiMonte Carlo Methods in Scientific Computing, ed. P. Hellekalek, G. Larcher, H. Niederreiter, and P. Zinterhof, Volume 127 of Lecture Notes in Statistics, 139–160. New York: Springer-Verlag. Niederreiter, H., and C. Xing. (1998). Nets, and algebraic geometry. In Random and Quasi-Random Point Sets, ed. P. Hellekalek and G. Larcher, Volume 138 of Lecture Notes in Statistics, 267–302. New York: Springer. Ökten, G. (1996). A probabilistic result on the discrepancy of a hybridMonte Carlo sequence and applications. Monte Carlo methods and Applications 2:255–270. Owen, A. B. (1992a). A central limit theorem for Latin hypercube sampling. Journal of the Royal Statistical Society B 54 (2): 541–551. Owen, A. B. (1992b). Orthogonal arrays for computer experiments, integration and visualization. Statistica Sinica 2:439–452. Owen, A. B. (1994). Lattice sampling revisited: Monte Carlo variance of means over randomized orthogonal arrays. Annals of Statistics 22:930–945. Owen, A. B. (1995). Randomly permuted and In Monte Carlo and Quasi-Monte Carlo Methods in Sci-
REFERENCES
473
entific Computing, ed. H. Niederreiter and P. J.-S. Shiue, Number 106 in Lecture Notes in Statistics, 299–317. Springer-Verlag. Owen, A. B. (1997). Monte Carlo variance of scrambled equidistribution quadrature. SIAM Journal on Numerical Analysis 34 (5): 1884– 1910. Owen, A. B. (1998a). Latin supercube sampling for very highdimensional simulations. ACM Transactions of Modeling and Computer Simulation 8 (1): 71–102. Owen, A. B. (1998b). Scrambling Sobol and Niederreiter-Xing points. Journal of Complexity 14:466–489. Pagès, G. (1997). A space quantization method for numerical integration. Journal of Computational and Applied Mathematics 89:1–38. Paskov, S., and J. Traub. (1995). Faster valuation of financial derivatives. Journal of Portfolio Management 22:113–120. Pirsic, G. (2001). A software implementation of Niederreiter-Xing sequences. In Monte Carlo and Quasi-Monte Carlo Methods 2000, ed. K.-T. Fang, F. J. Hickernell, and H. Niederreiter: Springer. To appear. Pirsic, G., and W. C. Schmid. (2001). Calculation of the quality parameter of digital nets and application to their construction. Journal of Complexity. To appear. Sloan, I. H., and S. Joe. (1994). Lattice methods for multiple integration. Oxford: Clarendon Press. Sloan, I. H., and L. Walsh. (1990). A computer search of rank-2 lattice rules for multidimensional quadrature. Mathematics of Computation 54:281–302. Sobol’, I. M. (1967). The distribution of points in a cube and the approximate evaluation of integrals. U.S.S.R. Comput. Math. and Math. Phys. 7:86–112. Sobol’, I. M. (1969). Multidimensional quadrature formulas and Haar functions. Moskow: Nauka. In Russian. Sobol’, I. M. (1976). Uniformly distributed sequences with an additional uniform property. USSR Comput. Math. Math. Phys. Academy of Sciences 16:236–242. Sobol’, I. M., and Y. L. Levitan. (1976). The production of points uniformly distributed in a multidimensional. Technical Report Preprint 40, Institute of Applied Mathematics, USSR Academy of Sciences. In Russian. Spanier, J. (1995). Quasi-Monte Carlo methods for particle transport problems. In Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, ed. H. Niederreiter and P. J.-S. Shiue, Volume 106 of Lecture Notes in Statistics, 121–148. New York: SpringerVerlag.
474
MODELING UNCERTAINTY
Spanier, J., and E. H. Maize. (1994). Quasi-random methods for estimating integrals using relatively small samples. SIAM Review 36:18–44. Tan, K. S., and P. P. Boyle. (2000). Applications of randomized low discrepancy sequences to the valuation of complex securities. Journal of Economic Dynamics and Control 24:1747–1782. Tausworthe, R. C. (1965). Random numbers generated by linear recurrence modulo two. Mathematics of Computation 19:201–209. Tezuka, S. (1987). Walsh-spectral test for GFSR pseudorandom numbers. Communications of the ACM 30 (8): 731–735. Tezuka, S. (1995). Uniform random numbers: Theory and practice. Norwell, Mass.: Kluwer Academic Publishers. Tezuka, S., and P. L’Ecuyer. (1991). Efficient and portable combined Tausworthe random number generators. ACM Transactions on Modeling and Computer Simulation 1 (2): 99–112. Tezuka, S., and T. Tokuyama. (1994). A note on polynomial arithmetic analogue of Halton sequences. ACM Transactions on Modeling and Computer Simulation 4:279–284. Tootill, J. P. R., W. D. Robinson, and D. J. Eagle. (1973). An asymptotically random Tausworthe sequence. Journal of the ACM 20:469– 481. Tuffin, B. (1996). On the use of low-discrepancy sequences in Monte Carlo methods. Technical Report No. 1060, I.R.I.S.A., Rennes, France. Tuffin, B. (1998). Variance reduction order using good lattice points in Monte Carlo methods. Computing 61:371–378. Wang, D., and A. Compagner. (1993). On the use of reducible polynomials as random number generators. Mathematics of Computation 60:363–374. Wang, X., and F. J. Hickernell. (2000). Randomized Halton sequences. Math. Comput. Modelling 32:887–899. Yakowitz, S., J. E. Krimmel, and F. Szidarovszky. (1978). Weighted Monte Carlo integration. SIAM Journal on Numerical Analysis 15:1289–1300. Yakowitz, S., P. L’Ecuyer, and F. Vázquez-Abad. (2000). Global stochastic optimization with low-discrepancy point sets. Operations Research 48 (6): 939–950.
Part VI Chapter 21 SINGULARLY PERTURBED MARKOV CHAINS AND APPLICATIONS TO LARGE-SCALE SYSTEMS UNDER UNCERTAINTY G.Yin Department of Mathematics Wayne State University Detroit, MI 48202
[email protected]
Q. Zhang Department of Mathematics University of Georgia Athens, GA 30602
[email protected]
K. Yin and H. Yang Department of Wood and Paper Science University of Minnesota St. Paul, MN 55108 kyin @ crn.umn.edu, hyang @ ece.umn.edu
Abstract
This chapter is concerned with large-scale hybrid stochastic systems, in which the dynamics involve both continuously evolving components and discrete events. Corresponding to different discrete states, the dynamic behavior of the underlying system could be markedly different. To reduce the complexity of these systems, singularly perturbed Markov chains are used to characterize the system. Asymptotic expansions of probability vectors and the structural properties of these Markov chains are provided. The ideas of decomposition and aggregation are presented using two typical optimal control problems. Such an approach leads to control policies that are simple to obtain and perform nearly as well as the optimal ones with substantially reduced complexity.
476
MODELING UNCERTAINTY Key words. singularly perturbation, Markov chain, near optimality, optimal control, LQG, MDP.
1.
INTRODUCTION
In memory of our distinguished colleague and dear friend Sidney Yakowitz, who made significant contributions to mathematics, control and systems theory, and operations research, we write this chapter to celebrate his lifetime achievements and to survey some of the most recent developments in singularly perturbed Markov chains and their applications in control and optimization of large-scale systems under uncertainty, which are related to Sid’s work on automatic learning, adaptive control, and nonparametric theory (see Lai and Yakowitz (1995); Yakowitz (1969); Yakowitz et al. (1992); Yakowitz et al. (2000) and the references therein). Many adaptive control and learning problems give rise to Markov decision processes. In solving such problems, one often has to face the curse of dimensionality. The singular perturbation approach is an effort in the direction of reduction of complexity. Our study is motivated by the desire of solving numerous control and optimization problems in engineering, operations research, management, biology, and physical sciences. To show why Markovian models are useful and preferable, we begin with the well-known Leontief model, which is a dynamic system of a multi-sector economy (see, for example, Kendrick (1972)). The classical formulation is: Let there be sectors, the output of sector at time and the demand for the product of sector at time Denote and let be the amount of commodity that sector needs in production and the proportion of commodity that is transferred to commodity Write and The matrix B is termed a Leontief input-output matrix. The Leontief dynamic model is given by
In the classical Leontief model, the coefficients are fixed. Nevertheless, in reality, more often than not, not only are A, B, and D time varying, but also they are subject to discrete shifts in regime–episodes across which the behavior of the corresponding dynamic systems are markedly different. As a result, a promising alternative to the traditional model is to allow sudden, discrete changes in the values of the parameters, which lead to a “hybrid” or “switching
Singularly Perturbed Markov Chains
477
model” governed by a Markov chain:
where is a continuous-time Markov chain (see Yin and Zhang (2001) for more details). In fact, the use of Markovian models have assumed a prominent role in time series analysis, financial engineering, and economic systems (see Hamilton and Susmel (1994); Hansen (1992) and the references therein). In addition to the modeling point mentioned above, many systems arising from communication, manufacturing, reliability, and queueing networks, among others, exhibit jump discontinuity in their sample paths. A common practice is to resort to Markovian jump processes in modeling and optimization. This chapter is devoted to such Markovian models having large state spaces with complex structures, which frequently appear in various applications and which may cause serious obstacles in obtaining optimal controls for the underlying systems. The traditional approach of dynamic programming for obtaining optimal control does not work well in such systems. The large size that renders computation infeasible is known as the “curse of dimensionality.” Owing to the pervasive applications of Markovian formulation in numerous areas, there has been resurgent interest in further exploring various properties of Markov chains. Hierarchical structure, a feature common to many systems of practical concerns has proved very useful for reducing complexity. As pointed out by Simon and Ando (1961), all systems in the real world have a certain hierarchy. Therefore it is natural to use the ideas of decomposition and aggregation for complexity reduction. Because the transitions (switches or jumps) in various states of a large-scale system often occur at different rates, the decomposition and aggregation of the states of the corresponding Markov chain can be achieved according to their rates of changes. Taking advantage of the hierarchical structure, the first step is to divide a large task into smaller pieces. The subsequent decompositions and aggregations will lead to a simpler version of the originally formidable system. Such an idea has been applied to queueing networks for resource organization, to computer systems for memory level aggregation, to economic models for complexity reduction (see Courtois (1977); Simon and Ando (1961)), and to manufacturing systems for production planning (see Sethi and Zhang (1994)). Owing to their different time scales, the problems fit reasonably well into the framework of singular perturbation in control and optimization; see Abbad et al. (1992); Gershwin (1994); Pan and (1995); Sethi and Zhang (1994); Yin and Zhang (1997b); Yin and Zhang (1998) and the references therein. Related work on singularly perturbed Markov chains can be found in Di Masi et al. (1995); Pervozvanskii and Gaits-
478
MODELING UNCERTAINTY
gori (1988); Phillips and Kokotovic (1981), among others. In Khasminskii et al. (1996), we analyzed singularly perturbed systems with an approach combining matched asymptotic expansion techniques and Markov properties. This combined approach enables us to obtain the desired asymptotic expansion of the probability vector and transition probabilities. We have furthered our understanding by considering singularly perturbed chains under weak and strong interactions (Khasminskii et al., 1997). This line of research has been carried forward substantially and has included asymptotic normality, exponential error bounds, Markov chains with countable state spaces, and the associated semigroups (see Yin and Zhang (1998)), which are the cornerstones for an in-depth investigation on various asymptotic properties and which are useful for many applications. Studies of the structural properties of the underlying singularly perturbed Markov chains and other properties of related occupation measures are in Zhang and Yin (1996); Zhang and Yin (1997) and Yin et al. (2000a); Yin et al. (2000b); Yin et al. (2000c). The asymptotic optimal and nearly optimal controls of controlled dynamic systems driven by singularly perturbed Markov chains have been examined in Yin and Zhang (1997a); Zhang et al. (1997); Zhang and Yin (1999). Numerical methods and a simulation study can be found in Yang et al. (2001). These results are extended to discrete-time models and included in Yin and Zhang (2000); see also Liu et al. (2001); Yang et al. (2001). To proceed, we present an illustrative example of a production planning model for a failure-prone manufacturing system. The system consists of a single machine whose production capacity is modeled by a finite-state continuous-time Markov chain having a generator and taking values in For example, 1 represents the machine being down, represents a full capacity operation, and the additional states represent different machine capacities in between. Let and denote the surplus, the rate of production, and the rate of demand, respectively. Here can be either deterministic or stochastic. The system equation is
Our objective is to choose the production rate constraint and to minimize the infinite-horizon discounted cost function
subject to the
Singularly Perturbed Markov Chains
479
where is a discount factor and G(·) is a running cost function. Using the dynamic programming approach (Fleming and Rishel, 1975), for each we can write the differential equation satisfied by the value function (the optimal value of the cost) as
see Sethi and Zhang (1994) and Yin and Zhang (1998), Appendix for details. Even if the demand is a constant, it is difficult to obtain the closed-form solution of the optimal control, not to mention the added difficulty due to the possible large state space (i.e., being a large number) since to find the optimal control using the dynamic programming approach requires solving equations of the form (1.3). To overcome the difficulty, we introduce a small parameter and assume that the Markov chain is generated by where Q is an irreducible generator of a continuous-time Markov chain. Using a singular perturbation approach, we can derive a limit problem in which the stochastic capacity is replaced by its average with respect to the stationary measures. The limit problem is in fact deterministic and is much simpler to solve. Intuitively, a higher level manager in a manufacturing firm need not know every details of floor events, only an averaged overview will be sufficient for the upper level decision making; see Sethi and Zhang (1994) for more discussion along this line. Mathematically, for sufficiently small the problem can be approximated by a limit control problem with dynamics
where is the stationary distribution (or steady-state probability) of the Markov chain generated by Q and satisfies In the cost function, the expectation can be replaced by the average with respect to its stationary distribution
and the associated value function satisfies a single equation
where that by using a small
As can be seen from the above example, the essence is and letting the study of the detailed variation
480
MODELING UNCERTAINTY
is replaced by its corresponding “average” with respect to the stationary distribution. In what follows, we will consider more complex systems. However, even for the above seemingly simple example, the averaging approach reduces the complexity noticeably. Note that the small parameter may not be present in the original systems. To facilitate the analysis and hierarchical decomposition, we introduce it into the systems. How small is considered to be small? In applications, constants such as 0.1 or even 0.5 could be considered as small if all other coefficients are of the order of magnitude 1. Only the relative order of magnitude matters. The mathematical results to be presented herein serve as guidelines for various approximations and for the estimation of error bounds. The asymptotic results of the underlying system (as ) provide insights into the structure of the system and heuristics of applications. A thorough understanding of the intrinsic behavior of the systems is instructive and is beneficial to explorations of the rich diversity of applications in hierarchical production planning, Markov decision processes, random evolution, and control and optimization of stochastic dynamic systems involving singularly perturbed Markov chains. Since the subject discussed in this chapter is at the intersection of singular perturbation and stochastic processes, our approaches consist of both analytic techniques (purely deterministic) and stochastic analysis (purely probabilistic). To make the chapter appeal to a wide range of audience in the fields of systems science, operations research, management science, applied mathematics, and engineering, we emphasize on the motivation and present the main results. Interested readers can refer to the references provided for further study and technical details. The rest of the chapter is organized as follows. Section 2 discusses the formulation of singularly perturbed Markov chains. To handle singularly perturbed systems driven by Markov chains, it is essential to have a thorough understanding of the underlying probabilistic structure. These properties are given in Section 3. Based on a couple of motivational examples, Section 4 presents decomposition/aggregation and nearly optimal controls and focuses on the optimal controls of linear quadratic regulators. Some additional remarks are made in Section 5. To make the chapter self-contained, we provide mathematical preliminaries and necessary background in the appendix.
2.
SINGULARLY PERTURBED MARKOV CHAINS
In this section we introduce singularly perturbed Markovian models in both continuous time and discrete time. The discussion is confined to the cases of stationary processes for simplicity. The power of the approach to be presented, however, is better reflected by nonstationary models.
Singularly Perturbed Markov Chains
2.1.
481
CONTINUOUS-TIME CASE
Many natural phenomena include processes changing at different rates. To describe such fast/slow two-time scales in the formulation, we introduce a small real number Let be a continuous-time Markov chain with finite state space and generator
such that and are themselves generators. A Markov chain with generator given in (2.4) is a singularly perturbed Markov chain. The simple example given below illustrates the effect of the small parameter Example. Suppose that Let
is a Markov chain having four states
Simulation of sample paths of the Markov chain yields Fig. 21.1, which displays paths for and respectively. Observe that the small parameter has a squeezing effect which rescales the sample paths of the Markov chain. When the Markov chain generated by undergoes rapid variations. This combined with produces a generator that has both rapidly varying part and slowly changing part with weak and strong interactions. Let be the probability vector associated with the Markov chain. Then it is known that satisfies the differential equation
where is a probability vector (i.e., all of its component and ). The equation (2.6) is known as the forward equation. Take, for simplicity, Then the solution of (2.6) is A direct use
482
MODELING UNCERTAINTY
of the singular perturbation idea may not work here since although (2.6) is a linear system of differential equations, the generator has an eigenvalue 0. A first glance may lead one to believe that is unbounded as However, this is not the case. By subtracting 0 from the spectrum of the generator, the rest of the eigenvalues are all on the left side of the complex plan, producing a rapid convergence to the stationary distribution. In view of (2.4), the matrix dominates the asymptotics. Let us concentrate on the fast changing part of the generator first. According to A.N. Kolmogorov, the states of any Markov chain can be classified as either recurrent or transient. It is also known that any finite-state Markov chain has at least one recurrent state, i.e., not all states are transient. As a result (Iosifescu, 1980, p.94), by appropriate arrangements, it is always possible to write the corresponding generator as either
Singularly Perturbed Markov Chains
483
or
The generator (2.7) corresponds to a Markov chain that has recurrent classes, whereas (2.8) corresponds to a Markov chain that includes transient states in addition to the recurrent classes. For the stationary cases, combining both (2.7) and (2.8) has exhausted all possibilities of practical concern. In the discussion to follow, we will use the notion of weak irreducibility (see the appendix for a definition) that is an extension of the usual notion of irreducibility. We will also deal with the partitioned matrices of the forms (2.7) and (2.8). Nevertheless, in lieu of recurrent classes, we will consider weakly irreducible classes, which is a generalization of the classical formulation.
2.2.
TIME-SCALE SEPARATION
Our formulation can accommodate and indicate different rates of changes among different states. In model (2.7), for example, a Markov chain generated by has classes of weakly irreducible states, within which the transitions take place at a faster pace. It follows that the state space of the Markov chain generated by has a natural classification A similar interpretation holds for the generator given by (2.8). Note that the component, on the last row of matrix in (2.8) denotes transitions from the transient states to the weakly irreducible class; and denotes the transition among those transient states. In what follows, we demonstrate how any given generator can be rewritten as (2.4) after appropriate rearrangements. For simplicity, we present a case where the dominating part of the generator, corresponds to a chain with weakly irreducible classes. The example to follow is drawn from Chapter 3 of Yin and Zhang (1998). Consider a generator Q given by
with the corresponding state space Step 1. Separate the entries of the matrix based on their orders of magnitude. The numbers {1,2} are at a scale (order of magnitude) different from that of
484
MODELING UNCERTAINTY
the numbers {10, –11, –12, 21, –22, 30, –33}. So we write Q as
Step 2. Adjust the entries to make each of the above matrices a generator. This requires moving the entries so that each of the two matrices satisfies the condition of a generator, i.e., with nonnegative off-diagonal elements, nonpositive diagonal element, and zero row sum. After such a rearrangement the matrix Q becomes
Step 3. Permute the columns and rows so that the dominating matrix is of a desired block diagonal form (corresponding to the weakly irreducible blocks). In this example, exchanging the orders of and in i.e., considering yields
Let
Then
Note that the decomposition procedure is not unique. We can also write
with Therefore, by using elementary row and column operations and some rearrangements, we have transformed a matrix Q into the form (2.4). The above procedure is applicable to time-dependent generators as well. It can also be used to incorporate generators with transient states.
Singularly Perturbed Markov Chains
3.
485
PROPERTIES OF THE SINGULARLY PERTURBED SYSTEMS
This section contains several asymptotic properties of singularly perturbed Markov chains. It begins with asymptotic expansions, proceeds to occupation measures, continues with functional central limit results, and concludes with large deviations and exponential error bounds.
3.1.
ASYMPTOTIC EXPANSION
The basic question concerns the probability distribution of the Markov chain as For the continuous-time case, let the Markov chain have state space and generator (2.4). Our study focuses on the forward equation (with time-dependent ):
where and The detailed proof of the following results can be found in Yin and Zhang (1998), Chapter 6. Theorem 1. Suppose that is given by (2.7) such that for each and each is weakly irreducible; for some positive integer and are and continuously differentiable on [0,T], respectively. Moreover, and are Lipschitz on [0, T]. Then, for sufficiently small the probability vector can be approximated to a desired accuracy by a series in that we can construct two sequences and such that
Moreover, all the
are sufficiently smooth, and
for some The asymptotic expansion enables us to simplify the computation. Note that as for Thus in lieu of using we can use the limit of the probability distribution. In many cases, we actually need both the limit of the probability distribution and certain higher order corrections. For the scaled occupation measure to be studied in the subsequent section, for example, the terms play an essential role. If transient states are included with
486
MODELING UNCERTAINTY
given by (2.8), we need that all eigenvalues of have negative real parts. In accordance with (3.10), for sufficiently small can be approximated by a series with an error bound of the order The approximation is uniform in it reflects the two-time scales, the regular time and the stretched time The terms involving are known as outer expansion terms; and the terms involving are the initial layer corrections. The term approximates well for those away from a layer of thickness but it does not satisfy the initial condition. Therefore we introduce to accommodate it. The outer expansion terms are all smooth and well behaved, and all the initial layer corrections decay exponentially. The method of obtaining and is constructive. The key is to choose such initial conditions appropriately that match both the outer expansion and initial layer correction. The procedure requires the use of weak irreducibility (see the definition in the appendix, in particular (6.54)), certain orthogonality, and the Fredholm alternative, which is stated as: Let B denote an matrix. For any define an operator as
where I is the is given by
identity matrix I. Then the adjoint operator
Assume Then one of the following two alternatives holds: (1) The homogeneous equation has only the zero solution, in which case (the resolvent set of A), is bounded, and the inhomogeneous equation has also one solution for each (2) The homogeneous equation has a nonzero solution, in which case the inhomogeneous equation has a solution iff for every solution of the adjoint equation After obtaining the formal expansion of the asymptotic series, we proceed to validate the expansion by estimating the corresponding error bounds. An interested reader is referred to Yin and Zhang (1998), Chapters 4 and 6 for detailed treatments. Discrete-time Problems. For discrete-time cases, we must use transition probability matrices in lieu of a generator. Similar to the continuous-time case, suppose that the transition matrix is given by
where is a small parameter, P is the transition matrix of a discrete-time Markov chain, and is a generator. Then any transition probability
Singularly Perturbed Markov Chains
487
matrix of a finite-state Markov chain with stationary transition probabilities can be put in the form (Iosifescu, 1980, p. 94) of either
or
where each is itself a transition matrix within the weakly irreducible class for and the last row corresponds to the transient states. The two-time-scale interpretation becomes a normal rate of change versus a slow rate of change. Let be the solution of
where P is a transition matrix and Q is a generator. Parallel to the development of the continuous-time systems, we can obtain the following asymptotic expansions. where
represents the approximation error, and
The term is the outer expansion and is the boundary-layer correction. For a detailed treatment, we refer the readers to Yin and Zhang (2000).
3.2.
OCCUPATION MEASURES
The asymptotic expansion obtained in the previous section is based on a purely deterministic approach. To study how the underlying Markov chain evolves, it is also important to investigate the stochastic characteristic of the Markov chain. A quantity of particular interest is the so-called occupation measure, or occupation time, which indicates the amount of time the chain spends in a given state. Let us begin with the continuous-time case in which and is weakly irreducible. Using the idea of the previous
488
MODELING UNCERTAINTY
section, we can develop asymptotic expansions of In this case, the leading term in the asymptotic expansion is the quasi-stationary distribution. Consider the time that the Markov chain spends in state up to time That is, It is easy to show that this integral can be approximated by which is a non-random function and is the mean (as ) of the occupation measure indexed by Thus, we define a sequence of centered occupation measures as
and write If in lieu of assuming and the weakly irreducibility of has more than one blocks as indicated in (2.7), then a nonrandom limit cannot be expected. To proceed, aggregate the states in each of the subspace into one state and define if This aggregated process in general is not Markov. However, it can be shown that converges weakly to a Markov chain with generator Recall that a sequence of random variables converges weakly to if and only if for any bounded and continuous function If the Markov chain corresponding to has more than one weakly irreducible classes, a simple “deterministic approximation” as in (3.16) by the quasi-stationary distribution is not enough. A pertinent definition of the occupation measure becomes
The immediate questions are: How good is such an approximation? What is the asymptotic distribution of Using probability basics, we can show that That is, a mean squares estimate for the unsealed occupation measure holds. Such an estimate implies that a scaled sequence of the occupation measures may have a nontrivial asymptotic distribution. Define
which is a scaled sequence of occupation measures. The scaling is suggested by the central limit theorem. The detailed proof of the following results can be found in Yin and Zhang (1998), Chapter 7.
Singularly Perturbed Markov Chains
489
Theorem 2. Assume that for each and is weakly irreducible. Suppose is twice differentiable with Lipschitz continuous second derivative and is differentiable with Lipschitz continuous derivative. For each and let be a bounded and Lipschitz continuous deterministic function. Then converges weakly to a switching diffusion where
and
is a standard m-dimensional Brownian motion.
For the definitions of weak convergence and jump diffusions, see the appendix. Loosely speaking, a switching diffusion process is a combination of switching processes and diffusion processes. It possesses both diffusive behavior as well as jump properties. Switching diffusions are widely used in many applications. For example, in a manufacturing system, the demand may be modeled as a diffusion process, whereas the machine production rate as a Markov jump process. Discrete-time Chains. For discrete-time Markov chains, analogous results can be obtained. Consider the weakly irreducible case, for and define a sequence of occupation measures
Then define a continuous-time interpolation:
With definition (3.19),
can be written recursively as a difference equation
490
MODELING UNCERTAINTY
where
Define an aggregated process
of
by
Furthermore, define continuous-time interpolations with the interpolation interval on[0,T], by
Similar to the continuous-time case, a mean squares estimate can be obtained and the weak convergence of can be derived. Subsequently, for each and each define the normalized occupation measure:
Observe that To proceed, define a continuous-time interpolation:
For each fixed intuitively, we expect that the random variable has a normal distribution. It indeed is the case and is illustrated by Fig. 21.2. We used 1000 replications (each replication corresponding to one realization). The computation results are displayed in the form of a histogram. On the other hand, keeping the sample point fixed and taking as a function of only yields a sample path as shown in Fig. 21.3. Note the diffusion type behavior displayed in the figure. Moreover, it can be shown that converges weakly to a switching diffusion process and that the asymptotic covariance depending on the zeroth order initial layer terms is computable. The detailed development of the asymptotic distribution is quite involved, we refer readers to Yin and Zhang (1998), Chapter 7 for containing weakly irreducible classes, and to Yin et al. (2000a) for cases including transient states.
Singularly Perturbed Markov Chains
491
492
3.3.
MODELING UNCERTAINTY
LARGE DEVIATIONS AND EXPONENTIAL BOUNDS
3.3.1 Large Deviations. We begin with a simple example to review the concept of large deviations. Suppose that is a sequence of i.i.d. random variables with mean 0 and variance For simplicity, assume that is Gaussian, which allows us to get an explicit representation of the moment generating function. Let be the partial sum We are interested in the probability of the event for The well-known strong law of large numbers indicates
with probability one (w.p.1). The central limit theorem implies that
Then for large is a rare event in that its probability goes to 0. Although some approximations can be derived, neither the law of large numbers nor the central limit theorem can give a precise quantification of this rareness or its associated probability. More detailed description beyond the normal deviation range is needed. That is the role of the large deviations approach. Consider the Cramèr–Legendre transformation L, defined by
The last line above is a consequence of the fact that The Chernoff’s bound,
is normally distributed.
indicates that is “exponentially equivalent to” In other words, the probability of the rare event is exponentially small. To illustrate the use of large deviations for the singularly perturbed Markov chains, let us consider the discrete-time problem. For simplicity assume that P has only one block and is irreducible. Then the centered occupation measures become
Singularly Perturbed Markov Chains
493
where
denotes the component of the stationary distribution of P. Let and Using the approach in Dembo and Zeitouni (1998), pp. 71-75, define a matrix with entries
where denotes the usual inner product. It follows that is also irreducible. For each let be the Perron–Frobenius eigenvalues of the matrix (see Seneta (1981) for a definition). Define
By using the Gärtner-Ellis Theorem (see Dembo and Zeitouni (1998)), we can derive the following large deviations bounds: For any set
where
and
denote the interior and the closure of G, respectively, and
3.3.2 Exponential Bounds. Many control and optimization problems (see Zhang (1995); Zhang (1996)) involve singularly perturbed Markov chains and require the use of the exponential-type upper bounds for the probability errors. Such exponential bounds also provide alternative ways to establish asymptotic normality. Such an idea was used in Zhang and Yin (1996) for error estimates of in continuous-time Markov chains, where is weakly irreducible. In fact, such a bound is crucial in obtaining asymptotic optimal controls of nonlinear dynamic systems (Zhang et al., 1997). The work of obtaining such bounds involving only weakly irreducible classes can be found in Yin and Zhang (1998); and the inclusion of transient states is in Yin et al. (2000e). Concentrating on weakly irreducible states only, we have the following Theorem. Theorem 3. Suppose the Markov chain has time-dependent generator with being of the form (2.7). Suppose that for each and is weakly irreducible; is differentiable on
494
MODELING UNCERTAINTY
[0, T] with its derivative being Lipschitz, and exist and K > 0 such that for
where
is also Lipschitz. Then, there and
is a constant satisfying
in which
An application of the error bounds yields that for any
The above results are mainly for time-varying generators. Better results can be obtained for constant and where the corresponding conclusion is that there exist positive constants and K such that for and
4.
CONTROLLED SINGULARLY PERTURBED MARKOVIAN SYSTEMS
In this section‚ by using a continuous-time LQG (linear quadratic Gaussian) regulator problem and a discrete-time LQ problem‚ we illustrate (1) the idea of hybrid system modeling under the influence of singularly perturbed Markov chains‚ (2) the decomposition and aggregation‚ (3) the associated nearly optimal controls. We show that the complexity of the underlying systems can be reduced by means of hierarchical control approach. The main ideas are presented and the end results are demonstrated. Interested reader can consult Zhang and Yin (1999) for the hybrid LQG problem and Yang et al. (2001) for the discrete-time LQ problem for further reading.
Singularly Perturbed Markov Chains
4.1.
495
CONTINUOUS-TIME HYBRID LQG
For some finite T > 0, we consider the following linear system in a time horizon [0,T],
where denotes state variables‚ denotes control variables‚ and are well defined and have finite values for any is a standard Brownian motion‚ and is a Markov chain with a finite state space The classical work of LQG assumes a fixed model (i.e.‚ A‚ B etc. are all fixed)‚ which excludes those systems subject to discrete-event variations. Similar to the Leontief model given in the introduction‚ however‚ a model with switching is a better alternative in many applications. The premise of such a model is that many important movements arise from discrete events. To accommodate this‚ we have included in (4.30) both continuous dynamics and discrete-event perturbations‚ which justifies the name hybrid systems. Our objective is to find an optimal control to minimize the expected quadratic cost function
where
is the expectation given and are symmetric nonnegative definite matrices; and D are symmetric positive definite matrices; and are independent. The LQG problems are important since many real-world applications can be approximated by linear or piecewise linear systems. Following the classical approach given in Fleming and Rishel (1975) and denoting the value function (the optimal cost) by we write
The problem is reduced to finding and It can be solved by substituting the above quadratic function into the dynamic programming equation of then solving the resulting so-called Riccati equations for and the equations for The solution of the problem completely depends on the solutions of the Riccati equations. Since many applications involve large-scale systems, we consider that the Markov chain has a large state space ( is large).
496
MODELING UNCERTAINTY
A measure of complexity is the number of equations to be solved. Since the Markov chain is involved‚ instead of solving one equation‚ one needs to solve a system of equations. When is large‚ the computation involved may be infeasible. To overcome the difficult‚ we introduce a small parameter and let the generator of the Markov chain be consisting of a rapidly changing part and a slowly varying one of the form (2.4). Note that is obtainable using the procedure described in the previous section. The resulting Markov chain becomes and the dynamic equation and the cost function can be rewritten as
and
To be more specific, assume has a block-diagonal form (2.7), where are weakly irreducible, for and A small parameter results in the system under consideration having a twotime-scale behavior (Delebecque and Quadrat, 1981; Phillips and Kokotovic, 1981). Using the dynamic programing techniques (see Fleming and Rishel (1975)), the value function = satisfies the following Hamilton-Jacobi-Bellman (HJB) equations: for and
with the boundary condition where defined in (6.52). Following the approach in Fleming and Rishel (1975)‚ let
is
for some matrix-valued function and a real-valued function Substituting (4.35) into (4.34) and comparing the coefficients of lead to the
497
Singularly Perturbed Markov Chains
following Riccati equations for
and
where is as defined in (6.52). In accordance with the results of Fleming and Rishel (1975)‚ these equations have unique solutions. In view of the positive definiteness of the optimal control has the form:
In solving this problem, one will face the difficulty caused by large dimensionality if the state space is large (i.e., is a large number), where a total of equations must be solved. To resolve this, we aggregate the states in as one state and obtain an aggregated process defined by when The process is not necessarily Markovian. However, using certain probabilistic argument, we have shown in Yin and Zhang (1998); Yin et al. (2000a) that converges weakly to the process generated by which has the form
with being the stationary distribution of For and
and
uniformly on [0‚ T] as
where
(for each
498
MODELING UNCERTAINTY
are the unique solutions to the differential equations
and
respectively, with Based on this result, we can use the optimal control of the limit problem to construct controls of the original problem. It can be shown that the controls so constructed are asymptotically optimal (Zhang and Yin, 1999). To proceed, we illustrate the ideas by examining the following example. Consider a four-state Markov chain having the same as in (2.5). Suppose that the generator has the form such that consists of and each of which is a 2 × 2 matrix. Thus and Suppose that the dynamic system (4.32) is a 2-dimensional one with the cost function (4.33). Choosing a time horizon [0, T] with T = 5, we discretize the system equations with step size The trajectories of and are given in Fig. 21.4. The computation was done on a Sun Workstation ULTRA 5 and the CPU times were recorded. The average CPU time for solving the original problem was 4.59 seconds, whereas the CPU time used in solving the problem via aggregation and averaging was only 2.46 seconds. Thus compared with the solution of the original system, only a little more than 50% of the computational effort is needed to find the nearly optimal control policy. The simulation results show that our algorithm approximates the optimal solutions well.
4.2.
DISCRETE-TIME LQ
This section discusses the discrete-time LQ problem involving a singularly perturbed Markov chain. Many systems and physical models in economics‚ biology‚ and engineering are represented in discrete time mainly because various measurements are only available at discrete instants. As a result‚ the planning decisions‚ strategic policies‚ and control actions regarding the underlying systems are made at discrete times. The continuous-time models can be regarded as an approximation to the “discrete” reality. For example‚ using
Singularly Perturbed Markov Chains
499
500
MODELING UNCERTAINTY
a dynamic programing approach to solve a stochastic control problem (with Markov chain driving noise)‚ yielding the so-called Hamilton-Jacobi-Bellman equation (Fleming and Rishel‚ 1975). This equation often needs to be solved numerically via discretization. Simulation and digitization also lead to various discrete-time systems. Furthermore‚ owing to the rapid progress in computer control technology‚ the study of sampled-data systems becomes more and more important. All these make the study of discrete-time systems necessary. Let be a discrete-time Markov chain with transition probability matrix given by (3.11)‚ where P is a transition probability matrix and Q is a generator. Use to denote the integer part for any For a finite real number and the dynamic system is given by
where
is the state‚ is the control‚ and are well defined and have finite values for each and is a sequence of random variables with zero mean. Define a sequence of cost functions
where is the expectation given and Our objective is to find an optimal control to minimize the expected quadratic cost function The formulation of (4.42) is useful in reducing the complexity as described in the last section. That is‚ the original system model without the small parameter is considered first. To reduce the computational burden‚ we introduce a small parameter into the system that results in a singularly perturbed model of the form (4.42). To obtain the desired asymptotic results‚ applying the dynamic programming principle under suitable assumptions with a slight modification of the argument in Bertsekas (1995) yields a system of dynamic programming equations. Denote the value functions by Then the dynamic programming (DP) equation is
501
Singularly Perturbed Markov Chains
for
Let
Assume
Define
Then we have the following Riccati equations
and
The optimal feedback control for the LQ problem is linear in the state variable:
For each Define constant interpolated processes
and define the piecewise and as
Instead of solving the Riccati equations‚ we can use the aggregation to obtain a limit system‚ in which the functions involved are averaged out with respect to the stationary distributions of the Markov chain. Although the original problem is a discrete-time one‚ the limit dynamic system is in continuous time.
502
MODELING UNCERTAINTY
As noted before‚ the solution of the problem completely depends on the number of equations to be solved. Thus the reduction of complexity is achieved in that only equations need to be solved as compared to solving the equations in the original problem. If is substantially larger than (i.e.‚ the complexity is significantly diminished. It can be shown that converges weakly to where is the solution of the hybrid system
where
and
In addition, and converge to and of (4.40) and (4.41), respectively. As an example, consider a four-state Markov chain with transition probability matrix
that are solutions
For a two-dimensional dynamic system (4.42) and the cost (4.43), we work with a time horizon of with T = 5. We use step size to discretize the limit Riccati equations. Take The trajectories of and are given in Fig. 21.5 for The simulation results show that the continuous-time hybrid LQG problem approximates closely the corresponding discrete-time linear quadratic regulator problem. Using this limit system, we can find its optimal control
where for
and
Subsequently‚ using the optimal control of the limit system‚ we construct a control for the original discrete-time control system
Singularly Perturbed Markov Chains
503
504
MODELING UNCERTAINTY
where
Equivalently‚ we can write
Such a control can be shown to be asymptotically optimal. As in the continuous-time case‚ to find the nearly optimal control requires the solution of Riccati equations‚ whereas to solve the original problems requires solving Riccati equations. Thus‚ when we will have a substantial savings. In the example presented‚ we used and Again‚ the computation was done on a Sun Workstation ULTRA 5. A CPU time of 2.21 seconds was needed to solve the problem via aggregation and averaging as compared to 5.58 seconds used in solving the original problem‚ which amounts to a 60% saving.
5.
FURTHER REMARKS
This work provides a survey on using singularly perturbed Markov chains to model uncertainty in various applications with the aim of reduction of complexity of large-scale systems. We have mainly focused on finite-state Markov chains. For future work‚ we point out: It will be of theoretical interests and practical concerns to study Markov chains having countable state spaces. In addition‚ including both diffusions and jumps into various models is another important area of investigation. Such switching diffusion models can be used for manufacturing systems‚ in which the diffusion represents the random demand and the jump process delineates the machine capacity‚ and for stock-investment models‚ in which hybrid switching geometric Brownian motion models can be used to capture the market trends and other economic factors such as interest rates and business cycles. Although some results have been reported in Il’ in et al. (1999a); Il’ in et al. (1999b); Yin (2001); Yin and Kniezeva (1999)‚ much more in-depth investigation is needed. Treating partially observable systems‚ there are growing interests in analyzing filtering and state estimation procedures that involve continuous dynamics and discrete-event interventions. Another area that remains to be largely untouched is the study of stability of singularly perturbed Markovian systems. Here the focus is on the asymptotic properties of the underlying systems as 0 and
505
Singularly Perturbed Markov Chains
All of these and their applications in signal processing‚ learning‚ economic systems‚ and financial engineering open up new avenues for further research and investigation.
6.
APPENDIX: MATHEMATICAL PRELIMINARIES
To make the chapter as self-contained as possible‚ we briefly review some basics of stochastic processes and related topics in this appendix.
6.1.
STOCHASTIC PROCESSES
Let be a probability space, where is the sample space, is the of subsets, and P is the probability measure defined on A collection of or simply is called a filtration if for It is understood that is complete in the sense that it contains all null sets. A probability space together with a filtration is termed a filtered probability space A stochastic process defined on is a function of two variables and where is a sample point and is a parameter taking values from a suitable index set For each fixed is a random variable and for each fixed is a function of called sample function or a realization. It is customary to suppress the dependence of from the notation. The index set may be discrete, e.g., or continuous, e.g., (an interval of real numbers). A process is adapted to a filtration if for each is an random variable; is progressively measurable if for each the process restricted to is measurable with respect to the in where denotes the Borel sets of A progressively measurable process is measurable and adapted, whereas the converse is not generally true. However, any measurable and adapted process with right-continuous sample paths is progressively measurable. A real-valued or vector-valued stochastic process is said to be a martingale on with respect to if, for each is and = w.p. 1 (with probability one) for all If we only say that is a martingale without specifying the filtration is taken to be the generated by A random vector is Gaussian if its characteristic function has the form
where is a constant vector in pure imaginary number satisfying definite matrix. A process
is the usual inner product, denotes the and A is a symmetric nonnegative is a Gaussian process if its finite-
506
MODELING UNCERTAINTY
dimensional distributions are Gaussian. That is‚ for any and is a Gaussian vector. If for any and
are independent, then is a process with independent increments. An Gaussian random process for is called a Brownian motion, if B(0) = 0 w.p.1, B(·) is a process with independent increments, B(·) has continuous sample paths w.p.1, and the increments have Gaussian distribution with and for some nonnegative definite matrix where denotes the covariance of Given processes and a process defined as
is called a diffusion. The concept of weak convergence is a substantial generalization of convergence in distribution in elementary probability theory. Let P and denote probability measures defined on a metric space The sequence is weakly convergent to P if
for every bounded and continuous function on Suppose that and X are random variables associated with and P, respectively. The sequence converges to X weakly if for any bounded and continuous function on F, as A family of probability measures defined on a metric space F is tight if for each there exists a compact set such that
For more details on definitions‚ notation‚ and properties of stochastic processes‚ see the references Hillier and Lieberman (1995); Karlin and Taylor (1975); Karlin and Taylor (1981). A more mathematically inclined reader may refer to Chung (1967); Davis (1993); Ethier and Kurtz (1986).
6.2.
MARKOV CHAINS
A Markov process is a process having the property that‚ given the current state‚ the probability of the future behavior of the process is not altered by any additional knowledge of its past behavior. That is‚ for any any
Singularly Perturbed Markov Chains
507
and any set A (an interval of the real line)
A Markov process having piecewise constant sample paths and taking values in either (for some positive integer or is called a Markov chain. The set is termed the state space of the Markov chain. The function
is known as the transition probability. If the transition probabilities depend only on the time difference (not on then we say that the Markov chain has stationary transition probabilities. Much of our recent work concerns the asymptotic properties of nonstationary singularly perturbed Markov chains. Interested reader can find more advanced topics of the treatments set forth in the list of references of the current chapter. Throughout this chapter, we consider finite-state Markov chains with state space Unless explicitly stated, the state space of is given by for some positive integer Note that the transition probabilities form an matrix that is called a transition matrix. Such a matrix has nonnegative entries with row sums being 1. For a discrete-time Markov chain, the one-step (stationary) transition matrix is given by whereas for a continuous-time Markov chain, the transition probability matrix is given by To characterize a discrete-time Markov chain, we use its transition probabilities; whereas for a continuous-time Markov chain, the use of a generator is a better alternative. Loosely speaking, a generator prescribes the rates of change (“time derivative”) of the transition probabilities. Formally, (for a stationary Markov chain) it is given by
where I is an denotes
identity matrix. For a suitable function
For a nonstationary Markov chain we have to modify the definition of the generator. A matrix is a generator of if is Borel measurable for all and is uniformly bounded‚
508
MODELING UNCERTAINTY
that is, there exists a constant K such that for and bounded real-valued functions defined on
for all
and and (2) for all
is a martingale. We say that a generator or the corresponding Markov chain is weakly irreducible if the system of equations
has a unique non-negative solution‚ where for a positive integer denotes an column vector with all entries being 1. The unique solution is termed a quasi-stationary distribution. The definition above is a slight generalization of the traditional notion of irreducibility and stationary distribution. We allow the and the to be time dependent; we require only nonnegative solutions of (6.54) instead of positive solutions.
6.3.
CONNECTIONS OF SINGULARLY PERTURBED MODELS: CONTINUOUS TIME VS. DISCRETE TIME
This section illustrates the close connection between continuous-time singularly perturbed Markov chains and their discrete-time counterparts; some of the related discussions are in Kumar and Varaiya (1984) and Yin and Zhang (1998) (see also Fox and Glynn (1990) for work on simulation). Starting with a continuous-time singularly perturbed Markov chain (as defined in the last section)‚ we can construct a discrete-time singularly perturbed Markov chain Conversely‚ starting from the discrete-time singularly perturbed Markov chain we can construct a continuous-time singularly perturbed Markov chain which has the same distribution of the original process Consider a continuous-time singularly perturbed Markov chain with state space and generator where and are generators of some Markov chains with stationary transition probabilities. The probability vector
satisfies the forward equation
509
Singularly Perturbed Markov Chains
Introduce a new variable Then the probability vector satisfies the rescaled forward equation
Denote
and fix
Let
Define
It is clear that all entries of are nonnegative and where for a positive integer denotes an column vector with all entries being 1. Therefore‚ is a transition probability matrix. Eq. (6.55) can be written as
The solution of the above forward equation is
Consider a discrete-time Markov chain the corresponding probability vector
with
having transition matrix
Then
satisfies
This together with (6.57) yields
Suppose is a Poisson process independent of for each nonnegative integer
having rate q. Then
510
MODELING UNCERTAINTY
Let be a sequence of “waiting times,” which is a sequence of independent and identically distributed exponential random variables. Define the time at which the event occurs. Now define a piecewise constant process
Then
Thus the reconstructed process same distribution.
and the original process
have the
Acknowledgement. We would like to thank Felisa Vázquez-Abad and Pierre L’Ecuyer for critical reading of an early version of the manuscript and for providing us with detailed comments and suggestions‚ which led to a much improved presentation. The research of G. Yin was supported in part by the National Science Foundation under grant DMS-9877090. The research of Q. Zhang was supported in part by the USAF Grant F30602-99-2-0548 and ONR Grant N00014-961-0263. The research of K. Yin and H. Yang was supported in part by the Minnesota Sea Grant College Program by the NOAA Office of Sea Grant‚ U. S. Department of Commerce‚ under grant NA46-RG0101.
REFERENCES Abbad‚ M.‚ J.A. Filar‚ and T.R. Bielecki. (1992). Algorithms for singularly perturbed limiting average Markov control problems‚ IEEE Trans. Automat. Control AC-37‚ 1421-1425. Bertsekas‚ D. (1995). Dynamic Programming and Optimal Control‚ Vol. I & II‚ Athena Scientific‚ Belmont‚ MA. Billingsley‚ P. (1999). Convergence of Probability Measures‚ 2nd Ed.‚ J. Wiley‚ New York. Blankenship‚ G. (1981). Singularly perturbed difference equations in optimal control problems‚ IEEE Trans. Automat. Control T-AC 26‚ 911-917. Courtois‚ P.J. (1977). Decomposability: Queueing and Computer System Applications‚ Academic Press‚ New York. Chung‚ K.L. (1967). Markov Chains with Stationary Transition Probabilities‚ Second Edition‚ Springer-Verlag‚ New York. Davis‚ M.H.A. (1993). Markov Models and Optimization‚ Chapman & Hall‚ London. Delebecque‚ F. and J. Quadrat. (1981). Optimal control for Markov chains admitting strong and weak interactions‚ Automatica 17‚ 281-296.
REFERENCES
511
Dembo‚ A. and O. Zeitouni. (1998). Large Deviations Techniques and Applications‚ Springer-Verlag‚ New York. Di Masi‚ G.B. and Yu. M. Kabanov. (1995). A first order approximation for the convergence of distributions of the cox processes with fast Markov switchings‚ Stochastics Stochastics Rep. 54‚ 211-219. Ethier‚ S.N. and T.G. Kurtz. (1986). Markov Processes: Characterization and Convergence‚ J. Wiley‚ New York. Fleming‚ W.H. and R.W. Rishel. (1975). Deterministic and Stochastic Optimal Control‚ Springer-Verlag‚ New York. Fox‚ G. and P. Glynn. (1990). Discrete-time conversion for simulating finitehorizon Markov processes‚ SIAM J. Appl. Math.‚ 50‚ 1457-1473. Gershwin‚ S.B. (1994). Manufacturing Systems Engineering‚ Prentice Hall‚ Englewood Cliffs‚ 1994. Hillier‚ F.S. and G.J. Lieberman. (1995). Introduction to Operations Research‚ McGraw-Hill‚ New York‚ 6th Ed. Hamilton‚ J.D. and R. Susmel. (1994). Autoregressive conditional heteroskedasticity and changes in regime‚ J. Econometrics‚ 64‚ 307-333. Hansen‚ B.E. (1992). The likelihood ratio test under nonstandard conditions: Testing Markov trend model of GNP‚ J. Appl. Economics‚ 7‚ S61-S82. Hoppensteadt‚ F.C. and W.L. Miranker. (1977). Multitime methods for systems of difference equations‚ Studies Appl. Math. 56‚ 273-289. Il’ in‚ A.‚ R.Z. Khasminskii‚ and G. Yin. (1999a). Singularly perturbed switching diffusions: Rapid switchings and fast diffusions‚ J. Optim. Theory Appl. 102‚ 555-591. Il’ in‚ A.‚ R.Z. Khasminskii‚ and G. Yin. (1999b). Asymptotic expansions of solutions of integro-differential equations for transition densities of singularly perturbed switching diffusions: Rapid switchings‚ J. Math. Anal. Appl. 238‚ 516-539. Iosifescu‚ M. (1980). Finite Markov Processes and Their Applications‚ Wiley‚ Chichester. Karlin‚ S. and H.M. Taylor. (1975). A First Course in Stochastic Processes‚ 2nd Ed.‚ Academic Press‚ New York. Karlin‚ S. and H.M. Taylor. (1981). A Second Course in Stochastic Processes‚ Academic Press‚ New York. Kendrick‚ D. (1972). On the Leontief dynamic inverse‚ Quart. J. Economics‚ 86‚ 693-696. Khasminskii‚ R.Z. and G. Yin. (1996). On transition densities of singularly perturbed diffusions with fast and slow components‚ SIAM J. Appl. Math.‚ 56‚ 1794-1819. Khasminskii‚ R.Z.‚ G. Yin‚ and Q. Zhang. (1996). Asymptotic expansions of singularly perturbed systems involving rapidly fluctuating Markov chains‚ SIAM J. Appl. Math. 56‚ 277-293.
512
MODELING UNCERTAINTY
Khasminskii, R.Z., G. Yin, and Q. Zhang. (1997). Constructing asymptotic series for probability distribution of Markov chains with weak and strong interactions, Quart, Appl. Math. LV, 177-200. Kumar, P.R. and P. Varaiya. (1984). Stochastic Systems: Estimation, Identification and Adaptive Control, Prentice-Hall, Englewood Cliffs, N. J. Kushner, H.J. (1972). Introduction to Stochastic Control Theory, Holt, Rinehart and Winston, New York. Kushner, H.J. (1990). Weak Convergence Methods and Singularly Perturbed Stochastic Control and Filtering Problems, Birkhäuser, Boston. Kushner, H.J. and G. Yin. (1997). Stochastic Approximation Algorithms and Applications, Springer-Verlag, New York. Lai, T.-L., and S. Yakowitz. (1995). Machine learning and nonparametric bandit theory, IEEE Tran. Automatic Control 40, 1199-1210. Liu, R.H., Q. Zhang, and G. Yin. (2001). Nearly optimal control of singularly perturbed Markov decision processes in discrete time, to appear in Appl. Math. Optim. Pan, Z.G. and (1995). of Markovian jump linear systems and solutions to associated piecewise-deterministic differential games, in New Trends in Dynamic Games and Applications, G. J. Olsder (Ed.), 61-94, Birkhäuser, Boston. Pervozvanskii, A.A. and V.G. Gaitsgori. (1988). Theory of Suboptimal Decisions: Decomposition and Aggregation, Kluwer, Dordrecht. Phillips, R.G. and P.V. Kokotovic. (1981). A singular perturbation approach to modelling and control of Markov chains, IEEE Trans. Automat. Control 26, 1087-1094. Ross, S. (1983). Introduction to Stochastic Dynamic Programming, Academic Press, New York. Seneta, E. (1981). Non-negative Matrices and Markov Chains, Springer-Verlag, New York. Sethi, S.P. and Q. Zhang. (1994). Hierarchical Decision Making in Stochastic Manufacturing Systems, Birkhäuser, Boston. Simon, H.A. and A. Ando. (1961). Aggregation of variables in dynamic systems, Econometrica 29, 111-138. Thompson, W.A. Jr. (1988). Point Process Models with Applications to Safety and Reliability, Chapman and Hall, New York. Tse, D.N.C., R.G. Gallager, and J.N. Tsitsiklis. (1995). Statistical multiplexing of multiple time-scale Markov streams, IEEE J. Selected Areas Comm. 13, 1028-1038. White, D.J. (1992). Markov Decision Processes, Wiley, New York. Yakowitz, S. (1969). Mathematics of Adaptive Control Processes, Elsevier, New York.
REFERENCES
513
Yakowitz‚ S.‚ R. Hayes‚ and J. Gani. (1992). Automatic learning for dynamic Markov fields with application to epidemiology‚ Oper. Res. 40‚ 867-876. Yakowitz‚ S.‚ P. L’Ecuyer‚ and F. Vázquez-Abad. (2000). Global stochastic optimization with low-dispersion point sets‚ Oper. Res.‚ 48‚ 939-950. Yang‚ H.‚ G. Yin‚ K. Yin‚ and Q. Zhang. (2001). Control of singularly perturbed Markov chains: A numerical study‚ to appear in J. Australian Math. Soc. Ser. B: Appl. Math. Yin‚ G. (2001). On limit results for a class of singularly perturbed switching diffusions‚ to appear in J. Theoretical Probab. Yin‚ G. and M. Kniazeva. (1999). Singularly perturbed multidimensional switching diffusions with fast and slow switchings‚ J. Math. Anal. Appl.‚ 229‚ 605630. Yin‚ G. and J.F. Zhang. (2001). Hybrid singular systems of differential equations‚ to appear in Scientia Sinica. Yin‚ G. and Q. Zhang. (1997a). Control of dynamic systems under the influence of singularly perturbed Markov chains‚ J. Math. Anal. Appl. 216‚ 343-367. Yin‚ G. and Q. Zhang (Eds.) (1997b). Mathematics of Stochastic Manufacturing Systems‚ Proc. 1996 AMS-SIAM Summer Seminar in Applied Mathematics‚ Lectures in Applied Mathematics‚ LAM 33‚ Amer. Math. Soc.‚ Providence‚ RI‚ Providence‚ RI. Yin‚ G. and Q. Zhang. (1998). Continuous-time Markov Chains and Applications: A Singular Perturbation Approach‚ Springer-Verlag‚ New York. Yin‚ G. and Q. Zhang. (2000). Singularly perturbed discrete-time Markov chains‚ SIAM J. Appl. Math.‚ 61‚ 834-854. Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000a). Asymptotic properties of a singularly perturbed Markov chain with inclusion of transient states‚ Ann. Appl. Probab.‚ 10‚ 549-572. Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000b). Singularly perturbed Markov chains: Convergence and aggregation‚ J. Multivariate Anal. 72‚ 208-229. Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000c). Occupation measures of singularly perturbed Markov chains with absorbing states‚ Acta Math. Sinica‚ 16‚ 161-180. Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000d). Decomposition and aggregation of large-dimensional Markov chains in discrete time‚ preprint. Yin‚ G.‚ Q. Zhang‚ and Q.G. Liu. (2000e). Error bounds for occupation measure of singularly perturbed Markov chains including transient states‚ Probab. Eng. Informational Sci. 14‚ 511-531. Yin‚ G.‚ Q. Zhang‚ H. Yang‚ and K. Yin. (2001). Discrete-time dynamic systems arising from singularly perturbed Markov chains‚ to appear in Nonlinear Anal.‚ Theory‚ Methods Appl.
514
MODELING UNCERTAINTY
Zhang‚ Q. (1995). Risk sensitive production planning of stochastic manufacturing systems: A singular perturbation approach‚ SIAM J. Control Optim. 33‚ 498-527. Zhang‚ Q. (1996). Finite state Markovian decision processes with weak and strong interactions‚ Stochastics Stochastics Rep. 59‚ 283-304. Zhang‚ Q. and G. Yin. (1996). A central limit theorem for singularly perturbed nonstationary finite state Markov chains‚ Ann. Appl. Probab. 6‚ 650-670. Zhang‚ Q. and G. Yin. (1997). Structural properties of Markov chains with weak and strong interactions‚ Stochastic Process Appl. 70‚ 181-197. Zhang‚ Q. and G. Yin. (1999). On nearly optimal controls of hybrid LQG problems‚ IEEE Trans. Automat. Control‚ 44‚2271-2282. Zhang‚ Q.‚ G. Yin‚ and E.K. Boukas. (1997). Controlled Markov chains with weak and strong interactions: Asymptotic optimality and application in manufacturing‚ J. Optim. Theory Appl. 94‚ 169-194.
Chapter 22 RISK–SENSITIVE OPTIMAL CONTROL IN COMMUNICATING AVERAGE MARKOV DECISION CHAINS Rolando Cavazos–Cadena Departamento de Estadística y Cálculo Universidad Autónoma Agraria Antonio Narro Buenavista, Saltillo COAH 25315 MÉXICO* †
Emmanuel Fernández–Gaucherand Department of Electrical & Computer Engineering & Computer Science University of Cincinnati Cincinnati‚ OH 45221-0030 ‡ USA emmanuel @ ececs.uc.edu
Abstract
This work concerns discrete–time Markov decision processes with denumerable state space and bounded costs per stage. The performance of a control policy is measured by a (long–run) risk–sensitive average cost criterion associated to a utility function with constant risk sensitivity coefficient and the main objective of the paper is to study the existence of bounded solutions to the risk–sensitive average cost optimality equation for arbitrary values of The main results are as follows: When the state space is finite‚ if the transition law is communicating‚ in the sense that under an arbitrary stationary policy transitions are possible
*This work was partially supported by a U.S.-México Collaborative Program‚ under grants from the National Science Foundation (NSF-INT 9602939)‚ and the Consejo Nacional de Ciencia y Tecnología (CONACyT) (No. E 120.3336)‚ and United Engineering Foundation under grant 00/ER-99. †We dedicate this paper to our wonderful colleague‚ Sid Yakowitz‚ an scholar and friend whom we greatly miss. ‡The support of the PSF Organization under Grant No. 200-350–97–04 is deeply acknowledged by the first author.
516
MODELING UNCERTAINTY between every pair of states‚ the optimality equation has a bounded solution for arbitrary non-null However‚ when the state space is infinite and denumerable‚ the communication requirement and a strong form of the simultaneous Doeblin condition do not yield a bounded solution to the optimality equation if the risk sensitivity coefficient has a sufficiently large absolute value‚ in general.
Keywords:
1.
Markov decision proccesses‚ Exponential utility function‚ Constant risk sensitivity‚ Constant average cost‚ Communication condition‚ Simultaneous Doeblin condition‚ Bounded solutions to the risk–sensitive optimality equation.
INTRODUCTION
This work considers discrete–time Markov decision processes (MDP’s) with (finite or infinite) denumerable state space and bounded costs. Besides a standard continuity–compactness requirement‚ the main structural feature of the decision model is that‚ under the action of each stationary policy‚ every pair of states communicate (see Assumption 2.3 below). On the other hand‚ it is assumed that the decision maker grades two different random costs according to the expected value of an exponential utility function with (non-null) constant risk sensitivity coefficient and the performance index of a control policy is the risk–sensitive (long–run) average cost criterion. Within this context‚ the main purpouse of the paper is to study the existence of bounded solutions to the risk–sensitive average cost optimality equation corresponding to a nonnull value of (i.e.‚ the which‚ under the continuity–compactness conditons in Assumption 2.1‚ yields an optimal stationary policy with constant risk–sensitive average cost. Thus‚ we are concerned in this paper with fundamental theoretical issues. The reader is referred to a growing body of literature in the application of risk-sensitive models in operations research and engineering‚ e.g.‚ Fernández-Gaucherand and Marcus (1997)‚ Avila-Godoy et al. (1997); Avila-Godoy and Fernández-Gaucherand (1998); Avila-Godoy and Fernández-Gaucherand (2000); Shayman and Fernández-Gaucherand (1999). The study of stochastic dynamical systems with risk–sensitive criteria can be traced back‚ at least‚ to the work of Howard and Matheson (1972)‚ Jacobson (1973)‚ and Jaquette (1973; 1976). Particularly‚ in Howard and Matheson (1972) the case of MDP’s with finite state and action spaces was considered and‚ under Assumption 2.3 below and assuming aperiodicity of the transition matrix induced by each stationary policy‚ a solution to the was obtained via the Perron–Frobenious theory of positive matrices for arbitrary Recently‚ there has been an increasing interest on MDP’s endowed with risk–sensitive criteria (Cavazos–Cadena and Fernández–Gaucherand‚ 1998a– d; Fernández–Gaucherand and Marcus‚ 1997; Fleming and McEneany‚ 1995; Fleming and Hernández–Hernández‚ 1997b; Hernández–Hernández and Marcus‚ 1996; James et al.‚ 1994; Marcus et al.‚ 1996; Runolfsson‚ 1994; Whit–
Risk–sensitive average cost optimality equation in communicating MDP’s
517
tle‚ 1990). In particular‚ it was shown in Cavazos–Cadena and Fernandez– Gaucherand (1998a-d) that even when the state space is finite‚ a strong recurrence restriction‚ like the simultaneous Doeblin condition in Assumption 2.2 below‚ guarantees the existence of solutions to the if is sufficiently small‚ establishing an interesting contrast with the results in Howard and Matheson (1972)‚ which provides the motivation for considering the role of the communication property in the existence of solutions to the The main result of the paper‚ stated below as Theorem 3.1‚ can be seen as an extension of the results in Howard and Matheson (1972)‚ in that avoiding other conditions (e.g.‚ aperiodicity)‚ it is shown that Assumptions 2.1 and 2.3 together imply the existence of solutions to the when the state space is finite‚ and also‚ as a complement to the Theorem 3.1 in Cavazos–Cadena and Fernández-Gaucherand (1998a)‚ in that it is shown that incorporating the communication Assumption 2.3 and requiring the finiteness of state space‚ the existence of solutions to the is guaranteed for every non-null sensitivity coefficient. On the other hand‚ it is shown‚ via a detailed example‚ that Theorem 3.1 does not extend to the denumerable state space case; see Example 3.1 and Propositions 3.1 and 3.2 in Section 3. In contrast with recent work based on the theory of stochastic games— which presently is a commonly used technique to study risk–sensitive criteria (Hernández–Hernández and Marcus‚ 1996; Hordjik‚ 1974; James et al.‚ 1994; Runolfsson‚ 1994)‚ the approach followed below is along more self– contained methods within the MDP literature. Indeed‚ as in Cavazos–Cadena and Fernández–Gaucherand (1998a)‚ the arguments are based on the study of a parameterized expected–total cost problem with stopping time‚ and the result is obtained by an appropriate selection of the paramenter; this approach allows to include both the risk–averse and risk–seeking cases. For other recent approaches in the area of stochastic decision processes‚ the reader is referred to the work of Yakowitz et al.‚ e.g.‚ Lai and Yakowitz (1995); Yakowitz (1995). For earlier work in the foundations of stochastic decision processes‚ see Yakowitz (1969). The organization of the paper is as follows: In Section 2 the decision model is introduced and the main result is stated in Section 3 in the form of Theorem 3.1‚ which assumes a finite state space; also‚ an example is used to show that the conclusions in Theorem 3.1 can not be extended to MDP’s with infinite and denumerable state space‚ in general. Section 4 contains the basic consequences of the Assumptions 2.1 and 2.3 to be used in the sequel‚ whereas in Section 5 the auxiliary expected–total cost stopping problems are introduced‚ and fundamental properties of these problems are stated in Section 6. Then‚ in Section 7 a proof of Theorem 3.1 is given and the paper concludes in Section 8 with some brief comments.
518
MODELING UNCERTAINTY
Notation. Throughout the remainder and stand for the set of real numbers and nonnegative integers, respectively, and for Given a nonempty set S endowed with denotes the space of all measurable real-valued and bounded functions defined on S, and is the supremum norm of On the other hand, given denotes the indicator function of that is if and Finally, for an event W the corresponding indicator function is denoted by I[W] and, as usual, all relations involving conditional expectation are supposed to hold true almost everywhere with respect to the underlying probability measure without explict reference.
2.
THE DECISION MODEL
Following standard notation (Arapostathis et al., 1993; Hernández–Lerma and Lasserre, 1996; Puterman, 1994), let an MDP be specified by the fourtuple where the state space S is denumerable, the separable metric space A is the control (decision, action) set, is the cost function, and is the controlled transition law. The interpretation of the model M is as follows: At each time the state of a dynamical system is observed, say and an action is chosen. Then a cost is incurred and, regardless of the previous states and actions, the state of the system at time will be with probability this is the Markov property of the decision model. Notice that it is assumed that every is an admissible action at each state but, as noted in Borkar (1984), this condition does not imply any loss of generality. The following standard assumption is assumed to hold throughout the sequel. Assumption 2.1. (i) The control set A is compact. (ii ) For each in
and
are continuous mappings
Policies. For each the space of histories up to time is recursively defined by and a generic element of is denoted by where and A control policy is a sequence where each is a stochastic kernel on A given that is‚ for each is a probability measure on A and for each Borel subset is a measurable mapping in the number is the probability of choosing action when the system is driven by Throughout the remainder denotes the class of all policies. Given the policy being used and the initial state the distribution of the state–action process is uniquely determined via Ionescu Tulcea’s theorem (Hernández–Lerma and Lasserre‚ 1996; Hinderer‚ 1970; Puterman‚ 1994); such a distribution is denoted
Risk–sensitive average cost optimality equation in communicating MDP’s
by
519
whereas
stands for the corresponding expectation operator. Set so that consists of all functions A policy is stationary if there exists such that‚ under at each time the action applied is determined by The class of stationary policies is naturally identified with and with this convention Under the action of each stationary policy‚ the state process is a Markov chain with stationary transition mechanism (Arapostathis et al.‚ 1993; Hernández–Lerma and Lasserre‚ 1996; Puterman‚ 1994; Ross‚ 1994).
Utility Function. For each define (constant) risk sensitivity as follows: For
the utility function with
For a given a (bounded) random cost the corresponding certain equivalent with respect to is implicitly defined by
so that a controller with risk sensitivity a random cost according to the expectation of between incurring the random cost or paying the corresponding certain equivalent for sure. Notice that (2.1) and (2.2) together yield that
Using Jensen’s inequality it follows that, when is non constant, (resp. if (resp. A controller assessing a random cost according to the expectation of is referred to as risk– averse when and risk–seeking if if the decision maker is risk–neutral. Performance Index. Let and be arbitrary. When the system is driven by and is the initial state, denotes the certain equivalent of the total cost incurred up to time with respect to (., i.e.,
520
MODELING UNCERTAINTY
(see (2.3)) whereas the long–run is defined by
The optimal
and a policy for every
average cost under
average cost at state
is
optimal
starting at
is given by
if
Remark 2.1. From (2.3)–(2.6) it is not difficult to see that so that the optimal average cost satisfies see‚ for instance‚ Cavazos–Cadena and Fernández–Gaucherand (1998a). Communication and Stability Assumptions. In the risk–neutral case it is well–known that a strong recurrence condition is necessary for the existence of a bounded solution to the corresponding average cost optimality equation (Arapostathis et al.‚ 1993; Bertsekas‚ 1987; Fernández-Gaucherand et al.‚ 1990; Hernández–Lerma and Lasserre‚ 1996; Kumar and Varaiya‚ 1986; Puterman‚ 1994)‚ which in turn yields an optimal stationary policy in a standard way. For the risk–sensitive average criterion in (2.4)–(2.6)‚ it was shown in Cavazos–Cadena and Fernández–Gaucherand (1998a) that the following stability condition is necessary for the existence of a bounded solution to the see below in (3.1). Assumption 2.2. (Simultaneous Doeblin Condition (Arapostathis et al.‚ 1993; Hernández–Lerma and Lasserre‚ 1996; Hordjik‚ 1974; Puterman‚ 1994; Ross‚ 1994; Thomas‚ 1980).) There exist a state and a positive constant K such that where for each
the first passage time
to state
is defined by
with the (usual) convention that the minimum of the empty set is It was shown in Cavazos–Cadena and Fernández–Gaucherand (1998a–d); that Assumptions 2.1 and 2.2 yield a bounded solution to the in (3.1) below whenever the risk sensitivity coefficient is small enough. To study the existence of bounded solutions to the for arbitrary the following additional condition‚ used firstly in Howard and Matheson (1972)‚ will be employed.
Risk–sensitive average cost optimality equation in communicating MDP’s
521
Assumption 2.3. (Communication). Under every stationary policy each pair of states communicate, i.e, given and there exists such that Remark 2.2. It will be shown in the sequel that, when the state and action spaces are finite, Assumption 2.3 implies that Assumption 2.2 holds true. Relation to the work of Howard-Matheson. Under Assumption 2.4 below, it was proved in Howard and Matheson (1972), via the Perron–Frobenious theory of positive matrices, that for arbitrary the has a (bounded) solution (rewards instead of costs were considered in Howard and Matheson (1972)). Assumption 2.4. (a) The state and action spaces are finite (notice that in this situation Assumption 2.1 is automatically satisfied); (b) Assumption 2.3 holds and the transition matrix induced by an arbitrary stationary policy is aperiodic. On the other hand, it has been recently shown in Cavazos–Cadena and Fernández– Gaucherand (1998a) that, even when the state space is finite, under Assumptions 2.1 and 2.2, the has a bounded solution only if the risk sensitivity coefficient is small enough, and an example was given showing that this conclusion can not be extended to arbitrary values of The difference between the conclusions in Cavazos–Cadena and Fernández–Gaucherand (1998a) and Howard and Matheson (1972), comes from the different settings in both papers. In particular, Assumption 2.3 is imposed in Howard and Matheson (1972), but not in Cavazos–Cadena and Fernández–Gaucherand (1998a), and an additional aperiodicity condition is used in the latter reference.
3.
MAIN RESULTS
The main problem considered in the paper consists in studying if Assumptions 2.1-2.3 yield (bounded) solutions to the for arbitrary values of the risk sensitivity coefficient It turns out that the answer depends on the state space: If S is finite, Assumption 2.3, combined with Assumption 2.1, implies the existence of a solution to the for arbitrary this result is presented below as Theorem 3.1 and gives an extension of that in Howard and Matheson (1972). On the other hand, as it will be shown via a detailed example, such a conclusion cannot be extended to the case in which S is countably infinite, thus providing an extension to the results in Cavazos–Cadena and Fernández–Gaucherand (1998a). Finite State Space Models.
522
MODELING UNCERTAINTY
The following theorem shows that Assumption 2.1 and Assumption 2.3 are sufficient to guarantee a (bounded) solution to the for arbitrary values of the non-null risk sensitivity coefficient Hence, our results extend those in Howard and Matheson (1972) in that the aperiodicity in Assumption 2.4(b) is not required, and in that our results hold for both the risk-seeking and risk-averse cases. Theorem 3.1. Let the state space S be finite and suppose that Assumptions 2.1 and 2.3 hold true. In this case, for every there exist a constant and a function such that the following are true. (i) The pair
(ii)
satisfies the
for each
(iii) For every the term in brackets in the right–hand side of (3.1) is a continuous function on A; thus, it has a minimizer and the corresponding policy Moreover, (iv) The pair where
in (3.1) is unique whenever is arbitrary but fixed.
satisfies
Remark 3.1. Theorem 3.1 provides an extension to the results in Howard and Matheson (1972), in that both the risk–averse and risk–seeking cases are considered, the action space is not restricted to be finite, and the requirement of aperiodicity of the transition matrices associated to stationary policies is avoided; notice that this latter condition is essential in the development of the Perron–Frobenious theory of positive matrices, which was the key tool employed in Howard and Matheson (1972). Also, observing that Assumption 2.3 yields the Doeblin condition in Assumption 2.2 when the state space is finite (see Theorem 4.1 in the next section), Theorem 3.1 above can be seen as an extension to Theorem 3.1 in Cavazos–Cadena and Fernández–Gaucherand (1998a), in that by restricting the framework to a finite state space, Assumptions 2.1 and 2.3 yield a solution to the for every In Cavazos– Cadena and Fernández–Gaucherand (1998a), it was shown that the admits bounded solution only for small enough, in contrast to the claims in Hernández–Hernández and Marcus (1996). The (somewhat technical) proof of Theorem 3.1 will be presented in Section 7, after the necessary preliminaries stated in the following three sections. As
Risk–sensitive average cost optimality equation in communicating MDP’s
523
in Cavazos–Cadena and Fernández–Gaucherand (1998a)‚ the main idea is to consider a family of auxiliary parameterized stopping problems for the MDP endowed with the a risk–sensitive expected–total cost criterion‚ and then Theorem 3.1 will be obtained by an appropriate selection of the parameter. Denumerable State Space Models. In view of Theorem 3.1 above it is natural to ask the following: Do‚ for every Assumptions 2.1–2.3 together yield a bounded solution to the when the state space is countably infinite? Such a solution is indeed obtained under the above assumptions for the riskneutral case (Arapostathis et al.‚ 1993). The following example shows that the answer to the question above is negative. Example 3.1. For each positive integer define
where is selected in such a way that Now, for each an MDP as follows: The state space is action space A = {a} is a singleton, the transition law is defined by
whereas the cost function
define the
is given by
In the following proposition it will be proved that Assumptions 2.1–2.3 hold in this example, and then, that the does not have a bounded solution for large enough. Proposition 3.1. For the MDP in Example 3.1 above, Assumptions 2.1–2.3 hold true and, moreover, the transition matrix in (3.2) is aperiodic. Proof. Assumption 2.1 clearly holds in this example, since A is a singleton. To verify Assumption 2.2, let and notice that, since from state transitions are possible to or to state it is not difficult to see that
and from (3.2) it follows that
524
MODELING UNCERTAINTY
so that for every
verifying Assumption 2.2. Now observe that by (3.2) and (3.4), and so that and then the communication condition in Assumption 2.3 holds. Notice, finally, that so that the transition law in (3.2) is aperiodic. Observe now that, for Example 3.1, the Poisson equation:
reduces to the following
In the following proposition it will be shown that this equation does not admit a bounded solution if is large enough. First, let be determined by
Proposition 3.2. For the MDP in Example 3.1, with assertions hold:
the following
(i)
(ii) (iii ) For there is not a pair satisfying (3.5). Proof. (i) From (3.2) and (3.4), it is not difficult to see that for every positive integer so that
and it is clear that
is equivalent to
Risk–sensitive average cost optimality equation in communicating MDP’s
525
(ii) Using (3.7) and part (i), it follows that
( i i i ) Let and suppose that satisfies (3.5). From Theorem 3.1 (i) in Cavazos–Cadena and Fernández–Gaucherand (1998a) (see also the verification Theorem in Hernández–Hernández and Marcus (1996) for the case it follows that is the optimal average cost at every state, i.e., Then, from Remark 2.1 it follows that where Next, observe that the Poisson equation (3.5) can be written as
and then (see Lemma 4.1 below)
Next observe that for whereas for so that (3.8) yields that for evry
To continue, observe that the following properties (a) is finite and then see also (2.8).
surely, since
(b) As
(c) Using
and (3.9), Fatou’s lemma yields that
which is equivalent to
hold true:
526
MODELING UNCERTAINTY
(d) The following inequality holds:
Using (a)–(d), (3.9) implies, via the dominated convergence theorem, that
i.e.,
In this case, part (ii) implies that has been proved that if a pair satisified, then
4.
so that In short, it exists such that (3.5) is
BASIC TECHNICAL PRELIMINARIES
This and the following two sections contain the ancilliary technical results that will be used to establish Theorem 3.1. The main objective is to collect some basic consequences of Assumptions 2.1 and 2.3 which will be useful in the sequel. These results are presented below as Theorems 4.1 and 4.2 and Lemma 4.1. Throughout the remainder of the paper, the state space S is assumed to be finite. The first result establishes that Assumptions 2.1 and 2.3 together imply a strong form of the Simultaneous Doeblin condition. Theorem 4.1. Under Assumptions 2.1 and 2.3, there exists such that The proof of this theorem, based on ideas related to the risk–neutral average cost criterion, is contained in Appendix A. Suppose now that the initial state is The following result provides a lower bound for the probability of reaching a state before returning to
Theorem 4.2. Let with arbitrary but fixed and suppose that Assumptions 2.1 and 2.3 hold true. In this case, (i) There exists a constant
such that, for every
Risk–sensitive average cost optimality equation in communicating MDP’s
(ii) Given
there exists a positive integer
527
such that
The arguments leading to the proof of this result are also based on ideas using the risk–neutral average cost criterion, and are presented in Appendix B. The following lemma will be useful in the proof of Theorem 3.1 (which is presented in Section 7). Lemma 4.1. Suppose that Assumptions 2.1 and 2.3 hold true, and let be such that for some and
Let
be fixed.
(i) For every positive integer
(ii) For each
and (iii)
Proof. (i) Define the sequence of random variables for In this case, for each the Markov property and (4.1) together yield
so that
and and
528
MODELING UNCERTAINTY
i.e., is a martingale with respect to the probability measure and the standard filtration regrdless of the initial state. Therefore, by optional stopping, for every and so that
(ii) By Theorem 4.1, on the event (see (2.8)),
always holds, so that, since
and then (4.2) and Fatou’s Lemma yield that
Notice now that for a given Theorem 4.1 implies that there exists a positive integer such that so that
and since S is finite, there exist
Now set
such that
and observe that
so that
by the dominated convergence theorem.
(iii) Since part (ii) yields that
Risk–sensitive average cost optimally equation in communicating MDP’s
529
Observe now that (4.2) is equivalent to
and, via part (ii), this equality implies that
5.
AUXILIARY EXPECTED–TOTAL COST PROBLEMS: I
The method of analysis of the risk sensitive criterion in (2.4)–(2.6) employed in the sequel will be based on auxiliary stopping problems endowed with a risk– sensitive total expected cost criterion; see also Cavazos–Cadena and Fernández– Gaucherand (1998a). Formally, let be a given state and let be a fixed one–stage cost function. The system will be allowed to run until the state is reached in a positive time, and it will be stopped at that moment without incurring any terminal cost. The total operation cost, namely, will be assessed according to the expected value of and both the maximization and minimization of this criterion will be considered: For and define
whereas
In Section 7, these criteria will be used to establish Theorem 3.1 by selecting where is an appropriate constant. The remainder of the section establishes fundamental properties of the criteria in (5.1) and (5.2) which follow from Assumptions 2.1 and 2.3; recall that the state space is assumed to be finite.
530
Theorem 5.1. Let
MODELING UNCERTAINTY
be arbitrary but fixed. In this case,
(i) For each
(ii) If (a)
then for every
moreover, (b) The following optimality equation holds:
Similarly,
(iii ) If
(a)
then for every
and
(b) The following optimality equation holds:
Proof. (i) Let be an arbitrary policy, and observe that for every Jensen’s inequality yields
where K is as in Theorem 4.1. Therefore,
Risk–sensitive average cost optimality equation in communicating MDP’s
531
completing the proof of part ( i ) , since it is clear that (5.2).
see (5.1) and
(ii) Let and be arbitrary and for a fixed state define the combined policy as follows: At time for every wheras for and
with
if If for some
for for
whereas and
then
In words, a controller selecting actions according to operates as follows: At time the action applied is chosen using the decision maker keeps on using until is reached in a positive time, and at that moment the controller switches to as if the process had started again. Let be the generated by the history vector and observe, by the Markov property, the definition of yields that for each positive integer
and taking expectation with respect to
it follows that
where the equality used that and coincide before the state a positive time. Therefore, under the condition that
is reached in it follows
532
which, since
MODELING UNCERTAINTY
was arbitrary, yields
Selecting such that (which is possible, by Theorem 4.2(ii)), it follows that establishing part (a), whereas part (b) follows along the lines used in the proof of Theorem 3.1 in Cavazos–Cadena and Fernández–Gaucherand (1998a). (iii ) Suppose that such that
In this case, there exists a policy
Next, by the Markov property, for each positive
where the shifted policy is defined by Taking the expectation with respect to
and
it follows that
and after selecting such that (by Theorem 4.2(ii)), it follows that establishing part (a), whereas part (b) can be proved using the same arguments as in the proof of Theorem 3.1 in Cavazos–Cadena and Fernández–Gaucherand (1998a). Theorem 5.2. Let that
and
be arbitrary but fixed, and suppose
In this case, there exists a positive constant c such that
Risk–sensitive average cost optimality equation in communicating MDP’s
533
The proof relies on the following two lemmas. Lemma 5.1. Let that
and
be given and suppose
In this case, Proof. Define and, as before, let
and be the
for generated by the history vector Notice now that for arbitrary and
where the shifted policy is defined by and the Markov property was used to obtain the second equality. Therefore, (5.5) yields that
so that is a submartingale with respect to each probability measure and the filtration By optional stopping, for each positive integer
534
MODELING UNCERTAINTY
where it was used that on the event (by Theorem 4.1), so that letting inequality, it follows that
and then, using the condition
Observe now that increase to in the last
this yields that
1, and the conclusion follows, since and were arbitrary. Lemma 5.2. Let be a given function and let
be such that,
In this case (i) There exists
such that
Moreover, (ii) Proof. Let
for every be such that
(i) The starting point is the optimality equation in Theorem 5.1(ii):
Next, define
and
by
where Combining these definitions with (5.6), it is clear that
whereas for
(5.6)–(5.9) yield
Risk–sensitive average cost optimality equation in communicating MDP’s
and in combination with (5.10) it follows that
Then, Lemma 5.1 implies that
Next, observe that
(see (5.8)), so that, setting
in (5.11),
and then (ii) First, notice that the inequality in (5.12) yields
535
536
MODELING UNCERTAINTY
Next observe that on the event which so that (5.12) yields
there exists with
for
inequality that together with (5.13) implies
and combining this with
it follows that
Notice now that Q on the Borel sets of
so that, setting
by Theorem 4.2(i), and define the measure by
Risk–sensitive average cost optimality equation in communicating MDP’s
537
by Jensen’s inequality. Next, observe that
where K is as in Theorem 4.1, and combining this inequality with (5.14) and (5.15) it follows that
To conclude observe that the mapping in so that, the last inequality yields
where
is as in Theorem 4.2. Then,
so that, by (5.1), (5.2) and (5.11),
is increasing
538
MODELING UNCERTAINTY
After these preliminaries, Theorem 5.2 can now be proved. Proof of Theorem 5.2. Let be such that and write where and N is the number of elements of S. Given a positive integer set and let denote the indicator function of i.e.,
For
consider the following claim:
(a) Applying Lemma 5.2 with
(5.16) hold true for
(b) Suppose that (5.16) is valid for In this case, from an application of Lemma 5.2 with instead of and with replacing D, it follows that there exists such that for every
Then, setting
so that (5.16) holds ture for From (a) and (b), it follows that (5.16) is valid for
that is, for for all
6.
AUXILIARY EXPECTED–TOTAL COST PROBLEMS: II
In this section, by examining (3.1), (5.1-5.2), and by an appropriate choice of the one-stage cost function D(·, ·) in the auxiliary expected-total cost MDP, candidate solutions needed to establish Theorem 3.1 are constructed. Theorem 6.1. Let
and
(i) There exists
such that
be arbitrary but fixed.
Similarly, (ii) There exists
such that
The proof of this result is presented in two parts below, as Lemmas 6.1 and 6.2. First, it is convenient to introduce some useful notation.
Risk–sensitive average cost optimality equation in communicating MDP’s
539
Definition 6.1. For each
and
and
the sets
are defined as follows:
and
Notice that when so that for every policy and then and (see (5.1) and (5.2)), which yields that both sets in Definition 6.1 are nonempty. On the other hand, observe that when then which yields that for every and then, this implies that and are contained in Lemma 6.1. For given this case, Proof. Let
and
let
In
be a sequence such that
In this case, for every policy
it follows that
Via the monotone convergence theorem, this inequality and (6.1) together yield that
and since
is arbitrary, it follows that
i.e., To conclude, suppose that In this case, Theorem 5.2 with instead of D and instead of yields a positive constant such that so that contradicting the definition of as the supremum of Therefore, and the proof is complete. The following lemma presents results for in Lemma 6.1, however the proof is substantially different.
similar to those
Risk–sensitive average cost optimality equations communicating MDP’s
and then (see (5.1)–(5.2)), (ii) Select a sequence
By part (i), for each
541
for every such that
there exists
such that
where the positive integer N is arbitrary. Since is a compact metric space, there is not any loss of generality in assuming that, in addition to (6.3), for some In this case, Assumption 2.1 yields, via Proposition 18 in Royden (1968) p. 232, that , and then (6.4) implies that
and since the integer N > 0 is arbitrary, it follows that
Therefore,
so that
542
MODELING UNCERTAINTY
(iii) By parts (i) and (ii), there exists a policy
It will be shown, by contradiction, that that
such that
Thus, suppose
and define a modified MDP by setting and In this case, using (6.5), Lemma 5.2 applied to model yields a positive constant such that and then so that and this contradicts the definition of as the supremum of Therefore and the proof is complete. The two previous lemmas yield Theorem 6.1 immediately. Proof of Theorem 6.1. Let be fixed. Setting part (i) follows from Lemma 6.1, whereas defining Lemma 6.2 yields part (ii).
7.
PROOF OF THEOREM 3.1
After the basic technical results presented in the previous sections, Theorem 3.1 can be established as follows. Proof of Theorem 3.1. Let and by fixed, and define as follows:
whereas
(i) By Theorem 6.1, (7.1) and (7.2) yield that maloity equations in Theorem 5.1 imply that
and
and then, the opti-
Risk–sensitive average cost optimality equation in communicating MDP’s
543
equalities that can be condensed into a single one: For
which is equivalent to the
(3.1).
(ii) and (iii) These parts can be obtained as in Theorem 3.1 in Cavazos– Cadena and Fernández–Gaucherand (1998a), or from the verification theorem in Hernández–Hernández and Marcus (1996) for the case
(iv) Let
be such that
where
for some fixed state It will be shown that and that First, notice that from Theorem 3.1 in Cavazos– Cadena and Fernández–Gaucherand (1998a) it follows that so that is the optimal average cost at every state, which is uniquely determined, so that Next, using Assumption 2.1, select such that is a minimizer of the term in brackets in (7.3), so that
In this case, Lemma 4.1 with and respectively, yields that for every
(notice that
instead of
and D ( . ) ,
and
on the event
and
To conclude, consider the following two cases. Case 1: of Lemma 4.1 (i),
Since
in this situation (7.3) implies that and following the same arguments as in the proof
544
MODELING UNCERTAINTY
and then, using that
and (7.4) and (7.5) imply that
so that and then
the reverse inequality can be obtained in a similar way,
Case 2: for every in (7.7), so that
Using again that (7.3) yields and then (7.6)–(7.8) occur if the inequalities are reversed and is replaced by Therefore for every and, similarly, so that
8.
CONCLUSIONS
This paper considered Markov decision processes endowed with the risk– sensitive average cost optimality criterion in (2.4)–(2.6). The main result of the paper, namely, Theorem 3.1, shows that under standard continuity–compactness conditions (see Assumption 2.1), the communication condition in Assumption 2.3 guarantees the existence of a solution to the stated in (3.1) for arbitrary values of when the state space is finite. Furthermore, it was shown via Example 3.1, that the conclusions in Theorem 3.1 cannot be extended to the case of countably infinite state space models. Hence, the results presented in the paper significantly extend those in Howard and Matheson (1972), and also the recent work of the authors presented in Cavazos–Cadena and Fernández– Gaucherand (1998a-d); see Remark 3.1.
REFERENCES
545
REFERENCES Arapostathis, A., V. S. Borkar, E. Fernández–Gaucherand, M. K. Gosh and S. I. Marcus (1993). Discrete–time controlled Markov processes with average cost criteria: a survey, SIAM, Journal on Control and Optimization, 31,282– 334. Avila-Godoy, G., A. Brau and E. Fernández-Gaucherand. (1997). “Controlled Markov chains with discoun- ted risk-sensitive criteria: applications to machine replacement,” in Proc. 36th IEEE Conference on Decision and Control, San Diego, CA, pp. 1115-1120. Avila-Godoy, G. and E. Fernández-Gaucherand. (1998). “Controlled Markov chains with exponential risk-sensitive criteria: modularity, structured policies and applications,” in Proc. 37th IEEE Conference on Decision and Control, Tampa, FL, pp. 778-783. Avila-Godoy, G.M. and E. Fernández-Gaucherand. (2000). “Risk-Sensitive Inventory Control Problems,” in Proc. Industrial Engineering Research Conference 2000, Cleveland, OH. Bertsekas, D.P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs. Borkar, V.K. (1984). On minimum cost per unit of time control of Markov chains, SIAM Journal on Control and Optimization, 21, 965–984. Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998a). Controlled Markov Chains with Risk-Sensitive Criteria: Average Cost, Optimality Equations, and Optimal Solutions, ZOR: Mathematical Methods of Operations Research. To appear. Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998b). Controlled Markov Chains with Risk–Sensitive Average Cost Criterion: Necessary Conditions for Optimal Solutions Under Strong Recurrence Assumptions. Submitted for Publication. Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998c). Markov Decision Processes with Risk–Sensitive Average Cost Criterion: The Discounted Stochastic Games Approach. Submitted for publication. Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998d). The Vanishing Discount Approach in Markov Chains with Risk–Sensitive Criteria. Submitted for publication. Fernández–Gaucherand, E., A. Arapostathis and S.I. Marcus. (1990). Remarks on the Existence of Solutions to the Average Cost Optimality Equation in Markov Decision Processes, Systems and Control Letters, 15, 425–432. Fernández-Gaucherand, E. and S.I. Marcus. (1997). Risk-Sensitive Optimal Control of Hidden Markov Models: Structural Results. IEEE Transactions on Automatic Control, 42, 1418-1422.
546
MODELING UNCERTAINTY
Fleming, W.H. and W. M. McEneany. (1995). Risk–sensitive control on an infinite horizon, SIAM, Journal on Control and Optimization, 33, 1881– 1915. Fleming, W.H. and D. Hernández–Hernández. (1997b), Risk sensitive control of finite state machines on an infinite horizon I, SIAM Journal on Control and Optimization, 35, 1970-1810. Hernández–Hernández, D. and S.I. Marcus.(1996), Risk sensitive control of Markov processes in countable state space, Systems & Control Letters, 29, 147–155. Hernández–Lerma, D. and J.B. Lasserre. (1996). Discrete-Time Markov Control Processes, Springer, New York. Hinderer, K. (1970). Foundations of Non–stationary Dynamic Programming with Discrete Time Parameter, Lecture Notes on Operations Research and Mathematical Systems, No. 33, Springer, New York. Hordjik, A. (1974). Dynamic Programming and Potential Theory, Mathematical Centre Tract No. 51, Matematisch Centrum, Amsterdam. Howard, A.R. and J.E. Matheson. (1972). Risk–sensitive Markov Decision processes, Managment Sciences, 18, 356–369. Jacobson, D.H. (1973). Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games, IEEE Transactions on Automatic Control, 18, 124–131. Jaquette, S.C. (1973). Markov decision processes with a new optimality criterion: discrete time. The Annals of Statistics, 1, 496-505. Jaquette, S.C. (1976). A utility criterion for Markov decision processes. Management Science, 23, 43-49. James, M.R., J.S. Baras and R. J. Elliot. (1994). Risk–sensitive control and dynamic games for partially observed discrete–time nonlinear systems, IEEE Transactions on Automatic Control, 39, 780–792. Kumar, P.R. and P. Varaiya. (1986). Stochastic Systems: Estimation, Identification and Adaptive Control, Prentice-Hall, Englewood Cliffs. Lai, T.L. and S. Yakowitz, “Machine Learning and Nonparametric Bandit Theory,” IEEE Trans. Auto. Control 40 (1995) 1199-1209. loève, M. (1980). Probability Theory I, Springer, New York. Marcus, S.I., E. Fernández-Gaucherand,D. Hernández-Hernández, S. Coraluppi and P. Fard. (1996). Risk Sensitive Markov Decision Processes, in Systems & Control in the Twenty-First Century. Series: Progress in Systems and Control, Birkhäuser. Editors: C.I. Byrnes, B.N. Datta, D.S. Gilliam, C.F. Martin, 263-279. Puterman, M. (1994). Markov Decison Processes, Wiley, New York. Ross, S.M. Applied Probability Models with Optimization Applications, Holden– Day, San Francisco. Royden, H.L. (1968). Real Analysis, MacMillan, London.
547
REFERENCES
Runolfsson, T. (1994). The equivalence beteen infinite horizon control of stochastic systems with exponential–of–integral performance index and stochastic differential games, IEEE Transactions on Automatic Control, 39, 1551–1563. Shayman, M.A. and E. Fernández-Gaucherand. (1999). “Risk–sensitive decision theoretic troubleshooting.” In Proc. 37th Annual Allerton Conference on Communication, Control, and Computing, September 22-24. Thomas, L.C. (1980). Connectedness conditions for denumerable state Markov decision processes, in: R. Hartjey, L. C. Thomas and D.J. White, Editors, Recent Advances in Markov Decision Processes, Academic press, New York. Whittle, P. (1990). Risk–sensitive Optimal Control, Wiley, New York. Yakowitz, S., T. Jawayardena, and S. Li. (1992). “Machine Learning for Dependent Observations, with Application to Queueing Network Optimization,” IEEE Trans. Auto. Control 37, 1316-1324. Yakowitz, S. “The Nonparametric Bandit Approach to Machine Learning,” in Proc. 34th IEEE Conf. Decision & Control,” New Orleans, LA, 568-572. Yakowitz, S. Mathematics of Adaptive Control Processes. Elsevier, New York.
APPENDIX: A: PROOF OF THEOREM 4.1 Thheorem 4.1 stated in Section 4 is a consequence of the following. Theorem A. Let the state space be finite suppose that Assumptions 2.1 and 2.3 hold true. In this case, the simltaneous Doeblin condition is valid. More explicitly, there exists such that Notice that the conclusion of Theorem A refers to the class of stationary policies, whereas Theorem 4.1 involves the family of all policies. However, Theorem 4.1 can be deduced from Theorem A as in Cavazos–Cadena and Fernández–Gaucherand (1998a) or Hernández–Hernández and Marcus, 1996. The proof of Theorem A has been divided into three steps presented in the following three lemmas. To begin with, notice that, since S is finite, Assumption 2.2 implies that the Markov chain induced by a stationary policy has a unique invariant distribution, denoted bu and characterized by Loéve (1980)
Moreover, from Assumption 2.2 it follows that Lemma A.1. (i) The mapping is continuous in
for every state Consequently,
(ii) For each is finite and continuous in Proof. (i) Let be such that To show that it is clearly sufficient to prove that an arbitrary limit point of coincide with Thus, let be a limit point of and notice that, without loss of generality, taking a subsequence if necessary it can be supposed that
548
MODELING UNCERTAINTY
Using that S is finite, it is clear that properties; moreover, for each
and
since each
has these
where the last equality used (A.1) and Assumption 2.1. Therefore, has all the properties to be the unique invariant distribution of the transition matrix determined by so that (ii) As already noted, by Assumption 2.3, so that using the equality ((Loève, 1980)), the assertion follows from part (i). Lemma A.2. Let
and
(i) There exist a positive integer
with
be arbitrary.
such that
(ii) For each Proof. (i) By Assumption 2.3, there exists a positive integer such that
Let
(ii) Let
so that and notice that
and states
and notice that since for so that (see (A.3))
Next, set
be as in part (i), and notice that
(a) (b)
whereas for
(c) Therefore, denoting the property yields,
generated by
by
the Markov
549
REFERENCES Combining (a)-(c), it follows that
so that
Since that
and
Lemma A.3. Let
be fixed and let
by Lemma A.l(ii), it follows
be the indicator function of
i.e.,
and set
Then, (i)
and there exists
(ii) Define
Then,
Proof. (i) Since
such that
by
and the following (risk-neutral optimality) equation is satisfied:
and
for
and then Lemma A. 1 yields the existence of case, since
it is clear that so that
at which the supremum is achieved. In this by Assumption 2.3.
550 (ii) Notice that the equality is not difficult to see that for every
MODELING UNCERTAINTY follows from (A.5). Also, using the Markov property, it
which is equivalent to
To establish (A.6), pick an arbitrary pair and the policy by
and define the discrepacy function
and Combining these definitions with A.8 it follows that for every
and then [ ],
Combining this last equality with (A.7) and using the fact that Therefore (see (A.9)), pair was arbitrary,
it follows that and then, since the
and (A.6) follows combining this inequality with (A.8). Proof of Theorem A. The notation is as in the statement of Lemma A.3. From the optimality equation (A.6), and using that it follows that for every and
and an inducion argument yields that, for every
551
REFERENCES Observe now that so that (A. 10) implies that
and letting
since
increase to
whereas for positive
it follows that
is strictly less than 1, by Lemma A.3 (i),
To conclude observe that in this argument
so that
is arbitrary, so that setting
APPENDIX: B: PROOF OF THEOREM 4.2 The proof of Therem 4.2 relies on the following lemma. Lemma B.1. Suppose that Assumptions 2.1 and 2.3 hold true. For given define the number of times in which the state process visits before returning to the initial state in a positive time, by
In this case, for each
where K is as in Theorem 4.1. [Recall that Proof. For each positive
so that, setting
= the
is the indicator function of
notice that given
generated by
the Markov property yields
]
552
where the shifted policy Therefore,
MODELING UNCERTAINTY
is determined by
where K is as in Theorem 4.1. Hence, taking expectation with respect to
it follows that
and then
yields
which, using that
To conclude, observe that given
Proof of Theorem 4.2. (i) Let condition holds, there exists 1973); (Jaquette, 1976))
on the event
and
so that
be given with Since the simultaneous Doeblin such that (Arapostathis et al., 1993; (Jaquette,
where without loss of generality, it is assumed that
If for each that [ ]
is such that
minimizes the term in brackets in (B. 1), it follows
On the other hand, (B.1) implies that for each
and
where the equality used that as well as using the Markov property, yields that for each positive integer
Next, an induction argument and
553
REFERENCES
Observe now that
and Since
(by Theorem 4.1)‚ almost surely:
so that as
the following convergences hold
and Therefore‚ (B.3) allows to conclude‚ via the dominated convergence theorem‚ that
Setting
in this inequality it follows‚ since
that
and‚ since an application of Lemma B.1 implies that conclusion follows selecting (ii) Let and the equality some positive integer
with
be given. By part (i), it follows that and the overall conclusion is obtained observing that
and the and using for
540 Lemma 6.2. Let
MODELING UNCERTAINTY
and
(i) For each
there exists
such that
In this case,
Let (ii)
be fixed.
and
(iii) Proof. (i) Notice that Theorem 5.1 (iii) yields that
since
and then, by Assumption 2.1, there exists a policy
so that
such that for every
From this equality, a simple induction argument using the Markov property yields that for every positive integer and
and since
this last inequality implies, for each
that
Chapter 23 SOME ASPECTS OF STATISTICAL INFERENCE IN A MARKOVIAN AND MIXING FRAMEWORK George G. Roussas University of California‚ Davis *
Abstract
This paper is a contribution to a special volume in memory of our departed colleague‚ Sidney Yakowitz‚ of the University of Arizona‚ Tucson. The material discussed is taken primarily from existing literature in Markovian and mixing processes. Emphasis is given to the statistical inference aspects of such processes. In the Markovian case‚ both parametric and nonparametric inferences are considered. In the parametric component‚ the classical approach is used‚ whereupon the maximum likelihood estimate and the likelihood ratio function are the main tools. Also‚ methodology is employed based on the concept of contiguity and related results. In the nonparametric approach‚ the entities of fundamental importance are unconditional and conditional distribution functions‚ probability density functions‚ conditional expectations‚ and quantiles. Asymptotic optimal properties are stated for the proposed estimates. In the mixing context‚ three modes of mixing are entertained but only one‚ the strong mixing case‚ is pursued to a considerable extent. Here the approach is exclusively nonparametric. As in the Markovian case‚ entities estimated are distribution functions‚ probability density functions and their derivatives‚ hazard rates‚ and regression functions. Basic asymptotic optimal properties of the proposed estimates are stated‚ and precise references are provided. Estimation is proceeded by a discussion of probabilistic results‚ necessary for statistical inference. It is hoped that this brief and selected review of literature on statistical inference in Markovian and mixing stochastic processes will serve as an introduction to this area of research for those who entertain such an interest. The reason for selecting this particular area of research for a review is that a substantial part of Sidney’s own contributions have been in this area.
Keywords:
Approximate exponential‚ asymptotic normality‚ consistency (weak‚ strong‚ in quadratic mean‚ uniform‚ with rates)‚ contiguity‚ differentiability in quadratic mean‚ design (fixed‚ stochastic)‚ distribution function‚ estimate (maximum likelihood‚ maximum probability‚ nonparametric‚ of a distribution function‚ of a pa-
*This work was supported in part by a research grant from the University of California‚ Davis.
556
MODELING UNCERTAINTY rameter‚ of a probability density function‚ of a survival function‚ of derivatives‚ of hazard rate‚ parametric‚ recursive)‚ likelihood ratio test‚ local asymptotic normality‚ Markov processes‚ mixing processes‚ random number of random variables‚ stopping times‚ testing hypotheses.
1.
INTRODUCTION
Traditionally‚ much of statistical inference has been carried out under the basic assumption that the observations involved are independent random variables (r.v.s). Most of the time‚ this is augmented by the supplemental assumption that the r.v.s also have the same distribution‚ so that we are dealing‚ in effect‚ with independent identically distributed (i.i.d.) r.v.s. This set-up is based on two considerations‚ one of which is mathematical convenience‚ and the other the fact that these requirements are‚ indeed‚ met in a broad variety of applications. As for the stochastic or statistical models employed‚ they are classified mainly as parametric and nonparametric. In the former case‚ it is stipulated that the underlying r.v.s are drawn from a model of known functional form except for a parameter belonging in a (usually open) subset the parameter space‚ of the Euclidean space In the latter case‚ it is assumed that the underlying model is not known except that it is a member of a wide class of possible models obeying some very broad requirements. The i.i.d. paradigm described above has been subsequently extended to cover cases‚ where dependence of successive observations is inescapable An early attempt to model such situations was the introduction of Markov processes‚ based on the Markovian property. According to this property‚ and in a discrete time parameter framework‚ the conditional distribution of given the entire past and the present‚ depends only on the present‚ For statistical inference purposes‚ most of the time we assume the existence of probability density functions (p.d.f.s). Then‚ if these p.d.f.s depend on a parameter as described above‚ we are dealing with parametric inference (about if not‚ we are in the realm of nonparametric inference. Both cases will be addressed below. According to Markovian dependence‚ the past is irrelevant‚ or to put it‚ perhaps‚ differently‚ the past is summarized by the present in our attempt to make probability statements about the future. Should that no be the case‚ then new statistical models must be invented‚ where the entire past is entering into the picture. A class of such models is known under the general name of mixing. There are several modes of mixing used in the literature‚ but in this paper we shall confine ourselves to only three of them. The fundamental characteristic embodied in the concept of mixing is that the past and the future‚ as expressed by the underlying stochastic process‚ are approximately independent‚ provided they are sufficiently far apart. Precise definitions are given in Section 3 (see
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
557
Definitions 3.1 - 3.3). In a way‚ mixing models provide an enlargement of the class of Markovian models‚ as is demonstrated that most (but not all) Markovian processes are also mixing processes. Unlike the Markovian case‚ where we have the options of considering a parametric or a nonparametric framework‚ in a mixing set-up the nonparametric approach is literally forced upon us. To the knowledge and in the experience of this author‚ there is no general natural way of inserting a parameter in the model to turn it into a parametric one. Lest it be misunderstood that Markovian and mixing models are the only ones providing for dependence of the underlying r.v.s‚ it should be stated here categorically that there is a plethora of other models available in the literature. Some of them‚ such as various time series models‚ have proven highly successful in many applications. There are several reasons for confining ourselves to Markovian and mixing processes alone here. One is parsimony‚ another is better familiarity of this author with these kind of processes‚ and still a third reason not the least of which is that these are areas (in particular‚ Markovian processes) where Sidney Yakowitz himself contributed heavily by his research works. The technical parts of this work and relevant references will be given in following sections. The present section is closed by stating that all limits in this paper are taken as the sample size tends to infinity unless otherwise specified. This paper is not intended as an exhaustive review of the literature on statistical inference under Markovian dependence or under mixing. Rather‚ its purpose is to present only some aspects of statistical inference‚ mostly estimation‚ and primarily in nonparametric framework. Such topics are taken‚ to a large extent‚ from relevant works of this author and his collaborators. References are also given to selected works of other researchers with emphasis to the generous contributions to this subject matter by Sidney Yakowitz.
2.
MARKOVIAN DEPENDENCE
Let be r.v.s constituting a (strictly) stationary Markov process, defined on a probability space open subset of and taking values in the real line As it has already been stated, most often in Statistics we assume the existence of p.d.f.s. To this effect, let be the initial p.d.f. (the p.d.f. of with respect to a dominating measure such as the Lebesgue measure), and let be likewise the p.d.f. of the joint distribution of Then is the p.d.f. of the transition distribution of given
558
MODELING UNCERTAINTY
2.1.
PARAMETRIC CASE - THE CLASSICAL APPROACH
In this framework‚ there are two basic problems of statistical inference to be addressed. Estimation of and testing hypotheses about Estimation of is usually done by means of the maximum likelihood principle leading to a Maximum Likelihood Estimate (MLE). This calls for maximizing‚ with respect to the likelihood (or log-likelihood) function. Thus‚ consider the likelihood function
and let
where all logarithms are taken with base Under suitable regularity conditions, a MLE, exists and enjoys several optimality properties. Actually, the parametric inference results in a Markovian framework were well summarized in the monograph Billingsley (1961a) to which the interested reader is referred for the technical details. From (2.1) and (2.2), we have the likelihood equations
or just
by omitting the first term in (2.3) since all results are asymptotic. The results to be stated below hold under certain conditions imposed on the underlying Markov process. These conditions include strict stationarity for the process‚ as already mentioned‚ absolute continuity of the initial and the transition probability measures‚ joint measurability of the respective p.d.f.s‚ their differentiability up to order three‚ integrability of certain suprema‚ finiteness of certain expectations‚ and nonsingularity of specified matrices. Their precise formulation is as follows and can be found in Billingsley (1961a); see also Billingsley (1961b). Condition (C1). (i) For each transition measure
and
these exists a stationary where
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
the Borel each
in is measurable with respect to and a probability measure on for each
559
for
(ii) For each and each assumes a unique stationary distribution i.e.‚ there is a unique probability measure on such that (iii) There is a measure on such that‚ for each and the p.d.f. is jointly measurable with respect to also‚ for each and each with p.d.f. jointly measurable with respect to Condition(C2). (i) For each (ii) For each of
and each
the set
is independent
(iii) For each and in up to the third order partial derivatives of with respect to exist and are continuous throughout For
and
ranging from 1 to
set:
Then (iv) For each that‚ for each
where
and
there is a neighborhood of it‚ lying in and and as above‚ it holds:
such
560
MODELING UNCERTAINTY
(v) For each where
the
Condition (C3). Let be a let be a mapping from into assume that each and the matrix throughout
matrix
is nonsingular‚
open subset of
and Set and has third order partial derivatives has rank
Then Theorem 2.1 in Billingsley (1961a) states that: Theorem 2.1. Under conditions (C1) - (C2) and for each the following are true: (i) There is a solution of (2.4), with tending to 1; (ii) This solution is a local maximum of (that is, essentially, of the log-likelihood function); (iii) The vector in (iv) If is another sequence which satisfies (i) and (iii), then Thus, this theorem ensures, essentially, the existence of a consistent (in the probability sense) MLE of The MLE of the previous theorem is also asymptotically normal, when properly normalized. Rephrasing the relevant part of Theorem 2.2 in Billingsley (1961a) (see also Billingsley (1961b)), we have Theorem 2.2. Let be the MLE whose existence is ensured by Theorem 2.1. Then‚ under conditions (C1)-(C2)‚
where the covariance
is given by the expression
Turning now to a testing hypothesis problem‚ let be as described in condition (C3)‚ and suppose that we are interested in testing the (composite) hypothesis at level of significance In testing the likelihood ratio test is employed‚ and the (approximate) determination of the cut-off point is to be obtained. For this purpose‚ the asymptotic distribution of the quantity is required.
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
561
This is‚ actually‚ given in Theorem 3.1 of Billingsley (1961a)‚ which states that: Theorem 2.3. Under conditions (C1)-(C3)‚
where is given in (2.4), is the dimension of and c is the dimension of In particular, if for some then (2.5) becomes
Classical results on the MLE in the i.i.d. case were derived by Wald (1941‚ 1943). See also Roussas (1965b) for an extension to the Markovian case of a certain result by Wald‚ and Roussas (1968a) for an asymptotic normality result of the MLE in a Markovian framework again. Questions regarding asymptotic efficiency of the MLE‚ in the i.i.d. case again‚ are discussed in Bahadur (1964)‚ and for generalized MLE in Weiss and Wolfowitz (1966). When an estimate is derived by the principal of maximizing the probability of concentration over a certain class of sets‚ the resulting estimate is called a maximum probability estimate. Much of the relevant information may be found in Wolfowitz (1965)‚ and Weiss and Wolfowitz (1967‚ 1970‚ 1974). Parametric statistical inference for general stochastic processes is discussed in Basawa and Prakasa Rao (1980).
2.2.
PARAMETRIC CASE - THE LOCAL ASYMPTOTIC NORMALITY APPROACH
When two parameter points and say‚ are sufficiently far apart‚ then any reasonable statistical procedure should be capable in differentiating them. Things are more complicated‚ however‚ if the parameter points are close together. A more precise formulation of the problem is as follows. Choose an arbitrary point in and let be a sequence of parameter points tending to Then study statistical procedures which will lead to differentiation of and This is one of the things achieved by the introduction of the concept of local asymptotic normality of the log-likelihood. This concept was introduced and developed by Le Cam (1960). A fundamental role in these developments is played by the concept of contiguity‚ also due to Le Cam. Contiguity may be viewed as a measure of closeness of two sequences of probability measures and is akin to asymptotic mutual absolute continuity. There are several characterizations of contiguity available. Perhaps‚ the simplest of them is as follows. For consider the measurable spaces and let and be probability measures defined on The sequences and are said to be contiguous if whenever also and vice versa. Subsequent relevant references are Le Cam (1966)‚ Le Cam (1986)‚
562
MODELING UNCERTAINTY
Chapter 6, and Le Cam and Yang (2000). In a Markovian framework, results of this type are discussed in Roussas (1972). For a brief description of the basic concepts and results under contiguity see Roussas (2000). As in the previous section, let be a stationary Markov process defined on the probability space let be the induced by the r.v.s and let be the restriction of to For simplicity, it will be assumed that for any and all that is, the probability measures involved are mutually absolutely continuous. Then
so that
Then the likelihood function
Replace by log-likelihood; that is‚
is expressed as follows:
and set
for the resulting
One of the objectives of this section is to discuss the asymptotic distribution of under the probability measures and Clearly, an immediate consequence of it would be the (approximate) determination of the cut-off point when testing the null hypothesis (or or in the case that ) against appropriate alternatives, and also the determination of the power of the test, when the test is based on the likelihood function. A basic assumption made here is that, for each the random function is differentiable in quadratic mean when the probability measure is used. That is, there is a random vector the derivative in quadratic mean of at such that
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
563
where the prime denotes transpose. (For a more extensive discussion‚ the interested reader is referred to Lind and Roussas (1977).) Define the random vector and the covariance by:
Then‚ Theorems 2.4-2.6 stated below hold under the following set of assumptions. (A1) For each and ergodic.
the underlying Markov process is (strictly) stationary
(A2) The probability measures uous for all
are mutually absolutely contin-
(A3) (i) For each the random function is differentiable in quadratic mean with respect to at the point when is employed, and let be the derivative in quadratic mean involved. (ii) of
is
where
(iii) The covariance function for every (A4) (i) For each where (ii) For each fixed is
is the
of Borel subsets
defined by (2.10)‚ is positive definite
in is defined in (2.7). is
as and
For some comments on these assumptions and examples where they are satisfied‚ see pages 45-52‚ in Roussas (1972). Theorems 2.4-2.6‚ given below‚ are a consolidated restatement of Theorems 4.1-4.6‚ pages 53-54‚ and Theorem 1.1‚ page 72‚ in the reference just cited. Theorem 2.4. Let and be defined by (2.9) and (2.10)‚ respectively. Then‚ under assumptions (A1)-(A4)‚ (i) (ii) (iii)
in
564
MODELING UNCERTAINTY
Also‚ Theorem 2.5. In the notation of the previous theorem‚ and under the same assumptions‚
in
probability.
It should be mentioned at this point that‚ unlike the classical approach‚ Theorem 2.5 is obtained from Theorem 2.4 without much effort at all. This is so because of the contiguity of the sequences and established in Proposition 6.1‚ pages 65-66‚ in Roussas (1972)‚ in conjunction with Corollary 7.2‚ page 35‚ Lemma 7.1‚ pages 36-37‚ and Theorem 7.2‚ pages 38-39‚ in the same reference. As an application of Theorems 2.4 and 2.5‚ one may construct tests‚ based essentially on the log-likelihood function‚ which are either asymptotically uniformly most powerful or asymptotically uniformly most powerful unbiased. Application 2.1. For testing the hypothesis at level of significance by:
against the alternative define the test functions
where and are defined by for all Then the sequence is asymptotically uniformly most powerful (AUMP) in the sense that‚ if is any other sequence of test functions of level then
For the justification of the above assertion and certain variations of it‚ see Theorems 3.1 and 3.2‚ Corollaries 3.1 and 3.2‚ and subsequent examples in pages 100 - 107 of Roussas (1972). A rough interpretation of part in Theorem 2.4 is that‚ for all sufficiently large and in the neighborhood of the likelihood function behaves as follows:
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
565
that is‚ is approximately exponential with being the all important random vector appearing in the exponent. This statement is made precise as follows. Let be a suitable truncated version of for which
and define the (exactly) exponential measure
by
Then‚ we have Theorem 2.6. Let where Then‚ under assumptions (A1)-(A4)‚
is any bounded sequence in
This result provides for the following application. Application 2.2. For testing the hypothesis against the alternative at asymptotic level of significance define the test functions by:
where and are chosen so that and is the upper p-th quantile of Then the sequence is asymptotically uniformly most powerful unbiased (AUMPU) of asymptotic level of significance That is‚ it is asymptotically unbiased‚ lim inf and for any other sequence which is also of asymptotic level of significance and asymptotically unbiased‚ it holds
For a discussion of this application‚ see Theorem 5.1‚ pages 115 - 121 in Roussas (1972). Theorem 2.6 may be used in testing the hypothesis even if The result obtained enjoys several asymptotic optimal properties. Actually‚ the
566
MODELING UNCERTAINTY
discussion of such properties is the content of Theorems 2.1 and 2.2, pages 170 - 171, Theorem 4.1, pages 183 - 184, and Theorems 6.1 and 6.2, pages 191 - 196, in Roussas (1972). This theorem can also be employed in obtaining a certain representation of the asymptotic distribution of a class of estimates of these estimates need not be MLE. To be more precise, for an arbitrary and any set so that for all sufficiently large Let be a class of estimates of defined as follows:
Then the limiting probability measure is represented as a convolution of two probability measures, one of them being the distribution, where is given in (2.10). The significance of this result derives from its usefulness as a tool of studying asymptotic efficiency of estimates and minmax issues. For its precise formulation and proof, the reader is referred to Theorem 3.1, pages 136-141, in Roussas (1972). Some of the original papers on which the above results are based are Roussas (1965a), Le Cam (1966), Johnson and Roussas (1969, 1970, 1972), Hájek (1970), Inagaki (1970), and Roussas and Soms (1973). Some concrete examples are treated in Stamatelos (1976). In Roussas (1968b), contiguity results are employed to study asymptotic efficiency of estimates considered earlier by Schmetterer (1966) in the framework of independence. In Roussas (1979), results such as those stated in Theorems 2.4 and 2.5 were obtained for general stochastic processes. Statistical procedures are often carried out in accordance with a stopping time A stopping time defined on the sequence is a nonnegative integer-valued r.v. tending non-decreasing to a.s. and such that for each Next, let be positive real numbers such that and in Set and let and be defined by (2.9) and (2.10), respectively, with replaced by Then suitable versions of Theorems 2.4 and 2.5 hold true. Relevant details may be found in Roussas and Bhattacharya (1999a), see Theorems 1.3.1 and 1.3.2 there. Also, for a version of Theorem 2.6 in the present framework, see Theorem 1.2.3 in Roussas and Bhattacharya (1999b). See also the papers Akritas et al. (1979) and Akritas and Roussas (1979). A concrete case where random times appear in a natural way is the case of semi-Markov processes. A continuous time parameter semi-Markov process with finite state space consists of r.v.s defined on a probability space and taking values in the set Furthermore, the process moves from state to state, according to transition probabilities
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
567
and it stays in any given state before it moves to the next state a random amount of time. This time depends both on and and is not necessarily exponential, as is the case in Markovian processes. It may be shown that, under suitable conditions, the process may be represented as follows: where is a Markov chain taking values in and are stopping times taking values in {1, 2 . . . } . For an extensive discussion of this problem, the interested reader is referred to Roussas and Bhattacharya (1999b). See also Akritas and Roussas (1980).
2.3.
THE NONPARAMETRIC CASE
Suppose that the i.i.d. r.v.s have a p.d.f. and that we wish to estimate nonparametrically at a given point Although there are several ways of doing this‚ we are going to focus exclusively on the so-called kernel method of estimation. This method originated with Rosenblatt (1956a)‚ and it was further developed by Parzen (1962). Presently‚ there is a vast amount of literature on this subject matter. Thus‚ is estimated by defined by
where the kernel K is a known function‚ usually a p.d.f.‚ and is a bandwidth of positive numbers tending to 0. Under suitable regularity conditions‚ it is shown that the estimate possesses several optimal properties‚ such as consistency‚ uniform consistency over compact subsets of or over the entire real line‚ rates of convergence‚ asymptotic normality‚ and asymptotic efficiency. Derivatives of may also be estimated likewise‚ and properties for the estimates can be established‚ similar to the ones just cited.No further elaboration will be made here on this matter in the context of i.i.d. r.v.s. Some excellent references on this subject are given at the end of this subsection. Up to 1969‚ the entire literature on kernel estimation of a p.d.f. and related quantities was exclusively restricted to the i.i.d. framework. The papers Roussas (1969a‚ b) were the first ones to address such issues in a Markovian set-up. These papers were followed by those of Rosenblatt(1969‚ 1970‚ 1971).So‚ let be a stationary Markov process with initial distribution function (d.f.) F‚ one-step transition d.f. initial p.d.f. joint p.d.f. of and one-step transition p.d.f. Two of the most important problems here are those of (nonparametrically) estimating the transition p.d.f. and transition d.f. In the process of doing so‚ one also estimates F‚ and For a brief description of the basics‚ consider the segment form the underlying Markov process‚ and estimate by
568
MODELING UNCERTAINTY
where
The kernel K is a known p.d.f. and is a bandwidth as described above. The p.d.f. is estimated by where
Then‚ for each
the natural estimate of
is
Some of the properties of the estimates given by (2.16) - (2.18) are summarized in the theorem below. The assumptions under which these results hold‚ as well as their justification‚ can be found in Theorems 2.2‚ 3.1‚ 4.2‚ 4.3 and Corollary 3.1 of Roussas (1969a)‚ and in Theorems 4.1 and 4.2 of Roussas (1988b). Theorem 2.7. Under suitable regularity conditions (see material presented below right after Theorem 2.12)‚ the estimates and of and respectively‚ have the following properties: (i) Asymptotic unbiasedness: (ii) Consistency in quadratic mean: and these convergences are uniform over compact subsets of spectively. (iii) Weak consistency: and this convergence is uniform over compact sets of (iv) Strong consistency with rates:
(v) Uniform strong consistency with rates:
and
re-
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
for any
569
and some
(vi) Asymptotic normality:
and
where Remark 2.1 In part (vi)‚ centering at and is also possible. Let us turn now to the estimation of the other important quantity‚ namely‚ the transition d.f. The initial d.f. F may also be estimated‚ and the usual popular estimate of it would be the empirical d.f. Namely‚
However‚ regarding the d.f. no such estimate is available. Instead‚ is estimated‚ naturally enough‚ as follows
Of course‚ F could be also estimated in a similar manner; that is‚
The estimates given by (2.19) and (2.20) are seen to enjoy the familiar GlivenkoCantelli uniform strong consistency property; namely‚ Theorem 2.8. Under suitable regularity conditions‚ and with given by (2.20) and (2.19)‚ respectively‚ it holds: (i) (ii)
for all
and
570
MODELING UNCERTAINTY
The justification of these results is given in Theorems 3.1 and 3.2 of Roussas (1969b). In addition‚ the estimate is shown to be asymptotically normal‚ as the following result states. Its proof is found in Roussas (1991a); see Theorem 2.3 there. Theorem 2.9. Under certain regularity conditions‚ the estimate by (2.19)‚ is asymptotically normal; that is‚
given
where and
Another important problem in the present context is that of estimating all (finite‚ conditional) moments of
The natural estimate of
In turns out that the quantity instead‚ by where
would be
where
is somewhat complicated and is replaced‚
The estimate is consistent and not much different from the estimate as the follows result states. Theorem 2.10. Under suitable regularity conditions‚ for and and given by (2.23) and (2.22):
These fact are established in Theorem 4.1 of Roussas (1969b). Remark 2.2. In (2.21)‚ we could consider‚ more generally a conditional expectation of the form for some (measurable)‚
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
571
provided‚ of course‚ The estimation procedure and the establishment of consistency of the respective estimate would be quite similar. As is well known‚ still another important problem here is that of estimating the p-th quantile of for any It is assumed that‚ for the equation has a unique root‚ the p-th quantile of to be denoted by An obvious estimate of would be any root of the equation For reasons of definiteness‚ we consider the smallest such root to be denote by Regarding this estimate‚ the following consistency holds‚ as is proven in Theorem 5.1 of Roussas (1969b). Theorem 2.11. Under suitable regularity conditions‚ and with as defined above‚ it holds
and
This section is concluded with certain aspects of recursive estimation. The transition d.f. was estimated by defined in (2.19) in terms of In turn‚ was defined in (2.18) by means of and given in (2.17) and (2.16)‚ respectively. Suppose now that we are faced with the same estimation problems but the observations are available in a sequential manner. On the basis of the segment of r.v.s from an underlying Markov process‚ we construct estimates such as and Next‚ suppose that a new observation becomes available. Then the question arises as to how this observation is incorporated in the available estimates without starting from scratch. Clearly‚ this calls for a recursive formula which would facilitate such an implementation. At this point‚ we are going to consider a recursive estimate for In the first place‚ appropriate estimates for and to be denoted here by and are respectively:
and‚ of course‚
Then
and
are given recursively by the formulas
572
MODELING UNCERTAINTY
where
and all
Below‚ we state asymptotic normality only for the recursive estimate whose proof is found in Roussas (1991b)‚ Theorem 2.1. Theorem 2.12. Let the recursive estimate of be given by (2.26) (see also relations (2.24) - (2.25) and (2.27) - (2.28)). Then‚ under suitable regularity conditions‚ the estimate given in (2.25)‚ is asymptotically normal; that is‚
where
and
which is assumed
to exist and to be in (0‚1). As has already been mentioned‚ precise sets of conditions under which Theorems 2.7 - 2.12 hold‚ as well as their proofs‚ may be found in the references cited. However‚ we thought it would be appropriate to list here a set of conditions out of which one would choose those actually used in the proof of a particular theorem. These assumptions are divided into three groups: Those pertaining to the underlying process‚ the ones referring to the kernel employed‚ and‚ finally‚ the conditions imposed on the bandwidth. Assumptions imposed on the underlying process. 1. The underlying process is a (strictly) stationary Markov process which satisfies hypothesis (see Doob (1953), page 221), or the weaker assumption of being geometrically ergodic (see Rosenblatt (1970), page 202). 2. (i) The process has 1-dimensional and 2-dimensional p.d.f.s (with respect to the appropriate Lebesgue measure) denoted by and respectively. Then the one-step transition p.d.f. is provided
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
(ii) The p.d.f.s
and
are bounded and continuous.
(iii) The p.d.f.s
and
have continuous second order derivatives.
(iv) The p.d.f.s
and
(v) The p.d.f. 3. Let
are Lipschitz of order 1; i.e.,
satisfies the condition:
be the joint p.d.f. of the r.v.s
and
Then
4. The second order derivative of the joint p.d.f. of the r.v.s satisfies the condition: 5. The joint p.d.f.s of the r.v.s so are the joint p.d.f.s of
7. For suitable
are bounded, and of the process has a unique p-th and
Assumptions imposed on the kernel. The real-valued function K defined on
and
where
6. The one-step transition d.f. quantile for and
(i)
573
is continuous in
is a bounded p.d.f. such that:
as
(ii)
and
for suitable
(iii) K is continuous. (iv) K is Lipschitz of order 1; i.e., (v) The derivative
exists except, perhaps, for finitely many points, and
Assumptions imposed on the bandwidths. is a sequence of positive numbers such that: (i)
(ii) In some cases, tions. (iii) In the recursive case,
with
and subject to some further restricand
574
MODELING UNCERTAINTY
Remark 2.3. In reference to the hypothesis brief description of it is as follows. Hypothesis
imposed on the process‚ a
(a) There is a (finite-valued non-trivial) measure an integer and a positive such that:
on the Borel
in
if where
is the
step transition measure of the processes.
(b) There is only a simple ergodic set and this set contains no cyclically moving subsets. (For part (a) above and for the definition of ergotic sets‚ as well as cyclically moving subsets‚ see Doob (1953)‚ pages 192‚ and 210-211). Regarding the concept of geometric ergodicity‚ here are some relevant comments. Suppose T is the one-step transition probability operator of the underlying process; i.e.‚ if is the one-step transition p.d.f. and is a (real-valued‚ measurable) bounded function‚ then
For the operator is defined recursively. For let be the of with respect to the measure induced by the 1-dimensional p.d.f. of the process; i.e.‚
For
the
of the operator where
is defined by:
means
Then the process is said to be geometrically ergodic‚ if the operator T satisfies condition for some with i.e.‚ for some positive integer Remark 2.4. It is to be mentioned that all results in Theorems 2.8 - 2.13‚ where the p.d.f.s and and the one-step transition d.f. are involved hold for points and such that is a continuity point of and is a continuity point of Also‚ it should be pointed out that the Glivenko-Cantelli Theorem for the empirical d.f. does not require the Markovian property; it holds under stationarity alone.
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
575
Sidney Yakowitz’ contributions to Markovian inference‚ either as a sole author or as a co-author‚ have been profound. A sample of some of the relevant works is presented here. The early work of Sidney Yakowitz was devoted to applying stochastic process modelling to a variety of problems related to daily stream flow records‚ Yakowitz (1972)‚ to daily river flows in arid regions‚ Yakowitz (1973)‚ to hydrologic chains‚ Yakowitz (1976a)‚ to water table predictions‚ Yakowitz (1976b)‚ to rivers in the southwest‚ Yakowitz (1977a)‚ to daily river flow‚ Yakowitz (1977b)‚ to hydrologic time series‚ Yakowitz and Denny (1973)‚ to statistical inference on stream flow processes‚ Denny‚ Kisiel and Yakowitz (1974). Papers Yakowitz (1976a‚ 1977b) have a strong theoretical component related to Markov processes. In Yakowitz (1979)‚ the author considers a Markov chain with stationary transition d.f. which is assumed to be continuous in He proceeds to construct an estimate of which‚ under certain regularity conditions‚ satisfies the GlivenkoCantelli Theorem as a functions of At that time‚ this was a substantial improvement over a similar result obtained earlier by the present author. In the paper Yakowitz (1985)‚ the author undertakes the nonparametric estimation problem of the transition p.d.f. and the conditional expectation measurable) in a class of Markovian processes‚ which include those satisfying the so-called Doob’s hypothesis (D) (see‚ Doob (1953)‚ page 192). Results are illustrated by an analysis of riverflow records. In Yakowitz (1989)‚ the author considers a Markov chain with state space in which satisfies the so-called Harris condition‚ and proceeds with the nonparametric estimation of the initial p.d.f.‚ as well as of the regression function The approach employed is based on kernel methodology. In Yakowitz and Lowe (1991)‚ the authors frame the Bandit problem as a Markovian decision problem‚ according to which‚ at each decision time‚ a r.v. is selected from a finite collection of r.v.s‚ and an outcome is observed. In Yakowitz et al. (1992)‚ the authors study the problem of minimizing a function on the basis of a certain observed loss. This loss is expressed in terms of a Markov decision process and the corresponding control sequence. In Yakowitz (1993)‚ the author proves consistency of the nearest neighbor regression estimate in a Markovian time series context. In Lai and Yakowitz (1995)‚ the authors refine and pursue further the study of the Bandit problem in the framework of controlled Markov processes. Finally‚ in the paper Gani‚ Yakowitz and Blount (1997)‚ the authors study both deterministic and stochastic models for the HIV spread in prisons‚ and provide some simulations of epidemics. For computations in the case of stochastic modelling‚ a new technique is employed for bounding variability in a Markov chain with a large state space.
576
MODELING UNCERTAINTY
Results on various aspects of nonparametric estimation in the i.i.d. context are contained in the following papers and books; namely‚ Watson and Leadbetter (1964a‚ b)‚ Bhattacharya (1967)‚ Schuster (1969‚ 1972)‚ Nadaraya (1964a‚ 1970)‚ Yamato (1973)‚ Devroye and Wagner (1980)‚ Cheng and Lin (1981)‚ Mack and Silverman (1982)‚ and Georgiev (1984a‚ b). Also‚ Devroye and Györfi (1985)‚ Silverman (1986)‚ Devroye (1987)‚ Rüschendorf (1987)‚ Eubank (1988)‚ and Scott (1992). The papers Prakasa Rao (1977) and Nguyen (1981‚ 1984) refer to a Markovian framework. Finally‚ a general reference on nonparametric estimation is the book Prakasa Rao (1983).
3. 3.1.
MIXING INTRODUCTION AND DEFINITIONS
As it has already be mentioned in the introductory section‚ mixing is a kind of dependence which allows for the entire past‚ in addition to the present‚ to influence the future. There are several modes of mixing‚ but we are going to concentrate on three of them‚ which go by the names of weak mixing or mixing‚ and strong mixing or There is a large probabilistic literature on mixing; however‚ in this paper‚ we focus on estimation problems in mixing models. A brief study of those probability results‚ pertaining to statistical inference‚ may be found in Roussas and Ioannides (1987‚ 1988) and Roussas (1988a). All mixing definitions will be given for stationary stochastic processes‚ although the stationarity condition is not necessary. The concept of strong mixing or was introduced by Rosenblatt (1956b) and is as follows. Definition 3.1. Consider the stationary sequence of r.v.s . . . ‚ and denote by and the induced by the r.v.s. . . . ‚ and respectively. Then the sequence is said to be strong mixing or with mixing coefficient if
The meaning of (3.1) is clear. It states that the past and the future‚ defined on the underlying process‚ are approximately independent‚ provided they are sufficiently far apart.
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
The next definition is that of
577
Thus,
Definition 3.2. In the notation of the previous definition‚ the sequence of r.v.s is said to be with mixing coefficient if
Remark 3.1. One arrives at Definition 3.2 by first defining the maximal correlation by: being being
with
next‚ specializing it to indicator functions to obtain
and‚ finally‚ modifying
as follows:
It is shown that
and
and
are related as follows:
Relations (3.3) show that and are all equivalent‚ in the sense that if and only if if and only if The third and last mode of mixing to be considered here is the weak mixing or defined below. Definition 3.3. In the notation of Definition 3.1‚ the sequence of r.v.s is said to be weak mixing or with mixing coefficient if
578
MODELING UNCERTAINTY
The concept of is due to Ibragimov (1959). It is shown that the mixing coefficients and follows:
are related as
so that implies and implies Accordingly‚ the terms “weak mixing” and “strong mixing” assigned to and respectively‚ is a misnomer. The kind of mixing which has proven to be most useful in applications is It has been established that several classes of stochastic processes are under rather weak regularity conditions. For some examples‚ the interested reader is referred to Section 3 of Roussas and Ioannides (1987). The question naturally arises as to when a stochastic process satisfies a mixing condition. Some answers to such a question are provided in the papers by Davydov (1973)‚ Kesten and O’Brien (1976)‚ Gorodetskii (1977)‚ Withers (1981a)‚ Pham and Tran (1985)‚ Athreya and Pantula (1986)‚ Pham (1986)‚ and Doukhan (1944).
3.2.
COVARIANCE INEQUALITIES
In statistical applications, mixing is often entering into the picture in the form of covariances, moment inequalities and exponential probability inequalities. With this in mind, we present below a series of theorems. In the following theorem, the functions and are defined on the underlying stochastic process, they are and respectively, and they may be either real-valued or complex-valued. Theorem 3.1.(i) Under for real-valued
and
and
for complex-valued where with
(ii) Under
and
and
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
579
(iii) Under for real-valued
and
and
for complex-valued where
with
Theorem 3.2. Let Then: (i) Under
and and
and
and
be as above and also
a.s.,
for real-valued
a.s.
and
and
for complex-valued
and
(ii) Under for real-valued
and
and
for complex-valued
and
(iii) Under for real-valued
and
and
for complex-valued
and
There are certain useful generalizations of the above two theorems which are stated below. For this purpose, the following notation is required. Let be as follows: with and let be the induced by the r.v.s The functions are -measurable and they may be either real-valued or complex-valued. Theorem 3.3.(i) Under for real-valued
580
MODELING UNCERTAINTY
and for complex-valued
where, for and (ii) Under
with
for real-valued
and for complex-valued
where, for with and Finally, the following generalized version of Theorem 3.2 holds. Theorem 3.4. For a.s. Then: (i) Under
let
be as described above and also
for real-valued
and for complex-valued
(ii) Under for real-valued
581
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
and for complex-valued
(iii) Under for real-valued
and for complex-valued
Detailed proofs of Theorems 3.1 - 3.4 may be found in Sections 5 - 7 of Roussas and Ioannides (1987). Specifically‚ see Theorems 5.1 - 5.5 and Corollary 5.1; Theorems 6.1 - 6.2 and Proposition 6.1; Theorems 7.1 - 7.4 and Corollaries 7.1 - 7.2 in the reference just cited.
3.3.
MOMENT AND EXPONENTIAL PROBABILITY BOUNDS
For the segment of r.v.s set for their sum‚ and let be any real number. For statistical inference purposes‚ often a bound of the moment is needed‚ assuming‚ of course‚ that this moment is finite. Such a bound is required‚ for example‚ when the Markov inequality is used. If the are i.i.d.‚ such a bound is easy to establish. A similar result has also been established for the case that the r.v.s are coming from a stationary Markov process. Namely‚ for some (> 0) constant (see Doob(1953)‚ Lemma 7.4‚ page 225). Such an inequality holds true as well under mixing. More specifically‚ one has the theorem stated below. The main conditions under which this theorem as well as the remaining theorems in this subsection hold are grouped together right after the end of Theorem 3.16. Theorem 3.5. Let satisfying any one of the and holds
be a segment from a stationary sequence of r.v.s or properties. Assume that Then‚ under some additional requirements‚ it and
a positive constant.
(3.6)
582
MODELING UNCERTAINTY
The proof of (3.6) is carried out by induction on and the relevant details may be found in Roussas (1988a). Although inequality (3.6) does provide a bound for probabilities of the form a stronger bound is often required. In other words‚ a Bernstein-Hoeffding type bound would be desirable in the present set-up. Results of this type are available in the literature‚ and one stated here is taken from Roussas and Ioannides (1988). Theorem 3.6. Let the r.v.s be as in the previous theorem‚ and let Then‚ under certain additional regularity conditions‚ it holds constants, and Also‚
(and subject to an additional restriction.)
where is a (> 0) norming factor (determining the rate at which converges a.s. to 0). A discussion of some basic properties of mixing may be found in Bradley (1981‚ 1985). Issues regarding moment inequalities are dealt with in the references Longnecker and Serfling (1978)‚ Yoshihara (1978)‚ and Yokoyama (1980). Papers where the Central Limit Theorem and Strong Law of Large Numbers are discussed are Ibragimov (1963)‚ Philipp (1969)‚ Withers (1981b)‚ Peligrad (1985)‚ and Takahata (1993). Finally‚ books where mixing is discussed‚ along with other subject matters‚ are those by Ibragimov and Linnik (1971)‚ Hall and Heyde (1980)‚ Rosenblatt (1985)‚ Yoshihara (1993‚ 1994a‚ b‚ 1995)‚ Doukhan (1994) (Sections 1.4.1 and 1.4.2)‚ and Bosq (1996).
3.4.
SOME ESTIMATION PROBLEMS
The estimation problems to be addressed here are those of estimating a d.f.‚ a p.d.f. and its derivatives‚ the hazard rate‚ fixed design regression‚ and stochastic design regression. The approach is nonparametric and the methodology is kernel methodology. Any one of the three modes of mixing discussed earlier is assumed to prevail. In all that follows‚ is a segment from a stationary stochastic process obeying any one of the three modes of mixing: Actually‚ many results are stated below only for and therefore are valid for the remaining two stronger modes of mixing. 3.4.1 Estimation of the Distribution Function or Survival Function. Let F be the d.f. of the so that defines the corresponding
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
583
survival function. So‚ if is an estimate for then is estimated by Due to this relationship‚ statements made about also hold for and vice versa. At this point‚ it should be pointed out that results stated below also hold for the case that the are multi-dimensional vectors. For the sake of simplicity‚ we restrict ourselves to real-valued The simplest estimate of is the empirical d.f. defined by
This estimate enjoys a number of optimal properties (mostly asymptotic) summarized in the following theorem. Theorem 3.7. The empirical d.f. defined by (3.7)‚ as an estimate of has the following properties‚ under mixing and some additional regularity conditions: (i) Unbiasedness: (ii) Strong consistency: (iii) Strong uniform consistency: (iv) Strong uniform consistency with rates: where is any compact subset of determines the rate of convergence. Also‚
and the (positive) norming factor
(v) Asymptotic normality: where
(vi) The
at the rate specified by the convergence;
(vii)Asymptotic uncorrelatedness: For the estimates and are asymptotically uncorrelated at the rate specified by the convergence:
584
MODELING UNCERTAINTY
where
Remark 3.2. In reference to part (iii) of the theorem‚ it must be mentioned that stationarity alone is enough for its validity‚ mixing is not required. Detailed listing of the assumptions under which these results hold‚ as well as their proofs‚ may be found in Cai and Roussas (1992) (Corollary 2.1 and Theorem 3.2)‚ Roussas (1989b) (Propositions 2.1 and 2.2)‚ and Roussas (1989c) (Theorem 3.1 and Propositions 4.1 and 4.2). However‚ see also the marterial right after the end of Theorem 3.16. An additional relevant reference‚ among others‚ is the paper by Yamato (1973). 3.4.2 Estimation of a Probability Density Function and its Derivatives. The setting here remains the same as above except we assume the existence of a p.d.f. of the d.f. F. The problem then is that of estimating and‚ perhaps‚ its derivatives as well. As in the previous subsection‚ although results below have been established for multi-dimensional random vectors‚ they will be stated here for real-valued r.v.s. To start with‚ is estimated by where
it is recalled here that K is a kernel (known p.d.f.)‚ and is a bandwidth with The estimate has several optimal properties which are summarized in the following theorem. stands for the set of continuity points of Theorem 3.8. In the mixing framework and other additional assumptions‚ the estimate defined by (3.8) has the following properties: (i) Asymptotic unbiasedness: (ii) Strong consistency with rates: for some (iii) Uniform a.s. consistency with rates:
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
585
Also,
and
(iv) Asymptotic normality:
and
(v) Joint Asymptotic normality: For any
where
distinct continuity points of
is a diagonal matrix with diagonal elements
Also,
(vi) Asymptotic uncorrelatedness: For any
with
586
MODELING UNCERTAINTY
Formulation of the assumptions under which the above results hold, as well as their proofs, may be found in Roussas (1990a) (Theorems 3.1, 4.1 and 5.1 ), Roussas (1988b) (Theorems 2.1, 2.2, 3.1 and 3.2), Cai and Roussas (1992) (Theorems 4.1 and 4.4), and Roussas and Tran ((1992a) (Theorem 7.1). Suppose now that the r-th order derivative of exists, and let us estimate it by where
where is the r-th order derivative of the kernel Regarding this estimate, it may be shown that it enjoys properties similar to the ones for One of them is stated below. Theorem 3.9. In the mixing framework and other additional assumptions, the estimate defined by (3.9) is uniformly strong consistent with rates; namely,
and, in particular,
provided These results are discussed in Cai and Roussas (1992) (Theorem 4.4). There is an extensive literature on this subject matter. The following constitute only a sample of relevant references dealing with various estimates and their behavior. They are Bradley (1983), Masry (1983, 1989), Robinson (1983, 1986), Tran (1990), and Roussas and Yatracos (1996, 1997). 3.4.3 Estimating the Hazard Rate. Hazard analysis has broad applications in systems reliability and survival analysis. It was thought therefore appropriate to touch upon some basic issues in this subject matter. Recall at
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
587
this point that if F and are the d.f. and the p.d.f. of a r.v. X, then the hazard rate is defined as follows,
If the r.v. X represents the lifetime of an item, then is the instantaneous probability that the item will survive beyond time given that is has survived to time In practice, is unknown and is estimated by where
where is given in (3.8), and and empirical d.f. given in (3.7). The estimate has several properties, inherited from Some of them are summarized below.
is the and
Theorem 3.10. In the mixing framework and under other additional assumptions, the estimate defined in (3.11) has the following properties: (i) Strong pointwise consistency: (ii) Uniform a.s. consistency with rates: where J is any compact subset of rate of convergence. Also,
and, in particular,
provided (iii) Asymptotic Normality:
and
is a norming factor specifying the
588
MODELING UNCERTAINTY
where (iv) Joint asymptotic normality: For any
where
distinct continuity points of
is a diagonal matrix with diagonal elements
Precise statement of assumptions, under which the above results holds, and their justification may be found in Roussas (1989b) (Theorems 2.1, 2.2 and 2.3), Cai and Roussas (1992) (Theorem 4.2), Roussas (1990a) (Theorem 4.2), and Roussas and Tran (1992a) (Theorem 8.1). Relevant are also the references Watson and Leadbetter (1964a, b). 3.4.4 A Smooth Estimate of F and As is apparent from the previous discussions, in estimation problems, we often assume the existence of the p.d.f. of the d.f. F of a r.v. X. In such a case, F is a smooth curve, and one may find it more reasonable to estimate if by a smooth curve, unlike which is a step function. This approach was actually used in Roussas (1969b), and it may be pursued here as well, to a limited extent. Thus, is estimated by where
Then the hazard rate
may also be estimated by
given below.
It may be shown that the estimate has properties similar to those of except that it is asymptotically unbiased rather than unbiased. Furthermore, it is shown that, under certain regularity conditions, the smooth estimate is superior to the standard estimate, the empirical d.f. in a certain second order asymptotic efficiency sense. The measure of asymptotic comparison is sample size, and the criterion used is mean square error. We do not plan to elaborate on it any further. The interested reader is referred to Roussas (1996) and references cited there. We proceed to close this section with one optimal property of the estimate Namely, Theorem 3.11. In the mixing set-up and under other additional assumptions, the estimate given in (3.13) is uniformly strong consistent with rates over
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
compact subsets J of
589
namely,
and, in particular,
provided For the justification of this result, the interested reader is referred to Cai and Roussas (1992) (Theorem 4.3). It appears, the idea of using smooth estimate for a d.f. as the one employed above belongs to Nadaraya (1964b). 3.4.5 Recursive Estimation. As it has already been seen, a p.d.f. may also be estimated recursively. This approach provides a handy way of incorporating an incoming observation to an existing estimate. Even in today’s age of high speed computers, the technique of recursive estimation may provide time-saving advantages, in order not to mention the virtues of the principle of parsimony. Furthermore, The resulting recursive estimate enjoys other optimal properties over a non-recursive estimate, such as reduction of the variance of the asymptotic normal distribution. Consider the recursive estimate of defined as follows:
Then, it is easily seen that
Here
is a sequence of bandwidths such that
The estimate marized below.
has several optimality properties some of which are sum-
Theorem 3.12. In the mixing framework and under additional suitable conditions, the recursive estimate defined in (3.14) has the following properties:
590
MODELING UNCERTAINTY
(i)Asymptotic unbiasedness: (ii) Possible reduction of the asymptotic variance:
(iii) Asymptotic uncorrelatedness: (iv) Joint asymptotic normality: For any distinct continuity points of
where C is a diagonal matrix with diagonal elements Also,
Remark 3.3 The results in part (ii) and (iv) (with )with justify the statement made earlier about reduction of the variance of the asymptotic normal distribution. The justification of the statements in Theorem 3.12 may be found in Roussas and Tran (1992a) (relations (2.6), (2.10), (2.15), and Theorems 3.1 and 4.1). Now, if is the empirical survival function, it can be written as follows:
By means of (3.17), it can be easily seen that
so that a recursive formula for define by
is also available. By means of
and
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
Then, namely,
591
may be evaluated recursively by way of the formula below;
Among the several asymptotic properties of is the one stated in the following result (see Theorem 6.1 in Roussas and Tran (1992a)). Theorem 3.13. Under mixing and suitable additional conditions, the estimate defined in (3.20) has the following joint asymptotic normality property; that is, for any distinct continuity points of
where C* is a diagonal matrix with diagonal elements
Remark 3.4. Applying this last result for and comparing it with the result stated in Theorem 3.1 ( i i i ) , one sees the superiority of in terms of asymptotic variance (for as it compares to From among several relevant references, we mention those by Masry (1986), Masry and Györfi (1987), Györfi and Masry (1990), Wegman and Davis (1979), and Roussas (1992), the last two concerning themselves with the independent case. 3.4.6 Fixed Design Regression. The problem addressed in this subsection is the following. For suppose one selects design points in a compact set S of at which respective observations are taken. It is assumed that these r.v.s have the following structure
where is an unknown real-valued function defined on and are random errors. The problem at hand is to estimate for by means of and The estimator usually used in this context is a linear weighted average of the More precisely, if stand for said estimator evaluated at then
where certain requirements.
and
are weights subject to
592
MODELING UNCERTAINTY
This problem has been studied extensively in the i.i.d. case (see, for example, Priestley and Chao (1972), Gasser and Müller (1979), Ahmad and Lin (1984), and Georgiev (1988)). Under mixing conditions, however, this problem has been dealt with only the last decade or so. Here, we present a summary of some of the most important results which have been obtained in the mixing framework. Theorem 3.14. Under mixing assumptions and further suitable conditions, the fixed design regression estimate defined in (3.22) has the following properties: (i) Asymptotic unbiasedness: (ii)Consistency in quadratic mean: (iii)Strong consistency: (iv) Asymptotic normality: Precise statement of conditions under which the above results hold, as well as their proofs, can be found in Roussas (1989a) (Theorems 2.1, 2.2 and 3.1), and Roussas et al. (1992) (Theorems 2.1 and 3.1). 3.4.7 Stochastic Design Regression. The set-up presently differs from the one in the previous subsection in that both the and the are r.v.s. More precisely, one has at hand pairs coming from a stationary process, where the are real-valued r.v.s and the are random vectors. It is assumed that
is finite, and then the problem is that of estimating on the basis of the observations at hand. Before we proceed further, we present another formulation of the problem which provides more motivation for what is to be done. To this effect, let be real-valued r.v.s forming a stationary time series. Suppose we wish to predict the r.v. on the basis of the previous r.v.s As predictor, we use the assuming, of course, that this conditional expectation is finite. By setting the pairs form a stationary sequence, and the problem of prediction in the time series setting is equivalent to that of estimating the conditional expectation on the basis of the available observations. Actually, one may take it one step further by considering a (known) real-valued function
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
defined on and entertain the problem of predicting means of the (assumed to be finite) conditional expectation
593
by
Once again, by letting be as above, and by setting the problem becomes again that of estimating the conditional expectation So, it suffices to concentrate on estimating given in (3.23). The proposed estimate is defined by
The estimate below.
enjoys several optimal properties; one of them is recorded
Theorem 3.15. Under the basic assumption of mixing and further suitable conditions, the regression estimate given in (3.24) is strongly consistent with rates, uniformly over compact subsets of namely,
where J is any compact subset of and the (positive) norming factor specifies the rate of convergence. For details, the interested reader is referred to Roussas (1990b) (Theorem 4.1). A recursive version of the estimate is also available and is defined as follows:
This estimate is asymptotically normal (see Theorems 2.1 - 2.3 in Roussas and Tran (1992b)), as the following theorem states. Theorem 3.16. Under the basic assumption of mixing and further suitable conditions, the recursive regression estimate given in (3.25) has the properties stated below. (i) Asymptotic normality: For any continuity point
for
and
for which
594
MODELING UNCERTAINTY
where
(ii) Joint asymptotic normality: For any distinct continuity points of and for which
where the covariance matrix diagonal elements given by
D is a diagonal matrix with
Precise formulation of conditions under which Theorems 3.5 - 3.16 hold, as well as their proofs, can be found in references already given. In order for the reader to get a taste of the assumptions used, a brief description of them is presented below. They are divided into three groups as was done in the material presented right after the end of Theorem 2.12. Assumptions imposed on the underlying process.
(i) The underlying process is (strictly) stationary and satisfies one of the mixing conditions:
(ii) The mixing coefficients involved satisfy certain summability conditions. Furthermore, in some of the results, they are required to satisfy some conditions also involving other entities (such as bandwidths and rates of convergence). For example, for coefficients it may be required that: or for some or or for some or for some and a certain sequence of positive numbers (iii) The 1-dimensional p.d.f. of the process, of order 1; i.e.,
is continuous; it is Lipschitz it has bounded and
Some Aspects of Statistical Inference in a Markovian and Mixing Framework
continuous second order derivative; subset of
(iv) The joint p.d.f. of
and
595
J a compact satisfies the condition
(v) The 1-dimensional d.f. F, is Lipschitz of order 1; i.e., Assumptions imposed on the kernel. The kernel K is a bounded p.d.f. such that:
(i)
as
(ii)
and
for suitable
(iii) K is Lipschitz of order 1; i.e.,
(iv) The derivative
exists is continuous and of bounded variation, and
Assumptions imposed on the bandwidths. is a sequence of positive numbers such that:
(i) (ii) (iii)
(iv) They also satisfy some additional conditions involving other entities (such as mixing coefficients and rates of convergence). (v) In particular, in the recursive case, the bandwidths satisfy conditions such as: for some there is a sequence of positive numbers such that and The formulation of the conditions employed in the fixed regression, as well as the stochastic regression case, requires the introduction of a large amount of notation. We choose not to do it, and refer the reader to the original papers already cited. It is to be emphasized that in any one of the results, Theorem 3.5 - 3.16, stated above the proof requires only some of the assumptions just listed. Early papers on regression were those by Nadaraya (1964a, 1970) and Watson (1964). Subsequently, there has been a large number of contributions in this area. Those by Burman (1991), Masry (1996), and Masry and Fan (1997) is
596
MODELING UNCERTAINTY
only a sample of them. In a monograph form, there is the contribution Härdle (1990). Finally, it should be mentioned at this point that there is a significant number of papers and/or monographs dealing with dependencies which do not necessarily involve mixing. Some of them are those by Földes (1974), Györfi (1981), Castelana and Leadbetter (1986), Morvai, Yakowitz and Györfi (1996), Tran, Roussas, Yakowitz and Truong (1996), Morvai, Yakowitz and Algoet (1997), and the monographs Müller (1988) (for the independent case), Györfi et al. (1989). Sidney Yakowitz, either alone or in collaboration with others, has also made significant contributions to statistical inference under mixing conditions. The papers below represent a limited sample of such works under mixing conditions. In Yakowitz (1987), the author studies the problem of estimating the nonlinear regression by nearest-neighbor methodology. The setting is that of a time series satisfying a mixing condition. In Tran and Yakowitz (1993), the authors establish asymptotic normality for a nearestneighbor estimate of a p.d.f. in a random field framework satisfying a mixing condition. In Györfi, Morvai and Yakowitz (1998), the authors address the forecasting problem in a time-series set-up. They show that various plausible prediction problems are unsolvable under weakest assumptions, whereas others are not solvable by predictions which are known to be consistent under mixing conditions. Furthermore, in work as yet unpublished, Heyde and Yakowitz (2001) show that there is no procedure which would provide consistent density or regression estimates for every process. Consistency is contingent upon assumptions regarding rates of decay of the coefficients. Such decay assumptions play an essential role in the analysis of any estimation scheme. Finally, in conclusion, it should be mentioned at this point that Yakowitz has made contributions in other areas of research. However, such contributions are not included here as falling outside the scope of this work.
ACKNOWLEDGMENTS Thanks are due to an anonymous reviewer whose constructive comments, resulting from a careful and expert reading of the manuscript, helped improve the original version of this work.
REFERENCES Ahmad, I. A. and P. E. Lin. (1984). Fitting a multiple regression function. Journal of Statistical Planning and Inference 9, 163 - 176.
REFERENCES
597
Akritas, M.G. and G. G. Roussas. (1979). Asymptotic expansion of the loglikelihood function based on stopping times defined on Markov processes. Annals of the Institute of Statistical Mathematics 31A, 21 - 38. Akritas, M.G., M. L. Puri, and G. G. Roussas. (1979). Sample size, parameter rates and contiguity - the i.i.d. case. Communication in Statistics - Theoretical Methods A8, 71 - 83. Akritas, M.G. and G. G. Roussas. (1980). Asymptotic inference in continuous time semi-Markov processes. Scandinavian Journal of Statistics 7, 73 - 79. Athreya, K.B. and S. G. Pantula. (1986). Mixing properties of Harris chains and autoregressive processes. Journal of Applied Probability 23, 880 - 892. Bahadur, R.R. (1964). On Fisher’s bound for asymptotic variances. Annals of Mathematical Statistics 35, 1545 - 1552. Basawa, I.V. and B. L. S. Prakasa Rao. (1980). Statistical Inference for Stochastic Processes. Academic Press, New York. Bhattacharya, P.K. (1967). Estimation of probability density and its derivatives. Series A 29, 373 - 382. Billingsley, P. (1961a). Statistical Inference for Markov Processes. University of Chicago Press, Chicago. Billingsley, P. (1961b). The Lindeberg-Lévy theorem for martingales. Proceedings of the American Mathematical Society 12, 788 - 792. Bosq, D. (1996). Nonparametric Statistics for Stochastic Processes. Springer Verlag, Berlin. Bradley, R.C. (1981). Central limit theorem under weak dependence. Journal of Multivariate Analysis 11, 1 - 16. Bradley, R.C. (1983). Asymptotic normality of some kernel-type estimators of probability density. Statistics and Probability Letters 1, 295 - 300. Bradley, R.C. (1985). Basic properties of strong mixing conditions. In: Dependence in Probability and Statistics, E. Eberlein and M.S. Taqqu (Eds.) 165 192, Birkhäuser, Boston. Burman, P. (1991). Regression function estimation from dependent observations. Journal of Multivariate Analysis 36, 263 - 279. Cai, Z. and G. G. Roussas. (1992). Uniform strong estimation under with rates. Statistics and Probability Letters 15, 47 - 55. Castellana, J.V. and M. R. Leadbetter. (1986). On the smoothed probability density estimation for stationary processes. Stochastic Processes and their Applications 21, 179 - 193. Cheng, K.F. and P. E. Lin. (1981). Nonparametric estimation of a regression function. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 223 - 233. Davydov, Y.A. (1973). Mixing conditions for Markov chains. Theory of Probability and its Applications. 18, 312 - 328.
598
MODELING UNCERTAINTY
Denny, J., C. Kisiel, and S. Yakowitz. (1974). Statistical inference on stream flow processes with Markovian characteristics. Water Resources Research 10, 947 - 954. Devroye, L, and T. J. Wagner. (1980). On the convergence of kernel estimators of regression function with applications in discrimination. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 51, 15 - 25. Devroye, L. and L. Györfi. (1985). Nonparametric Density Estimation: The View. John Wiley and Sons, Toronto. Devroye, L. (1987). A Course in Density Estimation. Birkhäuser, Boston. Doob, J.L. (1953). Stochastic Processes. Wiley, New York. Doukhan, P. (1994). Mixing: Properties and examples. Lecture Notes in Statistics No. 85, Springer-Verlag, New York. Eubank, R. (1988). Spline Smoothing and Nonparametric Regression. MarcelDekker, New York. Földes, A. (1974). Density estimation for dependent samples. Studia Scientiarum Mathematicarum Hungarica. 9, 443 - 452. Gani, J., S. Yakowitz, and M. Blount. (1997). The spread and quarantine of HIV infection in a prison system. SIAM Journal of Applied Mathematics 57,1510 - 1530. Gasser, T. and H.-G. Müller. (1979). Kernel estimation of regression function. In: Smoothing Techniques for Curve Estimation Lecture Notes in Mathematics. 757, 23 - 68. Springer-Verlag, Berlin. Georgiev, A.A. (1984a). Kernel estimates of functions and their derivatives with applications. Statistics and Probability Letters 2, 45 - 50. Georgiev, A.A. (1984b). Speed of convergence in nonparametric kernel estimation of a regression function and its derivatives. Annals of the Institute of Statistical Mathematics 36, 455 - 462. Georgiev, A.A. (1988). Consistent nonparametric multiple regression: The fixed design case. Journal of Multivariate Analysis 25, 100 - 110. Gorodetskii, V.V. (1977). On the strong mixing property for linear sequences. Theory of Probability and its Applications 22, 411 - 413. Györfi, L. (1981). Strongly consistent density estimate from ergodic samples. Journal of Multivariate Analysis 11, 81 - 84. Györfi, L., Härdle, W., Sarda, P. and Vieu, P. (1989). Nonparametric Curve Estimation from Time Series. Springer-Verlag, Berlin. Györfi, L. and E. Masry. (1990). The and strong consistency of recursive kernel density estimation from dependent samples. IEEE Transactions on Information Theory 36, 531 - 539. Györfi, L., G. LMorvai, and S. Yakowitz. (1998). Limits to consistent on-line forecasting for ergodic time series. IEEE Transactions on Information Theory 44, 886 - 892.
REFERENCES
599
Hájek, J. (1970). A characterzation of limiting distributions of regular estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14, 323 330. Hall, P. and C.C. Heyde. (1980). Martingale Limit Theory and Its Applications. Academic Press, New York. Härdle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. Heyde C. and S. Yakowitz. (2001). Unpublished work. Ibragimov, I.A. (1959). Some limit theorems for strict sense stationary stochastic processes. Doklady Akademii Nauk SSSR 125, 711 - 714. Ibragimov, I.A. (1963). A central limit theorem for a class of dependent random variables. Theory of Probability and its Applications 8, 83 - 94. Ibragimov, I.A. and Yu. V. Linnik. (1971). Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff Publishing, Groningen, The Netherlands. Inagaki, N. (1970). On the limiting distribution of a sequence of estimators with uniformity property. Annals of the Institute of Statistical Mathematics 22, 1 - 13. Johnson, R.A. and G.G. Roussas. (1969). Asymptotically most powerful tests in Markov processes. Annals of Mathematical Statistics 40, 1207 - 1215. Johnson, R.A. and G.G. Roussas. (1970). Asymptotically optimal test in Markov processes. Annals of Mathematical Statistics 41, 918 - 938. Johnson, R.A. and G.G. Roussas. (1972). Applications of contiguity to multiparameter hypothesis testing. Proceedings of the 6th Berkeley Symposium of Mathematical Statistics and Probability 1, 195 - 226. Kesten, H. and G.L. O’Brien. (1976). Examples of mixing sequences. Duke Mathematical Journal 43, 405 - 415. Lai, T.L. and S. Yakowitz. (1995). Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control 40, 1199 - 1209. Le Cam, L. (1960). Locally asymptotically normal families of distributions. University of California Publication of Statistics 3, 37 - 98. Le Cam, L. (1966). Likelihood functions for large numbers of independent observations. In: Research Papers in Statistics, Festschrift for J. Neyman, F.N. David (Ed.) 167 - 187. John Wiley and Sons, New York. Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. SpringerVerlag, New York. Le Cam, L. and G. Yang. (2000). Asymptotics in Statistics: Some Basic Concepts, 2nd ed. Springer-Verlag, New York. Lind, B. and G.G. Roussas. (1977). Camér-type conditions and quadratic mean differentiability. Annals of the Institute of Statistical Mathematics 29, 189 201.
600
MODELING UNCERTAINTY
Longnecker, M. and R.J. Serfling. (1978). Moment inequalities for under general stationary mixing sequences. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 43, 1-21. Mack, Y.P. and B.W. Silverman. (1982). Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 61, 405 - 415. Masry, E. (1983). Probability density estimation form sampled data. IEEE Transactions on Information Theory 29, 696 - 709. Masry, E. (1986). Recursive probability density estimation for weakly dependent stationary processes. IEEE Transactions on Information Theory II 32, 254 - 267. Masry, E. and L. Györfi. (1987). Strong consistency and rates for recursive probability density estimators of stationary processes. Journal of Multivariate Analysis 22, 79 -93. Masry, E. (1989). Nonparametric estimation of conditional probability densities and expectations of stochastic processes: Strong consistency and rates. Stochastic Processes and their Applications 32, 109 - 127. Masry, E. (1996). Multivariate local polynomial regression estimation for time series: Uniform strong consistency and rates. Journal of Time Series Analysis 17,571 -599. Masry, E. and J. Fan. (1997). Local polynomial estimation of regression functions for mixing processes. Scandinavian Journal of Statistics 24, 165 - 179. Morvai, G., S. Yakowitz, and L. Györfi. (1996). Nonparametric inference for ergodic, stationary time series. Annals of Statistics 24, 370 - 379. Morvai, G., S. Yakowitz, and P. Algoet. (1997). Weakly convergent nonparametric forecasting for stationary time series. IEEE Transactions on Information Theory 43, 483 - 498. Müller, H.-G. (1988). Nonparametric Regression Analysis of Longitudinal Data. Lecture Notes in Statistics No. 46, Springer-Verlag, Heidelberg. Nadaraya, E.A. (1964a). On estimating regression. Theory of Probability and its Applications 9, 141 - 142. Nadaraya, E.A. (1964b). Some new estimates for distribution functions. Theory of Probability and its Applications 9, 491 - 500. Nadaraya, E.A. (1970). Remarks on nonparametric estimates for density function and regression curves. Theory of Probability and its Applications 15, 134 - 137. Nguyen, H.T. (1981). Asymptotic normality of recursive density estimators in Markov processes. Publication of the Institute of Statistics, University of Paris 26, 73-93. Nguyen, H.T. (1984). Recursive nonparametric estimation in stationary Markov processes. Publication of the Institute of Statistics University of Paris 29, 65 - 84.
REFERENCES
601
Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics 23, 1065 - 1076. Peligrad, M. (1985). Convergence rates of the strong law for stationary mixing sequences. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 70, 307-314. Pham, T.D. and L.T. Tran. (1985). Some strong mixing properties of time series models. Stochastic Processes and their Applications 19, 297 - 303. Pham, T.D. (1986). The mixing property of bilinear and generalized random coefficient autoregressive models. Stochastic Processes and their Applications 23, 291 - 300. Philipp, W. (1969). The remainder in the central limit theorem for mixing stochastic processes. Annals of Mathematical Statistics 40, 601 - 609. Prakasa Rao, B.L.S. (1977). Berry-Esseen bound for density estimators of stationary Markov processes. Bulletin of Mathematical Society 17, 15 - 21. Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estimation. Academic Press, New York. Priestley, M.B. and M.T. Chao. (1972). Nonparametric functions fitting. Journal of the Royal Statistical Society Series B 34, 385 - 392. Robinson, P.M. (1983). Nonparametric estimators for time series. Journal of Time Series Analysis 4, 185 - 207. Robinson, P.M. (1986). On the consistency and finite-sample properties of nonparametric kernel time series regression, autoregression and density estimators. Annals of the Institute of Statistical Mathematics 38, Part A, 539 - 549. Rosenblatt, M. (1956a). Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27, 823 - 835. Rosenblatt, M. (1956b). A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences, U.S.A. 42, 43 - 47. Rosenblatt, M. (1969). Conditional probability density and regression estimators. In: Multivariate Analysis II, P.R. Krishnaiah (Ed.) 25-31. Academic Press, New York. Rosenblatt, M. (1970). Density estimates and Markov sequences. In: Nonparametric Techniques in Statistical Inference, M. Puri (Ed.). Cambridge University Press, Cambridge. Rosenblatt, M. (1971). Curve estimates. Annals of Mathematical Statistics 42, 1815 -1842. Rosenblatt, M. (1985). Stationary Sequences and Random Fields. Birkhäuser, Boston. Roussas, G.G. (1965a). Asymptotic inference in Markov processes. Annals of Mathematical Statistics 36, 978 - 992. Roussas, G.G. (1965b). Extension to Markov processes of a result by A. Wald about the consistency of the maximum likelihood estimate. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 4, 69 - 73.
602
MODELING UNCERTAINTY
Roussas, G.G. (1968a). Asymptotic normality of the maximum likelihood estimate in Markov processes. Metrika 14, 62 - 70. Roussas, G.G. (1968b). Some applications of the asymptotic distribution of the likelihood functions to the asymptotic efficiency of estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 10, 252 - 260. Roussas, G.G. (1969a). Nonparametric estimation in Markov processes. Annals of the Institute of Statistical Mathematics 21, 73 - 78. Roussas, G.G. (1969b). Nonparametric estimation of the transition distribution function of a Markov process. Annals of Mathematical Statistics 40, 1386 1400. Roussas, G.G. (1972). Contiguity of Probability Measures: Some Applications in Statistics. Cambridge University Press, Cambridge. Roussas, G.G. and A. Soms. (1973). On the exponential approximation of a family of probability measures and a representation theorem of Hájek-Inagaki. Annals of the Institute of Statistical Mathematics 25, 27 - 39. Roussas, G.G. (1979). Asymptotic distribution of the log-likelihood function for stochastic processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 47, 31 - 46. Roussas, G.G. and D. Ioannides. (1987). Moment inequalities for mixing sequences of random variables. Stochastic Analysis and Applications 5, 61 120. Roussas, G.G. (1988a). A moment inequality of for triangular arrays of random variables under mixing conditions, with applications. In: Statistical Theory and Data Analysis II, K. Matusita (Ed.) 273 - 292. North-Holland, Amsterdam. Roussas, G.G. (1988b). Nonparametric estimation in mixing sequences of random variables. Journal of Statistical Planning and Inference 18, 135 -149. Roussas, G.G. and D. Ioannides. (1988). Probability bounds for sums of triangular arrays of random variables under mixing conditions. In: Statistical Theory and Data Analysis II, K. Matusita (Ed.) 293 - 308. North-Holland, Amsterdam. Roussas, G.G. (1989a). Consistent regression estimation with fixed design points under dependence conditions. Statistics and Probability Letters 8, 41 - 50. Roussas, G.G. (1989b). Hazard rate estimation under dependence conditions. Journal of Statistical Planning and Inference 22, 81 - 93. Roussas, G.G. (1989c). Some asymptotic properties of an estimate of the survival function under dependence conditions. Statistics and Probability Letters 8, 235 - 243. Roussas, G.G. (1990a). Asymptotic normality of the kernel estimate under dependence conditions: Application to hazard rate. Journal of Statistical Planning and Inference 25, 81 - 104.
REFERENCES
603
Roussas, G.G. (1990b). Nonparametric regression estimation and mixing conditions. Stochastic Processes and their Applications 36, 107 - 116. Roussas, G.G. (1991a). Estimation of transition distribution function and its quantiles in Markov processes: Strong consistency and asymptotic normality. In: Nonparametric Functional Estimation and Related Topics, G.G. Roussas (Ed.) 443 - 462. Kluwer Academic Publishers, The Netherlands. Roussas, G.G. (1991b). Recursive estimation of the transition distribution function of a Markov process: Asymptotic normality. Statistics and Probability Letters 11, 435 - 447. Roussas, G.G. (1992). Exact rates of almost sure convergence of a recursive kernel estimate of a probability density function: Application to regression and hazard rate estimation. Journal of Nonparametric Statistics 1, 171 -195. Roussas, G.G. and L.T. Tran. (1992a). Joint asymptotic normality of kernel estimates under dependence, with applications to hazard rate. Journal of Nonparametric Statistics 1, 335 - 355. Roussas, G.G. and L.T. Tran. (1992b). Asymptotic normality of the recursive kernel regression estimate under dependence conditions. Annals of Statistics 20, 98 -120. Roussas, G.G., L.T. Tran, and D.A. loannides. (1992). Fixed design regression for time series: Asymptotic normality. Journal of Multivariate Analysis 40, 262 - 291. Roussas, G.G. (1996). Efficient estimation of a smooth distribution function under In: Research Developments in Probability and Statistics, Festschrift in honor of Madan L. Puri, E. Brunner and M Denker (Eds.), 205 -217. VSP, The Netherlands. Roussas, G.G. and Y.G. Yatracos. (1996). Minimum distance regression type estimates with rates under weak dependence. Annals of the Institute of Statistical Mathematics 48, 267 - 281. Roussas, G.G. and Y.G. Yatracos. (1997). Minimum distance estimates with rates under mixing. In: Research Papers in Probability and Statistics, Festschrift for Lucien Le Cam, D. Pollard, E. Torgersen and G.L. Yang (Eds.) 337 - 344. Springer-Verlag, New York. Roussas, G.G. and D. Bhattacharya. (1999a). Asymptotic behavior of the loglikelihood function in stochastic processes when based on a random number of random variables. In: Semi-Markov Models and Applications, J. Janssen and N. Limnios (Eds.) 119 -147. Kluwer Academic Publishers, The Netherlands. Roussas, G.G. and D. Bhattacharya. (1999b). Some asymptotic results and exponential approximation in semi-Markov models. In: Semi-Markov Models and Applications, J. Janssen and N. Limnios (Eds.) 149 -166. Kluwer Academic Publishers, The Netherlands.
604
MODELING UNCERTAINTY
Roussas, G.G. (2000). Contiguity of Probability Measures. Encyclopaedia of Mathematics, Supplement II. Kluwer Academic Publishers, pages 129 -130, The Netherlands. Rüschendorf, L. (1987). Consistency of estimates for multivariate density functions and for the mode. Series A 39, 243 - 250. Schmetterer, L. (1966). On the asymptotic efficiency of estimates. In: Research Papers in Statistics, Festschrift for J. Neyman; F.N. David (Ed.) 301 - 317. John Wiley and Sons, New York. Schuster, E.F. (1969). Estimation of a probability density function and its derivatives. Annals of Mathematical Statistics 40, 1187 - 1195. Schuster, E.F. (1972). Joint asymptotic distribution of the estimated regression function at a finite number of distinct points. Annals of Mathematical Statistics 43, 84 - 88. Scott, D.W. (1992). Multivariate Density Estimation. Wiley, New York. Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Stamatelos, G.D. (1976). Asymptotic distribution of the log-likelihood function for stochastic processes. Some examples. Bulletin of Mathematical Society. Gréce 17, 92 - 116. Takahata, H. (1993). On the rates in the central limit theorem for weakly dependent random variables. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 62, 477 - 480. Tran, L.T. (1990). Kernel density estimation under dependence. Statistics and Probability Letters 10, 193-201. Tran, L.T. and S. Yakowitz. (1993). Nearest neighbor estimators for random fields. Journal of Multivariate Analysis 44, 23 - 46. Tran, L.T., G.G. Roussas, S. Yakowitz, and V. Truong. (1996). Fixed-design regression for linear time series. Annals of Statistics 24, 975 - 991. Wald, A. (1941). Asymptotically most powerful tests of statistical hypotheses. Annals of Mathematical Statistics 12, 1 - 19. Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society 54, 426 - 482. Watson, G.S. (1964). Smooth regression analysis. Series A 26, 359 372. Watson, G.S. and M.R. Leadbetter. (1964a). Hazard Analysis. I. Biometrika 51, 175 -184. Watson, G.S. and M.R. Leadbetter. (1964b). Hazard Analysis. Series A 26, 110-116. Wegman, E.J. and H.I. Davis. (1979). Remarks on some recursive estimators of a probability density. Annals of Statistics. 7, 316 - 327.
REFERENCES
605
Weiss, L. and J. Wolfowitz. (1966). Generalized maximum likelihood estimators. Theory of Probability and its Applications 11, 58-81. Weiss, L. and J. Wolfowitz. (1967). Maximum probability estimators. Annals of the Institute of Statistical Mathematics 19, 193 - 206. Weiss, L. and J. Wolfowitz. (1970). Maximum probability estimators and asymptotic sufficiency. Annals of the Institute of Statistical Mathematics 22, 225 244. Weiss, L. and J. Wolfowitz. (1974). Maximum Probability Estimators and Related Topics. Lecture Notes in Mathematics 424, Springer-Verlag, BerlinHeidelberg-New York. Withers, C.S. (1981a). Conditions for linear processes to be strong mixing. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 477 480. Withers, C.S. (1981b). Central limit theorems for dependent variables I. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 509 - 534. Wolfowitz, J. (1965). Asymptotic efficiency of the maximum likelihood estimator. Theory of Probability and its Applications 10, 247 - 260. Yakowitz, S. (1972). A statistical model for daily stream flow records with applications to the Rillito river. Proceedings, International Symposium on Uncertainties in Hydrologic and Water Resources Systems, 273 - 283. University of Arizona, Tucson. Yakowitz, S. (1973). A stochastic model for daily river flows in an arid region. Water Resources Research 9, 1271 - 1285. Yakowitz, S. and J. Denny. (1973). On the statistics of hydrologic time series. Proceedings, 17th Annual Meeting of the Arizona Academy of Sciences 3, 146 -163. Tucson. Yakowitz, S. (1976a). Small sample hypothesis tests of Markov order, with applications to simulated and hydrologic chain. Journal of the American Statistical Association 71, 132 -136. Yakowitz, S. (1976b). Model-free statistical methods for water table predictions. Water Resources Research 12, 836 - 844. Yakowitz, S. (1977a). Statistical models and methods for rivers in the southwest. Proceedings, 21st Annual Meeting of the Arizona Academy of Sciences. Las Vegas. Yakowitz, S. (1977b). Computational Probability and Simulation. AddisonWesley, Reading, Mass. Yakowitz, S. (1979). Nonparametric estimation of Markov transition functions. Annals of Statistics 7, 671 - 679. Yakowitz, S. (1985). Nonparametric density estimation, prediction, and regression for Markov sequences. Journal of the American Statistical Association 80, 215 - 221.
606
MODELING UNCERTAINTY
Yakowitz, S. (1987). Nearest-neighbor methods for time series analysis. Journal of Time Series Analysis 8, 235 - 247. Yakowitz, S. (1989). Nonparametric density and regression estimation for Markov sequences without mixing assumptions. Journal of Multivariate Analysis 30, 124 - 136. Yakowitz, S. and W. Lowe. (1991). Nonparametric bandit methods. Annals of Operation Research 28, 297 - 312. Yakowitz, S., T. Jayawardena, and S. Li. (1992). Theory for automatic learning under partially observed Markov-dependent noise. IEEE Transactions on Automatic Control 37, 1316 - 1324. Yakowitz, S. (1993). Nearest neighbor regression estimation for null-recurrent Markov time series. Stochastic Processes and their Applications 48, 311 318. Yamato, H. (1973). Uniform convergence of an estimator of a distribution function. Bulletin of Mathematical Society 15, 69 - 78. Yokoyama, R. (1980). Moment bounds for stationary mixing sequences. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 52, 45-57. Yoshihara, K. (1978). Moment inequalities for mixing sequences. Kodai Mathematics Journal 1, 316 - 328. Yoshihara, K. (1993). Weakly Dependent Stochastic Sequences and Their Applications. Vol. II: Asymptotic Statistics based on Weakly Dependent Data. Sanseido Co. Ltd., Tokyo. Yoshihara, K. (1994a). Weakly Dependent Stochastic Sequences and Their Applications. Vol. IV: Curve Estimation based on Weakly Dependent Data. Sanseido Co. Ltd., Tokyo. Yoshihara, K. (1994b). Weakly Dependent Stochastic Sequences and Their Applications. Vol. V: Estimators based on Time Series. Sanseido Co, Ltd., Tokyo. Yoshihara, K. (1995). Weakly Dependent Stochastic Sequences and Their Applications. Vol. VI: Statistical Inference based on Weakly Dependent Data. Sanseido, Co, Ltd., Tokyo.
Part VII Chapter 24 STOCHASTIC ORDERING OF ORDER STATISTICS II Philip J. Boland Department of Statistics University College Dublin Belfield, Dublin 4 Ireland
Taizhong Hu Department of Statistics and Finance University of Science and Technology Hefei, Anhui 230026 People’s Republic of China
Moshe Shaked Department of Mathematics University of Arizona Tucson, Arizona 85721 USA
J. George Shanthikumar Industrial Engineering & Operations Research University of California Berkeley, California 94720 USA
Abstract
In this paper we survey some recent developments involving comparisons of order statistics and spacings in various stochastic senses.
608
MODELING UNCERTAINTY
Keywords:
1.
Reliability theory, systems, IFR, DFR, hazard rate order, likelihood ratio order, dispersive order, sample spacings.
INTRODUCTION
Order statistics are basic probabilistic quantities that are useful in the theory of probability and statistics. Almost every student of probability and statistics encounters these random variables at an early stage of his/her studies because these statistics are associated with an elegant theory, are useful in applications, and are also a convenient tool to use in order to illustrate in a basic (though not trivial) way some probabilistic concepts such as transformations, conditional probabilities, lack of independence, and the foundations of stochastic processes. In the area of statistical inference, order statistics are the basic quantities used to define observable functions such as the empirical distribution function, and in reliability theory these are the lifetimes of systems. In 1996 Boland, Shaked and Shanthikumar wrote a survey (which appeared in 1998 (Boland, Shaked, and Shanthikumar, 1998)), covering most of what had been developed in the area of stochastic ordering of order statistics up to that time. During the last few years this area has experienced an explosion of new developments. In this paper we try to describe and summarize some of these recent developments. The notation that we use in this paper is the following. Let be independent random variables which may or may not be identically distributed. The corresponding order statistics are denoted by Thus, and If is another collection of independent random variables, then the corresponding order statistics are denoted by Many of the recent results in the literature yield stochastic comparisons of with whenever and This is to be contrasted with previous results which yielded stochastic comparisons of the above order statistics, but under restrictions such as or and/or or or In this paper we emphasize the newer kind of comparisons. Of course, by a simple choice of the indices and one usually can obtain the older results from the newer ones. In this paper we also cover some recent results which stochastically compare sample spacings. We assume throughout the paper that the and the have absolutely continuous distribution functions, though many of the results which are described below are valid also in the more general case where the distribution functions of these random variables are quite general. In this paper “increasing" and “decreasing" mean “nondecreasing" and “nonincreasing," respectively. For any random variable X and any event A we denote
Stochastic Ordering of OrderStatistics II
609
by [X|A] any random variable whose distribution function is the conditional distribution of X given A. Unless stated otherwise, all the stochastic orders that are studied in this paper are described and extensively analyzed in Shaked and Shanthikumar (1994).
2.
LIKELIHOOD RATIO ORDERS COMPARISONS
In this section we describe some recent results which yield orderings of order Statistics with respect to various likelihood ratio orders. Recall the definition of the likelihood ratio order when the compared random variables have interval supports (possibly infinite) that need not be identical. Let X and Y be two absolutely continuous random variables, each with an interval support. Let and be the left and the right endpoints of the support of X. Similarly define and The values and may be infinite. Let and denote the density functions of X and Y, respectively. We say that X is smaller than Y in the likelihood ratio order, denoted as if
Note that in (2.1), when we use the convention when In particular, it is seen that if then Shanthikumar and Yao (1986) have introduced and studied an order which they called the shifted likelihood ratio order. The following definition is slightly more general than the definition of Shanthikumar and Yao (1986) who considered only nonnegative random variables. We say that X is smaller than Y in the up shifted likelihood ratio order, denoted as if
In the sequel we will also touch upon another stochastic order, introduced in Lillo, Nanda and Shaked (2001), which is defined as follows. Let X and Y be two absolutely continuous random variables with support We say that X is smaller than Y in the down shifted likelihood ratio order, denoted as if
Note that in the above definition we compare only nonnegative random variables. This is because for the down shifted likelihood ratio order we cannot take an analog of (2.2), such as, as a definition. The reason is that here, by taking very large, it is seen that practically there are no random variables that satisfy such an order relation. Note that in the definition above, the right hand side can take on (when varies) any value in the
610
MODELING UNCERTAINTY
right neighborhood of 0. Therefore we restricted the support of the compared random variables to Lillo, Nanda and Shaked (2001) have obtained the following results. Let be independent (not necessarily i.i.d. (independent and identically distributed)) random variables, and let be other independent (not necessarily i.i.d.) random variables, all having absolutely continuous distributions. Then
and
For nonnegative random variables with support Lillo, Nanda and Shaked (2001) were not able to prove a complete analog of (2.3) and (2.4) for the order. Rather, they could only show that the minima can be compared, that is,
In fact, Lillo, Nanda and Shaked (2001) showed that it is not possible to replace in (2.4). When the [respectively, above are i.i.d. then from (2.3)–(2.5) it follows that
and
By taking in (2.6) we obtain a comparison of order statistics from the same distribution (though the sample sizes may differ). Specifically, from (2.6) it follows that if the above are i.i.d. then
this result has been obtained by Raqab and Amin (1996) and, independently, by Khaledi and Kochar (1999).
611
Stochastic Ordering of OrderStatistics II
In general, a random variable X is not comparable in the up [respectively, down] likelihood ratio order with itself, unless it has a logconcave [respectively, logconvex] density function. Thus, from (2.7) it follows that if the above are i.i.d. with a logconcave density function then
Similarly, from (2.8) it follows that if the density function then
above are i.i.d. with a logconvex
These results can be found in Lillo, Nanda and Shaked (2001).
3.
HAZARD AND REVERSED HAZARD RATE ORDERS COMPARISONS
In this section we describe some recent results which yield orderings of order statistics with respect to the hazard and reversed hazard rate orders. An elegant new result is also given in this section. Recall the definition of the hazard rate and the reversed hazard rate orders when the compared random variables have supports that need not be identical. Let X and Y be two continuous (not necessarily nonnegative) random variables, each with an interval support which we denote by and respectively; and may be and and may be Let F and G be the distribution functions of X and Y, respectively, and let and be the corresponding survival functions. We say that X is smaller than Y in the hazard rate order, denoted as if
Note that in (3.1), when we use the convention when In particular, it is seen that if then If the hazard rate functions and are well defined, then
as
We say that X is smaller than Y in the reversed hazard rate order, denoted if
612
MODELING UNCERTAINTY
Again it is seen that if then In Shaked and Shanthikumar (1994) it is shown that
for any continuous function
for any such function
which is strictly increasing on
Also,
In Nanda and Shaked (2000) it is shown that
for any continuous function
which is strictly decreasing on
Also,
for any such function The latter two implications correct a mistake in Theorems 1.B.2 and 1.B.22 in Shaked and Shanthikumar (1994) — the parenthetical statements there are incorrect. These two implications often enable us to transform results about the hazard rate order into results about the reversed hazard rate order and vice versa. The first result regarding ordering order statistics in the sense of the hazard and the reversed hazard rate orders is the following useful proposition; later (see Theorem 3.1) we use it in order to obtain a new stronger result. Proposition 3.1. Let [respectively, be independent (not necessarily i.i.d.) absolutely continuous random variables, all with support for some (i) If
for all and
then
(ii) If
for all and
then
Boland and Proschan (1994) proved part (i) of Proposition 3.1 for nonnegative random variables; however, by (3.3), the inequality in part (i) is valid under the weaker assumptions of Proposition 3.1. Part (ii) of Proposition 3.1 strengthens Corollary 3.1 of Nanda, Jain and Singh (1998). It is worthwhile noting that by using (3.4) and (3.5) it can be shown that part (ii) is actually equivalent to part (i); see Nanda and Shaked (2000) for details. We now state and prove a new result which strengthens and unifies some previous results. The following theorem gives hazard rate and reversed hazard rate analogs of (2.3) and (2.4). Theorem 3.2. Let be independent (not necessarily i.i.d.) random variables, and let be other independent (not necessarily i.i.d.) random variables, all having absolutely continuous distributions
Stochastic Ordering of OrderStatistics II
with support
for some
613
Then
and
Proof. First we prove (3.6). Assume that for all We will now show that there exists a random variable Z with support such that for all Let and denote the hazard rate functions of the indicated random variables. From the assumption that for all it follows by (3.2) that
Let
be a function which satisfies
for example, let It can be shown that is indeed a hazard rate function. Let Z be a random variable with the hazard rate function Then indeed for all Now, let be i.i.d. random variables distributed as Z. Then
and the desired result follows from the fact that the likelihood ratio order implies the hazard rate order. With the aid of (3.4) and (3.5) it can be shown that statement (3.7) is equivalent to (3.6). Theorem 3.2 extends and unifies the relatively restricted Theorems 3.8 and 3.9 of Nanda and Shaked (2000). In light of (4.1) in Section 4, one may wonder whether the conclusion in (3.6) holds if it is only assumed there that for all (rather than for all In order to see that this is not the case, consider the
614
MODELING UNCERTAINTY
independent exponential random variables and with hazard rates 1 and 3, respectively, and the independent exponential random variables and with hazard rates 1 and 2, respectively. Then However, it is easy to verify that then and are not comparable in the hazard rate order. Similarly, the conclusion in (3.7) does not follow from the mere assumption for all We end this section by stating some results involving hazard rate and reversed hazard rate comparisons of order statistics constructed from one set of random variables. Let be independent (not necessarily i.i.d.) absolutely continuous random variables, all with support for some Then
The first inequality in (3.8) was proven in Boland, El-Neweihi and Proschan (1994) for nonnegative random variables; however, by (3.3), the inequality holds also without the nonnegativity assumption. The second inequality in (3.8) is taken from Hu and He (2000). The inequalities in (3.9) can be found in Block, Savits and Singh (1998) and in Hu and He (2000). Again, using (3.4) and (3.5) it can be shown that (3.9) is actually equivalent to (3.8); see Nanda and Shaked (2000) for details. For the next inequalities we need to have, among the a “largest" [respectively “smallest"] variable in the sense of the hazard [reversed hazard] rate order. Again, let be independent (not necessarily i.i.d.) absolutely continuous random variables, all with support for some Then (i) If
(ii) If
then
then
Boland, El-Neweihi and Proschan (1994) proved part (i) above for nonnegative random variables; however, again by (3.3), the inequality in part (i) is valid without the nonnegativity assumption. Part (ii) above is Theorem 4.2 of Block, Savits and Singh (1998). Again, using (3.4) and (3.5) once more it can be shown that part (ii) is equivalent to part (i); see Nanda and Shaked (2000) for details.
Stochastic Ordering of OrderStatistics II
4.
615
USUAL STOCHASTIC ORDER COMPARISONS
Recall the definition of the usual stochastic order. Let X and Y be random variables with survival functions and We say that X is smaller than Y in the usual stochastic order, denoted as if
The usual stochastic order is implied by the orders studied in Sections 2 and 3. Therefore, comparisons of order statistics, associated with one collection of random variables, follow from previous results in these sections. For example, in (2.9), (3.8) and (3.9), the orders and can be replaced by However, when we try to compare order statistics that are associated with two different sets of random variables (that is, a set of and a set of we may get new results because the assumption that an is smaller than a in the usual stochastic order is weaker than a similar assumption involving any of the orders discussed in Sections 2 and 3. At present there are not many such results available. We just mention one recent result that has been derived, independently, by Belzunce, Franco, Ruiz and Ruiz (2001) and by Nanda and Shaked (2000). Let [respectively, ] be independent (not necessarily i.i.d.) absolutely continuous random variables, all with support for some Then for any and we have that
5.
STOCHASTIC COMPARISONS OF SPACINGS
Let be nonnegative random variables with the associated order statistics The corresponding spacings are defined by and the corresponding normalized spacings are defined as The normalized spacings are of importance in reliability theory because they are the building blocks of the TTT (total time on test) statistic. Kochar (1998) has surveyed the literature on stochastic comparisons of spacings and normalized spacings up to 1997. In this section we emphasize more recent advances. Let be i.i.d. nonnegative random variables. If the common distribution function F is IFR (increasing failure rate; that is, is concave on ) then the normalized spacings satisfy
616
MODELING UNCERTAINTY
The above two results were obtained by Barlow and Proschan (1966) who also showed that if F is DFR (decreasing failure rate; that is, log is convex on then the inequalities above are reversed. Kochar and Kirmani (1995) strengthened this DFR result as follows. Let be i.i.d. nonnegative random variables. If the common distribution function F is DFR then the normalized spacings satisfy
Khaledi and Kochar (1999) proved, in addition to the above, that if F is DFR then the normalized spacings satisfy
Summarizing (5.1)–(5.3) we have, under the assumption that F is DFR, that
Kochar and Kirmani (1995) claimed that if are i.i.d. nonnegative random variables with a common logconvex density function then However, Misra and van der Muelen (2001) showed via a counterexample that this is not correct. Pledger and Proschan (1971) showed that if are exponential random variables with possibly different parameters (or more generally, with decreasing proportional hazard rate functions) then
For a particular choice of the parameters of the exponential random variables, Khaledi and Kochar (2000b) showed that the above inequality holds with replacing In an effort to extend this comparison to the likelihood ratio order, Kochar and Korwar (1996) obtained the following result which compares any normalized spacing to the first normalized spacing. Again, if are exponential random variables with possibly different parameters then
The following unpublished result which compares spacings is due to JoagDev (1995). Theorem 5.1. Let be i.i.d. random variables with a finite support, and with an increasing [decreasing] density function over that support. Then
Stochastic Ordering of OrderStatistics II
Proof. Let F and sity function of density of
617
denote, respectively, the distribution function and the denGiven and the conditional
at the point
is
and the conditional
density of at the point is Since is increasing [decreasing] it is seen that, conditionally, and therefore, conditionally, But the usual stochastic order is closed under mixtures, and this yields the stated result. For a sample of i.i.d. nonnegative random variables, Hu and Wei (2000) studied sums of adjacent spacings which they called generalized spacings. For example, the range of a sample is a generalized spacing, Hu and Wei (2000) showed that if is DFR [IFR] then
This result generalizes (5.3). We end this section by describing a few results that compare normalized and generalized spacings from two different random samples. The following two results have been derived by Khaledi and Kochar (1999). Let be i.i.d. nonnegative random variables with an absolutely continuous common distribution function, and let be i.i.d. nonnegative random variables with a possibly different absolutely continuous common distribution function. As above, denote the normalized spacings that are associated with the by Also, denote the normalized spacings that are associated with the by If and if either or is DFR, then
If
and if either
or
is DFR, then
In fact, (5.4) can be obtained from (5.5) by taking Of course, if we set in the above inequalities and then we obtain, as a special case, comparisons of the spacings (rather than the normalized spacings) that are associated with these two random samples. Let be the generalized spacings from a sample of i.i.d. nonnegative random variables as described above, and let be the generalized spacings from another sample of i.i.d. nonnegative random variables. Hu and Wei (2000) showed that if and if or
618
MODELING UNCERTAINTY
is DFR then
Now let be independent exponential random variables with hazard rates respectively, and let be i.i.d. exponential random variables with hazard rate As above, let and be the normalized spacings associated with the and the respectively. Kochar and Rojo (1996) showed that then
In fact, they showed a stronger result, namely, that the random vector is smaller than the random vector in the multivariate likelihood ratio order (see Snaked and Shanthikumar (1994) for the definition).
6.
DISPERSIVE ORDERING OF ORDER STATISTICS AND SPACINGS
An order that is useful in reliability theory, and in other areas of applied probability, is the dispersive order. Let X (with distribution function F) and Y (with distribution function G) be two random variables. We say that X is smaller than Y in the dispersive order, denoted by if for all where and are the right continuous inverses of F and G, respectively. It is well-known (see, for example, Shaked and Shanthikumar (1994)) that
Bartoszewicz (1985) and Bagai and Kochar (1986) have shown that
In this section we describe some recent results involving stochastic ordering of order statistics and spacings in the dispersive order. First note that from (3.8) it follows that if are independent (not necessarily i.i.d.) absolutely continuous DFR random variables, all with support for some then we get the following result of Kochar (1996):
This follows from (3.8), with the aid of (6.2), and from the fact that is DFR. If are i.i.d. DFR random variables, then Khaledi and Kochar (2000a) showed that
Stochastic Ordering of OrderStatistics II
619
Now let
be a collection of i.i.d. random variables, and let be another collection of i.i.d. random variables. Bartoszewicz (1986) showed that
Under the assumption Alzaid and Proschan (1992) have obtained some monotonicity results in and of the differences In order to obtain a more general inequality than (6.3) we need to assume more. Khaledi and Kochar (2000a) showed that if is a sequence of i.i.d. random variables, and if is another sequence of i.i.d. random variables, and if or is DFR, then
Khaledi and Kochar (2000c) showed that if are independent exponential random variables with hazard rates respectively, and if are i.i.d. exponential random variables with hazard rate then This strengthens a result of Dykstra, Kochar and Rojo (1997) who proved the inequality when are i.i.d. exponential random variables with hazard rate Consider now the normalized spacings that are associated with the i.i.d. random variables From (5.1)–(5.3), from (6.2), and from the fact that the spacings here are DFR (see Barlow and Proschan (1966)), it is seen that if are i.i.d. nonnegative DFR random variables, then we get the following results of Kochar and Kirmani (1995) and of Khaledi and Kochar (1999):
or, in summary,
We end this section by considering the spacings that are associated with the nonnegative i.i.d. absolutely continuous random variables and the spacings that are associated with the nonnegative i.i.d. absolutely continuous random variables Define the random vectors and Recall that means that
620
MODELING UNCERTAINTY
for all increasing functions for which the expectations exist. The following theorem is stated in Bartoszewicz (1986) with an incomplete proof. Theorem 6.1. Let U and V be as above. If particular,
then
Proof. Let F and G denote the distribution functions of and tively. Define and Clearly, Furthermore, from (6.1) we have that
The fact that variate order
In
respecand
now follows from a well-known property of the multi-
Theorem 2.7 in page 182 of Kamps (1995) extends (6.4) to the spacings of the so called generalized order statistics. Rojo and He (1991) proved a converse of Theorem 6.1. Specifically, they showed that if for all then
7.
A SHORT SURVEY ON FURTHER RESULTS
In this last section we briefly mention some results that give stochastic comparisons of order statistics and spacings in senses different than those described in Sections 2–6. Arnold and Villasenor (1998) obtained various results comparing order statistics from the uniform (0,1) distribution in the sense of the Lorenz order. In particular, one of their results can be described as "sample medians exhibit less variability as sample size increases." A comparison of two maxima of independent (not necessarily i.i.d.) random variables in the increasing convex order is implicit in Theorem 9 of Li, Li and Jing (2000). Barlow and Proschan (1975), pages 107–108 obtained some results comparing order statistics (in their words, systems) in the convex, star, and subadditive transform orders (see Shaked and Shanthikumar (1994), Section 3.C for a discussion on these orders). Oja (1981) showed that the convex transform order between two distributions implies that the ratios of spacings, that correspond to samples from these distributions, are ordered in the usual stochastic order. In (1998a) Bartoszewicz showed that the star transform order between two distributions implies that some ratios of moments of order
REFERENCES
621
statistics from these distributions have some interesting monotonicity properties. In (1998b) Bartoszewicz obtained various comparisons of functions of order statistics from different samples in the sense of the Laplace transform order. Wilfling (1996) described some comparison results of order statistics from exponential and Pareto distributions in the Laplace transform order.
ACKNOWLEDGMENTS We thank Baha-Eldin Khaledi for useful comments on a previous version of this paper.
REFERENCES Alzaid, A. A. and F. Proschan. (1992). Dispersivity and stochastic majorization. Statistics and Probability Letters 13, 275–278. Arnold, B. C. and J. A. Villasenor. (1998). Lorenz ordering of order statistics and record values. In Handbook of Statistics, Volume 16 (Eds: N. Balakrishnan and C. R. Rao), Elsevier, Amsterdam, 75–87. Bagai, I. and S.C. Kochar. (1986). On tail-ordering and comparison of failure rates. Communications in Statistics—Theory and Methods 15, 1377–1388. Barlow, R. E. and F. Proschan. (1966). Inequalities for linear combinations of order statistics from restricted families. Annals of Mathematical Statistics 37, 1574–1592. Barlow, R. E. and F. Proschan. (1975). Statistical Theory of Reliability and Life Testing, Probability Models, Holt, Rinehart, and Winston, New York, NY. Bartoszewicz, J. (1985). Dispersive ordering and monotone failure rate distributions. Advances in Applied Probability 17, 472–474. Bartoszewicz, J. (1986). Dispersive ordering and the total time on test transformation. Statistics and Probability Letters 4, 285–288. Bartoszewicz, J. (1998a). Applications of a general composition theorem to the star order of distributions. Statistics and Probability Letters 38, 1–9. Bartoszewicz, J. (1998b). Characterizations of the dispersive order of distributions by the Laplace transform. Statistics and Probability Letters 40, 23–29. Belzunce, F., M. Franco, J.-M. Ruiz, and M. C. Ruiz. (2001). On partial orderings between coherent systems with different structures. Probability in the Engineering and Informational Sciences 15, 273–293. Block, H. W., T.H. Savits, and H. Singh. (1998). The reversed hazard rate function. Probability in the Engineering and Informational Sciences 12, 69– 90. Boland, P. J., E. El-Neweihi, and F. Proschan. (1994). Applications of the hazard rate ordering in reliability and order statistics. Journal of Applied Probability 31, 180–192.
622
MODELING UNCERTAINTY
Boland, P. J. and F. Proschan. (1994). Stochastic order in system reliability theory. In Stochastic Orders and Their Applications (Eds: M. Shaked and J. G. Shanthikumar), Academic Press, San Diego, 485–508. Boland, P. J., M. Shaked, and J.G. Shanthikumar. (1998). Stochastic ordering of order statistics. In Handbook of Statistics, Volume 19 (Eds: N. Balakrishnan and C. R. Rao), Elsevier, Amsterdam, 89–103. Dykstra, R., S. Kochar, and J. Rojo. (1997). Stochastic comparisons of parallel systems of heterogeneous exponential components. Journal of Statistical Planning and Inference 65, 203–211. Hu, T. and F. He. (2000). A note on comparisons of systems with respect to the hazard and reversed hazard rate orders. Probability in the Engineering and Informational Sciences 14, 27–32. Hu, T. and Y. Wei. (2000). Stochastic comparisons of spacings from restricted families of distributions. Technical Report, Department of Statistics and Finance, University of Science and Technology of China. Joag-Dev, K. (1995). Personal communication. Kamps, O. (1995). A Concept of Generalized Order Statistics, B. G. Taubner, Stuttgart. Khaledi, B. and S. Kochar. (1999). Stochastic orderings between distributions and their sample spacings — II. Statistics and Probability Letters 44, 161– 166. Khaledi, B. and S. Kochar. (2000a). On dispersive ordering between order statistics in one-sample and two-sample problems. Statistics and Probability Letters 46, 257–261. Khaledi, B. and S. Kochar. (2000b). Stochastic properties of spacings in a single outlier exponential model. Technical Report, Indian Statistical Institute. Khaledi, B. and S. Kochar. (2000c). Some new results on stochastic comparisons of parallel systems. Technical Report, Indian Statistical Institute. Kochar, S. C. (1996). Dispersive ordering of order statistics. Statistics and Probability Letters 27, 271–274. Kochar, S. C. (1998). Stochastic comparisons of spacings and order statistics. In Frontiers in Reliability (Eds: A. P. Basu, S. K. Basu and S. Mukhopadhyay), World Scientific, Singapore, 201–216. Kochar, S. C. and S.N.U.A. Kirmani. (1995). Some results on normalized spacings from restricted families of distributions. Journal of Statistical Planning and Inference 46, 47–57. Kochar, S. C. and R. Korwar. (1996). Stochastic orders for spacings of heterogeneous exponential random variables. Journal of Multivariate Analysis 57, 69–83. Kochar, S. C. and J. Rojo. (1996). Some new results on stochastic comparisons of spacings from heterogeneous exponential distributions. Journal of Multivariate Analysis 59, 272–281.
REFERENCES
623
Li, X., Z. Li, and B-Y. Jing. (2000). Some results about the NBUC class of life distributions. Statistics and Probability Letters 46, 229–237. Lillo, R. E., A.K. Nanda, and M. Shaked. (2001). Preservation of some likelihood ratio stochastic orders by order statistics. Statistics and Probability Letters 51, 111–119. Misra, N. and E.C. van der Meulen. (2001), On stochastic properties of spacings, Technical Report, Department of Mathematics, Katholieke University Leuven. Nanda, A. K., K. Jain, and H. Singh. (1998). Preservation of some partial orderings under the formation of coherent systems. Statistics and Probability Letters 39, 123–131. Nanda, A. K. and M. Shaked. (2000). The hazard rate and the reversed hazard rate orders, with applications to order statistics. Annals of the Institute of Statistical Mathematics, to appear. Oja, H. (1981). On location, scale, skewness and kurtosis of univariate distributions. Scandinavian Journal of Statistics 8, 154–168. Pledger, G. and F. Proschan. (1971). Comparisons of order statistics and spacings from heterogeneous distributions. In Optimizing Methods in Statistics (Ed: J. S. Rustagi), Academic Press, New York, 89–113. Raqab, M. Z. and W.A. Amin. (1996). Some ordering results on order statistics and record values. IAPQR Transactions 21, 1–8. Rojo, J. and G.Z. He. (1991). New properties and characterizations of the dispersive ordering. Statistics and Probability Letters 11, 365–372. Shaked, M. and J.G. Shanthikumar. (1994). Stochastic Orders and Their Applications, Academic Press, Boston. Shanthikumar, J. G. and D.D. Yao. (1986). The preservation of likelihood ratio ordering under convolution. Stochastic Processes and Their Applications 23, 259–267. Wilfling, B. (1996). Lorenz ordering of power-function order statistics. Statistics and Probability Letters 30, 313–319.
Chapter 25 VEHICLE ROUTING WITH STOCHASTIC DEMANDS: MODELS & COMPUTATIONAL METHODS Moshe Dror Department of Management Information Systems The University of Arizona Tucson, AZ 85721, USA mdror @ bpa.arizona.edu
Abstract
1.
In this paper we provide an overview and modeling details regarding vehicle routing in situations in which customer demand is revealed only when the vehicle arrives at the customer’s location. Given a fixed capacity vehicle, this setting gives rise to the possibility that the vehicle on arrival does not have sufficient inventory to completely supply a given customer’s demand. Such an occurrence is called a route failure and it requires additional vehicle trips to fully replenish such a customer. Given a set of customers, the objective is to design vehicle routes and response policies which minimize the expected delivery cost by a fleet of fixed capacity vehicles. We survey the different problem statements and formulations. In addition, we describe a number of the algorithmic developments for constructing routing solutions. Primarily we focus on stochastic programming models with different recourse options. We also present a Markov decision approach for this problem and conclude with a challenging conjecture regarding finite sums of random variables.
INTRODUCTION
Consider points indexed in a bounded subset B in the Euclidian space. Given a distance matrix between the point pairs, a traveling salesman problem solution for points is described by a cyclic permutation such that is minimized over all cyclic permutations where represents the point in the position in In this generic traveling salesman problem (TSP) statement it is assumed that all the elements (positions of the points and the corresponding distances) are known in advance. In a setting like this, stochasticity can be introduced in a number of ways. First, consider that a
626
MODELING UNCERTAINTY
problem instance is represented by a subset of the given points. Say a subset is selected. Suppose that the problem instance represented by S occurs with probability Now suppose that given a cyclic permutation on the solution is calculated with respect to the point ordering determined by by simply skipping the points, implying that if however then the term appearing in the summation of pair distances is for the minimal positive integer such that For a given the expected distance is Thus, this stochastic version of the TSP is to find a cyclic permutation over that minimizes the expected distance. This stochastic optimization problem is known as the probabilistic TSP or PTSP for short (see Jaillet, 1985). Another stochastic version of the TSP can be stated in terms of the distance matrix by assuming a nonnegative random variable with a known probability distribution for each pair The value serves as a travel time factor for the distance We simply define a new random variable as a random travel time between points and In this setting an optimal cyclic permutation would be one which minimizes the expected TSP distance with respect to the matrix. In real-life terms, this setting seems appropriate when all points have to be visited; however, the travel time between pairs of points is a random variable with a known distribution (Laipala, 1978). Clearly, there are real-life examples for both of the above problems and for a hybrid of the two. However, in this chapter we survey yet a different setting of a stochastic routing problem which we refer to as a vehicle routing with stochastic demands (SVRP). For instance, in the case of automatic bank teller machines the daily demand for cash from each machine is uncertain. The maximal amount of cash that may be carried by an armored cash delivery vehicle is limited for security and insurance reasons. Thus, given a preplanned route (sequence of cash machines), there might not be enough cash on the designated vehicle to supply all the machines on its route resulting in a delivery stockout (referred to as a route failure), forcing a decision of how to supply the machines which have not been serviced because the cash run out. The problem can be described as follows: Given the points in some bounded subset space, a point (the depot), and a positive real value (capacity of a vehicle). Associate with each a bounded, nonnegative random variable (demand at The objective is to construct ( is determined as a part of the solution) cyclic paths all sharing the point – {0}, and some paths may be sharing other points as well, such that for each realization of the demand vector the demand
Vehicle Routing with Stochastic Demands: Models & Computational Methods
627
on each of the cyclic paths does not exceed Q–vehicle capacity, the total realized demand is less than or equal to and all realized demands are satisfied. A key assumption is that the value is revealed only when a vehicle visits the point for the first time. Note that the actual construction of the cyclic paths might depend on demand realizations at the points already visited. The objective is to find a routing solution, perhaps in the form of routing rules, which has a minimal expected distance. Note that some demand values might be splitdelivered over a number of these cyclic paths. The problem can be viewed as a single vehicle problem with multiple routes all visiting the point {0} - the depot. The above version of the SVRP has been associated with real-life settings such as sludge disposal (Larson, 1988) and delivery of home heating oil (Dror, Ball, and Golden, 1985), among others. In these two examples there is no advanced reporting of the values which represent current inventory levels or the size of replenishment orders. Thus, the amount to be delivered becomes known only as a given site is first visited.
2.
AN SVRP EXAMPLE AND SIMPLE HEURISTIC RESULTS
Consider a single route. Given customers 1,2,..., and a depot indexed 0, we have different routes for the customers, each corresponding to a cyclic permutation over {0,1,2,..., }. Let be the set of all routes (cyclic permutations) and for a particular route denote by the first customer, the second customers, and in general by the customer. would be the last customer before returning to the depot. The cost for route is simply
Denote by the probability that a vehicle of capacity Q runs-out of the delivered commodity at or before customer on route Assuming that the customers’ demands are independent random variables with nonnegative means, the sequence is a nondecreasing sequence for any Moreover, for normally distributed customer demands with coefficient of variation and the expected value for any the sequence is strictly convex (see Dror and Trudeau, 1986). Given the above condition that (i.e., the expected demand of customers on the route is less than the vehicle’s capacity), in Example 1 below, we show that the probabilities of the vehicle running-out of commodity at points along the route can be quite significant, and can result in a vehicle being forced to return to the depot for refill. Thus, one cannot
628
MODELING UNCERTAINTY
assume a priori that a route delivery sequence will be carried out as planned without interruption. In addition, given a ordered sequence of customer visits on an undirected network (symmetric distances), we show in Example 2 below that unlike deterministic vehicle routing problem, it makes a difference if such a sequence is visited as ordered "from left to right" or in reverse "from right to left". Example 1: Assume a route consisting of 20 customers. Customers’ demands are represented by independent, identical, normally distributed random variables We can tabulate the function for two different values of the coefficient of variation where for a small such as The probability of not reaching the last customer (customer = 20) without being forced to return to the depot for refill is 0.3228 and 0.4052 respectively for coefficient of variation and 20/21 (Dror and Trudeau, 1986). For the higher coefficient of variation the probability of never reaching the customer is also high (= 0.3085). Thus, the likelihood of a route failure in these settings is not negligible. Example 2: Assume a route with five customers located at (5,0), (5,5), (0,5), (–5,5), and (–5,0), and a depot located at (0,0) forming a rectangle of length 10 and width 5. The customers are denoted as 1,2,3,4, and 5, respectively. Assume a straight line travel distance between the locations and the expected demands of and 4, and for the customer located at (–5,0). We also assume a very simple recourse policy in the case of route failure by servicing each customer who was not delivered a full demand on the planned route individually (with multiple back and forth trips). We denote the route planned counterclockwise by and the (opposite direction) clockwise route as The expected travel distance for the
route is calculated as follows:
Vehicle Routing with Stochastic Demands: Models & Computational Methods
The expected travel distance for the
629
route is calculated similarly.
In the case that the customers’ demands are independent, normally distributed random variables with mean demands as described above and an identical coefficient of variation of the expected travel distances for the two routes are quite different. For it is equal to 40.5563, and for the expected travel distance is 48.8362 (Dror and Trudeau, 1986). A classical, and perhaps the most popular, heuristic for constructing vehiclerouting solutions for the case of deterministic customers’ demands is the socalled Clark and Wright heuristic (Clark and Wright, 1964). The thrust of the Clark and Wright heuristic route construction is in the concept that if two deliveries on two different routes can be joined on a single route, savings can be realized. In the VRP the savings are calculated for each pair of delivery points as (assuming a symmetric travel matrix). The savings are ordered and the customers are joined according to the largest saving available, as long as the demand of the combined route does not exceed the capacity of the vehicle – Q. This basic savings idea can be generalized for the stochastic vehicle routing case as follows: = [the expected cost of the route with customer on it] + [the expected cost of the route with customer on it] - [the expected cost of the combined route where customer immediately precedes customer When computing the savings terms in the stochastic version of Clark and Wright heuristic one has to account for the direction of the route. For each pair of points two different directional situations have to be considered and only the one with the highest saving value will be kept. In Dror and Trudeau (1986), the above stochastic Clark and Wright heuristic was implemented to construct a routing solution for a 75 customer test problem in Eilon, Watson-Gandy, and Christofides (1971). The truck capacity for this experiment is set at Q = 160. The depot is located at a center point of a square 80 × 80, however, we do not list here the coordinates for all the 75 points. Table 1 lists the routes constructed by the stochastic savings heuristic. It is interesting to note that some of the 10 routes constructed have a high probability of failure. For instance, routes marked as 9 and 10 have a probability of failure of 0.239 and 0.679 respectively. The expected demand of route 10 exceeds the truck capacity. However, upon closer examination of the two routes we find that the probability of route failure before the last customer is negligible, and the failures, if occurring at all, are only likely to materialize as the truck arrives at the last customer who is located very close to the depot
630
MODELING UNCERTAINTY
(customer 4 on route 9 is located at and customer 75 of route 10 is located Thus, the projected penalties (the cost) for route failures (a round trip to the depot and back) are very small.
2.1.
CHANCE CONSTRAINED MODELS
A different way of dealing with demand uncertainty and the potential route failure when a vehicle does not have enough commodity left to satisfy all customer’s demand is by modeling the stochastic vehicle routing problem in the form of chance-constrained programming. Essentially, for a given customer demands parameters, such as demand distributions with their means and variances, one subjectively specifies a control probability for a route not to incur a route failure. Following Stewart and Golden (1983), we restate below their chance-constrained SVRP formulation.
where is a binary decision variable which takes the value 1 if vehicle travels directly from to and is 0 otherwise; NV denotes the number of available vehicles; is the set of feasible routes for the traveling salesman problem with NV salesmen. and Q are defined as before and is the maximum allowable probability that a route might fail. The chance-constrained SVRP model presented above is in the spirit of mathematical models developed by Charnes and Cooper in the 50ties and early 60ties. One of the main premises of such models was that complicated stochastic optimization problems are convertable into equivalent deterministic problems while controlling for the probability of
Vehicle Routing with Stochastic Demands: Models & Computational Methods
631
"bad" events such as route failures for the SVRP. Stewart and Golden (1983) showed that this conversion process (stochastic deterministic) of the SVRP is possible for some random demand distributions. In addition, a number of simple penalty-based SVRP models have also been proposed by the same authors and are restated below.
where
is a fixed penalty incurred each time route fails.
where
is a penalty per unit demand in excess of Q on route and is the expected number of units by which the demand on route exceeds the capacity of the vehicle. The apparent problem with the above modeling approach for the SVRP is that designing a complete set of vehicle routes while controlling or penalizing route failures irrespective of their likely location and the cost of such failures, might result in bad routing decisions. It is important to remember that the cost of the recourse action taken in response to route failure is more critical than the mere likelihood of such failure.
3.
MODELING SVRP AS A STOCHASTIC PROGRAMMING WITH RECOURSE PROBLEM
Given that a customers’ exact demands are revealed only when the delivery vehicle visits the customer and that the driver is given the sequence in which the customers need to be serviced while allowing him to return to the depot and refill even before the vehicle actually runs out of commodity, how would we design an optimal routing sequence including the option of dynamic decision rules for when to refill the vehicle ? This is the theme of a recent paper by W.-H. Yang, K. Mathur, and R.H. Ballou (2000), in which a customer’s stochastic demand does not exceed the vehicle capacity Q. For simplicity, they assume that has a discrete distribution with possible values with probability mass function In their solution of the SVRP, Yang, et al. adopt a simple recourse action of returning to the depot whenever the vehicle runs out of stock, in addition to preset refill decisions based on the demand realizations along the route. Hence the vehicle may return to the depot before stockouts actually occur. What is
632
MODELING UNCERTAINTY
interesting in this simple recourse setting is that given a route (represented as a sequence of customers for each customer there exists a quantity such that the optimal decision in terms of the expected length of the route, is to continue to customer after serving customer if the quantity on the vehicle at that point is and to return to the depot to refill if it is More formally, suppose that upon completion of delivery to customer the remaining quantity in the vehicle is Let denote the total expected cost for completing the deliveries from customer onward. Let be the set of all possible loads on the vehicle after the delivery to customer then, satisfies the following dynamic programming recursion,
with the boundary condition Assuming that the cost matrix satisfies the triangular inequality, then which leads to Theorem 1 below. Theorem 1. (Yang et al., 2000) For each customer there exists a quantity such that the optimal decision, after serving node is to continue to node if or return to the depot if For the proof we refer the reader to the original paper and to Yang’s (1996) thesis. The computation of the values is recursive and for a given routing sequence requires computing the two parts of equation (1). Since there are many different routing options, constructing the best route for the SVRP with this simple recourse scheme requires direct examination of these routing options (say by enumeration) and is only feasible for small problems. In Yang (1996), a branch-and-bound algorithm is described which produces optimal solutions to some typical problems of up to 10 customers. Since this recourse policy allows for restocking the vehicle at any point on its route, even when it is clear that the total customer demand exceeds the vehicle’s capacity, it is not necessary to consider multiple routes. In fact, Yang, Mathur, and Ballou (2000), prove that a single route is more efficient than multiple vehicle route system. Obviously, this result assumes no other routing constraints such as time duration, etc. which might require implementation of a multiple route system.
Vehicle Routing with Stochastic Demands: Models & Computational Methods
633
Laporte, Louveaux, and Van hamme (2001) exmines the same problem from a somewhat different perspective. The SVRP is examined and an optimal solution methodology by means of an integer L-shaped method is proposed for the simple recourse of back and forth vehicle trips to the depot for refill in the case of route failure. We restate below the main points of the solution approach from Laporte et al. (2001), and Laporte and Louveaux (1993), as applied to the SVRP model in Dror et al. (1989). However, since Yang et al. (2000) have shown that single route design is more efficient than multiple routes, and (based on Theorem 1, Yang et al. 2000) a better recourse policy is obtained by allowing the return to the depot for refill even if the vehicle did not run-out of commodity at node we modify the L-shaped solution approach. That is, we assume that a single delivery route will be traced until either a route failure occurs followed by a round trip to the depot, or based on the result of Theorem 1, the vehicle returns to the depot to refill before continuing the route.
3.1.
THE MODEL
Let
be the cost of the routing solution when is the vector of the routing decisions if the vehicle goes directly from node to node and otherwise), and q is a vector of the customer demands which are revealed one at a time when the vehicle visits the customer. Clearly, both and are random variables. The cost of a particular routing solution predicated on a realization of is simply an additive term Thus, the SVRP model can be stated as follows
subject to
The cost term can be viewed as a two part cost of a given solution. In the first part we have the term cx denoting the cost of the initial pre-planned
634
MODELING UNCERTAINTY
routing sequence represented by x, and the second part, denoted by Q(x, q), would be the cost of recourse given x and a realization of q. Thus, Q(x, q) reflects the cost of return trips incurred by route failures and decisions to return for refill before route failures, minus some resulting savings. In this representation, we write simply Note that the two routing vectors and x are not the same. The binary vector x represents an initial TSP route, whereas is the binary routing vector which includes all the recourse decisions. We want to keep the vector binary and for this purpose we assume that (i) the probability of a node demand to be greater than capacity of a vehicle is zero, and (ii) the probability that a vehicle, upon a failure, will go back to the depot after returning to a node to complete its delivery is also zero. Since the vector x represents a routing solution for a single route, it satisfies constraints (3)-(6) ((3) and (4) with equality only). Setting the expectation with respect to q denoted as the objective function (2) becomes
In principle, the function Q(x) can be of any sign and is bounded (from below and above). However, in our case Constraints (3)-(6) ensure connectivity and conservation of flow. At this point we describe the function Q(x, q) in the standard framework of the Two-Stage Stochastic Linear Programs. In the second stage the customer deliveries are explicitly represented.
where y is the binary vector representing the recourse initiated trips to the depot. T(q) represents the deliveries made by the x vector given q, is the demand realization for q which has to be met (delivered) either by x or the recourse y. The integer L-shaped solution method of Laporte et al. (2001), is a branchand-cut procedure modified for the stochastic setting such as the SVRP by combining steps based on the combinatorial nature of the problem with an estimation of the stochastic cost contribution for each partial solution. We restate below the model for the problem statement required by the branch-andcut (integer L-shaped) method. (SVRP)
subject to
Vehicle Routing with Stochastic Demands: Models & Computational Methods
3.2.
635
THE BRANCH-AND-CUT PROCEDURE
The branch-and-cut procedure operates on the so-called ‘current problem’ (CP) obtained from the SVRP by: (a) relaxing the subtour elimination constraints (12), (b) at times solving a Linear Programming relaxation of it, and (c) replacing Q(x) by a lower bound in the estimation of the solution value of the (CP). This method assumes that given a feasible solution x, then Q(x) can be computed exactly. Moreover, a finite lower bound L on Q(x) is assumed to be available. Below, we repeat the steps of the procedure as presented in Laporte et al. (2001). 1 Set the iteration count and introduce the bounding constraint into (CP). Set the value of the best known solution equal to The only pendant node corresponds to the initial current problem. 2 Select a pendant node from the list. If none exists stop. 3 Set
and solve (CP). Let
be an optimal solution.
4 Check for any violations of constraints (12) or integrality values for and introduce at least one violated constraint. At this stage, valid inequalities or lower bounding functionals may also be generated. Return to Step 2. Otherwise, if fathom the current node and return to Step 1. 5 If the solution is not integer, branch on a fractional variable. Append the corresponding subproblems to the list of pendant nodes and return to Step 1. 6 Compute
and set
If
7 If then fathom the current node and return to Step 1. Otherwise add a valid cut to (CP) and return to Step 2.
636
3.3.
MODELING UNCERTAINTY
COMPUTATION OF A LOWER BOUND ON Z* AND ON Q(X)
Since the solution to the SVRP described here includes the policy option of returning to the depot just to refill and continue the route even if the vehicle did not empty (as in Yang et al. 2000), the expected optimal solution value is never more than that obtained by the L-shaped method of Laporte et al. (2001). If we denote the optimal solution value obtained by implementing the L-shaped method of Laporte et al. by and by the optimal solution value when solving the SVRP incorporating the Yang, et al (2000) recourse options, then we have that Moreover, both terms of the objective function (7): cx and in the Laporte et al. (2001) solution are respectively to the corresponding two terms when the recourse options are expended as in Yang et al. (2000). Below we describe the lower bounding calculations as in Yang (1996). Since we are describing a branch-and-bound enumerative tree, we assume that a node (a current problem - (CP) in the L-shaped method) in such a tree can be represented by two sets E and I. The set E is the set of variables which are set to zero, that is, the arcs which are excluded from the TSP solution. The set I consists of the variables which are set to one, these being the variables which describe a (partial) TSP tour. If we denote the current branchand-bound node as node the corresponding E and I sets will be marked by a superscript We describe the partial TSP tour determined by the set as the sequence of nodes 0. To find the least expected cost given the sets (and the corresponding TSP path) we note that
The cost of the traveling salesman path though the nodes in ending in the node can certainly serve as a lower is a bound on the first part of the above cost in (**). In addition, since nonincreasing function of gives a lower bound on the secondcomponent of the above cost (**). Since there are a number of good codes which solve in reasonable time large TSP instances of about 10,000 node problems (Applegate et al. 1998), we assume that such a TSP code can be called to generate an optimal path solution for the set ending in the node Denote the TSP solution value for ending in the node by To further tighten the lower bound we can employ a number of additional procedures. For instance, given that the vehicle has a
Vehicle Routing with Stochastic Demands: Models & Computational Methods
637
finite capacity and the expected demand of the set is greater than the vehicle capacity, the vehicle is likely to exhaust its load and go back to the depot before completing the route through the set of unconnected customers. At any node in the set the vehicle returns to the depot when either (a) there is a route failure, in which case the vehicle has to resume the route at with an additional cost of or (b) the vehicle goes back to the depot after replenishing customer for a refill and then proceeds directly to the next customer, say incurring an additional cost of In order to find a valid lower bound, one can rank all costs of types (a) and (b) among the nodes in and compute the least expected additional cost as follows:
where and Clearly, is an upper bound on potential replenishments for M. Note that this is a conservative lower bound which requires an calculations. One lower bound calculation for the expected cost through the partial tour is obtained by assuming that the vehicle starts the partial tour at its full capacity. Denote this lower bound as A somewhat tighter lower bound is described in Yang (1996). Combining the 3 lower bounds we obtain a valid lower bound at a node of the branch-and-bound tree as In case one wants to relax the corresponding TSP path problem and solve its linear programming relaxation, adding subtour elimination constraints and valid TSP facets and inequalities as in Laporte et al. (2001), a weaker lower bound at node is obtained. However, this way one combines the TSP solution with the stochastic programming solution procedure. Essentially, at a current branch-and-bound node we have both the cx value and the lower bound estimate on making it possible to use the 7 steps as outlined in Laporte et al. (2001). With this procedure we should be able to solve to optimality problems with up to 100 customers as Laporte et al. did, but the solution quality should be better since we are proposing to incorporate the finding of Yang et al. (2000).
638
MODELING UNCERTAINTY
In this context of restricted recourse (the original delivery sequence will be followed), it is important to mention the work of Bertsimas (1992), and Bertsimas et al. (1995). In those two papers given an identical discrete demand distribution for each customer an asymptotically optimal heuristic is proposed together with a closed expression for computing the expected length of such a tour, followed by an empirical investigation of the heuristic’s behavior (first paper). In the second paper, enhancement rules are proposed using dynamic programming for selecting return trips to the depot.
4.
MULTI-STAGE MODEL FOR THE SVRP
This section is based on Dror (1993), which provides a multi-stage stochastic programming model for the SVRP without restricting the recourse options a priori. In principle, a multi-stage stochastic programming model can be viewed as follows: Given certain initial (deterministic) information vector (think of it as the distance matrix and vehicle capacity Q), an decision vector is selected at some cost After executing (for instance determining and visiting the first customer), a new information vector, is obtained and a new decision, is selected at a cost This sequence of information decision information decision, continues up to K stages. At each stage the decisions are subject to constraints and depend on the actual realizations of the vectors of random variables Essentially, the decisions represent a recourse after new information or new realization of values for the random variables has been observed. A multi-stage stochastic (linear integer) program with fixed recourse can be modeled in principle as follows (see also Birge, 1985): (SVRP)
subject to
Minimize
Vehicle Routing with Stochastic Demands: Models & Computational Methods
639
where is a known (cost) vector in is a known vector in is a random defined on the probability space are correspondingly dimensioned real-valued matrices. For the SVRP the vector represents routing decisions made in the first period. Future decision vectors depend on previous realizations of decisions and the distribution parameters of as well as the problem’s static constraints (like vehicle-capacity Q). The adoption of this model for the SVRP is a relatively straightforward exercise presented next. But first we state a few simple observations regarding properties of the SVRP multi-stage solution. The two assumptions are that (i) (i.e., no individual demand is greater than the vehicle capacity), and (ii) if a vehicle visits a customer, it has to deliver to that customer whatever fraction of the customer’s demand the vehicle carries on arrival. If the vehicle is empty it has to return directly to the depot (i.e., no visits for just collecting demand information are allowed). Under these assumptions and a distance matrix which satisfies triangular inequality in the strong sense, Dror (1993) observed that: (1) Any arc is traversed at most once in the optimal SVRP solution. (2) The number of trips to the depot in an optimal SVRP solution is less than or equal to (3) In the optimal SVRP solution no customer will be visited more than times. The above three observations allow us to represent the SVRP as a problem on a ‘new’ expanded (minimal) graph with the property that any optimal stochastic vehicle routing solution of the original problem corresponds to a Hamiltonian cycle on the ‘new’ graph. Moreover, any Hamiltonian cycle on the ‘new’ graph describes an SVRP solution which satisfies the three above observations, given that the realization of customers’ demands makes such a solution feasible with respect to the vehicle’s capacity. The ‘new’ expanded graph is constructed as follows: Duplicate each customer node so each such node is represented by nodes. The depot is represented by nodes. Thus, the ‘new’ graph has nodes. All nodes representing a single customer are interconnected by arcs of zero distance and connected to the other nodes by arcs with the same distance as in the original graph. The same is applied for the depot nodes. On this new graph we seek a Hamiltonian tour (a TSP solution) which is equivalent to minimizing the expected cost of the SVRP as a multi-stage problem. First, a notational convention. Let denote a cyclic permutation of the node numbers with the following property:
640
MODELING UNCERTAINTY
(the depot node), (customer nodes). The permutation preserves the relative order of the set of nodes in the new graph which represent a node in the original graph. Given that customer is fully replenished, say on the visit to that customer, than the rest of the nodes corresponding to customer are visited in succession at no additional cost, which is consistent with the property of Let
and similarly, let Set
The interpretation of is that the cyclic permutation selects customer to be visited first, second, etc. Note that after the first visit to a customer, assuming that he will be replenished within a short time, during which his consumption is neglegible, the customer’s demand is known with certainty. However, visiting a customer for the purpose of obtaining the exact demand value without any delivery is not allowed.
4.1.
THE MULTI-STAGE MODEL
The SVRP is stated as a multi-stage stochastic programming model with stages over the expanded complete graph with nodes. Minimize
subject to
Vehicle Routing with Stochastic Demands: Models & Computational Methods
641
where denotes the flow from node to node and cannot exceed Q-vehicle capacity. Each node in the graph is visited exactly once. Constraints (25) are stated in a symbolic form expressing the fact that disconnected subtours are not allowed in the solution. The above model is stated purely for conceptual understanding of the routing decisions with respect to new demand information as it is revealed one customer at a time. From the computational prospective, this model is not very practical and would require a number of restrictive assumptions (in terms of demand distributions, recourse options, etc.) before operational policies could be computed.
5.
MODELING SVRP AS A MARKOV DECISION PROCESS
We restate here a Markov decision process model for the SVRP as presented in Dror (1993) which is very similar to the model presented in Dror et al. (1989). Consider a single vehicle of fixed capacity Q located at the depot and assume simply that no customer ever demands a quantity greater than Q. In essence, all our assumptions are the same as before. One can assume discrete demand distributions at the customers’ locations in an attempt to reduce the size of the state space (Secomandi, 1998), however, in this model presentation we assume continues demand distributions. A basic rule regarding vehicle deliveries is that once a vehicle arrives at customer location it delivers the customer’s full demand or as much of the demand as it has available. Thus, upon arrival (and after delivery) only one decision has to be taken in case there is some commodity amount left on the vehicle: Which location to move next. The vehicle
642
MODELING UNCERTAINTY
automatically returns to the depot only when empty, or when all customers have been fully replenished. In the initial state the vehicle is at the depot fully loaded and no customer has been replenished. The final state of the system is when all customers have been fully replenished and the vehicle is back at the depot. The states of the system are recorded each time the vehicle arrives for the first time at one of the customer locations and each time the vehicle enters the depot. Let be the times at which the vehicle arrived for the first time at new customer location. These are transition times and correspond to the times at which decisions are taken. The state of the system at time is described by a vector where denotes the position of the vehicle, and describes the commodity level in the vehicle. implies automatically (i.e., the vehicle is full at the depot). If customer has been visited then his exact demand is known and after a replenishment (partial or complete) denotes the remaining demand. In this case, If the customer has not been visited yet, then is set to -1. (That is the demand is unknown.) The state space is a subset S of which satisfies the above conditions. Given a transition time a decision is selected from decision space where and means that the vehicle goes from his present position, say to customer whose demand is yet undetermined, and on its route from to it replenishes a subset of customers P (a shortest path over that set and its end points) whose demand is already known. In that case the vehicle might also visit the depot. In many cases the subset P may be empty. For instance, at the first and second transition times. The decision is admissible only if The decision whenever or for (and every customer has been fully replenished). For each let denote the set of admissible decisions when the system is in state and the set of admissible state-decision pairs. At transition time the system is in some state and a decision has to be taken. The time for the next transition is dependent on and is deterministic (based on the deterministic distance matrix The next state is generated according to the probability distribution which governs the demand variables. More specifically, the demand variable at the vehicle location of the next transition time. Suppose is the observed state at the current transition time and is the decision taken. Then the time until the next transition time is simply where is the location of the vehicle at transition time and In the case when then the time until the next transition time is the shortest path travel time through the set P
Vehicle Routing with Stochastic Demands: Models & Computational Methods
643
ending at location In this model we assume that the service time is zero. Let be the transition law. That is for every Borel subset of is the probability that the next state belongs to given and A control policy is a function such that for An optimal solution policy for this SVRP Markov decision model minimizes the expected travel time described by a decision sequence Its starting state is and terminating state is That is, minimizes the following expected value
where is the number of transitions and is the usual time distance between and if the next transition occurs at without visiting any other nodes, or is the shortest time path from to if a subset of nodes is replenished between the two transition times. Since this model was presented in Dror et al. (1989) and Dror (1993), there has been (to our knowledge) only one substantial attempt to solve this hard problem as a Markov decision model: This was done by Secomandi (1998) and explored further in the form of heuristics in Secomandi (2000). Sacomandi (1998) reports solving problems with up to 10 customers. However, in order to simplify the state space somewhat, Secomandi assumes identical discrete demand distributions for all the customers. Clearly, this is a very hard problem for a number of reasons. One is the size of the state space. Another reason is that some subproblems might require exact TSP path solutions. However, so far this is the most promising methodology for solving the SVRP exactly without narrowly restricting the policy space.
6.
SVRP ROUTES WITH AT MOST ONE FAILURE – A MORE ‘PRACTICAL’ APPROACH
In section 2 of this paper we examined probabilities of route failure for the case of independent normally distributed demands as a function of the coefficient of variation. We illustrated that for ‘reasonably’ constructed vehicle routes (routes with total expected demand of customers below the vehicle capacity) the likelihood of route failures is not negligible. However, it is quite reasonable to assume that the likelihood of more than 1 or 2 failures on a route is negligible. Knowing that the number of SVRP route failures is small is certainly of practical interest and is the topic of the paper by Dror et al. (1993). Here we restate the main contributions of that paper.
644
MODELING UNCERTAINTY
In a typical vehicle routing setting with a large number of customers - like the case of propane distribution (Dror et al. 1985, Larson 1988) - time constraints for completing the daily deliveries cast the problem as a multi-vehicle delivery problem with each vehicle assigned a route on which the customers’ demand values are uncertain. From the point of view of a single route with customers on it, designing a route with the likelihood of numerous route failures makes little or no practical sense. In practice, a dispatcher of propane delivery vehicles seldom expects a delivery truck to return more than once to the depot for refill in order to satisfy all customers’ demands. The probability of a route failure at a customer of a route is simply We will return to this probability equation later. However, in this section we assume that the probability of more than one failure on a route is negligible (see section 2). Moreover, we assume that if a failure occurs on a route, it will occur only at the last customer. That is, given a set N, of customers, and In this case, one can construct a tour assuming that in the case of a route failure at a last customer the vehicle returns to the depot to refill and then incurs the cost of a back-and-forth trip to the last customer. To construct an optimal SVRP solution in this case requires solving a TSP in which for all customers the cost is replaced by A better solution might be obtained by permitting a return to the depot for refill at a suitable location along the route and thus prevent the potential of route failure at the last node. This recourse option can be formulated as a TSP by introducing 2 artificial nodes in the following manner: Let and be the two artificial nodes. If the node is entered from one of the nodes in N (note that the depot is denoted by {0}), it indicates that the solution requires a ‘preventive refill’ (a return to the depot before the last customer). If the node is entered from one of the nodes in N, the solution contains no ‘preventive refill’. Thus, only one of the two artificial nodes will be visited. The costs associated with the nodes and are:
Define singleton sets and a set The problem then becomes that of solving what is referred to as asymmetric Generalized TSP (GTSP) over the sets In the GTSP one has to construct a minimal cost tour which visits exactly one
Vehicle Routing with Stochastic Demands: Models & Computational Methods
645
element in each of the subsets. Solution methodologies for the GTSP have been developed by Noon and Bean (1991). In addition, one can transform a GTSP into a TSP (Noon and Bean, 1992). For instance, in our case set for some large constant M, forcing the two nodes and to be visited consecutively. Then reset This makes the transformation complete. Solving the above TSP over nodes solves our SVRP (with a failure potentially occurring only at the last node) allowing to incorporate the option of preventive refills.
7.
THE DROR CONJECTURE
A very important aspect of the SVRP is the computation and the properties of the probability of a route failure at the customer on the route. In the SVRP we assume that the customers” demands are independent random variables. In some cases for the sake of more ‘tractable’ computational studies we assume that these demands are independent identical random variables. Given a sequence of independent, identically distributed random variables with a positive mean value and a variance and positive fixed value denote by the probability that the random variable expressed as the sum of but not or less of the random variables exceeds the value In other words,
In principle, in the expression above the random variables need not be independent or identical. They are just ordered and the first are summed into a partial sum. The connection to the SVRP as described in this paper is obvious. However, this partial sum describes a setting which is not particular to the SVRP. In fact, in the stochastic processing and reliability literature, is referred to as a ‘threshold detection probability’. provides a measure of likelihood of overstepping a boundary Q at exactly the trial in successive steps accumulating the effects of a random sampled phenomenon. Some other related results together with a conjecture statemet are described in Kreimer and Dror (1990). More specifically, Kreimer and Dror (1990) address the following questions: (1) What is the most likely number of trials required to overstep the threshold value Q ? (2) In what range of is the sequence monotonically increasing (decreasing) ?
646
MODELING UNCERTAINTY
The second question can be also stated in terms of convexity (concavity) of the sum In the spirit of the SVRP, denote by such that Kreimer and Dror (1990) prove that for a number of distributions, the sequence 1,2, . . . is monotonically increasing in the range (or However, here we would like only to state a conjecture originated in Dror (1983). Attempts to prove (or disprove) this conjecture over the years have not been successful. Conjecture : Given Q > 0, and a cumulative distribution function with mean and variance that satisfies the following peoperties: 1 Coefficient of variation 2 Mean
is greater than or equal to the median,
3
Then the sequence
is monotonically increasing in the range
For some distributions all these restrictions are not necessary. We have proven this conjecture for normal distribution and few other distributions. However, the (as yet unproven) claim is that this monotonicity property is true in general ! In the SVRP it is important to analyze properties of In some practical cases encountered by the author, such as designing cost- efficient distribution for propane, convexity properties of play a crucial role (Trudeau and Dror, 1992).
8.
SUMMARY
This paper deals with stochastic vehicle routing problem that involves designing routes for fixed capacity vehicles serving (delivering to or collecting from but not both) a set of customers whose individual demands are only revealed when the vehicles arrive to provide service. At the beginning, we describe the problem and provide examples which demonstrate the impact demand uncertainty might have on the routing solutions. Since this topic has been examined by academics and routing professionals for over 20 years, there is a considerable body of research papers. We certainly have not covered them all in this overview.
REFERENCES
647
The vehicle routing problem with stochastic demands is a very hard problem and we have attempted to cover all the significant developments for solving this problem. Starting with early work on simple heuristics such as stochastic Clark and Wright, followed by chance-constrained formulations, and stochastic programming with recourse models, have attempted a broad overview. In the literature, the most frequently encountered papers have focused on stochastic programming models with limited recourse options such as back-and-forth vehicle trips to the depot for refill in the event of route failure. This approach has been examined more recently using the so called L-shaped optimization method (Laporte et al. 2001). Other approaches (Yang et al. 2000) have added interesting recourse options which could improve on the solution quality. However, the most promising approach is that of modeling the problem as a Markov decision process, presented in Dror et al (1989) and Dror (1993), with significant modeling and computational progress made recently by Secomandi (1998, 2000). In short, the vehicle routing problem is very easy to state, but, like a number of other similar problems, very hard to solve. It combines combinatorial elements with stochastic elements. The problem is ‘real’ in the sense that we can point out numerous real-life applications; unfortunately the present state-of-the-art for solving the problem is not very satisfactory. It is a challenging problem and we are looking forward to significant improvements in solution procedures hopefully in the near future.
REFERENCES Applegate, D., R. Bixby, V. Chvatal, and W. Cook. (1998). "On the solution of traveling salesman problem", Documenta Mathemaitica, extra volume ICM 1998; III, 645-656. Bertsimas, D.J. (1992). "A vehicle routing problem with stochastic demand", Operations Research 40, 574-585. Bertsimas, D.J., P. Chevi, and M. Peterson. (1995). "Computational approaches to stochastic vehicle routing problems", Transportation Science 29,342-352. Birge, J.R. (1985). "decomposition and partition methods for multistage stochastic linear programs", Operations Research 33, 989-1007. Clarke, C. and J.W. Wright. (1964). "Scheduling of vehicles from a central depot to a number of delivery points", Operations Research 12, 568-581. Dror, M. (1983). The Inventory Routing Problem, Ph.D. Thesis, University of Maryland. College Park, Maryland, USA. Dror, M. (1993). "Modeling vehicle routing with uncertain demands as a stochastic program: Properties of the corresponding solution", European J. of Operational Research 64, 532-441.
648
MODELING UNCERTAINTY
Dror, M. and P. Trudeau. (1986). "Stochastic vehicle routing with modified savings algorithm", European Journal of Operations Research 23, 228-235. Dror, M, M.O. Ball, and B.L. Golden. (1985). "Computational comparison of algorithms for inventory routing", Annals of Operations Research 4, 3-23. Dror, M., G. Laporte, and P. Trudeau. (1989). "Vehicle routing with stochastic demands: Properties and solution framework", Transportation Science 23, 166-176. Dror, M., G. Laporte, and V.F. Louveaux. (1993). "Vehicle routing with stochastic demands and restricted failures", ZOR – Zeitschrift fur Operations Research 37, 273-283. Eilon, S, C.D.T. Watson-Gandy, N. Christofides. (1971). Distribution Management: Mathematical Modelling and Practical Analysis, Griffin, London. Jailet, P. (1985). "Probabilistic traveling salesman problem", Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA. Kreimer J. and M. Dror. (1990). "The monotonicity of the threshold detection probability in stochastic accumulation process", Computers & Operations Research 17, 63-71. Laipala, T. (1978). "On the solutions of the stochastic traveling salesman problems", European J. of Operational Research 2, 291-297. Laporte, G. and F.V. Louveaux. (1993). "The integer L-shaped method for stochastic integer programs with complete recourse", Operations Research Letters 13, 133-142. Laporte, G., F.V. Louveaux, and L. Van hamme. (2001). "An integer L-shaped algorithm for the capacitated vehicle routing problem with stochastic demands", Operations Research (forthcoming). Larson, R.C. (1988). "Transportation of sludge to the 106-mile site: An inventory routing algorithm for fleet sizing and logistical system design", Transportation Science 22, 186-198. Noon, C.E. and J.C. Bean. (1991). "A Lagrnagian based approach to the asymmetric Generalized Traveling Salesman Problem", Operations Research 39, 623-632. Noon, C.E. and J.C. Bean. (1993). "An efficient transformation of the Generalized Traveling Salesman Problem", IN FOR 31, 39-44 . Secomandi, N. (1998). "Exact and Heuristic Dynamic Programming Algorithms for the Vehicle Routing Problem with Stcochastic Demands", Doctoral Dissertation, University of Houston, USA. Secomandi, N. (2000). "Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands", Computers & Operations Research 27, 1201-1225. Stewart, W.R., Jr. and B.L. Golden. (1983). "Stochastic vehicle routing: A comprehensive approach", European Journal of Operational Research 14, 371385.
REFERENCES
649
Stewart, W.R., Jr., B.L. Golden, and F. Gheysens. (1983). "A survey of stochastic vehicle routing", Working Paper MS/S, College of Business and Management, University of Maryland at College Park. Trudeau, P. and M. Dror. (1992). "Stochastic inventory routing: Stockouts and route failure", Transportation Science 26,172-184. Yang, W.-H. (1996). "Stochastic Vehicle Routing with Optimal Restocking", Ph.D. Thesis, Case Western Reserve University, Cleveland, OH. Yang, W.-H., K. Mathur, and R.H. Ballou. (2000). "Stochastic vehicle routing problem with restocking", Transportation Science 34, 99-112.
Chapter 26 LIFE IN THE FAST LANE: YATES’S ALGORITHM, FAST FOURIER AND WALSH TRANSFORMS Paul J. Sanchez Operations Research Department Naval Postgraduate School Monterey, CA 93943
John S. Ramberg Systems and Industrial Engineering University of Arizona Tucson, AZ 85721
Larry Head Siemens Energy & Automation, Inc. Tucson, AZ 85715
Abstract
Orthogonal functions play an important role in factorial experiments and time series models. In the latter half of the twentieth century orthogonal functions became prominent in industrial experimentation methodologies that employ complete and fractional factorial experiment designs, such as Taguchi orthogonal arrays. Exact estimates of the parameters of linear model representations can be computed effectively and efficiently using “fast algorithms.” The origin of “fast algorithms” can be traced to Yates in 1937. In 1958 Good created the ingenious fast Fourier transform, using Yates’s concept as a basis. This paper is intended to illustrate the fundamental role of orthogonal functions in modeling, and the close relationship between two of the most significant of the fast algorithms. This in turn yields insights into the fundamental aspects of experiment design.
652
1.
MODELING UNCERTAINTY
INTRODUCTION
Our purpose in writing this paper is to illustrate the role of orthogonal functions in factorial design, and one usage in Walsh and Fourier analysis for signal and image processing. We also want to exhibit the relationship between the Yates “fast Factorial Algorithm” and the fast Walsh transform, and to show how Yates’s algorithm contributed to the development of fast Fourier transforms. We would like to de-mystify the “black box” aura which often surrounds the presentation of these algorithms in a classroom setting, and to encourage the discussion of the topic of computationally efficient algorithms. We think that this approach is valuable because it demonstrates close links between statistics and a number of other fields, such as thermodynamics and signal processing, which are often viewed as quite divergent. Orthogonal functions serve many purposes in a wide variety of applications. For example, orthogonal design matrices or arrays play important roles in the statistical design and analysis of experiments. Discrete Fourier and Walsh transforms play comparable roles in digital signal and image processing. An important distinction between the various application areas is the data collection scheme employed. In statistical experiment design the functions represent the factors and their interactions, and the experiment is typically run in a randomized order. In signal processing the data are collected over time and represented in terms of a set of orthogonal functions which are explicit functions of time. Historically, the importance of these methods created interest in developing effective and efficient computational approaches, which are often called “fast” algorithms. Fast algorithms produce the same mathematical result as the standard algorithms, and are typically more computationally stable, yielding better numerical accuracy and precision. Thus they should not be confused with socalled “quick and dirty” statistical techniques which yield only approximate results. Nair (1990) has indicated the importance of attracting applications articles in the new technological areas of physical science and engineering, such as semiconductors. Hoadley and Kettenring (1990) have stressed the importance of communication between statisticians, engineers, and physical scientists. Stoffer (1991) has discussed statistical applications based on the Walsh– Fourier transform, and has outlined the existing Walsh-Fourier theory for realtime stationary time series. Understanding the relationship of experiment design to other orthogonal function based techniques and the corresponding fast algorithms should be useful in enhancing this communication. At first glance, high speed computation abilities might seem to negate the need for the computational efficiency available using fast transforms. In pattern recognition problems and signal processing the amount of data being processed
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
653
is a major reason for their continued importance. Relating the algorithms to well-known statistical techniques provides a mechanism for determining confidence limits when the signal is subject to random fluctuation, such as variation in natural lighting or low signal to noise ratios. In factorial experiments, where data sets are much smaller, the systematic computation format is advantageous for implementing the algorithm as a program irrespective of language, and can even be used to perform analyses in spreadsheets. Finally, these algorithms tend to have better numerically stability.
2.
LINEAR MODELS
Generally, the role of linear models in factorial experiments, Walsh analysis, and Fourier analysis is not made explicit. The methods of analysis which we will discuss are all based upon linear models. A discrete indexed linear model can be represented in matrix form in either a deterministic or statistical context as
or respectively. In both cases y is an column vector, B is an matrix representing the selected basis, and is an column vector of unknown parameters. In (2.2), is an column vector of random errors with zero mean and variance In both cases, when and B is of rank the parameters can be determined (estimated) uniquely by least squares as
The determined (estimated) parameters can be viewed as a projection of the data onto the basis represented by B, as opposed to the original basis which was composed of the first elementary vectors. If the basis is orthogonal, is an diagonal matrix, and hence so is its inverse. Thus is determined (estimated) by
where D is an diagonal matrix whose element is the inverse of the norm of the column vector of the basis. If all basis vectors have the same norm, the multiplication by D reduces to multiplication by a scalar constant. For the special case the least squares estimator simplifies to
654
MODELING UNCERTAINTY
In this case, the projection onto the basis B is one-to-one, i.e., y and are completely interchangeable since either can be constructed (reconstructed) from the other. We will be applying the linear model in three settings: Factorial Analysis, Walsh Analysis, and Fourier Analysis. The table in Appendix A serves as a quick reference to notation.
2.1.
FACTORIAL ANALYSIS
2.1.1 Definitions and Background. A factorial experimental design is conducted by controlling the levels of the various factors during the course of the experiment. In a factorial experiment there are factors, each of which is set at “low” and “high” levels. These will be designated –1 and +1 and we will use “–” and “+” to denote –1 and +1, respectively. The factor settings are determined by enumerating all combinations of low and high settings for each factor. This is demonstrated for three factors, in Figure 26.1. It is easily seen that there are points, each of which can be represented as a three-dimensional vector. The corner labeled 1 has factors one, two, and three set at low, low, and low settings, respectively, and can be represented as a triplet ( – , – , – ) . Corner 2 sets factors one, two, and three to high, low, and low, respectively, and yields the triplet ( + , – , – ) . In general, if there are factors the design can be represented as vectors of length where each of
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
655
the elements specifies the level setting of a corresponding factor at a single design point. We can construct the design matrix by assigning the vector at design point as the row of the matrix. Alternatively, we can consider the settings of one particular factor in the order given by the numbering of the design points. This provides the column vectors of the design matrix. Both viewpoints yield the same design matrix for any given experiment. The column vectors called the generating vectors for a factorial experiment, are:
Each factor in an experiment is assigned a (unique) generating vector. For an experiment with one replication, each run corresponds to a row, which is selected randomly (without replacement) and each factor is set according to the level indicated by the generating vector. For an experiment with blocks, this process is repeated times, each time with a new random sequence. For an experiment with replications, each row appears times, and the rows are selected in a completely random order. A single observation of the response is obtained for each row, resulting in observations for a full factorial experiment. This set of observations are subsequently sorted into standard order for analysis. In this case the order in which the run was carried out should be recorded to allow assessment of model assumptions. The generating vectors are used to construct the analysis matrix. When two or more factors affect the outcome based (non-additively) on their combined values, we say that there is an interaction term in the model. For example, when two or more factors interact in the form this term may also be included in the model. The generating vectors and all possible interaction terms form the contrast matrix. The addition of a column of ones completes the basis, and results in the analysis matrix. The columns of the analysis matrix representing factor interactions are obtained by the element-wise multiplication of all the generating vectors of the factors involved.
656
MODELING UNCERTAINTY
In general, the analysis matrix for a full factorial experiment can be written in matrix form in Yates’s standard order as
It is constructed from the generating vectors as follows. At stage 0, X consists solely of the column vector which is a column of +’s. At stage for X is expanded by appending to it the element-wise product of with each vector in the matrix at stage thus doubling the number of columns. Since the number of columns doubles at each of the iterations, and the column vectors are of length it follows that X is a square matrix. It is also of full rank, since the columns are mutually orthogonal. The table of contrasts, employed in many texts for the analysis of factorial designs, is the analysis matrix excluding (Contrast vectors always sum to zero, and does not fulfill that requirement and hence is not a contrast.)
The vector is inserted as the first column. The generating vectors and interactions complete the basis. For example, the product vector obtained from the interaction of and is found as the fourth column of (2.7). The column vectors of equation (2.7) are represented graphically in Figure 26.2. 2.1.2 The Model. Let y be the vector of responses, or response totals or means (depending upon the exact form of analysis of interest), which we wish to relate to the factors in some fashion. One commonly used model is the linear model given by (2.2) in Section 2. This is often written as
where is the vector of coefficients corresponding to and X plays the role of the basis matrix in (2.2). The vector is indexed by Thus the index identifies the factor or interaction.
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
657
658
2.1.3 for is
MODELING UNCERTAINTY
The Coefficient Estimator.
The familiar least squares estimator
It is easily verified that the columns of X all have modulus squares is used to estimate
so that if least
where I is the identity matrix. Thus the estimator is given from (2.4) as and only one matrix multiplication is required to obtain the estimates. The estimator has the same number of elements as the data vector, y, and can be viewed as an orthogonal transformation of the data vector to the parameter space. Furthermore, the observation vector can be computed from the parameter vector by Note that the transformation is information preserving, since the original vector can be recovered from the parameter vector. Finally, a smoothed (or parsimonious) predictor of y can be obtained by setting certain elements of to zero (perhaps those which are not statistically significant). If we call the new vector then the smoothed predictor is given by
2.2.
WALSH ANALYSIS
Walsh functions are binary valued functions of time. They first appeared in the published literature in Walsh (1923). While the time index can be either continuous or discrete, we will concentrate on discrete index applications. One can view them as similar to sine/cosine functions, in that they can be used as a complete orthogonal basis for representing time series, but Walsh functions differ in that they are not periodic. The first eight Walsh functions are plotted (in continuous time) in Figure 26.3 to illustrate this. The discrete-time vector representation can be obtained by taking the right-continuous limit of the continuous-time function for The binary nature of Walsh functions makes them a good choice for digital signal processing applications, or many other settings in which we wish to accurately represent discontinuous functions using a relatively small number of terms. We will deal only with a few properties relevant to our application area, experimental design. See Beauchamp (1984) for a more detailed treatment of the functions and their properties. 2.2.1 Definitions and Background. By convention Walsh functions are also encoded as {±1}. Three different ordering schemes are used to index them, each of which has merits for some types of applications. These are
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
659
660
MODELING UNCERTAINTY
known as natural, dyadic, and sequency ordering. We will concentrate on the natural (also known as Hadamard) order because it corresponds directly to the Yates’s ordering. However, the indexing scheme used in sequency ordering results in a nice notation for specifying factor interactions. In addition, the sequency index is directly related to how often the factor is varying about its mean, which closely corresponds to the concept of frequency in trigonometric functions. This may be of interest to the experimenter in certain settings (Sanchez, 1991). In all three indexing schemes the function numbering starts at zero, and the function indexed by zero is a vector of ones. As with the factorial designs, we can derive Walsh functions from a complete enumeration of all combinations of high and low settings of the k factors. This is illustrated for k=3 by Figure 26.4. The three indexing schemes briefly discussed above are obtained by sampling the eight design points in different (systematic) orderings. Note that the Hadamard ordering is precisely the reverse of the Yates’s ordering. This yields a set of column vectors which can be used to construct an analysis matrix, exactly as we did with the factorial analysis. The corresponding
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
661
column generating vectors are:
Note that the index of a generating vector is the number of adjacent pluses or minuses, and that for where is the factorial generating vector with index To determine the value of a Walsh function whose index is not a power of 2, decompose the index into its binary representation and take the product of all Walsh functions corresponding to a 1 bit. For example, since the binary representation of 5 is As with factorial designs, the total number of vectors generated in this fashion is To obtain a complete orthogonal basis, we again insert the vector The basis is constructed by placing as the column. In general, the matrix for a Walsh analysis can be written in Hadamard order as Thus if
the basis is
The generating vectors and are the second, third, and fifth columns of (2.10), respectively. The product vector obtained from the interaction of and for example, is found as the fourth column.
662
MODELING UNCERTAINTY
Note that this is the analysis matrix X from Yates’s algorithm given in (2.7) to within a scale factor of {±1} for each column. 2.2.2 The Model. Let y be the vector of data which we wish to represent in terms of our basis H. By convention the time index starts at zero, i.e., the vector consists of One commonly used model is the linear model given by (2.1) in Section 2. This can be written as
where Y is the vector of Walsh coefficients corresponding to the role of the basis matrix B in (2.1). 2.2.3 Discrete Walsh Transforms. of H all have modulus so that
and H plays
It is easily verified that the columns
where I is the identity matrix. Thus the matrix form for the discrete Walsh transform (DWT) of a vector y of length is defined as
Note that the DWT is the least squares estimator for Y. Also note that due to the symmetric nature of the Hadamard matrix, so it is correct to write the DWT as
In other words, the DWT is its own inverse to within a scale factor of The transform notation emphasizes the fact that the transformation is information preserving — the transform vector contains all of the information in the original vector. In other words, the transform is not actually changing the data, but rather is changing our viewpoint of the data. Both the original and the transform vectors represent exactly the same point in space. All we have done in the transformation is to change the set of axes from which we choose to view that point. Any orthogonal basis could be used for viewing the data, but some bases might be more interesting than others because of the physical interpretation we could place on the results. For statistical interpretation, an appropriate choice of basis would be the set of vectors we used to determine the inputs to an experiment. If the outputs fall solidly in some subspace corresponding to certain inputs, we infer that those inputs are important factors in determining the output.
Life in the Fast Lane: Yates ’s Algorithm, Fast Fourier and Walsh Transforms
2.3.
663
FOURIER ANALYSIS
The Fourier Transform is a common tool in engineering applications for analyzing time series, solving differential equations, or performing convolutions. For a thorough yet accessible introduction to Fourier transforms and other time series methods, we refer the reader to the text by Chatfield (1984). When the data consist of a discrete set of samples, the Discrete Fourier Transform (DFT) is the applicable tool. 2.3.1 Definitions and Background. Let be the generating vector whose elements are given by for where The set of indices consists of the ordered set of integers If we now set the power (element-wise) of for the resulting set of vectors are mutually orthogonal. We can construct a matrix which we will call using these vectors as the columns.
For example, if
the generating vector is
and thus the analysis matrix is:
664
MODELING UNCERTAINTY
Since these are periodic functions, (2.11) can be simplified by evaluating the exponents modulo The result is
This matrix can be expressed in a simpler form as the matrix of exponents after dividing the exponents by
This notation makes clear that the relationship between the columns is comparable to that in a factorial design array. Note, however, that the interaction columns are the sums of the corresponding factor columns here, since products of powers of a common base can be expressed in terms of the sums of the exponents. We can also express the transform in terms of trigonometric functions using the relationship This is useful for computational purposes – we can keep track of the magnitudes of the real and imaginary components separately for each element. In other words, each complex scalar element of the vector of data is represented by two scalar values, corresponding to the real and imaginary components. Complex multiplication can be equivalently expressed in matrix terms. If and then can be written as
The multiplication of two complex scalars becomes a matrix multiplication of a 2 × 2 matrix by a 2 × 1 vector to yield a 2 × 1 vector. It can thus be seen that
Life in the Fast Lane: Yates ’s Algorithm, Fast Fourier and Walsh Transforms
665
the Fourier basis represented by (2.12) can thus be represented as a 16 × 16 matrix:
It is interesting to note that
is a symmetric matrix, but
is not.
2.3.2 The Model. Let y be the vector of data which we wish to represent in terms of our basis As with the Walsh basis the time index starts at zero by convention, i.e., the vector consists of One commonly used model is the linear model given by (2.1) in Section 2. This can be written as where Y is the vector of Fourier coefficients corresponding to and plays the role of the basis matrix in (2.1). The least squares estimator for Y is
As with the other bases we have discussed, the vectors are mutually orthogonal, so the term forms a diagonal matrix and the estimator simplifies to a single matrix multiplication. However, the first and last column have a different modulus than the other terms – they are scaled by while the rest are scaled by Note that is symmetric about its diagonal, so that it is its own transpose. However, since is complex must be its complex conjugate, i.e., it is given as equation (2.12) with all signs reversed. We will designate the complex conjugate henceforth as When the data are real-valued, as with statistical applications, the imaginary part of each observation is zero and can be omitted. Hence we can eliminate the even numbered rows in equation (2.14). (Rows are eliminated, rather than columns, because the estimators are obtained from The result is that is a 16 × 8 matrix – clearly eight of the rows are redundant, since we only need an 8 × 8 matrix to have a complete basis. The question of which vectors to include in the basis can be resolved by examining Figure 26.5. The set of points obtained by evaluating form a circle of unit radius in the complex plane. Note that for that and In other words, frequencies in the range can
666
MODELING UNCERTAINTY
be expressed as linear functions of frequencies in the range so we can obtain a complete basis from the sine and cosine terms in the latter range. For the result is a reduced matrix
It can be easily verified that the columns form a real-valued orthogonal basis. The columns of are plotted in Figure 26.6.
3.
AN EXAMPLE
In this section we will present a vector of data and show how the coefficients are calculated for each of the three bases we have discussed. The results will then be compared. Suppose we have run a planned experiment in which the three factors were varied in controlled fashion. In practice it is usually recommended that
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
667
668
MODELING UNCERTAINTY
such experiments be run in a randomized order, to try to reduce the effects of sampling order. The data are then pre-sorted into standard order. For a classical factorial analysis, the data are placed in the order specified by the design cube of Figure 26.1. Suppose our sample, in standard order, is the vector
We can then analyze it by calculating the estimator given in equation 2.3. The resulting estimates are
In other words We can analyze the same vector in terms of the Walsh basis. The resulting vector estimate is:
and the corresponding model is
Recall that corresponds to factor 2, and corresponds to factor 3. Note that the estimated coefficients have the same magnitude but the opposite sign as those obtained from the traditional analysis. The explanation is straightforward – we have analyzed the data in the wrong order for the Walsh basis. Looking at Figure 26.4 it can be seen that the observation which has all three settings low is the last observation for a Walsh-based experiment. In fact, the ordering is exactly the reverse of the standard. Pre-sorting the data into reverse order results in the model Finally, we analyze the data set using Fourier analysis. The resulting vector of estimated coefficient is
which corresponds to the model
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
669
670
MODELING UNCERTAINTY
The fit can be seen in Figure 26.7. Note that for this particular data set more terms are required to obtain a fit in the Fourier basis. The Walsh and factorial bases require exactly same number of terms, because the basis vectors are the same to within a scale factor of ±1. In general, the number of non-zero Fourier coefficients will be different than the number of non-zero Walsh or factorial coefficients. Which number is larger depends entirely on the data set being analyzed.
4.
FAST ALGORITHMS
Recall that for any orthogonal basis, the least squares estimator is obtained from a single matrix multiplication. If the number of operations required to compute this can be substantially reduced the result is a computationally efficient algorithm. These are often called “fast” algorithms. The Fast Fourier Transform (FFT) is probably the best known of these fast algorithms. At first glance, high speed computation capabilities might seem to negate the need for the computational efficiency available using fast transforms. A straightforward implementation of the discrete Fourier transform requires calculations, where is the number of factors, while the corresponding FFT requires calculations. The amount of time required for a singlethreaded computer implementation of the algorithm to compute the transform is proportional to the number of calculations being performed. Figure 26.8
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
671
illustrates the relative efficiency of the fast algorithms in terms of computational time requirements. The horizontal axis is and the vertical axis plots the number of computations which must be performed for that value of The plot illustrates that the savings can be substantial for even relatively small values of This is particularly noticeable in a desktop computing environment. Further, the fast algorithms are often preferable in terms of numerical stability since they perform fewer calculations to yield the same answers. The systematic format of Yates’s algorithm is also advantageous in the programming of computations associated with factorial or fractional factorial designs. See Nelson (1982) as an illustration. Finally, simple techniques such as Yates’s algorithm also improve the feasibility of analyzing factorial designs on spread sheets without the need to program.
4.1.
YATES’S FAST FACTORIAL ALGORITHM
Yates’s algorithm provides a computationally efficient method estimating the in the model given in equation (2.8). Many of us, when exposed to the algorithm for the first time, are surprised and curious about its operation. Since it is usually viewed as a means to an end, most of us do not explore its makeup. Indeed, Yates gives no explanation of the origins of the algorithm. Most statisticians appear to be familiar with Yates’s algorithm, but few have more than a vague notion of why the algorithm works. We shall see in this and subsequent sections that the algorithm is based on elegant symmetries in the original experimental design, and that the central idea behind the Yates’s algorithm is applicable to orthogonal functions in general. Recall that since Yates’s algorithm can then be understood by representing the matrix as the product of another matrix multiplied by itself times. Specifically, we find a special matrix M such that We’ll first illustrate this matrix factorization for the case and then generalize. For we have
so that
672
MODELING UNCERTAINTY
and thus
The form of the matrix factorization for arbitrary values of
is
The calculation of can be equivalently expressed as At first glance this appears to be worse than the original algorithm, but the multiplica-
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
673
tion is not explicitly performed for the zero terms in M. The matrix X contains no zero elements, while M in fact contains precisely two non-zero elements per row or column, as can be seen above. This means that we perform only additions or subtractions for each of the M matrices in the factored form of the algorithm. Since there are such matrix factors, the total amount of work in the Yates’s approach is rather than if least squares is applied straightforwardly. Paraphrasing, for we have so the algorithm is in complexity rather than Figure 26.9 illustrates how Yates’s algorithm works for the case of This flow diagram contains equivalent information to the matrix representation – it describes how to combine those terms with non-zero coefficients. The diagram could also be used as a schematic for implementing Yates’s algorithm in hardware. Straight lines indicate that a term is to be added, while dotted lines indicate that the term is to be subtracted. Observations are input on the left, and combined as indicated by the lines to produce intermediate values and then contrasts Thus, at the first intermediate stage we have
at the second intermediate stage we obtain
674
MODELING UNCERTAINTY
Life in the Fast Lane: Yates ’s Algorithm, Fast Fourier and Walsh Transforms
675
and at the final stage we get
Note that this is exactly the same result we would obtain by explicitly performing the multiplication The contrasts can be scaled to produce estimates of by either pre- or post-multiplying by which corresponds to the scale factor of the least squares estimator. For general values of each column is in length, and there are columns with transformations being applied. Yates’s is a constant-geometry algorithm, i.e., exactly the same sequence of operations is applied in each of the transformations. This is undoubtedly a benefit when the calculations are being performed by hand. Given the flow diagram, the operations can be expressed algorithmically. We will need several storage vectors of length which are labeled for We define to be the element of the data vector at the iteration. The algorithm follows.
676
MODELING UNCERTAINTY
The symbol indicates assignment of a value to a variable. References to should use the original vector of data. The output of the algorithm resides in the vector The gains in computational efficiency arise from the sparseness in the matrix M which factors the matrix. The factorization which corresponds to Yates’s algorithm is not unique. For example, it is easily verified that
yields the same matrix for a experimental design. Further, it is not necessary to have the component matrices of the factorization be identical (i.e., The efficiency comes from sparseness, not symmetry, of the matrix factors. As we will see in the next section, there exists a whole class of “fast” algorithms which can be used to evaluate factorial experiments.
4.2.
FAST WALSH TRANSFORMS
As with the Yates’s algorithm, the FWT is obtained by matrix factorization. However, in this case the matrices are not all identical. This presents no problem – we noted in an earlier discussion that the efficiency of the algorithm comes from the sparseness of the matrix factors, not from any symmetry properties. The FWT can be represented in matrix form as
where
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
677
and
The resulting fast algorithm is depicted in Figure 26.10. As with the Yates’s factorization, we find that the FWT algorithm presented here is not unique. There are many published algorithms for performing FWT’s (Beauchamp, 1984) – hence our claim in the previous section that many fast algorithms exist for evaluating factorial experiments. Figure 26.11 illustrates a basic set of operations used for the FWT, and is called a butterfly flow diagram (for obvious reasons). Note that all the operations in the natural ordered FWT can be expressed as butterflys. This is quite significant, because the butterfly has the property that a pair of values is operated on to produce a new pair, which go in the corresponding locations. Since the data in a butterfly is used independently of data anywhere else in the algorithm, the storage initially allocated for the data vector can be reused at subsequent stages. The result is an in-place algorithm, and is highly efficient in both storage and computational time. However, since this is not a
678
MODELING UNCERTAINTY
Life in the Fast Lane: Yates ’s Algorithm, Fast Fourier and Walsh Transforms
679
constant-geometry algorithm, it is possibly less desirable than Yates’s for human calculation. Given the basic operation specified by a butterfly, the natural ordered FWT is described algorithmically below. (The symbol indicates assignment to a variable.)
Comparable results exist for the other Walsh ordering schemes. In fact, one way of performing the sequency ordered FWT is to apply the algorithm just described after pre-sorting the data into a particular order.
4.3.
FAST FOURIER TRANSFORMS
We will use the complex basis for notational simplicity in deriving a Fast Fourier Transform (FFT) algorithm. As with Walsh functions, there are actually several FFT’s. The one we will illustrate here is based on a result known as the Danielson–Lanczos lemma. Let be the Fourier transform coefficient, and define the corresponding contrast transform to be Danielson and Lanczos showed that could be expressed in terms of two other transforms, based on the even and odd data points, as follows.
680
MODELING UNCERTAINTY
where and are the contrast transforms of subsequences of length N/ 2 composed of the even and odd data points, respectively. In other words, the transform coefficient can be obtained as a weighted sum of two other transform coefficients based on shorter subsequences, and where the weighting factor is a constant to the power. The lemma can now be applied recursively until subsequences of length one are obtained. At each stage of the recursion the size of the data set is halved, so it can be applied at most times for each of the N coefficients. The result is a computational technique which is O(N log N) in complexity. By making use of the fact that our basis consists of circular functions, we can evaluate the coefficients modulo In doing so, we find a great deal of redundancy in the corresponding matrices. We can also note that the algorithm logically groups the data first by the least significant bit of the binary representation of the index, then by the second least significant bit, and so on, up to the most significant bit. By pre-sorting the data in reverse order of the bits, we find the data are appropriately grouped to perform the transformation in place. This pre-sorting is an O(N) operation, so it does not significantly impact the overall computation time while reducing the storage requirements. For example, the DFT for a set of eight points can be represented in matrix form in terms of the complex conjugate of equation (2.12). It is easy to verify via matrix multiplication that
to within modulo
on each of the exponents, where
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms
and
681
682
MODELING UNCERTAINTY
The S matrix contains only one non-zero entry per row and column‚ and so sorts the data set as described earlier in O(N) time. If desired‚ the ones can be replaced with so that the estimators are properly scaled with no additional work. Note that the functional form of the matrices for the FFT is identical to that for the FWT. Only the coefficients have changed. (In fact‚ calculating the sequency ordered FWT involves pre-multiplying by exactly the same sorting matrix.) This is the basis of generalized transform algorithms‚ which use the same matrix structure and substitute different sets of coefficients to perform the various transforms of interest.
5.
CONCLUSIONS
It is clear that many of the researchers referenced herein were aware of and influenced by Yates’s algorithm. However‚ the fundamental role of orthogonal transform theory‚ and relationships between the various “fast algorithms” appear to be unfamiliar to many statisticians. The fields of orthogonal factorial experimental designs and orthogonal transform theory appear at first glance to have evolved in parallel‚ with little cross-communication. Despite the initial impact of Yates’s work‚ many statisticians treat orthogonal transform theory and the closely related field of spectral analysis as tools for “time series”‚ and appear unaware of the applicability of the work to factorial experimental designs. Researchers in the field of digital signal processing (DSP) have significantly extended the work of Yates‚ Good‚ Cooley‚ and Tukey. We believe that it is worthwhile for statisticians to become familiar with the DSP research in orthogonal function decomposition for a number of reasons. The FWT offers two benefits relative to the traditional Yates’s algorithm: the FWT is an in-place algorithm; the FWT is its own inverse. DSP literature is based upon generalized transform theory:
REFERENCES
683
results can be generalized for many orthogonal designs‚ not just for and factorials. We have illustrated this using the FFT.
DSP literature offers a consistent notation and framework for representing orthogonal designs.
REFERENCES Ahmed‚ N.‚ and K. R. Rao. (1971). “The generalised transform.” Proc. Applic. Walsh Functions‚ Washington‚ D.C. AD727000‚ 60–67. Beauchamp‚ K.G. (1984)‚ “Applications of Walsh and related functions.” Academic Press‚ London. Chatfield‚ C. (1984)‚ “An Analysis of Time Series: An Introduction.” Chapman and Hall‚ New York. Cooley‚ J.W.‚ and J. W. Tukey. (1965)‚ “An algorithm for the machine computation of complex Fourier series.” Math. Comput. 19‚ 297–301. Good‚ I.J. (1958)‚ “The interactive algorithm and practical Fourier analysis.” J. Roy. Stat. Soc. (London)‚ B20‚ 361–372. Heideman‚ M.T.‚ D. H. Johnson‚ and C. S. Burrus. (1984)‚ “Gauss and the History of the Fast Fourier Transform.” IEEE ASSP Magazine‚ Oct. 1984‚ 14–21. Hoadley‚ A.B.‚ and J. R. Kettenring. (1990)‚ “Communications Between Statisticians and Engineers/Physical Scientists.” Technometrics Vol 32 No. 3‚243– 247. Kiefer‚ R.‚ and J.Wolfowitz. (1959)‚ “Optimal Designs in Regression Problems.” Ann. Math. Stat. Vol 30‚ 271–294. Manz‚ J.W. (1972)‚ “A sequency-ordered fast Walsh transform.” IEEE Trans. Audio Electroacoust. AV-20‚ 204–205. Nelson‚ L.S. (1982)‚ “Analysis of Two-Level Factorial Experiments.” JQT Vol 14‚ 2‚ 95–98. Pratt‚ W.K.‚ J.Kane‚ and H. C. Andrews. (1969)‚ “Transform image coding.” Proc. IEEE‚ 57‚ 58–68. Sanchez‚ P.J.‚ and S. M. Sanchez. (1991). “Design of frequency domain experiments for discrete-valued factors.” Applied Mathematics and Computation‚ 42(1): 1–21. Stoffer‚ David S. (1991). “Walsh–Fourier Analysis and Its Statistical Implications” J. American Statistical Association.‚ June 1991‚ Vol. 86‚ #414‚ 461– 485. Walsh‚ J.L. (1923)‚ “A closed set of orthogonal functions.” Ann. J. Math‚ 55‚ 5–24. Yates‚ F. (1937)‚ “The Design and Analysis of Factorial Experiments.” Technical Communication No. 35‚ Imperial Bureau of Soil Science‚ London.
684
MODELING UNCERTAINTY
APPENDIX: A: TABLE OF NOTATION
Chapter 27 UNCERTAINTY BOUNDS IN PARAMETER ESTIMATION WITH LIMITED DATA James C. Spall The Johns Hopkins University Applied Physics Laboratory Laurel‚ MD 20723-6099 e-mail:
[email protected]
Abstract
Consider the problem of determining uncertainty bounds for parameter estimates with a small sample size of data. Calculating uncertainty bounds requires information about the distribution of the estimate. Although many common parameter estimation methods (e.g.‚ maximum likelihood‚ least squares‚ maximum a posteriori‚ etc.) have an asymptotic normal distribution‚ very little is usually known about the finite-sample distribution. This paper presents a method for characterizing the distribution of an estimate when the sample size is small. The approach works by comparing the actual (unknown) distribution of the estimate with an “idealized” (known) distribution. Some discussion and analysis are included that compare the approach here with the well-known bootstrap and saddlepoint methods. Example applications of the approach are presented in the areas of signalplus-noise modeling‚ nonlinear regression‚ and time series correlation analysis. The signal-plus-noise problem is treated in greatest detail; this problem arises in many contexts‚ including state-space modeling‚ the problem of combining several independent estimates‚ and quantile calculation for projectile accuracy analysis. Key Words: Small sample‚ parameter estimation‚ system identification‚ uncertainty regions‚ M-estimates‚ signal-plus-noise‚ nonlinear regression‚ time series correlation.
Acknowledgments and Comments: This work was supported by U.S. Navy Contract N00024-98D-8124. Dr. John L. Maryak of JHU/APL provided many helpful comments and Mr. Robert C. Koch of the Federal National Mortgage Association (Fannie Mae) provided valuable computational assistance in carrying out the example. A preliminary version of this paper was published in the Proceedings of the IEEE Conference on Decision and Control‚ December 1995. This paper is dedicated to the memory of Sid Yakowitz—a scholar and a gentleman.
686
1.
MODELING UNCERTAINTY
INTRODUCTION
Meaningful inference in parameter estimation usually involves an estimation process and an uncertainty calculation. For many estimators—such as least squares‚ maximum likelihood‚ minimum prediction error‚ maximum a posteriori‚ etc.— there exists an asymptotic theory that provides the basis for determining probabilities and uncertainty regions in large samples (e.g.‚ Hoadley‚ 1971; Ljung‚ 1978; Serfling‚ 1980). However‚ except for relatively simple cases‚ it is generally not possible to determine this uncertainty information in the small-sample setting. This paper presents an approach to determining small-sample probabilities and uncertainty regions for a general class of multivariate M-estimators (M-estimates are those found as the solution to a system of equations‚ and include those estimates mentioned above). Theory and implementation aspects will be presented. The approach is based on a simple—but apparently unexamined—idea. Suppose that the statistical model being used is some distance (to be defined below) away from an “idealized” model‚ where the small-sample distribution of the M-estimate for the idealized model is known. Then the known probabilities and uncertainty regions for the idealized model provide the basis for computing the probabilities and uncertainty regions in the actual model. The distance may be reflected in a conservative adjustment to the idealized quantities. This approach is fundamentally different from other finite-sample approaches (see below)‚ where the accuracy of the relevant approximations is tied to the size of the sample versus the deviation from an idealized model. The M-estimation framework for the approach encompasses most estimators of practical interest and allows us to develop concrete regularity conditions that are largely in terms of the score function (the score function is typically the gradient of the objective function‚ which is being set to zero to create the system of equations that yields the estimate). One of the significant challenges in assessing the small-sample behavior of M-estimates is that they are usually nonlinear‚ implicitly defined functions of the data. Perhaps the most popular current approach to small-sample analysis is computer-based resampling‚ most notably the bootstrap (e.g.‚ Efron and Tibshirani‚ 1986; Hall‚ 1992; and Hjorth‚ 1994). The main appeal of this approach is relative ease of use‚ even for complex estimation problems. Rutherford and Yakowitz (1991) show how the bootstrap applies in the nonparametric regression problem‚ for which analytical analysis would rarely be possible. Resampling techniques make few analytical demands on the user‚ instead shifting the burden to one of computation. However‚ the bootstrap may provide a highly inaccurate description of Mestimate uncertainties in small samples (e.g.‚ Lunneburg‚ 2000‚ pp. 97–98). This poor performance is inherently linked to the limited amount of information in the small sample‚ with little improvement possible through a larger amount of resampling.
Uncertainty bounds in parameter estimation with limited data
687
Other relatively popular methods for small-sample probability and uncertainty region calculation are those based on series expansions, particularly the Edgeworth and saddlepoint e.g., Daniels, 1954; Davison and Hinkley, 1988; Reid, 1988; Field and Ronchetti, 1990; and Ghosh, 1994). However, as noted in Reid (1988, p. 223), “saddlepoint approximations have not yet had much impact on statistical practice.” More recently, Goutis and Casella (1999) and Huzurbazar (1999) discuss some of the challenges to implementation in large-scale problems (“...derivations and implementations rely on tools such as exponential tilting, Edgeworth expansions, Hermite polynomials, complex integration, and other advanced notions”). These references focus on easier problems than the types of problems motivating our work here. The major limiting factors of these series-based methods are the cumbersome analytical form and numerical calculations involved. Hence, much of the literature in this area focuses on the relatively tractable setting of estimates that are a smooth function of a sample mean (Reid, 1988; Kolassa, 1991; Fraser and Reid, 1993; Lieberman, 1994; and Chen and Do, 1994). This setting, however, is severely limiting in practice. Field and Ronchetti (1990, Sect. 4.5) consider the multivariate M-estimation problem. While their method has reasonable regularity conditions, the implementation appears challenging in all but the simplest problems. As an example M-estimate, Field and Ronchetti partially work out the solution for a simple linear regression problem, but in even this simple problem the solution is incomplete due to an unknown constant of integration (which does affect the interval probabilities). Field and Ronchetti (1990) also note difficulties in going to more general problems (“...it is not possible to find explicit solutions”). The essential relationship of the small-sample approach here to the analytical (saddlepoint) methods above is as follows. The saddlepoint methods make strong analytical and computational demands on the user and appear infeasible in many of the small-sample multivariate M-estimation problems encountered in practice (where the estimate is usually implicitly defined and must be found numerically). Although the approach here is generally easier to use for multivariate M-estimates, it requires that an idealized setting be identified from which to make adjustments. This is not always possible. Section 3 provides several distinct examples to illustrate how the idealized case may be determined in practice. A fundamental distinction between the saddlepoint method and the method here is in the nature of the errors in the probability calculations. For the saddlepoint these errors are in terms of the sample size n, and are typically of order 1/n; for the approach here the error is in terms of the deviation from the idealized case. In particular, for the above-mentioned measure of deviation the error is of order for any n for which the estimate is defined. In addition, the implied constant of the order term can be explicitly bounded if needed. Hence, while traditional methods have the desirable property of disappearing error as the small-sample approach here has disappearing error as the model deviation The philosophy here is
688
MODELING UNCERTAINTY
that one is working with small samples, and desirable properties as may not be relevant. The remainder of this paper is organized as follows. Section 2 describes the fundamental problem and formally introduces the concept of the idealized distribution. Associated artificial estimators and data that will be used in characterizing the probabilities of interest for the real estimator and data are also introduced in this section. Section 3 summarizes how the formulation in Section 2 applies in the areas of signal-plus-noise modeling, nonlinear regression, and time series correlation analysis. Section 4 presents the main theoretical results, which characterize the error between the idealized (known) probability and real (unknown) probability for the parameter estimate lying in a particular compact set (i.e., uncertainty region). Section 5 presents a thorough analysis of the signalplus-noise example introduced in Section 3, including a numerical evaluation. Section 6 offers a summary and some concluding remarks. The Appendix presents technical details and a proof of the Theorem.
2.
PROBLEM FORMULATION
Suppose we have a vector of data (representing a sample of size whose distribution depends on and a known scalar where is to be estimated by maximizing some objective function (say‚ as in maximum likelihood). The estimate is the quantity for which we wish to characterize the uncertainty when is small. It is assumed to be found as the objective-maximizing solution to the score equation:
(e.g., s(·) represents the gradient of a log-likelihood function with respect to when represents a maximum likelihood estimate). Suppose further that if (the “idealized” case), then the distribution for the estimate is known for the small n of interest. Our goal is to establish conditions under which probabilities for with (the real case) are close to the known probabilities in the idealized case. In particular, we show that the difference between the unknown and known probabilities for the estimates is proportional to when is small. This justifies using the known distribution for when to construct approximate uncertainty regions for when is small. Further, when is not so small, we show how the difference in the real and idealized probabilities can be approximated or bounded. The complexity of as a function of (via (2.1)) makes direct calculation of the small-sample distribution of impossible in all but the simplest cases. To characterize probabilities for the estimate we introduce two artificial estimators that have the same distribution as when and when respectively. This is analogous to the use of the well-known Skorokhod representation (Serfling,
689
Uncertainty bounds in parameter estimation with limited data
1980, p. 23) where an easier-to-analyze “artificial” random process (defined on a different probability space) is used to analyze the weak convergence and other properties of the real random process of interest. The two artificial estimators, say and are based, respectively, on fictitious vectors of data, and of the same dimension as x. To construct the two fictitious data vectors, we introduce a random vector z (same dimension as x), with associated transformations and being the same as at such thatwith being the same as the unknown (true) generating the real data,
and such that and have the same distribution as x for the chosen respectively. Then, from (2.1),
and for
As we will see in the examples of Section 3, it is relatively simple to specify the transformation and given the basic problem structure assocated with (2.1). The fundamental point in the machinations above is that the distributions of and are identical to the distributions of the estimate under and even though the various quantities etc.) have nothing per se to do with the real data and associated estimate. Our goal in Section 4 is to establish conditions under which probabilities for are close to the known probabilities for irrespective of the sample size n. This provides a basis for approximating (or bounding) the probabilities and uncertainty regions for under through knowledge of the corresponding quantities for Throughout the remainder of this paper, we use the order notation and to denote terms such that and are bounded and approach zero, respectively, as
3.
THREE EXAMPLES OF APPROPRIATE PROBLEM SETTINGS
To illustrate the range of estimation problems for which the small-sample approach is useful‚ this section sketches how the approach would be applied in three distinct M-estimation problems. Further detailed analysis (including numerical results) for the first of these problems is provided in Section 5.
690
MODELING UNCERTAINTY
3.1. Example 1: Parameter Estimation in Signal-Plus-Noise Model with Non-i.i.d. Data Consider the problem of estimating the mean and covariance matrix of a random signal when the measurements of the signal include added independent noise with known distributional characteristics. In particular‚ suppose we have observations distributed where the noise covariances are known and the signal parameters (for which the unique elements are represented in vector format by are to be jointly determined using maximum likelihood estimation (MLE). From the form of the score vector (see Section 5)‚ we find that there is generally no closed form solution (and no known finite-sample distribution) for the MLE when for at least one This corresponds to the actual case of interest. We also found that the saddlepoint method was analytically intractable for this problem (due to the relative complexity of the score vector) and that the bootstrap method worked poorly in sample sizes of practical interest (e.g.‚ Spall (1995) discusses this further. Estimation problems of this type (with either scalar or multivariate data) have been considered in many different problem contexts in control and statistics. Here are some examples: Shumway et al. (1981) and Sun (1982) in a state-space model identification problem; Rao et al. (1981) in the estimation of a random effects input-output model; James and Venables (1993) and the National Research Council (1992‚ pp. 143–144) in a problem of combining independent estimates of coefficients; Ghosh and Rao (1994) in small area estimation in statistical survey sampling; and Hui and Berger (1983) in the empirical Bayesian estimation of a dose response curve. One of the author’s interests in this type of problem lies in estimating projectile impact means and covariance matrices from noisy observations of varying quality; these are then used in calculating CEP values (the 50% circular quantile values) for measuring projectile accuracy as in Spall and Maryak (1992) and Shnidman (1995). In addition‚ for general multivariate versions of this problem‚ Smith (1985) presents an approach for ensuring that the MLE of the covariance matrix is positive semidefinite‚ and Spall and Chin (40) present an approach for data influence and sensitivity analysis. Central to implementing the small-sample approach is the identification of the idealized case and definition of relative to the problem structure. We can write where and are known matrices. (We are using the inverse form here to relate the various matrices since the more natural parameterization in the score vector is in terms of not as discussed in Section 5. However‚ this is not required as the basic ideas would also apply in working with the noninverse form.) If (the idealized identical case)‚ the distribution of the estimate is normal-Wishart for all sample sizes of at least two. For this application‚ the Theorem in Section 4 provides the basis for deter-
Uncertainty bounds in parameter estimation with limited data
691
mining whether uncertainty regions from this idealized distribution are acceptable approximations to the unknown uncertainty regions resulting from nonidentical In employing the Theorem (via (2.2a,b)), we let where In cases with a larger degree of difference in the (as expressed through a larger this idealized approximation for the uncertainty regions may not be adequate—implied constants associated with the bound of the Theorem provide a means of altering the idealized uncertainty regions (these implied constants depend on terms other than This example illustrates the apparent arbitrariness sometimes present in specifying a numerical value of (e.g., if the elements of are made larger, then the value of must be made proportionally smaller to preserve algebraic equivalence). This apparent arbitrariness has no effect on the fundamental limiting process as it is only the relative values of that have meaning after the other parameters (e.g., Q, etc.) have been specified. In particular, the numerical value of the bound does not depend on the way in which the deviation from the idealized case is allocated to and to the other parameters; in this example, depends on the products which are certainly not arbitrary. We will return to this signal-plus-noise example in Section 5 for a more thorough treatment.
3.2.
Example 2: Nonlinear Input-Output (Regression) Model
Although the standard linear regression framework is appropriate for modeling input-output relationships in some problems‚ a great number of practical problems have inherent nonlinearities. In particular‚ suppose that data are modeled as coming from the relationship where is nonlinear mapping and is a random noise term. Typically‚ least squares‚ Bayesian‚ or MLE techniques are used to find an estimate of In contrast to the linear setting (with normally distributed noise terms) the finite-sample distribution of will rarely be known in the nonlinear setting. (In particular‚ although the problem of estimating parameters in nonlinear regression models is frequently solvable using numerical optimization methods‚ the “situation is much worse when considering the accuracy of the obtained estimates” [Pazman‚ 1990].) The small-sample approach here is appropriate when the degree of nonlinearity is moderate; the corresponding idealized case is a linear regression setting that‚ in a sense illustrated below‚ is close to the actual nonlinear setting. Relative to (2.2a‚ b)‚ it is natural to choose where has the joint distribution of Let us illustrate the ideas for two specific nonlinear cases. First‚ suppose that is a quadratic function where and are vectors
692
MODELING UNCERTAINTY
or matrices (as appropriate) of known constants. Such a setting might arise in an inversion problem of attempting to recover an unknown input value from observed outputs (as is‚ e.g.‚ the main theme of fields such as statistical pattern recognition‚ image analysis‚ and signal processing); this particular quadratic model also plays a major role in experimental design and response surface identification (e.g.‚ Walter and Pronzato‚ 1990; Joshi et al.‚ 1994; and Bisgaard and Ankenman‚ 1996). Clearly in the case‚ we have the standard linear regression model for which the distribution of is known for having certain distributions (e.g.‚ normal). As with the example of Subsection 3.1‚ the apparent arbitrariness in specifying is accommodated since the product is the inherent expression of nonlinearity appearing in the bound. The second case pertains to a common model in econometrics. Suppose that represents the constant elasticity of substitution (CES) production function relating labor and capital inputs to production output within a sector of the economy (Kmenta‚ 1971‚ pp. 462–464 or Nicholson‚ 1978‚ pp. 200–201). This model includes a “substitution parameter‚” which we represent by After making a standard log transformation‚ the CES model has the form where the three parameters within the vector represent parameters of economic interest‚ and and represent capital and labor input from firm As discussed in Kmenta (1971‚ p. 463) and Nicholson (1978‚ p. 201)‚ when non-trivial limiting arguments show that the CES function reduces to the well-known (log-linear) Cobb-Douglas production function‚ representing the idealized case here. Hence‚ uncertainty regions for the estimate in the CES model can be derived from the standard linear regression-based uncertainty regions for the Cobb-Douglas function through use of the Theorem of Section 4.
3.3. Example 3: Estimates of Serial Correlation for Time Series A basic problem in time series and dynamic modeling is to determine whether a sequence of measurements is correlated over time, and, if so, to determine the maximum order of correlation (i.e., the maximum number of time steps apart for which the elements in the sequence are correlated). A standard approach for testing this hypothesis is to construct estimates of correlation coefficients for varying order correlations, and then to use the known distribution of the estimates to test against the hypothesis of zero correlation. Let us suppose that we construct MLEs of the jth-order correlation coefficients, j = 1 , 2 , . . . . Our interest here is in the case where the data are non-normally distributed. This contrasts, for example, with the small-sample approach in Cook (1988), which is oriented to normal (and autoregressive) models. (By the result in (Bickel and Doksum, 1977, pp. 220–221), we know that standard correlation estimate forms
Uncertainty bounds in parameter estimation with limited data
693
in, say, (Anderson, 1971, Subsection 6.1), correspond to the MLE when the data are normally distributed.) To define the likelihood function for performing the ML estimation, one needs to choose a particular model of the non-normality. Although the method here can work with any of the common non-normal models, let us consider the fairly simple way of supposing the data are distributed according to a nonlinear transformation of a normal random vector. (Two other ways may also be appropriate: (i) suppose that the data are distributed according to a mixture distribution where at least one of the distributions in the mixture is normal and where the weighting on the other distributions is expressed in terms of or (ii) suppose that the data are composed of a convolution of two random vectors, one of which is normal and the other being non-normal with a weighting expressed by In particular, consistent with (2.2a, b), suppose that x has the same distribution as where z is a normally distributed random vector and is a transformation, with measuring the degree of nonlinearity. Since is a linear transformation, the resulting artificial estimate has one of the finite-sample distributions shown in Anderson (1971, Sect. 6.7) or Wilks (1962, pp. 592–593) (the specific form of distribution depends on the properties of the eigenvalues of matrices defining the time series progression). Note that aside from entering the score function through the artificial data appears explicitly (à la (2.1)) through its effect on the form of the distribution (and hence likelihood function) for the data x or Then, provided that is not too large, the Theorem in Section 4 (with or without the implied constant of the bound, as appropriate) can be used with the known finite-sample distribution to determine set probabilities for testing the hypothesis of sequential uncorrelatedness in the non-normal case of interest.
4.
MAIN RESULTS
4.1. Background and Notation This section presents the main result‚ showing how the difference in the unknown and known probabilities for lying in a rectangle decreases as In particular‚ we will be interested in characterizing the probabilities associated with rectangles:
where and (Of course‚ by considering a union of arbitrarily small rectangles‚ the results here can be applied to a nonrectangular compact set subject to an arbitrarily small error.) As discussed in Section 2‚ we will use the artificial estimates and in analyzing this difference. After some background material is presented in Subsection 4.1‚ Subsection 4.2 presents the Theorem showing that the difference in probabilities
694
MODELING UNCERTAINTY
is
Subsection 4.3 presents a computable bound to the implied constant of the bound. An expression of critical importance in the Theorem below (and in the calculation of the implied constants for the result in the Theorem—see Subsection 4.3) is the gradient From the fact that depends on and we have by the total derivative formula of calculus
When the score s(·) is a continuously differentiable function in a neighborhood of and the of interest, and when exists at these the implicit function theorem applies to two of the gradients on the right-hand side of (4.1):
where the right-hand sides of the expressions in (4.2) are evaluated at and the of interest. All references to s = s(·) here and in the Theorem correspond to the definition in (2.1), with replacing as in (2.2a). The remaining gradient on the right-hand side of (4.1) is obtainable directly as Note that and its derivative in (4.1) are evaluated at the true (the generating the data) in contrast to the other expressions in (4.1) and (4.2), which are evaluated at the estimated One interesting implication of (4.1) and (4.2) is that is explicitly calculable even though is, in general, only implicitly defined. From the implicit function theorem, the computation of for the important special case of (see notation below) relies on the above-mentioned assumptions of continuous differentiability for s(·) holding for both slightly positive and negative. The following notation will be used in the Theorem conditions and proof: Consistent with usage above, a subscript i or j on a vector (say on etc.) denotes the ith or jth component. represents the inverse image of relative to
relative to i.e., the set Likewise, is the inverse image
Uncertainty bounds in parameter estimation with limited data
695
4.2. Order Result on Small-Sample Probabilities The main theoretical result of this paper is presented in the Theorem below. The proof is provided in the Appendix together with certain technical regularity conditions. These conditions are quite modest, as discussed in the remarks following their presentation in the Appendix (and as demonstrated in the signalplus-noise example of Section 5). The regularity conditions pertain essentially to smoothness properties of the score vector and to characteristics of the distribution of z. Theorem Let and be as given in (2.2a, b), and let a ± h be continuity points of the associated distribution functions. Then, under regularity conditions C.1–C.5 in the Appendix,
4.3. The Implied Constant of
Bound
Through the form of the calculations in the proof of the Theorem‚ it is possible to produce computable implied constants for the bound‚ i.e.‚ constants such that
We present one such constant here; another is presented in Spall (1995). The constant here will tend to be conservative in that it is based on upper bounds to certain quantities in the proof of the Theorem. This conservativeness may be desirable in cases where is relatively large to ensure that the in (4.3) is preserved in practical applications (i.e.‚ when the term is ignored).2 The constant in Spall (1995) is less conservative and is determined through a computer resampling procedure. Aside from the assumptions of the Theorem‚ there are two additional conditions under which is valid: (i) the elements of are mutually independent‚ with having a bounded‚ continuous density function on some set such that every point of is interior to this set‚ and (ii) from Subsection 4.1) is uniformly bounded on (note that‚ even when is continuous on the Theorem does not require boundedness for since 2 The issue of ignoring higher-order error terms (à la is common in many small- and largesample estimation methods. For example, the saddlepoint, bootstrap, and central limit theorem, in general, all have unquantifiable higher-order error terms in terms of n.
696
MODELING UNCERTAINTY
may be unbounded). Although these conditions may only be approximately satisfied in some practical problems, the value of c(a, h) computed as if the conditions were true may still provide critical insight into the true (uncomputable) constant (it should be no surprise that we have to impose additional conditions to obtain an analytical constant given that the computation of implied constants in order bounds that appear in the statistics and identification literature is almost always notoriously difficult and is usually not carried out). Nevertheless, condition (i) can sometimes be fully satisfied, as illustrated in the problem of Section 5. Furthermore, in other problems, it can often be at least approximately satisfied by forming a new through scaling the original by the square root of the Fisher information matrix under (this uses the fact that many estimates are asymptotically normally distributed with covariance matrix given by the inverse Fisher information matrix and the fact that independence and uncorrelatedness are equivalent under joint normality). Let us begin our derivation of c(a, h) by developing constants for each of the two probabilities in (A.6) of the Theorem proof. Then, by (A.1), c(a, h) will be twice the sum (over j) of the sum of the two constants for (A.6). Let be the bound to on that is implied by condition (ii). (In practice, if this bound is not available analytically, it could be estimated by random sampling a large number of times over taking the estimated as the maximum observed value of Then, we have for one of the two probabilities in (A.6),
where is the marginal density function for is defined in the Appendix (eqn. (A.4))‚ and the last line follows from the independence condition (i) and the mean value theorem. For (the other probability in (A.6))‚ we follow an identical line of reasoning to obtain
where Because the true values of and will not be known in practice, and could be chosen as the midpoint of the (assumed small) intervals in which they lie. Thus from (A.1) and (A.6),
Uncertainty bounds in parameter estimation with limited data
5.
697
APPLICATION OF THEOREM FOR THE MLE OF PARAMETERS IN SIGNAL-PLUS-NOISE PROBLEM
5.1. Background This section returns to the example of Subsection 3.1, and presents an analysis of how the small-sample approach would apply in practice. In particular, consider independent scalar observations distributed where the are known and are to be estimated (jointly) using maximum likelihood. As mentioned in Subsection 3.1, when for at least one i, j (the actual case), no closed-form expression (and hence no computable distribution) is generally available for When for all i, j (the idealized case), the distribution of is known (see (5.2a, b) below). For this estimation problem, Subsection 5.2 discusses the regularity conditions of the Theorem and comments on the calculation of the implied constant c(a, h), and Subsection 5.3 presents some numerical results. This two-parameter estimation problem is one where the other analytical techniques discussed in Section 1 (i.e., Edgeworth expansion and saddlepoint approximation) are impractical because of the unwieldy calculations required (say, as related to the cumulant generating function and its inverse). When using the distribution for as an approximation to the actual distribution (when justified by the Theorem), we choose a value of Q corresponding to the “information average” of the individual’s i.e., Q is such that (The idea of summing information terms for different measurements is analogous to the idea in Rao (1973, pp. 329–331).) As discussed in Subsection 3.1, deviations of order from the common Q are then naturally expressed in the inverse domain: where the are some fixed quantities (discussed below). Working with information averages has proven desirable as a way of down-weighting the relative contribution of the larger versus what their contribution would be, say, if Q were a simple mean of the (from (5.1) below, we see that the score expression also down-weights the data associated with larger A further reason to favor the information average is that the score is naturally parameterized directly in terms of through use of the relationship Hence, represents the mean of the natural nuisance parameters in the problem. Finally, we have found numerically that the “idealized” probabilities computed with the information
698
MODELING UNCERTAINTY
average have provided more accurate approximations to the true probabilities when the vary moderately than, say, idealized probabilities based on an average equal to the mean Note, however, that any type of average will work when the are sufficiently close since if and only if when Q > 0. The log-likelihood function, for the estimation of is
where
from which the score expression is found:
Since
we will consider only those sets of interest (i.e., such that This does not preclude having a practical estimate come from (in which case one would typically set to 0); however, in specifying confidence sets, we will only consider those points in space that make physical sense. (Note that if n is reasonably large and/or the are reasonably small relative to then from will almost always be positive.)
5.2. Theorem Regularity Conditions and Calculation of Implied Constant The first step in checking the conditions for the Theorem is to define the artificial data sequences‚ and associated artificial estimators and From the definitions in Subsection 3.1‚ the two artificial MLEs are
where As required to apply the Theorem‚ has a known distribution (the same‚ of course‚ as for the of interest from (5.1) when In particular‚ and satisfy
Uncertainty bounds in parameter estimation with limited data
699
Spall (1995) includes a verification of the regularity conditions C.1–C.4 in the Appendix (C.5 is immediate by the definition of We assume that Q > 0; then for all in a neighborhood of 0 (i.e., is well-defined as a variance for all near 0, including as required by the implicit function theorem in computing as discussed in Subsection 4.1). The two conditions for calculating the constant c(a, h) introduced in Subsection 4.3 were also satisfied. In particular, (i) and are independent with a bounded continuous density function, and (ii) is bounded by the fact that is continuous on (via the implicit function theorem) and is a bounded set. Although an analytical form is available for it may be easier in practice to approximate max for each j by randomlysampling over This yields estimates of and and is the procedure used in computing c(a, h) in Subsection 5.3 below. The probabilities for j = 1, 2 are readily available by the normal and chi-squared distributions for and Likewise, the density-based values are easily approximated by taking as an intermediate (we use mid) point of the appropriate interval This provides all the elements needed for a practical determination of as illustrated in Subsection 5.3.
5.3. Numerical Results This subsection presents results of a numerical study of the above MLE problem. Our goals here are primarily to compare the accuracy of uncertainty intervals based on the small-sample theory with the actual (empirically determined) regions. We also briefly examine the performance of asymptotic MLE theory. In this study, we took n = 5 and generated data according to with such that (so the average in an information sense, is 0.04 according to the discussion of Subsection 5.1). As discussed above, we estimate and are interested in uncertainty intervals for and For ease of presentation and interpretation, we will focus largely on the marginal distributions and uncertainty intervals for each of and this also is partly justified by the fact that and are approximately independent (i.e., when either or n is large, and are independent). We will report results for and which correspond to values of equal to {0.0323, 0.0323, 0.0323, 0.0625, 0.0625} and {0.0271, 0.0271, 0.0271, 0.143, 0.143}, respectively. Results for these two values
700
MODELING UNCERTAINTY
of are intended to represent the performance of the small-sample theory for a small- and a moderate-sized Spall (1995) shows that for the portion of the small-sample uncertainty intervals differ little from the true intervals or those obtained by asymptotic theory. Hence, we focus below on the part of Figure 5.1 depicts three density functions for for each of the and cases: (i) the “true” density based on the marginal histogram constructed from estimates of determined from independent sets of n = 5 measurements (a smoother in the SAS/GRAPH software system, p. 416 in Reference Volume 1 (35), was used to smooth out the small amount of jaggedness in the empirical histogram), (ii) the small-sample density from (5.2b) (corresponding to the idealized case), and (iii) the asymptotic-based normal density with mean = 1 and variance given by the appropriate diagonal element of the inverse Fisher information matrix
3
Note that for this example, the noise levels are relatively small, so techniques such as those in Chesher (1991) may have been useful. However, this example was chosen with small noise only to avoid the “tangential” (to this paper) issue of coping with negative variance estimates; there is nothing inherent in the small-sample approach that requires small noise (e.g., the other two examples discussed in Section 3 do not fit into the small-noise framework).
Uncertainty bounds in parameter estimation with limited data
701
for We see that with the true and small-sample densities are virtually identical throughout the domain while the asymptotic-based density is dramatically different. For there is some degradation in the match between the true and idealized small-sample densities, but the match is still much better then between the true and asymptotic-based densities. Of course, it is the purpose of the adjustment based on c(a, h) to compensate for such a discrepancy in confidence interval calculation. Note that the true densities illustrate the frequency with which we can expect to see a negative variance estimate, which is an inherent problem due to the small size of the sample (the asymptotic-based density significantly overstates this frequency). Because of the relatively poor performance of the asymptotic-based approach, we focus below on comparing uncertainty regions from only the true distributions and the small-sample approach.4 Figure 5.2 translates the above into a comparison of small-sample uncertainty regions with the true regions. Included here are regions based on the term of the Theorem when quantified through use of the constant, c(a, h). The smallsample regions are “nominal” regions in the sense that the distributions in (5.2.a,b)
4
However‚ as one illustration of the comparative performance of the small-sample and asymptotic approaches for confidence region calculation‚ consider the CEP estimation problem mentioned in Subsection 3.1 above. The CEP estimates are based on the signal-plus-noise estimation example of this section. For the e = 0.15 case‚ the 90% confidence interval of interest for the CEP estimate was approximately 30% narrower when using the small-sample approach than when using the standard asymptotic approach. Hence‚ by more properly characterizing the distribution of the underlying parameter estimates‚ we are able to extract more information about the CEP quantity of interest.
702
MODELING UNCERTAINTY
and are evaluated at the true and (consistent with a hypothesis testing framework where there is an assumed “true” The indicated interval end points were chosen based on preserving equal probability (0.025) in each tail, with the exception of the conservative case; here the lower bound went slightly below 0 using symmetry, so the lower end point was shifted upward to 0 with a corresponding adjustment made to the upper end point to preserve at least 95% coverage. (Spall (1995) includes more detail on how the probability adjustment was translated into an uncertainty interval adjustment.) For the case, we see that the idealized small-sample bound is identical to the true bound (this, of course, is the most desirable situation since there is then no need to work with the c(a, h)-based adjustment). As expected, the uncertainty intervals with the conservative (c(a, h)-based) adjustments are wider. For the case, there is some degradation in the accuracy of coverage for the idealized small-sample interval, which implies a greater need to use the conservative interval to ensure the intended coverage probability for the interval. The above study is fully representative of others that we have conducted for this estimation framework (e.g., nominal coverage probabilities of 90% and 99% and other values of They illustrate that with relatively small values of the idealized uncertainty intervals are very accurate, but that with larger values (e.g., or 0.30) the idealized interval becomes visibly too short. In these cases, the c(a, h)-based adjustment to the idealized interval provides a means for broadening the coverage to encompass the true uncertainty interval.
6.
SUMMARY AND CONCLUSIONS
Making inference in parameter estimation with limited data is a problem encountered in many applications. Although statistical techniques such as the bootstrap and saddlepoint approximation have shown some promise in the smallsample setting, there remain serious difficulties in accuracy and feasibility for the type of multivariate M-estimation problems that are frequently encountered in practice. For a range of problem settings, the approach described here is able to provide accurate information about the estimate uncertainties. The primary restriction is that an idealized case must be identified (where the estimate uncertainty is known for the given sample size) with which the estimate uncertainty for the actual case is compared. This idealized case is related to the actual problem, but differs in that an analytical solution to the distribution of the estimate is available. A computable bound is available to characterize the difference in set probabilities for the (known) idealized case and (unknown) actual problem. A Theorem was presented that provides the basis for comparing the actual and idealized cases. The Theorem provides an order bound on the difference between the cases and a means for quantifying the implied constant of the order
Uncertainty bounds in parameter estimation with limited data
703
bound. Implementations of the approach were discussed for three distinct wellknown identification settings to illustrate the range of potential applications. These were a signal-plus-noise maximum likelihood estimation problem, a general nonlinear regression setting, and a problem in non-Gaussian time series correlation analysis. The signal-plus-noise example was considered in greatest detail. This problem is motivated for the author by the analysis and (state-space) modeling of a naval defense system. The small-sample approach was relatively easy to implement for this problem and yielded accurate results. The required idealized case was one where the data are i.i.d. When the actual data were moderately non-i.i.d., it was shown that the small-sample approach yielded results close to the true (uncomputable) uncertainty regions. In cases where the actual data deviated a greater amount from the i.i.d. setting, the small-sample approach yielded conservative uncertainty regions based on a quantification of the implied constant of the order bound of the Theorem. This example provided a realistic illustration of the types of issues arising in using the approach in a practical application. While this paper provides a new method for small-sample analysis, further work would enhance the applicability of the approach. One area would be to explore, in detail, the application to problems other than the signal-plus-noise problem. Two candidate problems in time series and nonlinear regression were sketched in Section 3. Another valuable extension of the current work would be to automate (as much as possible) the computation of the implied constant that is used to provide a conservative uncertainty region. It would also be useful to carry out a careful comparative analysis of the relative accuracy, ease of implementation, and applicability of the saddlepoint method, the bootstrap, and the method of this paper. Despite these open problems, the method here offers a viable approach to a range of small-sample estimation problems.
APPENDIX: THEOREM REGULARITY CONDITIONS AND PROOF (SECTION 4) Regularity conditions C.1–C.5 for the Theorem are as follows. These are built from the score function (2.1) (or equivalently, (2.2a, b)) in terms of the artificial estimators and C.1 Let be a bounded set is valid). Further, if suppose that is uniformly bounded away from 0 on If then, except in an open neighborhood of (i.e., a region such that for some radius > 0 an n-ball of this radius around any point in is contained within this region), we have uniformly bounded away from 0 on
704 C.2 Except on a set of exists on Further, for each
MODELING UNCERTAINTY
0
= probability measure for on a.s.
and
C.3 For each when suppose that there exists an open neighborhood of (see C.1) such that and exist continuously in the neighborhood. Further, for each j and sign ±, there exists some scalar element in z, say such that we have
C.4
C.5 For each j = 1, 2 , . . . , p, the distribution of z is absolutely continuous in an open neighborhood of (see C.1). Remarks on C.1–C.5. Although some of the regularity conditions may seem arcane, the conditions may be easily checked in some applications (see, e.g., the problem in Section 5). In checking C.1, note that will often be a bounded set (see Section 5), which automatically implies that is bounded The other requirement on being bounded away from 0 on is straightforward to check since has a known distribution. Given the form of the score s(·) and transformation eqns. (4.1) and (4.2) readily suggest when C.2 will be satisfied. Note that when is bounded and is continuous on (see example in Section 5), the last part of the condition will be automatically satisfied since will be bounded above. The conditions in C.3 pertaining to can generally be checked directly since is typically available in closed form (its distribution is certainly available in closed form); the condition on can be checked through use of (4.1) and (4.2) (as in checking C.2). As illustrated in Section 5, to satisfy the near-symmetry condition, C.4, it is sufficient that have a continuous density near and that (from (4.1)) be uniformly bounded on (of course, this condition can also be satisfied in settings where is not bounded in this way). Finally, C.5 is a straightforward condition to check since the analyst specifies the distribution for z. Note that none of the conditions impose any requirements of bounded moments for (or equivalently, for which would be virtually impossible to check for most practical M-estimation problems. Briefly, the conditions above are used in the proof as follows. The bounded sets assumed in C.1 ensure that a domain of interest can be covered by a finite
Uncertainty bounds in parameter estimation with limited data
705
number of “small” neighborhoods. The fact that the derivative exists and is bounded away from 0 in C.2 ensures that an important ratio with in the denominator exists a.s. The assumption in C.3 regarding the continuous differentiability of certain expressions ensures (via the implicit function theorem) the local one-to-oneness of certain transformation, which in turn ensures the existence of certain density functions. C.4 allows for the substitution of an easier probability expression for a more complicated probability (to within “negligible” error). Finally, C.5 guarantees that local density functions exist for the “artificial” data vector z, which then can be used via the real value theorem to characterize a set probability of interest. In the proof of the Theorem below, we use the following two Lemmas, which have relatively straightforward proofs given in Spall (1995). Lemma 1 For a random vector let be continuity points for its associated distribution function F(·). Then
where For Lemma 2‚ represents a term such that converges to 0 in probability as Lemma 2 Let and A be two random variables and one event (respectively)‚ with dependent on the introduced above. Suppose that where and that Then Proof of Theorem. Without loss of generality‚ we take
Let us define
and
as appropriate). Then by the dominated convergence theorem and Laha and Rohatgi (1979‚ pp. 145–146)‚
706
MODELING UNCERTAINTY
For the probabilities in the first sum, we know that the event occurs only if Letting we have that the event for the jth summand corresponds to
For the probabilities in the second sum on the right-hand side of (A. 1) we know that so the event for the jth summand corresponds to
By condition C.4 we know that the probability of the event on the r.h.s. of (A.2a) can be bounded above to within by
Hence, from (A.2a, b), to within each of the 2p probabilities associated with the summands in (A.1) may be bounded above by the expression in (A.3). Now, for almost all z in the set we know by condition C.2 and Taylor’s theorem that
(suppressing the dependence on z). For all j = 1, 2 , … , p, we know by C.2 that a.s. on Hence the quantity
is well-defined a.s. on
Note that the expression in (A.3) is bounded above by
Uncertainty bounds in parameter estimation with limited data
707
Since Lemma 2 applies (see Spall (1995))‚ we can replace (to within error) the two probabilities in (A.5) by the following probabilities that do not depend on and are easier to analyze:
We now show that these two probabilities are result to be proved.
This will establish the main
To show the above, Spall (1995) shows that the conditional (on ) densities for exist near 0 for each j and ± sign such that We then use these densities to characterize the probabilities in (A.6). (When an the corresponding probability in (A.6) is by C.1 and C.2.) To characterize these densities, we first use C.2, C.3, and the inverse function theorem to establish that local densities exist in a finite number of disjoint regionswithin The inverse and implicit function theorems are then used to establish the form for the joint density for and (representing the elements of z after the scalar element from C.3 has been removed) in each of the local regions (likewise for Then can be integrated out, leaving the densities of interest for in each local region. On each of these local regions, it can be shown that the relevant probability is Taking the union over the finite number of local regions establishes that the probabilities in (A.6) are Q.E.D.
REFERENCES Anderson‚ T. W. (1971). The Statistical Analysis of Time Series‚ Wiley‚ New York. Bickel‚ P. J.‚ and K. A. Doksum (1977). Mathematical Statistics‚ Holden-Day‚ Oakland‚ CA. Bisgaard‚ S. and B. Ankenman (1996). Standard errors for the eigenvalues in secondorder response surface models. Technometrics‚ 38:238–246. Chen‚ Z. and K.-A. Do (1994). The bootstrap method with saddlepoint approximations and importance sampling. Statistica Sinica‚ 4:407–421. Chesher‚ A. (1991). The effect of measurement error. Biometrika‚ 78:451–462. Cook‚ P. (1988). Small-sample Bayesian frequency-domain analysis of autoregressive models. J. C. Spall‚ editor‚ Bayesian Analysis of Time Series and Dynamic Models‚ pp. 101–126. Marcel Dekker‚ New York. Daniels‚ H. E. (1954). Saddlepoint approximations in statistics. Annals of Mathematical Statistics‚ 25:631–650. Davison‚ A. C.‚ and D. V. Hinkley (1988). Saddlepoint approximations in resampling methods. Biometrika‚ 75:417–431.
708
MODELING UNCERTAINTY
Efron‚ B.‚ and R. Tibshiran (1996). Bootstrap methods for standard errors‚ confidence intervals‚ and other measures of statistical accuracy (with discussion). Statistical Science‚ 1:54–77. Field‚ C‚ and E. Ronchetti (1990). Small Sample Asymptotics. IMS Lecture Notes– Monograph Series (vol. 13). Institute of Mathematical Statistics‚ Hayward‚ CA. Fraser‚ D.A.S.‚ and N. Reid (1993). Third–order asymptotic models: Likelihood functions leading to accurate approximations for distribution functions. Statistica Sinica‚ 3: 67–82. Ghosh‚ J. K. (1994). Higher Order Asymptotics. NSF-CBMS Regional Conference Series in Probability and Statistics‚ Volume 4. Institute of Mathematical Statistics‚ Hayward‚ CA. Ghosh‚ M.‚ and J. N. K. Rao (1994). Small area estimation: An approach (with discussion). Statistical Science‚ 9: 55–93. Goutis‚ C.‚ and G. Casella (1999). Explaining the saddlepoint approximation. American Statistician‚ 53:216–224. Hall‚ P. (1992). The Bootstrap and the Edgeworth Expansion. Springer-Verlag‚ New York. Hjorth‚ J. S. U. (1994). Computer Intensive Statistical Methods. Chapman and Hall‚ London. Hoadley‚ B. (1971). Asymptotic properties of maximum likelihood estimates for the independent not identically distributed case. Annals of Mathematical Statistics‚ 42:1977–1991. Hui‚ S. L‚ and J. O. Berger (1983). Empirical Bayes estimation of rates in longitudinal studies. Journal of the American Statistical Association‚ 78:753–760. Huzurbazar‚ S. (1999). Practical saddlepoint approximations. American Statistician‚ 53:225–232. James‚ A. T.‚ and W. N. Venables (1993). Matrix weighting of several regression coefficient vectors. Annals of Statistics‚ 21:1093–1114. Joshi‚ S. S.‚ H. D. Sherali‚ and J. D. Tew (1994). An enhanced RSM algorithm using gradient-deflection and second-order search strategies. J. D. Tew et al.‚ editors‚ Proceedings of the Winter Simulation Conference‚ pp. 297–304. Kmenta‚ J. (1971). Elements of Econometrics. Macmillan‚ New York. Kolassa‚ J. E. (1991). Saddlepoint approximations in the case of intractable cumulant generating functions. Selected Proceedings of the Sheffield Symposium on Applied Probability. IMS Lecture Notes—Monograph Series (vol. 18). Institute of Mathematical Statistics‚ pp. 236–255‚ Hayward‚ CA. Laha‚ R. G.‚ and V. K. Rohatgi (1979). Probability Theory. Wiley‚ New York. Lieberman‚ O. (1994). On the approximation of saddlepoint expansions in statistics. Econometric Theory‚ 10:900–916. Ljung‚ L. (1978). Convergence analysis of parametric identification methods. IEEE Transactions on Automatic Control‚ AC-23:770–783.
Uncertainty bounds in parameter estimation with limited data
709
Lunneborg‚ C. E. (2000). Data Analysis by Resampling: Concepts and Applications. Duxbury‚ Pacific Grove‚ CA. National Research Council (1992). Combining Information: Statistical Issues and Opportunities for Research. National Academy of Sciences‚ Washington‚ DC. Nicholson‚ W. (1978). Microeconomic Theory. Dryden‚ Hinsdale‚ IL. Pazman‚ A. (1990). Small-sample distributional properties of nonlinear regression estimators (a geometric approach) (with discussion). Statistics‚ 21:323–367. Rao‚ C. R. (1973). Linear Statistical Inference and its Applications. Wiley‚ New York. Rao‚ P. S. R. S.‚ J. Kaplan‚ and W. G. Cochran (1981). Estimators for the one-way random effects model with unequal error variances. Journal of the American Statistical Association‚ 76:89–97. Reid‚ N. (1988). Saddlepoint methods and statistical inference (with discussion). Statistical Science‚ 3:213–238. Rutherford‚ B.‚ and S. Yakowitz (1991). Error inference for nonparametric regression. Annals of the Institute of Statistical Mathematics‚ 43:115–129. SAS Institute (1990). SAS/GRAPH Software: Reference‚ Version 6‚ First Edition. SAS Institute‚ Cary‚ NC. Serfling‚ R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley‚ New York. Shnidman‚ D. A. (1995). Efficient computation of the circular error probable (CEP) integral. IEEE Transactions on Automatic Control‚ 40:1472–1474. Shumway‚ R. H.‚ D. E. Olsen‚ and L. J. Levy (1981). Estimation and tests of hypotheses for the initial mean and covariance in the Kalman filter model. Communications in Statistics—Theory and Methods‚ 10:1625–1641. Smith‚ R. H. (1985). Maximum likelihood mean and covariance matrix estimation constrained to general positive semi–definiteness. Communications in Statistics—Theory and Methods‚ 14: 2163–2180. Spall‚ J. C. and D. C. Chin (1990). First-order data sensitivity measures with applications to a multivariate signal-plus-noise problem. Computational Statistics and Data Analysis‚ 9:297–307. Spall‚ J. C. and J. L. Maryak (1992). A feasible Bayesian estimator of quantiles for projectile accuracy from non-i.i.d. data. Journal of the American Statistical Association‚ 87:676–681. Spall‚ J. C. (1995). Uncertainty bounds for parameter identification with small sample sizes. Proceedings of the IEEE Conference on Decision and Control‚ pp. 3504–3515. Sun‚ F. K. (1982). A maximum likelihood algorithm for the mean and covariance of nonidentically distributed observations. IEEE Transactions on Automatic Control‚ AC27:245–247. Wilks‚ S. S. (1962). Mathematical Statistics. Wiley‚ New York. Walter‚ E.‚ and L. Pronzato (1990). Qualitative and quantitative experiment design for phenomenological models–A survey. Automatica‚ 26:195–213.
Chapter 28 A TUTORIAL ON HIERARCHICAL LOSSLESS DATA COMPRESSION John C. Kieffer Department of Electrical and Computer Engineering University of Minnesota Minneapolis, MN 55455
Abstract
1.
Hierarchical lossless data compression is a compression technique that has been shown to effectively compress data in the face of uncertainty concerning a proper probabilistic model for the data. In this technique, one represents a data sequence x using one of three kinds of structures: (1) a tree called a pointer tree, which generates x via a procedure called “subtree copying”; (2) a data flow graph which generates x via a flow of data sequences along its edges; or (3) a contextfree grammar which generates x via parallel substitutions accomplished with the production rules of the grammar. The data sequence is then compressed indirectly via compression of the structure which represents it. This article is a survey of recent advances in the rapidly growing field of hierarchical lossless data compression. In the article, we illustrate how the three distinct structures for representing a data sequence are equivalent, outline a simple method for designing compact structures for representing a data sequence, and indicate the level of compression performance that can be obtained by compression of the structure representing a data sequence.
INTRODUCTION
A modern day data communication system must be capable of transmitting data of all types, including textual data, speech/audio data, or image/video data. The block diagram which follows depicts a data communication system, consisting of encoder, channel, and decoder:
712
MODELING UNCERTAINTY
The data sequence that is to be transmitted through the communication channel will need to have allocated to it a certain portion of the available channel bandwidth. In order that the amount of channel bandwidth that is utilized be kept at a minimum, data compression must take place, in which the data sequence is encoded by the encoder into a binary codeword for transmission through the communication channel. The compression scheme in Fig. 1, consisting of encoder and decoder, is called a lossless scheme if the decoder always reconstructs the original data sequence, and is called a lossy scheme if reconstruction error is allowed. Hierarchical data compression is a rapidly growing subfield of information theory that started about 20 years ago. In hierarchical data compression, one employs structures such as trees, graphs, or grammars for data representation, thereby allowing progressive data reconstruction across scales from low resolution to high resolution. In order to better accommodate the demands of communication systems users, the next generation of data compression standards will be scalable in this sense, and this is the reason why hierarchical data compression is now coming into prominence. A hierarchical data compression scheme can either be a lossy scheme or a lossless scheme. Although the hierarchical lossy data compression schemes are not the subject of this paper, this paragraph gives some pointers to the literature on this subject. The most common hierarchical lossy compression schemes are based on wavelets (Strang and Nguyen, 1996; Ch. 11; Chui, 1992; Ch. 7). In a wavelet-based scheme, a signal is lossily compressed by exploiting its wavelet decomposition
where is an appropriately chosen wavelet function. Each wavelet coefficient provides a different characteristic of the signal as the scaling parameter and the spatial parameter are varied. Compression of the signal results from expanding in bits each coefficient whose magnitude exceeds a given threshold. Several important hierarchical data compression schemes based upon the wavelet decomposition have been developed, including the Burt-Adelson pyramid algorithm (Burt and Adelson, 1983), the Shapiro zerotree algorithm (Shapiro, 1993), and the Said-Pearlman set partitioning algorithm (Said and Pearlman, 1996). The second most common hierarchical lossy compression schemes are the fractal-based schemes—we refer the reader to the references Barnsley and Hurd (1993) and Fisher (1995) for a discussion of this topic. The present paper is devoted to coverage of hierarchical lossless data compression schemes, since until recently, knowledge concerning how to design efficient hierarchical lossless compression schemes lagged behind knowledge
A Tutorial on Hierarchical Lossless Data Compression
713
concerning how to design efficient hierarchical lossy compression schemes. The beginnings of the subject of hierarchical lossless data compression can be traced to the papers Kawaguchi and Endo (1980); Cameron (1988); and Kourapova and Ryabko (1995). In these papers, a context-free grammar is used to model some (but not all) of the dependencies in the data, and then the data is compressed by making use of the rules of the grammar. In recent years, a new idea has been exploited, namely, the idea of constructing a grammar from the data which models all of the dependencies in the data—since the data is reconstructible from the grammar itself, the problem of compression of the data reduces to the problem of compression of the grammar representing the data. In recent years, this idea has spawned many advances in the theory and design of hierarchical lossless data compression schemes, which shall be surveyed in this paper. Fig. 2 gives the encoding part of a hierarchical lossless data compression scheme, and Fig. 3 gives the decoding part. In the encoding part, a transform is applied to a given data sequence that is to be compressed, transforming into a data structure that is suitable for representing The data structure is typically one of three kinds of structures: (i) a pointer tree (Sec. 1.1); (ii) a data flow graph (Sec. 1.2); or (iii) a context-free grammar (Sec. 1.3). The encoder in Fig. 2 operates indirectly, compressing the data structure representing the given data sequence rather than directly applying itself to The decoding part unravels the operations done in the encoding part—the decoder and encoder are inverse mappings, and the inverse transform and transform are inverse mappings.
This paper is organized as follows. In the rest of this introductory section, we present examples illustrating the pointer tree, data flow graph, and context-free grammar structures for data sequence representation. In Section 2, equivalences between the tree, graph, and grammar structures are indicated. A simple algorithm for the design of a good structure to use in hierarchical lossless compression is presented in Section 3. Section 4 is concerned with the encoder operation in hierarchical lossless compression schemes. Finally,
714
MODELING UNCERTAINTY
Section 5 presents some results concerning the compression performance afforded by hierarchical lossless compression schemes. In particular, we will see that hierarchical lossless compression schemes represent a good choice for the communication system designer who is faced with uncertainty concerning the “true” probability distribution which models the data that is encountered in the communication system—such compression schemes can perform well no matter what the probability model for the data may be. Terminology and Notation. Throughout this paper, we assume some finite data alphabet to be fixed in the background. When we refer to a sequence as a data sequence, we simply mean that the sequence consists of finitely many entries from the fixed finite alphabet. If is a sequence whose entries are respectively, then we write the sequence as (i.e., without commas separating the entries). If are sequences, then denotes the sequence obtained by left-to-right concatenation of the sequences (see Example 3 for an illustration of the concatenation procedure). In this tutorial, when we refer to a tree, we will always mean a finite ordered rooted tree in the sense of Knuth (1973). Such a tree has a unique designated vertex which is the root vertex of the tree, and each of the finitely many nonroot vertices is reachable via a unique path starting from the root. The vertices which have children are called nonterminal vertices. We shall assume that each nonterminal vertex has at least two children, which have a designated ordering. The vertices of a tree which have no children are called leaf vertices. We will represent a tree pictorially, adopting the convention that the root vertex will be placed at the top of the picture, and that the tree’s edges extend downward, with the children of each nonterminal vertex appearing in the picture from left-to-right on the same level, corresponding to the designated ordering of these children. There are two important orderings of the vertices of a tree that we shall make use of that are compatible with the separate orderings of the children of each nonterminal vertex: (1) the breadth-first ordering, and (2) the depth-first ordering. Example 1: Fig. 4 illustrates the depth-first and breadth-first orderings of the vertices of a tree.
A Tutorial on Hierarchical Lossless Data Compression
1.1.
715
POINTER TREE REPRESENTATIONS
We can represent a data sequence using a tree called a pointer tree, which satisfies the properties: Every leaf vertex of the tree is labelled by either a symbol from the data alphabet or by a pointer label “pointing to” some nonterminal vertex of the tree. (A vertex containing a pointer label shall be called a pointer vertex.) There exists at least one leaf vertex of the tree which is labelled by a pointer label. The data sequence represented by the pointer tree can be recovered from the pointer tree via “subtree copying.” We explain the “subtree copying” procedure. Define a data tree to be a tree in which every leaf vertex of the tree is labelled by a symbol from the data alphabet. Suppose T is a pointer tree, and that is a leaf vertex of T pointing to a nonterminal vertex of T. Suppose the subtree of T rooted at is a data tree. Let T' be the tree obtained by appending to the subtree of T rooted at We say that the tree T' is obtained from the tree T by one round of subtree copying. The tree T' is either a pointer tree possessing one less pointer vertex than T, or it is a data tree. Suppose we start with a pointer tree having exactly pointer vertices, and are able to construct trees such that is obtained from via one round of subtree copying, Then, must be a data tree. It can be shown that if is some other sequence of trees in which is obtained from via subtree copying then Therefore, we may characterize as the unique data tree obtainable via finitely many rounds of subtree copying, starting from the pointer tree We call the data tree induced by the pointer tree Order the leaf vertices of in depth-first order. If we write down the sequence of data labels that we see according to this ordering, we obtain a data sequence which we will term the data sequence represented by the pointer tree Example 2: Fig. 5 illustrates a pointer tree. By convention, we take the pointer labels to mean that we are pointing to the nonterminal vertices labelled 1,2,3, respectively. (The internal labels 1,2,3 are really not needed; we shall see how to get along without these labels in Section 3.) The reader can easily verify that the tree in Fig. 6 is obtainable from the Fig. 5 tree via four rounds of subtree copying. Therefore, this tree is the data tree induced by the Fig. 5 tree. (The reader can also verify that the Fig. 6 tree is obtained regardless of the order in which the subtree copyings are done. For example, one can do rounds of subtree copying to the vertices which in depth-first ordering have labels respectively; one can also do the copying according to the
716
MODELING UNCERTAINTY
two orderings of the form Some other orderings of the rounds of subtree copying are valid as well.) The data labels on the data tree in Fig. 6 give us the sequence which is the data sequence represented by the pointer tree in Fig. 5.
In Fig. 6, we have kept the internal labels 1,2,3 that were in the Fig. 5 tree, in order to illustrate an important principle. Notice that the subtrees of the Fig. 6 data tree rooted at vertices 1,2,3 are distinct. Our important principle is the following: One should strive to find a pointer tree representing a given data sequence so that any two leaf vertices of the pointer tree which point to distinct vertices should have distinct data trees appended to them in the rounds of subtree copying. If a given pointer tree does not obey this property, then it can be reduced to a simpler pointer tree.
1.2.
DATA FLOW GRAPH REPRESENTATIONS
Let be a data sequence. A data flow graph which represents x has the following properties:
(p.1) The data flow graph is a finite directed acyclic graph which contains one or more vertices (called input vertices) having outgoing edges but
A Tutorial on Hierarchical Lossless Data Compression
717
no incoming edges, a unique vertex (called the output vertex) having at least one incoming edge but no outgoing edge, and possibly one or more vertices which are neither input nor output vertices. There is at least one directed path from each input vertex to the output vertex. (p.2) Each noninput vertex of the data flow graph possesses two or more incoming edges, and there is an implicit ordering of these incoming edges. (P.3) Each input vertex V contains a label data alphabet.
which is a symbol from the
for each (p.4) The input labels uniquely determine a data sequence label noninput vertex V of the graph. The labels on the vertices of the graph satisfy the equations
where V varies through all noninput vertices and are the vertices at the beginnings of the ordered edges ending at V. (The fact that the graph is acyclic guarantees the existence and uniqueness of the solutions to equations (1.1).) (p.5) The sequence of data symbols the data sequence
computed at the output vertex V is
Property(p.4) indicates why we call our graph a data flow graph. Visualize “data flow” through the graph in several cycles. In the first cycle, one computes at each vertex V whose incoming edges all start at input vertices. In each succeeding cycle, one computes the label at each vertex V whose incoming edges all start at vertices whose labels have been determined previously. In the final cycle, the data sequence represented by the data flow graph is computed at the output vertex. The following example illustrates the procedure. Example 3: Let us compute the data sequence represented by the data flow graph in Fig. 7. We suppose that the direction of flow along edges is from left to right, and that incoming edges to a vertex are ordered from top to bottom. On the first cycle, we compute
The second cycle yields
718
MODELING UNCERTAINTY
As a result of the third cycle, we obtain
Computation of the label
on the fourth and last cycle tells us what
is:
One property of a good data flow graph for representing a data sequence should be mentioned here: The sequences computed at the vertices of the data flow graph should be distinct. If this property fails, then two or more vertices of the data flow graph can be merged to yield a simpler data flow graph which also represents the given data sequence.
1.3.
CONTEXT-FREE GRAMMAR REPRESENTATIONS
A context-free grammar G for representing a data sequence can be specified in terms of its production rules. Each production rule of G takes the form
where V, the left member of the production rule (1.2), is a variable of the grammar G, and each belonging to the right member of the production rule (1.2) is either a variable of G or a symbol from the data alphabet. The variables of G shall be denoted by capital letters the symbols in the data alphabet are distinct from the variables of G, and shall be denoted by lower case letters The variable is a special variable called the root variable of G; it is the unique variable which does not appear in the right members of the production rules of G. For each variable V of the grammar G, it is required that there be one and only one production rule (1.2) of G whose left member is V; such a grammar is said to be deterministic. With these assumptions,
A Tutorial on Hierarchical Lossless Data Compression
719
one is assured that the language L(G) generated by G must satisfy one of the following two properties: L(G) consists of exactly one data sequence; or L(G) is empty. To see which of these two properties holds, one performs rounds of parallel substitutions using the production rules of G, starting with the root variable In each round of parallel substitutions, one starts with a certain sequence of variables and data symbols generated from the previous round; each variable in this sequence is replaced by the right member of the production rule whose left member is that variable—all of the substitutions are done simultaneously. There are only two possibilities: Possibility 1: After finitely many rounds of parallel substitutions, one encounters a data sequence for the first time; or Possibility 2: One never encounters a data sequence, no matter how many rounds of parallel substitutions are performed. Let be the number of variables of the grammar G. In Kieffer and Yang (2000), it is shown that if one does not encounter a data sequence after rounds of parallel substitutions, then Possibility 2 must hold. This gives us an algorithm which runs in a finite amount of time to determine whether or not Possibility 1 holds. Suppose Possibility 1 holds. Let be the data sequence generated by the grammar G after finitely many rounds of parallel substitutions. Then, L(G) = We call the data sequence represented by G. We list the requirements that shall be placed on any grammar G that is used for representing a data sequence: Requirement (i): The grammar G is a deterministic context-free grammar. Requirement (ii): The production rules of G generate a data sequence after finitely many rounds of parallel substitutions. Requirement (iii): The right member of each production rule of G should contain at least two entries. Requirement (iv): Every production rule of G must be used at least once in the finitely many rounds of parallel substitutions that generate a data sequence. Requirements (i) and (ii) have been mentioned previously. Requirements (iii) and (iv) are new, but are natural requirements to make, since the grammar
720
MODELING UNCERTAINTY
could be made simpler if (iii) or (iv) were not true. A grammar G satisfying requirements (i)-(iv) shall be called an admissible grammar. Example 4: Let G be the admissible grammar whose production rules are
Starting with the root variable of parallel substitutions are:
The data sequence
the sequences that are obtained via rounds
represented by G is thus
Traverse the entries of the rows of the display (1.3) in the top-down, left-to-right order; if you write down the first appearances of the variables you encounter in order of their appearance, you obtain the ordering which is in accordance with the numbering of the variables of G. We can always number the variables of an admissible grammar so that this property will be true, and shall always do so in the future. The rounds of parallel substitutions that we went through to obtain the sequence (1.4) are easily accomplished via the four line Mathematica program S={1}; P={{2,3,4,4,5,2}, {6,6,b}, {a,6}, {3,b,a}, {b,b}, {a,5}}; Do[S=Flatten[S/. Table[i->P[[i]],{i,Length[P]}]],{i,Length[P]}]; S
A Tutorial on Hierarchical Lossless Data Compression
721
which the reader is encouraged to try. You can run this program given any admissible grammar to find the data sequence represented by the grammar. All that needs to be changed each time is the second line of the program, which gives the right members of the production rules of the grammar. Notice that the grammar G in Example 4 obeys the following two properties: Property 1: Every variable of G except the root variable appears at least twice in the right members of the production rules of G. Property 2: If you slide a window of width two along the right members of the production rules of G, you will never encounter two disjoint windows containing the same sequence of length two. An admissible grammar satisfying Property 1 and Property 2 is said to be irreducible. There is a growing body of literature on the design of hierarchical lossless compression schemes employing irreducible grammars (Nevill-Manning and Witten, 1997a-b; Kieffer and Yang, 2000; Yang and Kieffer, 2000).
2.
EQUIVALENCES BETWEEN STRUCTURES
In the previous section, we discussed pointer trees, data flow graphs, and admissible grammars as three different types of structures for representing a data sequence. These structures are equivalent in the sense that if you have a structure of one of the three types representing a data sequence then a structure representing of either one of the other two types can easily be constructed. It is the purpose of this section to illustrate this phenomenon.
2.1.
EQUIVALENCE OF POINTER TREES AND ADMISSIBLE GRAMMARS
Suppose we are given a pointer tree representing a data sequence Here is a four-step procedure for using the pointer tree to find an admissible grammar G representing Step 1: Let be the breadth-first ordering of the nonterminal vertices of the pointer tree. Label these vertices with the labels respectively. Step 2: Each leaf vertex of the pointer tree contains either a pointer label or a data label. For each leaf vertex V containing a pointer label, assign a new label consisting of the label of the nonterminal vertex to which V points. For each leaf vertex containing a data label, keep this label unchanged. Step 3: For each
form the production rule
722
MODELING UNCERTAINTY
where are the respective labels of the children of vertex (children ordered from left to right). Step 4: The grammar G consists of the production rules (2.5), for Example 5: Referring to the pointer tree in Fig. 5, we see that the corresponding grammar has production rules
This is clear if we relabel the Fig. 5 tree according to Fig. 8.
In order to go from an admissible grammar to a pointer tree, one can reverse the four-step procedure given at the beginning of this subsection. The reverse procedure can give rise to more than one possible pointer tree for the same data sequence—just choose one of them. The following example illustrates the technique. Example 6: Suppose we start with the grammar (2.6). From the 9 production rules listed in (2.6), one forms the following 9 trees, each of depth one:
A Tutorial on Hierarchical Lossless Data Compression
723
Any tree which can be built up from these 9 trees is a pointer tree for the same data sequence represented by the grammar (2.6). (Start the tree building process by joining two of the trees in the array (2.7) to form a single tree—joining of two trees is accomplished by merging the root vertex of one tree with a leaf vertex of the other tree, where these two vertices have the same label This gives an array of 8 trees; join two of the trees in this array. Repeated joinings, 8 of them in all, gradually reduce the original array (2.7) to a single tree, the desired pointer tree.) One of the pointer trees constructible by this method is given in Fig. 8; another one is given in Fig. 9.
2.2.
EQUIVALENCE OF ADMISSIBLE GRAMMARS AND DATA FLOW GRAPHS
Let G be an admissible grammar representing the data sequence We describe how to obtain a data flow graph DFG(G) representing Let be the number of variables of G; then these variables are Let be the number of distinct symbols from the data alphabet which appear in the right members of the production rules of G. Let be the sum of the lengths of the
724
MODELING UNCERTAINTY
right members of the production rules of G. Grow a directed acyclic graph with vertices and edges as follows: Step 1: Draw vertices on a piece of paper. Label of them with the labels respectively. Label the remaining of them with the data symbols appearing in the right members of the production rules. Step 2: For each
let
be the production rule of G whose left member is where are each either symbols from or symbols from the data alphabet. For each draw outgoing edges from vertex and label these edges respectively. Make these edges terminate at the vertices with labels respectively. Step 3: For each edge that you have drawn on your piece of paper, reverse the direction of the edge. For each vertex with a label remove the label. What you now see on your piece of paper is the data flow graph DFG(G). (For each vertex of DFG(G) which is not an input vertex, the incoming edges to that vertex have the implicit ordering in accordance with the edge labelling performed in Step 2.) It is not hard to see that the above procedure is invertible—from a given data flow graph representing data sequence one can obtain an admissible grammar G representing and the given data flow graph is DFG(G). Example 7: Let G be the grammar with production rules
Apply the above three-step procedure to G in order to obtain DFG(G). Redraw the pictorial representation of DFG(G) so that all edges go from left to right, so that the input vertices are on the left and the output vertex is on the right, and so that the the incoming edges to each noninput vertex go from top to bottom according to the implicit ordering of these edges. This gives us the data flow graph in Fig. 7 (without the labels Conversely, given the data flow graph in Fig. 7, it is a simple exercise to construct the grammar (2.8).
A Tutorial on Hierarchical Lossless Data Compression
725
DESIGN OF COMPACT STRUCTURES
3.
We have seen that we can equivalently represent a data sequence using pointer trees, data flow graphs, or admissible grammars. The structure used to represent a given data sequence (whether it be a tree, graph, or grammar) should be “compact,” in order for us to be able to gain an advantage in compressing the data sequence by compressing the tree, graph, or grammar which represents it. Here are design principles which make clear what we mean by a “compact” structure: If the representing structure is to be a pointer tree or a data flow graph, design the structure to be compact in the sense that the number of edges is small relative to the length of the data sequence that is represented. If the representing structure is to be a context-free grammar, design the grammar to be compact in the sense that the total length of the right hand members of its production rules is small relative to the length of the data sequence that is represented. Various design methods for finding a compact structure to represent a given data sequence are addressed in the papers Nevill-Manning and Witten (1997ab); Kieffer et al., 1998; Kieffer and Yang (2000); Yang and Kieffer (2000). We shall not discuss all of these design methods here. Instead, we discuss one particular design method which is both simple and useful. We explain how to obtain a good pointer tree for representing each data sequence of a fixed length Choose a binary tree with leaf vertices as follows: If
is a power of two, take to be the full binary tree of depth consists of one root vertex, two vertices of depth one, four vertices of depth two, etc., until we have vertices of depth ) If
is not a power of two, first grow the full binary tree of depth This tree has leaf vertices. Choose of the leaf vertices, and grow two edges from each of them. The resulting binary tree, which has leaf vertices, is taken to be Now let be any data sequence of length that we wish to represent by a pointer tree. Let be the depth-first ordering of the leaf vertices of Label each with and let denote this labelled tree. For each nonterminal vertex V of let be the subtree of rooted at V. Each tree T(V) is drawn pictorially, with each nonterminal vertex carrying no label and each leaf vertex carrying a data label. Two T(V )’s are regarded to be the same if they look the same pictorially. To obtain a pointer tree from that represents we have to prune certain vertices from The following algorithm determines those vertices of
726
MODELING UNCERTAINTY
that have to be pruned. PRUNING ALGORITHM Step 1: List the nonterminal vertices of
in depth-first order. Let this list
be
Step 2: Traverse the list (3.9) from left to right, underlining each for which has not been seen previously. Let be the set of underlined vertices in the list (3.9), and let be the set consisting of each nonunderlined vertex in the list (3.9) whose father in belongs to Step 3: Prune from the tree all vertices which are successors of the vertices in Let be the resulting pruned tree. (If this step has been done correctly, then the set of nonterminal vertices of will be and the set of leaves of which are not leaves of will be Step 4: Attach a pointer label to each vertex V in which points to the unique vertex in for which T(V) and are the same. The pruned tree with the pointer labels attached, is a pointer tree representing the data sequence Example 8: We find a pointer tree representation for the data sequence of length 12. Forming the tree and then the tree we obtain the tree in Fig. 10. Notice that we have enumerated the nonterminal vertices of in Fig. 10 in depth-first order. Executing Steps 1 and 2, we see that
The vertices of which are successors of the vertices in are now pruned from to obtain the pointer tree in Fig. 11. Pointer labels must be placed at the the vertices in indicating the vertices in that they point to, and this is done as follows: Since vertex 5 must point to vertex 4 of to vertex 5 in Fig. 11.
no pointer label is assigned
Vertex 8 can point only to vertex 4 or vertex 7 of Since it actually points to the first of these, vertex 8 is assigned pointer label “1” in Fig. 11.
A Tutorial on Hierarchical Lossless Data Compression
727
Vertex 9 can only point to vertex 3 or vertex 6 of Since it actually points to the second of these, pointer label “2” is assigned to vertex 9 in Fig. 11.
4.
ENCODING METHODOLOGY
Let us now refer back to the two parts of a hierarchical lossless compression scheme given in Figs. 2 and 3. From the preceding sections, the reader understands the nature of a transform that can be used in Fig. 2 and the corresponding inverse transform in Fig. 3. To be precise, we have learned three distinct options for (transform,inverse transform) in dealing with a data sequence Option 1: Supposing to be the length of one could transform into the pointer tree which represents it, via the PRUNING ALGORITHM of Section 3. Then can be the “data structure” in Fig. 2. The inverse transform in Fig. 3 then employs the subtree copying method of Section 1.1 to obtain from Option 2: One could take the “data structure” in Fig. 2 to be the admissible grammar G associated with the pointer tree as in Section 2.1. The inverse transform in Fig. 3 then employs several rounds of parallel substitutions to obtain from G, as in Example 4.
728
MODELING UNCERTAINTY
Option 3: The “data structure” in Fig. 2 could be the data flow graph DFG(G) formed from the grammar G of Option 2, as described in Section 2.2. Example 3 illustrates the inverse transform method via which is computed from DFG(G) via a flow of data sequences along the edges of DFG(G). We have not yet discussed how the encoder compresses the data structure in Fig. 2 that is presented to it—this section addresses this question. Because of the equivalences between the tree, graph, and grammar structures discussed in Section 2, we can explain the encoder’s methodology for the pointer tree data structure in Fig. 2 only. Thus, we assume that the data structure in Fig. 2 that is to be assigned a binary codeword by the encoder is the pointer tree introduced in Section 3, where is the length of the data sequence (assumed to satisfy ). The binary codeword generated by the encoder will consist of the concatenation of the three binary sequences discussed below. The sequence The purpose of this sequence is to let the decoder know what is. There is a simple way to do this (see Example 9) in which consists of binary symbols. The sequence The purpose of this sequence is to let the decoder know the structure of the unlabelled pointer tree, i.e., the tree without the pointer labels and without the data labels. If a vertex of is a nonterminal vertex, there will a corresponding entry of equal to 1; if the vertex is a leaf of there will be a corresponding entry of equal to 0 (see Example 9).1 The sequence The purpose of this sequence is to let the decoder know each data label and each pointer label that has to be entered into the unlabelled pointer tree to obtain the pointer tree Example 9: As in Example 8, we consider the data sequence of length Referring to the tree in Fig. 10 and the pointer tree in Fig. 11, we can see what the sequences and generated by the encoder have to be. In general, the length can be processed to form in two steps: Step 1: Expand the integer into its binary representation ing of binary symbols the most significant bit. Step 2: Generate digit of
and then write down
where
consistis
(That is, repeat every followed by )
In this particular case, Step 1 gives us the binary expansion of the integer 12, which is 1100, and then Step 2 gives us
A Tutorial on Hierarchical Lossless Data Compression
729
Let us now determine how we can form in order to convey to the decoder the tree in Fig. 11 without the data and pointer labels. The first part of the binary codeword received by the decoder (the part) tells the decoder that whereupon the decoder knows that the nonterminal vertices of are the vertices 1,2,3,4,5,6,7,8,9,10,11 in Fig. 10. Of these, the decoder knows that vertices 1,2,3,4 will automatically be nonterminal vertices in Fig. 11. If the remaining vertices are processed by encoder and decoder in the breadth-first order 9,6,10,11,5,7,8, then the decoder learns the structure of with the transmission of 5 bits. Specifically, a 0 is transmitted for vertex 9 (indicating that this vertex is a leaf vertex in Fig. 11), vertices 10, 11 are deleted from the list (since these vertices cannot belong to and then bits are sent for each of vertices 6,5,7,8 to indicate which are leaf vertices and which are nonterminal vertices in Fig. 11. This gives us
Finally, we discuss how the encoder constructs the sequence The entries of tell the decoder what the data labels and the pointer labels are in Fig. 11. The data labels are which we encode as 0, 1, 1, 1, respectively. The pointer vertices are vertices 5, 8, 9 (see Fig. 10). The decoder already knows where vertex 5 points (as discussed in Example 8), so no pointer label needs to be encoded for vertex 5. The pointer labels 1, 2 on vertices 8, 9 can be very simply encoded as 0, 1, respectively. (For a very long data sequence, one would instead use arithmetic coding to encode the resulting large number of pointer labels—see Kieffer and Yang (2000) and Kieffer et al. (2000) for the arithmetic coding details which we have omitted here.) Concatenating the encoded data labels 0, 1, 1, 1 with the encoded pointer labels 0, 1, we see that
The total length of the binary codeword generated by the encoder is the sum of the lengths of and which is 17 bits. If we assume that the decoder knows the length of the data sequence to be 12, but does not know which binary data sequence of length 12 is being transmitted, then it is not necessary to form and the encoder’s binary codeword consists of and only. In this case, the length of the binary codeword is 11 bits. We are achieving a modest level of data compression in this example, since transmitting to the decoder without compression would take 12 > 11 bits—for a much longer data sequence, more compression than this could be achieved.
5.
PERFORMANCE UNDER UNCERTAINTY
How well do hierarchical lossless compression schemes perform? This section is devoted to answering this question. First, let us focus on the size of
730
MODELING UNCERTAINTY
the pointer tree as measured by the number of vertices it has. We let denote the number of vertices in the pointer tree It has been shown (Liaw and Lin, 1992; Kieffer et al., 2000) that there is a positive constant C such that the following is true.2 Theorem 1 For every
and for every data sequence
of length
Theorem 1 tells us that the ratio is small for large In other words, the size of the pointer tree is small relative to the length of the data sequence which is a compactness condition on the tree >From our discussion at the beginning of Section 3, this suggests to us that the use of the pointer tree in a hierarchical lossless compression scheme might lead to good compression performance. Specifically, we consider the hierarchical lossless compression scheme in which, for each and each data sequence of length the pointer tree is compressed according to the procedure described in Section 4. In the ensuing development, we will see that this pointer tree based hierarchical lossless compression scheme performs extremely well. Let A denote our fixed finite alphabet. We consider how well our hierarchical lossless compression scheme can compress a randomly generated data sequence generated according to the probability mass function Information theory tells us that for any lossless compression scheme, the expected length of the binary codeword into which is encoded cannot be less than the entropy and that the best lossless compression scheme for encoding (the Huffman code (Cover and Thomas, 1991)) assigns a binary codeword of expected length no worse than Unfortunately, the Huffman code can be constructed only if the probability model is known. However, what if we are uncertain about the probability model? Remarkably, for large our pointer tree based hierarchical lossless compression scheme provides us with near-optimum compression performance regardless of the probability model according to which the data is generated (provided that we assume stationarity of the model). In other words, if faced with uncertainty about the true data model, one can employ a hierarchical lossless compression scheme not based on any probability model which performs as well as a special-purpose lossless compression scheme based upon the true data model, asymptotically as the length of the data sequence grows without bound. The following theorem, proved in Kieffer and Yang (2000) and Kieffer et al. (2000), makes this precise. Theorem 2 Let random variable each
be a stationary random process in which each takes its values in the given finite data alphabet A. For let be the random data sequence Then,
A Tutorial on Hierarchical Lossless Data Compression
731
the expected codeword length arising from the encoding of with our pointer tree based hierarchical lossless compression scheme satisfies the bound
Discussion: We point out subcases of Theorem 2 that occur for special classes of stationary processes. First, suppose that is a memoryless process, meaning that the random variables are statistically independent, each having the same marginal probabilities
Letting
then, because of the independence, the joint entropy in (5.10) can be simply expressed as Also, it is known in the memoryless case (see Kieffer and Yang (2000); Kieffer et al. (2000)) that the term in (5.10) can be expressed as the size of the pointer tree as given by Theorem 1. Therefore, for the memoryless process, equation (5.10) reduces to Next, suppose that with transition probabilities
· · · is a stationary first order Markov process
and marginal probabilities
in (5.11). Then, equation (5.10) reduces to
where
These extensions can be taken as far as one wishes. In general, if is a stationary Markov process of any finite order one can define a nonnegative real constant (the so-called order entropy of the process) such that
There are other lossless compression schemes for which Theorem 1 is true for arbitrary stationary processes, and for which asymptotics of the form (5.12)
732
MODELING UNCERTAINTY
occur for arbitrary Markov processes of finite order. For example, the LempelZiv compression scheme (Ziv and Lempel, 1978) is another such scheme. It is an open question whether the hierarchical lossless scheme presented in this paper or whether the Lempel-Ziv scheme gives the smaller constant times in the term in (5.12). However, hierarchical lossless compression schemes have some advantages over the Lempel-Ziv scheme. Two of these advantages are: (1) hierarchical schemes are easily scalable; (2) hierarchical schemes sometimes yield state-of-the-art compression performance in practical applications. Acknowledgement. This work was supported by National Science Foundation Grants NCR-9627965 and CCR-9902081.
NOTES 1. The length of is at worst the number of vertices of which is Theorem 1. 2. The smallest value of C for which Theorem 1 is true is not known.
by
REFERENCES Barnsley, M. and L. Hurd. (1993). Fractal Image Compression. Wellesley, MA: AK Peters, Ltd. Burt, P. and E. Adelson. (1983). “The Laplacian Pyramid as a Compact Image Code,” IEEE Trans. Commun., Vol. 31, pp. 532–540. Cameron, R. (1988). “Source Encoding Using Syntactic Information Source Models,” IEEE Trans. Inform. Theory, Vol. 34, pp. 843–850. Chui, C. (1992). (ed.), Wavelets: A Tutorial in Theory and Applications. New York: Academic Press. Cover, T. and J. Thomas. (1991). Elements of Information Theory. New York: Wiley. Fisher, Y. (1995). (ed.), Fractal Image Compression: Theory and Application. New York: Springer-Verlag. Kawaguchi, E. and and T. Endo. (1980). “On a Method of Binary-Picture Representation and its Application to Data Compression,” IEEE Trans. Pattern Anal. Machine Intell., Vol. 2, pp. 27–35. Kieffer, J. and E.-H. Yang. (2000). “Grammar-Based Codes: A New Class of Universal Lossless Source Codes,” IEEE Trans. Inform. Theory, Vol. 46, pp. 737–754. Kieffer, J., E.-H. Yang, G. Nelson, and P. Cosman. (2000). “Universal Lossless Compression Via Multilevel Pattern Matching,” IEEE Trans. Inform. Theory, Vol. 46, pp. 1227–1245, 2000.
REFERENCES
733
Knuth, D. (1973). The Art of Computer Programming: Volume 1/Fundamental Algorithms. Reading, MA: Addison-Wesley. Kourapova, E. and B. Ryabko. (1995). “Application of Formal Grammars for Encoding Information Sources,” Probl. Inform. Transm., Vol. 31, pp. 23–26. Liaw, H-T. and C-S. Liu. (1992). “On the OBDD-Representation of General Boolean Functions,” IEEE Trans. Computers, Vol. 41, pp. 661–664. Nevill-Manning, C. and I. Witten. (1997a). “Identifying Hierarchical Structure in Sequences: A Linear-Time Algorithm,” Jour. Artificial Intell. Res., Vol. 7, pp. 67–82. Nevill-Manning, C. and I. Witten. (1997b). “Compression and Explanation Using Hierarchical Grammars,” Computer Journal, Vol. 40, pp. 103–116. Said, A. and W. Pearlman. (1996). “A New Fast and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees,” IEEE Trans. Circuits Sys. Video Technol., Vol. 6, pp. 243–250. Shapiro, J. (1993). “Embedded Image Coding Using Zerotrees of Wavelet Coefficients," IEEE Trans. Signal Proc., Vol. 41, pp. 3445-3462. Strang, G. and T. Nguyen. (1996). Wavelets and Filter Banks. Wellesley, MA: Wellesley-Cambridge Press. Yang, E.-H. and J. Kieffer. (2000). “Efficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transform–Part One: Without Context Models,” IEEE Trans. Inform. Theory, Vol. 46, pp. 755–777. Ziv, J. and A. Lempel. (1978). “Compression of individual sequences via variablerate coding,” IEEE Trans. Inform. Theory, Vol. 24, pp. 530–536.
Part VIII Chapter 29 EUREKA! BELLMAN’S PRINCIPLE OF OPTIMALITY IS VALID! Moshe Sniedovich Department of Mathematics and Statistics The University of Melbourne Parkville VIC 3052, Australia
[email protected]
Abstract
Ever since Bellman formulated his Principle of Optimality in the early 1950s, the Principle has been the subject of considerable criticism. In fact, a number of dynamic programming (DP) scholars quantified specific difficulties with the common interpretation of Bellman’s Principle and proposed constructive remedies. In the case of stochastic processes with a non-denumerable state space, the remedy requires the incorporation of the faithful "with probability one" clause. In this short article we are reminded that if one sticks to Bellman’s original version of the principle, then no such a fix is necessary. We also reiterate the central role that Bellman’s favourite "final state condition" plays in the theory of DP in general and the validity of the Principle of Optimality in particular.
Keywords:
dynamic programming, principle of optimality, final state condition, stochastic processes, non-denumerable state space.
1.
INTRODUCTION
All of us are familiar with Bellman’s Principle of Optimality (Bellman, 1975, p. 83) and the major role that it had played in Bellman’s monumental work on DP. What is not so well known - yet very well documented - is that Bellman’s Principle of Optimality has been the subject of serious criticism, e.g. Denardo and Mitten (1967), Karp and Held (1967), Yakowitz (1969), Porteus (1975), Morin (1982), Sniedovich (1986, 1992). In fact, almost every aspect of the Principle - e.g. its exact meaning, validity, role in DP - is problematic in the sense that scholars have conflicting views on
736
MODELING UNCERTAINTY
the matter. For the purposes of this discussion it will suffice to provide two pairs of quotes. The first pair refers to the title "Principle": ... Equation (3.24) is a fundamental equation of Dynamic Programming. It expresses a fundamental principle, the principle of optimality (Bellman [B4], [B5], which can also be expressed in the following way: … Kushner (1971,p. 87) The term principle of optimality is, however, somewhat misleading; it suggests that this is a fundamental truth, not a consequence of more primitive things. Denardo (1982, p. 16)
The second refers to the validity of the Principle: Roughly the principle of optimality says the following rather obvious fact … Bertsekas (1976, p. 48) To see that Bellman’s original statement of the Principle of Optimality also does not hold, simply flip the graph around ... Morin (1982,p. 669)
To motivate the discussion that follows we consider two typical counter examples to the validity of the principle. The first features a deterministic problem, the second a stochastic process. However, before we do this let us recall that Bellman’s phrasing of the principle is as follows (Bellman, 1957, p.83): PRINCIPLE OF OPTIMALITY: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Counter-Example 1: Consider the network depicted in Figure 29.1 and assume that the objective is to determine the shortest path from node 1 to node 5, where the length of a path is equal to the length of the longest arc on that path. By inspection, there
Eureka! Bellman’s Principle of Optimality is valid!
737
are two optimal paths, namely p=(1,2,3,5) and q=(1,2,4,5), both of length 3. Consider now the optimal path q and the state (node) resulting from the first decision, namely node 2. Clearly, the remaining decisions of q, namely the subpath (2,4,5) does not constitute an optimal policy with respect to node 2 it is clearly longer than (2,3,5). Hence, the optimal path q does not obey the Principle of Optimality. Counter-Example 2: Consider the following naive stochastic game: there are two stages (n=1,2) and at each stage there are two feasible decisions n=1,2. The dynamics of the process are as follows: The process starts at stage 1 with a given state, Upon making the first decision, the process moves to the next stage, where we observe a new state Then we make the second decision, and the process terminates. The second state, is a continuous random variable on the set S=[0,1] whose conditional density function depends on both and The return generated by the game is equal to the sum of the two decisions, namely The objective is to determine a policy so as to maximize the expected value of the total return Clearly, the best policy is to always use that is it is best to set This will yield an expected return equal to This policy obeys the Principle of Optimality. But consider the policy according to which and
Because is a continuous random variable on S=[0,1], the probability that is equal to zero under this policy. Hence, this policy is also optimal. But this policy does not obey the Principle of Optimality: selecting when we are at stage 2 observing state is not an optimal policy with regard to this realization of The traditional fixes for the apparent bugs in the conventional interpretation of the Principle are as follows:
1 To disarm counter-examples such as our Counter-Example 1, it is assumed that the objective function of the process satisfies certain separability and (strict) monotonicity conditions (e.g. Yakowitz (1969))
2 To disarm counter examples such as our Counter-Example 2, the clause "with probability one" is added (e.g. Yakowitz (1969), Porteus (1975)) to the premise of the Principle.
738
MODELING UNCERTAINTY
While these fixes are fine, there are other possibilities. In particular, it is possible to fix these - and other - bugs by adhering to Bellman’s original formulation of DP and the Principle of Optimality. The main objective of this paper is to briefly show how this can be done.
2.
REMEDIES
In this section we provide two remedies to the Principle. These remedies not only fix the bugs discussed above, they also indicate how elegantly Bellman dealt with a number of thorny modelling and technical aspects of DP. Remedy 1: Final state formulation A close examination of Bellman’s work on DP reveals that Bellman continually struggled with the following dilemma: how do you keep the formulation of DP simple, yet enable it to tackle complex problems? The Principle of Optimality was conceived as a device that will keep the description of the main ideas of basic of DP simple. In particular, in contrast to the DP models developed in the 1960’s with the stated goal of putting DP on a rigorous mathematical foundation (e.g. Denardo and Mitten (1967), Karp and Held (1967)), Bellman’s original treatment of DP paid very little attention to the objective function of the process. As a matter of fact, systematically and consistently Bellman avoided the need to deal with this issue by a very drastic assumption: the over all return from the decision process depends only on the final state of the process. Readers who are sceptical about this fact are invited to read (carefully) Bellman first book on DP, where they can find the following definition of an optimal policy: . . . Let us now agree to the following terminology: A policy is any rule for making decision which yields an allowable sequence of decisions; and an optimal policy is a policy which maximizes a preassigned function of the final state variable ... Bellman (1957, p.82)
As argued in Sniedovich (1992), not surprisingly this type of objective function satisfies the traditional (strict) monotonicity conditions (Mitten, 1964; Yakowitz, 1969), hence, the Principle of Optimality is (trivially) valid in the context of Bellman’s final state model of DP. It should be stressed that Bellman’s final state approach to the modelling aspects of DP is an extremely useful tool of thought. And it should not come as a surprise that it is still used as such (e.g. Woeginger (2000)). A close examination of Bellman’s treatment of stochastic processes (Bellman, 1957) immediately reveals that Bellman also devised a super elegant alternative to the "with probability one" clause suggested by Yakowitz (1969) and Porteus (1975). It works like this: Remedy 2: Conditional Modified Problems
Eureka! Bellman’s Principle of Optimality is valid!
739
The conventional interpretation of the Principle of Optimality deals with two optimization problems: the initial problem (before any decision is made) and the problem associated with the state resulting from the first decision. Symbolically we can call these problems Problem and Problem respectively. That is, the initial problem - Problem - is the problem we have to solve when we are at the initial stage n=1 observing the initial state of the process, The second problem - Problem - is the problem we face when we are at stage n=2, observing the state resulting from the first decision. CounterExample 2 indicates that even though is resulting from an implementation of an optimal policy with respect to Problem there is no guarantee that the remainder of this policy is optimal with regard to Problem But suppose that we compare Problem with the following problem: we still have to solve the original problem, but under the assumption that the first decision and the state resulting from it are given. Let this problem be denoted by Problem Is it true that if a policy is optimal with respect to Problem then it is also optimal with respect to Problem Needless to say, there are of course problems where this condition is not satisfied. However, from our perspective this is not the point because - following Bellman - we require the model to be a final state model in which case the above condition is trivially satisfied. Rather, the point is that there is no need here for the "with probability one" clause. Before we address this issue any further it will be instructive to re-examine Counter-Example 2 and see whether it satisfies the above condition. Counter Example 2: Revisited The objective function associated with Problem is equal to thus the optimal policy for this problem is The objective function for Problem is hence the optimal policy is regardless of what value takes. Thus, the above condition is satisfied. In short, the alternative to the "with probability one" fix works well not only in the framework final state models.
3.
THE BIG FIX
What emerges from the above analysis is that if we stick to (1.1) Bellman’s final state model and (4.1) the relationship between the modified problems (Problem and their respective conditional problems (Problem ) then the Principle of Optimality is alive and well and does not require any fixing. With regard to the relationship between the modified problems and their respective conditional problems, this is a typical Markovian property: it entails that Problem and Problem have the same optimal solutions regardless of the specific values of and This implies that the set of optimal
740
MODELING UNCERTAINTY
solutions of Problem does not depend on and as such, only on This is a Markovian property par excellence. The nice thing about this condition is that it enables us to drop both the final state requirement and the traditional monotonicity property. In short, the Markovian condition can act as the ultimate fix for the Principle of Optimality in that its validity guarantees the validity of the Principle of Optimality and the functional equation of DP.
4.
THE REST IS MATHEMATICS
Technical details concerning the Markovian condition and its role in DP can be found elsewhere (e.g. Sniedovich (1992)). Here we just illustrate how it can be used in a formal axiomatic manner as an alternative to the traditional monotonicity condition and the "with probability with one" clause. So let us first consider the following deterministic sequential decision problem: Problem
s.t.
where S denotes the state space, T is the transition function, denotes the set of feasible decisions at stage n if the system is in state and - the objective function - is a real valued function of the state and decision variables. It is assumed that the initial state of the system, is given. We refer to Problem as the initial problem. Let denote the set of all the feasible solutions to this problem and let denote the set of all (global) optimal solutions to this problem. It is assumed that is not empty. We now decompose the objective function, in the usual DP manner by assuming that there exists a function and a sequence of functions such that: and
where
Eureka! Bellman’s Principle of Optimality is valid!
741
for This leads to the notion of modified problems: Problem
Let denote the set of all feasible solutions to Problem and let denote the set of all optimal solutions to this problem. It is assumed that for any pair, the set is not empty. In anticipation for the introduction of the Markovian condition, we also consider the following conditional modified problems: Problem
subject to (4.7)-(4.8). Let denote the set of all feasible solutions to Problem and let denote the set of all optimal solutions to this problem. It is assumed that the set is not empty for any quadruplet Observe that by construction Clearly then (by inspection), Lemma 1
for all
and
742
MODELING UNCERTAINTY
Our beloved Markovian condition can now be stated formally as follows: Markovian Condition (Deterministic processes):
for all
and
The following is then an immediate consequence of the definition of the Markovian property. Corollary 1 If the Markovian condition (deterministic processes) holds then
for all
and
Therefore, substituting (4.12) into (4.10) yields Corollary 2 If the Markovian condition (deterministic processes) holds then
for all This is the famous functional equation of DP. In short, in the context of deterministic processes, the Markovian condition can be used as a sufficient condition for the validity of the Principle of Optimality and the functional equation of DP. Let us now examine the situation in the context of stochastic processes. We shall do this on the fly as an extension of the deterministic process. Observe that in the context of stochastic processes, the state observed at stage n+1 is not determined uniquely by the state and decision pertaining to stage n. Rather, it is determined by a (conditional) probabilistic law that depends on these two entities. In short, the state variable is a random variable. It is therefore natural to introduce the notion of a decision making rule, that is a rule according to which the decisions are determined. Furthermore, in anticipation for the Markovian condition, it is natural to consider the class of Makovian decision rules, namely Markovian policies. Formally, a
Eureka! Bellman’s Principle of Optimality is valid!
743
Markovian policy is a rule that for each stage-state pair, it assigns an element of Let denote the set of all the feasible Markovian policies associated with our model. Then, by definition, consists of all the Markovian Decision rules satisfying the following condition:
Our initial problem can thus be stated as follows: Problem
where denotes the expected value of generated by policy given the initial state where Similarly, the modified and conditional problems are defined as follows: Problem
Problem
Then clearly, Corollary 3
for all
and
Note that the expectation is taken with respect to the random variable whose probability function is conditioned by and Let denote the set of all the optimal solutions to Problem and let denote the set of all the optimal solutions to Problem Then the Markovian conditions for stochastic processes can be stated as follows: Markovian Condition (Stochastic processes):
744
MODELING UNCERTAINTY
and
for all
Now, assume that the decomposition scheme of the objective function is separable under conditional expectation, namely suppose that
and
for all
Under this condition, Corollary 2 entails that
Hence, Corollary 4 If the objective function is separable under conditional expectation and the Markovian condition (stochastic processes) holds, then
for all This is the famous DP functional equation for stochastic processes. Note that here, as in (4.18), expectation is taken with respect to the random variable whose probability function is conditioned by and In summary then, the Markovian condition works well in both deterministic and stochastic processes.
5.
REFINEMENTS
The Markovian condition can be refined a bit to reflect the fact that for the DP functional equation to be valid, it is sufficient that Principle of Optimality is satisfied by one policy, rather then by all the optimal policies. This leads to the following: Weak Markovian Condition (Deterministic processes):
745
Eureka! Bellman’s Principle of Optimality is valid!
for all
and
where denotes the empty set.
In words, each conditional modified problem shares at least one optimal solution with each modified problem giving rise to it. Corollary 5 If the Weak Markovian condition (deterministic processes) holds then the DP functional equation (4.13) is valid. Weak Markovian Condition (Stochastic processes):
for all
and
Corollary 6 If the Weak Markovian condition (deterministic processes) holds then the DP functional equation (4.22) is valid. Weak Markovian Condition (stochastic processes):
for all
and
In words, any conditional modified problem shares at least one optimal solution with each modified problem giving rise to it. Corollary 7 If the objective function is separable under conditional expectation and the Weak Markovian condition (stochastic processes) holds then the DP functional equation (4.22) is valid. The following observations should be made with regard to the distinction drawn above between the Markovian condition and the Weak Markovian condition: 1. DP Functional equations based on the Weak Markovian condition do not guarantee the recovery of all the optimal solutions.
746
MODELING UNCERTAINTY
2. There is not always a choice in this matter. That is, some objective functions are Markovian in nature so it is not possible to formulate for them valid DP functional equations that satisfy the Weak Markovian condition but do not satisfy the Markovian condition. 3. The distinction is more pronounced in the context of deterministic processes than in stochastic processes. This is a reflection of the fact that the "separation under conditional expectation" condition (4.20) is very demanding, hence the class of objective functions satisfying the Weak Markovian condition - hence the Markovian condition - is relatively small. 4. The distinction between Markovian and Weak Markovian conditions is similar - but not identical - to the distinction between monotone and strictly monotone decomposition schemes (Sniedovich, 1992).
The question naturally arises: what happens if the Markovian condition does not hold?
6.
NON-MARKOVIAN OBJECTIVE FUNCTIONS
Suppose that we have a sequential decision model whose objective function cannot be decomposed in such a way to satisfy the (Weak) Markovian condition. In this case, the process of deriving the DP functional equation breaks down. This could be caused because the objective function is not separable, that is, it cannot be decomposed, or if it can be decomposed, the decomposition scheme does not satisfy the (Weak) Markovian condition. There is no simple fool proof recipe for handling such cases. Rather, there seems to be two basic approaches and the choice between them is very much problem dependent. The first approach - which might be called "Pure DP" - regards the difficulty as a DP difficulty and attempts to fix it by reformulating the problem - in particular the state variable - so as to force the objective function to obey the Markovian condition. Typically this leads to an expansion of the state space and/or the return space with obvious implications for the efficiency of the algorithm used to solve the resulting DP functional equation. The second approach - which might be called "Hybrid" - attempts to deal with the difficulty without altering the structure of the DP model. Instead, other methods are employed to deal with the non-Markovian nature of the objective function. Thus, the resulting approach is based on a collaborative scheme between DP and other optimization methods. As an example, consider the case where the objective function of the sequential decision model is of the following form:
Eureka! Bellman’s Principle of Optimality is valid!
747
Clearly, this function is not separable and therefore it cannot be decomposed in a DP manner unless the structure of the model is expanded. There are two ways to expand the DP model so as to satisfy the Markovian property in this case: The first involves an expansion of the return space resulting from replacing the objective function g by two objective functions:
and viewing the problem as a multi-objective problem. Since g is monotone increasing with both r and an optimal solution to the original problem is a Pareto efficient solution to the multi objective problem. Thus, multi-objective DP can be used to generate all the efficient solutions to the multi-objective problem, from which the optimal solution to the original problem can be recovered (e.g. Carraway et al (1990), Domingo and Sniedovich (1993)). The second possible expansion is in the state space. Here the difficulty caused by the square root term in (5.3) is resolved by incorporating the variable
in the state variable of the DP model so that the expanded state variable is of the form We can then consider the “expanded” objective function
where
and
748
MODELING UNCERTAINTY
Observe that the expanded objective function is Markovian with respect to the expanded state variable. A possible "Hybrid" approach to the problem is to linearize the square root term on the right hand side of (6.1) and consider the parametric objective function:
The point is that, for any given value of this function is separable and Markovian with respect to the original state and decision variables. The idea is then to identify a value for the parameter such that if we optimize the problem using this value of we obtain an optimal solution to the original problem. This typically involves a line search which in turn requires solving the parametric problem for a number of values of the parameter Under appropriate conditions, composite concave programming can be used for this purpose (Sniedovich, 1992).
7.
DISCUSSION
One of the fascinating aspects of Bellman’s work on dynamic programming is his attempt to capture the essence of the method by a short non-technical description. Over the years this description - The Principle of Optimality - has become synonymous with dynamic programming. Unfortunately, the non-technical nature of the description has also led to difficulties with common interpretations of the Principle, which in turn led to criticism of Bellman’s work itself. It was shown in this paper that a proper reading and interpretation of Bellman’s formulation of dynamic programming in general and the Principle in particular can overcome the above difficulties.
REFERENCES Bellman, R. (1957). Dynamic Programming, Princeton University Press, Princeton, NJ. Bertsekas, D.P. (1976). Dynamic Programming and Stochastic Control, Academic Press, NY. Carraway R.L, T.L. Morin, and H. Moskovwitz. (1990). Generalized dynamic programmingfor multicriteria optimization, European Journal of Operations Research, 44, 95-104.
REFERENCES
749
Denardo, E.V. and L.G. Mitten. (1967). Elements of sequential decision processes, Journal of Industrial Engineering, 18, 106-112. Denardo, E.V.(1982). Dynamic Programming Models and Applications, PrenticeHall, Englewood Cliffs, NJ. Domingo A. and M. Sniedovich. (1993). Experiments with algorithms for nonseparable dynamic programming problems, European Journal of Operational Research 67(4.1), 172-187. Karp, R.M. and M. Held. (1967). Finite-state processes and dynamic programming, SIAM Journal of Applied Mathematics, 15, 693-718. Kushner, H. (1971). Introduction to Stochastic Control, Holt, Rinehart and Winston, NY. Mitten, L.G. (1964). Composition principles for synthesis of optimal multistage processes, Operations Research, 12, 414-424. Morin, T.L. (1982). Monotonicity and the principle of optimality, Journal of Mathematical analysis and Applications, 88, 665-674. Porteus, E. (1975). An informal look at the principle of optimality, Management Science, 21, 1346-1348. Sniedovich, M. (1986). A new look at Bellman’s principle of optimality, Journal of Optimization Theory and Applications, 49(1.1), 161-176. Sniedovich, M. (1992). Dynamic Programming, Marcel Dekker, NY. Yakowitz S. (1969). Mathematic of Adaptive Control Processes, Elsevier, NY. Woeginger, G.J. (2000). When does a dynamic programming formulation guarantee the existence of a fully polynomial time approximation scheme (FPTAS), INFORMS Journal on Computing.
Chapter 30 REFLECTIONS ON STATISTICAL METHODS FOR COMPLEX STOCHASTIC SYSTEMS Marcel F. Neuts Department of Systems and Industrial Engineering The University of Arizona Tucson, AZ 85721, U.S.A.
[email protected]
Abstract
Remembering conversations with Sidney Yakowitz on statistical methods for stochastic systems, the author reflects on the difficulties of such methods and describes several specific problems on which he and his students have worked.
Keywords:
statistical methods for stochastic systems, computer experimental methods.
1.
THE CHANGED STATISTICAL SCENE
Sid Yakowitz and I often had spirited discussions about statistical methodology for stochastic engineering systems. Sid is a friend who left many fond memories. It is a tribute to him that his ideas and opinions still stir our minds. This article offers some views on future developments of statistical methodology for probability models and is my way of honoring the memory of a departed friend. We were graduate students around 1960, a decade after statistics fully came into its own as an academic discipline and was widely recognized as an indispensible tool in industrial, agricultural, and socio-political practice. In those days, gathering data was always expensive and major computation was a chore. The challenge to statistical methods was to squeeze useful insight from simple models that, one hoped, captured the principal functional relations underlying the data and could reasonably well account for the variability in those data. In stochastic modelling, a similar situation prevailed. Those were the days when queues, inventories, and epidemics were analyzed in detail under the simplest Markovian assumptions and by using models with very few parameters.
752
MODELING UNCERTAINTY
Analytically explicit results were at a premium. Though burdened with ad hoc, unrealistic assumptions, these models provided useful information such as the equilibrium conditions for certain queues, simple operational rules for the control of single-product inventories, and threshold theorems for epidemics. It took a generation or longer before the impact of the computer was fully appreciated but it surely changed the worlds of probability and statistics. Starting in the early 1970s, I zealously argued that applied probability ought to place greater emphasis on the mathematical structure underlying algorithmic solutions than on rare cases of analytic tractability. Long before that was a commonly accepted view, I stated that a good computer algorithm offers greater versatility and provides deeper physical insight than the heavily constrained explicit formulas of our textbooks. Early on, Sid independently reached similar conclusions. He was among the first to teach a course on algorithmic methods for statistics and probability and in 1977, published the book Computational Probability and Simulation (Yakowitz, 1977), one of the earliest in algorithmic probability. He occasionally asked me why, rather than devoting my efforts to algorithms for stochastic models, I had not done really useful work and developed statistical procedures for these same models instead. In brief, my answer went as follows: Algorithmic methodology for probability models requires a complete rethinking of all existing theory. Based on insight into the structure of the models, it is more abstract and mathematically quite challenging. The constituent skills of algorithmic thinking are closer to those of pure mathematics, say, of functional analysis and algebra, than of the analytic manipulations that make up so much of applied mathematical education and practice. The broad change of perspective needed of the researchers in our field will (and did) take time. However, if that be true for probability models, it is nothing compared to the technical difficulties, the radical paradigm shift, and the social changes needed for statistical methods to meet the challenges and opportunities of this age of rapid, massive information processing. Why is the present time so exciting and, equally, so difficult for statistical research? It is exciting because of the superbly enhanced means of gathering and analyzing data. It is difficult because all existing statistical methods were conceived and developed in an era of small data sets and onerous computation. The principal statistical descriptors, such as moments and the familiar estimators, arose out of work with univariate data. Decades of descriptive statistics and of inferential analysis of data sets preceded and inspired the formal mathematical development of statistics. Issues such as the best (always to mean, most appropriate to the situation at hand) measures of central tendency and variability were discussed for a long time. Now, the task of retooling is immense, the data sets are huge and cannot easily be visualized, and the time pressure imposed by the rapidly evolving scientific and technological needs is enormous.
Reflections on Statistical Methods for Complex Stochastic Systems
753
When we were students, multivariate and time series analyses were already beautiful mathematical theories. Sid had a more direct interest in these than I but we agreed that the computational burden and the paucity of tractable results made their application to actual data a daunting task. I, for one, was happy to leave that to people in biology, economics, and the social sciences where the reality of multivariate, highly dependent data could not be overlooked. Work on stochastic models kept me well occupied and I could only follow developments in statistics from a distance. From colleagues at Purdue University and elsewhere, I learned about bayesian procedures, about selection and ranking, about variable selection in linear models, and other such work. During the 1970s, there clearly was a growing preoccupation with numerical results obtained by substantial algorithms, yet the statistical laboratory and the computing center remained clearly separated worlds. In the first, people engaged in statistical thinking, in the second, one sought advice and found help with massive computer jobs. In 1980 or so, during a one-day visit to Princeton University, I vividly experienced the thrill of seeing a changed, enriched statistical scene. A graduate student demonstrated a software package for time series that he was developing. A rich variety of statistical estimators, tests, and data transformations could be interactively implemented to serve in the exploration of one or more traces of a time series. Algorithm and computation, once barriers between methodological and physical insight, had become our faithful servants, if not yet our allies. The doctoral student had excellent knowledge of statistical theory and of the computer’s capabilities. He combined them in a creative, synergistic research project. There are now many highly numerate statistical researchers; the years since 1980 brought major progress in the algorithmization of statistics. Professionally written statistical software is now readily available. Judging by the text books, by my experience during the latter years of my teaching career, and by visits to universities in many countries, academic education in statistics still lags far behind these developments. With few exceptions, students learn the elementary mathematics underlying the most classical estimators and tests, not the deeper, substantive insights needed to use existing software packages with confidence and competence. When trying to stir interest and enthusiasm, ponderous preaching about generalities is counter-productive. When asked for a look ahead talk, I prefer to choose some specific problems that are just beyond our present capabilities. After explaining why they are important, I speculate about promising new approaches - promising in that they may get the job done, not in the first place for leading to easily publishable papers. In Neuts (1998), I so discussed selected problems in stochastic modelling. That area could benefit from greater emphasis on understanding the physical behavior of the models.
754
MODELING UNCERTAINTY
Next, I shall mention illustrative problems suggested by questions in traffic engineering and manufacturing. As academic researchers we can view those problems from the rear lines, without the immediacy of deadlines. By taking a general view, we can develop well-justified versatile methodology. If at all successful, that methodology will find new, unanticipated applications.
2.
MEASURING TELETRAFFIC DATA STREAMS
For their design and control, the measuring of Internet data streams is a rich source of statistical problems. These are obviously relevant and timely. We consider just one such a problem and see where it can lead. A record of a data stream is called a trace. It is a list of all teletraffic cells in the stream over a span of time. It gives their arrival times, destinations, types, lengths, and whatever other information is germane to their proper routing. A few seconds of observation can yield traces of one billion items of data or more. Each trace is only one observation; its length corresponds to the dimension of the multivariate data. Obtaining traces is not easy. It requires special equipment and careful planning; the mere fact of recording the trace interferes with the data stream on the microscopic time scale where individual transactions take place. For use in design work, teletraffic engineers maintain a limited collection of representative traces. It is much easier to obtain counts of the cells flowing past an observer during successive intervals of length The choice of requires attention. How large can we choose without losing important information about the traffic stream? Using only counts, can we reconstitute a reasonable approximation to the original trace for use, say, in future simulation experiments? Choosing is clearly a recasting of the old statistical problem of grouping data for easier analysis or for plotting in histograms. A look at the literature of the past shows how long these issues were debated, even for ordinary, univariate data. The stated questions do not have clear cut answers. In actuality, much depends on the context and on the specific application, but methodological studies serve to clarify criteria for the choice of Once is chosen, how can we extract pertinent information from, say, ten million counts? Obvious, inexpensive operations such as calculating a few sample means and variances or plotting the marginal empirical density of the counts are readily done, but what, if anything, do they actually tell us? Extracting information from long strings of counts lies at the origin of a growing literature on discrete time series. With one caveat, I believe that to be a promising area for useful statistical work. Judging from the published work of which I am aware, there is little interest in descriptive statistics and not much contact with actual data sets. It is only a mild generalization - if one that is painfully unkind - that academic statisticians do not look at data and applied
Reflections on Statistical Methods for Complex Stochastic Systems
755
users do not look at theory. In an effort to belie that quip, one of my later doctoral students, David Rauschenberg (Rauschenberg, 1994), examined ways of summarizing long strings of counts in short strings of informative icons that reflect the qualitative behavior of the counts over long substrings. Although it is regrettably unpublished, I consider his thesis a seminal work leading to data-analytic procedures that merit much further attention. Reconstituting a traffic stream from counting statistics is only an example in a vast class of problems dealing with random transformations of point processes. During the early 1990s, I worked with several Ph.D. students on local Poissonification, the operation whereby the events during successive intervals of length a are uniformly redistributed over these intervals, see Neuts et al. (1992) and Neuts (1994). What was our qualitative thinking behind that construction? If you only have the event counts over intervals of length you cannot recover information about the exact arrival times during those intervals. You can redistribute the points regularly, place them all in a "group arrival" in the middle of the interval, or, as we did, you can imagine that they are uniformly and randomly distributed - as they would be if the original process was a Poisson process. What we studied has the intuitive flavor of a "linear interpolation." That intuition was indeed borne out by some formulas for lower order moments that we derived. Unless is large, differences between the original and the reconstituted processes should not matter greatly - exactly the same idea that underlies the grouping of univariate or bivariate data. The statement of that intuition is vague. One can assail it with criticism or, constructively, one can give technically precise formulations that are amenable to rigorous scientific inquiry. In Neuts et al. (1992), we initiated a theoretical study of local Poissonification for the family of versatile benchmark processes, known as MAPs (Markovian arrival processes), see e.g., Neuts (1992), Narayana and Neuts (1992), and Lucantoni (1993). I wish we had been able to pursue that study further along the following lines: The pertinent engineering question is whether and when we can use counting data instead of detailed, but expensive traces. The answer to that question is context dependent. It depends on the service mechanisms to which the traffic is offered. In a queueing context, for example, when there is any appreciable queueing at all, the operation of service systems is little or not affected by slight perturbations in the arrival times of packets. Whether the packet comes a little earlier or a little later only means that it spends a little more or a little less time waiting in the queue. With the restrictive assumptions needed for classical queueing analysis, it is impossible to model the effect of local Poissonification (or of other transformations known in the engineering literature as traffic shaping) by standard analytic or algorithmic methods. Moreover, to compare an input stream and its
756
MODELING UNCERTAINTY
poissonifications for various values of it is not enough to treat each case separately. Using simulation terminology, one should run simultaneous, parallel simulations in which the various poissonifications of a given input stream are subject to identical service times. Valid comparisons are possible only when that is done. Experimental physicists know that, in meaningful comparison studies, one varies only one or two parameters between experimental runs, keeping all other conditions as much as possible the same. People with solid grounding in probability understand that, to compare two experiments (and not merely some simple descriptor, such as a mean) y! ou formalize both on the same probability space. Therein lies the fundamental difficulty of - and the serious scientific reservations to - the many engineering approximations common in applied stochastic models. For approximate models to be scientifically validated, we need to compare differences in the realizations of the stochastic processes, not merely in crude descriptors such the mean or standard deviation of the delay. A major difficulty in doing so is the paucity and extreme difficulty of theoretical results on multi-variate stochastic processes. For a few years now, I increasingly realize the importance of computer experimentation in stochastic modelling. In Neuts (1997) and Neuts (1998), I adduce reasons for that importance. As a case in point, the study of the effect of the window size leads to a pretty, seminal computer experiment. We generate a large data base (say, 10 million or more) interarrival times and we construct poissonifications of that random point sequence with K different values of These we offer (in parallel) to single servers with identical strings of service times, generated from a common distribution F(·). We so obtain K realizations of the queueing process that differ only in the values of the parameter What are some technical issues that arise in the design and analysis of that experiment? In the first place, note that a common input stream is used for all K poissonifications. Poissonification does not affect the order of the arrivals. We may therefore think of each arrival as a clearly identified job. The service time of that job is unaffected by the choice of Therefore, the original input and all K poissonified streams are subject to the same queueing delays; a common sequence of service times is used. If the original arrival stream comes from a stationary point process, it is easy to assure that all K poissonified streams are also realizations of stationary processes. The most interesting part is the analysis of the output of the experiment. As we are mainly interested in the differences between the queue with the original input and each of the K models with poissonified arrivals, I would form the sequences of differences between the delay of each job in the original queue and in each of the poissonified models. Each such sequence is a trace of a stationary process; we can apply established statistical procedures to it. In comparing distinct sequences of differences, that is, comparing the results
Reflections on Statistical Methods for Complex Stochastic Systems
757
for different values of a, we must bear in mind that these are highly dependent stochastic processes. It is likely that only data-analytic comparisons remain possible. Qualitative conclusions from such comparisons need be validated by replications of the entire experiment with different, independently generated data sets. The choice of the statistics used in comparisons, the informative representation and summary of data, and the efficient performance of the experiments, all present interesting new questions and challenges. Experience gained from one experimental study facilitates future ones and therein lies the potential growth of this field. The problem and the methodological approach that I have just described have important counterparts in engineering practice. I already mentioned the traffic traces of telecommunications applications. How are different traces or simulated traces from a proposed traffic model compared? A common practice is to use various measured or simulated arrival flows as input to single or multiple queues with constant service times. For many highly calibrated manufacturing or communications devices, the assumption of constant processing times is plausible. These input processes are typically offered - in parallel simulations - to servers with different holding times. A given input to various servers with different constant holding times is then interpreted as though input streams of various rates were offered to a server with a single, fixed holding time. Measured quantities, such as the average delay or the frequency of loss in a finite-buffer model, are typically quite robust. Useful engineering conclusions are drawn from them although without a formal statistical justification. The high dependence between the various simulated realizations and the heuristic manner in which estimates are obtained offer challenges to statistical analysis. In both problems I have mentioned, the general methodological issue is the same. How can we meaningfully measure differences between (dependent) stochastic processes whose realizations are relatively minor perturbations of each other?
3.
MONITORING QUEUEING BEHAVIOR
I have long been interested in procedures for monitoring the behavior of queueing systems. Monitoring differs from control. Its objective is to signal unusual excursions of the current workload (queue length or backlog of work) and to classify such excursions into different categories to identify their causes. When the cause of unusual excursions is identified, we can take appropriate control actions. A discussion of a monitor for a classical queueing model is found in Neuts (1988), and an application is described in Widjaja et al. (1996), and Li et al. (1998). For simple, tractable models, we can define and compute a profile curve. In brief, that is a stochastic upper envelope of the rare excursions above
758
MODELING UNCERTAINTY
a threshold K. As explained in Neuts (1988), the monitor becomes active when the queue length exceeds K and becomes dormant again when it returns to values below K. However, when an active monitor detects that the queue length exceeds the profile curve, a call for a control action ensues. The path of the excursion between K and the profile curve reveals the nature of the excursion and that is important to appropriate control intervention. Monitoring is another subject that, had time and personnel resources permitted it, I would have pursued in greater depth. It is obvious that monitoring multi-dimensional stochastic processes with complex dependence, so prevalent in queueing networks and production systems, present the greatest methodological challenge. I shall discuss two cases in some detail. Low-Dimensional Processes: The profile curve is an involved analogue of the familiar control limits of quality control. It is much more involved as it deals with highly dependent stochastic process data. The situation is even more complex for multivariate stochastic processes. We could define control limits for low-dimensional stochastic processes as follows: A realization of the process traces out a path in As the process is typically observed at discrete time epochs, our observations form a set S of isolated data points. The successive pealed convex hulls of our data set are informative statistical descriptors of the multivariate process. We start with the standard convex hull of S and we delete all its extreme points. The convex hull of the remaining set is the first pealed convex hull of S. We repeat the pealing procedure on to construct the second pealed convex hull and so on. We can define a monitoring region, for instance, as the difference between the convex hull of S and the first pealed convex hull to contain fewer than 95 percent of the original data points. The monitor is dormant when the process stays within that inner pealed convex hull. When it enters the monitoring region, we track the magnitude and the gradient of the changes in the process as information for whatever control actions that are appropriate. The implementation of such a monitoring scheme relies on methods that are now standard in computational geometry. A detailed statistical analysis of the procedure, if possible at all, will be limited to very special processes. An extensive experimental exploration using a various interesting multivariate processes with few restrictions would, I believe, inspire greater confidence in the statistical soundness and the practical utility of such a monitoring procedure. Stochastic Networks: Modern technology abounds with examples of stochastic networks. Only the rarest yield to modelling analysis. While, of necessity, engineers have extensive experience with measurements and monitoring of network traffic, their practice rests on the scantest of statistical methodology. Taking queueing networks as a working metaphor, I wondered how we could possibly monitor the traffic in such networks. A detailed observation of all
REFERENCES
759
network transactions is nearly always impractical, if not infeasible. On the other hand, we can usually observe, at little cost, the increments and decrements of the job load within sub-nodes of the network. Usually, these correspond to arrivals to and departures from the sub-node, but there are also cases where jobs increase by splitting or may disappear from the node without completing service. My last doctoral student, Jian-Min Li, and I initiated work on the process of the epochs of increments and decrements of a network node (Neuts and Li, 1999; Neuts and Li, 2000). We studied the fluctuations in the occurrences of increments and decrements by imagining a competition for runs between these two kinds of events. For a given, stable node in steady-state, we expect the departures and arrivals to be well interlaced. A quick, large increase in the arrivals suggests a rapid buildup in the content of the node. A long string of departures may indicate that the node is being starved for work. The idea behind our work was to use the statistical characteristics of normal fluctuations in the network to construct one or more monitors for selected subnodes. Due to various circumstances, our work remained in its initial stages. We believe, however, that the results were promising and merit further attention. Sid was definitely right. Statistical methods for stochastic models are sorely needed and of the greatest practical importance. Their development is challenging; it will require new, daring, and unconventional approaches.
ACKNOWLEDGMENTS This research of M. F. Neuts was supported in part by NSF Grant Nr. DMI9988749.
REFERENCES Li, J-M., M. F. Neuts, and I. Widjaja. (1998). Congestion detection in ATM networks. Performance Evaluation, 34:147–168. Lucantoni, D.M. (1993). The BMAP/G/1 queue: A Tutorial. In Lorenzo Donatiello and Randolph Nelson, editors, Performance Evaluation of Computer and Communication systems: Joint Tutorial Papers of Performance ’93 and Sigmetrics ’93, pages 330–358. Springer-Verlag, Berlin. Narayana, S. and M. F. Neuts. (1992). The first two moments matrices of the counts for the Markovian arrival process. Communications in Statistics: Stochastic Models, 8(3):459–477. Neuts, M.F. (1988). Profile curves for the M/G/1 queue with group arrivals. Communications in Statistics: Stochastic Models, 4(2):277–298. Neuts, M.F. (1992). Models based on the Markovian arrival processes. IEICE Transactions On Communications, E75-B(12):1255–65.
760
MODELING UNCERTAINTY
Neuts, M.F. (1994). The Palm measure of a poissonified stationary point process. In Ramón Gutierrez and Mariano J. Valderrama, editors, Selected Topics on Stochastic Modelling, pages 26–40. Singapore: World Scientific. Neuts, M.F. (1997). Probability Modelling in the Computer Age. Keynote address, International Conference, Stochastic and Numerical Modelling and Applications, Utkal University, Bhubaneswar, India. Neuts, M.F. (1998). Some promising directions in algorithmic probability. In Attahiru S. Alfa and Srinivas R. Chakravarthy, editors, Advances in Matrix Analytic Methods for Stochastic Models, pages 429–443. Neshanic Station, NJ: Notable Publications, Inc. Neuts, M.F. and J-M Li. (1999). Point processes competing for runs: A new tool for their investigation. Methodology and Computing in Applied Probability, 1:29–53. Neuts, M.F. and J-M. Li. (2000). The input/output process of a queue. Applied Stochastic Models in Business and Industry, 16:11–21. Neuts, M.F., D. Liu, and S. Narayana. (1992). Local poissonification of the Markovian arrival process. Communications in Statistics: Stochastic Models, 8(1):87–129. Rauschenberg, D.E. (1994). Computer-Graphical Exploration of Large Data Sets from Teletraffic. PhD thesis, The University of Arizona, Tucson, Arizona. Widjaja, I., M. F. Neuts, and J-M. Li. (1996). Conditional Overflow Probability and Profile Curve for ATM Congestion Detection. IEEE. Yakowitz, S. (1977). Computational Probability and Simulation. Addison-Wesley, Reading, MA.
Author Index
Aarts, 385, 388, 410, 412, 416 Abbad, 510 Abel, 203, 288, 299 Abramowitz, 128 Ackley, 385, 410 Acworth, 466 Adelson, 712, 732 Agrawal, 46, 52 Agresti, 196, 198 Ahmad, 590, 595 Ahmed, 682 Aizerman, 206, 221 Akad, 470 Akademi, 415 Akritas, 566, 595 Algoet, 11, 226–227, 231, 235, 246–247, 594 Altman, 32 Aluffi-Pentini, 411 Alzaid, 618, 621 Amin, 623 Anantharam, 38, 46, 52 Andersen, 79, 87, 90, 198 Anderson, 269–270, 273–274, 281–282, 693 Ando, 512 Andrews, 682 Anily, 388, 411 Ankenman, 692 Anorina, 79, 90 Antonov, 466 Aoki, 334, 357 Applegate, 636, 647 Aragon, 388, 413 Arapostathis, 544–545 Araujo, 84–85, 90 Arnold, 250, 267, 620–621 Asmussen, 183 Athreya, 577, 595 Avila-Godoy, 545 Avramidis, 466 Baccelli, 32 Badowski, 513 Bagai, 618, 621 Bahadur, 63, 90, 561, 595 Bailey, 235, 246
Bakhvalov, 466 Balarm, 334 Ball, 627, 648 Ballou, 631–632, 649 Banach, 77, 79–80, 84, 86 Banks, 32, 38, 42, 52 Banzhaf, 390, 411 Baras, 546 Barettino, 153 Barlow, 621 Barnsley, 732 Bartoszewicz, 618–621 Basawa, 561, 595 Basu, 622 Bauer, 208, 210, 221 Baum, 409, 411 Bayes, 41, 225, 690–691 512 Bean, 648 Beato, 153 Beauchamp, 657, 676, 682 Becker, 411 Beckman, 471 Beekmann, 203, 223 Bekey, 411 Bellman, 270, 333–334, 336, 338, 344–345, 349, 356–357, 496, 500, 735–736, 738–739, 748 Belmont, 510 Belzunce, 615, 621 Benardi, 131 Benveniste, 51–52, 370, 374, 376, 381 Berger, 690 Bernardi, 131, 152 Bernoulli, 39–40, 118, 392, 399, 443 Berry, 52 Bertsekas, 510, 545, 736, 748 Bertsimas, 32, 638, 647 Berzofsky, 152 Bhattacharya, 566, 575, 595, 602 Bhubaneswar, 760 Bianchi, 227 Biase, 389, 412 Bickel, 692 Bilbro, 389, 411
762 Billingsley, 32, 66–67, 90, 510, 558, 560, 595 Bina, 131, 133, 152 Birge, 647 Bisgaard, 692 Bittanti, 370 Bixby, 647 Black, 270, 282 Blaisdell, 131, 152–153 Blankenship, 22, 32, 510 Block, 621 Blount, 11, 575 Blum, 50, 52 Bodson, 376, 381 Boender, 389, 411 Bohachevsky, 411 Boland, 608, 612, 614, 621 Bolshoy, 153 Boltzmann, 388 Borkar, 45, 52, 545 Borovkov, 32, 62–63, 65, 72, 75, 84, 90–91 Bosq, 581, 595 Boyle, 467, 474 Braaten, 467 Bradley, 581, 585, 595–596 Bramson, 14, 32 Bratley, 467 Brau, 545 Braverman, 206, 221 Breiman, 246 Bremermann, 387, 390, 411 Brezzi, 39–43, 52 Broadie, 466–467 Brooks, 411 Bucher, 153 Buckingham, 131, 152 Bucy, 22, 32 Bunick, 154 Burman, 302, 329, 594, 596 Burrus, 682 Burt, 712, 732 Buyukkoc, 39, 54 Byrnes, 546 Caflisch, 467, 472 Caines, 370–371 Calladine, 131, 141, 153 Cameron, 732 Campi, 370 Cantelli, 242, 402–403, 409, 569, 574 Cao, 115 Capitanio, 329 Carr, 250, 267 Carraway, 748 Carrillo, 303, 329 Casella, 687 Castelana, 594 Castellana, 596 Cauchy, 180
MODELING UNCERTAINTY Caudill, 331 Cavazos–Cadena, 516–517, 520–523, 525, 529, 532, 543–545 Cavert, 99, 114 Cease, 152 Cerny, 412 Cesa-Bianchi, 246 Cesaro, 176 Chakravarthy, 760 Chang, 39–41, 52 Chao, 298–299, 590 Charnes, 630 Chatfield, 662, 682 Chebyshev, 67 Chen, 32, 687 Cheng, 329, 381 Chernoff, 37, 53 Chesher, 700 Chevi, 647 Chiarella, 252, 267 Chin 690 Chistyakov, 79 Chistyakov, 79, 91 Cholesky, 327 Chow, 246 Christofides, 629, 648 Chui, 732 Chung, 510 Chvatal, 647 Cieslak, 471 Clark, 51, 53, 156, 629, 647 Clarke, 184, 647 Clements, 330 Cobb, 692 Cochran, 467 Cohen, 96, 98–99, 114 Columban, 2 Compagner, 474 Conover, 471 Conway, 467 Cook, 329, 647, 692 Cooley, 682 Cooper, 630 Cornette, 152 Corput, 438 Cosman, 732 Cournot, 2, 249–251, 263–267 Courtois, 510 Cover, 327, 329, 730, 732 Cox, 198–199, 250, 267 Cranley, 467 Cristion, 51, 54 Crothers, 131, 153 Cushing, 250, 267 Dai, 32–33 Daley, 4, 11 Daniels, 687 Danielson, 678
AUTHOR INDEX Datta, 546 Davis, 510, 603 Davison, 687 Davydov, 577, 596 Dawande, 329 Dekker, 199, 223, 330, 749 Dekkers, 388, 412 Delebecque, 496, 510 DeLisi, 152 Dembo, 58, 91, 510 Dempster, 381 Denardo, 735–736, 738, 748–749 Denny, 6–7, 574, 596 Devroye, 202–203, 205, 207, 216, 221, 236, 246, 386–387, 392–393, 400, 404, 412, 575, 596 Di Masi, 511 Dietrich, 9, 11 Diggle, 198 Dirac, 252 Dixit, 283 Djereveckii, 370 Do, 687 Doeblin, 516–517, 520, 522, 526 Doksum, 692 Domingo, 747, 749 Doob, 335, 357, 572–573, 575, 580, 596 Dordrecht, 221–222, 414 Douglas, 692 Doukhan, 577, 581, 596 Down, 33–34 Drew, 131, 141, 152–153 Dror, 5, 627–629, 633, 638–639, 641, 643–649 Druzdzel, 467 Duckstein, 6–7 Duff, 8 Duffie, 283, 468 Dunford, 203, 218, 221 Dupuis, 15, 23, 33 Dvoretzky, 402, 412 Dykstra, 621 Dynkin, 335–336, 357 Eberly, 288, 299 Edgeworth, 687 Efimov, 469 Efron, 205, 221, 468, 686 Eilon, 629, 648 Eisenberg, 133, 153 Elliot, 546 Ellis, 493 Embrechts, 79, 85, 91 Endo, 732 Entacher, 468 Erlang, 62, 65 Ermakov, 390, 412 Ermoliev, 157, 183, 385, 412 Essunger, 115 Etemadi, 221
763 Ethier, 33, 511 Eubank, 575, 596 Fabian, 53 Fabius, 40, 53 Fahrmeir, 198 Fauci, 96, 99, 114 Faure, 429–430, 432, 440–441, 447, 451, 468 Fayolle, 14, 33 Feder, 227, 229, 246–247 Federgruen, 388, 411 Feller, 63, 65, 67, 91, 118, 128 Fennell, 330 Fernández–Gaucherand, 516–517, 520–523, 525, 529, 532, 543–545 Fernández-Gaucherand, 545 Field, 687 Fill, 63, 91 Fincke, 468 Firth, 198 Fisher, 6, 379, 383, 392, 394, 400–401, 412, 416, 696, 700, 732 Fishwick, 471 Fleming, 479, 500, 511, 516, 545 Fokianos, 190–191, 198 Foss, 32 Fourier, 132–134, 138, 453–454, 651–653, 662, 664, 667, 669, 678 Fox, 286, 299, 466–468, 511 Fradkov, 370 Franco, 621 Fraser, 687 Fredholm, 486 Freund, 246 Friedel, 468 Frobenious, 516, 521–522 Frontini, 389, 412 Furniss, 286, 299 Fushimi, 472 Gaimon, 329 Gaitsgori, 512 Gaivoronski, 157, 183 Galambosi, 11 Gallager, 512 Gamarnik, 32 Gani, 3, 5, 9–11, 383, 513, 575, 596 Garrett, 117, 125, 128 Gastwirth, 401, 412 Gauss, 703 Gaviano, 387, 413 Geffroy, 391, 395, 398–399, 413 Gelatt, 53, 387, 414 Gelfand, 388, 413 Geman, 369–370, 381, 387, 413 Georgiev, 575, 590, 596–597 Gerencsér, 245–247, 362, 370 Gershwin, 329, 511 Gheysens, 649
764 Ghosh, 687, 690 Gibbs, 388 Gidas, 388, 413 Giddings, 133, 154 Giessen, 152 Gill, 198, 383 Gilliam, 546 Giroux, 157, 184 Gittins, 38–40, 44, 53 Gladyshev, 206, 222 Glasserman, 466–467 Glynn, 157, 184, 511 Gnedenko, 395–396, 413 Goldberg, 385, 390, 413 Golden, 627, 630–631, 648–649 Goldenshluger, 245–247 Goldie, 91 Golub, 469 Golubov, 468 Good, 682 Gorodetskii, 577, 597 Gosh, 545 Goutis, 687 Graves, 53 Groningen, 597 Grunstein, 141, 153 Gruyter, 221 Guckenheimer, 250, 258, 267 Guo, 370 Gurin, 393, 413 Gutman, 246 Györfi, 202, 207, 227, 234–235, 238, 246–248, 371 Györfi, 221–222 Haario, 388, 413 Haase, 96, 99, 114 Haber, 469 Habib-agahi, 299 Hadamard, 659–661 Hajek, 387–388, 413 Hall, 686 Halton, 425, 440–441, 452, 469 Hamilton, 511 Hansen, 511 Hardy, 203, 222, 421 Harp, 154 Harris, 22–23, 298, 575 Hart, 302, 329 Hartjey, 547 Hartman, 389 Hasenbein, 32 Haussler, 246 Hayes, 9–10, 383, 416, 513 Hazelrigg, 305, 329 He, 621, 623 Heidelberg, 222–223 Heideman, 682 Heinrich, 469
MODELING UNCERTAINTY Held, 735, 738, 749 Hellekalek, 466, 468–470, 472 Helmbold, 246 Henry, 114 Hermite, 687 Hernández–Hernández, 546 Heunis, 371 Heyde, 59, 79, 87, 91, 202, 223, 581, 594 Hickernell, 452, 466, 468–471, 473–474 Hilbert, 453 Hillier, 511 Hinderer, 546 Hingham, 417 Hinkley, 687 Hipel, 9 Hjorth, 686 Hlawka, 421, 469–470 Hoadley, 652, 682, 686 Hoeffding, 470, 581 Hoffmeister, 387, 411 Holland, 53, 385, 390, 392, 412–413, 416 Holmes, 250, 258, 267 Hopf, 249–250, 255, 258–259, 261–262, 267 Hoppensteadt, 511 Hordjik, 546 Howard, 546 Howroyd, 268 Hu, 62–63, 91, 614, 617, 621 Huffman, 730 Hui, 690 Hurd, 732 Hutter, 8–9 Huzurbazar, 687 Hwang, 387, 413 Hymer, 5 Härdle, 222 Ibragimov, 577, 581, 597 Imamura, 416 Inagaki, 566, 597 Invernizzi, 252, 267 Ioannides, 575, 577, 580–581, 601 Ionescu, 518 Ioshikhes, 131, 153 Iosifescu, 482, 487, 511 Jackson, 250, 268 Jacobson, 516, 546 Jacod, 199 Jailet, 648 Jaillet, 626 Jain, 622 James, 546, 690 Jaquette, 546 Jarvis, 34, 385, 389–390, 413 Jawayardena, 547 Jayawardena, 10–11, 48, 54, 384, 417, 604 Jaynes, 356–357 Jensens, 205
AUTHOR INDEX Jiang, 329 Jing, 622 Joag-Dev, 622 Johnson, 388–389, 411, 413, 566, 597, 682 Joines, 471 Jones, 38, 53 Joshi, 692 Joslin, 371 Kabanov, 511 Kaebling, 44, 53 Kalman, 97, 105, 111–114, 375 Kamps, 620, 622 Kanatani, 381 Kane, 682 Kao, 68, 91 Karlin, 153, 511 Karlsson, 9 Karmanov, 393, 414 Karp, 735, 738, 749 Katz, 409, 411 Kaufman, 411 Kaufmann, 198–199 Kawaguchi, 732 Kedem, 3, 5, 10, 190–191, 198–199 Keenan, 199 Keller, 390, 468 Kendrick, 511 Kesten, 577, 597 Kettenring, 652, 682 Khaledi, 610, 616–620, 622 Khasminskii, 22, 33, 478, 511 Khomin, 252, 267 Kiefer, 50, 53, 373, 402, 412, 414, 682 Kieffer, 11, 248, 719, 721, 725, 729–733 Kirkpatrick, 35, 53, 414 Kirmani, 615–616, 619, 622 Kirschner, 96, 99, 114 Kisiel, 6, 574, 596 Kitagawa, 381 Kivinen, 227, 229, 247 Kleinrock, 175, 183 Klimov, 53 Klug, 141, 154 Klüppelberg, 91 Kmenta, 692 Kniazeva, 513 Knuth, 732 Koch, 685 Kochar, 610, 615–619, 621–622 Kohler, 221 Kokotovic, 496, 512 Kolassa, 687 Kollier, 10, 44, 54, 383, 417 Kong, 34, 374–376, 378–379, 381 Kopel, 250, 268 Koppel, 4 Korman, 329
765 Korobov, 425–427, 434, 437, 439, 442, 444, 447, 470 Koronacki, 393, 414 Korst, 385, 388, 410 Korwar, 616, 622 Kourapova, 733 Kouritzin, 371 Krause, 421 Kreimer, 645–646, 648 Krimmel, 7, 474 Kronecker, 209, 220–221 Krzysztofowicz, 6–7 202–203, 207, 216, 221 Kubicek, 250, 268 Kuk, 381 Kumar, 14, 33–34, 37, 53, 512, 546 Kurita, 471 Kurtz, 33, 511 Kushner, 15–17, 22, 25, 32–34, 51, 53, 156, 183, 335, 357, 371, 376, 381, 414, 512, 736, 749 Kwakernaak, 334, 357 L’Ecuyer, 184, 474, 513 Lafeuillade, 96, 99, 114 Lago, 389, 411 Lagrange, 162 Laha, 705 Lai, 10, 37, 39–44, 46–50, 52–53, 59, 91, 392, 414, 512, 546, 575, 598 Laipala, 626, 648 Laird, 381 Lam, 114 Lanczos, 678 Laplace, 620 Laporte, 633–637, 647–648 Larson, 627, 644, 648 Laurent, 432–433, 435, 437, 439 Lauss, 470 Le Cam, 561, 566, 598 Leadbetter, 575, 586, 594, 603 Leake, 334, 357 Lebesgue, 51, 60, 73, 75, 384, 463, 557, 572 Lee, 334 Leeb, 469 Legendre, 492 Lehoczy, 466 Leibler, 38, 46 Lem, 397 Lemieux, 469–471 Lempel, 732–733 Lenhart, 329 Leontief, 286, 476, 495 Levitan, 473 Levy, 96 Li, 10, 54, 417, 622, 759–760 Liang, 198 Liapunov, 13, 15, 22–25, 30–31 Liaw, 730, 733
766 Lidl, 471 Lieberman, 511, 687 Lillo, 610–611, 622 Lin, 34, 330, 575, 590, 730 Lindley, 176 Linnik, 79, 91, 581 Lipschitz, 23, 156, 158, 160, 335, 386, 456, 485, 489, 494, 572–573, 593 Littlestone, 227, 247 Littman, 44, 53 Liu, 334, 512–513, 733, 760 Ljung, 51, 54, 184, 360, 371, 374, 381, 686 Loh, 471 Longnecker, 581, 598 Lorenz, 620 Louie, 416 Louveaux, 633, 648 Lowe, 9, 43–44, 48, 54, 59, 68, 93, 383, 392, 417, 575 Loève, 546 Loève, 222 Lu, 14, 34, 330 Lucantoni, 759 Lugosi, 9, 44, 54, 202, 205, 221, 246–247, 384, 392, 417 Lui, 469 Luk, 469 Lukes, 334, 357 Luman, 304, 329 Lunneburg, 686 Luttmer, 471 Lyapunov, 380 Macmillan, 6 MacPherson, 467 MacQueen, 206, 222 Mai, 10, 44, 54, 474 Maisonneuve, 471 Majety, 303, 329 Makis, 303, 329 Makovian, 742 Malyshev, 14, 33–34 Manbir, 5 Mandl, 45, 54 Mann, 402, 414 Manning, 721, 725 Manz, 682 Marchands, 373 Marcus, 516, 522, 525, 543, 545–546 Marek, 250, 268 Margaht, 152 Markov, 3–4, 14, 22–24, 35, 37–38, 44–48, 235, 373, 375, 377, 387, 476–483, 485, 487–489, 492–496, 498, 500–502, 504, 506–509, 515–516, 518–519, 527, 531–533, 540, 544, 556–558, 561, 563, 566–567, 571–572, 574–575, 580, 625, 641, 643, 647, 731–732 Markowitz, 115
MODELING UNCERTAINTY Marti, 393, 414 Martingale, 191 Maryak, 685, 690 Masry, 585, 590, 594, 598 Massart, 402, 414 Mateo, 411–412 Matheson, 546 Mathur, 631–632, 649 Matous, 451 Matsumoto, 471 Matyas, 387, 414 McCullagh, 186, 191, 196, 199 McDiarmid, 205, 222 McDougall, 153 McEneany, 516 McGeoch, 388 McIntyre, 302, 329 Mckay, 471 McLachlan, 133, 153 Mcleod, 9 Medio, 268 Meerkov, 414 Meits, 302 Meitz, 329 Mengeritsky, 131, 153 Menshikov, 33 Merhav, 227, 246–247 Merrow, 285, 299 Methuen, 198 Metivier, 51–52, 381 Metropolis, 381, 414 Meyn, 33–34, 371 Michelena, 330 Miller, 253, 268 Minton, 330 Miranker, 511 Misra, 616, 622 Mitten, 735, 738, 749 Mitter, 98, 115, 388, 413 Mocarski, 153 Mockus, 414 Monro, 50, 54 Monroe, 155 Moore, 53 Moran, 4 Morin, 748–749 Morohosi, 471 Morokoff, 467, 472 Mortensen, 38, 54 Morton, 330 Morvai, 11, 202, 222, 226–227, 247–248, 594, 598 Moskovwitz, 748 Moskowitz, 467 Muelen, 616 Murray, 7–8 Muyldermans, 131, 153 Myers, 299
AUTHOR INDEX Männer, 414 Métivier, 370 Nadaraya, 202, 222, 575, 587, 594, 599 Nagaev, 58–59, 64, 66, 75, 79–80, 88–92 Nair, 652 Nanda, 609–615, 622 Naokes, 9 Narayana, 759–760 Nauk, 470 Nauka, 370 Nelder, 186, 191, 199 Nelson, 682, 732 Nemirovski, 156, 184 Neuman, 7, 381 Neumann, 115 Neuts, 759–760 Nevill, 721, 725 Nevill-Manning, 733 Neweihi, 614 Neyman, 222, 598 Ng, 381 Nguyen, 575, 599, 712, 733 Nicholson, 692 Niederreiter, 430–431, 447, 466–473 Nishikawa, 334, 357 Nishimura, 471 Nobel, 202, 222, 227, 236, 247 Noon, 648 Noordhoff, 33 Nordin, 390, 411 Noren, 6 Notermans, 114 Noussair, 298 Ohsumi, 334, 357 Oja, 620, 623 Okuguchi, 249, 254, 268 Ollivier, 154 Olson, 299, 329 Onrstein, 235 Orlando, 115 Ornstein, 247 Owen, 440, 456, 464, 466–467, 471–473 Pan, 512 Panneton, 471 Panossian, 334, 357 Pantaleo, 114 Pantula, 577 Papalambros, 303, 329–330 Papanicolaou, 22, 32 Pareto, 62–63, 620, 747 Parisi, 411 Parker, 6 Parzen, 567, 599 Pascal, 62 Paskov, 473 Patterson, 468 Pazman, 691
767 Pearlman, 712, 733 Pebbles, 7 Peligrad, 581, 599 Pentini, 387 Perelson, 95–97, 104, 108–110, 112, 115 Perkins, 34 Pervozvanskii, 512 Petrov, 63, 92, 409, 414 Pflug, 157, 184 Pham, 577, 599 Philipp, 581, 599 Phillips, 496, 512 Philips, 299 Piatak, 96, 115 Pina, 131, 153 Pindyck, 283 Pinelis, 10, 59, 71, 77, 79, 82, 92 Pinsky, 414 Pintér, 414 Pirsic, 470, 472–473 Pisier, 247 Pitman, 92 Platfoot, 330 Pledger, 623 Plemmons, 469 Poggi, 115 Pohst, 468 Polyak, 155–157, 184 Porter, 285, 298–299 Porteus, 735, 737–738, 749 Prakasa, 561, 575 Pratt, 682 Price, 389, 415 Priestley, 590, 599 Priouret, 51–52, 370, 381 Profizi, 115 Prohorov, 75 Pronzato, 692 Proschan, 612, 614–616, 618–621, 623 Protopopescu, 329 Puccio, 329 Pura, 470 Puterman, 546 Qaqish, 199 Quadrat, 510 Quirk, 286, 299 Rabinowitz, 468 Rajagopalan, 303 Rajagopolan, 330 Rajgopal, 329 Ranga, 90 Rao, 63, 90, 190, 199, 561, 575, 599, 682, 690, 697 Rappaport, 381 Raqab, 610, 623 Rastrigin, 413 Raton, 154 Rauschenberg, 755, 760
768 Raviv, 298 Rechenberg, 389–390, 415 Reid, 687 Reidel, 416 Reiman, 15, 17, 34 Rekasius, 334, 357 Reneke, 304, 307, 329–330 Riccati, 351, 495, 497, 501–502, 504 Rickard, 268 Riemann, 73 Rijn, 33 Rinnooy Kan, 411, 415 Rinnooy, 389 Ripley, 375, 381 Rishel, 479, 500, 511 Rissanen, 245, 247 Robbins, 37, 41, 50, 53–54, 59, 91–92, 155, 206, 222, 392, 414–415 Roberts, 381 Rockafellar, 158, 184 Rogers, 303, 330 Rogozin, 62, 91 Rohatgi, 79, 92, 705 Rojo, 617, 619–623 Ronchetti, 687 Rosenblatt, 567, 572, 576, 581, 599–600 Rosenbluth, 381, 414 Rosenthal, 381 Ross, 512, 546 Rothschild, 38, 54 Roussas, 11, 221, 561–571, 575, 577, 580–581, 583–587, 589–590, 592, 600–603 Roy, 330 Royden, 546 Rozonoer, 206, 221 Rubin, 381 Rubinstein, 157, 175, 183–184, 391, 395, 415 Ruiz, 615, 621 Runolfsson, 546 Russel, 268 Rustagi, 53, 623 Rutherford, 8–9, 686 Ryabko, 733 Rybko, 14, 34 Ryzin, 206 Saag, 115 Sacks, 54, 222 Sacomandi, 643 Sagar, 6 Sage, 334, 357 Said, 712, 733 Sakhanenko, 92 Saksman, 388, 413 Salas, 303, 330 Saleev, 466 Salzburg, 447 Samuelson, 330
MODELING UNCERTAINTY Sanchez, 659, 682 Sanders, 371 Sanseido, 605 Sarda, 597 Saridis, 334, 356–358 Sasaki, 387, 413 Sastry, 376, 381 Satchwell, 131, 133, 153 Savits, 614, 621 Schachtel, 131, 152–153 Schapire, 246 Schevon, 388 Schmetterer, 566, 602 Schmid, 470, 473 Schoenfeld, 191, 199 Scholes, 270, 279, 282–283 Schumer, 387, 415 Schuster, 7–8, 575, 602 Schwartz, 203, 218, 221, 371 Schwefel, 385, 387, 389–390, 415 Scott, 602 Sechen, 415 Secomandi, 641, 643, 647–648 Seidman, 14, 33–34 Sen, 8 Seneta, 512 Serfling, 581, 686, 688 Sethi, 512 Shaked, 608–615, 618, 620–621, 623 Shanthikumar, 608–609, 611–612, 618, 620, 623 Shapiro, 157, 175, 178–179, 184, 712, 733 Shayman, 546 Sherwood, 1 Shiue, 467, 473 Shnidman, 690 Shorack, 415 Shrader, 131, 153 Shubert, 386, 415 Shumway, 134, 136, 138–139, 153, 690 Siegmund, 63–64, 92, 206, 222 Sijthoff, 33 Silverman, 575, 602 Simon, 512 Singer, 227, 229, 247 Singh, 330, 621–622 Sivan, 334 Skorohod, 13, 15–16, 18, 21–23 Skorokhod, 335, 358, 688 Skvortsov, 469 Sloan, 426, 473 Sloane, 467 Slud, 190–191, 199 Smirnov, 401, 441 Smith, 7, 690 Sniedovich, 735, 738, 740, 746–749 Snyder, 389, 411 Soldano, 154
769
AUTHOR INDEX Solis, 393, 415 Solo, 34, 374–379, 381 Soms, 566, 600 Spall, 54, 371, 685, 690, 695, 699–700, 702, 705, 707 Spanier, 468–469, 472–473 Speakes, 373 Spiegelman, 202, 207, 222 Spouge, 152 Spragins, 6 Stackelberg, 269 Stamatelos, 566, 602 Staskus, 114 Steele, 205, 222 Stegun, 128 Steiglitz, 387, 415 Stein, 205, 221, 388–389, 468 Steiner, 157, 172 Stevens, 5 Stewart, 153, 630–631, 648 Stoffer, 131, 133–134, 136–139, 148, 152–153, 652, 682 Stolyar, 14, 34 Stone, 202, 222 Stout, 247 Strang, 712, 733 Su, 330 Sueoka, 131, 153 Sugiyama, 416 Sulzer, 115 Sun, 690 Sundaram, 38, 42, 52 Sundaresan, 269–270, 273–274, 281–282 Susmel, 511 Sussman, 130, 154 Suyematsu, 329 Szidarovszky, 2, 5–9, 249, 254, 259, 268, 382, 474 Szilagyi, 8 Söderström, 371 Taguchi, 651 Takahata, 581, 602 Tan, 95–96, 98–99, 105, 109–112, 114–115, 474 Tanner, 382 Tarasenko, 393, 415 Tasaka, 416 Tashkent, 90 Taubner, 622 Tausworthe, 438, 446, 474 Taylor, 139, 190, 380, 511 Tchebycheff, 317 Tekalp, 382 Teller, 381 Teneketzis, 46, 52 Terwillger, 153 Teugels, 83, 92 Tezuka, 438, 467–468, 474 Thomas, 547, 730, 732
Thompson, 512 Tibshirani, 686 Timmer, 414 Tkachuk, 59, 92–93 Todorovic, 3 Todorovich, 9 Tokuyama, 474 Tootill, 474 Tran, 10–11, 603 Traub, 473 Travers, 131, 141, 153–154, 726 Trifonov, 130–131, 141, 153–154 Trudeau, 628–629, 646, 648–649 Truss, 153 Tse, 512 Tsitsiklas, 32 Tsitsiklis, 512 Tsypkin, 157, 184, 247 Tuffin, 474 Tukey, 373, 682 Tutz, 198 Tyler, 153 Törn, 415 Uberbacher, 131, 154 Ung, 411 Uosaki, 416 Valderrama, 760 Van hamme, 633, 648 Van Laarhoven, 385, 388, 416 Van Ryzin, 222 Van Troung, 11 Vande Vate, 32–33 Vanderbilt, 388, 416 Vanmarcke, 328, 330 Varaiya, 38–39, 45, 52, 54, 512, 546 Vecchi, 53, 387 Venables, 690 Veraverbeke, 79, 85, 91 Verhulst, 364, 371 Verlag, 34, 114, 184, 415, 595 Vesterdahl, 383, 417 Viari, 133, 154 Vickrey, 294, 299 Vieu, 597 Villasenor, 620–621 Vinogradov, 58, 79, 93 Volterra, 253, 268 Voorthuysen, 302, 330 Vovk, 227, 248 Vychisl, 466 Vágó, 362, 370 Vázquez-Abad, 474 Wagner, 202, 207, 221, 575 Wald, 561, 603 Waldrop, 298–299 Walk, 157, 184, 203, 222, 371 Walker, 250, 267
770 Wallace, 302, 330 Wallnau, 330 Walrand, 38–39, 52, 54 Walsh, 455, 457–460, 464–465, 473, 651–653, 657, 659–661, 664, 667, 669, 678, 682 Walter, 692 Wang, 334, 356, 358, 474 Warmuth, 227, 229, 246–247 Wasan, 385, 416 Watson, 222, 603 Watson-Gandy, 648 Web, 96 Webb, 99, 114 Wegman, 590, 603 Wei, 382, 617, 622 Weibull, 62–63 Weiss, 33, 96, 99, 115, 153, 561, 603 Weissman, 96, 99, 114, 391, 395, 415 Wellesley, 732–733 Wellner, 415 Wessen, 285, 298–299 Wets, 385, 393, 412 Wheeden, 219, 222 Whisenant, 139 White, 334, 512, 547 Whitney, 414 Whittle, 44, 54, 547 Widjaja, 759–760 Wiener, 18, 22, 73, 270, 326–328, 335 Wilcoxon, 416 Wiley, 32–33, 53, 90, 128, 183, 198–199, 357, 410, 415, 510–512, 596, 602 Wilfling, 620, 623 Wilks, 693 William, 7 Williams, 7, 15, 22–23, 33 Wilson, 298–299, 466 Winston, 357, 512, 749 Wishart, 690 Withers, 603 Witten, 721, 725, 733 Woeginger, 738, 749 Woham, 334 Wolfowitz, 50, 53, 402, 405, 414, 561, 603
MODELING UNCERTAINTY Wong, 199 Wonham, 336, 351, 358 Wright, 629, 647 Wu, 115 Xiang, 95–96, 105, 110–112, 114–115 Xing, 431, 441, 447, 472 Yakowitz, 1, 4–11, 36, 43–44, 47–49, 51–54, 59, 68, 92–93, 152, 198, 202, 222–223, 247–248, 259, 268, 371, 381–384, 392, 394, 400–401, 412, 416–417, 422, 474, 476, 512–513, 517, 546–547, 555, 557, 574–575, 594–596, 598–599, 603–604, 685–686, 735, 737–738, 749, 751–752, 760 Yamato, 575, 583, 604 Yan, 302, 330 Yang, 34, 115, 227, 248, 510, 513, 561, 631–633, 636–637, 647, 649, 719, 721, 725, 729–733 Yao, 609 Yates, 651–652, 655, 657, 659, 661, 670, 672, 675–676, 678, 681–682 Yatracos, 585, 602 Ye, 96, 98–99, 109–110, 115 Yildizoglu, 298–299 Yin, 34, 53, 511–514 Ying, 44, 54 Yokoyama, 581, 604 Yoshihara, 581, 604–605 Yu, 469 Yudin, 156, 184 Zakharov, 389 Zaremba, 442–443, 471 Zeevi, 245–247 Zeger, 198–199 Zeitouni, 58, 91, 511 Zeller, 203, 223 Zhang, 32, 493, 498, 510–514 Zhigljavsky, 385, 390, 417 Zhiglyavskii, 412 Zhou, 331 Zhurkin, 131, 154 Zinterhof, 472 Zirilli, 411 Ziv, 732–733 Zolotarev, 85, 93 Zupancic, 114 Zygmund, 219