OPTIMIZATION METHODS FOR
APPLICATIONS IN
STATISTICS James E. Gentle George Mason University
c
2004 by James E. Gentle. All Rights Reserved.
Preface Optimization of functions, that is, minimization or maximization, is ubiquitous in statistical methods. Many methods of inference are built on the principle of maximum likelihood, in which an assumed probability density function or a probability function is optimized with respect to its parameters, given the observed realizations of the random variable whose distribution is described by the function. Other methods of inference involve fitting of a model to observed data in such a way that the deviations of the observations from the model are a minimum. Many important methods of statistics are used before any inferences are made. In fact, ideally, statistical methods are considered before data are collected, and the sampling method or the design of the experiment is determined so as to maximize the value of the data for making inferences, usually by minimizing the variance of estimators. The first three chapters of this book are introductory. Chapter 1 poses common statistical techniques as optimization problems. Chapter 2, on computer arithmetic, is somewhat detailed, and perhaps can be skimmed so long as its main point is understood and remembered: computer arithmetic is different from arithmetic on real numbers. Chapter 3 provides some basic definitions and discusses some important properties of continuous functions that are used in subsequent chapters. Chapters 4 and 5 continue with the focus on continuous functions, generally twice-differentiable functions. These functions occur in common optimization problems, and the ideas underlying the methods that address optimization of these functions have more general applicability. Because optimization of differentiable functions generally involves the solution of a system of equations, Chapter 4 covers basic methods for solving a system of equations, or finding the “roots of the equations”, and Chapter 5 covers the optimization techniques themselves, including ones that can be used when derivatives are not available. Except for the simple (but important) case when the equations are linear, the methods of optimization are iterative, with the solution being approached through a sequence of steps. In Chapter 6, we consider a class of optimization problems for functions over a discrete domain, and therefore methods based on iterative steps in a dense domain are not applicable. These are often called combinatorial optimization problems, because the solutions are combinations of discrete values. The methv
vi
PREFACE
ods are iterative, with the solution being approached through a sequence of steps, which are usually, but not necessarily, restricted to the discrete domain of the problem. Chapter 7 discusses optimization under constraints. Chapter 8 addresses two essentially unrelated issues in optimization, multiple optima, and multiple criteria for which optimization is to be performed. The brief Chapter 9 discusses computer software for optimization. Finally, in Chapter 10, we discuss optimization in various problems in statistics. In some sense, this chapter is just a more detailed version of Chapter 1 to show how the methods of the intervening chapters can be applied. The mathematical prerequisites for this text include analysis and linear algebra at a level normally attained through an undergraduate program of study in statistics, mathematics, or other natural sciences. The text assumes some familiarity with programming, although if this is lacking, the reader can generally achieve the requisite level by conscientious attention to the programming exercises. The text is written in a narrative style. “Theorems”, “Propositions”, and “Lemmas” occur in the text without special designation and numbers, and “Proofs” occur without my telling the reader that a Proof is being given. This possibly detracts from the usefulness of the book as a reference, but I believe the narrative flows more smoothly. It is hoped that the reader will remain mentally engaged without the necessity of being alerted to a Theorem with a capital “T”. No particular software system is used in this book, but in some exercises either Fortran or C is required, as they are in addressing many serious problems in optimization. Libraries in these languages are very useful, and the text often refers to routines in the IMSL Libraries. The text often uses R, S-Plus, and Matlab in some examples, and describes the facilities in those packages for the methods discussed. It would be useful for the reader to have access to one or both of these packages, but other packages such as PV-Wave could serve as well. Some exercises require use of a package that performs symbolic manipulation, such as Maple or Mathematica. The bibliography refers to sources for software, much of which is readily accessible over the Internet. The exercises comprise an important part of the text. In some cases only a pencil is required. In other cases computing is necessary and the solution can be presented as simple computer output. Some exercises require an exposition in a few paragraphs; other exercises require several pages of discussion. In the latter type of exercise, often a computational study must be designed and conducted.
A Word about Notation I try to be very consistent in notation. Most of the notation is “standard”. Appendix B contains a list of notation, but a general summary here may be
PREFACE
vii
useful. Terms that represent mathematical objects, such as variables, functions, and parameters, are generally printed in an italic font. The exceptions are the standard names of functions, operators, and mathematical constants, such as sin, log, E (the expectation operator), d (the differential operator), e (the base of the natural logarithm), and so on. I tend to use Greek letters for parameters and English letters for almost everything else, but in a few cases, I am not consistent in this distinction. I do not distinguish vectors and scalars in the notation; thus, “x” may represent either a scalar or a vector, and xi may represent either the ith element of an array or the ith vector in a set of vectors. I use uppercase letters for matrices and the corresponding lowercase letters with subscripts for elements of the matrices. I generally use uppercase letters for random variables and the corresponding lowercase letters for realizations of the random variables. Sometimes I am not completely consistent in this usage, especially in the case of random samples and statistics.
Acknowledgements I thank my colleagues in the School of Computational Sciences at George Mason University for many useful discussions on computational statistics. I thank John Kimmel of Springer for his encouragement and advice on this book and other books he has worked with me on. I also thank the reviewers for their comments and suggestions. I thank my wife Mar´ıa, to whom this book is dedicated, for everything. I used TEX via LATEX to write the book, and I used S-Plus and R to generate the graphics. I did all of the typing, programming, etc., myself, so all mistakes are mine. I would appreciate receiving notice of errors as well as suggestions for improvement. Notes on this book, including errata, are available at http://www.scs.gmu.edu/~jgentle/optbk/
Fairfax County, Virginia
James E. Gentle May 27, 2004
viii
PREFACE
Contents Preface
v
1 Statistical Methods as Optimization Problems
1
2 Numerical Computations 2.1 Computer Storage and Manipulation of Data . 2.1.1 The Floating-Point Model for the Reals 2.1.2 The Fixed-Point Number System . . . . 2.2 Numerical Algorithms and Analysis . . . . . . . 2.2.1 Error in Numerical Computations . . . 2.2.2 Efficiency . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
11 11 12 21 24 25 29 34
3 Basic Definitions and Properties of 3.1 Shapes of Functions . . . . . . . . 3.2 Stationary Points of Functions . . 3.3 Function Spaces . . . . . . . . . . . 3.3.1 Inner Products and Norms 3.3.2 Hilbert Spaces . . . . . . . 3.4 Approximation of Functions . . . . Exercises . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
37 38 41 48 49 51 51 61
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
65 65 66 70 73 74 86 89
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Finding Roots of Equations 4.1 Linear Equations . . . . . . . . . . . . . . . 4.1.1 Direct Methods . . . . . . . . . . . . 4.1.2 Iterative Methods . . . . . . . . . . 4.2 Nonlinear Equations . . . . . . . . . . . . . 4.2.1 Basic Methods for a Single Equation 4.2.2 Systems of Equations . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . .
ix
. . . . . . .
. . . . . . .
x
CONTENTS
5 Unconstrained Descent Methods in Dense Domains 5.1 Direction of Search . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Line Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Accuracy of Optimization Using Gradient Methods . . . . . . . . 5.6 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . 5.7 Fitting Models to Data Using Least Squares; Gauss-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Iteratively Reweighted Least Squares . . . . . . . . . . . . . . . . 5.9 Conjugate Gradient Methods . . . . . . . . . . . . . . . . . . . . 5.10 The EM Method and Some Variations+ . . . . . . . . . . . . . . 5.11 Fisher Scoring+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.12 Stochastic Search Methods . . . . . . . . . . . . . . . . . . . . . 5.13 Derivative-Free Methods . . . . . . . . . . . . . . . . . . . . . . . 5.13.1 Nelder-Mead Simplex Method . . . . . . . . . . . . . . . . 5.13.2 Price Controlled Random Search Method . . . . . . . . . 5.13.3 Ralston-Jennrich Dud Method for Least Squares . . . . . 5.14 Summary of Continuous Descent Methods . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102 106 108 108 112 113 114 114 117 118 120 121
6 Unconstrained Combinatorial Optimization; Other Direct Search Methods 6.1 Simulated Annealing . . . . . . . . . . . . . . 6.2 Evolutionary Algorithms . . . . . . . . . . . . 6.3 Guided Direct Search Methods . . . . . . . . 6.4 Neural Networks+ . . . . . . . . . . . . . . . 6.5 Other Combinatorial Search Methods . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . .
125 126 131 135 136 139 139
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
91 91 92 95 95 98 98
7 Optimization under Constraints 141 7.1 Constrained Optimization in Dense Domains . . . . . . . . . . . 142 7.2 Constrained Combinatorial Optimization . . . . . . . . . . . . . 147 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8 Multiple Extrema and Multiple Objectives 8.1 Multiple Extrema and Global Optimization 8.2 Optimization with Multiple Criteria . . . . 8.3 Optimization under Soft Constraints . . . . Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
153 153 154 156 157
9 Software for Optimization 9.1 Fortran and C Libraries . . . . . . . . . . . . . . . . . 9.2 Optimization in General-Purpose Interactive Systems 9.3 Software for General Classes of Optimization Problems 9.4 Modeling Languages and Data Formats . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
159 160 164 166 168
. . . .
. . . .
. . . .
. . . .
. . . .
CONTENTS
xi
9.5 Testbeds for Optimization Software . . . . . . . . . . . . . . . . . 169 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 10 Applications in Statistics 10.1 Fitting Models with Data . . . . . . . . . . . . . . . . . . . . . 10.2 Fitting by Minimizing Residuals . . . . . . . . . . . . . . . . . 10.2.1 Statistical Inference Using Least Squares+ . . . . . . . . 10.2.2 Fitting Using Other Criteria for Minimum Residuals+ . 10.2.3 Fitting by Minimizing Residuals while Controlling Influence+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Fitting with Constraints+ . . . . . . . . . . . . . . . . . 10.2.5 Subset Regression; Variable Selection+ . . . . . . . . . . 10.2.6 Multiple Criteria Fitting+ . . . . . . . . . . . . . . . . . 10.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 10.3.1 Maximum Likelihood Estimation with Constraints . . . 10.4 Optimal Design and Optimal Sample Allocation . . . . . . . . 10.4.1 D-Optimal Designs+ . . . . . . . . . . . . . . . . . . . . 10.4.2 Optimal Sample Allocation . . . . . . . . . . . . . . . . 10.5 Clustering and Classification* . . . . . . . . . . . . . . . . . . . 10.6 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . 10.7 Time Series Forecasting* . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
173 174 180 183 184
. . . . . . . . . . . . .
190 190 191 192 192 195 195 196 198 202 202 202 202
A Solutions and Hints for Selected Exercises
207
B Notation and Definitions
211
Bibliography Literature in Statistical Computing . World Wide Web, News Groups, List and Bulletin Boards . . . . . . References for Software Packages . . References to the Literature . . . . .
. . . . . Servers, . . . . . . . . . . . . . . .
221 . . . . . . . . . . . . . . 221 . . . . . . . . . . . . . . 223 . . . . . . . . . . . . . . 225 . . . . . . . . . . . . . . 226
Author Index
237
Subject Index
241
xii
CONTENTS
Chapter 1
Statistical Methods as Optimization Problems Optimization problems — maximization or minimization — arise in many areas of statistics. Statistical estimation and modeling both are usually special types of optimization problems. In a common method of estimation, we maximize a likelihood, which is a function proportional to a probability density at the point of the observed data. In another method of estimation and in standard modeling techniques, we minimize a norm of the residuals. The best fit of a model is often defined in terms of a minimum of a norm, such as least squares. Prior to collection of data, we design an experiment or a survey so as to minimize experimental or sampling errors. Some of the simpler and more common optimization problems in statistics can be solved easily, perhaps by solving a system of linear equations. Many other problems, however, do not have closed form solutions, and the solutions must be approximated by iterative methods. Statistical Models A general form of a statistical model is y = f (x; θ) + ,
(1.1)
in which y and x represent observable variables, θ represents a vector of parameters with unknown and unobservable values, f is some given function or function within a given class of functions, and is an unobservable residual or error, usually assumed to have some random distribution. (Notice I do not use special notation to distinguish vectors from scalars; in this model it is likely that x is a vector.) A basic problem in data analysis is to fit this model using observed data. Fitting the model involves estimation of θ. The more formal methods of statistical inference involve estimation of θ as a preliminary step. First, we decide 1
2
CHAPTER 1. STATISTICAL METHODS AND OPTIMIZATION
on a method of estimation, and then after estimating θ, we describe properties of the estimate and make inferences about y and its relationship with x. The most familiar form of this model is the linear model y = x> β + ,
(1.2)
where x and β are vectors. In this model β is a fixed, but unknown quantity. The problem of fitting the model is to estimate β. To develop an approach to the estimation of β, we first assign a variable to take the place of β. Least Squares of Residuals One approach to fit the model (1.2) is least squares. The method requires a set of observations, yi and xi . For each pair of observations, we form a residual, ri (b) = yi − x> i b, in terms of a variable b in place of the unknown estimand β. Using n observations, the ordinary least squares estimator of β is the solution to the optimization problem min b
n X
(ri (b))2 .
(1.3)
i=1
This optimization problem is relatively simple, and its solution can be expressed in a closed form as a system of linear equations. The solution to the optimization problem is a linear combination of the yi , and under flexible assumptions about the probability distribution of the random error term, some simple statistical properties of the estimator are relatively easy to determine. Furthermore, it is easy to see that this estimator is optimal in a certain sense among a broad class of estimators. If the distribution of the random error term is normal, even more statistical properties of the estimator can be determined. A variation of this optimization problem arises when the original model (1.1) is nonlinear. Again, we form the residuals, yi − f (xi ; t), and the nonlinear least squares estimate for θ is the solution to the optimization problem min t
n X
(yi − f (xi ; t))2 ,
(1.4)
i=1
where we are using the vector t as a variable in place of the fixed but unknown θ. This least squares problem is much more difficult both computationally and conceptually than the linear least squares problem. In general, there is no closed form solution. Minimizing Residuals The idea of minimizing the residuals from the observed data in the model is intuitively appealing. Because there is a residual at each observation, however,
50
100
y
150
200
3
0
0
50
100
y
150
200
STATISTICAL METHODS AND OPTIMIZATION
0
2
4
6
8
10
0
2
4
x
6
8
10
x
1.2
1.2
Figure 1.1: Residuals in a Linear and a Nonlinear Model gro105
0.06 0.05 0.04 0.03
1.1
1.1
4
0.02
b1
0.03 0.04 0.05 0.06
1.0
b1
1.0
0.01
6 5
0.9
0.9
0.20.4 0.6
1
0.8
0.90
0.95
5
4
1.00 b2
3
2
1 1.05
1
2
1.10
0.8
0.8
3
6
0.90
0.95
1.00
1.05
1.10
b2
Figure 1.2: Contours of Sums of Squares Functions for a Linear and a Nonlinear Model gro110 “minimizing the residuals” is not well-defined without additional statements. When there are several things to be minimized, we must decide on some way of combining them into a single measure. It is then the single measure that we seek to minimize. In the least squares problems mentioned above, we merely sum the squares of the individual residuals. This seems reasonable. Next we note the obvious: some residuals are positive and some are negative, so our objective cannot be to minimize them directly, but rather to minimize some function of their absolute values. When there is only a single quantity to minimze, minimizing any increasing function of that quantity, such as its square if the quantity is nonnegative, is equivalent to minimizing the quantity itself. We would arrive at the same point (that is, the same value of the variable over which the minimization is performed) if we minimized some other increasing function of that quantity.
4
CHAPTER 1. STATISTICAL METHODS AND OPTIMIZATION
If, however, we are to minimize a sum of several quantities, applying a given increasing function to each quantity prior to summing may result in a different point of minimization than if we apply some other increasing function. For the general objective of minimizing the residuals, therefore, we have alternatives. Instead of minimizing the sum of their squares, we may fit the model by minimizing some other quantity, resulting in an optimization problem such as n X min ρ(yi − f (xi ; t)), (1.5) t
i=1
where ρ(·) is some function that increases in |yi − f (xi ; t)|. Depending on ρ(·), this problem is much more difficult both computationally and conceptually than the least squares problem. One common choice of g is just the absolute value itself, and the problem of fitting the model is the optimization problem n X min |yi − f (xi ; t)|. (1.6) t
i=1
There is no closed form solution to this simple problem. Finally, we might reconsider how we choose to combine several individual residual values into a single measure. Simply summing them seems obvious. In practice, however, there may be other considerations. We may want to treat some residuals differently from others. This may be because we may consider the observation on which a particular residual is based to be more precise than some other observation; therefore we may choose to give that residual more weight. Alternatively, we may realize that some observations do not seem to fit our model the same way most of the other observations fit the model; therefore, we may adaptively choose to give those residuals less weight. These considerations lead to a slightly more general formulation of the problem of fitting the statistical model by minimizing residuals, resulting in the optimization problem min t
n X
w(yi , xi , t) ρ(yi − f (xi ; t)),
(1.7)
i=1
where w(yi , xi , t) is a nonnegative function. Because in practice, for this minimization problem, it is usually not explicitly a function of yi , xi , and t, we often write w(yi , xi , t) as a simple fixed weight, wi . A common instance of problem (1.7) is the weighted linear least squares problem with fixed weights, in which the function to be minimized is n X
2 wi (yi − x> i b) .
i=1
The weights do not materially change the complexity of this problem. It has a closed form solution, just as the unweighted (or equally-weighted) problem (1.7).
STATISTICAL METHODS AND OPTIMIZATION
5
Maximum Likelihood Estimation Another way of fitting the model y = f (x; θ) + is by maximizing the likelihood function that arises from the probability distribution of . Given the data, this is the optimization problem max t
n Y
p(yi − f (xi ; t)),
(1.8)
i=1
where p(·) is the probability function or the probability density function of the random error. Again we are using the vector t as a variable in place of the fixed but unknown vector θ. Optimization problems of this type can be quite formidable computationally. Although the statistical properties of the maximum likelihood estimators are often quite difficult to work out, they generally have relatively simple asymptotic behavior. Experimental Design Other optimization problems in statistics arise in optimal design of experiments and in the construction of optimal sampling plans. In design of experiments, we often assume a linear relationship between y and an m-vector x, and we anticipate collecting n observations, (yi , xi ), into an n-vector y and an n × m matrix X. We may express the relationship as y = β0 1 + Xβ + . Under the assumption that the residuals are independently distributed with a constant variance, σ 2 , the variance-covariance matrix of estimable linear functions of the least squares solution are formed from (X > X)− σ 2 . Because we may be able to choose the values in X, that is, the settings at which we make observations, we may attempt to choose them in such a way as to minimize variances of certain estimators. The variance of a particular estimator is minimized by maximizing some function of X > X. There are various ways we may attempt to minimize the variances of a collection of estimators. A common method in experimental design results in the optimization problem max
all factor settings
det(X > X).
(1.9)
If there are many possible values of X, this may be a difficult computational problem. Optimal Sampling Design In designing a sampling plan, we are often presented with the problem of allocating the sample sizes nh and mh within various strata and across multiple
6
CHAPTER 1. STATISTICAL METHODS AND OPTIMIZATION
stages. For given population sizes Nh and known within-strata variances v1h and v2h , the optimization problem has the form ! X X N2 Nh h min (1.10) Nh − 1 v1h + v2h . nh ,mh nh nh mh h
h
Clustering and Classification Less formal statistical methods also use optimization. In K-means clustering, for example, we seek a partition of a dataset into a preset number of groups k that minimizes the variation within each group. Each variable may have a different variation, of course. The variation of the j th variable in the g th group is measured by the within sum-of-squares: s2j(g) =
P ng
i=1
xij(g) − x ¯j(g) ng − 1
2
,
where ng is the number of observations in the g th group, and x ¯j(g) is the mean of the j th variable in the g th group. For data with m variables there are m such quantities. In K-means clustering the optimization problem is min
all partitions
k X m X
s2j(g) .
(1.11)
g=1 j=1
When groups or classes are known, the problem of determining to which group a given observation belongs is called “classification”. In classification, we determine optimal discriminators to define class membership. Domain of an Optimization Problem We can observe an important difference between two groups of the examples above. In problems similar to (1.3) or (1.8), we have assumed that β or θ could be any real number (or vector), and so the variable b or t in the optimization problem could range over the real numbers. We say the domain for each variable is the reals. In the optimal experimental design problem (1.9), the domain is the set of n×m real matrices. In the sampling allocation problem (1.10), on the other hand, the values of nh and mh must be nonnegative integers. The domain of problem (1.11) is also different; it is the collection of all partitions of a set. In an important way, however, the domain of problem (1.11) is similar to the domain of nonnegative integers; in fact, a reasonable approach to solving this problem depends on an indexing of the partitions using integers. The domain of these various optimization problems are very different; in one case the domain is dense, and in the other case the domain is countable, in fact, it is finite.
STATISTICAL METHODS AND OPTIMIZATION
7
Optimization Problems with Constraints An important type of variation on the optimization problems occurs when the model specifies that unknown parameter is in some given subregion of Euclidean space; that is the parameter space may not be the full Euclidean space. In this case, we generally constrain the variable that we substitute for the parameter also to be in the parameter space. Instead of proceeding as in the examples above, we formulate an optimization problem with constraints. For example, if it is known that β in the model (1.2) is in some given region S, instead of the unconstrained optimization problem (1.3), we have the constrained optimization problem min b
s.t.
n X
2 (yi − x> i b)
(1.12)
i=1
b ∈ S.
These constraints make the simple least squares problem much more difficult, both computationally and conceptually. The statistical properties of a least squares estimator subject to these constraints are not easy to determine. Any of the other optimization problems we formulated above for estimating parameters can also be modified to accommodate contraints on the parameters. The problem of designing a sampling plan, for example, often includes constraints on the total cost of conducting the survey or constraints on the coefficient of variation of the estimators. Instead of the unconstrained optimization problem (1.10), we have an optimization problem whose solution is subject to some constraints. Optimization of Multiple Objectives In most practical applications with optimization, there are more than one objective. A simple example of this is the simple general optimization problem of minimizing the residuals in order to fit a model. Because there are many residuals, we must decide how we want to minimize them all simultaneously. Of course, the obvious solution to this quandary is just to minimize the sum (of some function) of the residuals. Even in this simple example, however, there may be reasons to combine the residuals differentially, as in weighted regression. In a problem of optimal sampling design the general objective is to minimize the variance of an estimator. Realistically, there are many estimates that result from a single survey; there are several attributes (estimands) crossed with several strata. The problem of how to address the problem of minimizing the variances of all within strata estimators in a single sampling design requires consideration of the relative importance of the estimates, any constraints on variances and/or coefficients of variation, and constraints on the cost of the survey. There are various ways of accommodating multiple objectives and constraints. The simplest, of course, is to form a weighted sum. Constraints
8
CHAPTER 1. STATISTICAL METHODS AND OPTIMIZATION
can also be incorporated as a weighted component of the objective function. In most cases, a certain amount of interaction between the decision maker and the optimization procedure is required. Formulation of an Optimization Problem The general examples above represent a variety of optimization problems that arise in statistics. Notice the basic procedure in the estimation problems. We begin with a model or a probability statement with fixed but unknown parameters and variables that are realizations of random variables. We then define a minimization or maximization problem in which the realizations of the random variables are fixed, and variables are substituted for the unknown parameters. We call the function being minimized or maximized in an optimization problem the objective function. We have seen examples whose objective functions have domains that are dense and others with domains that are countable. We have seen examples with domains that are unbounded and others with domains that are constrained. In formulating a statistical problem as an optimization problem, we must be careful not to change the statistical objectives. The objective function and the constraints should reflect the desired statistical methodology. If a different or incorrect optimization problem is formulated because of computational considerations, we should at least be aware of the effect of the change. An example in the literature of how available software can cause the analyst to reformulate the objective function began with the problem of fitting a linear regression model with linear constraints; that is, a problem like (1.12), in which the constraints on b were of the form Ab ≤ c. It turns out that a optimization problem like (1.6), that is, least absolute values regression, with f (xi ; b) = x> i b, that is linear regression, and with constraints of the form Ab ≤ c, can be formulated as a linear programming problem (see Charnes, Cooper, and Ferguson, 1955) and solved easily using available software. At the time, there was no readily available software for constrained least squares regression, so the reformulated problem was solved. This reformulation may or may not have undesirable side effects on the overall analysis. The solution to an optimization problem is in some sense “best” for that problem and its objective function. This fact may mean that the solution is considerably less good for some other optimization problem. It is often the case, therefore, that an optimal solution is not robust to assumptions about the phenomenon being studied. Use of optimization methods is likely to magnify the effects of any assumptions. Optimization Methods The methods of optimization depend on the nature of the functions involved. The first consideration is the domain of the function. Many functions of interest have continuous (dense) domains, and many of the methods of optimization are developed for such functions. Many methods for such functions also assume
STATISTICAL METHODS AND OPTIMIZATION
9
the functions are differentiable, so for functions with dense domains, the differentiability of the functions are of primary concern. The simplest differentable functions to work with are polynomials, and among these, linear and quadratic functions are particularly simple. Optimization methods for differentiable functions are often developed using a quadratic function as a prototype function. Many functions of interest have discrete domains. Points in the domain are combinations of allowable values of a set of variables, and the problem is one of combinatorial optimization. For either discrete or dense domains, restrictions within the domain can make a relatively simple problem much more difficult. There are many problems that can arise in using optimization methods in statistics. One of the first, which we discussed above, is the incorrect formulation of the statistical problem as an optimization problem. Once the problem is formulated correctly, we must be careful in the methods used to solve the problem. A simple example of this occurs in maximum likelihood estimation, that is, solving the optimization problem (1.8) with the objective function n Q L(t; y, x) = p(yi − f (xi ; t)). From elementary calculus we know that if i=1
L(t; y, x) is differentiable in t, and if it has a maximum at t∗ for given y and x, then ∂L(t; y, x)/∂t|t∗ = 0. The equation ∂L(t; y, x)/∂t = 0 is called the “likelihood equation”, and a solution to it, t0 , is called a root of the likelihood equation (RLE). The fact from calculus just quoted, however, does not say that an RLE yields a maximum of L; the RLE can be just any stationary point, including a minimal point. We can insure that an RLE is a local maximum of L if we find the matrix of second derivatives to be negative definite at the RLE, but another problem in using an RLE as an MLE (maximum likelihood estimate) is that the maximum of the likelihood over the parameter space may occur on the boundary of the space, and the derivatives there may not be zero. There are many other problems that can arise in using optimization methods. For example, there may be more than one point of optimality. This kind of problem brings into question the formulation of the statistical method as an optimization problem. Software for Optimization Problems often occur because of numerical problems in the optimization algorithms used. Optimization problems can be notoriously ill-conditioned, due to, among other things, highly variable scaling in the variables. (An ill-conditioned problem can be described simply as one in which small changes in the input or in computations at intermediate stages may result in large changes in the solution.) Although good computational algorithms are known for a wide range of optimization problems and stable software packages implementing those algorithms are available, in the current state of the science, we cannot assume that computer software can solve a general optimization problem without some in-
10
CHAPTER 1. STATISTICAL METHODS AND OPTIMIZATION
tervention by the user. For many classes of optimization problems, in fact, the required level of user interaction is quite high. Literature in Optimization Because of the differences in optimization problems resulting from differences in the properties of the functions and of their domains, the research literature in optimization is quite diverse. Some journals that emphasize optimization, especially the computational aspects, are Journal of Optimization Theory and Applications, Journal of Global Optimization, Mathematical Programming, Optimization Methods and Software, and SIAM Journal on Optimization. In addition, journals on numerical analysis, operations research, statistics or in various fields of application often contain articles on optimization. There are also many books on various aspects of optimization. The Next Chapters Before we can discuss numerical optimization of functions properly, we need to understand some general principles of computer arithmetic, and some basic ways of dealing with continuous functions on the computer. We therefore begin with some discussion of numerical computations on computers (Chapter 2), and then we consider some fundamental properties of functions (Chapter 3) and the solution of systems of equations (Chapter 4). Solution of systems of equations is a subtask in many algorithms for optimization. In Chapters 5 through 8 we discuss the main topic of the book, that is, optimization. We first describe methods of optimization of continuous (and usually, twice-differentiable) functions; then we discuss optimization of graphs or functions over countable domains; next we consider the problem of optimization when the set of acceptable solutions is subject to constraints; and finally we consider problems in which there are multiple objective functions. In Chapter 9 we discuss software for optimization. The final Chapter 10 returns to the general topic of the present introductory chapter, that is, applications in statistics. In that chapter, we consider the applications and the optimization methodology in more detail.
Chapter 2
Numerical Computations Data may be numbers, text, or images. For each type of data, there are several ways of coding that can be used to store the data electronically, and specific ways the data may be manipulated. Our main interest in optimization is in numeric data and how computations are done with numeric data. There are standards for the representation of numeric data and for operations on the data. The Institute for Electrical and Electronics Engineers (IEEE) has been active in promulgating these standards, and the standards themselves are designated by IEEE numbers. In Section 2.1, we emphasize some of the important properties of computer numbers and computer arithmethic that often affect numerical computations in surprising ways. There are many details about computer arithmetic that we will skip, however, and the interested reader is referred to Gentle (1998), Chapter 1, for more details about most of the topics in this chapter. In Section 2.2 we consider some general properties of computer algorithms. Again, we skip many of the details.
2.1
Computer Storage and Manipulation of Data
In order to represent a numeric quantity in a fixed and relatively small number of digits or bits, the two most relevant things are the magnitude of the number and the precision to which the number is to be represented. Whenever a set of numbers is to be used in the same context, we must find a method of representing the numbers that will accommodate their full range and will carry enough precision for all of the numbers in the set and for whatever operations we may perform on the numbers. Practical design considerations require that the data storage area of the computer be divided into units of fixed size to represent individual numeric quantities. The fixed grouping of bits used to represent a single number in a computer is called a “word”, or a “storage unit”. The fixed length of storage units commonly used in computers is either 32 or 64 bits. 11
12
CHAPTER 2. NUMERICAL COMPUTATIONS
The most important mathematical structure in most optimization applications is the field of reals, IR, and so we need a system of computer numbers that behaves similarly to the reals.
2.1.1
The Floating-Point Model for the Reals
Elements in a large subset of real numbers are generally represented in a scheme similar to what is called “scientific notation”, or in a type of logarithmic notation. Elements of the reals not in this subset are approximated by elements in the subset. This representation requires a base or radix, a mantissa or significand, and an exponent. If the radix is b, and integer “digits” di are such that 0 ≤ di < b, and there are enough bits in the significand to represent p digits, then a real number is approximated by ±0.d1 d2 · · · dp × be ,
(2.1)
where e is an integer. This is the standard model for the floating-point representation. (The di are called “digits” from the common use of base 10.) In current practice, b is almost always 2, and so the di can be represented in bits. Because within a fixed number of digits, the radix point is not fixed, this scheme is called floating-point representation, and the set of such numbers is denoted by IF. The notation IF is also used to denote the system built on this set. The system consists of the set of floating-point numbers together with rules for combining the numbers in ways that simulate mathematical operations such as addition and multiplication. A floating-point number is also sometimes called “real”. The floating-point computer numbers, although they may be called “real”, do not correspond to the real numbers in a natural way. In particular, the set of floating-point numbers is finite; a given real number, even a rational, of any size may or may not have an exact representation by a floating-point number; and the floating-point numbers do not occur uniformly over the real number line. Figures 2.1 and 2.2 illustrate this. . . . 0
2−2
2−1
20
21
Figure 2.1: The Floating-Point Number Line, Nonnegative Half
. . . −21
−20
−2−1
−2−2
0
Figure 2.2: The Floating-Point Number Line, Nonpositive Half Within the allowable range, a mathematical integer can be exactly represented by a computer floating-point number; but rational fractions depend not
2.1. COMPUTER STORAGE AND MANIPULATION OF DATA
13
only on their magnitude, but also on the factors of the denominator. The simple rule, of course, is that the number must be a rational number whose denominator in reduced form factors into only primes that appear in the factorization of the base. In base 2, only rational numbers whose factored denominators contain only 2’s have an exact, finite representation. For a given real number x, we will occasionally use the notation [x]c to indicate the floating-point number used to approximate x, and we will refer to the exact value of a floating-point number as a computer number. We will also use the phrase “computer number” to refer to the value of a computer fixed-point number. It is important to understand that the set of floating-point numbers IF is a proper, finite subset of the IR. The number of bits allocated to the exponent e must be sufficient to represent numbers within a reasonable range of magnitudes; that is, so that the smallest number in magnitude that may be of interest is approximately bemin , and the largest number of interest is approximately bemax , where emin and emax are, respectively, the smallest and the largest allowable values of the exponent. Because emin is likely negative and emax is positive, the exponent requires a sign. In practice, most computer systems handle the sign of the exponent by defining a bias, and then subtracting the bias from the value of the exponent evaluated without regard to a sign. The parameters b, p, and emin and emax are so fundamental to the operations of the computer that on most computers they are fixed, except for a choice of two or three values for p, and maybe two choices for the range of e. In order to insure a unique representation for all numbers (except 0), most floating-point systems require that the leading digit in the significand be nonzero, unless the magnitude is less than bemin . A number with a nonzero leading digit in the significand is said to be normalized. If the base is 2, in a normalized representation, the first digit in the significand is always 1; therefore, it is not necessary to fill that bit position, and so we effectively have an extra bit in the significand. The leading bit, which is not represented, is called a “hidden bit”. This requires a special representation for the number 0, however. In a typical computer using a base of 2 and 64 bits to represent one floatingpoint number, 1 bit may be designated as the sign bit, 52 bits may be allocated to the significand, and 11 bits allocated to the exponent. The arrangement of these bits is somewhat arbitrary, and of course, the physical arrangement on some kind of storage medium would be different from the “logical” arrangement. A common logical arrangement assigns the first bit as the sign bit, the next 11 bits as the exponent, and the last 52 bits as the significand. The range of exponents for the base of 2 in this typical computer would be 2,048. If this range is split evenly between positive and negative values, the range of orders of magnitude of representable numbers would be from −308 to 308. The bits allocated to the significand would provide roughly 16 decimal places of precision.
14
CHAPTER 2. NUMERICAL COMPUTATIONS
As mentioned above, the set of floating-point numbers is not uniformly distributed over the ordered set of the reals. There are the same number of floating-point numbers in the interval [bi , bi+1 ] as in the interval [bi+1 , bi+2 ], even though the second interval is b times as long as the first. Figures 2.1 and 2.2 illustrate this. The density of the floating-point numbers is generally greater closer to zero. Notice that if floating-point numbers are all normalized, the spacing between 0 and bemin is bemin (that is, there is no floating-point number in that open interval), whereas the spacing between bemin and bemin +1 is bemin −p+1 . Most systems do not require floating-point numbers less than bemin in magnitude to be normalized. This means that the spacing between 0 and bemin can be bemin −p , which is more consistent with the spacing just above bemin . When these nonnormalized numbers are the result of arithmetic operations, the result is called “graceful” or “gradual” underflow. The spacing between floating-point numbers has some interesting (and, for the novice computer user, surprising!) consequences. For example, if 1 is repeatedly added to x, by the recursion x(k+1) = x(k) + 1, the resulting quantity does not continue to get larger. Obviously, it could not increase without bound, because of the finite representation. It does not even approach the largest number representable, however! (This is assuming that the parameters of the floating-point representation are reasonable ones.) In fact, if x is initially smaller in absolute value than bemax −p (approximately), the recursion x(k+1) = x(k) + c will converge to a stationary point for any value of c smaller in absolute value than bemax −p . The way the arithmetic is performed would determine these values precisely; as we shall see below, arithmetic operations may utilize more bits than are used in the representation of the individual operands. The spacings of numbers just smaller than 1 and just larger than 1 are particularly interesting. This is because we can determine the relative spacing at any point by knowing the spacing around 1. These spacings at 1 are sometimes called the “machine epsilons”, denoted min and max (not to be confused with emin and emax ). It is easy to see from the model for floating-point numbers on page 12 that min = b−p and max = b1−p The more conservative value, max , sometimes called “the machine epsilon”, or mach , provides an upper bound on the rounding that occurs when a floatingpoint number is chosen to represent a real number. A floating-point number
2.1. COMPUTER STORAGE AND MANIPULATION OF DATA
15
near 1 can be chosen within max /2 of a real number that is near 1. This bound, 1 1−p , is called the unit roundoff. 2b min 0
1 4
max
? ? 1 2
. . .
1
2
Figure 2.3: Relative Spacings at 1: “Machine Epsilons” These machine epsilons are also called the “smallest relative spacing” and the “largest relative spacing” because they can be used to determine the relative spacing at the point x. If x is not zero, the relative spacing at x is approximately x − (1 − min )x x or
(1 + max )x − x . x Notice we say “approximately”. First of all, we do not even know that x is representable. Although (1 − min ) and (1 + max ) are members of the set of floating-point numbers by definition, that does not guarantee that the product of either of these numbers and [x]c is also a member of the set of floating-point numbers. However, the quantities [(1 − min )[x]c ]c and [(1 + max )[x]c ]c are representable (by the definition of [·]c as a floating point number approximating the quantity within the brackets); and, in fact, they are respectively the next smallest number than [x]c (if [x]c is positive, or the next largest number otherwise), and the next largest number than [x]c (if [x]c is positive). The spacings at [x]c therefore are [x]c − [(1 − min )[x]c ]c and [(1 + max )[x]c − [x]c ]c . As an aside, note that this implies it is probable that [(1 − min )[x]c ]c = [(1 + min )[x]c ]c .
[[x]c − (1 − min )[x]c ]c . . .
[(1 + max )[x]c − [x]c ]c
?
?
. . .
x
Figure 2.4: Relative Spacings In practice, to compare two numbers x and y, we must compare [x]c and [y]c . We consider x and y different if [|y|]c < [|x|]c − [(1 − min )[|x|]c ]c ,
16
CHAPTER 2. NUMERICAL COMPUTATIONS
or if [|y|]c > [|x|]c + [(1 + max )[|x|]c ]c . The relative spacing at any point obviously depends on the value represented by the least significant digit in the significand. This digit (or bit) is called the “unit in the last place”, or “ulp”. The magnitude of an ulp depends of course on the magnitude of the number being represented. Any real number within the range allowed by the exponent can be approximated within 12 ulp by a floating-point number. As we have indicated, different computers represent numeric data in different ways. There has been some attempt to provide standards, at least in the range representable and in the precision for floating point quantities. There are two IEEE standards that specify characteristics of floating-point numbers (IEEE, 1985). The IEEE Standard 754 (sometimes called the “binary standard”) specifies the exact layout of the bits for two different precisions, “single” and “double”. In both cases, the standard requires that the radix be 2. For single precision, p must be 24, emax must be 127, and emin must be −126. For double precision, p must be 53, emax must be 1023, and emin must be −1022. The IEEE Standard 754 also defines two additional precisions, “single extended” and “double extended”. For each of the extended precisions, the standard sets bounds on the precision and exponent ranges, rather than specifying them exactly. The extended precisions have larger exponent ranges and greater precision than the corresponding precision that is not “extended”. Additional information about the IEEE Standards for floating-point numbers can be found in Overton (2001). Special Floating-Point Numbers It is convenient to be able to represent certain special numeric entities, such as infinity or “indeterminate” (0/0), which do not have ordinary representations in any base-digit system. Although 8 bits are available for the exponent in the single-precision IEEE binary standard, emax = 127 and emin = −126. This means there are two unused possible values for the exponent; likewise, for the double-precision standard there are two unused possible values for the exponent. These extra possible values for the exponent allow us to represent certain special floating-point numbers. An exponent of emin − 1 allows us to handle 0 and the numbers between 0 and bemin unambiguously even though there is a hidden bit (see the discussion above about normalization and gradual underflow). The special number 0 is represented with an exponent of emin − 1 and a significand of 00 . . . 0. An exponent of emax + 1 allows us to represent ±∞ or the indeterminate value. A floating-point number with this exponent and a significand of 0 represents ±∞ (the sign bit determines the sign, as usual). A floating-point number with this exponent and a nonzero significand represents an indeterminate value such as 00 . This value is called “not-a-number”, or NaN. In statistical data processing, a NaN is sometimes used to represent a missing value. Because a
2.1. COMPUTER STORAGE AND MANIPULATION OF DATA
17
NaN is indeterminate, if a variable x has a value of NaN, then x 6= x. Also, because a NaN can be represented in different ways, a programmer must be careful in testing for NaNs. Some software systems provide explicit functions for testing for a NaN. The IEEE binary standard recommends that a function isnan be provided to test for a NaN. Computer Operations on Numeric Data As we have emphasized above, the numerical quantities represented in the computer are used to simulate or approximate more interesting quantities, namely the real numbers or perhaps the integers. Obviously, because the sets (computer numbers and real numbers) are not the same, we cannot define operations on the computer numbers that would yield the same field as the familiar field of the reals. In fact, because of the nonuniform spacing of floating-point numbers, we would suspect that some of the fundamental properties of a field may not hold. Depending on the magnitudes of the quantities involved, it is possible, for example, that if we compute ab and ac and then ab + ac, we may not get the same thing as if we compute (b + c) and then a(b + c). Just as we use the computer quantities to simulate real quantities, we define operations on the computer quantities to simulate the familiar operations on real quantities. Designers of computers attempt to define computer operations so as to correspond closely to operations on real numbers, but we must not lose sight of the fact that the computer uses a different arithmetic system. The basic operational objective in numerical computing, of course, is that a computer operation, when applied to computer numbers, yields computer numbers that approximate the number that would be yielded by a certain mathematical operation applied to the numbers approximated by the original computer numbers. Just as we introduced the notation [x]c on page 13 to denote the computer floating-point number approximation to the real number x, we occasionally use the notation [◦]c to refer to a computer operation that simulates the mathematical operation ◦. Thus, [+]c represents an operation similar to addition, but which yields a result in a set of computer numbers. (We use this notation only where necessary for emphasis, however, because it is somewhat awkward to use it consistently.) The failure of the familiar laws of the field of the reals, such as distributive law cited above, can be anticipated by noting that [[a]c [+]c [b]c ]c 6= [a + b]c ,
18
CHAPTER 2. NUMERICAL COMPUTATIONS
or by considering the simple example in which all numbers are rounded to one decimal and so 13 + 13 6= 23 (that is, .3 + .3 6= .7). The three familiar laws of the field of the reals (commutativity of addition and multiplication, associativity of addition and multiplication, and distribution of multiplication over addition) result in the independence of the order in which operations are performed; the failure of these laws implies that the order of the operations may make a difference. When computer operations are performed sequentially, we can usually define and control the sequence fairly easily. If the computer performs operations in parallel, the resulting differences in the orders in which some operations may be performed can occasionally yield unexpected results. Because the operations are not closed, special notice may need to be taken when the operation would yield a number not in the set. Adding two numbers, for example, may yield a number too large to be represented well by a computer number. When an operation yields such an anomalous result, an exception is said to exist. Floating-Point Operations; Errors As we have seen, real numbers within the allowable range may or may not have an exact floating-point operation, and the computer operations on the computer numbers may or may not yield numbers that represent exactly the real number that would result from mathematical operations on the numbers. If the true result is r, the best we could hope for would be [r]c . As we have mentioned, however, the computer operation may not be exactly the same as the mathematical operation being simulated, and further, there may be several operations involved in arriving at the result. Hence, we expect some error in the result. If the computed value is r˜ (for the true value r), we speak of the absolute error, |˜ r − r|, and the relative error, |˜ r − r| |r| (so long as r 6= 0). An important objective in numerical computation obviously is to insure that the error in the result is small. Ideally, the result of an operation on two floating-point numbers would be the same as if the operation were performed exactly on the two operands (considering them to be exact also) and then the result were rounded. Attempting to do this would be very expensive in both computational time and complexity of the software. If care is not taken, however, the relative error can be very large. Consider, for example, a floating-point number system with b = 2 and p = 4. Suppose we want to add 8 and −7.5. In the floating-point system we would be faced with the problem: 8 : 1.000 7.5 : 1.111
× 23 × 22
2.1. COMPUTER STORAGE AND MANIPULATION OF DATA
19
To make the exponents the same, we have 8 : 1.000 7.5 : 0.111
× 23 × 23
or
8: 7.5 :
1.000 1.000
× 23 × 23
The subtraction will yield either 0.0002 or 1.0002 ×20 , whereas the correct value is 1.0002 × 2−1 . Either way, the absolute error is 0.510 , and the relative error is 1. Every bit in the significand is wrong. The magnitude of the error is the same as the magnitude of the result. This is not acceptable. (More generally, we can show that the relative error in a similar computation could be as large as b − 1, for any base b.) The solution to this problem is to use one or more guard digits, which are extra digits in the significand that participate in the arithmetic operation in what is called chaining (see Gentle, 1998, page 21). When several numbers xi are to be summed, it is likely that as the operations proceed serially, the magnitudes of the partial sum and the next summand will be quite different. In such a case, the full precision of the next summand is lost. This is especially true if the numbers are of the same sign. As we mentioned earlier, a computer program to implement serially the algorithm implied by P∞ i will converge to some number much smaller than the largest floatingi=1 point number. Another kind of error that can result because of the finite precision used for floating-point numbers is catastrophic cancellation. This can occur when two rounded values of approximately equal magnitude and opposite signs are added. (If the values are exact, cancellation can also occur, but it is benign.) After catastrophic cancellation, the digits left are just the digits that represented the rounding. Suppose x ≈ y, and that [x]c = [y]c . The computed result will be zero, whereas the correct (rounded) result is [x−y]c . The relative error is 100%. This error is caused by rounding, but it is different from the “rounding error” discussed above. Although the loss of information arising from the rounding error is the culprit, the rounding would be of little consequence were it not for the cancellation. To avoid catastrophic cancellation watch for possible additions of quantities of approximately equal magnitude and opposite signs, and consider rearranging the computations. An example of catastrophic cancellation familiar to statisticians may occur in the computation of the sample sum of squares: n X i=1
(xi − x ¯ )2 =
n X
x2i − n¯ x2
(2.2)
i=1
This quantity is (n − 1)s2 , where s2 is the sample variance. Either expression in equation (2.2) can be thought of as describing an algorithm. The expression on the left implies the “two-pass” algorithm that first computes x ¯ and then sums (xi − x ¯)2 . With these computations, the quantities are likely to be of relatively equal magnitude. They are of the same sign, so there will be no catastrophic cancellation in the early stages when the terms
20
CHAPTER 2. NUMERICAL COMPUTATIONS
being accumulated are close in size to the current value of b. There will be some accuracy loss as the sum b grows, but the addends (xi − a)2 remain roughly the same size. The accumulated rounding error, however, may not be too bad. The expression on the right of equation (2.2) implies a “one-pass” algorithm in which x2i and xi are summed at the same time. If the xi ’s have magnitudes larger than 1, the algorithm has built up two relatively large quantities, b and na2 . These quantities may be of roughly equal magnitude; subtracting one from the other may lead to catastrophic cancellation. (See Gentle, 1998, pages 31 through 33, for further discussion of these computations and alternatives.) The IEEE Binary Standard 754 (IEEE, 1985) applies not only to the representation of floating-point numbers, but also to certain operations on those numbers. The standard requires correct rounded results for addition, subtraction, multiplication, division, remaindering, and extraction of the square root. It also requires that conversion between fixed-point numbers and floating-point numbers yields correct rounded results. Exceptions and Special Floating-Point Numbers The standard also defines how exceptions should be handled. The exceptions are divided into five types: overflow, division by zero, underflow, invalid operation, and inexact operation. If an operation on floating-point numbers would result in a number beyond the range of representable floating-point numbers, the exception, called overflow, is generally very serious. (It is serious in fixed-point operations, also, if it is unplanned. Because we have the alternative of using floating-point numbers if the magnitude of the numbers is likely to exceed what is representable in fixed-point, the user is expected to use this alternative. If the magnitude exceeds what is representable in floating-point, however, the user must resort to some indirect means, such as scaling, to solve the problem.) Division by zero does not cause overflow; it results in a special number if the dividend is nonzero. The result is either ∞ or −∞, and these have special representations, as we have seen. Underflow occurs whenever the result is too small to be represented as a normalized floating-point number. As we have seen, a nonnormalized representation can be used to allow a gradual underflow. An invalid operation is one for which the result is not defined because of the value of an operand. The invalid operations are addition of ∞ to −∞, multiplication of ±∞ and 0, 0 divided by 0 or by ±∞, ±∞ divided by 0 or by ±∞, extraction of the square root of a negative number (some systems, such as Fortran, have a special type for complex numbers and deal correctly with them), and remaindering any quantity with 0 or remaindering ±∞ with any quantity. An invalid operation results in a NaN. Any operation with a NaN also results in a NaN. Some systems distinguish two types of NaN, a “quiet NaN” and a “signaling NaN”. An inexact operation is one for which the result must be rounded. For
2.1. COMPUTER STORAGE AND MANIPULATION OF DATA
21
example, if all p bits of the significand are required to represent both the multiplier and multiplicand, approximately 2p bits would be required to represent the product. Because only p are available, however, the result must be rounded.
Comparison of Reals and Floating-Point Numbers For most applications, the system of floating-point numbers simulates the field of the reals very well. It is important, however, to be aware of some of the differences in the two systems. The last four properties in Table P∞ 2.1 on the next page are properties of a field (except for the divergence of x=1 x). The important facts are that IR is an uncountable field, and IF is a more complicated finite mathematical structure.
2.1.2
The Fixed-Point Number System
Because an important set of numbers is a finite set of reasonably sized integers, efficient schemes for representing these special numbers are available in most computing systems. The scheme is usually some form of a base 2 representation, and may use one storage unit (this is most common), two storage units, or one half of a storage unit. The numbers represented in this scheme are called fixedpoint numbers and the set of such numbers is denoted by II. They are also called “integers”. Unlike the floating-point numbers, the fixed-point numbers are uniformly distributed over their range. If the set of integers includes the negative numbers also, some way of indicating the sign must be available. The first bit in the bit sequence (usually one storage unit) representing an integer is usually used to indicate the sign; if it is 0, a positive number is represented; if it is 1, a negative number. In a common method for representing negative integers, called “twos-complement representation”, the sign bit is set to 1, and the remaining bits are set to their opposite values (0 for 1; 1 for 0) and then 1 is added to the result. If the bits for 5 are ...00101, the bits for −5 would be ...11010 + 1, or ...11011. If there are k bits in a storage unit (and one storage unit is used to represent a single integer), the integers from 0 through 2k−1 − 1 would be represented in ordinary binary notation using k − 1 bits. An integer i in the interval [−2k−1 , −1] would be represented by the same bit pattern by which the nonnegative integer 2k−1 − i is represented, except the sign bit would be 1. The twos-complement representation makes arithmetic operations particularly simple. It is easy to see that the largest integer that can be represented in the twos-complement form is 2k−1 − 1, and the smallest integer is −2k−1 . A representation scheme such as that described above is called fixed-point representation or integer representation, and the set of such numbers is denoted by II. The notation II is also used to denote the system built on this set. This
22
CHAPTER 2. NUMERICAL COMPUTATIONS
Table 2.1: Differences in Real Numbers and Floating-Point Numbers IR
IF
cardinality:
uncountable
finite
measure:
for x < y, µ((x, y)) ∝ y − x
y − x = w − z, but #(x, y) 6= #(z, w)
continuity:
if x < y, ∃z 3 x < z < y µ([x, y]) = µ((x, y))
x < y, but no z 3 x < z < y #[x, y] > #(x, y)
closure:
x, y ∈ IR ⇒ x + y ∈ IR x, y ∈ IR ⇒ xy ∈ IR
not closed wrt addition not closed wrt multiplication (exclusive of infinities)
indentity:
a + x = x, for any x x − x = a, for any x a = 0, unique P∞ diverges x=1 x
a + x = x, but a + y 6= y a + x = x, but x − x 6= a a + x = b + x, but b 6= a P∞ converges, x=1 x if interpreted as (· · · ((1 + 2) + 3) · · ·)
associativity:
x, y, z ∈ IR ⇒ (x + y) + z = x + (y + z) (xy)z = x(yz)
not associative not associative
x, y, z ∈ IR ⇒ x(y + z) = xy + xz
not distributive
distributivity:
2.1. COMPUTER STORAGE AND MANIPULATION OF DATA
23
system is similar in some ways to a field instead of a ring, which is what the integers ZZ are. There are several variations of the fixed-point representation. The number of bits used and the method of representing negative numbers are two aspects that generally vary from one computer to another. Even within a single computer system, the number of bits used in fixed-point representation may vary; it is typically one storage unit or a half of a storage unit. In a fixed-point representation scheme using k bits, the range of representable numbers is of the order of 2k , usually from approximately −2k−1 to 2k−1 . Numbers outside of this range cannot be represented directly in the fixed-point scheme. Likewise, nonintegral numbers cannot be represented. Fixed-Point Operations The operations of addition, subtraction, and multiplication for fixed-point numbers are performed in an obvious way that corresponds to the similar operations on the ring of integers. Subtraction is addition of the additive inverse. (In the usual twos-complement representation we described earlier, all fixed-point numbers have additive inverses except −2k−1 .) Because there is no multiplicative inverse, however, division is not multiplication by the inverse. The result of division with fixed-point numbers is the result of division with the corresponding real numbers rounded toward zero. This is not considered an exception. As we indicated above, the set of fixed-point numbers together with addition and multiplication is not the same as the ring of integers, if for no other reason than the set is finite. Under the ordinary definitions of addition and multiplication, the set is not closed under either operation. The computer operations of addition and multiplication, however, are defined so that the set is closed. These operations occur as if there were additional higher-order bits and the sign bit were interpreted as a regular numeric bit. The result is then whatever would be in the standard number of lower-order bits. If the higherorder bits would be necessary, the operation is said to overflow. If fixed-point overflow occurs, the result is not correct under the usual interpretation of the operation, so an error situation, or an exception, has occurred. Most computer systems allow this error condition to be detected, but most software systems do not take note of the exception. The result, of course, depends on the specific computer architecture. On many systems, aside from the interpretation of the sign bit, the result is essentially the same as would result from a modular reduction. There are some special-purpose algorithms that actually use this modified modular reduction, although such algorithms would not be portable across different computer systems. The subsets of numbers that we need in the computer depend on the kinds of numbers that are of interest for the problem at hand. Often, however, the kinds of numbers of interest change dramatically within a given problem. For example, we may begin with integer data in the range from 1 to 50. Most simple operations such as addition, squaring, and so on, with these data would allow a
24
CHAPTER 2. NUMERICAL COMPUTATIONS
single paradigm for their representation. The fixed-point representation should work very nicely for such manipulations. Something as simple as a factorial, however, immediately changes the paradigm. It is unlikely that the fixed-point representation would be able to handle the resulting large numbers. When we significantly change the range of numbers that must be accommodated, another change that occurs is the ability to represent the numbers exactly. If the beginning data are integers between 1 and 50, and no divisions or operations leading to irrational numbers are performed, one storage unit would almost surely be sufficient to represent all values exactly. If factorials are evaluated, however, the results cannot be represented exactly in one storage unit and so must be approximated (even though the results are integers). When data are not integers, it is usually obvious that we must use approximations, but it may also be true for integer data.
2.2
Numerical Algorithms and Analysis
We will use the term “algorithm” rather loosely, but always in the general sense of a method or a set of instructions for doing something. Algorithms are sometimes distinguished as “numerical”, “semi-numerical”, and “non-numerical”, depending on the extent to which operations on real numbers are simulated. (Technically an algorithm must terminate after a finite, but perhaps unknown, number of steps.) Algorithms and Programs Algorithms are expressed by means of a flowchart, a series of steps, or in a computer language or pseudolanguage. The expression in a computer language is a source program or module; hence, we sometimes use the words “algorithm” and “program” synonymously. The program is the set of computer instructions that implement the algorithm. A poor implementation can render a good algorithm useless. A good implementation will preserve the algorithm’s accuracy and efficiency, and will detect data that are inappropriate for the algorithm. Robustness is more a property of the program than of the algorithm. The exact way an algorithm is implemented in a program depends of course on the programming language, but it also may depend on the computer and associated system software. A program that will run on most systems without modification is said to be portable. The two most important aspects of a computer algorithm are its accuracy and its efficiency. Although each of these concepts appears rather simple on the surface, each is actually fairly complicated, as we shall see.
2.2. NUMERICAL ALGORITHMS AND ANALYSIS
2.2.1
25
Error in Numerical Computations
An “accurate” algorithm is one that gets the “right” answer. Knowing that the right answer may not be representable, and rounding within a set of operations may result in variations in the answer, we often must settle for an answer that is “close”. As we have discussed previously, we measure error, or closeness, either as the absolute error or the relative error of a computation. Another way of considering the concept of “closeness” is by looking backward from the computed answer, and asking what perturbation of the original problem would yield the computed answer exactly. The backward analysis is followed by an assessment of the effect of the perturbation on the solution. Although backward error analysis may not seem as natural as the “forward” analysis (in which we assess the difference in the computed and true solutions), it is easier to perform because all operations in the backward analysis are performed in IF instead of in IR. Each step in the backward analysis involves numbers in the set IF, that is, numbers that could actually have participated in the computations that were performed. Because the properties of the arithmetic operations in IR do not hold, and at any step in the sequence of computations, the result in IF may not exist in IR, it is very difficult to carry out a forward error analysis. There are other complications in assessing errors. Suppose the answer is a vector, such as a solution to a linear system. What norm do we use to compare closeness of vectors? Another, more complicated situation for which assessing correctness may be difficult is random number generation. It would be difficult to assign a meaning to “accuracy” for such a problem. The basic source of error in numerical computations is the inability to work with the reals. The field of reals is simulated with a finite set. This has several consequences. A real number is rounded to a floating-point number; the result of an operation on two floating-point numbers is rounded to another floatingpoint number; and passage to the limit, which is a fundamental concept in the field of reals, is not possible in the computer. Rounding errors that occur just because the result of an operation is not representable in the computer’s set of floating-point numbers are usually not too bad. Of course, if they accumulate through the course of many operations, the final result may have an unacceptably large accumulated rounding error. Another, more pernicious effect of rounding can occur in a single operation, resulting in catastrophic cancellation, as we have discussed previously. Measures of Error and Bounds for Errors For the simple case of representing the real number r by an approximation r˜, we define absolute error, |˜ r − r|, and relative error, |˜ r − r|/|r| (so long as r 6= 0). These same types of measures are used to express the errors in numerical computations. As we indicated above, however, the result may not be a simple real number; it may consist of several real numbers. For example, in statistical data analysis, the numerical result, r˜, may consist of estimates of several regression
26
CHAPTER 2. NUMERICAL COMPUTATIONS
coefficients, various sums of squares and their ratio, and several other quantities. We may then be interested in some more general measure of the difference of r˜ and r, ∆(˜ r , r), where ∆(·, ·) is a nonnegative, real-valued function. This is the absolute error, and the relative error is the ratio of the absolute error to ∆(r, r0 ), where r0 is a baseline value, such as 0. When r, instead of just being a single number, consists of several components, we must measure error differently. If r is a vector, the measure may be some norm, and in that case, ∆(˜ r , r) may be denoted by k(˜ r − r)k. A norm tends to become larger as the number of elements increases, so instead of using a raw norm, it may be appropriate to scale the norm to reflect the number of elements being computed. However the error is measured, for a given algorithm we would like to have some knowledge of the amount of error to expect or at least some bound on the error. Unfortunately, almost any measure contains terms that depend on the quantity being evaluated. Given this limitation, however, often we can develop an upper bound on the error. In other cases, we can develop an estimate of an “average error”, based on some assumed probability distribution of the data comprising the problem. In a Monte Carlo method we estimate the solution based on a “random” sample, so just as in ordinary statistical estimation, we are concerned about the variance of the estimate. We can usually derive expressions for the variance of the estimator in terms of the quantity being evaluated, and of course we can estimate the variance of the estimator using the realized random sample. The standard deviation of the estimator provides an indication of the distance around the computed quantity within which we may have some confidence that the true value lies. The standard deviation is sometimes called the “standard error”, and nonstatisticians speak of it as a “probabilistic error bound”. It is often useful to identify the “order of the error”, whether we are concerned about error bounds, average expected error, or the standard deviation of an estimator. In general, we speak of the order of one function in terms of another function, as the argument of the functions approach a given value. A function f (t) is said to be of order g(t) at t0 , written O(g(t)) (“big O of g(t)”), if there exists a positive constant M such that |f (t)| ≤ M |g(t)| as t → t0 . This is the order of convergence of one function to another function at a given point. If our objective is to compute f (t) and we use an approximation f˜(t), the order of the error due to the approximation is the order of the convergence. In this case, the argument of the order of the error may be some variable that defines the approximation. For example, if f˜(t) is a finite series approximation to f (t) using, say, k terms, we may express the error as O(h(k)), for some function h(k). Typical orders of errors due to the approximation may be O(1/k),
2.2. NUMERICAL ALGORITHMS AND ANALYSIS
27
O(1/k 2 ), or O(1/k!). An approximation with order of error O(1/k!) is to be preferred over one order of error O(1/k) because the error is decreasing more rapidly. The order of error due to the approximation is only one aspect to consider; roundoff error in the representation of any intermediate quantities must also be considered. The special case of convergence to the constant zero is often of interest. A function f (t) is said to be “little o of g(t)” at t0 , written o(g(t)), if f (t)/g(t) → 0
as t → t0 .
If the function f (t) approaches 0 at t0 , g(t) can be taken as a constant and f (t) is said to be o(1). Big O and little o convergence are defined in terms of dominating functions. In the analysis of algorithms it is often useful to consider analogous types of convergence in which the function of interest dominates another function. This type of relationship is similar to a lower bound. A function f (t) is said to be Ω(g(t)) (“big omega of g(t)”), if there exists a positive constant m such that |f (t)| ≥ m|g(t)|
as t → t0 .
Likewise, a function f (t) is said to be “little omega of g(t)” at t0 , written ω(g(t)), if g(t)/f (t) → 0 as t → t0 . Usually the limit on t in order expressions is either 0 or ∞, and because it is obvious from the context, mention of it is omitted. The order of the error in numerical computations usually provides a measure in terms of something that can be controlled in the algorithm, such as the point at which an infinite series is truncated in the computations. The measure of the error usually also contains expressions that depend on the quantity being evaluated, however. Sources of Error in Numerical Computations Some algorithms are exact, such as an algorithm to multiply two matrices that just uses the definition of matrix multiplication. Other algorithms are approximate because the result to be computed does not have a finite closed-form expression. An example is the evaluation of the normal cumulative distribution function. One way of evaluating this is by use of a rational polynomial approximation to the distribution function. Such an expression may be evaluated with very little rounding error, but the expression has an error of approximation. We need to have some knowledge of the magnitude of the error. For algorithms that use approximations, it is often useful to express the order of the error in terms of some quantity used in the algorithm or in terms of some aspect of the problem itself. When solving a differential equation on the computer, the differential equation is often approximated by a difference equation. Even though the differences used may not be constant, they are finite and the passage to the limit can
28
CHAPTER 2. NUMERICAL COMPUTATIONS
never be effected. This kind of approximation leads to a discretization error. The amount of the discretization error has nothing to do with rounding error. If the last differences used in the algorithm are δt, then the error is usually of order O(δt), even if the computations are performed exactly. Another type of error occurs when the algorithm uses a series expansion. The infinite series may be exact, and in principle the evaluation of all terms would yield an exact result. The algorithm uses only a finite number of terms, and the resulting error is truncation error. When a truncated Taylor’s series is used to evaluate a function at a given point x0 , the order of the truncation error is the derivative of the function that would appear in the first unused term of the series, evaluated at x0 . Algorithms and Data The performance of an algorithm may depend on the data. Heuristically, data for a given problem are ill-conditioned if small changes in the data may yield large changes in the solution. Consider the problem of finding the roots of a high-degree polynomial, for example. Wilkinson (1959) gave an example of a polynomial that is very simple at first glance, yet whose solution is very sensitive to small changes of the values of the coefficients: f (x)
= =
(x − 1)(x − 2) · · · (x − 20) x20 − 210x19 + · · · + 20!
While the solution is easy to see from the factored form, the solution is very sensitive to perturbations of the coefficients. For example, changing the coefficient 210 to 210+2−23 changes the roots drastically; in fact, 10 of them are now complex. Of course the extreme variation in the magnitudes of the coefficients should give us some indication that the problem may be ill-conditioned. We attempt to quantify the condition of a set of data for a particular set of operations by means of a condition number. Condition numbers are defined to be positive and so that large values of the numbers means that the data or problems are ill-conditioned. A useful condition number for the problem of finding roots of a function can be defined in terms of the derivative of the function in the vicinity of a root. We will also see that condition numbers must be used with some care. For example, according to the condition number for finding roots, Wilkinson’s polynomial is well-conditioned. In the solution of a linear system of equations, the coefficient matrix determines the condition of this problem, and we generally define the condition number for a matrix with respect to the problem of solving a linear system of equations. The ability of an algorithm to handle a wide range of data, and either to solve the problem as requested or to determine that the condition of the data does not allow the algorithm to be used is called the robustness of the algorithm.
2.2. NUMERICAL ALGORITHMS AND ANALYSIS
29
Another concept that is quite different from robustness is stability. An algorithm is said to be stable if it always yields a solution that is an exact solution to a perturbed problem; that is, for the problem of computing f (x) using the input data x, an algorithm is stable if the result it yields, f˜(x), is f (x + δx) for some (bounded) perturbation δx of x. Stated another way, an algorithm is stable if small perturbations in the input or in intermediate computations do not result in large differences in the results. The concept of stability, for an algorithm, should be contrasted with the concept of condition, for a problem or a dataset. If a problem is ill-conditioned, a stable algorithm (a “good algorithm”) will produce results with large differences for small differences in the specification of the problem. This is because the exact results have large differences. An algorithm that is not stable, however, may produce large differences for small differences in the computer description of the problem, which may involve rounding, truncation, or discretization, or for small differences in the intermediate computations performed by the algorithm. The concept of stability arises from backward error analysis. The stability of an algorithm may depend on how continuous quantities are discretized, as when a range is gridded for solving a differential equation. See Higham (2002) for an extensive discussion of stability. Reducing the Error in Numerical Computations An objective in designing an algorithm to evaluate some quantity is to avoid accumulated rounding error and to avoid catastrophic cancellation. In the discussion of floating-point operations above, we have seen an example of where catastrophic cancellation was the culprit in a numerical computation that involved many individual computations. In that example there is negligible effect of accumulated rounding error. Often when a finite series is to be evaluated, it is necessary to accumulate a set of terms of the series that have similar magnitude, and then combine this with similar partial sums. It may also be necessary to scale the individual terms by some very large or very small multiplicative constant while the terms are being accumulated, and then remove the scale after some computations have been performed.
2.2.2
Efficiency
The efficiency of an algorithm refers to its usage of computer resources. The two most important resources are the processing units and memory. The amount of time the processing units are in use and the amount of memory required are the key measures of efficiency. A limiting factor for the time the processing units are in use is the number and type of operations required. Some operations take longer than others; for example, the operation of adding floating-point numbers may take more time than the operation of adding fixed-point numbers. This, of course, depends on the computer system and on what kinds of floating-point or
30
CHAPTER 2. NUMERICAL COMPUTATIONS
fixed-point numbers we are dealing with. If we have a measure of the size of the problem, we can characterize the performance of a given algorithm by specifying the number of operations of each type, or just the number of operations of the slowest type. If more than one processing unit is available, it may be possible to perform operations simultaneously. In this case the amount of time required may be drastically smaller for an efficient parallel algorithm than it would for the most efficient serial algorithm that utilizes only one processor at a time. An analysis of the efficiency must take into consideration how many processors are available, how many computations can be performed in parallel, and how often they can be performed in parallel. Often instead of the exact number of operations, we use the order of the number of operations in terms of the measure of problem size. If n is some measure of the size of the problem, an algorithm has order O(f (n)) if, as n → ∞, the number of computations → cf (n), where c is some constant. For example, to multiply two n×n matrices in the obvious way requires O(n3 ) multiplications and additions; to multiply an n×m matrix and an m×p matrix requires O(nmp) multiplications and additions. In the latter case, n, m, and p are all measures of the size of the problem. Notice that in the definition of order there is a constant c. Two algorithms that have the same order may have different constants, and in that case are said to “differ only in the constant”. The order of an algorithm is a measure of how well the algorithm “scales”; that is, the extent to which the algorithm can deal with truly large problems. Let n be a measure of the problem size, and let b and q be constants. An algorithm of order O(bn ) has exponential order, one of order O(nq ) has polynomial order, and one of order O(log n) has log order. Notice that for log order, it does not matter what the base is. Also, notice that O(log nq ) = O(log n). For a given task with an obvious algorithm that has polynomial order, it is often possible to modify the algorithm to address parts of the problem so that in the order of the resulting algorithm one n factor is replaced by a factor of log n. Although it is often relatively easy to determine the order of an algorithm, an interesting question in algorithm design involves the order of the problem, that is, the order of the most efficient algorithm possible. A problem of polynomial order is usually considered tractable, whereas one of exponential order may require a prohibitively excessive amount of time for its solution. An interesting class of problems are those for which a solution can be verified in polynomial time, yet for which no polynomial algorithm is known to exist. Such a problem is called a nondeterministic polynomial, or NP, problem. “Nondeterministic” does not imply any randomness; it refers to the fact that no polynomial algorithm for determining the solution is known. Most interesting NP problems can be shown to be equivalent to each other in order by reductions that require polynomial time. Any problem in this subclass of NP problems is equivalent in some sense to all other problems in the subclass and so such a problem is said to be NP-
2.2. NUMERICAL ALGORITHMS AND ANALYSIS
31
complete. For many problems it is useful to measure the size of a problem in some standard way and then to identify the order of an algorithm for the problem with separate components. A common measure of the size of a problem is L, the length of the stream of data elements. An n × n matrix would have length proportional to L = n2 , for example. To multiply two n × n matrices in the obvious way requires O(L3/2 ) multiplications and additions, as we mentioned above. In analyzing algorithms for more complicated problems, we may wish to determine the order in the form O(f (n)g(L)), because L is an essential measure of the problem size, and n may depend on how the computations are performed. For example, in the linear programming problem, with n variables and m constraints with a dense coefficient matrix, there are order nm data elements. Algorithms for solving this problem generally depend in the limit on n, so we may speak of a linear programming algorithm √ as being O(n3 L), for example, or of some other algorithm as being O( nL). (In defining L, it is common to consider the magnitudes of the data elements or the precision with which the data are represented, so that L is the order of the total number of bits required to represent the data. This level of detail can usually be ignored, however, because the limits involved in the order are generally not taken on the magnitude of the data, only on the number of data elements.) The order of an algorithm (or, more precisely, the “order of operations of an algorithm”) is an asymptotic measure of the operation count as the size of the problem goes to infinity. The order of an algorithm is important, but in practice the actual count of the operations is also important. In practice, an algorithm whose operation count is approximately n2 may be more useful than one whose count is 1000(n log n + n), although the latter would have order O(n log n), which is much better than that of the former, O(n2 ). When an algorithm is given a fixed-size task many times, the finite efficiency of the algorithm becomes very important. The number of computations required to perform some tasks depends not only on the size of the problem, but also on the data. For example, for most sorting algorithms, it takes fewer computations (comparisons) to sort data that are already almost sorted than it does to sort data that are completely unsorted. We sometimes speak of the average time and the worst-case time of an algorithm. For some algorithms these may be very different, whereas for other algorithms or for some problems these two may be essentially the same. Our main interest is usually not in how many computations occur, but rather in how long it takes to perform the computations. Because some computations can take place simultaneously, even if all kinds of computations required the same amount of time, the order of time may be different from the order of the number of computations.
32
CHAPTER 2. NUMERICAL COMPUTATIONS
In addition to the actual processing, the data may need to be copied from one storage position to another. Data movement slows the algorithm, and may cause it not to use the processing units to their fullest capacity. When groups of data are being used together, blocks of data may be moved from ordinary storage locations to an area from which they can be accessed more rapidly. The efficiency of a program is enhanced if all operations that are to be performed on a given block of data are performed one right after the other. Iterations and Convergence Most optimization algorithms are iterative; that is, groups of computations form successive approximations to the desired solution. In a program, this usually means a loop through a common set of instructions in which each pass through the loop changes the initial values of operands in the instructions. We will generally use the notation x(k) to refer to the computed value of x at the k th iteration. An iterative algorithm terminates when some convergence criterion or stopping criterion is satisfied. An example is to declare that an algorithm has converged when ∆(x(k) , x(k−1) ) ≤ , where ∆(x(k) , x(k−1) ) is some measure of the difference of x(k) and x(k−1) and is a small positive number. Because x may not be a single number, we must consider general measures of the difference of x(k) and x(k−1) . For example, if x is a vector, the measure may be some norm. In that case, ∆(x(k) , x(k−1) ) may be denoted by kx(k) − x(k−1) k. An iterative algorithm may have more than one stopping criterion. Often, a maximum number of iterations is set, so that the algorithm will be sure to terminate whether it converges or not. (As noted above, some people define the term “algorithm” to refer only to methods that converge. Under this definition, whether or not a method is an “algorithm” may depend on the input data, unless a stopping rule based on something independent of the data, such as number of iterations, is applied. In any event, it is always a good idea, in addition to stopping criteria based on convergence of the solution, to have a stopping criterion that is independent of convergence and that limits the number of operations.) The convergence ratio of the sequence x(k) to a constant x0 is ∆(x(k+1) , x0 ) , k→∞ ∆(x(k) , x0 ) lim
if this limit exists. If the convergence ratio is greater than 0 and less than 1, the sequence is said to converge linearly. If the convergence ratio is 0, the sequence is said to converge superlinearly. Other measures of the rate of convergence are based on ∆(x(k+1) , x0 ) = c, k→∞ (∆(x(k) , x0 ))r lim
(2.3)
2.2. NUMERICAL ALGORITHMS AND ANALYSIS
33
(again, assuming the limit exists, i.e., c < ∞.) In (2.3), the exponent r is called the rate of convergence, and the limit c is called the rate constant. If r = 2 (and c is finite), the sequence is said to converge quadratically. It is clear that for any r > 1 (and finite c), the convergence is superlinear. Convergence defined in terms of equation (2.3) is sometimes referred to as “Q-convergence”, because the criterion is a quotient, and specific rates of convergence may then be referred to as “Q-linear”, “Q-quadratic”, and so on. The convergence rate is often a function of k, say h(k). The convergence is then expressed as an order in k, O(h(k)). Improving Efficiency There are many ways to attempt to improve the efficiency of an algorithm. Often the best way is just to look at the task from a higher level of detail, and attempt to construct a new algorithm. Many obvious algorithms are serial methods that would be used for hand computations, and so are not the best for use on the computer. An effective general method of developing an efficient algorithm is called divide and conquer. In this method, the problem is broken into subproblems, each of which is solved, and then the subproblem solutions are combined into a solution for the original problem. In some cases, this can result in a net savings either in the number of computations, resulting in improved order of computations, or in the number of computations that must be performed serially, resulting in improved order of time. Let the time required to solve a problem of size n be t(n), and consider the recurrence relation t(n) = pt(n/p) + cn, for p positive and c nonnegative. Then t(n) = O(n log n). Divide and conquer strategies can sometimes be used together with a simple method that would be O(n2 ) if applied directly to the full problem to reduce the order to O(n log n). Although there have been orders of magnitude improvements in the speed of computers because the hardware is better, the order of time required to solve a problem is dependent almost entirely on the algorithm. The improvement in efficiency resulting from hardware improvements are generally differences only in the constant. The practical meaning of the order of the time must be considered, however, and so the constant may be important. Some algorithms are designed so that each step is as efficient as possible, without regard to what future steps may be part of the algorithm. An algorithm that follows this principle is called a greedy algorithm. A greedy algorithm is often useful in the early stages of computation for a problem, or when a problem lacks an understandable structure.
34
CHAPTER 2. NUMERICAL COMPUTATIONS
Bottlenecks and Limits There is maximum rate of floating-point operations possible for a given computer system. This rate depends on how fast the individual processing units are, how many processing units there are, and how fast data can be moved around in the system. The more efficient an algorithm is, the closer its achieved rate is to the maximum rate. For a given computer system, there is also a maximum rate possible for a given problem. This has to do with the nature of the tasks within the given problem. Some kinds of tasks can utilize various system resources more easily than other tasks. If a problem can be broken into two tasks, T1 and T2 , such that T1 must be brought to completion before T2 can be performed, the total time required for the problem depends more on the task that takes longer. This tautology has important implications for the limits of efficiency of algorithms. It is the basis of “Amdahl’s law” or “Ware’s law” (Amdahl, 1967) that puts limits on the speedup of problems that consist of both tasks that must be performed sequentially and tasks that can be performed in parallel. The efficiency of an algorithm may depend on the organization of the computer, on the implementation of the algorithm in a programming language, and on the way the program is compiled.
Exercises 2.1. Machine characteristics. (a) Write a program to determine the smallest and largest relative spacings. Use it to determine them on the machine you are using. (b) Write a program to determine whether your computer system implements gradual underflow. (c) Write a program to determine the bit patterns of +∞, −∞, and NaN on a computer that implements the IEEE binary standard. (This may be more difficult than it seems.) 2.2. What is the rounding unit ( 21 ulp) in the IEEE Standard 754 double precision? 2.3. Consider the standard model (2.1) for the floating-point representation: ±0.d1 d2 · · · dp × be , with emin ≤ e ≤ emax . Your answers may depend on an additional assumption or two. Either choice of (standard) assumptions is acceptable. (a) How many floating-point numbers are there? (b) What is the smallest positive number? (c) What is the smallest number larger than 1? (d) What is the smallest number X, such that X + 1 = X?
EXERCISES
35
(e) Suppose p = 4 and b = 2 (and emin is very small and emax is very large). What is the next number after 20 in this number system? 2.4.
(a) Define parameters of a floating-point model so that the number of numbers in the system is less than the largest number in the system. (b) Define parameters of a floating-point model so that the number of numbers in the system is greater than the largest number in the system.
2.5. Suppose that a certain computer represents floating point numbers in base 10, using eight decimal places for the mantissa, two decimal places for the exponent, one decimal place for the sign of exponent, and one decimal place for the sign of the number. (a) What is the “smallest relative spacing” and the “largest relative spacing”? (Your answer may depend on certain additional assumptions about the representation; state any assumptions.) (b) What is the largest number g, such that 417 + g = 417? (c) Discuss the associativity of addition using numbers represented in this system. Give an example of three numbers, a, b, and c, such that using this representation, (a + b) + c 6= a + (b + c), unless the operations are chained. Then show how chaining could make associativity hold for some more numbers, but still not hold for others. (d) Compare the maximum rounding error in the computation x + x + x + x with that in 4 ∗ x. (Again, you may wish to mention the possibilities of chaining operations.) 2.6. Consider the same floating-point system of Exercise 2.5. (a) Let X be a random variable uniformly distributed over the interval [1 − .000001, 1 + .000001]. Develop a probability model for the representation [X]c . (This is a discrete random variable with 111 mass points.) (b) Let X and Y be random variables uniformly distributed over the same interval as above. Develop a probability model for the representation [X + Y ]c . (This is a discrete random variable with 121 mass points.) (c) Develop a probability model for [X]c [+]c [Y ]c . (This is also a discrete random variable with 121 mass points.) 2.7. Give an example to show that the sum of three floating-point numbers can have a very large relative error. 2.8. Errors in computations. (a) Explain the difference in truncation and cancellation. (b) Why is cancellation not a problem in multiplication? 2.9. Consider the problem of computing w = x - y + z, where each of x, y, and z is nonnegative. Write a robust expression for this computation.
36
CHAPTER 2. NUMERICAL COMPUTATIONS
Chapter 3
Basic Definitions and Properties of Functions A function is a set of ordered pairs no two of which have the same second element. The set of all first elements is the domain of the function, and the set of all second elements is the range. An element in the domain is called an argument of the function, and element in the range is called a value of the function. We denote both the function, that is, the set, and the rule of correspondence between the elements of the pairs in set by the same symbol; for example, we may speak of the function f , or the rule f , which we may write as f (·). For x in the domain, we denote the second element in the pair as f (x). The elements in a function may be any type of object. In most of our applications in optimization both the domain and the range consists of real numbers or tuples of real numbers. If the elements in the range are single real numbers, the function is a scalar function; if they are tuples, it is a vector function. The most common type of function we will consider is a scalar function of a vector argument. (The reader is reminded that I do not distinguish a vector by any special notation.) Another important type of function is one whose domain is countable in one or more dimensions. Functions over discrete domains generally require methods different from those for continuous functions. Optimization of Functions The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that point. In some cases the value of the function at its optimum is known, so all that is to be done is to determine the location of the point. Finding the value at a given location is usually trivial compared to other aspects of the problem; however, if the optimal function value is known, that knowledge can sometimes be used to increase the efficiency in finding the location of the optimum. 37
38
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
We use the term “optimum” or “extremum” to refer to a minimum or maximum. We commonly consider the minimization problem only. This is without loss of generality, because maximizing f (x) is equivalent to minimizing its negative, −f (x). The general unconstrained optimization problem can be stated as the problem of finding the vector x∗ where the minimum occurs, arg min f (x) = x∗ ,
(3.1)
x
or of finding the minimum value of the function min f (x) = f (x∗ ). x
The function f is called the objective function. The elements of x are often called decision variables. In a statistical estimation or modeling problem, the decision variables are the parameters to be estimated. If the domain is countable, we say the optimization problem is “discrete”. If the domain is continuous in all variables, we say the optimization problem is “continuous” and we refer to the domain as “dense”. The techniques for functions over discrete domains are generally different for those for functions over dense domains. Most of this chapter and Chapters 4 and 5 concern continuous functions over dense domains, while Chapter 6 addresses optimization over discrete domains. Many optimization problems have both discrete variables and continuous variables. In notation such as f (x), we implicitly assume x is in the domain of f , and so the elements of x are discrete or continuous as required by the definition of the function.
3.1
Shapes of Functions
A set of points, S, is convex if for all x1 and x2 in S, and for 0 < α < 1, the point αx1 + (1 − α)x2 is in S. An extreme point is a point x that cannot be represented as x = αx1 + (1 − α)x2 , for some points x1 and x2 in S and 0 < α < 1. For any d, IRd is convex with no extreme points. A convex set in IRd with exactly d + 1 extreme points is called a simplex. A convex function on a convex set S is a function f such that for 0 < α < 1, and for all x1 and x2 in S, f αx1 + (1 − α)x2 ≤ αf (x1 ) + (1 − α)f (x2 ). (3.2)
3.1. SHAPES OF FUNCTIONS
t
t
t
39
t A A A A A A A A A tt Att tt t tt tt t tttt t t ttttt ttttttt % &
t
Figure 3.1: Convex Sets and Extreme Points Strict convexity of a function is the condition in which the inequality above is strict. although less often encountered, convexity of a vector function or matrix function is defined similarly by applying the definitions to each element of the function. Concavity and strict concavity are the conditions in which the inequality in (3.2) is reversed. Functions of these four types are illustrated in Figure 3.2. If f is convex, −f is concave. A concave function is sometimes said to be “concave down”, and a convex function is said to be “concave up”. For a convex function f of a scalar variable, if its first derivative exists, the derivative is nondecreasing. If its second derivative f 00 exists, then f 00 (x) ≥ 0
for all x.
Strict convexity implies that the second derivative is positive. Likewise, the second derivative of a concave function is nonpositive, the second derivative is negative if the function is strictly concave. For a differentiable function of a vector argument, the derivatives also provide information about the local shape of the function. In this case, we consider the vector of derivatives, the gradient, ∂f (x) ∂f (x) ∂f (x) ∇f (x) = , ,···, ∂x1 ∂x2 ∂xm (We often write a vector in the horizontal notation as in the equation above, but whenever we perform multiplication operations on vectors or subsetting
40
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
Concave Convex
Strictly Concave
Strictly Convex
Figure 3.2: Convex and Concave Functions operations on matrices, we consider a vector to be a column vector; that is, it behaves in many ways as a matrix with one column.) For a convex function f of a vector variable, if its gradient exists, it is nondecreasing in each of its elements. As in the scalar case, if a function f of a vector argument is twice-differentiable, more information about a stationary point can be obtained from the second derivatives, which comprise a matrix, called the Hessian, which is denoted by Hf , and defined as Hf = ∇ ∇f (x) = = =
∇2 f (x) ∂ 2 f (x) ∂xi ∂xj ∂ 2 f (x) . ∂x∂xT
Notice that the Hessian is a function, so we often specify the point at which it is evaluated in the ordinary function notation, Hf (x). The symbol ∇2 f (x) is also sometimes used to denote the Hessian, but because ∇2 f (x) is often used to denote the Laplacian (which yields the diagonal of the Hessian), we will use Hf (x) to denote the Hessian. For a convex function of a vector variable, if the Hessian exists, it is positive semidefinite. Strict convexity implies that the Hessian is positive definite. Generally, if the domain of a convex function is not specified, the function
3.2. STATIONARY POINTS OF FUNCTIONS
41
is assumed to be convex over the reals. Although the definitions of convexity and concavity as stated above apply to functions with continuous domains, analogous concepts can be defined for discrete problems. A function f (x) that is positive on an interval is said to be log convex if log f (x) is convex. Similar definitions apply for concavity. The normal probability density function shown in Figure 3.3, for example, is log concave. More generally, f (x) is said to be T convex (or T concave) if T (f (x)) is convex (or concave).
p(x) log(p(x))
x
x
Figure 3.3: Log Concave Function gro320 A useful fact about convex functions that follows immediately by induction on the definition (3.2) is a form of Jensen’s inequality: if f is convex, f
n X i=1
n X α i xi ≤ αi f (xi ),
(3.3)
i=1
P for all nonnegative αi such that ni=1 αi = 1. In the case of strict convexity, equality holds only if x1 = x2 = · · · = xn .
3.2
Stationary Points of Functions
Methods for finding the minimum of a function of continuous variables may be quite different from methods for finding the minimum over a countable set of choices. Discrete optimization problems involve inspection of combinations. For example, the problem of determining the best way to travel between a set of cities so as to visit each one at least once involves selection of an optimal permutation of the list of cities. The objective function for this problem is the total distance, or some other measure of the cost in traveling between all the cities in the order specified. Although the set of possibilities is countable, optimization problems of this type often require extensive computations. Sometimes the objective function is very complicated and expensive or time-consuming
42
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
to evaluate; the evaluation of the objective function may even involve making physical observations. The objective function may have an optimum at only one point or may have optima at many points. If f (x∗ ) ≤ f (x)
for all x,
f (x∗ ) is called a global minimum and x∗ is called a global minimizer. If f (x∗ ) < f (x), at all points, f (x∗ ) and x∗ are called a strict global minimum and a strict global minimizer respectively. For a function f with continuous domain, if f (x∗ ) < f (x)
for all x 3 kx − x∗ k < δ
for some δ > 0, f (x∗ ) is called a local minimum and x∗ is called a local minimizer. Notice that a global minimum is also a local minimum. The qualifier “strict” is also applied to local minima in the same way as with global minima. The terms “global”, “local”, and “strict” are also used in discrete problems. A local minimum in a discrete problem is a point that has a smaller function value than the function values of nearby points, for some definition of “nearby points”. Differentiable Functions If f is a differentiable function of a scalar variable, for a local or global minimizer x∗ , then f 0 (x∗ ) = 0. This is trivially obivous by considering the definition of the derivative and the conditions at the minimum f (x) < f (x − δ) and f (x) < f (x + δ) for small δ > 0. The derivative is also zero at a local or global maximizer, so obviously this condition is not sufficient to identify a minimum. The derivative may also be zero at a point that is neither a minimum nor a maximum, for example, at x = 0 for f (x) = x3 . Such a point is called an inflection point. Any point xs such that f 0 (xs ) = 0 is called a stationary point. Stationary points can be nuisances when we attempt to find maxima or minima. If a function f of a scalar argument is twice-differentiable, information about a stationary point can be obtained from the second derivative. If xs is a stationary point; and if f 00 (xs ) > 0, the stationary point is a minimum; if f 00 (xs ) < 0,
3.2. STATIONARY POINTS OF FUNCTIONS
43
Global Maximum
Inflection Point
Local Minimum
Global Minimum Figure 3.4: Stationary Points of a Continuous Function the stationary point is a maximum; otherwise it is an inflection point. Again, this is easily seen by using the definition of the derivative of f 0 at xs and the fact that f 0 (xs ) = 0. At a minimum, for example, f 0 (x − δ) < 0 and f 0 (x + δ) > 0 for small δ > 0. Scalar Functions of Vectors If the vector x∗ is a minimum point of the function f , and f is continuously differentiable in a neighborhood of x∗ , then, similar to the derivative of a function of a scalar, ∇f (x∗ ) = 0. If this were not the case, we could let p = −∇f (x∗ ) and write T pT ∇f (x∗ ) = − ∇f (x∗ ) ∇f (x∗ ) < 0, and then, because ∇f (x) is continuous in a neighborhood of x∗ , for some positive t0 we would have pT ∇f (x∗ + tp) < 0
for all t ∈ [0, t0 ].
44
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
Now, if we choose t1 ∈ (0, t0 ], from the mean-value theorem (Taylor’s formula with remainder), we have f (x∗ + t1 p) = f (x∗ ) + t1 pT ∇f (x∗ + t2 p)
for some t2 ∈ (0, t1 ).
This, however, would mean f (x∗ + t1 p) < f (x∗ ) contradicting the assumption that x∗ is a mimimum. The fact that ∇f (x∗ ) = 0 at a mimimum point is sometimes called the “first-order necessary condition” for x∗ to be a local minimum. If ∇f (xs ) = 0, xs is a stationary point; but it may or may not be a minimum. Also, of course, if it is a minimum, it may only be a local minimum. A stationary point may be a local or global minimum, a local or global maximum, an inflection point, or a saddlepoint. A saddlepoint is a point, x1 , in IRm such that, given vectors v1 and v2 , f (x1 ) < f (x1 + αv1 ) and f (x1 ) > f (x1 + αv2 ), for 0 < |α| < α0 for a given α0 > 0.
f(x1,x2)
x2
x1
Figure 3.5: Function with a Saddlepoint gro330 In the case of a function with a discrete domain, a saddlepoint can be defined in a similar fashion, although the usual situation is that for some points in a
3.2. STATIONARY POINTS OF FUNCTIONS
45
neighborhood of x1 the function values are greater than f (x1 ) and for other points in that neighborhood the function values are less. If x∗ is a minimum point and Hf (x) is continuous in a neighborhood of x∗ , then Hf (x∗ ) is positive semidefinite. If Hf (x∗ ) were not positive semidefinite, we could choose a vector p such that pT Hf (x∗ )p < 0. In this case, because Hf (x) is continuous near x∗ , there would be an interval [0, t0 ] such that for any (scalar) t ∈ [0, t0 ], pT Hf (x∗ + tp)p < 0. Now, if we choose t1 ∈ [0, t0 ] and express f (x∗ + t1 p) in a first-order Taylor formula with remainder, we would have for some t2 ∈ (0, t1 ), f (x∗ + t1 p) = = <
1 f (x∗ ) + t1 pT ∇f (x∗ ) + t21 pT Hf (x∗ + t2 p)p 2 1 2 T f (x∗ ) + t1 p Hf (x∗ + t2 p)p 2 f (x∗ ).
But this contradicts the assumption that f (x∗ ) is a minimum; hence, we must conclude Hf (x∗ ) is positive semidefinite. The fact that Hf (x∗ ) is positive semidefinite at a minimum point is sometimes called the “second-order necessary condition” for x∗ to be a local minimum. A sufficient condition for x∗ to be a strict local minimum, given the continuity assumptions about the derivatives, is that ∇f (x∗ ) = 0 and Hf (x∗ ) is positive definite. Because Hf (x) is continuous in a neighborhood of x∗ , Hf (x) is positive definite within a sufficiently small distance, say r, of x∗ . For any vector p such that kpk < r, 1 = f (x∗ ) + pT ∇f (x∗ ) + pT Hf (x∗ + αp)p 2 1 T = f (x∗ ) + p Hf (x∗ + αp)p, 2 where 0 < α < 1. Because x∗ + αp is within r of x∗ , Hf (x∗ + αp) is positive definite and so pT Hf (x∗ + αp)p > 0. This implies f (x∗ + p) < f (x∗ ) for any vector p small enough; hence f (x∗ ) is a local minimum. A maximum of a function, either a global or local maximum, is sometimes called its mode. This term is often used in reference to a probability density. A function with no local maxima other than a single strict maximum is called unimodal. If x∧ is the mode of f , then f (x∧ ) > f (x) for x 6= x∧ , f (x) is an increasing function for x ≤ x∧ , and f (x) is a decreasing function for x ≥ x∧ . A multivariate function may be unimodal in some directions and not in others. When we say a function is unimodal, we mean it is unimodal in all directions in its domain. Some functions may be unimodal in some projections, or along some slices. A function of m variables is said to be orthounimodal at the mode x∧ in IRm , if f (x∧ ) > f (x) for x 6= x∧ and for each i, f (x1 , x2 , . . . , xm ) is an increasing function in xi for xi ≤ x∧i and a decreasing function in xi for xi ≥ x∧i . See Dharmadhikari and Joag-Dev (1988) for discussion of variations. f (x∗ + p)
46
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
Vector Functions The objective of minimizing a vector requires clarification. Just as in the problem of minimizing the vector of residuals in fitting a statistical model, as we discussed on page 2, we must have some single real-valued measure of the vector. A simple and useful measure is a norm, which for a vector x we denote by kxk, and which for vectors x and y (with the same number of elements) and scalar a satisfy 1. Nonnegativity and mapping of the identity: if x 6= 0, then kxk > 0, and k0k = 0 2. Relation of scalar multiplication to real multiplication: kaxk = |a| kxk for real a 3. Triangle inequality: kx + yk ≤ kxk + kyk For a vector function f (x), the stationary points of interest are the stationary points of kf (x)k, for some norm. There are many norms that could be defined for vectors. The sum of squares of the residual vector ri = yi − f (xi ; t) used in equation (1.3) and (1.4) (see page 2), is the square of a norm. In equation (1.5) or (1.7), if ρ(ri ) = |ri |p for p ≥ 1, and the sum is raised to the power 1/p, the norm is an Lp norm, denoted as k · kp . More specifically, it is defined as
kxkp =
X
|xi |
p
! p1
(3.4)
i
This is also sometimes called the Minkowski norm. It is easy to see that the Lp norm satisfies the first two conditions above. For general p ≥ 1 it is somewhat more difficult to prove the triangular inequality (which for the Lp norms is also called the Minkowski inequality), but for some special cases it is straightforward (see Exercise 3.15 for p = 2). The most common Lp norms, and in fact, the most commonly used vector norms, are: P • kxk1 = i |xi |, also called the Manhattan norm because it corresponds to sums of distances along coordinate axes, as one would travel along the rectangular street plan of Manhattan. pP p 2 • kxk2 = hx, xi, also called the Euclidean norm, or the vector i xi = length. This is the square root of the inner product of the vector with itself. • kxk∞ = maxi |xi |, also called the max norm or the Chebyshev norm.
3.2. STATIONARY POINTS OF FUNCTIONS
47
The L∞ norm is defined by taking the limit in an Lp norm. An Lp norm is also called a p-norm, or 1-norm, 2-norm, or ∞-norm in those special cases. The inner product or dot product of the m-vectors x and y, denoted by hx, yi, is defined by m X hx, yi = xi yi . (3.5) i=1 2
The L2 norm of x is (hx, xi) . Bounds and Other Constraints in Optimization Problems If there are no restrictions on x, the problem is an unconstrained optimization problem. Often, however, there are limits on x; for example, the elements of x must be such that li ≤ xi ≤ ui . These simple limits on the decision variables are called bounds. In addition to such simple bounds, x may be required to satisfy other constraints. Constraints are of the general form g(x) ≤ b, where g is a vector-valued function. Bounds are just simple constraints. An equality constraint such as g1 (x) = b1 can be formulated as two inequality constraints, g1 (x) ≤ b1 and −g1 (x) ≤ −b1 . In some cases, the constraints may just be the domain of the objective function; that is, the objective function may be undefined outside of the region specified by the constraints. A point that satisfies all constraints is said to be feasible. In an unconstrained optimization problem, all points are feasible. The general optimization problem with bounds and constraints is written as min
f (x)
s.t.
−x ≤ −l x≤u
x
(3.6)
g(x) ≤ b. This general optimization problem is called a mathematical programming problem. In this standard formulation, the notation “s.t.” means “such that”. Some constraints may be difficult to express in terms of a function, g. The decision variables may be required to take on values only in the set of integers, for example. To simplify the notation, sometimes the bounds or other conditions on x are used to define a set S, so the constraints are just stated as x ∈ S, properly defined. The set S is called the feasible region. The inequalities in the constraints allow equality. For continuous variables the problem would not be well-defined otherwise. Any constraint that is satisfied by equality at a given point is called an active constraint at that point, and the point is a boundary point. A point that satisfies all constraints, but is not a boundary point is an interior point.
48
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
General Approaches to Optimization Various special cases of the general optimization problem require very different approaches. Although there are general purpose algorithms that can be used on almost any optimization problem, for a given special case, these algorithms are likely to be quite inefficient relative to an algorithm designed for that case. If, for example, f and g in problem (3.6) are linear, the problem is called a linear program or a linear programming problem, and there are very efficient methods for solving such linear programming problems that take advantage of the special linear structure. (Notice that the word “program” is sometimes used more-orless synonymously with “problem”.) If the values of the decision variables are constrained to a set of integers, the problem is called an integer program or a mixed integer program, and in this case, other specialized algorithms must be used. Chapters 5 and 6 discuss optimization methods for continuous functions and discrete functions respectively. An unconstrained problem is generally simpler to solve than one with constraints. One general approach to solving an optimization problem with constraints is to begin from an unconstrained optimal solution and then, through a series of steps, to move the solution into the region that satisfies the constraints. This approach may not be reliable if the objective function is not well-behaved outside of the feasible region. Another general approach is to begin inside the feasible region, and to move within the region in directions that decrease the objective function. The optimal solution is often on the boundary of the feasible region, so some common approaches, called active set methods, move along the boundary of the feasible region, seeking decreases in the objective function. We discuss constrained optimization problems in Chapter 7.
3.3
Function Spaces
The algebra of functions is somewhat analogous to the algebra of vectors. There are operations between scalars and functions and between functions. The operations on functions, such as addition, generally are defined only for two functions with the same domain, just as operations on vectors are generally defined only for two vectors with the same number of elements. The domain of the function is analogous to the index set of a vector. The addition operator for real-valued functions is defined in terms of ordinary addition of reals. The addition is defined as the ordinary addition of the function values at each point in the common domain. A set of functions closed under function addition is called a function space. Multiplication of a real-valued function by a real-valued scalar is defined simply as the usual product of the function value and the scalar. A simple linear combination of the functions f (·) and g(·) is af (·) + g(·), where a is a scalar. If a given function can be formed by a linear combination of one or more functions, the set of functions (including the given one) is said to be linearly
3.3. FUNCTION SPACES
49
dependent; conversely, if in a set of functions no one function can be represented as a linear combination of any of the others, the set of functions is said to be linearly independent. Multiplication of two functions, as with multiplication of vectors can be defined in different ways. One is simple element-wise multiplication, and this is what is meant by simple juxtoposition: f (x)g(x).
3.3.1
Inner Products and Norms
In an expression similar to equation (3.5), the inner product or dot product of the real functions f and g over the interval (a, b), denoted by hf, gi(a,b) , or usually just by hf, gi, is defined as hf, gi(a,b) =
Z
b
f (x)g(x) dx, a
if the (Lebesque) integral exists. (There are more general inner products of functions, but this one is most useful for our purposes.) The inner product for functions has the following properties: 1. Nonnegativity and mapping of the identity: if f 6= 0, then hf, f i > 0 and h0, 0i = 0. 2. Commutativity: hf, gi = hg, f i. 3. Factoring of scalar multiplication in dot products: haf, gi = ahf, gi for real a. 4. Relation of function addition to addition of dot products: hf + g, hi = hf, hi + hg, hi. To avoid questions about integrability, we generally restrict attention to functions whose dot products with themselves exist; that is, to functions that are square Lebesque integrable over the region of interest. The set of such square integrable functions is denoted L2 (a, b). In many cases, the range of integration is the real line, and we may use the notation L2 (IR), or often just L2 , to denote that set of functions and the associated inner product. The Cauchy-Schwarz inequality for the inner products of functions is hf, gi ≤ hf, f i1/2 hg, gi1/2 . This is easy to see, by first observing for every real number t, 0
≤ (h(tf + g), (tf + g)i)2 = hf, f it2 + 2hf, git + hg, gi = at2 + bt + c,
50
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
where the constants a, b, and c correspond to the inner products in the preceding equation. This nonnegative quadratic in t cannot have two distinct real roots, hence the discriminant, b2 − 4ac, must be less than or equal to zero; that is,
2 1 b ≤ ac. 2
By substituting and taking square roots, we get the Cauchy-Schwarz inequality. It is also clear from this proof that equality holds only if f = 0 or if g = rf , for some scalar r. We sometimes define function inner products with respect to a weight function, w(x), or with respect to the measure µ, where dµ = w(x)dx, hf, gi(µ;a,b) =
Z
b
f (x)g(x)w(x) dx, a
if the integral exists. Often both the weight and the range are assumed to be understood, and the simpler notation hf, gi is used. The norm of a function f , denoted generically as kf k, is defined in terms of an integral of some transformation of the function. The most common norm for a real-valued function is the Lp norm, denoted as kf kp , which is defined similarly to the Lp vector norm as: kf kp =
Z
b p
|f (x)| w(x) dx
!1/p
,
a
if the integral exists. The set of functions for which this integral exist is often denoted by Lp(µ;a,b) . The most common Lp function norm is the L2 norm, which is often denoted simply by kf k. As with the L2 vector norm, this norm is related to the inner product: kf k2 = hf, f i1/2 . The space consisting of the set of functions whose L2 norms over IR exist together with this norm is denoted L2 . To emphasize the measure of the weighting function, the notation kf kµ is sometimes used. (The ambiguity of the possible subscripts on k · k is usually resolved by the context.) For functions over finite domains, the weighting function is most often the identity. A normal function is one whose norm is 1. Although this term can be used with respect to any norm, it is generally reserved for the L2 norm, that is, the norm arising from the L2 inner product. A function whose integral (over a relevant range, usually IR) is 1 is also called a normal function. (Although this latter meaning is similar to the standard one, the latter meaning may include functions that are not square integrable.) Density and weight functions are often normalized; that is, scaled so as to be normal.
3.4. APPROXIMATION OF FUNCTIONS
3.3.2
51
Hilbert Spaces
For approximation methods it may be important to know that a sequence of functions (or vectors) within a given space converges to a function (or vector) in that space. A sequence {f (i) } in an inner product space is said to converge to f ∗ if given > 0, there exists an integer M , such that kf (i) − f ∗ k ≤ for all i ≥ M . (This convergence of the norm is uniform convergence. There is also a condition of pointwise convergence of a sequence of functions, that depends on the argument of each function in the sequence.) A sequence is said to be a Cauchy sequence if given > 0, there exists an integer M , such that kf (i) − f (j) k ≤ for all i, j ≥ M . An inner product space in which every Cauchy sequence converges to a member of the space is said to be complete. Such a closed space is called a Hilbert space. The finite-dimensional vector space IRd and the space of square-integrable functions L2 are both Hilbert spaces. They are, by far, the two most important Hilbert spaces for our purposes. The convergence properties of the iterative methods we often employ in smoothing and in optimization methods generally derive from the fact that we are working in Hilbert spaces.
3.4
Approximation of Functions
There are two reasons we discuss approximation of functions here. One reason is that often in optimization problems, we replace the functions of interest with functions that approximate them. The approximating functions are easier to work with than the original functions. Another reason is that we may want ⇐= make sure that examples to estimate an unknown function using observed data from the unknown func- are given later tion. An approach to this statistical estimation problem is to approximate the unknown function with some other function and then to fit the approimating function using the observed data. How well one function approximates another function is usually measured by a norm of the difference in the functions over the relevant range. If g approximates f , kg − f k∞ is likely to be the norm of interest. This is the norm most often used in numerical analysis when the objective is interpolation or quadrature. In problems with noisy data, or when g may be very different from f , kg − f k2 may be the more appropriate norm. This is the norm most often used in estimating probability density functions, for example. Basis Sets in Function Spaces If each function in a linear space can be expressed as a linear combination of the functions in a set G, then G is said to be a generating set, a spanning set, or a basis set for the linear space. (These three terms are synonymous.) The basis
52
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
sets for finite-dimensional vector spaces are finite; for most function spaces of interest, the basis sets are infinite. A set of real scalar-valued functions {qk } is orthogonal over the domain D with respect to the nonnegative weight function w(x) if the inner product with respect to w(x) of qk and ql , hqk , ql i, is 0 if k 6= l; that is, Z qk (x)ql (x)w(x)dx = 0 k 6= l. (3.7) D
If, in addition,
Z
qk (x)qk (x)w(x)dx = 1, D
the functions are called orthonormal. The weight function can also be incorporated into the individual functions to form a different set, q˜k (x) = qk (x)w1/2 (x). This set of functions also spans the same function space and is orthogonal over D with respect to a constant weight function. Basis sets consisting of orthonormal functions are generally easier to work with and can be formed from any basis set. Given two nonnull, linearly independent functions, q1 and q2 , two orthonormal vectors, q˜1 and q˜2 , that span the same space can be formed as q˜1 (·) q˜2 (·)
=
1 q1 (·), kq1 k
=
1 q1 , q2 i˜ q1 (·) . q2 (·) − h˜ kq2 − h˜ q1 , q2 i˜ q1 k
(3.8)
These are the Gram-Schmidt function transformations. They can easily be extended to more than two functions to form a set of orthonormal functions from any set of linearly independent functions. Series Expansions in Basis Functions Our objective is to represent a function of interest, f (x), over some domain D, as a linear combination of “simpler” functions, q0 (x), q1 (x), . . .: f (x) =
∞ X
ck qk (x).
(3.9)
k=0
There are various ways of constructing the qk functions. If they are developed through a linear operator on a function space, they are called eigenfunctions, and the corresponding ck are called eigenvalues. We choose a set {qk } that spans some class of functions over the given domain D. A set of orthogonal basis functions is often the best choice because
3.4. APPROXIMATION OF FUNCTIONS
53
they have nice properties that facilitate computations and a large body of theory about their properties is available. If the function to be estimated, f (x), is continuous and integrable over a domain D, the orthonormality property allows us to determine the coefficients ck in the expansion (3.9): ck = hf, qk i. (3.10) The coefficients {ck } are called the Fourier coefficients of f with respect to the orthonormal functions {qk }. In applications, we approximate the function using a truncated orthogonal series. The error due to Pjfinite truncation at j terms of the infinite series is the residual function f − k=1 ck fk . The mean squared error over the domain D is the scaled, squared L2 norm of the residual,
2
j
X 1
ck qk , (3.11)
f −
d k=0
where d is some measure of the domain D. (If the domain is the interval [a, b], for example, one choice is d = b − a.) A very important property of Fourier coefficients is that they yield the minimum mean squared error for a given set of basis functions {qi }; that is, for any other constants, {ai }, and any k,
2
2
j j
X X
ck qk ≤ f − ak qk (3.12)
f −
k=0
k=0
(see Exercise 3.17). In applications of statistical data analysis, after forming the approximation, we then estimate the coefficients from equation (3.10) by identifying an appropriate probability density that is a factor of the function of interest, f . (Note again the difference in “approximation” and “estimation”.) Expected values can be estimated using observed or simulated values of the random variable and the approximation of the probability density function. The basis functions are generally chosen to be easy to use in computations. Common examples include the Fourier trigonometric functions sin(kt) and cos(kt) for k = 1, 2, . . ., orthogonal polynomials such as Legendre, Hermite, and so on, splines, and wavelets. We discuss orthogonal polynomials below, and discuss splines beginning on page 57. For use of wavelets in estimating functions we refer the reader to Antoniadis, Gregoire, and McKeague (1994). More general applications of wavelets are considered in the articles in Antoniadis and Oppenheim (1995). Orthogonal Polynomials The most useful type of basis function depends on the nature of the function being estimated. The orthogonal polynomials are useful for a very wide range
54
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
of functions. Orthogonal polynomials of real variables are their own complex conjugates. It is clear that for the k th polynomial in the orthogonal sequence, we can choose an ak that does not involve x, such that qk (x) − ak xqk−1 (x) is a polynomial of degree k − 1. Because any polynomial of degree k − 1 can be represented by a linear combination of the first k members of any sequence of orthogonal polynomials, we can write k−1 X qk (x) − ak xqk−1 (x) = ci qi (x). i=0
Because of orthogonality, all ci for i < k − 2 must be 0. Therefore, collecting terms, we have, for some constants ak , bk , and ck , the three-term recursion that applies to any sequence of orthogonal polynomials: qk (x) = (ak x + bk )qk−1 (x) − ck qk−2 (x) = 0,
for k = 2, 3, . . . .
(3.13)
This recursion formula is often used in computing orthogonal polynomials. The coefficients in this recursion formula depend on the specific sequence of orthogonal polynomials, of course. This three-term recursion formula can also be used to develop a formula for the sum of products of orthogonal polynomials qi (x) and qi (y): k X i=0
qi (x)qi (y) =
qk+1 (x)qk (y) − qk (x)qk+1 (y) . ak+1 x−y 1
(3.14)
This expression, which is called the Christoffel-Darboux formula, is useful in evaluating the product of arbitrary functions that have been approximated by finite series of orthogonal polynomials. There are several widely used complete systems of univariate orthogonal polynomials. The different systems are characterized by the one-dimensional intervals over which they are defined and by their weight functions. The Legendre, Chebyshev, and Jacobi polynomials are defined over [−1, 1] and hence can be scaled into any finite interval. The weight function of the Jacobi polynomials is more general, so a finite sequence of them may fit a given function better, but the Legendre and Chebyshev polynomials are simpler and so are often used. The Laguerre polynomials are defined over the half line [0, ∞), and the Hermite polynomials are defined over the reals, (−∞, ∞). Any of these systems of polynomials can be developed easily by beginning with the basis set 1, x, x2 , . . . and orthogonalizing them by use of equations (3.8) and their extensions. Table 3.1 summarizes the ranges and weight functions for these standard orthogonal polynomials.
3.4. APPROXIMATION OF FUNCTIONS
55
Table 3.1: Orthogonal Polynomials Polynomial Series
Range
Legendre
[−1, 1]
Chebyshev
[−1, 1]
Weight Function 1 (uniform) 2 1/2
(1 − x )
(symmetric beta)
α
Jacobi
[−1, 1]
(1 − x) (1 + x)β (beta)
Laguerre
[0, ∞)
xα−1 e−x (gamma)
Hermite
(−∞, ∞)
e−x
2
/2
(normal)
The Legendre polynomials have a constant weight function and are defined over the interval [−1, 1]. The first few (unnormalized) Legendre polynomials are P0 (t) = 1 P2 (t) = (3t2 − 1)/2 P4 (t) = (35t4 − 30t2 + 3)/8
P1 (t) = t P3 (t) = (5t3 − 3t)/2 P5 (t) = (63t5 − 70t3 + 15t)/8
(3.15)
Graphs of these polynomials are shown in Figure 3.6. The normalizing constant for the k th Legendre polynomial is determined by noting Z 1 2 (Pk (t))2 dx = . 2k + 1 −1 The recurrence formula for the Legendre polynomials is Pk (t) =
2k − 1 k−1 tPk−1 (t) − Pk−2 (t). k k
(3.16)
The Hermite polynomials are orthogonal with respect to a Gaussian, or standard normal, weight function. A series using these Hermite polynomials is often called a Gram-Charlier series. (These are not the standard Hermite polynomials, but they are the ones most commonly used by statisticians because the weight function is proportional to the normal density.) The first few Hermite polynomials are H0e (t) = 1 H2e (t) = t2 − 1 H4e (t) = t4 − 6t2 + 3
H1e (t) = t H3e (t) = t3 − 3t H5e (t) = t5 − 10t3 + 15t
(3.17)
The recurrence formula for the Hermite polynomials is e e Hke (t) = tHk−1 (t) − (k − 1)Hk−2 (t).
(3.18)
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
1.0
56
P 2 P
3
P
0
P
1
4
0.0
P
5
-1.0
-0.5
Pi(x)
0.5
P
-1.0
-0.5
0.0
0.5
1.0
x
Figure 3.6: Legendre Polynomials gro370 ** grfn020 As an example of the use of orthogonal polynomials to approximate a given function, consider the expansion of f (x) = e−x over the interval [−1, 1]. The coefficients are determined by equation (3.10). Graphs of the function and the truncated series approximations using up to six terms (j = 0, 1, . . . , 5) are shown in Figure 3.7. Each truncated series is the best linear combination of the Legendre polynomials (in terms of the L2 norm) of the function using no more than j + 1 terms. Multivariate Orthogonal Polynomials Multivariate orthogonal polynomials can be formed easily as tensor products of univariate orthogonal polynomials. The tensor product of the functions f (x) over Dx and g(y) over Dy is a function of the arguments x and y over Dx × Dy : h(x, y) = f (x)g(y). If {q1,k (x1 )} and {q2,l (x2 )} are sequences of univariate orthogonal polynomials, a sequence of bivariate orthogonal polynomials can be formed as qkl (x1 , x2 ) = q1,k (x1 )q2,l (x2 ).
(3.19)
These polynomials are orthogonal in the same sense as in equation (3.7), where the integration is over the two-dimensional domain. Similarly as in equa-
3.4. APPROXIMATION OF FUNCTIONS
57
2.0 1.5
j=1
1.0
j=0
0.5
exp(-x) and approximations
2.5
exact
-1.0
-0.5
0.0
0.5
1.0
x
Figure 3.7: Approximations with Legendre Polynomials gro375 tion (3.9), a bivariate function can be expressed as f (x1 , x2 ) =
∞ ∞ X X
ckl qkl (x1 , x2 ),
(3.20)
k=0 l=0
with the coefficients being determined by integrating over both dimensions. Although obviously such product polynomials, or radial polynomials, would emphasize features along coordinate axes, they can nevertheless be useful for representing general multivariate functions. Often, it is useful to apply a rotation of the coordinate axes. The weight functions, such as those for the Jacobi polynomials, that have various shapes controlled by parameters can also often be used in a mixture model of the function of interest. The weight function for the Hermite polynomials can be generalized by a linear transformation (resulting in a normal weight with mean µ and variance σ 2 ), and the function of interest may be represented as a mixture of general normals. Splines The approach to function approximation that we pursued in the previous section makes use of a finite subset of an infinite basis set consisting of polynomials of degrees p = 0, 1, . . .. This approach yields a smooth approximation fb(x). (“Smooth” means an approximation that is continuous and has continuous derivatives. These are useful properties of the approximation.) The
58
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
polynomials in fb(x), however, cause oscillations that may be undesirable. The approximation oscillates a number of times one less than the highest degree of the polynomial used. Also, if the function being approximated has quite different shapes in different regions of its domain, the global approach of using the same polynomials over the full domain may not be very effective. Another approach is to subdivide the interval over which the function is to be approximated and then on each subinterval use polynomials with low degree. The approximation at any point is a sum of one or more piecewise polynomials. Even with polynomials of very low degree, if we use a large number of subintervals, we can obtain a good approximation to the function. Zero-degree polynomials, for example, would yield a piecewise constant function that could be very close to a given function if enough subintervals are used. Using more and more subintervals, of course, is not a very practical approach. Not only is the approximation a rather complicated function, but it may be discontinuous at the interval boundaries. We can achieve smoothness of the approximation by imposing continuity restrictions on the piecewise polynomials and their derivatives. This is the approach in spline approximation and smoothing. Multivariate splines are generally formed as tensor products of univariate splines. The polynomials are of degree no greater than some specified number, often just 3. This means, of course, that the class of functions for which these piecewise polynomials form a basis is the set of polynomials of degree no greater than the degree of polynomial in the basis; hence, we do not begin with an exact representation as in equation (3.9). In spline approximation, the basis functions are polynomials over given intervals and zero outside of those intervals. The polynomials have specified contact at the endpoints of the intervals; that is, their derivatives of a specified order are continuous at the endpoints. The endpoints are called “knots”. The finite approximation therefore can be smooth and, with the proper choice of knots, is close to the function being approximated at any point. The approximation, fb(x), formed as a sum of such piecewise polynomials is called a “spline”. The “order” of a spline is the number of free parameters in each interval. (For polynomial splines, the order is the degree plus 1.) There are three types of spline basis functions commonly used: • truncated power functions (or just power functions). For k knots and degree p, there are k + p + 1 of these: 1, x, ..., xp , ((x − z1 )+ )p , ..., ((x − zk )+ )p . Sometimes, the constant is not used, so there are only k + p functions. These are nice when we are adding or deleting knots. Deletion of the ith knot, zi , is equivalent to removal of the basis function ((x − zi )+ )p . • B-splines. B-splines are probably the most widely used set of splines, and they are available in many software packages. The IMSL Library,
3.4. APPROXIMATION OF FUNCTIONS
59
for example, contains three routines for univariate approximations using B-splines, with options for variable knots or constraints, and routines for two- and three-dimensional approximations using tensor product Bsplines. The influence of any particular B-spline coefficient extends over only a few intervals, so B-splines can provide good fits to functions that are not smooth. The B-spline functions also tend to be better conditioned than the power functions. The mathematical development of B-splines is more complicated than the power functions. De Boor (2002) provides a comprehensive development, an extensive discussion of their properties, and several Fortran routines for using B-splines and other splines. • “natural” polynomial splines. These basis functions are such that the second derivative of the spline expansion is 0 for all x beyond the boundary knots. This condition can be imposed in various ways. An easy way is just to start with any set of basis functions and replace the degrees of freedom from two of them with the condition that every basis function have zero second derivative for all x beyond the boundary knots. For natural cubic splines with k knots, there are k basis functions. There is nothing “natural” about the natural polynomial splines. A way of handling the end conditions that is usually better is to remove the second and the penultimate knots and to replace them with the requirement that the basis functions have contact one order higher. (For cubics, this means that the third derivatives match.) Some basis functions for various types of splines over the interval [−1, 1] are shown in Figure 3.8. Interpolating Splines Splines can be used for interpolation, approximation, and estimation. An interpolating spline fit matches each of a given set of points. Each point is usually taken as a knot, and the continuity conditions are imposed at each point. It makes sense to interpolate points that are known to be exact. The reason to use an interpolating spline is usually to approximate a function at points other than those given (maybe for quadrature), so applied mathematicians may refer to the results of the interpolating spline as an “approximation”. An interpolating spline is used when a set of points are assumed to be known exactly (more or less). Smoothing Splines The other way of using splines is for approximation or smoothing. The individual points may be subject to error, so the spline may not go through any of the given points. In this usage, the splines are evaluated at each abscissa point, and the ordinates are fitted by some criterion (such as least squares) to the spline.
60
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
0.8 0.6 0.0
0.2
0.4
basis functions
0.6 0.4 0.0
0.2
basis functions
0.8
1.0
B-Splines; Order = 2; 4 Knots
1.0
B-Splines; Order = 4; 4 Knots
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
x and the knots
0.5
1.0
0.8 0.6 0.4 0.2 0.0
0.0
0.2
0.4
0.6
basis functions
0.8
1.0
Natural Cubic Splines with 4 Knots
1.0
Power Basis; Order = 4; 4 Knots
basis functions
0.0 x and the knots
-1.0
-0.5
0.0
0.5
1.0
x and the knots
-1.0
-0.5
0.0
0.5
1.0
x and the knots
Figure 3.8: Spline Basis Functions gro380 Choice of Knots in Smoothing Splines The choice of knots is a difficult problem when the points are measured subject to error. One approach is to include the knots as decision variables in the fitting optimization problem. This approach may be ill-posed. A common approach is to add (pre-chosen) knots in a stepwise manner. Another approach is to use a regularization method (addition of a component to the fitting optimization objective function that increases for roughness or for some other undesirable characteristic of the fit). Kernel Methods Another approach to function estimation and approximation is to use a filter or kernel function to provide local weighting of the observed data. This approach ensures that at a given point the observations close to that point influence the estimate at the point more strongly than more distant observations. A standard method in this approach is to convolve the observations with a unimodal function that decreases rapidly away from a central point. This function is the
EXERCISES
61
filter or the kernel. A kernel has two arguments representing the two points in the convolution, but we typically use a single argument that represents the distance between the two points. Some examples of univariate kernel functions are shown below. uniform: quadratic: normal:
Ku (t) = 0.5, Kq (t) = 0.75(1 − t2 ), 2 Kn (t) = √12π e−t /2 ,
for |t| ≤ 1, for |t| ≤ 1, for all t.
The kernels with finite support are defined to be 0 outside that range. Often, multivariate kernels are formed as products of these or other univariate kernels. In kernel methods, the locality of influence is controlled by a window around the point of interest. The choice of the size of the window is the most important issue in the use of kernel methods. In practice, for a given choice of the size of the window, the argument of the kernel function is transformed to reflect the size. The transformation is accomplished using a positive definite matrix, V , whose determinant measures the volume (size) of the window. In the univariate case, the size of the window is just the width h. The argument of the kernel is transformed to s/h, so the function that is convolved with the function of interest is K(s/h)/h.
Exercises 3.1. Give an example of a set of points in IR3 that constitute a simplex. 3.2. Consider the function f (x, y) = ax2 + bxy + cy 2 . (a) What is the Hessian, Hf ? (b) Under what conditions on a, b, and c is f convex? (c) Under what conditions on a, b, and c would f have a saddlepoint? 3.3. For the function in Exercise 3.2, choose values of a, b, and c that make f convex, and plot the function over a rectangular domain centered on (0, 0). This can be done in S-Plus by the following statements: # Define function fun <- function(x, y, aa, bb, cc) {aa*x^2 + bb*x*y + cc*y^2} # Initialize grid projection points # The grid will be the outer product of these vectors x <- c(seq(-1.0, 1.0, 0.1)) y <- c(seq(-1.0, 1.0, 0.1)) # Initialize a, b, and c to appropriate values aa <- 1 bb <- 1 cc <- 1 # Produce perspective plot and save structure for labeling zout <- persp(x, y, outer(x,y,FUN=fun,aa,bb,cc), axes=F)
62
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES # Label axes # (this may require some experimentation to get them in the right place) text(perspp( -.5, -1.8, 0, zout), "x1", crt=25) text(perspp(-1.6, -.7, 0, zout), "x2", crt=-40)
3.4. For the function in Exercise 3.2, choose values of a, b, and c that cause f to have a saddlepoint, and plot the function over a rectangular domain centered on (0, 0). 3.5. In what essential way does the Hessian for your function in Exercise 3.3 differ from the Hessian for your function in Exercise 3.4? 3.6. Now consider a more interesting function: g(x, y)
=
sin
=
1,
p
x2 + y 2
.p
x2 + y 2 ,
if
x2 + y 2 6= 0;
otherwise.
(a) Plot g(x, y) for −10 ≤ x ≤ 10 and −10 ≤ y ≤ 10. (b) Plot the Hessian Hg (x, y) over the same range. (c) Determine a value of c such that Hg (x, y) is positive semidefinite for x2 + y 2 = c2 . Identify this circle in your plot of g(x, y). (d) Determine a value of d such that −Hg (x, y) is positive semidefinite for x2 + y 2 = d2 . Identify this circle in your plot of g(x, y). 3.7. Suppose f (x) and g(x) are convex functions over IRd . Show that h(x) = f (x) + g(x) is a convex function over IRd . 3.8. For each of the following distributions state whether or not the probability density function is concave and whether or not it is log concave. (a) normal distribution (do the parameters matter?). (b) gamma distribution with shape parameter 0.5 and scale parameter 1.0. (c) gamma distribution with shape parameter 2.0 and scale parameter 1.0. (d) beta distribution with parameters 0.5 and 0.5. (e) beta distribution with parameters 2.0 and 0.5. (f) beta distribution with parameters 2.0 and 2.0. (g) uniform distribution over the interval (0, 1). 3.9. What is the mode of (a) the gamma distribution with shape parameter 0.5 and scale parameter 1.0? (b) the gamma distribution with shape parameter 2.0 and scale parameter 1.0? (c) the beta distribution with parameters 2.0 and 0.5? (d) the beta distribution with parameters 2.0 and 2.0?
EXERCISES
63
Does the beta distribution with parameters 0.5 and 0.5 have a mode? 3.10. Consider a mixture of two normal distributions, N(µ, σ 2 ) and N(µ + d, k2 σ 2 ). Suppose the mixture consists of a proportion p of the first normal and a proportion 1 − p of the second. (a) Let p = 0.5 and k = 1. What are the conditions on d for the mixture to be unimodal? (b) Let p = 0.05 and k = 1. What are the conditions on d for the mixture to be unimodal? (c) Let p = 0.5 and k = 5. What are the conditions on d for the mixture to be unimodal? You may want to produce some plots of the mixture densities to get a better feel for the question. Notice that the question is independent of µ and σ 2 . 3.11. If X is a random variable with probability density function p(x), the expected value of any function of X, say g(X), is E(g(X)) =
Z
∞
g(x)p(x) dx, −∞
provided the integral exists. Show that if g is a twice-differentiable convex function and E(X) is finite, g(E(X)) ≤ E(g(X)). This is another form of Jensen’s inequality. It also holds if g is not twicedifferentiable, but the proof is more difficult. 3.12. Use Jensen’s inequality to show that if X is a random variable, then E(X)
2
≤ E(|X|)
2
≤ E X2 ,
if the expectations exist. 3.13. Use Jensen’s inequality to show that if X is a random variable such that Pr(X > 0) = 1, then 1 1 ≤E , E(X) X if the expectations exist. 3.14. Suppose x1 , x2 , . . . , xn are positive numbers. Use Jensen’s inequality to derive the well-known relationship among the arithmetic mean, the geometric mean, and the harmonic mean: n 1X 1 n xi i=1
!
≤
n Y i=1
xi
!1/n
≤
n 1X xi . n i=1
Hint: Define a random variable X that takes the values xi with equal probability, and take a log transform of X. 3.15. Show that the L2 vector norm satisfies the triangle inequality (page 46). Hint: Express the L2 vector norm as an inner product, and formulate and use the Cauchy-Schwarz inequality for the inner products of vectors (similar to the Cauchy-Schwarz inequality for the inner products of functions on page 49).
64
CHAPTER 3. BASIC DEFINITIONS AND PROPERTIES
3.16. Show that if p < 1 in equation (3.4), the resulting expression is not a norm. 3.17. Prove that the Fourier coefficients form the finite expansion in basis functions with the minimum mean squared error (that is, prove inequality (3.12) on page 53). Hint: Write kf − a0 q0 k2 as a function of a0 , hf, f i − 2a0 hf, q0 i + a20 hq0 , q0 i, differentiate, set to zero for the minimum, and determine a0 = c0 (equation (3.10)). This same approach can be done in multidimensions for a0 , a1 , . . . , ak , or else induction can be used from a1 on.
Chapter 4
Finding Roots of Equations Because of the special properties of derivatives of functions at the optima of the functions, we may solve an optimization problem by solving a simpler problem, namely finding a point at which a function is zero. More generally, a common problem in scientific computing is to solve a nonlinear system g(x) = b, where g is a vector-valued function of a vector argument; that is, x is an mvector and b is an n-vector. This is the general case of the linear system Ax = b, where A is a matrix. By writing f (x) = g(x) − b, we change the problem to one of solving the equation, f (x) = 0. (4.1) “Solving the equation” means finding the value of x, say x0 , that makes the equation true. The point x0 is called a “root” or a “zero” of the function. If the objective function in an optimization problem is quadratic in the decision variables, as in the least squares problem, the derivatives in terms of the decision variables are linear. The optimum, therefore, may be found by solving a linear system. When the objective function is not quadratic, but the function is differentiable, the optimum may be found by solving a system of nonlinear equations.
4.1
Linear Equations
One of the most common problems in numerical computing is to solve the linear system Ax = b, that is, for given A and b, to find x such that the equation holds. The system is said to be consistent if there exists such an x, and in that case a solution x may be written as A− b, where A− is some inverse of A. If A is square and of full rank, we can write the solution as A−1 b. 65
66
CHAPTER 4. FINDING ROOTS OF EQUATIONS
It is important to distinguish the expression A−1 b or A+ b, which represents the solution, from the method of computing the solution. We would never compute A−1 just so we could multiply it by b to form the solution A−1 b. There are two general methods of solving a system of linear equations: direct methods and iterative methods. A direct method uses a fixed number of computations that would in exact arithmetic lead to the solution; an iterative method generates a sequence of approximations to the solution. Iterative methods often work well for very large sparse matrices.
4.1.1
Direct Methods
Gaussian Elimination and Matrix Factorizations The most common direct method for the solution of linear systems is Gaussian elimination. The basic idea in this method is to form equivalent sets of equations, beginning with the system to be solved, Ax = b, or aT 1x =
b1
aT 2x
b2
=
... = aT nx
=
... bn ,
th where aT row of A. An equivalent set of equations can be formed by j is the j a sequence of elementary operations on the equations in the given set. These elementary operations on equations are essentially the same as the elementary operations on the rows of matrices discussed in Section ??. There are two kinds of elementary operations: an interchange of two equations,
aT j x = bj
← aT k x = bk
aT k x = bk
← aT j x = bj ,
which affects two equations simultaneously, and the replacement of a single equation with a linear combination of it and another equation: aT j x = bj
←
T cj aT j x + ck ak x = cj bj + ck bk ,
where cj 6= 0. If ck = 0 in this operation, it is the simple elementary operation of scalar multiplication of a single equation. The interchange operation can be accomplished by premultiplication by an elementary permutation matrix (see page ??): Ejk Ax = Ejk b. Likewise, the linear combination elementary operation can be effected by premultiplication by a matrix formed from the identity matrix by replacing its j th
4.1. LINEAR EQUATIONS
67
row by a row with all zeros except for cj in the j th column and ck in the k th column. Such a matrix is denoted by Ejk (cj , ck ), for example, 1 0 0 0 0 c2 0 0 E23 (c2 , c3 ) = 0 0 c3 0 . 0 0 0 1 Both Ejk and Ejk (cj , ck ) are called elementary operator matrices. The elementary operation on the equation aT 2 x = b2 in which the first equation is combined with it using c1 = −a21 /a11 and c2 = 1 will yield an equation with a zero coefficient for x1 . Generalizing this, we perform elementary operations on the second through the nth equations to yield a set of equivalent equations in which all but the first have zero coefficients for x1 . Next, we perform elementary operations using the second equation with the third through the nth equations, so that the new third through the nth equations have zero coefficients for x2 . The sequence of equivalent equations is
(1)
a11 x1 a21 x1 .. . an1 x1 a11 x1
(2)
+ +
a12 x2 a22 x2 .. .
+···+ +···+
a1n xn a2n xn .. .
= =
b1 b2 .. .
an2 x2
+···+
ann xn
=
bn
a12 x2 (1) a22 x2 .. .
+···+ +···+
a1n xn (1) a2n xn .. .
= b1 (1) = b2 .. .
+ + +
(1)
an2 x2
+···+ +···+
(1)
(1)
ann xn
= bn
.. . a11 x1
+
a12 x2 (1)
a22 x2
+
···
+
a1n xn
+
···
+ .. .
(n) (n−2)
an−1,n−1 xn−1
=
b1
a2n xn .. .
=
b2 .. .
(n−2)
= bn−1
(1)
+ an−1,n xn (n−1)
ann
xn
(1)
(n−2)
(n−1)
= bn
68
CHAPTER 4. FINDING ROOTS OF EQUATIONS
This last system is easy to solve. It is upper triangular. The last equation in the system yields (n−1) bn xn = (n−1) . ann By back substitution we get (n−2)
xn−1 =
(n−2)
(bn−1 − an−1,n xn ) (n−2)
,
an−1,n−1
and the rest of the x’s in a similar manner. Thus, Gaussian elimination consists of two steps, the forward reduction, which is order O(n3 ), and the back substitution, which is order O(n2 ). (k−1) The only obvious problem with this method arises if some of the akk ’s used as divisors are zero (or very small in magnitude). These divisors are called “pivot elements”. Suppose, for example, we have the equations 0.0001x1 x1
+ +
x2 x2
= =
1 2
The solution is x1 = 1.0001 and x2 = 0.9999. Suppose we are working with 3 digits of precision (so our solution is x1 = 1.00 and x2 = 1.00). After the first step in Gaussian elimination we have 0.0001x1
+
x2 −10, 000x2
= =
1 −10, 000
and so the solution by back substitution is x2 = 1.00 and x1 = 0.000. The L2 condition number of the coefficient matrix is 2.618, so even though the coefficients do vary greatly in magnitude, we certainly would not expect any difficulty in solving these equations. A simple solution to this potential problem is to interchange the equation having the small leading coefficient with an equation below it. Thus, in our example, we first form x1 0.0001x1
+ +
x2 x2
= =
2 1
so that after the first step we have x1
+
x2 x2
= =
2 1
and the solution is x2 = 1.00 and x1 = 1.00. Another strategy would be to interchange the column having the small leading coefficient with a column to its right. Both the row interchange and the
4.1. LINEAR EQUATIONS
69
column interchange strategies could be used simultaneously, of course. These processes, which obviously do not change the solution, are called pivoting. The equation or column to move into the active position may be chosen in such a way that the magnitude of the new diagonal element is the largest possible. Performing only row interchanges, so that at the k th stage the equation with n
(k−1)
max |aik
|
i=k
is moved into the k th row, is called partial pivoting. Performing both row interchanges and column interchanges, so that n;n
(k−1)
max |aij
|
i=k;j=k
is moved into the k th diagonal position, is called complete pivoting. See Exercises ?? and ??. It is always important to distinguish descriptions of effects of actions from the actions that are actually carried out in the computer. Pivoting is “interchanging” rows or columns. We would usually do something like that in the computer only when we are finished and want to produce some output. In the computer, a row or a column is determined by the index identifying the row or column. All we do for pivoting is to keep track of the indices that we have permuted. There are many more computations required in order to perform complete pivoting than are required to perform partial pivoting. Gaussian elimination with complete pivoting can be shown to be stable that is, the algorithm yields an exact solution to a slightly perturbed system, (A + δA)x = b. For Gaussian elimination with partial pivoting there are examples that show that it is not stable. These examples are somewhat contrived, however, and experience over many years has indicated that Gaussian elimination with partial pivoting is stable for most problems occurring in practice. For this reason together with the computational savings, Gaussian elimination with partial pivoting is one of the most commonly used methods for solving linear systems. See Gentle (1998), Chapter 3, for a further discussion of these issues. There are two modifications of partial pivoting that result in stable algorithms. One is to add one step of iterative refinement (see Section ??, page ??) following each pivot. It can be shown that Gaussian elimination with partial pivoting together with one step of iterative refinement is unconditionally stable. Another modification is to consider two columns for possible interchange in addition to the rows to be interchanged. This does not require nearly as many computations as complete pivoting does. Higham (1997) shows that this method, suggested by Bunch and Kaufman (1977) and used in LINPACK and LAPACK, is stable. Direct methods of solution of linear systems all use some form of matrix factorization, as discussed in Section ?? beginning on page ??. The LU factorization is the most commonly used method to solve a linear system.
70
CHAPTER 4. FINDING ROOTS OF EQUATIONS
Choice of Direct Method An important consideration for the various direct methods is the efficiency of the method for certain patterned matrices. If a matrix initially has a large number zeros, it is important to preserve zeros as the matrix is operated on. This helps to avoid unnecessary computations. Pissanetzky (1984) discusses some of the ways of doing this. The iterative methods discussed in the next section are often more useful for sparse matrices. Another important consideration is how easily an algorithm lends itself to implementation on advanced computer architectures. Many of the algorithms for linear algebra can be vectorized easily. It is now becoming more important to be able to parallelize the algorithms. The iterative methods discussed in the next section can often be parallelized more easily.
4.1.2
Iterative Methods
An iterative method for solving the linear system Ax = b obtains the solution by a sequence of successive approximations. The Gauss-Seidel Method with Successive Overrelaxation One of the simplest iterative procedures is the Gauss-Seidel method. In this method, we begin with an initial approximation to the solution, x(0) . We then compute an update for the first element of x: n X 1 (1) (0) x1 = b1 − a1j xj . a11 j=2 Continuing in this way for the other elements of x, we have for i = 1, . . . , n i−1 n X X 1 (1) (1) (0) bi − xi = aij xj − aij xj , aii j=1 j=i+1 where no sums are performed if the upper limit is smaller than the lower limit. After getting the approximation x(1) , we then continue this same kind of iteration for x(2) , x(3) , . . .. We continue the iterations until a convergence criterion is satisfied. As we discussed on page 32, this criterion may be of the form ∆(x(k) , x(k−1) ) ≤ , where ∆(x(k) , x(k−1) ) is a measure of the difference of x(k) and x(k−1) , such as kx(k) − x(k−1) k. We may also base the convergence criterion on kr(k) − r(k−1) k, where r(k) = b − Ax(k) .
4.1. LINEAR EQUATIONS
71
The Gauss-Seidel iterations can be thought of as beginning with a rearrangement of the original system of equations as a11 x1 a21 x1 .. . a(n−1)1 x1 an1 x1
+
= =
a22 x2 .. .
+ + a(n−1)2 x2 + an2 x2
.. . +··· + · · · + ann xn
b1 b2 .. .
− a12 x2
= bn−1 = bn
···− ···−
a1n xn a2n xn
−
ann xn
In this form, we identify three matrices – a diagonal matrix D, a lower triangular L with 0’s on the diagonal, and an upper triangular U with 0’s on the diagonal: (D + L)x = b − U x. We can write this entire sequence of Gauss-Seidel iterations in terms of these three fixed matrices, x(k+1) = (D + L)−1 −U x(k) + b . (4.2) This method will converge for any arbitrary starting value x(0) if and only if the spectral radius of (D + L)−1 U is less than 1. (See Golub and Van Loan, 1996, for a proof of this.) Moreover, the rate of convergence increases with decreasing spectral radius. Gauss-Seidel may be unacceptably slow, so it may be modified so that the update is a weighted average of the regular Gauss-Seidel update and the previous value. This kind of modification is called successive overrelaxation, or SOR. The update is given by 1 1 (D + L) x(k+1) = (1 − ω)D − ωU x(k) + b, ω ω where the relaxation parameter ω is usually chosen between 0 and 1. For ω = 1 the method is the ordinary Gauss-Seidel method. See Exercises ??, ??, and ??. Solution of Linear Systems as an Optimization Problem; Conjugate Gradient Methods The problem of solving the linear system Ax = b is equivalent to finding the minimum of the function f (x) =
1 T x Ax − xT b. 2
(4.3)
By setting the derivative of f to 0, we see that a stationary point of f occurs at the point x where Ax = b. If A is nonsingular, the minimum of f is at x = A−1 b, and the value of f at the minimum is − 21 bT Ab. The minimum point can be approached iteratively by starting at a point x(0) , moving to a point x(1) that yields a smaller value of the function, and
72
CHAPTER 4. FINDING ROOTS OF EQUATIONS
continuing to move to points yielding smaller values of the function. The k th point is x(k−1) +αk dk , where αk is a scalar and dk is a vector giving the direction of the movement. Hence, for the k th point we have the linear combination, x(k) = x(0) + α1 d1 + · · · + αk dk The convergence criterion is based on kx(k) − x(k−1) k or on kr(k) − r(k−1) k, where r(k) = b − Ax(k) . At the point x(k) , the function f decreases most rapidly in the direction of the negative gradient, −∇f (x(k) ). The negative gradient is just the residual, r(k) = b − Ax(k) . If this residual is 0, no movement is indicated, because we are at the solution. Moving in the direction of steepest descent may cause a slow convergence to the minimum. (The curve that leads to the minimum on the quadratic surface is obviously not a straight line.) A good choice for the sequence of directions p1 , p2 , . . . is such that pT k Api = 0,
for i = 1, . . . , k − 1.
Such a vector pk is said to be A-conjugate to p1 , p2 , . . . pk−1 . The path defined by the directions p1 , p2 , . . . and the distances α1 , α2 , . . . is called the conjugate gradient. A conjugate gradient method for solving the linear system is shown in Algorithm 4.1. Algorithm 4.1 The Conjugate Gradient Method for Solving Ax = b, Starting with x(0) 0. Set k = 0; r(k) = b − Ax(k) ; s(k) = AT r(k) ; p(k) = s(k) ; and γ (k) = ks(k) k22 . 1. If γ (k) ≤ , set x = x(k) and terminate. 2. Set q (k) = Ap(k) . 3. Set α(k) =
γ (k) . kq (k) k22
4. Set x(k+1) = x(k) + α(k) p(k) . 5. Set r(k+1) = r(k) − α(k) q (k) . 6. Set s(k+1) = AT r(k+1) . 7. Set γ (k+1) = ks(k+1) k22 . 8. Set p(k+1) = s(k+1) +
γ (k+1) (k) p . γ (k)
9. If k < kmax , set k = k + 1 and go to 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’.
4.2. NONLINEAR EQUATIONS
73
For example, the function (4.3) arising from the system 5 2 x1 18 = 2 3 16 x2
6
has level contours as shown in Figure ??, and the conjugate gradient method would move along the line shown, toward the solution at x = (2, 4).
4
1
x2
*
0
2
must fix
0
2
4
6
x1
Figure 4.1: Solution of a Linear System Using a Conjugate Gradient Method gro405
4.2
Nonlinear Equations
We describe several general methods for solving a system of nonlinear equations. Each of the methods may be the best for some given problem, and it is important to understand how these methods work. There are some specialized methods, such as for finding the roots of a polynomial, but we will not discuss them. We first consider methods for a single equation in a scalar variable.
74
4.2.1
CHAPTER 4. FINDING ROOTS OF EQUATIONS
Basic Methods for a Single Equation
Let us first consider the special case of equation (4.1) in which f is a scalarvalued function f of a scalar variable x. If there is no closed form for the inverse f −1 (·), and if f is continuous, then the solution is effected by an iterative process. This iterative process must have a convergence criterion or stopping criterion to decide when the solution is “close enough”. In some cases, the primary interest is in the values of the decision variables; and in other cases, the main interest is in the value of the objective function. The criterion may be based on a small positive number, , to bound the distance of the computed point of the minimum from x0 , or to bound the value of |f (x)| at the computed minimum point. The number of iterations allowed must also be bounded; in fact, if there is no stopping criterion independent of the “goodness” of the solution, the method of solution is not an “algorithm”, in a common definition of that term. In the following discussion, we assume f is a continuous function, and that a solution x0 exists. We will illustrate the methods with the function f (x) = x3 − 4x2 + 18x − 115,
(4.4)
which has a single root at x = 5. (As we mentioned earlier, there are special algorithms for polynomials, which we will not discuss. The standard algorithms are given by Jenkins and Traub, 1970a, 1970b, and 1972. A program by Jenkins, 1975, is available in the ACM CALGO and in the IMSL Libraries. Hull and Mathon (1996) describe a modification of the basic method that works better in the case of a multiple root of the polynomial.) Fixed-Point Method A general type of iteration for problems such as (4.1) is called a fixed-point method. In this problem the fixed-point method uses the fact that at the solution x0 = f (x0 ) + x0 . The fixed-point iteration is then (k+1)
x0
(k) (k) + x0 , = f x0
(0)
after starting with any value x0 . Bisection Method One of the simplest iterative methods for solving f (x) = 0 is the bisection method. The method begins with two values that bracket the solution, and then tightens the interval by halves. We assume that there are values xl and xu , with xl < xu , such that f (xl ) ≤ 0,
4.2. NONLINEAR EQUATIONS
75
f (xu ) ≥ 0, and f (x0 ) = 0. (If f (xl ) ≥ 0 and f (xu ) ≤ 0, we can relabel the points.) The method is shown in Algorithm 4.2. Algorithm 4.2 Bisection to Find a Root of an Equation 0. Set k = 0, and find an interval [xl , xu ] in which a solution lies. (k)
1. Set k = k + 1 and set x0 = (xu + xl ) /2. (k) 2. If sign f x0 = sign(f (xl )), then (k)
2.a. set xl = x0 ; otherwise (k) 2.b. set xu = x0 . 3. If xu − xl ≤ (k) return the solution as x0 ; otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’. In the example in Figure 4.2, the interval is successively halved, first by moving the upper bound down, then moving the lower bound up, then moving the lower bound up again, and so on. In each step the approximation to the (k) solution x0 is the midpoint of an interval, and then becomes an endpoint of the interval in the next step. The bisection method is very easy to understand and to implement. The solution always remains within a known interval. After k steps, the length of that interval is 2−k times its initial length, so the error of the approximation is of order 2−k . Each iteration gains one more bit of accuracy. Because the ratios of the lengths of successive intervals is constant, the bisection method converges linearly. The iterations beginning with those shown in Figure 4.2, and continuing until 11 digits of accuracy are shown in Table 4.1. The length of the interval is 7 initially. After 35 steps, it is approximately 7 · 2−35 . The stopping rule in Algorithm 4.2 is based on the length of the interval. It is clear that the algorithm must converge using this stopping rule; in fact, beginning with xl and xu , the algorithm terminates after exactly d log2 (xu − xl )/) e
CHAPTER 4. FINDING ROOTS OF EQUATIONS
100
150
76
x (1) 0 u
x (2) 0 x
x
l
u
o
o
-50
0
x
l
50
f(x)
x
2
4
6
8
10
x
Figure 4.2: Bisection to Find x0 , so that f (x0 ) = 0 gro410 steps. (k) At the solution, f x0 should be close to zero. An alternative stopping rule could be based on this value; that is, for a given > 0, stop when (k) f x0 ≤ . The bisection method requires that the function be continuous within the initial interval. The function need not be differential, however. Newton’s Method Newton’s method for a differential function is based on the first-order Taylor series of the function about a point near the solution: (k) (k) (k) + x − x0 f 0 x0 . f (x) ≈ f x0 (k)
(k+1)
As before, the solution is approached through the iterates, x0 , x0 update is obtained by assuming (k+1) f x0 = 0, and solving the Taylor series approximation (k+1) (k) (k+1) (k) (k) ≈ f x0 + x0 f x0 − x0 f 0 x0 .
, . . .. The
4.2. NONLINEAR EQUATIONS
77
Table 4.1: Bisection Iterations k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
xl 2.00000000000000 2.00000000000000 3.75000000000000 4.62500000000000 4.62500000000000 4.84375000000000 4.95312500000000 4.95312500000000 4.98046875000000 4.99414062500000 4.99414062500000 4.99755859375000 4.99926757812500 4.99926757812500 4.99969482421875 4.99990844726563 4.99990844726563 4.99996185302734
(k)
If f 0 x0
xu 9.00000000000000 5.50000000000000 5.50000000000000 5.50000000000000 5.06250000000000 5.06250000000000 5.06250000000000 5.00781250000000 5.00781250000000 5.00781250000000 5.00097656250000 5.00097656250000 5.00097656250000 5.00012207031250 5.00012207031250 5.00012207031250 5.00001525878906 5.00001525878906
k 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
xl 4.99998855590820 4.99998855590820 4.99999523162842 4.99999856948853 4.99999856948853 4.99999940395355 4.99999982118607 4.99999982118607 4.99999992549419 4.99999997764826 4.99999997764826 4.99999999068677 4.99999999720603 4.99999999720603 4.99999999883585 4.99999999965075 4.99999999965075 4.99999999985448
xu 5.00001525878906 5.00000190734863 5.00000190734863 5.00000190734863 5.00000023841858 5.00000023841858 5.00000023841858 5.00000002980232 5.00000002980232 5.00000002980232 5.00000000372529 5.00000000372529 5.00000000372529 5.00000000046566 5.00000000046566 5.00000000046566 5.00000000005821 5.00000000005821
6= 0, this approximation yields (k+1)
x0
(k) f x0 (k) . = x0 − (k) f 0 x0
(4.5)
Newton’s method uses the slope of the function at one point to choose the next point, which is the direction of a smaller value of the function, indicated by the slope. The method is given in Algorithm 4.3. Algorithm 4.3 Newton’s Method to Find a Root of an Equation (k)
0. Set k = 0, and determine an approximation x0 . (k+1)
1. Solve for x0
in (k) (k+1) (k) (k) f 0 x0 x0 = −f x0 − x0
that is, set (k+1)
x0 −1 (k) if f 0 x0 exists.
(k) f x0 (k) , = x0 − (k) f 0 x0
78
CHAPTER 4. FINDING ROOTS OF EQUATIONS (k+1) (k) 2. If x0 − x0 ≤
(k+1)
return the solution as x0 ; otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’.
o
0
f(x)
50
100
150
The stopping rule in Algorithm 4.3 is based on the interval between two successive approximations, just as the stopping rule of the bisection method is based on the length of the interval. As we mentioned in discussing the bisection (k) method, there may also be some interest in f x0 . This value should be near zero, and it could also be used as a stopping criterion. Newton’s method is easy to understand and to implement if the derivative is available. In Figure 4.3, we show Newton’s method applied to the same function we used bisection on in Figure 4.2. In the example in Figure 4.3, Newton’s method proceeds in an orderly fashion toward the zero of the function.
(2) 0
x
(1) 0
-50
x
2
4
6
8
10
x
Figure 4.3: Newton’s Method to Find x0 , so that f (x0 ) = 0 gro420 Notice in Figure 4.3 that the derivatives (the slopes) are decreasing, as the solution is approached from the right side. This could cause some problems with the method, because the denominator in step 1 of Algorithm 4.3 becomes
4.2. NONLINEAR EQUATIONS
79
small. In this example the derivative, f 0 (x) = 3x2 − 8x + 18, is not zero at the solution. (See Exercise 4.2, page 89. The derivative of the function in Exercise 4.2b is zero at the solution.) To investigate the convergence of Newton’s method, consider the first-order (k) Taylor series with remainder, expanded about a point near the solution, x0 , and evaluated at the solution x0 : 1 2 (k) (k) (k) (k) f (x0 ) = f x0 + x0 − x0 f 0 x0 + f 00 (ξ) x0 − x0 2 = 0. Using equation (4.5), we have (k+1) x0 − x0 1 f 00 (ξ) . 2 = 2 f 0 x(k) (k) x0 − x0 0 So, if the limit, as k → ∞, of the ratio on the right exists, the convergence is (k) 0 quadratic (see page 32 in Section 2.2). It is clear that if f x0 = 0 at any point, the method may fail. Even if the derivatives are not zero, however, Newton’s method may diverge unless the starting point is sufficiently close to the solution. Two ways in which Newton’s method can go wrong are illustrated in Figures 4.4 and 4.5. In both of these examples, the failure of Newton’s method occurs because the starting point is too far away from the zero. The possibility of this occurring makes the choice of starting value very important. In the bisection method, we do not have to be concerned about this, so long as we can find values that bracket the solution. A modification of Newton’s method is to use an approximation to the derivative: (k) f x(k) + h − f x0 0 (k) f 0 x0 ≈ . h This is sometimes called the “discrete Newton’s method”. It is also essentially the same as the next method we discuss. Secant Method The secant method is similar to Newton’s method in using the slope to determine successive points in the iteration. Newton’s method uses the derivative or the tangent at a given point, and the secant method uses the slope of the function between two given points to choose the next point. The method is given in Algorithm 4.4.
CHAPTER 4. FINDING ROOTS OF EQUATIONS
0.0
f(x)
0.5
1.0
80
o (3) 0
x
(1) 0
x
(2) 0
-1.0
-0.5
x
-4
-2
0
2
4
x
Figure 4.4: Failure of Newton’s Method gro421 Algorithm 4.4 Secant Method to Find a Root of an Equation (k−1)
0. Set k = 1, and determine approximations x0 (k) (k−1) (k) f x0 x −x0 (k+1) (k) 0 (k−1) . = x0 − 1. Set x0 (k) f x0
(k+1) (k) 2. If x0 − x0 ≤ ,
(k)
and x0 .
−f x0
(k+1)
return the solution as x0 ; otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’.
The intersection of the line between the two points on the function and the x-axis is taken as the next point at which to evaluate the function, as we (0) (1) see in Figure 4.6. The choice of x0 and x0 is arbitrary, although just as in Newton’s method, if they are two far away from the solution, the method may not converge. The two points in the secant method may or may not bracket a root.
81
0.2
4.2. NONLINEAR EQUATIONS
(1) 0
x o
(2) 0
f(x)
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
x
-3
-2
-1
0
1
2
3
x
Figure 4.5: Failure of Newton’s Method; Another Example gro422 Regula Falsi Method The regula falsi or false position method is similar to the secant method, except that the two starting points are chosen so as to bracket a solution, and as in the bisection method, each successive point is chosen so that it together with one of the two previous points brackets a solution. The method given in Algorithm 4.5 is a slight modification of the ordinary regula falsi method, and is sometimes called the “modified” regula falsi method. Because the “unmodified” regula falsi method (which omits steps 2.a.ii and 2.b.ii in Algorithm 4.5) should not even be used, we just refer to the method given here as regula falsi. Algorithm 4.5) is also sometimes called the “Illinois method”. Algorithm 4.5 Regula Falsi to Find a Root of an Equation 0. Set k = 0; find an interval [xl , xu ] in which a solution lies; set fl = f (xl ); set fu = f (xu ); and (k) set x0 = xl . (k+1)
u fl = xl ffuu−x 1. Set x0 −fl . (k+1) 2. If fl f x0 < 0, then
(k+1)
2.a.i. set xu = x0
(k+1) and fu = f x0 .
CHAPTER 4. FINDING ROOTS OF EQUATIONS
o
0
f(x)
50
100
150
82
(4) (3) (2) x x 0 0 0
x
o (0) 0
x
(1) 0
-50
x
2
4
6
8
10
x
Figure 4.6: Secant Method to Find x0 , so that f (x0 ) = 0 gro430 (k) (k+1) 2.a.ii. if f x0 f x0 > 0, then set fl = fl /2. Otherwise, (k+1) (k+1) and fl = f x0 2.b.i. set xl = x0 . (k) (k+1) 2.b.ii. if f x0 f x0 > 0, then set fu = fu /2. 3. If xu − xl ≤ , (k+1) ; return the solution as x0 otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’. The regula falsi method generally converges more slowly than the secant method, but it is more reliable, because the solution remains bracketed. Figure 4.7 illustrates two iterations of the method. Stochastic Approximation In practical applications we often cannot evaluate f (x) precisely. Instead, we (k) make observations that are contaminated with random errors or noise. At x0 ,
83
f(x)
0
50
100
150
4.2. NONLINEAR EQUATIONS
f u /2 (0) x o 0
(1) x0 o (2) 0
-50
x
2
4
6
8
10
x
Figure 4.7: Regula Falsi to Find x0 , so that f (x0 ) = 0 gro440 (k) instead of f x0 , we observe (k)
y0
(k) + k . = f x0
A fixed-point iteration of the form (k+1)
x0
(k) (k) = x0 + fb x0
(4.6)
(k) (k) could be used, where fb x0 is an estimate of the value of f at x0 , based on (k)
some observations of y0 . Alternatively, the model of interest may be a random process, and we may be interested in some function of the random process, f (x). For example, we may model an observable process by a random variable Y with probability density function pY (y, x), where x is a parameter of the distribution. We may be interested in the mean of Y as a function of the parameter x, Z f (x) = y pY (y, x) dy. If we know pY (y, x) and can perform the integration, the problem of finding a zero of f (x) (or, more generally, finding x such that the mean, f (x), is some specified level) is similar to the other problems we have discussed. Often in practice we do not know pY (y, x), but we are able to take observations on Y .
84
CHAPTER 4. FINDING ROOTS OF EQUATIONS
(k) These observations could be used to obtain fb x0 , and the recursion (4.6) used to find x. Each observation on Y is an estimate of f (x), so recursion (4.6) can be rather simple. For a sequence of observations on Y , y1 , y2 , . . . , Robbins and Monro (1951) suggested the recursion (k+1)
x0
(k)
= x0 + α(k) yk ,
(4.7) (k) where α(k) is a decreasing sequence of positive numbers similar to 1/f 0 x0 in Newton’s method (4.5), page 77, when the approach is from the left. Convergence in the Robbins-Monro procedure is not deterministic because of the random variables. Our interest must be in convergence in probability or convergence with probability 1. In order to guarantee convergence, the deterministic sequence of α(k) must satisfy certain norm properties (see Kushner and Yin, 1997). Multiple Roots It is possible that the function has more than one root, and we may want to find them all. A common way of addressing this problem is to use different starting points in the iterative solution process. Plots of the points evaluated in the iterations may also be useful. In general, if the number of different roots is unknown, there is no way of finding all of them with any assurance. If the number of roots in a given interval is known, however, and if the function is twice continuously differentiable in the interval, a “guided” bisection algorithm of Kavvadias and Vrahatis (1996) can be used to find all of them with certainty. We refer the interested reader to their paper for the details. Accuracy of the Solution As with most problems in numerical computations, the accuracy we can expect in finding the roots of a function varies from problem to problem; some problems are better conditioned than others. A measure of the condition of the problem of finding the root x0 can be developed by considering the error in evaluating f (x) in the vicinity of x0 . Suppose a bound on this error is , so |fb(x0 ) − f (x0 )| ≤ , or
b 0 )| ≤ , |f(x
where fb(x0 ) is the computed value approximating f (x0 ). Let [xl , xu ] be the largest interval about x0 such that |f (x)| ≤ ,
if x ∈ [xl , xu ].
(4.8)
4.2. NONLINEAR EQUATIONS
85
Within this interval, the computed value fb(x) can be either positive or negative just due to error in computing the value. A stable algorithm for finding the root of the function yields a value in the interval, but no higher accuracy can be expected. If f (x) can be expanded in a Taylor series about x0 , we have f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ), or f (x) ≈ f 0 (x0 )(x − x0 ). Now applying the bound in (4.8) to the approximation, we have that the interval is approximately 1 x0 ± 0 , f (x0 ) if the derivative exists and is nonzero. Therefore, if the derivative exists and is nonzero, a quantitative measure of the condition of the problem is 1 f 0 (x
0)
.
3 2 1 0
x (k) l (k) xu
-2 -3
-3
-2
-1
(k) xu
f(x)
0
x (k) l
-1
f(x)
1
2
3
This quantity is a condition number of the function f with respect to finding the root x0 . In Figure 4.8, we can see the sensitivity of a root-finding algorithm to the condition number.
2
3
4
5
6
7
8
2
3
4
x
5
6
7
8
x
Figure 4.8: Condition of the Root of f (x) = 0: Two Possibilities gro450 Wilkinson (1959) considered the polynomial f (x) = (x − 1)(x − 2) · · · (x − 20)
86
CHAPTER 4. FINDING ROOTS OF EQUATIONS
for studying rounding error in determining roots (see page 28). Very small perturbations in the coefficients of the polynomial lead to very large changes in the roots; hence, we referred to the problem as ill-conditioned. The derivative of that function in the vicinity of the roots is very large, so the condition number defined above would not indicate any conditioning problem. As we pointed out, however, that problem is ill-conditioned because of the extreme variation in the magnitude of the coefficients. This kind of situation is common in numerical analysis. Condition numbers do not always tell an accurate story; they should be viewed only as indicators, not as true measures of the condition.
4.2.2
Systems of Equations
If the argument of the function is an m-vector and the function value is an n-vector, equation (4.1), f (x) = 0, represents a system of equations: f1 (x1 , x2 , . . . , xm ) f2 (x1 , x2 , . . . , xm ) .. .
= = .. .
fn (x1 , x2 , . . . , xm )
= 0.
0 0 .. .
(4.9)
Each of the functions fi is a scalar-valued function of the vector x. Solution of systems of nonlinear equations can be a significantly more computationally intensive problem than solution of a single equation. Whether or not the system of equations (4.9) has a solution is not easy to determine. A system that has a solution is said to be consistent, just as a consistent linear system. If n > m, the system may be overdetermined (just as the linear system (??) on page ?? in Chapter ??); linear system. If n > m, the system may be overdetermined, and it is very likely that no solution exists. In this case, a criterion, such as least squares, for a good approximate solution must be chosen. Even if n = m, we do not have easy ways of determining whether a solution exists, as we have for the linear system. Newton’s Method As we have seen in the previous sections, the solution of nonlinear equations proceeds iteratively to points ever closer to zero. The derivative or an approximation to the derivative is used to decide which way to move from a given point. For a scalar-valued function of several variables, say f1 (x), we must consider the slopes in various directions, that is, the gradient ∇f1 (x). In a system of equations such as (4.9), we must consider all of the gradients; that is, the slopes in various directions of all of the scalar-valued functions. The matrix whose rows are the transposes of the gradients is called the Jacobian. We denote the Jacobian of the function f by Jf . The transpose of the Jacobian,
4.2. NONLINEAR EQUATIONS
87
that is, the matrix whose columns are the gradients, is denoted by ∇f for the vector-valued function f . (Note that the symbol ∇ can denote either a vector or a matrix, depending on whether the function to which it is applied is scalaror vector-valued.) Thus, the Jacobian for the system above is ∂f ∂f1 ∂f1 1 · · · ∂x ∂x1 ∂x2 m ∂f2 ∂f2 ∂f2 · · · Jf = ∂x1 ∂x2 ∂xm .. . ∂fn ∂x1
=
∂fn ∂x2
···
∂fn ∂xm
(∇f )T .
(4.10)
Notice that the Jacobian is a function, so we often specify the point at which it is evaluated in the ordinary function notation, Jf (x). Newton’s method described above for a single equation in one variable can be used to determine a vector x0 that solves this system, if a solution exists, or to determine that the system does not have a solution. For the vector-valued function in the system of equations (4.9), the first(k) order Taylor series about a point x0 is (k) (k) (k) f (x) ≈ f x0 + Jf x0 x − x0 . This first-order Taylor series is the basis for Newton’s method, shown in Algorithm 4.6. Algorithm 4.6 Newton’s Method for a System of Equations (Compare with Algorithm 4.3, page 77.) (k)
0. Set k = 0, and determine an approximation x0 . (k+1)
1. Solve for x0
in (k) (k+1) (k) (k) Jf x0 x0 = f x0 . − x0
(k+1) (k) 2. If x0 − x0 ≤ ,
(k+1)
return the solution as x0 ; otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’.
88
CHAPTER 4. FINDING ROOTS OF EQUATIONS
Notice in general that m and n are not equal, and the system in step 1 is n equations in m unknowns. If, however, m = n, and the Jacobian is nonsingular, the solution in step 1 is −1 (k+1) (k) (k) (k) x0 (4.11) = x0 − Jf x0 f x0 . It is important to remember that this expression does not imply that the Jacobian matrix should be inverted. Linear systems are not solved this way (see Section 4.1). Expressions involving the inverse of a matrix provide a compact representation, and so we often write equations such as (4.11). Sometimes, the Jacobian is replaced by a finite-difference approximation, ∂fi fi (x1 , x2 , . . . , xj + h, . . . xm ) − fi (x1 , x2 , . . . , xj , . . . xm ) ≈ , ∂xj h for h > 0. Use of this approximation in place of the Jacobian is called the “discrete Newton’s method”. This, of course, doubles the number of function evaluations per iteration, but it does avoid the computation of the derivatives. The number of computations in Newton’s method may be reduced by assuming that the Jacobian (or the discrete approximation) does not change much from one iteration to the next. A value of the Jacobian may be used in a few subsequent iterations. The number of computations can also be reduced if the Jacobian has a special structure, as is often the case in important applications, such as in solving systems of differential equations. It may be sparse or banded. In these cases, use of algorithms that take advantage of the special structure will reduce the computations significantly. Other ways of reducing the computations in Newton’s method use an estimate of the derivative that is updated within each iteration. This kind of method is called quasi-Newton. We will discuss quasi-Newton methods for optimization problems in Section 5.6. If the ranges of the variables in a nonlinear system are quite different, the solution may not be very accurate. The accuracy can often be improved considerably by scaling the variables and the function values so that they all have approximately the same range. Scaling of a variable xi is just a multiplicative transformation: yi = σxi . Of course, the ranges of the values of the variables may not be known in advance, so it may be necessary to do some preliminary computations in order to do any kind of useful scaling. Stochastic Approximation The Robbins-Monro stochastic approximation (see page 84), (k+1)
x0
(k)
= x0 + α(k) yk ,
extends immediately to the case in which x and y are vectors. The weights in the update, α(k) , can either be constant for each element of yk , or the weights
EXERCISES
89
may be elements of a diagonal matrix that allows different weights for each element of yk . More generally, a matrix that reflects the correlational structure of yk may be used in the update. We have a recursion similar to the Newton update (4.11) −1 (k) (k) with Jf x0 replaced by some A(k) , and f x0 replaced by yk . A further extension allows the observations on the underlying random vector Y to be correlated. We will encounter the Robbins-Monro procedure again in Section 5.12, but for more details on it and related methods we refer the reader to Kushner and Yin (1997).
Exercises Exercises 4.1 through 4.4 require you to write simple programs to find the zeros of functions. Use Fortran, C, Matlab, S-Plus, PV-Wave, or any other general-purpose language. Your program modules should be be independent of the function and should allow the user to input starting values and stopping criteria. 4.1. Write a program module to implement the bisection method to find a root of a given function, which is input together with values that bracket a root, and an epsilon as the stopping criterion. Your program should check that the two starting values are legitimate. Use bisection to determine the first zero of the Bessel function of the first kind, of order 0: J0 (x) =
1 π
Z
π
cos(x sin θ) dθ. 0
(This function is available in Matlab, besselj; in PV-Wave, beselj; in the IMSL Library, bsj0/dbsj0; and in the Unix math library, j0.) 4.2. Write a program module similar to that of Exercise 4.1 to implement Newton’s method to find a root of a given function, which is input together with its derivative, a starting value, and two stopping criteria: an epsilon and a maximum number of iterations. (a) Observe the performance of the method on the function f (x) = x3 − 14x2 + 68x − 115, which is the function used in the examples in this chapter. Start with (0) (k) x0 = 9. Print x0 to 10 digits, and observe the number of correct digits at each iteration until the solution is accurate to 10 digits. Produce a table similar to Table 4.1 on page 77. What is the rate of convergence? (b) Now observe the performance of the method on the function f (x) = x3 − 15x2 + 75x − 125, (0)
whose solution is also 5. Again start with x0 convergence? What is the difference?
= 9. What is the rate of
90
CHAPTER 4. FINDING ROOTS OF EQUATIONS
4.3. Write a program module similar to that of Exercise 4.1 to implement the secant method to find a root of a given function, which is input together with two starting values, and two stopping criteria: an epsilon and a maximum number of iterations. Observe the performance of the method on the function f (x) = x3 − 14x2 + 68x − 115. Produce a table similar to Table 4.1 on page 77. 4.4. Write a program module similar to that of Exercise 4.1 to implement the regula falsi method to find a root of a given function, which is input together with two starting values, and two stopping criteria: an epsilon and a maximum number of iterations. Your program should check that the two starting values are legitimate. Observe the performance of the method on the function f (x) = x3 − 14x2 + 68x − 115. Produce a table similar to Table 4.1 on page 77. 4.5. Compare the performance of the three methods in Exercises 4.2 through 4.4 and that of the bisection method for the given polynomial. Consider such things as rate of convergence and ease of use of the method. 4.6. Now consider the same function f as in the previous exercises, except assume that the value of f (x) can not be observed exactly. More precisely, suppose when we attempt to compute f (x), we get the value f˜(x) = f (x) + , where is a realization of a normal random variable with mean 0 and variance 0.0001. Compare the performance of the three methods in Exercises 4.1 through 4.4 for finding a zero of f˜(x). Consider such things as rate of convergence and ease of use of the method. What other issues are relevant?
Chapter 5
Unconstrained Descent Methods in Dense Domains We now return to the problem of finding x∗ such that min f (x) = f (x∗ ), x
where x is a m-vector and f is a real scalar-valued function. In this chapter we consider the continuous optimization problem in which x is a vector in a dense subset of IRm . In Chapter 6 we discuss discrete optimization, in which x is restricted to a countable subset. In the present chapter we generally assume the function is differentiable in all variables, and we often assume it is twice differentiable in all variables. The properties of derivatives and their characterization of stationary points discussed in Chapter 3 are the basis for most optimization methods for differentiable functions. Sometimes, rather than using the exact derivatives it is more efficient to use approximations such as finite differences. If the function is not differentiable, but is “well-behaved”, the methods based on finite differences often allow us to determine the optimum. For the time being we will consider the problem of unconstrained optimization. The methods we describe are the basic ones whether constraints are present or not. Solution of an optimization problem is usually an iterative process, moving from one point on the function to another. The basic things to determine are • direction or path, p, in which to step and • how far to step. (The step length is kαpk, for the scalar α.)
5.1
Direction of Search
For a differentiable function, from any given point, an obvious direction to move is the negative gradient, or a direction that has an acute angle with the negative 91
92
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
gradient. We call a vector p such that pT ∇f (x) < 0 a descent direction at the point x. For a function of a single variable, this direction of course is just the sign of the derivative of f at x. If ∇f (x) 6= 0, we can express p as Rp = −∇f (x),
(5.1)
for some positive definite matrix R. A particular choice of R determines the direction. A method that determines the direction in this manner is called a “gradient method”. Numerical computations for quantities such as pT ∇f (x) that may be close to zero must be performed with some care. We sometimes impose the requirement pT ∇f (x) < −, for some positive number , so as to avoid possible numerical problems for quantities too close to zero. Once a direction is chosen, the best step is the longest one for which the function continues to decrease. These heuristic principles of choosing a “good” direction and a “long” step guide our algorithms, but we must be careful in applying the principles.
5.2
Line Searches
Although the first thing we must do is to choose a descent direction, in this section we consider the problem of choosing the length of a step in a direction that has already been chosen. In subsequent sections we return to the problem of choosing the direction. We assume the direction chosen is a descent direction. The problem of finding a minimum in is similar to, but more complicated than, the problem of finding a zero of a function that we discussed in Chapter 4. In finding a root of a continuous function of a single scalar variable, two values can define an interval in which a root must lie. Three values are necessary to identify an interval containing a local minimum. Nearby points in a descent direction form a decreasing sequence, and any point with a larger value defines an interval containing a local minimum. After a direction of movement p(k) from a point x(k) is determined, a new point, x(k+1) , is chosen in that direction: x(k+1) = x(k) + α(k) p(k) ,
(5.2)
where α(k) is a positive scalar, called the step length factor. (The step length itself is kα(k) p(k) k.)
5.2. LINE SEARCHES
93
Obviously, in order for the recursion (5.2) to converge, α(k) must approach 0. A sequence of α(k) that converges to 0, even in descent directions, clearly does not guarantee that the sequence x(k) will converge to x∗ , however. This is easily seen in the case of the function of the scalar x, f (x) = x2 , starting with x(0) = 3 and α(0) = 1, proceeding in the descent direction −x, and updating the step length factor as α(k+1) = 12 α(k) . The step lengths clearly converge to 0, and while the sequence x(k) goes in the correct direction, it converges to 1, not to the point of the minimum of f , x∗ = 0. Choice of the “best” α(k) is an optimization problem in one variable: min f x(k) + α(k) p(k) , (5.3) α(k)
for fixed x(k) and p(k) . An issue in solving the original minimization problem for f (x) is how to allocate the effort between determining a good p(k) and choosing a good α(k) . Rather than solving the minimization problem to find the best value of α(k) for the k th direction, it may be better to get a reasonable approximation, and move on to choose another direction from the new point. One approach to choosing a good value of α(k) is to use a simple approximation to the one-dimensional function we are trying to minimize: ρ(α) = f x(k) + αp(k) . A useful approximation is a second- or third-degree polynomial that interpolates ρ(α) at three or four nearby points. The minimum of the polynomial can be found easily, and the point of the minimum may be a good choice for α(k) . A simpler approach, assuming ρ(α) is unimodal over some positive interval, say [αl , αu ], is just to perform a direct search along the path p(k) . A bisection method or some other simple method for finding a zero of a function as we discussed in Section 4.2.1 could be modified and used. Another approach for developing a direct search method is to choose two points α1 and α2 in [αl , αu ], with α1 < α2 , and then, based on the function values of ρ, to replace the interval I = [αl , αu ] with either Il = [αl , α2 ] or Iu = [α1 , αu ]. In the absence of any additional information about ρ, we choose the points α1 and α2 symmetrically, in such a way that the lengths of both Il and Iu are the same proportion, say τ , of the length of the original interval I. This leads to τ 2 = 1 − τ , the golden ratio. The search using this method of reduction is called the golden section search, and is given in Algorithm 5.1.
94
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
Algorithm 5.1 Golden Section Search √ 0. Set τ = 5 − 1 /2 (the golden ratio). Set α1 = αl + (1 − τ )(αu − αl ) and set α2 = αl + τ (αu − αl ). Set ρ1 = ρ(α1 ) and ρ2 = ρ(α2 ). 1. If ρ1 > ρ2 , 1.a. set αl = α1 , set α1 = α2 , set α2 = αl + τ (αu − αl ), set ρ1 = ρ2 , and set ρ2 = ρ(α2 ); otherwise, 1.b. set αu = α2 , set α2 = α1 , set α1 = αl + (1 − τ )(αu − αl ), set ρ2 = ρ1 , and set ρ1 = ρ(α1 ). 2. If αu − αl > (a preset tolerance) go to step 1; otherwise, return the solution as α1 . The golden section search is robust, but it is only linearly convergent, like the bisection method of Algorithm 4.2. (This statement about convergence applies just to this one-dimensional search, which is a subproblem in our optimization problem of interest.) Another criterion for a direct search is to require T f x(k) + α(k) p(k) ≤ f x(k) + τ α(k) p(k) ∇f x(k) , (5.4) for some τ in 0, 12 . This criterion is called the sufficient decrease condition, and the approach is called the Goldstein-Armijo method after two early investigators of the technique. After choosing τ , the usual procedure is to choose α as the largest value in 1, 12 , 14 , 18 , . . . that satisfies the inequality. If the step length is not too long, the descent at x(k) in the given direction will be greater than the descent in that direction at x(k) + α(k) p(k) . This leads to the so-called curvature condition: T (k) T (5.5) ∇f x(k) + α(k) p(k) ≤ η p(k) ∇f x(k) , p for some η in (0, 1). Mor´e and Thuente (1994) describe other ways of doing the line search, and provide some empirical results on the performance of the searches.
5.3. STEEPEST DESCENT
5.3
95
Steepest Descent
We now turn to the problem of choosing a descent direction. Most methods we will consider are gradient methods, that is, they satisfy (5.1): Rp = −∇f (x), From a given point x(k) , the function f decreases most rapidly in the direction of the negative gradient, −∇f x(k) . A greedy algorithm uses this steepest descent direction; that is, p(k) = −∇f x(k) , (5.6) and so the update in equation (5.2) is x(k+1) = x(k) − α(k) ∇f x(k) . The step length factor α(k) is chosen by a method described in Section 5.2. The steepest descent method is robust so long as the gradient is not zero. The method, however, is likely to change directions often, and the zigzag approach to the minimum may be quite slow (see Exercise 5.1a). For a function with circular contours, steepest descent proceeds quickly to the solution. For a function whose contours are ellipses, as the function in Exercise 5.1 (page 121), for example, the steepest descent steps will zigzag toward the solution. A matrix other than the identity may deform the elliptical contours so they are more circular. In Newton’s method discussed next, we choose the Hessian.
5.4
Newton’s Method
To find the minimum of the scalar-valued function f (x), under the assumptions that f is convex and twice differentiable, we can seek the zero of ∇f (x) in the same way that we find a zero of a vector-valued function using the iteration in equation (4.11), page 88. We begin by forming a first-order Taylor series expansion of ∇f (x), which is the second-order expansion of f (x). In place of a vector-valued function we have the gradient of the scalar-valued function, and in place of a Jacobian, we have the Hessian Hf , which is the Jacobian of the gradient. This first-order Taylor series expansion of ∇f is equivalent to a second-order Taylor series expansion of f . Setting the gradient to zero, we obtain an iteration similar to equation (4.11): −1 x(k+1) = x(k) − Hf x(k) ∇f x(k) .
(5.7)
Use of this recursive iteration is Newton’s method. The method is also often called the Newton-Raphson method.
96
CHAPTER 5. UNCONSTRAINED DESCENT METHODS In one dimension, the Newton recursion is just x
(k+1)
= =
∇f x(k) x − 2 ∇ f x(k) f 0 x(k) (k) x − 00 (k) . f x (k)
The second-order Taylor series approximation to f about the point x∗ , 1 f (x) ≈ f (x∗ ) + (x − x∗ )T ∇f (x∗ ) + (x − x∗ )T Hf (x∗ )(x − x∗ ), 2
(5.8)
is exact if f is a quadratic function. In that case, Hf is positive definite, and the terms in equation (5.7) exist and yield the solution immediately. When f is not quadratic, but is sufficiently regular, we can build a sequence of approximations by quadratic expansions of f about approximate solutions. This means, however, that the Hessian may not be positive definite and its inverse in (5.7) may not exist. Once more, it is important to state that we do not necessarily compute each term in an expression. We choose mathematical expressions for their understandability; we choose computational method for their robustness, accuracy, and efficiency. Just as we commented on page 88 concerning inversion of the Jacobian, we comment here that we do not compute the Hessian and then compute its inverse, just because that appears in equation (5.7). We solve the linear systems Hf x(k) p(k) = −∇f x(k) (5.9) by more efficient methods such as Cholesky factorizations. Once we have the solution to equation (5.9), equation (5.7) becomes x(k+1) = x(k) + p(k) .
(5.10)
Newton’s method, by scaling the path by the Hessian, is more likely to point the path in the direction of a local minimum, whereas the steepest descent method, in ignoring the second derivative, follows a path along the gradient, that does not take into account the rate of change of the gradient. This is illustrated in Figure 5.1. For functions that are close to a quadratic within a region close to the minimum, Newton’s method can be very effective so long as the iterations begin close enough to the solution. In other cases Newton’s method may be unreliable. The problems may be similar to those illustrated in Figures 4.4 and 4.5 (page 80) for finding a root. One way of increasing the reliability of Newton’s method is to use a damped version of the update (5.10), x(k+1) = x(k) + α(k) p(k) ,
5.4. NEWTON’S METHOD
97
Newton
steepest descent
Figure 5.1: Steepest Descent and Newton Steps gro508 for which a line search is used to determine an appropriate step length factor α(k) . When the function is not quadratic, the Hessian may not be positive definite, and so a modified Cholesky factorization may be used. In this approach, positive quantities are added as necessary during the decomposition of the Hessian. This changes the linear system (5.9) to the system (5.11) Hf x(k) + D(k) p(k) = −∇f x(k) , where D(k) is a diagonal matrix with nonnegative elements. Another method of increasing the reliability of Newton’s method is to restrict the movements to regions where the second-order Taylor expansion (5.8) is a good approximation. This region is called a “trust region”. At the k th iteration, the second-order Taylor series approximation provides a scaled quadratic model q (k) : (k)
q (k) (s) = f x∗
(k)
+ sT ∇f x∗
1 (k) + sT Hf x∗ s, 2
(5.12)
(k)
where s = x − x∗ . When the Hessian is indefinite, q (k) is unbounded below, so it is obviously (k) not a good model of f x∗ + s if s is large. We therefore restrict ksk, or better we restrict kD(k) sk for some scaling matrix D(k) . For some τ (k) , we require kD(k) sk < τ (k) ,
(5.13)
98
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
and we choose s(k) as the point where the quadratic q (k) achieves its minimum subject to this restriction. How much we should restrict s depends on how good the quadratic approximation is. If (k) (k) f x∗ − f x∗ + s(k) (k) f x∗ − q (k) s(k) is close to 1, that is, if the approximation is good, we increase τ (k) ; if it is small or negative, we decrease τ (k) . Implementation of these methods requires some rather arbitrary choices of algorithm parameters.
5.5
Accuracy of Optimization Using Gradient Methods
The problem of finding a minimum of a function is somewhat more difficult than that of finding a zero of a function discussed in Chapter 4. Our intuition should tell us this is the case. In one dimension, a zero of a function can be determined by successively bracketing a zero with two points. An interval containing a minimum of a function requires three points to determine it. Another way of comparing the accuracy of the solution of a nonlinear equation and the determination of the minimum of such an equation is to consider the Taylor expansion: 1 f (x) = f (˜ x) + (x − x ˜)f 0 (˜ x) + (x − x x) + · · · . ˜)2 f 00 (˜ 2 In the problem of finding a zero x0 , f 0 (x0 ) is nonzero, and for x ˜ close to x0 , (f (x) − f (˜ x)) is approximately proportional to (x − x ˜), where the constant of proportionality is f 0 (˜ x). A small value of the difference (x − x ˜) results in a proportionate difference (f (x) − f (˜ x)). On the other hand, in the problem of finding the minimum x∗ , f 0 (x∗ ) is zero, and for x ˜ close to x∗ , (f (x) − f (˜ x)) is approximately proportional to (x − x ˜)2 , where the constant of proportionality is f 00 (˜ x). A small value of the difference (x − x ˜) results in a smaller difference (f (x)−f (˜ x )). In finding roots of an equation we may set a convergence criterion proportional to the machine epsilon, mach . In optimization problems, we often √ set a convergence criterion proportional to mach .
5.6
Quasi-Newton Methods
All gradient descent methods determine the path of the step by the system of equations, R(k) p(k) = −∇f x(k) . The steepest descent method chooses R(k) as the identity, I, in these equations. As we have seen, for functions with eccentric contours, the steepest descent
5.6. QUASI-NEWTON METHODS
99
method traverses a zigzag path to the minimum. Newton’s method chooses R(k) (k) as the Hessian, Hf x∗ , which results in a more direct path to the minimum. Aside from the issues of consistency of the resulting equation (5.11) and the general problems of reliability, a major disadvantage of Newton’s method is the computational burden of computing the Hessian, which is O(m2 ) function evaluations, and solving the system, which is O(m3 ) arithmetic operations, at each iteration. Instead of using the Hessian at each iteration, we may use an approximation, B (k) . We may choose approximations that are simpler to update and/or that allow the equations for the step to be solved more easily. Methods using such approximations are called quasi-Newton methods or variable metric methods. Because Hf x(k+1) x(k+1) − x(k) ≈ ∇f x(k+1) − ∇f x(k) , we choose B (k) so that B (k+1) x(k+1) − x(k) = ∇f x(k+1) − ∇f x(k) .
(5.14)
This is called the secant condition. (Note the similarity to the secant method for finding a zero discussed in Sections 4.2.1 and 4.2.2.) We express the secant condition as B (k+1) s(k) = y (k) ,
(5.15)
where s(k) = x(k+1) − x(k) and y (k) = ∇f (x(k+1) ) − ∇f (x(k) ). The system of equations in (5.15) does not fully determine B (k) of course. Because B (k) is approximating Hf (x(k) ), we may want to require that it be symmetric and positive definite. The most common approach in quasi-Newton methods is first to choose a reasonable starting matrix B (0) and then to choose subsequent matrices by additive updates, B (k+1) = B (k) + Ba(k) , subject to preservation of symmetry and positive definiteness. The general steps in a quasi-Newton method are 0. Set k = 0 and choose x(k) and B (k) . 1. Compute s(k) as α(k) p(k) , where B (k) p(k) = −∇f (x(k) ). 2. Compute x(k+1) and ∇f (x(k+1) ).
100
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
3. Check for convergence and stop if converged. 4. Compute B (k+1) . 5. Set k = k + 1, and go to 1. Within these general steps there are two kinds of choices to be made: the way to update the approximation B (k) , and, as usual, the choice of the step length factor α(k) . (k) There are several choices for the update Ba that preserve symmetry and positive definiteness (or at least nonnegative definiteness). One simple choice is the rank-one symmetric matrix Ba(k) =
1 (y (k) − B (k) s(k) ) (y (k) − B (k) s(k) )T . (y (k) − B (k) s(k) )T s(k)
(5.16)
This update results in a symmetric matrix that satisfies the secant condition no matter what the previous matrix B (k) is. (You are asked to do the simple algebra to show this in Exercise 5.3.) If B (k) is positive definite, this update results in a positive definite matrix B (k+1) so long as c(k) ≤ 0, where c(k) is the denominator: c(k) = (y (k) − B (k) s(k) )T s(k) . Even if c(k) > 0, positive definiteness can be preserved by shrinking c(k) to c˜(k) so that 1 c˜(k) < (k) . (y − B (k) s(k) )T (B (k) )(−1) (y (k) − B (k) s(k) ) Although this adjustment is not as difficult as it might appear, the computations to preserve positive definiteness and, in general, good condition of the B (k) account for a major part of the effort in quasi-Newton methods. (k) Other, more common choices for Ba are the rank-two Broyden updates of the form Ba(k)
=
1 B (k) s(k) (B (k) s(k) )T (s(k) )T B (k) s(k) 1 + (k) T (k) y (k) (y (k) )T (y ) s T + σ (k) (s(k) )T B (k) s(k) v (k) v (k) ,
−
(5.17)
where σ (k) is a scalar in [0, 1], and v (k) =
1 (y (k) )T s(k)
y (k) −
1 (s(k) )T
B (k) s(k) B (k) s(k) .
Letting σ (k) = 0 in (5.17) yields the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update, which is one of the most widely used methods. If σ (k) = 1, the method is called the Davidon-Fletcher-Powell (DFP) method.
5.6. QUASI-NEWTON METHODS
101
The Broyden updates will preserve the positiveness of B (k) so long as (y (k) )T s(k) > 0. This is the curvature condition (see (eq:opt527) on page 94). If the curvature condition is not satisfied, s(k) could be scaled so as to satisfy this inequality. (Scaling s(k) of course changes y (k) also.) Alternatively, the update of B (k) can just be skipped, and the updated step is determined using the previous value, B (k) . This method is obviously quicker, but it is not as reliable. Inspection of either the rank-one updates (5.16) or the rank-two updates (5.17) reveals that the number of computations is O(m2 ). If the updates are done to the inverses of the B (k) ’s or to their Cholesky factors, the computations required for the updated directions are just matrix-vector multiplications and hence can also be computed in O(m2 ) computations. It is easily seen that the updates can be done to the inverses of the B (k) ’s using the Sherman-Morrison formula (equation (3.19) on page 110 of Gentle, 1998) for rank-one updates, or the Woodbury formula (equation (3.20) of Gentle, 1998) for more general updates. Using the Woodbury formula, the BFGS update, for example, results in the recursion, −1 B (k+1) = −1 I − (y(k) )1T s(k) s(k) (y (k) )T B (k) I − (y(k) )1T s(k) s(k) (y (k) )T +
1 s(k) (y (k) )T . (y (k) )T s(k)
The best way of doing the inverse updates is to perform them on the Cholesky factors instead of on the inverses. The expression above for updating the inverse shows that this can be done. Another important property of the quasi-Newton methods is that they can be performed without explicitly storing the B (k) ’s, which could be quite large in large-scale optimization problems. The storage required in addition to that for B (k) is for the vectors s(k) and y (k) . If B (k) is a diagonal matrix, the total storage is O(m). In computing the update at the (k + 1)th iteration, limited-memory quasi-Newton methods assume that B (k−j) is diagonal at some previous iteration. The update for the (k + 1)th iteration can be computed by vector-vector operations beginning back at the (k − j)th iteration. In practice, diagonality is assumed at the fourth or fifth previous iteration; that is, j is taken as 4 or 5. Quasi-Newton methods are available in most of the widely-used mathematical software packages. Broyden updates are the most commonly used in these packages, and of the Broyden updates, BFGS is probably the most popular. Nocedal (1992) discusses the various choices for updates in quasi-Newton methods and provides some comparisons. Khalfan, Byrd, and Schnabel (1993) and Byrd, Nocedal, and Schnabel (1994) also provide comparisons of update methods. Their results showed that the simple rank-one update (5.16) is often a superior method.
102
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
Truncated Newton Methods Another way of reducing the computational burden in Newton-type methods is to approximate the solution of the path direction R(k) p(k) = −∇f x(k) , where R(k) is either the Hessian, as in Newton’s method, or an approximation, as in a quasi-Newton method. In a truncated Newton method, instead of solving for p(k) , we get an approximate solution using only a few steps of an iterative linear equation solver, such as the conjugate gradient method. The conjugate gradient method is particularly suitable because it uses only matrix-vector products, so the matrix R(k) need not be stored. This can be very important in large-scale optimization problems that involve a large number of decision variables. How far to continue the iterations in the solution of the linear system is a major issue in tuning a truncated Newton method.
5.7
Fitting Models to Data Using Least Squares; Gauss-Newton Methods
One of the most important applications that involve minimization is the fitting of a model to data. In this problem, we have a function f that relates one variable, say y, to other variables, say the m-vector t. The function involves some unknown parameters, say the d-vector θ: y = f (t; θ).
(5.18)
The data consists of n observations on the variables y and t. Fitting the model is usually done by minimizing some norm of the vector of residuals ri (θ) = yi − f (ti ; θ). (5.19) The decision variables are the parameters θ. The optimal values of θ, often b are called “estimates”. denoted as θ, Because the data are observed and so are constants, the residuals are functions of θ only. The vector-valued function r(θ) maps IRd into IRn . The most common norm to minimize to obtain the fit is the L2 or Euclidean norm. The scalar-valued objective function then is s(θ)
=
n X
=
n X
yi − f (ti ; θ)
2
i=1
ri (θ)
i=1
=
r(θ)
T
2
r(θ).
(5.20)
5.7. FITTING MODELS TO DATA USING LEAST SQUARES; GAUSS-NEWTON METHODS103 This problem is called least squares regression. If the function f is nonlinear in θ, the functions ri are also nonlinear in θ, and the problem is called nonlinear least squares regression. “Modified” Gauss-Newton Method The gradient and the Hessian for a least squares problem have special structures that involve the Jacobian of the residuals, which is a vector function of the parameters. The gradient of s(θ) is ∇s(θ) = Jr (θ)
T
r(θ).
The Jacobian of r is also part of the Hessian of s: Hs (θ) = Jr (θ)
T
Jr (θ) +
n X
ri (θ)Hri (θ).
(5.21)
i=1
In this maze of notation the reader should pause to remember the shapes of these arrays, and their meanings in the context of fitting a model to data. Notice, in particular, that the dimension of the space of the optimization problem is d, instead of m as in the previous problems. We purposely chose a different letter to represent the dimension so as to emphasize that the decision variables may have a different dimension from that of the independent (observable) variables. The space of an observation has dimension m + 1 (the m elements of t, plus the response y); and the space of the observations as points yi and corresponding model values f (ti , θ) has dimension n. • ti is an m-vector. In the modeling context, these are the independent variables. • y is an n-vector, and it together with the n ti vectors are constants in the optimization problem. In the modeling context, these are observations. • θ is a d-vector. This is the vector of parameters. • r(·) is an n-vector. This is the vector of residuals. • Jr (·) is an n × d matrix. • Hri (·) is a d × d matrix. • s(·) is a scalar. This is the data-fitting criterion. • ∇s(·) is a d-vector. • Hs (·) is a d × d matrix.
104
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
In the vicinity of the solution θ∗ , the residuals ri (θ) should be small, and Hs (θ) may be approximated by neglecting the second term in equation (5.21). Using this approximation and the gradient descent equation, we have Jr (θ(k) )
T
Jr (θ(k) ) p(k) = − Jr (θ(k) )
T
r(θ(k) ).
(5.22)
It is clear that the solution p(k) is a descent direction; that is, if ∇s(θ(k) ) 6= 0, (p(k) )T ∇s(θ(k) )
=
T T T Jr (θ(k) ) p(k) − Jr (θ(k) ) p(k)
<
0.
The update step is determined by a line search in the direction of the solution of equation (5.22): x(k+1) − x(k) = α(k) p(k) . The search is usually required to satisfy the sufficient decrease condition (5.4) and the curvature condition (5.5). This method is called the Gauss-Newton algorithm. Because many years ago (prior to Hartley, 1961), the step was often taken simply as p(k) , a method that uses a variable step length factor α(k) is sometimes called a “modified Gauss-Newton algorithm”. It is the only kind to use, so we just call it the “Gauss-Newton algorithm”. In the case of a linear model, equation (5.18) becomes y = tT θ. The data, consisting of n observations on y and the m-vector t, results in an n-vector of residuals, r = y − T θ, where T is the n×m matrix whose rows are the observed tT . The Gauss-Newton algorithm for this linear least squares problem yields the solution in one step (see equation (3.23) on page 111 of Gentle, 1998). If the residuals are small and if the Jacobian is nonsingular, the GaussNewton method behaves much like Newton’s method near the solution. The major advantage is that second derivatives are not computed. If the residuals are not small or if Jr (θ(k) ) is poorly conditioned, the GaussNewton method can perform very poorly. If Jr (θ(k) ) is not of full rank, just as we do in the linear case, we could choose the solution corresponding to the Moore-Penrose inverse, which has the shortest Euclidean length: p(k) =
Jr (θ(k) )
T +
r(θ(k) ).
(5.23)
(Compare equation (3.26) on page 113 of Gentle, 1998.) If the matrix is nonsingular, the Moore-Penrose inverse is the usual inverse.
5.7. FITTING MODELS TO DATA USING LEAST SQUARES; GAUSS-NEWTON METHODS105 Levenberg-Marquardt Method Another possibility, which is similar to what is done in linear ridge regression (see Exercise 6.2 on page 179 of Gentle, 1998), is to add a conditioning matrix to T Jr (θ(k) ) Jr (θ(k) ) in equation (5.22). A simple choice is τ Id , and the equation for the update becomes T T Jr (θ(k) ) Jr (θ(k) ) + τ Id p(k) = − Jr (θ(k) ) r(θ(k) ). A better choice may be a scaling matrix, S (k) , that takes into account the variability in the columns of Jr (θ(k) ); hence we have for the update T T T Jr (θ(k) ) Jr (θ(k) ) + λ(k) S (k) S (k) p(k) = − Jr (θ(k) ) r(θ(k) ). (5.24) T The basic requirement for the matrix S (k) S (k) is that it improve the condition of the coefficient matrix. There are various way of choosing this matrix. T One is to transform the matrix Jr (θ(k) ) Jr (θ(k) ) so it has 1’s along the diagonal (this is equivalent to forming a correlation matrix from a variance-covariance matrix), and to use the scaling vector to form S (k) . The nonnegative factor λ(k) can be chosen to control the extent of the adjustment. The sequence λ(k) must go to 0 for the solution to converge. Equation (5.24) can be thought of as a Lagrangian multiplier formulation of the constrained problem (see Chapter 7):
min 21 Jr (θ(k) )x + r(θ(k) ) x
(k)
S x ≤ δk , s.t. (k) The Lagrange multiplier λ(k) is zero p(k) from
equation (5.23) satisfies kp k ≤
if(k) (k)
δk ; otherwise it is chosen so that S p = δk . Use of an adjustment such as in equation (5.24) is called the LevenbergMarquardt algorithm. This is probably the most widely used method for nonlinear least squares. The method can be thought of as a trust region method, with δk being the radius of the trust region, comparable to τ (k) in (5.13). Just as in ridge regression (see Exercise 6.2b on page 179 of Gentle 1998), the computations for equation (5.24) can be performed efficiently by recognizing that the system is the normal equations for the least squares fit of Jr (θ(k) ) r(θ(k) ) ≈ √ p. (k) (k) 0 S λ
Variance-Covariance of the Parameter Estimators In fitting models to data, we always wish to know how dependent our estimated parameters are to the particular set of data used. A careful assessment of this
106
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
dependence is usually predicated on formulation of the model (5.18) as y = f (t; θ) + E, where E is a random variable with an assumed probability distribution. The b are therefore functions of realizations of the random variable, and estimates, θ, their variability that results from the variability in the data can be assessed by the variance-covariance matrix of the random variable of which θb is a realization. (We usually use slightly less precise terminology, and refer to the varianceb covariance matrix of θ.) In simple cases where f is linear, E has a normal distribution with mean 0 and constant variance, σ 2 , and the observations are independent, the problem is particularly simple. In the more familiar notation of linear regression, y = Xβ, where y is a vector of observations, and X is a matrix of the corresponding observations, the estimates are the solution to X T Xβ = X T y. Furthermore, the variance-covariance matrix for βb is (X T X)−1 σ 2 . A good estimate of σ 2 , c2 , is (y − X β) b T (y − X β)/(n b σ − m), where n is the number of observations and m is the length of β. In problems of fitting models to data, we often make the assumption that the residuals are independently and identically distributed as normals. Even with this assumption, however, it may not be possible to write a simple expression b Using a linear approximation that for the variance-covariance matrix of θ. follows from the linear regression model described above, we may approximate the variance-covariance matrix as −1 c2 , b T Jr (θ) b Jr (θ) σ from the analogue, X T Xβ = X T y, of equation (5.22). The estimate of σ 2 is taken as the sum of the squared residuals, divided by n − m, where m is the number of estimated elements in θ. From equation (5.21), if the residuals are small, the Hessian is approximately equal to the cross-product of the Jacobian, and so an alternate expression for the variance-covariance matrix is c 2. b −1 σ Hs (θ) This later expression would be more useful if Newton’s method or a quasiNewton method is used in the solution of the least squares problem.
5.8
Iteratively Reweighted Least Squares
Often in applications, the residuals in equation (5.19) are not given equal weight in fitting the model. This may be because the reliability or precision of the observations on y and t may be different. For weighted least squares, instead
5.8. ITERATIVELY REWEIGHTED LEAST SQUARES
107
of (5.20) we have the objective function sw (θ) =
n X
2 wi ri (θ) .
(5.25)
i=1
The weights add no complexity to the problem, and the Gauss-Newton methods of the previous section apply immediately, with r˜(θ) = W r(θ), where W is a diagonal matrix containing the weights. The simplicity of the computations for weighted least squares suggests a more general usage of the method. Suppose for fitting the model (5.18) we choose to minimize some other Lp norm of the residuals ri in (5.19). The objective function then is sp (θ)
n p X yi − f (ti ; θ)
=
i=1
n X
=
i=1
2 1 2−p yi − f (ti ; θ) yi − f (ti ; θ)
(5.26)
This leads to an iteration on the least squares solutions. Beginning with yi − f (ti ; θ(1) ) = 1, we form the recursion that results from the approximation sp (θ
(k+1)
)≈
n X i=1
2 1 (k+1) ) . 2−p yi − f (ti ; θ yi − f (ti ; θ(k) )
Hence, we solve a weighted least squares problem, and then form a new weighted least squares problem using the residuals from the previous problem. This method is called iteratively reweighted least squares or IRLS. The iterations over the residuals are outside the loops of iterations to solve the least squares problems, so in nonlinear least squares, IRLS results in nested iterations. There are some problems with the use of reciprocals of powers of residuals as weights. The most obvious problem arises from very small residuals. This is usually handled by use of a fixed large number as the weight. Iteratively reweighted least squares can also be applied to other norms, sρ (θ) =
n X
ρ yi − f (ti ; θ) ,
i=1
but the approximations for the updates are not as good. Green (1984) and Street, Carroll, and Ruppert (1988) discuss IRLS methods for more general norms. Heiberger and Becker (1992) address some of the software development issues in using IRLS in regression programs.
108
5.9
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
Conjugate Gradient Methods
A quadratic function is well-behaved for the problem of finding its optimum. For this reason, as we have seen, many optimization methods are developed in the context of a quadratic function, f (x) =
1 T x Ax + xT b + c, 2
in which A is positive definite. The methods also often work for other types of functions for which optima exist, because the quadratic function is a good local model of the other functions. Functions with singularities or with extreme variability in the neighborhood of the optimum generally present difficult optimization problems. Using the quadratic function above, we can describe another method closely related to quasi-Newton methods. The updates, as usual, are x(k+1) = x(k) + α(k) p(k) ; hence, at the k th step we have the linear combination, x(k) = x(0) + α1 p1 + · · · + α(k−1) p(k−1) . In the conjugate gradient method, these steps are chosen so that p(k)
T
Ap(i) = 0,
for i = 1, . . . , k;
that is, p(k) is “A conjugate” to p(1) , p(2) , . . . p(k−1) . Thus, in the case of the quadratic objective function, the steps are orthogongal to each other with respect to the Hessian. This orthogonality of the directions makes the steps more efficient. At each iteration, we determine the optimal step length by a line search. The problem of solving a linear system Ax = b is equivalent to finding a minimum of (Ax − b)T (Ax − b), which is the quadratic function shown above. See page 104 of Gentle, 1998, for the use of the conjugate method for solving a linear system of equations.
5.10
The EM Method and Some Variations+
If the Hessian can ***== The idea of the conjugate direction method can be generalized A simple approach to determining the optimum of the function f (x), where x is a m-vector, is choose the direction of each step as the direction of one of the coordinate axes. As in equation (5.3), the problem is min f x(k) + α(k) p(k) , α(k)
5.10. THE EM METHOD AND SOME VARIATIONS+
109
but in this case, p(k) = ei , where ei is an m-vector with all elements zero except the ith element, which is one. These would not be gradient directions except in the case that f is the sum of functions involving only the separate elements of x. Although the method is easy to implement and will converge to a local minimum (under fairly modest regularity assumptions), it is not likely to be very efficient. A slight generalization of this is to choose two types of directions, p1 and p2 that span the space IRm . We can We choose (k)
p1 =
m X
(k)
ai ei
i=1
and (k)
p2 =
m X
(k)
bi ei
i=1
a step as to fix x2 , . . . , xm arbitrarily, optimize f with respect to x1 , then fix x1 at the optimum leave x3 , . . . , xm as before, optimize f with respect to x2 , continue this process for all elements of x, and then iterate, using the Although it is generally not efficient to do so, The EM method is a method of solving an optimization problem through a sequence of pairs of steps in which one of each pair addresses a simpler optimization problem. Dempster, Laird, and Rubin (1977) missing data Consider the multinomial distribution with 4 outcomes, that is, the multinomial with probability function, p(x1 , x2 , x3 , x4 ) =
n! π x1 π x2 π x3 π x4 , x1 !x2 !x3 !x4 ! 1 2 3 4
with n = x1 + x2 + x3 + x4 and 1 = π1 + π2 + π3 + π4 . Suppose that we assume that the probabilities are related by a single parameter, θ: π1
=
π2
=
π3
=
π4
=
1 + 2 1 − 4 1 − 4 1 θ, 4
1 θ 4 1 θ 4 1 θ 4
where 0 ≤ θ ≤ 1. This is the example that Dempster, Laird, and Rubin (1977) considered when they studied the EM algorithm. The model goes back to an example discussed by Fisher (1925) in Statistical Methods for Research Workers. Given an observation (x1 , x2 , x3 , x4 ), the log-likelihood function is l(θ) = x1 log(2 + θ) + (x2 + x3 ) log(1 − θ) + x4 log(θ) + c
110
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
and
x1 x2 + x3 x4 − + . 2+θ 1−θ θ Estimate θ using the data that Dempster, Laird, and Rubin used: n = 197 and x = (125, 18, 20, 34). (Note the equation dl(θ)/dθ = 0 is a quadratic in θ, so it could be solved explicitly.) To use the EM algorithm on this problem, we can think of a multinomial with five classes, which is formed from the original multinomial by splitting the first class into two with associated probabilities 1/2 and θ/4. The original variable x1 is now the sum of x11 and x12 . Under this reformulation, we now have a maximum likelihood estimate of θ by considering x12 + x4 (or x2 + x3 ) to be a realization of a binomial with n = x12 + x4 + x2 + x3 and π = θ (or 1 − θ). However, we do not know x12 (or x11 ). Proceeding as if we had a five-outcome multinomial observation with two missing elements, we have the log-likelihood for the complete data, dl(θ)/dθ =
lc (θ) = (x12 + x4 ) log(θ) + (x2 + x3 ) log(1 − θ), and the maximum likelihood estimate for θ is x12 + x4 . x12 + x2 + x3 + x4 The E-step of the iterative EM algorithm fills in the missing or unobservable value with its expected value given a current value of the parameter, θ(k) , and the observed data. Because lc (θ) is linear in the data, we have E (lc (θ)) = E(x12 + x4 ) log(θ) + E(x2 + x3 ) log(1 − θ). Under this setup, with θ = θ(k) , Eθ(k) (x12 )
1 1 1 x1 θ(k) /( + x1 θ(k) ) 4 2 4 (k) = x12 .
=
We now maximize Eθ(k) (lc (θ)). This maximum occurs at (k)
(k)
θ(k+1) = (x12 + x4 )/(x12 + x2 + x3 + x4 ). The following Matlab statements execute a single iteration. function [x12kp1,tkp1] = em(tk,x) x12kp1 = x(1)*tk/(2+tk); tkp1 = (x12kp1 + x(4))/(sum(x)-x(1)+x12kp1); Beginning with t = 0.5, we get 0.6082 ...
5.10. THE EM METHOD AND SOME VARIATIONS+
111
For Newton’s method, the Hessian is x1 x2 + x3 x4 + + 2, 2 2 (2 + θ) (1 − θ) θ and for scoring, the expected value of the information is n 1 2 1 + + , 4 2+θ 1−θ θ which we obtain by taking E(Xi ) for each element of the multinomial random variable. Using the Matlab statements function [l, dl, ie] = fishnr(x,t) l = x(1)*log(2+t) + (x(2)+x(3))*log(1-t) + x(4)*log(t); dl = x(1)/(2+t) - (x(2)+x(3))/(1-t) + x(4)/t; ie = x(1)/(2+t)^2 + (x(2)+x(3))/(1-t)^2 + x(4)/t^2; and function [l, dl, ei] = fishscor(x,t) l = x(1)*log(2+t) + (x(2)+x(3))*log(1-t) + x(4)*log(t); dl = x(1)/(2+t) - (x(2)+x(3))/(1-t) + x(4)/t; ei = sum(x)*(1/(2+t) + 2/(1-t) + 1/t)/4; to define functions, we iterate over the statements [l, dl, ie] = fishnr(x,t); t = t + dl/ie and [l, dl, ei] = fishnr(x,t); t = t + dl/ei Beginning with t = 0.5, with Newton’s method we get 0.6364 0.6270 0.6268 0.6268 and for scoring we get 0.6332 0.6265 0.6268 0.6268 McLachlan and Krishnan (1997) Chan and Ledolter (1995) Monte Carlo implementation time series “supplemented EM” SEM algorithm Meng and Rubin (1991) “stochastic EM” SEM algorithm Celeux and Diebolt (1985) and Diebolt and Ip (1996)
112
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
ECM algorithm Meng and Rubin (1993) Rai and Matthews (1993), Improving the EM algorithm, EM – difficulty with E Wei and Tanner (1990) MCEM Monte Carlo implementation data augmentation use MCMC EM – difficulty with M Lange (1995) suggested an EM gradient algorithm. Instead of regular M step, use one or more iterations of Newton’s method in the gradient direction provided by the E-step. *** problems with local minima ... Arslan, Constable, and Kent (1993) Augmented data scoring Ma and Hudson (1998) Accelerating The EM Algorithm Jamshidian and Jennrich (1993) Acceleration of the EM algorithm by using conjugate gradient methods, Jamshidian and Jennrich (1997) Acceleration of the EM algorithm by using quasi-Newton methods, Meng and van Dyk (1997), The EM algorithm – an old folk-song sung to a fast new tune, **** also gives history
5.11
Fisher Scoring+
An important optimization problem in statistical applications is the maximization of a likelihood function. If p(y; θ) is a probability function or a probability density function, with a fixed parameter θ, that describes the distribution of a random variable with realization y, the associated likelihood function is the function L(θ; y) = p(y; θ), in which the parameter is the variable, and a realization of the the random variable is the fixed parameter. For computational convenience, the log-likelihood, lL (θ; y) = log L(θ; y), is often used. When data for y are available, the likelihood or log-likelihood is often used to determine an estimate of the parameter θ. A reasonable estimate is the value of θ that maximizes the likelihood function (or, equivalently, its log). Such an estimator has certain desirable properties as the number of observations on y grows without bound. A common quasi-Newton method for optimizing lL (θ; y) is Fisher scoring, in which the Hessian in Newton’s method is replaced by its expected value. The iterates then are −1 θbk+1 = θbk − E HlL (θbk | y) ∇lL (θbk | y) (5.27) Modified Fisher scoring using Jacobi or Gauss-Seidel subiterations Ma and Hudson (1997)
5.12. STOCHASTIC SEARCH METHODS
5.12
113
Stochastic Search Methods
The Robbins-Monro stochastic approximation (see pages 84 and 88), can also be applied to optimization problems. In the update equation (4.7), (k+1)
x∗
(k)
= x∗ + α(k) yk ,
the random element yk is the negative gradient or an approximation to it. Kiefer and Wolfowitz (1952) used stochastic approximation with finite differences for a simple regression problem, and so methods like this are called often KieferWolfowitz procedures. Chin (1993) reviews some of the methods for general optimization problems. Spall (1992) describes a method called simultaneous perturbation stochastic approximation (SPSA) that differs from the Kiefer-Wolfowitz procedure by using only two evaluations of the objective function to approximate the gradient. The method is given in Algorithm 5.2. Algorithm 5.2 Simultaneous Perturbation Stochastic Approximation (SPSA) 0. Set k = 0, and choose an initial point, x(k) . 1. Generate a random vector p(k) , whose components are independent and from a suitable distribution with a mean of 0. 2. Compute the perturbation approximation gb(k) to the gradi simultaneous (k) (k) ent, g x = ∇f x : (k) (k) (k) (k) (k) (k) f x
(k)
gb
=
+c
p
−f x
−c
p
(k)
2c(k) p1
.. .
f x(k) +c(k) p(k) −f x(k) −c(k) p (k)
(k)
2c(k) pm
3. Update: (k+1)
x∗ (k+1)
(k)
(k)
= x∗ − α(k) gb(k) .
4. If |x∗ − x∗ | ≤ , (k) return the solution as x∗ ; otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’.
114
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
The efficiency in this method arises from the simplicity of the gradient approximation. The objective function is only evaluated at two points to compute the approximation. Spall (1992) recommends that the sequence α(k) and c(k) be chosen as α(k) = α0 (k + 1)−pα and c(k) = c0 (k + 1)−pc . While α0 and c0 may require some numerical experimentation for the given problem, based on general empirical results, Spall recommends pα = 0.602 and pc = 0.101. Spall (1992) recommends that the elements of p(k) be chosen in step 1 from a symmetric Bernoulli distribution with mass points (−1, 1). Other distributions also satisfy the requirements to guarantee convergence. The normal distribution and the uniform distribution, however, do not. Spall and Cristion (1994) described some modifications to the simultaneous perturbation stochastic approximation method for use in situations in which the objective function is changing over time.
5.13
Derivative-Free Methods
In the previous discussions we have generally assumed that the objective function is differentiable. If the function is differentiable, and the derivatives are available, methods that use the gradient are generally the most efficient ones; although depending on the cost of evaluation of derivatives, more efficient algorithms may avoid evaluation of the derivatives at every iteration. If the function is differentiable but the derivatives are not available, numerical derivatives or other approximations to the gradient should generally be used. For continuous functions that are not differentiable or whose derivatives are difficult to compute or to approximate, we need derivative-free methods. Also in the case of noisy functions that cannot be evaluated exactly, methods that do not directly use derivatives may be better.
5.13.1
Nelder-Mead Simplex Method
The Nelder-Mead simplex method (Nelder and Mead, 1965) is a derivative-free, direct search method. The steps are chosen so as to ensure a local descent, but neither the gradient nor an approximation to it is used. In this method, to find the minimum of a function, f , of m variables, a set of m + 1 extreme points (a simplex) is chosen to start with, and iterations proceed by replacing the point that has the largest value of the function with a point that has a smaller value. This yields a new simplex and the procedure continues. The method is shown in Algorithm 5.3.
5.13. DERIVATIVE-FREE METHODS
115
Algorithm 5.3 Nelder-Mead Simplex Method 0. Set tuning factors: reflection coefficient, α > 0; expansion factor, γ > 1; contraction factor, 0 < β < 1; and shrinkage factor, 0 < δ < 1. Choose an initial simplex, that is, m + 1 extreme points (points on the vertices of a convex hull). 1. Evaluate f at each point in the current simplex, obtaining the values f1 ≤ f2 ≤ · · · ≤ fm ≤ fm+1 Label the points correspondingly, that is, let xm+1 correspond to fm+1 , and so on. Pm 2. Reflect the worst point: let xr = (1+α)xa −αxm+1 , where xa = i=1 xi /m, and let fr = f (xr ). 3. If f1 ≤ fr ≤ fm , accept reflection: replace xm+1 by xr , and go to step 6. 4. If fr < f1 , compute expansion: xe = γxr + (1 − γ)xa . If f (xe ) < f1 , 4.a. accept expansion: replace xm+1 by xa ; otherwise, 4.b. replace xm+1 by xr . Go to step 6. 5. If fm < fr < fm+1 , let fh = fr ; otherwise, let fh = fm+1 . Let xh be the corresponding point. Compute contraction: xc = βxh + (1 − β)xa . If f (xc ) ≤ f (xh ), 5.a. accept contraction: replace xm+1 by xc ; otherwise, 5.b. shrink simplex: for i = 2, 3, . . . , m + 1, replace xi by δxi + (1 − δ)x1 . 6. If convergence has not occurred (see below) or if a preset limit on the number of iterations has not been exceeded, go to step 1; otherwise, return the solution as x1 . There are three common ways of assessing convergence of the Nelder-Mead algorithm. All three, or variations of them, may be used together. • The amount of variation in the function values at the simplex points. This is measured by the sample variance, 1 X s2f = (fi − f¯)2 , m+1 where f¯ is the sample mean of f1 , f2 , . . . , fm+1 . Convergence is declared if s2f < . This stopping criterion can lead to premature convergence, just because the simplex points happen to lie close to the same level curve of the function.
116
CHAPTER 5. UNCONSTRAINED DESCENT METHODS • The total of the norms of the differences in the points in the new simplex and those in the previous simplex. (In any iteration except shrinkage, there is only one point that is replaced.) This is one of several stopping criteria proposed by Parkinson and Hutchinson (1972). • The size of the simplex. Dennis and Woods (1987) suggested measuring this by max kxi − x1 k max(1, kx1 k) and terminating when this measure is sufficiently small.
Figure 5.2 illustrates one iteration of the algorithm in a two-dimensional problem. In two dimensions, the iterations are those of a triangle tumbling downhill vertex over edge and deforming itself as it goes.
xr
x2 xm x1 x3
Figure 5.2: One Nelder-Mead Iteration. In this step, “x2” becomes “x3”; “x1” becomes “x2”, and “xr” becomes “x1”. gro525 Although the Nelder-Mead algorithm may be slow to converge, it is a very useful method for several reasons. The computations in any iteration of the algorithm are not extensive. No derivatives are needed; in fact, not even the function values themselves are needed, only their relative values. The method is therefore well-suited to noisy functions; that is functions that cannot be evaluated exactly. There have been many suggestions for improving the Nelder-Mead method. Most have concentrated on the stopping criteria or the tuning parameters. The
5.13. DERIVATIVE-FREE METHODS
117
various tuning parameters allow considerable flexibility, but there are no good general guidelines for their selection. Barton and Ivey (1996) describe modifications of the Nelder-Mead algorithm for stochastic functions; that is, noisy functions or functions whose values have an additive random component. Their modifications include reevaluating the stochastic functions at the points in the simplex considered for replacement. It is a simple matter to introduce randomness in the decisions made at various points in the Nelder-Mead algorithm. As we discuss in Section 8.1, this may be useful for finding the global optimum of a function with many local optima. If some decisions are made randomly, however, the convergence criteria must be modified to reflect the fact that the iterations may no longer be strictly descending.
5.13.2
Price Controlled Random Search Method
Price (1977) proposed a method called controlled random search, in which a simplex is chosen randomly from a fixed set of points, and a random point in the simplex is reflected to obtain a new candidate point. In this method, to find the minimum of a function, f , of m variables, first a random set of n points is chosen, where n is an arbitrary number greater than m (Khuri, 1993, has suggested n = 10m, and Kˇrv´ y and Tvrd´ık, 1995, recommended max(10, m2 )). (0) The function is evaluated at each of the points, and the best point in the set x∗ (0) th (yielding f∗ ) is determined. In the k iteration, P from the set of n points, m are chosen randomly and their centroid, xa = m i=1 xi /m, is chosen. Another point xm+1 is chosen randomly from the remaining set of n − m points, and is reflected through the centroid, to obtain xr ; that is, xr = (1 + α)xa − αxm+1 (k−1) (k) (Price chose α = 1). If f (xr ) < f∗ , then f∗ is updated and x∗ is replaced by xr ; otherwise, xr is discarded. The iterations are continued until a stopping criterion is satisfied. Aside from a convergence criterion, there are only two parameters to be chosen in the controlled random search method, the number of fixed points to retain, and the reflection parameter. The number of points to chose, n, obviously should increase as the number of variables, m, increases. The optimal relationship increases faster than a polynomial, so while a linear increase may work well in low dimensions, in higher dimensions, larger sets of fixed points must be maintained. The larger the value of n, the less likely that the iterations will become stuck in a local minimum. One of the advantages of the controlled random search method is that it is likely to find a global minimum even if the function has multiple local minima. The choice of the reflection parameter, α, depends on the smoothness of the function. It can be chosen larger for smoother functions. Tvrd´ık and Kˇrv´ y (1995) suggested that α be chosen randomly, uniformly over (0, 8), and that this choice worked well for most problems they considered. Kˇrv´ y and Tvrd´ık (1995) suggested various other modifications to the basic controlled random search algorithm, including one that used ideas of genetic al-
118
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
gorithms. Ali, T¨orn, and Viitanen (1997) incorporated local optimizing searches within the steps of the controlled random search algorithm. Kˇrv´ y, Tvrd´ık, and Krpec (2000) report numerical experiments in fitting 14 different nonlinear models using two of their modified controlled random search algorithms and the standard algorithms in four different statistical packages (NCSS, Systat, S-Plus, and SPSS). Surprisingly, they found the modified controlled random search methods to work better. The software packages used Gauss-Newton, Levenberg-Marquardt, or simplex methods. One stopping criterion for the controlled random search method is a maximum scaled range of the function values of the points in the fixed set. As with any iterative method, of course, a limit on the number of iterations is also a stopping criterion. One possibility for speeding up the controlled random search method include selection of the point to be reflected as the point out of the m + 1 points defining the simplex with the smallest function value. Another possibility is to select the m + 1 simplex points from the n possible points with different probabilities, so as to favor the points with smaller function values. While both of these modifications may speed convergence in some cases, they do so at the risk of becoming stuck in a local minimum. For finding the global optimum of a function with many local optima, we could, in fact introduce more randomness in the procedure. It is a simple matter to introduce randomness in the decision of whether or not to accept the candidate point at each iteration. As we discuss in Section 8.1, more randomness in the optimization method increases the chances of finding a global optimum.
5.13.3
Ralston-Jennrich Dud Method for Least Squares
A secant method that often works well for least squares was proposed by Ralston and Jennrich (1978a). The method is a modification of the Gauss-Newton algorithm (see Section 5.7) that uses secant hyperplanes instead of the tangent hyperplanes defined by the gradient or approximations of the gradient. This secant method is called “dud” (doesn’t use derivatives). The least squares problem is min s(θ) = θ
n X
ri2 (θ).
(5.28)
i=1
This problem usually arises as a natural criterion for fitting a model to data by selecting an optimal value for the parameter vector θ: ri (θ) = yi − f (ti ; θ) where yi and the m-vector ti are known (observations) and θ is a d-vector to be determined (see page 102). The dud method approximates f (ti ; θ) by a secant hyperplane defined by d + 1 points on the surface of f . To prepare for the (k + 1)th iteration, given
5.13. DERIVATIVE-FREE METHODS
119
d + 1 values of θ, of which any p are linearly independent, label them in such a (k) way that θp+1 yields the smallest value for r(θ) in equation (5.28). Then, with the p + 1 given points (k) (k) (k) θ1 , θ2 , . . . , θp+1 and the corresponding values of s(θ), (k)
(k)
(k)
s1 , s2 , . . . , sp+1 , we express the hyperplane as h(α) =
d X
(k) ti ; θj
αj f
j=1
+ 1−
d X
(k) αj f ti ; θd+1 ,
(5.29)
j=1
and likewise the variables as θ=
d X
(k)
αj θj
d X (k) + 1− αj θd+1 .
j=1
(5.30)
j=1
Then, in the (k + 1)th iteration, the dud method 1. finds the value α∗ that minimizes the distance between h(α) and y; 2. finds a new point θN from equation (5.30); and (k)
3. replaces one of the θj
with θN .
There are several possible ways of proceeding, and various possibilities for certain computational details. The computations can be defined easily as a linear least squares problem by first expressing equation (5.29) as (k) h(α) = f ti ; θd+1 + F (k) α where F (k) is the n × d matrix with columns (k) (k) f t; θj − f t; θd+1 , where f (t; ·) is the vector with elements f (ti ; ·). We likewise express equation (5.30) as (k) θ = θd+1 + G(k) θ, where G(k) is the d × d matrix with columns (k) (k) θj − θd+1 . The minimizer of the function (y − h(α))T (y − h(α))
120
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
is of the familiar form, −1 T T (k) α∗ = F (k) F (k) F (k) y − f (t, θd+1 ) ,
(5.31)
and the new point is (k)
θN = θd+1 + G(k) α∗ . As usual in this kind of iteration, we may encounter a nearly singular F (k) . In that case, α∗ can be chosen as any least squares solution. Again we remark that the expression for α∗ above does not imply a computational method; we certainly do not invert the matrix in equation (5.31). The choice of the point to be replaced by θN requires some care so as to ensure that the points are systematically replaced. Another important consideration is how the initial starting points are to be selected. The simplest, and probably the most common, method is just to evaluate the function r over a grid of values and choose d + 1 points from the grid that have generally smaller values, but which also provide some reasonable coverage of the grid. One of the main problems of course is the existence of local minima. It if often better to restart the algorithm with a different set of d + 1 points than to attempt to provide such a good starting set that one convergence of the algorithm would be expected to find a global minimum. Inspection of the function values within planes of the d-dimensional grid can be helpful in identifying likely points of local minima. Various computational considerations and suggestions are discussed by Ralston and Jennrich (1978b). Powell (1965) described an efficient method for updating in a nonlinear least squares algorithm that used numerical approximations to derivatives. That method could also be used in dud, but it would result in only minimal improvement in most cases, especially if the number of observations n is large and if it is expensive to evaluate f . This algorithm lends itself to weighted and iteratively reweighted least squares, as described in Section 5.8. This allows its use in a data fitting problem even when the criterion of fit is the minimum of some other norm of the residual vector.
5.14
Summary of Continuous Descent Methods
Descent methods in dense domains are based on interations consisting or two choices: 1. direction in which to step 2. how far to step A gradient descent method chooses the direction p based on (5.1): Rp = −∇f (x). In the major variants of gradient methods, the directions are chosen as follows.
EXERCISES
121
• steepest descent (5.6), page 95, p(k) = −∇f x(k) . • Newton’s method (or Newton-Raphson) (5.9), page 96, Hf x(k) p(k) = −∇f x(k) • quasi-Newton methods (5.1) and (5.14), page 99, B (k) p(k) = −∇f x(k) • Gauss-Newton methods for least-squares problems (5.22), page 104, Jr (θ(k) )
T
Jr (θ(k) ) p(k) = − Jr (θ(k) )
T
r(θ(k) ).
• Levenberg-Marquardt modifications of Gauss-Newton methods (5.24), T T T Jr (θ(k) ) Jr (θ(k) ) + λ(k) S (k) S (k) p(k) = − Jr (θ(k) ) r(θ(k) ). The idea behind the Levenberg-Marquardt modification of the Gauss-Newton methods can also be applied in Newton’s method and in quasi-Newton methods. In applications, the adjustments are often taken as λ(k) I; that is, the identity T is used in place of the squared scaling matrix S (k) S (k) . These methods may also use approximations to the gradient direction ∇f x(k) . Other descent methods, such as SPSA (page 113) and dud (page 118), are based on explicit approximations to the gradient direction. The Nelder-Mead (page 114) and Price (page 117) methods use a simplex in a secant hyperplane to determine the direction of the step. In each method, after a direction has been determined, the lenght of the step must be determined. The gradient methods usually employ some type of line search, as described in Section 5.2.
Exercises 5.1. Consider the function f (x) = x21 + 5x22 , whose minimum obviously is at (0, 0). (a) Plot contours of f . (You can do this easily in S-Plus or Matlab, for example.) (b) In thesteepest descent method, determine the first 10 values of α(k) , (k) (k) f x , ∇f x , and x(k) , starting with x(0) = (5, 1). For the step length, use the optimal value (equation (5.3), page 93). (c) Plot contours of the scaled quadratic model (5.12) of f at the point (5, 1).
122
CHAPTER 5. UNCONSTRAINED DESCENT METHODS (d) Repeat Exercise 5.1b using Newton’s method. (How many steps does it take?) (e) Repeat Exercise 5.1b using SPSA (Algorithm 5.2).
5.2. Now consider a modification of Exercise 5.1. Suppose the function and its derivatives are measured with a random Gaussian error. The function actually observed is f (x) = x21 + 5x22 + , where has a N(0, 0.01) distribution, that is a normal distribution with a mean of 0 and a standard deviation of 0.1. The minimum of the expected value of the function is at (0, 0). Also, any measurement of the derivative has a random additive term. (a) In the steepest descent method, determine the first 10 values of α(k) , (k) (k) f x , ∇f x , and x(k) , starting with x(0) = (5, 1). For the step length, use the optimal value. (b) Repeat Exercise 5.2a using Newton’s method. (c) Repeat Exercise 5.2a using SPSA. 5.3. Show that the rank-one update of equation (5.16), page 100, results in a matrix B (k+1) that satisfies the secant condition (5.14).
5.4. Derive an expression for B (k+1)
−1
in terms of B (k)
−1
when the the rank-
one update of equation (5.16), page 100, is used. 5.5. Formulate a quasi-Newton method with the rank-one update (equation (5.16), page 100) to find the minimum of the function in Exercise 5.1. Start with B (0) = I and x(0) = (5, 1). 5.6. Many natural phenomena, such as phosphorescence or radioactive emissions, decay exponentially over time. Suppose the following measurements (with appropriate units) of the variable y were made with a crude instrument at regular 5 second time intervals beginning at t0 = 5: 1.71, 1.07, 0.62, 0.65, 0.17, 0.14, 0.08, 0.09 (a) Use least squares to fit the model y = θ1 exp(θ2 t). i. Formulate the Gauss-Newton method for this problem, and show the first two steps, beginning with θ = (1, 1). ii. Now use a program for nonlinear least squares, such as the IMSL Fortran routine rnlin, the C routine nonlinear regression, or the S-Plus function nls, for example. Plot the data and your fitted model. (b) Transform the model by taking logs of both sides, and again fit the model with least squares. What is the difference in this fit and the one using the raw model and data? (c) Now use the same software that you used in Exercise 5.6a to fit the model using least absolute values. Plot the data and your fitted model. What is the difference in the fit using least absolute values and that using least squares? Which observation contributes most to the difference in the fits?
EXERCISES
123
5.7. According to Maxwell-Boltzman theory, the probability density of the velocity of a gas molecule is proportional to (m/(kT ))(3/2) e−(mv
2
)/(2kT ) 2
v ,
where v is the velocity, T is the absolute temperature, m is the molecular mass, and k is Boltzman’s constant. Determine the mode of this distribution (the point where it achieves its maximum value – the “most likely” velocity). Your solution is called the rms velocity. (Make sure you choose the correct critical point for the maximum.)
124
CHAPTER 5. UNCONSTRAINED DESCENT METHODS
Chapter 6
Unconstrained Combinatorial Optimization; Other Direct Search Methods == see Fouskakis and Draper (2002) Stochastic optimization: A review, ISI Review, 70, 315–349. == must use stochastic methods – cannot explore the space because it’s too large. If the objective function is differentiable and the derivatives are available, methods described in the previous chapter that make use of the gradient and Hessian or simple approximations to the gradient and Hessian are usually the most effective ones. Even if the derivatives are not available or do not exist everywhere for a continuous objective function, the methods that use approximations to gradients are usually best. If the objective function is not differentiable, however, or if it is very rough, some kind of direct search for the optimum may be necessary. In some cases the objective function is noisy, perhaps with an additive random error term that prevents exact evaluation. In these cases also it may not be effective to use gradient or approximate-gradient methods. The Nelder-Mead simplex method may work in these cases. Stochastic search methods, such as SPSA which uses a type of gradient approximation method, may also be effective. Other stochastic search methods such as described in this chapter in the context of a countable domain may be useful for rough or noisy functions. Another important type of optimization problem are those in which the decision variables are discrete. The solution may be a configuration of a finite set of points, that is, a graph. In the traveling salesperson problem, for example, we seek a configuration of cities that provides a path with minimal total length 125
126
CHAPTER 6. COMBINATORIAL OPTIMIZATION
that visits each point in a set. In the vehicle routing problem, a fleet of vehicles stationed at a depot must make deliveries to a set of cities and it is desired to route them so as to minimize the time required to make all the deliveries. In a resource scheduling problem, a set of machines or workers are to be assigned to a set of tasks, so as to minimize the time required to complete all the tasks, of so as to minimize idle time of the resources. These kinds of problems are examples of combinatorial optimization. Direct search methods move from point to point using only the values of the function; they do not use derivative information, or approximations to derivatives. In some methods new points are chosen randomly, and then the decision to move to a new point is based on the relative values of the function at the old and new points. A tree or other graph of points may help to organize the points to visit in the search. Sometimes, based on points that have already been evaluated, sets of other points can be ruled out. In tree-based search methods, such fathoming or branch-and-bound techniques may greatly enhance the overall efficiency of the search. “Tabu” methods keep lists of points that are not likely to lead to an optimum. There are several variations of direct searches. Some search methods use heuristics that mimic certain natural systems. The articles in the collection by Aarts and Lenstra (1997) describe several types of search algorithms and discuss various applications to which the methods have been applied. In all direct search methods the new points are accepted or not based on the objective function values. Some search methods allow iterations that do not monotonically decrease the objective function values. These methods are especially useful when there are local minima. In these iterations, if the new point is better, then it is used for picking a point in the next iteration. If the new point is not better, there are three possible actions: • discard the point and find another one to consider • accept the new point anyway • declare the search to have converged
6.1
Simulated Annealing
Simulated annealing is a method that simulates the thermodynamic process in which a metal is heated to its melting temperature and then is allowed to cool slowly so that its structure is frozen at the crystal configuration of lowest energy. In this process the atoms go through continuous rearrangements, moving toward a lower energy level as they gradually lose mobility due to the cooling. The rearrangements do not result in a monotonic decrease in energy, however. The density of energy levels at a given temperature ideally is exponential, the so-called Boltzmann distribution, with a mean proportional to the absolute temperature. (The constant of proportionality is called “Boltzmann’s
6.1. SIMULATED ANNEALING
127
constant”). This is analogous to a sequence of optimization iterations that occasionally go uphill. If the function has local minima, going uphill occasionally is desirable. Metropolis et al. (1953) developed a stochastic relaxation technique that simulates the behavior of a system of particles approaching thermal equilibrium. (This is the same paper that described the Metropolis sampling algorithm.) The energy associated with a given configuration of particles is compared to the energy of a different configuration. If the energy of the new configuration is lower than that of the previous one, the new configuration is immediately accepted. If the new configuration has a larger energy, it is accepted with a nonzero probability. This probability is larger for small increases than for large increases in the energy level. One of the main advantages of simulated annealing is that the process is allowed to move away from a local optimum. Although the technique is heuristically related to the cooling of a metal, as in the application of Metropolis et al. (1953), it can be successfully applied to a broader range of problems. It can be used in any kind of optimization problem, but it is particularly useful in problems that involve configurations of a discrete set, such as a set of particles whose configuration can continuously change, or a set of cities in which the interest is an ordering for shortest distance of traversal. Kirkpatrick, Gelatt, and Vecchi (1983) discussed various applications, and the method became widely following the publication of that article. Collins, Eglese, and Golden (1988) provide an annotated bibliography for the development of the method as well as for a variety of problems in which it has found application. The Basic Algorithm In simulated annealing, a “temperature” parameter controls the probability of moving uphill; when the temperature is high, the probability of acceptance of any given point is high, and the process corresponds to a pure random walk. When the temperature is low, however, the probability of accepting any given point is low; and in fact, only downhill points are accepted. The behavior at low temperatures corresponds to a gradient search. As the iterations proceed and the points move lower on the surface (it is hoped), the temperature is successively lowered. An “annealing schedule” determines how the temperature is adjusted. In the description of simulated annealing in Algorithm 6.1, recognizing the common applications in combinatorial optimization, we refer to the argument of the objective function as a “state”, rather than as a “point”. Algorithm 6.1 Simulated Annealing 0. Set k = 1 and initialize state s. 1. Compute T (k). 2. Set i = 0 and j = 0.
128
CHAPTER 6. COMBINATORIAL OPTIMIZATION
3. Generate state r and compute δf = f (r) − f (s). 4. Based on δf , decide whether to move from state s to state r. If δf ≤ 0, accept; otherwise, accept with a probability P (δf, T (k)). If state r is accepted, set i = i + 1. 5. If i is equal to the limit for the number of successes at a given temperature, go to step 1. 6. Set j = j + 1. If j is less than the limit for the number of iterations at given temperature, go to step 3. 7. If i = 0, deliver s as the optimum; otherwise, if k < kmax , set k = k + 1 and go to step 1; otherwise, issue message that ‘algorithm did not converge in kmax iterations’. For optimization of a continuous function over a region, the state is a point in that region. A new state or point may be selected by choosing a radius r and point on the d dimensional sphere of radius r centered at the previous point. For a continuous objective function, the movement in step 3 of Algorithm 6.1 may be a random direction to step in the domain of the objective function. In combinatorial optimization, the selection of a new state in step 3 may be a random rearrangement of a given configuration. Parameters of the Algorithm: The Probability Function There are a number of tuning parameters to choose in the simulated annealing algorithm. These include such relatively simple things as the number of repetitions or when to adjust the temperature. The probability of acceptance and the type of temperature adjustments present more complicated choices. One approach is to assume that at a given temperature, T , the states have a known probability density (or set of probabilities, if the set of states is countable), pS (s, T ), and then to define an acceptance probabilty to move from state sk to sk+1 in terms of the relative change in the probability density from pS (sk , T ) to pS (sk+1 , T ). In the original application of Metropolis et al., the objective function was the energy of a given configuration, and the probability of an energy change of δf at temperature T is proportional to exp(−δf /T ). Even when there is no underlying probability model, the probability in step 4 of Algorithm 6.1 is often taken as P (δf, T (k)) = e−δf /T (k) ,
(6.1)
6.1. SIMULATED ANNEALING
129
although a completely different form could be used. The exponential distribution models energy changes in ensembles of molecules, but otherwise it has no intrinsic relationship to a given optimization problem. The probability can be tuned in the early stages of the computations so that some reasonable proportion of uphill steps are taken. In empirical studies of optimization of continuous functions, Bohachevsky, Johnson, and Stein (1986) found that early acceptance rates of 50% to 90% of uphill moves worked well. They suggest use of a factor that reduces the probability as the state moves closer to the optimum. In some optimization problems, the value of the function at the optimum, f ∗ , is known, and the problem is only to determine the location of the optimum. In such cases, they use a factor (f − f ∗ )g in the exponent. If the value f ∗ is not known but a reasonable estimate is available, they suggest use of the estimate. The estimate could be updated as the algorithm proceeds. Parameters of the Algorithm: The Cooling Schedule There are various ways the temperature can be updated in step 1. The probability of the method converging to the global optimum depends on a slow decrease in the temperature. In practice, the temperature is generally decreased by some proportion of its current value: T (k + 1) = b(k)T (k).
(6.2)
We would like to decrease T as rapidly as possible, yet have a high probability of determining the global optimum. Geman and Geman (1984) showed that under the assumptions that the energy distribution is Gaussian and the acceptance probability is of the form (6.1), the probability of convergence goes to 1 if the temperature decreases as the inverse of the logarithm of the time, that is, if b(k) = (log(k))−1 in equation (6.2). Under the assumption that the energy distribution is Cauchy, a similar argument allows b(k) = k −1 , and a uniform distribution over bounded regions allows b(k) = exp(−ck k 1/d ), where ck is some constant, and d is the number of dimensions (see Ingber, 1989). A constant temperature is often used in simulated annealing for optimization of continuous functions. Alrefaei and Andrad´ottir (1999) also suggested use of a constant temperature for optimization of noisy functions. The additive and multiplicative adjustments, c(k) and b(k) are usually taken as constants, rather than varying with k. Van Laarhoven and Aarts (1987), Collins, Eglese, and Golden (1988), and Hajek (1988) describe several other methods of updating the temperature. For functions of many continuous variables, Siarry et al. (1997) suggest using the basic simulated annealing approach on a sequence of lower-dimensional spaces. This approach can reduce the total number of computations, and would be particularly useful when the cost of evaluation of the function is very high.
130
CHAPTER 6. COMBINATORIAL OPTIMIZATION
Other Variations A method somewhat similar to simulated annealing was developed by AluffiPentini, Parisi, and Zirilli (1988a, 1988b). Their method, which is designed for continuous optimization problems, searches along solution trajectories of stochastic differential equations that govern a diffusion process. The cooling is continuous. Their method also does well in moving away from local optima. The differences in this method and the standard simulated annealing seem to depend more on values of tuning parameters than on any fundamental difference between the two methods. In some cases it may desirable to exercise more control over the random walk that forms the basis of simulated annealing. For example, we may keep a list of “good” points, perhaps the b best points found so far. After some iterations, we may return to one or more of the good states and begin the walk anew. Gelfand and Mitter (1989) and Gutjahr and Pflug (1996) studied the performance of simulated annealing for optimization of noisy functions. They derived convergence properties that depend on the manner in which the temperature is decreased. Alrefaei and Andrad´ottir (1999) suggested simulated annealing algorithms for noisy functions that uses a constant temperature. One of their procedures uses the number of times a point is visited to estimate the optimal solution. Simulated annealing is often used in conjunction with other optimization methods. Brooks and Morgan (1994) suggest using simulated annealing to determine starting points for other optimization methods, and Brooks (1995) provides a program that implements the simulated annealing selection of a number of starting points. Multiple starting points may allow the subsequent optimization method to identify several local optima. When gradient information is available, even in a limited form, simulated annealing is generally not as efficient as other methods that use that information. The main advantages of simulated annealing include its simplicity, its ability to move away from local optima, and the wide range of problems to which it can be applied. Corana, Marchesi, Martin, and Ridella (1987) compared a version of simulated annealing with other methods, including Nelder-Mead, and found the simulated annealing method to be more robust but more expensive in terms of number of function evaluations. Ingber (1989) suggests periodically “re-annealing”, by adjusting the temperature periodically, based on numerical derivatives computed during the previous iterations in the algorithm. When the exponential cooling schedule, T (k + 1) = exp(−ck k 1/d )T (k), mentioned above is also used, he calls this “very fast re-annealing” or “adaptive simulated annealing”. Simulated annealing proceeds as a random walk through the domain of the objective function. There are many opportunities for parallelizing such a process. The most obvious is starting multiple walks on separate processors. Aarts and Korst (1989) discuss various ways of performing simulated annealing
6.2. EVOLUTIONARY ALGORITHMS
131
on parallel processors. Applications Simulated annealing has been successfully used in a range of optimization problems, including probability density smoothing (Deutsch, 1996), classification (Sutton, 1991), construction of minimum volume ellipsoids (Woodruff and Rocke, 1993), and optimal design (see Section 10.4.1, page 196). The Canonical Example: The Traveling Salesperson Problem The traveling salesperson problem can serve as a prototype of the problems in which the simulated annealing method has had good success. In this problem, a state is an ordered list of points (“cities”) and the objective function is the total distance between all the points in the order given (plus the return distance from the last point to the first point. One simple rearrangement of the list is the reversal of a sublist, that is, for example, (1, 2, 3, 4, 5, 6, 7, 8, 9) → (1, 6, 5, 4, 3, 2, 7, 8, 9). Another simple rearrangement is the movement of a sublist to some other point in the list, for example, (1, 2, 3, 4, 5, 6, 7, 8,↑ 9) → (1, 7, 8, 2, 3, 4, 5, 6, 9) (Both of these rearrangements are called “2-changes”, because in the graph defining the salesperson’s circuit, exactly two edges are replaced by two others. The circuit is a Hamilton closed path.)
6.2
Evolutionary Algorithms
There are many variations of methods that use evolutionary strategies. These methods are inspired by biological evolution, and often use terminology from biology. Genetic algorithms mimic the behavior of organisms in a competitive environment in which only the fittest and their offspring survive. Decision variables correspond to “genotypes” or “chromosomes”; a point or a state is represented by a string (usually bit strings); and new values of the decision variables are produced from existing points by “crossover” or “mutation”. The set of points at any stage constitute a “population”. The points that survive from one stage to another are those yielding lower values of the objective function. The ideas of numerical optimization using processes similar to biological evolution are old (see M¨ uhlenbein, 1997, for some prehistory), but the current algorithms derive from the work of Rechenberg (1973) and Holland (1975 and 1992). Back (1996) provides descriptions and discussions of various evolutionary algorithms.
132
CHAPTER 6. COMBINATORIAL OPTIMIZATION
Genetic Algorithms In most iterations it is likely that the new population includes a higher proportion of fit organisms (points yielding better values of the objective function) than the previous population, and that the best of the organisms is better than the best in the previous population. The biological analogy of a single iteration of a genetic algorithm is represented in Figure 6.1. Population ? Mating Pool Selected ? Mate Pairs Selected ? Mating ? Offspring ? New Population
Figure 6.1: The Biological Analogy of One Iteration of a Genetic Algorithm
Parametrizations The first consideration in applying a genetic algorithm to an optimization problem is how to represent the problem in terms of the “chromosomes”, “populations”, and “fitness” of an evolutionary process. To use a genetic algorithm in the standard optimization problem min f (x), x∈S
the values of the decision variables are represented in binary notation as substrings of a bit string of length l that corresponds to a representation of a chromosome, and the fitness is a function g that is monotonically related to f . If g is fitness and we are to minimize f , then logically g would increase as f decreases. We usually do not interpret the relationship so literally. For practical purposes, g is often chosen so g(x) = f (x), and improved fitness is
6.2. EVOLUTIONARY ALGORITHMS
133
interpreted as a decrease in g. If the decision variables are continuous, they are discretized as necessary to fit in a reasonable bit string. Discrete decision variables may be allocated a number of bits sufficient to represent all of their possible values. Consider the minimization of the function f (x1 , x2 ) = x1 − 2x21 + 3x1 x2 − x22 . Obviously, for this problem we would not use a genetic algorithm or any other stochastic method, but we can use it to illustrate a parametrization of the problem. Suppose we represent values of both x1 and x2 in binary notation using strings of length 8 in which the fourth position from the left is the unit position (for example, 3.5 = 00111000). An organism is represented by a pair of such bit strings. We choose the fitness, g, as −f . Some organisms and their fitness are shown in Table 6.1. We see that orgranisms s2 and s3 are most fit within this population. The corresponding chromosomes would be good candidates for propogation. Table 6.1: Chromosomes and Fitness for a Genetic Algorithm Organism s1 s2 s3 s4 s5 s6
x1 00010000 00111000 00110000 00100000 00100000 01001000
x2 0001000 0001000 0000000 0011000 0010000 0101000
g 1.0 11.5 15.0 -3.0 -4.0 -6.5
Evolution Strategies Rechenberg formalized evolution strategies, two of which are called • (µ + λ)-ES • (µ, λ)-ES In a (µ + λ)-ES procedure, µ parents produce λ offspring and the best µ of the parents and offspring survive to the next generation. In a (µ, λ)-ES procedure, µ survivors are selected only from the offspring. The former method is the more commonly used evolution strategy, often with µ = λ = 1. The latter method is more similar to the method of Holland, which is sometimes identified as the genetic algorithm. Terminology varies somewhat, and we will not attempt to sort it out here, but rather proceed to describe the basic genetic algorithm (also called the “canonical genetic algorithm”.)
134
CHAPTER 6. COMBINATORIAL OPTIMIZATION
Evolution Method Algorithm 6.2 provides an outline of a genetic algorithm. There are several decisions that must be made in order to apply the algorithm. The first, as mentioned above, is to decide how to represent the values of decision variables in terms of chromosomes, and to decide how to evaluate the objective function in terms of a chromosome. Then, an initial population must be chosen. Algorithm 6.2 Genetic Algorithm 0. Determine a representation of the problem, and define an initial population, x1 , x2 , . . . , xn , for n even. 1. Assign probabilities pi to each item in the population and choose (with replacement) a probability sample of size n. This is the reproducing population. 2. Randomly pair all items in the reproducing population. Form a new population of size n from the n/2 pairs in the reproducing population, using various mutation and recombination rules. 3. If convergence criteria are met, stop, and deliver s as the optimum. otherwise, go to step 1. Mutation and Recombination Rules There are several possibilities for producing a new generation of oganisms from a given population. Some methods mimic sexual reproduction, that is, the combining of chromosomes from two organisms, and some methods are like asexual reproduction or mutation. A genetic algorithm may involve all of these methods, perhaps chosen randomly with fixed or varying probabilities. Three simple methods are crossover, for combining two chromosomes, and inversion and mutation, for yielding a new chromosome from a single one. In crossover of two chromosomes each containing m bits, for a randomly selected j from 1 to l, the first j bits are taken from the chromosome of the first organism and the last l − j bit are taken from the chromosome of the second organism. In inversion, for j and k randomly selected from 1 to l, the bits between positions j and k are reversed, while all others remain the same. In mutation, a small number of bits are selected randomly and are changed, from 0 to 1 or from 1 to 0. The number of bits to change may be chosen randomly, perhaps from a Poisson distribution, truncated at l. These operations are illustrated in Table 6.2. In the example operations shown in Table 6.2, crossover occurs between the third and fourth bits; inversion occurs for the bits between (and including) the third and the sixth; and mutation occurs at the second and fourth bits. As with simulated annealing, indeed, as with almost any optimization method, for a given problem, genetic algorithms may require a good deal of ad hoc tuning. Grefenstette (1986) has suggested general guidelines for selecting these control parameters. In the case of genetic algorithms, there are various ways
6.3. GUIDED DIRECT SEARCH METHODS
135
Table 6.2: Reproduction Rules for a Genetic Algorithm Generation k (k)
s1
Generation k + 1 Crossover
11001001 →
(k) s2
00111010
(k)
11101011
s1
(k)
s1
(k+1)
s1
Inversion (k+1) → s1 Mutation (k+1) 11101011 → s1
11011010
11010111 10111011
of encoding the problem, of adopting an overall strategy, and of combining organisms in a current population to yield the organisms in a subsequent population. The reader is referred to Michalewicz (1996) Jennison and Sheehan (1995), Whitley (1994), or Koza (1994a) for more complete discussion of the methods. Genetic algorithms can be implemented in parallel rather directly. Some of the issues in parallelizing genetic algorithms are discussed by M¨ uhlenbein (1992). An interesting application of genetic algorithms is in genetic programming, as developed by Koza (1992, 1994b) and Koza, Bennett, and Andre (1999). Genetic algorithms are used to develop a computer program, given only a description of the problem to be solved. Genetic programming proceeds to structure program elements consisting of simple computations and control structures such as loops and iterations into a complete program. The fitness is the proximity of the output of the program to the desired solution.
6.3
Guided Direct Search Methods
Tabu search simulates the human memory process in maintaining a list of recent steps. The list is called a tabu list. The purpose of the list is to prevent the search from backtracking. Before a potential step is selected the search procedures checks the tabu list to determine if it is in the recent path to this point. The tabu list can be implemented by penalizing the objective function. Tabu search is systematic and deterministic. The length of the tabu list determines how well the procedure works. A short list may result in some cycling because of the shorter memory. A long list, on the other hand, increases the computational burden. An “aspiration function” allows the tabu status of a potential step to be overridden if the aspiration level is attained. This also could allow for cycling. The aspiration
136
CHAPTER 6. COMBINATORIAL OPTIMIZATION
function can be implemented by rewarding the objective function. The basic ideas of tabu search are summarized by Glover (1986) and Glover and Laguna (1997). Tabu search has been successfully used in a range of optimization problems, including variable selection in models (Drezner, Marcoulides, and Salhi, 1999) and optimal design (see Section 10.4.1, page 196).
6.4
Neural Networks+
Neural networks are rules for associating a set of states of a system with each other. In the early developments of neural networks the states corresponded closely to the conditions of biological neural units. More generally, however, the units are abstractions, and the neural network defines how the state of one unit affects, or is affected by, the states of other units. Whether or not the units are abstractions, terminology from neuroscience is used in describing the system. The phrase “artificial neural network” is sometimes used to refer to a network that does not correspond directly to a network of biological neurons. Processing by Neurons Neurons can be constructed to process their inputs in various ways. One of the simplest operations is a weighted addition of the inputs. The neuron may have a weight wi associated with a particular source. If xi represents an input from that source, the weighted sum of p inputs is u=
p X
wi xi .
i=1
The sum may then be transformed further. Often, especially if the inputs are nonnegative, the weighted sum is compared with a threshold value, say θ, and it is set to some base value, perhaps 0, if the raw weighted sum is less than θ. Even though the inputs are combined linearly, the overall operation is nonlinear. Figure 6.2 represents a simple model of a neuron that processes inputs by forming a weighted sum and comparing it with a threshold. In an analogy to a biological model, the weights in the linear sum are called “synaptic weights”. Although thresholded linear combinations are the most common type of functions in a neuron, the processing functions can take a variety of forms. Feed-Forward Networks The neurons are linked together in a network, in which the outputs of several neurons are combined to form single outputs, and in which the outputs of some neurons are used as inputs to other neurons. In optimization and estimation, the most common type of neural network are feed-forward networks, in which the effects of the individual units are oneway. Thus, in the network there are “input” units and “output” units. The
6.4. NEURAL NETWORKS+ x1 c- wn 1 @ @ @ @ x2 c- wn 2P @ PP PP@ P P@ R @ P q P .. .
137
- u > θ?
Output y
Activation function
xp c- wn p
Synaptic weights
Figure 6.2: Nonlinear Processing Element input units do not do any processing, so they are generally referred to merely as “nodes”. The output units process input, in a manner analogous to biological units, so they are often referred to as “neurons”. There are other units that provide one-way connections between the input and output units. Each unit in this middle layer may process the input from one or more input nodes before feeding input to one or more output neurons. If all paths between the units are one-way, the units can be arranged into layers as shown in Figure 6.3. Feedforward networks of this type are also called multi-layer perceptrons. The network shown in Figure 6.3 is partially connected; that is, each node is not connected to every other node. If the connections follow regular patterns, it is often convenient to depict the partially connected network as a multidimensional lattice. Most simple networks are fully connected; each node is connected to every other node. There are other types of artificial neural networks. A common variation is a recurrent network, in which there are feedback loops; that is, the neurons in the output layer provide input back to the input nodes or to the neurons in the middle layer. The structures of artificial neural networks can be quite complicated. Feedback loops often have delay mechanisms, which can be deterministic or random. Adaptive Processing Elements Figure 6.4 represents a simple adaptive linear element, or adaline. The adaline may use a least squares algorithm to compare its output with a desired output. Based on the difference in the output and the desired output, the adaline may
138
CHAPTER 6. COMBINATORIAL OPTIMIZATION u : H @ J HH H uJ@ HH Z J@ H ZJ @ HH ZJ @ u j H : * Z @ J @ Z @ Z@ J u X J Z@ X@ XX Z X JXX@ @ X Zz u R @ ~ Z @ J XX H * HH @ J H @ J uXX H H @ XX J X XH XH JJ X@ H@ X u ^ z X R j H
Input layer
i X ZXXXX Z XX XXX Z z : i Z Z Z i X ZXXXX ZZ Z XXXZ Z XX ~ Z z : i Z > Z Z i X Z XX XX Z X XXZ XX ~ Z z : i i
Hidden layer
Output layer
Figure 6.3: Feedforward Network with One Hidden Layer make adjustments to its processing functions. The adaline can be built into a multilayer feedforward network, called a “madaline” (Widrow and Lehr, 1990). Neural nets are useful for noisy function optimization. General references for neural networks are Maren, Harston, and Pap (1990) and Haykin (1994). Poli and Jones (1994) neural net model for prediction. compares with Newton algorithm. Morris and Wong (1992) Systematic initialization of local search procedures and application to the synthesis of neural networks. White (1992) Nonparametric estimation of conditional quantiles using neural networks, Ripley (1993, 1994, 1996) discusses several applications of neural networks in statistics. Ripley (1994) classification Ripley (1996) pattern recognition Warner, Brad, and Manavendra Misra (1996), Understanding neural networks as statistical tools, The American Statistician 50, 284–293. Cheng, Bing, and D. M. Titterington (1994), Neural networks: A review from a statistical perspective (with discussion), Statistical Science 9, 2–54. Chen and Jain (1993) function approximation Software for Neural Networks The Matlab Neural Network Toolbox provides sever supervised and unsupervised network paradigms. It also produces portable C code. Other programs are bpsim, MIRRORS/II, and rps.
6.5. OTHER COMBINATORIAL SEARCH METHODS x1 c- wn 1 @ @ @ @ x2 c- wn 2P @ PP PP@ P P@ R @ P q P c .. .
- φ(·)
139
Output y
− Error P? e + 6
xp c- wn p Desired response
d
Figure 6.4: Adaptive Linear Element
6.5
Other Combinatorial Search Methods
A number of other stochastic combinatorial search methods have been developed. Some of these methods derive from the stochastic approximations in the Robbins-Monro procedure (equation (4.7)). Kushner and Yin (1997) describe these algorithms. As we have mentioned, stochastic algorithms are particularly useful in noisy function optimization. The book by Cook et al. (1997), covers many additional topics and methods in combinatorial optimization. Cook et al. emphasize the kernel algorithms that are utilized in optimization methods.
Exercises 6.1. The now classic application of simulated annealing is the traveling salesperson problem. The objective is to develop an ordered list of a set of cities, so that each city occurs at least once in the list, and that if the cities are visited in the order of the list, the total distance is minimized. The method, as indicated in the text, is to begin with an initial ordered list and to compute its total distance, to make a random change in the list and compute its distance, and then to accept the new list if its distance is less than the previous distance or else to accept the new list with a probability that is decreasing in the difference of the old distance and the new distance. Design and write a simulated annealing program for the traveling salesperson problem. (a) Use your program to determine the optimal order in which to visit the cities in the mileage chart below. Assume you return to the starting city.
140
CHAPTER 6. COMBINATORIAL OPTIMIZATION
Alexandria Blacksburg Charlottesville Culpeper Fairfax Front Royal Lynchburg Manassas Richmond Roanoke Williamsburg
↓ 263 117 70 15 71 178 28 104 233 148
↓ 151 193 249 203 94 238 220 41 257
↓ 47 102 124 66 91 71 120 120
↓ 55 44 108 44 89 164 133
↓ 57 163 23 106 218 150
↓ 157 45 133 174 177
↓ 157 110 52 165
↓ 96 207 140
↓ 189 51
↓ 215
(b) Determine an optimal order for beginning at Alexandria and ending at Williamsburg (not returning to Alexandria). 6.2. K-means clustering is a method of clustering observations into a preset number of groups k in such a way as to minimize the total of the within sums-of-squares, ng m X k X X
xij(g) − x ¯j(g)
2
,
g=1 j=1 i=1
where ng is the number of observations in the g th group, xij(g) is the ith observation on the j th variable in the g th group, and x ¯j(g) is the mean of the j th variable in the g th group. Write a simulated annealing program to perform K-means clustering; that is to minimize the objective function above. Use your program to form four clusters of the data
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16
=
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
.
See Zeger, Vaisey, and Gersho (1992) for discussion of a simulated annealing algorithm in this context. 6.3. Develop a genetic algorithm to solve the K-means clustering problem in Exercise 6.2. Compare the performance of simulated annealing and the genetic algorithm on this problem.
Chapter 7
Optimization under Constraints Jamshidian, Mortaza (2004), On algorithms for restricted maximum likelihood estimation, The general optimization problem for a scalar-valued function in m variables with r constraints is min x
f (x)
(7.1)
s.t. g(x) ≤ b, where x is m-dimensional and g(x) ≤ b is a system of r inequalities. This formulation can include equality constraints by expressing an equality as two inequalities. A point satisfying the constraints is called a feasible point, and the set of all such points is called the feasible region. For a given point xj , a constraint gi such that gi (xj ) = bi is called an active constraint. Any of the unconstrained optimization methods we have described can be modified to handle constraints by first insuring that the starting points statisfy the constraints and then explicitly incorporating checks at each iteration to insure that any new point also satisfies the contraints. If the new point does not satisfy the constraints, then some of the parameters of the algorithm may be adjusted and a new point generated (this is a possible approach in the NelderMead simplex method, for example), or, in random methods such as the Price controlled random search method, the new point is simply discarded and a new point chosen. Although this is a straightforward procedure, it is unlikely to be very efficient computationally. Unconstrained methods can be used efficiently if a sequence of unconstrained problems that converges to problem of interest can be defined. Although there may be problems with the objective function in regions that are not feasible, this method can often be very effective. 141
142
CHAPTER 7. OPTIMIZATION UNDER CONSTRAINTS
Another approach to solving constrained problems is to incorporate the constraints into the objective function. One way in which this is done is by use of supplementary variables, as discussed below. Another way is to define transformations of the variables so that the objective increases rapidly near constraint boundaries. See Box (1965) for discussion of this type of approach.
7.1
Constrained Optimization in Dense Domains
In a constrained optimization problem over a dense domain, the most important concerns are the shape of the feasible region and the smoothness of the objective function. The problem is much easier if the feasible region is convex, and fortunately most constrained real-world problems have convex feasible regions. The smoothness of the objective function is important, because if it is twicedifferentiable, we may be able to use the known properties of derivatives at function optima to find those optima. Some methods of dealing with constraints incorporate the constraints into the objective function. For such a method the shape of the feasible region is important because the derivatives of the combined objective function depend on the functions defining the constraints. Equality Constraints We will first consider some simple problems. Equality constraints are generally much easier to handle than inequalities, and we generally write the constraints explicitly as equalities, rather than as a pair of inequalities in the form of problem (7.1): min
f (x)
s.t.
g(x) = b.
x
(7.2)
An optimization problem with equality constraints can often be transformed into an equivalent unconstrained optimization problem. For any feasible point, all equality constraints are active constraints. An important form of equality constraints are linear constraints, Ax = b, where A is an r × m (with r ≤ m) matrix of rank s. With g(x) = Ax, we have min x
f (x)
s.t. Ax = b. If the linear system is consistent (that is, rank([A|b]) = s), the feasible set is nonnull. The rank of A must be less than m, or else the constraints completely determine the solution to the problem. If the rank of A is less than r, however, some rows of A and some elements of b could be combined into a smaller number of constraints. We will therefore assume A is of full row rank; that is, rank(A) = r.
7.1. CONSTRAINED OPTIMIZATION IN DENSE DOMAINS
143
If xc is any feasible point, that is, Axc = b, then any other feasible point can be represented as xc + p, where p is any vector in the null space of A, N (A). The dimension of N (A) is m − r, and its order is m. If B is an m × m − r matrix whose columns form a basis for N (A), all feasible points can be generated by xc + Bz, where z ∈ IRm−r . Hence, we need only consider the restricted variables x = xc + Bz, and the function h(z) = f (xc + Bz). The argument of this function is a vector with only m − r elements, instead of m elements, as in the original function f . The unconstrained minimum of h, however, is the solution of the original constrained problem. Now, if we assume differentiability, the gradient and Hessian of the reduced function can be expressed in terms of the the original function: ∇h(z)
=
B T ∇f (xc + Bz)
=
B T ∇f (x),
and Hh (z)
=
B T Hf (xc + Bz)B
=
B T Hf (x)B.
The relationship of the properties of stationary points to the derivatives, as described in Chapter 1, are the conditions that determine a minimum of this reduced objective function; that is, x∗ is a minimum if and only if • B T ∇f (x∗ ) = 0, • B T Hf (x∗ )B is positive definite, and • Ax∗ = b. These relationships then provide the basis for the solution of the optimization problem. This simple constrained optimization problem could be solved using the same methods as discussed in Chapter 5. Because the m × m matrix [B|AT ] spans IRm , we can represent the vector ∇f (x∗ ) as a linear combination of the columns of B and AT , that is, ∇f (x∗ ) = Bz∗ + AT λ∗ , where z∗ is an (m − r)-vector and λ∗ is an r-vector. Because ∇h(z∗ ) = 0, Bz∗ must also vanish, and we have ∇f (x∗ )
= =
AT λ∗ Jg (x∗ )T λ∗ .
(7.3)
144
CHAPTER 7. OPTIMIZATION UNDER CONSTRAINTS
Thus, at the optimum, the gradient of the objective function is a linear combination of the columns of the Jacobian of the constraints. The elements of the linear combination vector λ∗ are called Lagrange multipliers. The condition expressed in (7.3) implies that the objective function cannot be reduced any further without violating the constraints. We can also see this in another simple example with equality constraints. In this example the objective function is linear, and the single equality constraint is quadratic: min x
f (x) = 2x1 + x2
s.t. g(x) = x21 − x2 = 1. The optimum is x∗ = (−1, 0). The gradient of f (x) is ∇f (x) = (2, 1), that of g(x) is ∇g(x) = (2x1 , −1), and ∇g(x∗ ) = (−2, −1). As we see in Figure 7.1 at the optimum, = =
−∇g(x∗ ) −Jg (x∗ )T .
0
x2
1
2
3
∇f (x∗ )
x *
g(x)=0
-2
-1
f(x)
-2
-1
0
1
2
x1
Figure 7.1: Linear Objective and Quadratic Equality Constraint gro705
The Lagrangian Function The relationship between the gradient of the objective function and the Jacobian of the constraint function, motivates the definition of the Lagrangian function: (7.4) L(x, λ) = f (x) + λT (g(x) − b),
7.1. CONSTRAINED OPTIMIZATION IN DENSE DOMAINS
145
where λ is an m-vector, the elements of which are called Lagrange multipliers. The derivatives of the Lagrangian function can be analyzed in a manner similar to the analysis of the derivatives of the objective function in Chapter 1 to determine necessary and sufficiency conditions for a minimum subject to equality constraints. General Constrained Optimization over Dense Domains Inequality constraints present significant challenges in optimization problems. The extent of the difficulty depends on the type of the constraint. The simplest constraints are “box constraints”, or simple bounds on the variables. Next are linear constraints of the form l ≤ Ax ≤ u. Finally, general nonlinear constraints are the most complicated. As in other cases of optimization over dense domains, we will usually assume that the objective function is twice differentiable in all variables. We will only indicate some of the general approaches, and refer the interested reader to other literature such as Nash and Sofer (1996) or Nocedal and Wright (1999) for more extensive discussions. When there are both equality and inequality constraints, it is more convenient for the discussion to write the equality constraints explicitly as equalities, rather than as a pair of inequalities in the form of problem (7.1): min
f (x)
s.t.
g1 (x) = b1 ,
x
(7.5)
g2 (x) ≤ b2 , For any feasible point all equality constraints are active, while the any of the inequality constraints g2 (x) ≤ b2 may or may not be active. The following well-known theorem is proved in Nocedal and Wright (1999). Let L(x, λ) be the Lagrangian and let x∗ be a solution to prob(a) lem (7.5). If the gradients of the active constraints at x∗ , ∇g2 (x∗ ), are linearly independent, then there exists λ∗ such that ∇x L(x∗ , λ∗ ) = 0, (a)
and for all active constraints, g2
with corresponding λ(a) ,
(a)
λ∗ ≤ 0 and (a) (a)
λ∗ g2 (x∗ ) = 0. These necessary conditions are called the Karush-Kuhn-Tucker conditions, or just Kuhn-Tucker conditions. The Karush-Kuhn-Tucker conditions allow identification of potential solutions. These conditions, together with sufficient conditions involving second derivatives of L(x, λ), form the basis for a variety of
146
CHAPTER 7. OPTIMIZATION UNDER CONSTRAINTS
algorithms for constrained optimization of differentiable functions. The reader is referred to Nocedal and Wright (1999) for details. As mentioned previously, another approach to solving constrained problems is to formulate a sequence of simpler problems that converges to problem of interest. Fiacco and McCormick (1968) described a method of formulating a sequence of unconstrained problems that converges to the given constrained problem, called the sequential unconstrained minimization technique (SUMT). See Nash (1998). for further discussions of the method. A possible problem arises in this approach if the behavior of the objective function is different outside the feasible region from its behavior when the constraints are satisfied. Quadratic Objective Function with Linear Inequality Constraints A common form of the general constrained optimization problem (7.1) has a quadratic objective function and linear inequality constraints: min cT x + xT Hx x
s.t.
(7.6)
Ax ≤ b.
This is called a quadratic programming problem. If H is positive semidefinite, the problem is particularly simple, and there are efficient methods for solving a quadratic programming problem that make use of the fact that if x∗ is a solution, then there exists λ∗ such that 2Hx∗ + AT λ∗ = cT .
(7.7)
Goldfarb and Idnani (1983) described an algorithm that uses this approach, and some software packages that solve quadratic programming problems require the user to formulate the problem in that form (see, for example, Schrage, 1997). Quadratic programming has been used extensively in portfolio analysis following the work of Markowitz (1952). The optimization problem is defined in terms of the recent rates of growth and the covariances of those rates of growth for a set of assets under consideration for inclusion in the portfolio. See Exercise 7.2. A number of algorithms based on sequential quadratic programming problems are used for more general constrained optimization problems. As in the unconstrained sequences, the violations of the constraints are built into the objective functions of later stages. Schittkowski (1985) gave a program NLPQL that implements sequential quadratic programming. Fan, Sarkar, and Lasdon (1988) developed a sequential algorithm called successive quadratic programming that is somewhat more robust. As mentioned above, a disadvantage of a formulation of a sequence of approximate problems is that the problems generally do not maintain feasibility of the solution to the original problem. In some cases the objective function may not even be defined outside of the feasible region. Panier and Tits (1993) described a sequential approach to quadratic programming problems whose
7.2. CONSTRAINED COMBINATORIAL OPTIMIZATION
147
solutions are feasible. The method is called feasible sequential quadratic programming. They gave a program called FSQP that implements the method. == EM algorithm — linear constraints Kim and Taylor (1995) JASA 708 EM algorithm — nonlinear constraints, Lagrange method Razzaghi and Kodell (2001) Commun. Stat.
7.2
Constrained Combinatorial Optimization
Constraints in combinatorial optimization problems are usually handled by restricting the mechanism that generates new points to generate only feasible points. The Simplex Method in Linear Programming The basic linear program, which is often written as min z = cT x x
s.t.
(7.8)
x≥0 Ax ≤ b,
is a problem over a dense domain. A solution to the problem, however, occurs at a vertex of the polytope formed by the constraints. Because this is a finite set, the solution can be determined by inspecting a finite number or possibilities. It is in this sense that the linear programming problem is similar to other combinatorial optimization problems.
The linear programming problem is generally most easily solved by a simplex method, which steps through the vertices efficiently. We will not describe the simplex method here, but rather refer the reader to texts on linear programming, for example, Dantzig (1963), Murtagh (1981), Chv´atal (1983) or Nash and Sofer (1996). The points labeled “s(k) ”, “s(k+1) ”, and so on in Figure 7.3 may represent the progress of a simplex algorithm along the extreme points of the feasible region toward the solution x∗ . More efficient methods for very large-scale linear programs are based on interior-point methods such as developed by Karmarkar (1984) (see Gonzaga, 1992, or Nash and Sofer, 1996, for a description). An iterior-point method may proceed along points such as those labeled “i(k) ”, “i(k+1) ”, and so on in Figure 7.3 until the algorithm appears to slow, and then move to a vertex at “i(k+4) ” and switch over to a simplex algorithm for the final iterations toward the solution x∗ . The interior-point method uses a barrier function to proceed through the dense interior of the feasible region. This approach treats the problem as one in combinatorial optimization only in the latter stages.
148
CHAPTER 7. OPTIMIZATION UNDER CONSTRAINTS 6
E E E E E E E E
E
E E E x2 E
E E
E E E E
E E E E x∗ E E E E@ E E E E @ E E @ E E E E @EP E E E E E E PPP E PP E E E E P PP E z∗ = cT x∗ E PP E PE E E E E E E E x1 E
Ax ≤ b
z = cT x E E E E E E E E E E E E E
J E J E J J E J E J E E E E E E E E E E E E E E -
Figure 7.2: A Linear Programming Problem. The Parallel Lines Are in the Direction of the Coefficient Vector c. Linear programming is a good example of how a specialized algorithm can perform very differently for some variation of the underlying optimization problem.
Special formulations of the simplex method make very significant differences in the speed of the solution. The problem of fitting a linear regression under the criterion of least absolute values is a linear programming problem, but its solution is much more efficient when the simplex method is accelerated by taking into account its special structure, such as done by Barrodale and Roberts (1974). Arthanari and Dodge (1981) discuss this and other optimization problems in statistics that can be formulated in such a way that their special structure leads to more efficient mathematical programming problems. An important variation of linear programming is integer programming, in which the decision variables are restricted to be integers. In mixed integer programming some variables are restricted to be integers and others are not. Network Optimization Variations of the basic linear programming problem also include the transportation problem, the traveling salesman problem, and other network problems. The methods for linear programming can be applied directly. Sometimes, however,
EXERCISES
149 J
6
x2
i(k)d
/i(k+1)
x∗ I @ i(k+3) i(k+2) @ @ (k+4) @ i iP (k+4) P s PP PP PP PP PP (k+3) s
Ax ≤ b
J
J
J
J
J
ds(k) s(k+1)
s(k+2)
x1
-
Figure 7.3: Simplex Method and Interior Point Method for Linear Programming Problem it is more efficient to use one of the methods discussed for combinatorial optimization in Chapter 6.
Exercises 7.1. Consider the quadratic programming problem, min
3x2 + 2y 2 + z 2 + 2xy − xz − 0.8yz
x
s.t.
x+y+z =1 1.3x + 1.2y + 1.08z ≥ 1.12 x ≤ 0.75 y ≤ 0.75 z ≤ 0.75
(This is the form of a simple portfolio optimization problem. Because x, y, and z are not restricted to be nonnegative, short-selling is allowed.) Put this problem in the form of equation (7.6), and identify all of the variables in the new formulation of the problem. (This would be the first step in solving a quadratic programming problem using some software packages.) 7.2. In the allocation of financial assets it is generally desirable to maximize expected returns (growth) and to minimize risk. A commonly accepted definition of risk is
150
CHAPTER 7. OPTIMIZATION UNDER CONSTRAINTS the variability in the rates of return. The expected return is generally measured by the sample mean of past rates of return. The variability is generally measured by a sample variance of past rates of return. Suppose we have n possible assets to choose from. If the individual assets have average growth rates g1 , g2 , . . . , gn , and we put a proportion pi into the ith asset, the overall growth rate of the portfolio of these assets is d(p) =
n X
pi gi .
i=1
The variance of the portfolio depends on the variances of the individual assets, and on their covariances (expressing how they tend to move in the same or opposite directions). Assume the individual assets have standard deviations of s1 , s2 , . . . , sn , and the n × n matrix R contains the correlations of the individual assests, measured over some time in the past that is considered to be replicated in the present and near-future. Adopt this notation: g is the vector of the gi ’s; p is the vector of the pi ’s; s is the vector of the si ’s; and S = diag(s) is the n × n diagonal matrix containing the si ’s along the diagonal. The overall variance then is v(p) = pT SRSp. For a set of possible investments, the “optimal” allocation is usually defined as the p vector that minimizes P v for a given value of d. Because the pi ’s represent proportions, we assume pi = 1. For 8 assets, a sample of rates of return yielded g = (0.39, 0.88, 0.53, 0.88, 0.79, 0.71, 0.25, 0.27) s = (5.50, 7.03, 6.22, 7.04, 6.01, 4.30, 2.01, 1.56) and
R=
1.00 0.41 0.30 0.25 0.58 0.71 0.26 0.33
1.00 0.62 0.42 0.54 0.44 0.22 0.26
1.00 0.35 0.48 0.34 0.27 0.28
1.00 0.40 0.22 0.14 0.16
1.00 0.56 0.25 0.29
1.00 0.36 0.42
1.00 0.92
1.00
(The data correspond to indexes of publicly traded equities in six countries, plus United States 30-Day Treasury Bills and euros, all measured in terms of United States dollars, over a 216-month period between 1978 and 1995. The data are given in Richard O. Michaud, 1998, Efficient Asset Management, Harvard Business School Press, Boston.) (a) Under the definition of optimality and the given data, determine the optimal portfolio that has d = 0.7. You may wish to use the IMSL routine qprog or quadratic prog.
EXERCISES
151
(b) Now consider the quadratic objective function, v(p) − λd(p), where λ is a (λ) constant. For various choices of λ, determine the optimal portfolio, p∗ (λ) (λ) and the corresponding values of v∗ and d∗ . Plot a parametric curve of (λ) (λ) d∗ versus v∗ . This curve is called the “efficient frontier”. (c) The most obvious problem in the use of this procedure to select an “optimal” portfolio is the relevance of a portfolio that was optimal over some past period of time to future performance. Given some assumption of stationarity, however, there remain several issues about the quantities used in the problem formulation. The most basic of these issues concerns measurement methods. Again, given some assumptions of the appropriate ways to measure things such as past growth rate, however, additional problems arise because of the inherent randomness. From our most basic assumptions in the formulation of the problem, we must recognize that g, s, and R are realizations of random variables. Issues relating to the use of these realizations of random variables as parameters in an optimization problem have often been discussed in the financial literature, for example, Jobson and Korkie (1981) and Chopra and Ziemba (1993). Several possible ways of using simulation to deal with this problem come to mind. Resampling of the original data to generate new realizations of g, s, and R would probably be effective. A simpler way would be to assume that the realized value of g is in fact the mean of a random variable with variance-covariance matrix SRS, and merely to simulate realizations of g from a multivariate normal distribution with this mean and covariance (see, for example, Gentle, 2003, pages 197 and 198.) Use this approach and generate 1,000 realizations of g. For each, repeat Exercise 7.2b. This yields 1,000 efficient frontier curves. Now determine (λ) (λ) the curve formed by the means d¯∗ and v¯∗ , where the means are taken over the 1,000 values on the efficient frontiers. Compare this curve with the efficient frontier curve of Exercise 7.2b. Jobson and Korkie (1981) carried out a similar Monte Carlo study.
152
CHAPTER 7. OPTIMIZATION UNDER CONSTRAINTS
Chapter 8
Multiple Extrema and Multiple Objectives In practical applications, an optimization problem can rarely be stated in absolute terms. A local optimum may be preferable to a global optimum, because of issues that may not even be apparent until the optima are identified. Likewise, in many applications, the constraints are not necessarily essential. After inspecting alternative near optimal solutions and solutions to an unconstrained problem that almost satisfy the constraints, the penalty for violating the constraints may not be as important as the gain in the optimal solution. Also, in most practical applications, the objective is not just some simple function; there are multiple objectives.
8.1
Multiple Extrema and Global Optimization
It is possible that the function has more than one extreme point, or local optimum. As in the case of solving a system of linear equations that we discussed earlier, a common way of addressing this problem is to use different starting points in the iterative solution process. Plots of the points evaluated in the iterations may also be useful. In general, there is no way of finding all of the extreme points or the global optimum with any assurance. In fact, by analyzing any given deterministic method for finding a global optimum, an objective function could be constructed with a global optimum (probably a spike) that will not be found. The way to find a global optimum is to cause the optimization method to take different paths toward a point at which it will converge. In the absence of specific knowledge of the shape of the objective function, randomly diverting the course of the iterations is likely to be the best way of searching for the global optimum. There are three places in the algorithms at which randomization can be introduced: 153
154CHAPTER 8. MULTIPLE EXTREMA AND MULTIPLE OBJECTIVES • random selection of starting points • random selection of subsequent points to be considered • random acceptance of a point under consideration For the deterministic descent methods discussed in Chapter 5 random selection of starting points is the obvious method to choose. The stochastic methods discussed in Section 5.12 and the controlled random search method of Section 5.13 use both random starting points and random steps. Most of the methods discussed in Chapter 6 use randomization in all three ways. It is also easy to add random acceptance to some of the other methods, for example, Nelder-Mead and controlled random search. Chin (1993) discusses an implementation of the simultaneous perturbation stochastic approximation method (Algorithm 5.2, page 113) for global optimization. Masri and Bekey (1980) describe a global optimization algorithm that uses random searches adapted to the recent movements of the algorithm, so as to improve the likelihood of searching in a good direction. Rabinowitz (1995) describes a stochastic algorithm for global optimization with constraints. The difficulty of the problem obviously depends on the number and distribution of local optima. If the number of extrema in a given interval is known, and if the function is twice continuously differentiable in the interval, a “guided” bisection algorithm of Kavvadias and Vrahatis (1996) can be used to find all of them with certainty. We refer the interested reader to their paper for the details. The Journal on Global Optimization is devoted to research in this area. For the case of the statistical application of maximum likelihood, Gan and Jiang (1999) describe a statistical test that a maximum of the likelihood function is the global optimum. Their test is based on the fact that, under suitable regularity conditions, the log-likelihood l with parameter θ satisfies the equation E
8.2
∂2l ∂θ2
+E
∂l ∂θ
2
= 0.
Optimization with Multiple Criteria
The objective function in a given application may actually be quite complicated. For example, in a statistical procedure based on least squares, the effect of a single outlier on the solution to the minimization problem may be unacceptable. The more appropriate objective function may be least squares for residuals that are small or moderate, and least squares of scaled residuals for the larger residuals. Many specific objective functions have been proposed to allow for differential weighting of the residuals or to use a different function of the residuals, rather than the square function. For robust statistics, formulation of an appropriate objective function is usually the primary issue. In statistical procedures that attempt to achieve a minimum mean squared error (MSE), there are two things that are minimized: the square of the bias and
8.2. OPTIMIZATION WITH MULTIPLE CRITERIA
155
the variance. The objective function is just the sum of these two quantities, so it is just a simple and natural generalization of the objective function in minimum variance unbiased estimation when the feasible space is extended beyond unbiased estimators. A simple generalization of a single objective function is a set of objectives. Optimization procedures that explicitly recognize the existence of multiple objectives can then be developed. Whereas a standard optimization problem usually has an objective of the form min{f (x) = z}, the general multicriteria problem can be formulated as min{f1 (x) = z1 } min{f2 (x) = z2 } .. .
(8.1)
min{fn (x) = zn } s.t. x ∈ S. The vector of zi ’s in (8.1) is called the criterion vector. A criterion vector is nondominated if there does not exist another feasible criterion vector all of whose elements are less than the given vector. (The terms dominate and dominated are then defined by the common language semantics.) In most nontrivial multicriteria problems there exists a set of nondominated criterion vectors. Although no solution is “best”, for any solution that does not result in a nondominated criterion vector, there is a “better” solution. Techniques for multiple criteria optimization generally prescribe some systematic exploration of the set of nondominated criterion vectors. Within the set of parameter vectors, the concept of dominance leads to that of efficiency. A point x∗ ∈ S is efficient if and only if there does not exist another feasible point yielding a criterion vector that dominates the criterion vector associated with x∗ .Other terms synonymous with efficiency are Pareto optimalityand admissibility. The most common way of addressing the problem of optimizing with respect to more than one criterion is to form a weighted sum of the objective functions, and then to proceed as in a standard problem in mathematical programming. There are also other ways, such as the reference point method, for solving this problem. Steuer (1986) describes these methods, and also discusses the practical problems of using an approach that effectively weights the criteria a priori. The problems arise because we usually do not have an explicit utility function. Even if a reasonable a priori formulation of a single objective were possible, it is generally desirable to explore the space of tradeoffs within the feasible region that contains near-optimal points. Human intervention is almost always involved in multiple criteria optimization. Steuer (1986) discusses interactive procedures for multiple criteria optimization. Some of the procedures only work for linear objective functions, and others make implicit assumptions about the user’s utility function. The methods gen-
156CHAPTER 8. MULTIPLE EXTREMA AND MULTIPLE OBJECTIVES erally employ iterative projections of an unbounded line segment in the criterion space onto the nondominated surface of the feasible region (see also Korhonen and Wallenius, 1986). The available computer programs implementing this general method only work for linear problems. An important aspect of the methods is a graphical display that aids the user in interacting with the computations. This strategy is also applicable to nonlinear problems by replacing the linear programming module with a nonlinear code. The underlying computations for the nonlinear problem are more extensive, of course; and there may be a need to provide more than one nonlinear programming module. Any of the methods could be improved with more integrated graphics. For a nonlinear problem, the graphics to display the tradeoffs among the various criteria may be far more complicated. Most of the work in multicriteria optimization has involved both linear objective functions and linear constraints. There has been some work in the area of multicriteria optimization for nonlinear problems (see, for example, the survey by Weistroffer and Narula, 1991, and the book by Miettinen, 1999). Any approach to multicriteria optimization involves solution of one or more ordinary optimization problems, and a variety of algorithms is available for solving the basic nonlinear optimization problems.
8.3
Optimization under Soft Constraints
Because of the standard formulation of optimization problems as a single “objective” function together with a set of “constraints”, practitioners generally set up their problems in this way. In many real-world applications, however, given a choice between a solution that satisfies all of the constraints and a point that slightly violates some constraints but yields a much better value of the objective function, the practitioner would rethink the constraints and possibly accept the “nonfeasible” point. The way to approach most applications is to place a value or cost on results or decisions. The objective function does this; it represents the value of something. Hard constraints ignore gradations of value; they are either satisfied or they are not. Often it is better to allow the constraints to be violated, but to construct the objective function so as to attempt to satisfy them. We thus treat the constraints as “soft”. For example, we may modify the constrained problem, min
f (x)
s.t.
g(x) ≤ b.
x
to the unconstrained problem, f (x) + h(g(x) − b), where h is a function whose minimum occurs at all points x such that g(x) < b. If h is simple function with only two values, say 0 and M , and M is a very large
EXERCISES
157
number relative to the range of f , the unconstrained problem is very similar to the constrained problem. We may choose h to be an increasing function in g(x) − b for g(x) ≥ b. This will have the effect of insuring that the solution “nearly” satisfies the constraints. Not all constraints can or should be treated as soft constraints. Some constraints represent physical limitations and must be satisfied.
Exercises 8.1. Consider the optimization problem in Exercise 7.1, page 149. Define an optimization problem in which it is desired to minimize the same objective function, but x, y, and z are required to be nonnegative; the upper bounds on x, y, and z are not hard constraints, but are “desirable”; and it is desired to maximize the linear combination 1.3x + 1.2y + 1.08z.
158CHAPTER 8. MULTIPLE EXTREMA AND MULTIPLE OBJECTIVES
Chapter 9
Software for Optimization Most of the comprehensive scientific software packages such as the IMSL Libraries, Matlab, and S-Plus have modules for solution of systems of nonlinear equations and for optimization. It is possible for a user to access computational servers for optimization over the internet, so that the user client does not need to run the software. Czyzyk, Mesnier, and Mor´e (1998) describe a system called NEOS that provides server capability for optimization problems. Problems can be submitted to the NEOS system by email, or by interfaces available over the internet. == Ferris, Mesnier, Mor´e (2000) ACMTOMS p.1 More information about NEOS is available at http://www-neos.mcs.anl.gov/ Casanova and Dongarra (1998) describe a system called Netsolve that uses the NEOS system with remote procedure calls. Ferris, Mesnier, and Mor´e (2000) describe a simpler interface using Condor (see Epema et al., 1996). together with NEOS. It is difficult to design general-purpose software for optimization problems because the problems tend to be somewhat specialized and different solution methods are necessary for different problems. There are several specialized software packages for optimization. Some address general optimization problems for continuous nonlinear functions, with or without constraints. There are several packages for linear programming. These often also handle quadratic programming problems, as well as other variations, such as mixed integer problems and network problems. Another reason it is difficult to design general-purpose software for optimization problems is because the formulation of the problems in simple computer interfaces is difficult. Finally, the need for an initial guess may complicate the design of optimization software, especially for the unsophisticated user. The software would be hardpressed to decide on a reasonable starting value, however. Sometimes an obvious default such as x(0) = 0 will work, and there are some software packages 159
160
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
that will choose such a starting value if the user does not supply a value. Most packages require, or at least allow, the user to input a starting value. Because the development of a mathematical model that can be communicated easily to the computer is an important, but difficult aspect of optimization problems, there are packages that implement modeling languages, and many of the general packages accept optimization problems expressed in these languages. Mor´e and Wright (1993) provide a survey of software for nonlinear systems and optimization problems, both linear and nonlinear. Hans Mittelmann and Peter Spellucci maintain a guide to non-commercial optimization software http://plato.la.asu.edu/guide.html The magazine ORMS Today, published by Informs, periodically surveys software for optimization. A survey for nonlinear programming, for example, is in the June, 1998, issue, pages 36–45; and a survey for linear programming is in the August, 1999, issue, pages 64–71.
9.1
Fortran and C Libraries
There are a number of reliable optimization algorithms available for use in Fortran and C. Many of these programs are available in the ACM TOMS collection at netlib: http://www.netlib.org/liblist.html Two of the most widely used Fortran and C libraries are the IMSL Libraries and the Nag Library. They provide a large number of routines for optimization. Both libraries are available in both Fortran and C versions, and in both single and double precision. The Optimization Subroutine Library (OSL) is an IBM product that provides a collection of tools for solving linear programming, (LP) quadratic programming (QP), and mixed integer programming (MIP) problems. The MIP solver is capable of handling either a linear or quadratic objective function. Individual OSL components implement state-of-the-art algorithms in code that takes special advantage of the characteristics of the platforms on which they run. These components can be combined into applications as simple as “input, solve, output,” or as complicated as a knowledgeable practitioner may care to create. Both serial and parallel versions are available. OSL subroutines are written primarily in portable FORTRAN, with a few assembler language routines to enhance performance. OSL includes routines for linear programming, quadratic programming, network problems, and mixed-integer programming. There are a number of utility routines for input/output, matrix manipulation, and control querying and setting. There are also routines for performing sensitivity and parametric analyses. Data may be input in any format, or generated as needed, and passed on to OSL modules in internal arrays.
9.1. FORTRAN AND C LIBRARIES
161
OSL is available on numerous platforms, from PC’s to mainframes. It is available either as standalone solvers or as a library for developing custom applications. There are a large number of other optimization software packages in either Fortran or C. One of the widely-used ones is GRG2 reduced gradient methods (Lasdon, Waren, Jain, and Ratner, 1978), which is distributed by Windward Technologies Inc. (1995) and Frontline Systems. There is also a version for large sparse problems LSGRG2 (Smith and Lasdon, 1992). Conn, Gould, and Toint (1992) developed a package called LANCELOT for solving very large-scale nonlinear optimization problems. Some widely-used constrained optimization programs based on sequential quadratic programming are NLPQL (Schittkowski, 1985), FSQP (Panier and Tits, 1993), and NPSOL (Gill et al., 1992). NLPQL does not maintain feasibility and so may not be as reliable, especially for objective functions that may not be well-behaved when the constraints are violated. NPSOL has an option to require it to maintain feasibility. FSQP (“Feasible Sequential Quadratic Programming”) always maintains feasibility. NPSOL is available in the Nag Library as E04UCF. Lukˇsan and Vlˇcek (2001) describe four different optimization subroutines that do not require derivatives. The programs use variations of sequential quadratic programming and quasi-Newton methods. All of these programs were written in Fortran. The subroutines of Lukˇsan and Vlˇcek are available in the ACM TOMS collection at netlib. Another widely-available package is NL2SOL that has been successively modified over the years (Dennis, Gay, and Welsch, 1981a, 1981b; Gay, 1983; Gay and Welsch, 1988; and Bunch, Gay, and Welsch, 1993). Many of the Fortran and C subprogram libraries also have interactive interfaces that reduce the programming burden in using such libraries. Examples of Use of the IMSL Libraries The IMSL Libraries have eleven routines for unconstrained optimization and twelve routines for constrained optimization. The documentation provides a decision tree to identify the appropriate routine for a given problem. The first node in the tree is for unconstrained or constrained. Under the unconstrained branch, the next node is for univariate or multivariate. Under the unconstrained multivariate branch, there are two special branches for least squares problems and for very large-scale problems. For unconstrained problems not in these categories, the next choice is between smooth and nonsmooth objective functions. For smooth functions, the next choice is whether a first derivative is available, and if it is, the next choice is whether a second derivative is available. The branch of the decision tree corresponding to constrained problems likewise has a variety of choices depending on type of objective function, type of constraints, derivative information, and smoothness of the function. For constrained optimization, the first question is whether or not the constraints are linear. For linear constraints, a further question is whether the constraints are
162
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
simple box constraints, that is, bounds on the variables. Other factors determining the routine to use are the type of derivative information available, the size of the problem, and the smoothness of the objective function. There are separate IMSL routines for single and double precision. The names of the Fortran routines share a common root; the double precision version has a D as its first character, usually just placed in front of the common root. Functions that return a floating point number, but whose mnemonic root begins with an I through an N, have an A in front of the mnemonic root for the single precision version, and have a D in front of the mnemonic root for the double precision version. Likewise, the names of the C functions share a common root. The function name is of the form imsl f root name for single precision and imsl d root name for double precision. Consider the problem of determining the unconstrained minimum of the two-dimensional Rosenbrock function, f (x) = (1 − x1 )2 + 100(x2 − x21 )2 . The IMSL Fortran routine UMING/DUMING uses a quasi-Newton method to solve an unconstrained problem of this type. The single precision routine is invoked by the statement call uming (fcn, grad, n, xguess, xscale, fscale, iparam, rparam, x, fvalue) To use a Fortran program to solve this problem, we first write either a Fortran subroutine or a function for the mathematical function and either a Fortran subroutine or a function for its gradient. These are EXTERNAL modules. The routine UMING/DUMING requires these functions to be passes as subroutines with specific forms: FCN, a user-supplied SUBROUTINE to evaluate the function to be minimized. The usage is CALL FCN (N, X, F), where N is the length of X (input), X is the vector at which the function is to be evaluated (input, and not to be changed by FCN), F is the function value at the point X (output). GRAD, a user-supplied SUBROUTINE to evaluate the gradient at the point X. The usage is CALL GRAD (N, X, G), where N is the length of X and of G (input), X is the vector at which the function is to be evaluated (input, and not to be changed by GRAD), G is the gradient vector at the point X (output).
The Fortran modules for the Rosenbrock function and its gradient are shown in Figure 9.1. The other arguments in UMING/DUMING are:
9.2. GENERAL-PURPOSE INTERACTIVE SYSTEMS
163
N, the dimension of the problem (input), XGUESS, vector of length N containing the initial guess of the minimum (input), XSCALE, vector of length N containing the scaling factors for the variables (input), FSCALE, scaling factors for the function and gradient (input), IPARAM, parameter vector of length 7 (input/output), RPARAM, parameter vector of length 7 (input/output), X, the point at which the minimum occurs (output), FVALUE, the value of the function at the minimum (output). A program to solve the minimization problem is shown in Figure 9.1. The scales are set to 1 for both the variable and the function; and the default values are used for the parameter vectors IPARAM and RPARAM. The IMSL C function to solve this problem is min uncon multivar, which is available in two precisions: float *imsl f min uncon multivar double *imsl d min uncon multivar. There only two required arguments for *imsl f min uncon multivar: float fcn (int n, float x[]) int n, the number of variables. The same arguments as in the Fortran version are also available, but they all have default values. If the gradient is not supplied, numerical approximations are used. A C program to solve the minimization problem using the same settings as in the Fortran program is shown in Figure 9.2. The final 0 in the invocation of imsl f min uncon multivar is required to indicate the end of the argument list. Some of the most common constrained optimization problems are quadratic programming problems of the form (7.6) on page 146. The IMSL Libraries provide a quadratic programming routine, qprog (Fortran) or quadratic prog (C), that implements the method of Goldfarb and Idnani (1983). The IMSL Libraries also provide utility routines for finite approximations to the gradient, Hessian, and Jacobian. These routines, cdgrd, fdgrd, fdhes, and fdjac are useful when building an optimization program, and they are used internally in some of the IMSL optimization programs. It is generally better to provide program modules that actually compute the derivatives, rather than numerical approximations to them. Packages that do symbolic computations, such as Maple and Mathematica, can be used to determine mathematical expressions for the derivatives. Functions to compute the derivatives can also be written using software for automatic differentiation, as we discuss on page 167.
164 C
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
Fortran 77 program parameter (n=2) integer iparam(7) real f, fscale, rparam(7), x(n), xguess(n), xscale(n) external rosbrk, rosgrd, uming
C data xguess/-1.0,2.0/, xscale/1.0,1.0/, fscale/1.0/ C iparam(1) = 0 call uming (rosbrk, rosgrd, n, xguess, xscale, fscale, iparam, & rparam, x, f) print *, ’ The solution is ’, x, //, ’ The function value is ’, f, //, & ’ The number of iterations was ’, iparam(3), & ’ The number of function evaluations was ’, iparam(4), & ’ The number of gradient evaluations was ’, iparam(5) end C C
The two-dimensional Rosenbrock function subroutine rosbrk (n, x, f) integer n real x(n), f
C f = (1.0 - x(1))**2 + 100.0 * (x(2) - x(1)*x(1))**2 C return end C C
The two-dimensional Rosenbrock function subroutine rosgrd (n, x, f) integer n real x(n), g(n)
C g(1) = -2. * (1.0 - x(1)) - 400.*(x(2)-x(1)*x(1))*x(1) g(2) = 200.*(x(2)-x(1)*x(1)) C return end
Figure 9.1: IMSL Fortran Program to Find an Unconstrained Minimum
9.2
Optimization in General-Purpose Interactive Systems
General-purpose interactive systems such as Matlab, S-Plus, Gauss, and PVWave usually provide some functions for optimization. These are generally easier to use than the Fortran or C libraries, but the types of problems they solve are often more limited, and there are fewer available options to control the computations. An example of the use of the Matlab function to solve the same twodimensional unconstrained Rosenbrock problem is shown in Figure 9.3.
9.3. GENERAL OPTIMIZATION PROBLEMS
165
\* C program *\ #include
#include <stdio.h> main() { int i, n = 2; float *result, fx; static float rosbrk(int, float[]); static void rosgrd(int, float[], float[]); static float xguess[2] = {-1.0e0, 2.0e0}; static float grad_tol = 0.0001;
}
result = imsl_f_min_uncon_multivar (rosbrk, n, IMSL_XGUESS, xguess, IMSL_GRAD, rosgrd, IMSL_GRAD_TOL, grad_tol, IMSL_FVALUE, &fx, 0) IMSL_GRAD, rosgrd, IMSL_GRAD_TOL, grad_tol, IMSL_FVALUE, &fx, 0) printf (" The solution is "); for (i=0; i
/* The two-dimensional Rosenbrock function */ static float rosbrk (int n, float x[]) { return (1.0 - x[0])*(1.0 - x[0]) + 100.0 * (x[1] - x[0]*x[0])*(x[1] - x[0]*x[0]); } /* end of function */ /* The gradient of the two-dimensional Rosenbrock function */ static void rosgrd (int n, float x[], float g[]) { g[0] = -2. * (1.0 - x[0]) - 400.*(x[1]-x[0]*x[0])*x[0]; g[1] = 200.*(x[1]-x[0]*x[0]); } /* end of function */
Figure 9.2: IMSL C Program to Find an Unconstrained Minimum In Matlab the function is defined in a Matlab M-file. The function fmins in Figure 9.3 uses a Nelder-Mead simplex method. PV-Wave is a general-purpose interactive system that provides many of the capabilities of the IMSL Libraries. The C function min uncon multivar used in Figure 9.2, for example, is available in PV-Wave with a simpler interface. Many other routines from Fortran and C Libraries are available with interactive interfaces. A graphical user interface is available for OSL on some systems. This provides some of the OSL functionality in a point and click environment in which no programming is required.
166
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
M-File: function f = rosbrk(x) f = (1 - x(1))^2 + 100 * (x(2) - x(1)^2)^2; Statements to find and print minimum: [x, out] = fmins(’rosbrk’, [-1,2]); x rosbrk(x)
Figure 9.3: Matlab Statements to Find an Unconstrained Minimum
9.3
Software for General Classes of Optimization Problems
Because of the general complexity of optimization problems, special-purpose software has been developed for different types of problem. The reduces the complexity both of the user interface and of the computational algorithms. We will mention some software packages for various type of problems, but for a more extensive survey, we refer to Mor´e and Wright (1993). General Optimization Problems The difficulty in defining a user interface for optimization problems of a wide variety of types means that most of the software that addresses general optimization problems are Fortran or C libraries. The user interface requires specification of the problem in one of those languages. The IMSL Libraries, the Nag Library, and the Optimization Subroutine Library (OSL) discussed earlier are the most widely-used packages that provide capabilities for solving a wide range of types of optimization problems. Most packages provide a choice of computational methods. Many include the Nelder-Mead simplex method as one of the choices because of the simplicity of its interface (no derivatives) and because of its robustness. Linear Programming and Quadratic Programming Although Fortran and C libraries often provide routines for linear programming, the special structure of the problems makes it easy to define simpler user interfaces than Fortran or C modules. Well-developed software for linear programming has been available for a long time. An early very large-scale package produced by the IBM Corporation was called MPS. The format for specifying the problem and providing the data that this package required is called MPS format, and most software packages for linear programming allow for this format.
9.3. GENERAL OPTIMIZATION PROBLEMS
167
Currently a very popular format for linear programming packages is a spreadsheet format, which allows specification of the problem and input of the data in a spreadsheet that is compatible with the very popular spreadsheet programs such as Excel and Lotus. The magazine ORMS Today, published by Informs, periodically provides surveys of linear programming packages. The August, 1999, issue describes several systems and gives contact information for the distributors of the packages. Some of the more commonly used linear programming packages include OSL, particularly through its “non-programming” interface; Cplex; MINOS; Lindo and Lingo; and SAS. These linear programming software packages also solve quadratic programming problems. The modeling systems GAMS and AMPL are also widely used for linear programming problems. Least Squares There is a wide range of software for least squares problems. Most of the general-purpose software includes special routines for least squares. Packages for statistical data analysis often include functions for nonlinear least squares. For example, in the IMSL Libraries the routine rnlin performs least squares fits of general models and in S-Plus the function nls performs the computations for nonlinear least squares regression. A more general function in S-Plus, ms, minimizes a sum of nonlinear functions over parameters. Bouvier and Huet (1994) give a more general set of S-Plus functions for nonlinear regression, nls2, and Huet et al. (1996) illustrate their use in a number of examples. Many of the packages for linear programming, such as OSL, also include abilities for quadratic programming. The linear objective function is replaced by a quadratic function and otherwise the interface for linear programming and quadratic programming are the same. Bunch, Gay, and Welsch (1993) give Fortran subroutines for nonlinear least squares in nonlinear regression models. These routines also perform maximum likelihood and robust fitting of nonlinear regression models. Automatic Differentiation For algorithms that require the derivative of a function, it is often convenient to use software that performs symbolic differentiation, such as Maple. Even for functions that appear relatively simple, it is a good idea to double check the derivative by performing the differentiation on the computer. The software for the symbolic differentiation will also produce the appropriate Fortran or C code for the derivative of the function. Another approach is to use a software system that operates directly on the Fortran or C code for the function to produce corresponding code for the derivative of the function. Dobmann, Liepelt, and Schittkowski (1995) and Bischof et al. (1996) have developed systems that perform automatic differentiation
168
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
on functions written in Fortran. Griewank, Juedes, and Utke (1996) have developed a package for automatic differentiation of functions written in C and C++. The systems of Dobmann, Liepelt, and Schittkowski (1995) and Griewank, Juedes, and Utke (1996) are available in CALGO (see page 224). The ADIFOR system of Bischof et al. (1996) accepts the user’s Fortran 77 source code for the function and a specification of dependent and independent variables. ADIFOR then generates code that computes the partial derivatives of all of the specified dependent variables with respect to all of the specified independent variables. ADIFOR is available at http://www.mcs.anl.gov/adifor/ Bischof, Roh, and Mauer (1997) also produced a version of ADIFOR for C, called ADIC. Griewank (2000) discusses various techniques of automatic differentiation, including methods for sparse problems, higher derivatives, and nonsmooth problems. He also discusses software for automatic differentiation.
9.4
Modeling Languages and Data Formats
Many computational problems in science and statistics can be stated in a very straightforward manner. For software for these problems, there are simple inputs that are just scalars or dense arrays, and the output is likewise simple, a few scalars or arrays. Optimization problems by their very nature are somewhat more difficult to set up for input to computer software; the input consists of functions and gradients, relationships, and initial guesses. Often we want more than just a point solution; we want to know something about the progress toward the solution and we want to know the sensitivity of the problem to other values in the neighborhood of the solution. In many applications in which optimization problems arise, the structure of the problem is fixed, and values required to define the problem are sparse in input that consists of potentially very large arrays or other data structures. Adoption of a standard format in which to specify the problem can greatly facilitate the input of data. Likewise, a standard format in which to describe the solution helps the analyst to understand the results, and perhaps to try other scenarios. The IBM package for linear programming that dates to the 1960’s, called MPS, defined a standard format for specifying a linear programming problem. This MPS format is still widely used and most software packages for linear programming allow for this format. The format is very efficient for large-scale problems. Currently a very popular format for linear programming problem is a spreadsheet format, which allows specification of the problem and input of the data in a spreadsheet. Many linear programming software packages such as OSL have input/output abilities for both MPS and spreadsheet formats.
9.5. TESTBEDS FOR OPTIMIZATION SOFTWARE
169
GAMS (The Scientific Press, 1988) and AMPL (Fourer, Gay, and Kernighan, 1993) are systems built on modeling languages. The GAMS language, which is somewhat similar to Fortran in appearance, provides concise algebraic statements that are readily comprehensible to persons with a mathematics background. AMPL provides an interactive command environment for defining optimization problems. Both of these packages provide for input and output of problem specifications in standard formats such as MPS. These packages are often used as front-end interfaces for other optimization packages. AMPL provides an easy interface for several optimization programs and nonlinear solvers, including Cplex, GRG2, LANCELOT, Minos, NPSOL, and OSL. An complete list, as well as additional information about AMPL, is available at http://www.ampl.com/
9.5
Testbeds for Optimization Software
There are two different kinds of issues involved in gaining confidence in the results of numerical computations. One type of question relates to the quality of the algorithm and the software. Analysis of the algorithm, review of the software, and finally empirical testing of the software can provide general answers to the question of whether our computations are correct. For a given problem, however, the real question is how good is the value we have computed. One approach is to perturb the problem in a way that has a known effect on the exact solution. Comparing the computed solution of the original problem to the computed solution of the perturbed problem can alert us to possible inaccuracies; or, conversely, it can give us confidence in the computed solutions. The simplest perturbation that sometimes works is to change the precision. This does not perturb the mathematical problem, so the exact solution does not change. (The exact solution to the problem that the software is actually given may change. Our interest, however, is not in the question of whether the software did well with the approximations in the input data. In the real world, we want real solutions.) It is exceptionally difficult to verify the solution to an optimization problem. One way of gaining confidence in our computed solution is to use different starting values and see if we converge to the same solution. This approach is also used to help to get a “better” optimum in the case of multiple optima. When we try different starting values and converge to different solutions, we are left with the uncomfortable feeling that the problem has many optima, and we have just visited some local optima. Because of the difficulties of assessing the accuracy of computed results in a given optimization problem, it is important to have a wide variety of test datasets to use in validating optimization software. The development of such testbeds began many years ago. Hoffman et al. (1953) described several test problems, and gave their solutions. More recently, Bongartz et al. (1995) describe an extensive testbed for optimization called CUTE. Many of the test
170
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
sets are available from netlib (see page 224 for information on netlib). Floudas et al. (1999) present a collection of test problems gathered from a wide range of applications. They provide input files either in the GAMS format or the MINOPT format. A very simple problem that is often used to test optimization software is the Rosenbrock function: X f (x) = (1 − xj )2 + 100(xj+1 − x2j )2 , j=1,3,...,d−1
for an even integer d. This function is one generalization of the original Rosenbrock function that has two variables (see Rosenbrock, 1969). The function has a banana-shaped valley, with a minimum at (1, 1, . . . , 1). A slightly different generalization of Rosenbrock function is one of five test functions sometimes called De Jong’s tests that are widely used in testing combinatorial algorithms, especially genetic and other evolutionary algorithms. (See Kenneth A. De Jong, 1976, Analysis of the behavior of a class of genetic adaptive systems, unpublished Ph.D. dissertation, University of Michigan, Ann Arbor.) The other four tests in De Jong’s test suite include a simple circular quadratic, f (x) =
d X
x2j ;
j=1
a step function, f (x) = 6d +
d X
bxj c;
j=1
a quartic with noise, f (x) =
d X
(jx4j + ej ),
j=1
where ej is a realization of a normal (0,1) random variable; and a version of the “Shekel foxholes” function, 25
f (x) =
X 1 1 , + P 500 j=1 j + 2i=1 (xi − aij )p
where p is an even integer, usually 2, and a1j = −32, −16, 0, 16, 32, −32, −16, 0, 16, 32, . . . and a2j = −32, . . . , −32, −16, . . . , −16, 0, . . . , 0, 16, . . . , 16, 32, . . . , 32. Kennedy and Gentle (1980, pages 493–498) list seventeen other test problems from the literature; and Hock and Schittkowski (1985) and Schittkowski (1987) provide extensive collections of test problems.
EXERCISES
171
Test problems are useful for developers of algorithms, but because the developers have tested their software on standard collections of test problems, these test problems can give the user of the software a false sense of security. The test problems may or may not be representative of the user’s problem. In addition to a fixed set of test problems, it is important to be able to generate problems with specific characteristics, so as to test the ability of the optimization software to solve problems with given features. Dembo and Steihaug (1985) describe a test problem generator for unconstrained optimization problems. Facchinei, J´ udice, and Soares (1997a, 1997b) describe a method of generating test datasets with box constraints, beginning with unconstrained problems with known solutions.
Exercises 9.1. Pick a minimization program available to you and test it with the Rosenbrock function: X f (x) = (1 − xj )2 + 100(xj+1 − x2j )2 . j=1,3,...,d−1
Let d = 2, 4, 8, 16.
172
CHAPTER 9. SOFTWARE FOR OPTIMIZATION
Chapter 10
Applications in Statistics Many statistical methods require solution of optimization problems. The ubiquitous method of least squares, for example, is used to fit a model to data so as to minimize the sum of the squared residuals. In maximum likelihood estimation, the objective is to optimize (maximize) the likelihood function; and in minimum variance unbiased estimation, the objective is to optimize (minimize) the variance of an estimator chosen from the space of unbiased estimators. To determine an optimal sample design is a functional maximization problem in which the design points are chosen to maximize information (subject to a suitable definition of this latter term.) In the application of statistics to process improvement, the objective is to optimize some measure of expected performance over a range of process parameters. Whenever a statistical problem is in fact an optimization problem, it is important to formulate it correctly as such. It is necessary to identify clearly the objective function and the constraints that are appropriate for the original statistical problem. Whether or not the optimization problem has constraints should not have an effect on the formulation of the objective function. The availability of software also should not determine the formulation of the problem. Sometimes, however, the problem is modified and formulated so that available software can solve it. In the first published application of linear programming to solve a constrained regression problem, for example, the objective function was the sum of the absolute deviations rather than the sum of the squares of the deviations (Charnes, Cooper, and Ferguson, 1955). The criterion in their approach for fitting the model was least absolute values rather than least squares. The problem was formulated in this way because there were constraints on the regression coefficients, and the available linear programming software could easily handle the constraints but could not easily handle a quadratic objective function. The standard regression problem is an optimization problem. It is to use given observations in the vector y and the matrix X to fit the model y = Xβ + . 173
174
CHAPTER 10. APPLICATIONS IN STATISTICS
“Fitting” the model means to determine a suitable estimate for β. The first steps in building a model usually involve some reasoning about what kind of model makes sense for the phenomenon being studied, and an initial guess of a functional form y ≈ f (x; θ)
(10.1)
that expresses an average or ensemble relationship. In the relation (10.1), y and x usually represent observable (vector) variables (possibly observable only with error) and θ usually represents a vector of parameters. Within this relationship the observations may have random noise. The next steps in building a model involve the use of data to refine the functional form and estimate any unknown parameters. Statistical estimation, of course, involves confidence statements based on assumed or inferred probability distributions. The process of fitting a model with data is one of the most important activities in statistics. There are many variations on the simple theme of fitting a model: the model may be linear or nonlinear; various norms of the residual vector may be chosen to be minimized; there may be constraints to be imposed on the estimates; and so on. In this chapter we consider some of these variations. We also discuss some other statistical problems that involve optimization.
10.1
Fitting Models with Data
Models may be descriptions of the behavior of a random variable or descriptions of the relationship among variables. In statistical applications the relationships are often stochastic and/or the measurements of the variables are made with error. The process of building a model using empirical data is one of the most important problems in the natural sciences and applied mathematics. Model fitting with data is often referred to as an “inverse problem” in some areas of applied mathematics. This is because many of the mathematical applications have involved development of models using very little (or no) data. A general model may arise initially from casual observations or from first principles. In either case, however, the specific form of the model is chosen to correspond to observed data. Observational Data Observations of y and corresponding observations of x are used to determine a functional form, that is, what f to use, and then to determine θ for a given functional form f . Given data, we may write a relationship analogous to equation (10.1): y ≈ f (X; θ), where y is an n-vector of observations on the variable y from (10.1) and X is an n × m matrix of n observations on the m variables of the vector x. We often
10.1. FITTING MODELS WITH DATA
175
use the notation (xi , yi ) to represent an observation. In this notation xi is the transpose of the ith row of X. (Notice in the notation of equation (10.1), xi would be the ith element of the vector x. This notation may seem somewhat difficult at first, but the context usually makes it clear.) The objective is to use the given X and y to fit the model (10.1). Observed data constitute a discrete function. (A function is a set of ordered pairs of a special type.) This discrete function is an approximation of a subset of the continuous function f that is our (assumed) model. There are two aspects of the function composed of the data that warrant comment. A function is a set of ordered pairs no two of which have the same first value. The discrete function defined by the data, however, may have multiple elements with the same first value. There are various practical ways of reconciling this with our definition of f as a function. The addition of a random additive error to f is one way. Another aspect of the observational data that needs mention is the possible multiplicity of observations. Although it may be the case that (xi , yi ) = (xj , yj ) while i 6= j, we do not consider these two observations to be the same, and reduce the cardinality of the set by 1, as we would do in ordinary set theory. In observational data, each member of the set of observations contains an additional element: its index or some other identifying quantity. Models for Interpolation The model (10.1) can be viewed as expressing an exact relationship for a fixed set of values, that is, the discrete function made up of some observations is a subset of the function f . That exact fit would provide an approximation at other values that x may assume. This approach of fitting the given or observed values of y and X is called interpolation. An interpolant, which fits the y values exactly, is likely not to be very smooth, as we see in Figure 10.1, where we have plotted some observed data and have drawn curves through the data. Another reason that forming a model by interpolation of observed data may not be so useful is the fact that a function that fits all the data exactly will likely not have a very simple form. The requirement to fit all data values exactly may also mean that the relationship is not a (single-valued) function. Once the functional form f is chosen, it would be difficult or impossible to determine θ that would interpolate a given dataset. For interpolation, there must be considerable freedom to choose the function f also. In practice, the function is often chosen from a fixed class of functions, such as those that are piecewise polynomials, that have a rich set of parameters that determine the specific model. Models for Smoothing Data The process of selecting a relatively simple model that provides a good approximation to the data is called “smoothing”. For a given functional form f , the
176
CHAPTER 10. APPLICATIONS IN STATISTICS Linear
y
Cubic Spline
y
x
x
Figure 10.1: Interpolation gro1010a parameter θ is chosen so that the observed values yi are close to the smoothed b In Figure 10.2, we see two different smoothing values f (xi ; θb ), for some θ. models for the same data. In the plot on the left, we have a simple straight line that approximates the data. The straight line does not fit the three observations with the largest x values. The plot on the right is of two straight lines, recognizing that the relationship seems to be different for the points with larger values of x. Within any local region a straight line may provide a relatively good fit, as seen in the plot on the right in Figure 10.2. Here, the model may be a1 + b1 x, x ≤ x0 , y= a2 + b2 x, x > x0 . This model has two separate functional forms. A different functional form is used in the plot in Figure 10.3. This approximation seems to fit the data beter, and it captures an important apparent structure in the data. Iterative Model Selection A common approach to the general modeling problem is to assume a form for f , based on previous knowledge, based on “common sense”, or based on the principle of Occam’s razor (“simpler is better”); next to determine a “good” θ for that functional form; then to inspect closely the “goodness” of the fit of the
10.1. FITTING MODELS WITH DATA
177
Linear Approximation
y
Piecewise Linear Approximation
y
x
x
Figure 10.2: Linear Approximation gro1010b model to the data; and to iterate over this three-step process, as depicted in Figure 10.4. In many situations in the physical sciences, we can develop models from first principles, based on accepted physical laws. Whether or not we explicitly follow a process as shown in Figure 10.4, science over the ages progresses this way. Forms of Models The form of the model may be linear in both x and θ, for example, f (x; θ) = θ0 + θ1 x1 + θ2 x2 ; it may be linear only in θ, for example, f (x; θ) = θ0 + θ1 x1 + θ2 x21 + θ3 ex2 ; or it may be nonlinear in both x and θ, for example, f (x; θ) = θ0 eθ1 +θ2 x1 . Although the last model above can be “linearized” by taking logs of both sides (assuming θ0 > 0), the transformation changes the correspondence of the model to observed data. The residuals of the observed data from the linearized model are not additive in the original untransformed model.
178
CHAPTER 10. APPLICATIONS IN STATISTICS
Approximation with Gamma Curve
y
x
Figure 10.3: Approximation with a Parametric Curve gro1010c Models that are nonlinear in the parameters are the most difficult, both from a computational perspective and from a standpoint of deriving exact distributions of parameter estimators. The variables of interest may all occur explicitly in the dataset, or they may be implicit, such as a variable indicating the order in which the observations are made, or, possibly, the spatial location of the observation. Nonparametric Smoothing Rather than using a model such as (10.1) in which we identify a functional form for f that is dependent only on the parameter θ, we may use the data to determine values of y that correspond to x, and never explicitly develop a smoothing function corresponding to f (x; θ). This alternate approach is called nonparametric smoothing. In nonparametric smoothing the function may be expressed only as a set of ordered pairs S = {(b yi , xi )} corresponding to the original dataset {(yi , xi )}. In addition to representing the function as a set of ordered pairs, a rule for determining ybg corresponding to a given value xg not in the original dataset may be specified. This is usually done by defining a function fb(·) that interpolates the function S.
10.1. FITTING MODELS WITH DATA '
179 choose form of f (x; θ)
?
iterate over this
determine θb that provides a good fit ? &
compare fitted values with data
Figure 10.4: Parametric Modeling Criteria for Approximation and Fitting The basic idea in model fitting is that the discrete function constituted by the observed data approximates another function, usually with a continuous domain. The process of estimating functions with continuous domains and smoothing data often involves expansions of functions in terms of other simpler functions or it involves convolving two functions. As we have seen, the methods of estimation of the parameters in a model generally involve finding the optimum of some function. There are two common types of functions that may be optimized. One comes from an assumed probability density, and the other comes from measures of deviation of the observed data from what the model would predict. The use of the probability density leads to maximum likelihood estimation or related methods. The use of measures of deviation of the data from the fitted model leads to least squares estimation and related methods. In the case of the linear regression model, if the errors have a normal distribution, the two optimization methods, least squares and maximum likelihood, yield the same estimators. Charnes, Frome, and Yu (1976) show there is an equivalence of generalized (weighted) least squares and maximum likelihood estimation for additive errors from a distribution of the exponential class. In later sections we consider some of the computational issues for fitting models with data. These include least squares for nonlinear models, including estimation with constraints, and estimation by minimization of other measures of the deviation of the data from the model, and maximum likelihood estimation.
180
CHAPTER 10. APPLICATIONS IN STATISTICS
Statistical Inference with Models and Data Fitting models is one aspect of statistical inference. For a given model the fit is determined by statistical estimates of the parameters. Statistical inference goes beyond estimation of the parameters, however. The estimators are random variables and it is important to know something of their distribution. Their expected values and variances are particularly important. In general, we prefer estimators whose expected values are equal to the parameters being estimated, and whose variances are small. The expected values and variances, together with the statistics computed from the observed data, may be used to form confidence bounds for the parameters. For linear models without constraints the distributions of the estimators are generally relatively simple. For nonlinear models or for models with constraints on the parameters, however, it is often difficult to determine the distribution of the estimators. Also, estimators computed by nonlinear combinations of the observations may have relatively complicated distributions. A fundamental property of data is whether or not there is an ordering of the data. For many kinds of statistical analysis, a random sample is assumed to be identically and independently distributed (i.i.d.). This assumption is realistic for many situations of interest. In other situations, however, there is likely to be some lack of independence in the data either because of the chronological order in which it was measured, or because of spatial relationships among the observed units. In sequential data, such as a time series or a signal received over time, whether or not the sequential nature of the data is accompanied by a dependence is an important question to address. For example, in order for data generated by a pseudorandom number generator to be useful, the successive values should be “independent”. One way of assessing the independence is to compute the autocorrelations (although, of course, 0 correlations do not imply independence). Another way is to make some kind of domain transform, such as a finite Fourier transform.
10.2
Fitting by Minimizing Residuals
An intuitively appealing approach to fit a model y = f (x; θ) using data (yi , xi ) is to determine θ∗ that minimizes the residuals ri (θ) = yi − f (xi ; θ). Here, we write ri (θ) to emphasize the dependence of the residuals on the value of the parameter; usually we will just write ri . An approximation or a fitted model may be deemed adequate if the residual vector is “small”. Smallness may be defined in terms of a norn of this vector. The most common norm used
10.2. FITTING BY MINIMIZING RESIDUALS
181
for this purpose is the L2 norm, in which case the criterion for fitting is called “least squares”. Another useful norm is the L1 norm and the criterion is called “least absolute values”. Other norms may be useful, or some other completely different criterion may be appropriate. In general, we seek to minimize n X
g(ri ),
(10.2)
i=1
for some function g that is nondecreasing in |ri |. We may or may not make assumptions about the distribution of the residuals. Under certain assumptions about random distributions of residuals, some statistical property, such as maximum likelihood, may be relevant. The ri are 10.5. The model is fit so as Pvertical distances as shown in Figure to minimize g(ri ). Often, of course, g(r) = r2 , and so the fit is least squares. The computations for fitting by least squares are generally simpler than those for fitting by minimizing other norms of the residual vector. The methods discussed in Section 5.7 or Section 5.13 are appropriate for use in least-squares fitting. The first consideration in choosing the method is usually whether or not derivatives are available.
y
t y = f (x; θ) r5 d t d r4 r3 d t t d r1 r d t2 x
Figure 10.5: Residuals If the residuals have a normal distribution, maximum likelihood estimation is equivalent to minimizing the L2 norm of the residuals (see equation (10.11)). This is simple to generalize to any distribution whose probability density is monotone in a norm ky − f1 (x; θ1 )k with no restrictions on θ1 . (Here, f1 and θ1 arise from the full model f (x; θ).) The maximum likelihood estimates for θ1
182
CHAPTER 10. APPLICATIONS IN STATISTICS
are the values that minimize ky − f1 (x; θ1 )k. For example, if the residuals have a double exponential, or Laplace, distribution, R ∼ e−|y−f (x;θ)| , the maximum likelihood estimate of θ is the point that minimizes the L1 norm of the residual vector. Minimizing Weighted Residuals In many cases, we may not want to treat all of the residuals the same. If some observations are assumed to have greater precision than others, we may want to apply nonnegative weights, wi , to the residuals. The objective then is to minimize n X wi g(ri ). (10.3) i=1
Because the variance of some types of observational data is proportional to the mean of the data, it is sometimes appropriate to take weights α wi ∝ f (xi ; θ) . Orthogonal Distances Another type of residual we may wish to measure is the distance of the observation to the surface of the model. This distance is measured along the normal from the surface to the curve, and is sometimes called the “orthogonal distance”. The di are orthogonal distances as shown in Figure 10.6. The model P is fit so as to minimize g(di ). Often, of course, g(d) = d2 , and so the fit is “orthogonal least squares”. Fitting a model by minimizing the sum of the squared distances is called orthogonal distance regression and the criterion is sometimes called total least squares. Orthogonal distances have been studied most extensively in the case of the linear model, y ≈ Xβ. This criterion is sometimes suggested for the errors-in-variables model. This model has the form y = (X + ∆)β + E, where both ∆ and E are random variables. See Fuller (1987) for a discussion of the errors-in-variables model. Golub and Van Loan (1980) and Van Huffel and Vandewalle (1991) discuss some of the computational details of total least squares. Boggs et al. (1989) provided software for weighted orthogonal distance regression.
10.2. FITTING BY MINIMIZING RESIDUALS
y
183
td y = f (x; θ) 5 A A t A d3 AA AA d4 At t Ad1 A A d2 At x
Figure 10.6: Orthogonal Distances Ammann and Van Ness (1988, 1989) describe an iterative method that is applicable to any norm, so long as a method is available to compute a value of β that minimizes the norm of the vertical distances in the model. The method is 1. determine b(k) that minimizes the norm of (y (k−1) − X (k−1) β), 2. transform the matrix y (k−1) |X (k−1) to y (k) |X (k) by a rotation matrix that makes the k th fit horizontal 3. set k = k + 1 and go to 1. This is repeated until there is only a small change. An appropriate rotation matrix is Q in the QR decomposition of Im 0 . (b(k) )T X 1
10.2.1
Statistical Inference Using Least Squares+
Variance-covariance estimates: using Gauss-Newton or dud we may take !T −1 df df b = s2 Σ θ θ where s2 = y − f (X; θ)
T
y − f (X; θ) /(n − m).
184
CHAPTER 10. APPLICATIONS IN STATISTICS If the Hessian H is available, we may take b = s2 H−1 Σ
10.2.2
Fitting Using Other Criteria for Minimum Residuals+
When fitting a model y = f (x; θ) using data (yi , xi ) by minimizing the residuals ri = yi − f (xi ; θ), we must first choose a way of measuring the total size of the residuals, that is we must choose some mapping of the vector of residuals into the real numbers. A common way, as we have discussed, is to use the sum of their squares, that is, the L2 norm of the vector of residuals. This method is computationally simple and it leads to estimators that maximize the likelihood if the residuals have a normal distribution. Minimizing other measures of the residuals may result in maximum likelihood estimates when the residuals have other distributions. In general, we may fit the model by choosing θ as the value that minimizes g(r) for some reasonable function g. Instead of attempting to choose g so as to yield maximum likelihood estimates, we may focus on other properties of the estimates that result from minimizing a particular g(r). In some cases the probability distribution of the model residual error implies that the least squares estimators may be subject to undesirable excessive variation. The least squares estimate is affected more by observations with large residuals than by those with small ones. In those cases a “robust” estimator may be better. The least absolute values estimate is affected equally by large and small residuals. Some functions of the residuals that may be used as criteria for fitting the model are shown in Table 10.1. In each case, weights could also be applied to the residuals. The weights themselves may depend on the values of the residuals and on the values of x. The M estimator is the same as the least squares estimator if ρ(z) = z 2 . An interesting class of M estimators are the Lp estimators, for which ρ(z) = |z|p , with p ≥ 1, which correspond to estimates that minimize the Lp norms of the residuals (actually, the pth powers of the norms). Depending on the form of the function ρ, this may or may not be a simple optimization problem. L1 Regression The L1 or criterion in regression leads to an estimate of β that minimizes the sum of absolute values deviations: n X min | yi − xT (10.4) i b |. b
i=1
10.2. FITTING BY MINIMIZING RESIDUALS
185
Table 10.1: Functions of Residuals for Fitting Models Function of residuals
S-Plus function
least squares
Pn
lsfit
least absolute values
Pn
|ri |
least Lp norm (p finite)
Pn
|ri |p
minimax (least L∞ norm)
maxni=1 |ri |p
least trimmed squares
Ph
least trimmed absolute values
Ph
least median of squares
2 r(n/2)
M
Pn
Name
2 i=1 ri
i=1
i=1
2 i=1 r(i)
i=1
i=1
l1fit
ltsreg
|r(i) | lmsreg ρ(ri )
rreg
The estimator is sometimes called the least absolute values or LAV estimator. The idea of using least absolute values rather than least squares is an old one. Boscovitch in the 18th century proposed use of this criteria for fitting linear models, and described a geometric method for determining the estimates. His geometric algorithm was not very practical, however. Because the problem presents a much more difficult computational problem than least squares, interest in the L1 criterion waned. The solution to the minimization problem (10.4) may not be unique. Consider the model in which the X matrix is a single column of 1’s, that is, a model with no covariates. The problem is just P to estimate a single location parameter. The L1 criterion of minimizing |yi − m| leads to the sample median. If the sample size n is even, however, any value between the order statistics y(n/2) and y(n/2+1) yields the same value for this norm of the residuals. These two order statistics are “extreme-point” estimates, and all other estimates can be expressed as convex combinations of them: wy(n/2) + (1 − w)y(n/2+1) . We can also easily observe examples of the nonuniqueness and the extreme-point solutions in a simple model with one covariate, y = β0 + β1 x + . In Figures 10.7 and 10.8 extreme-point solutions are shown for some special patterns of observations. It is easy to see from Figures 10.7 and 10.8 that the lines shown all have
186
CHAPTER 10. APPLICATIONS IN STATISTICS y
t
t
t
t
x Figure 10.7: Extreme-Point L1 Estimates, Even Number of Observations the same L1 residual norm, and furthermore that the norm is minimal. In both cases, any convex combination of the extreme-point lines shown is an L1 fit. Whatever the dimension of the X matrix, we use the generic term “L1 plane” to refer to the equation that uses any estimator that minimizes the L1 residual norm. The L1 estimator has some interesting geometric properties. Let N + be the number of points above a fitted plane and N − be the number of points below the fitted plane. The following properties hold. • For any L1 plane |N + − N − | ≤ k. • For any L1 plane not passing through any observations |N + − N − | = 0. • For odd sample size, there is an observation through which every L1 plane passes. • No observation lies between two L1 planes. • There exists an L1 plane that passes through at least k observations when X is of full rank k. The estimate is invariant to movement of a single y value, so long as it is not an extreme point, and so long as it is not moved in such a way as the
10.2. FITTING BY MINIMIZING RESIDUALS y
187
t
t
t
t
t
x Figure 10.8: Extreme-Point L1 Estimates, Odd Number of Observations corresponding residual changes sign. The is a very strong robustness property. The estimate is, however, sensitive to movement of the x values. For case of a scalar x (that is, a simple linear model), a type of sensitivity analysis that identifies boxes within which movement of the x and y will not affect the fit has been proposed by Narula and Wellington (1985) and Narula, Sposito, and Wellington (1993). Sensitivity analysis can be added easily to the solution algorithm to yield bounds on each y value within which movement will not affect the fit. Except for extreme points, one side of each bound is infinite. Algorithms for L1 Regression Until efficient algorithms for linear programming were developed, the algorithm by Singleton (1940) was the best for L1 regression. It is practical only for relatively small problems, however. Charnes, Cooper and Ferguson (1955) considered a constrained regression problem and formulated it as a linear programming problem. Their emphasis was on the constraints. To make a linear programming problem out of it, their objective was to minimize the absolute deviations rather than the squared deviations. The constrained linear regression problem, which normally would have been fitted by least squares, was the first published formulation of the L1 problem as a standard linear programming problem. Without the additional constraints considered by Charnes et al., the L1 problem is the linear program-
188
CHAPTER 10. APPLICATIONS IN STATISTICS
ming problem (see problem (7.8), page 147): min
1T (e+ + e− )
s.t.
Xb + Ie+ − Ie− = y e+, e− ≥ 0
b
b
(10.5)
unrestricted,
where e+ and e− are n-vectors, 1 is an n−vector of ones and I is the n × n identity matrix. The dual problem needs only a basis of size k × k (see Exercise 10.6, page 204). The most significant improvement on the regular LP solution was made by Barrodale and Roberts (1973). Their modification works on the primal problem, and basically speeds up the simplex procedure by skipping steps. It also does not use the large storage space of the straightforward primal formulation. The L1 estimate can also be computed by iteratively reweighted least squares (IRLS) (see Section 5.8). n X
| yi − xT i β| =
i=1
n X
2 T | yi − xT i β| /| yi − xi β|
(10.6)
i=1
Zero residuals in (10.6) can cause computational problems. One solution would be to set the zero residuals to a very small number and proceed. Schlossmacher (1973) suggested deleting an observation with a zero residual from the iteration in which it is zero. Surprisingly, this method works fairly well, although there is no guarantee or proof of its convergence. The recent advances in interior-point methods initiated by Karmarkar (1984) (see Nash and Sofer, 1996) suggest they may be worthwhile if the sample size is very large. Several people have suggested the use of Karmarkar’s algorithm for the L1 problem, but except for very large problems the methods based on that algorithm are not competitive with the modified simplex method. Lp Regression For 1 < p < 2, iteratively reweighted least squares (IRLS), as in (10.6) is easy to implement and will generally work fairly well. The formulation is n X i=1
p | yi − xT i β| =
n X
2 T 2−p | yi − xT i β| /| yi − xi β|
(10.7)
i=1
As before, zero residuals can cause problems. The most effective way of dealing with zero residuals is to set to a large value the contribution to the Lp norm of the observation with the zero residual.
10.2. FITTING BY MINIMIZING RESIDUALS
189
M Regression Estimators chosen to minimize the sum of general functions of residuals as in equation (10.2) are called “M estimators” because they are in some sense similar to maximimum likelihood estimators. Determining the M estimator that minimizes X ρ(yi − xT i b, xi , s), for a linear model is equivalent to solving the nonlinear system X xi ψ(yi − xT i b, xi , s) = 0,
(10.8)
where ψ(r, t, s) = ∂ρ(r, t, s)/∂r. In the case where ρ is a function of (yi −xT i b)/s only, the solution can be determined as a weighted least squares problem in which the weight in the k th iteration is (k−1) ψ(yi − xT )/s(k−1) i b . (k−1) ) (yi − xT i b
Several other robust methods for regression fitting have been suggested, for example, trimmed least squares regression (Ruppert and Carroll, 1980) least trimmed squares regression, least trimmed absolute values regression, Bassett (1991) Hawkins and Olive (1999) gave algorithm and least median of squares regression, (see Rousseeuw, 1984). Souvaine and Steele (1987) discussed algorithms. Hawkins (1993b) gives a feasible set algorithm. Xu and Shiue (1993) gives a parallel algorithm. Functional Least Squares Regression Welsh (1985) (see Meintanis and Donatos, 1997) Regression with Equality Constraints Using Iteratively Reweighted Least Squares It is easy to see that the generalized least squares solution for the model y = Xβ + , with the linear equality constraints Lβ = c, is βbW,C
=
(X T W X)+ X T W y + (X T W X)+ LT (L(X T W X)+ LT )+ (c − L(X T W X)+ X T W y),
for the weights in W . Thus, it is straightforward to compute any of the iteratively reweighted least squares estimates subject to equality constraints.
190
10.2.3
CHAPTER 10. APPLICATIONS IN STATISTICS
Fitting by Minimizing Residuals while Controlling Influence+
Some of the methods of fitting that we discuss in the previous section are based just on minimizing some function of the residual vector, b r = y − f (X; β). If the function of the residual vector increases without bound when only a relatively small number of the observations have large residuals, that small set of observations can exert an undue influence on the fit. This is the case when the function is the L2 norm. The purpose of considering other functions of the residuals is to make the fit robust to a few large residuals. In the case of the L1 norm, up to 50% of the residuals can increase without bound (that is, they can “breakdown”) without affecting the fit. (Such a high breakdown percentage is rather pointless, because if that many observations do not fit the model well, surely a different model should be considered. Nevertheless, some data analysts seek methods with a high breakdown point. It does have a certain theoretical appeal.) Although there is no explicit term in the model for aberrations in the values of X, such outlying values can also cause problems in fitting the model. The values of some independent variables may be such that those observations exert an unduly large influence on the fit. Reweighting with Both Residuals and Leverages The influence due to the independent variables is called leverage, and in linear regression models that are fit by least squares, it is measured by the diagonal of the hat matrix X(X T X)−1 X T Chatterjee and M¨achler (1997) propose a simple method of iteratively reweighted least squares that incorporates the leverage. Minimum Volume Ellipsoids Rousseeuw and Leroy (1987) recommended search Woodruff and Rocke (1993) studied use of simulated annealing genetic algorithms tabu search
10.2.4
Fitting with Constraints+
Often in estimation problems we know the parameter must lie in some particular region. Occasionally, we may know that some function of the parameters must satisfy an equality. Equality constraints can generally be incorporated into the estimation procedure rather easily. It is more common to have inequality constraints, such as restrictions that one or more parameters are nonnegative. These kinds of restrictions are somewhat more difficult to satisfy.
10.2. FITTING BY MINIMIZING RESIDUALS
191
Least Squares with Restrictions Consider the linear regression model y = Xβ + , with the linear equality constraints Lβ = c. To determine the value of β that yields a minimum sum of squares of residuals and satisfies these linear equality constraints is a particularly simple problem, and it has a closed form solution. It is far more common, however, to have constraints of the form Lβ ≤ c. This problem is much harder. A common approach to computations with restriction is to use branch and bound methods with fathoming in a tree. (,1234) H HH HH HH j H (1,234) (,234) Q Q Q Q Q Q Q Q s Q s Q + + (12,34) (1,34) (2,34) (,34)
J
J
^ J (123,4) (12,4)
J
J
^ J (13,4) (1,4)
J
J ^ J
(23,4) (2,4)
J
J ^ J
(3,4) (,4)
A A A A A A A A AAU AAU AAU AAU AAU AAU AAU AAU (1234)(123)(124) (12) (134) (13) (14) (1) (234) (23) (24)(2) (34) (3) (4) ∅
Figure 10.9: A Tree for Branch and Bound
10.2.5
Subset Regression; Variable Selection+
leaps and bounds Furnival and Wilson (1974) Drezner, Marcoulides, and Salhi (1999), Tabu search model selection in multiple regression analysis Principal Components Regression+ Partial Least Squares+ comparisons Frank and Friedman (1993)
192
10.2.6
CHAPTER 10. APPLICATIONS IN STATISTICS
Multiple Criteria Fitting+
A new technique in model fitting: multiple objectives. Minimize several norms of the residual vector. Various techniques in multiple criteria optimization. One useful formulation is an objective function that is a weighted combination of norms. A very simple one is wL2 (r) + (1 − w)L1 (r) This multicriteria objective is implemented in an interactive computer program (that uses a general unconstrained minimization routine). With this objective function, the data analyst can vary w and observe the effect on the residual vector r. This adaptive technique can utilize subjective evaluation of plots of the fit and/or residuals. It is useful to do any model fitting with this objective function, even if a traditional fit (L2 ) will ultimately be used.
10.3
Maximum Likelihood Estimation
One way of expressing the relation (10.1) is in terms of the distribution of a random variable Y , given values of another variable x and a value of a parameter θ. This interpretation can be made whether or not x and θ are considered to be realizations of random variables. In a Bayesian approach, for example, the parameter θ is usually considered to be a realization of a random variable. If x and θ are assumed to arise from a probability distribution, the analysis may involve the joint distribution of Y and the random variables X and Θ from which x and θ are realized. If y is a sample from a distribution with a density p(y; x, θ), where x is a given vector of covariates and θ is an unknown (vector) parameter to be estimated, the likelihood function is L(θ) = cp(y; x, θ), where c is a constant. Notice that in the likelihood function the role of the function argument and parameter are exchanged; the argument of the likelihood function is the parameter of the probability density function. The maximum likelihood principle of estimation of θ involves maximizing this function of θ. Suppose, for example, that y1 , y2 , . . . , yn is a simple random sample from a gamma distribution with probability density function p(x) =
1 y α−1 e−y/β , Γ(α)β α
for 0 ≤ y ≤ ∞,
10.3. MAXIMUM LIKELIHOOD ESTIMATION
193
where α > 0 and β > 0 are unknown parameters and Γ(α) is the complete gamma function. There are no covariates. The likelihood is n Pn Y 1 n e− i=1 yi /β yiα−1 . Γ(α)β α i=1
L(α, β) =
(10.9)
The maximum likelihood estimates (MLE) for α and β are the values that maximize L(α, β) subject to α > 0 and β > 0. In Exercise 10.1, you are asked to obtain these estimates from a simple dataset. As another example, let y1 , y2 , . . . , yn be a simple random sample with no covariates from a normal distribution with unknown mean µ and unknown variance σ 2 > 0. The likelihood is Pn 1 1 (yi −µ)2 2σ 2 i=1 L(µ, σ 2 ) = e . 2 n/2 (2πσ ) As is often the case with common probability distributions, rather than trying to determine max L(µ, σ 2 ) directly, it is easier to work with the log of the likelihood, the log-likelihood, lL : lL (θ) = log L(θ). In the case of the normal sample, we have lL (µ, σ 2 ) = −
n n 1 X n (yi − µ)2 . log(2π) − log(σ 2 ) − 2 2 2 2σ i=1
It is a simple matter to set the derivatives to zero and solve, yielding n
µ∗ = y¯ = and
1X yi n i=1
n
σ∗2 =
1X (yi − y¯)2 . n i=1
First, we note µ∗ and σ∗2 satisfy the constraints on the parameter space. Next, we check the Hessian at θ∗ = (µ∗ , σ∗2 ): P n 1 i = 1n (yi − µ) σ2 σ4 ∂ 2 lL (θ) = − P P ∂θ∂θT n n 1 1 n 2 i = 1 (yi − µ) i = 1 (yi − µ) − σ4 σ4 σ6
= −
"
n σ∗2
0
0 n 2σ∗4
#
.
194
CHAPTER 10. APPLICATIONS IN STATISTICS
This is negative definite, so (µ∗ , σ∗2 ) is the unique maximizer of lL (µ, σ 2 ). Maximum likelihood estimation for the parameters in a model such as (10.1) requires the identification of a probability distribution. A common formulation of the model is y = f (x; θ) + E, where E is a random variable with probability density p(e; x, θ). If we have a simple random sample (y1 , x1 ), (y2 , x2 ), . . . , (yn , xn ), we have indirect observations on E: each yi − f (xi ; θ), where θ is the true but unknown value of the parameter. Suppose E has a normal distribution with mean f (x; θ) and variance σ 2 . We have the log-likelihood n n 1 X n 2 lL (θ, σ ) = − log(2π) − log(σ ) − 2 (yi − f (xi , θ))2 . 2 2 2σ i=1 2
(10.10)
The maximum of the log-likelihood in this case occurs at the minimum of the L2 norm of the vector r = y − f (x; θ), (10.11) where y is the vector of yi and f (x; θ) is the vector of f (xi ; θ). If the distribution of the residuals is normal, fitting by maximum likelihood is equivalent to fitting by least squares. Depending on the form of f (x; θ), it may or may not be a simple matter to determine the maximum of this function. For example, if f (x; θ) is the linear combination of the covariates, xT θ, where x and θ are vectors of the same length, differentiation of the log-likelihood yields the familiar normal equations X T y − X T Xθ = 0, where X is the matrix whose rows are the xi and y is the vector of yi . Aside from the intuitive appeal of estimates that maximize a “likelihood”, under certain assumptions on the distribution of the underlying random variables and on the range of the parameter space, the maximum likelihood estimators have certain desirable properties. Also, under general assumptions, the assymptotic distribution of the maximum likelihood estimators is known; therefore, approximate statements of inference can be made. When the parameter space is restricted, the unconstrained maximum likelihood estimates may not be in the allowable range. The correct approach, of course, is to use a constrained optimization procedure, to obtain the “restricted maximum likelihood estimates”. Restrictions on the parameter space, however, result in more a complicated distribution of the maximum likelihood estimators. Another problem that can occur in the use of maximum likelihood is that the likelihood function may have multiple maxima. Computationally, this presents the same kinds of problems discussed in Chapter 8. Gan and Jiang (1999) use the fact that 2 2 ∂ lL ∂lL E + E = 0, ∂θ2 ∂θ
10.4. OPTIMAL DESIGN AND OPTIMAL SAMPLE ALLOCATION
195
to develop a statistical test that a maximum of the likelihood function is the global optimum. In some cases maximum likelihood estimation presents interesting problems because of spikes in the likelihood function.
10.3.1
Maximum Likelihood Estimation with Constraints
Maximum Likelihood with Restrictions See Kim and Taylor (1995) Journal of Statistical Computation and Simulation Volume 74, 135 - 153 Making REML computationally feasible for large data sets: use of the Gibbs sampler David A. Harville Abstract: REML (restricted maximum likelihood) has become the preferred method for estimating variance components. Except for relatively simple special cases, the computation of REML estimates requires the use of an iterative algorithm. A number of algorithms have been proposed; they can be classified as derivativefree, first-order, or second-order. The computational requirements of a firstorder algorithm are only moderately greater than those of a derivative-free algorithm and are considerably less than those of a second-order algorithm. First-order algorithms include the EM algorithm and various algorithms derived from the REML likelihood equations by the method of successive approximations. They also include so-called linearized algorithms, which appear to have superior convergence properties. With conventional numerical methods, the computations required to obtain the REML iterates can be very extensive, so much so as to be infeasible for very large data sets (with very large numbers of random effects). The Gibbs sampler can be used to compute the iterates of a first-order REML algorithm. This is accomplished by adapting, extending, and enhancing results on the use of the Gibbs sampler to invert positive definite matrices. In computing the REML iterates for large data sets, the use of the Gibbs sampler provides an appealing alternative to the use of conventional numerical methods.
10.4
Optimal Design and Optimal Sample Allocation
Many times the data available to a statistician have been collected haphazzardly or for some purpose other than to address the question at hand. Ideally, however, before data are collected, a plan can be developed so to ensure that the data have high information content for the question to be addressed.
196
CHAPTER 10. APPLICATIONS IN STATISTICS
10.4.1
D-Optimal Designs+
When an experiment is designed to explore the effects of some variables (usually called “factors”) on another variable, the settings of the factors (independent variables) should be determined so as to yield a maximum amount of information from a given number of observations. The basic problem is to determine from a set of candidates the best rows for the data matrix X. For example, if there are six factors and each can be set at three different levels, there is a total of 36 = 729 combinations of settings. In many cases, because of the expense in conducting the experiment, only a relatively small number of runs can be made. If, in the case of the 729 possible combinations, only 30 or so runs can be made, the scientist must choose the subset of combinations that will be most informative. A row in X may contain more elements than just the number of factors (because of interactions), but the factor settings completely determine the row. We may quantify the information in terms of variances of the estimators. If we assume a linear relationship expressed by y = β0 1 + Xβ + , and make certain assumptions about the probability distribution of the residuals, the variance-covariance matrix of estimable linear functions of the least squares solution are formed from (X T X)− σ 2 . (The assumptions are that the residuals are independently distributed with a constant variance, σ 2 . We will not dwell on the statistical properties here, however.) If the emphasis is on estimation of β, then X should be of full rank. In the following we assume X is of full rank, that is, that (X T X)−1 exists. An objective is to minimize the variances of estimators of linear combinations of the elements of β. We may identify three types of relevant measures b the average variance of the elements of β, b of the variance of the estimator β: the maximum variance of any elements, and the “generalized variance” of the b The property of the design resulting from maximizing the informavector β. tion by reducing these measures of variance is called, respectively, A-optimality, E-optimality, and D-optimality. They are achieved when X is chosen as follows: • A-optimality: minimize trace((X T X)−1 ). • E-optimality: minimize ρ((X T X)−1 ). • D-optimality: minimize det((X T X)−1 ). Pukelsheim (1993) discusses these types of optimal designs and other issues relating to optimal design of experiments. Using the properties of eigenvalues and determinants (see Gentle 1998, Section 2.1.11), we see that E-optimality is achieved by maximizing ρ(X T X) and D-optimality is achieved by maximizing det(X T X).
10.4. OPTIMAL DESIGN AND OPTIMAL SAMPLE ALLOCATION
197
The D-optimal criterion is probably used most often. If the residuals have a normal distribution (and the other distributional assumptions are satisfied), the D-optimal design results in the smallest volume of confidence ellipsoids for β. (See Titterington, 1975, Nguyen and Miller, 1992, and Atkinson and Donev, 1992. Identification of the D-optimal design is related to determination of a minimum volume ellipsoid for multivariate data.) Woodruff and Rocke (1993) studied use of simulated annealing for minimum volume ellipsoid. The computations required for the D-optimal criterion are the simplest, and this may be another reason it is used often. To construct an optimal X with a given number of rows, n, from a set of N potential rows, one usually begins with an initial choice of rows, perhaps random, and then determines the effect on the determinant by exchanging a selected row with a different row from the set of potential rows. If the matrix X has n rows and the row vector xT is appended, the determinant of interest is det(X T X + xxT ) or its inverse. Using the relationship det(AB) = det(A) det(B), it is easy to see that det(X T X + xxT ) = det(X T X)(1 + xT (X T X)−1 x). (10.12) T Now if a row xT + is exchanged for the row x− , the effect on the determinant is given by T det(X T X + x+ xT + − x− x− )
= det(X T X) × T −1 1 + xT x+ − + (X X) T −1 T −1 xT x− (1 + xT x+ ) + − (X X) + (X X) T −1 (xT (10.13) x− ) 2 . + (X X)
Following Miller and Nguyen (1994), writing X T X as RT R from the QR decomposition of X, and introducing z+ and z− as Rz+ = x+ and Rz− = x− , we have the right-hand side of (10.13): T T T T z+ z+ − z− z− (1 + z+ z+ ) + (z− z+ )2 .
(10.14)
Even though there are n(N −n) possible pairs (x+ , x− ) to consider for exchanging, various quantities in (10.14) need be computed only once. The corresponding (z+ , z− ) are obtained by back substitution using the triangular matrix R.
198
CHAPTER 10. APPLICATIONS IN STATISTICS
Miller and Nguyen use the Cauchy-Schwarz inequality to show that the quantity (10.14) can be no larger than T T z+ z+ − z− z− ;
(10.15)
hence, when considering a pair (x+ , x− ) for exchanging, if the quantity (10.15) is smaller than the largest value of (10.14) found so far, then the full computation of (10.14) can be skipped. Miller and Nguyen also suggest not allowing the last point added to the design be considered for removal in the next iteration and not allowing the last point removed to be added in the next iteration. The procedure begins with an initial selection of design points, yielding the n × m matrix X (0) that is of full rank. At the k th step, each row of X (k) is considered for exchange with a candidate point, subject to the restrictions mentioned above. Equations (10.14) and (10.15) are used to determine the best exchange. If no point is found to improve the determinant, the process terminates. Otherwise, when the optimal exchange is determined, R(k+1) is formed using the updating methods discussed in the previous sections. (The programs of Gentleman, 1974, can be used.) Atkinson (1992) proposed a segmented algorithm using simulated annealing for D-optimality. Jung and Yum (1996) used tabu search for construction of exact D-optimal designs. **************
10.4.2
Optimal Sample Allocation
Many authors have considered the general problem of optimal sampling design in stratified or multistage sampling (see, for example, Bethel, 1985, 1989a, 1989b, and Mergerson, 1988, 1989). The techniques generally involve some kind of mathematical programming (see Huddleston, Claypool, and Hocking, 1970, or Arthanari and Dodge, 1981).) Chromy (1987) discusses surveys that produced multiple estimates, and considered a multicriteria optimization approach to minimize the variances of several estimators. This is, of course, a very common situation in survey sampling. Another situation that may result in multiple criteria is the case of different surveys that share the same design, but which may have differing frequencies of data collection. The survey design is often a two-stage sample in which both stages are stratified samples. The objective is to allocate the sample over the various strata so as to minimize the variances of the estimators. The estimators are ratios of weighted sums of stratified sample quantities. Their variances generally must be approximated by forming linear approximations of the estimators. Even if a reasonable a priori formulation of a single objective were possible, it is generally desirable to explore the space of tradeoffs within the feasible region that contains near-optimal points.
10.4. OPTIMAL DESIGN AND OPTIMAL SAMPLE ALLOCATION
199
Approximations to the variance of the ith estimator is generally of the form X X N2 Nh h Vi = Nh − 1 v1h + v2h , (10.16) nh nh mh h
h
where Nh is the population size (number of establishments) in the hth stratum and the v1h and v2h are the variance components of the first and second stages (see Valliant and Gentle, 1997, and Gentle, Narula, and Valliant, 1997) These are estimated from previous surveys. The number of establishments sampled in the hth stratum is nh (the first stage), and the number of occupations sampled within each establishment in that stratum is mh (the second stage). (In practice, there may be slight variations in the numbers of occupations sampled within the establishments in a given stratum.) The basic problem is to minimize variances of the estimators, subject to constraints on the total sample size (roughly, the cost) and on the variances of certain estimators. The “decision variables”, that is, the variables of the optimization problem, are the sample sizes in the individual strata. In addition to the constraints, there are also simple bounds on the sample sizes. A lower bound of at least 2 is desirable, so as to allow computation of an estimate of the variance within each stratum. The optimization problem, therefore, has the general form: X min wi Vi (nh , mh ) (10.17) nh ,mh ∈I
i
s.t. 2 ≤ n h ≤ Nh 2 ≤ mh ≤ Mh X ch nh ≤ B1 P n m P h h ≤ B2 nh Vk ≤ V0k The upper bound on the second stage, Mh , is a small integer (between 4 and 12) that depends on the total employment in the hth stratum. The restriction to integers may or may not be important. In the program built by Valliant and Gentle (1997) the user can interactively adjust the bounds in the constraints. Setting the bounds very large is equivalent to removing the constraint. It is often useful to define some constraints also to be components of the objective function. For example, in problem (10.17), instead of the objective function shown, we may form the objective function as X X wi Vi (nh , mh ) + w0 ch nh (10.18) i
h
Although a quantity like (10.18) may not make much sense (being the sum of variances and a total cost), this is a useful way to formulate the objective
200
CHAPTER 10. APPLICATIONS IN STATISTICS
function. It is a weighted combination of the objective function in the problem (10.17) and one of the constraints in that problem. If the programmer sets up the objective function in this way, the user can set w0 to zero and have a problem just as the original one. The user can also set wi to zero for all i 6= 0, and set B1 (the bound on the total cost) to a very large value. In this case the optimization problem is one of minimizing the total cost subject to given bounds on the variances. The optimization problem has a nonlinear objective function and nonlinear constraints. The variables (the sample sizes) are restricted to integer values. Fortunately, however, the restriction to integer values usually does not make a lot of difference. The effects of using integer solutions that are near to the continuous solution can be investigated by assigning them as trial allocations. In the continuous version of this problem (without the integer restrictions), both the objective function and the constraints are smooth. For the multiple linear regression model, y = Xβ + ,
(10.19)
robust estimators are often defined by an optimization problem whose objective is to minimize some other function of the residuals, ri = yi −
k X
βi xij .
j=1
Examples of objective functions are ones based on an Lp norm, min β
n X
|ri |p ,
i=1
for some p ≥ 1, or on a more general function of the scaled residuals, min β
n X i=1
φ(
ri ), s
where φ is a convex function, and s is some scale parameter (possibly estimated from the given data). Because some criteria are better in one situation while other criteria are better in other situations, the idea of combining the criteria arises naturally. So long as the functions of the residuals to be minimized are norms, it is an obvious extension to form an objective function that is the weighted sum of two or more norms, because any such function is still a norm. Gentle, Kennedy, and Sposito (1976) suggest combining the least squares and least absolute values norms in regression fitting. Narula and Wellington (1979) propose an objective function that is a weighted sum of the sum of absolute residuals and the maximum absolute residual (the L∞ norm).
10.4. OPTIMAL DESIGN AND OPTIMAL SAMPLE ALLOCATION
201
In general, a weighted combined criterion, based on q individual criteria, is min β
q n X X
wj mj (
j=1 i=1
ri ), s
where wj ≥ 0. Another problem for which there is no obvious single objective function is probability density estimation. The basic problem in density estimation is, given data x1 , x2 , . . . , xn from an unknown population, estimate the probability density function, p(x). The density function satisfies p(x) ≥ 0 and
Z
p(x) dx = 1. IRd
It is known, however, that no estimator pb exists that is unbiased in the sense that Ep (b p(x)) = p(x), ∀x ∈ IRd , R and that also has the properties that pb(x) ≥ 0 and IRd pb(x) dx = 1. Hence, in general, we seek estimators having various types of consistency and having some optimal property, such as minimum mean-squared-error. The object being estimated is a function, so the criteria are usually applied to an integral of some function involving pb. A common measure to minimize is the asymptotic mean integrated squared error (AMISE). For the commonly-used fixed-window (univariate) kernel density estimator of p, which is of the form pbh (x) = (nh)
−1
n X i=1
K
x − xi h
,
the variable of the optimization problem is just the window width h. (How to choose K is an optimization problem in functional analysis, and the optimal choice of the kernel does not yield significant gains.) Solving the optimization problem in h involves use of estimates of functionals of the density. Although the optimization problem (minimizing the estimated AMISE in the variable h) is not simple, a more appropriate objective function might be one with explicit weights on the two components of the MISE, the bias squared and the variance. Most approaches to the problem use equal weights for the two components, although at the asymptotic optimum the variance component is generally several times the bias component. The choice of the window width is a choice between the two components of the MISE. As h increases, the bias increases; as h decreases, the variance increases. In heuristic terms, as h increases, the smoothness increases (so structure may be obscured); as h decreases, the roughness increases (so noise increases).
202
CHAPTER 10. APPLICATIONS IN STATISTICS
Density estimation and smoothing, in general, are applications in which an exploratory approach is very useful. Insight and a better understanding of the data can often be obtained by using several different window widths on a given data set. Within the broad general objective of understanding the data, there are several possible objective functions that determine how a model is fit to the data.
10.5
Clustering and Classification*
10.6
Multidimensional Scaling
10.7
Time Series Forecasting*
Exercises 10.1. Consider the problem of obtaining the maximum likelihood estimates for the parameters of a gamma distribution. The likelihood is given in equation (10.9), page 193. Given the observations {3, 5, 2, 6, 7}, use an optimization routine to determine the MLE for α and β. You may wish to work with the log-likelihood lL (α, β). Remember the constraints on α and β. 10.2. Consider the problem of using the maximum likelihood principle to estimate the parameters of a Weibull distribution, which is often used in reliability studies. The probability density function for the (two-parameter) Weibull is p(x) =
α α−1 −(x/β)α x e , βα
x ≥ 0.
(a) Given the observations {3, 5, 2, 6, 7}, use an optimization routine to determine the MLE for α and β. (b) Let α = 5 and β = 2 and generate a pseudorandom sample of size 100. (SPlus has a function rweibull to do this; IMSL has a Fortran subroutine rnwib and a C function random weibull to do this; or you can use the inverse CDF method.) Now, using the sample, determine the MLE for α and β. 10.3. Consider 5 correlated binary random variables with marginal probabilities, .1, .3, .5, .7, .9, and with pairwise correlations 1.0 0.2 0.1 0.2 0.3
1.0 0.2 0.3 0.1
1.0 0.1 0.2
1.0 0.2
1.0
EXERCISES
203
For these binary parameters, determine the pairwise normal correlations: Φ2 (zπi , zπj ; rij ) = ρij
p
πi (1 − πi )πj (1 − πj ) + πi πj .
10.4. Consider the multinomial distribution with 4 outcomes, that is, the multinomial with probability function, p(x1 , x2 , x3 , x4 ) =
n! π x1 π x2 π x3 π x4 , x1 !x2 !x3 !x4 ! 1 2 3 4
with n = x1 + x2 + x3 + x4 and 1 = π1 + π2 + π3 + π4 . Suppose that we assume that the probabilities are related by a single parameter, θ: π1
=
π2
=
π3
=
π4
=
1 + 2 1 − 4 1 − 4 1 θ, 4
1 θ 4 1 θ 4 1 θ 4
where 0 ≤ θ ≤ 1. This is the example that Dempster, Laird, and Rubin (1977) considered when they studied the EM algorithm. The model goes back to an example discussed by Fisher (1925) in Statistical Methods for Research Workers. Given an observation (x1 , x2 , x3 , x4 ), the log likelihood function is l(θ) = x1 log(2 + θ) + (x2 + x3 ) log(1 − θ) + x4 log(θ) + c and
x1 x2 + x3 x4 − + . 2+θ 1−θ θ Use Newton, scoring, and the EM algorithm, to determine the maximum likelihood estimate for θ using the data that Dempster, Laird, and Rubin used: n = 197 and x = (125, 18, 20, 34). (Note the equation dl(θ)/dθ = 0 is a quadratic in θ, so it could be solved explicitly.) dl(θ)/dθ =
10.5. Consider data on the oxidation of ammonia to nitric acid in an industrial process. Data were collected at an industrial plant over a period of 21 consecutive days. The data shown below are from K. A. Brownlee )1965), Statistical Theory and Methodology in Science and Enginering, John Wiley & Sons, Inc. The data are also available as an S-Plus data set.
204
CHAPTER 10. APPLICATIONS IN STATISTICS % ×10 NH4 → HNO3 y 42 37 37 28 18 18 19 20 15 14 14 13 11 12 8 7 8 8 9 15 15
air flow x1 80 80 75 62 62 62 62 62 58 58 58 58 58 58 50 50 50 50 50 56 70
temp. x2 27 27 25 24 22 23 24 24 23 18 18 17 18 19 18 18 19 19 20 20 20
acid conc. x3 89 88 90 87 87 87 93 93 87 80 89 88 82 93 89 86 72 79 80 82 91
Use a program for least squares fitting to fit the model y ≈ β0 + β1 x1 + β2 x2 + β3 x3 by minimizing residuals in the following ways. (a) Least squares of residuals in the y direction. (b) Least absolute values of residuals in the y direction. (c) Least L1.5 norm of residuals in the y direction. (d) Least squares of residuals normal to the fitted plane. (e) Least absolute values of residuals normal to the fitted plane. (f) Least L1.5 norm of residuals normal to the fitted plane. 10.6. Formulate the dual of the linear program (10.5) for the determination of L1 regression coefficients. 10.7. Consider ways of controlling the leverage when fitting y = Xβ + by minimizing an Lp norm. First, begin by using artificial data to study influence when fitting using the Lp norm. As before, there are two main aspects to consider in choosing the data: the pattern of X and the values of the residuals in . The true values of β are not too important, so β can be chosen as 1. Use 20 observations. First, use just one independent variable (yi = β0 + β1 xi + i ). Generate 20 xi ’s more or less equally spaced between 0 and 10; generate 20 i ’s; and form the corresponding yi ’s. Fit the model using iteratively reweighted least squares to obtain the Lp fit, and plot the data and the model. Now, set x20 = 20, set 2 0
EXERCISES
205
to various values, form the yi ’s and fit the model for each value. Notice the influence of x20 . Do similar studies with 3 independent variables. (Do not plot the data, but perform the computations and observe the effect.) We may measure the influence of a point by the distance from the point to the mean of the x’s: ∆(xi , X T 1/n). The mean, x ¯ = X T 1/n, is an L2 quantity. It may be betterP to use a different Lp -based measure of the center of the xi . Let ˜p = argmina ka − xi kp and consider the Lp distance x ∆(xi , x ˜p ) = kxi , x ˜p kp . Now consider an Lp fit of y = Xβ + with weights wi that are values of a decreasing function of ∆(xi , x ˜p ). Now, using similar datasets to those used in the previous part of this exercise, study the use of various weighting schemes to control the influence. A weight function that may be interesting is
wi = min wmax , ∆(xi , x ˜p )
α
,
where wmax is some large number and α is a small negative number. In this weight function α you may want to choose α = −0.5, and inspect some values of ∆(xi , x ˜p ) prior to choosing wmax . (The problem, of course, arises from the possibility of near-zero values of ∆(xi , x ˜p ).) Carefully write up a clear description of your study, with tables and plots.
206
CHAPTER 10. APPLICATIONS IN STATISTICS
Appendix A
Solutions and Hints for Selected Exercises 3.11. Let E(X) = µ, and apply the mean value theorem about µ to get g(X) = g(µ) + (X − µ)g 0 (ξ), where ξ is between X and µ. Because g is convex and twice-differentiable, g 00 (x) ≥ 0 for all x. If X ≥ ξ ≥ µ, then because g 00 is nonnegative, g 0 (ξ) ≥ g 0 (µ), and so (X − µ)g 0 (µ)
≤
(X − µ)g 0 (ξ)
=
g(X) − g(µ).
If on the other hand, X < ξ < µ, then g 0 (ξ) ≤ g 0 (µ), and we have the same inequality, g 0 (µ)(X − µ)
≤
(X − µ)g 0 (ξ)
=
g(X) − g(µ).
Taking expectations of both sides, we have g 0 (µ)E(X − µ) ≤ E(g(X)) − E(g(µ)). But E(X − µ) = 0, and so g(E(X)) ≤ E(g(X)). 3.15. The triangle inequality for the L2 norm on vectors is
qX
or
Now,
X
(xi + yi )2 ≤
qX
x2i +
X
qX
x2i
qX
yi2 +
X
xi yi +
X
(xi + yi )2 ≤
X
(xi + yi )2 =
x2i + 2
X
x2i + 2
qX
yi2
X
yi2 ,
and by the Cauchy-Schwartz inequality for vector inner products,
X
xi yi ≤
qX
so the triangle inequality follows.
207
x2i
qX
yi2 ,
yi2 .
208
APPENDIX A. SOLUTIONS AND HINTS FOR EXERCISES
6.2. The groups are (x1 , x2 , x5 , x6 )
(x3 , x4 , x7 , x8 )
(x9 , x10 , x13 , x14 )
(x11 , x12 , x15 , x16 )
10.2a. α b = 2.7992, βb = 5.1889. The following S-Plus code can be used: datavals <- c(3,5,2,6,7) mlest <- ms(~-sum(log(dweibull(datavals,walpha,wbeta))), start=list(walpha=1,wbeta=1)) 10.4. For Newton, the Hessian is x1 x2 + x3 x4 + + 2, (2 + θ)2 (1 − θ)2 θ and for scoring, the expected value of the information is n 4
2 1 1 + + 2+θ 1−θ θ
,
which we obtain by taking E(Xi ) for each element of the multinomial random variable. Using the Matlab statements function [l, dl, ie] = fishnr(x,t) l = x(1)*log(2+t) + (x(2)+x(3))*log(1-t) + x(4)*log(t); dl = x(1)/(2+t) - (x(2)+x(3))/(1-t) + x(4)/t; ie = x(1)/(2+t)^2 + (x(2)+x(3))/(1-t)^2 + x(4)/t^2; and function [l, dl, ei] = fishscor(x,t) l = x(1)*log(2+t) + (x(2)+x(3))*log(1-t) + x(4)*log(t); dl = x(1)/(2+t) - (x(2)+x(3))/(1-t) + x(4)/t; ei = sum(x)*(1/(2+t) + 2/(1-t) + 1/t)/4; to define functions, we iterate over the statements [l, dl, ie] = fishnr(x,t); t = t + dl/ie and [l, dl, ei] = fishnr(x,t); t = t + dl/ei Beginning with t = 0.5, with Newton-Raphson we get 0.6364 0.6270 0.6268 0.6268 and for scoring we get 0.6332 0.6265 0.6268 0.6268 To use the EM algorithm on this problem, we can think of a multinomial with five classes, which is formed from the original multinomial by splitting the first class into two with associated probabilities 1/2 and θ/4. The original variable x1 is now the sum of x11 and x12 . Under this reformulation, we now have a maximum likelihood estimate of θ by considering x12 + x4 (or x2 + x3 ) to be a realization of a binomial with n = x12 + x4 + x2 + x3 and π = θ (or 1 − θ). However, we do not know x12
SOLUTIONS AND HINTS FOR EXERCISES
209
(or x11 ). Proceeding as if we had a five-outcome multinomial observation with two missing elements, we have the log likelihood for the complete data, lc (θ) = (x12 + x4 ) log(θ) + (x2 + x3 ) log(1 − θ), and the maximum likelihood estimate for θ is x12 + x4 . x12 + x2 + x3 + x4 The E-step of the iterative EM algorithm fills in the missing or unobservable value with its expected value given a current value of the parameter, θ(k) , and the observed data. Because lc (θ) is linear in the data, we have E (lc (θ)) = E(x12 + x4 ) log(θ) + E(x2 + x3 ) log(1 − θ). Under this setup, with θ = θ(k) , Eθ(k) (x12 )
=
1 1 1 x1 θ(k) /( + x1 θ(k) ) 4 2 4
=
x12 .
(k)
We now maximize Eθ(k) (lc (θ)). This maximum occurs at (k)
(k)
θ(k+1) = (x12 + x4 )/(x12 + x2 + x3 + x4 ). The following Matlab statements will execute a single iteration. function [x12kp1,tkp1] = em(tk,x) x12kp1 = x(1)*tk/(2+tk); tkp1 = (x12kp1 + x(4))/(sum(x)-x(1)+x12kp1); Beginning with t = 0.5, we get 0.6082 ... 10.6. min
yT d
d
s.t.
X Td = 0 −1 ≤ di ≤ 1
for all i
210
APPENDIX A. SOLUTIONS AND HINTS FOR EXERCISES
Appendix B
Notation and Definitions All notation used in this work is “standard”, and in most cases it conforms to the ISO conventions. (The notable exception is the notation for vectors.) I have opted for simple notation, which, of course, results in a one-to-many map of notation to object classes. Within a given context, however, the overloaded notation is generally unambiguous. I have endeavored to use notation consistently. This appendix is not intended to be a comprehensive listing of definitions. The Subject Index, beginning on page 241, is a more reliable set of pointers to definitions, except for symbols that are not words.
General Notation Uppercase italic Latin and Greek letters, A, B, E, Λ, and so on are generally used to represent either matrices or random variables. Random variables are usually denoted by letters nearer the end of the Latin alphabet, X, Y , Z, and by the Greek letter E. Parameters in models (that is, unobservables in the models), whether or not they are considered to be random variables, are generally represented by lowercase Greek letters. Uppercase Latin and Greek letters, especially P , in general, and Φ, for the normal distribution, are also used to represent cumulative distribution functions. Also, uppercase Latin letters are used to denote sets. Lowercase Latin and Greek letters are used to represent ordinary scalar or vector variables and functions. No distinction in the notation is made between scalars and vectors; thus, β may represent a vector and βi may represent the ith element of the vector β. In another context, however, β may represent a scalar. All vectors are considered to be column vectors, although we may write a vector as x = (x1 , x2 , . . . , xn ). Transposition of a vector or a matrix is denoted by a superscript T . Uppercase calligraphic Latin letters, F , V, W, and so on, are generally used to represent either vector spaces or transforms. Subscripts generally represent indexes to a larger structure, for example, xij may represent the (i, j)th element of a matrix, X. A subscript in parentheses represents an order (k) statistic. A superscript in parentheses represents an iteration, for example, xi may represent the value of xi at the k th step of an iterative process.
xi
The ith element of a structure (including a sample, which is a multiset).
x(i)
The ith order statistic.
211
212
APPENDIX B. NOTATION AND DEFINITIONS
x(i)
The value of x at the ith iteration.
Realizations of random variables and placeholders in functions associated with random variables are usually represented by lowercase letters corresponding to the uppercase letters; thus, may represent a realization of the random variable E. A single symbol in an italic font is used to represent a single variable. A Roman font or a special font is often used to represent a standard operator or a standard mathematical structure. Sometimes, a string of symbols in a Roman font is used to represent an operator (or a standard function); for example, exp represents the exponential function, but a string of symbols in an italic font on the same baseline should be interpreted as representing a composition (probably by multiplication) of separate objects; for example, exp represents the product of e, x, and p. A fixed-width font is used to represent computer input or output; for example, a = bx + sin(c). In computer text, a string of letters or numerals with no intervening spaces or other characters, such as bx above, represents a single object, and there is no distinction in the font to indicate the type of object. Some important mathematical structures and other objects are:
IR
The field of reals, or the set over which that field is defined.
IRd
The usual d-dimensional vector space over the reals, or the set of all d-tuples with elements in IR.
IRd+
The usual d-dimensional vector space over the reals, or the set of all d-tuples with positive real elements.
C I
The field of complex numbers, or the set over which that field is defined.
ZZ
The ring of integers, or the set over which that ring is defined.
G(n) I
A Galois field defined on a set with n elements.
C0, C1, C2, . . .
The set of continuous functions, the set of functions with continuous first derivatives, and so forth. √ The imaginary unit, −1.
i
Computer Number Systems Computer number systems are used to simulate the more commonly used number systems. It is important to realize that they have different properties, however. Some notation for computer number systems follows.
IF
The set of floating-point numbers with a given precision, on a given computer system, or this set together with the the four operators, +, -, *, and /. In some useful ways, IF is similar to IR; see page 12.
II
The set of fixed-point numbers with a given length, on a given computer system, or this set together with the the four operators, +, -, *, and /. In some useful ways, II is similar to ZZ; see page 21.
APPENDIX B. NOTATION AND DEFINITIONS
213
emin and emax
The minimum and maximum values of the exponent in the set of floatingpoint numbers with a given length (see page 13).
min and max
The minimum and maximum spacings around 1 in the set of floating-point numbers with a given length (see page 14).
or mach
The machine epsilon, the same as min (see page 14).
[·]c
The computer version of the object · (see page 17).
NaN
Not-a-Number (see page 16).
General Mathematical Functions and Operators Functions such as sin, max, span, and so on that are commonly associated with groups of Latin letters are generally represented by those letters in a roman font. Generally, the argument of a function is enclosed in parentheses, for example, sin(x), but often for the very common functions, the parentheses are omitted: sin x. In expressions involving functions, parentheses are generally used for clarity, for example, (E(X))2 instead of E2 (X). Operators such as d (the differential operator) that are commonly associated with a Latin letter are generally represented by that letter in a roman font.
×
Binary operator denoting multiplication of elements of a field or ring.
×
Binary operator denoting the cartesian product of two sets. The result is the set of ordered pairs of elements from the operand sets. This product is also called the direct product and the cross product.
×
Binary operator denoting the cross product of two vectors in IR3 . The phrase “cross product” is also used to refer to elementwise multiplication of the values of a variable, but the symbol × is not used to represent this operation.
|x|
The modulus of the real or complex number x; if x is real, |x| is the absolute value of x.
dxe
The ceiling function evaluated at the real number x: dxe is the smallest integer greater than or equal to x.
bxc
The floor function evaluated at the real number x: bxc is the largest integer less than or equal to x.
#S
The cardinality of the set S.
214 IS (·)
APPENDIX B. NOTATION AND DEFINITIONS The indicator function: IS (x)
=
1, if x ∈ S;
=
0, otherwise.
If x is a scalar, the set S is often taken as the interval (−∞, y], and in this case, the indicator function is the Heaviside function, H, evaluated at the difference of the argument and the upper bound on the interval: I(−∞,y] (x) = H(y − x). (An alternative definition of the Heaviside function is the same as this, except that H(0) = 12 .) In higher dimensions, the set S is often taken as the product set, Ad
=
(−∞, y1 ] × (−∞, y2 ] × · · · × (−∞, yd ]
=
A1 × A2 × · · · × Ad ,
and in this case, IAd (x) = IA1 (x1 )IA2 (x2 ) · · · IAd (xd ), where x = (x1 , x2 , . . . , xd ). The derivative of the indicator function is the Dirac delta function, δ(·), δ(·)
The Dirac delta “function”, defined by = 0, Rδ(x) ∞ −∞
for x 6= 0, δ(t) dt = 1.
The Dirac delta function is not a function in the usual sense. We do, however, refer to it as a function. For any continuous function f , we have the useful fact
Z
∞
f (y) dI(−∞,y] (x)
=
−∞
Z
∞
f (y) δ(y − x) dy −∞
=
f (x).
minf (·) or min(S)
The minimum value of the real scalar-valued function f , or the smallest element in the countable set of real numbers S.
argminf (·)
The value of the argument of the real scalar-valued function f that yields its minimum value.
⊕
Bitwise binary exclusive-or. The operator also is used as the direct sum of vector spaces.
O(f (n))
Big O; g(n) = O(f (n)) means there exists a positive constant M such that |g(n)| ≤ M |f (n)| as n → ∞. g(n) = O(1) means g(n) is bounded from above.
o(f (n))
Little o; g(n) = o(f (n)) means g(n)/f (n) → 0 as n → ∞. g(n) = o(1) means g(n) → 0 as n → ∞.
oP (f (n))
Convergent in probability; X(n) = oP (f (n)) means that for any positive , Pr(|X(n)/f (n)| > ) → 0 as n → ∞.
Ω(f (n))
Big Ω; g(n) = Ω(f (n)) means there exists a positive constant m such that |g(n)| ≥ m|f (n)| as n → ∞. g(n) = Ω(1) means g(n) is bounded from below.
APPENDIX B. NOTATION AND DEFINITIONS
215
ω(f (n))
Little ω; g(n) = ω(f (n)) means f (n)/g(n) → 0 as n → ∞. g(n) = ω(1) means g(n) → 0 as n → ∞.
d
The differential operator. The derivative with respect to the variable x is d . denoted by dx
f 0 , f 00 , . . . , f k
0
For the scalar-valued function f of a scalar variable, differentiation (with respect to an implied variable) taken on the function once, twice, . . ., k times.
fT
For the vector-valued function f , the transpose of f (a row-vector).
∇f
For the scalar-valued function f of a vector variable, the gradient (that is, the vector of partial derivatives), also often denoted as gf .
∇f
For the vector-valued function f of a vector variable, the transpose of the Jacobian, which is often denoted as Jf ; so ∇f = JT f (see below).
Jf
For the vector-valued function f of a vector variable, the Jacobian, that is, the matrix whose (i, j)th element is ∂fi (x) . ∂xj
Hf or ∇∇f or ∇2 f
For the scalar-valued function f of a vector variable, the Hessian. The Hessian is the transpose of the Jacobian of the gradient. Except in pathological cases it is symmetric. The element in position (i, j) is ∂ 2 f (x) . ∂xi ∂xj The symbol ∇2 f is sometimes also used to denote the diagonal of the Hessian, in which case it is called the Laplacian.
f ?g
The convolution of the functions f and g, (f ? g)(t) =
Z
f (x)g(t − x) dx.
The convolution is a function. Cov(f, g)
For the functions f and g whose integrals are zero, the covariance of f and g at lag t; Z Cov(f, g)(t) =
f (x)g(t + x) dx.
The covariance is a function; its argument is called the lag. Cov(f, f )(t) is called the autocovariance of f at lag t, and Cov(f, f )(0) is called the variance of f . Corr(f, g)
For the functions f and g whose integrals are zero, the correlation of f and g at lag t; R f (x)g(t + x) dx Corr(f, g)(t) = p . Cov(f, f )(0)Cov(g, g)(0) The correlation is a function; its argument is called the lag. Cov(f, f )(t) is called the autocorrelation of f at lag t.
216
f ⊗g
APPENDIX B. NOTATION AND DEFINITIONS
The tensor product of the functions f and g, (f ⊗ g)(w) = f (x)g(y)
for
w = (x, y).
The operator is also used for the tensor product of two function spaces, and for the Kronecker product of two matrices. fT or T f
The transform of the function f by the functional T . f F usually denotes the Fourier transform of f . f L usually denotes the Laplace transform of f . f W usually denotes a wavelet transform of f .
δ
A perturbation operator; δx represents a perturbation of x, and not a multiplication of x by δ, even if x is a type of object for which a multiplication is defined.
∆(·, ·)
A real-valued difference function; ∆(x, y) is a measure of the difference of x and y; for simple objects, ∆(x, y) = |x − y|; for more complicated objects, a subtraction operator may not be defined, and ∆ is a generalized difference.
x ˜
A perturbation of the object x;
Ave(S)
An average (of some kind) of the elements in the set S.
hf r ip
The rth moment of the function f with respect to the density p.
x ¯
The mean of a sample of objects generically denoted by x.
x ¯
The complex conjugate of the object x; that is, if x = r + ic, then x ¯ = r − ic.
∆(x, x ˜) = δx.
Special Functions log x
The natural logarithm evaluated at x.
sin x
The sine evaluated at x (in radians), and similarly for other trignometric functions.
x!
The factorial of x. If x is a positive integer, x! = x(x − 1) · · · 2 · 1. For other values of x, except negative integers, x! is often defined as x! = Γ(x + 1).
Γ(α)
The complete gamma function. For α not equal to a nonpositive integer, Γ(α) =
Z
∞
tα−1 e−t dt. 0
We have the useful √ relationship, Γ(α) = (α − 1)!. An important argument is 12 , and Γ( 12 ) = π. Γx (α)
The incomplete gamma function: Γx (α) =
Z
x
tα−1 e−t dt. 0
APPENDIX B. NOTATION AND DEFINITIONS
B(α, β)
217
The complete beta function: B(α, β) =
Z
1
tα−1 (1 − t)β−1 dt, 0
where α > 0 and β > 0. A useful relationship is B(α, β) =
Bx (α, β)
The incomplete beta function: Bx (α, β) =
Z
Γ(α)Γ(β) . Γ(α + β)
x
tα−1 (1 − t)β−1 dt. 0
Vectors, Vector Spaces, and Matrices sign(x)
For the vector x, a vector of units corresponding to the signs: = = =
sign(x)i
1 0 −1
if xi > 0, if xi = 0, if xi < 0;
with a similar meaning for a scalar. The sign function is also sometimes called the signum function, and denoted sgn(·). Lp
For real p ≥ 1, a norm formed by accumulating the pth powers of the moduli of individual elements in an object and then taking the (1/p)th power of the result.
k·k
In general, the norm of the object ·. Often, however, specifically either the L2 norm, or the norm defined by an inner product.
k · kp
In general, the Lp norm of the object ·.
kxkp
For the vector x, the Lp norm: kxkp =
kXkp
X
|xi |p
p1
.
For the matrix X, the Lp norm: kXkp = max kXvkp . kvkp =1
kf kp
For the function f , the Lp norm: kf kp = (see page 50).
Z
|f (x)|p dx
p1
218 kXkF
APPENDIX B. NOTATION AND DEFINITIONS For the matrix X, the Frobenius norm: kXkF =
sX
x2ij .
i,j
hx, yi
The inner product of x and y.
κp (A)
The Lp condition number of the nonsingular square matrix A with respect to inversion.
diag(v)
For the vector v, the diagonal matrix whose nonzero elements are those of v; that is, the square matrix, A, such that Aii = vi and for i 6= j, Aij = 0.
diag(A1 , A2 , . . . , Ak ) The block diagonal matrix whose submatrices along the diagonal are A1 , A2 , . . . , Ak . vec(A)
The vector consisting of the columns of the matrix A, all strung into one vector; if the column vectors of A are a1 , a2 , . . . , am then T T vec(A) = (aT 1 , a2 , . . . , am ).
vech(A)
For the symmetric the matrix A, the vector consisting of the unique elements all strung into one vector: vech(A) = (a11 , a21 , a22 , a31 , . . . , am1 , . . . , amm ).
trace(A)
The trace of the square matrix A, that is, the sum of the diagonal elements.
rank(A)
The rank of the matrix A, that is, the maximum number of independent rows (or columns) of A.
ρ(A)
The spectral radius of the matrix A (the maximum absolute value of its eigenvalues).
det(A)
The determinant of the square matrix A,
|A|
The determinant of the square matrix A, ; |A| = det(A).
det(A) = |A|.
Special Vectors and Matrices 1 or 1n
A vector (of length n) whose elements are all 1’s.
0 or 0n
A vector (of length n) whose elements are all 0’s.
I or In
The (n × n) identity matrix.
ei
The ith unit vector (with implied length).
Ejk
The (i, j)th elementary permutation matrix.
APPENDIX B. NOTATION AND DEFINITIONS
219
Models and Data A form of model used often in statistics and applied mathematics has three parts: a left-hand side representing an object of primary interest; a function of another variable and a parameter, each of which is likely to be a vector; and an adjustment term to make the right-hand side equal the left-hand side. The notation varies depending on the meaning of the terms. One of the most common models used in statistics, the linear regression model with normal errors, is written as Y = β T x + E. (B.1) The adjustment term is a random variable, denoted by an uppercase epsilon. The term on the left-hand side is also a random variable. This model does not represent observations or data. A slightly more general form is Y = f (x; θ) + E.
(B.2)
A single observation or a single data item that corresponds to model (B.1) may be written as y = βT x + or, if it is one of several, yi = β T xi + i . Similar expressions are used for a single data item that corresponds to model (B.2). In these cases, rather than being a random variable, or i may be a realization of a random variable, or it may just be an adjustment factor with no assumptions about its origin. A set of n such observations is usually represented in an n-vector y, a matrix X with n rows, and an n-vector : y = Xβ + or y = f (X; θ) + . The model is not symmetric in y and x. The error term is added to the systematic component that involes x. The has implications in estimation and model fitting.
220
APPENDIX B. NOTATION AND DEFINITIONS
Bibliography As might be expected, the literature in the interface of computer science, numerical analysis, and statistics is quite diverse; and relevant articles on optimization methods and applications in statistics are likely to appear in journals devoted to quite different disciplines. There are at least ten journals and serials whose titles contain some variants of both “computing” and “statistics”; but there are far more journals in numerical analysis and in areas such as “computational physics”, “computational biology”, and so on that publish articles relevant to the fields of statistical computing. There are two well-known learned societies whose primary focus is in statistical computing: the International Association for Statistical Computing (IASC), which is an affiliated society of the International Statistical Institute, and the Statistical Computing Section of the American Statistical Association (ASA). The Statistical Computing Section of the ASA has a regular newsletter carrying news and notices as well as articles on practicum. Also, the activities of the Society for Industrial and Applied Mathematics (SIAM) are often relevant to statistical computing. There are two regular conferences in the area of statistical computing: COMPSTAT, held biennially in Europe and sponsored by the IASC, and the Interface Symposium, generally held annually in North America and sponsored by the Interface Foundation of North America with cooperation from the Statistical Computing Section of the ASA. In addition to literature and learned societies in the traditional forms, an important source of communication and a repository of information are computer databases and forums. In some cases the databases duplicate what is available in some other form, but often the material and the communications facilities provided by the computer are not available elsewhere.
Literature in Statistical Computing In the Library of Congress classification scheme, most books on statistics, including statistical computing, are in the QA276 section, although some are classified under H, HA, and HG. Numerical analysis is generally in QA279, and computer science in QA76. Many of the books in the interface of these
221
222
BIBLIOGRAPHY
disciplines are classified in these or other places within QA. Current Index to Statistics, published annually by the American Statistical Association and the Institute for Mathematical Statistics, contains both author and subject indexes that are useful in finding journal articles or books in statistics. The Index is available in hard copy and on CD-ROM. The CD-ROM version with software developed by Ron Thisted and Doug Bates is particularly useful. In passing, I take this opportunity to acknowledge the help this database and software were to me in tracking down references for this book. The Association for Computing Machinery (ACM) publishes an annual index, by author, title, and keyword, of the literature in the computing sciences. Mathematical Reviews, published by the American Mathematical Society (AMS), contains brief reviews of articles in all areas of mathematics. The areas of “Statistics”, “Numerical Analysis”, and “Computer Science” contain reviews of articles relevant to computational statistics. The papers reviewed in Mathematical Reviews are categorized according to a standard system that has slowly evolved over the years. In this taxonomy, called the AMS MR classification system, “Statistics” is 62Xyy; “Numerical Analysis”, including random number generation, is 65Xyy; and “Computer Science” is 68Xyy. (“X” represents a letter and “yy” represents a two-digit number.) Mathematical Reviews is available to subscribers via the World Wide Web at MathSciNet: http://www.ams.org/mathscinet/ There are various handbooks of mathematical functions and formulas that are useful in numerical computations. Three that should be mentioned are Abramowitz and Stegun (1964), Spanier and Oldham (1987), and Thompson (1997). Anyone doing serious scientific computations should have ready access to at least one of these volumes. Almost all journals in statistics have occasional articles on computational statistics and statistical computing. The following is a list of journals, proceedings, and newsletters that emphasize this field. ACM Transactions on Mathematical Software, published quarterly by the ACM (Association for Computing Machinery). This journal publishes algorithms in Fortran and C. The ACM collection of algorithms is sometimes called CALGO. The algorithms published during the period 1975 through 1999 are available on a CR-ROM from ACM. Most of the algorithms are available through netlib at http://www.netlib.org/liblist.html ACM Transactions on Modeling and Computer Simulation, published quarterly by the ACM. Applied Statistics, published quarterly by the Royal Statistical Society. (Until 1998, included algorithms in Fortran. Some of these algorithms, with corrections, were collected by Griffiths and Hill, 1985. Most of the algorithms are available through statlib at Carnegie Mellon University.)
BIBLIOGRAPHY
223
Communications in Statistics — Simulation and Computation, published quarterly by Marcel Dekker. (Until 1996, included algorithms in Fortran. Until 1982, this journal was designated as Series B.) Computational Statistics, published quarterly by Physica-Verlag. (Formerly called Computational Statistics Quarterly.) Computational Statistics. Proceedings of the xxth Symposium on Computational Statistics (COMPSTAT), published biennially by Physica-Verlag. (Not refereed.) Computational Statistics & Data Analysis, published by North Holland. Number of issues per year varies. (This is also the official journal of the International Association for Statistical Computing, and as such incorporates the Statistical Software Newsletter.) Computing Science and Statistics. This is an annual publication containing papers presented at the Interface Symposium. Until 1988, these proceedings were named Computer Science and Statistics: Proceedings of the xxth Symposium on the Interface. From 1988 until 1992, the proceedings were named Computing Science and Statistics: Proceedings of the xxth Symposium on the Interface. (The 24th symposium was held in 1992.) In 1997, Volume 29 was published in two issues: Number 1, which contains the papers of the regular Interface Symposium; and Number 2, which contains papers from another conference. The two numbers are not sequentially paginated. These proceedings are now published by the Interface Foundation of North America. (Not refereed.) Journal of Computational and Graphical Statistics, published quarterly by the American Statistical Association. Journal of Statistical Computation and Simulation, published irregularly in four numbers per volume by Gordon Breach. Proceedings of the Statistical Computing Section, published annually by the American Statistical Association. (Not refereed.) SIAM Journal on Scientific Computing, published bimonthly by SIAM. This journal was formerly SIAM Journal on Scientific and Statistical Computing. (Is this a step backward?) Statistical Computing & Graphics Newsletter, published quarterly by the Statistical Computing and the Statistical Graphics Sections of the American Statistical Association. (Not refereed and not generally available in libraries.) Statistics and Computing, published quarterly by Chapman & Hall.
World Wide Web, News Groups, List Servers, and Bulletin Boards The best way of storing information is in a digital format that can be accessed by computers. In some cases the best way for people to access information is by computers; in other cases the best way is via hard copy, which means that
224
BIBLIOGRAPHY
the information stored on the computer must go through a printing process resulting in books, journals, or loose pages. A huge amount of information and raw data is available online. Much of it is in publicly accessible sites. Some of the repositories give space to ongoing discussions to which anyone can contribute. There are various ways of remotely accessing the computer databases and discussion groups. The high-bandwidth wide-area network called the “Internet” is the most important way to access information. Early development of the Internet was due to initiatives within the United States Department of Defense and the National Science Foundation. The Internet is making fundamental changes to the way we store and access information. The references that I have cited in this text are generally traditional books, journal articles, or compact disks. This usually means that the material has been reviewed by someone other than the author. It also means that the author possibly has newer thoughts on the same material. The Internet provides a mechanism for the dissemination of large volumes of information that can be updated readily. The ease of providing material electronically is also the source of the major problem with the material: it is often half-baked and has not been reviewed critically. Another reason that I have refrained from making frequent reference to material available over the Internet is the unreliability of some sites. The average life of a Web site is measured in weeks. For statistics, one of the most useful sites on the Internet is the electronic repository statlib, maintained at Carnegie Mellon University, which contains programs, datasets, and other items of interest. The URL is http://lib.stat.cmu.edu. The collection of algorithms published in Applied Statistics is available in statlib. These algorithms are sometimes called the ApStat algorithms. The statlib facility can also be accessed by email or anonymous ftp at [email protected]. An automatic email processor will reply with files that provide general information or programs and data. The general introductory file can be obtained by sending email to the address above with the message “send index”. Another very useful site for scientific computing is netlib, which was established by research workers at AT&T (now Lucent) Bell Laboratories and national laboratories, primarily Oak Ridge National Laboratories. The URL is http://www.netlib.org The Collected Algorithms of the ACM (CALGO), which are the Fortran, C, and Algol programs published in ACM Transactions on Mathematical Software (or in Communications of the ACM prior to 1975), are available in netlib, under the TOMS link. There is also an X Windows, socket-based system for accessing netlib, called Xnetlib; see Dongarra, Rowan, and Wade (1995).
BIBLIOGRAPHY
225
The Guide to Available Mathematical Software (GAMS), to which I have referred several times in this book, can be accessed at http://gams.nist.gov A different interface, using Java, is available at http://math.nist.gov/HotGAMS/ There are two major problems in using the WWW to gather information. One is the sheer quantity of information and the number of sites providing information. The other is the “kiosk problem”; anyone can put up material. Sadly, the average quality is affected by a very large denominator. The kiosk problem may be even worse than a random selection of material; the “fools in public places” syndrome is much in evidence. There is not much that can be done about the second problem. It was not solved for traditional postings on uncontrolled kiosks, and it will not be solved on the WWW. For the first problem, there are remarkable programs that automatically crawl through WWW links to build a database that can be searched for logical combinations of terms and phrases. Such systems and databases have been built by several people and companies. A neophyte can be quickly disabused of an exaggerated sense of the value of such search engines by doing a search on “Monte Carlo”. Aside from the large number of hits that relate to a car and to some place in Europe, the hits (in mid 1998) that relate to the interesting topic are dominated by references to some programs for random number generation put together by a group at a university somewhere. (Of course, “interesting” is in the eye of the beholder.) It is not clear at this time what will be the media for the scientific literature within a few years. Many of the traditional journals will be converted to an electronic version of some kind. Journals will become Web sites. That is for certain; the details, however, are much less certain. Many bulletin boards and discussion groups have already evolved into “electronic journals’. A publisher of a standard commercial journal has stated that “we reject 80% of the articles submitted to our journal; those are the ones you can find on the Web”.
References for Software Packages There is a wide range of software used in the computational sciences. Some of the software is produced by a single individual who is happy to share the software, sometimes for a fee, but who has no interest in maintaining the software. At the other extreme is software produced by large commercial companies whose continued existence depends on a process of production, distribution, and maintenance of the software. Information on much of the software can be obtained from GAMS. Some of the free software can be obtained from statlib or netlib.
226
BIBLIOGRAPHY
The names of many software packages are trade names or trademarks. In this book, the use of names, even if the name is not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.
References to the Literature The following bibliography obviously covers a wide range of topics in statistical computing and computational statistics. Except for a few of the general references, all of these entries have been cited in the text. The purpose of this bibliography is to help the reader get more information; hence I eschew “personal communications” and references to technical reports that may or may not exist. Those kinds of references are generally for the author rather than for the reader. In some cases, important original papers have been reprinted in special collections, such as Samuel Kotz and Norman L. Johnson (Editors) (1997), Breakthroughs in Statistics, Volume III, Springer-Verlag, New York. In most such cases, because the special collection may be more readily available, I list both sources.
A Note on the Names of Authors In these references, I have generally used the names of authors as they appear in the original sources. This may mean that the same author will appear with different forms of names, sometimes with given names spelled out, and sometimes abbreviated. In the author index, beginning on page 237, I use a single name for the same author. The name is generally the most unique (i.e., least abbreviated) of any of the names of that author in any of the references. This convention may occasionally result in an entry in the author index that does not occur exactly in any references. For example, a reference to J. Paul Jones together with one to John P. Jones, if I know that the two names refer to the same person, would result in an Author Index entry for John Paul Jones. Aarts, Emile, and Jan Korst (1989), Simulated Annealing and Boltzmann Machines, John Wiley & Sons, New York. Aarts, Emile, and Jan Karel Lenstra (Editors) (1997), Local Search in Combinatorial Optimization, John Wiley & Sons, New York. Abramowitz, Milton, and Irene A. Stegun (Editors) (1964), Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, National Bureau of Standards (NIST), Washington. (Reprinted by Dover Publications, Inc., New York.) Alrefaei, Mahmoud H., and Sigr´ un Andrad´ ottir (1999), A simulated annealing algorithm with constant temperature for discrete stochastic optimization, Management Science 45, 748–764. Aluffi-Pentini, Filippo; Valerio Parisi; and Francesco Zirilli (1988a), A global optimization algorithm using stochastic differential equations, ACM Transactions on Mathematical Software 14, 345–365.
BIBLIOGRAPHY
227
Aluffi-Pentini, Filippo; Valerio Parisi; and Francesco Zirilli (1988b), Algorithm 667: SIGMA — A stochastic-integration global minimization algorithm, ACM Transactions on Mathematical Software 14, 366–380. Arslan, Olcay; Patrick D. L. Constable; and John T. Kent (1993), Domains of convergence for the EM algorithm: a cautionary tale in the location estimation problem, Statistics and Computing 3, 103–108. Arthanari, T. S., and Yadolah Dodge (1981), Mathematical Programming in Statistics, John Wiley & Sons, New York. Atkinson, A. C. (1992), A segmented algorithm for simulated annealing, Statistics and Computing 2 221–230. Atkinson, A. C., and A. N. Donev (1992), Optimum Experimental Designs, Oxford University Press, Oxford, United Kingdom. Back, Thomas (1996), Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms, Oxford University Press, Oxford, United Kingdom. Barrodale, I., and F. D. K. Roberts (1974), Algorithm 478: Solution of an overdetermined system of equations in the l1 norm, Communications of the ACM 17, 319–320. Barton, Russel R., and John S. Ivey, Jr. (1996), Nelder-Mead simplex modifications for simulation optimization, Management Science 42, 954–973. Bassett, Gilbert W. (1991), Equivalent, monotone, 50% breakdown estimators, The American Statistician 45, 135–137. Bates, Douglas M., and John M. Chambers (1992), Nonlinear models, Statistical Models in S (edited by John M. Chambers and Trevor J. Hastie), Wadsworth & Brooks/Cole, Pacific Grove, California, 421–454. Becker, Richard A.; John M. Chambers; and Allan R. Wilks (1988), The New S Language, Wadsworth & Brooks/Cole, Pacific Grove, California. Bethel, James (1985), An optimum allocation algorithm for multivariate surveys, Proceedings of the Survey Research Section, ASA, 209–212. Bethel, James (1989a), Minimum variance estimation in stratified sampling, Journal of the American Statistical Association 84, 260–265. Bethel, James (1989a), Sample allocation in multivariate surveys, Survey Methodology 15, 47–57. Birkes, David, and Yadolah Dodge (1993), Alternative Methods of Regression, John Wiley & Sons, New York. Bischof, C.; A. Carle; P. Khademi; and A. Mauer (1996) ADIFOR 2.0: Automatic differentiation of Fortran 77 programs, IEEE Computational Science and Engineering 3, Number 3, 18–32. Bischof, C.; L. Roh; and A. Mauer (1996), ADIC: An extensible automatic differentiation tool for ANSI-C, Software — Practice and Experience 27, 1427–1456. Bohachevsky, Ihor O.; Mark E. Johnson; and Myron L. Stein (1986), Generalized simulated annealing for function optimization, Technometrics 28, 209–217. Bongartz, I.; A. R. Conn; Nick Gould; and Ph. L. Toint (1995), CUTE: constrained and unconstrained testing environment, ACM Transactions on Mathematical Software 21, 123–160. Bouvier, Annie, and Sylvie Huet (1994), nls2: Non-linear regression by S-Plus functions, Computational Statistics & Data Analysis 18, 187–190. Box, M. J. (1965), A comparison of several current optimization methods and the use of transformations in constrained problems, Computer Journal 8, 67–77. Brooks, Stephen P. (1995), A hybrid optimization algorithm, Applied Statistics 44, 530–533. Brooks, S. P, and B. J. T. Morgan (1994), Automatic starting point selection for function optimization, Statistics and Computing 4 173–177. Bunch, David S.; David M. Gay; Roy E. Welsch (1993), Algorithm 717: Subroutines for maximum likelihood and quasi-likelihood estimation of parameters in nonlinear regression models, ACM Transactions on Mathematical Software 19, 109–130. Byrd, Richard H.; Jorge Nocedal; and Robert B. Schnabel (1994), Representations of quasiNewton matrices and their use in limited memory methods, Mathematical Programming 63, 129–156.
228
BIBLIOGRAPHY
Celeux, G., and J. Diebolt (1985), The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem, Computational Statistics Quarterly 2, 73–82. Chambers, John M. (1997), The evolution of the S language, Computing Science and Statistics 28, 331–337. Chan, K. S., and Johannes Ledolter (1995), Monte Carlo EM estimation for time series models involving counts, Journal of the American Statistical Association 90, 242–252. Charnes, A.; W. W. Cooper; and R. O. Ferguson (1955), Optimal estimation of executive compensation by linear programming, Management Science 1, 138–150. Charnes, A.; E. L. Frome; and P. L. Yu (1976), The equivalence of generalized least squares and maximum likelihood estimates in the exponential family, Journal of the American Statistical Association 71, 169–171. Chatterjee, Samprit, and Martin M¨ achler (1997), Robust regression: A weighted least squares approach, Communications in Statistics — Theory and Methods 26, 1381–1394. Chen, D. S., and R. C. Jain (1993), A robust back-propagation learning algorithm for function approximation, Artificial Intelligence Frontiers in Statistics (edited by D. J. Hand), Chapman & Hall, London, 217–240. Chin, Daniel C. (1993), Performance of several stochastic approximation algorithms in the multivariate Kiefer-Wolfowitz setting, Computing Science and Statistics 25, 289–295. Chopra, Vijay K., and William T. Ziemba (1993), The effects of errors in means, variances, and covariances on optimal portfolio choice, Journal of Portfolio Management 19(2), 6–11. Chromy, J. R. (1987), Design optimization with multiple objectives, Proceedings of the Survey Research Section, ASA, 194–199. Chv´ atal, Vaˇsek (1983), Linear Programming, W. H. Freeman and Company, New York. Collins, N. E.; R. W. Eglese; and B. L. Golden (1988), Simulated annealing – An annotated bibliography, American Journal of Mathematical and Management Sciences 8, 208–307. Conn, A. R.; N. I. M. Gould; and Ph. L. Toint (1992), LANCELOT: A Fortran Package for Large-Scale Nonlinear Optimization (Release A), Springer-Verlag, New York. Cook, William J.; William H. Cunningham; William R. Pulleyblank; and Alexander Schrijver (1997), Combinatorial Optimization, John Wiley & Sons, New York. Corana, A.; M. Marchesi; C. Martin; and S. Ridella (1987), Minimizing multimodal function of continuous variables with the “simulated annealing” algorithm, ACM Transactions on Mathematical Software 13, 262–280. Dantzig, George B. (1963), Linear Programming and Extensions, Princeton University Press, Princeton. Dembo, R. S., and T. Steihaug (1985), A test problem generator for large-scale unconstrained optimization, ACM Transactions on Mathematical Software 11, 97–102. Dempster, A. P.; N. M. Laird; and D. B. Rubin (1977), Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society, Series B 45, 51–59. Dennis, John E., Jr.; David M. Gay; and Roy E. Welsch (1981a), An adaptive nonlinear least-squares algorithm, ACM Transactions on Mathematical Software 7, 348–368. Dennis, John E., Jr.; David M. Gay; and Roy E. Welsch (1981b), Algorithm 573: NL2SOL – An adaptive nonlinear least-squares algorithm, ACM Transactions on Mathematical Software 7, 369–383. Dennis, J. E., Jr., and D. J. Woods (1987), Optimization on microcomputers: The NelderMead simplex algorithm, Microcomputers in Large Scale Computing (edited by Arthur Wouk), Society for Industrial and Applied Mathematics, Philadelphia, 116–122. Deutsch, Clayton V. (1996), Constrained smoothing of histograms and scatterplots with simulated annealing, Technometrics 38, 266–274. Dharmadhikari, S., and K. Joag-Dev (1988), Unimodality, Convexity and Applications, Academic Press, New York. Diebolt, Jean, and Eddie H. S. Ip (1996), Stochastic EM: method and application, Practical Markov Chain Monte Carlo (edited by W. R. Gilks, S. Richardson, and D. J. Spiegelhalter), Chapman & Hall, London, 259–273.
BIBLIOGRAPHY
229
Dobmann, M.; M. Liepelt; and K. Schittkowski (1995), Algorithm 746: PCOMP, a Fortran code for automatic differentiation, ACM Transactions on Mathematical Software 21, 233– 266. Dongarra, Jack; Tom Rowan; and Reed Wade (1995), Software distribution using Xnetlib, ACM Transactions on Mathematical Software 21, 79–88. Dorfman, A. H. (1989), Iterated reweighted least squares revisited: The tension between simulation and practice, Computer Science and Statistics: Proceedings of the Twentyfirst Symposium on the Interface (edited by Kenneth Berk and Linda Malone), American Statistical Association, 280–283. Drezner, Zvi; George A. Marcoulides; and Said Salhi (1999), Tabu search model selection in multiple regression analysis Communications in Statistics — Simulation and Computation 28, 349–367. Duarte, Antonio Marcos, and Beatriz Vaz de Melo Mendes (1998), Interior point algorithms for LSAD and LMAD estimation, Computational Statistics 13, 233–256. Facchinei, Francisco; Joaquim J´ udice; Jo˜ ao Soares (1997a), Generating box-constrained optimization problems, ACM Transactions on Mathematical Software 23, 443–447. Facchinei, Francisco; Joaquim J´ udice; Jo˜ ao Soares (1997b), Algorithm 774: Fortran subroutines for generating box-constrained optimization problems, ACM Transactions on Mathematical Software 23, 448–450. Fan, Y.; S. Sarkar; and L. Lasdon (1988), Experiments with successive quadratic programming algorithms, Journal of Optimization Theory and Applications 56, 359–383. Fiacco, Anthony V., and Garth P. McCormick (1968), Nonlinear Programming Sequential Unconstrained Minimization Techniques, Research Analysis Corporation, McLean, Virginia. (Reprinted by Society for Industrial and Applied Mathematics, Philadelphia, 1990.) Floudas, Christodoulos A.; Panos M. Pardalos; Claire Adjiman; William R. Esposito; Zeynep H. Gumus; Stephen T. Harding; John L. Klepeis; Clifford A. Meyer; Carl A. Schweiger (1999), Handbook of Test Problems in Local and Global Optimization, Kluwer Academic Publishers, Dordrecht. Fourer Robert; David M. Gay; and Brian W. Kernighan (1993) AMPL: A Modeling Language for Mathematical Programming, Duxbury Press, Boston. Frank, Ildiko E., and Jerome H. Friedman (1993), A statistical view of some chemometrics regression tools (with discussion), Technometrics 35, 109–148. Furnival, George M., and Robert W. Wilson, Jr. (1974) Regression by leaps and bounds, Technometrics 16, 499–511. Gan, Li, and Jiming Jiang (1999), A test for global maximum, Journal of the American Statistical Association 94, 847–854. Gay, David M. (1983), Algorithm 611: Subroutines for unconstrained minimization using a model/trust-region approach, ACM Transactions on Mathematical Software 9, 503–524. Gay, David M., and Roy E. Welsch (1988), Maximum likelihood and quasi-likelihood for nonlinear exponential family regression models, Journal of the American Statistical Association 83, 990–998. Gentle, James E. (1998), Numerical Linear Algebra for Applications in Statistics, SpringerVerlag, New York. Gentle, James E. (2003), Random Number Generation and Monte Carlo Methods, second edition, Springer-Verlag, New York. Gentle, J. E.; W. J. Kennedy; and V. A. Sposito (1976), Properties of the L1 -estimate space, Proceedings of the Statistical Computing Section, ASA, 163–164. Gentle, James E.; Subhash C. Narula; and Richard L. Valliant (1997), Multicriteria optimization in sampling design, Statistics of Quality (edited by Subir Ghosh, William R. Schucany, and William B. Smith) Marcel Dekker, Inc., New York, 411–425. Gentleman, Robert, and Ross Ihaka (1997), The R language, Computing Science and Statistics 28, 326–330. Gentleman, W. M. (1974), Algorithm AS 75: Basic procedures for large, sparse or weighted linear least squares problems, Applied Statistics 23, 448–454. Gill, P. E.; W. Murray; M. A. Saunders; and M. H. Wright (1992), Some theoretical properties of an augmented Lagrangian merit function, Advances in Optimization and Parallel Computing (edited by P. M. Pardalos), North-Holland, Amsterdam, 101–128.
230
BIBLIOGRAPHY
Glover, Fred (1986), Future paths for integer programming and links to artificial intelligence, Computer and Operations Research 5, 533–549. Glover, Fred, and Manuel Laguna (1997), Tabu search, second printing, Kluwer Academic Publishers, Boston. Gonin, Ren´ e, and Arthur H. Money (1989), Nonlinear Lp -Norm Estimation, Marcel Dekker, Inc., New York. Gonzaga, C. C. (1992), Path-following methods for linear programming, SIAM Review 34, 167–224. Green, P. J. (1984), Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion), Journal of the Royal Statistical Society, Series B 46, 149–192. Griewank, Andreas (2000), Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation, Society for Industrial and Applied Mathematics, Philadelphia. Griewank, Andreas; David Juedes; and Jean Utke (1996), Algorithm 755: ADOL-C, a package for the automatic differentiation of algorithms written in C/C++, ACM Transactions on Mathematical Software 22, 131–167. Griffiths, P., and I. D. Hill (Editors) (1985), Applied Statistics Algorithms, Ellis Horwood Limited, Chichester, United Kingdom. Gutjahr, W. J., and G. Ch. Pflug (1996), Simulated annealing for noisy cost functions, Journal of Global Optimization 8, 1–13. H¨ ardle, W.; S. Klinke; and B. A. Turlach (1995), XploRe: An Interactive Statistical Computing Environment, Springer-Verlag, New York. Hartley, H. O. (1961), The modified Gauss-Newton method for fitting of nonlinear regression functions by least squares, Technometrics 3, 269–280. Hawkins, Douglas M. (1993a), A feasible solution algorithm for minimum volume ellipsoid estimator in multivariate data, Computational Statistics 8, 95–107. Hawkins, Douglas M. (1993b), The feasible set algorithm for least median of squares regression, Computational Statistics & Data Analysis 16, 81–101. Hawkins, Douglas M., and David Olive (1999), Applications and algorithms for least trimmed sum of absolute deviations regression, Computational Statistics & Data Analysis 32, 119– 134. Haykin, Simon (1994), Neural Networks. A Comprehensive Foundation, Macmillan Publishing Company, Englewood Cliffs, New Jersey. Heiberger, R. M., and R. A. Becker (1992), Design of an S function for robust regression using iteratively reweighted least squares, Journal of Computational and Graphical Statistics 1, 181–196. Hill, Tim; Marcus O’Connor; and William Remus (1996), Neural network models for time series forecasts, Management Science 42, 954–973. Hock, Willi, and Klaus Schittkowski (Editors) (1985), Test Examples for Nonlinear Programming Codes, Springer-Verlag, Berlin. Hoffman, A.; M. Mannos; D. Sokolowsky; and N. Wiegmann (1953), Computational experience in solving linear programs, Journal of the Society for Industrial and Applied Mathematics 1, 17–33. Holland, John H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor. (Reprinted with additional material by The MIT Press, Cambridge, Massachusetts, 1992). Huddleston, H. F.; P. L. Claypool; and R. R. Hocking (1970), Optimum sample allocation to strata using convex programming. Applied Statistics 19, 273–278. Huet, Sylvie; Annie Bouvier; Marie-Anne Gruet; Emmanuel Jolivet (1996), Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS Examples, Springer-Verlag, New York. Hull, T. E., and R. Mathon (1996), The mathematical basis and a prototype implementation of a new polynomial rootfinder with quadratic convergence, ACM Transactions on Mathematical Software 22, 261–280. Jamshidian, Mortaza, and Robert I. Jennrich (1997), Acceleration of the EM algorithm by using quasi-Newton methods, Journal of the Royal Statistical Society, Series B 59, 569– 587.
BIBLIOGRAPHY
231
Jamshidian, Mortaza (2004), On algorithms for restricted maximum likelihood estimation, Computational Statistics & Data Analysis 45, 137–157. Jenkins, M. A. (1975), Algorithm 493: Zeros of a real polynomial, ACM Transactions on Mathematical Software 1, 178–189. Jenkins, M. A., and J. F. Traub (1970a), A three-stage algorithm for real polynomials using quadratic iteration, SIAM Journal of Numerical Analysis 7, 545–566. Jenkins, M. A., and J. F. Traub (1970b), A three-stage variable-shift iteration for polynomial zeros and its relation to generalized Rayleigh iteration, Numerische Mathematik 14, 252– 263. Jenkins, M. A., and J. F. Traub (1972), Zeros of a complex polynomial, Communications of the ACM 15, 97–99. Jennison, Christopher, and Nuala Sheehan (1995), Theoretical and empirical properties of the genetic algorithm as a numerical optimizer, Journal of Computational and Graphical Statistics 4, 296–318. Jobson, J. D., and Bob Korkie (1981), Putting Markowitz theory to work, Journal of Portfolio Management 7(4), 70–74. Jung, Joo Sung, and Bong Jin Yum (1996), Construction of exact D-optimal designs by tabu search, Computational Statistics & Data Analysis 21, 181–191. Karmarkar, N. (1984), A new polynomial-time algorithm for linear programming, Combinatorica 4, 373–395. Kavvadias, Dimitris J., and Michael N. Vrahatis (1996), Locating and computing all the simple roots and extrema of a function, SIAM Journal on Scientific Computing 17, 1232– 1248. Kendall, Maurice G., and Alan Stuart (1968), The Advanced Theory of Statistics, Volume 3, Design and Analysis and Time Series, second edition, Hafner Publishing Company, New York. Kendall, Maurice G., and Alan Stuart (1969), The Advanced Theory of Statistics, Volume 1, Distribution Theory, third edition, Hafner Publishing Company, New York. Kendall, Maurice G., and Alan Stuart (1973), The Advanced Theory of Statistics, Volume 2, Inference and Relationship, third edition, Hafner Publishing Company, New York. Kennedy, William J., and James E. Gentle (1980), Statistical Computing, Marcel Dekker, Inc., New York. Khalfan, H. Fayez; R. H. Byrd; and R. B. Schnabel (1993), A theoretical and experimental study of the symmetric rank-one update, SIAM Journal on Optimization 3 1–24. Khuri, Andr´ e I. (1993), Advanced Calculus with Applications in Statistics, John Wiley & Sons, New York. Kim, Dong K., and Jeremy M. G. Taylor (1995), The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters, Journal of the American Statistical Association 90, 708–716. Kirkpatrick, S.; C. D. Gelatt; and M. P. Vecchi (1983), Optimization by simulated annealing, Science 220, 671–679. Koenker, Roger, and Stephen Portnoy (1997), The Gaussian hare and the Laplacian tortise: Computability of squared-error versus absolute-error estimators (with discussion), Statistical Science 12, 279–299. Korhonen, P., and J. Wallenius (1986), Some theory and an approach to solving sequential multiple-criteria decision problems, Journal of the Operational Research Society 37, 501– 508. Koza, John R. (1992), Genetic Programming: On the Programming of Computers by Means of Natural Selection, The MIT Press, Cambridge, Massachusetts. Koza, John R. (1994a), Genetic programming as a means for programming computers by natural selection, Statistics and Computing 4, 87–112. Koza, John R. (1994b), Genetic Programming II: Automatic Discovery of Reusable Programs, The MIT Press, Cambridge, Massachusetts. Koza, John R.; Forrest H. Bennett; and David Andre (1999), Genetic Programming III: Darwinian Invention and Problem Solving, Academic Press, New York. Kushner, Harold J., and G. George Yin (1997), Stochastic Approximation Algorithms and Applications, Springer-Verlag, New York.
232
BIBLIOGRAPHY
Lange, Kenneth (1995), A gradient algorithm locally equivalent to the EM algorithm, Journal of the Royal Statistical Society, Series B 57, 425–437. Lasdon, L. S.; A. D. Waren; A. Jain; and M. Ratner (1978), Design and testing of a GRG code for nonlinear optimization. ACM Transactions on Mathematical Software 4, 34–50. Ma, Jun, and H. Malcolm Hudson (1998), An augmented data scoring algorithm for maximum likelihood, Communications in Statistics — Theory and Methods 27, 2761–2776. Marazzi, A. (1993), Algorithms, Routines and S Functions for Robust Statistics Wadsworth & Brooks/Cole, Pacific Grove, California. Maren, Alianna; Craig Harston; and Robert Pap (1990), Handbook of Neural Computing Applications, Academic Press, San Diego. Markowitz, Harry M. (1952), Portfolio selection, Journal of Finance 7(1), 77–91. Masri, S. F., and G. A. Bekey (1980), A global optimization algorithm using adaptive random search, Applied Mathematics and Computation 7, 353–375. McLachlan, Geoffrey J., and Thriyambakam Krishnan (1997), The EM Algorithm and Extensions, John Wiley & Sons, New York. Meintanis, S. G., and G. S. Donatos (1997), A comparative study of some robust methods for coefficient-estimation in linear regression, Computational Statistics & Data Analysis 23, 525–540. Meng, Xiao-Li, and Donald B. Rubin (1991), Using EM to obtain asymptotic variancecovariance matrices: the SEM algorithm, Journal of the American Statistical Association 86, 899–909. Meng, X.-L., and D. B. Rubin (1993), Maximum likelihood estimation via the ECM algorithm: a general framework, Biometrika 80, 267–278. Meng, Xiao-Li, and David van Dyk (1997), The EM algorithm – an old folk-song sung to a fast new tune, (with discussion), Journal of the Royal Statistical Society, Series B 59, 511–567. Mergerson, James W. (1988), ALLOC.P: A multivariate allocation program, The American Statistician 42, 85. Mergerson, James W. (1989), A generalized univariate optimal allocation program, The American Statistician 43, 128. Metropolis, N.; A. W. Rosenbluth; M. N. Rosenbluth; A. H. Teller; E. Teller (1953), Equations of state calculation by fast computing machines, Journal of Chemical Physics 21, 1087– 1092. Michalewicz, Zbigniew (1996), Genetic Algorithms + Data Structures = Evolution Programs, second edition, Springer-Verlag, New York. Michie, Donald; David J. Spiegelhalter; and Charles C. Taylor (1994), Machine Learning, Neural, and Statistical Classification, Ellis Horwood, New York. Miller, Alan J., and Nam-Ky Nguyen (1994), A Fedorov exchange algorithm for D-optimal design, Applied Statistics 43, 669–678. Mor´ e, Jorge J., and David J. Thuente (1994), Line search algorithms with guaranteed sufficient decrease, ACM Transactions on Mathematical Software 21, 286–307. Mor´ e, Jorge J., and Stephen J. Wright (1993), Optimization Software Guide, Society for Industrial and Applied Mathematics, Philadelphia. Morgenthaler, Stephan (1992), Least-absolute-deviations fits for generalized linear models, Biometrika 79, 747–754. Morris, R. J. T., and W. S. Wong (1992), Systematic initialization of local search procedures and application to the synthesis of neural networks, Computer Science and Statistics: Proceedings of the Twenty-second Symposium on the Interface (edited by Connie Page and Raoul LePage), Springer-Verlag, New York, 209–214. M¨ uhlenbein, Heinz (1992), Parallel genetic algorithms in combinatorial optimization, Computer Science and Operations Research: New Developments in Their Interfaces (edited by Osman Balci, Ramesh Sharda, and Stavros A. Zenios), Pergamon Press, New York, 441–456. M¨ uhlenbein, Heinz (1997), Genetic algorithms, Local Search in Combinatorial Optimization (edited by Emile Aarts and Jan Karel Lenstra), John Wiley & Sons, New York, 137–171. Murtagh, B. A. (1981), Advanced Linear Programming: Computation and Practice, McGrawHill, New York.
BIBLIOGRAPHY
233
Narula, Subhash C.; Vince A. Sposito; and John F. Wellington (1993), Intervals which leave the minimum sum of absolute errors regression unchanged, Applied Statistics 42, 369– 378. Narula, S. C., and J. F. Wellington (1979), Linear regression using multiple-criteria, Multiple Criteria Decision Making: Theory and Application (edited by G. Fandel and T. Gal), Springer-Verlag, New York, 266–277. Narula, Subhash C., and John F. Wellington (1985), Interior analysis for the minimum sum of absolute errors regression, Technometrics 27, 181–188. Nash, Stephen G. (1998), SUMT (Revisited), Operations Research 46, 763–775. Nash, Stephen G., and Ariela Sofer (1996), Linear and Nonlinear Programming, McGraw-Hill, New York. Nelder, J. A., and R. Mead (1965), A simplex method for function minimization, Computer Journal 7, 308–313. Nguyen, Nam-Ky, and Alan J. Miller (1992), A review of some exchange algorithms for constructing D-optimal designs, Computational Statistics & Data Analysis 14, 489–498. Nocedal, Jorge (1992), Theory of algorithms for unconstrained optimization, Acta Numerica 1992, Cambridge University Press, Cambridge, United Kingdom, 199–242. Nocedal, Jorge, and Stephen J. Wright (1999), Numerical Optimization, Springer-Verlag, New York. Overton, Michael L. (2001), Numerical Computing with IEEE Floating Point Arithmetic, Society for Industrial and Applied Mathematics, Philadelphia. Panier, E. R., and A. L. Tits (1993), On combining feasibility, descent and superlinear convergence in inequality constrained optimization, Mathematical Programming 59, 261–276. Parkinson, J. M., and D. Hutchinson (1972), An investigation into the efficiency of variants of the simplex method, Numerical Methods for Non-linear Optimization (edited by F. A. Lootsma), Academic Press, London, 115–135. Poli, I., and R. D. Jones (1994), A neural net model for prediction. Journal of the American Statistical Association 89, 117–121. Powell, M. J. D. (1965), A method for minimizing a sum of squares of nonlinear functions without calculating derivatives, Computer Journal 8, 303–307. Price, W. L. (1977), A controlled random search procedure for global optimization, Computer Journal 20, 367–370. Pukelsheim, Friedrich (1993), Optimal Design of Experiments, John Wiley & Sons, New York. Rabinowitz, F. Michael (1995), A stochastic algorithm for global optimization with constraints, ACM Transaction on Mathematical Software 21, 194–213. Rai, S. N., and D. E. Matthews (1993), Improving the EM algorithm, Biometrics 49, 587–591. Ralston, Mary L., and Robert I. Jennrich (1978a), Dud, a derivative-free algorithm for nonlinear least squares, Technometrics 20, 7–14. Ralston, Mary L., and Robert I. Jennrich (1978b), Derivative-free nonlinear regression, Computer Science and Statistics: Proceedings of the Tenth Symposium on the Interface (edited by D. Hogben), U. S. Government Printing Office, Washington, 312–322. Rechenberg, I. (1973), Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution, Frommann-Holzboog, Stuttgart. Ripley, Brian D. (1993), Statistical aspects of neural networks, Networks and Chaos – Statistical and Probabilistic Aspects (edited by O. E. Barndorff-Nielsen, J. L. Jensen, and W. S. Kendall), Chapman & Hall, London, 40–123. Ripley, Brian D. (1994), Neural networks and related methods for classification (with discussion), Journal of the Royal Statistical Society, Series B 56, 409–456. Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, United Kingdom. Robbins, Herbert, and Sutton Monro (1951), A stochastic approximation method, Annals of Mathematical Statistics 22, 400–407. Rosenbrock, H. H. (1960), An automatic method for finding the greatest or the least values of a function, Computer Journal 3, 175–184. Ross, G. J. S. (1990), Nonlinear Estimation, Springer-Verlag, New York. Rousseeuw, P. J. (1984), Least median of squares regression, Journal of the American Statistical Association 79, 871–880.
234
BIBLIOGRAPHY
Rousseeuw, Peter J., and Annick M. Leroy (1987), Robust Regression and Outlier Detection, John Wiley & Sons, New York. Ruppert, David, and Raymond J. Carroll (1980), Trimmed least squares estimation in the linear model, Journal of the American Statistical Association 75, 828–838 (Corrections, 1982, ibid. 77, 954). Schittkowski, K. (1985), NLPQL: a Fortran subroutine for solving constrained nonlinear programming problems, Annals of Operations Research 5, 485–500. Schittkowski, Klaus (Editor) (1987), More Test Examples for Nonlinear Programming Codes, Springer-Verlag, Berlin. Schlossmacher, E. J. (1973), An iterative technique for absolute deviations curve fitting, Journal of the American Statistical Association 68, 857–859. Schrage, Linus E. (1997), Optimization Modeling with Lindo, fifth edition, Brooks/Cole, Pacific Grove, California. Siarry, Patrick; G´ erard Berthiau; Franncois Durdin; and Jacques Haussy (1997), Enhanced simulated annealing for globally minimizing functions of many-continuous variables, ACM Transaction on Mathematical Software 23, 209–228. Singleton, R. R. (1940), A method of minimizing the sum of absolute values of deviations, Annals of Mathematical Statistics 11, 301–310. Smith, S., and L. Lasdon (1992), Solving large sparse nonlinear programs using GRG, ORSA Journal on Computing 4, 1–15. Souvaine, D. L., and J. M. Steele (1987), Time- and space-efficient algorithms for least median of squares regression, Journal of the American Statistical Association 82, 794–801. Spall, James C. (1992), Multivariate stochastic approximation using a simulataneous perturbation gradient approximation, IEEE Transactions on Automatic Control 37, 332–341. Spall, James C., and John A. Cristion (1994), Nonlinear adaptive control using neural networks: estimation with a smoothed form of simultaneous perturbation gradient approximation, Statistica Sinica 4, 1–27. Spanier, Jerome, and Keith B. Oldham (1987), An Atlas of Functions, Hemisphere Publishing Corporation, Washington. (Also Springer-Verlag, Berlin.) Stark, P. B., and R. L. Parker (1995), Bounded-variable least-squares: an algorithm and applications, Computational Statistics 10, 129–141. Steuer, Ralph E. (1986), Multiple Criteria Optimization: Theory, Computation, and Application, John Wiley & Sons, New York. Street, James O.; Raymond J. Carroll; and David Ruppert (1988), A note on computing robust regression estimates via iteratively reweighted least squares, The American Statistician 42, 152–154. Sutton, Clifton D. (1991), Improving classification trees with simulated annealing, Computer Science and Statistics: Proceedings of the Twenty-third Symposium on the Interface (edited by Elaine M. Keramidas), Interface Foundation of North America, 396–402. The Scientific Press (1988), GAMS: A User’s Guide, The Scientific Press, San Francisco. Thompson, Joe F.; Bharat Soni; and Nigel P. Weatherrill (1998), Handbook of Grid Generation, CRC Press, Boca Raton. Thompson, William J. (1997), Atlas for Computing Mathematical Functions: An Illustrated Guide for Practitioners with Programs in C and Mathematica, John Wiley & Sons, New York. Tierney, Luke (1990), Lisp-Stat: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics, John Wiley & Sons, New York. Titterington, D. M. (1975), Optimal design: Some geometrical aspects of D-optimality, Biometrika 62, 313–320. Valliant, Richard, and James E. Gentle (1997), An application of mathematical programming to sample allocation, Computational Statistics & Data Analysis 25, 337–360. Van Laarhoven, P. J. M., and E. H. L. Aarts (1987), Simulated Annealing: Theory and Applications, Reidel Publishing, Dordrecht. Wei, Greg C. C., and Martin A. Tanner (1990), A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms, Journal of the American Statistical Association 85, 699–704. Welsh, A. H. (1985), An angular approach for linear data, Biometrika 72, 441–450.
BIBLIOGRAPHY
235
White, Halbert (1992), Nonparametric estimation of conditional quantiles using neural networks, Computer Science and Statistics: Proceedings of the Twenty-second Symposium on the Interface (edited by Connie Page and Raoul LePage), Springer-Verlag, New York, 190–199. Whitley, Darrell (1994), A genetic algorithm tutorial, Statistics and Computing 4, 65–85. Windward Technologies Inc. (1995), User’s Guide for GRG2 Optimization Library, Windward Technologies Inc., Meadows, Texas. Woodruff, David L., and David M. Rocke (1993), Heuristic search algorithms for the minimum volume ellipsoid, Journal of Computational and Graphical Statistics 2, 69–95. Xu, Chong-wei, and Wei-Kei Shiue (1993), Parallel algorithms for least median of squares regression, Computational Statistics & Data Analysis 16, 349–362. Zeger, Kenneth; Jacques Vaisey; and Allen Gersho (1992), Globally optimal vector quantizer design by stochastic relaxation IEEE Transactions on Signal Processing 40, 310–322.
236
BIBLIOGRAPHY
Author Index Aarts, Emile H. L., 104, 107, 108 Aarts, Emile, 4 Abramowitz, Milton, 196 Adjiman, Claire, 179 Alrefaei, Mahmoud H., 107, 108 Aluffi-Pentini, Filippo, 107 Amdahl, G. M., 33 Ammann, Larry, 150 Andrad´ ottir, Sigr´ un, 107, 108 Andre, David, 112 Andrews, Angus P., 28 Arslan, Olcay, 91 Arthanari, T. S., 5, 127, 161 Atkinson, A. C., 160, 161 Back, Thomas, 109 Bareiss, E. H., 23 Barlow, J. L., 23 Barrodale, I., 127, 154 Barton, Russel R., 96 Bassett, Gilbert W., 155 Bates, Douglas M., 5, 196 Becker, Richard A., 87 Bekey, G. A., 132 Bennett, Forrest H., 112 Berthiau, G´ erard, 107 Bethel, James, 161 Bickel, Peter J., 35 Birge, John R., 5 Bischof, Christian H., 177 Bj¨ orck, nAke, 5 Blackford, L. S., 21 Boggs, Paul T., 149 Bohachevsky, Ihor O., 107 Bongartz, I., 179 Bouvier, Annie, 177 Box, M. J., 120 Brooks, Stephen P., 108 Bunch, David S., 170, 177 Byrd, Richard H., 81, 149 Calvetti, Daniela, 23 Carle, A., 177 Carroll, Raymond J., 87, 155 Celeux, G., 91 Chaitin-Chatelin, Franncoise, 23 Chan, K. S., 91
Chan, Tony F., 28, 29 Charnes, A., 137, 146, 153 Chatterjee, Samprit, 157 Chen, D. S., 115 Chin, Daniel C., 92, 132 Chopra, Vijay K., 129 Chromy, J. R., 161 Chv´ atal, Vaˇsek, 126 Claypool, P. L., 161 Cleary, A., 21 Cody, W. J., 15, 16 Collins, N. E., 105, 107 Conn, Andrew R., 170, 179 Constable, Patrick D. L., 91 Cook, William J., 4, 116 Coonen, Jerome T., 16 Cooper, W. W., 137, 153 Corana, A., 108 Cristion, John A., 93 Cunningham, William H., 4, 116 Dantzig, George B., 126 De Jong, Kenneth A., 179 Dembo, R. S., 180 Demmel, James W., 21 Dempster, Arthur P., 23, 89 Dennis, John E., Jr., 95, 170 Deutsch, Clayton V., 108 Dharmadhikari, S., 46 Dhillon, I., 21 Diebolt, Jean, 91 Dobmann, M., 177 Dodge, Yadolah, 5, 127, 161 Donaldson, Janet R., 149 Donev, A. N., 160 Dongarra, Jack J., 21, 199 Drezner, Zvi, 113, 157 Duarte, Antonio Marcos, 155 Durdin, Franncois, 107 Eglese, R. W., 105, 107 Esposito, William R., 179 Facchinei, Francisco, 180 Fan, Y., 124 Ferguson, R. O., 137, 153 Fiacco, Anthony V., 124 Floudas, Christodoulos A., 179
237
238 Fourer, Robert, 178 Frank, Ildiko E., 158 Frayss´ e, Val´ erie, 23 Friedman, Jerome H., 158 Frome, E. L., 146 Fuller, Wayne A., 149 Furnival, George M., 157 Gan, Li, 132, 140 Garey, M. R., 30 Gay, David M., 170, 177, 178 Gelatt, C. D., 105 Gelfand, S. B., 108 Gentle, James E., 162, 164, 180 Gentleman, W. M., 161 Gersho, Allen, 117 Giles, C. Lee, 200 Gill, P. E., 4 Glover, Fred, 113 Goldberg, David, 15 Golden, B. L., 105, 107 Goldfarb, D., 124, 174 Golub, Gene H., 29, 149 Gonzaga, C. C., 126 Gould, Nick I. M., 170, 179 Green, Peter J., 87 Grewal, Mohinder S., 28 Griewank, Andreas, 177 Griffiths, P., 197 Gruet, Marie-Anne, 177 Gumus, Zeynep H., 179 Gutjahr, W. J., 108 Hammarling, S., 21 Hanson, Richard J., 5 Harding, Stephen T., 179 Harston, Craig, 115 Hartley, H. O., 84 Haussy, Jacques, 107 Hawkins, Douglas M., 155, 156 Haykin, Simon, 115 Heiberger, Richard M., 87 Higham, Nicholas J., 27 Hill, I. D., 197 Hock, Willi, 180 Hocking, R. R., 161 Hoffman, A., 179 Holland, John H., 109 Huddleston, H. F., 161 Hudson, H. Malcolm, 91 Huet, Sylvie, 177 Hull, T. E., 54 Hutchinson, D., 95 Idnani, A., 124, 174 Ip, Eddie H. S., 91 Ivey, John S., Jr., 96 Jain, A., 170 Jain, R. C., 115 Jamshidian, Mortaza, 92
AUTHOR INDEX Jenkins, M. A., 54 Jennison, Christopher, 112 Jennrich, Robert I., 92, 98, 99 Jiang, Jiming, 132, 140 Joag-Dev, K., 46 Jobson, J. D., 129 Johnson, D. S., 30 Johnson, Mark E., 107 Jolivet, Emmanuel, 177 Jones, R. D., 115 J´ udice, Joaquim, 180 Juedes, David, 177 Jung, Joo Sung, 161 Karmarkar, N., 126, 155 Kavvadias, Dimitris J., 64, 132 Kennedy, William J., 164, 180 Kent, John T., 91 Kernighan, Brian W., 178 Khademi, P., 177 Khalfan, H. Fayez, 81 Khuri, Andr´ e I., 97 Kiefer, J., 92 Kim, Dong K., 157 Kirkpatrick, Scott, 105 Klepeis, John L., 179 Koenker, Roger, 155 Korhonen, P., 134 Korkie, Bob, 129 Korst, Jan, 108 Koza, John R., 112 Krishnan, Thriyambakam, 91 Kˇrv´ y, Ivan, 97 Kushner, Harold J., 64, 69, 116 Laguna, Manuel, 113 Laird, N. M., 89 Lange, Kenneth, 91 Lasdon, L. S., 124, 170 Lawrence, Steve, 200 Lawson, Charles L., 5 Ledolter, Johannes, 91 Lenstra, Jan Karel, 4, 104 Leroy, Annick M., 157 Lesk, Michael, 201 LeVeque, Randall J., 29 Lewis, John Gregg, 28 Liem, C. B., 35 Liepelt, M., 177 Linnainmaa, Seppo, 23 Louveaux, Francois, 5 L¨ u, T., 35 Ma, Jun, 91 M¨ achler, Martin, 157 Mannos, M., 179 Marchesi, M., 108 Marcoulides, George A., 113, 157 Maren, Alianna, 115 Markowitz, Harry M., 124
AUTHOR INDEX Martin, C., 108 Masri, S. F., 132 Mathon, R., 54 Matthews, D. E., 91 Mauer, A., 177 McCormick, Garth P., 124 McLachlan, Geoffrey J., 91 Mead, R., 94 Meng, Xiao-Li, 91, 92 Mergerson, James W., 161 Metropolis, N., 105 Meyer, Clifford A., 179 Michalewicz, Zbigniew, 112 Michie, Donald, 165 Miller, Alan J., 160, 161 Mitter, S. K., 108 Money, Arthur H., 155 Monro, Sutton, 64 Mor´ e, Jorge J., 5, 74, 169, 175 Morgan, B. J. T., 108 Morris, R. J. T., 115 M¨ uhlenbein, Heinz, 109, 112 Murray, W., 4 Murtagh, B. A., 126 Narula, Subhash C., 134, 153, 162, 164 Nash, Stephen G., 4, 123, 124, 126, 155 Nelder, J. A., 94 Nemhauser, George L., 4 Nguyen, Nam-Ky, 160, 161 Nocedal, Jorge, 4, 81, 123, 124 O’Connor, Marcus, 165 Oldham, Keith B., 196 Olive, David, 155 Osborne, M. R., 5 Panier, E. R., 124, 170 Pap, Robert, 115 Pardalos, Panos M., 179 Parisi, Valerio, 107 Parkinson, J. M., 95 Petitet, A, 21 Pflug, G. Ch., 108 Poli, I., 115 Portnoy, Stephen, 155 Powell, M. J. D., 100 Price, W. L., 96 Pukelsheim, Friedrich, 160 Pulleyblank, William R., 4, 116 Rabinowitz, F. Michael, 132 Rai, S. N., 91 Ralston, Mary L., 98, 99 Ratner, M., 170 Rechenberg, 109 Remus, William, 165 Ren, H., 21 Rice, John R., 20 Ridella, S., 108 Ripley, Brian D., 115
239 Robbins, Herbert, 64 Roberts, F. D. K., 127, 154 Rocke, David M., 108, 157, 160 Roh, L., 177 Rosenbluth, A. W., 105 Rosenbluth, M. N., 105 Rosenbrock, H. H., 179 Rousseeuw, Peter J., 156, 157 Rowan, Tom, 199 Rubin, Donald B., 23, 89, 91 Ruppert, David, 87, 155, 156 Rustagi, Jagdish S., 5 Salhi, Said, 113, 157 Sarkar, S., 124 Schittkowski, Klaus, 124, 170, 177, 180 Schlossmacher, E. J., 155 Schnabel, Robert B., 81, 149 Schrage, Linus E., 124 Schrijver, Alexander, 4, 116 Schweiger, Carl A., 179 Seber, G. A. F., 5 Sheehan, Nuala, 112 Shih, T. M., 35 Shiue, Wei-Kei, 156 Siarry, Patrick, 107 Singleton, R. R., 153 Smith, S., 170 Soares, Jo˜ ao, 180 Sofer, Ariela, 4, 123, 126, 155 Sokolowsky, D., 179 Souvaine, D. L., 156 Spall, James C., 92, 93 Spanier, Jerome, 196 Spiegelhalter, David J., 165 Sposito, Vince A., 153, 164 Stanley, K., 21 Steele, J. M., 156 Stegun, Irene A., 196 Steihaug, T., 180 Stein, Myron L., 107 Steuer, Ralph E., 133, 134 Street, James O., 87 Sutton, Clifton D., 108 Tanner, Martin A., 91 Taylor, Charles C., 165 Taylor, Jeremy M. G., 157 Teller, A. H., 105 Teller, E., 105 Thisted, Ronald A., 196 Thompson, William J., 196 Thuente, David J., 74 Tits, A. L., 124, 170 Titterington, D. M., 160 Toint, Ph. L., 170, 179 Traub, J. F., 54 Tvrd´ık, Josef, 97 Utke, Jean, 177
240 Vaisey, Jacques, 117 Valliant, Richard, 162 Van Dyk, David, 92 Van Huffel, S., 149 Van Laarhoven, P. J. M., 107 Van Loan, Charles F., 149 Van Ness, John, 150 Vandewalle, J., 149 Vaz de Melo Mendes, Beatriz, 155 Vecchi, M. P., 105 Vrahatis, Michael N., 64, 132 Wade, Reed, 199 Wallenius, J., 134 Waren, A. D., 170 Watts, Donald G., 5 Wei, Greg C. C., 91 Weistroffer, H. R., 134 Wellington, John F., 153, 164 Welsch, Roy E., 170, 177 Welsh, A. H., 156 Whaley, R. C., 21 White, Halbert, 115 Whitley, Darrell, 112 Wiegmann, N., 179 Wild, C. J., 5 Wilkinson, J. H., 22, 23, 26, 65 Wilson, Robert W., 157 Wolfowitz, J., 92 Wolsey, Laurence A., 4 Wong, W. S., 115 Woodruff, David L., 108, 157, 160 Woods, D. J., 95 Wright, M. H., 4 Wright, Stephen J., 4, 5, 123, 124, 169, 175 Xu, Chong-Wei, 156 Yahav, Joseph A., 35 Yin, G. George, 64, 69, 116 Yu, P. L., 146 Yum, Bong Jin, 161 Zeger, Kenneth, 117 Ziemba, William T., 129
AUTHOR INDEX
Subject Index A A-optimality 159 absolute error 18, 22 ACM Transactions on Mathematical Software 196, 199 ACM Transactions on Modeling and Computer Simulation 196 algorithm, definition 34, 54 Alta Vista (Web search engine) 200 Amdahl’s law 33 AMPL (modeling system) 176, 178 AMS MR classification system 196 Applied Statistics 196, 199 ArcView (software) 201 ASCII code 7
Communications in Statistics — Simulation and Computation 197 complete space 49 COMPSTAT 195, 197 Computational Statistics & Data Analysis 197 Computational Statistics 197 Computing Science and Statistics 197 concave function 40 condition (problem or data) 26 condition number of a function with respect to finding a root 65 condition number with respect to computing a sample standard deviation 28 condition number 26, 28, 65 conjugate gradient method 82, 87 consistent system of equations 66 constrained least squares, equality constraints 156, 157 convergence criterion 33, 54 convergence ratio 34 convex function 40 Cplex (software) 176 cross product 189 cummulative distribution function 187 Current Index to Statistics 196 curse of dimensionality 35
B backward error analysis 22, 27 base point 9 base 10 beta function 192 BFGS (method) 80 bias, in exponent of floating-point number 11 big O (order) 24, 29 big omega (order) 25 bisection method 54 Boltzmann distribution 104 branch and bound 104, 157 Broyden update 80 C (programming language) 170 C CALGO (Collected Algorithms of the ACM) 196, 199 cancellation error 21, 27 Cartesian product 189 catastrophic cancellation 20 Cauchy-Schwarz inequality 48 CDF (cummulative distribution function) 187 chaining of operations 19 classification 165 Collected Algorithms of the ACM (CALGO) 196, 199 combinatorial optimization 104
241
D D-optimality 159, 160, 161 determinant of a matrix 160 DFP (method) 80 differentiation, symbolic 177 Dirac delta function 189 direct product 189 discretization error 25, 34 divide and conquer 32 dot product 48 double precision 15 dud method for least squares 98 dud secant for least squares 98 E E-optimality 159
242 ECDF (empirical cumulative distribution function) 187 ECM algorithm 91 EM method 88 empirical cumulative distribution function 187 error, absolute 18, 22 error, cancellation 21, 27 error, discretization 25 error, measures of 23, 24 error, relative 18, 22 error, rounding 21, 22 error, rounding, models of 23, Exercise 2.6: 37 error, truncation 25 error bound 23 error of approximation 25 errors-in-variables 149 exception, in computer operations 17, 20, 21 Exponent Graphics (software) 201 exponent 10 exponential order 29 extended precision 15 extrapolation 34 F fan-in algorithm 32 fathoming 104, 157 feasible point 46 feasible region 47 feed-forward network 114 Fisher scoring 92 fixed-point method 54 fixed-point representation 9 floating-point representation 9 FLOP, or flop 31 FLOPS, or flops 31 Fortran 170, 90 FSQP (software) 125, 170 function space 47 functional least squares regression 156 G gamma function 192 GAMS (modeling optimization system) 176, 178 GAMS (Guide to Available Mathematical Software) 199 GAMS, electronic access 199 Gauss-Newton method 84, 98 generalized least squares with equality constraints 156 genetic algorithm 108, 157 global optimization 131 golden section search 73 Goldstein-Armijo method 74 graceful underflow 13
SUBJECT INDEX gradient of a function 44, 66 gradual underflow 13, 21 greedy algorithm 32, 75 GRG2 (software) 170 guard digit 19 H Heaviside function 189 Hessian 45 hidden bit 11 Hilbert space 49 HotBot (Web search engine) 200 html 198 I IEEE standards 8, 15, 20 ill-conditioned (problem or data) 26 ill-conditioned data 26 IMSL Libraries 170, 201 incomplete gamma function 192 indicator function 189 infinity, floating-point representation 16, 20, 21 inner product 48 integer programming 127 integer representation 9 Interface Symposium 195, 197 interior-point method 126, 155 International Association of Statistical Computing (IASC) 195, 197 Internet 198 interpolation 142 inverse problem 141 IRLS (iteratively reweighted least squares) 86, 87, 154, 155 isnan 16 iterative method 33 iteratively reweighted least squares 86, 87, 154, 155 J Jacobian 66 Java (programming language) 199 Jensen’s inequality 41, 52 Journal of Computational and Graphical Statistics 197 Journal of Statistical Computation and Simulation 197 K Kalman filter 28 Karmarkar algorithm 126, 155 Karush-Kuhn-Tucker conditions 123 Kiefer-Wolfowitz procedure 92 L L1 regression 152
SUBJECT INDEX
243
Lp regression 155 Lagrange multiplier 122 Lagrangian function 122 Laplacian operator 190 LAV regression 152 least absolute values 152 least median of squares regression 156 least squares regression 83, 98 least trimmed absolute values regression 155 least trimmed squares regression 155, 156 Levenberg-Marquardt algorithm 85 limited-memory quasi-Newton method 81 Lindo (software) 176 line search 72 linear convergence 34 linear independence 48 linear programming software 176 linear programming 47, 125 Lisp-Stat (software) 202 little o (order) 24 little omega (order) 25 log concave function 41 log convexity 41 log order 29 LSGRG2 (software) 170
Newton-Raphson 75 NL2SOL (software) 170 NLPQL (software) 124, 170 noisy function optimization 96, 108, 115, 116 nonlinear regression 83, 86, 98, 150 nonparametric smoothing 145 norm, function 49 normal function 49 normalized floating-point numbers 11 Northern Light (Web search engine) 200 not-a-number (“NaN”) 16 NP-complete problem 30
M M estimator 151, 155 M regression 155 MACHAR 15, Exercise 2.1: 36 machine epsilon 13, 78 Maple (software) 202 Mathematica (software) 201 mathematical programming problem 47 Mathematical Reviews 196 mathml 199 Matlab (software) 201 maximum likelihood method 138, 165 minimum volume ellipsoid 157 MINOS (software) 176 missing data 89 missing value, representation of 16 mixed integer programming 127 modified Gauss-Newton algorithm 84 Mosaic (Web browser software) 198 MPS format 176, 178 MR classification system 196 multi-layer perceptron 114 multiple roots of a function 64
P polynomial order 29 portability 21 precision, double 15 precision, extended 15 precision, single 15 Price controlled random search (for optimization) 96, 119 probabilistic error bound 24 Proceedings of the Statistical Computing Section 197 PV-Wave (software) 202
N Nag Libraries 170 NaN (“not-a-number”) 16, 21 Nelder-Mead simplex method 94, 119 netlib 179, 196, 199 neural network 113 Newton’s method 56, 66, 75
O Occam’s razor 144 optimization of stochastic (noisy) functions 96, 108, 116 optimization 1 order of computations 29 order of convergence 24 order of error 24 orthogonal distance regression 149 OSL (software) 170 overflow, in computer operations 18, 20
Q Q-convergence 34 quadratic convergence 34 quadratic programming 124, 176 quadratic programming software 176 quasi-Newton method 79 R radix 10 rank-two update 80 rate constant 34 rate of convergence 34 real numbers 9 reduced gradient methods 170 register, in computer processor 19 regression 150 regression, nonlinear 83, 86, 98 regula falsi 61
244
SUBJECT INDEX
relative error 18, 22 relative spacing 13 Richardson extrapolation 35 ridge regression 84 Robbins-Munro procedure 64, 69, 116 robust estimation with equality constraints 156 robustness (algorithm or software) 26 root of a function 20, 54 Rosenbrock function 179, 180 rounding error 21, 22
T T convexity 41 tabu search 112, 157 total least squares 149 tree 157 trimmed least squares regression 155, 156 truncated Newton method 82 truncation error 25 trust region 77, 85 twos-complement representation 8, 18
S S, S-Plus (software) 202 sample variance, computing 27 SAS optimization software 176 scaling of an algorithm 29 scoring 92 secant method 59 SEM (stochastic EM) algorithm 91 SEM (supplemented EM) algorithm 91 sequential quadratic programming 124, 170 sequential unconstrained minimization techniques 124 Sherman-Morrison formula 81 SIAM Journal on Scientific Computing 197 sign bit 8 significand 10 simplex algorithm, linear programming 126 simplex method 94, 119 simplex 40 simulated annealing 104, 157 simultaneous perturbation stochastic approximation 92, 132 single precision 15 smoothing 145 splitting extrapolation 35 SQP (software) 124 stability 26, 65 standard deviation, computing 27 Statistical Computing Section of the American Statistical Association 195, 197 Statistical Computing & Graphics Newsletter 197 Statistics and Computing 197 statlib 197, 199 steepest descent 75 step length factor 73 stiff data 28 stochastic approximation 62, 68, 92, 116 stopping criterion 33, 54 storage unit 8, 10 successive quadratic programming 124 SUMT 124 superlinear convergence 34 symbolic differentiation 177
U ulp (“unit in the last place”) 14 underflow, in computer operations 13, 21 unit in the last place 14 unit roundoff 13 URL 199 V variable metric method 79 variance, computing 27 W W3C (World Wide Web Consortium) 198 Ware’s law 33 Web browser 198 weighted least squares with equality constraints 156 weighted least squares 86 Woodbury formula 81 word, computer 8, 10 World Wide Web (WWW) 198 World Wide Web Consortium 198 X xml 198 Xnetlib 199 XploRe (software) 202 Y Yahoo (Web search engine) 200 Z zero of a function 20, 54