SEQUENTIAL METHODS IN PATTERN RECOGNITION A N D MACHINE LEARNING
This is Volume 52 in MATHEMATICS IN SCIENCE AND ENGINEERING A series of monographs and textbooks Edited by RICHARD BELLMAN, University of Southern California A complete list of the books in this series appears at the end of this volume.
SEQUENTIAL METHODS IN PATTERN RECOGNITION AND MACHINE LEARNING K. S. FU School of Electrical Engineering Purdue University Lafayette, Indiana
ACADEMIC PRESS New York and London 1968
COPYRIGHT 0 1968 BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED. NO PART OF THIS BOOK MAY BE REPRODUCED IN ANY FORM, BY PHOTOSTAT, MICROFILM, OR ANY OTHER MEANS, WITHOUT WRITTEN PERMISSION FROM THE PUBLISHERS.
ACADEMIC PRESS, INC. 111 Fifth Avenue, New York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. Berkeley Square House, London W. 1
LIBRARY OF CONGRESS CATALOG CARDNUMBER: 68-8424
PRINTED IN THE UNITED STATES OF AMERICA
l
During the past decade there has been a considerable growth of interest in problems of pattern recognition and machine learning. This interest has created an increasing need for methods and techniques for the design of pattern recognition and learning systems. Many different approaches have been proposed. One of the most promising techniques for the solution of problems in pattern recognition and machine learning is the statistical theory of decision and estimation. This monograph treats the problems of pattern recognition and machine learning by use of sequential methods in statistical decision and estimation theory. The material presented in this volume is primarily based on the research carried out by the author and his co-workers, Dr. G. P. Cardillo, Dr. C. H. Chen, Dr. Y. T. Chien, and Dr. 2. J. Nikolic during the past several years. In presenting the material, emphasis is placed upon the development of basic theory and computation algorithms in systematic fashion. Although many different types of experiments have been performed to test the methods discussed, for illustrative purpose, only experiments in English-character recognition have been presented. The monograph is intended to be of use both as a reference for system engineers and computer scientists and as a supplementary textbook for courses in pattern recognition and adaptive and learning systems. The presentation is kept concise. As a background to this monograph, it is assumed that the reader has adequate preparation in college mathematics and an introductory course on probability theory and mathematical statistics. The subject matter may be divided into two majors parts: (1) pattern recognition and (2) machine learning. Roughly speaking, six approaches are presented, they are equally divided from Chapter 2 to Chapter 7. After a brief review of several important approaches in pattern recognition in Chapter 1, two methods for feature selection and ordering in terms of information theoretic approach and KarhunenL o h e expansion are presented in Chapter 2. In addition to the V
vi
PREFACE
application of Wald’s sequential probability ratio test and the generalized sequential probability ratio test to pattern classification problems, three techniques are discussed, namely, the modified sequential probability ratio test with time-varying stopping boundaries (Chapter 3), the backward procedure using dynamic programming (Chapter 4), and the nonparametric sequential ranking procedure (Chapter 5). The application of dynamic programming to both feature ordering and pattern classification is also included in Chapter 4. A brief introduction to sequential analysis is given in Appendix A. Bayesian estimation techniques (Chapter 6) and the stochastic approximation procedure (Chapter 7) are introduced as learning techniques in sequential recognition systems. Both supervised and nonsupervised learning schemes are discussed. Relationships between Bayesian estimation techniques and the generalized stochastic approximation procedure are demonstrated. Methods are also suggested for the learning of slowly time-varying parameters. The method of potential functions, because of its close relationship to the stochastic approximation procedure, is briefly presented in Appendix G. Some of the material in the monograph has been discussed in several short courses at Purdue University, Washington University, and UCLA. Most of the material has been taught in both regular and seminar courses at Purdue University and the University of California at Berkeley. For a regular course in pattern recognition and machine learning, many other approaches should also be discussed. Unfortunately, because of the limited scope of the monograph, those promising approaches cannot be covered in detail here. Instead, a very brief remark on other related approaches and interesting research problems is given in the last section of each chapter. I t is no doubt that there are still some works not mentioned even in these remarks due to the author’s oversight or ignorance. Lafayette, Indiana August, 1968
K. S . Fu
ACKNOWLEDGMENTS
It is the author’s pleasure to acknowledge the encouragement of Dr. M. E. VanValkenburg, Dr. L. A. Zadeh, Dr. T. F. Jones, Dr. W. H. Hayt, Jr., Dr. J. C. Hancock, and Dr. J. R. Lehmann. He owes a debt of gratitude to Dr. Richard Bellman, who read the manuscript and contributed many valuable suggestions. The author is also indebted to his colleagues and students at Purdue University and the University of California at Berkeley, who, through many helpful discussions during office and class hours, coffee breaks, and late evenings, assisted in the preparation of the manuscript. Particular suggestions and errata lists were provided by Dr. 2. J. Nikolic and Dr. Y. T. Chien. The author and his co-workers at Purdue have been very fortunate in having the consistent support from National Science Foundation for the research in pattern recognition and machine learning. The major part of the manuscript was completed during the author’s sabbatical year (1967) at the Department of Electrical Engineering and Computer Science, University of California, Berkeley. The environment and the atmosphere in Cory Hall and on Telegraph Avenue definitely stimulated the improvement and the early completion of the manuscript. In addition, the author wishes to thank Mrs. Patricia Gress for her efficient and careful typing of the manuscript.
vii
This page is intentionally left blank
Preface
V
1. Introduction 1.1 Pattern Recognition 1.2 Deterministic Classification Techniques 1.3 Training in Linear Classifiers 1.4 Statistical Classification Techniques 1.5 Sequential Decision Model for Pattern Classification 1.6 Learning in Sequential Pattern Recognition Systems 1.7 Summary and Further Remarks References
1 3 8 10
13 19 21 22
2. Feature Selection and Feature Ordering 2.1 2.2 2.3 2.4
Feature Selection and Ordering-Information Theoretic Approach Feature Selection and Ordering-Karhunen-Lotve Expansion Illustrative Examples Summary and Further Remarks References
24 29 35 43 44
3. Forward Procedure for Finite Sequential Classification Using Modified Sequential Probability Ratio Test 3.1 3.2 3.3 3.4 3.5 3.6
Introduction Modified Sequential Probability Ratio Test-Discrete Case Modified Sequential Probability Ratio Test-Continuous Case Procedure of Modified Generalized Sequential Probability Ratio Test Experiments in Pattern Classification Summary and Further Remarks References
46 47 52 54 56 62 63
4. Backward Procedure for Finite Sequential Recognition Using Dynamic Programming 4.1 Introduction 4.2 Mathematical Formulation and Basic Functional Equation ix
64 65
CONTENTS
X
4.3 Reduction of Dimensionality 4.4 Experiments in Pattern Classification 4.5 Backward Procedure for Both Feature Ordering - and Pattern Classification 4.6 Experiments in Feature Ordering and Pattern Classification 4.7 Use of Dynamic Programming for Feature-Subset Selection 4.8 Suboptimal Sequential Pattern Recognition 4.9 Summary and Further Remarks References
68 72 79 80 86 88 93 94
5. Nonparametric Procedure in Sequential Pattern Classification 5.1 Introduction 5.2 Sequential Ranks and Sequential Ranking Procedure 5.3 A Sequential Two-Sample Test Problem 5.4 Nonparametric Design of Sequential Pattern Classifiers 5.5 Analysis of Optimal Performance and a Multiclass Generalization 5.6 Experimental Results and Discussions 5.7 Summary and Further Remarks References
96 97 101 105 107 113 115 116
6. Bayesian Learning in Sequential Pattern Recognition Systems 6.1 Supervised Learning Using Bayesian Estimation Techniques 6.2 Nonsupervised Learning Using Bayesian Estimation Techniques 6.3 Bayesian Learning of Slowly Varying Patterns 6.4 Learning of Parameters Using an Empirical Bayes Approach 6.5 A General Model for Bayesian Learning Systems 6.6 Summary and Further Remarks References
117 123 127 130 134 139 140
7. Learning in Sequential Recognition Systems Using Stochastic Approximation 7.1 Supervised Learning Using Stochastic Approximation 7.2 Nonsupervised Learning Using Stochastic Approximation 7.3 A General Formulation of Nonsupervised Learning Systems Using Stochastic Approximation 7.4 Learning of Slowly Time-Varying Parameters Using Dynamic Stochastic Approximation 7.5 Summary and Further Remarks References
APPENDIX A. Introduction to Sequential Analysis 1. Sequential Probability Ratio Test 2. Bayes’ Sequential Decision Procedure References
141 148 155 158 168 169 171 171 176 179
CONTENTS
APPENDIXB. Optimal Properties of Generalized KarhunenL o h e Expansion 1. Derivation of Property (i) 2. Derivation of Property (ii)
xi 181 181 183
APPENDIXC. Properties of the Modified SPRT
185
APPENDIXD. Enumeration of Some Combinations of the kj’s and Derivation of Formula for the Reduction of Tables Required in the Computation of Risk Functions
191
APPENDIXE. Computations Required for the Feature Ordering and Pattern Classification Experiments Using Dynamic Programming
196
APPENDIXF. Stochastic Approximation: A Brief Survey
201
1. Robbins-Monro Procedure for Estimating the Zero of an Unknown Regression Function 2. Kiefer-Wolfowitz Procedure for Estimating the Extremum of an Unknown Regression Function 3. Dvoretzky‘s Generalized Procedure 4. Methods of Accelerating Convergence 5. Dynamic Stochastic Approximation References
202 204 206 209 21 2
APPENDIXG. The Method of Potential Functions or Reproducing Kernels
214
1. The Estimation of a Function with Noise-Free Measurements 2. The Estimation of a Function with Noisy Measurements 3. Pattern Classification-Deterministic Case 4. Pattern Classification-Statistical Case References
201
215 217 217 219 221
Author Index
223
Subject Index
226
This page is intentionally left blank
SEQUENTIAL METHODS IN PATTERN RECOGNITION A N D MACHINE LEARNING
This page is intentionally left blank
CHAPTER 1
INTRODUCTION
1.1 Pattern Recognition
The problem of pattern recognition is that of classifying or labeling a group of objects on the basis of certain subjective requirements. Those objects classified into the same pattern class usually have some common properties. The classification requirements are subjective since different types of classifications occur under different situations. For example, in recognizing English characters, there are twenty-six pattern classes. However, in distinguishing English characters. from Chinese characters, there are only two pattern classes, i. e., English and Chinese. Human beings perform the task of pattern recognition in almost every level of the nervous system. Recently, engineers faced the problem of designing machines for pattern recognition. Preliminary results have been very encouraging. There have been some successful attempts to design or to program machines to read printed or typed characters, identify bank checks, classify electrocardiograms, recognize some spoken words, play checkers and chess, and sort photographs. Other applications of pattern recognition include handwritten characters or word recognition, general medical diagnosis, system’s fault identification, seismic wave classification, target detection, weather prediction, speech recognition, etc. The simplest approach for pattern recognition is probably the approach of “template-matching.” I n this case, a set of templates or prototypes, one for each pattern class, is stored in the machine. The input pattern (with unknown classification) is compared with the template of each class, and the classification is based on a preselected matching criterion or similarity criterion. In other words, if the input pattern matches the template of ith pattern class better than it matches any other template, then the input is classified as from the ith pattern class. Usually, for the simplicity of the machine, the templates are stored 1
1.
2
INTRODUCTION
in their raw-data form. This approach has been used for some existing printed-character recognizers and bank-check readers. The disadvantages of the template-matching approach is that it is sometimes difficult to select a good template from each pattern class and to define a proper matching criterion. The difficulty is especially remarkable when large variations and distortions are expected in all the patterns belonging to one class. The recognition of handwritten characters is a good example in this case. A more sophisticated approach is that instead of matching the input pattern with the templates, the classification is based on a set of selected measurements extracted from the input pattern. These selected measurements, called “features,” are supposed to be invariant or less sensitive with respect to the commonly encountered variations and distortions, and to also contain less redundancies. Under this proposition, pattern recognition can be considered as consisting of two subproblems. The first subproblem is what measurements should be taken from the input patterns. Usually, the decision of what to measure is rather subjective and also dependent on the practical situations (for example, the availability of measurements, the cost of measurements, etc.). Unfortunately, at present there is very little general theory for the selection of feature measurements. However, there are some investigations concerned with the selection of a subset and the ordering of features in a given set of measurements. The criterion of feature selection or ordering is often based on either the importance of the features in characterizing the patterns or the contribution of the features to the performance of recognition (i.e., the accuracy of recognition). The second subproblem in pattern recognition is the problem of classification (or making a decision on the class assignment to the input patterns) based on the measurements taken from the selected features. The device or machine which extracts the feature measurements from input patterns is called a feature extractor. The device or machine which performs the function of classification is called a clussijier. A simplified block diagram of a pattern recognition system I
r X In
WSUreIWrltS
Fig. 1.1. A pattern recognition system.
1.2.
3
DETERMINISTIC CLASSIFICATION TECHNIQUES
is shown in Fig. I . l . + Thus, in general terms, the template-matching approach may be interpreted as a special case of the second approach“feature-extraction” approach, where the templates are stored in terms of feature measurements and a special classification criterion (matching) is used for the classifier. 7.2
Deterministic Classification Techniques
The concept of pattern classification may be expressed in terms of the partition of feature space (or a mapping from feature space to decision space). Suppose that N features are to be measured from each input pattern. Each set of N features can be considered as a vector x, called a feature (measurement) vector, or a point in the Ndimensional feature space In,. The problem of classification is to assign each possible vector or point in the feature space to a proper pattern class. This can be interpreted as a partition of the feature space into mutually exclusive regions, and each region will correspond to a particular pattern class. Mathematically, the problem of classification can be formulated in term of “discriminant functions”[ I] Let w1 w2 ,..., w, be designated as the m possible pattern classes to be recognized, and let )
be the feature (measurement) vector where xi represents the ith feature measurement. Then the discriminant function Di(X) associated with pattern class wj , j = 1,..., m, is such that if the input pattern represented by the feature vector X is in class w i, denoted as X wi , the value of D i ( X ) must be the largest. That is, for all X w i,
-
Di(X) > D j ( X ) ,
N
i,j = 1, ..., m, i # j
( 14
Thus, in the feature space 52, the boundary of partition, called the decision boundary, between regions associated with class wi and class w i respectively, is expressed by the following equation )
Dt(X) - Dj(X) = 0 + The division of two parts is primarily for convenience rather than necessity.
(1.3)
1.
4
INTRODUCTION
MDximum Amplaude Detectw
-
Decision
Dircriminani
m
Fig. 1.2.
A classifier.
A general clock diagram for the classifier using criterion (1.2) and a typical two-dimensional illustration of (1.3) are shown in Figs. 1.2 and 1.3, respectively. Many different forms satisfying condition (1.2) can be selected for D i ( X ) . Several important discriminant functions are discussed in the following.
i'
Region Associated With Class ui: Di(X)>D.(X)
I
Fig. 1.3.
1
An example of partition in a two-dimensional feature space.
A. Linear Discriminant Functions I n this case a linear combination of the feature measurements
xl, x2 ,..., xN is selected for Di(X), i.e., N
Di(X)=
+ w ~ , ~ + ~i =, 1,..., m (1.4) between regions in a, associated with w t
k=l
wikxk
The decision boundary and w i is in the form of
N
D i ( X ) - Dj(X) =
k=l
+
w ~ x ~ C
w N + ~= 0
(1.5)
1.2.
DETERMINISTIC CLASSIFICATION TECHNIQUES
5
with Wk = W i k - W j k and WN+1 = Wi,N+1 - Wj,N+1. Equation (1.5) is the equation of a hyperplane in the feature space SZ,. A general linear discriminant computer is shown in Fig. 1.4. If m = 2, on the
:23-Di(X)
xN
+I
f i g . 1.4.
"iN
wi#*l
A linear discriminant computer.
basis of ( l S ) , i, j = 1 , 2 ( i # j ) , a threshold logic device as shown in Fig. 1.5 can be employed as a linear classifier (a classifier using linear
. Fig. 1.5.
A linear two-class classifier.
discriminant functions). From Fig. 1.5, let D ( X ) = & ( X ) - D,(X),if and if
output
=
+I,
i.e., D ( X ) > 0, then
output
=
-1,
i.e., D ( X ) < 0, then X
X
N
w1
(1.6) w2
For the number of pattern classes more than two, m > 2, several threshold logic devices can be connected in parallel so that the combinations of the outputs from, say, M threshold logic devices will be sufficient for distinguishing m classes when 2M m. Or, the general configuration of Figs. 1.2 and 1.4 can also be used.
B. Minimum-Distance Classifier An important class of linear classifiers is that of using the distances between the input pattern and a set of reference vectors or prototype
6
1.
INTRODUCTION
points in the feature space as the classification criterion. Suppose that m reference vectors R, ,R, ,...,R, are given with Rj associated with the pattern class wj . A minimum-distance classification scheme with respect to R, , R, ,..., R, is to classify the input X as from w i, i.e., X ,- wi if I X - Ri I is the minimum (1.7) where I X - R, 1 is the distance defined between X and R,. For example, I X - Ri I may be defined as
IX
- Ri
I
=
[ ( X - RJT(X - Ri)]”z
(1.8)
where the superscript T represents the transpose operation to a vector. From (1.8), I X - Ri l2 = XTX - XTR, - XRiT + RiTRi (1.9) Since X T X is not a function of i, the corresponding discriminant function for a minimum-distance classifier is essentially D i ( X ) = XTR,
+ XRiT - RiTRi ,
i
=
1 , ..., m
(1.10)
which is linear. Hence, a minimum-distance classifier is also a linear classifier. The performance of a minimum-distance classifier is of course dependent upon an appropriately selected set of reference vectors. C . Piecewise Linear Discriminant Functions
The concept adopted in Section B can be extended to the case of minimum-distance classification with respect to sets of reference vectors. Let R, , R, ,..., R, be the m sets of reference vectors associated with classes w1 , w , ,..., w, , respectively, and let reference vectors in Rj be denoted as Rjk’,i.e., Ri(Fc’ E Rj ,
k
=
1,..., uj
where ui is the number of reference vectors in set R i . Define the distance between an input feature vector X and Ri as (1.11) That is, the distance between X and Rj is the smallest of the distances between X and each vector in Rj . The classifier will assign the input
1.2.
DETERMINISTIC CLASSIFICATION TECHNIQUES
7
to a pattern class which is associated with the closest vector set. If the distance between X and Rik’, I X - R‘ik’ I, is defined as (1.8), then the discriminant function used in this case is essentially Di(X) = Max {XTRp’ + (Rp’)TX- (Rp’)TRy’}, i k = 1 ,...,ui
=
1, ...,m
(1.12)
Let
Dp’= x = p+ (@’)TX - (Rp’)=Rp’
(1.13)
Then i = 1, ..., m
Di(X) = Max {Dp’(X)}, k = l , ...,u ,
(1.14)
It is noted that D i k ) ( X )is a linear combination of features, hence the class of classifiers using (1.12) or (1.14) is often called piecewise linear classifiers [I]. An example of the piecewise linear classifier is the or-perceptron which is shown in Fig. 1.6. XI
Logic Dovices (A-unit)
Fig. 1.6.
An a-perceptron.
D. Polynomial Discriminant Functions An rth-order polynomial discriminant function can be expressed as Dg(X) = wiifi(X)
+ wizfi(X) + + w i ~ f ~ (+X )
wi,~+1 (1.15)
wherefi(x) is of the form x”lx”2
k l k2
... xz
for
1
k, ,k, ,..., k, nl ,a2,...,n,
=
1,...)N and 1
=0
(1.16)
The decision boundary between any two classes is also in the form of an rth-order polynomial. Particularly, if r = 2, the discriminant function is called a quadric discriminant function.
1.
8
INTRODUCTION
In this case, fj(X)= ~
2x2
for k , , k ,
=
1,...,N , n1 , n 2 = 0 and 1 (1.17)
Typically,
L
=
&N(N
+ 3)
(1.19)
In general, the decision boundary for quadric discriminant functions is a hyperhyperboloid. Special cases included hypersphere, hyperellipsoid, and hyperellipsoidal cylinder. A general quadric discriminant computer is shown in Fig. 1.7. XI
Fig. 1.7.
1.3
A quadratic discriminant computer.
Training in Linear Classifiers
The two-class linear classifier discussed in Section 1.2 can easily be implemented by a single threshold logic device. If the patterns from different classes are linearly separable (can be separated by a hyperplane in the feature space Qx), then with correct values of the coefficients or weights, w1 , w2 ,..., wN+l in (1.5), the achievement of a perfectly correct recognition is possible. However, in practice, the proper values of the weights are usually not available. Under such circumstances, it is proposed that the classifier be designed to have the capability of estimating the best values of the weights from the input patterns. The basic idea is that by observing patterns with known classifications, the classifier can automatically adjust the weights in order to achieve correct recognitions. The performance of the classifier is supposed to improve as more and more patterns are abserved. This process is called training or learning, and the
1.3.
TRAINING IN LINEAR CLASSIFIERS
9
patterns used as the inputs are called training patterns. Several simple training rules are briefly introduced in this section. Let Y be an augmented feature vector which is defined as
Y=
[:I
XN
=
rl
( 1.20)
where X i s the feature vector of a pattern. Consider two sets of training patterns Ti and Ti belonging to two different pattern classes w1 and w z , respectively. Corresponding to the two training sets there are two sets of augmented vectors Tl and T2 ; each element in Tl and T2 is obtained by augmenting the patterns in Ti and TL , respectively. That the two training sets are linearly separable means that a weight vector W exists (called the solution weight vector) such that or
YTW>0
for each Y E T,
YTW<0
for each
(1.21)
Y E T,
where W=
( 1.22)
The so-called “error-correction” training procedure for a linear classifier can be summarized in the following way: For any Y E TI , the product Y*W must be positive, i.e., YTW > 0. If the output of the classifier is erroneous (i.e., YTW < 0) or undefined (i.e., YTW = 0), then let the new weight vector be
W‘= W + a Y
(1.23)
where (Y > 0 is called the correction increment. On the other hand, for Y E T,, YTW < 0. If the output of the device is erroneous (i.e., YTW > 0) or undefined, then let
W ’ =w - a Y
(1.24)
10
1.
INTRODUCTION
Before training begins, W may be preset to any convenient values. Three rules for choosing 01 are suggested: (i) Fixed increment rule. a is any fixed positive number. (ii) Absolute correction rule. a is taken to be the smalles integer which will make the value of Y T W cross the threshold of zero. That is, a = the
smallest integer < I YTWl YTY
(1.25)
~
(iii) Fractional correction rule. a is chosen such that IYTW-YTW"
=XIYTWI,
O
(1.26)
Or, equivalently, (1.27)
The convergence of the three error-correction rules can be proved [l]. By convergence, it means that if the two training sets are linearly separable, the sequence of weight vectors produced by the training rule converges to a solution weight vector in a finite number of training steps or iterations. 1.4 Statistical Classification Techniques
In Section 1.2, the feature measurements x1 , x2 ,..., x,
, are assumed
to be deterministic quantities. However in many cases the patterns in one class are expected to have large variations in feature measure-
ments and the noise effect involved in taking these measurements cannot be neglected. One approach proposed is to consider that x1 ,x2 ,..., xN are random variables where xi is the noisy measurement of the ith feature. For each pattern class wi , j = 1,..., m, assume that the multivariate (N-dimensional) probability density (or distribution) function of the feature vector X , p ( X / w J , and the probability of occurrences of w i , P(wj), are known. On the basis of the a priori informationp(X/wj) and P ( w j ) , j = 1,..., m, the function of a classifier is to perform the classification task for minimizing probability of misrecognition. The problem of pattern classification can now be formulated as a statistical decision problem (testing of m statistical hypotheses) by defining a decision function d ( X ) , where d ( X ) = di
1.4.
-
STATISTICAL CLASSIFICATION TECHNIQUES
11
w i is accepted [1]-[3]. Let means that the hypothesis Hi: X L ( w i , di) be the loss incurred by the classifier if the decision di is made when the input pattern is actually from wi . The conditional loss (or called conditional risk) for X w i is
For a given set of a priori probabilities P = {P(wl),P(w2),..., P(w,)}, the average loss (or average risk) is
Substitute (1.28) into (1.29) and let (1.30)
then (1.29) becomes (1.31)
The term yX(P,d) is defined as the a posteriori conditional average loss of the decision d for given feature measurements X. The problem is to choose a proper decision di , j = 1,..., m, to minimize the average loss R(P, d) or to minimize the maximum of the conditional loss r ( w i , d) (minimax criterion'). The optimal decision rule which minimizes the average loss is called the Bayes rule. From (1.31) it is sufficient to consider each X separately and to minimize rx(P,d). If d* is an optimal decision in the sense of minimizing the average loss, then
that is,
t In some classification problems, the information about a priori probabilities is not available. The minimax criterion (with respect to the least favorable a priori distribution) is suggested as a classification procedure [19]-[21].
12
1.
INTRODUCTION
For the (0, 1) loss function, i.e., (1.34)
the average loss is essentially also the probability of misrecognition. In this case, the Bayes decision rule is that if
d* = d i ,
i.e., X
P ( w , ) p ( X / w , )>, P ( w j ) p ( X / w j )
for all j
-,
wi
1, ..., m
(1.35)
Define the likelihood ratio between classes wi and w i as A=--- P ( X l 4
(1.36)
P(Xl4 then (1.35) becomes d*
=
di
if h 2 _ p(_ w~) P(4
for all j = 1, ..., m
(1.37)
The classifier that implements the Bayes decision rule for classification is called a Bayes classifier. A simplified block diagram of a Bayes classifier is shown in Fig. 1.8.
-
Decision
x2
Likelihood Computers
Fig. 1.8.
A Bayes classifier.
It is noted from (1.35) that the corresponding discriminant function implemented by a Bayes classifier is essentially O i ( X ) = P(4P ( w J i ) ,
or equivalently, Oi(X) = log P ( w , ) p ( X / w , ) ,
1,..., m
(1.38)
i = 1 , ..,m
(1.39)
i
=
1.5.
13
SEQUENTIAL MODEL FOR PATTERN CLASSIFICATION
The decision boundary between regions in s2, associated with w i and w j 1s
or
P ( 4 P ( X l 4 - P(%) P ( X l 4
(1.40)
=0
( I .41)
As an illustrative example, suppose that p ( X / w i ) ,i = 1,..., m, is a multivariate gaussian density function with mean vector M iand covariance matrix Ki, i.e., p(X/wi)= [ ( 2 ~ ) "I ~K i '-1]"
exp[-t(X - A!QTKF'(X
-
Mi)],
i = 1 , ..., m (1.42) Then the decision boundary expressed by (1.41) is
- -log 1 1K.I f - +[(X - M$)=K;'(X - Mz)
2
lKil
- ( X - M j ) V q ( X- Mj)] = 0
Equation (1.35) is, in general, a hyperquadric. If K i= Ki (1.43) reduces to XTK-l(Mi - Mi) - +(Mi
+ Mj)'K-'(Mi
- Mj)
P(W .> = 0 + log 2 P(4
(1.43) =
K,
( 1.44)
which is a hyperplane. It is noted that from (1.35) the Bayes decision rule with (0, 1) loss function is also the unconditional maximum-likelihood decision rule. Furthermore the (conditional) maximum-likelihood decision may be regarded as the Bayes decision rule (1.35) with equal a priori probabilities, i.e., P(w,) = l/m, i = 1,..., m. 1.5 Sequential Decision Model for Pattern Classification
I n the statistical classification systems described in Section 1.4, all the N features are observed by the classifier at one stage.+ As a *This is also known as fixed-sample size decision procedure.
14
1. INTRODUCTION
matter of fact, the cost of feature measurements has not been taken into consideration. It is evident that an insufficient number of feature measurements will not be able to give satisfactory results in correct classification. On the other hand, an arbitrarily large number of features to be measured is impractical. If the cost of taking feature measurements is to be considered or if the features extracted from input patterns are sequential in nature, we are led to apply sequential decision procedures [4,5] to this class of pattern recognition problems [6]. The problem is especially pertinent when the cost of taking a feature measurement is high. For example, if the feature to be measured is in an industrial process and the measurement requires that the process be interrupted or completely stopped, or if elaborate equipment, excessive time, or a complicated and risky operation (in biomedical applications) is required to perform the measurement, then these factors may prohibit its use. Thus there is a balance between the information provided by the features measurement and the cost of taking it. A trade-off between the error (misrecognition) and the number of features to be measured can be obtained by taking feature measurements sequentially and terminating the sequential process (making a decision) when a sufficient or desirable accuracy of classification has been achieved. Since the feature measurements are to be taken sequentially the order of the features to be measured is important. It is expected that the features should be ordered such that measurements taken in such an order will cause the terminal decision earlier. The problem of feature ordering is a rather special problem in sequential recognition systems. It is an objective of this monograph to cover the recent developments in the application of sequential decision procedures to feature selection, feature ordering, and pattern classification. Application of sequential decision procedures to pattern classification was proposed by Fu. If there are two pattern classes to be recognized, Wald’s sequential probability ratio test (SPRT) can be applied [4].t At the nth stage of the sequential process, that is, after the nth feature measurement is taken, the classifier computes the sequential probability ratio (1.45) t
A brief introduction to sequential analysis is given in Appendix A.
1.5.
15
SEQUENTIAL MODEL FOR PATTERN CLASSIFICATION
where p,(X/wi), i = 1, 2, is the ‘(multivariate n-dimensional) conditional probability density function of X for pattern class wi . The A, computed by (1.45) is then compared with two stopping boundaries A and B. If A, A, then the decision is that X w1 (1.46) and if A, < B, then the decision is that X -. w2 (1.47)
-
If B < A, < A, then an additional feature measurement will be taken and the process is proceeding to the (n 1)th stage. The two stopping boundaries are related to the error (misrecognition) probabilities by the following expressions (see Appendix A)
+
-
(1.48)
-
where eii is the probability of deciding X wi when actually X wj is true, i,j = 1, 2. Following Wald’s sequential analysis, it has been shown that a classifier using the SPRT has an optimal property for the case of two pattern classes, that is, for given e12 and e21there is no other procedure with at least as low error probabilities or expected risk and with shorter length of average number of feature measurements than the sequential classification procedure. Equations (1.46) and (1.47), with equality signs, represent the decision boundaries which partition the feature space into three regions: the region associated with w1 ; the region associated with w z ; and the region of indifference (or null region). The region between the two boundaries is the region of indifference in which no terminal decision is made. It can be seen that the decision boundaries in a sequential classification process vary with the number of feature measurements n. For example, suppose that x1 ,x2 ,..., are independent feature measurements with p(xi/wi), j = 1 , 2,..., i = 1, 2, a univariate gaussian density function with mean mi and variance u2. For the simplicity of computation, instead of A,, log A, is computed. After the first feature measurement x1 is taken,
1
= - [(ml - m,) x, - +(m12 - m:)] U2
(1.49)
16
1.
INTRODUCTION
Comparing log A, with log A and log B, i.e., if then Xl
< ml-
02
log B m2
+ Q(m, + m2),
XI
then x,
-
-
w1
(1.50)
w2
(1.51)
and if
then x2 will be taken and the process is proceeding to the second stage. After the second feature measurement is taken, =
and
rg
Proceeding as before, i.e., if x1
x1
+ +
x2
2
x2
<
02
m1 - m 2 U2
"1
- m2
+ (m, + m2), log B + (m, + m2), log A
then X then X
-
-
w1
(1.53)
w2
(1.54)
and if
then x3 will be taken and the process is proceeding to the third stage. Hence, at the nth stage of the process,
(1.55)
1.5.
SEQUENTIAL MODEL FOR PATTERN CLASSIFICATION
17
The classification procedure becomes such that if
c xi < n
i=l
02
m1-
m2
log B
+ f (m,+ m2),
then X
-
w2
(1 -57)
and if
then x , + ~ will be taken. It is noted that the decision boundaries, defined by (1.56) and (1.57) with equality signs, are two parallel hyperplanes in the feature space. Note further that the separation or distance between the boundaries (the width of the region of indidifference) is 02
m1-
m2
log A -
02
m1-
m2
log B
=
02
m, - m2
A log%
(1.58)
which is proportional to 02/(m,- m2).For given error probabilities e12 and e21 the average number of measurements for termination of the process depends directly on a2 and inversely on (m,- m2). For more than two pattern classes, m > 2, the generalized sequential probability ratio test (GSPRT) can be used [5]. At the nth stage, the generalized sequential probability ratios for each pattern class are computed as
The Un(X/wi)is then compared with the stopping boundary of the ith pattern class, A(wi), and the decision procedure is to reject the pattern class w i from further consideration, that is, X is not considered in the class wi if Un(X/w,) < A(wi), i = 1, 2,..., m (1.60) The stopping boundary is determined by the following relationship A(wi) =
1 - eii [IIy==,(1 - edI1'" '
i
=
1, 2, ...,m
(1.61)
18
1.
INTRODUCTION
After the rejection of pattern class w i from consideration, the total number of pattern classes is reduced by one and a new set of generalized sequential probability ratios is formed. The pattern classes are rejected sequentially until only one is left, which is accepted as the recognized class. The rejection criterion suggested, though somewhat conservative, will usually lead to a high percentage of correct recognition because of the fact that only the pattern classes which are the most unlikely to be true are rejected. For two pattern classes, m = 2, the classification procedure (1.60) is equivalent to Wald's SPRT and the optimality of SPRT holds. For m > 2, whether the optimal property is still valid remains to be justified. However, the classification procedure is close to optimal in that the average number of feature measurements required to reject a pattern class from further consideration is nearly minimum when two hypotheses (the hypothesis of a pattern class to be rejected and the hypothesis of a class not rejected) are considered. A general block diagram for a sequential recognition system is shown in Fig. 1.9. Computer simulations for English character recognition will be described in Section 2.3. Likelihood computers
U h Pattern p u 4 extractor Feature
Fig. 1.9.
p''ionw Un ( X/wi
1
Decision
A sequential pattern recognition system.
A pattern classifier using a standard sequential decision procedure, SPRT or GSPRT, may be unsatisfactory because: (i) an individual classification may require more numbers of feature measurements than can be tolerated; and (ii) the average number of feature measurements may become extremely large if the eij's are chosen to be very small. I n practical situations, it may become virtually necessary to interrupt the stadard procedure and resolve among various courses of action. This can be achieved by truncating the sequential process at n = N. For example, the truncated sequential decision procedure
1.6.
19
LEARNING IN PATTERN RECOGNITION SYSTEMS
for SPRT will be the following. Carry out the regular SPRT until either a terminal decision is made or stage N of the process is reached. If no decision has been reached at stage N, decide X w1 if A, 1, or decide X w 2 if AN 1. I n a pattern classifier using truncated GSPRT, at n = N the input pattern is classified as belonging to the class with the largest generalized sequential probability ratio. Under the truncated procedure the process must terminate in at most N stages. Truncation is a compromise between an entirely sequential procedure and a classical, fixed-sample size decision procedure as (1.35). It is an attempt to reconcile the good properties of both procedures: the sequential property of examining measurements as they accumulate and the classical property of guaranteeing that the tolerances will be met with a specified number of available measurements.
-
<
-
<
1.6 Learning in Sequential Pattern Recognition Systems
I n previous sections, all the information relevant to the statistical characteristics of patterns in each class, for example P(w,) and p(X/w,), is assumed completely known. However, in practice, the information required for optimal design of feature extractors or classifiers is often only partially known. One approach suggested is to design a pattern recognition system which has the capability of learning the unknown information during its operation. The decisions (feature seclections or classifications) are made on the basis of the learned information. If the learned information gradually approaches the true information, then the decisions based on the learned information will eventually approach the optimal decision as if all the information required is known. Therefore, during the system’s operation, the performance of the system is gradually improved. The process which acquires necessary information for decision during the system’s operation and which improves the system’s performance is usually called “learning.” Several approaches based on the statistical estimation theory have been proposed for the estimation (learning) of unknown information. If the unknown information is the parameter values of a given function such as p ( X / w , ) or the equation of decision boundary, parametric estimation techniques can be applied. If both the form and the parameter values of a function are unknown, in general, nonparametric techniques should be used. However, as it can be seen later, both
20
1.
INTRODUCTION
cases can be formulated as the problems of successive estimation of unknown parameters. During the operation of a pattern recognition system, the system learns (estimates) the necessary information about each pattern class by actually observing various patterns. I n other words, the unknown information is obtained from these oberved patterns. Depending upon whether the correct classifications of the input patterns observed are known or not, the learning process performed by the system can be classified into “learning with a teacher” or “supervised learning’’ and “learning without a teacher” or “nonsupervised learning.” In the case of supervised learning, Bayesian estimation [7] and stochastic approximation [8] can be used to successively estimate (learn) unknown parameters in a given form of feature distributions of each class p ( X / w i ) .T h e successive estimation of continuous conditional probabilities of each pattern class can be performed by applying the potential function method [9] or the stochastic approximation. T h e similarities between certain Bayesian estimation schemes and the generalized stochastic approximation algorithm have been demonstrated [lo, 151. It has also been shown that certain learning algorithms of the potential function method belong to the class of stochastic approximation algorithms [1I]-[ 131. I n nonsupervised learning, the correct classifications of the observed patterns are not available and the problem of learning is often reduced to a process of successive estimation of some unknown parameters in either a mixture distribution of all possible pattern classes or of a known decision boundary. One property of SPRT which can be used to improve the accuracy of classification is to reduce the error (misrecognition) probability by varying stopping boundaries. It has been shown by Wijsman [16] that in SPRT if the upper stopping boundary A is increased and the lower stopping boundary B is decreased, then at least one of the error probabilities, eI2 and e Z 1 , decreases, unless the new SPRT (after varying stopping boundaries) is equivalent to the old one, in which case, the error probabilities are unchaged. This property of SPRT can be easily extended to GSPRT [17]. For sequential classification of m ( m 2 2) pattern classes, if all the stopping boundaries A(wi), i = 1,..., m, are nonincreasing after each feature measurement is taken (at least one of them is decreasing) then the error probability will be reduced as the number of feature measurements increases. T h e adjustment of stopping boundaries can be prespecified by the
1.7.
SUMMARY AND FURTHER REMARKS
21
designer or determined during the system’s operation (“on-line”) by checking from U,(X/wi) as to which pattern class is more probable to be rejected and the stopping boundary corresponding to that pattern class can be modified (decreased) in a much lesser amount than that for the others. Another advantage of varying stopping boundaries is that, by starting with relatively large values of stopping boundaries and gradually decreasing them, the average number of feature measurements will not be excessively large in comparing with the case in which small values of stopping boundaries are used through the whole process. Consequently, the probability of misrecognition and the average number of feature measurements can be somewhat simultaneously controlled by properly adjusting stopping boundaries. Several approaches for learning in sequential pattern recognition systems will be discussed in detail in Chapters 6 and 7. 1.7 Summary and Further Remarks
I n this chapter, the problem of pattern recognition is described. In general, there are two subproblems involved, namely, feature extraction and classification. Several approaches to pattern classification, including deterministic discriminant function approach, fixed-sample size, and sequential statistical decision approaches, are briefly presented. When the cost of taking feature measurements is considered, as in most practical problems, the sequential decision approach becomes particularly attractive. In the absence of complete a priori knowledge for designing an optimal recognition system, the requirement of learning in pattern recognition is emphasized. Several learning techniques are briefly introduced. Learning schemes with and without external supervision are respectively defined. Due to the limited scope of the monograph, several other approaches for pattern recognition [24]-[3 11 have not been discussed. New training procedures have been recently proposed for linear and piecewise linear classifiers. Koford and Groner [32] have proposed a training procedure based on the least mean-square error criterion for linear classifiers. Duda and Fossum [33] have suggested an errorcorrection training procedure, though without convergence proof, for piecewise linear classifiers. Instead of applying training patterns sequentially, training procedures with patterns applied in groups
22
1.
INTRODUCTION
have been proposed [34]-[36]. In general, group-pattern training procedures converge to the optimum weight vector in less number of iterations than that using single-pattern (applied sequentially) training procedures. The trade-off is the increase of computations and storage requirement.
References 1. N. J. Nilsson, “Learning Machines-Foundations of Trainable Pattern-Classifying Systems. McGraw-Hill, New York, 1965. 2. G. Sebestyen, “Decision-Making Processes in Pattern Recognition.” Macmillan, New York, 1962. 3. C. K. Chow, An optimum character recognition system using decision functions. I R E Trans. Electron. Computers 6, 247-254 (1957). 4. A. Wald, “Sequential Analysis.” Wiley, New York, 1947. 5 . F. C. Reed, A sequentical multidecision procedure, Proc. Symp. on Decision Theory and Appl. Electron. Equipment Develop., USAF Develop. Center, Rome, New York, April 1960. 6. K. S. Fu, A sequential decision model for optimum recognition. “Biological Prototypes and Synthetic Systems,” Vol. I. Plenum Press, New York, 1962. 7. N. Abramson and D. Braverman, Learning to recognize patterns in a random environment. IRE Trans. Inform. Theory 8, 558-563 (1962). 8. N. V. Loginov, Methods of stochastic approximation. Avtomat. i Telemeh. 27, 185-204 (1966). 9. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, Potential functions technique and extrapolation in learning system theory. Proc. Congr. IFAC, 3rd, June, London, 1966. 10. Y. T. Chien and K. S. Fu, On Bayesian learning and stochastic approximation. IEEE Trans. System Sci. Cybernetics 3, 28-38 (1967). 11. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, The Robbins-Monro process and the method of potential functions. Automat. i Telemeh. 26, 1951-1954 (1965). 12. Ya. 2. Tsypkin, Establishing characteristics of a function transformer from randomly observed points. Aotomat. i Telemeh. 26, 1947-1950 (1965). 13. C. C. Blaydon, On a pattern recognition result of Aiserman, Braverman and Rozonoer. IEEE Trans. Inform. Theory 12, No. 1, 82-83 (1966). 14. R. L. Kashyap and C. C. Blaydon, Recovery of functions from noisy measurements taken at randomly selected points and its applications to pattern recognition. Proc. IEEE 54, No. 8, 1127-1129 (1966). 15. K. S. Fu, Y. T. Chien, Z. J. Nikolic, and W. G. Wee, On the stochastic approximation and related learning techniques. Tech. Rept. TR-EE666. Purdue Univ., Lafayette, Indiana, April 1966. 16. R. A. Wijsman, A monotonicity property of the sequential probability ratio test. Ann. Math. Statist. 31. 677-684 (1960). 17. K. S. Fu and C. H. Chen, Pattern recognition and machine learning using
REFERENCES
23
sequential decision approach. Tech. Rept. TR-EE65-9. School of Elec. Eng., Purdue Univ., Lafayette, Indiana, April 1964. 18. T. Marill and D. M. Green, Statistical recognition functions and the design of pattern recognizers. I R E Trans. Electron Computers 9, 472-477 (1960). 19. A. Wald, “Statistical Decision Functions.” Wiley, New York, 1950. 20. G. B. Wetherill, “Sequential Methods in Statistics.” Methuen, London, and Wiley, New York, 1966. 21. D. Blackwell and M. A. Girshick, “Theory of Games and Statistical Decisions.” Wiley, New York, 1954. 22. I. Selin, Sequential detection. “Detection Theory,” Chapter 9. Princeton Univ. Press, Princeton, New Jersey, 1965. 23. E. L. Lehmann, “Testing Statistical Hypotheses.” Wiley, New York, 1959. 24. M. Eden, On the formalization of handwriting, Proc. Appl. Math. Symp., 1961, 12. Amer. Math. SOC.,Providence, Rhode Island. 25. R. Narasimhan, A linguistic aproach to pattern recognition. Rept. No. 121. Digital Computer Lab., Univ. of Illinois, Urbana, Illinois, July 1962. 26. H. Freeman, On the digital computer classification of geometric line patterns. Proc. Nut. Electron. Conf. 18, 312-314 (1962). 27. A. G. Frantsuz, An optimal pattern recognition algorithm. Eng. Cybernetics No. 5, 62-70 (1965). 28. G. H. Ball and D. J. Hall, ISODATA, a novel method of data analysis and pattern classification. Tech. Report. Stanford Res. Inst., Menlo Park, California, April 1963 29. T. M. Cover and P. E. Hart, Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13, No. 1, 21-27 (1967). 30. A. B. J. Novikoff, Integral geometry as a tool in pattern recognition. Proc. Bionics Symp. 1960, 247-262. USAF Wright-Patterson AFB, Ohio. 31. G. H. Ball, An application of integral geometry to pattern recognition. Rep. ONR Contract No. 3438/00. Stanford Res. Inst., Menlo Park, California, 1962. 32. J. S. Koford and G. F. Groner, The use of an adaptive threshold element to design a linear optimal pattern classifier. IEEE Trans. Inform. Theory 12, No. 1, 42-50 (1966). 33. R. 0. Duda and H. Fossum, Pattern classification by iteratively determined linear and piecewise linear discriminant functions. IEEE Trans. Electron. Computers 15, No. 2, 220-232 (1966). 34. Y. C. Ho and R. L. Kashyap, An algorithm for linear inequalities and its applications. IEEE Trans. Electron. Computers 14, No. 5, 683-688, 1965. 35. J. B. Rosen, Pattern separation by convex programming. 1.Math. Anal. Appl. 10, 123-134 (1965). 36. W. G. Wee and K. S. Fu, An adaptive procedure for multiclass pattern classification. IEEE Trans. Electron. Computers 17, No. 2, 178-182 (1968). 37. L. Kanal, F. Slaymaker, D. Smith and W. Walker, Basic principles of some pattern recognition systems. Proc. Nut. Electron. Conf. 18, 279-295 (1962). 38. F. Rosenblatt, “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.” Spartan Books, Washington, D. C., 1961. 39. K. Steinbuch and B. Widrow, A critical comparison of two kinds of adaptive classification networks. IEEE Trans. Electron. Computers 14, 737-740 (1965). 40. J. I. Chu and J. C. Chueh, Error probability in decision functions for character recognition. J. Assoc. Comput. Mach., 14, 273-280 (1967). -
CHAPTER 2
FEATURE SELECTION A N D FEATURE O R D E R I N G
2.1
Feature Selection and Ordering-Information
Theoretic Approach
As mentioned in Section I. 1, the selection of features is an important problem in pattern recognition and is closely related to the performance of classification. Also, in sequential pattern recognition systems, the ordering of features for successive measurements is important. The purpose of feature ordering is to provide, at successive stages of the sequential classification process, a feature which is the most “informative” among all possible choices of features for the next measurement so the process can be terminated earlier. The problem of feature ordering may be considered as a problem of feature selection if, at each stage of the sequential classification process, the feature subset to be selected contains only one feature. Approaches from the viewpoint of information theory have been suggested for evaluating the “goodness” of features. Both the divergence and the average information about the pattern classes characterized by the features have been proposed as criteria for feature “goodness.” The concept of divergence is closely related to the discriminatory power between two pattern classes with gaussian distributed feature measurement vector X. The use of information measure and divergence as criteria for feature selection or ordering is also implied by the comparison of the expected risks when the Bayes decision rule is employed for the classification process. A function in the form of entropy or average information has been proposed by Lewis [I] as a criterion for feature selection and ordering. Assume that each featurefi , j = I, ..., N , can take v j possible values, a particular value of fi., j = I, ..., N, can take v j possible values, a particular value of fi is denoted by f i ( K ) , K = 1, ...,v j . Associated 24
2.1.
INFORMATION THEORETIC APPROACH
25
with each fj, a number Gj which measures the “goodness” of fi is to be determined experimentally, The number Gi is in general a statistic obtained by evaluating fj over a large sample of the patterns to be recognized. The following relationships between Gj and the percentage of correct recognition have been suggested as a guide for the selection of Gj . (i) If Gi> G,, then the percentage of correct recognition of using& only must be greater than the percentage of correct recognition of using fq only. (ii) If Gj > G, , then, for any set of features F, the percentage of correct recognition of using features& and F must be greater than the percentage of correct recognition of using fq and F. (iii) The percentage of correct recognition of using F is a linear function of the sum of the Gj values for the features in F. Since no single-number statistic satisfies either (ii) or (iii) in general, a statistic which at least satisfies (ii) and (iii) over a fairly wide range of situations (not all) is proposed. The requirement of Gj being a single number suggests that Gimay be selected as an expected value of some function. Assuming the statistical independence among feature measurements it is suggested that
cc m
Gi =
vj
i=l k=l
prwi
,h(41log V{PbJi,f,(k)l>
(2.1)
A logarithmic function is selected because of the additive property of Gj required by (ii). In view of (i), y should be a measure of the correlation betweenfi and wi . The proposed is
Thus
From (2.2), Gican be interpreted as the mutual information of the feature fiand pattern classes w1 ,..., w, [9].
26
2.
FEATURE SELECTION AND FEATURE ORDERING
The application of divergence as a criterion for feature selection and ordering has been proposed by Marill and Green [2]. Assume that, for wt , X is distributed according to a multivariate gaussian density function with mean vector M iand covariance matrix K, i.e., p(X/wi)= [ ( 2 ~ ) "1 ~K
11/2]~1
exp[- i ( X - MJTK-l(X - Mi)] (2.4)
Let the likelihood ratio be
and let L
= log X =
log p(X/w,)- log p(X/w,)
(2.6)
Substitute (2.4) into (2.6); we obtain L
= XTK-l(Mi - Mj) - ;(Mi
and
+ Mj)TK-l(Mi- Mj)
E[L/Wi]= &(Mi - Mj)TK-'(Mi - M j )
(2.7) (2.8)
Define the divergence between w i and w i as [3] wj) = E
~ L /-~ q~q Iw j i
(2.9)
Then, from (2.8) and (2.9), ](mi,W j ) = (Mi - M,)TK-l(M, - M j )
(2.10)
It is noted that in (2.10) if K = I, the identity matrix, then J ( w i , mi) represents the squared distance between M i and M 3 . If a fixedsample size or nonsequential Bayes decision rule is used for the classifier, then for P(wi) = P(wi) = i,from (1.27), or
X-wi
if X
X-wj
if X
> 1, < 1,
or L > O or L < O
The probability of misrecognition is e = +P[L
> O/wj] + ;P[L < O/W,]
(2.11)
From (2.7), (2.8), and (2.10), it is concluded that p(L/wi)is a gaussian density function with mean *J and variance J where J = J ( w c , wi).
2.1.
INFORMATION THEORETIC APPROACH
27
Similarly, p ( L / w j )is also a gaussian density function with mean -*J and variance J. Thus, e =4
I (2771)-1/2exp[- Ht + i$J>/Jl m
dt
J
-m
Let Y=-
then =
t f BJ
(2.13)
47
I,,, ( 2 ~ ) - l /exp[ ~ m
-+yz] dy
(2.14)
It is noted that, from (2.14), e is a monotonically decreasing function of J ( w c , mi).Therefore, features selected or ordered according to the magnitude of J(wi , wi)will imply their corresponding discriminatory power between wt and w j . For more than two pattern classes, the criterion of maximizing the minimum divergence or the expected divergence between any pair of classes has been proposed for signal detection and pattern recognition problems [4]-[6]. The expected divergence between any pair of classes is given by
For the distributions given in (2.4),
Let d2 = M i n J ( w i , 193
wj),
i #fi
(2.17)
then (2.18)
28
2.
FEATURE SELECTION AND FEATURE ORDERING
Hence (2.19)
T h e tightest upper bound of d occurs when 1 - C ~ “ = , [ P ( is ~ Lthe J~)]~ maximum. This maximum is 1 - ( l / m ) which yields (2.20)
T h e bound, as indicated in (2.20), can be achieved by taking various combinations of features from a given feature set, or, alternatively, by gradually increasing the number of features N such that the feature subset selected will correspond to the case where d2 is the closest value to mJ(w)/(m - 1). I n general, there may be more than one feature subset which satisfies the criterion. I n sequential recognition systems, since the features are measured sequentially a slightly different approach with a similar viewpoint from information theory can be used for “on-line” ordering of features [6].I n the application of SPRT or GSPRT for classification, the knowledge of what pattern classes are more likely to be true (at the input of the recognition system) is used to determine the “goodness” of features. Let 7 be the available number of features at any stage of the sequential process, 7 N, and fj, j = 1,..., 7 , be the j t h feature. The criterion of choosing a feature for the next measurement, following Lewis’ approach, is a single-number statistic which is an expectation of a function describing the correlation among pattern classes, previous feature measurements, and each of the remaining features. Such a statistic associated with fj can be expressed as, after n (noisy) feature measurements x1 , x2 ,..., x, were taken,
<
j=1,
Since
..., Y,
N=7+n
(2.21)
2.2. KARHUNEN-LOBVE
29
EXPANSION
(2.23)
It is noted that P(fj/wi, x1 , x2 ,..., x,) is the a posteriori distribution of fi for class wi after x1 , x2 ,..., x, were taken. The term Ij(n) is the conditional entropy or the mutual information of fi and w1 ,..., w, after n feature measurements x1 ,..., x, were taken. The feature ordering criterion is the maximization of Ij(n).The ordering procedure is to compute Ij(n) for all j = 1,..., I and select the feature for the (n + 1)th measurement which gives the largest value of Ij(n). As the number of feature measurements increases, the a posteri distribution corresponding to the input pattern class gradually plays a dominant role in Ij(n), and the feature which best characterizes the input pattern class is the most likely to be chosen earlier than the others. A different approach for feature selection and ordering based on the backward programming will be discussed in Chapter 4. 2.2
Feature Selection and Ordering-Karhunen-Lo8ve
Expansion
An alternative approach is proposed in this section for the feature selection and ordering in which complete knowledge of the probabilistic descriptions of the input patterns under consideration is not required. The basic viewpoint is essentially that of preweighting the features according to their relative importance in characterizing the input patterns, regardless of the specific classification scheme used in a recognition system. Here, “relative importance” is interpreted in the sense of (i) committing less error when the representation of patterns is subject to approximation because of truncated finite measurements, and (ii) carrying more information with regard to the discrimination of classes. With this point of view, an optimal feature selection and ordering procedure has been developed under the framework of a Karhunen-Lokve expansion [7], [8]. The procedure described in the following is considered as a generalized version [181.
30
2.
FEATURE SELECTION AND FEATURE ORDERING
A generalized Karhunen-Lokve expansion is presented first in continuous case and then its corresponding form in discrete case. Consider the observation of a stochastic process {X(t),0 t T). The observed random function X ( t ) is generated from one of the m possible stochastic processes {Xi(+ 0 t T } , i = 1, 2, ...,m,corresponding to the m pattern classes, respectively. Let the random functions have the expansion
< <
< <
c Vikvk(t) m
&(t)
=
for all t E (0, T), i
=
1,..., m
(2.24)
k=l
where the Vik’sare random coefficients satisfying E( Vik)= O.+ { y k ( t ) ) is a set of deterministic orthonormal coordinate functions over (0, T). Define a covariance function K ( t , s) for the m stochastic processes as follows:
where X z ( t ) is the complex conjugate of Xi(t). After substituting (2.24) into (2.25) we have m
Let the random coefficients ViTik)ssatisfy the conditions
i=l
i=l
(2.27) if k f j
(2.28)
That is, if the expansion in (2.24) exists for Xi(t),i = 1,..., m, and the random coefficients satisfy the conditions in (2.27), then the t This can be achieved by centralizing all the random functions and is therefore assumed without loss of generality.
2.2.
31
K A R H U N E N - L O ~ E EXPANSION
covariance function K(t, s) must have the representation as (2.28). Furthermore, from (2.28),
If the summation and the integration can be interchanged, (2.29) becomes
(2.30)
= uk2vk(t)
The expansion in (2.24) in which { V k ( t ) } is determined by (2.29) or (2.30) through the defined covariance function K ( t , s) is called the generalized Karhunen-Lokve expansion. The generalized Karhunen-Lokve expansion described above has been shown to have the following optimal properties: (i) it minimizes the mean-square error committed by taking only a finite number of terms in the infinite series of the expansion; and (ii) it minimizes the entropy function defined over the variances of the random coefficients in the expansion. The details of the proofs for the optimal properties are given in Appendix B. It is noted that the necessary conditions stated in (2.27) essentially mean that the random coefficients between each pair of coordinate functions among all classes of stochastic processes should be uncorrelated. However, the random coefficients between each pair of coordinate functions for a single class should not be uncorrelated.+ If instead of the random function X ( t ) being continuously observed over (0, T), only the sampled measurements from the random function are taken, then the desired representation becomes
xi =
[:‘I
m
and xi,
=
1 VikPki,
i = 1,...,m, j
=
1,..., N
(2.31)
k=l
XiN
T h e V i k ’ S are random coefficients and P k j is the jth component of the coordinate vector k in { V k } which is a set of orthonormal coordinate t The same conclusion seems to have been also reached from the information theoretic point of view by Barabash [9].
32
2.
FEATURE SELECTION AND FEATURE ORDERING
vectors. Define the discrete analog of the covariance function K ( t , s) for the m stochastic processes as m
cc m
=
m
\
m
m
pktEl.2
k=l j=1
1p ( w i )
00
=
E(vikv:)
i=l
t,s
(2.32)
akzpktpK*, k=l
=
1, ..., N
Furthermore, by the orthonormality of the coordinate vectors,
1K(t, s, p k , s = s=1 c j=1 u?&t&ks s=1 N
N
c
w
m
=
j=1
c N
O?/.Cjt
s=l
p%ks
(2.33)
= Ok2pkt
The generalized Karhunen-Lokve expansion in discrete case becomes m
xij =
1v i k p k j , k=l
i
=
1,..., m, j = I,.. ,N
where the pkj’s satisfy (2.33) and the random coefficient mined by (2.34), for each K
c xi+&, N
vik =
j=l
i
=
I,
..,m
vik
is deter-
(2.34)
It is noted that (2.33) is the discrete equivalent of the integral equation defined in (2.30). The coordinate vectors of the generalized KerhunenLoCve expansion are essentially the eigenvectors determined from K(t,4. The optimal properties of the minimized mean-square error and entropy function of the generalized Karhunen-Lohe expansion lead to an optimal procedure for feature selection and ordering. By properly constructing the generalized Karhunen-Lohe coordinate system through (2.30) or (2.33) and arranging the coordinate function {ylc(t)}or coordinate vectors {Q} according to the descending order of their associated eigenvahes (Tk2, feature measurements taken according to this order will contain the maximum amount of infor-
2.2. KARHUNEN-LOBVE
EXPANSION
33
mation about the input patterns whenever the recognition process stops at a finite number of measurements. The following theorem, which summarizes the results obtained in this section, will furnish a convenient way of constructing an optimal coordiante system.
< <
Let {Xi(t),0 t T } and P(w,), i = 1,..., m, be the m stochastic processes and their corresponding probabilities of occurrences, respectively. Let the random functions have the expansion
Theorem 2.1
c m
Xi@) =
k=l
VikVk(t)
where the V
c.." P(w,) E(Vikv;)
= U,2Skj
i=l
where
6kj
is the Kronecker delta function and
c P ( q )Var(Vik) m
ukz =
i=l
The proof of sufficiency has been given in the process of deriving the optimal properties of the generalized Karhunen-Lohe expansion in Appendix B. T o show the necessity, assume that the covariance function K ( t , s) has the representation of (2.28) where {yk(t)} is determined by (2.30). From (2.24),
(2.35)
=
lTdt '1 ds f '1 cp:(t)
0
=
T
0
0
i=l
P(wi)E[X,(t)X?(s)]vj(s)
ds K(t,s) vj(s) =
dt p):(t)
0
(2.36)
34
2.
FEATURE SELECTION AND FEATURE ORDERING
Note that Theorem 2.1 also holds for the discrete case with obvious substitutions of corresponding discrete quantities. From Theorem 2.1 it is easily seen that the construction of the desired coordinate system can be viewed as finding the coordinate functions (or vectors) in which the coordinate coefficients are mutually uncorrelated so that the conditions in (2.27) are satisfied. The procedure is basically that of decorrelating the coordinate coefficients over the ensemble of all pattern samples from different classes. In many recognition problems where the covariance functions are real and symmetric, the decorrelation process simply amounts to the diagonalization of the covariance functions under consideration. The actual procedure for feature selection and ordering is summarized into the following steps in terms of discrete case.+ Step 1 . Obtain the covariance function K ( t , s) defined in (2.32) from the feature vectors extracted from the given pattern samples. If the components of the feature vectors assume real values, K ( t , s) is a real symmetric matrix. Step 2. Find the eigenvalues and their associated eigenvectors for K ( t , s ) . Let the eigenvectors be normalized and lexicographically arranged according to the descending order of their associated eigenvalues. The set of orthonormal vectors thus obtained constitutes the generalized Karhunen-Lohe coordinate system. Step 3. Make the transformation defined in (2.34) where the pki’s are the components of the orthonormal eigenvectors obtained from step 2. The resulting V*lik’~ are the desired coordinate coefficients in terms of the generalized Karhunen-Lokve coordinate system.
It is noted that the complete ordering of feature measurements is achieved in the course of rearranging the eigenvectors according to the descending order of their associated eigenvalues. These eigenvalues are nothing but the variances of the transformed coefficients. From (B.6) m
t The reason for presenting the discrete case here is that most of the practical experiments in pattern recognition are processed with sampled data on a digitial computer.
2.3.
35
ILLUSTRATIVE EXAMPLES
we can obtain an expression for the average minimized error over the time period (0, T),
JI 1 'f Min
i=l
P(w,) E[I Ri,(t)12]\ dt
=
1-
5
k=l
uk2
(2.37)
>
which is a decreasing function of n. Since u12 uZ22 uk2 3 ~2 k 2 + ..., ~ the complete ordering of feature measurements according to the descending order of eigenvalues will produce smaller error with respect to any other ordering when the recognition process terminates at a finite number of measurements. Also, since the proposed procedure is independent of the classification scheme used in the recognition system, the problem of selecting a feature subset from a given set of features can be viewed as a subproblem of feature ordering. The procedure of completely ordering the coordinate vectors will allow us to select a subset of r (r N) feature measurements with minimized mean-square error by simply picking out the first Y coordinate vectors in the resulting generalized KarhunenLokve system. A computational example will be given in Section 2.3 to illustrate the procedure.
<
2.3
Illustrative Examples
T h e statistical pattern recognition techniques discussed in Chapters
1 and 2 were applied to handprinted and handwritten English character
recognition. The pattern samples used for the experiments were obtained by asking subjects to write characters in a 2-inch square (for handwritten characters) or in a circle with 2-inch diameter (for handwritten characters). Eight features, x1 ,..., x8 , were selected for handprinted characters, and eighteen features, x1 ,..., xI8, for handwritten characters, as shown in Fig. 2. I. Each feature measurement is the distance measured along a predetermined path from the edge of the square or the circle to the edge of the character (see also Fig. 2.1). These features were chosen somewhat arbitrarily with the hope that, for each pattern class, the probability distribution function of the features was close to a multivariate gaussian distribution. I n all the experiments simulated on the digital computer, the assumption of gaussian distribution for each class was made.+ t The gaussian assumption for feature distribution has been studied by many authors [lo], [Ill.
2.
36
FEATURE SELECTION AND FEATURE ORDERING X8
x7
8
5
6
Fig. 2.1. Typical samples and their feature measurements for handwritten and handprinted characters a and b.
The mean vector and the covariance matrix were estimated from a number of test samples by computing the sample mean vector and the sample covariance matrix. Example 1 T h e sequential probability ratio test and the generalized sequential probability ratio test were used for classifying handprinted and handwritten English characters. The mean vectors and the covariance matrices for the features were estimated from fifty test samples for each character.
(i) Recognition of Handprinted Characters A and B. The eight features were assumed statistically independent, and were measured sequentially. The stopping boundaries were preset symmetrically (B = -A) at successive stages as f2.1,
52.2,
f2.3,
f2.4,
42.5,
32.6,
f2.7
f2.8
The classification process was truncated at the eigth stage. T h e features, if ordered, were arranged according to the descending order of their corresponding divergences. For characters A and B in this experiment, the order is x 4 , xl,x, , x 2 , x 8 , x 5 ,x, , x 3 . The recognition results are shown in Table 2.1. I t is noted that, for the same set of stopping boundaries, the classification process for the case of
2.3.
37
ILLUSTRATIVE EXAMPLES
Table 2.1 Input Features unordered output
A
A
39 1
B
1
Features ordered
I
A
B
1
39 1
1 39
I I I
39
6.16 47
Average number of measurements Number of truncations
' '
I
3.58 24
I
t
6
INPUT
l
2
,
l
4
,
l
6
,
I)
l
'
n
( Numbw of Otwvotiona)
Fig. 2.2. The divergence versus the number of feature measurements when the successive measurements are ordered or unordered.
2.
38
FEATURE SELECTION AND FEATURE ORDERING
ordered features terminates earlier than that for the unordered features (i.e., following the natural order x1 , x2 ,..., xs). The divergence versus the number of feature measurements when the successive measurements were ordered or unordered is shown in Fig. 2.2. Although the divergence for all eight features is constant regardless of ordering, the first few measurements can be very effective in leading to correct classifications when the measurements are taken in such an order as suggested. (ii) Recognition of Handwritten Characters a and b. The eighteen features xl,x2 ,..., x18 were assumed dependent and were measured sequentially. The classification process was truncated at the eighteenth stage. The reduction of the average number of measurements using the SPRT is quite evident from the following results. Case a. Stopping boundaries A and B were fixed at +1.5 and respectively.
- 1.5,
Input
a
b
output
b
Case b. Stopping boundaries A and B varied from f l . O to f1.85 with increment of f0.05 at each stage. Input
a
b
output
b
The average number of feature measurements required in this case is 5.516. From a computer simulation, it was found that in order to achieve the same recognition accuracy the recognition process using a fixed-sample size Bayes decision procedure [(P(a)= P(b)] would require nine features. (iii) Recognition of Handwritten Characters a, b, c, and d. The conditions were the same as those in (ii) except that the GSPRT was
2.3.
39
ILLUSTRATIVE EXAMPLES
used for classification and the features were ordered according to their corresponding values of Ij(n). The stopping boundaries A(w,) were all set equal to 0.9. The recognition results are shown in Table 2.2. It is noted that with a slight sacrifice of the recognition Table 2.2 Input Features unordered output U
b C
d
u
2
b
8 0 1 2 7 0 3 2 1 0
c
d
1 0 9 0 2
2 0 1 7
Average number of measurements for each class
Ed4 Eb(n)
E m Ed4
10.40 9.32 15.30 11.65
I I
I
''
'
I I I I I I I I
I
Features ordered u
b
c
2 6 4 8 0 2 5 4 4 1 1 8 0 0 0 2
d
5 4 0 1
4.73 3.00 4.70 3.62
accuracy, the average number of feature measurements required to reach a terminal decision when successive measurements are properly ordered 'can be much reduced. Example 2 The feature selection criterion described in (2.17)(2.20) was applied to the selection of the best subset of six features from the eight features xl, x2 ,...,x, given for the handprinted characters P,D, V , and J. There were twenty-eight possible feature subsets with six features. The six features resulting in the tightest upper bound of d = J(wi,wj) were x 2 , x 4 , x 5 , x 6 , x 7 , x , . The same test samples were then used to test the classification accuracy. The percentage of correct recognition using the six features was 91.7 % and the percentage of correct recognition using all eight features was 93.1 yo. I n both cases, the feature measurements were not ordered. The results are shown in Table 2.3. Example 3 The feature selection and ordering procedure described in Section 2.2 was applied to the recognition of characters a, b, c, and d.
2.
40
FEATURE SELECTION AND FEATURE ORDERING Table 2.3
,
I I
Input: all eight features are used
D
P
V
J
I
32
1 I I
O
2 33 0 1
1 0 35 0
1 3 0 32
Output
D
P
V
J
I‘
D P V
33 3 0 0
2 34 0 0
1 0 35 0
1 3 O 32
J
output: features x2, x p,xs,xg, x, ,x8 are used
I
1
The mean vectors and the covariance matrices were estimated from sixty samples for each character (240 character samples altogether). The classifier of the recognition system implements a Bayes decision rule with equal a priori probabilities and (0, 1) loss function. The feature measurement vector X was transformed into a new vector V = [ V , , V , ,..., VISITaccording to the ordering transformation formula in (2.34). The optimal coordinate vectors {vk),K = 1,..., 18, are essentially the eigenvectors which are lexicographically ordered according to the descending order of their associated eigenvalues of the covariance function defined in (2.32). The ordered eigenvalues and the optimal coordinate vectors computed from the 240 character samples are presented in Table 2.4. Note that the largest eigenvalue is 42.508 compared to the smallest value 0.063. I n this case, the smaller eigenvalues truly indicate the insignificance of the corresponding coordinates in characterizing the character samples. Since the ordering procedure only involves a linear transformation, the classifier implements the same decision function as that with no feature ordering except that the classification is based on the transformed measurement vector V. A simplified flow diagram which shows the computer-simulated recognition system with features ordered and unordered is given in Fig. 2.3. The recognition of the 240 samples in each case was first carried out by assigning the class membership based on the first two feature measurements and the percentage of correct recognition was computed. The procedure was then repeated by adding two successive measurements at a time until all the eighteen features were exhausted. The recognition results are presented in Fig. 2.4 where the percentage of correct recognition is plotted against the number of measurements. It is noted that the effect of ordering the feature
2.3.
41
ILLUSTRATIVE EXAMPLES
Table 2.4
ORDERED EIGENVALUB AND EIGENVECTORS COMPUTED FROM CHARACTER SAMPLES
Eigen-
Eigenvectors
values
42.508 26.515 8.546 7.237 5.719 4.527 3.891 3.032 2.070 1.904 1.296 0.951 0.650 0.423 0.214 0.173 0.085 0.063
0.040 0.167 0.018 0.031 -0.255 -0.031 0.382 -0.061 -0.280 0.097 0.257 0.177 0.046 0.432 0.060 0.115 0.270 -0.014 0.674 0.148 0.256 -0.591 0.149 -0.186 0.029 0.517 0.141 0.223 0.013 0.002 0.000 0.009 0.017 0.011 0.018 -0.044
0.020 -0.000 0.045 0.108 -0.026 0.024 0.290 -0.011 -0.290 0.098 -0.007 -0.002 0.049 0.460 0.047 -0.167 -0.180 0.293 -0.041 -0.047 -0.222 -0.319 -0.368 0.543 -0.342 -0.498 0.036 -0.047 -0.500 -0.059 -0.022 0.015 -0.462 0.018 0.141 0.024
-0.016 -0.134 0.084 0.059 -0.019 -0.011 0.264 0.177 -0.264 0.224 -0.038 -0.273 0.022 0.410 0.022 -0.430 -0.116 0.359 -0.080 -0.073 -0.163 0.327 -0.194 -0.465 -0.203 0.070 -0.045 0.071 0.007 0.009 -0.153 -0.037
-0.020 -0.295 0.097 -0.232 0.039 -0.070 0.251 0.572 -0.235 0.582 -0.076 -0.213 0.009 -0.089 0.012 0.261 -0.029 -0.148 -0.145 0.057 -0.138 -0.130 -0.078 0.151 0.014 0.039 -0.041 -0.015 0.205 -0.053 0.249 -0.011 0.600 0.441 0.001 0.006 -0.583 0.719 -0.002 0.001
-0.033 -0.245 0.097 -0.470 0.058 -0.437 0.261 0.058 -0.213 -0.104 -0.126 0.521 0.005 0.170 -0.025 -0.137 0.012 -0.070 -0.120 -0.424 -0.109 0.075 0.036 0.032 0.042 0.028 -0.058 0.008 0.562 0.063 0.488 -0.004 -0.427 -0.014 -0.301 -0.000
-0.050 -0.217 0.117 -0.401 0.054 -0.284 0.202 -0.300 -0.242 -0.322 -0.119 -0.576 -0.008 0.125 -0.035 0.323 0.079 0.215 -0.156 0.083 -0.088 -0.077 0.176 -0.007 0.257 0.023 -0.200 -0.012 0.224 -0.036 -0.766 0.035 -0.210 0.007 0.108 -0.003
-0.055 -0.427 0.095 -0.203 0.083 0.154 0.138 -0.118 -0.187 -0.088 -0.058 -0.032 0.007 -0.136 -0.148 -0.613 0.083 -0.265 -0.161 0.422 0.079 -0.278 0.307 -0.002 0.492 0.013 -0.326 -0.079 -0.563 0.041 0.289 -0.005 0.064 0.009 -0.121 0.027
0.009 -0.611 0.068 0.100 0.033 0.538 0.041 -0.078 -0.151 -0.098 -0.050 0.271 0.231 0.232 -0.097 0.393 -0.492 0.062 -0.094 0.007 0.251 0.151 0.313 -0.056 0.077 -0.014 0.650 0.051 -0.035 -0.008 -0.036 -0.003 0.006 -0.004 -0.003 -0.014
0.144 -0.419 0.003 0.664 -0.070 -0.568 -0.050 -0.174 0.039 0.094 -0.037 -0.005 0.493 0.078 0.111 0.034 -0.504 -0.092 0.214 -0.003 0.247 -0.054 -0.031 -0.015 -0.040 0.020 -0.580 -0.015 0.105 -0.009 0.001 0.024 -0.015 -0.023 0.041 0.009
Pattern Input
n Start
Generate observation vector X = (x,. x2. ..., x 1 8 ) Without feature ordering Estimate mean vectors and covariance matrix
1'
With feature ordering
Order eigenvectors and form ordering transformations
vector V = ( V ,.V2 ,...,V , * ) through the ordering transformations
f1
f Compute likelihood ratios
Classifier
*
t
'
Classifier
Decide pattern class
t
Fig. 2.3.
A simplified flow diagram of computer-simulated recognition system. 42
2.4.
IV 60
2
43
SUMMARY AND FURTHER REMARKS
4
o OrdaedObaervotionr
A Unordered Observotknr
6 8 10 12 Number of Feoture Obwnotionr
Fig. 2.4.
14
16
18
Recognition results.
measurements is reflected in the fact that a considerably higher recognition rate can be obtained during the first few measurements. This performance is particularly important when the number of feature measurements is limited by the data-processing unit of the recognition system. 2.4 Summary and Further Remarks
Two special methods for feature selection and ordering, using the concept of entropy and divergence and the formulation of KarhunenLokve expansion, are discussed in this chapter. Computer simulations for English character recognition have been employed to illustrate sequential classification procedures with ordered and unordered feature measurements. Besides divergence, some other distance
44
2.
FEATURE SELECTION AND FEATURE ORDERING
measures [3], [12]-[16] might be also useful in the study of feature selection and ordering problems. The explicit relationship between divergence and probability of misrecognition has been easily derived for gaussian distributed pattern classes with equal covariance matrices. It will be useful to study the explicit relationship when the covariance matrices are not equal or the patterns are not gaussian distributed [21], [22]. Similarly, the feature selection procedure using the KarhunenLoPve expansion has made use of only second order statistics and certainly involves only linear transformations on the feature space. As a matter of fact, Watanabe [8] has proved the equivalence between the Karhunen-Lohe expansion and the factor analysis commonly used by experimental psychologists. It might be desirable in many problems that higher order statistics and nonlinear transformations [171, although much more difficult to handle and implement, should be taken into consideration for the best use of feature information. References 1. P. M. Lewis, The characteristic selection problem in recognition systems. I R E Trans. Inform. Theory 8, 171-178 (1962). 2. T. Marill and D. M. Green, On the effectiveness of receptors in recognition systems. IEEE Tmns. Inform. Theory 9, 11-17 (1963). 3. S. Kullback, “Information Theory and Statistics.” Wiley, New York, 1959. 4. T. L. Grettenberg, A criterion for statistical comparison of communication systems with applications to optimal signal selection. T R No. 2004-4, SEL-62-013. Stanford Electron. Labs., Stanford, California, February 1962. 5. T. L. Grettenberg, Signal selection in communication and radar systems. IEEE Trans. Inform. Theory 9, 265-275 (1963). 6. K. S. Fu and C. H. Chen, Sequential decisions, pattern recognition, and machine learning. Tech. Rept. TR-EE65-6. School of Elect. Eng., Purdue Univ., Lafayette, Indiana, April 1965. 7. K. Karhunen, ’iiber Lineare Methoden in der Wahrscheinlichkeitsrechnung.Ann. Acad. Sci. Fennicae Ser. A I 37, (1947) [English translation by I. Selin is available as On Linear Methods in Probability Theory, T-131. The RAND Corp., Santa Monica, California, August 1960.1 8. S. Watanabe, Karhunen-Loeve expansion and factor analysis-theoretical remarks and applications. Proc. Conf. Inform. Theory 4th, Prague, 1965. 9. Yu. L. Barabash, On properties of symbol recognition, Eng. Cybernetics, No. 5, 71-77 (1965). 10. 3. A. Lebo,On the selection of decision criteria and the estimation of probabilities in pattern recognition. Ph. D. Thesis (Tech. Rept. TR-EE64-16). School of Elec. Eng., Perdue Univ., Lafayette, Indiana, September 1964. 11. T . Marill and D. M. Green, Statistical recognition functions and the design of pattern recognizers. I R E Trans. Electron. Computers 9, 472-477 (1960).
REFERENCES
45
12. P. C. Mahalanobis, On the generalized distance in statistics. Proc. Nut. Inst. Sci. India 122, 49-55 (1936). 13. A. Bhattacharyya, On a measure of divergence between two multinomial populations. Sankhyd 6, 401-406 (1946). 14. T. Kailath, The divergence and bhattacharyya distance measures in signal detection. IEEE Trans. Commun. Technol., 15, No. 1, 52-60 (1967). 15. H. Kobayashi and J. B. Thomas, Distance measures and related criteria. Proc. Ann. C m f . Circuit and System Theory, 5th, Allerton, October 1967, 491-500 16. T. T. Kadota and L. A. Shepp, On the best finite set of linear observables for discriminating two ‘gaussian signals’. IEEE Trans. Inform. Theory 13, No. 2, 278-284 (1967). 17. G. Sebestyen, “Decision-making Processes in Pattern Recognition.” Macmillan, New York, 1962. 18. Y. T. Chien and K. S. Fu, On the generalized Karhunen-Loeve expansion. IEEE Trans. Inform. Theory 13, No. 2 518-520 (1967). 19. J. T. Tou and R. P. Heydron, Some approaches to optimum feature extraction, In “Computer and Information Sciences-11” (J. T. Tou, ed.). Academic Press, New York, 1967. 20. H. D. Block, N. J. Nilsson and R. 0. Duda, Determination and detection of features in patterns. In “Computer and Information Sciences” (J. T. Tou and R. H. Wilcox, eds.). Spartan Books, Washington, D. C., 1964. 21. P. J. Min, D. A. Landgrebe, and K. S. Fu, On feature selection in multiclass pattern recognition. Proc. 2nd Princeton Conf. Information Sciences and Systems, March 1968, pp. 453-457. 22. P. J. Min, On feature selection in multiclass pattern recognition. Ph.D. Thesis (Tech. Rept. TR-EE68-17). School of Elec. Eng., Purdue University, Lafayette, Indiana, June 1968.
CHAPTER 3
FORWARD PROCEDURE FOR FINITE SEQUENTIAL CLASSIFICATION USING MODIFIED SEQUENTIAL PROBABILlTY RATIO TEST
3.1
Introduction
As described in Section 1.5, the error probabilities eij can be prespecified in SPRT and GSPRT. However, in this case, the number of feature measurements required for a terminal decision is a random variable which, in general, depends upon the specified eij and has a positive probability of being greater than any constant. Since it is impractical to allow an arbitrarily large number of feature measurements to terminate the sequential process, we are frequently interested in setting an upper bound for the number of feature measurements within which the pattern classifier must make a terminal decision. An abrupt truncation of the process as described in Section 1.5 is an answer. But the abrupt truncation is considered as an inefficient procedure because if the value of sequential probability ratio is large and the number of feature measurements is near the truncation value, a small number of additional feature measurements will not, in general, permit much chance of rejecting any pattern class whatever the measurements may be. In this chapter, the problem of terminating the sequential process at a finite number of feature measurements using forward computation procedure (i.e., SPRT or GSPRT) is presented. The application arises, in practice, when the feature extractor of a recognition system has only a finite number of suitable feature measurements available to the classifier, or when the cost of taking more feature measurements is found to be too high as the number of measurements exceeds a certain limit. In either case, the urgency to terminate the process 46
3.2.
SEQUENTIAL PROBABILITY RATIO TEST-DISCRETE
CASE
47
becomes greater when the available feature measurements are to be exhausted. Instead of using abrupt truncation this problem is studied by considering time-varying stopping boundaries for the sequential classification process. The idea of varying the stopping boundaries as a function of time or number of feature measurements, similar to the one used in Section 1.6, enables us to investigate the behavior of a modified SPRT (with time-varying stopping boundaries) as compared with the standard Wald’s SPRT with constant stopping boundaries A and B. Since the stopping boundaries are constructed and employed in the direction of usual time sequence starting with the first feature measurement, the term “forward procedure” is emphasized here to distinguish from the “backward procedure” discussed in Chapter 4. 3.2
Modified Sequential Probability Ratio Test-Discrete
Case
-
The modified SPRT is formulated as follows [1]-[3] : Let Ei(n) be the expected number of feature measurements when X w i, i = 1,2, that is, a terminal decision is made. Subject to the requirement that when X is classified as from class oi,the probability of misrecognition will be at most eii (i # j ) , the problem is to give a procedure with time-varying stopping boundaries for deciding between X w1 and X w2 such that E,(n) is a minimum. The procedure of modified SPRT can be stated as follows: Let gl(n) and g2(n)be either constants or monotonically nonincreasing and nondecreasing functions of n, respectively. The classifier continuously takes measurements as long as the sequential probability ratio An lies between egt(n)and egz(nt,that is, the sequential process continues by taking additional feature measurements as long as
-
-
8%‘”) < x, < 81(.),
n
= 1, 2,
...
If and if
the decision is that X A,, >, ~ ? l ( ~ ) then , hn
then the decision is that X < \ egz(n),
N
3.1 w1 w2
In this formulation, it is seen that the standard Wald’s SPRT can be considered as a special case of the modified SPRT where gl(n) and g,(n) are constants. The fact that, in general, gl(n) and g2(n) can be
48 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
made functions of n enables us to design a sequential classifier such that the expected number of feature measurements in reaching a terminal decision and the probability of misrecognition may be controlled in advance. Consider the modified sequential probability ratio test defined in (3.1) and (3.2) for which
g2(n) =
3
4’ (1 - -
<
(3.4)
where 0 < rl , r2 1, a’ > 0, b’ > 0, and N is the prespecified number of feature measurements where the truncation occurs and the
Q*(n)
Fig. 3.1.
I
Graphical representation of gl(n) and gz(n) as functions of n.
classifier is forced to reach a terminal decision. The graphical representation of gl(n) and gz(tz)as functions of n is shown in Fig. 3.1. Let
Then the modified sequential probability ratio test is defined in
-
-
and the violation of either one of the inequalities is associated with w1 or X w z . It is noted that as N .--t 00, the classification of X
3.2.
SEQUENTIAL PROBABILITY RATIO TEST-DISCRETE
CASE
49
(3.6) defines the standard Wald's SPRT where a' = log A and b' = -log B. The derivatives of gl(n) andg,(n) at n = 0 are --r,a'/N and r2b'/N, respectively; they characterize the initial slopes of the convergent boundaries and therefore determine the rate of convergence to N when the process is to be truncated. As in Section 1.5, it will be interesting to see that change of decision boundaries in the case of modified SPRT. Use the same example as that discussed in Section 1.5. For the modified SPRT defined by (3.1) and (3.2), (1.56) and (1.57) become, respectively,
and
It is noted that the decision boundaries, defined by (3.7) and (3.8) with equality signs, are again two parallel hyperplanes in the feature space. The separation between the boundaries is
which is no longer a constant (but a function of n) as that in (1.58). If gl(n) and g2(n) are specified by (3.3) and (3.4), then, as n -+ N, the separation between the two decision boundaries approaches zero. Consequently when n = N, the region associated with w1 and the region associated with w 2 meet, eliminating the region of indifference, then a terminal decision must be made. Let El(n),i = 1,2, be the expected number of feature measurements for the modified SPRT when X wi , and let E,I*(n) and Ei**(n) be the corresponding expectations when the lower and upper stopping boundaries, respectively, are violated. Assume that ei2 and eLl, the error probabilities upon the termination of the modified SPRT, are very small so that 1 - ei2 'v 1 and 1 - eLl 'v 1. This assumption is not necessary byt it greatly simplifies the resulting expressions. Following this assumption the equation
-
E;(n)
=
e;,E;*(n)
+ (1 - e;,) Ei**(n)
(3.10)
50 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
is replaced by E;(n)
(3.1 1)
= E;**(n)
Assume that the feature measurements are independently and identically distributed. Let
By using a well-known result of sequential analysis [4], [5], together with often-mentioned neglect of excess over the boundaries, we obtain (see Appendix A) E;(z,
+ + z,) = Ei(L,)
E;(z,
+ + x,)
**.
N
E;**(n) E,(z)
(3.12)
and
<
= eb,E;*[--6’(1
+ (1
-
- u)+7
e;,) E;* *[a’(1 - up]
(3.13)
When eil 1, the first term on the right-hand side of (3.13) can be neglected. Then by (3.12), E;**(n) E,(z) cv E;**[u’(l N
UP]
E;**(u’{~- T
+ +[Y~(T,- l)] u2 - **-})
~ U
(3.14)
Thus (3.15)
where all the conditional moments of u higher than the first are neglected. T o obtain the error probability ei2 the following relation is used
Or, equivalently, E;** exp[-u’(1 - up]N e;,/(l - eb,)
(3.17)
3.2.
SEQUENTIAL PROBABILITY RATIO TEST-DISCRETE
CASE
51
Taking into account 1 - eil N 1, neglecting the conditional moments of u higher than the first in the Taylor series expansion about u = 0, and substituting (3.15) for E;**(n), we get
-
(3.18)
Equations (3.15) and (3.18) apply when X w2 is true, by replacing a’ by-b’, ei2 by eil , and E;(n) by El(@). Now consider the standard Wald’s SPRT with upper stopping boundary A = ea and lower stopping boundary B = cb.If e12 and e21 are very small, E,(n) N a/E,(z) and caN e12. Suppose a’ = a, that is, the boundary of the standard Wald’s SPRT and the modified SPRT begin (tz = 0) at the same points. Then (3.15) and (3.18) can be rewritten as (3.19)
and (3.20)
From (3.19) and (3.20), it is important to observe the following relationships: (3.21)
This is to say that, because of the convergent property of the timevarying stopping boundaries, the modified SPRT requires a less expected number of feature measurements. The amount of reduction is controlled by the design parameter rl as shown in the left inequality of (3.21). (ii) ei2 is greater than e12 since ~laEl(n)l“
+ YlEl(41
is a positive quantity. This result is to be expected due to the optimality of the standard Wald’s SPRT. In fact, if ei2 were set equal to e12, the modified SPRT will have a larger expected number of feature measurements than that of the Wald’s SPRT. This in turn implies also that the modified SPRT must begin at a’ > a.
52 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
With regard to the results obtained above, it becomes clear that, by properly constructing the time-varying stopping boundaries for the sequential classification process, the following purposes can be accomplished: (i) the classification process always terminates by a prespecified maximum number of feature measurements; (ii) the expected number of feature measurements is controllable and usually less than that required for the standard Wald’s SPRT with fixed parallel stopping boundaries; (iii) it is possible by adjusting the starting points of stopping boundaries to achieve error probabilities as low as those in Wald’s SPRT. 3.3
Modified Sequential Probability Ratio Test-Continuous
Case
Analogous to the discrete case presented in Section 3.2, the continuous case of modified SPRT is now described [l], [6]. Let { X l ( t ) ,t 3 0} and { X 2 ( t ) t, 3 0} be two different stochastic processes, corresponding to two pattern classes subjected to a random environment (due to noise, distortion, etc.). The classifier measures continuously, beginning at t = 0, a process { X ( t ) ,t >, 0} in the feature space and wishes to decide, as soon as possible, whether { X ( t ) } is {Xl(t)}or {Xz(t)}.Let t , be the time when the classifier reaches a terminal decision. I n general, t , is a random variable. Let Ei(tT) denote the expected value of t , when { X ( t ) }= {Xi(t)},i = 1, 2. Subject to the requirement that when { X ( t ) }= {Xi(t)}the probability of an incorrect classification will be at most eji , j (#i) = 1, 2, the problem is to give a decision procedure for classifying between {Xl(t)} and {X2(t)}such that Ei(tT) is a minimum for i = 1, 2. This is simply the same formulation for stochastic processes with continuous time parameter as that originally given by Wald for stochastic processes with discrete time parameter. Assume that the stochastic processes associated with the two pattern classes satisfy the following condition: For every t 2 0, X ( t ) is a sufficient statistic for the process, that is, given X ( t ) the conditional distribution of X(7), 0 T t, is (with probability 1) the same for the processes { X l ( t ) }and {X2(t)}.
< <
3.3.
SEQUENTIAL PROBABILITY RATIO TEST-CONTINUOUS
CASE
53
Let (3.22)
(3.23)
The modified SPRT can be stated as follows: Let gl(t) and g2(t) be either constants or monotonically nonincreasing and nondecreasing functions of t, respectively. The classifier continues to measure { X ( t ) }as long as eg,(t)
< h(X(t)) <
t 20
(3.24)
As soon as A(X(t))2 ,PI('), the classifier stops measuring {X(t)} and decides that { X ( t ) }= {Xl(t)}.Similarly, as soon as A(X(t)) @ a ( ' ) , the classifier stops measuring { X ( t ) )= {X2(t)}.Consider that
<
g&)
t a' (1 - T
=
)
(3.25)
T=
(3.26)
and &(t) = 4' (1 -
< <
--)Tt
<
where 0 t T , 0 < rl , r2 1, a' > 0, 6' > 0, and T is the prespecified observation time at which the sequential process is truncated and the classifier is forced to make a terminal decision. Equation (3.24) is then reduced to
< L(X(t))< a' (1 - -)T t
,
t 20
(3.27)
and the violation of either one of the inequalities is associated with the classification of { X ( t ) }= {Xl(t)}or { X ( t ) }= {X2(t)}. It is noted that as T + 00 the modified SPRT reduces to the standard Wald's SPRT with continuous time parameter where a' = log A and b' = -log B. Also, -rla'/T and r2b'lT are the derivatives ofg,(t) andg,(t) at t = 0, respectively. Let E;(tT),i = 1, 2, be the expected termination time for the modified SPRT when { X ( t ) }= {Xi(t)}. Analogous to the discrete case, the following
54 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
relationships in terms of continuous time parameter are developed (see Appendix C for detailed derivations): (3.28) (3.29)
Parallel to (3.19), (3.20)’ and (3.21), we obtain for continuous case, (3.30) (3.31)
and 1 +rl
< E;(tT) < El(tT)
(3.32)
From (3.31), ei2 > e12 since YluEl(tT)/[T
+
ylEl(tT)l
is always positive. In this formulation, the modifed SPRT with continuous time parameter essentially includes the standard Wald’s SPRT with discrete time parameter as a special case where gl(t) and g2(t) are constants and t is considered as belonging to some nonnegative integer set (0, 1,2,...}. Also because of the use of continuous time parameter some of the approximated relations due to Wald (by neglecting the excess over the boundaries) become exact with probability 1. 3.4 Procedure of Modified Generalized Sequential Probability Ratio Test
Generally speaking, the principle of constructing the time-varying stopping boundaries for Wald’s SPRT also applies to the generalized sequential probability ratio test [7] when the number of pattern classes to be recognized is more than two. In the following, the proce-
3.4.
PROCEDURE
55
dure of modified GSPRT (with time-varying stopping boundaries) for continuous time parameter is described. The case for discrete time parameter can be analogously derived. Let {Xr(t),t >, 01, i = 1,2, ..., m, be the hypothesized stochastic process associated with the ith pattern class w iwhose probability density function is p(X(t)/wi). The classifier continuously measures a stochastic process { X ( t ) ,t 201 at its input and decides, as soon as possible, to classify the input stochastic process as one of the m possible stochastic processes. In a modified GSPRT, the generalized sequential probability ratio for each pattern class is computed upon the measurement of X ( t ) , at time instant t,
and is compared with the stopping boundary gi(t), i = 1, ..., m. As soon as U(X(t)/w,)< gz(t> (3.34) the pattern class wi is dropped from consideration, and the number of possible pattern classes is reduced by one for the next computation. The process of forming the generalized sequential probability ratio continues until there is only one pattern class retained; this pattern class is then assigned to the input. Note that the stopping boundaries gi(t), i = 1,...,m, are, in general, functions of time and need not be identical for all classes. Similar to the ones suggested for the modified SPRT, a simple class of convergent boundaries may assume the form
(3.35) I n fact, the spirit of the modified GSPRT relies on an optimal construction of these functions such that all the pattern classes but one are dropped from consideration by a prespecified time T. It remains to determine the error probabilities eij and the expected termination time Ei(tT) in terms of the design parameters, such as T, r i , etc. Following the approach taken by Reed, the modified GSPRT defined in (3.33) and (3.34) may be viewed as a special Markov process with continuous time parameter. The probability aspects of the modified GSPRT are not as yet completely known. I n turn, in the next section, an algorithmic construction of the timevarying stopping boundaries will be given, and experimental results
56 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
will be used to illustrate how the desirable performance resulting from the modified SPRT may also be achieved in the case of modified GSPRT. 3.5 Experiments in Pattern Classification
The modified SPRT and the modified GSPRT described in previous sections have been applied to the classification of handwritten English characters a, b, and c. Sixty samples of each character were processed in establishing the necessary statistics for the construction of suitable mathematical models. The same eighteen features used in the experiments in Section 2.3 were used here. Each input pattern was represented by a sequence of eighteen measurements denoted by a feature vector in the 18-dimensional feature space. The process of discretizing the measurements into ten possible values in the examples described in the following is simply for the purpose of their being easily simulated on a digital computer, with the understanding that the results will apply without any modification to stochastic processes with discrete time parameter. Experiment 1 The feature distributions for each pattern class is assumed to be multivariate gaussian. Let p,(X/o,), i = 1, 2, 3, represent the multivariate gaussian densities for characters a, b, and c, respectively, at the nth stage of the sequential classification process. X is an n-dimensional feature vector denoting the successive measurements of (xl ,x2 ,..., xn), n N = 18. This is the case that the classification process terminates in no more than eighteen feature measurements. Specifically,
<
p,(X/wi) = [(2~ ),/2
IK
l1/2]-l
exp[- $(X - M i ) X - l ( X - Mi)],
i = 1 , 2 , 3 (3.36) where Mi is the mean vector for class m i , K is the n x n common covariance matrix for all three classes. Sample means and sample covariances estimated from the sixty samples were used for Mi and K. Binary classification of characters a and b using the modified SPRT In this case, m = 2, the logarithm of sequential probability ratio computed at the nth stage is
Case a
+ M1)TK-'(M2 - Ml)
L, = X X - 1 ( M 2 - Ml) - +(Ma
(3.37)
3.5.
EXPERIMENTS I N PATTERN CLASSIFICATION
57
The upper stopping boundary for the modified SPRT is chosen as gdn)
= a'(1
n
-18)'
n
=
1,2,..., 18
(3.38)
and the lower stopping boundary as gz(n)
1
l"s)
4' 1 - - ,
(
n
=
1,2,..., 18
(3.39)
whera a' and --b' are the starting boundaries adjusted in such a way that the various levels of error probabilities and the expected number of feature measurements are obtained. The result is shown in Fig. 3.2 in which the trade-off between error and classification time (in terms
I
Fig. 3.2.
1
5
I
10
I
IS
1
Recognition of characters a and b (normal model):
- modified SPRT.
I
20 25 Percontoga of Misrecognition
I
--- standard SPRT;
58 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
PROCEDURE
of the expected number of measurements) is demonstrated. On the same figure, the corresponding curve for the standard Wald’s SPRT (by setting constant stopping boundaries of various levels) is shown for the purpose of comparison. It is noted that while the same power of classification between two characters may be achieved, the modified SPRT has shown its capability of reducing the classification time up to 40 % for small error probabilities. Case b
Multiclassification of characters a, b, and c using the modified GSPRT
In this case, i = 1,2, 3, the generalized sequential probability ratio for each class computed at nth stage is
where m may assume the value 2 or 3, depending on the number of pattern classes under consideration at each stage. Let
l2
gi(n) = G’ 1 - - ,
(
TI
=
1,2,..., 18
(3.41)
where G’ > 0 and i = 1,2, 3. The classification procedure of the modified GSPRT is to drop the class wi from consideration at the nth stage if U,(X/wi)
3.5.
59
EXPERIMENTS I N PATTERN CLASSIFICATION
18-
1
8
16-
'S 14-
b s z #ki
a
I*-
10
-
8-
6-
I
30
1
I
I
I
20
40
hcmtage of Miamcognitin
Fig. 3.3. Recognition of characters a, b, and c (normal model): GSPRT; - modified GSPRT.
---
standard
pattern (character a or b) form a discrete time homogeneous firstorder Markov chain, that is, P(.n
I x1
9
x2 ,***,
xn-1
9
4 = P(% I Xn-1
9
mi)
(3.42)
Let the state space of the Markov process be denoted by S = ( S , , S, ,..., Sl0)where the Si's are the ten possible values of feature measurements (that is, S1= 1, S, = 2, etc., any measurements with values beyond ten were truncated to be ten). The state transition probability matrices M i= [pi(j,K)], i = 1, 2, corresponding to characters a and b, respectively, are defined as follows:
p i ( j , k) = Prob{x,
= S,
Ix
~ = - ~S,},
j,k
=
1,2,..., 10 (3.43)
that is, the conditional probability that the nth measurement belongs to state Ek ,given that the (n - 1)th measurement belongs to state Si. It is assumed that the Markov chain is ergodic, and that the transition
60 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
Estimation of [P,(i,j)l
PROCEDURE
Recognition
v
I
I
'-i
Start
Set frequency
Start
counting matrices
measurements of an input sample
measurements of an input sample
Count Nii
XI*XZ.
- - - .XI8
Add individual elements of matrices to the relevant frequency counter
Estimated elements of probability matrices from stored frequencies
by using counted Nii and estimated transition probabilities
Yes 11
Yes \r
Fig. 3.4. Computer flow diagrams for the estimation and recognition procedures in Experiment 2.
3.5.
EXPERIMENTS I N PATTERN CLASSIFICATION
61
probability matrices are known a priori (estimated from the given samples, see Fig. 3.4 for the flow diagrams). Then the modified SPRT is to compute the sequential probability ratio A, at the nth stage
which is an immediate consequence of the Markov property (3.42). Taking the logarithm of A,, we obtain
Njxl o Pdj, gPm k) ,
= 5,K
j,k
=
1,2,..., 10
(3.45)
for a sufficiently large number of measurements [S], where Njk is the number of transitions from state Sjto state S, ,and Cj,,Njk = (n - 1) which is the total number of transitions at the nth stage of the process. By employing the same time-varying stopping boundaries chosen in case (a) of Experiment 1, L, is computed and a terminal decision or classification is made upon the crossing of either boundary. The performance curves showing the relationship between error and the expected number of feature measurements for both the modified SPRT and the standard Wald's SPRT, respectively, are presented in Fig. 3.5. Two additional results of this experiment may be mentioned here. (i) It is seen that in the Markovian-dependent model, only the transition probabilities pi(j, K) and the number of transitions Njk are needed in performing the SPRT. Complex computations of matrix operations, especially in the case of high-dimensional feature space, is avoided and therefore the computer time is greatly reduced.
62 3.
FINITE SEQUENTIAL CLASSIFICATION-FORWARD
I
L
2
1
4
I
6
I
8
1
10
PROCEDURE
I
12
I
14
Percentage of Mimcognition Fig. 3.5. Recognition of characters a and b (Markovian model): SPRT; - modified SPRT.
---
standard
(ii) In comparing Figs. 3.5 and 3.2, a consistent reduction of the expected number of feature measurements has been found. It appears that Markov dependence among the feature measurements may provide a much more valid approximation to the physical situation of English character recognition in the experiments performed. 3.6 Summary and Further Remarks
A forward procedure for finite sequential classification problems is presented in this chapter. The proposed procedure requires the
REFERENCES
63
modification of Wald’s sequential probability ratio test by constructing convergent time-varying stopping boundaries with which the sequential probability ratios are compared at each stage. The fact that the stopping boundaries monotonically converge guarantees the finite termination of the classification process, but not without sacrificing the optimal properties of the original Wald’s test. The resulting procedure, however, is simple and efficient as is demonstrated in the computer-simulated experiments of character recognition. Comparisons have been made between the modified SPRT and Wald’s SPRT with respect to the probabilities of misrecognition and the expected number of feature measurements. Both discrete and continuous (time parameter) cases of the modified test are discussed. The modified SPRT has also been extended to the modified GSPRT for multiclass classification problems. The forms of the time-varying stopping boundaries, as suggested by (3.3) and (3.4), certainly cannot be considered as optimal. A more quantitative study on the proper selection of a’,b’, rl , and r2 , or even the forms of the time-varying stopping boundaries, should be useful in establishing more efficient test procedures. References 1. Y. T. Chien and K. S. Fu, A modified sequential recognition machine using timevarying stopping boundaries. IEEE Trans. Inform. Theory 12, No. 2, 206-214 (1966). 2. T. W. Anderson, A modification of the sequential probability ratio test to reduce the sample size. Ann. Math. Statist. 31, 165-197 (1960). 3. J. J. Bussgang and M. B. Marcus, Truncated sequential hypothesis tests. Memo. RM-4268-APRA. The Rand Corp., Santa Monica, California, November 1964. 4. A. Wald, “Sequential Analysis.” Wiley, New York, 1947. 5. C. R. Rao, “Linear Statistical Inference and Its Applications,” Section 7c.2. Wiley, New York, 1965. 6. A. Dvoretzky. J. Kiefer, and J. Wolfowitz, Sequential decision problems for processes with continuous time parameter, testing hypotheses. Ann. Math. Statist. 24, 254-264 (1953). 7. F. C. Reed, A sequential multi-decision procedure. Proc. Symp. on Decision Theory and Appl. Electron. Equipment Develop., USAF Develop. Center, Rome, New York, April 1960. 8. R. M. Phataford, Large sample sequential analysis of Markovian observations. J . India Statist. Assoc. 1, No. 3, 152-160 (1963).
CHAPTER 4
BACKWARD PROCEDURE FOR FINITE SEQUENTIAL RECOGNITION USING DYNAMIC PROGRAMMING
4.1
Introduction
As mentioned in Section 1.5, many pattern recognition problems may be considered as sequential decision processes in which the number of observations is necessarily finite. The method of modifying the sequential probability ratio test described in Chapter 3 is certainly one approach of solving this class of decision problems. However, the optimality of the original decision procedure is frequently sacrificed, especially in the multiple decision case (m > 2). The optimal Bayes sequential decision procedure which minimizes the expected risk including the cost of observations is essentially a backward procedure [l]. It is intended to show in this chapter that, as an alternative approach to the modified sequential probability ratio test, the dynamic programming [2]-[8] provides a feasible computational technique for a class of sequential recognition systems with finite stopping rules. The intuitive argument of using dynamic programming for finite sequential recognition problems can be stated as follows: Consider a sequential decision process. With observations taken one at a time, each stage of the process is a decision problem including both the choice of closing the sequence of observations and making a terminal decision, and the choice of taking an additional observation. It is easy to determine the expected risk involved in the decision when the procedure is terminated, but it is not easy to find the expected risk involved in taking an additional observation. For the case of taking one more observation, the expected risk is that of continuing and then doing the best possible from then on. Consequently, in order to determine the best decision at the present stage 64
4.2.
MATHEMATICAL FORMULATION-BASIC
EQUATION
65
(i.e., whether to continue the process or not) it is necessary to know the best decision in the future. I n other words, as far as seeking the optimal decision procedure is concerned, the natural time order of working from the present to the future is of little use because the present optimum essentially involves the future optimum. The only alternative to keep the true optimality is to work backwards in time, i.e., from the optimal future behavior to deduce the optimal present behavior, and so on back into the past. The entire available future must be considered in deciding whether to continue the process or not, and the method of dynamic programming provides just such an optimization procedure, working backwards from a prespecified last stage to the very first stage. I n the problems of sequential recognition where the decision procedure is to terminate at a finite number of observations, the termination point can be used as a convenient starting point (i.e., the last stage) for backward computation. 4.2 Mathematical Formulation and Basic Functional Equation
The way in which the dynamic programming is carried out in the finite optimal sequential decision procedure is by applying the principle of optimality. As stated by Bellman [2], “an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” In essence, it is equivalent to saying that if an optimal policy is pursued, then at each stage of the sequential process the remaining decisions must themselves form an optimal policy from the state reached to the terminal point of the process. Consider the successive observations or feature measurements xl, x2 ,..., x, , n = 1, 2,..., with known distribution function of x,+~ given the sequence x1 ,...,x, , P(x,+, I x1 ,..., x,). After the observation of each feature measurement, the decisions made by the classifier include both the choice of closing the sequence of feature measurements and making a terminal decision (to decide the pattern class based on the observed feature measurements), and the choice of making another observation of the next feature measurement before coming to a terminal decision. Let p , ( x , , x2 ,..., x,)
be the minimum expected risk of the entire sequential decision process, having observed
4.
66
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
the
sequence of feature measurements >*-*, X n ; be the cost of continuing the sequential C(x, , x2 ,..., x,) process at the nth stage, i.e., taking an additional feature measurement, xn+l ; R(x, , x2 ,..., x, ; d,) be the risk of making terminal decision di (i.e., the ith pattern class is accepted by the classifier), i = 1 , 2,...,m, on the basis of the feature measurements x l , x2 ,..., x, . 9
X2
If the classifier decides to stop the process, the expected risk is Mini R(xl ,x2 ,..., xn ; d,) by employing an optimal decision rule. If the classifier decides to continue the process and to take on more feature measurement x , + ~ ,the expected risk is C(X1
9
x2
,..*)4
+ f P,+l(Xl
9
I Xl
x2 ,.**, x, ,X,+l) dP(X,,l
****,
x ),
where the integration is carried over the admissible region of x ~ + ~ . Hence, by the principle of optimality, the basic functional equation governing the infinite sequence of the expected risk pn(xl, x2 ,..., x,), n = 1, 2, ..., is p,(x1 , x2 ,*-.,x), Continue: =
Min
Stop:
C(x,
+ J”
,...,x,)
,.-,X, x,+i) Min R(xl ,..., x, ; di) ~n+i(Xi
@(x,+I
I
,.-,x,)
(4.1)
In the case of finite sequential decision processes where a terminal decision has to be made at or before a preassigned stage number N (for example, only a maximum of N feature measurements available for observation), the optimal stopping rule can be determined backwards starting from the given risk function (or specified error probabilities) of the last stage. That is, at Nth stage let, pN(xl ,x2 ,..., xN) = Min R(x, ,x2 ,...,xN ; 4)
(44
and compute the expected risk for stage number less than N through the functional equation (4.1). Specifically, starting with the known
4.2.
MATHEMATICAL FORMULATION-BASIC
67
EQUATION
(or given) value for p N ( x l ,x2 ,...,x N ) in (4.2), we have at (N - 1)th stage, > x2
PN-l(%
xN-l)
)***)
Continue: C(x,, x2 ,..., xN-l)
+s
= Min
d x 1
)***)
x N ) dp(xN
I x1
$***?
xN-l)
(4.3)
Stop: Min R(x, ,..., xN-l ; di) 1 in which pN(xl ,...,x N ) is obtained from (4.2). At (N - 2)th stage, PN-2(x1
9
x2
9***9
xN-2)
Continue: C(xl ,x2 ,..., xN+)
= Min
[
in which pN-l(xl PZ(X1
9
x2)
= Min
+s
PN-l(%
?.**)
%N-l) dp(xN-l
I
,-**,
xN--2)
(4.4)
Stop: Min R(xl ,...,x ~ ;di) - ~ 1
,..., x ~ - is~ obtained ) from (4.3). At second stage
[
Continue:
c(xl
%2)
+
P3(x1
Stop: Min R(xl , x2 ; di) I
x2
,x3) dp(x3
x1
x2)
(4.5)
in which p3(xl , x2 , x3) is obtained from the third stage. At first stage,
in which pz(xl , x2) is obtained from (4.5). One can easily see the computational difficulty arising from this formulation. Aside from the necessary memory locations for estimating the high-order conditional probabilities, the storage in a computer required for calculating the risk functions alone is already enormous. For example, suppose there are eight feature measurements available for successive observations (N = 8) and each measurement can take on one of ten values (discrete case); in order to resolve (4.1) through the recursive equations just described the total storage required for storing all the possible risk functions pn(xl ,x2 ,..., xn), n = 1, 2,..., 8, is 10 + lo2 + ... lo8. Because of this type of computational
+
68
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
difficulty, methods toward the reduction of storage requirement are of major concern in designing a truly sequential recognition system with optimal stopping rule. This is the subject of discussion in the next section.
4.3
Reduction of Dimensionality
4.3.1 USEOF SUFFICIENT STATISTICS The first possible solution to reduce the dimensionality is the use of sufficient statistics in describing the recognition process under consideration. Let each feature measurement assume one of the r discrete values El , E, ,..., E,. (a quantization of feature space). Assume that the features of each pattern class are characterized by a multinomial distribution, i.e., for each m i , i = I, ..., m, there exists a probability function
(4.7)
where p , is the probability of occurrence of Ei for class m i , IT=, pij = 1, and ki is the number of occurrences of Ei, ki = n. Since the statistic (k, , k, ,...,k,.;n) is sufficient to characterize the multinomial distribution it is reasonable to assume that only the number of occurrences of Ei , ki ,j = 1 , 2,..., Y , not its order, is important in making a decision. Then the functional equation (4.1) becomes
x;=l
Pn(k1 > k, ,...,k,) TContinue:
=
Min
C(k, , k, ,...,k,)
4.3.
69
REDUCTION OF DIMENSIONALITY
where P(wJ is the a priori probability for class w i. Specifically, at Nth stage, P N ( 4 9 K, ,-**, k,) = ?in Wl k, ,.**, K, ; 4) (4.9) 9
at (N - 1)th stage,
PN-l(k1 K, ,..*,k,) Continue: C(K, ,K, 9
=
+
Min
,...,K,)
m
c t
i=l
P(wi)
Stop: Min R(K, ,K, a
,...)K j + 1,.*-,K,)
P i j P ~ ( h
j=1
(4.10)
,..., k, ;di)
at first stage, P l ( h 9 A,
=
?.*.9
Min
k,) Continue: C(Kl ,k, ,..., K,)
[
+
c PijfJpP(k1 K, + r
m
i=l
P(4
,-*-*
i=l
Stop: ?in R(kl , K,
,...,K,
1 9 - v
K,)
(4.11)
;d,)
The risk function p,(kl , k , ,..., k,) is then determined for each and every sequence of k , ,k , ,..., k, , where & ki = n, n = 1 , 2,..., N. I n addition, the optimal stopping rule is also determined at each stage. That is, if the risk of stopping is less than the expected risk of continuing for a given history of feature measurements, the sequential process is terminated. The actual optimal structure of the resulting procedure is obtained in the course of solving the functional equation (4.8). In resolving (4.8), it is also required to compute the minimum termination risk Min, R(kl ,k, ,..., k, ;di)at each stage. The Bayes decision rule is employed here to illustrate the computation procedure, although in practice other proper optimal decision rules may be chosen according to the statistical knowledge at hand. Let L(wi,dj) be the loss incurred by making the terminal decision dj when the input pattern is really from class w i . Then the risk function of deciding that the pattern belongs to class m i ,having observed the joint event [kl , k , ,..., k,], can be written as m
R(kl ,k, ,..., K, ;dj) = C P(wi)L(wi ,dj) P(Kl ,..., k, I wi) i=l
(4.12)
70
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
The quantity Min, R(kl , k, ,..., k, ; di) is in fact the risk attained when the sequential process stops. It is worth noting that, similar to the dicussion in Section 1.4, in the case of (0, 1) loss function, i.e., L ( w i ,d j ) = 0 =1
if i
=j
if i # j
(4.13)
the decision procedure reduces to: decide dj if P(wj)P(k, ,k, ,...,k, I wi)
> P(w,) P(k, ,k, ,..., k, I w i ) for all i # j (3.14)
and the risk attained is R(kl , k, ,..., k, ; dj). The way in which the reduction of dimensionality can be achieved is due to the assumption of independent measurements implied by the ignorance of the ordering of the occurrence of Ej . This assumption allows the reduction of storage requirement from C:==,yn to C,"==, (,+;-l) by simply realizing the constraints that C;='=, kj = n and n N . Detailed results on this type of reduction are given in Appendix D.
<
4.3.2 ASSUMPTION OF MARKOVIAN DEPENDENCE In many pattern recognition problems, the assumption of independent measurements and the ignorance of measurement ordering can not be made without leading to intolerable rate of misrecognition. Frequently, a more feasible approximation to the true state of affairs is to consider the simple Markovian dependence among the feature measurements successively observed by the classifier (higher order Markovian dependence may also be assumed if it is relevant). The assumption of this type of dependence relation has the obvious advantage of providing a more sophisticated model for the physical process under consideration while still retaining a certain degree of mathematical simplicity. In the solution of recursive equation (4. I), a reduction in dimensionality can also be achieved by replacing the high-order conditional probability P(X,+~I x1 ,..., x), by a set of first order transition probabilities when the underlying Markov processes are properly defined. The procedure of this replacement is described as follows. Let the feature measurements xl, x, ,..., x, be considered as a discrete time homogeneous first-order Markov chain, with the state space being the Y quanta E l , E, ,..., E, . The sequence x1 , x, ,..., x,
4.3.
71
REDUCTION OF DIMENSIONALITY
known to be generated by one of the m possible pattern classes with transition probability matrices [P,(i,j)], u = 1, 2, ..., m, where PJi, j ) = Prob{x,
1 xn-,
= Ej
= Ei ; w,},
n = l , 2 ,..., N and u = l , 2 ,..., m
(4.15)
Let the risk function at the nth stage be defined by pn(kll , k,, ,..., k,, ; k,, ,..., k,, ;...; k,, ,..., k,.,.) as the expected risk after having observed x1 ,x2 ,..., x, in which kij transitions have been made from state Et to state Ei,i,j = 1,...,Y, and I
*
(4.16)
The continuing risk in (4.1) is then computed as
The functional equation governing the Markovian sequence of feature measurements becomes
[Continue: = Min
1
C(k,, ,..., k,,
;...; k,, ,..., k,,)
x ~ n + l ( & ,-*-, k,, ;**., kij Stop: Min R(kll ,..., k,, ..; k,, U
;.
+ I,*..; ,..., k,,
(4.18) kr,
,***,
kw)
; d,)
Equation (4.18) can be solved again by working backwards with the terminal condition as pN(k11
,..., k,, ;...; k,, ,...,k,,)
= Min R(k,,
,...,k,, ;...; k,, ,..., k,,
;d,) (4.19)
72
4.
SEQUENTIAL RECOGNITI,ON-BACKWARD
PROCEDURE
where
Notice that the price to be paid for considering the Markovian dependence is an increase of storage requirement from to when compared with the case of independent measurements assumed in Section 4.3.1. However, the required storage is still considerably reduced in comparison with the conventional method, where high-order conditional probabilities and risk functions for all possible sequences of measurement history in the original feature space would have to be determined.
x$=lr+rl)
4.4
Experiments in Pattern Classification
In order to test the formulation and procedure outlined in Sections 4.2 and 4.3, a particular pattern classification problem was taken. The pattern classes considered were the handprinted English characters D, J, P, and V (denoted as class wl, w 2 , w3 , and 0 4 ,respectively). Thirty-six samples of each character were processed to estimate the various required statistics about each class. Eight radial intersection
'No
10
6
Fig. 4.1.
D,
I , p, v.
5
II
8
7
9
Typical samples and their measurements for handprinted characters
4.4.
73
EXPERIMENTS I N PATTERN CLASSIFICATION
measurements (taking on values from 1 to 20), the same as those used in Section 2.3 for A and B, were used as successive feature measurements, as shown in Fig. 4.1. T o reduce the dimensionality in computation, the method described in Section 4.3.1 was used, assuming that the feature measurements were statistically independent. Furthermore, each feature measurement was quantized into five levels, El ,E, ,..., E5 ; and only events consisting of a feature measurement falling into a particular quantum were considered. Parameter values (pil ,p , ,...,pi5) characterizing the multinomial distribution of class w, were estimated by preparing a histogram and computing the frequency ratios for the occurrence of the quantized events. The approximate cumulative distribution function of feature measurements thus obtained is given in Table 4.1. F ( i , j ) represents the number of Table 4.1
DISTRIBUTIONS OF FEATURE MEASUREMENTS FOR ENGLISH CHARACTERS D, J, P, AND V
1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 1 4 17 31 68 124 199 264 287 288 288 288 288 288 288 288 288 288 288
0 0 5 21 51 100 121 136 158 187 223 223 252 272 280 284 286 288 288 288
0 1
0 0
12
3 13 28 45 81 143 197 226 235 235 237 249 264 270 276 282 285 288
40 80 128 1ti8 242 269 277 277 281 283 286 288 288 288 288 288 288
74
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
measurements from the class w iwhich falls below the integer valuej; = 1 , 2,..., 20. For example, if quantum 3 was chosen to be (6, lo), then, for class w 2 ,
i = 1 , 2 , 3 , 4 and j
p23
= [F(2, 10)
-F(2,6)]/36 x 8
(7 Start
stop and decide class
+
1 + Read in learning samples of ith class i=l,2,3,4
t
Quantize and calculate probabilities P . . , j = 1 , 2 ,..., r
Quantify the nth measurement ofthesample
Read in a priori probabilities measurement costs and loss matrix
Set n=n+ 1
-
Read in character sample
I
I
Generate all possible sequences of (k,I kI .-. k,)
function at the nth stage
9
I
Calculate stopping risks G, and obtain MinG1=A I
No
continuing greater
.
Fig. 4.2.
-
Calculate risk of continuing
=B
A simplified flow diagram.
4.4.
75
EXPERIMENTS I N PATTERN CLASSIFICATION
A simplified flow diagram which shows both the programming for the calculation and storage of the various quantities required in the classification procedure, and also the programming for the actual classification procedure itself is given in Fig. 4.2. The results produced by this program under various experimental conditions are presented below. In addition to the dynamic programming procedure, experiments were also run to determine the accuracies attainable using all the eight features identically quantized (nonsequential or fixed-sample size Bayes classification procedure). The results then allowed a fair comparison of the expected number of feature measurements required and the correct classifications achieved for the sequential and nonsequential procedures. Experiment 1
Experimental conditions: (i) All measurement costs are equal =0.03/measurement. (ii) Loss function qwi,
.
dj) = 0,
a =j
.
= A (constant),
i #j P(wl) = P(o,) = P(w3) = P(w4) = 0.25. (iii) (iv) Quantum partitions: El = (0, 51, E, = (5, 71, E , E4 = (8, 91, E5 = (9, 201.
=
(7, 81,
ClassiJcation results: (i) Dynamic programming procedure No. of patterns classified as
D D
J
P V
2
J
P
recognition
Total no. of required measurements
70 98 36 72
148 132 166 127
yo of correct
True class
V
5 5 2 4 1 3 5 0 0 16 6 13 1 1 9 0 2 6
573
Overall accuracy: 68 %
4.
76
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
(ii) Nonsequential procedure No. of patterns classified as True class
D D
J
2
P V
J
P
4 3 1 2 3 8
7
5
7
V
4
1
correct recognition
Total no. of required measurements
288 288
yo of
6
5 6
67 64
7
4
47
288
3 2 1
59
288
Overall accuracy: 59 Yo
Experiment 2
Experimental conditions:
(i), (ii), (iii) same as Experiment 1. (iv) Quantum partitions: E, = (0, 61, E , E, = (8, 111, E, = (11, 201.
=
(6, 71, E ,
(7, 81,
=
Classification results:
(i) Dynamic programming procedure ~
~
~
No. of patterns classified as
yo of
True class
D D
2
J
0
P
13 0
V
J
6 2
Overall accuracy: 67 yo
P
2
Total no. of required measurements
0
72
7
6
3
75
240 137
3
19
1
53
226
2 2 5
70
129 732
9
8
V
correct recognition
4.4.
77
EXPERIMENTS I N PATTERN CLASSIFICATION
(ii) Nonsequential procedure No. of patterns classified as True class
D
J
P
V
2
7 2 7 0 0 2 4 9 3 12 3 20 1 0 5 5 2 6
D
J P V
yo of correct recognition
Total no. of required measurements
75 67 55 72
288 288 288 288
1152
Overall accuracy: 67 yo
Experiment 3
Experimental conditions:
(i), (iii), (iv), same as Experiment 2. (ii) Loss function-the loss due to misrecognition when a pattern is true from class w3 ; i.e., character P, equal four times that of other misrecognitions.
Dynamic programming procedure
ClassiJication results:
No. of patterns classified as True class
D D
2
J
0 0 0
P V
J
% of correct recognition
P
V
0 3 4
0
5
15 0 6 0 9 2 1
100
21 0 6
3
58
58
Total no. of required measurements
87 114 83 143
427
Overall accuracy: 55 yo
4.
78
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Experiment 4
Experimental conditions: (i) The cost of measurements varies linearly with the measurement number from 0.01 to 0.08. (ii), (iii), (iv) same as Experiment 2.
ClassiJication results:
Dynamic programming procedure
No. of patterns classified as
True class
D D
1 P V
J
22 3 0 2 8 15 5 0 12
P
V
11
0
5 15 2
3 1 22
yo of correct recognition
Total no. of required measurements
61 78 42 61
172 117 185 119
593
Overall accuracy: 61 yo
The results of the four experiments are summarized and discussed as follows: (1) Experiments 1 and 2 show a comparison of classification results between the dynamic programming procedure and a nonsequential classification procedure for two different quantizations of feature space. As expected, in each experiment the required number of measurements using the dynamic programming procedure is much less than that using the nonsequential procedure, without sacrificing significantly the percentage of correct classification. (2) Experiment 3 shows the effect of employing an unsymmetric loss function. By weighting the misrecognitions most heavily on class w3 , it is possible to achieve 100yo correct classification for that class. While it is true that this result is accomplished at the expense of causing greater errors in classifying patterns from other classes, the change of loss function does make the classification procedure more flexible and useful in distinguishing various errors.
4.5.
BACKWARD PROCEDURE
79
(3) Experiment 4 shows the results of varying the cost of taking measurements (a linear cost function was used). It is noted that the greater expense of the later measurements causes the classification procedure to terminate much sooner, but at the expense of the classification accuracy. 4.5
Backward Procedure for Both Feature Ordering and Pattern Classification
In previous sections of this chapter, the dynamic programming procedure has been applied to the pattern classification problem without considering the ordering of feature measurements. However, as mentioned in Section 2.1, in order to terminate the sequential recognition process earlier the ordering of feature measurements is often rather important. In this section, a more general sequential pattern classifier is considered. The classifier so designed has the additional capability of selecting the best feature for next measurement. In other words, in the process of sequential decisions, if the decision is to continue taking an additional measurement it also, in the meantime, selects the best feature for the next measurement. Let FN = (fl ,...,fN) be the set of N features extracted by the feature extractor in their natural order. Let Ftn = ( f t , ,...,f in ) , n = 1,..., N, be a particular sequence of n features measured by the classifier at the nth stage of the sequential recognition process. The remaining features available for further measurements at the nth stage will be F, = FN - Ftn. Note that the feature f l i may be any one of the elements in FN and the (noisy) feature measurement corresponding to f i , is represented by a random variable xi as in the previous formulation. Similar to those in Section 4.2, the following terms are defined: pn(x1 ,..., x, I Ftn) is the minimum expected risk of the entire sequential recognition process, having observed the sequence of feature measurements x1 ,..., x, when the particular sequence of features Ftnis selected. C(xl ,..., x, I Fin) is the cost of continuing the sequential recognition process at the nth stage when Fin is selected. R(x, ,..., x, ; di I Fla) is the expected risk of making terminal decision di ,i = 1,..., m, on the basis of the feature measurements x1 ,..., x, when Flnis selected.
80
4.
PROCEDURE
SEQUENTIAL RECOGNITION-BACKWARD
I x1 ,...,x, ;F,-) is the conditional probability disP(x,+, tribution of x , + ~ when fin+, is selected, given the sequence of measurements x1 ,..., x , on the sequence of features F t n . When the classifier decides to stop the process and make a terminal decision at the nth stage, the expected risk is simply Mini R(x, ,..., x , ; di I F,,).If the classifier decides to take an additional measurement, then the measurement must be optimally selected from the remaining features F, in order to minimize the risk. That is, the expected risk of measuring the (n 1)th feature is Min 1C(x19*.*9
f tn+l'Fn
x,
I Ft-1
+J
+
Pn+1(X1
x dP(X,+l
Y*..>
x,
,%+, I Ft, ,ft,+,)
;ft,+,I x1
*.**Y
3,
;&,)I
Therefore, by the principle of optimality the basic functional equation governing the sequential recognition process becomes Pn(X1 > * * * 9
xn I Ft,)
Continue:
Min
it,+,~Fn
= Min
x W % + l ;ft,+,I x1 ) * * * , x, Stop: Min R(xl ,..., x, ;di I Ftn)
&,)I
1
(4.20)
Again, (4.20) can be recursively solved by setting the terminal condition to be PN(%
)**'?
xN
1 F t ~ >= Min R(xl
)*.*)
xN
; di
IF t ~ >
(4.21)
and computing backwards for risk functions R , , n < N. The major difference between the solution of (4.20) and that of (4.1) lies in the fact that the optimal stopping rules obtained from the present solution are automatically accompanied by a best sequence of features capable of minimizing the expected risk upon termination. 4.6
Experiments in Feature Ordering and Pattern Classification
T o test the formulation and the optimality of the procedure outlined in Section 4.5, the English character recognition problem described
4.6.
EXPERIMENTS IN FEATURE ORDERING
81
in Section 4.4 was again used. Only three pattern classes D, J, and P were considered, each represented by thirty-six samples which were processed both to obtain the probability distribution used and to test the technique. Same as the example in Section 4.4, eight radial intersection measurements quantized into twenty quanta were used as features, and a histogram procedure employed to estimate the probability that a given feature falls in a given quantum, conditioned on the fact that a particular character was measured. All feature measurements were assumed statistically independent. In order to reduce the dimensionality, the Bayesian statistic (a posteriori probability) was used. At each stage of the process, the conditional probability that the sample was from each class was calculated, given the past history of feature measurements. That is, after x1 was measured,
wheref,, is the feature selected to measure by the classifier and x1 is the outcome of the measurement. These quantities are then used as the a priori probabilities for the next stage (the second stage in this case) of the process. The procedure can be formulated recursively as, at the nth stage,
Thus it can be seen that by using this procedure, all information provided by the past history of feature selection and measurement outcomes is contained in the a posteriori probabilities calculated by (4.23). The classifying decision at the final stage depends on the a posteriori probability of occurrence of each class having measured all eight features in addition to the loss due to misrecognitions. For computational purposes each a posteriori probability was quantized into twenty equal devisions. Thus the probability space was quantized into a total of 210 quanta as shown in Fig. 4.3. The loss due to misrecognition was assumed equal to one in all cases, i.e.,
L(Wi,di)= 0, =1,
i=j i#j
0.5
I .o
P(D)
I
P(p)
o,s
(b)
1.0
0.03 0.075 0.075 0.125 0.125 0.125 0.175 0.175 0.175 0.175 0.225 0.225 0.225 0.225 0.225 0.275 0.275 0.275 0.275 0.275 0.275 0.325 0.325 0.325 0.325 0.325 0.325 0.325 0.375 0.375 0.375 0.375 0.375 0.375 0.375 0.375 0.425 0.425 0.425 0.425 0.425 0.425 0.425 0.425 0.425 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.475 0.500 0.525 0.525 0.525 0.525 0.525 0.525 0.525 0.525 0.525 0.475 0.450 0.500 0.550 0.575 0.575 0.575 0.575 0.575 0.575 0.525 0.475 0.425 0.400 0.450 0.500 0.550 0.600 0.625 0.625 0.625 0.575 0.525 0.475 0.425 0.375 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.625 0.575 0.525 0.475 0.425 0.375 0.325 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.625 0.575 0.525 0.475 0.425 0.375 0.325 0.275 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.575 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.575 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.150 0.200 0.250 0.300 0.3500.400 0.450 0.500 0.550 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.125 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.525 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.125 0.075 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.475 0.425 0.375 0.325 0.275 0.225 0.175 0.125 0.075 0.025
+
Fig. 4.3. (a) The classification decision boundary-the letter indicates the decision to be made. (b) Expected of making a classifying decision.
cost
4.6.
EXPERIMENTS IN FEATURE ORDERING
83
The cost of feature measurement is O.Ol/measurement. In all experiments the a priori probability of each class was taken to be equal to one-third. The expected risk of making a decision for various a posteriori probabilities is printed in the corresponding quantum. The decison boundary diagram shown in Fig. 4.3(a) is interpreted as being that if the a posteriori probabilities fall in a quantum labeled a D, J, or P,the input pattern is classified as a D, J, or P,respectively. The same quantizations were used at every stage of the process, including the calculation of decision boundaries for the selection of features. Detailed illustration of these computations is given in Appendix E. Three experiments were performed. The purpose of these experiments was to allow a verification of the optimal properties of the proposed procedure and a comparison of results obtained from using the proposed procedure and other statistical classification procedures. Experiment 1: Sequential classification with feature ordering. Experiment 2: Nonsequential Bayes classification using all eight features. Experiment 3: Sequential classification without feature ordering. Table 4.2 summarizes the results concerning the accuracy of recognition and the number of feature measurements required for classification. Table 4.3 indicates the costs of the various classification procedures. The results of the three experiments are summarized and discussed as follows. (1) It is seen that the same percentage of correct recognition is obtained for all three classification procedures. In fact, it turned out that the misrecognitions were made on exactly the same patterns. (2) From Table 4.2, it should be noted that even though the sequential classification procedure without feature ordering required fewer feature measurements to classify patterns from class J than the sequential procedure with feature ordering, it did require more measurements for the entire process. It appears that the sequential procedure with feature ordering may cause poorer performance in some particular cases, but on the average over the entire process it produces better results. This is expected since the optimization was carried out over the entire process.
4.
84
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Table 4.2
ACCURACYOF CLASSIFICATION No. of Patterns classified as True class
D
J
P
yo of correct recognition
Total no. of required measurements
(i) Experiment 1
D
J P
33 0 7
0
3
36 0
0 29
91.6 100 80.6
147 82 135 364
Overall accuracy: 90.7 yo
(ii) Experiment 2
D
33
0
3
J
0 7
36 0
0 29
P
91.6 100 80.6
288 288 288 864
Overall accuracy: 90.7 yo
(iii) Experiment 3
D
33
J
0 7
P
Overall accuracy: 90.7 yo
0 36 0
3 0 29
91.6 100 80.6
187 61 189 436
4.6.
EXPERIMENTS IN FEATURE ORDERING
85
Table 4.3
COSTSOF CLASSIFICATION PROCJSSES (i) Classificationof 36 Samples of D (Class
No. of required measurements Cost of measurements Expected risks of 36 classifying decisions Combined total cost
(ii)
wl)
Exp. 1
Exp. 2
Exp. 3
147 1.47 1.67 3.14
288 2.88 1.87 4.75
187 1.87 1.43 3.30
Exp. 1
Exp. 2
Exp. 3
82 0.82 0.90 1.72
288 2.88 0.90 3.78
61 0.61 0.90 1.51
Classification of 36 Samples of .J (Class w e )
No. of required measurements Cost of measurements Expected risks of 36 clasifying decisions Combined total cost
(iii) Classificationof 36 Samples of P (Class
No. of required measurements Cost of measurements Expected risks of 36 classifying decisions Combined total cost
w3)
Exp. 1
Exp. 2
Exp. 3
135 1.35 3.075 4.425
288 2.88 3.025 5.905
189 1.89 3.275 5.165
(iv) Cumulative Results of Classifying All 108 Samples
No. of required measurements Total cost of measurements Total expected risks of 108 decisions Combined total cost
Exp. 1
Exp. 2
Exp. 4
364 3.64 5.645 9.285
864 8.64 5.795 14.435
437 4.37 5.60 9.97
86
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
(3) The sequential procedure with feature ordering required about 60 % less feature measurements than the nonsequential Bayes pro-
cedure while the sequential procedure without feature ordering required about 50% less. (4) From Table 4.3, it is seen that the overall total cost of the recognition process for the sequential procedure with feature ordering is minimal. The row labeled total expected risk of classifications was obtained by summing the expected risks of recognition and is an indication of the confidence with which classifying decisions are made. The sequential procedure with feature ordering costs about 64% as much as the nonsequential procedure, while the sequential procedure without feature ordering costs about 68% as much. 4.7
Use of Dynamic Programming for Feature-Subset Selection
The proposed dynamic programming procedure for feature ordering and pattern classification can be modified to allow the selection of an optimum subset of features from a given set. Two particular cases are discussed in this section. If an abruptly truncated sequential decision procedure is to be used for pattern classification, it would be important to select the best subset with size equal to the truncated length from a given set of features. The dynamic programming procedure also provides the answer to this type of feature selection problem. Consider, for example, that it is desired to recognize the characters D, J, P using a forward sequential decision procedure with no more than five (independent but not identically distributed) features. The problem is to select a best subset of size five from the eight given features. Assume that the a priori probabilities for each class are given. The feature-subset selection problem can be solved by searching from the memory for the minimum expected risk decision boundaries among all boundaries for which five features remain. I n the example given in Section 4.6, if the a priori probabilities are assumed P(ul)= P(w& = 0.25 and p(u3) = 0 5 then the subsets (f8 ,f6 , f 3 ,f2 ,fl), (f7 ,fS ,f3 , f 2 P f l ) , and (f6 ,f5 ,f 3 ,f2 ,fl) all yield the same minimum expected risk for the process. Any one of the three ordered features subsets is an optimal subset with five features. If a nonsequential Bayes or maximum-likelihood decision procedure is to be used for the pattern classification, the dynamic programming procedure can also be applied to determine the best feature subset
4.7.
DYNAMIC PROGRAMMING FOR FEATURE-SUBSET SELECTION
87
from a given set of features. The only difference from the case just treated above is that, in this case, the cost of taking measurements becomes zero. A computer simulation was performed using the same example in Section 4.6. The a priori probabilities of the three classes are assumed equal with one-third each. The loss due to misrecognition equals to one in all cases. Using the dynamic programming procedure, the expected risk of every subset of the eight features was calculated. The classification of all 108 pattern samples was performed using each subset. In all, a total of C&l (!) = 256 classification studies were made. The results are summarized in Figs. 4.4, 4.5, and 4.6.
I
Fig. 4.4.
Expected Coal of Decision
Experimental relationship between percent error and expected cost.
Figure 4.4 shows the relationship between the expected risk of decision and the percentage of misrecognitions. Because the loss due to misrecognition is equal to one in all situation, linear relationship is expected between the percentage of misrecognition and the expected risk. Both the theoretical relationship and the actual regression line obtained from experimental results are also shown in the figure. Figure 4.5 indicates the bounds within which the results for the feature subsets with various sizes fall. The results show that the expected risk is a good indicator for the classification accuracy. The
88
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
j-Indicote6 the No. of Features Used For Recognition
Fig. 4.5. Minimums and maximums; the number of errors versus the expected cost of decision.
variation in the expected percentage of misrecognitions for feature subsets with various sizes is demonstrated in Fig. 4.6. The numbers associated with the plotted points indicate the best and the worst feature subsets, respectively. 4.8 Suboptimal Sequential Pattern Recognition
In Section 4.5, a backward procedure has been developed for constructing the optimal solution for feature ordering and pattern classification. In general, the knowledge of the joint probability density functions of all the features and the a priori probabilities for each class are required in the computation of (4.20). From the computational point of view the procedure is often difficult to implement without large-scale computation facilities. If certain assumptions (e.g., independence of feature measurements) can be made in the
4.8.
89
SUBOPTIMAL SEQUENTIAL PATTERN RECOGNITION
'7- 40 -
- .40
30 -
- .30
Erron
20
-
- .20
-
678 -.I0
10
I
I
I
I
2
I
3
I
4
I
5
124567 124578 6
1245678 7
8
No. d Footurn
Fig. 4.6. Minimum and maximum; the number of errors versus the number of features measured.
practical recognition problems, the optimal procedure can be implemented as that described in Section 4.6. However, it will still be desirable to develop an approximation to the optimal procedure so the computations involved can be much simplified. I n this section, an approximation scheme which leads to a suboptimal solution is discussed, and comparisons are made with the optimal procedure to show the trade-off between the optimality and the computational difficulties. The approximation made which leads to a suboptimal solution is that, at each stage, the classifier considers the next stage to be terminal, that is, a classification decision must be made at the next stage (onestage ahead truncation). The following three different cases are chosen to illustrate the effectiveness of the suboptimal solution:
Case 1: optimal solution when the feature measurements for each class are independent; Case 2: suboptimal (one-stage ahead truncation) solution when the feature measurements for each class are independent;
90
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Case 3: suboptimal solution when the feature measurements for each class have a first order Markov dependence.
A comparison of cases 1 and 2 displays the effects of the truncation approximation. On the other hand, a comparison of cases 1 and 3 allows a determination as to the relative advantage of increasing the computational complexity either through (i) increasing the knowledge of the statistical dependence of the feature measurements while truncating the backward procedure, or (ii) simplifying the probability assumptions and carrying out the entire backward programming computation. Case 1
In this case
Equation (4.23) can be used to compute the a posteriori probabilities of each class at each stage. Let P, be the set of a posteriori probabilities of the occurrence of each class computed by (4.23), i.e., P,
= {P(w, I x,
,...,x,
;Ftn); i = 1,..., m}
Then the basic recursive equation (4.20) using statistic reduces to
L Case 2
(4.25)
P, as the sufficient
Stop: Min{R(d, I P,)}
Equation (4.26), in this case, becomes Continue: Min lC(xl ,...,x, I Fin) f
%+I
It is noted that, in (4.27), the averaging process is always over the terminal stage costs, and is easily performed as the sequential recognition process proceeds. In this way the requirement for the storage
4.8.
SUBOPTIMAL SEQUENTIAL PATTERN RECOGNITION
91
(i.e., the storage of the cost surfaces at each stage) is greatly reduced, and the resulting computations simplified. Case 3
For this case, P@n+l;ftn+,
I Xl
9'**9
*n
;Ftn)= P(*n+1
I *n ;hn)
(4.28)
and the a posteriori probabilities are calculated by P(Wi I *1
,... Xn ;Ft,) 9
The sufficient statistic is then (P, ;x, ;ft,)and (4.20) becomes A P n ;*n ;ft,)
Continue: Min lC(xl ,..., xn [F,,) in+1
= Min
(4.30)
L
Stop: Min{R(d, I Pn)} 6
It can be seen that for cases 1 and 2, all information provided by the past history of feature selection and measurement outcomes up to and including the nth stage is contained in the a posteriori probabilities calculated. All that need be done is to keep tracking what features have been measured and the a posteriori probabilities calculated. Thus, the actual values of measurement outcomes and the order of features measured can be dropped from consideration, thereby allowing the reduction in computations and storage. In case 3, additional storage is required in order to save the last feature measurement. More serious memory requirements are necessitated by the storage of the transition matrices, and by the added dimension of dependence of the cost function at each stage. Of course, the main computational advantage remains the fact that the expectation (averaging) in (4.30) is taken only over the terminal stage and can be easily computed at each stage of the process. T o test the formulations in Cases 1, 2, and 3, the recognition of handprinted English characters D,J , P was again used as an example. The same training samples were used as that in Section 4.6 to establish
92
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
Table 4.4
RFSULTS OF OPTIMALAND SUBOPTIMAL SOLUTIONS No. of patterns classified as True class
J
D
P
Expected loss Total no. of Cost of of of correct measurements feature classification recognition required measurements decisions
%
Case 1
D J P
3
0 7
3
36
0
0
3
0
29
91.6
147 82 135
100
80.6
1.47 0.82 1.35
1.675 0.900 3.075
-
-
-
-
90.7
364
3.64
5.650
1.53 0.78 1.14
1.675 0.900 3.050
Total expected loss of the entire process: 9.29
Case 2
D J P
3
3
0 36
0
7
3
91.6
0
100 83.4
29
0
153 78 I14
-
-
-
-
91.7
345
3.45
5.625
0.85 0.66 0.88
0.925 0.900 2.075 3.900
Total expected loss of the entire process: 9.07t
Case 3
D J P
3
5 0 1
0 36 0
1
0
35
97.5
85 66 88
100
97.5
-
-
-
98.2
239
2.39
-
Total expected loss of the entire process: 6.29 t
The peculiar result with a lower expected loss than the optimum was due to the
4.9.
SUMMARY AND FURTHER REMARKS
93
the probability density functions and the transition probabilities required. Table 4.4 summarizes the results obtained with the cost of measuring any feature at any stage equal to 0.01, and the loss of making any classification error equal to 1.0. It appears from the results in Table 4.4 that, for this example, by using Markovian statistics and the one-stage ahead truncation approximation, we would be able to take the correlation between feature measurements into account and still retain the capability to implement a sequential recognition process which approaches the optimum. Of course, if the feature measurements are truly independent then the Markov assumption would result in no improvement over an independence assumption. 4.9
Summary and Further Remark
I n this chapter, the dynamic programming approach has been proved useful in designing a finite sequential classifier whose optimal structure is considered as a multistage decision process. It is shown that the actual decision structure of the sequential classifier, which includes both the choice of continuing and the choice of stopping the sequence of measurements, is obtained by recursively optimizing the risk functions in a backward manner. The backward procedure guarantees the termination of classification processes within a prespecified number of feature measurements (finiteness) and, in the meantime, also preserves the optimality of minimizing the average risk. Methods of reducing the computational difficulty and storage requirement have been suggested to make the multistage decision process suitable for numerical solution. It is true while the assumptions made on (i) independent measurements and (ii) Markov-dependent measurements are only approximations of the true state of affairs, nevertheless they provide a ready solution to the optimal design of many recognition problems. When it is desirable for the recognition system to perform “on-line” selection of feature measurements for successive observations, the approach of using dynamic programming presents a possibility of designing a recognition system for ~~
fact that one incorrectly classified pattern accounted for an expected loss of classification of 0.475 using the optimal procedure and only 0.25 using the approximation. Neglecting this single pattern it is seen that the results become more reasonable.
94
4.
SEQUENTIAL RECOGNITION-BACKWARD
PROCEDURE
both feature selection and pattern classification. Computer-simulated experiments on character recognition, including comparisons between sequential and nonsequential classifiers, have illustrated the validity and feasibility of the dynamic programming approach. There has not been much quantitative comparison of performance between the forward and the backward sequential classification procedures other than the degree of optimality and computational difficulty. This lack of comparison makes it difficult to determine exactly which procedure is more appropriate for a particular problem on hand. Although a suboptimal backward procedure-a one-stage ahead truncation procedure- has been suggested as a compromise, the degradation of performance in general cannot be quantitatively determined beforehand. References 1. D. Blackwell and M. A. Girshick, “Theory of Games and Statistical Decisions.” 1 Wiley, New York, 1954. 2. R. Bellman, “Dynamic Programming.” Princeton Univ. Press, Princeton, New Jersey, 1957. 3. R. Bellman, R. Kalaba, and D. Middleton, Dynamic programming, sequential estimation and sequential detection processes. Proc. Nut. Acad. Sci.47, 338-341 (1961). 4. D. V. Lindley, Dynamic programming and decision theory, Appl. Statist. (lo), 39-51 (1961). 5. E. B. Dynkin, The optimum choice of the instant for stopping a Markov process. Sooiet Math. Dokl. 4, No. 3, 627-629 (1963). 6. P. C. Fishburn, A general theory of finite sequential decision processes. Tech. Paper RAC-TP-143. Res. Anal. Corp., McLean, Virginia, February 1965. 7. R. A. Howard, “Dynamic Programming and Markov Processes” Wiley, New York, 1960. 8. G. B. Wetherill, “Sequential Methods in Statistics:” Methuen, London and Wiley, New York, 1966. 9. K. S. Fu, Y. T. Chien, and G. P. Cardillo, A dynamic programming approach to sequential pattern recognition. IEEE Trans. Electron. Computers 16,790-803 (1967). 10. K. S. Fu and G. P. Cardillo, An optimum finite sequential procedure for feature selection and pattern classification. IEEE Trans. Auto. Control 12, 588-591 (1967). 11. Y. T. Chien and K. S. Fu, An optimal pattern classification system using dynamic programming. Intern. J. Math. Biosciences 1, No. 3, 439-461 (1967). 12. G. P. Cardillo and K. S. Fu, A dynamic programming procedure for sequential pattern classification and feature selection. Intern. J. Math. Biosciences. 1, No. 3, 463-491 (1967). 13. B. R. Bhat, Bayes solution of sequential decision problem for Markov dependent observations. Ann. Math. Statist. 35, 1656-1662 (1964).
REFERENCES
95
14. R. Bellman and R. Kalaba, On the role of dynamic programming in statistical communication theory. IRE Trans. Inform. Theory 3, No. 3, 197-203 (1957). 15. H. H. Goode, Deferred decision theory. In “Recent Developments in Information and Decision Processes” (R. E. Macho1 and P. Gray, eds.). MacMillan, New York, 1962.
CHAPTER 5
NONPARAMETRIC PROCEDURE IN SEQ UE NTIAL PATTERN CLASS1FICAT10N
5.1
Introduction
The design of a sequential pattern classification system for classifying patterns in a random environment (noise, distortion, etc.) has been primarily concerned with the case where the following assumptions are made: (i) a sufficient number of feature measurements is always available and thus the classification process can be prolonged forever if needed;
(ii) the statistical knowledge about the patterns in each class is either completely known a priori or can be estimated by the classification system through some learning processes. The first difficulty, which arises from the prolonged experimentation, can be avoided by either modifying the standard Wald’s sequential probability ratio test so that the classification process will terminate at a prespecified finite number of featuye measurements as described in Chapter 3, or simply employing the dynamic programming procedure which determines the optimal stopping boundaries by computing backwards from the last feature measurement up to the first as discussed in Chapter 4. An equally important but perhaps less explored case of design is the one which will relax the constraint in (ii) so that no assumption or actual knowledge is needed on the form of the underlying probability distributions associated with each pattern class [1]-[8]. The purpose of this chapter, therefore, is to introduce a nonparametric approach to the design of a sequential pattern classification system using Wald’s SPRT. It is noted that in order to carry out the computation in Wald’s SPKT, an assumption or actual knowledge is needed on the specific 96
5.2.
SEQUENTIAL RANKING PROCEDURE
97
forms of the probability density functions, pn(X/wl)and pn(X/w2). This is essentially what has been done in the experiments presented in previous chapters where the feature vectors are assumed to be samples from known probability distributions describable within a set of parameters (for example, mean vectors and covariance matrices in gaussian distributions), known to or estimated by the classification system. It may frequently happen that this knowledge is not available or any simplified assumption cannot be justified due to the lack of a priori information about the random patterns or due to the changing statistics of the operating environment. In either case, nonparametric methods would have to be pursued so as to obtain a more realistic mathematical model in approximating the physical situation. In statistical decision theory, many nonparametric schemes are based on the set of ranks determined by sample measurements. In the following sections, a sequential ranking procedure [9] is employed and the resulting performance analyzed in the design of a binary classifier so that the nonparametric setting of Wald’s SPRT can be naturally applied. A generalized procedure capable of classifying patterns from more than two classes is also discussed. 5.2
Sequential Ranks and Sequential Ranking Procedure
It was remarked in the previous section that in order to apply Wald’s SPRT in the nonparametric setting, we would have to replace the feature measurement vector X = [xl, x2 ,...,xnIT by a vector of ranks T = [Tl, T2,..., Tn]. The rank Tifor xi is 2, Z = 1, 2,..., n, if and only if xi is the Zth smallest measurement with respect to the set of measurements xl,x2 ,..., x, . Because of the sequential nature of taking the feature measurements in SPRT, we are naturally led to the idea of sequentially ranking the measurements every time a new measurement is taken without having to rerank all the preceding measurements in the entire feature vector. T o see exactly how such a procedure may be derived, it is beneficial to look into the ordinary (nonsequential) reranking process which is described as follows. Suppose that the feature measurements x1 Ix2 ,..., x, are taken successively, and each time a new measurement is taken the entire set of measurements is reranked. Let Tii be the rank of xj with respect to the entire set of measurements (xl, x2 ,..., xi) at the ith stage of
98
5.
NONPARAMETRIC PROCEDURE
process, where i = 1 , 2,..., n and j = 1 , 2,..., i. Then the following two groups of vectors will describe the ordinary reranking process: Successive measurement
Ordinary rank
set
vector
It should be pointed out that the vector [T,,, T,,,..., Tnn]alone completely determines the reranking process, in the sense that each ordinary rank vector listed above can be reconstructed given only the ranks T i $ ,i = 1, 2, ..., n, where Tit is the rank of xi relative to the measurement set (xl, x2 ,..., xi). I n fact, it is easily seen that a feature measurement can be ranked as it is measured, relative to the entire proceding measurements, without reranking the previous measurements, and still retain in the information which would come from reranking all the preceding measurements. This method of ranking the measurements is one which fits in naturally with the idea of sequential decision procedure, when the measurements are taken successively in accordance with a specified stopping rule. T o formally present this idea which leads to the development of a nonparametric sequential classification procedure, the following definition and lemma are first given: Definition The “sequential rank” of x, relative to the set of measurements (x,, x2 ,...,x,) is S, if x, is the (S,)th smallest in (XI , x2 ,...,x n).
Thus the sequential rank of x1 is always 1, the sequential rank of 1 or 2 depending on x2 < x1 or x1 < x2 ,and the sequential rank of x3 is 1, 2,or 3, according to whether x3 is the smallest, the next largest, or the largest in the set of measurements (x,, x 2 , x3), etc. In the sequel, the sequential rank vector for the feature measurement vector X = [x, , x,,..., x,]~ will be denoted by S(n) = [Sl s,,**., &I. x2 is either
?
5.2.
SEQUENTIAL RANKING PROCEDURE
99
Lemma There is a one-to-one correspondence between the set of n! possible orderings xi, < xi, < *.- < xi, and the n! possible sequential rank vectors [S, , S, ,..., S,] for the feature measurement vector = [x, , x , ,..., X,]T.
x
Proof [ 9 ] , [lo] Consider the vector [x, , x, ,..., x,ITwhere the xi aren distincts real numbers and the set {[xi, , xi? ,..., xJT} consisting of the n! vectors and possible orderings obtained by permuting the coordinates of [ x , , x, ,...,xJT. Now define the mapping 9 from the set {[xi, , xi2 ,..., x i J T } into the set {[r, ,r2 ,..., rJT: rl = 1; r2 = 1, 2;...; r, = 1, 2,..., n} by setting the jth coordinate of ?(xi, , xiz ,..., xi,) equal to the rank of x. in the set xil , xi, ,..., xi, , that is, the jth coordinate is r if xi, is the rth smallest among xi, , xi2 ,..., xi$ . The mapping is one-to-one and onto.
The significance of this lemma, which will become clear later, may be summarized as follows: If we consider each < xi,, of a feature measurement vector ordering, say xi, < xie < X = [x, ,x, ,..., x,]*, and use the definition given above to obtain the associated sequential ranks S , , S, ,..., S , , the sequential rank vector will be uniquely determined. Conversely, the sequential rank vector uniquely determines the original ordering. Since a particular ordering xil < xi, < ... < xi, also uniquely determines the ordinary rank vector [T,,, T,, ,..., T,,], there exists a one-to-one mapping between the set of sequential rank vectors and the set of ordinary rank vectors for all possible orderings. In order to provide a smooth transition of the Wald's SPRT to its nonparametric setting, it is necessary to find the probability distribution for the sequential rank vectors. There are two significant findings in nonparametric statistics which can be used to obtain, respectively, the exact calculation of sequential rank distribution and a practical application in nonparametric testing problems. One is due to the fact that there exists a one-to-one correspondence between the ordered measurements (hence the ordinary rank vectors) and the sequential rank vectors. It follows that the distribution of the sequential ranks is completely determined since the distribution for the ordinary ranks can be easily calculated [l 11. A second useful finding is the basic assumption of Lehmann alternatives frequently made in nonparametric tests. It will be shown later in this chapter that this assumption, although necessary, is not quite as restrictive as it appears to be when used in the nonparamatric design of sequential classification systems.
---
100
5.
NONPARAMETRIC PROCEDURE
Consider first the distribution of the sequential rank vectors. Using the fact that there is a unique relation between the ordered feature measurements and the sequential rank vectors, the distribution of the sequential rank vectors is also completely specified by
where Pi5(xij)indicates the distribution function of xi5 and the xij's are assumed to be independent in this calculation. For the special case when the distribution functions Pi(xJ are taken to be Lehmann alternatives [ 121, then
Using (5.1) and substituting (5.3), we obtain P(x,
< x2 < < x,) **-
5.3.
SEQUENTIAL TWO-SAMPLE TEST PROBLEM
101
By relabeling the xi's, the probability of any order of the xi's can be found using (5.5), giving all the values needed in (5.1) to specify the distribution of the sequantial rank vectors. 5.3 A Sequential Two-Sample Test Problem
As a basic model for the nonparametric design of a sequential classification system, a sequential two-sample test problem is described in this section. Suppose there are available two measurement vectors of successive measurements X = [x, , x, ,..., x,IT and Y = b1,yz ,...,y,]', each sampled from some probability distributions. The problem is to test the hypothesis that the two distributions are the same against the alternative hypothesis that they are different, using as few measurements as possible. Let the successive measurements x1 , x2 ,..., x, and y1 ,y, ,...,yn be independent random variables, and assume that we wish to test hypothesis Ho :
G
alternative Hl :
G =f(P(X))
= P(X)
with the assumptions that P ( X ) is the probability distribution of X andf(P(X)) the distribution of Y. In order to use the Wald's SPRT based on the sequential ranks, the measurements will be arranged so that they can be taken alternatively as x1 ,y, , x, ,y, ,..., x, , y, . Let the combined measurements at the kth stage be denoted by a vector V ( k ) = [vl, v, ,..., vk] where vl = x,, v, = yl, etc. Let S(k) = [S, , S, ,..., S,] be the sequential rank vector for V(k), and let
be the sequential probability ratio at the kth stage of the process.
5.
102
NONPARAMETRIC PROCEDURE
Under the hypothesis H,, , P,(S(k) = S/H,,) = l/k! for a certain outcome vector S of S(k) and therefore, P,(S(k) = S/Hl) can be computed by noting that each outcome S corresponds, in a one-to-one manner, to a particular ordering of the combined measurements of the xi's and yi's. That is, it is sufficient to compute P(Ul
< % < ... < %/HI) ...
= --mi
t,gt,<.
j
..g t p < m
fi
(5-7)
dfi(P(ti)),
where fi(P(ti))= P(t,) when vi is an x, and f,(P(t,)) =f ( P ( t , ) ) when vi is a y. Again in the case of Lehmann alternatives we have against
H,,:
G =P(X)
Hl :
G = f ( P ( X ) ) = P'(X),
Y
>0
Using (5.5) and 5.7) we obtain, for k even
The sequential probability ratio at the kth stage then reduces to
and
+
As the (k 1)th measurement is taken, A, becomes Aktl. Using the 1)th observation, Sk+l,we can rewrite sequential rank for the (k (5.10) and (5.11) to obtain h k f l as follows:
+
5.3.
SEQUENTIAL TWO-SAMPLE TEST PROBLEM
103
To complete the nonparametric SPRT procedure, it is only left to set the pair of stopping boundaries with which the sequential probability ratios are compared. The crossing of either one of the boundaries will result in a terminal decision as described previously. Note that in the process of forming the sequential probability ratio from one stage to another, we need only to know the 1)th measurement and the vector sequential rank S,,, of the (k A(k) = [A, , A, ,..., Ak]defined in (5.9). If Sk+,is determined to be 2, then the (k + 1)th measurement comes between the (2 - 1)th and the (2 1)th smallest measurement of the preceding measurements. In this way a new vector A(K 1) = [A,, A,, .., A=-,, A*, A,, ,..., Ak] with (k 1) elements is fcrmed, where A* = 1 if the (K 1)th measurement is an x, and A* = r if it is a y. The sequential probability ratio is then obtained through (5.12) or (5.13) using Sk+,and A(k 1). In summary, the nonparametric SPRT can be reduced to the following steps:
+
+
+
+
+
+
+
Step I . Obtain the sequential rank for the (k 1)th measurement. Step 2. Form vector A(k 1) from A(k) and S,,, . Step 3. Compute the sequential probability ratio Ak+, by (5.12) or (5.13) and compare with. stopping boundaries.
+
It should be remarked that the above procedure does not require the reranking of all previous measurements to compute the sequential probability ratios. In fact, once the sequential rank of a measurement [and consequently the vector A(k)] is determined, it remains unaltered in later computations. The resulting test procedure is therefore greatly simplified and, as expected, fits in more naturally with the successively received feature measurements whose sequential ranks needs to be assigned only when they are received. A natural question to ask here is the validity of assuming Lehmann alternatives to distinguish significant departures from a hypothesis in practical applications. While it is true that the proposed alternatives are primarily motivated by their simplicity in developing a useful test procedure, they are justifiable in many classification problems for the following reasons: (1) It seems that when nonparametric methods are appropriate, one usually does not have a very precise knowledge about the alternatives. What one would like to know is whether the alternatives are the representatives of the principal type of deviations from the
5.
104
NONPARAMETRIC PROCEDURE
hypothesis, in terms of which one can study the ability of various tests to detect such deviations. Lehmann alternatives have been found to provide the typical deviations which usually prevail in many probability distributions describing various pattern classes. An illustrative example, given in Lehmann’s work [12], is reproduced in Fig. 5.1 where 07r
Fig. 5.1.
Normalized gaussian density function and Lehmann alternatives.
three Lehmann alternatives (r = 2, 3, 6) are plotted against a gaussian distribution with density function p ( x ) = (27r)-lI2 exp(-x2/2). The typical deviations are evidenced by their shift in mean values and variation in standard deviations. (2) While other types of alternatives are also readily available in nonparametric statistics, for example, G = P(X f u) or G = p P ( X ) qP2(X), they again depict alternatives with similar deviations without suggesting a new measure of the difference between the hypothesis and the alternatives. In addition, the computation of the sequential probability ratio under these alternatives would be much more complicated than that under Lehmann alternatives,
+
5.4.
NONPARAMETRIC DESIGN
105
since the distribution of the ranks would depend not only on a (or p and q) but also on P ( X ) . In the following sections, discussions will be centered on the application of the sequential two-sample test model to the design of a nonparametric classification system. Analysis of the classifier thus designed will then be made in terms of the performance criterion. Moreover, a selection rule will be found to determine a suitable Lehmann alternative when the statistics of the patterns are completely unknown, so that this seemingly unusual assumption may find its interpretation and appropriateness in practical applications. 5.4 Nonparametric Design of Sequential Pattern Classifiers
It is assumed that the input pattern samples to the classifier are known to be taken from two distinct classes of nonparametric distributions, each of which represents a class of patterns. The problem is to seek a nonparametric design of a sequential pattern classifier which, for a pair of prespecified error probabilities, will classify the unknown patterns into one of the two pattern classes with the fewest expected number of feature measurements. Clearly, the assumption of nonparametric statistics and the requirement of optimal termination suggest that the solution be obtained by considering the classification procedure as a sequential two-sample test in the nonparametric setting [14]. Suppose there are available some classified pattern samplest from each patter class. Let these samples, each with n measurements, be denoted by two sets of feature vectors {Xl} = {[xll, x,: ..., x,7} and {X2}= {[x12,x~~,..., xn2]) belonging to classes w1 and w 2 , respectively. Let the input pattern to be presented for classification be denoted by a feature vector Y = [rl ,yz,..., y,] whose components are measured successively by the classifier. T o determine which class the input pattern Y belongs to, it is sufficient to decide whether Y comes from a distribution to which { X l } belongs. Since in the case of binary classification, the acceptance of a pattern in one class implies the rejection in the other (once the terminal decision is made), it is sufficient for the classifier to decide whether Y and { Xl } (or say {X2}) t As can be seen in the next chapter, these samples with known classifications are called learning measurements (observations) or learning samples.
5.
106
NONPARAMETRIC PROCEDURE
come from the same distribution. Futhermore, if the set of {xl}consits of learning samples which are representatives of pattern class wl, the classification process reduces to a sequential two-sample test problem in which the classifier performs the test of hypothesis H,, :
G
= P(X)
alternative Hl:
G
= Pr(X),
Y
>0
where P ( X ) is the probability distribution of the x2’s in X1, and G = Pr(X) is the probability distribution of the yj’s in Y. From the discussion in Section 5.3, it follows that a simple application of the sequential probability ratio test, based on the sequential ranks of the combined measurements, will result in the decision of either accepting or rejecting the hypothesis H , , corresponding to the recognition of pattern class w1 or w2,respectively. T o illustrate the above procedure of classifying unknown patterns, the following computational example is considered. Suppose there are two pattern samples, X and Y,drawn from the populations of handwritten English characters a and b. Each pattern sample is represented by an 18-dimensional vector (see Section 2.3) of which the components will be measured successively by the classifier, namely,
x = [%
2
x2
,*..,s s l
= [4.3,4.9, 5.3,
5.5, 5.4,4.8,4.2,4.7,6.6, 7.5, 6.2, 5.2, 5.2, 7.0, 8.8, 7.9, 6.0,4.6]
y
= [Yl ,Y2,.**?YlSI =
[4.0, 5.0,6.0,7.0,7.0,6.0,5.0, 5.0,6.0,7.0,6.0, 5.0, 5.0,8.0, 8.0, 7.0,6.0,4.0]
The combined sample for measurement is denoted by
Consider that the vector X is a learning sample taken from the pattern class w1 . The problem is to test whether the sample vector Y comes from the same class of distribution as that of X, after having taken as few measurements as possible, Let Y = 0.5 in the Lehmann
5.5.
ANALYSIS OF OPTIMAL PERFORMANCE
107
alternative to the hypothesis Ho (Y = 1) and let elo = eol = 0.1, where egj is the probability of accepting the hypothesis Hi when actually Hi is true, i,j = 0, 1. Then the stopping boundaries in Wald's SPRT are approximated by computing A
-
1-% - 9.0 el0
B -
1 - el0
- 0.11
Using (5.10) and (5.11) or (5.12) and (5.13), it is a rather straightforward computation to obtain the sequential probability ratios, based on the sequential ranks of the combined sample, as follows: 1.00 == 1.33 = 1.60 = 1.07 = 1.33 = 0.89 = 1.02 = 0.68
= A2
A, A4 A,
A, A,
A,
= 0.74
A16
= 0.16
A10
= 0.49
A17
= 0.19
All A,, h13 A,,
= 0.30
Als = 0.18 A,, = 0.18
= 0.27
A,,
= 0.12
= 0.23
A21
= 0.14
= 0.20
A,,
= 0.10
= 0.44
Since A,, < 0.11 = B, the classifier decides to accept the hypothesis Ho (reject H l ) at the 22nd measurement of the combined sample (the 11th measurement of the unknown pattern sample Y), and the unknown pattern Y is classified as belonging to class w1 . 5.5 Analysis of Optimal Performance and a Multiclass Generalization
It is noted that for any given value of Y in the Lehmann alternatives to the hypothesis, the sequential classification procedure obtained above is optimum if two parallel stopping boundaries are used (standard Wald's SPRT). That is, on the average the procedure requires fewer measurements to make a terminal decision than any other test procedure, sequential or nonsequential, with at least as low error probabilities. From the practical design standpoint, the value of Y can be freely chosen (as it was in the example given in
108
5.
NONPARAMETRIC PROCEDURE
Section 5.4) by the designer. However, any change of r is, in general, to affect the classifier in its structure as well as its performance. Therefore, it is desirable to analyze the system’s performance by deriving an approximated relation between the parameter r in the Lehmann alternatives and the expected number of measurements required to make a terminal decision. The result of this derivation, as will be shown shortly, is the establishment of a suitable way of choosing the value of r such that a more efficient design may be achieved without seriously increasing the system’s complexity. For the purpose of convenience, define the following equivalent SPRT by rewriting (5.10). For k even, let
(5.14) The process of forming
rk
continues as long as
A-I
< rk< ~
- 1
(5.15)
The hypothesis H,, is accepted as soon as log r, 2 log B-l, and is rejected as soon as log r, log A-l, where A and B are the stopping boundaries as defined in (1.48). Let W, = log r, , and define
<
Ai= (l/i) C A j 2
i=1
(5.17)
Then, after taking logarithms, (5.14) can be rewritten as W , = log
n Qi k
i=l
k
=
1 ui
(5.18)
i=l
It is seen that by redefining the SPRT procedure in this manner, the logarithm of the sequential probability ratio at the kth stage of process can be represented as the sum of k random variables u1 , u2 ,..., uk . Before deriving the expected number of measurements, the following
5.5.
109
ANALYSIS OF OPTIMAL PERFORMANCE
results established by Parent [9] are needed. Consider the case when the hypothesis H,, is true, then giving and
P(A, = 1)
= P(A, = I ) =
E(Aj) = E(A,) = *(l
E(Qi) = , - l / z E ( A , )
4
+
(5.19)
T)
= $(~--1/2
+~112)
Also, the joint probabilities are given by P(A, = 1, A,
= 1) = P(A, = I , Aj = T ) =
p(A, = 1, Aj
= Y) = P(A, = T ,
A,
(5.20)
= 1) = -
(-
4 k-1
)
(5.21)
where i # j. Using these probabilities, the variance of Q$ can be obtained by simple computations as (5.22)
Note that Va(Qi) is decreasing in i as i = 1,2,..., K. Expanding logQi about the mean value of Q$, we have
+
where E(Q,) = 4(r-ll2 +I2) > 1 and q k is bounded away from zero. Taking the expectation of ui , (5.23) becomes E(ui) = log + ( Y - ~ / ~
+ r1l2) + Var(Q,) (-)1
(5.24)
€,k
Since the interest is mainly in the case when i j - k (i.e., the stages when the sequential process is about to terminate so that Var(Qi) is diminishingly small), the second term in E(uJ can be dropped from consideration, especially if K is sufficiently large. Thus (5.24) can be approximated by E,(u)
= E(uJ N
log *(.-1/2
+W )
where the subscript “r” indicates the dependence of E(u) on
(5.25) 7.
5.
110
NONPARAMETRIC PROCEDURE
The expected number of measurements can now be determined. Following the notations used in Chapter 3. Let E,*(wk) denote the conditional expected value of wk given that the lower stopping boundary is crossed at the Kth measurement, namely W k log A-l. Let E:*( wk) denote the conditional expected value of wk, given that the upper stopping boundaries are crossed, i.e., wk 2 log B-l. Then, under the hypothesis H , ,
<
E m
=
* wk) + ( l -
r~r(u)l-l{elOE7.(
el,) E,* *( W,))
(5.26)
Neglecting the excess over the boundaries, i.e., the inequalities wk log A-l and wk log B-l are considered as equalities, we can write, with probability 1,
<
ET(W,)
=
log A-l
= log el0
1 - e01 1 - el0 E,**( W,) = log B-l = log e01
(5.27) (5.28)
Substituting (5.25), (5.27), and (5.28) into (5.26), we obtain E,(k)
=
+
[log $ ( Y - ~ / ~Y'/~)]-'
[el, log el0 1 - e01
If el, is chosen to be very small, that is, el, as
(5.29)
< 1, (5.29) can be written
(5.30)
which is an approximated relation between the expected number of measurements and the parameter value r, given el, and eol . Consider the case when Y < 1. It is clear that the denominator of E,(R) in (5.30) increases as Y decreases. Since log[(l - elo)/eol] is a constant for given el, and e,, , E,(K) tends to decrease accordingly. On the other hand, when Y > 1, the denominator increases as r increases, which, in turn, makes E,(K) decrease. It can be concluded that the expected number of measurements E,(K) decreases as the value of Y departs from unity, regardless of whether r < 1 or Y > 1. A qualitative plot of E,(k) vs. Y, which depicts the functional relation of (5.30), is shown in Fig. 5.2. There is a natural generalization from the sequential two-sample test to the case where the number of pattern classes is more than two.
5.5.
111
ANALYSIS OF OPTIMAL PERFORMANCE
0 0.01
01 I 0 Parameter "r" in L h a n n AI~WIWBV~S
100
Fig. 5.2. Plot of Eq. (5.30) showing the relation between the expected number of observations and the parameter r in the Lehmann alternatives.
Let P(X/w,),i = 1, 2, ..., m,be the probability distribution (unknown) for the m pattern classes, and let the set of learning samples from class . . wt be denoted by (Xi} = ([xli, xzi ,...,xn2]},z = 1, 2,..., m. To determine the pattern class to which a pattern sample Y = [yl, y z ,...,yn] belongs, the two-sample test can be applied to each and every class w i to test hypothesis H,, :
G
= P(X/w,),
i = 1, 2, ..., m
against alternative Hl :
G = f ( P ( X / w i ) ) = P'*(X/wi),
ii
> 0, i = 1,2, ..., m
Consider the successive measurements of the m combined sample vector to be
where k = 1, 2,..., 2n.
112
5.
NONPARAMETRIC PROCEDURE
The corresponding sequential rank vectors are determined to be Sl(k) = [&I S2(4
=
IS,
9
s12
)*.*>
slkl
9
s22
,.**,
S2k1
Sm(k) = [Sm, , s m 2
,.*.,&nkl
where k = 1,2,..., 2n. At the kth measurement of the m combined sample vectors Vl(k),V2(k),..., V J k ) , the sequential probability ratios are computed, i.e., (5.31)
where Pk(S,(k)/Ho)or Pk(Si(k)/Hl)is the probability of the sequential rank vectors S,(k), given the hypothesis that Y belongs to the class wi , or the alternative that Y does not. Following (5.8), for k even, (5.31) becomes (5.32)
where A.. 23 = 1 - ri
if vii is an x from X i if vii is a y
(5.33)
Adopting the rejection criterion of the GSPRT, the pattern class wi is dropped from consideration at the kth measurement if hi, 2 A,(%) (5.34) The process of forming Aik continues until there is only one sequential probability ratio left not satisfying the above inequality, and its associated hypothesis is then accepted as the true pattern class to which Y belongs. Note that the upper stopping boundaries &(mi) in (5.34) are generally functions of both the hypothesis under test and the number of stage k under consideration. I n practice, &(mi) may be set to [l - (eol)i]/(elo)i where (el,), and (eel), are the two types of error probabilities for the hypothesis associated with class wi . As shown in (5.34), the lower stopping boundaries Bk(wi)have been made negligibly small, by setting (eel), arbitrarily small, to prevent the test from accepting a hypothesis prematurely when the alternative Hl is true. In any case, the stopping boundaries should be determined in such a way that they would minimize the effect of possible am-
5.6.
EXPERIMENTAL RESULTS AND DISCUSSIONS
113
biguity, for example, a rejection in all pattern classes in a situation where it is known that the input sample is from one of the pattern classes. 5.6 Experimental Results and Discussions
T o determine the effectiveness of the sequential ranking procedure in constructing a nonparametric sequential classifier, a computersimulation experiment was carried out. The experiment consisted in classifying the handwritten English characters a and b as described in Section 2.3, with the exception that the probability distributions of patterns in each class were assumed to be nonparametric and unknown. Let the learning sample from character a be denoted by
xa= [XI5,xza,...,X&] which was obtained, in this experiment, from the estimated mean vector of sixty samples from character a, or the alternative that it does not (i.e., to accept that Y belongs to the class of character b). A flow diagram showing the process of computer-simulated classification procedure is given in Fig. 5.3. For the purpose of illustration, some sixty pattern samples of characters a and b were tested on the computer. The classification results are summarized in Fig. 5.4 in which the error probability (average percentage of two types of misrecognition) is plotted against the average number of feature measurements required to make a terminal decision, The classification experiment was repeated with r in the Lehmann alternatives as a running parameter. It is seen that the experiment performed tends to verify the theoretical conclusions in two respects: (i) The sequential ranking procedure and the resulting two-sample test model do provide an effective nonparametric procedure for sequential classification in which the error probability decreases as the number of measurements increases, as usually expected in Wald’s SPRT. (ii) For a specified error probability, fewer measurements are required to make a terminal decision by increasing the value of r if r > 1, or by decreasing the value of r if Y < 1, which is the relation concluded in (5.30). This relation is particularly useful in selecting a desirable Lehmann alternative for a certain pattern class in the absence of any statistical knowledge about the pattern samples. Although no direct verification of the validity of the assumption of Lehmann alternatives was attempted in the experiment, the simulation result does indicate
5.
114
NONPARAMETRIC PROCEDURE
I
Obtain learning sample vector
XU) = (x I , x2, ...,x,) 1 = 1, 2, ..., 18
Read in successive measurements of an input sample to be recognized
L P
--
Y ( 0 = cv,, Y2r .... Y,) / = I , 2, ..., 18
Form combined sample vector for X(1) and Y ( / ) V ( k ) = (VI. V Z , ..., ~ k ) = (XI, Y I .
...,
Replace IbyI+ 1 and kbyktl
vr)
Obtain sequential rank vector for V ( k )
S ( k ) = (s!, Sz, ...,Si) k = I . 2, ..., 36 I
Form vector A ( k ) = ( A i , A,, ..., A,)
I
by using Eq.(5.10) or (5.1 I )
I
Yes
I
Decide
Decide
character
character
Fig. 5.3. Computer flow diagram for the recognition experiments using nonparametric technique.
5.7. 60
c
I
Q
Fig. 5.4. model).
115
SUMMARY AND FURTHER REMARKS
5
K
)
1
5
I
2 0 2 5 3 Q Awmga Number of ObrnMtionr
3
5
Performance curves; recognition of characters a and b (nonparametric
possible low error probabilities in the classification of character samples if proper Lehmann alternatives are chosen and a sufficient number of measurements are available. 5.7
Summary and Further Remarks
A nonparametric setting for the Wald’s sequential probability ratio test based on the sequential ranks has been discussed in this chapter. The essential feature of the sequential ranks lies in the fact that a new measurement can be ranked as it is measured, relative to the preceding measurements, without reranking all the previsous measurements. One application of this ranking scheme is in the design of a sequential recognition system to classify patterns with nonparametric statistics. The solution is obtained by formulating the classification procedure in terms of a sequential two-sample problem where the classifier wishes to decide whether or not an X-population and a Y-population have the same probability distribution. With the assumption of Lehmann alternatives in the two-sample’test, a simple design of a nonparametric sequential classifier is developed. Both intuitive and theoretical justifications have been given to the use and selection of suitable Lehmann alternatives. A generalization procedure of the two-sample test to the case of multiclass classification problem
116
5.
NONPARAMETRIC PROCEDURE
has also been suggested. Computer-simulated experiments have shown satisfactory results regarding the verification of theoretical conclusions and the classification of English characters. The nonparametric sequential classification procedure proposed in this chapter is a rather special approach based on the sequential probability ratio test and the assumption of Lehmann alternatives. It should be interesting to explore more general results and possible extensions by considering alternatives other than Lehmann alternatives or other nonparametric decision procedures. References 1. G. H. Ball, Data analysis in the social sciences: What about the details?. Proc. Fall Joint Computer Conference, 533-599 (1965). 2. G. Sebestyen and J. Edie, An algorithm for nonparametric pattern recognition. IEEE Trans. Electronic Computers 15, 908-915 (1966). 3. J. Owen, Nonparametric pattern recognition, Part I and Part 11. T R No. 1 and No. 2, July/October. Information Research Associates, Inc., Waltham, Massachusetts, 1965. 4. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, The probability problem 5.
6. 7. 8. 9.
10.
11. 12. 13. 14. 15. 16.
of pattern recognition learning and the method of potential functions. Aertomatika i Telemekhanika 25, 1175-1190 (1964). T. M. Cover and P. E. Hart, Nearest neighbor pattern classification. IEEE Trans. Information Theory 13, 21-27 (1967). D. F. Specht, Generation of polynomial discriminant functions for pattern recognition. IEEE Trans. Electronic Computers 16, 308-319 (1 947). G. F. Hughes, On the mean accuracy of statistical pattern recognizers. IEEE Trans. Information Theory 14, 55-63 (1968). E. G. Henrichon, On nonparametric methods for pattern recognition. Ph. D. Thesis (TR-EE68-18), Purdue University, Lafayette, Indiana, June 1968. E. A. Parent, Sequential ranking procedure. Tech. Rept. No. 80. Dept. of Statist., Stanford Univ., Stanford, California, 1965. 0. Barndorff-Nielsen, On the limit behavior of extreme order-statistics. Ann. Math. Statist. 34, 992-1002 (1963). W. Hoeffding, Optimum nonparametric tests. Proc. Symp. Math. Statist. and Probability, 2nd, Berkeley, 19.51, pp. 83-92. Univ. of California Press, Berkeley, California, 1951. E. L. Lehmann, The power of rank tests. Ann. Math. Statist. 24, 23-43 (1953). I. R. Savage and J. Sethuraman, Stopping time of a rank-order sequential probability ratio test based on Lehmann alternatives. Ann. Math. Statist. 37, NO. 5, 1154-1160 (1966). K. S. Fu and Y. T. Chien, Sequential recognition using a nonparametric ranking procedure. IEEE Trans. Inform. Theory 13, 484-492 (1967). D. A. S. Frazer, “Nonparametric Methods in Statistics.” Wiley, New York, 1957. I. R. Savage, Contributions to the theory of rank order statistics-the two sample case. Ann. Math. Statist. 27, 590-615 (1956).
CHAPTER 6
BAYESIAN LEARNING IN SEQUENTIAL PATTERN RECOGNITION SYSTEMS
6.1 Supervised Learning Using Bayesian Estimation Techniques
As pointed out in Section 1.6, in the absence of complete a priori knowledge, pattern recognition systems can be designed to learn the necessary information from their input observations. Depending upon whether the correct classifications of the input observations are available or not the learning process can be classified into supervised learning and nonsupervised learning schemes. Various techniques have been proposed for the design of learning systems. Two problems are of primary interest in sequential pattern recognition: the problem of learning an unknown probability density function and that of learning an unknown probability measure. Supervised learning schemes using Bayesian estimation techniques are discussed in this section [1]-[3]. When the form of the probability density function p ( X / o , ) is knowa but some parameters 8 of the density function are unknown, the unknown parameters can be learned (estimated) by iterative applications of Bayes’ theorem. It is assumed that there exists an a priori density function for the unknown parameter 8 (in general, vector-valued) po(8) which reflects the initial knowledge about 8. Consider what happens to the knowledge about 8 when a sequence of independent identically distributed feature vectors X , ,X , ,..., X , , all from the same pattern class, is observed. The function po(8) changes to the a posteriori density function p(8/X, ,..., X,) according to Bayes theorem. For example, the a posteriori density function of 8 given the first observation X , ist
+ Since all the learning observations XI,...,X,, are from the same class, dropped out from each term in (6.1) without causing any confusion.
117
wt
can be
118
6.
BAYESIAN LEARNING
After X , and X , are observed, the a posteriori density function of 0 is
In general,
The required probability density function can be computed by
x
p(e/xl,...,x,,, wi)de,
n
=
i , 2,... (6.4)
where the first term at the right-hand side of (6.4),
is known, and the second term, p(d/X, ,..., X , , mi), is obtained from (6.3). The central idea of Bayesian estimation is to extract information from the observations X I , X , ,..., X , for the unknown parameter 0 through successive applications of the recursive Bayes formula. It is known that [l], on the average, a posteriori density function becomes more concentrated and converges to the true value of the parameter so long as the true value is not excluded by the a priori density function of the parameter. In each of the supervised learning schemes to be discussed, the iterative application of Bayes theorem can be accomplished by a fixed computational algorithm. This is made possible by carefully selecting a reproducing a priori density function for the unknown parameter so that the a posteriori density functions after each iteration are members of the same family of a priori density function (i.e., the form of the density function is preserved and only the parameters of the density function are changed).t The learning schemes are then reduced to the successive estimations of parameter values. t Some important results concerning the necessary and sufficient conditions admitting a reproducing density function can be found in the work of Spragins [4].
6.1.
119
SUPERVISED LEARNING
THE PARAMETERS OF A GAUSSIAN DISTRIBUTION 6.1.1 LEARNING
A. Learning the Mean Vector M , with Known Covariance Matrix K In this case, the unknown parameter 0 to be learned is M whose uncertainty can be reflected by assigning a proper reproducing a priori density function p,(0) = p,(M). Let
and assign
where M , represents the initial estimate of the mean vector and Q, is the initial covariance matrix which reflects the uncertainty about M , . From the reproducing property of gaussian density function, it is known that [2], [3], after successive applications of Bayes' formula, the a posteriori density function p ( M / X l ,..., X,), given the learning observations X , ,..., X , , is again a gaussian density function with M , and Q, replaced by the new estimates M , and Q, The new estimates M , and Q, are, respectively, the conditional mean and covariance of M after n learning observations X , ,..., X , , i.e., M n = E[Mn+1/Xl,.**, Xnl
+ n-lK)-lM, + @,(ao + n-lK)-l(X) = cOv[Mn+,/xl ,..., Xn] = (n-lK)(@,+ n-1K)-W0
= (n-'K)(@, ajn
i
and
(6.5)
n
Or, in terms of a recursive relationship, (6.5) and (6.6) can be written as Mn = K(an-l
and
+ lC-lMn-1 + an-l(an-l+ K)-lXn
an= K p n - l + K)-l@n-l
(6.7)
(6.8)
Equation (6.5) shows that M , can be interpreted as a weighted average of the a priori mean vector Mo and the sample information ( X ) , with the weights being ( n - l K ) (Qo n-lK)-l
+
120
6.
BAYESIAN LEARNING
+
and @a(@o n-lK)-l, respectively. The nature of this interpretation can be seen more easily in the special case where
a0= 01-lK,
01
>0
(6.9)
Then (6.5) and (6.6) become (6.10)
and
a,
=' K
nfm
(6.11)
As n ---t co, M , 4 (X) and @, +0, which means, on the average, the estimate M, will approach the true mean vector M of the gaussian density function.t B. Learning the Covariance Matrix K , with Zero (or Known) Mean Vector I n this case 0 = K is the parameter to be learned. Let K-l = Q and assign the a priori density function for Q to be the Wishart density function with parameters (KO, oo) [22], i.e.,
=0
otherwise
(6.12)
where Q, denotes the subset of the Euclidean space of dimension +N(N 1) where Q is positive definite, and CN,,ois the normalizing constant
+
(6.13)
KOis a positive definite matrix which reflects the initial knowledge of K-1, and vo is a scalar which reflects the confidence about the initial estimate K O .It can be shown that, by successive applications of Bayes' formula, the a posteriori density function of Q, t Since the sample mean
( X ) is an unbiased estimate of the true mean vector M .
6.1.
SUPERVISED LEARNING
121
p(Q/Xl ,..., X,), is again a Wishart density function with parameters KOand vo replaced by K , and v, where (6.14)
+n
(6.15)
c
(6.16)
vn = v,, and (XP)
=
I n ; XiXiT i=l
Equation (6.14) can be again interpreted as the weighted average of the a priori knowledge about K-l, K O ,and the sample information contained in ( X X T ) .
C . Learning the Mean Vector M and the Covariance Matrix K I n this case, 8 = (M, Q ) and Q = K-l. An appropriate a priori density function for the unknown parameter 0 is found to be GaussianWishart, i.e., M is distributed according to a gaussian density function with mean vector Mo and covariance matrix dj0 = p;lK, and Q is distributed according to a Wishart density function with parameters vo and KO.It can be shown that, by successive applications of Bayes' formula, the a posteriori density function of 8, p(8/Xl ,...)X,) = p(M, Q/X, ,..., X,), is again a Gaussian-Wishart probability density function with parameters vo , po , Mo , and KOreplaced by v , , p, , M , , and K , , respectively, where ern = vo
+n
Pn = PO
+n
(6.17) (6.18) (6.19)
and (6.21)
Equation (6.19) is the same as (6.10) except that 01 is replaced by po . Equation (6.20) can be interpreted as follows. The first two terms
122
6.
BAYESIAN LEARNING
at the right-hand side are weighted estimates of the noncentralized moments of X; the term (voKo pOMOMOT) represents the a priori knowledge; and [(n - 1)s n (X) ( X ) q represents the sample information. The last term at the right-hand side is generated from the new estimate of the mean of X.
+
+
6.1.2 LEARNING THE PARAMETERS OF A BINOMIAL DISTRIBUTION
It seems to be rather obvious and reasonable to interpret the new estimates of parameters in terms of the weighted average of a priori knowledge and sample information in the case of a gaussian distribution. Unfortunately, the interpretation will become much less obvious for distributions other than gaussian. The difficulty involved can be illustrated by examining the case of binomial distribution b(n, p) with parameters (n,p) [5], [6]. Consider the Bernoulli process with parameter t9 = p. Let x l , x2 ,..., x , denote the observed samples of the process where each x (1 or 0 ) is drawn from a distribution b( 1, p), 0 < p < 1. If r = ZTz1x i , then r is the number of ones (successes) at the nth observation which has the binomial distribution b(n, p ) . That is, the conditional density function of Y, given p , is (6.22)
Notice that the sample outcome y = (r, n) is a sufficient statistic of dimension two for the parameter 8 = p. Suppose that p is unknown and is to be learned through the sample outcome (r, n). As in the case of gaussian distribution, an appropriate a priori density function assigned for p is the beta probability density function [22]
po(e) = p o ( p ) = [ q r , , no - ~ ~ ) ] - ~ p ~ o-- lp(yi' 0 - l
(6.23)
where B ( r o ,no - y o ) is the beta function with parameters r,, and n o , which are assigned positive constants reflecting the initial knowledge about the unknown parameter p . It can be easily verified by Bayes theorem that the a posteriori density of p, given (r, n), is again a beta density function with parameters rn and n,
p ( p I r, n) = [B(rn,n, where
rn = ro
+Y
-
rn)]-lpTn-l(l - p)nn-'n-1
(6.24) (6.25)
6.2.
NONSUPERVISED LEARNING
and n, = no
+n
123 (6.26)
From the first glance at (6.25) and (6.26), it seems rather natural to regard r and n as the sample information to update the initial knowledge yo and n o , respectively. I n doing so, however, it will be unsuccessful to give an interpretation in the sense of weighted average of initial knowledge and sample information as that in the case of gaussian distribution. The difficulty lies in the fact that either component of the statistic (r, n) cannot be considered as a measure of information in a sample from a Bernoulli process, and it follows that it would not be sensible to consider either component of the parameter ( y o , no) as a measure of the knowledge underlying the a priori distribution. T o remedy this situation, let (6.27)
Since for given rn an increase in n implies an increase in r, the sample information seems to be unambiguously measured by (m,n). Substitute monofor ro in (6.23)
p(p/mo,no) = [B(rnono, no(l
-r
O
n , ) ) ~ - ~ p ~~ p)%(1-9nJ-1 -~(l (6.28) no>O
Thus the a posteriori parameters are given by n, = no
+n
(6.29)
Note that the expected value of rn is p , so m is a natural estimate of p. Equation (6.29) can be interpreted as the weighted average of a
priori knowledge and sample information as that in the gaussian case.
6.2
Nonsupervised Learning Using Bayesian Estimation Techniques
Since the correct classifications of learning observations are unknown in the case of nonsupervised learning, it is almost impossible to precisely associate each learning observation with the distribution
124
6.
BAYESIAN LEARNING
of the correct pattern class for updating information. Instead, one approach is to formulate the problem of nonsupervised learning as a problem of estimating the parameters in an overall distribution (called a mixture distribution [7]) comprising the component distributions. The learning observations are considered from the mixture distribution, and the component distributions may be the distributions of each pattern class or the distributions corresponding to the various partitions in the observation (feature) space. A mixture distribution (or dentisy) function results when the set of learning observations X , , X , ,..., X , can be partitioned in W ways, Zla, Z2,,..., 2,". For example, if each of the Xi's is possibly generated by one of the m classes of distributions, then W = mn.The mixture distribution is defined as
c P(X/Z,")P(2,") W
P(X)=
(6.30)
i=l
where P(X/Z,") is called the ith-partition conditional distribution, and P(Z,") the ith mixing parameter. If the mixture distribution is considered to be characterized by sets of parameters, then the parameter-conditional mixture distribution, P(X/O,P ) , can be constructed by the family of the ith-partition, parameter-conditional distributions, {P(X/O,,Z,"); i = 1,..., and the two sets of parameters 0 = (0, , 0, ,..., O W } and P = {P(Z,"), P(Z,,) ,..., P(ZWa)}.The basic mixture equation (6.30) becomes
w>,
c qxp, ,2,")~ ( 2 , " ) W
P ( x / e ,P ) =
i=l
(6.31)
The problem of nonsupervised learning can then be reduced to that of finding a unique solution for 8 and P , given P(Xj8, P ) . It is known that the class of mixture distributions which may have a unique solution for 8 and P is limited and whether it admits a unique solution will depend upon the identifiability of the mixture distribution [8], [9]. The parameter-conditional mixture distribution can be considered as the image under the mapping, say v, of the parameter sets 8 and P defined by (6.31). P ( X / % P , ) is said to be identifiable if is a one-to-one mapping of 8 and P onto P(X/O,P). It is noted that the question of whether P(X/O,P ) is identifiable is one of unique characterization. That is, for a particular family of the partitions, the mixture distribution P(X/O, P ) uniquely deter-
6.2.
NONSUPERVISED LEARNING
125
mines the sets of parameters 0 and P . It is clear that if the nonsupervised learning problem is such that the mixture distribution is not uniquely characterized by 0 and P (not identifiable), then there exists no unique solution to the underlying problem. The method of estimating these parameters, 0 and P,seems to depend on the particular problem on hand. In this section, nonsupervised learning using Bayesian estimation technique is discussed.
A. Estimation of Parameters of a Decision Boundary Consider there are n learning observations, denoted by x1 ,x2 ,...,x , , drawn from one of the two pattern classes, w1 and w 2 , having univariate gaussian distributions with some unknown parameters. The optimum decision boundary to minimize the probability of misrecognition in classifying observations into one of the two pattern classes is, in general, a function of the a priori probabilities, the means, and the variances. Particularly, in the nonsequential Bayes classification process (Section 1.4), if the a priori probabilities are equal and the variances are equal, the optimum decision boundary is known to be the mean of the two means. When the learning scheme is a supervised one, the two means can be easily learned from the classified learning observations. In the case of nonsupervised learning, the problem can be viewed as one of estimating the mean of the mixture distribution p ( x ) where [lo]
cW J d P ( X l 4 2
P(X) =
i=l
-1 -3*
+
1
,ol/"exP [-
z1
*
1 U ( 2 7 p exP
[-
(6.32)
From (6.32) it is easily seen that the optimum decision boundary is simply the mean of the mixture distribution p(x). The sample mean (6.33)
is applied to estimate (learn) the mean of p(x). The approach can be extended to the problems concerning unequal a priori probabilities and mixtures of multivariate gaussian distributions. The solutions to
126
6.
BAYESIAN LEARNING
these generalizations are shown to rely on the estimation of higher moments of the mixture distribution [l I]. B. Estimation of Parameters in Mixture Distributions Assume there are two pattern classes w1 and w 2 . The form of the probability density function p ( X / w , ) is known but a parameter 8 is unknown, and p ( X / w 2 )is completely known. The problem is to learn the parameter 0 from the learning observations X , , X , ,..., X , with unknown classifications. Since the correct classifications of the learning observations are unknown, each observation can be considered as either from class w1 or from class w 2 . If the sequence X,,..., X , is partitioned into all possible combinations in which the two classes w1 and w 2 can occur there will be 2" such combinations. Let Zin be the ith partition of the sequence XI ,..., X , , then the a posteriori probability density is obtained by
p(ejxl,..., x,)= c p(e/xl ,..., x, , z,,) P(z,~/x, ,...,x,) 2n
(6.34)
i=l
The problem is now reduced to that of supervised learning for each of the 2" partitions [12]. The result of estimation is obtained by taking the weighted sum of the results obtained from each partition with the weights being the probabilities of occurrence of each partition P(Z,"/Xl ,..., X n ) , i = 1 ,..., 2". I t can be seen from (6.34) that the number of computations will grow exponentially with n, as for this reason it does not seem to be practical for large numbers of learning observations. An alternative solution to this problem has been developed in order to avoid the difficulty of exponential growth in computation [13]. By applying Bayes theorem,
6.3.
SLOWLY VARYING PATTERNS
127
which is a mixture of p(X/O,w l ) and p(X/w2).The assumption of conditional independence is the fundamental reason why there is no need to store all the learning observations. Substitute (6.37) into (6.35); a recursive expression for estimating O is obtained as
(6.38)
If p(O/x,? * - - > Xn-l), P(Xn/e, 4, p(Xn/w2), P(wl), and P ( 4 are known, p(O/Xl ,..., X,) can be computed by (6.38). Assume that P(wl) and P(w2) are known. In order to compute p(Xn/O,w,) and p(O/X, ,..., X,) for all values of 8, it must be assumed that O can be finitely quantized so that the number of computations can be kept finite. For multiclass problems, if more than one pattern class has unknown parameters, let 0, be the unknown parameter associated with pattern class w, , i = 1, ..., m. Assuming conditional independence of learning observations X, ,...,X , and independence of O,, i = 1, ..., m, similar to (6.38), the recursive equation for estimating 8, can be obtained as
I n general, either p,,(O,) or P(wi), i = 1, ..., m, must be different. Otherwise, the computations for all 8,'s will learn the same thing (since they compute the same quantity) and the system as a whole will learn nothing. 6.3
Bayesian Learning of Slowly Varying Patterns
In Section 6.1, the parameters (for example, the mean vector of a multivariate gaussian distributed pattern class) to be learned are considered fixed but unknown. The problem that the parameters to be learned change slowly in a random manner will be treated in this section. For illustrative purposes, the problem of learning the mean vector M of a gaussian distribution is again used [2]. Assume that
6.
128
BAYESIAN LEARNING
the change of M is slow when compared with the observation time of learning observations so that M changes only slightly from one observation to the next. Mathematically p(X/Mn)= [(2?~)”~I K
1’2]-1
exp[- t ( X - M,JTK-’(X - Mn)] (6.40)
where M , is a function of n, and is to be learned from a sequence of classified learning observations X , , X , ,..., X , . Let X , = M , qn , n = 1,2, ..., where 7, is the (mean zero) noise component conta..., are assumed minated in the measurement. ...,q n - l , r l n , v,+, statistically independent of each other, and q n , n = 1, 2, ..., are also independent of M , . From the slowly varying nature of M , , assume that M , is developed by a random walk process where the random steps are the independent gaussian vectors A,. That is
+
+ An-, Mn + An
M n = Mn-1
Mn+1 =
---
-
= Mn-1
--- = Mo + A 0
+ An-1 + An + *.. + A n
+A
(6.41)
where A, is gaussian distributed with mean zero and covariance matrix Kd . Roughly speaking, M , can be considered as arising from a series of independent steps of length dj (a random variable) being added together. However, the model is inconvenient to apply due to the fact that M , as defined is the sum of a large number of identically distributed random variables. As n increases, the components of M , become unbounded with probability 1. This difficulty can be eliminated by introducing a constant 0 < a < 1 and changing (6.41) to
+ +
M , = aMn-, An-l Mn+, = aMn A, = a2MnP1 adn-, = anflMO andO an-lAl *.-
---
+
+
+
+ An
+ + An
(6.42)
Later, let u + 1 in the final answer so that the modification is only temporary. As in the case of supervised Bayesian learning the successive estimates of the mean vector M , are essentially the conditional
6.3.
129
SLOWLY VARYING PATTERNS
expectations of. X,,, given the sequence of learning observations X,,X, ,..., X, , that is M n
= E[Xn+l/Xl)-.-) Xnl
(6.43)
= E[Mn+l/Xl,*.-) Xnl
Similarly,
anis the conditional covariance of M,+,, @, = c0v[Mn+,/x1,..., X,]
Let M'
= EIM,/Xl,
(6.4)
...,X,]
(6.45)
and @' = C0v[Mn/X1,..., Xn]
(6.4)
By iterative applications of Bayes' formula, the same as was done in Section 6.1, the following results are obtained: M'
= K(@,-1
and @' = K(@,-,
Since Mn+, = aMn
then Mn = aM'
+ K)-'Mn-l +
+ K)-lX,
+ K)-'an-l
(6.47) (6.48)
+ An
(6.49)
+ K)-'Mn-l +
= aK(@%-,
and
@,-1(@,-1
U@n-l(@n-l
+ K)-'Xn
+ KA = a2K(Qn-, + K)-l@,-l + KA
(6.50)
@, = a2@'
(6.51)
For large n, @, II @n-l ,solve (6.51) for @, with slowly varying mean M , (u N 1). We obtain @,
N
[K,K-1]1/2* K
Using (6.52) in (6.50) and expanding [I f (&K-')'/']-' series [16], we have M,
N
[I - (KdK-1)1/2]M,,
(6.52)
in a Neumann
+ (KAK-l)ll2X,
(6.53)
130
6.
BAYESIAN LEARNING
Equation (6.53) again shows that the new estimate M , is a weighted average of the a priori mean vector and the sample information. A special example is given to bring out the significance of the results. Let KA = B2K (6.54) and
Icl, = (1
- B) Mn-1
+ BXn
(6.55)
From (6.55), the slowly varying M , is tracked by adding PX, to an attenuated version of the previous estimate Mnp1as new learning observations arrive. It is noted that if M , is stationary, then K A-+ 0 as a ---f 1. Consequently, (6.50) and (6.51) reduce to (6.7) and (6.8), respectively. 6.4 Learning of Parameters Using an Empirical Bayes Approach
In previous sections of this chapter, the Bayesian estimation techniques have been applied to the estimation of unknown parameters 0 in a probability distribution function when the a priori distribution of 8 is assumed to have a convenient form (reproducing distributions). If t9 itself is a random variable and its a priori distribution P(0) is unknown, a more general formulation based on the empirical Bayes approach is suggested [17], [18]. I n this section, the estimation of unknown parameters in a probability distribution using the empirical Bayes approach is presented. It is known that the unconditional distribution function of X can be expressed as P ( X ) = J qxp)dpp)
where P(X/O) is the conditional distribution of estimate of 8 be of the form y ( X ) . Then
(6.56)
X given 8. Let the (6.57)
which is a minimum when (6.58)
6.4.
131
EMPIRICAL BAYES APPROACH
The random variable 8, defined by (6.58) is the Bayes estimator of 8 corresponding to the a priori distribution P(8). Equation (6.58) is of course the expected value of the a posteriori distribution of 8 given X . If P(8) is known then (6.58) is a computable function. If P(8)is unknown, let 8, ,...,8, be the sequence generated corresponding to the sequence of learning observations X , ,..., X , . Assume that 8, , n = 1,2, ..., are independent with common distribution P(8) and that the distribution of X , depends only on 8,. At the nth stage of the estimation process, i.e., after taking n learning observations X , ,..., X,, if the previous values 8, ,...,8,-, are by now known, the empirical distribution function of 8 can be formed as pn-l(e)
=
number of terms O1 ,..., On-, which are (n - 1)
<6
(6.59)
The estimate of 8, , 6, ,can then be obtained from (6.58) by replacing P(e) by Pn-1(8), i.e*,
Since, as n -+ 00, P,(8) + P(8) with probability 1, hence 6, + dP in the limit. I n many practical situations, the sequence O1 ,..., On-, is not available. It is possible to infer from the sequence of learning observations X , ,..., X , the approximated form of the unknown P(8) or at least to approximate the value of the functional of P(8) defined by (6.58). Consider that for any given X , the empirical frequency ratio Pn(W =
number of terms X , ,...,X , which = X n
(6.61)
which tends to P ( X ) with probability 1 as n-+ 00. From (6.56), for certain classes of P(8) and the kernel P(X/8),an approximation of P(8)can be obtained. The following examples are chosen to illustrate the procedure. Example 1 P(X/8) is the Poisson distribution, i.e., ex
P ( X p ) = cex!
-’
x = 0,1, ...;
e >0
(6.62)
6.
132
BAYESIAN LEARNING
P(0) belongs to the class of all distribution functions on the positive real axis. In this case, from (6.56)
Then, from (6.58) (6.64)
From (6.63) and (6.64), the following relation can be written (6.65)
Let
=
(’ +
+
number of terms x1 ,...,xn which = x 1 number of terms x1 ,..., x, which = x
then regardless of the unknown P(0), we have, as n ---t ~,(x) + 8,
(6.66)
CQ,
with probability 1
(6.67)
This suggests using as an estimate of the unknown 8, the computable quantity q,,(x,,) in the hope that as n+ CO, E[Tn(xn) - en12 + ~
[ d p
el2
If 0 has all its probability concentrated at a single value 8, p(0) = S(O - O0), the Bayes estimator of 0 is
8,
=
eo
(6.68)
, i.e., (6.69)
which does not involve x at all. Hence,
ey
=
o
(6.70)
E [ ( X - e)21
=
E(B) = eo
(6.71)
E[B,
-
and
6.4.
EMPIRICAL BAYES APPROACH
133
P(X/B)is the binomial distribution
Example 2
where r is the total number of trials, x the number of successes, and 0 the unknown probability of success in each trial. P(0) may be taken as the class of all distribution functions on the interval (0, 1). In this case,
(6.73)
and (6.74)
From (6.73) and (6.74), we can write (6.75)
Let Pn.r(x) =
number of terms x1 ,..., xn which = x n
(6.76)
then P,Jx) + P,(x) with probability 1 as n -+ co. Now consider the sequence of learning observations xi , xh ,...,xk where xk denotes the number of successes in the first (r - 1) out of the r trials which produced x, successes, and let Pn.r-dx) =
Thus, as n -+
number of terms x i ,..., x; which = x (n - 1)
(6.77)
00,
Pn:r-l(x)
+
Pr-l(x)
with probability 1. Define (6.78)
134
6.
-+
BAYESIAN LEARNING
then as n-+ co, vn,r(x)
. 7
pT(x
+ l)
= dp,r-1
with probability 1 (6.79)
pT-l(x)
This means that if ~,,,(xk) is used as the estimate of 8, ,the estimator for large n, will do about as well as if the a priori distribution P(8) were known. It is noted that if (6.56) is considered as a mixture distribution, then the estimation of 8 and P(8) from the sequence of learning observation XI ,..., X , is related directly to the problem of nonsupervised learning discussed in Section 6.2. The identifiability condition of a mixture distribution will again play an important role in this class of estimation problems [8], [18]. 6.5
A General Model for Bayesian Learning Systems
Pugachev has proposed a general model for Bayesian learning systems to include various learning schemes into one formulation [191-[21]. With probably a slightly different interpretation, the model is presented in this section. The central idea of Pugachev’s model is to consider a real teacher (or trainer) who might not know the correct answer (for example, the correct classification of a learning observation) exactly. Let the input of the learning system be X . Corresponding to the input X, let the output of the learning system be &, the output of the teacher be Q’, and the desired output be Q. In general, X, Q, Q, and Q’ are vector-valued random variables. The input-output relationship of the teacher can be expressed by the conditional probability density function pt(Q‘/X,Q, &). I n the special case of an ideal teacher (or trainer), the teacher knows the desired output Q, i.e., (6.80) ~ t ( Q 8, ’ /&) x ,= U?’ - Q) which is independent of X and &, where S(Y) is the Dirac delta function. For any teacher, Q’ in general does not coincide with Q. A simple block diagram of the system and teacher is shown in Fig. 6.1. If the teacher trains the system by demonstrating a sequence of learning observations XI ,..., X , , then PdQ’lX, Q, &)
= Pt(Q’/X)
(6.81)
6.5.
GENERAL MODEL FOR LEARNING SYSTEMS
135
X -
Fig. 6.1.
A general block diagram of learning systems.
which is independent of Q and &. If the teacher trains the system by evaluating the system’s performance (from its output), then PdQ’lX, 8,&)
&I
= Pt(Q‘/X,
(6.82)
which is independent of Q. The operation of the system is represented by the conditional probability density function p,(&/X). In the special case of a Bayesian optimal system
where Q* is the Bayesian optimal output with respect to a given criterion. For example, for a Bayesian optimal classifier, R(X, Q*) = Min R(X, &) B
(6.84)
For illustrative purposes, the discrete case is considered here. Let p ( X ,Q) be the joint probability density function of X and Q; in general, the function may contain a linear combination of 6 functions at the points (X, Q) to which nonzero probabilities correspond. Assume that the functions p , ,p , , and p are known functions of their arguments and depend on a finite number of unknown parameters which form the components of a vector 8. As discussed in Section 6.1, the unknown parameters 8 can be estimated (learned) using Bayes formula. Let the a priori density function of 8 be po(8), and the corresponding output sequence of the system and the teacher for the input sequence of learning observations X , ,...,X , be &, ,..., &, and Qi ,..., Q , respectively. i Assume that X , ,..., X , are independent,
6.
136
BAYESIAN LEARNING
then the a posteriori probability density function of 0 can be expressed as
nPYX, II
= KPo(6)
> Qi >
i=l
Qi’le)
(6.85)
where K is a normalization constant. The expression pi(Xi ,&( ,Pile) indicates that it may be different at different times. In the special case of an ideal teacher,
(6.86)
(6.87)
I n the case that a real teacher trains the system by demonstrating a sequence of learning observations,
P(X, &, Q’P) = P(x/e) Pt(Q’/X, 0) where
P(x/e)= fP(X QP)dQ
(6.88) (6.89)
Then, (6.85) becomes
P(O/X~ ,---, Xn ;QI ~ * * Qk) . ~
n pi(xi/e) ~P(Ql/xi n
=K z Po(0)
0)
i=l
(6.90)
In the case that the system learns using its own decisions,
P(~/x,xn ;& 1 ~ * * &n) *~ )***)
npi(xi/e) n
=Ks~o(e)
i=l
(6.91)
6.5.
GENERAL MODEL FOR LEARNING SYSTEMS
137
which is independent of the teacher's output. This class of learning systems is sometimes called decision-directed learning systems. Since the optimal system is defined in the sense of Bayes criterion, it is easily seen that in this sense the ideal teacher may not be the best one. In fact, no system can learn, in general, to reproduce exactly the desired output Q that the ideal teacher does (for example, the case of zero probability of misrecognition). It can learn only to find appropriate estimators of Q. Hence, the teacher which trains the system to find optimal Bayes estimators of the desired output should be considered the best one. The output of such a teacher coincides with the output Q* of the optimal system minimizing the average risk. Assume that an operator A(8) can be determined such that (6.92)
Consequently, the distribution of 8 is entirely concentrated on the subset of values of 8 defined by the equations Q; = A ( e ) x i ,
i = 1, ...,
(6.94)
If there exist r such equations with a unique solution for 8, then for r, the distribution of 8 is concentrated at one point, which any n corresponds to the true values of the unknown parameters 8. This can be done by solving the r equations simultaneously from r pairs of training samples (Xi, 9;).It is noted that two kinds of information are needed for this operation. The first kind is the knowledge of (6.94), and the second kind is the r pairs of training samples with Qi = QS. With this amount of information and by solving r equations as (6.94) simultaneously, the teacher is able to learn the true values of the unknown parameters 8, and from then on, the output of the teacher will be Q* which is Bayes optimal. The Q* from the output of the teacher is then used to train the system, and, in turn, the system will approach a Bayes optimal system. Example Let the real teacher be a binary linear classifier as shown in Fig. 6.2. For gaussian distributed pattern classes with equal
6.
138
BAYESIAN LEARNING
Fig. 6.2.
A linear classifier.
covariance matrix, the Bayes optimal decision boundary is essentially a hyperplane. Let Y
=
]:[
=
[f]
W
and
=
E]
(6.95)
Then the Bayes optimal output of the classifier can be expressed by Q‘ = Q* =
C wixi + wg = WTY 2
i=l
= A(W)Y
(6.96)
Thus, by applying three pairs of training samples (XI ,Qf), ( X , , QZ), and ( X , ,Qg), the true values of the unknown parameters w, , w 2 , and w3 can be obtained by solving three equations as Q ’
= A(W)Yi,
=
1,2,3
simultaneously with
It is noted that the training (or supervised learning) of the system based on the output of the real teacher using Bayesian techniques is an asymptotic process, i.e., the estimated parameter values converge to the true values only asymptotically (or after an infinite sequence of learning samples). However, with known Bayes optimal operator A(0) for the teacher, the teacher can learn the unknown parameters with only a finite number of training samples. (Presumably, the teacher has the capability of solving a set of Y simultaneous equations.) If the system has the same structure as that of the teacher,
6.6.
SUMMARY AND FURTHER REMARKS
139
i.e., the Bayes optimal operator A(0) can also be used for the system, then there will be no difference between the teacher and the system. Both of them can learn the unknown parameters with a finite set of training samples. I n general, the teacher is considered to have a more complicated structure or at least the capability of solving equations (6.94) than that of the system. 6.6
Summary and Further Remarks
The Bayes estimation techniques have been applied to the learning (estimation) of unknown parameters in a probability distributions (or density) function. If the parameters are fixed but unknown, by assuming a convenient form for the a priori distribution of the parameters, the true parameter values can be learned through an iterative application of Bayes formula. Both supervised and nonsupervised learning schemes are discussed. In the nonsupervised learning, the unclassified learning observations are considered as coming from the mixture distribution with the probability distributions of each class as component distributions. The Bayes estimation procedure can also be extended to the case where the unknown parameters are slowly time-varying. If the unknown parameters are themselves random variables with unknown a priori distribution functions, the empirical Bayes approach is suggested for the estimation of the parameters. Finally, the general Bayesian learning model proposed by Pugachev is presented. In this model, learning processes with ideal teacher and with real teacher can be put into one mathematical formulation. The class of reproducing distribution functions plays an important role in obtaining simple computational algorithms for the Bayesian estimation of unknown parameters. This may practically limit the applications of the Bayesian estimation techniques. In using the mixture formulation, a unique solution can only be obtained if the identifiability condition is satisfied. However, it is not obvious that efficient computational algorithms can be obtained for the estimation even if the mixture is identifiable. It would be interesting to study the mixture estimation problems from the computational viewpoint, especially in the high-dimensional cases where the number of unknown parameters is large. Similarly, it would be desirable, from a practical viewpoint, to investigate the computational algorithms derived from Pugachev’s general learning model.
140
6.
BAYESIAN LEARNING
References 1. T. W. Anderson, “An Introduction to Multivariate Statistical Analysis.” Wiley, New York, 1958. 2. N. Abramson and D. Braverman, Learning to recognize patterns in a random environment. IRE Trans. Inform. Theory 8, 58-63 (1962). 3. D. G. Keehn, A note on learning for Gaussian properties. IEEE Trans. Inform. Theory 11, 126-132 (1965). 4. J. D. Spragins, Jr., A note on the iterative application of Bayes rule. IEEE Trans. Inform. Theory 11, 544-549 (1965). 5 . Y. T. Chien and K. S. Fu, On Bayesian learning and stochastic approximation. IEEE Trans. Systems Sci. Cybernetics 3, 28-38 (1967). 6. R. Bellman, “Adaptive Control Processes-A Guided Tour.” Princeton Univ. Press, Princeton, New Jersey, 1961. 7. H. Teicher, On the mixture of distributions. Ann. Math. Statist. 31, 55-73 (1960). 8. H. Teicher, Identifiability of finite mixtures. Ann. Math. Statist. 34, 1265-1269 (1963). 9. S. J. Yakowitz and J. D. Spragins, On the identifiability of finite mixtures. Ann. Math. Statist. 39, 209-214 (1968). 10. D. B. Cooper and P. W. Cooper, Nonsupervised adaptive signal detection and pattern recognition. Information and Control 7, 416-444 (1964). 11. P. W. Cooper, Some topics on nonsupervised adaptive detection for multivariate normal distributions. In “Computer and Information Sciences-11’’ (J. T. Tou, ed.). Academic Press, New York, 1967. 12. R. F. Daly, The adaptive binary-detection problem on the real line. Rept. 2003-3. Stanford Electron. Labs., Stanford, California, February 1962. 13. S. C. Fralick, Learning to recognize patterns without a teacher. IEEE Trans. Inform. Theory 13, 57-64 (1967). 14. D. F. Stanat, Nonsupervised pattern recognition through the decomposition of probability functions. Tech. Rept. Sensory Intelligence Lab., Univ. of Michigan, Ann Arbor, Michigan, April 1966. 15. J. W. Sammon, An adaptive technique for multiple signal detection and identification. In “Pattern Recognition” (L. Kanal, ed.). Thompson Book Co., Washington, D.C., 1968. 16. E. T. Whittaker and G. N. Watson, “Modern Analysis.” Cambridge Univ. Press, London and New York. 1958. 17. H. Robbins, An empirical Bayes approach to statistics. Proc. Symp. Math. Statist. and Probability, 3rd, Berkeley, 1956, 1, 157-164. Univ. of California Press, Berkely, California, 1956. 18. H. Robbins, The empirical Bayes approach to statistical decision problems. Ann. Math. Statist. 35, 1-20 (1964). 19. V. S. Pugachev, A Bayes approach to the theory of learning systems. Preprints, Proc. I F A C Conf., 3rd, June 1966. 20. V. S. Pugachev, Optimal training algorithms for automatic systems with nonideal trainers. Dokl. Akad. Nauk SSSR 172, No. 5, 1039-1042 (1967). 21. V. S. Pugachev, Optimal learning systems. Dokl. Akad. Nauk SSSR 175, No. 5, 762-764 (1967). 22. H. Cramtr, “Mathematical Methods of Statistics.” Princeton Univ. Press, Princeton, New Jersey, 1961.
CHAPTER 7
LEARNING IN SEQUENTIAL RECOGNITION SYSTEMS USING STOCHASTIC APPROXIMATION
7.1
Supervised Learning Using Stochastic Approximation
Stochastic approximation is a scheme for successive estimation of a sought quantity (the unknown parameter to be estimated) when, due to the stochastic nature of the problem, the measurements or observations have certain errors. A brief introduction of stochastic approximation is given in Appendix F. Supervised learning schemes using stochastic approximation are discussed in this section. In the learning of an unknown probability P(wi) from a set of classified learning observations X , , X , ,..., X , , let ni denote the number of times that the observations are from class w i, CL1ni = n and C?=,P(w,) = 1. Since the correct classifications of the learning observations are known, so is ni . If the the initial estimate of P(wi) is Po(wi), 0 Po(w,) 1, Po(wi)= I, then the successive estimates of P(w,) can be formed by the following stochastic approximation algorithm
<
xF=l
<
pn+l(wi) = pn(wi)
ni - pn(wi)], + yn+l [T
i = 1, ..., m
Since E(n,) = nP(wi), conditions
( 2 ) =P(
E-
and
wi)
are always satisfied. When {yn) is chosen such that 1
> yn > 0,
m
m
yn =
co
n=l
141
and
n=l
yn2 <
co
142
7.
LEARNING USING STOCHASTIC APPROXIMATION
the successive estimates {P,(wJ} approach the quantity P(w,), i = 1,..., m, in the mean square sense and with probability 1. I n the case of learning an unknown probability density function from the observations X , , X , ,..., X , , let p ( X ) be approximated by a finite series [l]
where y i ( X ) ,i
=
1, 2 ,..., M, is a system of orthonormal functions, i.e.,
.
=1,
2=j
.
(7.5)
The following stochastic approximation algorithm is proposed for the estimation of c i s : Ci,,+l
= Ci,,
+ Y,[Vi(X,)
- Ci,nI,
i
= 1Y.,
(7.6)
where {yn} satisfies the conditions (7.3). Therefore, as n + a, approaches c i , i = 1,..., M, in the mean square sense and with probability 1. A special case in the supervised learning of a probability density function is that the form of the density function is known but some parameters unknown. In this case, the unknown parameters can be estimated from the observations X , ,X , ,..., X , by using the stochastic approximation algorithm. An alternative discussion is presented here for the following dual purpose: to illustrate the application of stochastic approximation to parameter estimation and to show the relationship between the Bayesian estimation technique described in Section 6.1 and the stochastic approximation procedure [2]. A. Learning the Mean Vector M of a Gaussian Distribution Recall (6.7) and (6.8) M, @,
+ K)-'M,-i + = K(@,-l + = K(@,-i
@,-I(@,-i
+ K)-lX,
K)-1@,-1
(6.7) (6.8)
By adding and subtracting M,-, at the right-hand sides, (6.7) becomes M,
= Mn-1
+
@,-1(@,-i
+ K)-'(X, - Mn-1)
(7.7)
7.1.
143
SUPERVISED LEARNING
I t is noted that (7.7) is also a special algorithm of stochastic approximation. The nature of this algorithm can be easily seen in the special case when Q0 = ol-lK. Then (7.7) becomes
Mn Let X ,
=M
= w-1
+ (n + .)-'(Xm
(7.8)
- Mn-1)
+ 7, be the N-dimensional noise vector v n = ( ~ n l~, n ' ,
TnN)
satisfying the conditions of mean zero and finite variance for each component and for all n. Also, let Yn
= (n
+
(7.9)
a)-1
which satisfies conditions (7.3). Equation (7.8) then becomes
where F,
=
1 - y n = 1 - (n Fn
>0
+ a)-l satisfying and
n Fn m
=0
n=l
Since 11 Mo 11 < co is assumed for the initial estimate, and E[ll q n 1121 is bounded above by B, i.e., N
E[II ~n
1121= C E [ ( T ~ ~ )< 'I B
(7.13)
(-00
i=l
hence,
E[ll '~Mo1121
+ C E[ll m
n=l
~
n
117~ Gn E[ll Mo 117 + B
m
Yn2 n=l
< 00
(7-14)
144
7.
LEARNING USING STOCHASTIC APPROXIMATION
which verifies Dvoretzky’s condition (N4) (Appendix F). Condition (N5) is satisfied for any measurable function qw(Ml,..., M,J. Therefore, by Dvoretzky’s theorem (special case 11) lim E[ll Mn - M \Iz]
=0
(7.15)
P(lim Mn = M } = 1
(7.16)
n-rm
n+w
which simply means that (7.8) is a special case of stochastic approximation with the convergence of the extimates to the true mean vector in the mean square sense and with probability 1.
B. Learning the Covariance Matrix K of a Gaussian Distribution Rewrite the estimation equation (6.14) as
Since Kn-l
=
+
voK, C:GIXiXiT v,+n-1
then (7.17) becomes
(7.19)
which satisfies conditions (7.3). It can be shown that (7.18) is again a special algorithm of Dvoretzky’s stochastic approximation procedure [2]. As a result, the estimates obtained from (7.18) converge to the true covariance matrix (for every element) in the mean square sense and with probability 1. It is also possible to verify that the Bayesian estimation of both mean vector and covariance matrix of a gaussian distribution form a stochastic approximation algorithm. The detailed analysis is omitted here.
7.1.
145
SUPERVISED LEARNING
C. Learning the Parameter of a Binomial Distribution Equation (6.29) can be rewritten as mn = mn-1
Let = (n
3/n
+ (n + no)-'(m
(7.20)
- mn-1)
+ no)-'
(7.21)
which satisfies conditions (7.3), and (7.22)
m=P+Tn
so that
E(qn) = 0
<
E [ ( V ~ ) ~B ]
and
< co
(7.23)
Define Tn(m,'
"2
(7.24)
mn-1) = (1 - mn) mn-lt- YnP
v-**s
Then (7.20) becomes mn = Tn(m19 m2
)***!
mn-1)
in which Tn(m, s**.> mm-1)
-P I
+ ynqn I
= (1 - mn) mn-1
I nt,-1 - p I
=Fn
where Fn = 1 - yn = 1 - (n
(7.25)
-P I (7.26)
+ no)-l
satisfying
>0
F,
and
m
n F n= 0 n=l
Dvoretzky's condition (R3) (Appendix F) can be easily verified since m
W
C E[(mn~n)~I
mn2 n==l
n=1
Finally,
E[ynvn I ml
>-..) mn-11
< 03
(7.27)
=0
is true by the assumption on yn and therefore gives condition (R4). Thus, by Dvoretzky's theorem, lim E[(mn - p)2] = 0
(7.28)
P(lim mn = p} = 1
(7.29)
n-m
n-rm
146
7.
LEARNING USING STOCHASTIC APPROXIMATION
Equations (7.28) and (7.29) again conclude that the convergence of the estimates to the true value of the parameter p, in the mean square sense and with probability 1, is an immediate consequence of bringing the Bayesian estimation technique into the general framework of stochastic approximation.
D. Mean Square Error and Optimum Learning Sequence It is often desirable in practice to determine the rate of convergence for a learning algorithm in terms of the mean square error of estimation. As an example, consider the algorithm (7.10) Mn - M
Let
= (1 - YnXMn-1-
Vn2 = E[ll Mn - M
M)
+ ynvn
(7.30)
117
(7.31)
Then, from (7.30). Vn2 = (1 - Yn)"vI",,
< (1 - Yn)":-,
+ yn2E[II qn 117
(7.32)
+ yn2B
where B is the upper bound of E[ll 7, I12]. Iterating the expression (7.321, Vn2< y$B y;-$F: y,2BF&1F~i+2 F,2
+
+ +
+ + y12BF2zF32 *.*
Fn2
+ V$FCF22
***
Fa2
(7.33)
where V,Z = E[ll M , - M/I2]is the mean square error of the initial estimate M, and F, = 1 - y,. Since
(7.34) (7.33) becomes
(7.35)
7.1.
147
SUPERVISED LEARNING
+
<
That is, for sufficiently large n, Vn2 B/(n a), irrespective of the initial error and thus the rate of convergence, is in the order of l/n. As a result of the above derivation, it is possible to obtain an optimal learning sequence by choosing the {y,} such that the mean square error of the estimate is minimized at each and every iteration. For the purpose of simplicity, consider a one-dimensinal version of (7.30) for learning the mean m of a gaussian distribution. Let m, be the nth estimate of m and x, = m the noisy observation at nth stage; (7.30) becomes
+ en
mn - m
+ ynSn (1 - yJ2E[(mn-, - mI21 +
= (1
Thus, E[(mn - mI21 =
- yJ(mn-1
or
v," = (1 - yn)":-,
(7.36)
- m)
+ yn22
~n"E[(5n>7
(7.37)
where Vz = E[(m, - m)z] and a stationary noise has been assumed so that E[(&J2] = an2= uz for all n. It remains to select a sequence {y> satisfying conditions (7.3) and making the mean square error Vn2 as small as possible. This can be achieved by setting the first derivative of Vn2 with respect to y, equal to zero and solving for y,. The result is (7.38)
Let the initial mean square error be Vo2.Using (7.37) and (7.38), iterate Vn2 and y, alternatively to obtain the optimal sequence yz and the minimized mean square error Vi*, respectively, as follows: Y: =
1
n
+ (."VO2)
=-
12
1
+a
(7.39)
where 01 = u2/Vo2is the ratio of the variance of the gaussian distribution to the variance of the initial estimate of the mean. It is interesting to note that the optimal sequence yz is exactly the same weighting sequence used in (7.8) which is derived from the Bayesian estimation. The minimized mean square error Vk* is also the variance obtained in the same algorithm (7.8).
148 7.2
7.
LEARNING USING STOCHASTIC APPROXIMATION
Nonsupervised Learning Using Stochastic Approximation
In this section, the problem of nonsupervised learning is formulated in general terms as the problem of estimating parameters in a mixture distribution. The stochastic approximation procedure is applied to estimate the unknown parameters [2], [3]. For illustrative purposes, it is interesting to note that the algorithm (6.33) is also a special algorithm of stochastic approximation with {yn} being a harmonic sequence. Equation (6.33) can be rewritten as mn = m,, = mn-1
with m, = 0 where yn
+ n-l(xn - rnn-J + Yn(xn
(7.41)
- mn-1)
= rl, satisfying
conditions (7.3). Let
T n ( m l $ . * -mn-1) ? = (1 - yn) m,,-l
+
(7.42)
Ynm
where m is the true value of the mean to be learned. Then
I Tn(ml ,..., mnPl) - m I
= Fn
I mn-1
-m
I
(7.43)
where F, = 1 - yn = 1 - n-l, satisfying Dvoretzky's condition (R2). Hence, the convergence of the estimates to the true mean in the mean square sense and with probability 1 is implied by Dvoretzky's theorem. Let the follwing assumptions be made for the underlying nonsupervised learning process. (i)
There are m classes of probability distributions corresponding
to m pattern classes of which the a priori probabilities, P(q), ..., P(w,),
P(w,) = 1, are fixed but unknown. (ii) The probability distribution (or density) function of each class W , is characterized by a set of parameters O i . (iii) Learning observations are assumed drawn from the mixture distribution constructed by the component distributions, i.e., m
P(X/e ,P ) =
C P(X/ei i=l
9
mi) P ( W i >
(7.44)
where P(Xl8, P) denotes the mixture distribution function characterized by 8 = (8, , O2 ,..., 8,) and P = {P(wl),P(w2),..., P(w,)}.
7.2. (iv) There
exist
149
NONSUPERVISED LEARNING
unbiased
estimates
of
certain
statistics
H = { H ( X ) }for the mixture (e.g., first moment, second moment, etc.). The functional relationship between H and the parameter sets, 8 and P, is known. For example, an equation such as F ( H , e, P ) = o
(7.45)
is known at each stage of the learning process. (v) Additional equations relating the parameter sets, say G(8, P)= 0, are available to give a unique solution for the unknown parameter 8 and P.
If (i)-(v) are satisfied with probability 1, then the true parameters 8 and P are defined in the limit by F(H, 8, P) = 0 and G(8, P)= 0. The learning process is then reduced to that of finding the unique solution for 8 and P through the functional relationships F and G where F can be obtained from the successive estimates { H ( X ) }and G is given a priori or sought by some auxiliary estimation procedures. Examples are given in the following to illustrate the estimation of 8 an P by using stochastic approximation.
7.2.1 ILLUSTRATIVE EXAMPLES Example 1 Let m = 2, P (ol) = Pl, and P(w2)= 1 - Pl. Each component density function is characterized by its mean and variance, i.e.,
-
,Q, W i ) ,
i = 1,2 (7.46) The mixture density function characterized by 8 = {m,, u12 ,m 2 ,u22} and Plis given by p(x/wJ
&/mi
P ( X P , Pl)= PlP(x/ml >
U12,
w1)
+ (1 - P1)p(x/m2*
u22, W 2 )
(7.47)
The problem is to learn the unknown parameters 8 and Plfrom the unclassified learning observations x1 ,x2 ,..., x, generated from P(XP,
PI.
Let the first, second, and third moment of x with respect to p(@, P) be computed from (7.47) E(x) = P P l + (1 - Pl) m2 E(X2) = P 1 h 2 E ( 2 ) = Pl(ml3
+ + (1 U12)
= Pl(% - m2)
-
P1)(mz2
+
022)
+
m2
+ 3m1a12) + (1 - Pl)(m2' + 3m202')
(7.48) (7.49) (7.50)
150
7.
LEARNING USING STOCHASTIC APPROXIMATION
Consider that u12 = u22 = u2 and P, are fixed and known. Let m2 = 0 and m, be the parameters to be learned. From (7.48), m1 can be solved directly Case a
m1
(7.51)
= E(X)/Pl
It remains only to obtain the successive estimates of E(x). Stochastic approximation can be applied to give an asymptotically unbiased extimate of E(x), which will in turn give an unbiased estimate of m, . Let En(x) be the nth estimate of E ( x ) and define En(x) = En-l(x)
where yn satisfies yn > 0, by Dvoretzky's theorem,
C,"l
+ YJxn - En-l(x)I yn =
00,
C;=,yn2 < 00.
(7.52)
Thus,
lim E{[En(x)- E(x)I2} = 0
(7.53)
P{limEn(x) = E(x)} = 1
(7.54)
n+m
vb+m
which imply (7.55) (7.56)
That is, the true value of m, is learned in the mean square sense and with probability 1. Let u12 = u22 = u2 be known and m2 = 0. The problem is to learn P, and m, . Equations (7.48) and (7.49) give
Case b
(7.57) (7.58)
The solution of m, and P, can be obtained by successfully estimating the first and second moment of the mixture distribution and alternatively applying (7.57) and (7.58). The stochastic approximation procedure can be applied as in case a to give asymptotically unbiased estimates of the moments E(x) and E(x2), which in turn will give the unbiased estimates of m, and P, .
7.2.
NONSUPERVISED LEARNING
151
Case c Let m2 = 0, and let m, , P, , and u2 (= u12 = u ~ be~ the ) parameters to be learned through the first three moments. Equations (7.48), (7.49), and (7.50) become (7.59)
E(x) = Plml
+ u2 E(X3) = P1(m13 + 3m102)
E(x2) = P1ml2
(7.60) (7.61)
Solving (7.59), (7.60), and (7.61) simultaneously gives m12
- 3E(x)m1
+ 3E(x2) - E(xa)/E(x)= 0
(7.62)
Since (7.62) is a second order equation in m, , it can be readily shown that the discriminant is 9E2(x) - 12E(x2)
+ 4E(x3)/E(x)
Substituting (7.59), (7.60), and (7.61) for the moments, the discriminant becomes m12(3P, - 2)2. Since m, # 0, the condition for unique solution is P, = g. Thus, if P, = $ for the mixture distribution then the parameters Pl, u2, and m, can be learned through (7.59), (7.60), and (7.61) by defining stochastic approximation algorithms to give asymptotically unbiased estimates of E(x), E(x2), and E(x3). Otherwise, the problem will have multiple solutions. I t is noted that at each stage of the above learning process,the parameters are successively estimated through the moment estimators of the mixture distributions. The assumDtions made in these cases are essentially the constraints which sometimes have to be put on the component distributions in order to achieve a unique solution for the unknown parameters using only first, second, and third order statistics. I n the case that 012 and u22 are unknown and unequal, then higher moments of the mixture distribution will be needed to give sufficient functional relations for the unknown parameters. Usually, as in case c, multiple solutions are expected in solving the simultaneous nonlinear equations resulting from this procedure. A unique solution is attainable only if more information about the parametrs is available to assure that the solution obtained is the one characterizing the mixture distribution. These additional constraints may well be inter-
152
7.
LEARNING USING STOCHASTIC APPROXIMATION
preted as the conditions of identifiability [5] in learning the mixture distribution using stochastic approximation. Example 2 Let m = 2, P (w,) = P, , and P(w,) = 1 - PI. Each component in the mixture is a binomial distribution bi(k,pi), i = 1, 2. Let x1 , x2 ,..., x, be independently identically distributed random variables with probability function
P(x, = x) = for x = 0,
= 0,
1,..., k
(7.63)
otherwise
The problem is to learn p, and p, with the assumption that P , is known. Let the first and second moments of x respect to P(x, = x) be computed from (7.63) E(4
+ (1 4% PI[^ + k(k - 1) P17 + (1 - f'~)[kPz + k(k - 1) Pzz]
= PlkPI
E(x2)=
-P l)
(7.64) (7.65)
T o obtain the unique solution for the parameters p , and p, assume that 0 < p , < p , < 1 and require that k 3 2. Then it can be determined that (7.66) E(x)
=
T- +
1-
P,
E(x2) - E ( x )
[
k(k
-
1)
P(x)
(7.67)
Through (7.66) and (7.67) successive estimates of p , and p, can be obtained by defining stochastic approximation algorithms to establish the estimates of E ( x ) and E(x2). Since the stochastic approximation algorithms give asymptotically unbiased estimates of the moments E ( x ) and E(x2),this in turn will give the unbiased estimates of p , and p, . Note that assumptions (i) k 3 2 and (ii) p , < p , are the a priori information required to obtain the unique characterization for the mixture distribution when PI is known. The first assumption is a requirement of sample size and the second is a condition imposed on the parameters of the component distributions.
7.2.
153
NONSUPERVISED LEARNING
CASE 7.2.2 A SPECIAL
If, in (5.83), P = {P(wl),P(w2),..., P(w,)} is the only set of parameters which is unknown, then (7.68)
A stochastic approximation algorithm can be derived to directly estimate the a priori probabilities P(wd),i = 1,..., m. The problem is formulated as that of choosing P*(wi), i = 1,..., m, to minimize ~31,r91 I
=
/I
2
m
[P*(wi) - P(wi)]p(-X/wi)/ dX
+ 2X [ C P*(wi) - 11 m
(7.69)
i=l
i=l
where h is a Lagrange multiplier. The solution is obtained by simultaneously solving the system of linear equations obtained by setting the partial derivative of I with respect to P*(wi) equal to zero for i = 1,...,m, which results in the following equation DP*
= E[p(X/w)] - XU
(7.70)
where (i) D is the matrix with elements di,
=
/ p(X/w,)p(X/w,)dX,
i,j
=
1,..., m
(7.71)
(ii) P* is an m-dimensional column vector with the ith component P*(wi), i = 1,..., m, (iii) E [ p ( X / w ) ]is an m-dimensional column vector with the ith component E[p(X/wi)]= J p ( X / w , ) p ( X )dX,
(iv)
i = 1,..., m
(7.72)
U is an m-dimensional column vector of m components all equal to one.
It is required that det D,the determinant of D,is not equal to zero. The ith component of P* which satisfies (7.70) is then
(7.73)
154
7.
LEARNING USING STOCHASTIC APPROXIMATION
where Dir is the adjunct of dir in the matrix D. Since (7.70) has a unique solution, and m < CO, then P(wJ = P*(w,) = E[F,(X)] where
The a priori probability P(wJ can be estimated from the unclassified independent learning observations X,, X 2 ,..., X, drawn from the mixture p ( X ) . Let Pn(wi)denote the nth estimate of P(w,). The following stochastic approximation algorithm is proposed for the estimation of P(o,), i = 1,..., m:
+
(7.75) P,(w,)= P,&Ji) yn[F,(X,) - Pn-l(Wi>], i = 1, ..., m Po(wi)= 1. The sequence {yn} satisfies where P,,(w,) 0 with Y n > 0, (1 - Y n ) > 0, I I L (1 - Yn) = 0, and C L Yn2 < a (Dvoretzky's special case 11). By Dvoretzky's theorem, lim E[ll P, - P 112]
12-03
P{lim P, 12-03
(7.76)
=0
= P} =
(7.77)
1
where P is the m-dimensional column vector with ith component P(w,) and P, the m-dimensional column vector with ith component P,(wi). From (7.75), the expected value of the norm
11 P,
-P
112
E[II Pn - P 11'
= (P,- P)'(Pn ==
(1
- yn)'
- P)
E[II Pn-1 - f'll'l
+ ynE[II Vn 117
(7.78)
where 7, is an m-dimensional column vector with the ith component
vni = F .Z(X)- W J i )
(7.79)
qni is a random variable with mean zero and variance at. Let
(7.80)
Then from (7.78),
0 B E[Il Pn
-
P 11'
= (1 - yn)2v:-l
< ve2
+yn22
(7.82)
7.3.
155
FORMULATION OF NONSUPERVISED LEARNING SYSTEMS
Similar to (7.39) and (7.40), by iterating (7.82) for
the minimal value of Vw2is (7.83)
when (7.84)
7.3 A General Formulation of Nonsupervised Learning Systems Using Stochastic Approximation
Tsypkin [lo], [l 11 has recently proposed a general fomulation for nonsupervised learning systems using variational technique [121 and stochastic approximation. The formulation and its subsequent learning algorithm are briefly discussed in this section. Let QXi be the region in the feature space Q, associated with X mi. For each region Qxi, i = 1,...,m, a loss function L,(X, 8) is introduced where 8 = (8, ,...,em>is the set of unknown parameters corresponding to each region. The functions L,(X, e), i = 1,..., m, evaluate the loss when X - u i or X E Q ~Then ~ . the average loss (risk) due to misclassification is N
m
(7.85)
When the regions QXi, i = 1, ..., m,in the feature space Q, are disjoint, (7.85) can be written as (7.86)
where p ( X ) is the mixture density function defined by (7.87)
156
7.
LEARNING USING STOCHASTIC APPROXIMATION
By introducing the characteristic function ci(X,0)
=
1
=O
if X
E Qxi
(7.88)
if X$sZxi
(7.86) can be written in the following form: m
-
As indicated in Chapter 1, the problem of classification can be considered as that of partitioning the feature space Qx into such regions Qxi, i = 1,..., m, for which the average loss is minimal. Since the probability density function p ( X ) in (7.89) is a mixture density function the unknown parameters can be learned from unclassified learning samples. The approach used to obtain the necessary conditions for the minimum is to set the variation of R(8) with respect to the parameters O l e , k = 1,..., m, equal to zero, i.e.,
[c m
V O k W )= E
4 x 9
i=l
0) V,Li(X,
41
+ E 1c vBkEi(x, e)L,(x,el] i=l
= 0,
K
=
1,...,m (7.90)
Consider first the second term at the right-hand side of (7.90). the gradient of the characteristic function ei(X, O), Oeiei(X,e), is a multidimensional delta function [ 131. From the properties of delta functions, the second term in (7.90) is equal to zero for all X except the points X = X o which are located on the decision boundaries Aik , i, k = 1,..., m, between the regions Qxi and Qxk. Thus,
7.3.
157
FORMULATION OF NONSUPERVISED LEARNING SYSTEMS
For the points located on the decision boundary Ai,, X L,(Xo,0)
-
L,(Xo, 0)
= 0,
i, k
=
= Xo,
1,..., m
(7.92)
Hence (7.90) reduces to
vOkzqe) =E [
c m
e) v,&,(x, el]
. i ( ~ ,
2=1
k = 1,..., m
= 0,
(7.93)
Equations (7.92) and (7.93) are the necessary (but not sufficient) conditions for a minimum of the average loss. I n order to obtain the optimal decision (classification) rules, let
On the decision boundaries and with optimal parameter values O
fik(xo, e*)
e*)
e*)
= L ~ ( X O , - L,(XO,
=
o
=
0*,
(7.95)
The functions fik(Xo,O*) are different from zero within each region. Specifically, for X E QXi,
jik(x, e*) < 0,
K
=
1,..., m, K # i
(7.96)
The decision rule (7.96) is uniquely defined by the loss functions L,(X, O), K = 1,..., m, and it can be completely determined if the parameters Of ,..., 02 can be estimated on the basis of observed (unclassified) learning samples. The algorithm of estimating (learning) the parameters can be derived by considering the problem as one of finding the root of the regression equation (7.93). The stochastic approximation procedure is applied to estimate O? ,..., 0% , i.e.,
where Ok,n is the nth estimate of O$ L l
E~(X,,
= (4,n-l =
1
=0
, %n-1
,
9.a.)
~m,n-,>
if fik(Xn, On-l) otherwise
<0
for all k #
(7.98)
7.
158
LEARNING USING STOCHASTIC APPROXIMATION
If (i)
Yk,n
is a sequence of positive numbers satisfying W
c W
Yk,n =
co
and
n=l
n=l
&n
< co
(7.99)
(ii) L,(X, 0), i = 1,..., m, are differentiable functions of the parameters 8, ,..., Om , (iii) There exists a real number d such that for all real values of the parameters d1 ,..., 0,
(7.100)
(7.102)
where Rk
=
(7.103)
then the algorithm (7.97) converges to 02, k = 1,..., m, with probability 1 [lo]. Tsypkin has shown that several other nonsupervised learning algorithms [14], [15] can be formulated as special cases of the general algorithm (7.97).
7.4
Learning of Slowly Time-Varying Parameters Using Dynamic Stochastic Approximation
The problem of learning the parameters of a probability distribution has been treated as a successive estimation of conditional distributions through the optimal use of learning observations. In Sections 7.1 and 7.2, the algorithms of stochastic approximation have been proposed for the estimation of the unknown but fixed parameters. In the case where the unknown parameters are time-varying (not fixed),
7.4.
LEARNING OF SLOWLY TIME-VARYING PARAMETERS
159
these algorithms will not be able to produce the true parameter values. I n this section, algorithms developed on the basis of the dynamic stochastic approximation procedure proposed by DupaE [16] are presented for the estimation of slowly time-varying parameters [17]. A brief introduction of the dynamic stochastic approximation procedure has been included in Appendix F. Both supervised and nonsupervised learning problems are investigated by using this approach. In general, the proposed learning algorithms will consist of a two-step approximation procedure to be performed at each stage of the learning process. The first step is designed to correct the timevarying trend of the parameters being learned; the second step is made by means of an ordinary stochastic approximation procedure, based on the observation of a new learning measurement. The convergence of the algorithms can be proved by using Dvoretzky's approach [181.
7.4.1 SUPERVISED LEARNING Consider that a sequence of learning observations, XI , X , ,..., X , , is known to be from the same pattern class characterized by the conditional probability density function p(X/8,), where 8, is the set of unknown time-varying parameters. For the simplicity of illustration, a one-dimensional case is considered; let 0, be the unknown timevarying mean of p(x/8,) with known variance un2. The parameter 0, varies slowly with time according to the following situation:
en+,
+
= (1
n-1)
6,
+ w-~),
3 1
(7.104)
Denote the nth estimate of 9, by 8,. Then a two-step stochastic approximation algorithm is defined as follows: let x,+~ be the learning observation received at the (n 1)th stage,
+
first step: second step:
+ n-l)On + ~ ( n - . ) en+, = 0; + yn(xn+l - 0;) 0;
= (1
(7.105) (7.106)
where yn is a sequence of positive numbers satisfying the conditions m
n=1
(1 - m)(l
+ n-')
= 0,
(1 - yn)
> 0,
m
C yn2 < co
n=l
(7.107)
7.
160
LEARNING USING STOCHASTIC APPROXIMATION
By making use of Dvoretzky’s approach [18], it can be shown that the estimate I ! ? ~ . + ~converges to On+, in the mean square sense and with probability 1. Rewriting (7.106), k+l
+ + .-w, + O’(n-9 +
(1 - m) = (1 - Y J ( 1 =
YnXn+1
where O’(n-u)
+
= (1
and %+l =
vnfl represents the noisy
n-1)-1
+
en+,
’Y,%+l
+ Y,%+l
O(n-w)
(7.108) (7.109) (7.110)
%+l
component of the learning observation x,+~
with
and
E(qn) = 0
E(qn2)= am2
(7.111)
Equation (7.108) can be written as a Dvoretzky’s stochastic approximation algorithm &+l = w j l
with
T,(d1
,...,on)
7.”)
6,)
(1 - m)(l
1
+ Ynrln+1
(7.112)
+ .-‘)(en + O‘(nPu))+ ’YA+~ (7.113)
as the noise-free (deterministic) transformation. From (7.1 13) and
(7.104),
I
L(4,--.,Q en+, I -
=
+ .-w, + O’(n-9 +Y A + l I (1 - y,)(l + n-l) I 4, - 0, I l(1 -
- en+,
=
= F,
I 8,
where F,
=
I
(7.1 14)
+ n-l)
(7.115)
-
(1 - m)(l
0,
Suppressing the random part of (7.112) and using the relation (7.114) recursively, I &+l - en,, I = F, I 0, - en I
7.4. Thus, if
LEARNING OF SLOWLY TIME-VARYING
n:=, F, = 0 (by properly choosing y,),
PARAMETERS
then
161
On+, converges
to On+, for any finite values of 8, and 8,. Now let x, = ynvn+, . From (7.111) it is obvious that E(zn)= 0
and E(xn2)= yn2crn+,. Suppose the noise component of x,+, that its variance u:+, is bounded by u2 for all n. Then
C E(zn2)< 1 3/n2 < a W
is such
W
n=l
n=l
(7.117)
which establishes the fact that the sum of the noise variances is finite. Finally, let (7.118) vn2 = E[& - en)2] After squaring and taking the expectation of both sides of (7.108), we have the recursive relation for the man square error of the estimates
V:+,
< Fn2Vn2 + un2u2
(7.1 19)
in which the inequality comes from the substitution of the upper bound of the noise variance u2 for an2. With Bn2= yn2u2, iterating the expression (7.1 19) V:+1
< Fn2Vn2+ Bn2 < F;F;-, F:V: tF;F:-,
.**
1.-
F2B12+ * * *
+ Fn2F:-,B:-, + Fn2B:-,t B,,' n-1
2VI
+ C Bi2b2-i + Bn2
(7.120)
i=l
where (7.121) If V12< 00, it is readily seen that the right-hand side, and hence the left-hand side, of (7.120) becomes zero as n + 00. Thus, lim E[(& - en)2]
n+m
=
o
(7.122)
The convergence of 6, to 8 with probability 1 can be deduced by using the argument similar to Dvoretzky's proof for the stationary case [18].
162
7.
LEARNING USING STOCHASTIC APPROXIMATION
A further manipulation of (7.119) will result in an optimal yn sequence. The result is a minimax solution since only the maximum possible mean square error will be minimized in carrying out the optimization process. Rewrite (7.1 19) as V:+1
< [(n + l)/n12(1 -
Yn)’vn2
+ m202
(7.123)
The minimum of the right-hand side of (7.123) is achieved (by setting the first derivative with respect to yn equal to zero) at
+ u“n/(n + 1)]2 Vn2
yn =
vn2
(7.124)
Substituting the expression of (7.124) into (7.123) for yn , we obtain the following recursive relation for the mean square error (7.125)
Let the initial estimate have the mean square error V12 = E[dl and define a = a2/V2. Using (7.124) and (7.125), iterate Vn2and yn alternatively to obtain the optimal sequence y i and the minimized mean square error Vi:l , respectively, as follows: (7.126)
It should be noted that the optimal yn sequence thus obtained can be easily shown to satisfy the conditions stated in (7.107) and a simple = 0. substitution of (7.126) for yn into (7.115) gives n,mZlFn Notice from (7.127) that the value of V:* approaches zero, in the order of n-l, as n -+ co. The rate of convergence in terms of mean square error, however, is slightly slower than that of the ordinary (one-step) stochastic approximation algorithm, that is, (7.105) and (7.106) would be replaced by a single equation fin,,,
= fin
+
Yn‘n(Xn+1 - fin)
(7.128)
A quantitative comparison of these two algorithms can be made by
computing the mean square errors of the estimates resulting from
7.4.
LEARNING OF SLOWLY TIME-VARYING
PARAMETERS
163
both cases. The optimal (7.128) has been given in (7.39) and (7.40). The result of these computations are shown in Table 7.1. It is clear Table 7.1 a = l
Stationary V z l
Number of stage n
0 (initial est.)
(12
Time-varying V z z 02
1 2 3 4 5 6
0.502 0.330~ 0.250~ 0.2002 0.170~ 0.14a2
0.82 0.640~ 0.530~ 0.450~ 0.4002 0.350~
10 20 30
0.0902 0.050~ 0.030~
0.240~ 0.1402 0.090~
that the mean square error in the time-varying case is greater except in the limiting case as n+ a,where the error for both estimates vanishes. The principle of the dynamic (two-step) stochastic approximation algorithm is equally applicable to the estimation of other parameters (other than mean) provided that they are also slowly time-varying. Obvious modifications regarding the correction term in (7.106) must be made depending on the particular parameter under estimation. In general, the use of the dynamic stochastic approximation algorithm is justified when 8, is a linear or nearly linear function of n (Appendix F). It is noted that if the exact nature of O(n-p) in (7.104) is unknown and, consequently, the first step correction can be expressed as
0;
= (1
+ n-1) on + O(n-.)
(7.129)
where 6 ( n - p ) is in general different from O(n-w), then, from (7.129) and (7.106), On+, = (1 - yn>(l n-l>[en (1 n-l)-lO(n-u>l
+
+ Ynf'n+l + Ynvn+l
+ +
(7.130)
164
7.
LEARNING U S I N G STOCHASTIC APPROXIMATION
c l(1 m
n=l
-
m)[O(n-”)- O(n-’)Il <
(7.134)
Dvoretzky’s theorem can still be applied, and hence the convergence of 0, to 8, can be obtained. 7.4.2
NONSUPERVISED LEARNING
The dynamic (two-step) stochastic approximation algorithm discussed in Section 7.4.1 can be extended to the case of nonsupervised learning by using the formulation of mixture distribution (Section 7.2). The problem is then reduced to the estimation of unknown timevarying parameters which characterize the mixture distribution as learning observations are received. T h e following examples are given to illustrate the applications. Example 1 Consider a two-class classification problem with a priori probabilities P, and ( 1 - P,),respectively. Let the conditional density functions of feature measurements for each class, p(x/m,,, ; u:,,) and p(x/m,,, ; u;,,), respectively, be gaussian densities characterized by their respective means (wz1,,; m2,n) and variances ( ~ 1 2 , ;~ u;,,). Denote the time-varying parameter set (P,; m,,, ; m2,,; u;,, ; a;,,) by 0,. The mixture density function is then given by P W n ) = PnP(x/m,.n
;
4 . n ) + (1 - Pn)P(x/%*n ;4 . n )
(7.135)
7.4.
LEARNING OF SLOWLY TIME-VARYING
PARAMETERS
165
Consider the simplest case where m,, is the only time-varying parameter to be estimated, and m2,,= 0, P, = P, u?,, = u:,, = u2 (known). Let ml,n+l
= (1
+
(7.136)
n-lh1.n
which is a special case of (7.104) with p arbitrarily large. Compute the first moment of x with respect to p(x/O,) from (7.135), &(x) = Pnm1.n
+ (1 -
PnIm2,n
(7.137)
= P9.n
Then En+,(x) - (1
+ n-')
En(X) =
+ n-') Pm1.n (1 + n-')
P 9 . n - (1
= P[q,n-
ml,n]
(7-138)
That is, E,(x) varies in a similar manner as m,,, , hence satisfying the special case of condition (7.104) when p is arbitrarily large. Now, consider E,(x) as a time-varying parameter of the mixture density function and let &(x) be the nth estimate of E,(x). Apply the dynamic stochastic approximation algorithm,
+
= EXX)
&+,CX>
where EA(X) = (1
+
yn(xn+l
n-1)
- EX%))
En(X)
(7.139) (7.140)
and the yn sequence satisfies the conditions in (7.107). Using the results established [e.g., (7.124)], we obtain lim E[gn(x)- En(x)]2 = O
(7.141)
P(Iim &(x) = E,(x)} = 1
(7.142)
n-m
n+oo
That is, E,(x), and consequently m,,, = E,(x)/P from (7.137), can be learned asymptotically in mean square and with probability 1. It is also clear that, from (7.137), the same procedure can be applied to learn the unknown and time-varying P, if m1,, becomes known and time-invariant. Example 2 In Example 1, referring to (7.135), consider the case where m,,, = m, , m2,, = 0, P, = P,and m, and P are known. The
166
7.
LEARNING USING STOCHASTIC APPROXIMATION
problem is to estimate the time-varying variance Let = (1
,,a :
C T ~=, u ~ g,,
+ n-') an2
= un2.
(7.143)
Compute the second moment with respect to p(x/e,) from (7.135), En(x2)
=Pn(4.n = Pm12
+ 4 . n ) + (1 - PnI(m22.n + 4 . n )
+ an2
(7.144)
Then, using (7.143) and (7.144), we have En+,(x2) - (1
+ n-'I
U X 2 )
+ (1 + n-l) u22 (1 + n - l ) ( P m 1 2 + an2) = Pm12 - (1 + n-l) PmI2
= Pm12
-
(7.145)
- -n-lPm12
Since Pm12is a constant, E,(x2) again satisfies the condition in (7.104) with p = 1. In order to learn En(x2) and hence urn2[from (7.144)], the dynamic stochastic approximation algorithm similar to (7.139) may be applied to obtain estimates which converge to the true parameter value in mean square and with probability 1. Similar to the cases treated in Section 7.2, the method given in the above examples can also be applied to estimate unknown and timevarying parameters of the mixture distribution with component distributions other than gaussian. 7.4.3
AN ACCELERATED DYNAMIC STOCHASTIC APPROXIMATION ALGORITHM
It was noted in Section 7.4.1 that the convergence of estimates to the true parameter value for the dynamic stochastic approximation algorithm would be slightly slower than that in the ordinary algorithm for the corresponding stationary case. This degradation of performance is clearly due to the presence of time-varying trend in the parameters to be learned. Frequently, as a matter of practical significance, we would wish to speed up the rate of convergence. As an illustrative example, consider a special case of learning the mean of a probability density function with p arbitrarily large, that is, en+, = (1 n-')8, . Rewrite (7.106) as
+
on+,
- 0; = Y n ( x n + l -
0;)
(7.146)
7.4.
167
LEARNING OF SLOWLY TIME-VARYING PARAMETERS
It is seen that fewer fluctuations in the sign changes of
en+,
indicate that is still far away from On+, , whereas more frequent is near On+l. Consequently, the consign changes may indicate vergence of the dynamic stochastic approximation algorithm can be accelerated by using Kesten's acceleration scheme [ 191. Define
0;
+
e;
= (1
+
B1
= Y1
(7.149)
Bn
= Ys(n)
(7.150)
en,
where
=
Bn(Xn+1
n-1)
-0 );
(7.147)
en
(7.148)
and
Tabk 7.2
Number of stages, n
Sign of 4+,- 4,:
1 2 3 4 5 6
7
8 9 10
+ + Total corrections:
Corrections Unmodified algorithm
Modified algorithm
0.80 0.64 0.53 - 0.45 0.40 0.35 0.31 - 0.28 0.26 0.24
0.80 0.80 0.80 - 0.64 0.53 0.53 0.53 - 0.45 0.40 0.40
+ 2.80
+ 3.70
168
7.
LEARNING USING STOCHASTIC APPROXIMATION
in which @(x) = 1
=O
if x < 0 if x > O
(7.152)
This means that the modified algorithm takes a different yn every time (Oi+l - &) and - 0i-J differ in sign; otherwise yn remains unchanged. Table 7.2 shows that the modification is capable of accelerating the learning process. It gives the result that when the sign changes are not very often (four times in the first ten stages), the total correction for the estimates of the modified algorithm reaches 3.7 units while the correction for the unmodified algorithm is only 2.8 units. 7.5 Summary and Further Remarks
In this chapter, the stochastic approximation procedure has been applied to supervised and nonsupervised learning problems. The problems of estimating parameters, probability measure, and probability density functions have been treated. Mean square error is used as a performance measure of the estimation procedures. The optimal y n sequence is obtained in the sense of minimizing the mean square error at every iteration. Relationships between Bayesian estimation and stochastic approximation have been discussed. I n some cases, the Bayesian learning (estimation) algorithms have been shown to fall into the framework of Dvoretzky’s general stochastic approximation procedure. Consequently, the convergence in mean square sense and with probability 1 is guaranteed. The mixture formulation is again used for nonsupervised learning. The stochastic approximation procedure is employed to estimate the unknown parameters in a mixture distribution (or density) function. Dynamic stochastic approximation is applied to the learning (estimation) of slowly timevarying parameters. The procedure in general consists of a two-step approximation to be performed at each stage of the learning process. The first step is designed to correct the time-varying trend of the parameters being learned; the second step is made by means of an ordinary stochastic approximation procedure. In addition to Bayesian estimation, learning techniques such as linear reinforcement and the potential function method (Appendix G) have also been shown, in some cases, to fall into the general framework
7.5.
SUMMARY AND FURTHER REMARKS
169
of stochastic approximation [20-251. It may be concluded that these techniques are mathematically similar, i.e., they possess the same type of convergence and even the same type of convergence rate. However, from an engineering viewpoint, the computational difficulties involved as well as the a priori information required in each learning technique are different. Further investigation on the selection of different forms of T,(X, ,..., X,) for faster rate of convergence should be interesting. The study will offer an opportunity to investigate the optimum properties of the existing stochastic approximation algorithms and the relationship between the complexity of T,(X, ,..., X,) selected and the rate of convergence. The problem of learning nonstationary (time-varying) parameters has received more attention recently. The results obtained so far are rather restricted. A further study of a general (dynamic) stochastic approximation procedure for the estimation of time-varying parameters may give a possible solution.
References 1. Ya. Z. Tsypkin, Use of the stochastic approximation method in estimating unknown distribution densities from observations. Avtomat. i Telemeh. 27, No. 3, 432-434 (1966). 2. Y. T. Chien and K. S. Fu, On Bayesian learning and stochastic approximation. IEEE Trans. System Sci. Cybernetics 3, 28-38 (1967). 3. Z. J. Nikolic and K. S. Fu, A mathematical model of learning in an unknown random environment. Proc. Nut. Electron. Conf. 22, 607-612 (1966). 4. R. L. Kashyap and C. C. Blaydon, Estimation of probability density and distribution functions. Proc. Conf. Circuits and Systems, Ist, Asilomar, November 1967. 5. H. Teicher, Identifiability of finite mixtures. Ann. Math. Statist. 34, 1265-1269 (1963). 6. Z. J. Nikolic and K. S. Fu, Algorithm for learning without external supervision and its applications to learning control systems. IEEE Trans. Automatic Control 11, 414422 (1966). 7. E. A. Patrick and J. C. Hanock, Nonsupervised sequential classification and recognition of patterns. ZEEE Trans. Inform. Theory 12, 362-372 (1966). 8. Z. J. Nikolic and K. S. Fu, On the estimation and decomposition of mixtures using stochastic approximation. Southwestern IEEE Conf. Record (1967). 9. G . N. Saridis, Z. J. Nikolic and K. S. Fu, Stochastic approximation algorithms for system identification, estimation and decomposition of mixtures. Proc. Conf. Circuit and System Theory, 5th, Allerton, October 1967. 10. Ya. Z . Tsypkin and G. K. Kel’mans, Recursive algorithms of self-learning. Izv. Akad. Nauk SSSR, Tekhn. Kibernetika No. 5, 70-80 (1967).
170
7.
LEARNING USING STOCHASTIC APPROXIMATION
11. Ya. Z. Tsypkin, Self-learning-What is i t ? IEEE Intern. Cono., New York, March 1968. 12. M. I. Schlesinger, On arbitrary pattern recognition. “Reading Automata.” Naukova Dumka, Kiev, 1965. 13. I. M. Gel’fand and G. E. Shilov, Properties and Operations. “Generalized functions,” Vol. 1. Academic Press, New York, 1967. 14. A. A. Dorofeyuk, The algorithms of learning pattern recognition without a teacher, based on the potential function method. Aotomut. i Telemeh. 27, No. 10, 1728-1736 (1966). 15. E. M. Braverman, Potential function method in the problem of learning pattern recognition without a teacher. Aotomat. i Telemeh. 27, No. 10, 1748-1770 (1966). 16. V. DupaE, A dynamic stochastic approximation method. Ann. Math. Statist. 36, 1695-1702 (1965). 17. Y. T. Chien and K. S. Fu, Learning in nonstationary environment using dynamic stochastic approximation. Proc. Conf. Circuit and System Theory, 5th, Allerton, 1967. Monticello, Illinois. 18. A. Dvoretzky, On stochastic approximation. Proc. Symp. Math. Statist. and Probability 3rd, Berkeley, 1956, 1, pp. 39-55. Univ. of California Press, Berkeley, California, 1956. 19. H. Kesten, Accelerated stochastic approximation. Ann. Math. Statist. 29, 41-59 (1958). 20. K. S. Fu, Relationships among various learning techniques in pattern recognition systems. I n “Pattern Recognition” (L. Kanal, ed.). Thompson Book Co., Washington, D.C., 1968. 21. K. S. Fu and Z. J. Nikolic, On some reinforcement techniques and their relation to the stochastic approximation. IEEE Trans. Autom. Control 11, 756-758 (1966). 22. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, The Robbins-Monro process. Automat. i Telemeh. 26, No. 11, 1951-1954 (1965). 23. Ya. A. Tsypkin, Establishing characteristics of a function transformer from randomly observed points. Aotomat. i Telemah. 26, No. 11, 1947-1950 (1965). 24. C. C. Blaydon, On a pattern recognition result of Aiserman, Braverman and Rozonoer. IEEE Trans. Inform. Theory 12, No. 1, 82-83 (1966.). 25. K. S. Fu, On learning techniques in Engineering Cybernetic systems, Cybernetica No. 3, 194-213 (1967). 26. R. L. Kashyap and C. C. Blaydon, Recovery of functions from noisy measurements taken at randomly selected points. Proc. IEEE 54, No. 8, 1127-1128 (1966).
APPENDIX A
INTRODUCTION TO SEQ UENTIAL A NALYSlS
Sequential experimentation (or sequential sampling procedure) is an area of statistics which is both of practical importance and also of great theoretical interest. A sequential experiment may be defined as one in which the course of the experiment depends in some way upon the results obtained. The sequential nature is in general exhibited in two ways. First, there is the sequential choice of the experiment to be performed, in that the observations or measurements used in the next and succeeding stages depend on earlier results. Secondly, a rule for termination of the experiment has to be formulated; this rule should allow the experiment to continue until there is evidence that a combination of observations taken so far will yield a near optimal decision. These two aspects, although strongly interconnected, are, separate problems. The first problem deals with the sampling plan, i.e., strategies of how to use observations as they become available, so as to find quickly the (approximate) optimal decision yield, while the second aspect deals with how, precisely, the optimal decision is to be made before terminating the experiment. Sequential analysis is usually delimited by experiments in which stopping rule considerations alone are involved. In this appendix, the main theoretical results of sequential analysis relating to its applications to pattern classification are summarized. I . Sequential Probability Ratio Test
The most important discovery in sequential analysis is Wald’s sequential probability ratio test (SPRT) [l]. The SPRT is designed to decide between two simple hypotheses. Suppose that a random variable x has a probability density functionp(xi8) where 8 is the param171
172
APPENDIX A
eter to be tested. The problem is to test the hypothesis Hl that 6’ = O1 against the hypothesis H , that 6’ = 8,. The test constructed decides in favor of either 8, or 0, on the basis of observations x1 , x2 ,.... Suppose that if Hl is true we wish to decide for Hl with probability at least (1 - e21), while if H , is true, we wish to decide for H , with probability at least (1 - el,). For a fixed sample size (nonsequential) test, the optimum solution to this problem was provided by Neyman and Pearson [2]. They have shown that for a given number of observations n, the test giving smallest eI2 (i.e., the most powerful test) depends on the likelihood ratio A, where
and the test decides to accept or reject the hypothesis H I according as A, is less than or greater than a constant. The value of this constant can be chosen to give the test the correct size e21 and in principle n can be chosen to give the test power (1 - el,). It is noted that eZ1 and e21 are the so-called “error of the first kind” and “error of the second kind,” respectively. Wald’s SPRT is analogous to this, and has an analogous optimal property. The test procedure is as follows: Continue taking observations as long as B
A (A.3) and stop taking observations and decide to accept the hypothesis H , as soon as An < B (A.4) The constants A and B are called the upper and the lower stopping boundaries respectively. They can be chosen to obtain approximately the probabilities of error el, and e2, prescribed. Suppose that at nth stage of process, it is found that A, = A (A.5) leading to the terminal decision of accepting Hl . From (A.1) and (AS),
P(XIH1) = 4
GVfz)
(A4
INTRODUCTION TO SEQUENTIAL ANALYSIS
173
which is equivalent to
where both integrations are over the region consisting of all observations that lead to the acceptance of Hl . By the definitions of e12 and e21, (A.7) reduces to
Similarly, when
&=B
(A.9)
then e21
= B(1 - e12)
(A.lO)
Solving (A.8) and (A.10), we obtain A
= (1 - e21)/e12
(A.ll)
B
= e21N - e12)
(A.12)
It is noted that the choice for stopping boundaries A and B results in error probabilities e12 and eS1 if continuous observations are made and the exact equality of (A.5) and (A.9) can be obtained. For discrite observations, the test based on the stopping boundaries given by (A.11) and (A.12) may result in error probabilities different from e12 and e21 due to the neglected excess over the boundaries. This discrepancy is of minor consequence; and choosing A and B according to (A. 11) and (A. 12) provides essentially the same protection against both kinds or errors. Nevertheless, in general, (A.13) (A.14)
It has also been noted that, in the process of proving (A.11) and (A.12), it nowhere assumes the observations to be independent. From (A.5) and (A.9), again by neglecting the excess over the boundaries, L, = log A, = log A with probability e12 when H2 is true L, = log A with probability (1 - ezl) when Hl is true when H2 is true L, = log B with probability (1 - e12) L, = log B with probability eZ1 when Hl is true
174
APPENDIX A
Let Ei(L,) be the conditional expectation of L, when Hi is true, and let (A.15)
It follows directly that = (1
&(Ln)
- ezl) log A
+ (1 -
E2(Ln) = el2 log A
Define yi
+
e21
log B
(A.16)
e12)
log B
(A.17)
1 if no decision is made up to the (i - 1)th stage = 0 if a decision is made at an earlier stage
=
Then yi is clearly a function of x1 ,..., xi-1 only and is independent of xi and hence independent of zi= z(xi). Consider the sum
c xi n
Ln =
i=l
=~
1
+
~~
Taking expectations
12
+ + znym + ~
2
(A.18)
( c air*) m
E(L,)
=E
c
i=l
m
=
E(ZiYi)
i=l
c E(Yi) m
=E(4
i=l m
= E(z)
C P(n 3 i )
i=l
= E(z) E(n)
(A.19)
Therefore, from (A. 16), the average number of observations when Hl is true can be expressed as (A.20) Similarly, from (A.17), (A.21)
INTRODUCTION TO SEQUENTIAL ANALYSIS
175
It can be shown that the SPRT terminates with probability 1 both under Hl and H2 [l], [13]. Wald and Wolfowitz [3] have shown that for assigned error probabilities (eI2, e21), the SPRT minimizes the average number of observations, El(n) and E2(n).It is noted that the operation of SPRT is essentially independent of a priori probabilities p(Hi), i = 1, 2, although the probability of error necessarily depends on the a priori data. Several authors have extended the SPRT to more general situations. Cox [4] has given an interesting example of an SPRT with dependent observations which has a higher average number of observations at both Hl and H 2 . The extension to the test of composite hypotheses has been made by Wald [l] and Cox [4], [5], respectively. Wald suggests the use of weight functions to reduce the problem involving composite hypotheses to one of simple hypotheses. Unfortunately, there is no general method available for choosing weight functions with suitable properties. The weight-functions method suggested by Wald can be interpreted as forming modified hypotheses by integrating out the nuisance parameters. Cox [4] and Armitage [6] have proposed a second way out of the difficulty of constructing sequential tests for composite hypotheses by considering a sequence formed by transforming the original observations; the transformation is so chosen that the new (transformed) sequence does not depend on nuisance parameters. The SPRT will be performed in terms of the new sequence of observations. A standard SPRT may be unsatisfactory because (i) and individual test may last longer than can be tolerated, and (ii) the average number of observations becomes extremely large if e12 and are chosen to be 2t very small. In some situations it may become virtually necessary to interrupt the test procedure and resolve between the alternative courses of action. As suggested by Wald, this can be achieved by truncating the sequential process at n = N. The new rule of the truncated SPRT will be the following. Carry out the standard SPRT until either a decision is made or stage N of the test is reached. If no decision has been reached at stage N, accept the hypothesis Hl if A, > 1, or accept the hypothesis H , if A, < 1 . Under the new rule the test must terminate in at most N stages. Truncation is a compromise between an entirely sequential test and a fixed sample size test. It is an attempt to reconcile the good features of both of them: the sequential feature of examining observations as they accumulate and the fixed-sample size feature of guaranteeing that the
176
APPENDIX A
tolerance will be met with a specified sample size or number of observations. The SPRT has been also extended to the test of three hypotheses by Armitage [7] and Sobel and Wald [8]. Armitage proposes to use all three possible SPRT simultaneously, and Sobel and Wald propose to use two of the three SPRT. For multiple hypotheses, Reed [9] has proposed a generalized sequential probability ratio test (GSPRT). Suppose there are m hypotheses. At nth stage, the generalized sequential probability ratios for each hypothesis are defined as
The decision rule of GSPRT is as follows. Compare U,(X/Hi)with the stopping boundary of hypothesis H i , A(H,), and reject Hifrom consideration if U,(X/H,) < A(H,),
i
=
1, ...,m
(A.23)
The stopping boundary is determined by the following relationship
i
=
1,..., m
(A.24)
where e,, is the probability of accepting Hi when actually Hpis true. After the rejection of hypothesis Hi from consideration, the total number of hypotheses is reduced by one and a new set of generalized sequential probability ratios is formed. The hypotheses are rejected sequentially until only one is left, which is accepted. For m = 2, the GSPRT is equivalent to the SPRT and the optimal property of SPRT is preserved. For m > 2, whether the optimal property is still valid remains to be justified. Nevertheless, the GSPRT, from a practical viewpoint, is easy to implement. 2. Bayes' Sequential Decision Procedure
In the decision theoretic formulation of a fixed-sample size Bayes decision problem (testing of m statistical hypotheses), the optimal
INTRODUCTION TO SEQUENTIAL ANALYSIS
177
(Bayes) decision d* is chosen to minimize the average loss R(P,d) where R(P, 4 R(P, d*)
P
r(H,
9
m
=
2 P(H,) r(H, ,4,
i=l
(A.25)
= M P R(P, d,)
the set of a priori probabilities ..., P(H,)}, (A.26) ,4 P(X/H,)dX
= {P(Hl),
4 =J
QX
w,
is the conditional loss or risk. L(Hi , di) is the loss incurred if the decision di is made (i-e., to accept the hypothesis Hi) when actually Hi is true. It can be shown that for the loss function L ( H , , d,)
=
the Bayes decision d* = di if
1 - s,
.
; :1
a=j
=
P(H,)p(X/H,)2 P(H,)p(X/H,)
.
(A.27)
i+j
for all j
=
1,..., m
(A.28)
Let the likelihood ratio between Hi and Hibe (A.29)
then (14) becomes d*
= di
ift (A.30)
I n the Bayes sequential decision problem, the observations are taken in sequence. At each stage, the decision-maker decides on the basis of the information thus far collected on whether to stop the process and make a terminal decision or to take another observation. If the observations were costless, the decision-maker would not alter its behavior, since it could not lose and might actually gain by taking all the observations available. However, in most practical situations, observations are costly, and the decision-maker might greatly improve t It is noted that the result is consistent with the result obtained by Neyman and Pearson as shown by (A.1).
178
APPENDIX A
its situation if at each stage of the process it balances the cost of taking future observations against the expected gain in information for such observations. Consider that there are m hypotheses Ifl, H2 ,..., H , . For any sequential sampling plan S with elements Si, j = 0, 1, ..., N, and a priori probabilities P(H,), i = 1, ..., m, the average risk (or the sequential risk) can be written as [lo] R(P, s, 4
=
cP(K)c 1 7n
N
i=l
,=I
.
sj
[C,(X)
+w, d,(X))I P(X/Hi) dX 9
(A.31)
where C i ( X ) is the cost of observations x1 ,..., xi, and d i ( X ) is the decision function based on observations x1 ,..., xi . The sequential sampling plan S is, in general, represented by a partition of Qx with elements Si, j E J = (0,1, ..., N} such that each Sjis a cylinder set over K = {r E J : 0 < r < j } . The decision function di(X) provides a terminal decision rule for Si.At each stage, the average risks for continuing the sequential process, i.e., taking an additional observation, and for stopping the process and making a terminal decision can be calculated, respectively. The Bayes decision is optimal in the sense of minimizing the average risk. Therefore, the decision of whether to continue the sequential process or to stop the process and make a terminal decision can be obtained by comparing the corresponding risks at each stage. If N is bounded, the sequential process must be terminated at N stages, i.e., at Nth stage [lo], [Ill. average risk of making a terminal decision = average risk
of taking an additional observation
(A.32)
Let pn(xl,x2,...,x,) be the minimum average risk of the entire sequential decision procedure, having observed the sequence of observations xl,x2 ,..., x, ; let C(xl ,x2 ,..., x,) be the cost of taking one more observation at the nth stage of process; and let R(xl , x2 ,..., x, ; d,) be the average risk of making a terminal decision di after taking observations xl,x2 ,..., x, . If the decision procedure terminates, the average risk is Mini R(xl ,x2 ,...,x, ; di) by employing an optimal decision rule. If it continues, taking an additional observation x,+~ , the average risk is C(X1
,x2 ,*-*, x,)
+J
Pn+1(Xl
9
x2
Y..'?
9
x,)
dP(%+l
I Xl
9.*.>
xn)
INTRODUCTION TO SEQUENTIAL ANALYSIS
179
The basic functional equation governing the sequence of average risk function is Pn(X1
9
x2
, 4 ** * 9
Continue: =
Min
C(xl, x2 ,..., x,) Pn+1(X1
,x2 *.**,
*,
9
%+l)dP(xn+,
Stop: Min R(x, ,..., x, ;di)
I x1 ,-..> 4 (A.33)
Following (A.32), at Nth stage of process, pN(xl
, x2 ,...,xN)
= Min
R(x, ,x2 ,...,xN ;di)
(A.34)
By induction backward, the Bayes sequential decision procedure can be carried out from the last stage to the initial stage of the process through a recursive relationship. Dynamic programming gives a computation procedure for this problem [12]. If N is not bounded, then the Bayes sequential decision procedure governing the infinite sequence of observations for two hypotheses is equivalent to the Wald’s SPRT. Hence, SPRT is also optimal in the sense that for given a priori probabilities it minimizes the average risk [3], [lo]. References 1. A. Wald, “Sequential Analysis.” Wiley, New York, 1947. 2. J. Neyman and E. S. Pearson, On the use and interpretation of certain test criteria for purpose of statistical inference. Biometrika Pt I, 175-240; Pt. 11, 263-294 (1928). 3. A. Wald and J. Wolfowitz, Optimum character of the sequential probability ratio test. Ann. Math. Statist. 19, 326-339 (1948). 4. D. R. Cox, Sequential tests for composite hypotheses. Proc. Cambridge Phil. SOC.48, 290-299 (1952). 5. D. R. Cox, Large sample sequential tests for composite hypotheses. SankhyE 25, 5-12 (1963). 6. P. Armitage, Some sequential tests of student’s hypothesis. J. Roy. Statist. SOC. Suppl. 9, 250-263 (1947). 7. P. Armitage, Sequential analysis with more than two alternative hypothesis and its relation to discriminant function analysis. J. Roy. Statist. SOC.Ser. B 12, 137-144 (1950). 8. M. Sobel and A. Wald, A sequential decision procedure for choosing one of three hypotheses concerning the unknown mean of a normal distribution. Ann. Math. Statist. 20, 502-522 (1949).
180
APPENDIX A
9. F. C. Reed, A sequential multi-decision procedure. Proc. Symp. on Decision Theory and Applications Electron. Equipment Develop., USAF Developm. Center, Rome, New York, April 1960. 10. D. Blackwell and M. A. Girshick, “Theory of Games and Statistical Decisions,” Chapters 9 and 10. Wiley, New York, 1954. 11. G. B. Wetherill, “Sequential Methods in Statistics,” Chapter 7. Methuen, London, 1966. 12. R. Bellman, R. Kalaba, and D. Middleton, Dynamic programming, sequential estimation and sequential detection processes. Proc. Nut. Acad. Sci. 47, 338-341, 1961. 13. C. R. Rao, “Linear Statistical Inference and Its Applications,” Chapter 7c. Wiley, New York, 1965. 14. R. H. Berk, Asymptotic properties of sequential probability ratio tests. Ph. D. Thesis, Dept. of Statist., Harvard Univ., Cambridge, Massachusetts, 1964. 15. A. Wald, “Statistical Decision Functions.” Wiley, New York, 1950. 16. R. M. Phatarfod, Sequential analysis of dependent observations. Biometrika 52, 157-165 (1965).
APPENDIX B
O P T I M A L PROPERTIES O F GENERALIZED K A R HU N E N - L O k V E E X P A N S I O N
The derivation of the optimal properties of the generalized Karhunen-Lokve expansion stated in Section 2.2 is given in this appendix. 1. Derivation of Property ti)
Let {t,hk(t)}be a set of arbitrary orthonormal coordinate functions and let (2.24) be written as
x$(t)= k
n
l
v$k#k(t)
+
&n(t),
=
1,*-*,
(B. 1)
where Ri,(t) is the remainder when the expression terminates at k = n. Define the expected value of the square of the modulus of the remainders by
f
1-1
+i)
E[I &&)l21
The problem is to find a set of coordinate functions which gives the best approximation to the random function Xi(t), in the sense of minimizing
2 P ( 4 E[I &M21
d=1
over all possible expansions having the same number of terms. 181
182
APPENDIX B
In terms of (B.1) m
m
From (2.27) and (2.24), we have m
m
OPTIMAL PROPERTIES OF K A R H U N E N - L O ~ E EXPANSION
183
Substitute (2.27), (B.3), and (B.4) into (B.2),
2. Derivation of Property (ii)
Let Xi(t) be square integrable and normalized+ for each random function observed and i = 1, ..., m so that
Then from (2.24), we have
m
and
t The normalization process is merely for mathematical convenience.
184
APPENDIX B
Let
which are essentially the eigenvalues of the integral equations defined in (2.30). Since pk >, 0 and
c m
k=l
m
Pk =
m
11
I vik l2
p(wi)
m
=
p(%) = 1
(B.ll)
I=1
k=l i=l
the Pk’S form a probability distribution on the generalized KarhunenL o h e coordinate functions {4k(t)}.Define an entropy function for the Pk’S of the { 4 k ( t ) ) ’
c m
H[{+k(t))l
=-
k=l
Pk
(B.12)
logPk
If the pL)s are ordered such that p1
2 p2 2
*”
2 Pk 2 Pk+l 2
*“
(B.13)
then for any other similarly ordered pk’s associated with any arbitrary set of coordinate functions {$k(t)),we have n
n
(B.14)
Hence, (B.15)
APPENDIX C
PROPERTIES O F T H E M O D I F I E D SPRT
This appendix covers the derivation of the expected value of termination time t , and the error probabilities for the modified SPRT. First, several lemmas and corollaries are given for the case of continuous time parameter. Similar results for the case of discrete time parameter may be found in Bussgang and Marcus. These lemmas and corrollaries will then be used for the derivation of the expected termination time and the error probabilities. Let p f ( t T ) and p$(t,) be the probability measures, corresponding to the stochastic processes {Xl(t),t 0} and {Xz(t),t 2 0}, respectively, of the set of sample functions { X ( t ) ,t 0) which cause the lower inequality in (3.24) to be violated for the first time at time t , . And, let p?*(t,) and p?*(t,) be the corresponding probability measures with the violation of upper inequality in (3.24). Let f f ( t , ) , i = 1,2, be the conditional probability measure that the sequential classification process terminates at time tT given that {&(t), t 2 0}, i = 1,2, is the true stochastic process (or the pattern belonging to class w 6 ) being measured at the input, and that the lower inequality is violated. Corresponding conditional probability measure f?*(t,), i = 1, 2, for the violation of upper inequality is similarly defined. Let Ef(t,) and Ef*(t,), i = 1,2, be the corresponding conditional expectations with respect to the above conditional probability measures. Then we have the following lemmas. Lemma 1
185
186
APPENDIX C
Proof Let A be the event that the SPRT terminates at the time t, , and let B be the event that the test procedure violates the upper stopping boundary when {Xl(t),t 2 0} is the true process being measured. Then
Pr(B) =
/
m
0
~T*(T)~T
(C-3)
Also,
Pr(A n B ) = p;*(t,) Therefore, f$*(tT) = Pr(A/B) = Pr(A n B)/Pr(B)
Other parts of the lemma can be proved in exactly the same manner. Lemma 2 Consider the Wald's SPRT with continuous time parameter (the assumption of neglecting the excess over the boundaries is no longer needed, since, in the continuous case, the excess is zero with probability 1) which leads to the classification of { X ( t ) }= {Xl(t)}.We have
fF*(tT) = f : * ( t , )
for
t,
>O
(C-6)
Similarly, in the case when { X ( t ) }is classified as {X2(t)},
f%)
=fz*(tT)
< <
for
tT
>0
(C.7)
Proof t t,, be the sample function which leads Let X ( t ) , 0 to the classification of { X ( t ) }= { X l ( t ) }at time t , ; then
or p(X(t,)/w,) = e B l ( t T ) P ( X ( t T ) / 4 (C.9) Let Q,(t,) be the set of all sample functions for which the above equality holds, conditioned by the fact that (3.24) was satisfied for all t < t , ; then p(X(t)/w,)dX(tT) = eBJtr)
P(X(t,)/w,) dX(tT) (C-10)
PROPERTIES OF THE MODIFIED SPRT
Or, equivalently, l(tT)
P1**W = 8
**( T)
P2
187
(C.11)
Noting that in the Wald's SPRT with continuous time parameter, (C.12)
with probability 1, so we have (C.13)
(C.14) (C.15)
By Lemma 1, (C.16) (C.17)
Substituting back into (C. 13) we obtain (1 - ezdfi**(t,> = e,,f:*(t,) e12
which is fT*(t,) = f g * ( t , ) . The other part of the lemma can be similarly proved. Corollary 1 In the Wald's SPRT with continuous time parameter, the conditional moments of the termination time t , , if they exist, are all equal, i.e.,
(C.18)
El**(t,V) = Ez**(t,')
E1*(tT') = EZ(tT'), Proof
r
>0
(C.19)
Immediately follows from the application of Lemma 2.
188 Corollary 2
APPENDIX C
For the modified SPRT, the following relations are
true:
Proof
(C.11)
Hence,
The proof of (C.21) is similar, and (C.22) follows from the application of (C.20) and (C.21). Let {L(t),t >, O} be a stochastic process whose expected value exists and is equal to E(L). Let {Z(tT),t , 2 0} be another stochastic process where Z ( t T )= JkL(t)dt and t, is a random variable whose expected value E(tT) exists. If t , and L(t), for all t, are statistically independent, i.e., p ( t T ,L(t)) = p(t,)p(L(t)), and the expected value of Z(tT)is finite, then Theorem 1
Proof
PROPERTIES OF THE MODIFIED SPRT
189
Since the expected value is assumed to be finite, apply Fubini’s theorem,
= E(L)
1
t T P ( t T ) dtT = EPT)
*
E(L)
Q.E.D.
tr
Now, let Ei(tT),i = 1, 2, be the conditional expected value of termination time t , for the modified SPRT, and let e ; $ , i # j = 1,2, be the corresponding error probabilities. Assume that ei2 and e$ are small such that 1 - e& N 1 and 1 - ehl N 1; we have
+ (1 - e&) R;**(tT)
E;(tT)= eLIE;*(tT)
N
E;**(tT)
(C.25)
Using Theorem 1 and following the path of Bussgang’s derivation, we find that
E; [ s t ’ y t ) dt] = E;(tT)E;(L) N E;**(tT)El(L)
(C.26)
0
where
El [ f r L ( t ) dt]
- (tT/T)IT*)
= ei,E;*{--b’[l
0
+ (1 - e;,) N
E;**{a’[l
E;**{a’[l - (tT/T)lrl} -
(tT/T)lrl)
(C.27)
Letting u = tT/T and equating (C.26) and (C.27), we obtain E;**(t,) E,(L) = E;**[a’(l - U ) r l ]
(C.28)
By expanding (1 - u)rl in (C.28) and neglecting all the conditional moments of u higher than the first, we have a’
E’**(tT)
=E,(L) + (rlu’/T)
(C.29)
190
APPENDIX C
To obtain the expression for error probability eiz for the modified SPRT, use the relation (C.21) of Corollary 2 withg,(t) = a’(1 - up, then
E;**[e-a’(l-u)Ti1 = 4 2 / ( 1 - 4 1 )
(C.30)
Again, the error probabilities are assumed very small and the higher order moments of u are neglected,
E;**[~-U’(~-T,U)] ei2
(C.31)
Using the first order approximation for the exponential term in (C.31) and substituting (C.29) with equality sign, we get (C.32)
Finally, consider the standard Wald’s SPRT with upper stopping boundary A = ea and lower stopping boundary B = ecb. If the error probabilities are small, then following Wald’s derivation El(tT) = u/E,(L)
and
ca- e12/(1 - e21)
= el2
Suppose a = =a‘, that is the bondaries of the standard Wald’s SPRT and the modified SPRT begin at the same value, then
(C.34)
APPENDIX D
E N U M E R A T I O N OF SOME C O M B I N A T I O N S O F T H E kj’s A N D D E R I V A T I O N O F FORMULA FOR T H E R E D U C T I O N OF TABLES R E Q U I R E D IN T H E C O M P U T A T I O N OF RISK F U N C T I O N S
We have defined that kj is the number of occurrences of the event E j , j = 1,2,..., I, and & kj = N , which is the number of stages. The total number of tables for storing the risk functions of all possible sequences of the kj’s can be enumerated by first observing the general pattern of tables k,
kz
0 0
0 0
... ... ... ...
kr-s
0 0
kv-2
0 0
...
... ... 0 0
0 0
0 0
0 0
...
kr-1
0
0
0 2
N-1 N - 2
.
N-3 N-2 N-1 N
0 0
0 0
... ... ...
0 0
1
0 1
N-1 N - 2
2
1
N-3 N-2 N-1
1
... ...
...
0
3 2
... ...
... 0
kr
...
0
191
(N
+ 1)
terms
1
0
.
1 0
( N ) terms
192
APPENDIX D
... ... ... ... ... ... ... ... ... ... ... ... ... ...
0 0
0
0
0
0
0 0
0 0
0
0
0
0 0
0 0 0
0 0
0 0
0
0
...
0 0
0
2 2
2
1
N-4 N-3 N-2
N-2 N-3
.
( N - 1) terms
3 terms
2 1 0
0 0
0
0 0 0
N-2 N-2 N-2
0 1 2
2 1 0
...
0 0
N-1 N-I
0 1
0
2terms
...
0
N
0
0
1 term
... ... ...
the number of terms in the (Kr--$, =
0 1
1
K,) subtables is obviously
+ 2 + 3 + + ( N - 1) + N + ( N + 1) = &(N+ 1)(N+ 2)
(D.1) Now let k, , k, ,...,kr-3 be also varied. The total number of terms in the entire (k, , It, ,..., k,) table is obtained by summing over all the variables for the expression of (D.l) in which N is replaced by j j 7 - kl - k, - kr--3. This total number is equal to (n$.~l) as stated in Section 4.3. That is, ..I
the number of terms in the entire table
* = k,=O c N
c
N-kl-kr--*.-kr-4 k,=O
ENUMERATION OF
K
i
’
~
FOR - ~REDUCTION ~ ~ ~OF TABLES ~ ~
193
This equality is proven below for all r > 3 by mathematical induction. For r = 1,2,3, the number of tables can also be shown equal to (N$Iyl) by enumerating the subtables (Q, ( K , ,A,), and (kk,K, , K3) directly. Proof of (D.2)
Let r = 4. Then
+ 1)(N - kl + 2) 2 - ( N + 1)(N+ 2)+ ( W N + I)+ 2 2 - ( N + 3)(N+ 2)(N + 1)
c (N
kl-0
- A,
...
I 1.2 2
3.2.1
= ( N +4 -41- 1 ) which implies that (D.2) is true for r = 4. We shall show: (D.2) is true for some r = I implies that (D.2) is true for r = I + 1. Before doing this, we should first establish the following equality which is true under the hypothesis that (D.2) is true for some I = I: for I > 3,
Equality (D.3) can also be proved by induction. Let I = 4 N kl-0
(N -Kl
+ 1)
+ 1) + ( N )+ ( N - 1) + .* + 1 = c j = ( N + 1)(N + 2)
= (N N+l
2
5=1
=
(N + 2 ) = ( N + 4 -2)
(D.3) true for I
4-2
=4
We now show (D.3) is true for some 1 = m + (D.3) is true for l=m+1.ForI=m+1,
c c *.. c ( N - K K ,- K 2 - -Kme2 + 1) cc c “- K , - ... -Km4 + 1) N
N-kl
N-kl-..*-km-a
***
kl=O k,=O
=
k,-+
N
N-k1
N-k1-...-km
--p
*..
kl=O k,=O
km-a=O
+(N-hK, - ... - kmJ
+ + 13
194
APPENDIX D
x +[(N - k, - * - * - kn+3 =
("
+ m-1
')
+ 1)(N - k, -
-kfn3
+ 2)
by hypothesis
Thus we have shown (D.3) is true for I = m -+(D.3) is true for I = m + 1. Hence (D.3) is true for all I > 3. Now we can go back to show that (D.2) is true for r = 1 -+ (D.2) is true for I = I 1. For r = 1 + 1, we have
+
Since the first term in the right-hand side of (D.4) is ("2~') (D.4) can be written as N
1-1
N-kl
N-k1-.---
kl=O kz=O
kz-+
k 1-4
ENUMERATION OF =
v v ... L N
K
N-k1
N-kl-***-k
kl=O ks=O
khS=0
- L
v L
j
'
~
- FOR~
REDUCTION ~ ~ ~ OF~ TABLES ~ ~
195
z-4
and applying the equality (D.3), giving
c c ..- c N
N-kl
N-kl-**.-k
k,=O ke=O
t-4
khcs=O
-("+I--) 1-1
-(N+I-2)
1-2
Substituting this relation into (D.5) we have 1-4
1-1
kl=O kz=O
k,,=O
(D.5) Again by applying (D.3) in the same manner, (D.5) can be successively reduced to
= c (N + E - x 1 N+l
x=l
1-1
which means that (D.2) is true for r = I for all I > 3 and the number of tables is
+ 1.
Thus (D.2) is true
APPENDIX E
C O M P U T A T I O N S REQUIRED F O R T H E FEATURE O R D E R I N G AND PATTERN CLASS1FICAT10N E X PERI ME N T S U S I N G DYNAMIC P R O G R A M M I N G
This appendix presents the detailed computational procedure for the example in Section 4.6.The quantization of the probability space into 210 quanta was used at each stage of the classification process, including the calculation of decision boundaries for the ordering of features. For example, at the start of the eighth stage of the process, seven features have already been measured and one remains. For each of the possible remaining features a decision boundary must be determined. The possible decisions include the choice of classifying (making a terminal decision) the input pattern or taking the last available feature measurement on the basis of (4.20). therefore, at this stage of the process eight sets of decision boundaries must be calculated, one for each feature which could be the remaining feature at the start of the eighth stage. Figure E.1 shows, as an example, the decision boundary obtained for the case where only feature f3 remains to be measured. The number “3” in the quantum of Fig. E.l(a) indicates that it is expected to have smaller risk to measuref, than to make a classifying decision, while a letter indicates that it is the classifying decision which should be made. The procedure continues in the same manner for the seventh stage. At the start of the seventh stage, two features are available for measurement. Since there are (:) = 28 different possible pairs of eight features, twenty-eight decision boundaries must be calculated by using (4.20). At each succeeding stage in the process, analogous computations are encountered. Table E. 1 shows the number of decision boundaries 196
0.153 0.173 0.152
0.226 0.183 0.140
0.iTI 0.230 0.i76 0.130 0.300 0.248 0.222 0.166 0.130 0.334 0.299 0.253 0.189 0.158 0.104 0.370 0.323 0,286 0.221 0.170 0.119 0.087 0.400 0.349 0.299 0.243 0.199 0.147 0.099 0.058 0.396 0.363 0.301 0.264 0.209 0.164 0.122 0.073 0.025
__c
P(D)
Fig. E.1. (a) The feature selection decision surface. The numbers in each quantum indicate .that feature 3 is to be measured (3) or that a classification decision is to be made (D, J, P ) . (b) Expected cost of feature selection or classification.
+ Read in training samples
t 7 1 ~
-
Calculate cost of making classification decision for each quantum in probability space (210 quanta)
Store on tape and print out
Generate decision functions for each possible remaining feature for the next-blast stage, i.e., stage at which the last remaining feature is measured
Store on tape and print out I Generation of decision functions for each stage is shown more completely in Fig. E.2 (b)
Generate decision functions for the f i i t stage of the process, i.e., the best feature at the first stage of the process for each quantum in the probability space
rk-----== Store on tape and print out
Fig. E.2.
Detailed flow diagram for the generation of
199
COMPUTATIONS FOR ORDERING AND CLASSIFICATION EXPERIMENTS
Obtain a combination of three features from the 8 and number each combination distinctly so that no two different combinations have the same number
--c
NU = F3 + (Fz-1) (&-21/2+ V i - 1 ) (Fi-2)(F1-3)/6 where fi is the ith feature such that Fl > F2 >F3, and Nu is the number assigned to combination (F,, F,, 4)
t
Generate a distinct quantum in the a priori probability space and assign a number to it: PNu=L + J ( J = I ) , J = 1,20, L = l , J , where L a n d J are the row and column numbers of the quantum respectively; fNu is number assigned to quantum Calculate the a priori probability represented by the quantum
p(A)=O.OS ( 1 -O.O25OL),pIB)= 1025 -O.OW,p(C)= I - p ( A ) - p ( B )
I
Calculate the a priori probability that sample is from each of the classes given that the ith feature fell in thejth quantum; do this for each of the three features in the combination (6, Fz 4)
t
I
Find the quantum number associated with each of the a posteriori probabilities for each of the three features
f
Calculate cost of continuing for each of the three features in the combination ( F l ,F,, 4). In doing this, the proper ordering number for the remaining two features must be determined in order to locate the optimum cost function for that stage
I
f
Compire costs of continuing for each feature to determine the cheapest feature to measure if the processes were to continue
f
I
I Locate cost of a classificationdecision for the present a priori probability I 1 Is the cost of continuing greater than the cost of classifying? tYes Classify according t o decision criterion previously found Have a l l combinations of three features been tried? es
I Proceed t o the 5th stam decision functions for each stage, specifically the sixth stage.
I
200
APPENDIX E
Table E.1
NUMBER OF DECISION BOUNDARIES AND STORAGE LOCATIONS REQUIRED
Stage number
Number of decision Number of expected Number of storage risk surfaces locations required boundaries 1
8 28 56 70 56 28 8 1
256
420 = 218(1 + 1) 3360 = 218(8 + 8) 11,760 23,520 29,400 23,520 11,760 3,360 420 107,520
reqclired for each stage of the process in addition to the number of storage locations needed for each stage. Computer flow diagrams for the calculations are given in Fig. E.2. As the decision boundaries for the various stages of the process were calculated, they were stored and subsequently rearranged for use in the classification experiments. The rearrangement was necessary since the data were generated in the reverse order in which they were to be used. To make use of the decision boundaries, the memory was searched to obtain the pertinent decision boundary among the decision boundaries generated at each stage. The optimum decision, to continue or to stop the process, can then be immediately determined.
APPENDIX F
STOCHASTIC A P P R O X I M A T I O N A BRIEF SURVEY
1. Robbins-Monro Procedure for Estimating the Zero of an Unknown Regression Function
Let y be a random variable with probability distribution function H ( y / x ) depending on a parameter x. Assume that the regression function
exists, and for a real 01 the equation
has a unique root 8. By observing y at different values of x, it is necessary to estimate 8. Let the nth estimate of 8 be x,. Robbins and Monro [I] proposed the following recursive algorithm: starting with an arbitrary initial estimate x1 , then
where y(x,) is the observation y at x = x, . The following theorems give the convergence properties for the Robbins-Monro procedure [119
121.
Theorem 1
fying
(9
Let {a,} be a sequence of positive real numbers satism
m
n=1
a,, = 00
and 201
*=l
an2< m
(F.4)
202
APPENDIX F
The regression function M ( x ) satisfies the following conditions: (ii) I M(x)I (iii)
< C (i.e., M(x) is bounded by a constant C < co) [Y - M(x)l2w y / 4 < u2 < co
(F.5)
< 8, M(8) = a, M(x) > a for x > 8 (F.6) (v) M(x) strictly increasing when 1 x - 8 I < 6 for 6 > 0 inf I 6 I M(x) - a I > 0. (vi) (F-7) (iv) M(x) < a for x
124
Then lim E[(xn- 8)2]
woo
=0
(F.8)
Theorem 2 Let M(x) be the regression function satisfying the following conditions:
(i) (ii)
I M(x)l < A
J [y -
~ ( x ) 1 2d
(iii) M(x) < a (iv)
+B I x I
inf
8,<1a+t?l<8,
If, moreover, a,
<
~(y/x)
for x
for A, B
>0
(F-9)
< co
u2
< 8,
(F.lO)
M(x) > a
I M(x) - a I > 0 for any pair
> 0, C,"=la,
=
P{limx, n-tm
co and C;,l
= 8} =
1
>8 ,a2).
for x (6,
an2 <
(F.11) (F.12)
co, then (F.13)
Blum [3] and Gladyshev [4] have extended the Robbins-Monro procedure to the multidimensional case. The continuous case of Robbins-Monro procedure has been discussed by Driml and Nedoma [5], Hang and SpaEek [6], and Driml and Hang [7]. 2. Kiefer-Wolfowitz Procedure for Estimating the Extremum of an Unknown Regression Function
Following Robbins and Monro's formulation, it is necessary to estimate the unique extremum O(maximum or minimum which is known to be unique) of M(x), or equivalently, to estimate the unique root O of the equation M'(x) = 0 (F.14)
STOCHASTIC APPROXIMATION-A
203
BRIEF SURVEY
Kiefer and Wolfowitz proposed the following procedure [8] : = x,
% I + ,
fan [ A x , cn
"+"
+ cn) -
r(xn
-4 1
(F.15)
where the sign is used for estimating maximum and the sign for the minimum. It is noted that [Ax,
+
cn)
"-"
- r(xn - c n ) l / k
can be interpreted as the average slope which approximates the gradient at x, . Hence, algorithm (F.15) may be considered as a stochastic version of the gradient technique in hill-climbing. Let the regression function M(x) satisfy the following
Theorem 3
conditions:
J [Y - M(x)12dH(y/x) <
(9
u2
< 00
(F.16)
(ii) M(x) is strictly increasing when x < 8 and strictly decreasing when
> e.
There exist j? > 0 and B
> 0 such that
+ I x" - 8 I 3
implies I M(x') - M(x")l < B I x' - x" I (F.17) There exist p > 0 and R > 0 such that (iii) I x' - 0 I
I x' - x" I < p implies I M(x') - M(x")l < R
(iv) For every 8
(F.18)
> 0, there exists ~ ( 6 )> 0 such that
(v) 1 x - 8 1 > S implies
I M(x
o,~n~,z
+ 4 - M ( . - .)I €
> T(S) (F.19)
If {a,} and {c,} are sequences of positive real numbers satisfying m
(vi) then
C a,
n=1
m
m
= 03,
C a,c, < 03, n=l
and
n=1
lim I?[(%, - 8)T = 0
n+w
where {x,} is defined by (F.15).
C( s r < 00,
(F.20)
204
APPENDIX F
It is noted that conditions (F.17), (F.18), and (F.19) are the Lipschitz regulartity conditions. Blum has proved that Theorem 3 holds even when (F.17) is not satisfied. In this case, Blum has also proved that P{limxn = e} = 1 n+m
The Kiefer-Wolfowitz procedure has been extended to multidimensional case by Blum [3] and Sacks [lo], and to continuous case by Sakrison [ 1I]. 3. Dvoretzky’s Generalized Procedure
Dvoretzky [121 has suggested that any stochastic approximation procedure may be viewed as an ordinary deterministic (error-free) successive approximation method with a random noise component superimposed upon it. On the basis of this concept, a generalized stochastic approximation algorithm is proposed as xn+l
=
T n ( x l > * * *xn) ,
+ zn
(F.21)
where T,(xl ,..., x,) is the error-free transformation and z, is the random noise component. Theorem 4
satisfying
Let {a,},
{fin}, and
{y,} be nonnegative real numbers
(F.22)
lim an = 0
PI)
n+m m
(F.23) (F.24) n=l
Let T, be measurable transformations satisfying (D4) I T n ( r l > - - - ,I n ) - 0 I for all real rl , ..., r, . Also,
<
,(1
+
Pn)
I rn - 8 I
-~
n ]
(F-25)
m
(I:5)
1 E(zn2)< a
(F.26)
*=l
(D6)
E[zn I X I ,..., xn] = 0
(F.27)
STOCHASTIC APPROXIMATION-A
205
BRIEF SURVEY
with probability 1 for all n. Then, the sequence {x), defined by ) converges to the sought quantity 8 in the (F.21) with E ( x , ~< mean square sense and with probability 1, i.e. [9], [12] lim E[(xn - O ) 7 = 0, n+m
P{lim xn = O} n-1w
=
1
It is noted that if, in (F.21), Tn(% ,.*.,
Xn)
= xn
+ 4. - M ( 4 l
(F.28) (F.29)
zn = %[M(x,)-Y(Xn)l,
the procedure is reduced to the Robbins-Monro procedure. Similarly, if Tn(X, >...> Xn) = xn f (~n/Cn)[M(Xn cn) - W X n - cn)l (F.30)
+
zn = (an/Cn>[r(%
+ cn) -
WXn
+ cn) -Y(%
- Cn)
+
M(Xn
-4
1
(F.31) the procedure is essentially the Kiefer-Wolfowitz procedure. Dvoretzky’s proof of the theorem was simplified by Wolfowitz [13], revealing more of the essential structure of the process. The multidimensional generalization of Theorem 4 has been proved by Gray [14]. Two special cases of Dvoretzky’s procedure are presented in the following as they are extensively used in Section 7.1. Special Case I (real random variables). Let T, be measurable transformations
(R1)
I
Tnh ,**.,
In)
-0 I
<
FnI yn
- el
(F.32)
where {F,) is a sequence of positive numbers satisfying (F.33)t
t The condition (R2) actually also implies the requirement that bounded for all s > T . See also Molverton and Rawgen [26].
nz=rF,,is uniformly
206
APPENDIX F
with probability 1 for all n, imply lim E[(xn - 8)T
n+m
P{limxn = 8} n+m
=0 =
1
Special Case I1 (normed linear space). Suppose that X, and 2, assume values in a normed linear space 52 with 11 Y 11 denoting the norm of Y. Let 6 be an element of 52 and T, which are measurable transformations from the nth Cartesian power of 52 into 52 and assume that
(N1)
II Tn(r1 ***.)
rn)
- 8 It d Fn It rn - 8 II
(F.34)
where {F,} is a sequence of positive numbers satisfying m
HFn=O
(N2)
n=1
Define (N3)
Xn+1 =
Tn(X1 )***s Xn)
+ zn
(F.35)
Then the conditions (N4)
X1 1121 < a,
Xn) (N5) E[lldX1~****
00
1 E[II z n 1121 < a
n=l
(F.36)
+ zn 117 Q E[ll v ( X ~ ~X-n*117~ + E[ll zn 117
for every measureable function v(Xl ,..., X,) imply lim E[ll Xn - 8 11T
n-m
P{lim 11 Xn - 8 11 n+w
(F.37)
=0
(F.38)
1
(F.39)
= 0) =
Block also proposed a more general type of stochastic approximation taking place in a normed vector space [15]. 4.
Methods of Accelerating Convergence
Two approaches have been suggested for accelerating the convergence of stochastic approximation procedure. The first approach
STOCHASTIC APPROXIMATION-A
BRIEF SURVEY
207
is to accelerate convergence by selecting a proper weighting sequence {a,} or {a,} and {c,}, etc. An intelligent way of choosing the weighting sequence based on the information concerned with the behavior of the regression function, intuitively speaking, should improve the rate of convergence. Historically, the first method of accelerating the convergence of a stochastic approximation procedure was proposed by Kesten [16]. The basic idea is that when the estimate is far from the sought quantity 8 there will be few changes of sign of (x, - x,-~). Near the goal 8 we would expect overshooting to cause oscillation from one side of 8 to the other. Kesten proposed using the number of sign changes of (x, - x,-~) to indicate whether the estimate is near or far from 8. Specifically, the quantity a, is not decreasing if (x, - x,-~) retains its sign. Mathematically, the algorithm can be written in the form of Dvoretzky’s procedure where dl = a , , d2 = a 2 , d, = ds(,) , and
+1 03
s(n) = 2
@[(Xi
i=l
with @(x) = 1
=O
- x<-l)(x&l
- Xi+)]
(F.41)
if x < 0 if x > O
This means that d, is constant so long as (x, - x,-~) and (x,-~ - x,-~) have the same sign. The algorithm (F.40)converges with probability 1. Fabian has proposed the following accelerated algorithms [171: xn+1 = xn
%+l= Xn
+ an s€!n[. + (anlcn)
- Y@n)l for Robbins-Monro procedure
Sgn[Y(Xn
+ 4-
(F.42)
-4 1
for Kiefer-Wolfowitz procedure
(F.43)
Algorithms (F.42) and (F.43) converge to their sought quantities, respectively, only in a comparatively narrow class of problems in which the distribution function of the random variable y is symmetric with respect to 8. Another scheme of accelerating convergence proposed by Fabian is the application of an analogy of steepest descent method. The scheme can be summarized as follows. For given x,
208
APPENDIX F
and y(xn), take a series of (noisy) observations V , of the quantity M(x, kay(x,)), k = 1,2,... . Assume the Vk’s are independent of x, and yn . Select a, = ka when
+
sgn V , = sgn V ,
=
... = sgn V,,
= sgn
V , = -sgn V,,
(F.44)
for Robbins-Monro procedure, or when V,
> Vz > ..* > V,-1 > V , < V,,,
(F.45)
for Kiefer-Wolfowitz procedure. Under rather general conditions on V , , the suggested scheme converges with probability 1. The second approach for accelerating convergence is by taking more observations at each stage of iteration. Intuitively speaking, taking mor observations at each stage will explore the regression function more in detail than the original stochastic approximation procedure [18], [19], and, consequently, the extra information can be utilized to improve the rate of convergence. Venter and Fabian have proposed accelerated algorithms for the Robbins-Monro and Kiefer-Wolfowitz procedures, respectively [20], [21]. For illustrative purposes, Venter’s procedure is that of estimating the slope of the regression function at the root by taking two observations at each stage and using this information to improve the rate of convergence and the asymptotic variance of the Robbins-Monro procedure. The recursive algorithm
where y l and y: are random variables with their conditional distributions given y; , y i , k = 1,..., (n - l), independent and identical to that of y(xn c,J and y(xn - c,), respectively. {c,} and {d,} are two sequences of positive numbers. A, is an estimate of the slope a defined as follows: assume that 0 < a < a < b < co with a and b known, let
+
n
B,
=
n-l
and A,
=a = B,
=b
C ( y ; - $)/2~, j=1
(F.47)
if Bn < a
otherwise if B n > b
(F.48)
The algorithm (F.46) converges with probability 1. If also E(x12)< co, it converges in the mean square sense.
STOCHASTIC APPROXIMATION-A
209
BRIEF SURVEY
The same idea can be carried over to the Kiefer-Wolfowitz procedure. I n this case, three observations are taken at each stage of itereation, and the appropriate second order differences of the observations are used to estimate the second order derivative of the regression function at the maximum (or minimum). This information would then be utilized to determine the next estimate x,+, of the maximum (or minimum). I n a similar idea proposed by Fabian, the Kiefer-Wolfowitz procedure can be modified in such a way as to be almost as speedy as the Robbins-Monro procedure. The modification consists of taking more observations at every stage of iteration and utilizing this information to eliminate (smooth out) the effect of all higher order derivatives of the regression function. 5. Dynamic Stochastic Approximation
Fabian and DupaE have considered the case in stochastic approximation where the sought quantity 8 moves during the iteration process. The following presentation is based on DupaE’s discussion [22].
A. The ModiJied Robbim-Monro Procedure Let M,(x)
= M(x - 0,
+ el),
n = 1,2, ...
(F.49)
such that 8, is the unique root of M,(x) = 0. Let {a,} be a sequence of positive numbers, and let x1 be an arbitrary random variable (initial estimate). Define xn+, = x,* - u,y(x,*), n = 1,2,..., (F.50) where (F.51)
(F.52)
E[Y(Xn*)
and WY(X,*) x,
,..., x,] \< u2 < ca
(F.53)
+
The meaning of the algorithm (F.50) is the following: at the (n 1)th stage of iteration an estimate of On+, is determined. Start from the preceding estimate x, , first make a correction based on (F.51), then estimate the value of M,,, at xf by means of the observation y(xz)
210
APPENDIX F
and, finally, take a further correction, -a,y(x,*). It will be seen from Theorem 5 and its corollary, that the use of the algorithm is justified when 8, is a linear (nearly linear) function of n. Theorem 5
Suppose that the following conditions are satisfied:
(i) M(x) < 0 for x
< 0,
and
M(x) > 0 for x
> 0,
(F.54)
There exist K O ,Kl such that (ii) KOI x - 8,
For n
=
(iii)
I
< I M(x)I < Kl I x - 8,
for -cn
I
(F.55)
1, 2,..., a, = ulna,
a
> 0,
< x < +cn
4 < <1
(F.56)
OL
8, varies in such a way that ( 3
en+,
- (1
+ n-y,= o(n-a),
>a
(F.57)
Further, E(xl2) < 00
(4
(F.58)
Then (x, - 0,) approaches zero in the mean, and EL(%,- e,)~]
o(n-3
for
w
- O(n-2(~-")) for
w
=
Qa
<~
O L
(F.59)
Corollary Under the assumption of Theorem 5, let 0, be a linear function of n, then
Let 8, be proportional to a, then E[(xn -
= O(n-a)
for
4 < OL < 1
(F.61)
The mean square convergence (as well as convergence with probability 1) of the algorithm (F.50)can also be deduced from Dvoretzky's theorem, even under slightly more general conditions on 0,.
STOCHASTIC APPROXIMATION-A
BRIEF SURVEY
21 1
Under the assumptions (F.50), (F.52), (F.53), (F.54), (F.55)and (F.58),and replacing conditions (F.56)and (F.57)by
Theorem 6
m
C a,, < co
lim nu, = 00,
n-)m
en+,
- (1
(F.62)
,=l
+ n-')e,
(F.63)
= O(a,)
it holds that lim E[(xn- 0,)q
=0
(F.64)
P{lim x, = 0,) = 1
(F.65)
n-ta
n-m
B. The Modified Kiefer- Wolfom-tzProcedure
+
Let M,(x) = M ( x - 0, el), n = 1, 2,..., since 8, is the unique maximum of M,(x). Let {u,} and {c,} be two sequences of positive numbers, and let x1 be any arbitrary initial estimate. Define %,+I
= x,*
+ (am/cn)[y(xn* + c,) +
+ c,)
where x$ = (1 n-l)x, , and y(xz variables such that their conditional are M,+,(x,* c,) and M,+l(xz - c,), variances are bounded by a constant independent.
+
n = 1,2,... (F.66)
- c,)],
-y(x,*
and y(x$ - c,) are random expectations, given x1 ,...,x, , respectively, their conditional uz,and they are conditionally
Theorem 7 Suppose that the following conditions are satisfied: M(x) is increasing for x < and decreasing for x > O1 There exist
.
K , , K 3 ,K4 such that
K , I x - 4 I d I W x ) l < K3 I x I M"(x)l < K4 for -co For n
=
- 01 I
< x < +00
(F.67)
1, 2,..., a, = a/na,
a
> 0,
c, = c/ny,
c
> 0,
Q O1
< 01 < 1
01
(F.68)
-
8, varies in such a way that
em+, - (1
+ n-l)e,
= O(n-m),
w
> 1~
(F.69)
212
APPENDIX
F
Further, E ( x I 2 )< co. Then (xn - On) approaches zero in the mean, and E[(xn - 4JZ]= O(n-(a-zv)) = O(n-.2(0-ol).)
for
w
> $a - y
for
w
< 3:.
-y
(F.70)
References
1. H. Robbins and S. Monro, A stochastic approximation method. Ann. Math. Statist. 22, No. 1, 400-407 (1951). 2. J. A. Blum, Approximation methods which converge with probability one. Ann. Math. Statist. 25, NO. 2, 382-386 (1954). 3. J. A. Blum, Multidimensional stochastic approximation procedures. Ann. Math. Statist. 25, No. 4, 737-744 (1965). 4. E. G. Gladyshev, On stochastic approximation. Teor. Veroyatnost. i Primenen. 10, No. 2, (1965). 5. M. Driml and N. Nedoma, Stochastic approximations for continuous random processes. Trans. Conf. Inform. Theory Statist. Dec. Functions, Random Processes, 2nd, Prague, 1960. 6. 0. Hang and A. SpaEek, Random fixed point approximation by differentiable trajectories. Trans. Conf. Inform. Theory, Statist. Dec. Functions, Random Processes, 2nd, Prague, 1960. 7. M. Driml and 0. Hang, Continuous stochastic approximations. Trans. Conf. Inform. Theory, Statist. Dec. Functions, Random Processes, 2nd, Prague, 1960. 8. J . Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regression function. Ann. Math. Statist. 23, No. 3, 462-466 (1952). 9. J. A. Blum, A note on stochastic approximation. Proc. Amer. Math. SOC.9, 404-407 (1958). 10. J. Sacks, Asymptotic distribution of stochastic approximations. Ann. Math. Statist. 29, NO. 2, 373-405 (1958). 11. D. L. Sakrison. A continuous Kiefer-Wolfowitz procedure for random processes. Ann. Math. Statist. 35, No. 2, 59C599 (1964). 12. A. Dvoretzky, On stochastic approximation. Proc. Symp. Math. Statist. and Probability Jrd, Berkeley, 1956, 1. Univ. of California Press, Berkeley, California, 1956. 13. J. Wolfowitz, On stochastic approximation methods. Ann. Math. Statist. 27, 1151-1 156 (1956). 14. K. B. Gray, Application of stochastic approximation to the optimization of random circuits. Proc. Symp. Appl. Math. 16th, 1964, 16. Am. Math. SOC.,Providence, Rhode Island 1964. 15. H. D. Block, On stochastic approximation. Unpublished Rep. Dept of Math., Cornell Univ., Ithaca, New York, 1956. 16. H. Kesten, Accelerated stochastic approximation. Ann. Math. Statist. 29, No. 1, 41-59 (1958). 17. V. Fabian, Stochastic approximation methods. Czechoslooak Math. J. 10, No. 1, 123-1 59 (1 960).
REFERENCES
213
18. D. Burkholder, On a class of stochastic approximation processes. Ann. Math. Statist. 27, No. 4, 1044-1059 (1956). 19. H. D. Block, Estimates of error for two modification of the Robbins-Monro stochasticapproximationprocess. Ann. Math. Statist. 28, No. 4,1003-1010 (1957). 20. J. H. Venter, An extension of the Robbins-Monro procedure. Ann. Math. Statist. 38, NO.1, 181-190 (1967). 21. V. Fabian, Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Statist. 38, No. 1, 191-200 (1967). 22. V. DupaE, A dynamic stochastic approximation method. Ann. Math. Statist. 36, 1695-1702 (1965). 23. D. J. Wilde, “Optimum Seeking Methods.” Prentice-Hall, Englewood Cliffs, New Jersey, 1964. 24. V. Fabian, A stochastic approximation method for finding optimal conditions in experimental work and in self-adapting systems. Aplikace Matematiky, 6 , 162-183 (1961). 25. L. Schmetterer, Stochastic approximation. Proc. Symp. Math. Statist. and Probability, 4th, Berkeley, 1961, 1. Univ. of California Press, Berkeley, California, 1961. 26. C. T. Molverton and J. T. Rawgen, A counterexample to Dvoretzky’s stochastic approximation theorem. IEEE Trans. Inform. Theory 14,157-158 (1968).
APPENDIX G
T H E M E T H O D O F POTENTIAL FUNCTIONS O R R E P R O D U C I N G KERNELS
The potential function method introduced and studied extensively by Aiserman, Braverman, and Rozonoer has been used for successive approximations of unknown uniformly bounded continuous functions which may be either deterministic or stochastic. The unknown function, for example, may be a discriminant function, a response function of a system, or a probability distribution (or density) function. In this appendix the method of potential functions is briefly introduced [ 11-[5], and several applications are described. Consider a function f ( X ) of which the exact behavior is unknown. The only information available is some knowledge about the value off(X) at certain points, X , , X, ,..., X , . The problem of learning is, from the available information, to construct a function which converges to f ( X ) in a certain sense. The space of X, Qx , is in general multidimensional and the points of observation, XI ,..., X , , cannot be chosen at will but occur independently in a random manner. Also, the information off(Xi), i = 1, ..., n, may be noisy or only partially measurable [for instance, only the sign off(X,) can be measured]. Hence, the usual extrapolation techniques are practically inapplicable for solving the learning (estimation) problems. This is where the application of potential function method is suitable. The general formulation of the potential function method can be stated as follows. Let y i ( X ) ,i = 1,2,..., be a complete set of functions defined on Qx . Suppose that the function f (X) to be learned can be represented by the expansion (G. 1)
The coefficients ci are unknown a priori. The learning measurements or observations X, ,..., X , are assumed statistically independent and 214
215
POTENTIAL FUNCTIONS OR REPRODUCING KERNELS
distributed according to an unknown probability density function p(X). A function of two variables, called the “potential function,” is introduced as
c hi2pi(X)Vi(Y) 00
K(x,Y ) z=
(G.3
i=l
where the h i s are real numbers chosen in such a way that the function
K ( X , Y) is bounded. After n observations, X, ,..., X , , are taken,
the nth estimate of the functionf(X), denoted byf,(X), can be computed from the following general algorithm fn(X) =fn-l(X)
+ rnK(X,
(G.3) withf,(X) = 0. Y, is dependent upon the type of information received aboutf(X,). The function f ( X ) is assumed to be sufficiently smooth so the condition
c (Ci/hd2 <
xn)
m
i=1
((3.4)
03
is always satisfied. Particularly, if f ( X ) can be represented by a finite-sum expansion
c M
f ( X >=
i=l
((3.5)
CiVi(X)
then the potential function may be chosen to be M
K(X. Y )=
1 hi2Vi(X)V i ( Y )
(G.6)
i=l
and the condition (G.4) is automatically satisfied for addition, it also follows that
c
# 0. In
M
fnV) =
i=l
CinVi(X1
(G.7)
+
(G.8)
where Cin = Ci.n-1
~nhi~dxn)
Four possible applications are now discussed. 1. The Estimation of a Function with Noise-Free Measurements
Let y
y2 ,..., yn
=f
( X ) and assume that the learning observations yl, ,..., X , be
, where yi =f(&), are noise-free. Let also
xl
216
APPENDIX G
independently distributed according to some unknown probability density function p ( X ) . The learning algorithm (G.3) can be applied to estimate f ( X ) with rn =
yn Sgn[Yn
-.fn-l(Xn)]
((3.9)
where sgn(u) = +1
for u
- -1
for u
>0
<0
(G.lO)
and y n is a sequence of positive numbers satisfying m
m
C yn = 00
h=l
and
1 yn2 <
00
(G.ll)
h=l
(G.12)
where A is an arbitrary positive constant satisfying A
> 4 M ~Kx( X , X )
(G.13)
Then under the condition (G .4), the following convergent properties can be proved: (i)
lim JQc I f ( x )-fn(x)IA X ) d x
n+m
=0
(G.14)
when (G.9) is used, and (ii)
lim EX.f(X)-fn(X>l2>= 0 n-m
(G.15)
when (G.12) is used. If conditions (G.5) and (G.6) are satisfied, and for any
where p1 ,...,p M do not vanish simultaneously, thenf,(X) converges to
f ( X ) according to (G.14) and (G.15) not only in probability but also with probability 1.
POTENTIAL FUNCTIONS OR REPRODUCING KERNELS
2.
217
The Estimation of a Function with Noisy Measurements
In this case, the observations y1 ,yz ,...,y, are noisy. Let Yn = f ( X n )
+ 5,
(G.17)
where theit',$. are independent random variables (noise) with zero mean and finite covariances. Also, the conditional probability density function p([,/X,) is assumed not a function of n. Under such conditions, it is suggested that, in the algorithm (G.3), (G.18)
rn = r n [ m -fn-l(Xn)I
where y, satisfies the conditon (G. 11). It can be shown that lim E { [ f ( X )-fn(X)I2) n-m
=0
(G.19)
that is, by applying the algorithm (G.3) with (G.18), the estimate fn(X) converges tof(X) in the mean square sense. Similarly, if (G.5), (G.6), and (G. 16) are satisfied, thenf,(X) also converges tof(X) with probability 1. 3. Pattern Classification-Deterministic
Case
Suppose that the input patterns to the classifier are from one of the two possible pattern classes, w1 and w2 , and assume that w1 and w g are mutually exclusive. The decision functionf(X) used by the classifier is such that, for an E > 0, if
f ( X ) > E,
then X - w l
if
f ( X ) < -E,
then X - w 2
A special case which is commonly used is sgnf(X) = +1, -
-1,
X
-
(G.20)
w1
X-Wz
(G.21)
Let the learning observations Xl ,..., X, from both pattern classes be independently distributed according to some unknown probability density function p(X). The XI ,..., X, are the feature vectors charac-
218
APPENDIX G
terizing the input patterns (learning samples) with known classifications. The algorithm (G.3) can be used to establish the estimates f n ( X )with r n = Hsgnf(Xn)
- ~gnfn-l(Xn)l
(G.22)
If the condition (G.4) is satisfied, then the following convergent property can be proved (G.23)
If, in addition to the condition (G.4), the statistics of the learning observations satisfy the condition that, for 0 < k < a,if X , ,..., X , are not completely separated (correctly classified) by f , ( X ) , there is a strictly positive probability of occurrence of X,, to reduce the misclassifications, then it is possible to find such a k with probability 1. In other words, with probability 1 f,(X) converges tof(X) within a finite number of iterations (observations), k. More specifically, the rate of convergence and the stopping rule for the learning process can be investigated in terms of the number of corrections made in the case of misclassifications. Let L be an infinite sequence of learning observations drown from w1 and w2 .It can be shown that there exists a quantity J, (G.24)
which is independent from the choice of L such that the number of corrections (of misclassifications) S J. The learning process is considered to be terminal if after S corrections of misclassifications there are no corrections during the subsequent S, observations. Let the probability of misclassification after S S, learning observations be Ps+s,(~). Then S l can be selected according to the following stopping rule. For E > 0 and 6 > 0
<
+
if log €6
st
> log(1 - €)
(G.26)
219
POTENTIAL FUNCTIONS OR REPRODUCING KERNELS
4. Pattern Classification-Statistical Case
In this case, the classification of input patterns is based on the set of probabilities P(wJX), i = 1, ..., m, where m is the number of pattern classes. Since P ( o , / X ) are unknown a priori, the potential function method is applied to estimate the probabilities. Let f ( X ) = P(w,/X) and consider a random function +(X) such that +(X)= 1 =O
where 6,is the complement of
if X
-wi
if X - G i
mi.
(G.27)
It is noted that, from (G.27),
P{+(X) = l} = f ( X ) = P ( w , / X ) P{+(X) = O} = 1 - f ( X )
(G.28) (G.29) (G.30) (G.31)
where g, is a random variable with zero mean and finite variance. Then the problem is reduced to that of estimatingf(X) from noisy measurements as in section 2. An alternative approach is to consider a “- operator” such that when it operates on a function #(X),
&x>= 0 =+(X) =1
if
<+ ( X )< 0
-00
0 <+(X) < 1
if if
1
< + ( X ) < 00
(G.32)
When the nth observation X, is taken, the classifier assigns it with probability fnn-l(Xn)to W , and with probability (1 - f,+,(X,)> to wi. Taking into account the information as to the actual membership of X, , four possibilities will occur: ( m i , mi), (wi ,Gi), (Wi , wi), and (6,, Wi). Here the first symbol indicates the pattern class to which X, actually belongs and the second the pattern class to which X , has been assigned by the classifier. Then, for the sequence r, in the algorithm (G.3) defined by rn = 0 for (wi,ai)and ( G i , G i ) = Yn for (wi,Gi) - --yn for (Gi , w i ) (G.33)
220
APPENDIX G
where yn is a sequence of positive numbers satisfying (G.ll), the successive estimates fn(X) converge to f ( X ) = P ( q / X ) in the mean square sense. Two methods have been suggested for the selection of potential functions. The first method is to select a certain system of functions, y i ( X ) , first. A set of orthonormal functions is usually a convenient choice. The potential function is then constructed according to (G.2). The second method sugggested is to select a symmetrical function of two variables, X and Y, directly as a potential function K ( X , Y). If the concept of distance between X and Y is defined in Qz , it is convenient to choose the potential function as a distance function. However, it is necessary to guarantee that the function selected is representable by (G.2). The following theorem is considered useful in this aspect. Let SZ, be either a bounded region of an N-dimensional Euclidean space EN or a discrete finite set of points in EN.Furthermore, let the function K(I z I), where
Theorem
I z 12 = z12 + z22 + *.*
+
zhr2
be a continuous function in EN whose multidimensional Fourier transform
J'
J' K(I z I) exp [ - j
N
..-
7 . ~ dz, ~ ~ ~ dzN 1
k=l
is positive at any point V = (q, v2 ,..., v N ) .Then, for X, Y E SZ, , the potential function K(I X - Y I) can be expanded in a series of the form (G.2) where v i ( X )is a complete system of functions inL,(X). It should be noted that the theorem only indicates the condition guaranteeing the required expandability of the potential function, but says nothing about the generation of v i ( X )by the given potential function. If sZx satisfies the conditions given in the theorem and a distance function p2(X, Y) is defined as N
f 2 ( xy,, =
2
k=l
(xk
- yk)'
(G.34)
then it is helpful to choose the potential function of the form K(p(X, Y)). For example, K ( X , Y) may be selected as K(X, y ) = e-awx,n
(G.35)
221
REFERENCES
of which the Fourier transform is of the form N
-
(G.36)
As pointed out by Simmons recently [6], there is a close resemblance between the potential function method and the method of reproducing kernels [7]. It is known from the theory of Fredholm’s integral equations [8] that if a set of orthonormal functions y4(X),i = 1, 2, ..., is complete in some set of functions G ( X ) , [ G ( X )CLJ-functions, the kernel K ( X , Y) in this set is given by
c .I,bi(X)Vi(Y) m
K(-% Y ) =
i=l
(G.37)
where rli is the eigenvalue of K ( X , Y) associated with the normalized eigenfunction qi(X). The expression (G.37) for K ( X , Y) guarantees that r14 and q4 are paired eigenvalues and eigenfunctions. If the summation terminates after some finite number M, then K ( X , Y) is a special Pincherle-Goursat kernel+ with M eigenvalues. Equation (G.37) is then equivalent to the potential function defined by (G.6). The requirement that G ( X ) and therefore y 4 ( X )be contained in the class of L2 functions is equivalent to the bounds imposed on y i ( X ) and K ( X , X ) in the method of potential functions. References 1. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, Theoretical foundations of the potential function method in pattern recognition. Avtomat. i Telemeh. 25, 917-936 (1964). 2. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, The probability problem of pattern recognition learning and the method of potential functions. Aotomat. i Telemeh. 25, 1307-1323 (1964). 3. M. A. Aiserman, E. M. Braverman, and L. I. Rozonoer, The method of potential functions for the problem of restoring the characteristic of a function converter from randomly observed points. Avtomat. i Telemeh. 25, 1705-1714 (1964). + A Pincherle-Goursat kernel is of the form
K ( X , Y)= ~ ; = 1 9 * ( X ) $ m where {qi(X)}and {$s(Y)}are two sets of linearly independent L, functions over the argument range R.
222
APPENDIX G
4. E. M. Braverman, On the potential function method. Aotomat. i Telemeh. 26, 2205-2213 (1965). 5. E. M. Braverman and E. S. Pyatnitskii, Estimation of the rate of convergence of algorithms based on the potential function method. Aotomat. i Telemeh. 27, 95-112 (1966). 6. G. J. Simmons, Iterative storage of multidimensional functions in discrete distributed memories. In “Computer and Information Sciences - 11” (J. T. Tou, ed.), pp. 261-280. Academic Press, New York, 1967. 7. N. Aronszajn, Theory of reproducing kernels. Trans. Amer. Math. Soc. 68, 337-404 (1950). 8. P. G. Tricomi, “Integral Equations.” Wiley, New York, 1957.
AUTHOR INDEX Numbers in parentheses are reference numbers and indicate that an author’s work is referred to, although, his name is not cited in the text. Numbers in italics show the page on which the complete reference is listed.
Abramson, N., 20(7), 22, 117(2), 119(2), 127(2), 140 Aiserman, M. A., 20(9), 20(11), 22, 96(4), 116, 169(22), 170, 214, 221 Anderson, T. W., 47(2), 63, 117(1), 118(l), 140 Armitage, P., 175 176, 179 Aronszajn, N., 221(7), 222 Ball, G. H., 21(28, 31), 23, 96(1), 116 Barabash, Yu. L., 25(9), 31, 44 Barndorff-Nielsen, O., 99(10), 216 Bellman, R., 64(2. 3), 65, 94,95,122(6), 140, 179(12), 180 Berk, R. H., 180 Bhat, B. R., 94 Bhattacharyya, A., 44(13), 45 Blackwell, D., 11(20), 23, 64(1), 94, 178(10), 179(10), 180 Blaydon, C. C., 20(13), 22, 169, I70 Block, H. D., 45, 206(15), 208(19), 212, 213 Blum, J. A., 201(2), 202,204,205(9), 212 Braverman, D., 20(7), 22, 117(2), 119(2), 127(2), 140 Braverman, E. M., 20(9), 20(11), 22, 96(4), 116, 158(15), 169(22), 170, 214, 221, 222 Burkholder, D., 208(18), 213 Bussgang, J. J., 47(3), 63 Cardillo, G. P., 94 Chen, C. H., 20(17), 22, 27(6), 28(6), 44 Chien, Y. T., 20(10, 15), 22, 29(18), 45, 47(1), 52(1), 63, 94, 105(14), 116,
122(5), 140, 142(2), 144(2), 148(2), 159(17), 169, 170 Chow, C. K., 11(3), 22 Chu, J. I . , 23 Chueh, J. C., 23 Cooper, D. B., 125(10), 140 Cooper, P. W., 125(10), 126(11), 140 Cover, T. M., 21(29), 23, 96(5), 116 Cox, D. R., 175, 179 Cram&, H., 120(22), 122(22), 140 Daly, R. F., 126(12), 140 Dorofeyuk, A. A., 158(14), 170 Driml, M., 202, 212 Duda, R. O., 21, 23, 45 DupaE, V., 159, 170, 209, 213 Dvoretzky, A., 52(6), 63, 159, 160(18), 161(18), 170, 204, 205(12), 212 Dynkin, E. B., 64(5), 94 Eden, M., 21(24), 23 Edie, J., 96(2), 116 Fabian, V., 207, 208, 212, 213 Fishbum, P. C., 64(6), 94 Fossum, H., 21, 23 Fralick, S. C., 126(13), 140 Frantsuz, A. G., 21(27), 23 Frazer, D. A. S., 116 Freeman, H., 21(26), 23 Fu, K. S., 14(6), 20(10, 15, 17), 22, 23, 27(6), 28(6), 29(18), 44, 45, 47(1), 52(1), 63, 94, 105(14), 116, 122(5), 140,142(2), 144(2), 148(2,3), 153(8,9), 159(17), 169, 170
223
224
AUTHOR INDEX
Gel’fand, I. M., 170 Girshick, M. A., 11(20), 23, 64(1), 94, 178(10), 179(10), 180 Gladyshev, E. G., 202, 212 Goode, H. H., 95 Gray, K. B., 205, 212 Green, D. M., 23, 26, 35(11), 44 Grettenberg, T. L., 27(4, 5), 44 Groner, G. F., 21, 23 Hall, D. J., 21(28), 23 Hanock, J. C . , 169 Hans,O., 202, 212 Hart, P. E., 21(29), 23, 96(5), 116 Henrichon, E. G., 96(8), 116 Heydron, R. P., 45 Ho, Y. C., 22(34), 23 Hoeffding, W., 99(11), 116 Howard, R. A., 64(7), 94 Hughes, G . F., 96(1), 116 Kadota, T. T., 44(16), 45 Kailath, T., 44(14), 45 Kalaba, R., 64(3), 94, 95, 179(12), 180 Kanal, L., 23 Karhunen, K., 29(7), 44 Kashyap, R. L., 22, 23, 169, 170 Keehn, D. G., 117(3), 119(3), 140 Kel’mans, G. K., 155, 158(10), 169 Kesten, H., 170, 207, 212 Kiefer, J., 52(6), 63, 203, 212 Kobayashi, H., 44(15), 45 Koford, J. S., 21, 23 Kullback, S., 26(3), 44
Molverton, C. T., 205, 213 Monro, S., 201, 212 Narasimhan, R., 21(25), 23 Nedoma, N., 202, 212 Neyman, J., 172, 179 Nikolic, Z. J., 20(15), 22, 148(3), 153(8,9), 169, 170 Nilsson, N. I., 3(1), 7(1), 11(1), 22, 45 Novikoff, A. B. J., 21(30), 23 Owen, J., 96(3), 116 Parent, E. A., 97(9), 99(9), 109, 116 Patrick, E. A., 169 Pearson, E. S., 172, 179 Phataford, R. M., 63, 180 Pugachev, V. S., 134 (19-21), 140 Pyatnitskii, E. S., 214(5), 222 Rao, C. R., 50(5), 63, 180 Rawgen, J. T., 205, 213 Reed, F. C . , 14(5), 22, 54(7), 63, 176,180 Robbins, H., 130(17, 18), 134(18), 140, 201, 212 Rosen, J. B., 22(35), 23 Rosenblatt, F., 23 Rozonoer, L. I., 20(9), 20(11), 22, 96(4), 116, 169(22), 170, 214, 221
Sacks, J., 204, 212 Sakrison, D. L., 204, 212 Sammon, J. W., 140 Saridis, G. N., 153(9), 169 Savage, I. R., 100(13), 116 Landgrebe, D. A., 44(21), 45 Schlesinger, M. I., 155(12), 170 Lebo, J. A., 35(10), 44 Schmetterer, L., 213 Lehmann, E. L., 23, 100, 104, 116 Sebestyen, G., 11(2), 22, 44(17), 45, Lewis, P. M., 24, 44 96(2), 116 Lindley, D. V., 64(4), 94 Selin, I., 23 Loginov, N. V., 20(8), 22 Sethuraman, J., 100(13), 116 Shepp, L. A., 44(16), 45 Shilov, G. E., 170 Mahalanobis, P. C., 44(12), 45 Simmons, G. J., 221, 222 Marcus, M. B., 47(3), 63 Slaymaker, F., 23 Marill, T., 23, 26, 35(11), 44 Middleton, D., 64(3), 94, 179(12), 180 Smith, D., 23 Sobel, M., 176, 179 Min, P. J., 44(21, 22), 45
AUTHOR INDEX
SpaEek, A., 202, 212 Specht, D. F., 96(6), 116 Spragins, J. D., 118, 124(9), 140 Stanat, D. F., 140 Steinbuch, K., 23 Teicher, H., 124(7, 8), 134(8), 140, 152(5), 169 Thomas, J. B., 44(15), 45 Tou, J. T., 45 Tricomi, P. G., 221(8), 222 Tsypkin, Ya. Z., 20(12), 22, 142(1), 155, 158(10), 169, 170 Venter, J. H., 208, 213
225
Wald, A., 11(19), 14(4), 22, 23, 45, 50(4), 63, 171(1), 175, 176, 179, 180 Walker, W., 23 Watanabe, S., 29(8), 44 Watson, G. N., 140 Wee, W. G., 20(15), 22, 23 Wetherill, G. B., 11(20), 23, 64(8)! 94, 178(11), 180 Whittaker, E. T., 140 Widrow, B., 23 Wijsman, R. A., 20, 22 Wilde, D. J . , 213 Wolfowitz, J . , 52(6), 63, 175, 179, 203, 205, 212 1
Yakowitz, S. J., 124(9), 140
SUBJECT INDEX a-Perceptron, 7 Bayes risk, 11, 70, 135, 177 Bayes’ sequential decision procedure, 176 using dynamic programming 64, 179 Baysian estimation (learning), 20 empirical Bayes approach, 130 general model, 134 nonsupervised, 123 of slowly varying patterns, 127 supervised, 117 Classification techniques deterministic, 3 statistical fixed sample size, 10 sequential, 13 Classifier, 2, 4 Bayes, 12 linear, 5, 137 minimum-distance, 5 Covariance function, 30, 33, 34
experiments, 80-86 suboptimal, 88 Feature selection and ordering, 24 experiments, 36-43 information theoretic approach, 24 Karhunen-Lobve expansion, 29 use of dynamic programming, 79, 86 Feature space, 3 Finite sequential classification, 46, 64 backward procedure, 64 experiments, 72-79 forward procedure, 46 experiments, 5 6 6 2 using dynamic programming, 64 Karhunen-Lobve expansion, 29 optimal properties, 31, 181 Learning, 19 nonsupervised, 20 supervised, 20 Lehmann alternatives, 99 selection of, 107-1 11
Decision boundary, 3, 82, 83 hyperellipsoid, 8 hyperhyperperboloid, 8 hyperplane, 5, 13, 17 hypersphere, 8 Discriminant computer linear, 5 quadric, 8 Divergence, 24, 26 expected, 27
Mutual information, 25, 29
Empirical Bayes approach, 130 Entropy, 24, 3 1, 32, 184
Nonsupervised learning, 20 general model using stochastic approximation, 155 using Bayesian estimation techniques, 123 using dynamic stochastic approximation, 164 using stochastic approximation, 148
Feature extractor, 2 Feature ordering and pattern classification backward procedure, 79
Pattern recognition, 1 Potential functions, 215 method of, 20, 168, 214
226
SUBJECT INDEX
Reduction of dimensionality, 68 Markovian dependence, 70 use of sufficient statistics, 68, 191 Reproducing kernels, 214, 221 Sequential analysis, introduction, 171 Sequential pattern classification, 13 nonparametric, 96 experiments, 113-1 15 Sequential probability ratio test, 14, 171 generalized, 17 modified, 46-54 generalized, 54-56 properties, 54, 185 nonparametric, 103 truncated, 18 Sequential ranks, 97 Sequential ranking procedure, 97 Sequential rank vector, 98 distribution of, 99 Sequential two-sample test, 101 Stochastic approximation, 20, 201 accelerated dynamic, 166 accelerated procedures, 206
227
Dvoretzky's procedure, 204 dynamic, 158, 209 Kiefer-Wolfowitz procedure, 202 Robbins-Monro procedure, 201 Stopping boundaries, 15, 17, 103 lower, 172 time-varying, 47 upper, 172 Supervised learning, 20 using Bayesian estimation techniques, 117 using dynamic stochastic approximation, 159 using stochastic approximation, 141 Template-matching, 1 Training, 8 absolute correction rule, 10 fixed increment rule, 10 fractional correction rule, 10 group-pattern, 22 linear classifier, 8 piecewise linear classifier, 21
Mathematics in Science and Engineering A Series of Monographs and Textbooks Edited by RICHARD BELLMAN, University of Southern Cdiforniu 1. TRACYY. THOMAS. Concepts from Tensor Analysis and Differential Geometry. Second Edition. 1965 2. TRACY Y. THOMAS. Plastic Flow and Fracture in Solids. 1961 3. RUTHERFORD ARIS.The Optimal Design of Chemical Reactors: A Study i n Dynamic Programming. 1961 and SOLOMON LEFSCHETZ. Stability by Liapunov’s Direct Method 4. JOSEPHLASALLE with Applications. 1961 5. GEORGE LEITMANN (ed.) Optimization Techniques: With Applications to Aerospace Systems. 1962 6. RICHARD BELLMAN and KENNETH L. COOKE.Differential-Difference Equations. 1963 7. FRANK A. HAIGHT.Mathematical Theories of Traffic Flow. 1963 8. F. V. ATKINSON.Discrete and Continuous Boundary Problems. 1964 9. A. JEAREY and T. TANIUTI. Non-Linear Wave Propagation: With Applications to Physics and Magnetohydrodynamics. 1964 10. JULIUS T. Tow. Optimum Design of Digital Control Systems. 1963 11. HARLEY FLANDERS. Differential Forms : With Applications to the Physical Sciences. 1963 12. SANFORD M. ROBERTS. Dynamic Programming in Chemical Engineering and Process Control. 1964 13. SOLOMON LEFSCHETZ. Stability of Nonlinear Control Systems. 1965 14. DIMITRISN. CHORAFAS. Systems and Simulation. 1965 15. A. A. PERVOZVANSKII. Random Processes i n Nonlinear Control Systems. 1965 16. MARSHALL C. PEASE, 111. Methods of Matrix Algebra. 1965 17. V. E. BENES.Mathematical Theory of Connecting Networks and Telephone Traffic. 1965 18. WILLIAMF. AMES.Nonlinear Partial Differential Equations in Engineering. 1965 19. J. ACZEL.Lectures on Functional Equations and Their Applications. 1966 20. R. E. MURPHY. Adaptive Processes in Economic Systems. 1965 21. S. E. DREYFUS. Dynamic Programming and the Calculus of Variations. 1965 22. A. A. FEL’DBAUM. Optimal Control Systems. 1965 23. A. HALANAY. Differential Equations: Stability, Oscillations, Time Lags. 1966 24. M. NAMIKOGUZTORELI. Time-Lag Control Systems. 1966 25. DAVIDSWORDER. Optimal Adaptive Control Systems. 1966 26. MILTONASH. Optimal Shutdown Control of Nuclear Reactors. 1966 27. DIMITRISN. CHORAFAS. Control System Functions and Programming Approaches (In Two Volumes). 1966 28. N. P. ERUCIN.Linear Systems of Ordinary Differential Equations. 1966 29. SOLOMON MARCUS. Algebraic Linguistics; Analytical Models. 1967 30. A. M. LIAPUNOV. Stability of Motion. 1966
.
31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54.
.
GEORGE LEITMANN(ed.) Topics in Optimization. 1967 MASANAO AOKI. Optimization of Stochastic Systems. 1967 HAROLD J. KUSHNER.Stochastic Stability and Control. 1967 MINORUURABE.Nonlinear Autonomous Oscillations. 1967 F. CALOCERO. Variable Phase Approach to Potential Scattering. 1967 A. KAUFMANN. Graphs, Dynamic Programming, and Finite Games. 1967 A. KAUPMANN and R. CRUON.Dynamic Programming: Sequential Scientific Management. 1967 J. H.AHLBERG, E. N. NILSON,and J. L. WALSH.The Theory of Splines and Their Applications. 1967 Y. SAWARAGI, Y. SUNAHARA, and T. NAKAMIZO. Statistical Decision Theory in Adaptive Control Systems. 1967 RICHARDBELLMAN. Introduction to the Mathematical Theory of Control Processes Volume I. 1967 (Volumes I1 and I11 in preparation) E. STANLEY LEE. Quasilinearization and Invariant Imbedding. 1968 WILLIAMAMES.Nonlinear Ordinary Differential Equations in Transport Processes. 1968 WILLARD MILLER,JR. Lie Theory and Special Functions. 1968 PAULB. BAILEY,LAWRENCE F. SHAMPINE, and PAUL E. WALTMAN. Nonlinear Two Point Boundary Value Problems. 1968 Iu. P. PETROV. Variational Methods in Optimum Control Theory. 1968 0.A. LADYZHENSKAYA and N. N. URAL'TSEVA.Linear and Quasilinear Elliptic Equations. 1968 A. KAUPMANN and R. FAURE. Introduction to Operations Research. 1968 C. A. SWANSON. Comparison and Oscillation Theory of Linear Differential Equations. 1968 ROBERT HERMANN. Differential Geometry and the Calculus of Variations. 1968 N. K.JAISWAL. Priority Queues. 1968 HUKUKANE NIKAIDO.Convex Structures and Economic Theory. 1968 K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. 1968 YUDELLLUKE.The Special Functions and Their Approximations (In Two Volumes). 1968 ROBERT P. GILBERT. Function Theoretic Methods in the Theory of Partial Differential Equations.
In preparation
V. LAKSHMIKANTHAM and S. LEELA.Differential and Integral Inequalities MASAOIRI.Network Flow, Transportation, and Scheduling: Theory and Algorithms S. HENRYHERMES and JOSEPH P. LASALLE.Functional Analysis and Time Optimal Control
This page is intentionally left blank