2,---,fll with
combination
coefficients
al,a2,---,an
.
The vector
a,^, +fl2p2 +"- + aB% is the zero vector 0 if and only if a1,flJ,>--,flHareall zeros. If all vectors in V can be written as linear combinations of yn,^, •••,%, then these vectors form a set called the span of ^ , , ^ s - - - , % . Or in other words, we say that
4.1. Approximation and best approximation
71
precisely, we claim that l,xM•-•,*,, are a basis for IT,,. To prove this we need to show that {1) 1, x{, • • •, xn are linearly independent, (2) they span II B . Property (2) is clear from the definition of Tln . As for (1): suppose q,,q,•••,&,, are scalars such that aa + atx + • • • + anx" = 0 . The following concepts at least point out the possibility of how to construct an approximant based on linear combination. Theorem 4.1 (Best approximation) Let V be a normed linear space with norm I | . Let ^l,
A solution to the minimizing problem is usually called a "best approximant" to y from VB. How many solutions are there? The answer is connected with convexity conditions. Let V be a vector space. A subset S of V is convex if for any two members fH, q>2 of S, the set of all members of form 2,0 <, t < 1 called the line segment from ^ to ip%, also belongs to S. Theorem 4.2 (Uniqueness) Let V be a normed linear space with strictly convex norm, flies best approximations from finite dimensional subspaces are unique. Approximation is transformed into the problem searching for the bases which define a normed linear space with strictly convex norm. Depending on the sense in which the approximation is realized, or depending on the norm definition in equation (4.4), there are three types of approximation approaches: I.
Interpolatory approximation: The parameters a, are chosen so that on a fixed prescribed set of points x* (k = I,... ,n) we have
72
4. Approximation and B-splim Junction
=*
(4.5)
Sometimes, we even further require that, for each i, the first r, derivatives of the approximant agree with those of y at x t 2.
Least-square approximation: The parameters at are chosen so as to minimize
_
(4.6)
2 *=•'
3.
Min-Max approximation: the parameters a* are chosen so as to minimize n
- max y - J ] a,% -> min
(4,7)
4.2 Polynomial basis Choosing q>t = x', we have polynomials as approximants. Weierstrass theorem guarantees this is at least theoretically feasible Theorem 4.3 (Weierstrass approximation theorem) Let f(x) be a continuous function on the interval [a,b]. Then for any £>0, there exists an integer n and a polynomial pn such that (4-8) In fact, if [a,&]=[0,l], the Bernstein polynomial
converges to f(x) as n -» QO , Weierstrass theorems (and in feet their original proofs) postulate existence of some sequence of polynomials converging to a prescribed continuous function uniformly on a bounded closed intervals. When V = L^a; b] for some interval [a; b], and 1 < p < 1 and the norm is the
4.2, Polynomial basis
73
usual /Miorm, it is known that best approximante are unique. This follows from the fact that these norms are strictly convex, i.e. Jxj < r,||y| < r implies |e+y\\ < 2r unless x = y. A detailed treatment of this matter, including proof that these norms are strictly convex can be extracted from the material in PJ. Davis, Interpolation and Approximation, chapter 7. However, polynomial approximants are not efficient in some sense. Take Lagrange interpolation for instance. If xv,x1,---,xn are « distinct numbers at which the values of the function/are given, men the interpolating polynomial p is found from
)£*«
(4-10)
where Lnk(x) is
~
x
i
The error in the approximation is given by -x,).
where £(x) is in the smallest interval containing Introducing the Lebesque function
(4.12)
x,xl,xn.
and a norm ||/|| = max|/(x)t. Then
WNIkl
(4.14)
74
4. Approximation and B-spline function
This estimate is known to be sharp, that is, there exists a function for which the equality holds. Equally spaced points may have bad consequences because it can be shown that
IKI&Ce"'2.
(4.16)
As n increases, function value gets larger and larger, and entirely fails to approximate the function / This situation ean be removed if we have freedom free to choose the interpolation points for the interval [a,b]. Chebyshev points are known to be a good choice. * t = - \ a + b + (a~b)cos^~^-
* 2L
.
(4.17)
n-l J
The maximum value for the associated Lebesque function is it logK+4. |r'i|< — II
It
(4.18)
ft-
Using Chebyshev points, we therefore obtain the following error bounds for polynomial interpolation
4
X1K denotes the linear space of all polynomials of degree n on [a,b]. We may further show that
Thus by using the best interpolation scheme, we can still only hope to reduce the error from interpolation using Chebyshev points by less than a factor -logn+5.
(4.21)
4.2. Polynomial basis
75
Example 4.3 Runge example This is a well-known numerical example studied by Runge when he interpolated data based on a simple fimction of )
(4.22)
on an interval of [-1,1]. For example, take six equidistantly spaced points in [-1, 1 ] and find y at these points as given in Table 4.1. Now through these six points, we can pass a fifth order polynomial /j(x) = 0.56731-1.7308x'+1.2019x4,
-l:Sx
(4.23)
On plotting the fifth order polynomial and the original function in Figure 4.2, we can see that the two do not match well. We may consider choosing more points in the interval [-1, 1] to get a better match, but it diverges even more. In fact, Runge found that as the order of the polynomial becomes infinite, the polynomial diverges in the interval of-1 < x < 0.726 and 0.726 < x < 1. How much can we improve the situation if Chebyshev points are used? Reconsider this problem, but take six non-equidistantly spaced points in[-l, 1] calculated from equation (4.17) and find y at these points as given in Table 4.2. Now through these six points, we can pass a new fifth order polynomial / s '(x) = 0.355-0.716x 2 +0.399%", -1 S i < 1. Table 4,1: Six equidistantly spaced points on [-1,1] X y
-1.0 -0.6 -0.2 0.2 0.6 1.0
1 l+25x 2 0.03846 0.1 0.5 0.5 0.1 0.03846
(4.24)
Table 4.2: Six unequidistantly spaced points on [-1,1] X }
-1.0 -0.809 -0.309 0.309 0.809 1.0
1 l+25x 2 0.03846 0.05759 0.29520 0.29520 0.05759 0.03S46
On plotting the fifth order polynomial and the original function in Figure 4.3, we can see that the two match better at the two ends, but do not match well around
76
4, Approximation and B-splinefunction
the cental portion. So Chebyshev points remove instability at the cost of loosing accuracy around the central portion. This is natural if we look at equation (4.17) which puts more points near the two ends. Chebyshev points are nearly optimal. In other words, even if we employ an alternative knot sequence better than Chebyshev points, we would not gain much. We have to seek other methods to reduce errors. As suggested by equation (4,19), we have the following two approaches (1) Increase n, (2) Decrease the interval [a,b]. 1 Runge function
5-th order polynomial
0.8 0.6 0.4 0.2 0 -0.2
-1
-0.5
0
0.5
1
Figure 4.2: 5* order polynomial vs. exact function 1 0.8
Runge function
5-th order polynomial
0.6 0.4 0.2 0 -1
-0.5
0
0.5
1
Figure 4.3: 5th order polynomial on Chebyshev points
11
4.3. B-splines
Approaeh (1) is not a good dioice because increasing n may produce disastrous consequence in many cases. The linear combination coefficients become nearly linearly dependent. Approach (2) seems the unique option for us, which results in approximations such as B-splines. 4.3 B-splines 4.3.1 Definitions The truncated power basis can be numerically bad and it has some drawbacks as a theoretical tool. Early workers (Sehoenberg, 1946) defined and used special splines called B-splines. Their interest was mostly in theoretical studies with them because the age of modern scientific computing hadn't really begun. Much later, de Boor (1978) discovered properties of B-splines that make them well suited to computation. The B-spline function can be defined either through the divided difference or through recursion relationship. We first use the latter to give the definition of The B-spline function. At the end of this section, B-spline definition based on the former is also provided. Let A = {x,} (i = 0,lt...,n) be a non-decreasing knot sequence. Consider a function s(x) defined on the interval [JED,XB] . Its vertical coordinate for x, is yt. An order 1 (or degree 0 polynomial) B-spline is defined as a step function as shown in Figure 4.4 (a). 5
'
G, [I
x*[x,,xM] xe[x,,xM]'
(4.25)
Bf(x) is 2-th B-spline of order 1.
(a) Order 1
(b) Order 2
( c ) Order 3
Figure 4.4 The first three order B-splines
78
4. Approximation and B-spline function The function s(x) defined on the interval [x0, JCB ] is then expressed as (4.26)
An order 2 (or degree 1) B-spline is a broken line as shown in Figure 4.4 (b). Consider the equation defining the broken line. The algebraic equation for straight line (x,»>»,)tQ (xM,yM) is
x-x,
(4.27)
The algebraic equation for straight line (xM,yi+l)
-x X
i+2
to {xi+1,yM}
's
x-xi+
(4.28)
X
i+\
Comparing the above two equations we define X- -X, X
Bf(x) =
M
X
-xf -X
M
(4.29)
,-X.j
otherwise
Then the whole broken line from (xa,y0) to (xn,yn) can be defined as (4-30) If equation (4.25) is employed, equation (4.29) can be rewritten in the form of (4.31)
4,3. B-splines
79
This relationship can be generalized to higher order B-splines. Replacing order 2 by order k and order 1 by k-\, we obtain (4.32) This is the well-known recursion relationship for B-splines. We may use it to define a B-spline function of higher orders. So an order 3 B-spline is defined by
~X
(4.33)
Substituting equations (4.25) and (4.31) into above equation yields, after lengthy manipulations,
Bf (x) = (x, - x , ) f (X°+> ~X{''HiX°+> ~X)H(x-x,)
(4.34)
where H(x) is Heaviside function defined by
[
if
(4.35,
The index s-i-3, and the function ws(x) is a product of the form
w.Crt = Suppose the knot sequence is equidistance and there are n+\ equidistance knots, c = x0 < xx < • • • < xx = d, which divide the internal [c,d] into n subintervals. For conveniences in later mathematical expressions, three more knots at each end are defined: x_ 3 , x_2, x_t, xH+l, xn+1 and xn+i, It is customary to set x_3 = x_j =x_l=x0 = c and xH - JCB+, = xatt = xn+3 = d. It is clear that n+l knots define N splines. Therefore, N=n+\. With such definitions, we can plotS/(x) in Figure 4.5. Example 4.4 B-splines of order 3 Let A - {0,1,2,3}. Substituting it into the expression above yields
4. Approximation and B-spline Junction
80
0.5
K
n+1
Figure 4.5 B-splines of order 3 (degree 2)
= ~(3-xf - | { 2 - x ) 2
(4.37a) (4.37b)
2 < x < 3 , Bl(*) = -(3-*)*.
(4.37c)
but note (4J7d)
(4.37e) S,3(x)is continuous at x = \ and x = 2 . Its first derivative is continuous at x = 1 and x = 2 .The second derivative is discontinuous at x = 1 and x = 2. In cases without confiision, we also simply use B, (x) by omitting the symbol for order k. From equation (4.32) we may defme order 4 B-splines. The results turn out to be similar to order 3 B-splines in form. In particular,
4.3. B-splines
81
(X H
: ~*W-».) )
(4.38a)
where the index s=/-4, and the function w/x) is a product of the form -*™).
(4.38b)
In most applications, B-splines of order 3 and 4 are employed. For B-splines of higher orders, we do not give their explicit expressions here. 4,3,2 B-spIine basis sets Can we use B-splines for meaningful calculations? Do B-splines form a basis set for V? The Curry-Sehoenberg theorem says yes! Theorem 4,4 Curry-Schoenberg theorem For a given strictly increasing sequence Z = {iit'"t^K*i}> negative
1=1
integer
sequence
v = {vj,-",vA,}
with
am
all
a
^
S^ven
w !
" "
, v,£k,
set
let A = {JC 1 ,---,X B+4 } be any
non-
i»2
decreasing sequence so that (1)
xlixj£---^xk<
(2) for i = 2,—N the number ^ appears k-v, times in A.
(4.39)
Then the sequence fl,* • • • 5* of B-splines of order k for the knot sequence A is a basis for Hk&v, considered as functions on [x* »*„+]]• This theorem provides the necessary information for generating a B-spline basis set using the recurrence relations above. It specifies how to generate the knot sequence A with the desired amount of smoothness. The choice of the first and last k knots is arbitrary within the specified limits. In the sequence A, the amount of smoothness at a breakpoint xt is governed by the number of times x, appears in the sequence A .Fewer appearances means more smoothness. A convenient choice for the first and last knot points is to make the first & knot points equal to Xj and the last k knot points equal to xN+l .This
82
4. Approximation and B-spline function
corresponds essentially to imposing no continuity conditions at £j and §N+l. Proof; See der Hart (2000). 4.3.3 Linear independence of B-spline functions The following theorem proves the linear independence of B-splines. Theorem 4.5 Let pt be a linear fimctional given by the rule (4.40) ax \%
r=0
with i+k-l
n
fr
-lA
(4.41)
and T}, an arbitrary point in the open interval (a^ ,X; +i ). Then PlB^S,,foratlij
.
Proof: see de Boor (1978) or van der Hart (2000).
(4.42) D
4.3.4 properties of B-splines B-splines have some nice properties which make them appealing for function approximation. (1) (Local support or compact support) B* (x) is a non-zero polynomial on ^ £ x 5 xuk.
(4.43)
This follows directly from definition. At each x, only k B-splines are non-zero. (2) {Normalization)
4.3. B-splines | > * ( x ) = l.
83 (4-44)
This can be seen from recursion relations. (3)
(Nannegativity) Bf(x)>0.
(4.45)
Again follows from recursion relation. (4) (Differentiation) (4.46) X
t
X
i*k
X
M
Proof. See der Hart (2000).
•
Note that the same knot set is used for splines of order k and k - 1 .In the (convenient) case that the first and last knot point appear k times in the knot set ,the first and last spline in the set 5,w , are identically zero. (5) (Integration) "t
k +\
(4.47) X
i
(6) Function bounds if xt: £ x < xM and / = ^ a, 5. ,then in{aw_A>--.,a,} S/(*)Smax{o,. +w ,•••,«,}.
(4.48)
(7) B-splines are a relatively well-conditioned basis set. There exists a constant Dk .which depends on k but not on the knot set A,so that for all i (4.49)
84
4. Approximation and B-spiine function In general, Dk » 2*" 3/s , Cancellation effects in the B-spline summation are limited. (8) (Least-square property) If f(x)eC2 [a,b], then there exists a unique set of real constant flj ( i = -k, •••,«-1) solving the minimization problem
(4.50) These coefficients can be obtained from the linear system of equation CA = b where the matrix C is C=
ffi*(*)B*(x)*
(4.51)
.
(4.52)
The right hand side vector b is
Blt(x)(k\ .
(4.53)
And the unknown vector A is A = {a. 4 ,- s a B .,f.
(4.54)
Example 4.5 Comparison of various approximation methods The effectiveness of the above-mentioned approximation methods can be numerically demonstrated using Runge example in Example 4.3. Consider three interpolation methods: (1) Lagrange interpolation on equally spaced points (denoted as uniform polynomial in Figures 4.6 and 4.7); (2) Lagrange interpolation on non-equally spaced points (denoted as nonuniform polynomial in Figures 4.6 and 4.7); and (3) B-splines. They are obtained as follows.
4.3. B-splines
85
Consider Lagrange interpolation given in equations (4,10) and (4,11) of the form If the fonction values f(xk) at n points % are given, then the function value evaluated by the Lagrange interpolant at any point x is
= £/(**)**(*),
(4.55)
where Lk(x) is
»=n
x-x,
(4.56)
Consider B-spline interpolation. At the given n points xk, the interpolant must satisfy (4.57)
4(*») = /(**>.
where af are unknown coefficients to be determined by solving the system of linear equations using Gauss Elimination Method, to say
5,00
'« N
(4.58)
Figure 4.6 shows the results obtained from the three methods based on 10 points. Uniform polynomial behaves very well around the central portion, even better than B-splines. But it yields bad results around the two ends, validating the conclusions obtained by Runge long ago. Nonuinform polynomial behaves better than uniform polynomial around the two ends, but worst among the three around the central portion. This is due to the feet that nonuniform polynomial uses less points in the central portion. Around the central portion, the performance of Bsplines is almost as same as the uniform polynomial, while around the two ends, it is comparable with that of the nonuniform polynomial. Among the three, it is clear that B-splines are the best interpolants. As the number of points is increases to 20, the difference between B-splines and nonuniform polynomial is not significant, but it remains distinguishable. The uniform polynomial behaves badly around the two ends, as shown in Figure 4.7.
86
4. Approximation and B-splim function
In the figure, the curve of B-splines is not identified due to its closeness to the true Runge function and the difficulty to differentiate them. The effectiveness of B-splines can be well demonsttated by this example.
Runge iunction
B-splines
Nonuniform polynomial
0.5
i Uniform polynoniial -0.5
-1
0.5
0
-0.5
1
Figure 4.6 Comparison of various interpolation methods for 10 points. In the figure, uniform refers to equally spaced points and nonuniform refers to unequally spaced points.
V\ \ V
/
Nonuniform polynomial
Uniform / polynomial/
0.5
-1
-0.5
0
0.5
1
Figure 4.7 Comparison of various interpolation methods for 20 points. The B-spline interpolating curve is almost coincident with the Runge function.
4.4. Two-dimentional B-splines
87
4.4 Two-dimensional B-splines Two-dimensional B-splines may refer to B-splines defined on an arbitrary surface in the 3-D space, or on a plane. It would take a lot of space here if we went deep into the details of 2-D B-splines defined on an arbitrary surface. Considering what will be employed in the subsequent chapters, we focus our discussions on 2-D B-splines defined on a plane, the simplest case in 2-D space. This simplifies the presentation a lot. The reader who is interested in more general theory of 2-D B-splines is referred to de Boor (1978). In the following throughout the book, 2-D B-splines refer solely to those defined on a plane. The simplest 2-D B-splines are obtained by direct product of two 1-D Bsplines in the form of BtJ(x,y) = Bi(x)BJ(y).
(4.59)
It is bell-shaped in a 2-D space. A 2-D B-spline can also be obtained by replacing the argument in a 1 Bspline by the radial distance r = ^{x-xjf+(y-yjf centre of i-th B-spline. In notation,
3
2
-(2-Sf, 6
0,
, where (*,,;>,) is the
1<SS2
(4.60)
S>2
where S - r/A f , a, = IS/ltth2. B-splines defined in this way are called Radial B-spline Function, RBF in short. In the equation, h, defines the radius of the circle centered at (xf,y,) inside which Bt(r,hf) does not vanish. 4.5 Concluding remarks Approximation theory has a long history, but it remains active due to the fact feat it is not easy to find a flexible enough yet universal approximation tool for so many cases encountered in real-world applications. Polynomials had been an effective tool for theoretical analysis. It is not very suited to computational purpose due to its over-sensitivity to local changes as the order is high. The most effective way to reduce approximation errors is to decrease the
88
4, Approximation and B-spline function
interval [a,b]. This leads to the imroduetion of B-splines, The B-spline is nearly optimal choice for approximation. B-splines have nice properties suited to approximating complicated functions. The best properties of B-splines are that they are flexible enough to yield satisfactory approximation to a given fimction while maintaining stability. Only fundamental things about B-splines are introduced. One of important developments made in recent years, the so-called nonuniform rational B-splines Surface (NURBS), is not mentioned in this chapter for the apparent reasons. The interested reader is referred to Piegl & Tiller (1997),
Chapter 5
Disorder, entropy and entropy estimation
Entropy is one of most elegant concepts in science. Accompanying each progress in the conceptual development of entropy is a big forward step in science, Entropy was first introduced into science as a thermodynamic concept in 1865 for solving the problem of irreversible process. Defining entropy as a measure of the unavailability of a system's thermal energy for conversion into mechanical work, Clausius phrased the second thermodynamic law by claiming that tiie entropy of an isolated system would never decrease. In 1877, Boltanan gave interpretation of entropy in the framework of statistics. Entropy as a mathematical concept appeared first in Shannon's paper (1948) on information theory. This is a quantum jump, having great impact on modem communication theory. Another important progress for mathematical entropy was made by Kullback (1957) in 1950s. Entropy is thus an elegant tool widely used by both mathematicians and physicists. Entropy is yet one of the most difficult concepts in science. Confusions often arise about its definition and applicability due to its abstract trait This results from the phenomena, known as disorder or uncertainty, described by entropy. In fact, entropy is a measure of disorder. In this chapter, entropy as a mathematical concept will be first elucidated, followed by the discussions on how to construct unbiased estimators of entropies S.1 Disorder and entropy Entropy describes a broad class of phenomena around us, disorder. Its difficult mathematical definition does not prevent us from gaining an intuitive understanding of it. Example 5.1 Matter The unique properties of the three-states of matters (solid, gas, and liquid)
89
90
5. Disorder, entropy and entropy estimation
result from differences in the arrangements and of the particles making up of them. The particles of a solid are strongly attracted to each other, and are arranged in a regularly repeating pattern, or in order, to maintain a GAS solid shape. Gas expands in every direction as there are few bonds sublimation among its particles. Gas is a state of matter without order. Liquid flows because its particles are not held rigidly, but the attraction between the particles is sufficient to give a definite volume. So liquid is a state in between ordered solid and disordered gas.See Figure 5.1, Take water for instance. As freeze temperature drops below 0°C, ail SOLID L1DQID particles suddenly switch to an ordered state called crystal. And they Figure 5.1 Arrangement of particles vibrate around their equilibrium in different states of matters positions with average amplitude As. The ordered state is broken as temperature increases higher above 100°C, water becomes vaporized gas. Gas particles do not have equilibrium positions and they go around without restriction. Their mean free path Ag is much larger than the other two, that is, we have the following inequality temperature increases above 0°C, and particles travel around with larger average free distance A,
kt « At « A .
(5.1)
Disorder does matter in determining matter states. If the amount of disorder is low enough, matter will be in the state of solid; and if the amount of disorder is high enough, matter will be in the state of gas. Example 5.2: Digit disorder Take number as another example. Each number can be expressed as a sequence of digit combination using 0 to 9. For example, - = 0.499999..., - = 0.285714285714285714..., V2 = 1.1415926.....
(5.2)
J. /. Disorder and entropy
91
Except the first two digits, the string for 1/2 exhibits simple pattern by straightforwardly repeating digit 9. The string representing 1/7 is more complicated than that representing 1/2, but it remains ordered in the sense that it is a repetition of the six digits 285714. The third string representing -Jl does not show any order, all digits placed without order. We thus say that the first series is the best ordered and the last worst ordered. In other words, the amount of disorder of the first series is least while the third the largest. From the two examples above, it is plausible that disorder is a phenomenon existing both in nature and in mathematics. More examples on disorder can be cited, some of which are given in the following Example 5.3: Airport disorder An airport is in order with all flights arriving and departing on time. Someday and sometime, a storm may destroy the order, resulting in a state that massy passengers wait in halls due to delayed or even cancelled flights. The passengers might get more and more excited, creating a disordered pattern. Disorder is so abstract that we hardly notice it if not for physicists and mathematicians. Disorder is important for it determines the state of a system. It is desirable to introduce a quantity able to describe the amount of disorder. We will see that we are able to define a quantity, called entropy, for quantitatively defining the amount of disorder. And //(in fact, it is the capital Greek letter for E, the first letter of entropy), is frequently used to denote entropy To find the definition of entropy, we return to Example 5.1. Consider a onedimensional bar of length L. If the bar is filled with ice, there will be approximately Nt=L/A, particles in the bar. If the bar is filled with water, there will be approximately N(= Lf At particles in the bar. And if the bar is filled with gas, there will be approximately Ng = L/Ag particles. Because of equation (5.1), we have N,»Nt»Ng.
(5.3)
Therefore, the number of particles in the bar should be relevant to the amount of disorder in the system, or entropy. In other words, entropy H is a function of the number of particles in the bar, H = H{N). And H{N) should be an increasing function of N. Suppose now that the length of the bar is doubled. The number of particles in the new bar will be doubled, too. And how about entropy? The change in entropy cannot be the number itself iV, because otherwise the change of entropy for ice will be N,, for water JV, and for gas Ng, The length of the bar is doubled, but the increase of entropy is not same for the three cases. This is not acceptable
92
J. Disorder, entropy and entropy estimation
if we hope that entropy is a general concept. An alternative method to view the problem is to define entropy in such a way that if the bar is doubled in length or the number of particles is doubled, the entropy increment is one. If the number of particles is four-fold increased, the entropy increment is 2. Then we have
,..,
(5.4)
Note that H(l) = 0 because there does not exist any disorder for one particle. Solving the above equation, we are led to say that entropy is given This is a heuristic introduction to entropy. In the following sections, we will ignore the particular examples in the above, and tarn to abstract yet rigorous definition of entropy. 5.1.1 Entropy of finite schemes To generalize the entropy definition introduced in the above, we must notice that the number JV used in the above is just an average value, hi more general cases, N is a random variable and should be replaced by probability. Consider a bar filled with AT particles. They are not necessarily arranged in an equidistance way. Suppose, the free distance of the first particle is A,, the free distance of the second particle is Aj,..., and the free distance of the N-th particle is AN. If the bar is filled with particles, all of which have free distance A,, then the entropy would be H{ = log{£/!,) based on the above definition. If the bar is filled with particles, all of which have distance Aj, the entropy would be H2 = logfi/lj) And if the bar is filled with particles all of which have distance A^, the entropy would be Hn = log(£ / AN ) . Suppose now the bar is filled with particles of various free distances. We will have to use the averaged quantity to represent the amount of disorder of the system, that is, ff = ~ ( l o g « 1 + l o g ^ + " - + logn w } n
(5.5)
where n = w, + n2 + • • • + nN is the total number of particles. If 1 /«, is replaced by probability, we obtain the following generalized entropy concept based on probability theory.
5.1, Disorder and entropy
93
A complete system of events Ai,A2,--',An in probability theory means a set of events such that one and only one of them must occur at each trial (e.g., the appearance of 1,2,3,4,5 or 6 points in throwing a die). In the case N-2 we have a simple alternative or pair of mutually exclusive events (e. g. the appearance of heads or tails in tossing a coin). If we are given the events Ah A3, .... An of a complete system, together with their probabilities JJ, , j % ,-••,/?„ {pi 2; 0, ^ p, = 1), then we say that we have a finite scheme (5.6) ft •••
P»)
In the case of a "true" die, designating the appearance of / points by A, (1 s i £ 6 ), we have the finite scheme
P\
Pi
Pi
P«
Pi
P«
From the finite scheme (5.6) we can generate a sequence of the form AjA1A]A%Aft.., .The sequence is an ordered one if Ai,At,---tAll appear in a predictable way; otherwise disordered. Therefore, every finite scheme describes a state of disorder. In the two simple alternatives
0.5 0.5)
^0.99 0.01
the first is much more disordered than the second. If a random experiment is made following the probability distribution of the first, we may obtain a sequence which might look like AlAiA2AlA2AlAzA[.,., It is hard for us to know which will be the next. The second will be different, and the sequence generated from it might look like AlAlAlAiAlA)AfAx... Ms are almost sure that the next letter is 4 with small probability to make mistake. We say that the first has more amount of disorder than the second. We sometimes use uncertainty instead of disorder by saying that the first is much more uncertain than the second. The correspondence of the two words uncertainty and disorder can be demonstrated by Equation (5.6). Disorder is more suitable for describing the state of the sequences generated from finite scheme (5.6) while uncertainty is more suitable for describing the finite scheme
94
J, Disorder, entropy and entropy estimation
itself. Large uncertainty implies that all or some of assigned values of probabilities are close. In the extreme case, all probabilities are mutually equal, being 1/n , A sequence generated from such a scheme would be highly disordered because each event has equal probability of occurrence in the sequence. On the other extreme, if fee probability for one of the events is much higher than the rest, the sequence produced from such scheme will look quite ordered. So the finite scheme is of low uncertainty. Thus, disorder and uncertainty are two words defining the same state of a finite scheme. The scheme
4
(5-9)
"'}
0.3 0.7 J represents an amount of uncertainty intermediate between the previous two. The above examples show that although all finite schemes are random, their amount of uncertainty is in fact not same. It is thus desirable to infroduce a quantity which in a reasonable way measures the amount of uncertainty associated with a given finite scheme. The quantity
can serve as a very suitable measure of the uncertainty of the finite scheme (5.6). The logarithms are taken to an arbitrary but fixed base, and we always take pk logj% = 0 if pk = 0 . The quantity H(pl,p2,---,pH)i8 called the entropy of the finite scheme (5.6), pursuing a physical analogy with Maxwell entropy in thermodynamics. We now convince ourselves that this function actually has a number of properties which we might expect of a reasonable measure of uncertainty of a finite scheme. 5.1.2 Axioms of entropy Aided by the above arguments, entropy can be rigorously introduced through the following theorem. Theorem 5.1 Let H(p1,p1,---,plt)be
ajunction defined for any integer n and n
for all values- P 1 ,/ 7 2»'"»A
suc
^
tnat
ft £0,(& = l,2,---,«), ^pk
= 1 . If for
any n this fimction is continuous with respect to all its arguments, and if it has the following properties (1), (2), and (3),
5. /. Disorder and entropy
95
n
(1) For given n and for ^pk
= 1, the function H (p,, p2, - • •, pH ) takes its
largest value far pk = 1 / n, {k = 1,2, • • •, n), (2)
H(AB)
^H(A)+HA(B),
(3)
H(p,,pi,—,pll,Q) = ff(pt,p1}---,pj. (Adding the impossible event or any number of impossible events to a scheme does not change its entropy.) then we have ^
(5,11)
where c is a positive constant and the quantity Hd{B) = ^ipkHk{E)
is the
k
mathematical expectation of the amount of additional information given by realization of the scheme B after realization of scheme A and reception of the corresponding information. This theorem shows that the expression for the entropy of a finite scheme which we have chosen is the only one possible if we want it to have certain general properties which seem necessary in view of the actual meaning of the concept of entropy (as a measure of uncertainty or as an amount of information). The proof can be found in Khinchin (1957). Consider a continuous random variable distributed as f(x) on an interval [a,b]. Divide the interval into n equidistance subintervals using knot sequence 4\' ii >'"'»C+i • The probability for a point to be in the &-th subinterval is
(5.12)
where Ax = £i+1 -f t is the subinterval length. Substituting it in equation (5.11) yields
(5.13)
96
5. Disorder, entropy and entropy estimation
The second term on the right hand side of the above equation is a constant if we n
n
n
note iSx^jf{^k)=^iMf{§k)=^ipk
=1 and log Ax is a constant. So only the
first term on the right hand side is of interest. As division number becomes large, H -> so, the first term on the right hand side is just the integral H(f,f) = -c\f{x)\ogf{x)dx.
(5.14)
where two arguments are in the expression H(f,f). In the subsequent sections, we will encounter expression H(f,g) indicating that the function after the logarithm symbol in equation (5.14) is g. Equation (5.14) is the definition of entropy for a continuous random variable. The constant c = 1 is often assumed. A question that is often asked is: since the probability distribution already describes the probability characteristics of a random variable, why do we need entropy? Yes, probability distribution describes the probability characteristics of a random variable. But it does not tell which one is more random if two probability distributions are given. Entropy is used for comparing two or more probability distributions, but a probability distribution describes the randomness of one random variable. Suppose that the entropy of random variable X is 0.2 and that of random variable Y is 0.9. Then we know that the second random variable is more random or uncertain than the first. In this sense, entropy assign an uncertainty scale to each random variable. Entropy is indeed a derived quantity from probability distribution, but it has value of its own. This is quite similar to the mean or variance of a random variable. In fact, entropy is the mathematical expectation of -log/(jc), a quantity defined by some authors as information. Example 5.4 Entropy of a random variable Suppose a random variable is normally distributed as f(x)=
.—- exp V2
From definition (5.14) we have
= -jf(x)loBf(x)dx 00
- J : &*"*{_
exp
-.* 7J log5 -7==-exp - v n 7 \\ttc 2«x J * " W 2 ^ f f ^ | 2cr
(5.15)
5.2. Kullhack information and model uncertainty
97
I
(5.16)
The entropy is a monotonic ftinction of variance independent of the mean. Larger variance means larger entropy and viee versa. Therefore, entropy is a generalization of the concept variance, measuring data scatters around the mean. This is reasonable because widely scattered data are more uncertain than narrowly scattered data. 5.2 Kullback information and model uncertainty In reality show The wheel of Fortune a puzzle with a slight hint showing its category is given to three contestants. The puzzle may be a phrase, a famous person's name, an idiom, etc. After spinning the wheel, the first contestant has a chance to guess which letter is in the puzzle. If he/she succeeds, he/she has the second chance to spin the wheel and guess again which letter is in the puzzle. If he/she fails, the second contestant will spin the wheeltocontinue the game, and so on untilfeepuzzle is unraveled finally. We simplify the example a little. The process for solving the puzzle is in feet a process for reducing uncertainty, that is, entropy. At the very beginning, which letter will appear in the puzzle is quite uncertain. The guessing process is one that each contestant assigns a probability distribution to the 26 letters. As the guessing process continues, more and more information has been obtained. And the probability assigned to letters by each contestant gets closer and closer to the answer. Suppose at an intermediate step the probability distribution given by a contestant is
(fU7)
The question is to solve the puzzle, how much more information is needed? In other words, how far away is the contestant from the true answer? Each contestant speaks loud a letter which he/she thinks should be in the puzzle. Whether the letter is in the puzzle or not, we obtain information about the puzzle. And the letters given by the contestants form a sample, the occurrence probability of which is (5.18) «,!«,!.••«„
Its entropy is
98
J. Disorder, entropy and entropy estimation log p(B) = ~y]—lag qk
(5.19)
where the constant term is neglected. From the large number theorem, we conclude that as the sample size n, becomes large, the above entropy becomes 1
n
- lim — log p{B) = - V pk log qk .
(5.20)
Denoting the term on the right hand of the above equation by
We conclude that H(p,q) is a new entropy concept interpreted as follows. Suppose the true probability distribution is given by equation (5.6). We take a sample from the population, and obtain a probability distribution given by equation (5.17). The entropy estimated by equation (5.21) is entropy H(p,q). Therefore, H(p,q)represents the entropy of the scheme p measured by model if. More precisely, the entropy of a random variable is model-dependent. If a model other than the true one is used to evaluate the entropy, the value is given by H(p,q). We note that tog A is the entropy of the finite scheme under consideration. The difference between H(p,q) and H(p,p) represents the amount of information needed for solving the puzzle, that is, l(p,q) = H(p,q)-H(p,p) = f > t log^-.
(5.23)
I(p,q) is defined as Kullhack information. It may also be interpreted as the amount of uncertainty introduced by using model q to evaluate the entropy of p .
5.2, Kullback information and model uncertainty
99
Theorem 5.2 Kullback l(p,q)has the following properties: (1) (2) J(p,q) = Q if and only if
pk=qk.
Proof; Let x>0 and define function / ( x ) = logx—x+l . f{x) takes its maximum value 0 at point x = l , Thus, it holds that / ( J C ) ^ O . That is, log x < x - 1 . The equality is valid only when x = 1. Setting * = qk I p k , we have
Pk
Pk
and
w *
Pk «
\Pk
j
w
w
Multiplying minus one on both sides of the above equation, we obtain
Jftlog-^->0. The equality holds true only when pk =qk.
(5.26) •
The above concepts can be generalized to continuous random variable. Suppose X be a continuous random variable with pdf f(x). The entropy measured by g(x)is
H(f, g) = ~ J / t o log g(x)dx.
(5.2?)
The difference between the true entropy and the entropy measured by g(x) is the Kullback information l(f, g) = H{f, g) - H(f, / ) = f/(*) I o g 4 ^ •
C 5 - 28 )
J. Disorder, entropy and entropy estimation
100
Besides the above interpretation, I(f,g) may also be interpreted in the following way. Suppose a sample is taken from X, which entropy is H(/, / ) . Because the sample is a subset of the population, it cannot contain all information of the population. Some of information of the population must be missing. The amount of information missing is I(f,g) if g(x) represents the pdf fully describing sample distribution. In this sense, I(f,g) represents the gap between the population and a sample. Theorem 5.3 Kullback information / { / , g) for continuous distribution satisfies (1)
(2) I(f, g) = 0 if and only if f = g.
(5.29)
Kullback information is interpreted as the amount of information missing. The first Property indicate that the missing amount of information is always positive, a reasonable conclusion. The second Property imply that an arbitrarily given distribution cannot fully carry the information contained by another distribution unless they are same. Note that Kullback information is not symmetric, that is,
/(/.ir)
(5.30)
Example 5.5 Entropies of two normal random variables Suppose two normal random variables are distributed as 1
/(*) =
-exp
and g(x) =
1
2a
I -exp •jhi;
From definitions (5.27) and (5.28) we have
#,*) = -]/(*) log *(*)<* 1
I (x-{tf |. I 1 I (x-vf PI - h r 5 ^ Jog^ ^ ^ e x p l - It1
ex
J
'-hk^-^l*
i
^ r
(x-y) 2
2r2
dx
2r2
(5.31)
5.2. Kullback information and model uncertainty
101
(5.32a) Comparing equation (5.16) we see that the entropy H{f,g) is no longer independent of the mean. It is a function of the variances and means of these two random variables. The two variances cannot be exchanged in equation (5.32a), and thus H(f, g) is not symmetric, satisfying equation (5.30). Kullback information is
exp -
-f i =
r
—
exp -
2o
"
Similarly, we have
gf
g\
^ \
i
( f ]
(5.32c)
Summing up equations (5.16) and (5.32b) yields (5.33) The term on the right hand side of equation (5.33) is obtained from equation (5.32a). The above equation has validated equation (5,28) through a particular example. It is not difficult to verify equation (5.30) through equations (5.32b) and (5.32c), but the procedure is a little lengthy, thus neglected here.
102
J. Disorder, entropy and entropy estimation
Theorem 5.4 If the pdf of a random variable isf(x), then for any statistical model g(x) other than f(x), the entropy H(f,f) is smallest. In notation, H(f,g)ZH(f,f).
(5.34)
Proof. Applying equation (5.29) to equation (5.28) leads immediately to equation (5.28). • Kullback information / ( / , g) is a useful concept, having played an important role in communication theory. But here we emphasize its influence on statistics. Statistics is characterized by two typical procedures: estimation and hypothesis testing. Both share one thing in common: inference as we mentioned in Chapter 2. The essence of inference is to make decision based on incomplete information. Suppose the true pdf (model) of a random variable is f(x) which is not known. What is known is a candidate model g(x) which can be obtained from a sample. Because the size of a sample is finite, it cannot contain all the information in the population. g(x) determined through the sample is generally not equal to f(x). Thus, the two models f(x) and g(x) are different, resulting in model uncertainty. In a statistical inference problem, the true model f(x) is not known. What is known is the sample-determined model g(x). All decisions are made based on g(x). Because of model uncertainty, the entropy evaluated by g(x) cannot be equal to the true entropy H(f,f). Equation (5.34) shows that the entropy predicted by any pdf (or model) other than the true pdf (or model) is larger than the true entropy. If we want to estimate the entropy H(f, / ) , we need to know the entropy resulting from model uncertainty. Consider a special case in that the difference between two probability distributions / and g is small such that Taylor expansion is valid. The difference between the two distributions is estimated by the relative percentage squared,
f
(5.35)
Because A is • a function of x, the total difference should be an integral with respect to x weighted by / . So we have
5.2. Kullback information and model uncertainty
J/Acfe = J/ii^pldk = J(g - f)^p-dx.
103
(5.36)
If x is small, we have expansion (5.37) So equation (5.36) is rewritten in the form of
[l +
% ss..
(5.38)
The right hand side is in fact the divergence of the two probabilities, because
J(f ~ / ) log[l+tejQdx = f(g - / ) log^r dx ^ & + J/log^ A =J(f,g)
(5.39)
Using symbols for Kullback information, the last term is ).
(5.40)
J(f,g) has a special name called divergence oftwopmbability distributions. It measures the uncertainty resulting from evaluating entropy/ by use of model g, quantifying the difference between two probability distributions. Therefore, J{f*g) defines model uncertainty. J{f, g) has the following properties. Theorem 5.5 Divergence of two probability distributions J(f,g) satisfies
(1) J(f,g)>Q, (2) JW,g) = J(g,f), (3)
(5.41)
Jif, g) = 0 if and only iff = g.
But J(f, g) does not satisfy the triangle inequality ).
(5.42)
104
J. Disorder, entropy and entropy estimation
Proof: These properties are natural conclusions obtained from the definition of the divergence of two probability distributions, J(f, g) = / ( / , g) + I(g, f). o Example 5.6 Divergence of two probabilities Consider Example 5.5 with the two probability density functions given by equation (5.31). Direct summing up equations (5.32b) and (5.32c) yields J(f,g) = I(f,g) + Kg,f)
(5.43)
It is not difficult to see that J(f, g) > 0 if we note that the first two terms on the right hand side of equation (5.43) are no less than zero and the third term is no less than zero, too. The difference of two normal distributions is characterized by two parameters: their variances and means. If the means of two normal random variables are same, the difference is described by the second term on the right hand side of equation (5.43), being a function of their variances only. If their variances are same, their difference is a function of only their means given by the third term, hi general, the difference is given by equation (5.43), a junction of their variances and means. We now return to the discussions of model uncertainly. In section 2.3.2, we mentioned sampling error resulting from the difference of a statistic and the true parameter under consideration. What is missing there is there is the errors resulting from misuse of models. If a model is not properly specified, errors will be induced. In traditional statistics, an underlined assumption is that the true model is always known. What is unknown is the parameter(s) present in the model. Introduction of information theory into statistics indicates this is not enough. If model uncertainty is not correctly handled, non-statistical errors will come into our analysis and misleads our subsequent statistical decision. Model uncertainly is one of the most important contributions information theory has made to statistics. In a typical statistical inference problem, therefore, we find two sources of uncertainties: one resulting from the random variable itself and the model uncertainty. In notation, the total statistical entropy (TSE), is = H(f,f)+J(f,g). It may also be written in the following forms = H(f,f)+J(f,g) = mf,f)+I(f,g)+I(g,f)~H(f,g)+I(g,f)
(5.44)
5,3. Estimation of entropy based on large samples
105
The first term on the right hand side of equation (5.45) is the entropy of/measured by g, and the second term is the amount of information missing in the process of replace g by / In other words, if g is known, and from g to find / the required amount of information is I(g,f) .This is Sample X schematically shown in Figure 5.2 . Two procedures are shown in the figure. In the first procedure indicated by a downward arrowed line, a sample is drawn from/ Including I(g,f) in equation Figure 5.2 Uncertainty present in (5.45) is important. This can be seen inference process by comparing equations (5.16),(5.33) and (5.43). In these equations, if we let T2 -> 0, we may see that TSE has a form closer to # ( / » / ) than # ( / , g ) , t h a t i i , (5.46) he first term on the right hand side is as same as that of H(f,f)in equation (5.16) while this term does not turn up in the expression for ff(f,g). It is plausible that TSE has stronger capability to recover / from g than H ( / , g). This will be numerically demonstrated in the subsequent chapters. 5.3 Estimation of entropy based on large samples In the above, discussions are focused on entropies which are defined in the framework of probability theory. We now turn to the entropy carried by a sample, so discussing the issue more in the framework of statistics. If the pdf of a random variable (or the distribution of a finite scheme) is known, the above-mentioned entropies can be calculated through simple manipulations'. In most cases in applications, however, the pdf is not known. And statistical estimation must be employed to find the distribution through a random sample taken from the population. In Chapter 2, it was mentioned that a good estimator is required to have three properties: consistency, unbiasedness and efficiency. It is also pointed out there
5, Disorder, entropy and entropy estimation
106
that the first two properties are very important. Estimators constructed for estimating entropies are also required to be consistent, unbiased and efficient. Unfortunately, to construct an unbiased estimator of entropy is not an easy work to do. We thus relax the restriction, trying to construct asymptotically unbiased estimator. In other words, these estimators are unbiased only when the sample size is large. It is well known that entropy estimation is not trivial. Statistical fluctuations of the random sample used to estimate unknown parameters induce both statistical and systematic deviations of entropy estimates. In the naive ('likelihood') estimator one replaces the pdf / ( * ) in the Shannon entropy / / ( / , / ) = - J / ( x ) log f{x)dx by an estimate / ( x ) , More precisely, the naive estimator (5.47) leads to a systematic underestimation of the entropy H. Take M-L estimator for instance. If a sample is drawn from the population, and the unknown parameter is estimated from the sample, the entropy estimated by use of equation (5,47) would yield a value smaller than the true value in most cases. Therefore, numerous studies have been conducted to build unbiased estimators for entropy. In the following, a variety of estimators for various entropies are given.
-0.2 0
50
100
150
100
Figure 5.3 Naive estimate of entropy Htfj) using equation (5.47)
5. J, Estimation of entropy based on large samples
107
Example 5.7 Biased estimate of entropy Reconsider Example 5.4 . The entropy of a normal random variable is
H(f,f) = In this example we assume that the variance is 1. If a sample is drawn from the population, the unbiased estimate of the variance a1 is
(5-48)
108
5. Disorder, entropy and entropy estimation
Theorem 5.6 The asymptotically unbiased estimator of entropy H(X | a 0 ) = / / ( / , / } based on a large sample of size ns is (5.50)
Proof. Expand_/fx|a) around the true value of a0
i 2
(5.51)
,daf
where Aat =at-e^ is the difference between estimated and true values. If the following notations are used, (,52)
equation (5.51) becomes
fix| •) « /" +f^Aa, +L-f£ 9a,
(5.53)
2 dafiaj
Similarly we have the following expansion for logarithm
I a) * log/ 0
(5.54)
2
Therefore, we have
H(f, / ) = H(X | a) = - J/(* | a) log / ( * | &)dx
•dx.
(5.55)
J. J. Estimation of entropy based on large samples
109
In the above, we replace H{f,f) by H{X\k) to reveal the influence of statistical fluctuations. In short form, equation (5.55) is (5.56) where %j*Q
•0
da, T =-
^log/0
(5.57a)
da,
3f° 9log/0 -dx AaAa, 1 da,
\r
log/
(5.57c)
•dx
flog/" J
(5.57b)
da.
dx iMfAaj. dajdaj
(5.57d)
\
The first term is the entropy measured by the frue model, denoted by (5.58) The term T, is a normal random variable because each Aaf is a normal random variable with zero mean. The mean of T, about sample is thus zero, that is, £j7;=0.
(5.59)
Estimates of the rest three terms are generally not zero. They result in biased estimate of entropy. To estimate T21, we rewrite matrix in Tlt in the form of m
r J
0a,
da,
(5.60)
which is the expectation of the product of tiie two logarithmic terms in the equation above. In other words, we have the following relationship
5. Disorder, entropy and entropy estimation
110 \dlog/0
1
dlog/8"
da
*
(5.61)
da
J.
This is the fisherian information matrix. With the notations defined above, T2\ becomes (5.62) Aa,. is asymptotically a normal random variable with zero mean, and thus F2/ is a random variable following chi-square distribution with mean nfins, where nf is the number of free parameters in the model. Taking average on both sides of the equation above yields the following estimate (5.63)
To estimate 7#, we rewrite Equation (5.57c) in the following form
? Sa, daj f dafieij
tbc.
(5.64)
The first term on the right hand side is
0
-If
r051og/°
1 % da,
aa,.
Slog/0 5Si.
(5.65)
and the second term on the right hand side is
*
0
J/
f dafia
52
f;
(5.66)
Because the integral of probability density function is a constant. Therefore, the asymptotically unbiased estimate of TJ2 is
5.3. Estimation of entropy based on large samples \ E^MM 2
i^. 2 n.
111 (5,6?)
Estimating T33 is a little complicated. In Chapter 4, we introduced that a continuous function can be approximated to any desired degree by a polynomial ft
of the form f(x) = J ^ x * » where ak (1 <; k <, n) are unknown coefficients. If we use a polynomial to approximate pdf f(x), we then conclude that the second derivatives of f{x) are zeros, meaning that Tn = 0. In summary, we have )] = H(x\&)+^. 2 n,
(5.68)
Thus, the asymptotically unbiased estimator of the Shannon entropy is given by the term on the right hand side of equation (5.68), that being the right hand side of equation (5.50). a The proof is lengthy, but the return is rewarding. It not only points out that the naYve estimator (5.47) does yield a systematic deviation from the true value, but also quantify the systematic deviation. This estimator is applicable to continuous and discrete cases. In the discrete case, the first term on the right hand side is replaced by (5.69) where q is maximum likelihood estimator of frequency. And nf is the cell number of histogram minus 1. Example 5.8 Bias estimation Reconsider Example 5.7. Two cases were considered in Example 5.7. For the ease of sample size 10, the bias is obtained from the figure to be around -0.05. Based on equation (5.50), the bias should be - n / / 2 « , =-1/20 = -0.05 . The numerical example and theoretical evaluation is same. For the case of sample size 50, the theoretical value for the bias is -nf I2nt= -1/100 = -0.01. The bias obtained from numerical calculation is -0.011. In this example, only one parameter a1 is present. Thus, «/=!.
112
J. Disorder, entropy and entropy estimation
If the naive estimator is used, then not only Shannon entropy is biased but also Kullback information is biased. We now turn to constructing asymptotically unbiased estimator of Kullback information. From the definition, we have (5.70)
I[f,f(X | The second term on the right hand side is
(5.71)
1og/°«fr- \f
Using expansion of l o g / ( J f | a ) in equation (5.45) and the definition of the entropy defined by the true model yields the following equation
/[/°,/Pf|a)]=J/° log fdx (5.72)
dx
The first and second terms on the right hand side of the equation above cancel each other with only the third term left. Therefore, we obtain
4
(5.73)
21 Equation (5.56) already predicts the unbiased estimate of Tn to be
nflns.
Theorem 5,7 The asymptotically unbiased estimator of Kullback information is
In,
(5.74)
With these, we may obtain the estimator for the third entropy we are interested in. Theorem 5.8 The asymptotically unbiased estimator of H(f, g) is (5.75)
J, J, Estimation of entropy based on large samples
113
The fourth estimator is about the divergence of two probabilities. It measures the difference between two probabilities. In this case, it measures the difference between the true model f(x) = f(x\a") and the candidate model
Theorem 5.9 The asymptotically unbiased estimator of J(f,g) is •) = X
(5.76)
Example 5.9 Estimation of various entropies In this example, numerical values are given to demonstrate the asymptotically unbiased estimators constructed in theorems 5.7~ 5.9. The normal distribution is used for the example. As before the unbiased estimate of the variance cr2 is
The true and estimated models (pdfs) are, respectively
^ l
(5.78)
From definitions we have ,
1,
o-2
1
ff2
1.
.,
1
1
(5.79b) In the above equations, tr2 = 1 is assumed. Repeat the calculations performed in Example 5.7, we obtain numerical values for the quantities in the above as shown in Figure 5.4. Their theoretical values are also given in the figure. The particular values for
114
5. Disorder, entropy and entropy estimation
Kullback information and the divergence are very close to their theoretical values derived in this section. What is missing in the figure is H(f, g ) , which is a function of I{f,g). If the latter is accurately estimated, then H(f,g) accurately estimated, too.
can be
0.1
J(f,g)
Figure 5.4 Numerical example demonstrating the asymptotically unbiased estimators and the distance to their theoretical values
5.3.2 Asymptotically unbiased estimator of TSE and AIC The purpose for finding all the previous estimators is to find the estimator for TSE. Theorem 5.10 The Asymptotically mbiased estimator of TSE is
ME = H(f, f) + j(f, g) = H(X 1 1 ) + | ^ .
(5.80)
Note that a special name, measured entropy (ME), is used to denote the estimator of TjSE here. This is a useful estimator. It states an unpleasant fact. Suppose we are given a model f(x \ a) with unknown parameter a. Using a sample from the population to estimate the unknown parameter a, we obtain a and f{x \ a ) . Equation (5.80) shows that f(x \ a) may not be the true model f(x \ a) . Estimation always
J. 3. Estimation of entropy based on large samples
115
induces some amount of uncertainty. The second term on the right hand side of equation (5.80) is of particular attention because it quantifies the systematic deviation of entropy estimation. An analogue is digital camera. The quality of a digital camera is assessed by its pixels. The larger the pixels are, the clearer the photos are. The second term on the right hand side is inversely proportional to pixels. The smaller the term is, the closer the candidate model to the true model. Another quantity of interest is about the unbiased estimation of log-likelihood function. Theorem 5,11 The asymptotically unbiased estimator of log-likelihood function | >
(5.81)
is
f \ 3 ) + ^ . «,
(5.82)
Proof. Rewrite equation (5.54) ,o Slog/" , 10 2 log/° Iog/(x I i) « l o g / 0 + ^ - Aat + - — ? * - . da, 2 oafia
(5.83)
Substituting it in equation (5.83) yields , CK 1
nt tt
da,
As sample size is large, the first term approaches
£ t o / ( , |a0) -> H(f,f).
(5.85)
based on the large number theorem. The second term, as mentioned above, will asymptotically approach zero because it is a sum of normal random variables.
116
J. Disorder, entropy and entropy estimation
The third term will approach 2J
dafici
This is in fact Ts in equation (5.57c). It is asymptotically a ehi-square random variable of freedom nf, and its expectation with respect to sample Jf is £A=~-
(5-87)
We have shown in the previous section that the asymptotically unbiased estimator of H(f, / ) is (5.88) Furthermore, the first term on the right hand side of the above equation can be estimated by use of the following estimator H(x | a) = - — ] > > § / ( * , 1 a ) . n, w
(5.89)
Substituting equations (5J7) and (5J9) into (5.84) and taking expectation on both sides, we obtain asymptotically unbiased estimator for log-likelihood function L given by equation (5.82). n A special name is given to this estimator, Akaike Information Criterion, or AIC in short. Historically, it was first obtained by Akaike (Sakamoto, 1993). Here AIC is not the original form, different by a factor of 2 In,. It is beneficial to make a comparison between AIC and ME, If sample is large enough, the log-likelihood function in equation (5.81) is asymptotically K(\a)-»H(f,g)
(5.90)
If g{x) = f(x I a) is used. Referring to Figure 5.2, AIC is an estimator of the entropy contained in the
5.3. Estimation of entropy based on large samples
117
sample X. From definition, ME estimate the uncertainty associated the total statistical process. AIC predicts the entropy only present in the estimation process without considering if the model can recover the true model. Example 5.10 ME and AIC The true and estimated models (pdfs) are, respectively
/(*) =
1
I xexp - :
(5.91) ex
(5.92)
P| ~lh
Theoretical value for TSE and its estimate are respectively (5,93a) (5.93b)
2
Comparison of these two quantities is plotted in Figure 5.5. The two quantities are pleasantly close to eaeh other, numerically validating equation (5.80).
200 Figure 5.5 Numerical comparison of TSE and ME
118
J. Disorder, entropy and entropy estimation
5.4 Entropy estimation based on small sample All estimators obtained in the previous section ha¥e been obtained based on the assumption that the sample size is large. In the case of small samples, we will employ different techniques as outlined in the following. In section 2.3.2, we have seen that sample mean is a random variable, too. If sample size nt is very small, say nt = I, sample mean and random variable are identically distributed. On the other hand, if sample size ns is large, sample mean is asymptotically distributed as a normal random variable. If sample size is not so large to guarantee sample mean be close to the normal distribution, nor so small to enable us compute sample distribution by using the method presented in section 2.2.2,, then we have to develop new methods for estimation. The above mentioned two cases, very small samples and large samples, share one thing in common, that is, sample mean is treated as a random variable. For the case in between the two extremes, it is natural to assume sample mean is a random variable, too. Therefore, in the most general cases, the unknown parameter 9 in f(x 19) is treated as a random variable. In doing so, we change ftom traditional statistics practice determining parameter 0 through sample X into determining distribution of parameter 0 through sample X. In notation, X-+$=>X-+P(0\X)
(5.94)
where P(01X) is the pdf of the parameter 0 to be estimated from the given sample X. This is the basic assumption of Bayesian statistics. In the framework of Bayesian statistics, P{0 \ X) is written in the form of P(0\X) = / W W ) ]P(X 10)P(0)d0
(5J5)
where in Bayesian language: P(X 10) = Y\fx (xt I ^ ) : *s m e sample occurrence probability given & P(0):
is the Prior distribution of 0,
P(X) = JP(X 10)P{0)d0; is the sample occurrence probability, ormarginal probability, P(0\X):
is the posterior distribution of 0.
(5.96)
5. J, Model selection
119
Now consider the problem of how to measure the uncertainty of parameter 0. Such uncertainty comes from two sources, the sample Itself and the uncertainty of 0 after the realization of the sample. The uncertainty associated with this sample is H(X) = - JP(X) log P(X)dX = -Ex log P(X).
(5.97)
In the framework of Bayesian statistics, P(X) is given by equation (5.96). Consider two events: A = 8 and B = 0 J X. Then the uncertainty associated with parameter 0 is described by H(0) = H(X) + Hx(0)
(5.98)
if equation property (2) of theorem 5.1 is employed. The uncertainty associated with event B is defined by Hx(0) = \P{X)H{01 X)dX = ExH(01X)
(5.99)
where H(0 \X) = - JP(01X) log P(0 \X)d& = -Em log P(0 \X).
(5.100)
Equation (5.99) shows that -logP(X) is an unbiased estimator of H(X), and equation (5.100) shows that ~logP(0 \ X) is an unbiased estimator of H{6\ X), Therefore, we obtain the unbiased estimator of entropy of parameter 0, that is, H{0)^-\o%P{X)-\ogP(0\X).
(5.101)
S.5 Model selection Model selection will be specially mentioned in Chapters 6 to 10. We will, however, present general theories about model selection here. Traditional statistics has focused on parameter estimation, implicitly assuming that the statistical model, or the pdf of a random variable, is known. This is in fact not the Case. In real-world applications and in most cases, the model is unknown. Referring to Figure 4.1, we have shown three possible models (lognormal, normal and Weibull) to approximate the unknown pdf under consideration. Purely from the graph, it is hard to make a decision on which is the best fit of the observed data. Therefore, we do not have any convincible
120
5, Disorder, entropy and entropy estimation
support to assume that the statistical model under consideration is known. It thus signifies a shift from traditional assumption that the statistical model under consideration is known with unknown parameters to contemporary belief that the statistical model under consideration is also unknown with unknown parameters. Such a slight change makes a big difference because this problem has not been discussed in traditional statistics. In summary, in modem statistics, bom model and parameters are unknown. How to determine a statistical model is thus the biggest challenge we are faced with if we get rid of the assumption that the model under consideration is known. It is at this point that information theory comes into play. The prevailing solution to the problem is that suppose we are given a group of possible statistical models. We are satisfied if there exist some criterion able to tell us which model is the best among all possible models. So we call this procedure as model selection. By model selection, we change the problem from determining statistical models to selecting models. To be able to select the best model, we thus need to do two things. The first is that the group of possible models should be so flexible and universal that the model to be after is included in the group of possible models. This problem has been touched in Chapter 4, being a procedure for function approximation. The second is that we have a criterion at hand to help us to select the best model from the group of possible models. This problem is the main focus of this section. 5.5.1 Model selection based on large samples Suppose we are given a group of possible models {,/)(* |6[-)} (i-h"^), each of which contains unknown parameter 0t, Draw a sample X from the population, and use the sample to estimate unknown parameters 8, in each m.
model by some method (M-L method, to say), yielding estimate Bt, Suppose me true model is among the group{f,(x| &,)}. Then the true model must minimize TSE. And Theorem 5.10 gives the asymptotically unbiased estimate of TSE, denoted by ME. Combining the two theorems together, we obtain the criterion for model selection. Theorem 5.12 Among all possible models, the true model minimizes ME = H(f,f)
+ J(f,g) = H{X | a ) + J ^ - » m i n (5.102) 2« where §t is the M-L estimate of the unknown parameter 0, based on large samples.
5.5. Model selection
121
An alternative criterion for model selection is AIC. Theorem 5,13 Among all possible models, the true model minimizes AIC AIC = -—V
lag f(x(\&)+^--»min.
n,ti
(5,103)
n,
It should be pointed out there are possibly other criteria for model selection. But to the best knowledge of the author of the book, applying ME and AIC to model selection has been studied most extensively up to now, and thus only these two criteria are introduced here. Theorems 5.12 and 5,13 solve our problem about model selection. They are very useful tools for large sample problems. Example S.11 Model Selection: polynomial regression (Sakamoto et al, 1993) In Table 5.1 is given a group of paired data (xl,yj), which are plotted in figure 5.6. We want to know the relationship between x and y. The general relationship between the pair as shown in Figure 5.6 can be approximated by use of a polynomial (polynomial regression) + amxm
(5.104)
In this example, the polynomial regression takes particular form of yt = a0 + aixl + - + amx™ + *,
(5-105)
where st are independent and standard normal random variables, m is the degree of the regression polynomial. This model is the sum of a polynomial of deterministic variable x( and a random error s,, resulting in a random variable yt. This regression polynomial of degree m is a normal variable y, with ae + a, x, + • • •+a m xf as the mean and an unknown er1 as the variance, that is, its pdf is given by
f(yt
y;-aB~a,x,
amx°')
2a
1
(5.106)
In the following, this regression polynomial of degree m is written as SO MODEL(O) is such a regression polynomial that it is
MODEL(OT).
5. Disorder, entropy and entropy estimation
122
independent of variable x, distributed as N(ag,<x2) . This model has two unknown parameters, att and cr2, MODEL(l) is such a regression polynomial that it is distributed as N(a0 + alxi,ai),
with three unknown parameters a 0 , «,
2
and a . Similarly, MODEL(2) is a parabolic curve. If observed data {xstyt) are given, we may perform regression analysis to find the unknown parameters involved in the model. Detailed procedure is given in the following for easy understanding.
Table 5.1 Observed data pair
i
1 2 3 4 5 6 7 8 9 10 11
y> 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.012 0.121 -0.097 -0.061 -0.080 0.037 0.196 0.077 0.343 0.448 0.434
0.6 x Figure 5.6 Observed paired data (x,, yt) 0
0.2
0.4
(1) Likelihood analysis Suppose n sets of data (jCp^), ..., ( J C , , , ^ ) are given. They are to be fitted using MODEL(wi) : yj =
+ e,
The pdf of the model is given in equation (5.103). Then the likelihood function of these n sets of data is
5.5. Model selection
123
(5J07)
The corresponding log-likelihood is then given by
(5-108) (2) Likelihood maximization If the maximum likelihood method is employed, the unknown parameters can be found by maximizing the log-likelihood junction in equation (5.113). This is equivalent to minimizing
M
This is in fact the least-square method. Summarizing the above procedure, we conclude feat the problem of polynomial regression is reduced to the least-square method. To minimize S in equation (5.106), aB,...,am must satisfy the conditions
From these equations, we may find the system of linear equations the M-L estimates a^,..., am should satisfy
124
5. Disorder, entropy and entropy estimation ft
1.x (5-111)
Solving this system of linear equations gives the M-L estimates. One more parameter, a1, remains undetermined. By differentiating the left hand side of equation (5.105) with respect to a3, we may find the equation the M-L estimate «x2 should satisfy 0/
(5.112)
la1 ' la21 Solving this equation yields the condition the variance must meet
m
m
rr
(5.113) «^»i) denotes the variance a1 corresponding to MODEL(»i). For example, MODEL(-1) a simplified notion representing the case mat y, is independent of x, so that y, is normal random variable with zero mean and variance er a , that is, yt ~ #(0,cr 2 ). Using these symbols, the maximum log-likelihood is Ky{,--,yAat,---,am,a1) = ~\og{2n)-^\Ogd{m)~.
(5.114)
(3) Model selection In MODEL(/»), there are m+2 parameters. They are regression coefficients
5.5. Model selection
125
ao,...,am and the variance cr2. Based on equation (5.93b) for calculating ME in the case of the normal distribution, we have after some simple manipulations ME = -lQg(2x)+-lQgd{m)+- + - ! ^ l 2 2 2 2n,
(5.115a)
and AIC is after simple manipulations AIC = - log(2s-}+- log d(m)+~+^ii 2 2 2 re
(5
where the number of free parameters n} =m+2 and sample size n = 11. With the above preparations, we reconsider the data given at the beginning of this example. Straightforward calculations yield |>,=5.5,
2>f=3.85,
| > ? =3.025,
| > ; =2.5333
(=1
i=I
(=1
f=I
2>, =1.430,
J>,fl = 1.244, f/Xfyi
i=I
»=1
= 1.11298, £ j £ = 0.586738
(=1
(=1
ME and AIC values for different models are calculated for different models. For example, the results for MODEL(-1) (zero mean normal distribution) are
M E = i log (2ff)+-+- log 0.0533+- x — = 0.0894 2 M ; 2 2 B 2 II ^/C = -log(2s-) + -+ilog0.0533+—= 0.0439 2 ax f 2 2 11 Continue similar calculations to determine the regression polynomials of various degrees and find the variances corresponding to each regression polynomial. Based on such calculations, both ME and AIC values can be easily evaluated. In Table 5.2, such calculation results are summarized up to degree 5. As the number of free parameters increases, the variance decreases fast if the number of free parameters is smaller man 3. As the number of free parameters is above 4, the variance does not change much. Both ME and AIC are minimum as the second degree regression polynomial is employed.
126
J. Disorder, entropy and entropy estimation Table 5,2 Variance, ME and AIC for the regression polynomials of various degree Degree
Free parameter
Variance
ME
AIC
-1 0 1 2 3 4 5
1 2 3 4 5 6 7
0.05326 0.03635 0.01329 0.00592 0.00514 0.00439 0.00423
0.089 0.035 -0.333 .0.600 -0.535 -0.477 -0.360
0.044 -0.056 -0.469 -0.782 -0.762 -0.750 -0.678
Based on the results given in Table 5.2, MODEL(2) y, = 0.03582-0.49218^ +O.97237xf +e, minimizes both ME and AIC, thus being assessed as the best regression polynomial for the data given in Table 5.1. In feet, the data given in Table 5.1 were generated from the following parabolic equation yt = 0.05 - 0.4x, + OMxf + e-, where e ( are normal random variable with zero mean and variance 0.01. It is somewhat remarkable that both ME and AIC do pick the best model from a group of candidate models. The coefficients in these two equations do not differ much. 5.5.2 Model selection based on small samples Bayesian statistics, which is characterized by treating unknown parameters as random variables, has undergone rapid developments since World War II. It is already one of the most important branch in mathematical statistics. But a longstanding argument about Bayesian statistics is about the choice of prior distribution in Bayesian statistics. Initially, prior distribution used to be selected based on user's preference and experience, more subjective than objective. Note that for the same statistical problems, we may have different estimates of posterior distribution due to different choices of prior distribution. Prior distribution is a priori determined. This is somewhat annoying because such
5.5. Model selection
127
ambiguity in final results prevents applied statisticians, engineers and those interested in applying Bayesian statistics to various fields from accepting the methodology. Situations have changed in recent years since information theory was combined with Bayesian methodology. The basic solution strategy is to change the problem into one for model selection. Although there are some rules helping us to choose prior distribution in Bayesian method, the choice of prior distribution is in general not unique. Suppose we have a group of possible prior distributions selected by different users or by different methods. By using some criterion similar to those for large samples, we are thus able to find the best prior distribution from the possible candidates. Referring to equation (5.84), we note that true model for 6 minimizes H(0) according to theorem 5.4 or equation (5.28). Thus, the best prior should minimize H(0) = - log P(JT)- log P(61X) -> min.
(5.116)
Note that the above equation is equivalent to two minimization procedures as follows -logiJ(Z)-»min,
(5.117a)
- tog P(01 ,¥)-»• min.
(5.117b)
Because H(0) is a linear function of -logP(X)
and -logP((?| J f ) . Minimum
H(ff) is attained if and only if the two terms at the right hand side of equation (5.90) are minimized. We use a special name, Bayesian Measured Entropy (MEB), to denote the -logP(Jf). Then based on equation (5.91a) we have Theorem S.14 7%e best prior distribution must satisfy MEB = -2 log P(X) -> min .
(5.118)
Here the constant 2 in front of the logarithm of the marginal probability appears due to historical reason. It is also called Akaike Bayesian Information Criterion (ABIC) (Akaike, 1989;Akaike, 1978). Equation (5.92) is in fact an integral equation for the unknown prior distribution P(8). Theoretically, the solution exists. So equation (5,92) selects
128
5. Disorder, entropy and entropy estimation
the best priors from candidate models. In equation (5.95), Q is unknown. Using Bayesian rule (5J1), equation (5.95) states that
P(X)
(5.119)
In the equation, only parameter & is unknown. Therefore, the above equation yields point estimate of 8. It plays the role as the M-L method because it degenerates to M-L method because it degenerates to M-L estimation if prior distribution is a uniform one. 5.6 Concluding remarks Entropy and entropy estimation are the basic ideas in this Chapter. Based on them, we have constructed quite some unbiased estimators for a certain of entropies. The most important three estimators are ME and AIC for large samples and MEB for small samples. Model selection had not been touched in traditional statistics. It is one of the focuses in modem statistics, and thus deserves special attentions. Entropy-based methods have been introduced in this Chapter to attack this problem. They will be widely applied in the subsequent chapters.
Chapter 6
Estimation of 1-D complicated distributions based on large samples
A random phenomenon can be fully described by a sample space and random events defined on the sample space. This is, however, not convenient for more complicated cases. Introduction of random variables and their distributions provides a powerful and complete mathematical tool for describing random phenomena. With the aid of random variables and their distributions, sample space and random events are no longer needed. Thus, we focus on random variables and tiieir distributions. The most frequently used distribution form is the normal distribution. Although the normal distribution is well studied, it should be noted that the normal distribution is a rare distribution in real-world applications as we mentioned in the previous chapters. Whenever we apply the normal distribution for a real-world problem, we must a priori introduce a significant assumption. Our concern is thus about how to avoid such a priori assumption and place the distribution form under consideration on a more objective and sound base. Attempts have been made to solve the problem by introducing more distribution forms, Weibull, Gamma etc. We have about handful of such distribution forms. They are specially referred to as special distributions. Special distributions usually have good properties for analysis, but not enough capabilities to describe the characters of the random phenomena encountered in practical applications. Such examples are numerous. Ocean waves are of primary concerns for shipping. Wave height and weave period are random in nature. People began to study the joint distribution of wave height and period at least over one hundred years ago. Millions of sample points have been collected, but their distributions are still under extensive study today in part due to the lack of powerful distribution forms for describing random properties of ocean waves. Complicated distributions are not rare in real-world applications, but special distributions are not enough to describe them.
129
130
6. 1-D estimation based on large samples
Another serious problem is that a priori assumption is arbitrarily used for any practical applications and analyses, People involved in stock market analysis often complained about the wide use of me normal distribution for analyzing price fluctuations. Whenever the normal distribution or any other special distribution is used, it implies that we enforce the phenomenon under consider to vary in accordance with the a priori assumed law. If price fluctuations are not distributed as special distributions we assume, we put ourselves in fact at the hand of God. Any prediction based on such assumption is prone to be misleading rather than profitable investment. In summary, at least two concerns must be addressed before we adopt a specific distribution form for the random variable under consideration. The first is about the distribution form, which capability should be powerful enough to describe the most (we cannot say all) random variables, either simple or complicated. The second concern is about objective determination of distributions. The strategy for solving the first concern is to introduce a family are able to describe complicated distributions that special distributions cannot do. This strategy indicates the recent developments in estimating complicated distributions; construction of a parameterized family # that is much flexible than special distributions (Sakamoto et al, 1983; Zong & Lam, 1998). And it is so powerful that it meets our demands in most cases, whether the distribution under consideration is simple or complicated. The strategy for solving the second problem is that instead of using one a priori assumed distribution a family fl> of distributions or models are considered. Based on a random sample, the best model is selected from the family based on some criterion. This process, called model selection, convinces us that the distribution or model is selected based on the information of the sample and that it is best among a group of candidate models. If the group of candidate models is large enough, the model chosen in such a way must be the distribution we are after. Combining the two strategies together leads to the recent development of information theoretic methods for estimating complicated distribution. In this chapter, estimation of complicated 1-D distribution based on large samples using B-splines is intoduced. 6.1 General problems about pdf approximation To ensure that the family
6.I. General problems about pdf approximation
131
covered in this family, we need to know what kind of function such a pdf is. In other words, for any given random variable, if is it possible to approximate it by use of some simpler functions? Here we have a nice theorem to answer the question. Theorem 6.1 {Lebesgue '$ decomposition theorem), any distribution fimctionF(x) can be written in the form of F(x) = a i F I (*) + a1F2(jic)+aIFJ(x)
(6.1)
where «. 3s 0,(i = 1,2,3) , ar,+a2+afj =1 . F^x) is absolutely differentiable, being continuous everywhere and differentiable except at countabfy many points. In other words, /*j(x) is differentiable almost everywhere. F2(x) is a step function with a countable number of jumps, and F^x) is singular. Proof see Bhat (1985).
n
This theorem states that any pdf is a sum of three types of functions. Because is singular, thus being pathological. It is often dropped in analyses.
F3(JC)
F2(x) has the form FJ(JC) = ][]/>; , where pt = Pr(X = xj). Hence, the random variable X has a finite probability of occurring at the discrete values x,,x 2 ,"and zero probability of having any other values. Differentiating both sides of equation (6.1) by noting mat the derivative of a step function is a Kronecker Delta function, we have -*,)
(6.2)
where fix)
is at least continuous almost everywhere. Note that at those points
where fix)
is not continuous, f(x)
can be expressed as ^^q^Sjix—Xj) .
Considering the second term on the right hand side of equation (6.2), we have Corollary 6.t If f(x)
is continuous, then a pdf can be written in the form of
= ajx ix) + a^plSl{x-xl)
(6.3).
132
6. !-D estimation based on large samples
If the singular part is not counted, both theorem 6.1 and corollary 6.1 indicate that a pdf is either continuous or discrete, or both combined. Approximating pdf can thus be studied separately for continuous and discrete random variables. This simplifies our analysis a lot. We may thus focus on the two issues separately. First of all, we notice that approximating the distribution of a discrete random variable is not a challenge if the sample size is large enough. This is due to the fact that the distribution of a discrete random variable is expressed by a finite or finitely many discrete probability values. Thus, approximating such a distribution does not involve the issue of how to approximate the distribution form. As more and more data are collected, or as sample size gets larger and larger, we have better estimates of the distidbution. In the case of small samples, however, large statistical errors may arise. This latter issue will be discussed in Chapter 8 for 1-D discret distributions and in Chapter 9 for 2-D discrete distributions. In this chapter, we focus on how to approximate the pdf of a continuous random variable. Consider a continuous random variable X defined on the interval [c,d\. Corollary 6.1 leads us to the conclusion that © is the set of all such continuous functions f{x) satisfying Non-negative condition: f{x) & 0 for x e [c, d]. Normalization condition: \f(x)dx = 1
(6.4a) (6.4b)
We are thus after the continuous approximants satisfying the above two constraints. We now turn to the problem of how to construct f(x). 6.2 B-splinc approximation of a continuous pdf Consider a continuous pdf. We are tempted to use polynomials to approximate f(x), and many researchers have done so over the years. This is, however, a method of serious limitations because of the properties of polynomial approximation discussed in Chapter 4. Here two issues should be considered separately. The first is the capability of polynomial approximation and the second the stability of polynomial approximation. The answer to the former question is affirmative. Any continuous function can be approximated by a polynomial to desired accuracy if the order of the polynomial can be any integer. But the answer to the second question is negative. The stability of polynomial approximation is poor, mis being particularly true as the order of the approximating polynomial is large. Based on these, we have a picture of properties of polynomial approximations in mind. For any given continuous function, we are able to find a polynomial
6.2. B-spline approximation of a continuous pdf
133
which is so close to the given function that their difference is neglected. If the function value at a point has a slight change, however, disaster might occur because all the coefficients of the polynomial may suffer big changes. If the function to be approximated is badly behaved anywhere in the interval, then the approximation is poor everywhere. This is particularly true if uniform spacing of knots is employed. This global dependence on local properties leads to unstable approximation. Studies of other approximating functions such as truncated power basis functions are also not satisfactory. Based on the introduction in Chapter 4, B-splines are satisfactory approximating tool for our purpose. Then a linear combination of B-spline functions is assumed to be able to approximate the pdf f(x) of a continuous random variable X in the form of (6.5) where a, (i=l,2,...,N) are the linear combination coefficients, 7 a = (a,,fla>'",fl!w) is the coefficient vector and JV is the number of B-spline functions used to approximate the pdf. B,{x) is the B-spline function of chosen order. In equation (6.5), we use f(x | a) to indicate the independence of f(x) on the coefficient vector a. Based on Curry-Schoenberg theorem introduced in Chapter 4, B-splines form a basis for the vector space of all continuous functions defined on a bounded interval. Being of limited summation cancellation effects, B-splines are relatively well conditioned basis set. Moreover, B-splines are linearly independent. These properties indicate that B-splines are suited to the purpose of approximating continuous pdf. As given in Chapter 4, the third order and fourth order B-spline functions are frequently employed in applications. For convenience, they are rewritten here. The third order B-spline function is of the following form,
g,(s) = fr - * J E ( * " ' ~X)*H(X'+> ~X)H(X-X,).
(6.6a)
where H(x) is Heaviside function defined by
»M-{? ' " I The index s=i-3, and the function ws(x) is a product of the form
(6.6b)
134
6, !-D estimation based on large samples
f] »=0
Equidistant B-splines are assumed in the above. Suppose there are equidistance points, c = xa <xl<---<xB = d, which divide the internal into n subintervals. For convenience in later mathematical expressions, more knots at each end are defined; x_%, x_2, x_t, xH+i, xn+2 and xn+i,
n+1 [c,d] three It is
customary to set *_, = x_2 = x_, = x9 = c and xn = xnH = xn+J = xn+i = d . It is clear that w+1 knots define N splines. Therefore, N=n+l, The fourth order B-splines are similar with the third order B-splines in form. They are
where the index s=i-4, and the function ws(x) is a product of the form
Let us return to equation (6.1). The integral of f(x | a) over the distribution range [c,d] must be one, that is,
J/(x|a)A = l
(6.9)
Substituting equation (6.5) into equation (6.9), we obtain
fiaic, =1
(6.10a)
To obtain the second term in the above equation, the compact support property of B-splines that an order 3 B-spline is not zero on [*».*.£,] and vanishes elsewhere (sep Chapter 4) is used. The above formulation is also valid for order 4 B-splines. If order 4 B-splines are used, the last equality in equation (6.10a) N
remains valid, that is, 2 ^ c » integral of the Mh B-spline,
=
* wi^1 c< differently defined. Here, c, denotes the
6.3. Estimation c,= Jj j (jc)A =a:'~Xf-3 for order $B-splines,
I
(6.10b)
X, —X,
F
C) =
135
BAx)dx =——— 4
for order 4 B-splines.
(6.10c)
To meet the requirement mat a pdf be positive imposed by equation (6.4a), we simply set a,*0.i'~l,2,....N
(6.11)
This is a sufficient condition, but not a necessary condition. Equations (6.5), (6.10) and (6.11) complete the approximation of a continuous pdf. Once the combination coefficients are given, the pdf of a continuous random variable is defined. In the following section, statistical methods are employed to find the combination coefficients. 6.3 Estimation 6,3.1 Estimation from sample data To determine the coefficient vector a, a random sample of size ns is drawn from the population. Let Hie sample points be xt (£ = 1,2, ••-,«,). If N is given in equation (6.5), a can be estimated using the maximum likelihood method. The maximum likelihood method is just one of the possible methods for estimating the coefficient vector a. It is also feasible to apply other methods of estimation, but the systematic methodology for estimating the coefficient vector a developed in Chapters 6 through 10 is totally based on the maximum likelihood metiiod, and thus the M-L method is employed here and in the subsequent chapters. Based on the maximum likelihood method, the estimation problem is formulated as the following optimization problem: For a given N, find vector a such that it satisfies L = £ log f{xc | a) -+ max
(6.12a)
subject to the constraints: 5> ( e,=l
(6.12b)
136 O/ fe0
6. l-D estimation based on large samples , (i' = l,2..",tf) .
(6.12c)
Equations (6.12a)~(6.12c) define a nonlinear programming problem (NLP). Being a linear function of a, f(x | a) is a continuous function in the space defined by equations (6.5) and (6.12). So is the log-likelihood function L. Theorem 6.2 The problem defined by equation (6.12) has a unique solution. Proof. From Weierstrass" theorem, which states that a continuous function defined on a compact interval must have an extreme, we conclude that the solution to equation (6,12) exists. It is provable, as given in the appendix to this Chapter, that there exists only one extreme point over the entire feasible domain for this nonlinear programming problem, see the appendix. • It is known that the most difficult thing in optimization is that the objective function has multiple extremes. A search scheme is often trapped at local extremes and fails to find the global optimum. In terms of this, the property that the problem defined by equation (6.12) has only one extreme point in the entire feasible domain is really a remarkable property. This makes numerical treatments much easier and no special cautions are needed. Therefore, if a local maximum solution is found to equation (6.12), it must be the global optimum solution because the solution is unique. Very often it is difficult to find a solution to a nonlinear programming problem. Even the problem defined by equation (6.12) has only one extreme point, a code based on a general-purpose method may turn out be computationally inefficient. In most applications of optimization research, the number of unknowns is restricted within several parameters, say 2 to 5. For the problem defined by equation (6.12), however, the number of unknowns is of the range of 10~50. In some cases, the number of unknowns may be over 100. For such optimization problems of large number of unknowns, general-purpose optimization methods are usually not applicable. This is particularly true for 2-D cases as will be discussed in Chapter 7. It is desirable to develop a particular method to find the solution in an efficient way. So in the appendix, an iterative formula is derived, which reads (6-B)
We have q £,(£)/ f(x | a) 51 because the nominator is a term in the nonnegative denominator, we conclude OS a, S l / c , . This is in agreement with equation
6.3. Estimation
137
(6.12b). The iteration foimuk remains valid even if f{x, j a) = 0 . To see this, note that every term in f(x{ j a) is nonnegative. So f(xt | a) = 0 implies that each term in / ( x , |a) must be zero. That is, alBi(x,) = Q . Because a.Bt(x)/f(x
| a) < 1 , a< '
l
must be finite even if it is of the type - .
The suggested initial values are 1
(6,14)
A small number, say 10"4, is prefixed. The iteration starts from the initial value given in (6.14). The iteration continues until the difference between the previous and present values of the combination coefficients is smaller than the prefixed small number. Numerical tests have shown that it takes several to several tens of iterations to reach the optimum. This iteration formula is shown to be very computationally efficient, making the methods presented here feasible as a statistical tool on a PC or a laptop. Equations (6,5) and (6.13) give complete solution to finding the continuous pdf based on a large sample if the number N of B-splines is given. A code based on the method is given in the floppy attached to this book and a brief description of the code is given in Chapter 12. The inputs of the code are the number N of Bsplines, the distribution interval [c,d] and the observed data (sample) ^ ( 1 = 1,2,...,^). The model assumes that the random variable under consideration must be disfributed in a finite interval [e,d\. If a random variable is distributed on an infinitely large interval, the model introduced here is used in the sense of approximation. The method requires input of raw data without treatment. If observed data are treated by some approaches, variants of the above method may be used, as demonstrated in the following section. 6,3,2 Estimation from a histogram More often than not, a pdf is expressed in the form of a histogram. Suppose a histogram is composed of K cells as shown in Figure 6.1. The histogram is formed from n, sample points and there are k, sample points in k-th cell (k=l,2,.,.,K), respectively. The nodes of each cell are denoted by §k and Ijk+i to differentiate from knots of B-splines.
6. I-D estimation based on large samples
138
If the sample points have a distribution defined by f{x), the probability for the event that n* points fall in &-th cell is given by the following multinomial distribution n.!
(6.15)
where qk is the partial probability of f(x), It is assumed again that f(x) is approximated by a linear combination given by equation (6.5). The partial probability qk relates to the combination coefficients a through B,(x)dx
(6.16)
If <4 denotes the integral in the last term, that is,
Figure 6.1 Schematic figure of a histogram
(6.17)
equation (6,16) would be N
(6.18) M
6.3. Estimation
139
Equation (6.17) can be numerically or analytically integrated. The analytical form is, however, so complicated that it does not exhibit much superiority. Numerical quadratures such as Gauss quadrature or Simpson rule are all suitable tools to use. They do not need much computer time for 1-D cases. The log-likelihood function is obtained by taking logarithms on both sides of equation (6.15) «! logP = log——f
- + «, log*?, +nilogql-
+ nK logqK
(6.19)
The first term on the right hand side is a constant because «* (k-l,2,,,,K) is observed value. So only the rest terms are included in the log-likelihood function
A slight change is made in the above equation by dividing the right hand side of equation (6.19) with sample size n, and introducing the observed frequency/?*. Similar to equation (6.12), the best estimate of a pdf based on a histogram is formulated as follows For given N, find vector a so that it satisfies K
-»max
(6.21a)
subject to the constraints: |>,.c,. = l
(6.21b)
af^0,
(6.21c)
(f = l,2,-,JV).
The solution to the above problem must satisfy a. = _L x y5^£*. n
A
,- = 1,2,...,^
(6.22)
*-i ft (a)
This iterative formula can be obtained in the similar way as equation (6.13) and its proof is neglected here. Again we may use this equation as an iterative formula to find the coefficients
6. 1-D estimation based on large samples
140
6.4 Model selection In the previous discussions, N is always assumed fixed. How to specify N, however, remains a problem. If, for example, two different N's are used to approximate the same pdf, we would obtain two models. The question immediately arises: which model is better? Before answering the question, let's consider the following example. Example 6.1 Model selection Assume a true distribution is given by
fexp +0.2x|-7Lexp{~2(;c-7)2}l
xe[0,10].
(6.23)
from which 200 random numbers are generated as the given sample. The following two models are used to approximate g(x): (6.24a) (6.24b)
Given N=7
—
N=50
—
-L(N=7)=359 -L(N=50)=330
8
10
Figure 6.2 Fallacy of Likelihood function as a tool for model selection where order 3 B-splines are used. Based on the procedures introduced in section
6.4. Moeiel selection
141
6.3.1, the unknown parameter a is estimated. The estimated pdfs using the two models are plotted in Figure 6.2. The values of the log-likelihood functions for both cases are, respectively L(N = 7) = -359 and L(N = 50) = -330.
(6.25)
It is clear from Figure 6.2 that the model f,(x) is closer to the true distribution, but_^a(x) is not. Based on the values of the log-likelihood function, however, fm(x) is better than / 7 (x) because the former has larger likelihood value. We have two conclusions from this example: 1) Model selection must be properly handled. It has significant influence on the estimation accuracy. It is not true that the more parameters the better the model is. It seems there exists an optimum number of B-splines; 2) Likelihood function fails as a quantitative evaluation tool for model selection. We need a new tool to serve as a quantitative measure of model selection. Fallacy of likelihood function as a quantitative evaluation tool for model selection results from the fact that M-L estimator is biased. There exist several criteria for model selection, among which are Akaike Information Criterion (AIC) and Measured Entropy (ME) introduced in Chapter 5. Whatever is random is uncertain. The amount of uncertainty is measured in the information theory by entropy. Uncertainty comes from two sources: the uncertainty of the random variable itself and the uncertainty of the statistical model resulting from approximation. The uncertainty of the random variable itself is measured by the entropy of the true model of the form
//(/,/) =-f/log/&.
(6.26)
The uncertainty resulting from model approximation is measured by the divergence between the true and the candidate models
fix)
(6-27)
The best model should minimize the sum of the total uncertainty: Hif, f) + J(f, g) -*• min The asymptotically unbiased estimator of H(/, / ) is
(6.28)
142 H{f,f)
6. I-D estimation based on large samples = ~\f{x \ a)log/(x | &)dx+^
(6.29)
where ns is the number of sample points, «/ is the number of free parameters in the model equaling to N-l in light of the equality constraint in equation (6.10), and a is the maximum likelihood estimate of a. The asymptotically unbiased estimator of«/(/, g) is (6-30) And thus, the asymptotically unbiased estimator of equation (6.28) is Measured Entropy = ME = - J / ( x | a ) l o g / ( x | a ) & + - ^ 2«s
(6.31)
In chapter 5, as an asymptotical approximant to likelihood function, Akaike Information Criterion (AIC) is estimated by
i Note that the coefficients in front of the last terms in the two equations above are different because they are obtained on different bases. Aided with above-mentioned criteria, the best estimate of pdf (the optimum N) can be found through the following procedures: Suppose a is the maximum likelihood estimate of A for given N. Find N so that ME(N) = - f/(* | a) tog f{x | k)dx +3 x t # " ^ -> mia
(6.33a)
Or if AIC is used, Suppose a is the maximum likelihood estimate of a for given N. Find N so that AIC = - — X l o S fix, | a ) + — -> min
(6.33b)
6.4. Model selection
143
The above procedures are also an optimization process. Given JV,, find the maximum estimate of a through equation (6.13), and compute corresponding ME (Nt). Give another Nt > Nt and find corresponding ME( JVj). Continue the process until a prefixed large N. And then find the N which minimizes ME or AIC. If a is estimated from a histogram, the above formula for ME has no change in the form: Find N so that ME(N) = - | / ( x | a) log f{x | &)dx+—
1 _» min
(6.34a)
But the formula for AIC is slightly changed Find N so that AIC(N) = - V pk log qk (a)+ *-i
-> min
(6.34b)
«,
To obtain equation (6.34b), divide the interval [c,d] into K subintervals \,§k >&+/]• Denote the length of the subinterval by Ak and the number of points falling into &-th cell by nk. Then the first term on the right hand side of equation (6.33b) is rewritten in the form of
(6.35)
Neglecting the terms are constants on the right hand side of equation (6.35) results in equation (6.34b). The integral in equation (6.34a) can only be evaluated through numerical methods, say Gauss quadrature. For one-dimensional problem, computer time is not a big issue and thus choice of a numerical quadrature scheme does not exhibit significant impact on the numerical accuracy and computational efficiency.
144
6, 1-D estimation based on large samples
6.5 Numerical examples In the following examples, we assume the true pdf f(x) is given, and generate ns random numbers from this distribution by employing the method presented in Chapter 3. Using these random data we estimate the coefficients a and N, based on the above analyses. Example 6.2 The exponential distribution Suppose an exponential distribution of the form is given f(x) = exp(-x)
(6.36)
From this distribution, generate a sample of size 20 by use of PRNG introduced in Chapter 3. The generated data are .67Q55220E-02 .19119940E+01 .90983310E+00 .32424040E+00 .45777790E-01
.88418260E+00 .17564670E+01 .1Q453880E+01 35282420E+00 J3858910E+00
.32364390E+00 J181957GE+01 J9749570E-01 J88Q6860E+00 .13271010E+01
.64127740E+00 .10687590E+01 .22005460E+01 .24852700E+00 .10658780E+01
If MODEL(N) denotes the model for approximating the pdf, that is, MODEL(N): / ( x ] a ) = £ a, £,.{*)
(6.37)
we obtain estimates for the following models using equation (6.13). (a) M0DEL(3) representing 3 B-spline functions are used to approximate the pdf a, = 0.4998, fl2 = 7.43 x 10"8, a, = 5.37 x 10~M . Wife parameters in the above substituting the parameters in MODEL(N), the log-likelihood function, ME and AIC can be calculated from the following equation | a)
(6.38a)
6.5. Numerical examples ME(N) = - [f{x | a)log f{x | a ) * + ^ ^ — ^ J 2«, = —-Ylog/(x t |a}+—•—-
145 (6.38b) (6.38c)
that is, £ = 20.13, ME = 1.51, AIC = l.U This model is in feet approximated by the first B-spline and the rest two Bsplines have coefficients nearly equaling zero. (b) MODEL(4) having 4 B-splines a, =0.999, a 2 =2.65>dtr\ ^ = 0 , « 4 = 0 I = 15.05 , ME = 0.89, AIC = 0.90 (b) MODEL(5) having 5 B-splines a, = 0.953,a2 =0.273, a, = 3,82x 10"\ a 4 = 0 , 5 s = 0 £ = 15.05, ME = 1.13, Among these three models, MODEL(4) minimizes both ME and AIC, thus being the best model for the data presented at the beginning of this example. We now use two larger samples to perform the estimation. Suppose two random samples are generated from (6.36). The size of the first sample is 100 and the second 200. Some of the calculated results are given in Table 6.1. N starts from 3 and stops at 11. Values of Log-likelihood function L, ME and AIC are given in the table. From the Table, the log-likelihood function —L is nearly a monotonic function of N. Again we see that it cannot specify the most suitable model. For the case of «s=lQ0, ME and AIC take their minimums at N=4, both yielding the same estimates. The estimated pdfs for iV=3,4,10 are shown in Figure 6.3 (a). For the case of J¥=3, the curve does not fully represent the characteristics of the given curve because the number of B-splines is not enough. In the case of iV==lG, the curve ejdubits more humps than expected. The curve corresponding to N=4 is competitively closer to the original curve than the rest two. Our visual observations and quantitative assessment by use of ME and AIC yield consistent conclusions. In general, the estimate is acceptable. The estimate is improved a lot if the sample size is increased to 200 as shown
146
6. I -D estimation based on large samples
in Figure 6.3 (b). However, ME and AIC are different in this case. ME attains minimum at N=5 and AIC attains minimum at i¥=6. They do not differ too much if viewed from the figure. On the entire interval, the curves for iV=5 and N=6 are almost coincident except on the interval [2.3, 5], where the curve for N=6 is visibly lower than the original curve and the curve for N—5. Table 6,1 The exponential distribution
N 3 4 5 6 7 8 9 10 11
N 3 4 5 6 7 S 9 10 11
-L 106.5 91.6 92.1 91.5 91.1 90.7 90.3 90,4 89.9
(a) «s=100 ME 1.39 0.980 0.985 1.000 0.984 1.019 1.027 1.039 1.052
AIC 1.085 0.946 0.961 0.965 0.971 0.977 0.983 0.994 0.999
-L 213.3 182.8 183.2 181.7 181.9 180J 180.0 180.5 189.9
(b) «,=200 ME 1.376 0.960 0JS2 0.953 0.959 0.954 0.957 0.974 0.981
AIC 1.077 0.939 0.936 0.934 0.939 0.939 0.940 0.948 0.949
6.5. Numerical examples
147
1.2 1 OJ
1
0.6
ens:
o
t»
0.4
•a
0.2 Si
o
0
Figure 6.3 The exponential distribution Hence, comparison with the original curve has revealed that the curve for N=5 (ME-based best) is slightly better than the curve for N=6 (AlC-based best). This is not surprising if we recall the assumptions and derivations in Chapter 5. AIC is an asymptotically approximant to likelihood function, but ME accounts for model uncertainty. It is interesting to note that section 6.1 gives satisfactory expression of continuous pdfe based on approximation theory, section 6.2 estimates the unknown parameters based on statistical theory while section 6.3 solves the problem of model selection based on information theory. None of the above-
148
6. I-D estimation based on large samples
mentioned three theories eould solve the problem of pdf estimation in such a broad sense if they were not combined together. Thus, this chapter and the present example well demonstrate the power of interdisciplinary studies. Example 6.3 The normal distribution The next example considered is that X is normally distributed as
(639)
Again two samples («s=100 and «/=200) were generated from the distribution. The estimated results are shown in Table 6.2, with N starting from 3 and stopping at 13. For the first case («,,=100), ME and AIC predict different models. ME indicates N=7 is the most suitable model while AIC predicts the best model is given by N=E. The difference is solved by plotting me estimated pdfe in Figure 6.4(a), In the figure, the curve for i¥=5 is also plotted. It shows poor correlation to the original curve.
Table 6.2 The normal distribution (a)
N 3 4 S 6 7 8 9 10 11 12 13
-L 173.5 173.2 147.1 151.7 136.8 135.3 134.9 135.1 134.6 134.1 134.0
(b) nx=200
n,,=100
ME(N) 1.984 1.993 1.760 1J20 1.431 1.483 1.483 1.500 1.498 1.498 1.527
AIC 1.755 1.762 1.511 1.567 1.428 1.423 1.429 1.441 1.446 1.451 1.457
N 3 4 5 6 7 8 9 10 11 12 13
-L 348.5 348.5 299.1 308.1 283.2 282.7 282.1 281.1 281.4 281.2 281.2
ME(N) 1.969 1.976 1.730 1.788 1.458 1.480 1.471 1.477 1.488 1.490 1.495
AIC 1.753 1.757 1.516 1.565 1.446 1.449 1.450 1.451 1.457 1.461 1.466
ft 5. Numerical examples
149
0.5 +3
u 01
J
0.2
13
•s 1
.-.
'
If/ \ \ N=7{ME) — /t--~.A\N=8{AIC) --'
0.3 a
Given
(a) «s=100
0.4
0.1 0
V 4 X
10 §
(b) «s=200
0.4 4.
0.3 0.2
1 !
0.1 0 0
1
Figure 6.4 The normal distribution The model iV=7 (ME-based best) is closer to the original curve in terms of shape. It keeps the symmetry of the original curve while the curve for i¥=§ (AICbased best) loses such symmetry. However, both show some quantitative deviations from the original curve. Generally speaking, the model N=7 is slightly better than JV=8. In Figure 6.4 (b) are shown the results obtained from the sample n/=200 and in Table 6.2 the values for likelihood function, ME and AIC are given. In this case, ME and AIC yield same prediction that N=7 is the best. This is in agreement with visual observation from Figure 6.4 (b). Among the three
150
6. I-D estimation based on large samples
curves plotted in the figure, N=l is closest to the original. To see the efficiency of the iterative formula (6.12), the iterative process for three combination coefficients a2,a^ and a6 for the case N = B are plotted in Figure 6.5. After ten iterations, the results for the three coefficients are already very close to the final solutions. The convergence rate is thus remarkable. This is not special case. Numerical experiments yield the same conclusions. In general, the convergence rate of the iterative formula (6.12) is quite satisfactory, giving convergent results after about ten or several tens of iterations. U.4 , . - • • "
"
0.3 0.2 0.1 0
1
0
Iteration number 10
20
30
40
Figure 6.5 Convergence of linear combination coefficients Example 6AA Compound distribution Consider a more complicated example, in which X has the following mixed distribution: (6.40a)
jg(x)dx where g(x) is a function defined on [0,10]
(6.40b) + 0.2x
xe[0,10]
151
6.S. Numerical examples
The definition domain is [0,10]. Three random samples of size «,= 30 and nx = 50 and «s=1000 were generated, respectively. Estimations were separately perfonned for these three samples using the above procedures. The results for likelihood function, ME and AIC are given in Table 6.3. N starts from 6 and stops at 20. Table 6.3 The compound distribution
"N 6
7 8 9 11 12 13
-L 173.1 166.3 171.6 160.7 158.3 159.3 157.2
ME(N) 1.750 1.813 1.881 1.818 1.743 1.811 1.738
(a}«,-100 AIC(N) N 1.781 14 1.723 15 1.781 16 1.687 17 1.683 18 1.703 19 1.692 20
-L 157.6 157.5 156.5 157.0 156.4 156.3 155.9
ME(N) 1.785 1.800 1.778 1.817 1.837 1.819 1.857
AIC(N) .706 .715 .715 .730 .734 .743 .749
ME(N) 1.679 1.671 1.670 1.674 1.679 1.676 1.682
AIC(N) 1.653 1.653 1.654 1.655 1.658 1.659 1.661
ME{N) 1.663 1.654 1.650 1.656 1.657 1.655 1.659
AIC(N) 1.644 1.643 1.643 1.644 1.645 1.646 1.646
(b)«s=500
N
-L
6
862.9 835.5 847.0 819.0 812.3 817.8 812.2
7 8 9 11 12 13
N
-L
6
1727 1674 1697 1642 1629 1640 1629
7 8 9 11 12 13
ME(N) 1.703 1.739 1.727 1.706 1.667 1.689 1.660
AIC(N) 1.736 1.683 1.708 1.654 1.645 1.658 1.648
ME(N) 1.695 1.732 1.722 1.695 1.656 1.677 1.646
(c) «/=1000 AIC(N) N 1.732 14 1.680 15 1.704 16 1.650 17 1.639 18 1.651 19 1.641 20
N
-L
14 15 16 17 18 19 20
813.5 812.4 811.9 811.7 812.1 811.4 811.3
-L 1631 1629 1628 1628 1628 1628 1628
6. I-D estimation based on large samples
152
U.4
yx / \
0.3 / 0.2
•8
1
0.1
(c)ns=1000 \
/
N=ll N=13
—
N=20 Given
\
_
/ VA
•
.
_
•
X
0
2
4
6
8
10
Figure 6.6 Influence of sample size on estimation accuracy
6.5. Numerical examples
153
In general, for all three easesj the maximum likelihood function is a decreasing function of the number of B-splines, N. But as N is large enough, say N is larger than 13, the likelihood function is almost a constant, varying little as N increases further. This is particularly true as the sample size is large, see Table 6.3(c). Comparison of Table 6,3 (a) and (c) shows the influence of the second term (the number of free parameter over the sample size) in AIC. As sample size is large, see Table 6.3 (c), the influence of the second term is not significant, and thus AIC is nearly a constant as the likelihood. As fee sample size is small, see Table 6.3 (a), the second term in AIC becomes more important. For all the three cases, ME-based best models are given by i¥=13 while AICbased best model are given by N = l l . From the table, ME value is favorably larger than AIC value for each N. This is reasonable because ME accounts for one more term of uncertainty, l(g,f), than AIC, Some of the results are plotted in Figure 6.6. In the figure, the curve for N-2Q is plotted for the sake of comparison. If sample size is small, the difference between the estimated pdf and the given distribution is significant, see Figure 6.6 (a). As sample size increases, the difference becomes smaller and smaller, indicating a convergence trend, see Figure 6.3 (b) and (c). Figures 6,6(b) and (c) do not show apparent difference from the viewpoint of statistics. Example 6.5 Estimation from histogram data This example demonstrates the estimation procedure presented in section 6.3.2, Take the sample n/=500 in Example 6,4 for instance. A K==15 histogram is formed on the interval 10,10] as shown in Figure 6,7. With the estimation procedure neglected, the results for the likelihood function, ME and AIC are tabulated in Table 6.4 for each N. The ME-based best model is N=\S, while AIC does not change at all as the sample size is above 17, failing to find the best model. In figure 6.7 is plotted the estimated pdf and the original curve. The estimated pdf (N=15) is very close to the true distribution in comparison of the curve for JV=13. ME performs well for this example. The computer times for all these calculations were within one minute on a Pentium 4 PC, thanks to the introduction of the iterative formulas. The methods are computer-oriented, making estimation automatically performed once the histogram is given.
154
6. 1-D estimation based on large samples Table 6.4 Estimate by histogram data (B/=500)
N 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
-L
ME
AIC
2.141 2.129 2.102 2.096 2.078 2.084 2.068 2.075 2.065 2.069 2.065 2.065 2.065 2.065
2.235 2.168 2.200 2.136 2.151 2.148 2.124 2.136 2.107 2.126 2.108 2.112 2.109 2.108
2.169 2.157 2.130 2.124 2.107 2.112 2.096 2.104 2.094 2.098 2.093 2.093 2.093 2.093
4
6
8
10
X Figure 6.7 Estimate from histogmm date
Example 6.6 Revisit of entropy estimation Entropy estimation has been theoretically studied in Chapter 5. In this example, Example 6.4 is used to demonstrate how entropy estimates vary with unknown parameters. Suppose sample size n, = 500. Four quantities defined by
155
6.5. Numerical examples 3 AT-1 2 »
(6.41a) (6.41b)
JV-1 2 n.
(6.41c)
JV-1
(6.4 Id)
The values of these quantities are plotted in Figure 6.8, in which the vertical ordinate shows the values of the above-defined quantities and the horizontal ordinate shows the number of parameters. 1.9
1.8
I
1.7
1.6
1.5 10
20
30
Figure 6.S Entropy estimation As sample size increases, the quantity £ 3 tends to be a constant. Because E% is in fact the estimate of the entropy of the random variable under consideration, constancy of £ 3 as sample size is large is just the right estimate of the entropy. The rest three quantities decrease as sample size increases from small to medium size. As sample size increases further, they begin to increase. Basically, Ei and AIC are quite close, because the latter is the asymptotic estimate of the
156
6. 1-D estimation based on large samples
former. This verifies the theoretical analysis presented in Chapter 5. Over quite large range, E\ is largest among the three. This is due to the fact that Ej measures the uncertainty in the whole process. The domain marked by a circle in the figure is the area of interest because the three quantities take minima in this domain. The critical numbers of parameters around which the four quantities take their respective minima are close. Within this domain, the curves for the four quantities defining four types of entropies show large fluctuations. Outside this domain, fluctuations are not significant, all curves varying in a relatively smooth way. It is within this fluctuation domain that the four entropies take their minima. The fluctuations are not meaningless. Take the curve for £ 2 for instance. It is minimized at iV=13. Its value at N=\2 is, however, much larger than the entropy value at JV==12. Such big difference at these two points enables us to convincingly find the number of B-splines minimizing entropy values. 6,6 Concluding Remarks If the methods presented in this chapter are placed at a larger background, a word should be mentioned about estimation methodologies in general. The task of estimating a probability density from data is a fundamental one in reliability engineering, quality control, stock market, machine learning, upon which subsequent inference, learning, and decision-making procedures are based. Density estimation has thus been heavily studied, under three primary umbrellas: parametric, semi-parametric, and nonparametric. Parametric methods are useful when the underlying distribution is known in advance or is simple enough to be well-modeled by a special distribution. Semi-parametric models (such as mixtures of simpler distributions) are more flexible and more forgiving of the user's lack of the true model, but usually require significant computation in order to fit the resulting nonlinear models. Nanparametric methods like the method in this Chapter assume the least structure of the three, and take the strongest stance of letting the data speak for themselves (Silverman,1986). They are useful in the setting of arbitrary-shape distributions coming from complex real-world data sources. They are generally the method of choice in exploratory data analysis for this reason, and can be used, as the other types of models, for the entire range of statistical settings, from machine learning to pattern recognition to stock market prediction (Gray & Moore, 2003). Nonparamefric methods make minimal or no distribution assumptions and can be shown to achieve asymptotic estimation optimality for ANY input distribution under them. For example using the methods in this chapter, with no assumptions at all on the true underlying distribution, As more data are observed, the estimate converges to the true density (Devroye & Gyorfi, 1985). This is clearly a property that no particular parameterization can achieve. For this reason nonparametric estimators are the focus of a considerable body of advanced
6.6 Concluding Remarks
157
statistical theory (Rao, 1983; Devroye & Lugosi, 2001). We will not spend more space on nonparametric estimation here. The interested reader is referred to these books for more exposure to nonparametric estimation. Nonparametric estimations apparently often come at the heaviest computational cost of the three types of models. This has, to date, been the fundamental limitation of nonparametric methods for density estimation. It prevents practitioners from applying them to the increasingly large datasets that appear in modem real-world problems, and even for small problems, their use as a repeatedly-called basic subroutine is limited. This restriction has been removed by the method presented in this chapter. As a nonparametric estimation method, the advantages of the proposed method are summarized as follows; (1) In the proposed methods the pdf is estimated only from the given sample. No prior information is necessary of the distribution form; (2) The pdf can be estimated by a simple iterative formula, which makes the methods computationally effectively. The methods are computeroriented; (3) The methods provided in this chapter are capable of approximating probability distributions with satisfactory accuracy; (4) As sample size increases, the estimated pdf gets closer to the true distribution; (5) ME and AIC analysis are able to pick up the most suitable one from a group of candidate models. ME is more delicate than AIC, but in most cases they yield same estimates. But note that ME or AIC are not valid if the number of free parameters «/ is too large, otherwise the central limit theorem fails. The cases that the number of free parameters is larger than the sample size are treated in Chapters 8 and 9. Without giving more examples, it should be pointed out that the fourth order B-splines yield same satisfactory results as the third order B-splines. The reader may verify this using the codes given in Chapter 12. The method presented in this chapter, a nonparametric one, also exhibits a shift of estimation strategy from traditional statistics. The typical procedures to determining a probability distribution as shown in Figure 6.9 are (1) SPECIFICATION: Assume a distribution t y p e / ( x | S ) , where 0 is an unknown parameter; from the family <& of distributions; (2) ESTIMATION: Take a sample from the population. Use statistical methods (moment method, maximum likelihood method, etc) to estimate the unknown parameter #as a function of sample X; (3) TESTING: Test if the estimated distribution matches the data well. If not, return to step (1), specify another model and repeat the same procedures until a good statistical model is found. This is the strategy of Fisher's statistical inference. We are in fact taught in
158
6. I'D estimation based on large samples
this way in universities (Akaike, 1980), The methods presented in this chapter, however, determine a distribution through the following procedures; (1) SPECIFICATION: Assume a distribution f(x\0) , from the parameterized family <&; (2) ESTIMATION; Take a sample from the population. Use statistical methods (mainly maximum likelihood method) to estimate the unknown parameter Q as a function of sample X; (3) SELECTION: Find the most appropriate model based on some criterion. These procedures are slightly different from Fisher's strategy in form, but they bring some different methods into the procedures. At the stage of SPECIFICATION, <& is an enhanced family of parameterized functions. TESTING in the Fisher's strategy is replaced by SLECTION, SELECTION is in fact an optimization process searching for the most appropriate model. The importance of model selection is exemplified in this Chapter. In most modern methods for nonparametrie estimations, however, model selection is ignored. Therefore, the tradeoff between model uncertainty and statistical fluctuations are not considered. Only through the introduction of information theory can we properly handle the problem of model selection.
DATAJT
I *
SPECIFICATION
\
f(ff)
ESTIMATION
Iw No
TESTING
T
Yes
OUTPUT Figure 6.9 Strategy of Fisher's statistical inference
Appendix
159
Appendix: Non-linear programming problem and the uniqueness of the solution Before we proceed to find the iterative formula to estimate the combination coefficients, we need to introduce some results in NLP theory. These results can be found on monograph on NLP, say Luenberger (1984). Consider the following nonlinear programming problem: jL{a,,a2,...,aAr)-»max
(6.A.1)
gj(a},a2,--;aJ>Q, j<=l,2,-,m
(6.A.2)
«^Q, i = l,2,~;N.
(6A.3)
Equation (6.A.2) defines m constraints. Writing the Lagrangian in the following form: (6.A.4) we introduce the following Lemma Lemma A.1: If a" =(a',al,---,a°N) is an optimum solution to the above problem, then there must exist non-negative multipliers «* (j=l,2,,,,,m), which satisfies the Kuhn-Tucker conditions together with a0 in the following equations: — iOforall at &0; af — 1 = 0, i = l,--,N da, Ida,) yOfoll^Q;
wJ— =0, J-l,—,m
(6.A.5)
(6.A.6)
Lemma A.2: Let a0 satisfy the above necessary conditions. Then the sufficient condition for a0 to be the global maximum is
A(a,u") ^ A(a°,uV f
8A(
f'"^ (a, -a?)
da satisfied for all a £ 0 .
(6.A.7)
160
6. 1-D estimation based on large samples
We now turn to our problem defined by equation (6.12). In our problem, there is no constate like equation(6.A.2), but we have an equality constraint. Modifying the Lagrangian slightly by introducing a Lagrange multiplier A, we have
Jjtfi -l) J
(6.A.8)
The derivatives of A with respect to a are
^
jMW
/ = i,2,...,i¥ .
(6.A.9)
Let a0 be an optimum. Then it must satisfy the second equality in equation (6.A.5), that is,
JdA JdA\
^Aflfofr)
.
i = l,2,-,N.
(6.A.10)
Summing up ihe above equations over i we obtain 0
(6 AJ1)
-
Exchanging the order of summation signs leads to iVci)
0
y
0 =
i = 1 2 ,...,jV.
(6.A.12)
|a ) tT The denominator of the first term is independent on fee index i. Thus the summation is performed only on the numerator, which turns out to befx(xt | a 0 ). The sum in the second term is one if equation (6.10) is used. Then we have ft
' ' '
'ffiX^a}
(6.A.13)
Appendix
161
From which we are led to the conclusion A = -«, . Substituting this into equation (6,A. 10), we obtain --n,c,a?=O, i = l,2,—,JV.
(6.A.14)
Hence, we arrive at the following iteration formula Theorem A.1 if a0 is a solution to NLP defined by equation (6.12) then it must satisfy
This is equation (6.13). Next we will prove that if a0 is an optimum solution to fee problem, it must be the global optimum. According to Taylor expansion of multi-variate function we have:
aa,dak
= A(a )+ 7.—-—&at +~\,Z*. <»i
So,.
2
l=i
j=i
—i-J-J-&a,Aak
(6.A.16)
dOfdOf.
where 0 < ^ l . For brevity, we wrote A(a,n°) = A(a) in the above equation. From Equation (6.A.S), we have
The third term on the right hand side of Equation (6.A.16) is less than zero, as shown below,
162
6. 1-D estimation based on large samples
f *=l
1 *w/
1 f \i=i
Y J
Based on equation (6.A. 16), it is concluded that
A(a) < A(a°) + ± ^ 1 Aa,
(6.A.19)
da Based on Lemma A.2, we have Theorem A.2: If a0 is a solution to the problem defined by equation (6.12), then it must be the global optimum. Meanwhile, if there are two solutions to the problem, they must be identical, because they are all global optimum. In other words, the solution is unique. So far, the theorems proven in this appendix reveal the attractive properties of the model established in this chapter. The solution is unique, and its value is given by a simple iteration
Chapter 7
Estimation of 2-D complicated distributions based on large samples
Estimation of 2-D distributions involves the same basic procedures as those described for estimation of 1-D distributions in Chapter 6. It is thus straightforward to extend the methods for 1-D estimation to 2-D estimation. Many equations are similar in character to those for 1-D cases, but there are some large and small differences that must be considered. First of all, there are, more options for approximating a bivariate distribution than for a univariate distribution. Geometrically, a line in 1-D space corresponds to either a rectangle or a circle in 2-D space. A 2-D B-spline can be constructed either using the product of two 1-D B-splines (forming a rectangle) or using a 1D B-spline which argument is radial distance {forming a circle). A bivariate distribution is then approximated using either of the two types of B-splines. Both approximation methods will be described in this Chapter. Computational efficiency is not an important factor to be considered for 1-D estimation, but it is a very important factor for 2-D estimation due to the sharp rise in numbers of unknown parameters and sample size. It is beneficial for any attempts to reduce computational time for estimating a bivariate distribution. 2-D distributions are more fascinating than 1-D distributions. 1-D distributions can be assessed through visual check while 2-D distributions are more difficult to be assessed by visual check. Information criteria like ME are necessary. Most studies over the years in multivariate statistics have been focused on the multi-variate normal distribution (Anderson, 1958; Sakamoto, 1993). In the realworld applications, 2-D distributions are no less encountered than 1-D distributions. Cta the other hand, there remains a lack of general methods for estimating non normal distributions for a multivariate. Thus methods for estimating 2-D distributions are given in this Chapter.
163
164
7. 2-D estimation based on large samples
7,1 B-Spline Approximation of a 2-D pdf As same as in Chapter 6, only continuous pdfs are discussed here. Consider a 2-D rectangular domain, in which is defined a bivariate continuous random vector (X, Y). An orthogonal coordinate OXY is defined in this domain. Along the two coordinates, M and N B-splines are respectively used to approximate the joint pdf f(x, y) of (X, Y), as shown in Figure 7.1, in the form of
f(x,y | a) =
(7-1) 1=1 ,/=l
where av (i-l,2,..,,M, j-I,2,..-,N)
are the linear combination coefficients. Re-
arranging the coefficients so as to form a vector, we obtain
Using the normalizing condition that a pdf f(x,y)
o oo o ° e 080 00
c^ocr
0
o
o
o
integrates to one, we have
° o °
a» o
°8 ° o o °L
1 0.5 0 X0
X,
Xj
X M -2
Figure 7.1 B-spIine approximation of 2-D pdf
7,1, B-spline approximation of a 2-D pdf
165
where c$ denotes the integral of the ij-Hi B-spline, and takes values as follows: 11
rj
x —x y ~y ci} = f BXx)dx f B(y)dy = ' M x J J~3
(7,4a)
for order 3 B-splines.
(7.4b) •»-*
for order 4 B-splines. To ensure f(x, y) 2:0, we simply set as before atJ > 0, (F=1J, ...,Af,j=l,2, ...,N)
(7.4c)
Another way to approximate a 2-D pdf is to use the so-called Radial B-spline Functions (RBF), which are defined in the radial symmetric form, (see Figure 7.2)
-(2-Sf, 1<SS 6 o, s;>2
where r = ^j(x-xi)3+(y-yif
, S = rjhf , a^isflah1
(7.5)
, and (xhy$ is the
centre of i-th RBF. B,{r,ht) integrates to one and ht denotes the size (radius) of the domain in which Bjfr.hj) does not vanish. A,has significant influences on the shapes of the radial B-splines. In Figure 7.2 are shown three radial B-splines for different A,. In the figure, the radial B-splines are centered at (3,5) and (7.5,10), respectively. In Figure 7.2(a), ^=2, describing "over fat" B-splines. In Figure 7.2(c), hr0.5, defining "over thin" B-splines. Figure 7.2(b) represents proper B-splines, in which B-splines are defined in such a way that the distance of two B-splines is equal to hi. hj may be different from one RBF to another. It may also be same for all RBFs.
7. 2-D estimation based on large samples
166
(a) h=2 0.1
r
m
10 Q
y
10 0 (e) h=0.5
0 0
10
Figure 7,2 Influence of parameter ht on the form of radial B-splines
7.2. Estimation
167
Using RBF, a bivariate pdf is approximated in the form of M.
,
: ytf
(7-6)
where M is the total number of RBFs. Because a RBF integrates to one, we have it
u
J ] atct = ]T a, = 1, if c( = 1 is defined.
(7,7)
As same as before, it is required that the coefficients must be greater than or equal to zero, that is, af>Q,
i = l,...,M.
(7.8)
If RBF is replaced by another function which is of similar characters of RBF, mat is, RBS does not vanish only within a small region, the approximation formulated above remains valid. In general, a bivariate distribution can be approximated by a linear combination of the following form M
f{x,y\m) = 2>d(/\*>, r = ^(x-x,f + {y-y,f
(7.9)
where function $(r,h,), often called basis function, takes non-zero values within a small circle, and vanishes outside the circle. In other words, it is a locally compact supported function. Kernel density estimation developed by Silverman (1986) is such an example. The list of the particular forms of basis function $fa",ht) is long, such as Gaussian function, multi-quadries. Whatever the forms are, the estimation procedures are fundamentally same. We remain to focus on the two types of 2-D B-splines as described above, keeping in mind that the basic procedures apply to other basis functions. 7.2 Estimation 7.2.1 Estimation from sample data Using the maximum likelihood method to estimate the unknown parameters, we first of all take a random sample of size «,s from the population. Let the sample points be {xt,y() (t = 1,2,•••,#,). If MxN is given in equation (7.2), estimating a is equivalent to the following nonlinear programming problem;
168
7. 2-D estimation based on large samples
For given M and N, find vector a such that it satisfies (7-10) subject to the constraints:
at*Q.{i = l,2,-,M,j>*l,-.N)
(7.12)
The nonlinear programming problem above also applies to RBFs if equation (7.11) and (7.12) are replaced by equations (7.7) and (7.8), respectively. Using vector and matrix notations, we may easily conclude that all the conclusions valid for 1-D problem remains valid for the 2-D problem defined above. In summary we have Theorem 7.1 (1) The solution to the problem defined by (7.10)~(7,12) exists; (2) There exists unique extreme point, which is global maximum; and (3) The solution can be found through the following iterative formula
startingfrom the initial values
If RBFs are used, the corresponding iterative formula becomes
starting from the initial value
7,2. Estimation «,=— • 1 M
169 (7-16)
1.2.2 Computation acceleration If the iterative formulas (7.13) or (7,15) were not used, it would be very timeeonsuming to perform the 2-D estimation because of the huge numbers of operations involved, or it would require computational time that is too long to be practically meaningful. Even so, there remains room for further acceleration of computational speed. A quick estimate of operation numbers per iteration demanded by equation (7.13) can be easily made to be around MxNxnt, Suppose 10x10 = 100 Bsplines are used, and the sample size is 400. Then the number of operations per iteration is about 40,000, which is a large number. We note from equation (7.13) that for most observation points, B){x)Bj{y)is zero. It does not vanish only for a small number of points which fall in its nonzero region. To make use of this fact (we know that this is the local support property of B-splines), the iteration is not done over all observation points, but only over those on which Bi{^Bi{y)i& not zero. DtJ is used to denote the index set of those observations that contribute to Bt (x)Bj (y), that is, Bl(Xl)Bl(yt)*Q
if
teD,
• (
3(*,)fl,(y,) = 0 if teDt '
'
}
Before iteration begins, make computations to find sets DtJ based on equation (7.17). Introduction of the index set Dy leads to a new iterative formula ±
fl^tW|
{71g)
nsctJ Sot / W (*(.J' ( I«) Because the rest terms are zero, a^ forms a matrix, whose non-zero elements given by equation (7.18) is banded. If a uniform bivariate distribution is considered, the mamematical expectation of the observation number in Dy is E[DH ] = ^ . * (Ml)(iVl)
(7.19)
170
7. 2-D estimation based on large samples
The number of operations per iteration is then estimated at MxNx E[DtJ]« ns. Consider previous example again. Using formula equation (7.18), the operation number is around 400 per iteration, a remarkable reduction! If the bivariate distribution is not uniform, equation (7.19) seems no longer valid. The n*ue case would be: some index sets Dy contain more elements than the rest, and these sets take more operations than the others. One extreme case is that all sample points fall in one index set and the rest do not have elements. The operation number in that index set is around «, and the operation number in the rest is zero for there is no element in them. The total operation number per iteration remains around «,, if equation (7.18) is used. In general, if some index sets contain more elements and more they take more operations, then some other sets contain less elements and they take less operations. On average, the estimate given by equation (7.19) remains valid with the understanding that it is averaged over all index sets. This is a crude estimate, but it does help us to establish the idea that equation (7.18) saves up to M x N fold computer time vs equation (7.13). If M and N are 10 to say, then the computational speed is about 100 fold increased. The arguments above also apply to equation (7. 15) for RBFs. The sum over all sample points in the equation should be replaced by the sum over the index set Dh any element of which makes ^ ( r , , ^ ) nonzero. This technique is employed in other fields such as molecular dynamics and computational mechanics to accelerating numerical simulation speeds. 1-D and 2-D computations are significantly different resulting from the dramatic difference in numbers of unknown parameters. Any computing, if performed over all points in a 2-D space, would not be practical to be computationally effective. 7,2.3 Estimation from a histogram Suppose a 2-D histogram is composed of MH x NH cells as shown in Figure 7.3. The histogram is formed from «, sample points, and there are «,,, sample points in sf-th cell (s=!,2,..,,Mm t=l,2,...,N/f), respectively. Each cell is defined by the interval K . # t + i ] x f e ' ^ + i ] - Here Greek letters are used to differentiate from knots of B-splines. If the sample obeys the distribution defined by f(x,y), the probability for the event that ns, points fall in st-th cell is given by the following multinomial distribution
7,2. Estimation
171
where g» is the partial probability of f(x, y), and it relates to the combination coefficients a through
Figure 7.3 2-D histogram composed of MH x NH cells
4/+1 %*{
IJ
=J
l "fc+1 M
N
(7.21)
= £ £ % J jB,(x)Bj(y)tfydx Use <4«i to denote the integral in the last term, that is, (7.22)
and equation (7.21) would be M
N
(7.23)
The log-likelihood function is obtained by taking logarithms on both sides of equation (7.20)
172
7, 2-D estimation based on large samples
L = log P = 2 ] ]T p s( log 9W + constant, />„ = nm I ns,
(7.24)
»=i i=\
Similar to equation (6.12), the best estimate of a pdf based on a bivariate histogram is formulated as follows For given M and N, find vector a so that it satisfies 1 =
Z E Pi l o 81i -*1m a x
C?-25a)
!=J 1=1
subject to /te constraints: (7.25b) # a0,
(i = l,2,-,M;j = l,2,-;N),
(7.25c)
The solution to the above problem must satisfy
^ f i ^
1
1 2 Mj l
2
N
(7.25d)
Again we may use this equation as an iterative formula to find the coefficients. If RBFs are used, we have the following formulation For given M and N, find vector a so that it satisfies (7.26a) subject to the constraints: |>,C(=1
(7.26b)
q £ 0 , (i = l,2,---,M)
(7.26c)
where
7.3, Model selection
*" *
173
(7.27a)
«= J )*,(*",*>$«&
(7.27b)
The solution is (7.28)
7.3 Model selection Based on ME analysis, the number of free parameters in a model must be defined. For bivariate distributions, we have two definitions for the number of free parameters as follows MxN-l M-l
if equation (7.1) is used ifequation(7.6)(RBFs)isused
For bivariate distributions, the best estimate of pdf can be found through the following procedures: Suppose a is the maximum likelihood estimate of afar given number of unknown parameters. The best model satisfies ME = -jjf(x,y\
&)lagf(x,y
| n)dxdy+^i-
-> min
In the case that AIC is used, Suppose a is the maximum likelihood estimate of a for given number of parameters. The best model satisfies
(7.30a)
174
7. 2-D estimation based on large samples = -—Y
log f(Xt,yf\i)+?*—*
min
(7.30b)
If a is estimated from a histogram, the above formulations are changed slightly as follows: Find M and N so that ME(M, N) ~ - 2 J 2 , Q*i (fi) l o §au C & )+—- -» m m
(7-31 a)
or Fiwrf M anrf JV" so that A1C(N) = - ] ^ f > w logfM(a) + ^ - • min. i-l
(=1
(7.31b)
«,
7.4 Numerical examples In examples 7.1 and 7.2, the methods presented in this Chapter are examined by using the following procedure; given a distribution -> from the distribution generate a random sample -> from the generated sample estimate the pdf -> compare the estimated and given pdfe to assess estimation accuracy. Example 7,1 Two dimensional normal pdf. This example has been considered in Chapter 3, where a method for generating random numbers (ARM) was discussed. Consider a bivariate normal distribution defined by (7.32a)
Here g(x,y) = 0.&xh(x,y 15.5,5.5,0.6,0.6,0.3) +02xh(x,y 17,7,0.4,0.4,0.3) which in turn is defined by
(7.32b)
7.4, Numerical examples
175
(7-33)
The shape of f(x,y)
is shown in Figure 7.4. The definition domain [0,9]x[Q,9]
Three random samples (xt,yf) { £ = 1,2, • • •, n,) with sample size n, =200,400 and 500 were respectively generated from the distribution above. Figure 7.5 shows the sample scatter in the plane.
Figure 7,4 The given distribution
Figure 7.5 The generated random sample (n,=500)
First of all, rectangular B-splines (product of two one-dimensional B-splines) were used. M and N are varied from 6 to 15. In total, there are 10 x 10 = 100 case were calculated. Some of results are given in Table 7.1. In the ease of ns~200, the ME-based optimum number is M=10 and iV=10, while AlC-based optimum number is M=8 and N-%, see Table 7.1 (a). In Figure 7.6 we plot two pdfe, one estimated based on M=N=10 B-splines, see Figure 7.6 (a). This is in fact the case for ME to be minimum. Figure 7.6 (b) shows the pdf estimated by using M=N=8 B-splines, mat is, best estimate based on minimum AIC. Comparing these two plots, we see that Figure 7.6 (a) is better
176
7, 2-D estimation based on large samples
than Figure 7.6 (b), showing that ME analysis is more suited to the model selection than AIC. The sample size is not large, and thus even the estimation based on minimum ME is not very close to the true distribution. Particularly, the second small peak in the region [7,8]x[7,8] is not well described. Table 7.1 ME and AIC values (a) ns = 200 M 8 9 10 10 11 12
N 8 9 10 12 10 12
ME 2.963 3.058 2.869 2.941 2.984 3.038
-L 431.1 426.1 398.1 389.9 397.1 378.6
AIC 2.470 2.530 2.485 2.544 2.531 2.608
,n, = 400 M 10 10 11 11 12 14
N 10 14 12 14 12 14
-L 833.5 818.9 812.3 809.1 804.3 799.0
ME 2.56i 2.643 2.459 2.652 2.581 2.779
AIC 2.331 2.395 2.358 2.405 2.368 2.485
ME 2.480 2.528 2.453 2.530 2.474 2.622
AIC 2.271 2.322 2.291 2.328 2.296 2.391
= 500 M 10 10 11 11 12 14
N 10 14 12 14 12 14
-L 1036 1022 1015 1011 1005 1001
7.4, Numerical examples
177
(a) ME-based
0.4
0.3 0.2 0.1 0
0.4
(b)AIC-based
0.3 0.2 0.1 0 3
Figure 7.6 ME-based and AlC-based best estimates for«,, =200
The optimal (M,N) for sample size ns = 400 are M=l 1 and N=12 based on minimum ME. If AIC is applied, the optimum number of B-splines are M=N=10. The estimated two dimensional pdfs are given in Figure 7.7 (a)-(b) for minimum ME and minimum AIC. For this case, same conclusion as above can be obtained. AIC underutilizes the number of B-splines, resulting in incapability to capture the small peak. ME-based optimum number of B-splines yields better results.
7. 2-D estimation based on large samples
178 0.4
(a)ME-bestM=ll,iV=12
0.3 0.2 0.1 0 3
(b)AIC-best M=1G,JV=1O
0.4 0.3 0.2 0.1 0
Figure 7,7 The ME-based best estimated pdfi for sample »j=400
The results for sample ns = 500 are given in Figure 7 J for ME-based and AlC-based optimum estimations. The optimum number of B-splines based on minimizing ME is M=ll and N=12 while AlC-based optimum number is M=N=10. If sample size is kept same, ME analysis yields better estimations. For all the samples discussed above, ME selects better models than AIC does. As discussed in Chapter 5, ME takes two uncertainties into consideration: the uncertainty associated with the random variable itself and flie uncertainty resulting from model approximation. AIC, on the other hand, is asymptotically unbiased estimate of likelihood function without considering the uncertainty
7.4. Numerical examples
179
associated with model approximation. In terms of this fact, we believe that ME analysis is more suited to distribution analysis.
(a) ME-best M=UfN=12
0,4
(b)AIC-best
0.3 0.2
0.1 0 3
Figure 7.8 The ME-based best estimated pdfe for sample n,=5O0 To investigate the influence of the number of B-splines on estimation accuracy, two cases are considered: M=N=% and M=N-20. The sample used for the estimation is 200. The estimated results for these two cases are plotted in Figure 7.9.
180
7. 2-D estimation based on large samples
Figure 7.9 Influence of number of B-splines Observations similar to 1-D case indicate that the number of B-splines has important influence on the estimation accuracy. If the number is underutilized, the global trend variation cannot be fully described; if the number is overshoot, the influences of local changes resulting from statistical fluctuations are magnified. In either way, estimation is not satisfactory. Therefore, ME analysis helps us to find the right number of B-splines so that both underutilization and overshooting are avoided.
7.4, Numerical examples
181
Hence, ME analysis, or information theory in the broader sense, is really interesting and marvelous. It is able to find therightnumber of B-splines from a lot of candidate models. Example 7.2 Revisit of Example 7,1 with RBFas the appraximant In example 7.1 products of two 1-D B-splines have been used as approximants for 2-D pdf. In this example, we use radial B-spline functions as approximants. The data used are as same as those given in Example 7.1. Draw a sample of size nt = 200 from the population. MODEL(iV) denotes the approximation using N radial B-spline functions, that is, MODEL(N) f(x,y\a) = y,a,Sl(r,hl),
r = J(x~xly+{y-yiy
(7.34)
If the sample is given, the coefficients are determined through the iterative formula
e^—xtff1^'
'•t=^t-xif+iyl-ylf
(7.35)
Suppose M and N B-splines are used in the X- and Y-directions, respectively. If these B-splines are uniformly distributed along the two directions, hi is determined by , (b-a c-d") h =max ,——
,_,,. (7.36)
where [a,b] denotes the interval X lies in and [c,d\ denotes the interval Y lies in, ME and AIC are then calculated from
ME = - \\f(x,y |fi)tog/<*,y\ i)dufy+-£-
(7.37)
AIC = -—£\agf{xf,yt
(7.38)
|a)+^
For example, for MODEL(36), we have
182
7. 2-D estimation based on large samples
h, = max] — , — 1 = 1, L=447.69, ME=2.726 & AIC=2.413 V 6 6
(7.39)
Continuing the procedures, we obtain results for differing models with some of the results shown in Table 7.2 (a). The ME-based best estimate is given by MODEL(64), that is, N=M. The best estimate by minimizing AIC is MODEL(36). Both cases are plotted in figure 7.10 (a) and (b). As same as 1-D case, ME-based best estimate is better than AlC-based best estimate because the latter underutilizes the number of B-splines.
Table 7.2 Part of calculated results in the case of RBF used
(a) n. = 200 N 36 48 56 64 72 80 88 96 104 112 120
L 447.69 454.79 444.01 422.62 428.20 421.63 422.04 423.74 420.75 421.22 421.79
ME 2.726 2.862 2.917 2.707 2.846 2.834 2.904 2.975 2.986 3.078 3.118
= 800 AIC 2.413 2,508 2.495 2.428 2.496 2.503 2.545 2.593 2.618 2.661 2.703
N 70 80 90 100 120 140 160 170 180 190 200
L 1721.35 1673.05 1652.99 1630.63 1636.85 1632.24 1628.23 1630.66 1630.49 1628.14 1629.33
ME 2.523 2.360 2.289 2.247 2.305 2.339 2.360 2.392 2.405 2.417 2.443
AIC 2.237 2.190 2.177 2.162 2.194 2.214 2.234 2.249 2.261 2.271 2.285
If we try a larger sample of size ns — 800, estimation accuracy will be improved. Without giving the details, some of results are listed in Table 7.2 (b). In this case, both ME and AIC are minimized by MODEL(IOO) and the corresponding pdf is plotted in Figure 7,12(e). In summary, RBF is applicable to distribution estimation, too. As sample size is small, AIC is prone to underutilize B-splines. As sample size is large, ME- and AlC-based best estimations converge the true distribution.
7.4. Numerical examples
(a) nx =200, ME best
0.4 0.3
(c) ^ = 8 0 0 , ME & AIC best
0.2 0.1 0 3
9
3
Figure 7.10 Estimation using RBF for two samples
183
7, 2-D estimation based on large samples
184
Example 7.3 Joint distribution of wave height and wave period in the Pacific (Histogram) The joint distribution of wave height and wave period shows high irregularity worldwide (Watanabe, 1993), Here, the distribution of wave height and wave period is approximated by the two dimensional B-spIine functions. The data are taken from the records measured aboard ships for winters in the Pacific (Watanabe, 1993). The wave periods range from 0 to 16 seconds and wave heights range from 0 to 16 meters. The histogram of the data is shown in Figure 7.5 and the observed date are given in Table 7.3. Order 4 B-spline functions are used in this example, only for the sake of comparison. The estimation is based on histogram using the method presented in section 7.2.5. The results of the analysis are partially given in Table 7.4. The best numbers of B-splines are M=17 for wave height and iV=26 for wave period. The estimated pdf is shown in Figure 7.11. In the figure, H represents wave height in meter and T represents period in second. The vertical axis is probability density. For this case, AIC and ME predict the same results.
Table 7.3 Wave height and wave period data in the Pacific (winter)
T\H 0.000.751.752.753.754.755.756.757.758.759.7510.7511.7512.7513.7514.75-
069756 204031 81364 15627 3463 1093 449 37 21 26 17 12
2 0 1 2
513988 135590 J28686 41338 19535 3421 1446 120 40 40 26 13 9 2 2 8
64312 77195 141318 72427 21112 5818 2621 4526 404 272 187 9 15 3 0 9
72217 43463 104275 86094 31431 8470 3468 2465 496 309 214 12 10 5 1 6
82072 33394 85139 95867 54283 17808 6835 3760 1216 627 485 68 90 24 12 8
7.4. Numerical examples
185
(continued)
T\H 0.000.751.752.753.754.755.756.757.758.759.7510.7511.7512.7513.7514.75-
9641 107.92 27368 36476 30493 12482 4880 2695 865 480 416 66
m 16 17 12
10844 9245 21975 28595 28275 16224 7595 3539 1407 682 691 97 402 34 20 121
11236 1918 4999 7067 6954 5000 2563 1270 538 304 234 55 61 17 13 32
1257 3159 6148 8068 8688 7042 4558 2494 1141 530 534 125 116 50 24 43
1322 1167 2192 2151 2103 1603 1190 698 341 196 197 35 46 17 10 17
1461 2688 5012 3949 3577 2898 2419 1712 953 619 493 163 145 39 21 78
Table 7.4 ME and AIC values for wave height-period joint distribution (winter, ns = 19839000)
M
N
-L
ME
AIC
15 16 17 17 18 18 19 20
15 16 26 29 18 27 19 17
3.601 3.600 3.589 3.588 3.592 3.589 3.590 3.587
3.846 3.839 3.817 3.819 3.837 3.819 3.824 3.843
3.601 3.599 3.589 3.590 3.592 3.589 3.590
3.587
The distribution form shown in Figure 7.11 is too complicated to be represented by using simple distribution forms. Without powerful B-splines and ME analysis, it would not be possible for us to find such complicated distributions in an easy and pleasant way.
186
7. 2-D estimation based on large samples
Figure 7.11 Estimated joint distribution of wave-height and waveperiod
7.S Concluding remarks Bivariate distributions are frequently encountered in applications. And general methods for estimating underlined joint distribution remain lacking. For a given bivariate sample or histogram, it is not easy to work out the distributions. The method presented in this Chapter resembles in character that presented in Chapter 6, but it is more useful. For in 1-D cases, we have alternative methods to evaluate the underlined distributions, but in 2-D cases, we hardly have an alternative method directly estimating distribution from a sample or histogram. In this sense, the method developed in this Chapter is both necessary and meaningful. Several factors have significant influence on estimation accuracies. They are sample size, the number of B-splines and criteria for model selection. Numerical examples show that as sample size get larger and larger, the method presented in this Chapter can give estimates converging to the true distribution, Underutilization of B-splines is unable to describe important characters in the pdf form, while overshooting of B-splines is sensitive to local changes. ME and AIC analysis finds the right number of B-splines to keep the balance between global character in the disfribution form and local changes. ME predicts better results than AIC, validated both numerically and theoretically. It is recommended to use ME analysis
7.5. Concluding remarks
187
The estimations are not sensitive to the order of B-splines, if only order 3 and order 4 B-splines are considered. Therefore, both order 3 and order B-splines are suitable choice in real-world applications. Again the functions of the method are summarized as follows If a random sample from the population is drawn, the method can directly estimate the distribution from the data. The estimation procedure is composed of three steps (1) Given M and N, find the M-L estimates of the linear combination coefficients, (2) For different M and N, compare ME values and find the M and N minimizing ME, (3) The model with M and N minimizing ME is the pdf we are after.
This page intentionally left blank
Chapter 8
Estimation of 1-D complicated distribution based on small samples
In Chapters 6 and 7, a systematic method has been introduced to estimate the density function of a random variable through given samples. In the method, a pdf is approximated by a linear combination of B-spline functions, and the best number of B-splines (best model) necessary for the approximation is determined by minimizing Measured Entropy (ME) or AIC, As pointed out in the closing of Chapter 6, the method works well for large samples. When the sample size is small, however, the method presented in the previous chapters cannot yield satisfactory results. The estimated combination coefficients show strong irregularities as we will see later in this chapter. In passing, it is pointed out that it is always hard to answer the question: how large a large sample should be and how small a small sample should be. So we need to clarify it before we proceed further. Here by large sample we mean that the sample size is much greater than the number of unknown parameters to be determined from the sample. The rest cases are defined as small samples. For example, if 50 B-splines are used (50 unknown parameters) and we have only 40 sample observations, then this is a small sample problem. On the other hand, if 50 B-splines are used, but we have 200 sample observations, then it is a large sample problem. "Large" and "small" are used in the relative sense throughout the book. To overcome the shortcoming of the method previously presented, Bayesian approach, in which all parameters are treated as random variables, is employed to improve the estimation accuracy. So the new method to be introduced in this chapter is characterized by a preliminary prediction-correction process. At the preliminary prediction stage, we still use a linear combination of B-spline functions to approximate a pdf as we did before, but the number of B-spline functions is prefixed and may be much greater than sample size. Strongly influenced by statistical fluctuations, the combination coefficients are highly
189
190
8, 1-D estimation based on small samples
irregular. At the correction stage, the smoothness restriction on the combination coefficients is introduced, based on which the so-called smooth prior distribution is constructed. By combining the information obtained from the preliminary prediction and from smooth prior distribution in the Bayes' rule, the influence of statistical fluctuations is effectively removed, and greatly improved estimate, which is close to the true distribution, can be obtained. So in the method to be presented, a new strategy is applied. It is based on the belief that a number of B-splines (say 50 or 100) large enough have sufficient flexibility to represent most distribution functions we are interested in. The coefficients estimated based on a small sample would produce poor estimates. The reason is that the amount of information provided in sample data is less than that neeessary for determining the unknown parameters. Prior distribution provides an alternative to pool information from sources other than sample data. The total amount of information provided in sample data and by prior distribution is enough for determining all unknown parameters. Bayesian methods have found a variety of applications in the fields of science and engineering. Particularly mentioned is the field of structural reliability, a field to study if a structure (ship, building, dam, aircraft, etc) Mis within design life cycle. Because these engineering structures are so durable that their failure probabilities are very small. Thus the available feilure data are scarce, any inference based on which is dubious. Bayesian methods are thus specially preferred. (Ditlevsen, 1994; Sander & Badoux, 1991; Manners, 1994; Zong & Lam, 2002). 8.1 Statistical influence of small sample on estimation In Chapter 6, a linear combination of B-spline functions was used to approximate the pdf of a continuous random variable X as
(8.1)
To determine the coefficient vector a, a random sample of size ns is taken from the population. Let the sample observation point be x f {l=l,2,...,n s ). Then the maximum likelihood estimate of a is given by the following simple iterative scheme, see equation (6.13),
(8.2)
8.1. Statistical influenece of small sample on estimation
191
The model above works well for cases where N is greatly smaller than the sample size ns. In this chapter, we will extend the above method to the cases where the relationship above is violated. Such extension is needed when the distribution under consideration is complicated and the accessible data are limited. Consider the following example to see the influence of statistical fluctuations on estimation. Example 8.1 Statistical influence of small sample on estimation Suppose the true distribution is given by
(8.3B)
where g(x) is a function defined on [0,10]
(8.3b)
By use of the method for generating random numbers, 40 random numbers were generated from f(x) as a given sample. The following model, in which 50 B-splines are used, is employed to approximate f(x)
/(x| «) = £<**,(*)
(8.4)
Note that the number of B-splines (N = 50) is greater than the sample point number («,=40). Using the generated sample, the unknown parameters in fee above model are estimated from equation (8.2) and the results are shown in Figure 8.1. The predicted pdf exhibits noise-like irregularities, unable to describe the distribution under consideration at all. The high irregularities are attributed to statistical fluctuations. Recall 1-D examples in Chapter 6, where the number of B-splines was around 10, and the number of sample points was always above 30. It is thus not surprising that good results were obtained in Chapter 6.
192
8. 1~D estimation based on small samples
Preliminary prediction
4
6
N = 50
8
10
Figure 8.1 Influence of statistical fluctuations on the estimation If a is estimated from histogram data, the same irregularities are observable, too. The irregularities remind us of treating the coefficients as random variables. In feet, even in the case of large sample, they are tteated as random variables, too. The sample size is always finite, and the coefficients always deviate from the true values to somewhat extent. The difference is that in the case of large samples, deviations of the estimated values are small, asymptotically normally distributed. In the case of small samples, however, statistical fluctuations are not small and the distribution is not close to the normal distribution. Taking the parameter as random variables represents a radical shift of statistical methodologies from traditional Fisher's statistics to Bayesian statistics. Bayesian method will thus be employed in the following. 8.2 Construction of smooth Bayesian priors 8.2,1 Analysis of statistical fluctuations Due to statistical fluctuations, the predicted parameter a is generally not coincident with its true value b, and there exists a deviation w,, as schematically shown in Figure S.2, a, =
(8.5)
where a is either estimated directly from sample data or from histogram data. Define two vectors
S.2. Construction of smooth Bayesian priors
= (bl,bi,~;btlf,
193 (8.6)
where the superscript "7™ denotes matrix transpose. From the Central Limit Theorem, a, are asymptotically normally distributed with mean bt as the sample size becomes large. It is therefore reasonable to assume that wt is a normal random variable with zero mean and common variance er1.
.-••••
Figure 8.2 Deviations between true and predicted coefficients
Generally, the smaller the sample size is, the larger a2 is. Hence we have
(8.7)
Once b is given, me likelihood fimction for a is then given by
(8Ja)
which is obtained from the independence assumption. Taking the common factor out of the product symbol, we obtain
194
8. 1-D estimation based on small samples
( b>
°
Using HI to denotes the distance in Euclidian space of the form
Equation (8.8b) can be rewritten in more compact form
] e x pi-^-4} LJ_|| a _bfl
(8.8d)
Because a is predicted directly from the sample, P{& | b) contains all the necessary information in sample data. This is the preliminary prediction as we mentioned at the beginning of this chapter. 8.2.2 Smooth prior distribution of combination coefficients The reason for us not to be satisfied with the preliminary prediction in Example 8.1, or the reason for us to say that the preliminary prediction is not a good estimate, is that the predicted pdf is quite irregular and we cannot see the global variation trend of the random variable under consideration. In other words, a good estimate should be smooth enough and exhibit the global variation trend of the random variable under consideration. We would say that smoothness is the prior information on b. Smoothness is just an abstract expression, but we may quantify it using mathematical terms. Because smoothness is mathematically defined by equating the right and left derivatives at one point, we have
f
-=H
(8-9)
where the subscript + and - denote the left and right derivatives, respectively. Approximating the equation above by use of central difference, we obtain
8.2. Construction of smooth Bayesian priors
195 (8.10)
Or, we may write them in the following form, (8.11) where e, denotes the approximation errors resulting from discretization, or neglect of higher order terms in the derivatives. Figure 8.3 depicts how the errors result from approximation. Whenever straight lines are used to approximate the slopes of a curve, finite errors are observed. When ef is zero, ^ is simply the mean of bM and d,_,, and thus b is in feet a straight line, which is of course smooth. If e, is very large, the right and left derivatives are much different, and b is not smooth. Therefore, e = (eJ,ei,--,eK_])basically defines the smoothness of b. Because we require that b be smooth, the mathematical expectation of e, is surely zero. We fiirther assume that e^s are mutually independent and identical normal random variables with zero mean and same variance T2. With these assumptions, we have
(8.12a)
Figure 8.3 Errors resulting from derivative
196
& 1-D estimation based on small samples
Noting that e r e = e£ + e2 + > • •+ej_,, we rewrite the equation as
F(e) =
1
(8,12b)
Introducing the matrix 1 -2 1
1 -2 1
1 -2
1
(8.13)
1 -2
1
we may rewrite equation (8.11) in the matrix form = Db.
(8.14)
Substituting the equation above into equation (8,12b), we obtain the prior distribution for the true combination coefficient b
(8.15) Introducing another new parameter a?2 =a2 lt%, we obtain fi-Z
exPi-|L-||Dbf
(8.16)
The distribution above, obtained based on the smoothness assumption, is called smooth prior distribution. It is purely constructed from our requirement
8.2. Construction of smooth Bayesian priors
197
that b be smooth, and thus it contains the information not in sample data. On the other hand, F(a | b) solely contains the information in sample data. We thus have two sources of information: sample date and prior distribution. One basic idea behind Bayesian statistics is to make Ml use of these two sources of information. It is possible to qualitatively study the behavior of P(b), Variance o* is estimated from sample, and thus fixed with respect to P(b). The parameter that is changeable is to2. As t»2 is large, the exponential function in equation (8.16) vanishes in the most part of the distribution domain of e = D b , except near e = 0. So F(b) is characterized by sharp rise near the origin, but fast dying away from the origin. This is schematically shown in Figure 8.4 (a). The differences among elements of vector b must be small. So vector b represents a nearly straight line connecting bi and by, see Figure 8.4(b).
(a)
P(b)
Figure 8.4 Correlation of o 2 and the behavior of vector b
If to2 is small, the exponential function in equation (8.16) does not vanish in the most part of the distribution domain of e = Db, and P(b) is characterized by wide spread. The differences among the elements of vector b are likely large, and
198
8. 1-D estimation based on small samples
vector b might represent a curve of irregular shape as shown in Figure 8.4 (c). Therefore, i*(b) basically controls the shape of vector b. 8.3 Bayesian estimation of complicated pdf 8.3.1 Bayesian point estimate Two models have been established up to now. One is the likelihood function P(a | b ) , which pools the information in sample data, and the other the prior distribution P(b), which defines the smoothness of b. Combining them in the Bayes' rule introduced in chapter 2, we obtain the posterior distribution P(b | a) = C x P(a | b)P(b)
(8.17)
where C is the normalizing constant. Substituting equations (8Jd) and (8.16) into equation (8.17) yields
(8.18a)
Further simplification leads to
.
1
expi
V2s-cr}
r[|a-bf+
I 2er L"
ffi2l|Dbf"|i
(8.18b)
-lj
In Chapter 5, we have shown that the best estimate should maximize the poster distribution. This is called Bayesian point estimate. Equivalently, it can be obtained by minimizing £? 2 (b), Q2 (b) = ||a - b f + w1 jDbf ~» min Setting
(8.19)
8.3. Bqyesian estimation of complicated pdf
= 0, i =
l,2,-'-,j
199
(8-20)
we conclude that b must satisfy (8.21a) the solution to which is ~ l a, and a1 = i
(8.21b)
Here I is the unit matrix of the form
1=
(8.22)
Equation (8.21a) defines a system of linear equations, which can be solved by several methods. Gauss elimination is fee most frequently used one, and this issue will be detailed in section 8.3.3.
(8.23)
The pdf calculated from following equation, which is obtained from Equation (8.1) with a replaced by b is thus an estimate based on Bayesian method. Note the denominator in the equation above, which is included to make the pdf under consideration to satisfy the normalization condition, that is, a pdf should integrate to one.
200
8. 1-D estimation based on small samples
Two more parameters remain undetermined. They are eo2 and e 2 . Once they are given, we may obtain Bayesian point estimate b from equation (8.21) immediately. In the next section, an entropy analysis is used to find the most suitable co2. 8,3.2 Determination of parameter co2 Parameter c*2 confrols the scatter in a. LargeCT2represents large deviation of estimated parameter a from the true parameter b, and exhibits highly irregular traits in pdf, as shown in Figure 8.1. Small o 2 represents closeness of the estimated distribution to the true one. From equation (8.21), we note that b = a if ca2=0, and the Bayesian estimate degenerates to the preliminary prediction. From equation (8.19), we note that b is a straight line if a2 —> ao . Therefore, a2 and ra2 are very important in our analysis. In most analyses using Bayesian approach, these two parameters, or one of them is subjectively selected, subjecting Bayesian statistics to the criticism by objectivists. It is, therefore, demanded for any attempt to find or build an "objective" way to determine these parameters. Consider the marginal probability P(a | a2, a-1) = j>(a 1 b)P(b)db
(8.24)
which describes the averaged behavior of a, and gives the true distribution of the occurrence of event a. Because it is independent of parameter vector b, we can obtain the estimates of m2 and u2 by maximizing the marginal probability. Find a/ and a* such that P(z | to1, a2) = j>(a |b)P(b)cfl» -> max.
(8.25)
As mentioned in Chapter 5, this optimization underlines the request that the uncertainty associated with the unknown parameters is minimum. If b were not treated as a random variable and took a particular value, this principle would degenerate to the Maximum Likelihood Principle. Rewritten in the following logarithmic form, the marginal probability is
8.3. Bayesian estimation of complicated pdf
201
MEB(©2, er2) = - 2 log P(a | o 2 , v2) = - 2 log J>(a | b)P(b)
(8.26)
MEB, denoting the logarithm of the probability, is the abbreviation of Bayesian Measured Entropy. The coefficient 2 in the front of the integral symbol is included for historical reason. Introducing the following two matrices in equation (8.26)
Ho"
(O7)
HOD™ I
We have
Q1 = ||a - bf + m1 |Dbf = ||x - Fbf.
(8.28)
Furliiermore, using the following equality Q2 = ||x - F b f = ||x - Fb| 2 + ||F(b - b)||J = q2 + |F(b - b)| 2
(8.29)
where q3 is given in Equation (8.21), we may write MEBfa/, n1} in the form of MEB(©2 .tr 1 ) = -21og |P(a|b)P(b)db
r
(830)
-
The last integral in the equation above is in feet the multidimensional normal distribution with the following normalization factor
-=L-
|F r F|" 2 , | F r F = determinant of F r F .
W2*oJ '
'
'
(8.31)
202
8. 1-D estimation based on small samples
Thus we obtain MES(©2,cr2) = -(JV-2)log© 2 +(#-2)log<7 2 +-^-+log|F r F| +constant
(8.32)
The best model is obtained by minimizing MEB, equation (8.25). In other words, the derivatives of MEB with respect to cf2and
©
or do
(8.33a)
Bto
0
(8.33b)
From equation (8.33a) we obtain estimate of a 2
JV-2
(g.34)
Equation (8.33b) is so complicated that we cannot find an explicit solution to it. Instead of solving it analytically, we turn to numerically solve the following equivalent problem MEB(©2 ) = ~(N~ 2) log ®% + (N - 2) log q2 + log|F r F| -> min
(8.35)
which is obtained by substituting equation (8.33a) into equation (8.32). Note that we write MEB(ro2) in stead of MEBCm^o2) because o 2 is a known parameter in equation (8.35). This is in fact a nonlinear programming problem free of constraints. Its solution exists because MEB(t»2) is a continuous function of to2. HJ2 can take any value in the interval (0,oo) . However, in applications, m1 e [10"4,10s] is a suitable choice. It is difficult to mathematically prove that MEB has a unique minimum. But from numerical experiments performed up to now by various authors, it seems MEB has unique solution. The typical dependence of MEB on 2 changes
8.3. Bqyesian estimation of complicated pdf
203
from small to large. At certain value of co2, MEB changes from a decreasing function to a slowly increasing function. This is numerically explored in Example 8.2. Several methods may be used to solve the nonlinear programming problem defined by equation (8.35). In the following procedures, the simplest method, dividing the interval into equidistance subintervals, is employed. In summary, the Bayesian estimation is composed of following steps (1) Obtain preliminary prediction a from equation (8.2) as input; (2) Divide [lO^.lO 3 ] into JV& equidistance subintervals. Loop over (a) Take a tn2 from each subinterval; (b) Obtain smooth Bayesian point estimate b from equation (8.21) for the given (a2; (c) Estimate MEB(BJ 2 ) from Equation (8.35). (3) Choose the m2 which minimizes MEB. The value JVj is approximately around 20-100. If this division is not enough, make a finer division of a smaller subinterval and compute corresponding MEB. Note that it is not necessary to find the exact value of oo2 that minimizes MEB. If the reader is not satisfied with the optimization method above, he or she may try other optimiation methods, among which Genetic Algorithm (GA) has received extensive studies in recent years. Using GA, step 2 a) will be changed to random selection of a point in the interval [lO^JO 3 ]. For one-dimensional cases, however, these optimization methods do not show significant differences, and thus we do not explore further, here. In spite of seemingly complexity, the above analysis can be easily implemented and yields quite satisfactory and robust results. The relevant codes are provided in me floppy attached to this book, and some examples will be given in section 8.4. 8.3.3 Calculating b and determinant of |FTF| In terms of numerical procedures, three issues are of major concern. The first is to numerically find b defined in equation (8.21b) or equally speaking, numerically solve the linear equation system (8.21a). The second is to numerically find the determinant of matrix |F T F|, and the third is to numerically minimize MEB in equation (8.35). The last issue has been addressed in the section above, and we focus on the rest two numerical issues. To solve a linear equation system Cb=a like equation ( O l a ) where C is an NxNcoefficient matrix, Gauss Elimination is most frequently used. It consists of two parts: forward elimination and back substitution. In the forward
8. I-D estimation based on small samples
204
substitution, transform C into unit upper triangular through a series of elementary matrix operations. Gauss Elimination is also often used to evaluate matrix determinant. The Elimination comes into play due to the fact mat if the elements in lower triangle of the matrix are all zeroes, the determinant is simply the product of the diagonals. This also applies in the upper triangle zeroes. It means:
IfC =
0 0 0
0
cr c2. 0 0
,orC =
0 0
0 0
% o
(8.36)
then the determinant is (8.37)
So, the elimination works to shape the matrix into one of these forms. Because of its wide availability, the method is not detailed here. 8.4 Numerical examples In the first example, we assume a true pdf f(x), and generate nt random numbers from this distribution. Using these random data we estimate the coefficient vector b based on the analysis above, hi the next two examples, the present method is applied to two practical problems. Example 8.2 A Compound distribution Consider the following mixed distribution of a random variable X (8.38a)
/(*) = !
- ^
+0.2x
(8.38b)
8.4, Numerical examples
205
This is as same as that given in Example 8,1. And the preliminary prediction is plotted in both Figure 8.1 and Figure 8.5. First assume N = 50 and generate ns - 40 random numbers. Following the steps given in section 8.3.2, the optimum Bayesian estimate for the ease is found, a/ is varied from 0.01 to 1000. And the interval is divided into 100 equidistance subintervals. Some of the search results are shown in Table 8.1. And the minimum MEB is found around f^lOO. (The exact value should be ©2=98).
= 50 MEB=19J
Q
•a 1 a,
0.1 0
0
2
4
6
8
10
Figure 8.5 Bayesian estimation based on 40 sample points
200
400
600
800
m2 Figure 8.6 Relationship between MEB and
1000
206
8. 1-D estimation based on small samples Table 8.1 Dependence of MEB on to2 («, = 40)
MEB 0,01
30. 60. 90. 100.
78.75 21.57 20.10 19.82 19.81
MEB 110. 150. 200. 500. 1000.
19.83 20.01 20.33 21.81 22.81
Figure 8.5 shows the estimate based on «»=40. Compared with preliminary prediction, the Bayesian estimate is much improved and is close to the true distribution as shown in the figure. If we notice the noise-like irregularity of the preliminary prediction and the closeness of the Bayesian estimate to the true distribution, the usefulness of the analysis employed here is strongly supported. The search process for the optimum MEB for this case is shown in Figure 8.6. From the figure, we see that after the optimum point, MEB does not change much with ©2. The function relationship between to2 and MEB is quite simple. This is also true in all reported simulations, which are given or not given here. Thus, it is a numerical observation that there exists only one optimum solution for MEB and that MEB is a simple and quite smooth function of OJ2. The same relationship can be quantitatively observed from Table 8.1. A sharp drop in MEB values is observed for oa2 to be varied from 0.01 to 100. After that, MEB exhibits very slow increase for co2 to be varied from 100 to 1000, To investigate the sample influence, three samples of pseudo random numbers are generated from the given distribution. The sample size are 15,20 and 40, respectively. The optimum estimated pdfe for the three samples are plotted in Figure 8.7. As expected, as sample size increases, the estimate becomes better. What is impressive is that the estimates based on 15 and 20 sample points are also quite close to the given distribution. Figure 8.7 shows the estimates for three particular values of co2. The estimated pdfe for t»2 = 0.1, co2 = 98 and
207
8.4. Numerical examples 0.6
A
0.5
1 u
a
1
i
B s == 40
Givetf
0.4 0.3 0.2
N == 50
MEB=65.2
•K.
« 2 =2000 MEB=23.4
1 ' \
©2 = 98
MEB=19J
0.1 0.0 0
2
4
6
8
10
X Figure 8.7 Influences of co2 on the estimation Example 8.3 Distribution of ice loads on propeller blades Ship navigation in polar waters presents a formidable challenge to ships' propulsion systems as large ice pieces impinging on their propeller blades sometimes result in stresses exceeding the yield strength of the blade material. Damage to propellers is costly and can also spell disaster if a ship becomes disabled in a remote area. Ship operators, propulsion system designers, regulatory bodies and classification societies are all concerned with the safe passage of vessels in ice-infested waters. To better define the requirements for good design for ice navigation, extensive measurements of propeller ice loads at full scale were made by several organizations over the past two decades. Huge number of data was collected. Special distributions (Weibull, Normal et al) used to be employed to fit sample distributions. Zong and Lam (2000) applied the present method to estimate impact forces on propeller blades measured on a vessel, the MV Robert LeMeur (Laskow et al, 1986). The impact forces showed quite complicated distribution forms in all reported cases. One figure, which shows the angular location of blade impact force is reproduced in the form of histogram in Figure 8.9. In the figure, the abscissa denotes the start orientation of blade interaction in degree. Zong and Lam (2000) used 100 B-splines to approximate the distribution. The sample size is 181. From these data, the distribution is estimated and the results are plotted in Figure 8.8 with solid line. The optimum &2 and MEB are 24 and 681.7, respectively. The majority of the ice impacts occur at 40 degrees and 240 degrees (Figure 8.8).
S, 1-D estimation based on small samples
208
Because the method treats sample data automatically, it reduces the time for estimation greatly.
0,010
N=100 « s = 181 MEB=-681.7
Histogram
V
0.008
Estimated pdf 0.006
1\
0.004 0.002 0.000
0
40
80
120
160
200 240 280
320 360
Angular location (degree) Figure 8 J Blade impact force along angular location
Example 8.4 Distribution of maximum bow stress in sea state 7for LASH Italia Large impulse loads are experienced by a body during impact with water. This is often designated as slamming. Both fore and bottom parts of a ship are exposed to slamming, as well as the deck between the two hulls of a catamaran or a surface effect ship. Slamming loads can lead to structural damage as well as induce whipping. Current trend to produce innovative, lighter and faster ships, increases the probability of slamming and, in addition, lighter structures are more prone to slamming damage than conventional structures. Both aspects ask for a better understanding and treatment of slamming loads and, in general, slamming loads are random due to the fact that sea waves are random. It is highly desirable in the shipbuilding industry to obtain reliable information on the frequency and magnitude of slamming. Petrie et al (1986) reported the measured data on maximum bow stress resulting tram slamming in green seas based on 42 voyages. Because mere are a lot of data collected, a systematic and automatic method is needed to analyze the distributions of the slamming forces and bow stesses in order to save cost and manpower. The above model was
8.5, Application to discrete random distributions
209
applied to analyze the data. One example is presented here in Figure 8.9. The abscissa denotes the IS minute intervals of maximum bow stress in k psi in sea state 7 for LASH Italia.
0.6
a °-
Histogram
4
N=100
Estimated pdf
I 0.2 0.1 0 2 3 4 Maximum bow stress (kpsi) Figure 8.9 Estimated pdf and histogram of maximum bow stress As before, 100 B-splines were used to approximate the distribution. The sample size is 81. In this case, more B-splins than data are used. The optimum estimated pdf is shown in Figure 8.9 with solid line and the original data are given in histogram, The optimum a2 and MEB are 38 and 82.6, respectively. 8.5 Application to discrete random distributions The technique developed in this chapter is applicable to discrete disnibutions in the case of small samples. To see this, we consider the finite scheme as follows A=
A 4 -
4,
(8.39)
If sample size is large, the maximum likelihood estimate of pt is obtained by maximizing likelihood function (8.40)
210
8. 1-D estimation based on small samples
where q; = — is the frequency observed for event 4 • n, Solving the equation yields the maximum likelihood estimate of pt in the following P,=%-
(8.41)
If sample size is large, smooth estimates are expected to obtain. On the opposite, if sample size is small, large fluctuations are expected in the estimate p t . To remove the irregularities present in p t , we assume that A=*,+w(.
(8.42)
This is exactly equation (8.5). Therefore, all approaches developed thereafter are applicable to determining b,, smooth estimates of %. The details are neglected. 8.6 Concluding remark! 8.6.1 Characterization of the method In this chapter, a method that can directly identify an appropriate pdf for a continuous random variable based on a small sample is presented. Three models are established. One is the likelihood function that pools the information from sample date, one is the smooth prior distribution that defines the smoothness of the unknown parameters, and the last is Bayesian Measured Entropy that helps us to find the most suitable (a2 (and the prior distribution) in an "objective" way. The usefulness of the method is examined using numerical simulations. It has been found that the estimated pdfs under consideration based on the present analysis are stable and yield satisfactory results even for small samples. The method is characterized by assumption-free. We do no assume the specific form of the distribution. This is attractive in applications because 1) the possibility for subjective evaluation of a particular sample distribution is greatly reduced; 2) dependence on experts' opinions is minimized and a person with basic training can use the method without difficulty; and 3) the sample size may be either large or small. It is particularly suitable for the cases where many sets of data demand analysis. After the observed data are input into a computer, all the rest analysis can be automatically done using the current method. The CPU time, as mentioned before, is surprisingly short. In a typical problem, to find the optimum solution, it usually takes several tens of seconds on a Pentium 4 PC, even the simplest optimization method is used.
8.6. Concluding remarks
211
8.6.2 Comparison with the method presented in Chapter 6. The solution strategies are different for the methods presented here and in Chapter 6, To see that, we point out the hidden assumptions in using the methods in Chapter 6: (1) A function composed of more B-spIines has stronger capabilities to describe a complicated sample distribution than a function composed of less B-splines; (2) A function composed of more B-splines is less stable than a function composed of less B-splines; (3) ME analysis finds the optimum point on which capability and stability are balanced based on sample observations. The fundamental assumptions behind the method introduced in this Chapter are (1) A function composed of large number B-splines has enough capability to describe a complicated distribution; (2) Such function must be unstable, vulnerable to statistical fluctuation; and (3) A new smoothing technique is introduced to remove the influences resulting from the statistical fluctuation. We believe that the method introduced in this Chapter is more powerful than that introduced in Chapter 6 based on the fact that the current method applies to large and small samples. But the method introduced in Chapter 6 is the basic input for the current method 8.6.3 Comments on Bayesian approach In Bayesian statistics, all unknown parameters are treated as random variables obeying prior distribution. Prior distribution contains the information that is not available in the sample, and must be specified elsewhere. Thus, it is very important to construct the prior distributions. The frequently used four ways are (see Chapter 4 for details), (1) (2) (3) (4)
Determination of prior distribution by use of historical data; Information free prior distribution; Equally ignorant principle; and Maximum entropy prior distribution
In the cases that historical data are not available, the first method is no longer applicable. The rest three methods share something in common. They by to assume prior distributions in such a manner that they do not contain information,
212
S. 1-D estimation based on small samples
or contain information as less as possible. In the extreme case, the prior distribution is a uniform one spread over a finite interval. In this chapter, another method to construct prior distributions is presented; construct them through reasonable physical restrictions like smoothness. Here the method used to construct the smooth prior distributions is a simple one, assuming that the curve connecting three points bt.\, bt and bM is a straight line. It is reasonable to assume that this curve is a parabolic curve. In doing so, equation (8.11) is replaced by e, = U(bM + V , ) - l « 4 - 3 ( * M + 4-*)
(8-43)
The rest is as same as those previously presented. Some techniques for smoothing are available, and most of them can be easily embedded into the current method. Whatever smooth prior disfribution is used, it should minimize MEB. Minimum MEB plays the role of replacing subjective evaluation by objective evaluation. This is impressive if we recall the subjective character of Bayesian statistics. There seems no objection to the positive use of subjective information as proposed in Bayesian approach. But Bayesian approach is subject to criticism due to the arbitrary use of prior distributions. It is individual-dependent. Utilization of MEB enables us to reduce such arbitrariness or subjectiveness to the minimum extent and put Bayesian analysis on a more objective foundation.
Chapter 9
Estimation of 2-D complicated distribution based on small samples
As mentioned in the previous chapters, the method presented in Chapters 6 and 7 apply to cases of large samples. In Chapter 8, a Bayesian method is developed for 1-D distributions based on small samples. The method is characterized by prediction-correction two-steps procedures. In this chapter, the Bayesian method for small samples is extended to 2-D distributions. The structure of this Chapter is almost completely as same as Chapter 8 for the sake of comparison and easy understanding. This chapter is an extension of Chapter 8 in terms of dimensionality. In Chapter 8 was discussed Bayesian estimation of 1-D distributions while in this chapter to be discussed is Bayesian method for estimating 2-D distributions. This chapter is also an extension of Chapter 7 in terms of method. In Chapter 7 was presented a method for estimating 2-D distributions based on large samples while this chapter to be discussed is a method for estimating 2-D distribution based on small samples. Therefore, the present chapter is an extension of the methods presented in the previous two chapters. The method to be presented here was first proposed by Akaike (Tanabe, 1983). It applies to only observations on equidistance lattice points. The method was later extended to arbitrarily observed data on non-lattice points through introduction of B-splines {Zong et al, 1995). Zong & Lam (2002) made a further generalization by using the method to determining complicated probability distributions, 9.1 Statistical influence of small samples on estimation In Chapter 7, a 2-D random variable is approximated by a linear combination of B-spline functions. In a 2-D rectangular domain is defined a bivariate continuous random vector (Jf, Y) and an orthogonal coordinate QXY. Along the
213
214
9. 2-D estimation based on small samples
two coordinates, M and N B-splines are used to approximate the joint pdf f(x, y) of (X, Y) in the form of
f(x,y) » f{x,y I a) = £fX*,(*)*,O0
(9.1)
where atJ (i=l,2,,.,,M, j=l,2,...,N) are the linear combination coefficients and vector a is a = (fl lls -.. 1 o uv a 2 |,-'-a 2JV ,•»•,%,—.s^f.
(9.2)
Using the following formula, we are able to find the combination coefficients
where
} for order 3 B-splines,
(9.4a)
for order 4 B-splines.
(9.4b)
x
3
3
c s = \ B,(x)$k \ Bj *<-* *>-*
Suppose the true pdf is given in Example 3.9, see Figure 3.9(a). For easy reference, Figure 3.9(a) is reproduced in Figure 9.1. «»=200 random numbers are generated from the true distribution, as shown in Figure 3.9(b) . We use MxN = 40x40 = 1600 B-splines to approximate the pdf under consideration, that is,
f(x,y)« f(x,y\ a) = Jfl^Cx^Cy).
(9.5)
9,1. Statistical infltmnce of small samples based on estimation
215
Slightly different from 1-D cases, in 2-D cares the number of B-splines ( MxN = 40x40 = 1600 ) is much greater than the sample point number («.,=200), up to 8-fold in this case. Recall in Chapter 8, 50 B-splines have been used for 40 sample points. Therefore, 2-D estimations are more challenging than 1-D estimations.
Figure 9.1 The assumed pdf
Figure 9.2 Influence of statistical fluctuations on estimation accuracy With the generated sample, the unknown parameters in the above model are estimated from equation (9.3) and the results as shown in Figure 9.2. The predicted pdf are even more irregular than 1-D cases. As mentioned in Chapter 8, the highly irregularities are due to statistic fluctuations.
216
9, 2-D estimation based on small samples
9.2 Construction of smooth 2-d Bayesian priors 9.2.1 Analysis of statistical fluctuations Due to statistical fluctuations, the predicted parameter a is not consistent with its true value b, and there exists a deviation wlt so that we have ^ y y
i = l,2,.~,M;j = l,2,-,N.
(9.6)
where
where the superscript "7** denotes matrix transverse. The Large Number Theorem asserts that atf are asymptotically normally distributed with mean bv as the sample size is large enough. It is assumed that v/tj is a normal random variable with zero mean and common variance a1. Generally, the smaller the sample is, the larger a1 is. Hence we have P()
L
\ \ \ \\ ,
2cr J
i = l,2,-,M;j = l,2,-,N. l,2,,M;j l,2,,N.
(9.8)
With b is given, the likelihood fiinction for a is then given by
Rewriting it by taking the common fiictor out of the product symbol, we obtain
(9 9b)
-
If I is used to denotes the distance in Euclidian space of the form It It
INI=J^I+&K+•••+&»
*
( 9 - 9c )
9.2. Smooth prior distribution of combination eoeffisnts
217
equation (9Jb) can be rewritten in more compact form
Because a is predicted directly from the sample, P(a|b) contains all the necessary information in sample data. This is the preliminary prediction as we mentioned in the beginning of this chapter. 9.2,2 Smooth prior distribution of combination coefficients To remove the irregularities from the estimations, we require b be smooth. This piece of information is not contained in the sample data at all. It is our perception of how the data should change in space. In 1-D case, the smoothness condition is obtained by equating left- and rightderivatives. In the case of 2-D space, smoothness is somewhat difficult to obtain. The reason is that defining 2-D smoothness requires derivatives in two directions. So there are many ways to address the problem of smoothness. It is not appropriate to discuss 2-D smoothness in detail here. We focus our attention on a special class of 2-D smoothness defined in complex variable theory. The functions extensively studied in complex variable theory are called analytical functions. If an analytical function is continuous, it will be differentiable infinitely many times. This is a nice property and thus an analytical function is very smooth. The real and imaginary parts of an analytical function satisfy Laplace equation, respectively. Laplace equation is also called harmonic equation, so functions solving harmonic equation are called harmonic functions. On the other hand, if two functions solve Laplace equation, and they are orthogonal, then they form an analytical function. Reminded by this property, we impose smoothness condition on b by requiring that it be a harmonic function. Because b is discrete in nature, we require that b satisfy the discrete form of Laplace equation, that is, for those B-splines on the four boundaries, en=bMil+bt_hl-2bn, +
i = 2,3,--,M~l, 2
«u=*ij + . *b-i- V J = 2,3,-,tf-l, «w=V*+JU*-2*W. * = 2»3,-,M~l, *MJ = hM.m + bM,M -2bMj, j =2 , 3 , - , JV-1. For those that are not on the four boundaries, we have
(9.10a) (9.10b) (9.10c) (9.10d)
9. 2-D estimation based on small samples
218 +b
~4bn
< -U
(9.11)
where eff denotes the approximation errors resulting from discretization, or neglect of higgler order terms in the derivatives. Because we require that b be smooth, the mathematical expectation of ej} is surely zero. We further assume that etJ are mutually independent and identical normal random variables with zero mean and same variance T2. With these assumptions, we have
(9.12a)
Denoting eTe = ef2 + e|, + • • • + e ^ , we may rewrite the above equation in the form of
_L
1
(9.12b)
"2/ Introducing the matrix D, I
D,
I (9.13a)
D=
D, where
D,=
1 -2 1 1 -2
1
1
-2
(9.13b)
1 1 -2
1
9.3. Formulation ofBayesian estimation ofcomplicatedpdf —2 1 -4 1
1 -4
219
(9.13c)
1 -2
And I is the unit matrix, D is an (MN-4)xMN matrix, e is an ( M V - 4 ) x l vector and a is an vector, we may rewrite equations (9.9) and (9.12) in the matrix form (9.14)
= Db.
Substituting above equation into equation (9.12), we obtain the prior distribution for the true combination coefficient b i
— 2r
e
"p 1 - ^ J
(9.15) Introducing a new parameter (9.16) we obtain
-
llDbll
(9.17)
The above prior distribution is purely constructed from our demanding that b be smooth, and thus it contains the information unavailable in sample data. On the other hand, F ( a | b ) solely contains the information in sample data. One basic idea behind Bayesian statistics is to make fall use of these two sources of information. 9.3 Formulation of Bayesian estimation of complicated pdf 9 J.I Bayesian point estimate The two models established up to now, that is, the likelihood function
220
9. 2-D estimation based on small samples
P(a | b) pooling infomiation from sample data, and the prior distribution P(b), defining the smoothness of b, can be combined in the Bayes1 theorem introduced in Chapter 8 in the following way F(b|a) = Cx/'(a|b)P(b)
(9,18)
where C is the normalizing constant. Substituting equations (9.9d) and (9.17) into equation (9.18) yields
•vAff-4
-LUI
expj-^lDbir \.
(9.19a)
Further simplification leads to
|__i_[| _
exp |__i_[| aa_bjf +S
J
lDbf ] |
(9.19b)
In Bayesian statistics, the Bayesian point estimation of b can be obtained by maximizing the posterior distribution. Or equivalently, it can be obtained by minimizing Q1 (b), Q2 (b) = fa - b f + e>2 jDbf -> min
(9.20)
Setting -3f~ = G, i = l,2,—,M;j
= 1,2,.-,N
(9.21)
we obtain the solution b = (I + o2JfHTl a,q1=Q1 (b).
(9.22)
The pdf calculated from following equation, which is obtained from Equation (9.1) with a replaced by b is thus an estimate based on Bayesian method
9.4. Houholder transform
f(x,y\ b) = JtZbyBXxJBjiy).
221
(9.23)
Once w2 is given, we can obtain Bayesian point estimation b from equation (9.22). Given different eo2, we may obtain different estimates b . Which estimate is the most suitable remains a problem. In the next section, an entropy analysis is used to find the most suitable a2. 9.3,2 Determination of parameter co2 From equation (9.22), we note that b = a if c/=0, and the Bayesian estimation degenerates to the preliminary prediction. From Equation (9.20), we note that b is a plane if a2 -> «o, Thus, e^ is a very important factor in our analysis. We hope to determine a/ in an "objective" way. Note that the marginal probability P(a | m2) = J>(a | b)P(b)db
(9.24)
describes the averaged behavior of a, and gives the true distribution of the occurrence of a. Because it is independent of parameter vector b, we can obtain the estimation « 2 andCT2by maximizing the marginal probability. Or, Find a/ such that P(a j m1) = j>(a |b)F(b)db -» max.
(9.25)
Rewritten in the following logarithmic form, the marginal probability is MEB{©2) = - 2 log F(a | &1) = - 2 log j>(a j b ) P ( b ) * .
(9.26)
Substituting F(a | b) and F(b) in equation (9.26) and denoting,
x=
,F =
F r F | = determinant of F r F,
, and
(9.27)
222
9. 2-D estimation based on small samples
we have (9.28) Furthermore, using «f|2
g 2 = Qx - Fb|| = |x - F b | + |[F(b - b)| = f2 + F(b -1
(9.29)
we may write MEBfa?) in the form of 2
) = -21ogJP(a|b)P(b)db
lima v2MM
t
2
(9.30)
= -21og
The last integral in the equation above is in fact the multidimensional normal distribution with the following normalizing factor 1
MN
FrF
1/2
(9.31)
where the parallel sign j | denotes the determinant of the matrix inside the parallel signs. Thus we obtain MEB(® 2 } = -(MM - 4) log Q}2 + (MN - 4) logo"2
• + log|FrFJ + constant.
(9.32)
The best model is obtained by minimizing MEB. Differentiating MEB in the equation above with respect to a 2 and setting the derivative to zero, we obtain (9.33)
9.4. Householder transform
223
Substituting 2
2
+ (M¥-4)logf 2 + log|F r F|-» min (9.34)
In summary, the Bayesian estimation is composed of following steps (!) Preliminary prediction from Equation (9.2); (2) Smooth Bayesian point estimation from Equation (9.22) for a given (3) Estimation of MEB(«J 2 ) from Equation (9.32); (4) Repeat steps (2) and (3) for different m2 and choose the m2 which minimizes MEB, It should be pointed out that special numerical treatments are needed to find the determinant of the matrix |FTF| because the size of this matrix is very big. For example, if M=N=40, the size of this matrix is of the order of 3200 x 1600, which is hardly solvable by using simple numerical methods. This matrix is, however, a sparse one. We may Householder reduction method to solve it. 9.4 Householder Transform Recall in Chapter 8 that the determinant F r F is found through Gauss elimination method. It is not, however, feasible here to use the method for finding the determinant F F in equation (9,34) for the size of matrix F F is so large that the computer time becomes unbearable. An alternative method must be used instead. The proper method for mis case is the so-called Householder reduction method, which is suitable for large-scale sparse matrix. Householder method is composed of three steps. First of all, transform a real symmetric matrix A into a tridiagonal matrix C. Then the eigenvalues of matrix C is calculated by use of root-finding method and the corresponding eigenvectors are found. Finally, the determinant is solved using the eigenvalues by use of the following theorem from linear algebra. Theorem 9.1 If Ai,A2,---,An determinant of A is
are eigenvalues of the matrix A, then the
Based on this theorem, we focus on finding the eigenvalues of a matrix using Householder transfer. Details are given in the appendix to this chapter.
224
9, 2-D estimation based on small samples
Figure 9.3 Bayesian estimation based on 200 sample points.
8900 8700 03
8500 8300
20 Figure 9.4 Relationship between MEB and to2 and the search for the optimum point (minimum MEB}
9.5, Numerical examples
225
9.S Numerical examples In the first example, we assume a true pdf f(x,y), and generate ns random points from this distribution. Using these random data we estimate the coefficient vector b based on the above analysis. In the second example, the present method is applied to a practical problem. Example 9,1 Normally correlated 2-dimensional pdf Suppose the true distribution is given by Equation (9.3). It is further assumed that M=JV=4Q (totally 40 x 40 =1600 B-splines are used). Then we generate n, = 200 random points. The shape of f(x,y) is shown in Figure 9,2 and the random points are shown in Figure 3.9 (b). By following the steps given in section 9.3, the optimum Bayesian estimation is found as shown in Figure 9.3 for this case. Compared with preliminary prediction, the Bayesian estimation is much improved and is close to the true distribution as shown in Figure 9.2. If we notice the noise-like irregularity in the preliminary prediction and the closeness of the Bayesian estimation to the true distribution, the usefulness of the analysis employed in this paper is strongly supported. The searching process for the optimum MEB is shown in Figure 9.4. From the figure, we see that after the optimum point, MEB does not change much with m2. The function relationship between eo2 and MEB is quite simple. Thus, it is a rule of thumb (because it is just our observation without mathematical justification) that there exists only one optimum solution for MEB( &t2) and that MEB( m2) is a simple and quite smooth function of m2. To see the influence of sample size on the estimation, three samples of pseudo random points are generated from the given distribution. The sample sizes are 100,200 and 300, respectively. The optimum estimated pdf for the three samples are plotted in Figure 9.5. What is impressive is that the estimations based on 100, 200 and 300 sample points are quite close to each other. Figure 9.6 shows the estimations for three specific a)2 values. The estimated pdf for o 2 = 0.01, m2 = 8 and m2 = 200 are plotted in Figure 9.6(a)-(c). If m2 is very small (say, a>2 =0.01), or the variance t2 of b is very large, the Bayesian estimation is close to the preliminary prediction, and the smoothness information is ignored in the estimation. On the other hand, if©2 is very large (say, m2 =200), or the variance r 2 of b is very small, the estimated pdf tends to be a flat plane, and the sample information is ignored in the estimation. Thus there are two extremes. On one extreme, the smoothness information is ignored, and on the other extreme the sample information is ignored. By aid of Bayesian approach, we successfully combine the two sources of information and obtain greatly unproved estimation.
9. 2-D estimation based on small samples
226
®2=W MEB=8830
(a) Estimation based on 100 sample points ro2 = 8 MEB=8335
I •a 8
9
J
(b) Estimation based on 200 sample points co2 = 9 MEB=8027
(c) Estimation based on 300 sample points Figure 9.5 Estimation based on three different samples
Probability density
Probability density
Probability density
1°
(I O
r 8'
f T3
I
KJ
9. 2-D estimation based an small samples
228
But, it should be mentioned that around the optimum point, MEB varies very slowly. For example, the MEB differences for a?2 =10 and m1 =2 in this example is less than 1%. Example 9.2 Joint distribution ofwave-height and wave-period
H(m)
Figure 9.7 The Bayesian estimation of the joint distribution of wave-height and wave-period (M=N=30). H is wave height and Tis wave period. This problem has been studied in Chapter 8 as an example for large sample. Here we use the method developed in this chapter to solve the problem again. The data of wave height and wave period are taken from the records measured by ships for winters in the Pacific (Zong, 2000). The wave periods range from 0 seconds to 16 seconds and wave heights range from 0 meters to 16 meters. We use 900 B-spline functions to approximate the distribution (Af=30 and i\N30). The optimum a1 =0.01 and MEB=W*. The estimated pdf is shown in Figure 9.7 9.6 Application to discrete random distributions The methodology presented in sections 9,2~9.4 has been applied to logistic model. In this section, we apply it to discrete random variable to show its capability. Consider a bivariate discrete random vector of the form
9.7. Concluding remarks
229
pn (9.35) p M2
r
•••
p *MN.
The fee M-L estimate of PtJ is n, S-
(9.36)
where ntJ is the number of event A^. If sample size is small, large fluctuations are expected in the estimate p,. To remove the irregularities present in pt, we assume that
Again we obtain (9.6). From here, the formulas presented in sections 9.2-9.4 are applicable. 9.7 Concluding remarks We are often faced with the cases where observed samples show complex distributions and it is difficult to approximate the samples with well known simple pdfs. In such situations, we have to estimate the pdf directly fiom samples. Especially influenced by statistical fluctuations, estimation based on small samples becomes more difficult. In this paper, a method mat can directly identify an appropriate pdf for a 2dimensional random vector from a given small sample is presented. Three models are established in this paper. One is the likelihood function, which pools the information in sample data, one is the smooth prior distribution which defines the smoothness of the unknown parameters, and the last is the MEB which helps us to find the most suitable m1 (and the prior distribution) in an "objective" way. The usefulness of the method is examined with numerical simulations. It has been found that the estimated pdfe under consideration based on the present analysis are stable and yield satisfactory results even for small samples.
230
9. 2-D estimation based on small samples
Appendix: Householder transform A.I Tridiagonalization of a real symmetric matrix The special case of matrix that is tridiagonai, that is, has nonzero elements only on the diagonal plus or minus one column, is one that occurs frequently. For tridiagonai sets, the procedures of LU decomposition, forward- and back substitution each take only O(N) operations, and the whole solution can be encoded very concisely. Naturally, one does not reserve storage for the full N x N matrix, but only for the nonzero components, stored as three vectors. The purpose is to find the eigenvalues and eigenvectors of a square matrix A. The optimum strategy for finding eigenvalues and eigenvectors is, first, to reduce the matrix to a simple form, only then beginning an iterative procedure. For symmetric matrices, the preferred simple form is tridiagonai. Instead of trying to reduce the matrix all the way to diagonal form, we are content to stop when the matrix is tridiagonai. This allows the procedure to be carried out in a finite number of steps, unlike the Jacobi method, which requires iteration to convergence. The Householder algorithm reduces an n*n symmetric matrix A to tridiagonai form by n - 2 orthogonal transformations. Each transformation annihilates the required part of a whole column and whole corresponding row. The basic ingredient is a Householder matrix P, which has the form P = I-2wwr
(9.A.1)
where w is a real vector with |w|2 = 1. (In the present notation, the outer or matrix product of two vectors, a and b is written as a b r , while the inner or scalar product of the vectors is written as a r b.) The matrix P is orthogonal, because P2=(l-2wwr)-(l-2wwr) = I - 4 w - w r + 4 w . ( w T - w ) - w T =1
(9.A.2)
Therefore P = P - I . But P r = P, and so P r = P-I, proving orthogonality. Rewrite P as T
P =I~
(9.A.3) XT
where the scalar H is
Appendix „
231
1I |
(9.A.4)
21 '
and u can now be any vector. Suppose x is the vector composed of the first column of A. Choose (9.A.5)
u=x+xe
where ei is the unit vector [1, 0,. . . , 0 ] r , and the choice of signs will be made later. Then
{jx] +1 This shows that the Householder matrix P acts on a given vector x to zero all its elements except the first one. To reduce a symmetric matrix A to tridiagonal form, we choose the vector x for the first Householder matrix to be the lower n — 1 elements of the first column. Then the lower n — 2 elements will be zeroed:
10 0 P, A = 0
0
0
(«-D p *1
a,.,
irrelevant
0
k 0 0 0
(9.A.6)
irrelevant
Here we have written the matrices in partitioned form, with ("~13P denoting a Householder matrix with dimensions (n - 1) x (w - 1). The quantity k is simply plus or minus the magnitude of the vector [a2l»• • •, anl ] r . The complete orthogonal transformation is now
9. 2-D estimation based on small samples
232
k A' = P A P =
(9.A.7)
0 0
irrelevant
0 We have used the fact that P 7 = P. Now choose the vector x for the second Householder matrix to be the bottom « - 2 elements of the second column, and from it construct
1 0 0 0 1 0
0 0
0 0
(9.A.8)
0 0 The identity block in the upper left corner insures that the tridiagonalization achieved in the first step will not be spoiled by this one, while the (n ~ 2)~ dimensional Householder matrix tn~2)P2 creates one additional row and column of the tridiagonal output. Clearly, a sequence of « - 2 such transformations will reduce the matrix A to tridiagonal form. Instead of actually carrying out the matrix multiplications in P • A • P, we compute a vector
Au H
(9.A.11)
Then 1» _ A
U
U
\ _ A
H A' = A P A = A - p u r - u -
(9.A.12a) (9.A.12b)
where the scalar K is defined by
2H
(9.A.13)
Appendix
233
If we write qsp-Xu
(9.A.14)
then we have A' = A-qu r -uq r
(9.A.15)
This is the computationally useful formula, Most routinesforHouseholder reduction actually start in the n-th column of A, not the first as in the explanation above. In detail, the equations are as follows: At stage m(m= 1,2,..., «-2) the vector u has the form Kn.WVi
,-,0].
(9.A.16)
Here i = n-m + l = n,n-l,--',3
(9.A.17)
and the quantityCT(|JC|2 in our earlier notation) is
a Hanf +(aaf+-
+ (aIJ_lf
(9 A18)
We choose the sign of a in (9.A. 18) to be the same as the sign of at,._, to lessen round-off error. Variables are thus computed in the following order: a, H, if, p, K, q, A'. At any stage m, A is tridiagonal in its last m - 1 rows and columns. If the eigenvectors of the final tridiagonal matriK are found (for example, by the routine in the next section), then the eigenvectors of A can be obtained by applying the accumulated transformation Q=P,P2-PB_2
(9.A.19)
to those eigenvectors. We therefore form Q by recursion after all the P's have been determined: Q2 2 Q,= P, • Qi+I,
/= » - 3
1.
(9.A.20)
234
9. 2-D estimation based on small samples
A.2 Finding eigenvalues of a tridiagonal matrix by bisection method Tridiagonalization leads to the following tridiagonal matrix c,
b2
(9.A.21)
K Once our original, real, symmetric matrix has been reduced to tridiagonal form, one possible way to determine its eigenvalues is to find the roots of the characteristic polynomial pn(X) directly. The characteristic polynomial of a tridiagonal matrix can be evaluated for any trial value of X by an efficient recursion relation. Theorem A.1 Suppose b^O
(i=2,3,.,.,n).
For a«y X, the characteristic
polynomials form a Sturmian sequence {pt (^)}" =0 satisfying po(A) =
,
i = 2,-,n
(9.A22)
If a(A) denotes the number for the sign between two neighboring numbers to change, then the number of the eigenvalues of A smaller than A is a{A), The polynomials of lower degree produced during the recurrence form a Sturmian sequence that can be used to localize the eigenvalues to intervals on the real axis. A root-finding method such as bisection or Newton's method can then be employed to refine the intervals. Suppose all eigenvalues of A satisfy Al < A^ <-"
, the interval length of which is (b0
—aQ)/2p.
In detail, suppose that we are to perform bisection at the r-th step and that the middle point of [af_,,&,._,] is dr
Appendix
<*r =}<^~. +*,-.)
235
(9-A.23)
Then we may compute sequence {Pj(dr)}*L0 and detennine the number of sign-change a(dr),
a(dr)
Based on the following criterion
dr,br = br_{
(9.A.24)
Because one of the following equations must hold a{ar)
and
a{br)2:k
^ must be on the interval [aF, br ] . The upper and lower bounds, b0 and a0 ,for \ are determined by
\h = max(4 ±(|6;I +1&,,I)} [ao=mm{4±(|6,|+|*,+Ij)}
•I
i = l 2 ••• n
f9A251
where we let bx = 6W+I = 0 . A J Determing determinant of a matrix by its eigenvalues Once the eigenvalues are found, the determinant is (9.A.26)
This page intentionally left blank
Chapter 10
Estimation of the membership function
People think that mathematics is precise and exact. It seems that worldwide scientists and engineers focus their attentions on finding the most exact numbers in their studies with errors smaller than 0.1 %, 0.01% and even 0.001%. On the other hand, vague values and fuzzy language are more often used. Words like "tall" in "he is tall", "fat" in "she is fat" are all fuzzy words without clear definitions. 190 cm is definitely tall, but 175 may be tall or may not. Being tall cannot be measured by single index like height, and it is also influenced by one's figure, weight, face or even clothes. Mathematics is not unable to handle fuzzy phenomena. There is a mathematics branch, called fuzzy set theory able to describe such fuzzy things. Fuzzy set theory is introduced into mathematics by Zadeh in 1970s {Zimmermann, 1985). Since then, fuzzy set theory has been applied to language studies, control etc. The most important concept in fuzzy set theory is the membership function. Although fuzzy set theory and statistical estimation are two different branches of mathematics, the former can also be studied by use of the latter. So in this chapter, the method introduced in previous chapters is applied to determining membership functions. The membership function is usually determined by the user in applications. It seems that direct determination of the membership function based on sample data was proposed by Fujimoto et al (1994) and Zong et al (1995). 10.1 Introduction In traditional set theory, the boundaries between two sets are crisp, meaning that an element is either in set A or set B, but not in both. In Figure 10.1 (a) is shown three crisp sets intervals A^,Aj,^ on the real line, that is, 4 =[0,30], 4 =(30,70], 4 =(70,100]. 237
10. Estimation of the membership function
238
A3
A2
U< i ^
.
-
30
.
•
.
-
.
•
.
'
.
•
,
•
.
-
.
•
,
•
P
3
70
i
i
Y
YP •v
(a) Crisp sets
1
20/w\ 40
60XX 80
>
(b) Fuzzy sets
Figure 10.1 Crisp sets vs. fuzzy sets If we introduce step functions defined by [1 if
^[0,30]
(10.1a)
if x 6 (30,70] 0 if xe (30,70]'
(10.1b)
1 if x e (70,100] 0 if x* (70,100]'
(10.1c)
then i4,, ^ and ^ may be redefined by
4 ={x:fti(x)*0}t i = l,2,3.
(10.2)
In words, 4 1S a collection of those points which do not make the function ft, vanish. We give a special name membership function to//,, which is characterized by (1)
(10,3a)
(2)
(10.3b)
We do not have any reason to say that membership functions ft, must be step functions. In feet, functions satisfying equations (10.3) can be used as membership functions. For example, we may define ft, by
10.1. Introduction
2|i-B|
20<x<=30
20
30<xi40
20
1 1-
40<xfi50
£-5O 20
239
(10,4a)
xJ
20 0
70<JC
1 1-2
x-20 20 20 0
20
(10.4b) 30<x<40 40 <x
JC<50
(10.4c)
1-; 1 These fimctions are plotted in Figure 10,1 (b). Similar to Equation (10.2), we define three sets, specially written in the form of A,, At and ^ , by
, 1 = 1,2,3.
(10.5)
They are Juzzy sets. For crisp sets, if point Pe Aj, then P £ A] and P €A^, see figure 10.1 (a). But for fuzzy sets, a point can be in two or even more sets
240
10. Estimation of the membership function
simultaneously. For Point P in figure 10.1 (b), both fc # 0 and /<2 # 0 . Thus P e Ai a n d P e A^ from definition (10.5). So fuzzy sets do not have clear boundaries. Detailed presentation of the theory ran be found in Zimmermann (1985) One of prominent applications of fuzzy set theory is in the field of quantitative description of language variables due to the fact that language itself is fuzzy. Take age for instance. "Young", "middle-aged" and "old" are such language variables that their boundaries cannot be clearly defined. A person being 30 years old may either be young or be middle-aged. It is hard to draw a clear line between "young" and "middle-aged", so is "middle-aged" and "old". Therefore, ^ = young , A^ = middle-aged and A^ ~ old are three fuzzy sets. If we want to describe language variables like "young", "middle-aged" and "old" in the framework of crisp set theory, we have to define two critical numbers. For example, 30 years old and 70 years old are defined as the two critical numbers. Below 30 is young and above 30 is middle-aged. Below 70 is middle-aged and above 70 is old. This is in fact what is shown in Figure 10.1 (a) and defined in equation (10.1), Such definition is easy, but a little wired due to the absurdness that a person one day past his thirtieth birthday is middle-aged and another person one day to his thirtieth birthday is young. They are, however, may be only two days different in age. In fuzzy set theory, 30 years old is considered half young and half middle-aged. It may be interpreted at two levels. At the first level, half young and half middle-aged means that he is at the transition stage of life from young to middle-aged. At the second level, half young and half old means that if a survey is made of those 30 years old, half of them may be considered young while half of them are considered middle-aged. In this sense, fuzzy sets with unclear boundaries are in fact more informative and reasonable than crisp sets. This is schematically shown in Figure 10.1 (b) and defined by equation (10.4). Once the rationale for fuzzy sets is justified, the remaining question is how to determine the membership functions. Equation (10.4) and Figure 10.1(b) is in fact a way to define the membership functions for these three fuzzy sets. How to find them? To answer the question is more difficult than to ask. Over the years, several forms of functions have been employed for defining membership functions. Those in Figure 10.1 are two examples. In general, the forms of membership functions may be problem-dependent. Even for the same problem, they are also user-dependent. For the same problem, two users may use quite different forms of membership functions. Take height for instance. In Figure 10.2 are shown two different forms of membership functions for assessing one's height. Three fuzzy sets are defined: "short", "medium" and "tail". One may choose those in figure 10.2 (a) for membership functions and the other may well choose those in Figure 10.2.(b) as membership functions. Using different forms of membership functions will definitely influence the subsequent
241
I O.I. Introduction short
medium
\y
I
short
medium
tall
Heigh (cm)
\y
AA
150 160
tall
170
ISO
Height (cm) I I 150 160
170
180
Figure 10.2 Two different membership functions
M
20 (a) randomly prepared triangles
30
40
50
60
70
80
(b) fuzzy sample
Figure 10.3 Experiment on the perception of triangular area by a student assessments. Therefore, a systematic methodology is needed to build and determine membership functions on an objective base. Membership functions ft (x) may be interpreted as the probability of a point to be in a specific fuzzy set. That is, for a point x, the probability for it to be A, is fit {%). Note that //, (x)+f*2 (x) + fi3 {x) -1, meaning that a point must be in a set. In summary, the membership function lying at the heart of fuzzy set theory serves as two purposes. They define fuzzy sets themselves and they determine the probability of a point belonging to a specific fuzzy set. A membership function is in fact a probability density function detennining the possibility for a point to be in a fuzzy set. Therefore, the methods proposed in the previous chapters are also applicable to the determination of the membership functions.
242
10. Estimation of the membership junction
By aid of measured entropy analysis, the membership function ean be determined in an objective way. 10.2 Fuzzy experiment and fuzzy sample To statistically determine membership functions, we need to design a fuzzy experiment. A fuzzy experiment is slightly different from an ordinary statistical experiment. Fujimoto (1994) designed a fuzzy experiment. Here his experiment procedure is slightly adapted for general purposes. 10.2.1 How large is large? A survey designed to collect the information about people's perception of fuzzy concept like "large" and "small" is to prepare sixty one triangular sheets, the areas of which ranged from 20 cm2 to 80 cm2, to say. The side lengths and interior angles of the triangles were randomly determined so that all the triangles had different shapes, as shown in figure 10.3 (a). Three reference triangles, representing "large" with actual area 70, "medium" with actual area 50 and "small" with actual area 25, are also prepared. In the survey, the three reference triangles are shown to the people under test first. Then these triangular sheets are shown to each person in the survey one by one. For each sheet, he or she is requested to assess it using "small", "medium" or "large". After all the sheets are shown to him, a fuzzy sample shown in Figure 10.3 (b) is obtained. In the figure, the horizontal axis is the real triangle area and the vertical axis is the classification result. The experiment may be repeated to another person, and so on. Finally, a fuzzy sample as shown in Figure 10.3 (b) is obtained, from which membership functions are to be estimated. What is remarkable in the figure is that mere are overlapping regions among each class. That is, the boundaries among each class are fuzzy. So, the data are called fuzzy data. The fuzzy data roughly indicate the correlation between the linguistic expressions and the physical quantity under consideration. The question asked at the very beginning of this section on "how large is large" can now be answered based on the data in Figure 10.3. For this problem, "large" is a set ranging roughly from 60 to SO cm2. It is clear from the figure that triangles with area below 55 cm2 has never been classified as "large". Thus, it is safe to say that triangles with area above 55 cm2 are large while those between 55 and 65 cm2 are in transition stage from "large" to "medium". 10.2.2 Fuzzy data in physical sciences The above example is somewhat arbitrary because the experiment has been performed on humans. In physical sciences, however, such fuzzy data are also
10.2. Fuzzy experiment andjuzzy sample
243
available. Fluid flow in a pipe is a topic which has been of extensive interest in fluid mechanics. A parameter describing the flow state is called Reynolds number Re defined by (10.6) where U is maximum flow velocity in the pipe, d is the diameter of the pipe and v is the kinetic viscosity of the fluid in the pipe. For water, v = 10~".
Laminar Turbulent AAMAAA A,
pooooooo Laminar log Re Turbulent Figure 10.4 Fuzzy data for Laminar and Turbulent flow Reynolds number Re dominates flow state in the pipe. If it is small, the flow is laminar, a state fluid particles smoothly flow in the pipe. If it is large, the flow is turbulent, a state fluid particles irregularly flow in the pipe. It has been experimentally found that transition from laminar flow to turbulent flow is not unique. Over a wide range of Reynolds number, the transition may happen depending pipe wall smoothness. If the pipe wall is very smooth, the transition does not occur until fe=4Q000, while if the pipe wall is rough, the transition may occur around 2100. When many experimental results are plotted in one single figure, we obtain fuzzy data schematically shown in Figure 10.4. Note this figure is schematic rather man physically accurate. Fuzzy set theory also provides us with a new look at some old problems. A rod under compression load P may become unstable suddenly as the load gradually increases. This is a well known buckling problem in mechanics of materials. The critical load at which the rod becomes unstable is theoretically a deterministic value, but in real-world application is random. If many rods of same size and same material are tested, we obtain data showing large scatters
244
W. Estimation of the membership Junction
around the theoretical critical value of the rod. From the viewpoint of mathematics stood away from physical background, fuzzy data in Figures 10.4 and 10.5 are no different in nature.
Unstable AAMAAA A
0O 00 0000 Stable
Figure 10.5 Fuzzy data for Stable and Unstable rod Quite some examples in a variety of engineering fields and science disciplines can be given showing similar patterns as in Figures 10.3~10.5. All these phenomena can be treated using fuzzy set theory. But figure 10.3 and Figures 10.4-10.5 are slightly different in that the latter is physics-based phenomena requiring membership functions to be objectively determined while the former is human-related the membership functions of which are better to be determined objectively. This justifies the need to find membership functions in an objective way. 10.2.3 B-spline Approximation of the membership functions In this section, we will determine the membership function through the fuzzy data obtained in the previous section. Suppose the universe of discourse is X. Let xt(£ = 1,2, •••,*„) be a sample from X. Further, suppose the fuzzy sets are At , A2 , ... , AM and the corresponding membership functions are H\(x),fi2{x),---,iiu{x}. In the triangle experiment, x, is the area of a triangle and ^ , Aj and A^ correspond to "Small", "Medium" and "Large", respectively. As done in the previous Chapters, we again assume that the membership function pf (x) can be expressed in the form of a linear combination of B-spline functions in the universe of discoursed, i.e.,
/ ft 2. Fuzzy experiment and fuzzy sample
245
ft (x) = auBl (x) + «,2Bj (*) + »•+a m B N (x) 21 1
IN
22 2
^07j
N
In concise form we have N
Mi(*) = 2 a « S j W » i = l,--,M
(10.8)
where iV is the number of B-spline functions which consist of the membership functions, a(, are the combination coefficients, and B} (x) is the B-spline functions of chosen order and is of the following form if order 3 B-spline function is chosen JL (x UAX) - (Xf -X.j)
— rf H(x
>
r
— x\ n{X-Xj,
(lU.yj
Now we are to determine the parameters a y . From the process of classification, the membership function ft,(xt) is regarded as same as the probability that a sample point xt is classified into Ixaacy set A,, that is, Pr[xteA,] = iii(xt).
(10.11)
Therefore, we employ the likelihood analysis for the determination of the membership functions. The probability of the classification event for all the sample points xt (£ = 1,2, • • •, ns) is expressed by the following likelihood function.
x
- n The log-likelihood function is
l=\ xte
246
10. Estimation of the membershipjumtion M
M
N
N
M
where J>(x) = £2>,«,(*) = £(£« s )^ W = 1According to the B-spline function properties we have ^iBJ(x) = \.
(10.14)
From the above two equations the following relationship is obtained, fdalj=l,j
= l,-,N.
(10.15)
A membership function is always greater than or equal to zero. To guarantee this we simply impose the restriction that all parameters atJ are greater than or equal to zero, that is, o,^0
i = l,...,M;j = l,...,N.
(10.16)
Usually, we hope /<,(.*) is a decreasing function of * and fiM(x) is an increasing function. It is obvious that these are guaranteed by the following equations: Gf u £:s u + 1 , j = \,...,N-l, am
(10.17a) (10.17b)
Based on the above analyses the best estimates of the unknown parameters atj must satisfy (10.18) subject to
fX=l
j = l,-,N,
(10.19a)
10.3. ME analysis crMaeru+I «M»^«MJ+1
«^0
j = l,...,N-l,
247 (10.19b)
j = l,...,N-l, i = l,...,M;j = l»...,iV.
(10.19c) (10.19d)
This optimization model has a good property. The optimum solution or? is unique (the proof is given in appendix 10.A). That is, if we can find a local maximum point a°j by some method it must be the global optimum solution. The optimization method like the Flexible Tolerance Method introduced in Himmelblau (1972) may be employed to solve the maximization problem (10JH10.9). 10.3 ME analysis In the above optimization problem, the optimization parameters are N and afJ . If N is fixed, this problem can be solved by ordinary nonlinear programming methods. Because N is also an optimization parameter, measured entropy analysis must be used. Consider a fuzzy sample of size ns. n'f denotes me number of unknown parameters for me j-tft fuzzy set. Without equation (10.15), n'f would be equal to N . Equation (10.15) is an interlink among the M fuzzy sets. It is hard to distribute these N equalities among the M fuzzy sets. We may, however, avoid this difficulty using the method in the following. The entropy for i-th fuzzy set is *.
(10.20)
The corresponding asymptotically unbiased estimator of the measured entropy is
(10.21) Then the total measured entropy should be q
E— •
(10.22)
248
10. Estimation of the membership function
From the definition, the second term become
2"B,
2 s,
2
ns
Therefore, the best N solves the minimization problem ME
= - Z * [tt,{x\a)\ogf*,{x\*)dx + 2;(-M~X)N
.
(10.24)
If AIC is used, AIC = -maxfl} + J ] ttf = -max{£}+N(M -1) -» min
(10,25)
where the number of free parameters nf = M x N—N because there are N equality constraints in the above model. In the actual optimization process, the best N which makes ME or AIC minimum is numerically searched by the following procedure. First, a positive integer N is assumed. Then the maximum likelihood analysis is carried out to obtain the best estimates of au, The value of ME or AIC is calculated. Next, changing N gradually, and the corresponding atJ and MEs or AICs are calculated. By the comparison of these ME or AIC values, the best N is found. 10.4 Numerical Examples Example 10.1 Triangular area problem Figure 10.6 (a) shows the fuzzy data obtained from the experiment described in section 10.2. Using these date, we may perform the ME analysis and the likelihood analysis using the method presented before. The B-spline functions of order 3 are used. For the optimization of the likelihood function, the Flexible Tolerance Method (FTM) as outlined in the appendix to this chapter is used. In many nonlinear programming methods a considerable portion of the computational time is spent on satisfying rather rigorous constraint requirements. However, the FTM does not satisfy the constrainte first. But, the constraints are gradually satisfied as the search process proceeds toward the true solution. Tables 10.1 gives the results of the analysis. From the table, the minimum ME value is obtained if eleven B-splines are used, and minimum AIC value is
10.4 Numerical Examples
249
obtained if eight B-splines are used. The estimated membership functions based on minimum ME and minimum AIC are plotted in Figures 10.6 (b) and (e). The results obtained from ME analysis and AJC analysis are quite consistent although the numbers of B-splines in the two analyses are different.
Table 10.1 Results for three fuzzy sets
N 6 7 8 9 10 11 13 12 14 15
-L
ME
AIC
19.7 16.7 14.5 15.5 14.3 13.6 13.1 14.2 13.1 13.2
5.54 4.11 2.95
33.65 32.69 32.50* 35.57 36.26 37.63 41.13 40.22 43.09 45.20
3J0 3.19 2.52* 3.00 3.27 2.65 3.18
20
30 40 50 60 70 80 Area
0 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Area Area Figure 10.6 Estimation of the membership fiinctions for the problem of triangular areas
Example 10.2 Analysis ofthejkzzy data of five classifications In shipbuilding industry, welding is a very important process, taking more than 40% workloads. Welding quality is directly related to the life of a ship.
10. Estimation of the membership function
250
Thus, welding quality must be controlled within allowable errors. One factor controlling welding quality is called misalignment S as schematically shown in Figure 10.7. The figure shows that two vertical plates are welded to a horizontal plate. The two vertical plates are required to be in one line. This is difficult because a welding worker cannot see the upper plate as he is welding the lower plate to the horizontal plate, and he cannot see the lower plate as he is welding upper plate to the horizontal plate. So after the welding work, a surveyor must do quality examination. The examination results are classified into five classes: "Very Good" (VG), "Good" (G), "Medium" (M), "Bad" (B) and "Very Bad" (VB). Because of several factors involved, the quality assessment is not a simple correlation, but a complicated link as shown in Figure 10.7 (a), in which is given one hundred sample points to show the correlation between misalignment S and quality assessments. VG G j
1—
(
— \ <5: misalignment M
-»j k - $ ( t: thickness
B VB
1 .
0.5
ft
(b)ME
AAA/
AM 0.5
1
1.5
0
0.5
Figure 10.7 Five classifications of misalignment Using fuzzy data in Figure 10.7 (a), the membership functions are estimated. Order 3 B-spIine fiinetians have been in the analysis. Table 10.2 gives the results of the analysis.
251
10.4. Numerical examples Table 10.2 Results for five fuzzy sets
N 4 5 6 7 8 9
-L
Me
AIC
N
-L
Me
AIC
82.4 1.73 98.4 10 1.49 56.3 98.3 71.7 1.68 11 1.54 93.7 55.5 101.5 67.0 1.48 12 1.64 91,5 55.0 105.0 62.0 1.57 91.9 55.1 1.71 109.1 13 57.0 14 51.7 1.73 1.45 109.7 92.0* 56.1 1.49 94.0 53.4 1.78 115.4 15 The minimum ME value is obtained at N=i and the minimum AIC is given as N=6, Figures 10.7 (b) and (c) compare the membership functions for the two cases. Example 10.3 Sample Size Influences In this example, the sample size influences of fuzzy data on the forms of membership functions are discussed. Suppose membership functions of fee forms shown in Figure 10.8 (a) are given, representing three fuzzy sets A^, A^ and A^ . Then 100, 200 and 300 pairs of uniform random numbers (Me,v,)are generated in the range O S M , ^10 and Ofiv, S i {t = ls2,---,n1). These pairs of random numbers are plotted in Figure 10.8 (b). Acceptation and Rejection Method introduced in Chapter 3 is employed to generate fuzzy data by the following procedure If vt < ft(uf) and vt > pi}{ut), then ut e 4; If vt <, fij{ut) <Mui)> ^m ut 6 4} • If vt > ft,(ut), then «f is rejected;
0
2
4
6
8
10
0
2
4
6
8
Figure 10.8 Given membership functions (a) and random numbers (b)
10
252
10. Estimation of the membership Junction
«, assumes three values, 100, 200 and 300. The three corresponding fuzzy samples are given in the left figures in Figure 10.9. Using the method presented in this Chapter, the membership functions corresponding to the three generated samples can be estimated. The results for ME values are given in Table 10.3 and plotted in Figure 10.9. In the Table, only ME results are given and AIC values are neglected.
Table 10.3 Sample size influences (a):Sample size: 100
(b) Sample size:200
N 5 6 7 8 9 10 11 12 13 14 15
(c): Sample size:300
-L
ME
-L
ME
-L
ME
39.5 34.5 35.1 33.8 32.0 32.0 32.0 32.0 31.5 31.4 31.6
5.70 4.07 4.50 4.14 4.24 3.76 3.86
73,4 69.0 69,2 68.7 68.1 68.3 67.5 67.6 66.9 66.9 67.5
5.63 3.76 4.03 3.84 3.62 3.72 3.67 3.63 3.65 3.64 3.63
123.1 107.9 109.9 107.3 107.2 127.4 106.4 106.7 105.9 105.4 106.5
5.60 3.77 4.12 3.88 3.S4 3.65 3.58 3.61 3.69 3.68 3.69
3J2 3.83 3.72 3.90
For n, = 100, minimizing ME yields iV=14. The profiles of the membership functions as shown in Figure 10.9 (a) are complicated because too many Bsplines are used. If sample size is over-small, statistical fluctuations have significant influence on the shapes of the membership functions. For «, = 200 and na =300 , minimizing ME yields same results at JV=9. The estimated membership functions are shown in Figures 9 (b) and 9(c), which look much better than Figure 10.9 (a). Figures 10.9 (b) and (c) do not exhibit significant differences, showing the convergence as sample size increases. Both are close to the given membership functions as shown in Figure 10.8.
10.5, Concluding remarks
253
(a)n, =100 L M S 8
0
2
4
6
8
10
10
0
0
2
4
6
8
10
8
10
8
10
Figure 10.9 Given membership functions (right) and fuzzy data (left) 10.5 Concluding Remarks In this chapter, a probabilistic model that can determine the forms of the membership functions based on experimental data is presented. In the method, the membership functions are approximated by a linear combination of B-spline functions. The best number of B-splines which compose the membership functions under consideration is determined by minimizing ME. And the best combination coefficients are determined probabilistically based on the likelihood analysis.
254
10. Estimation of the membership function
The features of the membership functions presented in this paper can be characterized as follows: (1) The membership functions can be automatically determined from the fuzzy data by the proposed method. No prior knowledge of the form of the membership functions is necessary in the estimation. (2) The method works well irrelevant to the number of classifications. Also, numerical calculations are easily performed because the optimum solution is unique. (3) If the sample size of fuzzy data becomes larger, the estimated membership function becomes more accurate.
Appendix
255
Appendix: Proof of uniqueness of the optimum solution Consider the following nonlinear programming problem.
subject to the following constraints, gi(xy,x2,-,xK)Z0,
i = l,2,-,m,
XjZQ,j = l,2,--;n.
(10.A.2) (10.A.3)
The Lagrangian is m
L = f(xy,x2,---,xn) + J^uigi(xl,xt,---,xn).
(10.A.4)
1=1
If x° is a local optimum solution in the above problem, then the sufficient condition for xa to be the global maximum point is:
±8Li^U°)(xi-x!).
(10.A.5)
where w° satisfies — SO for all K, >0; u^—) = 0, i = l,—,m.
(10.A.6)
The Lagrangian for our problem is
M«) = Z E log^(^)+E^ x(£a # -l)+ & ff «,fl(«), in which g s ' s are constraints given by the following equations.
(10.A.7)
256
10. Estimation of the membership fimction
>0
(10 A8)
According to Taylor expansion of multivanate function we have
M
M
<10A 9)
-
where 0 < 0 < 1 Because,
in which
(
1 if(s = j,i = l)or(s = N - 1 ST(* = y +1, i - l)or (* *= 0 otherwise
So
Therefore, the third term on right hand side of equation (10.A.9) becomes
Appendix
i
M
N
257
If
—y "yy f £-i JLi
£-i
-« />! j=i 1=1
Mi
1=1 1,64 Mi
j=
From equations (10.A.9) and (10.A.12), the following relationship is obtained, M ft
££f^
SI
(10.A.13)
The above equation indicates that if a0 is a local optimum solution it must be the global solution.
This page intentionally left blank
Chapter 11
Estimation of distributions by use of the maximum entropy method
Maximum Entropy Method (MEM) cannot be ignored when information theory is applied to finding the pdf of a random variable. Although MEM was explicitly formulated first in 1957 by Jaynes (1957a,b), it was implicitly used by Einstein in the early 20th century for solving problems in quantum statistical mechanics (Eisberg & Resnick, 1985). However, both Shannon (Shannon & Weaver, 1949; Khinehin, 1957) and Kullback (1957) did not touch the topic of MEM in their original works. The central idea in MEM is the Maximum Entropy (Maxent) Principle. It was proposed for solving problems short of information. When formulating the Maxent Principle, Jaynes (1957) believed that in the cases where only partial information is available about the problem under consideration, we should use fee probability maximizing entropy subject to constraints. All other probabilities imply that unable-to-prove assumptions or constraints are introduced into the inference such that the inference is biased. MEM is a fascinating tool. Formulated properly, all special distributions (the normal, exponential, Cauchy etc) can be determined by use of MEM. In other words, mese special distributions solve the governing equations of MEM. Expressed in Jaynes language, all known special distributions represent an unbiased probability distribution when some of information is not available. Besides academic research on MEM, MEM has been applied to a variety of fields for solving problems present in communication (Usher, 1984), economics (Golan et al, 1996; Fraser, 2000; Shen & Perloff, 2001), agriculture (Preekel, 2001) and imaging processing (Baribaud,1990). In this Chapter, application of MEM for estimating distributions based on samples is introduced.
259
260
1 L Estimation by use of the MEM
11,1 Maximum entropy The detective-crack-case story features the MEM. Initially, only very few information is available. And as many suspects as possible should be investigated without favoring some suspects. As investigation proceeds, some suspects are eliminated from the investigation, but others receive more extensive and intensive investigations. As more and more evidences are collected, the true murder is found. At each stage of the investigation, the principle behind the investigation is to involve all suspects for investigation. No suspect is eliminated from the investigation if without strong evidence to support to do so. Mathematically speaking, the probability for each suspect to commit the crime is p,,, i = I, • • •, M, Initially, all suspects are equally suspected, and thus Pi=l/M.
(11.1)
As more information is collected, some suspects are eliminated from the investigation due to alibis. If Mi suspects are excluded, the probability for each of the remaining suspects is />,=1/(M-M,).
(11.2)
Finally, as Mi = M - 1 , only one suspect is identified. Suppose there are ten suspects initially. Then the detecting process can be written in the following form Ai
A2
A3
A4
Aj
AB
AI
At
A$
A\a
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0 0 0
0.2 0 0,2 0 0.2 0.2 0.2 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0 1 0 0
(11.3)
Two observations from the process are of direct relevance to MEM. First of all, at each stage, the probability is always assigned to each suspect with equal probability without ftvoring one of the suspects. Why to do so? Without flather alibis, it is dangerous to eliminate a suspect from the investigation too early. This is in fact the rational behind the Maxent Principle. Because the uncertainty is expressed by entropy, the maxent principle states ~Y.Pt log/% -» max.
(11.4)
/ /. /. Maximum entropy
261
The second observation is that as more and more information is collected, the probability to find the true criminal becomes larger and larger. Initially, all suspects are of equal probability to commit the crime. This piece of information, expressed as a constraint, is 5>=I.
(11.5)
Equations (11.4) and (11.5) outline the essence of MEM. In words, if we are seeking a probability density function subject to certain constraints (e.g., a given mean or variance), use the density satisfying those constraints which makes entropy as large as possible, Jaynes (1957a,b) formulated the principle of maximum entropy, as a method of statistical inference. His idea is that this principle leads to the selection of a probability density function which is consistent with our knowledge and introduces no unwarranted information. Any probability density function satisfying the constraints which has smaller entropy will contain more information (less uncertainty), and thus says something stronger than what we are assuming. The probability density function with maximum entropy, satisfying whatever constraints we impose, is the one which should be least surprising in terms of the predictions it makes. It is important to clear up an easy misconception; the principle of maximum entropy does not give us something for nothing. For example, a coin is not fenjust because we don't know anything about it. In fact, to the contrary, the principle of maximum entropy guides us to the best probability distribution which reflects our current knowledge and it tells us what to do if experimental data does not agree with predictions coming from our chosen distribution: understand why the phenomenon being studied behaves in an unexpected way (find a previously unseen constraint) and maximize entropy over the distributions which satisfy all the constraints we are now aware of, including the new one. A proper appreciation of the principle of maximum entropy goes hand in hand with a certain attitude about the interpretation of probability distributions. A probability distribution can be viewed as: (1) a predictor of frequencies of outcomes over repeated trials, or (2) a numerical measure of plausibility that some individual situation develops in certain ways. Sometimes the first (frequency) viewpoint is meaningless, and only the second (subjective) interpretation of probability makes sense. For instance, we can ask about the probability that civilization will be wiped out by an asteroid in the next 10,000 years, or the probability that the Red Sox will win the World Series again. We illustrate the principle of maximum entropy in the following three theorems (Conrad, 2005).
262
/ /, Estimation by use of the MEM
Theorem 11.1 For a probability density function pi on a finite set {xu---,Xn}, H(pl,p2l-,pn)£logn
= H(-,-,-,-) « «
(11.6) n
withequalityifandonfy if pi is uniform, i.e., p(xf) = l/nforalli. Proof: see equation (5.13) in Chapter 5. Concretely, if pt,p%,,.., pnare nonnegative numbers with ^pi = 1, then Theorem 11.1 says -^pt
logpt < logw, with equality if and only if every p, is
l/n. Theorem 11.2 For a continuous probability density function fix) with variance
(11.7) with equality if and only iff(x) is Gaussian with variance a2, Le., for some fi we have 0140
V23-CT2
Note that the right hand side of equation (11.8) is the entropy of a Gaussian. This describes a conceptual role for Gaussians which is simpler than the Central Limit Theorem. Proof. Let /(x) be a probability density function with variance a 1 . Let ft be its mean. (The mean exists by definition of variance). Letting g(x) be the Gaussian with mean p and variance a2
U Splitting up the integral into two integrals, the first is — log(2ffitr2)
(11.9) since
11,1. Maximum entropy
263
Jf(%)dx =1", and the second is 1/2 since f/"(x)(x—/ij J& = O"2 by definition. Thus the total integral is — [l + log(2fffxa)], which is the entropy of g(x). Based on equation (5.45), we conclude that (*)Iogs(x)dr (11.10)
Theorem 11.3 For any continuous probability density Junction p on (0,1) with mean A ) Sl +logl
(11.11)
with equality if and only iff is exponential with mean, i.e.,
Proof Let^x) be a probability density function on ( 0 , » ) with mean A Letting
-'"1,-)f(x)lagg(x)dx = ["/(*)Jogfloga+yU .Since/has mean A, this integral is log A + 1, which is the entropy of g. Theorem 11.3 suggests that for an experiment with positive outcomes whose mean value is known, the most conservative probabilistic model consistent with that mean value is an exponential distribution. In each of Theorems 11.1, 11.2, and 11.3, entropy is maximized over distributions on a fixed domain satisfying certain constraints. The following Table summarizes these extra constraints, which in each case amounts to fixing the value of some integral. Distribution Uniform Normal with mean ju Exponential
Domain Finite (-so, oo) (0.OD)
Fixed Value None
l(x-/i)'ftx)dx
[xfixyt
How does one discover these extra constraints? They come from asking, for a given distribution g(x) (which we aim to characterize via maximum entropy),
264
II, Estimation by use of the MEM
what extra information about distributions jfljx) on the same domain is needed. For instance, in the setting ofTheorem 11.3, we want to realize an exponential distribution g(x) = (l/l)e" I '*on(0,®) as a maximum entropy distribution. For any distribution^) on (0,oo),
-]f(x)logg(x)dx =
j
= (\OgA)[f{x)dx+\[xf(x)dx
(11.12)
j [xf(x)dx To complete this calculation, we need to know the mean ofjfx). This is why, in Theorem 11.3, the exponential distribution is the one on (0, w) with maximum entropy having a given mean. The reader should consider Theorems 11.1 and 11.2, as well as later characterizations of distributions in terms of maximum entropy, in this light We turn to «-dimensional distributions, generalizing Theorem 11.2. Entropy is defined in terms of integrals over Rn. Theorem 11.4. For a continuous pmbability density function f on R" with fixed covariances (Ty (11.13) where £ = {0^ ) is the covariance matrix for fix). There is equality if and only if j(x) is an n-dimensional Gaussian density with covariances try, We recall the definition of the covariances o# . For an w-dimensional probability density function f(x), its means are /* = [ Xif(x)dx
and its
covariances are dx.
(11.14)
In particular, ers a 0 , When n = 1, a~v = u2 in the usual notation. The symmetric
11.2. Formulation of the maximum entropy method
265
matrix £ = (cTj,) is positive-definite, since the matrix ({vi.Vj}) is positivedefinite for any finite set of linearly independent vt in a real inner product space Proof. The Gaussian densities onR* are those probability density functions of the form. G(x)=
l
==g -<'"X«-rt»W)
{1U5)
where £ = (
-j«+log[(2ir) B dets]}
(11.16)
by a calculation left to the reader. (Hint: it helps in the calculation to write £ as the square of a symmetric matrix.) Now assume/ is any w-dimensional probability density function with means and covariances. Define pt to be its i-th mean and define <% to be its covarianee matrix. Let g be the «-dimensional Gaussian with the means and covariances off. The theorem now follows from Theorem 11.3 and the equation
}
(11.17)
whose verification boils down to checking that (11.18) which is easy to verify. (Hint: Diagonalize the quadratic form corresponding to £.) D 11.2 Formulation of the maximum entropy method We now turn to more precise description of the method by considering a continuous random variable.
266
//. Estimation by use of the MEM
Consider a continuous random variable X, the pdf of which is f(x) .Some properties about the distribution are known. These properties are supposed to be expressed in the form of • = pi,i = 0,l,2,--,N.
(11.19)
If 4a{x) = 1, then equation (11.63) states that the area of the pdf is a constant, usually one. If $ (x) = x, then equation (11.63) states that the first-order moment of the random variable is known. Different choice of $(x) can yield a large class of distributions. In case of no confusion, p , are still called moments. The MEM is formulated in the following form Given (*)/(*)& = A.
i = 0,l,2,-,N.
(11.20)
Find pdf f{x) such that - J/(JC) log f(x)dx -* max.
(11.21)
These equations can be formally solved by introducing Lagrangian defined by
A = - J/Wlog/(x)&+1> [ J/M#(x)<&- A ] •
(11-22)
If variational principle is used for/(a:), that is, / ( x ) is assigned a variation Sf independent of x, we have
-T"
•
(11.23)
-1-log/p/* If equation (11.21) is satisfied, the variational in equation (11.23) must be zero, that is, Sh. = 0. Because this holds true for any Sf - 0 , the integrand in equation (11.23) must also be zero. So
11.2. Formulation of the maximum entropy method
267
M
(11.24) (=0
Solving this equation results in (11.25) where & are obtained by solving equation (11.20) U(x)f(x)dx = p,,
i = Q,l,2,-,N.
(11.26)
Equations (11.25) and (11.26) are the most important equations in MEM. Note that there are iv" +1 unknowns ^ . Let us see some simple cases. Suppose N +1 = 1 .that is, there is only one unknown in equation (11.26), $j = 1 , and pa = 1 , we conclude immediately from equation (11.25) that f(x) is a uniform distribution. If rephrased, this conclusion is to say that uniform distribution solves maxent equations (11.21) given that the area of pdf is a constant on the interval of definition of random variable X. This is the conclusion of Theorem 11.1 generalized to continuous random variables If iv* +1 = 2 , jfc = 1, $ =x and xe(0,oo), then the function on the right hand side of equation (11.25) is an exponential function, meaning feat the exponential distribution solves the maxent equation (11.21). In other words, the exponential distribution is the maxent distribution given £(JQ. In section 11.2, theoretical tool has been employed to obtain these conclusions. In this section, we are led to the same conclusions by simple mathematical equations. If JV + 1 = 3 , $ ) = 1 , &=x, $l=x1 and xe(-®,oo), then exponent of the function on the right hand side of equation (11.25) is a polynomial of degree 2. So the function on the right hand side of equation (11.25) can be expressed as a function of the form exp| -1 + J_^ x
L
2
=
'•=«t J 1=0
I
(11.27)
268
/ /. Estimation by use of the MEM
where c represents all constant terms. This is the normal distribution, and thus the normal distribution is the maxent distribution solving equation (11.21) if the mean£(X) and the variance E(X3) are given. We may obtain the conclusions obtained in section 11.2 again by using the method presented above, All these distributions, however, share one common thing, that is, they are solved if particular forms of $(x) and //( are given. If the particular forms of $(x) are not known, how to find the maxent distribution / ( x ) ? Another common question is that the number M is small, usually two or three. If it is large, it becomes very difficult to solve equation (11.26). It is thus seen that the beautiful theory presented in section 11.1 cannot help us to find most distributions encountered in real-world applications. The key points to answering the question are (1) how to represent $(x) and (2) how to find fk. In the following sections, these two points will be addressed so that the maxent disttibution for a general question is found. 11.3 B-spline representation of fZ*,(x) Our purpose is to construct a general method to solve MEM equations (11.25) and (11.26). We may assume that $(x) = x', i = %%•••,M-I. This is neither efficient nor good, however, as we discussed in the previous chapters. A better alternative is again to use B-spline functions to approximate $ (x), that is, 1 Bi(x)
i =0 i =
-
(11.28)
!,•••,M
Here i - 0 is a special case requiring that the area of a pdf be always one. If we deliberately define Ba(x) = 1, equation (11.28) becomes #(x) = JU*y = 0 , l , - , M .
(11.29)
With such choice of $(x), we may further define &»
f»
(11.30)
If a sample is drawn from a population, then pi can be statistically determined by the following statistic
11.3. B-spline representation of $ (x)
269 {11.31}
Because the large number theorem ensures that as sample size is large, we have
—f>(*a
# = 1,—,JV.
(11.32)
Returning to equation (11.26), we transform MEM to a problem to solve the following N +1 equations with A as unknowns = /*,
i = 0,l,-,N.
(11.33)
This is a set of nonlinear equations. If the solution to equation (11.33) exists, we are able to prove that the solution to the set of solutions is unique. To prove this, suppose At and AJ are two sets of solutions to equation (11.33), that is, =p,,
(11.34a)
=A •
(11.34b)
Subtracting equation (11.34a) from equation (11 J4b) yields = 0.
(11.35)
From calculus we conclude that the term in the square bracket must be zero for the above equation is always zero for any At and %. In other words,
expf-1 + f)-Wx)1 = expf-1 + ftMB,{x)\.
(11.36)
This equation requires that the exponents on both sides must be equal, that is,
270
/ /, Estimation by use of the MEM
f>fl(x) = £ # « ( * ) . )=V
(11.37a)
1=0
We are thus led to the equation ) = 0.
(11.37b)
from which we conclude that At = A,' because equation (11.8 lb) is true for any x. By carefully choosing x along the real axis, we are able to obtain = 0 , j = 0,1,2,-,N.
(11.38)
So that the determinant of the matrix B${xj) is not zero. This is possible if xj is chosen in such a way that it maximizes BI(XJ) . Equation (11.38) is a set of linear equations, which has only zero solutions J» - A! = 0 if the determinant of the coefficient matrix B,{xj) is not zero. 11.4 Optimization solvers Uniqueness of the solution to equation (11.33) is a nice property. It is not an easy job to find the roots of equation (11.33) because of nonlinearity. The difficulty can be easily sunnounted by rewriting equation (11.33) in the form of optimization „
^
-,1
A-A 1=0
/
->min.
(11.39)
->min
(11.40a)
J
Or equivalently
B/(x)exp - l + ^^jB ( (x) ufc-/?f
subject to
11.5. Asymptotically unbiased estimate of At f
n
\
fexp -l + yU5,(x) \dx = \. J
\
t*
271
(11.40b)
)
Equations (11.25) and (11.26), (11.40) are all equivalent We are already familiar with optimization problems that appeared in Chapters 6-10. Three optimization methods have been used. They are iterative formulas used in Chapters 6 and 7, and Flexible Tolerance Method (FTM) used in Chapter 10. It seems hard to obtain a nice iterative formula as that obtained in Chapter 6. So we have to resort to FTM and GA. When solving equations (11.39) or (11.40), the search process is stopped if L < e or solution step number > Ne. Here £ is a prefixed small number, say 10"3. Ne is a prescribed large number, say 500. 11.5 Asymptotically unbiased estimate of A, We prove that solving At by equations (11.39) or (11.40) yields asymptotically unbiased estimate of ^ , To see this, note that in equation (11.31) p, are asymptotically normal random variables with zero mean based on the large number law. Denote the true value of Aj by A? and expand the left hand side of equation (11.26) with respect to the true value ^»°. Moreover, denoting pdf by / ( x | A), we obtain
(11.41)
The estimate At are dependent on the sample, thus being function of sample. Taking average on both sides of equation (11.26) about sample X yields
As mentioned above, & are asymptotically normal variables with mean pf. Thus, as sample size is large, Expi = pf, where superscript "0" represents true value of pi. Therefore, the second terms on the left hand side of equation (11.42) must be zero because the first term on the left hand side and the term on the right hand side of equation (11.42) are equal. In other words,
272
11. Estimation by use of the MEM
ExAj=Af.
(11.43)
Therefore, equation (11,39) or equations (11.40) yield asymptotically unbiased estimate of <4 • 11.6 Model selection MEM in fact takes the place of Maximum Likelihood Method, estimating the unknown parameters through sample observations. The estimation accuracy depends on several factors, one of which is the number of B-spline functions used in the estimation process. If the number of B-splines used in the estimation is over-small or over-large, accuracy will be lost. Again, accuracy and statistical errors must keep balance at an acceptable level. Because Maxent estimators are not like maximum likelihood estimators, the properties of the latter being well studied. The lack of M l knowledge of maxent estimators in the current case shows that we cannot directly apply the nice criteria for model selection like ME and AIC. To work out a criterion for model selection like ME based on maxent estimators remains a task under research. Here, however, we use an approximate approach to handle the problem. Consider Maxent Principle (11.21). If a large sample is taken from the population, the entropy is approximately equal to (11.44) Maximizing the left hand side term is equal to maximizing the right hand side. Therefore, Maxent Method is asymptotically as same as Maximum Likelihood Method Based on such APPROXIAMTION, we assume that ME remains valid here for maxent estimators. Then the best model should minimize ME in the way of ^ 2 n,
(11.45)
where J, is the maxent estimate of A and \N+1, N,
equation (11.39) is used equation (11.40) is used
.
(11.46)
We want to emphasize again that equation (11.45) is approximate in the sense
/ L 7. Numerical Examples
273
that MEM is asymptotically equal to M-L method, but maxent estimators are not necessarily equal to M-L estimators. Theoretically speaking, estimators maximizing entropy does not necessarily maximize likelihood functions. The drive for better estimator than ME is always needed in the future. It should be pointed out that AIC is no longer valid here because likelihood function does not exist here, 11.7 Numerical Examples
Example 11.1 Direct estimation of a normal distribution Given a normal distribution (11.47) Suppose seven B-splines are used to approximate it. The interval is [-1,1]. By defining
we obtain after numerical integration of the above equation. Their values are given in the following
=2,13x10-* =0.23, = 0.49, = 0.23, p, =6.08x10"*. One more condition requiring the area under the pdf be one,po = I, should be added in real -world applications. The Flexible Tolerance Method (FTM) is employed here to solve the problem. In the computation, the error tolerance for stopping computation is set at 10"4. That is, the optimization target is set at 10"4, or iteration number is smaller than 500, Results are shown in Figure 11.1.
274
11. Estimation by use of the MEM
-1
-0.5
0 X
0.5
1
Figure 11.1 Maxent estimate of the normal distribution In the figure, the given distribution is also plotted. Comparing the estimated and given distributions reveals that except at the right end, the agreement between the two curves is quite satisfactory.
Example 11.2 Estimation of the normal distribution based on sample In the above example, values for pi are obtained from theoretical calculations based on equation (11.48). In reality, these values are not known, and must statistically estimated from samples. Therefore, to be more practical, a sample of size n, is generated from the given distribution (11.47). If n, = 100, the estimated moments based on equation (11.31) are given in Table 11.1
Table 11.1 Theoretical and estimated values of moments ( / % = ! )
Estimated («»=50)
Estimated («»=150)
6.08 xlO"4
Estimated («s=100) 0
0
0
Pt
2.13x10"* 0.23
1.79X10"1 0.22
2.06xl0" a 0.23
UOxlO" 2 0.23
A
0.49
0.5
0.48
0.5
Ps
0.23
Items
Theoretical
Pi Pi
0.24 2
Pi
2.13xlQ"
Pi
6.0Sxl0" 4
0.26 2
2.36 xlO"
2.18xlO"2
0.23 2
I.77X10" 0
2.44xlO"z 1.45K10"4
275
11.7. Numerical Examples
Based on the above data, the distribution is estimated by minimizing equation (11,40). The estimated pdfs are shown in Figure 11.2 for n, = 100 .In the figure is also shown the given distribution for comparison. Except at the right end of the horizontal axis, the agreement between the two curves is in generally satisfactory.
2
1 Prob ili
3?
given 1.5
y*%.
estimated
£
\
1
M
a
0.5 0
-1
-0.5
0 X
n, = 100
V": 0.5
Figure 11.2 Maxent estimate of normal distribution based on sample using #=7 B-splines
To study the influence of sample size on estimation accuracy, two more computations have been made. These two samples have size 50 and 150, respectively. The estimated moment based on equations (11.31) are given in Table 11.1, too. FTM is used for these two cases. As N=7 B-splines are used, the estimated results are plotted in Figure 11,3 (a) and (b). As sample size is 50, the estimated pdf has a significant skewness towards the right side. Return back to Table 11.1 and look at p% and p$. As sample size is 50, these two moments differ about 0.03. These two moments determine the curve departing from being symmetrical about the central line to being more dense on the right side of the central line. Therefore, the accuracy of the statistical estimates of the moments has an important impact on the final estimation. This is clearer if we study the case for sample size to be equal to 150. The estimated moments for this case are listed in the fourth column in Table 11.1, p% and ps in this case are identical, and the curve has good symmetry about the central line around the centre, as shown in Figure 11.3 (b). Surprisingly, the right end error which appeared in Figures 11.1 -~11.3 disappears in this case, replaced
11. Estimation by use of the MEM
276
by a flat line. In general, increasing sample size does ensure the convergence of the estimation.
given „
1.5
1 •§
estimated
(a)«,=50
/""V
0.5
-1
0.5
-0.5
(b)«t = 150
given
1.5
estimated
/
Probabil
1 0.5
•
J
0 -1
-0.5
\ 0
0.5
1
Figure 11.3 Sample size influence on estimation accuracy
As we have seen in the previous chapters, the number of B-spHnes used for approximating pdf has important influence on estimation accuracy. Two cases are considered here, N=3 and i¥=ll, corresponding to 3 and 11 B-splines, respectively. The estimated pdfs for these two cases are shown in Figure 11.4. In Figure 11.4(a) is shown the estimated pdf for N=3 using a sample of size 100. The results are surprisingly good. The estimation for N=l 1 is, however, not good enough for practical use because it is oversensitive to local statistical fluctuations being a wavy curve. As in Chapter 6, we thus need to select models from the candidates using different number of B-splines. In this example, N changes from 3 to 8, in total 6 cases considered. For each,
11,7. Numerical Examples
27?
both / / = - J / { x | ^ ) l o g / ( x | A}t&and ME are calculated, as given in Table 11.2 Table 11.2 ME values for model selection (n s = 100 )
N 3 4 5
H
ME
0.00594 0.00580 0.00568
0.05094 0.06584 0.08068
Given: ......... Estimated:
1.5
N 6 7 8
H
ME
-0.00178 -0.00131 0.01080
0.08822 0.10369 0.13080
w
/* * \
!!=ioo
•
1 0.5
V
0 -1
0.5
-0.5
2
Given: Estimated:
1.5
.........
•• «,
=
100
"
1 0.5
•
/ \
0 -1
-0,5
0
0.5
1
Figure 11.4 Influence of number of B-splines
From the table, JV=3 minimizes ME meaning that the best estimate is given by using 3 B-splines. We have plotted JV=3 result in Figure 11.4(a). The good
278
/ /. Estimation by use of the MEM
agreement between the estimated and the given pdfs does validate the effectiveness of ME analysis. Example 11.3 Compound distribution This example was considered in Chapter 6. It is rewritten here for convenience. The distribution is given by /(*)=,og(j)
(11.49a)
] g(x)dx
where g(x) is a function defined on [0,10] 1
(11.49b)
From this distribution, 100 random numbers are generated as a random sample for estimation. Based on the sample, the maxent distribution is estimated using the method presented here. The results are given in Table 11.3. Table 11.3 Dependence of ME values on the number of B-splines Number of B-splines 7 3 9 10 11 12
ME .162145E+01 .172011E+01 .156969E+01 .160988E+01 .168089E+01 .157742E+G1
The minimum ME is obtained at N=9. The estimated results for N=5,9 and 12 are plotted in Figure 11.5. Similar observations can be amde again in the figure, that is, N=S represents the case of underutilizating B-splines and JV=12 represents the case for overshooting B-splines. From the comparison in the figure, AH? is better than the rest. Its corresponding ME is really smallest among the cases considered. Again, the capability of ME analysis is validated. From the figure, however, it is observable that the results are not better than
279
U.S. Concluding remarks
those presented in Chapter 6. To make equation (11.31) be more accurate, large samples are needed. Therefore, it is not safe to say that MEM is superior to M-L method now. 0.5
Given: .,„ ,. „ 0.4 3
1 I
•
tat
r
0.2 0.1 0
•-
N =S
0.3
0
t
\
•
#=12
2
4
6
8
10
Figure 11.5 Maxent estimate of compound distribution
11.8 Concluding Remarks MEM has a variety of applications. But studies on applying MEM to estimating distributions based on SAMPLE remain few. Traditionally, polynomials are used to approximate $(x), but the capability is severely limited by the inadequacy of polynomials as a flexible and robust interpolating tool. Introduction of B-splines is expected to open up a new direction for estimating complicated distributions.
This page intentionally left blank
Chapter 12
Code specifications
Short specifications to each code given in the CD-rom attoched to this book are presented in this chapter for the easy reference of the reader. 12.1 Plotting B-splines of order 3 12.1.1 Files in directory B-spline FORTRAN code for calculating B-splines: bspline.f Input file name : bspline.inp Output file name : bspline.dat 12.1.2 Specification: Inputs in input file: hsx ndx xsm xlg
: number of B-splines : division number : minimum x-value : maximum x-value
Output in the output file:
Subroutines function bn(k,x,bound): calculate B-spline values of order k+3 at point x with knot k+3 : order of B-spline, input x : point of interest, input
281
282
12. Code specifications bound bn
: knot sequence, input : B-spline value of order k+1 at point x with knot sequence bound, function wp(k,x,bound): calculate w'(x) in the definition of B- spline all arguments in the bracket are as same as above function hsd(x) X hsd
: Heaveside function, taking zero if x is smaller than 0 and 1 if x is not smaller than 0. : point of interest, input : value of Heaviside function, output
subroutine bd(xsm,xlg,nhx,xbound): calculate knot sequence hsx : number of B-splines ndx : division number xsm : minimum x-value xlg : maximum x-value nhx : hsx-2 xbound : knot sequence Example 12.1 15 B-splines Input 15,40 0, 10 Output (x,y) for plotting B-spline functions 1.2 Random number generation by ARM 12.2.1 Files in the directory of random FORTRAN for predicting random number Input file Output file
: random.f ; rand.inp : n.inp, histout, randO.dat
12.2.2 Specifications Inputs in the input file ns : sample size to generate np : =1, exponential distribution : =2, compound distribution : =3, normal distribution a,b : lower and upper bounds of the interval
12.3. Estimating 1-D distribution using B-splines
283
Outputs in fee output file x : random numbers distributed as the given pdf Subroutines subroutine rand(ix, yfl)
: generating uniform random number using the linear congruence method as given in the following
subroutine rand(ix» yfl) iffix .eq. 0) ix = 67107 ix=125*ix ix = ix - ix / 2796203 * 2796203 yfl= float(ix) yfl=yfl/2796203 return end function pdfs(x) : compound distribution used in the previous chapters function pdf(x) : exponential distribution function pdfh(x) : normal distribution 12.3 Estimating 1-D distribution using B-splines 12.3.1 Files in the directory shhl Filename :shdl,f Input file name : exp.inp Output file name : exp.out, exp.dat, exp.zme 12.3.2 Specifications Inputs: ns numx numf ndx xi(n) comer xlg xsm
; integer, sample size ; integer, smallest number of B-splines to be used : integer, largest number of B-splines used : integer, division number for plotting : real, sample point on X-axis : real, required accuracy : real, the largest value of x : real, the smallest value of x
284
12. Code specifications
Outputs:
x(nx) fie AIC zme
: estimated linear combination coefficient : likelihood functions : AIC value : measured entropy
Subroutines SUBROUTINE OBF(XQ,xs&); iteration formula function psi(x j ) : density function / where x is an array storing the coefficients function psib(x»xl) : calculate f(x( \ a) function bn(k,x,bound) : see section 12.1 function wp(kpc,bound) : see section 12.1 function hsd(x) : see section 12.1 subroutine bd(xsm,xlg,nhx,xbound): see section 12.1 subroutine enfropytxjeht)
: calculating entropy H .
12.4 Estimation of 2-D distribution: large sample 12.4.1 Files in the directory shd2 FORTRAN code shh2.f for estimating pdf using sample data (xc,yt:) Input file : uSQG.inp Output file : uSOO.out File storing ME values 12.4.2 Specifications Inputs: ns numx numxf numy
: sample size : smallest number of B-splines to be used in x-direction : largest number of B-splines to be used in x-direction : smallest number of B-splines to be used in y-
numyf ndx ndy xi(n) yi(N) conver ylg ysm
: largest number of B-splines to be used in y-direction : division number in x-axis for plotting, around 40 : division number in y-axis for plotting, around 40 : x-eoordinate of the sample point : y-eoordinate of the sample ; Required accuracy : The largest value of x : The smallest value of x
direction
12. S. Estimation ofl-D distribution from a histogram yig ysm Outputs: x(nx) AIC zme Fx
: The largest value of y : The smallest value of y : real array, estimated linear combination coefficients : AIC values : measured entropy : likelihood function
Subroutines SUBROUTINE iteration process function psi(xj) function psib(x,xl,yl) : calculate f(xe,yt j a) : see section 12.1 function bn(k,x,bound) : see section 12.1 function wp(k,x,bound) : see section 12,1 function hsd(x) subroutine bd(xsm,xlg,nhx,xbound): see section 12.1 subroutine entropy(x,eht) : calculating entropy H . 12.5 Estimation o f l - D distribution from a histogram 12,5.1 files in the directory shhl FORTRAN code for estimating pdf from a given histogram: shhl.f Input file: rulS.inp 12.5.2 Specifications Input:
M Ns num number
ndx xi(m) yi(m+l)
xlg
xsm CONVER Outputs: X(NX)
ane AIC
285
: Histogram cell number : Sample point number : Smallest number of B-splines used : Largest number of B-splines used : Division number for plotting : Cell heights : Cell coordinates : The largest value of x : The smallest value of x : Required accuracy
: Parameter values : ME value : AIC values
12, Code specifications
286 fx
: Likelihood functions
Subroutines SUBROUTINE OBF(xO,X,F) function 2Hnu(x,y) function qi(i,x)
: iteration for estimating pdf ; function value : qt =0^*^
subroutine simp(a,b,n,k,s): Simpson method for numerical quadrature subroutine trix : calculate coefficient etj function bn(k,x,bound) : see section function wp(k,x,bound) : see section function hsd(x) : see section subroutine bd(xsm,xlg,nhx,xbound): see section
12.1 12.1 12.1 12.1
12.6 Estimation of 2-D distribution from a histogram 12.6.1 files in the directory shhl FORTRAN code for estimating pdf from a given histogram: shh2.f Input file: wavew.inp 12.6.2 Specifications Input:
M N Ns Numx Numberx
ndx numy numbery
ndy xi(m,n) yix{m+l) yiy(n+l)
xlg xsm yig ysm, CONVER Outpute: X(NX)
AIC
: Cell number in X-direetion : Cell number in Y-direction : Sample point number : Smallest number of B-splines used : Largest number of B-splines used : Division number for plotting : Smallest number of B-splines used : Largest number of B-splines used : Division number for plotting : Cell heights : Cell coordinates : Cell coordinates : The largest value of x : The smallest value of x : The largest value of y : The smallest value of y : Required accuracy : Parameter values :
AIC
values
287
12.7. Estimation of2-D distribution using RBF : ME values zme : Likelihood functions fx Subroutines iteration for estimating pdf SUBROUTINE OBF(xO,X,F) ftmction value function zmu(x,y) function qi(i,x) subroutine simp(a,b,n,k)s} Simpson method for numerical quadrature subroutine trix calculate coefficient c s fiinction bn(k,x,bound) fiinction wp(k5x,bound) ftmction hsd(x) subroutine bi(xsm,xlg,nhx,xbound)
see section see section see section see section
12.1 12.1 12.1 12.1
12.7 Estimation of 2-D distribution using RBF 12.7.1 Files in the directory shr2 FORTRAN code shr2.f for estimating pdf using sample date Input file: uSOO.inp
(xt,yt)
12.7.2 Specifications Inputs: ns numx numxf numy numyf ndx ndy xi(n) yi(N) conver yig ysm yig • ysm Outputs: x(nx) AIC
: sample size : smallest number of B-splines to be used in x-direction : largest number of B-splines to be used in x-direction : smallest number of B-splines to be used in y-direction : largest number of B-splines to be used in y-direction : division number in x-axis for plotting, around 40 : division number in y-axis for plotting, around 40 : x-coordinate of the sample point : y-coordinate of the sample : Required accuracy : The largest value of x : The smallest value of x : The largest value of y : The smallest value of y : real array, estimated linear combination coefficients : AIC values
288
12. Cade specifications Zme Fx
: measured entropy : likelihood function
Subroutines: SUBROUTINE QBF(X0,x,fx) function psi{x j ) function psib(x,xx,yy) fiinction bn(i,x,y) subroutine entropy(x,eht)
: iterative solution process : density function value at sample point : density function :RBF : calculating entropy H .
Bibliography
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 15.
Anderson, T. W. (1958). Introduction to multivariate statistics analysis, Wiley, New York, Apostol, T. M. (1969), Calculus, vol. 2, John Wiley & Sons. Balian, R. (1991). From microphysks to macrophysics, methods and applications of statistical physics,vol. I, Springer-Verlag, New York. Balian, R,(1992), From Microphysks to Macrophysics, Methods and Applications of Statistical Physics,vol. II, Springer-Verlag, New York. Bhat, B. R. (1985): Modem Probability Theory, An introductory textbook (2nd edition) John Wiley & Sons, New York. Blum, L., Blum, M, & Schub M. (1986). "A simple unpredictable pseudorandom number generator," SIAMJ, of Computing, 15(2), 364-383. de Boor, C. (1972). "On calculating with B-splines," J. Approx. Theory, 6, 50-62. Buck, B. & Macaulay, V. A. (eds.) (1991). Maximum entropy in action, Clarendon Press, Oxford. Cemak, J. (1996). "Digital generators of chaos," Physics Letters A, 214, 151-160. Conrad, K. (2005). http://www.math.uconn.edu/~kconrad/blurbs/ Couture, R &. L'Ecuyer, P.0997). "Distribution properties of multiplywith-carry random number generators," Mathematics of Computation, 66, 591. Couture, R & L'Ecuyer, P. (1998): "Guest Editors' Introduction," ACM Transactions on Modeling and Computer Simulation, 8(1), 1-2. Cox, M. G. (1971). "The numerical evaluation of b-splines: division of analysis and computing," National Physkat Laboratory, DNAC 4, U.K. Crandall, S.H. (1980). "Non-Gaussian closure for random vibration of nonlinear oscillators," M, J. Non-linear Mech,\5,303-313. Davis, PJ. (1963). Interpolation and Approximation. Blaisdell Publishing Company, New York.
289
290
Bibiography
16. Eisberg, R. & Resniek, R. (1985). Quantum physics of atoms, molecules, solids, nuclei and particles, 2nd edition, John Wiley & Sons. 17. Elderton, W.P. (1953). Frequency curves and correlation, 4th ed.,Harren Press, New York. 18. Entaeher, Karl. (1998). "Bad subsequences of well-known linear eongruential pseudorandom number generators," ACM transactions on Modeling and Computer Simulation, 8(1), 61-70. 19. Er, G.K. (1998): "A method for multi-parameter PDF estimation of random variables," Structural Safety, 20,25-36. 20. Faddeev, D. K. (1956). "The notion of entropy of finite probabilistic schemes (Russian)," UspekhiMat, Nauk, 11,15-19. 21. Faux, LD. & Pratt, M. J. (1979). Computational geometry for design and manufacture, Wiley, New York. 22. Feinstein, A. (1958). The Foundations of Information Theory, McGraw-Hill, New York. 23. Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer., New York. 24. Fog, A. (2000). How to optimize for the Pentium family of microprocessors. http://www.agner.org/assem. 25. Fog, A. (2001). Pseudo random number generators, http://www.agner.org/random. 26. Fraser, I. (2000). "An application of maximum entropy estimation: the demand for meat in the United Kingdom," Applied Economics, 32,45-59. 27. Fujimoto, Y., Shintaku, E., Zong, Z., Ishikura, H. & Isokami, T. (1994). "The model for determining the membership functions based on fuzzy data", J. Naval Architecture of Kansai, Japan, 236. 28. Fujimoto, Y., Shintaku, E. & Zong, Z. (1994). "Quantification of subjective information and its utilization in reliability engineering," J. of Naval Architecture Society of Japan, 176,615-624. 29. Gen, M. & Cheng, R. W. (1997). Genetic Algorithm and Engineering Design, A Wiley-Interscience Publication, John Wiley & Sons, Inc., New York. 30. Golan, A., Judge, G. & Miller, D. (1996). Maximum entropy econometrics; robust estimation with limited data, John Wiley and Sons, New York. 31. Goldman, S.(1955). Information Theory, Prentice-Hall, New York. 32. Harris, B. (1960). "Probability distributions related to random mappings," Annals of Mathematical Statistics, 31, 1045-1062. 33. Himmelblau, D. M. (1972). Applied nonlinear programmin,, McGraw-Hill, New York. 34. Hong, H.P., Lind N.C. (1996). "Approximate reliability analysis using normal polynomial and simulation results," Structural Safety,18,329-339.
Biliography
291
35. IEEE Computer Society (1985). IEEE standard for binary floating-point arithmetic (ANSI/IEEE Std 754-1985). 36. James, F. (1990). "A review of pseudorandom number generators," Computer Physics Communications, 60,329-344. 37. Jaynes, E.T. (1957a). "Information theory and statistical mechanics," Physical Review, 106,620-630. 38. Jaynes, E.T. (1957b). "Information theory and statistical mechanics II," Physical Review, 108,171-190. 39. Justice (ed.), J. (1986). Maximum entropy and bayesian methods in applied statistics, Cambridge Univ. Press, Cambridge. 40. Khinchin, A.I. (1957). Mathematical Foundations of information Theory, Dover Publication, Inc., New York. 41. Knuth, D. E. (1998). The art of computer programming, 2 (3rd ed.),Addison- Wesley. Reading, Mass. 42. Kullback, S.(1959). Information theory and statistics, Willey & Sons, New York. 43. Lam, K.Y., Zong, Z., Wang, Q.X. (2001) "Probabilistic failure of a cracked submarine pipeline subjected to underwater shock", Journal of Offshore Mechanics and Arctic Engineering, 123,134-140. 44. Larsen, R J . & Marx, M. L. (2001). An Introduction to Mathematical Statictics and Its Applications, 3 rd edition, Prentice Hall, NJ. 45. Levine, R. D. & Tribus, M. (eds.) (1979). The maximum entropy formalism, The MIT Press, Cambridge. 46. Lidl, R. & Niederreiter, H. (1986). Introduction to finite fields and their applications, Cambridge University Press. 47. Lind, N.C. & Chen, X. (1987). "Consistent distribution parameter estimation for reliability analysis," Structural Safety, 4,141-149, 48. Lind, N.C. & Nowak, A. S. (1988). "Pooling expert opinions on probability distributions," Journal of Engineering Mechanics, 114,341-389. 49. L'Ecuyer, P. (1997): "Bad lattice structures for vectors of non-successive values produced by some linear recurrences," INFORMS Journal of Computing, vol. 9, no. 1, pp. 57-60. 50. L'Ecuyer, P. (1999). "Good Parameters and Implementations for Combined Multiple Recursive Random Number Generators," Operations Research, 47(1), 159-164. 51. Lozover, O. & Preiss, K. (1981). "Automatic generation of cubic B-splinere presentation for a general digitized curve," Eurographics, Encarnacao,J.L. ed., North-Holland, 119-126. 52. Marsaglia, G., Narasimhan, B., & Zaman, A. (1990), "A random number generator for PC's," Computer Physics Communications, 60,345. 53. Marsaglia, G. (1997). DIEHARD, http://stat.6u.edu/~geo/diehard.html or http://www.cs.hku.hk/mtemet/randomCD.htol. 54. Matsumoto, M. & Nishimura, T. (1998). "Merseme Twister: A 623-
292
55. 56. 57.
58. 59. 60. 61.
62. 63. 64. 65. 66. 67. 68.
69.
70. 71.
72.
Bibliography Dimensionally Equidistributed Unifonn Pseudo-Random Number Qmerstor,"ACM Trans. Model. Comput. Simul. 8(1), 31-42. Milton, J.S., McTeer, P.M. & Corbet, J, J. (1997). Introduction to Statistics, WCB McGraw-Hill, Boston. Mood, A. M., Graybill, R. & Boes, D.C. (1974). Introduction to the theory of statistics, 3rd ed., International Student Edition. Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial and Applied Mathematics, Philadelphia. Ostle, B. (1966). Statistics in Research, Oxford & IBH Publishing Co., Calcutta. Patil, G P., Kotz, S. & Ord, J. K. (eds.) (1975). Statistical Distributions in Scientific Work, 3, D. Reidel Publ. Company, Dordrecht. Preckel, P.V. (2001). "Least squares and entropy: a penalty function perspective," American Journal of Agricultural Economics, 83,366-377. Riesenfeld, R. F. (1973). Berstein-Bezier Methods for the Computer-Aided Design of Free-Form Curves and Surfaces. Ph.D. Dissertation.,Syracuse University, Syracuse, NY, U.S.A. Sakamoto, Y . , Ishikuro, M. & Kitagawa, G. (1993). Information statistics (in Japanese). Kyoritsu Publisher, Tokyo. Sehoenberg, I. J. (1946). "Contributions to the problem of approximation of equidistant data by analytic functions," Q, Appl. Math., 4,45-99. Schuster, H. G. (1995). Deterministic Chaos: An Introduction. 3'rd ed., VCH. Weinheim, Germany, Waelhroeek, H. & Zertuche, F. (1999). "Discrete Chaos," J. Phys. A, 32(1), 175-189. Shannon, C. E. & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Urbana, IL. Shen, E.Z. & Perloff, J.M. (2001). "Maximum entropy and Bayesian approaches to the ratio problem" Journal of Econometrics, 104,289-313. Shen, S. X.» Bing, F. S. & Wang, H. Z.(1990a). Handbook of Contemporary Engineering Mathematics (in Chinese) , 3, Hua Zhong University of Technology Press, Wuhan. Shen, S. X., Bing, F. S. & Wang, H. Z.(1990b). Handbook ofContemporary Engineering Mathematics (in Chinese) , 4, Hua Zhong University of Technology Press, Wuhan. Utts, J. M. & Heckard, R. F. (2002). Mind on statistics, Duxbury Thomson Learning, Australia, Wu, S. C , Abel, J. F. & Greenberg, D. P. (1977). "An interactive computer graphics approach to surface representation," Computer Graphicslmage Processing, 6, 703-712. Yamaguchi, F. (1978). "A new curve fitting method using a CAT computer display," Computer Graphics Image Processing, 7,425-437.
Biliography
293
73. Yeh, R. Z. (1973), Modem Probability Theory, Harper & Row Publisher, New York. 74. Zong, Z. & Lam, K.Y. (1998). "Estimation of complicated distributions using B-spline functions," Structural Safety, 20 (4), 341-355. 75. Zong, Z. & Lam, K.Y. (2000). "Bayesian estimation of complicated distributions", Structural Safety, 22(1), 81-95. 76. Zong, Z. & Lam, K.Y (2001). "Bayesian Estimation of 2-dimensional complicated distributions", Structural Safety, 23(2), 105-121. 77. Zong, Z., Lam K.Y. & Liu, G.R. (1999). "Probabilistic risk prediction of submarine pipelines subjected to underwater shock", Journal Of Offshore Mechanics And Arctic Engineering, 121,251 -254. 78. Zong, Z., Shintaku, E. & Fujimoto, Y. (1995). "A method to determine the membership functions based on fuzzy data", Bulletin of the Faculty of Engineering, Hiroshima University, 13(1), 11-21. 79. Zong, Z. & Bi, J. Y. (2005). Maximum entropy method for estimating probability distribution (in press). 80. Zhu, X.L. (2001). Fundamentals of Applied Information Theory, Tsinghua University Press., Beijing.
This page intentionally left blank
Index
144,145,153,156,163-170,175180,185-189,207,209,211217,225,244-253,272-279,281287 functions, 67-69,77,79,82,87, 133,144,165,181,183,189,190,2 13,228,245,246,248,250,253,26
l-D, 64,87,129,189,191,213,285 2-D, 64,163,213 Approximation, 67,137,163,181,244 best, 69,71 function, 68,69,82 Axiom, 94,
8,272,282
Basis, 70-72,77,81,83,133 B-spline, 81 polynomial, 72 truncated power, 77, Bayesian, 118,119,126,127,128,189192,197-203,210-213,216,219228 estimation, 198,203,213, 219,223,225,227 measured entropy, 127,201-210, method,127,190,192,199,213, 220 point estimate, 198,20033, 219-221,223 priors, 192,216,219 statistics, 118,119,126,127 192,197,211,212,219,220 Bivariate, 29,65,163,164,167,170 -174,186,213,229 B-spline, 67-69,77-88,130,132,133-
Chaos, 51 Chebyshev's inequality, 31,32, Combination,70,71,77,90,133135, 147-150,159,164167,171,187190,194,196,213219,283,284, 287
linear.70,71,133,138,150,213, 283,284,287
coefficient,70,77,133-138,150, 159,164,171,187-190,194,196, 214,217,219,283,284,287 Conditional 6,7,14,15, density function,. 15 distribution, 14 probability, 6,14,15 Consistency, 37,38,40,105,117 Convex, 71,73 Covariance, 16,22,44 Criteria, 121,141,142,163,186
295
296
Index Model selection, 121,141,186
Determinant^ 1,203,204,236,221223,270 Deterministie, 2,10,49,51,121,244 Disorder, 89-94 Distribution, 9-23,26-36,40-67,93,96150,153-158,163174,185187,213,214,219229,259-268,273-287 Probability,10,22,27-36,48,93, 96-104,130,157,213,259,261 chi-square,21,23,56,57,110 conditional, 14, complicated, 67,129,130,279 compound ,65,150,278,282,283 discrete, 132,219 Gauss, 20,22 marginal, 14, multinomial, 138 normal,20.21,61-67,100,104, 110,112,118,125-130,148,222, 268, Effieiency,37,38,105,144,150 Elementary outcomes, 2 Entropy, 45,54,55,74,89-128,141,142, 154-156,221,242247,248,259265,284-288 estimation, 89,102,105-119, 154,155 measured, 89,99,105,109,114,141, 142,242,247,248,284-288 bayesian measured, 127 Estimate,27,34-44,98-128,135,141148,190-210,219-230,246,248, 271-278,284 Estimation,27,34,35,40,44,89-119, 128-158,163-191,198,203, 206,208,213-229,237,254,259, 272-278,284-287 entropy, 89,105-107,112,114,
118,128,154 Estimator, 27,35-44,89,105-119,128, 141,142,157,247,272 asymptotically unbiased,39,106116,142,247 parameter, 27,35-39 Event, 2-16,22,49,93-95,119,129,138, 170,200,210,229,245 random, 2,49,129, Exclusive, 3,5,93 Experiment, 2-4,9,45-48,93,150,202, 242-244 Function, 9-12,25,30-49,60,67-88,91104,110-116,120-135,139-161, 213-219,225-229,237-254,261268,271,272,281-288 approximation, 68,69,82,120 cumulative density, 11, joint density, 13,15, joint distribution, 12-14 Lesbeque, 80 likelihood, 39-41,136,139,141, 144,145-153,216,219,229, 272,285-287 membership,237-246,249-254 probability density, 10,41,49,61, 110,241,261-265 radial B-spline, 87,133,144 Fuzzy, 237,240-254 data,242-251,254 sample,242,247,252 set, 237,240-251
Gaussian, 20-22 Histogram, 54,58,70,82,111,137,139, 143,153,170-174,184,186,189, 207,285,286 Householder transform, 223,231 Independence, 8,9,16,21,54,58,70,82,
Index 133,193 linear, 71 test, 58 Independent, 8,9,16,22,28-32,48,58, 59,70,77,97,121-124,133, 160,195,200,218,221 linearly, 70,77,133, independent and identically 28, 218, distributed, 22, Individuals, 26,27 Inference, 25-28,34,37,48,102,104, 156,158,190,259,261 Statistical, 27,28,34,37,102, 104,158,261 Information, 16,17,26,27,42-44,48, 81,89,95-116,120,127,130, 141,142,147,157,158,163-181, 217-220,228,229,242,259-261 Kullback information, 97,99103,111-114, Akaike,116,141,142 Intersection, 3,4,7 Likelihood, 39-44,116,122-124,135158,190,200,209,210,216,219, 229,245,248,253 function, 39-41,122,141,142, 147-153,198,209,210,216, 219,229,245,248 log-likelihood, 39-43,123-245 log-likelihood function,39-43, 116,123,136,139,141,144,145, 245 Likely, 4-6 equally, 5,6 Measured entropy, 127,141,142,284288
Bayesian, 127 Method, 1,11,22,28,39-50,58-67,76, 84,85,91,107,118-135,140,153-
297 155,163,167,175,184-192,199, 203,204-212,213,220,223,228231,237,247-254,259-287 acceptance/rejection, 61,62,64, bayesian, 127,130,181-189, 213,220
linear congruence, 49,50,66,283 maximum likelihood, 39,40, 107,123,135,157,272 Model selection, 119,120-130,140, 141,147,158,271 criteria, 120 Modulus, 50,51 Moments, 18-20,44,266,274,275 central, 18, first, 18, second, 18 third, 19 Mutually exclusive, 3,5,93 Nonlinear, 51,136,156,159,202,203, 247,248,255,269 Norm, 70-73,167,168, Objective, 126-129,136,200,210,212, 221,230,241,242,244 Observations, 27,28,55,189,206,211, 213,225,260 Optimization, 135,136,143,158,200, 203,210,247,270-273 Parameters,20,22,27,34,35,38,45,5155,60,71,72,120-126,136,141147,154-157,163,167,170,173, 189-191,200,202,210,215,219, 221,229,243-248,272,285,286 Period, 26,50-53,59,66,129,184 Polynomial, 68-78,82-87,121,126, 132,133,235,267,279 basis, 72 Population, 16,17,25-27,33-39,4548,114,120,135,157,158,167, 181,268 Power, 9,46,60,61,68,77,132,148,
298
Index
208,211 Prediction-correction, 1 §9,213 Probability, 1 -17,22,25-36,40-49,56, 61,92-105,110,118,127,130-132, 138,156,157,170,171,184,200, 208,213,221,241,245,259-265 density function, 10,22,49,61, 104,110,261-265 distribution, 20,26,27,33,35,36, 40,48-68,93,96-104,130,157, 259,261 Random, 2,9-22,27-41,47-67,92-155, 164,167,174-178,187-190,199, 192-195,203-206,213-218,224, 229,244,251,259,266,271,278, 282 number, 49-66,140,144,174,191, 2.5.2.6.2.4.251,278,282,283 Randomness, 1,2,10,40,49,51,54,55, 57,96 Realizations, 27,39,50 Sample, 2-5,9-14,25-48,50-59,97-128, 129-158,163-182,186-198,206211,213-220,225-230,237,242254,268-279,282-288 fuzzy, 242,247,252, large,105,107,116,118-121,129, 137,145,213,228,229,283,284 small, 30,48,118 126,128,132, 213,229,230 space, 2-5,9-14,46,129, size,27,30-33,37-44,53-57,98, 102,106,111,115,118,125,132, 135,139,144,146,151457,163169,175-193,206-211,251-254, 269,271,275,282-284,287 Sampling, 8,26-30,35,36,48,54, 60-64,104, distributions,28,30,35,36,48,54, error, 104, Set, 2,3,7,9,26,27,62-71,79-84,93,122,
132-135,165,169,170,211,231, 237-251,262,265,269,270,273 crisp, 237,240, fuzzy, 237,240-251, Skewness, 17,19,275 Smooth, 156,190-198,203,206,211, 213,216-219,223,225,229,243 prior distribution, 190,211,213, 217,229 Span, 70,71 Space, 2-5,9-14,46,53,59,70,71,74, 87,129,133,136,157,163,170, 194,216,217 vector, 70,71,133 normed linear, 70,71 sample, 2-5,9-14,129, Standard deviation, 19,32,45, Statistic, 7,25-27,33-37,46-48,55,68, 89,102-105,118-120,126-132, 153,157,163,192,197,200, 212-220 Statistical,l, 12-17,25-28,34-40,45-48, 51-54,59,102-108,116,119, 120,126,135,137,141,147,156, 158,180,190-192,212,237,242, 252,259,261,272,276 Stochastic simulations, 49 Test, 33,45-48,54-59,137,157,242 independence, 54,58, uniformity, 54,56,57 visual, 54,59 Testing, 8,21,34,45,46,54-58,102, 157,158, hypothesis, 34,45,102 Theorem, 6,7,31-34,43,44,48,55,71, 72,81,82,94,95,98-103,107, 112-121,127,131-133,136,157, 161,162,216-235,261-269 central limit, 55,157,262 large number, 98,115,269 Unbiasedness, 37,38,105 Uncertainty, 27,89,93-104,114,116,
Index 119,141,147,153,156,158,178, 260,261 Model, 97,102-104,147,158 Union, 3,4 Univariate, 163 Variable, 9-22,27-44,50-56,60,61,64, 67,92,95-110,115,118-126, 129-137,141,155,189,195, 200,204,211,212,213,216-218, 229,240,259,265,266,267,271
299
random, 9-22,27-36,39,41,50-56, 60,61,64,67,92,95-110,115, 118,119,121,124,126,129-133, 137-141,151,189,192-195,200, 204,211,212,213,216,218,229, 265-267,271 uniform random, 19,50-56,60 Gaussian, 20 Variances, 16-20,30,31,44,101,104, 125,261-265,268
Mathematics in Science and Engineering Edited by C.K. Chui, Stanford University Recent titles: C. De Coster and P. Habets, Two-Point Boundary Value Problems: Lower and Upper Solutions Wei-Bin Zang, Discrete Dynamical Systems, Bifurcations and Chaos in Economics I. Podlubny, Fractional Differential Equations E. Castillo, A. Iglesias, R. Ruiz-Cobo, Functional Equations in Applied Sciences V. Hutson, J.S. Pym, M.J. Cloud, Applications of Functional Analysis and Operator Theory (Second Edition) V. Lakshmikantham and S.K. Sen, Computational Error and Complexity in Science and Engineering T.A. Burton, Wolterra Integral and Differential Equations (Second Edition) E.N. Chukwu, A Mathematical Treatment of Economic Cooperation and Competition Among Nations: with Nigeria, USA, UK, China and Middle East Examples V.V. Ivanov and N. Ivanova. Mathematical Models of the Cell and Cell Associated Objects