Information-Theoretic Methods for Estimating of Complicated Probability Distributions, Volume 207 (Mathematics in Science and Engineering)

Information-Theoretic Methods for Information-Theoretic Estimating Complicated Probability Distributions This is volu...

Author: Zhi Zong

17 downloads 565 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Information-Theoretic Methods for Information-Theoretic Estimating Complicated Probability Distributions

This is volume 207 in MATHEMATICS IN SCIENCE AND ENGINEERING Edited by C.K. Chui, Stanford University A list of recent titles in this series appears at the end of this volume.

Information-Theoretic Methods Information-Theoretic for Estimating Complicated Probability Distributions Zhi Zong DEPARTMENT OF DEPARTMENT OF NAVAL NAVAL ARCHITECTURE ARCHITECTURE DALIAN UNIVERSITY OF TECHNOLOGY CHINA

ELSEVIER Amsterdam –- Boston -– Heidelberg –- London –- New York -– Oxford Paris –- San Diego -– San Francisco –- Singapore -– Sydney -– Tokyo

Elsevier Radarweg 29, PO Box 211, 211, 1000 AE Amsterdam, The Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 0X5 1GB, UK

First edition 2006 Copyright © 2006 Elsevier B.V. All rights reserved

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Elsevier's Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: [email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made verification Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN-13: 978-0-444-52796-7 ISBN-10: 0-444-52796-6 0-444-52796-6 ISSN: 0076-5392 For information on all Elsevier publications visit our website at books.elsevier.com Printed and bound in The Netherlands

10 10 10 9 8 7 6 5 4 3 2 1 06 07 08 09 10

7b my parents To L. Zhou, Xuezhou andXueting

This page intentionally left blank

Preface Mixing up various disciplines frequently produces something that is profound and far-reaching. Cybernetics is such an often-quoted example. Mix of information theory, statistics and computing technology proves to be very useful, which leads to the recent development of information-theory based methods for estimating complicated probability distributions. Estimation of the probability distribution of a random variable is the fundamental task for quite some fields besides statistics, such as reliability, probabilistic risk analysis (PSA), machine learning, pattern reeognization, image processing, neural networks and quality control. Simple distribution forms such as Gaussian, exponential or Weibull distributions are often employed to represent the distributions of the random variables under consideration, as we are taught in universities. In engineering, physical and social science applications, however, the distributions of many random variables or random vectors are so complicated that they do not fit the simple distribution forms at al. Exact estimation of the probability distribution of a random variable is very important. Take stock market prediction for example. Gaussian distribution is often used to model the fluctuations of stock prices. If such fluctuations are not normally distributed, and we use the normal distribution to represent them, how could we expect our prediction of stock market is correct? Another case well exemplifying the necessity of exact estimation of probability distributions is reliability engineering. Failure of exact estimation of the probability distributions under consideration may lead to disastrous designs. There have been constant efforts to find appropriate methods to determine complicated distributions based on random samples, but this topic has not been fully discussed in detail. The present book is intended to fill the gap and documents the latest research in this subject. Determining a complicated distribution is not simply a multiple of the workload we use to determine a simple distribution, but it turns out to be a much harder task. Two important mathematical tools, function approximation and information tiieory, that are beyond traditional mathematical statistics, are often used. Several methods constructed based on the two mathematical tools for distribution estimation are detailed in this book. These methods have been applied by the author for several years to many cases. They are superior in the following senses: (1) No prior information of the distribution form to be determined is necessary. It can be determined automatically from the sample; (2) The sample size may be large or small;

vi

Preface

(3) They are particularly suitable for computers. It is the rapid development of computing technology that makes it possible for fast estimation of complicated distributions. The methods provided herein well demonstrate the significant cross influences between information theory and statistics, and showcase the fallacies of traditional statistics that, however, can be overcome by information theory. The book arose from author's years of research experiences in applying statistical tools to engineering problems. It has been designed in such a manner that it should prove to be useful for several purposes, namely, (1) as a reference book for researchers and other users of statistical methods, whether they are affiliated with government, industry, research institutes or universities, (2) as a reference book for graduate students who are interested in statistics applications in engineering and (3) as a reference book for those who are interested in the interplay of information theory, computing technology and statistics. This text is organized as follows. Chapters 1 to Chapter 3 are introductions to probability and statistics. Chapter 4 briefs approximation theory, emphasizing capabilities and properties of B-spline functions. For those who are familiar with these may skip these four chapters. In Chapter 5, information theory is introduced and entropy estimators are constructed. This chapter is the fundamental theory of the book. From Chapters 6 to 9, four basic methods for estimating distributions are infroduced in detail, aided with numerical examples. The membership functions are key concepts in fuzzy set theory. They are mathematically equivalent to probability density functions. Therefore in Chapter 10, the methods previously introduced are extended to estimation of the membership functions in fuzzy set theory. In Chapter 11, Maximum Entropy Method (MEM) is discussed because MEM cannot be neglected if information theory is mentioned. The influence of MEM in information theory cannot be underestimated. It is nice in this chapter that we will see how B-splines and entropy estimation theory developed in previous chapters have brought new elements into the well-known MEM, making it suited to more complicated problems. In Chapter 12, seven FORTRAN codes for the methods introduced in the preceding chapters are given, and their softcopies are separately attached at the back cover of this book. The organizations and persons the author want to express his gratitude is long. The Natural Science Foundation of China is acknowledged for her financial support under the grant 50579004. Part of the work had been finished during the author's stay in Japan under the sponsorship of Japanese Ministry of Education through collaborating with Professor Y. Fujimoto, Hiroshima University. Thanks go to my colleagues, friends and students for their endurance, helps and supports. Among them are Guo Hai, Dao Lin, Hong Wu, Shuang Hu, L. Sun, Jing Bei, Liu Lei, L. He, Hua Qiang, Zong Yu, Xiao Fang, to name a few. Z. Zong

Acknowledgement 1.

Figure 4.2 is reprinted from Journal of offshore mechanics and arctic engineering, 121, Z. Zong, K. Y. Lam & G. R. Liu, "Probabilistic risk prediction of submarine pipelines subjected to underwater shoek", Copyright (1999) with permission from ASME

2.

Chapter 6 and the figures are reprinted from Structural Safety, 20 (4), Z. Zong & K, Y, Lam, "Estimation of complicated distributions using Bspline functions", Copyright (1998) with permission from Elsevier

3.

Chapter 7 and the figures are reprinted from Structural Safety, 20 (4), Z. Zong & K. Y, Lam, "Estimation of complicated distributions using B-spline functions", Copyright (1998) with permission from Elsevier

4.

Chapter 8 and the figures are reprinted from Structural Safety, 22 (1), Z. Zong & K. Y, Lam, "Bayesian estimation of complicated distributions", Copyright (2000) with permission from Elsevier

5.

Chapter 9 and the figures are reprinted from Structural Safety, 23 (2), Z. Zong & K. Y. Lam, "Bayesian Estimation of 2-dimensional complicated distributions", Copyright (2001) with permission from Elsevier

Contents Preface

v

Acknowledgment

vii

Contents

viii

List of Tables

xiv

List of

figures

xv

1 Randomness and probability

1

1.1 Randomness , 1.1.1 Random phenomena 1.1.2 Sample space and random events 1.2 Probability 1.2.1 Probability defined on events 1.2.2 Conditional probability 1.2.3 Independence 1.3 Random variable 1.3.1 Random variable and distributions 1.3.2 Vector random variables and joint distribution 1.3.3 Conditional distribution , 1.3.4 Expectations 1.3.5 Typical distribution 1.4 Concluding remarks 2 Inference and statistics 2.1 Sampling 2.1.1 Sampling distributions for small samples 2.1.2 Sampling distributions for large samples

vm

.,

2 2 2 4 4 6 8 9 9 12 14 16 19 22 25 26 28 30

Contents

ix

2.1.3 Chebyshev's inequality 2.1.4 The law of large numbers 2.1.5 The central limit theorem 2.2 Estimation 2.2.1 Estimation 2.2.2 Sampling error 2.2.3 Properties of estimators » 2.3 Maximum Likelihood Estimator , 2.3.1 The Maximum Likelihood Method (M-L Mehtod) 2.3.2 The Asymptotic Distribution of the M-L Estimator 2.4 Hypothesis testing 2.4.1 Definitions 2.4.2 Testing procedures... , 2.5 Concluding remarks

31 32 33 34 34 35 37 39 39 40 45 45 47 48

3 Random numbers and their applications..................... 3.1 Simulating random numbers from a uniform distribution 3.2 Quality of random number generators 3.2.1 Randomness test 3.2.2 Uniformity test 3.2.3 Independence test 3.2.4 Visual testing 3.3 Simulating random numbers from specific distributions 3.4 Simulating random numbers for general CDF 3.5 Simulating vector random numbers 3.6 Concluding remarks

49 50 53 54 56 58 58 59 61 64 66

4 Approximation and B-spline functions

67

4.1 Approximation and best approximation 4.2 Polynomial basis 4.3 B-splines 4.3.1 Definitions 4.3.2 B-spline basis sets 4.3.3 Linear independence of B-spline functions 4.3.4 properties of B-splines

,

69 72 77 77 81 82 82

x

Contents 4.4 Two-dimensional B-splines 4.5 Concluding remarks „

,

5 Disorder, entropy and entropy estimation... 5.1 Disorder and entropy 5.1.1 Entropy of finite schemes 5.1.2 Axioms of entropy 5.2 Kullback information and model uncertainty 5.3 Estimation of entropy based on large samples 5.3.1 Asymptotically unbiased estimators of four basic entropies 5.3.2 Asymptotically unbiased estimator of TSE and AIC 5.4 Entropy estimation based on small sample 5.5 Model selection 5.5.1 Model selection based on large samples 5.5.2 Model selection based on small samples 5.6 Concluding remarks

87 87 89 89 92 94 97 105 107 114 118 119 120 126 128

6 Estimation of 1-D complicated distributions based on large samples............................................... 6.1 General problems about pdf approximation 6.2 B-spline approximation of a continuous pdf 6.3 Estimation 6.3.1 Estimation from sample data 6.3.2 Estimation from a histogram 6.4 Model selction 6.5 Numerical examples 6.6 Concluding Remarks Appendix: Non-linear programming problem and the uniqueness of the solution ,

129 130 132 135 135 137 140 144 156 159

7 Estimation of 2-D complicated distributions based on large samples ., 7.1 B-Spline Approximation of &2-Dpdf

163 164

7.2 Estimation 7.2.1 Estimation from sample data

167 167

Contents 7.2.2 Computation acceleration 7.2.3 Estimation from a histogram 7.3 Model selection 7.4 Numerical examples 7.5 Concluding

m

, remarks

169 170 173 174 186

8 Estimation of 1-D complicated distribution based on small samples...

..,..,.,..

.,.,...,.,

8.1 Statistical influence of small sample on estimation 8.2 Construction of smooth Bayesian priors 8.2.1 Analysis of statistical fluctuations 8.2.2 Smooth prior distribution of combination coefficients 8.3 Bayesian estimation of complicated pdf 8.3.1 Bayesian point estimate 8.3.2 Determination of parameter co2 8.3.3 Calculating b and determinant of |FTF| 8.4 Numerical examples 8.5 Application to discrete random distributions 8.6 Concluding remarks 8.6.1 Characterization of the method 8.6.2 Comparison with the methods presented in Chapter 6 8.6.3 Comments on Bayesian approach

, 189 190 192 192 194 198 198 200 203 204 209 210 210 211 211

9 Estimation of 2-D complicated distribution based on small samples 9.1 Statistical influence of small samples on estimation 9.2 Construction of smooth 2-d Bayesian priors 9.2.1 Analysis of statistical fluctuations 9.2.2 Smooth prior distribution of combination coefficients 9.3 Formulation of Bayesian estimation of complicated pdf. 9.3.1. Bayesian point estimate 9.3.2 Determination of parameter a»2 9.4 Householder Transform 9.5 Numerical examples

213 213 216 216 217 , 219 219 221 223 225

xii

Contents 9.6 Application to discrete random distributions 228 9.7 Concluding remarks 229 Appendix: Householder transform , 230 A.I Tridiagonalization of a real symmetric matrix 230 A.2 Finding eigenvalues of a tridiagonal matrix by bisection method 234 A.3 Determing determinant of a matrix by its eigenvalues 235

10 Estimation of the membership function,,...,. 10.1 Introduction.. 10.2 Fuzzy experiment and fuzzy sample 10.2.1 How large is large? 10.2.2 Fuzzy data in physical sciences 10.2.3 B-spline Approximation of the membership functions 10.3 ME analysis 10.4 Numerical Examples 10.5 Concluding Remarks Appendix: Proof of uniqueness of the optimum solution

237 237 242 242 242 244 247 248 253 255

11 Estimation of distribution by use of the maximum entropy method 259 11.1 Maximum entropy , 260 11.2 Formulation of the maximum entropy method 265 11.3 B-spline representation of $ (x)

268

11.4 Optimization solvers

270

11.5 Asymptotically unbiased estimate of At

271

11.6 Model selection 11.7 Numerical Examples 11.8 Concluding Remarks

272 273 279

12 Code specifications 12.1 Plotting B-splines of order 3 12.1.1 Files in directory B-spline 12.1.2 Specification 12.2 Random number generation by ARM

281 281 281 281 282

Contents

12.3

12.4

12.5

12.6

12.7

12.2.1 Files in the directory of random 12.2.2 Specifications Estimating 1-D distribution using B-splines 12.3.1 Files in the directory shhl 12.3.2 Specifications Estimation of 2-D distribution: large sample 12.4.1 Files in the directory shd2 12.4.2 Specifications Estimation ofl-D distribution from a histogram 12.5.1 files in the directory shhl 12.5.2 Specifications Estimation of 2-D distribution from a histogram 12.6.1 Files in the directory shhl 12.6.2 Specifications Estimation of 2-D distribution using RBF , 12.7.1 Files in the directory shr2 12.7.2 Specifications

xiii

..,.,....

282 282 283 283 283 284 284 284 285 285 285 286 286 286 287 287 287

Bibliography

289

Index

295

XIV

List of Tables Table 1.1 The Chi-square distribution Table 2.1 Relationship between Type I and Type II errors Table 3.1 Dependence of multiplier a on sand K

23 45 ,

Table 3.2 Randomness Test for Example 3.1 Table 3.3 Uniformity Testing for Example 3.1 , Table 3.4 Critical values for Chi squared distributions Table 3.5 Comparison of Tables 3.2 and 3.3 Table 3.6 Sampling formulas for some random variables Table 4.1 Six equidistantly spaced points in [-1,1] Table 4.2 Six unequidistantly spaced points in [-1,1] Table 5.1 Observed data pair (JC J 3 J ( )...... Table 5.2 Variance, ME and AIC for the regression polynomials of various degree Table 6.1 The Exponential distribution... Table 6.2 The Normal distribution Table 6.3 The Compound distribution Table 6.4 Estimate by histogram data («,=5Q0) Table 7.1 ME and AIC values Table 7.2 Part of calculated results in the case of RBF used Table 7.3 Wave height and wave period data in the Pacific (winter) Table 7.4 ME and AIC values for wave height-period joint distribution (winter, n s = 19839000) Table Table Table Table

2

8.1 Dependence of MEB on m (ns = 40) 10.1 Results for three fuzzy sets. 10.2 Results for five fuzzy sets..... 10.3 Sample size influences

52 55 57 57 57 61 75 75 122 126 146 148 151 154 176 182 184 185 206 249 251 252

XV

List of figures Figure 1.1 Standard deviation Figure 1.2 Skewness Figure 2.1 Graphical illustration of the acceptance and rejection regions. Figure 3.1 Random number distribution on the plane using single precision calculations.The left figure is generated from 200 random numbers and the right figure generated from 2000 random numbers Figure 3.2 Random number distribution on the plane using double precision calculations.The left figure is generated from 200 random numbers and the right figure generated from 2000 random numbers Figure 3.3 Random numbers generated by LCG viewed in three-dimensional space Figure 3,4 Data distribution projected on X-Z plane (left) and on X-Y plane (right).Note the regular pattern in the right figure Figure 3.5 schematic ARM. Figure 3.6 Flow chart of ARM Figure 3.7 Histograms of the exponential distribution Figure 3.8 Histograms for the normal distribution Figure 3.9 The assumed p.d.f. and generated sample Figure 4.1 An example showing the complexity of pdf...., Figure 4.2 5th order polynomial vs. exact function Figure 4.3 5th order polynomial on Chebyshev points Figure 4.4 The first three order B-splines Figure 4.5 B-splines of order 3 (degree 2) , Figure 4.6 Comparison of various interpolation methods for 10 points. In the figure,_unifarm refers to equally spaced points and nonuniform refers to unequally spaced points Figure 4.7 Comaprison of various interpolation methods for 20 points. The B-spline_interpolating curve is almost coincident

19 19 47

with the Runge function.. Figure 5.1 Arrangement of particles in different states of matters

86 90

52

53 58 59 63 63 63 64 65 68 76 76 77 80

86

xvi

List of figures Figure 5.2 Information missing in estimation process Figure 5.3 Naive estimate of entropy HifJ) bys use of equation (5.58) Figure 5,4 Numerical example demonstrating the aymptotically unbiased estimators_and the distance to their theoretical values Figure 5.5 Numerical comparison of TSE and ME

105 106

Figure 5.6 Observed paired data (x^y,)

122

Figure 6.1 Schematic figure of a histogram... Figure 6,2 Fallacy of Likelihood function as a tool for model selection... Figure 6.3 The exponential distribution Figure 6.4 The normal distribution Figure 6.5 Convergence of linear combination coefficients Figure 6,6 Influence of sample size on estimation accuracy Figure 6.7 Estimate from histogram data. Figure 6.8 Entropy estimation Figure 6.9 Strategy of Fisher's statistical inference Figure 7.1 B-spline approximation of 2-D pdf. Figure 7.2 Influence of parameter h, on the form of radial B-splines

138 140 147 149 150 152 154 155 151 164 166

Figure 7.3 2-D histogram composed of MH *NH cells..

171

Figure 7.4 The given distribution Figure 7.5 The generated random sample («S=5GQ). Figure 7.6 ME-based best estimate and AlC-based estimate ns =200 Figure 7.7 The ME-based best estimated pdft for sample «,=400 Figure 7.8 The ME-based best estimated pdfs for sample w/=500 Figure 7.9 Influence of number of B-splines Figure 7.10 Estimation using RBF for two samples Figure 7,11 Estimated joint distribution of wave-height and wave-period Figure 8.1 Influence of statistical fluctuations on the estimation Figure 8.2 Deviations between true and predicted coefficients Figure 8.3 Errors resulting from derivative approximation

175 175 177 178 179 180 183 186 192 193 195

Figure 8.4 Correlation of© 2 and the behavior of vector b Figure 8.5 Bayesian estimation based on 40 sample points Figure 8.6 Relationship between MEB and
197 205 205

1

Figure 8.7 Influences of w on the estimation Figure 8.8 Blade impact force along angular location

,

114 117

207 208

List of

figures

xvii

Figure 8.9 Estimated pdf and histogram of maximum bow Figure 9.1 The assumed pdf...... , Figure 9.2 Influence of statistical fluctuations on estimation accuracy Figure 9.3 Bayesian estimation based on 200 sample points

209 215 215 224

Figure 9.4 Relationship between MEB and m2 and the search for the optimum point(minimum MEB) Figure 9.5 Estimation based on three different samples

224 226

Figure 9.6 Influence of the parameter to2 Figure 9.7 The Bayesian estimation of the joint distribution of wave-height and wave period (M=N=30) H is wave height and T is wave period Figure 10.1 Crisp sets vs. fuzzy Figure 10.2 Two different membership functions Figure 10.3 Experiment on the perception of triangular area by a student Figure 10.4 Fuzzy data for Laminar and Turbulent flow..... Figure 10.5 Fuzzy data for Stable and Unstable rod Figure 10.6 Estimation of the memebersbip functions for the problem of Triangular area Figure 10.7 Five classifications of misalignment Figure 10.8 Given membership function (above) and random numbers (below).. Figure 10.9 Given membership function (right) and random number (left)

227

228 238 241 241 243 244 249 250 251 253

This page intentionally left blank

Chapter 1

Randomness and probability

Probability had been closely associated with games of chance. It seems that C. de M£r£, a French nobleman living in the seventeenth century was the first person to explicitly formulate the problem of winning chances in a game. He was interested in then popular dice game. A dice game consisted in throwing a pair of dice 24 times. People believed that betting on a double six in 24 throws would be profitable. Based on the calculations made by de Mere", however, the true case was the opposite. This counter-intuition problem and others posed by de Mere" attracted the attention of two then famous French mathematicians B. Pascal and P. de Fermat, After correspondence exchanges, they discovered the fundamental principles of probability theory. Soon thereafter, the Dutch scientist C. Huygens published the first book on probability. Probability theory soon became popular due to the inherent appeal of games of chance. This led to rapid development of the subject during the 18th century. P. de Laplace (1749-1827) made important contributions to probability theory by applying probabilistic ideas to a variety of scientific and practical problems. Examples are the theory of errors, actuarial mathematics, and statistical mechanics. The wide applicability of probability theory did not solve the difficulty of precise definition of probability until 1930s. The search for a widely acceptable definition took nearly three centuries. The process was marked by much controversy. In 1933, a Russian mathematician A. Kolmogorov put a stop to the controversy by using axiomatic methods to form modern probability theory (Apostol, 1969). Books on probability are numerous, and thus only a brief introduction is given herein. The interested reader is referred to Yeh (1973), Baht (1995), Milton et al (1997), and Utts & Heckard (2002) or even some good websites on probability.

2

/. Randomness and probability

1.1 Randomness 1.1.1

Random phenomena

Our mindset is used to deterministic phenomena, which we are sure to occur or not to occur. Examples are numerous. The sun rises every morning and sets every evening. Water flows from high to low. There is a direct flight from New York to Tokyo. Deterministic phenomena help us to make a sound plan in advance so as to be well prepared for the event to occur. But tiiis is not always me case. More often than not, some unpredictable things popping up without any warning would change, hinder or even destroy our plans. Many of us have such experiences. My daughter, then 10 years, described how she felt on hearing the tsunami news from island Phuket in 2004. "The plan we had for our holiday to island Phuket, Thailand started brewing months ago, and we were sure that nothing could stop us from getting utmost fun in the sun. Excitement that developed months ago had reached the highest level. However, a good holiday plan isn't always enough for a good holiday. The tsunami that found its way all to island Phuket denied our plan completely." Opposite to deterministic phenomena, these unexpected occurrences, called random phenomena, are in fact all around us. And they are studied in the framework of probability theory. More precisely, under certain conditions, the phenomenon that the outcomes do not always occur is called random. Here, "certain conditions" and "all elementary outcomes" are basic elements of random phenomena. "Certain conditions" may be man-made, or natural, but they must be repeatable. So random phenomena are also called random experiments, "All elementary outcomes" refer to those which are simple and cannot be divided further. Take quality checkup of a TV set for instance. The outcomes are either "pass" or "fail". "Pass" or "fail" are two of the elementary outcomes. But it is unpredictable for which to occur. 1.1.2 Sample space and random events Suppose we are to perform an experiment which outcomes are either finitely many or infinitely many, but not unique. We further suppose, however, that the set of all possible outcomes is known. This set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by O. Example 1.1: Tossing a coin The experiment consists of me flipping of a coin. The outcome is either Heads (denoted by //) or Tails (denoted by T), then the sample space is (1.1)

1.1. Randomness

3

Example 1.2: Rolling a die If fee experiment consists of rolling a die, then the sample space is O^fl.2,3,4,5,6}.

(1.2)

Example 1,3: Gasoline consumed in a month Consider driving a car for a month, with the outcome being the number of gallons of gasoline consumed. In principle this outcome could be any nonnegative number, and so the sample space is O = [0,«).

(1.3)

Events are collections of outcomes in the sample space, Or, any subset E of the sample space O is known as an event. We say the event A occurs if the outcome of the experiment is an element of the set A. Example 1.4: Events in examples 1. 1 and 1.2 Reexamine Example 1.1 in the above. Set E = {H}, then E is the event that a head appears on the flip of the coin. Similarly, E = {T} is the event that a tail appear. In Example 1,2, E = {2} is the event that two appears on the toss of the die. E = {1,2,3,4,5} is the event that 1,2,3,4 and 5 appear on the toss. Events and their relationships can be described by using the set language. Suppose E and F are two events of a sample space O . The operation E\jF called union of E and F defines a new event, consisting of all points which are either in E or in F or in both E and F. In other words, £ U F occurs either E or F occurs. Suppose in Example 1.1, E = {H} and F = {T}, then E U F = {H, T} defines an event that definitely occur, or the whole sample space. The operation Ef] F, or simply EF, defines a new event that consists of all points which are both in E and F. In other words, the event EF occurs if and only if E and F occur. Suppose in Example 1.1, E = {H} and F = {T}. Then the intersection EF does not consist of any points and hence could not occur. Such event is given a special name, null event and denoted by 0 . If EF - 0 , E and F are said to be mutually exclusive. The definitions in the above can be extended to more than two events. If UpJSj"", are events, then the union of these events is defined to be the event

4

/. Randomness and probability

consisting of all points which are in anyone of El, E2 • • •. We use U^_, £„ to denote the union of these events. We use n™=, En to denote the intersection of £,, £ 2 • • •, which is defined as the event consisting of the points which are in all of the events En, n = 1,2, • • •. All points in the sample space O , which are not in £ define a new event the compliment of event E . This compliment of E is denoted by E". Back to Example 1.1, Ee ={T} if E = {H} , Ec = 0 , if E = {H,T}. Therefore, the compliment of a sample space is empty, that is, O" = 0 . 1.2 Probability 1.2.1 Probability defined on events An intuitive notion of the terminology and interpretation of probabilities is essential to functioning success&lly in today's society. The most-frequently quoted example is weather reports we get from the TV or radio. The phrases the weather reporters often use are "there is a 40% chance of rain today" or "there is a 10% chance of rain today". Weather reporters have even coined the phrase "POP" (probability of precipitation) in making their reports. These numbers are important for our daily life. If POP is 0, we should bring an umbrella to class or to work; and if POP is 100%, one should plan an outdoor sporting event. The public has an intuitive notion about these numbers. POP is near 0 or POP is 100% are events that will hardly occur or will occur for sure. How about 50% POP? This depends on one's mental outlook. An optimist would not prepare for rain, whereas a pessimist would take an umbrella. From this example, we obtain an intuitive interpretation of probabilities. First of all, probabilities are numbers between 0 and 1 inclusive. They outline a vision of whether an event will occur. Sometimes the numbers are replaced by percentages as 0% and 100%. Furthermore, probabilities near 0 indicate that the event under consideration is not likely to occur; and the probabilities near 1 indicate that the event under consideration is very likely to occur. Probabilities near 1/2 indicate that the event in question has the same chance of occurring as of failing to occur. Probability is often defined through the following axioms. Consider an experiment whose sample space is Q . For each event E of the sample space Q , a number P(E) is defined such that it satisfies the following three conditions (axioms): (1) 0 £ 7 » ( £ ) £ l ,

(1.4a)

1.2. Probability (2) /»(«) = 1,

5 (1.4b)

(3) For any sequence of events El,E2,-", which are mutually exclusive, that is, events for which EmEn = 0 whenever n&m . Then

f

(1.4c)

/*{£) is referred to as the probability of the event £. Because£|J£ C =ft and £ p | £ c = 0 , we have P(E) + P(EC ) = /»(£ U Ec) = F{O) = 1 „

(1.5)

that is, (1.6) This is the so-called compliment law. Moreover, if E = fl, we have /*(£) = 0 • Consider P(E u F ) , the probability of all points either in E or in F. Note that P(E) + P(F) is the probability of all points in £plus the probability of all points in F. Since any point that is both in E and F is counted twice, we must have P(E\JF) = P(E) + P(F)-P(EF)

.

(1.7)

Example 1.5; Probability for coin tossing In Example 1.1 for coin tossing, £, ={H}, E2 ={T]. The sample space O = {H,T}. If we supposed that the chance for a head and a tail were equal (called equally likely), then we would have P(S) = l, P({H}) = P({T}) = j , P(0) = O,

(1.8)

They satisfy the three conditions defining the probability. If a head is twice as likely to appear as a tail, we would have = 1, P{{H\) = | » P({T}) = I ,
(1.9)

6

/, Randomness and probability

Example 1.6; Probability for die rolling In the die rolling example (Example 1.2), if we supposed that all six numbers were equally likely to appear, then we would have ^({1}) = -P({2}) = ... = F({6}) = -| . From condition (1.4c) for probability definition, we conclude that the probability of getting an even number would equal (1.10) 1,2,2 Conditional probability Knowing one event has occurred often forces us to reassess the probability of another event. Consider the 2004 presidential election. The probability that a voter east his ballot for President Bush is 51%. But suppose we know this voter is a DC resident. This would make us revise this probability downward by a considerable amount. Alternatively, knowing the voter is a Texas resident would make us revise this probability upwards. The notation for conditional probability is Pr(B | A) meaning the probability of event B given that event A occurs. Given that one event occurs, the conditional probability of a second event is the ratio of the probability of an occurrence common to both events to the probability of the initial event. .

(L1

)

Pr(A) Example 1.7: Presidential election If A is being a voter in Texas and B is being a Bush voter, the 2004 presidential election results reveal P r ( ^ ) - 0.061; 6.1% of all voters were from Texas. And Pr(Af)B) = 0.037; 3.7% of all voters were Texan Bush voters. So the probability of a Texan voting for Bush is

Pr(A)

0.061

{U2)

One of the most important conclusions on conditional probability is stated in the following theorem

1.2. Probability

7

Theorem 1.1 Multiplication rule Pr(Af]B) = Pr(B | A)Pr(A) - Pr{A j B) = PT(B) .

(1.13)

This follows directly from probability definition. The theorem says that the probability of an occurrence common to two events is the probability of the first event multiplied by the probability of the second event conditional on the first event. Theorem 1.1 can be extended to general cases. Theorem 1.2 The law of total probability If Al,A1,---,An satisfy 4 PI 4 = 0 , ifi*j,

and\jA, =Q, i,j = l,2,-,n

,

Then |

>

.

(1.14)

Proof. See Baht (1995}.

•

The following theorem was discovered quite long ago, but did not receive extensive attentions until 1950s. Since then, statistics based on Bayes' rule has dominated the development of modem statistics. Theorem 1.3 Bayes 'rule. 2 ,---,4, satisfy Air\Aj=0,ifi#j,and\JAi=n,

i,j = l,2,—,n,

(1.15)

Then

/W4)P*iL.

(U6)

Proof: We consider the case for the A and Ac. The multiplication rule and the feet that set intersection is commutative means

8

/. Randomness and probability

Ptt.41 B)Pr(B) = Pr(Af]B) = MB I A)Pr(A)

(1.17)

and the law of total probability means Pr(fl) = Pr(fl|,4)Pr(/0 + Pr(B|#)PrC#)

(1.18)

giving the result.

n

Example 1.8: Testing for binary outcomes Consider the massive testing of cattle for BSE (mad cow disease) introduced by the European Commission in 2001. Research into these tests revealed that a BSE-infected cow had a 70% chance of testing positive and a healthy cow had a 10% chance of testing positive. Denoting infection as 5 and testing positive as T, we have P r ( r | B ) = 0.70, P i t r | S e ) = 0.10.

(1.19)

The probability that a cow had BSE was very low, around 0.00001. However, for the sake of clarity in the following calculations, let's assume Pr(B) = 0.01, not quite so low a probability but one that will still illustrate the main idea. Suppose a cow tests positive. It is natural to ask what is the probability that this cow actually has BSE. By Bayes' rule,

Pr{i I *^» >K"/ ' * *v i " /• n " /

/t

2Q\

-=0.066. 0.70 xO.Ol +0.10 x 0.99 If a cow tests positive there is only a 6.6% chance that it actually is infected with BSE! The BSE is sadly very ineffective. This was not immediately apparent from the false positive and ftlse negative rates of 10% and 30%, but Bayes' rule revealed the full limitations of the test. The multiplication rule can be extended to more than two events. 1.2.3 Independence A prominent example of independent events is sampling with replacement. Consider drawing two playing cards from a standard deck. Let Hi denote the first card being a Heart, and let Hj denote the second card being a Heart. Now suppose the cards are drawn with replacement. That is, after the first card is drawn it is replaced, the deck is thoroughly reshuffled, and then the second card

1.3. Random variable

9

is drawn. We dearly have Pr(Hl | //,) = Pr(Ht), because the probability that the second card is a Heart is regardless of whether or not the first card drawn is a Heart. In contrast, if the cards are drawn without replacement, then /// and H3 are no longer independent. If the first card is a Heart and is not replaced, then there are 12 Hearts left out of 51 cards, so | j = Pr(// a |// l )*Pr(// 2 ) = . I .

(1.21)

Independence can be defined in several ways. The following are all equivalent definitions of A and B being independent; Pr(A\B) = Pr(A), Pr(S|^) = Pr(S)» Pr(A n B) = PT(A) Pr(B).

(1.22a) (1.22b) (1.22c)

1,3 Random variable 1.3.1 Random variable and distributions In the previous sections, we studied experiments by defining a sample space and probabilities for every event in the sample space. The sample space and the corresponding event probabilities constitute a complete description of the experiment. But we are often interested not in a complete description of an experiment, but rather wish to concentrate only on a few features. So we wish to move away from the comprehensiveness of probability theory towards a more focused approach. Defining and applying random variables is the means by which we focus on the experimental features of interest. If we are able to assign a real value X for each elementary outcome ® of the sample space Q, then we define a real-valued function on the sample space. Such a function is called a random variable (r.v. or rv) on the sample space. We often use X ~ X(at) or F = Y(m) to denote random variables, and the assigned value, denoted by small letters x, y are called value of the random variable. More rigorously speaking, X is a random variable on fi if X = X(s>) is a realvalued function defined on X = X(w) satisfying {w | X(m) <x}e fj(O) for any real number x.

(1.23)

where p(Q) denotes the power set of Q . Thus, a random variable implies two things: The value of random variable Jf is dependent on the occurrence of one of

10

/. Randomness and probability

the elementary outcome m. Because the latter is random, so is the value of the random variable. But the probability for a certain value of the random variable is deterministic due to the fact that the event" X{to) =x" is an element in Q, and each event in O is pre-assigned a probability P, In summary, a random variable is characterized by "randomness in value but determinism in probability". It should be noted that it is possible to define two or more random variables on one single random space. In the above example, another random variable Y may be defined as "six spots on the two dice". Random variable is the most fundamental concept in probability theory. And the distribution function of a random variable fully describes its behavior. Distribution and Probability density function (pdf in short) are the basic definitions in this book. Suppose X is a random variable on the sample space Q . For any real numbers, the probability of the event X < x is a function ofx Fix) = Pr(jr <x) = Pr(© | X(m) <x).

(1.24)

Then the function F(x) is defined as the distribution of the random variable. Each random variable has a distribution, and different random variables may as well have the same distribution function. The phrase "random variable X has a distribution F(x) " is equivalent to "X is distributed as F(x) " or "X obeys distribution F(x)", denoted by Jf ~ F(x). A distribution function has the following three properties (1) monotonic function: if x, < x,, then F(x,) < F(x 2 ), (2) F(-oo) = lim Fix) = 0, F{«) = lim F(x) = 1,

(1.25a) (1.25b)

(3) Right continuous: F(x + 0) = lim F(x + Ax) = F(x).

(1.25c)

On the other hand, any real-valued function having the above-defined properties is the distribution function of a random variable. If the derivative of F(x) exists, and the differential form dF(x) = f(x)dx

(1.26)

is given, then f(x) is called probability density function (pdf). Because of property (1.25a), we always have

1.3, Random variable f(x)>0, -ca<x<m.

11 (1.27)

In cases where confusions may arise, a subscript is used to denote pdf in the form of fx (x). If two random variables are considered, for example, symbols like fx (x) and fr (y) would be fine to show their difference. The differential dF(x) represents the event {x < X <, x + dx}, that is, Pt(dx) = PT(X <X<x+dx).

(1.28)

Then Ihe probability of the event (a < X < b) is given by P(a<X£h)= §dF(x) = jf(x) dx = F(h) - F(a) a

(1.29)

a

and from Property (1.25b)s we have dx = l

(1.30)

We sometimes also call F(x) cumulative density Junction (CDF). There are two types of random variables. Continuous random variables are those we have dealt with in the above. Opposite to continuous random variables is discrete random variable to be discussed in the following. With practice it is easy to distinguish the two. It is necessary to do so since the methods of computing associated probabilities depend on the type of variable involved. A discrete random variable is a random variable that can assume at most a finite or a countably many infinite number of possible values, A discrete random variable X assumes discrete real values xl,xli---,xn with associated probabilities Pt, P2, • • •, Pn, often denoted by finite scheme (1.31) Because distribution function F(x) = Pr(-¥ < x) experiences a jump at point x = xi ,so having no derivative at that point, Delta function (or impulse function) is used

12

i. Randomness and probability

f - x

t

) ,

(1.32)

In most practical applications, continuous random variables represent measured data, such as heights, lengths, distance and life periods. Discrete random variables represent count data, such as the number of red balls drawn from an um, the number of boys in a family. 1,3.2 Vector random variables and joint distribution Suppose Q is a probability space and « real-valued functions Xx ~Xx{m), Xg = Xj(fl?), •••, Xn(m), a e O are defined on space O . If for any « real numbers xx, xz, • • •, xn there always holds {fflllWS^IjWS^.-.I.W^JgpP],

(1.33)

Then (XX,,..,XII) is an «-dimensional random variable (or random vector, or vector random variable) on the sample space O . One big difference between a one-dimensional random variable introduced above and an n-dimensional random variable lies in the feet that only one real number is assigned for the former and many real numbers are assigned for the latter for each elementary result co. Let (XX,X1,'-",XII) be an H-dimensional random variable on the sample space O , For any n real numbers xx,xi,---,xn, the probability for the event "JIT, £ xx, Xx < x,,..,, Xx <. x,, that is, F{xv,x2,-',xn) = ?t{Xx &xx,X^x2,-,Xn<xH)

(1.34)

is called the joint distribution function for H-dimensional random variable (X,, Xt, • • •, Xn), or joint distribution in short. Joint distribution function fully describes the probabilistic properties of a high-dimensional random variable, including not only those of single random variable but also of the statistical correlation between any two random variables. If there exists a function f(xx,x1,---,xn) such that for « real numbers x,, x2, • • •, xn, there always holds (1.35)

/. J. Random variable

13

Then this n-dimensional random variable is called ^-dimensional joint distribution function, and f(x[,xl,---,x/l) is called joint density function, A joint density function satisfy (1)

f(Xi,Xl,.-,xJ>0, J

J

1

J

(1.36) *

«

. A

H

Take two-dimensional random variable (X, F) for instance, the corresponding distribution function is x,Y^y)

(1.38)

which has the following properties (1) (2)

(3)

F(x,y) is anon-decreasing function of either x or y, For any x and y,

(l-39a)

H-™,y) ~ Km F(x,y) = 0,

(1.39b)

F(x,-oo) = Mm F(x, y) = 0,

(1.39c)

F(-HM, +oo) =

(1.39d)

lim F(x, j ) = 1,

F(x, y) is a right-continuous function either for x or for y. F(x+0,y) = F(*. j;), F(*,y+0) = F(*,y),

(4)

(1.40)

For any two points (x,, yt) and (x2, y2) on the plane, we have

.yJZQ.

(1.41)

Any 2-dimensional function having the above-mentioned properties can be used as a joint distribution function, If F(x,y) is the joint distribution, then the distribution function of X can be obtained from F(x, y) through Fx (x) = Pr(Jf <, x) = Pr(Jr < *, K < ao) = F(x,»).

(1.42)

14

/. Randomness ami probability Similarly, the distribution function of Y can be obtained from F(x,y) through

F,{y) = F(*>,y).

(1.43)

Both Fx{x) and Fr(x) are called marginal distribution functions of the joint distribution function or random variables X and Y. In the case of n-dimensional random variable, various marginal distribution functions ean be obtained from (1)

^ jri (x 1 ) = F(* 1 ,oo | ...,«),

(1.44a)

(2)

F j r i (x i ) = F K * i , . - , o o ) ,

(1.44b)

(3)

FXiX2(xl,x1) = F(xi,x1,m,.-,<xi).

(1.44c)

An «-dimensional joint distribution function has n one-dimensional distribution functions, «/2 two-dimensional marginal distribution functions,...., n f«-/J-dimensional marginal distribution functions. 1.3.3 Conditional distribution If X is a random variable, and A is an event with positive probability, then the conditional probability (1.45)

is called conditional distribution Junction of random variable^ about event A, or conditional joint distribution in short. Conditional joint distribution describes the statistical law of the values a random variable assumes given that event A occurs. Here A might be an event on sample space Q . Different A may correspond to different conditional joint distribution. Thus, conditional distribution function is a function of not only x but also event A. If we take A — {a < X £ b}, then the above conditional distribution function may be rewritten as

0 Ft(x\a<X
x
'~ ^

g)

B<xSi.

><x

(1.46)

1.3. Random variable

15

This is compressed (or truncated) distribution function of random variable X, or compressed distribution in short. Suppose (X,Y) is a two-dimensional continuous random variable and its joint density function is f(x,y). If two events A and B are {x<X<x + dx} and {y < Y £ y + dy}, respectively, then Pr(5) = f(y)dy,

Pr(AB) = f(x, y)dxdy, f(y) * 0.

(1.47)

define a conditional probability PTiA\B) = f(x\y)dx.

(1.48)

The quantity

/(*|y)=^Hr

(1-49)

f(y) is the conditional density function given that Y = y. It is a density function of x, thus satisfying f(x\y)>0 , )f(x\y)ax = l.

(1.50)

-so

Similarly we have the conditional density function given that X = x defined by ^ ^ - .

(1.5D

fix) Two events A and B are statistically independent if and only if = P(A)P(B). (1.52) This condition can be applied to two random variables X and Y by defining two events A'= {A s x} and B = {Y <x] for two real numbers x and y. Thus, X and Fare said to be statistically independent random variables if and only if y).

(1.53)

16

/. Randomness and probability From this definition and the definitions of density functions, we obtain

f(x,y) = f(x)f(y).

(1.54)

Suppose (X,Y) is a two-dimensional random variable. The following expectation cov(X,Y) = E{[X-E(X)][Y-E(Y)]} = E{XY)~ E(X)E(Y)

(1.55)

is called covariance if it exists. In more general study of the statistical independence of n random variables Xl,Xl,---,X)t, we define events 4 by

where x( are real. The random variables X, are said to be statistically independent if the following condition is satisfied. / ( * „ * , , - , * „ ) = /(x,)/{x 2 ).-•/(*„.).

(1.56)

1.3.4 Expectations The probabilistic properties of the population axe fully described by its density function f(x) . And the density function contains infinitely many pieces of information about the random variable under consideration. The flight hours from Singapore to Beijing are a random variable depending on the weather, the conditions of the two airports and etc. So are the flight hours form Singapore to Hong Kong. If the question is: which flight takes more time. Then our answer is surely the second flight. The reason for us to get the answer quickly is that we know there exists a number able to describe the flight hours well. It is the average flight hours. The average flight hours from Singapore to Hong Kong are 3 while the average flight hours from Singapore to Beijing are 6. If the question is changed to "which flight is more on time", the answer also depends on a single number, called variance. The larger the variance is, the less the flight is on time. If we say the first flight is more on time, then it means that the variance qf the first flight is smaller than the second, vice versa. Therefore, variance describes the scatter state of the random variable around a specified value, the mean in this case. Many of us have the following experiences. Although the average flight hours from Singapore to Beijing are six hours, we have a vague feel that over half

1.3. Random variable

1?

flights often take less than six hours, say five and half hours. Then comes the third question: given the average flight hours, most probably we use more flight hours or less flight hours than the average for a specified flight course? To answer the question, we need another quantity called skewnsss. Continuing the process, we may find each question might be answered by a single number. All these numbers contain partial information of the population. For a continuous random variable, there exist infinitely many such numbers; while for a discrete random variable, such numbers are finitely many or countably infinitely many. Among these numbers, the most important ones are the average and variance. Other names for the average of a random variable are "the expected value of X", "the mean of JC\ or "the statistical average of X", "the mathematical expectation of a random variable JT. We used E(X) or Jf to denote the average. It is defined by (1.57)

If X is a discrete random variable with n possible values x,, each having a probability PrOc,.) of occurrence, then

J>

(1.58)

if the density function is expressed as xl), Example 1.9 The Exponential distribution Consider the exponential distribution defined by

0 From the definition:

x
(1.59)

18

/, Randomness and probability (1.61)

Often we must consider that case that F is a function of random variable X as we see in the previous section. Suppose Y = g(X) is a function of X. Then .

(1-62)

If X is a discrete random variable, then (1.63) where n may be infinite in some cases. Expectation is a special case of a more general class of quantities called moments. The w-th moment about the origin is defined by ma = E[X] = jx"f(x)dx

n = 0,1,2,...

.

(1.64)

If the function g(x) is replaced by g(x) - x". It is clear that the zero-th moment is the area of the density function, while the first moment is w, = X, the expected value of X. Another type of moments are called central moments, or moments about the mean value of X, defined by ft, = E[(X-Xf] = j(x-X)"f(x)dx

(1.65)

-EC

which is obtained by inserting g(X) = (X-X)" into definition (1.62). The moment /% = 1 is the area of f{x), Expectation is the first central moment. The second central moment is so important that it is denoted by a special symbol
= m22-mf.

(1.66)

/. J, Random variable

19

And its square root is called the standard deviation of X It measures the spread in the function f(x) about the mean. Roughly, the average distance from the mean of all data points. The more closely the values cluster around the mean the smaller the standard deviation will be, see Figure 1.1.

Plus skew

Figure 1.1 Standard deviation

zero skew

minus skew

Figure 1.2 Skewness

The third moment is the above mentioned skewmss. Skewness refers to the symmetry of the curve. A perfectly symmetrical curve has a skewness of zero. The sign of skewness describes the direction of the long "toil" points, not the location where most data lie, see Figure 1.2. 1.3.5 Typical distributions In this section, four important special distributions are briefed. 1. Tbe uniform distribution The uniform distribution is the simplest one. A uniform random variable assumes all values on an interval with equal probabilities. The density function of a uniform random variable on an interval [a,b] is defined as

(1.67)

[b-a A uniform random variable has the following properties

20

/. Randomness and probability (1) mean E(X) = - ^ ,

(1,68a)

(2) variance D(X) = i * Z f L ,

(1.68b)

Notation X ~ J/[s, &] denotes that X is uniformly distributed on the interval

2. The normal distribution The normal distribution, or Gaussian distribution, has the density function of the form

l_expLl£z£)l]

/(je)== J2

[

2ff

-*><*<«

(1.69)

J

where -oo < ju < oo and cr > 0 are two parameters, the first being location parameter and the second being scale parameter. We often use iVX//,<72) to denote Gaussian distribution. If /F=0 and tr -1, such a distribution is called standard normal distribution. The importance of normal distribution lies in the fact that the sum of a large number of random variables approaches Gaussian distribution if the influence of each random is small. Its properties are (1) Moments: £(Jf ) = /*,£>( JQ =

ff2.

(1.70)

(2) If X is distributed as //(/*, tr 2 ) , then the normalized random variable X" =(X-fi)lcr has distribution #(0,1). (3) The weighted sum c,Xl + ctX2 + • • • + cnXn is distributed as

(4) If the sum of two random variables is a Gaussian random variable, then each random variable is a Gaussian random variable.

21

1.3. Random variable 3. The chi-square distribution Suppose XltXlt'",XBan

standard normal random variables. Their squared

sum Xf + X\ + • • • + X\ is distributed as _a/2-l

/(*) = 2

a/a

-alt

x>0

r(")

(1.72) x
where T(a) denotes Gamma function defined by

(1.73) Distributions having such pdf are called Chi-square distribution with degree of freedom n, denoted by %%{a). The mean and variance of a Chi-square variable are E(X) = a, D{X) = 2a, respectively Chi-square variables are additive, meaning the sum of finitely many or infinitely many random variables obeying Chi-square distributions remains Chisquare distributed. Chi-square distribution is widely used for independence testing. So Chisquare distribution is tabulated in Table 1.1. 4. The multivariate normal distribution Let

'. /w,cr2

cr;

(1.74)

Notice that |£| = ofcr| (1 - p2) and \p\ < 1 , so |E| > 0 and

£-1 __

1

-po-,cr,j

o-,3 J '

We define the density function of the Gaussian bivariate

(1-75)

22

/. Randomness and probability

X-i

exp<--

yi-fa

(1.76)

A Gaussian multivariate has the property that (1.77) that is, the two parameters in a Gaussian multivariate are the mean vector and covariance matrix. Suppose x and y are m- and «-dimensional random vectors, respectively. The components of x are identically and independently distributed as JV(0,1). A is a mxnconstant matrix, and yt. is an nxl vector. If y and |» + A r x obey same distribution, then y is said to be a Gaussian multivariate, denoted b y y ~ N n ( i i , A 7 A ) , We also use symbol E = A T Ain some eases. Its density function is (1.78)

where j«| represents determinant. The linear transform of a Gaussian multivariate remains Gaussian. All the marginal distributions of a Gaussian multivariate remain Gaussian. 1.4 Concluding remarks Random phenomena are everywhere around us. They are described by event and probability. Through introducing random variable, the study of random phenomena is concentrated on finding the probability distribution or probability density function (pdf) of the random variable under consideration. Four special pdfe are outlined here. They are useful. More general methods for finding pdf will be presented later.

1,4. Concluding remarks

23

Table 1,1 The Chi-square distiibution

df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

.100 2.70554 4.60517 6.25139 7.77944 9.23636 10.64464 12.01704 13.36157 14.68366 15.98718 17.27501 18.54935 19.81193 21.06414 22.30713 23.54183 24.76904 25.98942 27.20357 28.41198 29.61509 30.81328 32.00690 33.19624 34.38159 35.56317 36.74122 37.91592 39.08747 40.25602

.050 3.84146 5.99146 7.81473 9.48773 11.07050 12.59159 14.06714 15.50731 16.91898 18.30704 19.67514 21.02607 22.36203 23.68479 24.99579 26.29623 27.58711 28.86930 30.14353 31.41043 32.67057 33.92444 35.17246 36.41503 37.65248 38.88514 40.11327 41.33714 42.55697 43.77297

.025 5.02389 7.37776 9.34840 11.14329 12.83250 14.44938 16.01276 17.53455 19.02277 20.48318 21,92005 23.33666 24.73560 26.11895 27.48839 28.84535 30.19101 31.52638 32.85233 34.16961 35.47888 36.78071 38.07563 39.36408 40.64647 41.92317 43.19451 44.46079 45.72229 46.97924

.010 6.63490 9.21034 11.34487 13.27670 15.08627 16.81189 18.47531 20.09024 21.66599 23.20925 24.72497 26.21697 27.68825 29.14124 30.57791 31.99993 33.40866 34,80531 36.19087 37.56623 38.93217 40.28936 41.63840 42.97982 44.31410 45.64168 46.96294 48.27824 49.58788 50.89218

.005 7.87944 10.59663 12.83816 14.86026 16.74960 18.54758 20.27774 21.95495 23.58935 25.18818 26.75685 28.29952 29.81947 31.31935 32.80132 34.26719 35.71847 37.15645 38.58226 39.99685 41.40106 42.79565 44.18128 45.55851 46.92789 48.28988 49.64492 50.99338 52.33562 53.67196

This page intentionally left blank

Chapter 2

Inference and statistics

The Saturday August 6,2005 edition of the Washington Post reported that the unemployment rate in July was 5.0%. Where did this 5.0% number come from? The algebra of the unemployment rate is as follows: take the number of unemployed persons and divide it by the number of persons in the civilian labor force (the employed plus the unemployed). But how does anyone know the numerator and the denominator of this fraction? Were you, or anyone you know, approached by t government official in July and asked whether you were in the civilian labor force, and if so, whether you were employed or unemployed? The number actually comes from the Current Population Survey (CPS), a survey which is administered by the Bureau of Labor Statistics which samples 65,000 American households each month. So 5.0% is not, in fact, the unemployment rate in the U.S., but rather the unemployment rate among the participants in the CPS. No one is particularly interested in the unemployment rate among the Current Population Survey participants per se. But the Washington Post published the 5.0% number on its front page because it is hoped that it bears a strong relationship to the actual U.S. unemployment rate. The actual unemployment rate is not known. To know it would require surveying the entire American labor force, some 150 million people — an insurmountable task. So the hope is mat the reported rate, the one for the CPS participants, strongly resembles the actual unemployment rate in the U.S. Just how strong is this resemblance? It is up to the statistician to answer that question. Here is an astounding fact. If one uses proper methods to survey 1500 people from a population of millions and millions, one can certainly find the percentage of the entire population who support the current president up to the accuracy within 3%. The tricky part is that one has to use proper surveying method (Utts & Heckard, 2002; Milton et al, 1997). Statistics is different from probability. One way to reveal the difference is to compare the whole-part relationship. In probability, the density function is

25

26

2. Inference and statistics

applicable to the whole. If the whole is infinitely large, or the whole is so large that we are in fact unable to count each element of it within possible time period, we instead study the distribution of a part of the whole. Because a whole can have a lot of parts, the distributions of the parts are also numerous. In this sense, statistics deals with finite problems as opposed to the infinite (or very large) problems probability deals with. We may also talk this difference between them using very loose languages in such way that probability is more theoretical while statistics is more practical. Statistics has found so wide applications that we hardly tell a quantitative branch of science to which statistics is not applicable. This chapter is heavily dependent on the books by Utts & Heckard (2002), Milton et al (1997) and Shen et al (1990b). 2.1 Sampling First of all, we need to define three terms: Individuals are usually people, schools, cities, animals and so on. They are treated as no different, sharing same traits. All electrons are same, and thus each electron is an individual. All people are same in an election, each having one vote. The population of interest (or, simply, population) is the entire collection of individuals or objects about which information is desired. A sample is a set of data drawn from the population of interest; i.e., a subset of the population of interest. A subtle assumption in statistics is that all individuals in a sample are same, sharing same traits as mentioned above, but they may produce different outcomes. If we do not know the outcomes, we do not have a way to differentiate them. John and Smith are two different U.S. citizens, but they are same in a presidential election. Before the election, we are not able to determine whether they are same (supporting the same president candidate) or different (supporting different candidates) or not. To note this point is useful in some cases to avoid conceptual difficulties. Notice that we talk of the population of interest, but we talk of a sample. This is because there is only one population, and it remains fixed throughout the statistical analysis. However, there are many different samples that may be drawn from the population. Typically the population of interest is very large, perhaps infinitely large. Example 2.1 Population vs, individual Suppose the population of interest is the entire labor force in the United States. There are many, many different samples that might be drawn from this population. One such sample from this population is the Current Population Survey (CPS). The CPS contains monthly labor force information from roughly

2.1. Sampling

27

65,000 American households. Statistical inference or simply inference involves making sensible generalizations about the population from information contained in a sample, Drawing conclusions about the population of interest from a sample involves incomplete information, since the sample is only a subset of the population. So we can never draw certain conclusions about the population of interest (if we could, statistical inference would instead be called statistical conclusions). There is always an amount of uncertainty about such conclusions, and an important aspect of statistical inference is to quantify this amount of uncertainty. A descriptive measure of the population of interest is called a parameter. Parameters are typically unknown because we typically do not observe the entire population of interest, but merely a sample from this population. A descriptive measure of a sample is called a statistic. Statistics are typically known, since researchers have data from a sample available to them. Example 2,2 (continued) A parameter that a labor economist might be interested in is the average hourly wage among all American workers. This parameter is unknown, because among workers sampled in the CPS is known. This is an example of a statistic. In inference a researcher will often use a statistic to try to draw partial conclusions (or inferences) about a parameter. When a statistic is used in this way it is called an estimator of the parameter. In the labor force example, the average hourly wage for all workers surveyed in the CPS is an estimator of the average hourly wage for all American workers. We will denote a variable by a letter from the end of the alphabet, such as x, y or z. We denote the number of observations in a sample as »,, and refer to nt as the sample size. (Alternatively, if we are dealing with a population, we denote the number of observations in the population as N, and refer to N as the population size). For t = 1,2, ••-,«„, we let xt denote the value of the variable x for observation I . A random sample is a subset of the population selected so that every individual has a specified probability of being part of the sample. From the sample to find the probability distribution of the population is called statistical inference or estimation. The particular value is called estimate. The process to draw a sample from the population is called sampling. And the nx individuals drawn from the population are denoted by JCl,...,Xn . They form a sample of size ns. Each Xt varies with the individual taken from the population, thus being a random variable. We observations, or realizations.

then obtain ns values, X),x2,•••,*„ , called

28

2, Inference and statistics

2,1.1 Sampling distributions for small samples In the previous section, we viewed samples simply as collection of numbers; for example, jCpXj,"--,^,, . In studying statistical inference we view samples as collections of random variables, and thus denote them in upper case, such as The idea is that prior to collecting a sample we do not know the values the different observations in the sample will take, and thus we should treat these sample observations as random variables. Likewise, prior to collecting a sample, we do not know the values such as the sample mean and the sample variance we will take, and thus we also wish to view these as random variables. So we will denote them using upper-case notation. Consider a discrete random variable X with pdf ~

X f(x)

0 2/5

1 2/5

3 1/5

Notice that the expected value of this random variable is Q

l

3

L

•

(2.1)

and the variance is (2.2)

Suppose we draw a random sample of size nt = 2 from this distribution. Denote the two members of the sample as^T, and Xt. We are considering the members of the sample as random variables - hence the notation with a capital X. It is important to remind ourselves of what it means for a sample to be random. Formally, a random sample is one where (1) The random variables making up the members of the sample are all independent of each other, and (2) The random variables making up the members of the sample are all drawn from the same distribution. An alternative terminology for a random sample that emphasizes these two conditions is independent and identically distributed sample, commonly

2.1. Sampling

29

abbreviated as i.i.d. sample. The marginal probability distribution of Xt, and of X , , is simply the probability distribution of the original random variable, X, Since Xt and X2 constitute a random sample, Xy and X2 are independent, Therefore, for any values xs and x2 we have = x, f)Xz = x2) = PrCJf, = x,)Pr(J!fj = x2) with distribution: Jf,=O X2=0 Jf 2 =l X2=3

Z,=3

fxM) 2/5. 2/5 1/5

4/25 4/25 2/25

4/25 4/25 2/25

2/25 2/25 1/25

2/5

2/5

1/5

Consider the sample mean, X =—(Jf, + Jf 2 ). We can use the bivariate pdf f{xy, x2) to compute the distribution of the X = — (X, + Xa) . We have

P t t l = 0) = P r ( ^ 1 = 0 n X 2 = 0 ) = — ,

(2.3a)

Pr(f = - ) = PrCJT, = 0 n l 2 =l) + Pr(X, = l n ^ 2 = O ) = — ,

(2.3b)

Pr(f = l) = P r ( ^ , = 2 n i f 2 = 2 ) = ^ ,

(2.3c)

Pr(X = 1-) = PrfJf, = 0 n I , = 3) + Pr(^, =3nJf 2 = 0) = — ,

(2,3d)

J?

=1) = — ,

(2.3e)

^

(2.3f)

So the pdf for Z is X

Ax)

0 4/25

1/2 8/25

1 4/25

3/2 4/25

2 4/25

3 1/25

30

2. Inference and statistics

We call this the sampling distribution of the sample mean, and fawn it we can compute E(X) — 1, And let S3 denote the sample variance \

'

- I ) 2 =(^, -Xf

+ (X2-X f.

(2.4)

With a little effort we can compute the probability distribution of S2 as follows. So the density function for S2 If

0 9/25

1/2 8/25

2 4/25

4+1/2 4/25

This is the sampling distribution of the sample variance, and we can use it to compute £(S 2 ) = 2 6 / 2 5 . It is important not to be confused by quantities such as D(S2) . It is sometimes despairingly declared "How can one take the variance of a variance? That makes no sense." The answer is that the sample variance, S1, is a random variable (since the sample from which it was computed is random). And all random variables, including S"2, have probability distributions from which you can compute expected values, variances, etc. Sure that we can proceed in a similar fashion when faced with a random sample drawn from a continuous distribution. 2.1.2 Sampling distributions for large samples It was demonstrated above how to compute the sampling distribution of the sample mean for a small sample of size n, = 2. It is immediately apparent that this computation will become exceedingly burdensome if the sample size is increased from n, = 2 to more commonly encountered sample sizes such as 10, 100 or 1000. Although it is too computationally intensive to compute the sampling distribution of the sample mean for larger sample sizes in the way outlined above, we can very easily describe certain features of this sampling distribution if an alternative approach is employed as follows. Consider the sample mean of X computed from a sample of arbitrary size «,, (2.5)

2.1. Sampling

31

So without knowing the entire distribution of X we have, with minimal effort, computed the expected value of X, namely E{X) = E(X). Now consider the variance of X . Exploiting the laws of variances, we have _

(2.6)

Further reduction leads to (2.7)

The second term is zero because they are independent. Thus, we have

(2J) So the vai'anee has the same central tendency (as measured by the expected value) as X, but a smaller variance - and the larger the sample size ns the smaller the variance of D(X), We have a "contraction" of the probability density towards the expected value of X. 2.1.3 Chebyshev's inequality The above statement about X involving a "contraction" of probability density towards the expected value is somewhat vague. A much more precise statement about the relationship between the variance of a random variable and the amount of probability density that is close to it's expected value is available. Theorem 2.1 (Chebyshev's inequality) For any random variable X and any constant s > 0, we have (2.9) Proof. Let's prove Chebyshev's inequality for the case of a continuous random variable. If X has density function,/!*)* then

32

2, Inference and statistics

(2.10)

Dividing both sides of this inequality by e2 gives the desired result.

•

A practical application of Chebyshev's inequality bounds the probability that X is within a few standard deviations of it's expected value p . We have Pr(| X—fi j< k) =1 — Prfl x — ^r |> A), where we may think of k as a small integer. Setting a = k in Chebyshev's inequality yields (2.11) For example (with k - 2), at least 3/4 of the density of X is within two standard deviations of p . 2,1.4 The law of large numbers We now return to a random sample Xt,X2,'--,Xn

. Recall that this is a

sequence of random variables that are independent and are identically distributed with mean E(X)-^i and variance var(Jf) = iT2 .Exploiting the fact that — —a1 E(X) - (A and D(X) = — , we can apply Chebyshev's inequality to the sample n

mean, (2.12) Now think about the sample size ns, getting larger and larger. No matter how small e (as long as e > 0 ) the right-most term above approaches zero as n, -* OD . Thus we have proven the law of large numbers. Theorem 2.2 (The law of large numbers) If X is the average of ns independent random variables with expectation p. and variance a*, then for any s

2.1. Sampling

33 (2.13)

Or, in words, one can force all the probability density of X to be arbitrarily close to {j by choosing a large enough sample size nt . Or, to use a mathematical term, the law of large numbers says the sample mean converges to the population mean as the sample size gets larger and larger. Proof. See Baht (1997).

a

2.1.S The central limit theorem Thus far in this chapter we have established several results about the distribution of the sample mean, X, for a random sample of size «, drawn from the distribution of X, We have seen that the expected value of X is fi, the expected value of X itself. And we have seen that the variance of X is a2 tnt, the variance of X divided by the sample size. We have also seen how to compute the entire distribution of X , although this computation becomes overwhelmingly burdensome for all but the smallest sample sizes. We have also seen that, while computing the entire distribution of X for a moderate or large n, is not feasible, in the special case where X has a normal distribution, however, the distribution of if is (exceedingly) easy to compute; we have X ~ N(/J,(T2 / ns). While it is convenient to be able to simply compute the distribution of the sample mean of X when X is normally distributed, this result is of limited usefulness in real-world applications. Real world phenomena rarely correspond to normally distributed random variables. And on those rare occasions where they do, it is because the variable has been artificially constructed. A leading example is standardized test scores. Such scores are normally distributed by construction, as opposed to naturally. Distributions of raw test scores are not normal. But raw test scores are normalized to ensure the standardized test scores are normally distributed. In contrast, we will now study the central limit theorem, a result which is incredibly useful in real-world applications. Indeed the central limit theorem is the most useful theorem in the field of statistics. For a random variable Jf with any probability distribution, consider drawing a random sample of size n, from this distribution; denote this sample as Xl,Xi,---,XK

. The central limit theorem says that, for largen t , the distribution

of the sample mean, — ^ X ( ,

is approximately normal. The larger the sample

34

2. Inference and statistics

size ns, the better the approximation. We have already seen that E(X) = fi and -

_
Theorem 2,3 (Central limit theorem) For large ns, we have (2.14) where " —2—» " is notation for "is approximately distributed". The central limit theorem is extremely useful, but we will not prove it. Detailed proof can be found in Baht (1995) and Yeh (2002). Notice that the central limit theorem requires ns to be "large", a somewhat vague requirement. The implication of the theorem tells us that the normal approximation is getting better and better as ns increases. 2.2 Estimation The primary purpose of statistical study is to draw conclusions about a population based on a sample drawn from the population. This process is statistical inference. Statistical inference involves learning about the population of interest by studying the sample at hand. There are two main types of statistical inference: estimation as briefly mentioned at the beginning of this section and hypothesis testing. This section focuses on estimation, where we use features of the sample to estimate (i.e., make calculated guesses about) corresponding features of the population of interest. This section discusses point estimation, where we estimate a population parameter with a single number, or point. Another type of estimation called interval estimation, however, will not be discussed in this book. 2.2.1 Estimation Recall that a parameter is a feature of the population of interest, while a statistic is a feature of the sample. In studying statistical inference we consider sample members as random variables and statistics, as a consequence, also as random variables. In a typical real-world application, the population of interest is unknown, and thus parameters of interest are unknown. Estimation is the exercise of using (known) statistics to make calculated guesses (or estimates) about the value of

2,2. Estimation

35

corresponding parameters. For example, if a labor economist is interested in average household income in the United States (this is the parameter), she can estimate it by the average income among households sampled in the Current Population Survey (the statistic). A statistic that is used to estimate the value of a parameter is called an estimator. Estimators are random variables, since they are computed from samples whose members are random variables. Thus the exercise of estimation involves using an estimator (which is a random variable) to guess the value of a parameter. Parameters, of course, are not random variables. They are simply numbers. Average household income in the U.S., while unknown, is not random - it is what it is. In contrast the average income among households sampled in the Current Population Survey is considered a random variable, since the households in the CPS were all randomly sampled. 2,2,2 Sampling error Consider the following problem of statistical estimation. There is a parameter, called 0, of the population of interest that we wish to estimate with a statistic, called T, For example, if we wish to use the sample mean as an estimator of the population mean, then 6= ft and T = X . The statistic T, being function of the random sample, is a random variable. When T is employed as an estimator of 0, it is of interest to see how the probability distribution of T varies from the value 0, The random variable \T- 9 |, which is the sampling error of T, captures this variability. The sampling error reflects the fact that we do not know the true value of the parameter; we merely have an estimator of this parameter based on a random sample. The estimator T will not correspond exactly to the parameter &, it is only an approximation. The probability distribution of the sampling error gives us an indication of how iar off the approximation is. To illustrate sampling error, recall the discrete random variable^ with pdf X /(*)

0 2/5

1 2/5

3 1/5

We have E{X) = 1 and D(X) = 6 . For a random sample of size ns=2, the sampling distribution of the sample mean, X, to have pdf X

/M

0 4/25

1/2 8/25

1 4/25

3/2 4/25

2 4/25

3 1/25

Combining the distribution of the sample mean with the value of the

36

2, Inference and statistics

population mean, // = E[X) -1 , allows construction of the probability distribution of the sampling error of the sample mean, X-p\. random variable X -

The discrete

has pdf

X-p

1/2 12/25

4/25

8/25

1/25

Notice how this probability distribution is computed. We have, for example, PruJf—// = 1J = 8/25 becausep = l , since, f o r X - / / to equal 1 we needJf to equal either 0 or 2, which occurs with probability 4/25 and 4/25, respectively. With the probability distribution of jjf-// at hand we can compute

-/i|] =16/25 and £»[]J-/*|] =119/625. Also in section 2.1,1 we computed the pdf of the sampling distribution of the sample variance,

s1 J f jj

0

1/2

2

9/2

9/25

8/25

4/25

4/25

We can likewise construct the probability distribution of the sampling error of the sample variance a

s2-c

f^~

)

7/10

4/5

6/5

33/10

28/25

4/25

9/25

4/25

and thus compute

We could, of course, define the sampling error for other estimators. If we use the sample maximum to estimate the population maximum, then we have T = X{ns) and 0 = max(JQ, and the sampling error of the sample maximum is the random variable \X{nt) - max(X)|.

2,2. Estimation

37

2.2,3 Properties of estimators Statistical inference is about using an estimator (that is, a statistic which is feature of the sample) to try to infer something about the corresponding parameter (which is a feature of the population). Now, when we use an estimator we certainly would like it to be a "good" estimator. That is, we would like it to do a good job of approximating the parameter of interest. Statisticians have a formal way of deciding whether an estimator is "good". Ideally it will adhere to three properties; consistency, unbiasedness and efficiency. In defining these terms, we will again employ notation where 0 denotes the parameter of interest, and T denotes the estimator of that parameter. (For example, if the parameter of interest is the population mean, and the estimator we are considering is the sample mean, then 0 = ^ and T = X,) (1) Consistency; As nt -» oo (that is, as the sample size gets larger and larger), if (a) E(T) tends towards 0, and (b) D(T) tends towards zero, then T is a consistent estimator of 6. (2) Unbiasedness: If E{T) •= 9, then Fis an unbiased estimator of 0. (3) Efficiency: If there are several consistent, unbiased estimators of 8 available, and among this group of estimators T has the lowest variance, then T is the efficient estimator of 0. The most important of these three properties is consistency. Consistency in essence says that the estimator does a better and better job as ns gets larger and larger, and for an infinitely large ns, the estimator would exactly equal the parameter. In words, consistency means T converges to $. If an estimator is not consistent, then our instinctive belief that a large sample is better than a small sample is no longer true. We should always demand of an estimator consistency at a minimum. An alternative way to define consistency of Fis as follows; for any e > 0, MmPr(|r~(?p»e) = 0 .

(2.15)

Recall from previous sections that, when T = X and & = ft, this definition of consistency is simply the law of large numbers. In other words, the law of large numbers proves that the sample mean is a consistent estimator of the population mean. If an estimator is consistent, unbiasedness gives the estimator additional appeal. For example, the sample mean, X, is both a consistent and an unbiased

38

2. Inference and statistics

estimator of the population mean, ft. However, it is impossible to find unbiased estimators of many parameters. For example, if the parameter is the population maximum, then in many cases neither the sample maximum nor any other estimator is unbiased. However, the sample maximum is consistent, which is comforting. The quantity E(T)-0 is called the bias. If E(T)-0 = 0 , then T is obviously unbiased. If E(T) -&<0, then T is negatively biased. If E(T) ~0>O, then T is positively biased. The third property, efficiency, is fundamentally different from the first two. Consistency and unbiasedness exist (or don't exist) in isolation - for example, an estimator is either consistent or it's not. In contrast, efficiency involves a comparison of several estimators that are grouped together, preferring the one with the lowest variance. It is important that like-estimators are grouped together when applying the efficiency criterion. All the estimators should be both consistent and unbiased when grouped together in an efficiency comparison. One cannot compare the variance of estimators with different properties to determine efficiency. For example, consider the sample variance, D(X) =
r

(2.16) n, +1 w

X' is consistent, since (2.17) which tends to fi as n , - > » . And has a smaller variance than D(X) , since

( X )

n. +1

\

-\ D(X)
U.+l +ll

(2.18)

But it is incorrect to say that X* is more efficient than X , because X is unbiased and X* is not; (2.19) It is very important not to make the egregious mistake of claiming that a

2.3. Maximum likelihood estimator

39

estimator is efficient when it is not even consistent. For example, a consistent estimator of the population maximum is the sample mean. An estimator of the population mean that has lower variance than the sample mean is simply the number 7 (which has zero variance since it's a constant). But is certainly not correct to say 7 is a more efficient estimator than the sample mean. The problem, of course, is that despite it's small variance, 7 is a very silly estimator of the population maximum. It's silly precisely because it's not consistent. In many eases, one of which is to be discussed in the following, some estimators have the three properties or one of the properties only when the sample size « , - » « . These estimators are said to asymptotically satisfy the properties. So we have asymptotically unbiased estimator of a parameter, and we also have asymptotically consistent estimator of a parameter. This concept is very useful in Chapters 5 ~ 11. 2.3 Maximum Likelihood Estimator 2.3.1 Maximum Likelihood Method (M-L Method) Suppose the pdf of the population X is f(x 10), where x may be a onedimensional random variable, or a multivariable random variable. 0 may be a sole parameter or a vector, but unknown. {xl,x1,---,xn} is a realization of a sample. The function

f \ 0 )

(2.20)

is called likelihood function. If Lix^,-^

\0) = maxL(xl,x1,-,xH

\0)

(2.21)

0

then parameter 0 is called the maximum likelihood estimate of parameter 0, Suppose I ^ . j C j , - - - , ^ \6) is differentiable, then 6 must satisfy the likelihood equation — = 0. d0

(2.22) '

Because log L (log-likelihood function) is a monotonically increasing function of

40

2. Inference and statistics

L, they are maximized at the same point. So logi is often used to replace L, satisfying logarithmic likelihood (log-likelihood) equation

dO

0.

(2.23)

2,3.2 The Asymptotic Distribution of the M-L Estimator Maximum Likelihood (M-L) Method is the most widely used method for parameter estimation. Using the method, we need to assume the functional form of the distribution which generates the data. However, the assumption can often be varied without affecting the form of the M-L estimator; and the general theory of maximum-likelihood estimation can be developed without reference to a specific distribution. In fact, the M-L method is of such generality that it provides a model for most other methods of estimation. For the other methods tend to generate estimators which can be depicted as approximations to the maximum-likelihood estimators, if they are not actually identical to the latter. In order to reveal the important characteristics of the likelihood estimators, we should investigate the properties of the log-likelihood function itself. Consider the case where fi is the sole parameter of a log-likelihood function log Ux; 6) wherein if = (X^,..., Xn) is a vector of sample elements. In seeking to estimate the parameter, we regard ft as an argument of the function whilst the elements of X are considered to be fixed. However, in analyzing the statistical properties of the function, we restore the random character to the sample elements. The randomness is conveyed to the maximizing value 6 which thereby acquires a distribution. A fundamental result is that, as the sample size increases, the likelihood function divided by the sample size tends to stabilize in the sense that it converges in probability, at every point in its domain, to a constant function. In the process, the distribution of 6 becomes increasingly concentrated in the vicinity of the true parameter value 0O. This accounts for the consistency of maximum-likelihood estimation. To demonstrate the convergence of the log-likelihood function, we shall assume, that the elements of JC, , JC2 , • • •, xn form a random sample. Then \d). And therefore

(2.24)

2.3. Maximum likelihood estimator

,0) = f > g / ( r ( |

ff).

41

(2.25)

For any value of 0, this represents a sum of independently and identically distributed random variables. Therefore the law of large numbers can be applied to show that Km log £(Jf, 0) = £[log f(x 10)].

(2.26)

The next step is to demonstrate that fiflog/^l^^&finog/CxI^)], which is to say tot the expected log-likelihood function, to which the sample likelihood function converges, is maximized by the true parameter value 8a. The first derivative of log-likelihood function is

dL(x\0) d0

L(x\9)

A

This is known as the score of the log-likelihood function at 0 . Under conditions which allow the derivative and the integral to commute, the derivative of the expectation is the expectation of the derivative. Thus, from Equation (2.25) — £ [ l o g l ( * | # ) ] = f—1— d L ( x { d ) L(x\0,)dx

(2.28)

where 0O is the true value of 0 and L(x | ff) is the probability density function of x. When 0 = 0B, the expression on the right hand side simplifies in consequence of the cancellation of L{x\0) in the denominator with L(x\0e) in the numerator. Then we obtain (2.29)

where the final equality follows from the fact that the integral is unity, which implies that its derivative is zero. Thus —E[iagL(x\8B)]=

f—1— < a ( x | g ) £,(x|fl ( )
(2.30)

42

2, Inference and statistics

and this is a first-order condition which indicates that the £ [ £ ( x | 0 ) / « J is maximized at the true parameter value 6d. Given that the L(x\0)/n! converges to £ [ £ ( x | # ) / n j , it follows, by some simple analytic arguments, that the maximizing value of the former must converge to the maximizing value of the latter: which is to say that 0 must converge to &0. Now let us differentiate (2,27) with respect to 0 and take expectations. Provided that the order of these operations can be interchanged, then (2.27) becomes (231) where the final equality follows in the same way as that of (2.29). The left hand side can be expressed as

d&

d0

x

*

and, we have

(2.33)

Therefore, when 8 = 0Q, we obtain

I

d0

(2.34)

This measure is know as Fisher's Information. Since (2.30) indicates that the score d log L(x; 8b)/d0 has an expected value of zero, it follows that Fisher's Information represents the variance of the score at 0O. Clearly, the information measure increases with the size of the sample. To obtain a measure of the information about 0 which is contained, on average, in a single observation, we may define $ = <& / n,. The importance of the information measure €> is that its inverse provides an approximation to the variance of the maximum-likelihood estimator which

2.3. Maximum likelihood estimator

43

becomes increasingly accurate as the sample size increases. Indeed, this is the explanation of the terminology. The famous Crammer-Rao theorem indicates that the inverse of the information measure provides a lower bound for the variance of any unbiased estimator of 8. The fact that the asymptotic variance of the maximum-likelihood estimator attains this bound, as we shall proceed to show, is the proof of the estimator's efficiency. The asymptotic distribution of the maximum-likelihood estimator is established under the assumption that the log-likelihood function obeys certain regularity conditions. Some of these conditions are not readily explicable without a context. Therefore, instead of itemizing the conditions, we shall make an overall assumption which is appropriate to our own purposes but which is stronger than is strictly necessary. We shall image that log L(x \ 9) is an analytic function which can be represented by a Taylor-series expansion about the point

I 0) =

,£b^iM(e_eD)!+...

(235)

In pursuing the asymptotic distribution of the maximum-likelihood estimator, we can concentrate upon a quadratic approximation. The reason is tiiat, as we have shown, the distribution of the estimator becomes increasingly concentrated in the vicinity of the true parameter value as the size of the sample increases. Therefore the quadratic approximation becomes increasingly accurate for the range of values of 0 which we are liable to consider. It follows that, amongst the regularity conditions, there must be at least the provision that the derivatives of the function are finite-valued up to the third order. The derivative is

d&

d&

d0*

(

a)

"

K

'

}

By setting 9 = 0 and by using the fact that d\o%L{x\ff)i d& = § , which follows from the definition of the maximum-likelihood estimator, we find that

1

rftogKxl^)

The argument which establishes the limiting distribution of ^ns{6-0o)

(2J?)

has

44

2. Inference and statistics

two strands. First, the law of large numbers is invoked in to show that

1 d2tegL(X\d0)_

I

^d2\ogL(x\0B)

(2.38)

must converge to its expected value which is the information measure ^ = <& / «„, Next, the central limit theorem is invoked to show that

1 dhgL(x\0o)_

fa

dd

I

fat!

d&

has a limiting normal distribution which is N(Q,f). This result depends crucially on the feet that ^ = ®/« s is the variance of dlogL(x\0o)id0 limiting distribution of the quantity •faffl-Og) l

distribution, since this is the distribution of p

, Thus the

is the normal

N(0,$)

times an N(0,f) variable.

Within this argument, the device of scaling 6 by fa, has the purpose of preventing the variance from vanishing, and the distribution from collapsing, as the sample size increases indefinitely. Having completed the argument, we can remove the scale factor; and the eonclusion which is to be drawn is the following: Let 0 be the maximum-likelihood estimator obtained by solving the equation dlogL(x \0)id0 = O, and let 0B be the true value of the parameter. Then 0 is distributed approximately according to the distribution N(0}®~1), where ®~' is the inverse of Fisher's measure of information. In establishing these results, we have considered only the case where a single parameter is to estimated. This has enabled us to proceed without the panoply of vectors and maoices. Nevertheless, nothing essential has been omitted from our arguments. In the case where 0 is a vector of k elements, we define the information matrix to be the matrix whose elements are the variances and eovarianees of the elements of the score vector. Thus the generic element of the information matrix, in the y-th position, is

E

Ip

89,

(2.40)

It should be pointed out that M-L method is only one of methods for parameter estimation. Other methods, which will not be detailed here, are the method of moments, the method of least square, etc. One more method, the maximum

2,4. Hyothesis testing

45

entropy method will, however, be discussed in chapter 11. 2.4 Hypothesis testing 2.4,1 Definitions A statistical hypothesis is a statement about a statistical population and usually is a statement about the values of one or more parameters of the population. For example, the following could be taken as hypotheses: (1) the probability of a 1 on a toss of a certain die is 1/6, (2) the mean height of American adult males is 5 feet 8.4 inches, (3) the mean length of a certain brand of 6-inch rulers is 5.99 inches and the standard deviation is 0.02 inch. It is frequently desirable to test the validity of such hypotheses. In order to do this, an experiment is conducted and the hypothesis is rejected if the results obtained from the experiment is improbable under this hypothesis. If the results are not improbable, the hypothesis is accepted. For example, we might test hypothesis (1) above by tossing the die 600 times. Intuitively, it is evident that if 600 I's are obtained, the result is improbable under the hypothesized probability of 1/6, and the hypothesis should be rejected. On the other hand, if 100 I's were observed, this result would not be improbable and the hypothesis would undoubtedly be accepted. When results such as these are obtained, intuition (combined with common sense) is sufficient to decide whether to accept the hypothesis. However, in actual practice, experimental results do not usually lead to such obvious conclusions; hence the need for statistical tests of hypotheses. It should be pointed out that aMiough we accept or reject a hypothesis we have not proved or disproved the hypothesis. Therefore, in testing hypotheses, mere are two types of errors which can be made. These are called: (1) Type I error-the rejection of a hypothesis which is true. (2) Type II error-the acceptance of a hypothesis which is false. The relationship between these two types of errors are given in Table 2.1.

Table 2.1 Relationship between Type I and Type II errors Decision Accept the hypothesis. Reject the hypothesis.

True Situation Hypothesis is true Hypothesis is false No error Type II error No error Type I error

When setting up an experiment to test a hypothesis, it is desirable to minimize

46

2, Inference and statistics

the probabilities of making these errors. In order to make it easier to talk about these errors and their probabilities, the probability of making a Type I error is designated as a and the probability of making a Type II error is designated as ft. It should also be noted that 100a (in per cent) is commonly referred to as the significance level. What constitutes suitably small values of a and fl ? This is not a question which can be answered unequivocally for all situations. Obviously the values of a and p should depend on the consequences of making Type I and II errors, respectively. For example, if we are considering the purchase of a lot of batteries {or some other very critical item) for use in weapons, we might hypothesize that fee lot is of satisfactory quality. (Actually we should state this hypothesis in more precise terms). If this hypothesis is true and we reject it, no great harm him been done since we can always wait for the next lot (assuming that we are not in a hurry). Consequently a can be relatively large (perhaps 0.25 or larger). On the other hand, if the hypothesis is false and we accept it, the result may be a large number of dud weapons. Since this is very undesirable, fi should be quite small, (maybe 0.01 or less). It is quite understandable that the supplier might feel differently about these probabilities. An important consideration in discussing the probabilities of Type II errors is the "degree of falseness" of a false hypothesis. In a given experiment, if the hypothesis is false but is nearly true (such as hypothesizing that a probability is 1/2 when actually it is 1.0001/2), ft could be quite large. However, if the hypothesis is grossly false, (such as hypothesizing that a probability is 1/2 when it is actually 1), /? should be much smaller. For a given experiment testing a specific hypothesis, the value of 1-/? is known as the power of the test. Since the power depends on the difference between the value of the parameter specified by the hypothesis and the actual value of the parameter where me latter is unknown, 1- /? should be expressed as a function of the true parameter. Such a function is known as a power function. It is good practice not only to state the hypothesis to be tested (denoted by H) but also to state the alternative(s) to H (denoted by A). This is not only good procedure; it also aids in the determination of the regions of acceptance and rejection when considering the sample space of all possible values of the test statistic. Incidentally, the rejection region is frequently referred to as the critical region. Using the notation of this paragraph, it is seen that a = Pr(reject H|H is true) and fi = Pr(accept H|A is true). Example 2.3 Testing mean Consider a simple hypothesis H:

/i =

fl,

(2.41)

47

2.4. Hyothesis testing against a single alternative, A:

(2.42)

M = Pi,

where we are dealing with a normal population with known variance, er2. Let the decision to reject H (accept A) or to accept H (reject A) be based on a single observation obtained at random from the population under examination. If the random observation is less than C (see Figure 2.1), H will be accepted; if the random observation is greater than or equal to C, H will be rejected. That is, X > C constitutes the rejection or critical region. The probabilities a and p* are represented by the shaded and cross-hatched areas, respectively. Clearly, besides depending on the choice of C, a depends on the hypothesis under test (frequently called the null hypothesis) while 0 depends both on the null hypothesis and on the alternative hypothesis.

pdf for H being true

Accept H

pdf for A being true

Accept A

Figure 2.1 Graphical illustration of the acceptance and rejection regions

2.4.2 Testing procedures When establishing a test procedure to investigate statistically the credibility of a stated hypothesis, there are several factors that must be considered. Assuming a clear statement of the problem has been formulated and that an associated hypothesis has been stated in mathematical terms, these are; (1) The nature of the experiment that will produce the data must be defined. (2) The test statistic must be selected. That is, the method of analyzing the data should be specified

48

2, Inference and statistics (3) The nature of the critical region must be established. (4) The size of the critical region (that is, a) must be chosen. (5) The size of the sample (i.e., the number of times the experiment will be performed) must be determined.

It should be clear that these steps will not always be taken in the order listed. Not all of the steps are independent, and frequently it is necessary to reconsider (several times) the various steps until a reasonable test procedures formulated. 2.5 Concluding remarks Different from probability, statistics is concerned about making inference on infinity (the population) based on limited information (the sample). It is thus understandable that statistics is always associated with risk because mistakes can be made to make decision based on insufficient amount of information. It is also not surprising that information theory is a key tool throughout the book. In statistics, decision is made through sample and estimator. Samples are random, and sampling distributions can be found through basic probability distribution for small samples or by applying the large number theorem for large samples. To make sound statistical decision and avoid mistakes, an estimator should be consistent, unbiased and efficient.

Chapter 3

Random numbers and their applications

"What is real is random" (Einstein). From our birth to death, our daily life is Ml of random events. Physicists have even gone further, claiming that all particles in the microscopic world move in a random way governed by quantum mechanics. As computer technology develops so rapidly, we cannot resist the desire to put randomness into computers. This turns out to be both fruitful and rewarding. Stochastic simulations as they are called are able to solve not only stochastic problems, but also deterministic problems. They are even able to crack once hard-to-solve problems in a surprisingly simple way. Randomness in computer has also an important impact on our life today. Many of us have the experiences of playing PC games, running software from Microsoft, purchasing something on the web and applying textures to a photograph. Behind all these activities lies the important tool: random numbers. A random number is chosen as if by chance from a specified distribution such that a collection of these numbers could reproduce the underlying probability density function, A random number is in fact a string of the 10 digits (0~9) arranged in an irregular way. It is always emphasized that random numbers on computer cannot be truly random because they are generated by a fixed algorithm. It is impossible to generate an arbitrary long string of digits and prove it is random. So Pseudo Random Number (PRN) is a more suitable terminology and the methods generating pseudo random numbers are called Pseudo Random Number Generators (PENG). The study of PENG is numerous. The reader is referred to Blum et al (1986), James (1990), Couture & L'Ecuyer (1997), Fishman (1996). There are several common methods for generating pseudo random numbers, the simplest among which is linear congruence method. A brief introduction on PRNGs are given herein for they will be used in subsequent chapetrs.

49

50

J. Random numbers ami their applications

3.1 Simulating random numbers from a uniform distribution Suppose U is a uniform random variable on [0,1]. The name random number is often reserved for uniform random number U , realizations of which are denoted by {«<} = {«f}^_, • Among many methods for generating random numbers, the linear congruence method is most widely employed. The linear congruence method for generating pseudo random numbers uses the linear recurrence relation xM3saxf+b(modM),

( = 0,],— .

(3.1)

This PRNG is often denoted by LCG{M,0,^^), where xa is initial value, a is multiplier, b is increment and M is modulus. They are all non-negative integers, and a, b and xt are all smaller than M .In equation (3.1), xf+1 and axf + b are congruent modulo M. Because xt < M , the quantity ue=xt/M

(3.2)

is a string on [0,l]. For example, if we take x0 = a = b = 7 and M = 10 ,we obtain the array {*e} = 7,6,9,0,7,6,9,0,{«,} = 0.7,0.6,0.9,0,0.7,0.6,0.9,0,This is a digit siring with period 4. LCG(M,a,b,xv) is simple in use and elegant in form. Here is a sample of the code subroutine randu(ix,ib,im,yfl) ifijfi .ne. 0.0) goto I ax=float(ix) am=float(im) ab=float(ib) yfHax/am 1 yfl=ab*yfl yfl=yfl-float(ifix(yfl)) return end

(3.3)

J. /, Simulating random numbers from a uniform distribution

51

In the code, nr, ib and im represent x,hm\A M, respectively. But strangely, using it humans cannot produce a string of random digits. The reason is that any digit string produced by the iteration

* w =/(*,)

(3-4)

satisfying 0 < xt,xM < M must be periodic. That is, repetition must occur after a certain length of the string. This length is the period of the array, denoted by T. It is clear that T<M. Because any digit in the string is produced by code (3.3), any digit Uf =xt/M in the string can be exactly determined even before the string is produced. This contradicts the definition of randomness. Thus, it is impossible for us to produce truly random numbers on computer. But no need to be disappointed at this. It is provable that the string produced by a proper LCG(M,a,b,xQ) with sufficiently long period has the necessary statistical characteristics of a uniform random variable on [0,1], Thus, the string will look exactly like a uniform random number within the interval [0, F ] . 1977 saw the advent of chaos, a strange theory claiming that the output of a deterministic dynamic system may suddenly become random even if the input is deterministic. This changed our mindset dramatically because it had been believed that stochastic phenomena resulted from a stochastic system, but never from a deterministic system. Chaos theory predicts the opposite. A deterministic and nonlinear dynamic system is able to produce a random string. One prominent feature of a chaotic system is that it is very sensitive to initial values. A slight change of initial value of the order 10"6may result in completely different strings. The encouraging conclusions from chaos studies are that some special forms of iteration code {3.3} are chaotic systems able to produce random strings and that a chaotic system has infinitely many periods. We do not elaborate the topic in detail. The interested reader is referred to Fog (2001). We now turn to the choices of the four parameters in LCG(M, a,b,xa). (1) Modulus M The basic requirements behind choosing the suitable modulus are to make the period as long as possible and make computing easy. If the effective word length of a computer is K in binary, then it is provable that M = 2K meets the above two requirements. The maximum possible period is 2K~1 fpr b = 0 and 2* for b&Q, The K value for an ordinary computer is about 30. (2) Increment b. The choice of b * 0 is superior to the choice of b - 0 , but the rewards are not prominent. The periods 2K for 6 ^ 0 and 2K~1 for b = 0

J. Random numbers and their applications

52

are not signifcantly difiFerent if we consider the fact that K is a very large number around 30. Hence, b = 0 is always a practical choice, (3) Initial value xa. If b = 0 , x0 must be an odd number in order to obtain longer periods. (4) Multiplier a . A variety of studies indicate that the multiplier is very important for obtaining random numbers with long periods. They are not arbitrarily given. If b = 0 , the multiplier should be of the form a - 8c ± 3 , where c is an arbitrary integer. Empirical formulas are available for determining a within even smaller class. One of such empirical formulas is given by

In Table 3.1, the values for multiplier a dependent on s and K are given

Table 3.1 Dependence of multiplier a on sand K

s 6 7 8 9

K 31-34 35-39 40-44 45-48

0.2

0.4

0.6

0J

a 12207 03125 3 05175 78123 76 29394 53125 1907 34863 28125

1

0

0.2

0.4

0.6

0.B

1

Figure 3.1 Random number distribution on the plane using single precision calculations. The left figure is generated from 200 random numbers and the right figure generated from 2000 random numbers.

3.2. Quality of random number generators

0.4

53

0.6

0.8

1

Figure 3.2 Random number distribution on the plane using double precision calculations. The left figure is generated from 200 random numbers and the right figure generated from 2000 random numbers.

3.2 Quality of random number generators Random numbers {«,} must exhibit the statistical properties a uniform random variable is supposed to have. The statistical properties obtained from random numbers should not be significantly from those obtained from a simple random sample obeying uniform distribution. If they are significantly different, the generated random numbers would be useless. It should be pointed out that we should not totally rely on equation (3.1). A slight change in the subroutine may result in dramatic change as shown in the following example. Example 3.1 Tricky random numbers If the four parameters in LCG(M,a,b,xa)

take the following values

a = 2045, M = 8388606, xQ = 8388605 , b = 0 . we are then able obtain a string of random numbers. Every two random numbers are taken as a group to form a vector («,, M2 ) , and are plotted in two-dimensional space as shown in Figure 3.1. In the calculations, all real variables are computed using single-precision variables. Two calculations have been made: one producing a sample of size 200 and the other producing a sample of size 5000. From the figure, there exists an apparent repetition period four. All data lie on four straight lines in the figure. The results are disappointed. The results change dramatically, however, if we change single-precision calculations into double-precision calculations. Two samples of size 200 and size

54

3. Random numbers and their applications

2000 have been produced. The results based on double-precision calculations are plotted in Figure 3.2. At least they look like random, much better than those in Figure 3.1. Single- and double-precision calculations affect computer word length, thus the effect of randomness in the generated random numbers. So it is necessary to do testing of the generated random numbers. Three types of statistical measures are introduced to control the quality of random numbers. They are randomness test, uniformity test and independence test. At last, visual test is also mentioned. 3.2.1 Randomness Test Among all numbers describing sampling distributions, mean and variance are most important. Their theoretical values, if uniform distribution on the interval [0,1] is considered, are known. So their values estimated from a group of random numbers should be close to the theoretical values without significant difference. Otherwise, the random numbers must be rejected. Suppose «,random numbers are generated from LCG(M,a,b,x0), denoted by « 1 ,K Z ,--.,M HJ

.

(3.5)

The mean, variance of these random numbers are

f/=-f>(

,

(3.6a)

t^-fX

,

(3.6b)

, M

2

4

After simple calculations, we obtain 1 2

D&-WL.JL 72.

1

£(tn=T

.

.

(3-7a)

,

(3.7b) ( 3 - 7c )

3.2, Quality of random number generators

55

a, 4 D(U2) =

(3.7d)

45M.

(3.7e) D(S22), =

1

(3.70

180«,

From the cenfral limit theorem, the statistics (3.8a) (3.8b) (3.8c) are all asymptotically normally distributed as i¥(0,1) . Their estimates are obtained by substituting equation (3.5) into equations (3.6)-~(3J). Take significance level a = 0,05. If the absolute values of the estimates obtained from equation (3.8) are smaller than 1.96, then the different is significant and U is not a uniform random variable on interval [0,1].

Table 3.2 Randomness test for Example 3.1 Single precision

I/,

Double Precision

U2

U3

£/,

I/j

n, =300

1.1201

1.5161

1.7265

0.6509

0.5260

0.4170

n, = 3000

0.4718

0.4365

0.0812

0.1559

0.0234

0.6975

Example 3.2 Randomness Testfor Example 3.! Return to Example 3.1. The estimates of the three parameters defined in equation (3.8) are given in Table 3.2. Two observations from the table are interesting and worthy of cautions. First, results obtained from double-precision calculations are better than those obtained from single precision calculations. Second, Sample size has a positive influence on these estimates as expected.

56

J. Random numbers and their applications

Surprisingly, both single- and double-precision calculations have passed the testing, meaning that one single testing is not enough for filtering out low quality PRNG 3.2.2 Uniformity test Another name for uniformity testing is frequency testing. Its purpose is to test if the frequency curve (frequency histogram) obtained from random numbers (3.5) is significantly different from theoretical frequency curve (theoretical histogram). Divide the interval [0,1] into K equidistance cells. Rearrange ut in the ascending order from small to large. Suppose w^ random numbers fall into thejth cells, that is, »j random numbers in equation (3.5) satisfy

K

(3.9)

K

Let U be uniform random variable on the interval [0,1] .Then the probability for ut to fall in each cell is pt = — , and the quantity

is asymptotically distributed as ^ z distribution of freedom K - 1 . Generally, it is required that sample size «, > 3 0 , and — > 5 . If permitted, ns should be as large K as possible. Given a string of random numbers in the form of equation (3.5), we may obtain %f value from equation (3.10). This value is compared with the so-called critical values available in a chi-square distribution table given in the appendix to Chapter 1. If %f is smaller than %*m or larger than %f%, the string of random numbers should be rejected for being "under-uniform" if %? < Xm& and for being "over-uniform" if X\ > Xi% • The string of random

numbers

or x!%
Z?m
are regarded

as "suspectable" if

The string is "almost suspectable" if • Thus, only those falling within

3.2. Quality of random number generators

57

< X? < J , « are acceptable. Example 3.3 Uniformity testfar Example 3.1 Table 3.3 Uniformity Testing for Example 3.1

n, = 200 n, = 2000

Single precision, %f £=20 A: = io 8.6333 13.3333 0.3733 0.4933

Double Precision, X\ £ = 10 £ = 20 5.5333 9.7333 4.1133 16.7455

Table 3.4 Critical values for Chi squared distributions Xw%

if = 10 £ = 20

2.088 7.633

3.325 11.651

Xm% 14.684 27,204

Xt%

21.666 36.191

Return to Example 3.1 and perform uniformity testing. The testing results are given in Table 3.3. From Appendix to chapter on chi-square distribution, we have the critical values given in Table 3.4. It is interesting to compare Tables 3.3 and 3.4. The comparison results are listed in Table 3.5. All random numbers generated by double-precision calculations have passed the test, but those generated by single-precision calculations have passed the test if the sample size is 200, but have failed if the sample size is increased to 2000. Therefore, the quality of these PKNGs is revealed through the uniformity test although they have all passed the randomness test. Table 3.5 Comparison of Tables 3.2 and 3.3

1.

= 200 = 2000

Single precision, %* £ = 10 £ = 20 Pass Pass Fail Fail

Double Precision, x\ £ = 10 £ = 20 Pass Pass Pass Pass

58

3. Random numbers and their applications

3,2,3 Independence test Almost always, computer-generated random numbers are required to be independent, so that there are no correlations between successive numbers. Independence is checked by correlation coefficient between any two numbers of distance j . (3.11)

For sufficiently large n, (for example,«, - j > 5 0 ) , the static (3.12) is asymptotically distributed as N(Q,l) if there is no correlation. This can serve as a measure for independence test.

0.5

0.5 1

0

Figure 3.3 Random numbers generated by LCG viewed in three-dimensional space 3.2.4 Visual testing Tests of random numbers to a large extent are same as tests of a random sample. So other testing methods may also be applied to testing random numbers. But it should be mentioned all these memods are used for general purposes. Random numbers do show specific properties in some eases, which may skip

3.3. Simulating random numbers from specific distributions

59

these statistical tests. Crude as it is, visual test may give surprising insight into the quality of the generated random numbers.

0.2

0.4

0.6

0.8

(b) Figure 3.4 Data distribution projected on X-Z plane (left) and on X-Y plane (right). Note the regular pattern in the right figure. Learner proposed the generator LCG(1Q8+1, 23, 0, 47594118) which had a repetition period of 5882352 (Fog, 2001). The period is so large that we convince ourselves to use the uniforms generated from this PKNG without a second thought. If we generate 3000 numbers from this PENG and plot them in a three-dimensional space in the form of {K,,K2,«j),(M4,as,ws ),•••, (K3998»M2«9W3«M)S

we

would obtain a random arrangement of random points in the space as shown in Figure 3.3. They look satisfactorily random. If these data are viewed from X-Z plane, they remain irregular as shown in Figure 3.4 (a). But if we view the data from X-Y plane, surprisingly periodic pattern appears, see Figure 3.4 (b). In Figure 3.4 (b), 22 periods in total can be seen. Simple as it is, visual test helps us to obtain an overall view of the data distributions. It is suggested that all the tests introduced are performed whenever possible. Generating random number is thus not an easy job as the first glance. 3.3 Simulating random numbers from specific distributions After we obtain uniforms {ut} , we need to transform them into an independent sample obeying a specified distribution. If this specified distribution is given, it is possible to perform a transform to find the random numbers

60

J. Random numbers and their applications

obeying the required specific distribution. Suppose the distribution function of a continuous random variable Jf is F(x). Then the variable U = F(X)

(3.13)

is a uniform random variable on the interval [0,1]. It is further assumed that F(x) is a monotonic function. Then the inverse function F~l (x) exists, giving JC = F~l(U).

(3.14)

Example 3.4 Generating a Power-law distribution In order to generate a power-law distribution P(x) = Cx" from a uniform distribution on [0,1], write P(JC) = Oc" forxe[x fl ,jr,], where C and n are two parameters. Normalization gives \p{x}dx = C-i-[*f + 1 - O = 1

(3.15)

so

(3.16) The distribution function of X is (3.17) U is assumed a uniformly distributed variable on [0,1]. From equation (3.14), the desired distribution is

This is the sampling formula for X. In Table 3.6, sampling formulas for some specific distributions are given for easy reference.

3.4. Simulating random numbers for general CDF

61

Table 3.6 Sampling formulas for some random variables Type Uniform distribution

Probability density functions 1

Sampling formulas

H

aU + b

Power law

x"

r«+i , + , T ^

Ae'*1*'^ x>t

x = -loga/2 + r

Exponential

1 _4

Standard Normal

[c

'io j

« l =^-21og«,cos(2^) «2 = ^—21ogK, sin(2jru 2 )

F distribution

Log-normal

-—-x'-V* 1 x>0 exp |-[log(x - a)+bf 12<xa} ^o-(jr-ff)

Weibull

= (0-e)(-\ogu)y xexp<-

X-E

Poisson Binomial

x = ^ ,A = minTTM(

«,
3.4 Simulating random numbers for general CDF More general methods than those given in Table 3.5 are available, among which is widely applied Acceptance/Rejection Method (ARM). Suppose the pdf of random variable X is f(x) distributed on [a,b]. Its maximum on the interval is f0 = sup f{x) OSiSft

as shown in Figure 3.5. Let

(3.19)

62

J. Random numbers and their applications

X = a+(b-a)Ui,Y

= f0Ul.

(3-20)

Then pdfs forX and Fare, gl(x)

= -±-,xe[a,h];

8l(y)

= -y, j/e[0,/ o ].

(3.21)

Then the sampling procedure is given in Figure 3.6. First, a pair of random numbers u,, M2 are generated. Compute xand y from equation (3.20). Then compare y and / ( x ) . If y > / ( * ) , «, is rejected. Otherwise, w, is accepted and set n = «,. Next, we need to prove that all rf generated in this way are distributed as f(x), To do so, we note that

Pr{Y
KI

J

Then »? is distributed as /(jt). The quantity p=

l

(3.23)

(b-a)f0 is called sampling efficiency. It represents the percentage of acceptance. This is simplest form of acceptance/rejection methods. Acceptance/rejection methods for more complicated cases can be found in Shen et al (1989). Then the sampling procedure is given in Figure 3.6. First, a pair of random numbers M,, «2 are generated. Compute xand y from equation (3.20). Then compare y and f(x). If y > f(x"), ut is rejected. Otherwise, u, is accepted and set tf - «,.

3.4. Simulating random numbers far general CDF

63

Example 3.S Generating an exponential distribution Two methods are used herein to generate a desired distribution. In the first method, the sampling formula in Table 3.6 is employed. 500 exponential random numbers are generated from the sampling formula !

"

"

A= l.

(3.24)

Generating random u u

numbers

\i

Yes

y
T} = X

Figure 3.6 Flow chart of ARM

Figure 3.5 schematic ARM

0.3 Sampling formula ———

0.2

ARM

1

0.1 0 0

1

2

3

4

5

Figure 3.7 Histograms of the exponential distribution

J. Random numbers and their applications

64

Based on these 500 random numbers, a histogram is formed as shown in Figure 3.7 using solid lines. The second method used for generating exponential random numbers is ARM. Based on 500 data, a histogram is formed, as shown in Figure 3.7 using dotted lines. Note that these data will be used in Chapter 6.

0.3 Sampling formula

... 0.2

——

0.1

ARM

mm*

0 0

1

2

3

4

5

6

Ti-FR-n- —

Figure 3,8 Histograms for the normal distribution

Example 3.6 Generating normal random variable Again two methods are used to generate normal random numbers. Table 3.6 gives the sampling formula for standard normal random Numbers (3.25) 500 normal random numbers are generated by use of this fonnula. The histogram formed from these data is plotted in Figure 3 J using solid line. In the figure is also given the histogram formed from data generated from ARM. In the figure, only half is plotted. It is strange that the results given by equation (3.25) are of poor quality while the ARM results are much better. 3.S Simulating vector random numbers Acceptance/rejection method can be extended to high-dimensional vector random numbers. Take two-dimensional random numbers for instance. Suppose (X,Y)8re distributed as fxr(x,y) and (*,><) are defined in the Domain D , As same as in the one-dimensional case, set

3.5. Simulating vector random numbers

65 (3.26)

/o ""

First, generate uniform random vector (a,,w 2 )on 23 and uniform random number », on [0,/J,]. For each {u{,ut), there is correspondingly «,. Second, compare the values a, and /«.(«,,w 2 ) . If fxy(x,y)
x,y) = QMxh(x,y 15.5,5.5,0,6,0.6,0,3)

where

xex P i

2{l-p)\

(3.28)

3 (a) Assumed 2-D pdf

4

5 6 7 8 ^ 2Q0 Random points

Figure 3.9 The assumed pdf and generated sample

9

66

3. Random numbers and their applications

The definition domain is [3,9;3,9] .Figure 3.9(a) shows how the pdf look like and Figure 3.9(b) shows distribution of 200 random points obtained from ARM on the plane. 3.6 Concluding remarks Generating random numbers itself is a very interesting field to study. Unfortunately, this topic cannot be explored further due to the limitation of the book. Linear congruence method is a widely used method. The method is simple, but we must be cautious when using it. Testing cannot be neglected in order to avoid a string of numbers that are not random, or more exactly, the repetition period of the string is over short.

Chapter 4

Approximation and B-splinc function

The pdf of a continuous random variable is in feet a non-negative continuous function. If it is complicated in form to such an extent that it cannot be expressed in the form of special distributions, among which are the normal, exponential, uniform, Weibull distributions, to name a few, then we must use approximate methods to represent the pdf under consideration. It should be noted that complicated distributions do exist in a wide range. It is rare in the real-world applications to find a random variable perfectly described by a specific distribution function. This is well demonstrated by the wide use of histograms in a variety of cases because use of a histogram often implies that the form of the pdf under consideration is hardly expressible as an explicit function form. In Zong et al (1999) is given an example well demonstrating the complication of pdf form in engineering applications. The problem considered in the paper is the probabilistic response of an underwater oil pipeline to dynamic shock loading. The dynamic response W is a function of pipe diameter D, rigidity of the materials making up of the pipe El, loading characters Pm, pipe length L, that is,

Because of uncertainties present in engineering, all quantities in the bracket are random in nature, so is the maximum response "Wmac. Even we suppose all the quantities in the bracket are normal, Wmm may not be normal because the function relationship / is not necessarily linear. By assuming that all the quantities in the bracket are normal in the paper, the distribution of the probabilistic maximum response is calculated and represented by histogram as reproduced in Figure 4.1. In the figure are also plotted the results obtained from fitting the data by the log-normal, normal and Weibull distributions. The Histogram shows that the pdf is not a symmetrical function about the mean, right-half being slightly denser than the left-half. From the figure, it is clear that

67

68

4. Approximation and B-spline function

0.2 Log-normal: Normal Weibull

0.16 0.12

I

n

0.0S 0.04 0 0

.J 0.1

t» I''

0.2

0.3

0.4

0.5

0.6

0.7

3

Maximum deflection Wmax / L ( x 10" ) Figure 4.1 An example showing the complexity of pdf all three special distribution forms do not fit the histogram well. Therefore, it is our purpose to look for a function which is both flexible enough to fully characterize the pdf under consideration and simple enough to easily handle. To do so, we have to employ function approximation theory. The primary aim of function approximation is to use simple functions to represent complicated functions. Here, simple functions refer to polynomials, power functions, trigonometric functions etc. They are evaluated, differentiated and integrated easily within finitely many steps using just basic arithmetic operations. Others are, however, so complicated that they cannot be explicitly written. They are expressed implicitly, say, through differential equations, integral equations, infinite sums. In the case of statistics, the situation is even more complicated because the function form is implicitly contained in the observed data. Therefore, it has long been sought for the possibility to express a complicated function using sufficiently simple functions. Polynomials have long been a widely-employed approximation tool due to their simplicity in form and arithmetic operations. They have also been a good tool for theoretical analysis. But polynomial approximation suffers a fatal defect, that is, if the approximation is poor anywhere, then it would be poor everywhere. So mathematicians never cease searching approximation tools better than polynomials. It seems that this process will not cease in the near future, but some fruitful achievements have been made up to date. One of them is the B-spline function, simple in form and stability against local changes. The B-spline function has been studied tor almost five decades. Schoenberg (1946) first proposed the theory of the B-spline function. Cox (1971) and de Boor (1972) proposed recursive schemes for computing the coefficients.

4.1. Approximation and best approximation

69

Riesenfeld (1973) utilized blending functions for curves of arbitrary shape. Thereafter, representation of curves using B-spline functions became popular. Usually, three important properties are inherent in a B-spline curve. First, the eurve is completely controlled by knot points. Second, curve can have different degrees without affecting the number of knot points. Third, if a knot point is moved, only the segments around this knot point are affected. Most applications employ the cubic B-spline function (Lozover & Preiss, 1981; Yamaguchi, 1978; Wu et aL, 1977) because of its inherent C2 continuity. In this chapter, a brief introduction to function approximation theory and the B-spline function is given for the easy reference of the reader (der Hart, 2000). 4.1 Approximation and best approximation Suppose C[a,6] is a set of all continuous functions defined on the interval [a,b] . Define a real-valued function f(x)eC[a,b] whose form is either complicated or hard to explicitly write. A general approximation is to use another simpler function A(f) to replace f(x) given that A(f) is very close to f{x). Example 4.1 Approximation ofsin(x) Let f(x) = sin(x) , x e [0,0.1] . Based on Taylor expansion, we have approximate formula / ( x ) * x , x e [0,0.1].

(4.1)

So we take A(f) = l-x.

(4.2)

The errors within the definition domain are X

0.02 0.04 0.06 0.08 0.1

/ ' 0.019999 0.039989 0.059964 0.079915 0.099834

AB

\f-A(f)\

0.02 0.04 0.06 0.08 0.1

0.000001 0.000011 0.000036 0.000085 0.000166

As x increases, the errors also increase. But within the interval, the approximation has high accuracy.

70

4. Approximation and B-spline function Approximation theory is about determining A(f) and how well it works as a

replacement of f(x) .To answer these questions, we begin with vector space. A vector space is a set of ¥ectors and scalars, in which is defined two types of operations: addition denoted by "+" only for vectors and multiplication denoted by " x " for a scalar and a vector. Note that we have zero vector 0 and zero scalar 0 in the space. A normed linear space is a vector space V with the additional structure of a norm. The norm is a function || || from V to R, where R is the set of all real numbers, with following properties: for each x, y e R and each scalar or,

(1) W^0»

(4.3a)

(2) H = 0 ifandonlyifx=0,

(4.3b)

(3) IHI^ffUxlJ,

(4.3c)

(4) ||* + ^ 1 < M + | | J | .

(4.3d)

Let V be a vector space. Let $B1,f>2,---,fBbe vectors in V, and a,,a2,••-,«»„ be scalars in V. The vector a , ^ +az#>2 +--- + an2,---,fll with

combination

coefficients

al,a2,---,an

.

The vector

a,^, +fl2p2 +"- + aB% is the zero vector 0 if and only if a1,flJ,>--,flHareall zeros. If all vectors in V can be written as linear combinations of yn,^, •••,%, then these vectors form a set called the span of ^ , , ^ s - - - , % . Or in other words, we say that
4.1. Approximation and best approximation

71

precisely, we claim that l,xM•-•,*,, are a basis for IT,,. To prove this we need to show that {1) 1, x{, • • •, xn are linearly independent, (2) they span II B . Property (2) is clear from the definition of Tln . As for (1): suppose q,,q,•••,&,, are scalars such that aa + atx + • • • + anx" = 0 . The following concepts at least point out the possibility of how to construct an approximant based on linear combination. Theorem 4.1 (Best approximation) Let V be a normed linear space with norm I | . Let ^l,
A solution to the minimizing problem is usually called a "best approximant" to y from VB. How many solutions are there? The answer is connected with convexity conditions. Let V be a vector space. A subset S of V is convex if for any two members fH, q>2 of S, the set of all members of form

2,0 <, t < 1 called the line segment from ^ to ip%, also belongs to S. Theorem 4.2 (Uniqueness) Let V be a normed linear space with strictly convex norm, flies best approximations from finite dimensional subspaces are unique. Approximation is transformed into the problem searching for the bases which define a normed linear space with strictly convex norm. Depending on the sense in which the approximation is realized, or depending on the norm definition in equation (4.4), there are three types of approximation approaches: I.

Interpolatory approximation: The parameters a, are chosen so that on a fixed prescribed set of points x* (k = I,... ,n) we have

72

4. Approximation and B-splim Junction

=*

(4.5)

Sometimes, we even further require that, for each i, the first r, derivatives of the approximant agree with those of y at x t 2.

Least-square approximation: The parameters at are chosen so as to minimize

_

(4.6)

2 *=•'

3.

Min-Max approximation: the parameters a* are chosen so as to minimize n

- max y - J ] a,% -> min

(4,7)

4.2 Polynomial basis Choosing q>t = x', we have polynomials as approximants. Weierstrass theorem guarantees this is at least theoretically feasible Theorem 4.3 (Weierstrass approximation theorem) Let f(x) be a continuous function on the interval [a,b]. Then for any £>0, there exists an integer n and a polynomial pn such that (4-8) In fact, if [a,&]=[0,l], the Bernstein polynomial

converges to f(x) as n -» QO , Weierstrass theorems (and in feet their original proofs) postulate existence of some sequence of polynomials converging to a prescribed continuous function uniformly on a bounded closed intervals. When V = L^a; b] for some interval [a; b], and 1 < p < 1 and the norm is the

4.2, Polynomial basis

73

usual /Miorm, it is known that best approximante are unique. This follows from the fact that these norms are strictly convex, i.e. Jxj < r,||y| < r implies |e+y\\ < 2r unless x = y. A detailed treatment of this matter, including proof that these norms are strictly convex can be extracted from the material in PJ. Davis, Interpolation and Approximation, chapter 7. However, polynomial approximants are not efficient in some sense. Take Lagrange interpolation for instance. If xv,x1,---,xn are « distinct numbers at which the values of the function/are given, men the interpolating polynomial p is found from

)£*«

(4-10)

where Lnk(x) is

~

x

i

The error in the approximation is given by -x,).

where £(x) is in the smallest interval containing Introducing the Lebesque function

(4.12)

x,xl,xn.

and a norm ||/|| = max|/(x)t. Then

WNIkl

(4.14)

74

4. Approximation and B-spline function

This estimate is known to be sharp, that is, there exists a function for which the equality holds. Equally spaced points may have bad consequences because it can be shown that

IKI&Ce"'2.

(4.16)

As n increases, function value gets larger and larger, and entirely fails to approximate the function / This situation ean be removed if we have freedom free to choose the interpolation points for the interval [a,b]. Chebyshev points are known to be a good choice. * t = - \ a + b + (a~b)cos^~^-

* 2L

.

(4.17)

n-l J

The maximum value for the associated Lebesque function is it logK+4. |r'i|< — II

It

(4.18)

ft-

Using Chebyshev points, we therefore obtain the following error bounds for polynomial interpolation

4

X1K denotes the linear space of all polynomials of degree n on [a,b]. We may further show that

Thus by using the best interpolation scheme, we can still only hope to reduce the error from interpolation using Chebyshev points by less than a factor -logn+5.

(4.21)

4.2. Polynomial basis

75

Example 4.3 Runge example This is a well-known numerical example studied by Runge when he interpolated data based on a simple fimction of )

(4.22)

on an interval of [-1,1]. For example, take six equidistantly spaced points in [-1, 1 ] and find y at these points as given in Table 4.1. Now through these six points, we can pass a fifth order polynomial /j(x) = 0.56731-1.7308x'+1.2019x4,

-l:Sx
(4.23)

On plotting the fifth order polynomial and the original function in Figure 4.2, we can see that the two do not match well. We may consider choosing more points in the interval [-1, 1] to get a better match, but it diverges even more. In fact, Runge found that as the order of the polynomial becomes infinite, the polynomial diverges in the interval of-1 < x < 0.726 and 0.726 < x < 1. How much can we improve the situation if Chebyshev points are used? Reconsider this problem, but take six non-equidistantly spaced points in[-l, 1] calculated from equation (4.17) and find y at these points as given in Table 4.2. Now through these six points, we can pass a new fifth order polynomial / s '(x) = 0.355-0.716x 2 +0.399%", -1 S i < 1. Table 4,1: Six equidistantly spaced points on [-1,1] X y

-1.0 -0.6 -0.2 0.2 0.6 1.0

1 l+25x 2 0.03846 0.1 0.5 0.5 0.1 0.03846

(4.24)

Table 4.2: Six unequidistantly spaced points on [-1,1] X }

-1.0 -0.809 -0.309 0.309 0.809 1.0

1 l+25x 2 0.03846 0.05759 0.29520 0.29520 0.05759 0.03S46

On plotting the fifth order polynomial and the original function in Figure 4.3, we can see that the two match better at the two ends, but do not match well around

76

4, Approximation and B-splinefunction

the cental portion. So Chebyshev points remove instability at the cost of loosing accuracy around the central portion. This is natural if we look at equation (4.17) which puts more points near the two ends. Chebyshev points are nearly optimal. In other words, even if we employ an alternative knot sequence better than Chebyshev points, we would not gain much. We have to seek other methods to reduce errors. As suggested by equation (4,19), we have the following two approaches (1) Increase n, (2) Decrease the interval [a,b]. 1 Runge function

5-th order polynomial

0.8 0.6 0.4 0.2 0 -0.2

-1

-0.5

0

0.5

1

Figure 4.2: 5* order polynomial vs. exact function 1 0.8

Runge function

5-th order polynomial

0.6 0.4 0.2 0 -1

-0.5

0

0.5

1

Figure 4.3: 5th order polynomial on Chebyshev points

11

4.3. B-splines

Approaeh (1) is not a good dioice because increasing n may produce disastrous consequence in many cases. The linear combination coefficients become nearly linearly dependent. Approach (2) seems the unique option for us, which results in approximations such as B-splines. 4.3 B-splines 4.3.1 Definitions The truncated power basis can be numerically bad and it has some drawbacks as a theoretical tool. Early workers (Sehoenberg, 1946) defined and used special splines called B-splines. Their interest was mostly in theoretical studies with them because the age of modern scientific computing hadn't really begun. Much later, de Boor (1978) discovered properties of B-splines that make them well suited to computation. The B-spline function can be defined either through the divided difference or through recursion relationship. We first use the latter to give the definition of The B-spline function. At the end of this section, B-spline definition based on the former is also provided. Let A = {x,} (i = 0,lt...,n) be a non-decreasing knot sequence. Consider a function s(x) defined on the interval [JED,XB] . Its vertical coordinate for x, is yt. An order 1 (or degree 0 polynomial) B-spline is defined as a step function as shown in Figure 4.4 (a). 5

'

G, [I

x*[x,,xM] xe[x,,xM]'

(4.25)

Bf(x) is 2-th B-spline of order 1.

(a) Order 1

(b) Order 2

( c ) Order 3

Figure 4.4 The first three order B-splines

78

4. Approximation and B-spline function The function s(x) defined on the interval [x0, JCB ] is then expressed as (4.26)

An order 2 (or degree 1) B-spline is a broken line as shown in Figure 4.4 (b). Consider the equation defining the broken line. The algebraic equation for straight line (x,»>»,)tQ (xM,yM) is

x-x,

(4.27)

The algebraic equation for straight line (xM,yi+l)

-x X

i+2

to {xi+1,yM}

's

x-xi+

(4.28)

X

i+\

Comparing the above two equations we define X- -X, X

Bf(x) =

M

X

-xf -X

M

(4.29)

,-X.j

otherwise

Then the whole broken line from (xa,y0) to (xn,yn) can be defined as (4-30) If equation (4.25) is employed, equation (4.29) can be rewritten in the form of (4.31)

4,3. B-splines

79

This relationship can be generalized to higher order B-splines. Replacing order 2 by order k and order 1 by k-\, we obtain (4.32) This is the well-known recursion relationship for B-splines. We may use it to define a B-spline function of higher orders. So an order 3 B-spline is defined by

~X

(4.33)

Substituting equations (4.25) and (4.31) into above equation yields, after lengthy manipulations,

Bf (x) = (x, - x , ) f (X°+> ~X{''HiX°+> ~X)H(x-x,)

(4.34)

where H(x) is Heaviside function defined by

[

if

(4.35,

The index s-i-3, and the function ws(x) is a product of the form

w.Crt = Suppose the knot sequence is equidistance and there are n+\ equidistance knots, c = x0 < xx < • • • < xx = d, which divide the internal [c,d] into n subintervals. For conveniences in later mathematical expressions, three more knots at each end are defined: x_ 3 , x_2, x_t, xH+l, xn+1 and xn+i, It is customary to set x_3 = x_j =x_l=x0 = c and xH - JCB+, = xatt = xn+3 = d. It is clear that n+l knots define N splines. Therefore, N=n+\. With such definitions, we can plotS/(x) in Figure 4.5. Example 4.4 B-splines of order 3 Let A - {0,1,2,3}. Substituting it into the expression above yields

4. Approximation and B-spline Junction

80

0.5

K

n+1

Figure 4.5 B-splines of order 3 (degree 2)

= ~(3-xf - | { 2 - x ) 2

(4.37a) (4.37b)

2 < x < 3 , Bl(*) = -(3-*)*.

(4.37c)

but note (4J7d)

(4.37e) S,3(x)is continuous at x = \ and x = 2 . Its first derivative is continuous at x = 1 and x = 2 .The second derivative is discontinuous at x = 1 and x = 2. In cases without confiision, we also simply use B, (x) by omitting the symbol for order k. From equation (4.32) we may defme order 4 B-splines. The results turn out to be similar to order 3 B-splines in form. In particular,

4.3. B-splines

81

(X H

: ~*W-».) )

(4.38a)

where the index s=/-4, and the function w/x) is a product of the form -*™).

(4.38b)

In most applications, B-splines of order 3 and 4 are employed. For B-splines of higher orders, we do not give their explicit expressions here. 4,3,2 B-spIine basis sets Can we use B-splines for meaningful calculations? Do B-splines form a basis set for V? The Curry-Sehoenberg theorem says yes! Theorem 4,4 Curry-Schoenberg theorem For a given strictly increasing sequence Z = {iit'"t^K*i}> negative

1=1

integer

sequence

v = {vj,-",vA,}

with

am

all

a

^

S^ven

w !

" "

, v,£k,

set

let A = {JC 1 ,---,X B+4 } be any

non-

i»2

decreasing sequence so that (1)

xlixj£---^xk<

(2) for i = 2,—N the number ^ appears k-v, times in A.

(4.39)

Then the sequence fl,* • • • 5* of B-splines of order k for the knot sequence A is a basis for Hk&v, considered as functions on [x* »*„+]]• This theorem provides the necessary information for generating a B-spline basis set using the recurrence relations above. It specifies how to generate the knot sequence A with the desired amount of smoothness. The choice of the first and last k knots is arbitrary within the specified limits. In the sequence A, the amount of smoothness at a breakpoint xt is governed by the number of times x, appears in the sequence A .Fewer appearances means more smoothness. A convenient choice for the first and last knot points is to make the first & knot points equal to Xj and the last k knot points equal to xN+l .This

82

4. Approximation and B-spline function

corresponds essentially to imposing no continuity conditions at £j and §N+l. Proof; See der Hart (2000). 4.3.3 Linear independence of B-spline functions The following theorem proves the linear independence of B-splines. Theorem 4.5 Let pt be a linear fimctional given by the rule (4.40) ax \%

r=0

with i+k-l

n

fr

-lA

(4.41)

and T}, an arbitrary point in the open interval (a^ ,X; +i ). Then PlB^S,,foratlij

.

Proof: see de Boor (1978) or van der Hart (2000).

(4.42) D

4.3.4 properties of B-splines B-splines have some nice properties which make them appealing for function approximation. (1) (Local support or compact support) B* (x) is a non-zero polynomial on ^ £ x 5 xuk.

(4.43)

This follows directly from definition. At each x, only k B-splines are non-zero. (2) {Normalization)

4.3. B-splines | > * ( x ) = l.

83 (4-44)

This can be seen from recursion relations. (3)

(Nannegativity) Bf(x)>0.

(4.45)

Again follows from recursion relation. (4) (Differentiation) (4.46) X

t

X

i*k

X

M

Proof. See der Hart (2000).

•

Note that the same knot set is used for splines of order k and k - 1 .In the (convenient) case that the first and last knot point appear k times in the knot set ,the first and last spline in the set 5,w , are identically zero. (5) (Integration) "t

k +\

(4.47) X

i

(6) Function bounds if xt: £ x < xM and / = ^ a, 5. ,then in{aw_A>--.,a,} S/(*)Smax{o,. +w ,•••,«,}.

(4.48)

(7) B-splines are a relatively well-conditioned basis set. There exists a constant Dk .which depends on k but not on the knot set A,so that for all i (4.49)

84

4. Approximation and B-spiine function In general, Dk » 2*" 3/s , Cancellation effects in the B-spline summation are limited. (8) (Least-square property) If f(x)eC2 [a,b], then there exists a unique set of real constant flj ( i = -k, •••,«-1) solving the minimization problem

(4.50) These coefficients can be obtained from the linear system of equation CA = b where the matrix C is C=

ffi*(*)B*(x)*

(4.51)

.

(4.52)

The right hand side vector b is

Blt(x)(k\ .

(4.53)

And the unknown vector A is A = {a. 4 ,- s a B .,f.

(4.54)

Example 4.5 Comparison of various approximation methods The effectiveness of the above-mentioned approximation methods can be numerically demonstrated using Runge example in Example 4.3. Consider three interpolation methods: (1) Lagrange interpolation on equally spaced points (denoted as uniform polynomial in Figures 4.6 and 4.7); (2) Lagrange interpolation on non-equally spaced points (denoted as nonuniform polynomial in Figures 4.6 and 4.7); and (3) B-splines. They are obtained as follows.

4.3. B-splines

85

Consider Lagrange interpolation given in equations (4,10) and (4,11) of the form If the fonction values f(xk) at n points % are given, then the function value evaluated by the Lagrange interpolant at any point x is

= £/(**)**(*),

(4.55)

where Lk(x) is

»=n

x-x,

(4.56)

Consider B-spline interpolation. At the given n points xk, the interpolant must satisfy (4.57)

4(*») = /(**>.

where af are unknown coefficients to be determined by solving the system of linear equations using Gauss Elimination Method, to say

5,00

'« N

(4.58)

Figure 4.6 shows the results obtained from the three methods based on 10 points. Uniform polynomial behaves very well around the central portion, even better than B-splines. But it yields bad results around the two ends, validating the conclusions obtained by Runge long ago. Nonuinform polynomial behaves better than uniform polynomial around the two ends, but worst among the three around the central portion. This is due to the feet that nonuniform polynomial uses less points in the central portion. Around the central portion, the performance of Bsplines is almost as same as the uniform polynomial, while around the two ends, it is comparable with that of the nonuniform polynomial. Among the three, it is clear that B-splines are the best interpolants. As the number of points is increases to 20, the difference between B-splines and nonuniform polynomial is not significant, but it remains distinguishable. The uniform polynomial behaves badly around the two ends, as shown in Figure 4.7.

86

4. Approximation and B-splim function

In the figure, the curve of B-splines is not identified due to its closeness to the true Runge function and the difficulty to differentiate them. The effectiveness of B-splines can be well demonsttated by this example.

Runge iunction

B-splines

Nonuniform polynomial

0.5

i Uniform polynoniial -0.5

-1

0.5

0

-0.5

1

Figure 4.6 Comparison of various interpolation methods for 10 points. In the figure, uniform refers to equally spaced points and nonuniform refers to unequally spaced points.

V\ \ V

/

Nonuniform polynomial

Uniform / polynomial/

0.5

-1

-0.5

0

0.5

1

Figure 4.7 Comparison of various interpolation methods for 20 points. The B-spline interpolating curve is almost coincident with the Runge function.

4.4. Two-dimentional B-splines

87

4.4 Two-dimensional B-splines Two-dimensional B-splines may refer to B-splines defined on an arbitrary surface in the 3-D space, or on a plane. It would take a lot of space here if we went deep into the details of 2-D B-splines defined on an arbitrary surface. Considering what will be employed in the subsequent chapters, we focus our discussions on 2-D B-splines defined on a plane, the simplest case in 2-D space. This simplifies the presentation a lot. The reader who is interested in more general theory of 2-D B-splines is referred to de Boor (1978). In the following throughout the book, 2-D B-splines refer solely to those defined on a plane. The simplest 2-D B-splines are obtained by direct product of two 1-D Bsplines in the form of BtJ(x,y) = Bi(x)BJ(y).

(4.59)

It is bell-shaped in a 2-D space. A 2-D B-spline can also be obtained by replacing the argument in a 1 Bspline by the radial distance r = ^{x-xjf+(y-yjf centre of i-th B-spline. In notation,

3

2

-(2-Sf, 6

0,

, where (*,,;>,) is the

1<SS2

(4.60)

S>2

where S - r/A f , a, = IS/ltth2. B-splines defined in this way are called Radial B-spline Function, RBF in short. In the equation, h, defines the radius of the circle centered at (xf,y,) inside which Bt(r,hf) does not vanish. 4.5 Concluding remarks Approximation theory has a long history, but it remains active due to the fact feat it is not easy to find a flexible enough yet universal approximation tool for so many cases encountered in real-world applications. Polynomials had been an effective tool for theoretical analysis. It is not very suited to computational purpose due to its over-sensitivity to local changes as the order is high. The most effective way to reduce approximation errors is to decrease the

88

4, Approximation and B-spline function

interval [a,b]. This leads to the imroduetion of B-splines, The B-spline is nearly optimal choice for approximation. B-splines have nice properties suited to approximating complicated functions. The best properties of B-splines are that they are flexible enough to yield satisfactory approximation to a given fimction while maintaining stability. Only fundamental things about B-splines are introduced. One of important developments made in recent years, the so-called nonuniform rational B-splines Surface (NURBS), is not mentioned in this chapter for the apparent reasons. The interested reader is referred to Piegl & Tiller (1997),

Chapter 5

Disorder, entropy and entropy estimation

Entropy is one of most elegant concepts in science. Accompanying each progress in the conceptual development of entropy is a big forward step in science, Entropy was first introduced into science as a thermodynamic concept in 1865 for solving the problem of irreversible process. Defining entropy as a measure of the unavailability of a system's thermal energy for conversion into mechanical work, Clausius phrased the second thermodynamic law by claiming that tiie entropy of an isolated system would never decrease. In 1877, Boltanan gave interpretation of entropy in the framework of statistics. Entropy as a mathematical concept appeared first in Shannon's paper (1948) on information theory. This is a quantum jump, having great impact on modem communication theory. Another important progress for mathematical entropy was made by Kullback (1957) in 1950s. Entropy is thus an elegant tool widely used by both mathematicians and physicists. Entropy is yet one of the most difficult concepts in science. Confusions often arise about its definition and applicability due to its abstract trait This results from the phenomena, known as disorder or uncertainty, described by entropy. In fact, entropy is a measure of disorder. In this chapter, entropy as a mathematical concept will be first elucidated, followed by the discussions on how to construct unbiased estimators of entropies S.1 Disorder and entropy Entropy describes a broad class of phenomena around us, disorder. Its difficult mathematical definition does not prevent us from gaining an intuitive understanding of it. Example 5.1 Matter The unique properties of the three-states of matters (solid, gas, and liquid)

89

90

5. Disorder, entropy and entropy estimation

result from differences in the arrangements and of the particles making up of them. The particles of a solid are strongly attracted to each other, and are arranged in a regularly repeating pattern, or in order, to maintain a GAS solid shape. Gas expands in every direction as there are few bonds sublimation among its particles. Gas is a state of matter without order. Liquid flows because its particles are not held rigidly, but the attraction between the particles is sufficient to give a definite volume. So liquid is a state in between ordered solid and disordered gas.See Figure 5.1, Take water for instance. As freeze temperature drops below 0°C, ail SOLID L1DQID particles suddenly switch to an ordered state called crystal. And they Figure 5.1 Arrangement of particles vibrate around their equilibrium in different states of matters positions with average amplitude As. The ordered state is broken as temperature increases higher above 100°C, water becomes vaporized gas. Gas particles do not have equilibrium positions and they go around without restriction. Their mean free path Ag is much larger than the other two, that is, we have the following inequality temperature increases above 0°C, and particles travel around with larger average free distance A,

kt « At « A .

(5.1)

Disorder does matter in determining matter states. If the amount of disorder is low enough, matter will be in the state of solid; and if the amount of disorder is high enough, matter will be in the state of gas. Example 5.2: Digit disorder Take number as another example. Each number can be expressed as a sequence of digit combination using 0 to 9. For example, - = 0.499999..., - = 0.285714285714285714..., V2 = 1.1415926.....

(5.2)

J. /. Disorder and entropy

91

Except the first two digits, the string for 1/2 exhibits simple pattern by straightforwardly repeating digit 9. The string representing 1/7 is more complicated than that representing 1/2, but it remains ordered in the sense that it is a repetition of the six digits 285714. The third string representing -Jl does not show any order, all digits placed without order. We thus say that the first series is the best ordered and the last worst ordered. In other words, the amount of disorder of the first series is least while the third the largest. From the two examples above, it is plausible that disorder is a phenomenon existing both in nature and in mathematics. More examples on disorder can be cited, some of which are given in the following Example 5.3: Airport disorder An airport is in order with all flights arriving and departing on time. Someday and sometime, a storm may destroy the order, resulting in a state that massy passengers wait in halls due to delayed or even cancelled flights. The passengers might get more and more excited, creating a disordered pattern. Disorder is so abstract that we hardly notice it if not for physicists and mathematicians. Disorder is important for it determines the state of a system. It is desirable to introduce a quantity able to describe the amount of disorder. We will see that we are able to define a quantity, called entropy, for quantitatively defining the amount of disorder. And //(in fact, it is the capital Greek letter for E, the first letter of entropy), is frequently used to denote entropy To find the definition of entropy, we return to Example 5.1. Consider a onedimensional bar of length L. If the bar is filled with ice, there will be approximately Nt=L/A, particles in the bar. If the bar is filled with water, there will be approximately N(= Lf At particles in the bar. And if the bar is filled with gas, there will be approximately Ng = L/Ag particles. Because of equation (5.1), we have N,»Nt»Ng.

(5.3)

Therefore, the number of particles in the bar should be relevant to the amount of disorder in the system, or entropy. In other words, entropy H is a function of the number of particles in the bar, H = H{N). And H{N) should be an increasing function of N. Suppose now that the length of the bar is doubled. The number of particles in the new bar will be doubled, too. And how about entropy? The change in entropy cannot be the number itself iV, because otherwise the change of entropy for ice will be N,, for water JV, and for gas Ng, The length of the bar is doubled, but the increase of entropy is not same for the three cases. This is not acceptable

92

J. Disorder, entropy and entropy estimation

if we hope that entropy is a general concept. An alternative method to view the problem is to define entropy in such a way that if the bar is doubled in length or the number of particles is doubled, the entropy increment is one. If the number of particles is four-fold increased, the entropy increment is 2. Then we have

,..,

(5.4)

Note that H(l) = 0 because there does not exist any disorder for one particle. Solving the above equation, we are led to say that entropy is given This is a heuristic introduction to entropy. In the following sections, we will ignore the particular examples in the above, and tarn to abstract yet rigorous definition of entropy. 5.1.1 Entropy of finite schemes To generalize the entropy definition introduced in the above, we must notice that the number JV used in the above is just an average value, hi more general cases, N is a random variable and should be replaced by probability. Consider a bar filled with AT particles. They are not necessarily arranged in an equidistance way. Suppose, the free distance of the first particle is A,, the free distance of the second particle is Aj,..., and the free distance of the N-th particle is AN. If the bar is filled with particles, all of which have free distance A,, then the entropy would be H{ = log{£/!,) based on the above definition. If the bar is filled with particles, all of which have distance Aj, the entropy would be H2 = logfi/lj) And if the bar is filled with particles all of which have distance A^, the entropy would be Hn = log(£ / AN ) . Suppose now the bar is filled with particles of various free distances. We will have to use the averaged quantity to represent the amount of disorder of the system, that is, ff = ~ ( l o g « 1 + l o g ^ + " - + logn w } n

(5.5)

where n = w, + n2 + • • • + nN is the total number of particles. If 1 /«, is replaced by probability, we obtain the following generalized entropy concept based on probability theory.

5.1, Disorder and entropy

93

A complete system of events Ai,A2,--',An in probability theory means a set of events such that one and only one of them must occur at each trial (e.g., the appearance of 1,2,3,4,5 or 6 points in throwing a die). In the case N-2 we have a simple alternative or pair of mutually exclusive events (e. g. the appearance of heads or tails in tossing a coin). If we are given the events Ah A3, .... An of a complete system, together with their probabilities JJ, , j % ,-••,/?„ {pi 2; 0, ^ p, = 1), then we say that we have a finite scheme (5.6) ft •••

P»)

In the case of a "true" die, designating the appearance of / points by A, (1 s i £ 6 ), we have the finite scheme

P\

Pi

Pi

P«

Pi

P«

From the finite scheme (5.6) we can generate a sequence of the form AjA1A]A%Aft.., .The sequence is an ordered one if Ai,At,---tAll appear in a predictable way; otherwise disordered. Therefore, every finite scheme describes a state of disorder. In the two simple alternatives

0.5 0.5)

^0.99 0.01

the first is much more disordered than the second. If a random experiment is made following the probability distribution of the first, we may obtain a sequence which might look like AlAiA2AlA2AlAzA[.,., It is hard for us to know which will be the next. The second will be different, and the sequence generated from it might look like AlAlAlAiAlA)AfAx... Ms are almost sure that the next letter is 4 with small probability to make mistake. We say that the first has more amount of disorder than the second. We sometimes use uncertainty instead of disorder by saying that the first is much more uncertain than the second. The correspondence of the two words uncertainty and disorder can be demonstrated by Equation (5.6). Disorder is more suitable for describing the state of the sequences generated from finite scheme (5.6) while uncertainty is more suitable for describing the finite scheme

94

J, Disorder, entropy and entropy estimation

itself. Large uncertainty implies that all or some of assigned values of probabilities are close. In the extreme case, all probabilities are mutually equal, being 1/n , A sequence generated from such a scheme would be highly disordered because each event has equal probability of occurrence in the sequence. On the other extreme, if fee probability for one of the events is much higher than the rest, the sequence produced from such scheme will look quite ordered. So the finite scheme is of low uncertainty. Thus, disorder and uncertainty are two words defining the same state of a finite scheme. The scheme

4

(5-9)

"'}

0.3 0.7 J represents an amount of uncertainty intermediate between the previous two. The above examples show that although all finite schemes are random, their amount of uncertainty is in fact not same. It is thus desirable to infroduce a quantity which in a reasonable way measures the amount of uncertainty associated with a given finite scheme. The quantity

can serve as a very suitable measure of the uncertainty of the finite scheme (5.6). The logarithms are taken to an arbitrary but fixed base, and we always take pk logj% = 0 if pk = 0 . The quantity H(pl,p2,---,pH)i8 called the entropy of the finite scheme (5.6), pursuing a physical analogy with Maxwell entropy in thermodynamics. We now convince ourselves that this function actually has a number of properties which we might expect of a reasonable measure of uncertainty of a finite scheme. 5.1.2 Axioms of entropy Aided by the above arguments, entropy can be rigorously introduced through the following theorem. Theorem 5.1 Let H(p1,p1,---,plt)be

ajunction defined for any integer n and n

for all values- P 1 ,/ 7 2»'"»A

suc

^

tnat

ft £0,(& = l,2,---,«), ^pk

= 1 . If for

any n this fimction is continuous with respect to all its arguments, and if it has the following properties (1), (2), and (3),

5. /. Disorder and entropy

95

n

(1) For given n and for ^pk

= 1, the function H (p,, p2, - • •, pH ) takes its

largest value far pk = 1 / n, {k = 1,2, • • •, n), (2)

H(AB)

^H(A)+HA(B),

(3)

H(p,,pi,—,pll,Q) = ff(pt,p1}---,pj. (Adding the impossible event or any number of impossible events to a scheme does not change its entropy.) then we have ^

(5,11)

where c is a positive constant and the quantity Hd{B) = ^ipkHk{E)

is the

k

mathematical expectation of the amount of additional information given by realization of the scheme B after realization of scheme A and reception of the corresponding information. This theorem shows that the expression for the entropy of a finite scheme which we have chosen is the only one possible if we want it to have certain general properties which seem necessary in view of the actual meaning of the concept of entropy (as a measure of uncertainty or as an amount of information). The proof can be found in Khinchin (1957). Consider a continuous random variable distributed as f(x) on an interval [a,b]. Divide the interval into n equidistance subintervals using knot sequence 4\' ii >'"'»C+i • The probability for a point to be in the &-th subinterval is

(5.12)

where Ax = £i+1 -f t is the subinterval length. Substituting it in equation (5.11) yields

(5.13)

96

5. Disorder, entropy and entropy estimation

The second term on the right hand side of the above equation is a constant if we n

n

n

note iSx^jf{^k)=^iMf{§k)=^ipk

=1 and log Ax is a constant. So only the

first term on the right hand side is of interest. As division number becomes large, H -> so, the first term on the right hand side is just the integral H(f,f) = -c\f{x)\ogf{x)dx.

(5.14)

where two arguments are in the expression H(f,f). In the subsequent sections, we will encounter expression H(f,g) indicating that the function after the logarithm symbol in equation (5.14) is g. Equation (5.14) is the definition of entropy for a continuous random variable. The constant c = 1 is often assumed. A question that is often asked is: since the probability distribution already describes the probability characteristics of a random variable, why do we need entropy? Yes, probability distribution describes the probability characteristics of a random variable. But it does not tell which one is more random if two probability distributions are given. Entropy is used for comparing two or more probability distributions, but a probability distribution describes the randomness of one random variable. Suppose that the entropy of random variable X is 0.2 and that of random variable Y is 0.9. Then we know that the second random variable is more random or uncertain than the first. In this sense, entropy assign an uncertainty scale to each random variable. Entropy is indeed a derived quantity from probability distribution, but it has value of its own. This is quite similar to the mean or variance of a random variable. In fact, entropy is the mathematical expectation of -log/(jc), a quantity defined by some authors as information. Example 5.4 Entropy of a random variable Suppose a random variable is normally distributed as f(x)=

.—- exp V2

From definition (5.14) we have

= -jf(x)loBf(x)dx 00

- J : &*"*{_

exp

-.* 7J log5 -7==-exp - v n 7 \\ttc 2«x J * " W 2 ^ f f ^ | 2cr

(5.15)

5.2. Kullhack information and model uncertainty

97

I

(5.16)

The entropy is a monotonic ftinction of variance independent of the mean. Larger variance means larger entropy and viee versa. Therefore, entropy is a generalization of the concept variance, measuring data scatters around the mean. This is reasonable because widely scattered data are more uncertain than narrowly scattered data. 5.2 Kullback information and model uncertainty In reality show The wheel of Fortune a puzzle with a slight hint showing its category is given to three contestants. The puzzle may be a phrase, a famous person's name, an idiom, etc. After spinning the wheel, the first contestant has a chance to guess which letter is in the puzzle. If he/she succeeds, he/she has the second chance to spin the wheel and guess again which letter is in the puzzle. If he/she fails, the second contestant will spin the wheeltocontinue the game, and so on untilfeepuzzle is unraveled finally. We simplify the example a little. The process for solving the puzzle is in feet a process for reducing uncertainty, that is, entropy. At the very beginning, which letter will appear in the puzzle is quite uncertain. The guessing process is one that each contestant assigns a probability distribution to the 26 letters. As the guessing process continues, more and more information has been obtained. And the probability assigned to letters by each contestant gets closer and closer to the answer. Suppose at an intermediate step the probability distribution given by a contestant is

(fU7)

The question is to solve the puzzle, how much more information is needed? In other words, how far away is the contestant from the true answer? Each contestant speaks loud a letter which he/she thinks should be in the puzzle. Whether the letter is in the puzzle or not, we obtain information about the puzzle. And the letters given by the contestants form a sample, the occurrence probability of which is (5.18) «,!«,!.••«„

Its entropy is

98

J. Disorder, entropy and entropy estimation log p(B) = ~y]—lag qk

(5.19)

where the constant term is neglected. From the large number theorem, we conclude that as the sample size n, becomes large, the above entropy becomes 1

n

- lim — log p{B) = - V pk log qk .

(5.20)

Denoting the term on the right hand of the above equation by

We conclude that H(p,q) is a new entropy concept interpreted as follows. Suppose the true probability distribution is given by equation (5.6). We take a sample from the population, and obtain a probability distribution given by equation (5.17). The entropy estimated by equation (5.21) is entropy H(p,q). Therefore, H(p,q)represents the entropy of the scheme p measured by model if. More precisely, the entropy of a random variable is model-dependent. If a model other than the true one is used to evaluate the entropy, the value is given by H(p,q). We note that tog A is the entropy of the finite scheme under consideration. The difference between H(p,q) and H(p,p) represents the amount of information needed for solving the puzzle, that is, l(p,q) = H(p,q)-H(p,p) = f > t log^-.

(5.23)

I(p,q) is defined as Kullhack information. It may also be interpreted as the amount of uncertainty introduced by using model q to evaluate the entropy of p .

5.2, Kullback information and model uncertainty

99

Theorem 5.2 Kullback l(p,q)has the following properties: (1) (2) J(p,q) = Q if and only if

pk=qk.

Proof; Let x>0 and define function / ( x ) = logx—x+l . f{x) takes its maximum value 0 at point x = l , Thus, it holds that / ( J C ) ^ O . That is, log x < x - 1 . The equality is valid only when x = 1. Setting * = qk I p k , we have

Pk

Pk

and

w *

Pk «

\Pk

j

w

w

Multiplying minus one on both sides of the above equation, we obtain

Jftlog-^->0. The equality holds true only when pk =qk.

(5.26) •

The above concepts can be generalized to continuous random variable. Suppose X be a continuous random variable with pdf f(x). The entropy measured by g(x)is

H(f, g) = ~ J / t o log g(x)dx.

(5.2?)

The difference between the true entropy and the entropy measured by g(x) is the Kullback information l(f, g) = H{f, g) - H(f, / ) = f/(*) I o g 4 ^ •

C 5 - 28 )

J. Disorder, entropy and entropy estimation

100

Besides the above interpretation, I(f,g) may also be interpreted in the following way. Suppose a sample is taken from X, which entropy is H(/, / ) . Because the sample is a subset of the population, it cannot contain all information of the population. Some of information of the population must be missing. The amount of information missing is I(f,g) if g(x) represents the pdf fully describing sample distribution. In this sense, I(f,g) represents the gap between the population and a sample. Theorem 5.3 Kullback information / { / , g) for continuous distribution satisfies (1)

(2) I(f, g) = 0 if and only if f = g.

(5.29)

Kullback information is interpreted as the amount of information missing. The first Property indicate that the missing amount of information is always positive, a reasonable conclusion. The second Property imply that an arbitrarily given distribution cannot fully carry the information contained by another distribution unless they are same. Note that Kullback information is not symmetric, that is,

/(/.ir)

(5.30)

Example 5.5 Entropies of two normal random variables Suppose two normal random variables are distributed as 1

/(*) =

-exp

and g(x) =

1

2a

I -exp •jhi;

From definitions (5.27) and (5.28) we have

#
I (x-{tf |. I 1 I (x-vf PI - h r 5 ^ Jog^ ^ ^ e x p l - It1

ex

J

'-hk^-^l*

i

^ r

(x-y) 2

2r2

dx

2r2

(5.31)

5.2. Kullback information and model uncertainty

101

(5.32a) Comparing equation (5.16) we see that the entropy H{f,g) is no longer independent of the mean. It is a function of the variances and means of these two random variables. The two variances cannot be exchanged in equation (5.32a), and thus H(f, g) is not symmetric, satisfying equation (5.30). Kullback information is

exp -

-f i =

r

—

exp -

2o

"

Similarly, we have

gf

g\

^ \

i

( f ]

(5.32c)

Summing up equations (5.16) and (5.32b) yields (5.33) The term on the right hand side of equation (5.33) is obtained from equation (5.32a). The above equation has validated equation (5,28) through a particular example. It is not difficult to verify equation (5.30) through equations (5.32b) and (5.32c), but the procedure is a little lengthy, thus neglected here.

102

J. Disorder, entropy and entropy estimation

Theorem 5.4 If the pdf of a random variable isf(x), then for any statistical model g(x) other than f(x), the entropy H(f,f) is smallest. In notation, H(f,g)ZH(f,f).

(5.34)

Proof. Applying equation (5.29) to equation (5.28) leads immediately to equation (5.28). • Kullback information / ( / , g) is a useful concept, having played an important role in communication theory. But here we emphasize its influence on statistics. Statistics is characterized by two typical procedures: estimation and hypothesis testing. Both share one thing in common: inference as we mentioned in Chapter 2. The essence of inference is to make decision based on incomplete information. Suppose the true pdf (model) of a random variable is f(x) which is not known. What is known is a candidate model g(x) which can be obtained from a sample. Because the size of a sample is finite, it cannot contain all the information in the population. g(x) determined through the sample is generally not equal to f(x). Thus, the two models f(x) and g(x) are different, resulting in model uncertainty. In a statistical inference problem, the true model f(x) is not known. What is known is the sample-determined model g(x). All decisions are made based on g(x). Because of model uncertainty, the entropy evaluated by g(x) cannot be equal to the true entropy H(f,f). Equation (5.34) shows that the entropy predicted by any pdf (or model) other than the true pdf (or model) is larger than the true entropy. If we want to estimate the entropy H(f, / ) , we need to know the entropy resulting from model uncertainty. Consider a special case in that the difference between two probability distributions / and g is small such that Taylor expansion is valid. The difference between the two distributions is estimated by the relative percentage squared,

f

(5.35)

Because A is • a function of x, the total difference should be an integral with respect to x weighted by / . So we have

5.2. Kullback information and model uncertainty

J/Acfe = J/ii^pldk = J(g - f)^p-dx.

103

(5.36)

If x is small, we have expansion (5.37) So equation (5.36) is rewritten in the form of

[l +

% ss..

(5.38)

The right hand side is in fact the divergence of the two probabilities, because

J(f ~ / ) log[l+tejQdx = f(g - / ) log^r dx ^ & + J/log^ A =J(f,g)

(5.39)

Using symbols for Kullback information, the last term is ).

(5.40)

J(f,g) has a special name called divergence oftwopmbability distributions. It measures the uncertainty resulting from evaluating entropy/ by use of model g, quantifying the difference between two probability distributions. Therefore, J{f*g) defines model uncertainty. J{f, g) has the following properties. Theorem 5.5 Divergence of two probability distributions J(f,g) satisfies

(1) J(f,g)>Q, (2) JW,g) = J(g,f), (3)

(5.41)

Jif, g) = 0 if and only iff = g.

But J(f, g) does not satisfy the triangle inequality ).

(5.42)

104

J. Disorder, entropy and entropy estimation

Proof: These properties are natural conclusions obtained from the definition of the divergence of two probability distributions, J(f, g) = / ( / , g) + I(g, f). o Example 5.6 Divergence of two probabilities Consider Example 5.5 with the two probability density functions given by equation (5.31). Direct summing up equations (5.32b) and (5.32c) yields J(f,g) = I(f,g) + Kg,f)

(5.43)

It is not difficult to see that J(f, g) > 0 if we note that the first two terms on the right hand side of equation (5.43) are no less than zero and the third term is no less than zero, too. The difference of two normal distributions is characterized by two parameters: their variances and means. If the means of two normal random variables are same, the difference is described by the second term on the right hand side of equation (5.43), being a function of their variances only. If their variances are same, their difference is a function of only their means given by the third term, hi general, the difference is given by equation (5.43), a junction of their variances and means. We now return to the discussions of model uncertainly. In section 2.3.2, we mentioned sampling error resulting from the difference of a statistic and the true parameter under consideration. What is missing there is there is the errors resulting from misuse of models. If a model is not properly specified, errors will be induced. In traditional statistics, an underlined assumption is that the true model is always known. What is unknown is the parameter(s) present in the model. Introduction of information theory into statistics indicates this is not enough. If model uncertainty is not correctly handled, non-statistical errors will come into our analysis and misleads our subsequent statistical decision. Model uncertainly is one of the most important contributions information theory has made to statistics. In a typical statistical inference problem, therefore, we find two sources of uncertainties: one resulting from the random variable itself and the model uncertainty. In notation, the total statistical entropy (TSE), is = H(f,f)+J(f,g). It may also be written in the following forms = H(f,f)+J(f,g) = mf,f)+I(f,g)+I(g,f)~H(f,g)+I(g,f)

(5.44)

5,3. Estimation of entropy based on large samples

105

The first term on the right hand side of equation (5.45) is the entropy of/measured by g, and the second term is the amount of information missing in the process of replace g by / In other words, if g is known, and from g to find / the required amount of information is I(g,f) .This is Sample X schematically shown in Figure 5.2 . Two procedures are shown in the figure. In the first procedure indicated by a downward arrowed line, a sample is drawn from/ Including I(g,f) in equation Figure 5.2 Uncertainty present in (5.45) is important. This can be seen inference process by comparing equations (5.16),(5.33) and (5.43). In these equations, if we let T2 -> 0, we may see that TSE has a form closer to # ( / » / ) than # ( / , g ) , t h a t i i , (5.46) he first term on the right hand side is as same as that of H(f,f)in equation (5.16) while this term does not turn up in the expression for ff(f,g). It is plausible that TSE has stronger capability to recover / from g than H ( / , g). This will be numerically demonstrated in the subsequent chapters. 5.3 Estimation of entropy based on large samples In the above, discussions are focused on entropies which are defined in the framework of probability theory. We now turn to the entropy carried by a sample, so discussing the issue more in the framework of statistics. If the pdf of a random variable (or the distribution of a finite scheme) is known, the above-mentioned entropies can be calculated through simple manipulations'. In most cases in applications, however, the pdf is not known. And statistical estimation must be employed to find the distribution through a random sample taken from the population. In Chapter 2, it was mentioned that a good estimator is required to have three properties: consistency, unbiasedness and efficiency. It is also pointed out there

5, Disorder, entropy and entropy estimation

106

that the first two properties are very important. Estimators constructed for estimating entropies are also required to be consistent, unbiased and efficient. Unfortunately, to construct an unbiased estimator of entropy is not an easy work to do. We thus relax the restriction, trying to construct asymptotically unbiased estimator. In other words, these estimators are unbiased only when the sample size is large. It is well known that entropy estimation is not trivial. Statistical fluctuations of the random sample used to estimate unknown parameters induce both statistical and systematic deviations of entropy estimates. In the naive ('likelihood') estimator one replaces the pdf / ( * ) in the Shannon entropy / / ( / , / ) = - J / ( x ) log f{x)dx by an estimate / ( x ) , More precisely, the naive estimator (5.47) leads to a systematic underestimation of the entropy H. Take M-L estimator for instance. If a sample is drawn from the population, and the unknown parameter is estimated from the sample, the entropy estimated by use of equation (5,47) would yield a value smaller than the true value in most cases. Therefore, numerous studies have been conducted to build unbiased estimators for entropy. In the following, a variety of estimators for various entropies are given.

-0.2 0

50

100

150

100

Figure 5.3 Naive estimate of entropy Htfj) using equation (5.47)

5. J, Estimation of entropy based on large samples

107

Example 5.7 Biased estimate of entropy Reconsider Example 5.4 . The entropy of a normal random variable is

H(f,f) = In this example we assume that the variance is 1. If a sample is drawn from the population, the unbiased estimate of the variance a1 is

(5-48)

108

5. Disorder, entropy and entropy estimation

Theorem 5.6 The asymptotically unbiased estimator of entropy H(X | a 0 ) = / / ( / , / } based on a large sample of size ns is (5.50)

Proof. Expand_/fx|a) around the true value of a0

i 2

(5.51)

,daf

where Aat =at-e^ is the difference between estimated and true values. If the following notations are used, (,52)

equation (5.51) becomes

fix| •) « /" +f^Aa, +L-f£ 9a,

(5.53)

2 dafiaj

Similarly we have the following expansion for logarithm

I a) * log/ 0

(5.54)

2

Therefore, we have

H(f, / ) = H(X | a) = - J/(* | a) log / ( * | &)dx

•dx.

(5.55)

J. J. Estimation of entropy based on large samples

109

In the above, we replace H{f,f) by H{X\k) to reveal the influence of statistical fluctuations. In short form, equation (5.55) is (5.56) where %j*Q

•0

da, T =-

^log/0

(5.57a)

da,

3f° 9log/0 -dx AaAa, 1 da,

\r

log/

(5.57c)

•dx

flog/" J

(5.57b)

da.

dx iMfAaj. dajdaj

(5.57d)

\

The first term is the entropy measured by the frue model, denoted by (5.58) The term T, is a normal random variable because each Aaf is a normal random variable with zero mean. The mean of T, about sample is thus zero, that is, £j7;=0.

(5.59)

Estimates of the rest three terms are generally not zero. They result in biased estimate of entropy. To estimate T21, we rewrite matrix in Tlt in the form of m

r J

0a,

da,

(5.60)

which is the expectation of the product of tiie two logarithmic terms in the equation above. In other words, we have the following relationship

5. Disorder, entropy and entropy estimation

110 \dlog/0

1

dlog/8"

da

*

(5.61)

da

J.

This is the fisherian information matrix. With the notations defined above, T2\ becomes (5.62) Aa,. is asymptotically a normal random variable with zero mean, and thus F2/ is a random variable following chi-square distribution with mean nfins, where nf is the number of free parameters in the model. Taking average on both sides of the equation above yields the following estimate (5.63)

To estimate 7#, we rewrite Equation (5.57c) in the following form

? Sa, daj f dafieij

tbc.

(5.64)

The first term on the right hand side is

0

-If

r051og/°

1 % da,

aa,.

Slog/0 5Si.

(5.65)

and the second term on the right hand side is

*

0

J/

f dafia

52

f;

(5.66)

Because the integral of probability density function is a constant. Therefore, the asymptotically unbiased estimate of TJ2 is

5.3. Estimation of entropy based on large samples \ E^MM 2

i^. 2 n.

111 (5,6?)

Estimating T33 is a little complicated. In Chapter 4, we introduced that a continuous function can be approximated to any desired degree by a polynomial ft

of the form f(x) = J ^ x * » where ak (1 <; k <, n) are unknown coefficients. If we use a polynomial to approximate pdf f(x), we then conclude that the second derivatives of f{x) are zeros, meaning that Tn = 0. In summary, we have )] = H(x\&)+^. 2 n,

(5.68)

Thus, the asymptotically unbiased estimator of the Shannon entropy is given by the term on the right hand side of equation (5.68), that being the right hand side of equation (5.50). a The proof is lengthy, but the return is rewarding. It not only points out that the naYve estimator (5.47) does yield a systematic deviation from the true value, but also quantify the systematic deviation. This estimator is applicable to continuous and discrete cases. In the discrete case, the first term on the right hand side is replaced by (5.69) where q is maximum likelihood estimator of frequency. And nf is the cell number of histogram minus 1. Example 5.8 Bias estimation Reconsider Example 5.7. Two cases were considered in Example 5.7. For the ease of sample size 10, the bias is obtained from the figure to be around -0.05. Based on equation (5.50), the bias should be - n / / 2 « , =-1/20 = -0.05 . The numerical example and theoretical evaluation is same. For the case of sample size 50, the theoretical value for the bias is -nf I2nt= -1/100 = -0.01. The bias obtained from numerical calculation is -0.011. In this example, only one parameter a1 is present. Thus, «/=!.

112

J. Disorder, entropy and entropy estimation

If the naive estimator is used, then not only Shannon entropy is biased but also Kullback information is biased. We now turn to constructing asymptotically unbiased estimator of Kullback information. From the definition, we have (5.70)

I[f,f(X | The second term on the right hand side is

(5.71)

1og/°«fr- \f

Using expansion of l o g / ( J f | a ) in equation (5.45) and the definition of the entropy defined by the true model yields the following equation

/[/°,/Pf|a)]=J/° log fdx (5.72)

dx

The first and second terms on the right hand side of the equation above cancel each other with only the third term left. Therefore, we obtain

4

(5.73)

21 Equation (5.56) already predicts the unbiased estimate of Tn to be

nflns.

Theorem 5,7 The asymptotically unbiased estimator of Kullback information is

In,

(5.74)

With these, we may obtain the estimator for the third entropy we are interested in. Theorem 5.8 The asymptotically unbiased estimator of H(f, g) is (5.75)

J, J, Estimation of entropy based on large samples

113

The fourth estimator is about the divergence of two probabilities. It measures the difference between two probabilities. In this case, it measures the difference between the true model f(x) = f(x\a") and the candidate model

Theorem 5.9 The asymptotically unbiased estimator of J(f,g) is •) = X

(5.76)

Example 5.9 Estimation of various entropies In this example, numerical values are given to demonstrate the asymptotically unbiased estimators constructed in theorems 5.7~ 5.9. The normal distribution is used for the example. As before the unbiased estimate of the variance cr2 is

The true and estimated models (pdfs) are, respectively

^ l

(5.78)

From definitions we have ,

1,

o-2

1

ff2

1.

.,

1

1

(5.79b) In the above equations, tr2 = 1 is assumed. Repeat the calculations performed in Example 5.7, we obtain numerical values for the quantities in the above as shown in Figure 5.4. Their theoretical values are also given in the figure. The particular values for

114

5. Disorder, entropy and entropy estimation

Kullback information and the divergence are very close to their theoretical values derived in this section. What is missing in the figure is H(f, g ) , which is a function of I{f,g). If the latter is accurately estimated, then H(f,g) accurately estimated, too.

can be

0.1

J(f,g)

Figure 5.4 Numerical example demonstrating the asymptotically unbiased estimators and the distance to their theoretical values

5.3.2 Asymptotically unbiased estimator of TSE and AIC The purpose for finding all the previous estimators is to find the estimator for TSE. Theorem 5.10 The Asymptotically mbiased estimator of TSE is

ME = H(f, f) + j(f, g) = H(X 1 1 ) + | ^ .

(5.80)

Note that a special name, measured entropy (ME), is used to denote the estimator of TjSE here. This is a useful estimator. It states an unpleasant fact. Suppose we are given a model f(x \ a) with unknown parameter a. Using a sample from the population to estimate the unknown parameter a, we obtain a and f{x \ a ) . Equation (5.80) shows that f(x \ a) may not be the true model f(x \ a) . Estimation always

J. 3. Estimation of entropy based on large samples

115

induces some amount of uncertainty. The second term on the right hand side of equation (5.80) is of particular attention because it quantifies the systematic deviation of entropy estimation. An analogue is digital camera. The quality of a digital camera is assessed by its pixels. The larger the pixels are, the clearer the photos are. The second term on the right hand side is inversely proportional to pixels. The smaller the term is, the closer the candidate model to the true model. Another quantity of interest is about the unbiased estimation of log-likelihood function. Theorem 5,11 The asymptotically unbiased estimator of log-likelihood function | >

(5.81)

is

f \ 3 ) + ^ . «,

(5.82)

Proof. Rewrite equation (5.54) ,o Slog/" , 10 2 log/° Iog/(x I i) « l o g / 0 + ^ - Aat + - — ? * - . da, 2 oafia

(5.83)

Substituting it in equation (5.83) yields , CK 1

nt tt

da,

As sample size is large, the first term approaches

£ t o / ( , |a0) -> H(f,f).

(5.85)

based on the large number theorem. The second term, as mentioned above, will asymptotically approach zero because it is a sum of normal random variables.

116

J. Disorder, entropy and entropy estimation

The third term will approach 2J

dafici

This is in fact Ts in equation (5.57c). It is asymptotically a ehi-square random variable of freedom nf, and its expectation with respect to sample Jf is £A=~-

(5-87)

We have shown in the previous section that the asymptotically unbiased estimator of H(f, / ) is (5.88) Furthermore, the first term on the right hand side of the above equation can be estimated by use of the following estimator H(x | a) = - — ] > > § / ( * , 1 a ) . n, w

(5.89)

Substituting equations (5J7) and (5J9) into (5.84) and taking expectation on both sides, we obtain asymptotically unbiased estimator for log-likelihood function L given by equation (5.82). n A special name is given to this estimator, Akaike Information Criterion, or AIC in short. Historically, it was first obtained by Akaike (Sakamoto, 1993). Here AIC is not the original form, different by a factor of 2 In,. It is beneficial to make a comparison between AIC and ME, If sample is large enough, the log-likelihood function in equation (5.81) is asymptotically K(\a)-»H(f,g)

(5.90)

If g{x) = f(x I a) is used. Referring to Figure 5.2, AIC is an estimator of the entropy contained in the

5.3. Estimation of entropy based on large samples

117

sample X. From definition, ME estimate the uncertainty associated the total statistical process. AIC predicts the entropy only present in the estimation process without considering if the model can recover the true model. Example 5.10 ME and AIC The true and estimated models (pdfs) are, respectively

/(*) =

1

I xexp - :

(5.91) ex

(5.92)

P| ~lh

Theoretical value for TSE and its estimate are respectively (5,93a) (5.93b)

2

Comparison of these two quantities is plotted in Figure 5.5. The two quantities are pleasantly close to eaeh other, numerically validating equation (5.80).

200 Figure 5.5 Numerical comparison of TSE and ME

118

J. Disorder, entropy and entropy estimation

5.4 Entropy estimation based on small sample All estimators obtained in the previous section ha¥e been obtained based on the assumption that the sample size is large. In the case of small samples, we will employ different techniques as outlined in the following. In section 2.3.2, we have seen that sample mean is a random variable, too. If sample size nt is very small, say nt = I, sample mean and random variable are identically distributed. On the other hand, if sample size ns is large, sample mean is asymptotically distributed as a normal random variable. If sample size is not so large to guarantee sample mean be close to the normal distribution, nor so small to enable us compute sample distribution by using the method presented in section 2.2.2,, then we have to develop new methods for estimation. The above mentioned two cases, very small samples and large samples, share one thing in common, that is, sample mean is treated as a random variable. For the case in between the two extremes, it is natural to assume sample mean is a random variable, too. Therefore, in the most general cases, the unknown parameter 9 in f(x 19) is treated as a random variable. In doing so, we change ftom traditional statistics practice determining parameter 0 through sample X into determining distribution of parameter 0 through sample X. In notation, X-+$=>X-+P(0\X)

(5.94)

where P(01X) is the pdf of the parameter 0 to be estimated from the given sample X. This is the basic assumption of Bayesian statistics. In the framework of Bayesian statistics, P{0 \ X) is written in the form of P(0\X) = / W W ) ]P(X 10)P(0)d0

(5J5)

where in Bayesian language: P(X 10) = Y\fx (xt I ^ ) : *s m e sample occurrence probability given & P(0):

is the Prior distribution of 0,

P(X) = JP(X 10)P{0)d0; is the sample occurrence probability, ormarginal probability, P(0\X):

is the posterior distribution of 0.

(5.96)

5. J, Model selection

119

Now consider the problem of how to measure the uncertainty of parameter 0. Such uncertainty comes from two sources, the sample Itself and the uncertainty of 0 after the realization of the sample. The uncertainty associated with this sample is H(X) = - JP(X) log P(X)dX = -Ex log P(X).

(5.97)

In the framework of Bayesian statistics, P(X) is given by equation (5.96). Consider two events: A = 8 and B = 0 J X. Then the uncertainty associated with parameter 0 is described by H(0) = H(X) + Hx(0)

(5.98)

if equation property (2) of theorem 5.1 is employed. The uncertainty associated with event B is defined by Hx(0) = \P{X)H{01 X)dX = ExH(01X)

(5.99)

where H(0 \X) = - JP(01X) log P(0 \X)d& = -Em log P(0 \X).

(5.100)

Equation (5.99) shows that -logP(X) is an unbiased estimator of H(X), and equation (5.100) shows that ~logP(0 \ X) is an unbiased estimator of H{6\ X), Therefore, we obtain the unbiased estimator of entropy of parameter 0, that is, H{0)^-\o%P{X)-\ogP(0\X).

(5.101)

S.5 Model selection Model selection will be specially mentioned in Chapters 6 to 10. We will, however, present general theories about model selection here. Traditional statistics has focused on parameter estimation, implicitly assuming that the statistical model, or the pdf of a random variable, is known. This is in fact not the Case. In real-world applications and in most cases, the model is unknown. Referring to Figure 4.1, we have shown three possible models (lognormal, normal and Weibull) to approximate the unknown pdf under consideration. Purely from the graph, it is hard to make a decision on which is the best fit of the observed data. Therefore, we do not have any convincible

120

5, Disorder, entropy and entropy estimation

support to assume that the statistical model under consideration is known. It thus signifies a shift from traditional assumption that the statistical model under consideration is known with unknown parameters to contemporary belief that the statistical model under consideration is also unknown with unknown parameters. Such a slight change makes a big difference because this problem has not been discussed in traditional statistics. In summary, in modem statistics, bom model and parameters are unknown. How to determine a statistical model is thus the biggest challenge we are faced with if we get rid of the assumption that the model under consideration is known. It is at this point that information theory comes into play. The prevailing solution to the problem is that suppose we are given a group of possible statistical models. We are satisfied if there exist some criterion able to tell us which model is the best among all possible models. So we call this procedure as model selection. By model selection, we change the problem from determining statistical models to selecting models. To be able to select the best model, we thus need to do two things. The first is that the group of possible models should be so flexible and universal that the model to be after is included in the group of possible models. This problem has been touched in Chapter 4, being a procedure for function approximation. The second is that we have a criterion at hand to help us to select the best model from the group of possible models. This problem is the main focus of this section. 5.5.1 Model selection based on large samples Suppose we are given a group of possible models {,/)(* |6[-)} (i-h"^), each of which contains unknown parameter 0t, Draw a sample X from the population, and use the sample to estimate unknown parameters 8, in each m.

model by some method (M-L method, to say), yielding estimate Bt, Suppose me true model is among the group{f,(x| &,)}. Then the true model must minimize TSE. And Theorem 5.10 gives the asymptotically unbiased estimate of TSE, denoted by ME. Combining the two theorems together, we obtain the criterion for model selection. Theorem 5.12 Among all possible models, the true model minimizes ME = H(f,f)

+ J(f,g) = H{X | a ) + J ^ - » m i n (5.102) 2« where §t is the M-L estimate of the unknown parameter 0, based on large samples.

5.5. Model selection

121

An alternative criterion for model selection is AIC. Theorem 5,13 Among all possible models, the true model minimizes AIC AIC = -—V

lag f(x(\&)+^--»min.

n,ti

(5,103)

n,

It should be pointed out there are possibly other criteria for model selection. But to the best knowledge of the author of the book, applying ME and AIC to model selection has been studied most extensively up to now, and thus only these two criteria are introduced here. Theorems 5.12 and 5,13 solve our problem about model selection. They are very useful tools for large sample problems. Example S.11 Model Selection: polynomial regression (Sakamoto et al, 1993) In Table 5.1 is given a group of paired data (xl,yj), which are plotted in figure 5.6. We want to know the relationship between x and y. The general relationship between the pair as shown in Figure 5.6 can be approximated by use of a polynomial (polynomial regression) + amxm

(5.104)

In this example, the polynomial regression takes particular form of yt = a0 + aixl + - + amx™ + *,

(5-105)

where st are independent and standard normal random variables, m is the degree of the regression polynomial. This model is the sum of a polynomial of deterministic variable x( and a random error s,, resulting in a random variable yt. This regression polynomial of degree m is a normal variable y, with ae + a, x, + • • •+a m xf as the mean and an unknown er1 as the variance, that is, its pdf is given by

f(yt

y;-aB~a,x,

amx°')

2a

1

(5.106)

In the following, this regression polynomial of degree m is written as SO MODEL(O) is such a regression polynomial that it is

MODEL(OT).

5. Disorder, entropy and entropy estimation

122

independent of variable x, distributed as N(ag,<x2) . This model has two unknown parameters, att and cr2, MODEL(l) is such a regression polynomial that it is distributed as N(a0 + alxi,ai),

with three unknown parameters a 0 , «,

2

and a . Similarly, MODEL(2) is a parabolic curve. If observed data {xstyt) are given, we may perform regression analysis to find the unknown parameters involved in the model. Detailed procedure is given in the following for easy understanding.

Table 5.1 Observed data pair

i

1 2 3 4 5 6 7 8 9 10 11

y> 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.012 0.121 -0.097 -0.061 -0.080 0.037 0.196 0.077 0.343 0.448 0.434

0.6 x Figure 5.6 Observed paired data (x,, yt) 0

0.2

0.4

(1) Likelihood analysis Suppose n sets of data (jCp^), ..., ( J C , , , ^ ) are given. They are to be fitted using MODEL(wi) : yj =

+ e,

The pdf of the model is given in equation (5.103). Then the likelihood function of these n sets of data is

5.5. Model selection

123

(5J07)

The corresponding log-likelihood is then given by

(5-108) (2) Likelihood maximization If the maximum likelihood method is employed, the unknown parameters can be found by maximizing the log-likelihood junction in equation (5.113). This is equivalent to minimizing

M

This is in fact the least-square method. Summarizing the above procedure, we conclude feat the problem of polynomial regression is reduced to the least-square method. To minimize S in equation (5.106), aB,...,am must satisfy the conditions

From these equations, we may find the system of linear equations the M-L estimates a^,..., am should satisfy

124

5. Disorder, entropy and entropy estimation ft

1.x (5-111)

Solving this system of linear equations gives the M-L estimates. One more parameter, a1, remains undetermined. By differentiating the left hand side of equation (5.105) with respect to a3, we may find the equation the M-L estimate «x2 should satisfy 0/

(5.112)

la1 ' la21 Solving this equation yields the condition the variance must meet

m

m

rr

(5.113) «^»i) denotes the variance a1 corresponding to MODEL(»i). For example, MODEL(-1) a simplified notion representing the case mat y, is independent of x, so that y, is normal random variable with zero mean and variance er a , that is, yt ~ #(0,cr 2 ). Using these symbols, the maximum log-likelihood is Ky{,--,yAat,---,am,a1) = ~\og{2n)-^\Ogd{m)~.

(5.114)

(3) Model selection In MODEL(/»), there are m+2 parameters. They are regression coefficients

5.5. Model selection

125

ao,...,am and the variance cr2. Based on equation (5.93b) for calculating ME in the case of the normal distribution, we have after some simple manipulations ME = -lQg(2x)+-lQgd{m)+- + - ! ^ l 2 2 2 2n,

(5.115a)

and AIC is after simple manipulations AIC = - log(2s-}+- log d(m)+~+^ii 2 2 2 re

(5

where the number of free parameters n} =m+2 and sample size n = 11. With the above preparations, we reconsider the data given at the beginning of this example. Straightforward calculations yield |>,=5.5,

2>f=3.85,

| > ? =3.025,

| > ; =2.5333

(=1

i=I

(=1

f=I

2>, =1.430,

J>,fl = 1.244, f/Xfyi

i=I

»=1

= 1.11298, £ j £ = 0.586738

(=1

(=1

ME and AIC values for different models are calculated for different models. For example, the results for MODEL(-1) (zero mean normal distribution) are

M E = i log (2ff)+-+- log 0.0533+- x — = 0.0894 2 M ; 2 2 B 2 II ^/C = -log(2s-) + -+ilog0.0533+—= 0.0439 2 ax f 2 2 11 Continue similar calculations to determine the regression polynomials of various degrees and find the variances corresponding to each regression polynomial. Based on such calculations, both ME and AIC values can be easily evaluated. In Table 5.2, such calculation results are summarized up to degree 5. As the number of free parameters increases, the variance decreases fast if the number of free parameters is smaller man 3. As the number of free parameters is above 4, the variance does not change much. Both ME and AIC are minimum as the second degree regression polynomial is employed.

126

J. Disorder, entropy and entropy estimation Table 5,2 Variance, ME and AIC for the regression polynomials of various degree Degree

Free parameter

Variance

ME

AIC

-1 0 1 2 3 4 5

1 2 3 4 5 6 7

0.05326 0.03635 0.01329 0.00592 0.00514 0.00439 0.00423

0.089 0.035 -0.333 .0.600 -0.535 -0.477 -0.360

0.044 -0.056 -0.469 -0.782 -0.762 -0.750 -0.678

Based on the results given in Table 5.2, MODEL(2) y, = 0.03582-0.49218^ +O.97237xf +e, minimizes both ME and AIC, thus being assessed as the best regression polynomial for the data given in Table 5.1. In feet, the data given in Table 5.1 were generated from the following parabolic equation yt = 0.05 - 0.4x, + OMxf + e-, where e ( are normal random variable with zero mean and variance 0.01. It is somewhat remarkable that both ME and AIC do pick the best model from a group of candidate models. The coefficients in these two equations do not differ much. 5.5.2 Model selection based on small samples Bayesian statistics, which is characterized by treating unknown parameters as random variables, has undergone rapid developments since World War II. It is already one of the most important branch in mathematical statistics. But a longstanding argument about Bayesian statistics is about the choice of prior distribution in Bayesian statistics. Initially, prior distribution used to be selected based on user's preference and experience, more subjective than objective. Note that for the same statistical problems, we may have different estimates of posterior distribution due to different choices of prior distribution. Prior distribution is a priori determined. This is somewhat annoying because such

5.5. Model selection

127

ambiguity in final results prevents applied statisticians, engineers and those interested in applying Bayesian statistics to various fields from accepting the methodology. Situations have changed in recent years since information theory was combined with Bayesian methodology. The basic solution strategy is to change the problem into one for model selection. Although there are some rules helping us to choose prior distribution in Bayesian method, the choice of prior distribution is in general not unique. Suppose we have a group of possible prior distributions selected by different users or by different methods. By using some criterion similar to those for large samples, we are thus able to find the best prior distribution from the possible candidates. Referring to equation (5.84), we note that true model for 6 minimizes H(0) according to theorem 5.4 or equation (5.28). Thus, the best prior should minimize H(0) = - log P(JT)- log P(61X) -> min.

(5.116)

Note that the above equation is equivalent to two minimization procedures as follows -logiJ(Z)-»min,

(5.117a)

- tog P(01 ,¥)-»• min.

(5.117b)

Because H(0) is a linear function of -logP(X)

and -logP((?| J f ) . Minimum

H(ff) is attained if and only if the two terms at the right hand side of equation (5.90) are minimized. We use a special name, Bayesian Measured Entropy (MEB), to denote the -logP(Jf). Then based on equation (5.91a) we have Theorem S.14 7%e best prior distribution must satisfy MEB = -2 log P(X) -> min .

(5.118)

Here the constant 2 in front of the logarithm of the marginal probability appears due to historical reason. It is also called Akaike Bayesian Information Criterion (ABIC) (Akaike, 1989;Akaike, 1978). Equation (5.92) is in fact an integral equation for the unknown prior distribution P(8). Theoretically, the solution exists. So equation (5,92) selects

128

5. Disorder, entropy and entropy estimation

the best priors from candidate models. In equation (5.95), Q is unknown. Using Bayesian rule (5J1), equation (5.95) states that

P(X)

(5.119)

In the equation, only parameter & is unknown. Therefore, the above equation yields point estimate of 8. It plays the role as the M-L method because it degenerates to M-L method because it degenerates to M-L estimation if prior distribution is a uniform one. 5.6 Concluding remarks Entropy and entropy estimation are the basic ideas in this Chapter. Based on them, we have constructed quite some unbiased estimators for a certain of entropies. The most important three estimators are ME and AIC for large samples and MEB for small samples. Model selection had not been touched in traditional statistics. It is one of the focuses in modem statistics, and thus deserves special attentions. Entropy-based methods have been introduced in this Chapter to attack this problem. They will be widely applied in the subsequent chapters.

Chapter 6

Estimation of 1-D complicated distributions based on large samples

A random phenomenon can be fully described by a sample space and random events defined on the sample space. This is, however, not convenient for more complicated cases. Introduction of random variables and their distributions provides a powerful and complete mathematical tool for describing random phenomena. With the aid of random variables and their distributions, sample space and random events are no longer needed. Thus, we focus on random variables and tiieir distributions. The most frequently used distribution form is the normal distribution. Although the normal distribution is well studied, it should be noted that the normal distribution is a rare distribution in real-world applications as we mentioned in the previous chapters. Whenever we apply the normal distribution for a real-world problem, we must a priori introduce a significant assumption. Our concern is thus about how to avoid such a priori assumption and place the distribution form under consideration on a more objective and sound base. Attempts have been made to solve the problem by introducing more distribution forms, Weibull, Gamma etc. We have about handful of such distribution forms. They are specially referred to as special distributions. Special distributions usually have good properties for analysis, but not enough capabilities to describe the characters of the random phenomena encountered in practical applications. Such examples are numerous. Ocean waves are of primary concerns for shipping. Wave height and weave period are random in nature. People began to study the joint distribution of wave height and period at least over one hundred years ago. Millions of sample points have been collected, but their distributions are still under extensive study today in part due to the lack of powerful distribution forms for describing random properties of ocean waves. Complicated distributions are not rare in real-world applications, but special distributions are not enough to describe them.

129

130

6. 1-D estimation based on large samples

Another serious problem is that a priori assumption is arbitrarily used for any practical applications and analyses, People involved in stock market analysis often complained about the wide use of me normal distribution for analyzing price fluctuations. Whenever the normal distribution or any other special distribution is used, it implies that we enforce the phenomenon under consider to vary in accordance with the a priori assumed law. If price fluctuations are not distributed as special distributions we assume, we put ourselves in fact at the hand of God. Any prediction based on such assumption is prone to be misleading rather than profitable investment. In summary, at least two concerns must be addressed before we adopt a specific distribution form for the random variable under consideration. The first is about the distribution form, which capability should be powerful enough to describe the most (we cannot say all) random variables, either simple or complicated. The second concern is about objective determination of distributions. The strategy for solving the first concern is to introduce a family are able to describe complicated distributions that special distributions cannot do. This strategy indicates the recent developments in estimating complicated distributions; construction of a parameterized family # that is much flexible than special distributions (Sakamoto et al, 1983; Zong & Lam, 1998). And it is so powerful that it meets our demands in most cases, whether the distribution under consideration is simple or complicated. The strategy for solving the second problem is that instead of using one a priori assumed distribution a family fl> of distributions or models are considered. Based on a random sample, the best model is selected from the family based on some criterion. This process, called model selection, convinces us that the distribution or model is selected based on the information of the sample and that it is best among a group of candidate models. If the group of candidate models is large enough, the model chosen in such a way must be the distribution we are after. Combining the two strategies together leads to the recent development of information theoretic methods for estimating complicated distribution. In this chapter, estimation of complicated 1-D distribution based on large samples using B-splines is intoduced. 6.1 General problems about pdf approximation To ensure that the family

6.I. General problems about pdf approximation

131

covered in this family, we need to know what kind of function such a pdf is. In other words, for any given random variable, if is it possible to approximate it by use of some simpler functions? Here we have a nice theorem to answer the question. Theorem 6.1 {Lebesgue '$ decomposition theorem), any distribution fimctionF(x) can be written in the form of F(x) = a i F I (*) + a1F2(jic)+aIFJ(x)

(6.1)

where «. 3s 0,(i = 1,2,3) , ar,+a2+afj =1 . F^x) is absolutely differentiable, being continuous everywhere and differentiable except at countabfy many points. In other words, /*j(x) is differentiable almost everywhere. F2(x) is a step function with a countable number of jumps, and F^x) is singular. Proof see Bhat (1985).

n

This theorem states that any pdf is a sum of three types of functions. Because is singular, thus being pathological. It is often dropped in analyses.

F3(JC)

F2(x) has the form FJ(JC) = ][]/>; , where pt = Pr(X = xj). Hence, the random variable X has a finite probability of occurring at the discrete values x,,x 2 ,"and zero probability of having any other values. Differentiating both sides of equation (6.1) by noting mat the derivative of a step function is a Kronecker Delta function, we have -*,)

(6.2)

where fix)

is at least continuous almost everywhere. Note that at those points

where fix)

is not continuous, f(x)

can be expressed as ^^q^Sjix—Xj) .

Considering the second term on the right hand side of equation (6.2), we have Corollary 6.t If f(x)

is continuous, then a pdf can be written in the form of

= ajx ix) + a^plSl{x-xl)

(6.3).

132

6. !-D estimation based on large samples

If the singular part is not counted, both theorem 6.1 and corollary 6.1 indicate that a pdf is either continuous or discrete, or both combined. Approximating pdf can thus be studied separately for continuous and discrete random variables. This simplifies our analysis a lot. We may thus focus on the two issues separately. First of all, we notice that approximating the distribution of a discrete random variable is not a challenge if the sample size is large enough. This is due to the fact that the distribution of a discrete random variable is expressed by a finite or finitely many discrete probability values. Thus, approximating such a distribution does not involve the issue of how to approximate the distribution form. As more and more data are collected, or as sample size gets larger and larger, we have better estimates of the distidbution. In the case of small samples, however, large statistical errors may arise. This latter issue will be discussed in Chapter 8 for 1-D discret distributions and in Chapter 9 for 2-D discrete distributions. In this chapter, we focus on how to approximate the pdf of a continuous random variable. Consider a continuous random variable X defined on the interval [c,d\. Corollary 6.1 leads us to the conclusion that © is the set of all such continuous functions f{x) satisfying Non-negative condition: f{x) & 0 for x e [c, d]. Normalization condition: \f(x)dx = 1

(6.4a) (6.4b)

We are thus after the continuous approximants satisfying the above two constraints. We now turn to the problem of how to construct f(x). 6.2 B-splinc approximation of a continuous pdf Consider a continuous pdf. We are tempted to use polynomials to approximate f(x), and many researchers have done so over the years. This is, however, a method of serious limitations because of the properties of polynomial approximation discussed in Chapter 4. Here two issues should be considered separately. The first is the capability of polynomial approximation and the second the stability of polynomial approximation. The answer to the former question is affirmative. Any continuous function can be approximated by a polynomial to desired accuracy if the order of the polynomial can be any integer. But the answer to the second question is negative. The stability of polynomial approximation is poor, mis being particularly true as the order of the approximating polynomial is large. Based on these, we have a picture of properties of polynomial approximations in mind. For any given continuous function, we are able to find a polynomial

6.2. B-spline approximation of a continuous pdf

133

which is so close to the given function that their difference is neglected. If the function value at a point has a slight change, however, disaster might occur because all the coefficients of the polynomial may suffer big changes. If the function to be approximated is badly behaved anywhere in the interval, then the approximation is poor everywhere. This is particularly true if uniform spacing of knots is employed. This global dependence on local properties leads to unstable approximation. Studies of other approximating functions such as truncated power basis functions are also not satisfactory. Based on the introduction in Chapter 4, B-splines are satisfactory approximating tool for our purpose. Then a linear combination of B-spline functions is assumed to be able to approximate the pdf f(x) of a continuous random variable X in the form of (6.5) where a, (i=l,2,...,N) are the linear combination coefficients, 7 a = (a,,fla>'",fl!w) is the coefficient vector and JV is the number of B-spline functions used to approximate the pdf. B,{x) is the B-spline function of chosen order. In equation (6.5), we use f(x | a) to indicate the independence of f(x) on the coefficient vector a. Based on Curry-Schoenberg theorem introduced in Chapter 4, B-splines form a basis for the vector space of all continuous functions defined on a bounded interval. Being of limited summation cancellation effects, B-splines are relatively well conditioned basis set. Moreover, B-splines are linearly independent. These properties indicate that B-splines are suited to the purpose of approximating continuous pdf. As given in Chapter 4, the third order and fourth order B-spline functions are frequently employed in applications. For convenience, they are rewritten here. The third order B-spline function is of the following form,

g,(s) = fr - * J E ( * " ' ~X)*H(X'+> ~X)H(X-X,).

(6.6a)

where H(x) is Heaviside function defined by

»M-{? ' " I The index s=i-3, and the function ws(x) is a product of the form

(6.6b)

134

6, !-D estimation based on large samples

f] »=0

Equidistant B-splines are assumed in the above. Suppose there are equidistance points, c = xa <xl<---<xB = d, which divide the internal into n subintervals. For convenience in later mathematical expressions, more knots at each end are defined; x_%, x_2, x_t, xH+i, xn+2 and xn+i,

n+1 [c,d] three It is

customary to set *_, = x_2 = x_, = x9 = c and xn = xnH = xn+J = xn+i = d . It is clear that w+1 knots define N splines. Therefore, N=n+l, The fourth order B-splines are similar with the third order B-splines in form. They are

where the index s=i-4, and the function ws(x) is a product of the form

Let us return to equation (6.1). The integral of f(x | a) over the distribution range [c,d] must be one, that is,

J/(x|a)A = l

(6.9)

Substituting equation (6.5) into equation (6.9), we obtain

fiaic, =1

(6.10a)

To obtain the second term in the above equation, the compact support property of B-splines that an order 3 B-spline is not zero on [*».*.£,] and vanishes elsewhere (sep Chapter 4) is used. The above formulation is also valid for order 4 B-splines. If order 4 B-splines are used, the last equality in equation (6.10a) N

remains valid, that is, 2 ^ c » integral of the Mh B-spline,

=

* wi^1 c< differently defined. Here, c, denotes the

6.3. Estimation c,= Jj j (jc)A =a:'~Xf-3 for order $B-splines,

I

(6.10b)

X, —X,

F

C) =

135

BAx)dx =——— 4

for order 4 B-splines.

(6.10c)

To meet the requirement mat a pdf be positive imposed by equation (6.4a), we simply set a,*0.i'~l,2,....N

(6.11)

This is a sufficient condition, but not a necessary condition. Equations (6.5), (6.10) and (6.11) complete the approximation of a continuous pdf. Once the combination coefficients are given, the pdf of a continuous random variable is defined. In the following section, statistical methods are employed to find the combination coefficients. 6.3 Estimation 6,3.1 Estimation from sample data To determine the coefficient vector a, a random sample of size ns is drawn from the population. Let Hie sample points be xt (£ = 1,2, ••-,«,). If N is given in equation (6.5), a can be estimated using the maximum likelihood method. The maximum likelihood method is just one of the possible methods for estimating the coefficient vector a. It is also feasible to apply other methods of estimation, but the systematic methodology for estimating the coefficient vector a developed in Chapters 6 through 10 is totally based on the maximum likelihood metiiod, and thus the M-L method is employed here and in the subsequent chapters. Based on the maximum likelihood method, the estimation problem is formulated as the following optimization problem: For a given N, find vector a such that it satisfies L = £ log f{xc | a) -+ max

(6.12a)

subject to the constraints: 5> ( e,=l

(6.12b)

136 O/ fe0

6. l-D estimation based on large samples , (i' = l,2..",tf) .

(6.12c)

Equations (6.12a)~(6.12c) define a nonlinear programming problem (NLP). Being a linear function of a, f(x | a) is a continuous function in the space defined by equations (6.5) and (6.12). So is the log-likelihood function L. Theorem 6.2 The problem defined by equation (6.12) has a unique solution. Proof. From Weierstrass" theorem, which states that a continuous function defined on a compact interval must have an extreme, we conclude that the solution to equation (6,12) exists. It is provable, as given in the appendix to this Chapter, that there exists only one extreme point over the entire feasible domain for this nonlinear programming problem, see the appendix. • It is known that the most difficult thing in optimization is that the objective function has multiple extremes. A search scheme is often trapped at local extremes and fails to find the global optimum. In terms of this, the property that the problem defined by equation (6.12) has only one extreme point in the entire feasible domain is really a remarkable property. This makes numerical treatments much easier and no special cautions are needed. Therefore, if a local maximum solution is found to equation (6.12), it must be the global optimum solution because the solution is unique. Very often it is difficult to find a solution to a nonlinear programming problem. Even the problem defined by equation (6.12) has only one extreme point, a code based on a general-purpose method may turn out be computationally inefficient. In most applications of optimization research, the number of unknowns is restricted within several parameters, say 2 to 5. For the problem defined by equation (6.12), however, the number of unknowns is of the range of 10~50. In some cases, the number of unknowns may be over 100. For such optimization problems of large number of unknowns, general-purpose optimization methods are usually not applicable. This is particularly true for 2-D cases as will be discussed in Chapter 7. It is desirable to develop a particular method to find the solution in an efficient way. So in the appendix, an iterative formula is derived, which reads (6-B)

We have q £,(£)/ f(x | a) 51 because the nominator is a term in the nonnegative denominator, we conclude OS a, S l / c , . This is in agreement with equation

6.3. Estimation

137

(6.12b). The iteration foimuk remains valid even if f{x, j a) = 0 . To see this, note that every term in f(x{ j a) is nonnegative. So f(xt | a) = 0 implies that each term in / ( x , |a) must be zero. That is, alBi(x,) = Q . Because a.Bt(x)/f(x

| a) < 1 , a< '

l

must be finite even if it is of the type - .

The suggested initial values are 1

(6,14)

A small number, say 10"4, is prefixed. The iteration starts from the initial value given in (6.14). The iteration continues until the difference between the previous and present values of the combination coefficients is smaller than the prefixed small number. Numerical tests have shown that it takes several to several tens of iterations to reach the optimum. This iteration formula is shown to be very computationally efficient, making the methods presented here feasible as a statistical tool on a PC or a laptop. Equations (6,5) and (6.13) give complete solution to finding the continuous pdf based on a large sample if the number N of B-splines is given. A code based on the method is given in the floppy attached to this book and a brief description of the code is given in Chapter 12. The inputs of the code are the number N of Bsplines, the distribution interval [c,d] and the observed data (sample) ^ ( 1 = 1,2,...,^). The model assumes that the random variable under consideration must be disfributed in a finite interval [e,d\. If a random variable is distributed on an infinitely large interval, the model introduced here is used in the sense of approximation. The method requires input of raw data without treatment. If observed data are treated by some approaches, variants of the above method may be used, as demonstrated in the following section. 6,3,2 Estimation from a histogram More often than not, a pdf is expressed in the form of a histogram. Suppose a histogram is composed of K cells as shown in Figure 6.1. The histogram is formed from n, sample points and there are k, sample points in k-th cell (k=l,2,.,.,K), respectively. The nodes of each cell are denoted by §k and Ijk+i to differentiate from knots of B-splines.

6. I-D estimation based on large samples

138

If the sample points have a distribution defined by f{x), the probability for the event that n* points fall in &-th cell is given by the following multinomial distribution n.!

(6.15)

where qk is the partial probability of f(x), It is assumed again that f(x) is approximated by a linear combination given by equation (6.5). The partial probability qk relates to the combination coefficients a through B,(x)dx

(6.16)

If <4 denotes the integral in the last term, that is,

Figure 6.1 Schematic figure of a histogram

(6.17)

equation (6,16) would be N

(6.18) M

6.3. Estimation

139

Equation (6.17) can be numerically or analytically integrated. The analytical form is, however, so complicated that it does not exhibit much superiority. Numerical quadratures such as Gauss quadrature or Simpson rule are all suitable tools to use. They do not need much computer time for 1-D cases. The log-likelihood function is obtained by taking logarithms on both sides of equation (6.15) «! logP = log——f

- + «, log*?, +nilogql-

+ nK logqK

(6.19)

The first term on the right hand side is a constant because «* (k-l,2,,,,K) is observed value. So only the rest terms are included in the log-likelihood function

A slight change is made in the above equation by dividing the right hand side of equation (6.19) with sample size n, and introducing the observed frequency/?*. Similar to equation (6.12), the best estimate of a pdf based on a histogram is formulated as follows For given N, find vector a so that it satisfies K

-»max

(6.21a)

subject to the constraints: |>,.c,. = l

(6.21b)

af^0,

(6.21c)

(f = l,2,-,JV).

The solution to the above problem must satisfy a. = _L x y5^£*. n

A

,- = 1,2,...,^

(6.22)

*-i ft (a)

This iterative formula can be obtained in the similar way as equation (6.13) and its proof is neglected here. Again we may use this equation as an iterative formula to find the coefficients

6. 1-D estimation based on large samples

140

6.4 Model selection In the previous discussions, N is always assumed fixed. How to specify N, however, remains a problem. If, for example, two different N's are used to approximate the same pdf, we would obtain two models. The question immediately arises: which model is better? Before answering the question, let's consider the following example. Example 6.1 Model selection Assume a true distribution is given by

fexp +0.2x|-7Lexp{~2(;c-7)2}l

xe[0,10].

(6.23)

from which 200 random numbers are generated as the given sample. The following two models are used to approximate g(x): (6.24a) (6.24b)

Given N=7

—

N=50

—

-L(N=7)=359 -L(N=50)=330

8

10

Figure 6.2 Fallacy of Likelihood function as a tool for model selection where order 3 B-splines are used. Based on the procedures introduced in section

6.4. Moeiel selection

141

6.3.1, the unknown parameter a is estimated. The estimated pdfs using the two models are plotted in Figure 6.2. The values of the log-likelihood functions for both cases are, respectively L(N = 7) = -359 and L(N = 50) = -330.

(6.25)

It is clear from Figure 6.2 that the model f,(x) is closer to the true distribution, but_^a(x) is not. Based on the values of the log-likelihood function, however, fm(x) is better than / 7 (x) because the former has larger likelihood value. We have two conclusions from this example: 1) Model selection must be properly handled. It has significant influence on the estimation accuracy. It is not true that the more parameters the better the model is. It seems there exists an optimum number of B-splines; 2) Likelihood function fails as a quantitative evaluation tool for model selection. We need a new tool to serve as a quantitative measure of model selection. Fallacy of likelihood function as a quantitative evaluation tool for model selection results from the fact that M-L estimator is biased. There exist several criteria for model selection, among which are Akaike Information Criterion (AIC) and Measured Entropy (ME) introduced in Chapter 5. Whatever is random is uncertain. The amount of uncertainty is measured in the information theory by entropy. Uncertainty comes from two sources: the uncertainty of the random variable itself and the uncertainty of the statistical model resulting from approximation. The uncertainty of the random variable itself is measured by the entropy of the true model of the form

//(/,/) =-f/log/&.

(6.26)

The uncertainty resulting from model approximation is measured by the divergence between the true and the candidate models

fix)

(6-27)

The best model should minimize the sum of the total uncertainty: Hif, f) + J(f, g) -*• min The asymptotically unbiased estimator of H(/, / ) is

(6.28)

142 H{f,f)

6. I-D estimation based on large samples = ~\f{x \ a)log/(x | &)dx+^

(6.29)

where ns is the number of sample points, «/ is the number of free parameters in the model equaling to N-l in light of the equality constraint in equation (6.10), and a is the maximum likelihood estimate of a. The asymptotically unbiased estimator of«/(/, g) is (6-30) And thus, the asymptotically unbiased estimator of equation (6.28) is Measured Entropy = ME = - J / ( x | a ) l o g / ( x | a ) & + - ^ 2«s

(6.31)

In chapter 5, as an asymptotical approximant to likelihood function, Akaike Information Criterion (AIC) is estimated by

i Note that the coefficients in front of the last terms in the two equations above are different because they are obtained on different bases. Aided with above-mentioned criteria, the best estimate of pdf (the optimum N) can be found through the following procedures: Suppose a is the maximum likelihood estimate of A for given N. Find N so that ME(N) = - f/(* | a) tog f{x | k)dx +3 x t # " ^ -> mia

(6.33a)

Or if AIC is used, Suppose a is the maximum likelihood estimate of a for given N. Find N so that AIC = - — X l o S fix, | a ) + — -> min

(6.33b)

6.4. Model selection

143

The above procedures are also an optimization process. Given JV,, find the maximum estimate of a through equation (6.13), and compute corresponding ME (Nt). Give another Nt > Nt and find corresponding ME( JVj). Continue the process until a prefixed large N. And then find the N which minimizes ME or AIC. If a is estimated from a histogram, the above formula for ME has no change in the form: Find N so that ME(N) = - | / ( x | a) log f{x | &)dx+—

1 _» min

(6.34a)

But the formula for AIC is slightly changed Find N so that AIC(N) = - V pk log qk (a)+ *-i

-> min

(6.34b)

«,

To obtain equation (6.34b), divide the interval [c,d] into K subintervals \,§k >&+/]• Denote the length of the subinterval by Ak and the number of points falling into &-th cell by nk. Then the first term on the right hand side of equation (6.33b) is rewritten in the form of

(6.35)

Neglecting the terms are constants on the right hand side of equation (6.35) results in equation (6.34b). The integral in equation (6.34a) can only be evaluated through numerical methods, say Gauss quadrature. For one-dimensional problem, computer time is not a big issue and thus choice of a numerical quadrature scheme does not exhibit significant impact on the numerical accuracy and computational efficiency.

144

6, 1-D estimation based on large samples

6.5 Numerical examples In the following examples, we assume the true pdf f(x) is given, and generate ns random numbers from this distribution by employing the method presented in Chapter 3. Using these random data we estimate the coefficients a and N, based on the above analyses. Example 6.2 The exponential distribution Suppose an exponential distribution of the form is given f(x) = exp(-x)

(6.36)

From this distribution, generate a sample of size 20 by use of PRNG introduced in Chapter 3. The generated data are .67Q55220E-02 .19119940E+01 .90983310E+00 .32424040E+00 .45777790E-01

.88418260E+00 .17564670E+01 .1Q453880E+01 35282420E+00 J3858910E+00

.32364390E+00 J181957GE+01 J9749570E-01 J88Q6860E+00 .13271010E+01

.64127740E+00 .10687590E+01 .22005460E+01 .24852700E+00 .10658780E+01

If MODEL(N) denotes the model for approximating the pdf, that is, MODEL(N): / ( x ] a ) = £ a, £,.{*)

(6.37)

we obtain estimates for the following models using equation (6.13). (a) M0DEL(3) representing 3 B-spline functions are used to approximate the pdf a, = 0.4998, fl2 = 7.43 x 10"8, a, = 5.37 x 10~M . Wife parameters in the above substituting the parameters in MODEL(N), the log-likelihood function, ME and AIC can be calculated from the following equation | a)

(6.38a)

6.5. Numerical examples ME(N) = - [f{x | a)log f{x | a ) * + ^ ^ — ^ J 2«, = —-Ylog/(x t |a}+—•—-

145 (6.38b) (6.38c)

that is, £ = 20.13, ME = 1.51, AIC = l.U This model is in feet approximated by the first B-spline and the rest two Bsplines have coefficients nearly equaling zero. (b) MODEL(4) having 4 B-splines a, =0.999, a 2 =2.65>dtr\ ^ = 0 , « 4 = 0 I = 15.05 , ME = 0.89, AIC = 0.90 (b) MODEL(5) having 5 B-splines a, = 0.953,a2 =0.273, a, = 3,82x 10"\ a 4 = 0 , 5 s = 0 £ = 15.05, ME = 1.13, Among these three models, MODEL(4) minimizes both ME and AIC, thus being the best model for the data presented at the beginning of this example. We now use two larger samples to perform the estimation. Suppose two random samples are generated from (6.36). The size of the first sample is 100 and the second 200. Some of the calculated results are given in Table 6.1. N starts from 3 and stops at 11. Values of Log-likelihood function L, ME and AIC are given in the table. From the Table, the log-likelihood function —L is nearly a monotonic function of N. Again we see that it cannot specify the most suitable model. For the case of «s=lQ0, ME and AIC take their minimums at N=4, both yielding the same estimates. The estimated pdfs for iV=3,4,10 are shown in Figure 6.3 (a). For the case of J¥=3, the curve does not fully represent the characteristics of the given curve because the number of B-splines is not enough. In the case of iV==lG, the curve ejdubits more humps than expected. The curve corresponding to N=4 is competitively closer to the original curve than the rest two. Our visual observations and quantitative assessment by use of ME and AIC yield consistent conclusions. In general, the estimate is acceptable. The estimate is improved a lot if the sample size is increased to 200 as shown

146

6. I -D estimation based on large samples

in Figure 6.3 (b). However, ME and AIC are different in this case. ME attains minimum at N=5 and AIC attains minimum at i¥=6. They do not differ too much if viewed from the figure. On the entire interval, the curves for iV=5 and N=6 are almost coincident except on the interval [2.3, 5], where the curve for N=6 is visibly lower than the original curve and the curve for N—5. Table 6,1 The exponential distribution

N 3 4 5 6 7 8 9 10 11

N 3 4 5 6 7 S 9 10 11

-L 106.5 91.6 92.1 91.5 91.1 90.7 90.3 90,4 89.9

(a) «s=100 ME 1.39 0.980 0.985 1.000 0.984 1.019 1.027 1.039 1.052

AIC 1.085 0.946 0.961 0.965 0.971 0.977 0.983 0.994 0.999

-L 213.3 182.8 183.2 181.7 181.9 180J 180.0 180.5 189.9

(b) «,=200 ME 1.376 0.960 0JS2 0.953 0.959 0.954 0.957 0.974 0.981

AIC 1.077 0.939 0.936 0.934 0.939 0.939 0.940 0.948 0.949

6.5. Numerical examples

147

1.2 1 OJ

1

0.6

ens:

o

t»

0.4

•a

0.2 Si

o

0

Figure 6.3 The exponential distribution Hence, comparison with the original curve has revealed that the curve for N=5 (ME-based best) is slightly better than the curve for N=6 (AlC-based best). This is not surprising if we recall the assumptions and derivations in Chapter 5. AIC is an asymptotically approximant to likelihood function, but ME accounts for model uncertainty. It is interesting to note that section 6.1 gives satisfactory expression of continuous pdfe based on approximation theory, section 6.2 estimates the unknown parameters based on statistical theory while section 6.3 solves the problem of model selection based on information theory. None of the above-

148

6. I-D estimation based on large samples

mentioned three theories eould solve the problem of pdf estimation in such a broad sense if they were not combined together. Thus, this chapter and the present example well demonstrate the power of interdisciplinary studies. Example 6.3 The normal distribution The next example considered is that X is normally distributed as

(639)

Again two samples («s=100 and «/=200) were generated from the distribution. The estimated results are shown in Table 6.2, with N starting from 3 and stopping at 13. For the first case («,,=100), ME and AIC predict different models. ME indicates N=7 is the most suitable model while AIC predicts the best model is given by N=E. The difference is solved by plotting me estimated pdfe in Figure 6.4(a), In the figure, the curve for i¥=5 is also plotted. It shows poor correlation to the original curve.

Table 6.2 The normal distribution (a)

N 3 4 S 6 7 8 9 10 11 12 13

-L 173.5 173.2 147.1 151.7 136.8 135.3 134.9 135.1 134.6 134.1 134.0

(b) nx=200

n,,=100

ME(N) 1.984 1.993 1.760 1J20 1.431 1.483 1.483 1.500 1.498 1.498 1.527

AIC 1.755 1.762 1.511 1.567 1.428 1.423 1.429 1.441 1.446 1.451 1.457

N 3 4 5 6 7 8 9 10 11 12 13

-L 348.5 348.5 299.1 308.1 283.2 282.7 282.1 281.1 281.4 281.2 281.2

ME(N) 1.969 1.976 1.730 1.788 1.458 1.480 1.471 1.477 1.488 1.490 1.495

AIC 1.753 1.757 1.516 1.565 1.446 1.449 1.450 1.451 1.457 1.461 1.466

ft 5. Numerical examples

149

0.5 +3

u 01

J

0.2

13

•s 1

.-.

'

If/ \ \ N=7{ME) — /t--~.A\N=8{AIC) --'

0.3 a

Given

(a) «s=100

0.4

0.1 0

V 4 X

10 §

(b) «s=200

0.4 4.

0.3 0.2

1 !

0.1 0 0

1

Figure 6.4 The normal distribution The model iV=7 (ME-based best) is closer to the original curve in terms of shape. It keeps the symmetry of the original curve while the curve for i¥=§ (AICbased best) loses such symmetry. However, both show some quantitative deviations from the original curve. Generally speaking, the model N=7 is slightly better than JV=8. In Figure 6.4 (b) are shown the results obtained from the sample n/=200 and in Table 6.2 the values for likelihood function, ME and AIC are given. In this case, ME and AIC yield same prediction that N=7 is the best. This is in agreement with visual observation from Figure 6.4 (b). Among the three

150

6. I-D estimation based on large samples

curves plotted in the figure, N=l is closest to the original. To see the efficiency of the iterative formula (6.12), the iterative process for three combination coefficients a2,a^ and a6 for the case N = B are plotted in Figure 6.5. After ten iterations, the results for the three coefficients are already very close to the final solutions. The convergence rate is thus remarkable. This is not special case. Numerical experiments yield the same conclusions. In general, the convergence rate of the iterative formula (6.12) is quite satisfactory, giving convergent results after about ten or several tens of iterations. U.4 , . - • • "

"

0.3 0.2 0.1 0

1

0

Iteration number 10

20

30

40

Figure 6.5 Convergence of linear combination coefficients Example 6AA Compound distribution Consider a more complicated example, in which X has the following mixed distribution: (6.40a)

jg(x)dx where g(x) is a function defined on [0,10]

(6.40b) + 0.2x

xe[0,10]

151

6.S. Numerical examples

The definition domain is [0,10]. Three random samples of size «,= 30 and nx = 50 and «s=1000 were generated, respectively. Estimations were separately perfonned for these three samples using the above procedures. The results for likelihood function, ME and AIC are given in Table 6.3. N starts from 6 and stops at 20. Table 6.3 The compound distribution

"N 6

7 8 9 11 12 13

-L 173.1 166.3 171.6 160.7 158.3 159.3 157.2

ME(N) 1.750 1.813 1.881 1.818 1.743 1.811 1.738

(a}«,-100 AIC(N) N 1.781 14 1.723 15 1.781 16 1.687 17 1.683 18 1.703 19 1.692 20

-L 157.6 157.5 156.5 157.0 156.4 156.3 155.9

ME(N) 1.785 1.800 1.778 1.817 1.837 1.819 1.857

AIC(N) .706 .715 .715 .730 .734 .743 .749

ME(N) 1.679 1.671 1.670 1.674 1.679 1.676 1.682

AIC(N) 1.653 1.653 1.654 1.655 1.658 1.659 1.661

ME{N) 1.663 1.654 1.650 1.656 1.657 1.655 1.659

AIC(N) 1.644 1.643 1.643 1.644 1.645 1.646 1.646

(b)«s=500

N

-L

6

862.9 835.5 847.0 819.0 812.3 817.8 812.2

7 8 9 11 12 13

N

-L

6

1727 1674 1697 1642 1629 1640 1629

7 8 9 11 12 13

ME(N) 1.703 1.739 1.727 1.706 1.667 1.689 1.660

AIC(N) 1.736 1.683 1.708 1.654 1.645 1.658 1.648

ME(N) 1.695 1.732 1.722 1.695 1.656 1.677 1.646

(c) «/=1000 AIC(N) N 1.732 14 1.680 15 1.704 16 1.650 17 1.639 18 1.651 19 1.641 20

N

-L

14 15 16 17 18 19 20

813.5 812.4 811.9 811.7 812.1 811.4 811.3

-L 1631 1629 1628 1628 1628 1628 1628

6. I-D estimation based on large samples

152

U.4

yx / \

0.3 / 0.2

•8

1

0.1

(c)ns=1000 \

/

N=ll N=13

—

N=20 Given

\

_

/ VA

•

.

_

•

X

0

2

4

6

8

10

Figure 6.6 Influence of sample size on estimation accuracy

6.5. Numerical examples

153

In general, for all three easesj the maximum likelihood function is a decreasing function of the number of B-splines, N. But as N is large enough, say N is larger than 13, the likelihood function is almost a constant, varying little as N increases further. This is particularly true as the sample size is large, see Table 6.3(c). Comparison of Table 6,3 (a) and (c) shows the influence of the second term (the number of free parameter over the sample size) in AIC. As sample size is large, see Table 6.3 (c), the influence of the second term is not significant, and thus AIC is nearly a constant as the likelihood. As fee sample size is small, see Table 6.3 (a), the second term in AIC becomes more important. For all the three cases, ME-based best models are given by i¥=13 while AICbased best model are given by N = l l . From the table, ME value is favorably larger than AIC value for each N. This is reasonable because ME accounts for one more term of uncertainty, l(g,f), than AIC, Some of the results are plotted in Figure 6.6. In the figure, the curve for N-2Q is plotted for the sake of comparison. If sample size is small, the difference between the estimated pdf and the given distribution is significant, see Figure 6.6 (a). As sample size increases, the difference becomes smaller and smaller, indicating a convergence trend, see Figure 6.3 (b) and (c). Figures 6,6(b) and (c) do not show apparent difference from the viewpoint of statistics. Example 6.5 Estimation from histogram data This example demonstrates the estimation procedure presented in section 6.3.2, Take the sample n/=500 in Example 6,4 for instance. A K==15 histogram is formed on the interval 10,10] as shown in Figure 6,7. With the estimation procedure neglected, the results for the likelihood function, ME and AIC are tabulated in Table 6.4 for each N. The ME-based best model is N=\S, while AIC does not change at all as the sample size is above 17, failing to find the best model. In figure 6.7 is plotted the estimated pdf and the original curve. The estimated pdf (N=15) is very close to the true distribution in comparison of the curve for JV=13. ME performs well for this example. The computer times for all these calculations were within one minute on a Pentium 4 PC, thanks to the introduction of the iterative formulas. The methods are computer-oriented, making estimation automatically performed once the histogram is given.

154

6. 1-D estimation based on large samples Table 6.4 Estimate by histogram data (B/=500)

N 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0

-L

ME

AIC

2.141 2.129 2.102 2.096 2.078 2.084 2.068 2.075 2.065 2.069 2.065 2.065 2.065 2.065

2.235 2.168 2.200 2.136 2.151 2.148 2.124 2.136 2.107 2.126 2.108 2.112 2.109 2.108

2.169 2.157 2.130 2.124 2.107 2.112 2.096 2.104 2.094 2.098 2.093 2.093 2.093 2.093

4

6

8

10

X Figure 6.7 Estimate from histogmm date

Example 6.6 Revisit of entropy estimation Entropy estimation has been theoretically studied in Chapter 5. In this example, Example 6.4 is used to demonstrate how entropy estimates vary with unknown parameters. Suppose sample size n, = 500. Four quantities defined by

155

6.5. Numerical examples 3 AT-1 2 »

(6.41a) (6.41b)

JV-1 2 n.

(6.41c)

JV-1

(6.4 Id)

The values of these quantities are plotted in Figure 6.8, in which the vertical ordinate shows the values of the above-defined quantities and the horizontal ordinate shows the number of parameters. 1.9

1.8

I

1.7

1.6

1.5 10

20

30

Figure 6.S Entropy estimation As sample size increases, the quantity £ 3 tends to be a constant. Because E% is in fact the estimate of the entropy of the random variable under consideration, constancy of £ 3 as sample size is large is just the right estimate of the entropy. The rest three quantities decrease as sample size increases from small to medium size. As sample size increases further, they begin to increase. Basically, Ei and AIC are quite close, because the latter is the asymptotic estimate of the

156

6. 1-D estimation based on large samples

former. This verifies the theoretical analysis presented in Chapter 5. Over quite large range, E\ is largest among the three. This is due to the fact that Ej measures the uncertainty in the whole process. The domain marked by a circle in the figure is the area of interest because the three quantities take minima in this domain. The critical numbers of parameters around which the four quantities take their respective minima are close. Within this domain, the curves for the four quantities defining four types of entropies show large fluctuations. Outside this domain, fluctuations are not significant, all curves varying in a relatively smooth way. It is within this fluctuation domain that the four entropies take their minima. The fluctuations are not meaningless. Take the curve for £ 2 for instance. It is minimized at iV=13. Its value at N=\2 is, however, much larger than the entropy value at JV==12. Such big difference at these two points enables us to convincingly find the number of B-splines minimizing entropy values. 6,6 Concluding Remarks If the methods presented in this chapter are placed at a larger background, a word should be mentioned about estimation methodologies in general. The task of estimating a probability density from data is a fundamental one in reliability engineering, quality control, stock market, machine learning, upon which subsequent inference, learning, and decision-making procedures are based. Density estimation has thus been heavily studied, under three primary umbrellas: parametric, semi-parametric, and nonparametric. Parametric methods are useful when the underlying distribution is known in advance or is simple enough to be well-modeled by a special distribution. Semi-parametric models (such as mixtures of simpler distributions) are more flexible and more forgiving of the user's lack of the true model, but usually require significant computation in order to fit the resulting nonlinear models. Nanparametric methods like the method in this Chapter assume the least structure of the three, and take the strongest stance of letting the data speak for themselves (Silverman,1986). They are useful in the setting of arbitrary-shape distributions coming from complex real-world data sources. They are generally the method of choice in exploratory data analysis for this reason, and can be used, as the other types of models, for the entire range of statistical settings, from machine learning to pattern recognition to stock market prediction (Gray & Moore, 2003). Nonparamefric methods make minimal or no distribution assumptions and can be shown to achieve asymptotic estimation optimality for ANY input distribution under them. For example using the methods in this chapter, with no assumptions at all on the true underlying distribution, As more data are observed, the estimate converges to the true density (Devroye & Gyorfi, 1985). This is clearly a property that no particular parameterization can achieve. For this reason nonparametric estimators are the focus of a considerable body of advanced

6.6 Concluding Remarks

157

statistical theory (Rao, 1983; Devroye & Lugosi, 2001). We will not spend more space on nonparametric estimation here. The interested reader is referred to these books for more exposure to nonparametric estimation. Nonparametric estimations apparently often come at the heaviest computational cost of the three types of models. This has, to date, been the fundamental limitation of nonparametric methods for density estimation. It prevents practitioners from applying them to the increasingly large datasets that appear in modem real-world problems, and even for small problems, their use as a repeatedly-called basic subroutine is limited. This restriction has been removed by the method presented in this chapter. As a nonparametric estimation method, the advantages of the proposed method are summarized as follows; (1) In the proposed methods the pdf is estimated only from the given sample. No prior information is necessary of the distribution form; (2) The pdf can be estimated by a simple iterative formula, which makes the methods computationally effectively. The methods are computeroriented; (3) The methods provided in this chapter are capable of approximating probability distributions with satisfactory accuracy; (4) As sample size increases, the estimated pdf gets closer to the true distribution; (5) ME and AIC analysis are able to pick up the most suitable one from a group of candidate models. ME is more delicate than AIC, but in most cases they yield same estimates. But note that ME or AIC are not valid if the number of free parameters «/ is too large, otherwise the central limit theorem fails. The cases that the number of free parameters is larger than the sample size are treated in Chapters 8 and 9. Without giving more examples, it should be pointed out that the fourth order B-splines yield same satisfactory results as the third order B-splines. The reader may verify this using the codes given in Chapter 12. The method presented in this chapter, a nonparametric one, also exhibits a shift of estimation strategy from traditional statistics. The typical procedures to determining a probability distribution as shown in Figure 6.9 are (1) SPECIFICATION: Assume a distribution t y p e / ( x | S ) , where 0 is an unknown parameter; from the family <& of distributions; (2) ESTIMATION: Take a sample from the population. Use statistical methods (moment method, maximum likelihood method, etc) to estimate the unknown parameter #as a function of sample X; (3) TESTING: Test if the estimated distribution matches the data well. If not, return to step (1), specify another model and repeat the same procedures until a good statistical model is found. This is the strategy of Fisher's statistical inference. We are in fact taught in

158

6. I'D estimation based on large samples

this way in universities (Akaike, 1980), The methods presented in this chapter, however, determine a distribution through the following procedures; (1) SPECIFICATION: Assume a distribution f(x\0) , from the parameterized family <&; (2) ESTIMATION; Take a sample from the population. Use statistical methods (mainly maximum likelihood method) to estimate the unknown parameter Q as a function of sample X; (3) SELECTION: Find the most appropriate model based on some criterion. These procedures are slightly different from Fisher's strategy in form, but they bring some different methods into the procedures. At the stage of SPECIFICATION, <& is an enhanced family of parameterized functions. TESTING in the Fisher's strategy is replaced by SLECTION, SELECTION is in fact an optimization process searching for the most appropriate model. The importance of model selection is exemplified in this Chapter. In most modern methods for nonparametrie estimations, however, model selection is ignored. Therefore, the tradeoff between model uncertainty and statistical fluctuations are not considered. Only through the introduction of information theory can we properly handle the problem of model selection.

DATAJT

I *

SPECIFICATION

\

f(ff)

ESTIMATION

Iw No

TESTING

T

Yes

OUTPUT Figure 6.9 Strategy of Fisher's statistical inference

Appendix

159

Appendix: Non-linear programming problem and the uniqueness of the solution Before we proceed to find the iterative formula to estimate the combination coefficients, we need to introduce some results in NLP theory. These results can be found on monograph on NLP, say Luenberger (1984). Consider the following nonlinear programming problem: jL{a,,a2,...,aAr)-»max

(6.A.1)

gj(a},a2,--;aJ>Q, j<=l,2,-,m

(6.A.2)

«^Q, i = l,2,~;N.

(6A.3)

Equation (6.A.2) defines m constraints. Writing the Lagrangian in the following form: (6.A.4) we introduce the following Lemma Lemma A.1: If a" =(a',al,---,a°N) is an optimum solution to the above problem, then there must exist non-negative multipliers «* (j=l,2,,,,,m), which satisfies the Kuhn-Tucker conditions together with a0 in the following equations: — iOforall at &0; af — 1 = 0, i = l,--,N da, Ida,) yOfoll^Q;

wJ— =0, J-l,—,m

(6.A.5)

(6.A.6)

Lemma A.2: Let a0 satisfy the above necessary conditions. Then the sufficient condition for a0 to be the global maximum is

A(a,u") ^ A(a°,uV f

8A(

f'"^ (a, -a?)

da satisfied for all a £ 0 .

(6.A.7)

160

6. 1-D estimation based on large samples

We now turn to our problem defined by equation (6.12). In our problem, there is no constate like equation(6.A.2), but we have an equality constraint. Modifying the Lagrangian slightly by introducing a Lagrange multiplier A, we have

Jjtfi -l) J

(6.A.8)

The derivatives of A with respect to a are

^

jMW

/ = i,2,...,i¥ .

(6.A.9)

Let a0 be an optimum. Then it must satisfy the second equality in equation (6.A.5), that is,

JdA JdA\

^Aflfofr)

.

i = l,2,-,N.

(6.A.10)

Summing up ihe above equations over i we obtain 0

(6 AJ1)

-

Exchanging the order of summation signs leads to iVci)

0

y

0 =

i = 1 2 ,...,jV.

(6.A.12)

|a ) tT The denominator of the first term is independent on fee index i. Thus the summation is performed only on the numerator, which turns out to befx(xt | a 0 ). The sum in the second term is one if equation (6.10) is used. Then we have ft

' ' '

'ffiX^a}

(6.A.13)

Appendix

161

From which we are led to the conclusion A = -«, . Substituting this into equation (6,A. 10), we obtain --n,c,a?=O, i = l,2,—,JV.

(6.A.14)

Hence, we arrive at the following iteration formula Theorem A.1 if a0 is a solution to NLP defined by equation (6.12) then it must satisfy

This is equation (6.13). Next we will prove that if a0 is an optimum solution to fee problem, it must be the global optimum. According to Taylor expansion of multi-variate function we have:

aa,dak

= A(a )+ 7.—-—&at +~\,Z*. <»i

So,.

2

l=i

j=i

—i-J-J-&a,Aak

(6.A.16)

dOfdOf.

where 0 < ^ l . For brevity, we wrote A(a,n°) = A(a) in the above equation. From Equation (6.A.S), we have

The third term on the right hand side of Equation (6.A.16) is less than zero, as shown below,

162

6. 1-D estimation based on large samples

f *=l

1 *w/

1 f \i=i

Y J

Based on equation (6.A. 16), it is concluded that

A(a) < A(a°) + ± ^ 1 Aa,

(6.A.19)

da Based on Lemma A.2, we have Theorem A.2: If a0 is a solution to the problem defined by equation (6.12), then it must be the global optimum. Meanwhile, if there are two solutions to the problem, they must be identical, because they are all global optimum. In other words, the solution is unique. So far, the theorems proven in this appendix reveal the attractive properties of the model established in this chapter. The solution is unique, and its value is given by a simple iteration

Chapter 7

Estimation of 2-D complicated distributions based on large samples

Estimation of 2-D distributions involves the same basic procedures as those described for estimation of 1-D distributions in Chapter 6. It is thus straightforward to extend the methods for 1-D estimation to 2-D estimation. Many equations are similar in character to those for 1-D cases, but there are some large and small differences that must be considered. First of all, there are, more options for approximating a bivariate distribution than for a univariate distribution. Geometrically, a line in 1-D space corresponds to either a rectangle or a circle in 2-D space. A 2-D B-spline can be constructed either using the product of two 1-D B-splines (forming a rectangle) or using a 1D B-spline which argument is radial distance {forming a circle). A bivariate distribution is then approximated using either of the two types of B-splines. Both approximation methods will be described in this Chapter. Computational efficiency is not an important factor to be considered for 1-D estimation, but it is a very important factor for 2-D estimation due to the sharp rise in numbers of unknown parameters and sample size. It is beneficial for any attempts to reduce computational time for estimating a bivariate distribution. 2-D distributions are more fascinating than 1-D distributions. 1-D distributions can be assessed through visual check while 2-D distributions are more difficult to be assessed by visual check. Information criteria like ME are necessary. Most studies over the years in multivariate statistics have been focused on the multi-variate normal distribution (Anderson, 1958; Sakamoto, 1993). In the realworld applications, 2-D distributions are no less encountered than 1-D distributions. Cta the other hand, there remains a lack of general methods for estimating non normal distributions for a multivariate. Thus methods for estimating 2-D distributions are given in this Chapter.

163

164

7. 2-D estimation based on large samples

7,1 B-Spline Approximation of a 2-D pdf As same as in Chapter 6, only continuous pdfs are discussed here. Consider a 2-D rectangular domain, in which is defined a bivariate continuous random vector (X, Y). An orthogonal coordinate OXY is defined in this domain. Along the two coordinates, M and N B-splines are respectively used to approximate the joint pdf f(x, y) of (X, Y), as shown in Figure 7.1, in the form of

f(x,y | a) =

(7-1) 1=1 ,/=l

where av (i-l,2,..,,M, j-I,2,..-,N)

are the linear combination coefficients. Re-

arranging the coefficients so as to form a vector, we obtain

Using the normalizing condition that a pdf f(x,y)

o oo o ° e 080 00

c^ocr

0

o

o

o

integrates to one, we have

° o °

a» o

°8 ° o o °L

1 0.5 0 X0

X,

Xj

X M -2

Figure 7.1 B-spIine approximation of 2-D pdf

7,1, B-spline approximation of a 2-D pdf

165

where c$ denotes the integral of the ij-Hi B-spline, and takes values as follows: 11

rj

x —x y ~y ci} = f BXx)dx f B(y)dy = ' M x J J~3

(7,4a)

for order 3 B-splines.

(7.4b) •»-*

for order 4 B-splines. To ensure f(x, y) 2:0, we simply set as before atJ > 0, (F=1J, ...,Af,j=l,2, ...,N)

(7.4c)

Another way to approximate a 2-D pdf is to use the so-called Radial B-spline Functions (RBF), which are defined in the radial symmetric form, (see Figure 7.2)

-(2-Sf, 1<SS 6 o, s;>2

where r = ^j(x-xi)3+(y-yif

, S = rjhf , a^isflah1

(7.5)

, and (xhy$ is the

centre of i-th RBF. B,{r,ht) integrates to one and ht denotes the size (radius) of the domain in which Bjfr.hj) does not vanish. A,has significant influences on the shapes of the radial B-splines. In Figure 7.2 are shown three radial B-splines for different A,. In the figure, the radial B-splines are centered at (3,5) and (7.5,10), respectively. In Figure 7.2(a), ^=2, describing "over fat" B-splines. In Figure 7.2(c), hr0.5, defining "over thin" B-splines. Figure 7.2(b) represents proper B-splines, in which B-splines are defined in such a way that the distance of two B-splines is equal to hi. hj may be different from one RBF to another. It may also be same for all RBFs.

7. 2-D estimation based on large samples

166

(a) h=2 0.1

r

m

10 Q

y

10 0 (e) h=0.5

0 0

10

Figure 7,2 Influence of parameter ht on the form of radial B-splines

7.2. Estimation

167

Using RBF, a bivariate pdf is approximated in the form of M.

,

: ytf

(7-6)

where M is the total number of RBFs. Because a RBF integrates to one, we have it

u

J ] atct = ]T a, = 1, if c( = 1 is defined.

(7,7)

As same as before, it is required that the coefficients must be greater than or equal to zero, that is, af>Q,

i = l,...,M.

(7.8)

If RBF is replaced by another function which is of similar characters of RBF, mat is, RBS does not vanish only within a small region, the approximation formulated above remains valid. In general, a bivariate distribution can be approximated by a linear combination of the following form M

f{x,y\m) = 2>d(/\*>, r = ^(x-x,f + {y-y,f

(7.9)

where function $(r,h,), often called basis function, takes non-zero values within a small circle, and vanishes outside the circle. In other words, it is a locally compact supported function. Kernel density estimation developed by Silverman (1986) is such an example. The list of the particular forms of basis function $fa",ht) is long, such as Gaussian function, multi-quadries. Whatever the forms are, the estimation procedures are fundamentally same. We remain to focus on the two types of 2-D B-splines as described above, keeping in mind that the basic procedures apply to other basis functions. 7.2 Estimation 7.2.1 Estimation from sample data Using the maximum likelihood method to estimate the unknown parameters, we first of all take a random sample of size «,s from the population. Let the sample points be {xt,y() (t = 1,2,•••,#,). If MxN is given in equation (7.2), estimating a is equivalent to the following nonlinear programming problem;

168

7. 2-D estimation based on large samples

For given M and N, find vector a such that it satisfies (7-10) subject to the constraints:

at*Q.{i = l,2,-,M,j>*l,-.N)

(7.12)

The nonlinear programming problem above also applies to RBFs if equation (7.11) and (7.12) are replaced by equations (7.7) and (7.8), respectively. Using vector and matrix notations, we may easily conclude that all the conclusions valid for 1-D problem remains valid for the 2-D problem defined above. In summary we have Theorem 7.1 (1) The solution to the problem defined by (7.10)~(7,12) exists; (2) There exists unique extreme point, which is global maximum; and (3) The solution can be found through the following iterative formula

startingfrom the initial values

If RBFs are used, the corresponding iterative formula becomes

starting from the initial value

7,2. Estimation «,=— • 1 M

169 (7-16)

1.2.2 Computation acceleration If the iterative formulas (7.13) or (7,15) were not used, it would be very timeeonsuming to perform the 2-D estimation because of the huge numbers of operations involved, or it would require computational time that is too long to be practically meaningful. Even so, there remains room for further acceleration of computational speed. A quick estimate of operation numbers per iteration demanded by equation (7.13) can be easily made to be around MxNxnt, Suppose 10x10 = 100 Bsplines are used, and the sample size is 400. Then the number of operations per iteration is about 40,000, which is a large number. We note from equation (7.13) that for most observation points, B){x)Bj{y)is zero. It does not vanish only for a small number of points which fall in its nonzero region. To make use of this fact (we know that this is the local support property of B-splines), the iteration is not done over all observation points, but only over those on which Bi{^Bi{y)i& not zero. DtJ is used to denote the index set of those observations that contribute to Bt (x)Bj (y), that is, Bl(Xl)Bl(yt)*Q

if

teD,

• (

3(*,)fl,(y,) = 0 if teDt '

'

}

Before iteration begins, make computations to find sets DtJ based on equation (7.17). Introduction of the index set Dy leads to a new iterative formula ±

fl^tW|

{71g)

nsctJ Sot / W (*(.J' ( I«) Because the rest terms are zero, a^ forms a matrix, whose non-zero elements given by equation (7.18) is banded. If a uniform bivariate distribution is considered, the mamematical expectation of the observation number in Dy is E[DH ] = ^ . * (Ml)(iVl)

(7.19)

170

7. 2-D estimation based on large samples

The number of operations per iteration is then estimated at MxNx E[DtJ]« ns. Consider previous example again. Using formula equation (7.18), the operation number is around 400 per iteration, a remarkable reduction! If the bivariate distribution is not uniform, equation (7.19) seems no longer valid. The n*ue case would be: some index sets Dy contain more elements than the rest, and these sets take more operations than the others. One extreme case is that all sample points fall in one index set and the rest do not have elements. The operation number in that index set is around «, and the operation number in the rest is zero for there is no element in them. The total operation number per iteration remains around «,, if equation (7.18) is used. In general, if some index sets contain more elements and more they take more operations, then some other sets contain less elements and they take less operations. On average, the estimate given by equation (7.19) remains valid with the understanding that it is averaged over all index sets. This is a crude estimate, but it does help us to establish the idea that equation (7.18) saves up to M x N fold computer time vs equation (7.13). If M and N are 10 to say, then the computational speed is about 100 fold increased. The arguments above also apply to equation (7. 15) for RBFs. The sum over all sample points in the equation should be replaced by the sum over the index set Dh any element of which makes ^ ( r , , ^ ) nonzero. This technique is employed in other fields such as molecular dynamics and computational mechanics to accelerating numerical simulation speeds. 1-D and 2-D computations are significantly different resulting from the dramatic difference in numbers of unknown parameters. Any computing, if performed over all points in a 2-D space, would not be practical to be computationally effective. 7,2.3 Estimation from a histogram Suppose a 2-D histogram is composed of MH x NH cells as shown in Figure 7.3. The histogram is formed from «, sample points, and there are «,,, sample points in sf-th cell (s=!,2,..,,Mm t=l,2,...,N/f), respectively. Each cell is defined by the interval K . # t + i ] x f e ' ^ + i ] - Here Greek letters are used to differentiate from knots of B-splines. If the sample obeys the distribution defined by f(x,y), the probability for the event that ns, points fall in st-th cell is given by the following multinomial distribution

7,2. Estimation

171

where g» is the partial probability of f(x, y), and it relates to the combination coefficients a through

Figure 7.3 2-D histogram composed of MH x NH cells

4/+1 %*{

IJ

=J

l "fc+1 M

N

(7.21)

= £ £ % J jB,(x)Bj(y)tfydx Use <4«i to denote the integral in the last term, that is, (7.22)

and equation (7.21) would be M

N

(7.23)

The log-likelihood function is obtained by taking logarithms on both sides of equation (7.20)

172

7, 2-D estimation based on large samples

L = log P = 2 ] ]T p s( log 9W + constant, />„ = nm I ns,

(7.24)

»=i i=\

Similar to equation (6.12), the best estimate of a pdf based on a bivariate histogram is formulated as follows For given M and N, find vector a so that it satisfies 1 =

Z E Pi l o 81i -*1m a x

C?-25a)

!=J 1=1

subject to /te constraints: (7.25b) # a0,

(i = l,2,-,M;j = l,2,-;N),

(7.25c)

The solution to the above problem must satisfy

^ f i ^

1

1 2 Mj l

2

N

(7.25d)

Again we may use this equation as an iterative formula to find the coefficients. If RBFs are used, we have the following formulation For given M and N, find vector a so that it satisfies (7.26a) subject to the constraints: |>,C(=1

(7.26b)

q £ 0 , (i = l,2,---,M)

(7.26c)

where

7.3, Model selection

*" *

173

(7.27a)

«= J )*,(*",*>$«&

(7.27b)

The solution is (7.28)

7.3 Model selection Based on ME analysis, the number of free parameters in a model must be defined. For bivariate distributions, we have two definitions for the number of free parameters as follows MxN-l M-l

if equation (7.1) is used ifequation(7.6)(RBFs)isused

For bivariate distributions, the best estimate of pdf can be found through the following procedures: Suppose a is the maximum likelihood estimate of afar given number of unknown parameters. The best model satisfies ME = -jjf(x,y\

&)lagf(x,y

| n)dxdy+^i-

-> min

In the case that AIC is used, Suppose a is the maximum likelihood estimate of a for given number of parameters. The best model satisfies

(7.30a)

174

7. 2-D estimation based on large samples = -—Y

log f(Xt,yf\i)+?*—*

min

(7.30b)

If a is estimated from a histogram, the above formulations are changed slightly as follows: Find M and N so that ME(M, N) ~ - 2 J 2 , Q*i (fi) l o §au C & )+—- -» m m

(7-31 a)

or Fiwrf M anrf JV" so that A1C(N) = - ] ^ f > w logfM(a) + ^ - • min. i-l

(=1

(7.31b)

«,

7.4 Numerical examples In examples 7.1 and 7.2, the methods presented in this Chapter are examined by using the following procedure; given a distribution -> from the distribution generate a random sample -> from the generated sample estimate the pdf -> compare the estimated and given pdfe to assess estimation accuracy. Example 7,1 Two dimensional normal pdf. This example has been considered in Chapter 3, where a method for generating random numbers (ARM) was discussed. Consider a bivariate normal distribution defined by (7.32a)

Here g(x,y) = 0.&xh(x,y 15.5,5.5,0.6,0.6,0.3) +02xh(x,y 17,7,0.4,0.4,0.3) which in turn is defined by

(7.32b)

7.4, Numerical examples

175

(7-33)

The shape of f(x,y)

is shown in Figure 7.4. The definition domain [0,9]x[Q,9]

Three random samples (xt,yf) { £ = 1,2, • • •, n,) with sample size n, =200,400 and 500 were respectively generated from the distribution above. Figure 7.5 shows the sample scatter in the plane.

Figure 7,4 The given distribution

Figure 7.5 The generated random sample (n,=500)

First of all, rectangular B-splines (product of two one-dimensional B-splines) were used. M and N are varied from 6 to 15. In total, there are 10 x 10 = 100 case were calculated. Some of results are given in Table 7.1. In the ease of ns~200, the ME-based optimum number is M=10 and iV=10, while AlC-based optimum number is M=8 and N-%, see Table 7.1 (a). In Figure 7.6 we plot two pdfe, one estimated based on M=N=10 B-splines, see Figure 7.6 (a). This is in fact the case for ME to be minimum. Figure 7.6 (b) shows the pdf estimated by using M=N=8 B-splines, mat is, best estimate based on minimum AIC. Comparing these two plots, we see that Figure 7.6 (a) is better

176

7, 2-D estimation based on large samples

than Figure 7.6 (b), showing that ME analysis is more suited to the model selection than AIC. The sample size is not large, and thus even the estimation based on minimum ME is not very close to the true distribution. Particularly, the second small peak in the region [7,8]x[7,8] is not well described. Table 7.1 ME and AIC values (a) ns = 200 M 8 9 10 10 11 12

N 8 9 10 12 10 12

ME 2.963 3.058 2.869 2.941 2.984 3.038

-L 431.1 426.1 398.1 389.9 397.1 378.6

AIC 2.470 2.530 2.485 2.544 2.531 2.608

,n, = 400 M 10 10 11 11 12 14

N 10 14 12 14 12 14

-L 833.5 818.9 812.3 809.1 804.3 799.0

ME 2.56i 2.643 2.459 2.652 2.581 2.779

AIC 2.331 2.395 2.358 2.405 2.368 2.485

ME 2.480 2.528 2.453 2.530 2.474 2.622

AIC 2.271 2.322 2.291 2.328 2.296 2.391

= 500 M 10 10 11 11 12 14

N 10 14 12 14 12 14

-L 1036 1022 1015 1011 1005 1001

7.4, Numerical examples

177

(a) ME-based

0.4

0.3 0.2 0.1 0

0.4

(b)AIC-based

0.3 0.2 0.1 0 3

Figure 7.6 ME-based and AlC-based best estimates for«,, =200

The optimal (M,N) for sample size ns = 400 are M=l 1 and N=12 based on minimum ME. If AIC is applied, the optimum number of B-splines are M=N=10. The estimated two dimensional pdfs are given in Figure 7.7 (a)-(b) for minimum ME and minimum AIC. For this case, same conclusion as above can be obtained. AIC underutilizes the number of B-splines, resulting in incapability to capture the small peak. ME-based optimum number of B-splines yields better results.

7. 2-D estimation based on large samples

178 0.4

(a)ME-bestM=ll,iV=12

0.3 0.2 0.1 0 3

(b)AIC-best M=1G,JV=1O

0.4 0.3 0.2 0.1 0

Figure 7,7 The ME-based best estimated pdfi for sample »j=400

The results for sample ns = 500 are given in Figure 7 J for ME-based and AlC-based optimum estimations. The optimum number of B-splines based on minimizing ME is M=ll and N=12 while AlC-based optimum number is M=N=10. If sample size is kept same, ME analysis yields better estimations. For all the samples discussed above, ME selects better models than AIC does. As discussed in Chapter 5, ME takes two uncertainties into consideration: the uncertainty associated with the random variable itself and flie uncertainty resulting from model approximation. AIC, on the other hand, is asymptotically unbiased estimate of likelihood function without considering the uncertainty

7.4. Numerical examples

179

associated with model approximation. In terms of this fact, we believe that ME analysis is more suited to distribution analysis.

(a) ME-best M=UfN=12

0,4

(b)AIC-best

0.3 0.2

0.1 0 3

Figure 7.8 The ME-based best estimated pdfe for sample n,=5O0 To investigate the influence of the number of B-splines on estimation accuracy, two cases are considered: M=N=% and M=N-20. The sample used for the estimation is 200. The estimated results for these two cases are plotted in Figure 7.9.

180

7. 2-D estimation based on large samples

Figure 7.9 Influence of number of B-splines Observations similar to 1-D case indicate that the number of B-splines has important influence on the estimation accuracy. If the number is underutilized, the global trend variation cannot be fully described; if the number is overshoot, the influences of local changes resulting from statistical fluctuations are magnified. In either way, estimation is not satisfactory. Therefore, ME analysis helps us to find the right number of B-splines so that both underutilization and overshooting are avoided.

7.4, Numerical examples

181

Hence, ME analysis, or information theory in the broader sense, is really interesting and marvelous. It is able to find therightnumber of B-splines from a lot of candidate models. Example 7.2 Revisit of Example 7,1 with RBFas the appraximant In example 7.1 products of two 1-D B-splines have been used as approximants for 2-D pdf. In this example, we use radial B-spline functions as approximants. The data used are as same as those given in Example 7.1. Draw a sample of size nt = 200 from the population. MODEL(iV) denotes the approximation using N radial B-spline functions, that is, MODEL(N) f(x,y\a) = y,a,Sl(r,hl),

r = J(x~xly+{y-yiy

(7.34)

If the sample is given, the coefficients are determined through the iterative formula

e^—xtff1^'

'•t=^t-xif+iyl-ylf

(7.35)

Suppose M and N B-splines are used in the X- and Y-directions, respectively. If these B-splines are uniformly distributed along the two directions, hi is determined by , (b-a c-d") h =max ,——

,_,,. (7.36)

where [a,b] denotes the interval X lies in and [c,d\ denotes the interval Y lies in, ME and AIC are then calculated from

ME = - \\f(x,y |fi)tog/<*,y\ i)dufy+-£-

(7.37)

AIC = -—£\agf{xf,yt

(7.38)

|a)+^

For example, for MODEL(36), we have

182

7. 2-D estimation based on large samples

h, = max] — , — 1 = 1, L=447.69, ME=2.726 & AIC=2.413 V 6 6

(7.39)

Continuing the procedures, we obtain results for differing models with some of the results shown in Table 7.2 (a). The ME-based best estimate is given by MODEL(64), that is, N=M. The best estimate by minimizing AIC is MODEL(36). Both cases are plotted in figure 7.10 (a) and (b). As same as 1-D case, ME-based best estimate is better than AlC-based best estimate because the latter underutilizes the number of B-splines.

Table 7.2 Part of calculated results in the case of RBF used

(a) n. = 200 N 36 48 56 64 72 80 88 96 104 112 120

L 447.69 454.79 444.01 422.62 428.20 421.63 422.04 423.74 420.75 421.22 421.79

ME 2.726 2.862 2.917 2.707 2.846 2.834 2.904 2.975 2.986 3.078 3.118

= 800 AIC 2.413 2,508 2.495 2.428 2.496 2.503 2.545 2.593 2.618 2.661 2.703

N 70 80 90 100 120 140 160 170 180 190 200

L 1721.35 1673.05 1652.99 1630.63 1636.85 1632.24 1628.23 1630.66 1630.49 1628.14 1629.33

ME 2.523 2.360 2.289 2.247 2.305 2.339 2.360 2.392 2.405 2.417 2.443

AIC 2.237 2.190 2.177 2.162 2.194 2.214 2.234 2.249 2.261 2.271 2.285

If we try a larger sample of size ns — 800, estimation accuracy will be improved. Without giving the details, some of results are listed in Table 7.2 (b). In this case, both ME and AIC are minimized by MODEL(IOO) and the corresponding pdf is plotted in Figure 7,12(e). In summary, RBF is applicable to distribution estimation, too. As sample size is small, AIC is prone to underutilize B-splines. As sample size is large, ME- and AlC-based best estimations converge the true distribution.

7.4. Numerical examples

(a) nx =200, ME best

0.4 0.3

(c) ^ = 8 0 0 , ME & AIC best

0.2 0.1 0 3

9

3

Figure 7.10 Estimation using RBF for two samples

183

7, 2-D estimation based on large samples

184

Example 7.3 Joint distribution of wave height and wave period in the Pacific (Histogram) The joint distribution of wave height and wave period shows high irregularity worldwide (Watanabe, 1993), Here, the distribution of wave height and wave period is approximated by the two dimensional B-spIine functions. The data are taken from the records measured aboard ships for winters in the Pacific (Watanabe, 1993). The wave periods range from 0 to 16 seconds and wave heights range from 0 to 16 meters. The histogram of the data is shown in Figure 7.5 and the observed date are given in Table 7.3. Order 4 B-spline functions are used in this example, only for the sake of comparison. The estimation is based on histogram using the method presented in section 7.2.5. The results of the analysis are partially given in Table 7.4. The best numbers of B-splines are M=17 for wave height and iV=26 for wave period. The estimated pdf is shown in Figure 7.11. In the figure, H represents wave height in meter and T represents period in second. The vertical axis is probability density. For this case, AIC and ME predict the same results.

Table 7.3 Wave height and wave period data in the Pacific (winter)

T\H 0.000.751.752.753.754.755.756.757.758.759.7510.7511.7512.7513.7514.75-

069756 204031 81364 15627 3463 1093 449 37 21 26 17 12

2 0 1 2

513988 135590 J28686 41338 19535 3421 1446 120 40 40 26 13 9 2 2 8

64312 77195 141318 72427 21112 5818 2621 4526 404 272 187 9 15 3 0 9

72217 43463 104275 86094 31431 8470 3468 2465 496 309 214 12 10 5 1 6

82072 33394 85139 95867 54283 17808 6835 3760 1216 627 485 68 90 24 12 8

7.4. Numerical examples

185

(continued)

T\H 0.000.751.752.753.754.755.756.757.758.759.7510.7511.7512.7513.7514.75-

9641 107.92 27368 36476 30493 12482 4880 2695 865 480 416 66

m 16 17 12

10844 9245 21975 28595 28275 16224 7595 3539 1407 682 691 97 402 34 20 121

11236 1918 4999 7067 6954 5000 2563 1270 538 304 234 55 61 17 13 32

1257 3159 6148 8068 8688 7042 4558 2494 1141 530 534 125 116 50 24 43

1322 1167 2192 2151 2103 1603 1190 698 341 196 197 35 46 17 10 17

1461 2688 5012 3949 3577 2898 2419 1712 953 619 493 163 145 39 21 78

Table 7.4 ME and AIC values for wave height-period joint distribution (winter, ns = 19839000)

M

N

-L

ME

AIC

15 16 17 17 18 18 19 20

15 16 26 29 18 27 19 17

3.601 3.600 3.589 3.588 3.592 3.589 3.590 3.587

3.846 3.839 3.817 3.819 3.837 3.819 3.824 3.843

3.601 3.599 3.589 3.590 3.592 3.589 3.590

3.587

The distribution form shown in Figure 7.11 is too complicated to be represented by using simple distribution forms. Without powerful B-splines and ME analysis, it would not be possible for us to find such complicated distributions in an easy and pleasant way.

186

7. 2-D estimation based on large samples

Figure 7.11 Estimated joint distribution of wave-height and waveperiod

7.S Concluding remarks Bivariate distributions are frequently encountered in applications. And general methods for estimating underlined joint distribution remain lacking. For a given bivariate sample or histogram, it is not easy to work out the distributions. The method presented in this Chapter resembles in character that presented in Chapter 6, but it is more useful. For in 1-D cases, we have alternative methods to evaluate the underlined distributions, but in 2-D cases, we hardly have an alternative method directly estimating distribution from a sample or histogram. In this sense, the method developed in this Chapter is both necessary and meaningful. Several factors have significant influence on estimation accuracies. They are sample size, the number of B-splines and criteria for model selection. Numerical examples show that as sample size get larger and larger, the method presented in this Chapter can give estimates converging to the true distribution, Underutilization of B-splines is unable to describe important characters in the pdf form, while overshooting of B-splines is sensitive to local changes. ME and AIC analysis finds the right number of B-splines to keep the balance between global character in the disfribution form and local changes. ME predicts better results than AIC, validated both numerically and theoretically. It is recommended to use ME analysis

7.5. Concluding remarks

187

The estimations are not sensitive to the order of B-splines, if only order 3 and order 4 B-splines are considered. Therefore, both order 3 and order B-splines are suitable choice in real-world applications. Again the functions of the method are summarized as follows If a random sample from the population is drawn, the method can directly estimate the distribution from the data. The estimation procedure is composed of three steps (1) Given M and N, find the M-L estimates of the linear combination coefficients, (2) For different M and N, compare ME values and find the M and N minimizing ME, (3) The model with M and N minimizing ME is the pdf we are after.

This page intentionally left blank

Chapter 8

Estimation of 1-D complicated distribution based on small samples

In Chapters 6 and 7, a systematic method has been introduced to estimate the density function of a random variable through given samples. In the method, a pdf is approximated by a linear combination of B-spline functions, and the best number of B-splines (best model) necessary for the approximation is determined by minimizing Measured Entropy (ME) or AIC, As pointed out in the closing of Chapter 6, the method works well for large samples. When the sample size is small, however, the method presented in the previous chapters cannot yield satisfactory results. The estimated combination coefficients show strong irregularities as we will see later in this chapter. In passing, it is pointed out that it is always hard to answer the question: how large a large sample should be and how small a small sample should be. So we need to clarify it before we proceed further. Here by large sample we mean that the sample size is much greater than the number of unknown parameters to be determined from the sample. The rest cases are defined as small samples. For example, if 50 B-splines are used (50 unknown parameters) and we have only 40 sample observations, then this is a small sample problem. On the other hand, if 50 B-splines are used, but we have 200 sample observations, then it is a large sample problem. "Large" and "small" are used in the relative sense throughout the book. To overcome the shortcoming of the method previously presented, Bayesian approach, in which all parameters are treated as random variables, is employed to improve the estimation accuracy. So the new method to be introduced in this chapter is characterized by a preliminary prediction-correction process. At the preliminary prediction stage, we still use a linear combination of B-spline functions to approximate a pdf as we did before, but the number of B-spline functions is prefixed and may be much greater than sample size. Strongly influenced by statistical fluctuations, the combination coefficients are highly

189

190

8, 1-D estimation based on small samples

irregular. At the correction stage, the smoothness restriction on the combination coefficients is introduced, based on which the so-called smooth prior distribution is constructed. By combining the information obtained from the preliminary prediction and from smooth prior distribution in the Bayes' rule, the influence of statistical fluctuations is effectively removed, and greatly improved estimate, which is close to the true distribution, can be obtained. So in the method to be presented, a new strategy is applied. It is based on the belief that a number of B-splines (say 50 or 100) large enough have sufficient flexibility to represent most distribution functions we are interested in. The coefficients estimated based on a small sample would produce poor estimates. The reason is that the amount of information provided in sample data is less than that neeessary for determining the unknown parameters. Prior distribution provides an alternative to pool information from sources other than sample data. The total amount of information provided in sample data and by prior distribution is enough for determining all unknown parameters. Bayesian methods have found a variety of applications in the fields of science and engineering. Particularly mentioned is the field of structural reliability, a field to study if a structure (ship, building, dam, aircraft, etc) Mis within design life cycle. Because these engineering structures are so durable that their failure probabilities are very small. Thus the available feilure data are scarce, any inference based on which is dubious. Bayesian methods are thus specially preferred. (Ditlevsen, 1994; Sander & Badoux, 1991; Manners, 1994; Zong & Lam, 2002). 8.1 Statistical influence of small sample on estimation In Chapter 6, a linear combination of B-spline functions was used to approximate the pdf of a continuous random variable X as

(8.1)

To determine the coefficient vector a, a random sample of size ns is taken from the population. Let the sample observation point be x f {l=l,2,...,n s ). Then the maximum likelihood estimate of a is given by the following simple iterative scheme, see equation (6.13),

(8.2)

8.1. Statistical influenece of small sample on estimation

191

The model above works well for cases where N is greatly smaller than the sample size ns. In this chapter, we will extend the above method to the cases where the relationship above is violated. Such extension is needed when the distribution under consideration is complicated and the accessible data are limited. Consider the following example to see the influence of statistical fluctuations on estimation. Example 8.1 Statistical influence of small sample on estimation Suppose the true distribution is given by

(8.3B)

where g(x) is a function defined on [0,10]

(8.3b)

By use of the method for generating random numbers, 40 random numbers were generated from f(x) as a given sample. The following model, in which 50 B-splines are used, is employed to approximate f(x)

/(x| «) = £<**,(*)

(8.4)

Note that the number of B-splines (N = 50) is greater than the sample point number («,=40). Using the generated sample, the unknown parameters in fee above model are estimated from equation (8.2) and the results are shown in Figure 8.1. The predicted pdf exhibits noise-like irregularities, unable to describe the distribution under consideration at all. The high irregularities are attributed to statistical fluctuations. Recall 1-D examples in Chapter 6, where the number of B-splines was around 10, and the number of sample points was always above 30. It is thus not surprising that good results were obtained in Chapter 6.

192

8. 1~D estimation based on small samples

Preliminary prediction

4

6

N = 50

8

10

Figure 8.1 Influence of statistical fluctuations on the estimation If a is estimated from histogram data, the same irregularities are observable, too. The irregularities remind us of treating the coefficients as random variables. In feet, even in the case of large sample, they are tteated as random variables, too. The sample size is always finite, and the coefficients always deviate from the true values to somewhat extent. The difference is that in the case of large samples, deviations of the estimated values are small, asymptotically normally distributed. In the case of small samples, however, statistical fluctuations are not small and the distribution is not close to the normal distribution. Taking the parameter as random variables represents a radical shift of statistical methodologies from traditional Fisher's statistics to Bayesian statistics. Bayesian method will thus be employed in the following. 8.2 Construction of smooth Bayesian priors 8.2,1 Analysis of statistical fluctuations Due to statistical fluctuations, the predicted parameter a is generally not coincident with its true value b, and there exists a deviation w,, as schematically shown in Figure S.2, a, =

(8.5)

where a is either estimated directly from sample data or from histogram data. Define two vectors

S.2. Construction of smooth Bayesian priors

= (bl,bi,~;btlf,

193 (8.6)

where the superscript "7™ denotes matrix transpose. From the Central Limit Theorem, a, are asymptotically normally distributed with mean bt as the sample size becomes large. It is therefore reasonable to assume that wt is a normal random variable with zero mean and common variance er1.

.-••••

Figure 8.2 Deviations between true and predicted coefficients

Generally, the smaller the sample size is, the larger a2 is. Hence we have

(8.7)

Once b is given, me likelihood fimction for a is then given by

(8Ja)

which is obtained from the independence assumption. Taking the common factor out of the product symbol, we obtain

194

8. 1-D estimation based on small samples

( b>

°

Using HI to denotes the distance in Euclidian space of the form

Equation (8.8b) can be rewritten in more compact form

] e x pi-^-4} LJ_|| a _bfl

(8.8d)

Because a is predicted directly from the sample, P{& | b) contains all the necessary information in sample data. This is the preliminary prediction as we mentioned at the beginning of this chapter. 8.2.2 Smooth prior distribution of combination coefficients The reason for us not to be satisfied with the preliminary prediction in Example 8.1, or the reason for us to say that the preliminary prediction is not a good estimate, is that the predicted pdf is quite irregular and we cannot see the global variation trend of the random variable under consideration. In other words, a good estimate should be smooth enough and exhibit the global variation trend of the random variable under consideration. We would say that smoothness is the prior information on b. Smoothness is just an abstract expression, but we may quantify it using mathematical terms. Because smoothness is mathematically defined by equating the right and left derivatives at one point, we have

f

-=H

(8-9)

where the subscript + and - denote the left and right derivatives, respectively. Approximating the equation above by use of central difference, we obtain

8.2. Construction of smooth Bayesian priors

195 (8.10)

Or, we may write them in the following form, (8.11) where e, denotes the approximation errors resulting from discretization, or neglect of higher order terms in the derivatives. Figure 8.3 depicts how the errors result from approximation. Whenever straight lines are used to approximate the slopes of a curve, finite errors are observed. When ef is zero, ^ is simply the mean of bM and d,_,, and thus b is in feet a straight line, which is of course smooth. If e, is very large, the right and left derivatives are much different, and b is not smooth. Therefore, e = (eJ,ei,--,eK_])basically defines the smoothness of b. Because we require that b be smooth, the mathematical expectation of e, is surely zero. We fiirther assume that e^s are mutually independent and identical normal random variables with zero mean and same variance T2. With these assumptions, we have

(8.12a)

Figure 8.3 Errors resulting from derivative

196

& 1-D estimation based on small samples

Noting that e r e = e£ + e2 + > • •+ej_,, we rewrite the equation as

F(e) =

1

(8,12b)

Introducing the matrix 1 -2 1

1 -2 1

1 -2

1

(8.13)

1 -2

1

we may rewrite equation (8.11) in the matrix form = Db.

(8.14)

Substituting the equation above into equation (8,12b), we obtain the prior distribution for the true combination coefficient b

(8.15) Introducing another new parameter a?2 =a2 lt%, we obtain fi-Z

exPi-|L-||Dbf

(8.16)

The distribution above, obtained based on the smoothness assumption, is called smooth prior distribution. It is purely constructed from our requirement

8.2. Construction of smooth Bayesian priors

197

that b be smooth, and thus it contains the information not in sample data. On the other hand, F(a | b) solely contains the information in sample data. We thus have two sources of information: sample date and prior distribution. One basic idea behind Bayesian statistics is to make Ml use of these two sources of information. It is possible to qualitatively study the behavior of P(b), Variance o* is estimated from sample, and thus fixed with respect to P(b). The parameter that is changeable is to2. As t»2 is large, the exponential function in equation (8.16) vanishes in the most part of the distribution domain of e = D b , except near e = 0. So F(b) is characterized by sharp rise near the origin, but fast dying away from the origin. This is schematically shown in Figure 8.4 (a). The differences among elements of vector b must be small. So vector b represents a nearly straight line connecting bi and by, see Figure 8.4(b).

(a)

P(b)

Figure 8.4 Correlation of o 2 and the behavior of vector b

If to2 is small, the exponential function in equation (8.16) does not vanish in the most part of the distribution domain of e = Db, and P(b) is characterized by wide spread. The differences among the elements of vector b are likely large, and

198

8. 1-D estimation based on small samples

vector b might represent a curve of irregular shape as shown in Figure 8.4 (c). Therefore, i*(b) basically controls the shape of vector b. 8.3 Bayesian estimation of complicated pdf 8.3.1 Bayesian point estimate Two models have been established up to now. One is the likelihood function P(a | b ) , which pools the information in sample data, and the other the prior distribution P(b), which defines the smoothness of b. Combining them in the Bayes' rule introduced in chapter 2, we obtain the posterior distribution P(b | a) = C x P(a | b)P(b)

(8.17)

where C is the normalizing constant. Substituting equations (8Jd) and (8.16) into equation (8.17) yields

(8.18a)

Further simplification leads to

.

1

expi

V2s-cr}

r[|a-bf+

I 2er L"

ffi2l|Dbf"|i

(8.18b)

-lj

In Chapter 5, we have shown that the best estimate should maximize the poster distribution. This is called Bayesian point estimate. Equivalently, it can be obtained by minimizing £? 2 (b), Q2 (b) = ||a - b f + w1 jDbf ~» min Setting

(8.19)

8.3. Bqyesian estimation of complicated pdf

= 0, i =

l,2,-'-,j

199

(8-20)

we conclude that b must satisfy (8.21a) the solution to which is ~ l a, and a1 = i

(8.21b)

Here I is the unit matrix of the form

1=

(8.22)

Equation (8.21a) defines a system of linear equations, which can be solved by several methods. Gauss elimination is fee most frequently used one, and this issue will be detailed in section 8.3.3.

(8.23)

The pdf calculated from following equation, which is obtained from Equation (8.1) with a replaced by b is thus an estimate based on Bayesian method. Note the denominator in the equation above, which is included to make the pdf under consideration to satisfy the normalization condition, that is, a pdf should integrate to one.

200

8. 1-D estimation based on small samples

Two more parameters remain undetermined. They are eo2 and e 2 . Once they are given, we may obtain Bayesian point estimate b from equation (8.21) immediately. In the next section, an entropy analysis is used to find the most suitable co2. 8,3.2 Determination of parameter co2 Parameter c*2 confrols the scatter in a. LargeCT2represents large deviation of estimated parameter a from the true parameter b, and exhibits highly irregular traits in pdf, as shown in Figure 8.1. Small o 2 represents closeness of the estimated distribution to the true one. From equation (8.21), we note that b = a if ca2=0, and the Bayesian estimate degenerates to the preliminary prediction. From equation (8.19), we note that b is a straight line if a2 —> ao . Therefore, a2 and ra2 are very important in our analysis. In most analyses using Bayesian approach, these two parameters, or one of them is subjectively selected, subjecting Bayesian statistics to the criticism by objectivists. It is, therefore, demanded for any attempt to find or build an "objective" way to determine these parameters. Consider the marginal probability P(a | a2, a-1) = j>(a 1 b)P(b)db

(8.24)

which describes the averaged behavior of a, and gives the true distribution of the occurrence of event a. Because it is independent of parameter vector b, we can obtain the estimates of m2 and u2 by maximizing the marginal probability. Find a/ and a* such that P(z | to1, a2) = j>(a |b)P(b)cfl» -> max.

(8.25)

As mentioned in Chapter 5, this optimization underlines the request that the uncertainty associated with the unknown parameters is minimum. If b were not treated as a random variable and took a particular value, this principle would degenerate to the Maximum Likelihood Principle. Rewritten in the following logarithmic form, the marginal probability is

8.3. Bayesian estimation of complicated pdf

201

MEB(©2, er2) = - 2 log P(a | o 2 , v2) = - 2 log J>(a | b)P(b)
(8.26)

MEB, denoting the logarithm of the probability, is the abbreviation of Bayesian Measured Entropy. The coefficient 2 in the front of the integral symbol is included for historical reason. Introducing the following two matrices in equation (8.26)

Ho"

(O7)

HOD™ I

We have

Q1 = ||a - bf + m1 |Dbf = ||x - Fbf.

(8.28)

Furliiermore, using the following equality Q2 = ||x - F b f = ||x - Fb| 2 + ||F(b - b)||J = q2 + |F(b - b)| 2

(8.29)

where q3 is given in Equation (8.21), we may write MEBfa/, n1} in the form of MEB(©2 .tr 1 ) = -21og |P(a|b)P(b)db

r

(830)

-

The last integral in the equation above is in feet the multidimensional normal distribution with the following normalization factor

-=L-

|F r F|" 2 , | F r F = determinant of F r F .

W2*oJ '

'

'

(8.31)

202

8. 1-D estimation based on small samples

Thus we obtain MES(©2,cr2) = -(JV-2)log© 2 +(#-2)log<7 2 +-^-+log|F r F| +constant

(8.32)

The best model is obtained by minimizing MEB, equation (8.25). In other words, the derivatives of MEB with respect to cf2and
©

or do

(8.33a)

Bto

0

(8.33b)

From equation (8.33a) we obtain estimate of a 2

JV-2

(g.34)

Equation (8.33b) is so complicated that we cannot find an explicit solution to it. Instead of solving it analytically, we turn to numerically solve the following equivalent problem MEB(©2 ) = ~(N~ 2) log ®% + (N - 2) log q2 + log|F r F| -> min

(8.35)

which is obtained by substituting equation (8.33a) into equation (8.32). Note that we write MEB(ro2) in stead of MEBCm^o2) because o 2 is a known parameter in equation (8.35). This is in fact a nonlinear programming problem free of constraints. Its solution exists because MEB(t»2) is a continuous function of to2. HJ2 can take any value in the interval (0,oo) . However, in applications, m1 e [10"4,10s] is a suitable choice. It is difficult to mathematically prove that MEB has a unique minimum. But from numerical experiments performed up to now by various authors, it seems MEB has unique solution. The typical dependence of MEB on 2 changes

8.3. Bqyesian estimation of complicated pdf

203

from small to large. At certain value of co2, MEB changes from a decreasing function to a slowly increasing function. This is numerically explored in Example 8.2. Several methods may be used to solve the nonlinear programming problem defined by equation (8.35). In the following procedures, the simplest method, dividing the interval into equidistance subintervals, is employed. In summary, the Bayesian estimation is composed of following steps (1) Obtain preliminary prediction a from equation (8.2) as input; (2) Divide [lO^.lO 3 ] into JV& equidistance subintervals. Loop over (a) Take a tn2 from each subinterval; (b) Obtain smooth Bayesian point estimate b from equation (8.21) for the given (a2; (c) Estimate MEB(BJ 2 ) from Equation (8.35). (3) Choose the m2 which minimizes MEB. The value JVj is approximately around 20-100. If this division is not enough, make a finer division of a smaller subinterval and compute corresponding MEB. Note that it is not necessary to find the exact value of oo2 that minimizes MEB. If the reader is not satisfied with the optimization method above, he or she may try other optimiation methods, among which Genetic Algorithm (GA) has received extensive studies in recent years. Using GA, step 2 a) will be changed to random selection of a point in the interval [lO^JO 3 ]. For one-dimensional cases, however, these optimization methods do not show significant differences, and thus we do not explore further, here. In spite of seemingly complexity, the above analysis can be easily implemented and yields quite satisfactory and robust results. The relevant codes are provided in me floppy attached to this book, and some examples will be given in section 8.4. 8.3.3 Calculating b and determinant of |FTF| In terms of numerical procedures, three issues are of major concern. The first is to numerically find b defined in equation (8.21b) or equally speaking, numerically solve the linear equation system (8.21a). The second is to numerically find the determinant of matrix |F T F|, and the third is to numerically minimize MEB in equation (8.35). The last issue has been addressed in the section above, and we focus on the rest two numerical issues. To solve a linear equation system Cb=a like equation ( O l a ) where C is an NxNcoefficient matrix, Gauss Elimination is most frequently used. It consists of two parts: forward elimination and back substitution. In the forward

8. I-D estimation based on small samples

204

substitution, transform C into unit upper triangular through a series of elementary matrix operations. Gauss Elimination is also often used to evaluate matrix determinant. The Elimination comes into play due to the fact mat if the elements in lower triangle of the matrix are all zeroes, the determinant is simply the product of the diagonals. This also applies in the upper triangle zeroes. It means:

IfC =

0 0 0

0

cr c2. 0 0

,orC =

0 0

0 0

% o

(8.36)

then the determinant is (8.37)

So, the elimination works to shape the matrix into one of these forms. Because of its wide availability, the method is not detailed here. 8.4 Numerical examples In the first example, we assume a true pdf f(x), and generate nt random numbers from this distribution. Using these random data we estimate the coefficient vector b based on the analysis above, hi the next two examples, the present method is applied to two practical problems. Example 8.2 A Compound distribution Consider the following mixed distribution of a random variable X (8.38a)

/(*) = !

- ^

+0.2x

(8.38b)

8.4, Numerical examples

205

This is as same as that given in Example 8,1. And the preliminary prediction is plotted in both Figure 8.1 and Figure 8.5. First assume N = 50 and generate ns - 40 random numbers. Following the steps given in section 8.3.2, the optimum Bayesian estimate for the ease is found, a/ is varied from 0.01 to 1000. And the interval is divided into 100 equidistance subintervals. Some of the search results are shown in Table 8.1. And the minimum MEB is found around f^lOO. (The exact value should be ©2=98).

= 50 MEB=19J

Q

•a 1 a,

0.1 0

0

2

4

6

8

10

Figure 8.5 Bayesian estimation based on 40 sample points

200

400

600

800

m2 Figure 8.6 Relationship between MEB and

1000

206

8. 1-D estimation based on small samples Table 8.1 Dependence of MEB on to2 («, = 40)

MEB 0,01

30. 60. 90. 100.

78.75 21.57 20.10 19.82 19.81

MEB 110. 150. 200. 500. 1000.

19.83 20.01 20.33 21.81 22.81

Figure 8.5 shows the estimate based on «»=40. Compared with preliminary prediction, the Bayesian estimate is much improved and is close to the true distribution as shown in the figure. If we notice the noise-like irregularity of the preliminary prediction and the closeness of the Bayesian estimate to the true distribution, the usefulness of the analysis employed here is strongly supported. The search process for the optimum MEB for this case is shown in Figure 8.6. From the figure, we see that after the optimum point, MEB does not change much with ©2. The function relationship between to2 and MEB is quite simple. This is also true in all reported simulations, which are given or not given here. Thus, it is a numerical observation that there exists only one optimum solution for MEB and that MEB is a simple and quite smooth function of OJ2. The same relationship can be quantitatively observed from Table 8.1. A sharp drop in MEB values is observed for oa2 to be varied from 0.01 to 100. After that, MEB exhibits very slow increase for co2 to be varied from 100 to 1000, To investigate the sample influence, three samples of pseudo random numbers are generated from the given distribution. The sample size are 15,20 and 40, respectively. The optimum estimated pdfe for the three samples are plotted in Figure 8.7. As expected, as sample size increases, the estimate becomes better. What is impressive is that the estimates based on 15 and 20 sample points are also quite close to the given distribution. Figure 8.7 shows the estimates for three particular values of co2. The estimated pdfe for t»2 = 0.1, co2 = 98 and
207

8.4. Numerical examples 0.6

A

0.5

1 u

a

1

i

B s == 40

Givetf

0.4 0.3 0.2

N == 50

MEB=65.2

•K.

« 2 =2000 MEB=23.4

1 ' \

©2 = 98

MEB=19J

0.1 0.0 0

2

4

6

8

10

X Figure 8.7 Influences of co2 on the estimation Example 8.3 Distribution of ice loads on propeller blades Ship navigation in polar waters presents a formidable challenge to ships' propulsion systems as large ice pieces impinging on their propeller blades sometimes result in stresses exceeding the yield strength of the blade material. Damage to propellers is costly and can also spell disaster if a ship becomes disabled in a remote area. Ship operators, propulsion system designers, regulatory bodies and classification societies are all concerned with the safe passage of vessels in ice-infested waters. To better define the requirements for good design for ice navigation, extensive measurements of propeller ice loads at full scale were made by several organizations over the past two decades. Huge number of data was collected. Special distributions (Weibull, Normal et al) used to be employed to fit sample distributions. Zong and Lam (2000) applied the present method to estimate impact forces on propeller blades measured on a vessel, the MV Robert LeMeur (Laskow et al, 1986). The impact forces showed quite complicated distribution forms in all reported cases. One figure, which shows the angular location of blade impact force is reproduced in the form of histogram in Figure 8.9. In the figure, the abscissa denotes the start orientation of blade interaction in degree. Zong and Lam (2000) used 100 B-splines to approximate the distribution. The sample size is 181. From these data, the distribution is estimated and the results are plotted in Figure 8.8 with solid line. The optimum &2 and MEB are 24 and 681.7, respectively. The majority of the ice impacts occur at 40 degrees and 240 degrees (Figure 8.8).

S, 1-D estimation based on small samples

208

Because the method treats sample data automatically, it reduces the time for estimation greatly.

0,010

N=100 « s = 181 MEB=-681.7

Histogram

V

0.008

Estimated pdf 0.006

1\

0.004 0.002 0.000

0

40

80

120

160

200 240 280

320 360

Angular location (degree) Figure 8 J Blade impact force along angular location

Example 8.4 Distribution of maximum bow stress in sea state 7for LASH Italia Large impulse loads are experienced by a body during impact with water. This is often designated as slamming. Both fore and bottom parts of a ship are exposed to slamming, as well as the deck between the two hulls of a catamaran or a surface effect ship. Slamming loads can lead to structural damage as well as induce whipping. Current trend to produce innovative, lighter and faster ships, increases the probability of slamming and, in addition, lighter structures are more prone to slamming damage than conventional structures. Both aspects ask for a better understanding and treatment of slamming loads and, in general, slamming loads are random due to the fact that sea waves are random. It is highly desirable in the shipbuilding industry to obtain reliable information on the frequency and magnitude of slamming. Petrie et al (1986) reported the measured data on maximum bow stress resulting tram slamming in green seas based on 42 voyages. Because mere are a lot of data collected, a systematic and automatic method is needed to analyze the distributions of the slamming forces and bow stesses in order to save cost and manpower. The above model was

8.5, Application to discrete random distributions

209

applied to analyze the data. One example is presented here in Figure 8.9. The abscissa denotes the IS minute intervals of maximum bow stress in k psi in sea state 7 for LASH Italia.

0.6

a °-

Histogram

4

N=100

Estimated pdf

I 0.2 0.1 0 2 3 4 Maximum bow stress (kpsi) Figure 8.9 Estimated pdf and histogram of maximum bow stress As before, 100 B-splines were used to approximate the distribution. The sample size is 81. In this case, more B-splins than data are used. The optimum estimated pdf is shown in Figure 8.9 with solid line and the original data are given in histogram, The optimum a2 and MEB are 38 and 82.6, respectively. 8.5 Application to discrete random distributions The technique developed in this chapter is applicable to discrete disnibutions in the case of small samples. To see this, we consider the finite scheme as follows A=

A 4 -

4,

(8.39)

If sample size is large, the maximum likelihood estimate of pt is obtained by maximizing likelihood function (8.40)

210

8. 1-D estimation based on small samples

where q; = — is the frequency observed for event 4 • n, Solving the equation yields the maximum likelihood estimate of pt in the following P,=%-

(8.41)

If sample size is large, smooth estimates are expected to obtain. On the opposite, if sample size is small, large fluctuations are expected in the estimate p t . To remove the irregularities present in p t , we assume that A=*,+w(.

(8.42)

This is exactly equation (8.5). Therefore, all approaches developed thereafter are applicable to determining b,, smooth estimates of %. The details are neglected. 8.6 Concluding remark! 8.6.1 Characterization of the method In this chapter, a method that can directly identify an appropriate pdf for a continuous random variable based on a small sample is presented. Three models are established. One is the likelihood function that pools the information from sample date, one is the smooth prior distribution that defines the smoothness of the unknown parameters, and the last is Bayesian Measured Entropy that helps us to find the most suitable (a2 (and the prior distribution) in an "objective" way. The usefulness of the method is examined using numerical simulations. It has been found that the estimated pdfs under consideration based on the present analysis are stable and yield satisfactory results even for small samples. The method is characterized by assumption-free. We do no assume the specific form of the distribution. This is attractive in applications because 1) the possibility for subjective evaluation of a particular sample distribution is greatly reduced; 2) dependence on experts' opinions is minimized and a person with basic training can use the method without difficulty; and 3) the sample size may be either large or small. It is particularly suitable for the cases where many sets of data demand analysis. After the observed data are input into a computer, all the rest analysis can be automatically done using the current method. The CPU time, as mentioned before, is surprisingly short. In a typical problem, to find the optimum solution, it usually takes several tens of seconds on a Pentium 4 PC, even the simplest optimization method is used.

8.6. Concluding remarks

211

8.6.2 Comparison with the method presented in Chapter 6. The solution strategies are different for the methods presented here and in Chapter 6, To see that, we point out the hidden assumptions in using the methods in Chapter 6: (1) A function composed of more B-spIines has stronger capabilities to describe a complicated sample distribution than a function composed of less B-splines; (2) A function composed of more B-splines is less stable than a function composed of less B-splines; (3) ME analysis finds the optimum point on which capability and stability are balanced based on sample observations. The fundamental assumptions behind the method introduced in this Chapter are (1) A function composed of large number B-splines has enough capability to describe a complicated distribution; (2) Such function must be unstable, vulnerable to statistical fluctuation; and (3) A new smoothing technique is introduced to remove the influences resulting from the statistical fluctuation. We believe that the method introduced in this Chapter is more powerful than that introduced in Chapter 6 based on the fact that the current method applies to large and small samples. But the method introduced in Chapter 6 is the basic input for the current method 8.6.3 Comments on Bayesian approach In Bayesian statistics, all unknown parameters are treated as random variables obeying prior distribution. Prior distribution contains the information that is not available in the sample, and must be specified elsewhere. Thus, it is very important to construct the prior distributions. The frequently used four ways are (see Chapter 4 for details), (1) (2) (3) (4)

Determination of prior distribution by use of historical data; Information free prior distribution; Equally ignorant principle; and Maximum entropy prior distribution

In the cases that historical data are not available, the first method is no longer applicable. The rest three methods share something in common. They by to assume prior distributions in such a manner that they do not contain information,

212

S. 1-D estimation based on small samples

or contain information as less as possible. In the extreme case, the prior distribution is a uniform one spread over a finite interval. In this chapter, another method to construct prior distributions is presented; construct them through reasonable physical restrictions like smoothness. Here the method used to construct the smooth prior distributions is a simple one, assuming that the curve connecting three points bt.\, bt and bM is a straight line. It is reasonable to assume that this curve is a parabolic curve. In doing so, equation (8.11) is replaced by e, = U(bM + V , ) - l « 4 - 3 ( * M + 4-*)

(8-43)

The rest is as same as those previously presented. Some techniques for smoothing are available, and most of them can be easily embedded into the current method. Whatever smooth prior disfribution is used, it should minimize MEB. Minimum MEB plays the role of replacing subjective evaluation by objective evaluation. This is impressive if we recall the subjective character of Bayesian statistics. There seems no objection to the positive use of subjective information as proposed in Bayesian approach. But Bayesian approach is subject to criticism due to the arbitrary use of prior distributions. It is individual-dependent. Utilization of MEB enables us to reduce such arbitrariness or subjectiveness to the minimum extent and put Bayesian analysis on a more objective foundation.

Chapter 9

Estimation of 2-D complicated distribution based on small samples

As mentioned in the previous chapters, the method presented in Chapters 6 and 7 apply to cases of large samples. In Chapter 8, a Bayesian method is developed for 1-D distributions based on small samples. The method is characterized by prediction-correction two-steps procedures. In this chapter, the Bayesian method for small samples is extended to 2-D distributions. The structure of this Chapter is almost completely as same as Chapter 8 for the sake of comparison and easy understanding. This chapter is an extension of Chapter 8 in terms of dimensionality. In Chapter 8 was discussed Bayesian estimation of 1-D distributions while in this chapter to be discussed is Bayesian method for estimating 2-D distributions. This chapter is also an extension of Chapter 7 in terms of method. In Chapter 7 was presented a method for estimating 2-D distributions based on large samples while this chapter to be discussed is a method for estimating 2-D distribution based on small samples. Therefore, the present chapter is an extension of the methods presented in the previous two chapters. The method to be presented here was first proposed by Akaike (Tanabe, 1983). It applies to only observations on equidistance lattice points. The method was later extended to arbitrarily observed data on non-lattice points through introduction of B-splines {Zong et al, 1995). Zong & Lam (2002) made a further generalization by using the method to determining complicated probability distributions, 9.1 Statistical influence of small samples on estimation In Chapter 7, a 2-D random variable is approximated by a linear combination of B-spline functions. In a 2-D rectangular domain is defined a bivariate continuous random vector (Jf, Y) and an orthogonal coordinate QXY. Along the

213

214

9. 2-D estimation based on small samples

two coordinates, M and N B-splines are used to approximate the joint pdf f(x, y) of (X, Y) in the form of

f(x,y) » f{x,y I a) = £fX*,(*)*,O0

(9.1)

where atJ (i=l,2,,.,,M, j=l,2,...,N) are the linear combination coefficients and vector a is a = (fl lls -.. 1 o uv a 2 |,-'-a 2JV ,•»•,%,—.s^f.

(9.2)

Using the following formula, we are able to find the combination coefficients

where

} for order 3 B-splines,

(9.4a)

for order 4 B-splines.

(9.4b)

x

3

3

c s = \ B,(x)$k \ Bj *<-* *>-*

Suppose the true pdf is given in Example 3.9, see Figure 3.9(a). For easy reference, Figure 3.9(a) is reproduced in Figure 9.1. «»=200 random numbers are generated from the true distribution, as shown in Figure 3.9(b) . We use MxN = 40x40 = 1600 B-splines to approximate the pdf under consideration, that is,

f(x,y)« f(x,y\ a) = Jfl^Cx^Cy).

(9.5)

9,1. Statistical infltmnce of small samples based on estimation

215

Slightly different from 1-D cases, in 2-D cares the number of B-splines ( MxN = 40x40 = 1600 ) is much greater than the sample point number («.,=200), up to 8-fold in this case. Recall in Chapter 8, 50 B-splines have been used for 40 sample points. Therefore, 2-D estimations are more challenging than 1-D estimations.

Figure 9.1 The assumed pdf

Figure 9.2 Influence of statistical fluctuations on estimation accuracy With the generated sample, the unknown parameters in the above model are estimated from equation (9.3) and the results as shown in Figure 9.2. The predicted pdf are even more irregular than 1-D cases. As mentioned in Chapter 8, the highly irregularities are due to statistic fluctuations.

216

9, 2-D estimation based on small samples

9.2 Construction of smooth 2-d Bayesian priors 9.2.1 Analysis of statistical fluctuations Due to statistical fluctuations, the predicted parameter a is not consistent with its true value b, and there exists a deviation wlt so that we have ^ y y

i = l,2,.~,M;j = l,2,-,N.

(9.6)

where

where the superscript "7** denotes matrix transverse. The Large Number Theorem asserts that atf are asymptotically normally distributed with mean bv as the sample size is large enough. It is assumed that v/tj is a normal random variable with zero mean and common variance a1. Generally, the smaller the sample is, the larger a1 is. Hence we have P()

L

\ \ \ \\ ,

2cr J

i = l,2,-,M;j = l,2,-,N. l,2,,M;j l,2,,N.

(9.8)

With b is given, the likelihood fiinction for a is then given by

Rewriting it by taking the common fiictor out of the product symbol, we obtain

(9 9b)

-

If I is used to denotes the distance in Euclidian space of the form It It

INI=J^I+&K+•••+&»

*

( 9 - 9c )

9.2. Smooth prior distribution of combination eoeffisnts

217

equation (9Jb) can be rewritten in more compact form

Because a is predicted directly from the sample, P(a|b) contains all the necessary information in sample data. This is the preliminary prediction as we mentioned in the beginning of this chapter. 9.2,2 Smooth prior distribution of combination coefficients To remove the irregularities from the estimations, we require b be smooth. This piece of information is not contained in the sample data at all. It is our perception of how the data should change in space. In 1-D case, the smoothness condition is obtained by equating left- and rightderivatives. In the case of 2-D space, smoothness is somewhat difficult to obtain. The reason is that defining 2-D smoothness requires derivatives in two directions. So there are many ways to address the problem of smoothness. It is not appropriate to discuss 2-D smoothness in detail here. We focus our attention on a special class of 2-D smoothness defined in complex variable theory. The functions extensively studied in complex variable theory are called analytical functions. If an analytical function is continuous, it will be differentiable infinitely many times. This is a nice property and thus an analytical function is very smooth. The real and imaginary parts of an analytical function satisfy Laplace equation, respectively. Laplace equation is also called harmonic equation, so functions solving harmonic equation are called harmonic functions. On the other hand, if two functions solve Laplace equation, and they are orthogonal, then they form an analytical function. Reminded by this property, we impose smoothness condition on b by requiring that it be a harmonic function. Because b is discrete in nature, we require that b satisfy the discrete form of Laplace equation, that is, for those B-splines on the four boundaries, en=bMil+bt_hl-2bn, +

i = 2,3,--,M~l, 2

«u=*ij + . *b-i- V J = 2,3,-,tf-l, «w=V*+JU*-2*W. * = 2»3,-,M~l, *MJ = hM.m + bM,M -2bMj, j =2 , 3 , - , JV-1. For those that are not on the four boundaries, we have

(9.10a) (9.10b) (9.10c) (9.10d)

9. 2-D estimation based on small samples

218 +b

~4bn

< -U

(9.11)

where eff denotes the approximation errors resulting from discretization, or neglect of higgler order terms in the derivatives. Because we require that b be smooth, the mathematical expectation of ej} is surely zero. We further assume that etJ are mutually independent and identical normal random variables with zero mean and same variance T2. With these assumptions, we have

(9.12a)

Denoting eTe = ef2 + e|, + • • • + e ^ , we may rewrite the above equation in the form of

_L

1

(9.12b)

"2/ Introducing the matrix D, I

D,

I (9.13a)

D=

D, where

D,=

1 -2 1 1 -2

1

1

-2

(9.13b)

1 1 -2

1

9.3. Formulation ofBayesian estimation ofcomplicatedpdf —2 1 -4 1

1 -4

219

(9.13c)

1 -2

And I is the unit matrix, D is an (MN-4)xMN matrix, e is an ( M V - 4 ) x l vector and a is an vector, we may rewrite equations (9.9) and (9.12) in the matrix form (9.14)

= Db.

Substituting above equation into equation (9.12), we obtain the prior distribution for the true combination coefficient b i

— 2r

e

"p 1 - ^ J

(9.15) Introducing a new parameter (9.16) we obtain

-
llDbll

(9.17)

The above prior distribution is purely constructed from our demanding that b be smooth, and thus it contains the information unavailable in sample data. On the other hand, F ( a | b ) solely contains the information in sample data. One basic idea behind Bayesian statistics is to make fall use of these two sources of information. 9.3 Formulation of Bayesian estimation of complicated pdf 9 J.I Bayesian point estimate The two models established up to now, that is, the likelihood function

220

9. 2-D estimation based on small samples

P(a | b) pooling infomiation from sample data, and the prior distribution P(b), defining the smoothness of b, can be combined in the Bayes1 theorem introduced in Chapter 8 in the following way F(b|a) = Cx/'(a|b)P(b)

(9,18)

where C is the normalizing constant. Substituting equations (9.9d) and (9.17) into equation (9.18) yields

•vAff-4

-LUI

expj-^lDbir \.

(9.19a)

Further simplification leads to

|__i_[| _

exp |__i_[| aa_bjf +S

J

lDbf ] |

(9.19b)

In Bayesian statistics, the Bayesian point estimation of b can be obtained by maximizing the posterior distribution. Or equivalently, it can be obtained by minimizing Q1 (b), Q2 (b) = fa - b f + e>2 jDbf -> min

(9.20)

Setting -3f~ = G, i = l,2,—,M;j

= 1,2,.-,N

(9.21)

we obtain the solution b = (I + o2JfHTl a,q1=Q1 (b).

(9.22)

The pdf calculated from following equation, which is obtained from Equation (9.1) with a replaced by b is thus an estimate based on Bayesian method

9.4. Houholder transform

f(x,y\ b) = JtZbyBXxJBjiy).

221

(9.23)

Once w2 is given, we can obtain Bayesian point estimation b from equation (9.22). Given different eo2, we may obtain different estimates b . Which estimate is the most suitable remains a problem. In the next section, an entropy analysis is used to find the most suitable a2. 9.3,2 Determination of parameter co2 From equation (9.22), we note that b = a if c/=0, and the Bayesian estimation degenerates to the preliminary prediction. From Equation (9.20), we note that b is a plane if a2 -> «o, Thus, e^ is a very important factor in our analysis. We hope to determine a/ in an "objective" way. Note that the marginal probability P(a | m2) = J>(a | b)P(b)db

(9.24)

describes the averaged behavior of a, and gives the true distribution of the occurrence of a. Because it is independent of parameter vector b, we can obtain the estimation « 2 andCT2by maximizing the marginal probability. Or, Find a/ such that P(a j m1) = j>(a |b)F(b)db -» max.

(9.25)

Rewritten in the following logarithmic form, the marginal probability is MEB{©2) = - 2 log F(a | &1) = - 2 log j>(a j b ) P ( b ) * .

(9.26)

Substituting F(a | b) and F(b) in equation (9.26) and denoting,

x=

,F =

F r F | = determinant of F r F,

, and

(9.27)

222

9. 2-D estimation based on small samples

we have (9.28) Furthermore, using «f|2

g 2 = Qx - Fb|| = |x - F b | + |[F(b - b)| = f2 + F(b -1

(9.29)

we may write MEBfa?) in the form of 2

) = -21ogJP(a|b)P(b)db

lima v2MM

t

2

(9.30)

= -21og

The last integral in the equation above is in fact the multidimensional normal distribution with the following normalizing factor 1

MN

FrF

1/2

(9.31)

where the parallel sign j | denotes the determinant of the matrix inside the parallel signs. Thus we obtain MEB(® 2 } = -(MM - 4) log Q}2 + (MN - 4) logo"2

• + log|FrFJ + constant.

(9.32)

The best model is obtained by minimizing MEB. Differentiating MEB in the equation above with respect to a 2 and setting the derivative to zero, we obtain (9.33)

9.4. Householder transform

223

Substituting 2

2

+ (M¥-4)logf 2 + log|F r F|-» min (9.34)

In summary, the Bayesian estimation is composed of following steps (!) Preliminary prediction from Equation (9.2); (2) Smooth Bayesian point estimation from Equation (9.22) for a given (3) Estimation of MEB(«J 2 ) from Equation (9.32); (4) Repeat steps (2) and (3) for different m2 and choose the m2 which minimizes MEB, It should be pointed out that special numerical treatments are needed to find the determinant of the matrix |FTF| because the size of this matrix is very big. For example, if M=N=40, the size of this matrix is of the order of 3200 x 1600, which is hardly solvable by using simple numerical methods. This matrix is, however, a sparse one. We may Householder reduction method to solve it. 9.4 Householder Transform Recall in Chapter 8 that the determinant F r F is found through Gauss elimination method. It is not, however, feasible here to use the method for finding the determinant F F in equation (9,34) for the size of matrix F F is so large that the computer time becomes unbearable. An alternative method must be used instead. The proper method for mis case is the so-called Householder reduction method, which is suitable for large-scale sparse matrix. Householder method is composed of three steps. First of all, transform a real symmetric matrix A into a tridiagonal matrix C. Then the eigenvalues of matrix C is calculated by use of root-finding method and the corresponding eigenvectors are found. Finally, the determinant is solved using the eigenvalues by use of the following theorem from linear algebra. Theorem 9.1 If Ai,A2,---,An determinant of A is

are eigenvalues of the matrix A, then the

Based on this theorem, we focus on finding the eigenvalues of a matrix using Householder transfer. Details are given in the appendix to this chapter.

224

9, 2-D estimation based on small samples

Figure 9.3 Bayesian estimation based on 200 sample points.

8900 8700 03

8500 8300

20 Figure 9.4 Relationship between MEB and to2 and the search for the optimum point (minimum MEB}

9.5, Numerical examples

225

9.S Numerical examples In the first example, we assume a true pdf f(x,y), and generate ns random points from this distribution. Using these random data we estimate the coefficient vector b based on the above analysis. In the second example, the present method is applied to a practical problem. Example 9,1 Normally correlated 2-dimensional pdf Suppose the true distribution is given by Equation (9.3). It is further assumed that M=JV=4Q (totally 40 x 40 =1600 B-splines are used). Then we generate n, = 200 random points. The shape of f(x,y) is shown in Figure 9,2 and the random points are shown in Figure 3.9 (b). By following the steps given in section 9.3, the optimum Bayesian estimation is found as shown in Figure 9.3 for this case. Compared with preliminary prediction, the Bayesian estimation is much improved and is close to the true distribution as shown in Figure 9.2. If we notice the noise-like irregularity in the preliminary prediction and the closeness of the Bayesian estimation to the true distribution, the usefulness of the analysis employed in this paper is strongly supported. The searching process for the optimum MEB is shown in Figure 9.4. From the figure, we see that after the optimum point, MEB does not change much with m2. The function relationship between eo2 and MEB is quite simple. Thus, it is a rule of thumb (because it is just our observation without mathematical justification) that there exists only one optimum solution for MEB( &t2) and that MEB( m2) is a simple and quite smooth function of m2. To see the influence of sample size on the estimation, three samples of pseudo random points are generated from the given distribution. The sample sizes are 100,200 and 300, respectively. The optimum estimated pdf for the three samples are plotted in Figure 9.5. What is impressive is that the estimations based on 100, 200 and 300 sample points are quite close to each other. Figure 9.6 shows the estimations for three specific a)2 values. The estimated pdf for o 2 = 0.01, m2 = 8 and m2 = 200 are plotted in Figure 9.6(a)-(c). If m2 is very small (say, a>2 =0.01), or the variance t2 of b is very large, the Bayesian estimation is close to the preliminary prediction, and the smoothness information is ignored in the estimation. On the other hand, if©2 is very large (say, m2 =200), or the variance r 2 of b is very small, the estimated pdf tends to be a flat plane, and the sample information is ignored in the estimation. Thus there are two extremes. On one extreme, the smoothness information is ignored, and on the other extreme the sample information is ignored. By aid of Bayesian approach, we successfully combine the two sources of information and obtain greatly unproved estimation.

9. 2-D estimation based on small samples

226

®2=W MEB=8830

(a) Estimation based on 100 sample points ro2 = 8 MEB=8335

I •a 8

9

J

(b) Estimation based on 200 sample points co2 = 9 MEB=8027

(c) Estimation based on 300 sample points Figure 9.5 Estimation based on three different samples

Probability density

Probability density

Probability density

1°

(I O

r 8'

f T3

I

KJ

9. 2-D estimation based an small samples

228

But, it should be mentioned that around the optimum point, MEB varies very slowly. For example, the MEB differences for a?2 =10 and m1 =2 in this example is less than 1%. Example 9.2 Joint distribution ofwave-height and wave-period

H(m)

Figure 9.7 The Bayesian estimation of the joint distribution of wave-height and wave-period (M=N=30). H is wave height and Tis wave period. This problem has been studied in Chapter 8 as an example for large sample. Here we use the method developed in this chapter to solve the problem again. The data of wave height and wave period are taken from the records measured by ships for winters in the Pacific (Zong, 2000). The wave periods range from 0 seconds to 16 seconds and wave heights range from 0 meters to 16 meters. We use 900 B-spline functions to approximate the distribution (Af=30 and i\N30). The optimum a1 =0.01 and MEB=W*. The estimated pdf is shown in Figure 9.7 9.6 Application to discrete random distributions The methodology presented in sections 9,2~9.4 has been applied to logistic model. In this section, we apply it to discrete random variable to show its capability. Consider a bivariate discrete random vector of the form

9.7. Concluding remarks

229

pn (9.35) p M2

r

•••

p *MN.

The fee M-L estimate of PtJ is n, S-

(9.36)

where ntJ is the number of event A^. If sample size is small, large fluctuations are expected in the estimate p,. To remove the irregularities present in pt, we assume that

Again we obtain (9.6). From here, the formulas presented in sections 9.2-9.4 are applicable. 9.7 Concluding remarks We are often faced with the cases where observed samples show complex distributions and it is difficult to approximate the samples with well known simple pdfs. In such situations, we have to estimate the pdf directly fiom samples. Especially influenced by statistical fluctuations, estimation based on small samples becomes more difficult. In this paper, a method mat can directly identify an appropriate pdf for a 2dimensional random vector from a given small sample is presented. Three models are established in this paper. One is the likelihood function, which pools the information in sample data, one is the smooth prior distribution which defines the smoothness of the unknown parameters, and the last is the MEB which helps us to find the most suitable m1 (and the prior distribution) in an "objective" way. The usefulness of the method is examined with numerical simulations. It has been found that the estimated pdfe under consideration based on the present analysis are stable and yield satisfactory results even for small samples.

230

9. 2-D estimation based on small samples

Appendix: Householder transform A.I Tridiagonalization of a real symmetric matrix The special case of matrix that is tridiagonai, that is, has nonzero elements only on the diagonal plus or minus one column, is one that occurs frequently. For tridiagonai sets, the procedures of LU decomposition, forward- and back substitution each take only O(N) operations, and the whole solution can be encoded very concisely. Naturally, one does not reserve storage for the full N x N matrix, but only for the nonzero components, stored as three vectors. The purpose is to find the eigenvalues and eigenvectors of a square matrix A. The optimum strategy for finding eigenvalues and eigenvectors is, first, to reduce the matrix to a simple form, only then beginning an iterative procedure. For symmetric matrices, the preferred simple form is tridiagonai. Instead of trying to reduce the matrix all the way to diagonal form, we are content to stop when the matrix is tridiagonai. This allows the procedure to be carried out in a finite number of steps, unlike the Jacobi method, which requires iteration to convergence. The Householder algorithm reduces an n*n symmetric matrix A to tridiagonai form by n - 2 orthogonal transformations. Each transformation annihilates the required part of a whole column and whole corresponding row. The basic ingredient is a Householder matrix P, which has the form P = I-2wwr

(9.A.1)

where w is a real vector with |w|2 = 1. (In the present notation, the outer or matrix product of two vectors, a and b is written as a b r , while the inner or scalar product of the vectors is written as a r b.) The matrix P is orthogonal, because P2=(l-2wwr)-(l-2wwr) = I - 4 w - w r + 4 w . ( w T - w ) - w T =1

(9.A.2)

Therefore P = P - I . But P r = P, and so P r = P-I, proving orthogonality. Rewrite P as T

P =I~

(9.A.3) XT

where the scalar H is

Appendix „

231

1I |

(9.A.4)

21 '

and u can now be any vector. Suppose x is the vector composed of the first column of A. Choose (9.A.5)

u=x+xe

where ei is the unit vector [1, 0,. . . , 0 ] r , and the choice of signs will be made later. Then

{jx] +1 This shows that the Householder matrix P acts on a given vector x to zero all its elements except the first one. To reduce a symmetric matrix A to tridiagonal form, we choose the vector x for the first Householder matrix to be the lower n — 1 elements of the first column. Then the lower n — 2 elements will be zeroed:

10 0 P, A = 0

0

0

(«-D p *1

a,.,

irrelevant

0

k 0 0 0

(9.A.6)

irrelevant

Here we have written the matrices in partitioned form, with ("~13P denoting a Householder matrix with dimensions (n - 1) x (w - 1). The quantity k is simply plus or minus the magnitude of the vector [a2l»• • •, anl ] r . The complete orthogonal transformation is now

9. 2-D estimation based on small samples

232

k A' = P A P =

(9.A.7)

0 0

irrelevant

0 We have used the fact that P 7 = P. Now choose the vector x for the second Householder matrix to be the bottom « - 2 elements of the second column, and from it construct

1 0 0 0 1 0

0 0

0 0

(9.A.8)

0 0 The identity block in the upper left corner insures that the tridiagonalization achieved in the first step will not be spoiled by this one, while the (n ~ 2)~ dimensional Householder matrix tn~2)P2 creates one additional row and column of the tridiagonal output. Clearly, a sequence of « - 2 such transformations will reduce the matrix A to tridiagonal form. Instead of actually carrying out the matrix multiplications in P • A • P, we compute a vector

Au H

(9.A.11)

Then 1» _ A

U

U

\ _ A

H A' = A P A = A - p u r - u -

(9.A.12a) (9.A.12b)

where the scalar K is defined by

2H

(9.A.13)

Appendix

233

If we write qsp-Xu

(9.A.14)

then we have A' = A-qu r -uq r

(9.A.15)

This is the computationally useful formula, Most routinesforHouseholder reduction actually start in the n-th column of A, not the first as in the explanation above. In detail, the equations are as follows: At stage m(m= 1,2,..., «-2) the vector u has the form Kn.WVi

,-,0].

(9.A.16)

Here i = n-m + l = n,n-l,--',3

(9.A.17)

and the quantityCT(|JC|2 in our earlier notation) is

a Hanf +(aaf+-

+ (aIJ_lf

(9 A18)

We choose the sign of a in (9.A. 18) to be the same as the sign of at,._, to lessen round-off error. Variables are thus computed in the following order: a, H, if, p, K, q, A'. At any stage m, A is tridiagonal in its last m - 1 rows and columns. If the eigenvectors of the final tridiagonal matriK are found (for example, by the routine in the next section), then the eigenvectors of A can be obtained by applying the accumulated transformation Q=P,P2-PB_2

(9.A.19)

to those eigenvectors. We therefore form Q by recursion after all the P's have been determined: Q2 2 Q,= P, • Qi+I,

/= » - 3

1.

(9.A.20)

234

9. 2-D estimation based on small samples

A.2 Finding eigenvalues of a tridiagonal matrix by bisection method Tridiagonalization leads to the following tridiagonal matrix c,

b2

(9.A.21)

K Once our original, real, symmetric matrix has been reduced to tridiagonal form, one possible way to determine its eigenvalues is to find the roots of the characteristic polynomial pn(X) directly. The characteristic polynomial of a tridiagonal matrix can be evaluated for any trial value of X by an efficient recursion relation. Theorem A.1 Suppose b^O

(i=2,3,.,.,n).

For a«y X, the characteristic

polynomials form a Sturmian sequence {pt (^)}" =0 satisfying po(A) =

,

i = 2,-,n

(9.A22)

If a(A) denotes the number for the sign between two neighboring numbers to change, then the number of the eigenvalues of A smaller than A is a{A), The polynomials of lower degree produced during the recurrence form a Sturmian sequence that can be used to localize the eigenvalues to intervals on the real axis. A root-finding method such as bisection or Newton's method can then be employed to refine the intervals. Suppose all eigenvalues of A satisfy Al < A^ <-"
, the interval length of which is (b0

—aQ)/2p.

In detail, suppose that we are to perform bisection at the r-th step and that the middle point of [af_,,&,._,] is dr

Appendix

<*r =}<^~. +*,-.)

235

(9-A.23)

Then we may compute sequence {Pj(dr)}*L0 and detennine the number of sign-change a(dr),

a(dr)
Based on the following criterion

dr,br = br_{

(9.A.24)

Because one of the following equations must hold a{ar)
and

a{br)2:k

^ must be on the interval [aF, br ] . The upper and lower bounds, b0 and a0 ,for \ are determined by

\h = max(4 ±(|6;I +1&,,I)} [ao=mm{4±(|6,|+|*,+Ij)}

•I

i = l 2 ••• n

f9A251

where we let bx = 6W+I = 0 . A J Determing determinant of a matrix by its eigenvalues Once the eigenvalues are found, the determinant is (9.A.26)

This page intentionally left blank

Chapter 10

Estimation of the membership function

People think that mathematics is precise and exact. It seems that worldwide scientists and engineers focus their attentions on finding the most exact numbers in their studies with errors smaller than 0.1 %, 0.01% and even 0.001%. On the other hand, vague values and fuzzy language are more often used. Words like "tall" in "he is tall", "fat" in "she is fat" are all fuzzy words without clear definitions. 190 cm is definitely tall, but 175 may be tall or may not. Being tall cannot be measured by single index like height, and it is also influenced by one's figure, weight, face or even clothes. Mathematics is not unable to handle fuzzy phenomena. There is a mathematics branch, called fuzzy set theory able to describe such fuzzy things. Fuzzy set theory is introduced into mathematics by Zadeh in 1970s {Zimmermann, 1985). Since then, fuzzy set theory has been applied to language studies, control etc. The most important concept in fuzzy set theory is the membership function. Although fuzzy set theory and statistical estimation are two different branches of mathematics, the former can also be studied by use of the latter. So in this chapter, the method introduced in previous chapters is applied to determining membership functions. The membership function is usually determined by the user in applications. It seems that direct determination of the membership function based on sample data was proposed by Fujimoto et al (1994) and Zong et al (1995). 10.1 Introduction In traditional set theory, the boundaries between two sets are crisp, meaning that an element is either in set A or set B, but not in both. In Figure 10.1 (a) is shown three crisp sets intervals A^,Aj,^ on the real line, that is, 4 =[0,30], 4 =(30,70], 4 =(70,100]. 237

10. Estimation of the membership function

238

A3

A2

U< i ^

.

-

30

.

•

.

-

.

•

.

'

.

•

,

•

.

-

.

•

,

•

P

3

70

i

i

Y

YP •v

(a) Crisp sets

1

20/w\ 40

60XX 80

>

(b) Fuzzy sets

Figure 10.1 Crisp sets vs. fuzzy sets If we introduce step functions defined by [1 if

^[0,30]

(10.1a)

if x 6 (30,70] 0 if xe (30,70]'

(10.1b)

1 if x e (70,100] 0 if x* (70,100]'

(10.1c)

then i4,, ^ and ^ may be redefined by

4 ={x:fti(x)*0}t i = l,2,3.

(10.2)

In words, 4 1S a collection of those points which do not make the function ft, vanish. We give a special name membership function to//,, which is characterized by (1)

(10,3a)

(2)

(10.3b)

We do not have any reason to say that membership functions ft, must be step functions. In feet, functions satisfying equations (10.3) can be used as membership functions. For example, we may define ft, by

10.1. Introduction

2|i-B|

20<x<=30

20

30<xi40

20

1 1-

40<xfi50

£-5O 20

239

(10,4a)

xJ

20 0

70<JC

1 1-2

x-20 20 20 0

20

(10.4b) 30<x<40 40 <x

JC<50

(10.4c)

1-; 1 These fimctions are plotted in Figure 10,1 (b). Similar to Equation (10.2), we define three sets, specially written in the form of A,, At and ^ , by

, 1 = 1,2,3.

(10.5)

They are Juzzy sets. For crisp sets, if point Pe Aj, then P £ A] and P €A^, see figure 10.1 (a). But for fuzzy sets, a point can be in two or even more sets

240

10. Estimation of the membership function

simultaneously. For Point P in figure 10.1 (b), both fc # 0 and /<2 # 0 . Thus P e Ai a n d P e A^ from definition (10.5). So fuzzy sets do not have clear boundaries. Detailed presentation of the theory ran be found in Zimmermann (1985) One of prominent applications of fuzzy set theory is in the field of quantitative description of language variables due to the fact that language itself is fuzzy. Take age for instance. "Young", "middle-aged" and "old" are such language variables that their boundaries cannot be clearly defined. A person being 30 years old may either be young or be middle-aged. It is hard to draw a clear line between "young" and "middle-aged", so is "middle-aged" and "old". Therefore, ^ = young , A^ = middle-aged and A^ ~ old are three fuzzy sets. If we want to describe language variables like "young", "middle-aged" and "old" in the framework of crisp set theory, we have to define two critical numbers. For example, 30 years old and 70 years old are defined as the two critical numbers. Below 30 is young and above 30 is middle-aged. Below 70 is middle-aged and above 70 is old. This is in fact what is shown in Figure 10.1 (a) and defined in equation (10.1), Such definition is easy, but a little wired due to the absurdness that a person one day past his thirtieth birthday is middle-aged and another person one day to his thirtieth birthday is young. They are, however, may be only two days different in age. In fuzzy set theory, 30 years old is considered half young and half middle-aged. It may be interpreted at two levels. At the first level, half young and half middle-aged means that he is at the transition stage of life from young to middle-aged. At the second level, half young and half old means that if a survey is made of those 30 years old, half of them may be considered young while half of them are considered middle-aged. In this sense, fuzzy sets with unclear boundaries are in fact more informative and reasonable than crisp sets. This is schematically shown in Figure 10.1 (b) and defined by equation (10.4). Once the rationale for fuzzy sets is justified, the remaining question is how to determine the membership functions. Equation (10.4) and Figure 10.1(b) is in fact a way to define the membership functions for these three fuzzy sets. How to find them? To answer the question is more difficult than to ask. Over the years, several forms of functions have been employed for defining membership functions. Those in Figure 10.1 are two examples. In general, the forms of membership functions may be problem-dependent. Even for the same problem, they are also user-dependent. For the same problem, two users may use quite different forms of membership functions. Take height for instance. In Figure 10.2 are shown two different forms of membership functions for assessing one's height. Three fuzzy sets are defined: "short", "medium" and "tail". One may choose those in figure 10.2 (a) for membership functions and the other may well choose those in Figure 10.2.(b) as membership functions. Using different forms of membership functions will definitely influence the subsequent

241

I O.I. Introduction short

medium

\y

I

short

medium

tall

Heigh (cm)

\y

AA

150 160

tall

170

ISO

Height (cm) I I 150 160

170

180

Figure 10.2 Two different membership functions

M

20 (a) randomly prepared triangles

30

40

50

60

70

80

(b) fuzzy sample

Figure 10.3 Experiment on the perception of triangular area by a student assessments. Therefore, a systematic methodology is needed to build and determine membership functions on an objective base. Membership functions ft (x) may be interpreted as the probability of a point to be in a specific fuzzy set. That is, for a point x, the probability for it to be A, is fit {%). Note that //, (x)+f*2 (x) + fi3 {x) -1, meaning that a point must be in a set. In summary, the membership function lying at the heart of fuzzy set theory serves as two purposes. They define fuzzy sets themselves and they determine the probability of a point belonging to a specific fuzzy set. A membership function is in fact a probability density function detennining the possibility for a point to be in a fuzzy set. Therefore, the methods proposed in the previous chapters are also applicable to the determination of the membership functions.

242

10. Estimation of the membership junction

By aid of measured entropy analysis, the membership function ean be determined in an objective way. 10.2 Fuzzy experiment and fuzzy sample To statistically determine membership functions, we need to design a fuzzy experiment. A fuzzy experiment is slightly different from an ordinary statistical experiment. Fujimoto (1994) designed a fuzzy experiment. Here his experiment procedure is slightly adapted for general purposes. 10.2.1 How large is large? A survey designed to collect the information about people's perception of fuzzy concept like "large" and "small" is to prepare sixty one triangular sheets, the areas of which ranged from 20 cm2 to 80 cm2, to say. The side lengths and interior angles of the triangles were randomly determined so that all the triangles had different shapes, as shown in figure 10.3 (a). Three reference triangles, representing "large" with actual area 70, "medium" with actual area 50 and "small" with actual area 25, are also prepared. In the survey, the three reference triangles are shown to the people under test first. Then these triangular sheets are shown to each person in the survey one by one. For each sheet, he or she is requested to assess it using "small", "medium" or "large". After all the sheets are shown to him, a fuzzy sample shown in Figure 10.3 (b) is obtained. In the figure, the horizontal axis is the real triangle area and the vertical axis is the classification result. The experiment may be repeated to another person, and so on. Finally, a fuzzy sample as shown in Figure 10.3 (b) is obtained, from which membership functions are to be estimated. What is remarkable in the figure is that mere are overlapping regions among each class. That is, the boundaries among each class are fuzzy. So, the data are called fuzzy data. The fuzzy data roughly indicate the correlation between the linguistic expressions and the physical quantity under consideration. The question asked at the very beginning of this section on "how large is large" can now be answered based on the data in Figure 10.3. For this problem, "large" is a set ranging roughly from 60 to SO cm2. It is clear from the figure that triangles with area below 55 cm2 has never been classified as "large". Thus, it is safe to say that triangles with area above 55 cm2 are large while those between 55 and 65 cm2 are in transition stage from "large" to "medium". 10.2.2 Fuzzy data in physical sciences The above example is somewhat arbitrary because the experiment has been performed on humans. In physical sciences, however, such fuzzy data are also

10.2. Fuzzy experiment andjuzzy sample

243

available. Fluid flow in a pipe is a topic which has been of extensive interest in fluid mechanics. A parameter describing the flow state is called Reynolds number Re defined by (10.6) where U is maximum flow velocity in the pipe, d is the diameter of the pipe and v is the kinetic viscosity of the fluid in the pipe. For water, v = 10~".

Laminar Turbulent AAMAAA A,

pooooooo Laminar log Re Turbulent Figure 10.4 Fuzzy data for Laminar and Turbulent flow Reynolds number Re dominates flow state in the pipe. If it is small, the flow is laminar, a state fluid particles smoothly flow in the pipe. If it is large, the flow is turbulent, a state fluid particles irregularly flow in the pipe. It has been experimentally found that transition from laminar flow to turbulent flow is not unique. Over a wide range of Reynolds number, the transition may happen depending pipe wall smoothness. If the pipe wall is very smooth, the transition does not occur until fe=4Q000, while if the pipe wall is rough, the transition may occur around 2100. When many experimental results are plotted in one single figure, we obtain fuzzy data schematically shown in Figure 10.4. Note this figure is schematic rather man physically accurate. Fuzzy set theory also provides us with a new look at some old problems. A rod under compression load P may become unstable suddenly as the load gradually increases. This is a well known buckling problem in mechanics of materials. The critical load at which the rod becomes unstable is theoretically a deterministic value, but in real-world application is random. If many rods of same size and same material are tested, we obtain data showing large scatters

244

W. Estimation of the membership Junction

around the theoretical critical value of the rod. From the viewpoint of mathematics stood away from physical background, fuzzy data in Figures 10.4 and 10.5 are no different in nature.

Unstable AAMAAA A

0O 00 0000 Stable

Figure 10.5 Fuzzy data for Stable and Unstable rod Quite some examples in a variety of engineering fields and science disciplines can be given showing similar patterns as in Figures 10.3~10.5. All these phenomena can be treated using fuzzy set theory. But figure 10.3 and Figures 10.4-10.5 are slightly different in that the latter is physics-based phenomena requiring membership functions to be objectively determined while the former is human-related the membership functions of which are better to be determined objectively. This justifies the need to find membership functions in an objective way. 10.2.3 B-spline Approximation of the membership functions In this section, we will determine the membership function through the fuzzy data obtained in the previous section. Suppose the universe of discourse is X. Let xt(£ = 1,2, •••,*„) be a sample from X. Further, suppose the fuzzy sets are At , A2 , ... , AM and the corresponding membership functions are H\(x),fi2{x),---,iiu{x}. In the triangle experiment, x, is the area of a triangle and ^ , Aj and A^ correspond to "Small", "Medium" and "Large", respectively. As done in the previous Chapters, we again assume that the membership function pf (x) can be expressed in the form of a linear combination of B-spline functions in the universe of discoursed, i.e.,

/ ft 2. Fuzzy experiment and fuzzy sample

245

ft (x) = auBl (x) + «,2Bj (*) + »•+a m B N (x) 21 1

IN

22 2

^07j

N

In concise form we have N

Mi(*) = 2 a « S j W » i = l,--,M

(10.8)

where iV is the number of B-spline functions which consist of the membership functions, a(, are the combination coefficients, and B} (x) is the B-spline functions of chosen order and is of the following form if order 3 B-spline function is chosen JL (x UAX) - (Xf -X.j)

— rf H(x

>

r

— x\ n{X-Xj,

(lU.yj

Now we are to determine the parameters a y . From the process of classification, the membership function ft,(xt) is regarded as same as the probability that a sample point xt is classified into Ixaacy set A,, that is, Pr[xteA,] = iii(xt).

(10.11)

Therefore, we employ the likelihood analysis for the determination of the membership functions. The probability of the classification event for all the sample points xt (£ = 1,2, • • •, ns) is expressed by the following likelihood function.

x

- n The log-likelihood function is

l=\ xte

246

10. Estimation of the membershipjumtion M

M

N

N

M

where J>(x) = £2>,«,(*) = £(£« s )^ W = 1According to the B-spline function properties we have ^iBJ(x) = \.

(10.14)

From the above two equations the following relationship is obtained, fdalj=l,j

= l,-,N.

(10.15)

A membership function is always greater than or equal to zero. To guarantee this we simply impose the restriction that all parameters atJ are greater than or equal to zero, that is, o,^0

i = l,...,M;j = l,...,N.

(10.16)

Usually, we hope /<,(.*) is a decreasing function of * and fiM(x) is an increasing function. It is obvious that these are guaranteed by the following equations: Gf u £:s u + 1 , j = \,...,N-l, am
(10.17a) (10.17b)

Based on the above analyses the best estimates of the unknown parameters atj must satisfy (10.18) subject to

fX=l

j = l,-,N,

(10.19a)

10.3. ME analysis crMaeru+I «M»^«MJ+1

«^0

j = l,...,N-l,

247 (10.19b)

j = l,...,N-l, i = l,...,M;j = l»...,iV.

(10.19c) (10.19d)

This optimization model has a good property. The optimum solution or? is unique (the proof is given in appendix 10.A). That is, if we can find a local maximum point a°j by some method it must be the global optimum solution. The optimization method like the Flexible Tolerance Method introduced in Himmelblau (1972) may be employed to solve the maximization problem (10JH10.9). 10.3 ME analysis In the above optimization problem, the optimization parameters are N and afJ . If N is fixed, this problem can be solved by ordinary nonlinear programming methods. Because N is also an optimization parameter, measured entropy analysis must be used. Consider a fuzzy sample of size ns. n'f denotes me number of unknown parameters for me j-tft fuzzy set. Without equation (10.15), n'f would be equal to N . Equation (10.15) is an interlink among the M fuzzy sets. It is hard to distribute these N equalities among the M fuzzy sets. We may, however, avoid this difficulty using the method in the following. The entropy for i-th fuzzy set is *.

(10.20)

The corresponding asymptotically unbiased estimator of the measured entropy is

(10.21) Then the total measured entropy should be q

E— •

(10.22)

248

10. Estimation of the membership function

From the definition, the second term become

2"B,

2 s,

2

ns

Therefore, the best N solves the minimization problem ME

= - Z * [tt,{x\a)\ogf*,{x\*)dx + 2;(-M~X)N

.

(10.24)

If AIC is used, AIC = -maxfl} + J ] ttf = -max{£}+N(M -1) -» min

(10,25)

where the number of free parameters nf = M x N—N because there are N equality constraints in the above model. In the actual optimization process, the best N which makes ME or AIC minimum is numerically searched by the following procedure. First, a positive integer N is assumed. Then the maximum likelihood analysis is carried out to obtain the best estimates of au, The value of ME or AIC is calculated. Next, changing N gradually, and the corresponding atJ and MEs or AICs are calculated. By the comparison of these ME or AIC values, the best N is found. 10.4 Numerical Examples Example 10.1 Triangular area problem Figure 10.6 (a) shows the fuzzy data obtained from the experiment described in section 10.2. Using these date, we may perform the ME analysis and the likelihood analysis using the method presented before. The B-spline functions of order 3 are used. For the optimization of the likelihood function, the Flexible Tolerance Method (FTM) as outlined in the appendix to this chapter is used. In many nonlinear programming methods a considerable portion of the computational time is spent on satisfying rather rigorous constraint requirements. However, the FTM does not satisfy the constrainte first. But, the constraints are gradually satisfied as the search process proceeds toward the true solution. Tables 10.1 gives the results of the analysis. From the table, the minimum ME value is obtained if eleven B-splines are used, and minimum AIC value is

10.4 Numerical Examples

249

obtained if eight B-splines are used. The estimated membership functions based on minimum ME and minimum AIC are plotted in Figures 10.6 (b) and (e). The results obtained from ME analysis and AJC analysis are quite consistent although the numbers of B-splines in the two analyses are different.

Table 10.1 Results for three fuzzy sets

N 6 7 8 9 10 11 13 12 14 15

-L

ME

AIC

19.7 16.7 14.5 15.5 14.3 13.6 13.1 14.2 13.1 13.2

5.54 4.11 2.95

33.65 32.69 32.50* 35.57 36.26 37.63 41.13 40.22 43.09 45.20

3J0 3.19 2.52* 3.00 3.27 2.65 3.18

20

30 40 50 60 70 80 Area

0 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Area Area Figure 10.6 Estimation of the membership fiinctions for the problem of triangular areas

Example 10.2 Analysis ofthejkzzy data of five classifications In shipbuilding industry, welding is a very important process, taking more than 40% workloads. Welding quality is directly related to the life of a ship.

10. Estimation of the membership function

250

Thus, welding quality must be controlled within allowable errors. One factor controlling welding quality is called misalignment S as schematically shown in Figure 10.7. The figure shows that two vertical plates are welded to a horizontal plate. The two vertical plates are required to be in one line. This is difficult because a welding worker cannot see the upper plate as he is welding the lower plate to the horizontal plate, and he cannot see the lower plate as he is welding upper plate to the horizontal plate. So after the welding work, a surveyor must do quality examination. The examination results are classified into five classes: "Very Good" (VG), "Good" (G), "Medium" (M), "Bad" (B) and "Very Bad" (VB). Because of several factors involved, the quality assessment is not a simple correlation, but a complicated link as shown in Figure 10.7 (a), in which is given one hundred sample points to show the correlation between misalignment S and quality assessments. VG G j

1—

(

— \ <5: misalignment M

-»j k - $ ( t: thickness

B VB

1 .

0.5

ft

(b)ME

AAA/

AM 0.5

1

1.5

0

0.5

Figure 10.7 Five classifications of misalignment Using fuzzy data in Figure 10.7 (a), the membership functions are estimated. Order 3 B-spIine fiinetians have been in the analysis. Table 10.2 gives the results of the analysis.

251

10.4. Numerical examples Table 10.2 Results for five fuzzy sets

N 4 5 6 7 8 9

-L

Me

AIC

N

-L

Me

AIC

82.4 1.73 98.4 10 1.49 56.3 98.3 71.7 1.68 11 1.54 93.7 55.5 101.5 67.0 1.48 12 1.64 91,5 55.0 105.0 62.0 1.57 91.9 55.1 1.71 109.1 13 57.0 14 51.7 1.73 1.45 109.7 92.0* 56.1 1.49 94.0 53.4 1.78 115.4 15 The minimum ME value is obtained at N=i and the minimum AIC is given as N=6, Figures 10.7 (b) and (c) compare the membership functions for the two cases. Example 10.3 Sample Size Influences In this example, the sample size influences of fuzzy data on the forms of membership functions are discussed. Suppose membership functions of fee forms shown in Figure 10.8 (a) are given, representing three fuzzy sets A^, A^ and A^ . Then 100, 200 and 300 pairs of uniform random numbers (Me,v,)are generated in the range O S M , ^10 and Ofiv, S i {t = ls2,---,n1). These pairs of random numbers are plotted in Figure 10.8 (b). Acceptation and Rejection Method introduced in Chapter 3 is employed to generate fuzzy data by the following procedure If vt < ft(uf) and vt > pi}{ut), then ut e 4; If vt <, fij{ut) <Mui)> ^m ut 6 4} • If vt > ft,(ut), then «f is rejected;

0

2

4

6

8

10

0

2

4

6

8

Figure 10.8 Given membership functions (a) and random numbers (b)

10

252

10. Estimation of the membership Junction

«, assumes three values, 100, 200 and 300. The three corresponding fuzzy samples are given in the left figures in Figure 10.9. Using the method presented in this Chapter, the membership functions corresponding to the three generated samples can be estimated. The results for ME values are given in Table 10.3 and plotted in Figure 10.9. In the Table, only ME results are given and AIC values are neglected.

Table 10.3 Sample size influences (a):Sample size: 100

(b) Sample size:200

N 5 6 7 8 9 10 11 12 13 14 15

(c): Sample size:300

-L

ME

-L

ME

-L

ME

39.5 34.5 35.1 33.8 32.0 32.0 32.0 32.0 31.5 31.4 31.6

5.70 4.07 4.50 4.14 4.24 3.76 3.86

73,4 69.0 69,2 68.7 68.1 68.3 67.5 67.6 66.9 66.9 67.5

5.63 3.76 4.03 3.84 3.62 3.72 3.67 3.63 3.65 3.64 3.63

123.1 107.9 109.9 107.3 107.2 127.4 106.4 106.7 105.9 105.4 106.5

5.60 3.77 4.12 3.88 3.S4 3.65 3.58 3.61 3.69 3.68 3.69

3J2 3.83 3.72 3.90

For n, = 100, minimizing ME yields iV=14. The profiles of the membership functions as shown in Figure 10.9 (a) are complicated because too many Bsplines are used. If sample size is over-small, statistical fluctuations have significant influence on the shapes of the membership functions. For «, = 200 and na =300 , minimizing ME yields same results at JV=9. The estimated membership functions are shown in Figures 9 (b) and 9(c), which look much better than Figure 10.9 (a). Figures 10.9 (b) and (c) do not exhibit significant differences, showing the convergence as sample size increases. Both are close to the given membership functions as shown in Figure 10.8.

10.5, Concluding remarks

253

(a)n, =100 L M S 8

0

2

4

6

8

10

10

0

0

2

4

6

8

10

8

10

8

10

Figure 10.9 Given membership functions (right) and fuzzy data (left) 10.5 Concluding Remarks In this chapter, a probabilistic model that can determine the forms of the membership functions based on experimental data is presented. In the method, the membership functions are approximated by a linear combination of B-spline functions. The best number of B-splines which compose the membership functions under consideration is determined by minimizing ME. And the best combination coefficients are determined probabilistically based on the likelihood analysis.

254

10. Estimation of the membership function

The features of the membership functions presented in this paper can be characterized as follows: (1) The membership functions can be automatically determined from the fuzzy data by the proposed method. No prior knowledge of the form of the membership functions is necessary in the estimation. (2) The method works well irrelevant to the number of classifications. Also, numerical calculations are easily performed because the optimum solution is unique. (3) If the sample size of fuzzy data becomes larger, the estimated membership function becomes more accurate.

Appendix

255

Appendix: Proof of uniqueness of the optimum solution Consider the following nonlinear programming problem.

subject to the following constraints, gi(xy,x2,-,xK)Z0,

i = l,2,-,m,

XjZQ,j = l,2,--;n.

(10.A.2) (10.A.3)

The Lagrangian is m

L = f(xy,x2,---,xn) + J^uigi(xl,xt,---,xn).

(10.A.4)

1=1

If x° is a local optimum solution in the above problem, then the sufficient condition for xa to be the global maximum point is:

±8Li^U°)(xi-x!).

(10.A.5)

where w° satisfies — SO for all K, >0; u^—) = 0, i = l,—,m.

(10.A.6)

The Lagrangian for our problem is

M«) = Z E log^(^)+E^ x(£a # -l)+ & ff «,fl(«), in which g s ' s are constraints given by the following equations.

(10.A.7)

256

10. Estimation of the membership fimction

>0

(10 A8)

According to Taylor expansion of multivanate function we have

M

M

<10A 9)

-

where 0 < 0 < 1 Because,

in which

(

1 if(s = j,i = l)or(s = N - 1 ST(* = y +1, i - l)or (* *= 0 otherwise

So

Therefore, the third term on right hand side of equation (10.A.9) becomes

Appendix

i

M

N

257

If

—y "yy f £-i JLi

£-i

-« />! j=i 1=1

Mi

1=1 1,64 Mi

j=

From equations (10.A.9) and (10.A.12), the following relationship is obtained, M ft

££f^

SI

(10.A.13)

The above equation indicates that if a0 is a local optimum solution it must be the global solution.

This page intentionally left blank

Chapter 11

Estimation of distributions by use of the maximum entropy method

Maximum Entropy Method (MEM) cannot be ignored when information theory is applied to finding the pdf of a random variable. Although MEM was explicitly formulated first in 1957 by Jaynes (1957a,b), it was implicitly used by Einstein in the early 20th century for solving problems in quantum statistical mechanics (Eisberg & Resnick, 1985). However, both Shannon (Shannon & Weaver, 1949; Khinehin, 1957) and Kullback (1957) did not touch the topic of MEM in their original works. The central idea in MEM is the Maximum Entropy (Maxent) Principle. It was proposed for solving problems short of information. When formulating the Maxent Principle, Jaynes (1957) believed that in the cases where only partial information is available about the problem under consideration, we should use fee probability maximizing entropy subject to constraints. All other probabilities imply that unable-to-prove assumptions or constraints are introduced into the inference such that the inference is biased. MEM is a fascinating tool. Formulated properly, all special distributions (the normal, exponential, Cauchy etc) can be determined by use of MEM. In other words, mese special distributions solve the governing equations of MEM. Expressed in Jaynes language, all known special distributions represent an unbiased probability distribution when some of information is not available. Besides academic research on MEM, MEM has been applied to a variety of fields for solving problems present in communication (Usher, 1984), economics (Golan et al, 1996; Fraser, 2000; Shen & Perloff, 2001), agriculture (Preekel, 2001) and imaging processing (Baribaud,1990). In this Chapter, application of MEM for estimating distributions based on samples is introduced.

259

260

1 L Estimation by use of the MEM

11,1 Maximum entropy The detective-crack-case story features the MEM. Initially, only very few information is available. And as many suspects as possible should be investigated without favoring some suspects. As investigation proceeds, some suspects are eliminated from the investigation, but others receive more extensive and intensive investigations. As more and more evidences are collected, the true murder is found. At each stage of the investigation, the principle behind the investigation is to involve all suspects for investigation. No suspect is eliminated from the investigation if without strong evidence to support to do so. Mathematically speaking, the probability for each suspect to commit the crime is p,,, i = I, • • •, M, Initially, all suspects are equally suspected, and thus Pi=l/M.

(11.1)

As more information is collected, some suspects are eliminated from the investigation due to alibis. If Mi suspects are excluded, the probability for each of the remaining suspects is />,=1/(M-M,).

(11.2)

Finally, as Mi = M - 1 , only one suspect is identified. Suppose there are ten suspects initially. Then the detecting process can be written in the following form Ai

A2

A3

A4

Aj

AB

AI

At

A$

A\a

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0 0 0

0.2 0 0,2 0 0.2 0.2 0.2 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0 1 0 0

(11.3)

Two observations from the process are of direct relevance to MEM. First of all, at each stage, the probability is always assigned to each suspect with equal probability without ftvoring one of the suspects. Why to do so? Without flather alibis, it is dangerous to eliminate a suspect from the investigation too early. This is in fact the rational behind the Maxent Principle. Because the uncertainty is expressed by entropy, the maxent principle states ~Y.Pt log/% -» max.

(11.4)

/ /. /. Maximum entropy

261

The second observation is that as more and more information is collected, the probability to find the true criminal becomes larger and larger. Initially, all suspects are of equal probability to commit the crime. This piece of information, expressed as a constraint, is 5>=I.

(11.5)

Equations (11.4) and (11.5) outline the essence of MEM. In words, if we are seeking a probability density function subject to certain constraints (e.g., a given mean or variance), use the density satisfying those constraints which makes entropy as large as possible, Jaynes (1957a,b) formulated the principle of maximum entropy, as a method of statistical inference. His idea is that this principle leads to the selection of a probability density function which is consistent with our knowledge and introduces no unwarranted information. Any probability density function satisfying the constraints which has smaller entropy will contain more information (less uncertainty), and thus says something stronger than what we are assuming. The probability density function with maximum entropy, satisfying whatever constraints we impose, is the one which should be least surprising in terms of the predictions it makes. It is important to clear up an easy misconception; the principle of maximum entropy does not give us something for nothing. For example, a coin is not fenjust because we don't know anything about it. In fact, to the contrary, the principle of maximum entropy guides us to the best probability distribution which reflects our current knowledge and it tells us what to do if experimental data does not agree with predictions coming from our chosen distribution: understand why the phenomenon being studied behaves in an unexpected way (find a previously unseen constraint) and maximize entropy over the distributions which satisfy all the constraints we are now aware of, including the new one. A proper appreciation of the principle of maximum entropy goes hand in hand with a certain attitude about the interpretation of probability distributions. A probability distribution can be viewed as: (1) a predictor of frequencies of outcomes over repeated trials, or (2) a numerical measure of plausibility that some individual situation develops in certain ways. Sometimes the first (frequency) viewpoint is meaningless, and only the second (subjective) interpretation of probability makes sense. For instance, we can ask about the probability that civilization will be wiped out by an asteroid in the next 10,000 years, or the probability that the Red Sox will win the World Series again. We illustrate the principle of maximum entropy in the following three theorems (Conrad, 2005).

262

/ /, Estimation by use of the MEM

Theorem 11.1 For a probability density function pi on a finite set {xu---,Xn}, H(pl,p2l-,pn)£logn

= H(-,-,-,-) « «

(11.6) n

withequalityifandonfy if pi is uniform, i.e., p(xf) = l/nforalli. Proof: see equation (5.13) in Chapter 5. Concretely, if pt,p%,,.., pnare nonnegative numbers with ^pi = 1, then Theorem 11.1 says -^pt

logpt < logw, with equality if and only if every p, is

l/n. Theorem 11.2 For a continuous probability density function fix) with variance

(11.7) with equality if and only iff(x) is Gaussian with variance a2, Le., for some fi we have 0140

V23-CT2

Note that the right hand side of equation (11.8) is the entropy of a Gaussian. This describes a conceptual role for Gaussians which is simpler than the Central Limit Theorem. Proof. Let /(x) be a probability density function with variance a 1 . Let ft be its mean. (The mean exists by definition of variance). Letting g(x) be the Gaussian with mean p and variance a2

U Splitting up the integral into two integrals, the first is — log(2ffitr2)

(11.9) since

11,1. Maximum entropy

263

Jf(%)dx =1", and the second is 1/2 since f/"(x)(x—/ij J& = O"2 by definition. Thus the total integral is — [l + log(2fffxa)], which is the entropy of g(x). Based on equation (5.45), we conclude that (*)Iogs(x)dr (11.10)

Theorem 11.3 For any continuous probability density Junction p on (0,1) with mean A ) Sl +logl

(11.11)

with equality if and only iff is exponential with mean, i.e.,

Proof Let^x) be a probability density function on ( 0 , » ) with mean A Letting

-'"1,-)f(x)lagg(x)dx = ["/(*)Jogfloga+yU .Since/has mean A, this integral is log A + 1, which is the entropy of g. Theorem 11.3 suggests that for an experiment with positive outcomes whose mean value is known, the most conservative probabilistic model consistent with that mean value is an exponential distribution. In each of Theorems 11.1, 11.2, and 11.3, entropy is maximized over distributions on a fixed domain satisfying certain constraints. The following Table summarizes these extra constraints, which in each case amounts to fixing the value of some integral. Distribution Uniform Normal with mean ju Exponential

Domain Finite (-so, oo) (0.OD)

Fixed Value None

l(x-/i)'ftx)dx

[xfixyt

How does one discover these extra constraints? They come from asking, for a given distribution g(x) (which we aim to characterize via maximum entropy),

264

II, Estimation by use of the MEM

what extra information about distributions jfljx) on the same domain is needed. For instance, in the setting ofTheorem 11.3, we want to realize an exponential distribution g(x) = (l/l)e" I '*on(0,®) as a maximum entropy distribution. For any distribution^) on (0,oo),

-]f(x)logg(x)dx =

j

= (\OgA)[f{x)dx+\[xf(x)dx

(11.12)

j [xf(x)dx To complete this calculation, we need to know the mean ofjfx). This is why, in Theorem 11.3, the exponential distribution is the one on (0, w) with maximum entropy having a given mean. The reader should consider Theorems 11.1 and 11.2, as well as later characterizations of distributions in terms of maximum entropy, in this light We turn to «-dimensional distributions, generalizing Theorem 11.2. Entropy is defined in terms of integrals over Rn. Theorem 11.4. For a continuous pmbability density function f on R" with fixed covariances (Ty (11.13) where £ = {0^ ) is the covariance matrix for fix). There is equality if and only if j(x) is an n-dimensional Gaussian density with covariances try, We recall the definition of the covariances o# . For an w-dimensional probability density function f(x), its means are /* = [ Xif(x)dx

and its

covariances are dx.

(11.14)

In particular, ers a 0 , When n = 1, a~v = u2 in the usual notation. The symmetric

11.2. Formulation of the maximum entropy method

265

matrix £ = (cTj,) is positive-definite, since the matrix ({vi.Vj}) is positivedefinite for any finite set of linearly independent vt in a real inner product space Proof. The Gaussian densities onR* are those probability density functions of the form. G(x)=

l

==g -<'"X«-rt»W)

{1U5)

where £ = (
-j«+log[(2ir) B dets]}

(11.16)

by a calculation left to the reader. (Hint: it helps in the calculation to write £ as the square of a symmetric matrix.) Now assume/ is any w-dimensional probability density function with means and covariances. Define pt to be its i-th mean and define <% to be its covarianee matrix. Let g be the «-dimensional Gaussian with the means and covariances off. The theorem now follows from Theorem 11.3 and the equation

}

(11.17)

whose verification boils down to checking that (11.18) which is easy to verify. (Hint: Diagonalize the quadratic form corresponding to £.) D 11.2 Formulation of the maximum entropy method We now turn to more precise description of the method by considering a continuous random variable.

266

//. Estimation by use of the MEM

Consider a continuous random variable X, the pdf of which is f(x) .Some properties about the distribution are known. These properties are supposed to be expressed in the form of • = pi,i = 0,l,2,--,N.

(11.19)

If 4a{x) = 1, then equation (11.63) states that the area of the pdf is a constant, usually one. If $ (x) = x, then equation (11.63) states that the first-order moment of the random variable is known. Different choice of $(x) can yield a large class of distributions. In case of no confusion, p , are still called moments. The MEM is formulated in the following form Given (*)/(*)& = A.

i = 0,l,2,-,N.

(11.20)

Find pdf f{x) such that - J/(JC) log f(x)dx -* max.

(11.21)

These equations can be formally solved by introducing Lagrangian defined by

A = - J/Wlog/(x)&+1> [ J/M#(x)<&- A ] •

(11-22)

If variational principle is used for/(a:), that is, / ( x ) is assigned a variation Sf independent of x, we have

-T"

•

(11.23)

-1-log/p/* If equation (11.21) is satisfied, the variational in equation (11.23) must be zero, that is, Sh. = 0. Because this holds true for any Sf - 0 , the integrand in equation (11.23) must also be zero. So

11.2. Formulation of the maximum entropy method

267

M

(11.24) (=0

Solving this equation results in (11.25) where & are obtained by solving equation (11.20) U(x)f(x)dx = p,,

i = Q,l,2,-,N.

(11.26)

Equations (11.25) and (11.26) are the most important equations in MEM. Note that there are iv" +1 unknowns ^ . Let us see some simple cases. Suppose N +1 = 1 .that is, there is only one unknown in equation (11.26), $j = 1 , and pa = 1 , we conclude immediately from equation (11.25) that f(x) is a uniform distribution. If rephrased, this conclusion is to say that uniform distribution solves maxent equations (11.21) given that the area of pdf is a constant on the interval of definition of random variable X. This is the conclusion of Theorem 11.1 generalized to continuous random variables If iv* +1 = 2 , jfc = 1, $ =x and xe(0,oo), then the function on the right hand side of equation (11.25) is an exponential function, meaning feat the exponential distribution solves the maxent equation (11.21). In other words, the exponential distribution is the maxent distribution given £(JQ. In section 11.2, theoretical tool has been employed to obtain these conclusions. In this section, we are led to the same conclusions by simple mathematical equations. If JV + 1 = 3 , $ ) = 1 , &=x, $l=x1 and xe(-®,oo), then exponent of the function on the right hand side of equation (11.25) is a polynomial of degree 2. So the function on the right hand side of equation (11.25) can be expressed as a function of the form exp| -1 + J_^ x

L

2

=

'•=«t J 1=0

I

(11.27)

268

/ /. Estimation by use of the MEM

where c represents all constant terms. This is the normal distribution, and thus the normal distribution is the maxent distribution solving equation (11.21) if the mean£(X) and the variance E(X3) are given. We may obtain the conclusions obtained in section 11.2 again by using the method presented above, All these distributions, however, share one common thing, that is, they are solved if particular forms of $(x) and //( are given. If the particular forms of $(x) are not known, how to find the maxent distribution / ( x ) ? Another common question is that the number M is small, usually two or three. If it is large, it becomes very difficult to solve equation (11.26). It is thus seen that the beautiful theory presented in section 11.1 cannot help us to find most distributions encountered in real-world applications. The key points to answering the question are (1) how to represent $(x) and (2) how to find fk. In the following sections, these two points will be addressed so that the maxent disttibution for a general question is found. 11.3 B-spline representation of fZ*,(x) Our purpose is to construct a general method to solve MEM equations (11.25) and (11.26). We may assume that $(x) = x', i = %%•••,M-I. This is neither efficient nor good, however, as we discussed in the previous chapters. A better alternative is again to use B-spline functions to approximate $ (x), that is, 1 Bi(x)

i =0 i =

-

(11.28)

!,•••,M

Here i - 0 is a special case requiring that the area of a pdf be always one. If we deliberately define Ba(x) = 1, equation (11.28) becomes #(x) = JU*y = 0 , l , - , M .

(11.29)

With such choice of $(x), we may further define &»

f»
(11.30)

If a sample is drawn from a population, then pi can be statistically determined by the following statistic

11.3. B-spline representation of $ (x)

269 {11.31}

Because the large number theorem ensures that as sample size is large, we have

—f>(*a

# = 1,—,JV.

(11.32)

Returning to equation (11.26), we transform MEM to a problem to solve the following N +1 equations with A as unknowns = /*,

i = 0,l,-,N.

(11.33)

This is a set of nonlinear equations. If the solution to equation (11.33) exists, we are able to prove that the solution to the set of solutions is unique. To prove this, suppose At and AJ are two sets of solutions to equation (11.33), that is, =p,,

(11.34a)

=A •

(11.34b)

Subtracting equation (11.34a) from equation (11 J4b) yields = 0.

(11.35)

From calculus we conclude that the term in the square bracket must be zero for the above equation is always zero for any At and %. In other words,

expf-1 + f)-Wx)1 = expf-1 + ftMB,{x)\.

(11.36)

This equation requires that the exponents on both sides must be equal, that is,

270

/ /, Estimation by use of the MEM

f>fl(x) = £ # « ( * ) . )=V

(11.37a)

1=0

We are thus led to the equation ) = 0.

(11.37b)

from which we conclude that At = A,' because equation (11.8 lb) is true for any x. By carefully choosing x along the real axis, we are able to obtain = 0 , j = 0,1,2,-,N.

(11.38)

So that the determinant of the matrix B${xj) is not zero. This is possible if xj is chosen in such a way that it maximizes BI(XJ) . Equation (11.38) is a set of linear equations, which has only zero solutions J» - A! = 0 if the determinant of the coefficient matrix B,{xj) is not zero. 11.4 Optimization solvers Uniqueness of the solution to equation (11.33) is a nice property. It is not an easy job to find the roots of equation (11.33) because of nonlinearity. The difficulty can be easily sunnounted by rewriting equation (11.33) in the form of optimization „

^

-,1

A-A 1=0

/

->min.

(11.39)

->min

(11.40a)

J

Or equivalently

B/(x)exp - l + ^^jB ( (x) ufc-/?f

subject to

11.5. Asymptotically unbiased estimate of At f

n

\

fexp -l + yU5,(x) \dx = \. J

\

t*

271

(11.40b)

)

Equations (11.25) and (11.26), (11.40) are all equivalent We are already familiar with optimization problems that appeared in Chapters 6-10. Three optimization methods have been used. They are iterative formulas used in Chapters 6 and 7, and Flexible Tolerance Method (FTM) used in Chapter 10. It seems hard to obtain a nice iterative formula as that obtained in Chapter 6. So we have to resort to FTM and GA. When solving equations (11.39) or (11.40), the search process is stopped if L < e or solution step number > Ne. Here £ is a prefixed small number, say 10"3. Ne is a prescribed large number, say 500. 11.5 Asymptotically unbiased estimate of A, We prove that solving At by equations (11.39) or (11.40) yields asymptotically unbiased estimate of ^ , To see this, note that in equation (11.31) p, are asymptotically normal random variables with zero mean based on the large number law. Denote the true value of Aj by A? and expand the left hand side of equation (11.26) with respect to the true value ^»°. Moreover, denoting pdf by / ( x | A), we obtain

(11.41)

The estimate At are dependent on the sample, thus being function of sample. Taking average on both sides of equation (11.26) about sample X yields

As mentioned above, & are asymptotically normal variables with mean pf. Thus, as sample size is large, Expi = pf, where superscript "0" represents true value of pi. Therefore, the second terms on the left hand side of equation (11.42) must be zero because the first term on the left hand side and the term on the right hand side of equation (11.42) are equal. In other words,

272

11. Estimation by use of the MEM

ExAj=Af.

(11.43)

Therefore, equation (11,39) or equations (11.40) yield asymptotically unbiased estimate of <4 • 11.6 Model selection MEM in fact takes the place of Maximum Likelihood Method, estimating the unknown parameters through sample observations. The estimation accuracy depends on several factors, one of which is the number of B-spline functions used in the estimation process. If the number of B-splines used in the estimation is over-small or over-large, accuracy will be lost. Again, accuracy and statistical errors must keep balance at an acceptable level. Because Maxent estimators are not like maximum likelihood estimators, the properties of the latter being well studied. The lack of M l knowledge of maxent estimators in the current case shows that we cannot directly apply the nice criteria for model selection like ME and AIC. To work out a criterion for model selection like ME based on maxent estimators remains a task under research. Here, however, we use an approximate approach to handle the problem. Consider Maxent Principle (11.21). If a large sample is taken from the population, the entropy is approximately equal to (11.44) Maximizing the left hand side term is equal to maximizing the right hand side. Therefore, Maxent Method is asymptotically as same as Maximum Likelihood Method Based on such APPROXIAMTION, we assume that ME remains valid here for maxent estimators. Then the best model should minimize ME in the way of ^ 2 n,

(11.45)

where J, is the maxent estimate of A and \N+1, N,

equation (11.39) is used equation (11.40) is used

.

(11.46)

We want to emphasize again that equation (11.45) is approximate in the sense

/ L 7. Numerical Examples

273

that MEM is asymptotically equal to M-L method, but maxent estimators are not necessarily equal to M-L estimators. Theoretically speaking, estimators maximizing entropy does not necessarily maximize likelihood functions. The drive for better estimator than ME is always needed in the future. It should be pointed out that AIC is no longer valid here because likelihood function does not exist here, 11.7 Numerical Examples

Example 11.1 Direct estimation of a normal distribution Given a normal distribution (11.47) Suppose seven B-splines are used to approximate it. The interval is [-1,1]. By defining

we obtain after numerical integration of the above equation. Their values are given in the following

=2,13x10-* =0.23, = 0.49, = 0.23, p, =6.08x10"*. One more condition requiring the area under the pdf be one,po = I, should be added in real -world applications. The Flexible Tolerance Method (FTM) is employed here to solve the problem. In the computation, the error tolerance for stopping computation is set at 10"4. That is, the optimization target is set at 10"4, or iteration number is smaller than 500, Results are shown in Figure 11.1.

274

11. Estimation by use of the MEM

-1

-0.5

0 X

0.5

1

Figure 11.1 Maxent estimate of the normal distribution In the figure, the given distribution is also plotted. Comparing the estimated and given distributions reveals that except at the right end, the agreement between the two curves is quite satisfactory.

Example 11.2 Estimation of the normal distribution based on sample In the above example, values for pi are obtained from theoretical calculations based on equation (11.48). In reality, these values are not known, and must statistically estimated from samples. Therefore, to be more practical, a sample of size n, is generated from the given distribution (11.47). If n, = 100, the estimated moments based on equation (11.31) are given in Table 11.1

Table 11.1 Theoretical and estimated values of moments ( / % = ! )

Estimated («»=50)

Estimated («»=150)

6.08 xlO"4

Estimated («s=100) 0

0

0

Pt

2.13x10"* 0.23

1.79X10"1 0.22

2.06xl0" a 0.23

UOxlO" 2 0.23

A

0.49

0.5

0.48

0.5

Ps

0.23

Items

Theoretical

Pi Pi

0.24 2

Pi

2.13xlQ"

Pi

6.0Sxl0" 4

0.26 2

2.36 xlO"

2.18xlO"2

0.23 2

I.77X10" 0

2.44xlO"z 1.45K10"4

275

11.7. Numerical Examples

Based on the above data, the distribution is estimated by minimizing equation (11,40). The estimated pdfs are shown in Figure 11.2 for n, = 100 .In the figure is also shown the given distribution for comparison. Except at the right end of the horizontal axis, the agreement between the two curves is in generally satisfactory.

2

1 Prob ili

3?

given 1.5

y*%.

estimated

£

\

1

M

a

0.5 0

-1

-0.5

0 X

n, = 100

V": 0.5

Figure 11.2 Maxent estimate of normal distribution based on sample using #=7 B-splines

To study the influence of sample size on estimation accuracy, two more computations have been made. These two samples have size 50 and 150, respectively. The estimated moment based on equations (11.31) are given in Table 11.1, too. FTM is used for these two cases. As N=7 B-splines are used, the estimated results are plotted in Figure 11,3 (a) and (b). As sample size is 50, the estimated pdf has a significant skewness towards the right side. Return back to Table 11.1 and look at p% and p$. As sample size is 50, these two moments differ about 0.03. These two moments determine the curve departing from being symmetrical about the central line to being more dense on the right side of the central line. Therefore, the accuracy of the statistical estimates of the moments has an important impact on the final estimation. This is clearer if we study the case for sample size to be equal to 150. The estimated moments for this case are listed in the fourth column in Table 11.1, p% and ps in this case are identical, and the curve has good symmetry about the central line around the centre, as shown in Figure 11.3 (b). Surprisingly, the right end error which appeared in Figures 11.1 -~11.3 disappears in this case, replaced

11. Estimation by use of the MEM

276

by a flat line. In general, increasing sample size does ensure the convergence of the estimation.

given „

1.5

1 •§

estimated

(a)«,=50

/""V

0.5

-1

0.5

-0.5

(b)«t = 150

given

1.5

estimated

/

Probabil

1 0.5

•

J

0 -1

-0.5

\ 0

0.5

1

Figure 11.3 Sample size influence on estimation accuracy

As we have seen in the previous chapters, the number of B-spHnes used for approximating pdf has important influence on estimation accuracy. Two cases are considered here, N=3 and i¥=ll, corresponding to 3 and 11 B-splines, respectively. The estimated pdfs for these two cases are shown in Figure 11.4. In Figure 11.4(a) is shown the estimated pdf for N=3 using a sample of size 100. The results are surprisingly good. The estimation for N=l 1 is, however, not good enough for practical use because it is oversensitive to local statistical fluctuations being a wavy curve. As in Chapter 6, we thus need to select models from the candidates using different number of B-splines. In this example, N changes from 3 to 8, in total 6 cases considered. For each,

11,7. Numerical Examples

27?

both / / = - J / { x | ^ ) l o g / ( x | A}t&and ME are calculated, as given in Table 11.2 Table 11.2 ME values for model selection (n s = 100 )

N 3 4 5

H

ME

0.00594 0.00580 0.00568

0.05094 0.06584 0.08068

Given: ......... Estimated:

1.5

N 6 7 8

H

ME

-0.00178 -0.00131 0.01080

0.08822 0.10369 0.13080

w

/* * \

!!=ioo

•

1 0.5

V

0 -1

0.5

-0.5

2

Given: Estimated:

1.5

.........

•• «,

=

100

"

1 0.5

•

/ \

0 -1

-0,5

0

0.5

1

Figure 11.4 Influence of number of B-splines

From the table, JV=3 minimizes ME meaning that the best estimate is given by using 3 B-splines. We have plotted JV=3 result in Figure 11.4(a). The good

278

/ /. Estimation by use of the MEM

agreement between the estimated and the given pdfs does validate the effectiveness of ME analysis. Example 11.3 Compound distribution This example was considered in Chapter 6. It is rewritten here for convenience. The distribution is given by /(*)=,og(j)

(11.49a)

] g(x)dx

where g(x) is a function defined on [0,10] 1

(11.49b)

From this distribution, 100 random numbers are generated as a random sample for estimation. Based on the sample, the maxent distribution is estimated using the method presented here. The results are given in Table 11.3. Table 11.3 Dependence of ME values on the number of B-splines Number of B-splines 7 3 9 10 11 12

ME .162145E+01 .172011E+01 .156969E+01 .160988E+01 .168089E+01 .157742E+G1

The minimum ME is obtained at N=9. The estimated results for N=5,9 and 12 are plotted in Figure 11.5. Similar observations can be amde again in the figure, that is, N=S represents the case of underutilizating B-splines and JV=12 represents the case for overshooting B-splines. From the comparison in the figure, AH? is better than the rest. Its corresponding ME is really smallest among the cases considered. Again, the capability of ME analysis is validated. From the figure, however, it is observable that the results are not better than

279

U.S. Concluding remarks

those presented in Chapter 6. To make equation (11.31) be more accurate, large samples are needed. Therefore, it is not safe to say that MEM is superior to M-L method now. 0.5

Given: .,„ ,. „ 0.4 3

1 I

•

tat

r

0.2 0.1 0

•-

N =S

0.3

0

t

\

•

#=12

2

4

6

8

10

Figure 11.5 Maxent estimate of compound distribution

11.8 Concluding Remarks MEM has a variety of applications. But studies on applying MEM to estimating distributions based on SAMPLE remain few. Traditionally, polynomials are used to approximate $(x), but the capability is severely limited by the inadequacy of polynomials as a flexible and robust interpolating tool. Introduction of B-splines is expected to open up a new direction for estimating complicated distributions.

This page intentionally left blank

Chapter 12

Code specifications

Short specifications to each code given in the CD-rom attoched to this book are presented in this chapter for the easy reference of the reader. 12.1 Plotting B-splines of order 3 12.1.1 Files in directory B-spline FORTRAN code for calculating B-splines: bspline.f Input file name : bspline.inp Output file name : bspline.dat 12.1.2 Specification: Inputs in input file: hsx ndx xsm xlg

: number of B-splines : division number : minimum x-value : maximum x-value

Output in the output file:

Subroutines function bn(k,x,bound): calculate B-spline values of order k+3 at point x with knot k+3 : order of B-spline, input x : point of interest, input

281

282

12. Code specifications bound bn

: knot sequence, input : B-spline value of order k+1 at point x with knot sequence bound, function wp(k,x,bound): calculate w'(x) in the definition of B- spline all arguments in the bracket are as same as above function hsd(x) X hsd

: Heaveside function, taking zero if x is smaller than 0 and 1 if x is not smaller than 0. : point of interest, input : value of Heaviside function, output

subroutine bd(xsm,xlg,nhx,xbound): calculate knot sequence hsx : number of B-splines ndx : division number xsm : minimum x-value xlg : maximum x-value nhx : hsx-2 xbound : knot sequence Example 12.1 15 B-splines Input 15,40 0, 10 Output (x,y) for plotting B-spline functions 1.2 Random number generation by ARM 12.2.1 Files in the directory of random FORTRAN for predicting random number Input file Output file

: random.f ; rand.inp : n.inp, histout, randO.dat

12.2.2 Specifications Inputs in the input file ns : sample size to generate np : =1, exponential distribution : =2, compound distribution : =3, normal distribution a,b : lower and upper bounds of the interval

12.3. Estimating 1-D distribution using B-splines

283

Outputs in fee output file x : random numbers distributed as the given pdf Subroutines subroutine rand(ix, yfl)

: generating uniform random number using the linear congruence method as given in the following

subroutine rand(ix» yfl) iffix .eq. 0) ix = 67107 ix=125*ix ix = ix - ix / 2796203 * 2796203 yfl= float(ix) yfl=yfl/2796203 return end function pdfs(x) : compound distribution used in the previous chapters function pdf(x) : exponential distribution function pdfh(x) : normal distribution 12.3 Estimating 1-D distribution using B-splines 12.3.1 Files in the directory shhl Filename :shdl,f Input file name : exp.inp Output file name : exp.out, exp.dat, exp.zme 12.3.2 Specifications Inputs: ns numx numf ndx xi(n) comer xlg xsm

; integer, sample size ; integer, smallest number of B-splines to be used : integer, largest number of B-splines used : integer, division number for plotting : real, sample point on X-axis : real, required accuracy : real, the largest value of x : real, the smallest value of x

284

12. Code specifications

Outputs:

x(nx) fie AIC zme

: estimated linear combination coefficient : likelihood functions : AIC value : measured entropy

Subroutines SUBROUTINE OBF(XQ,xs&); iteration formula function psi(x j ) : density function / where x is an array storing the coefficients function psib(x»xl) : calculate f(x( \ a) function bn(k,x,bound) : see section 12.1 function wp(kpc,bound) : see section 12.1 function hsd(x) : see section 12.1 subroutine bd(xsm,xlg,nhx,xbound): see section 12.1 subroutine enfropytxjeht)

: calculating entropy H .

12.4 Estimation of 2-D distribution: large sample 12.4.1 Files in the directory shd2 FORTRAN code shh2.f for estimating pdf using sample data (xc,yt:) Input file : uSQG.inp Output file : uSOO.out File storing ME values 12.4.2 Specifications Inputs: ns numx numxf numy

: sample size : smallest number of B-splines to be used in x-direction : largest number of B-splines to be used in x-direction : smallest number of B-splines to be used in y-

numyf ndx ndy xi(n) yi(N) conver ylg ysm

: largest number of B-splines to be used in y-direction : division number in x-axis for plotting, around 40 : division number in y-axis for plotting, around 40 : x-eoordinate of the sample point : y-eoordinate of the sample ; Required accuracy : The largest value of x : The smallest value of x

direction

12. S. Estimation ofl-D distribution from a histogram yig ysm Outputs: x(nx) AIC zme Fx

: The largest value of y : The smallest value of y : real array, estimated linear combination coefficients : AIC values : measured entropy : likelihood function

Subroutines SUBROUTINE iteration process function psi(xj) function psib(x,xl,yl) : calculate f(xe,yt j a) : see section 12.1 function bn(k,x,bound) : see section 12.1 function wp(k,x,bound) : see section 12,1 function hsd(x) subroutine bd(xsm,xlg,nhx,xbound): see section 12.1 subroutine entropy(x,eht) : calculating entropy H . 12.5 Estimation o f l - D distribution from a histogram 12,5.1 files in the directory shhl FORTRAN code for estimating pdf from a given histogram: shhl.f Input file: rulS.inp 12.5.2 Specifications Input:

M Ns num number

ndx xi(m) yi(m+l)

xlg

xsm CONVER Outputs: X(NX)

ane AIC

285

: Histogram cell number : Sample point number : Smallest number of B-splines used : Largest number of B-splines used : Division number for plotting : Cell heights : Cell coordinates : The largest value of x : The smallest value of x : Required accuracy

: Parameter values : ME value : AIC values

12, Code specifications

286 fx

: Likelihood functions

Subroutines SUBROUTINE OBF(xO,X,F) function 2Hnu(x,y) function qi(i,x)

: iteration for estimating pdf ; function value : qt =0^*^

subroutine simp(a,b,n,k,s): Simpson method for numerical quadrature subroutine trix : calculate coefficient etj function bn(k,x,bound) : see section function wp(k,x,bound) : see section function hsd(x) : see section subroutine bd(xsm,xlg,nhx,xbound): see section

12.1 12.1 12.1 12.1

12.6 Estimation of 2-D distribution from a histogram 12.6.1 files in the directory shhl FORTRAN code for estimating pdf from a given histogram: shh2.f Input file: wavew.inp 12.6.2 Specifications Input:

M N Ns Numx Numberx

ndx numy numbery

ndy xi(m,n) yix{m+l) yiy(n+l)

xlg xsm yig ysm, CONVER Outpute: X(NX)

AIC

: Cell number in X-direetion : Cell number in Y-direction : Sample point number : Smallest number of B-splines used : Largest number of B-splines used : Division number for plotting : Smallest number of B-splines used : Largest number of B-splines used : Division number for plotting : Cell heights : Cell coordinates : Cell coordinates : The largest value of x : The smallest value of x : The largest value of y : The smallest value of y : Required accuracy : Parameter values :

AIC

values

287

12.7. Estimation of2-D distribution using RBF : ME values zme : Likelihood functions fx Subroutines iteration for estimating pdf SUBROUTINE OBF(xO,X,F) ftmction value function zmu(x,y) function qi(i,x) subroutine simp(a,b,n,k)s} Simpson method for numerical quadrature subroutine trix calculate coefficient c s fiinction bn(k,x,bound) fiinction wp(k5x,bound) ftmction hsd(x) subroutine bi(xsm,xlg,nhx,xbound)

see section see section see section see section

12.1 12.1 12.1 12.1

12.7 Estimation of 2-D distribution using RBF 12.7.1 Files in the directory shr2 FORTRAN code shr2.f for estimating pdf using sample date Input file: uSOO.inp

(xt,yt)

12.7.2 Specifications Inputs: ns numx numxf numy numyf ndx ndy xi(n) yi(N) conver yig ysm yig • ysm Outputs: x(nx) AIC

: sample size : smallest number of B-splines to be used in x-direction : largest number of B-splines to be used in x-direction : smallest number of B-splines to be used in y-direction : largest number of B-splines to be used in y-direction : division number in x-axis for plotting, around 40 : division number in y-axis for plotting, around 40 : x-coordinate of the sample point : y-coordinate of the sample : Required accuracy : The largest value of x : The smallest value of x : The largest value of y : The smallest value of y : real array, estimated linear combination coefficients : AIC values

288

12. Cade specifications Zme Fx

: measured entropy : likelihood function

Subroutines: SUBROUTINE QBF(X0,x,fx) function psi{x j ) function psib(x,xx,yy) fiinction bn(i,x,y) subroutine entropy(x,eht)

: iterative solution process : density function value at sample point : density function :RBF : calculating entropy H .

Bibliography

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

12. 13. 14. 15.

Anderson, T. W. (1958). Introduction to multivariate statistics analysis, Wiley, New York, Apostol, T. M. (1969), Calculus, vol. 2, John Wiley & Sons. Balian, R. (1991). From microphysks to macrophysics, methods and applications of statistical physics,vol. I, Springer-Verlag, New York. Balian, R,(1992), From Microphysks to Macrophysics, Methods and Applications of Statistical Physics,vol. II, Springer-Verlag, New York. Bhat, B. R. (1985): Modem Probability Theory, An introductory textbook (2nd edition) John Wiley & Sons, New York. Blum, L., Blum, M, & Schub M. (1986). "A simple unpredictable pseudorandom number generator," SIAMJ, of Computing, 15(2), 364-383. de Boor, C. (1972). "On calculating with B-splines," J. Approx. Theory, 6, 50-62. Buck, B. & Macaulay, V. A. (eds.) (1991). Maximum entropy in action, Clarendon Press, Oxford. Cemak, J. (1996). "Digital generators of chaos," Physics Letters A, 214, 151-160. Conrad, K. (2005). http://www.math.uconn.edu/~kconrad/blurbs/ Couture, R &. L'Ecuyer, P.0997). "Distribution properties of multiplywith-carry random number generators," Mathematics of Computation, 66, 591. Couture, R & L'Ecuyer, P. (1998): "Guest Editors' Introduction," ACM Transactions on Modeling and Computer Simulation, 8(1), 1-2. Cox, M. G. (1971). "The numerical evaluation of b-splines: division of analysis and computing," National Physkat Laboratory, DNAC 4, U.K. Crandall, S.H. (1980). "Non-Gaussian closure for random vibration of nonlinear oscillators," M, J. Non-linear Mech,\5,303-313. Davis, PJ. (1963). Interpolation and Approximation. Blaisdell Publishing Company, New York.

289

290

Bibiography

16. Eisberg, R. & Resniek, R. (1985). Quantum physics of atoms, molecules, solids, nuclei and particles, 2nd edition, John Wiley & Sons. 17. Elderton, W.P. (1953). Frequency curves and correlation, 4th ed.,Harren Press, New York. 18. Entaeher, Karl. (1998). "Bad subsequences of well-known linear eongruential pseudorandom number generators," ACM transactions on Modeling and Computer Simulation, 8(1), 61-70. 19. Er, G.K. (1998): "A method for multi-parameter PDF estimation of random variables," Structural Safety, 20,25-36. 20. Faddeev, D. K. (1956). "The notion of entropy of finite probabilistic schemes (Russian)," UspekhiMat, Nauk, 11,15-19. 21. Faux, LD. & Pratt, M. J. (1979). Computational geometry for design and manufacture, Wiley, New York. 22. Feinstein, A. (1958). The Foundations of Information Theory, McGraw-Hill, New York. 23. Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer., New York. 24. Fog, A. (2000). How to optimize for the Pentium family of microprocessors. http://www.agner.org/assem. 25. Fog, A. (2001). Pseudo random number generators, http://www.agner.org/random. 26. Fraser, I. (2000). "An application of maximum entropy estimation: the demand for meat in the United Kingdom," Applied Economics, 32,45-59. 27. Fujimoto, Y., Shintaku, E., Zong, Z., Ishikura, H. & Isokami, T. (1994). "The model for determining the membership functions based on fuzzy data", J. Naval Architecture of Kansai, Japan, 236. 28. Fujimoto, Y., Shintaku, E. & Zong, Z. (1994). "Quantification of subjective information and its utilization in reliability engineering," J. of Naval Architecture Society of Japan, 176,615-624. 29. Gen, M. & Cheng, R. W. (1997). Genetic Algorithm and Engineering Design, A Wiley-Interscience Publication, John Wiley & Sons, Inc., New York. 30. Golan, A., Judge, G. & Miller, D. (1996). Maximum entropy econometrics; robust estimation with limited data, John Wiley and Sons, New York. 31. Goldman, S.(1955). Information Theory, Prentice-Hall, New York. 32. Harris, B. (1960). "Probability distributions related to random mappings," Annals of Mathematical Statistics, 31, 1045-1062. 33. Himmelblau, D. M. (1972). Applied nonlinear programmin,, McGraw-Hill, New York. 34. Hong, H.P., Lind N.C. (1996). "Approximate reliability analysis using normal polynomial and simulation results," Structural Safety,18,329-339.

Biliography

291

35. IEEE Computer Society (1985). IEEE standard for binary floating-point arithmetic (ANSI/IEEE Std 754-1985). 36. James, F. (1990). "A review of pseudorandom number generators," Computer Physics Communications, 60,329-344. 37. Jaynes, E.T. (1957a). "Information theory and statistical mechanics," Physical Review, 106,620-630. 38. Jaynes, E.T. (1957b). "Information theory and statistical mechanics II," Physical Review, 108,171-190. 39. Justice (ed.), J. (1986). Maximum entropy and bayesian methods in applied statistics, Cambridge Univ. Press, Cambridge. 40. Khinchin, A.I. (1957). Mathematical Foundations of information Theory, Dover Publication, Inc., New York. 41. Knuth, D. E. (1998). The art of computer programming, 2 (3rd ed.),Addison- Wesley. Reading, Mass. 42. Kullback, S.(1959). Information theory and statistics, Willey & Sons, New York. 43. Lam, K.Y., Zong, Z., Wang, Q.X. (2001) "Probabilistic failure of a cracked submarine pipeline subjected to underwater shock", Journal of Offshore Mechanics and Arctic Engineering, 123,134-140. 44. Larsen, R J . & Marx, M. L. (2001). An Introduction to Mathematical Statictics and Its Applications, 3 rd edition, Prentice Hall, NJ. 45. Levine, R. D. & Tribus, M. (eds.) (1979). The maximum entropy formalism, The MIT Press, Cambridge. 46. Lidl, R. & Niederreiter, H. (1986). Introduction to finite fields and their applications, Cambridge University Press. 47. Lind, N.C. & Chen, X. (1987). "Consistent distribution parameter estimation for reliability analysis," Structural Safety, 4,141-149, 48. Lind, N.C. & Nowak, A. S. (1988). "Pooling expert opinions on probability distributions," Journal of Engineering Mechanics, 114,341-389. 49. L'Ecuyer, P. (1997): "Bad lattice structures for vectors of non-successive values produced by some linear recurrences," INFORMS Journal of Computing, vol. 9, no. 1, pp. 57-60. 50. L'Ecuyer, P. (1999). "Good Parameters and Implementations for Combined Multiple Recursive Random Number Generators," Operations Research, 47(1), 159-164. 51. Lozover, O. & Preiss, K. (1981). "Automatic generation of cubic B-splinere presentation for a general digitized curve," Eurographics, Encarnacao,J.L. ed., North-Holland, 119-126. 52. Marsaglia, G., Narasimhan, B., & Zaman, A. (1990), "A random number generator for PC's," Computer Physics Communications, 60,345. 53. Marsaglia, G. (1997). DIEHARD, http://stat.6u.edu/~geo/diehard.html or http://www.cs.hku.hk/mtemet/randomCD.htol. 54. Matsumoto, M. & Nishimura, T. (1998). "Merseme Twister: A 623-

292

55. 56. 57.

58. 59. 60. 61.

62. 63. 64. 65. 66. 67. 68.

69.

70. 71.

72.

Bibliography Dimensionally Equidistributed Unifonn Pseudo-Random Number Qmerstor,"ACM Trans. Model. Comput. Simul. 8(1), 31-42. Milton, J.S., McTeer, P.M. & Corbet, J, J. (1997). Introduction to Statistics, WCB McGraw-Hill, Boston. Mood, A. M., Graybill, R. & Boes, D.C. (1974). Introduction to the theory of statistics, 3rd ed., International Student Edition. Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial and Applied Mathematics, Philadelphia. Ostle, B. (1966). Statistics in Research, Oxford & IBH Publishing Co., Calcutta. Patil, G P., Kotz, S. & Ord, J. K. (eds.) (1975). Statistical Distributions in Scientific Work, 3, D. Reidel Publ. Company, Dordrecht. Preckel, P.V. (2001). "Least squares and entropy: a penalty function perspective," American Journal of Agricultural Economics, 83,366-377. Riesenfeld, R. F. (1973). Berstein-Bezier Methods for the Computer-Aided Design of Free-Form Curves and Surfaces. Ph.D. Dissertation.,Syracuse University, Syracuse, NY, U.S.A. Sakamoto, Y . , Ishikuro, M. & Kitagawa, G. (1993). Information statistics (in Japanese). Kyoritsu Publisher, Tokyo. Sehoenberg, I. J. (1946). "Contributions to the problem of approximation of equidistant data by analytic functions," Q, Appl. Math., 4,45-99. Schuster, H. G. (1995). Deterministic Chaos: An Introduction. 3'rd ed., VCH. Weinheim, Germany, Waelhroeek, H. & Zertuche, F. (1999). "Discrete Chaos," J. Phys. A, 32(1), 175-189. Shannon, C. E. & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Urbana, IL. Shen, E.Z. & Perloff, J.M. (2001). "Maximum entropy and Bayesian approaches to the ratio problem" Journal of Econometrics, 104,289-313. Shen, S. X.» Bing, F. S. & Wang, H. Z.(1990a). Handbook of Contemporary Engineering Mathematics (in Chinese) , 3, Hua Zhong University of Technology Press, Wuhan. Shen, S. X., Bing, F. S. & Wang, H. Z.(1990b). Handbook ofContemporary Engineering Mathematics (in Chinese) , 4, Hua Zhong University of Technology Press, Wuhan. Utts, J. M. & Heckard, R. F. (2002). Mind on statistics, Duxbury Thomson Learning, Australia, Wu, S. C , Abel, J. F. & Greenberg, D. P. (1977). "An interactive computer graphics approach to surface representation," Computer Graphicslmage Processing, 6, 703-712. Yamaguchi, F. (1978). "A new curve fitting method using a CAT computer display," Computer Graphics Image Processing, 7,425-437.

Biliography

293

73. Yeh, R. Z. (1973), Modem Probability Theory, Harper & Row Publisher, New York. 74. Zong, Z. & Lam, K.Y. (1998). "Estimation of complicated distributions using B-spline functions," Structural Safety, 20 (4), 341-355. 75. Zong, Z. & Lam, K.Y. (2000). "Bayesian estimation of complicated distributions", Structural Safety, 22(1), 81-95. 76. Zong, Z. & Lam, K.Y (2001). "Bayesian Estimation of 2-dimensional complicated distributions", Structural Safety, 23(2), 105-121. 77. Zong, Z., Lam K.Y. & Liu, G.R. (1999). "Probabilistic risk prediction of submarine pipelines subjected to underwater shock", Journal Of Offshore Mechanics And Arctic Engineering, 121,251 -254. 78. Zong, Z., Shintaku, E. & Fujimoto, Y. (1995). "A method to determine the membership functions based on fuzzy data", Bulletin of the Faculty of Engineering, Hiroshima University, 13(1), 11-21. 79. Zong, Z. & Bi, J. Y. (2005). Maximum entropy method for estimating probability distribution (in press). 80. Zhu, X.L. (2001). Fundamentals of Applied Information Theory, Tsinghua University Press., Beijing.

This page intentionally left blank

Index

144,145,153,156,163-170,175180,185-189,207,209,211217,225,244-253,272-279,281287 functions, 67-69,77,79,82,87, 133,144,165,181,183,189,190,2 13,228,245,246,248,250,253,26

l-D, 64,87,129,189,191,213,285 2-D, 64,163,213 Approximation, 67,137,163,181,244 best, 69,71 function, 68,69,82 Axiom, 94,

8,272,282

Basis, 70-72,77,81,83,133 B-spline, 81 polynomial, 72 truncated power, 77, Bayesian, 118,119,126,127,128,189192,197-203,210-213,216,219228 estimation, 198,203,213, 219,223,225,227 measured entropy, 127,201-210, method,127,190,192,199,213, 220 point estimate, 198,20033, 219-221,223 priors, 192,216,219 statistics, 118,119,126,127 192,197,211,212,219,220 Bivariate, 29,65,163,164,167,170 -174,186,213,229 B-spline, 67-69,77-88,130,132,133-

Chaos, 51 Chebyshev's inequality, 31,32, Combination,70,71,77,90,133135, 147-150,159,164167,171,187190,194,196,213219,283,284, 287

linear.70,71,133,138,150,213, 283,284,287

coefficient,70,77,133-138,150, 159,164,171,187-190,194,196, 214,217,219,283,284,287 Conditional 6,7,14,15, density function,. 15 distribution, 14 probability, 6,14,15 Consistency, 37,38,40,105,117 Convex, 71,73 Covariance, 16,22,44 Criteria, 121,141,142,163,186

295

296

Index Model selection, 121,141,186

Determinant^ 1,203,204,236,221223,270 Deterministie, 2,10,49,51,121,244 Disorder, 89-94 Distribution, 9-23,26-36,40-67,93,96150,153-158,163174,185187,213,214,219229,259-268,273-287 Probability,10,22,27-36,48,93, 96-104,130,157,213,259,261 chi-square,21,23,56,57,110 conditional, 14, complicated, 67,129,130,279 compound ,65,150,278,282,283 discrete, 132,219 Gauss, 20,22 marginal, 14, multinomial, 138 normal,20.21,61-67,100,104, 110,112,118,125-130,148,222, 268, Effieiency,37,38,105,144,150 Elementary outcomes, 2 Entropy, 45,54,55,74,89-128,141,142, 154-156,221,242247,248,259265,284-288 estimation, 89,102,105-119, 154,155 measured, 89,99,105,109,114,141, 142,242,247,248,284-288 bayesian measured, 127 Estimate,27,34-44,98-128,135,141148,190-210,219-230,246,248, 271-278,284 Estimation,27,34,35,40,44,89-119, 128-158,163-191,198,203, 206,208,213-229,237,254,259, 272-278,284-287 entropy, 89,105-107,112,114,

118,128,154 Estimator, 27,35-44,89,105-119,128, 141,142,157,247,272 asymptotically unbiased,39,106116,142,247 parameter, 27,35-39 Event, 2-16,22,49,93-95,119,129,138, 170,200,210,229,245 random, 2,49,129, Exclusive, 3,5,93 Experiment, 2-4,9,45-48,93,150,202, 242-244 Function, 9-12,25,30-49,60,67-88,91104,110-116,120-135,139-161, 213-219,225-229,237-254,261268,271,272,281-288 approximation, 68,69,82,120 cumulative density, 11, joint density, 13,15, joint distribution, 12-14 Lesbeque, 80 likelihood, 39-41,136,139,141, 144,145-153,216,219,229, 272,285-287 membership,237-246,249-254 probability density, 10,41,49,61, 110,241,261-265 radial B-spline, 87,133,144 Fuzzy, 237,240-254 data,242-251,254 sample,242,247,252 set, 237,240-251

Gaussian, 20-22 Histogram, 54,58,70,82,111,137,139, 143,153,170-174,184,186,189, 207,285,286 Householder transform, 223,231 Independence, 8,9,16,21,54,58,70,82,

Index 133,193 linear, 71 test, 58 Independent, 8,9,16,22,28-32,48,58, 59,70,77,97,121-124,133, 160,195,200,218,221 linearly, 70,77,133, independent and identically 28, 218, distributed, 22, Individuals, 26,27 Inference, 25-28,34,37,48,102,104, 156,158,190,259,261 Statistical, 27,28,34,37,102, 104,158,261 Information, 16,17,26,27,42-44,48, 81,89,95-116,120,127,130, 141,142,147,157,158,163-181, 217-220,228,229,242,259-261 Kullback information, 97,99103,111-114, Akaike,116,141,142 Intersection, 3,4,7 Likelihood, 39-44,116,122-124,135158,190,200,209,210,216,219, 229,245,248,253 function, 39-41,122,141,142, 147-153,198,209,210,216, 219,229,245,248 log-likelihood, 39-43,123-245 log-likelihood function,39-43, 116,123,136,139,141,144,145, 245 Likely, 4-6 equally, 5,6 Measured entropy, 127,141,142,284288

Bayesian, 127 Method, 1,11,22,28,39-50,58-67,76, 84,85,91,107,118-135,140,153-

297 155,163,167,175,184-192,199, 203,204-212,213,220,223,228231,237,247-254,259-287 acceptance/rejection, 61,62,64, bayesian, 127,130,181-189, 213,220

linear congruence, 49,50,66,283 maximum likelihood, 39,40, 107,123,135,157,272 Model selection, 119,120-130,140, 141,147,158,271 criteria, 120 Modulus, 50,51 Moments, 18-20,44,266,274,275 central, 18, first, 18, second, 18 third, 19 Mutually exclusive, 3,5,93 Nonlinear, 51,136,156,159,202,203, 247,248,255,269 Norm, 70-73,167,168, Objective, 126-129,136,200,210,212, 221,230,241,242,244 Observations, 27,28,55,189,206,211, 213,225,260 Optimization, 135,136,143,158,200, 203,210,247,270-273 Parameters,20,22,27,34,35,38,45,5155,60,71,72,120-126,136,141147,154-157,163,167,170,173, 189-191,200,202,210,215,219, 221,229,243-248,272,285,286 Period, 26,50-53,59,66,129,184 Polynomial, 68-78,82-87,121,126, 132,133,235,267,279 basis, 72 Population, 16,17,25-27,33-39,4548,114,120,135,157,158,167, 181,268 Power, 9,46,60,61,68,77,132,148,

298

Index

208,211 Prediction-correction, 1 §9,213 Probability, 1 -17,22,25-36,40-49,56, 61,92-105,110,118,127,130-132, 138,156,157,170,171,184,200, 208,213,221,241,245,259-265 density function, 10,22,49,61, 104,110,261-265 distribution, 20,26,27,33,35,36, 40,48-68,93,96-104,130,157, 259,261 Random, 2,9-22,27-41,47-67,92-155, 164,167,174-178,187-190,199, 192-195,203-206,213-218,224, 229,244,251,259,266,271,278, 282 number, 49-66,140,144,174,191, 2.5.2.6.2.4.251,278,282,283 Randomness, 1,2,10,40,49,51,54,55, 57,96 Realizations, 27,39,50 Sample, 2-5,9-14,25-48,50-59,97-128, 129-158,163-182,186-198,206211,213-220,225-230,237,242254,268-279,282-288 fuzzy, 242,247,252, large,105,107,116,118-121,129, 137,145,213,228,229,283,284 small, 30,48,118 126,128,132, 213,229,230 space, 2-5,9-14,46,129, size,27,30-33,37-44,53-57,98, 102,106,111,115,118,125,132, 135,139,144,146,151457,163169,175-193,206-211,251-254, 269,271,275,282-284,287 Sampling, 8,26-30,35,36,48,54, 60-64,104, distributions,28,30,35,36,48,54, error, 104, Set, 2,3,7,9,26,27,62-71,79-84,93,122,

132-135,165,169,170,211,231, 237-251,262,265,269,270,273 crisp, 237,240, fuzzy, 237,240-251, Skewness, 17,19,275 Smooth, 156,190-198,203,206,211, 213,216-219,223,225,229,243 prior distribution, 190,211,213, 217,229 Span, 70,71 Space, 2-5,9-14,46,53,59,70,71,74, 87,129,133,136,157,163,170, 194,216,217 vector, 70,71,133 normed linear, 70,71 sample, 2-5,9-14,129, Standard deviation, 19,32,45, Statistic, 7,25-27,33-37,46-48,55,68, 89,102-105,118-120,126-132, 153,157,163,192,197,200, 212-220 Statistical,l, 12-17,25-28,34-40,45-48, 51-54,59,102-108,116,119, 120,126,135,137,141,147,156, 158,180,190-192,212,237,242, 252,259,261,272,276 Stochastic simulations, 49 Test, 33,45-48,54-59,137,157,242 independence, 54,58, uniformity, 54,56,57 visual, 54,59 Testing, 8,21,34,45,46,54-58,102, 157,158, hypothesis, 34,45,102 Theorem, 6,7,31-34,43,44,48,55,71, 72,81,82,94,95,98-103,107, 112-121,127,131-133,136,157, 161,162,216-235,261-269 central limit, 55,157,262 large number, 98,115,269 Unbiasedness, 37,38,105 Uncertainty, 27,89,93-104,114,116,

Index 119,141,147,153,156,158,178, 260,261 Model, 97,102-104,147,158 Union, 3,4 Univariate, 163 Variable, 9-22,27-44,50-56,60,61,64, 67,92,95-110,115,118-126, 129-137,141,155,189,195, 200,204,211,212,213,216-218, 229,240,259,265,266,267,271

299

random, 9-22,27-36,39,41,50-56, 60,61,64,67,92,95-110,115, 118,119,121,124,126,129-133, 137-141,151,189,192-195,200, 204,211,212,213,216,218,229, 265-267,271 uniform random, 19,50-56,60 Gaussian, 20 Variances, 16-20,30,31,44,101,104, 125,261-265,268

Mathematics in Science and Engineering Edited by C.K. Chui, Stanford University Recent titles: C. De Coster and P. Habets, Two-Point Boundary Value Problems: Lower and Upper Solutions Wei-Bin Zang, Discrete Dynamical Systems, Bifurcations and Chaos in Economics I. Podlubny, Fractional Differential Equations E. Castillo, A. Iglesias, R. Ruiz-Cobo, Functional Equations in Applied Sciences V. Hutson, J.S. Pym, M.J. Cloud, Applications of Functional Analysis and Operator Theory (Second Edition) V. Lakshmikantham and S.K. Sen, Computational Error and Complexity in Science and Engineering T.A. Burton, Wolterra Integral and Differential Equations (Second Edition) E.N. Chukwu, A Mathematical Treatment of Economic Cooperation and Competition Among Nations: with Nigeria, USA, UK, China and Middle East Examples V.V. Ivanov and N. Ivanova. Mathematical Models of the Cell and Cell Associated Objects

Information-Theoretic Methods for Estimating Complicated Probability Distributions with CDROM

Read more

Variational methods in mathematics, science and engineering

Read more

Elements of Applied Probability for Engineering, Mathematics and Systems Science

Read more

Probability Distributions Used in Reliability Engineering

Read more

Mathematics in Engineering and Science

Read more

Mathematics in Engineering and Science

Read more

Mathematics in engineering and science

Read more

Methods of Matrix Algebra (Mathematics in Science and Engineering 16)

Read more

Methods of Matrix Algebra (Mathematics in Science and Engineering 16)

Read more

Discrete numerical methods in physics and engineering, Volume 107 (Mathematics in Science and Engineering)

Read more

Integral Methods in Science and Engineering, Volume 1: Analytic Methods

Read more

Computational Methods for Modeling of Nonlinear Systems, Volume 212 (Mathematics in Science and Engineering)

Read more

Characterisation of Probability Distributions

Read more

Principles of combinatorics, Volume 72 (Mathematics in Science and Engineering)

Read more

Stability of Motion, (Mathematics in Science and Engineering, Volume 30)

Read more

Probability Models in Engineering and Science

Read more

Probability Models in Engineering and Science

Read more

Random Variables and Probability Distributions (Cambridge Tracts in Mathematics)

Read more

Statistical Distributions in Engineering

Read more

Foundations of Quantization for Probability Distributions

Read more

Foundations of quantization for probability distributions

Read more

Foundations of quantization for probability distributions

Read more

Systems and Simulation (Mathematics in Science and Engineering, Volume 14)

Read more

Foundations of Quantization for Probability Distributions

Read more

Random variables and probability distributions

Read more

System identification, Volume 80 (Mathematics in Science and Engineering)

Read more

Nonserial dynamic programming (Mathematics in Science and Engineering, Volume 91)

Read more

Riccati differential equations, Volume 86 (Mathematics in Science and Engineering)

Read more

Operator Inequalities, Volume 147 (Mathematics in Science and Engineering)

Read more

Random Variables and Probability Distributions

Read more

Recommend Documents

Information-Theoretic Methods for Estimating Complicated Probability Distributions with CDROM

Information-Theoretic Methods for Information-Theoretic Estimating Complicated Probability Distributions This is volu...

Variational methods in mathematics, science and engineering

Elements of Applied Probability for Engineering, Mathematics and Systems Science

Elements of Applied Probability for Engineering, Mathematics and Systems Science This page intentionally left blank ...

Probability Distributions Used in Reliability Engineering

Probability Distributions Used in Reliability Engineering Probability Distributions Used in Reliability Engineering ...

Mathematics in Engineering and Science

Mathematics in Engineering and Science

Mathematics in engineering and science

Methods of Matrix Algebra (Mathematics in Science and Engineering 16)

Methods of Matrix Algebra MATHEMATICS IN SCIENCE A N D ENGINEERING A Series of Monographs and Textbooks Edited by Ri...

Methods of Matrix Algebra (Mathematics in Science and Engineering 16)

Discrete numerical methods in physics and engineering, Volume 107 (Mathematics in Science and Engineering)

Discrete Numerical Methods in Physics and Engineering Donald Greenspan Computer Sciences Department and Academic Compu...