Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by .I. Siekmann
Lecture Notes in Computer Science Edited by G. Goos and J. Hartmanis
Editorial
Artificial Intelligence has become a major discipline under the roof of Computer Science. This is also reflected by a growing number of titles devoted to this fast developing field to be published in our Lecture Notes in Computer Science. To make these volumes immediately visible we have decided to distinguish them by a special cover as Lecture Notes in Artificial Intelligence, constituting a subseries of the Lecture Notes in Computer Science. This subseries is edited by an Editorial Board of experts from all areas of AI, chaired by JOrg Siekmann, who are looking forward to consider further AI monographs and proceedings of high scientific quality for publication. We hope that the constitution of this subseries will be well accepted by the audience of the Lecture Notes in Computer Science, and we feel confident that the subseries will be recognized as an outstanding opportunity for publication by authors and editors of the AI community. Editors and publisher
Lecture Notes in Artificial Intelligence Edited by J. Siekmann Subseries of Lecture Notes in Computer Science
419 KurtWeichselberger Sigrid P6hlmann
A Methodology for Uncertainty in Knowledge-Based Systems
Springer-Verlag Berlin Heidelberg NewYork London ParisTokyo Hong Kong
Authors
Kurt Weichselberger Sigrid P6hlmann Seminar fLir Spezialgebiete der Statistik Universit&t ML~nchen Ludwigstral3e 33, D-8000 ML~nchen 22, FRG
CR Subject Classification (1987): 1.2.3-4 ISBN 3-540-52336-? Springer-Verlag Berlin Heidelberg NewYork ISBN 0-38?-52336-? Springer-Verlag NewYork Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data banks. Duplication of this publication or parts thereof is only permitted under the provisions of the German Copyright Law of September 9, 196,5, in its version of June 24, 1985, and a copyright fee must always be paid. Violations fall under the prosecution act of the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1990 Printed in Germany Printing and binding: Druckhaus Beltz, Hemsbach/Bergstr. 2145/3140-543210 - Printed on acid-free paper
FOREWORD
The number of publications on the management of uncertainty in expert systems has grown considerably over the last few years. Yet the discussion is far from drawing to a close. Again and again new suggestions have been made for the characterization and combination of uncertain information in expert systems. None of these proposals has been adopted generally. Most of the methods recommended introduce new concepts which are not founded on classical probability theory. This book, however, written by statisticians, investigates the possibility of giving a systematic treatment using the classical theory. It also takes into account that in many expert systems the available information is too weak to produce reliable point estimates for probability values. Therefore the handling of interval-valued probabilities is one of the main goals of this book. We have not dealt with all important aspects of these issues in our study. We intend to continue our research on the subject with the aim of solving those problems which still remain unsolved. Also we are aware of the fact that the experience of other researchers may throw new light on some of our statements. Therefore we are grateful for any criticism and for all suggestions concerning possible improvements to our treatment. We had the opportunity to discuss some parts of our study with Thomas K~mpke, Ulm, and owe valuable suggestions to him. Since our native tongue is German and we live in a German speaking environment, we had some difficulties as regards the English style. Louise Wallace, Plymouth, has supported us very much in this respect, although she bears no responsibility for remaining imperfections. Anneliese Hiiser and Angelika Lechner, both from Munich, carefully managed the editing of a manuscript which progressed step by step to its final version. Dieter Schremmer, Munich, supported us by drawing the diagrams. Their help is greatly appreciated.
Munich, January 1990
Kurt Weichselberger, Sigrid P6hlmann
CONTENTS 1.
T h e a i m s of this s t u d y
2.
I n t e r v a l e s t i m a t i o n of p r o b a b i l i t i e s
.
.
.
.
7.
R e l a t e d theories
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1 7 29
3.1
C h o q u e t - c a p a c i t i e s and sets of p r o b a b i l i t y d i s t r i b u t i o n s . . . . .
29
3.2
C h o q U e t - c a p a e i t i e s and m u l t i v a l u e d m a p p i n g s
34
3.3
T h e o r y of belief functions
3.4
C o m b i n a t i o n rules of t h e D e m p s t e r - S h a f e r t y p e
. . . . . . .
44
3.5
T h e m e t h o d s used in t h e e x p e r t s y s t e m M Y C I N
. . . . . . .
59
. . . . . . .
. . . . . . . . . . . . . . .
T h e s i m p l e s t case of a d i a g n o s t i c s y s t e m
. . . . . . . . . . . .
4.1
A s o l u t i o n w i t h o u t further a s s u m p t i o n s
4.2
Solutions with double i n d e p e n d e n c e a n d r e l a t e d m o d e l s . . . . .
Generalizations
. . . . . . . . . .
67 75 87
. . . . . . . . . . . . . . . . . . .
87
The formalism
5.2
Some aspects of p r a c t i c a l a p p l i c a t i o n
. . . . . . . . . . .
I n t e r v a l e s t i m a t i o n of p r o b a b i l i t i e s in d i a g n o s t i c s y s t e m s 6.1
An a p p r o a c h w i t h o u t a d d i t i o n a l i n f o r m a t i o n
6.2
A d d i t i o n a l i n f o r m a t i o n a b o u t ~j
6.3
T h e c o m b i n a t i o n rule for two units
6.4
T h e c o m b i n a t i o n rule for m o r e t h a n two u n i t s
95
. . . . . .
99
. . . . . . . .
101
. . . . . . . . . . . . .
103
. . . . . . . . . . . .
111
A d e m o n s t r a t i o n of t h e use of i n t e r v a l e s t i m a t i o n
. . . . . . . .
118
. . . . . . . .
121
A p p l i c a t i o n of F o r m u l a (3.21) to s t r u c t u r e s defined by k - P R I s
References
67
. . . . . . . . . . . . . . . . . . . .
5.1
Appendix:
38
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
127 131
LIST OF DEFINITIONS
k-PRI
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
k-dimensional probability interval . . . . . . . . . . . . . . . .
7
reasonable
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
structure
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
feasible
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
derivable
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
.
.
.
.
degenerate
.
global independence total independence
.
. .
.
.
interval-admissible
.
.
.
.
.
mutual k-independence
.
.
. .
.
.
.
.
double independence k-independence
.
.
. .
.
.
.
.
. .
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
.
.
.
.
.
.
.
.
.
.
.
61
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
.
.
.
61
.
.
.
87
. .
.
.
.
. .
. .
. .
. .
. .
. .
.
.
.
.
. .
. . . . . . . . . . .
.
.
. .
.
.
. .
. .
. .
. . . . . . . . . .
.
.
.
.
.
.
C o n c e p t s w h i c h c a n be f o u n d in r e l a t e d t h e o r i e s a r e n o t i n c l u d e d .
.
.
.
89 104
CHAPTER 1 The Aims of this Study Expert systems of a certain kind rely essentially upon the availability of a method for handling uncertainty. These systems cannot be conceived without a decision being firstly made about the choice of this method. Obviously this is true for all expert systems using empirical knowledge which in itself is not absolutely certain. As an example, we could mention a medical expert system which draws conclusions from the observed symptoms about whether or not a certain disease is present. All conclusions of this type unavoidably contain an amount of uncertainty. The rules which lead to these conclusions should not be confused with logical rules and must not be treated in the same way.
We shall call expert systems of this type diagnostic systems. They are mostly found in the field of medicine, but can also be used for meteorological or geological purposes, and of course for the control of technical installations. We shall demonstrate some results of our study by means of an example which uses the alarm system of a power plant. Therefore the expression "diagnostic system" should always be understood in the sense of an expert system, which relies upon empirical interdependences for drawing its conclusions and consequently requires the treatment uncertainty.
of
In order to make it possible to decide upon an appropriate therapy, a quantitative measure of uncertainty has to be applied in all relevant cases of a diagnostic system. Additionally it may be sensible to establish rules which, in certain stages of the investigation, direct the investigator's efforts depending on the degree of certainty achieved for possible hypotheses. It is evident therefore, that for researchers who design diagnostic systems the question has to be answered, as to which method of measuring uncertainty should be employed. For more than three hundred years scientists, philosophers, mathematicians and statisticians have used the concept of probability to describe degrees of uncertainty. Over three centuries a huge amount of theoretical results and experiences concerning the applicability of probability theory in different fields of human knowledge has been accumulated. Nevertheless many doubts concerning the appropriateness of the use of probability in diagnostic systems have arisen during the last few decades. So one has to ask what new aspects have evolved in the construction of such systems, which could possibly result in the necessity to develop methods for measuring uncertainty beyond classical probability theory. First of all it must be stated that although the basic ideas prevailing in some considerations about diagnostic systems sound convincing, they violate fundamental requirements for reasonable handling of uncertainty. These ideas may be described as follows: If a certain fact is observed, a measure M1 of uncertainty concerning the hypothesis in question must exist. If in addition another fact is observed, which produces a measure M2 with respect to the same hypothesis, a combination
rule must be given, which yields the measure of uncertainty of this hypothesis resulting from both observations. Such a rule, which calculates the measure of uncertainty for the combined observation as a function of the measures M1 and M2 can never take into account the kind of mutual dependence of the two observed facts. It might well be that these facts nearly always occur together, if indeed they occur at all. In such a situation the second observation is redundant and should not be used to update the measure of uncertainty. In another situation the two facts very seldomly occur simultaneously and if they do, then this is an important indication concerning the hypothesis in question. If they do occur simultaneously, the updating of the measure of uncertainty should have drastic consequences. A combination rule which treats these two situations equally, can by no means be regarded as useful. The question arises: Should probability theory be blamed for not supporting the construction of such a combination rule? Yet more fundamental is the question: Is it justifiable to attribute a certain measure of uncertainty to the observation of a given fact, irrespective of the circumstances? Take the example of a medical diagnostic system: If a symptom Z is observed, and a measure of uncertainty is used concerning the hypothesis of the presence of a certain disease, can this measure remain valid, if this disease occurs much more frequently than before? Once again an appropriate use of probability theory reveals the kind of dependence prevailing in this case. However, this will not be a popular result, because it states that a diagnostic system using this type of measure of uncertainty cannot be applied to populations showing different frequencies of this disease. Later in this study we shall demonstrate that negligence with respect to the aspects mentioned above may result in the inclusion of information into a diagnostic system which is equivalent to ruining it. Another argument against a possible application of probability theory in diagnostic systems is as follows: While probability theory affords statements, using real numbers as measures of uncertainty, the informative background of diagnostic systems is often not strong enough to justify statements of this type. This is indeed a true concern of the conception of diagnostic systems not met by probability theory in its traditional form. However, it is possible to expand the framework of probability theory in order to meet these requirements without violating its fundamental assumptions. In our study we shall present elements of a systematic treatment of problems of this kind and refer to related theories. Therefore we believe that the weakness of estimates for measures of uncertainty as used in diagnostic systems represents a stimulus to enrich probability theory and the methodological apparatus derived from it, rather than an excuse for avoiding its theoretical claims. A third argument which is met in the discussion about the application of probability theory in diagnostic systems refers to the disputes about the foundations of that theory. It is easy to quote prominent probabilists who express completely contradictory opinions about the essential meaning
of a probability statement. It must however be noted, that those difficulties concerning the concept of probability originate mainly from the problems of statistical inference - which may be the object of an expert system, but never of a diagnostic system. When a diagnostic system is conceived, the experimental background of the information employed is not explicitly considered: Now the question is, which measure of uncertainty should be applied in order to describe this information irrespective of whether it stems from the experience of an expert or from the evaluation of a sample. Concerning this problem the queries about the foundations of probability cannot be a reason for turning away from the language of probability, and even more so because the probabilists are in agreement about the basic rules for the use of probability. Only these basic rules are required in a diagnostic system. Therefore as far as these systems are concerned, if a measure of uncertainty has to be developed we recommend that the language of probability be relied upon and that one refrains from interfering in the dispute about probability concepts. It should be explicitly stated that this recommendation emphasizes that all rules of classical probability theory must be respected and that any new principle which cannot be justified by this theory must be avoided. Nevertheless in our study we shall discuss methods which employ such principles if they have been proposed for use in diagnostic systems: the Dempster-Shafer rule of combination and the methods applied in the expert system MYCIN. Probably the main problem which arises from the construction of diagnostic systems is the combination of information stemming fl'om different sources. We shall concentrate on the discussion of this problem which has attracted much attention in recent literature on Artificial Intelligence [See e.g. KANAL and LEMMER, 1986; PEARL, 1988] Since it is our concern to promote the use of probability theory for handling uncertainty in artificial intelligence, we shall investigate situations which may be described as follows: A number of sources of information are given, for instance the results of different parts of a compound medical test or the behaviour of different alarm units controlling the state of a power plant. For each of the sources of information a probability statement about a problem under consideration can be made, for instance concerning the state of health of the person tested or concerning the momentary state of the power plant. How can these probability statements stemming from different sources of information be combined to an overall probability statement? Since the circumstances suggest in many cases that we should consider the sources of information as if they were in a temporal sequence, the expression "updating of probability" may be used to describe this problem. To avoid confusion we shall carefully explain the difference between the problem we are concerned with and another problem, which is sometimes called "combining of probability distributions". The latter problem deals with subjective probability distributions stemming from different persons, and the aim is to find a single probability distribution which may be defined as being attributed to the group of persons as a whole. It assumes that all persons of the group possess the same stock of
information. Deviations between their probability statements can then be ascribed solely to their personal attitudes. An abundance of literature concerning this problem is available, going back as far as to Abraham Wald [GENEST, ZIDEK, 1986; LEHRER, WAGNER, 1981]. If the main difference between the two problems is kept in mind - i.e. different sources of information or one common stock of information - it should always be possible to distinguish between them, even if one has to combine probability statements stemming from different experts in a diagnostic system. To make the distinction as clear as possible we shall never use expressions like "expert view" in our study. In this way we also wish to demonstrate that we are not concerned about the origin of the probability statements used in diagnostic systems. Whether these statements are created through theoretical considerations, through evaluations of empirical results or through personal views of experts, does not influence the way we use them. We therefore assume that probability estimates are given under well defined conditions, regardless of whether these are point estimates or interval estimates. Certainly this must be seen as a realistic assumption, because it allows for situations in which little information is available and which therefore produce very wide probability intervals. In the case that absolutely no knowledge at all is given - if this ever occurs - this has to be described by an interval reaching from zero to one. It should be stated explicitly that we refrain from using fuzzy sets to define probability estimates. We believe that the use of interval estimates produces a degree of freedom large enough to distinguish between situations which may be relevant for the use in diagnostic systems. The combination of the theory of fuzzy sets with the methods proposed here would inevitably lead to further complications of these methods and consequently result in an impediment to their application. As already mentioned, the results described in this study are created by elementary probability theory. From the standpoint of this theory they provide no new insights. The methodology recommended for use in diagnostic systems depends upon the feature of the sources of information involved. We propose rules for combining two sources of information in the simple case of two states of nature and two symptoms distinguished for each source of information, if all relevant probabilities are given as numbers, but no assumption is made about mutual independence of the sources of information (Chapter 4.1). In the case of mutual independence of the sources - a concept which is discussed already in Chapter 3.5 - we provide basic results for two sources of information in Chapter 4.2 and generalize them in Chapter 5.1, so that no restrictions concerning the number of sources of information or the number of states of nature or the number of symptoms distinguished remain. In Chapter 5.2 we shall give an answer to the question of changing prior distributions, which was mentioned before, when we referred to the disease, whose frequency had been increased. All this is done under the assumption, that all probabilities are estimated by real numbers.
Interval estimates of probabilities are described in a general manner in Chapter 2, while theories related to this subject are reported in Chapter 3.1 to 3.3. The problem of combining information stemming from mutual independent sources in the case that probabilities are estimated by intervals is treated in Chapter 6. In that chapter we confine ourselves to the case, that only two states of nature are distinguished and only two symptoms can be observed for each source of information. The resulting recommendations for two sources of information are described in Chapter 6.3 and those for more than two sources of information in Chapter 6.4. Their behaviour is demonstrated in Chapter 7. The cases of more than two states of nature or more than two possible symptoms afford additional methodological considerations which will have to be postponed for further studies. We hope to be able to include results referring to this problem in a second edition of this book. An important aspect of our investigation is the discussion of alternative combination rules. In Chapter 3.4 the Dempster-Shafer rule is described, which is often recommended in literature. Apart from theoretical considerations, which bring to light a lack of justification for this rule, we demonstrate its behaviour in problems of practical relevance. The main result of this part of our study is shown in Example (3.11): The Dempster-Shafer can produce misleading results. Reasonable statements concerning the same problem are derived through probability theory in Chapter 4 and in ChapterS.
Compare: Example (4.3) and Example (5.1)! These comparisons exclude the
Dempster-Shafer combination rule from the stock of methods which can be recommended. The expert-system MYCIN introduces a technique relying on the construction of certainty factors. Its background and its behaviour are investigated in Chapter 3.5, where it turns out, that it is also not suitable as a basis for diagnostic systems. We do not discuss all relevant recent literature on the subject of handling uncertainty in expert systems, which deserves careful consideration, primarily the book by Judea PEARL [1988], but we intend to do so in a later edition of our study. Since in Pearl's book a comprehensive bibliography can be found, we refrain from including a bibliography in our contribution.
CHAPTER 2 Interval Estimation of Probabilities Many methodological considerations about diagnostic systems start with the assumption, that probability estimates are given by intervals and not by real numbers. Therefore it seems worthwhile to discuss the formal aspects of such a situation. In this chapter an approach is presented, which promises to qualify for use in diagnostic systems. Let us start with Some definitions. Definition (2.1): k Be 8 = {El, •. • ,Ek}. The unknown probability distribution P ( E l ) , . . . , P (Ek), with i~lP (Ei) = 1, has to be estimated. A set of intervals [Li ; Ui],
i = 1 , . . . , k , with
0
(2.1)
is called a k-dimensional probability interval for 8 (or a k-PRI), if Li < P(Ei) _
(2.2)
c~
As this definition is quite general, it allows probability estimates which are indeed senseless. Example (2.1): k = 2:
0.3 _
If the probability of Ex is at greatest 0.4, then the probability of the complementary event E2 = ~E~ must be at least 0.6. Therefore the estimate described above is by no means reasonable.
[]
To avoid estimates of this type we introduce the concept of reasonable estimates.
Definition (2.2): A k - P R I is called reasonable, if there exists a non-empty set S of k-dimensional probability distributions so that (2.2) is valid for each probability distribution belonging to S.
n
The probability distribution in the example above is not reasonable in the sense of Definition (2.2), since for k = 2 a probability distribution, for which the probability of E1 is not greater than 0.4 and the probability of E2 is not greater than 0.5, does not exist. In the following theorem an equivalent condition for the existence of a reasonable probability estimate according to Definition (2.2) is given.
Theorem (2.1): A k - P R I is reasonable, iff k
k
i=l
(2.3)
L i < _ l a n d ZUi_> t. i=l
[3
Proof of Theorem (2.1): k
k
Let ELi < 1 and E Ui >_1, then there exists a value y with 0 < y < 1, so that i=l
i*l
k
k
yi~lLi + ( l - y ) i~lUi = 1. The probability distribution for which: P(Ei) = yLi + (1-y)Ui belongs to the set S, which therefore is not empty. k k Otherwise let ~ Li > 1. If P(Ei) > Li, i = 1,...,k, it follows: E P(Ei) > 1. i=l
-
i=l
This is a contradiction to the assumption, that {P(Ei), i = 1,...,k} is a probability distribution. The same is true, if the upper limits violate the condition of Theorem (2.1).
[]
In the following, an example is given for reasonable probability estimates. Example (2.2): k : 3:
0.3 _
There are many probability distributions which obey these inequalities, for instance:
P(E 0 = 0.30,
P(E2) = 0.30,
P(E3) = 0.40
or:
P(Ei) : 0.35,
P(E2) : 0.35,
P(E3) : 0.30.
Nevertheless this interval estimation is not satisfactory because: If P(Ei) is not less than 0.3 and the same is true for P(E2), it is not possible for P(E3) to reach the value 0.5. Therefore these intervals are in a certain sense "too large". Parts of the intervals exist which can never be reached by a probability distribution. For instance, the numbers greater than 0.4 are not possible values of the probabilities of E2 and E3; P(E3) = 0.5 could be achieved only if one of the lower limits for P(E 0 and P(E2) were to be extended. This would mean a loss of information and therefore is not recommended,
r~
To distinguish those estimates which do not show this undesirable property we shall investigate the exact meaning of a reasonable k-PRI: the set of k-dimensional probability distributions which is determined by this estimate. While up to now we have discussed the existence of ~
set S of prob-
ability distributions which are in accordance with (2.2), we shall now define the set of all probability distributions obeying (2.2).
Definition (2.3): Be ( ~ , ' U ) a k-PRI. Then S * ( ~ , ' g ) is the set of all k-dimensional probability distributions so that (2.2) is valid for each probability distribution belonging to S*(~,'U). We shall denominate S* as the structure of the k - e R I ('I~,U). [] Evidently the structure of a reasonable k - P R I is not empty, the structure of a non-reasonable k - P R I is empty. Using this definition we can introduce the concept of feasible estimates. Definition (2.4): A reasonable k - P R I ('L*,'U) is called feasible, if for each i = 1,...,k and every pi with Li _
a)
k i~lLi + (Uj-Lj) ~ 1
b)
k i~lUi - (Uj-Li) > 1
j : 1, . . . , k
(2.4) []
Proof of Theorem(2.2): k
" ~ " :
k
Be l~lLi.: + Ujl -< 1 and l•lUi.= + LJ2 _>1, for all j l , j 2 : 1 ,. . . , k. iCJ 1
iitJ 2
For any pair jl it j2 the following inequalities hold: k
k
i=~lLi + Lj2 + Ujl _<1 and i~lVi + Ujl + LJ2 _>1 iSj 1 i;~j 2
i;~j. i;~j
Then there exists a value y : 0 < y < 1 with k
k
y(iE Li + LI2 + Uj,) + ( l - y ) (iE1Ui + Uil + Li2) = 1 itj, i~j ~
i~j 1 i~j 2
If we define: P(Ej~) = Li2 P(EI) = ( 1 - y ) U i + y L i ,
i=1,
...,k;
i¢ jl,j2
10
we achieve a probability distribution which takes the limits Ujl and Lj2of the given k - P R I . This holds for all jl ¢ j2. For any point pi : Li < Pi < Ui a probability distribution for P(Ei) = pi is gained as a weighted average of probability distributions P', P" with P'(Ei) = Li and P"(Ei) = Ui. k
"~":
BejE{1,...,k}andP(gj)
=Uj. ThenU]+ ~tP(Ei) =1. i icj As P(Ei) _>Li , for all i ¢ j, it follows: k Uj+ Z1Li.= _< 1. iCj Analogous for b).
[]
In Example (2.2) the violation of the conditions described in Theorem (2.2) under a) for j = 2 and j = 3 can be recognized, and it is also evident that the estimate would be reasonable if all the upper limits were 0.4. The question now arises whether it is possible to derive a feasible estimate from a reasonable one which is not feasible. In other words: We are searching for an algorithm determining which part of the intervals can be eliminated if a reasonable but not feasible estimate has to be converted into a feasible one. We first need a definition which describes this program:
Definition (2.5): A k - P R I ( ~ ' , U ' ) is called derivable from a reasonable k - P R I ( ~ , U ) , if
s*(I~' ,I~') D s*(I],l~)
(2.5a)
Li
(2.5b) []
and Ui' _
Obviously such a k - P R I ( ~ ' , U ' ) is always reasonable. Immediately it is realized that (2.5a) and (2.5b) lead to S*(~,[~) = S*(~' ,t}').
(2.~)
It is important to note, that this definition excludes all kinds of trivial derivations of a weaker estimate out of a stronger one and corresponds only to those derivations of estimates for which the derived one is at least as strong as the original one. Such a derivation is possible only if some additional information is used which has not been taken into consideration in the original estimate. This information is the fact that the sum of all probabilities must be one. The derivation of feasible estimates from only reasonable ones is described in the following theorem:
11 Theorem (2.3): Be (-L+, U) a reasonable k-PRI. Then there is only one feasible k-PRI ( ~ ' , g ' ) derivable from it with
k Uj' = Min (Uj, 1 -i=~lLi)
(2.7a)
i;q k Lj' = Max (Lj, 1 -i~lUi) il/j
(2.7b)
1~
The proof of Theorem Lemma 1: Be E L i + U j * > I
(2.a)
then g Ui + Lj _>1 i;q
for each j ¢ j*.
makes use of the following two lemmata:
iCj*
Lemma 2: Be i~j,Ui_ + Lj, _< 1 then E L i + Uj _<1
for each j ¢ j*.
iCj
Proof of Lemma 1: Be E L i + U j * > I
i~j*
then for each j ¢ j* :
i~jUi+ Lj =i~jUi+Uj* + LJ>i~ jLi+ Uj*+I~j = 2 Li+ Uj*> 1 i;~j* i;~j* -
iCj*
-
Proof of Lemma 2: analogous to the proof of Lemma 1. Conclusions from these lemmata: For a reasonable k-PRI, which is not feasible, there are only three possibilities: a) only a)-conditions, as defined in Theorem (2.2), are violated but no b)-condition. /3) only h)-conditions are violated but no a)-condition. 7) there exists only one j* for which both conditions are violated. Proof of Theorem (2.3): a) To show: ( g ' , U ' ) is derivable from (T',U). Obviously (2.5b) holds. Be{P(E,),...,P(Ek)}CS*(I~,U),thenLj_
so that:
-
12
Lj' : Max (Li, 1 - i~lUi) _< P(Ej) < Min (Uj, 1 - i~iLi) = Ui'.
This means, that {P(EO,... ,P(EK)} C S*(I]' ,~'). b) To show: If ( ~ , ' U ) is not feasible, (L',U') satisfies the feasibility condition. Case ~, only a)-conditions are violated: k Be J: = { j : i = E 1 L i + U j > I } # ¢ Ui
for j ~ J k 1-i=~lLi f o r j ~ J
Uj~ =
As no b)-condition is violated for (12,'U), we need not change any Lj and therefore Lj' = Lj for all j. Together it follows that all a)-conditions are fulfilled for ( ~ ' , ' U ' ) . Be ]J] =1 : J = { j * } , t h e n i~j,Vi + Lj* -> 1 (otherwise it would be case 7)) Consequently itEi,Uil + Li,' = itEj,Ui + Lj* -> 1. For all j # j*: iCj
Ui' + Lj' = i~j,Ui' + U j * ' - U j ' + = i~j,Ui+ ( 1 - i ~ j , L i )
Lj : - Ui + Lj =
: 1 +F~ (Ui-Li) - (Uj-Lj) > 1 i~j* This means that all b)-conditions are satisfied if I J I = 1. B e I J I >2 : jl, j 2 e J , j l # j 2 , then for all j # jl: i;Q
Ui' +
Lj'
i~ilUi I + UJ1 -Uj =
i/Jl
=1+
Vi ' + ( 1 - i
~Jl
~, ( U i ' - L i ) i~J 1
+ Li = Li) - (Uj' - LI) (Uj' - L i ) > l , f o r a l l j # j l
analogous: i~i ~If]J
Ui' +L i' = l + E
~J2
(Ui' - L i ) -
(Uj'-Lj)_>I,
for all j # j 2
I > 2 all b)-conditions are satisfied.
Case ~, only b)-conditions are violated: analogous to case a.
13
Case 7, there exists only one j* for which both conditions are violated: U}* = 1 - S ,Li , Ui' = Ui for i i~ j* i;tl L}* = 1 -i~],Vi ,
Therefore
Li I = L i
for i ~ j*
i~j*Ui'+ Li , = itZj,Ui+ Ll , = 1 and i
,Li + g}* = 1
This means condition a) and b) are fulfilled for j = j*. That for j ~t j* conditions a) and b) are not violated for ( ~ ' , U ' ) , can be shown in the same way as in case a). c) Uniqueness: Be (L'",U") derivable from ( ~ , U ) and feasible. We want to show: (L+",'U '') = (L",'U'). From Equation (2.6) it follows: S * ( ~ , U ) = S*(L",V') = S*(L'",'U"). As both k-PRI's are feasible we can conclude: Lj' = L j " = Min_P(Ej)
Ui' : U;' :
S*(L,u) M~ P(Ej) S*[L,U)
The following example shows how a feasible interval estimation can be derived, when a reasonable one is given. Example (2.3):
0.20
U2~=1- E L i = 1 - 0 . 7 0 = 0 . 3 0 i;t2
The feasible estimate derived from the reasonable one given above therefore has the following shape: 0.20 _
m
One should note that the structure of a k-PRI is an important aspect not only if probabilities are estimated for the purposes of expert systems. Statisticians often describe their results by means of
14 interval estimation which, applied to k-dimensional probability distributions or k-dimensional frequency distributions, is usually given as a k-PRI. So by discussing S* we also learn more about the effective content of such a statistical statement. It is easy to see that S* must always be convex: Any linear combination of elements of S* with non-negative coefficients adding up to 1 must obviously belong to S*. For k = 2 the set S* can obviously be regarded as the interval [L;U]. For larger values of k the structure S* is more complicated. It can be more easily understood by means of geometrical considerations. For this purpose we shall confine ourselves at first to the case k = 3. The set So of all possible probability distributions, which is identical to S* in the case of no information being given about any of the probabilities, can be represented graphically in a symmetric way if we employ a system of triangular coordinates.
,0
s. @ ~ 0 )
'P(EJ 4,0 T 0,3 o,~
oo. ~
-
q~ 0,6
o5 0,~
o,3
$3
~.
0,2, 0,4
(o,4,o) \
0 ~,0,4)
~k
"¢-
.q
% ca
Figure 1
The triangle representing SO and the representation of the set S* in the case that no probability is below 0.2.
15 This graphic representation shows So as a triangle. It corresponds to the fact that every element of So can be constructed as a weighted average of the three "corners", the three distributions, which assign the probability value 1 to El, E2 or E3:
[pl} [1} {01 [0} P2
= Pl
0
Pa
+ P2
0
1
+ P3
0
0
, Pi E [0;1] , i = 1,2,3.
1
Any additional information has to be described by a subset of the triangle representing So. This must also be true with respect to a feasible 3-PRI, which is equivalent to a set S* c So. Example (2.4): In Figure 1 the representation of the structure is demonstrated for a feasible 3-PRI, which originates
from
the
information
that
each
of
the
three
probabilities
is
at
least
0.2: L~ = L2 = L3 = 0.2. The graphic representation demonstrates that this information leads to S*, which is also triangular in shape and includes no probability distribution with a p~ larger than 0.6. This corresponds to the application of the condition for feasibility: LI+ L2+U3= 1 ~
Ua=0.6
L~+U2+L3=I
~
U2=0.6
Ua + L2 + La = 1 ~
U~ = 0.6
It is easily seen that S* is always triangular in shape, if the lower limits L~, L2 and La are given and the upper limits are the mere results of feasibility. In these cases all elements of S* are to be calculated as weighted averages of the probability distributions in the three corners, which in Figure 1 are denoted by s,, s2 and s3:
[061 i01 {1°.61 {o21 [o2I sl=
0.2
, s2=
0.2
0.6
, s3 =
0.2
0.2 0.6
I
Therefore S*, resulting from the estimate that no probability is smaller than 0.2, is described by
S* =
A1 0.2
0.2
+A2
0.6
+ A3 0.2
0.2
0.6
Ai > 0, i = 1 , 2 , 3 ;
3
i~l)~i = 1
}
.
This result is by no means true for all feasible 3-PRI's. The difference between the consequences of upper limits and of lower limits for the probability is demonstrated by use of Examples (2.5) and (2.6) which employ only upper limits. Example (2.5):
0.0 _
i=1,2,3.
It is easily seen that this 3-PRI is already feasible without any modification to the lower limits. Its graphic representation is to be found in Figm'e 2 denominated by ST.
16
0 ca
o
(1,0,0) 4,0
o ~s
J
0,3
0
5tz
s: C/
0,7 0,6 -
x. S:
05 • 0,4
I
o,3 #
'4~
' O,2,
I
0,4
(o,-~,o)
c~
0
(0,o,~)
$14
.C
% o~
Figure 2
The g r a p h i c r e p r e s e n t a t i o n of S* a c c o r d i n g t o Example ( 2 . 5 ) , SI*( case of Example ( 2 . 6 ) , $2"( . . . . . . . ) .
) , and of S* i n the
The borders of S~ are seen to be hexagonal in shape with the corners sm...,s16.
Sll =
0.0
, S12 =
I0.2)
0.2 IO.O)
, S13 =
0.8
I Fol loo/ , S14 =
0.0
0.8
, S15 =
0.2
0.2
, S16 =
0.8
0.0 0.8
Therefore the elements of S* must be represented as weighted averages of six vectors:
S~=
/lob} / I [ooI [o7 io1 ,~1 0.0 0.2
+,12 0.2 0.0
+~3 0.8 0.0
+14 0.8 0.2
+15 0.2 0.8
+,~6 0.0 0.8
ti>0, -
i 1~ i = 1
}
17 It is worthwhile to explain how the six corners originate: Each upper limit U i cuts the corresponding apex of the triangle and creates two new corners of S~ because there are two borders which meet at the apex of the original triangle. The number of the corners of S'~ is the product of three (cuttings of an apex) times two (corners created by each cutting). If the upper limits Vi were smaller than in Example (2.5), larger parts of the apices would be cut, but the structure S* remains a hexagon. Example 2.6: Only if each Ui=0.5, is the hexagon reduced to a triangle (s21, s22, $23), as can be seen in Figure 2. In this case the structure of the 3-PRI, denoted by S*~, is described by
S*
2=
A1
0.0
I {0.5} {001
0.5
+ ~2
0.5
0.0
+ A3
0.5
hi _> 0,
3
1~1,~i.== 1
}
0.5
It is evident that by choosing smaller values for Ui the number of the corners of S* either remains unaltered or may be reduced, but the nmnber will never increase. In order to proceed to the general case of a feasible 3 - P R I we shall make use of Example (2.7): Let a 3-PRI be:
0.1 < P(Et) 5 0.4 0.2 < P(E2) < 0.5 0.3 < P(E3) < 0.6
(It is easy to control that this is a feasible 3-PRI.) If we again use triangular coordinates, S* will also in this case be represented by a hexagon, as it is seen in Figure 3. It is easy to recognize the way in which the six corners of S* can be calculated: Each of them corresponds to a probability distribution consisting of one probability equal to its lower limit and another one equal to the respective upper limit provided that the third probability does not exceed its limits. In this example we arrive at the following values: sl:
P(E1) : L1 = 0.1 ;
P(E2) : U2 = 0.5
~
P(E3) = 0.4
s2:
P(E1) = L~ = O. 1 ;
P(E3) = U3 = 0 . 6
~
P(E2) = 0.3
$3:
P(E2) = L2 = 0.2 ;
P(E3) = U3 = 0.6
~
P(E1) = 0.2
s4:
P(E,) = U, : 0.4 ;
P(E2) = L2 = 0.2
~
P(E3) : 0.4
ss:
P(E1) : U~ = 0.4 ;
P(E3) : L3 : 0.3
~
P(E2) = 0.3
s6:
P(E2) = U2 = 0.5 ;
P(E3) = L3 = 0.3
~
P(E1) = 0.2
18
?(E,)
0
('tO,O)
T ~1,0 0,~ O,g Off
0.
0
0,6
o5
~3
0,3 0,2,
Sq
t61
0,4
tz
(0&0)
(0,o,4)
c7
Figure 3
o
.( o~
The representation of S* in the case of Example (2.7).
O n e should note, t h a t in all o t h e r cases, in which two p r o b a b i l i t i e s are equal to one of t h e i r respective limits, t h e t h i r d p r o b a b i l i t y does not lie b e t w e e n its limits. In F i g u r e 3 we labelled these p o i n t s using tt,t2,t3,t4,ts,t6. tl:
P(E2) : U 2 : 0 . 5
;
P(Ea) = U3 : 0 . 6
~
P(Ei) : - 0 . 1
t2:
P(E~) = L1 = 0 . 1 ;
P(E2) = L2 = 0 . 2
~
P(E3) = 0 . 7
t3:
P(E1) = U1 = 0 . 4 ;
P(E3) : U3 : 0 . 6
~
P(E2) : 0 . 0
t4:
P(E2) = L2 = 0 . 2 ;
P(E3) = L3 = 0 . 3
~
P(E1) = 0 . 5
t5:
P(E1) = U ~ = 0 . 4
;
P(E2) = U2 = 0 . 5
~
P(E3) = 0 . 1
t6:
P(E1) = L1 = 0 . 1 ;
P(E3) = L3 = 0 . 3
~
P(E2) = 0 . 6
W e m a y therefore describe S* of this e x a m p l e as follows:
19
S*=
{{°11 [iil 1°211°41f°41 [°21 ,h 0.5 +,~2 0.4
:
+,~3 0.2 +24 0.2 +J5 0.3 +,~6 0.5 [0.6J [0.4J [0.3J 0.3
Ai>O, ~lJi=l i
}
and it is obvious, that S* cannot be represented mathematically in a simpler way. This is, however, not true for every feasible 3-PRI. To describe results concerning the structure S* we introduce a new term. Definition (2.6): A feasible k - P R I is called degenerate, if there exists a value Qi with Q i = L i or Q i = U i for each k
i=l,...,k, so that i__E1Qi= 1.
[] 3
It is easy to see, that a feasible 3 - P R I can be degenerate only, if either E Li + (Uj-Li) = 1 or i=l
3
E Ui + (Li-Uj) = 1 for an index j e
i=1
{1,2,3}.
We can now formulate: Theorem (2.4): The structure S* of a 3 - P R I is represented by a hexagon, if the 3 - P R I is not degenerate. Proof of Theorem (2.4): We first define tile six corners of S* as in Example (2.7): $1:
P(E1) = L1
;
P(E2) : U2
;
P(Ea) = 1-L,-U2
s2: sa:
P(E,) = La
;
P(E2) = 1-L,-Ua
;
P(E3) = Ua
P(E,) = >L2-Ua
;
P(Eu) = Lu
;
P(Ea) = Ua
P(E3) = 1-U1-L2
s4:
P(E1)
= U1
;
P(E2) = L2
;
ss:
P(Ei) = U1
;
P(E2) = 1-UI-L3
;
P(E3) = L3
s6:
P(E1) = 1-U2-L3
;
P(E2) = Us
;
P(E3) = L3
Due to the condition of feasibility none of these probabilities exceed their respective limits, for instance: La _< 1-L1-U2 _
_ 1. We now suppose that the 3 - P R I is not degenerate. We want to show, that in this case the representation of S* has exactly six corners, namely the corners sl,s2,sa,s4,s5 and s6. Therefore firstly it has
20 to be shown, that there are no additional corners. The possible candidates are those points, where for two probabilities either the upper limits or the lower limits are reached, e.g.: t , : P(E1) = 1-U2-U3; P(E2) =V2
;
P(E3) = U3
t6 : P(E1) = L1 ; P(E2) = 1-L1-L3; P(E3) = L3 As the 3-PRI is not degenerate, we obtain for the point t~: P(E1) = 1-U2-U3 < LI because of UI+U2+Ua-(UI-L1) > 1 and for the point t6: P(E2) = 1-L1-L3 > U2 because of LI+L2+L3-(L2-U2) < 1 The other four points are excluded in the same way. Secondly it must be shown, that the corners sl to s6 are indeed six different points. If for two points sj and sj, all three respective probabilities are to be alike, then two equations must hold, e.g. if 81~$2: U2 = 1-LwU3 and 1-LwU2 = U3. As this is possible only if the 3-PRI is degenerate, it can be excluded due to the underlying assumption in Theorem (2.4). o The case of a degenerate 3-PRI is demonstrated through Example (2.8), which modifies the estimates of Example (2.7) in only one place. Example (2.8): A 3-PRI be:
0.1 < P(E~) _<0.4
0.2 _< P(E2) _< 0.5 0.1 _(P(E3) _<0.6
This is a feasible estimate, but it is degenerate because of 3 UI+U2+L3 = i~tUi - (U3-L3) = 1 . In Figure 4 the graphic representation of S* is seen to be a pentagon. In this case the structure S* of the 3-PRI is:
S*=
{10.1} 10.1}10.2)[0/ [0.4} 5 } ,~1 0.5 0.4
+,~2 0.3 0.6
+/~3 0.2 0.6
+~4 0.2 0.4
+'~5 0.5 0.1
,~i>0, i~l/~i=l
With the aid of Figure 4 it is easily recognized how the number of corners could be further reduced by changing the used estimate: to a number of four corners either by decreasing L2 to 0 or by increasing U3 to 0.7. It is also seen how a combination of variations of this type could transfer S* into a set which is represented by a triangle.
[]
21
2.
0
(,to,o)
?(E,) 4,0 O,g
~0. 0,
O,S 0,7 0,6
o,s 0,4
-0,3 0, Z
0,,1
(0,-to) (o,O,,i)
0
o~
Figure 4
The r e p r e s e n t a t i o n of S* in the case of Example (2.8).
The experience gained concerning the case of k=3 provides the basis for the general case. Evidently a corner of S* is produced if for all but one probability either the respective minimum Li of the estimate or the respective maximum Ui is used and if the last probability lies between the corresponding limits. The number of those corners of S* for a k-PRI is apparently at least equal to k, the value resulting from a non-informative k-PRI - or from a k-PRI which is generated only by lower limits Li. In such situations the upper limits Ui are mere consequences of the feasibility and do not influence the structure of S*, which for instance in the case of k=4 remains a tetrahedron. If upper limits come into effect of their own accord, they will increase the number of corners by cutting some of the k apices of the generalized tetrahedron. Each of these cuts produces up to k-1 corners of S* - and exactly k-1 corners if it remains separate from all other cuts. Therefore the number of corners of S* is always between k as a minimum and k(k-1) as a maximum.
22 Mathematically S* must be constructed as the set of all weighted averages of these corners. While this construction does not raise any theoretical problems, the calculation of S*, even for moderate k, causes considerable technical difficulties concerning the numerical evaluation. So the indeed inevitable price of lack of sharpness concerning the estimation of probabilities becomes higher if k is greater, and there is no legitimate way of avoiding this dilemma. One should realize that "probability intervals", if for instance k=5, possess structures which may have up to 20 corners and in most cases will have no less than this. It is evident that mathematical treatment of such a structure is complicated. It could only be simplified, if these structures were reduced artificially. We shall demonstrate the search for corners with two examples, which use k=4. In the first of these examples the 4 - P R I is not degenerate, whereas in the second it is degenerate. Because of the conditions of feasibility in the case of a non-degenerate 4 - P R t it is not possible for a corner to include three lower limits L or three upper limits U. Therefore only those 24 candidates remain, where either two L's and one U are combined with a suitable fourth probability, or one L and two U's are combined with the suitable fourth probability value. In the case of degeneration the search for corners may be confined to the same 24 candidates, since a corner with three U's, for instance with U1,U2 and U3, is only possible if U I + U 2 + U a + L 4 = 1, so that for P(E,) : V,, P(Es) = Us, P(E3) = [13 the calculation yields P(E4) = L4, but in this case in the same way the triple: P(EQ = U1, P(E2) = Us, P(E4) = L4 produces P(E3) = U3. Therefore a corner with three U's (or with three L's) is possible only if the same corner is created starting with only two U's and one L (or two L's and one U) and is consequently found among the 24 candidates. Example (2.9): Let a 4 - P R I be:
L~ = 0.00 _
Obviously it is feasible and non-degenerate. We shall designate a successful candidate by %" and characterize a failing candidate by "t". P(E1) = L1 = 0.00 ; P(E2) = L 2 = 0 . 1 0 ; P(E3) = U 3 : 0 . 3 0 ~ P ( E 4 )
=0.60 : t
P(E1) = L1 = 0.00 ; P(E2) = U 2 = 0 . 3 0 ; P(E3) = L 3 = 0 . 2 0 ~ P ( E 4 )
= 0 . 5 0 : $1
P(E1) = U1 = 0.10 ; P(E2) = L2 = 0.10 ; P(E3) = L3 = 0.20 ~ P(E4) = 0.60 : t P(E1) = U1 = 0.10 ; P(E2) = U2 = 0.30 ; P(E3) = La = 0.20 ~ P(E4) = 0.40 : t P(E1) : U1 : 0.10 ; P(E2) = L 2 = 0 . 1 0 ; P ( E a ) = U 3 = O . 3 0 ~ P ( E 4 ) P(E~)
= L, =
=0.50:s2
0.00 ; P(E2) = U2 = 0.30 ; P(E3) = U3 = 0.30 ~ P(E4) = 0.40 : t
23
P(E,) = L, = 0 . 0 0 ; P(E2) = L2 = 0 . 1 0 ; P(E4) = U4 = 0 . 5 2 ~ P(E1) = L1 = 0 . 0 0 ; P(E2) = U 2 = 0 . 3 0
P(Ea) = 0 . 3 8 : t
; P(E4) = L4 = 0 . 4 8 ==} P(Es) = 0 . 2 2 : sa
P(E,) = U1 = 0 . 1 0 ; P(E2) = L 2 = O . l O ;
P(E4) = L4 = 0 . 4 8 ~
P(E3) = 0 . 3 2 : t
P(E1) = U1 = 0 . 1 0 ; P(E2) = U 2 = 0 . 3 0 ;
P(E4) = L4 = 0 . 4 8 ~
P(E3) = 0 . 1 2 : t
P(E2) = L2 = 0 . 1 0
; P(E4) = U4 = 0 . 5 2 ~
P(E1)=UI=O.IO;
P(E1) = Lt = 0 . 0 0 ; P(E2) = u2 = o . 3 o
P(E1) = L1 = 0 . 0 0 ; P(E3) = L3 = 0 . 2 0 ; P(E4) = U4 = 0 . 5 2 ~ P(E1) = L, = 0 . 0 0 ; P(E3) = U s = 0 . 3 0 ;
P(Ea) = 0 . 2 8 : s4
; P(E~) = U4 = 0 . 5 2 ~ P ( g a )
=0.18
: t
P(E2) = 0 . 2 8 : s5
P(E4) = L4= 0 . 4 8 ~ P ( E 2 )
:0.22
: s6
P(E3) = L3 = 0 . 2 0 ; P(E4) = L4 = 0 . 4 8 ~
P(E2) = 0 . 2 2 : sr
P(E1) = UI = 0 . 1 0 ; P(E3) = U3 = 0 . 3 0 ; P(E4) = L4 = 0 . 4 8 ~
P(E2) = 0 . 1 2 : ss
P(E1) = U1 = 0 . 1 0 ; P(E3) = L 3 = 0 . 2 0
; P(E4) = U4 = 0 . 5 2 ~
P(E2) = 0 . 1 8 : s9
P(E,) = L1 = 0 . 0 0 ; P(E3) =U3 = 0 . 3 0
; P(E4) = U4= 0 . 5 2 ~ P ( E 2 )
=0.18
P(E2) = L 2 = 0 . 1 0
; P(E4) = U 4 = O . 5 2 ~ P ( E 1 )
=0.18:t
P(EI)=UI=0.10;
; P(Ea) =L3 = 0 . 2 0
P(E2) = L2 = 0 . 1 0 ; P(E3) = Ua = 0 . 3 0 ; P(E4) = L4 = 0 . 4 8 ~ P(E2) = U 2 = 0 . 3 0
; P(E3) = L3 = 0 . 2 0
; P(E4) = L4 = 0 . 4 8 ~
P(E1) = 0 . 1 2 :
: Slo
t
P(E1) = 0 . 0 2 : 811
P(E2) = U2 = O.3O ; P(E3) = Ua = 0 . 3 0 ; P(E4) = L4 = 0 . 4 8 ~
P(E1) = - 0 . 0 8 : t
P(E2) = U2 = 0 . 3 0 ; P(E3) = La = 0 . 2 0 ; P(E4) = U4 = 0 . 5 2 ~
P(Ex) = - 0 . 0 2 : t
P(E2) = L2 = 0 . 1 0 ; P(Ea) = Ua = 0 . 3 0 ; P(E4) = U4 = 0 . 5 2 ~
P(E1) = 0 . 0 8 : si2
T h e s t r u c t u r e S* of this 4 - P R I is therefore t h e set of all w e i g h t e d averages of t h e d i s t r i b u t i o n s sl to []
s12. No s i m p l e r description of S* is possible. E x a m p l e (2.10): Let a 4 - P R I he:
L1 =- 0 . 1 2 _< P(E1) _< 0 . 1 7 = Vl L2 = 0 . 1 5 _< P(E2) < 0 . 2 4 = u 2 L3 = 0 . 2 3 < P(E3) < 0 . 3 6 = U s L4 = 0 . 2 8 _
It is easy to verify t h a t this 4 - P R I is feasible, b u t it is d e g e n e r a t e , as LI + U2 + U3 + L4 = 0 . 1 2 + 0 . 2 4 + 0 . 3 6 + 0 . 2 8 = 1 a n d a d d i t i o n a l l y LI + U2 + L3 + U4 = 0 . 1 2 + 0 . 2 4 + 0 . 2 3 + 0.41 = 1 W e shall d e s i g n a t e c o n s t e l l a t i o n s which p r o d u c e t h e s a m e corner using t h e s a m e i n d e x of s. P(E,) = L , = 0 . 1 2
; P(E2) =L2 = 0 . 1 5
P(E1) = L , = 0 . 1 2
; P(Eu) =152 = 0 . 2 4 ; P(E3) =
; P(E3) =Us = 0 . 3 6 ~
L3 = 0 . 2 3
~
P(E,) : U, = 0 . 1 7 ; r ( E 2 ) : L2 = 0 . 1 5 ; P ( E a ) = L 3 = 0 . 2 3 ~ P(E1) =UI = 0 . 1 7 P(E1) = U a = 0 . 1 7 ; P(E,) = L , = 0 . 1 2
P(E4) = 0 . 3 7 : S1 P(E4) = 0.41 : s2 P(E4) = 0 . 4 5
: t
; P(e2) =U2 = 0 . 2 4 ; P(E3) = L3 = 0 . 2 3 ~
P(E4) = 0 . 3 6 : s3
P(E2) =L2 = 0 . 1 5
; P(E3) =U3 = 0 . 3 6 ~
P(E4) = 0 . 3 2 : s4
; P(E2) =U2 = 0 . 2 4
; P(Es) =U3 = 0 . 3 6 ~
P(E4) = 0 . 2 8 : s5
24 P(E1) = L1 = 0.12 ; P(E2) = L2 = 0.15 ; P(E4) = U4 = 0.41 ~ P(E3) = 0.32 : s6 P(EI) = Lt = 0.12 ; P(E2) = U2= 0.24 ; P(E4) = L4 = 0.28 ~ P(E3) = 0.36 : s5 P(EI) =UI=0.17 ; P(E2) =L2=0.15 ; P(E4) =L4=O.28~P(E3) =0.40 : t P(E1) = h = 0.17 ; P(E2) = U2 = 0.24 ; P(E4) = L4 = 0.28 ~ P(E3) = 0.31 : s7 P(E,) = U l = 0 . 1 7 ; P(E2) = L 2 = 0 . 1 5 ; P(E4) = U4= 0.41 ==>P(E3) = 0 . 2 7 : s8 P(E1) = L1 = 0.12 ; P(E2) = U2 = 0.24 ; P(E4) = U4 0.41 ~ P(E3) = 0.23 : s2 =
P ( E 1 ) = L I = 0 . 1 2 ; P(E3) = L a = 0 . 2 3 ; P ( E 4 ) = U 4 = 0 . 4 1 ~ P ( E 2 ) P ( E I ) = L I = 0 . 1 2 ; P(E3) = U 3 = 0 . 3 6 ;
=0.24:s2
P(E4) = L 4 = 0 . 2 8 ~ P ( E 2 ) = 0 . 2 4 : s 5
P(E1) = U 1 = 0 . 1 7 ; P(E3) = L a = 0 . 2 3 ; P(E4) = L 4 = 0 . 2 8 ~ P ( E 2 )
=0.32 : t
P(E,) = U t = 0 . 1 7 ; P(E3) =U3=0.36 ; P(E4) =L4=O.28~P(E2) =0.19 : s9 P(E,) =UI=0.17 ; P(E3) =L3=0.23 ; P(E4) =U4=0.41~P(E2) =0.19 : s,o P(E,) = L , = 0 . 1 2 ; P(E3) =U3 = 0 . 3 6 ; P(E4) = U 4 = 0 . 4 1 ~ P ( E 2 ) P(E2) = L 2 = 0 . 1 5 ; P ( E 3 ) = L 3 = 0 . 2 3 ;
=0.11 : t
P(E4)=U4=0.41~P(E,)=0.21
: t
P(E2) : L2 = 0.15 ; P(E3) = U3 = 0.36 ; P(E4) = L4 = 0.28 ~ P(EI) = 0.21 : t P(E2) = U 2 = 0 . 2 4 ; P(E3) = L 3 = 0 . 2 3 ; P(E4) = L 4 = 0 . 2 8 ~ P ( E 1 )
=0.25 :t
P(E2) = U 2 = 0 . 2 4 ; P(E3) = U 3 = 0 . 3 6 ; P(E4) = L 4 = 0 . 2 8 ~ P ( E 1 )
= 0 . 1 2 : s5
P(E2) = U 2 = 0 . 2 4 ;
= 0 . 1 2 : s2
P(E3) = L 3 = 0 . 2 3 ; P(E4) = U 4 = 0 . 4 1 ~ P ( E t )
P(E2) = L 2 = 0 . 1 5 ; P(E3) = U 3 = 0 . 3 6 ; P(E4) = U 4 = 0 . 4 1 ~ P ( E ~ )
=0.08 : t
So the two equations, each causing degeneracy, reduce the number of corners by two, but the same process reduces the number of failing candidates, which was 12 in the case of non-degeneracy, to eight.
[]
We shall have a closer examination of the problems relating to the structure of a k - P R I for consideration in the framework of a more specialized study. As a result of the above considerations we can assume that a probability estimate is only adopted if it is a reasonable one. Any reasonable probability estimate can be replaced by the feasible one which is derivable from it. Therefore we shall in future always assume that a feasible probability estimate is given. This estimate is equivalent to the convex set S* of probability distributions which are in accordance with it. As any feasible k - P R I identifies a structure S* which constitutes a set of probability distributions, it is possible to calculate the upper and lower limits for the probabilities of any compound event simply by listing these probabilities for M1 corners of S*. This will be de demonstrated in the following example.
25
Example (2.11): We refer to Example (2.9), where the following estimates were given: 0.00 < P(E1) _<0.10 0.10 <_P(E2) _<0.30 0.20 < P(E3) _<0.30 0.48 _
consists of all weighted averages of the distributions Sl,...,812 as described in
Example (2.9). The probabilities of E1 v E2 for these distributions are: sl:
P (El v E2) = 0 . 3 0
s2:
P(E, v E2) = 0.20
s3 :
P (El v E2) = 0.30
s4:
P(E, v E2) = 0.20
s5:
P(E1 v E2) = 0.28
$6: s7: ss:
P(Et v E2) = 0.22 P(E1 v E2) = 0.32 P(E~ v E2) = O. 22
89:
P(E, v E2) = 0.28
st0:
P(E, vE2) : 0 . 1 8
su : sl2:
P (El V E2) = 0.32 P(E, v E2) = 0.18
Therefore the probability of the compound event E1 v E2 is a weighted average of the twelve probabilities calculated above. The maximum of this weighted average is evidently identical to the greatest of these probabilities, the minimum to the smallest. Therefore we arrive at the following result: 0.18_< P(g~ v E~) <_0.32.
[]
The question, whether it is necessary to go through the listing of all corners of S* in order to achieve this result, is answered by the following theorem. Theorem (2.5): If ( ~ ; ~ )
is a feasible k - P R I , I c (1,...,k} and B = U El, then icI
P,(B) : Max[ r. Li; 1 ~TUi] : Min P(B) iEI i P~S*
(2.8)
and P*(B) = Min[ ~ Vi; 1 - i ~iLi] = Max,P(B) i~I PcS
(2.9) []
26 Proof of Theorem (2.5): Obviously the extreme values must be found in a corner of S*. In such a corner at least k-1 probabilities take either their maximal or their minimal value. For the calculation of P*(B) we distinguish three cases:
1)
EVi+ E Li=l. ieI i~I This is the case of degeneracy: The corner with P(Ei) = U i for all i ~ I, and P(Ei) = Li
for all
i ~ I, produces P*(B) with:
P*(B) = E Ui = 1 - ~ ] L i i~l i~I Therefore (2.9) is valid in this case.
2)
E U i + E Li>l. i~I i~I k
Because of ~ Li < I it must be possible to find a corner for which i=I
P(Ei) =Li for all i g I
and
E P(Ei) + E L i = l . icI igI
This corner obviously produces P(B) = P*(B) = 1 -
Z Li < ~ Ui igI icI
in accordance with (2.9).
3)
EUi+ ELi
P(B) =P*(B) = ~ U i < l ieI
Z Li iFI
which again is in accordance with (2.9).
It is immediately seen, that the same kind of proof can be given for (2.8). Example (2.12): In the case of Example (2.11) we have B = E1 U E2, I = {1,2} , -~I = {3,4} E L i = 0.00 + 0.10 = 0.10 icI
[]
27
1-
~ Ui i~I
= I -
0.30 - 0.52 = 0.18
Ui : 0.10 + 0.30 = 0.40 ieI 1-
~ Li=1-0.20-0.48=0.32 ieI
Due to (2.8):
P, (B) -- Max [0.10, 0.18] = 0.18
Due to (2.9):
P* (B) -= Min [0.40, 0.32] : 0.32
The properties of structures of k - P R I s are of great importance when interval estimates are used and more than two states of nature are distinguished. The same is true, if any source of information distinguishes between more than two possible observations.
CHAPTER
3
Related Theories
3.1 Choquet-Capacities and Sets of Probability Distributions In this chapter we shall present a short report about a theory closely related to the contents of Chapter 2. In the years around 1953 Gustave Choquet introduced the concept of "capacities" to denominate set functions which do not obey the law of additivity but obey some law of either sub-additivity or supra-additivity [CHOQUET, 1953/54, 1959]. Subsequently the application of these ideas to probability theory was studied and propagated by Huber and Strassen [STRASSEN, 1964; HUBER and STRASSEN, 1973; HUBER, 1973, 1976]. The mathematics used by Choquet enable us to deal with interval-valued probabilities. The main aspect of the results gained by tIuber and Strassen is a theory of robust statistics, which considers sets of probability distributions with similar properties rather than one single probability distribution. WALLEY and FINE [1982] presented elements of a frequentist theory of interval-valued probabilities. We need not discuss their arguments, because the use of interval estimation for probabilities in expert systems does not rely on a frequentist interpretation. Instead we shall concentrate on aspects which are relevant to ttuber and Strassen's studies as well as to our own. Let S be a set of probability distributions; then for each set B, measurable with respect to each distribution in S, the following definitions are obvious: P*(B) = Sup P(B) PeS may be called "upper probability of B" and
P,(B) : Inf P(B) PcS may be called "lower probability of B". These two set functions are conjugate to each other by
P, (B) : 1 - P* (~B).
(3.1)
The properties of upper and lower probabilities are similar to those of probability as defined by the axioms of Kolmogorov. In the particular case of the sample space E0: P, (Eo) = P* (Eo) : 1
(3.2)
P , ( ¢ ) = P*(¢) = 0
(3.3)
and for the empty set ¢:
30 Instead of Kolmogorov's theorem of total additivity we have: If Bj and B2 are disjoint events, then
P*(B1 t3 B2) _ P,(B,) + P,(B2) Therefore upper probabilities obey a law of sub-additivity, lower probabilities a law of supra-additivity. On the other hand a condition of monotony is given: B1 C B2 ~ P*(B,) < P*(B2)
(3.4)
which again together with (3.1) produces its corresponding condition
Bl C B2 ~
P,(BI) < P,(B2)
and two further requirements of continuity which are onIy of relevance if infinite sample spaces are considered. Set functions with these properties are called Choquet-capacities. If, additionally, P*(B) obeys the following relation: P*(B1 U B2) + P*(B1 n B2) _
(3.5a)
it is called a 2-alternating capacity. According to (3.1), together with (3.5a), P.(B) obeys
P,(U~) + P,(B~) _
(3.5b)
and is called a 2-monotone capacity. The questions of necessary and sufficient conditions for set functions which are to serve as upper or lower probabilities in sample spaces of a general kind, have brought about set-theoretical oriented studies in the field of probability theory [HUBER, 1973; ANGER and LEMBCKE, 1985]. While there exist examples demonstrating that not all upper probabilities are 2-alternating [HUBER and STRASSEN, 1973], we shall prove a theorem, which combines the results of Chapter 2 directly with the concept of Choquet-capacities.
Theorem (3.1): Be ( ~ , U ) a feasible k-PRI and Bi = {Ell i EIj} ; B2 = {Ell i E I2}, where In, I2 c { 1 , . . . , k } . Then (3.5a) holds.
31
Proof of Theorem (3.1): Bin B 2 = { E i l i E I l n I2} Bi n-B2 = {Eil i e I1 N-I2} -~B1 O B2 = {Ell i e-,Ia A I2} "~BI A-~B2 = {Eil i E'~I1 fl-I2} Due to Theorem (2.5): P*(B1UB2)+P*(B1NB2) =Min[
E Ui; 1 -
L(IIUI2)
E Li] +Min[
"(IIUI2)]
E Ui; 1 -
I(IINI2)
This is equivalent to four inequalities: 1) P*(B1UB2)+P*(B1NB2)< E Ui + E Ui I1UI2
llNI2
2) P*(B1UB2) + P*(B1NB2) _< E Ui + 1 ELi x,oi2 -(i~na) 3) P*(B1UB2) +P*(B1NB2) < 1 -
ELi+ "(I1U]2)
4) P*(B1UB2) + P*(B1NB2) _<1 -
E Ui I1NI2
E Li + 1 -
"(IIUI2) Also due to (2.9) we have P*(BI) + P*(B2) : Min[ S D i ;
S Li] + Min[ E Ui; 1 - ELi]
1 -
I1
Li
-(~1n12)
~I1
J
12
912 J
Now we distinguish four cases: a), b), c) and d). a) E U i < I - ELi and E U i < I - ELi I1
-111
12
~12
Then:
P*(B1) + P*(B2) = E Ui + E Ui = I1 =
I2
Y, Ui + E Ui _> P*(B1 U B2) + P*(B1 N B2) llUI2 I1N12
because of inequality 1).
b)
EUi
E L i and E U i > I "7]1 12
ELi ~12
Then:
P*(B1) +P*(B2) = E U i + I I1 _>~Ui+l11 =
ELi_> -q2
ELi}~ ( U i - L i ) = "12 l,N('q2)
E Ui + 1 I1NI2
because of inequality 3)
E
Li >_P*(B1 U B2) + P*(BI N B2)
(-ql) N (-12)
E Li]
"(IInI2)]
32 C) Z Ui > 1 - ~ L i and Z Ui < 1 - ~ L i ll -ql I2 "M2 Then:
P*(B~) + P*(B2) = 1 - Z Li + Z Ui _> 711 [2 > 1 - Z Li + Z Ui ~ (Ui - Li) = "~I1 12 (-ql)NI2 = I-
Z
Li +
Z Ui _)P* (B1 U B2) + P* (Bx N B2)
(.i,)n(~i2) |~ni~ because of inequality 3) as well.
d)
~Ui>l-
11 Then:
~ L i and ~ U i > l "MI I2
ZLi "~I2
P*(B1) + P*(B2) = 1 - Z Li + 1 - Z Li = -ql -~I2 =2 -
Z
Li -
Z
Li >_pa(B1 U B2) + P*(B1 N B2)
(-~II)U (~12) ("ql) f] ("112) because of inequality 4). Therefore (3.5a) holds in each of the four cases. Example (3.1): We refer to Example (2.9), where the following estimates are given: 0.00 < P(E~) < 0.10 0.10 _
BI = El U E2 B2 = E2 U E3
According to (2.8) and (2.9) we have P,(B,) : Ma~[0.10, 1-0.S2] = 0.1S P*(B,) : Min[0.40, 1-0.68] : 0.32 P,(B2) = Ma~[0.30, 1-0.62] : 0.3S P*(B2) = Min[0.60, 1-0.48] = 0.52 Because B1 and B2 are not singletons, (2.8) and (2.9) cannot be applied directly and we have to explain them by the singletons El: B1UB2=E1UE2UE3='~E4 BI N B2 = E2
33
Therefore:
P,(B1 U B2) = 1 - P* (E4) = 0.48 P*(B1 U B2) = 1 - P,(E4) = 0.52 P,(B1 n B2) = 0.10 P*(B1 n B2) = 0.30
(3.5a) states: P*(B1U B2) + P*(B1 I3 B2) = 0.52 + 0.30 = 0.82 < P*(BI) + P*(B2) = = 0.32 + 0.52 = 0.84 and (3.5b): P,(B1 u B2) + P,(B1 n B2) = 0.48 + 0.10 = 0.58 > P,(BI) + P,(B2) = = 0 . 1 8 + 0 . 3 8 = 0.56 The question, whether (3.5a) is sufficient to produce a feasible k - P R I , is not of the same importance for our considerations as the question of necessity. We can therefore rely upon Huber and Strassen, who showed, that (3.1), (3.2), (3.3), (3.4) and (3.5a) in the case of a finite sample space produce upper and lower envelopes of a set of probability measures, which is equivalent to what we call a feasible k - P R I [HUBER and STRASSEN, 1973; HUBER, 1973]. Therefore we can state the following result: The set of feasible k-PRIs is identical to the set of 2-alternating (respectively 2-monotone) Choquet-capacities. While these results are mainly of theoretical interest as far as k - P R I s are concerned, they will be of great help when we come to characterize related concepts.
34
3.2 Choquet-Capacities and Multivalued Mappings In the years between 1963 and 1968 Arthur Dempster published a series of articles, in which he proposed a new type of theory of statistical inference [DEMPSTER, 1966, 1967a, 1967b, 1968a, 1968b]. He aimed at establishing a bridge between statistical inference as is proposed by Sir R.A. Fisher in his theory of fiducial probability and the theory of Bayesian statistical inference. He intended to allow probability statements about the parameters of a distribution to be made without any assumptions about a-priori probabilities of these parameters. In the general case his methods produce upper and lower probabilities, but not probabilities as real numbers. Therefore probabilities are described by intervals [P.(B), P*(B)], which are assigned to each measurable set B of parameter values. Obviously they must obey: 0 < P,(B) <_P(B) < P*(B) < 1.
(3.6)
The upper and lower limits P*(B) and P.(B) are generated by means of multivalued mappings. Without going into detail we may describe the derivation of these limits as follows: It has to be supposed that three spaces are given, 1) A sample space X, the elements of which are the possible realizations x of a variable X depending upon an unknown parameter 0. 2) A space fl with known probability distribution Pn, the elementary events of which are
= ~(x,0). 3) A parameter space O of the possible values of the parameter 0; to create probability statements about subsets B C O, if x is observed, is the goal of Dempster's analysis. The governing idea of this method is represented by the fact, that every x e X produces: a) A virtually multivalued mapping Fx(~) from ~t into O attributing to every element :~ those values of 0, which give rise to compatibility of x and ~. b) A subset ~2x of all those w, for which Px(W) is not empty, together with the probability distribution on ~x, designated by Px, which is the result of conditioning P~ with respect to the subset ~xCfl. If B represents a measurable subset of O, the lower probability of B according to Dempster's principle of inductive reasoning is defined by P,(B) = Px({W e •x : Px(W) c B , rx(W) ¢ ¢ } ) ,
B c 8.
(3.7)
Bc 0
(3.8)
While the upper probability is given by P*(B) = Px({W e ~x : rx(~) n B # ¢}),
It is an important property of this construction, that only those elements ~; of the set ft are admitted, for which Fx(Z) ¢ ¢ and that the measure Px, which is used, is taken as standardized, so that Px({~lrx(~) # ¢}) = 1.
(3.9)
35 Dempster's theory, which introduces new aspects concerning the relationship between random and non-random variables, has on the whole been unsuccessful. His contributions have not exerted any noteworthy influence upon the treatment of statistical inference. To facilitate the understanding of this model, ZADEH [1986] transfers Dempster's ideas to a situation of limited information about the values of a certain variable 7(0J) attributed to the elements e of a given population l~. It is supposed that only a non-empty set r(e) of values of the considered variable 7 is assigned to each element of the population and no further information about the true value of 7(x) is available. If N(B) is the number of dements ~a, for which 7(~a) belongs to a certain set B o O : N(B) = I{~ E f~: 7(w) E B}I , the following inequalities are obvious: N(B) _>N,(B) = I{w E a: r(~) c B}I and
N(B) _
l{~oE ~}: P(w) N B # ¢}1 •
Analogous inequalities are valid, if the number N is replaced by the proportion N(B)/]~ I. If a simple random sample of one element of ~ is considered, this proportion determines the probability that the outcome is an element of B. Therefore we have
P(B) > P,(B) and
N,(B)
=~
P(B) _
(3.10) (3.11)
If the probabilities of the elementary events "selection of ~" are not equal, (3.10) and (3.11) nmst be replaced by P(B) >_P,(B) = P({~ E ~: I'(~) C B})
(3.12a)
P(B) _
(3.12b)
These formulas can already be found in STRASSEN [1964], who refers to a result of CHOQUET [1953/54] according to which P , is a totally monotone capacity respectively P* a totally alternating capacity. The definition of a n-monotone respectively n-alternating set function F is the following:
F(B) _>E F(B N Bi)
- P, F(B N Bi N Bi) + - . . . -(-1) n F(B N B1M...NBn) i<j
(3.13a)
F(B) < F~F(B U Bi) - ~ F(B U Bi U Bj) + - . . . - ( - 1 ) " F(B U nlu...UBn) i i<j for allB, B 1 , . . . , B n C O n E ~ .
(3.13b)
i respectively
36 If then (3.13a) is valid for all n E ~1, F is called a totally monotone capacity; if (3.13b) is valid for all n e ~I, F is called a totally alternating capacity. In the case of a k - P R I the set O consists of k elements and therefore each B can be combined with at most 2 k - 1 different sets Bi with the consequence, that F is totally monotone respectively totally alternating if only (3.13a,h) are valid for n = 1,...,2k - 1. Obviously these results can be applied not only to Zadeh's model but also to Dempster's construction. Multivatued mappings therefore produce lower limits which are totally monotone and upper limits which are totally alternating. In order to make a clear distinction between the two models described in the last chapter and in the present one, we shall at first explain in more detail the situation treated in this chapter. This situation obviously arises from uncertainties about the exact values of some outcomes. Let us use the example introduced by ZADEH [1986] to illustrate the model. The population ~ consists of persons 0J, whose ages (in years) 7(e) are reported. It is supposed that for some of these persons the age is only known to lie in a certain interval r(w), for instance, between 30 and 35 years. Our aim is to calculate the probability that the true age 7(~o) of a randomly selected person lies in a given set B of ages, for instance, B = [33;34]. Under these circumstances it is possible to use the limits for this probability, described by (3.12a) and (3.12b). However, one may also proceed more directly: The probability for selecting the element ~ is then assigned to the set P(0:), but not to any part of it. In this way a probability assignment is created, which does not assign probability values to all singletons of O (i.e. single years of age) and only to those, like an ordinary probability assignment would do. Instead of which it assigns probabilities to those singletons 7(~) and to those intervals F(~) which are reported for at least one person ~. Considered in a more general manner, one arrives at a probability assignment which assigns probability values to singletons of O as well as to subsets of O. The lower limit P,(B) is produced, if all probability assignments to B and to parts of it (subsets as well as elements) are added. Therefore in the case of B =
[33;34] the
sum of the probabiIity assign-
ments to the single years 33 and 34 and that to the age group "33 or 34" represent the lower limit P,(B). It is easily seen that P,(B) generated in this way must obey the law P,(B1 U B~) >_P.(Bi) + P,(B2) and therefore is supra-additive. In the same manner it is realized that P*(B) is sub-additive. However, Choquet's results cited above are much stronger: P,(B) is a totally monotone capacity, P*(B) a totally alternating one. The situation which leads to this model will be found in empirical research, if limited accuracy of measurements produces a special kind of uncertainty. It therefore might be of use in an advanced
37 theory of measurement still to be conceived. However, in the present state of affairs, uncertainties about probabilities causing the employment of k-PRI's in diagnostic systems are never created in the manner described above. If in contrast we refer to the model underlying Chapter 2 as well as Chapter 3.1, we consider a probab!lity space O with a e-fieldS~ of measurable sets B and suppose, that not a single probability distribution but a set S of probability distributions is given. For every element B e 2 the set S produces a set of probability Values 5~(B) = {P(B): P E S}. We take P*(B) as the snpremum and P,(B) as the infimum of this set. If then O contains exactly k singletons Et,...,Ek, a k - P R I is produced. This model obviously is suitable, if for any reason probability assignments are uncertain. Every element of ,~(B) represents a possible probability assessment of B. A large set S characterizes a situation in which rather different probability assessments must be accepted. This is exactly the mathematical description of the state of affairs in many diagnostic systems. The probability of a certain state of nature, if a given symptom is observed, cannot be characterized by a single number because of differences between probability estimates. The set of admitted probability distributions is the structure of the k - P R I as defined in Chapter 2. It was stated it] Chapter 3.1, that upper and lower limits for P(B) derived from this model are Choquet-capacities as well, but P*(B) is a 2-alternating capacity and P,(B) a 2-monotone capacity. Therefore the comparison of the two models leads to the result: Although in both cases upper and lower probabilities as suprema and infima of the probabilities are produced, there remains a difference between the mathematical properties. While the model using a set of probability distributions produces 2-alternating and 2-monotone capacities, the model introduced by Dempster and based on multivalued mappings or an inaccurate information about outcomes requires totally alternating and totally monotone capacities. Obviously the class of totally monotone Choquet-capacities is a subset of the class of 2-monotone Choquet-capacities. DEMPSTER [1967a, pp.330-333] demonstrates that the class of estimates produced by nmltivalued mappings is properly contained in the class of k-PRIs without using the concept of Choquet-capacities. He gives examples of relations which must hold in the smaller class but not in the larger one. Nevertheless, he seems to propose the general use of upper and lower limits stemming from multivalued mappings, if uncertainties about probabilities give rise to applying interval estimates, but he does not report the necessary conditions for this use (as will be cited in Chapter 3.3). On the other hand, he gives a valuable report of previous literature on upper and lower probabilities. We shall refer again to the difference between the two described models in the following chapter.
38
3.3 Theory of Belief Functions In 1975 Glenn Sharer, a student of Dempster's, published "A Mathematical Theory of Evidence". In this book he applied Dempster's theory with respect to two particular aspects: interval-valued probabilities and the combination rule. In this chapter we shall discuss his treatment of interval-valued probabilities. Shafer introduces the concept of belief functions as the fundamental element of his theory. The degree of belief (or degree of support or belief function) is a property of propositions, which materially is equivalent to the lower probability as described in Chapter 3.2 or the lower limit of a probability interval. However, this equivalence is never stated explicitly in Shafer's book. He defines a belief function, Bel(B), by the following three properties, through which the analogy to the axioms of Kolmogorov can be recognized at once: Bel (¢) : 0 (3.14a)
Bel(O) = 1
(3.i4b)
Bel(B~U...UBn) > E Bel(Bi) + @1) 3 E Bel(BinBj) + . . . + ( - 1 ) nd Bel(Bln...nBn) i
(3.14c)
i<j
Condition (3.14c) is easily identified to be (3.13a) in another appearance - by using the symbol B
instead of B~U...UBn - so that it states: Bel(B) is a totally monotone capacity. (Yet Sharer does not use the concept of capacities.) In the special case, that BinBj = ¢ for i # j, (3.14c) becomes II
Bel(BiU . . . UBn) ->1~ - . = Bel (Bi) a result which means supra-additivity of Bel(B). Furthermore Sharer defines the upper probability P*(B) in a way, as if he had stated, that Bel(B) is a lower probability: P*(B) = 1 - Bel(-~B) Obviously supra-additivity of BeI(B) is sufficient to yield P*(B) > Bel(B)
(3.15) (3.16)
By means of the theory of Choquet-capacities, as cited in the last chapter, it is seen, that P*(B) must be a totally alternating capacity and that especially n
P*(BIU . . . UBa)
, if
BinBj = ¢ for i # j,
(3.17)
which says that P*(B) is sub-additive. The interpretation of Bel(B) as the lower and P*(B) as the upper probability of B is the basis of using the concept of belief functions in diagnostic systems. It is often argued that the possibility to use interval estimates of probabilities is the main merit of this theory.
39 Shafer extends his theory by a concept of "basic probability numbers" m(B). He defines these basic probability numbers for each subset B of O, so that o _< ra(B) _< 1 m(~b) = 0
• (8) : 1
(3.18a) (3.18b) (3.~8c)
Be0 The connection between basic probability numbers and belief functions is defined by
Bel(B) =
~ m(B')
(3.19)
B'cB Basic probability numbers show much similarity to probability numbers but they are not only assigned to singletons - as in the case of classical probability theory - but to virtually all subsets of O, including O itself. The connection between belief functions and basic probability numbers in a simple case is demonstrated in the example below. Example (3.2): Given: O = E U -B and:
Bel(E) = 0.4
Bel(TE) : 0.5
P,(E) : 0.4
P*(E) = 1 - 0.5 = 0.5
P,(-E) : 0 . 5
P*(~E) : 1 - 0.4 = 0.6
Then:
m(E) = 0.4, m(-~E) = 0.5
and:
m(8) : 1 - 0.4 - 0.5 = 0.1
Generally for a feasible 2-PRI: P,(E)=L~
P*(E)=U1
P, (-E) =L2=I-U,
P* (~E) :U2=1-L1
and therefore m(B)=L1
m(~E):L2
m({})= 1- L1-L2=U~-LI=U2-L2>_0 Shafer supposes throughout his book, that all basic probability numbers are given and through these numbers the belief functions may be calculated by means of Equation (3.19). Additionally through (3.15) all upper probabilities may be calculated. Therefore, if belief functions and upper probabilities are interpreted as the two limits of the probability interval, it is always possible to identify the probability intervals in cases in which a system of basic probability numbers obeying (3.18) is given. But where do we find these basic probability numbers? In a real situation it will be the reverse: Past experience has made it possible to state probability intervals for certain outcomes and there arises the question whether the Dempster-Shafer theory is applicable, i.e.: can basic probability numbers be identified, which fulfil the conditions (3.18)? Example (3.3) will demonstrate the methods involved.
40 Example (3.3): 8 = {E,, E~, E3} 0.2 < P(E 0 < 0.5 0.1 < P(E2) _<0.4 0.4 _
= 1
-
P*(E,)
=
0.5
= Bel(E2 U E3) = re(E2) + re(E3) + re(E2 U E3) = = 0.1 + 0 . 4 + m(E2 U E3) =:~
m(E2
U
E3)= 0 . 0
Bel(-~E2) : 1 - P*(E2) : 0.6 : B e l ( E 1 U E3) : re(El) +m(E3) +m(E1UE3) : = 0.2 + 0.4 + re(E1 U E3)
m(Z~ u Z3)= 0.0 Bel(-,E3) = 1 - P*(E3) : 0.4 : Bel(E1 O E2) : re(E1) + re(E2) + re(E1 U E2) : = 0 . 2 + 0.1 + re(E1 UE2) ::~ III(EI U ~2) = 0 . 1
Bel(8)
=1 = m(E1) + re(E2) + re(E3) + m(E1UE2) + m(E2UEs) + m(E1UE3) + m(EIUE2UE3) = = 0.2 + 0.1 + 0.4 + 0.1 + 0.0 + 0.0 + m(E, U E2 U E3)
re(e) = re(E1 u E2 u Ea) = 0.2 The two preceding examples could be interpreted as if it were possible to calculate basic probability numbers in all cases of reasonable or at least of feasible k-PRIs. This is not true. There is a large group of k-PRIs for which basic probability numbers cannot be found. This group will be introduced in the following example. Example (3.4): O : {E1,E2,E3} For simplicity we consider estimates which use the same probability interval for El, E2 and E3. In this case the estimates are reasonable if the interval contains the value 1/3. Let us assume that the following four competing estimates are under consideration.
41 a)
0.20 < P(Ei) _<0.40
i = 1,2,3
b) 0.25 _
i = 1,2,3
c)
0.30 < P(Ei) _<0.40
i = 1,2,3
d) 0.30 <_P(Ei) < 0.35
i = 1,2,3
It is easy to control that all of these PRIs are not only reasonable but also feasible. Nevertheless it will be shown that three of these four estimates cannot be used. A general necessary condition for the existence of non-negative basic probability numbers - which will be discussed later in detail is given by the following inequality: k
~] (Bel(Ei) + P*(Ei)) > 2.
i=l
This condition is violated in the case of the feasible 3-PRIs which are given above in cases a), b) and d). Only for c) does the sum of all lower and upper limits exceed the value 2. If we, for instance, try to calculate the basic probability numbers in case a) in the same way as we did in Example (3.3), we get: m(Ei) = 0.2 Bel(~Ei) = 1 - P*(Ei) = 0.6 m(EiUEj) = 0 . 2
fori#j
m({]) = m(Ex U E2 U E3) = - 0.2 The same calculation could be carried out for the estimates b) and d). For a general description of this group of PRIs we require the following Definition (3.1): A k - P R I is Dempster-Shafer-admissible (D-S-admissible), if there exists a set of non-negative basic probability numbers m(B) for all B C ® in accordance with
(3.18)
and with the interpretation
of Li as belief functions and Ui as upper probabilities.
[]
Through this definition the requirements of Shafer's theory of belief functions, concerning the concept of k - P R I s - which in itself is not a topic of Shafer's - are characterized. We can now formulate Theorem (3.2): A necessary condition for a k - P R I to be D-S-admissible is: k i=l
(Li + Vi) > 2
(3.20) []
42 Proof of Theorem (3.2): Be 0 =
{El . . . .
, Ek} k j~ILj - Li+~ti~2+;~i3' + "'" +;ti~n-I
Be] (TEl) = 1 - U i =
where
k j I-I E2 = Y, Y~ m(EJl U EJ2) Jl =2 J2 =1 k
Ji-1
jl ~" j2;~" and N1 analogously corresponds to the 1-times s u m m a t i o n of m(Ejl U . . . U Ejl ). S u m m i n g up we obtain: k k k k k Bel (-~Ei) = k E U I = k'jE1L j EILi + i=El(¢~ + ¢iE3 + . . . i=1 "= "= "= k k k k i__EIUi+ iEiLi = k - (k - 2) i__EILi-i_El(¢iE2+ . . . +¢Eik-i)
+7~ik-1)
As the n u m b e r of the (EjIU ... U EJl), with one index not equal to i, is exactly (k-l), it follows: k
i~l(Ui+Li) = k-
k (k-2)i~lLi-
(k-2)ra-
(k-a)N3-...-r,k_~
k For ~k = re(B) >_ 0 it is: k k E (Ui+Li) > k - (k-2) [iE1Li + ~a + Na + - . . + Ek-1 + Ek] = k - (k-2) = 2 i=1 = k because of ~ Li + ~ + ~3 + • • .+ ~k = 1 i=!
[]
Concerning the sufficiency of this condition we shall formulate a weaker statement. Theorem (3.3): A 3 - P R I is D - S - a d m i s s i b l e if it is feasible and condition (3.20) holds. Proof of Theorem (3.3): Be 8 = {El, E2, E3} m(Ei) = L i e [0;1]
(according to the assumption)
m(E~ U E2) = 1 - U3 - L1 - L2 > 0 (because of feasibility) m(E2 U E3), m(E~ U E3) a n a l o g o u s m(~)
: 1 - re(E1) - re(E2) - re(E3) - m(E1UE2) - m(E~UE3) - m(E2UE3) :
3 = 1 - E L i - (I-U3-LI-L2) - (I-U2-Lt-L3) - (I-UI-L2-L3) = i=l 3 3 = i=~lLi+ i~lUi- 2 E [0;1] (because of (3.20) and feasibility)
[]
43 We shall not discuss the sufficient conditions for D-S-admissibility in the case of k > 4 because these conditions are rather complicated and do not provide a new insight into this problem. We shall merely demonstrate, using a simple example, that condition (3.20) is not sufficient in the case of k = 4 . Example (3.5): Let O = {EI,Ea,Ea,E4} and 0.15 _
i=1,...,4.
Obviously this is a feasible 4 - P R I and condition (3.20) holds. Because of Theorem (2.5): P.(EitJEi) =Max[Li + Lj; 1 - Uk - U1] = =Max[0.30; 0.20] = 0.30 , i ¢ j ¢ k ¢ l m(Ei) = 0 . 1 5 , Therefore:
i = 1,...,4,
m(Ei tOEi) = 0.30 - 0.15 - 0.15 = 0.00.
Otherwise P,(EiUEjUEk) =1 - P*(EI) = 0.60 = Bel(ExtJEjUI~k) = = m(t~i)+m(Ej)+m(Ek)+m(l~itJl~i)+nl(EiUl~k)+m(l~iUt~k)+m(l~itgl~jUF,k) = = 0 . 1 5 + 0.15 + 0.15 + m(EiUEjUEk) m(Ei U Ej U Ek) = 0.15 This leads to re(E1 tOE2 tO E3 U E4) = - 0.20. Therefore this 4 - P R I is not D-S-admissible. The question arises as to why some feasible k-PRIs are not D-S-admissible. It should be remembered that the background of the basic probability number concept is a multivalued mapping as was used by Dempster. Therefore the belief functions are totally monotone Choquet-capacities as was stated in the preceding chapter. On the other hand, the lower probabilities of k-PRIs are 2-monotone Choquet-capacities which need not be totally monotone. There is no reason why, in a practical case, a lower probability of a k - P R I should be totally monotone. In Example
(3.4)
the
lower probabilities of the cases a), b) and d) are 2-monotone but not totally monotone. Only the lower probability of estimate c) is totally monotone and can be represented as a Dempster-Shafer belief function. The restriction to totally monotone Choquet-capacities is obviously an important limitation to the use of Sharer' s belief functions in the case of probability estimates as applied Jn diagnostic systems. There is no reason why only D-S-admissible k - P R I s should occur in diagnostic systems, and there is no remedy given by Shafer's theory for those cases in which D-S-admissibility is violated.
44
3.4 Combination Rules of Dempster-Shafer Type Without any doubt Dempster's Combination Rule represents that part of his theory of inference which attracts by far the most attention in the field of applied statistics. This rule is intended to act as a method to solve the problem of fusing different elements of information about the same subject, provided this information has the character of probability statements and stems from independent sources of information. There are two main aspects of this problem: It originates from methods of statistical inference which are intended to produce probability statements. In a methodology of this kind the question soon arises about combining conclusions based on independent samples and referring to the same unknown parameter. On the other hand, the second field of application for such a rule is the use of probability as a general language of uncertainty as is found in many diagnostic systems. Here exists an urgent need for a method of combining information which comes from different sources but refers to the same problem under investigation. In this case it is very difficult to define what is meant by "independent sources of information". Dempster himself who explicitly intends to produce a method for quite a general field of application is aware of these difficulties and writes: "The mechanism adopted here assumes independence of the sources, a concept whose real world meaning is not so easily described as its mathematical definition. Opinions of different people based on overlapping experiences could not be regarded as independent sources. Different measurements by different observers on different equipment would often be regarded as independent, but so would different measurements by one observer on one piece of equipment: here the question concerns independence of errors." [DEMPSTER, 1967a, p.335]. Dempster's explanation makes it clear that there is an important difference between the present approach and the various decision-theoretic methods to combine subjective probability distributions, methods which inevitably accept the presupposition that all probability distributions are influenced by some common information. Obviously classical probability theory does not provide immediate solutions for the problem Dempster has in mind. On the other hand, it is possible to find a problem treated by probability theory which shows some similarity to the problem of independent sources of information: Let there be 1 mutually independent random variables X1,...,X1 which possess the same set of possible outcomes: E1,...,Ek- If the additional information is given, that a certain experiment produced the same result for each of the 1 variables, then the probability that this outcome is El*, is given by Pl(Ei*) " . . . . P (El*) =
PI(Ei*)
k E Pl(Ei) " . . . - P : ( E i ) i=1 provided, that the denominator is not equal to zero.
(3.21)
45 If the analogy between this random experiment and the problem of independent sources of information is regarded as being sufficient, an analogous formula can be applied in treating this problem, but only if all probability values are known precisely. Then we would have 1 independent sources of information (ZI,...,Z1) and we would draw conclusions about a variable 0, which takes one of the values E,,...,Ek. From the source Zi we conclude with probability Pj(Ei), that 0 takes the value El. Then Equation (3.21) would determine the probability P(Ei*), which characterizes the conclusion from the combination of all 1 independent sources of information to the statement, that 0 takes the value El*. This type of procedure seems to be inevitable, if statistical inference in a non-bayesian manner is to be described using probability to characterize the conclusions. However, there is no obvious justification for this procedure except the afore-mentioned analogy, and the necessary reweighting or conditioning is often regarded as counter-intuitive. We shall later discuss the implications of the use of this formula. Dempster generalizes Formula (3.21) so that it becomes applicable to probability assignments produced by multivalued mappings. As shown in Chapter 3.3 the difference between this kind of probability assignments and the usual kind may be described using basic probability numbers. They are attributed to subsets not only to singletons. Dempster decides to apply a modification of Formula (3.21) to basic probability numbers. How is (3.21) generated? Using, because of "independence", the product rule, one obtains at first: P(E ~1)) =PI(Eil) " - , - " Pl(Eil) , whereby E(1) denotes an "event" which may - in analogy to the random experiment - be described as follows: The first source leads to Ell, the second to Ei2 and so on. Our knowledge that all sources contain information about the same topic, excludes all E(11, with the exception of those for which Ell . . . . . Ell. These E <11 can be identified with the common El, i ~- 1,...,k. Their original probability is described by the numerator of (3.21), while conditioning produces reweighting, using the denominator of (3.21). If these considerations are to be transferred to assignments, which attribute basic probability numbers to subsets Bi of O, the first step remains unchanged and leads to m(B~l') : ml(Bi~) ...." n~l(Bil) with B (11 being the symbol for concluding from the first source to subset Bil, from the second to Bi2 and so on. For Bil . . . . .
Bit and for Bil N ... Cl Bil --= ¢ an analogous procedure to that used
above is obvious. Therefore the question remains how to treat those B C1' for which Bil,...,Bil are neither identical nor disjunct. Any combination rule for basic probability numbers re(B) affords a decision as how to make allowance for m(B~ l, ), if B~ l> is of this kind. In constructing his combination rule, Dempster decides to attribute these basic probability numbers to the subsets Bil fl...NBil. This seems quite natural, as one would proceed in the same way if
46 experiments were described. However, we shall see later, that this is not the only plausible attitude, if sources of information are to be combined. The formula resulting from Dempster's procedure is, for 1 = 2, explicitly given by SHAFER [1976, p.60] :
.E. m,(Bi)m2(Bj)
l~J BinBj=B
m(B) =
(3.22) 1 -.E.ml(Bi)m2 (Sj) l~J
B inBi= ¢ The working of this formula is shown through Example (3.6). Two probability estimates are taken as stemming from independent sources and each of them is described by its basic probability numbers, m~ respectively m2. Example (3.6): O = {E,,
E2, E3}
m,(E1) = 0.2
m2(E,) : 0.3
~,(E~) :
~(E~)
O, 1
: O. 3
ml(E3) = 0.4
m2(E3) = 0.2
ml(E1UE2) = O. 1
m2(E1UE2) : 0.0
ml(E1Ug3) = O. 0 ml(EaUE3) = 0.0
m2(E1UE3) = O. 1 m2(E2UE3) = 0.0
m[([~) = 0.2
m2(0) = 0.1
From this we obtain: 0.2 _
0.3 _
0.1 < PI(Ea) < 0.4
0.3 _
O. 4 _
0.2_< P2(E3) _<0.4
ToapplyFormula(3.22) wefirstofallcNculatethenumeratorNreachBcO: N(E~) = ml(Ea)m2(E1) + ma(E1)m2(E1UE2) + ml(E1)m2(E1UE3) + mt(E1).m2(O) + + mI(E1UE2)m2(Et) + ml(E1UE3)m2(E~) + ml(O)m2(E1) + + mI(E1UEa)'ma(E1UE3) + ml(E1UE3)m2(E1UE2) = =0.2.0.3+0.2.0.0+0.2.0.1+0.2.0.t+0.1.0.3+0.0.0.3+ +0.2.0.3+0.1-0.1+0.0.0.0 = 0.20 Analogous: N(E2) = 0 . 1 3
N(E3) = 0 . 2 0
47
N(E1LIE2) = ml(EglE2) • m2(E1LIE2) + ml (E1LIE2) • m2(0) + ml(0). m2(E1UE2) = 0.01 N(E1UE3) = 0.02 N(E2UEa) = O. O0
N(O) ml(fl).m2(8) =
=
0.02
The denominator D of Formula (3.22) has to be calculated as the sum of all numerators. D = g N(B) = 0.20 + 0.13 + 0.20 + 0.01 + 0.02 + 0.00 + 0.02 = 0.58 Be0
B#¢
The following basic probability numbers result:
m(E,)
0.20 =gTgg
m(E2)
= 0.2241
m(Ea)
= 0.3448
= 0.3448
m(E1UE2) = 0.0172 m(E1UE3) = 0.0345 m(E2UEa) = 0 m(O)
= 0.0345
Now it is possible to derive the limits of the probabilities for the events El, E2 and E3 with respect to the Dempster-Shafer's combination rule: 0.3448 _
D
The Dempster-Shafer rule as described in (3.22) and demonstrated in Example (3.6) is often recommended for all cases of combining independent sources of information and especially for the use in diagnostic systems. It can easily by generalized to more than two independent sources of information, either directly or stepwise. In the latter case commutativity and associativity have been shown by S H A F E R [1976, pp.62-64]. Since this rule cannot be derived from laws of probability, it is legitimate to search for competing solutions for the considered problem and to compare the behaviour of such solutions. We start such investigation by discussing the possibilities to generalize Formula (3.21) to cases in which interval estimates of probabilities are used. At first we take into consideration that every k - P R I produces a structure, i.e. a set of probability distributions. If 1 structures are given, the application of (3.21) leads to a value P(Ei*) for every selection of one distribution out of each structure. In that way a set of values P(Ei*) is created, and the calculation of the Upper and lower limits for P(gi*) obviously is the straightforward generalization of (3.21).
48 Let us for simplicity take 1 = 2. The two k - P R I s correspond to the structures S* and S~, so that Sup Pj(Ei) = Uji PieS~ I n f Pj(Ei) =Lji PjeS]'
j=1,2
; i = l . . . . ,k.
If these two sources of information are combined, a structure S * = S~ ® S~ is produced. Each probability distribution which belongs to S* can be combined with each distribution belonging to S*2. For any such combination Equation (3.21) is used. We denote for the combined probability distribution P(Ei): Sup~P(Ei) = Ui pes t I n f P(Ei) = Li. PES* These upper and lower limits can be calculated by means of an algorithm, which is described in the Appendix. It is possible that they are much wider than the limits provided by (3.22). This will be demonstrated in the following example, which uses the values of the preceding example. Example (3.7): Lll = 0.2
Ull = 0.5
L21 = 0.3
U21 = 0.5
La2 = 0.1
U12 = 0.4
L~2 = 0.3
U22 = 0.4
L13 = 0.4
U13 --
0.6
L23 : 0.2
U23 = 0.4
Using the algorithm described in the Appendix we find: U1 = 0.6944. This is produced by P~(E,) = 0.5
P ~ ( E , ) = 0.5
PI(E2) = 0.1
P2(E2) = 0.3
PI(E3) = 0.4
P2(E3) = 0.2
The lower limit L~ = 0.1667 is produced by P,(E1) = 0.2
P 2 ( E , ) = 0.3
P~(E2) = 0.2
P2(E2) = 0.3
Px(E3) = 0.6
P2(E3) = 0.4
The upper limit for P(E2) is U2 = 0.5 and stems from PI(Ea) = 0.2
P2(E1) = 0.4
PI(E2) = 0.4
P2(E2) = 0.4
PI(E3) = 0.4
P2(E3) = 0.2
The corresponding lower limit L2 = 0.0833 is produced by PI(EI) = 0.5
P2(E1) = 0.5
el(E2) = 0.1
Pu(E2) = 0.3
PI(E3) = 0.4
P2(Ea) = 0.2
49
Finally Ua = 0.6667 stems from PI(E1) = 0.3
P ~ ( E , ) = O.3
PI(E2) = 0.1
P2(E2) = 0.3
PI(E3) = 0.6
P2(E3) = 0.4
and La = 0.2222 results if (3.21) is applied to: PI(E1) = 0.5 PI(E2) = 0.1
P2(E1) = 0.5 P2(E2) = 0.3
P,(E3) = 0.4
P2(E3) = 0.2
If these limits are compared with the results of the Dempster-Shafer rule as given in Example (3.6), it is obvious that the Dempster-Shafer limits are by far too narrow: P(EI) :
[0.3448, 0.4310] against [0.1667, 0.6944]
P(E2) :
[0.2241, 0.2759] against [0.0833, 0.5000]
P(E3) :
[0.3448, 0.4138] against [0.2222, 0.6667].
[]
The result of these comparisons demonstrates that the Dempster-Shafer rule may be a great deal too optimistic. The calculation of the true limits for each of the combined probabilities according to (3.21) is always possible if the application of the Dempster-Shafer rule is possible. Additionally it can be practiced in those cases, in which PRIs are used, which are feasible, but not D-S-admissible. The extension to more than two sources of information can be achieved stepwise in quite the same manner as for the Dempster-Shafer rule. For these reasons a justification to rely on the Dempster-Shafer generalization of Equation (3.21) is not to be seen. Furthermore it is noteworthy that in a way quite similar to that which leads to the Dempster-Shafer rule, another type of combination rule may be derived which does not have the tendency to be too optimistic, but the tendency to be too pessimistic. In this case the basic probability number m l ( B i l ) ' . . . ' m l ( B i l ) is always attributed to BilU...UBil. Since this subset of O can only be ¢ if each Bij = ¢, the necessity of conditioning does not exist if this procedure is applied. So far this is not a generalization of (3.21) but a modification. The reasons for justifying this method are similar to those of Dempster's generalization of (3.21): If frmn one source of information it is concluded that Bil occurs and from the other one it is concluded that Bi2 occurs, one should only conclude, that Bil or Bi2 occur. The resulting rule of combination is for 1=2 : m(B) = ft.
mx(Bi)m2(Bi) l~J BiUBj=B
The working of the formula will be demonstrated by
(3.23)
50 Example (3.8): Using the same assumptions as in Example (3.6) and (3.7) and the same set of data we obtain: m(E,) = ml(E,)'m2(Gl) =
0.2.0.3
= 0.06
m(E2) = ml(E2).m2(E2) = 0 . 1 . 0 . 3 = 0.03 m(E3) = m,(E3).ms(Ea) = 0 . 4 - 0 . 2 = 0.08 m(E1UE2) = ml(E1).m2(E2)+ml(E2).m2(E,) + + ml (ElUE2) [ms (El)+m2 (Es)+m~ (EIUEs)] + + ms (EIUE2) [m, (E,)+ml (Es)] = = 0 . 2 . 0 . 3 + 0 . 1 - 0 . 3 + 0.110.3 + 0 . 3 + 0.0] + 0 . 0 = 0 . 1 5 In an analogous way: m(E1UE3) =0.22 m(E2UE3) = 0 . 1 4 re(O) = 0.32 This leads to LI = 0.06 _
P,(E,) .P~(E,) + P,(E~).Ps(E:) < [(PI(E,) + P,(E~)]. [Ps(E,) + P~(E2)] as long as PffEI).P2(E1) > 0 or PffE2).P2(E1) > 0. That the denominator is enlarged, leads to reduction of those probabilities (or basic probability numbers) for which the numerator remains
51
unchanged, because the concerned events are not involved in the coarsening. Therefore we have to expect larger probabilities (or basic probability numbers) for the events not involved in coarsening, if the combination rule is applied first and coarsening takes place afterwards and smaller ones, if coarsening is done first. These Considerations cannot be applied to Formula (3.23), the "conservative" combination rule, which does not prescribe reweighting. It can be shown that this formula provides independence of the result from the sequence of the operations coarsening and combining. Therefore in this respect (3.23) is superior to both other methods of combination. We shall demonstrate these results by means of the data used in the preceding examples. Example (3,9): For the data of Example (3.6) and the coarsening El' = Et U E2 Ea' = E3 it follows o.4_< P~(EI') < o.6 0.4 < Pl(E3') <_0.6 and ml(El') = 0.4 ml(E3') = 0 . 4
0.6_< l'2(E~') _<0.8 0.2 <_P2(]~3') _<0.4
m2(El') = 0 . 6 m2(E3') =0.2 rex(B) = 0.2 m2(O) = 0.2 If the Dempster-Shafer combination rule is applied to the data gained by coarsening, the result is: m(E, ~) = 0.6471 m(E3 T) = 0.2941 m(8) = 0.0588
Therefore the result of coarsening first and combining afterwards is: 0.6471 < P(EI') < 0.7059 0.2941 < e(Ea') < 0.3529 If at first the Dempster-Shafer rule is used for combining the two sources, the result described in Example (3.6) is gained. If then coarsening is applied, re(E1 ' ) = 0. 5861 (if rounding is taken into consideration: 0.5862) m(E3 ' ) = 0. 3448 m(O) = 0.0690. Therefore we have 0. 5862 5 P (El ' ) _< 0.6552 0.3448 < P(Ea') _< 0.4138 The probability of E3 = E3', not involved in coarsening, is estimated considerably higher than in the case, that the reverse sequence was used. It is easy to construct examples for which these differences become much greater.
52 If we use the method producing the exact limits for the application of (3.21) to the two structures, we have to compare it with Example (3.7). Applying the method of the Appendix to the two structures created by the coarsening, we achieve: L1' = 0.5 and U3' = 0.5, both resulting from P1(EI') = 0 . 4 P2(EI') = 0 . 6 PI(E3' ) = 0.6 P2(E3') = 0.4 U1 ~ = 0.8571 and L3' = 0.1429, stemming from PI(EI') = 0.6
P2(EI') = 0.8
P,(E3') = 0.4 P2(E3') = 0.2 Therefore first coarsening and afterwards combining leads to 0.5000 < P ( E I ' ) < 0.8571 0.1429 _
0.24
m(E3') : m l ( E 3 ' ) ' m 2 ( E 3 ' ) = 0.08 m (0)
= ml ( E l ' ) . m 2 ( E 3 ' ) + m l ( E 3 ' ) . m 2 ( E l ' ) + m l ( E l ' ) . m 2 ( 8 ) + m l ( 8 ) - m 2 ( E l ' ) +ml (E3 |) "m2 (8) +ml (8) .m2 (E31 ) +ml (8) .m2 (8) = = 0.4.0.2+0.4.0.6+0.4.0.2+0.2.0.6+0.4.0.2+0.2.0.2+0.2.0.2 =
+
= 0.68 This produces the 2-PRI: 0.24_< P ( E I ' ) _<0.92 0.08 < P(E3') _<0 . 7 6 . If we, on the other hand, coarsen the result of Example (3.8) we obtain: m ( E l ' ) =m(E1) +m(E2) + m(E,UE2) = 0 . 2 4 re(E3') =re(E3) = 0 . 0 8 m (8) =1 - r e ( E , ' ) - m(E3') = 0.68 and therefore the same 2 - P R I which is gained, if coarsening is done first and the rule (3.23) is applied afterwards. In this way the independence from the sequence of the operations, which was stated above, is demonstrated.
[]
53 From the considerations described above and the demonstrations brought about by Example (3.9) we may conclude: The Dempster-Shafer rule is not the only method of generalizing (3.21). The procedure described in the Appendix may be applied in order to produce the upper and the lower limits for the combined structure, provided only that feasible k-PRIs are given - even if the limits for the initial distributions are not totally monotonous and not totally alternating capacities. Furthermore this procedure gives a clear answer to a meaningful question, namely: What are the greatest and the smallest probabilities according to (3.21), if the initial probabilities lie between their respective limits? Should we restrict ourselves to cases, for which basic probability numbers exist, two further methods can be suggested. Among them, the Dempster-Shafer rule produces narrower probability intervals, but the competing Formula (3.23) shows an important favourable quality which the Dempster-Shafer rule lacks. While the aspects discussed above were of a more or less practical nature we now turn to the question whether a combination rule referring to Equation (3.21) can be sufficient at all. As was pointed out earlier one should generally distinguish between the use of such rules in statistical inference and their use in diagnostic systems. The characteristic feature of the latter is the combination of pieces of evidence, each of which consists of an observation of one relevant attribute. It has to be supposed that for each observation of each of the attributes probability statements about the states of nature in question are given. If two assumptions are made, namely that the sources of information are mutually independent and that all probability statements can be described by numbers, not by intervals, a formula like (3.21) can be applied in order to combine two statements stemming from two sources. If formula (3.21) itself is employed, exactly the information contained in the two probability distributions - and nothing else - is used. Can this be sufficient? Before discussing this problem we shall consider the following example. Example (3.10): Let us examine two mutually independent attributes which provide information about a person's health. We suppose that each of the two actual observations produces a probability of 10% for the presence of a certain illness, a probability of 90% for its absence. This enables us to apply formula (3.21) and to calculate the probability of the presence of this illness with respect to both observations. The result would be: 1.27o. Afterwards we come to know which outcomes were possible for each of the two attributes. For simplicity let us assume that for both attributes only two outcomes had been possible and let us distinguish two different cases. In the first case each of the two unobserved outcomes would have produced a probability of 0.1% for the presence of the illness. In this case the illness in question is obviously a rather seldom one: The probability of its presence must lie between 0.1% and 10%.
54 However, both observations show themselves to be in favour of the illness, because both lead to the greater one of the two possible probability values. If each of the two observations separately produces a probability of 10% for the presence of the illness, can anyone accept the result of a combination rule, from which the combined observation produces a probability of 1.2%? Let us consider another case, in which the unobserved outcome of each of the two attributes would have produced a probability of 95% for the presence of the illness. Now the illness under consideration must be more frequent than in the first case: its probability lies between 10% and 95%. Yet both observations indicate against the presence of the illness. Therefore a combination of both observations should produce a probability which is smaller than the probability produced by one observation alone. Consequently in this case the result of a combination rule should be smaller than 10%. Q
If one follows the line of argumentation we used in this example, it becomes evident that in certain cases a combination rule originating from (3.21) cannot be satisfactory. In these cases the influence of the alternative outcome on the interpretation of combined observations can by no means be neglected. A combination rule which uses only probability statements based on the actual observations must inevitably produce counter-intuitive effects in some situations. As all rules of the type (3.21) cannot be derived from more general results of probability theory but must rather be seen as proposals, their behaviour is an important criterion concerning their applicability. Therefore the above considerations produce a very fundamental reason against the employment of (3.21) and of the Dempster-Shafer rule in diagnostic systems: These procedures do not use all the information which may be relevant in order to draw conclusions from different pieces of evidence. It should be noted, that our argumentation up to now by no means relies on anything like a Bayesian concept. The mere comparison of probability statements based on different observations demonstrates the necessity of taking the alternative into consideration. If a total probability of the presence of the sickness was mentioned, then it was done only in the sense of a trivial - and in most cases very inaccurate - estimate: It cannot he greater than the greatest and not he smaller than the smallest probability in the case of any observation. This is the type of statement which can be made for diagnostic systems but can be made extremely seldom - if at all - in the course of applying probability statements in statistical inference. We therefore confine our considerations to the use of combination rules in diagnostic systems and do not discuss the problem of applying such rules in non-bayesian theories which employ probability statements in the frame of statistical inference. In Chapter 4 we shall investigate the factual requirements of additional data in order to propose a method of combining independent sources of information in a way which satisfies the theoretical as well as the practical needs of describing uncertainty by means of probability in diagnostic systems.
55
Example (3.11) introduces the relevant data of a situation which will be employed several times during later chapters. In the first place these data are used in order to demonstrate the kind of absurd results which may originate from applying the Dempster-Shafer rule to diagnostic systems. Example (3.11): Let us consider a certain power plant which has two states. Let EN be the state of normal performance and EF the state of failure, 8 = {EF,EN}. We assume that this power station has alarm units. Each alarm unit has the following behaviour: If the alarm goes off the probability of a failure is 0.01, if it does not go off the probability of the failure is 0.0001. For simplicity of the calculation we have assumed, that these probabilities are numerically the same for all units. In addition, we assume that the units work independently and therefore allow the application of Dempster-Shafer's combination rule. If there are two units and both units give alarm we have (using Shafer's notation): mi(EF) = Be](EF) = 0.01 mi(Ei) = Bel(EN) = 0.99 mi(O)
=0
,
i=
1,2.
Then the combination rule yields: 0.01 • 0.01 P(EF) = 0 . 0 1 . 0 . 0 1 + 0 . 9 9 . 0 . 9 9
= 0.000102.
This result, which is not reasonable at all, would not be substantially altered, if we used interval estimation instead of point estimation for the probabilities. It demonstrates that the Dempster-Shafer rule cannot be relied upon in this case. ff one o f t h e t w o alarms goes off and the other one does not, we arrive at: ml(EF) = Bell(EF) = 0.01
m2(EF) = Belu(EF) = 0.0001
ml(EN) = Bell(EN) = 0.99
m2(EN) = Bel2(EN) = 0.9999
mi(0) = 0 Via the Dempster-Shafer rule, we obtain: 0.01 • 0.0001 P(E~) : 0 . 0 1 . 0 . 0 0 0 t + 0.99-0.9999
i = 1,2 .
: 0.00000101
If none of the alarms goes off: mi(EF) = Beli(EF) = 0.0001 mi(EN) = Beli(EN) = 0.9999 mi(O)
=0
,
i = 1,2 .
and 0.0001 0.0001 P(EF) = 0.0001.0.0001 + 0.9999.0.9999 •
=
1.0002.10-s
Altogether, according to Dempster-Shafer, the existence of two alarm units has considerably reduced the probability of a failure. This result can be extended to more than two units: For three
56 independent units, for instance, the probability of a failure in the worst case - if all alarms go off at the same time - is calculated through the Dempster-Shafer rule as 0.01 . 0 . 0 1 • 0.01 1.03.10_ 6 P(EF) = 0.01.0.01.0.01 + 0.99.0.99.0.99 =
[]
This demonstrates that the application of the Dempster-Shafer rule is extremely dangerous in these cases: The more alarms go off at the same time, the more stringent is the conclusion that the power plant is in normal state! These results are extremely useless for practical purposes. But they show a tendency which is typical of the Dempster-Shafer rule: Probabilities which are smaller than one half in the case of two states of nature (or smaller than 1/k in the case of k states of nature), are reduced by the application of the Dempster-Shafer rule, the others are increased. This means that very rare events can never be recognized even if many circumstances are strongly in favour of them. This tendency also exists in those cases where the deviations from 0.5 (in the case of two states of nature) are small: If the probabilities are 0.4 and 0.6 in both distributions, we arrive, following the application of the Dempster-Shafer rule, at a probability of 0.308 for the one result and 0.692 for the other. While in such moderate cases one is tempted to accept such a result, the example of the power station given above, shows that the application of the Dempster-Shafer rule can be extremely dangerous. To avoid the impression that Example (3.11) is very untypical, because it uses probabilities given by numbers, let us extend the assumptions to a case, in which probabilities are given by intervals. Example (3.12): We assume that in the case that the first unit gives alarm: 0.01 < PI(EF) ~ 0 . 0 2 , which produces: m~(EF) = 0.01 ms ( E N ) = 0 . 9 8
m,(S)
=0.01
Should the second unit give alarm, we have: 0.005 _
=
0.005
m2(EN) = 0.985 m2(O) = 0.010
57 Then the Dempster-Shafer rule yields (for the case that both units give alarm):
mI(EF).m2(EF) + m1(EF).m2(O) + mi(8).m2(EF) m(EF) 1 - m(ED.m2(~N)
0.01.0.005
- m~(EN)-m2(Er)
+ 0.01.0.010
+ 0.01.0.005
= 0.000203 1 -
0.01.0.985
-
0.98.0.005
m(EN) = 0.999696 m(0)
= 0.000101
0.000203 < P(Ev) ~ 0.000304.
[]
An example which questions the vMidity of reweighting as used in the D e m p s t e r - S h a ~ r rule, and may therefore be seen as an additionM point in our argument, was constructed by ZADEH and included in a research memorandum [1979]. Example (3.13):
0 = {E1,E2,E3}
PI(E1) = ml(E1) = 0.99
P2(E1) = m2(Ej) = 0.00
PI(E2) = ml(E2) = 0.01
P2(E2) : m2(E2) = 0.01
PI(E3) = ml(E3) = 0.00
P2(E3) : m2(E3) = 0.99
Using (3.21) or (3.22) for the calculation of the combined probability distribution, the result P(E2) = 1 is produced. Although this example is evidently of no practical relevance, it demonstrates the relative advantages of (3.23) by means of which we would have obtained: m(EQ = 0.0000
m(E1UE2) = 0.0099
m(E2) = 0.0001
m(EltJE3) : 0.9801
m(E3) = 0.0000
m(E2UE3) = 0.0099
re(O) = 0
and therefore: 0.0000 < P(E~) < 0.9900 o . o o o l < P(t~) < o.o~oo
0.0000 < P(E3) _<0.9900 In concluding our investigation of the theory of belief functions and of the Dempster-Shafer rule with respect to diagnostic systems we may state the following results: 1.
The theory of belief functions is applicable only in those cases in which the lower probabilities are described by totally monotone Choquet-capacities. As there is no material difference between k - P R I s fulfilling this requirement and other k-PRIs, the employment of this theory is impossible in many cases of practical interest.
58 2.
The Dempster-Shafer rule cannot be justified by considerations which are part of probability theory. If it is regarded as a practical advice it reveals, on the one hand, severe disadvantages to a higher extent than competing rules do. On the other hand, it suffers fundamentally from its neglect of data, which are intuitively of vital importance with respect to combining information in the frame of diagnostic systems.
3.
The application of the Dempster-Shafer rule leads to certain types of bias: small probabilities are reduced yet further and by a large degree. This means state of nature which seldom occurs, can never be detected, even if the actual observations produce comparatively great probabilities for it, as long as these probabilities are under l / k , where k is the number of the states of nature.
Altogether we come to the conclusion that neither the theory of belief functions nor the Dempster-Shafer rule may be regarded as satisfactory methods of evidential reasoning in diagnostic systems.
59
3.5 The Methods Used in the Expert System MYCIN In the years preceding 1976 Edward Hance Shortliffe and Bruce G. Buchanan developed the expert system MYCIN for use in special medical application. [SHORTLIFFE, 1976 with a bibliography on earlier publications]. It contained a model of inexact reasoning in medicine which was based on probability theory but conceived on intuitive grounds. Arguing that the requirements of classical probability theory are not given in those practical cases which are of interest with respect to MYCIN, Shortliffe and Buchanan developed ways of evidential reasoning using supplementary concepts. There are two main definitions, which introduce a "measure of increased belief MB" and a "measure of increased disbelief MD". The fundamental definitions are: Definition (3.2): 1 / MB(B,Z) = ~MaxfP(BIZ), P(B)] - P(B) 1 - P(B) t 1 MD(B,Z) = I'(B) - Min[P(BiZ), P(B)] P (B)
i f P(B) = 1 (3.24) otherwise i f P(B) = 0 (3.25) otherwise []
By P(BIZ ) Shortliffe and Buchanan understand the probability of an outcome B given the observation Z and by P(B) they understand the prior probability of B. In this way the concept used in MYCIN takes into account not only the probability which is determined by the observation but also the probability which was given prior to this observation. One should note that the Dempster-Shafer rule uses only the probabilities determined by the observations, but not the prior probability. The measure of increased belief is defined as a number larger than zero only in those cases when P(BIZ ) > P(B). Such a value indicates that through the observation Z "belief in B was increased'L MB is a simple fraction: the actual increase in probability of B divided by the possible maximum of this increase. This construction therefore uses the same value to characterize very different situations. For instance if the prior probability of B was 0.1 and by the observation Z it is increased to 0.2, then MB is 1/9. But the same value of MB is achieved, if in another case the prior probability of B was 0.91 and the probability of B given the observation Z was 0.92. It is indeed questionable, whether these two gains in probability may be regarded as the same increase in belief. In statistical theory there exist different measures for such situations some of which are more formal, whereas others have a model as background. It must be accepted that the authors of MYCIN introduce a new kind of measure and express the opinion that practitioners treat situations, like the two described above,
60 as equal with respect to the increase in probability resulting from new information. A similar interpretation is possible for MD. Except in the case where the prior probability is zero, MD is different from zero only in those cases where the probability of B has been reduced due to the observation. Then MD expresses the part of possible reduction which in fact has occurred. Therefore if a probability which was 0.20 priori - P(B) - is reduced to 0.10 due to the observation Z, there emerges a MD = 0.5. If on the other hand the prior probability was 0.92 and it is reduced by Z to 0.46, the result is also MD = 0.5. Yet should Z reduce the probability from 0.92 to 0.91, MD would take the value of 0.0109. It is interesting to compare this result with the one obtained for MB: increase from 0.91 to 0.92 produces MB = 0.1111. Confusion, caused by the discrepancy between measures of absolute belief and measures of a change in belief, often arises in application, as has already been mentioned by other authors. Users of expert systems employ measures of absolute belief and handle them as if they were measures of change in belief and vice versa. With respect to (3.24) and (3.25) only one of the two measures MB and MD can be different from zero. In the original version of the expert system MYCIN situations were taken into account, in which MB as well as MD were not equal to zero. These situations were created by the following combination rule, which was to be applied to MBs as well as to MDs:
0 MB(B,ZtAZ2) :
i f MD(B,Z~^Z2) = 1
MB(B,Z,) + MB(B,Z2) - MB(B,Z1).MB(B,Z2)
MD(B,ZI^Z2) = /
i f MB(B,ZI^Z2) = 1
0
t MD(B,Zt) + MD(B,Z2) -
(3.26a) otherwise
(3.26b) MD(B,Z1).MD(B,Z2)
otherwise
The application of this combination rule is not limited to cases in which certain requirements of independence or similar conditions are fulfilled. On the contrary it is meant as a general rule which should approximate the real values of MB and MD. SHORTLIFFE [1976, pp.182-184] reports a simulation of the effect of Formulas (3.26) due to which, in the vast majority of cases, this approximation did not produce values radically different from the true ones (using the certainty factor which we shall describe later). One should realize that such empirical comparisons can never remove the difficulties caused by the ignorance of the statistical behaviour of the two sources of information. If for instance the two observations which give rise to the MBs and MDs are very closely correlated in the case of both B and -B, it cannot be expected that the same combination rule gives reliable results as in another case in which these two observations are totally uncorrelated. Therefore these Formulas (3.26), which are meant to approximate the true results of combinations, can never do so, because true values do not exist as long as the correlation of Z1 and Z2 remains unknown.
61 Although the author of MYCIN does not present a justification for the Formulas (3.26), an argu, ment does nevertheless exist. J. Barclay ADAMS [SHORTLIFFE, BUCHANAN, 1985, p.263 ff] has shown that the formulas can be proven by means of probability theory, if certain requirements of independence are fulfilled. For this purpose we shall give the following definitions: Definition (3.3): Let BicO, i = 1,...,k. Z1 and Z2 are (B1 .... , Bl~)-independent, if for all Bi: P(ZI^Z21Bi) = P(Z~IBi ).P(Z21Bi), i = 1, . . . , k. For simplicity in future we shall write: P(ZlO ) = P(Z) and call O-independence global independence.
Definition (3.4): a) If Z1 and Z2 are (O,B,~B)-independent we call Zl and Z2 totally independent with respect to B. b) If Z~ and Z2 are (B,~B)-independent we call Z1 and Z2 double independent with respect to B. [] Using this notation we can formulate Adams' result as follows: 1) If Zt and Z2 are (O, B)-independent, then (3.26b) holds. 2) If Z1 and Z2 are (O,~B)-independent, then (3.26a) holds. Therefore, as long as one takes into account only those symptoms which contain evidence against B, and for that reason give rise to the construction of MDs, it is sufficient for the validity of MYCIN's combination rule, that all the symptoms are independent with respect to O and B. As long as one only regards symptoms which bear evidence in favour of B and which therefore give rise to the construction of MBs, the combination rule is valid, if the symptoms are independent with respect to 0 and ~B. If symptoms of both kinds are to be combined, neither of the two requirements is sufficient for the simultaneous validity of (3.26). This is only guaranteed if the symptoms Z1 and Z2 are totally independent with respect to B. Now it can be shown Theorem (3.4): If Z1 and Z2 are totally independent with respect to B, then either P(ZllB):P(Zi]~B) or P(Z2IB)=P(Z2I~B). If the combination rule is to be valid for symptoms Zt and Z2 which bear evidence in favour of, as well as against B, at least one of the two symptoms must be independent from the states of nature under consideration. This fact indicates that this symptom is meaningless in as far as the actual aim of this investigation is concerned.
62 Proof of Theorem (3.4): I. ) P(Z1AZ2): P(Z1)'P(Z2) IX.) P(Z,^Z2IB) = P(ZxIB).P(Z~IB) III.) P(ZtAZ2I~B) -: P(Zt [-,B).P(Z2I~B) By means of the law of total probability, using II and III, it follows that: P(Z1AZ2) P(ZIIB).P(Z2IB) P(B) + P(ZII~B)-P(Z2I,B) P(-,B) =
(*)
Using I: P(ZI^Z2) = P(Zt)"P(Z2) = = [P(ZIIB).P(B) + P(ZII,B ) .P(-B)]. [P(Z2[B).P(B) + P(Z21-,B) .P(~B)] =
= P(Z1 [B).P(Z2I B) .p2(B) + p(ZII~B).p(z2I-,B).p2(-~B) + +[P(Z,I-,B ) .P(Z2IB) + P(ZxIB) .P(Z2I-~B)] .P(B) P(~B)
(**)
Subtraction (*) - (**) gives rise to: 0 = P(B) P(-,B) [P(ZxIB) - P(Zx[-~B)]. [P(Z2IB) - P(Z2I~B)] It can therefore be concluded, that the MYCIN combination rule in its original version is consistent with probability theory only if either all symptoms are in favour of B or if all symptoms bear evidence against B or if all symptoms but one are meaningless with respect to B. These results show that in general there is no theoretical foundation for the combination rule of MYCIN in its original version. It should, however, be noted that such a theoretical foundation has never been claimed by Shortliffe. Concerning the practical application of the combination rule, MYCIN recommended firstly the combination of all the MBs stemming from symptoms which are in favour of the outcome B, followed by the combining of all MDs for which the symptoms bear evidence against B and as a third step the combining of the MB and the MD which resulted from steps 1 and 2. As through the combination rule the possibility arises that MB as well as MD are different from zero, they are combined to a single number which is called "certainty factor CF". The underlying idea is that a certainty factor is '% plausible representation of the numbers an expert gives when asked to quantify the strength of his judgmental rules." (SHORTLIFFE, 1976, pp.173/174 or SHORTLIFFE, BUCHANAN, 1985, p.251). The original definition of certainty factors is
CF(B,Z) : MB(B,Z) - MD(B,Z)
(3.27)
Shortliffe says of an expert who is asked to quantify the strength of his judgmental rules: "He gives a positive number (CF > 0) if the hypothesis is confirmed by observed evidence, suggests a negative number (CF < 0) if the evidence lends credence to the negation of the hypothesis, and says there is
63 no evidence at all (CF = 0) if the observation is independent of the hypothesis under consideration." [SHORTLIFFE, 1976, p.174]. As the certainty factor is meant to combine the information which is included in MB and MD, it is of the same nature as both of these measures: it contains information about the changes of knowledge, but it does not describe the knowledge itself. With respect to outcomes of low credibility the authors of MYCIN changed the strategy of combining evidence and the combination rule. The new combination rule calculates the certainty factors of the combined information through the certainty factors of the separate ones. CF1 + CF2 - CFI.CF2
f o r CF1,CF2 > 0
CF1 + CF2 + CFa'CF2
f o r CF1,CF2 < 0
CF=
(3.28)
CF1 + CF2 otherwise 1 - Min[lCFI[,lCF21]
Here CF1 and CF2 are the certainty factors which come from two different observations Z1 and Z2; CF is the certainty factor which arises from the combined observation of Z1 and Z2. It is understood that MBs, MDs and CFs are always calculated with respect to the original prior probability. The new combination formula (3.28) has obvious advantages: it secures - as Shortliffe and Buchanan stress - commutativity. Therefore it is not necessary to store all MBs and MDs. It is sufficient to store the last CF because another sequence of the same observations would not change the result. [SHORTLIFFE, BUCHANAN, 1985, p.216]. The authors consider Formula (3.28) more plausible than the original formula. The main objective remains: It is not possible to judge the merits of any combination rule which does not take into account a measure of association or correlation between the observations concerned. In the rules MYCIN uses for the final decisions, the certainty factors of competing hypotheses are compared and therapy is based on hypotheses with high CP-vaiues. This is described in the following way [SHORTLIFFE, BUCHANAN, 1985, pp.261/262]: "We have shown that the numbers thus calculated are approximations at best. Hence it is not justifiable simply to accept as correct the hypothesis with the highest CF after all relevant rules have been tried. Therapy is therefore chosen to cover for all identities of organisms that account for a sufficiently high proportion of the possible hypotheses on the basis of their CF's. This is accomplished by ordering them from highest to lowest and selecting all those on the list until the sum of their CF's exceeds z (where z is equal to 0.9 times the sum of the CF's for all confirmed hypotheses). This ad hoc technique therefore uses a semiquantitative approach in order to attain a comparative goal." (Italics in the
original text.)
64 This method can lead to very dangerous results, if prior probabilities for different hypotheses are unequal. This was shown by various authors and is demonstrated in the following example. Example (3.14): Let the prior probabilities of three possible hypotheses Bl, B2 and B3 be P(B1) = 0.65 P(B2) = 0.30 P (B3) = 0.05 A certain observation Z more or less eliminates the possibility of B2 and transfers almost all of its probability to the hypothesis B3: P(BIlZ) : 0 . 6 5 P(B21Z) =0.01 P(B31Z) : 0.34 It follows: MB(B,,Z) = 0 MB(B2,Z) =0
MD(B~,Z) = 0 MD(B2,Z) = 0.30-0.01 0.30 =0.967
MB(B3,Z) 0.34 - 0.05 = 0,305 = 1 - 0.05
MD(Bs,Z) = 0
Therefore the certainty factors are (in order of magnitude) CF(B3,Z) = +0. 305 CF(BI,Z) :
0
CF(B2,Z) = -0.967 The rule recommended by MYCIN bases its therapy on the hypothesis B3, which has 100% of the sum of CFs for all confirmed hypotheses. Hypothesis B1 is not taken into account because its certainty factor is zero. But hypothesis B1 still has a probability of 65%.
[]
This is far from an extreme example. A rule which may yield results as the one described in this example is liable to spoil valuable information and to be misleading in decision making and therefore should not be recommended. To summarize: The method of certainty factors can be described as an attempt to develop a special language for the communication between experts and an expert system. This attempt proves to be a failure as it does not meet the basic requirement of introducing a new language: an exact description of the concepts used. Only when both communication partners have agreed, at least to a certain extent, on these concepts, is communication possible. But in the case of MYCIN one partner - the expert system - develops ideas about what the many other partners have to bear in mind when they make a certain statement. It is very informative that the authors write: "Suppes pressed us early on to state whether we were trying to model how expert physicians do think or how they
ought to think. We argued that we were doing neither. Although we were of course influenced
65 by information regarding the relevant cognitive processes of experts ...... , our goals were oriented much more toward the development of the high-performance computer program. Thus we sought to show that the CF model allowed MYCIN to reach good decisions comparable to those of experts and intelligible both to experts and to the intended user community of practicing physicians." [SHORTLIFFE, BUCHANAN, 1985, p.211, Italics in the original text]. Obviously the system MYCIN does not provide a suitable basis for a new concept of credibility for use in diagnostic systems. On the other hand, if probability is used for this purpose in the traditional way, the majority of users are familiar with the employed concept. Others can be given short informative courses if they need some elements of probability either to express their empirical knowledge or to use a diagnostic system properly. Therefore we strongly recommend the application of probability theory in its original version in diagnostic systems. What can be achieved in this way will be demonstrated in the following chapters.
CHAPTER4 The Simplest Case of a Diagnostic System 4.1 A Solution Without Further Assumptions In Chapter 3.4 we discussed the example of the power station (Example (3.11)) which has different alarm units. In order to construct a probabilistic model we shall assume that each alarm unit consists of a stochastic control mechanism. At certain time intervals, for instance every hour, all control mechanisms are read. We shall assume that there is a certain probability that such a reading produces an alarm signal and call these probabilities Pi, with j = 1,..., l, if there axe 1 mechanisms or - as we shall say generally in future - 1 units Ai, j = 1,...,1. For simplicity we shall take 1 = 2 as the simplest case of a diagnostic system and obtain the following table: P P2-P P2 Z2
Pl -P 1-p 1 -P2+P 1 -P2 -~Z2
Pl l-p1
Zl ~Z1
1
In the event that unit A1 causes an alarm we call this Z~ and similarly use Z2 when the alarm is caused by unit A2. In the event that unit Aj does not cause an alarm we call this ~Zj (j -- 1,2). To complete the description of the system we need in addition the probability that both units cause an alarm at the same time. We call this probability p. It characterizes the kind of dependence between the two alarm units. Obviously the following limits exist for p: Max(0; pl+p2-1) < p _<Min(pl,p2) If p reaches its maximal possible value, the unit with the lower probability of alarm can never give alarm without the other one doing so at the same time. In the special case that pl = p2, either both or neither unit gives alarm. Obviously in this case one of the two units is unnecessary. Without loss of generality we assume that Pt _<0.5 and p2 _<0.5 with the consequence that the minimal value of p is zero. If p = 0 this characterizes the situation in which both units never give alarm at the same time. This is the extreme case of negative association between the two events Z1 and Z2. It should be noted that these probabilities describe the behaviour of the two units without having regard to the actual of the two possible states of the power station, which we call the normal state EN and the failure or breakdown EF : O = {EN,EF}. We assume that the probabilities for a failure are given for Z1 and -Z1 as well as for Z2 and -,Z2. We use the following notation:
68 P(Es I z j) = ~i P(Es I ~zj) = ~i (Regarding our example of alarm units in a power plant it is obvious that wi must be much greater than wi, J = 1,2.) Now it is possible to calculate the total probability of a failure in the power station: we call it w.
o~ = P(Es) = p,,o,+(1-p~)~ = p~o~+(1-p~)~
(4.1)
Due to this result the probability of EF can be calculated in two different ways and the parameters we use are only sensible if both calculations tend to the same value of w. This restriction for the parameters Pb P2, wx, ~ , oz and ~ will always be taken for granted, as otherwise our model would contain a contradiction. Obviously e = 0 and ~ = 1 are not of interest and therefore are excluded in the following derivations. In the same way we shall exclude the case el = ~j, because such an alarm unit would yield no information. It is our aim to find a "combination rule" which combines the information about the respective states of the two alarm units. We shall use the following notation P(EF[ Zl ^ Z2) =x+÷ P(EF[ Z1 A~Z2)
=
x,_
P (EF [-~Z1 ^ Z2) = x_+ P(EFI-~Z, ^-~Z2) = x__
There are four equations which must be true for these probabilities. For instance the probability that the first unit gives alarm and there is a breakdown is pt'wl. In this case the second unit can either give alarm or not. The probability that both units give alarm and there is a breakdown is p.x+÷, the probability that only the first, but not the second unit gives alarm and there is a breakdown is (pa-p)x÷-. This relationship is described in the first of the following equations: p.x÷÷ + (pwp)x÷-
= piwl (p2-p)x-+ + ( 1 - p l - p 2 + p ) x - = (1-pl)~1
p.x÷÷ +
(p2-p)x-+ (pwp)x+-
= p2w2
(4.2)
+ (1-pwp~+p)x--= (1-p2)~2
The other three equations are analogous to the first one. (4.2) is a system of four linear equations for four unknown variables x+., x._, x_. and x_.. Yet the rank of the matrix of this system is not four, because the sum of the first two equations is equal to the sum of the last two.
69 Therefore it is not to be expected that a unique solution exists. In the context of our model the only solutions which are of interest are those for which each of the four x-values lies in the interval [0;1]. It can be shown that such solutions do not exist for all possible parameters. Before discussing the general case we shall start with some special cases: a)
p = 0:
In this case, which has been mentioned already, it is not possible for both units to give alarm at the same time. Therefore x+. is not defined. Immediately it follows: x÷_ = wl, x_÷ = 0J2 and if pl+p2~l: x - - = (1-pl)~l-p2~2 1-pa-p2
8)
P=Pl+P~.-I: In this case x.- is not defined, because it is not possible that ~Z1 and ~Z~ occur simultaneously. Therefore: x+_ = ~2, x_÷ = ~1 and if p ¢ 0: x++ = p2w2-(1-pl)~l P If p = 0 (this means: Pi + p2 = 1) only two situations are possible: Z1 ^ -,Z2 or ~Z1 ^ Z2. Therefore ~1 = ~2 and w2 = ~1 must
necessarily be true.
We obtain for the remaining
probabilities:
~)
~)
X+- :
~1 :
~2
X-+ =
t02 =
51-
P=PI: In this case x+_ does not exist and x÷+ = ~1, x__ = ~2 and if p ¢ p2: x_+ = p2w2-plwl p2-p
P = P2: x_+ does not exist, x++ = ~2, x__ = ~l and if p ¢ p~: X+- = P l t 0 1 - p 2 ~ ) 2 .
Pt-P If p = Pl = P2 then x+- and x_+ do not exist. Necessarily: L01= w2 and 5~ = 52. Therefore: X++ =
0,/1 =
W2, X - - =
~1 = ~ 2 -
If none of these special cases occurs, the conditions under which suitable solutions exist, and the resulting sets of solutions are described in Theorem (4.1): If
Max[0; pl+P2-1; p2o:2-(1-pl)~l; p2(1-w2)-(1-pa)(l-w1)] _
then solutions exist for the System (4.2) so that x++, x÷_, x_+, x__ are in the interval [0;1]. The set of all solutions is the following:
(4.3)
70
plai-px++ (x++, x+_-
Pl-P
P2a2-px÷+
, x_.
p2(1-W2)
(l-p2)~Z-Plai+px++
, x__
pl(1-Wl)
1-~-~;
Max[O;
P2-P
1 ~ ;
1-pl-P2+p
P2W2-(1-pl)Xl
p
:
] <x,+ (4.4a)
Pill P z w 2 Min [1;
P ;
P ; 1+
(1-pl)(1-~l) - p2(1-~2) } P ]
which leads to
Max
O;
pitol-p2t#2 PwP
; 1-
(l-p2) (l-~z) pi(l-oJl)] < x+_ < PwP ; 1 - Pl-P (4.4b)
_<Min [ 1 ; PwPP'~'; (1-p2)~2 ;pl_p
Max [ O; P2°)2-1DI~lp2-P ; 1-
p2(1-~v2) ;p2_p
1+ ( l - p , ) ( 1 - ~ ,] ) - (p1 - ,p 2 _) (1--~2) p
1 - (1-pl)p2_p(1-xl)]j _<x_, _< (4.4c)
-
I.
Pz-P ;
P2-P
P2-P
J
Max [ O; (1-p2)~2-plwl (1-p2)(1-~2) (1-pl)(1-~1)] 1-pl-p2+p ; 1k 1-pl-P2+p ; 1 1-pwp2+p J -<x__<(4.4d) -<Min
(1-p2)~, (1-p2)~2 1; 1-pt-p2+p ; 1-pl-p2+p
1 + p,(1-~,)-(1-p2)(1-~2)] 1-pwp2+p J
Proof of Theorem (4.1): With the abbreviations: r i =pjwj , rj = (1-Pi)X j , j = 1 , 2 and y++=px++, y+.= (pl-p)x+-, y-+: ( p 2 - p ) x - , , we maintain the following system of equations:
y - - : (1-pl-p2+p)x--
O0 1 1 y,_ ~1 1 0 1 0 y_+ = r2 0101 y__ F2 If we take into account that all probabilities are to be between 0 and !, i.e.: a) O
0 < y+_ < PwP 0 < y_+ _
D
7] I)
Max[O;r,-~lax(y+.), r2-Max(y.+)] < y++ < Min[p; r,-~lin(y+_) ; r2-Min(y_+)]
II)
Max[O; rl-Max(y++), r2-Max(y..)] < y+_ < Min[pl-p; rwMin(y++) ; F2-Min(y__)]
III)
Max[O; rl-Ilax(y__), r2-Max(y++)] < y.+ < Ilin[p2-p; F1-Min(y__); r2-Min(y++)]
IV)
Max[O; ~l-Max(y_+), r2-Max(y÷_)] < y__ < Min[1-p,-p2+p; ~,-Min(y_+) ; F2-Min(y+_)]
To solve this system by iteration we start with the trivial limits given by a) - d) and achieve: a') lllax[0; rl-pl+p; r2-p2+p] _
b')
Max[0; rwp; F2-1+pt+p2-p] _
c')
Max[0; Fl-l+pi+p2-p; r2-p] < y_+ < Min[p2-p; ~1; r2]
d')
Max[0; ~Fp2+p; F2-pl+p] _
By continuing this procedure, we use the Maxima and Minima of a') - d ' ) and find, considering rl + r~ = r2 + ~2, which is a version of (4.1): a")
Max[O; rl-p~+p; rwF2; r2-p2+p] _
b")
Max[0; r~-p; rl-r2; r2-1+pi+p2-p] _ y+_ _<Min[pl-p; rl; ~2; rl-r2+p2-p]
c")
Max[0; rl-l+pt+P2-p; r2-p; r2-rl] < y_+ _ Min[p2-p; {1; r2; rl-r2+pl-p]
d")
Max[0; ~I-P2+P; {wr2; {2-pl+p] _
These limits reproduce themselves if used in I)-IV). Therefore they define the solutions of the system if they are not contradictory i.e., if the upper limits are not smaller than the corresponding lower limits. From a") it can be concluded that (4.3) is sufficient for a non empty set of y++. The inequalities b")-d") do not make any further restraints upon p. [] Theorem (4.1) says that probabilities x++, x._, x_., x__ do not exist for all values of p. For instance, it is possible that p is too small. Let the probabilities of a breakdown in case of each positive signal Zj be comparatively large, while the probabilities of a breakdown in case of ~Zj are extremely small. If p is near to zero the two units almost never give alarm at the same time. Yet if one unit gives alarm, the probability of a failure is comparatively large; if the other does not give alarm, the probability of a failure is extremely small. This is a contradiction: A solution cannot exist. The second feature of Theorem (4.1) is that in most cases in which solutions exist, they are not unique. This means that the information used in this model is not sufficient to determine a single value for the probability of a failure in the case of two alarms sounding. This contrasts very sharply to those ideas which assume that a real number as a "combined probability" must exist even in cases where there is much less information than in the model used here. We shall see later, what type of information has to be added in order to secure that a real number exists which represents the probability of failure in case of two alarms being given.
72
One should interpret p with respect to Pl and P2 as a characteristic of the mutual dependence of the two units. If for instance p = Pl'P2, this means that the two signals are independent. But this type of independence is global independence: O-independence (see Chapter 3.5). It should not be confused with double independence for normal state and for breakdown. Only in cases which are of no importance to reality is it possible that global independence and double independence meet. We called this total independence. If one of these probabilities x++, x+_, x_+, x_ has t o b e estimated, the corresponding interval (4.4) gives the required information. Yet if for any purpose more than one of these probabilities has to be estimated for the same problem, then the functional dependence between x++ and x+_, for example, has to be taken into account. For each possible single value of x.+ the single value for every other probability is uniquely determined. We shall discuss these results ushlg the data of Example (3.11), which has already been used in discussing the Dempster-Shafer combination rule. Example (4.1): We use the same parameters as in Example (3.11): 0J1 = ~2 = 0.01
vl = 0J2 = 0.0001
For the probabilistic model used in this paragraph the information must be completed by the values of pj, e.g.: Pl = P2 = 0.02 For simplicity of calculation it is again assumed that the parameters for both units are the same. Due to this assumption Equation (4.1) is automatically true and we obtain: v = ply1 + (l-p1) 51 = p2~2 + (l-p2) v2 = 0.000298. For the parameter p Equation (4.3) leads to 0.000102 < p < 0.02. /
The maximal value of p, due to the symmetry between the two units, is 0.02. This value means that both units either give alarm at the same time or do not give alarm at all. In this case one could treat both units as if they were only one unit, because they are completely dependent upon each other. If p is smaller than 0.02, the possibility arises that one unit gives alarm and the other does not. In the case of p = 0.0004 the two units are globally independent, and if p is still smaller, there is a negative association between them: If one gives alarm, the probability that the other does likewise is reduced. In the following table the admissible values of x++ are given for some selected values of p: p = 0.02
:
x++ = 0.01
p = 0.01
:
0.0102 < x++ _ 0.02
p = 0.001
:
0.1020_< x++ _< 0.20
p = 0.0004
:
0.2550 < x++ < 0.50
p = 0.0002
:
0.5100 < x++ < 1.00
p = 0.000102
:
x++ = 1
73 It is seen that the intervals for x++ are comparatively large if p takes an intermediate value and that at the extreme points of p there is a unique solution for x++. Also it is seen that in the case of p = 0.02, if the two units operate as practically one, the probability model cannot be cheated: it reveals the same probability as in the case of only one unit. The smaller the value of p, the more valuable the set of the two units: if both give alarm the probability of a failure is now comparatively large; in the case of global independence (p = 0.0004) this probability is between 25.5% and 50%. It becomes still larger when there is negative association between the two units, and is 1.0 in the case of the smallest possible value of p (p = 0.000102). It should be noted that the probabilities as derived from the Dempster-Shafer combination rule are never found in the intervals of possible probabilities as determined by the model. The differences between the results of the two procedures applied to this problem are extreme.
D
We now alter the value of the probability for an alarm in the second unit, while all parameters for the first unit remain unchanged.
Example (4.2) Pl = 0.02
P2 = 0.01
o~i = 0 . 0 1
~2 = 0 . 0 1
~1 = 0.0001
~2 = 0.0002 ~o= 2.98.10 -4
We find by means of (4.3): 2.10 -6 < p < 0.01 and by means of(4.4a): p = 0.000002
:
x,+ = 1.0
p = 0.000010
:
0.200000 < x+÷ < 1.0
p = 0.000100
:
0.020000 < x++ < 1.0
p = 0.000200
:
0.010000 < x++ < 0.5000
p = 0.001000
:
0.002000 < x++ < 0.1000
p = 0.009902
:
0.000202 < x++ < 0.0101
p = 0.010000
:
x++ = 0.0100
There are two important differences between the results we obtain using these parameters and those obtained in the foregoing example: For some small values of p (around p = 0.0001) the intervals for x++ become very wide so that in such a case the parameters contain very little information about the probability x++. Secondly the lower limits of the interval for x÷+ are not a monotonous function of p. Instead they reach a minimum at a value of p which is determined by p = p 2 - ( 1 - p l ) ~ = 0.009902
74
and show a very sharp increase from there to the terminal point p = 0.01. If no exact knowledge about p exists, the only reliable treatment is to use the union of the intervals for x** with respect to all possible values of p. We shall later proceed to carry out a thorough investigation of situations in which parameters are defined only by intervals. Furthermore it should be noted, that our description of the methods in this paragraph, using the example of a power station, does not confine the application of this method. Evidently, in a medical diagnostic system for instance, symptoms may be used instead of our alarm units and types of illness instead of our "state of the power station". Finally, it must be said that in principle all the methods used here could be transferred to more complicated models where there are more than two states of nature or more than two different signals, which can be given by a unit or even more than two units. The formal treatment of such models is technically more complicated and is not shown here, because we shall now move onto another model, which uses assumptions about independence and therefore is not only easier to handle but also provides stronger results.
75
4.2 Solutions with Double Independence and Related Models The results of the analysis in Chapter (4.1) are comparatively weak. This is due to the fact that the models used there do not contain any statement about dependence or independence of the signals with respect to the two states of nature. These models use the global probabilities Pl, P2 and p, - w h i c h refer to O - while any measure of dependence in EF and EN must use conditional probabilities. When describing the formalism, we employ the notation already used in our example of the power station and find P(ZjAEF) = pj~aj
P(Zj^EN) : Pi(1-wj)
P(Zj IEF) = Pi~i
P(Z~IEN) :
P(Z1AZ2AEF) = p.x++
P(ZI^Z2AEN) = p(1-x++)
p(Z1AZ2IEF ) = p'X+÷ ~d
P(Z,^Z~IEN) :
1-0)
Now we define two measures of dependence between Zl and Z2. The one is the measure of dependence of ZI and Z2 with respect to EF and is called ~F. The other is the measure of dependence of Z1 and Z2 under EN and is called aN. To avoid indefiniteness we have to suppose 0
land0<
xj,~j
j=1,2.
These restrictions are well founded if the units are to work sensibly and therefore we shall also use them in the following chapters. Definition (4.1): P(Z1AZ2IEF)
P(Z1AZaIEN)
~F = P(ZIIEF)'P(Z2IEF)
~N = P{Z1]EN)'P(Z2JEN)
It is easy to show that Max [ 0 ;
1 t r ~
1 + P(Z2I EF)
1 1 - P(ZIIEF)'P(Z2IEF) ] -< ~F -<Max [P(ZIIEF),P(Za[EF)]
and the analogous relation for aN. Obviously nF = 1 has to be interpreted as EF-independence of Z1 and Z2. ,%F > 1 describes positive association between Z1 and Z2 with respect to EF: The probability that both events occur simultaneously is greater than it would be in the case of independence. Using Definition (4.1) we are able to derive the following theorem.
76 Theorem (4.2): If t~F and ~N describe the dependence of Z1 and Z2 with respect to EF and EN, then: EFtOI~2
(4.5)
X++ =
[]
Proof of Theorem (4.2): With the notation used above we obtain: px,, = ~Fplxl'p2°)2 and
pl(1-Xl)
p2(1-~d2)
Therefore: px++ = ~--~Fp,p2~l~2
and
p(1-x,,) : aN_i~p,p2(1-w,)(1-w2 )
(1-w2)lj P = PlP2 [ aFXI-L~w~+ aN (1-Wlll_w
By summation:
Dividing (*) by (**) we get Formula (4.5).
(*) (**) []
Due to the assumed knowledge of the kind of dependence between Z~ and Z2 under EF and EN we derive x++ as a number, not as an interval. Therefore we may conclude that the interval estimation of x,, in the preceding chapter is only due to the fact that no assumption was made about the conditional dependence or independence of Zl and Z2 with relation to the two states of nature. It should be noted, that (4.5) is a simple result of probability theory and does not contain any concepts which are not part of classical probability theory. Of course it does not use any ad hoc proposals for combining evidence. In fact results like (4.5) can be found in text books of probability theory. Any combination rule which is not in agreement with Formula (4.5) therefore contradicts elementary probability theory and cannot be justified by any argument about experimental verification. The most frequently used assumption about the conditional dependencies is the assumption about double independence of Z1 and Z2 in EF as well as in EN. This means nF = 1 and nN = 1. In this case Formula (4.5) is simplified to ~dl a/2
~0 0/
(4.6a)
+
This result is even more popular than (4.5), because the assumption of double independence is widely used. (4.6a) occurs when probabilities of a hypothesis are updated, provided that a total probability of the hypothesis is considered available - a situation which is often described as "Bayesian updating". The methods used in the expert system PROSPECTOR are of this kind and therefore may be regarded as almost identical [DUDA, HART, NILSSON, 1986]. The only relevant
77 difference between the two treatments is that P R O S P E C T O R uses likelihoods of the type P(Zj
IEF)
in its formula, while in (4.6a) probabilities of the type P(E~IZj) = wj are used. This changes the outfit of the results to some extent. The use of odds instead of probabilities by P R O S P E C T O R is purely superficial. It is possible to obtain, analogous to Theorem (4.2), the other probabilities of
EF,
for instance x÷_,
the probability of a failure in case of Z1 and -,Z2. In this case we have to use corresponding values of gF and gN. For double independence all those values for ~ take the value 1. Therefore formulas for x+_, x_+ and x__ can be derived in the same way as Formula (4.6a) for x++: 011~2
x+_ :
w aJl~2 + (1-~Ol)(1-~2) oJ 1-o~
(4.6b)
~la/2
x_+ =
~o + (1-~)(1-~2)
~
(4.6c)
~11~12
x__ =
~ ~a
(4.6d) i-~
Example (4.3): With the parameters used in Example (4.1) we achieve under the assumption of double inde~ pendence the following four probabilities of a failure of the power plant: 0.0001 gT0-O-f2-gN = 0.255 x,+= 0.0001 0.9801 gT0-0-02~ + (y79997N2 0.000001 x,_ = x_+ = 0.000001 0.989901 = 0.003377 g-_Ny0-2~ + 0TgggT~ 1.10-s ~:g0-O2-gg X__ = 1-10 -s 0.999800 = 0.000034 ~J:NY0-2~ + g:V99T0~ These are results in accordance with the purpose of an alarm system: If there are two units and both go off, the probability of a failure is much greater than in the case that there is only one unit and this goes off. One should remember the misleading result gained by the Dempster-Shafer combination rule in Example (3.11)! The corresponding p-value derived from equation (**)in Theorem (4.2) is p = 0.000526 and is remarkably different from the value p = 0.0004, which is a consequence of global independence.
78 It may be asked, how these results correspond to those of the preceding chapter, where for each p-value intervals for x+÷, x÷_, x_+ and x__ were obtained. If we use the p-value of 0.000526 to calculate those intervals we find, due to Equations (4.4a-d): 0.1939 _<x,+ < 0.3802 0 < x+_, x_+ __0.00503 0 _<x__ < 0.000102. Therefore all probabilities derived under the assumption of double independence lie in the corresponding intervals. [] The question, whether this statement is generally true, can be determined by the following theorem. Theorem (4.3): The probabilities deduced from (4.6) (under the assumption of double independence) lie in the corresponding intervals (4.4). [] Proof of Theorem (4.3): It can be checked that for [ ~1~02+ (1-~0t) (1-~o2) ] p = piP2 ~ 1-~a the values x++, x+_, x.+, x__ according to (4.6) are solutions of the System (4.2). For instance, p,,++ + (p,-p)x+_
= plp
+ p,(1-p2)
=
(***)
-
+
= p,+,,,+,,.
due to: p + pl(1-p2)
~1 2 +.
1-~0
= Pl P2 ---7-+ P2
l-w
{
=
-
(1-aq) (1-~2) }
i-~
= Pl { 7x, [p2w2 + (1-p2)~2] + l-x, [p2(i-~;2)+ (l-p2) (i-~2)]} = p,
~- x + i 7 7 (1-0o [ -
= Pl (l-w) (1-52)]
As the inequalities (4.4) are valid for all solutions of the System (4.2), they must be valid for the solutions x++, x+_, x_+, x_, provided the p-value for double independence is used. m It becomes obvious that p according to (***) is necessary for double independence, as shown in the proof of Theorem (4.2), but is not sufficient. It should be noted that double independence of Zl and Z2 in the case of E F and EN comes nearest to what might be understood by "independent sources of information" as used by Dempster. But of
79 course the two concepts are not comparable directly, in that double independence is defined exactly, whereas the meaning of "independent sources of information" is not totally clear. With respect to what we call similarity of the two concepts it is justifiable to compare the results of Dempster-Shafer's combination rule and (4.6). This is first done by means of Examples (3.11) and (4.3). The application of the Dempster-Shafer combination rule, in this case, gives results which are very different from those stemming from probability theory. If we compare the two formulas which are applied in both rules, the origin of the differences becomes obvious: while in (4.6) the total probability of EF is used as a denominator, it is not used at all in Dempster-Shafer's formula. Therefore the difference in the information used is described exactly by this "prior probability". It is easy to convert the Formulas (4.6) into Dempster-Shafer's formula, if all total probabilities are equal to one another, so that the denominators may be reduced. Therefore the application of the Dempster-Shafer combination rule leads to the same results as the application of (4.6), if the total probabilities are: P(EF) : P(EN) :
1
Such an assumption is of course very far from any estimation of the probability that a power station has a breakdown. This is the reason why Example (3.11) led to such obviously dangerous results, which are not seen in examples used by Sharer himself. As long as the regarded events Ei are more or less equally probable a-priori, the application of Dempster-Shafer' s rule comes near to the results of (4.6) and therefore leads to sensible results. It should be noted that the Dempster-Shafer rule applied to x+÷ does not use values of ~j, which influence the prior probability ~ by (4.1). Therefore it is possible to imagine values of ~j which lead - in combination with any pj - to w = 1/2. In this case the result of the Dempster-Shafer rule would be correct. For the example of the power station these ~j would have to be near to 1, in contrast to the true values of ~j which are much smaller than wj. Now the results of the application of the Dempster-Shafer rule in Example (3.11) become understandable. They are the same as those of the application of (4.6), provided ~j is near to 1, indicating that the signals Z1 and Z2 designate the state of tow danger while ~Zi designates high danger. In such a case it would be quite reasonable, that ZI^Z2 would signalize an extremely low danger. Therefore the probability of EF in case of ZI^Z2 would be in accordance with the results of the Dempster-Shafer rule. Of course this explication is good only for x,+. If one applies the Dempster-Shafer rule to x__, it amounts to the application of probability theory with the assumption that wj is very near to 1. In the cases of x+_ and x_+ the assumptions would be, that those wj or ~j which are not used in the formula itself, are near to 1. It is easily seen that using the knowledge of wI and Wl for the power station a value of w = 1/2 can be excluded, even if the values of Pl and p2 are not known. Therefore in such cases the application of Dempster-Shafer's rule amounting to such an assumption is not tolerable at all. The same is true in most medical diagnostic systems, when it is an important aim to detect rare diseases: If the Dempster-Shafer rule is applied, such a detection is impossible.
80 Concerning the strategy in cases of low information availability, generally it can be said: a)
It is hard to understand that the probabilities of a state E i in the case of Z1 and Z2 could be known, while the total probability of Ei is not known at all. Therefore in most cases it must be possible to give at least a good estimation of ~o, even if pl and P2 are not known (which is not necessary for the application of (4.6)).
b)
If it is absolutely impossible to estimate w, at least we know that, due to (4.1), w is a weighted average of wl and ~ ; therefore w lies between these two limits. And the same is true for w2 and ~2: these are also limits for oz Using the narrowest of these limits, interval estimates for x+, etc. can be derived from (4.6), even if "no information whatsoever about the prior exists".
Such interval estimates use the following facts: 1) 2)
Ma~ Min (0Jj ,~j) < 01< Min Ma~ (0Jj,~j) i J x** as well as x÷-, x_, and x_- are monotonously decreasing functions of w.
(4.7)
In the following example we demonstrate this kind of interval estimation: Example (4.4): Using the same values for wj and ~] as in Example (4.3), but assuming that w remains unknown (and p is not known either) we have: ~j = 0.0001 < 0J < 0.01 = wj Due to the monotony of the probabilities x++ etc. as functions of w, the upper limits are found by using a) = 0.0001 and the lower limits by using w = 0.01: 0.01 < x++ < 0.505 0.0001 _<x+_, x_+ < 0.01 0.99.10 -6 _<x__ < 0.0001 Although
these limits are far ranging, it is evident that the results of the application of
Dempster-Shafer's combination r u l e - as in Example (3.11) - l i e
far outside these limits in all
cases. This demonstrates: even if no information at all exists about the total probability, there is no excuse for the use of Dempster-Shafer's rule at least in this example, because the information about o3j and ~j is sufficient to show that the results of this combination rule are by far unreasonable.
[]
The derivation of x,, in (4.6) resulting from the Equation (4.5) was carried out under the assumption that ~;F ----gN = 1. Of course formulas which are analogous to (4.5) may be derived for x,_, x_+ and x__. Formulas (4.6) are special cases of these generalized formulas if the corresponding values of ~F and ~N are equal to one. In order to describe the necessary conditions for the validity of (4.6) we designate coefficients which must be used in formulas analogous to (4.5):
81 a ~ * = nF
n~-= K~+
- -
a~÷ = NN
P(Z1A-Z2IEF)
(4.8)
P(Z, IEF)'P(~Z2IEF)
P('ZI^Z2JEF) P(-ZaIEF)'P(Z2IEF) P(-Z1A-Z2JEF) P(-ZllEF)'P("Z2[EF)
a~-, a~*, ~fi- analogous. Now the question arises, whether the conditions that all aF and all aN are equal to 1 is necessary for the validity of Equations (4.6). It is obvious that (4.6a) is also derivable from (4.5) if ai~+ : aft +
(4.9)
Therefore the probability x++ may be described by (4.6a) not only in the case of double independence but also in those cases, in which both symptoms are dependent upon each other, but the kind of dependence under EF and EN is equal in the sense of (4.9). One should be reluctant to use this kind of "dependence of equal strength". In the same way as a ++ may be used as a measure of dependence, the values of a +-, ~-+ and a-- may also be used. Therefore it would only be justifiable to talk about dependence of equal strength, if all four measures were equal in the case of EF and EN: at + = ~+
a t = a~-
(4.10)
al~+ = ai~+
All these equations are true, if each of the eight values a is equal to one. This means double independence. The possibility, that (4.10) is true in other cases is investigated in Theorem (4.4). Theorem (4.4): (4.10) holds, iff either all values a are equal to 1 or P(ZjIEF) = P(Zj[EN) , j = 1 , 2 .
(4.11) []
According to this theorem the equality of all corresponding values of aF and aN is only possible in the case of double independence or in the trivial case that the signals have the same probability under EF and EN and therefore the units are not suitable for distinguishing between the two states (or for recognizing disease).
82 Proof of Theorem(4.4): Using thefollowing table for EF:
~}+P(ZllEF)P(Z21EF )
~}-P(ZI[EF)(1-P(Z2[EF))
P(Z1]EF)
~+(1-P(Z,]EF))P(Z2]EF)
~p- (1-P(Z1]EF))(1-P(Z2[EF))
1-P(ZllEF)
(4.12) P(Z2IEF)
1-P(Z2[EF)
and an analogous table for EN, we obtain
~--
1-~*P(Z2]EF)
~fi-=
1-P(Z2JEN)
1-P(Za]EF)
g~+ =
1-~*P(ZIIEF)
1-~+P(Z2IEN)
~+ =
1-P(ZllEF)
1-~+P(Z, IEN) 1-P(ZllEN)
1-P(Z, IEF)-P(Z21EF)+~ff+P(Z, IEF)P(Z2[EF) (t-P(ZIIEF))(1-P(Z2IEF)) and an analogous result for ~ - . By means of these results it can be shown that ~}+ = a~ + and a}~- = ~ - is together only possible if ~i~÷ = 1 or (4.11) holds. The analogous statements can be derived with respect to ~+, a~+, t;~- and KI~-.
[]
We demonstrate these considerations by use of the data to be found in Example (4.1): Example (4.5): Recalling that for the power station ~i = 0.01, ~j = 0.0001, Pi = 0.02 we find for the conditional probabilities P(Zj IEF) :
j = 1,2
P(EFIZi)'P(Zi) = wiPi : 0.6711. P(EF)
In an analogous way: P(Z i [EN) = 0.0198. These results characterize the two units: If there is a breakdown, each unit gives alarm with a probability of about 2/3, if the power plant is in normal state, the probability of a blind alarm is about 2 % for each unit. The probability of both units giving alarm can be determined, if n}~+and n~+ are chosen. For example we take n~+ = n~+ = 1.3 and obtain
83
P (ZlAZ2]EF) = O. 5856 P (Zl^Z2 ]EN) = O. 0005. Table (4.12) can now be completed by subtraction:
EF : 0.5856
0.0856
0.6711
0.0856
0.2433
0.3289
0.6711
0.3289
1
0.0005
0.0193
0.0198
0.0193
0.9609
0.9802
0.0198
0.9802
EN:
From this we obtain: ~ - = a~ + = 0.3878,
a~- = 2.2495
nil+ = ~ + = 0.9939,
nil- = 1.0001
If we now apply (4.5) to x**, we arrive at our previous result: x++ = 0.255
(because of ~ + = ~ ' )
while
x+_ = x_+ : 0.0013
(in contrast to 0.0034 in the case of double independence)
and
x__ = 0.000075
(in contrast to 0.000034).
[]
Up to now in this chapter we have used a description of the stochastic behaviour of the two units Z~ and Z2 with respect to the two states of nature: EF and EN. In previous chapters we used the concept of global independence which is related to O, the union of EF and EN. Global independence is assured by p = P(z,^z2)
= P(Z,).P(z~)
: pl.p2.
In Chapter (4.1) it was shown that for this p there results an interval for x++. If a further assumption is then made, either the independence in the case of EF or of EN, we describe situations which are called (O,EF)-independence or (O,EN)-independence. Because of the symmetry between them it is sufficient to discuss (O,EF)-independence. In this case x++ is derived in the following way: P(ZIAZ21EF)'P(EF)
P(ZI[EF)'P(Z21EF)'P(EF)
P (EF I ZIAZ2) P(ZIAZ2)
P(Z~).P(Z2)
P(EFIZ1)'P(EFIZ2) (4.13)
~(EF)
84 Of course the probability of EN in the case of (O,EF)-independence cannot be derived in an analogous way but must be calculated by P(EFIZ1)'P(EFIZ2) P(ENIZ1AZ2)
= 1-P(EFIZI^Z2) = 1
P(EF)
-
(4.14)
P(E~IZl)+P(~IZ2)-P(EN)-P(E~IZ,).P(ENIZ~) 1 - P(EN) The formulas for x,_, x_÷ and x__ correspond to (4.13) and (4.14). In case of (O,EN)-independence the Formulas (4.13) and (4.14) (or their corresponding ones) must be interchanged. The effect of the assumption "(O,EF)-independence" instead of "double independence" is shown in the following example. Example (4.6): Using the previous parameters and (O,EF)-independence it follows, that: P (EF I ZI^Z2) = 0. 3356 P (EN IZl^Z2) = 0.6644 If we use the respective formulas for (O,EN)-independence: P (EF I ZIAZ2) = 0. 0196
P (EN I ZlhZ2) = 0. 9804
In the case of double independence we had: P(EFIZI^Z2) = 0. 255
P(ENIZI^Z2) = 0. 745.
If these results are compared with previous results, we find that the value for x** lies in the interval which is derived frorn Theorem (4.1), if the value attributed to p describes global independence, i.e. p = 0.0004. This is a consequence of the fact that (O,EF)-independence is a special case of global independence. The analogous value of x++ in the case of (O,EN)-independence is not in concordance with the corresponding interval from Theorem (4.1). If the reason for this deviation is sought, it can be shown that (O,EN)-independence is not possible in this case. This is an example of the limitations for the construction of (O,EF)-independence and (O,EN)-independence, which we have already mentioned in Chapter 3.
[]
85 If we investigate the concept of these types of independence using the model of this chapter we find: Theorem (4.5): With the notations used in this chapter, the assumption of (O,Es)-independence is possible, iff a)
(1-~o,) (1-w2) < 1 1-to
h~j
P2 _< 1 - .(1-101).(i-102)
c)
Pl _<1
101 1-10 102
(1-101) (1-102)
-
1-10
d~_,
P132" ([1- (1-101) (1-102)11_10) - > pl" 101 10+p2" 10~_ 1
The assumption of (O,EF)-independence is possible, iff a)
101"102< 1 tO
-
1-0h
b)
p2 _<
c)
pl _<
d)
~
i - 10t'102 ~d
1
1-10~ -
~)1-~2 O2
(1 _ ~ } > _
pl(l-1001_+10p2(1-102)_ 1
The proof of this theorem relies upon the fact that for (O,E~)-independence, not only O and E N but also EF must be a set for which the conditional probabilities of Zl^Z2, gl^~Z2, ~Za^Z2, ~Zl^~Z2 are non-negative numbers. If this is controlled, the proof is then straightforward. It should be noted that in contrast to this result it is possible to conceive double independence for all parameters which define sensible subsets EF and EN. This remarkable difference between the two concepts shows the superiority of the concept of double independence. In fact it is not easy to describe a situation, in which the concept of (O,EF)-independence is obvious. On the other hand, double independence is a concept which is identical to those concepts most often used in practical statistical analysis. Finally we can conclude from the considerations described in this chapter: In all cases, in which the assumption of double independence is possible, Formulas (4.6) will provide a reliable and justifiable combination rule. No other "combination rule" should be applied in such a case. If there are serious doubts with respect to double independence, it might be easiest to collect reasonable information about p, Pl and P2 and to apply (4.4), even if it results in interval estimates for the probabilities x÷+ etc. While reliable information about the values of ~ would provide stronger estimates (point estimates) for x÷÷ and the other probabilities of this type, it will often be impossible to obtain such information.
CHAPTER
5
Generalizations 5.1 T h e F o r m a l i s m While in the preceding chapter we used the simplest type of a model to describe the states of nature, the observations and the conclusions drawn from them, in this chapter we shall discuss the general case of combining information provided that it consists of point-estimates of probabilities. At first we shall generalize the model with respect to the number of states of nature: k
O=oEi i=l
and the number of distinguishable signals or signs in both alarm units: A1 = (Zll, . . . , Zlal);
h2= (Z21, . . . ,
Z2n2).
At a later stage we shall also discuss situations in which there are more than two alarm units. We shall use a model, which is a direct generalization of the model of double independence used in the preceding chapter. For this model we require the following probabilities: V(Zll),
...,
P(Zlnl)
with
n1 rS=lP(Zlrl) = 1 1
P(Z21),
Definition
...,
P(Z2n2) w i t h
n2 r~=lP(Z2r2 ) = 1 2
(5.1):
As a generalization of double independence as used in the preceding chapter we describe k-independence of the two units by: P (Zlr ^Z2r2tEi) = P (Zlrl 1]~i) "]) (Z2r21]~i)
i : 1,...,
k
1,...,
nl
F1 =
r2=l,...,
(5.1)
n2
[]
In this chapter we shall always suppose that k-independence is given and additionally that the probabilities P(EilZlrl) , e(EilZ2r2) , rl = 1 , . . . , nt; r2 = 1 , . . . , n2 are known or that at least reliable point estimations of these probabilities are available. Just as for the case k = 2 Equation (4.1) must be fulfilled, there are also requirements for these probabilities: n1
n2
P(E~) = r~l:IP(E~LZ'r')"P(Z,r ) = r:~ P(EiIZur^).P(Z2r) ~ , ;
i--1
' " "
,k
(5.2)
88 This is a system of (k-l) equations which limits our freedom in estimating the k. (nl+n2)-2 probabilities appearing in (5.2). Of course any analysis depends on the validity of Equation (5.2). Due to the assumption of k-independence we can immediately derive a generalization of (4.6):
P(Ei* 1ZlrlhZ2r2) =
r(Zlr 1hZ2r2 IEi*)" P(Ei*) _ P ( z l r 1 ^z2r2 ] E i * ) ' P ( ~ i * ) k P(Zlrl^Z2r2) i~lP(Zl rl^Z2r21Ei)"P(Ei)
_P(ZlrlIEi*)'P(Z2r2IEi*)'P(Ei*)
_P(Ei*IZlrl)'P(Ei*IZ2r2)/P(Ei*)
k i=~lP(Z,r IEi)'P(Z2r2IEi)'P(Ei)
k i~,P(EilZ'r,)'P(Ei]Z2r2)/P(Ei)
i*= I , . . . , k ;
1"1= l , . . . , n i ;
r2 = 1 , . . . , n 2
(5.3)
As for (4.6), (5.3) is merely a consequence of probability theory and not a new result; this system of equations provides the combination rule which is supplied by classical probability theory for all cases in which the assumption of k-independence holds. It should be noted, that it does not use the values of P(Zjrj) directly, but these values are important for the calculation of the values P(Ei). Before discussing these results in detail we shall make a further generalization, this time assuming that there are not two but 1 alarm units: AI = ( Z 1 1 , . . . , A2 = ( Z 2 1 , . . . ,
Zl. 1) Z2.2)
: A1 : ( Z H , . . . ,
Zlnl)
In this case we must assume that the probabilities of the outcomes - or measurements - of each of the 1 alarm units are known:
P(Z11),..., e(Z~. 1) P(Z21),...
'
P(Z2n2)
P(Zll),- • •, P(Zlnl)
n. with •J P(Zjr ) : 1, r.=l j J
j = 1 , . . . ,1
89 Definition (5.2): The assumption of k-independence as made in the case of 1 = 2 must now be converted into the assumption of mutual k-independence which states 1
P (Zlrl ^ . . . A Zlrl IEi ) =i~lP(Zjrj IEi) rj = l , . . . , n i j=l,...,1
(5.4) []
i=l,...,k
This is a very strong assumption indeed, but it is justifiable if it may be taken that, whatever the state of nature, the alarm units are not influenced by each other. In addition we assume that the conditional probabilities P(EilZirj) , i=l,...,k; j=l,...,l; rj=l,...,nj are either known or that they are estimated by real numbers. We shall later discuss situations, in which only weaker estimates of these probabilities are available. Again a system of restrictions exists, which controls the credibility of the estimates used: nI
n1
P(Ei) = ~ P(EilZlrl)'P(Zxrl) . . . . . rl=l
~ P(EiIZI~.).P(Z~r.)
rl=l
1
l
i= 1 ...,k
(5.5)
'
Through these restrictions "prior probabilities" P(Ei) are once again defined which bear information contained in both the probabilities P(Zjr.) and the conditional probabilities P(EiIZjr.). J J It will be seen once again, that the probabilities of the outcomes are used in the final result only in this condensed form. The construction of the combination rule now means the calculation of conditional probability for the state of nature El* provided that Zlr, ..., Zlr I are given. The result is given in the following Theorem (5.1): Under the conditions of mutual k-independence the following combination rule can be stated: 1 P(Ei*]ZIr1 A. .. ^ Zlrl) : P ( E i * ) l - I k
1
II P(Ei*lZir.) i=l
J i
1 II P(Ei[Zjr.) i=1 p(Ei)l-1 j=l j
k;
(5.6)
ri = l , . . . , n j ; j =1,..., 1
[]
1' * = 1 , . . . ,
The proof of Theorem (5.1) follows exactly the procedure used in deriving Equation (5.3).
[]
90 Formula (5.6) may be converted into the following form 1 P(Ei*lZj r.) P(Ei*lZlr 1 h.-.h
Zlrl) =
P(Ei*) ,i~l PiEi,)
!
(5.7)
k P(Ei) ~ P(EilZJri) j~l
i~l
P(Ei)
which can be useful in computing. Formula (5.6), like more specialized formulas, does not contain the probabilities of the outcomes P(Zjr ) directly but only through their influence on the total probabilities P(Ei). Therefore the question arises about what knowledge is necessary in order to put (5.6) into practice: Obviously the conditional probabilities P(EilZiri) must be known; these are the main elements of the description of uncertainty in a diagnostic system. But this is not true in the same way for the probabilities P(Zjrj) of the outcomes. Is it sufficient that P(Ei) are known? For the mere use of Formula (5.6) the knowledge of the total probabilities P(Ei) -together with the conditional probabilities - is obviously sufficient. However, it is by no means obvious, that the given total probabilities and the conditional probabilities fit together in the model which constitutes the theoretical background of a well founded application of the formula. A set of conditional probabilities and a set of total probabilities are only in accordance with respect to the model, if probabilities of the outcomes exist which fulfill (5.5). Therefore the existence of suitable P(Ziri) should always be controlled if the Formula (5.6) is to be applied. It is easy to see that (5.5) consists of 1 subsystems, for instance when j = 1 n1
P(Ei) = rE 1P(Ei[Zarl ) "P(Zlrl)
i = 1,..., k
(5.5a)
1
Each subsystem represents k equations for nj unknown variables P(Zjrj) (J =- 1,..., 1). As the sum of these k equations must reveal the identity 1 = 1, the number of linear independent equations does not exceed k-1. On the other hand R.
~J P(Z~r) = 1
r.=l J
j
j = 1,...,
]
and therefore the number of linear independent unknowns is ni-1. Obviously such a subsystem cannot be soluble for every possible prior distribution P(E1),...,P(Ek), if k > nj. Although in special constellations solutions may exist which even lie in [0;1], one would generally expect (5.5) to be insoluble, ifk > nI holds for any j, j c { 1 , . . . , ]}. A necessary - but not sufficient - condition for the existence of solutions in [0;1] is obviously the following:
91 Max Min P(EilZjr.) < P(Ei) < llin Max P(EilZjrj) j
r.
j
-
-
j
J
r.
'
i = l,...,k
(5.8)
J
If (5.8) does not hold, at least one alarm unit exists for which all the conditional probabilities of Ei are greater than P(Ei) or at least one for which they are all smaller than P(Ei). As (5.8) is by no means a guaranty that all systems of the type (5.5a) are soluble and provide solutions which may be taken as probabilities of the outcomes, the general rule must be: Before applying (5.6) it has to be proven, that the estimates of the prior probabilities and those of the conditional probabilities are in accordance with the assumption of the existence of nj P(ZJri), 0_
which obey (5.5). This may be done by means of a program which solves linear equations. Should this control fail, (5.6) must not be employed! To illustrate this formula let us turn to our Example (4.3) about the power plant. Example (5.1): For convenience we shall use the same parameters as in Example (4.3), but assume that there are more than two alarm units which are all mutually double independent; this means that they are independent in the case of the normal state of the power station as well as in the case of a breakdown. One should remember that in the case of only one signal the probability of a breakdown if the signal went off was 0.01. If there were two signals and both went off the probability of a breakdown rose to 0.255. Now let us assume that there are three signals and that each of them goes off at the same time: OJ1 tO2 ~ 3
x+++ = mtm2e3 (1-0~t)(1-~2) (1-~03) = 0.921 +
(l_m)2
In case of four signals going off at tile same time, the result of the corresponding formula is 0.9975. On the contrary, the Dempster-Shafer rule would ill these cases have resulted in practical absolute security that the power station is in a quite normal state; this was shown in Chapter 3.
[]
It is useful to compare Formula (5.3) with the general form of the Dempster-Shafer formula. It is evident, that (5.3) can be converted into the Dempster-ShMer formula, if all denominators may be reduced due to the fact that they are all the same. This means that the application of the Dempster-Shafer formula is compatible with probability theory if the "independence of the sources of information" is interpreted as k-independence of Zlj and Z2j, and if the prior probabilities of the different states of nature are all the sane. In this case: P(Ei) = 1/k
i=l,...,
k.
92 The results we have discussed for 1 = 2 may be easily generalized to cover situations with more than two units. Again we must assume that the two concepts of independence are taken as compatible. Then a comparison of the formulas reveals, that both show commutativity and associativity: neither the order of the observations, nor the order in which these observations are combined is important for the result. And the results are the same for both methods, if all prior probabilities are taken to be equal. Therefore it may be summarized, that not only in the case of I = 2, but also in the general case the application of the Dempster-Shafer formula is equivalent to the use of equal a-priori probabilities. All considerations of the preceding chapter concerning interval estimation of the prior probabilities through the conditional probabilities - if this is necessary - can be generalized without difficulty. We shall demonstrate the behaviour of Formula (5.3) using the following example. Example (5.2): We assume that there are three possible states of nature, which we conceive as three states of a power station. The state E1 denotes the normal state of the station, the state E2 means that there are some disturbances and the state E3 that there is a severe breakdown. In addition, we assume that there are two alarm units which work independently from each other. For convenience we shall assume that each of the two units has a first sign: "green" which is to be interpreted as "no danger" (Zit), a second sign is "yellow" and is interpreted as "be careful" (Zi2) and a third sign: "red" signifying "high danger" (Zja); j = 1, 2. We assume that the prior probabilities for the three states of the power station are known as P(E1) =0.994050
,
P(E2) = 0.004455 ,
P(E3) = 0.001495.
In addition, we know the conditional probabilities of the three states of the power station given the signs of each of the units. They are formulated as three matrices:
P(EllZir.) j
:
[0.9990 0.9950
0.95 0.95
0.70] 0.75 ,I
P(E21Zjrj )
:
0.0009 0.0038
0.04 0.03
0.20} 0.20
P(E31Zirj )
:
0.0001 0.0012
0.01 0.02
0.10} 0.05
Now we could apply Formula (5.3) as all necessary values are given, but we should not do so before having controlled, whether the prior probabilities and the conditional probabilities fit together in the sense that values exist for the probabilities P(Zjrj) so that Equations (5.5) are fulfilled.
93
0.994050 = 0.999 P(Zn) + 0.95 P(ZI2) + 0.70 P(ZI3) = 0.995 P(Z21) + 0.95 P(Z22) + 0.75 P(Z23) 0.004455 = 0.0009 P(Zn) + 0.04 P(Z,2) + 0.20 P(Z,3)
= 0.0038 P(Z2,) + 0.03 P(Z22) + 0.20 P(Z23) 0.001495 = 0.0001 P(Zn) + 0.01 P(Z~2) + 0.10 P(Zla) = 0.0012 P(Z~I) + 0.02 P(Z22) + 0.05 P(Z23) We arrive at the following values: P(Zn) = 0.95 P(Z12) : 0.04 P(Z13) = 0.01 P(Z2~) = 0.98735 P(Z22) = 0.01075 P(Z23) = 0.00190. All solutions of these systems are in the interval [0;1]. Therefore all the solutions are admissible and a credible probability model exists for which the used values are parameters. Now we can apply Formula (5.6) and obtain the following results:
rl~
1
2
3
i
0.999153 0.957559 0.736359
0.992310 0.692504 0.199483
0.945142 0.251766 0.041096
r'~
1
2
3
0.000767 0.034358 0.179285
0.006299 0.205455 0.401602
0.050664 0.630758 0.698659
r~
1
2
3
i
0.000080 0.008083 0.084357
0.001391 0.102041 0.398915
0.004194 0.117476 0.260245
P(E11Zlrl^Z2r2) :
P(E21Zlrl^Z2r2) :
P(E31Zlrl^Z2r2) :
The behaviour of the combination rule for 3-independent units in this example is easily recognized: The conditional probability of E~ is at least 0.7 if we observe only one unit, even if this shows "red". In the case of two units this probability falls to 4% (0.041096) if both units show "red". On the other hand, the probability for E1 is increased from 0.999 at greatest if one unit shows "green", to 0.999156 if both units show "green". If we want to investigate this behaviour more thoroughly, we have to compare the two units and we see: unit one is more capable of distinguishing the states
94 of nature, because its conditional probabilities with respect to "green" and "red" show larger differences than the respective ones of the second. Therefore generally a red sign in unit one increases the probability of state Ea more than a red sign given by the second unit. Finally, it should be noted that, due to large differences between the probabilities for E1 and for the other states, the requirements for the conditional probabilities in order to secure admissible solutions of System (5.5) are very strict. If some of the conditional probabilities were changed only slightly, a situation is possible in which the solutions of the Systems (5.5) cannot be interpreted as probabilities.
[]
Of course it may be argued that the assumptions used in this chapter are often unrealistic: In many cases one has to deal with interval estimates for probabilities. The methods involved in solving such problems are developed in Chapter 6.
95
5.2 Some Aspects of Practical Application Let us start the consideration of the practical aspects with a short story.
Example (5.3): Mr. K. lives in a country in which 40% of the population suffer from a new type of disease, which can be cured if detected early enough. So he goes to a doctor who tells him that there are two tests for this disease, each working independently from the other. In addition, the doctor knows the probability of a person suffering from the disease, if one of the two tests produces a positive or negative result. They are: Test 1:
Test 2:
Positive result:
wl = 0.8
Negative result:
~1 = 0.1
Positive result:
~o2 = 0.9
Negative result:
~ -- 0.2
Both tests are applied to Mr. K. and both give a negative result. The doctor has a computer which informs him: "The probability that this person has the disease is 0.04." When the doctor tells Mr. K. this result, the latter asks him: "Does this computer use an appropriate program? I hope it doesn't use MYCIN or Dempster-Shafer[" The doctor asks: "What do you know about programs appropriate for diagnostic systems?" And Mr.K. replies: "Well, I am a statistician myself and therefore I don~t trust the programs used in some of these systems." The doctor says: "No, my system uses classical probability theory; but I'm glad you told me that you are a statistician, because I recently received information, which states that only 25% of the statisticians suffer from this new type of disease. So I am sure we shall find out, that together with the two negative results of the tests, the probability that you have the disease is smaller than 4%." He changes the input of the computer from w = 0.4 to o0 = 0.25 and reads: "The probability that this person has the disease is 0.077." The doctor now believes that his computer has a breakdown, as he does not find this result credible at all. But Mr. K. proposes: "Why don't you use the information about my profession as if it were a third test? Employ the formula for three independent tests!" The doctor then asks him: "Are we supposed to regard your profession and the two tests as mutually independent?" Mr. K. answers: "Why not? After all we need double independence. Take the case, for example, that somebody has the disease. I am quite sure that for each of the tests the probability of a positive result is the same, whether this person is a statistician or not. And the same is true for a person who is healthy: His profession will not influence the outcome of the test."
96 The doctor seems to be convinced and feeds the computer the supplementary input: w3 = 0.25 and makes it use the appropriate formula for three independent tests. The computer produces: "The probability that this person has the disease is 0.0204." While Mr. K. is quite satisfied, the doctor does not know which treatment he should prescribe. His rules order him to apply a certain kind of treatment to all people with a probability greater than 0.05 and another one to people for whom the probability is 0.05 or smaller. But Mr. K. comforts him: "Trust the result which says that the probability is about 2%, because this result is correct. I shall explain why your first result was wrong if you wish." Indeed Mr. K. is right: The computer program must never be used in the way the doctor used it the first time. We shall demonstrate this fact by considering the behaviour of Test 1. We know that for a person, for whom originally the probability of having the disease was 0.4, after a negative result of the test this probability is now 0.1. But is this also true for a person, for whom the original or prior probability was 0.25? Obviously it cannot be true for people whose prior probability was smaller than 0.1, because such total probability could never be produced as a weighted average of 0.1 and 0.8. Furthermore: If originally, for Person A, the probability of having the disease was smaller than for Person B, why should this difference vanish after the test has produced the same result for both people? If we accept that there should be a difference in the probabilities after the test, how is this difference determined? In the general case no answer can be found to this question, because the regarded difference must depend upon the behaviour of the test. However, it is sensible to assume that, if the person has the disease, the conditional probability of a positive result does not depend upon the facts which determine the prior probability w, and that the same is true if the person does not have the disease. This assumption produces a probabilistic formalism equal to that found in the case of double independence. Therefore Mr. K.'s explanation to the doctor is in the first place nothing more than a report on how we arrived at Formula (4.6). He shows that for a person with prior probability co' (instead of w), in the case of the single test j, we have to use a2 i • a2 T
(5.9)
~j =
~i.~' + (1-~j)(1-~') ~]
1-~
instead of ~0j and apply an analogous formula for calculating ~ .
For Mr. K. himsdf w ~ = 0.25 and for that reason the probabilities that he has the disease are as follows:
97 Test 1:
Test 2:
Positive result:
w[ = 0.6667
Negative result:
~ = 0.0526
Positive result:
w~ = 0.8182
Negative result:
~ = 0.1111
Now it is possible to derive, in two different ways, the probability that he has the disease, if both tests show a negative result. The first way is to combine ~l' and ~2' with regard to his prior probability w' 0.25 using (4.6): 0.0526 - 0.1111 =
;l';i
0.25 0.0526.0.1111 0.9474.0.8889 0.25 + 0.75
(1-;I)(1-;~)
0.0204
The other method uses w' as if it was produced by a third test. This is the method Mr. K. recommended to the doctor: ~1"~2"~'
0.1-0.2.0.25 0.4 '~ - 0.0204 0.1.0.2.0.25 0.9.0.8.0.75 0.42 + 0.62
X___ -
~1"~2"~' (1-~1) (1-;2) ( 1 - a ' ) ~-~-- + (l_a)2
The fact that both computations must produce the same result, can easily be shown if we transform the probabilities into: [ -'=
(1-~i) (1-w')~ } t 1+ - - - -
5j .~' (l-r)
~J
1-; I (1-;j)(1-~')~ -
j
~;
=
1,2
~i "~' (i-~) l-x__ (1-;I)(1-~)~'
and analogously:
= x__
~ I - ~ (1-~')
Therefore: l-x__
(1-~l)(1-w')w
(1-~2)(1-w')w
w'
(1-~,)(1-~2)(1-~')w 2
x__
~1"~o' (1-~)
~2"~'(1-w)
(1-~')
~1"~2"~'(1-~)~
Exactly the same expression is derived, if l-x--- is calculated through the formula which was used X---
for
the second method.
In conclusion, we may state the following rules for the use of diagnostic systems:
98
1)
A diagnostic system conceived for a situation with a prior probability distribution P(E2),...,P(Ek) must not be transferred to a situation with another prior probability distribution P'(E1),...,P'(Ek), without being adapted.
2) Adaptation supposes that P(Zjrj [Ei) remains unchanged if P(Ei) is altered. 3) Adaptation can be achieved by using P(Ei*IZj r . ) ' P ' (El*) 1
p,(Ei, i ZJrl) =
P(Ei*) k
Z i:1
(5.10)
P(Ei [Zj r . ) "P'(Ei)
i P(Ei)
- according to (5.3) -together with the prior distribution P'(E1),...,P'(Ek).
4) Adaptation can also be achieved by applying Formula (5.6) of Theorem (5.1). We define P(Ei[ZD2d):=P'(Ei), i = l , . . . , k , designating Z1+2,2 as the observation that the prior distribution has changed from P(Ei) to P'(Ei). Then we have for r1~1 = 1 : 1 P' (El* [Z2r1 ^ . . - ^ Zlrl) = P(Ei*[Ztr 1 ^ . . . ^ Zlr 1 ^ Zl÷2,1) =
k
X
1+2
• i IJ, P(E *IZjrj) 1
1÷2
(5.11)
• i--1J, P(E~IZj,.)
5)
Both methods of adaptation produce the same result.
6)
It must be admitted that often it will not be possible to gain reliable point estimates W(Ei), since one does not possess enough knowledge concerning the situation to which the diagnostic system is transferred. In those cases both methods of adaptation can be combined with procedures described in the next chapter.
99
CHAPTER 6 Interval Estimation of Probabilities in Diagnostic Systems The argument is widely used, that classical probability theory cannot be applied in diagnostic systems, because in many cases the available information is not sufficient to establish reliable point estimates for the probabilities. The latter statement has to be accepted, especially in those cases where the estimation of probabilities stems from sampling. In such a case only interval estimation may claim a certain amount of reliability. Shafer's theory of belief functions is merited with the ability to handle this problem. In this chapter we aim to derive solutions - using probability theory - for the problem of combining information which contains interval estimation of probabilities. We shall confine ourselves to models which assume k-independence and shall restrict ourselves to the simplest of situations: There are only two states of nature, two alarm units and each unit has two different signals. In this case the assumption of k-independence means double independence or (EF,Es)-independence. Now we suppose that the probability x** of the failure EF has to be estimated, if both alarm units go off; this event is denoted by Z1~^Z21 (instead of Z~AZ2 as in Chapter 4). For a problem of this type we suggested - if the necessary requirements concerning the underlying model are given - the application of Formula (4.6a), which uses the probabilities
P(EFlZ11) :
~
P(EFIZ21) = w2
P(~) : ~. While in Formula (4.6a) it was supposed that exact values for these probabilities are given, we will now allow three interval estimations to be given. We shall call this the basic information for x,,: Lj < wj < Uj L <~
j = 1,2
(6.1)
Due to the restrictions described in Chapter 4 we shall always suppose that L,Lj < 1 and U,Uj > 0. It should be noted that the symbols L] and Uj in this chapter are not used for the same variables as the symbols Li and Ui in Chapter 2. Normally, the index i distinguishes states of nature, the index j different alarm units or sources of information. However, in Chapter 2 the index j is sometimes used to specify a certain i. As long as we take a look at situations in which only two states of nature are distinguished, the limits for the probabilities of EN are determined by the basic information, if the estimates are feasible.
100
In this chapter we shall discuss different ways of treating this problem. We aim to show that the most adequate method depends upon the kind of additional information available. By additional information we mean all information beyond the basic information on the probability which is to be calculated. Therefore additional information for x÷, is basic information for x__. It is obvious that in diagnostic systems additional information will be automatically available in nearly all cases, because not only x÷÷ is of interest, but also the other probabilities. Nevertheless, we shall use the expression "basic information" in the following chapters with respect to x+,. For the purpose of comparison we shall start investigating a situation in which indeed only basic information is available.
101
6.1 An Approach without Additional Information A first method, which is rather simple, will lead to weak results. In applying this method we combine Equation (4.6a) with (6.1) in a simple mathematical way. The result is given by
Li .L2 U L1.L2 + (1-LI)-(1-L2) U
U1 .Us < x.
<
L U~.U2 + (1-Ut). (1-U2)
(l-U)
L
,
(6.2)
(l-L)
if U < 1 and L > O. In the case U = 1 the lower limit for x** is equal to zero; the upper limit can be calculated by (6.2). In the case L = 0 the upper timit for x , is equal to one; the lower limit is given by (6.2). It is easy to justify this result: At first, it has to be shown that the used constellations are not impossible, i.e. that simultaneously w~ may take the value L~, ~ the value L2 and w the value U to produce the m i n i m u m of x,., or that wj may take the value Uj (j = 1,2) and co the value L in order to produce the m a x i m u m of x,,. It must be taken into account, that the application of (4.6a) presupposes the model which was employed in Chapter 4 and therefore presupposes Equation (4.1). It should be noted that (6.1) does not contain any information about pl, p2, ~1 and 7_~. Above all no information is given which prevents us from taking Pl and P2 to be very near to zero. In this case the values of co, col and a~ are practically independent from each other and co can take its maxinmm, while col and ~ take their minima or vice versa. Secondly, it must be shown that - by using the smallest possible values for Wl and ~ and the largest possible value for w - the minimum of x++ is indeed achieved. This fact and the symmetric result for the m a x i m u m are easily seen, if (4.6a) is converted to ~01 " (-02
Xl'a~2 + (1-~t)" (l-x2) Lo 1-0o
= {1+ (1-a/l)" (1-a/2) el
X2
OJ } -1
" ]~d
(6.3)
which shows that x+, is a monotonously increasing function in oal and ~ , but a monotonously decreasing function in w. On the one hand, (6.2) produces the weakest estimation of x,+: Under no circumstances whatsoever can these limits be exceeded, if (6.1) holds. On the other hand, it provides the most accurate estimate, if only (6.1) and no other information about the parameters concerned may be taken to be known. Therefore additional information is necessary, if the result (6.2) is to be improved. The fact that these intervals are really very large will be shown in the following examples.
102 Example (6.1): Returning to the Example (4.3) we assume: 0.01 _<~j _< 0.02 j=1,2 0.0003_<~ < 0.0006 and obtain 0.1453 _<x÷, _< 0. 5812 Example (6.2): 0.1 _<xi _<0.2 0.055 <x <0.128
j =1,2
0.0776 _<x+÷ < 0.5178
In the following the estimates derived by (6.2) will be called the trivial estimates of x++.
103
6.2 A d d i t i o n a l
Information
a b o u t ~j
It is obvious, that an improvement of (6.2) is only possible, if more information about the parameters used in the model is taken into account than was done in the derivation of (6.1). We should consider that there is a symmetry between wi and wj, which corresponds to the symmetry between the signs Zil ("the alarm unit j goes off") and Zj2 ("the alarm unit j does not go off"). Therefore an obvious assumption is, especially for diagnostic systems, that the same kind of information is given for wl = P(EFIZJ2) as was given for wj = P(EF[Zjt), even if ~j is not apparent in Formula (4.6a). Accordingly we assume: I~i <-~i <-Ui j : 1,2 (6.4) with the restrictions Li < 1 and Uj > O. In the same way as in Chapter 4 we exclude the case: [Lj,Uj] = [Li,Vi]. If the estimates (6.1) and (6.4) are to be used simultaneously, obviously the fundamental Equation (4.1) must be taken into consideration. The prior probability must, for each of the two alarm units, result in a linear combination of ~vj and ~i with weights pj and l-p1 which must lie in the interval [0;1]. Therefore interval estimates which do not allow for the existence of probabilities pj should not be admitted. The control of admissibility, in this sense (the exact definition will be given later), is only possible through a calculation of these probabilities, taking into account that each pj can result not only in a single value but also in an interval. As each pj is determined by only co, a:j and ~j (not by coj,, ~j,, j'¢j), in the first part of our considerations we will confine ourselves to a situation in which pi is determined through a:, col and ~l.
Therefore the following problem arises, if possibilities to improve the trivial estimate (6.2) are studied: If
L~ _ wl _
I:1 _<~1 _
(6.5)
L <w
which of the triples (wl,~,w) are possible in the sense, that the three values co1,~1 and a: can occur simultaneously? This question is very closely related to the problem of the existence of solutions pl for Equations (4.1). It is easily seen, that under the condition 0:l ~ vi:
p~ :
01
-
~1
(6.6)
is a mathematical solution of this equation, which, because of Pl = P(Zn), must, in the present context, lie in the interval [0;1]. This is obviously granted, if and only if w lies between wl and ~l. Therefore the first answer to this problem is: A triple above, if w lies between wl and ~ .
(wl,~,w) is possible in
the sense described
104
Such an answer would not take into account that Pl is a quantity of the same kind as w, ws and 51. As for this type of quantities interval estimates have to be used in diagnostic systems, we are entitled to assume that the estimate for pl is also an interval. If (6.6) was used, the set of solutions would depend upon Wl and ~ and so would yield another interval estimate of Pl for every (ws,~), but would not produce one single interval. We therefore are in search of an interval of ps-values so that, for every element Pl of this interval and for every (wl,~) between their respective limits, a w-value between L and U is obtained through (4.1). Additionally, for each of the two limits L and U there must exist at least one triple (wl,wl,pl) so that (4.1) produces this value. Therefore our concern is to find: pL, p~E [0,1] ; p L < p ~ SO that [L,IJ] : {o~: w = psws + (1-ps)~s ; Ll_<Ws_(U1, ]~1_(~1_(]J1, 0 < p L _ < p l < p U < l }
(6.7)
This leads to the following definition Definition
(6.1):
A System (6.5) is interval-admissible, if pL, pU exist, so that (6.7) holds. A necessary condition for the admissibility of such a triple of intervals can easily be derived from (6.6): L must lie between L1 and Ls
and
U must lie between Ux and U1.
If these conditions are not fulfilled, System (6.5) can never have solutions pl which lie in the interval [0;1]. In this situation there is a striking discrepancy between the interval estimations used. In order to discuss further conditions of interval-admissibility, we have to distinguish four possible cases concerning the location of the limits of ws and
Case a)
Case b)
Case e)
~s:
Us _>Us Ls > 15s
(The limits of w~ are greater than those of ~1)
Vl _>Us Ls < ]:s
(The limits of 0h are outside those of 51)
Vl < Us La _>]~s
(The limits of wl are between those of 5s)
105
Ut < UI
Case d)
Li < 17t
(The limits of ~1 are smaller than those of ~1)
As equality between two limits does not raise special problems, we are able to treat Case a) and Case d) in the same way: The limits of oa are either both greater than the respective limits of ~1 or they are both smaller. In Case b) and Case c) the interval for one probability - Wl respective ~1 contains the interval for the other one. Therefore we may treat these two situations in an analogous way. Let us at first consider Case a). Taking into account that w is a monotonous increasing function in Wl as well as in ~ , we can conclude that the following inequality must hold for every admitted Pl: plLa + (1-pl)I71 < w(pl) = plwl + (1-pl)51 _
(6.8)
~1 the
respective limits are used. Furthermore, if U1 > U~, we have to take into account that the upper limit is a monotonously increasing function of Pl. If therefore pU is chosen so that U : pUU1 +
(i-p~)131
pU _ U-U1
(6.9)
Ut -UI then for all p < pU the inequality
o)(pl) _
is valid.
The same derivation can be carried out for the lower limit of p~ according to (6.8). If L1 > L1, this linear function in Pl is monotonously increasing and if we solve L : pL, L, +
by
pL_ L-];, , L 1-I~1 there will be w(p,) >_L for all p >_pL.
(6.10)
In the case that U1 = U1, the inequality v(pl) _
I~ < L < L1 U, < U _
(6.11)
As long as pL < pU, then for all Pl between these two limits the respective w(pl) lies between L and U and these two limits are indeed obtained. Therefore in this case System (6.5) is interval-admissible.
106
Although (6.11) holds, it is nevertheless possible that pV < pL. In this situation values of p~ do not exist, for which w(pl) lies between its limits L and U for all values of wl and wi, which are in accordance with (6.5). In this situation these three interval estimations for the prior probability w and for the conditional probabilities w~ and wi cannot be used simultaneously: they do not fit together. Due to pL > p~ and because of the characteristics of Case a) the two implications hold:
pLL, + (l-pL)l~i = L
L U Pi>Pl ~
p~LI + (l-p~)l:, < L
(6.12a)
p~U~ + (l-p~)U~ > u
(6.12b)
pbpV p~U~ + (l-p~)U, : ~
~
It is obvious that under these circumstances the estimates used in (6.5) have to be modified, if they are to be used in the diagnostic system. Roughly speaking, the limits for w are too narrow compared with the limits for wl and ~1, or the limits of w~ and ~ are too wide compared with the limits for w. As long as all estimates used are taken as valid - especially the estimates for w - the estimates for wl or ~ may be improved, but it is not determined in which way. One may, for instance, improve the estimation of w~ by increasing L~ until the inequality in (6.12a) becomes an equation. This means that, as long as p~ > 0, the new value of Li is given by
Ltnew-
L-(1-p~)~,
pO
( > El )
(6.13)
supposing that this new value of L1 is not larger than Uv Application of L1new in Formula (6.10) produces: pL = p~. We could apply the same procedure to L1 with the result of Z1new -
L-p~LI
( > ]~1 ) ,
(6.14)
1-p~ again with the restriction that the new value must not be larger than U1. (6.10) produces pL = p~ also in this case. If for any reason the use of (6.13) or (6.14) is too restrictive for wl respectively for ~t it is possible to increase L1 as well as ~t until the inequality on the right side of (6.12a) becomes u an equation. In all of these cases Pl remains unchanged and becomes the only admissible value of Pi so that the interval for p~ degenerates to a single point. All the upper limits of w~ and ~ remain of course unchanged by this procedure; so the equation on the left side of (6.12b) remains true. Alternatively an analogous procedure can be used to diminish the upper limits U1 or U1 in one of the two following ways: ul"e~ -
U-(1-p~)U1 L ( < Vl ) Pl
(6.15)
107
or as long as pL < 1:
~ llew
U-p~U1 - _
( < l~i ) 1-pL
(6.16)
or a combined treatment can be applied to U1 and 13"1 in an analogous way as to L1 and [71. In all these cases the interval for Pl degenerates to the single point pL. Furthermore, it is possible to apply a mixture of the two mentioned procedures so that upper limits as well as lower limits are modified - in the extreme case all four limits. Which of these many possible procedures is applied, depends only on material considerations, not on formal ones. The question is, which of the possible improvements can be justified best with respect to the matter under consideration. Nevertheless, the possibility that none of these improvements seems to be acceptable, must be taken into account too. If the respective source of information is to be used at all, the information included in System (6.5), which is not interval-admissible, has to be modified in the other direction, i.e. weakened. In this case the interval [L;U] for the prior probability co has to be expanded. By doing so it is admitted that the original estimate for co may be untrue. The expansion of the interval [L;U] again can be carried out in different ways. The first method defines the lower limit L by Lnew = pUL, + (1-pU)];1 (6.17) and uses pU as the degenerated interval for Pl; another possibility results in Unew = pLU1 + (1-pL)U1
(6.18)
and leads to p} as the only value of p,. It is evident that a mixture of these two procedures is also possible.
Case d) - apart from the treatment of the equality sign - is analogous to Case a), if wl and ~1 respectively Pl and (l-p1) are interchanged. Case b) is determined by the fact that the interval for o01 contains the interval for ~l. The requirements for w, which stem from the fundamental Equation (4.1), lead to the condition L1 _
(6.19)
which is necessary for interval-admissibility of System (6.5). It is obvious that (6.8) is also valid in this case, but while the upper limit is still a monotonously increasing function of Pl, the lower limit is now a monotonously decreasing function of Pl. Therefore, if L1 ¢ L1 and U1 ¢ Ui, one arrives at the following two limits for pi: lp~ = L-r~ , ~p~_ U-~I L 1-1;1 U,-U1
(6.20)
108
The case that either Lt = ]51 or U1 = U1 will be discussed under the section for case bT). Both of the limits (6.20) are upper limits and the only lower limit for Px is zero. If pl is very near to zero, the limits (6.8) are very near to the limits of ~
and therefore must be contained in the
interval [L;U]. If lp~ = 2pV = pV, a unique interval [0;p~] for Pl emerges as a solution to the problem under consideration. System (6.5) is therefore interval-admissible. If lpO1 ¢ 2p U, a solution cannot be found: Either the one limit for w is violated or the other one is not reached at all. We distinguish: Case ba)
XpU < 2pU:
Then the following implications hold ~p~L, + (1-'p~)]:1 = L ~ 2pULl + (1-2p~)151 < L 2p~U1 + (1-2p~)U1 = U ~
lpUU1 + (1-1p~)U1 < U
(6.21a) (6.21b)
As System (6.5) is not interval-admissible, modifications to the used estimates must once again be made. As long as all estimates of System (6.5) are trustworthy, some of them can be improved in order to obtain an interval-admissible system: Either L1 or ]:1 can be increased until the right side of (6.21a) becomes an equation. The two alternative solutions are either L1new and I:1 with L1 new -
L-(1-2p~)rl 2p~
( > L1 )
(6.22)
or L1 together with LlneW, where L-2p~L1
~|neW _ _ _
(>[1)
(6.23)
1-2p~ As for Case a), a mixture of both procedures which establishes the equation on the right side of (6.21a), is also possible. In all these cases the interval for Pl is: [0;2pV]. Another method which improves estimates used in (6.5) is to lower U so that the right side of (6.21b) becomes an equation: Unew = lpUU1 + (1-tpU)U1 ( < U )
(6.24)
If this modification is used the information concerning wl and ~1 remains unchanged. In contrast to Case a) this method allows us to improve the estimate of w in order to establish an interval-admissible system.
109
Again it is up to the experts to judge which of the possible moderations is most acceptable. If none of these improvements can be justified, the only solution is to accept expansions of the given intervals. It is possible to enlarge U1 or Ut in order to yield an equality for (6.21b). This leads to: U1new and U1 with
u-(I-'pV)N U1new
'pV
( > U1 )
(6.25)
or to U1 and U1new with U- lpUU1 Vlnew =- ( > UI ) 1-1p U
(6.26)
In this case the probability interval which is a solution of the modified System (6.5) is [0;~pU]. Otherwise L can be diminished until the right side of (6.21a) becomes an equation. This leads to L new : 2p~L1 + (l-2pU)]:1 ( < L ) (6.27) and in this case the interval for p~ is: [0;2pV]. Case b~): ~pU > 2pU: this case is treated analogously. The improvement of the estimates is achieved by decreasing U1 or U~ (or both) or by increasing L. If this is not possible, either L1 or L1 have to be decreased or U must be increased. Case b7): If Lt = g~ (=L) and U1 < U~ the relation L = p~L~ + (1-pl) E1 is true for all Pl E [0;1]. If we calculate 2pV through (6.20) the interval [0;2pU~] renders a solution for all 0 < Pt _<2p~. Inequality (6.8) holds and the limits L and U are reached. Therefore (6.5) is interval-admissible together with [0;2pU]. In the same way we can conclude that for U, = U~ and LI < L1 System (6.5) is interval-admissible together with [0;tp~] according to (6.20). Case c) is easily derived from Case b) if cat and ~ are interchanged. Of course this leads to an interchange between p~ and 1-pl, so that we obtain two lower limits of Pl in this case, while the maximum value is always equal to one. The expressions (6.22) - (6.27) remain valid, if it is taken into account that we have to use the lower limits p} instead of pU. From now on we shall always assume that modifications have been made if necessary and that System (6.5) is interval-admissible. This will be taken for granted in the following chapter for both sources of information; later for all sources of information.
110
Moreover, due to Equation (4.1) the same estimate of w has to be used, i.e. the same L and the same U, for all sources of information. Some practical problems may arise, if in the course of making a system admissible for one source of information, modification of L and U is taken into consideration. In such a case the computations concerning the admissibility of the systems stemming from the other sources of information have to be repeated. There are no formal methods concerning the problem of bringing different systems of the type (6.5) into agreement so that a common estimate [L;U] for w can be employed. This is only dependent upon the respective reliability of all the information used.
111
6.3 The Combination Rule for Two Units In this chapter we shall confine ourselves to a situation in which only two alarm units exist. Therefore the description of the problem is given by the following parameters: Lt < tot <_ Ut
L2<_to2<_U2
L1 _<~t _
L2 _<~2 _<02
(6.28)
L_
for j=l,2
A value for w can appear simultaneously with a pair ( ~ , ~ ) , if it can appear simultaneously with both wa and w2. If we denominate Max x(vl,to2) =Min {Max[pjtojL + (1-pL)Uj; p?vj + (1-p°)Uj]} j=l~2 and Min to(to,,to2) = Max {Min[pLwj + (1-pL)]:j; pjtoj U + (1-p~)[:j] } j=l~2 we find Min to(~l,to2) _
(6.31a)
(6.31b)
(6.31c)
as the range of all total probabilities ~ which can occur simultaneously with ~1 and ~2, provided that: Min to(tol,w2) _<Max ~d(tol,to2)
(6.31d)
112
It is, however, by no means evident, that (6.31d) must hold for all pairs (0d, ~2). If Min [pLoj1 + (1-pL)]:t; p~0:t + (1-p~)l~t] > Max [pLw2 + (1-pL)]:2; p2~z2 + (1-pV)I:2] then all total probabilities 0:, which can occur simultaneously with ~ , are greater than those, which can occur simultaneously with x2. Analogously L + (1-pL)]~2; p2w2 U + (1-pU)]]2] > Max [pL~, + (l_pL)i]l; P ~ I + (1-p~)I:t] Min [P2X2 states that all values of ~ occurring simultaneously with ~2 are greater than those occurring simultaneously with wl. In any of these cases a total probability x, which can appear simultaneously with both 0~t and 0J2, does not exist. This must be interpreted in the sense that ~1 and x2 can never occur simultaneously. Consequently all pairs (xl,x2), for which (6.31d) does not hold, should not taken into account in searching for possible results of the combination rule. It will, however, be seen that these exclusions do not have an effect on the final estimate of x++, due to the following lemma and Theorem (6.1). Lemma: (6.31d) holds for zJ = Uj, j = 1,2 and for ~i = Li, J = 1,2. Let ~j = Uj, j = 1,2: Because of interval-admissibility of both systems, L * (1-p )Uj; p Vj + (1-pC)Uj] : C Max [pjU] must hold for j = 1,2. Therefore: Max ~(U~,U2) = U,
while for
L < pjUj + (1-pi) q _
so that L ~ iinx
(h,U2) _
If ~j = Lj, j = 1,2, interval-admissibility leads to
Li+ Min [pjL
pUL1 +
=L
and therefore Min ~(L1,L2) = L , while for all pL _
j = 1,2
and therefore L _<Max w(L1,L2) _
113
T h e o r e m (6.1)
With respect to (6.28) and (6.29) the following inequality holds, if 0 < Max ~o(Lt,L2) < 1 and 0 < M i n w(UI,U2) < 1: L1L2 U1U2 Max w(LI~L2) < x+÷ Min w(U,,U2) (6.32) L,L~ + (I-L,)(I-L2) U1U2 + (1-U 0 (l-U2) Min z(UI,U2) 1-Min z(U1,U2) n Max ~(LI,L2) l-Max ~(LI,L2) The proof of this theorem usesthefollowing Lemina: ~1~2 Min w(wl,w~) xu(~l,~ ) = ~1~2 + (1-~)(1-e2) Min X(el,X2) 1-Min e ( ~ , e 2 )
(6.33a)
as well as ~l&2 xL(~l,~2 ) : Max ~ ( ~ , ~ ) ~ + (1-~,)(~-~) Max a(al,a2) 1-Max a(~1,~2)
(6.33b)
are monotonously increasing functions of oa and ~2. The proof of this lemma is only demonstrated with respect to xu(wl,~); for XL(aJ1,0J2) the proof is quite analogous. We have to distinguish four cases. Let us first assume that a) Mill a(WI,X2) = pLw1 + (1-pL1)I~I. Therefore Xu(tO,,~2)
= [ q + (l-t°1) (l-t°2) l [pLItjI+(1-pL)I~I] ] 0~1~2 { 1- [pL~0~+(1-p~)]~l] } [
I
L
=
L
(1-~02) " ~11 [Pla/I+(1-Pl)]~I] ]-1 = = 1 ~2.]~xl[pL(l_w~)+(l_pL)(l_i:l)]
[
+
L L ]~1 (1-0~2) • [p,+(1-pl) ~ ]]-' w2[pl+(1-pl)
l-w1 ]
whereby it is shown that in this case Xu(Wl,w2) is a monotonously increasing function of wl and of a.~.
114
b) Mi~ ~ ( ~ , ~ )
= p ~ , + (1-p~)~, :
p~ is always used instead of p}, but otherwise the proof and its result remain unchanged. The cases c) Min o~(el,e2) = P2LeO2 + (1-pL)I72 and d) Min e(e,,a2) = p~c02 + (1-p~)I72 are gained by interchanging the parameters for the first and the second source of information and they produce the same result.
[]
Proof of Theorem (6.1): Due to monotony the largest possible value of x++ is found if we take cat = U~, ~ = U2 and = Min w(Ut,U2). The smallest possible value of x++ is achieved if ~ol = L1, ~ = L2 and ~o = Max ~o(L1,L2). From this the inequality (6.32) follows immediately and the limits given there are seen to be the best possible ones. Because of the monotony of xu(el,~2) and XL(e~,e2) it iS easily seen that the lower limit of x++ can never be greater than the upper one.
[]
The application of Theorem (6.1) leads to the best available limits for x+÷. In most cases it will result in a remarkable gain in information compared with the trivial estimate. If on the other hand, the estimation which is derived from Theorem (6.1) is not sufficient with respect to a certain demanded accuracy, it must be understood that this cannot be helped by any formal method. In such a case the requirements of higher accuracy can only be met if either the accuracy of basic estimates is improved or any new material assumption is used. The use of Theorem (6.1) can be extended to the case in which Max 0J(LI,L2) = 1 or to the case in which Min L0(Ut,U2) = 0, while Max 0~(L1,L2) = 0, respectively Min e(UI,U2) = 1, can be excluded due to the restrictions described above. Since Lj < 1 and Uj > 0 (j = 1,2), it can be concluded that the lower limit for x+÷ is zero, if Max ~0(L1,L2) = 1 and that the upper limit for x++ is one, if Min ~0(UI,U~) = 0. We shall compare the results of Theorem (6.1) with the trivial estimate (TR) given in (6.2), using the parameters of Examples (6.1) and (6.2). Example (6.3): 0.01
< wj < 0.02
0.0001 < ~j _<0.0002
( j : 1,2)
O. 0003 < La < 0.0006 With this additional information about 5j we obtain: pL = pU = 0.0202. This is not a typical case, because the common estimate for pl and p2 results in a single number, not in an interval. Applying (6.32) - which becomes very simple in this case - we derive 0.2039 < x++ < 0.4533, a remarkable gain in accuracy compared with the result of Example (6.1): 0.1453 < x++ < 0.5812.
D
115
Example (6.4): 0.1 _<xj _<0.2 0.01 _<~j < 0.02
j = 1,2
0.055<~ <0.128 With this additional information about ~j the limits for pj (j = 1,2) are 0.5 and 0.6. These values are applied to (6.32) and yield: 0.1447 < x÷÷ _<0.3476 Therefore, also in this example, the accuracy of the estimate is improved considerably (compared with: 0.0776 < x÷÷ < 0.5178). [] In analogy to Formulas (4.6) we apply Theorem (6.1) to the remaining three probabilities of our model and obtain (by replacing wj with ~i and pj with 1-pj and using the analogous restrictions as for Theorem (6.1) and the conclusions analogous to those above): L1E2 U1~2 Max w(LI~2) < x+-< Min w(Ux:U~) L,~2 + (1-Lt)(1-~2) U,U2 + (1-U,)(1-U2) Max ~(L1,~2) 1-Max ~(L1,~2) Min ~(U1,U2) 1-Min ~(U,,U2)
(6.34)
where Max w(L,,I:2) : Min { Max[pLL, + (1-pL)V,; pUL, + (1-pU)Ux] ; Max[(1-pL)1:2 + LU Min w(U,,U2) : Max { Min[pLU1 + (1-pL)I:,; p~U, + (1-pU)]~,] ; Min[(1-pL)U2 + P2LL2; (1-pU)U2 + pUL2]}. ~,L2 UaU2 Max w(~,,L2) < x-+ < Min w(U~,U2) ~,L2 + (1-~x)(1-L2) U,U2 + (1-U,)(1-U2) Max w(~,L2) 1-Max w(Et,L2) Min w(U1,U2) 1-Min w(U1,U2) where Max w(]:,,L2) = Min { Max[(1-pL)]:, + pLu,; (1-pU)]~, + pVU,] ; Max[p~L L ~ + (I-pL)U~; p~C L 2 + (1-p~)U~]} Min w(U1,U2) = Max { Min[(1-pL)U1 + plLa;L (i_pU)U1 + pUL,] ; Min[pLU2 + (1-pL)I:2; p2u2U+ (l_p2U)i:2]}
(6.35)
116
~t~2 Max v(~1,~2)
r,r2 Max x(E1,E2)
+
(~-r,)(~-~2) 1-Max x(El,~2)
UtU2 Min ~(Ui,U2)
< x_-<
uiu~ Min
+
x(U1,U2)
(6.36)
(1-v,)(1-u~) 1-Min ~(U1,U2)
where U Max w(I:l,I~2) : Min { Max[(1-pL)ITj + pjLUj; (1-p~)l:j + pjUj]} j=l~2
Min ¢(Ua,U2) = Max { Min[(1-pL)Ui + piLLJ; (1-p~)% + p~Ljl } j=l~2 To avoid misinterpretations: Each of the symbols Max v(., .) and Min v(., .) has to be understood solely according to its respective definition, while x(., .) is not used as a symbol of its own and changes its role from one formula to the next. For practical calculations it should be noted that in the Formulas (6.32) - together with (6.31) and (6.34) to (6.36) each of the eight weighted averages is used four times: that is once in each formula. If this is taken into consideration the effort of calculation is reduced remarkably. We shall demonstrate these results by an example which is a slight variation of Example (6.4). Example (6.5): Lt = 0.1 ~1 = 0.01
Ul= 0.2 Ut = 0.02 L = 0.055
Like in Example (6.4) we have: In the same way we find:
L2 = 0.12 E2 = 0.012
U2= 0.18 U2= 0.018
U= 0.128
p~ = o.5 p~ = 0.39sl
p7 = 0.6 p~ = o.6790
Therefore the system of estimates is interval-admissible. It is possible to demonstrate by means of this example the existence of pairs (xl,0J2) which cannot occur simultaneously. If, for instance, vt = U1 and x2 = L2 are used, it is found according to (6.31a) and (6.31b) Max x(U1,L2) : Min {Max[0.1100; 0. 12801; Max[0.0586; 0.0873] } : 0.0873 Min ~(U1,L2) = Max {Min[0.1050; 0.12401; Min[0.0550; 0.08531 } = 0. 1050, a result, which indicates that ~a = UI and v2 = L2 can never occur simultaneously.
117
In order to prepare for the application of (6.32) and (6.34) to (6.36) we calculate:
Then we find
pLL, + (1-pL)U1 = 0.0600
p~L, + (1-p~)U, = 0.0680
L p~L2 + (1-pL)U2 = 0.0586
p2L2U + (1-p~)~2 = 0.0873
pLu, + ( 1 - p i ) i : , = 0.1050
p~U, + (1-pU)L, = 0.1240
p~u2L + (1-p~)~2 = 0.0789
U p2U2 + (1-p2U)I:2 = 0.1261
Max ~(L1,L2) = Min(0.0680; 0.0873) = 0.0680 Min ~(U1,U2) = Max(0.1050; 0.0789) = 0.1050 Max x(Ll,I~2) = Min(0.0680; 0.1261) = 0.0680 Min 0;(U1,U2) = Max(0.1050; 0.0586) = 0.1050
Max w(E1,L2) = Min(0.1240; 0.0873) = 0.0873 Min ~(U-1,U2) = Max(0.0600; 0.0789) = 0.0789
Max o~(L1,L2) = Min(O.1240; 0.1261) = 0.1240 Min x(UI,U2) = Max(O.0600; 0.0586) = 0.0600 This leads to 0.1720 _<x÷, < 0.3187 0.0182 < x+- < 0.0376 0.0142 _<x-, < 0.0497 0.0009 _<x__ < 0.0058
It should be noted, that Max 0;(El,L2) > Min 0;(UI,U2)
and
Max 0;(El,E2) > Min 0;(U1,U2) are results in accordance with the model used. They demonstrate that total probabilities may exist which can occur together with the pair of greatest values of the conditional probabilities as well as with the pair of smallest.
[]
118
6.4 The Combination Rule for More than Two Units In the following we shall confine ourselves to the discussion of the following problem: There are two states of nature. There are 1 alarm units each with two different signs. The random events Zjl (respectively Zj2) are mutually double independent, j = 1,...,1 (see Definition (5.2)). There exists an interval estimate for each of the probabilities wj and wj: [Li,Uj] respectively
[l:i,vi]. There exists an estimate for w: [L,U]. Under these circumstances we are able to generalize straightforwardly the results described previously in Chapter 6.3. The problem of admissibility has already been mentioned: It is necessary that all the interval estimates fit together: There must exist pL V L U J , Pi e[O;1] , Pi -< Pi , J = 1,...,l so that each of the systems of type (6.5) is interval-admissible. Without loss of generality we shall discuss the calculation of an interval estimate for P(EFIZu^ . . . ^Zll) =: Xl, Each other sequence of the signs (Zjl or Zj2) can be treated in the same way by interchanging the parameters ~oj and ~j, respectively pj and (!-pj). The formula which describes the calculation of this probability is a specialization of Formula (5.6): ~1
" • • • "~01
o)1-1 Xl+ = ~gl"..-"0)1 + ( l - X l ) ' . . . "
~11
(6.37)
(1-~1)
(1-o~)1-1
and at the same time it is a generalization of Formula (4.6a). Therefore the same considerations can be used as those applied in the case of 1 = 2: For the trivial solutions we arrive at the following result: L~.....L1 UI-i LI-...'LI
+ (1-LI).....(1-L1)
U1 1
(1-U) 1-1
UI.....U1 Ll-I
_<Xl÷ _<
,
(6.38)
U,'...-U1 + (1-Ua).....(1-Ut) L1 1
( l - L ) 1-
if U < 1 and L > 0. In the case that L = 0 the upper limit for x]÷ is equal to one; the lower limit is calculated by (6.38). In the case that U = 1 the lower limit for xl, is equal to zero; the upper limit is given by (6.38).
119
Analogously to the derivation of Formula (6.32) we take into account that for each 1-tuple (wi,...,wl) the greatest value of xl, emerges if w adopts the smallest possible value with respect to w~,...,oa. The analogous result holds for the lower limit and the largest w which is possible for wl,...,Wl. As for I = 2 we arrive at a fundamental theorem. Theorem (6.2): Ifpjg" _
j = 1,...,1,
and 0 < Max ~(L1,...,L1) < 1 as well as 0 < Min o~(U~,...,U1) < 1 , then
LI • . . • "LI [Max
~(Ll,...,]J1)]l-i
xl+ L,.....L1
+
[Max w(L1 . . . . ,L1)] 1-1
(6.39)
(1-LI).....(1-L1) [1-Max w ( L I , . . . , L 1 ) ] l-I U1 • • • • .U1
[Min x ( U 1 , . . . , U 1 ) ] 1-1 (
u1.....u~ [Min w ( U 1 , . . . , U l ) ] 1-1
+
( 1 - o l ) . . . . . (1-u~) [1-Min ~ ( U 1 , . . . , U 1 ) ] 1-,
where Max w(wl, • • ., wl) =
Min
L U {Max[pLwj + (1-pi)Uj; pjwj + (1-pV)Uj] }
(6.4oa)
{Min [pjwj L + (1-pj)]~j; L pUwj + (l_pV)~j]}
(6.40b)
j = l ~ • • • ,1
Mina(0~l , . . . ,
al) =
Max j=l~
• ..
,1
[]
The proof of this theorem uses a lemma which is a simple generalization of the one used in the case of 1 = 2: The upper as well as the lower limit are monotonously increasing functions of each of the wj, if (6.40) is used. The proof of this lemma is analogous to the proof of the lemma for the special case. Using this lemma, Theorem (6.2) follows in a straightforward manner: The extreme values of xl+ are produced by taking the extreme values of oa,...,~.
[]
The case that Max w(L1,...,L1) = 1, leads to the result that the lower limit in (6.39) is zero; the case that Min ~(U1,...,U1) --- 0, to the result that the upper limit is one. This is concluded in the same way as for Theorem (6.1). It should be noted that (6.40) results in, at least potentially, an additional limitation of w for each alarm unit. Therefore in the case of more than two sources of information, (6.39) has to be applied
120
instead of calculating xl. "step by step", a method for which an analogous derivation is not possible. This is different to situations in which all probabilities are given with precision. Up to now in this chapter we have not treated the problem which was introduced in Chapter 5.2 discussing the possibility that Mr.K. has a certain disease. If supplementary information about an individual case is available, the prior probability for this individual may be different from that used in the diagnostic system. In Chapter 5.2 two equivalent methods of adapting the diagnostic system to this situation were described, assuming that all probabilities are given as numbers.
If those probabilities are given as intervals, both methods of adaptation may still be used, but it seems to be much easier to employ the one which is mentioned under number 4. We shall briefly discuss this method and its execution. As long as the supplementary information has the same structure as information gained through questions, Formula (6.39) can be transformed by using Z].i,l in the same way as in Formula (5.11). This presupposes, however, that estimates are given for P1.1 and for ~1.1. In many cases such estimates will not be available or it will not be sensible to conceive probabilities like Pl~l and ~1+~ at all. This will be true, if the diagnostic system was calculated some time ago and the prior probability has changed in the meantime. In such a case it is impossible to extend (6.40a) and (6.40b) to (1+1). On the other hand, however, any extension of (6.40a) could at most lead to a diminished value of Max ~(~i,-..,~0, which would result in an increased lower limit of (6.39); in a symmetric way any extension of (6.40h) could at most lead to a decreased upper limit of (6.39). Therefore we recommend the use of Max ~(~1,...,~0 from (6.40a) and Min ~(wl,...,~1) from (6.40b) also in the extended version of (6.39). Further generalizations which increase the number of states of nature beyond 2 or the number of distinguishable signs for each unit to more than 2, are postponed for more specialized studies. Approximate solutions are possible if one state of nature, respectively one sign, is selected and the remaining states of nature or the remaining signs are considered collectively so that the distinction is only made between two states of nature, respectively two signs. The mode of operation of the methods described in Chapter 6 is demonstrated by means of a more detailed example in the next chapter.
CHAPTER
7
A Demonstration of the Use of Interval Estimation In this chapter we wish to demonstrate the performance of interval estimation, as described in the preceding chapter, by use of an example which consists of a tiny expert system. We assume that the aim of this expert system is to provide a statement concerning only two states of nature. Let us for simplicity call the one state of nature "failure" and the other one "normality". Then it is possible to use the same symbols as in the preceding chapter: EF and EN. The prior information contains an estimate of the probability of EF: This probability lies between 1% and 2.5%. If we employ the symbols L and U as in the last chapter, we have L = 0.010 and U = 0.025. The expert system poses four questions. Each of the questions allows only two answers. We shall designate them by "yes" and "no". We assume that intervals are given for both answers which describe the probability of failure if either answer is given. For the answer "yes" we designate these limits by Li and U] (j = 1,...,4) and for the answer "no" we use ~l and g I. Concerning Question 1 our information is given by: L1 = 0.050
U1 = 0.100
I~ = 0.000
U1 = 0.005.
The respective information for Question 2 is: L2 = 0.100
U2 = 0.150
1:2 = 0.010
C~ = 0.020,
for Question 3: L3 = 0.050
U3 = 0.150
I~3 = 0.005
U3 = 0.010
and for Question 4: L4 = 0.015
U4 = 0.020
I:4 : o. 005
v4 : o. 050.
It should be noted that for the first three questions the answer "yes" is obviously in favour of the state of failure while answer "no" favours normality. In the case of Question 4 the situation is different: answer "yes" contains much more information than answer "no", because answer "yes" gives narrower limits for the probability of a failure, while the limits in case of answer "no" are much wider and the interval [L4,U4] is contained in [L4,U4]. Using this information we shall calculate the probability of a failure in the case that a certain sequence of answers is given. At first we must prove the interval-admissibility of the four estimates
122
with respect to the prior information: [L,U]. This is done by calculating the probability intervals [ pL, p~ ] using the Equations (6.9) and (6.10). In the case of Question 1 this leads to: : 0.2
pV : o . 2 1 o 5 .
: o.o
: 0.0385.
In the case of Question 2 the result is:
If we use the same method for calculating the limits of P3 we get: pL=0.1111 pU:0.1071. This result informs us of the fact that the system is not interval-admissible. We have to decide whether we believe the estimates [L,U] or the limits [L3,U3] respectively [E3,U3]. We assume that there is no reason to doubt the prior information and also that we decide not to change the interval [Es,U3]. Therefore we can either derive a new lower limit L3 or a new upper limit Us. We decide to decrease the upper limit using Formula (6.15) and obtain U3ew = 0.145. With this solution we accept the degeneration of the interval for Ps to the single value P3 = 0.1111. Question 4 corresponds to what was referred to as Case c) in Chapter 6, where it was shown that this case is derived from Case b), if w4 is exchanged for 54 and p4 for l-p4. Therefore we have to derive an interval [pL4;1] as an estimate for P4, the probability of the answer "yes" to Question 4. The two solutions according to (6.20) are: lp L = 0 . 5 ,
2pL4=0.8333.
Again it was decided not to alter the prior probabilities but to use this information in order to improve the estimate for ~4, the probability of failure in case of the answer "no" to Question 4. Therefore we calculate ~ e w : o - ~p~ u4 : o.
03,
1 - lp L which corresponds to Equation (6.25). With the necessary corrections the interval-admissible system is as follows: L1 = 0.050
U, = 0.100
I;1 = 0.000
lJi = 0.005
0.2 < p~ < 0.2105 L2 = O. 100
U2 = O. 150
[:2 = 0.010
172 = 0.020
0.0 < P2 ~ 0.0385
123
L3 = 0.050 1;3= 0.005
II3= 0.145 U~= 0.010
P3 = 0.1111
L4 = 0.015
U4 = 0.020
1:4 = 0.oo5
u4 = o:o3o
0 . 5 _< P4 <- 1 . 0 .
Now we suppose that the answers to those four questions are double independent so that we can combine the information contained in the answers to different questions by means of the formulas derived in Chapter 6. First of all we calculate the probability of a failure if the answer to the first two questions is "yes". If we do not use any additional information we apply Formula (6.2), the trivial estimate, and arrive at the following interval for x,÷: 0.1857 < x** _<0.6600. If we apply the combination rule (6.32), we calculate, using (6.31): Max 0;(L1,L2) = 0.0145 (instead of U = 0.025 for TR) and Min x(U1,U2) = 0.02 (instead of L = 0.01 for TR) and obtain the following limits: 0.2848 < x** <_0.4900. Therefore the probability of a failure is at least 28%, although each individual probability does not exceed 15%. For the last time we report the result of the application of the Dempster-Shafer rule: 0.0144 < x** < 0.0173. Again this is an absurd one! Now let us assume that the answer to Question 3 was "yes" too. If we use the trivial solution we obtain: 0.3189 <_x+÷, _<0.9702. While this result shows the tendency to increase the probability of failure due to the third answer "yes", the interval is even larger than that in the case of only Question 1 and Question 2. If we employ the combination rule (6.39) to these three answers we obtain the following result: 0. 5890 _<x~++ <_O. 8830 (using Max v(Lt,L2,L3) = 0.0144 and Min 0~(U1,U2,U3) = 0.0206 apart from differences which are caused by rounding). Obviously a third answer "yes" has increased the probability of a failure considerably.
124
With respect to Question 4, we at first assume that once again the answer was "yes", remembering that this time the answer "yes" does not necessarily indicate an increase in the probability of failure, because the interval for 54 includes the interval for w4. The application of (6.38) leads to 0.2175 < x÷÷+÷ < 0.9850, which shows a loss of information even resulting from the answer "yes", because the lower limit decreased and the upper limit increased. Therefore, if the answer is "yes" and we employ the trivial solution, Question 4 is of no use. With the combination rule the result is 0.5982 < x÷÷.÷ < 0.8801 (with Max ~0(L1,L~,L3,L4) = 0.0144 and Min w(U1,U2,U3,U4) = 0.0206). Indeed this shows some progress compared with the result stemming from the first three questions. Therefore we may conclude that Question 4 may be useful to those who employ the combination rule (6.39) - as we in fact recommend. It should be taken into account that asking Question 4 contains the possibility that the answer is "no". The probability of this result, (1-p4), lies between 0 and 0.5 and therefore must be taken seriously into consideration. So it is useful to calculate additionally the interval estimate for the probability of failure in the case of the sequence of answers "yes, yes, yes, no". If TR is used the result is: 0.0840 < x÷÷÷_ < 0.9901. This is a very remarkable deterioration of the result for the first three questions. If TR is employed, any answer to Question 4 therefore results in a loss of information in the case that the first three questions are answered with "yes". Somebody who uses TR should avoid Question 4 in this case. If we employ the combination rule we obtain: 0.3296 < x+++- < 0.9175 which again is a much weaker statement than that we had gained from the first three answers. Since we are concerned with the application of the combination rule (6.39) we may conclude: If the answer to the first three questions is "yes", the answer to Question 4 results either in a small gain in information, if it is "yes", or in a serious loss of information, if it is "no". It is a good policy to exclude this question in this particular case. Furthermore, one should control whether the situation is at all different if other answers are given to the first three questions. Otherwise it would be better to avoid Question 4 from the beginning.
125
It is a remarkable feature of diagnostic systems using interval estimates for o4 ~j and ~j that questions like Question 4 may be a nuisance in all cases, because the intervals [Li,Uj] and [Lj,Ui] are so wide or because they belong to Case b) or Case c), as defined in Chapter (6.2). Therefore we should test to see which of the possible questions are of use and which of them would spoil the information gained from the others and should consequently be avoided. Concluding this argument we assume that Question 4 has been omitted and therefore x+÷÷, estimated through (6.39), may be taken as the final result for an individual case for which the estimation of the prior probability, as used in the diagnostic system, is true. Should, however, supplementary information consist of an increased prior probability, an adaptation of the result is necessary. Let us assume that for this individual case L'=0.04
and
U'=0.06
and that it is impossible to conceive probabilities P5 and ~5, which would be necessary if this information was to be used like information stemming from "Question 5". Due to the recommendation given in Chapter 6 we adapt the limits for x+++ gained so far by using L1.L2.L3.L' and U1.U2.U3.U' instead of L1.L2.L3 and U1.U2.U3, while Max ~o(LbL2,L3) and Min ~(U1,U2,U3) remain unchanged. This leads to a final result: 0.8029 < x+'++ < 0.9583 which demonstrates - compared with the corresponding limits for x+,+ (0.5890 and 0.8830) - the importance of the adaptation in this case. From further practical studies about diagnostic systems conceived in the way which is described in Chapter 6, valuable information concerning the application of these methods will be obtained.
APPENDIX Application of Formula (3.21) to Structures Defined by k-PPds.
P,(Ei*) We have
P(Ei*) = ~ P I ( E I )
i=l
and
" . . . " PI(Ei*)
(3.21)
"..." P I ( E i )
Sup Pj(Ei) = Uji PiES~. Inf Pj(Ei) = Lji PjES~.
We designate:
Ui*: =
i=l,...,k;
j=l,...,1
Sup~ P(Ei*) PjES~ j = l . . . . ,1
Li*:=
P(Ei*)
Inf P J.ES*J 3 = 1 , . . . ,1 and confine ourselves at first to the case 1 = 2. We start by calculating Ui*, for a fixed i*, if P j e S j, j = 1,2 : •
Ui*
PI(Ei*).P2(Ei*) = Sup [ 1 + i~i,Pl(Ei)'P2(Ei)]-1= P,(Ei*)'P2(Ei*) i=l 1-1= [ 1 + Inf~ = [ 1 + Infi~i,Pl(Ei).P2(Ei) i;~i*Pl (Ei).P2(Ei)] -1
= Sup ~ P , ( E i ) ' P 2 ( E i )
Sup PI(Ei*) "P2 (Ei*) Evidently the
sum
in the
numerator
U1 i* " U2 i* reaches its smallest
value, if P l ( E i * ) = Uli*
and
P2(Ei*) = U2i*. Therefore Ui* is calculated by solving a problem of Non-Linear Programming: the search for Infi~i,Pl(Ei ) "P2(Ei), under the restrictions defined by the two k - P R I s , combined with E Pt(Ei) = 1 - Ufi* and i~i*
i~¢i,P2(Ei) = 1 - U2i*
(*)
It is easy to see, that the infimum cannot be found in the interior of the two sub-structures defined by (*); it must appear in a situation which may be described as a combination of two corners belonging to both sub-structures defined by (*). Therefore the search for the infimum, which in this case is indeed a minimum, can be achieved by means of a computer program which constructs all pairs of corners belongingto the two sub-structures.
The values of Li*, if Pj E Sj , j = 1,2 are found in an analogous way:
128
PI(Ei*) "P2 (Ei*) Li* = I n f
+ S u P i ~ i , P l ( E i ) ' P 2 ( E i ) ]-1 =
1
L1 i*
E PI(Ei)'P2(Ei)
' L2 i*
if Lli* • L2i* ~ 0. If Lli* • L2i* = 0, then obviously Li* = 0. This time we search for Sup i~i,Pl(Ei)"P2(Ei) under the restriction defined by the k - P R I combined with i~i,P,(Ei)
: 1 - Lli*
~nd
i~i,r2(Ei)
: 1 - L21*
(**)
As this supremum must appear in a situation in which two corners of the sub-structures (**) are combined, it is indeed a maximum and may also be found by means of a computer program. Whilst the procedures described up to now are sufficient to apply (3.21) to a situation in which two k - P R I s are combined, the extension of these methods to situations in which more than two k-PRIs are involved, consists of a stepwise repetition of the same kind of analysis: Once the first two k - P R I s have been combined resulting in a single one, this can be combined with a third one, and so on. The final result does not depend upon the order of these operations because commutativity and associativity are granted. We shall demonstrate the application of Formula (3.21) to two k - P R I s with the following example. Example: Let k = 4 and Lll = 0.0
Ull = 0.2
L21 = 0.4
U~I = 0.5
L12 = 0.1
U12 = 0.4
L22 = 0.0
U22 = 0.2
L13 = 0.2
U13 = 0.5
L23 = 0. I
U23 = 0.3
L14 = 0.3
U14 = 0.7
L24 = 0.2
U24 = 0.4
4
The calculation of U1 requires the use of Inf Z Pi(Ei) .P2(Ei) i=2
when 4 i~2Pl(Ei) = 0 . 8
,
4 i~21)2(Ei) : 0 . 5
Controlling all pairs of corners, we find the minimum to be 0.12 (for instance if P,(E2) = 0.3 , PI(E3) = 0.2 , P,(E4) = 0 . 3 and P2(E2) = 0.0 , Pu(E3) = 0.3 , P2(E4) = 0.2 ). Therefore we can calculate: Ui=
[
1+0.2
• 0.5
]104 4
In the same way we derive U2 = 0.4706, U3 = 0.7143 and U4 = 0.9333. Concerning the lower limits it is immediately seen that LI = 0 and L2 = 0.
129
The calculation of L3 uses Sup [ PI(E1)"P2(E1) + Pt(E2)"P2(E2) + PI(E4)"Pa(E4) ] with
PI(E1) + PI(E2) + PI(E4) = 0.8 P2(E1) + P2(E2) + P2(E4) = 0.9
and results in: La = [ 1 + 0 . 2 0 : 3 0 . 1 ] - ' = i-~ = 0.0625. In the same manner we calculate L4 = 0.2143 so that our result of applying (3.21) to the combination of the two 4-PRIs is: Li = 0.0
U1 = 0.4545
L2 = 0.0
U2 = 0.4706
L3 = 0.0625
U3 = 0.7143
L 4 --
0.2143
U4 = 0 . 9 3 3 3 .
[]
REFERENCES ANGER, B., LEMBCKE, J. (1985): Infinitely subadditive capacities as upper envelopes of measures, Zeitschrift f5r Wahrscheinlichkeitstheorie und verwandte Gebiete 68, 403-414. BUCHANAN, B.G., SHORTLIFFE, E.H. (1985) (Eds.): Rule-based expert systems: the MYCIN experiments of the Stanford heuristic programming project, Addison-Wesley. CHOQUET, G. (1953/54): Theory of capacities, Ann. Inst. Fourier 5, 131-295. CHOQUET, G. (1959): Forme abstraite du th~or~me de capacitabilit(}, Ann. Inst. Fourier 9, 83-89. DEMPSTER, A.P. (1966): New methods for reasoning towards posterior distributions based on sample data, Ann. Math. Stat. 37, 355-74. DEMPSTER, A.P. (1967a): Upper and lower probabilities induced by a multivalued mapping, Ann. Math. Stat. 38, 325-339. DEMPSTER, A.P. (1967b): Upper and lower probability inferences based on a sample from a finite univariate population, Biometrica 54, 515-528. DEMPSTER, A.P. (1968a): A generalization of bayesian inference, J.R.Stat.Soc., Ser. B, Vol. 30, 205-247. DEMPSTER, A.P. (1968b): Upper and lower probabilities generated by a random closed interval, Ann. Math. Stat. 39, 957-966. DUDA, R.O., HART, P.E., NILSSON, N.J. (1976): Subjective bayesian methods for rule-based inference systems, in: Proceedings of the 1976 National Computer Conference, AFIPS, Vol. 45, 1075-1082. GENEST, Ch., ZIDEK, J.V. (1986): Combining probability distributions: A critique and an annotated bibliography, Stat. Science 1,114-148. HUBER, P.J. (1973): The case of Choquet capacities in statistics, Bull. Int. Stat. Inst., Vol. XLV, Book 4, 181-188. HUBER, P.J., STRASSEN, V. (1973): Minimax tests and the Neyman-Pearson lemma for capacities, Annals of Statistics 1,251-263.
132
HUBER, P.J. (1976): Kapazit&tea statt Wahrscheintichkeiten? Gedanken zur Grundlegung der Statistik, Jahresberichte der Deutschen Mathematiker-Vereinigung 78, 81-92. KANAL, L.N., LEMMER J.F. (1986) (Eds.): Uncertainty in artificial intelligence, Machine intelligence and pattern recognition, Vol. 4, North-Holland. LEHRER, K., WAGNER, C.G. (1981): Rational consensus in science and society, D. Reidel, Dordrecht, Holland. PEARL, J. (1988): Probabilistic reasoning in intelligent systems: Networks of plausible inference, Morgan Kaufmann, San Mateo, California. SHAFER, G. (1975): A mathematical theory of evidence, Princeton University Press. SHORTLIFFE, E.H. (1976): Computer-based medical consultations: MYCIN, Elsevier Computer Science Library. STRASSEN, V. (1964): Megfehler und Information, Zeitschrift ffir Wahrseheinliehkeitstheorie und verwandte Gebiete, 2, 273-305. WALLEY, P., FINE, T.L. (1982): Towards a frequentist theory of upper and lower probability, Annals of Statistics, 10, 741-761. ZADEH, L.A. (1979): On the validity of Dempster's rule of combination of evidence, Memorandum No. UCB/ERL M 79/24, Electronic Research Laboratory, University of California, Berkeley. ZADEH, L.A. (1986): A simple view of the Dempster-Shafer theory of evidence and its implication for the rule of combination, The AI Magazine, Summer 1986, 85-90.