This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Vx COMPANY(x)
=
[1SS-TAX ED(x) = x =Bull]
From a proof-theoretic point of view, circumscription offers the exciting prospect of a general formulation, as opposed to CW A and subimplication, for which not all theories are eligible to be applied to existing proof procedures. In fact, circumscription benefits from a very elegant prooftheoretic approach through a mere axiom schema, the so called circumscription schema. Definition
The circumscription schema for circumscription of the relation •.. , R.}) with varying ... , R. is of the form
R in the theory T (in the form of the sentencet T {R, R 1 ,
relations R 1 ,
[T{ R', R'1 ,
Where R', R'1 ,
... ,
••• ,
R~} " (Vx R'(x)
= R(x))] = (Vx R(x) = R'(x))
R~ are formulae and T{R', R'1 ,
... ,
R~} is the theory
T{R, R1 , •.. , R.} in which all occurrences of R, R1 , ... , R. are replaced by . Iy. ... , R'• respective Intuition is often lost when it comes to selecting formulae for replacing formula parameters R', R'1 , ••. , R~ in the circumscription schema. It is a ~ifficult task (Reiter, 1982; Besnard, 1984; Lifschitz, 1985a) to find an Instancet (of the circumcription schema) that ultimately leads to what could R' • R'1,
t A (finite) theory can always be written in the form of a formula in which all variables Occuring are quantified. tAn instance of a schema is a formula that results from substituting formulae for the formula Parameters in the schema.
!52
P. Besnard and P. Siegel
be called conclusive formulae. This is why it seems prudent to provide the reader with at least a rough idea of what the circumscription schema means. First, T{ R', R'1 , •• • , R~} testifies, if deducible from T, that R' if admissible for R in the sense that the way R is (incompletely) specified in T does not preclude R from being R' (only extensionally of course). Secondly, (Vx R'(x) => R(x)) ensures that R is required (by the formulae ofT) to be true whenever R' is true, so that if R' is admissible for R then restricting R to R' (by means of the formula (Vx R(x) => R'(x))) is the least that can be done in order to have R minimized. The circumscription schema is to be added to first-order predicate calculust (a brief account of it is given in the Introduction, Appendix A, p. 8) for usc by first-order inference rules as any first-order axiom schema, thus disturbing first-order predicate calculus as little as possible. Definition A formula F is derivable by circumscription ofthe relation R in the theory T with the relations R 1 , ••• , R. varying, denoted T l-c1RIP) F, iff F is derivable by first-order predicate calculus from T supplemented with the corresponding circumscription schema, in symbols C[T{R/R 1 , ... , R.}] 1- F.t Let us see how all this works in practice. Returning to our illustration, let us circumscribe COMP-MANUF, with SS-TAXED being allowed to vary, in the theory T 4 consisting of the two following formulae:
COM P-M AN U F(Bull) Vx COMPANY(x) " 1 COMP-MANUF(x)=>SS-TAXED(x)
Let us consider the instance yielded by the circumscription schema for the substitutions COMP-MANUF'(x)~x SS-TAXED'(x)~ IX=
=Bull
Bull
Then T 4 {COMP-MANUF', SS-TAXED'} consists of
Bull= Bull Vx COMPANY(x)" IX= Bull=> IX= Bull
Both formulae are valid, and, accordingly, they can be derived from anY theory; the former is an immediate consequence of the axiom of reflexivity of t
First-order predicate calcu/u.~ is the syntactical part of first-order logic. is the usual notation of derivability by first-order logic.
t The symbol f-
5
Preferential-Models Approach to Non-Monotonic Logics
153
equality Vx x = x; the latter comes from the axiom schema (A " B)~ B. As regards Vx COMP-MANUF'(x) ~ COMP-MANUF(x), it is the formula
Vxx
= Bull~COMP-MANUF(x)
which can be derived from COMP-MANVF(Bull) by virtue of Leibniz' substitutivity schema, namely VxVy x = y " A(x) ~ A(y) for every formula
A. Since the left member of the considered instance of the circumscription schema can be derived from T 4 , we can use modus ponens (from formulae A and A~ B infer formula B) to get Vx COMP-MANUF(x)
~
COMP-MANUF'(x), that is,
Vx COMP-MANUF(x)
~
x =Bull
At this stage, we have used the circumscription schema directly to obtain the formula Vx COMP-MANUF(x)~x =Bull, which can thus be said to be derivable from T 4 by circumscription (of COMP-MANUF in T 4 with SS-TAXED being allowed to vary). Symbolically, T 4 1-ciR(Pl Vx COMPMANVF(x) ~ x =Bull where C/R(P) denotes the circumscription of COMP-MANUF in T 4 with SS-TAXED being allowed to vary. From this formula and by sticking to pure first-order predicate calculus, it is possible to arrive at T 4 1-ciR(PJ Vx COM PAN Y(x) ~ [oSS-TAX ED(x) ~ x =Bull]. Details are as follows. The second formula of T 4 can be written in the form Vx COMPANY(x) ~ [oSS-TAXED(x)
~
COMP-MANUF(x)]
from which, using Vx COMP-MANUF(x)~x =Bull (which can be derived from the formula obtained above and the formula Vx x =Bull~ COMPMANUF(x) that we have already seen to be derivable from the formula COMP-MANUF(Bull) ofT 4 ), we conclude
'Vx
•JX
COMPANY(x) ~ [oSS-TAXED(x)
~
x =Bull]
. Deciding which formulae are to be substituted for the formula parameters
In the circumscription schema is fundamental with respect to obtaining conclusive formulae (particularly Vx COMP-MANUF(x)~x =Bull), as lllany instances of the circumscription schema, for example the one arising from substituting COMPANY(x) AX= Bull for COMP-MANUF'(x), lead nowhere. 1' We now consider non-monotonicity of circumscription through the theory s, which is the theory T 4 supplemented with the formula
COM P-M ANUF(Peugeot)
!54
P. Besnard and P. SieKei
It is not possible to have Vx COMP-MANUF(x) =>x =Bull
derivable from T 5 by circumscription, i.e. circumscription is non-monotonic. However, it is possible to derive Vx COMP-MANUF(x)=>x =Bull v x =Peugeot
from T 5 by circumscription using substitution x = Bull v x =·Peugeot for both COMP-MANUF'(x) and IX= Bull" IX= Peugeot for SS-
TAXED'(x). 6
CONCLUDING REMARKS
The account presented of the preferential-models approach to nonmonotonic logics seems to leave out two major contributions to nonmonotonic reasoning: default logic (Reiter, 1980) and autoepistemic logic (Chapter 4 and Moore, 1985). For the former, specifying an appropriate preemption preordering is rather easy for a restricted fragment like the one corresponding to CW A. Unfortunately, things get much more involved if non-atomic formulae are taken into account. For autoepistemic logic. working with a modal language adds another difficulty, mainly because a logic in which theories consist only of first-order formulae is easier to capture by means of a preemption preordering (rendering the effect of the nonmonotonic, perhaps content-specific, inference rules). In any case, both default logic and autoepistemic logic are more general than CW A, subimplication and circumscription in that they require maximization of one or several relations in many theories. That it would be impossible to characterize certain existing nonmonotonic logics through the preferential-models approach would just mean that their semantic bases rely on non-constructive fixed points, which cannot be captured by any preemption preordering. It is only a matter of expression But the principle of the preferential-models approach is not disputed: first formalize our intuitions about non-monotonic reasoning within a modeltheoretic setting and then devise a formal system for it (thereby being concerned with constructing a subclass of the preferential models defined hy means of non-constructive fixed points).
BIBLIOGRAPHY AI (1980). Special Issue on Non-monotonic Logics. Artificial Intelligence 13, no. I 2 (Apart from the closed world assumption, introduces the very first non-monoton 1c logics.)
5
Preferential-Models Approach to Non-Monotonic Lol(ics
155
sesnard, P. (1984). Vers une caracterisation de Ia circonscription. Rapport Inria. (Some results on circumscription about its proof theory.) sossu, G. and Siegel, P. (1982). Non monotonic Reasoning and Databases. Advances in Database Theory (ed. H. Gallaire, J. Minker and J.-M. Nicolas), pp. 239-284. Plenum Press, New York. (A first (technical) glance at subimplication.) Bossu, G. and Siegel, P. (1985). Saturation, Non monotonic Reasoning, and the Closed World Assumption. Artificial Intelligence 25, 13-63. (Where subimplication is defined.) Chang, C. C. and Lee, R. C. T. (1973). Symbolic Logic and Mechanical Theorem Proving. Academic Press, New York. (The most readable textbook on Herbrand models and their role in the design of proof-theoretic systems, especially resolution.) Colmerauer, A. ( 1979). Sur les bases theoriques de Pro log. Rapport de Recherche, Universite d'Aix, Marseille 2. (A paper roughly along the lines of the previous item, but devoted to a particular system, namely Prolog (the logic programming language.) Etherington, D. W., Mercer, R. E. and Reiter, R. (1985). On the adequacy of predicate circumscription for closed world reasoning. Computational Intelligence 1, 11-15. (A must as far as the formal study of circumscription is concerned.) Farinas del Cerro, L. and Herzig, A. ( 1988). An automated modal logic of elementary changes. Chapter 2 of this book. (An introduction to a different view of logic programming.) Lifschitz, V. (1985a). Computing circumscription. Proc. Int. Joint Conf on Artificial Intelligence (/JCAI-85), pp. 121-127. Kaufmann, Los Angeles. (As the title says.) Lifschitz, V. (1985b). Closed world databases and circumscription. Artificial Intelligence 27, 229-235. (Results on when circumscription and the closed-world assumption meet.) McCarthy, J. (1980). Circumscription-a form of non monotonic reasoning. Artificial Intelligence 13, 27-39. (A beautiful exposition of the motivation for circumscription and its definition.) McCarthy, J. (1986). Applications of circumscription to formalizing commonsense knowledge. Artificial Intelligence 28, 89-116. (More insight into circumscription.) Marchal, B. (1988). Modal logic-a brief tutorial. Appendix B, Introduction to this book. (The basic definitions and concepts of modal logic.) Moore, R. C. ( 1985). Seman tical considerations on non-monotonic logic. Artificial Intelligence 25, 75-94. (Together with circumscription and default logic, the best proposal within the field of non-monotonic logics. Highly technical.) Moore, R. C. ( 1988). Autoepistemic logic. Chapter 4 of this book. (Good background _reading for the previous paper.) Nicolas, J .-M. ( 1979). Contribution a l'etude theorique des bases de donnees: apports de Ia logique mathematique. These d'etat, Onera-Cert, Toulouse. (A study of the application of logic to database theory.) NMRW (1984). Non Monotonic Reasoning Workshop, New Paltz. (A collection of Papers, the standard of which is rather mixed but including valuable contributions t_o the study of non-monotonic logics.) Reuer, R. ( 1978). On closed world databases. Logic and Databases (ed. H. Gallaire and J. Minker), pp. 55-76. Plenum Press, New York. (A pioneering and very insightful R ~aper on non-monotonicity in logic systems.) eiter, R. (1980). A logic for default reasoning. Artificial Intelligence 13, 81-132. (An R ~ssential work in the field of non-monotonic logics.) euer, R. (1982). Circumscription implies predicate completion (sometimes). Proc.
!56
P. Besnard and P. Siege/
American Association for Artificial Intelligence Col!f. (AAA/-82), pp. 418--420.
Kaufmann, Pittsburgh. (One more result on circumscription.) DISCUSSION
C. Froidevaux: Informal introduction to circumscnptwn and CWA. Commonsense reasoning supposes very wide knowledge, so we prefer to use general statements rather than to store all elementary facts they could generate. But in everyday life there are exceptions for almost general assertions, and they are too many to be all mentioned explicitly: our knowledge is necessarily incomplete. Thus we need to use general statements admitting exceptions. This process introduces non-monotonicity in commonsense reasoning. The prototypical example is the classical inference about birds: from the facts that generally birds can fly and that ostriches are birds but cannot fly, we can infer that Tweety. provided that it is a bird, can fly. But if we subsequently discover that Tweety is an ostrich, the ability of Tweety to fly can no longer be derived. To express such general statements, the circumscription approach uses classical logic and introduces abnormality predicates. (We should point out that there are as many abnormality predicates as types of exceptions.) In the case of birds, we say that a bird that is not abnormal (with respect to the ability to fly) can fly. We obtain the following first-order formulae: 1
(\fx) (bird(x)
A
1abnormal(x) = .flies(x))
= bird(x)) (\fx) (oxtrich(x) = 1jlies(x)) (\fx) (ostrich(x)
bird(Tweety)
From these formulae, it can be inferred that an ostrich is abnormal, but to deduce that Tweety flies, we must know that only ostriches are abnormal. We require the well-known default principle (cf. Chapter 7): if something relevant is an abnormal object then this is explicitly stated; otherwise it can be reasonably considered as normal. More generally, this principle extends to the following one: the objects that can be shown to have a certain property P (here abnormal) are all (and only those) the objects that satisfy this property: we circumscribe the extension of the predicate P. In order to do this, we have to minimize its extension. From a semantic point of view, we have introduced a notion of preferability between models: models where the extension of P is minimal are preferable. Another form of non-monotonicity occurs with the assumption that only positive information is represented and that negative information can be derived h~ completion. This process considerably simplifies the storage of the data: we need only give relevant positive facts, while negative information is very high. In the context of relational databases, this is known as the closed-world assumption (C W A); we assume that every relation instance is true only if it is given explicitly or else implied by one of the universal rules defining the relation. (In general, only Horn theories arc considered.)
5
Preferential-Models Approach to Non-Monotonic Logics
!57
for example, let us consider a universe of blocks and table as follows: A, Band C are blocks, D is a table, A is green, C is red, A is ON D, C is ON D, B is ON C and if object X is ON object Y and object Y is ON object Z then X is ON Z. Under the CW A, we restrict our attention to the world where A, Band Care the only blocks, Dis the only table, A is the only green thing and Cis the only red thing. Moreover, if object X is ON object Y then (X, Y) e {(C, D), (B, C), (B, D), (A, D)}. Hence in this world the colours of B and of D remain unknown and A is neither on B nor on C. Thus CW A leads us to prefer models where every predicate has a minimal extension. While circumscription focuses on some predicates for the minimization process, CW A deals with all predicates. This brief presentation highlights the fact that for these two formalisms the notion of first-order model is insufficient to capture non-monotonicity, and suggests that we resort to a concept of preferability between models: not all models are desirable. The preference criterion obviously depends on the formalism considered. Nonmonotonicity results from the following observation: if M is a preferable model for a set of axioms T and if T c T', then M is not even necessarily a model ofT'.
2 A preemption preordering on the models. Besnard and Siegel provide a general framework to define the semantics of both formalisms. To every theory T they attach a partition P of the set of its relation symbols, so that P = (R ~, R +, R _, R). R ~ denotes the set of relations that must be identical in all models, R + the set of relations to be maximized, R _ the set of relations to be minimized and R the set of relations allowed to vary from model to model. This partitioning is used to compare models: ~P denotes the preemption ordering with respect to P on the models. Recall the example of Section 2 of Besnard and Siegel's Chapter. Let T be the following theory: T = {MAN(Emos), MAN(Socrates),
('
A
ORDINARY(x)= -,AGUT/ED(x))
('
= M AN(x))}
The choice of the partition P for T refers to the intuition underlying a commonsense reasoning. The second assertion means that generally a man does not suffer from aguty. As ORDINARY is the most frequent case, we shall maximize the extension of the predicate ORDINARY, while AGU Tl ED, being an exceptional state, will be minimized. Since there is no logical relation between the properties ORDINARY and AGUT/ ED on the one hand, and FAST- WALKER on the other hand, the predicate FAST-WALKER is allowed to vary: we have no preference about the fact that Emos (or Socrates) is a FAST- WALKER or not. We shall also compare models where the extensions of the predicate MAN are the same. Hence we get the partition P = (R ~ ""{MAN}, R+ ={ORDINARY}, R- = {AGUTIED}, R. ={FAST-WALKER}). Let M and N be the following interpretations:
M [MAN]= {Emos,Socrates} M[ORDINARY]={ },
=
N [MAN]
N[ORDINARY]={Emos}
M [AGUTIED] ={Socrates},
N [AGUTIED] = { }
M [FAST-WALKER]= \Socrates}.
N [FAST-WALKER]= {Emos}
158
P. Be.mard and P. SieKel
M and N are both models ofT, but N is preferable to M with respect to the preemption ordering attached toP: N ~PM. However, N is not yet the expected model forT. Let N' be as follows:
N' [MAN]= {Emos, Socrates}, N' [AGUT/ED] = { },
N' [ORDINARY]= {Emos, Socrates}
N' [FAST-WALKER]= {Socrates, Emos}
N' is a model ofT that is preferable toN, and no model is preferable to N'. We say that N' is a minimal model for ~P. N' is a desirable model in the sense trtat we prefer overall a model where Emos and Socrates are ordinary men, who do not suffer from aguty. Minimal models for ~P play a fundamental role in the specification of the semantics ofCWA and circumscription formalisms. It remains to give the form of the partition P precisely for each formalism. For CW A, R _ consists of all relations and R ~ , R + , R are all empty sets. For cirumscription the partition is more elaborate: if Q is th~ predicate to be circumscribed and R 1 , •• • , R. the parameter relations then R = consists of all relations of the theory minus Q, R~o ... , R., R- = {Q}, R+ = {} and R. = {R~o ... , R.}. Unfortunately, the critical choice of the parameter relations is principally motivated by technical reasons, so that intuition is often missing. Roughly speaking, exception predicates must be minimized, while property predicates (e.g. FLIES, AGUT/ED), which we know about, are allowed to vary. 3 Limitations of this presentation. The possibility of describing different nonmonotonic formalisms in a general framework is useful: comparisons between them can easily be carried out. For example, this framework emphasizes the relationship between CW A and circumscription: it suggests that CW A is equivalent to the circumscription of all predicates. Besnard and Siegel propose also to include in the framework the formalism of subimplication. But the partition provided (which is the same as for the CWA). highlights more the similarity between CW A and subimplication than the difference. The preemption ordering as defined by Besnard and Siegel does not permit distinction between them. In fact, the essential difference lies in the special kind of models considered: on the one hand Her brand models for CW A; on the other hand. discriminant models for subimplication. The task is more complex for other non-monotonic formalisms. Default logic cannot be described in this context, except for the fragment corresponding to the CWA. Bidoit and Hull (1986) have proved that minimal models (in the sense of subimplication) define a good semantics for theories whose defaults are all CW A defaults (i.e. normal defaults without prerequisites). But for normal defaults in general this framework approach cannot take into account Moore's modal nonmonotonic logic: autoepistemic logic (see Chapter 4). Partitioning of relations is unsatisfactory even for circumscription. Several versions of this formalism have been proposed. Prioritized circumscription has been conceived by Lifschitz to remedy some drawbacks of general circumscription. In some cases, we have to deal with several kinds of abnormality, so that more than one abnormalit) predicate must be introduced. If the minimization of these predicates conflict with each other then we need a partial priority ordering between them. Hence priorit) circumscription requires minimization of predicates with respect to some prioritl ordering. In conclusion, the preferential-model approach to non-monotonic logics should he
5
Preferential-Models Approach to Non-Monotonic Lol{ics
159
extended so that the preemption ordering is no longer based on a partitioning of the relations. Very recently, Shoham (1987) has made a proposal for doing this. The basic idea behind the construction of his framework is again the use of a preference relation on models, associated with a standard logic. But this preference relation can be any strict partial order on interpretations. In this framework classical notions such as satisfiability and entailment are redefined. Depending on the choice of the preference relation, different non-monotonic logics can be described with this framework. For subimplication and circumscription, the partial orders are analogous to those of Besnard and Siegel. But as this partial order can be very general, default logic and autoepistemic logic can be dealt with. (In fact, Shoham handles another nonmonotonic modal logic, namely Halpern and Moses' logic, of minimal knowledge.) For example, Halpern and Moses' logic is treated by giving a preference criterion on Kripke interpretations. This framework makes comparisons between non-monotonic logics easier. Let us give such a result: circumscription can be reduced to the logic of minimal knowledge.
Flash Sheridan: A theory is a syntactic object- a collection of constant terms, function symbols, predicate symbols, sentences, and what-have-you. A model of a theory is some mathematical objects for that theory to be about: a constant symbol should mean some object, a relation symbol should mean some relation in extension (i.e. a set of ordered pairs), a predicate symbol should mean the extension of a predicate (i.e. a set), and a sentence should mean a truth value. The model is a model of the theory if the obvious things work out right; for instance, a true sentence means the truth value True. (It will make things easier to consider predicates as one-place relations. Actually, we worry about predicates more, so I shall consider relations as multiplace predicates.) A model (of a given theory) is nicer than or as nice as another if the predicates that one wants smaller (R _) are no larger, the predicates that one wants larger (R +) are no smaller, the predicates that one insists stay the same (R =)do stay the same, and the predicates that one doesn't care about (R.) can do whatever they want. As in the mundane world, what is nicer will depend on context; we abbreviate "nicer than or as nice as" by ::;;;p, where P tells us the context (P = (R=, R+, R-, R.)). The objects and functions must stay the same (Besnard and Siegel call this minoring). A model is minimal if there is nothing strictly nicer than it. The following are examples: Circumscription: In circumscription's opinion, what chiefly matters is that one predicate, the abnormality predicate, AB, is minimized (so R_ = {AB}). Some predicates (R.) are allowed to vary, some (R=) are not. SoP= (R=, 0. {AB}, R.). Closed-world assumption: In the CW A, one wants to minimize everything. So every ~redicate is in R _. We, however, derive only ground literals from CWA (a ground hteral is either a relation symbol followed by the appropriate number of constant symbols, or the negation thereof); after we have derived all the ground literals we Want, we then revert to normal deduction . . Subimplication: This uses the same definition of "nice", but we only consider discriminant models. A discriminant model is what set-theorists call a term model, ~hich is merely a cheap trick. We consider the theory itself as the model. For Instance, if we want a term model of set theory then the null set is the symbol "0" ~nd the number two is the term " {0. { 0}} ".In a decent set theory this works out very ad!y, since {0. 0} is the same set as {0}, but these are obviously different terms. ('fhtngs are not quite as bad as that if one is careful-but this is leading us astray.)
160
P. Besnard and P. SieK<"!
This has advantages for (their example) French corporations. In a discriminant model Dassault "# Peugeot since these are different names. It is nice to be able to prove that Dassault "# Peugeot, but I imagine this will be too crude. (For instance, a discriminant model will prove all sorts of things like Persia "# Iran, the morning star "# the evening star. This would greatly benefit users of pseudonymous credit cards. It will also prove Mother(John) "#Mary, so if "Mother( John)= Mary" is in your theory, it is now inconsistent. There is so because "Mother(John)" and "Mary" arc different terms.) The point to all this is that minimal models (i.e. models than which there is nothing nicer), in the varying senses of minimal (because the sense of"nicer" varies) models arc interesting. Something should follow from circumscription and some axioms if it is true in all circumscription-minimal models of the axioms, and from CW A similarly. Besnard and Siegel do not so much prove this as define it to be true.
Reply to Froidevaux: Within the preferential-models framework, CWA and subimplication are characterized through the same preemption ordering, but there is nothing wrong with this because subimplication aims at safely generalizing CWA to universal theories. It is too hasty a claim to say that only the fragment of default logic corresponding to CWA can be captured by the preferential-models method. Indeed, there is no reason to contend that the free default fragment of default logic is out of the scope of this method (the free default fragment, which has some properties (Besnard, 1987), is obtained by constraining defaults to have no prerequisites (see Chapter 7)). On the other hand, surely not all of default logic comes under the realm of the framework presented. This is due to some peculiarities of default logic such as the one corresponding to the inability of default logic to take into account default information for reasoning by cases (what is under consideration here is the reasoning scheme relativized to the consistency requirements underlying the notion of inference developed in default logic-such that if some available default information makes it possible to conclude C from A and to conclude C from B then C can be concluded from A v B). Even more unfortunate and unexpected is the fact that recent unpublished works seem to indicate that autoepistemic logic cannot be characterized by means of preferential models. To take into account prioritized circumscription, the definition of a preemption preordering has to be modified, but no change is needed as far as partitioning of relations is concerned. In conclusion, it is fair to say that Shoham (1987) independently proposed a framework for preference over models that goes far beyond the one presented here. but the price to pay for the increase in generality is, among other things, a complex redefinition of satisfiability and entailment. Reply to Sheridan: For subimplication to be applied, it is an absolute requirement that the theory at hand doesn't have equalities (even conditional ones) in it: so if "Mother(John) = Mary" is in the theory then no inconsistency arises, because subimplication just doesn't apply to the theory. Such a requirement may seem very restrictive at first sight, but it is a usual assumption for databases, and subimplication has been devised to handle the querying of databases with disjunctive information. As indicated by Sheridan, a formula should be derivable by circumscription if it is true in all circumscription-minimal models of the theory. Unfortunately, this is not always the case (see e.g. Besnard, 1987).
5
Preferential-Models Approach to Non-Monotonic Logics
161
Additional references
Besnard, P. (1987). An Introduction to Default Logic. Springer-Verlag, Berlin. Bidoit, N. and Hull, R. (1986). Positivism vs. minimalism in deductive databases. Proc. ACM. SIGACT-SIGMOD Symp. on Principles of Database Systems. pp. 123-132. Shoham, Y. (1987). A semantical approach to nonmonotonic logics. Proc. 2nd Int. Symp. on Logic in Comput. Science (L/CS 1987), pp. 275-279. Computer Society Press, Ithaca, New York.
6
An lntuitionistic Basis for NonMonotonic Reasoning MICHAEL R. B. CLARKE Department of Computer Science, Queen Mary College, University of London, UK
D.M.GABBAY Department of Computing, Imperial College of Science and Technology, University of London, UK
Abstract Many systems of non-monotonic deduction have been proposed, but little attention has been given to saying what non-monotonic deduction is. We suggest the notion of restricted monotonicity as one plausible relaxation of the strict monotonicity condition of classical deduction. Gabbay's intuitionistic semantics and nonmonotonic deduction rules are given and compared with the modally based fixedpoint systems of Moore and McDermott and Doyle. Systems with branching futures are shown not to satisfy restricted monotonicity in general. Implementation of a practical non-monotonic system is briefly discussed.
1
INTRODUCTION
Before proposing yet another logical system for non-monotonic reasoning, we start by suggesting some formal properties in terms of which nonmonotonic systems in general might be classified. Suppose that one is offered an inference system in some area of application where knowledge is represented by propositional formulae At. A 2 , ... , A., B, etc. One tries to evaluate it by asking questions of the form: does A follow from At. A 2 , ... , A.? Or, in the usual meta-language of propositional logic, does A 1, A 2 , .. . , A. r A hold? The answer, yes or no, indicates whether the Pair <{At. A 2 , ... , A.}, A) is or is not an element of the system's consequence relation. Proceeding in this way, one builds up a sample of the consequence relation and can then ask whether this is a satisfactory logical inference system. Is the machine behaving logically? One way to check is to look at the meaning of the formulae A 1 , A 2 , •. . , A., A and see whether the answers make ~ON·STANDARD LOGICS FOR AUTOMATED EASONING ISBN 0.12·649520·3
Copyright (fj /988 Academic Press Limitf'd All rights of rt•production in any form reserved
164
M. R. B. Clarke and D. M. Gahhay
sense in the context of application. Suppose, however, that the necessary interface is as yet unavailable. Is there any test that we can apply to the answers to check whether or not the system is operating logically? How do we know that it is not just converting the A; to numbers and checking whether or not the number corresponding to A divides the product of A~. A 2, .. ., A.? If it is then this arithmetic method of answering will have "logical" properties such as Br-A and, if A 1. A 2, .. ., A. r A then A 1. A 2, .. ., A., BrA. We should not want to say that an arithmetically based system like this was logical. What conditions should a system satisfy to be characterized as logical? This question has been answered by Tarski and Scott. If the three conditions A~.
A 2, .. ., A., B r- B
A~.A2,
... ,A.rX A~.A 2 ,
A~.
A~.A2, ... ,A.,XrB ... ,A.r-B
A 2, .. ., A. r B ... ,A.,Xr-B
A~.A 2 ,
(reflexivity) (transitivity) (monotonicity)
are satisfied then r- is the consequence relation of a monotonic deductive system. So we do have a method of checking whether or not our hypothetical system is a monotonic deduction machine. The arithmetic method of answering will fail the transitivity condition. In general, logics will be characterized by the further conditions that they satisfy in addition to those above. It can be shown for example that intuitionistic provability is the smallest consequence relation closed under substitution that satisfies the deduction theorem. It is natural to consider similar criteria for non-monotonic deduction. Not surprisingly, they involve changing the monotonicity condition. But first consider another simple example. Suppose that a database f!JJ contains assertions
Pl:
A,B--+ C
P2:
-,D--+ B
P3:
A
and that the deductive component is classical truth-functional logic. The way that deductions are actually carried out of course depends on the rules chosen-semantic tableau, natural deduction etc. Here we take the following (monotonic) rules:
6
Jntuitionistic Ba.,es/iJr Non-Monotonic Reasoning
Rl: R2: R3:
165
XI\Y-+Z X-+ ( Y-+ Z)
X
X-+Y
y X-+ Y
Y-+Z
X-+Z
Rl is an axiom schema in many systems, R2 is modus ponens, R3 is the transitivity of implication. We write A~> A 2 , ... , An => B to mean that B is deduced (monotonically) via a single application of the rule R. Using these rules on the database f!J>, we can make the successive deductions f!J> => A
-+
(B
-+
C)
f!J>,
A
-+
(B
-+
C) => B
f!J,
A
-+
(B
-+
C),
B
-+ -+
C C => -, D
-+
C
Thus f!J> 1- -, D -+ C, and we cannot extract any more information from f!J> by classical deduction-we cannot, for example, deduce C from f!J>. Attempting to get more out of the data, we might consider adding some more reasonable-looking inference rules. For example we might add a default rule stating that for any literal X not appearing in f!J> as the consequent of an implication, assert by default -,X: Dl:
X a literal and X not the consequent of an implication
-,x This is a different kind of rule from R I, ... , R3. It says in effect that if there is no way that X can be deduced from f!J> then assume -,X. For clarity we use a different symbol for this kind of default deduction and write A~> A 2 , •. . , An~ B. With this new possibility, we can get more out of the database. In fact we can deduce C without using R3: f!J
~
f!J~
-,D -,D => B
f!J~ -,D, B =>A-+ (B-+ C)
f!J~ -,D, B, A-+ (B-+ C)=> B-+ C f!J~ 1D, B, A -+ (B-+ C), B-+ C => C
This is just an example, of course; other rules could have been used. Regardless of what the rules actually are, we should like to be able draw general conclusions about such systems.
166
M. R. B. Clarke and D. M. Gahhay
Monotonic deductions are made up of chains of single-step rules such as R I, ... , R3 above. Properties of monotonic deduction are proved by mathematical induction on the length of the chain. Non-monotonic deductions can similarly be thought of as constructed from a mixture of single-step monotonic and non-monotonic rules, and properties of non-monotonic deduction can then be proved similarly from properties of the single-step rules. Clearly monotonicity is one property that we shall not be able to prove, and this has led to non-monotonic deduction being define
Consistency of new deductions with existing data. If A~o A 2, ... ,An is consistent and A~oA 2 , ... ,An~B holds then A~oA 2 , ... ,An,B is consistent.
(2)
If Band C can separately be derived from A~o A 2, .. ., An by singlestep rules then A~o A 2, .. ., An, B, Cis consistent. In view of (I), this can also be written A~o A~o
A2, .. ., An~ B and A 1, A 2, .. ., An~ C iff A2, .. . , An~ B 1\ C
This is arguable. It is valid for certain kinds of default system, but not those with multiple extensions and not for probabilistic reasoning. where for example we might say A~o
(3)
A2, ... ,An~ A ifProb (A
I A~o A2, ... ,An)> 0.5
Restricted monotonicity: A~o
A2, .. . , An~ X A~o A2, .. ., An~-? B A~oA2, ... ,An,X~B
We know that~ is non-monotonic, but if we add a proposition that tells us is consistent, it does not affect the deducibility of other deducible sentences. ~
6
/ntuitionistic Bases for Non-Monotonic Reasoning
167
We now define A., A 2, ... ,An 1- X iff for some B., B 2, . .. , Bm
It can then be shown (Gabbay, 1984) that the following properties hold for
1-:
2
A., A2, ... ,An, Bl- B
(reflexivity)
A., A2, . .. , An 1- B A., A2, ... ,An, B 1- C A.,A2, ... ,Ani"'C
(transitivity)
A,, A2, ... , An 1- B A,, A2, .. ., An 1- C A.,A2, ... ,An,BI-C
(restricted monotonicity)
INTUITIONISTIC VERSUS CLASSICAL LOGIC
So far we have given some examples of non-monotonic deduction and said something about properties that a non-monotonic consequence relation might or might not be expected to have. We now give the semantic basis for a particular system based on intuitionistic logic, and compare it with other systems and the criteria given earlier. The classical understanding of truth is simplistic-statements are either true or false. The intuitionistic notion is more realistic-a statement may change its truth value over time, but not capriciously. At this moment it may fail to be true because we do not have sufficient grounds for asserting it. At some future time it may become true as more information becomes available. Although intuitionistic logic is weaker than classical logic in the sense that its theorems are a subset of classical theorems, it is stronger in the sense that if by a classical argument we establish, for example, P v Q then this deduction is weaker than if we had derived it intuitionistically. Although intuitionistic logic uses the same names and symbols for the connectives as classical logic does, they are not definable in terms of each other in intuitionistic logic, and the semantics of -, and ..... in particular are rather different. Many classical deductions seem counter-intuitive when expressed in ordinary language, particularly those involving translations of material implication, for example (\fx)P(x) -+ Q I- (3x)(P(x) -+ Q), which has a fairly intricate classical proof by contradiction but is not provable intuitionistically. Let P(x) be "x plays Well" and Q "we shall win", and ask oneself whether the deduction seems reasonable.
M. R. B. Clarke and D. M. Gahhar
168
Many other examples of this type can be given, and whether the differences are significant depends presumably on the application. We shall go on to show, however, that many of the counter-intuitive results reported by McDermott and Doyle disappear when their notion of consistency is formulated intuitionistically. Although, reasoning intuitionistically, we do not insist that every statement is either true or false, we do assume that every statement is either true or fails to be true. The situation may develop as more information becomes available and statements that up to now have failed to be true may become true. We do assume, however, that statements that have been established as true can never again fail to be true. No further evidence can overturn a demonstration of truth. Negation is defined similarly, -, P is true now if and only if P fails to be true in all possible future situations-we see now, at this moment, that we could never have grounds for asserting P.
3 SEMANTIC PRESENTATION OF AN INTUITIONISTIC SYSTEM We start by giving a semantics for intuitionistic propositional logic in such a way as to show how it models the accumulation of information with time, together with a notion of consistency analogous to that of McDermott and Doyle. Consider a set T of moments of time and a reflexive and transitive timeordering relation ~, where t 2 ~ t 1 means that t 2 is either later than t 1 or t 1 itself. As time goes on, we learn more and more about the world in the sense that the truth of more and more propositions becomes established. Let the propositional language L be the usual one, with the addition of the extra operator M whose intended interpretation is that M P means it is consistent to asume P. Once known to be true, an atomic proposition remains true; up to that point it fails to be true. Label these two possibilities I and 0. They can be specified via a function h: T x L -+ {0, I} having the following properties: (0)
if h(t, A)
(I)
h(t, A " B)= I iff h(t, A)= I and h(t, B)= I;
(2)
h(t, A v B)= I iff h(t, A)= I or h(t, B)= I;
(3)
h(t, -,A) = I iff, for all s
(4)
h(t, A -+ B) = I iff, for all s
(5)
h(t, M A) = I iff there is an s
=
I for atomic A then h(s, A)
~
=
I for all s
~
t;
t, h(s, A) = 0; ~
t, if h(s, A) = I then h(s, B) = I;
~
t such that h(s, A) = I.
Note that, although atoms and formulae not containing M are forever true
6
[ntuitionistic Bases for Non-Monotonic Reasoning
169
once they become true, M A and formulae involving M can revert to falsehood. In the diagram, M A is true at t and t 1 but not at t 2 • Note also that M A is not the same as -,-,A (although in linear time it is). We say more later about the obvious connection with modal possibility. A
t,
Intuitionistic entailment is defined as A I= B if and only if, for any function h satisfying (0, ... , 5) above, whenever h(t, A)= I, h(t, B)= I. Intuitionistic provability r- can then be defined in terms of axiomatic, natural-deduction or tableau systems that can be proved sound and complete with respect to the above semantics. We say something about automating this later. Our notion of non-monotonic provability 1- will be based on r-, or equivalently 1=. But first we check whether or not the consequences of this definition satisfy our intentions about the behaviour of the operator M. M A and -,A are semantically exclusive and exhaustive possibilities, so we have I= M A v -,A. Also, -, M A is semantically equivalent to -,A and A I= M A is valid. Furthermore, M A -+ B is equivalent to-, A v B; either-, A is true now or, if not then M A is true now and hence B. Note that C v -,Cis not a theorem, but is equivalent to MC-+ C. Also, MC-+ -,cis equivalent to -,c. We have seen that I= M B v -, B, but that is not quite the same as A I= M B iff A I+ -,B. Let the time points T be the subset of L generated by -,, " and -+only, and let the ordering relation A ~ B mean I= A -+B. For atomic p let h(A, p) = I iff A I= p. For any formula Bit can then be shown inductively that h(A, B) = I iff A I= B. Now A 14: -, B iff A " B is consistent. But A " B ~ A since A " B I= A, so if A I+ -, B then there is a state, namely A 1\ B, in the future of A in which B is established, and conversely if there is such a state B in the future of A then A 14: -,B. So A I= MB iff A 14: -,B. Now consider again some of the features of McDermott and Doyle's ( 1980) system, also discussed in Gab bay ( 1982) and in Chapter 4 by Moore. Among several problematic cases for their logic, McDermott and Doyle cite the following: (I)
{(MC-+ D), 1D} is inconsistent;
(2)
MC does not follow from M(C
1\
D);
M. R. B. Clarke and D. M. Gahha 1·
170
(3)
{MC, 1C} is not inconsistent;
(4)
{MC-+
-,q
is inconsistent, but {MC--+ -,c, 1C} proves
-,c.
They find {(M C --+ D), -, D} inconsistent because -, C cannot be shown to follow, forcing the assumption of MC. Moore also finds that this theory cannot be the foundation of a consistent set of beliefs. Intuitionistically MC--+ Dis equivalent to -,c v D, so {(MC--+ D), 1D} is equivalent to -, C, justifying the intuition of McDermott and Doyle, who remedied the inconsistency of their classically based system by arbitrarily adding -,C. Also, M C does follow intuitionistically from M( C 1\ D), and {M C, -, C} is semantically inconsistent, while M C --+ -, C is equivalent to -,C. {M C --+ C: proves C v -, C, which says that either Cis known now or -,Cis known, but does not say which. The intuitionistic semantics can be extended to quantified formulae, and the system can also be given an axiomatic presentation. The details are in Gabbay ( 1982).
4
THE NON-MONOTONIC COMPONENT
So far we have shown that the intuitionistic semantic basis overcomes some of the problems with McDermott and Doyle's system, but we have not yet said what constitutes non-monotonic deduction in our system. We could now go on to extend the logic with fixed-point definitions as McDermott and Doyle do, and these are indeed of some interest, but, following the discussion in the first part of the chapter, we first define non-monotonic deduction iteratively in terms of chained single-step default rules. The single-step non-monotonic rules are of the form A 1, A 2 , ••• , An ~ B if. for some P., ... , Pk such that A., A 2, ... ,An, MP., . .. , MPk are consistent we have At. A 2, ... ,An, MPt. ... , MPk I= B. We can also say as a special case that At. A2, ... , An~ B if At. A 2, . .. , An I= B with no default. As defined here, the intuitionistic deduction of B is semantically based. The corresponding proof might involve several basic steps such as modus ponens. whereas earlier we implied that ~ stood for only a single basic step. What is of interest, however, is whether~ satisfies the consistency and monotonicity properties introduced earlier. We then define non-monotonic 1- as above. At. A 2, ... , An 1- X iffor some Bt.···•Bm
6
/ntuitionistic Basesfor Non-Monotonic Reasoning
171
The following are examples:
{MP-+ P}
(l)
-vv)
P because {(MP-+ P), MP} I= P
{(MP-+ P), (P "MQ-+ R)} I- R
(2) because
{(MP-+ P), (P "MQ-+ R), MP} I= P and
{(MP-+ P), (P "MQ-+ R), P, MQ} I= R
5
RESTRICTED MONOTONICITY
We stated earlier that if the consistency and restricted-monotonicity properties hold for single-step rules then they can be shown to be hereditary along chains of such rules. Thus it suffices to consider the single-step rules, and for simplicity we take these to be of the following form: A B if A, M P I= B for some P such that A, M P are consistent. As a preliminary, the following properties can be easily established: -vv)
( 1)
A, M P are consistent iff A I+ -, P;
(2)
if A, MPI= B then A I= B v -,P;
(3)
if A, MP are consistent and A, MP I= B then A, Bare consistent.
Now strict monotonicity A
-vv)
A, B
X
-vv)
X
does not hold for
-vv).
A counter-example
is (M P -+ P) P but {(M P -+ P), -, P} I= -, P. Suppose, however, that we take a B that is "more consistent" with A. This is the idea behind restricted A B A X monotonicity. Does always hold? Again the answer in A, B X general is no. A counter-example is M P -+ P P and M P -+ P P, because {(M P -+ P), M(-, P)} I= -, P, but {(M P -+ P), P, M(-, P)} is inconsistent. (It can be shown, however, that if P and Q are not the same then (MP-+ Q) Q but (MP-+ Q) 1Q). Now suppose that {A, MP} I= B and {A, MQ) I= X, where A, B and A, MQ are separately consistent but A, B, MQ are inconsistent. Then A,MQ I= 1B.SowehaveA I= B v -,PandA I= -,B v 1Q.Thisispossible in branching time, so restricted monotonicity need not hold, but in linear time the schema (ex -+ {3 v y) -+(ex -+ {3) v (ex -+ y) is provable, so I:::::(A-+ B) v (A-+ 1P) and I=(A-+ -,B) v (A-+ 1Q). But A I+ -,P, so there is a state, namely A " P, in the future of A in which P is established. -vv)
-vv)
-vv)
-vv)
-vv)
-vv)
+
-vv)
-,
172
M. R. B. Clarke and D. M. Gahhay
Therefore h(t, A -+ -, P) = 0 for all t and hence h(t, A -+ B) = I for all t. By a similar argument on Q, we also have h(t, A -+ -,B) = I for all t. So in linear time the assumption that {A, B, M Q} is inconsistent leads to a contradiction A -vv> B A -vv> X and holds. A, B -vv> X
6 CONNECTION WITH MODAL LOGIC AND FIXED-POINT THEORIES There is a well-known representation (originally due to Godel) of the notion of intuitionistic truth within the modal logic S4 . If [A] denotes the modal translation of the intuitionistic formula A then [A] = OA for atomic A [A 1\ B] = [A] 1\ [B] [A v B] = [A] v [B] [A -+ B] = O([A] -+ [B]) [iA] = OI[A]
where connectives inside square brackets are intuitionist and those outside are classical. We can now add to these
The modal translation of the intuitionistic M P -+ P (or equivalently P v -, P) is now OP v 0-, OP, which is an instance of the S 5 axiom. Thus defaults of the form M P -+ P, normal defaults in Reiter's terminology, take us some way towards S 5 . If we had obtained the non-monotonic component by the fixed-point method then its modal translation would have been: Tis the fixed point of the (intuitionistic) premises A~o A 2 , .. . , A. if Tis the set of S4 consequences of
Note the difference that intuitionistic negation makes. McDermott and Doyle's definition was equivalent to
{At. A2, .. . , A.} u {-, 0-, PI-, P ¢ T} The intuitionistic notion of truth seems to give much the same effect as the other half of Moore's definition, which forces inclusion of OP if P e T Consider again the example {MC-+ D, -,D}. The modal translation is {0(-, o-, DC-+ OD), o-, OD}, which is logically equivalent to {O(P-+ DQ), OP}, putting P for-, OD and Q for-, DC.
6
Jntuitionistic Bases ./(Jr Non-Monotonic Reasoning
173
From the K axiom and modus ponens, we have DDQ, i.e. OQ, i.e. [1C] as before. Unlike Moore and McDermott and Doyle we do not now get a contradiction through having to block --, DQ by including Q. Following the intuitionistic version of the fixed-point definition, DQ E T, so--, DQ need not be. There is no need, or even possibility in intuitionistic logic, of introducing
Q. 7 AUTOMATING INTUITIONISTIC NON-MONOTONIC DEDUCTION There are a number of ways to provide the monotonic component. We have experimented with a tableau implementation for intuitionistic logic based on the signed tableau rules in Fitting (1983), with two more rules for the new operator M: {tA,, ... , tAk,
JB,, .. .,fBm,fMX}
I
{tA" .. . , tAk,fB" .. . , fBm, t--, X}
where t andfsay that the formula that follows is true or fails to be true in the world specified. This is straightforward to implement, but tableau systems do not always provide efficient or comprehensible proofs, so a backwardchaining method for intuitionistic logic based on natural deduction is also being developed. Gabbay ( 1987) gives a similar system for classical logic, and also shows how it can be extended to intuitionistic and modal logic. What is not so clear is how the non-monotonic rules should be implemented. A natural approach is to try and extend the goal-directed backward-chaining methods already used for deductive databases. For many straightforward deductions this seems to work quite well. Consider once more the example {(MP ..... P), (P" MQ ..... R)} 1- R and suppose that we are thinking in terms of a Prolog-like system. The program is MP-+P P" MQ-+ R
and the query is
?I-
R
Note that the query requests non-monotonic deduction. To show R, we have to show P" MQ. To show P, we have to show MP, for which there is no explicit fact or rule. But M P is the same as failing to show --, P, so we invoke a
174
M. R. B. Clarke and D. M. Gahhay
new computation rule for formulae of the form M P. We try to show 1 p (monotonically) and fail. So M P can be assumed, and now we have to show Q, for which there is no fact or rule, so once more we must try to show 1 Q, which fails, so MQ succeeds and so on. Given a suitable complete theorem-prover that answers valid or not to any proposed propositional deduction, implementation of the new computation rule is relatively simple. Of course estimation of failure in the quantified case would need some heuristics. However, the additional rule, although necessary, is by no means sufficient. Suppose that we change the example slightly to
MP-+ P IPAMQ-+R We now have to show 1 P at the first step. In fact 1 P is an intuitionistic consequence of {(M P -+ P), M( 1 P) }, but this time there is no rule in the database to directly trigger M( 1 P). This step can be generated, however, by a computation rule that says that if one is otherwise stuck with a goal G then add MG if it is consistent to do so (this idea was suggested by Steve Reeves). We are currently experimenting with a practical system based on rules such these. BIBLIOGRAPHY Fitting, M. C. (1983). Proof Methods for Modal and lntuitionistic Logic. Reidel, Dordrecht. (The title is self-explanatory; a careful and complete survey, especially good on tableau methods.) Gabbay, D. M. (1982). Intuitionistic basis for non-monotonic logic. Proc. 6th Con:f on Automated Deduction (ed. D. W. Loveland). Lecture Notes in Computer Science. Vol. 138, pp. 260-273. Springer-Verlag, Berlin. (The paper that originally showed how McDermott and Doyle's notion of consistency could be given an intuitionistic semantics.) Gabbay, D. M. (1984). Theoretical foundations for non-monotonic reasoning in expert systems. Research Report DoC 84/11, Dept. Computing, Imperial College of Science and Technology, University of London; also in Logics and Models o( Concurrent Systems (ed. K. Apt), pp. 439-459 Springer-Verlag, Berlin. (The report in which the notion of restricted monotonicity is introduced, and in which proofs are given of results stated here.) Gabbay, D. M. (1987). Programming in Pure Logic (forthcoming book). (Prolog-style backward-chaining computation rules are given for classical, intuitionistic and modal logic. Unlike Prolog, the resulting systems are complete.) McDermott, D. and Doyle, J. (1980). Non-monotonic logic I. Artificial Intelligence 13. 41-72. (The paper that originally introduced the consistency operator M.) Moore, R. C. (1988). Autoepistemic logic. Chapter 4 of this book. Reiter, R. (1980). A logic for default reasoning. Artificial Intelligence 13, 81-132. (Introduces the notion of default reasoning.)
6
Jntuitioni.l'tic Ba.l'e.l' for Non-Monotonic Reasoning
175
DISCUSSION
J. A. Campbell: "Non-monotonic reasoning" now means many things to many different people. For example, there is the loose or colloquial meaning: reasoning in which the conclusions of deductions change with time as new information is required. This meaning has been particularized in several different ways, e.g. circumscription (McCarthy, 1984). A typical middle-of-the-road particularization is default reasoning, of the kind represented by the work of Reiter ( 1980); other varieties are not hard to find (see e.g. l!,ukaszewicz, 1986; AAAI, 1984). Autoepistemic logic (Chapter 4) belongs in the same general non-monotonic family. The title of this chapter by Clarke and Gab bay suggests that intuitionistic logic is of some help or interest to people who may want to use any of the above systems. This is not untrue (or at least it fails to be untrue, i.e. it can be believed while we wait for further information), because it offers some technical fixes at a fairly basic level. The question of the extent to which the fixes can propagate to higher levels of any given non-monotonic logical system and therefore change the way in which its interpretation works is still open. It is certainly worth further research in the directions indicated below and in this chapter. The essence of the intuitionistic view of logics, applicable also to the nonmonotonic case as presented in this chapter, is that certain knowledge of the world is a non-decreasing quantity: at least for atomic propositions, the transition from "no evidence either way" to true or false is irreversible once it has happened. This is indeed a means of evolution of knowledge that makes non-monotonic reasoning necessary, but it is not clear that it matches many of the situations in which such reasoning will be needed in potential applications, for example real-world economic planning exercises against the background of pseudorandom behaviour of some (most?) governments. To take a simple case of an intuitionistically unruly effect, consider a component that is subject to faults followed by periods of recovery. When not "recovering", it is in either a "true" or a "false" state, with different implications for the system of which it is a part, in the two different states. Following recovery, there is a further period during which its state is not identifiable by the observer, although the state exists and affects the system. During that period, the observer must make inferences based on his beliefs about the component without having certainty about its state. This situation is quite realistic (for example the problem could be one of planning short-wave radio links, where the component is the ionosphere or part of it), but seems to be too complex for the basic intuitionistic semantic function h(t, A). This is a comment about the scope of the theory, not about its technical content. There is a second comment about scope which should be made here: the discussion by Clarke and Gabbay refers to propositional expressions, but leaves open the question of what happens to the formal structure when higher-order expressions are involved; in particular, first-order expressions for typical versions of default reasoning (for which there seems no reason to believe that the behaviour will be nasty after further work on technical issues) and second-order for circumscription (a tougher problem!). It is tempting to mention, in passing, that the example of the counter-intuitive meaning of ('v'x) P(x) -+ Q 1- (3x) (P(x) -+ Q) when P(x) is "x plays well" and Q is "we shall win" is an effective blow struck against classical first-order logic (or an effective demonstration of its defencelessness), but there is no evidence that it will not turn out to be an equally effective blow or demonstration at the expense of the intuitionistic approach, when that approach has been extended to the point where it can handle all first-order expressions satisfactorily. There are semantic issues here, to do with the
176
M. R. B. Clarke and D. M. Gahha1·
meaning of -+ and what everyday meanings it is reasonable to attach to predicate formulae, but these are rather different from anything else in this chapter (i.e. they may be the same for classical and intuitionistic logics) and probably from anything else in this entire book. Within the restricted (propositional-calculus) scope of the intuitionistic treatment in the paper, the main pay-off is that the most obvious defect of the seminal paper on non-monotonic reasoning by McDermott and Doyle (1980) is removed. This is that -,c does not follow naturally from MC-+ D and -,D. In addition, MC-+ c intuitionistically proves nothing significant ( C v -,C), which it does prove, is a paraphrase rather than a piece of new information), while McDermott and Doyle's view that it proves C looks too strong a view to model reality cautiously. (A cautious observer would call that view "jumping to conclusions".) These are both positive achievements. In Clarke and Gabbay's specification of the deduction ""'• however, a price is paid: further restriction of the scope of the formalism. This comes from the property (2). which requires that B and C should be consistent if both A., A 2 , •• • , An""' B and A., A 2 , .• • , An""' C. As the authors note, the condition rules out systems with multiple extensions. In a system with multiple extensions, each extension must make some kind of semantic sense, and thus correspond to the construction of the world-view of one conceivable rational agent: democratic diversity, in fact. A single-extension system is rather like the picture of the stage of a puppet theatre as seen by the puppet-master. There will certainly be occasions when that treatment is adequate or even necessary. but there will probably be at least as many other occasions when it is too limiting for the problem that one is trying to solve by non-monotonic methods. Property (2) can probably not be relaxed without destroying the essential structure and meaning of the "restricted monotonicity" property (3). In one respect, mentioned below, the destruction is not necessarily a bad thing. But a hard-headed potential user's reason (additional to any reason that a specialist in intuitionistic logic may have) for preserving property (3) is that it simplifies possible proof procedures and automation of this intuitionistic propositional calculus to the point where it has a fighting chance of being computationally efficient. Efficiency is not a problem for suitably simple propositional cases, for example the establishment of {(M P -+ P), (P " MQ -+ R)} 1- R, but in general the proof scheme needs much more attention if it is not to degenerate into a trial-and-error exercise. It seems that the maximum mileage has already been obtained from classical and technical non-monotonic rules of deduction; what would be helpful would be a third type of rule or set of rules to keep a chain of deductions moving in the right direction. towards a desired goal, and avoiding tautologies along the way. Obviously, soundness and completeness of any such scheme should receive early attention, for example as a way of protecting it against too-enthusiastic use of clues for good expressions to select in a next proof step that are given from outside the intuitionistic system on some external grounds of "plausibility". It would be amusing to consider whether it could ever be safe to receive clues of this kind, say from a belieffunctional (Chapter 9) computation running in parallel with the logical one, and what kinds of mechanisms would be needed to sanitise the advice, to the point where completeness etc. could be guaranteed. Customers using distributed and multi-agent planning methods in artificial intelligence are prone to ask for things at least as dangerous as this! Less adventurously, schemes for selecting the next move to make in a proof can be envisaged which rely on syntax or the previous history of a computation. An example
6
/ntuitionistic Bases for Non-Monotonic Reasoning
177
of the latter in a successful scheme of deduction in classical logic is ancestor-filtered resolution. A possibility that mixes the two is the idea that, in choosing expressions e andfin order to conclude e ".f ~ g, no more than one of e,.fand g should begin with the operator M. An example of the "syntax" type ("short expressions are likely to be nicer to concentrate on than long expressions, or at least they waste fewer computing resources if this concentration leads nowhere") is the unit-preference tactic used in resolution. Another syntactic example, even if the intention of including it in the paper is entirely different, is Clarke and Gabbay's rule Dl; there is no obvious reason why it should not serve two purposes at once. In general, there may be several ideas worth transferring to the problem of finding good intuitionistic proof procedures from the area of classical resolution where they have had past successes. Nilsson ( 1980) gives a useful catalogue of such resolution-based ideas. This chapter started by suggesting that it would propose some properties in terms of which non-monotonic systems might be classified. In outline. what it has proposed are two points on a possible scale of classification. A monotonic deduction system is specified by three properties for its consequence relation. One brand of (intuitionistic) non-monotonic system is specified by two properties in common with it and by a third property (restricted monotonicity) that is conceptually not far distant from 'classical" monotonicity. Yet there are many more brands of non-monotonic system, as the references mentioned at the beginning of these comments indicate. Going beyond restricted monotonicity, for example by trying to specify different families of multiple extensions, may provide a useful taxonomy of non-monotonic reasoning systems that is supported by an intuitionistic base. Taxonomies arc somewhat lacking at present, but are likely to be needed more as the amount and variety of published material on non-monotonic logics (in the "loose meaning" mentioned at the beginning of this discussion) increase.
Reply: The first point that Campbell makes is that the particular model we put forward of intuitionistic truth as accumulation of information in time does not cope with situations where truth values change in an arbitrary manner over time. The example he gives of an intermittently faulty component would seem to be best handled by state probabilities parametrized by time or possibly some kind of event calculus or similar temporal approach. The semantics that we propose is not intended to define a temporal logic, the time points are identified with the propositions known to be true at those points. Time is used as a metaphor in the same way that possible worlds are in modal logic. Campbell's second point is that we only discuss the propositional case. We could have made it clearer that intuitionistic logic can be given a first-order semantics in the usual way for Kripke models by specifying a partially-ordered set of domains of individuals for each time point or possible world. The original Gabbay (1982) paper also gives an axiomatic presentation that includes all the necessary axioms and rules for the first-order case. The formula (\lx)P(x) -+ Q I- (3x)(P(x) -+ Q) cannot be proved without reductio ad absurdum; it is not intuitionistically valid. The point we were trying to make here, by ~ay of a brief example en passant, was that intuitionistic implication and deduction is In many cases a better translation of the if ... then of everyday human reasoning than classical material implication and the deductions that result from having reductio ad absurdum as an available rule. McDermott and Doyle's system only proves C from MC-+ C when the nonmonotonic rules are used, just as the intuitionistic system does. It could be viewed as a
178
M. R. B. Clarke and D. M. Gahhay
justifiable cnuctsm of the intuitionistic system that MC-+ C is equivalent to C v --, C, i.e. neutral with respect to C or --,C. This is not the usual intention behind defaults. The question is related to the point about multiple extensions. We agree with John Campbell that multiple extensions are an inherent part of conjectural reasoning. What is needed is some way of saying that some extensions are preferred to others on the basis of current information. It is worth noting that, according to the definition we give of non-monotonic deduction, the theory M P -+ Q cannot non-monotonically prove 1Q if P and Q are different. We have enlarged slightly at the end of this chapter on how a practical nonmonotonic system might be implemented. What is needed now are practical applications and comparisons with other methods such as truth-maintenance systems.
Additional references AAAI (1984). Proc. AAAI Workshop on Non-Monotonic Reasoning, New Paltz, NY. American Association for Artificial Intelligence, Menlo Park, California. ¥-ukaszewicz, W. (1986). CC AI. J. Integrated Study Artificial Intel/. Cogn. Sci. Appl. Episternal. 3, 7-31. McCarthy, J. (1984). Applications of circumscription to formalizing common-sense knowledge. In AIAA (1984), pp. 295-324. Nilsson, N.J. (1980). Principles of Artificial Intelligence. Tioga Publishing Co., Palo Alto, California.
7
Inheritance in Semantic Networks and Default Logic CHRISTINE FROIDEVAUX Laboratoire de Recherche en lnformatique, Universite de Paris-Sud, Orsay, France
DANIEL KAYSER LIPN, Departement de Mathematiques et lnformatique, Universite de Paris-Nord, Vil/etaneuse, France
Abstract Hierarchies with exceptions are very common. Inheriting properties in such hierarchies is an important problem in automated reasoning. We present two solutions to this problem-Fahlman's semantic networks NETL and Reiter's Logic for Default Reasoning-and we examine how these two solutions are related.
1
INTRODUCTION
This paper examines the relationship ex1stmg between a knowledgerepresentation technique, namely semantic networks allowing exceptions, and a form of inference, default reasoning. Semantic nets are usually equipped with ad hoc inference procedures, while default reasoning is more firmly grounded, as far as theoretical background is concerned; unfortunately, it turns out that in the most general cases default logic has rather bad properties concerning decidability or proof theory. We thus focus on some more specific cases, namely type hierarchies, which are of major practical interest, where default reasoning and semantic nets can be related in various ways. 1.1
Semantic networks
The idea of using so-called "semantic" nets in order to represent some kinds of knowledge stems from a widely spread metaphor, which consists in using spatial expressions to convey semantic relationships (e.g. a semantic neighbourhood between concepts). Some early works, especially in the field of NON.STANDARD LOGICS FOR AUTOMATED REASONING ISBN 0·12-649520-3
Copyright CO /988 Academic Press Limited All rights of reproduction in any form reserved
C. Froidevaux and D. KaysC'r
180
information retrieval (e.g. Doyle, 1962), considered the possibility of implementing some sort of semantic graph in a computer. At that time, the li!lks in the net had no other interpretation than existence of a (possibly weighted) meaning relationship between the concepts. Since the publication of an influential paper by Quillian (1968), the idea of semantic nets has pervaded more areas of Artificial Intelligence, but, as no precise semantics was ascribed to nodes or to links in the original paper, many workers using Quillian's approach merely considered themselves free to give any meaning they wished to the relations expressed in a network. A noteworthy exception was the work of Shapiro (1971 ), which was arguably the first serious attempt to represent first-order logic formulae in a network formalism (we discuss the reasons for doing this in Section 1.3). In an important paper, Woods ( 1975) criticized the proliferation of graphic formalisms that had no well-defined semantic interpretation. A correspondence between some types of networks and certain classical first-order theories was subsequently provided by Schubert (1976) and Hayes (1977). A good presentation of "second-generation" networks is to be found in Findler ( 1979). Meanwhile, Fahlman (1979) presented a system, called NETL, that was primarily oriented toward parallel computers; although NETL's semantics is not always accurately defined, this system clearly embodies an important mechanism for default reasoning (see Section 1.2). A large part of the present chapter will be devoted to NETL or NETL-Iike inferencing facilities. More recent literature concerning semantic networks includes descriptions of KL-ONE (Brachman and Schmolze, 1985), Krypton (Brachman et a/., 1985) and KL-TWO (Vilain, 1984). 1.2
Default reasoning
Except in a few technical domains, knowledge is best expressed by means of general statements, which should not be interpreted as universal laws; for example Examples Turning the switch puts on the light (the bulb might be broken, ... ). Birds can fly (except ostriches, penguins, wounded birds, ... ). The seminar is held every monday (except during holidays, ... ).
Such facts can be represented well by formulae such as (1)
P(x)
1\
1exception 1 (x)
1\ ... 1\
iexception.(x) => Q(x)
But formulae of this kind do not express at all the fact that exceptions are
7
Inheritance in Semantic Networks and Default Logic
181
something exceptional: P receives exactly as much importance as any "exception;'' (in other words, if P(a) has been proved, and 1exceptioni(a) for all j I= i has also been proved, the implication remains blocked, as would be the case if not even P(a) was proved: the general case has no more weight than any bogus case). This would not be a problem if inference took place in a universe where an exhaustive description of each object is available (even in this case, more economical descriptions for rule (I) could help); unfortunately, most real-world applications are such that one cannot even dream of exhaustive descriptions. As Reiter ( 1978) says: "Default reasoning may well be the rule, rather than the exception, in reasoning about the world, since normally we must act in the presence of incomplete knowledge." Example A robot planning to have some light in the room should not have to prove that the bulb is OK, the wires are normal, the fuse is fine, .... Only if something wrong appears in the realization of the plan would it be time to check. As a matter of fact, human communication (and man-machine communication as well) is possible only under the pragmatic law (in the linguistic sense of the word "pragmatic") that if something relevant is in an exceptional state, it must be explicitly stated. Otherwise, it is reasonable to assume that everything is normal. Default reasoning deals exactly with this assumption. We present in Section 3 one formalization (Reiter, 1980) of default reasoning. Other formalizations include those of AI (1980), McDermott ( 1982), Moore ( 1985) (see also Chapter 4), Lukaszewicz (1985) and McCarthy ( 1986) (see also Chapter 5). 1.3
Inheritance
Representing knowledge would be futile if there were no procedure to draw inferences from the knowledge represented. We shall focus on the kind of default reasoning that can be achieved with semantic networks, but first it seems natural to answer the question: why use semantic networks in order to make inferences, instead of, say, usual logical formulae? Well, this is an ever-lasting point of debate. The arguments generally given in defence of the network representation are as follows. (i) Not all inferences are equally needed; while logic can, in principle, ?educe any deducible fact, some lines of reasoning have more practical Importance, and having a means of deducing quickly along these lines is definitely an advantage.
C. Froidevaux and D. Kaysn
182
The most useful deductive pattern is inheritance; it goes as follows: if it can be established that a (respectively the As) is a (respectively arc) member(s) of class B, and if it is known that every element of B has some property, then a (respectively the As) has (respectively have) that property. In first-order logic, this might be expressed for example as
(2)
('v'x) (A(x) =:. B(x))
(3)
('v'x) (B(x) =:. (3y)(Q(y)" P(x, y)))
Now, if assertions (2) and (3) are scattered among many irrelevant (for the present purpose) formulae, it might take a long time for an automatic theorem-prover to show that
(4)
('v'x) (A(x) =:. (3y)(Q(y)" P(x, y)))
holds. But this is precisely the kind of deduction that the indexing capabilities of a semantic net allow to be performed very quickly (see Section 2). (ii) The creation of new nodes and arcs seems easier, at least for a nonexpert, than the redaction of formulae. More important, a visual display can show at once all the relations concerning a given entity: this facility makes modification a much safer task than altering a formula without being aware of the side-effects of such a modification. It is worth noting that recent advances in semantic data models tend to favour network-like formalisms (see e.g. Hull, 1985), precisely because of the ease of modification and interrogation. (iii) As will be shown later, adding or removing exceptions is relatively fast and safe, while formulae like (I) need to be rewritten every time a new exception is taken into account. Alternatively, it could be argued that we now need two inference mechanisms in order to be complete: when inheritance has failed to yield the desired result, a standard theorem-prover must be triggered, and the time spent with the network is wasted; this is true but (a)
efforts are made to ensure that inheritance runs much faster than regular deduction-the waste of time is thus negligible; (b) some "hybrid" strategies (Brachman et al., 1985) may take advantage of the work already performed during inheritance to speed ur the deduction; (c) completeness is not always wanted (some users might prefer a message "It will take me much time to find the answer. Do you reallY
7
Inheritance in Semantic Networks and Default Logic
183
need it?"; only in the case where the user says "yes" is a complete deduction required).
2 AN EXAMPLE
Consider the following statements: when the road is clear, go ahead; when you have a red light in front of you, stop; when a policeman tells you to go ahead, go ahead. At first sight, they seem contradictory: nothing prevents you from being on a clear road, with a red light in front of you; the above statements conclude then on "go ahead" and "stop". What should be done is to consider the word "except" as implied at the end of the first two lines, i.e. if the road is clear go ahead, except when you have red light, in which case you must stop, except .... In the third assertion it is implicit that the road is clear. Now the situation is well defined in any case. Let us have a look at its representation in NETL:
GO:
RC: RL:
PO:
situation where you GO ahead; situation where Road is Clear; situation where road is clear, but Red Light is on; situation where road is clear, with red light, but POliceman tells you to go.
The arrow ~means "set inclusion" (e.g. RL ~ RC will be read as "every situation where a road is clear with Red Light on is a situation where Road is Clear"). The arrow means "typical set inclusion" (e.g. Rc ----. GO will be read as "every situation where Road is Clear is a situation Where you GO ahead, except if you are told otherwise"). The arrow -Hit+ rneans "typical set exclusion" (in the sense that the typical element of the first set does not belong to the second set), and RL -!+#+ GO should read "every situation where road is clear, but Red Light is on, is not a situation where
184
C. Froidevaux and D.
Kay.~,,,.
you GO, except if you are told otherwise". The arrow --- --+ means "exception" (i.e. cancels the information represented by the arc pointed at by the arrow). These kinds of arcs, with possible variations, are sufficient to yield the intuitive results, when the inference mechanism is enforced by the following marker-passing schema (similar to Fahlman, 1979). Put a marker, say M I, on the node corresponding exactly to the actual situation, and pass markers according to the rules M I ---+ becomes M I ---+ M I becomes M I
---+
MI
(but M I M3 unchanged)
M I '*++ becomes M I
~
M2
(but M I -++ M3 +f+ remains unchanged)
MI
----+
M3
MI
---+
----+
becomes M I
---+
remains
In other words, M3 behaves as an inhibitor: as soon as an arc is marked with M3, it becomes unable to pass any marker. M I is a "yes" and M2 is a "no". Markers are supposed to be passed in parallel synchronously (if the process is asynchronous, "lost races" between markers might happen: this is easily detected and fixed; see Fahlman, 1979). Moreover, some networks can yield to a situation where both M I and M2 appear on the same node. This inconsistency occurs only in what Fahlman calls "illegal networks"; Fahlman et al. ( 1981) provides a means to detect them. Example Is an RL situation an RC situation? Put M I on RL: rule M I ----. M I applies; M I appears on RC; the answer is "yes". Is it a GO situation? Rules M I -- --+ M3 and M I -ttl++ M2 apply alsP simultaneously; M I appears on RC, M2 on GO. At the next step, rule M I - - M3 ---+ applies; nothing else is applicable. Only M2 has appeared on GO; the answer is "no". Example What is a PO? Put M I on PO: rule M I ~ M I applies twice and propagates M I to RL and to GO; rule M I ----+ M3 applies. At the next step, rule M I --+ M I propagates M I from RL to RC; rules M I ~ M3 -++-' and Ml ----+ M3 apply. At the next step, rule Ml M3-- applies: nothing else is applicable. The answer is: a PO is an RL, an RC, and a GO For this question, working the marker-passing schema leads to the following network:
185
7 Inheritance in Semantic Networks and Default Logic
Remark 1
Depth can be increased, such as in the following example:
Regular years have 365 days, except when year number is divisible by 4, where they have 366 days, except when year number is divisible by 100, where they have 365 days, except when year number is divisible by 400, where they have 366 days. The network is as follows:
\
\
I
I I ~
The reader may convince him/herself that the propagation scheme works, Yielding the expected results, and that any deeper model could in principle be represented, although it seems unlikely to find in the real world four nested e~ceptions. Moreover, this example is less realistic than the previous one, 810ce all the knowledge is available at once, and there is hence no need to reason with defaults: an ordinary algorithm is sufficient.
186
C. Froidevaux and D. Kayser
Remark 2 NETL allows "role" links, to reflect situations like the one described by (3), yielding p
B
Q
• ..-...*--+*
(Read: for all x that is a B there exists a y that is a Q and that plays role P for x)
Nothing precludes having, besides a "strict" role link with the interpretation which would mean, as in similar given by (3), a "default" role link cases, "unless cancelled, there exists ... ", with possible exception links pointing at it. For instance, the following net would mean "every human has a father, who is human, except Adam":
Human
Adam
..--......
* ""'WWWW' > *
l/
Father
••
We shall not elaborate on the role exceptions in the rest of this paper, but they don't seem to conceal more difficulty than that we shall treat.
3
PRESENTATION OF THE DEFAULT LOGIC (REITER)
3.1 3.1.1
Definition of the theory Intuitive approach
Our presentation of the formalism will be essentially syntactic, because the proposals in order to have a semantical definition of default theories did not provide a more intuitive insight into these theories. This formalism presents many interesting features and is in our opinion an appropriate tool to handle default reasoning. In order to make the syntactical definitions more intuitive. we begin with a few informal considerations. Recall that by default reasoning, we mean the drawing of plausible inferences from less-than-conclusive evidence in the absence of information \l1 the contrary. The first idea is to introduce in the first-order formalism a basiL default operator, denoted 'tf where 'tf w means "w cannot be deduced from the given knowledge base". The first-order theory will be augmented with inference schemata like this for example: (5)
bird(x) 'tf (penguin(x) v ostrich(x) v ... ) jlies(x)
7 Jnherilance in Semantic Networks and Default Logic
187
such a formula means: "if x is a bird and if it cannot be proved that x is a penguin or an ostrich, then deduce that x flies." If we add the formula (6)
bird(tweety)
then we can infer that tweety flies. Default reasoning is non-monotonic in the sense that the addition of new statements may invalidate previously derived facts: the set of theorems does not grow monotonically with the set of axioms. Namely, in the example, if we add to the formulae (5) and (6) the axiom
(7)
penguin( tweet y)
then the theory is still consistent but default (5) is not applicable and the formula flies (tweety) is no longer a theorem. An important difference between classical logic and default reasoning is that a single set of axioms can have more than one set of conclusions. For example, consider a universe with two objects A and B. Assume an object is not a block unless it is required to be. Assume also that either A or B is a block. We get the two statements: (8)
ff Block(x) -,B/ock(x)
(9)
Block(A) v Block(B)
Default (8) means "if it cannot be proved that x is a block then deduce that x is not a block". Now neither Block(A) nor Block(B) can be proved using the classical inference rules, so that -,Block( A) and -,Block( B) should be provable by means of(8). But the statement -,Block( A) 1\ -,Block( B), which then becomes provable, is inconsistent with (9). In order to avoid this inconsistency, we will manage to have two possible extensions, one in which Block(A) and -,Block( B) are provable, and another in which Block( B) and -,Block( A) are provable. This feature of default reasoning explains the difficulty of providing a semantics for defaults that allows a set of axioms to give rise to several extensions. Another problem is related to the definition of the non-monotonic theorems. In the inference schema (10)
ffP Q
it is necessary to know that P is not in the set of all provable statements in Order to declare that Q is in that set. This amounts to saying that the set of Provable statements should be known before any proof begins. To avoid this
188
C. Froidevaux and D. Kay.l'<'r
apparent circularity a fixed-point construction (cf. Definition 2 in Section 3.1.2) must be used. The definition of the theorems is thus not constructive: it is not in general decidable whether a formula is a theorem, i.e. there is no algorithm that will tell us whether or not a given default is applicable Moreover, to tell whether a sequence of formulae is or is not a proof of a nonmonotonic theorem does not depend, in contrast with classical logic, on an individual check of each step. 3.1.2
Formal definition
We now provide a short formal definition of default logic (for a more thorough presentation the reader is referred to Reiter, 1980; Besnard, 1987). Definition 1 A default theory 11 = (D, W) consists of a set of closed firstorder formulae Wand a set of defaults D. A default is any expression of the form u(x):v 1(x), v2 (x), ... , v.(x) w(x)
where u(x), v 1 (x), v2 (x), ... , v.(x) and w(x) are well-formed formulae whose free variables are among those of x = x 1 , ... , xm; u(x) is called the prerequisite' of the default, v1 (x), v2 (x), ... , v.(x) are its justifications and w(x) is its consequent. The meaning of the default it as follows: if u(x) is known and if v1 (x), v2 (x), ... , v.(x) are consistent with what is known then w(x) is inferred. Note The operator If used in the previous section had only an intuitive meaning. The symbol ":" is here formally defined. The reader should be aware that ":B" attempts to model the intuitive expression lfiB.
In what follows, theorems are given only for closed defaults, i.e. for defaults without free variables. The results extend to an open default theory, b) considering the closed default theories obtained from the open default theory by instantiating all the free variables of the defaults with the terms of the corresponding Herbrand universe. We should point out that the defaults are not formulae of the first-order language, but are in some way specific oriented inference rules. "Default~ therefore function like meta-rules; they are instructions about how to create an extension of the incomplete theory" (Reiter, 1980): the set of defaults D yields a means of predicting conclusions in spite of some gaps in the firstorder knowledge W, by extending W. Any such extension will provide the theorems of the default theory and will be interpreted as an acceptable set ol beliefs that one may hold about the incompletely specified world W. Note
7
189
Inheritance in Semantic Networks and Default Logic
that not every default theory has an extension and some have more than one. A set of formulae is considered as being an extension if it is the fixed point of an operator r, which we now define. The definition of r must take into account the following properties: an extension must contain W, be deductively closed under first-order provability and be closed under the application of the defaults. More formally, we get the following.
Definition 2 Given a default theory A = (D, W), let S be a set of closed formulae and r(S) the smallest set satisfying the following three properties: ~
(i)
W
(ii)
r(S)
(iii)
If
r(S); =
Th(r(S))
u: V1, ... , w
Vn
(Th stands for "first-order theoremhood");
ED, u E r(S) and -, v1 ,
A set E is an extension for A iff r(E) operator r.
=
.•. , -, Vn
¢ S then wE r(S).
E, i.e. iff E is a fixed point of the
"r(S) is the minimal set of beliefs that we can have in view of S, where S indicates which justifications for beliefs are to be admitted" (Besnard, 1987). We give a characterization of extensions that makes this notion more intuitive.
Theorem 1 for A iff E =
Let A = (D, W) be a closed default theory. E is an extension E;, where
U;=o, .... oo
£0 = W Ei+ I = Th(Ed u
{l w
u: V1, ... ,
w
Vn
ED,whereueE;and-,v 1 ,
... ,-,vn¢E
}
fori;;;: 0
Because of the occurrence of E in the definition of E;+ 1 , the sequence of sets
E1 cannot be considered as constructive to obtain an extension. We give some examples to illustrate the definitions.
Example 1
01)
Consider the following assertions:
in general birds fly
02) canaries are birds 03)
tweety is a canary
190
C. Froidevaux and D. Kayser
the theory L\ 1 =(D., W.), where bird(x))} and
These assertions correspond to
W1
=
{canary(tweety), (Vx (canary(x)
=:.
_ {bird(x) :.fties(s)} fiies(x)
D.-
Then L\ 1 has a unique extension
E1
=
Th( {canary(tweety), (Vx) (canary(x) =:.
bird(x)), bird(tweety),fiies(tweety)J) Assume that we add the new assertion (14)
tweety does not fly
LetL\'1 be(D 1 , W'1 ),where W 1 = Wu{---,fiies(tweety)};thenL\'1 hasauniquc extension
£'1
=
Th({canary(tweety), (Vx) (canary(x) =:. bird(x)), bird(tweety), ifiies(tweety) J)
We can no longer deduce fiies(tweety). This example shows that default theories are, in general, non-monotonic.
Example 2 modified.
This example is taken from Dubois et al. ( 1985) and slightly
( 15)
generally, if Mary attends a meeting, Peter does not attend the meeting
( 16)
generally, if Peter attends a meeting, Mary does not attend the meeting
Let AM M (resp. AMP) be short for "Attends Meeting Mary" (resp. Attends Meeting Peter). These assertions translate into the following defaults: _ {AMM: -,AMP AMP: -,AMM} D2,------
-,AMP
-,AMM
Moreover, suppose that we know (17)
either Peter or Mary (or both) attend the meeting
Then we must add the first-order formula W 2 = {AMM v AMP}. Then L\ 2 = (D 2 , W 2 ) has a unique extension £ 2 = Th({AMM vAMP}). We can conclude only the global presence of Mary or of Peter. A couple of things are worth noticing here: first, in contrast with the block example, there is nothing in the extension that was not deducible from the first-order knowledge, and this is because none of the default prerequisites j-, provable. A consequence is that the statement of exclusion 1(AMM
1\
AMP)
7 Inheritance in Semantic Networks and
D~fault
191
Logic
which follows intuitively from the premises (if one is present then the other is absent, so that they are not both present), is not provable. It seems then advisable to modify the translation of the statements in order to get defaults without prerequisites (as in the block example), for instance:
, _ {:AMM , :AMP} D2---,AMP --,AMM "Generally, if x attends a meeting then y does not attend a meeting" is translated here as "if it is consistent to believe that x attends, then conclude that y does not attend". The statement of exclusion is now provable; as a matter of fact, !!2 = (D]., W 2 ) has two extensions: E]. = Th({AMM, -,AMP}) and E2 = Th({AMP, --,AMM}), and 1(AMM" AMP) is a member of both E]. and E2. Unfortunately, this translation is till inadequate. Suppose the formulation to be slightly altered, the premises being (15) generally, if Mary attends a meeting then Peter does not attend the meeting (18)
every time Bill attends then Peter also attends
(19)
Bill attends
We now get the following default theory: !l3 = (D3, W3),
and
W3
=
where D 3 = { :AMM} -,AMP
{AMB, AMB =AMP}
The reader expects the conclusion to be "Peter attends", but this is not provable with the given translation; as AM M can be consistently believed, the conclusion --,AMP is also attainable, and thus there is no extension at all. We are thus naturally led to a translation using seminormal defaults (see later); the initial formulation of the problem is thus rendered as where D'2
=
{:AMM" -,AMP, :AMP" --,AMM} -,AMP
--,AMM
W 2 = {AMM vAMP}
This translation has two advantages: the statement of exclusion belongs to both extensions E]. and E2 of ll2, and the adjunction of AMB =AMP and 4MB to the first default of A2 leads to the correct conclusion: AMP. From this example, it is clear that the translation of a set of general
192
C. Froidevaux and D. Kayser
statements into a set of defaults is no trivial matter. To our knowledge, as yet, no procedure has been designed that would do the job. Example 3 In the case where the satisfied prerequisites lead to opposite conclusions, we get two extensions as for the block example. Suppose that we have the following assertions (cf. Reiter, 1980): (20)
typically Republicans are not pacifists
(21)
typically Quakers are pacifists
(22)
Richard is both a Quaker and a Republican
The corresponding default theory is 11 4
=
(D 4 , W4 ), where
W4
=
{Quaker(Richard)
1\
Republican(Richard))
and
D
=
4
{Republican(x):---, Pacifis:(x) Quaker(x): Pacifist(x)} ---, Pacifist(x) ' Pacifist(x)
114 has two extensions: =
Th( {Quaker( Richard)
1\
Republican( Richard), Pacifist( Richard)})
E~ =
Th( {Quaker( Richard)
1\
Republican( Richard), ---,Pacifist( Richard)})
£4
The existence of two extensions is compatible with the fact that we cannot conclude about the warlike nature of Richard (the assertions are ambiguous). This situation corresponds to an inconsistent net in NETL. While default logic accommodates two mutually inconsistent views, NETL forces the user to choose among them, because of the "illegality" of the net. Example 4
Let 11 5 be (D 5 , W5 ), where W 5 =
0
and D 5 =
.-,A}.Then {T
11 5 has no extension. This means that intrinsically incoherent defaults can never be applied. Moreover, Reiter proved a result according to which defaults cannot bring any inconsistencies. Theorem 2 A closed default theory (D, W) has an inconsistent extension iff W is inconsistent. An important drawback of this formalism is that, given a default theory, we cannot know a priori whether it has an extension or not. It then seems natural to restrict ourselves to theories that are known to have an extension. Normal default theories enjoy this property.
7 Inheritance in Semantic Networks and Default Logic
193
rx./•
Definition 3 A closed normal default is a default of the form: where rx. and f3 are closed formulae. A closed normal default theory is a closed theory (D, W), where every default of D is normal.
We get the important following result: Theorem 3 extension.
(Reiter)
Every closed normal default theory has an
Normal default theories are also semi-monotonic; that is, we have the following: Theorem 4
(Semi-monotonicity)
(Reiter)
Suppose that D and
D' are sets of closed normal defaults with D £ D'. Let E be an extension for the closed normal default theory 11 = (D, W) and let 11' = (D', W). Then 11' has an extension E' such that E £ E'.
This property is very useful in the sense that it makes possible a proof theory that is local with respect to the defaults entering into the proof. Reiter (1980) defines the notion of default proof, which we shall not present here; we mention only the result, according to which a consistent closed normal default theory 11 has an extension E such that f3 E E iff f3 has a default proof with respect to 11. Owing to the necessity for satisfiability tests in establishing the default proof, the extension membership problem, even for closed normal default theories, is proved as being not semi-decidable. Unfortunately normal defaults can interact with each other so that they lead to the derivation of anomalous default assumptions. In order to avoid them, we need other default rules. For example, consider the following assertions (Reiter and Criscuolo, 1981): (23)
typically, high-school dropouts are adults
(24)
typically, adults are employed
We do not want to conclude, for a given high school dropout, that he/she is employed. Let John be a high school dropout. Now, if we need open normal defaults, we get the following default theory: 11 = (D, W), where
D = {high-school-dropout(x): adult(x), adult(x): employed(x)} adult(x) employed(x)
194
C. Froidevaux and D. Kayser
and
W = {high-school-dropout(John)}
Then A has an extension that contains employed(John). We can block the transitivity by replacing the second default by adult( x) : employed ( x) 1\ --,high-school-dropout( x) employed(x)
. rx.: p 1\ y Thus we get semi-normal defaults, 1.e. defaults of the form . SemiY
normal default theories are generally not desirable, because they do not enjoy some of the good properties of normal theories. They can fail to have some extension, they lack semi-monotonicity and their proof theory appears to be considerably more complex than that for normal theories. In some cases, semi-normal defauots can be avoided. Consider the following statements. (25)
typically university students are adults
(26)
typically adults are employed
(27)
typically university students are not employed
The use of normal defaults leads to some ambiguity. But in this case, the statements are compatible with the following statement that we can add: (28)
typically adults are not university students
Thus we can use normal defaults only as follows: D = {student(x): adult(x), adult(x): --, student(x), adult(x) --, student(x) adult(x)
1\
istudent(x): employed(x), student(x): 1employed(x)} employed(x) 1employed(x)
Let Peter be a student: W = {student( Peter)}. Then A = (D, W) has a unique extension that contains: 1employed(Peter). Let John be an adult: W' = (adult(John)}. Then A= (D, W') has a unique extension containing : 1student(John), employed(John).
In general it is not possible to avoid the use of semi-normal defaults, as will be shown in the next section.
3.2
Interpretation of NETL proposed by Etherington and Reiter
The discussion up to the end of Section 3 is restricted to the mechanisms of NETL that deal with is-a hierarchies with exceptions.
195
7 Inheritance in Semantic Networks and Default Logic
As in Reiter and Criscuolo (1981 ), Etherington and Reiter distinguish between prototypical facts such as "typically mammals give birth to live young" and hard facts about the world such as "all dogs are mammals". The former translate into default rules, while the latter translate into universal first-order formulae. The translation rules are as follows: A~B:
(a)
(\fx) (A(x) => B(x))
(strict is-a link)
("As are always Bs") A~B:
(b)
(\fx) (A(x) => --, B(x))
(strict is-not-a link)
("As are never Bs")
(c)
A--+ B:
(d)
A -Hit+ B:
(e)
c-- -+:
A(x): B(x)
(default is-a link)
B(x)
A(x): --, B(x)
(default is-not-a link)
--, B(x) (exception link)
This final link must have at its head a default link. It cannot be translated independently of this default link. There are two cases:
r-----
B
(i)
A(x):B(x)
1\
ICt(X)
1\ ..• 1\
--,Cn(X)
B(x)
c, ..... c. A(x): 1B(x)
1\
1C 1(x)
1\ •.. 1\
--,Cn(x)
--, B(x)
A
Recall the first example of Section 2, where the network was as on the left below: GO
~\
RC
\
The corresponding default theory is \
l)
Ri_ I
\.I
PO
A= (D, W) D
= {RC(x): GO(x)
" --, RL(x), _R_L_:_(x-=-):_--,_G_O--::-('-=-x'-=-)-,---"-'_P_O---'(-'-x)} GO(x) --, GO(x)
W = { (\fx) ( PO(x) => RL(x)), (\fx) (RL(x) => RC(x)). (\fx) ( PO(x) => GO(x))}
196
C. Froidevaux and D. Kayser
Let c be a PO situation. Let W 1 be Wu {PO(c)}. A 1 = (D, WI) has a unique extension that contains RL(c), RC(c) and GO(c); that is, c is an RL situation (road-is-clear-but-Red-Light-is-on), an RC situation (Road-isClear), and a GO situation (you-GO-ahead). Let b be an RL situation not known to be a PO situation. Let W 2 be Wu {RL(b)}. A2 = (D, W2 ) has a unique extension E that contains RC(b) and---, GO( b); b is an RC situation but is not a GO situation. Note that E also contains ---,PO( b). Etherington proved the following result: although it may have non-normal defaults, the default theory corresponding to an acyclic inheritance network with exceptions has at least one extension. An algorithm that computes the extensions is provided. Such default theories are said to be ordered semi-normal default theories. If a semi-normal default theory is not ordered then it does not necessarily have an extension, as the following example shows. Let A = (D, W), where W=
0,
D = {
: C & ---, B : B & ---, A : A & ---, C ' B ' A
C}
Then A has no extension. An important drawback of this formalization is that the translation of general statements explicitly mentions the exceptions. This gives rise to different problems. Either we assume that the exceptions are all previously known (an improbable assumption) or we accept continuous modification of the defaults as new exceptions are discovered. An increasing number of exceptions increases the complexity of the defaults. Moreover, the links of the network cannot be translated independently of each other. In Section 3.4 we propose another formalization using default logic that avoids the abovementioned drawbacks and is closer to the NETL structure. The basic idea bears some similarity to McDermott's (1982) proposal for handling exceptions. First we present Touretzky's proposition for handling is-a hierarchies with exceptions, a proposition that does not use semi-normal defaults. 3.3
Implicit ordering of defaults
Touretzky ( 1984) presents a formal analysis of inheritance under "i1iferential ordering". This concept allows him to define an implicit ordering relation « among defaults and to represent inheritance in a natural way, using default logic. To represent the is-a and is-not-a links between two classes P and Q. he uses the following normal defaults: P(x):Q(x) Q(x)
(' 1. k) 1s-a m ,
P(x):1Q(x) ---, Q(x)
. . (1s-not-a hnk)
Let di and di be two defaults of a normal default theory such that Pi(x) is the
7
Inheritance in Semantic Networks and Default Logic
197
prerequisite of di and Pi(x) is the prerequisite of di. He defines di « di to mean that either there exists a default with prerequisite Pi(x) and consequent Pj(x), or there exists another normal default dk such that di « dk and dk « di. This partial ordering induces an ordering (also denoted «) on the default proofs, by comparing the ordering of the last default rules used in each proof. (A default proof sequence is the equivalent of an inheritance path in semantic networks.) For example the three assertions (25)-(27) are translated into the following normal default theory:
D=
{d
1
=
student(x): adult(x), d 2 adult(x)
=
adult(x): ernployed(x), ernployed(x) d3
=
student(x): -, ernployed(x)} 1ernployed(x)
where d 1 « d 2 and d 3 « d 2 • (The second point comes from the fact that there exists a default, namely d 1 , whose prerequisite is the prerequisite of d3 and whose consequent is the prerequisite of d2 ). Let Peter be a student: W = {student (Peter)}. Let 11 = (D, W). We get two conflicting default proof sequences:
S1
=
student(Peter)--(dd- -+adult(Peter)--(d 2 )---+ employed( Peter)
S2 = student(Peter)--(d 3 )- -+iernployed(Peter) From d 3 « d 2 it follows that only S2 is a valid default proof: we obtain the desired result. With this ordering, we can decide which extension must be chosen when there are many contradictory possibilities while the network is intrinsically unambiguous. In the case of an ambiguous network, the implicit order forces the user to recognize that neither extension is to be preferred to the other. While there is no longer any need explicitly to mention exceptions to general laws, so that semi-normal defaults are avoided, this formalization does not allow general laws and exceptions to be distinguished. 3.4 Anotherformalization using default logic for is-a hierarchies With exceptions 3.4.1
Definition of the semantics used
The formal semantics proposed here for inheritance networks is more closely related to the definition of links, nodes and wires in NETL. As do Etherington and Reiter, we distinguish between hard facts and prototypical facts.
198
C. Froidevaux and D. Kayser
In this system a "default is-a link" between two nodes A and B is in fact identified with a node ("handle node") R;-the name of the assertion-linked by wires to nodes A and B. R; is called the "justification" of the assertion. The default is-a link between A and B will therefore be represented as follows: A - R; --+ B. It translates into the semi-normal default ~. = u,
A(x): R;(x)
B(x)
1\
B(x)
(is-a default rule)
(where A(x) and B(x) are property predicates and R;(x) is an assertion predicate). The default is-not-a link between A and B will be represented as follows: A
++ Ri -t++ B. (ji =
It translates into the semi-normal default A(x): R ·(x) 1
1\ ---, B(x)
---,B(x)
(is-not-a default rule)
Like the default is-a link, the cancel link is represented with a node for the name of the assertion. Its graphical representation is A -
R;- > B or A
1'
-++ Ri ++- > B 1'
I
I
Rk
Rk
.
I I
I
I
c
c
which translates into the default (jk =
C(x):Rk(x)
1\
---,R.(x)
---,R.(x)
for n = i or j
(exception default rule)
We make the following assumption: the justification of two different default rules cannot have the same assertion predicates. For every new assertion there will be a new assertion predicate created. Strict is-a links and strict is-not-a links are translated as in Section 3.2. Recall the example of Section 3.2. With our notation, we get the semantic network shown on opposite page. Let W1 be Wu { PO(a)}; A 1 = (D, W!) has a unique extension T 1 that contains PO(a), RL(a), RC(a) and GO(a). Let W2 be Wu {RL(h)j: A2 = (D, W 2 ) has a unique extension T 2 that contains RL(b), RC(b) and ---, GO(b ). Which are precisely the results we wanted. Note that in this case, we do not need the strict is-a link between PO and GO. We could as well have
7
199
Inheritance in Semantic Networks and Default Logic
The corresponding default theory is
Gr~
A= (D, W)
R,~
I\
I
RC
A
.,, l
RL
1\
1R 1(x)
R,
,' l
R4
RL(x): -,R 1(x)
I
R
I
D = {RC(x): GO(x) 1\ R 1(x), RL(x):---, GO(x) " R 2 (x), GO(x) ---, GO(x)
W
R 3 (x), PO(x): 1R 2 (x) 1\ R 4 (x)} 1R 2 (x)
= {('vx) (PO(x) =:. RL(x)), (\fx) (RL(x) =:. RC(x)),
(\fx) (PO(x) =:. GO(x))}
\l
PO
used an exception link between PO and R 3 . It translates into the default PO(x): ---, R 3 (x)
1\
R 5 (x)
---, R 3 (x) that we would use instead of the formula (\fx) (PO(x) =:. GO(x)). The semi-normal default theory is ordered in the same way as that of Etherington and Reiter. Therefore it has an extension. The algorithm provided by these authors applies here. Note that we can use the same network for another purpose: we might wish not to interpret exceptions as assertions of negative facts, but only as blocking the transitivity of the is-a inference path. In this case, we can suppress from the justification of every semi-normal default the predicate that equals to the consequent. We thus get a taxonomic default theory, whose defaults are neither normal nor semi-normal but which still has a unique extension and enjoys some other good properties (cf. Froidevaux, 1986). The formalization using semi-normal defaults and assertion predicates has the same useful properties as the NETL system: as new exceptions are added, old rules remain valid; it is merely necessary to introduce new exception default rules; there is thus no need to know all the exceptions beforehand, thanks to the use of the assertion predicates R;(x). One could object to the introduction of these assertion predicates R;(x) because of the large number of assertions. The objection should then also address another non-monotonic treatment of is-a hierarchies with exceptions: the "circumscription" proposed by McCarthy (1986). He supposes that every object is abnormal in some way and hence "wants to allow some aspects of the object to be abnormal and still assume the normality of the rest". To this end he introduces a predicate ab and many functions aspect;; let us notice that there are as many such
200
C. Froidevaux and D. Kay.l'er
functions aspect; as there are assertions, and hence as many as the assertion predicates R;. Recently, Lukaszewicz (1986) proposed a system for default reasoning that uses many abnormality predicates that are very similar to our assertion predicates. Another interesting feature of our formalism, besides a closer correspondence with elements of the NETL network (namely handle modes), is that it provides a means for handling ambiguities.
3.4.2 Solving ambiguity Consider the following example, taken from Reiter: (29)
If you don't know where a person lives then you can assume that he/she lives where his/her spouse does
(30)
You can also assume that he/she lives where his/her employer 1s located
( 31)
Mary works-in Vancouver
(32)
Spouse(M ary) lives-in Toronto
(33)
(x lives-in u
(34)
!(Vancouver= Toronto)
1\
x lives-in v)
=>
u=v
The statements (29) and (30) translate into defaults as follows: _
(j
1 -
(j 2
lives-in (spouse(x), y): R 1 (x, y) lives-in(x, y)
1\
lives-in(x, y)
works-in (x, y): R (x, y) 1\ lives-in(x, y) lives-in(x, y)
2 = -----'----=-::----=---,--------'--
The assertions (31 )-(34) translate naturally into first-order formulae. With Reiter's classical translation into default logic, two alternatives are obtained: more precisely, we get two extensions. With our translation, we can impose a priority among these extensions. For this, we add either the default () 3 or the default () 4 :
With the presence of the default () 3 , we get Mary lives-in Toronto; with () 4 , we get Mary lives-in Vancouver. Obviously we cannot add simultaneously both defaults () 3 and () 4 .
7
Inheritance in Semantic Networks and Default Logic
201
3.4.3 Conclusion Our formalization improves on the others, because (i)
it preserves the modularity of links in NETL; and
(ii)
it reflects the possibility of inhibiting links ("marking handle-node" in Fahlman, 1979).
It is, however, worth noting that none of the formalizations exactly translates the results of the marker-passing algorithm. For instance, in the theory A2 of Section 3.2, the default theory concludes on -,PO( b), while no M 2 marker could possibly reach node PO.
4
DOMAINS OF APPLICATION
As has already been explained in Section 1, semantic nets with exceptions are useful every time a domain is described with statements of typicality and contains incompletely specified objects. This does not mean that they apply equally well under all circumstances. Consider the following cases. 4.1 "Typically" is understood as "true unless otherwise stated or deduced" Example (After McDermott and Doyle, 1980) "In France, it is daylight at noon". The only possible exceptions are eclipses, and these are sufficiently rare and predictable to be announced; so the information is true unless it is explicitly known to be false. This is exactly the case where all the mechanisms presented here work at their best. 4.2 "Typically" means "most plausible unless otherwise stated or deduced" Example "Seminars are held every Monday at 2.30 p.m." This should be Understood as "true except during holidays and except when a cancellation has been sent to the prospective attendees". Nevertheless, an attendee who knows: (i)
that next Monday is not holiday;
(ii) that he/she is on the mailing list of the seminar and did not receive a cancellation,
202
C. Froidevaux and D. Kayser
is entitled to assume that the seminar will take place next Monday, but he/she might still consider the possibility of the contrary (for example, the cancellation did not reach him/her, the speaker has cancelled too late to send mail, ... ). This situation corresponds to a decision with incomplete information and. unfortunately, this case is much more frequent than case 4.1. The mechanism presented here still works, but, depending on whether or not default rules have been used to reach it, the conclusion should be taken as "true" or as "plausible". If an estimation of the probability/plausibility is possible then the semantic nets should be augmented with number-passing capabilities (Fahlman, 1982) in order to compute a degree of confidence in the result. When "typically" does not even have the meaning of "plausible", but reflects only an asymmetry in favour of one of the possible outcomes, the conclusion-when reached by means of defaults-has only the meaning of "the best conclusion available", which does not guarantee its plausibility. 4.3
Abduction
In troubleshooting or diagnosis systems, many rules read "if a is observed. then b is a plausible hypothesis". Of course, b might be ruled out by previous observations; the situation then bears some resemblance to the situations previously considered. Example "If lamp does not light when you turn on the switch then check the bulb." "If lamp does not light when you turn on the switch and the bulb seems OK, then check the fuse." "If lamp does not light when you turn on the switch, if the bulb and the fuse seem OK, then check whether there is some light somewhere else in the vicinity".
The word "seem" here is important, since successful troubleshooting involves taking into account the fact that first-order verifications do not preclude further investigations. In other words, if "observed" is taken as a "modal operator" then one never has a strict rule: "observed" P::::;. P, but a default rule: P to be true
!I' observed P then by default consider
Poole defines a logical system for problems of diagnosis that involves default assumptions that are analogous to some of Reiter's defaults. In this case, defaults are treated as possible hypotheses in a scientific theory that explains the results. We give only a few indications on the notion of
7
Inheritance in Semantic Networks and Default Logic
203
explainable: an answer is explainable if it follows logically from some consistent set of default instances together with the facts. Poole (1985) provides a semantic characterization of the notion of the most specific theory to explain the results, in the case where reasoning with defaults leads to more than one extension and hence produces different answers. As far as semantic networks are concerned, Poole's formalism provides a semantics for the syntactical notion of inferential ordering (cf. Section 3.3). However, this formalism is not restricted to default links (prototypical facts) as in Touretzky's, but also handles hard facts. 4.4
Typical elements in a class
Much recent work in cognitive psychology (e.g. Rosch, 1975; Dubois, 1986) is based on the fact that human knowledge makes an intensive use of classes in which some elements play a special role; these elements are sometimes said to be prototypical. In what respect is it useful to know which individuals or subclasses are prototypical of a class? The answer is clear when class inclusion is considered as a default statement: an object is prototypical in a class if all (or at least most) default rules specific to that class apply to the object, i.e. if it cancels no links starting from the node corresponding to the class. In a system having no intermediary degrees of truth, this fact adds nothing new. Any individual or subclass of the class already inherits all the default properties of the class; so the only information gained through knowledge of the fact "a is a typical B" is a higher degree of confidence in the default assumptions on a property inherited by a from B. However, the knowledge should also be exploited in a different way, which could be named "bottomup" inheritance, along the following lines: "if you want to know what a B is, and you know that a is a typical member of class B, then look at a". (This is used when answering a child's question: "What is a fir tree?": If you happen to see one good example of a fir then you may answer "look, this is a fir fir tree", i.e. fir trees inherit many properties of the tree.) The inference scheme would then be: if x is a B, if it is consistent that x has property P if a is a typical element of B and has property P then, it is plausible that x has property P. Unfortunately, this inference scheme is much too crude: it amounts to assuming that a typical element has only typical properties. The same Problem arises in analogical reasoning: analogy using irrelevant features Yields invalid conclusions. The problem of finding the "typical properties" or the "relevant features" remains open.
204
5
C. Froidevaux and D. Kay.l'er
CONCLUSION
We have focused on certain patterns of default reasoning essentially based on the notion of inheritance. These patterns occur frequently, especially in commonsense reasoning as it is represented in semantic networks (assertions of typicality). We have shown that networks can provide a graphical notation that distinguishes inheritance and exceptions, and that these networks can be handled reasonably computationally. Moreover, default logic yields a formal tool for representing default reasoning. Default logic works well when normal defaults are used. Unfortunately, normal defaults can interact so that the usc of non-normal defaults becomes necessary. We have emphasized the close links between semantic networks and default logic by giving translations of net arcs into default logic formulae. The formalizations using default logic show the correctness of the NETL-Iike semantic networks in the case of hierarchies with exceptions. In general, default theories are computationally intractable because of the need to check for consistency in the proof of a non-monotonic theorem. Reiter, as early as 1980, pointed out the need for heuristics (cf. Section 3) for handling default theories. Semantic networks, by providing indexing schemes on formulae, could yield an efficient heuristic for the consistency checks required in default reasoning (Reiter and Criscuolo, 1983). Potential derivation chains in the logical formalism correspond to directed paths in the network representation. Recall Example I and repeat what Reiter and Criscuolo suggest: "if node "-,flies" cannot be found within a sufficiently large radius r of the node "tweety" (i.e. if no directed path of length r or less from "tweety" to "-,flies" exists in the index structure) then it is likely that jlies(tweety) is consistent with the given first-order database". According to them, "an heuristic of this kind is precisely the sort of resource limited computation required for common sense reasoning" as Winograd (1980) advocates. This proposition is interesting insofar as it evokes an efficient way to combine the heuristic power of semantic network and the formal power of default logic.
BIBLIOGRAPHY AI (1980). Special issue on non-monotonic logic. Artificial Intelligence 13, no. I 2 (Historically, the first important publication in non-monotonic logic. Various aspects of the field are covered: the ideas behind the notion of non-monotonicity arc generally well discussed, even in the most technical papers. Some of the conributors have since given more elaborate versions of their systems.) Besnard, P. (1987). An Introduction to Default Logic. Springer-Verlag, Berlin.
7
Inheritance in Semantic Networks and Default Logic
205
Brachman, R. J. and Schmolze, J. G. (1985). An overview of the KL-ONE knowledge representation system. C ogn Sci. 9, 171-216. Brachman, R. J., Gilbert, V. P. and Levesque, H. J. (1985). An essential hybrid reasoning system-knowledge and symbol level accounts of KRYPTON. Proc.lnt. Joint CoY!f on Artificial Intelligence, Los Angeles (IJCAI-85), pp. 532-539. Morgan Kaufmann, Los Altos, California. Doyle, L. B. ( 1962). Indexing and abstracting by Association. American Documentation (October), pp. 378-390. Dubois, D. (1986). Comprehension de phrases: representation semantique et processus. These d'Etat, Univ. Paris 8. Dubois, D., Farreny, H. and Prade, H. (1985). Sur divers problemes inherents a l'automatisation des raisonnements de sens commun. Congn!s AFCET-RFIA, Grenoble, Vol. I, pp. 321-328. Etherington, D. W. and Reiter, R. (1983). On inheritance hierarchies with exceptions. Proc. American Association for Artificial Intelligence Col'!f. (AAA/-83), Washington, DC, pp. 104-108. Fahlman, S. E. (1979). NETL: A System for Representing and Using Real-World Knowledge. MIT Press, Cambridge, Mass. (NETL was intended for actual implementation on massively parallel architectures. The ambitions were probably too high for the hardware existing in the late 1970s. This work announces both nonmonotonic logic and connectionism. Very easy to read, and full of interesting remarks concerning knowledge representation issues.) Fahlman, S. E., Touretzky, D. S. and Van Roggen, W. (1981). Cancellation in a parallel semantic network. Proc. 7th Int. Joint. CoY!f on Artificial Intelligence (IJCAI-81), Vancouver, pp. 257-263. Fahlman, S. E. (1982). Three flavors of parallelism. Proc. Canadian Soc. for Computational Studies of lntel/igence-82, Saskatoon, Sask., pp. 230-235. Findler, N. V. (1979). Associative Networks-Representation and Use of Knowledge by Computers. Academic Press, New York. (Although rather old, compared with the timescale of the field, this collection of 14 contributions is one of the best introductions to semantic networks. As for AI (1980), most of the authors have more recently described newer versions of their systems, but the basic ideas underlying their research are generally better presented in this volume.) Froidevaux, C. (1985). Exceptions dans les hierarchies SORTE-DE. Congres AFCETRFI A, Grenoble, Vol. 2, pp. 1127-1138. Froidevaux, C. (1986). Taxonomic default theory. Proc. European Conf on Artificial Intelligence (ECAI-86), Brighton, pp. 123-129. Hayes, P. J. (1977). In defence of logic. Proc.lnt. Joint Conf on Artificial Intelligence (IJCAI-77), Cambridge, Mass., pp. 559-565. Hull, R. (1985). A survey on research on semantic database models. Technical Report, Comp. Sci. Dept, Univ. Southern California, May. Israel, D. J. (1980). What's wrong with non-monotonic logic?. Proc. American Association for Artificial Intelligence Conf (AAAI-80), Stanford, pp. 99-101. Kayser, D. (1984). Examen de diverses methodes utilisees en representation des connaissances. Congres AFCET-RF/A. Paris, Vol. 2, pp. 115-144. lukaszewicz, W. (1985). Two results on default logic. Proc. Int. Joint Col'!{ on Art!ficiallntelligence (IJCAI-85), Los Angeles, pp. 459--461. Morgan Kaufmann, Los Altos, California. lukaszewicz, W. (1986). Minimization of abnormality: a simple system for default reasoning. Proc. European Conff(Jr Artificial Intelligence (ECAI-86), Brighton, pp. 382-389.
206
C. Froidevaux and D. Kay.\'er
McCarthy, J. (1980). Circumscription-a form of non-monotonic reasoning. Artificial Intelligence 13, 295-323. McCarthy, J. (1986). Applications of circumscription to formalizing common-sense knowledge. Artificial Intelligence 28, 89-116. McDermott, D. and Doyle, J. ( 1980). Non-monotonic logic I. Artificial Intelligence 13. 41-72. McDermott, D. (1982). Non-monotonic logic II: Non-monotonic modal theories. JACM 29, 33-57. Moore, R. (1985). Semantical considerations on non-monotonic logic. Artificial Intelligence 25, 75-94. Poole, D. (1984). A logical system for default reasoning. Workshop on Non-Monotonic Reasoning, AAAJ, October, pp. 373-384. Poole, D. (1985). On the comparison of theories: preferring the most specific explanation. Proc. Int. Joint Conf on Artificial Intelligence (IJCAI-85), Los Angeles, pp. 144-147. Morgan Kaufmann, Los Altos, California. Quillian, M. R. (1968). Semantic memory. Semantic Information Processing (ed. M. Minsky), pp. 227-270. MIT Press, Cambridge, Mass. Reiter, R. (1978). On reasoning by detault. Proc. 2nd Symp. on Theoretical Issues in Natural Language Processing, Urbana, Illinois, pp. 210-218. Reiter, R. (1980). A logic for default reasoning. Artificial Intelligence 13,81-132. (This paper, as well as McCarthy's one, is certainly the most influential paper of the collection. The most important theorems concerning normal default theories are proved (existence of extension, mutual inconsistency of extensions, semimonotonicity, elements of proof theory). The reasons for the choices are provided. which makes the paper, despite its technicality, very readable.) Reiter, R. and Criscuolo, G. (1981). On interacting defaults. Proc. Int. Joint Conf on Artificial Intelligence (IJCAI-81), Vancouver, pp. 270-276. Reiter, R. and Criscuolo, G. (1983). Some representational issues in default reasoning. Comp. Maths Applies 9, 15-27. Rosch, E. (1975). Cognitive representations of semantic categories. J. Exp. Psycho/. 104, 192-233.. Shapiro, S.C. (1971). The MIND system-a data structure for semantic information processing. Report R-837-PR, The Rand Corp. Schubert, L. K. (1976). "Extending the expressive power of semantic networks" Artificial Intelligence 7, 163-198. Touretzky, D. S. (1984). Implicit ordering of defaults in inheritance systems. Proc. American Association for Artificial Intelligence Conf. (AAAI-84), pp. 322-325. Vilain, M. (1984). KL-TWO; hybrid knowledge representation system. BBN, Technical Report 5694. Winograd, T. (1980). Extended inference modes in reasoning. Artificial Intelligence 13. 5-26. Woods, W. A. (1975). What is in a link-foundations for semantic networks. Representation and Understanding (ed. D. G. Bobrow and A. Collins), pp. 35-tC Academic Press, New York.
DISCUSSION Didier Dubois and Henri Prade: Default logic offers a formal mechanism for dealing with rules having unspecified exceptions. However, some limitations or problem>
7
Inheritance in Semantic Networks and Default Logic
207
seem to exist from a knowledge-representation point of view. In the following, five questions are briefly mentioned. (I) The approach in its present state does not seem able to take into account various modalities, such as "typically" and "very typically", that might enable some ambiguities to be resolved, as in the following example: typically, Republicans are not pacifists very typically, Quakers are pacifists Richard is both a Quaker and a Republican What default conclusion can be obtained about Richard in the absence of any other information? (2) Default logic enables ordinary conclusions to be derived as in standard logic, as well as default conclusions. But it seems that the default logic itself does not provide any mechanism in order to distinguish between an ordinary conclusion and a default one, which would be useful in case new information becomes available. (3) A troublesome question is the choice of the right translation of a default assertion in default logic, since, according to the choice that is made, the derivation capacities may be different, as is mentioned by Froidevaux and Kayser. For instance, the rule "typically, if Mary attends a meeting then Peter does not" can be written in default logic as
AMM: --,AMP --,AMP or as
:AMM --,AMP
or as
: AMM" --,AMP --,AMP
where AMM (resp. AMP) is short for "Attends Meeting Mary" (resp. "Attends Meeting Peter"). The question of the existence of a best translation seems to be worth investigating. (4) In Reiter and Criscuolo (1981) it is claimed that default logic is more oriented towards the treatment of typicality than towards usuality (where a frequentist interpretation is assumed). It seems acceptable to say that "generally, birds fly" is more a question of property typical of birds than of frequency (especially because there is a considerable ambiguity about the referential on which this frequency would be defined). However, there are many examples where the choice between the two interpretations is more difficult; we can say "typically students are unemployed" as well as "most students are unemployed"
It also seems that when we say that flying is a typical capability of birds we are saying not only that "generally, birds fly" but also that "generally, flying animals are birds" (although there are some exceptions~.g. bats). It suggests that typicality is perhaps a
C. Froidevaux and D. Kayser
208
matter of default equivalence. Another question related to typicality is to know what the primary notion between "typical property" and "typical element in a class". Note that a typical element may have a non-typical property too! (5) Default logic, as discussed by Froidevaux and Kayser, is basically concerned with the application of default rules (e.g. typically high-school dropouts are adults) that are general even if they have exceptions to particular situations or cases (e.g. John is a high-school dropout). Farreny and Prade (1986) have proposed a numerical treatment of this kind of problem using possibility measures. The problem of producing a new default rule from already known default rules is not usually considered in the default-logic literature. This latter problem has been addressed and discussed by Zadeh ( 1985), who models the fuzzy proportions present in default rules such as "most x that are As are Bs" in the setting of possibility theory. See the end of Chapter I 0 for a brief account of this approach. Numerical approaches offer the advantage of quantifying the probability or the possibility of encountering exceptions.
Philippe Smets: (I) In the case of a closed normal default theory with a finite number of defaults, could Theorem I, given Theorem 4, provide a constructive way of getting the extension, considering each default one after the other. (2) Belief functions and default logic: many examples in this chapter could be handled with belief functions. To see the power of the belief-function approach within the context of default reasoning, I treat the example "Where does Mary live" (sec Section 3.4.2, (29)-(34)). Let X = "city where Mary lives" H = "city where Mary's husband lives" W = "city where Mary works" Let Bel 1(X =H)= IX, Bel 2(X = W) = p.
Bel 1(X #H)= I - IX, Bel 2(X # W) = I - p
I have a degree of belief IX that Mary lives with her husband and I - IX that she docs not. I have a degree of belief p that Mary lives where she works and I - p that she docs not. The domain of X relevant here is ff x "Y where ff = { T, --, T}and t · = { V,--, V}, T means Toronto, V means Vancouver, and a pair like (T,--, V) in .OJ x "Y means that Mary lives in Toronto and not in Vancouver. Evidence £ 3 is "H = T". Combined with evidence £ 1 on which Bel 1 was derived, it induces a belief Bel 13 with masses m 13 (cyl(T)) =IX and m 13 (cyl(1 T)) = I -IX, where cyl(T) is the cylindrical extension ofT on ff x "Y, i.e. cyl(T) = (T, V) v (T,--, V). Evidence £ 4 is" W = V". Combined with evidence £ 2 on which Bel 2 was derived. it induces a belief Bel 24 with masses m 24 (cyl(V)) = {J and m24 (cyl(--, V)) = I - {J. The combination of Bel 13 with Bel 24 leads to the belief function Bel 1234 with masses m1234(T, --, V) = IX(! - p), m 123 4(1 T, V) = (I - ::~.){!. m1234(T, V) = IX{J, m1234(1 T,--, V) =(I -IX)(! - {J). We further have evidence £ 5 "the place where Mary lives is unique". It induces ;t belief function Bel 5 with mass m5(1(T, V)) = I. The combination of Bel 5 with Belt 234 leads to the belief function Bel 12345 with masses
7
209
Inheritance in Semantic Networks and Default Logic
m 123d-, T, V)
=
(I - rx)P/(1 - rx{J)
m 12345 (1 T,-, V) =(I - rx)(l - P}/(1 - rxp)
This last belief function quantifies our degree of belief that Mary lives in Toronto, in Vancouver ... or somewhere else, a case not considered in the authors' analysis. Note that the same final solution would be derived if one had proposed Bel 1 (X =H)= rx and Bel 1 (X #H)= 0, Bel 2 (X = W) = p and Bel 2 (X # W) = 0, i.e. Bel 1 (Bel 2 ) gives mass rx(p) to "X= H" ("X= W') and leaves the remaining masses undetermined. Furthermore, owing to the symmetry and associativity of Dempster's rule of combination, the order in which the five pieces of evidence are combined is irrelevant. The only question that remains is to evaluate rx and p, i.e. the strength of our belief that respectively a wife lives with her husband and that she lives where she works. This evaluation problem is similar to the one encountered and theoretically solved with subjective probabilities. Our solution has the advantage of (i)
(ii) (iii)
not neglecting the case "Mary lives neither in Toronto nor in Vancouver"; handling the case where there is some ordering between the default rules (29) and (30); and providing a measure of the incoherence between the two pieces of evidence £ 1 and £ 2 (the term rxp).
Didier Dubois and Robert Valette: Anybody familiar with the literature on discrete asynchronous systems should be struck by the analogy between a NETL representation and a Petri net (Peterson, 1982). In both types of represention we have networks and markers (or tokens) moving from node to node. Moreover, the example given in Section 2 about cross-road traffic modelling is a typical example given in introductory courses on Petri nets. The following is a Petri net representing the behaviour of vehicles in the example:
places transition token
A Petri net is generally the model of a sequential process in an evolving environment. In the above network, the environment is represented by the dotted arrows, which express the facts that the light may switch from red to green, the Policeman may say to go, the road may get cleared, at some points of time.
C. Froidevaux and D. Kayser
210
There is a patent difference in the modelling purpose between the above network. and the one in Section 2 (p. 185). The latter describes a reasoning process and does not mean to simulate the passing of a traffic junction. However, in the NETL formalism as well as Reiter default logic, the knowledge representation is oriented. The reason in~ problem is always based on facts about what the policeman says, how the road is, how the traffic light is, and aims at deciding whether to go or not to go. This is clearly very different from representations in classical logic formalisms, which are not directed. This directed feature of NETL representations makes it possible to envisage a Petrinet model of the reasoning processes imbedded in a NETL network. However, the obtained Petri net will have special features, not shared by all kinds of Petri nets. namely (i)
coloured tokens: this is to account for the three kinds of markers in NETL (see Jensen (1981) for introduction to coloured Petri nets).
(ii)
propagation by "contamination", i.e. a token crossing a transition leaves behind a copy of itself:
usually produces
in the Petri-net conventions, however, the contamination mode is obtained by the following modification of the graph, with the same firing convention: which produces Using these conventions, it is possible to systematically translate NETL arrows into the Petri-net formalism. We shall use three kinds of tokens: the "yes" (Y), the "no" (N), which stand for M I and M2, and a "nil" (0) token standing for the state of a place for which nothing is proved so far. Any place in a Petri net must contain one and only one of Y, N, 0. Links between places and transitions are also coloured, with the following meaning: {x 1 ,
... ,
x.}
means that the transition is enabled only by a token of colour x = {x 1 , upstream place.
••• ,
x.} in the
X
o--t-0 means that upon firing a transition, a token of colour x appears in the downstream place.
7
211
Inheritance in Semantic Networks and Default Logic
In the following, we give the translation of NETL arrows:
A
A
~
TB I
I
B (set inclusion)
(typical set inclusion) with inhibitor
y
0, y
y
y
A
I
B
N,O
I
c
c
N,O
A
I I I .. B (set exclusion)
A
cCPoB y
N
Note that the term "set exclusion" is rather ambiguous, because in NETL, it is an oriented notion, while it is usually a symmetrical notion. What would be the advantages in using a Petri-net formalism rather than the NETL representation? In fact, mathematical and computerized tools are available to analyse Petri nets from the point of view of their consistency, the discovery of structural properties such as invariants (i.e. parts of the network that always contain the same numbers of tokens). But in the case of coloured Petri nets, these tools are still in their infancy. Moreover, most of the theoretical results and analytical algorithms exclude the contamination mode for propagating tokens. In conclusion, although it may seem interesting to know that NETL networks do belong to the Petri-net family, the consequences of this result do not look as promising as one might have thought beforehand.
Reply: Regarding Comment (I) lukaszewicz ( 1985)t that all and & = (D, W) can be constructed, by order, and this solution is close to Let £ 0 = Th( W) and D 0 = D; for 0 ~ i ~ n:
if
by Smets, the answer is yes: it is proved in only the extensions of a closed normal theory considering the n defaults of D in a well-specified the one that is sought by Smets.
there exists at least one default d in D;, reading ~ such that v u E E; and -, v ¢ E;
then pick one such default d; let D;+ 1 =D;-{d}, E;+ 1 =Th(E;u{v}) else (i.e. D; is empty or for all defaults d in D; either u ¢ E; or -, v E E;) E = E; is an extension of&, exit. t We are indebted to P. Besnard for bringing this p;:-oof to our attention.
212
C. Froidevaux and D. Kayser
Comments (1), (4) and (5) by Dubois and Prade and Comment (2) by Smets are variants of one and the same statement: default logic is non-numerical. They seem to consider this fact as a weakness; it is not. There are, in practice, situations where one knows what is "normal", without finding it natural to give a numerical estimate of what "normal" amounts to. In the Toronto/Vancouver example, for instance, as Smets says, "the only question that remains [in his approach] is to evaluate ct. and fi"; the default approach spares us the burden of computing an evaluation. However, as Dubois and Prade's Comment (4) points out, it is not always obvious whether a given situation is better modelled with or without numbers. Default logic postulates that "abnormal" situations are made explicit, and this might yield trouble when two abnormalities conflict with each other, as in the Republican/Quaker example. If one considers that one of them is "more" abnormal than the other (the situation considered in Dubois and Prade's Comment (I)) then one can declare it (see Section 3.4.2). Reiter, quoted in Section 3.1.2, presents defaults as "meta-rules": Dubois and Pradc (Comment (5)) ask whether defaults might produce new defaults, i.e. whether the "meta-meta" level makes sense; we are not aware of any work in this direction, but it might be an interesting one. Comment (2) by Dubois and Prade emphasizes the fact that, as there is no degree of truth in this approach, every conclusion, be it "hard" or "default", gets the same status. If this makes problems, we mentioned (Section 4.2) that the system can be augmented with number-passing capabilities, in order to differentiate the "true" from the "plausible". Finally, we agree with Dubois and Prade's Comment (3): knowing how to correctly translate natural-language statements into default-logic formulae would certainly be a good thing, but we cannot tell how much effort this might require, or even whether it is at all feasible. Another interesting direction is to use default logic as a tool for linguistic investigations, and there is some evidence that it could be a very promising one, at least for several linguistic issues (e.g. coreferentiality, ambiguity, typicality).
Additional references Farreny, H. and Prade, H. (1986). Default and inexact reasoning with possibilit)' degrees. IEEE Trans. Syst. Man Cyber. 16, 270-276. Jensen, K. (1981). Colored Petri nets and the invariant method. Theor. Camp. Sci. 14. 317-336. Peterson, J. L. (1982). Petri Net Theory and the Modeling of Systems. Prentice-Hall. Englewood Cliffs, N.J. Zadeh, L. A. ( 1985). Syllogistic reasoning in fuzzy logic and its application to usualit) and reasoning with dispositions. IEEE Trans. Syst. Man Cyber. 15, 754-763.
8
Probabilistic Logic GERHARD PAASS Gese/lschaft fiir Mathematik und Datenverarbeitung, Sankt Augustin, Federal Republic of Germany
Abstract In this chapter the degree of belief in propositiOns is expressed by numerical probabilities. Probability theory is employed to establish a consistent probability measure in agreement with available evidence. The lines of reasoning as well as inherent assumptions of different evaluation methods are discussed. Finally the case of uncertain and partially contradictory probabilities is considered.
1
INTRODUCTION
Expert systems commonly employ some means of drawing inferences from domain and problem knowledge where both the knowledge and its implications are less than certain. There are many principles of how to evaluate such "weak" knowledge. The approach discussed in this chapter is based on logic as well as probability theory. Assume that there are two propositions A and B, both of which may be either true or false. It is the goal of mathematical logic to determine whether combined expressions, such as A 1\ B, are true or false. Now suppose that because of incomplete knowledge the expert does not know whether the propositions A and Bare true or false but can specify the "probability" for the truth of A and the truth of B. Then it is the aim of probabilistic logic to evaluate the "probability" that expressions such as A " B are true. In colloquial use "probability" is just another expression for "belief", "likelihood" or "chance". In probability theory and mathematical statistics, however, there is a precise definition of probability. By means of a probability distribution, the chances of several facts being jointly true or false may be characterized. The main assumption of probabilistic logic is that there exists ~consistent probability distribution over the possible states in the domain of Interest that represents the current knowledge of the decision maker. It describes his information about the relative chances of facts, rules and Consequences being jointly true or false. If all probabilities are equal to zero
214
G. Paas. 1
or one, we again get the case of classical logic. Therefore probabilistic logic is an extension of classical logic. The concept of probabilistic logic may be applied to expert systems. A typical expert system consists of rules and facts that can be stated as logical propositions. They form the iriference net on which the analysis of the decision-maker is based. Usually the validity of rules and facts is not known exactly. Then one or more domain experts have to specify a probability for their validity, which may be based on theory, experience or judgement. Because of limited information and differing experience of the different experts, these assessments are error-prone, and the resulting probabilities may be inexact or uncertain themselves. If several experts supply uncertain probabilities then contradictory judgements may even occur. To resolve such conflicts, the decision-maker, the user of the expert system, has to estimate the reliability of the domain experts' judgement. The aim of the analysis is the joint evaluation of the rules and facts in the inference net and the associated probability distribution to arrive at the desired probabilities of some consequences. This process of reasoning is sketched in Fig. I. In the first section some basic concepts of probabilistic logic are discussed. Starting with an example, an interpretative framework for subjective probabilities is given and it is shown how a probability measure on propositions can be constructed. The second section gives a formal frame( decision maker )
j (expert!)
~ Assign prohahiliries ro rules, facr.,
(
(
Assesses re/iahiliry of'experrs
knowledge base probabilities of con seq uencc I. consequence ]
rule I rule 2
expert 2 )
expertn)
~
Fig. 1
probability of rules, facts: reliability of experts
Evaluarion hy prohahilisric logic
Reasoning with probabilistic logic in expert systems.
8
Prohahilistic Logic
215
work for the evaluation of inference nets. After a concise vector notation has been specified, the different types of information that may be available to the decision maker (structural information, "data" from experts) are compiled and principles for the evaluation of the inference net are formulated. The following two sections describe methods for the evaluation of inference nets in more detail. If the available probabilities of rules and facts are exact then the decision maker can perform a "worst-case" analysis, which yields upper and lower bounds on probabilities in question without making additional assumptions. If the decision-maker can state reasonable assumptions about the statistical "correlation" between different propositions (and hence about the structure of the probability distribution) then he can rule out implausible alternatives and arrive at much more informative results in a "restricted" analysis. The following section applies these concepts to the case in which probabilities themselves are uncertain. The uncertainty is represented by "error models", which can be evaluated by statistical concepts such as the likelihood approach, the maximum-entropy principle or Bayesian statistics. In the last section of the chapter the main features of probabilistic logic will be summarized and compared with similar techniques. 2
BASIC CONCEPTS .OF PROBABILISTIC LOGIC
The aim of probabilistic logic is the definition and evaluation of a probability distribution over logical propositions. Comprehensive discussions of the topic are given by Cheeseman (1985), Spiegel halter (1986a, b) and Nilsson (1986). The main intention of this chapter is to describe the basic concepts and assumptions inherent in probabilistic logic. They will be illustrated by the following small-scale example. 2.1
Example: rule-based diagnostic system
Suppose that a doctor has to decide whether or not a patient has a disease D. The relation between the two symptoms A and B and the disease D is specified in the form of the following rules F., ... , F5 , which hold with a certain probability: F 1 :="If A then D follows" holds with probability n 1
F2 :="If ---,A then D follows" holds with probability n 2 F 3 :="If B then D follows" holds with probability n 3 F4 :="If D then B follows" holds with probability n 4 F 5 :="If A then B follows" holds with probability n 5
216
G. Paas.1
These probabilities ni reflect the subjective degree of belief of the doctor in the truth of the rules for a certain universe, for instance the people of a town. They are based on medical theory, experience or intuition. Assume in addition that it is known that sympton A has a certain probability with respect to this universe: F6 :="A" holds with probability n 6 The rules and facts F" .. . , F6 contain all available informatio111 about the probabilistic relation of A, B and D in the universe. The probability n 1 for F 1 :="If A then D follows" might be interpreted as the probability that the logical implication A => D holds. This implication is true if (A 1\ D) v -,A is valid. This proposition, however, has no intuitive meaning to the physician. For him it is more natural to consider only the situation that the antecedent A holds and assess the chance that D will be true. Therefore the probability associated with a rule is always defined as the conditiona1 probability of the consequence given the antecedent. For F 1 we have for example n 1 = p(D I A):= p(A 1\ D)/p(A). The probabilistic relation between A and D is completely defined by the "joint" probabilities p(A 1\ D). p(A 1\ -,D), p(1A 1\ D) and p(1A 1\ 1D). These probabilities can always be expressed by the conditional probabilities with respect to A as well as D. e.g. p(A 1\ D) = p(D I A)p(A) = p(A I D)p(D). Therefore the specification of p(D I A) does not necessarily mean that D is caused by A, as the identical relation between A and D could be characterized using p(A I D). Assume that the probability of the diagnosis D is to be determined for a specific patient who exhibits the symptons -,A and B: F 7 :="Diagnose D" holds for the patient with probability n 7 This probability is the conditional probability of the disease given the symptoms: 1t7 = p(D IIA 1\ B). The resulting inference net is shown in Fig. 2. Historically, probabilistic reasoning in expert systems has been centred around the Bayesian formula. To avoid computational difficulties, this formula in practice was employed together with a number of severe structural restrictions, especially the assumption of conditional independence. which are usually highly unrealistic and invalidate the results of the analysis. These restrictions are relaxed for the approaches presented in this chapter (i) (ii) (iii)
"symptoms" are not required to be conditionally independent given the disease"; "diagnoses" do not have to be exclusive; the inference net may contain cycles, i.e. multiple chains of reasoning for the same "facts";
8
217
Prohahilistic Logic p(Bi A)
(
(
symptom A )
~ p(Dj .. ,
)
""'p(D I •A)
~ ~
(
Fig. 2
(iv)
symptom 8
diseaseD
)
Example from medical diagnostics.
no prior probabilities are required for diagnoses and intermediary facts.
Therefore the present approach seems to be much better in application in real expert systems. For a more comprehensive discussion see Section 6.1. 2.2
Interpretation of subjective probabilities
Suppose that B means "the patient has hypertension". Now assume that Fred Miller comes to the expert, a physician, who states after some investigation: p(B) = 0.3. What is the meaning of this assertion: "Fred Miller has hypertension with probability 0.3"? The usual concept of probability involves a long sequence of repetitions of a given situation. For example saying that a fair coin has the probability-! of coming up heads means that in a long series of independent flips of the coin heads will occur about half of the time. This frequency concept, however, is not adequate when dealing with the probability of a proposition like B. It is not possible to make identical "copies" of Fred Miller with identical life histories and count the relative frequency of Fred Millers with hypertension. The theory of subjective probability (Berger, 1980, pp. 61 fl") has been created to enable one to talk about probabilities when the frequency Viewpoint does not apply. The main idea is to let the probability of a proposition reflect the personal belief in the "chance" that the proposition is true. In this context a proposition can be characterized as a clear statement that is capable of being either true or false. For example the expert may have a personal feeling as to the chance of B being about 0.3, even though no frequency probability can be assigned to the event. Such probability assessment is very common in everyday life, for instance when estimating the chance of rain for the next day. The calculation of a frequency probability is theoretically straightforward.
218
G. Paass
One simply determines the relative frequency of the event of interest. A subjective probability, however, can be determined only by introspection. There are several concepts available that illustrate the actual meaning of subjective probabilities. The simplest way of determining subjective probability values is to compare the probability of propositions. The expert, for example, can compare B with -,B. If-, B is felt to be twice as likely to occur as B then he would define p(B) = t and p(-, B) = l Betting situations are especially useful to consider because they tend to make the mind evaluate more carefully. To determine p(B) by this mechanism, imagine a situation in which the expert will receive 1 - d dollars if B occurs and d dollars if -, B occurs, where 0 ~ d ~ 1. The idea is then to choose d until the expert is indifferent between the two possibilities. Assuming rational behaviour (and linear utility of money), this is the case if the expected gain is equal for both alternatives: (1 - d)p(B)
= dp(1B)
(I )
where p(-, B) = 1 - p(B). As a result we get p(B) = d. The practical difficulties with the elicitation of subjective probabilities are discussed by Kahneman et a[. ( 1982). Assume that an expert has specified his personal degree of belief in the truth of the propositions under consideration. It is not clear at all that this degree of belief should constitute a probability measure p. It can, however, be shown that a probability measure will result if the specification of beliefs obeys a set of axioms reflecting rational behaviour. Such axioms, for example. contain postulates for the behaviour of a rational gambler, who is able to specify preferences between the "bets" described above. The attractiveness of these axioms, and hence of the laws of probability derived from them, arises from the fact that someone who violates them using a different scalar measure of uncertainty is liable to demonstrable loss and hence irrational. For discussions see Cheeseman (1985), Genest and Zidek ( 1986), Horvitz et a/. (1986), Good (1982) and Fishburn ( 1986).
2.3
Probability measures on propositions
Assume that a single rational expert provided consistent probabilities n., ... , n.f. for the facts and rules F., ... , F.F of the inference net. For a 11
evaluation of the desired probabilities of diseases it is necessary to integratL' all these probabilities to a joint probability measure, which is assumed to exist. In this section the construction of such a probability measure is discussed Consider rule F 1 in the example, for which the conditional probabilitY n. = p(D I A) •= p(D 1\ A)/p(A) is known to the expert. To express the conditional probabilities, we have to determine the probabilities of A and A 1\ f).
8
219
Probabilistic Logic
All such propositions that occur in a probability (or conditional probability) specified for the inference net (including probabilities to be determined as a result of the analysis) form the set 011 == { U 1 , ... , Unu} of relevant propositions. In the example consisting ofF" ... , F1 the set lllf comprises the following nu = I 0 propositions:
Us==B,
U2 ==A "D,
V 3 == -,A,
V4 == -,A" D
U6 == B" D,
U1 == D,
U8 ==A " B
U9 == -,A" B, V 10 ==D" -,A" B, To define a probability distribution for the propositions in lllf, we have to construct the smallest set ·'fl/' = { W" .. . , Wnw} of "elementary" propositions that fulfil the following conditions: (i)
each Vi is the disjunction of some of the J.tj: Vi =
(ii)
the Wj are exclusive: J.tj 1
(iii)
the· Wj are exhaustive: W1 v ... v W,w is true.
1\
VieJ(iJ
J.tj;
W} 2 is false for j 1 i= h;
As the Wj are exclusive and exhaustive, they represent the spectrum of all possible situations that are of interest for the decision problem. Therefore each Wj is called a possible world with respect to the problem of interest. The set § that can be formed from disjunctions, conjunctions, and negations of the elements of "'f/' is called the Boolean algebra of propositions in "'f/'. § contains the Vi as well as T, the proposition that is always true, and F, the proposition that is always false. Using the above properties, the set of possible worlds can be constructed by the generation of conjunctions ul 1\ 1\ Unu with ui = vi or ui = --,vi. If lllf contains only propositions from propositional calculus then it suffices to consider the conjunctions of the atomic propositions (A, B and D in our example) and their negations. A probability measure p( ·) is a numerically valued set function p: §--+ [0, I] satisfying the following axioms for all A, BE§: 0
p(A) If A " B
=F
~
0,
then
0
0
(2)
p(T) = I
p(A v B) = p(A)
+ p(B)
(3)
lhe sum of the probabilities p( Wj) is equal to I because the Wj are exclusive and exhaustive. The probability p(Ud is just the sum of the probabilities of the possible worlds that constitute Vi:
p(Ud =
L jeJ(i)
p(Wj),
where vi=
v
Wj
(4)
jeJ(i)
Conditions under which sample spaces can be generated from infinite sets are
220
G. Paas.1·
explored in extension theorems (Fishburn, 1986). Sample spaces can also be constructed if we have sentences, i.e. closed well-formed formulae in some logical language L, for instance first-order logic (Nilsson, 1986; Grosof, 1986 ). Here it is required that the consistency of the finite set lllf = {Vi, ... , v.u} of sentences of interest can be established (which is not always possible in firstorder logic). The probability measure is defined over the set 1f/' of equivalence classes of interpretations for the sentences in lllf, i.e. of consistent cpnjunctions
(5) Where Uj may take the value Ui or I Uj, and COnsistency is defined With respect to L. Grosof ( 1986) points out that 1f/' is isomorphic to the notion of the power set of a "frame of discernment" in Shafer-Dempster theory. With respect to the evaluation of probabilities in an inference net, there is no difference between propositions from propositional calculus and sentences from first-order logic after the set of possible worlds 1f/' has been constructed.
3 3.1
FRAMEWORK FOR THE EVALUATION OF INFERENCE NETS Vector notation
For simplicity, we assume that the set ou of relevant propositions contains the certain proposition T as always p( T) = 1 is known. Because a nonconditional probability p(A) is equal to the conditional probability p(A I n all specified probabilities 7tj, i = 1, .. ., nF, can be considered as conditional probabilities. For each ni = p(Ail I Ai 2 ) = p(Ai 1 1\ Ai 2 )/p(Ad the set J/1 contains the propositions U t ==Ail " Ai 2 as well as U i- == Ai 2 . The linear equations p(Ut} = LeJ+(i) p(W;) and p(Ui-) = Lier(i) p(WJ) suggest the representation of p( u and p( u j-) in terms of the p( W; ). we define a (0-1)-matrix R+ == (rij)"v·"w' where the ith row rt contains the representation of p( U t) in terms of the p( W; ), i.e. r ij = I iff j E J + (i). Then we have p(Ut)=rtpw, where Pw==(p(W1 ), ... ,p(W.w))' is the vector of probabilities of the possible worlds. In a similar way, we can define R- == (rij)"v·"w' where ri} = I iff j E F(i). This yields p(Ui-) = ri~pw. and therefore
n
ni = rt Pw/ri~ Pw
(6)
If we denote the elementwise division of vectors by +, we get a relation for the vector 1t == (n~. ... , n.F)' of probabilities supplied by the experts: 1t=(R+pw)+(R-pw)
(7)
This equation is the basis of most procedures discussed later. It describe~
8
221
Prohahilistic Logic
which probabilities Pw of possible worlds are compatible with the numbers n; supplied by the experts. As an additional restriction to pw, we have the requirement that the probabilities must sum to one and are not negative l'pw =I,
Pw ~o.
with
1=
(1, .... , I)'
(8)
The probability measures for which the relations (7) and (8) hold form the set 9 0 offeasible probability measures. If 1t contains conditional probabilities, the relation (7) between Pw and 1t is nonlinear. If, however, the value of 1t is known exactly, (7) can be transformed to a linear restriction on Pw (cf. Grosof, 1986). From (6), we have 0 = (n;r;~ - rt )pw =: c~pw. If the ith row of a matrix C" is defined by c~ then we arrive at C"pw = 0
(9)
In the special case that there are no "genuine" conditional probabilities, R-Pw consists of a vector of ones, and the nonlinear relation (7) reduces to a linear one: 1t = R+ Pw· This case was discussed by Nilsson (1986, p. 74). For use in later sections, let us introduce random variables associated with propositions. As shown above, there are atomic propositions X~o .. ., X"x whose conjunctions are the possible worlds Wj. (In the example, we had X 1 ==A, X 2 == B and X 3 ==D.) For each X; we define a binary random variable X; that can take the values X; and -,X;. Then marginal and conditional probability measures can be specified in a simple way. p(x;, xi), for instance, denotes the marginal probability measure given by p(X; 1\ Xi), p(X; 1\ -, Xj), p(-, X; 1\ Xi) and p(-, X; 1\ -,Xi), while p(x;i xi) symbolizes the corresponding conditional probability measures. The joint distribution is denoted by p(x). Usually it is the aim of the decision-maker to determine the probability of some proposition U * = VieJ• Wj, the desired "diagnosis". Let g* be a (0-1 )vector with gj = I iff j E J*. Then we have
p(U*) = g*'pw
(10)
In principle, p(U*) can be estimated by first determining some "optimal" Pw from the given 1t using (7) and (8), and then calculating p( U *) from (I 0). Before the probabilities for an inference net are specified, it is necessary to fix the relevant universe that is to be described by the joint distribution p(x). With respect to evaluation, the decision-maker is usually interested in the Probability of the diagnosis U * only for that part of the distribution p(x) Where the specific available evidence, the symptoms of the "patient", holds. If these symptoms B~o .. . , B1 are known with certainty then the desired Probability of U * for the patient is given by p( U *I B 1 1\ ... 1\ B1). If only the Probability of the symptoms is known then an auxiliary proposition E may
222
G. Pua.,_,
be introduced that is true iff the specific evidence holds. The information about the probabilities of the symptoms then may be introduced in the inference net by rules according to the conditional probabilities p(Bi I£). The desired probability of the diagnosis then will be p( U *I £) (for a similar approach cf. Spiegelhalter, 1986b). This avoids the usual approach (Nilsson. 1986), where all 21-terms of the factorization p(U*IE)=
I
p(U*Ib., ... ,b,)p(b., ... ,briE)
(IIJ
b, = B,, •81: .. . : b1=8 1, •81
have to be determined. Even for a moderate number I of symptoms, this approach is no longer practical because of the sheer number of combinations. Of course, as our approach involves less information than ( 11 ), it leads to an unique solution only if appropriate additional structural restrictions arc imposed.
3.2
Available information and evaluation strategies
Suppose that one or more domain experts have supplied a vector of probabilities 7t = (n., ... , n.,.) for the rules and facts F., .... , F.,.. Usually the dimension nF of 1t will be smaller than the dimension nw of Pw· If (7) and (8) are consistent and there is a solution Pw for them, it will not be unique, and the set .OJ!0 of feasible probability measures contains many elements. As it is impossible to distinguish between these solutions using the information contained in n, the vector Pw of probabilities is not identifiable in this case. Then (7) and (8) will have no unique solution for Pw. and the set &10 of feasible probability measures contains many elements. Often, however, the decision-maker has additional information that can he used for the evaluation of the inference net. One type of information concerns the structure of the probability measure. (i)
From theoretical or heuristical arguments, the decision-maker often knows that the true distribution is a member of a smaller restricted class of distributions. This is the case, for instance, if the experts arc independent and use different sources of information. The probability measures in such a class can be described by a smaller vector 9 of parameters, from which the whole vector Pw of probabilities can be recovered by the "parametrization", a function Pw(9) of 9.
(ii)
Alternatively, the decision-maker can consider the vector pw of unknown probabilities itself as a random quantity with distribution Pr(pw ). Pr(pw) describes the prior knowledge of the decision-maker about Pw before the information in 1t supplied by the experts is known to him, and allows a very flexible specification of structural knowledge about the probability distribution.
8
Probabilistic Logic
223
In addition, the decision maker needs some information on the precision of the probabilities n 1 supplied by the domain experts. Two cases can be distinguished. First, the n1 may be known exactly from definitional or theoretical reasoning. Secondly, the n 1 may be uncertain and erroneous to some extent. These errors may be a consequence of the limited information of the experts about the subject, and do not imply that the experts violate the axioms of rational decision. To describe the extent of such errors, the decision-maker has to assess the reliability of the experts from his own subjective view. This assessment can be formalized by an error model that gives the stochastic relation between the true probability and the numbers ir1 supplied by the experts. Suppose that the decision-maker wants to estimate the probability p( U *) of some proposition U* from a given ft. Depending on his information on the structure of the distribution p(x) and the errors, there are several strategies for the evaluation of the inference net. Let us first consider the case where the probabilities n 1 supplied by the experts are exact. If the decision-maker has no additional information about the internal structure of p(x) then it might be sufficient for his purposes to determine the lowest value p( U *)1ow and the highest value p( U *)high that are compatible with the information in ft. Then the probability of U * has to lie in the interval [p(U*)1ow• p(U*)high]. This approach is called worst-case analysis. It is closely related to the minimax principle of statistical decision theory, where the decision-maker tries to limit the maximal loss that could result from a decision. Assume that the decision-maker has some knowledge about the structure of p(x) and knows that p(x) is contained in a restricted class f!/J 1 of distributions. In a restricted analysis he can exploit this knowledge to determine the set of solutions & 2 := & 0 n f!/J 1 from that class for which (9) and (8) hold. If the restrictions are defined in an appropriate way and their number is large enough then multiple solutions may be eliminated, and f!/J2 contains a unique element, the "solution". If the supplied probabilities are uncertain because of limited information of the experts then we assume that the decision-maker can formulate his subjective knowledge about this uncertainty as an error model. This specifies the random relation between the "true" probabilities ft that would be specified by completely informed experts and the uncertain "data" ft. Again, structural restrictions may be taken into account by a parametrization Pw(O). This is the usual set-up of statistics, where inferences about parameters 8 are drawn from "observed" data. Consequently, the evaluation of the inference net can be done according to the different approaches of statistical analysis: (i)
classical statistical analysis relies on the frequency interpretation of
224
G. Pa<~.~·s
probability and employs point estimation, hypothesis testing and confidence methods to arrive at statistical conclusions; (ii)
Bayesian statistical analysis starts with a prior distribution Pr(O) on the parameters; the information contained in Pr(9) as well as in the "data" ii: is summarized by a posterior distribution Pr(OI ii:).
The characteristics of the different types of analysis are summari~d in Table 1. In the next sections these types of analysis are discussed in more detail. Table 1
Different types of analysis.
Information available to the decision-maker
on the structure of the probability distribution
on the precision of supplied probabilities
Type of analysis
Missing
Worst-case analysis
Restricted class of distributions
Restricted analysis
n is uncertain; {Restricted class of distributions error model for n
Classical statistical analysis
7t
{
is exact
Prior distribution on probabilities
4 4.1
Bayesian analysis
EXACT KNOWLEDGE ABOUT PROBABILITIES Worst-case analysis
Assume that n is known exactly and for some U *the decision-maker wants to estimate the smallest and the largest values of p( U *)compatible with n. From (10), we have p(U*) = g*'pw, and because of(8) and (9), the decision-maker can determine the highest value p( U *)high consistent with 1t by solving the following linear-programming problem: determine p( U *)high •= max g*'pw
subject to the restrictions
Pw
C"pw
=
0,
Pw
~
0,
For the solution of this problem, efficient algorithms from operations research are available. In the same way, p( U *) 1ow can be calculated by maximizing -g*'pw. The resulting bounds on p(U*) are correct without additional assumptions about the structure of the probability distribution. If
8
225
Probabilistic Logic
enough restrictions are available then the resulting interval [p(U*) 1ow• p(U*)high] for p(U*) will be sufficiently narrow. In general, the specification of an exact probability, e.g. n:; = 0.73184675 ... , is impossible in reality for an expert; as infinitely fine probability comparisons are then needed. It is realistic, however, that an expert knows with certainty that the true rr:; is contained in the interval [rr:;,Jow• 1t;, high] = [0.70, 0.75]. This yields exact upper and lower bounds 1tJow ::::; 1t ::::; 1thigh· Because of (6), we get 1t;, 10 w ::::; rt Pw/r;~ Pw· This leads to the inequality C~owPw ::::; 0, where the matrix Ciow is defined according to (9). In a similar way, we get C.:ighPw ~ 0. Then the determination of p( U *)high with p( U *) = g*'pw amounts to the following linear-programming problem: determine
p( U *)high •= max g*'pw
subject to the restrictions
Pw
C:'awPw::::; 0,
Pw
~ 0,
a'pw = I
The attractive feature of this approach is that it involves information combination and processing only via probabilistic means while explicitly recognizing the limited precision of probability elicitation. For a discussion see Fishburn (1986, pp. 340, 346, 351). The linear-programming approach gives a solution that exploits all available information in an optimal way. The analysis is in terms of the bounds on the joint vector Pw in !R"w instead of bounds on single p( J.tj ). As the Pw that meet the restrictions form an nw dimensional simplex, whose boundaries in general are not parallel to the coordinates, it contains more information and leads to smaller intervals for p( U *) than a lowerdimensional approach. This concept is a generalization of the convex Bayesian approach (Thompson, 1985; Kyburg, 1987), where probability intervals are propagated using the Bayesian formula. In the area of statistics, Smith (1961) has discussed a theory of interval-valued upper and lower probabilities. Other features of the distribution can also be characterized by restrictions. If, for instance, A and B are known to be independent then this yields P(A 1\ B) = p(A)p(B) (cf. Grosof, 1986). Such restrictions, however, are nonlinear, and instead of efficient linear-programming algorithms, general programs for constrained optimization have to be employed. Another type of nonlinearity is induced by the evaluation of conditional probabilities, e.g. P(D I A 1\ B) •= p(D 1\ A 1\ B)jp(A 1\ B). For both the numerator and the denominator probability bounds may be determined separately, yielding a coarse bound on the ratio. This bound, however, may be refined arbitrarily by partitioning the denominator range into a series of small intervals and determining intervals for the numerator given the denominator intervals. This technique is demonstrated in the following example.
226
G. Paas.v
Example example:
Assume that the following probability bounds are known for our 0.0 ::::; p(D I A) ::::; 0.4,
0.7::::; p(DI-,A)::::; 0.9
0.6 ::::; p(B I D) ::::; 0.8,
0.3 ::::; p(B I I D)::::; 0.5
0.0::::; p(B I A)::::; 0.2,
0.3 ::::; p(A) ::::; 0.7
an interval [v 0 , v 1] for p(D I-, A 1\ B) = First we can determine ranges for the denominator and the numerator by the linear-programming method: We
p(D
have
1\
-,A
to
1\
determine
B)/p(-, A " B).
0.13::::; p(D" -,A" B)::::; 0.60,
0.23 ::::; p(-, A " B) ::::; 0.69
As a lower bound for v0 we get 0.13/0.69 = 0.19, and as an upper bound 0.13/0.23 = 0.56. For v 1 we get the bounds 0.60/0.69 = 0.87 and 0.60/0.13 ~ 1. To get tighter bounds, the interval [0.23, 0.69] of the denominator can be partitioned into k equally spaced subintervals [t;, t;+ 1 ], i = 1, ... , k. Subsequently, k different LP-problems can be solved with the additional restriction t; ::::; p(-, A " B) ::::; t; + 1 in the ith problem. From each solution, bounds for v0 and v 1 can be determined. The maximum of all upper bounds and the minimum of all lower bounds gives a globally valid range for v0 or v 1 respectively. For k = 10 we get 0.477 ::::; v0 ::::; 0.544 and 0.998 ::::; v 1 ::::; 1.0, while k = 100 yields 0.531 ::::; v0 ::::; 0.538 and 0.999 ::::; v 1 ::::; 1.0. Hence we arrive at p(D I -,A " B) E [0.531, 1.000]. In this way we can get bounds for the solution of a nonlinear problem by linear methods. Although linear programming problems may be solved for quite large systems, the computational effort increases roughly as a cubic function of the number of constraints. Therefore methods to reduce the computational burden are desirable. An obvious alternative is to specify the linearprogramming problem in terms of marginal probabilities instead of the complete vector Pw· This means that for each rule like p(X; I Xi)= nk the corresponding restriction p(X; " Xi)= nkp(Xi) is formulated in terms of the marginal distribution p(x;, xi) of the two variables X; and xi. Additional restrictions are needed to ensure compatibility of marginal probabilities for the different subsets of variables. If, for example, the marginal distributions p(x;, xi) and p(x;, xd are used then p(x;) has to be identical for both marginals. As an additional simplification, the calculations for each of the marginal distributions could be performed separately using the results to narrow the range for the probability of each proposition in a stepwise manner. The INFERNO approach (Quinlan, 1983) works in this way by providing new bounds for propositions belonging to a single rule. Assume, for instance, that
8
Prohahi/isric Logic
227
p(D 1 A)~ ex 1 and p(A) ~ ex 2 . Then using the relations p(D) ~ p(D 1\ A)= p(DIA)p(A) yields p(D) ~ ex 2 •ex 1 :=ex 3 . By this and similar arguments, the
bounds on probabilities successively may be tightened. The resulting intervals are still valid, but in general they are not as narrow as the optimal ones. 4.2
Restricted analysis
Worst-case analysis often takes into account situations that are highly unlikely. This sometimes leads to very conservative bounds for the desired probabilities. One way to rule out such improbable situations is the imposition of structural restrictions on p(x). To define such a restricted class of distributions, two different strategies may be used. (i)
A new parametrization is chosen in such a way that some of the new parameters can be restricted to a fixed value (e.g. zero) by theoretical or heuristic arguments defining a restricted class of distributions. To arrive at a solution, only the remaining smaller vector 9 of parameters has to be determined, from which the whole vector Pw of probabilities can be recovered by the "parametrization", a function Pw(O). If the number of restrictions is large enough then under certain conditions a unique solution 6 may exist.
(ii)
One probability distribution from the set of feasible solutions is selected by some reasonable criterion. It is reasonable, for example, to select that probability measure with smallest "information" content, as it imposes minimal additional assumptions. The resulting set of solutions again forms a restricted class of distributions.
Let us first discuss a convenient parametrization and appropriate solution methods. The distribution p(x) concerns binary variables x 1 , ••. , x"x' and without loss of generality can be assumed to be multinomial with probabilities Pw. A frequently used parametrization of p(x) is closely related to a standard measure of the association between two binary random variables X; and xi, the logarithm of the cross-product ratio: p(iX; 1\ ---,Xi)p(X; 1\ Xi) ex( x;, xi) = I og :_---,,.,.------,,..,.-,---"'--''c:----:.,..::-,. p(iX; 1\ Xj)p(X; 1\ --,Xi)
(12)
This has values between - oo and oo. Values >0 indicate a positive association (if x; = X; then xi tends to be Xi, and vice versa), while values <0 indicate a negative association (if x; = X; then xi tends to be ---,Xi, and Vice versa). ex(x;, xi) takes the value 0 if the variables are independent, i.e. P(x;, xi)= p(x;)p(xi).
228
G. Paas.1·
In the case of three variables, we could consider the conditional distributions p(x;, xi I xk = -, Xd and p(x;, xi I xk = Xk) and determine the two-dimensional association between X; and xi for these conditional distributions. a(x;, xi, xk), defined as the difference of these "conditional" associations, is independent of the variable used for conditioning. If a(x;, xi, xk) =f. 0 then the two-dimensional associations of the conditional distributions p(x;, xi I xk = -, Xk) and p(x;, xi I xk = Xk) are different. Consequently, there is a characteristic of the distribution that cannot be""explained" in terms of the two-dimensional margins. Hence a(x;, xi, xd is a sort of "higher-order" interaction between X;, xi and xk. This way to construct measures for higher order associations can be extended to any number of variables (Fienberg, 1980, pp. 27ff). They have the attractive feature that their value is not changed if only a lower-order marginal is modified. This means that a higher-order interaction is not affected if only information about lower-dimensional marginals is taken into account. Now consider the case where no information jointly concerning X;, xi and xk is available. It is then plausible to assume a(x;, xi, xk) = 0 as there is no reason to suppose p(x;, xi I xk = Xd =f. p(x;, xi I xk = -, Xk). The distribution p(x) can be reparametrized in terms of the interactions or functions thereof, for instance as log-linear models (Bishop et al., 197 5 ). Higher-order interactions (and the corresponding parameters) are set equal to zero if no data for the determination of these interactions are available and there are no theoretical reasons indicating the presence of these interactions. In this way, the class of probability distributions may be restricted such that to each vector 1t of consistent values for the marginal probabilities there exists an unique solution Pw· Using this approach, the decision-maker may take into account all higher-order interactions between variables that he thinks are relevant and for which he has some information .
..
X1
.,.,...,__ _ _-til~
Xz
/.,,.,, XJ
Fig. 3
Inference net with cycle.
XJ
Fig. 4
Inference net without cycles
The procedure for obtaining a solution Pw depends on the structure of the inference net. Let us first consider the small example shown in Fig. 3, where only the marginal p(x., x 2 ) and the conditional distributions p(x 3 1 x 1 ) and
8
229
Prohahilistic Lof(ic
p(x 3 1x 2) are known to the decision-maker, who wants to estimate the joint distribution p(x 1 , x 2, x 3 ). Starting with the marginal distribution p(x" x 2 ), there are two possible "lines of reasoning" to estimate p(x" x 2, x 3 ):
(i)
ignore p(x 3 1x 2) and use p(x3lxd:
(ii)
p(x"x2,x3) = p(x3lxdp(x"x2);
ignore p(x 3 1 xd and use p(x31 x2):
p(x,, x2, x3) = p(x31 x2)p(x" x2).
Obviously a compromise between both lines of reasoning has to be found. If such cycles are present in an inference net then it can be shown that a solution for the joint distribution can only be obtained by iterative procedures. This corresponds to the result of Dubois and Prade (see Chapter 10) that a logic of uncertainty is generally not truth-functional. Haber and Brown (1986) present algorithms for the general case, while Pearl ( 1986) discusses simplified procedures for distributions with special interaction patterns. If there a're no cycles, as in Fig. 4 then Pw can be determined by a noniterative formula, which, however, may be complicated (Bishop et al., 1975, pp. 74ft} In the case of one "disease" x 1 and "symptoms" x 2, .. . , xk that are conditionally independent with respect to x 1 (i.e. p(x;, xi I xd =p(x;lxdp(xilxd for i>j> 1) this formula reduces to the famous Bayesian formula employed in many expert systems. The second way to restrict the class of distributions is the selection of a distribution from the set of feasible distributions by some criterion. Konolidge (1982) proposed the selection of a distribution where the entropy H(pw) == -pw'log Pw is maximized subject to the conditions (8) and (9). As maximum entropy corresponds to minimum (statistical) information in the probability distribution, this approach is reasonable. The maximum-entropy approach and similar criteria are discussed by Hunter (1986), Shore (1986), Gokhale and Kullback (1978), Diaconis and Zabell (1982) and Dalkey (1986). The notion of "minimum information" is, however, a bit misleading as it suggest that we get something (a unique solution) for nothing. In most cases it leads to identical estimates as the assumption of a log-linear model where the higher-order associations for which no data are available have been set to zero. Consequently, (i)
the selection of a distribution with "minimum information" implies similar or identical restrictions as if higher-order interactions were set to zero;
(ii)
the restrictions used in the log-linear model in this respect are the least demanding;
230
G. Pall.ls
(iii)
5
if the decision-maker does not know whether particular interactions are zero or not and no information is available about them from the experts then the utilization of the "minimum-information" approach or the restriction of those interactions to zero may be misleadiny.
UNCERTAIN KNOWLEDGE ABOUT PROBABILITIEiS
Assume that a number of experts have supplied probabilies iti expressing their subjective probability concerning the facts or rules of interest to the decision-maker. As the experts have different subjective probability distributions according to their different limited state of information and personal experience, their judgements are uncertain and may be erroneous and conflicting to some extent. This does not mean that the experts violate the laws of rational decision-making; the differences may simply arise from their different expertise and knowledge about the topic in question. The consequences of inconsistent probabilities are demonstrated in the following example: p(BIA)=O.l,
p(A) = 0.9,
p(B) = 0.9
Assume that these specifications hold with certainty. Then obviously p(-,BIA) = 0.9 holds. We have p(1B) ~ p(1B" A)= p(-,BIA)p(A) = 0.81, which contradicts p(B) = 0.9. Hence specifications from different
experts may lead to contradictions even if facts and rules are only assumed to hold with some probability. If an estimate of p( U *) depends too much on highly uncertain probabilities then it will be unreliable itself. Therefore. during the evaluation of the inference net, the extent of possible errors in the nj should be taken into account. For the decision-maker the probabilities iti are some sort of data. We assume that from his subjective knowledge about the experts and their state of information he can specify the relative precision of the iti. In the next section it is proposed to formalize this assessment by an "error model".
5.1
Error models
The value iti provided by an expert can be considered as a fixed but unknown uncertain quantity whose distribution is completely specified by the conditional distribution p(iti lni) defining the error model. The "true" probability ni can be considered as the subjective probability estimate that would be supplied by a rational expert with complete information about all aspects of the problem. With respect to statistical analysis, each iti can be considered as
8
231
Prohahi/istic Logic
a statistic of a hypothetical sample (Paass, 1986) containing one or more elements and generated according to p(iti I ni ). Hamburger ( 1986) shows that this representation of uncertainty satisfies a list of general desiderata. We assume that the decision-maker is able to assess the precision of the vector it of values supplied by the experts and knows the corresponding p(it l1t) or some of their parameters. In addition, we suppose that he can restrict the class of distributions according to theoretical reasons, which yields a parametrization Pw(O). Because he could choose 9 == Pw. this comprises the unrestricted case. Obviously this task of the decision-maker is very difficult. He has to assess the "performance" of the experts and their honesty. Moreover, he has to evaluate whether the experts share some information, which jointly will influence their judgement. Genest and Zidek ( 1986, pp. 120f) and French ( 1985) discuss these issues in connection with the "group-consensus problem", where a group consensus in a group of experts has to be found. They also consider the situation that the decision-maker is an expert himself. There are different types of error models and associated error distributions p(iti I ni(O)). The assumption of a model by the decision-maker merely implies that the figures supplied by the expert have the same distribution as if they resulted from a chance process according to that model. It is not necessary that the corresponding experiment or chance process actually took place. Piecewise-uniform error density, e.g.
_
{5y
p(nd ni(O)) =
if
(I - y)/0.8
liti - ni(O) I ~ 0.1 otherwise
where iri(O) ==min (max (ni(O), 0.1 ), 0.9) and 0 ~ y ~ 1. The probability that iti is inside the interval /(ni(O)) == [iri(O) - 0.1, iri(O) + 0.1] is y. Inside as well as outside the interval all values are equally probable. For y = I we know for certain that ni is contained in /(ni(O)), yielding the interval restrictions of worst-case analysis as a special case (cf. Loui, 1986). Note, however, that the error model assumes an uniform distribution over this interval, whereas in worst-case analysis it is only known that iti follows an arbitrary distribution confined to the interval. Additive error with constant variance itj = 1tj(9)
+ l>j,
E(ed = 0,
var (ei) =
af
Here iti is simply assumed to be a measurement value without respect to its probability properties. This model may only serve as a crude approximation, as probability values outside the interval [0, 1] are not excluded (cf. Rauch, 1984).
232
G. Paas5
Binomial errors niifi "' B(nj, ni(O)) where B(nj, ni(O)) indicates a binomial distribution. This error model presupposes an experiment involving ni independent random drawings according to the "true" probability ni of the rule or fact h For each observation in the resulting sample Si it can only be determined whether Fi is true or.fa/se. Let ci be the number of times where Fi holds and cdni the relative frequency. The probability of ci and ni for a fixed ni(O) is then proportional to ni(9y;(1 - ni(O))"•-c•. The error model was proposed by Ginsberg (1985) and Paass (1986). It can be justified if the experts get their information by experience and base their probability estimates on relative frequencies that they have observed in practice. Because of the limited number of observed cases, there will be a deviation of cdni from the true value ni(O), the sampliny error, which is assumed to be the only cause of errors. It decreases with growing sample size ni. Many variants of these error distributions (e.g. transformed normal errors, discussed by Genest and Zidek, 1986; Lindley, 1985) may be used to represent the knowledge of the decison-maker about the precision of experts' judgements. The parameters of the error distributions on the one hand can be specified directly if they have an intuitive meaning like the standard deviation O"j. On the other hand, they can be expressed in terms of quantiles of the error distribution. Assume, for instance, that the decision-maker wants to determine the sample size ni for the binomial error model and he knows that with probability 0.95 the value ifi will be the interval [0.7, 0.9] if ni(O) = 0.8. Then, using the definition of the binomial distribution (or appropriate approximations), the corresponding sample size n 1 can be determined as 66.5. Assume that an expert has specified ifi = 1 but the decision-maker thinks that this statement is uncertain to some extent. By means of an error model, it is possible to specify this uncertainty without necessarily stating evidence against ifi = I. This means that an estimate ifi = I will be the "best" estimate of ni if the other probability assignments do not imply some evidence in favour of ni < 1. Usually the errors for different ni are assumed to be statistically independent. In other words, it is supposed that the actual deviation ifi - rr; for some ni is not influenced by the actual deviation ifi- ni for any other ni,j #- i. This is reasonable if the experts use different sources of information and do not collaborate. Clemen and Winkler ( 1985) analyse the impact of dependence between experts on the precision of final estimates for probabilities. They show that dependent sources of information considerably reduce the precision of estimates in comparison with the independent case. An alternative is to model the joint distribution p(ifj, ifiln;(O), nj(O)) with the
8
Prohahilistic Logic
233
corresponding covariances or interactions. The statistical techniques discussed below can be applied directly to this case (cf. Paass, 1986). There are procedures to combine the opinions of experts that use no explicit statistical model but start from desirable properties of combination methods. One example is the linear opinion pool, where the estimated final distribution ftw is a convex combination of the distributions Pw; specified by the experts: ftw = Li PiPW;· However, the same result would arise if additive error with constant variance was assumed. For a discussion of these and other approaches see Genest and Zidek ( 1986). 5.2
Evaluation by statistical methods
Let us first discuss some statistical evaluation methods (for a short survey see Dawid, 1983). A widespread method is the likelihood approach. Its direct appeal lies in the idea that it is a good way to compare parameter values 9 1 and 9 2 by means of the probability that they assign to the observed "data". For given "data" iti specified by independent experts and error distributions p(itd ni(9)) we can define the likelihood function L(9) by n,.
L(9) := p(ir lx(9)) :=
fl
p(iti I ni(9))
i= I
This function summarizes all information present in the data. 9 1 is more compatible with the data than 9 2 if L(9 1 ) > L(9 2) (in the absence of other information). On the other hand, the same inferences should result for 9 1 and 82 if L(9d = L(92). Let E>max be the set of parameters where L(9) is maximal. Let us assume that E>max contains only one element. This will happen if the number n6 of parameters is not too large in relation to the number (and structure) of the data items ni. It can be checked by examining the derivates of L(9). The unique maximal parameter value e is called the maximum-likelihood estimate: L(O) = max L(9)
(13)
9
It utilizes all available information in an efficient way and yields the true parameter 9 0 if the sample sizes go to infinity. The likelihood function may be maximized by different optimization methods using first- or second-order derivatives (Mcintosh, 1982) or by simple iterative algorithms (e.g. the EMalgorithm; Dempster et al., 1977) that work without derivatives. If the unknown distribution contains very many parameters then simplified algorithms have to be used to reduce computational effort (cf. Paass, 1986). There are several ways to evaluate the likelihood function. First, of course, 0 can be utilized to calculate the associated value of p(U*) = g*'pw(O) for a
234
G. Paass
proposition U*. The precision of the estimate p(U*) can be determined by a confidence interval. It measures the information contained in the data and enables the decision-maker to distinguish between well-establsihed but equal probabilities and ignorance caused by missing information. It can he estimated using the appropriate likelihood-ratio statistic, whose accuracy depends on the accuracy of the approximation of L(O) by a normal distribution, which increases with growing sample sizes n;. To explore the stochastic relation between some variables of interest, the marginal probabilities for these variables may be estimated. In this way, one can determine. for instance, the information content of yet-unknown symptoms for a diagnosis (Paass, 1986). If two experts consider identical rules F; and Fi and give the same uncertain probability statement ii; = iii with the same variance a 2 then the combination of these two statements is equivalent to a statement with identical it; but variance !a 2 . Therefore pieces of evidence are accumulated, and the true rr; is more likely to be located near ii;. An inherent feature of the maximum-likelihood approach is that "contradictions" between given probabilities ii; are resolved and a unique estimate Pw == Pw(O) is determined. Estimated values it; may differ from the specified values ii;. Contradictions are resolved in such a way that for less reliable ir, the extent of modification is largest. Hence by checking the difference between ii; and it; the "most contradictory" probabilities can be detected.
Table 2 Rule or fact F;
p(D I A) p(DI-,A) p(B I D) p(BI-,D) p(BI A) p(A)
Sample size
Value ii; supplied by expert
1ri,low
ni. hiKh
n;
0.20 0.80 0.70 0.40 0.10 0.50
0.00 0.70 0.60 0.30 0.00 0.30
0.40 0.90 0.80 0.50 0.20 0.70
II 43 57 65 24 17
Interval
Example Suppose that for our example the probabilities ii; listed in Table 2 have been assigned by independent experts. Assuming binomial errors, the decision-maker assesses the reliability of the experts by specifying a "probable interval" [rr;.Iow• n;, high] that will contain the true probability value n; with the prescribed probability P; = 0.90. From this interval, the sample size n; of the corresponding hypothetical sample can be determined (a normal
8
235
Prohahilistic Logic
approximation was used). The hypothetical sample size n; is based on the assumption that the interval [n;. low• ni. high] corresponds to the 90% confidence interval for 1tj. The EM algorithm yields the maximum-likelihood estimate p(D I ---,A 1\ B)= 0.75 with an estimated standard deviation of 0.07. Compared with the worst-case solution of the example given above, this interval is much tighter. If the information about p(B I A) is not taken into account and A and B are assumed to be conditionally independent given D the estimate p(D I ---,A 1\ B) = 0.87 results. This effect is quite general, as the assumption of conditional independence has a tendency to overstate the information content of the available "data". Now suppose that an additional expert states his knowledge if 8 •= p(B I ---,A) = 0.50 and the decision-maker assumes this expert to be rather reliable with n 8 . low = 0.45 and n 8 • high = 0.55. The corresponding hypothetical sample size is 270.7, yielding an estimate p(D I ---,A 1\ B) = 0.88. However, the ifi are now contradictory to some extent, and the estimated values deviate from the specified values, for example p(B I A)= 0.35 instead of 0.10 and p(B I---, A) = 0.52 instead of0.50. According to a x2 test the deviation of p(B I A) is more significant than the deviation of p(B I---, A). In this sense the specified p(B I A) is "more contradictory" than p(B I---, A). An alternative statistical evaluation principle is Bayesian analysis. Here the vector 9 of unknown parameters itself is considered as an n8 -dimensional random variable with an unknown distribution Pr (9). Probability vectors are treated as points in !R"o and Pr (9) is a probability measure on that space, which can usually be described by a probability density. It induces a distribution Pr (n) on 1t because of Pw = Pw(9) and 1t = (R+ Pw) -:- (R- Pw). The aim of Bayesian analysis is the combination of a given prior distribution Pr (9) with some data. In our context, Pr (9) may result from two different lines of reasoning. First, it can encode structural information about the distribution discussed above (e.g. the absence of higher order associations, higher plausibility of specific distributions). On the other hand, it can reflect complete ignorance and give the same density to all 9-values. However, the definition of such "non-informative" priors is a controversial topic (Berger, 1980, pp. 68ff). The crucial and arguable feature of Bayesian reasoning is that the prior distribution Pr (9) is always assumed to exist. For a given prior distribution and a known error model p(if 19) the posterior density for the unknown parameter 9 is _
Pr( 9 ltt) =
p(if 19) Pr (9)
f p(ifl9) Pr(9)d9
( 14)
It specifies the density value or "relative" probability of 9-vectors after the information contained in if has been taken into account. Its maximum, the
236
G. Paass
maximum posterior estimate, if it exists, gives the "most-probable" parameter value. Moreover, posterior regions may be determined where the posterior density is highest and which contain the true parameter with a prescribed probability. This can even be done for the case that there is no unique maximum of the posterior density. If the decision-maker has a constant "noninformative" prior density Pr(9) =constant, indicating missing information about the parameter, then the maximum-posterior estimate is identical with the maximum-likelihood estimate (13). The determination of the maximum-posterior estimate can be done with the same methods and comparable effort as the calculation of the maximumlikelihood estimate. The determination of posterior regions involves the determination of multivariate integrals, which is computationally very demanding in the general case. Simplifications arise if Pr (9) is assumed to be normally distributed, and least-squares methods like the extended Kalman filter (Lederman, 1984, pp. 902ff) may be used. Gokhale and Kullback (1978, pp. 199 ff) discuss the application of the maximum entropy approach. A new class of algorithms for the solution of the nonlinear optimization problems employs statistical simulation techniques. With relatively little effort they yield approximate solution whose quality increases during the progress of the optimization. As in case of uncertain probabilities the variance of 0 usually is large suboptimal solutions often are sufficient. An example is the simulated annealing algorithm (Aarts, Laarhoven, 1985) where a cost function C (9) is to be minimized. The algorithm consists in successive modifications of a single component 0; of 9 to a new value ii;. If C(li) < C(9) the modification is accepted. Otherwise the modification is accepted with probabiity exp ([C(9)- C(li)]/t). If the control parameter t is slowly decreased towards zero it can be shown that the resulting values concentrate on the set {0 I C(O) =min C(9)} of minimal cost parameters. In a Bayesian context we may define C 8 (9) == -log (p(ft 19) Pr (9)) and the procedure yields the maximum posterior estimate. In the same way the negative log likelihood function CL(9) == -log (p(ft 19)) or other cost functions may be utilized. If for a suitable parameterization 9 and Bayesian cost function C8 (9) the value oft is set to I it can be shown (Paass, 1987) that the resulting simulated sequence of parameter values can be considered as a sample of the posterior distribution Pr (9 I ft). By observing the evolution of 9 after convergence to steady state the marginal posterior distribution and corresponding posterior regions of any parameter can be obtained. In the same way confidence regions can be established if the likelihood cost function CL(9) is employed as criterion function. In contrast to the sample based procedure of Bundy ( 1985) the values of higher order interactions not affected by the criterion function
8
Probabilistic Logic
237
can be controlled and, for example, can be set to their 'least informative' zero values.
6 6.1
SUMMARY AND COMPARISON Assessment of probabilistic logic
In this section we want to summarize and discuss the main features of probabilistic logic in expert systems. It has been demonstrated above that the probability of a proposition A can be interpreted as the "subjective degree of belief" of the decision-maker in A. No frequency concept is needed, as the existence of a probability measure p(x) can be derived from a few axioms of rational behaviour, and there exists a clear framework for the interpretation of probabilities. An expert system seems to be particularly suitable for the application of probability, as it is a closed simplified system with fixed states that can react to the real world only according to a limited number of "pieces of evidence". In contrast with the early utilization of Bayes' rule, where a large number of prior probabilities were required, the approaches discussed in this chapter need only the information about the structure and probability values of the inference net that is logically necessary to arrive at a result at all. Of course, the prize paid for such a relaxation in the requirements is that in general the resulting probabilities may no longer be unique point values but may only be known to be located within some interval. The inference net can have an arbitrary form with cycles. Structural assumptions may be stated by restricting the class of distributions considered (if, for instance, higher-order interactions are zero) or by assuming informative prior distributions. In worst-case analysis no structural assumptions are necessary. The inference net can be revised and enlarged by simply exchanging rules or adding variables. If new propositions occur in the course of reasoning then the probability measure can be extended consistently. Hence probabilistic logic is not confined to situations where all propositions are known in advance. It is not necessary to stick to a single number for the description of the degree of belief in a proposition. The precision of probabilities can be characterized by a whole function, the likelihood function or the related posterior density. Unlike classical logic, in general no single possible world will emerge as the "true" world, but all possible worlds have to be taken into account with differing chances of being the "true" world. Because of this complexity, no simple "explanation" of a result is usually feasible. It is,
238
G. Paass
however, possible to identify the main reasons that led to a specific result by considering the probability of antecedents of rules. An inherent feature of the approaches for handling uncertain probabilities is their ability to resolve contradictions and to take into account the relative precision of "inputs" for the determination of the resulting probabilities. During the process of probabilistic reasoning, the probability p(A) and hence the truth value assigned to a proposition A can change if new evidence arrives and uncertain probabilities are assumed. Consequently, probabilistic logic with uncertain pieces of evidence is a sort of non-monotonic logic. The intention of this chapter was to discuss the different evaluation principles and inherent assumptions that may be chosen in probabilistic logic. The algorithms presented often involve large computational effort. They give a sort of reference solution, which may be used to derive simpler computationally feasible procedures. Methods that may be employed for larger inference nets are (i) (ii) (iii) (iv)
the linear-programming approach, where the restrictions are specified in terms of marginal probabilities; the INFERNO approach of Quinlan (1983); the Bayesian network technique proposed by Pearl ( 1986), where a specific interaction pattern is assumed; statistical simulation techniques (Paass, 1987), which give approximate solutions for the most general case.
As many research groups are currently working on new algorithms, significant progress can be expected in the near future. 6.2
Relation to similar approaches
The characteristic of the Shafer-Dempster approach (Shafer, 1976; Chapter 9 of the present book) is that it does not lead to a probability distribution over the exclusive and exhaustive possible worlds W; E "'f/, but rather starts with a "probability mass function" m(A) ~ 0 on the Boolean algebra fi' over "HH. This function is supplied by the experts, and the masses have to sum to I. Example Consider the possible worlds "HH corresponding Boolean algebra is
= { W~> W2 ,
W3 }. Then the by fi' = {0, W~> W2, W3, W1 v W2, W1 v W3, W 2 v W3, W1 v W 2 v W3 }. Assume the probability mass function m(W1 ) = 0.3, m(W3) = 0.1, m(W2 v W3 ) = 0.4, and m( W1 v W2 v W3 ) = 0.2. given
If a mass is given to W; v J.tj, this means that this "amount" of belief can be
8
Probabilistic Logic
239
attributed jointly to W; and It), but the decision-maker does not have enough information to allocate proportions of m( W; v It}) to the single propositions w; and It). Consequently the mass given to the disjunction of all W; cannot be allocated at all. This situation can also be modelled by means of probabilistic logic. Assume for our example that there is a random variable w with possible "values" WI> W2 and W3, and that there exists a probability measure p(w) for w. The values of p(w) are not known. There exists, however, another variable v that has the "values" v~> v3, v23 and v 123 corresponding to the elements :!i' with positive mass. w and v are assumed to have a joint distribution p(v I w)p(w) = p(w, v) = p(w I v)p(v). About p(w I v), the decision-maker has some structural information: p(W1Ivd =I,
p(W2Ivd = 0,
p(W3I vd = 0
p(W1Iv3) = 0,
p( W2l V3) = 0,
p(W3Iv3) =I
o,
p( W2l V23) =IX,
p( W1 I v23) =
p( W1 I v123) = P~>
p( W2 I v 123) = P2,
p(W3IV23)= I-IX p(W3Iv123) = I - P1- P2
The free parameters IX, P1 and P2 , however, are unknown to him. They represent the information necessary for an allocation of probability masses to the probabilities of the W;. The basic probability assignment defines the marginal distribution p(v). It is then the task of the decision-maker to derive conclusions about the probability of the elements of the Boolean algebra. Obviously there is no unique solution, but upper and lower bounds on these probabilities can be established by use of worst-case analysis. The resulting bounds are the narrowest possible without introducing additional information into p(v, w). Grosof ( 1986) and Kyburg ( 1987) point out that every Shafter-Dempster belief function may be expressed by inequality constraints on an underlying probability measure. The converse, however, is not true, and hence the Shafer-Dempster scheme has less expressive power than the probability approach with upper and lower bounds. In the case of two experts supplying two different probability mass functions, an error model could be formulated according to the precision of their assignments. By convex Bayesian analysis (Thompson, 1985), the mass functions can be combined (cf. Grosof, 1986). It would be interesting to compare the results of Dempster's rule of combination with these probabilistic techniques. Lemmer ( 1986) has already showed that the combination rule yields results contradictory to a probability interpretation.
240
G. Pcws.1·
Many of the structural features of other approaches to uncertain reasoning are similar to probabilistic logic. Gaines ( 1978) shows that probabilistic logic as well as fuzzy logic (see Chapter 10) can be considered as special cases of a "standard uncertainty logic", which he defines by a set of axioms. T 0 arrive at probabilistic logic, the axiom of excluded middle p(A v -,A) = 1 has to be added, while another axiom is added to arrive at a variant of fuzzy logic. Goodman and Nguyen ( 1985) develop generalized set-membership functions with probabilistic and fuzzy logic as special cases. Horvitz et a/. ( 1986, pp. 212f) discuss the relation of two distinct forms of fuzzy logic to probabilistic logic. The first type (Zadeh, 1983) allows beliefs to be assigned to propositions that are fuzzy, i.e. remain ill-defined. Proponents of probability theory have pointed out that imprecision in the specification of a proposition could always be converted to uncertainty of a related precise event that had similar or identical semantic content. Cheeseman ( 19S6) proposed that probability distributions over variables of interest can capture the characteristics of fuzziness within the framework of probability. The fuzzy proposition "Mary is young", for instance, can be represented by a distribution specifying the probability that Mary has age z. He claims that "fuzzy logic is unnecessary for representing and reasoning about uncertainty." The second type of fuzzy logic (Gaines, 1978) interprets the degree J1.T( A) of membership of a proposition A in the set of true propositions as the degree of belief in A. In this approach the degree of belief in a conjunction is defined by J1.T(A 1\ B) = min (J1.T(A), J1.T(B)). Obviously, this is not consistent with the factorization p(A " B) = p(A I B)p(B) of probability theory. It is, however. comparable to features of worst-case analysis discussed above. The relation between probabilistic logic and default logic is discussed in Chapter 10. There are many other approaches to uncertain reasoning, some of which are rather ad hoc. Horvitz eta/. ( 1986) showed, for example, that the certainty factors approach is inconsistent. Heckerman ( 1986) modified the certainty-factor method in such a way that it satisfies the requirements for a Bayesian probability interpretation. The above arguments show that probabilistic logic is able to exhibit similar features to some "new" concepts of non-monotonic reasoning. Hence the concepts seem to be complementary instead of contradictory. For each approach it is most important to clarify and check all the inherent assumptions (independence, absence of higher-order associations, etc.) before applying it to a concrete problem. There has been great progress in this field during the last few years, and it is to be hoped that parallel research on different concepts will stimulate progress in uncertain reasoning as a whole.
8
Prohahilistic Logic
241
ACKNOWLEDGMENTS
I should like to thank Didier Dubois, Gabor Gyarfas, Hermann Quinke, Phillipe Smets and Frank Veltman for their valuable comments. In addition, I am grateful to the Gesellschaft fiir Mathematik und Datenverarbeitung, who provided the opportunity to work on this subject. Bl B LIOG RAPHY The following references provide an introduction to the basic principles and problems of probabilistic logic and compare it with other approaches. Cheeseman, P. (1985). In defense of probability. Proc. Int. Joint Con[. on Artificial Intelligence (/JCAI-85), Los Angeles, pp. 1002-1009. (Starting with the notion of probability as a measure of belief in the truth of a proposition, the positive features of probabilistic reasoning in comparison to other approaches are compiled. It is argued that probability theory, when used correctly, is sufficient for the task of uncertain reasoning.) Fishburn, P. C. (1986). The axioms of subjective probability (with discussion). Statist. Sci. 1, 335-358. (Gives an up to date survey of axiom systems, including comparative probability relations, decision-theoretic approaches, interval probabilities, etc. The discussion gives an impression of the controversies in this field.) French, S. (1985). Group consensus probability distributions: a critical survey. Bayesian Statistics 2 (ed. J. M. Bernardo et a/.), pp. 183-202. North-Holland, Amsterdam. (Two main versions of the group-consensus problem are considered. In the expert problem a group of experts submits probability judgements to a decision-maker outside the group, who has to aggregate the experts' opinions. In the group-decision problem the group itself is responsible for aggregating their probability judgements to a consistent probability distribution.) Genest, C. and Zidek, J. V. (1986). Combining probability distributions: a critique and an annotated bibliography. Statist. Sci. 1, 114-148. (This paper discusses the problem of aggregating a number of probability distributions specified by different experts. In contrast with probabilistic reasoning no marginal or conditional distributions are specified by the experts. From the point of view of decision theory, the different approaches are compared. The extensive bibliography and the discussion give a comprehensive picture of this field.) Kanal, L. N. and Lemmer, J. F. (eds) (1986). Uncertainty in Artificial Intelligence. North-Holland, Amsterdam. (This collection contains nearly 40 papers and gives a representative impression of current developments. The topics of probabilistic reasoning, belief functions, maximum entropy, and interval probabilities are covered and compared in depth.) Nilsson, N.J. (1986). Probabilistic logic. Artificial Intelligence 28, 71-87. (Defines the truth value of sentences of first-order logic by their probability in probabilistic reasoning systems. The derivation applies to any logical system for which the consistency of a finite set of sentences can be established.) Paass, G. (1986). Consistent evaluation of uncertain reasoning systems. Proc. 6th Int. Workshop on Expert Systems and their Applications, Avignon, pp. 73-94. (Inference nets are considered where probabilities of fact and rules are not known exactly but
242
G. Paas.1
are subject to error. These probabilities are modelled as random samples where the number of elements determines their reliability. The inference net is evaluated according to the maximum-likelihood principle, allowing conflicting evidence to be processed.) Quinlan, J. R. (1983). INFERNO: a cautious approach to uncertain reasoning. Comp. J. 26, 255-269. (Specifies a method for the evaluation of inference nets where intervals are specified for marginal and conditional probabilities, yielding intervals that have to contain the true probabilities. The approach is computationally cheap, but may yield intervals that are larger than optimal.) Spiegelhalter, D. J. (1986a). A statistical view of uncertainty in expert systems. Artificial Intelligence and Statistics (ed. W. Gale ed.), pp. 17-55. Addison-Wesley, Reading, Mass. (Different approaches to uncertain reasoning (e.g. probabilistic reasoning, fuzzy reasoning, belief functions and the theory of endorsements) arc compared from a statistical point of view. It is argued that a subjectivist Bayesian view of uncertainty can provide many features demanded by expert systems. Some examples as well as numerically feasible methods for the evaluation of inference nets are discussed.)
Other references
Aarts, E. H. L. and van Laarhoven, P. J. M. (1985). Statistical cooling: a general approach to combinatorial optimization problems. Philips J. Res. 40, 193-226. Berger, J. 0. (1980). Statistical Decision Theory. Springer-Verlag, New York. Bernardo, J. M., DeGroot, M. H., Lindley, D. V. and Smith, A. F. M. (eds) (1985). Bayesian Statistics 2. North-Holland, Amsterdam. Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, Mass. Bundy, A. (1985). Incidence calculus: a mechanism for probabilisitc reasoning. J. Autom. Reasoning 1, 263-283. Cheeseman, P. (1986). Probabilistic versus fuzzy reasoning. In Kana! and Lemmer (1986), pp. 85-102. Clemen, R. T. and Winkler, R. L. (1985). Limits for the precision and value of information from dependent sources. Operations Res. 33, 427-442. Dalkey, N. C. (1986). Inductive inference and the representation of uncertainty. In Kana! and Lemmer (1986), pp. 393-397. Dawid, A. P. (1983). Statistical inference. Encyclopedia of Statistics, Vol. 4 (ed. S. Kotz and N. L. Johnson), pp. 80--105. Wiley, New York. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood frorn incomplete data via the EM algorithm. J. R. Statist. Soc. 839, 1-38. Diaconis, P. and Zabell, S. L. (1982). Updating subjective probability. J. Am. Statist. Assn 77, 822-830. Fienberg, S. E. (1980). The Analysis of Cross Classified Categorial Data. MIT Press. Cambridge, Mass. Gaines, B. R. ( 1978). Fuzzy and probability uncertainty logics. Info. Control 38. 154-169. Gale, W. (ed.) (1986). Artificial Intelligence and Statistics. Addison-Wesley, Reading. Mass.
8
Prohahilistic Logic
243
Ginsberg, M. L. (1985). Does probability have a place in nonmonotonic reasoning? Proc.Int. Join Con,{. on Artificial Intelligence, (JJCAI-86) (ed. A. Joshi), Los Angeles, pp. 107-110. Gokhale, D. V. and Kullback, S. (1978). The In,{ormation in Contingency Tables. Marcel Dekker, New York. Good, I. J. (1982). Axioms of Probability. Encyclopedia o.f Statistical Sciences, Vol. I (ed. S. Kotz and N. L. Johnson), pp. 169-176, Wiley, New York. Goodman, I. R. and Nguyen, H. T. (1985). Uncertainty Models.for Knowledge-Based Systems. North-Holland, Amsterdam. Grosof, B. N. (1986). An inequality paradigm for probabilistic reasoning. In Kana! and Lemmer (1986), pp. 259-275. Haber, M. and Brown, M. B. (1986). Maximum likelihod methods for log-linear models when expected frequencies are subject to linear constraints. J. Am. Statist. Assn 81, 477-482. Hamburger, H. (1986). Representing, combining and using uncertain estimates. In Kanal and Lemmer (1986), pp. 399-414. Heckerman, D. (1986). Probabilistic interpretations for MYCIN's certainty factors. In Kanal and Lemmer (1986), pp. 167-196. Horvitz, E. J., Heckerman, D. E. and Langlotz, C. P. (1986). A framework for comparing alternative formalisms for plausible reasoning. Proc. American Association .for Artificial Intelligence Con,{. (AAAI-86), pp. 21(}-214. Hunter, D. (1986). Uncertain reasoning using maximum entropy inference. In Kana! and Lemmer (1986), pp. 203-209. Kahneman, D., Slovic, P. and Tversky, A. (1982). Judgement under Uncertainty: Heuristics and Biases. Cambridge University Press. Konolidge, K. (1982). An information-theoretic approach to subjective Bayesian inference in rule-based systems. Draft, SRI International, Menlo Park. Kyburg, H. E. (1987). Bayesian and non-Bayesian evidential updating. Artificial Intelligence 31, 271-293. Lederman, E. (ed.) (1984). Handbook of Applicable Mathematics, Vol. VI Part B: Statistics. Wiley. Chichester. Lemmer, J. F. (1986). Confidence factors, empiricism and the Dempster-Shafer theory of evidence. In Kana! and Lemmer (1986), pp. 117-125. Lindley, D. V. (1985). Reconciliation of discrete probability distributions. In Bernardo eta/. (1985), pp. 375-390. Loui, R. P. (1986). Interval-based decisions for reasoning systems. In Kana! and Lemmer (1986), pp. 459-472. Mcintosh, A. A. (1982). Fitting Linear Models: An Application o.f Conjugate Gradient Algorithms. Springer-Verlag, New York. Paass, G. ( 1988). Uncertain reasoning by stochastic simulation. Working Paper G MD/ F3, St Augustin, FRG. Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29, 241-288. Rauch, H. E. (1984). Probability concepts for an expert system used for data fusion. AI Magazine (Fall), pp. 55-60. Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press. Shore, J. E. (1986). Relative entropy, probabilistic inference, and AI. In Kana! and Lemmer (1986), pp. 211-215. Smith, C. A. B. (1961). Consistency in statistical inference and decision. J. R. Statist. Soc. 823, 1-25.
244
G. Paa.\'.1'
Spiegelhalter, D. J. (1986b). Probabilistic reasoning in predictive expert systems. In Kana! and Lemmer (1986), pp. 47-67. Thompson, T. R. (1985). Parallel formulation of evidential reasoning theories. Proc. Int. Joint Conf. on Artijiciallntelligence (IJCAI-85), Los Angeles, pp. 321-327. Zadeh, L. A. (1983). The role of fuzzy logic in management of uncertainty in expert systems. Fuzzy Sets and Systems 11, 199-227.
DISCUSSION Frank Veltman: Since most decisions must be made under circumstances of uncertainty, the designer of an expert system is immediately confronted with the problem of representing the modes of inference typical of such circumstances. Gerhard Paass's chapter offers an encyclopaedic survey of the statistical techniques that become available once one has taken for granted that the kind of "uncertainty" at stake here is best captured by probability theory. I shall not dispute the idea that probability theory offers the best characterization of uncertainty-indeed, there are strong arguments in favour of this position (see e.g. Lindley, 1982). My comments--one remark and one question-only pertain to the particular way in which Paass develops this idea. (i) The remark that I want to make is this: the domain of application of the mathematical framework offered in Sections 2.3 and 3.1 is rather limited, much more limited than the author seems to think. Amplification. The theory presented in Sections 2.3 and 3.1 requires for its application that probabilities be assigned to things that bear a truth value: sentences (or propositions, as the author prefers to call them). This is established by choosing as the sample space a set of so-called possible worlds, each determined by some maximal consistent set of sentences. The practice, however, does not fit this theory. For one thing, there is no way to make sense of the example presented in Section 2.1 if the symbols "A"," B" and "D" arc really to be interpreted as sentences. This is what Paass says: Suppose that a doctor has to decide whether or not a patient has a disease D. The relation between the two symptoms A and B and the disease D is specified in the form of the following rules F" .. ., F 5 , which hold with a certain probability: F 1 ,="If A then D follows" holds with probability n 1 F 2 '="If --,A then D follows" holds with probability n 2
[ .... ] These probabilities 1ti reflect the subjective degree of belief of the doctor in the truth of the rules for a certain universe, for instance the people of a town. [ ....] the probability associated with a rule is always defined as the conditional probability of the consequence given the antecedent. For F1 we have for example n 1 = p(D I A)'= p(A " D)/p(A).
One might try to interpret "A" as "a person chosen at random shows symptom A". and "D" as "a person chosen at random suffers from diseaseD", but in this manner the sentence (A 1\ D) is not going to mean "a person chosen at random has both the
8
Prohahilistic Logic
245
symptom A and the diseaseD", let alone that p(D I A) can serve as a formalization of the conditional probability that a person chosen at random suffers from disease D, given that this person shows symptom A. There are two alternative, and appropriate, ways to interprete the quoted paragraphs, but in neither of these A and D can be understood as sentences. On the first reading, A and D are to be interpreted as predicates, and the sample space concerned is not a set of possible worlds, but a set of possible "patients", each given by some maximal consistent set of predicates. Fortunately, the statistical techniques discussed in Section 4 work just as well-mutatis mutandis-for this set up as they do for the original. However, this only holds for the case that we restrict ourselves to oneplace predicates. For more complex predicates things go wrong: Example 1 Some people think that the kissing disease is transmitted by kissing. Clearly, these people, and those who disagree with them, will want to talk about the conditional probability that a person x will be infected with the kissing disease by a person y who has the kissing disease, and who has kissed x. As far as I can see, there is no way to handle this probability within the framework offered by Paass, not even if we interpret it in a different manner.
The second way to read the quoted paragraphs is to think of A and D as formulae Ax and Dx in which the free variable x is suppressed. Ax can be considered as an abbreviation of "x shows symptom A", and Dx as an abbreviation of "x suffers from diseaseD". Where Paass writes (A 1\ D), we read (Ax 1\ Dx), and instead of p(D I A) we read p(Dx 1 Ax). Perhaps, at first sight, there is not much difference between this reading and the one discussed above. However, the advantage of not suppressing the free variables becomes clear if we want to formalize examples that involve both formulae with free variables and sentences, i.e. formulae without free variables. Example 2
(a) (b)
Compare the following:
the probability that every male smoker dies from cancer before the age of 60; the conditional probability that a randomly chosen person dies from cancer before the age of 60, given that this person is a male smoker.
Note that these two probabilities are not necessarily the same. In fact we can be pretty sure that an expert will assign the probability 0 to (a), and a positive probability to (b). Still, there are logical relations between these two probabilities. If one of them is I then so is the other. Within Paass's framework, these logical relations cannot be made explicit. What one needs for this example as well as for the previous one is a fully fledged probabilistic semantics for arbitrary first-order formulae with or without free variables-a semantics that assigns to a formula 4J(x., ... , x.) the probability of finding for a randomly chosen n-tuple of objects that the property expressed by 4> applies to them. Given such a framework, we could safely write "Dx" for "x dies before the age of 60", and "Sx" for "xis a male smoker", and so arrive at p(l;/x(Sx --+ Dx)) for the probability under (a), and p(Dx I Sx) for the probability under (b). The probability of Example I could be formalized as p(lxyiPy 1\ Kyx), where "lxy" is an abbreviation of"x will be infected by y", "Px" is an abbreviation of"x has the kissing disease", and "Kxy" is an abbreviation of "x has kissed y". A probability semantics with the desired properties was devised in the early sixties by the Polish logician Jerzy J:_os. Unfortunately, lack of space prevents me from
246
G. Paas.1
describing his theory here. Let me just mention the relevant literature. The locus classicus is J:,os (1963). The theory is further developed in Fenstad (1967), and in Gaifman and Snir (1982). Cooke ( 1986) discusses the theory with a view to application in expert systems. (ii) The question that I want to ask is this: How does the discussion of Section s relate to other work that has been done on the subject of expert resolution. Mor~ precisely, is it meant as an alternative to the approach that tries to develop criteria for evaluating expert probability assessment? Amplification. At several places in his chapter, Paass emphasizes that the probabilities at stake are supposed to reflect the expert's subjective degree of belief in the propositions concerned. Now, clearly, different experts may have different opinions about the same propositions, and therefore assign different numbers to them. In one way or another the decision-maker will have to resolve this conflict of opinions. I must confess that I do not fully understand the strategies that Paass recommends to this purpose. Actually, I find myself already at a loss with the way he introduces the problem. He says that the probabilities supplied by the experts may be erroneous to some extent, and he introduces so-called error models that give the stochastic relation between the true probability and the numbers supplied by the experts. I find it odd to find the word "error" in a context where subjective probabilities are involved- can people really be mistaken about their own degree of belief?-and I should be greatly helped if the author could give an example of the kind of errors he has in mind. Neither do I see what can possibly be meant by "the true probability", and it does not help much when the author says on p. 230 that
The "true" probability 1t; can be considered as the subjective probability estimate that would be supplied by a rational expert with complete information about all aspects of the problem. As far as I can see, the only feasible subjective probability estimates that any rational being-expert or not-will supply in these ideal circumstances are 0 for the propositions known to be false, and I for the propositions known to be true. But this is not what the author has in mind, I am afraid.t Anyway, the question arises as to why the author takes recourse to these error models to solve a problem that first and foremost seems to be a selection problem: which of the experts is the most reliable-or, perhaps better: which one has so far been found to be the most reliable? This question has received a lot of attention recently. and several criteria have been proposed for evaluating expert probability assessment. Important contributions can be found in Lichtenstein et a/. (1982), De Groot and Fienberg (1983) and Cooke (1985). I am grateful to Roger Cooke and Michie! van Lambalgen for their help.
Didier Dubois and Henri Prade: Are probability measures inevitable for modelling: subjective uncertainty? Several authors cited by Paass, such as Cheeseman (1983) and Horvitz et a/. (1986), have claimed that axioms of rational behaviour force an 'On p. 217ft' Paass discusses as a special case the case that the experts concerned have given their subjective estimates of an objective probability-one or the other statistical fact. Of course. in this special case the words "error" and "true probability" do apply.
8
247
Probabilistic Lof?ic
uncertainty measure to be a probability measure. To do so, they put forward Cox's (1946) axiom system for the modelling of "reasonable expectation". However, Cox's axioms are not always exactly reported. Namely, he starts from the following requirements. Letting .f(a I b) be a measure of the "reasonable credibility" of the proposition b when the proposition a is known to be true, Cox proposes two basic axioms: Cl:
there is some operation * such that .f(c " b I a)
C2:
=
.f(c I b " a) *f(b Ia)
(I)
there is a function S such that .f(--,bla)
=
S(f(bla))
(2)
where --, b denotes not b. The following additional technical requirements are needed: C3:
* and S both have continuous second-order derivatives.
Then .f is proved to be isomorphic to a probability measure. Cheeseman (1985) proposes Cox's results as a formal proof that no other set functions than probability measures are reasonable for the modelling of subjective uncertainty. This claim can be disputed for two reasons. Although (I) seems very sensible as a definition of conditional credibility function, the purely technical assumption (C3) is very strong and cannot be justified on commonsense arguments. For instance* =minimum is a solution of (I) that does not violate the algebra of propositions, but it certainly violates C3. It is recovered as a valid solution as soon as C3 is relaxed to a more intuitive continuity assumption. A second objection concerns Axiom C2, which explicitly states that only one number is enough to describe both the uncertainties of b and --,b. Clearly, this statement rules out the ability to distinguish between the notions of possibility and certainty. This distinction is the very purpose of belief functions, possibility measures, and any kind of upper and lower probability system. Hence Cox's setting, although an interesting attempt at recovering probability measures from a purely non-frequentist point of view, does not provide the ultimate answer to the problem of justifying subjective probabilities. Dubois and Prade (1982) should be consulted for another axiomatic setting encompassing both probability and possibility measures as admissible models of subjective uncertainty. Namely the degree g(a) attached to proposition a should satisfy the following decomposability axiom: Dl:
if a " b =
0
then there is an operation .l such that g(a v b)= g(a) .l g(b)
(3)
These are called decomposable measures. They are usually different from belief functions in the sense of Shafer, although possibility and probability measures do belong to both settings. But many belief functions are not decomposable. Another comment concerns the interpretation of degrees of probability as degrees of intermediate truth. In our contribution to this book (Chapter to) we strongly argue against such a confusion. A degree of truth is not a degree of uncertainty about truth. In particular, the logical propositions considered in Paass's chapter are only either true or false. This distinction, in a probabilistic setting, is at least as old as Carnap's 0945) paper. However, Carnap's view of probability of a proposition a as the ratio between the number of possible worlds where a is true over the number of possible Worlds is not really convincing, since it assumes that the possible worlds are equally
248
G. Paass
probable-an assumption that is very difficult to check, and which turns out to be false in many cases. Actually, a probability measure on an algebra of propositions is justified if possible worlds are identified with a set of outcomes in a random process, and statistical data about this process are available.
Philippe Smets: F:
The author, like probabilists, translates the sentence
"if W then N" holds with probability p
as P(N I W) = p. This interpretation is questionable. One can translate F in at least two ways: 1:
P(NI W)
2:
P(--, W
=p
v N) = p
Suppose that an urn contains 100 balls. Let W =the ball is white, N =the ball is numbered. Suppose that there are 60 W" N balls, 15 W" --,N, 12 --, W" N, 13--, W" --,N. A ball is going to be taken randomly (each ball having a probability of O.Ql of being selected). If one translates F into "if I have extracted a white ball then the probability that the ball is numbered is p" then one has P(N I W) = p = 0.80. But one can also consider all extractions where the proposition W -+ N is true, i.e. whenever the ball is either W " N or --, W. One obtains the second translation with P(--, W v N) = p = 0.85. The two translations correspond respectively to 1':
N-+P(W)=p
2':
P( W-+ N) = p
The decision as to which translation is relevant can only be derived from information external to F. In medicine, with G = "if symptom S then diseaseD holds with probability p", one usually derives G from the fact that among those with symptomS a proportion p have disease D, in which case interpretation I holds. But it could also be that one receives the rule S -+ D from a professor whose proportion of correct assertions is p. Then interpretation 2 holds. A study of which proposition receives the I - p can be enlightening. In the case G, r is the proportion of time the professor tells the truth and I - p is the proportion of time the professor does not tell the truth. To be a good probabilist, one must then construct the probability distribution on all the sentences that the professor could assert given that he does not tell the truth, often an unreal requirement, and distribute the probability I - p among these sentences. With belief functions, the distinction between I and 2 disappears. One has bei(NI W) = c(bel(--, W v N)- bel(--, W)) (see Chapter 9, Section 3.6), with c = I or c = I - bel(--, W), depending on the openor closed-world assumption (Chapter 9, Section 4). . Let W and N be the spaces on which W and N are defined (i.e. W is the set of colours). When I have only the information "my belief is p that W-+ N is true", I build a belief function on W x N such that bei(N 1 W) = p and such that bel(cyl( W)) "" bel(cyl(--, W)) = 0 (Chapter 9, Section 6), with cyi(W) = (W, N) v (W, 1N). Then c = I and bei(N I W) =bel(--, W v N)
8
Probabilistic Logic
249
Therefore in the G case, the belief-function approach consists in allocating the mass p to S--+ D and the mass I - p to the tautology, avoiding in fact the obligation to enumerate the set of sentences that might be uttered by the professor when he lies, and to allocate to each of them a probability. To assimilate the probability of a conditional A --+ C to a conditional probability P(C 1 A) is hazardous, especially when considering iterated conditionals like A --+ (B--+ C) as shown by Lewis (1976) in his first triviality result. Let us define the--+ operator such that P(A --+ C)= P(C I A) for every A and C. Then one gets P(A --+ C I C) = P(C I A " C) = I P(A--+ CI--,C) = P(CI A
1\ I
C)= 0
For any D one has P(D) = P(DIC)P(C)
+ P(DI--,C)P(--,C)
If D is A --+ C then one obtains P(C I A)= I· P(C)
+ 0· P(--,C) =
P(C)
so A and C are probabilistically independent! Conditionals are highly delicate concepts (see Harper et a/., 1981 ), and their direct translation as conditional probabilities can be misleading.
Reply: First, I should like to discuss the objections of Frank Veltman to the definition of a probability measure using propositions. Recall that the set "'Y = {Wt. .. . , w. } of elementary propositions was defined as an exhaustive collection of consrstent and mutually exclusive statements about the world. Exactly one of these statements, Wj, is true. To remain in our example of medical diagnosis, the world is described by the characteristics of the next patient in the waiting room of the doctor. The possible worlds are the possible, logically consistent combinations W; of symptoms and diseases of this patient. The subjective probability measure of the doctor assigns a number 0 ~ p( U) ~ I to each subset U of "'Y according to the subjective degree to which he believes this subset to contain the fixed, but yet unobserved, realization Wj E "'f'", the true symptoms and diseases of the patient. Hence relevant propositions are stated with respect to a specific situation (a specific patient). A sample or a population is not necessary for the specification and evaluation of subjective beliefs by subjective probability measures. As long as the set of elementary propositions is finite, this conceptually simple setup may be utilized. Nilsson (1986, pp. 77ft"), for instance, shows how problems of firstorder logic may be solved by this approach. However, first-order logic and probability are only loosely connected in Nilsson's theory, as in a first step the "internal" consistency of possible worlds "'Y; has to be established within first-order logic, and in a second step probability theory is applied to derive the desired probability of consequences. Therefore I agree that integrated probability semantics for arbitrary first-order formulae-as proposed by Veltman-are more appropriate in this case. Let me now discuss the remarks of Veltman concerning the case of uncertain knowledge about probabilities. Here an independent external decision-maker is postulated who is able to judge the reliability of different experts. This decision-maker IS assumed to specify his subjective belief about the reliability of the ith expert for a series of hypothetical situations. In each such hypothetical situation he is asked to assume a specific "true" value, e.g. n; = 0.3, for the probability p(A) in question.
250
G. Paass
Subsequently, he is asked to specify his subjective probability that the number rr 1 that the expert wiii assign to p(A) will be lower then a specific value Ct. for example c, == 0.1. By specifying his subjective probability for different values c1, the decisionmaker can formulate his conditional subjective probability measure p(if 1 ln 1 = 0.3) for the situation that n 1 = 0.3. This process is repeated for other hypothetical situations with different values of n1• In this way, a subjective conditional probability measure p(if1 ln 1) can be gained, givinga complete picture of how the decision-maker judges the performance of the ith expert in different circumstances. Note that p(ifd n 1) contains no indication of the "true" probability in question. In the literature on group decision making, such an external decision-maker has been called "supra-Bayesian" (Genest and Zidek, pp. 120ff). It has several advantages over other approaches. (i)
If such a decision-maker exists then the pooling process is not a problem, as he can treat the experts' judgements as data and update his prior via Bayes' theorem (Genest and Zidek, 1986, p. 120).
(ii)
The likelihood solution corresponds to a Bayesian solution with noninformative priors.
(iii)
Many known approaches to combining subjective probability estimates can be understood as special cases. Forming a weighted average of the probabilities of the experts (linear opinion pool), for example, corresponds to the assumption of a normal error model with the inverse weights as variances.
I agree with Dubois and Prade that there are different ways to model subjective uncertainty. The selection of such a theory depends on the characteristics of the problem at hand. It has to be demonstrated, however, that a new formalism is internally consistent and offers more than well established approaches (e.g. probability theory). Otherwise it may have a confusing effect and ignore the rich theory developed for established paradigms. The interpretation of a sentence F: "if W then N" holds with probability p as P(N I W) =pis not essential to the approach of the paper. Depending on the situation of interest, the interpretation P(-, W v N) = p may be more appropriate, as demonstrated in the comment of Smets. Both interpretations state some characteristics of the joint probability measure, and may be used without problems with the techniques discussed in this chapter.
Additional references Carnap, R. (1945). The two concepts of probability. Phil. Phenomenol. Res. 5, 513-532. Cooke, R. M. (1985). Expert resolution: Proc. 2nd Conf on Analysis, Design and Evaluation of Man-Machine Systems. Pergamon Press, New York. Cooke, R. M. (1986). Probabilistic reasoning in expert systems reconstructed in probability semantics. Philosophy of Science Association 1986, Vol. I. Cox, R. (1946). Probability, frequency and reasonable expectation. Am. J. Phys. 14. 1-13. De Groot, M. and Fienberg, S. E. (1983). The comparison and evaluation of forecasters. Statistician, 32, 12-22. Dubois, D. and Prade, H. (1982). A class of fuzzy measures based on triangular norrns.
8
Probabilistic Logic
251
A general framework for the combination of uncertain information. Int. J. Gen. Syst. 8, 43-61. Fens tad, J. E. ( 1967). Representations of probabilities defined on first order languages. Sets, Models, and Recursion Theory (ed. J. N. Crossley), pp. 156-172. NorthHolland, Amsterdam. Gaifman, H. and Snir, M. (1982). Probabilitities over rich languages, testing and randomness. J. Symbolic Logic 47, 495-548. Harper, W. L., Stalnaker, R. and Pearce, G. (1981). Ifs: Conditionals, Belief, Decision, Chance, and Time. Reidel, Dordrecht. Lewis, D. ( 1976). Probabilities of conditionals and conditional probabilities. Phil. Rev. 85, 297-315. Also in Harper eta/. (1981), pp. 129-147. Lichtenstein, S., Fisch hoff, B. and Phillips, D. (1982). Calibration of probabilities: the state of the art to 1980. Judgement under Uncertainty: Heuristics and Biases (ed. D. Kahneman, P. Slovic and A. Tversky), pp. 306-335. Cambridge University Press. Lindley, D. V. (1982). Scoring rules and the inevitability of probability. Int. Statist. Rev. 50, 1-26. Los, J. (1963). Semantic representation of the probability of formulas in formalized theories. Studia Logica 14, 183-194.
9
Belief Functions PHILIPPE SMETS /RID/A, Universite Libre de Bruxel/es, Belgium
Abstract This chapter is a short self-contained presentation of the use of belief functions, a mathematical tool for the quantification of subjective, personal credibility.
1
INTRODUCTION
In order to delimit the problems covered by belief functions, we briefly describe various types of ignorance closely related to belief. There are at least three forms: possibilistic, probabilistic or credibilistic, each endowed with its own mathematical model. 1.1
Possibility
The information that "John's height is over 170 em" implies that, in describing John, any height h over 170 is possible and any height equal to or below 170 is impossible. This can be represented by a possibility function on the height domain whose value is 0 for h ~ 170 and I for h > 170 (where 0 = impossible and I = possible). Ignorance is due to the lack of precision, of specificity of the information "over 170". This type of ignorance can be generalized with statements like "John is tall". It implies that a height less than 160 em is impossible (value= 0) and a height above 180 em is possible (value= 1). In between, one may consider that the possibility takes some intermediate value between its extrema 0 and 1, the greater the height, the greater the possibility. Ignorance is due to the imprecision that results from the use of the fuzzy, vague, ill-defined term "tall". This type of possibilistic ignorance is covered by Dubois and Prade in Chapter I 0, and will not be discussed here. 1.2
Probability
Another form of ignorance results from randomness encountered in chance
254
P. Snwrs
set-ups. For example, when throwing a dice, the probability that the outcome . . I IS one IS "6· This model can be generalized by considering that the probability of each event is not known as a real value between 0 and I, but as belonging to an interval. This results in the "upper and lower probability theory" (Good 1950; Smith, 1961; Dempster, 1967, 1968). This theory should not be confused with the one covered by belief functions. It requires the existence of an underlying probability whose value is known only to be within a crisp interval. It has been further generalized when probability is known as a fuzzy number (close to 0.6), as a linguistic variable (small) or to lie within a fuzzy interval (approximately between 0.4 and 0.5) (Zadeh, 1975). Another generalization is obtained by the introduction of some metaprobability that describes our knowledge about the value of the unknown but existing probability (Lindley et al., 1979). This meta-probability expresses in fact our degree of belief about an unknown probability, where degrees of belief are quantified by a probability function. It is a particular form of Bayesian probability. Furthermore, it is also a special case of the theory of belief functions when belief is quantified by a probability function, a particular form of belief function described below. 1.3
Credibility
Belief functions aim to model and to quantify the subjective, personal credibility (called belief hereinafter) induced in us by evidence. Some evidence is strong enough to induce knowledge: if it is II a.m. then I know it is daytime. Other not so definite evidence may induce only a belief: given the information available on 15 July, 1986, I believe that I will be in Cordes on 22 September, 1986. This belief can be more or less strong, thus admitting degrees of belief. Bayesian probabilists have claimed that this degree of belief can be quantified by probability functions whose major axiom, the additivity axiom. states that the probability of the union of two disjoint events is the sum of the probabilities of each event (Fine, 1973). The Bayesian approach is usually justified by axioms describing decision processes or betting behaviour Within such a context, our belief can indeed be described by a probability function, but it does not follow that our belief should always be so modelled Belief can exist outside any decision or betting context. It is a cognitive process that exists per se. The Bayesian argument implies only that when we face a decision problem we must be able to construct a probability function based on our belief. This chapter presents a model to quantify someone's degree of belief based on belief functions (Shafer, 1976). Within the AI community, it is often called
9
Belief Functions
255
the Dempster-Shafer model, an unfortunate denomination which allows too widespread a confusion between upper and lower probabilities and belief functions, the first dealing with an imprecisely known underlying probability, the second with the intensity of our credibility. Some modifications of Shafer's initial model are introduced, essentially the distinction between the open- and the closed-world assumptions and its impact on the normalization. This chapter is a short self-contained exposition of the whole theory, but Shafer's ( 1976) highly readable seminal book should be read before really pursuing the topic. The model developed here should not be confused with those found in recent AI research papers on belief networks (Pearl, 1986a, b). In these papers beliefs are quantified by classical Bayesian probabilities, and the problem under consideration is their implementation for AI applications. Section 2 of this chapter discusses the nature of the frame of discernment on which a degree of belief will be established, and presents the distinction between open- and closed-world assumptions. Section 3 introduces the general model and presents an example. Section 4 presents the relevant mathematical definitions and properties. Our presentation considers algebra of propositions and not sets as is often done. Both approaches could have been used. Our choice reflects a personal taste and the idea that the concept of the truth of a proposition precedes that of belonging to a set. Section 5 presents Dempster's rules of conditioning and combination. Section 6 presents Bayes theorem, generalized within a belief-function framework (Smets, 1978). Section 7 discusses discounting evidence, i.e. what to do when the available evidence is not reliable. Section 8 presents canonical experiments that can explain the meaning of the numerical value of the belief given to a proposition A. Section 9 concludes and presents some hints about the use of this model for automated reasoning.
2 THE FRAME OF DISCERNMENT 2.1
Open- and closed-world assumptions
In most probability theories, as well as in Shafer's theory, one starts by postulating some frame of discernment A (also called the Universe of Discourse or the Domain of Reference) on which evidence induces some belief. In reality, the cognitive process is hardly as simple. When faced with a cognitive problem, one starts by constructing the set KP of those propositions Known as Possible. But there is also (I) the set UP of Unknown Propositions for which we have no idea whether they are possible or impossible, and (2) the set K I of those propositions Known as Impossible. In
256
P. Smets
the classical approach, one considers that UP is empty and accepts the highly idealized closed-world assumption, i.e. that the truth is necessarily in KP and that A is KP. The content of the three sets depends not only on the problem studied, but also on the pieces of evidence available. As evidence becomes available, propositions are redistributed between the three sets. (1)
A proposition A is transferred from KP to Kl when the evidence permits the claim that A is impossible. This corresponds to the classical concept of conditioning.
(2)
A proposition A is transferred from UP to KP if the evidence induces us to consider as possible some forgotten propositions.
(3)
A proposition A is transferred from UP to Kl if the evidence induces us to consider that some forgotten propositions are in fact impossible. In practice, this has no direct impact, as the degrees of belief are constructed only on KP.
(4)
Transfer from K I to K P or UP and from K P to UP would be inconsistent with the definition of the three sets, if one accepts, as here, that the allocation of any proposition to one of the three sets is always correct. A true proposition may be correctly allocated to KP and UP. and a false proposition may be correctly allocated to K P, K I or UP.
A true proposition may not be allocated to Kl, and any propositiOn allocated to Kl will stay in Kl, inducing monotonicity for the impossible (false) propositions. The generalization could be considered by accepting that a true proposition might be in Kl and constructing some meta-belief function on the set of all propositions that expresses the degree of belief that each proposition can belong to any of the three sets. The closed-world assumption postulates an empty UP set. The open-world assumption admits the existence of a non-empty UP set, and the fact that the truth might be in UP. 2.2
Notation
This presentation of the frame of discernment is formalized as follows. One writes -,, v, 1\ and => for the negation, the disjunction, the conjunction and the material-implication connectives. The set K P will be based on A, a finite set of elementary propositions. Let D be the boolean algebra of propositions derived from A, i.e. Q contains the
9
Belief Functions
257
conjunctions, disjunctions and negations of any set of propositions of A. Let ln be the tautology relative to n, i.e. ln is the disjunction of all elementary propositions of A. Let On be the contradiction relative to n, i.e. none of the propositions of A implies On. Then the conjunction of any two distinct propositions of A is On. The set UP will be denoted by e. No details about its structure and about Kl are needed. Any support given by some piece of evidence to some proposition A of n is in fact given to A v e. In order to simplify the notation, we shall not repeat the disjunction with e, but it must be unstood that whenever a proposition A of Q is mentioned, it corresponds to A v e. The proposition On is not the contradiction, as it corresponds to On v e. There would be a contradiction if e was empty (the closed-world assumption). The proposition ln corresponds to ln v e and is thus a tautology as all propositions in KJ are false by definition. Negation of any proposition A of n, symbolized by --,A, is taken relatively to A. So --,A is the disjunction of e and any elementary proposition of A not implying A. The E symbol is used with the following meanings: A E A means that A is an elementary proposition of A; A E n means that A is a proposition of n;
for BEn, A E B means that A is an elementary proposition implying B. Thus OnE n is true but OnE A and OnE Bare false as On is not an elementary proposition. (Being an element of the algebra n is different from being an element of an element of Q.) For any A En, IAI is the number of elementary propositions BE A such that BE A. For A, BEn, the symbol A ..... B means "it is true that A implies B", i.e. A and B are such that whenever X E A, then X E B. Note that On -+ A can be asserted for all A in Q. We say that a proposition BEn is based on some elementary proposition A of A if A E B.
3 3.1
QUANTIFICATION OF DEGREE OF BELIEF General model
Suppose that there is a piece of evidence that induces in us some belief concerning the truth of propositions defined on a finite frame of discernment
258
P. Smers
A with !l being the Boolean algebra derived from A. It is postulated that there exists some finite amount of belief that is spread among the various propositions A of n according to the available evidence. For instance suppose that Mrs Jones has been murdered and we, the judges, know that the suspects arc Peter, Paul and Mary. Thus A= {Peter, Paul, Mary}. Given the available evidence, parts of the amount of belief are allocated to each of the three potential murderers, as in Bayesian model. But some evidence might support something other than only one of the three persons. Such is the case of the evidence "the mufderer is a male". This evidence supports A ="Peter or Paul", and we allocate some part of m of our total mass of belief to A without being able to split it between the two components of A. In such a case, probabilists usually invoke the Principle of Insufficient Reason or an argument of symmetry to decide that the mass m must be split into two equal parts, one for Peter and one for Paul. The originality (and the power) of Shafer's model is that it does not evoke these principles and leaves the mass m allocated to the proposition A. The total amount (mass) of belief is arbitrary, but is conveniently scaled to I without any loss of generality. The non-negative mass m(A) allocated to the proposition A En that cannot be allocated to any proposition A' such that A' -+ A, A' # A is called a basic probability number by Shafer ( 1976). (A -+ B is short for "it is true that A implies B'.) The function m: n-+ [0, I] is called a basic probability assignment whenever
L
m(A) =I
A-1 11
where ln is the tautology relative to n. The notation
L
means that the sum
A-B
is taken over all propositions A En that imply BEn, or over all propositions BEn implied by A E !l, depending on which symbol A orB is not fixed by the context. Any A such that m(A) > 0 is called a focal proposition.
3.2
Practical example
As a practical example, suppose that we are the judges and must analyse the available evidence concerning Mrs Jones' case. Three witnesses provide evidence (testimonies). Let the three pieces of evidence be symbolized by E ,. E2 and E 3 . E1:
Witness I is a janitor, who claims he heard the victim yelling and then saw a small man running out of the victim's house.
9
259
Belief Functions
E2 :
Witness 2 is an old lady, who lives across the street from the victim and who saw the crime through her window and claims the murderer was much taller than the victim.
E3 :
Witness 3 is Peter's girlfriend, who testifies that Peter was at her home far away from the victim's house when the crime happened.
How do we evaluate the meaning of these three pieces of evidence, how do we quantify their respective support for the potential murderers, how do we combine these supports, and what do we do if doubt can be cast about the quality of the testimonies. Let k symbolize the killer. E 1 supports that k is a man. Furthermore, k looks small, which fits Paul or Mary, both being small, but not Peter, who is quite tall. But as the janitor was far from the house, his opinion about the tallness of the man he saw running is doubtful, as is his testimony about the sex, as Mary has short hair and could have worn slacks. The impact of E 1 on n can be summarized by three masses, one pointing to {Peter or Paul}, one pointing to {Paul or Mary} and one unallocated, i.e. pointing to 10 . E 2 suggests Peter, but as the witness is short-sighted and claims she had taken off her glasses just before looking through the window, some reservation must be allocated concerning the value of her testimony. The impact of E 2 can be summarized by two masses, one pointing to {Peter}, the other being unallocated. E 3 suggests Paul or Mary. But as the witness is Peter's girlfriend, serious doubts must be put on her testimony. The impact of E 3 can be summarized by two masses, one pointing to {Paul or Mary}, the other being unallocated. Indeed, if the witness is lying, E 3 does not support that Peter is the killer, it only makes her testimony meaningless. Table 1 (columns m" m 2 and m3 ) presents the masses quantifying the impact of the three pieces of evidence on n. The evaluation of the masses is not discussed in this chapter; it will be briefly discussed in Section 8. Table 1
Masses derived from the three pieces of evidence, and their combination Q
m,
mz
mJ
On Peter Paul Mary Peter or Paul Peter or Mary Paul or Mary Peter, Paul or Mary
0.6
0.5 0.2 0.3
m,z
m,zJ
0.12 0.48
0.36 0.24 0.10 0.00 0.10 0.00 0.14 0.06
0.20
0.4
0.5 0.5
0.08 0.12
260
P. Sme1s
The present example is based on external evidence (testimonies). But one could just as well have used internal (objective) evidence like the fingerprints found on the weapon or the knowledge that the killer smokes a certain brand of cigarettes. 3.3
Combination of evidence
Pieces of evidence are combined by the application of Dempster's rule of combination on the basic probability assignments. The product of the masses induced by two distinct pieces of evidence is allocated after combination to the conjunction of the two focal propositions. Let mi(A) be the masses derived from evidence Ei, i = 1, 2, and let mu(A) be the mass obtained after the combination of pieces of evidence E 1 and E 2 . So m 1 (A) * m2 (B) is allocated to the conjunction A 1\ B. All such possible products are computed and all masses allocated to the same proposition are added together: mu(A) =
L
m 1 (A v X)m 2 (A v Y) =
x--,A
L
m 1(X)m 2 (Y)
x"'r=A
y--,A X"' Y =011
m123 is computed by combining m 12 with m3 in the same way. Table presents the results. Dempster's rule of combination is associative: whatever the order in which basic probability assignments are combined, the results are identical. 3.4
Belief and plausibility
The quantity m(A) measures the amount of belief that one commits specifically to A, not the total belief that one commits to A. Each mass m(A) supports also any proposition implied by A. Therefore the total degree of support (belief) that we have about the fact that a proposition A is true is obtained by adding all the masses m(B) allocated to propositions B that imply A without implying -,A (which means that On must be discarded from the sum). The degree of belief given to A is quantified by the belief function bel: n ..... [0, 1], with bel(A) =
L
m(B)
(3.1)
B-A
B,<011
By definition, bei(On) = 0, even though m(On) might be positive (see Section 4.8). The plausibility of a proposition A is the sum of the amounts of belief that are allocated to proposition B and that do not contradict A, i.e. do not imply-, A. The degree of plausibility given to A is quantified by the plausibility
9
Belief Function.'
function pi:
n---+
261
[0, 1], with
pi(A)
L
=
m(B)
B" A ;<011
It is related to bel through
pl(A) = bel(l 0 ) - bel(-, A)= 1- m(00 ) - bel(1A) The meanings of "belief" and "plausibility" are still controversial. One might prefer to call bel( A) the degree of minimal or necessary support (entailment, commitment) for A, and pl(A) the degree of maximal or potential support (entailment, commitment) for A. We shall use hereinafter the words belief and plausibility as they are those most often used in the present context, even though that usage might be subjected to criticisms. Table 2 presents the degrees of belief and plausibility derived from the data of Table 1. Table 2
Belief and plausibility functions derived from the data of Table I Q
On Peter Paul Mary Peter or Paul Peter or Mary Paul or Mary Peter, Paul or Mary
4
bel 1
pl.
bel 2
pl2
bel 3
pl3
bel 123
pll23
0.0 0.0 0.0 0.0 0.5 0.0 0.2 1.0
0.0 0.8 1.0 0.5 1.0 1.0 1.0 1.0
0.0 0.6 0.0 0.0 0.6 0.6 0.0 1.0
0.0 1.0 0.4 0.4 1.0 1.0 0.4 1.0
0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0
0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0
0.00 0.24 0.10 0.00 0.44 0.24 0.24 0.64
0.00 0.40 0.40 0.20 0.64 0.54 0.40 0.64
MATHEMATICAL PROPERTIES OF BELIEF FUNCTIONS
4.1
Belief functions
In Section 3 three functions have been defined on 0: the basic probability assignment m, the belief function bel and the plausibility function pl. Belief functions satisfy the following inequalities: = 1 - m(00 ) ~ 1
(1)
bel(1 0 )
(2)
for every n > 0 and every collection A 1 , A 2 ,
be{
y Ai) ~ f bel(AJ - L bel(Ai 1\ i>j
Aj) + .. · + ( -1)-n bel(AI
.. . ,
1\
An En,
A2
1\ ... 1\
An)
(4.1)
262
P. Smets
Shafer starts with the idea that degrees of belief satisfy these inequalities, arguing for instance that the belief in the disjunction of two propositions should at least contain the sum of belief allocated to each reduced by the belief allocated to both, the corresponding equality encountered for probability functions being unjustified. Unfortunately, this requirement is not sufficient to define belief functions, and one must postulate these inequalities for all n. Critics of Shafer's approach argue against having to postulate all these inequalities, an excessive and not very natural requirement. These criticisms justify why this presentation starts with non-negative basic probability numbers, rather than with belief functions. When all focal propositions are elementary propositions, the inequalities (4.1) become equalities, bel(A) = pl(A) for all A en, and the belief function becomes a probability function. Therefore Bayesian probability theory is a particular case of the theory of belief functions. One could consider the use of belief functions that must satisfy the inequalities (4.1) only for n:::; K < oo. In that case, the representation of bel based on the relation (3.1) might imply that some masses m are negative (Chateauneuf and Jaffray, 1986). A theory of belief where negative masses are allowed could be considered, but its meaning is not yet clear for the present author. The model based on the basic probability assignment could be abandoned, and another theory based on monotone capacities of finite order could be advocated. But the inequalities then have to be justified for all n :::; K. Shafer, in his book, indeed starts by postulating these inequalities for K = oo. That approach has been criticized as unnatural and is not proposed here. It seems easier to grasp the concept of positive masses spread among propositions then of the inequalities. When K = oo, both approaches lead to the same solution, but this is not the case when K is finite. 4.2
Communality function
A fourth function, the communality function, is defined on n, but its meaning is not obvious, which probably explains why it is so rarely mentioned. Its importance is nevertheless enormous from a computational point of view and for proving theorems. The communality function q is a function q: n-+ [0, I] such that q(A)
L
=
m(A v B)
B-•A
The four functions m, bel, pi and q define each other uniquely. Among other relations, one has m(A)=
L B-A
8;<00
(-lt-bbei(B)
9
Belief Functions
with a- b = lA
263 1\
1BI; m(A) =
L (-l)bq(A
v B)
n--,A
bel(A)
+ m(On) = L (-ltq(B) n--,A
q(A)=
L
(-l)bbel(-,B)
B-A
with b = IBI. 4.3
Vacuous belief function
The vacuous belief function is defined such that m(ln) = I bel(ln) = I bei(A) = 0
for all A -:f. ln
pi(A) = I
for all A -:f. On
q(A) =I
for all A en
It describes the belief one obtains in cases of total ignorance. Total ignorance has always been troublesome for the Bayesians, leading to strong controversies. It is either just rejected as non-existent-a Procrustean solution not followed here-or solved by the application of the Principle of Insufficient Reason: if one has k elementary propositions and there is no reason why any should be supported more than (more credible than) any others, then split the probability mass equally among them. But this does not represent Total Ignorance. There is no reason why some disjunction of elementary propositions should be more supported than any other. So one must have bei(A) equals some constant c ~ 0 for all A En, and not only for the elementary propositions A E A. Of course, this is not possible with probability functions. With belief functions, this means that with A and B such that A 1\ B =On, one has the inequality bei(A v B) ~ bei(A) + bei(B) and thus c ~ 2c; therefore c = 0 is the only solution, and it indeed satisfies all the inequalities characterizing belief functions. It corresponds to the highly logical basic probability assignment by which m(ln) = I and all other masses are null, ln being the only supported proposition. 4.4
Simple support functions
A belief function is called a simple support function (SSF) if it has at most one focal proposition different from ln. This focal proposition is called the focus
264
P. Sme1s
of the SSF. The pieces of evidence E 2 and E 3 in the example of Section 3.2 are such SSFs. SSFs correspond to a very elementary form of belief functions, the case where the evidence points partially toward a unique proposition. SSFs are a particular case of consonant belief functions. 4.5
Consonant belief functions
Consonant belief functions (Shafer, 1976) are belief functions for which the masses are allocated on focal propostions A 1 , A 2 , .•• , A. such that A 1 --+ A 2 • A 2 --+ A3 , ... , A.- 1 --+A •. In that case, bei(A " B)= min {bei(A), bei(B)}}
(4.2)
pi(A v B)= max {pi(A), pi(B)} Such bel and pi functions are called respectively necessity and possibility functions in Dubois and Prade ( 1985). Consonant belief functions are only particular cases of belief functions, but they are too restrictive as a general model to be used to quantify degrees of belief. The fact that degrees of necessity and possibility should satisfy the relation (4.2) and that some particular pair or family of belief functions also satisfies these relations does not mean that the two concepts, possibility/plausibility and necessity/credibility, can be confused. Different concepts can share the same mathematical model without sharing the same interpretation. 4.6
Dempster's rule of combination
In this presentation, each time one of the functions m, bel, pi or q is introduced with some supplementary symbols, we shall abstain from defining each one in relation to the others. This avoids the need to explicitly define m 1. pl 1 and q 1 as being the basic probability assignment m 1 , the plausibility function pl 1 and the communality function q 1 related to the belief function bel 1• The simple declaration of one of them automatically implies the others. the supplementary symbols being sufficient to know which ones are interrelated. Given two belief functions bel 1 and bel 2 induced by two distinct pieces of evidence, the belief function bel 12 that results from their combination is obtained by Dempster's rule of combination (see Section 3.3). Expressed with communality functions, Dempster's rule of combination becomes
a relation whose simplicity explains the advantage of these functions.
9
Belief Functions
4.7
265
Dempster's rule of conditioning
Suppose that we have a basic probability assignment m on n obtained after considering some initial evidence. Then suppose that we learn from a new piece of evidence that the truth is necessarily in B e !l, and thus that all propositions based on the elementary proposition A E -, B are impossible. How does this evidence modify our basic probability assignment? Let m' be the basic probability assignment obtained after taking the new evidence into account. To construct m'(A), three situations must be considered, depending on the relation between A en and the conditioning proposition BE !l. ( 1)
A -+ B. The evidence that the truth is in B does not modify the part of our total belief mass supporting A.
(2)
A 1\ B = A 1 =f. On and A 1\ -,B = A 2 =f. On. The mass A was allocated by m to A 1 v A 2 with A 1 -+ Band A 2 -+ 1B. We learn that the truth is in B; and therefore the mass that was allocated to A 1 v A 2 is transferred to A 1 , the only part of A that is compatible with the new evidence that asserts "the truth is in B".
(3)
A -+ -,B. The evidence that the truth is in B tells that all elementary propositions in A are impossible. Thus the mass m(A) is transferred to On.
Therefore m'(A) =
L m(A { c--,B
v C)
0
for all A
-+
B
otherwise
This rule is called Dempster's rule of conditioning. It implies m'(On) =bel(-, B)+ m(On) I' A _ {bel(A v 1B)- bel(1B) be ( ) - bel'(A 1\ B)
pl'(A) = pl(A q'(A) =
{
1\
q(A)
0
B)
for all A -+ B otherwise
for all A E !l
for all A -+ B otherwise
Returning to the murder case, suppose that a definitive piece of evidence tells that Peter is not the murderer; then B = {Paul, Mary}. For instance, the portion of belief that was allocated to Paul and/or Mary are unchanged, the portion that was given to "Peter or Paul" now supports Paul alone, and the portion that was given to Peter is transferred to On.
266
P. Smets
Conditioning on a proposition B is in fact a special case of Dempster's rule of combination where one combines bel with a belief function with only one focal proposition B receiving the whole mass. 4.8
Differences from Shafer's model
Shafer's original model includes the requirement m(On) = 0. (Remember that On is short for On v e, but Shafer postulates that e is empty-he always postulates the closed-world assumption.) We feel that this is unnecessary and may lead to unsatisfactory results (see Section 5.2). In the present murder case of Section 3.2, m(On) corresponds to that amount of belief allocated to none of the three suspects. We must always keep in mind that the murderer might be someone else, for example all pieces of evidence pointing to Mary and not to Peter and Paul, point in fact to "Mary or someone else other than Peter or Paul". In particular, m(On) is the amount of belief allocated to the proposition that none of the three suspects is the murderer. Had we received the evidence that the murderer must be one of the three suspects (the closed-word assumption of Section 2) then this new evidence would induce some conditioning that would imply m(On) = 0. The fact that m(On) might be non-null implies that the evidence is by nature essentially negative in that it allows one to discard some propositions. Indeed, all pieces of evidence pointing to Mary are essentially not supporting "Peter or Paul". The method of reasoning simulated by this approach is closer to an elimination procedure than to a constructive procedure. A support for a proposition is in fact a non-support for its negation taken relatively to 11. To understand m(On) > 0, one must accept the open-world assumption and consider that any amount of belief allocated to a proposition A E Q is in fact allocated to A v e, where e is the set UP (Section 2.1 ). Then m(O!l) represents the mass allocated to e. In the open-world context, ---,A ED means the complement of A relative to 11, and the mass allocated to ---,A E D is in fact allocated to ---,A v e. To say that A En and BEn contradict each other means that there are no elementary propositions in 11 that simultaneously imply A and B. If witness I points to Peter and witness 2 points to Paul, they contradict each other. If they are perfectly reliable, it means that the murderer is not in 11 = {Peter, Paul, Mary} (see Section 5.2). Shafer's approach postulates beforehand the closed-world assumption. If one were to define n such that it included e then this would lead to the same basic probability assignment as with the open-world assumption if one took care never to allocate some masses to propositions of Q that did not include e. We feel that it is easier to use the restricted nand to allow positive masses for On, keeping in mind that all masses given to propositions A E Q are always
9
267
Belief Functions
allocated to A v 8, except if the closed-world assumption is explicitly expressed. The major impact of Shafer's postulate m(On) = 0 is that the results obtained from Dempster's rule of combination must be renormalized in order to keep the total mass equal to I. Therefore he divides each mass obtained by Dempster's rule of combination by a constant corresponding to 1 - m(00 ).
5
COMBINATION OF TWO BELIEF FUNCTIONS
5.1
Axiomatic justification of Dempster's rule of combination
Suppose that we have two belief functions bel 1 and bel 2 induced by two distinct pieces of evidence. The question is to define a belief function bel 12 = bel 1 EB beh resulting from the combination of the two belief functions, where EB symbolizes the combination operator. Dempster's rule of combination can be justified by the following axioms. A1:
compositionality: beidA) is a function of A, bel 1 and beh only;
A2:
symmetry:
A3:
associativity:
(bel 1 EB beh) EB bel3 = bel. EB (beh EB beh) A4:
conditioning: if bel 2 is such that m 2 (B) = 1 then
m12(A) =
L m { c--,B 0
1 (A
v C)
for all A
--+
B
otherwise
The axiom of compositionality A 1 claims that the combination is a functional of both belief functions and maybe A, but nothing else. The axiom of symmetry A2 and the axiom of associativity A3 tell as that the result of the combination of pieces of evidence is independent of the order in which they are considered and/or they are associated. The axiom of conditioning A4 has been justified in Section 4.7. It implies that if bel 2 is vacuous then bel 12 = bel 1. Axioms A1-A4 imply that
ql2(A) =f(A, {q 1 (B): B--+ A}, {q2(B): B--+ A})
268
P. Smets
Axiom AS expresses the idea that the result of the combination will not be modified by a permutation among the elementary propositions of A:
internal symmetry: let A be a set of distinct elementary propositions A" A 2, .. . , An E A; let the propositions B" B 2, .. . , Bn be a permutation of the propositions A 1 , A 2, .. . , An; and let qi and q; be the sequences of communalities used in qu(A), with
AS:
qi = {qi(Ad, qi(A2),qi(A, v A2),qi(A3), ... ,qi(A, v A2 v ... vAn)} q;
=
{qi(B,),qi(B2),qi(B, v B2),qi(B3), ... , qi(B, v B2 v ... v Bn)}
then
f(A, q,, q2) = f(A, q',, q2) Axiom A6 considers that the mass m!2(A) given to A En is independent of the masses given by m 1 (and m2) to propositions B-+ 1A: A6:
autofunctionality: \fA E!l, A =1-lr19 m12 (A) does not depend on m 1(X) for all X-+ --,A;
A7:
three-elements: there are at least three elementary propositions in A.
A8:
continuity: for all q2(A), q 12 (A) is continuous as q 1(A)-+ 1.
The three-element axiom seems hardly critical. The continuity axiom 1s needed only to eliminate an uninteresting degenerate solution.
Theorem Uniqueness of Dempster's rule of combination: given axioms A 1A8, for all A E Q
All proofs and details about this axiomatic justification are in Smets ( 1986a).
5.2
Normalization
In order to distinguish between Shafer's definitions and ours, we use capital letters as first letter for the four functions as defined by Shafer. When Shafer introduced his model to quantify degrees of belief, he postulated M (00 ) = 0 and Bel(l 0 ) = 1. So, after combining two belief functions, he had to normalize the results in order to get Bel 12 (1 0 ) = 1. This is obtained by computing mu(A) as done here and then proportionally rescaling it into M !2(A) = mu(A)/{ l - m!2(0 0 )}. This normalization seems natural, but has been seriously criticized by Zadeh (1984) with the next counter-example. Suppose that we have a murder case with three suspects: Peter, Paul and
9
269
Belief Functions
Table 3
Peter Paul Mary
Witness I
Witness 2
M12
m12
0.99 O.ol 0.00
0.00 O.ol 0.99
0.00 1.00 0.00
0.00 0.0001 0.00
Mary, and two witnesses. Table 3 presents the degrees of belief of each witness about who might be the murderer. Witness 1 is sure that it is not Mary, that it is most probably Peter, but that it might also be Paul. Witness 2 holds similar beliefs except for the permutation between Peter and Mary. How can these two quite contradictory pieces of evidence be combined. Shafer's original solution M 12 leads to the conclusion that Paul is certainly the murderer. Zadeh does not accept this solution, as it gives full certainty to a solution (Paul) that is hardly supported at all. In fact, in the totally different situation where both witnesses had been sure that Paul was the murderer, the combined solution would have been the same M 12 . The solution m 12 within the present theory seems much more realistic as it shows a little support for the conclusion Paul, but On is highly supported (mdOn) = 0.9999). Keeping in mind the meaning of On given in Section 2.1, the most obvious conclusion one should have in the present situation is that the real murderer must be a fourth person, i.e. the solution is in the set e = UP and not in the set Q = KP ={Peter, Paul, Mary}. There is of course another way to handle the present inconsistency. The pieces of evidence are combined by a judge who obtains evidence from two witnesses, each expressing his own belief. The judge must also consider his own belief about the reliability of the witnesses. So one could introduce a meta-belief function representing the degree of belief held by the judge about the assertions of each witness. Discounting (Shafer, 1976) is one way to take into account this meta-belief (see Section 7). What represents the normalization in the present theory. Suppose that we are presented with the further piece of evidence: "The murderer is necessarily one of the group Peter, Paul or Mary". How might we accommodate this "closed-world conditioning" (UP is empty), i.e. how do we transform m12 into m'1 2 such that m'12 (0n) = 0. We must somehow reallocate m 12 (0n) to propositions of n in order to keep the sum of all the masses m'12 equal to 1. The general solution is given by m'dA) = mdA) m'12 (0n) = 0
+ c(A, m., m2)m.2(0n)
\fA en, A #-On
270
P. Sme1s
Shafer's solution corresponds to c(A, m" m2 ) = mu(A)/{1 - mu(00 )}. It can be obtained if one requires that relative degrees of belief (or plausibility) should stay constant after considering the closed world conditioning. Definition The closed-world conditioning corresponds to the impact of the absolutely certain proposition "UP = 0".
We have another axiom: A9:
let bel' be the belief function obtained from bel: !l --+ [0, I] after closed-world conditioning; then m'(0 0 ) = 0 and VA, BE !l, A, B i= On bel'(A)/bel'(B)
=
bel(A)/bel(B)
The last equality can be equivalently replaced by pl'(A)/pl'(B)
=
pl(A)/pl(B)
Axiom A9 implies that bel'( A) = c ·bel( A) with c independent of A. As bel'(l 0 ) = c(l- m'(00 )) =I, c = 1/{1- m(00 )}, as in Shafer's solution. In this chapter the combination operator EB has been considered under the open-world assumption and it has been shown that Shafer's normalization can be assimilated to the impact of the closed-world conditioning. It takes into account Zadeh's critics because if the closed-world assumption is true then the only murderer is Paul, as Peter and Mary have been eliminated by witnesses I and 2 respectively. By elimination, only Paul remains, as shown by M12· The real paradox in the counter-example lays not so much in the normalization but in the acceptance of the closed-world assumption. In a real-world situation it is obvious that if one can really believe both witnesses then one should seriously question the closed-world assumption. The solution m 12 has the advantage of showing the practical impact of the closedworld conditioning, which was not visible with Shafer's solution.
6
GENERALIZED BAYES THEOREM
Suppose that we have two finite sets of elementary propositions X and Yand let !lx and !lr be the finite Boolean algebras derived from X and Y. Let x E X and y E Y denote the elementary propositions of X and Y. Within probability theory, Bayes' theorem permits the computation of P(y I A), the a posteriori conditional probability distribution on Y given A E !lx from the set {P(x I y): y E Y} of conditional probability distributions on X given each elementary proposition y E Y and P(y), an a priori
9
271
Belief Function.\·
probability distribution on Y. One has P(y I A)= P(A I y)P(y)
/Jr
P(A I z)P(z)
This formula is based on the idea that there is an underlying joint probability distribution on the product space W = X x Y (with !lw the corresponding Boolean algebra) such that the various conditional probability distributions and the a priori probability distribution can be deduced from it respectively by conditioning and by marginalization. We have generalized this theorem when all probability distributions are replaced by belief functions (Smets, 1978). For each singleton y E Y let belx(. I y) be a belief function on the space X. (The dot corresponds to the variable whose domain is indicated by the subscript in belx(. I y).) Suppose that the a priori belief on Y is vacuous. Given these belief functions, the belief function belw on W = X x Y is constructed by the following steps. (1°)
Build the vacuous extension belw,y:2w-+[O,I] of belx(-ly) such that (a) its conditioning on the set {(x, y): x E X} is equal to belx(. I y), and (b) belw,y is the least iriformative belief function among all the ?elief functions bel~ that satisfies (a), i.e. belw,y(w) ~ bel~(w) Vwe!lw.
(2°)
liB-Combine the belw,y: belw
(3°)
Condition belw on x E !lx, the result being belw,..,: 2w-+ [0, 1].
(4°)
Marginalize belw,.., on the space Y into belr( .1 x), with belr(A I x) = belw,..(w), where w = {(x;, y;): X; Ex, y; E A}
=
belw,y 1 tB belw,y2 tB ... tB belw,yn·
The result is VA E !ly,X E nx belr(A I x) = {
fl
belx(lx I y)-
ye•A
plr(A I x) =
fl
belx(lx I y)}
yeY
{t - fl
(I - plx(x I y))}
yeA
qy(A I x) =
fl
plx(x I y)
yeA
If the closed-world assumption is accepted then the terms in the three relationships must be divided by I - c, where
c
=
f1 yeY
belx(lx I y)
272
P.
Sm<'ls
An interesting case is obtained when beJx(. I y) is vacuous for one y. Let it be so for y'. Then c = 0. The results do not depend on the closed- or open-world assumptions. (See also the diagnosis of a still unknown disease at the end of this section.) This generalized Bayes theorem satisfies the following requirements: (I
0
)
bely(. I x) is a function of {(belx(x I y), plx(x I y): y E Y };
(2°)
ply(y I x) is proportional to plx(x I y) for y E Y, Vx E !lx;
(3°)
if (i) plx(x I y) is a probability distribution function P(x I y) on X Vy E Y; (ii) x 1 and x 2 are two independent observations on X randomly selected according to P(x I y); (iii) belr(. I x;), i = I, 2, is the belief function describing the impact of the observations X; on the set Y, i.e. derived from P(x; I y); and (iv) belr(. I x 1, x 2) is the belief function describing the impact of both observations x 1 and x 2 considered simultaneously, i.e. derived from P(x" x 2 1 y) = P(x 1 I y) · P(x 2 I y); then bely(.l x1, x2) = belr(.l
xd EB bely(.l x2)
If there is a non-vacuous a priori belief function on Y, it must be EB combined with bely(. I x). If all belx(. I y) and the a priori belief function are probability functions then bely(A I x) = ply(A I x) and the generalized Bayes theorem reduces to the classical Bayes theorem. Let A E !lx and BE !ly. The vacuous-extension requirement is such that belx(A I B)= belw(A or 1B) = belw(B ::::J A), where (A or 1B) = (B ::::J A)= {(x;, y;): X; E A or y; E 1B) defined on Wand belw(B ::::J A) is the degree of belief of the material implication (B ::::J A). In the present situation at least, conditional belief and belief of conditionals (material implications) are identical concepts (Lewis, 1976). From belw, one can also compute belx(A I B) for A E !lx and BE !lr (i.e. when B is not an elementary proposition of Y). Suppose there is some a priori belief function bel 0 on Ythat has been combined with belw. Conditioning on B E !ly gives (Smets, 1978) belx(A
I B)
=
t.);om
(m 0 (y') ,.Dy' belx(A I y))}
I
pl 0 (B)
(6.1)
The importance of the generalized Bayes theorem is obvious for any inference when the available information is quantified by belief functions. For instance, for medical diagnosis problems Y is the set of mutually exclusive diagnosis, X is the set of symptoms, belx(x I y) describes our belief about which symptoms x are present in disease y, and belr(Y I x) describes our a posteriori belief about the diseases given the observed symptom x and an a priori belief on Y (Smets, 1978).
9
273
Belief Functions
Suppose that one introduces a diagnosis y' = the set of still unknown diseases. Given y', the belief on X is obviously vacuous: what can we know about the symptoms when the patient suffers from a still unknown disease? Then belr(Y' I x) is the belief that the patient presenting symptoms x belongs to the set y' of still unknown diseases. If the value is high, it means that one might be discovering a new disease. 7
DISCOUNTING EVIDENCE
Going back to the example of Section 3.2, suppose that we receive the information that the janitor was drunk. What can one say about evidence E 1 . Shafer proposes discounting this evidence by a factor c, the higher the value of c, the more it is discounted. Let rn be the basic probability assignment before discounting and m' that after discounting. Then rn'(A) = (1 -c)· rn(A) m'(l 0 ) = m(l 0 )
+c
L
\fA E Q, A -:f. 10 (7.1)
rn(A)
A"ln
This rule corresponds to the idea that each focal proposition sees its basic probability mass proportionally reduced, except 10 , which incorporate all missing masses. This rule can be justified as follows. Let the Y-space have two elements: y' = "witness tells the truth" and y" = "witness lies". The belief function bei(A) considered before discounting corresponds to belx(A I y'), and belx(A I y") is vacuous (a lie doesn't support anything). Let an a priori belief on Y be such that bel 0 (y') = 1 - c (the value of bel 0 (y") turns to be irrelevant). Compute belx(A I Y) from equation (6.1) and EB-combine the result with bel 0 . The result is them' function obtained after discounting, as described in equation (7.1 ). Suppose that, for our example, the judge discounts E 1 by a factor of 0.7, considering that the drunkeness of the janitor highly reduced the fiability of his initial testimony. Table 4 presents the impact of such discounting. Table 4 Masses and belief and plausibility functions derived from data of Table I when evidence E 1 has been discounted by a factor of 0.7 Q
m'I
On Peter Paul Mary Peter or Paul Peter or Mary Paul or Mary Peter, Paul or Mary
0.15 0.06 0.79
bel'1
pi',
m12J
bei12J
plt23
0.00 0.00 0.00 0.00 0.15 0.00 0.06 1.00
0.00 0.94 1.00 0.85 1.00 1.00 1.00 1.00
0.318 0.282 0.030
0.000 0.282 0.030 0.000 0.342 0.282 0.212 0.682
0.000 0.470 0.400 0.340 0.682 0.652 0.400 0.682
0.030 0.182 0.158
274
P. Sme1s
Note that discounting with a factor 1 results in a vacuous belief function whereas a factor 0 leaves the belief function unchanged. '
8
MEANING OF bel: A CANONICAL EXAMPLE
"When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it in numbers, your knowledge is of a meager and unsatisfactory kind." (Lord Kelvin, 1883). In order to understand what bel( A) is, the degree of belief of proposition A, we must have some canonical scale of propositions in which degrees of belief are well defined, and with which we can compare proposition A. Shafer and Tversky (1985) provide such a canonical scale. Let there be an unknown message X E 11 and a set of n translators t;: i = 1, 2, ... , n. Let p; be the probability that translator t; is selected. We can observe the result of the translation of the original message only by the translator that was selected. We ignore which translator has been used, we know only the probability with which each translator can be the one selected. Given the observed message Y, we can construct the set A; of messages from Q, the Boolean algebra derived from 11, which would have been translated into Y by translator t;. The belief bel( A) that the original message X is in the set A is obtained by adding the probabilities p; of the translators t; such that A; implies A. Such a bel is indeed a belief function. From it one can derive the functions m and pl. Furthermore, if the same message is translated twice by two independently selected translators then one can construct bel, based on the observed message Y 1 , bel 2 based on the observed message Y 2 , and bel 1 2 based on observed messages Y 1 and Y 2 considered simultaneously. Shafer and Tversky showed that bel 12 = bel 1 EB beh, the result obtained through the application of Dempster's rule of combination.
9
CONCLUSIONS
This chapter has presented a model to quantify someone's degree of belief that a proposition is true. A finite amount of belief is distributed among propositions of a frame of discernment 11. The non-negative mass m(A) quantifies the amount of belief specifically allocated to proposition A that cannot be allocated to any proposition B #- A that implies A. The degree of belief in a proposition A is the sum of the masses allocated to propositions B that imply A without implying -,A. The degree of plausibility in a
9
Belief Functions
275
proposition A is the sum of the masses allocated to propositions B that are compatible with A. One particular characteristic of this model is that it allows a positive mass to be allocated to the contradiction 00 relative to Q, the Boolean algebra derived from A. The meaning of such an allocation can be understood if one gives due consideration to the difference between the open- and closed-world assumptions. The frame of discernment A is an a priori construct on which one distributes one's belief. But one should not ignore the fact that this frame is usually nothing but an intellectual construct and that it may happen that none of the propositions of Q is true. The impact of the closed-world assumptions has been studied and a normalization coefficient derived; the result is Shafer's model. The advantage of the present approach is that it permits the evaluation of the degree of conflict among the propositions of n, and therefore a decision upon the appropriateness of the frame of discernment A and of the closed-world assumption. Axioms have been presented that justify Dempster's rule of combination used to combine two belief functions derived from two distinct pieces of evidence. Distinctness is somehow defined in the entailment functionality axiom AI. The major axiom is the conditioning axiom A4, which postulates the impact of conditioning on the masses allocation. Given these axioms, the uniqueness of the Dempster's rule of combination has been derived. The case of non-distinctness is considered in Smets ( 1986b ). Bayes' theorem has been generalized within the framework of belief functions. Discounting is a particular case of such a generalization. Belief functions provide a model that seems promising for the development of Expert Systems that need to handle uncertainty. Its use for medical diagnosis was described in Smets ( 1978, 1979, 1981 ); more recent applications in AI can be founded in Barnett ( 1981 ), Garvey et al. ( 1981 ), Gordon and Shortliffe ( 1984, 1985), Lowrance ( 1982) and Stratt ( 1984).
BIBLIOGRAPHY Bonissone, P. P. and Brown, A. L. ( 1985). Expanding the horizons of expert systems. Technical Information Series, Report 85CRD219, General Electric. Bonissone, P. P. and Ker, K. S. (1985). Selecting uncertainty calculi and granularity: an experiment in trading-off precision and complexity. Technical Information Series, Report 85CRD171, General Electric. Bonissone, P. P. (1986). Plausible reasoning: coping with uncertainty in expert systems. Technical Information Series, Report 86CRD053, General Electric. Bonissone, P. P. (1986). Summaring and propagating uncertain information with triangular norms. (Submitted for publication.) (The four papers by Bonissone and collaborators present an up-to-date survey of approximate reasoning techniques
276
P. Sme1s
and their implementation in Expert Systems. Very clear. They should be read in the order given above.) Chatalic, P. ( 1986). Raisonnement deduct if en presense de connaissances imprecises et incertaines. Un systeme base sur Ia theorie de Dempster-Shafer. These, Universite Paul Sabatier, Toulouse. (A thesis presenting the use of belief functions in an inference engine based on believed implications and facts). Dempster, A. P. (1967). Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Statist. 38, 325-339. Dempster, A. P. (1968). A generalization of Bayesian inference. J. R. Statist. Soc. B30, 205-247. (Dempster's two papers introduced the idea of belief functions as developed by Shafer.) Fine, T. (1973). Theories of Probability. Academic Press, New York. (A highly critical comparison of the foundations of the various theories that have been proposed to justify the use of probability functions.) Gordon, J. and Shortliffe, E. H. (1985). A method for managing evidential reasoning in a hierarchical hypothesis space. Artificial Intelligence 26, 323-357. (A clear presentation of belief functions oriented toward the AI community, with discussion on its use and computationality in practical problems when a hierarchical structure can be imposed on the frame of discernment.) Kyburg, H. E. (1987). Bayesian and non-Bayesian evidential updating. Artificial Intelligence 31, 271-294. (A critique of belief functions, showing that many of the advantages of belief functions could also be obtained with probability functions. But belief functions are equated to interval-valued probabilities.) Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press. (THE book on belief functions. Highly readable, a must!) Shafer, G. and Tversky, A. ( 1985). Languages and designs for probability judgment Cognitive Sci. 9, 309ff. (A presentation of an experimental set-up that permits the derivation of degrees of belief that satisfy the model based on belief functions.) Smets, P. (1978). Un modele mathematico-statistique simulant le processus du diagnostic medical. Doctoral dissertation, Universite Libre de Bruxelles, Bruxelles. (Available through University Microfilm International, 30--32 Mortimer Street, London WIN 7RA, Thesis 80-70,003.) (Thesis presenting among others the generalized Bayesian theorem based on belief functions.) Smets, P. (198la). Medical diagnosis: fuzzy sets and degree of belief. Fuzzy Sets and Systems 5, 259-266. (Which types of ignorance must be considered in a model for medical diagnosis, and which models cover them.) Smets, P. (1981 b). The degree of belief in a fuzzy event. Info. Sci. 25, 1-19. (Definition of the crisp degree of belief in a fuzzy event.) Smets, P. ( 1986a). Belief functions and their combination. (Submitted for publication.) (An axiomatic justification of Dempster's rule of combination and a plea for unnormalized belief functions.) Smets, P. (1986b). Combining non distinct evidence. Proc. North American Fuzz\" Information Processing (NAFIP1986), New Orleans. (What does distinctness mean for belief functions and how should they be combined when evidences are not distinct (correlated).) Smets, P. (1986c). Bayes' theorem generalized for belief functions. Proc. European Conf. on Artificial Intelligence (ECAI-86), Vol. II, pp. 169-170. (A presentation of the generalized Bayes theorem.) Smith, C. A. B. (1961). Consistency in statistical inference and decision. J. R. Statist. Soc. B23, 1-37. (One of the papers that stimulated the work of Dempster on upper and lower probabilities.)
9
Belief Functions
277
Zadeh, L. (1984). A mathematical theory of evidence (book review). AI Magazine 5(3), 81-83. (A critique of the normalization factor in Dempster's rule of combination.) Other references
Barnett, J. A. (1981 ). Computational methods for a mathematical theory of evidence, Proc. 7th Int. Joint Con[. on Artificial Intelligence (IJCAI-81), Vancouver, pp. 868875. Chateauneuf, A. and Jaffray, J. Y. (1986). Some characterizations of lower probabilities and other monotone capacities through the use of Mobius inversion. Proc. Int. Conf on Information Processing and Management of Uncertainty in Knowledge Based Systems, Paris, pp. 229-230. Dubois, D. and Prade, H. (1985). Theorie des possibilites. Masson, Paris. Garvey, T. D., Lowrance, J.D. and Fischler, M.A. (1981). An inference technique for integrating knowledge from disparate sources, Proc. 7th Int. Joint Con[. on Artificial Intelligence (IJCAI-81), Vancouver, pp. 319-325. Good, I. J. ( 1950). Probability and the Weighting of Evidence. Griffin, London. Gordon, J. and Shortliffe, E. H. ( 1984). The Dempster-Shafer theory of evidence. In Rule-Based Expert Systems: the M YCI N experiments of the Stanford Heuristic Programming Project (ed. B. G. Buchanan and E. H. Shortliffe), pp. 272-292. Addison- Wesley, Reading, Mass. Lewis, D. (1976). Probabilities of conditionals and conditional probabilities. Phil. Rev. 85, 297-315. Lindley, D. V., Tversky, A. and Brown, R. V. (1979). On the reconciliation of probability assessment. J. R. Statist. Soc. A: 146-180. Lowrance, J. D. (1982). Dependency-graph models of evidential support. COINS Technical Report 82-26. Pearl, J. (1986a). On evidential reasoning in a hierarchy of hypothesis. Artificial Intelligence 28, 9-15. Pearl, J. (1986b). Fusion, propagation and structuring in belief network. Artificial Intelligence 29, 241-288. Smets, P. (1979). Modele quantitatif du diagnostic medical. Bull. A cad. R. Med. Be/g. 134, 33{}-343. Stratt, T. M. (1984). Continuous belief functions for evidential reasoning. Proc. 4th American Association for Artificial Intelligence Con[., Austin, Texas, pp. 308-313. Zadeh, L. (1975). The concept of linguistic variables and its application to approximate reasoning. Parts I, II and III. Info. Sci. 8, 199-249; 8, 301-357; 9, 4380.
DISCUSSION M. R. B. Clarke: This chapter gives a very clear explanation of the theory, with some good examples and interesting technical results. I shall deal first with what I see as the weak point of the theory, its semantic basis, and secondly with the question of normalization. Semantic basis of belieffunctions. There is no doubt that the Bayesian formulations of classical probability theory that have been used for evidence propagation in expert systems such as Prospector do not deal easily with ignorance. To get started, one is
278
P. Sme/s
forced to give prior probabilities to events or hypotheses even if one has no prior knowledge at all. These arbitrary assignments are difficult to make consistently and are propagated into the conclusions. A further problem with the classical theory is that commitment of probability to a hypothesis implies commitment of all remaining probability to its negation. In both these respects the Shafer theory seems superior. However, whereas the Bayesian theory has a well-defined semantics in terms of rational betting behaviour (Lindley, 1985), the Shafer theory of evidence has no such semantic basis. It provides a self-consistent method of manipulating belief numbers, but says nothing about how the results are to be used or what they mean. If probabilities are used then, in theory at least, expected gain or loss can be computed and compared for each diagnosis or projected course of action, but it is not clear that belief functions can be used in an analogous way without making equivalent assumptions. Neither is it clear what Dempster's rule is actually doing in some situations. Section 8 of the chapter discusses the translater example of Shafer and Tversky. In the (meta-)language of logic one would say that this rather special problem in probability is being put forward as a model for the theory. But the intended applications of belief functions, such as medical diagnosis, seem to be unrelated to this model. It would be useful to see an example from medical diagnosis. showing how the computed belief numbers are finally interpreted and used. Gordon and Shortliffe (1985) have done this for hierarchically structured hypotheses and have also shown that for this special case the computations are tractable. The chapter strongly maintains that belief and plausibility are not upper and lower subjective probabilities. Yet in some cases by reassigning masses they can be directly interpreted as such (Kyburg, I ':186). It is Dempster's rule that gives rise to departures from probability theory. The intervals resulting from Dempster-Shafer updating are subintervals of those resulting from application of conditioning to upper and lower probabilities. In some cases Dempster's rule gives intuitively reasonable results, while conditioning on upper and lower probabilities gives vacuously wide bounds. In other cases, however, Dempster's rules forces decisions whose expected utility is negative (Kyburg gives an example). Table 01
Peter Paul Mary
ml
m2
M12
ml2
Mean
0.99 0.01 0
0 0.01 0.99
0 1.0 0
0 0.0001 0
0.495 0.01 0.495
Normalization. The counter-example given in the chapter is not very convincing (Table D 1). The unnormalized value m 12 seems as wrong as the normalized M 12 · Suppose that 50 witnesses all give beliefO.OI to Paul. Would we want the combined belief to be 10- 100 ? One could argue that in this case the best answer is given by classical estimation theory. The widely varying beliefs in Peter and Mary, which average 0.495, could be shown by giving standard errors. The real mistake here is in assigning zero beliefs (or probabilities) to any event without thinking carefully about what happens as they tend to zero. Suppose that we have the situation shown in Table D 2, and now consider carefully the way that <) 2 1; behaves. The result given holds when b 2 /r. is large. If both (j and r. are small in such<~
9
279
Belief Functions
way that b 2 /F. tends to unity ({j = 0.01, f.= 0.0001 for example) then M 12 is ! for all three suspects. Paradoxes like the one quoted in the chapter are always likely to arise from Dempster's rule when zero rather than very small belief is assigned to an elementary proposition. TableD 2
Peter Paul Mary
1-b-f.
f.
(j
(j
1-b-f.
(I - ()- F.)/((jl/E + 2[1 -()-F.]) WM/W/E+2[1-b-F.]) (I - (j- E)/W/E + 2[1 - (j- f.])
Gerhard Paass: Philippe Smets' chapter on belief functions gives a highly informative introduction to that topic. He shows that belief functions are a flexible tool to simultaneously represent information and ignorance about the facts of interest. I want to comment on some aspects of belief functions from the view of the probability formalism. In Section 1.3 Smets postulates a difference between upper/lower probabilities on the one hand and belief functions on the other hand. Upper and lower probabilities, which assume an underlying probability measure, are contrasted with belief functions, which deal with the intensity of credibility. This difference is not quite evident to me. In Section 8 Smets discusses a canonical example, where beliefs are calculated from certain probabilities associated with a special hypothetical experiment. Hence one can define beliefs in terms of these probabilities, and in principle the whole theory could be stated in terms of probabilities. This view is backed by Shafer ( 1986, p. 133) who states that "the advantage gained by the belief-function generalization of the Bayesian language is the ability to use certain kinds of incomplete probability models". In fact Kyburg ( 1987) proves that the representation of belief states by a mass function m( ·)is a special case of the upper/lower probabilities approach. Belief functions are an elegant way to define upper and lower probabilities. In practice, however, the specification of bel(') by an expert may be difficult as a lot of restrictions have to be observed. For the specification in terms of m(') the expert has to subtract the "masses" allocated to "smaller" elements of the Borel algebra, which also seems to be somewhat unnatural. In addition, Grosof (1986, p. 269) shows by an example that specific sets of upper/lower probabilities cannot be specified in terms of a belief function. In this sense upper/lower distributions are more expressive than the Shafer/Dempster approach. In many cases, however, a representation of an inference net in terms of m( ·) or bel(·) may have advantages. Therefore it may be desirable for a user to be able to switch between the different representations. The attractive point of belief functions is the simple way in which evidence involving incompletely specified probability measures may be combined by Dempster's rule. This rule, however, is only one special way to do this. According to Kyburg (1987), it yields in general narrower probability intervals than the evaluation of upper/lower distributions. As Shafer (1986) points out it assumes that the basic probability assignments m 1(·)and m 2 ( ·)to be combined are independent, i.e. are based on independent arguments or independent items of evidence. Because of this independence assumption (which is not obvious from the axioms of Section 5.1), the rule corresponds to the formation of the product probability measure from the
280
P. Smels
probability measures underlying m 1(-) and m2 ( · ). Dependences can be taken into account by forming joint measures different from the product measure. In the same way as in the case of upper/lower distributions, it must be decided whether the pieces of evidence are independent or a more complicated design has to be used. Let us consider the example given in Section 2.2 and assume that the first two witnesses come to identical conclusions and both state m(Peter) = m(Paul) = m(Mary) = i. As combined evidence, Dempster's rule yields mn(Peter) = m12 (Paul) = mn(Mary) =! and m 12 (0) = 1. Following the interpretation of Smets, two-thirds oft he amount ofbeliefis allocated to outside suspects, although both witnesses give identical statements where outside suspects are not mentioned. This seems to be implausible, as identical statements from independent experts should lead to a reinforced joint statement and not indicate "contradictions" of sizable degree. In this case, Shafer's original approach seems to be more sensible, as it yields mn(Peter) = m 12 (Paul) = m 12 (Mary) = i. This approach, however, is plagued by Zadeh's paradox. In summary, the utilization of Dempster's rule does not seem to be very attractive. In this chapter Smets does not address the question of the numerical feasibility of the belief-function approach. As belief functions are defined on the complete Boolean algebra of events with 2k elements, it seems that the approach becomes unfeasible if the size k of the frame of discernment gets larger. It may, however, be possible to evaluate an inference net in terms of "marginal" belief functions, as can be done for the evaluation of upperflower distributions by linear-programming methods.
Didier Dubois and Henri Prade: Smets' chapter proposes a view of belief functions as generalized probability measures with a subjectivist interpretation. Parallel to this measure-theoretic view, one can develop a set-theoretic view of belief functions, introduced by Nguyen (1978) (see also Goodman and Nguyen, 1985). Considering a belief function Bel on a Boolean algebra Q of propositions, built from a finite set of (mutually exclusive) elementary propositions, it is possible to interpret Bel as a generalized logical proposition. Indeed, let ~(Q) be the set of belief functions on 0. and let BelA be the belief function with basic probability assignment rnA such that rnA( A) = I. It is then easy to define a canonical injection Q ~ ~(Q) such that 'v' A E 0, (A)= BelA, so that Q can be identified as a subset of ~(Q). To any belief function Bel, we associate an equivalent generalized proposition A (or set, if any proposition represents a subset of a frame of evidence) where A= {(A, m(A)) I m(A) > 0}. A generalized proposition is thus a weighted set of propositions. As a consequence, one may think of extending the 16 binary connectives oflogic from Q to~ to ~(Q). Let 0 stand for any binary logical connective, let Bel1 and Bel 2 be two belief functions expressing distinct bodies of evidence, and corresponding to generalized propositions A1 and A2 , Bel 1 and Bel 2 can be combined via 0 into Bel. which corresponds to A1 0 A2 , and is defined by the basic probability assignment (I) XOY=A
This definition is a generalized version of Dempster's rule (Section 3.3). Actually. Dempster's rule (in its non-normalized version) corresponds to a conjunction of generalized propositions (e.g. intersection of random sets). But union, implication, etc. can be extended by (1), as well as negation. Several notions of logical entailment between generalized propositions (or belief functions) can be defined, with various levels of strength. They correspond to several concepts of generalized set-inclusions.
9
Belil~f
281
Funclions
Note that if Bel is consonant (Section 4.5) then the corresponding generalized logical proposition A can be interpreted as a fuzzy subset of elementary propositions. In particular, the generalized notions of logical entailment, mentioned above, encompass fuzzy-set inclusion. See Dubois and Prade (1986a) for more details on this logical view of belief functions; in particular, the algebraic structure of J61(Q) (which is no longer a Boolean algebra) is studied. Equation (I) is one way of generalizing Dempster's rule of combination. As proved by Smets, the product operation combining m1 and m2 in (I) is unique. We have recently and independently obtained this result with another set of axioms (Dubois and Prade, 1986b). Namely, given two sets {a; I i = I, ... , m}, and {bi lj = I, ... , n} of positive numbers such that Li= 1. ...• m a;= I, Li= t. .... n hi= I, construct the set {cii I i = I, ... , m;j = I, ... , n} of positive numbers such that Vi,j, cii =a;* bi for some continuous operation. It is proved that if Vm, 'Vn, L cii = I then * is the product. Regarding the question of un-normalized results from the application of Dempster's rule, the main problem is what to do with m(0) obtained in (I) (with 0 = "). Smets criticizes Shafer's technique for normalization, in Section 5.2. We have also noticed the lack of numerical stability of the normalized version of Dempster's rule in the presence of conflicting information (Dubois and Prade, 1985a). Actually, the aggregation operation is not continuous in the vicinity of the total conflict situation. This is one more piece of evidence against a systematic use of Dempster's rule in its normalized form. Yager (1987) has proposed an alternative normalization procedure, which reallocates the weight m(0) in (I) to the tautology In, thus interpreting the amount of conflict as an extra amount of ignorance. This new rule is no longer discontinuous and can deal with the paradoxical case in Section 5.2. Another worthwhile addendum to Smets' contribution is the emergence of a generalized information theory Ia Shannon" in the setting of the theory of evidence. The cardinality lSI of a finite set S is a measure of imprecision of the statement "v E S", where v is a single-valued variable. A logical proposition A is a formal representation of "v E S" provided that S is a model of A. An extended notion of cardinality can be defined for belief functions, yielding a measure of imprecision
"a
HI(Bel)
=
L m(S) log2 lSI s
This measure is an extension of one proposed by Higashi and Klir ( 1983) in the setting of fuzzy sets. This measure is the working tool for a belief-function elicitation technique called the principle of minimum specificity (Dubois and Prade, 1986c), out of an incomplete specification. Note that HI(Bel) = 0 when Bel is a probability measure, so that HI does not generalize Shannon's entropy. Shannon's entropy measure has been generalized for belief functions by Yager ( 1983). It is a measure of the dispersion of focal clements: HD(Bel) = -
L
m(A) log 2 PI(A)
A!:!l
HD reduces to Shannon's entropy for probability measures. Note that if Bel is consonant then HD(Bel) = 0 (no dispersion). More details can be found in the cited papers; see also Dubois and Prade (1985b, 1987). Our final comment concerns Shafer's interpretation of the functions Bel and PI in terms of subjective, personal credicibility and plausibility, which Smets shares. As mathematical objects, the functions Bel and PI do not convey any a priori meaning, i.e. "subjectivist" or "objectivist". Now these functions were first derived by Dempster
282
P. Smers
(1967) in a statistical framework, by carrying a probability measure through a manyvalued mapping. Dempster calls these functions "upper and lower probabilities" because the precise location of the original probability measure is lost owing to the many-valued mapping; however, owing to the axiomatics of Bel and PI (as deriving from a basic probability assignment) they are only special cases of upper and lower probability measures. Interpreting the focal elements as imprecise observations, m(A) being the frequency rate of observing A, one can come up with a purely frequentist view of belief functions (Dubois and Prade, 1986d), which are then special kinds of illknown probability measures. Shafer has reinterpreted Dempster's upper and lower probabilities in terms of personal plausibility and belief. However, he has just modified the terminology. In particular, there is no attempt to justify this subjectivist view in a formal setting, for example in terms of betting behaviour or comparative belief relations, as already exists for probability measures (Savage, 1972). Such attempts exist for possibility measures (see Dubois (1986) for comparative possibility relations, and Giles (1982) for a betting-behaviour interpretation that encompasses notions of upper and lower probabilities more general than belief functions). Hence, so far, the subjectivist view of belief function is not supported (and the canonical experiments of Section 8 are not enough to do the job), although this way of modelling subjective uncertainty judgment looks very convenient in practice-much more convenient than the rigid framework of probability theory.
Reply: (I) The relation between belief functions and lower probabilities is really delicate (as seen in the three comments). The Bayesian credo is that a credal state can be described by a unique probability function that gives to any proposition a unique number that measures its degree of belief, and that these numbers obey the additivity rule (and the other properties) of probability functions. The credo for our transferable-beliefs model (which we think corresponds to Shafer's model) is that the degree of belief in any proposition can also be described by a unique number, but that these numbers obey the superadditivity property (and the other properties) of belief functions: bel(A v
B)~
bel(A)
+ bel(B)
- bel(A
1\
B)
The credo of the proponents of the upper and lower probabilities model is that there exists a family 9 of probability functions that describes our credal state. The probability P(A) of any proposition A is between the upper P*(A) and lower P *(A) probabilities with: P .(A) = min ( P(A): P P*(A) = max ( P(A): P
E E
.cpl) 9)
Conditioning on B is obtained by the conditioning of each P in 9; therefore
I B)= min (P(A I B): P E 9) P*(A I B)= max (P(A I B): P E 9)
P*(A
These results are not those obtained by Dempster's rule of conditioning (see Section 4.7). Finally, Dempster's model for upper and lower probabilities is often interpreted as a special case of the former when the family 9 of potential probability functions quantifying our degree of belief is restricted to the probability functions such that the
9
Belief Functions
283
lower probabilities satisfy the inequalities of Section 4.1. But the Dempster's rule of conditioning is not justified in this interpretation. The origin of Dempster's restricted family & can better be explained by considering a probability function on a space X and a one-to-many mapping G from space X to space Y. If one defines (--+ is the material implication) P*(A) = P({x: G(x)" A# 0}) P*(A) = P({x: G(x)--+ A})
one gets Dempster's upper and lower probabilities. Even though Dempster's model and the transferable-beliefs model share syntactical properties, they are not identical. Dempster's model presupposes a probability function on a certain space X and a one-to-many mapping G, which is not the case for the transferable-beliefs model. That both models share the same mathematical properties is not an argument for them being the same concept. Remember that water flow and electricity can be described mathematically by the same differential equationsbut water is not electricity. Dempster's model and the transferable-beliefs model share the same interrelation as an urn and a subjective probability. A subjective probability is not an objective probability. That they are numerically identical is not an argument. It results from the so-called frequency principle (Hacking, 1965, p. 135), which states (with chance and belief for objective and subjective "probability"). belief( A: chance( A)
=
p)
= p
that is, the belief in A is numerically equal to the chance of A. By choosing various A, one constructs a scale of belief, which can be used to assess the degree of belief of any further proposition (the concepts of urn and chance becoming unnecessary). To be acceptable, this principle also requires that belief and chance share the same mathematical properties (which Bayesians accept). In Dempster's model, the probability P(x) for x EX induces on the space Y a mass m(x) = P(x) such that
P.(A)
=
L
m(x)
G(x)~A
Thus m(x) corresponds to the mass m postulated in the moving-masses model. The Shafer-Tversky translator example permits the construction of a scale for a degree of belief. It uses an analogy of the "frequency principle": the degree of belief given to message y in Y is numerically equal to the lower probability of y computed from the translator example. Mathematical properties of both models are the same. In order to assess the degree of belief in any other proposition (unrelated to any assumed translator), one uses that scale. Considering these remarks, one may wonder if the "Shafer-Dempster" qualification is not as unfortunate as the use of "probability" to describe both objective and subjective concepts. (2) In his comments, Clarke complains about the absence of well-defined semantics for Shafer's theory. In fact, this complaint covers two problems solved in the probability domain by the urn model and exchangeable bets or by axioms like those of Savage. An analogy of the urn model exists in Shafer's theory. The Shafer-Tversky translator provides it. Its operationality may be weak ... but exchangeable bets are neither an efficient method to assess someone's belief. The operationality of a
284
P. Smers
canonical experimental model is not required. The experimental model is described in order to give a meaning to the value pin statements like "the degree of belief in A is p". Bayesians will claim that p has the properties of a probability function. We claim that p has the properties of a belief function. What about decision? We claim that beliefs obey the transferable-beliefs model. When someone must take a decision, he must then construct a probability function derived from the belief function that describes his credal state. This probability function is then used to make decisions. One obvious (if not yet fully justified) way to build this probability function is P(A) =
L
m(B)IA "
BI/IBI
Be(l
where IAI is the number of elementary propositions in A. It corresponds to a Generalized Insufficient Reason principle: a mass given to the disjunction of n elementary propositions is split equally among these n propositions. Bayesians might then argue: why bother with belief functions if their only observable translation is derived from a probability function ... let us use a probability function from start. But see the following counter-example. Peter, Paul and Mary are the three potential killers, and evidence I says there is the same support for the boys as for the girl, i.e. m 1 (Peter v Paul) = m 1 (Mary) = 0.5. The derived probability is the highly reasonable P 1 : P 1(Peter) = P 1 (Paul) = 0.25 P 1(Mary) = 0.50 Evidence 2 is "Peter cannot be the killer" (he was in America), i.e. m 2 (Paul v Mary)= I
P 2 (Paul) = P 2 (Mary) = 0.5 Combining the two pieces of evidence gives m 12 (Paul) = m 12 (Mary) = 0.5
i.e. Pn(Paul) = PdMary) = 0.5 a highly reasonable solution. Let us use the probabilist approach with P 1 and the conditioning evidence that the killer is "Paul or Mary". We get P'1 2 (Paul) =
1
P'n(Mary) =
1
with P'd ·) = P 1 ( • I Paul or Mary). This solution is unsatisfactory. Just think about the results one would get if the two pieces of evidence had been considered in the reverse order! Criticism that we have changed the terms of reference is not adequate either ... except if one completely redefines the conditioning process. Therefore we provide here an example where the moving-masses approach is OK whereas the strict probabilist approach leads to paradoxical results. (3)
The second complaint of Clarke-the absence of an axiomatic characterizatioll
9
Belief' Functiom·
285
like Savage's-is real. Such a characterization would be useful ... but unfortunately it is not yet available. Is this a criticism of the theory? I do not think so. It would give it a added bonus, but if this absence is taken as a weakness of the theory, one might well wonder how people have managed to use probability theory for so many years without a well-defined characterization. Belief-function theory is in its infancy, and it is to be hoped that such a characterization will be formulated. The fact that belief functions are defined by an infinite number of postulates (the inequalities given in Section 4.1 must be satisfied for all n) will surely lead to serious difficulties. (4) A question was raised about what does "distinct" mean. Evidence corresponds to a restriction on some underlying space. Consider spaces X and Y. Given evidence I (subset x of X is true) and evidence 2 (subset y of Y is true), we build two belief functions bel 1 and bel 2 on some space Z. Consider the belief on Yinduced by evidence I (x is true) and the belief on X induced by evidence 2 (y is true). If these two belief functions are vacuous, then the pieces of evidence are distinct. Therefore two pieces of evidence are distinct if each leaves one totally ignorant about the particular value the other will take (Smets, 1986b ). (5) The normalization problem is delicate. The idea proposed by Clarke is another way to solve it, but the limiting result when e goes to 0 depends on the ratio (jfe 2 , an unpleasant situation. I think the value of Zadeh's paradox lies in showing that a blind application of normalization is dangerous. We defend the idea that normalization should not be used except if one accepts explicitly the close-world assumption (Section 2). In the example of Paass, the witnesses are really very ignorant about who is the killer, and the belief of ~ (open-world assumption) is not necessarily unsavoury. Should we accept the closed-world assumption, the results of the normalization would be what Paass feels reasonable. (6) Dubois and Prade present some other syntactical rules to combine pieces of evidence. Their relevance to the moving-masses model is not obvious, but merits further exploration. Their justification of Dempster's rule of combination is based on the assumption that c;i is a function of a; and bi-a strong assumption that needs justification.
Additional references
Dubois, D. (1986). Belief structures, possibility theory and decomposable confidence measures on finite sets. Computers and Artificial Intelligence (Bratislava) 5, 403-416. Dubois, D. and Prade, H. (1985a). Combination and propagation of uncertainty with belief functions-a reexamination. Proc. 9th Int. Joint Conf on Artificial Intelligence (IJCAI-85), Los Angeles, pp. 111-113. Dubois, D. and Prade, H. (1985b). A note on measures of specificity for fuzzy sets. Int. J. Gen. Syst. 10, 279-283. Dubois, D. and Prade, H. (1986a). A set-theoretic view of belief functions. Logical operations and approximations by fuzzy sets. Int. J. Gen. Syst. 12, 193-226. Dubois, D. and Prade, H. (1986b). On the unicity of Dempster's rule of combination. Int. J. Intelligent Syst. I, 133-142.
286
P. Smets
Dubois, D. and Prade, H. (1986c). The principle of minimum specificity as a basis for evidential reasoning. Proc.lnt. Con{. on Information Processing and Management of Uncertainty to Knowledge-Based Systems, Paris, pp. 4Q--43. Dubois, D. and Prade, H. (1986d). Fuzzy sets and statistical data. Eur. J. Operational Res. 25, 345-356. Dubois, D. and Prade, H. (1987). Properties of measures of information in evidence and possibility theories. Fuzzy Sets and Systems 24, 161-182. Giles, R. (1982). Foundations for a theory of possibility. Fuzzy Information and Decision Processes (ed. M. M. Gupta and E. Sanchez), pp. 183-195. North-Holland, Amsterdam. Goodman, I. R. and Nguyen, H. T. (1985). Uncertainty Modelsfor Knowledge-Based Systems. A Unified Approach to the Measurement of Uncertainty. North-Holland, Amsterdam. Grosof, B. N. (1986). An inequality paradigm for probabilistic knowledge. Uncertainty in Artificial Intelligence (ed. L. N. Kana) and J. F. Lemmer), pp. 259-275. North-Holland, Amsterdam. Higashi, M. and Klir, G. J. (1983). Measures of uncertainty and information based on possibility distributions. Int. J. Gen. Syst. 9, 43-58. · Lindley, D. V. (1985). Making Decisions. Wiley, London. Nguyen, H. T. (1978). On random sets and belief functions. J. Math. Anal. Applies. 65, 531-542. Savage, L. J. (1972). The Foundations of Statistics. Dover, New York. Shafer, G. (1986). Probability judgement in Artificial Intelligence. Uncertainty in Artificial Intelligence (ed. L. N. Kana) and J. F. Lemmer), pp. 127-135. NorthHolland, Amsterdam. Yager, R. R. (1983). Entropy and specificity in a mathematical theory of evidence. Int. J. General Syst. 9, 249-260. Yager, R. R. (1987). On the Dempster-Shafer framework and new combination rules. Information Sciences 41, 93-137.
10
An Introduction to Possibilistic and Fuzzy Logics DIDIER DUBOIS and HENRI PRADE Laboratoire Langages et Systemes lnformatiques, Universite Paul Sabatier, Toulouse, France
Abstract This chapter discusses the notions of degree of truth for vague propositions and of degree of uncertainty in the presence of partial information. Possibility and fuzzy logics are then introduced for the treatment of uncertainty and vagueness respectively. Vague quantifiers are also considered. Some applications are mentioned.
INTRODUCTION
The expression "fuzzy logic" is used to refer to a variety of approaches proposing a logical treatment of imperfect knowledge usually referring explicitly to fuzzy-set theory. However, a distinction among these approaches can be made between those that deal primarily with vagueness and those whose primary concern is uncertainty. Issues of vagueness and uncertainty have become important with the emergence of advanced information systems equipped with some reasoning capabilities. Fuzzy and possibilistic logics have developed on their own for some 20 years now, and it has become worthwhile to compare this methodology with others that were independently suggested to address related problems of knowledge representation and processing. This chapter describes the basic features of a logic of vagueness and a logic of uncertainty, which can be related via the basic principles of fuzzy-set and possibility theory. Possibilistic logic is a logic of partial ignorance that contrasts with betterknown probabilistic logic systems. It is not possible to represent ignorance in terms of known probability values. Within probability, ignorance is often wrongly interpreted as randomness, where outcomes are equally probable. However, the state of knowledge where there is an equal lack of certainty about all events (including non-elementary ones) that are liable to occur cannot be expressed by a single probability measure. In contrast, possibility theory captures, in a very simple way, states of knowledge ranging from NON-STANDARD LOGICS FOR AUTOMATED REASONING ISBN O·I2-649520·3
Copyright (f) 1988 Academic Pre."s Limited All rights of reproduction in any form re.w?rl 1ed
288
D. Duhoi.l' and H. Prade
complete information to total ignorance. Fuzzy logic, in contrast, is a logic of vague predicates, and escapes the laws of Boolean algebra. As such, it cannot be paralleled with probabilistic logic because they address radically different issues, and not mutually exclusive ones. For instance, one could think of assigning a grade of probability to a vague proposition. Moreover, fuzzy logic is very deviant because automated reasoning processes are based on meaning computation, and can no longer be based on the usual symbolic theoremproving methodologies in a straightforward manner. The first section discusses the notion of truth, and what happens to it when the available information is incomplete and/or the propositions to be evaluated contain vague predicates. Emphasis is put on the question of truthfunctionality. It is indicated that a logic of uncertainty is generally not truthfunctional, while the logic of vagueness based on fuzzy sets, in the presence of complete information, is truth-functional. Section 2 introduces possibility theory and describes elementary patterns of reasoning with possibilityqualified propositions. The expressive power of this approach is tentatively compared with other numerical or non-numerical approaches, i.e. probability, evidence theory, modal and default logics. The third section reviews two approaches to the treatment of vagueness in logic: a "syntactic" approach where intermediary grades of truth are attached to propositions, and a "semantic" approach where the meaning of logical propositions is explicitly represented in the reasoning mechanisms. Basic patterns of reasoning are also described under the two approaches.
1
DEGREE OF TRUTH AND TRUTH-FUNCTIONALITY
Truth is generally understood as the conformity between a statement and the actual state of facts to which it supposedly refers. Here, however, a degree of truth is rather a measure of agreement between the representation of the meaning of a statement and the representation of what is actually known about reality. This is a practical view of truth in an information-system perspective. A statement describes the properties of objects and their interrelationships, under the form of a symbolic expression. The meaning of a statement can be viewed as a constraint restricting the value of variables that are implicit in the statement. This view is supported by Zadeh (1981), who defines procedures for the computation of meaning. What is known of reality is supposedly stored in a database fld, in the form of statements. What can be said of the truth of a query statementS depends upon our state of knowledge (the information in fld), and derives from a matching procedure between the meaning of S and of the contents of fld. According to the respective precision of Sand the information in .rJI, the truth of Sis asserted, refuted, but may also
10
Possihilistic and Fu::::::y Logics
289
be only partially known (pervaded with uncertainty), or may be a matter of degree (S is vague). 1.1 A semantic approach to the computation of truth and Prade, 1985a)
(Dubois
It is assumed that Sand fJB refer to the same universe of discourse (domain) U. U is generally a multidimensional space. A statement w is translated into a constraint "X is M(w)" by means of a meaning computation procedure such as PR UF (Zadeh, 1978b ). X is a vector of variables to which w refers, and M(w) is a relation linking the values of these variables. M(w) is a subset of U called the meaning of w. According to whether M(w) is a singleton, a subset or a fuzzy subset (Zadeh, 1965) of U, w is said to be precise, imprecise or vague. Vagueness explicitly refers to the existence of borderline elements of U to which the statement as well as its contrary cannot be completely applied, while imprecision refers to the lack of specification of the meaning of w. The meaning M(!JB) of the set of statements stored in fJB is obtained by performing the join of the relations M(w) for all win !JB. M(!JB) can be viewed as the set of possible states of the world. Example Consider the statement w = "John is tall". It is represented by the statement "height[ John] is TALL" where TALL is a vague predicate and height[ John] is a variable taking its values on the set U of heights between 0.5 and 2.5 m. M(w) is characterized by the membership function of the fuzzy set of "tall" heights, i.e. to each height is allocated a degree of tallness between 0 (non-membership) and 1 (complete membership).
Note that the notion of precision is not absolute. It depends on the way the universe of discourse is defined. In the above example w is a vague statement because U = [0.5, 2.5]. But if U ={SHORT, MEDIUM, TALL} then w becomes precise. In the framework of classical logic, predicates are always precise or imprecise, but are not vague; they will be referred to as crisp predicates. In the following a proposal initiated in Bellman and Zadeh ( 1977) is made to represent truth of crisp or vague statements in the presence of precise, imprecise or vague information. The approach is illustrated using the example of a data base fJB containing an item of information about John's height, and of a statement S pertaining to his height. (i) Crisp statement; precise information
Consider the situation shown in Fig. I. M (81) is then a singleton of U, say { u 0 }. For instance, u 0 = I. 7 m; i.e. we know that John is 1.7 m tall. M(S) is an
290
D. Duhois and H. Prade A= M(S)
M(Jl) = {uo}
0------------~====~========~ 165 1.70 Sizes
Fig. 1
ordinary subset A of U. In the example, S = "John is more than 1.65 m tall" and M(S) = [1.65, 2.5]. Let t(S I81) be the degree of truth of statementS w.r.t. 81. The degree of i:ruth is then (I)
where J.I.A is the characteristic function of A, the set of heights compatible with S. In the example, J.I.A(u 0 ) = I, i.e. Sis true. (a)
A= M(S)
o------~~~---=======~1.65 o b Size (b)
A= M(S)
0
1.65
0
b
Size
(c)
r----oI
I
I I I I I I
0
0
I I
A= M(S)
I I
I I
b 1.65
Size
Fig. 2 (a) S is necessarily true. (b) S is possibly true and possibly false. (c) S is impossible.
10
Possibilistic and Fuzzy Logics
291
(ii) Crisp statement; imprecise information
Let us assume that S is a crisp statement (e.g. a standard wff) and that f!J contains only crisp information but M(f!J) is not precise, i.e. in the example John's height is known only to lie in some subset of U. In Fig. 2, M(f!J) = [a, b]. Several situations may occur. (a)
M(81) s;; A; then Sis surely true, i.e. t(S I81) = I. For instance see Fig. 2(a): Sis as in (i), but M(81) = [1.68, 1.72].
(b)
An M(fJI) -:f. 0 but A does not contain M(f!J). S is possibly true or possibly false. This situation is closely related to the case when a wff S is consistent with the contents of f!J taken as axioms, but cannot be inferred from them. See Fig. 2(b), where for instance John's height is known to be between 1.60 and 1.70, S being as in (i).
(c)
An M(f!J) = 0; then Sis surely false; i.e. t(S IfJI) = 0. For instance, in Fig. 2(c) S is as in (i) and M(81) = [1.50, 1.60]. Here M(f!J) s;; A, where A, the complement of A, is the interpretation of "not S", denoted --, S.
To account for these various situations, a set function and defined by fl(E)
=
{10
if En~ -:f. otherw1se
n can be introduced
0
(2)
By convention, fl(S) is short for fl(M(S)). n is called a possibility measure, in the sense of Zadeh (1978a), because fl(S) evaluates to what extent it is possible that S is true. The degree of truth is here extended and becomes for each S a possibility distribution, i.e. the characteristic function of a subset of {0, 1}, denoted J.lr<s 1..,,, such that J.lr(SIJI)(O) =
n(--, S),
J.lr(SIJI)( 1) =
fl(S)
(3)
The above three cases can be characterized in the following way: (a)
fl(S) = 1, fl(--, S) t(S If!l) = 1;
(b)
fl(S)
(c)
fl(S)
= 1, = 0,
t(S I81)
= 0,
i.e. t(S I81)
= {1 },
which is identified with
= 1, i.e. t(Sif!J) = {0, 1}; fl(1S) = 1, i.e. t(Sif!J) = {0}, which
fl(1S)
is identified with
= 0.
Note that in the example considered here the case when fl(S) = n(-,S) = 0 cannot occur. It corresponds to logical inconsistency, and appears only if M(81) = 0.
292
D. Dubois and H. Prade
An equivalent description could have been made in terms of a set function N defined by
N(E)
={I0 ¢>Aotherwise ~ s;
(4)
It is called a necessity (certainty) measure and 1s related to the above possibility measure by the following identity: N(S) = I - 0(1S)
(5)
which just stresses that "Sis necessarily true" means that -,sis impossibly true.
o--~~~====~====~~==~~ 1.70 u0 =1.75 1.80 Sizes
Fig. 3
(iii) Vague statement; precise information
M(!?l) is again a singleton {u 0 } of U. In the example u0 = 1.75 m. M(S) is a fuzzy subset A of U, sketched in Fig. 3, which represents the sizes more or less compatible with the concept "tall" in the current context. The statement to be evaluated is S = "John is tall". Consistently with (I), its degree of truth is defined by
(6)
t(S 181) = J-1A(u 0 ) E [0, I]
Intermediary degrees of truth thus appear when vague statements are considered. These degrees are strictly between zero and one as long as u0 is a borderline case for the vague category A, and is neither a prototype of the objects satisfying the predicate A nor of the objects satisfying the predicate A. M(S) =A
.,. = fLF
0~----~--~----L-----~--1.60 1.65 1.80
Fig. 4
10
Possihilistic and Fu==.v Logics
293
(iv) Crisp statement; fuzzy information
In contrast with the previous case, it is the available information that is vague; i.e. M(&l) is a fuzzy set F, describing what is known of John's height. F is pictured in Fig. 4. The membership function of F is here interpreted as a possibility distribution n = Jl.F because the values in the support ofF (defined by {u E U IJJ.F(u) > 0}) are mutually exclusive candidates for John's height. In contrast, in (iii) the values in the support of A were simultaneously somewhat compatible with the query statement S; i.e. Jl.A was not a possibility distribution. Here JJ.F(u) = n(u) is the degree of possibility that John's height is exactly equal to u. Because the information is not precise, but the statement S is crisp as in (i) and (ii), the truth or falsity of S may not be known with certainty as in case (ii). But, because the information is fuzzy, this certainty is graded. Jl.r<Si.<~J is a possibility distribution over {0, I}, ranging on the interval [0, 1], defined by lf.r<Si.<~J( I) = sup {n(u) I u E A} ~ O(S)
(7)
lf.r<Si<~J(O) = sup {n(u) I u ¢ A} ~ fl(-, S)
(8)
with A = M(S), consistently with (2) and (3). Note that in this case strictly speaking we do not have intermediary degrees of truth. n is a set-function called a possibility measure (Zadeh, 1978a) and fl(A) is the degree of possibility that Sis true when A = M(S). The degree of necessity that Sis true N(A) = I - fl(A), i.e. it is the degree of impossibility that S is false. Possibility measures are such that VA, B, fl(A u B) = max (fl(A), fl(B)), and max (O(A), O(A)) = I; i.e. at least one of the two numbers defined by (7) and (8) is equal to I.
(v) Vague statement; fuzzy information (general case)
In this case both M(S) and M(PA) are represented by fuzzy sets. In the example, we want to know whether John is tall knowing that his height is about 1.70 m. See Fig. 5. A=M(S)
Fig. 5
294
D. Dubois and H. Prade
This case combines case (iii) and (iv), in the sense that there are intermediary degrees of truth that are not precisely known. J.lA takes values on the unit interval [0, 1], but because the value of John's height is ill-located in M(&d), t(S I PA) is itself a fuzzy set of [0, 1], which can be interpreted as a fuzzy truthvalue (Bellman and Zadeh, 1977), whose membership function is defined by the extension principle (see e.g. Dubois and Prade, 1980a) J.lr(Si.*J(v) =sup {n(u)IJ.LA(u) = v} u
(9) J.lr(s 1..,,(v) is the grade of possibility that the degree of truth of S given PAis v. The fuzzy truth value t(S I!14) can be approximated by means of two numbers O(S) and N(S), which extend (7) and (8) to the case of a vague statement (Zadeh, 1978b; Dubois and Prade, 1985a), with A = M(S):
O(S) =sup min (J.LA(u), n(u))
(10)
u
N(S) =I- 0{1S) = infmax (J.lA(u), I- n(u))
(II)
Namely, given a truth-value v* with grade of possibility I, it is easy to prove that N (S) ~ v* ~ O(S) (Dubois eta/., 1986). More specifically, when M (PA) = {u 0 }, fl(S) = N(S) = J.lA(u 0 ). Moreover, fl(S) and N(S) can be obtained from t(S I PA) directly (Dubois and Prade, 1985c). fl(S) can be viewed as the degrees of possibility and necessity that S is "true", if we interpret "true" by extending its definition from {0, 1} (i.e. J.ltrue( 1) = 1, J.ltrue(O) = 0) to [0, 1] by letting J.ltrue(v) = v, Vv E [0, 1]. We then have in any case fl(S) =sup min (J.lr(s 1.-.,(v), v)
( 12)
v
N(S) = infmax (1 - J.lr(Si!AIJ(v), v).
(13)
v
Table 1 summarizes the above discussion. Note that in the case of statistical knowledge (the information in PA is represented by an histogram) instead of fuzzy knowledge, and of a crisp query, we recover probabilistic logic, where the probability that S is true is equal to 1 minus the probability that S is false.
1.2
Truth-functionality issues
In this section we examine the status of logics of uncertainty and vagueness with respect to the existence of truth-functional connectives. It is proved that logics of uncertainty, such as the one arising in case (iv) of the previous
10
Possibilistic and Fuzzy Logics
295
Table 1
Query
Knowledge
Truth
Logic
Crisp Crisp
Precise Imprecise
Classical Close to modal
Crisp
Fuzzy
Vague Vague
Precise Fuzzy
0 or I 0, I or {0, I} Possibility { ofO, Possibility of I Between 0 and I Fuzzy truthvalue
Complexity
"Possibilistic" Many-valued Fuzzy
(Arrows indicate the orders of increasing complexity)
section, cannot be truth-functional, while this property can be preserved for the logic of vagueness based on fuzzy sets in the presence of complete information. Let us assume that S is a crisp statement. S can only be true or false, eventually. However, the available information in f!l may prevent us from concluding clearly about this matter. For instance, when f!l contains vague statements and M (f!l) is fuzzy, only degrees of possibility and necessity that S is true can be computed. One may think, in order to model the lack of knowledge about the truth or falsity of S, an intermediary degree of truth in some many-valued logic could be used. One advantage would be that, using the truth-functionality property, it would be easy to assess the state of uncertainty about the truth of a compound statement in terms of the "degree of truth" of elementary statements. Let us denote by r(p) such a degree of truth of a classical proposition p. For simplicity, we assume that p belongs to a Boolean algebra, supposedly finite; i.e. we restrict ourselves to propositional calculus. r(p) is computed for instance out of some measure of uncertainty pertaining to its meaning M(p) = A, for instance r(p)
=
![fl(A)
+ N(A)]
as implicitly suggested in Gaines ( 1976). In the following we make no assumption about the way r(p) is actually elicited. The following result proves the uselessness of any truth-functional [0, 1]valued logic with continuous connectives as a model of logic of uncertainty on a Boolean algebra of standard propositions. Proposition Let f!J be a finite Boolean algebra of propositions and let r be a truth-assignment function fJ-+ [0, 1], supposedly truth-functional via continuous connectives. Then Vp E ~ r(p) E {0, 1}. Moreover, r is an interpretation in the sense of propositional calculus, i.e. r( p) = 1 ¢> r(--, p) = 0.
296
D. Dubois and H. Prude
Proof Truth-functionality implies that there exists a function f: [0, I] --+ [0, I] and a two-place operation * on [0, I] such that Vp, r(1p) = f(r(p)),
withf(l) = 0 andf(O) = I
Vq, r(p v q) = r(p)H(q),
with I* I= I= 0* I= I •0 and 0•0 = 0
Using results in Dubois and Prade ( 1982a), f is a continuous order-reversing involution in [0, I] (because ---, 1p = p), and * is a continuous monotone semigroup of [0, I] called a triangular co norm (Schweizer and Sklar, 1963). Letting p = q leads to r(p) = r(p) H(p), i.e. * is idempotent. The only idempotent conorm is "max". Hence r is a possibility measure on E(f!l), the set of atoms of&. The truth-functionality of negation implies, moreover, r( p) = f(r(ip)), while in possibility theory max (r(p), r(1p)) = 1. Hence Vp, max (r(p),f(r(p))) = I. Hence the result follows. QED Thus assuming truth-functionality leads back to the case when propositions are all known as being either true or false, which, using the semantic view of truth values developed in the preceding section, implies that the available information in the database f!J is precise. As a consequence, logics of uncertainty cannot be truth-functional. This result is a reminder of the well-known fact in mathematics that a non-trivial Boolean algebra that is linearly ordered has only two elements. When vague propositions are allowed, the result is no longer valid. Indeed, algebras of vague propositions (whose interpretation are fuzzy sets) are no longer Boolean. The properties of a family of fuzzy subsets of a given universe depend upon the choice of set-theoretic operations. Whatever this choice may be, the necessity of relaxing the Boolean structure is proved in Dubois-Prade ( 1980b ). The most popular choice of operations is (Zadeh, 1965) union
lf.AvB =max (JJ.A, JJ.B)
(15)
intersection complementation
( 14)
llx = 1 - Jl.A
( 16)
Using these connectives, all properties of the Boolean algebra are preserved except Au A#- U, An A= 0. In other words, we get a complete distributive lattice with a pseudocomplementation. The above choice of operations is unique, in order to keep this structure, except for the complementation, which can be more general. In that context it was proved by Ponasse ( 1978) that fuzzy-set theory provides a proper representation of Lukasiewicz many-valued algebras; it is a counterpart of the famous Stone representation theorem for Boolean algebras. Such a work validates the interpreta-
/0
297
Possihilistic and Fuzzy Logics
tion of vague predicates in terms of fuzzy sets. Although in the following, fuzzy set-theoretic operations are always defined by ( 10)-( 12), other connectives exist. Axiomatic settings for fuzzy-set-theoretic operation are reviewed in Dubois- Prade ( 1985b ), based on extensive use of results in functional equations. However, algebraic structures that are obtained when departing from (14)-(16) are poorer. Truth-functionality is recovered when evaluating the grade of truth of a fuzzy predicate in the presence of precise information (case (iii) in Section 1.1). Indeed, we get, as a consequence of ( 14)-(16) and (6), the minimum, the maximum and the complement to 1 as models for conjunction, disjunction and negation respectively. Hence many-valued logics look useful for accommodating a logic of vagueness, when uncertainty is ruled out. An intermediate degree of truth for a vague proposition p is then interpreted by the fact that p does not perfectly match precisely described facts, which clearly differs from the case where truth is ignored because the actual facts are illknown (i.e. uncertainty), although pis a perfectly clear-cut statement. In that respect, fuzzy-set theory offers an interpretive framework for many-valued logics (Rescher, 1969). As suggested earlier, when both vagueness of statements and uncertainty about actual facts are present, the grade of truth can be defined as a fuzzy number, interpreted as a possibility distribution over truth-values. This representation nicely combines the existence of intermediate truth-values and the lack of knowledge about which is the actual one. As proved by Dubois and Prade ( 1985c), truth-functionality is also lost generally in this case, for instance, t(S or S' I!14) cannot be expressed in terms of t(S I!14) and t(S' I81), except in the following special case of decomposability: M(!Jl) is a fuzzy Cartesian product F x G on U x V with F defined on U, G on V, Jl.F x G = min (Jl.F, Jl.G), M(S) is a fuzzy set of U and M(S') is a fuzzy set of V; we then have
........._
t(S or S' I81) = max (t(S I!14'), t(S' I81"))
( 17)
where 84' is the part of 81 such that M(!Jl') = F, 81" is the part such that M(!Jl") = G, and is the maximum operation extended to fuzzy numbers (Dubois and Prade, 1980a). Under the same assumptions, we have
max
--
t(S and S' I!14) =min (t(S I!14'), t(S' I81"))
(18)
When measures of possibility are used, note that we always have O(A u B)= max (n(A), O(B)), but n(A n B) < min (n(A), O(B)) generally, when A and B are fuzzy or crisp sets. As a consequence, N(A u B) > max (N(A), N(B)) generally. Thus the pair (N(A), n(A)) approximating the fuzzy truth-value is globally not truth-functional either.
D. Dubois and H. Prade
298
2
POSSIBILITY LOGIC AND OTHER LOGICS OF UNCERTAINTY
In this section we consider Boolean algebras of crisp propositions. Let g( p) be the grade of uncertainty about the truth of proposition p (we no longer call it a degree of truth!). g can be choosen among monotonic set-functions, i.e. those that respect logical entailment: if p--+ q = 0 (the tautology) then g(p) ~ g(q). Practically, the nature of g will be dictated by the nature of the available information in the database !!4. For instance, when !11 contains vague predicates (case (iv) of Section 1.1 ), modelled by possibility distributions, g is naturally a possibility measure. Possibilistic logic is a logic of uncertainty based on the use of possibility measures. The idea of using possibility theory as a basis for uncertain deductive reasoning was first proposed by Prade (1983) and then systematically developed by the authors (Dubois and Prade, 1984c, 1985a, 1986a, 1987a).
2.1
Basic axioms and interpretations of possibility measures
In the case of possibilistic logic, each axiom Pi is assigned a grade of possibility O(pi) and a grade of necessity N(pi) = 1 = 0(1pi) that Pi can be taken as true. These numbers can be computed by comparing the meaning of Pi to a description of some actual state of facts, as argued in Section l. The laws of possibility theory state that 0(0) = 1,
0(0) = 0
and Vp, q, fl(p v q) =max (fl(p), fl(q))
( 19)
where 0 and I[)) stand respectively for the propositions true and false in any interpretation. Thus, as long as classical logical propositions are used, the relation max (fl(p), 1 - N(p)) = 1 must be satisfied. More specifically, N(p) = 1 entails that pis true;
n( p) = 0 entails that p is false; N(p) = 0, fl(p) = 1 means total ignorance about the truth or falsity of p.
Here the notions of possibility and necessity are given a logical interpretation related to the lack of precision of the available information. Other interpretations of the same mathematical model have been put forward. In the nineteen fifties, in the framework of decision theory, Shackle, an English economist, advocated a non-probabilistic model of subjective uncertainty (see Shackle, 1961 ); in this work the notion of epistemic possibility, expressed
10
299
Possihilistic and Fuzzy Logics
in terms of degree of potential surprise, was introduced. This proposal is very close to the above model. Basically, Shackle claims that human decisions are taken on the basis of available possibilities rather than probabilities. Zadeh (1978a) proposes a physical view of possibility in' terms of ease of attainment, feasibility (e.g. the possibility of squeezing a given number of tennis balls into a box). Frequentist views of possibility measures are suggested by Dubois and Prade ( 1986b ); in that perspective, a possibility measure is a special case of a random set (Goodman and Nguyen, 1985). Possibility measures can also be considered as a limiting case of decomposable set functions g, i.e. such that g(A u B) = g(A) * g(B) for some operation * when A n B = 0; such an axiomatic approach to "subjective distorted probabilities" was initiated by Dubois and Prade ( 1982a). A view of possibility in the spirit of measurement theory (comparative possibility) is proposed by Dubois ( 1987). Finally, Giles (1982) develops an interpretation of possibility measures in terms of betting behaviour in the spirit of decision theory.
2.2
Uncertain deductive reasoning with possibility degrees
A paradigm of deductive reasoning under uncertainty can be described as follows. Given a set of axioms p" p 2 , .• • , Pn consisting of wffs in, say, firstorder logic on finite domains, attach to each p; a number g( p;) expressing a grade of confidence regarding the truth of p;, with the convention that g(~) = I for tautology and g(O) = 0 for contradiction. g is a function from the Boolean algebra f!JJ generated from {p~. ... ,pn} to [0, 1], as defined in the introduction to this section. The problem of deductive inference under uncertainty comes down to computing g(p) (and g(1p)) for any proposition p of interest, knowing g(p.), .. . , g(pn). This computation must take advantage of the properties of the uncertainty measure g-here a possibility or a necessity measure. Basic patterns of inference of classical logic have been extended to possibilistic logic, namely: modus ponens (Prade, 1983; Dubois and Prade, 1984c)
fl(q)
~ N(q) ~min (N(p), N(p--+ q))
(20)
modus tollens (ibid.) N(p) ~ fl(p) ~max (fl(q), I - N(p--+ q))
(21)
More generally, from knowledge of N (p), N( 1p) and the grades of necessity of all ways of relating p and q by means of implication, the inference process
300
D. Duhois and H. Prade
can be put into matrix form (Dubois and Prade, 1986a): [ N(q) N(---,q)
J~ [n11n noo not][ 10
N(p) N(1p)
J
(22)
where the matrix product is a sup-min composition, nii is a lower bound on N( pi --+ qi), p 1 = p, p0 = ---, p. Proper behaviour of the inference rule requires that at least one diagonal of the matrix contains zeros (see Dubois and Prade, 1986a). Along the same lines of thought, the resolution principle has been extended in this context (Dubois and Prade, 1987a) for ground clauses, as well as firstorder ones. For ground clauses it can be proved that N(q v r) ~min (N(p v q), N(ip v r))
(23)
which gives back the resolution principle when the right-hand side of the inequality is 1. The refutation method, as a proof methodology, can be extended for uncertain clauses. Here, to an uncertain clause is attached a degree of uncertainty interpreted as a positive lower bound on its grade of necessity. It was proved that the grade of necessity attached to the empty clause corresponds to a lower bound on the grade of necessity of the proposition to be proved. This lower bound is obtained by applying the (extended) resolution principle to the set of axioms equipped with their uncertainty levels, and the negation of the proposition to be proved, with necessity 1. Moreover, a set of undertain ground clauses is said to be inconsistent when the allocation of lower bounds on grades of necessity violates the axioms of possibility theory. N(p) ~IX, N(1p) ~ p and min (IX, p) > 0 is an example of such a violation.) It can be proved that this notion of inconsistency is equivalent to the classical inconsistency of the set of clauses where the lower bounds of the grades of necessity are removed. Based on these results, it is possibile to envisage automated reasoning techniques in the style of theoremproving. But there is a clear problem of strategy in order to maximize the lower bound attached to the empty clause (Dubois et al., 1987a). 2.3
Relationship with other logics of uncertainty
It is clear that other types of measures of uncertainty can be used instead of possibility measures, for instance probability measures and belief functions; but non-numerical approaches such as modal logic or non-monotonic logic can also account for uncertainty. The links between possibility measures and these other proposals are briefly discussed in this section.
2.2.1
Probabilistic logic
Probabilistic logic is considered for automated reasoning purposes by several
10
301
Possibilistic and Fuzzy Logics
authors (see Chapter 8). One of the problems is to give some interpretation to grades of probability; this is achieved by interpreting logical formulae as subsets of elementary events referred to as sets of possible worlds, or even directly relating the probabilities to statistical experiments. In other words, these approaches define some way of capturing the meaning of logical propositions, in the same spirit as is done here, although with different terminologies and assumptions. Interestingly enough, the automated reasoning techniques proposed by most probability theorists strongly depart from the classical theorem-proving methodology. Namely, the meaning of propositions is explicitly used in the reasoning procedure, which then comes down to a constrained-optimization problem. However, it can be proved (see e.g. Suppes, 1966; Dubois and Prade 1987a) that basic patterns of inference can be extended to probabilistic logic, for example the resolution principle becomes Prob (q v r);;;: max (0, Prob (p v q)
+ Prob (1p
v r)- I)
(24)
Note that the lower bound is smaller than with necessity degrees. Quinlan's INFERNO system, mentioned in Chapter 8, is closer to the spirit of symbolic reasoning, but it lacks completeness since it does not always compute the best bounds on probability values. Possibilistic reasoning techniques turn out to be far simpler, and the completeness can be conjectured. Moreover, the use of the additive law in probabilistic logic may lead to an increase of errors from input data to conclusions, while errors remain constant in the possibilistic setting. Finally, the normalization rule for probability measures (L;= 1 ..... n Prob (u;) = I, where u; is an elementary event) is more difficult to satisfy than the normalization rule for possibility measures (max;= 1 ..... " fl(u;) = I). Namely, inconsistent uncertainty assignments may be more frequent in the probabilistic setting than in the possibilistic one. But the major difference between possibilistic and probabilistic logics is that in the latter there is no absolute convention for modelling ignorance about the truth of a proposition in terms of a single probability value. One can only do it in terms of upper and lower probabilities: P*(p) = I ;;;: Prob (p);;;: P.(p) = 0; but this is exactly the convention that is adopted in the possibilistic setting. 2.2.2
Belief functions
Both settings can be reconciled within the theory of evidence. See Chapter 9. Axioms p., .. ., Pn are given grades of credibility (belief, support) Cr(p;), and grades of plausibility PI( p;) ;;;: Cr( p;), which implicitly define a basic probability assignment m: f!J-+ [0, I] such that m(i(])) = 0 and
L pe.<'
m(p) = I
(25)
302
D. Dubois and H. Prade
Namely, denoting by /(p) the set of implicants of p (l(p) = {q Iq-+ p = the functions Cr and PI are defined by Cr(p;) =
L
~});
(26)
rn(q)
qe/(p;)
Pl(p;)
=
I-
L
(27)
rn(q)
qe/(•p 1)
Note that (26) and (27) do not characterize a unique basic probability assignment rn from knowledge of {(Cr(pi), PI( pi)), i = I, ... , n}, generally. The good point about this approach is that the evidence supporting p can be only loosely related to the evidence supporting 1p, in contrast with the probabilistic setting, where Prob (p) + Prob (1p) = 1. Intervals [Cr(p), Pl(p)] can be viewed as constraining ill-known probability values. In particular, if for any non-atomic p E &, rn(p) = 0, then only elementary propositions are weighted by m, and Cr = PI are probability measures. In contrast, if the set of focal propositions§' = { q Im(q) > 0} can be ordered as q" .. . , qm such that Vi= I, rn- I, qi E /(qi+ 1 ) then the function PI is a possibility measure and conversely. In this case Cr is also a necessity measure (see e.g. Dubois and Prade, 1982b).
2.2.3
Modallogic
It is interesting to discuss the links between possibility logic and modal logics
which provide a syntactic modelling of the concept of possibility and necessity, usually referring to possible-world semantics (see Appendix B of the Introduction). The main differences between both approaches seem to be as follows. (i)
In modal logic possibility and necessity are all-or-nothing concepts. As a consequence they can be introduced as special symbols in the language. Op reads "pis necessary" while OP reads "pis possible". In contrast, in possibility theory, possibility is a graded notion as well as necessity-whence the use of numbers.
(ii)
Modal logics propose numerous axiomatic settings, while the axioms of possibility theory are well-defined and unique. Along this line, it is relevant to define qualitative counterparts of possibility-theory axioms, in the style of modal logic, restricting ourselves to the case where O(p) E {0, 1}. One way of doing this is to use the following translation rule: 1- Dp
translates into N( p)
=
I
1- OP
translates into n(p)
=
I
/0
303
Po.uihilistic and Fu::::y Logics
Clearly, the classical identiy---, () p = D---, ptranslates into I - 0( p) = N(---, p), which is a basic relationship in possibility theory. Moreover, a numerical translation of Lewis' implication D(p-+ q) is clearly N(p-+ q) = I, which implies that p-+ q is true, in possibility logic. The basic
axiom of possibility logic can be expressed as 1-()(p v q)+-+(()p v ()q)
(28)
This is one of the basic axioms of the modal-logic system T according to von Wright (see Hughes and Cresswell, 1968). In addition, possibility theory recovers the square of Aristotelian modalities, as does the S 5 system. It would be interesting to relate possibility theory to some existing formal systems in modal logic.
2.2.4
Default reasoning using logics of uncertainty
Probabilistic logic and possibility logic have both been suggested as possible approaches to default reasoning (Rich, 1983; Farreny and Prade, 1986). The idea is to interpret the weight bearing on an "if-then" rule as a measure of the extent to which the rule has no exception. The rule then models an imperfect "is-a" link in a semantic network. See Chapter 7. This interpretation of logics of uncertainty faces several problems. (i)
It is not clear that in default logic the grade of uncertainty must be attached to a logical implication p -+ q. The use of conditioning instead of implication may appear more natural for modelling imperfect "is-a" relations. For instance Prob (q I p) expresses the proportion of q's among those that are p's. See Zadeh ( 1983) for a treatment of default rules in terms of fuzzy proportions. The problem raised here is the difference between what can be called a "conjecture" (i.e. a universal assertion that is true or false, but cannot yet be proved nor refuted) and what Zadeh ( 1985) calls a "disposition" (an assertion that generally holds, but sometimes does not).
(ii)
Default rules do not always underlie a statistical interpretation. In particular, typicality (Chapter 7) seems to be of a different nature. In that case a possibilistic treatment of default rules, as done by Farreny and Prade (1986), may be more satisfactory. A default rule is then modelled by knowledge of the quantity O(q 1 p) defined by the relation n(p
1\
q) =min (n(p), n(q I p))
(29)
from the knowledge of a possibility measure. However, as indicated
304
D. Dubois and H. Prade
by Dubois and Prade ( 1986a), this notion of conditioning is very close to logical implication, since n(q I p) = 1 - N(p--+ 1q), or 1.
3
FUZZY LOGIC
Uncertain propositions, considered in the preceding section, must not be confused with fuzzy propositions. In the first case we have propos1tions that are true or false (thus involving non-vague predicares), but due to the lack of precision of the available information, we can in general only estimate to what extent it is possible or necessary that a proposition is true. In the second case the available information is precise, but the vagueness of predicates leads to propositions with intermediary degrees of truth. Obviously we may encounter a fuzzy proposition for which the available reference information is not precise; then we have the general case of an uncertain fuzzy proposition; the study of such propositions is outside the scope of this introduction. This situation is the most complicated one, and it leads to fuzzy truth-values as indicated in Section 1. See also Prade (1985b) and Yager (1984) for a representation and a treatment of uncertain fuzzy propositions in terms of possibility distributions.
3.1
Formal reasoning with vague predicates
Patterns of reasoning in the style of modus ponens can be developed for fuzzy propositions, i.e. bounds on t(q) can be computed from knowledge oft( p) and t( p --+ q); see Dubois-Prade (1980a) for instance. However, there are different natural ways for defining t(p--+ q) from (14)-(16), as discussed by Dubois and Prade (1984a); moreover, we may have t(1q--+ 1p) #- t(p--+ q), for some definitions of the implication operator, when t( p --+ q) is not defined as t(1q v q). It can be proved that t(q) ~min (t(1p v q), t(p)) if and only if t(1p v q) + t(p) > 1 (see Dubois and Prade, 1980a, p. 167), result contrasts with (20). Quite early in the development of fuzzy-set theory, an extension of Robinson's resolution principle was proposed by Lee ( 1972) for ground clauses in the framework of the fuzzy logic defined by (14)-(16), i.e. for dealing with fuzzy propositions; note that the resolution principle avoids the explicit use of the implication connective in the representation of the knowledge. Basically, Lee proved that if all the truth-values of parent clauses are strictly greater than 0.5 then a resolvent clause derived by the resolution principle always has a truth-value between the maximum and the minimum of those of the parent clauses. See Dubois and Prade (1987a) for a bibliography of subsequent works along this line.
10
305
Po.uibilistic and Fuzzy Logics
Note that a set of fuzzy propositions is generally not a Boolean algebra. In particular, the law of contradiction is not valid, so that the refutation method, which is the basis of many logic-programming techniques, seems hard to implement here. This fact, and also the fact that the above results are applicable only to ground clauses, may restrict the applicability of formal proving methodologies for fuzzy logic.
3.2
Fuzzy logic based on meaning computation
Another approach to reasoning with fuzzy statements, i.e. statements containing fuzzy predicates, is described by Zadeh (1979). Let S~o S 2 , ... , Sn ben statements expressed, say, in natural language. Let {M(S;), i = I, ... , n} be the meanings of S ~o ... , Sn, as defined in Section I. M(Si) is obtained by translating Si into a meaning-representation language such as PR UF (Zadeh, 1978b), and is fuzzy restriction on a variable vector Xi taking its value on universe Vi. Let U be the universe of discourse built from U~o .. . , Un, and let V be some subuniverse from which a variable Yin which we are interested takes its values. Reasoning is here viewed as using the statements S~o ... , Sn to say something about Y. This is done in three basic steps: (i)
compute the meanings M(S;) of Si, i = I, ... , n, and their cylindrical extensions on U, say M 0 (S;);
(ii)
calculate the join of the M(Si); this yields a fuzzy relation R on U, interpreted as a possibility distribution;
(iii)
project R on the universe V, to obtain the fuzzy relation Projv (R), which is the meaning of the conclusion statement S about Y.
This scheme is very general. R is obtained by intersection of the fuzzy relations M 0 (S;). Let X be the vector variable pertaining to U; X can be denoted ( Y, Y'), where Y' pertains to variables taking values in V' such that U = V x V'. Projv (R) is defined by
v Y, J.lProjy(R)( Y)
=
sup J.lR( Y, Y')
(30)
Y'
consistently with possibility theory. The classical modus-ponens pattern has been generalized to the following fuzzy-logic pattern. S 1:XisA' S2 : if X is A then Yis B S: Yis B' where M(S 1 ) = A', a fuzzy set on U 1
306
D. Dubois and H. Prade
M(S 2 ) is obtained by means of a multiple-valued implication connective denoted --+ (see e.g. Dubois and Prade (1984b) for a review). We have J.lM(S2)(X, Y) = J.lA(X)
--+
J.lB( Y)
(31)
The universe U is U 1 x V, and the relation R on U is such that J.lR(X, Y) = J.lA·(X)
where
* is an
* (J.LA(X)
--+
J.lB( Y))
(32)
intersection operation such as ( 15). Thus we get (33)
It is shown by Dubois and Prade (1984b, 1985a) that the choice of the implication operation --+ is dictated by the choice of the intersection * as soon as the meaning of S 2 is defined from J.lA and J.l 8 under the following constraints:
(i)
from X is A one should conclude S: Y is B;
(ii)
M(S 2 ) should be as much unspecific as possible (i.e. as large a fuzzy set as possible) not to be arbitrary.
In particular, if * = min then --+ should be Godel implication (a --+ b = I if a ::::; b, and b otherwise). An axiomatic approach to the definition of manyvalued logic implication connectives (Rescher, 1969) is given in Trillas and Valverde (1985) for instance. A unified view of several classes of many-valued implication functions is proposed by Dubois and Prade (1984a). The pattern of the resolution principle can be dealt with using the same methodology:
S 1 :XisA'orZisC
S 2 : X is not A or Y is B S: ( Y, Z) is D This pattern can be called a generalized resolution principle. Indeed, here A' #- A, and predicates are fuzzy. It can be checked that, using (14)-(16) for basic set-theoretic operations, J.lo( Y, Z) =sup min (max (J.lA' (X), J.lc(Z)), max (I - J.lA(X), J.l 8 ( Y))) (34) X
Note that J.lo( Y, Z) ~ max (J.tc(Z), J.l 8 ( Y)) always holds; equality is obtained when A' = A is a crisp subset. The classical resolution principle is then recovered. Note that when C is empty, the pattern of the generalized modus ponens (33) is recovered, with * = min and a --+ b = max (I -a, b). When the implication used is not Godel's, A'= A does not yield B' = B in (33), D = B or C in (34). In other words, the elimination of fuzzy predicates
/0
307
Pmsihilistic and Fuzzy Logics
is not always permitted in fuzzy counterparts of the resolution principle. To recover the elimination property requires once again a proper choice of the implication operations in the pattern S 1 : if X is not A then Z is C S 2 : if X is A then Yis B S: ( Y, Z) is B or C Note that the inference mechanism in fuzzy logic is generally a nonlinearprogramming technique. Examples of systems based on these ideas are proposed by Baldwin (1979, 1983), Yager (1984) and Martin-Ciouaire and Prade (1986) for example. See also Prade and Negoita (1986), Sanchez and Zadeh (1987) for application-oriented papers, and Prade (1985a) for a larger bibliography.
3.3
Illustrative example
Let us con~ider the following example, which illustrates various aspects of possibility and fuzzy logics. We have two rules and two facts: (a)
if a person is a professional jockey U) then his/her weight is approximately between 45 and 50 kg (A');
(b)
if a person is a male (m) and his weight is between 40 and 50 kg (A) then it is likely he is a teenager (t);
(c)
John (J) is a male (m) and a professional jockey U).
The possible weights of a professional jockey specified by rule (a) are represented by means of a bell-shaped possibility distribution Jl.A, like that pictured in Fig. 6, and whose support is the interval [45, 50]. The conclusion part of rule (b) is pervaded with uncertainty; this can be modelled using the notation introduced above: (a):j(x)--+ A'(x);
(b): m(x)
1\
A(x) --+ t(x)
(IX);
(c): m(J),j(J)
where IX = N(Vx, m(x) " A(x) --+ t(x)). A particular case of the generalized modus ponens enables us to deduce from (a) and (c) that A'(J), i.e. John's weight is fuzzily restricted by A'. Then we compute to what extent we are certain that the condition part of (b) holds as N(m(J)
1\
A(J); m(J)
1\
A'(J)) =min (N(m(J); m(J)), N(A(J); A'(J))) =min (I, N(A(J); A'(J))) = N(A(J); A'(J))
which can be easily computed using (11) with n = propagate the uncertainty along rule (b):
llA'IJ)·
N(t(J));;;: min (N(A(J); A'(J)), IX)
Then by (20) we
308
D. Dubois and H. Prade
However, here we may find the conclusion that John is a teenager is somewhat certain is highly undesirable. The problem of controlling transitivity effects is classical in default logic; see Chapter 7. The way of coping with this problem in our framework is as follows. First, as others do, we substitute for (b) a more precise statement, namely (b')
if a person is a male and his weight is between 40 and 50 kg and he is not a jockey then it is likely that he is a teenager.
Then if we know that John is a jockey, rule (b') can no longer be applied to John. Secondly, if we just know that John is a male and that his weight is fuzzily restricted by A' then, in order to be able to use (b'), we keep the piece of default knowledge that the (a priori) possibility that a person is a jockey is very low, say e. So the corresponding a priori certainty that a person is not a jockey is high (I - e). The evaluation of the condition part of (b') now yields min (N(A(J); A'(J)), I -e), and finally our certainty that John is a teenager will be min (N(A(J); A'(J)), I - e, IX), i.e. a strong certainty if IX and N(A( J); A'( J)) are close to I.
3.4
Reasoning with fuzzy quantifiers
Often items of knowledge are expressed in the form of statements involving quantifiers different from the universal or the existential ones. These quantifiers can be viewed as proportions that may be only vaguely specified. They translate linguistic terms such as "most of" and "some". Zadeh ( 1983, 1985) has considered syllogisms with propositions involving fuzzy quantifiers modelled by fuzzy subsets of the unit interval, for example the so-called intersection/product syllogism of the form
Q1 As are Bs Q2 (A and B)s are Cs Q1 ® Q2 As are (B and C)s where ® is the extended product of fuzzy numbers (Dubois and Prade, 1980b). In the above syllogism Q 1 restricts the possible values of the proportion, lA n BI/IAI (where I I denotes the cardinality) or more generally of the conditional probability P(B I A). Q2 is defined in a similar way. The resulting quantifier Q1 ® Q2 is justified by the well-known probabilistic identity P(B n CIA) = P(B I A)· P(C IAn B). Note that patterns of reasoning that are valid when universally quantified may no longer hold even in a weaker form with fuzzy or numerical quantifiers. For instance the syllogism, where V means "all"
/0
309
Possihilistic and Fuzzy Logics
VAs are Bs Q 1 Bs are Cs Q As are Cs is valid with Q 1 = Q = V. However, as soon as Q 1 =IV, nothing can be said about the proportion of As that are Cs, which may take any value in the unit interval, i.e. Q = [0, 1]. Moreover, if we add the supplementary piece of knowledge Q 2 Bs are As
then it can be established that when JJ.Q 1 is increasing fori= I, 2 (i.e. Q 1 and
Q2 are variants of "most" as in Fig. 6)
eQl)
~( 0,18~ 1 Q=max
(35)
e
the extended where IilaX' is the extended maximum for fuzzy numbers, subtraction and (I Q 1 )/Q 2 an extended quotient. This result only expresses the fact that if A ~ B, P(A I B) ~ q 2 and P( C I B) ~ q 1 then, from the laws of probability theory, we conclude that P(C I A)~ (0, I - (1 - q 1 )/q 2 ). In (35) laws of probability theory are combined with results in fuzzy arithmetics (Dubois and Prade, 1980a). Consequently, Zadeh's theory of fuzzy syllogisms is nothing but probabilistic logic expressed in terms of conditional probabilities (as in the approach described in Chapter 8) with the assumption that the knowledge of probability values is in the form of fuzzy intervals, i.e. fuzzy probabilities (Zadeh, 1984), instead of point-probabilities or interval-valued ones. Based on this modelling, some forms of reasoning with default rules can be defined, when the general rules of the form Q As are Bs are instantiated. Note that in
e
0 Fig. 6
Q ="most".
310
D. Duboi.l' and H. Prade
their linguistic forms, the quantifiers may not explicitly appear in the rules (e.g. "snow is white" is short for "usually, snow is white"). Rules with explicit or implicit quantifiers are called "dispositions" by Zadeh ( 1985). The degree of truth of a statement of the formS= "Q As are Bs" is computed by Zadeh as follows: t(S 1.?4)
_.UQ (lA lA IBl) -_.UQ (u~u min (.UA(u), ,U8(u)))
-
11
L .UA(u)
ueU
provided that all values {(.UA(u), .u8(u)) I u E U} are stored in .?4. Yager (1983) proposes another treatment of quantified statements of the form "Q As are Bs", which does not relate to conditional probabilities.
4 EXAMPLES OF APPLICATIONS IN THE AUTHORS' RESEARCH GROUP Implementing the generalized modus ponens can be very simply achieved using parametrized representations of membership functions (Dubois et al., 1987b); the order of complexity of fuzzy production systems is thus not higher than for usual production systems. Of course the processing time for each rule is higher than in the classical case. But this is counterbalanced by a better expressive power of fuzzy production rules. A few fuzzy rules can generally account for the behaviour of a larger set of non-fuzzy rules. Implementing possibilistic logic in the production system style is quite efficient owing to the max-min matrix-calculation scheme for uncertainty propagation. In the resolution style, linear resolution strategies can be adapted, and a heuristic search algorithm has been developed (analogous to Nilsson's A*) to maximize the certainty of the resulting empty clause (Dubois et a/., 1987a). The design of the inference engine SPII (Martin-Ciouaire and Prade, 1986) has been motivated by the need for a sufficiently general inference system able to (i) deal both with the imprecision and the uncertainty pervading factual and expert knowledge, and (ii) combine symbolic reasoning with numerical computation. SPII-2 is capable of treating pieces of information (facts or rules) that are imprecise (since they are expressed by means of vague predicates) or uncertain (since their truth is not fully guaranteed). SPII-2 works in backward-chaining. Possibility theory is used for representing imprecision in terms of possibility distributions and uncertainty by means of a pair of possibility and necessity measures. More technically, SPII-2 (i) propagates uncertainty and imprecision in the reasoning process via deductive inferences; (ii) estimates the degree of matching between facts and condition parts of rules in presence of vagueness; (iii) combines imprecise or
10
Pos.l'ihili.l'tic and Fu:::y Logics
311
uncertain pieces of information relative to the same matter; and (iv) performs computation on ill-known numerical quantities using fuzzy arithmetics. SPII-2 has been developed and experimentally tested on a realistic prospectappraisal problem in petroleum geology involving fuzzy rules (Lebailly et al., 1987). SPII-2 is written in LELISP and is running on a VAX 11-780 computer as well as a Macintosh microcomputer. DIABETO (Buisson et al., 1987) is a medical expert system, accessible from the French videotex network TELETEL, which is a decision-aid tool for the treatment of diabetes. In DIABET0-111, imprecise/uncertain rules and facts are represented in an unified manner using possibility distributions. Particularly DIABET0-111 deals with expert rules involving fuzzy conditions, which are understood as "the more the condition is satisfied, the more certain is the conclusion". Besides, an interpolation method enables the system to build, from a given set of fuzzy rules, a new fuzzy rule which is more adapted to the current situation if necessary. Presently the knowledge base contains about 300 rules (the full knowledge base should contain about 1000 rules). The system is_ designed for use by sick people themselves. It is implemented in NIL (a dialect of LISP) on a VAX 11-780. The inference engine T AlGER (Farreny et al., 1986) is not only able to handle uncertain rules but also imprecise and uncertain factual pieces of knowledge concerning the values of logical or numerical variables. The possibilistic representation of uncertainty that is used is somewhat similar to the MYCIN one (Buchanan and Shortliffe, 1984), but the chaining and combination operations of the possibility-theory-based aproach differ somewhat from the empirical choice (obtained as distorted probabilistic laws) made in MYCIN. Besides, imprecision is dealt with in the same possibilistic framework in TAIGER. TAIGER manipulates numerical values pervaded with imprecision and uncertainty, while inference engines like that of MYCIN treat uncertain rules and facts only. TAIGER maintains a representation of imprecise or uncertain facts in terms of possibility distributions, while the uncertainty of a rule is modelled by attaching the numbers appearing in a 2 x 2 matrix representation of the rule (Farreny and Prade, 1986). TAIGER works in backward-chaining. TAIGER is currently implemented on an IBM-PC microcomputer in MULISP.
5
CONCLUSION
Possibility theory offers a common setting for modelling uncertainty and imprecision in reasoning systems. However, the reasoning methodology in fuzzy logic drastically differs from the theorem-l'roving approach. In the latter, statements are translated into logical formulae. Inference is then performed symbolically, regardless of the meaning of the formulae. In fuzzy
312
D. Dubois and H. Prade
logic, in contrast, statements are translated into elastic constraints in a meaning-representation language, and the meaning of the conclusion is directly computed via nonlinear-programming techniques. However, in possibility logic, as soon as no vagueness pervades the knowledge, it seems that part of the theorem-proving methodology can be extended, as stressed in Section 2. Finally, we have pointed out that the notion of truth can be viewed as the result of a semantic pattern-matching process. This view leads to the definition of operational procedures in order to compute degrees of truth and degrees of uncertainty that can feed approximate reasoning systems. Bl B LIOG RAPHY Baldwin, J. A. (1979). A new approach to approximate reasoning using a fuzzy logic. Fuzzy Sets and Systems 2, 309-325. (An extensive treatment of the generalized modus ponens based on fuzzy truth-values.) Dubois, D. and Prade, H. (1980a). Fuzzy Sets and Systems: Theory and Applications. Academic Press, New York. (An account of the fuzzy-set literature in the nineteen seventies. Covers a broad range of topics.) Dubois, D. and Prade, H. (1985a). (avec Ia collaboration de H. Farreny, R. MartinCiouaire, C. Testemale) Thi!Orie des Possibilith Applications a Ia Representation des Connaissances en lnformatique. Collection Methode+ Programmes, Masson, Paris. (English translation to be published by Plenum Press, New York.) (A complement to the previous reference on some aspects of fuzzy-set theory, especially possibility measures, fuzzy arithmetics and fuzzy-set-theoretic operations. Focuses on applications to approximate reasoning, heuristic search, fuzzy programming and relational databases.) Dubois, D. and Prade, H. (1986a). Possibilistic inference under matrix form. Fuzzy Logic in Knowledge Engineering (ed. H. Prade and C. V. Negoita), pp. 112-126. Verlag TUV Rheinland, K6ln. (An extensive presentation of possibilistic logic. Also deals with the question of the possibility of conditionals and conditional possibility.) Dubois, D. and Prade, H. (1987a). Necessity measures and the resolution principle. IEEE Trans. Syst. Man Cyber. 17, 474-478. (The theorem-proving approach to possibilistic logic.) Gaines, B. R. (1976). Foundations of fuzzy reasoning. Int. J. Man-Machine Stud 8, 623-668. (A basic reference on the links between multiple-valued logics and fuzzyset theory.) Lee, R. C. T. (1972). Fuzzy logic and the resolution principle. J. Assoc.for Computing Machinery 19, 109-119. (The main and oldest reference on the theorem-proving approach to the max-min multiple-valued logic underlying fuzzy-set theory.) Ponasse, D. (1978). Algebres floues et algebres de J:_ukasiewicz. Rev. Roum. Mat h. Pures Appl. 23, 103-113. (A fuzzy counterpart of Stone's theorem for Boolean algebras.) Prade, H. (1985a). A computational approach to approximate and plausible reasoning with applications to expert systems. IEEE Trans. Pattern Anal. Machine Intelligence 7, 260--283 (Corrections in 7, 747-748). (An overview of approximate reasoning methodologies related to possibility theory and fuzzy logic. Includes a very large bibliography.) Prade, H. and Negoita, C. V. (eds) (1986). Fuzzy Logic in Knowledge Engineering.
/0
Possihi/istic and Fuzzy Logics
313
Verlag TOY Rheinland, Koln. (A collection of up-to-date contributions by major researchers in the area of possibility theory and fuzzy logic applied to approximate reasoning, databases and expert systems.) Sanchez, E. and Zadeh, L.A. (eds) (1987). Approximate Reasoning in Intelligent Systems, Decision and Control. Pergamon Press, Oxford. (A similar collection, with other contributions.) Yager, R. R. (1983). Quantified propositions in a linguistic logic. Int. J. Man-Machine Stud. 19, 195-227. (An alternative approach to fuzzy quantifiers, extending the substitution method in logic.) Zadeh, L. A. (1965). Fuzzy sets. Info. Control 8, 338-353. (The founding paper on fuzzy-set theory. It is still recommended reading to capture the basic intuitions.) Zadeh, L. A. (1978a). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems I, 3-28. (The first paper on possibility measures. Stresses the links between possibility distributions and linguistic information. Adopts a physical point of view on possibility, as opposed to statistical knowledge.) Zadeh, L.A. (1979). A theory of approximate reasoning. Machine Intelligence 9, (ed. J. E. Hayes, D. Michie and L. I. Mikulich), pp. 149-194. Elsevier, Amsterdam. (Zadeh's approach to reasoning with vague information. Describes in detail the combination/projection methodology sketched in Section 3.2.) Zadeh, L. A. (1985). Syllogistic reasoning in fuzzy logic and its application to usuality and reasoning with dispositions. IEEE Trans. Syst. Man Cyber. 15, 754-763. (The most up-to-date paper on the theory of dispositions and the treatment of fuzzy quantifiers.) Other references
Baldwin, J. (1983). A fuzzy relational inference language for expert systems. Proc. 13th IEEE Int. Symp. on Multiple-Valued Logic, Kyoto, pp. 416-423. IEEE, New York. Bellman, R. E. and Zadeh, L. A. (1977). Local and fuzzy logics. Modern Uses of Multiple-Valued Logics (ed. J. M. Dunn and G. Epstein), pp. 103-165. Reidel, Dordrecht. Buchanan, B. G. and Shortliffe, E. H. (1984). Rule-based Expert Systems. TheM YCI N Experiments of the Stanford Heuristic Programming Project. Addison-Wesley, Reading, Mass. Buisson, J. C., Farreny, H., Prade, H., Turnin, M. C., Tauber, J. P. and Bayard, F. (1987). TOULMED, an inference engine which deals with imprecise and uncertain aspects of medical knowledge. Proc. Eur. Conf on Artificiallnteligence in Medicine (AIM E-87), Marseilles. Springer-Verlag, Berlin. Dubois, D. (1987). Possibility theory: towards normative foundations. Risk, Decision and Rationality (ed. B. Munier). Reidel, Dordrecht. (To appear.) Dubois, D. and Prade, H. (1980b). New results about properties and semantics of fuzzy-set-theoretic operators. Fuzzy Sets. Theory and Applications to Policy Analysis and Information Systems (ed. P. P. Wang and S. K. Chang), pp. 59-75. Plenum Press, New York. Dubois, D. and Prade, H. (1982a). A class of fuzzy measures based on triangular norms. Int. J. Gen. Syst. 8, 43-61. Dubois, D. and Prade, H. (1982b ). On several representations of an uncertain body of evidence. Fuzzy Information and Decision Processes (ed. M. M. Gupta and E. Sanchez), pp. 167-181. North-Holland, Amsterdam. Dubois, D. and Prade, H. (1984a). A theorem on implication functions defined from triangular norms. Stochastica, 8, 267-279.
314
D. Duhois and H. Prade
Dubois, D. and Prade, H. (1984b). Fuzzy logics and the generalized modus ponens revisited. Cybernetics and Systems 15, 293-331. Dubois, D. and Prade, H. (1984c). The management of uncertainty in expert systems: the possibilistic approach. Operational Research '84: Proc. lOth Triennal I FO RS Conf., Washington, DC (ed. J. P. Brans), pp. 949-964. North-Holland, Amsterdam. Dubois, D. and Prade, H. (1985b). A review of fuzzy set aggregation connectives. Info. Sci. 36, 85-121. Dubois, D. and Prade, H. (1985c). Evidence measures based on fuzzy information. Automatica 31, 547-562. Dubois, D. and Prade, H. (1986b). Fuzzy sets and statistical data. Eur. J. Operational Res. 25, 345-356. Dubois, D., Prade, H. and Testemale, C. (1986). Weighted fuzzy pattern matching. Proc. Journee Nationale sur /es Ensembles F/ous, Ia Theorie des Possibilites et leurs Applications, Toulouse, pp. 115-145. (To appear in Fuzzy Sets and Systems, 1988.) Dubois, D., Lang, J. and Prade, H. (1987a). Theorem proving under uncertainty. A possibility theory-based approach. Proc. lOth Int. Joint Conf. on Artificial Intelligence (IJCAI-87), Milan.
Dubois, D., Martin-Ciouaire, R. and Prade, H. (1987b). Practical computing in fuzzy logic. Fuzzy Computing (ed. M. M. Gupta and T. Yamakawa). North-Holland, Amsterdam. (To appear.) Farreny, H. and Prade, H. (1986). Default and inexact reasoning with possibility degrees. IEEE Trans. Syst. Man Cyber. 16, 270-276. Farreny, H., Prade, H. and Wyss, E. (1986). Approximate reasoning in a rule-based expert system using possibility theory: a case study. Information Processing '86 (ed. H. J. Kugler), pp. 407-413. North-Holland, Amsterdam. Giles, R. (1982). Foundations for a theory of possibility. Fuzzy /'!formation and Decision Processes (ed. M. M. Gupta and E. Sanchez), pp. 183-195. North-Holland, Amsterdam. Goodman, I. R. and Nguyen, H. T. (1985). Uncertainty Modelsfor Knowledge-Based Systems. North-Holland, Amsterdam. Hughes, G. E. and Cresswell, M. J. (1968). An Introduction to Modal Logic. Methuen, London. Lebailly, J., Martin-Clouaire, R. and Prade, H. ( 1987). Use of fuzzy logic in rule-based systems in petroleum geology. Approximate Reasoning in Intelligent Systems, Decision and Control (ed. E. Sanchez and L. A. Zadeh), pp. 125-144. Pergamon Press, Oxford. Martin-Ciouaire, R. and Prade, H. (1986). SPil-l: a simple inference engine capable of accommodating both imprecision and uncertainty. Computer-Assisted DecisionMaking (ed. G. Mitra), pp. 117-131. North-Holland, Amsterdam. Prade, H. (1983). Data bases with fuzzy information and approximate reasoning in expert systems. Proc. IF AC Int. Symp. on Artificial Intelligence, Leningrad, pp. 113120. Prade, H. (1985b). Reasoning with fuzzy default values. Proc. 15th IEEE Int. Symp. on Multiple-Valued Logic, Kingston, Ontario, pp. 191-197. IEEE, New York. Rescher, N. (1969). Many-Valued Logic. McGraw-Hill, New York. Rich, E. (1983). Default reasoning as likelihood reasoning. Proc. American Association for Artificallntelligence Conf (AAAI-83), Washington, DC, pp. 348-351. Schweizer, B. and Sklar, A. (1963). Associative functions and abstract semi-groups. Pub/. Mat h. Debrecen 10, 69-81. Shackle, G. L. S. (1961). Decision, Order and Time in Human Affairs, 2nd edn. Cambridge University Press.
10
Po.vsihilistic and Fuzzy Lof(ic.\·
315
Suppes, P. ( 1966). Probabilistic inference and the concept of total evidence. Aspects of Inductive Logic (ed. J. Hintikka and P. Suppes), pp. 49-65. North-Holland, Amsterdam. Trillas, E. and Valverde, L. (1985). On implication and indistir;~guishability in the setting of fuzzy logic. Management Decision Support Systems Using Fuzzy Sets and Possibility Theory (ed. J. Kacprzyk and R. R. Yager), pp. 198-212. Verlag TOY, Rheinland, Koln. Yager, R. R. (1984). Approximate reasoning as a basis for rule-based expert systems. IEEE Trans. Syst. Man Cyber. 14, 636-643. Zadeh, L. A. ( 1978b ). PR UF: a meaning representation language for natural languages. Int. J. Man-Machine Stud. 10, 395-460. Zadeh, L. A. (1981 ). Test-score semantics for natural languages and meaning representation via PRUF. Technical Note 247, SRI International, Menlo Park, California. Also in Empirical Semantics (ed. B. B. Rieger), pp. 281-349. Brockmeyer, Bochum, 1982. Zadeh, L.A. (1983). The role of fuzzy logic in the management of uncertainty in expert systems. Fuzzy Sets and Systems 11, 199-228. Zadeh, L.A. (1984). Fuzzy probabilities. l'!fo. Proc. Mgmt 19, 148-153.
DISCUSSION Marie-Odile Cordier: Dubois and Prade make in their paper a clear distinction between uncertain reasoning and vague (or approximate?) reasoning. Uncertain reasoning is a precise reasoning on a given situation (or world) incompletely described in a database. The answer to a precise query can then be: "surely" true, "surely" false, or "possibly" true or false: i.e. a degree of certainty. If one knows for example that John is tall then the query "is John's height more than 1.80 m?" is answered in terms of whether there is more or less possibility that the statement is true. Vague reasoning is reasoning with vague predicates on a precise database. The predicates are defined approximately and are more or less verified by precise data, i.e. are more or less true. An answer to a query is then a degree of truth. If one knows that John's height is 1.70 m then a statement such as "John is tall" can be said to be true with a to-be-determined degree of truth. Possibilistic logic and fuzzy logic are two ways of reasoning on imperfect knowledge; they use as a common tool fuzzy-set theory and for that reason are quite often confused. They are clearly distinguished in Dubois and Prade's chapter. Possibilistic logic is concerned with uncertain reasoning where the database is a fuzzy description of a given world; it is a logic of uncertainty, as is probabilistic logic and the logic of evidence. Fuzzy logic is concerned with fuzzy reasoning on precise information; it is a logic of vagueness. Both are numerical approaches in the sense that degrees of certainty and degrees of truth are both estimated by numerical values. In possibilistic logic, an assertion is labelled by two values, the possibility and the necessity, which describe an interval of certainty of this assertion. In fuzzy logic, an assertion is labelled with a degree of truth describing its conformity with the reality. An important result is given by Dubois and Prade stating that logics of uncertainty cannot be truth-functional. Uncertain reasoning: evaluation of uncertainty versus use of dependency links. Uncertain reasoning means reasoning on incomplete information: the value of
316
D. Dubois and H. Prade
what Dubois and Prade call a variable (such as the height of John) is unknown, but can be restricted to a set of values. These restrictions reduce the set of possible worlds that can be modelled by such an incomplete database. One way of doing this is to evaluate the possible values of the variable: using numerical estimations, which are easy to combine and to manipulate. These numbers can be obtained using fuzzy set theory as is done in possibilistic logic, or probability theory as in probabilistic logic, and can describe the possibility, probability, credibility etc. of the corresponding assertion. Symbolic estimations can also be used, such as the modalities proposed by Kodratoff eta/. (1985). Another way is to consider unknown values as possible hypotheses and to use hypothetical reasoning; assertions are then labelled by hypothetical contexts that describe precisely under what conditions the result can be obtained. Instead of producing answers such as "/ikes(Mary, Paul) with possibility
IX
and necessity
P"
it would produce "likes(Mary, Paul) if Paul earns more than $10000 or Paul is less than 40 years old" which can be more useful or more instructive.
One of the problems with logics of uncertainty is how to transform uncertain information into information labelled by a certainty degree. In probabilistic logic the certainty degree describes a probability that can be obtained by using the well-known probabilistic theory. It seems that probabilities are well adapted to describe the certainty of a fact such as "the king of diamonds is in West's hand" (game of bridge) which can be computed precisely. In possibilistic logic the possibility and necessity are obtained through the use of fuzzy-set theory. The meaning of a fuzzy predicate is described by a fuzzy set; the justification of these values is a question of agreement on the meaning of a word; a fuzzy set describing "John is tall" in terms of precise heights or "Peter is rich" in terms of amount of salary can be said to be good only if it satisfies the user. The meaning of a word is a matter of opinion and cannot be formally justified. The choice between logics of uncertainty seems to be quite dependent on the domain; Dubois and Prade argue that some results are worse (using the resolution principle) in probabilistic logic than in possibilistic logic, but what if inputs could be more precisely determined in the first case? Comparison with other logics of uncertainty.
Possibility and necessity of an implication: p :::J q (p-+ q). Dubois and Prade show how to get the couple (possibility, necessity) for an element of a fuzzy database: the meaning of a fuzzy predicate is given by a fuzzy set, represented by a trapezoidal curve; these curves are used to obtain for a given fuzzy fact, represented by a ground atomic formula, the two certainty measures. It is not so clear when one considers implication like p -+ q: where do possibility and necessity come from? What is their (intuitive) meaning? Possibility and necessity of an implication can be seen as extensions of possibility and necessity of a ground fact; they would describe the fuzzy relation between two propositions p and q, and express the uncertainty of expressions such as "it is possible that", "it can be that", "probably ... ". It would be expected that an implication would be labelled by possibility and
10
Possibi/istic and Fuzzy Logics
317
necessity measures as is the case for ground assertions; it seems that a matrix is proposed instead; this matrix is said to express "the grade of necessity of all ways of relating p and q". But no means (such as fuzzy sets for expressing the fuzzy predicates) are given for determining these measures. What is the intuitive meaning of these values? Where does this matrix come from? Are fuzzy sets used to compute it? For example, what is n 11 ? is it the necessity of (p --+ q) or the necessity of q knowing p? or are these two notions equivalent? In Farreny and Prade (1986) it seems that the matrix corresponds to conditional possibilities, with the relation O(p and q) =min (O(q I p), O(p)). Is this always the case? What are the properties of the matrix induced by the properties on possibilities and necessities? Does the matrix replace the possibility and necessity of an implication? Or can these measures be obtained from it? Possibility and necessity on .first-order implications. It is not so difficult to imagine the possibility and necessity of a propositional implication like
"if attends-to-the-meeting(Peter) then very probably attends-to-the-meeting(Mary)". It is not so easy when one considers first-order implications. What is now the meaning of possibility Poss and necessity Nee for: Vx P(x) --+ Q(x). (I) It could be the possibility (respective necessity) of the global formula
Poss [Vx P(x)
--+
Q(x)]
This means that "it can be" that all P(x) are Q(x) as in "it can be that all planes are on strike today"; it cannot be used to express the uncertainty on "all birds fly" for example. Poss [Vx p/ane(x)
--+
in-strike(x)]
(2) It is more probably something like the possibility of the conclusion when the conditions are verified, which is not so far from conditional possibilities: VxP(x)
--+
Poss [Q(x)]
Vx attends-to-the-meeting(Peter, x)
-->
Poss [attends-to-the-meeting(Mary, x)]
as in
or
Vx smokes(x)
--+
Poss [diedbefore60(x)].
These implications could be rewritten as: smokes(Toto)
--+
Poss [diedbefore60(Toto)]
smokes(Lulu)
--+
Poss [diedbefore60(Lulu)] .
.
The possibility does not depend on the x concerned, and remains the same after instantiation. (3) But what about when the possibility depends on the domain of x (as is the case when one considers statistical measures)? For example, in: Vx bird(x)
--+
Poss [flies(x)],
the possibility reflects the fact that Bird is a superclass, a union of classes of birds that fly and of classes of birds that do not fly; Poss is only valid for x being a bird but
318
D. Dubois and H. Prade
changes when the domain of xis restricted to a subset of bird such as duck or penguin; the implication cannot be specialized, via the classical rule of specialization, without an update of the possibility. Let us suppose Vx penguin(x)
-+
bird(x)
Vx bird(x)
-+
Poss [.flies(x)]
One cannot derive from this Vx penguin(x)
-+
Poss [bird(x)]
This is the same argument used in Duval and Kodratoff (1986): the French usually drink coffee, but it cannot be used to derive that there is some possibility that someone (who is French) drinks coffee. More generally, the problem is that of an evaluation that is true for a groupE of x, but cannot be used for a subset of E. The same problem is met with the use of fuzzy quantifiers; and it seems to be difficult to reason on such information. Implementation issues.
Let us suppose two certain inferences:
Vx height(x) > 1.80 m
-+
basketball-player(x)
Vx height(x) > 1.80 m
-+
likes(Mary, x)
and that we know from a database that height(John) > 1.80 with n = IX and N = fi. If a contradiction such as --,basketball-player(John) is added, then it seems that we have to (i) (ii)
update the certainty measures of height(John) > 1.80; update the certainty measures of the derived assertions such as likes(Mary, John);
(iii)
update all the certainty measures concerned with the height of John as height(John) > 1.90 ... ;
(iv)
if the first inference were
Vx height(x) > 1.80 m and weight(John) < 80 kg update could be done on weight(John) too.
-+
basketball-player(x) then this
If, for dealing with a contradiction, a complete TMS algorithm has to be implemented, requiring the use of dependency links to antecedents, is it not easier to use these links to reason directly, as in hypothetical reasoning, on the unknown or incompletely known values? In conclusion, this chapter seems to be an up-to-date treatment of a crucial problem, that of reasoning on imperfect knowledge. A clear presentation is made of possibilistic and fuzzy logics, and a number of exciting problems remain to be explored. Paul Gochet: Dubois and Prade acknowledge that the notion of truth with which they operate has been tailored for a special purpose. They define truth as the agreement, which can be partial (graded) between the representation of the meaning
10
Possihilistic and Fuzzy Logics
319
of a statement and the representation of what is actually known. If that definition is combined with the standard definition of knowledge as justified true belief then a vicious circle is generated. This objection, however, can be dismissed. The authors are entitled to take the notion of "knowledge base" as primitive and to define knowledge as the content of a knowledge base. They present the correspondence theory of truth as the standard concept: "Truth is generally understood as the conformity between a statement and the actual state of affairs it supposedly refers to." At first sight, that presentation can be questioned. Tarski's semantic concept of truth or Ramsey's earlier redundancy theory of truth have won a wider agreement, at least among logicians and philosophers, than the traditional correspondence theory of truth. For Tarski, truth consists of satisfaction by all sequences of objects of the domain, and satisfaction, in turn, is given a recursive definition (Gochet, 1986). That definition enables him to do without the metaphoric expression of "conformity" and to avoid commitment to dubious entities such as facts (Gochet, 1980) or states of affairs. For Tarski, the predicate "true in language L" can be defined either absolutely, i.e. independently of a model, or relatively, i.e. with respect to a model. The second definition is more often used today, as it plays a crucial role in formal semantics, where it serves to define validity (truth in all models). That definition of validity is very general. It applies also to non-classical logics in which the models have been enriched by the introduction of possible worlds, moments of time, accessibility relations, and modified by a non-standard interpretation of logical constants. A version of the correspondence theory of truth was recently defended by Perry and Barwise within the framework of their situation semantics. It has been shown, however, that situation semantics and Montague semantics, despite significant differences, can both be subsumed under a slightly modified version of the framework that Montague provided in his Universal grammar (Muskens, 1988). Since Montague's framework embodies and enlarges Tarski's definition of truth, this result shows that Dubois and Prade's correspondence theory of truth can fit in with the "received view", i.e. with Tarski's definition of truth. Dubois and Prade's theory, however, is incompatible with Ramsey's theory. This is worth examining, since Ramsey has taken a stance on the issue raised by the concept of degree of truth. According to Ramsey, and also according to Ayer (Gochet, 1988), who has much improved on Ramsey's theory, saying that a statement is true is nothing more than reasserting the statement. The sentence "It is true that p" means nothing more than "p". The predicate "true" is redundant. Haack ( 1980) observes that the very notion of degree of truth ceases to make sense if we take up the redundancy theory: "... given that he holds that 'It is true that p' means that p, it is natural that Ramsey should say that 'It is ! true that p' means nothing at all, since there seems to be no way of modifying the right-hand side of Ramsey's definition to give a sense to the modified left-hand side." One might question the claim that the adverbial modification of the truth-predicate cannot be transferred meaningfully to the asserted sentence. Instead of saying "It is half true that the flag Is white", one could say "The flag is half white". But this counterexample fails to refute Haack's claim. By cancelling out the expression "It is true that" and displacing the degree adverb, we change the meaning. The former sentence allowed one interpretation only ("The flag is grey"), whereas the latter allows several ones, and the preferred reading is "Half of the flag is white". Moreover, there are cases where the syntactic shift is really impossible. We can say
320
D. Dubois and H. Prade
"It is i true that France is hexagonal" but the sentence "France is i hexagonal" is sheer nonsense. The clash between Dubois and Prade's admission of degrees of truth and Ramsey's redundancy theory is definitely not an argument against the former view, since it is open to us to abandon Ramsey's theory and retain the idea that truth comes in degrees. We make rough statements such as the abovementioned sentence "France is hexagonal" borrowed from J. L. Austin. Two strategies are possible to cope with that linguistic use. We can say that such a statement fits the fact to a certain degree and decide that statements that fit the fact to a degree ranging between 50% and 100% are true, whereas statements whose "degree of fit" falls below 50% are to be ascribed the truth-value False. Or we can collapse the two dimensions of assessment (fitness to facts and truth-value) into one and introduce the notion of degrees of truth. This is Dubois and Prade's policy. This policy enables them to exhibit the interconnections between classical logic, modal logic, possibilistic logic, many-valued logic and fuzzy logic. This fully justifies their choice even if it departs from ordinary parlance.
Flash Sheridan: The problem with fuzzy logics is not that they are bad logics, but that they are not logics at all. It is hard to define what logic is, but it can be clear that something isn't a logic: if it has significant empirical consequences, or if it doesn't have connectives satisfying the most basic properties (see below) of and and or. Dubois and Prade's logic proves something that I claim is an empirical statement about the nature of colour. An alternative fuzzy logic has connectives that claim to be and and or, but are nothing like them. (I shall restrict my attacks to or; it is the easier target.) The two most basic things about or are that p or pis the same asp, and that p or q is no less true than either p or q. Call the first "idem potence", the second "monotonicity." And must also be idempotent, and monotonic the other way: p and q is no more true than either p or q. Say we have a pencil that is fairly red, and fairly orange. I claim that it is at least conceivable that it is very red or orange. (In fact, there is such a pencil, but that doesn't matter.) With the "most popular choice of operations" (Dubois and Prade's equations (14) and (15)) this is impossible: the pencil is fairly red or orange. (t(P v Q) = max (t(P), t( Q)). t(P) is the degree of truth of the proposition P.) Dubois and Prade do have a theory of colour that makes sense of this; I think it is arbitrary and wrong, but that doesn't matter. What matters is that one can deduce their theory of colour from their logic. I claim that this theory is empirical, so this version of fuzzy logic is not a logic. I am not going to discuss the meaning of"empirical"; if you feel the existence of such a pencil is not an empirical matter, you need not believe my argument. (I think one could even make a case that it isn't.) But if you agree that it is empirical, you may be intrigued by fuzzy logic's usefulness, but you must believe that this usefulnesss is accidental. I know of a different version of or. It uses + instead of max. The obvious problem with this is that one may then get truth values greater than I; it dodges the problem by fiat: if the value one gets is greater than!, pretend it is 1: t(p v q) =min [t(p + q), 1]. This is not idempotent. I am not here attacking the idea of vagueness; I should be interested to see a good logic of vagueness, although there are strong reasons to believe that there can be no such thing. The best philosopher to address the issue of vagueness has concluded that
10
Possibilistic and Fuzzy Lof(ics
321
it is incoherent (Dummett, 1975; see also Fine, 1975-to this latter article, through Frank Veltmann, I am indebted for the color example). If there is a way to axiomatize vagueness, it seems it would have to be far more radical than you would be willing to acept. (It would probably have to be an extreme version of an extreme philosphical position called "strict finitism".) But, except for computational convenience, I see no reason for it to be truth-functional.
Reply: The three discussants have each focused on different aspects of our paper; their respective comments can be summarized into the following questions. (i)
Can there be a logic of vague propositions (Sheridan)?
(ii)
What is the expressive power of possibilistic logic and its relevance for commonsense reasoning (Cordier)?
(iii)
What is the meaning of graded truth (Gochet)?
All three questions are very much relevant to a proper understanding of fuzzy set and possibility theory, and we are grateful to the discussants for raising them. First of all, fuzzy-set theory has no special claim to stand as a general theory of vagueness. To build a membership function one needs three objects: a referential set n, a set of membership values V and a mapping !1A from n to V that discriminates between membership and non-membership in A. The set Vis usually taken to be the unit interval, but this is clearly a matter of convenience. More generally, V should be allowed to be a lattice (Goguen, 1967); then one can build purely qualitative models of vague concepts. Q can be any kind of set, but in practice the use of the fuzzy-set approach is made easier whenever n is what we shall call a "simple set", i.e. either a finite set with small cardinality, or a linear numerical scale, or a Cartesian product thereof. Outside these cases, it is difficult to find a procedure that enables the membership function to be elicited in a reasonable way. Fortunately the above cases occur quite often in practice, especially when the predicate A can be expressed by means of some clearly identified attribute a of objects in n, ranging on some scale S that is a simple set. For instance n is a set of (possibly numerous) people, a(w) evaluates the size of wEn and A means "tall", and is defined on S rather than n using the membership function !1A: S -+ V. !1A(a(w)) is then the degree of tallness of the individual w. The identification of a membership function on a simple set is a problem in empirical psychometry, which is not especially difficult (see Smithson, 1987; Norwich and Turksen, 1984 ). Ancestors of the membership functions have been suggested by philosophers of vagueness (for example Black's (1937) consistency profiles) as a reasonable way of capturing the meaning of vague concepts. But note that the expression of membership functions in terms of random sets (Kampe de Feriet, 1982; Goodman and Nguyen, 1985) enable statistical interpretations to live alongside pure psychometric interpretations of fuzzy sets. This dependence of fuzzy logic upon empirical or statistical matters may look disgraceful to fully fledged logicians. In particular, our theory of vagueness offers no "ontological", absolute definition of graded truth. From a philosophical point of view, fuzzy or possibilistic logic may appear to be accidental. But all practical problems are philosophically accidental too, and the purpose of the logic systems in this book is the solving of practical problems rather than the solving of philosophical issues, although Gochet tends to suggest that our classification of logic systems may have some philosophical relevance. Let us turn to the question of truth-functionality. Sheridan stands strongly against
322
D. Duhoi.1· and H. Prade
the idea that a logic of vagueness can be truth-functional. First, we have never said that it always is. Truth-functionality is preserved only in the presence of complete information. It is false whenever the available information is not complete, even for standard formulae. Another point is that gradedness of truth is not compatible with Boolean algebra. An algebra of vague propositions necessarily has fewer properties than a Boolean algebra. Here there is a choice about which structural properties we are to give up. Let t( p) be the truth value of the (vague) proposition p. Sheridan thinks that basic properties of disjunction are monotonicity (t( p v q) ~ max (t( p), t(q))) and idempotency. Taking truth-functionality for granted leads to t(p v q) = t(p) l_ t(q), where l_ is continuous, coincides with the logical "or" for binary truth values, is commutative, and monotonically increasing in the wide sense. If we further assume that x l_ 0 = x (i.e. p or "false" = p) then the only possible choice for t(p v q) is t( p v q) = max (t( p), t(q)), given Sheridan's requirements of monotonicity and idempotency. Moreover, idempotency is incompatible with the excluded-middle law (Dubois and Prade, 1984d). If the latter must be preserved then we must drop idempotency, and the only possible solution, up to an isomorphism, becomes (t(p v q) =max (0, t(p) + t(q)- I). These proposals are not arbitrary, but are dictated by the algebraic properties that one wishes to keep. Hence there are several possible algebraic structures for a set of vague propositions, and they are all compatible with the unit interval. This is why truth-functionality can be preserved for vague propositions. But we acknowledge the fact that the truth-functionality assumption is made in order to get a simple theory of vagueness. We make it because it is not self-inconsistent and because it makes computations easy to carry out. We agree with Sheridan that 0.4 red and 0.4 orange may lead to 0.8 red or orange. We could always get a disjunction operation that satisfies this condition. However, we believe that nobody would state this property this way. We prefer the approach that first states which algebraic properties of the fuzzy "or" are sensible in a given situation, and then derives the proper class of "or"s accordingly. If this class contradicts the available evidence then maybe the truthfunctionality assumption should be dropped. But we are aware that truth-functionality is here a matter of convenience and is clearly an assumption. See Osherson and Smith (1981, 1982) for a discussion of its limitation from a psychological point of view, and the discussions by Zadeh (1982) and Cohen and Murphy (1985) of the extensionality of the logical combination of vague concepts. Let us now turn to possibilistic logic. Cordier raises a number of very interesting issues in her comments. First, why use numbers instead of symbols to express uncertainty? A purely symbolic approach to uncertainty such as the one by Duval and Kodratoff (1986) faces a challenging problem; that is, how to combine the symbolic modalities. At the end of a proof one gets a list modalities to be interpreted as a whole. And to our knowledge there is no guideline as to how this should be done. In contrast, here, uncertainty propagation is done according to the rules of a given theory of uncertainty (whose choice depends upon the nature of the available information). Of course, the degree of uncertainty bearing on a conclusion may not be informative enough, and, as Cordier stresses it, one may wish to get the reasons for uncertainty as well. But handling uncertainty is not incompatible with maintaining the hypothetical assumptions under which this uncertainty would be removed. The two tasks are not redundant: uncertainty expresses to what extent information is lacking, while hypothetical reasoning is useful for characterizing what is the extra information needed to remove the uncertainty. In that sense, the suggestion of using Truth Maintenance System-like approaches in conjunction with uncertain reasoning
10
Possihilistic and Fuz=y Logics
323
is certainly valuable. Note also that in the estimation of the uncertainty of a compound proposition with respect to a given (incomplete) state of information using a fuzzy pattern-matching technique, it is possible not only to compute a possibility and a necessity degree, but also to determine what part of the information needs to be made more precise and in what manner in order to come closer to complete certainty. Another important issue is that of properly interpreting degrees of necessity and possibility, and being able to get them out of the available evidence. As mentioned earlier, possibilistic information usually stems from linguistic information involving vague terms referring to simple sets. Incompleteness and vagueness of available evidence lead to grades of uncertainty obtained through fuzzy pattern-matching. This is true for elementary facts as well as rules, since a fuzzy linguistic rule also translates into a possibility distribution. Moreover, the interval between the necessity and the possibility of an assertion p, say [N(p), O(p)], can be viewed as bounds on an unknown probability-either lower-bounded (if 0( p) = I) or upper-bounded (if N(p) = 0).
Why is possibilistic logic interesting at all compared with probabilistic logic? (i)
Possibilistic logic offers an absolute reference point for expressing ignorance. Namely N(p) = N(1p) = 0. Expressing some belief about p comes down to choosing a point in [0, I] in between certainty (N(p) = I, N(1p) = 0) and ignorance (N(p) = 0, N(1p) = 0). Ignorance cannot be modelled in probability theory, where it is approximated by randomness. Moreover, in the case of randomness one can say nothing about Prob (p) compared with Prob ( 1 p) unless one knows how many alternatives are offered by p and 1 p. Note that possibility cannot model randomness, i.e. probability theory has its own usefulness, of course. Upper and lower probability systems can model ignorance, and possibility theory can be viewed as the simplest of upper and lower probability systems.
(ii)
Possibility theory offers a nice framework to attach weights of uncertainty to rules "if p then q" in complete accordance with classical logic. An uncertain "if ... then" rule is better expressed by a conditional measure (g(q I p)) than by the measure of a conditional (g( p -+ q)). The quantity N( p -+ q) is very close to a conditional possibility measure, as explained in Dubois and Prade (1986a). Namely we have N(p-+ q) = N(ql p) £ I - 0(1ql p)
as soon as O(p " q) of. 0( p), where O(q Ip) is defined as the greatest solution of 0( p " q) = min (n(q I p), O(p)). Hence the necessity of a conditional is close to a definition of conditional necessity, and possibilistic logic, when we restrict the pair (N(p), N(1p)) to be either (0, I) or (1, 0), reduces to classical logic. The probability of a conditional is seldom equal to the conditional probability (Prob (q 1 p) = Prob (p -+ q) only if they are both I, or if Prob (p) = I). Hence translating uncertain rules into conditional probabilities does not yield a logic that generalizes classical logic, strictly speaking. (iii)
Possibilistic logic is a quasi-qualitative calculus where numbers are compared and not added or multiplied. Numbers are useful only to model gradedness, and no great precision is required. In contrast, probabilistic logic requires sufficiently precise inputs in order to be able to carry out long inferences that remain informative.
D. Dubois and H. Prade
324
Let us now consider the problem of first-order implications. It is clear that N(Vx, P(x) -+ Q(x)) = IX is not the same as Vx N(P(x) -+ Q(x)) = IX (where P and
Q
are non-vague predicates for simplicity). The first expression represents a conjecture that is refuted by finding x 0 such that P(x 0 )-+ Q(x 0 ) is false. The other expression is closer to a default rule when IX is close to I, since it means that for all x, "P(x) -+ Q(x) is true is almost sure" (but there may be exceptions). That is exactly equivalent to saying that when P(x) is true then Q(x) is almost surely true, putting the necessity on Q(x) only. This identity of meaning is reflected by the fact that both approaches, i.e. putting the necessity on the rule or on the rule conclusion, are equivalent. To see this, let M(P) = {x 1 P(x) is true}. N(Q(x)) =IX is expressed by the fuzzy set M(Q.) defined by (Prade, 1985b) I ifxEM(Q) { I-IM
-IX)=
max (I - J-IM
-IX))
which is the same as the one translating the fuzzy rule P(x) -+ Q.(x), where Q. is the fuzzy predicate whose meaning is defined by J-IM
10
Possihilistic and Fu==y Logics
325
clear statistical interpretation (even if the probability values are fuzzily known). This is not really the case for the "most birds fly" example, since one does not know how to count events. If it means "most bird species can fly", performing a statistic makes no sense because the species are more a matter of definition than objective facts. In that case, we may think that possibilistic logic offers a reasonable framework for automated reasoning with instantiated default information, provided that we can cope with non-monotonicity, i.e. we can devise sophisticated schemes for combining pieces of information with various levels of specificity/generality. See Dubois and Prade ( 1987c) for a first step in this direction. Let us end this reply by saying a few more words about our notion of truth. Once again, it is very pragmatic and refers to what we know about the world, and not to the actual world. Our working assumption is that the database contains vague, incomplete information, but no utterly wrong statements. We agree with Haack about the lack of meaning of degrees of truth attached to classical formulae. But the incompleteness of the available information may make us ignorant about truth; we model it by a {0, !}-valued possibility measure on the set {true, false}, such that n(true) = n(false) = I. Vagueness of the available information leads us to let n(true) and n(false) lie in the unit interval. So far, truth is a binary notion. It is only when evaluating the truth of vague statements that grades of truth become meaningful, to model the fact that vagueness pertains to the existence of borderline instances or interpretations. The example "France is hexagonal" mentioned by Gochet is very interesting. It is a typical instance of a word that has a very precise meaning in some contexts (mathematics) and a fuzzy meaning in ordinary talk. "Hexagonal" then refers to an illbounded set of shapes that look more or less like the mathematical hexagon. A natural way of evaluating the degree of truth of "France is hexagonal" is to evaluate the relative amount of distortion to which a hexagon must be submitted in order to get France's shape. This idea is actually implemented in our team for the purpose of analysing verbal designations of objects in scene analysis (Dubois and Jaulent, 1985).
Additional references Black, M. (1937). Vagueness. An exercise in logical analysis. Phil. Sci. 4, 427-455. Cohen, B. and Murphy, G. (1985). Models of concepts. Cognitive Sci. 8, 27-58. Dubois, D. and Jaulent, M. C. (1985). Shape understanding via fuzzy models. Proc. 2nd IFACjlFJPjlFORS/IEA Conf on Analysis, Design and Evaluation of ManMachine Systems, Varese, Italy, pp. 302-307. Dubois, D. and Prade, H. ( 1984d). Criteria aggregation and ranking of alternatives in the setting of fuzzy set theory. Fuzzy Sets and Decision Analysis (ed. H. J. Zimmermann, L. A. Zadeh and B. R. Gaines), pp. 209-240. TIMS Studies in the Management Sciences Vol. 20. Dubois, D. and Prade, H. (1987b). On fuzzy syllogisms. Computational Intelligence (to appear). Dubois, D. and Prade, H. (1987c). Possibility theory and default reasoning. Artificial Intelligence. (To appear.) Dummett, M.A. E. (1975). Wang's Paradox. Synthese 30, 301-324; also in Truth and Other Enigmas. Duckworth Press, London, 1978. Duval, B. and Kodratoff, Y. (1986). Automated deduction in an uncertain and inconsistent data basis. Proc. 7th Eur. Con[. on Artificial Intelligence, Brighton, pp. 101-108.
326
D. Dubois and H. Prade
Fine, K. (1975). Vagueness, truth, and logic. Synthese 30, 266-300. Gochet, P. (1980). Outline of a Nominalist Theory of Propositions, pp. 73-86. Reidel, Dordrecht. Gochet, P. (1986). Ascent to Truth, pp. 81-83. Philosophia Verlag, Munich. Gochet, P. (1988). On Sir Alfred Ayer's theory of truth. The Philosophy of A. J. Ayer (ed. L. E. Hahn). The Library of Living Philosophers, Open Court, La Salle, Illinois. Goguen, J. A. (1967). L-fuzzy sets. J. Math. Anal. Applies 18,145-174. Goodman, I. and Nguyen, H. T. (1985). Uncertainty Modelsfor Knowledge Based Systems. North-Holland, Amsterdam. Haack, S. (1980). Is truth flat or bumpy. Prospectsfor Pragmatism (ed. D. H. Mellor), pp. 17-18. Cambridge University Press. Kampe de Feriet, J. (1982). Interpretation of membership functions of fuzzy sets in terms of plausibility and belief. Fuzzy ll!formation and Decision Processes (ed. M. M. Gupta and E. Sanchez), pp. 93-98. North-Holland, Amsterdam. Kodratoff, Y., Perdrix, H. and Franova, M. (1985). Traitement symbolique du raisonnement incertain. Proc. AFCET, ll!formatique Congres, Paris pp. 33-45. AFCET, Paris. Muskens, R. (1988). Going partial and relational in Montague grammar. Proc. 6th Amsterdam Colloq. Norwich, A. B. and Turksen, I. B. (1984). A model for the measurement of membership and the consequences of its empirical implementation. Fuzzy Sets and Systems 12, 1-25. Osherson, D. N. and Smith, E. E. (1981 ). On the adequacy of prototype theory as a theory of concepts. Cognition 9, 35-58. Osherson, D. N. and Smith, E. E. (1982). Gradedness and conceptual combination. Cognition 12, 299-318. Smithson, M. (1987). Fuzzy Set Analysis for Behavioral and Social Sciences. SpringerVerlag, Berlin. Zadeh, L. A. (1982). A note on prototype theory and fuzzy sets. Cognition 12, 291-297.
Index
Note: Figures are indicated by italic page numbers; Tables by bold numbers; Footnotes by suffix "n"
Aguty, meaning of term, 138n Aircraft battle simulation, 47, 48-9 Approximation lattices, 54 ARCHES system, 88, 89, 90, 96, 101, 102 ASK function, 34, 35 Assertions, accepting/rejecting of, 33-7 ASSUME logic axiomatics of, 66-8 canonical model for, 69-70 completeness of, 70, 71 and conditional logics, 73-4 in database updates, 72-3 decidability of, 71 fundamental theorem for, 70 language of, 65 deductive systems for, 66-8 semantics of, 65-6 relation to other approaches, 75 resolution deduction method for, 7(}-1 soundness of, 71 Autodoxastic logic, 105n; see Autoepistemic logic Autodoxastic theory, in game-playing, 44-5 Autoepistemic logic alternative semantics for, 111-14 applications of, 119-22 basic theory of, 107-11 compared to default reasoning, 12(}-1 computing with, 114-19 language of, I 06 non-defeasibility of, 121 non-monotonicity of, 105, 121-2, 130 possible-world semantics used, 111-14 and preferential-models approach, 154 and related logics, 122-6 transformation procedure for, 117-18 Autoepistemic model, 107 Autoepistemic theories, 107
completeness of, 108, 109 grounding defined for, 109-10 possible-world intepretations of, 112,
113 soundness of, 108, 110 stability defined for, 109 stable-expansion definition of, II 0 Axiom schemata ASSUME logic, 66 autoepistemic logic, 125 & n belief functions, 267-8 game-theoretic semantics, 38 meaning of term, 12n TLA logic, 94 Axiomatization ASSUME logic, 66-8 TLA logic, 88, 95 Barcan formula, 39, 43, 59 Bayesian analysis, 224, 235 Bayesian formula, 216, 225, 229, 237 Bayesian network techniques, 238 Bayesian theorem, generalized within framework of belief functions, 27(}-3, 275 Belief functions combination of two belief functions, 267-70 communality functions used, 262-3 consonant belief functions used, 264 and default logic, 208-9 Dempster's rule of combination applied, 260, 264 Dempster's rule of conditioning applied, 265-6 differences from Shafer's model used, 266-7 and discounting evidence, 273-4
328
Belief functions-contd. inequalities to be satisfied by, 261-2 mathematical properties of, 261-7 meaning of, 274 and modal logic, 131 transferable-beliefs model used, 257-8, 282, 283, 284 and possibilistic logic, 301-2 semantic basis of, 277-8 as simple support functions, 263--4 vacuous belief functions used, 263 Belief system, standard way of defining, 28-9 Boote, George, I, 2 Boolean connectives, 9 Bound variables, I 0 Boyer-Moore theorem-prover, 59 Calculus, meaning of term, 8 Canonical model, ASSUME logic, 69-70 Chellas nomenclature, 19, 126 Circumscription, 6, 148-54, 156, 159, 199-200 consistency of, 150 definition of, 149 derivation of formula by, 152 disadvantage of, 149 Leibniz substitutivity schema, 153 non-monotonicity of, 153--4 schema used, 151-2 varying relations used, 150 Classical logic, 2, 8-14 and default logic, 187 extension of, 2, 3, 15 and intuitionistic logic, 167-8 non-standard logics as alternatives, 2--4 probabilistic logic as extension of, 213-14 Closed default theory, 192 Closed normal default theory, 192 Closed-world assumption (CW A), 6, 142-5, 156-7, 159 in belief models, 256, 257, 269, 270 consistency of, 145 and minimal models, 143--4 non-monotonicity of, 145 type of problem tackled, 143 Closed-world conditioning, definition of, 270
Index
Combination, Dempster's rule of, 260, 264, 267-8, 275, 280, 281 Commonsense reasoning, 140, 156 Communality functions, 262-3 Completeness ASSUME logic, 70, 71 definition of, 12 requirement for, I statement in modal logic, 18 Conditional logics and ASSUME logic, 73--4 relation to theory change, 79 Conditional probability, definition of, 216 Conditionals, probabilistic logic modelling of, 23--4 Conditioning, Dempster's rule of, 265-6, 280, 283 Consistency circumscription, 150 closed-world assumption, 145 definition of, 129, 133 meaning of term, 145n statement in modal logic, 18 subimplication, 147 Consonant belief functions, 264 Convex Baynesian analysis, 225, 239 Correspondence theory of truth, 319 Credibilistic ignorance, 254-5 Crisp predicates, 289 Crisp statements, 289-92, 293 Database updates, 72 contraction type, 64 examples of, 72-3 expansion type, 64 hypothetical approach, 63 revision type, 64 temporal approach, 63 Decidability ASSUME logic, 71 meaning of term, 13 Deductibility, term avoided in autoepistemic logic, 128, 133 Deduction definition of, 137n meaning of term, 137n Deduction theorem, 13 Default logic and belief functions, 208-9
Index
and classical logic, 187 formalization for is-a hierarchies with exceptions, 197-201 advantages of, 201 ambiguity solving in, 200 semantics defined, 197-200 introduction to, 180-1 non-monotonicity of, 187 non-numerical nature of, 212 and preferential-models approach, 154 and probabilistic logic, 240 theory defined, 186-94 formal definition, 188-94 intuitive approach, 186-8 Default normal theories, semimonotonicity of, 193 Default proof, definition of, 193 Default reasoning and autoepistemic logic, 120--1 defeasibility of, 121 possibilistic logic used in, 303 probabilistic logic used in, 303 uncertainty logics used in, 303--4 Default theory, definition of, 188 Defaults, implicit ordering of, 196-7 Degree of belief, quantification of, 257-61 belief functions used, 260--1 combination of evidence, 260 general model used, 257-8 plausibility functions used, 260--1 practical example, 258-60 Dempster's rule of combination, 260, 264, 280, 281 axiomatic justification of, 267-8, 275 generalized version of, 280 uniqueness of, 268, 275, 281 Dempster's rule of conditioning, 265-6, 280, 283 Dempster-Shafter model, 220, 238-9, 255 Derivability, meaning of term, 12 Deviant logics, 3 DIABETO expert system, 311 Discernment, frame of, 255-7 notation of, 256-7 open- and closed-world assumptions, 255-6 Discounting (of evidence), 269, 273--4, 275 Discriminant minimal models, 147 defintion of, 146
329
Discriminant models, 159, 160 Domain of Reference, 255 Domain-free heuristics, 47, 58, 59, 60 Domain-specific heuristics, 47, 48, 58, 59, 60 Doxastic logic, game-playing using, 43-7
Elementary changes, logic of, 63-76 criticism of, 77-9 Ellis' rational belief system, 30--1 Entropy maximization, 229, 236 Environment, meaning of term, II Epistemic possibility, 298-9 Epistemic states, 54 Error models, 223, 230--3 additive error with constant variance, 231 binomial errors, 232 piecewise error density, 231 Etherington-Reiter interpretation (of NETL), 194-6 Excluded middle, law of, 3--4, 322 EXECUTIVE function, 37, 55 Expert, knowledge base as representation of beliefs of, 43 Expert systems, probabilistic logic applied to, 214,214, 216 Extended interface language, 34--5 Extended Kalman filter, 236 Extended logics, 3 Extensional theories, 143, 144
Fahlman's (semantic) network, 180, 1946 see also NETL system First-order logic, 9 compared to modal logic, 140 model theory, and minimal models, 139--42 First-order predicate calculus, meaning of term, 152n First-Order Predicate Logic (FOL), syntax of, 8-9 Fixed-point definitions, 123, 124, 166, 170 Fixed-point theories, relation to intuitionistic logic, 172-3
330
Frame of discernment, 255-7 notation of, 256-7 open- and closed-world assumptions, 255-6 Free variables, to Fundamental theorem ASSUME logic, 70 definition of, 69 Fuzzy information, 293-4 Fuzzy logic, 7-8, 287, 288, 304-10 applications of, 310-11 compared to possibilistic logic, 315 example of use, 307-8 formal reasoning with vague predicates, 304-5 meaning computation as basis, 305-7 meaning-representation language used, 305, 312 and probabilistic logic, 240 reasonong with fuzzy quantifiers, 30810 Fuzzy predicates, 316 Fuzzy quantifiers, reasoning with, 30810
Game-playing heuristics, 47-50 deviation from Hintikka's rules, 47 Game-theoretic semantics, 29-30 game rules used, 30, 31 model sets used, 30 rational belief system used, 30-1 relation to other approaches, 51-2 status of program, 50-I summary of research, 50 Generalized Bayesian theorem, 270-3 and belief functions, 272 Generalized modus ponens, 310, 311 Generalized resolution principle, 306 Godel's implication, 306 Ground literals, 143, 144 Grounding, meaning of term, 143n
Halpern and Moses' logic, 122, 159 Herbrand models, 145, 150-1 definition of, 144 Higher-order logics, 9, 12-13 Hintikka's game-theoretic semantics, 2930,47
Index
Horn theories, 145, !56 meaning of term, 145n Ignorance credibilistic, 254-5 possibilistic, 253, 287 probabilistic, 213, 253-4 Inference, rules of, 12 Inference nets, 214 framework for evaluation of, 220-4 information and evaluation strategies available, 222-4 vector notation used, 220-2 method used for large nets, 238 with/without cycle, 228-9 Inferential ordering, 196 INFERNO, 226-7, 238, 301 Inheritance hierarchies, 4, 6, 181-6 Insufficient Reason, Principle of, 258, 263 Interpretation definition of, 139-40 meaning of term, I 0 Introspective reasoning, 132-3, 135 Intuitionistic entailment, definition of, 169 Intuitionistic logic, 4, 6, 14 and ASSUME logic, 74 automation of, 173-4 compared to classical logic, 167-8 and fixed-point theories, 172-3 and modal logic, 172-3 non-monotonic component of, 170-1 semantic presentation of, 168-70 Intuitionistic probability, definition of, 169 K45 logic, 126, 130, 134 KD45 logic, 19,43-7, 130, 134 KL-ONE knowledge representation system, 180 KL-TWO knowledge representation system, 180 Knowledge, representation by temporal logic, 88-90 Knowledge bases, and autoepistemic logic, 119 Knowledge-based systems and circumscription, !50 game-theoretic approach to, 29-32
Index
incremental development of, 28-9 interface languages used, 34-5, 36--7 interpreter, 28, 31, 36 man-machine dialogue in, 28 truth-functional approaches, 29 Kripke models, 5, Ill Kripke semantics, 2, 19-20, 85 Kripke structures, 112-14, 115 KRYPTON knowledge representation system, 180 Levesque's autoepistemic logtc, 122 Levi identity, 78 Likelihood approach, 233 Likelihood functions definition of, 233 evaluation of, 233-4 Linear opinion pool, 233 Linear-programming approach (for inference nets), 224, 225, 226, 238 Lukasiewicz many-valued algebras, 296 McDermott's non-monotonic modal theories, 123-5 McDermott's temporal logic, 98 McDermott and Doyle's non-monotonic logic, 122-3, 168, 170, 172 Many-possible-worlds models, 2, 5, 11112 see also Kripke ... Many-sorted logic (MSL), 13-14 Many-valued logic, 292, 295 MAX and MIN strategies, 35-6, 55, 57 Maximum-entropy approach (to binomial error model), 236 Maximum-likelihood approach, 233, 234 Meaning-representation languages, 305, 312 Measurable tense logics (MTL), 85, 86, 87 MECS-AI system, 83 Medical diagnostic systems belief functions used, 272-3, 275 possibility distributions used, 311 probabilistic logic used, 215-17 Meta-interpreter, 38 modal extensions to, 39-50 Metatheory, 3, 7 Minimal models, 158, 160
331
definition of, 142 and first-order models, 139-42 Minimally modellable theories, meaning of term, 148 Minimum-information approach, probabilistic logic, 229, 230 Minoring, 159 meaning of term, 141 Modal logics abbreviations used, 16 alphabet used in, 15 automated modal logic of changes, 6376 Chellas nomenclature used, 19 and first -order logic, 140 formal definition of, 15-16 game rules for, 40-3 and intuitionistic logic, 172-3 modal systems illustrated, 18-19 notation used, 20 and possibilistic logic, 302-3 semantics of, 19-20 sentence construction in, 16 sentence examples, 17-18 validity in, 21-2 Model, meaning of term, !59 Model Theory, 3, 6 Modus ponens ASSUME logic, 66 generalized modus ponens (in fuzzy logic), 310, 311 meaning of term, 12 possibilistic logic, 299 Modus tollens, possibilistic logic, 299 Monotonic proof systems, 13 Monotonicity classical logic, 137 intuitionistic logic, 164 relation to arguments, 138 see also Non-monotonic logics; Restricted monotonicity Montague semantics, 319 Mood symbols, 36--7 MYCIN inference engine, 311 Natural Deduction systems, 13 Necessity functions, 316-17 Necessity measures, 302 Negative introspection, 130, 135 Nested exceptions, 185
332
NETL system, 180 example using, 183-6 exception arrow used, 184 handle nodes in, 198 illegal networks in, 184 interpretation by Etherington and Reiter, 194-6 marker-passing schema used, 184 set inclusion arrow used, 183 typical set inclusion arrow used, 183-4 Nixon example, 106-7, 115, 118 Non-measurable tense logics (NMTL), 85-6, 87 continuous logic, 86, 87 dense logic, 86, 87 infinite logic, 86, 87 linear logic, 85-6, 87 minimal logic, 85, 87 TL~ logic, 88-97; see also main entry: TL~ logic transitive logic, 85, 87 Non-monotonic inference systems, 6, 13 Non-monotonic logic autoepistemic logic as, 105 classification of, 163-4 intuitionistic basis for, 163-78 preferential-models approach to, 13760 Non-monotonic operators, ASSUME operator, 66-7 Normalization belief model, 268-70 in belief models, 278-9 Numeric logics, 3, 7-8, 213-326 Open-world assumption, in frame of discernment, 256 Ordered seminormal default theories, 196, 199 Peano arithmetic (PA), 18, 116 Petri nets, 209-11 Plausibility functions, 260--1 Poole's formalism, 203 Possibilistic ignorance, 253, 287 Possibilistic logic, 287-8 axioms and interpretations of, 298-9 and belief functions, 301-2 and default reasoning, 303-4 and fuzzy logic, 315
Index
and modal logic, 302-3 and probabilistic logic, 300-1, 323 uncertain deductive reasoning by, 299300 Possibility degrees, uncertain deductive reasoning with, 299-300 Possibility functions, 317-25 Possibility measures, 291, 293 interpretation of, 298-9 Possible-worlds approach, 2, 5, 19, 111-12 Possible-worlds diagrams, 42, 46 Possible-worlds semantics, 85, 302 see also Kripke ... Posterior distributions, 224, 235 Preemption preordering, 141-2, 157 Preferential-models approach, 139-54 and autoepistemic logic, 154 and default logic, 154 limitations of, 158-9 Preservation criterion, definition of, 64, 66 Presupposition, ASSUME logic, 66 Principle of Insufficient Reason, 258, 263 Prior distribution, 235 Probabilistic ignorance, 213, 253-4 Probabilistic logic, 7, 23-4, 213-51 analysis having exact knowledge about probabilities, 224-30 restricted analysis, 227-30 worst -case analysis, 224-7 analysis with uncertain knowledge about probabilities, 230--7 error models, 230--3 statistical methods, 233-7 applied to expert systems, 214-15 assessment of, 237-8 basis concepts of, 215-20 example in medical diagnostics, 215-17 interpretation of subjective probabilities, 217-18 probability measures on propositions, 218-20 conditionals modelled in, 23-4 and default logic, 240 as extension of classical logic, 213-14 framework for evaluation of inference nets available information and evaluation strategies, 222-4 vector notation, 220--2
Index
and fuzzy logic, 240 main assumption of, 213-14 and possibilistic logic, 300-1, 323 relation to similar approaches, 238--40 and Shafer-Dempster theory, 220, 238-9 Probability, definition of, 213 Probability measures, 218-20 Probability theory, 3 PROLOG language, 37, 49, 173 Proof procedure, meaning of term, 148n Proof system, meaning of term, 8, 12 Propositional content (of sentences), compared with pragmatic content, 29 Propositional logic, 9, II PRUF meaning-representation language, 305 Quantifiers, 37-9 fuzzy quantifiers, 308-10 Queries, 37-9 importance of, 39 Ramsey's redundancy theory of truth, 319, 320 Reflexivity conditions, 164, 167 Reiter's default logic, 186-94; see also Default logic Relaxation rules, 52 Restricted analysis, 223, 227-30 Restricted class (of distributions), 222, 227 Restricted monotonicity condition, 167 Role links, 186 Rule-based diagnostics, probabilistic logic used, 215-17 Rules of inference, 12 Sampling error, 232 Satisfiability ASSUME logic canonical model, 6970 definition of, 65 meaning of term, 140 Semantic networks abduction in, 202 domains of application for, 201--4
333
example using, 183-6 introduction to, 179-80 meaning of word "typically", 201-2 Poole's formalism used, 203 typical elements in a class used, 203 see also NETL Semantics ASSUME logic language, 65-6 intuitionistic logic, 168-70 meaning of term, 8 modal logic, 19-20 TL,1 logic, 92--4 Seminormal default theories, 196, 199 Sentence tokens, 37 Shackle's non-probabilistic model, 298 Shafer (belief) model, 7, 255 criticisms of, 262 difference from, 266-7 normalization necessary, 268-70, 278-9, 281, 285 semantic basis of, 278 Shafer-Dempster theory, 220, 238-9, 255 Shafer-Tversky translator example, 274, 278, 283 Shannon's entropy measure, 281 Simple support functions (SSFs), 263--4 Simulated annealing method, 236 Situtation calculus, 98 Skolem functions, 40, 41, 43 Socrates example, 138 Soundness ASSUME logic, 71 definition of, 12 requirement for, I SPII inference engine, 31 {}-II Stalnaker's conditions, I 09, 124, 126 Statistical methods, probabilistic logic, 233-7 Statistical simulation techniques, 238 Stone representation theorem, 296 Strategy, definition in game-theoretic semantics, 33 Subimplication, 146-8, 159 consistency of, 147 definition of, 146 Subjective probabilities, interpretation of, 217-18 Symbolic reasoning game-theoretic semantics, 4, 27-61 intuitionistism, 6, 163-78 modal logics, 5, 15-23, 63-103
334
Symbolic reasoning-contd. reasoning with incomplete knowledge, 5-6, 105-61, 179-212 Syntax meaning of term, 8 TL~ logic, 9{}-1
TAIGER inference engine, 311 Tarski's semantics, 55, 319 Taxonomic default theory, 199 TELL function, 34, 35 Temporal logics, 85-8 dating logics, 87-8 tense logics, 85-7 see also Measurable ... ; Nonmeasurable tense ... ; TL~ logic Term models, 159 Theorem-provers, resource-bounded, 56, 57 Theory, meaning of term, 159 Theory change and ASSUME logic, 74 relation to conditional logics, 79 Time computer models of, 82-4 branching-time model, 99-100 linear-time models with manipulation of temporal intervals, 83-4, 98 nonlinear times expressed, 83 sequences of states expressed, 82-3 importance of, 81 TL~ logic axiomatization of, 88, 95 axioms of, 94 decision procedure based on, 95-6, 102 deduction relation used, 93, 95-6 inference rules of, 94 knowledge representation using, 88-90 semantics contents of, 92-4 syntactic contents of, 9{}-1 Touretzky's analysis of inheritance, 1967 Transitivity conditions, 164, 167 Triangular conorm, 296 Truth correspondence theory of, 319 degree of, 288-94 meaning of term, 288
Index
Ramsey's redundancy theory of, 319, 320 semantic approach to calculation of, 289-94 with crisp statement and fuzzy information, 293, 295 and imprecise information, 291-2, 295 and precise information, 289-90, 295 with vague statement and fuzzy information, 293-4, 295 and precise information, 292, 295 Tarski's semantic concept of, 319 Truth Maintenance Systems (TMS), 75, 322 Truth-functionality, 294-7 Tweety example, 5, 6, 12{}-1, 132, 156 default logic consideration, 187, 18990 Type hierarchies, 6 Typicallity, 201-2, 303 Uncertainty, logics of, 8 Uncertainty reasoning, 315-16 symbolic approach to, 322 Unexpected hanging paradox, 134, 135 Universal theory, meaning of term, 142n, 143n Universe of Discourse, 255 meaning of term, 10 Upper-lower probabilities, contrasted with belief functions, 254,279, 282 Vagueness, logics of, 3, 8, 289 see also Fuzzy ... ; Possibilistic ... ; Probabilistic logic Weak S5 logic, 126, 130, 134 Well formed formula (wff), meaning of term, 8, 9 Well-founded theories, meaning of term, 148 Worst-case analysis, 223, 224-7 Zadeh's paradox, 268-9, 280, 285 Zermelo-Fraenkel (ZF) set theory, modal logic use of, 18