B RIAN S KYRMS
DISCOVERING “WEIGHT, OR THE VALUE OF KNOWLEDGE”
I. I NTRODUCTION In the summer of 1986, I went to Cambridge to look through Ramsey’s unpublished papers. It turned out that the Ramsey archives existed only on microfilm at Cambridge – the originals having been purchased at auction by Nicholas Rescher for the Archives of Logical Positivism at the University of Pittsburgh. Hugh Mellor kindly arranged for me to be allowed to study the microfilm. I still have my temporary library card as a memento. Later, I was able to see the originals at the University of Pittsburgh. I was looking for something, but found something different. What I found of most importance consisted of two manuscript pages, the first of which was entitled “Weight, or the Value of Knowledge.” They were not consecutive in the numbering that had been given to these unpublished manuscripts, but they clearly went together. A few years later, Nils Eric Sahlin published a transcription of these in the British Journal for Philosophy of Science [Ramsey (1990)]. There is also a transcription contained in Maria Carla Galavotti’s collection of Ramsey’s papers [Ramsey (1991)]. I included a facsimile of the second manuscript page in my book, The Dynamics of Rational Deliberation. Today I will tell you something about what I was looking for, what I found, and the relationship between the two. This is an old story, but I will add a few new twists. II. W HAT I WAS L OOKING F OR I was looking to see whether Ramsey had any discussion of coherence of beliefs across time – diachronic coherence –, or of the related question of coherent rules for updating belief. These questions were the focus of intense philosophical discussion at the time (and to some extent still are.) Ramsey says just enough in “Truth and Probability” to whet the imagination, and to hold out the promise of something more. Ramsey introduced the question of coherence to the discussion of degrees of belief, noting that violation of the laws of probability allows a Dutch book:
55 M. C. Galavotti (ed.), Cambridge and Vienna: Frank P. Ramsey and the Vienna Circle, 55–65. © 2006 Springer. Printed in the Netherlands.
56
B RIAN S KYRMS
If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered to him, which would be absurd. He could have a book made against him by a cunning bettor and would then stand to lose in any event.
The converse, that coherent degrees of belief preclude a Dutch book, is stated two paragraphs later: Having any definite degree of belief implies a certain measure of consistency, namely willingness to bet on a given proposition at the same odds for any stake, the stakes being measured in terms of ultimate values. Having degrees of belief obeying the laws of probability implies a further measure of consistency, namely such a consistency between the odds acceptable on different propositions as shall prevent a book being made against you.
Ramsey was writing a paper for a philosophy club, and did not provide the mathematical details, but they were soon independently supplied by de Finetti. Regarding coherent change in degrees of belief, Ramsey has only this to say: Obviously, if p is the fact observed, my degree of belief in q after the observation should be equal to my degree of belief in p given q before, or by the multiplication law to the quotient of my degree of belief in p&q by my degree of belief in p. When my degrees of belief change in this way we can say that they have been changed consistently by my observation.
He is saying that belief change by conditioning on “the fact observed” is coherent belief change. Again, proofs were only later supplied by others. [See Freedman and Purves (1969), Putnam (1975), Teller (1973), and the review in Lane and Sudderth (1984).] But what about the case of uncertain evidence, where there is no clear-cut “fact observed” within the domain of an individual’s probability function? [Jeffrey (1968), Armendt (1980), Diaconis and Zabell (1982)] Did Ramsey even consider the possibility? And what exactly is the coherence claim in terms of which conditioning on the fact observed constitutes coherent belief change in the cases where it does apply? III. W HAT I F OUND In those two manuscript pages I found an account of the Value of Knowledge – that is to say, the expected utility of pure, cost-free information. The word “Weight” in the heading of the first page is a clear reference to Keynes’ Treatise on Probability. Keynes thought that degrees-of-belief needed two dimensions: probability, which showed how likely an event was judged to be, and weight, which measured quantity of evidential support behind the probability judgement. The following passages indicate both the general nature of Keynes’ concerns and the degree to which he has misconceived the problem:
DISCOVERING “W EIGHT , OR THE V ALUE OF K NOWLEDGE ”
57
The second difficulty... is the neglect of the ‘weights’ in the conception of ‘mathematical expectation.’ ... if two probabilities are equal in degree, ought we, in choosing our course of action, to prefer that one that is based on the greater body of knowledge? The question appears to me to be highly perplexing, and it is difficult to say much that is useful about it. ... Bernoulli’s maxim that in reckoning a probability we must take into account all the information which we have, even when reinforced by Locke’s maxim that we must get all the information that we can, does not seem completely to meet the case. [Keynes (1921) p. 313]
Ramsey clearly saw the answer. Thirty years after Keynes, Bernoulli’s maxim was restated as Carnap’s “total evidence condition” [Carnap (1950)], and the correct analysis was presented to the philosophical community by I. J. Good. (1967). Good cites the treatment of the expected value of new information in Raiffa and Schlaifer (1961). The basic case of pure, cost-free information is already treated by Savage (1954) in Ch. 7 and appendix 2, but Ramsey’s unpublished note anticipates Savage by three decades. The basic principles at work are easy to illustrate in the case of two acts. Suppose that you are going to either buy, or not buy an item on E-bay. You are inclined to buy it. But you have the option of postponing your decision for a minute and reading others’ reports of past transactions with this individual. This information costs nothing more than a mouse click. This could bring negative information about the reliability of the seller that would cause you to forego the purchase, or it might bring information that would confirm your predisposition to buy. To simplify the exposition, we suppose that there are only two possible pieces of information that may come up if you look, one positive and one negative. Then the expected value of buying and of not buying, can be plotted as a function of the probability that the information is positive – as shown in figure 1.
Figure 1
58
B RIAN S KYRMS
If you look and get positive information you move to the right side; if you look and get negative information you move to the left side; before looking you are somewhere in the middle. The lines plotting the expected values of the acts are straight lines because they are averages. Which act is optimal depends on the probabilities of good and bad information. If we take the optimal act at every value and plot it we get the expected value of the Bayes (optimal) act as a function of the probability of good information, as shown in bold in figure 2. You can see that it’s shape is that of a convex function ( It dishes down in the middle). At any point, your expected value of choosing the act that looks best to you in that informational state is just the value of this function.
Figure 2
What is the expected value now, of clicking the mouse, getting the additional information and then deciding? It is a point falling on the dashed line connecting the value of buying with good information on the right, and not buying with bad information on the left in figure 3. It is obviously greater than the value of acting now without the additional information, because the dashed line falls above the bold line except at the endpoints. That is because the dashed line, being an average, is straight, while the bold line – as noted previously – is convex. One can even measure the expected value of information at any point, as the difference between the two curves. The principles illustrated in this simple example hold in more complex cases. There might, for instance, be an infinite number of possible acts, resulting from the setting of some continuously variable control parameter. That would make no real difference in the argument. The expected utility of the Bayes act would still be convex, and the argument would still work. This is the case addressed by
DISCOVERING “W EIGHT , OR THE V ALUE OF K NOWLEDGE ”
59
Ramsey in his two pages of notes. The second page contains a diagram that you could immediately recognize by its resemblance to figure 3.
Figure 3
Ramsey goes a bit further. He supposes that what is to be learned does not take us all the way to the left or right side of the diagram but only part way in each direction. This might be thought of as a model of less than ideal evidence, but the evidence learned is still modeled as evidence that is learned with certainty. The effect of learning the imperfect evidence is assumed to be a shift to a new probability by conditioning on what was learned. [See my (1990) book for a fuller discussion.] IV. D IACHRONIC C OHERENCE We can, by now, say quite a bit about the topic I was looking for but didn’t find – the question of diachronic coherence of degrees of belief. I will take an indirect, but scenic route to the central result. There is a close connection between Bayesian coherence arguments and the theory of arbitrage. [See Shin (1992)] Suppose we have a market in which a finite number of assets are bought and sold. Assets can be anything – stocks and bonds, pigs and chickens, apples and oranges. The market determines a unit price for each asset, and this information is encoded in a price vector x =<x1, ...xn>. You may trade these assets today in any (finite) quantity. You are allowed to take a short position in an asset, that is to say that you sell it today for delivery tomorrow. Tomorrow, the assets may have different prices, y1, ...,ym. To keep things simple, we initially suppose that there are a finite number of possibilities for tomorrow’s price vector. A portfolio,
60
B RIAN S KYRMS
p, is a vector of real numbers that specifies the amount of each asset you hold. Negative numbers correspond to short positions. You would like to arbitrage the market, that is to construct a portfolio today whose cost is negative (you can take out money) and such that tomorrow its value is non-negative (you are left with no net loss), no matter which of the possible price vectors is realized. The fundamental theorem of asset pricing states that you can arbitrage the market if and only if the price vector today falls outside the convex cone spanned by the possible price vectors tomorrow. [If we were to allow an infinite number of states tomorrow we would have to substitute the closed convex cone generated by the possible future price vectors.] The value of a portfolio, p, according to a price vector, y, is the sum over the assets of quantity times price, that is the dot product of the two vectors. If the vectors are orthogonal the value is zero. If they make an acute angle, the value is positive; if they make an obtuse angle, the value is negative. An arbitrage portfolio, p , is one such that p•x is negative and p•yi is non-negative for each possible yi ; p makes an obtuse angle with today’s price vector and is orthogonal or makes an acute angle with each of the possible price vectors tomorrow. If p is outside the convex cone spanned by the yis, then there is a hyperplane which separates p from that cone. An arbitrage portfolio can be found as a vector normal to the hyperplane. It has zero value according to a price vector on the hyperplane, negative value according to today’s prices and non-negative value according to each possible price tomorrow. On the other hand, if today’s price vector in the convex cone is spanned by tomorrow’s possible price vectors, then (by Farkas’ lemma) no arbitrage portfolio is possible. The matter is easy to understand visually in simple cases. Suppose the market deals in only two goods, apples and oranges. One possible price vector tomorrow is $1 for an apple, $1 for an orange. Another is an apple will cost $2, while an orange is $1. These two possibilities generate a convex cone, as shown in figure 4. (We could add lots of intermediate possibilities, but that wouldn’t make any difference to what follows.)
Figure 4
DISCOVERING “W EIGHT , OR THE V ALUE OF K NOWLEDGE ”
61
Let’s suppose that today’s price vector lies outside the convex cone, say apples at $1, oranges at $3. Then it can be separated from the cone by a hyperplane (in 2 dimensions, a line), for example the line oranges = 2 apples, as shown in figure 5.
Figure 5
Normal to that hyperplane we find the vector <2 apples, -1 orange>, as in figure 6.
Figure 6
This should be an arbitrage portfolio, so we sell one orange short and use the proceeds to buy 2 apples. But at today’s prices, an orange is worth $3, so we can pocket a dollar, or – if you prefer – buy 3 apples and eat one. Tomorrow we have to deliver an orange. If tomorrow’s prices were to fall exactly on the hyperplane, we would be covered. We could sell our two apples and use the proceeds to buy the orange. But in our example, things are even better. The worst that can happen tomorrow is that apples and oranges trade 1-to-1, so we might as well eat another apple and use the remaining one to cover our obligation for an orange.
62
B RIAN S KYRMS
In the foregoing, assets could be anything. As a special case they could be tickets paying $1 if p, nothing otherwise, for various contingent propositions, p. The price of such a ticket can be thought of as the market’s collective degree-orbelief or subjective probability for p. We have not said anything about the market except that it will trade arbitrary quantities at the market price. The market could be implemented by a single individual – the bookie of the familiar Bayesian metaphor. Without yet any commitment to the mathematical structure of degrees of belief, or to the nature of belief revision, we can say that arbitrage-free degrees of belief today must fall within the convex cone of degrees of belief tomorrow. This is the fundamental diachronic coherence requirement. We might go further and suppose that tomorrow we learn the truth. In that case a ticket worth $1 if p; nothing otherwise, would be worth either $1 or $0 depending on whether we learn whether p is true or not. By itself this does not tell us a great deal, only that arbitrage-free prior degrees of belief must be nonnegative. Now suppose that we have three assets being traded which have a Boolean logical structure. There are tickets worth $1 if p; nothing otherwise, $1 if q; nothing otherwise, and $1 if p or q; nothing otherwise. Furthermore, p and q are incompatible. This additional structure constrains the possible price vectors tomorrow, so that the convex cone becomes the two dimensional object: z = x + y, x, y, non-negative, as shown in figure 7. Arbitrage-free degrees of belief must be additive. Additivity of subjective probability comes from the additivity of truth value and the fact that additivity is preserved under convex combination. One can then complete the coherence argument for probability by noting that coherence requires a ticket that pays $1 if a tautology is true to have the value $1.
Figure 7
DISCOVERING “W EIGHT , OR THE V ALUE OF K NOWLEDGE”
63
Notice that from this point of view the synchronic Dutch books are really special cases of diachronic arguments. After all, you really do need the time when you find out the truth for the synchronic argument to be complete. (This point has been raised by some as an objection to the application of subjective probability to the confirmation of scientific laws.) If anything, the assumption that there is a time when the truth is revealed is a much stronger assumption than anything that preceded it in this development. (One who rejects this assumption might reject additivity, but still require degrees-of-belief today to fall within the convex cone spanned by degrees of belief tomorrow.) Today the market trades tickets that pay $1 if pi; nothing otherwise, where the pis are some assertions about the world. All sorts of news comes in and tomorrow the price vector may realize a number of different possibilities. (We have not, at this point, imposed any model of belief change.) The price vector for these tickets tomorrow is itself a fact about the world, and there is no reason why we could not have trade in tickets that pay off $1 if tomorrow’s price vector is p, or if tomorrow’s price vector is in some set of possible price vectors, for the original set of propositions. The prices of these tickets represent subjective probabilities about its subjective probabilities tomorrow. Some philosophers have been suspicious about such entities, but they arise quite naturally. And in fact, they may be less problematic than the first-order probabilities over which they are defined. The first-order propositions, pi, could be such that their truth value might or might not ever be settled. But the question of tomorrow’s price vector for unit wagers over them is settled tomorrow. Coherent probabilities of tomorrow’s probabilities should be additive no matter what. Let us restrict ourselves to the case where we eventually do find out the truth about everything [perhaps on Judgment Day], so degrees of belief today and tomorrow are genuine probabilities. We can now consider tickets that are worth $1 if the probability tomorrow of p = a and p; nothing otherwise, as well as tickets that are worth $1 if probability tomorrow of p =1. These tickets are logically related. Projecting to the 2 dimensions that represent these tickets, we find that there are only two possible price vectors tomorrow. Either the probability tomorrow of p is not equal to a, in which case both tickets are worth nothing tomorrow, or probability tomorrow of p is equal to a, in which case the former ticket is has a price of $a and the latter has a price of $1. The cone spanned by these two vectors is just a ray as shown in figure 8. So today, the ratio of these two probabilities (provided they are well-defined) is a. In other words, today the conditional probability of p, given probability tomorrow of p = a, is a. It then follows that to avoid a Dutch book, probability today must be the expectation of probability tomorrow. [See Goldstein (1983) and van Fraassen (1984)]
64
B RIAN S KYRMS
Figure 8
V. D IACHRONIC C OHERENCE AND THE V ALUE OF K NOWLEDGE The traditional setting for the Value of Knowledge theorem is one in which it is assumed that experience delivers up an evidence statement, one of a set of possible evidence statements which partition the space of possibilities. The decision maker then adopts new degree of belief equal to the old degrees of belief conditional on the evidence. If a decision maker believes that the impending learning situation answers to this description, then his probability today will be his expectation of his probability tomorrow. But probability today can well be the expectation of probability tomorrow in a far less structured learning situation. Following Dick Jeffrey [1968], we should pay attention to the fact that evidence may not arrive with certainty. The decision maker may not have in her grasp the observation sentences required to implement that classical model – or even if she has them, she may lack the probabilities conditional on them to implement probability change by conditioning on the evidence. In these cases too, diachronic coherence still has a bite. Probability today must still be the current expectation of probability tomorrow. This happens to be all that is required to prove the theorem that pure, costfree information has non-negative expected value. This was first shown by Paul Graves (1989) in the context of a discussion of Jeffrey’s probability kinematics. The following proof highlights the essential features of our example, and generalizes smoothly to more complicated cases. Let B(p) be the expected utility of the Bayes act according to probability p We write E for current expectation. Because of the convexity of B, (by Jensen’s inequality): E[B(probability after learning)] >= B[E(probability after learning)] By diachronic coherence we can replace E(probability after learning) with (probability before learning). So, E[B(probability after learning)] >= B[probability before learning]
DISCOVERING “W EIGHT , OR THE V ALUE OF K NOWLEDGE”
65
In other words, ex ante an informed decision is surely at least as good, and perhaps better than, an uninformed one. Diachronic coherence implies the value of knowledge. R EFERENCES Armendt, B. (1980) “Is There a Dutch Book Theorem for Probability Kinematics?” Philosophy of Science 47: 583-588. Carnap, R. (1950) Logical Foundations of Probability Chicago: University of Chicago Press. de Finetti, B. (1970) Teoria Della Probabilità v. I. Giulio Einaudi editori: Torino, tr. as Theory of Probability by Antonio Machi and Adrian Smith (1974) Wiley: New York. Diaconis, P. and Zabell, S. (1982) “Updating Subjective Probability” Journal of the American Statistical Association 77:822-830. Freedman, D.A. and R.A. Purves (1969) “Bayes’ Method for Bookies” Annals of Mathematical Statistics 40: 1177-1186. Graves, P. (1989) “The Total Evidence Principle for Probability Kinematics” Philosophy of Science 56, 317-324. Goldstein, M. (1983) “The Prevision of a Prevision” Journal of the American Statistical Association 78: 817-819. Good, I.J. (1967) “On the Principle of Total Evidence” British Journal for the Philosophy of Scence 17, 319-321. Jeffrey, R. (1968) “Probable Knowledge” In The Problem of Inductive Logic ed. I. Lakatos. Amsterdam: North Holland. Keynes, J.M. (1921) A Treatise on Probability. Harper Torchbook edition (1962). New York: Harper and Row. Lane, D.A. and Sudderth, W. (1984) “Coherent Predictive Inference” Sankhya, ser. A, 46: 166-185. Levi, I. (2002) “Money Pumps and Diachronic Dutch Books” Philosophy of Science 69 [PSA 2000 ed. J.A. Barrett and J.M. Alexander] S235-S264. Putnam, H. (1975) “Probability Theory and Confirmation” in Mathematics, Matter and Method. Cambridge: Cambridge University Press. Raiffa, H. and R. Schlaifer (1961) Applied Statistical Decision Theory. Boston: Harvard School of Business Administration. Ramsey, F.P. (1990) “Weight or the Value of Knowledge” Transcribed by N.-E. Sahlin. The British Journal for the Philosophy of Science, 41, (1990), 1-3. Ramsey, F.P. (1991) Notes on Philosophy, Probability and Mathematics. Edited by Maria Carla Galavotti. Bibliopolis: Napoli. Shin, H.S. (1992) “Review of The Dynamics of Rational Deliberation” Economics and Philosophy 8: 176-183. Skyrms, B. (1990) The Dynamics of Rational Deliberation Cambridge, Mass.: Harvard University Press. Teller, P. (1973) “Conditionalization and Observation” Synthese 26, 218-258. van Fraassen, B. (1984) “Belief and the Will” Journal of Philosophy 81: 235-256.
Logic and Philosophy of Science University of California, Irvine 3151 Social Science Plaza Irvine, California U.S.A.
[email protected]