MATHEMATICAL METHODS IN SAMPLE SURVEYS
Series
on
I
Multivariate I Analysis I • Vol.3 • I
MATHEMATICAL METHODS IN SAMPLE SURVEYS
Howard G. Tucker Universiiy of California, Irvine University
World Scientific
Singapore-New Jersey •ondon'Hong Kong
SERIES ON MULTIVARIATE ANALYSIS Editor: M M Rao
Published Vol. 1: Martingales and Stochastic Analysis J. Yeh Vol. 2: Multidimensional Second Order Stochastic Processes Y. Kakihara Forthcoming Convolution Structures and Stochastic Processes R. Lasser Topics in Circular Statistics S. R. Jammalamadaka and A. SenGupta Abstract Methods in Information Theory Y. Kakihara
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Cataloging-in-Publication Data Tucker, Howard G. Mathematical methods in sample surveys / Howard G. Tucker. p. cm. — (Series on multivariate analysis : vol. 3) Includes bibliographical references (p. - ) and index. ISBN 9810226179 1. Sampling (Statistics) I. Title. II. Series. QA276.6.T83 1998 519.5*2--dc21 98-29452 CIP
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
First published 1998 Reprinted 2002
Copyright © 1998 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
Printed in Singapore by Uto-Print
Preface As the title of this book suggests, it is a textbook about some math ematical methods in sample surveys. It is not about the nuts and bolts of setting up a sample survey, but it does introduce students (or readers) to some basic methodology of doing sample surveys. The mathemat ics is both elementary and rigorous, and it requires as a prerequisite the satisfactory experience of one or two years of university mathe matics courses. It is suitable for a one year junior-senior level course for mathematics and statistics majors; it is also suitable for students in the social sciences who are not handicapped by a fear of proofs in mathematics. It requires no previous knowledge of statistics, and it could actually serve as both an intuitive and mathematically rigorous introduction to statistics. A sizable part of the book covers only those topics in discrete probability that are needed for the sampling meth ods treated here. Topics in sampling that are covered in depth include simple random sampling with and without replacement, sampling with unequal probabilities, various linear relationships, stratified sampling, cluster sampling and two stage sampling. There is just enough material included here for a one year under graduate course, and it has been used as such at the University of California at Irvine for the last twenty years. The first five chapters cover the discrete probability needed for the next six chapters; these can be covered in an academic quarter. It should be pointed out that a usual one quarter course in discrete probability cannot replace what is developed in these five chapters. For one thing, considerable emphasis on working with multivariate discrete densities was needed because of the dependence that arises when the sampling is done without replace ment. Also the material on conditional expectation and conditional variance and conditional covariance as random variables is rarely, if at all, treated at the elementary level as it is here. It is this body of results that is so important in developing the material in the sample survey v
vi
PREFACE
part of the book and without any handwaving. This is particularly true for Chapters 7 through 11. It should also be stated that there is no fat in Chapters 1 through 5. Indeed, the topics covered in these chapters were not settled upon until the material in Chapters 6 through 11 was finally in place. Indeed, great care was taken to insure that Chapters 1 through 5 contained the minimal amount of material needed for the remaining chapters. There is no doubt as to the importance of the topics covered in this text for students specializing in statistics and biostatistics. Awareness of them is also important for students in the social sciences and in the various areas of business administration. But I would like to include some comments on the importance of a course based on this text for stu dents majoring in pure mathematics. Except for the unproved central limit theorem in Chapter 5 (which is not invoked in the proofs of any of the results following that chapter), this text can be claimed to be an example of an undergraduate course that teaches utmost mathematical rigor. What is more, the development is a vertical one, and very few of the chapters can be taken out of order. I call everyone's attention to Chapter 4 where results on conditional expectation and conditional variance as random variables are developed. In this chapter condi tional expectation is defined as a number and as a random variable. As a random variable, all properties that are usually obtained by a certain amount of measure-theoretic prowess elsewhere are here obtained by rather elementary methods. In addition, in this setting basic results are obtained on conditional variance and conditional covariance which culminate with the Rao-Blackwell theorem. I have two hopes connected with this text and the course it serves. One hope is that the student who is primarily applications oriented will appreciate and enjoy the mathematical ideas behind the problems of estimation in sample surveys. At the same time I hope that those who are primarily oriented in the direction of pure and abstract mathematics will see that one can keep this orientation and at the same time enjoy how well it touches on real life. I wish to express my appreciation to Mrs. Mary Moore who did the original lATgX typesetting for almost all of this document. Professors Mark Finkelstein and Jerry A. Veeh contributed greatly to my entrance
PREFACE
vii
into the age of computer typesetting; indeed, the completion of this document might never have taken place without their help. This book is dedicated to my wife, Marcia. Howard G. Tucker Irvine, California November 20, 1997
This page is intentionally left blank
Contents 1
Events and Probability 1.1 Introduction to Probability 1.2 Combinatorial Probability 1.3 The Algebra of Events 1.4 Probability 1.5 Conditional Probability
1 1 3 9 17 20
2
R a n d o m Variables 2.1 Random Variables as Functions 2.2 Densities of Random Variables 2.3 Some Particular Distributions
27 27 31 41
3
Expectation 3.1 Properties of Expectation 3.2 Moments of Random Variables 3.3 Covariance and Correlation
47 47 51 56
4
Conditional Expectation 4.1 Definition and Properties 4.2 Conditional Variance
65 65 72
5
Limit Theorems 5.1 The Law of Large Numbers 5.2 The Central Limit Theorem
83 83 86
6
Simple Random Sampling 6.1 The Model
91 91 IX
x
CONTENTS 6.2 6.3 6.4 6.5
Unbiased Estimates for Y and Y Estimation of Sampling Errors Estimation of Proportions Sensitive Questions
99 103 107 112
7 Unequal Probability Sampling 7.1 How to Sample 7.2 WR Probability Proportional to Size Sampling 7.3 WOR Probability Proportional to Size Sampling
117 117 122 128
8
Linear Relationships 8.1 Linear Regression Model 8.2 Ratio Estimation 8.3 Unbiased Ratio Estimation 8.4 Difference Estimation 8.5 Which Estimate? An Advanced Topic
135 135 138 144 148 150
9
Stratified Sampling 9.1 The Model and Basic Estimates 9.2 Allocation of Sample Sizes to Stata
155 155 161
10 Cluster Sampling 10.1 Unbiased Estimate of the Mean 10.2 The Variance 10.3 An Unbiased Estimate of Var(Y)
169 169 175 177
11 Two-Stage Sampling 11.1 Two-Stage Sampling 11.2 Sampling for Non-Response 11.3 Sampling for Stratification
183 183 189 196
A The Normal Distribution
203
Index
205
Chapter 1 Events and Probability 1.1
Introduction to Probability
The notion of the probability of an event may be approached by at least three methods. One method, perhaps the first historically, is to repeat an experiment or game (in which a certain event might or might not occur) many times under identical conditions and compute the relative frequency with which the event occurs. This means: divide the total number of times that the specific event occurs by the total number of times the experiment is performed or the game is played. This ratio is called the relative frequency and is really only an approximation of what would be considered as the probability of the event. For example, if one tosses a penny 25 times, and if it comes up heads exactly 13 times, then we would estimate the probability that this particular coin will come up heads when tossed is 13/25 or 0.52. Although this method of arriving at the notion of probability is the most primitive and un sophisticated, it is the most meaningful to the practical individual, in particular, to the working scientist and engineer who have to apply the results of probability theory to real-life situations. Accordingly, what ever results one obtains in the theory of probability and statistics, one should be able to interpret them in terms of relative frequency. A sec ond approach to the notion of probability is from an axiomatic point of view. That is, a minimal list of axioms is set down which assumes cer tain properties of probabilities. From this minimal set of assumptions 1
2
CHAPTER 1. EVENTS AND
PROBABILITY
the further properties of probabiUty are deduced and applied. A third approach to the notion of probability is limited in applica tion but is sufficient for our study of sample surveys. This approach is that of probabiUty in the "equaUy likely" case. Let us consider some game or experiment which, when played or performed, has among its possible outcomes a certain event E. For example, in tossing a die once, the event E might be: the outcome is an even number. In general, we suppose that the experiment or game has a certain number of mutuaUy exclusive "equaUy likely" outcomes. Let us further suppose that a certain event E can occur in any one of a specified number of these "equaUy likely" outcomes. Then the probabiUty of the event is defined to be the number of "equaUy likely" ways in which the event can oc cur divided by the total number of possible "equaUy likely" outcomes. It must be emphasized here that the number of equally likely ways in which the event can occur must be from among the total number of equally likely outcomes. For example, if, as above, the experiment or game is the single toss of a fair die in which the "equaUy likely" out comes are the numbers {1,2,3,4,5,6}, and if the event E considered is that the outcome is an even number, i.e., is 2,4 or 6, then the prob ability of E here is defined to be 3/6 or 1/2. This approach is limited, as was mentioned above, because in many games and experiments the possible outcomes are not equally likely. The probabiUty model used in this course is the "equaUy likely" model. EXERCISES 1. A (possibly loaded) die was tossed 150 times. The number 1 came up 27 times, 2 came up 26 times, 3 came up 24 times, 4 came up 20 times, 5 came up 29 times and 6 came up 24 times. a) Compute the relative frequency of the event that on the toss of this die the outcome is 1. b) Find the relative frequency of the event that the outcome is even. c) Find the relative frequency of the event that the outcome is not less than 5.
1.2. COMBINATORIAL
PROBABILITY
3
2. Twenty numbered tags are in a hat. The number 1 is on 7 of the tags, the number 2 is on 5 of the tags, and the number 3 is on 8 of the tags. The experiment is to stir the tags without looking and to select one tag "at random". a) What are the total number of equally likely outcomes of the experiment? b) From among these 20 equally likely outcomes what is the total number of ways in which the outcome is the number 1? c) Compute the probability of selecting a tag numbered 1. Do the same for 2 and 3. d) What is the sum of the probabilities obtained in (c)?
1.2
Combinatorial Probability
We now consider the computation of probabilities in the "equally likely" case. Let us suppose that we have n different objects, and we want to arrange k of these in a row (where, of course, k < n). We wish to know in how many ways this can be accomplished. As an example, suppose there are five members of a committee, call them A, 5 , C, D, E, and we want to know in how many ways we can select a chairman and a secretary. When we select the arrangement ((7, A), we mean that C is the chairman and A is the secretary. In this case n = 5 and k = 2. The different arrangements are listed as follows:
(A,B) (A,C) (B,A) (B,C) (C,A) (C,B) (D,A) (D,B) (E,A) (E,B)
(A,D) (A,E) (B,D) (B,E) (C,D) (C,E) (D,C) (D,E) (E,C) (E,D)
One sees that there are 20 such arrangements. The number 20 can also be obtained by the following reasoning: there are five ways in which the chairman can be selected (which accounts for the five horizontal rows of pairs), and for each chairman selected there are four ways of selecting the secretary (which accounts for the four vertical columns).
4
CHAPTER 1. EVENTS AND
PROBABILITY
Consequently there are 20 such pairs. In general, if we want to deter mine in how many ways we can arrange k out of n objects, we reason as follows. There are n ways of selecting the first object. For each way we select the first object there are n — 1 ways of selecting the second object. Hence the total number of ways in which the first two objects can be selected is n(n — 1). For every way in which the first two ob jects are selected there are n — 2 ways of selecting the third object. Thus the number of ways in which the first three objects can be se lected is n(n — l)(n — 2). From this one can easily conclude that the number of ways in which k out of n objects can be laid in a row is n(n — l)(n — 2) • • • (n — (fc — 1)), which can be written as the ratio of factorials: n\/(n — k)\ (Recall: 5! = 1 x 2 x 3 x 4 x 5). This is also referred to as the number of permutations of n things taken & at a time. In the above arrangements (or permutations) of n things taken k at a time, we counted each way in which we could arrange the same k objects in a row. Suppose, however, that one is interested only in the number of ways k objects can be selected out of n objects and is not interested in order or arrangement. In the case of the committee discussed above, the ways in which two members can be selected out of the five to form a subcommittee are as follows: (A3) (B,C) (C,D) (D,E)
(A,C) (B,D) (C,E)
(A,D) (B,E)
(A,E)
We do not list (Z), B) as before, because the subcommittee denoted by (Z), B) is the same as that denoted by (£?, D) which is already listed. Thus, now we have only half the number of selections. In general, if we want to find the number of ways in which one can select k objects out of n objects, we reason as follows. As before, there are n\/(n — k)\ ways of arranging (or permuting) n objects taken k at a time. However, all k\ ways of arranging each k objects are included here. Hence we must divide the n\/(n — k)\ ways of arranging k out of n objects by k\ to obtain the number of ways in which we can make the k selections. This number of ways in which we can select k objects out of n objects without regard to order is usually referred to as the number of combinations of n objects or things taken fc at a time. It is usually denoted by the
1.2. COMBINATORIAL
PROBABILITY
5
binomial coefficient: /n\
\k)
n!
=
k\(n-k)V
This binomial coefficient is encountered in the binomial theorem which states: n k k
(a + br = £( )a F- , fc=o
v/c/
where 0! is defined to be 1. Now we apply these two notions to some combinatorial probabihty problems, i.e., the computation of probabilities in the "equally likely" case. In each problem, the cautious approach is first to determine the number of equally likely outcomes in the game or experiment. Then one computes the number of equally likely ways from among these in which the particular event can occur. Then the ratio of this second number to the first number is computed in order to obtain the probabihty of the event. Example 1. The numbers 1,2, • • • ,n are arranged in random order, i.e., the n! ways in which these numbers can be arranged are assumed to be equally likely. We are to find the probabihty that the numbers 1 and 2 appear as neighbors with 1 followed by 2. As was mentioned in the problem, there are n! equally likely out comes. In order to compute the number of these ways in which the indicated event can occur, we reason as follows: there are n — 1 po sitions permitted for 1; for each position available for 1 there is only one position available 2, and for every selection of positions for 1 and 2, there are (n — 2)! ways for arranging the remaining n — 2 integers in the remaining n — 2 positions. Consequently, there are (n — 1) • 1 • (n — 2)! ways in which this event can occur, and its probabihty is p =
(n-l)-l-(n-2)!
=
(n - 1)! = j _
Before beginning Example 2, we should explain what is meant by selecting a random digit (or random number). In effect, one takes 10 tags and marks 0 on the first tag, 1 on the second tag, 2 on the third tag, • • •, and 9 on the tenth tag. Then these tags are put into a hat
6
CHAPTER 1. EVENTS AND
PROBABILITY
(or urn). If we say "select n random digits" or "sample n times with replacement", we mean that one selects a tag "at random", notes the number on it and records it, returns it to the container, and repeats this action n — 1 times more. Example 2. We are to find the probability p that among k random digits neither 0 nor 1 appears. The total number of possible outcomes is obtained as follows. There are 10 possibilities for selecting the first digit. For each way in which the first digit is selected there are 10 ways of selecting the second digit. So there are 102 ways of selecting the first two digits. In general, then, the number of ways in which the first k digits can be selected is 10*. Now we consider the event: neither 0 nor 1 appears. In how many "equally likely" ways from among the 10* possible outcomes can this event occur? In selecting the k random digits, it is clear that with the first random digit there are eight ways in which it can occur. The same goes for the second, third, and on up to the fcth random digit. Hence, out of the 10* total possible "equally likely" outcomes there are 8* outcomes in which this event can occur. Thus p = 8k/10k. Example 3. Now let us determine the probability P that among k random digits the digit zero appears exactly 3 times (where 3 < k). Again, the total number of equally likely outcomes is 10*. Among the k trials (i.e., k different objects) there or (* J ways of selecting the 3 trials in which the zeros appear. For each way of selecting the 3 trials in which only zeros occur there are 9*~3 ways in which the outcomes of the remaining k — 3 trials can occur. Thus P = (*) §k~*llQk. Example 4. A box contains 90 white balls and 10 red balls. If 9 balls are selected at random without replacement, what is the probability P that 6 of them are white? In this problem there are \$)
ways of selecting the 9 balls out of
100. Since there are ( 96° J ways of selecting 6 white balls out of 90 white balls, and since for each way one selects 6 white balls there are f1®) ways of selecting 3 red balls out of the 10 red balls, we see that there
1.2. COMBINATORIAL
7
PROBABILITY
are (9°J (™) ways of getting 6 white balls when we select 9 without replacement. Consequently, P =
(?) (?) /rloo^
'T)
Example 5. There are n men standing in row, among whom are two men named A and B. We would like to find the probability P that there are r people between A and B. There are two ways of solving this problem. In the first place there are l?) ways in which one can select two places for A and B to stand, and among these there are n — r — 1 ways in which one can pick two positions with r positions between them. S o P = (n — r — l ) / ^ ] . Another way of solving this problem is to observe that there are n! ways of arranging the n men, and that among these n\ ways there are two ways of selecting one of the men A or B. For each way of selecting one of A or B there are n — r — 1 ways of placing him, and for each way of selecting one of A or B and for each way of placing him there is one way in which the other man can be placed in order that there be r men between them, and there are (n — 2)! ways of arranging the remaining n — 2 men. So 2(n-r-l)(n-2)! _ n! "
n-r-l (;) '
EXERCISES 1. An urn contains 4 black balls and 6 white balls. Two balls are selected without replacement. What is the probability that a) one ball is black and one ball is white? b) both balls are black? c) both balls are white? d) both balls are the same color? 2. In tossing a pair of fair dice what is the probability of throwing a 7 or an 11?
8
CHAPTER 1. EVENTS AND
PROBABILITY
3. Two fair coins are tossed simultaneously. What is the pfobabiUty that a) they are both heads? b) they match? c) one is heads and one is tails? 4. The numbers 1,2, • • •, n are placed in random order in a straight line. Find the probability that a) the numbers 1,2,3 appear as neighbors in the order given, and b) the numbers 1,2,3 appear as neighbors in any order. 5. Among k random digits find the probability that a) no even digit appears, b) no digit divisible by 3 appears 6. Among k random digits (k > 5) find the probability that a) the digit 1 appears exactly five times, b) The digit 0 appears exactly two times and the digit 1 appears exactly three times. 7. A box contains 10 white tags and 5 black tags. Three tags are selected at random without replacement. What is the probability that two are black and one is white? 8. There are n people standing in a circle, among whom are two people named A and B. What is the probability that there are r people between them? 9. Six random digits are selected. In the pattern that emerges, find the probability that the pattern will contain the sequence 4,5,6.
1.3. THE ALGEBRA
1.3
OF EVENTS
9
T h e Algebra of Events
Before we may adequately discuss probabilities of events we must dis cuss the algebra of events. Then we are able to establish the properties of probability. Connected with any game or experiment is a set or space of all possible individual outcomes. We shall consider only those games or experiments where these individual outcomes are equally likely. Such a collection of all possible individual outcomes is called a fundamental probability set or sure event It will be denoted by the Greek letter omega, fi. We shall also use the expression fundamental probability set (or sure event) for any representation we might construct of all individual outcomes. For example, in a game consisting of one toss of an unbiased coin, a fundamental probability set consists of two individual outcomes which can be conveniently referred to as H (for heads) and T (for tails). If the game consists in tossing a fair coin twice, then the fundamental probability set consists of four individual outcomes. One of these outcomes could be denoted by (T, # ) , which means that tails occurs on the first toss of the coin and heads occurs on the second toss. The remaining three individual outcomes may be denoted by (jff, H), (H, T) and (T, T). In general, an arbitrary individual outcome will be denoted by u> and will be referred to as an elementary event Thus, fi denotes the set of all elementary events. An event is simply a collection of certain elementary events. Differ ent events are different collections of elementary events. Consider the game again where a fair coin is tossed twice. Then, as indicated above, the sure event consists of the following four elementary events: (H,H)
(H,T)
(T,H)
(T,T).
If A denotes the event: [heads occurs in the first toss], then A consists of two elementary events, (H, H) and ( # , T), and we write this as A =
{(H,H),(H,T)}.
If B denotes the event: [at least one head appears], then B consists of the three elementary events ( # , # ) , ( # , T) and (T, # ) , i.e., B =
{(H,H),(H,T),(T,H)}.
10
CHAPTER 1. EVENTS AND
PROBABILITY
If C denotes the event: [no heads appear], then C consists of one el ementary event, i.e., C = {(T,T)}. If D denotes the event: [at least three heads occur], this is clearly impossible and is an empty collection of elementary events; we denote this by D = <j), where <j> always means the empty set. In general, we shall denote the fact that an elementary event u belongs to the collection of elementary events which determine an event A by u G A. If an elementary event w occurs, and if u G A, then we say that the event A occurs. If might be noted at this point that just because an event A occurs, it does not mean that no other events occur. In the example above, if (i/, H) occurs, then A occurs and so does B. The fundamental probabihty set f) is also called the sure event for the basic reason that whatever elementary event u does occur, then always We now introduce some algebraic operations of events. If A is an event, then Ac will denote the event that A does not occur. Thus Ac consists of all those elementary events in the fundamental probabihty set which are not in A. For every elementary event u in the fundamental probability set and for every event A, one and only one of the following is true: u G A or u> G A c . An equivalent way of writing u> G A c is u £ A, and we say that u> is not in A. Also, A c is called the negation of A or the complement of A. If A and B are events, then AUB will denote the event that at least one of the two events A, B occur. By this we mean that A can occur and B not occur, or B can occur and A not occur, or both A and B can occur. In the previous example, if E denotes the event that heads occurs in the second trial, then E =
{(H,H),(T,H)}
=
{(H,H),(H,T),(T,H)}.
and AUE
In other words, A U E is the event that heads occurs at least once, and we may write A U E = B. In general, if Ai, • • •, A n are any n events, then Ai U A2 U • • • U A n
1.3. THE ALGEBRA
OF EVENTS
11
denotes the event that at least one of these n events occurs. This event will also be written as
Suppose A and B are events which cannot both occur, i.e., if u> G A, then uj £ B, and if u> G i?, then u; $£ A. In this case, A and J5 are said to be incompatible or disjoint or mutually exclusive. Events Ai, • • •, An are said to be disjoint if and only if every pair of these events has this property. The notation A C B means: if event A occurs, then event B occurs. Other ways of stating this are: A implies B and B is implied by A. Thus A C B is true if and only if for every u> G A, then u G B. In any situation where it is desired to prove A C 5 , one should select an arbitrary u G A and prove that this implies u; G B. We define the equality of two events A and 5 , namely A = B, to occur if A C i? and 5 C A , i.e., A and B share the same elementary events. Finally we define the event that A and B both occur, which we denote by A 0 5 , to be the event consisting of all elementary events cv in both A and B. This is frequently referred to as the intersection of A and B. If Ai, • • •, A n are any n events, then the event that they all occur is denoted in two ways by n
A i n A 2 n - - - r i A n = f] Aj. We sometimes write AB instead of A fl B and Ai A 2 • • • A n instead of Ai n A 2 n • • • n A n . We now prove some propositions on the algebra of events. P R O P O S I T I O N 1. For every event A, then A C A. Proof: Let u € A. Then this same u G A. Hence every elementary event in the left event is an elementary event in the right event. P R O P O S I T I O N 2. If A,B,C then A C C.
are events, if A C B and if B C C,
Proof: Let u G A; we must show that w G C Since Ac
B and a; G A,
CHAPTER 1. EVENTS AND
12 then OJGB.
PROBABILITY
Now, ,snce B C C and ssnce u € B, ,hen n € C.
P R O P O S I T I O N 3. For every event A, AD A = A, A U A = A, and (Ac)c = A. Proof: These are obvious. P R O P O S I T I O N 4. If A is any event, then
CACSfi Proof: The trick here involves the fact that any u one might ffnd in <j> is certainly in A, since <j> containn no w'so The implication A C ft is obvious. We noted above that if A and B are two events, and if we wished to prove A C B, then we should take an arbitrary elementary event u in A and prove that it is in B. Now suppose we have two events A and B, and suppose we wish to prove A = B. Because of the definition of equality of two events given above, one is required to do the following: (i) take an arbitrary u G A a n d prove that u> e B, and ((ii take an arbitrary u € B and prove that u e A. P R O P O S I T I O N 5. IfAx, A*---,
A* are events then
(u*V-rK-<
/ /
i=l i=l
(Ihese are known as the DeMorgan formulae.) Proof: In order to prove the first equation, let u be any elementary event in the left hand side. Then u is not in the event that at least one of Ai, • ■ •• An occurr This means that u ii not an ellment of any of the At\ i.e., u <£ A{ for i = 1,2,, • • ,n. Hence u e A\ for all i, i.e., u e H?=iAf. Thus we have shown that (UJL1At)c C n?=i4 5 . Now let w be any ellmentary event in n? =1 4?. Then u € A? for all ii Hence w ^ A,- for t = 1,2, , • •, n, and dhen this happens, then nertainly u 4. U?_=Ai. But this means u € ( U ^ A , ) 0 . . h u s n^^A? ? CU^nAi)0, and since we have proved the reverse implication, we therefore have the first equality of our proposition. In order to prove the second equation,
1.3. THE ALGEBRA
OF EVENTS
13
we use the first one that is aheady proved and replace each A,- by A?. Thus we obtain (U?=1A?)C = n? =1 (A?) c , and by Proposition 3, this becomes (UJL^A?)0 = flJ^A,-. Now take the complement of both sides to obtain the second equation. Q.E.D. P R O P O S I T I O N 6. c = ft and ftc = . Proof: Since (j> contains no elementary events, its complement must be the set of all elementary events, i.e., ft. Also, the negative or comple ment of the sure event is the event consisting of no elementary events. Q.E.D. P R O P O S I T I O N 7. If A and B are events, and if A C B, Af\B = A.
then
Proof: Since A C J5, the every elementary event in both A and B is in A, also every elementary event in A is also in 5 , and therefore in A and B. Q.E.D. P R O P O S I T I O N 8. If A and B are events, then AD B C A. Proof: If u G A fl i?, then u> G A and w 6 B , which implies u G A. P R O P O S I T I O N 9. If A and B are events, and if A C B, Bc C A c . IfA = B, then Ac = £ c .
then
Proof: Let u; G 5 C . Then u £ B. This implies that CJ ^ A, since if to the contrary u G A, then A C 5 would imply u € B which contradicts the fact established above that u £ B. But u £ A implies u G A c . Thus Bc C A c . Next, if A = 5 , then A C 5 and B C A , which, by the first conclusion aheady proved, imply Bc C A c and Ac C 5 C , which in turn imply £ c = A c . Q.E.D. P R O P O S I T I O N 10. / / A and 5 are evenfc, then AUB and AH B = BOA.
=
BUA
Proof: If v G A U S , then a; G A or u G 5 . If a; G A, then a? is in at
14
CHAPTER 1. EVENTS AND
PROBABILITY
least one of the events 5 , A, namely A; if u € 1?, then u is in at least one of the events JB, A, namely 2?. Thus u; € JB U A, and we have shown AU B C B l) A. Since this holds for any two events, we may replace A above by B and B by A, and we obtain B U A C A U B. These two inclusions imply AUB = BUA. In order to obtain the second equation, replace A and B by A c and Bc respectively in the first equation, take the complements of both sides and apply Propositions 3 and 5. Q.E.D. P R O P O S I T I O N 11. IfA.B
and C are events, then
A(J(£UC) = (AUB)UC and
An(Bnc) = (AnB)nc. Proof: If u) € A U (B U C), then u € A or CJ € B U C. If a; € A, then u> is an element of at least one of A, 1?, namely A. Hence w G i U B . This in turn implies u; is in at least one of A U B and C, namely, A U 5 . Thus u; 6 (A U 5 ) U C. If w G 5 U C, then CJ is in at least one of B or C. If it is in C, then it is in (A U 5 ) U C If it is in 5 , then it is in A U i?, namely 5 . Thus we have established the inclusion AU(5UC)C(AUB)UC. In order to establish the reverse inclusion, and hence the first equation, we use Proposition 10 and the above inclusion to obtain (AUB)UC
= = =
CU(AUB) CU(BUA)C(CUB)UA (BUC)UA = AU(BUC).
In order to establish the second equation, replace A, B and C in both sides of the first equation by A c , Bc and Cc respectively, take the com plements of both sides, and apply Propositions 9 and 5 to obtain the conclusions. P R O P O S I T I O N 12. IfA.B An(BDC)
and C are events, then =
ABuAC
1.3. THE ALGEBRA
OF EVENTS
15
and A U ( £ n C) = (A U £ ) n (A U C). Proof: If u G A 0 (B U C), then u> G A and a; is in at least one of J3, C. Hue B, then u G A # ; if u G C, then a; G AC. Hence u> G AB U AC. If a; G AB U AC, then u> is in at least one of AB and AC. If u G A 5 , then co G A and a; G 5 ; now w £ B implies w G B U C , and hence u; G A n (B U C). If u> G AC, then replace C by B in the previous sentence to obtain the same conclusion. In order to prove the second equation, replace A, B and C in the first equation by A c , Bc and C c respectively, take the complements of both sides and apply Propositions 9 and 5. P R O P O S I T I O N 13. If A and B are events then A U B = A U ACB, and A and ACB are disjoint. Proof: If u> G A U 5 , then a; G A or a; G B. If w G A, thenu; G AUA C 5. If a? G -B, then two cases occur: u G A also, or u; ^ A. In the first case u> G A U A c 5 . In the second case, w G A c while yet u G J5, i.e., u> G A c 5 and thus u> G A U A C B. Thus, A U 5 C A U A c 5 . Now let u G A U A C B. Then w G A or u> G A c 5 . If u G A, then a; G A U 5 . If u> G A C B, then w 6 B , and hence ueAUB. Thus A U ACB C A U B , and the equation is established. Also, A and ACB are disjoint since if u) G A c 5 , then u G A c , i.e., w £ A. Q.E.D. EXERCISES 1. Prove: if 1? is any event, then >D B = and 5 fl 0 = B. (See Propositions 4 and 7.) 2. If A is an event, use Problem 1 and Propositions 3 and 6 to prove U A = A and ft U A = fl. 3. Use Propositions 10 and 8 to prove: if A and B are events, then ACiB C B. 4. Use Problem 3 and two propositions of this section to prove: if C and D are events, then C C C U D.
CHAPTER 1. EVENTS AND
16
PROBABILITY
5. Prove: if A, B, C and D are events, if A C C and if B C D , then AB C CD. 6. Let Ai, A 2 , A 3 , A*, A 5 , A 6 , A 7 be events. Match these three events: AJA£A3, A6A^ and A2A5 with the following statements: (i) A 2 and A 5 both occur, (ii) As is the first among the seven events to occur, and (iii) AQ is the last event to occur. 7. Let Ai, A2 and A3 be events, and define 2?t to be the event that A{ is the first of these events to occur, i = 1,2,3. Write each of 2?i, 2?2, #3 in terms of Ai, A 2 , A 3 and prove that A\ U A 2 U A3 = J5i U 5 2 U B3. 8. Prove: if A is any event, then A U Ac = ft and A n A c = . 9. Prove: if A and B are events, then B = A S U A c 5 . (Hint: Use Problem 8, Proposition 12 and Problem 1.) 10. In Problem 6, construct the event: A5 is the last of these events to occur. 11. Five tags, numbered 1 through 5 are in a bowl. The game is to select a tag at random and, without replacing it, select a second tag. (I.e., take a sample of size two without replacement.) After you list all 20 elementary events in 0 , list the elementary events in each of the following events: A:
the sum of the two numbers is < 6
B:
the sum of the numbers is 5
C:
the larger of the two numbers is < 3
D:
the smaller of the two numbers is 2
E:
the first number selected is 5
F:
the second number selected is 4 or 5.
1.4. PROBABILITY
17
12. In Problem 11, list the elementary events in each of the following events:
Al)£,AnC,Dc,(AUEy,EnF.
13. Prove Proposition 3. 14. Prove the converse to the second statement in Proposition 9: If A and B are events, and if Ac = Bc, then A = B. 15. Prove: If A, B and C are events, and if A and B are disjoint, then AC and BC are disjoint. 16. Prove: If A, # ! , - • • , # „ are events, if A C U^tf; andif Hu • ■ -- Hn are disjoint, then AHU • • •• AH/ are disjoint, and A = U?=1 Aff,-.
1.4
Probability
The only notion of probability that we shall use in this course is that where the elementary events are all equally likely. In most cases these equally likely outcomes will be apparent. In others, they will be difficult to find, but in most of these cases we shall not have to find them. In any game or experiment, if N denotes the total number of equally likely outcomes (in ft) , and if NA denotes the number of equally likely outcomes in the event A, then we define the probability of A by
™-¥ Concrete examples of this were given in Section 1.2. The following propositions will be used repeatedly in this course. P R O P O S I T I O N 1. If A is an event, then 0 < P(A) < 1. Proof: Since 0 < NA < JV, divide through by N, and obtain 0 < P(A) < 1. Q.E.D. P R O P O S I T I O N 2. If A is an event, then P{AC) = 1 -
P(A).
18
CHAPTER 1. EVENTS AND
PROBABILITY
Proof: Since N = NA + NA; we have, upon dividing through by N, 1 = P(A) + P(AC), from which the conclusion follows. Q.E.D. P R O P O S I T I O N 3. P(fi) = 1 and P{) = 0. Proof: This follows from the fact that N+ = 0 and N0 = N. P R O P O S I T I O N 4. IfA!,---,Ar
are disjoint events, then
pp(i[ i jj A) P(A ■) A ) == y^y^P(A■) Proof: Disjointness of Alt1? • • •• Ar implies
t=l
Dividing through by N yields the result.
Q.E.D.
P R O P O S I T I O N 5. / / A and B are events, then P(A) = P(BA) (BCA).
+
Proof: Because ft = B\JBC, then A = Anil = A(BUBC) = ABUABC. Since AB and ABC are disjoint, it follows that P(A) = P(ABUABC) = P(AB) + P(ABC). Q.E.D. P R O P O S I T I O N 6. If A and B are events, then P(AUB)
= P{A) +
P(ACB).
Proof: By Proposition 13 in Section 1.3, AliB = AU ACB, and A and ACB are disjoint. Applying Proposition 4 above we obtain P(AUB) = P(A) + P{ACB) . Q.E.D. P R O P O S I T I O N 7. If A and B are events then P(A UB) = P(A) + P{B) -
P{AB).
1.4.
PROBABILITY
19
Proof: By Proposition 6, P{A U B) = P(A) + P(ACB) . By Propo sition 5, P(B) = P(AB) + P(ACB) , or P(ACB) = P{B) - P{AB). Substituting this into the first formula, we get the result. Q.E.D. P R O P O S I T I O N 8. (Boole's inequality). If A and B are events, then P(AU B) < P(A) + P(B). Proof: By Propositions 7 and 1, P(Al)B) P(A) + P{B).
= P(A) + P(B)-P(AB) < Q.E.D.
EXERCISES 1. Prove: if A, B and C are events, then P(AUBUC)
= P{A) + P(B) + P(C) -P(AB) - P{AC) - P(BC) +
P(ABC).
2. Prove: if A, B and C are events, then P(A U B U C) < P{A) + P{B) + P(C). 3. Use the principle of mathematical induction to prove: if Ai, • • •, An
P (^)3,thenpfu^)<E \«=i / «=i
are events,
4. Prove: if A and B are events, and if A C B, then P(A) < P(B). 5. Prove: if A and B are events, then P(AB) < P(A) < P(A U B). 6. Prove: if A and B are events, if A C B, and if P(B) < then P(A) = P(B). 7. Prove: if Ai,A2 PiAJ + PiA^
P(A),
and A3 are events, then P(A X U A2 U A3) = + PiAiAlAs).
8. Another way to prove Proposition 6 is as follows. First prove that A fl (A U B) = A and A c (A UB) = A c 5 . Then apply Proposition 5. Now fill in complete details of this proof.
20
1.5
CHAPTER 1. EVENTS AND
PROBABILITY
Conditional Probability
We now define the conditional probability that an event A occurs, given that a certain event B occurs. Since we are given that B occurs, we can only define this conditional probability when P(B) > 0 or NB > 0. Since we are given that B occurs, then the total possible number of equally likely outcomes is NB which we place in the denominator. Among these we wish to find the total number of equally likely ways in which A occurs. This is seen to be NAHB or NAB- Thus, the conditional probability that A occurs, given that B occurs, which we denote by P(A\B), is obtained from the formula P W B )
R E M A R K l.IfA
.
^
and B are events, and if P(B) > 0, then
p(A\mP{m
P AB
"> ~ 1^ W
Proof: By the definition above, we have P(A\B)
=
NAB
NB
NABIN
NB/N
P(AB)
P{B) " Q.E.D.
R E M A R K 2. If A and B are events with positive probabilities, then P{AB) = P(A\B)P(B)
=
P(B\A)P(A).
Proof: The first equality follows from Remark 1, and the second is the same as the first with A replaced by B and B by A. Q.E.D. The three most useful theorems in connection with conditional prob abilities will now be presented along with some applications.
1.5. CONDITIONAL
PROBABILITY
21
T H E O R E M 1. (The Multiplication Rule). If A0, Air n + 1 events such that P(AoAi • • • A„_i) > 0, then P{AQAX ••■An) = P(A0)P(A1\A0)P(A2\A0A1)•
•• ,An are any
• • P(An\A0■ • • A n _ a ).
Proof: We prove this by induction on n. For n = l,P(A0) > 0, so P{Ax\Ao) = P(A 0 A!)/P(Ao), or P{AoA\) = P(A0)P(A1\A0). Now let us assume the theorem is true for n — 1 (where n > 2); we shall show it is also true for n. By induction hypothesis, P(A0A1 •.. An-i) = P(A0)P(A1\Ao)
- • • P(An^\A0
• • • A n _ 2 ).
By Remark 1, letting B = A0 • • • A n _ 1? A = An, we obtain PfAoAx.-.An)
=
P(i4o---i4 n _ 1 )P(i4 w |i4 0 ---i4 n - 1 )
=
P(A0)P(A1\Ao)
• • • P(A n |A 0 - • • A n _x),
which proves that the theorem then holds for all n.
Q.E.D.
E x a m p l e 1. (Polya Urn Scheme). An urn contains r red balls and b black balls originally. At each trial, one selects a ball at random from the urn, notes its color and replaces it along with c balls of the same color. Let i?t- denote the event that a red ball is selected at the ith trial, and let B\ denote the event that a black ball is obtained at the ith trial. We wish to compute P(RiB2B3) . Using the multiplication rule, we obtain P(RlB2B3)
=
P(R1)P(B2\R1)P(B3\R1B2)
r "
b
b+c
(r + b) (r + b + c) (r + b + 2c) *
E x a m p l e 2. (Sampling Without Replacement). An urn contains N tags, numbered 1 through N. One selects at random three tags with out replacement. This means: one first selects a tag at random, then without replacing it one selects a tag at random from those remaining, and again, without replacing it, one selects yet another tag from the
22
CHAPTER 1. EVENTS AND
PROBABILITY
N — 2 remaining tags. If ii, Z2, ^3 are three distinct positive integers not greater than JV, then, by the multiplication rule, the probabiUty that i\ is selected on the first trial, z*2 on the second trial and i% on the third trial is
L
!
1
N* N-l'
N-2
T H E O R E M 2. (Theorem of Total Probabilities). IfHir-,Hn are are disjoint events with positive probabilities, and if A is an event sat isfying A C U^Hi, then P(A) =
±P(A\Hi)P(Hi). 1=1
Proof: First note, using Propositions 7 and 12 in Section 1.3, that A = A n (U?=1fff-) =
U^AHi.
Further, AH\,- • • ,AHn are disjoint; this comes from the hypothesis that J/i, • • •, Hn are disjoint. Hence, by Proposition 4 in Section 1.4,
P(A) = p((jAHt)=J2P(AHi). \t=l
/
t=l
Now by Remark 1 or 2 above, P(AHt) =
P(A\Hi)P(Hi)
P(A) =
'£P(A\Hi)P(Hi).
for 1 < i < n . Thus
t=l
Q.E.D. Example 3. In the Polya urn scheme in Example 1, P(R1) = ^—
r+b
and P ( £ i ) =
&
r + b'
1.5. CONDITIONAL
23
PROBABILITY
In order to compute P{R2), we first note that R2 C R\ U B\ and R\ and l?i are disjoint. Hence by the theorem of total probabilities, P(R2) = P(R2\R1)P(Rl)
+
P(R2\B1)P(B1).
Since P(R2\Ri) =
r r | C and P{R2\B1) = r + 6+ c r + b-\- c
we have PfFM-
r + C
r r + 6+ c r + 6
r b r + 6+ c r + 6
r r + 6"
Example 4. In Example 2 above on sampling without replacement, let us compute the probability that 1 is selected on the second trial. Using the theorem of total probabilities we obtain N
P[\ in trial#2]
=
£P([1
in
trial#2]|[i in trial#l]) P([i in trial#l])
t=2
= f-JL. 1 = 1 T H E O R E M 3. (Bayes' Theorem) IfHi,-,Hn are disjoint events with positive probabilities, if A is an event satisfying A C U^Hi, and if P(A) > 0, then for j = l , 2 , - - - , n ,
_
WTO) £?„p(/iiff()nffi)
Proof: By the definition of conditional probability, we have, by our hypotheses and by Theorem 2, that P(H \A\] =
r{ni{
P{AHj)
P(A)
=
P A H P H
( \ i) ( >) Y2=iP{A\Hi)P(Hiy Q.E.D.
24
CHAPTER 1. EVENTS AND
PROBABILITY
In rather loose terminology, Bayes' theorem is applied in this general situation. An event A is known to have occurred. There are n disjoint events, called the possible causes of A, and since A has occurred it is known that one of Hu-,Hn "caused it" (i.e., A C U? = 1 # n ). If one wishes to determine which of the possible causes really caused it, one might wish to evaluate P(Hj\A), for 1 < j < n, and select as a possible cause an Hj for which P(Hj\A) is maximum. Example 5. Consider the Polya urn scheme again. Suppose one ob serves that the event R2 has occurred and wishes to determine the probability that Bi was the "cause" of it, i.e., to evaluate P(i?i|i? 2 )By Bayes' theorem we find that
P(B,W
=
W + W P(R) \R )P(R )
P(R2\B1)P(B1) b r +b+c
One should note that P(Bi|i? 2 ) =
2
1
1
P(R2\B1).
Example 6. Consider the sampling without replacement that occurred in Examples 2 and 4. Suppose one observes that 1 is selected in the second trial and wishes to find the probabiHty that selecting 3 in the first trial is its "cause", i.e., to evaluate P([3 in 1st trial]|[l in 2nd trial]). Using Bayes' theorem this turns out to be P([l in 2nd trial]|[3 in 1st trial])P([3 in 1st trial]) EiL 2 ^([1 in 2nd trial]|[j in 1st trial])P(£j in 1st trial])
1 N - 1'
EXERCISES 1. In the hypothesis of Theorem 1, it was assumed that P(AoAi - • • A n _i) > 0 so that the last conditional probability was well-defined. Prove that this assumption implies that P(C\j=0Aj) > 0 for 0 < k < n — 2, so that all the other conditional probabilities are also well-defined.
1.5. CONDITIONAL
PROBABILITY
25
2. In the proof of the theorem of total probabilities, the statement is made that since H\, • • •, Hn are disjoint, then AH\, • • •, AHn are disjoint. Prove this statement. 3. In sampling without replacement considered in Examples 2 and 4, suppose a simple random sample of size 3 is selected. Prove that the probability of getting a 1 in the third trial is 1/N. 4. In the Polya urn scheme, find P(Rs). 5. An urn contains four objects: A, 2?, C, D. Each trial consists of selecting at random an object from the urn, and, without replac ing it, proceeding to the next trial. If X is one of those four objects, and if i = 1 or 2 or 3 or 4, let X{ denote that event that X is selected at the zth trial. Compute the following: i) p ( ^ ) . ii)
P(A2\A1),
iii) P(A 2 |Bx), iv) P(A2) and v)
P(B3).
6. In the Polya urn scheme, compute
P(Ri\R2).
7. In the Polya urn scheme, compute i) P(lfe|fll), ii) P(R3\R2) iii)
and
P(R1R3)>
8. An urn contains 2 black balls and 4 white balls. At each trial a ball is selected at random from the urn and is not replaced for the next trial. Let B{ denote the event that the first black ball selected is on the ith trial. Compute i)
P(B2),
ii) P(Pn),
26
CHAPTER 1. EVENTS AND
PROBABILITY
iii) P(B5) and iv)
P(B6).
9. In Problem 8, let C,- denote the event that the second black ball selected is selected at the ith trial. Compute i) P(C 2 ), ii) iii)
P(C3), P{BXCZ),
iv) P(B2\C3) v)
and
P{CX).
10. An absent-minded professor has five keys on a key ring. What is the probability he will have to try all five of them in order to open his office door? 11. Urn # 1 contains 2 white balls and 4 black balls, and urn # 2 contains 5 white balls and 4 black balls. An urn is selected at random, and then a ball is selected at random from it. What is the probability that the ball selected is white? 12. In Example 2, find the probability that 2 is selected in the third trial, where N = 5.
Chapter 2 Random Variables 2.1
R a n d o m Variables as Functions
In a sample survey, when we select individuals at random, we are really not interested in the particular individuals selected. Rather, we are interested in some numerical characteristic (or characteristics) of the individual selected. This numerical characteristic is a function, in that to each elementary event selected there is a number assigned to it. Definition. A random variable X is a function which assigns to every element u £ fi a real number X{LO). The following examples illustrate the idea of random variable. (i) Take a sample of size one from the set 0 of all registered students students at your university. In this case, X{u) might be the grade point average of u. Thus, corresponding to student u G H is the number X(u>). (ii) Sample three times without replacement from the set of all regis tered students at your university. In this case, fi will consist of the set of all ordered triples ( t ^ i , ^ , ^ ) , where no repetitions are allowed. If Y denotes the age of the third student selected, i.e., if Y assigns to ( w i , ^ ^ ) the number: the age of CJ3, then Y is a random variable.
27
28
CHAPTER 2. RANDOM
VARIABLES
(iii) In example (ii), if Z assigns to {wi,W2,wz) the total indebtedness of u>i and u>2 and u>3, then Z is a random variable. We are usually interested in the values that random variables take and the probabilities with which these values are taken. Thus we have the following definition. Definition. If X is a random variable defined over some fundamen tal probabiUty space fi, then the range of X , which we denote by range(X), is defined as the set of numbers X(u>) for all u> G fi, i.e., range(X) = {X{u) : u € f)}. This is also denoted by X(£i) or {x : x = ^(ci>) for some u> € fi}. Since fi is finite, then the range of a random variable X is finite and has at most as many members as does Q. Random variables are func tions, and so, like functions, they admit algebraic operations. These are given in the following definition, Definition. If fi is a fundamental probability space, if X and Y are random variables defined over f), and if c is a constant, then we de fine the random variables X + F , XY, X/Y, cX, max{X,Y}, and min{X, Y} as follows: (i) X + Y assigns to every u 6 fi the number X(u) +
Y(u),
(ii) XY assigns to every UJ £ Q the product X(u)Y(u) X(UJ) and Y(u),
of the numbers
(iii) X/Y assigns to every u € 0 the quotient X{u)/Y(u)\ for at least one u> & fi, then X/Y is not defined,
if Y(u) = 0
(iv) cX assigns to u the number CX{UJ) , (v) max{X,Y} assigns to each u> e 0, the larger of the numbers X(u>),Y(u>), and (vi) min{X, y } assigns to each u € ft the smaller of the numbers X(u>) and y(u;) .
2.1.
RANDOM
VARIABLES
AS
29
FUNCTIONS
In general, if X, Y, Z are random variables, and if /(u,v,u;) is any function of three variables, then / ( X , Y, Z) is a random variable which assigns to every w G f i the number f(X(Lo), Y(u), Z{u)). Among the random variables defined over 0 , some very important ones are the indicator random variables, defined as follows. Definition. If A C fi is an event, the indicator of A , denoted by 7^, is defined as the random variable that assigns t o w G f l the number 1 if u> G A and 0 if u £ A , i.e., T
/
N
/ 1
if u> G A
P R O P O S I T I O N 1. If A is an event, then l\ = IA. Proof: If u> G A, then 7i(w) = l 2 = 1 = iyt(u>), and if u; ^ A, then I\(w) = 02 = 0 = 7^(a>). Q.E.D. P R O P O S I T I O N 2. If A and B are events, then IAIB = IABProof: liuj E A and u E B, then u; e AB, and thus /A/BM
= IA(U)IB{U)
= 1-1 = 1 =
IAB{u).
If u> is not in A S , then IAB{W) = 0 and, since UJ is not in at least one of A, 5 , then at least one of the numbers I A (<*>), IB{&) is zero, in which case IAIB(W) = IA{U)IB(U) = 0 = /^(w). Q.E.D. P R O P O S I T I O N 3. 7/A and J5 are events, then IAB = m i n { / ^ , / B } . Proof: Note that the minimum of IA{W) and IB{W) is 1 if and only if IA{U>) = 1 and IB(W) = 1, which is true if and only if u G A and UJ € B, i.e., a; G A S o r / ^ u ; ) = 1. Q.E.D. P R O P O S I T I O N 4. / / A aradB are events, then IAuB = m a x { J A , / B } . Proof: IAUB(V) = 1 if and only if UJ G A U 2?, which is true if and only if CJ is in at least one of A, B. This means at least one of IA{U), IB{W) is 1. Q.E.D.
30
CHAPTER 2. RANDOM
VARIABLES
In our notation we generally suppress the symbol u>. Thus, if X is a random variable, we shall write [X = x] instead of {a; G ft : X(u) = x}, we shall write [X < x] instead of {u G ft : X(u) < x} and, in general, for any set of numbers A, we shall write [X € A] instead of {u G ft : X(u>) G A}. An example of this is the game where ft denotes the set of outcomes of three tosses of an unbiased coin. In this case ft consists of (HHH) (HHT) (HTH) (HTT) (THH) (THT) (TTH) (TTT)' Let X denote the number of heads in the three tosses. For example, if u) = (HTH), then X(u) = 2. Also X(THH) = 2 , while X(TTT) = 0. Thus [X = 2] = {(HHT), (HTH), (THH)}, and [X <1] = {(HTT), (THT), (TTH),
(TTT)}
EXERCISES 1. Let X denote the sum of the outcomes of two tosses of a fair die, and let Y be their product. Let (u, v) denote that outcome of the two tosses in which the first number that comes up is u and the second number that comes up is v. (i) List all 36 elementary events in ft. (ii) What are the elementary events in each of these events: [X < 4], [X > 30] and [X G (1,4]]? 2. A box contains five tags, numbered 1 through 5. A tag is selected at random, and, without replacing it, a second tag is selected. (i) List all 20 ordered pairs of outcomes in this game of sampling twice without replacement. (ii) Let X denote the first number selected, and let Y denote the second number selected. List the elementary events in each of the following events: [X = 2], [Y = 2], [X = 3], [Y = 3]
and [X + Y < 4].
2.2. DENSITIES
OF RANDOM VARIABLES
31
3. Prove: if A is an event, then I A* = 1 — 7^. 4. Prove: if A and B are events, then they are disjoint if and only if IAUB = IA + IB-
5. Prove: if X and Y are random variables, and if x £ range(X), then [X = x] = U{[X = x][Y = y]:ye range(Y)}. 6. Prove: if X is a random variable, and if A is any set of numbers, then [X e A] = U{[X = x] : x e A n ransre(X)}. 7. Prove: if X is a random variable, then
X = J2xIlx=*h X
where the symbol ^
means the sum for all x G range(X).
X
2.2
Densities of Random Variables
What we are primarily interested in are the probabihties with which a random variable takes certain values. This set of probabihties is usually referred to as the density of the random variable, which we now define. Definition. If X is a random variable, its density fx{x) is defined by r / \ _ f P[X = x] if x G range(X) -{0 iix£ range(X)
}x[x)
In the example presented at the end of Section 2.1 (where X denotes the number of heads occurring in three tosses of an unbiased coin) the density of X is the following: fx(0) fx(l)
= P[X = 0] = P{(TTT)} = l/8, = P[X = l] = P{(HTT),(THT),(TTH)}
= 3/8,
32
CHAPTER 2. RANDOM fx(2)
=
P[X = 2] = P{(HHT),(HTH),(THH)}
fx(S)
= P[X = 3] = P{(HHH)}
fx(x)
=
VARIABLES = 3/8,
= 1/8, and
O i f s £ {0,1,2,3}.
Note that for every x € range(X), fx(x) > 0. We shall also need a definition of range of two or more random variables considered jointly (or as a random vector). Definition. If X and Y are random variables, their (joint) range, denoted by range(X, F ) , is defined by range(X,Y)
= {(X(LJ),Y(CJ))
: UJ G ft}.
In general, if X\, • • •, Xn are n random variables, then we define range(Xu--,Xn)
= {(Xi(o;), • • • ,Xn(u>)) : w G ft}.
One should note that range(X, Y) is a set of number pairs, i.e., a set of points in 2-dimensional Euclidean space R2. Likewise, range(Xi,..., Xn) is a set of points in n-dimensional Euclidean space Rn. Definition. If X and Y are random variables, their joint density fxy{x,y) is defined by fxA*, V) = f P{[X 10
= XU Y = d )
if
(*' y\ G range{X, otherwise.
Y),
This is referred to as a bivariate density. As an example, consider an urn with the following nine number pairs: (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3) A number pair is selected at random. Let X denote the smaller of the two numbers (e.g., X assigns to (3,1) the number 1), and let Y denote the sum of the two numbers (for example, Y assigns to (3,2) the number 5). We shall find the joint density of X and Y .
2.2. DENSITIES
OF RANDOM
VARIABLES
33
First note that range(X) = {1,2,3}, range(Y) = {2,3,4,5,6}, and range{X,Y) = {(1,2),(1,3),(1,4),(2,4),(2,5),(3,6)}. Next observe that
[X- -1}[Y = 2] [X- = i][y = 3] [X-- = i][y = 4] [X- = 2][y = 4]
=
{(1,1)} = {(1,2), (2,1)} = {(1,3), (3,1)} = {(2,2)} [X-- --2][Y = 5] = {(2,3), (3,2)} [X- = 3][F = 6] = {(3,3)}. Thus, / x , y ( l , 2 ) = l / 9 , / * , y ( l , 3 ) = 2/9,/x,y(l,4) = 2/9,/*,y(2,4) = l / 9 , / x , y ( 2 , 5 ) = 2/9,/*,y(3,6) = 1/9, and fxA^v) = 0 for aU other pairs (x,y). Notice in this bivariate case that /x,y(2,3) = 0 although 2 £ range(X) and 3 £ range(Y). What matters now is range(X, Y) . Joint densities of more than two random variables are similarly de fined. For example, if X, Y, Z are random variables, their joint density is defined, for all (#, y, z) € range(X, Y, Z), by fx,Y,z(*, y,z) = P([X = x][Y = y][Z = *]); otherwise we define /jr,y,z(x, y,z) = 0. It is important to keep in mind the notation used from here on. If Uij - • •, Ur are random variables, the joint density of f/i, • • •, Ur is denoted by The subscripts are t / i , - - - , [ / r ; they indicate the random variables of which it is the joint density. Within the parentheses will be the points (i/i, • • •, ur) in the range of C/i, • • •, Ur at which the density is evaluated. Many times we have (or start out with) a joint density of several random variables, but we wish to have the densities of each single ran dom variable. This can be accomplished by using theorems like the following. T H E O R E M 1. If X and Y are random variables with joint /x\y(#>2/); then the densities fx(x) and fy(y) are fx{x)
= YHfx>Y(x,y)
: y £ range(Y)} if x £ range(X)
density
34
CHAPTER 2. RANDOM
VARIABLES
and fv(y) = Ys{fxy(xiV)
:x e
range(X)} if y
erange(Y).
Proof: We first observe that for each x G range(X), [X = x] = U{[X = x][Y = y]:y€
range(Y)}.
Indeed, if UJ is in the right hand side, then X(UJ) = x, and if UJ is in the left hand side, i.e., X{UJ) = x, then Y(u>) G range(Y) , say, Y{UJ) = j/i, in which case UJ G [X = x)[Y = Vl] C U{[X = x][r = y] : y € r a n 5 e ( F ) } . Since the right hand side is a disjoint union, we have fx(x)
= P[X = x] = 52{P([X =
YHfx>Y(x>y)
:
y
e
= x][Y =
y]):yerange(Y)}
range(Y)}.
The proof of the second equation of the theorem is similar.
Q.E.D.
In Theorem 1, fx(x) is called a marginal or marginal density of fx7Y(x,y)- As an example, consider the random variables X, Y whose joint density is given by: /*,y(l,l) /x,r(3,2)
= =
l / 8 , / ^ y ( 2 , l ) = l/8,/x,y(2,2) = l / 4 , l/8,/*,y(3,3) = l / 4 , / x , y ( 4 , l ) = l / 8 .
Graphically, this is represented as follows:
2.2. DENSITIES
OF RANDOM
Y
35
VARIABLES
.I/4
3-
.I/4
2-
.1/8
1-
0()
.1/8
.1/8
i
i
1
2
.1/8
1
3
i
4
The marginals for X and Y are: / x ( l ) = 1/8, fx(2) = 3/8,/x(3) = 3/8,/x(4) = 1/8, and / K ( l ) = 3/8, fy (2) = 3/8,/y(3) = 1/4. Individual and joint densities of random variables are useful in com puting the probabilities that certain functions of these random variables "take values" in certain sets. We consider particular cases of this in the following theorems. T H E O R E M 2. If X is any random variable, and if A is any set of real numbers then P[X e A] = £ { / * ( * ) : x e A n
range{X)}.
Proof: It is clear that [X E A] = U{[X = x] : x € A n
range(X)}
and that the right hand side is a disjoint union. Taking probabilities of both sides, we get P[XeA]
= ^{P[X =
Jlifxix)
= x]:xe : x e A fl
Anrange(X)} range(X)}.
CHAPTER 2. RANDOM
36
VARIABLES
which proves the theorem. One can prove in a similar fashion the following theorem whose proof we omit. T H E O R E M 3. / / X and Y are random variables, and if S is any subset of 2-dimensional Euclidean space, R 2 , then P[(X, Y) G S] = £ { / A : , Y ( Z , 2 / ) : (x,y) G S and (x,y) G range(X,Y)}. Another result of this type that we shall use is the following. T H E O R E M 4. If X andY are random variables, and ifg(x^y) function defined over range(X,Y), then P[g{X,Y)
= z] = ^2{fxAx>y)
:
9(z>y) = z
and
(*>V) €
is a
range(X,Y)}.
Proof: First observe that the right hand side of the above equation is summed over all number pairs (x,y) such that (x,y) G range(X,Y) and g(x, y) = z. One can easily verify that \g{X,Y)
= z] = U{[X = x][Y = y] : g(x, y) = z and (x, y) G range(X,
Y)}.
One is also able to verify that the union on the right side is a disjoint union. Taking probabilities of both sides yields the conclusion of the theorem. Q.E.D. The above theorem shows how to obtain the density of a function of one or more random variables. We next need to develop the idea of independence of random variables. Definition. If Xi, • • • , X m are random variables defined over fi, we shall call them independent if and only if, for every yt- G range(Xi), 1 < i < m, the events [Xx = y x ], • • •, [Xm = ym] are independent. T H E O R E M 5. If X\,- • • , X m are random variables, they are inde-
2.2. DENSITIES
OF RANDOM
37
VARIABLES
pendent if and only if m
fXu^XmiVU ' " • , Vm) = I I fXjiVj) for all yi 6 range(Xi), I
<m.
Proof: The equation is a consequence of the definition of independence. The converse is obtained, starting from the equation given, by summing both sides over various indices j/ t - £ range(Xi) to obtain r
fxilt...txir(yi1,...,yir)
=
Tifxtjiytj)
for 1 < 4 < • • • < £r < m, 1 < r < m.
Q.E.D.
T H E O R E M 6, If X\, • • • , X n are independent random variables, and if c i>' * • ? Cn are constants, then C\X\, • • •, CnXn are independent. Proof: First assume that none of the c,'s are zero. Then for 1 < ji < " ' < jr < n and by the hypothesis of independence of X\, • • •, X n , we have P f][cjtXjt
= ue] = Pf)[Xjt
£=1
= ut/cjt)
£=1
=
t[P[Xjk=ut/cjt] £=1
=
f[P[cJtXjt=ue}. 1=1
If ct = 0, then P[ctXi = 0] = 1 and P([ctXt = 0] n A) = P{A) for any event A. If u£ ^ 0, P [ Q X * = u£] = 0, and P ( [ Q X * = ut]nA) = 0 for any event A. Putting all these statements together gives us the theorem. Q.E.D. The most concrete examples of both independent random variables and non-independent (i.e., dependent) random variables occur in sample survey theory. Let us look at the basic model. We have a population U of N units denoted by f/i, • • •, i/;v, all of which are equally likely to be selected under random sampling. If we sample n times with
38
CHAPTER 2. RANDOM
VARIABLES
replacement, then the fundamental probability space ft is the set of all possible ordered n-tuples of units from U, with repetition of units in each n-tuple allowed. Let X be a function denned over U which assigns to the unit U{ tht number it,-, 1 < i < N. The u,'u need non be distinct and usually are not. For 1 < j < N let Xj be a function (or random variable) denned over ft by assigning to the n-tuple (Utl,• • • Utn) the number u^, i.e., ^ 0 ( ( ^ 1 J " ' 5 Uln)) = X(Ut-)
= U£-.
The random variables Xu • • •• Xn are referred to as a sample of fsze n on X, where the sampling is done with replacement. The fundamental probability space fi contains Nn equally likely elementary events. T H E O R E M 7. In sampling n times with replacement as defined above, the random variables Xu • ■ • •Xn are independent, and all have the same density. Proof: Let Nx denote the number of units Ut such that X(Ui) = x, i.e., Nx = #{U G U : X{U) = x}. Then the joint density f[NXi /*.-*.(*!, - ^ ) = ^
r iV
n
= f [ ^ = f[P[xi t=l
i V
t=l
= Xi] = f[
fXi(Xi).
t=i
By Theorem 5, Xu • • •• Xn are independentt Clearly, because of fhe replacement after each selection, all densities fXi(x) are the same. Q.E.D. If the sampling is done without replacement, then it is clear that the random variables Xu • ■ • •Xn are eot tndependent. However rhere is the following interesting and useful result. T H E O R E M 8. In sampling n times without replacement, all univariate densities {fXi(x)} are the same, and all bivariate densities {fXi Xi{u,v),i ± j} are the same.
2.2. DENSITIES
OF RANDOM
VARIABLES
39
Proof: Both conclusions follow from this remark: in sampling n times without replacement from £/, the set of f^J equally likely outcomes is the same as that obtained by recording tne selection of the zth unit first, then the jih unit (i ^ j ) , and then the remaining n — 2 units from left to right. Q.E.D. EXERCISES 1. Consider a game in which an unbiased coin is tossed four times. (i) List the set fl of all equally likely outcomes. (ii) Let X denote the number of heads in the four tosses. Find the density fx{') oi X. (iii) Let U denote the smaller of the number of heads and the number of tails in the four tosses. Find the density fu{') of U. (iv) Find the joint density of X and U. 2. Let X and Y be random variables whose joint density is given by /x,y(0,0) /x,y(l,2)
= =
l/3,/x,y(0,l) = l / 4 , / x , y ( l , l ) = l/6 1/6 a n d / x , y ( 2 , 0 ) = 1/12.
(i) Find the density, /x(*)> of X. (ii) Find the density, /y(-)> of Y. (iii) If U = min(X, F ) , find the joint density of U and X, /z7,x(#, •)• 3. Compute P([Y = l]\[X = 1]) and P([X = 2]\[Y = 0]), where X and Y are as in Problem 2. 4. In Problem 2, find the density of X + Y. 5. An urn contains 3 red balls and 4 black balls. One ball after another is drawn without replacement from the urn until the first black ball is drawn. Let X denote the number of balls drawn from the urn at the time the first black ball is drawn. Find the density of X .
40
CHAPTER 2. RANDOM
VARIABLES
6. Prove: if X is a random variable, and if g is a function defined over range(X), then the density of g(X) is fg(X)(z) = 52{fx(x)
: g(x) = z}.
7. Prove: if X, Y and Z are independent random variables, if g is a function defined over range(Z), and if h is a function defined over range(X,Y), then /&(X, F ) and g(Z) are independent random variables. 8. Prove: if X, Y and Z are random variables, if ip(x, y) is a function defined over range(X,Y) and if ^(x,j/,z) is a function defined over range(X, Y, Z\ then the joint density, / ( u , v), of the random variables <^>(X, Y\ */>(X, F, Z) is /(w, v) = 53{/y,y,z(«, 2/, z) : (p(x, y) = u, 0 ( s , y, *) = v}. 9. Let £/ : t/i, • • •, £/# be a population with N units, and let X be a real-valued function defined over U. Sampling is done n times without replacement. Let Xi denote the value of X for the ith unit selected. If 3 < n < JV, prove that the joint density of all triples (Xi,Xj,Xk) of observations are the same when i , j , k are distinct. 10. Let U,X and N of Problem 9 be as follows: U X:
Ui 2.3
U2 U3 2.1 3.0
U4 2.3
U5 3.0
U6 U7 3.5 2.1"
One samples three times without replacement. Letting XL, X 2 , X3 be as in Problem 9, determine the joint densities of (i) (Xi, X2, X3), (ii) (Xi,X2) (by taking a suitable marginal), (iii) (X 2 ,X 3 ), (iv) (X 3 ,Xi), (v) Xu (vi) X2 and (vii) X3. 11. Compute P[X2 = 2.1] and P([X2 = 2A]\[X1 = 2.1]), where Xx and X2 are as in Problem 10. Ponder over the intuitive reason for the difference.
2.3. SOME PARTICULAR
DISTRIBUTIONS
41
12. Let X, Y, Z be random variables with joint density as follows: /x,y,z(0,0,0)
=
l / 2 1 , / w ( 0 , 0 , l ) = 2/21,
/x,y,z(0,l,l)
=
l / 7 , / w ( 2 , l , 0 ) = 4/21,
/w(2,0,l)
=
5/21,/ X l y,z(l,0,2) = 2/7.
(i) Find the joint density of X and Z. (ii) Find the marginals / * ( • ) , / y (•)>/&(■)• 13. An urn contains 3 red, 2 white and 2 blue balls. Sampling is done without replacement until no balls are left in the urn. Let X denote the trial number at which the first blue ball is selected, let Y be the trial number at which the first red ball is selected, and let Z be the trial number at which the first white ball is selected. (i) Determine the joint density of X, Y, Z. (ii) Determine the joint density of X, Y. (iii) Compute the probability that blue is the last color selected. (iv) Compute the probability that white appears before red, i.e., P[Z < Y\. 14. Prove: If X\, • • •, Xn are independent random variables, if ai, • • •, a n are constants, and if Yj = Xj + a^, 1 < j < n, then Yi, • • •, Yn are independent random variables.
2.3
Some Particular Distributions
Some particular formulae for densities arise in practice again and again. Those that appear in sample survey mathematics are singled out here. Definition. A sequence of Bernoulli trials refers to repeated plays of a game such that i) with each play there are two disjoint outcomes, S and F , of which exactly one occurs,
42
CHAPTER 2. RANDOM
VARIABLES
ii) P(S) does not vary from play to play, and iii) the outcomes of the plays are independent events. An example of a sequence of Bernoulli trials is the repeated tossing of a die where, say, S denotes the outcome that the die comes up 1 or 2 and F denotes the outcome that the die comes up 3,4,5 or 6. Independence of outcomes is clear, and P(S) = 1/3 for each toss of the die. Definition. Let X denote the number of times S occurs in n Bernoulli trials, and let p = P(S). Then the random variable X is said to have the binomial distribution, denoted by 5 ( n , p ) . We shall sometimes write: X is B{n,p) o r X ~ B(n,p). find the density of X when X is B{n,p). T H E O R E M 1. IfX
is B(n,p),
then
fx{x) = [ t)pX(l-P)n-X \ 0
We now
ifO<*
Proof: The event [X = x] is the disjoint union of (*J events, namely, those represented by all n-tuples of patterns of S and F such that each pattern contains exactly x S's. Each such pattern occurs with probability px{\ - p)n~x. Thus P[X = x] = (£) p*(l - p)n~x for x = 0,l,2,...,n. * Q.E.D. T H E O R E M 2. IfX andY are independent random variables, if X is B(m,p) and ifY is B(n,p), then X + Y is B{m + n,p). Proof: Consider a sequence of m+n Bernoulli trials in which P(S) = ^, and let Z denote the number of times S occurs. Then Z is B(m + n,p). Let X' denote the number of times S occurs in the first m trials, and let Y1 denote the number of times S occurs in the last n trials. Now X1 and Y' are independent and have, respectively, the same distributions
2.3. SOME PARTICULAR
DISTRIBUTIONS
43
as X and Y. Hence X + Y has the same distribution as X1 + Y' = Z which is B(m + ra,p). Q.E.D. The binomial distribution occurs in sample survey theory when sam pling is done with replacement. In the more frequent case when sam pling is done without replacement, the following distribution arises. Definition. If an urn has r red balls and b black balls, if n balls are selected at random without replacement, and if X denotes the number of red balls found among the n(n < r + 6), then X is said to have the hypergeometric distribution. T H E O R E M 3. If X is a random variable with the hypergeometric distribution of the last definition, then jr / \ I ' /r+iT ' fx(x) = < ( r r) [ 0
# max{0, n - b\ < x < min{n, r\ l J i J otherwise.
Proof: There are (r) equally likely ways of selecting x red balls out of r red balls in the urn. For each way in which x red balls are selected from the r there are ( n * x ) ways of selecting n — x black balls. Thus the total number of equally likely ways in which x red balls and n — x black balls can be selected is (r) (ntx)' There are ( r ^ 6 ) equally likely ways of selecting n balls out of the total of r + b balls. Thus P[X,x]={
l
^Sf1
{o^
itz€ram,e(X) otherwise.
We must always verify the range of values of X. The most impor tant part of computing a hypergeometric distribution is not that of computing the ratio involving three binomial coefficients, but that of computing the allowable values of x , i.e., range(X). Clearly x > 0 always. However, if b < n, then the minimal number of red balls in a sample of size n has to be n — b. Thus we see that the smallest value of x is max{0,n — 6}. Again, it is clear that x < n. But also x < r. Hence x < min{rc,r}. Q.E.D.
44
CHAPTER 2. RANDOM
VARIABLES
One further important univariate distribution is the uniform distribution. Definition. A random variable X is said to have the uniform distribution over {!,-••,N} if pp rr yy __ •i !_- // llllNN n*1 -H-y0 *J \ 0
^ 1 forl
A most important multivariate distribution is the multinomial distribution. Definition. Suppose a game has r + 1 disjoint outcomes A0, Ai, • • •• Ar, with U U A ' = ft (i-e., one and only one of the A's must occur.) Suppose the game is played n times under identical conditions, so that the n outcomes are independent. Let X{ denote the number of times Ai occurs in the n plays, 1 < i < r, and denote Pi = P(A). Then Ai, • • • •Xr rre eaid dt oave eth multinomial distribution, which we eenote by MAf(n,Pl,••• ,Pr) It should be noted that Pl + •.•+Pr < 11 Also, if r = 11 then MN{n,Pl) and B(n,Pl) are the same. We conclude this section by determining the joint density of multinomially distributed random variables. T H E O R E M 4. IfXu density is
• • • ,Xr are MN{n,Px,
n'
/
r
• ■ - •Pr), then their joint
W
r
\n-£!-i*«
^•••••->-sr^35^n^)f(n*) ^•••••->-sr^35^n^)f(n*) (•-£*) (•-£*) forxi forXi
>0,•-••x >0,,;i-|-----|hx >0,•-••Xr>0,Xi + ---+X r r r
< U. n.
Proof: The event C\ri=1[Xi = x;] can be written as a disjoint union of all n-tuples involving Xl A^S,* ■ • •xr Ar'ss Each of these n tuples has probabiUty 1 K •Pr r (l -- Pi Pi P? •• •• •#'(!
pr) Pr)n-n-XlXl-.-"-XrXr.
2.3. SOME PARTICULAR
DISTRIBUTIONS
45
Now how many such n-tuples do we have? There are f M selections of trial numbers possible for A\ to occur. For each x\ trial numbers for Ai to occur there are ( n ~ x i ) selections of trial numbers for A2 to occur. Continuing, we see that the total number of n-tuple outcomes in which there are X\ Ai's,- • •, xr A r 's is fr (n-T£{xt\ =
MV
*i
)
n!
*i!---* r !(n-ELi*<)!' Q.E.D.
EXERCISES 1. In the Polya urn scheme, let X denote the number of times a red ball is selected in the first two trials. Find the density of X. 2. If X and Y are independent random variables, if X is and if Y is 5 ( n , p ) , find P([X = k]\[X + Y = r]).
B(m,p)
3. If X and Y are independent random variables, and if each is uni formly distributed over {1, • • •, iV}, find the density of max{X, Y}. 4. If X and Y are independent and uniformly distributed over {0,1,2}, find the density of X + Y. 5. In Problem 4, find the density of XY. 6. Prove that ^ ( « ) = («: 1 1 )7. In a sequence of Bernoulli trials with P(S) = p, let T denote the trial at which S occurs for the first time, and let X = min{T, 3}. Find the density of X. 8. An absent-minded professor has five keys on a key ring. When he wishes to open his office door, he tries one key after another until he finds the one that works. Let X denote the number of tries needed. Find the density of X. (Note: He doesn't try the same key twice; he is not that absent-minded.)
46
CHAPTER 2. RANDOM
VARIABLES
9. Urn # 1 has two white balls and three black balls, and urn # 2 contains two white balls and two black balls. An urn is selected at random, and two balls are selected from it without replacement. Let X denote the number of white balls in the selection. Find the density of X, 10. IfXuX2,X3 areMAf(n,p u p 2 ,p 3 ) for 0 < u < v < n.
find P([XX = u]\[X1+X2 = v])
11. In Problem 10, prove that Xi,X3
are
MN(n,pi,p3).
12. An urn contains r red balls, w white balls and b blue balls. One samples n times with replacement. Let X denote the number of red balls selected, and let Y denote the number of white balls selected. Find the joint density of X, Y. 13. Solve problem 12 if the sampling is done without replacement. 14. Let X, Y be independent random variables, each uniformly dis tributed over {0,1,2} , and let U = X + Y, V = X - Y, R = min{X, Y} and S = max{X, Y). Find i) the joint density ofU,X
,
ii) the joint density of £7, V, and iii) the joint density of i?, S. 15. Prove: if X,Y are random variables with the same joint density as random variables X1, Y\ and if g is a function defined over range(X, F ) , then the densities of g(X, Y) and g(X', Y') are the same.
Chapter 3 Expectation 3.1
Properties of Expectation
Expectation is a weighted average. Its definition is justified by the law of large numbers, which will be encountered later. Definition. If X is a random variable with density /A-(X), we define its expectation, EX or E(X), by E(X) = £ { * / * ( * ) • x €
range(X)}
or, more simply,
£(*) = 5>/*(s). X
As an example, suppose X is a random variable with density /x(-2.3) fx(2.79)
= l/10,/ j r (0) = l / 5 , / * ( 1 . 3 4 ) = 3 / 1 0 , = 2/5, fx(x) = 0 for x$ {-2.3,0,1.34,2.79}.
Then, in accordance with the above definition, EX = ( - 2 . 3 ) ^ + 0 • | + ( 1 . 3 4 ) ^ + (2.79) • | = 1.288. We might have random variables which are functions of one or more random variables. The following theorem gives the formulae for their expectations. 47
48
CHAPTER 3.
EXPECTATION
T H E O R E M 1. If X andY are random variables, ifh is a function defined over the range of X, and if g is a function defined over the range ofX,Y, then Eh(X) =
Y,h(x)fx(x) X
and Eg(X,Y)
=
J2g(x,y)fx,Y(x,y).
Proof: Using the definition and Exercise 6 in Section 2.2, we obtain Eh{X)
=
y
jTtzP[h(X)
= z]
z Z
= EH*)fx(x). x
The proof of the second formula is accomplished in a similar manner. Q.E.D. As an example, let X be a random variable with density given just above the statement of Theorem 1, and let hhe & function defined by h(x) = 2ex + x2. Then Eh(X)
= *(-2.3)-L + h(0) • I + Ml-34) - A +
k(2.79).
|
=
(2e"2-3 + (-2.3) 2 )(1/10) + (2e° + 0 2 )(l/5)
=
+(2e 1 - 34 + (1.34) 2 )(3/10) + (2e2-79 + (2.79) 2 )(2/5) 19.918
T H E O R E M 2. If X and Y are random variables, and if C is a constant, then i) E(X + Y) = E(X) + E(Y), and ii) E(CX)
=
CE(X).
3.1. PROPERTIES
OF EXPECTATION
49
Proof: By Theorem 1 above and by Theorem 1 of Section 2.2, if we take the function g defined by g(x, y) = x + y,we have
E(X Y) = J2(* J2(x ++y)f £(x ++ y) y)fxA*,y) XX(x,y)
= E D */*.>-(*» y) = Ei: yfxA*, v) = EX xDy */*.r (*»x y) =y Ex E y/*.K*,») = Yl (52fx,y( 'y)) + 12y(J2fx,Y(x>y)) x
x
y
y
x
= Yl fx(x) + Y,yfY(y) = E(x) + E(Y). y = Ex */*(*)+ EyMs/) = £W + £ ( n
Again, by Theorem 1, if h(x) = Cx, then Again, by Theorem 1, if h(x) = Cx, then E(CX) = Y,Cxfx(x) X
= C£xfx(x)
= CE{X).
X
Q.E.D. T H E O R E M 3. If X is B(n,p) then EX = np. Proof: By the definition of B(n,p) and expectation we have
T=0 x=0
and, letting fc = x - 1, we eave eusing gth einomial lheorem) - 1n 1 k n £(X) = npY,( ,1)pk(l-p) -E(X) = npE^r^Al-P)" "* k k=o ^ ^ * ' ' fc=o = np{p^-(1-p)) = np(p + ( 1 - p ) ) n -n1-1==np. np.
Q.E.D.
50
CHAPTER 3.
EXPECTATION
Two little examples should be mentioned now. Example 1. If X = IA for some event A, then, because [IA = 1] = A and [IA = 0] = Ac, we have
E(IA) = 1 • P(A) + 0 • P(AC) = P(A).
Example 2. If X is uniformly distributed over {1,2, • • •, N} , then P[X = x) = l/N for x = 1,2, • • •, N, and hence
x=1
=
JV
iV
1 JV(iV N(N + l) 1) N' JV" 2
=
N+ JV +1l ~ 2 ~ -
EXERCISES 1. If X and Y are random variables with joint density given by / x y ( 0 , 0 ) = 1/36, / x r ( 0 , l ) = 1/18, / y y ( 0 , 2 ) = 1/12, / * y ( l , 2 ) = 1/9, /x,y(2,2) = 5/36, /x,y(2,l) = 1/6, /Ar,y(2,0) = 7/36, and /x,y(l,0) = 2/9, compute (i) E(X), (ii) E(Y), (iii) E(X% (iv) E ( X y ) , (v) £;(2X+y+l), (vi) E (^y) and (vii) E(max{X,Y}). 2. Prove: if a and b are constants, and if X and F are random variables, then E(aX + bY) = aE(X) + bE(Y). 3. Prove: If X and Y are random variables which satisfy X(u) Y(w) for all u e 0 , then £(X) < £;(F).
<
4. Prove: if X is a random variable, then rmn{range{X)}
< E{X) <
m&x{range(X)}.
5. If X and Y are independent random variables, and if each is uniformly distributed over {0,1,2,3}, compute (i) E(min{X, Y}) and (ii) E{XY). 6. Prove: E(X - E(X)) = 0 for any random variable X.
3.2. MOMENTS
3.2
OF RANDOM
VARIABLES
51
M o m e n t s of Random Variables
Powers of random variables are also random variables, and their expec tations are of considerable interest. In this section we discuss moments, central moments, variance and standard deviation. Definition. If X is a random variable, and if n is a nonnegative integer, we define the nth moment of X (or of the distribution or density of X) by mn = E(Xn). We define the nth central moment of X by Hn = E((X EX)n). The first moment, E(X), is really the center of gravity of the mass distributed by fx(x)> Of special interest is the second central moment and its square root. Definition. If X is a random variable, its second central moment, /i 2 , is called its variance and is denoted by VarX or Var(X). The standard deviation of X , denoted by s.d.X or s.d.(X), is defined by
s.d.(X) = Vv^x. The following theorems are constantly used in all of probability and statistics. T H E O R E M 1. If X is a random variable then Var(X) (E(X))\
= E(X2)
-
Proof: By the definition of variance and properties of expectation from Section 3.1, we have
Var(X)
= E((X-E(X))2) = E(X2-2E(X)X = E{X2) - 2E(X)E(X) + {E{X)f = E{X2)-(E(X))2.
+ (EX)2)
Q.E.D. T H E O R E M 2. If X is a random variable, and if C is a constant, then Var(CX) = C2VarX.
CHAPTER 3.
52
EXPECTATION
Proof: By Theorem 1, and by Theorem 2 of Section 3.1, we have Var(CX)
= = = =
E((CX)2)-(E(CX))2 E(C2X2)-C2(E(X))2 C2(E(X2) (E(X))2) C2Var{X). Q.E.D.
Thus, if VarX = 10, and C = 3, then Var(SX) = 32Var(X) also Var(-2X) = (-2)2VarX = 40. T H E O R E M 3. IfX Var(X).
= 90;
andC are as in Theorem2, then Var(X + C) =
Proof: By the definition of variance, = E((X + C - E(X + C))2) = E((X + C - E{X) - C)2) = E((X-E(X))2) = Var(X).
Var(X + C)
Q.E.D. Theorem 3 tells us that variance is not changed by adding a constant to the random variable. T H E O R E M 4. If X and Y are independent random variables, then E(XY) = E(X)E(Y). Proof: In Theorem 1 of Section 3.1, let g(x,y) = xy. In addition we use Theorem 5 of Section 2.2. Thus
E(XY) = Eg(X,Y) = 52g(x,y)fXtY(x,y) x,y
x
y
= (E^W)(E^W) = ^ O T x
y
53
3.2. MOMENTS OF RANDOM VARIABLES
Q.E.D. THEOREM 5. IfX1, ■ • •, Xn are independent random variables, then VarltxA^tvariXi). Proof: By Theorems 1 and 4,
= E = E || ££ X) X) ++gg XXUUXX EiX^/ V/ V -/ -\ ^\Y^EiX^/
= £,E(X]) + J2E(XUXV) 2 - £ ( £ T O2)-££(*«)£(*«) -££W*W -E(£(*;)) 2 2 = = £it{E(X])-(E(X {E(X]) - (£(*;))j)) } }==£ itvar(X Var{X)).j).
i=i
i=i
Q.E.D. Note that, by Theorems 2,4 and 5, if X and F are independent random variables, and if a, b and c are constants then Var(aX + bY + c) = a2Far(X) + 62Var(F). Thus, Var(2X - 15F + 376) = 4Var(X) + 225Var{Y). (A very common mistake committed by unwary students is the following: if they are given that X and Y are independent, they might write that Var(X - 3Y) = Var(X) - 9Var(Y) instead of Var{X - 3Y) = Var{X) + {-3)2VarY = Var(X) + War(Y).) Note: sometimes, when X, Y are not independent, it still happens that Var{X + Y) = Var(X) + Var(Y). This occurs only when X and Y are uncorrelated, which will be discussed in the next section. There will be problems at the end of this section in which it will occur that Var(X + Y)^ Var(X) + Var(Y). THEOREM 6. If X is B(n,p), then Var(X) = np(l
-p).
CHAPTER 3.
54
EXPECTATION
Proof: Let Ai denote the event that S occurs on the zth trial, 1 < i < n. By the definition of Bernoulli trials, Ai, • • •, An are independent events, P(A t ) = p for 1 < i < n, and IA^'" ,lAn a r e independent random variables. Now I\. = J^., so E(I\.) = E(IA^) = P- Hence Var(IAi)
= E{I%) - {E{IAi)f
= p{\ - p).
By Theorem 5, since X = ]££=i /A,, we have Var(X)
= Y,Var{IAi)
= np(l
-p). Q.E.D.
Another result we anticipate using in the future is the following. T H E O R E M 7. In sampling from a population n times without re placement, ifX\, - • • , X n denote the observed outcomes, then E(XiXj) = E(X1X2)fori^j. Proof: We apply Theorem 1 of Section 3.1 with g(x,y) Theorem 8 of Section 2.2 to obtain
E(XiXj)
=
= xy and
^uvfx^x^v) UyV
= E«"/X,A(«,«) = ^ A ) . UyV
Q.E.D. EXERCISES 1. If X is a random variable, prove that the value of t that minimizes E((X - t)2) is t = E(X). 2. Prove: If X is a random variable, then Var (yjxX)
= 1.
3. Prove: If X is a random variable, then {E{X)f
<
E(X2).
4. Prove: If X is a random variable, then Var(X)
> 0.
55
3.2. MOMENTS OF RANDOM VARIABLES
5. Prove: If X is B(n,p), then the value of p that maximizes Var(X) isp = 1/2. 6. Prove: If X is a random variable, and if Y is defined by Y =
1 (X s.d.X
E(X)),
then E(Y) = 0 and Var(Y) = 1. 7. If X and Y are independent random variables, if E(X2) = 5, E(X) = 2,E(Y2) = 10 and £ ( y ) = 3, compute (i) Var(X + Y), (ii) Var(X + 2Y), and (iii) F a r ( X - 3 r ) . 8. Let X and V be random variables with the following joint density:
.2/9
3-\
.5/36
.1/12
0 11/36 0
.7/36
.!/6
24
1/18
1/9 3
T
X
Compute Var(X), Var(Y) and Var(2X - 3Y). (Note: X and F are not independent.)
CHAPTER 3.
56
EXPECTATION
9. Prove: If X is a random variable, then Hn = £ ) ( " I
rn{mn-j{-\y,
where, as in this section, rrij denotes the jth. moment of X and Hn denotes the nth central moment of X. 10. Use the fact that
" , 2 _ n(n + l)(2n + l)
h3 ~
6
to derive the formula of the variance of a random variable which has the uniform distribution over {1, • • •, n } . 11. Prove: if X\, • • • ,Xn are independent random variables, then E(X1---Xn)
=
f[E(Xj). i=i
12. Prove that £?=oC; - np)2 ( ] ) p>(\ - p)n^ 0
3.3
= np(l - p), where
Covariance and Correlation
Covariances are needed when one wishes to compute the variance of a sum of random variables which are not independent. Correlation is sometimes correctly used and is sometimes misused as a measure of the degree of dependence between two random variables. Both covariance and correlation are important, and their definitions and basic properties are developed in this section. In order to do so, we shall need Schwarz's inequality. L E M M A 1. If X is a random variable such that E(X2) P[X = 0] = 1.
= 0, then
3.3. COVARIANCE
AND
57
CORRELATION
Proof: If the conclusion were not true, then P[X ^ 0] > 0. Hence there exists a number xo € range(X) such that XQ ^ 0 and P[X = x0] > 0. Then E(X2)
= Y,x2fx(x)
> xlfx(x0)
> 0,
X
which contradicts the hypothesis that E(X2) = 0.
Q.E.D.
T H E O R E M 1. (Schwarz's Inequality). If X andY are random vari ables, then {E(XY))2
<
E(X2)E{Y2),
with equality holding if and only if there are real numbers a and b, not both zero, such that P[aX + bY = 0] = 1. Proof: If either random variable is zero, then the theorem is easily seen to be true. Hence we need to prove the theorem when neither random variable is the zero random variable. We first recall that the quadratic equation ax2 + 2bx + c = 0, with a ^ 0, has a double real root if and only if b2 — ac = 0 and has no real roots if and only if b2 — ac < 0. Now, for any real t, P[(tX + Y)2 > 0] = 1, and hence E((tX + Y)2) > 0 for all t. Thus the second degree polynomial in t, E({tX + Y)2) = E(X2)t2
+ 2E(XY)t
+ E(Y2)
is always non-negative, i.e., it either has a double real root or no real root. In either case, b2 - ac = (E{XY))2 - E(X2)E(Y2) < 0, in 2 2 2 which case (E(XY)) < E(X )E(Y ). Equahty holds if and only if the polynomial has one real double root, i.e., there exists a value of t, call it t 0 , such that E((t0X + Y)2) = 0. By Lemma 1, this is true if and only if P[t0X + Y = 0] = 1 . Q.E.D. An easy application of the theorem is an alternate proof of the fact that for every random variable X, (E(X))2 < E(X2). Indeed, let Y = 1 with probability 1. Then X = XY and E(Y2) = 1, so (E{X))2 = {E(XY))2 < E{X2)E{Y2) = E(X2).
CHAPTER 3.
58
EXPECTATION
Definition. If X and Y are random variables, we define the covariance of X and Y by Cov{X,Y)
= E((X - E(X))(Y
-
E(Y))).
T H E O R E M 2. If X andY are random variables, then Cov(X,Y) E(XY) - E{X)E(Y) and Cov(X,X) = Var(X).
=
Proof: Using properties of expectation, we have Cov(X,Y)
= =
E((X - E(X))(Y - E(Y))) E(XY - E(X)Y - E(Y)X +
=
E(XY)
-
E(X)E(Y))
E(X)E(Y).
The second conclusion follows from the two definitions.
Q.E.D.
Definition. If X and Y are non-constant random variables, then their correlation or correlation coefficient, pxy, is defined by PX Y
'
Cov(X,Y) ~~ s.d.(X)s.d.(Y)'
T H E O R E M 3. IfX andY are independent (and non-constant) dom variables then Cov{X,Y) = 0 and pxy = 0.
ran
Proof: Using the definition of covariance, Theorem 4 of Section 3.2 and Theorem 2 above, we have Cov(X,Y)
= E(XY) =
This implies pXly = 0.
-
E(X)E(Y)
E(X)E(Y) - E(X)E(Y)
= 0. Q.E.D.
It is important to note that the converse is not necessarily true, namely, pxy = 0 does not necessarily imply that X and Y are inde pendent. There are examples (see Exercise # 1 ) where X and Y are not independent and yet px,Y = 0-
3.3. COVARIANCE
AND
CORRELATION
59
T H E O R E M 4. IfX andY are non-constant random variables, then —1 < PxyY < 1- Further, pxy = I if and only ifY = aX + b for some constants a > 0 and b, and pxy — —1 &/ a^d onZy ifY = cX + d /or some constants c < 0 and d. Proof: In Theorem 1 (Schwarz's inequality) replace Xand Y by X — E(X) and y—E(Y) respectively, and one immediately obtains p2XY ^ 1 or — 1 < /OX,Y < 1. By Theorem 1, /o^ y = 1 if and only if there are real numbers u and v, not both zero, such that u{X - E(X)) + v(Y - E(Y)) = 0. Since by hypothesis X and Y are non-constant, this last condition implies that both u and v are non-zero. Hence we may write Y = aX+b when PXY = 1- Now
cw(x,r) = £((* - £(x))(y - £(y))) = a £ ( ( X - E(X))2) = aVar(X).
+ bE(X -
E(X))
Since Var(X) > 0, we may conclude (i) Pxy = 1 if and only if Cov(X, Y) > 0, which is true if and only if a > 0, and (ii) />x,y = —1 if and only if Cov(X, Y) < 0, which is true if and only it a < 0. Q.E.D. In sample survey theory we shall need to know the formula for the correlation coefficient of two multinomially connected random variables. L E M M A 2. If X\j • • •,Xr are random variables whose joint distribu tion is MAf(n,pi, • • • ,p r ), then (Xi,X2) are MAf(n,pi,p2) and X± is S(n,pi). Proof: In the definition of multinomial distribution in Section 2.3, we could consider the events B0 = A0 U UJ=3<^*> B\ = A\ and B2 = A2. Then Xi and X2 are the number of times B\ occurs in n trials and the number of times that B2 occurs in n trials, respectively. Further B0, B\
60
CHAPTER 3.
EXPECTATION
and B2 are disjoint, and one of them must occur. Thus, by definition (X\,Xi) are MN(n,p\,p2). Also, X\ is the number of times event A\ occurs in n trials. The definition of Bernoulli trials is satisfied, and thus Xx is B{n,pi). Q.E.D. T H E O R E M 5. IfXu---,Xr Cov(Xi,X2) = —npip2, and
are MAf(n,pu-
Pxux2 = - 4
• - ,p r )(r > 2), then
P1P2
(l-ft)(l-ft)-
Proof: By Lemma 2, the random variables X\,X2 are A4J\f(n,pi,p2). Now Xt- denotes the number of times Ai occurs in n trials, i = 1,2. Let Ct- denote the event that A\ occurs in the zth trial, and let Di denote the event that A2 occurs in the ith trial, 1 < i < n. Hence X\ = J^JLi ^C and X 2 = YA=I ID{,
and
£(x^2) = £((£>,)(£;/i,,)) t=i
i=i
For each i, C, and JD, are disjoint, so Ic(Di = 0- For the n2 — n pairs {(z? i ) : * 7^ i}? since Ct- and Z)j are independent, £ ( / c , ^ ) = P(dDi)
=
P{Ci)P{Dj)=Pm.
2
Hence E(XiX2) = (n — n)pip 2 - Also, by Lemma 2, X t is B{n,pi), and hence E(Xi) = npt-. Thus, by the above and Theorem 3 of Section 3.1 we have Cov(XuX2)
= = =
E(X1X2) (n 2 - n)PlP2 -npxp2,
E(X1)E(X2) - nPlnp2
which yields the first formula. By Theorem 6 of Section 3.2, Var(Xi) npi(l — pi), * = 1,2, and thus Cov(XuX2) Pxux2
=
y/VariXJVariX^
=
3.3. COVARIANCE
AND
CORRELATION
61
-npip2
^npi(l-p1)np2(l
-P2)
P1P2
Q.E.D.
EXERCISES
1. If X is uniformly distributed over {—2, —1,0,1,2}, and if Y = X2, find the joint density of X, Y (draw a graph of it), and show that X, Y are not independent.
2. In Problem 1, compute px,Y-
3. Let X, Y be random variables with joint density given by
/x,y(0,2) = l / 6 , / * , y ( 5 , l ) = 1/3, fx,Y (10,0) = 1/2.
Compute px,Y-
4. If X and Y are random variables whose joint density is graphed below, compute E(X), E(Y), Cov{X,Y), Var(X), Var(Y), px,Y and Var(X + Y).
CHAPTER
62
.1/4
Y 4-
.I/6
3-
,1/12
EXPECTATION
.1/5
2-
,1/10
1-
1
-1
3.
()
.1/5
1
1
1
1
2
3
5. Let X and F be random variables whose joint density is given by /x,y(-5,-l) /*,y(l,3)
= =
2 / 1 3 , y * , y ( - 2 , l ) = 3/13, 5/13,/jr,y(5,5) = 3/13.
Compute pxyY6. Determine whether the following polynomials have two distinct real zeros, one double real zero or no real zeros: (i) 2x2 - 3x - 1 (ii) x2 + x + l (iii) x2 + 25z + 625/4. 7. Prove: if Z is a random variable, and if Z > 0, then -E(Z) > 0 . 8. Prove: if X and y are random variables, then Var(X Var(X) + V a r ( r ) if and only if pXy = 0.
+ Y) =
3,3, COVARIANCE
AND CORRELATION
63
9. If X and Y are random variables, and if Var(X+Y) = Var(X) + Var(Y), does this imply that X and Y are independent?
Chapter 4 Conditional Expectation 4.1
Definition and Properties
A frequent goal in statistical inference is to determine as accurately as possible the expectation of an observable random variable. Many times all one can do is observe the random variable and declare that this is the best one can do. However, on occasion one might be able to observe the conditional expectation of the random variable, given prior information. This has a tendency to be closer to the expectation that one wishes to ascertain. The Rao-Blackwell theorem at the end of this chapter renders these remarks more precise. We shall define two kinds of conditional expectation. One conditional expectation of a random variable X , given a value y of another random variable Y, is a number which we denote by E(X\Y = y). Another conditional expectation, that of a random variable X , given a random variable y , is a random variable E(X\Y) which assigns to each u> 6 fi the number E(X\Y)(u>). These two are interrelated and determine each other. Definition. If X and Y are random variables, and if y G range(Y), we define E(X\Y = y), the conditional expectation of X given Y = j / , by
E(X\Y = y) = J2*P([X = *W = y})X
65
66
CHAPTER 4. CONDITIONAL
EXPECTATION
The definition above has a corollary which is essentially the theorem of total probabilities extended to conditional expectation. T H E O R E M 1. If X and Y are random variables, then E(X) ZyE(X\Y = y)P[Y = y].
=
Proof: Using the identities X = ^2X xI[x=x] a ^ d 1 = J2y I[Y=y], w e have
E(X) = £((E*fc«-i)(£W) x
E
y
xI
= (E12 [x=*W=v]) y
y
x
x
= E(E*P(lx = *W = v]))P\r = v] y
x
= J2E(X\Y = y)P[Y = y}. y
Q.E.D. Loosely speaking, then, if one knows the conditional expectation of X given any particular value of F , and if one knows the distribution (i.e., density) of F , then one is able to compute E(X). Remark 1. If X andY are random variables, and if g is a function defined over range(X), then E(g(X)\Y
= y) = j:9(x)P([X
= x]\[Y = y))
Proof: By the definition of conditional expectation and by Theorem 4 in Section 2.2,
E(g(X)\Y = y) = 5>P(far(*) = z]\[Y = „]) * r
P([g(X) = z][Y = y}) P[Y = y]
4.1. DEFINITION AND PROPERTIES =
J^yjY,*Y,{P([X
= E9(*)P([X
67 = x}lY = y)):9(x) = z}
= x]\[Y = y]).
Q.E.D. LEMMA 1. If X andY are random variables, and ifyE then E X Y
( \
range{Y),
= y) = T ^ y ^ W ™ ) -
Proof: By the definition of conditional probabiUty and the above def inition,
=
p[Y = y] TsXEVlx^W^y})
=
P[Y
= y^E(Y,xIlx=x)I[)r=y])
Q.E.D. Example. In the joint density that is pictured on the next page, one easily obtains E(X\Y = 0) = 3/4, E(X\Y = 1) = 4/3, and E(X\Y = 2) = 19/12. (See if you can work these out intuitively.)
CHAPTER 4. CONDITIONAL
68
,5/28
,7/28
12/28
,4/28
,6/28
jl/28
f 3/28
EXPECTATION
1
T H E O R E M 2. If X,Y and Z are random variables, and if a and b are constants, then for every z G range(Z), E(aX + bY\Z = z) = aE(X\Z
= z) + bE(Y\Z = z).
Proof: By Lemma 1 and properties of expectation we have E(aX + bY\Z = z)
=
p[z
= x]E{{aX
=
aE(X\Z
+
bY)I[z=z])
= z) + bE(Y\Z = z). Q.E.D.
T H E O R E M 3. If X and Y are random variables, and if g is a func tion defined over range(X, Y), then for y € range(Y) E(g(X, Y)\Y = y) = E(g(X, y)\Y = y).
4.1. DEFINITION
AND
PROPERTIES
69
Proof: Again by Lemma 1,
E(g(X,Y)\Y
= y) =
p
^
=
y]E{g{X,Y)Ip=y])
=
p^yl^iX^I^y])
= E(g(X,y)\Y = y). Q.E.D. T H E O R E M A. If X andY are independent random variables, if g is a function defined over range(X), and ify€ range(Y), then E(g(X)\Y
= y) =
E(g(X)).
Proof: Remark 1 and the hypothesis of independence of X and Y imply E(g(X)\Y
= y)
= ^g(x)P([X
= x)\[Y = y})
X
= E9^)P[X
= x) = E(g(X)).
X
Q.E.D. We now define conditional expectation as a random variable. Definition. If X and Y are random variables, the conditional expec tation of X given Y is defined to be the random variable E(X\Y)
= Y;E(X\Y
= y)I[y=y],
y
where the summation is taken over all y 6 range(Y). E(X\Y) is a function of Y.) T H E O R E M 5. If X,Y are constants, then
(Note that
and Z are random variables, and if a and b
E(aX + bY\Z) = aE{X\Z)
+
bE(Y\Z).
70
CHAPTER 4. CONDITIONAL
EXPECTATION
Proof: By the definition and Theorem 2,
E{aX + bY\Z) = Y,E(aX +
hY Z=zZ
\
)hz=A
Z
=
£ ( a £ ( X | Z = z) + bE(Y\Z = z))I[Zzzz] z
= a J2 E(X\Z = z)I[z=z] + 6 £ E{Y\Z = z)I[z=z] Z
=
Z
aE{X\Z) + bE{Y\Z). Q.ED.
Since E(X\Y) expectation.
is a random variable, we shall be concerned about its
T H E O R E M 6. IfX andY are random variables, then E(E(X\Y)) E{X).
=
Proof: By the definitions of expectation and conditional expectation, and by Theorem 1, we have E(E(X\Y))
= E(£E(X\Y
=
y)Ipr=a)
y
= Y.E(x\Y = y)p[Y = y] = E{x). y
Q.E.D. T H E O R E M 7.I/X andY tion over range(X), then
are random variables, and if g is a func-
E(g(X)Y\X)
=
g(X)E(Y\X).
Proof: Using the definitions and the properties already proved of ex pectation and conditional expectation, we have E{g{X)Y\X)
= Y,E{9{X)Y\X
= x)I[x=x]
X
= Y,E{g{x)Y\X = x)I[x=x]
4.1. DEFINITION AND PROPERTIES -£g(x)E(Y\X
71 = x)I[x=x]
X
Y,g(X)E(Y\X
= x)I[x=x]
X
g(X)Y,E(Y\X
= x)I[x=x]
g(X)E(Y\X). Q.E.D. COROLLARY TO THEOREM 7. If X is a random variable, and if g is a function defined overrange(X), then E(g(X)\X) = g(X). Proof: This follows by taking Y = 1 and noting that E(Y\X) = 1. Q.E.D. EXERCISES 1. Prove: if X and Y are random variables, if a and b are constants, and if Y = aX + b, then E(Y\X) = Y. 2. If fx,Y(x,y) is as displayed graphically below, compute E(Y\X = 0), E(Y\X = 1), E(Y\X = 2), E(Y\X = 3) and E(Y). Y 3 11/2
ll/8
.V8
ll/24
.1/24
.1/24
11/32
,1/32
,1/32
_^IZ32
X
72
CHAPTER 4. CONDITIONAL 3. In Problem 2, compute E(^\X
EXPECTATION
= 2).
4. In Problem 2, find the density of the random variable 5. In Problem 2, compute E(X2\X
E(Y\X).
+ Y = 2).
6. Prove: if Y = c, where c is some constant, and if X is a random variable, then E(Y\X) = c. 7. Prove: if X is a random variable, then XI[x=c] = d[x=c]8. Prove: if X is a random variable, then X = y%2xllx=x]. x
4.2
Conditional Variance
Loosely speaking, the conditional expectation of a random variable given another replaces more evenly spread probability masses by more concentrated point masses but leaves the expectation the same. A happy consequence is that the variance is decreased, which is some thing much to be desired in sample survey theory. We shall see just how that happens in this section. The notion of conditional variance is a principal tool used in multi-stage sampling, and thus what is about to unfold is of utmost importance in sample survey theory. Definition. If [/, V and W are random variables, then the conditional covariance of (7, V given W = w, Cov(U, V\W = w), is defined by Cov(U, V\W = w) = E(UV\W
= w) - E(U\W = w)E{V\W
= w).
An equivalent definition of Cov(U, V\W = w) is given by the fol lowing theorem. T H E O R E M 1. IfU,V,W Cov(U,V\W
are random variables, then
= w) = E((U-E(U\W
= w))(V-E(V\W
= w))\W = w).
4.2. CONDITIONAL
73
VARIANCE
Proof: By Theorem 2 in Section 4.1, the right hand side of the above equation becomes E(UV-UE(V\W
= w)
- E(U\W = w)V + E(U\W = w)E(V\W = w)\W = w) = E(UV\W = w) - E(U\W = w)E(V\W = Cov{U,V\W = w).
= w)
Q.E.D. Definition. If X and W are random variables, the conditional variance of X given W = w,Var(X\W = w), is defined by yar(X|VT = w) = Cov(X,X\W = w). T H E O R E M 2. If X and W are random variables, then Var{X\W
= w) = E{(X - E{X\W
= w)f\W
= w).
Proof: This is a direct consequence of the definition of conditional variance and of Theorem 1. Q.E.D. Corollary to Theorem 2. If X and W are random variables, then Var(X\W
=
w)>0.
Proof: By Theorem 2, using the definition of conditional expectation and the facts that I[w=w] = I\w=w] a n d E(Y2) > 0 for any random variable F , we have Var(X\W
= w)
= E{(X - E(X\W
= w))2\W = w)
Q.E.D. Conditional variance has much the same properties as does variance, plus: any function of the conditioning random variable behaves very much like a constant.
74
CHAPTER 4. CONDITIONAL
EXPECTATION
T H E O R E M 3. If X and W are random variables, and if c is a con stant, then Var(c + X\W = w) = Var(X\W = w) and Var(cX\W = w) = c2Var(X\W = w) . Proof: Since E(c + X\W = w) = c + E{X\W = w), we apply The orem 2 to obtain the first conclusion. Also, since E(cX\W = w) = cE(X\W = «i), we again apply Theorem 2 to obtain the second equa tion. Q.E.D. T H E O R E M 4. If X,Y and W are random variables, and if X is any function of W, say, X = f(W), then Var(X + Y\W = w) = Var(Y\W = w), and Var(XY\W = w) = {f{w))2Var(Y\W = w). Proof: By Theorem 3 in Section 4.1, E(f(W)
+ =
Y\W = w) = E(f(w) + Y\W - w) f(w) + E(Y\W = w),
and E{{f(W)
+ Y-E(f(W)
+ Y\W =
=
E((Y - E(Y\W
=
Var(Y\W
tv))2\W=:w)
= w))\W = w)
= w).
Also E(f(W)Y\W and Var(f(W)Y\W
= w) =
f(w)E(Y\W
= w),
= w) = E{(f(W)Y - f(w)E(Y\W = (f(w))2Var(Y\W = w).
= w))2\W = w)
Q.E.D. A result to be used frequently in multi-stage sampling is the follow ing. T H E O R E M 4A. If X and Y are random variables, and if f(x,y) any function of two variables then Var(f(X,Y)\Y
= y) = Var(f(X,y)\Y
= y)
is
4.2. CONDITIONAL
VARIANCE
75
for all y G rangeY. Proof: Using the definition of conditional variance and Theorem 3 of Section 4.1, we have Var(f(X,Y)\Y
= E{{f{X,Y))2\Y
= y)
= =
= y)-{E(f(X,Y)\Y
E({f{X,y))2\Y Var{f(X,y)\Y
= y) - (E(f(X,y)\Y = y)
for all y G rangeY.
= y))2 = y)f
Q.E.D.
Definition. If U, V and H are random variables, then the conditional covariance of U and V, given H, is a random variable defined by Cov(U,V\H)
= E(UV\H)
-
E(U\H)E(V\H).
An immediate corollary of this definition is the following result. T H E O R E M 5. IfU.V
and H are random variables, then
Cov(U,V\H)
= ^Cov(U,V\H
= h)I[H=h].
h
Proof: If h' ^ h", then easily I[H=h']I[H=h"] = 0. Thus, E(U\H)E(V\H)
( ^ E(U\H = />') W l ) ( £ E(V\H = hff)I[H=h>l]
=
h'
=
£
h" E U H
( \
= h)E(V\H
= h)I[H=h].
h
Also by the definition, E(UV\H)
= 2 > ( t f V | f f = *)'[*=*]• h
Thus, by the definition above, Cov{U, V\H)
=
£ ( £ ( # V|ff = h) - E(U\H h h
= h)E(V\H
= h))I[H=h]
76
CHAPTER 4. CONDITIONAL EXPECTATION Q.E.D.
Analogous to Theorem 1 is the following result for conditional covariance given a random variable. THEOREM 6. IfU,V and H are random variables, then Cov(U, V\H) = E({U - E(U\H))(V -
E{V\H))\H).
Proof: Remembering that E(X\Y) is a function of Y, then by Theo rems 5 and 7 of Section 4.1 we have E((U - E{U\H))(V - E{V\H))\H) = = E(UV - E{U\H)V - E{V\H)U + E(U\H)E{V\H)\H) = E(UV\H) - E(E(U\H)V\H) -E(E(V\H)U\H) + E(E(U\H)E(V\H)\H) = E{UV\H) - E{U\H)E{V\H) -E(V\H)E(U\H) + E{U\H)E{V\H) = E{UV\H) - E(U\H)E{V\H) = Cov(U,V\H). Q.E.D. The fundamental theorem of this section is the following. THEOREM 7. IfU,V and H are random variables, then Cov(U, V) = E(Cov(U, V\H)) + Cov{E{U\H), E(V\H)). Proof: By Theorem 6 of Section 4.1, E(UV) = E(E(UV\H)), E{U) = E(E(U\H)) and E(V) = E(E(V\H)). Thus Cov(U,V) = E(UV) - E{U)E{V) = E(E(UV\H)) E(E(U\H))E(E(V\H)) = E(E(UV\H)) - E{E(U\H)E{V\H)) +E(E(U\H)E(V\H)) E(E(U\H))E(E(V\H)) = E(Cov(U, V\H)) + Cov{E{U\H), E{V\H)).Q.E.D.
4.2. CONDITIONAL
77
VARIANCE
Definition. If X and H are random variables, then the conditional variance of X given H is the random variable denned by Var{X\H) = Cov(X,X\H). T H E O R E M 8. If X and H are random variables then (i) Var(X\H) (ii) Var(X\H) (iii) Var(X)
= ZkVar(X\H
= h)I[H=hh
= E((X - E{X\H))2\H), = E{Var{X\H))
+
and
Var(E(X\H)).
Proof: These three results are special cases of Theorems 5,6 and 7. Q.E.D. Conclusion (iii) in Theorem 8 is applied again and again in multi stage methods in sample survey theory. The following theorem should be given here since it is an immediate corollary to Theorem 8 and is widely used in mathematical statistics. T H E O R E M 9. (Rao-Blackwell Theorem). If X andY variables then Var(X) > Var(E(X\Y)).
are random
Proof: Since Var(X\Y = y) > 0, it follows that Var(X\Y) > 0 and thus E(Var(X\Y)) > 0. The conclusion follows now from Theorem 8. Q.E.D. The following extension of theorem 3 is a standard tool used in sample survey theory. T H E O R E M 10. If X,Y function of Z, then
and Z are random variables, and if X is a
Var(X + Y\Z) =
Var(Y\Z)
and Var(XY\Z)
=
X2Var(Y\Z).
78
CHAPTER 4. CONDITIONAL
EXPECTATION
Proof: Suppose X = f(Z). Then by Theorems 4 and 8, Var(X + Y\Z)
= £ Var(f(Z)
+ Y\Z =
z)I[z=z]
Z
=
£ Var(Y\Z = z)/ I Z = ,] =
Var(Y\Z),
Z
and Var(XF|Z)
=
^Var(f(Z)Y\Z
=
z)I[z=z]
Z
=
T,(M?Var
=
z)I[z=z]
=
z)I[z=z]
z
=
Y,U{Z)fVar{Y\Z Z
=
(f(Z)fj:Var(Y\Z
=
z)I[z=z]
Z
=
X2Var(Y\Z). Q.E.D.
Finally, an indispensable tool in certain parts of sample survey the ory is that of conditional independence, Definition. If £/i, • • •, £/n, Z are random variables, then £/i, • • •, Un are said to be conditionally independent given Z (or with respect to Z) if P ([W< = UMZ = z])=f[ \t=l
/
P([Ut = ut]\[Z = *]) t=l
for all U{ £ range(Ui), 1 < i < n, and all 2 €
range(Z).
Conditional variance relates to conditional independence in just the same way that variance relates to independence. T H E O R E M 11. IfUu • • •, Un> Z are random variables, and if U\, - - -, Un are conditionally independent given Z, then VariUi + • • • + Un \Z = z) = £ Var{Ui\Z = z)
4.2. CONDITIONAL
VARIANCE
for all z € range(Z),
79
and
Var(J2Ui\z)
=J2Var{Ui\Z).
\t=l
/
1=1
Proof: We first note that for n > 2, if i =^ j , then also Ui and Uj are conditionally independent given Z. Thus E(UiUj\Z = z) = J2uvP([Ui = u][Uj = v]\[Z = z]) tt,v
= E « ^ ( [ ^ = «]|[Z = ^ ( P i = f]|[Z = *]) =
fe*PW
= u]\[Z = ,])) fevP([Ui = v]\[Z = *]))
= E(Ui\Z = z)E{Uj\Z = z). Using this we obtain, for i ^ j , Cov(Ui,Uj\Z
= z) = 0.
Thus Var(Ui + --- + Un\Z = z)
= £ Var(^|Z = z) + £ tf(w(l/i, l/,-|Z = z)
= £Var(C/i|Z = *), t=l
thus establishing the first conclusion. The second conclusion follows from the first by multiplying both sides by I[z=z] a n d summing over all z e range(Z). Q.E.D. EXERCISES 1. Prove: If £7, V and W are random variables, and if W is a function of V, then E(U\W) = E(E{U\V)\W).
80
CHAPTER 4. CONDITIONAL
EXPECTATION
2. Let X, Y be random variables whose joint density is given by the graph below. Compute (i) Var(E(X\Y)), (ii) the density of Var{X\Y) and (Hi) E(Var(X\Y)).
Y
4-
.1/7
3-
.1/7
.1/7
.I/7
2-
1H
,1/7
~i
-1
()
.1/7
.1/7
i
i
i
1
2
3
X
3. Prove: If X and Y are random variables, then
{E{X\Y)Y = Y,{E{X\Y = y)yi[Y=y]. 4. Prove: If X, Y and H are random variables, and if X and Y are conditionally independent given H, then £(AT|#) = £(X|#)£(F|#). 5. Prove: If X, y and Z are random variables, and if {X, F } and Z are independent, then £(X|Y, Z) = £ ( X | F ) . 6. Let X and Y* be random variables with joint density given by /y,y(3,6)
= =
/y,y(4,6) = / y ,K(5,6) = / Y , y ( - 2 , 0 ) / x , y ( - 4 , 0 ) = / x , y ( - 6 , 0 ) = l/6.
4.2. CONDITIONAL
VARIANCE
(i) Compute Var(X\Y
= 6) and Var(X\Y
(ii) Find the density of (iii) Compute E(X\Y
81 = 0).
Var(X\Y).
= 6) and E(X\Y
(iv) Find the density of
E(X\Y).
(v) Compute E(Var(X\Y))
and
= 0).
Var(E(X\Y)).
(vi) Verify for this example that Var(X)
= Var(E(X\Y))
+
E(Var(X\Y)).
Chapter 5 Limit Theorems 5.1
T h e Law of Large Numbers
Two limit theorems, known as the law of large numbers and the central limit theorem, occupy key positions in statistical inference. The law of large numbers provides a method of estimating certain unknown constants. The central limit theorem, among its many uses, gives us a means of determining how accurate these estimates are. This section is devoted to a most accessible law of large numbers. L E M M A 1. (Chebishev's Inequality). If X is a random variable, then for every e > 0, P([\X - E(X)\ > e)) <
Var(X)/e2.
Proof: We easily observe that E(X2)
=
"£x2P[X
= x]
X
>
x2P[X = x]
£ {*:M> £ }
> =
e2£{P[X = z]:|z|>e} ^P[\X\>e}. 83
84
CHAPTER 5. LIMIT
THEOREMS
Thus, P[\X\ > e] < E(X2)/e2. Now, since this inequality holds for every random variable X , replace X by X — EX to obtain Var(X)/e2.
P[|X - E(X)\ >e]<
Q.E.D. Chebishev's inequality gives loose confidence intervals for the expec tation of an observable random variable when one knows its variance. Namely, if one wishes to find an interval (X — e, X + e) for the expecta tion of an observable random variable X when one knows its variance, one uses the following equivalent form of Chebishev's inequality. T H E O R E M 1. If X is a random variable and if e> 0 then P[X-t<
Var(X)/e2.
E(X) < X + e] > 1 -
Proof: By Chebishev's inequality, 1 - P[\X - E(X)\ > e] > 1 -
Var(X)/e2.
But l-P[\X-E(X)\>e]
=
P[\X - E(X)\ < e]
=
P [ - e < X - E(X) < e]
= P[X-e<E(X)<X
+ e].
Substituting this into the above inequality yields the theorem. Q.E.D. T H E O R E M 2. (The Law of Large Numbers) Let Xx, - • -, X n be inde pendent random variables with the same density, i.e., a sample of size n when sampling is done with replacement. Then for every e > 0, Jiim P{[\(XX + ■■■ + Xn)/n
- EiXJ]
> e]) = 0
Proof: Let Xn = (Xx + ■ ■ ■ + Xn)/n. Then E{Xn) = \E{XX Xn) = ^nE(Xt) = E(Xt), and, by Theorem 5 of Section 3.2,
Var(Xn)
= LVar
( ±
X
\
=
L2pvar(Xj)
+ ■■■ +
85
5.1. THE LAW OF LARGE NUMBERS NUMBERS =
\nVar{X11)) \nVar(X
=
-Var(X1).
Now by Chebishev's inequality, E(X11)\>e]e]
0 < P[\Xn= Var(Xi)/ne2
Thus P[\X P[|Xnn -- E(X!)\ E(X{)\ > e] -> 0 as n -> oo. oo.
Q.E.D. Q.E.D.
The law of large numbers is popularly known as the "law of aver ages". Our next theorem provides a rigorous justification for the first approach to probability described in Section 1.1. T H E O R E M 3. (BernouUi's theorem) In a sequence of Bernoulli trials involving the outcome S possible at each trial whose probability is p, if Sn denotes the number of times S occurs in the first n trials, then
IimP([|^-p|>e]) = 0 n->oo
n
for every e > 0. Proof: Let AA denote the event that S occurs in the zth trial. Then Then IAl,,•-••IA • ■ ••IA areindependent independentrandom randomvarrabless varrablessallallwith withthe thesame sameden den den n n are sity, namely [p ifx \p \ix = = 11
H [l~
[l~pP M RU *JTo°i>
Their Their common common expectation expectation is is £(/.Ai) £(/.Ai) = = PP- We We observe observe that that
A A
Hence, by the Law of Large Numbers, P[|—-p\ n for every e > 0.
>> ee]e- ^> 00 aa s n - > o o > Q.E.D.
86
CHAPTER 5. LIMIT
THEOREMS
EXERCISES 1. Prove: if a\, • • •, a n , &i, • • •, bn are positive numbers, and if aj > e > 0 for 1 < ;' < n, then £ ? = 1 Oik > « E;=i h2. Prove: if X is a random variable, then (i) E{*:**>e} *2fx(x) (ii)
> eP[X> > e], and 2
E{X:\x\>,yx fX(x)>e2P[\X\>e].
3. If Y is a random variable, and if Var{Y) e > 0 such that
= 1, find a value of
P ( [ £ ( y ) E ( F - e , r + e)])>.95. 4. Prove: if Ai, • • •, An are independent events with the same proba bility p, then IA1 , • • •, /A„ are independent random variables with the same density.
5.2
The Central Limit Theorem
The central limit theorem is of utmost importance in statistical infer ence. Its proof is fairly advanced and is the only theorem in this text whose proof will not be given. It is proved in more advanced courses. T H E O R E M 1. (Central Limit Theorem) IfXu---,Xn satisfy the hypothesis of the Law of Large Numbers, and if we denote Sn = £™=1 Xj and a2 = Var(Xi), then, for every real number x, lim P
Sn — ESn 7 = —
< X
= J = f e-«lHt. V2ir J-oo
The integral in Theorem 1 cannot be integrated in closed form. However, it is tabulated and appears in standard statistical tables. Let us denote
(x) = -£=[* e~t2/2dt. V27T
5.2. THE CENTRAL
LIMIT
THEOREM
87
We should point out that $(oo) = 1. In order to prove this, we shall prove that
We do this by writing the left hand side as
then writing this product as a double integral
r r e-Wdudv,
J—oo
J—oo
then changing to polar coordinates via the change of variables u = r cos 0, v = r sin0 (and replacing dudv by rdrdQ) to obtain /•27T
tOO
r I00 e~r2l2rdr d6 = 2TT. Jo Jo Clearly $(—oo) = 0, and since the integrand is positive, it follows that $ is non-decreasing, i.e., if —oo < xi < x 0. Values of $ are given in the Appendix of this book. The function $ is called the normal distribution. An observation should be made about limit theorems such as, for example, the central limit theorem. A limit theorem helps one in an approximation problem, as we now il lustrate. Suppose one plans on taking 100 observations on a population by sampling with replacement. The observations then become indepen dent random variables Xi, • • • ,Xioo. We consider the problem: given EXi = 10 and VarXi = 9 for 1 < i < 100, to find an approximation of the probability P[Sioo< 1,038.46]
88
CHAPTER 5. LIMIT
THEOREMS
where 5i0o = X\ + • • • + Xi00. We first observe that E(Sioo) £ ( E ; = i * j ) = 1,000 and Var^ioo) = E ; = i ^ K * j ) = 900. Hence P[Sioo< 1,038.46] =
P
=
P
ffioo - £(ffioo)
<
=
1,038.46-1,000 '900
^IOO — ESioo < 1.282 xA'arS'ioo
which by the central limit theorem is approximated by $(1,282). The table of the normal distribution in the Appendix of this book yields $(1,282) = .9001, and thus an approximate value of P[S100 < 1,038.46] is .9001. In problems connected with proportions in sample survey theory, the following special case of the central limit theorem will be of use. T H E O R E M 2. (Laplace-DeMoivre theorem) If Sn denotes the num ber of times S occurs in n Bernoulli trials where P(S) = p, then Sn — np
<x
$(z) as n —» oo.
y/np(l-p) Proof: Let At denote the event that S occurs in the zth trial. Then IAX , • • *, I An a r e independent random variables with the same density.
//.» =
p if x = 1 1 — p if x = 0 0 if x £ { 0 , 1 }
Clearly E(IAi) = p, Var(IAi) = p ( l - p ) , and Sn = IAl+. • -+IAn. Thus, as noted before, E(Sn) = np and Var(Sn) = np(l — p). Applying the central limit theorem we obtain the conclusion of the theorem. Q.E.D. EXERCISES 1. If T64 denotes the number of times a head turns up in 64 tosses of an unbiased coin, find an approximation of P[T&4 < 35].
5.2. THE CENTRAL
LIMIT
THEOREM
89
2. An unbiased die is tossed 100 times. Let Sioo denote the sum of the 100 faces that turn up. Find an approximate value of P[Sioo < 330]. 3. It costs two dollars to enter a game. A coin is tossed. If it comes up heads, you win $3.90. Assuming the coin to be fair, find the probability that you will suffer no total loss at the end of 20 plays.
Chapter 6 Simple Random Sampling 6.1
The Model
Sample survey theory deals with populations of units. The population under consideration might be a set of people, in which case each person in this set is a unit. At other times the population might be the set of family farms in the state of Kansas, in which case each farm is a unit. We shall always denote the population, i.e., the set of units, by U. The letter N will always denote the number of units in a population. Usually, the value of N is known; there are cases where it is not. The individual units are denoted by £/i, U2, • • •, t//v, and U:UXU2
• • • UN
will denote the population and its individual units. Associated with each population that we shall deal with in this course is a function y that assigns to each unit Ui a number YJ, i.e., y(Ui) = Yi, 1 < i < N. This function, in the case of a population of people, might mean that Y{ is the total income of person Ui during the previous year. In case U is the population of family farms in the state of Kansas, Y{ might denote the size in acres of the individual farm £/*. In practice, one usually has a list of the units (called a frame) but does not know any of the Y^'s. One can determine Yi only if one observes (in some sense) the unit Ui directly. The population U and the numerical 91
92
CHAPTER 6. SIMPLE RANDOM
SAMPLING
characteristic y are sometimes displayed completely as U : UXU2 •■• YN y : Y1Y2 ••• YN. Given such a population as that above and the limitation on de termining the values { K , l < i < N}, the problem before us is to determine the sum of all 3^'s, i.e., to determine y , where Y is defined by N
or to determine Y, the average of the Y^s , i.e.,
Y=j;EYi iV
= Y/N.
t=l
In the first example given above, Y denotes the total annual income of the particular population. In the second example, Y denotes the total acreage of all family farms in Kansas. This problem of determining the value of Y is completely and most satisfactorily solved when it is possible to observe each unit Ui and measure or determine the value Y{. This is what happens in a complete census. If a complete census is impossible, then one has to estimate Y based on some selection of a few of the units and on the determination of the values of y for these units. This is what the theory of sample surveys is about. An element of randomness is introduced in the selection of these few units, and from the y-values obtained one uses some formula which depends on the randomizing procedure to compute an estimate Y of Y. The quantity Y is an observable random variable. The problems that will beset us are these. We shall want the distribution of Y to be centered (in some sense) around the unknown number Y. Usually, this is done by requiring that its expectation be Y , i.e., E(Y) = Y. Among procedures which will do this for us, we shall seek a procedure such that "most" values of Y are "close" to Y. This is usually done by requiring a procedure for which the variance of K, Var(Y), is small. Finally, we shall wish to have a formula for estimating the maximum error we can make in reporting Y to be equal to the unknown Y. There
6.1. THE MODEL
93
are yet other problems that might arise, say, that of minimizing cost or designing the randomizing procedure to keep the entire survey within a certain cost limitation. Before beginning our study of these randomization procedures we should make sure that we know how to make use of a random number generator, which is a program in a hand-held calculator or computer which generates random numbers. In actual fact, these programs gen erate what are called pseudo-random numbers. In practice we act or pretend that they are truly random and are obtained as follows. Con sider a bowl with ten tags in it, numbered 0,1,2, • • •, 9. Let Xi, • • •, Xk denote a simple random sample taken with replacement.This means: take a tag at random, let X\ denote the number observed on the tag, replace it, and repeat this processfc—1more times. Clearly, the random variables X i , • • •, Xn are independent, and each is uniformly distributed over {0,1, • • •, 9}, i.e., the density of each X{ is /
(xj =
( l / 1 0 if» = 0 , l , . . . , 9 10 otherwise Then the random number obtained is 10
102
10*'
This number appears in decimal form as .X1X2 • • • Xk just as a con stant number (2/10) + (5/100) + (4/1000) appears as .254. If k = 3, then the above procedure yields a number selected at random from {.000, .001, • • •, .998, .999}. The probability then of selecting any par ticular number, such as .682, is .001, and the probability of selecting any number less than .682 is .682 (Don't forget .000.) In general, if one obtains a random number with k decimal places, then the probability of selecting any fixed constant x, expressed in decimal form as X = .X\X2 • • ■ Xk
is 1/10*, and the probability of selected a number less than x is the very same X = ,X\X2
' Xk*
The above shows how to select an n-digit random number in the unit interval [0,1).
94
CHAPTER 6. SIMPLE RANDOM
SAMPLING
Now, suppose we have N distinct units which one can number from 1 to N. The problem we now address is how to select a number at random from the set {1,2, • • •, N}. If TV is a power of 10, then the above proce dure provides us with an answer. For example, if N = 10 4 , and if the random number generator produced a random number -X\X
[X^N] + 1 where [x] means the largest integer < x P R O P O S I T I O N l.Ifc = ,C\C2 • • • is a number (in decimal form) in [0,1), and if X^ is an n-digit random number in [0,1), then P[X^ < c] —► c as n —> oo. Proof: It is easy to verify that [X (T0 < -ClC2 ■ ■ ■ Cj C [XM
[XM <-Cl---Cn
+
^].
Taking probabilities of the three events we have .c1---cn
+
^ .
Now 0 < c— .ci • • • cn < l/10 n for every n, which implies that .c\ - • • cn —► c as n —> oo. Taking limits in the above displayed inequality as n —> oo yields c < limn_>oo P[X^ < c] < c, which yields the conclusion. Q.E.D.
6.1.
THE MODEL
95
P R O P O S I T I O N 2. IfN is a positive integer, ifl
and if
Proof: We observe that [y(«) = k] = =
[[NXW] = k - 1] = [jb - 1 < NXW \k~
l
< k]
< J W < —1
so
p\yM = k] = p [ i = l < xw < i . Now one easily verifies that P[X («)
< £ ] = p[XW < ^ i ] + P[*^i < IW < A],
or
P[^
< x^ < A] = P[X(«) < A] - P [ * M < * ^ i ] .
Substituting this into the expression above for P[y( n ) = k] and applying Proposition 1, we obtain the conclusion. Q.E.D. Proposition 2 shows that if one wishes to select a number at random from {1,2, •• • ,iV} one should select an n-digit random number X^ with n as large as the calculator or machine can support, multiply it by TV, then take the integer part of it and add 1. Thus, if we are confronted with N units denoted by C/i, U^, • • •, f/jv, and if we wished to take a simple random sample with replacement of size n from it, we would perform the above procedure n times to get an ordered n-tuple of units ((/feu* • • , t 4 n ) . The probability of obtaining any particular n-tuple of units is (l/iV) n , because of the independence of successive draws of a random number. Now suppose we wish to take a simple random sample from {1,2, • • • ,iV} without replacement. If these were just numbered tags in a bowl, and if we selected n of them at random without replace ment, we know that the probability of selecting the distinct integers &i, • • •, kn in this orderis 1/(N\/(N — n)!). The problem arises on how
96
CHAPTER 6. SIMPLE RANDOM
SAMPLING
one could use a random number generator to select a simple random sample without replacement of size n. Let us consider this procedure. Select a number at random from {1,2, • ■ •, N} using the random num ber generator; suppose this number is k\. Now select a second random number. If it is different from &i, call it k2. If it is equal to &i, disregard it and select another random number. If it is unequal to &i, call it k2\ if not, continue sampling until you obtain a second distinct number. One continues selecting random numbers from {1, • • •, JV} until n distinct integers &i, • • •, kn are obtained, (n < N). P R O P O S I T I O N 3, By sampling with the above procedure, the prob ability of obtaining the ordered n-tuple k\, • • •, kn of distinct numbers from {1,2, •- ,N} is 1
N\/(N-n)l' Proof: In this one instance in this book we make use of the count able additivity of probability which was not covered in Chapter 1. In Chapter 1 we demonstrated that if {Ai, • • •, An} is a finite set of dis joint events, then P(U" =1 A,) = £ j = i P(Aj). Now we make use of an extended property which is: if {Ai,A2, • ■ •} is an infinite sequence of disjoint events, then
p(l)An) = jrp(An). \n=l
/
n=l
Let [A?i, • • •, kn] denote the event that in continued sampling from { 1 , 2 , - " j N } by simple random sampling with replacement that the first integer selected is &i, the second distinct integer selected is k2, • • •, and the nth distinct integer selected is fcn, where, as stated in the hypothesis, &i, • • •, kn are distinct numbers from {1,2, • • •, N}. One easily sees that the event [&i, • • •, kn] can be represented as the following countable union of disjoint events: oo
oo
[&i,••-,&,.]= U •" U mi=0
A(m1,---,mn.1),
mn=0
where A(m x , • • • ,m n _!) denotes this event: k\ is selected on the first trial, which is followed by mi fcx's, which is followed by k2, which is
6.1. THE MODEL
97
followed by ra2 numbers consisting only of fci's and fc2's, followed by fc3, followed by ra3 &i's, fc2's and fc3's, followed by • • •, followed by fckn. Because of independence,
'<**.■■■.--» - L(L\ * & nL(±\® - - ((!LZ±\ = ^ r £L ==
=
— 11 I— 1
{A(ml5 • • •, m n _i) : mi mi > > 0, • • •, ran_i > 0} are dis Since the events {A(m^ joint, it follows that
p([ku-M) = ±±J1
-
m m nn __ ii -- O O gg -- ll mm / qn \\
jJ^^ 1 1 2z ^ I "A? J n
1
=
£ n(i)-
m m ii -- 00 n-1 n-1 oo
~' 1
1
— T T — - —
Nn J* 1 - q/N 1
n
~1
1
1
= -TT—— = N jJiN-q
N\/(N-n)\
Q.E.D.
Proposition 3 thus vaUdates the algorithm laid out beforeQ.E.D. it for Proposition 3 thus vaUdates the algorithm laid out before it for taking a simple random sample without replacement. taking a simple random sample without replacement. As noted earlier, in our model As noted earlier, in our model U : £/i Ux U22 ••• UNN y : Yx Y22 . . . YNN, y as a function whose domain is U assigns to U% the number Yi, YJ, i.e., y{Ui) = Yi,l < i < N. We shall henceforth let yj denote the random variable that gives the j/-value of the jih unit selected in a random sam ple of size n. Thus, yj assigns to the elementary event (U^, • • •, Uin) the number Yir In Section 2.2 we observed that in sampling with replace ment yi, j/i, • • •, yn are independent and have the same density, namely P[y£ = x] = #{z : Yi = x}/N,
\<£
CHAPTER 6. SIMPLE RANDOM
98
SAMPLING
We also observed that in the case of sampling without replacement, 2/i, • * • ? J/n are not independent, yet they all have the same density as in the case of independence. In the sequel we shall refer to the random variables j/i, • • •, yn as a sample of size n. We shall use only two abbreviations: WR will denote with replace ment, and WOR will denote without replacement. EXERCISES 1. Suppose the population to be sampled is U: Y:
f/i U2 U3 U4 1.2 2.3 4.0 1.2
Us U6 U7 1.2 2.3 4.0
U8 1.2.
Let j/i, j/2? 2/3 denote a sample of size three WOR from the popu lation. Compute i) the joint density of yu y2,2/3, ii) the joint densities of 2/1,2/2, of 2/2,2/3, and of 2/1,2/3, iii) the densities of 2/1, of 2/2 and of 2/3, iv) v)
E(y1)JE(y2)9E(y3)9 Var(y1),Var(y2),Var(y3),
vi) the correlation coefficients pyuy2,py2m
and
pyim,
vii) Y and F , 2. In using a random number generator to obtain a sample of size 2 from {1,2,3,4,5} WOR, find the probability that the sample is 1,4 and that it is obtained on or before the fourth random number is observed. 3. In using a random number generator to obtain a sample of size three WOR from {2,3,4,5}, find the probability that the sample is 3,1,4 and that sampling terminates when the fifth random number is observed. 4. In using a random number generator to obtain a sample of size two WOR from {2,3,4,5}, compute the probability that at least five successive random numbers must be observed in order to obtain the sample.
6.2. UNBIASED ESTIMATES
FOR Y AND ANDY Y
99
5. Prove: if x is any real number, then [x] + 1 = [x + 1].
6.2
Unbiased Estimates for Y and Y
We shall shaU obtain here unbiased estimates for Y and Y in WR and WOR simple random sampling. Then we shall derive the formulae for their variances and show that greater precision is obtained in WOR sam pling. Definition: If z is an observable random variable, and if A is some real number, then z is called an unbiased estimate of A if E{z) = A, whatever the value of A. Now consider the population U and the numerical characteristic y: U: y:
Ui U2 ■■• Ux ■■■ UN Y y,1 Y2 •■• ••• YN
-► VjYY is segarded da s aannom variable eefined If the function y : U{ -» over the fundamental probability space U, then the density of y is #{i: Yi = fy(t) = #{i:Yi
t}/N.
Now let ylu, • ■ •• yn be a sample of ssze n on this populatton. (This was denned in Section 6.1.) If the sampling is done either with or without defined replacement, then, by Theorems 7 and 8 of Section 2.2, all yt's have the same density as y. We shall use the following notation: V = (yi + --• • +n)/n. T H E O R E M 1. In sampling WR or WOR± Ny is an unbiased estimate ofY andy is an unbiased estimate ofY. Proof: Proof: Regarding Regarding yy as as aa random random variable, variable, we we have have
E(y) = E*M*) = jjtY' = *t
±y
t=l
100
CHAPTER 6. SIMPLE RANDOM
SAMPLING
Thus, £(y) E(y) = = -- E?=i E?=i *%.) E(yi) = =^ ±nY = = Y, Y, ^ee., ^ee., yy isis an an unbiased unbiased estimate estimate E{Ny) = = JV£?(y) NE(y) = NY = F, and hence Ny Ny is an unbiased of Y. Now, E(Ny) estimate of Y. Q.E.D. Among sample survey statisticians the ratio n/N is referred to as the sampling fraction. This will appear in our formulae from time to time. A basic numerical characteristic of our population is 5J, defined by
s^jfzjt^-y)2T H E O R E M 2. In sampling WR or WOR,
Var(yi)=?Lz±s;,l
y^y)=E^(Yi-Yr=^si VarW-JtjjW-Y)*-^!!}. Q.E.D.
T H E O R E M 3. 3 . In sampling WR, T,
,_,
Var(j/) Var(y)
Var(JVy)
1 i V - 12
= — — Si and = --——S^ and 1 " = ±N(N - 1)S 2 .
Proof: This follows from Theorem 2 and the fact that in WR sampling, yyux , • ■ • *, *, y> y> are are independent independent random random variables, variables, all all with with tthhtt same same mensity. mensity. Q.E.D. T H E O R E M 4. In WOR simple random sampling, Var(y) Var{y) = 1(1 - £)S%, £ ) S J , and Var{Ny) = f£(1 - $)SJ.
6.2. UNBIASED ESTIMATES
FOR Y AND Y
101
Proof: Let us denote z,- =_yi — Y,l < i < n. Then zx,---,zn is a sample of size n on y — Y WOR and E(zi) = 0,1 < i < n. By Theorems 3 and 7 of Section 3.2, Var(zi) = Var(yi) for 1 < i < n, and E(ziZj) = E{zxzi) for i ^ ;'. Thus Var(y)
£(j/ - f ) 2
=
= ^((E(^-?)) 2 ) = ^((E^) 2 ) =
^{nE(zl)
+
n(n-l)E(zlZ2)}.
Note that this formula for Var(y) holds for every value of n < N. If we select n = N, then j / = Y is a constant, and thus, when n = N, Var(y) = 0. In other words, when n = N, E{zl)-\-{N-\)E{zlz2) = 0. Thus
E(zlZ2)
=
-j^Etf) 1 1 * .7=1
—
-
_ C
2
->V
Substituting this in the formula for Var(y) above, we obtain
V«r(y)
=
-L{n^l5j-n(„-l)i5J}
which yields the first conclusion. The second follows from Var(Ny)
= N2Var(y)
= '—{1 -
-)S2y. Q.E.D.
We observe that, in both WR and WOR simple random sampling, y is an unbiased estimate of Y and Ny is an unbiased estimate of
102
CHAPTER 6. SIMPLE RANDOM
SAMPLING
Y. However, as shown in Theorems 3 and 4, the variance of y in WOR sampling is smaller than that in WR by a multiplicative factor of (1 — n/N). In Section 6.3 we shall show why the square root of the variance is a measure of the error and shall show how to estimate the maximum error made in using Ny to estimate Y. EXERCISES 1. If X is an observable random variable whose distribution is B(n, p), where n is known, and p is unknown, prove that X/n is an unbi ased estimate of p. 2. If X is an observable random variable whose distribution is uni form over {1,2, • • •, iV}, where N is unknown, show that 2X — 1 is an unbiased estimate of N. 3. In simple random sampling, find the value of the sampling fraction so that the standard deviation of y in WOR sampling is one half that of WR sampling. 4. Prove: in sampling WOR, if n = iV, then Ny = Y. 5. Consider the following population: U: y:
Ux U2 U3 U4 Us U6 10 11 9 10 12 11.
(i) Compute Y and Y. (Remember: these are unknown to the statistician in real life.) (ii) Compute the variance of Y in a sample of size 3 when the sampling is done WR and also when it is done WOR. 6. Compute a value of Y in WR sampling in Problem 5 when the units selected are Us,U2,Us . 7. In Problem 5 compute the value of Y in WOR sampling when the units selected are U$, C/i, Us .
6.3. ESTIMATION
6.3
OF SAMPLING
103
ERRORS
Estimation of Sampling Errors
In any estimation problem in sample surveys we shall wish to have an estimate of the maximum error made. Indeed, one of the advantages of a sample survey over a complete census is that we have a means of estimating errors. We should recall in our present setting the central limit stated in Chapter 5. In our present notation and model it states: In WR sam pling, if j/i, • • •, yn is a sample of size n, then lim P n—t-oo
ny — nY yjn
Varfa)
_l r
<x
e~*'2dt
~~ \Z2TT J-c
for all real x. The function tx
1
e-'"dt
that appears on the right side of the limit expression is the standard normal distribution function, and its values for different values of x are given in the table in the Appendix. Particular values to keep in mind are $(1,645) = .95, $(-1,645) = .05, $(1.96) = .975, $(-1.96) = .025, $(2.57) = .995 and $(-2.57) = .005. Now, the value to us of a limit theorem like the central limit theorem above is that certain functions of n may be approximated for large values of n by the limit quantity. For x > 0, -x <
ny — nY ^ i = < x yjn Var(yx)
=
P
ny — nY
[yfn Var(yx)
-P
<x
ny — nY [y/n
<-x
Variyx)
is approximated by $(x) — $(—x). Thus, for large values of n, _1.96 <
nV nY
~ < 1.96 yjn Variyx)
104
CHAPTER 6. SIMPLE RANDOM
SAMPLING
is approximated by $(1.96) - $(-1.96) = .95, a probability close to one. Also, '_2.57 <
f ~ ^ < 2 .57 yjn Var(yi)
is approximated by $(2.57) - $(-2.57) = .99, a probability even closer to one. Thus it is almost certain that the following inequality holds: -3<
f ~
n ?
<3,
or (after using a little algebra) \Ny -Y\<
3Ny/Var(yi)/n.
Let us recall Chebishev's inequaUty from Chapter 5: if Z is a random variable, then for every e > 0,
P([\Z-E(Z)\<e])>l-^P-. If we replace e by JVar(Z)lt,
we obtain
P([\Z-E(Z)\<J^Q])>l-t. If we let t = 0.01, then P([\Z - E(Z)\ < 10^/yar(Z)]) > .99, which gives us a larger bound to the error of estimating E(Z) by Z, namely 10y/Var(Z). In any C2tse, we see that if we obtain an unbiased estimate Z for some unknown constant, the smaller the variance, the smaller the error. If we wish to be right in stating the maximum error of estimates in at least 99 out of every 100 cases we would state that the error is less than lOy Var(Z), provided we know the value Var(Z). In practice, we should prefer to use 2.57VVarZ for rather good theoretical reasons.
6.3. ESTIMATION
OF SAMPLING ERRORS ERRORS
105
Thus it becomes necessary to be able to estimate the variance of an unbiased estimate if we wish to estimate the maximum error in using it. T H E O R E M 1. In WR sampling ifyu-••,ynisa and if2ysi is defined by andifs
is a sample of size n,
n
1
4 = - r T It =>i - j ' ) 2 . n
i
then s2y is an unbiased estimate
ofVar(yi).
Proof: We may write
^{l>j-4 Now E(yj) = E(Vl yn E(y]) = E(y2) for 1 < j < n. Since yyi, u • • •, y„ yi)) and E{y]) are independent, then E
(l>j) = nE(yl),
and
E(f) = ^ ( E J / J3 + E w ) n^
\fr[
fe
w
2 2 = ±-{nE( = l!nE(yl) y ) + +n(nn(n-l)(E( l)(E(yiyi)) ))>\. }. n2
*■
Thus, with a little algebra we obtain E(s2y) = Var{y Var(y x).i).
J
Q.E.D.
COROLLARY 1. In WR sampling, an unbiased estimate of the variance of the unbiased estimate Y = Ny is ^-s2y. Proof: In WR sampling, Var(Ny) = N2Var(yi)/n. Var{yi)/n. By Theorem 1, 2 2 E(sl) = Variyi). Var(yx). Thus N s Jn Var(Ny). v/„ is an unbiased estimate of Q.ED.
CHAPTER 6. SIMPLE RANDOM
106
SAMPLING
The situation for WOR is a bit more complicated. WOR sampling, if si = ;^^ EE"=iG/j vf,2 , &en then T H E O R E M 2. In /n WO# " = i ( y ; - £) 2 2 2 2 E{s f e » S = (1/(N - l))Ef=i(^ E(s})y) = S SI, «where l))T$Li<Xj - Y) . Ey =1 (y,—y)23 = ELiS/ E"=iJ/ Proof: Wefirst E?=i(j/i-j/) -"*/ We first observe that (n-l)s (n-l>22 = E"=i(j/j-y) Ej=iJ/222-raJ/ -™/ 2222 22 (_E(j/)), , we use Theorem 4 in Section 2 to Since Var(y) = E(y 2?(j/)) — (E(y)) obtain E(y £(y 2 ) = Var(y) + (E(y)) {E(y))2 = 1(1 -_ £*)S* y22. We also have )£2 + Y
*(s*)--«>-ss* Using these two equations we have (n-l)E(s2y)
= E rtyU (f^y2)
-
nE(y2)
-spM1-*)*-** _
n(iV-l) n(N-l)
=
jy ^ " V *~V~N) ~ N) bs*y N 2 (n-l)S .
2
n\ n\
/
2
s
This yields £E(s2 ( s 2y)) = = S2 S2y...
Q.E.D.
COROLLARY . i/ n W O R sampling, an unbiased estimate of the variance ofY = Ny is Var(Y) = ^- (l f 1 — j^J ^J s2, and aw an unbiased estimate ofVar(y) ofVar{y)
is Var(y) Var{y) = (1 —
n/N)s2/n.
Proof: By Theorem 2 above and Theorem 4 of Section 2,
J 7 " ' ■ - T -J'K) *{£('-*)<} £(-*)*«)
- T ('"ff)5?-""<**>•
6.4. ESTIMATION
OF
PROPORTIONS
107
The second conclusion follows by known properties of variance. Q.E.D. EXERCISES 1. Consider the following population: U:
Ux
U2
y:
1.2
1.5 1.4 1.6
U3
U4
U6
U7
U8
1.3 1.4
Us
1.2
1.4
(i) Compute the values of Y and F ; remember, in real life these values are unknown to the statistician. (ii) Compute S*. (iii) Compute
Var(y).
2. Suppose in Problem 1 a sample of size 4 is taken WR and the units selected turn out to be f/5, E/7, C/7, C/4. (i) Use the sample to compute a value of the unbiased estimate ofF . (ii) Use the sample to compute the value of the unbiased estimate of Var(Y). 3. Do Problem 2 when the sampling is done WOR and the units selected are f/3, C/5, f/i, C/7.
6.4
Estimation of Proportions
The present section provides special appUcations of the previous two sections. Here we are interested in the proportion of a population that satisfies a certain property. For example, we might be interested in the proportion of a population that is unemployed or the proportion of a population in a certain age group. We shall let A denote the subset of U which has the particular property in question, and we shall let 7r denote the proportion of units in U that are in A, i.e., the number of units in A divided by N. Corresponding to U{ in we shall take Y{ = 1 if Ui e A and Y{ = 0 if U{ $ A. Thus we see that Y = E f c i l i i s
108
CHAPTER 6. SIMPLE RANDOM
SAMPLING
the number of units in A, i.e., the total number of units in U with the property in question, and it also turns out that Y = 7r. Now consider a random sample of size n taken WR. If y\, • • •• yn are the observations on y, then they are independent, and P[yi = 1] = 7r and P[yi = 0] = 1 — ir for 1 < i < n. Thus, Y%=i V* 1S ^(n? 7r )? an^ n ^s the proportion of units sampled that are in A. T H E O R E M 1. In a WR sample of size n to estimate the proportion 7T of a population in a subset A, the proportion of units it in the sample that are in A is an unbiased estimate ofir, its variance is 7r(l — 7r)/n ; and an unbiased estimate of the variance of it is 7r(l — it)/(n — 1). Proof: From Theorem 1 of Section 2, it = y is an unbiased estimate of Y = 7r. Now Var(y) = Var(yi)/n, and by Theorem 1 in Section 3, Sy is an unbiased esttmate of Var(y)) . Thus, an unbiased esttmate of Var(it) is Var(it) = Sy/n. But
^P>-^MP-*Y ^
<
-
»
>
=
^
&
-
^
■
Now y} = yi for 1 < i < n, and thus
3l
- ^{p-nf}
l 22 r \\ny {ny - ny ny -7r(l - ^ ( l - it). n y -ny \ }} = JJ n— — 1 ^ n — —1 ^ n — 11 From this we have shown that Var(it) Variit) defined by
=
-7r(l — it) /& — n —J.1 is an unbiased estimate of Var(it). Var{it). Variit) =
Q.E.D.
We next turn our attention to WOR simple random sampling. T H E O R E M 2. In WOR sampling of size n to estimate the proportion 7r of a population in a subset A of U, the proportion it of units in
6.4. ESTIMATION
PROPORTIONS OF PROPORTIONS
109 109
the sample that are in A is an unbiased unbiased estimate estimate ofof it,it, its itsvariance variance is is ^ = ^ 7 r ( l - it), and an unbiased estimate of this variance $p=k*0-
* * ) - ^ ( ' - £ ) «i-*>• Proof: Again by Theorem 1 of Section 2, x 7risisan anunbiased unbiased estimate estimateofof Y = it. By Theorem 4 of Section 2,
v^.).I(,_i)5;. v**)-1 (i-£)sj. But
3 = ^{p?-***} 1 =
< > I • — IS I 7
N ~ Ifei ' J ( i V , r = ±1(N*-N**) = JFTT "^ = = j \j£-ir{l-ir), r r r ( 1 - *),
and thus
r-W-wS)*1-')-
By Theorem 2 of Section 3, E(s\) £(*») = 5J, SJ, and thus an unbiased estimate of Var(it) Var{v) is
**>-io-sx. «*>-i
^(*) = ^Tl(l-^)*(l-*).
Q.E.D. We now relate the results we have obtained to results in a different area. We first show how the results of this section give us the expectation and variance of the hypergeometric distribution that we discussed in section 2.3. Recall that if an urn contains r red balls and b black
CHAPTER 6. SIMPLE RANDOM
110
SAMPLING
balls, if n of these are selected WOR, and if X denotes the number of red balls in the sample, then the density of X is ( P{[x
= x]) =
T
) (
b
)
l* W-W
if max{0, n — b} < x < min{r, n) and is zero otherwise. The proportion of red balls in the population (urn) is 7r = r / ( r +6), and the proportion of red balls in the sample is it = X/n. By Theorem 2 above, £(TT) = 7r,or n-1E{X)
= r/(r + b).
Thus, E(X) = nr/(r + 6). Also by Theorem 2, Var(X)
= Var(nit) = n2Var{k) n(r + b — n) 7r(l-7r), r + b- 1
or Var(X) =
n(r + b — n) rb r + 6 - 1 (r + b)2'
Thus we have formulae for the expectation and variance. We last show how the results of Section 2 can be used to obtain formulae for the mean and variance of the Wilcoxon distribution which is defined as follows. Let m and n be positive integers and let W denote the sum of n integers selected at random WOR from the set {1,2, • • •, m + n}. The random variable W is said to have the Wilcoxon distribution. Here we see that N = m + n and Yi = i, 1 < i < m + n. If we denote the sample by 2/i,-'-,2/n, then E(yx) = ^ E / L " i n j = _L(rn + n)(n, + n-Hl) ?
o r
^
=
( m +
n +
1 ) / 2 >
Now W^ = ft + - - • + yw,
so £(W0 = E?=i E(Vi) = n(m + n + l ) / 2 , i.e., £ ( W ) = n(m + n + l)
6.4. ESTIMATION
OF PROPORTIONS PROPORTIONS
111
2 • •■• •++ynyn ==ny, Since W = Vl + ■ ny,ititfollows followsthat thatVar(W) Var{W) ==nn2Var(y). Var(y). By By Theorem 4 of Section 2, Var(y) = 1i (l - ^ _) ) S*. SJ. Now
1
fm+n
si = yY
==
^^{^y-
{m+n)
1
n^
11
1l V -V^ (rn TY= (m + + n)(m nXm + + nn + + 1l)) 3 m + n jfr{^ m+n 2 rn ++ nn ++1l m 2
and, and, from irom aa known known identity identity usually usually proved proved by by induction, induction, ^ • £n?T .
n •£? ^ .,.„
n)(m + n + l)(2m + 2n + 11)) (m + n)(m
Thus, _, _2 1 f(m + n)(m + n + l)(2ro l)(2m + 2n + l) ^* == mm + n _ l \ 6 5
(m + n){m n)(m + + nn + + 1) 2 \ 4 /•
After some algebraic simplification, we obtain a2_ yv
~
{m (m + +n+ + l)(m l)(m + + n) 12 *
Thus, Var(W)
= =
~
n2Var(y) n
S V~^nTnJ ' ill T i*>/ nm (m (m + n + l)(m l)(m + n) m + nn' 12 ' \
or mn(m + n + l) 12 ' The formulae for E(W) and Var(W) are useful in non-parametric statistical inference when one wishes to approximate the distribution of W for large m and n. Var{W) vy
;
=
112
CHAPTER 6. SIMPLE RANDOM
SAMPLING
EXERCISES 1. Let U denote the sum of 10 numbers selected at random WOR from {11,12, • • •, 29,30}. Compute E(U) and Var(U) . 2. In a population of 1000 units, 40 units were selected at random WOR. Among these 40 units, 15 units had a certain property. Compute a value of the unbiased estimate of the proportion of units in the entire population with this property, and compute a value of the unbiased estimate of the variance. 3. In an area considered for urban renewal, time and money were available to inspect 50 out of 420 homes. Among 50 of these 420 homes selected at random W72, 18 were found to be substandard. Compute an estimate of the total number of substandard homes among the 420 and an estimate of the variance. 4. Among the integers {1,2, • • •, m + n}, let W denote the sum of n of them selected at random without replacement, and let S denote the sum of m of them selected at random without replacement. Prove that Var(W) = Var(S) and that E(W) = (™-Hfo+"+i) _ E(S).
6.5
Sensitive Questions
We pause now for a special application of results of Section 4. In Sec tion 4 we assumed that if we were sampling a human population, and if we asked each person in our sample a simple yes/no question, then the response would always be truthful. But there are many situations in which a truthful answer could not be elicited from a sizable propor tion of the sample. Suppose one is conducting a survey on a university campus to determine what proportion of students cheat during exami nations whenever the opportunity presents itself. If you took a simple random sample of 300 students and asked them the question in person, the answers would undoubtedly be all no's. The same would be the case if the question were: do you practice safe sex, except in this case you would receive a unanimous 'yes' answer. However, if the respondent
6.5. SENSITIVE
QUESTIONS
113
can be made to feel anonymous in some way, he or she would feel no danger with a truthful answer. Thus the statistician is challenged to come up with a survey technique which provides the respondent with a feeling of safety and yet provides an unbiased estimate of the propor tion of individuals in that population who have what we might refer to as a "secret sin". Whatever the sensitive question we consider in a survey, we shall use the generic expression "secret sinner" to denote an individual who would be reluctant to give a truthful answer to a direct question. Let us begin with a population of N individuals; N might or might not be known. Let N0 denote the number of secret sinners among the N. The problem is: based on a sample of size n, to estimate the ratio NQ/N. One method of eliciting a response from an individual in the sample is to provide the respondent with an urn containing a known number r of red balls and a known number b of black balls with r ^ b; the composition of the urn is the same for all respondents. The instructions to each individual in the sample are these: (i) Select a ball at random from the urn out of sight of the interviewer, note its color and return it to the urn. (ii) If the ball you selected was red, answer T (for true) or F (for false) to the statement: I am a secret sinner. (iii) If the ball you selected was black, answer T or F to the statement: I am not a secret sinner. Our assumption here is that the person interviewed will respond truth fully, since he or she knows that the interviewer does not know the color of the ball selected and hence does not know which statement the answer T or F applies to. However, the statistician, having at his or her disposal the size n of the sample, the ratio r/(r + b) and the number of T answers is able to obtain an unbiased estimate of the proportion of secret sinners in the population. Let us denote IT = No/N and p = r/(r + b). If one individual is selected at random from the population, this person will respond T if and only if one of the two disjoint events A and B occurs: A is the event that the individual selected is a secret sinner and that he or she
114
CHAPTER 6. SIMPLE RANDOM
SAMPLING
selected a red ball, and B is the event that the respondent is not a secret sinner and that he or she selected a black ball. If we let ip denote the probability of obtaining a T answer from an individual at random, then
=
y - (1 - p) 2p-l
Notice that the denominator is not zero, since p = r/(r + b) ^ 1/2 by the foresight of having r ^ b. If (p denotes the proportion of T answers from the sample of size n, then no matter whether the sampling was done with or without replacement, E((p) = (p. Thus it defined by _ *A
y-(l-p) 2p-l
is an unbiased estimate of 7r. We may formalize this development with the following theorem. T H E O R E M . An unbiased estimate of ir is y-(l-p) *~ 2p-l ' If the sampling is done without replacement, then an unbiased estimate ofVar(it) is v°n*)
(2p_i)2
n
- l
[
Nh
6.5. SENSITIVE
QUESTIONS
115
and if the sampling is done with replacement, then an unbiased estimate ofVar(it) is Vann)
(2p - l)^
n-1
"
Proof: This follows from the above development and Theorem 1 and 2 of Section 4. EXERCISES Suppose the procedure outlined in this section is changed somewhat. Suppose the urn contains r red balls, w white balls and b blue balls. The procedure is for the person interviewed to draw a ball at random from the urn and out of sight of the interviewer, to note its color and then to return it to the urn. If the ball drawn is a red ball, the respondent is to answer T or F to the statement: I am a secret sinner. If the color of the ball drawn is white or blue, he or she is instructed to respond T or F to the statement: the color of the ball that I obtained was white. Let 7r and
+
^ .
2. Prove that (r + w + b)ip > w. 3. Find an unbiased estimate it of 7r, and show that it could have negative values. 4. Find the formulas for an unbiased estimate of Var(jr) when the sampling is done with replacement and without replacement.
Chapter 7 Unequal Probability Sampling 7.1
H o w to Sample
In Chapter 6, whenever we sampled from a population of N units, whether WR or WOR, each item had the same probability of being selected. For reasons to be elaborated on later, we shall sometimes find it desirable to sample with unequal probabilities. The purpose of this section is to explain how this is done when the sampling is done WR and when it is done WOR. In simple random sampling on the population
we sample so that for each i(l < i < N) the probability of selecting Ui on the first observation is l/N. Now we wish to sample in such a way as to give some units larger probabilities of being selected than others. For example, suppose our population U has four units; U\,U
118
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
Thus we see that the probability of selecting U2 is 1/3 which is twice the probabiUty of selecting t/4, which is 1/6. The relative sizes of the units in this example are 2,2,1 and 1; dividing each size by the sum of the sizes gives us the probability of selecting that particular unit. Now consider the general case. We have a population of N units, Z7i, • • •, UN, and suppose we wish to select one unit, where the probabil ity of selecting unit Uj is pj > 0, where p\, p2, • • •, PN are given nonnegative numbers which satisfy p\ H \~PN = 1 and, for practical purposes, are all representable by fc-digit decimals for some positive integer k. Let us denote ip = 0, ii = p\, t2 = p\ +#2? • • • 5 t>N = Pi H \~PN = 1 • Then k select a fc-digit random number X^ \ as discussed in Chapter 6. If X^ is observed to be in the interval [i/-i,i;), then one decides that the unit selected in Uj. One easily sees that proceeding in this manner, the probabiUty of selecting unit U{ is pi for 1 < i < N. If sampUng is done WR, then aU n units in the sample are selected in his man ner. The probabiUties pi, • • • ,PN usuaUy come about in the following way. Associated with each unit U% is a positive number X{ that is known before any sampUng is contemplated. We shall wish to sample so that the probabiUty pi of obtaining unit £/,- on the first observation is proportional to X{. Thus pi must satisfy Pi =
Xi/X,
where X = X^jLi-^j- If w e sample like this n times, it is called WR probability proportional to size sampUng. In this method of sampling we essentially return the unit to the population after each observation and take the next observation in the same manner. The sizes referred to in this method are the positive numbers X\, - • •, Xjq. We now describe what is meant by WOR probability proportional to size sampling. Again the units are U\, • • •, C//v, and the corresponding "sizes" are known positive numbers X\,- •• ,-XJV. Let n denote the number of observations desired in the sample. The first unit is selected in the same manner as described above. Suppose that the unit selected on this first observation is U%x. Now the remaining units are U \ {Uix} = {C/i, • • •, UN} \ {Uix}. A unit is now selected from among these with probabiUty proportional to size, i.e., unit Uj is selected with probability Xj/(X — Xix) for 1 < j < N, j 7^ t'i. If the unit selected on this
7.1. HOW TO SAMPLE
119
second observation is Ui2, then the third unit selected is wiih with probability proportional to its size among the sizes of the remaining N - 2 uniis, i.e.. unit Ur is selected with probability orobabilitv XJ(X-X. - X2), ) -X forh)iox1
p(u U-'jzx]. . P(Uhh,---,u ,---,Uinin)=|nx./ ) = | n ^ / U-T,Xir) Proof: Let P(U P({/,J£/,,,#••,[/,,_,) ik\Uh,#•■ ,Uik^) denote the conditional probability that Uik is selected on the Jbth observation given that Uh,• •,f, , • ••••• U _x ikw were selected in the first k - 1 trials. As shown above
k P(u f^xXt\. PiUiAu^-'-M^^xj ik\uil,---,uik_j^xj\x-(x-Y;x Applying the multiplication rule, we obtain the conclusion of the theorem. Q.E.D The question next arises on how to use a random number generator too obtain a sample of size n from among units Uu • ■ •• UN by means of WOR VOR probability proportional to sizes Xu • • •• XN. One method whiih appears appears as obvious is to generate n random numbers; let ri, r 2 , • • •, r n denote lenote these numbers. Now define tiyy = {Xi (*i + + -• ••• ++ Xj)/X, Xj)/X, 11 << jj << N, N, and where we take tw = 0. In this case we t nd suppose rx £ [tlh.utlh), shall hall declare that unit Uh is the first unit selected. Next, with unit UJhh removed removed from from the the population, population, use use the the remaining remaining units units with with their their corresponding :orresponding sizes and r 2 in the same way as above to select a second unit, init, call it Uh. Proceed to select Ujs,---,Ujn in the same manner.
120
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
However, the re-computation of the t-values needed in order to select each unit can be cumbersome. Since random numbers r x , r 2 , • • • are easy to come by, the skipping method of WOR probability proportional to size sampling, on which we now elaborate, will prove to be convenient; the correctness of this procedure will be proved in Theorem 2. The skipping method is as follows. Again, let tj = (Xi -\ \-Xj)/X, and if a random number r falls in the interval [t/-i,i;), we declare that unit Uj is selected. The sampling is continued in this manner with as many random numbers as needed until n distinct units have been selected. We designate these n units as our sample. Justification of the use of this skipping method is achieved by comparing the following theorem, a generalization of Proposition 1 in Section 6.1, with Theorem 1. T H E O R E M 2. Using the skipping method described above, the prob ability of selecting n distinct units £/*!,-••, Uin is P(Uh,---,Uin)
=
ffiM'-IN}
Proof: Let
B(j2j3r"Jn) denote the event that the first random number selected is in the interval ['u-i>*ti)> the next j 2 random numbers are in [i t l _i,t t l ), the next ran dom number is in [it 2 -i, U2) where i2 ^ ii, the next j 3 random numbers are in [ ' u - i j ' t i ) U [
followed by a random number in LW3-IJW3)}
where U3 ^ U2,ti3 ^ £ tl , and continuing until the (n + X^r=2>)*l1 r a n ' dom number is in [Un-iiUn)- It i s e a s y to see that the event that the distinct units f/tl, • • •, U{n are selected is equal to the following disjoint union: 00
00
U ••• U J2=0
i„=0
B(j2,---,jn).
121
7.1. HOW TO SAMPLE SAMPLE Thus oo OO
p(uh,---,uin)
OO oo
=
j:---i;p(B(j2,...,jn)) j2=0
jn=0
j2=0
jj„=0 n=0
X
r=2 \[
X
V \
X
/
Jj
xX 4W\ X X ' 1 - (Xith + •■ •• •• ++ Xi X, __Xt)/X )IX
Q.E.D. Theorem 2 justifies the use of the skipping method in WOR probability proportional to size sampling, and this method eliminates the need to recompute the probabilities and probability intervals for each successive observation. A few words of a preliminary nature are in order about the desirability of this kind of sampling. Most frequently one has other, or previous, information about every unit of the population, and in many cases the numerical results of this previous information are roughly proportional to the numerical characteristics one anticipates observing. Schematically this is represented as follows: U: x: y:
Ux U2 . . . Un Xxt X2 •■• Xn X Yx Yt Y2 . . . Yn.
Here, every X{ is already known. The function x might be the results of a previous census, and hence it is already known. If y is proportional to x approximately, and if one wishes to estimate Y = E i l i ^ , one would certainly expect more accuracy by sampling the units not with equal probabilities but with probabilities proportional to the values of y, i.e., probabilities proportional to the values of x. This advantage will be given in a more precise form in a later section.
122
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
EXERCISES 1. Suppose a population U, preliminary information x, and y are as follows: U: x: y:
Ui U2 U3 f/4 U5 U6 U7 31.4 62.9 17.1 18.2 33.0 71.1 65.2 48.1 85.2 30.6 25.1 51.3 100.1 99.6
(i) If one is selecting a sample of size 3 from U by WR probability proportional to size, find the probability that the sample is (U6,U3,U6). (ii) If one is selecting a sample of size 3 from U by WOR proba bility proportional to size, find the probability that the sampleis(f/ 2 ,C/ 6 ,f/ 3 ). 2. In Problem 1 (ii) find the probability that sampling is terminated on or before the fifth observation when the skipping method is used. 3. In Problem 1 (ii) find the probability that Ue is obtained before the fourth trial and U3 is selected after the fourth trial.
7.2
WR Probability Proportional to Size Sampling
If we have a population U-.Ux x : X\ y:Y1
••• UN • • • XN . . . YN
where we have reason to believe that x is "almost" proportional to y so that we wish to sample with probabilities proportional to size, then we shall refer to x as a predictor of y. We shall abbreviate our notation for the above population by simply writing (ZY;x,y). In this section we obtain a formula for an unbiased estimate of Y when our
7.2. W R PROBABILITY PROPORTIONAL TO...
123 123
probability proportional to size sampling is done WR. We shall also derive a formula for the variance of this estimate and then obtain an unbiased estimate of this variance. Given (U;x,y) above, we assign to unit U £/,•t the probability Pi = Xi/X. The function y may be regarded as a random variable; its density is seen to be
P[y = t] = UPi--Yi = t} and its expectation is
E(y) = J:Y J:YrTX Xrr/X. /X r=l
Let yi,-.?J/n yi,--?Jy w denote the numerical outcomes of our sample of size •yn n in WR probability proportional to size sampling. Then yyi,• 1 , • •• •• •y„ are independent random variables, each with the same density as y. In addition, let u- denote the number of the unit selected on the ith m,u • • •• un are seen to be independent random trial, 1 < i < n. Then u variables with common density P[Ul = = *] = = Xi/X = Pi,l
N.
For 11
Pi = E M - i - 1 , rr = = ll
One might refer to pi fi as the probability of selecting the unit obtained in the ith observation. T H E O R E M 1. An unbiased estimate ofY in WR probability proportional to size sampling on (U;x,y) (W;x,y) is Y defined by
* - ; ! > / * , and its variance is
y#)-;£»$-*)''■
124
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
Proof: For each i, the random variables y,- and p* depend only on the unit selected on the zth trial. Since the sampling is WR, it follows that the n random variables j/i/pj, • • •, Vn/Pn are independent and all have the same density. Observe that
in
f
Yr
from which we obtain E(y1/p*1) = £?=i Yr = Y. Thus
E(Y) = (l/n)J2E(yr/P;) = Y, r=l
proving that Y is an unbiased estimate of Y. Note that
Var(yi/Pl) =
E((^-YJ\
- se-')'from which the formula for Var(Y) in the theorem follows.
Q.E.D.
R E M A R K 1. Let us consider the special case when X{ = 1,1 < i < N. In this case pr = 1/JV, 1 < r < JV, and p* is the constant 1/JV. In this case,
Y = (N/n)J2yi = Ny, t=l
and
V"-* =
U
lEliXYr-Y? r=l
n ~
iV
n
y
7.2. W R PROBABILITY
PROPORTIONAL
TO... TO...
125 125
These formulae are the same as those given in Theorems 1 and 3 in Section 6.2. In WR sampling, an unbiased estimate of Var(Y) is fairly easy to obtain. 2.j\n An unbiased estimate of Var(Y) T H E O R E M 2. random variable Var(Y) defined by
is the observable
!
^=^s(H!t) ^ / p j , • • • ,yjp*n ,yn/P*n are independent and identically disProof: Since yi/pj,•• tributed (i.e., they all have the same density), it follows that
yar(y) = \±Var( * £ Var^/p]) Var{Y) yj/p*)
= l-Var (yVA A•
By Theorem 1 in Section 6.3, the random variable
J-±(yj..l±li!i\2 n-l^VPj
n^p'J
is an unbiased estimate of Var(y,/pJ). Var{yx/p\). E(Va7(Y)) = Var{Y).
From this it follows that Q.E.D.
There is a formula for Var(Y) other than that given in Theorem 1 to which we now turn our attention. L E M M A 1. If ai, • • •• arr , a , , • •• br are any numbers, then
,=1
\j=l
) \k=l
/
Proof: Onee observes observes that mat \»=1 \i=l
/ \fc=l \Jk=l
/
j=lk=l j=lA:=l
j#fc
126
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
r
= Ea a«'6ft« + E aa i66 *' * * i *' = tE= i «' « +j E Q.E.D. Q.E.D.
from which the result follows. from which the result follows.
L E M M A 2. / / Ax, • • •• Arr Pli , • •, PP rre eeal numbers, ,f P, > > for 11
£<-*)'-E(*-£)V, Proof: We observe that T
/ A
\2
V i V i
''
r
A%
rr
A2 A2
T
r
2 D *(£-*) = E # 2 E ^ + E ^ 1=1 = ll ■*« ■*« «=1 i=i ** tt = t1=1 =i «=i rr
\2 \2
= Y) — - 2 A22 + A22 = J2 — - A22 P = Y) i^lP— i - 2 A + A = £J2 l — i - A On the other hand, by easy algebra and Lemma 1, On the other hand, by easy algebra and Lemma 1,
2 +
§(M)'*'< - §(f''- ^ f«) =
Y\[—P- + ^-P\ - 2 V M -
= E4h-E^4
p
= (sf)fe Hf* / r
- U I> \»=i
\2 /
r
+EA2 + tE^2 =i
_= y* £L _ 42 Y]—A2. t = i *"**
These two strings of equalities establish the equation in the lemma. Q.E.D.
7.2. W R PROBABILITY T H E O R E M 3. IfY
PROPORTIONAL
TO ...
127
is as in Theorem 1, then 2
n
i<j
\Pi
PjJ
Proof: This follows immediately from Theorem 1 and Lemma 2 by substituting in the latter pi for P t ,K for Ai and Y for A in the latter. Q.E.D. The advantage of unequal probability sampling is seen by means of the formula for Var(Y) given in Theorem 3. Suppose that Y{ is approximately KX{ for some constant K. Then for every i < j , ^ *■ is very close to zero, from which it follows that Var(Y) is quite small. EXERCISES 1. Consider the population U:
x: y:
Ux
U2
U3
U4
Us
U6
U7
17.1 21.3 30.1 15.6 40.9 25.0 35.2 12.3 13.2 19.8 12.1 28.3 15.2 26.8
Let 2/1,2/2? 2/3 denote a sample of size three on y in WR probability proportional to size sampling using the predictor x (i) Find the density p^ (ii) Compute E{pl) (iii) Compute Y. (iv) Find the density of 2/2/^2 (v) Compute E(y2/pl) (vi) Compute
and
Varfa/pi).
E(Y).
(vii) Compute Var(Y)
by the formulae in Theorems 1 and 3.
2. In Problem 1, suppose one does simple random sampling WR. In this case ly is an unbiased estimate of Y. Compute Var(7y), and compare it with the answer you obtained in l(vii).
128
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
3. For population (W; x, y) with probability proportional to size sampling, prove that E{yx) = E-Ii YXi/X. 4. Suppose (U;x,y) satisfies the foUowing: there exists a constant K > 0 and a small number e > 0 such that K(l-e)<^
+ e)
X{ Xi
(i.e., the ratios {15/A,-} are all aU close to K.). A".). If HY V" isis the the unbiased unbiased estimate of Y F given in Theorem 1, prove that Var(Y) <
7.3
^K2X2e2.
WOR Probability Proportional tto o Size Sampling
Surely we must be able to do better by doing probability proportional to size sampling without replacement rather than with replacement. We show that this can be done in this section. We shall let (U;x,y),{Pi},{ui} i},{Ui} and {j/t} {y,} be as in Section 7.2. Our basic assumptions are that Xft/Yi is not identically constant in i, we are taking a sample of size n by WOR probability proportional to size sampling, and Xi > 0 for all *. To begin with, let us define
P R O P O S I T I O N 1. Ifti
h=y
is as defined above, then
£ x~pcI{Ui=*
and E
W
=Y
-
7.3. WOR
PROBABILITY PROPORTIONAL
TO ...
129
Proof: Recall that y\ can be written as y\ = ]CjLi^j^[ui=j]- Thus for any elementary event u>, if u € [i^i = j ] , then j/i(o;) = 1} Yj and # ( w ) = ^P j = Xj/X, i.e., (yiM)(u,) Pi(w) (yM)(u) = Yjl{XjlX), which yields the first conclusion. The second conclusion follows from Theorem 1 in Section 7.2 by taking n = 1. Q.E.D. Next, let us define
t,
„,f
r n i=i Xi/(X — X Ul) iUi=,h h Xi/(x-x y
2 yi+
~
*
r •
u
and, in general for 2 < j < n, we define
.•. ++ »„,•_!+ «!i , = - *yi ++ ••• -,+
] E E
t=l,tg{t
„■-■■ Xllx
y
' - l xy .hH-* ,/hHIL-,
Let tj <,■ = (
™ » W ee —
— —
„o/K. V.
fc^). Note that for Proof: Let us compute £ ( <* > ,!==* „ • • •fc,-.^ , » , ■ _ , = *,-_,). 1 < iz < <j —1 E(yt\ui = h,---,
u,-_i = *,-_!) = Ffci.
Also, for »rf i £ {{Jfci,...,*,-.!} *!,•••,*,•_!}
P([Uj P([Uj = =i ] »]!«! |Ul = = *!,-..,U,-_ *x, • • -,«,•_! = *,-_!) = Xi/ = X,7 ( X -( X ^X - f£c r jX*, . J . X = *,-_!) Thus we obtain
E{tj\ = *!,...,«,-_! *!,...,«,-_!==Vi) v o==n,n,++• •• ••+• v+ ,v +. + E{tj\Ul Ul =
EE ^ ^ = Y=-
F
-
130
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
Hence E(tj) = Y. This proves that each tj is unbiased. Easily,
E(ij) = E((t1 + --- + tj)/j) = Y Q.E.D. The statistic tn is what we intend to use as an estimate of Y. The first question that arises is whether this estimate is "better" than the es timate Y obtained in Section 7.2 where the samphng was done WR. By the definition of Y in Section 7.2 it follows that Var(Y) = Var(ti)/n. Thus we shall have proved in to be a "better" estimate than Y when we have proved that- Var(in) < Var(ti)/n. T H E O R E M 2. If {U, 1 < i < n} are as defined above and then U and tj have correlation zero for i ^ j , and Var(tn) <
n
ifn>2,
-Varfa).
Proof: Let 2 < j < n, let &i, • • •, kj be j distinct integers in { 1 , • • • , N}. By what was proved in the proof of Theorem 1, E(tj\ui = A?i, • • •, Uj-i = %_i) = Y for 2 < j < n. Thus the random variable ^(
=
= k
Z)
E(tj\ux = fci,--->fi-i = I H J / ^ . I ^ J
£
YI
ct:>^] = Y'
a constant. We use this fact now to show that if 1 < i < j < n, then Cov(ti,tj) = 0. Indeed, by properties of conditional expectation established in Chapter 4, since U is a function of Ui, • • •, u,-, and since i < j , we have E{Utj)
=
£(£(W>ir.., t^))
= EiUEitjluu-.tUj-t)) = E(tiY) = YE(ti)^Y29
7.3. WO R PROBABILITY
PROPORTIONAL
TO ...
131
this last equality by Theorem 1. Since E(t{) = E(tj) = y , then Cov(th ts) = E(Utj) - E(ti)E(tj) = y 2 - y 2 = 0, i.e., the covariance of U and tj is zero. We next show that for 2 < i < n, Var(i t ) < Var(t\). We first recall the theorem proved in Chapter 4: if U is a random variable, and if H is any vector random variable, then Var(U) = E(Var(U\H))
+
Var(E(U\H)).
To apply this result here, let U = U and let H be the random vec tor whose coordinates are Ui, • • •, ut-_i. As shown earlier in this proof, E(ti\ui, • • •, Ui-i) = Y which is a constant random variable, and thus Var(E(ti\ui, • • •, Ut-i)) = 0. This implies by the above-recalled result that Var(U)
=
E(Yar(ti\uir^,Ui^i)) {E((ti)2\ui
£
= ki,-~,Ui-i
=
k-i)
= *!,-••, wt_i = fc^x))2} J^-i [ u r = k r ] ) .
-(E(ti\Ul
For fixed fc1? • • •, fct-_i, the expression inside the curly brackets, {•}, is the variance of yi/p* when samphng WOR and with probability propor tional to size after units Ukx, • • •, Uki_x have been removed. By Theorem 3 in Section 7.2, Var\
M-
E l
E
=
l
p.Pi
- -— \p« Pi/
^(i-i)
2
Hence the expression in the curly brackets above equals
E i
-1>
and because Yu/Xu is assumed to be not identically constant in u, then at least one of the L ^ ) expressions above is strictly less than
Ku
132
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
Thus
Var(ti)<E[
£
Varit^-j^^Varih).
Now, putting everything together that we have proved so far, we have
Var(tJ
=
^Var(j2ti\
= -AjlVar(ti) U
= lb
+2 £
\*=1
"^<
Cov^tj))
l<*
/
±-2f:Var{U)<±-nVar{t1)=l-Var{t1). TL
lh
Q.E.D. Comparing this theorem with Theorem 1 in Section 7.2, we see that WOR sampling yields smaller variance for the unbiased estimate than would have been obtained for the unbiased estimate in WR sampling obtained in Section 7.2. Last, we shall need an unbiased estimate of Var(Q. T H E O R E M 3. An unbiased estimate ofVar(tn) Var
is
1
Proof: Recall that in the proof of Theorem 2 we proved that E{titj) = Y2 for i ^ j . Let us define
Then E(Y2) = Y2. Now let us define Var(in) = f„2 - Y2. Then E(Var(Q)
= E(Pn) - (E(Q)2
=
Var(Q.
7.3. WO R PROBABILITY
PROPORTIONAL
TO ...
133
But tn-Y*
=
1 n(n — 1
n(n-l)fn2-2
J2
Mi)
1 | n 2 f n 2 - 2 X) Wj-nfB3l n(n — 1 I l
X> - Q2. t=i
Q.E.D. EXERCISES 1. Prove: If U and V are random variables, if E(U\V = v) = c for all v G ran(7e(V), where c is some constant, then E(U\V) = c. 2. Prove: If Z is a random variable, if X
=
XI
X
zI[Z=z],
z£range(Z)
and if z£range(Z)
where yz ^ 0 for all z G range(Z), then
z€range(Z)
y z
3. Consider the population W: J7i «73 % ^4 #5 #6 x : 8.1 12.2 20.1 10.3 8.2 6.8 y : 11.9 19.3 28.1 17.0 11.1 10.3.
134
CHAPTER
7. UNEQUAL PROBABILITY
SAMPLING
Suppose one were to sample twice WOR with probabiUty propor tional to size sampling. (i) Compute Var(Y), Y.
where y = f2 is the unbiased estimate of
(ii) Suppose in sampling (WOR) the units selected were first: U3 and second: C^. Compute t^t2 and ?24. Prove: If 1 < i < j — 1, then E(yi\ui = *!,-•-, UJ^ = fy-i) = Yki
Chapter 8 Linear Relationships 8.1
Linear Regression Model
One of the happiest circumstances there can be in the problem of esti mating Y for a population (W; x, y) with a predictor variable x is that in which a linear relationship connects x and y. This means that there are constants a and b such that y = a + bx. Of course, in practice this never occurs in a precise manner; yet there are many times when there is good reason to believe such a relationship exists in an approxi mate manner. This chapter is devoted to some advantages that can be reaped when there is reason to believe that such a linear relationship is close to reality. In this section we consider the problem of estimating a and b by means of a simple random sample on U. Given a simple random sample t*i, • • •, un on ZY, either WR or WOR, we address ourselves to the problem of finding numbers a and 6 such that y and a + bx are "close" to each other in some sense so that we could almost write y = a + bx. Now j/ t - is recalled to be defined by = Yr,\ < i < n , l < r < JV, or j/t(w5) = Yu„l < s < n. We yi(Ur) night define X{ by Xi(us) = XUs. Thus a and b should be such that we could almost write j/ t - = a + bx{ for 1 < i < n. The wish to write these equations is the same as that of making all absolute values of differences |t/i1 — a — bxi I as small as possible. One mathematically precise way of stating this is the following problem: to find a and b as functions of #i> * •') #n? J/i? * • •, Vn that will minimize the sum of the squares of the 135
136
CHAPTER 8. LINEAR
RELATIONSHIPS
errors, Q, where n
Q = XXj/f - a - 6x ) 2 . t It is this problem that we solve now. L E M M A 1. If Ci,- — ,Cn are real numbers, and if c is defined by c = (d H f- Cn)/n, then the value ofm that minimizes YA=I(CI"~~m)2 is m = c. Proof: We denote T = E?=i(c* - m) 2 . Then T
= £((ct--c) + (c-m))2 t=i n
z
2 d
m
n
= E t e - f + ( - ) E t e - *) + n ( a - m ) 2 Since EJL1(cl- — c) = 0, it follows that the smallest value of T is achieved when m = c. Q.E.D. Let us denote j
n X
-n £ * Ui -i
j
n
andj/ = - £ y , - , n
t=i
n
- J2xl - *2 and *» = V^J' i
n
- «=i 5 3 y? - y2 and a„ = ^ * , and
J E 1 * - xv «=1
T H E O R E M 1. The values of b and a which minimize Q above are given, respectively, by i °v * i * - Lo = T-px,y ana a = y — ox. °x
8.1. LINEAR REGRESSION
MODEL
137
This "regression equation" can be written in the form y—y -7
x—x
A
= Px
•
Proof: Rewriting Q as Q = Y27=i(yi ~~ ^x%— a ) 2 ? a n d denoting c, = yt- — 6xt-, we obtain from Lemma 1 that whatever value b is, Q is minimized when a is set equal to 1
n
Ui$3(w
n
" 6*0 = y -bx.
Thus we need only find the value of b that minimizes n
Now
Q = E((w - y) - ^ « - *))2 n
= E(w - *)a - 2*E(** - *)(*• - y) +* 2 E ( * - *)2i=l
i=l
t=l
Since Q is a sum of squares, then Q > 0. But Q is quadratic in 6, i.e., Q is of the form Q = Ab2 + 2Bb+C. We can rearrange this expression for Q to obtain
B\2
0-A(»+3) +
AC-B2
Since A > 0, it follows that the value of b that minimizes Q is 6 = - B / A . Thus n
h
A~
"
'
CHAPTER 8. LINEAR
138
RELATIONSHIPS
Since Yl?=i(xi — *) = 0? w e obtain, by easy algebra, 6 = §^/3x,y, which completes the proof of the first conclusion of the theorem. The second conclusion follows by some easy algebraic manipulation. Q.E.D. It should be noticed that a and b are functions defined over the fun damental probability space over which the random variables #i, • • •, x n , tfu • • * > Vn are defined, and thus are themselves random variables. They are called least squares solutions for a and 6. EXERCISES 1. The equation y = a + bx considered in this section is called the regression line of y on x. Find the least squares estimates of the constants in the regression line of x on y. 2. In our least squares treatment of x, y, it might be more reasonable to assume that y is a second degree polynomial function of x, i.e., y = a + bx + ex2. If so, find the least squares estimate a, 6, c of a, 6, c. 3. Prove: if ci, • • •, Cn are numbers, and if c = (c\-\
h cn)ln, then
E? =1 (c,-c) = 0. 4. Prove: if X is a random variable, then the value of m that mini mizes E((X - m) 2 ) is m = E(X).
8.2
Ratio Estimation
In this and the subsequent sections we shall concern ourselves about special cases of the linear regression model y = a + bx plus a small error, as considered in Section 1. In this section we shall deal with the case a = 0. When this situation holds, y = bx or Y{ = bX{ plus a small error. This is just the situation where probability proportional to size sampling is so effective. Another effective method here, if one chooses to do simple random sampling either WR or WOR, is the ratio estimate, which is the subject of this section. As before we consider the population (U; x, y) with predictor. A sample of size n is taken yielding observations (#i, j/i), • • •, (x n , j/ n ), and the problem is to estimate Y.
8.2. RATIO
ESTIMATION
139
Definition: The ratio estimate Y of Y is defined to be Y =
X
^X.
Roughly speaking, y estimates K, x estimates X , so y/x estimates Y/X = Y/X, and hence the ratio estimate should estimate Y. A word of caution is in order here. It is not necessarily true that the expecta tion of a quotient of two random variables is equal to the quotient of the expectations. Hence the ratio estimate Y of Y is not necessarily unbiased. In Theorem 2 we shall give conditions under which the ratio estimate is unbiased. The literature on ratio estimates appears to some as unsatisfying. The results are arrived at without sufficient rigor. Also, the approxi mations are found to be wanting. In Theorem 1 below we shall show that if the assumption of the model that y/x is uniformly close to b is true, then the ratio estimate Y of Y attains remarkable precision and has very small variance. This theorem is not of much practical use, but an understanding of it should give the student the courage to use ratio estimates when the model above applies. But first we need a lemma. L E M M A 1. IfV then
is a random variable satisfying P[a < V < b] = 1, Var(V)<(b-a)2.
Proof: Suppose the density of V is /(v,*) = p«,l < i < r. Then J2i=iP% = 1 a n ( i a < Vi < b for I < i < r. Thus api < ViPi < bpi, and summing over i, we obtain a < E(V) < b. These last two inequalities yield a - b < v{ - E(V) < b - a for 1 < i < r. Thus (vi - E(V))2
<(b-a)2ioil
and Var(V)
J2(vi-E(V))2pi<(b-a)2jriPi
= »=i
=
«=i 2
(&-«) Q.E.D
140
CHAPTER 8. LINEAR
RELATIONSHIPS
T H E O R E M 1. Let(U;x,y) be a population with positive predictor xvalues and positive y-values, and suppose a sample of size n is obtained, either WR or WOR. Suppose there exist numbers £,K, where 0<8< 1 andK > 0, such that (1 - 8)K < y/x < (1 + 8)K. Then the ratio estimate Y = (y/x)X ofY satisfies
»(i)
\=lY
l+r l+d
1- b
and
(ii) (ii)
VY\(^L-J y^ <{&)'*■ v-M << (i^j ^
Var{Y)
><(A)
2
Proof: The hypotheses imply (1 - S)KXi < Y < (1 + 6)KX,,
l
Summing both sides of each inequality from 1 through N, we have 8SKX. (1 - S)KX < Y < (1 + SSKX. xn, Similar relationships are seen to hold for the observations xxi, u • -• •, x„, yu•-••• ■ ••y„n ynnnamely, namely,,(,(- -8)Kxi 8)Kxi <
141
8.2. RATIO ESTIMATION ESTIMATION and Y > (1 - 8)KX 8)KX = ^ j ( l + 8)KX 8)KX > |\ ^^| y ,
from which we obtain conclusion (i). We finally apply Lemma 1 to conclusion (i) to obtain
^<-m-m^-{^)^Q.E.D. Truly, more general theorems than Theorem 1 can be proved. However, Theorem 1 should give a good indication of how accurate the ratio estimate can be when x and y have an almost constant ratio. It is important to note that in using the ratio estimate we are taking advantage of the fact that y/x is very close to being constant, while y/X need not be so. It should be noted that the ratio estimate need not be unbiased, and examples can easily be constructed where y/x is not an unbiased estimate of Y/X. Just for curiosity's sake we can state and prove a theorem giving conditions under which E(y/x) = Y/X. One should continue too keep in mind that in our model (U;xy)) of a population with predictor, all Xt are assumed to be positive. T H E O R E M 2. Let (U;xy)) be a population with a predictor, and suppose that a simple random sample (xi,jh),•- • ,(x B ,y«) oofize e ni selected WR. Suppose there exists a constant 0 > 0 such that if Z{ is defined by Z{ = ,• - fixit then z{ and x4 are independent random variables, and E{Zi) = 0,1 < i < n. Then the ratio estimate Y = (y/x)X is an unbiased estimate ooY. Proof: Recall that for each i,E(yi) = ?,E(Xi) = X and (by hypothesis) E{Zi) = 0, from which it follows that E(y) = Y,E(x) = X and E(z) = 0 (where z = (Zl + • • • + zn)/n). Now, from t t h eact that yi = fa+Zi, 1 < i < n, we obtain y = px+z, and, taking expectations, we have Y = /3XOTY
=
/PX.
CHAPTER 8. LINEAR RELATIONSHIPS
142 Thus
*(!)-*(' + i)-Mi)-
We now apply three results from Chapter 4. We first observe that
Next, since 1/x is a function of #i, • • •, x n , it follows that E[=\xu---,xn) \x
J
=
X
-E(z\xu---,xn),
or
*(i)-*(i*w»..-••.».>)• Since by hypothesis {xi, • • •, xn} and {z±, • • •, zn} are independent, it follows that z and {a?i, • • •, xn} are independent, and thus E(z\xu---,xn)
= E(z) = 0.
From this we obtain E(z/x) = 0, and hence E(y/x) = /?. But earlier, we established that /? = Y/J£, and thus y/x is an unbiased estimate of Y/X. Q.E.D. Models do exist that satisfy the hypothesis of Theorem 2. For such a model, z is a random error whose expectation is zero. Thus unbiasedness of the ratio estimate occurs when y is a constant multiple of x plus an independent random error. An example of such a population is given in the following set of exercises. EXERCISES 1. Consider the following population: U: x: y:
Ux U 2 U3 U4 U5 3.1 13.2 24.1 41.6 62.1 3.5 14.8 27.5 46.2 71.42
A sample of size three is observed WOR.
8.2. RATIO
ESTIMATION
143
i) List all ten outcomes of selecting three units from among those inU. ii) Compute X and Y (iii) Find the smallest and largest values of the unbiased estimate 5j/ of Y. (iv) Find the smallest and largest values of the ratio estimate
(y/x)XoiY. 2. Verify that the population in Problem 1 satisfies the hypotheses of Theorem 1 for K = 1.13 and 8 = .03. 3. Consider the following population: U: Ux U2 U3 x : 2 3 5 y: 4 6 11. A sample of size two is taken WOR. Compute E(y/x), Y/X compare.
and
4. Consider the population: U: x: y:
£/i U2 U3 U4 Us U6 U7 U8 U9 30 40 50 30 40 50 30 40 50 44 59 74 45 60 75 41 61 76
Define the random variable z by y = 1.5a: + z, i.e., z = y — 1.5a:. i) Find the density of x. ii) Find the density of z. iii) Find the joint density of x and z. iv) Verify that x and z are independent random variables. v ) Verify that E(z) = 0.
144
CHAPTER 8. LINEAR
8.3
RELATIONSHIPS
Unbiased Ratio Estimation
In Section 8.2 we treated the ratio estimate obtained by simple random sampling. It was not necessarily unbiased; we could not compute its variance, nor could we provide an unbiased estimate for its variance. However, if we are willing to modify our sampling by selecting our first unit by means of probability proportional to size sampling and then se lecting n — 1 distinct units by WOR simple random sampling from the remaining N — 1 units, then it turns out that the ratio estimate is un biased, we can determine the formula for the variance of this estimate, and we can find an unbiased estimate of this variance. DEFINITION: Let {U; x, y} be a population with predictor. We shall say that a sample of size n is selected by p.p.a.s. sampling (probability proportional to aggregate size) if the first unit is selected by probability proportional to size sampling and the remaining n — 1 units of the sample are selected from the remaining N — 1 units of the population by WOR simple random sampling. Of use here and in subsequent results is the following lemma. L E M M A 1. if ai, • • •, a/v, 6i, • • •, 6j\r are any numbers, and if f{u, v, x, y) is any function of four variables, then, for fixed n < N,
l<*x<- -
KU
x
'
j=l
and
£
£ f{aK-,akv,huMv) = (
i
o ) E /(°»"»ai' *•"» h)-
\n- z/
t
Proof: For each i € {1,---,N}, there are (*"*) ways of selecting n — 1 other distinct integers in {1, • • •, N} \ {i} such that the n of them form an n-tuple ix < i2 < • • • < in. Thus at- is included in ( ^ T i ) sums on the left-hand side of the first equation. Similarly, take arbitrary
8.3. UNBIASED RATIO
ESTIMATION
145
hj, 1 < i < j < N. These occur in (^_2 ) t u p l e s (fci, • • •, fcn), where n 1< h < < kn
(y/x)X
Proof: Let {{7tl, • • •, (7tn} denote the event that units f/tl, • • •, U{n are selected in any order, and let B% denote the event that Ui is the unit selected first (by probability proportional to size sampling). Then
r=l
-
x N
i n:l)
x
(S) '
By a property of conditional expectation given in Chapter 4, by Lemma 1, and by this last equation we have E(Y)
=
£ l
E(Y\{Uil,---,Uin})P({Uil,
•••,Uin})
E nm = l
_
l
1
£
V^ n TC im Z^ r =l ^ * r
J\r.
E?=l-^t. ^ ( ^ T i )
EnJ=r.
V n - i y l<»i<-
Q.E.D. We shall omit computation of Var(Y) and shall proceed directly to obtain an unbiased estimate of Var(Y). It should be noted that in so doing we need only find an unbicised estimate z of Y2 and consider Var(Y) = Y2—z. We shall need to define the random variable p*(N, n); it is a random variable that equals P{{Uix, • • •, Uin}) whenever the event {C/i15 • • •, Uin} occurs. This event was defined in the proof of Theorem 1. Thus Pm(N,n) =
£ l
Pm^-',Uin})I{Uii^Uin}.
146
CHAPTER 8. LINEAR
T H E O R E M 2. Inp.p.a.s. is
RELATIONSHIPS
sampling, an unbiased estimate
Va~r(Y) = Y2
ofVar(Y)
r ^ z , p*(N,n)
where
(
z =
n
1
U-lJr=1
Proof: We compute E(z/p*(N,n)) have
2
\
[n-2)u
)
in two stages. By Lemma 1, we
y^n
i
y2
TV
- /iV-l\ \n-l)
Y
X) l
ir~z2Yit=l
Also, by Lemma 1,
}
•^
pf.f/7r
t
l =
(N-2\
1^
. . . 77. '
\WJV-2\ %ni '[, 2
i -
^n- J V«<«
2^YiuYiv = ,N_2s
/
(
9
PyWt!,'",Uin})
) 22YrY3 = 2ZYTY3.
Thus E(z/p*(N, n)) = £ ^ 1? + E r * , TO = Y* = (£(Y)) 2 . Hence £ ? ( t ^ r ( y ) ) = E{Y2) - (E(Y))2
=
Var(Y). Q.E.D.
8.3. UNBIASED RATIO
ESTIMATION
147
The ratio estimate obtained for p.p.a.s. samphng thus has a decided advantage if one is able to take that first observation by probability proportional to size sampling. EXERCISES 1. Prove: If N > n > 3, and if d1? • • •, djq are real numbers then d
X) X) l
iudivdiw
= (
_
3
■
d ) Yl udvd% ' u,vtw distinct
2. Prove: In p.p.a.s. samphng,
3. Consider the following population: U:
Ux
x: y:
20 23 17 14 31 26 39 48 33 30 60 40
U2
U3
U4
U5
U6
Suppose that in taking a sample of size 4 by p.p.a.s., the units obtained are Us, {/3, f/6, U\. (i) Compute the unbiased estimate Y of y . (ii) Compute the unbiased estimate, Var(Y),
of
Var(Y).
4. In the proof of Theorem 1, prove that the n events 5tln{^,...,^J,...,Btnn{C/tl,...,f/tn} are disjoint and that {Uil9- • •, Uin) = U?=i(#tV n {Uiir • •, t/ t n }). 5. The proof given for Theorem 2 relies on the following observation: if Ai, • • •, Ar are r disjoint events such that Y?j=i P(Aj) = 1, if C/ and V are random variables defined by r
^ = J2 ui1^
and
^
=
£ i = i vi^*i»
CHAPTER 8. LINEAR
148
RELATIONSHIPS
where i/i, • • •, u r , Vi, • • •, vr are constants, and if Vj ^ 0 for 1 < j < r, then U A u,-
V
Uv>
Prove that this observation is true.
8.4
Difference Estimation
We continue our investigation of estimation of Y when x and y are connected linearly, i.e., when the equation y = a + bx is almost true. In this section we shall deal with the special case where 6 = 1 , i.e., when the difference y — x is almost constant, or Y{ — X{ is almost constant for 1 < i < N. If y — x is almost constant, its variance will be small. Hence the variance of an unbiased estimate of the difference Y — X ought to be small; adding this estimate to X , which we already know, would give us an unbiased estimate of Y with a relatively small variance. T H E O R E M 1. In WOR simple random sampling on a population (ZY;x,y) with predictor, the observable random variable
Y = N-J2(yi-Xi) + X is an unbiased estimate ofY,
and
Var(Y) = iV2I (l - £ ) (S*y + S> - 2pSxSy), where p is the correlation coefficient of x and y. Proof: Since
Y = N(y-x)
+ X,
and since E(y) = Y and E(x) = X, we obtain E(Y) = NY-NX+X Y. Now Var(Y)
= =
N2Var(y-x) N2(Vary
+
Varx-2Cov(x,y)).
=
8.4. DIFFERENCE
ESTIMATION
149
By Theorem 4 in Section 6.2,
and
v
-v-k(i-ii)s'»
where N
and
Letting p denote the correlation coefficient of x and j / , we have Cov(x, y) = py/Var(x)Var(y
= — T1 — —J
SxSyp.
Thus, Var(Y)
= N>± ( l - £ ) (52 + Sy2 - 2 ^ 5 , ) . Q.E.D.
T H E O R E M 2. An unbiased estimate of Var(Y), difference estimate in Theorem 1, is
where Y is the
V^r(Y) = iV2 {I (l - £ ) (,J + 4) - 2(*y - *y)} . Proof: Since the sampling is WOfl, £ ( s 2 ) = S 2 and £ ( s 2 ) = S*. Also E(xy - Xy) = E{xy) - XE{y) = E(xy) - XY = C<w(x,y). Thus E(Var(Y))
= Vor(y).
Q.E.D.
CHAPTER 8. LINEAR
150
RELATIONSHIPS
EXERCISES 1. In Theorem 1, assume that the sampling is done WR, show that the difference estimate given there is still unbiased and derive the formula for the variance of Y. 2. Consider the following population U: x: y:
Ui 81 84
U2 110 111
U3 50 52
U4 92 96
A sample of size 3 is taken WOR, U6,UltUe
U5 U 6 120 65 122 68
and the units selected are
i) Compute the value of Ny. ii) Compute the value of the difference estimate,
Y = N-22(yi-Xi)
+ X.
iii) Compute Var(y).
8.5
Which Estimate? A n Advanced Topic
This section is for advanced readers only, namely, those who are ac quainted with the multivariate normal distribution and the theory of linear models. Others may proceed directly to the next chapter. Let us consider a very large population with predictor, (ZY; x, y), for which we know as usual the values of TV, X\, • • •, XN and hence of X and X. We shall assume that a simple random sample of n units has been taken with replacement, yielding observations (xi, j/i), (x2,2/2), • • •, (^n?J/n)« Our problem as usual is to use this sample of n pairs to estimate the value of Y. We might have good reason in some cases to suspect that x and y are almost related by a linear relationship y = a + bx for some unknown constants a and 6. More precisely, this relationship could be of the form
y = a + bx + z,
8.5. WHICH ESTIMATE?
AN ADVANCED
TOPIC
151
where z is an "error" random variable that is independent of x. If this were the case, and if we could use the n observations to conclude that a = 0, then by Theorem 2 in Section 8.2 the ratio estimate Y = (y/x)X would be an unbiased estimate of Y and could provide us with greater precision than Y = Ny. Also, if this is the case, and if we could use the n observations to determine that 6 = 1 , then we could use the difference estimate Y = N(y — x) + X as an unbiased estimate with greater precision than the usual Y = Ny. Thus there are two tests of hypothesis that we would wish to make. We are especially able to make these two tests of hypothesis if it is not too unreasonable to assume that (si,yi), • • •, (x n , yn) is a sample from a bivariate normal population with unknown mean vector and un known covariance matrix £ . (Of course, this is never true, but it could be close, in which case the following is quite reasonable for practical purposes.) From this bivariate normality assumption it is known that yt- = a + bx{ + Z{j 1 < i < n, where a and b are constants, xt- and Z{ are independent, and z\, • • •, zn are independent and identically distributed with common distribution function being normal with mean 0 and unknown variance a2 > 0. In order to determine if either linear model outlined above is true, we shall perform two tests of hypothesis: the first is to test the null hypothesis H0 : a = 0 vs. the alternative a ^ O , and the second is to test the null hypothesis HQ : b = I vs. the alternative 6 ^ 1 . In what follows below we shall use this notation:
\yn /
\
x
x
\*n /
n /
Thus the n equations written above may be written in matrix form as y = Xp + z. In order to test the null hypothesis H0 : a = 0 vs. the alternative a 7^ 0, let us denote Mo = I :
and
7
= (b)
CHAPTER 8. LINEAR
152
RELATIONSHIPS
If the null hypothesis Ho : a = 0 is true, then the linear model becomes
y = ^o7 + zThe least squares estimate J3 of /? under the general linear model and 7 of 7 under the null hypothesis are given by $ = (^'^)-1A'V;and 7 = ( ^ o ) - 1 * ^ respectively. The F-statistic is
\\y-Xf)\p and under the null hypothesis the distribution of T is the F-distribution with (l,ra — 2) degrees of freedom. The test is to reject HQ : a = 0 if T is too large. Note: if we accept Ho : a = 0, and if our assumptions hold, then by Theorem 2 in section 8.2 the ratio estimate of Y will have very little bias. The test for the null hypothesis Ho : b = 1 has an extra ingredient. Under the assumption that (x, y) is bivariate normal, it follows that (y — x, x) is also bivariate normal. Hence one is able to write yi — Xi = c + dxi + W{, 1 < i < n, where W\, - • •, wn are independent and identically distributed with com mon distribution being normal with mean zero and unknown variance T 2 > 0, and where X{ and W{ are independent for 1 < i < n. In order to test the null hypothesis that 6 = 1 against the alternative that 6 ^ 1 we may do the same thing by testing the null hypothesis H0 : d = 0 against the alternative d ^ 0. Let us denote
w =
(i)'-o
Now our general linear model is written as y - x = X6 + w,
and 8 = (c).
8.5. WHICH ESTIMATE?
AN ADVANCED
TOPIC
153
and when H0 : d = 0 is true, we have y - x = ln6 + w. The least square estimates 0 of 0 and 8 of 6 are given by 0 = (XiX)~1X\y
- x)
and S=
(ltnln)-1ltn(y-x)=y-5c.
The F-statistic is
lly-^-^r and, again, under the null hypothesis Ho : d = 0, T has the F distribution with ( l , n — 2) degree of freedom. The test is to reject Ho : d = 0 if T is too large. If both hypotheses are rejected decisively, it would undoubtedly be best to use Y = Ny as an estimate of Y.
Chapter 9 Stratified Sampling 9.1
T h e Model and Basic Estimates
In Chapter 7 we discovered that it is possible to decrease substantially the variance of an unbiased estimate of Y by using predictor data x whenever such are available. Without the predictor data it is possible to obtain an unbiased estimate of Y with a much smaller variance than that of the usual unbiased estimate, TVy, by means of stratification of the population U. Let us consider the following extreme example to introduce this method. Suppose one has a population, U: y:
Ut-'-Uso U" -u
U51---U150 U151---U300 v - "V w • • -tu,
of 300 units. Suppose it is known that units numbered from 1 through 50 have the same (unknown) i/-value, u, that units numbered from 51 through 150 have the same (unknown) y-value, v, and that units numbered 151 through 300 have the same (unknown) y-value, w. If you are allowed to take a sample of size three on U in order to estimate y , you would be foolish to take a simple random sample of size three, either WR or WOR , especially if there were vast differences suspected between the pairs of values among u,v,w. The smartest thing to do would be to select one unit from among #i, • • • #50, one from among #51' • • #150 and one from among U^\ • • • f/300. Doing this, you would have observed the values of iz, v, to, and you would get the exact value of y by calculating Y = 50u + 50t> + 50u;. 155
156
CHAPTER
9. STRATIFIED
SAMPLING
Although t h e above example appears to be extreme, traces of it occur in problems met by the sample survey statistician. One wishes to estimate Y by means of an estimate Y t h a t is as close t o Y as possible. Sometimes t h e population U can be represented as a disjoint union of subsets where t h e t/-values for t h e units of each subset are almost equal. In such a situation, it seems sensible to take a few observations on each subset, compute the sample mean for t h e observations on each subset, multiply these sample means by the corresponding sizes of these subsets and then add these products to obtain an estimate of Y. Decomposition of U into such a union of subsets is known as stratification. Each subset of nearly uniform units is called a stratum. Let (U; y) be a population of units with t h e associated y-values. We shall assume t h a t U is t h e disjoint union of L disjoint subsets U\, • • •, WL, each subset being a s t r a t u m , i.e., U = U^=1Z<4. We assume t h a t t h e units (and hence number of units) in each s t r a t u m are known. Our original population, recall, is U:
UX-'-UN
y:
Y^-YN,
where y(Ui) = Y{ for 1 < t < N. We shall let Nh denote t h e n u m b e r of units in t h e hth s t r a t u m , so t h a t N = ?2h=i Nh. T h e hth s t r a t u m a n d corresponding j/-values are represented now by Kh : Uhi'" y\uh :
UhNh
Yhi--YhNh,
where y\uh{Uhi) = Yhi. The sum of the j/-values for t h e hth s t r a t u m , Uh, is denoted by Nh t=l
from which it necessarily follows t h a t Y = Y$U\ Yh- T l i e s i z e o f t i l e sample observed on the /ith s t r a t u m will be denoted b y nh. In other words, just as a sample of size n is taken WR or WOR on U, as in Chap ter 6, so now, under stratification, a sample of size nh is taken (WR or WOR) on £ 4 . T h e observations on any one s t r a t u m are assumed
9.1. THE MODEL AND BASIC
ESTIMATES
157
(because they are so taken) to be independent of the sets of observa tions taken on all the other strata. In other words, letting y/a, • • •, yhnh denote the n^ observations on £4, then the L sets of random vari ables (yii, • • •, yim)> • • • ? (yLi? • • • ? 2/LnL) are L independent sets of ran dom variables, i.e., are L independent random vectors. (Note: In the terminology of Chapter 2, the two random vectors ({/, V) and (X, Y, Z) are independent random vectors if the joint density of all five random variables factors as follows: /I/,V,X,Y,ZK
v, x, y, z) = /tf,v(u, V)/X,Y,Z(S, y, *).
Note that independence of these random vectors does not mean that the random variables within any random vector are independent.) If the sampling on each stratum is done WR, then, for fixed fe, y^i, • • •, yhnh are independent and identically distributed random variables; if the sampling is done WOR, they are not independent. We yet need further notation. Denote n
L
= Ysn^ h=i
y'h
=
Y =
—^2VHU
J2Nhyl h=i Nh
1
1 Nh
l
t=i
and s
1
lh =
^/»
1 , , 2 -—rEfe-i'O — r £ ( y w - y'h] Tl}i
1
t=i
The meanings of these are clear. The total size of the sample is n. The sample mean for the hth. stratum is y'h. The sum of products referred to previously is just what Y is. The average of the y-values of all units
158
CHAPTER 9. STRATIFIED
SAMPLING
in Uh is Y{. yA'. Specializing S% S2y ofofChapter Chapter66totoUh Uhgives givesususS* Sft2,yhand , ands2s\ yh hisis the sample variance for the sample from Uh. Definition. We shall say that the sampling on a stratified population is obtained WR if the sampling on each stratum is obtained WR , i.e., if, for 1 < h < L, the nh observations on (Uh,y\Uh) are obtained WR. The same definition holds for sampling WOR on a stratified population. T H E O R E M 1. In stratified sampling, either WR or WOR, Y is an unbiased estimate of Y. Proof: Applying Theorem 1 of Section 6.2 to {Uh,y\uh) we obtain E{Nhy'h) = Y^. Y{. Thus,
E{Y) EiY)
, ( B(±M) £ « )
= ^E(N = ^E(N hy'hh) hy' h=i L
= =
s
f
= yy rV yY = v^.JZi.U.
T H E O R E M 2. In /» stratified tferftf erf sampling WR the variance of the unbiased estimate YofYis
Var{Y) = £ and an Mnt^ea unbiased7 estimate esh'mate ofVar(Y) o/Var(K)
iVft(iVft
~ 1 ) ^,
is
« r (ry)) = £ — — *lhajfc. i^ M = L, nh /l=l h=i
n h
Proof: Applying Theorem 3 of Section 6.2 to (U (%,2/k) h,y\uh) we obtain Nh{Nh Var(N )= M ^~ z1] S* JS*lyhyh.^. . Var(Nhy'hh)y'h=
9.1.
THE MODEL AND BASIC
ESTIMATES
159
Because of the independence in sampling on the different strata, iVij/J, • • •, NLV'L a r e independent random variables. Thus Var(Y)
=
Var L
(IH
h=l
nh
h=l
If we apply Corollary 1 to Theorem 1 in Section 6.3 to (Uh^y\un) we obtain E(NZs2yh/nh) = Var(Nhy'h). Thus E(vZr{Y))
J2E(^slh)=J:Var(Nhyfh)
= h=i
=
\
n h
/
Var(Y,Nhy'h)
h=i
=Var{Y). Q.E.D.
T H E O R E M 3. In stratified sampling WOR, the variance of the un biased estimate YofY is
and an unbiased estimate of Var(Y)
is
Proof: Applying Theorem 4 in Section 6.2 to (Uh^y\uh) we obtain
CHAPTER 9. STRATIFIED
160
SAMPLING
and, because of the independence of sampling between strata, we have Var(Y)
=
Varfc^Nhy',)
=
^Var(Nhy'h) h=i L
^
N2
nh \
Nh,
Next, applying Theorem 2 in Section 6.3 we obtain
E^Y)) - g f ( l - t ) ^ ) N?
' ££0-£)*-""<*>• Q.E.D. It might be possible to stratify a population (U; x, y) based on the values of a predictor x . In this case it is possible to do probability proportional to size sampling within each stratum. This is not terribly exciting now, since we would essentially by re-doing Chapter 7 word for word for the stratified model. EXERCISES 1. Consider the following stratified population: Ui: UiX U12 U13 U14 U15 U16 U„ y\Ul : 12 15 13 10 12 13 14 Ui :
U21
y\u2 : 35
U22
U23
U24
U25
U26
U27
38
36
39
34
37
34
(i) Compute I ? , ! ? , Y ^
U18 16
and S*,.
(ii) If a sample of size three is taken WOR from each stratum, if f/i3, Un and U\\ are the units selected from ZYi, and if £^23 > U22 and C/27 are the units selected from % , evaluate Y and Var(Y).
9.2. ALLOCATION
OF SAMPLE SIZES TO STATA
161
2. In Problem 1, suppose the population was not stratified, suppose a random sample of size six WOR was taken, and suppose the sample consisted of the same units as in l(ii). Compute Y and
Var(Y). 3. Compute Var(Y) for both problems l(ii) and 2, and explain the difference in values.
9.2
Allocation of Sample Sizes to Stata
So far in this text we have never mentioned how one selects n. Some times the value of n depends on how small one wishes the value of Var(Y) to be. This is hard to do unless one has taken a preliminary sample and has made an estimate of the various terms that appear in the formula for the variance of Y. Quite often the sample size is determined by simple cost considerations. Suppose the initial cost of performing a survey is CQ and the cost of each observation is c\. Then the total cost of performing the survey is C = CQ + nc\. If the total cost is decided in advance, then one solves for n in the above equation to decide on the sample size. In the case of a stratified population, the cost of an observation can be different for different strata. Thus, if the cost C of the entire survey is given in advance, if the initial cost of performing the survey is Co, and if the cost of each observation in the /ith stratum is Ch (where CQ and Ci, • • •, CL are given in advance), the problem becomes one of deciding values of ni, • • •, n^, and therefore of n = J2h=ira&,which satisfy the equation L
C = °o + Y^c^nhh=i
The particular problem we consider here is that of determining, among all sets of values ni, • • •, ni which satisfy the above equation, a set which minimizes the variance of the unbiased estimate Y of Y. The following lemma was proved in a random variable form in the Section 3.3. It is given here with proof for the sake of completeness and in order to emphasize its importance.
162
CHAPTER 9. STRATIFIED
SAMPLING
L E M M A 1. (Cauchy-Schwarz Inequality). 7/a x , • • •• aTr bu , • •, br are real numbers, then
rt !<s (s ) -(S"0(p-)(s°4 (!>•)(!>•)• with equality holding if and only if there exist numbers u, v, not both zeros, such that ua- + vb{ = 0for1 0 has a double real root if and only if b2 - ac = 0, and it has no real roots if and only if b2 - ac < 0. If a,t = 6< = 0,1 < i < r, there is nothing to prove. If this is not the case, then we may assume that, say, not all at-'s are zeros. In this case, £J = 1 aj > 0- F°r all eall ?, ^t is *rue that E<=i(*at + fc)a > 0, or,
(E«, ? )* ! +2 ( i > * ) <+x>? > o. Since, as a function of t, this parabola is non-negative, the quadratic has either no real roots or a double real root. In either case,by the above discussion on quadratic equations, it follows that
M'-isM')
Equality holds if and only if there is a double root, t0; in this case E(*oa, + 6,)2 = 0. But this last equation holds if and only if each summand is zero, i.e., toai + bi = 0,1 Var(Y) is minimized for preassigned C, and C is
9.2. ALLOCATION
OF SAMPLE SIZES TO STATA
minimized for preassigned value ofVar(Y), positive constant K such that
163
if and only if there is some
nh = KNhSyh/^,
\
Proof: By Theorem 3 in Section 1
from which we obtain
h=l
h=l
Uh
Letting Q = (var(Y) we have
+ £ NhS^
(C - «>),
«-(sM(s*)'
If C is a preassigned constant, then Var(Y) is minimized by minimizing the right hand side of the above equation. (Note: the term J2h=i NhS%h in the definition of Q does not depend on ni, • • •, n^.) The right side of the above equation is just the right side of the following application of the Cauchy-Schwarz inequality (Lemma 1):
Since the left term in this inequality does not really depend on the rifc's (the y/rihS cancel), the right side is minimized over rai,---,ra£, when the inequality becomes an equality. By the Cauchy-Schwarz in equality this occurs if and only if there exists a constant K such that y/chnh = KNhSyh/y/rih for 1 < /i < L, or nh = KNhSyh/y/chThe same condition is seen to be necessary and sufficient for minimizing C when Var(Y) is a given constant. Q.E.D.
CHAPTER 9. STRATIFIED
164
SAMPLING
This theorem does not solve our practical problem for the simple reason that in contemplating the sample survey, we do not know the values of K and n. This result however is one step in the right direction. Before we proceed further we state the theorem in the case of WR stratified sampling. T H E O R E M 2. In WR stratified sampling with linear cost function C = Co + ^2h=iChTih, Var(Y) is minimized for preassigned C, and C is minimized for a preassigned value ofVar(Y), if and only if there is some constant K such that nh = K-*
—
Syh, 1 < h < L.
y/Ch
The proof of this theorem depends on Theorem 2 of Section 1 in just the same way that the proof of Theorem 1 depended on Theorem 3 of Section 1, and the steps of the proof are similar. It is left as an exercise. T H E O R E M 3. (Neyman-Tschuprow Allocation Theorem). In WOR stratified sampling with linear cost function C = Co + Z)^=i ch^h and with C given, the values of n, n 1? • • • ,n^ which minimize Var(Y) are
and n
h
NhSyhly/c^ = « ^Z^m=l z ~ ™m&ym/
y/cm
Proof: By Theorem 1, nh = KNhSyh/y/ch, 1
from which we obtain L
L
n = Y^nh = K^2 NhSyhlyfe.
9.2. ALLOCATION
OF SAMPLE SIZES TO STATA
165
Solving for K, we find that K = Efc=l NhSyh/y/Ch Thus n
^ = ^z,
Since C — Co = 2^=1 cfcn/n
we
have
^ ~ V^
N S
I /El'
from which we solve for n to obtain E£=l
__(C~CQ)
n =
NrnSyn/y/^
?2h=lNhSyhy/Ch Q.E.D.
Theorem 3 gives a solution to the allocation problem except for one thing: we do not know the values of S ^ , 1 < h < L. However, if we have a predictor variable x of which y is a linear function, then we shall be able to solve for rth and n in the allocation problem. Consider now a stratified population with a predictor variable x. We denote the number Xhi as the value that x assigns to the unit Uhi in the hih stratum. As in Chapter 7, all numbers {Xhi,\
are known. Analogous to Y, we denote L
Nh
* = ££*« h=i t = i
and l
Nh
CHAPTER 9. STRATIFIED
166
SAMPLING
where 1 Nhh
T H E O R E M 4. In WOR stratified sampling with linear cost function C = co + J2h=i °hnh and with C given, if a predictor x is available and if y = ax + b for some constants a and b, then the values of n, ni, • • •, n^ which minimize Var(Y) are _ n =
(C-c0)J%slNhSxh/y/c£ J2h=l NhSxhy/Ch
and NhSxh/y/c^
Proof: From Theorem 3 we see that Theorem 4 will easily follow if we can show that Syh = aSxh for all h. Now y = ax + b simply means that Yhi = aXhi + b for all h and i. Thus c2
i
Nh
l
t=i Nh
_
Nh
— -
UaXhi-*Xff
n2^2
i.e., Syh = aSxhQ.E.D. Theorem 4 gives us the possibility of practical application. If we have previous information in the form of x and if y is anywhere near to being a linear function of x, then Theorem 4 shows us how to allocate sample sizes to the strata in order to keep within a cost restriction and minimize the variance of Y.
9.2. ALLOCATION
OF SAMPLE SIZES TO STATA
167
EXERCISES 1. Prove Theorem 2. 2. In Theorem 4, if WOR is replaced by WR, correct the conclusions of the theorem, and prove it. 3. Consider the following stratified population with predictor: Wi: Uu U12 U13 Ul4 Uu U16 U17 x: 14 18 20 15 12 15 18 y: 21 26 32 20 20 22 29 TAi :
C/21
V22
U23
U24
U25
U26
x: y:
41 58
50 76
45 64
40 63
44 69
51 72
IA3 '• U31
U32
U33
U34
U35
U36
U37
x: y:
59 91
64 92
60 93
58 90
65 95
60 70 89 100
61 88
U38
Each observation in U\ costs $10, each in U2 costs $14 and each in U3 costs $18. Initial cost of the survey is $20, and the cost of the survey may not exceed $200. find the sample sizes for the strata that minimize the variance of Y for WOR.
Chapter 10 Cluster Sampling 10.1
Unbiased Estimate of the M e a n
In cluster sampling the population U is represented as a disjoint union of subsets, each subset being called a cluster. We shall denote the clusters by &, • ■ ■ ,CN, , a n dhus N N
U=\JC . i=\ i As with the whole population it is assumed that one knows which units ar* Parn d C, and anrl thus t h u s knows k n o w s the t.h« number n n m W of nf units u n i t s in each -jwh cluster. r l n s W The TW are in each sampling procedure consists of selecting n clusters by simple rand. random from sampling WOR and then taking a WOR simple random sample fr. each cluster selected. With these data, the problem consists of obta obtaining then to to obtain obtain Var(Y\, Var(Y\, and, and, last, ing an an unbiase• unbiase• estimate estimate A A ooff T Y,, then last, to find an an unbassed unbassed estimaee, estimaee, ofr(Y). ofr(Y), of Var(Y)n These three tasks .are estimaee. find ofr{Y), accomplished in this chapter. leta on accomplished in this chapter. We We shall shall not not dwell dwell too too long long heta heta on 1the cases when clister sampling .s needed. Suffice it to oay that there cases when clister sampling .s needed. Suffice it to oay thate there .are cases when when once once an an observation observation is is mad. mad. on on aa unit unit in in ss populatihn, populatihn, ttion el ehen cases it is cheaper easier and/or quicker to taka u number of observations aservations it is cheaper, easier and/or quicker to taka u number of observatic on obseron the the cluster cluster in in which which it it ir ir located located than than to to take take even even one one other other obs vatton vation vatton cn the population al larged An nxample of this is wher ther¬ th vre great vrp orpat distances t i s t a m p s between bptwppn clusters clusters bue bue negligible n e d i c i b l f distances Hi stances between bet.Wi negligiblf any two tdits within a clustec It is important to observe the sifference 169
CHAPTER 10. CLUSTER
170
SAMPLING
between cluster sampling and stratified sampling. In cluster sampling, the units in each cluster need not have values that are relatively close to each other. The cluster is there because of the easy accessibility of observations once the cluster or a unit in the cluster has been selected. In stratified sampling, observations are taken on each stratum; this is not true in cluster sampling. One could say that stratified sampling is a special case of cluster sampling in the case when the sample of clusters in cluster sampling consists of all clusters. Let us establish once and for all the notation to be used in this chapter. We shall denote the number of units in cluster Ct- by M{. Thus M , defined by 1=1
is the total number of units in U (for this chapter only). The units in the ith cluster Ct- will be denoted by t/,j,l < j < Mt-, and we shall denote y(Uij) — Y^. The following notation will be used: Mi
is the sum of the y-values for the iih cluster, 1
Mi
lv±i
i=l
is the average of the y-values of the ith cluster, 1 N
N t=l
is the average total y-value per cluster, and N
^
=
Z^**'
N Mu
j
_
Y =
N
t=l
2^i=l
2LfZ^, *™» lvl
i
and !
S? =
2^t=iiVi*
N i
Mu
u=i v = i
u=l
v=l
10.1.
UNBIASED ESTIMATE
OF THE MEAN
171
are as in previous chapters. A sample of size n clusters will be selected WOR, and we shall refer to the numbers of the clusters so selected by ^1? • * * j ^71 > where 1 < V\ < • • • < vn < N. These are of course random variables with joint density given by *r
n
L
J 1/ ( ? ) ^ 0
if 1 < *i < • • • < ** < N otherwise.
Given that the cluster Ct is in our sample, then a sample of size ratis taken WOR of units in C,-; these observations (random variables) are assumed to be independent of the observations taken on any other clusters and, indeed, will be independent of the other clusters selected. This can be expressed mathematically as follows. If 1 < v\ < • • • < vn < N are the clusters selected, and if j/fc-i, • • •, j/jbimfc. denote the m ^ observations taken on Ck{, then P
( P [W.'i = '•*> " ' > Vknui = '*»*.■] I f l ^
=
^ )
= II P ( bfkil = **> • * * > y*.-m*. = *imfcJI f ]fa= *il ) Let {y t i, • • •, j/tmj} denote a sample of size m,- on C,\ We shall further denote 1 J/f. = —2/»., ratand
n
1 Let us finally denote M
1 iKI
J
>
x
g=l
Note that, for 1 < i < n, 1 < kx < • • • < kn < N,
v-*,in[„-y)-^(i-^)5i,.
CHAPTER 10. CLUSTER
172
SAMPLING
In addition, because of the independence assumption above, yVl.. • ■ •, y„n. are independent relative to the conditional probability
P i I Qh- = *;] J , and thus we have, e.g.,
Var (±MV]yv,\J>,- = *,•]) = E ^ ^ " (l ~ J%) #*,This last fact will be made use of in Section 2. It should be noted that in case rat = M t for 1 < i < TV, i.e., we take a complete census of each cluster in the sample, then we are back in the case of WOR simple random sampling that was discussed in Chapter 6. Thus, we may state without proof the following theorem. T H E O R E M 1. In WOR cluster sampling, z/m t = M t , l < i < N, then an unbiased estimate of Y is
U
t=l
its variance is
and an unbiased estimate of Var(Y)
is
^-T(I-£)dr|?»»-*»>•• We now look at the general case. As indicated above, the numbers of the clusters obtained in our sample of n clusters are z/i, • • •, j / n , where 1 < v\ < V2 < - • • < vn 5* N. In the i/jih cluster selected which has Mu. units [note: the subscript is a random variable], a sample of size mUj is selected WOR, and thus Mv.yVj. should be an unbiased estimate
10A.
UNBIASED ESTIMATE
OF THE MEAN
173
of the sum of y-values of the Ujth cluster, ^ Z)j=i M^y^. should be an unbicised estimate of the average of the cluster totals of revalues, and, finally, ^ Ej=i MUiyVj. should be an unbicised estimate of Y. We now prove this last "should be" in the following theorem. T H E O R E M 2. In cluster sampling, an unbiased estimate ofY N
is
n
Proof: We shall apply results from Chapter 4. We observe that E(Y)
=
E(E{Y\vu-••,»»))
£
E(Y\v1 = k1,---,vn = kn)p[f)[vj
i
= kj])
V=i n
/ 1
= — 52 52 E{MViyVj\vx = *!,...,!/„ = hn)jjj n i
CO
n
n
N
= —
]C
\n)
^2E(Mkjykj.\i/1 = kir-,un
1
= kn)
CO'
l
Now by Lemma 1 of Section 8.3,
Q.E.D.
CHAPTER 10. CLUSTER SAMPLING
174
EXERCISES 1. Consider the following population consisting of six clusters: Ci: y ■
0n 20
u12
^13
014
015
016
017
018
31
42
39
25
28
28
34
u22
023
024
025
026
027
16
20
20
31
50
C2:
02i
y-
31
C3: y-
u31 u32
033
034
035
036
037
038
21
45
31
18
42
36
38
U39 51
C4: y-
43
22
C5:
051
y-
23
0 5 2 053 054 055 056 057 058 81 46 52 31 46 40 35
C6: y-
u61
Ufa 063
064
065
066
067
068
41
52
43
45
47
52
56
28
28
u41 u42
043
044
045
046
047
048
U49
£^4,10
49
33
36
41
52
16
18
20
35
The plan is to select three clusters WOR. If cluster C\ is in the sample, 4 units are to be selected from it WOR, if C2 is in the sample, 3 units are selected from it WOR, if Cs is included, 4 units are selected from it WOR, if C4 is included, 5 units are selected WOR, if C5 is included, 3 units are selected WOR, and this last holds for CQ. i) If the sample of clusters consists of C2,C4,Cs, if U24,U26 and U22 are the units selected in C2 if U47, U4S, f/42, U45 and U41 are the units selected in C4, and if t/55, f/52 and U57 are the units selected in C5, compute Y. ii) Compute Y{., and Y{., 1 < i < N, and Yet2. Prove: £(£„,..|i/i = ku - - ■, i/n = kn) = Yk... 3i Evaluate: E{yx + • + vn).
10.2. THE VARIANCE
10.2
175
The Variance
In this section we derive the formula for the variance of Y. This will involve M 1 > w = * M _ 1 2*/\Yjg ~~ Yj-J > 1V1
L
3
g=l
which was given in Section 1, and S$, which is defined by 1 2 N
^6 =
_ j XX^- - Yd) .
N
T H E O R E M 1. In cluster sampling, ifY N
is defined by
n
then
Proof: We shall prove this using a result from Chapter 4, namely by using Var(Y)
= E{Var{Y\vu
•••,!/„)) + Var(E{Y\uu
■•-, vn)).
Indeed, by the independence assumption and its consequence given in Section 1, E(Var(Y\vU'--,vn)) =
£ l
£ l<*l<».<*n<JV
Var (Y\ f) Wi = kj]) P ( f) fo = kj]) \
j=i
/
\i=i
/
Var (% ±MViyVi.\ f) fc = *,■]) yjL\"i=l
J=l
/
W
CHAPTER 10. CL USTER
176
SAMPLING
Now we use Lemma 1 of Section 8.3 to obtain
We next wish to evaluate Var(E(Y\vi, • • •, vn)). We first note that for fixed j , 1 < j < n, by what was proved in the proof of Theorem 2 of Section 1, E{MUjyVj\v1,-
• -,i/ n ) =
£
E [MViyyi.\
l
\ M
E
f][u{ = k]) / n ? = 1 h=^] i=l
/
* ^ (&;•! H ^ = *»']) 7n?=:1h=*,]
l
Hence Var(E(Y\vu-
•-,»„))
=
VarBl^M^S,,,!!/,,--,^)
Now Y^., • • •, l^ n . constitute a simple random sample WOR of size n on the total y-values on the clusters. Hence, by the results of Chapter 6,
Thus by the formula for Var(Y) stated at the beginning of this proof, we obtain our conclusion.
10.3. AN UNBIASED ESTIMATE
OF VAR(Y)
177 177
EXERCISES 1. In Problem 1 of Section 1, compute Var(Y) procedure outlined.
for the sampling
2. Prove: for nxeci 7« 1 ^^ 7 ^^ ft,
10.3
A n Unbiased Estimate of Var(Y)
Let us establish the notation i
rrti
T H E O R E M 1. An unbiased estimate ofVar(Y)
is given by
va^rix) =( I^-I^()I^- ^| :)(^M| :^( -M^ ^| :-M ^ |^: )M ^ ) 2 = ^I 2
t^)
Proof: We recall from Chapter 6 that, in the case of simple random sampling WOR, E{s2) = S%. Thuss ,n cluster rsmpllng, ,fo rixed jj
EU Vj\f][u ^(^ i n hi ==^k]i]\=Sl ) = ^ krTaking the expectation of the second term in the formula for Va~r(Y), we obtain
■es«±(-aw
178
CHAPTER 10. CLUSTER
!<*!<•••<*„
U
i=l
m
\
"i
\
m
»iJ
SAMPLING
j=l
)
\n)
~l
We now take the expectation of the first expression in the formula for V a r ( y ) , which we denote by F . In addition we may write N2 *J = £ ( l - i ) -L- (A-B), n n \ NJ n - r where 1=1
and B
= 1-(±M,y,,
Now, by properties of conditional expectation,
E{A) =
Y. l
E A
( \f) te = *.•]) P f (11* =k'}) ■ \
t=l
/
\t=l
/
Then we use the formula E(X2\H
= h) = (E{X\H = h)f + Var(X\H
= h)
to obtain
* (*XI Qfo - w) = Wl + Kik.i'-w)81 1 L 'mki \
f2 k
rnki Mki)
wk
>
179
10.3. 10.3. AN UNBIASED UNBIASED ESTIMATE ESTIMATE OF VAR(Y) OF VAR(Y) This and Lemma 1 of Section 8.3 imply E(A)
= —N
V
V
22 + ¥*■ ^ (l (1 _- ^?*L\ + L \ SSwk "\\
(Y2
{ n) i
M Mj
*J wk'J'J
To find E(B), we first observe
E{B) = E{B)=
^im
n \n) { nJ
52 ( (£^i = *;])• £ ^^f(E^^)
l
\\\t=l \i=l
//
As in the computation of E(A) , we obtain
ij=l =l
//
\i=i V=i
+
«=i
1-
/
5
s5( ^) -''
= i;>2.+Ey».-tt.. t= l
u^V
+r^L(i-^)s
2
Now, using Lemma 1 in Section 8.3, we obtain
♦tpsSo-s)*} N "-
AT £^ ^ iV
+
AT(AT - 1) £
y Yvy Y
-- --
+
m,- V ^ A £& m
M,J Mj
^^
CHAPTER 10. CLUSTER SAMPLING
180 Since
(
N
\
2
JN V
i=i
.•=1 " /
and since YCi = Y/N, we have
+ +
l^M?(i rm\Q2 N(n-l)-2 wi + + Yci Nti N£i ™i V ~ Mi) M ^ ""' N-l J V - 1 Yci ''
The first term on the right hand side becomes
+
-
n
N
-n-^
T^TYce (I _ r \ 5C2i +, ^Ll£y2 F
~ "In
ivJ
iV-l ^
Hence
^-a-^^^sfo-i)**-^
Thus
*«£**)) - J* (I -1) 1)--1^. ((^^-j^- ^ £) afc. Biv^n .«.(!_ j +Hgg£f (,(l_- ^) .
10.3. AN UNBIASED ESTIMATE
OF VAR(Y)
181
Let us work some more on the expression for E(A) which we obtained above. As it was in the case for E(B) we have 2 wi
Substituting these formulae for E( A) and E(B) in the above expression for E(Var(Y)), we see that the coefficient of YQI is 0, the coefficient of 5? is \n
Njn-1\
N
V
NJJ
which becomes n V
JV7'
and the coefficient of M
M?
becomes U
NJn-l
\N
NJ+
'
which in turn is N/n. Putting all this together we obtain E(Var(Y)) = Var(Y). Q.E.D. EXERCISES 1. Prove: If X and H are random variables, then E(X2\H
= h) = Var(X\H
= h) + (E{X\H
2. In Problem l(i) of Section 1, evaluate
Var(Y).
3. Prove: £ £ i Y<2 = E £ i ( ^ - Yet)2 + N?3t.
=
h)f.
Chapter 11 Two-Stage Sampling 11.1
Two-Stage Sampling
Two-stage or double sampling involves sampling the population ac cording to some scheme and then sampling this sample. In each such procedure we have a population (£/,x,i/) with a predictor. However, no complete knowledge of the predictor is available as in Chapter 7. In this chapter three such procedures are presented. For the procedure considered in this section we know neither the x-values nor the j/-values of the units before we start sampling. As usual, we wish to estimate y , and it is assumed that y and x are roughly linearly related, just as in Chapter 7. However, it might be very easy or cheap to determine the x-value of a unit, but expensive or difficult to determine its j/-value. Thus we take an initial simple random sample WOR of size n' and observe the x-values of the units selected: x^, x'2, • • •, x'n,. From the units in this first sample we select a sample of size n by WR probability proportional to size. We shall use the following notation: j/J, • • •, y'n, will denote the j/-values of the units selected in the first sample corresponding to the x-values x^, • • •, x'n,. Not all of these y-values are observable. We let x' denote the sum of the x-values of the units in the first sample; this value is observable. The random variables u[, • • •, u'n, will denote the subscripts of the units selected in the first sample, 1 < u[ < u'2 < • • • < u'n, < N. These n' random variables might or might not be observable; their values do 183
CHAPTER 11. TWO-STAGE
184
SAMPLING
not enter into the formulae for the estimates of Y and Var(Y) that we shall obtain. We shall let ui, • • •, t/n, where 1 < ux < • • • < un < iV, denote the subscripts of the units selected in the second sample from among the units selected in the first sample. As before, they might or might not be observable. We shall let #i, • • •, xn and t/i, • • •, yn denote the corresponding x— and y— values of these units. We pause to develop the intuitive idea behind the unbiased esti mate Y to be obtained for Y. Referring back to Chapter 7, it appears reasonable that, for 1 < i < n, then x'yi/xi should be an unbiased estimate of y1 = Y%=i Hit the unobservable sum of the y-values of the first sample. Hence — Y^L-i ^ should also be an unbiased estimate of yf. The first sample should be a representative sample of the entire population, and Ny'In' should be an unbiased estimate of Y. Hence Y defined by Y = ^7— £?-i ^ should an unbiased estimate of Y, and it is this that we shall prove in our first theorem. T H E O R E M 1. In the double sampling procedure described above, an unbiased estimate YofY is given by
Proof: By Theorem 1 of Section 7.2,
E
{^\<---^=fliy'Jiorl
Hence
and
E
(h£* )=yn
Hence E(Y) = NY = Y.
i = i
Q.E.D.
11.1.
TWO-STAGE
SAMPLING
185
T H E O R E M 2. In the double sampling procedure defined above, the variance of the unbiased estimate Y = ^7 — S?=i ^ l 5 given by
r^-i£&£H&-*h"!L& Proof: In the proof of Theorem 1 we found that
n
' ,=1
Thus from the results in Chapter 6, = i V 2 i ( l - | ) S2,
Var(E(Y\u[,■■■,u'n,))
which is the second term in the right side of the formula stated in the theorem. We now wish to obtain E(yar(y\u'x, • • •, u'n,)). By Theorem 3 of Section 7.2,
^(fK,-X,) = -
Var^-t^K,..^ N21
Hence by Lemma 1 of Section 8.3, E(Var(Y\u[j
_ iv 2 1 (;:;) v
y'A2
Trr.r>(yj
• • •, u'n,)) =
/2i _ nV
Now by Theorems 1 and 3 of Section 7.2, EiVariYK,
• • •, «„,)) = - j ^ - g y (
^
- r ]
CHAPTER 11. TWO-STAGE
186
SAMPLING Q.E.D.
T H E O R E M 3. IfYis as in Theorem 1, then Var(Y), estimate ofVar(Y), is equal to the following
iV2 {x'f
f»rf
(«0»n(n-l)\£i»?
1/^i/A 2 ) n\frzi)
J
, N(N-n'± \x,j-y±_ (Q2 1 nn^n' — 1 ) 1
i^i xi
an unbiased
n' n — \
m-m
Proof: We first establish two formulas. The first is that
This comes about by replacing the function y by y2 and using results from Chapter 7. The second formula is:
where y1 = YX=i Vi- I n order to prove this, we make use of the fact that ^ (xMfx'\uii"" >Mn') = YA=I Vi == y't properties of conditional expecta tion, the fact that x1 is a function of u^, • • • yu'n, and the conditional independence of j/,/xt- and yj/xj given u'1? • • •, v!n, to obtain
- M^K.-,*,) = M!£(gK,-,<,)£:(|iK,--,<,) =
A2
(*/')
11.1. TWO-STAGE SAMPLING
187
Now let B denote the second term on the right side in the formula given in the statement of the theorem. By the two formulas given above,
£(BK,...X,)
= NJ"r_1hhy'i)2-Am =
N N n
( - ">
1
Y{y>-j>y
where y' = (y[ + • • • + y'n,)/nf. Since the first stage sampling was simple random sampling WOR, then, using a property of conditional expectation, we have
E(B) = £(£(5K, • • •, <,)) =
"^J^Sl,
which is the second term in the expression for Var(Y) Now let us denote
__ JV2 (*Q2 [" y? A (n'^nCn-l)^^
in Theorem 2.
1 ("yi\2} n {% xt) j '
we need to prove
N
n ' - l f Jd(
Y{
V
We first rewrite A as A =
N2 1 2 (n') 2n(n — i) ') n(n-l)\V
--^—id-'Atl^-X-'-y:^-^-). nJ^yxi/x'J n %. x /x' /x> J
By the definition of conditional variance,
t
Xj
CHAPTER 11. TWO-STAGE
188
SAMPLING
Applying Theorem 1 of Section 7.2 to the right side of the above, we obtain
Also, since, for i ^ j , yi/xi and yj/xj are conditionally independent given i4, • • •, u'n,, and since x1 is a function of u[, • • •, u'n,, we again apply Theorem 1 of Section 7.2, a property of conditional expectation, and the second formula established earlier to obtain E
=
\XilX' Xj/X'
( ^ ( ^ K , \XiXj
)
■■■,<) )
= (y'YHence
E(A\u'---,u')
=
^ - - y ^ - ( ^ - - y '
By Lemma 2 of Section 7.2,
w,-.<.>=^:E>:*;.(fj-f Now, applying Lemma 1 of Section 8.3, the definition of conditional expectation given a random variable and its properties and the fact that x\ = Xui. and y\ = Yui., we have E(A)
=
E{E{A\u\,
•••,<,))
11.2.
SAMPLING
FOR
NON-RESPONSE
189
l
{„,)
- J_ZLI
v
r v vk &
" (5) Wn1
2
1 iV 1 (N-2\ (»)(nyn{n>-2)ltXiX>{xi-Xj)
2kV
^ ' U ~*J (Yj
YA2 '
Again, applying Lemma 2 of Section 7.2, this last equation becomes
Q.E.D. EXERCISES 1. Find the joint density of t^, • • •, ufn,. 2. Find the joint density of ui, • • •, un. 3. Use Lemma 1 in Section 8.1 to prove directly that E(Ny) = Y in simple random sampling WOR on
y:
11.2
Yi--YN.
Sampling for Non-Response
It sometimes happens that a simple random sample of a population (ZY; y) does not elicit a one hundred percent response. This occurs in the case of questionnaires sent out by mail. It also occurs when sampling is done by visiting households, since it is possible that the occupants are not at home. There is sometimes reason to believe that those units of a population that do not respond are actually a sample from a stratum of units that would not respond and that is qualitatively different from the stratum of units that would respond. In any case, the statistician
190
CHAPTER 11. TWO-STAGE
SAMPLING
then takes a simple random sample from the non-respondents, going to considerable effort and expense to obtain a 100% response from this subsample. An unbiased estimate of Y is then possible. So much for generalities; now we shall spell out everything in detail. Our population as usual is given by U:
U!---UN
We assume that U can be represented as a disjoint union of two sets, call them TZ and Af , where % contains every unit of U that would respond if it were included in the initial sample, and Af contains every unit of U that would not respond. Neither % nor Af is known ahead of time. We shall let Ni denote the number of units in 7£, and N2 will denote the number of units in Af; neither Ni nor N2 are known. Since U = H U Af is a disjoint union, it follows that N = JVi + N2. The first stage consists of taking a simple random sample WOR from U of size n. We shall let ni denote the number of units in the sample that respond and n2 the number of units that do not respond in this initial sample; ni and n2 are observable and are random variables. Clearly n = ni+n2. One easily observes that rt\ is a random variable that has the hypergeometric distribution which was defined in Chapter 2. Thus E{nx) = (Ni/N)n and E(n2) = nN2/N. We shall denote the observable y values of those units in the sample that do respond yiy'"->y,n1' O n e should note that ni could equal zero with positive probability if n < N2. If r > 1, and if the event [nx = r] occurs, then yi, • • •, yr is a simple random sample WOR from 71. This means: the joint conditional density of yj, • • •, y'r given the event [rti = r] is that of a simple random sample WOR from K. Formally, we define I
ni
y'ni = — 52ViI[ni>H-
We also let j/J', • • •, y"2 denote the unobservable y-values of the units from Af in this first stage sample, and we likewise define 1
12
V* = ~ n
2 ,=1
L,Vi'fan]-
11.2. SAMPLING
FOR
NON-RESPONSE
191
If y denotes the arithmetic mean (it is not observable) of the y-values of the n units in the first stage sample, then •
y
n From the n2 units that do not respond we shall select a simple random sample WOR of size u. The size u is assumed to be a function of n2 that satisfies: u = 0 if n2 = 0 1 < u < n2 < n if n2 > 1. For example, u is sometimes selected as a fraction of n 2 , say, u = l | n 2 | + 1, where [x] denotes the largest integer < x. We shall let 2/i> • • • > Vu denote the y-values of the u units in this second sample and shall denote 1 u u
t=i
We shall let 1 < u[ < u'2 < • • • < u'ni < N\ denote the numbers (subscripts) of the n\ units that are among the n units obtained in the first stage sample and that are in 7£, and we shall let 1 < u'{ < u'^ < • • • < v!^2 < N2 denote the subscripts of the n2 units from M in the firststage sample. Note that u[^ u2, • • •, w^ are random variables (actually, a random number of them as well, since rt\ is a random variable) and P[u[ = ku - • •, u'T = fcr,ni = r] = -r^-yP[ni = r) if 1 < fci < • • • < kr < Nu 1 < r < Nu and Ukl € 11, • • •, Ukr G ^ , and = 0 otherwise. A similar statement can be made for u", • • •, u" 2 . We shall need in the proofs that follow a convention on some nota tion for summation. If the conditioning event is K = mi, • • •, v!n_r = ran_r, u'{ = fci, • • •, < = fcr, n 2 = r], the summation will be over all indices mi,---,m n _ r ,fci,---,A; r ,r
CHAPTER 11. TWO-STAGE SAMPLING
192 satisfying
max{0, n — Ni} < r < min{ra, N2} 1 < mi < • • • < m n _ r < Ni 1 < h < ■ ■ ■ < k r < N2
umi eii,---,umn_r en ukleM,---,ukreN.
L E M M A 1. The following holds: E(yu\u'{,---,u'l2,n2)
= y':2.
Proof: For fixed r > 1, we obtain from Chapter 6 and the definition of u that 1 r E(yu\u" = fci, • • • , < = K,n2 = r) = - ^ F f c | . r
.=i
Now, if £ = max{0, n — Ni} and m = min{ra, N2}, and upon observing that E(yu\[n2 = 0]) = 0, we have E{yu
|
< , • • • , < , "2) =
=
E E ( y * \ u " - ^i) • • • »u" = kr,n2 = r)/[„i/=fcl)...)U»=Ar)n2=r]
=
E ( ~ JC y.'') ^K-fcl ,-,<=*r,»2=r]
*
-
/
™2
— 2^«• y l»2>l] := ^/n 2 • n 2 «=i
Q.E.D.
11.2. SAMPLING
FOR
NON-RESPONSE
193
L E M M A 2. In the notation already established, E(yu\u'l9 • ■ • X a X i ' , • • • , < , , n 2 ) = y"2Proof: This lemma has the same proof as Lemma 1, except the con ditioning set is now [ui = mi, • • •,u'n_r = mn_r,u'l =kir-
,u"= kr,n2 = r].
T H E O R E M 1. An unbiased estimate ofY
is given by
n
Y = N
iy'ni + n-iVu n
Proof: We observe that E(Y) = -(E(my') n
■
JniJ
+
E(n2yu)).
Now, by Lemma 1, E{n2yu)
= = =
E(E(n2yu\u'(,---,u'l2,n2)) E(n2E(yu\u'{,---,uZ2,n2)) E{n2yl).
Thus E(Y)
= ^(E(niy-'ni) =
NE'niV
+ E(n2y';2))
»1+"2<\
But -(niy'ni+n2jCt)
= y,
the arithmetic mean of a simple random sample of size n WOR, its expectation is Y, and hence E(Y) = NY = Y. Q.E.D. In order to emphasize that u is a random variable that is a function of n2, we shall sometimes denote it by u(n2).
194
CHAPTER 11. TWO-STAGE
T H E O R E M 2. The variance o/Y = N(niy'ni + n2yu)/n
SAMPLING is given by
Var(Y) . * ! (l - £) $ + pl,E («, ( ^ - l) , f e l ] ) , w/iere ^w
=
AJI_I X/ (^ X 2 {,:t/ieV} \
~ AE
7V
iV2
Yi ^ I • {fctfaV} /
Proof: Let B = K , • • • , < , « » , • • •, < 2 , n 2 } . Then £ ( F | 5 ) = = ^E(niy'ni \B) + ^E(n2yu\B). Since n i , n 2 and y ^ are functions of the random variables in B, it follows that E(Y\B) =
-{n1y'ni+n2E(yu\B)}. TV
By Lemma 2 we obtain
E(Y\B) = £ { n < + n < } 71
\t=i
t=i
/
Now applying the formula for variance of the arithmetic mean of a simple random sample WOR, we have Var(E(Y\B))
= N>± ( l - £ ) S}.
We next wish to compute E(Var(Y\B)). Since n i j / ^ and n 2 are func tions of the random variables in 5 , we obtain (using results from Section 4.2) that Var (-(niy-'ni
+ n2yu)\B)
=
?LJ±Var{yu\B).
Now Var(yu\B) = "£Var(yu\B
= b)I[B=b],
11.2.
SAMPLING
FOR
NON-RESPONSE
195
where the sum is taken over all b of the form b = {mi, • • •, ran_r, fci, • • •, kr, r } . By results from Chapter 6, Var(yu\B
= b) = Ju r()
( l - H & ) £ (%,J - - p r E ^ ) \ r J fri \ u(r) £ J J
If we define 1
™2
~ 7 ^<(yi " y'n2)2Jln2>2h n2 - l t=1 the above expressions for Var(yu\B) and Var(yu\B = 6) (and the fact that there are n2 — 1 values of r) imply 4"
=
Var(UB) = " - ( l - ± 2 l \ fy u(n2) \ n2 J Note that Sy„ is the sample variance of a sample of size n2 from AT. Thus, from Chapter 6, for r > 2,
E(4,\n2 = r) = Sl, where Sy2 is as defined in the statement of the theorem, and E(Sy„ \n2 = r) = 0 if r = 0 or 1 by the definition above of Sy„. Thus E(s2y„\n2)
=
J2
E(sl„\n2
= r)I[n2=r]
r=max{0,n—TVi} =
<S'i tt ^[n2>2]--
Finally,
- S-M&O-^)**)) -
^ ( ^ ) ( ' - ^ )
f i
« "
w
)
CHAPTER 11. TWO-STAGE
196
SAMPLING
and, since u(l) = 1,
Since Var(Y) = Var(E(Y\B)) sion of the theorem.
+ E(Var(Y\B)),
we obtain the conclu Q.E.D.
An unbiased estimate of Var(Y) is beyond the scope of this course. EXERCISES 1. Prove: E(y'ni) = ±P[m
> 1 ] E { ^ : Ut € K}.
2. Prove: If r is a non-negative integer-valued random variable, say range(r) = {0,1,2, • • •, r } , if j/i, • • •, yn is a simple random sam ple taken WOR, and if r and {j/i, • • •, yn} are independent, then E(Vl
+ • - - + yT ) =
E(T)E(VI).
3. Prove that niyfni + n2J/^2 is the sum of a simple random sample of size n on (£/; j/). 4. Prove: yw = Y^/[£=&] when r = 1. 5. Prove that E(n2) =
11.3
nN2/N.
Sampling for Stratification
In the stratified sampling considered in Chapter 9, it was assumed that we knew which units were in which strata. In this section we shall stratify a population by means of a predictor variable x. However, unlike previous cases where we knew the values of x for all the units, we now know the value of x for a unit only if we can observe that unit. It turns out that it is sometimes quick, cheap and easy to take a large sample of units in order to observe the values of x. Then, upon stratification of this sample we may take a subs ample of the sample within each stratum and thus take some advantage of uniform values
11.3. SAMPLING
FOR
STRATIFICATION
197
of y within each stratum. Let us begin to make these general remarks a little more precise. We assume that we are dealing with population of units with a predictor variable, x, namely U :Ui---UN x : X\ - - • Xpj y
:F1---yn
We do not know the x-value of any unit until we obtain it in a sample. Before any sampling is done, we decide on numbers a\ < a2 < • • • < « L - I by which the population is to be stratified. This means that U\ will consist of those units in U whose x-values are equal to or less that a x , i.e., Ux = {Ui eU : x(Ui) = Xi < ai}, then Uj = {Ui € U : a ^ i < x(Ui) = Xi < a,-} for 2 < j < L — 1, and finally UL =
{UieU:aL-1<x(Ui)}.
Thus U = UjLjZVj. We shall let Nh denote the number of units in stra tum Uh, note that we do not know the values of JVi, • • •, NL. However, N = JVi + V NL. The notation S% and S%h used in Chapter 9 will continue to be used in this section. We now take a simple random sample WOR of size n from this population and observe the x-values of the n units obtained. These observations of the x-values are used only to decide which strata the units in the sample came from. We shall let nh denote the number of units in this sample that are in stratum £4,1 < h < L. Thus n = rii + • • • + n^. It should be emphasized that each n^ is a random variable; indeed, each n^ has a hypergeometric distribution, and the joint density of ni, • • • ,n^_i is that of a multivariate hypergeometric distribution. Notice that it is possible for rih to be zero. For each h let Vh be a function of nh which satisfies the following two requirements:
CHAPTER 11. TWO-STAGE
198
SAMPLING
i) i/h = 0 if nh = 0 ii) if nh > 0, then 1 < Vh < rihThus Uh is a random variable that satisfies [i/h = 0] = [nh = 0]. In order to emphasize that Vh is a deterministic function of rih we shall sometimes write Vh(rih). Thus, when Uh takes the value i, then Vh takes the value i/&(i). In the second stage of our sampling we take a simple random sample WOR of size Vh from among the rih units that we selected in % , and we observe the j/-values of these units. We shall denote the unit numbers from the initial sample of size n that are in Uh by 1 < u'hl < u'h2 < • • • < u'hnh < Nh. Note that these unit numbers are observable random variables. For each h we shall let y'hl, • • •, y'hnh denote the y-values of these units, none of which is observable until after the subsample of size Vh is taken. We shall let yhij • • •, yh,vh denote the observable y-values of the Vh units obtained from the rih units. Further notation used is: i) VW" ->Vn will denote the unobservable y-values of the n units se lected in the first stage of our sampling, " ) 4 * = = £ T £ B I ( » ' K " Vnh)2hnh>2h wh«e y'n>> = £ E S i J/L^K>i], and m
) & = ^££i^K>i].
It should be noticed that two-stage sampling for non-response is but a special case of the scheme considered here. For the former scheme, L = 2 and v\ = ri\ if we let Ui = TZ and U2 = Af. T H E O R E M 1. An unbiased estimate ofY
is Y = N± Y%=i n ^ .
Proof: Using results obtained in Chapter 4 on conditional expectation, we have, for 1 < h < Z, E(nhyh)
= =
^(^(n^K!,...,^,^)) E(nhE(yh\u'hir-.,u'hynh,nh)).
By Lemma 1 in Section 2, %Kir--,
h )
^) = ^
11.3. SAMPLING
FOR
STRATIFICATION
199
Thus E
(y) =
N n
lEE(n»y'nh) h=l
=
NB^nX.)
= NE (±J2y?J = NE(y) =
NY = Y. Q.E.D.
T H E O R E M 2. IfY
= N^Li^hVn,
then
VariX) = «•! (l - i ) Si + £ £** (-U £ - l) W,) Proof: We shall let B denote the collection of random variables consist ing of all unit numbers in the first stage sample, 1 < u\ < u^ < • • • < u n < N. Clearly every random variable in {ui, • • •, u n } is a function of the random variables in L
UKu"'*^.^}. and every random variable in the latter is a function of i/i, • • •, un. Thus we may use either set of random variables as a definition of B as far as conditioning is concerned. By Lemma 2 in Section 2, and by properties of conditional expectation E(nhyn\B) = nhE(yn\B) = nhy'nh. Thus
E(Y\B) =
N-J2E(n hyh\B) n h=i
200
CHAPTER 11. TWO-STAGE L
1
=
SAMPLING
N
-Enhy'h
n
h=i
= N-J2vi = Ny. Thus, from Chapter 6 we have Var{E(Y\B))
= N*Var(y) = N^
(l - £ )
S^
i.e., Var(E(Y\B))
=^
(l - £ ) Sj.
We next wish to evaluate E(Var(Y\B)). Because of the way in which our sampling is done, namely, the joint distribution of jfa, • • •, y%Vi is in dependent of the joint distribution of observations on other strata once rii,- • •, nx, and hence i/l5 • • •, TIL are known, it follows that nij/i, • • •, njjyi, are conditionally independent given B. (For definition and next step in connection with conditional independence, see Section 4.2) Thus, Var(Y\B)
= ^Var N2
fe
**&!*)
L
—^J2VaV(nhyn\B),
=
and since n^ is a function of the random variables in B, Var(Y\B)
=
N2 n
L
—j:n2hVar(yh\B). h=i
In the proof of Theorem 2 in Section 2, we proved that Var(yh\B) =
Vh \
±(l-^)s2ylh. nhJ
Thus E(n2hVar(yh\B))
= E{E{n\V
ar{yh\B)\nh))
11.3.
SAMPLING
FOR STRATIFICATION
201
In the proof of Theorem 2 in Section 2 we also showed that
E(s2ylh\nh) = Sy [ n „> 2 ] . Hence E(nlVar(yh\B))
= S&B (nh ( ^ - l ) 7[n„>2])
= ^(»»(~i)Wn)Thus
E(Var(Y\B)) = ^tfyhE
(»* ( ^ - l ) / „ > , , ) .
Now we make use of the result that states Var(Y) = Var(E(Y\B))
+
E(Var(Y\B))
in order to obtain the conclusion.
Q.E.D.
An unbiased estimate for VarY is beyond the scope of this course. EXERCISES 1. Find the formula for the joint density of ni, • • •, n^_i. 2. If L > 3, find the formula for the joint density of 1/1, v2. 3. Find the joint density of (u'hlJ
•, u'hnh, n^).
4. Write rih as a function of i/i, • • •, un. 5. Write U{ as a function of the random variables in L
UKl'-'-Xn,^}-
Appendix A T h e Normal Distribution I X 11 [ 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2.30 2.40 2.50 2.60 2.70 2.80 2.90 3.00 3.10 3.20 3.30 II 3.40 |
0.00 .5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413 .8643 .8849 .9032 .9192 .9332 .9452 .9554 .9641 .9713 .9772 .9821 .9861 .9893 .9918 .9938 .9953 .9965 .9974 .9981 .9987 .9990 .9993 .9995 .9997
0.01 .5040 .5438 .5832 .6217 .6591 .6950 .7291 .7611 .7910 .8186 .8438 .8665 .8869 .9049 .9207 .9345 .9463 .9564 .9649 .9719 .9778 .9826 .9864 .9896 .9920 .9940 .9955 .9966 .9975 .9982 .9987 .9991 .9993 .9995 .9997
0.02 .5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461 .8686 .8888 .9066 .9222 .9357 .9474 .9573 .9656 .9726 .9783 .9830 .9868 .9898 .9922 .9941 .9956 .9967 .9976 .9982 .9987 .9991 .9994 .9995 .9997
0.03 .5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485 .8708 .8907 .9082 .9236 .9370 .9484 .9582 .9664 .9732 .9788 .9834 .9871 .9901 .9925 .9943 .9957 .9968 .9977 .9983 .9988 .9991 .9994 .9996 .9997
0.04 .5160 .5557 .5948 .6331 .6700 .7054 .7389 .7704 .7995 .8264 .8508 .8729 .8925 .9099 .9251 .9382 .9495 .9591 .9671 .9738 .9793 .9838 .9875 .9904 .9927 .9945 .9959 .9969 .9977 .9984 .9988 .9992 .9994 .9996 .9997
0.05 .5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531 .8749 .8944 .9115 .9265 .9394 .9505 .9599 .9678 .9744 .9798 .9842 .9878 .9906 .9929 .9946 .9960 .9970 .9978 .9984 .9989 .9992 .9994 .9996 .9997
203
0.06 .5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554 .8770 .8962 .9131 .9279 .9406 .9515 .9608 .9686 .9750 .9803 .9846 .9881 .9909 .9931 .9948 .9961 .9971 .9979 .9985 .9989 .9992 .9994 .9996 .9997
0.07 .5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577 .8790 .8980 .9147 .9292 .9418 .9525 .9616 .9693 .9756 .9808 .9850 .9884 .9911 .9932 .9949 .9962 .9972 .9979 .9985 .9989 .9992 .9995 .9996 .9997
0.08 .5319 .5714 .6103 .6480 .6844 .7190 .7517 .7823 .8106 .8365 .8599 .8810 .8997 .9162 .9306 .9429 .9535 .9625 .9699 .9761 .9812 .9854 .9887 .9913 .9934 .9951 .9963 .9973 .9980 .9986 .9990 .9993 .9995 .9996 .9997
0.09 n .5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621 .8830 .9015 .9177 .9319 .9441 .9545 .9633 .9706 .9767 .9817 .9857 .9890 .9916 .9936 .9952 .9964 .9974 .9981 .9986 .9990 .9993 .9995 .9997 .9998 |
This page is intentionally left blank
Index Bin(n,p) 42 p.p.a.s. sampling 146 Bayes' theorem 23 Bernoulli trials 41 Bernoulli's theorem 85 binomial coefficient 5 binomial distribution 42 binomial theorem 5 bivariate density 32 bivariate normal population 153 Cauchy-Schwarz inequality 164 central limit theorem 86 central moment 51 Chebishev's inequality 83,104 cluster sampling 171 combinatorial probability 3 complement of a set or event 10 conditional covariance as a number 72 conditional covariance as a random variable 75 conditional expectation as a number 65 conditional expectation as a random variable 69 conditional independence 78 conditional probability, definition of 20 conditional variance as a
number 73 conditional variance as a random variable 77 correlation coefficient 58 covariance 58 DeMorgan formulae 12 density of a random variable, definition of 31 disjoint events 11 double sampling 185 elementary event 9 empty set 10 equality of two events, definition of 11 event 9 expectation, definition of 47 fundamental probability set 9 fundamental probability space 38 hypergeometric distribution 43,110 independent random variables, definition of 36 indicator of an event 29 individual outcomes 9 intersection 11 Laplace-DeMoivre theorem 88 Law of large numbers 84 marginal or marginal density 34 moments of a random variable 51 205
206 multinomial distribution 44 multiplication rule 21 Neyman-Tchuprow Theorem 166 normal distribution 87 permutation 4 Polya urn scheme 21,45 population, definition of 91 probability of an event 5 probability proportional to ag gregate size sampling 146 probability proportional to size sampling WOR 120 probability proportional to size sampling WR 119,125 proportions, estimations of 108 random number generator 93 random number 5,93 random variable, definition of 27 range of a random variable, def inition of 28 ratio estimation 140 relative frequency, definition of 1 sample of size n 38 sampling without replacement 21 Schwarz' inequality 57 skipping method 122 standard deviation 51 standard normal distribution 103 stratification 158 stratified sampling 157 stratum 158 subset, definition of 11 sure event 9,10 total probabilities, theorem of 22 two stage sampling 185 two-stage sampling
INDEX for stratification 198 unbiased estimate, definition of 99 uniform distribution 44 variance of a random variable 51 Wilcoxon distribution 111 WOR means 'without replacement' 98 WR means 'with replacement' 98