Simple Probabilty Question - Page 2

carbone1853 · #11 05-10-2007, 12:22 PM

MtDon
The trick comes in deciding the meaning of “the best estimate of the probability distribution”. You give one way estimate the distribution; I could come up with another. We would have to agree on a method to decide on the better estimate of the distribution.

That said the method you propose would be a reasonable approach.

MtDon · #12 05-10-2007, 02:01 PM

I need some clarification.

What is meant by the "probability distribution of P"?

Do you mean "What is the best estimate of the probability P?"

Or do you mean "What is the best function f(p) where p represents the possible probabilities from 0 to 1. And f(p) is the probability that P=p?"

Or do you mean something else?

--Don

f97tosc · #13 05-10-2007, 06:15 PM

[ QUOTE ]
There are multiple approaches to this question. One approach is to regard P as a uniform random variable on (0,1),

[/ QUOTE ]

Good post Jason1990, one quick comment is that it has been argued (by Ed Jaynes) that if we really know nothing of the proportion (before we get the 100 data points), then the right prior distribution to use isn't uniform but

P(U=u)=1/(u*(1-u)).

If on the other hand we know that there are at least some 0s and some 1s (as opposed to all 0s or all 1s), then we get the uniform distribution as you suggested. Also, 1/(u*(1-u)) reduces to uniform if we make one 1 observation and one 0 observation. This also implies that if for example you get 42 ones and 58 zeros, the alternative prior will give you the same belief as if you had only gotten 98 data points, 41 ones and 57 zeros. So for most ratios it doesn't matter much; the 100 observations are usually much more important than the exact nature of our prior ignorance.

PairTheBoard · #14 05-10-2007, 08:03 PM

[ QUOTE ]
There are multiple approaches to this question. One approach is to regard P as a uniform random variable on (0,1), and to condition on the 100 results in order to refine the distribution of P. Instead of P, let me use U, since P stands for "probability." In this case, we want

P(U < x | X = k) = P(U < x, X = k)/P(X = k).

The numerator and denominator can be calculated from the observation that

P(X = k | U = p) = C(100,k)p^k(1 - p)^{100-k},

where C(n,k) is the choose function. Therefore,

P(U < x, X = k) = \int_0^x C(100,k)p^k(1 - p)^{100-k}dp

and

P(X = k) = \int_0^1 C(100,k)p^k(1 - p)^{100-k}dp = 1/101.

Taking the quotient and differentiating with respect to x gives the conditional density of U, given X = k. In other words, if you observe k 1's, then the density of U is

f(x) = 101*C(100,k)x^k (1 - x)^{100-k}.

That is, U has the Beta distribution with parameters k + 1 and 101 - k.

[/ QUOTE ]

Suppose you took that resultant Beta distribution and used it for your prior distribution, then recalculated. Would you get the same Beta distribution back again? Seems like that would be a nice property if you did.

PairTheBoard

AaronBrown · #15 05-10-2007, 09:36 PM

You can get philosophical about this question if you like, it is not well-defined. If you don't get philosophical you can fall into error from some problems. But this particular problem is simple enough that I think you can give a good answer without inquiring deeply into the nature of probability.

A key unstated assumption is that all the data come from the same distribution. To make that concrete, assume that all the observations are written on 101 cards. You have to assume that the deck is shuffled to say anything useful, that's what I mean by all data coming from the same distribution. If the deck is not shuffled, the first 100 cards don't tell you any more about the last card than they do about the color of your eyes or whether it will rain tomorrow.

The deck initially started with either X or X+1 cards marked "1". "You have no idea where this data comes from" could be interpreted to mean you think X or X+1 are equally likely. This is actually a dangerous assumption in some circumstances, but I think it's justified in this case. If you have some reason to suspect that X or X+1 is more likely, for any X, then you know something about where the data come from.

If the deck started with X cards marked "1", then the probability of seeing all X in the first 100 is 1 - X/101. If the deck started with X+1 cards marked "1" then the probability of seeing exactly X of them among the first 100 is (X + 1)/101.

The sum of these two probabilities is 102/101. I multiply each one by 101/102 I get two probabilities that add to 1 (as they must) without favoring either the X or X+1 hypothesis. That suggests estimating P as (X + 1)/102.

I think it's reasonable to say any other estimate for P implies some knowledge of the data. For example, if you estimate P as X/100, you must believe that X is 1 + 2*(50 - X)/[X*(101 - X)] times as likely as X + 1. This implies you think P is more likely to be near 0 or 1 than 0.5. Actually, this belief is inconsistent because it cannot be what you think for X = 0.

If estimating P as anything but (X + 1)/102 implies some knowledge of the data, then so does having any non-trivial probability distribution for P.

You have to be careful about how you use this information. If someone comes and offers to bet you about the last card at some odds, and you think they know P, you should not take their bet. Computing the expected value of your bet using your estimate of P is incorrect, and will cost you money. On the other hand, if the person offering to bet knows no more than you do, and hasn't seen the first 100 cards, you can use your estimate of P to make money.

If your estimate of P were the probability that the last card is a "1", then you could use it to compute the expected value whether you were betting with someone who knew more than you, the same as you, or less than you. So your estimate of P is not a probability. Therefore your estimate of P is not P, because P is a probability.

jason1990 · #16 05-10-2007, 11:31 PM

[ QUOTE ]
[ QUOTE ]
There are multiple approaches to this question. One approach is to regard P as a uniform random variable on (0,1), and to condition on the 100 results in order to refine the distribution of P. Instead of P, let me use U, since P stands for "probability." In this case, we want

P(U < x | X = k) = P(U < x, X = k)/P(X = k).

The numerator and denominator can be calculated from the observation that

P(X = k | U = p) = C(100,k)p^k(1 - p)^{100-k},

where C(n,k) is the choose function. Therefore,

P(U < x, X = k) = \int_0^x C(100,k)p^k(1 - p)^{100-k}dp

and

P(X = k) = \int_0^1 C(100,k)p^k(1 - p)^{100-k}dp = 1/101.

Taking the quotient and differentiating with respect to x gives the conditional density of U, given X = k. In other words, if you observe k 1's, then the density of U is

f(x) = 101*C(100,k)x^k (1 - x)^{100-k}.

That is, U has the Beta distribution with parameters k + 1 and 101 - k.

[/ QUOTE ]

Suppose you took that resultant Beta distribution and used it for your prior distribution, then recalculated. Would you get the same Beta distribution back again? Seems like that would be a nice property if you did.

[/ QUOTE ]
In this problem, if your prior has density f(x), then your posterior will have density proportional to

x^k (1 - x)^{100-k} f(x).

So if you take the above Beta distribution as your prior, then your posterior will look like

x^{2k} (1 - x)^{200-2k}.

Repeat that n times and you will have a density proportional to

x^{nk} (1 - x)^{100n-nk}.

This is Beta(nk + 1, 100n - nk + 1). Its mean is

(nk + 1)/(100n + 2),

and its variance is

(nk + 1)(100n - nk + 1)/[(100n + 2)^2 (100n + 3)].

So the mean tends to k/100, and the variance tends to 0. In other words, the distribution tends to a point mass at k/100.

jason1990 · #17 05-11-2007, 12:20 AM

Here is a little more on this topic. Imagine the data is arriving sequentially in time. If you want, you could update your prior after each data point. So start with your original prior. After one data point, compute the posterior. Now use your posterior as a new prior. Get another data point and compute again. And so on. This is equivalent to just doing one computation at the end.

So your idea of using the Beta as a prior and recalculating is equivalent to just taking an identical copy of the data set and appending it to the old data, creating twice as much data with the same proportion of 1's and 0's.

Therefore, in fact, it would be a bad property if you got the same Beta distribution again. It would mean that the method does not produce better and better estimates as the sample size grows. What actually happens, as I described above, is that if the sample size grows and the proportion remains the same, then the method produces distributions which converge to the point mass at that proportion.

HP · #18 05-11-2007, 05:45 AM

[ QUOTE ]

A key unstated assumption is that all the data come from the same distribution.

[/ QUOTE ]My apologies yes, I should have said that, it's what I had in mind

HP · #19 05-11-2007, 12:00 PM

hey btw all, I'd like your answer to this:

Same as OP but with 4 data points. They are all from the same distribution.

You are only given 3. They are:

1
1
1

What's the chance the last data point is a 1?

AaronBrown · #20 05-11-2007, 02:22 PM

To use my approach, say there are four cards, each with a zero or one on it. They are shuffled and the first three are dealt, all are ones.

Now we know you started with either three or four ones. Suppose before making any observations we think these two cases are equally likely. We know that 3/4 of the time starting with three ones, we would have had a different result (that is, a zero would have shown up among the first three cards). 100% of the time starting with four ones we get this result. So we should now believe there is a 20% chance the last card is a zero, 80% that it is a one. This is the same formula as I posted above as (K + 1)/102. Let K = number of observed 1's = 3 and N = number of observations = 3, then (K + 1)/(N + 2) = 4/5 = 80% is the chance the last card is a 1.

It would be different if we knew something about how the cards were assigned one or zero. For example, if we thought they were assigned by the flip of a fair coin, then 3 ones is initially four times as likely as 4 ones. That exactly offsets the information from the observations, so we think there is a 50% probability that the last card is a one. That's not a coincidence, since we know how the ones and zeros were assigned, we know there's a 50% probability that the last card is a one regardless of what observations we make. If we do the calculation for any observations we find the probability is 50%.