Simple Probabilty Question

HP · #1 05-03-2007, 05:32 AM

This has no poker in it so if mods want to move this to SMP I understand

Suppose you have some data on a variable. The only possible values for the variable are 0 and 1

You have no idea where this data comes from

We get 100 data points. We end up with X 1's, and (100-X) 0's. There is a 101th data point, but we haven't been told what it is yet

Call the chance the 101th data point is a 1 'P'

How do we find the probability distribution for P?

jason1990 · #2 05-03-2007, 08:38 AM

There are multiple approaches to this question. One approach is to regard P as a uniform random variable on (0,1), and to condition on the 100 results in order to refine the distribution of P. Instead of P, let me use U, since P stands for "probability." In this case, we want

P(U < x | X = k) = P(U < x, X = k)/P(X = k).

The numerator and denominator can be calculated from the observation that

P(X = k | U = p) = C(100,k)p^k(1 - p)^{100-k},

where C(n,k) is the choose function. Therefore,

P(U < x, X = k) = \int_0^x C(100,k)p^k(1 - p)^{100-k}dp

and

P(X = k) = \int_0^1 C(100,k)p^k(1 - p)^{100-k}dp = 1/101.

Taking the quotient and differentiating with respect to x gives the conditional density of U, given X = k. In other words, if you observe k 1's, then the density of U is

f(x) = 101*C(100,k)x^k (1 - x)^{100-k}.

That is, U has the Beta distribution with parameters k + 1 and 101 - k.

KipBond · #3 05-03-2007, 08:59 AM

My guess: P=X/100
Look forward to seeing the right answer, though. [img]/images/graemlins/smile.gif[/img]

EDIT: I didn't see the previous reply before I replied.

HP · #4 05-03-2007, 09:27 AM

ty jason

edit: btw jason do you mind briefly describing the other approaches you alluded to?

jason1990 · #5 05-03-2007, 09:47 AM

What I wrote is a Bayesian estimation using a uniform distribution as the "prior." A Bayesian estimation is the only method which will give you a probability distribution for P (or U, in my notation). Alternatives within the Bayesian framework would be given by simply choosing a different prior. (In practice, however, I doubt there are many so-called Bayesian statisticians that would choose something non-uniform in the situation you described.) If you choose a different prior, so that the unconditioned density of U is g(p), then the formulas change to

P(U < x, X = k) = \int_0^x C(100,k)p^k(1 - p)^{100-k}g(p)dp

and

P(X = k) = \int_0^1 C(100,k)p^k(1 - p)^{100-k}g(p)dp.

The conditional density of U, given X = k, is then proportional to

C(100,k)x^k(1 - x)^{100-k}g(x),

but the constant of proportionality may not be easy to compute.

Another approach is to use "classical" estimation techniques, by computing confidence intervals and so forth. But this will not produce a probability distribution for U.

carbone1853 · #6 05-05-2007, 12:48 AM

The question you asked is not so simple as some of the comments of jason1990 points out. It is not possible to "find the probability distribution for P" in the same way you might find the probability of tossing a coin and getting 2 heads in a row. The classical approach would be to asume a distribution for P then find the parameters. Like mean and variance. With the Bayesian method, as given above, you must assume a prior distribution. There are also other ways... 1) You could form this as a hypotheses testing problem. (you want to but 95% sure the distribution is the X distribution." 2) A crude method would be to plot your data in a histogram and see if it looks like a particular distribution.

Anyway this is not such a simple question.

jason1990 · #7 05-05-2007, 09:55 AM

[ QUOTE ]
The question you asked is not so simple as some of the comments of jason1990 points out.

[/ QUOTE ]
I did not mean to imply the question is simple. It is relatively simple, though, compared to many problems statisticians work on.

[ QUOTE ]
The classical approach would be to asume a distribution for P then find the parameters.

[/ QUOTE ]
I suppose this could depend on your definition of "classical." But by the classical definition of "classical," the classical approach is to assume P itself is a parameter. In that case, it does not have a distribution. Assuming P has a distribution is assuming P is a random variable. This is a distinctly Bayesian approach. Given OP's question ("How do we find the probability distribution for P?"), it makes sense to do this. But this would not normally be called a classical approach.

[ QUOTE ]
With the Bayesian method, as given above, you must assume a prior distribution. There are also other ways... 1) You could form this as a hypotheses testing problem. (you want to but 95% sure the distribution is the X distribution."

[/ QUOTE ]
Perhaps you could elaborate on this, because it does not make sense to me as stated. If you are going to hypothesize that P has some distribution, and then you want to test this hypothesis, then the distribution you are testing must be a prior distribution. (Otherwise, you have no data to do your test.) If your test confirms the hypothesis, then you must still go through the procedure of calculating the posterior distribution in order to answer OP's question.

But even the initial testing step looks suspect. Consider the hypothesis that P has a uniform distribution. Under this hypothesis, all numbers of 1's are equally likely to show up in the data set. How many 1's have to appear for you to reject this hypothesis?

[ QUOTE ]
2) A crude method would be to plot your data in a histogram and see if it looks like a particular distribution.

[/ QUOTE ]
The data are 1's and 0's. A histogram would just show the numbers of 1's and 0's. The distribution would look like a Bernoulli with parameter X/100. This would approximate the distribution of the data, given P. It would not approximate the distribution of P.

HP · #8 05-06-2007, 01:28 AM

I agree with jason here.

It seems the only logical method is assume P has some kind of distribution before looking at the data, then simply use the Bayesian method to get the new distribution for P

carbone1853 · #9 05-09-2007, 12:39 PM

Sorry for the delay but I’ve been out of town.

HP since your original question was “How do we find the probability distribution for P?” you have already made the assumption that P has a distribution, so Bayesian methods would be the way to go. I was talking about the other stuff because there are other ways to look at the problem.

To answer Jason’s question:

As a hypothesis testing problem we have a set of data consisting of 0’s and 1’s and we want to know if the 0’s and 1’s are of equal prob. So, formulating the problem,

H1: The 0’s and 1’s have equal probability.
H0: The 0’s and 1’s do not have equal probability.

We will accept H1 if Prob of the Number of ones > 0.95
(Note: 0.95 is an arbitrary but somewhat common level.)

To make the calculations simple it is common to use the normal approximation to the binomial distribution. Using this approximation the mean of the Normal is N*p and the var is N*p*(1-p). In our case N (the number of data) is 100 and p = 0.5. p is the prob of a one under H1.

So under H1:
We have a Normal prob distribution with mean 50 and var of 25.
If we integrate the Normal from 40 to 60 we get about 0.95.

So,

Accept H1 if the Number of ones is between 40-60.
Accept H0 if the Number of ones is less than 40 or greater than 60.

MtDon · #10 05-10-2007, 01:07 AM

I haven't studied probability since college, so I guess I must be missing something.

You have a set of 100 data points of 0's and 1's.

It is known that 0 and 1 are the only values that can occur, or specificly that the 101'th data point must be either a 0 or a 1.

You don't know where the data points came from.

There are X 1's, 100-X 0's

P is the probability that the 101'th data point is a 1.

The question is what the probility distribution of P is.

A general answer would need to consider all possible values of X - correct?

X can range in value from 0 to 100 - correct?

P is unkown - correct?

Or is P a parameter like X?

Asuming that P is unknown, rather than a parameter. It seems to me that the best estimate for probability P is either X/100, or else it is indeterminable.

So wouldn't the best estimate of the probility distribution of P be represented in table form as:

...variable:......0.............1
...P:........(100-X)/100.......X/100

Or else say it was impossible to say anthing about the probability distribution of P.

What am I missing?

-- Don