Simple Probabilty Question - Page 3

HP · #21 05-11-2007, 10:35 PM

ty, interesting

TomCowley · #22 05-12-2007, 02:12 PM

Aaron: I agree with your method and results for exactly 101 cards, but I'm not sure that assumption is valid. Assuming the data is sequential uncorrelated observations from the same distribution (which seems quite reasonable to me), then it's equivalent to ask "Given these first 100 observations, what is the probability observation #102 will be a 1?" The answer must be the same as for #101.

If you approach this with combinatorics for 102 cards (100 with X 1's, then 11 or 10 or 01 or 00) as you did with 101 cards, you don't get the same answer (unless my algebra was hideously wrong twice), so there's an issue somewhere.

HP · #23 05-12-2007, 02:19 PM

[ QUOTE ]
Assuming the data is sequential uncorrelated observations from the same distribution

[/ QUOTE ]

for the record, this is indeed the case I am interested in

TomCowley · #24 05-12-2007, 04:16 PM

If we remove the "small-deck" bias, then the answer appears to be: (i is number of total observations possible, J is number of total 1s in those i observation, c is the choose operator, e.g. 5c2=5!/(2!(5-2)!) =10)

The expected number of 1s in the i observation are:

Limit as i->Inf of this fraction (from combinatorics with an NcX cancelled out of every term in the numerator and denominator)

Numerator = Sum J=X to i-(N-x) of j*((i-N)c(J-x))/icj
Denominator = Sum J=X to i-(N-x) of ((i-N)c(J-x))/icJ

So from this, P is the expected number of X's left divided by the number of observations left, which is

(fraction - X) / (i-N)

I don't have any tools that can calculate this for big i, but it shouldn't be hard for somebody with a math program to get some numbers. For the i=101 case that Aaron's solved, my formula gives

numerator = X/101cX + (X+1)/101c(X+1)
denominator = 1/101cX + 1/101c(X+1)

cancelling a lot of factorials gives the fraction as

(103X+1)/102

that's the total number of 1's in the 101 observations. Given that we know X are in the first 100, that means

(103X+1)/102 - X = (X+1) /102 1s are expected in the remaining (one) observations, so P=((X+1)/102)/1, as expected, so I think my formula's right. I'm expecting it to approach P=X/N as i->inf, but we'll see.

carbone1853 · #25 05-12-2007, 05:33 PM

"I'm expecting it to approach P=X/N as i->inf"
I think that will be the case.
If
P = Ev[Number of Ones]
X = The sum of the number of ones
N = Total number of data (0s and 1s)
Then by the law of large numbers,
X/N -> P

The law of large numbers is the probability theorem that says the sample avarage converges to the expected value.

PairTheBoard · #26 05-13-2007, 11:37 PM

[ QUOTE ]
Here is a little more on this topic. Imagine the data is arriving sequentially in time. If you want, you could update your prior after each data point. So start with your original prior. After one data point, compute the posterior. Now use your posterior as a new prior. Get another data point and compute again. And so on. This is equivalent to just doing one computation at the end.

So your idea of using the Beta as a prior and recalculating is equivalent to just taking an identical copy of the data set and appending it to the old data, creating twice as much data with the same proportion of 1's and 0's.

Therefore, in fact, it would be a bad property if you got the same Beta distribution again. It would mean that the method does not produce better and better estimates as the sample size grows. What actually happens, as I described above, is that if the sample size grows and the proportion remains the same, then the method produces distributions which converge to the point mass at that proportion.

[/ QUOTE ]

Thanks. That makes sense. As I thought more about it I had a hunch that the rinse and repeat method I described would probably converge to a point mass. But I didn't realize it amounted to what you describe.

PairTheBoard

AaronBrown · #27 05-14-2007, 09:00 PM

[ QUOTE ]
Aaron: I agree with your method and results for exactly 101 cards, but I'm not sure that assumption is valid. Assuming the data is sequential uncorrelated observations from the same distribution (which seems quite reasonable to me), then it's equivalent to ask "Given these first 100 observations, what is the probability observation #102 will be a 1?" The answer must be the same as for #101.

If you approach this with combinatorics for 102 cards (100 with X 1's, then 11 or 10 or 01 or 00) as you did with 101 cards, you don't get the same answer (unless my algebra was hideously wrong twice), so there's an issue somewhere.

[/ QUOTE ]
The trouble is "serially uncorrelated observations from the same distribution" is not as well-defined as it sounds. That's why I like my shuffled deck of 101 cards. That's well-defined. It certainly meets the definition of the problem, so the well-defined answer to that question must be at least an admissible answer to the original problem.

As I turn the cards over one at a time, are the results "serially uncorrelated" and from "the same distribution"? They are from the point of view of a person with no opinion about the number of zeros and ones on the cards. But to someone who knows how many ones are in the deck, each observation is from a different distribution (since each card revealed changes the odds of getting a one on the next card), and of course they have an inverse correlation.

I would say that in order to clearly define what you mean by "serially uncorrelated" and "from the same distribution" you have to know or assume something about the data generating process. The only assumption I want to make is that each observation is like the others; that is all the ones and zeros were written down in advance and shuffled randomly.

In the case of 102 cards in the deck, it must have started with X, X+1 or X+2 ones. The relative probabilities of observing X ones among the first 100 observations are (1 - X/102)*(1 - X/101), 2*(1 - X/101)*(X + 1)/102 and (X+ 2)*(X + 1)/(102*101).

These add to 103/101, so I multiply each by 101/103 to get the probability of two zeros is (102 - X)*(101 - X)/(102*103), one one and one zero is 2*(101 - X)*(X + 1)/(102*103) and the chance of two ones is (X + 1)*(X + 2)/(102*103).

That makes the chance that the 101st card (or the 102nd, but only computing after the first 100 cards have been dealt) is a one is (X + 1)*[(101 - X) + (X + 2)]/(102*103) = (X + 1)/102, the same as with the 101 card deck.

TomCowley · #28 05-15-2007, 02:48 AM

My algebra's awful.

Sequential uncorrelated observations are common in the physical sciences (and simulations of physical sciences) when studying random processes/methods. You really are sampling the same distribution over and over again and can continue to do so at will.

AaronBrown · #29 05-15-2007, 10:57 AM

[ QUOTE ]
Sequential uncorrelated observations are common in the physical sciences (and simulations of physical sciences) when studying random processes/methods. You really are sampling the same distribution over and over again and can continue to do so at will.

[/ QUOTE ]
It's trickier than you might think.

Suppose you are giving a pill to patients with a well-defined condition and it either kills or cures them instantly. You might consider this serially uncorrelated observations. But is it?

Consider three views of probability (and there are many others). You might consider the results physically random, as people sometimes do in quantuum mechanics. In that case, it's only an assumption that the distribution is constant and the results uncorrelated. It either is or it isn't based on physical reality, and you don't know which unless you know a lot about the process, and you said that wasn't true.

Or you might view the outcome as deterministic, but based on unknown factors. For example, the pill might cure people who acquired the condition through behavior, but not people who inherited the condition. You might eventually notice that in your study, but say you don't know it yet. That factor might not have a constant distribution. Perhaps the behavior is increasing, so you are more likely to get acquired conditions as time goes by. Perhaps the people who acquired the condition have other similarities, so you tend to get them in bunches. Perhaps word will spread through networks and the congenital people will learn from experience that the pill kills while the acquired people will learn that it cures; so your trials will self-select to be successful.

Third, you could say that you're not concerned about physical reality at all, only your own subjective knowledge. The phenomena might be random or deterministic, all you're concerned about is making decisions about whether or not to give the pill, knowing only what you know. In that case, each observation changes your subjective belief, which changes the interpretation of future observations. You assume the observations are serially uncorrelated from the same distribution, but this is not true of your knowledge. That is, before you make the first observation, you might bet with even odds that the first pill will cure the patient. But you wouldn't say the probability is 1/1,024 that the first ten pills will cure the first ten patients. You think the probability is much higher than that, because each cure increases your subjective assessment of the probability of future cures. So you assume the observations are independent in order to compute your belief, but this is not true of your subjective belief.

All of this gets exceedingly technical. That's why I like a simple, well-defined thought experiment like shuffling a deck of cards. It gives an unambiguous answer, that we know is right for at least one interpretation of the question. The other approaches are a lot of work and might end up with an absurd answer. If you find these subtleties fascinating, as I do, it's worth slogging through the hard stuff. But you asked a simple question, so I think the simple answer is best.

TomCowley · #30 05-15-2007, 04:39 PM

I wouldn't have questioned your answer if I could have added, subtracted, and multiplied. Only coming up with a different answer for 102 cards (twice, ugh) made me think there was an issue.

[ QUOTE ]
Consider three views of probability (and there are many others). You might consider the results physically random, as people sometimes do in quantuum mechanics. In that case, it's only an assumption that the distribution is constant and the results uncorrelated. It either is or it isn't based on physical reality, and you don't know which unless you know a lot about the process, and you said that wasn't true.

[/ QUOTE ]

OP said he didn't know (and in my background, which is coincidentally computational random walk quantum, it really is true as long as you look at data points far enough apart.) I would say that as long as you know nothing else about the data points, all the objective views are roughly identical. The frequency can't be changing over your sample time because that violates the original premise of the same distribution (which, in itself, means you have to know something about the data collection for this to be true in the real world, so it's a bit contradictory). If it could, that makes the data analysis a whole lot harder.

If the data is deterministic, but based on an unknown factor, as long as you don't know which data points belong to which factor, the correlation to a factor can't change your expectation of the next answer. Of course the answer you come up with may have no relationship to reality because of (assumed constant) sampling bias, but it gives a fair estimation of what you'd expect if you continued with your sampling method. And obviously that's why medical trials actively try to make sure enough "different groups" are represented, because in reality different groups can be affected quite differently, and any sample average isn't nearly as meaningful.

As far as the subjective belief... the inability to accurately process sequences of random events into an accurate, objective picture is one of the big reasons poker is profitable [img]/images/graemlins/smile.gif[/img] I'm quite glad if their temporary subjective belief is wrong, and even happier if they never figure it out.