Probability question [Archive] - Two Plus Two Newer Archives

View Full Version : Probability question

Oct0puz

05-09-2007, 05:23 PM

Lets say that you have a random sequence of M bits (0101....0110).
What is the probabbility that this sequence contains at least one predeternimed sequence of N bits, N<M ?

thylacine

05-09-2007, 05:46 PM

[ QUOTE ]
Lets say that you have a random sequence of M bits (0101....0110).
What is the probabbility that this sequence contains at least one predeternimed sequence of N bits, N<M ?

[/ QUOTE ]

Define exactly what you mean by "contains" here.

Also consider

N=2

01

11

Oct0puz

05-09-2007, 05:57 PM

For example: The predetermined sequence is a=111
The random sequence 1110111 contains a twice.
The random sequence 1100101 does not contain a.

edit: 1111000 also contains a twice if that makes any difference.

bigpooch

05-09-2007, 06:04 PM

You have to be a bit more specific about the predetermined
sequence to get a precise answer. [One thing we can say is
that the probability is bounded above by (M-N+1)/(2^N) which
is not hard to show.]

As an example, consider M=4. For the strings of length N=2,
"01" occurs in 11 of the 2^M possible sequences whereas
"00" only occurs in 8 of them (and by symmetry, "10" occurs
in 11 and "11" occurs in only 8).

Oct0puz

05-09-2007, 06:35 PM

[ QUOTE ]
You have to be a bit more specific about the predetermined
sequence to get a precise answer. [One thing we can say is
that the probability is bounded above by (M-N+1)/(2^N) which
is not hard to show.]

As an example, consider M=4. For the strings of length N=2,
"01" occurs in 11 of the 2^M possible sequences whereas
"00" only occurs in 8 of them (and by symmetry, "10" occurs
in 11 and "11" occurs in only 8).

[/ QUOTE ]

If we assume that the predetermined sequence is just randomly generated and we don't know what it is. Or does this get too complicated?

bigpooch

05-09-2007, 07:25 PM

For the general answer, I wouldn't know what ideas to use
to find the closed form solution for all M and N. On the
other hand, the more we know about the sequence and M and
N, we might be able to find useful bounds.

PairTheBoard

05-09-2007, 08:32 PM

I could be wrong, but I suspect there's an easy combinatoric solution once you look at it the right way. My problem is I never seem to see the right way to look at it very easily. My brute force approach would be to look at the probabilties that the string "a" does Not occur in the first N places of the random string "R" of length M. Call that Event E1. It's easy to compute P(E1). Then look at similiarly defined events E2,E3,... and mulitply the P(Ei)together. The trouble is, those Events are not independent. You really need,

P(E1)P(E2|E1)P(E3|E1,E2)...P(E_(M-N+1)|E1,E2,...,E_(M-N))

which looks intractable.

There's got to be a better way to look at it. I don't know though. You might have to get into the Group theory for permutations or something. Combinatoric guys know all that stuff.

PairTheBoard

arahant

05-10-2007, 12:25 AM

[ QUOTE ]
Lets say that you have a random sequence of M bits (0101....0110).
What is the probabbility that this sequence contains at least one predeternimed sequence of N bits, N<M ?

[/ QUOTE ]

pull out N bits.
count the remaining possible sequences.

Number of sequences containing N is (M-N+1)*2^(M-N)
Total number of sequences is 2^M
answer should be (M-N+1)*2^(M-N)/2^M...no?

Oct0puz

05-10-2007, 07:32 AM

[ QUOTE ]
I could be wrong, but I suspect there's an easy combinatoric solution once you look at it the right way. My problem is I never seem to see the right way to look at it very easily. My brute force approach would be to look at the probabilties that the string "a" does Not occur in the first N places of the random string "R" of length M. Call that Event E1. It's easy to compute P(E1). Then look at similiarly defined events E2,E3,... and mulitply the P(Ei)together. The trouble is, those Events are not independent. You really need,

P(E1)P(E2|E1)P(E3|E1,E2)...P(E_(M-N+1)|E1,E2,...,E_(M-N))

which looks intractable.

There's got to be a better way to look at it. I don't know though. You might have to get into the Group theory for permutations or something. Combinatoric guys know all that stuff.

PairTheBoard

[/ QUOTE ]

If we don't know how the N long sequence looks like don't you think that we can see the events as independent?

PairTheBoard

05-10-2007, 11:50 PM

[ QUOTE ]
[ QUOTE ]
I could be wrong, but I suspect there's an easy combinatoric solution once you look at it the right way. My problem is I never seem to see the right way to look at it very easily. My brute force approach would be to look at the probabilties that the string "a" does Not occur in the first N places of the random string "R" of length M. Call that Event E1. It's easy to compute P(E1). Then look at similiarly defined events E2,E3,... and mulitply the P(Ei)together. The trouble is, those Events are not independent. You really need,

P(E1)P(E2|E1)P(E3|E1,E2)...P(E_(M-N+1)|E1,E2,...,E_(M-N))

which looks intractable.

There's got to be a better way to look at it. I don't know though. You might have to get into the Group theory for permutations or something. Combinatoric guys know all that stuff.

PairTheBoard

[/ QUOTE ]

If we don't know how the N long sequence looks like don't you think that we can see the events as independent?

[/ QUOTE ]

That might be the case and there might be a way to look at it so as to see it easily. Say you denote "a" by

"a" = (a_1,a_2,...,a_N)

Then I suppose you could argue that the a_i are IID Bernoulli random variables with parameter 0.5. Also denote the longer string by say B, with

B = (B_1,...,B_M) also IID Bernoulli random variables with parameter 0.5.

Then the events
E1 = { a_i <> B_i | for some i=1...N }
E2 = { a_i <> B_(i+1) | for some i=1...N}

Are those events independent? The condition B_(i+1) <> a_i should not depend on whether B_i <> a_i because B_(i+1) and B_i are independent. That's not much of a proof but I'm going to take it. Looks right to me.

Then P( (a_1,...,a_N) = (B_1,...,B_N) ) = (.5)^N
and P(E1) = 1-(.5)^N

Therefore,

P(E1)P(E2)...P(E_(M-N+1)) = [1-(.5)^N]^(M-N+1) for the probability of NoMatch.

Finally the Probabilty of 1 or more matches is:

1 - [1-(.5)^N]^(M-N+1)

Looks good to me. But I have trouble actually doing any math these days. I'm much better at watching others.

PairTheBoard

jason1990

05-11-2007, 01:08 AM

[ QUOTE ]
If we don't know how the N long sequence looks like don't you think that we can see the events as independent?

[/ QUOTE ]
If I understand correctly what you mean, I do not think it is true. Here is a complete solution for N = 3.

Our short sequence is A = (a_1,a_2,a_3). Our long sequence is B = (b_1,...,b_M). Both are iid Bernoulli(1/2), and are independent of each other. Let B_j = (b_j,b_{j+1},b_{j+2}). First, some preliminary calculations:

P(A = B_j) = 1/8,
P(A = B_j = B_{j+1}) = 1/64,
P(A = B_j = B_{j+2}) = 1/64, and
P(A = B_j = B_{j+1} = B_{j+2}) = 1/128.

Thus,

P(A != B_1) = 7/8.

Also,

P(A != B_2, A != B_1) = 1 - P(A = B_1 or A = B_2)
= 1 - P(A = B_1) - P(A = B_2) + P(A = B_1 = B_2)
= 1 - 1/8 - 1/8 + 1/64 = 49/64.

This means

P(A != B_2 | A != B_1) = (49/64)/(7/8) = 7/8.

Similarly,

P(A != B_3, A != B_2, A != B_1) = 1 - 3/8 + 3/64 - 1/128 = 85/128

so that

P(A != B_3 | A != B_2, A != B_1) = (85/128)/(49/64) = 85/98.

(Notice that 85/98 != 7/8, which shows that these events are not independent.) Finally, notice that

P(A != B_j | A != B_{j-1}, ..., A != B_1)
= P(A != B_j | A != B_{j-1}, A != B_{j-2})
= P(A != B_3 | A != B_2, A != B_1) = 85/98.

Hence,

P(A does not appear in B)
= P(A != B_1)P(A != B_2 | A != B_1)
*\Product_{j=3}^{M-2} P(A != B_j | A != B_{j-1}, ..., A != B_1)
= (7/8)*(7/8)*(85/98)^{M-4}.

jason1990

05-11-2007, 01:47 AM

Scratch that. This is bunk:

[ QUOTE ]
P(A != B_j | A != B_{j-1}, ..., A != B_1)
= P(A != B_j | A != B_{j-1}, A != B_{j-2})
= P(A != B_3 | A != B_2, A != B_1) = 85/98.

[/ QUOTE ]
It is much more complicated than this. For example, consider

P(A != B_4 | A != B_3, A != B_2, A != B_1).

We have

[ QUOTE ]
P(A = B_j) = 1/8,
P(A = B_j = B_{j+1}) = 1/64,
P(A = B_j = B_{j+2}) = 1/64, and
P(A = B_j = B_{j+1} = B_{j+2}) = 1/128.

[/ QUOTE ]
Also,

P(A = B_1 = B_4) = 1/64,
P(A = B_1 = B_2 = B_4) = 1/256,
P(A = B_1 = B_3 = B_4) = 1/256, and
P(A = B_1 = B_2 = B_3 = B_4) = 1/256.

Hence,

P(A != B_4, A != B_3, A != B_2, A != B_1)
= 1 - 4/8 + 6/64 - 6/256 + 1/256 = 147/256.

From before,

[ QUOTE ]
P(A != B_3, A != B_2, A != B_1) = 1 - 3/8 + 3/64 - 1/128 = 85/128.

[/ QUOTE ]
Hence,

P(A != B_4 | A != B_3, A != B_2, A != B_1)
= (147/256)/(128/85) = 147/190.

I suspect that we could find the limit as j -> infty of

P(A != B_j | A != B_{j-1}, ..., A != B_1)

by solving a recurrence relation. This would give an asymptotic estimate for N = 3 and M large.

PairTheBoard

05-11-2007, 02:50 AM

Can you give an intuitive sense for why

P(A != B_3 || A != B_2, A != B_1) != 7/8

For example, suppose M=5 and A=(a_1,a_2,a_3)

(1) A != B_1 means b_1 != a_1 or b_2 != a_2 or b_3 != a_3
(2) A != B_2 means b_2 != a_1 or b_3 != a_2 or b_4 != a_3

Now asking if

(3) A != B_3 means b_3 != a_1 or b_4 != a_2 or b_5 != a_3

Which of those last three != 's is affected by any of the information in the two lines above? Certainly, b_5 doesn't care what relation any of the previous b_i's have to a_3. Considering b_4, we know there is a greater than normal chance that b_4 != a_3 by (2) but why should that impact the question of whether b_4 != a_2 in (3)? Same for b_3. We know there's a greater than normal chance that it's not equal to a_3 or a_2 from (1) and (2) but why should that impact it's chances of being not equal to a_1 in (3)?

I realize this is not a proof. But still there should be some intuitivly direct way to see how the information provided in (1) and (2) affects the chances of things in (3).

It must be some mix of things, like b_3 != a_3 and b_4 != a_3 in (1) and (2) implies there's a higher chance that b_4 = b_3 and thus with b_3 != a_2 in (2) a higher chance that b_4 != a_2 in (3). I was afraid of something like that. Boy that gets complicated fast.

PairTheBoard

Siegmund

05-11-2007, 05:40 AM

The thing that matters is whether there are sub-strings in the middle of the N-digit string that are one character different from the leftmost part of the N-digit string.

Suppose you are trying to build "1111": if you ever get a digit wrong, you start completely over, and need at least four more digits before you can possibly build your string.

But, suppose you are trying to build "1101": getting 1-1-1 instead of 1-1-0 only sets you back one digit, because you can use the second and third 1s as the start of a new sequence.

Given an N-digit sequence, you can write a Markov chain with N+1 states, and get the transition probabilities. Taking 1101 as an example:

State A: wrong digits
State B: ....1
State C: ....11
State D: ....110
State E: success

transition matrix: 0.5 each for
A->A or B; B->A or C; C->C or D; D->A or E. E is absorbing.

Start with the vector [1 0 0 0 0], multiply by the transition matrix raised to the Mth power, and the 5th element of the new vector is the probability that a string of M random digits contains your chosen string.

The 2^N possible N-digit strings can be grouped according to which of the N(N+1)/2 possible transition matrices they give rise to.

Somebody who has gotten more sleep tonight than I have can actually calculate a closed form for the Mth power of a typical transition matrix.