Simple storage problem (math)

bigpooch · #12 11-25-2007, 12:16 PM

This is a worthwhile idea to consider. Define g(p) to be a
"modified gap" function g(p) = (q-p-2)/2 where q is the
least prime >p. [ Take advantage that the difference
between successive primes is even. ]

First, instead of stopping at the prime 13, go up to 23
[ 2x3x5x...x23 < 2^32 but 2x3x5x...x29 > 2^32 ] so the
relevant factor is about 6.112910518 (ratio of highest
"candidate" to bits used). Now encode all primes <10^8
using the basic idea. Note the average gap at about 10^8
between primes is about 18.42 or g(p) is around 8.21 on
average already.

Second, using the fact that pi(100 million) = 5 761 455,
use pzhon's idea so that the encoding takes only about
log_2(C(16 358 820, 5 761 446)) bits or about 15 312 000
bits which is a little bit better than 16 358 820 bits
(we save more than 1 megabit). [ How efficiently are the
primes encoded? Ignoring 2,3,5,..., 29 for now, on average
about 2.6577 bits per prime are used. If pzhon's idea is
not used, it's about 2.8394 bits per prime and if just the
basic idea is used for all primes <2.29x10^9, the average
would be about 3.3532 bits per prime.]

So there remains about 424.688 megabits for the remaining
105 959 352 primes or just above 4.008 bits per prime. Is
that enough to naively encode g(p)? The "jumping champion"
is 6 (most common difference between primes) for any "big"
range <1.74x10^35 (see http://mathworld.wolfram.com) , so
the most common value for g(p) is 2, so it's possible that
for many values of p in (10^8, 2.29x10^9), 0<=g(p)<=14 so
only four bits are used for the vast majority of p. Also
note the maximum value of g(p) is 145 in the range
( see: http://primes.utm.edu/notes/GapsTable.html ).

On the other hand, the average for g(p) should be around
9.5 since ln(2.29x10^9)/2 is about 10.776, i.e., we should
expect many values of p in the relevant range such that
g(p)>14 (where we would need 4 bits + several more to
encode the value of g(p)). It seems that the "budget" of
4.008 bits per value of g(p) will be slightly exceeded to
naively encode the difference. OTOH, there could also be
enough p such that 0<=g(p)<=6 so that a 3-bit scheme could
be used at first instead, but I suspect that there are too
many p in the range for which g(p)>=7.

Yes, it's close at 100 million. If 200 million is used
instead, we should be there.