Statistics Question (hard, I think)

Daisydog · #1 12-31-2006, 01:51 AM

Let X be a random variable with an unknown distribution

Let X_bar(n) be a sample mean calculated based on n samples from the distribution of X

Now, suppose your goal is to estimate Var(X) and the only thing you are allowed to observe is X_bar(n) 50 times (but not the underlying 50*n individual samples used to calculate the 50 sample means). You are allowed to choose n.

The estimator you are using is n*(sample variance of X_bar(n)) which should be an unbiased estimator of Var(X), I think.

The questions is: What value of n will minimize the variance of the estimator.

Based on some simulations I've done in Excel, I'm pretty sure the answer depends, but on exactly what I am not sure.

Siegmund · #2 12-31-2006, 05:02 AM

Sure looks to me like for large n it ought to be independent of n. You are taking the variance of 50 observations, either way.

Do you have some particular reason to believe it *isnt* independent of n (other than some "n vs. n-1" effects if n is extremely small)?

AaronBrown · #3 12-31-2006, 12:13 PM

In theory, the larger the n, the smaller the variance of your estimator. However, there are practical considerations that might lead you to reduce n. For example, if you cannot measure with infinite precision, if n gets too large, all your means will be the same and you will get an estimate of zero.

Daisydog · #4 12-31-2006, 05:57 PM

Based on simulations, I have noticed the following (which I haven't proven theoretically):

1. When X is distributed normally, it appears that the variance of the estimator is independent of n. So you get just as good of an estimate if n=1 vs. n=1,000,000.

2. When X is not normally distributed, the variance of the estimator appears to be independent of n if n is large. I assume this is because the sample means becomes normally distributed for large n.

3. For some distributions of X, the best estimator appears to be when n=1. For example, this seems to occur when X is distributed uniformly and when X is distributed Bernoulli with p=.5. Note that Skewness=0 for both of these.

4. For distributions of X that are highly skewed, like Bernoulli with p=.01, it appears the best estimators are when n is large. Using n=1 appears to be a very poor estimator.

I'd be interested if anyone can back this up with any theory.

AaronBrown · #5 12-31-2006, 06:51 PM

You are correct, I was not.

The variance of the sample variance of a Normal distribution is 2*s^4*(N-1)/N^2, where N = 50 in this case. That means 0.0392*s^4. If the standard deviation of X is s, then the standard deviation of X_bar is s/n^0.5. That makes the variance of the sample variance of X_bar 0.0392*s^4/n^2. But when you multiply the variance of X_bar by n to get an unbiased estimate of the variance of X, the variance increases by n^2, so you end up with the same variance of the sample variance of X_bar for all n.

For non-Normal distributions, the result depends on the kurtosis (the central fourth moment divided by the variance squared). The variance of the sample variance is:

s^4*[(k - 1)*N - k + 3]/N^3

where k is the kurtosis. k = 3 for a Normal distribution, so this reduces to the formula above. The s^4 term makes no difference, as shown above for the Normal case. So the only change as n changes is through the effect on k. If the variance of X is finite, k will tend toward 3 as n grows.

Therefore, if the distribution has "fat tails," k > 3, increasing n will decrease k and reduce the variance of the sample variance of X_bar. Thin tailed distributions, with k < 3 will see the opposite effect.

Daisydog · #6 01-02-2007, 01:13 AM

Thank you. Your posts are always enlightening. I had been trying to derive a theoretical formula for the variance of the sample variance of the sample mean and it was getting too complicated. It looked like it involved E(X^4) so the formula with the kurtosis is not a surprise. I was guessing it had something to do with skewness too, but I guess not. Just curious, how were you able to come up with that formula so quickly? I'm guessing an old text book or did you just have that commited to memory?

Follow up question: If X is the distribution of win/loss on individual hands of poker, would you agree that X would have a high kurtosis and thus the best estimator would use a large n, say greater than 1000? When I talk about poker hands here I mean poker hands simulated from something like Turbo Texas Hold em where I think individual hands should be independent.

AaronBrown · #7 01-03-2007, 08:36 PM

Thanks for the kind words, especially since I was dead wrong in my first response.

I do remember the relations of the first four cumulants, and it's just a little algebra to derive the variance of the sample variance.

Yes, I agree that the more hands you use per simulation, the more accurate your sample variance will be. But you're better off putting all the hands in one big pool.

Daisydog · #8 01-03-2007, 10:10 PM

[ QUOTE ]
Yes, I agree that the more hands you use per simulation, the more accurate your sample variance will be. But you're better off putting all the hands in one big pool.

[/ QUOTE ]

With the simulation program you can't put all the hands in one big pool. All it tells you is the total win/loss after n hands. Given this, I think the best way to estimate the variance is to run the simulation, say, 50 times with n=10,000 or more and use that to derive a per-hand sample variance.

These simulation programs are not the best, but I think you can use them to get a reasonable approximation of variance and how certain game condition affect it.