part of the population is excluded from sampling, now what?

mrmr · #1 10-31-2007, 08:25 PM

You take a reasonably sized random sample, n, from a population of size N. Then you find an additional x*N elements of the population. (I don't think it will matter too much, but let's say x is somewhere between 0 and .4)

So the population is now (1 + x)*N sized, but your samples all came from an N sized sub-population. There is no particular reason to think the newly found portion of the population is distributed any differently than the original N elements.

What do you do? Is it reasonable to "pretend" the sample came from the entire population (the one that has (1 + x)*N elements) and proced as normal? If not, how do you calculate a confidence interval for and level of precision of your estimates (of whatever parameter(s) interest you) for the entire population?

So a concrete example of what I am asking would go like this. You have 2000 baseballs that were hit through your window by the neighbor kids over the years. You look at 100 of them (randomly selected) and record the number of bits of glass embedded in each one. You head to your office to estimate the number of bits of glass per baseball for the entire 2000 baseball collection, and of course you want to know how accurate and precise that estimate is, so you will want to calculate those things too. And then you remember that you have 650 baseballs that you forgot to count because the neighbor kids ... eh, enough with this cheesy story. You remember you have 650 more. So you've actually got 2650, but you only sampled from 2000 of them, and you want to get an estimate, and know how precise and accurate it is, but you don't want to do any more leg work or re-examing of baseballs.

David Sklansky · #2 10-31-2007, 08:42 PM

If the 650 baseballs are randomly excluded then its just a random sample of 100 out of 2650. Ironically the only way we can know if they are random would be if you had chosen not to cut short your cheesy story.

knowledgeORbust · #3 10-31-2007, 09:54 PM

More cheese!

Out of curiosity - where'd this problem come from?

madnak · #4 10-31-2007, 11:56 PM

[ QUOTE ]
Is it reasonable to "pretend" the sample came from the entire population (the one that has (1 + x)*N elements) and proced as normal?

[/ QUOTE ]

Yes. If "There is no particular reason to think the newly found portion of the population is distributed any differently than the original N elements" then taking a sample from N and taking a sample from (N+x) is the same. As we expect the distributions of N and (N+x) to be the same, we expect the distribution of the sample to be the same as well.

mrmr · #5 11-01-2007, 12:08 AM

Okay, okay, let's see what we can do here.

The full cheesy story: You store the baseballs in crates along with various other sports collectables, sorted by year. You replaced your windows with plexiglass in 2001, so you don't bother looking through the 2002+ crates. You gather all the baseballs (it turns out there are 2000) then devise and employ some clever technique to randomly select 100 of them.

You collect your data, and then retire to your office to estimate and calculate, when it hits you (like a bad pitch through a livingroom window), that it was 2003, not 2001, when you installed the plexiglass. You go back and pull out your 2002 and 2003 crates, and count 650 baseballs.

You begin to wonder, must I take some seperate sample out of the 650 to get a reasonable estimate, and to determine how accurate and precise said estmate is? Or throw out my data and start over with a new random sample out of 2650?

Before you can think the problem through, much less begin some re-sampling process, the ghost of Babe Ruth appears before you and says: "I'm hungry for baseballs," and eats all of the baseballs. Before disappearing, the ghost of the Babe tells you, in one long sustained belch, "I have some hidden wisdom for you from beyond the grave: there is no reason to think that the 2002 and 2003 baseballs are any different from the older ones, and before some stickler gets clever on us, let me add that the baseballs, the windows, and the neighbor kids, Ceteris and Paribus, all remained relatively unchanged over the years, and had no serial correlations to speak of."

-----

I don't know how or why I thought of this problem, but I asked it here because after reading "Sampling," by S. Thompson, I still couldn't answer the question in a mathematically (or logically, if it doesn't come to math) rigorous way. I think I could figure it out if I keep working on it, but why would I do that, when I've got 2+2 to answer it for me?!

Actually, it seemed like a good logical puzzle, so I thought I'd share it.

Still interested in an answer, and more importantly, a line of reasoning that supports it.

Edit: posted before madnak's reply. I welcome more input, though.

madnak · #6 11-01-2007, 12:59 AM

Well, I would think about it like this - you have a population z. That population is composed of subpopulation N and subpopulation x. You have no information on how the members of z are sorted into N and x.

There are at least two different questions here. If you take a sample n from N, will the expected distribution be the same as if you took a sample from z? Also, what conclusions can you draw about z from n, and are they the same as the conclusions you can draw about N?

I think the answer to the first question is yes, but I'm not so sure about the second question. I think that the concrete example is more certain because it describes that x and N were selected using the same process. This is the kind of information that is probably necessary for any sort of rigorous evaluation.

In a way, it depends on what you mean by "there is no particular reason..." In the Babe Ruth situation, it seems that you mean all things are equal, in which case the boxes are essentially arbitrary. The groupings by year are meaningless and used solely for the purpose of categorization. As a result, you can definitely take a sample from some of the boxes, but not all of them. In fact, you could just take one box - say, the 1998 box - and that would be a random sample. You could make statistical statements on the basis of that sample.

But nothing is well-defined. Your disclaimer against clever sticklers would need to be much more thorough for anything to be certain. You'd need to say more about what you're looking at and how it's determined. I think the best mathematical approach might be to describe a very specific case and generalize from there.