Two Plus Two Newer Archives  

Go Back   Two Plus Two Newer Archives > General Gambling > Probability
FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools Display Modes
  #1  
Old 03-20-2007, 03:54 AM
mosta mosta is offline
Senior Member
 
Join Date: Feb 2003
Location: outplaying 300bb downswing
Posts: 1,687
Default convenience samples and abuse of statistics--Help?



I had a graduate social science statistics seminar with Richard Berk (who's close to David Freedman (who wrote a famous text book)), and he expounded on the prevalent misuse of statistical significance tests in social science. I didn't make sure to retain it so well because I didn't expect to use quantitative methods. And it nags at me b/c I don't feel fully equipped to call out specious analysis now.

Basically the point was that if you are using your basic T-test, for example, you need to be able to specifically define the population that you sampled from by simple random sampling. As Freedman put it in his book: you need to be able to apply the box model--you (in effect) put all the names in a box and drew out lots. Your inference is from your sample, to the box.

Aside from formal survey research and randomized biomedical experiments, most social science works with convenience samples. Two examples that come to mind are law review articles on trial practices. Each one collected as much info "as he could" from a set of courts. But the data was not a simple random sample from a defined population. It was whatever they could come up with by asking as much as they could. And at the end they click the significance test on the software, and voila it's "scientific."

Social scientists who are sophisticated enough to know that there is a problem here posit "superpopulations" in a metaphysical argument to try to tie the analysis to what looks like math. Although he tried to be nice and allow you to disagree, Berk's position was that the whole superpopulation thing was nonsense. But the role of the superpopulation is supposed to be, since you're not inferring from a sample to concrete and specific population (the "box" that you sampled from), you claim to infer to the "superpopulation"--of like possible worlds or such.

I have a couple of acquaintances in epidemiology and bioinformatics I thought about asking about this. But I'm not confident I can discuss this competently enough.

Is this criticism familiar to anyone here who's a real statistician? Convenience samples being inappropriate for statistical inference? Superpopulation being nonsense? Most social science significance claims being bunk? (Now I know it's a bit different when you get into bootstrapping techniques, but I only remember the general gist of that. Not whether it applies to convenience samples or not.)

What about when you look at your poker tracker data and compute a confidence interval? Where is the box model for that statistical inference?
Reply With Quote
  #2  
Old 03-20-2007, 04:55 AM
Siegmund Siegmund is offline
Senior Member
 
Join Date: Feb 2005
Posts: 1,850
Default Re: convenience samples and abuse of statistics--Help?

I'll take a stab at it...

Every statistical test comes with a list of assumptions. For most of them, the list of assumptions includes assuming that each of your observations was drawn independently from the same pool of possible outcomes.

When you write up your results, you are supposed to not only report the test statistic, but also show that the assumptions behind that particular test either were satisfied, or were violated in some harmless way that doesn't invalidate the conclusion.

Different fields have different standards for what they will accept. When you do a telephone survey, you've got an actual description of the method you used to select the phone numbers and the procedure you used to follow up with the people who didn't answer the first time you called. In a lot of scientific applications, convenience sampling has a built-in truly random aspect (the water in a turbulent river is constantly flowing downstream and mixing itself, so a bucket lowered from a bridge will work just as well as a bucket lowered to a depth of exactly 0.45m below the surface, 3.83m from the north bank). In other applications, there are hidden hazards (if you took that bucketful right at the mouth of a river as it empties into the ocean, was it high or low tide? Are you sure that's river water?)

And, finally, there are times when we invent a useful fiction, so that we can use a number we know how to calculate as a proxy for one we can't. I was recently asked, for instance, to calculate the percentage of freshmen at a certain university who I expect to get a bachelor's degree in four years. OK, we can count how many enrolled in fall 2002 and got degrees in spring 2006, fine. This gives us some sort of an estimate of how successful students are in the program. To calculate an uncertainty of that estimate, I might regard the students who ACTUALLY enrolled in 2002 as somehow being a sample drawn from a mythical infinite pool of high school seniors who might choose to attend my university... I'm doing it for my convenience, and admitting I am doing so - not actually claiming that pool of seniors actually ever existed. It's a mathematical formalization of the armwaving statement "assuming that the students who enrolled in 2002 were typical."

Is it a good or a bad formalization? That depends entirely on whether you believe the argument that led up to it. The statistical test explicitly disclaims all responsibility for any mistakes that occur if your data don't really behave the way you claimed they did when you said they were suitable for use of that test.

In the case of poker tracker data, I think most of us are aware that there are a number of issues (consecutive hands are not independent if anybody at the table is tilting... game selection is better at certain times of day than others... and on and on) BUT, since we know our estimates are going to have large built-in uncertainties, as long as we know that the errors we are making are small compared with the uncertainties, we don't lose sleep over it.
Reply With Quote
  #3  
Old 03-20-2007, 01:18 PM
mosta mosta is offline
Senior Member
 
Join Date: Feb 2003
Location: outplaying 300bb downswing
Posts: 1,687
Default Re: convenience samples and abuse of statistics--Help?

Thanks a lot. This is a really good explanation--just what I was hoping for. I'll see if I can come back with a little more exploration later.
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 10:59 AM.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.