Friday, December 16, 2011

When Standing At the Brink of the Abyss, Staring Into the Great Unknown, We Randomize

by Chris Masterjohn

In any experiment, randomization is the central criterion necessary to make an inference about cause and effect.  This is true whether we are studying inanimate objects, isolated proteins, cells, animals, or people.  

Randomization helps us remove the influence of both known and unknown confounders.  The ultimate confounders are choice and the passage of timePeople (or animals) who choose one thing may be constitutionally different in myriad known and unknown ways from people (or animals) who choose another thing.  As a result, "self-selection" or choice acts as a super-confounder.  The passage of time also acts as a super-confounder because of its association with a near infinitude of both known and unknown time-dependent trends that can irrevocably mangle our interpretation of any observation if they aren't somehow accounted for.  The principle reason that randomization is such a useful tool is that it can account not just for the known confounders but even for those unknown.

The Great Unknown

There is unfortunately no way to quantify how much we don't know, but humility, wisdom, and scientific caution all require us to assume that the unknown is likely to vastly exceed the known in breadth, depth, and importance.  We can imagine that the total pool of truth looks something like this:

You may have to click on the picture to enlarge it so you can see the tiny space occupied by the "known" portion of truth.

We have probably only begun to investigate a small portion of what can be investigated, and there may be a great deal of truth that is impossible to investigate.  Indeed, there may be a great deal of truth that is impossible to even imagine, and there is obviously no way to imagine what such truth might be or just how much of it there is.  But that doesn't make it any less true, and doesn't make it any less able to confound our observations.  

Epidemiologists study the world around us without performing any experiments.  They make observations, which is the critical first step in the scientific method.  They establish facts, without establishing cause-and-effect phenomena.  Facts are critical.  

Epidemiologists also calculate statistical associations between putative causes and putative effects.  In so doing, they try to adjust for confounders, but their adjustments necessarily come from the top two layers of truth, things known and controversial things under investigation.  Adjustments coming from the second layer, controversial things under current investigation, may simply be wrong, and the adjustments themselves may thus become confounders.  Regardless of whether the adjustments improve the analysis in the way they are intended to, scientific caution would require us to assume that the vast majority of counfounders lie in the bottom three layers: things we haven't thought of, things we can't test, and things we can't even imagine.

These are the "unknown unknowns" that Donald Rumsfeld talks about:

Epidemiologists do make some attempt to account for unmeasured confounding, but when doing so they seem to naively assume that any unmeasured confounding is likely to be simple and straightforward.  Consider this paragraph from the "Eco-Atkins" paper by Frank Hu, Walter Willett, and other researchers from the Harvard School of Public Health (1):
We also considered the influence of unmeasured confounding by using a sensitivity analysis. We found that for [the Health Professionals' Follow-up Study], the unmeasured confounder would have to have a prevalence of 40% among those at the highest decile of animal score and a [hazard ratio] of 2.0 with total mortality to attenuate the association to nonstatistical significance. In [the Nurses' Health Study], the unmeasured confounder would have a prevalence of 20% and a [hazard ratio] of 2.0 to attenuate the association to nonstatistical significance. Because important confounders for the analyses of total and disease-specific mortality were controlled for, it is unlikely that such strong confounding would remain to explain the observed associations. 
I blogged about this study last year in my post, "New Study Shows that Lying About Your Hamburger Intake Prevents Disease and Death When You Eat a Low-Carb Diet High in Carbohydrate."  We can see from the language above that these researchers seem to think they've identified almost everything important already.  Thus, they refer not to a potential sea of unknown confounders but "the unmeasured confounder," and conclude that such a confounder would have to be so prevalent and powerful that it strains the imagination to think it could have escaped their notice.  

After all, folks, they're experts!  And what expert wouldn't spot such a ginormous confounder?  Only an expert that is human, I suppose.  One who doesn't know the unknown.

These researchers seem to assume an extremely implausible model of the progress of knowledge, one in which the limit of attainable knowledge is equal to the totality of truth and in which we are rapidly and asymptotically approaching this limit:

There is, of course, no way to disprove such a model at any given moment because we have no way of quantifying what we don't know.  Such a model is nevertheless likely to be eternally disprovable in retrospect.  We can prove with great confidence, for example, that nineteenth century scientists were not a finger snap away from knowing everything, and people a hundred years from now will likely be able to prove the same about us with just as much confidence.  We will never be able to test the untestable or imagine the unimaginable, however, so we will never be able to argue with someone who claims that nothing untestable is true or that nothing unimaginable exists.  We nevertheless have the liberty to opt out of foolishness, a liberty for which we should be forever grateful.

The Randomized Experiment

Most experts agree that we should use observational evidence to generate hypotheses about cause-and-effect phenomena, but not to confirm or refute such hypotheses.  We use randomized, controlled experiments, by contrast, to support or refute our hypotheses about cause and effect. 

It can be easy for us to get hung up on the "controlled" or "experiment" parts and miss the point that the key feature of such an experiment that allows the inference of cause and effect is actually the randomization.  To randomize simply means that we start with a group of people (or rats, or rocks, or whatever we are studying) and then randomly allocate the members of this group to one or another treatment to study the effects of that treatment.  If all subjects receive two or more treatments and thereby act as "their own controls," we would randomly allocate them to receive these treatments in different orders. 

If the sample size is large enough, randomization ensures that each treatment group is a random sample of the initial group from which it is drawn.  If each group is a random sample drawn from the same initial pool of subjects, all of the groups will have a more or less identical pattern of confounding variables, regardless of whether those confounding variables are known to us.  

In a small study randomization can sometimes fail to distribute confounders evenly.  We can imagine that if we took four people off the street and randomly allocated two of them to group A and two to group B, the two groups might be radically different from one another! 

For a more realistic example, the LA Veterans Administration Hospital Study randomly allocated just under 850 men to receive a diet based on butter and other animal fats or a diet based on vegetable oils.  I discussed this study at great length in my "Good Fats, Bad Fats" talk at Wise Traditions.  The randomization failed to distribute smoking habits evenly, so that there were more moderate and heavy smokers in the animal fat group and more light smokers and non-smokers in the vegetable oil group (2):

This failure makes it more difficult to determine whether any differences between the two groups result from the different diets or from the different smoking habits. 

At this point, we should tremble in amazement, realizing we are standing on the brink of an invisible abyss, the ocean of the unknown, full of creatures of confounding great and small.  Smoking is a known and rather obvious confounder.  If a study with almost a thousand people in it is vulnerable to such an error in the distribution of smoking, are not all studies of this size likely to err in distributing untestable confounders, confounders we haven't thought of yet, and unimaginable confounders? 
When small, randomized studies are repeated, however, the likelihood of such failures occurring in a consistent pattern is infinitesimal.  And when randomized studies are large enough, the likelihood of such failures even within one study is very low.  It is therefore important for us to give greater emphasis to the results of large studies, and to repeat studies so that we can pool their results together and examine a broad totality of evidence.

As C.S. Lewis once wrote (3), "two heads are better than one, not because either is infallible, but because they are unlikely to go wrong in the same direction."  

With scientific studies, this is only true if the error is random.  If a study is not randomized properly, error can result from systematic bias, so that repeated studies of the same type are likely to repeat this bias over and over.  If the study is properly randomized, the possibility for error is still very great, but such error will be random rather than a result of systematic bias.  None of these randomized studies will be infallible, but they are incredibly unlikely to all go wrong in the same direction.

All this said, it would be gross error to say that randomized studies constitute "better" evidence than observational studies.  All forms of evidence have their strengths and limitations.  The strength of randomized studies is our ability to use them to infer that one thing causes another.  This is hardly the only useful type of knowledge.

In future posts, I will elaborate on the "super-confounding" nature of choice and the passage of time, common errors in interpreting randomized studies, the inherent drawbacks of experimental studies, and the value of observational research.

Read more about the author, Chris Masterjohn, PhD, here.


1.  Fung TT, van Dam RM, Hankinson SE, Stempfer M, Willett WC, Hu FB.  Low-carbohydrate diets and all-cause and cause-specific mortality: two cohort studies.  Ann Intern Med. 2010;153(5):289-98.

2.  Dayton S, Pearce ML, Hashimoto S, Dixon WJ, Tomiyasu U.  A Controlled Clinical Trial of a Diet High in Unsaturated Fat in Preventing Complications of Atherosclerosis.  Circulation. 1969;150(1 Suppl 2): 1-62. 

3.  C.S. Lewis, Introduction to the English translation of St. Athanasius the Great's On the Incarnation published by St. Vladimir's Seminary Press, quoted from the 1953 edition.


  1. Thanks so much for your articles. I learn a lot from them. Appreciate your insights. And I enjoyed the "Unknown Unknowns" clip of Rumsfeld.
    One typo: in the quotation from the "Eco-Atkins" paper, one word says "morality" (mid-paragraph), but I believe it should be "mortality". :)

  2. Thanks Barbara! You must have clicked on this right after I published it, because I fixed "morality" to "mortality" almost immediately. Good eye!


  3. Chris, apropos the example of smoking confounding results, I'm wondering what you think about the possibility of enterotypes being a confounder. Have you read the Nature paper or any others on the enterotype concept? Lately I've become more and more interested in the mysteries of the gut, and I wonder if you agree with me that we are on the brink of opening a large pocket of ideas that was previously not imagined, but testable.

  4. Hi Ed,

    I haven't read the "enterotype" paper yet, but I agree with the principle that intestinal bacteria varies, and I believe it is somewhat heritable but somewhat influenced by the environment, and as a result is likely to confound heritability measurements as well as, like you point out here, any randomized study that is too small to distribute enterotypes equally between groups. That said, if randomized trials are repeated, it is incredibly unlikely that they would all introduce the same error in the distribution of enterotypes, so the repetition of randomized studies should remove this as a confounder.


  5. Why bother with confounding factors? Plant food is the best food, that's all! (joking...)

    Great article Chris! The hardest part is to wait for the next ones.

  6. A little addendum, if I may:

  7. Superb as always. I really have to scratch my head sometimes at the kinds of things that are passed off as the definitive proof of a hypothesis and the overconfidence of some depending on their vocation or ideology. With the "eco-atkins" study the first thought to come to my head was the seeming inability of the researchers to actually be able to tell what the subjects ate, not just the amount of animal foods as you pointed out, but trans fats, oxidized vs. non-oxidized oils, harshly cooked vs. minimally cooked meats, and food quality. They claim to have accounted for trans fats, but we both know that the food frequency questionnaire does no such thing. Not to mention miconutrition. When macronutrition changes, so does micronutrition. And then there are unmeasurable lifestyle factors like sleep quality and stress. If I can realize that without any sort of scientific training at all, why can't some scientists?

    I think that Wilett has a lot of talent, but he also has a lot of hubris. He claims to be able to account for confounding factors, but doesn't account for them. Like in this study comparing white rice and brown rice in relation to instance of type two diabetes

    This appears to be good work by the low standards of epidemiology. But where's sugar?! Why wouldn't you include sugar in your analysis for diabetes? So if those who prefer white rice to brown rice also prefer soda to water, or vice versa, we have an unaccounted for confounder.

    All I want to see is a little more humility from scientists. I'm glad we have you, Chris! (I know you'll respond to that humbly as always)

  8. I don't understand the smoking example -- So they randomized the people and an important variable, smoking prevalence, was not evenly spread out -- so the study is insufficient? What if something equally important but unknown? And isn't it possible a repeat test had the same random distribution? It seems the only way to avoid this issue is increase N dramatically?

  9. Hi Laurent,

    Yes, of course, the solution is not to study anything at all! ;-) Thanks for your kind words.

    Hi Stabby,

    Great points. And indeed, I have my own tendency to hubris, so humility is something to strive for, but well worth the struggle. Thank you for your appreciation and contributions.

    Hi Anonymous,

    Yes, I think you do understand the smoking example because your conclusion is correct. If an important variable was not evenly distributed, this means that some unknown number of other variables, some unknown proportion of which are important, are always likely to be unevenly distributed in a study with that size. There are two ways to solve this. One is to increase N dramatically. The other is to repeat the studies.

    The study was random, so the uneven distribution of smoking should have resulted from random error, not systematic error. This means that if it were repeated, the magnitude of error in the distribution of smoking may be similar, but shouldn't be in a consistent direction. So if you repeated the study a number of times, you should have some that are biased towards more smokers in the vegetable oil group and others where more smokers are in the animal fat group, and it should all even out. Thus, the effect of repeating the studies is the exact same effect as dramatically increasing the N.

    "Insufficient" is a poor word because it implies that there is something "sufficient." But it is not an all or nothing matter. It is a continuum of convincingness.


  10. > "The randomization failed to distribute smoking habits evenly..."

    This reflects a failure of what should be common-sense methodology: Randomization (of the necessary kind) should not be random in the way they implemented it.

    The better methodology:

    1) From your total sample, pair off people who are maximally similar with respect to whatever you think (or know) are (or are correlated with) likely confounders.

    2) For each pair, randomly choose which member is in the experimental group, placing the other in the control group.

    A few moments thought will show that this accomplishes the purpose of randomization perfectly well, and will also reduce the effect of confounders to a substantial but not-fully-quantifiable extent.

    In the present instance, the number of smokers (and their intensities of smoking) could have been balanced almost perfectly -- and the ages, genders, and income levels of the group us could also have relatively well balanced.

    This procedure improves the genuine significance of the results (i.e., the quality of the scientific information gained), but, being less likely to produce spurious correlations, will decrease the incidence of exciting results. This may be one of the reasons that the superior procedure hasn't penetrated the research culture very far -- but this is the opposite of a valid excuse.

    Scientists should be pushed to use the superior methodology, and pushed hard. It would, in the end, save lives.

  11. Hi Anonymous,

    Yes, I am aware of such methods of stratification, but they do not solve the problem at all. This would work to distribute smoking evenly. My point, however, was that most confounders are likely to be unknown. Thus, all you will arrive at is the illusion of even distribution of confounders as only the ones that have been chosen as the most obvious and important will be evenly distributed.

    With respect to reducing confounding from unknown variables, which is the main reason randomization is necessary, there is no substitute for increasing the sample size either within a study or by performing numerous repetitions of the study.


  12. What are your thoughts on the use of epidemiological studies to debunk causation?

    As an example, the NHANES study into salt consumption found no correlation between salt consumption and mortality, stroke or heart-disease. I would argue that this evidence is as close to disproving the salt/high-blood-pressure/heart-disease theory as any clinical study could. The logic flows that without correlation, how can there be causation?

    Yet, I also see that reverse con-founders can exist. It is possible that people who eat more salt may be associated with unknown factor X which protects against the nasty-outcome. Hence, the low-salt diet pushers, who, were always happy to link correlations to causations when it suits them, suddenly become all sceptical of population studies when there is no correlation.

    However, the power of the null study is surely greater than the positive study. The correlation in a positive study could mean anything, but a null study to be invalid requires a non-direct cause to be equal to a direct cause. The probability of this happening seems low to me? - but here I am losing my humility?

    Supposing the non-direct cause cancels out the supposed direct cause, it still leaves us with the problem of application.
    If people eating the most salt in their diets are just as healthy as people who eat the least, do we really need to do anything about what appears to be a non-issue?

  13. > “Yes, I am aware of such methods of stratification, but they do not solve the problem at all.”

    Yes, stratification doesn’t not “solve” the problem of small sample size, but the method can greatly ameliorate it. It gives some of the effect of a larger sample, thereby making studies of a given cost more reliable. The problem is that the improvement, though obviously substantial (the smoking example!) isn’t strictly quantifiable (though statistical estimators could be constructed), and the improvement gained doesn’t show up as a benefit in the p-value mechanism. It’s merely a way to extract more reliable information, a way to form a more appropriate control group, a way to reduce a known source of statistical noise.

    It will also *tend* to reduce the effects of unknown confounders, not by reducing bias (which is already gone), but by reducing noise.

    In the main post, your point is that randomization is essential in order minimize the effect of confounders, avoid systematic bias, and infer causality. This is true, of vital importance, and needs to be more widely recognized.

    My point in no way takes away from this. It merely says that simple randomization is inferior to stratified randomization.

    You’re aware of stratified sampling methods, and so are many other researchers and evaluators of research results. These methods, if used with any intelligence, are strictly superior. Even if applied in reverse (!) these methods can’t bias a study.

    Is there any valid reason not to stratify before randomizing? As I mentioned, it does have the effect of reducing the incidence of spurious exciting results. Too bad about that.

  14. Hi Anonymous,

    I think we are essentially in agreement, if perhaps placing emphasis a little differently. I agree that stratification is a good thing and improves the methodology of randomization, especially in studies that are not very, very large. All I am saying is that this necessarily biases us towards stratifying based on variables that we expect to be confounders, but does little if anything to ensure the equal distribution of unknown confounders. Given the possibility that unknown confounders greatly exceed the importance of expected confounders, I think we should utilize stratification but should recognize that it provides no guarantee of equivalence between groups, and thus must rely on the repetition of studies to increase the total number of data points. You are emphasizing the first part of the point and I the second, but I think we are in agreement. Thank you very much for raising the issue and making a valuable contribution to the discussion.


  15. Hi Gordon,

    Far be it from me to accuse someone else of losing humility; that would certainly be an example of the pot calling the kettle black. But I do not think you can infer lack of causation from lack of correlation. It raises the equal and opposite problem found with the inference of causation from correlation. I wrote about this here:

    Lack of Correlation Does Not Show Lack of Causation

    I think you are correct, in the case of salt, that there could be confounding variables masking a true association. But there are other possibilities. The effect might not be dose-dependent or linear within the range studied, the range studied might simply be too small, or the indicators of salt intake may not be accurate enough. In the case of salt, there is another obvious issue, which is that there is a well developed body of evidence showing that some people are sensitive to an increase in blood pressure in response to salt and others are not. So an analysis that does not divide the salt sensitive from the salt insensitive is not going to provide very precise information. We can see something like this in the case of blood cholesterol and stroke. For many years it was thought there was no correlation. It turned out that there are two correlations: low cholesterol is associated with hemorrhagic stroke and high cholesterol is associated with ischemic stroke. If one does not make this distinction, the two correlations cancel each other out and the illusion of "no correlation" is created.

    That said, I think when one incorporates the totality of the evidence, that body of evidence does not make a convincing case that salt consumption itself causes high blood pressure or contributes to heart disease. On the other hand, traditional diets tended to contain some salt, usually natural salt with a greater variety of background minerals and without noxious additives, but did not contain as much salt as is found in modern processed foods. Moreover, salt is used by "flavorists" to manufacture addictive foods, which I believe likely contributes to obesity, which in turn contributes to metabolic dysfunction and heart disease.


  16. Nice post Chris. Kinda related, not sure if you've read this yet -

    So have you submitted anything to present at AHS12? Just curious.

    Warm Regards,
    Aravind (the young looking 42 year old)

  17. Hi Aravind,

    Yes, I read it earlier today. Fantastic article! I did submit a presentation to AHS12. See you there?


  18. This comment has been removed by the author.

  19. Yes I hope to be there too. Looking forward to it!

  20. It's also interesting to think about this at a meta-level.

    While any single research study might randomize subjects in order to reduce the risk that confounding variables are driving the significant result, the selection of WHICH research studies ever see the light of day isn't random at all. Rather, it's just the opposite! That is, researchers tend to submit for publication those studies showing significant and interesting results. Likewise, journals favor the publication of studies showing interesting and significant results. Given that the effects of a confounding variable can be quite interesting and significant, it sets up an odd paradox: those (published) studies with the most interesting and significant results may be the ones at greater risk for containing a confounding variable's influence. This is in spite of each individual study showing proper randomized design.

    To begin to address this, one needs to first recall the statistical idea of multiple comparison procedures. For example, if a given journal contains 20 studies each showing proper design and Type I error of, say, .05, then chances are that at least one of those twenty studies does in fact contain Type I error--due to a confounding variable or other cause. But even after these controls, there's simply always going to be some degree of selection bias present in published research. ...But like you said, specifically which journal or specific study contains an error is entirely unknown. We just don't know what we don't know!

  21. MNL, good point. Thankfully, statisticians are dealing with ways to deal with this, such as funnel plots for detecting publication bias. Clinical trials must also now be registered beforehand, which will increase transparency.



To create a better user experience for everyone, comments are now moderated. Please allow up to one business day for your comment to post. In order to avoid the appearance of spam, please avoid posting links, especially to commercial destinations, and using all-caps.