Let me start with a confession. For the better part of a decade, I practiced medicine under a comforting belief: the randomized controlled trial is the gold standard of medical evidence, and everything else is lesser. I did not just believe this. I participated in building the temple. As a Principal Investigator on Phase III dupilumab trials that contributed to an FDA approval, I was inside the machinery of randomization. I saw its elegance. I administered its protocols. I trusted its output.
And I was wrong. Not about randomization being useful. It is. But about what it actually does, what it cannot do, and the uncomfortable gap between the two that most clinicians, regulators, and even many researchers never examine closely enough.
This post is not anti-RCT. It is anti-complacency. It is a careful, evidence-based argument that the pedestal we have placed randomization on is simultaneously deserved and dangerously misleading. If you prescribe medications, design trials, review evidence, or make policy based on “Level 1 evidence,” you need to understand the five structural problems I am about to lay out.
What Randomization Actually Does (and Doesn’t Do)
The standard teaching goes like this: randomization eliminates confounding. You flip a coin, assign patients to treatment or control, and any difference in outcome must be caused by the treatment because the groups are balanced on all variables, measured and unmeasured. It is elegant. It is intuitive. And it is an oversimplification that borders on mythology.
Here is what randomization actually guarantees, in precise language: it ensures that treatment assignment is statistically independent of baseline covariates in expectation. Over infinite repetitions of the randomization procedure, the groups will be balanced. In any single trial, they may not be. This is not a pedantic distinction. It is the entire foundation of what makes small trials unreliable and what makes subgroup analyses from even large trials treacherous.
Randomization does not eliminate confounding. It makes confounding random rather than systematic. That is a meaningful improvement. It is not the same thing as elimination.
But the deeper problem is not statistical. It is conceptual. The reverence for randomization has calcified into a hierarchy of evidence that treats the RCT as categorically superior to all other study designs. This hierarchy, taught in every medical school and enshrined in every guideline committee, contains a logical error that I did not fully appreciate until I began studying causal inference formally.
The Five Cracks in the Crown
Allow me to lay out, one by one, the structural limitations of randomized trials that are not limitations of execution (those exist too) but limitations of the design itself. These are features, not bugs. They are baked into what an RCT is.
The RCT tells you that a drug works. It does not tell you for whom it works, why it works, or what to do when it doesn’t. Those are the questions that actually matter in clinical practice.
The Hierarchy Problem
The evidence hierarchy, pyramid-shaped, with systematic reviews of RCTs at the top and case reports at the bottom, is one of the most influential conceptual models in modern medicine. It has also, I believe, done subtle but significant damage to how we think about evidence.
The hierarchy conflates study design with study quality. It assumes that an RCT is always superior to an observational study, regardless of how the RCT was conducted or what question it asked. But a well-designed observational study with appropriate causal methods, clear assumptions, sensitivity analyses, and a question that matches the clinical decision, can be more informative than a poorly designed RCT that answers the wrong question in the wrong population over the wrong time horizon.
Sir Austin Bradford Hill, whose criteria are often cited in support of the RCT hierarchy, actually warned against exactly this rigidity. In his 1965 address to the Royal Society of Medicine, he argued that the demand for experimental evidence before acting on observational findings could sometimes cost lives. The relationship between smoking and lung cancer, one of the most consequential causal findings in medical history, was established entirely through observational data.
What I Saw Inside the Machine
Let me bring this back to lived experience. When I was running Phase III dupilumab trials, I observed things that the final publications do not capture.
I watched the eligibility criteria exclude the very patients I most wanted to treat. I saw how the controlled environment of a trial site, with its protocol visits, adherence monitoring, and nurse follow-ups, created conditions that would never exist in a busy dermatology clinic in Casablanca or Brooklyn. I noticed that the patients who responded beautifully in the trial were not always the same type of patients who responded in my practice afterward.
None of this means the trial was wrong. The dupilumab trials were rigorous, well-designed, and their conclusions were sound within the scope of the question they asked. But the scope of that question was narrower than we typically acknowledge. And the gap between what the trial told us and what clinicians needed to know was wider than the publications suggested.
The trial proved that dupilumab works. It did not tell me which of my patients in Marrakech, with their different genetic backgrounds, different comorbidity profiles, different environmental exposures, would respond the way the average trial participant did.
The Alternative Is Not Nihilism
I anticipate the objection: if not the RCT, then what? Are you arguing for anecdote-based medicine? For abandoning rigor?
Absolutely not. I am arguing for expanding our definition of rigor. The field of causal inference, developed over the past four decades by statisticians, epidemiologists, and computer scientists, provides formal mathematical frameworks for extracting causal conclusions from observational data under clearly stated assumptions. These are not informal, hand-waving arguments. They are precise, testable, and in many cases falsifiable.
Methods like target trial emulation allow researchers to design an observational study as if it were a trial, specifying eligibility criteria, treatment strategies, time zero, and outcomes with the same discipline applied to protocol design. The parametric g-formula handles time-varying confounding and competing risks in ways that standard regression cannot. Directed Acyclic Graphs make assumptions explicit and testable. These tools do not replace the RCT. They complement it. They fill the gaps that randomization, by its nature, cannot reach: the long-term question, the combinatorial question, the real-world population question, the ethical boundary question.
The FDA’s Real-World Evidence Program, formalized under the 21st Century Cures Act, explicitly encourages the use of observational data for regulatory decisions. The European Medicines Agency has issued guidance on the use of real-world data for effectiveness evaluation. The field is moving not away from rigor, but toward a broader, more honest conception of what rigor means.
So Where Does This Leave Us?
If you are a clinician, it leaves you here: the evidence from RCTs is necessary but not sufficient. It answers one version of the question and leaves others unanswered. The next time you read a trial, ask yourself: does this population look like my patient? Does this time horizon match my clinical decision? Does the comparison reflect my actual choice? If the answer to any of these is no, you are extrapolating beyond what randomization can guarantee.
If you are a researcher, it leaves you here: the hierarchy of evidence is a starting heuristic, not a final judgment. The quality of a study depends on the quality of its assumptions and the transparency with which those assumptions are stated. A well-designed causal analysis with explicit assumptions can be more informative than a trial with hidden ones.
If you are a patient, it leaves you here: the drug that was proven effective in a trial was proven effective for the average participant in that trial. Whether it will work for you depends on factors the trial may not have measured. This is not a reason for despair. It is a reason to demand better evidence, evidence that accounts for your complexity, not just the population average.
What Comes Next
I did not write this post just to critique the RCT. I wrote it because I am building something on the other side of that critique. In my doctoral research, I am applying formal causal inference methods to real-world patient data, methods designed to answer exactly the questions that trials cannot.
In my next post, I will show you what happens when you take the same clinical question and run it through three different causal architectures. The results do not just differ. They tell fundamentally different stories about who benefits and who does not. And one of those stories changes the clinical calculus entirely.
If you found this argument compelling, or if you violently disagree, I want to hear from you. The best science happens at the intersection of disagreement and rigor.