Let me start with a confession. For the better part of a decade, I practiced medicine under a comforting belief: the randomized controlled trial is the gold standard of medical evidence, and everything else is lesser. I did not just believe this. I participated in building the temple. As a Principal Investigator on Phase III dupilumab trials that contributed to an FDA approval, I was inside the machinery of randomization. I saw its elegance. I administered its protocols. I trusted its output.

And I was wrong. Not about randomization being useful. It is. But about what it actually does, what it cannot do, and the uncomfortable gap between the two that most clinicians, regulators, and even many researchers never examine closely enough.

This post is not anti-RCT. It is anti-complacency. It is a careful, evidence-based argument that the pedestal we have placed randomization on is simultaneously deserved and dangerously misleading. If you prescribe medications, design trials, review evidence, or make policy based on “Level 1 evidence,” you need to understand the five structural problems I am about to lay out.

What Randomization Actually Does (and Doesn’t Do)

The standard teaching goes like this: randomization eliminates confounding. You flip a coin, assign patients to treatment or control, and any difference in outcome must be caused by the treatment because the groups are balanced on all variables, measured and unmeasured. It is elegant. It is intuitive. And it is an oversimplification that borders on mythology.

Here is what randomization actually guarantees, in precise language: it ensures that treatment assignment is statistically independent of baseline covariates in expectation. Over infinite repetitions of the randomization procedure, the groups will be balanced. In any single trial, they may not be. This is not a pedantic distinction. It is the entire foundation of what makes small trials unreliable and what makes subgroup analyses from even large trials treacherous.

Randomization does not eliminate confounding. It makes confounding random rather than systematic. That is a meaningful improvement. It is not the same thing as elimination.

But the deeper problem is not statistical. It is conceptual. The reverence for randomization has calcified into a hierarchy of evidence that treats the RCT as categorically superior to all other study designs. This hierarchy, taught in every medical school and enshrined in every guideline committee, contains a logical error that I did not fully appreciate until I began studying causal inference formally.

The Five Cracks in the Crown

Allow me to lay out, one by one, the structural limitations of randomized trials that are not limitations of execution (those exist too) but limitations of the design itself. These are features, not bugs. They are baked into what an RCT is.

1
The Eligibility Paradox To conduct a clean trial, you must exclude messy patients. Patients with comorbidities, polypharmacy, extremes of age, non-adherence risk, psychiatric conditions. The very patients who will receive the drug once it is approved. The typical Phase III trial enrolls a population that represents, at best, 30 to 40 percent of the patients who will eventually be prescribed the medication. Some estimates put it as low as 15 percent. This means the internal validity that randomization provides comes at the direct expense of external validity. The trial answers a question precisely, but it may not be the question that matters in the clinic.
2
The Time Horizon Problem Most RCTs run for 12 to 52 weeks. Cardiovascular prevention trials are an exception, sometimes extending to 3 to 5 years. But chronic diseases last decades. A 52-week trial of a biologic for psoriasis tells you whether it works at one year. It tells you almost nothing about what happens at year five, whether the benefit persists, whether tolerance develops, whether long-term safety signals emerge. And yet clinical decisions are made for lifelong therapy based on one-year data. The RCT, by its finite design, generates a snapshot and we treat it as a movie.
3
The Average Treatment Effect Illusion This is the crack that, once you see it, you cannot unsee. An RCT reports an average treatment effect. Drug A reduces LDL by 40 mg/dL on average, or reduces PASI score by 75% in 60% of patients. But averages are fictions. No individual patient has the average treatment effect. Some patients benefit enormously. Some benefit modestly. Some do not benefit at all. And some are actively harmed. The average hides all of this. When I prescribed dupilumab to patients in practice, I did not give them the average response. I gave them their response. And their response depended on variables that the trial was not designed to detect.
The RCT tells you that a drug works. It does not tell you for whom it works, why it works, or what to do when it doesn’t. Those are the questions that actually matter in clinical practice.
4
The Comparison Constraint An RCT can compare A to B. It cannot easily compare A to B to C to D in every possible sequence, combination, and duration. In cardiovascular prevention, a patient might be on an ACE inhibitor, a statin, aspirin, and a beta-blocker simultaneously. The number of possible combinations, doses, and sequences is combinatorially explosive. No trial can test them all. Yet clinical practice requires exactly these decisions. Should I add drug C or switch drug B? The RCT answers the question it was designed to ask, which is rarely the question the clinician is facing in the exam room.
5
The Ethical Boundary There are questions that matter enormously that we cannot ethically randomize. Can we randomize patients to receive no treatment for severe hypertension to see what happens? Can we randomize patients to a polypill versus their physician’s individualized choice? Can we randomize patients by their genetic profile to treatment arms when we suspect harm in one subgroup? The answer, often, is no. And so the most important clinical questions, the ones where lives are most at stake, are precisely the ones where randomization cannot help us. We are left with observational data. And the evidence hierarchy says observational data is inferior. This is where the myth becomes dangerous.

The Hierarchy Problem

The evidence hierarchy, pyramid-shaped, with systematic reviews of RCTs at the top and case reports at the bottom, is one of the most influential conceptual models in modern medicine. It has also, I believe, done subtle but significant damage to how we think about evidence.

The hierarchy conflates study design with study quality. It assumes that an RCT is always superior to an observational study, regardless of how the RCT was conducted or what question it asked. But a well-designed observational study with appropriate causal methods, clear assumptions, sensitivity analyses, and a question that matches the clinical decision, can be more informative than a poorly designed RCT that answers the wrong question in the wrong population over the wrong time horizon.

The key insight The quality of causal inference does not depend on the study design alone. It depends on the transparency and plausibility of the assumptions required for a causal interpretation. Randomization is one way to justify those assumptions. It is not the only way. And when it comes at the cost of relevance, generalizability, and clinical applicability, the trade-off may not be worth it.

Sir Austin Bradford Hill, whose criteria are often cited in support of the RCT hierarchy, actually warned against exactly this rigidity. In his 1965 address to the Royal Society of Medicine, he argued that the demand for experimental evidence before acting on observational findings could sometimes cost lives. The relationship between smoking and lung cancer, one of the most consequential causal findings in medical history, was established entirely through observational data.

What I Saw Inside the Machine

Let me bring this back to lived experience. When I was running Phase III dupilumab trials, I observed things that the final publications do not capture.

I watched the eligibility criteria exclude the very patients I most wanted to treat. I saw how the controlled environment of a trial site, with its protocol visits, adherence monitoring, and nurse follow-ups, created conditions that would never exist in a busy dermatology clinic in Casablanca or Brooklyn. I noticed that the patients who responded beautifully in the trial were not always the same type of patients who responded in my practice afterward.

None of this means the trial was wrong. The dupilumab trials were rigorous, well-designed, and their conclusions were sound within the scope of the question they asked. But the scope of that question was narrower than we typically acknowledge. And the gap between what the trial told us and what clinicians needed to know was wider than the publications suggested.

The trial proved that dupilumab works. It did not tell me which of my patients in Marrakech, with their different genetic backgrounds, different comorbidity profiles, different environmental exposures, would respond the way the average trial participant did.

The Alternative Is Not Nihilism

I anticipate the objection: if not the RCT, then what? Are you arguing for anecdote-based medicine? For abandoning rigor?

Absolutely not. I am arguing for expanding our definition of rigor. The field of causal inference, developed over the past four decades by statisticians, epidemiologists, and computer scientists, provides formal mathematical frameworks for extracting causal conclusions from observational data under clearly stated assumptions. These are not informal, hand-waving arguments. They are precise, testable, and in many cases falsifiable.

Methods like target trial emulation allow researchers to design an observational study as if it were a trial, specifying eligibility criteria, treatment strategies, time zero, and outcomes with the same discipline applied to protocol design. The parametric g-formula handles time-varying confounding and competing risks in ways that standard regression cannot. Directed Acyclic Graphs make assumptions explicit and testable. These tools do not replace the RCT. They complement it. They fill the gaps that randomization, by its nature, cannot reach: the long-term question, the combinatorial question, the real-world population question, the ethical boundary question.

The emerging consensus

The FDA’s Real-World Evidence Program, formalized under the 21st Century Cures Act, explicitly encourages the use of observational data for regulatory decisions. The European Medicines Agency has issued guidance on the use of real-world data for effectiveness evaluation. The field is moving not away from rigor, but toward a broader, more honest conception of what rigor means.

So Where Does This Leave Us?

If you are a clinician, it leaves you here: the evidence from RCTs is necessary but not sufficient. It answers one version of the question and leaves others unanswered. The next time you read a trial, ask yourself: does this population look like my patient? Does this time horizon match my clinical decision? Does the comparison reflect my actual choice? If the answer to any of these is no, you are extrapolating beyond what randomization can guarantee.

If you are a researcher, it leaves you here: the hierarchy of evidence is a starting heuristic, not a final judgment. The quality of a study depends on the quality of its assumptions and the transparency with which those assumptions are stated. A well-designed causal analysis with explicit assumptions can be more informative than a trial with hidden ones.

If you are a patient, it leaves you here: the drug that was proven effective in a trial was proven effective for the average participant in that trial. Whether it will work for you depends on factors the trial may not have measured. This is not a reason for despair. It is a reason to demand better evidence, evidence that accounts for your complexity, not just the population average.

What Comes Next

I did not write this post just to critique the RCT. I wrote it because I am building something on the other side of that critique. In my doctoral research, I am applying formal causal inference methods to real-world patient data, methods designed to answer exactly the questions that trials cannot.

In my next post, I will show you what happens when you take the same clinical question and run it through three different causal architectures. The results do not just differ. They tell fundamentally different stories about who benefits and who does not. And one of those stories changes the clinical calculus entirely.

If you found this argument compelling, or if you violently disagree, I want to hear from you. The best science happens at the intersection of disagreement and rigor.

H

Dr. Hafsa Benzzi

Board-certified dermatologist. Former Principal Investigator on Phase III dupilumab trials. PhD candidate in Clinical Research at Icahn School of Medicine at Mount Sinai, studying causal inference for cardiovascular disease prevention. She has practiced medicine across Morocco, France, and the United States.