What an Effect Size Actually Tells a Leader (And What It Doesn't)
Somewhere in the last two decades, "effect size" became a piece of leadership vocabulary — a number leaders cite the way they used to cite intuition. It shows up in professional development slides, in initiative rollouts, in the justification for why this year's focus is this and not that. Used well, it's one of the more honest tools available for prioritizing scarce time. Used poorly, it becomes a way of dressing up a decision that was already made in the language of certainty it doesn't deserve.
It's worth understanding what the number is actually doing before leaning on it in front of a team.
The basic idea, stripped of jargon
An effect size is a way of measuring how much difference a practice makes, expressed in a standardized unit so that different studies — measuring different things, with different tests — can be compared on the same scale. Rather than saying "test scores went up," which tells you nothing about whether that increase was large or trivial, an effect size tells you the size of the shift relative to the normal spread of outcomes you'd see anyway.
A rough, widely cited benchmark treats an effect size of roughly 0.4 as equivalent to about a year's worth of typical growth in a year's time — in other words, the average effect of simply getting a year older and having some instruction, regardless of what that instruction specifically was. Practices below that line may still be worth doing, but they're not outperforming the baseline of normal development. Practices meaningfully above that line are candidates for real prioritization, because they're moving outcomes faster than time alone would.
Why this number is genuinely useful
Every leader operates inside a scarcity problem: finite meeting time, finite coaching cycles, finite attention from an already-stretched team. Effect size research offers something rare — a rough, evidence-based way to rank competing initiatives instead of choosing based on which one is newest, loudest, or most recently pitched by a compelling speaker. If you have to pick one or two things to go deep on this year, "which of these has the strongest evidence of actually moving outcomes" is a far better filter than "which of these do I have the most conviction about."
It's also a useful tool for saying no. Leaders who can point to a number when declining the fifteenth well-intentioned initiative of the year have an easier time protecting focus than leaders relying purely on judgment calls that can be second-guessed.
Where it gets misused
The number invites a kind of false precision. A few cautions worth holding onto:
Context doesn't transfer perfectly. An effect size calculated across a broad research base doesn't guarantee the same result in your specific setting, with your specific population, implemented at your specific level of fidelity. The number describes an average across many contexts, not a guarantee for any one of them.
Implementation quality swamps the number. A high-effect-size practice implemented poorly will underperform a moderate-effect-size practice implemented with real fidelity and coherence. The research measures the practice under conditions of reasonably faithful implementation — which is exactly the variable most likely to break down in a real building under real time pressure.
It measures average effect, not effect for your specific gap. A practice with strong average effects may do very little for the particular subgroup or particular skill deficit you're most worried about. The number is a starting filter, not a diagnosis.
Comparing effect sizes across very different studies is riskier than it looks. Different measurement instruments, different populations, and different research designs all shift what a given number actually represents. Treating a league table of effect sizes as a precise ranking rather than a rough guide overstates what the underlying research can support.
How to use it without overselling it
The strongest use of effect-size data isn't as proof, it's as a starting filter: use it to narrow a long list of possible priorities down to a short one worth investigating more closely, then pair the number with your own local data before committing significant time to it. When you introduce a practice to a team on the strength of its effect size, be honest about the caveat — this is evidence that it tends to work, not a guarantee it will work here, which is exactly why implementation fidelity and local monitoring matter as much as the initial decision.
Leaders who treat effect size as one input among several — alongside context, capacity, and coherence with what's already underway — get real value from it. Leaders who treat it as an oracle tend to be surprised, a year later, when the number didn't deliver on its own.
A worked example
Imagine two competing practices under consideration, one with an effect size around 0.6 and one around 0.3, both requiring roughly the same investment of training and time. The number alone suggests a clear choice. But suppose the 0.3 practice is already partially embedded in existing routines, well understood by staff, and aligned with three other current priorities, while the 0.6 practice would require building an entirely new structure from scratch, with no existing staff expertise to draw on. The raw number still favors the second option, but a leader weighing implementation reality alongside the evidence might reasonably choose the first, betting that high-fidelity execution of a moderately effective practice will outperform low-fidelity execution of a theoretically stronger one. This isn't ignoring the evidence — it's using it as one input inside a fuller judgment, which is precisely what the research base itself would recommend if you read the caveats as carefully as the headline number.