Preprint
Article

This version is not peer-reviewed.

Understanding Heterogeneity in Meta-Analysis: A Structured Methodological Tutorial

Submitted:

19 August 2025

Posted:

21 August 2025

Read the latest preprint version here

Abstract
Meta-analysis is frequently read from the diamond down. The forest plot’s tidy alignment gives the illusion of certainty, with the pooled diamond suggesting a single definitive answer. Yet the forest is rarely uniform: some trunks lean, others twist, and a few tower or collapse, reshaping the skyline. This metaphor illustrates heterogeneity—the unevenness between studies—that ultimately determines the reliability of pooled estimates. This tutorial recenters interpretation on that variability: Q signals its existence, I² describes the proportion beyond chance, and τ² quantifies its magnitude, while prediction intervals extend these measures into practice by showing the range that future studies may realistically occupy. In diagnostic test accuracy, hierarchical models such as Reitsma’s bivariate and HSROC are highlighted, as they preserve the correlation between sensitivity and specificity and capture threshold-driven heterogeneity. Beyond numerical measures, visual and analytical approaches provide complementary insights into the underlying sources of heterogeneity, helping to explain why studies diverge. From these tools emerge practical lessons: the need for transparent reporting, robust estimators, prediction intervals, and caution in interpreting subgroup claims, while routine pitfalls—such as defaulting to DerSimonian–Laird, selecting the model solely based on a heterogeneity statistic or reporting I² in isolation—are avoided. The message is simple: the diamond is not the compass—meta-analysis earns credibility not by multiplying averages, but by explaining the uneven forest behind them.
Keywords: 
;  ;  ;  ;  ;  ;  ;  

Introduction

Imagine walking through two forests. In the first, all trunks stand vertical, of equal height and girth, aligned in neat rows. The skyline is smooth, giving the impression of perfect order. This is homogeneity: studies pointing in the same direction, with little variation.
Now picture a second forest. Here one trunk is thicker, another taller, several lean at different angles. The skyline is uneven and irregular. This is heterogeneity: differences in study results that go beyond what would be expected by chance. In this metaphor, each tree represents a study: its tilt reflects the effect estimate, its height or girth the sample size and precision, and the skyline the pooled evidence. What at first glance may look like tidy alignment is, in truth, a landscape of variation—the heterogeneity that ultimately shapes how trustworthy meta-analytic conclusions are (Figure 1).
It is tempting to admire the canopy without a closer look. Meta-analysis invites the same shortcut: our eyes go straight to the diamond at the bottom of the forest plot, the promise of a single answer. But just as the forest is shaped by the tilt and health of individual trees, the pooled estimate is only as meaningful as the variation behind it. Heterogeneity is what lies between the trunks, what bends the landscape, and what can quietly change the story told by the diamond.

How is Heterogeneity Measured?

But how do we measure the irregularities of a forest? One might count trees, compare their heights, or note the angles at which they lean. Each measure captures part of the picture, but none tells the whole story. Meta-analysis is the same: heterogeneity can be quantified in several ways, each with strengths and limitations.

The Q Statistic

Definition.
The Q statistic, introduced by Cochran in 1954 [1], is the oldest test for heterogeneity. It asks whether the differences between study results are larger than expected by chance. In forest terms, a few random tilts are normal, but if several trunks lean sharply in different directions, something real is making the forest uneven. Q distinguishes these scenarios.In practice, Q sums the squared deviation of each study from the pooled mean, weighting precise studies more strongly. If studies align, Q stays small; if some diverge, Q grows.
Interpretation.
Under the null hypothesis of homogeneity, Q follows approximately a chi-square (χ2) distribution with degrees of freedom equal to the number of studies minus one (df = n–1) [2,3]. The χ2 distribution may sound intimidating, but it is simply a reference curve for the scatter we would expect by chance. It is like a “null model” of a perfectly straight forest: if the observed Q is much larger than this baseline, the variation is unlikely to be random. Degrees of freedom (df) serve to calibrate this test, because they define the χ2 curve against which Q is compared when calculating the p-value. In a small grove (few studies, low df), we expect nearly perfect alignment, so even a single leaning trunk feels alarming. In a vast forest (a lot of studies, high df), some tilting is anticipated as part of the baseline variation, and concern arises only when the leaning systematically exceeds this expectation across many trees.
In summary: the higher the Q value, the greater the heterogeneity; the df simply adjust the yardstick used to judge whether that value is extreme enough to reject homogeneity. A small p-value (<0.05) suggests real heterogeneity; a large one means the scatter may still be due to chance.
Limitations.
Because Q depends directly on df, it is strongly influenced by the number of studies. With few, it has low power and may miss true heterogeneity; with many, it becomes oversensitive, flagging trivial differences [4]. Another key limitation is that Q reflects the amount of excess variation but not its structure. Two meta-analyses can have the same Q value but very different patterns of irregularity—one with many small deviations in height, another with a single extreme outlier. In forest terms, Q can tell us that the skyline is uneven, but not why: whether the irregularity arises from many trees differing slightly in height, or from a single trunk that is dramatically shorter than the rest (Figure 2). Q is therefore useful as a first signal, but never sufficient on its own.

The I2 Statistic

Definition.
The I2 statistic, introduced by Higgins and Thompson in 2002 [5,6], expresses the proportion of total variability in effect estimates that is due to heterogeneity rather than chance. It is derived from Q and its df. I2 is a percentage: 0% means all scatter could be random; higher values mean that some of the variation reflects real differences. For example, I2 = 50% suggests that half of the observed variability is genuine.
Interpretation.
Guidelines sometimes classify I2 values as: 0–40% possibly unimportant, 30–60% moderate, 50–90% substantial, and 75–100% considerable. These thresholds are only rough guides. Two forests may both yield I2 = 50% and still look very different: one where tilts are barely noticeable, another where several trunks are almost falling. For this reason, the Cochrane Handbook cautions against applying thresholds rigidly [7].
Limitations.
I2 does not measure the actual size of heterogeneity, only its proportion. In meta-analyses with very precise studies, even small absolute differences can yield high I2; with small, imprecise studies, I2 may appear low despite obvious spread, or it may overestimate inconsistency because of noise [8]. Another limitation is instability with few studies: with a small forest of only a handful of trees, I2 may underestimate or exaggerate the true inconsistency, and its apparent precision is misleading. In summary: I2 is a useful quick signal of inconsistency, but it does not tell us how large the heterogeneity really is, and it should never be interpreted in isolation.

The τ2 Statistic

Definition.
The τ2 statistic is the main measure of between-study variance in a meta-analysis. Variance, in simple terms, describes how spread out a set of numbers is. If all studies give almost identical results, the variance is close to zero; if results differ widely, the variance is larger. While Q detects whether heterogeneity exists, and I2 expresses the proportion of variation beyond chance, τ2 quantifies its absolute magnitude, expressed in the same units as the effect size (risk ratios, mean differences, log odds ratios, etc.).
In forest terms: I2 may tell you that half the trunks lean because of real conditions, but τ2 tells you how much they lean—whether a few degrees or almost toppling over.
Interpretation.
Fixed-effect models assume that all studies estimate the same underlying effect. Any differences are attributed to sampling error, so between-study variance is assumed to be zero and τ2 is not estimated. In forest terms, this model treats every trunk as perfectly straight, and any tilt we see is dismissed as random noise.
Random-effects models, by contrast, acknowledge that true effects may differ across studies because of variations in populations, interventions, or methods. Here τ2 captures the actual variance of these true effects—the average squared distance between them. If τ2 = 0, the forest is uniform; as τ2 grows, the trunks lean at increasingly different angles.
Estimating τ2 is not trivial. Several methods exist:
  • DerSimonian–Laird (DL). The most widely known and historically the default [9]. It is simple—plugging the observed Q into a formula—but biased: with few studies or large true heterogeneity, it tends to underestimate τ2, pulling values toward zero and giving overly narrow confidence intervals. Given its well-documented biases, the DL estimator should no longer be regarded as the default choice in modern meta-analysis.
  • Restricted maximum likelihood (REML). Now considered the standard [10,11]. Unlike DL, it iteratively searches for the τ2 that makes the observed results most plausible under a random-effects model. This reduces bias and produces more reliable estimates, especially in small meta-analyses or when heterogeneity is substantial.
  • Paule–Mandel. Another estimator that, like REML. Although classical, improves over DL and is currently accepted as a REML alternative.
  • Bayesian estimators. In Bayesian statistics, τ2 is not treated as a fixed number but as something uncertain, described by a probability distribution. We start with a prior (what we already know or assume) and update it with the data to get a posterior (what seems plausible after seeing the evidence) [12]. The advantage is that we can make direct probability statements, like “there is a 70% chance that heterogeneity is above a clinically important level.” This approach is flexible, especially when data are scarce, but the results depend on how the prior is chosen, so it must be done transparently.
Once τ2 is estimated, it influences not only descriptive statistics but also inference. The Hartung–Knapp–Sidik–Jonkman (HKSJ) adjustment, now recommended by the Cochrane Handbook [7], uses τ2 to provide more robust confidence intervals in random-effects models. This correction is particularly important when the number of studies is small, where conventional Wald-type intervals—the default output in software such as RevMan—systematically underestimate uncertainty and give an illusion of precision. HKSJ intervals are typically wider, but they reflect the real instability that arises when between-study variance is nonzero.
A key conceptual point is that τ2 and I2 are not interchangeable. Two meta-analyses can both report I2 = 50%, meaning that half the observed variability is real. But τ2 will reveal whether that variability is modest (a few degrees of tilt: an orderly forest) or extreme (trunks leaning at 30–40°: a chaotic woodland). This contrast is illustrated in Figure 3: in both panels, the proportion of leaning trunks is the same (I2 = 50%), yet in one the tilt is barely perceptible while in the other it is dramatic. This shows why τ2, not I2, determines the real scale of heterogeneity and directly governs the width of pooled confidence intervals, the span of prediction intervals, and the reliability of meta-regression analyses.
Limitations.
τ2 is less intuitive than Q or I2 because it is expressed on the same scale as the effect size. A τ2 of 0.04 may be trivial for a risk ratio but very large for a mean difference in kilograms. The number itself has meaning only relative to the chosen metric. Another limitation is instability with few studies. When the forest has only a handful of trees, τ2 can swing wildly depending on the estimator—sometimes suggesting that trunks are almost perfectly aligned, other times that the woodland is chaotic. In forest terms: two woods may both show I2 = 50%, but τ2 tells you whether the tilts are mild (all trunks leaning just a little, still an orderly skyline) or dramatic (several trunks close to falling, creating real disorder). Finally, τ2 is often reported without confidence intervals. Yet its uncertainty can be large, especially in small meta-analyses, and ignoring it risks giving a false sense of certainty. In summary: τ2 is harder to read than I2, but it is the most structural parameter, because it measures the real size of heterogeneity and directly governs the reliability of pooled results. Because τ2 is the foundation for any forward-looking inference, prediction intervals (PIs) represent its most direct clinical extension

Prediction Intervals: An Extension of τ2

A confidence interval reflects the precision of the pooled effect, but says nothing about what a new study might show. PIs extend τ2 by translating between-study variance into a range of plausible effects for future settings [13,14,15]. Because they incorporate both sampling error and heterogeneity, PIs are almost always wider than CIs.
In forest terms, the CI tells us how precisely we have measured the average tilt of the trunks, while the PI shows the range of tilts we are likely to encounter if we keep walking deeper into the forest. Clinically, this matters: a pooled risk ratio may look beneficial (CI entirely <1.0), yet the PI can cross the line of no effect, warning that in some contexts the intervention may not work—or could even harm.
Limitations. PIs are fragile when based on few studies (<10–15), often giving misleading coverage. They should be read as a conceptual tool that illustrates the plausible range of effects, not as a statistically robust interval in small meta-analyses.

Which Statistic Matters Most: Q, I2, or τ2?

No single statistic can capture the full complexity of heterogeneity. Q tells us whether the variability across studies exceeds what chance alone would explain. I2 expresses what proportion of the observed scatter is real rather than random. τ2 measures the absolute magnitude of that variability on the scale of the chosen effect size.
Of these, τ2 is the most structural parameter. It directly governs the width of pooled confidence intervals, the span of prediction intervals, and the performance of meta-regression. I2 is easy to report and widely recognized, but it is scale-dependent and unstable when only a few studies are available. Q provides a formal test, but it is driven by sample size and degrees of freedom rather than by the actual importance of heterogeneity.
In forest terms: Q asks if the woodland looks unusual, I2 tells what fraction of trunks are leaning beyond chance, and τ2 measures how far they actually tilt. For interpretation, this means Q and I2 are useful signals, but τ2—together with prediction intervals—is the key to understanding what heterogeneity really means for practice.
But do these conventional metrics of heterogeneity apply equally to diagnostic accuracy studies?
In therapeutic meta-analysis, heterogeneity is usually unidimensional: all studies estimate the same type of effect. Diagnostic test accuracy (DTA) is different. Here two outcomes—sensitivity and specificity—must be analyzed together, because they are linked by the diagnostic threshold. Raising the threshold makes specificity climb but sensitivity fall; lowering it has the opposite effect. This correlation means that univariate pooling of sensitivity or specificity is misleading. Instead, hierarchical random-effects models—the bivariate model and the hierarchical summary ROC (HSROC)—are required [16,17,18].
Interpretation.
In DTA meta-analysis, heterogeneity is described by the τ2 for sensitivity and specificity, plus a correlation parameter that captures threshold effects. Unlike intervention reviews, where Q, I2, or τ2 summarize a single outcome, diagnostic data are inherently bivariate. Applying univariate statistics such as Q or I2 to sensitivity or specificity alone ignores this structure and often exaggerates or misrepresents heterogeneity. Metrics like the bivariate I2 proposed by Zhou et al. [19] have been suggested, but the most robust approach is to interpret the variance and correlation parameters yielded directly by hierarchical models, as these respect the joint nature of accuracy data.
Limitations.
Quantifying heterogeneity in DTA is inherently more complex than in intervention reviews. Even when hierarchical models are used, estimates of τ2 (Se), τ2 (Sp), and threshold correlation become unstable with few studies, producing wide or imprecise variance estimates. This makes heterogeneity harder to measure, yet also more crucial, because threshold-driven differences are often the main source of inconsistency in diagnostic accuracy research [20].
Table 1 summarizes the main statistics for assessing heterogeneity, highlighting their interpretation, strengths, and limitations.

So, We Found Heterogeneity. What Now?

Detecting heterogeneity is only the first step; the real challenge is what to do with it. Reporting Q, I2, or τ2 tells us that the forest is uneven, but not why. To move forward, we need to ask three questions: how much, where from, and what it means.

How Much?

Several statistics can be used, each with its strengths and limitations. Cochran’s Q formally tests whether variation exceeds chance, but is highly dependent on the number of studies. I2 expresses the proportion of observed variability due to heterogeneity, but does not reflect its magnitude. τ2 directly measures the absolute variance of true effects, while prediction intervals extend this information to show the range that future studies might plausibly fall into. None of these measures alone gives a full picture, but together they provide a structured way to judge whether variability is modest, substantial, or large enough to alter interpretation.

Where from?

Subgroup analyses
Splitting studies into categories helps test whether effects differ systematically—for example by risk of bias, study design, population, or outcome definition. In the forest metaphor, it is like comparing slopes, clearings, or groves to see if trees lean more in one environment than another. The credibility of subgroup findings depends on prespecification, biological plausibility, consistency across studies, and the magnitude of difference [21].
Leave-one-out checks
This method re-runs the analysis while omitting one study at a time [22,23]. If the skyline of the forest remains stable, results are robust; if a single missing tree reshapes the canopy, that study is an outlier. Its greatest value is identifying such influential cases. But it can exaggerate the impact of random noise or small studies, so it should be read as a sensitivity test rather than grounds for exclusion.
Meta-regression
When multiple factors may explain inconsistency, meta-regression relates study-level variables (e.g., design, age, quality) to effect size [24]. It is like putting on colored lenses that reveal different leaning patterns. Yet with too few studies it easily produces spurious results; a pragmatic rule is at least 10 studies per covariate.
Table 2 provides an overview of the main analytical approaches available to explore heterogeneity in meta-analysis
Visual tools
Plots can reveal patterns at a glance. Forest plots show inconsistency through wide or non-overlapping confidence intervals. Baujat plots highlight which studies drive heterogeneity [25], like oversized trees skewing the grove. Galbraith (radial) plots show departures from the central trend [26], and L’Abbé plots reveal scatter in event rates [27].
Funnel plots are perhaps the most widely recognized visual tool. They display study size (or precision) on the vertical axis against effect size on the horizontal, forming an inverted funnel when results are balanced. Large studies cluster near the pooled effect at the top, while smaller studies scatter widely at the bottom. When the funnel is distorted—lopsided or hollow—it may signal small-study effects such as publication bias, selective reporting, or true differences tied to small samples. Because small studies can lean systematically in one direction, they may not only create funnel asymmetry but also inflate statistical heterogeneity (Q, I2, τ2). Yet funnel plots have limits: they need a sufficient number of studies (generally ≥10) to be informative, and their interpretation is subjective. Formal statistical tests—Egger’s, Begg’s, or Deeks’ for diagnostic accuracy [28,29,30]—can complement visual inspection, but none are definitive; asymmetry must always be judged in context. Table 3 summarizes the main visual tools to detect and explore heterogeneity
What does it mean?
Finding heterogeneity is not the end of the story—the key is how to interpret it. Not all variability is harmful. Some reflects the natural diversity of patients, settings, or interventions: different soils and climates that make trees grow differently. This kind of variation can increase generalizability, showing how effects behave across real-world conditions.
But heterogeneity caused by bias or flawed methods is another matter. If trunks are bent because they were measured with a crooked ruler, the irregularity reflects error, not biology. Pooling such studies risks embedding bias into the summary result.
Numbers (Q, I2, τ2) can only signal that inconsistency exists; they cannot say if it is acceptable or fatal. Interpretation requires judgment:
  • If differences arise from valid but diverse contexts, pooling may be reasonable, provided conclusions are nuanced.
  • If differences stem from systematic flaws, pooling misleads, and sometimes the correct choice is not to pool at all.
In practice, heterogeneity should trigger caution, not automatic exclusion or blind pooling. Sometimes the wisest approach is to present a qualified conclusion; other times, to stop at the treeline and refuse to merge trees that clearly do not belong to the same forest.

Transparency, Reproducibility, and Caution

Meta-analysis is not marketing—it is medicine. A polished pooled estimate with a narrow CI may look convincing, but if heterogeneity is concealed the result is misleading. Transparency means documenting every analytic choice so others can retrace the path. Reproducibility means that the same data and code should yield the same findings if re-run independently. And caution means recognizing that heterogeneity is the rule, not the exception: sometimes it is acceptable, sometimes it undermines trust, and sometimes it means pooling should not be done at all.
Table 4 lists key reporting practices that enhance transparency and reproducibility in meta-analysis. Table 5 lists common pitfalls (‘don’ts’) in reporting and interpreting heterogeneity

Conclusions

This tutorial has used the forest metaphor to make statistical concepts of heterogeneity—Q, I2, τ2 and beyond—accessible while preserving rigor. Misapplied, these measures inflate certainty and distort results; applied wisely, they clarify when differences are trivial, meaningful, or prohibitive for pooling.
The strength of meta-analysis does not lie in producing a single neat number, but in presenting variability truthfully and interpreting it with care. By embracing transparency, reproducibility, and caution, evidence synthesis can remain both statistically rigorous and clinically relevant.

CRediT Author Statement

Javier Arredondo Montero (JAM): Conceptualization; Methodology; Validation; Investigation; Writing—Original Draft; Writing—Review & Editing; Visualization; Supervision; Project administration.

Ethical Statement

This study did not involve human subjects or animals. As only simulated data were used, no ethical approval or informed consent was required.

Original Work

The manuscript’s author declares that it is an original contribution, not previously published.

Informed Consent

N/A.

AI Use Disclosure

Artificial intelligence (ChatGPT-4, OpenAI) was used to improve the clarity and style of the language

Data Availability Statement

No new datasets were generated or analyzed for the purposes of this work.

Conflict of Interest

There is no conflict of interest or external funding to declare. The author does not have anything to disclose

References

  1. Cochran, W.G. The Combination of Estimates from Different Experiments. Biometrics 1954, 10, 101–129. [Google Scholar] [CrossRef]
  2. Mood AM, Graybill FA, Boes DC. Introduction to the Theory of Statistics. 3rd ed. New York: McGraw-Hill; 1974.
  3. Casella G, Berger RL. Statistical Inference. 2nd ed. Duxbury Press; 2002.
  4. Hoaglin, D.C. Misunderstandings about Q and ‘Cochran's Q test' in meta-analysis. Stat. Med. 2015, 35, 485–495. [Google Scholar] [CrossRef]
  5. Higgins, J.P.T.; Thompson, S.G. Quantifying heterogeneity in a meta-analysis. Stat. Med. 2002, 21, 1539–1558. [Google Scholar] [CrossRef]
  6. Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring inconsistency in meta-analyses. BMJ 2003, 327, 557–560. [Google Scholar] [CrossRef]
  7. Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5 (updated 24). Cochrane, 2024. Available from www.cochrane.org/handbook. 20 August.
  8. von Hippel, P.T. The heterogeneity statistic I2 can be biased in small meta-analyses. BMC Med Res. Methodol. 2015, 15, 1–8. [Google Scholar] [CrossRef]
  9. DerSimonian, R.; Laird, N. Meta-analysis in clinical trials. Control Clin. Trials 1986, 7, 177–188. [Google Scholar] [CrossRef]
  10. Viechtbauer, W. Bias and Efficiency of Meta-Analytic Variance Estimators in the Random-Effects Model. J. Educ. Behav. Stat. 2005, 30, 261–293. [Google Scholar] [CrossRef]
  11. Veroniki, A.A.; Jackson, D.; Viechtbauer, W.; Bender, R.; Bowden, J.; Knapp, G.; Kuss, O.; Higgins, J.P.; Langan, D.; Salanti, G. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res. Synth. Methods 2015, 7, 55–79. [Google Scholar] [CrossRef]
  12. Higgins, J.P.T.; Thompson, S.G.; Spiegelhalter, D.J. A Re-Evaluation of Random-Effects Meta-Analysis. J. R. Stat. Soc. Ser. A (Statistics Soc. 2008, 172, 137–159. [Google Scholar] [CrossRef] [PubMed]
  13. Riley, R.D.; Higgins, J.P.T.; Deeks, J.J. Interpretation of random effects meta-analyses. BMJ 2011, 342, d549–d549. [Google Scholar] [CrossRef] [PubMed]
  14. Nagashima, K.; Noma, H.; A Furukawa, T. Prediction intervals for random-effects meta-analysis: A confidence distribution approach. Stat. Methods Med Res. 2018, 28, 1689–1702. [Google Scholar] [CrossRef]
  15. IntHout, J.; A Ioannidis, J.P.; Rovers, M.M.; Goeman, J.J. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open 2016, 6, e010247. [Google Scholar] [CrossRef]
  16. Reitsma, J.B.; Glas, A.S.; Rutjes, A.W.; Scholten, R.J.; Bossuyt, P.M.; Zwinderman, A.H. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiology 2005, 58, 982–990. [Google Scholar] [CrossRef] [PubMed]
  17. Rutter, C.M.; Gatsonis, C.A. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat. Med. 2001, 20, 2865–2884. [Google Scholar] [CrossRef]
  18. Harbord, R.M.; Deeks, J.J.; Egger, M.; Whiting, P.; Sterne, J.A.C. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 2006, 8, 239–251. [Google Scholar] [CrossRef] [PubMed]
  19. Zhou, Y.; Dendukuri, N. Statistics for quantifying heterogeneity in univariate and bivariate meta-analyses of binary data: The case of meta-analyses of diagnostic accuracy. Stat. Med. 2014, 33, 2701–2717. [Google Scholar] [CrossRef]
  20. Deeks, J.J.; Bossuyt, P.M.; Leeflang, M.M.; Takwoingi, Y. Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Version 2.0 (updated 23). Cochrane. 2023. Available on: https://training.cochrane.org/handbook-diagnostic-test-accuracy/current. 20 July.
  21. Oxman, A.D.; Guyatt, G.H. A Consumer's Guide to Subgroup Analyses. Ann. Intern. Med. 1992, 116, 78–84. [Google Scholar] [CrossRef] [PubMed]
  22. Viechtbauer, W.; Cheung, M.W.-L. Outlier and influence diagnostics for meta-analysis. Res. Synth. Methods 2010, 1, 112–125. [Google Scholar] [CrossRef]
  23. Meng, Z.; Wang, J.; Lin, L.; Wu, C. Sensitivity analysis with iterative outlier detection for systematic reviews and meta-analyses. Stat. Med. 2024, 43, 1549–1563. [Google Scholar] [CrossRef]
  24. Thompson, S.G.; Higgins, J.P.T. How should meta-regression analyses be undertaken and interpreted? Stat. Med. 2002, 21, 1559–1573. [Google Scholar] [CrossRef]
  25. Baujat, B.; Mahé, C.; Pignon, J.; Hill, C. A graphical method for exploring heterogeneity in meta-analyses: application to a meta-analysis of 65 trials. Stat. Med. 2002, 21, 2641–2652. [Google Scholar] [CrossRef]
  26. Galbraith, R.F. A note on graphical presentation of estimated odds ratios from several clinical trials. Stat. Med. 1988, 7, 889–894. [Google Scholar] [CrossRef]
  27. L'ABbé, K.A.; Detsky, A.S.; O'ROurke, K. Meta-Analysis in Clinical Research. Ann. Intern. Med. 1987, 107, 224–233. [Google Scholar] [CrossRef]
  28. Sterne, J.A.; Egger, M. Funnel plots for detecting bias in meta-analysis: Guidelines on choice of axis. J. Clin. Epidemiol. 2001, 54, 1046–1055. [Google Scholar] [CrossRef] [PubMed]
  29. Egger, M.; Smith, G.D.; Schneider, M.; Minder, C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997, 315, 629–634. [Google Scholar] [CrossRef] [PubMed]
  30. Begg, C.B.; Mazumdar, M. Operating Characteristics of a Rank Correlation Test for Publication Bias. Biometrics 1994, 50, 1088–101. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The upper panel depicts a perfectly aligned forest, where all trunks stand vertical and of equal height—an analogy for homogeneous studies with minimal heterogeneity (low Q, low I2, τ2 ≈ 0). The lower panel shows the same number of trunks, but several are tilted at different angles, while others vary in height or trunk thickness. This uneven woodland represents heterogeneous studies, where differences across results may arise from multiple potential sources of variability rather than chance alone.
Figure 1. The upper panel depicts a perfectly aligned forest, where all trunks stand vertical and of equal height—an analogy for homogeneous studies with minimal heterogeneity (low Q, low I2, τ2 ≈ 0). The lower panel shows the same number of trunks, but several are tilted at different angles, while others vary in height or trunk thickness. This uneven woodland represents heterogeneous studies, where differences across results may arise from multiple potential sources of variability rather than chance alone.
Preprints 173163 g001
Figure 2. Different structures of heterogeneity with similar Q. The upper panel depicts several trunks with small variations in height, creating mild irregularity across the skyline. The lower panel shows otherwise symmetric trunks of equal height, except for one that is extremely short. Both scenarios could yield a similar Q statistic, yet they represent very different realities: the same quantification may arise from the progressive accumulation of many small deviations or from a single disproportionate outlier. This underscores that Q reflects the presence of excess variation but does not capture its underlying structure.
Figure 2. Different structures of heterogeneity with similar Q. The upper panel depicts several trunks with small variations in height, creating mild irregularity across the skyline. The lower panel shows otherwise symmetric trunks of equal height, except for one that is extremely short. Both scenarios could yield a similar Q statistic, yet they represent very different realities: the same quantification may arise from the progressive accumulation of many small deviations or from a single disproportionate outlier. This underscores that Q reflects the presence of excess variation but does not capture its underlying structure.
Preprints 173163 g002
Figure 3. Same I2, different τ2. The upper panel shows six trunks: three perfectly vertical and three with a slight tilt. The lower panel mirrors this arrangement, but the tilts are pronounced. Although the proportion of leaning trunks is identical (conceptually the same I2), the degree of tilt differs, representing small versus large τ2. This illustrates that I2 captures only the proportion of heterogeneity, whereas τ2 reflects its magnitude: two meta-analyses may share the same I2 yet differ greatly in absolute variability between studies.
Figure 3. Same I2, different τ2. The upper panel shows six trunks: three perfectly vertical and three with a slight tilt. The lower panel mirrors this arrangement, but the tilts are pronounced. Although the proportion of leaning trunks is identical (conceptually the same I2), the degree of tilt differs, representing small versus large τ2. This illustrates that I2 captures only the proportion of heterogeneity, whereas τ2 reflects its magnitude: two meta-analyses may share the same I2 yet differ greatly in absolute variability between studies.
Preprints 173163 g003
Table 1. Key statistics for assessing heterogeneity in meta-analysis.
Table 1. Key statistics for assessing heterogeneity in meta-analysis.
Measure What it measures How it works Strengths Limitations Forest metaphor
Q (Cochran’s Q) Tests if variability between studies is greater than chance alone χ2 test, df = n–1 Simple, widely implemented Low power with few studies; too sensitive with many; only a test (yes/no) Spotting whether the grove looks uneven at all
I2 (Higgins & Thompson) Proportion of observed variability due to real heterogeneity (not chance) Derived from Q and df Intuitive %, widely reported Distorted with very small/large studies; unstable with few studies; does not tell absolute size What fraction of the leaning goes beyond natural randomness
τ2 (tau-squared) Between-study variance (absolute amount of heterogeneity) Estimated via formulas (DL, REML, etc.) Gives scale of dispersion in same units as effect size Harder to interpret; estimator-dependent; unstable with few studies How much the trunks lean (a few degrees vs. 40°)
Prediction Interval (PI) Likely range of true effects in a new study Extends random-effects model using τ2 Adds realism: shows what to expect in future contexts Wide intervals with few studies; often omitted in practice Walking deeper: what leaning we may see in the next part of the forest
DTA: Univariate Q/I2 Applied separately to sensitivity (Se) and specificity (Sp) Same formulas as above, but only for one dimension Easy, familiar Misleading: ignores correlation Se–Sp, inflates heterogeneity Looking only east–west, ignoring north–south bends
DTA: Bivariate model (Reitsma) Joint modeling of Se & Sp with correlation (ρ) Bivariate random-effects linear mixed model on the logit scale of sensitivity and specificity. Preserves correlation; handles threshold Requires more data; computationally heavier Viewing forest in 2D, not one axis
DTA: HSROC (Rutter & Gatsonis) Models accuracy across thresholds Curve-based hierarchical model Captures threshold explicitly Less intuitive for clinicians; complex Not just leaning trunks, but whole slope of the ground
DTA: Bivariate I2 (Zhou) Extends I2 to joint Se–Sp space Formula from bivariate variance-covariance Provides “intuitive %” in DTA Newer, less familiar, rarely in software defaults Proportion of the mess in both directions simultaneously
Table 2. Analytical approaches to exploring heterogeneity in meta-analysis.
Table 2. Analytical approaches to exploring heterogeneity in meta-analysis.
Approach Description Forest Metaphor Strengths Limitations
Structured subgroup analysis Compare effect sizes across categories (e.g., low vs. high RoB, RCT vs. observational). Comparing clearings: do trees in rocky vs. fertile soil lean differently? Easy to interpret; highlights clinically meaningful differences. Strong assumptions; oversimplifies heterogeneity as binary; can be misleading if used alone.
Subgroup analysis: Sensitivity analyses Explore robustness under different assumptions (e.g., estimators, corrections for zero events). Testing different rulers to measure the same lean. Reveals how assumptions affect results. Can be misleading with sparse data; requires multiple studies; exploratory, not confirmatory.
Subgroup analysis: leave-one-out Re-runs meta-analysis omitting one study at a time. Temporarily removing one tree to see if the skyline changes. Simple, widely implemented, shows robustness. Harder to interpret; depends on scale of effect size; unstable with small number of studies.
Meta-regression Relates effect size to study-level covariates (e.g., mean age, year, quality). Putting on colored lenses—seeing if tilt changes with soil, wind, or slope. Handles multiple covariates, quantifies trends. Requires at least 10 studies per covariate
Prediction intervals Estimates range of effect in a future study, considering heterogeneity. Tells you what tilt angles you might encounter in the next grove. Clinically meaningful, forward-looking. Requires a reliable estimate of τ2, which can be difficult when there are only a few studies
Table 3. Visual approaches to exploring heterogeneity in meta-analysis.
Table 3. Visual approaches to exploring heterogeneity in meta-analysis.
Approach Description Forest Metaphor Strengths Limitations
Forest plot visual inspection Visual inspection of confidence interval overlap across studies. Looking at tree trunks: do their shadows overlap, or are they scattered apart? Simple first step; immediately shows obvious dispersion. Subjective; poor reliability when few studies are available.
Baujat plot Plots each study’s contribution to overall heterogeneity (Q) against influence on effect size. Spotting which trees lean most and distort the forest skyline. Identifies outliers and influential studies. Exploratory; requires enough studies; interpretation not always straightforward.
Galbraith (radial) plot Plots standardized effect sizes against precision. Like drawing rays from the forest center—outliers stand apart from the main bundle. Highlights heterogeneity and small-study effects. Assumes linearity; less intuitive for non-statisticians.
L’Abbé plot Scatterplot of event rates in treatment vs. control groups across studies. Two groves side by side: do trees from one lean consistently more than the other? Good for binary outcomes; intuitive clinical insight. Not suitable for continuous outcomes; harder to interpret with sparse data.
Funnel plot Plots study effect size against precision to assess asymmetry (often for publication bias). Like looking up at the treetops—symmetry suggests balance, asymmetry suggests something missing. Can hint at bias or small-study effects; widely recognized. Low power with few studies; asymmetry ≠ publication bias per se
Table 4. Good Practices for Reporting and Interpreting Heterogeneity in Meta-Analysis.
Table 4. Good Practices for Reporting and Interpreting Heterogeneity in Meta-Analysis.
Step Recommended Action Rationale
1. Test for Presence Report Cochran’s Q statistic with degrees of freedom and p-value. Provides a formal test, but acknowledge its limitations (low power with few studies, excessive sensitivity with many).
2. Quantify Inconsistency Report I2 together with its 95% confidence interval. I2 quantifies the proportion of variability due to heterogeneity. The CI communicates the considerable uncertainty of this estimate.
3. Quantify Magnitude Report between-study variance (τ2) and specify the estimator used (e.g., REML). τ2 measures the absolute magnitude of heterogeneity. Justify using a robust estimator (REML) over biased methods (DL).
4. Assess Predictive Impact Report the 95% Prediction Interval. Translates heterogeneity into a clinically interpretable range of expected effects in future studies.
5. Visualize Data Always present a forest plot. Consider additional plots (Baujat, Galbraith) if enough studies are available. Visual inspection complements statistical metrics, helping to identify patterns, outliers, and inconsistencies.
6. Explore Sources with Caution If subgroup or meta-regression analyses are conducted, explicitly state they are exploratory and hypothesis-generating. Prevents overinterpretation of findings with low statistical power and high risk of ecological fallacy or spurious results.
Table 5. Common Pitfalls (“Don’ts”) in Reporting and Interpreting Heterogeneity.
Table 5. Common Pitfalls (“Don’ts”) in Reporting and Interpreting Heterogeneity.
Pitfall Example Consequence
Choosing model based on statistical threshold Switching to random-effects only if Q-test p < 0.10 Misleading inference; model choice should be conceptually justified, not threshold-driven
Using DerSimonian–Laird by default Applying DL estimator in small or heterogeneous meta-analyses Underestimation of τ2, overly narrow CIs, false precision
Overinterpreting subgroup or meta-regression results Treating subgroup differences as confirmatory False positives due to low power and ecological bias
Ignoring prediction intervals Reporting only pooled effect and CI Misses clinical implications of between-study variability
Excluding studies based on funnel plot asymmetry alone Removing “outliers” due to funnel plot Conflates publication bias with heterogeneity; risks cherry-picking
Interpreting or performing funnel plots with few studies (<10) Drawing conclusions about publication bias from funnel plot when k < 10 Funnel plots are unreliable with few studies; risk of false inference of bias or asymmetry
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated