Submitted:
27 August 2025
Posted:
02 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- DerSimonian–Laird (DL). The most widely known and historically the default [9]. It is simple—plugging the observed Q into a formula—but biased: with few studies or large true heterogeneity, it tends to underestimate τ², pulling values toward zero and giving overly narrow confidence intervals. Given its well-documented biases, the DL estimator should no longer be regarded as the default choice in modern meta-analysis.
- Paule–Mandel. Another estimator that, like REML. Although classical, it improves over DL and is currently accepted as an REML alternative.
- Bayesian estimators. In Bayesian statistics, τ² is not treated as a fixed number but as something uncertain, described by a probability distribution. We start with a prior (what we already know or assume) and update it with the data to get a posterior (what seems plausible after seeing the evidence) [12]. The advantage is that we can make direct probability statements, like “there is a 70% chance that heterogeneity is above a clinically important level.” This approach is flexible, especially when data are scarce, but the results depend on how the prior is chosen, so it must be done transparently.
|
But do these conventional metrics of heterogeneity apply equally to diagnostic accuracy studies? In therapeutic meta-analysis, heterogeneity is typically unidimensional, as all studies estimate the same type of effect. Diagnostic test accuracy (DTA) is different. Here, two outcomes—sensitivity and specificity—must be analyzed together, because the diagnostic threshold links them. Raising the threshold increases specificity but decreases sensitivity; lowering it has the opposite effect. This correlation means that univariate pooling of sensitivity or specificity is misleading. Instead, hierarchical random-effects models—the bivariate model and the hierarchical summary ROC (HSROC)—are required [16,17,18]. Interpretation. In DTA meta-analysis, heterogeneity is described by τ² for sensitivity and specificity, along with a correlation parameter that captures threshold effects. Unlike intervention reviews, where Q, I², or τ² summarize a single outcome, diagnostic data are inherently bivariate. Applying univariate statistics such as Q or I² to sensitivity or specificity alone ignores this structure and often exaggerates or misrepresents heterogeneity. While bivariate I² statistics have been proposed [19], their interpretation is problematic and they have not achieved consensus as reliable measures. Alternative approaches have been discussed in the literature, such as considering the area of the 95% prediction ellipse in the HSROC space, or summarizing heterogeneity through Median Odds Ratios (MOR) for sensitivity and specificity. These methods reflect attempts to provide more intuitive metrics, but each carries limitations and lacks standardization. Ultimately, the most robust and transparent strategy remains the direct interpretation of the variance and correlation parameters from the bivariate or HSROC model, which avoids the pitfalls of oversimplified summary statistics. Limitations. Quantifying heterogeneity in DTA is inherently more complex than in intervention reviews. Even when hierarchical models are used, estimates of τ² (Se), τ² (Sp), and threshold correlation become unstable with few studies, resulting in wide or imprecise variance estimates. This makes heterogeneity harder to measure, yet also more crucial, because threshold-driven differences are often the main source of inconsistency in diagnostic accuracy research [20]. |
|
And when the focus shifts to prognosis, do the same rules of heterogeneity still apply? Prognostic systematic reviews differ fundamentally from intervention or diagnostic reviews. Their aims vary: some quantify the overall prognosis of a population (e.g., survival at fixed time points), others evaluate the prognostic impact of a single factor (e.g., hazard ratio for biomarker expression), and still others assess or validate multivariable prognostic models. Unlike therapeutic or diagnostic settings, prognostic outcomes are often time-to-event, involve censoring, and depend strongly on th Interpretation. In prognostic meta-analysis, heterogeneity is reflected in several unique statistics. For single prognostic factors, random-effects pooling of hazard ratios is common, with τ² quantifying between-study variance. For prognostic models, measures of discrimination (e.g., the c-statistic or AUC) and calibration (e.g., calibration slope or global observed/expected ratio) are frequently synthesized. Each has its own scale and requires variance-stabilizing transformations before pooling. Importantly, heterogeneity may arise not only from sampling error but also from differences in case-mix (patient populations with varying baseline risk), variations in model specification (which predictors are included and how), and differences in follow-up length or censoring mechanisms. Multivariate meta-analysis has been proposed to jointly synthesize discrimination and calibration measures, enabling a more comprehensive interpretation of model performance across various settings. Limitations. Quantifying heterogeneity in prognostic reviews is particularly challenging. With few studies, estimates of between-study variance in hazard ratios or c-statistics are highly unstable, leading to imprecise prediction intervals. Furthermore, heterogeneity is often driven by clinical and methodological diversity—differences in baseline risk, predictor definitions, and statistical modeling choices—rather than sampling variability alone. This makes heterogeneity not just harder to measure but also more crucial, because prognostic evidence is especially vulnerable to misinterpretation when applied across populations with differing risk structures. As a result, transparent reporting of τ², prediction intervals, and sources of case-mix variation is essential for trustworthy prognostic meta-analysis. |
|
Beyond Heterogeneity: Inconsistency in Network Meta-Analysis In network meta-analysis (NMA), heterogeneity coexists with inconsistency—a distinct but related concept. Heterogeneity reflects variability within pairwise comparisons, while inconsistency arises when direct and indirect evidence disagree. Both are rooted in the assumption of transitivity: treatment effects can only be compared if effect modifiers are sufficiently similar across trials. When this assumption fails, two types of inconsistency may appear: loop inconsistency, where closed loops of evidence yield conflicting estimates, and design inconsistency, where treatment effects differ depending on comparator sets. Distinguishing heterogeneity from inconsistency, and reporting both, is critical for trustworthy interpretation of NMA results. |
- If differences arise from valid but diverse contexts, pooling may be reasonable, provided conclusions are nuanced.
- If differences stem from systematic flaws, pooling misleads, and sometimes the correct choice is not to pool at all.
2. Conclusions
Original work
Conflict of interest
Informed consent
AI Use Disclosure
Data Availability Statement
Ethical Statement
CRediT author statement: Javier Arredondo Montero (JAM)
References
- Cochran, William G. “The Combination of Estimates from Different Experiments.” Biometrics, vol. 10, no. 1, 1954, pp. 101–29. JSTOR, https://doi.org/10.2307/3001666. [CrossRef]
- Mood AM, Graybill FA, Boes DC. Introduction to the Theory of Statistics. 3rd ed. New York: McGraw-Hill; 1974.
- Casella G, Berger RL. Statistical Inference. 2nd ed. Duxbury Press; 2002.
- Hoaglin DC. Misunderstandings about Q and 'Cochran's Q test' in meta-analysis. Stat Med. 2016 Feb 20;35(4):485-95. doi: 10.1002/sim.6632. Epub 2015 Aug 24. PMID: 26303773. [CrossRef] [PubMed]
- Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21: 1539-1558.
- Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003 Sep 6;327(7414):557-60. doi: 10.1136/bmj.327.7414.557. PMID: 12958120; PMCID: PMC192859. [CrossRef] [PubMed]
- Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5 (updated August 2024). Cochrane, 2024. Available from www.cochrane.org/handbook.
- von Hippel PT. The heterogeneity statistic I(2) can be biased in small meta-analyses. BMC Med Res Methodol. 2015 Apr 14;15:35. doi: 10.1186/s12874-015-0024-z. PMID: 25880989; PMCID: PMC4410499. [CrossRef] [PubMed]
- DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986 Sep;7(3):177-88. doi: 10.1016/0197-2456(86)90046-2. PMID: 3802833. [CrossRef] [PubMed]
- Viechtbauer, W. (2005). Bias and Efficiency of Meta-Analytic Variance Estimators in the Random-Effects Model. Journal of Educational and Behavioral Statistics, 30(3), 261-293. https://doi.org/10.3102/10769986030003261 (Original work published 2005). [CrossRef]
- Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, Kuss O, Higgins JP, Langan D, Salanti G. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res Synth Methods. 2016 Mar;7(1):55-79. doi: 10.1002/jrsm.1164. Epub 2015 Sep 2. PMID: 26332144; PMCID: PMC4950030. [CrossRef] [PubMed]
- Higgins JP, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009 Jan;172(1):137-159. doi: 10.1111/j.1467-985X.2008.00552.x. PMID: 19381330; PMCID: PMC2667312. [CrossRef] [PubMed]
- Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011 Feb 10;342:d549. doi: 10.1136/bmj.d549. PMID: 21310794. [CrossRef] [PubMed]
- Nagashima K, Noma H, Furukawa TA. Prediction intervals for random-effects meta-analysis: A confidence distribution approach. Stat Methods Med Res. 2019 Jun;28(6):1689-1702. doi: 10.1177/0962280218773520. Epub 2018 May 10. PMID: 29745296. [CrossRef] [PubMed]
- IntHout J, Ioannidis JP, Rovers MM, Goeman JJ. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open. 2016 Jul 12;6(7):e010247. doi: 10.1136/bmjopen-2015-010247. PMID: 27406637; PMCID: PMC4947751. [CrossRef] [PubMed]
- Reitsma JB, Glas AS, Rutjes AWS, Scholten RJPM, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 2005;58(10):982e90.
- Rutter CM, Gatsonis CA. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med. 2001 Oct 15;20(19):2865-84. doi: 10.1002/sim.942. PMID: 11568945. [CrossRef] [PubMed]
- Harbord RM, Deeks JJ, Egger M, Whiting P, Sterne JA. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics. 2007 Apr;8(2):239-51. doi: 10.1093/biostatistics/kxl004. Epub 2006 May 11. Erratum in: Biostatistics. 2008 Oct;9(4):779. PMID: 16698768. [CrossRef] [PubMed]
- Zhou Y, Dendukuri N. Statistics for quantifying heterogeneity in univariate and bivariate meta-analyses of binary data: the case of meta-analyses of diagnostic accuracy. Stat Med. 2014 Jul 20;33(16):2701-17. doi: 10.1002/sim.6115. Epub 2014 Feb 19. PMID: 24903142. [CrossRef] [PubMed]
- Deeks JJ, Bossuyt PM, Leeflang MM, Takwoingi Y (editors). Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Version 2.0 (updated July 2023). Cochrane, 2023. Available from https://training.cochrane.org/handbook-diagnostic-test-accuracy/current.
- Oxman AD, Guyatt GH. A consumer's guide to subgroup analyses. Ann Intern Med. 1992 Jan 1;116(1):78-84. doi: 10.7326/0003-4819-116-1-78. PMID: 1530753. [CrossRef] [PubMed]
- Viechtbauer W, Cheung MW. Outlier and influence diagnostics for meta-analysis. Res Synth Methods. 2010 Apr;1(2):112-25. doi: 10.1002/jrsm.11. Epub 2010 Oct 4. PMID: 26061377. [CrossRef] [PubMed]
- Meng Z, Wang J, Lin L, Wu C. Sensitivity analysis with iterative outlier detection for systematic reviews and meta-analyses. Stat Med. 2024 Apr 15;43(8):1549-1563. doi: 10.1002/sim.10008. Epub 2024 Feb 6. PMID: 38318993; PMCID: PMC10947935. [CrossRef] [PubMed]
- Thompson SG, Higgins JP. How should meta-regression analyses be undertaken and interpreted? Stat Med. 2002 Jun 15;21(11):1559-73. doi: 10.1002/sim.1187. PMID: 12111920. [CrossRef] [PubMed]
- Baujat B, Mahé C, Pignon JP, Hill C. A graphical method for exploring heterogeneity in meta-analyses: application to a meta-analysis of 65 trials. Stat Med. 2002 Sep 30;21(18):2641-52. doi: 10.1002/sim.1221. PMID: 12228882. [CrossRef] [PubMed]
- Galbraith RF. A note on graphical presentation of estimated odds ratios from several clinical trials. Stat Med. 1988 Aug;7(8):889-94. doi: 10.1002/sim.4780070807. PMID: 3413368. [CrossRef] [PubMed]
- L'Abbé KA, Detsky AS, O'Rourke K. Meta-analysis in clinical research. Ann Intern Med. 1987 Aug;107(2):224-33. doi: 10.7326/0003-4819-107-2-224. PMID: 3300460. [CrossRef] [PubMed]
- Sterne JA, Egger M. Funnel plots for detecting bias in meta-analysis: guidelines on choice of axis. J Clin Epidemiol. 2001 Oct;54(10):1046-55. doi: 10.1016/s0895-4356(01)00377-8. PMID: 11576817. [CrossRef] [PubMed]
- Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997 Sep 13;315(7109):629-34. doi: 10.1136/bmj.315.7109.629. PMID: 9310563; PMCID: PMC2127453. [CrossRef] [PubMed]
- Begg CB, Mazumdar M. Operating characteristics of a rank correlation test for publication bias. Biometrics. 1994 Dec;50(4):1088-101. PMID: 7786990. [PubMed]



| Measure | What it measures | How it works | Strengths | Limitations | Forest metaphor |
|---|---|---|---|---|---|
| Q (Cochran’s Q) | Tests if variability between studies is greater than chance alone | χ² test, df = n–1 | Simple, widely implemented | Low power with few studies; too sensitive with many; only a test (yes/no) | Spotting whether the grove looks uneven at all |
| I² (Higgins & Thompson) | Proportion of observed variability due to real heterogeneity (not chance) | Derived from Q and df | Intuitive %, widely reported | Distorted with very small/large studies; unstable with few studies; does not tell absolute size | What fraction of the leaning goes beyond natural randomness |
| τ² (tau-squared) | Between-study variance (absolute amount of heterogeneity) | Estimated via formulas (DL, REML, etc.) | Gives scale of dispersion in same units as effect size | Harder to interpret; estimator-dependent; unstable with few studies | How much the trunks lean (a few degrees vs. 40°) |
| Prediction Interval (PI) | Likely range of true effects in a new study | Extends random-effects model using τ² | Adds realism: shows what to expect in future contexts | Wide intervals with few studies; often omitted in practice | Walking deeper: what leaning we may see in the next part of the forest |
| DTA: Univariate Q/I² | Applied separately to sensitivity (Se) and specificity (Sp) | Same formulas as above, but only for one dimension | Easy, familiar | Misleading: ignores correlation Se–Sp, inflates heterogeneity | Looking only east–west, ignoring north–south bends |
| DTA: Bivariate model (Reitsma) | Joint modeling of Se & Sp with correlation (ρ) | Bivariate random-effects linear mixed model on the logit scale of sensitivity and specificity. | Preserves correlation; handles threshold | Requires more data; computationally heavier | Viewing forest in 2D, not one axis |
| DTA: HSROC (Rutter & Gatsonis) | Models accuracy across thresholds | Curve-based hierarchical model | Captures threshold explicitly | Less intuitive for clinicians; complex | Not just leaning trunks, but whole slope of the ground |
| DTA: Bivariate I² (Zhou) | Extends I² to joint Se–Sp space | Formula from bivariate variance-covariance | Provides “intuitive %” in DTA | Newer, less familiar, rarely in software defaults | Proportion of the mess in both directions simultaneously |
| Approach | Description | Forest Metaphor | Strengths | Limitations |
|---|---|---|---|---|
| Structured subgroup analysis | Compare effect sizes across categories (e.g. low vs. high RoB, RCT vs. observational). | Comparing clearings: do trees in rocky vs. fertile soil lean differently? | Easy to interpret; highlights clinically meaningful differences. | Strong assumptions; oversimplifies heterogeneity as binary; can be misleading if used alone. |
| Subgroup analysis: Sensitivity analyses | Explore robustness under different assumptions (e.g. estimators, corrections for zero events). | Testing different rulers to measure the same lean. | Reveals how assumptions affect results. | Can be misleading with sparse data; requires multiple studies; exploratory, not confirmatory. |
| Subgroup analysis: leave-one-out | Re-runs meta-analysis omitting one study at a time. | Temporarily removing one tree to see if the skyline changes. | Simple, widely implemented, shows robustness. | Harder to interpret; depends on scale of effect size; unstable with small number of studies. |
| Meta-regression | Relates effect size to study-level covariates (e.g. mean age, year, quality). | Putting on colored lenses—seeing if tilt changes with soil, wind, or slope. | Handles multiple covariates, quantifies trends. | Requires at least 10 studies per covariate |
| Prediction intervals | Estimates range of effect in a future study, considering heterogeneity. | Tells you what tilt angles you might encounter in the next grove. | Clinically meaningful, forward-looking. | Requires a reliable estimate of τ2, which can be difficult when there are only a few studies |
| Approach | Description | Forest Metaphor | Strengths | Limitations |
|---|---|---|---|---|
| Forest plot visual inspection | Visual inspection of confidence interval overlap across studies. | Looking at tree trunks: do their shadows overlap, or are they scattered apart? | Simple first step; immediately shows obvious dispersion. | Subjective; poor reliability when few studies are available. |
| Baujat plot | Plots each study’s contribution to overall heterogeneity (Q) against influence on effect size. | Spotting which trees lean most and distort the forest skyline. | Identifies outliers and influential studies. | Exploratory; requires enough studies; interpretation not always straightforward. |
| Galbraith (radial) plot | Plots standardized effect sizes against precision. | Like drawing rays from the forest center—outliers stand apart from the main bundle. | Highlights heterogeneity and small-study effects. | Assumes linearity; less intuitive for non-statisticians. |
| L'Abbé plot | Scatterplot of event rates in treatment vs. control groups across studies. | Two groves side by side: do trees from one lean consistently more than the other? | Good for binary outcomes; intuitive clinical insight. | Not suitable for continuous outcomes; harder to interpret with sparse data. |
| Funnel plot | Plots study effect size against precision to assess asymmetry (often for publication bias). | Like looking up at the treetops—symmetry suggests balance, asymmetry suggests something missing. | Can hint at bias or small-study effects; widely recognized. | Low power with few studies; asymmetry ≠ publication bias per se |
| Step | Recommended Action | Rationale |
|---|---|---|
| 1. Test for Presence | Report Cochran’s Q statistic with degrees of freedom and p-value. | Provides a formal test, but acknowledge its limitations (low power with few studies, excessive sensitivity with many). |
| 2. Quantify Inconsistency | Report I² together with its 95% confidence interval. | I² quantifies the proportion of variability due to heterogeneity. The CI communicates the considerable uncertainty of this estimate. |
| 3. Quantify Magnitude | Report between-study variance (τ²) and specify the estimator used (e.g., REML). | τ² measures the absolute magnitude of heterogeneity. Justify using a robust estimator (REML) over biased methods (DL). |
| 4. Assess Predictive Impact | Report the 95% Prediction Interval. | Translates heterogeneity into a clinically interpretable range of expected effects in future studies. |
| 5. Visualize Data | Always present a forest plot. Consider additional plots (Baujat, Galbraith) if enough studies are available. | Visual inspection complements statistical metrics, helping to identify patterns, outliers, and inconsistencies. |
| 6. Explore Sources with Caution | If subgroup or meta-regression analyses are conducted, explicitly state they are exploratory and hypothesis-generating. | Prevents overinterpretation of findings with low statistical power and high risk of ecological fallacy or spurious results. |
| Pitfall | Example | Consequence |
|---|---|---|
| Choosing model based on statistical threshold | Switching to random-effects only if Q-test p < 0.10 | Misleading inference; model choice should be conceptually justified, not threshold-driven |
| Using DerSimonian–Laird by default | Applying DL estimator in small or heterogeneous meta-analyses | Underestimation of τ², overly narrow CIs, false precision |
| Overinterpreting subgroup or meta-regression results | Treating subgroup differences as confirmatory | False positives due to low power and ecological bias |
| Ignoring prediction intervals | Reporting only pooled effect and CI | Misses clinical implications of between-study variability |
| Excluding studies based on funnel plot asymmetry alone | Removing “outliers” due to funnel plot | Conflates publication bias with heterogeneity; risks cherry-picking |
| Interpreting or performing funnel plots with few studies (<10) | Drawing conclusions about publication bias from funnel plot when k < 10 | Funnel plots are unreliable with few studies; risk of false inference of bias or asymmetry |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
