Submitted:
04 September 2025
Posted:
05 September 2025
You are already at the latest version
Abstract
Keywords:
Introduction
Understanding Basic Diagnostic Test Accuracy Metrics: Beyond Sensitivity and Specificity
The Diagnostic Odds Ratio: Interpretation and Limitations
Why Hierarchical Models Are Needed in Diagnostic Test Accuracy Meta-Analyses
Hierarchical Models for DTA Meta-Analyses: Structure and Parameterization
Which Metrics Should Be Reported in Diagnostic Test Accuracy Meta-Analyses
Selecting the Appropriate Statistical Software for Diagnostic Accuracy Meta-Analysis
- metandi [27]: Was the first Stata command for hierarchical DTA meta-analysis. It is more limited in scope but provides a detailed output of all core model parameters. However, it lacks meta-regression and offers only limited graphical customization.
- midas [28]: Although limited in some aspects, it remains a complementary option for hierarchical modeling. It integrates exploratory tools, including heterogeneity plots, goodness-of-fit checks, Fagan nomograms, likelihood ratio scattergrams, and Deeks’ regression test for publication bias. It can estimate the AUC with 95% CI, but these values may be biased since they are not obtained from a formal HSROC model. Meta-regression is available but restricted to univariable analyses (although multiple univariates can be displayed simultaneously on screen, as illustrated in this article). A practical caveat is that midas may occasionally return an AUC of 1.0 with CI 0–1 if the Excel file contains hidden rows; deleting unused rows or copying data into a new clean sheet resolves this issue.
- metadta [29,30]: Is the most modern and versatile Stata command for DTA meta-analysis. It offers extensive functionalities, including bivariate I² estimation (Zhou et al.), advanced meta-regression capabilities, and highly customizable graphical outputs. It is currently the preferred tool for conducting methodologically rigorous DTA meta-analyses within the Stata environment. Nonetheless, it has some limitations in specific analyses, particularly regarding graphical utilities, where midas provides complementary features (e.g., Fagan nomograms, likelihood ratio scattergrams, and integrated diagnostic plots).
Heterogeneity in Diagnostic Test Accuracy Meta-Analyses
Meta-Regression in Diagnostic Test Accuracy Meta-Analyses
Publication Bias in Diagnostic Test Accuracy Meta-Analyses
Complementary Tools for Interpreting DTA Meta-Analyses: Fagan Nomograms, Scatterplots, and Beyond
- Upper Left Quadrant: LR+ < 10, LR− < 0.1 — diagnostic exclusion only
- Upper Right Quadrant: LR+ > 10, LR− < 0.1 — both exclusion and confirmation
- Lower Right Quadrant: LR+ > 10, LR− > 0.1 — diagnostic confirmation only
- Lower Left Quadrant: LR+ < 10, LR− > 0.1 — neither exclusion nor confirmation
Application of Hierarchical Models in Surgical Diagnostic Research: A Practical Example
The Conventional (Flawed) Analysis: Meta-DiSc
The Hierarchical Analysis: A Robust Alternative with Stata
Investigating Heterogeneity: Influence Diagnostics and Outlier Detection
Exploring Sources of Heterogeneity: Meta-Regression
Translating Results into Clinical Practice: Fagan Nomograms and Likelihood Ratios
Assessing Publication Bias: Deeks' Test
Common Pitfalls, Expertise Requirements, and Practical Recommendations
- Select appropriate, validated software for hierarchical modeling, prioritizing tools such as metandi, midas, or metadta in Stata, or equivalent R packages (mada, diagmeta), depending on available expertise and project complexity.
- Report comprehensive diagnostic metrics, including pooled sensitivity and specificity with confidence intervals, HSROC curves, and likelihood ratios. The DOR may be included as a secondary indicator, but should not be the sole measure of test performance. If reporting the AUC, clarify whether it derives from a rigorously parameterized HSROC model or a simplified approximation, as interpretations differ substantially.
- Use prediction intervals and heterogeneity estimates, such as τ² or the bivariate I² by Zhou, to convey uncertainty and between-study variability transparently. Avoid overreliance on conventional I², which exaggerates heterogeneity by ignoring the correlation between sensitivity and specificity.
- Incorporate complementary graphical and analytical tools, including Fagan nomograms, likelihood ratio scatterplots, Cook’s distance, and standardized residuals, to enhance interpretability, identify influential studies, and explore sources of heterogeneity. It is essential to distinguish between post hoc sensitivity analyses and post hoc meta-regression. The former are an accepted strategy to test the robustness of primary findings when influential or outlying studies are identified. Their role is to confirm whether conclusions hold when such studies are excluded, not to produce a ‘new correct’ result. In contrast, deciding covariates for meta-regression post hoc constitutes data dredging and increases the risk of spurious associations. Robust practice requires pre-specifying covariates a priori in the protocol, whereas outlier-based sensitivity analyses should be presented transparently as exploratory robustness checks. Even when pre-specified, meta-regression findings must be interpreted with caution, particularly in small meta-analyses, where introducing multiple covariates risks overfitting. A commonly recommended rule of thumb is to ensure at least 10 studies per covariate to achieve stable inference.
- Recognize and mitigate additional biases beyond publication bias, such as spectrum bias (non-representative patient populations), selection bias, partial verification bias, misclassification, information bias, or disease progression bias. These sources of error frequently lead to overestimation of diagnostic performance, underscoring the need for rigorous methodological safeguards.
- Collaborate with experienced biostatisticians or methodologists throughout all phases of the meta-analysis, from dataset preparation to statistical modeling and interpretation, to ensure adherence to best practices and maximize methodological rigor.
- Explicitly verify the statistical assumptions underpinning hierarchical models. Many users apply BRMA or HSROC frameworks without assessing residual distribution, bivariate normality, or the influence of individual studies. Tools such as midas modchk or the influence diagnostics in metandi facilitate this evaluation and should be part of routine analysis. If bivariate normality is seriously violated (e.g., highly skewed distributions), consider data transformations, alternative modeling strategies such as more flexible Bayesian approaches, or, at the very least, interpret the results with great caution.
- Interpret results with caution in small meta-analyses. When fewer than 10 studies are included, hierarchical models remain the preferred approach, but confidence intervals widen, prediction regions expand, and estimates of heterogeneity or threshold effects become less stable. In such scenarios, emphasis should shift towards transparency, identification of evidence gaps, and cautious interpretation rather than overconfident conclusions.
Limitations and Scope
Conclusions
Supplementary Materials
CRediT Authorship Contribution Statement
Conflicts of Interest
Financial Statement/Funding
Ethical Approval
Statement of Availability of the Data Used
Declaration of Generative AI and AI-Assisted Technologies in the Writing Process
References
- Dinnes J, Deeks J, Kirby J, Roderick P. A methodological review of how heterogeneity has been examined in systematic reviews of diagnostic test accuracy. Health Technol Assess. 2005 Mar;9(12):1-113, iii. [CrossRef] [PubMed]
- Leeflang MM, Deeks JJ, Gatsonis C, Bossuyt PM; Cochrane Diagnostic Test Accuracy Working Group. Systematic reviews of diagnostic test accuracy. Ann Intern Med. 2008 Dec 16;149(12):889-97. [CrossRef] [PubMed] [PubMed Central]
- Rutter CM, Gatsonis CA. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med. 2001 Oct 15;20(19):2865-84. [CrossRef] [PubMed]
- Reitsma JB, Glas AS, Rutjes AW, Scholten RJ, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005 Oct;58(10):982-90. [CrossRef] [PubMed]
- Deeks JJ, Bossuyt PM, Leeflang MM, Takwoingi Y (editors). Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Version 2.0 (updated July 2023). Cochrane, 2023. Available from https://training.cochrane.org/handbook-diagnostic-test-accuracy/current.
- Altman DG, Bland JM. Diagnostic tests. 1: Sensitivity and specificity. BMJ. 1994 Jun 11;308(6943):1552. [CrossRef] [PubMed] [PubMed Central]
- Akobeng AK. Understanding diagnostic tests 1: sensitivity, specificity and predictive values. Acta Paediatr. 2007 Mar;96(3):338-41. [CrossRef] [PubMed]
- Singh PP, Zeng IS, Srinivasa S, Lemanu DP, Connolly AB, Hill AG. Systematic review and meta-analysis of use of serum C-reactive protein levels to predict anastomotic leak after colorectal surgery. Br J Surg. 2014 Mar;101(4):339-46. Epub 2013 Dec 5. [CrossRef] [PubMed]
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982 Apr;143(1):29-36. [CrossRef] [PubMed]
- Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010 Sep;5(9):1315-6. [CrossRef] [PubMed]
- Arredondo Montero J, Martín-Calvo N. Diagnostic performance studies: interpretation of ROC analysis and cut-offs. Cir Esp (Engl Ed). 2023 Dec;101(12):865-867. Epub 2022 Nov 24. [CrossRef] [PubMed]
- Altman DG, Bland JM. Diagnostic tests 2: Predictive values. BMJ. 1994 Jul 9;309(6947):102. [CrossRef] [PubMed] [PubMed Central]
- Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ. 2004 Jul 17;329(7458):168-9. [CrossRef] [PubMed] [PubMed Central]
- Akobeng AK. Understanding diagnostic tests 2: likelihood ratios, pre- and post-test probabilities and their use in clinical practice. Acta Paediatr. 2007 Apr;96(4):487-91. Epub 2007 Feb 14. [CrossRef] [PubMed]
- Bai S, Hu S, Zhang Y, Guo S, Zhu R, Zeng J. The Value of the Alvarado Score for the Diagnosis of Acute Appendicitis in Children: A Systematic Review and Meta-Analysis. J Pediatr Surg. 2023 Oct;58(10):1886-1892. Epub 2023 Mar 6. [CrossRef] [PubMed]
- Glas AS, Lijmer JG, Prins MH, Bonsel GJ, Bossuyt PM. The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol. 2003 Nov;56(11):1129-35. [CrossRef] [PubMed]
- Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004 May 1;159(9):882-90. [CrossRef] [PubMed]
- Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Stat Med. 1993 Jul 30;12(14):1293-316. [CrossRef] [PubMed]
- Dinnes J, Mallett S, Hopewell S, Roderick PJ, Deeks JJ. The Moses-Littenberg meta-analytical method generates systematic differences in test accuracy compared to hierarchical meta-analytical models. J Clin Epidemiol. 2016 Dec;80:77-87. Epub 2016 Jul 30. [CrossRef] [PubMed] [PubMed Central]
- Harbord RM, Whiting P, Sterne JA, Egger M, Deeks JJ, Shang A, Bachmann LM. An empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary. J Clin Epidemiol. 2008 Nov;61(11):1095-103. [CrossRef] [PubMed]
- Wang J, Leeflang M. Recommended software/packages for meta-analysis of diagnostic accuracy. J Lab Precis Med 2019;4:22.
- Zamora J, Abraira V, Muriel A, Khan K, Coomarasamy A. Meta-DiSc: a software for meta-analysis of test accuracy data. BMC Med Res Methodol. 2006 Jul 12;6:31. [CrossRef] [PubMed] [PubMed Central]
- Review Manager (RevMan) [Computer program]. Version 5.4. The Cochrane Collaboration, 2020.
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2024.
- SAS Institute Inc. SAS Software, Version 9.4. Cary, NC: SAS Institute Inc.; 2024.
- StataCorp. Stata Statistical Software: Release 19. College Station, TX: StataCorp LLC; 2025.
- Harbord, R. M., & Whiting, P. (2009). Metandi: Meta-analysis of Diagnostic Accuracy Using Hierarchical Logistic Regression. The Stata Journal, 9(2), 211-229. [CrossRef]
- Ben Dwamena, 2007. "MIDAS: Stata module for meta-analytical integration of diagnostic test accuracy studies," Statistical Software Components S456880, Boston College Department of Economics, revised 05 Feb 2009.
- Nyaga VN, Arbyn M. Metadta: a Stata command for meta-analysis and meta-regression of diagnostic test accuracy data - a tutorial. Arch Public Health. 2022 Mar 29;80(1):95. Erratum in: Arch Public Health. 2022 Sep 27;80(1):216. [CrossRef] [PubMed] [PubMed Central]
- Nyaga VN, Arbyn M. Comparison and validation of metadta for meta-analysis of diagnostic test accuracy studies. Res Synth Methods. 2023 May;14(3):544-562. Epub 2023 Apr 18. [CrossRef] [PubMed]
- Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003 Sep 6;327(7414):557-60. [CrossRef] [PubMed] [PubMed Central]
- Zhou Y, Dendukuri N. Statistics for quantifying heterogeneity in univariate and bivariate meta-analyses of binary data: the case of meta-analyses of diagnostic accuracy. Stat Med. 2014 Jul 20;33(16):2701-17. Epub 2014 Feb 19. [CrossRef] [PubMed]
- Begg CB, Mazumdar M. Operating characteristics of a rank correlation test for publication bias. Biometrics. 1994 Dec;50(4):1088-101. [PubMed]
- Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997 Sep 13;315(7109):629-34. [CrossRef] [PubMed] [PubMed Central]
- Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J Clin Epidemiol. 2005 Sep;58(9):882-93. [CrossRef] [PubMed]
- van Enst WA, Ochodo E, Scholten RJ, Hooft L, Leeflang MM. Investigation of publication bias in meta-analyses of diagnostic test accuracy: a meta-epidemiological study. BMC Med Res Methodol. 2014 May 23;14:70. [CrossRef] [PubMed] [PubMed Central]
- Fagan TJ. Letter: Nomogram for Bayes's theorem. N Engl J Med. 1975 Jul 31;293(5):257. [CrossRef] [PubMed]
- Rubinstein ML, Kraft CS, Parrott JS. Determining qualitative effect size ratings using a likelihood ratio scatter matrix in diagnostic test accuracy systematic reviews. Diagnosis (Berl). 2018 Nov 27;5(4):205-214. [CrossRef] [PubMed]


| Feature | metandi (Harbord & Whiting, 2009) | midas (Dwamena, 2007) | metadta (Nyaga & Arbyn, 2022) |
|---|---|---|---|
| Primary Model | BRMA only. HSROC parameters/curve derived from BRMA output (not a separately fitted HSROC model). | BRMA only. ROC curve is an indirect approximation derived from the BRMA output, not a formal HSROC model. | BRMA (bivariate framework); HSROC curve obtained via formal re-parameterization (mathematically equivalent to Rutter & Gatsonis). |
| Meta-Regression | Not supported. | Univariable only. Cannot fit multivariable models. | Supports both Univariable and Multivariable meta-regression. |
| Heterogeneity Metrics | Reports between-study variance components (τ²). Does not calculate a bivariate I² statistic. | Reports univariate I² statistics, ICCs for sensitivity and specificity, and median sensitivity/specificity estimates. Does not calculate between-study variance components (τ²) | Reports between-study variance (τ²) and the bivariate I² (Zhou) |
| Influence Diagnostics | Supported via the predict post-estimation command to obtain Cook's distance and standardized residuals. | Supported via the integrated modchk option, which generates a panel of four diagnostic plots. | No built-in commands for influence diagnostics. Requires manual post-estimation calculations. |
| Publication Bias Test | Not supported. | Supported via the pubbias subcommand, which implements the recommended Deeks' test. | Not supported. |
| Clinical Utility Tools | Not supported. | Provides Fagan nomograms, a likelihood ratio scattergram, a bivariate boxplot for outlier detection, and calculates the diagnostic odds ratio (DOR). | Not supported. |
| Graphical Output | Generates a formal HSROC plot with correctly calculated confidence and prediction regions. | Generates an approximate ROC plot, Fagan nomograms, and various other diagnostic plots. | Generates high-quality forest plots and an HSROC plot with correctly calculated confidence and prediction regions. |
| Key Limitation | Lacks meta-regression and many modern analytical features. Can be prone to model convergence issues, especially with sparse data or zero-events. | The prediction region is methodologically flawed and inflated due to ignoring covariance. The reported AUC is an extrapolation based on untestable assumptions. | Lacks built-in functions for publication bias assessment and influence diagnostics, requiring the use of other packages (like midas) to complete a full analysis. |
| Aspect | Traditional Models (DerSimonian-Laird, Moses-Littenberg) | Hierarchical Models (HSROC, BRMA) |
|---|---|---|
| Software Availability | Multiple platforms, including outdated tools: MetaDisc 1.4, RevMan 5.4, Stata, R, SAS. | Stata, R, Meta-DiSc 2.0 (limited but specific support). |
| Reported Metrics | Sensitivity and specificity (analyzed separately), symmetric ROC curve, DOR. | Joint modeling of sensitivity and specificity, LR+, LR–, DOR; hierarchical ROC curve (data-restricted); AUC estimated in select cases. |
| Confidence Intervals | Often narrow and symmetric, prone to underestimating uncertainty. | Data-dependent, typically wider and asymmetric, better reflect true variability. |
| Heterogeneity Assessment | Cochran's Q and I² statistics (may misrepresent variability due to inability to disentangle threshold and non-threshold heterogeneity). | Bivariate I² (Zhou), variance estimates for sensitivity and specificity, prediction regions for visual assessment. |
| Threshold Effect Handling | Ignored; assumes a common threshold across studies, may distort pooled estimates. | Models threshold heterogeneity by allowing study-specific operating points via random effects; HSROC separates accuracy and threshold components. A negative Se–Sp correlation often arises under threshold variability but is a consequence, not the mechanism |
| Interpretation of ROC Curve | Assumes a symmetric ROC curve as a mathematical simplification, which may not reflect the true asymmetry or variability present in empirical data. The curve is often extrapolated beyond the observed data range. | The ROC curve is typically constrained to the empirical data range, incorporating asymmetry, threshold effects, and between-study heterogeneity. Best practice is to display the HSROC primarily over the observed operating range to avoid misleading extrapolation; extrapolation is a plotting choice, not a model property |
| Summary Estimates Robustness | Pooled estimates (Se, Sp, DOR) are often misleading when there is high heterogeneity. | Summary points and prediction regions account for between-study variability and correlation. |
| Advanced Diagnostics | Not available. | Influence diagnostics (Cook's distance, standardized residuals), bivariate boxplots, LR+ and LR- scattergrams, model fit assessment. |
| Meta-Regression | Not available. | Available (univariable and multivariable). |
| Outlier Detection and Robustness | Limited capacity. | Systematic outlier identification supports sensitivity analyses and model refinement. |
| Publication Bias Assessment | Begg's test, Egger's regression, funnel plots (limited validity for DTA) | Deeks' test (specifically designed for DTA). |
| Study-Level Variance Handling | Often underestimated; ignores study clustering. | Explicit variance modeling for sensitivity and specificity; accounts for within- and between-study variability. |
| Recommendation | Rationale | Practical Implication |
|---|---|---|
| Plan and report analyses transparently | Post hoc choices increase the risk of bias and reduce reproducibility | Best practice includes protocol registration (e.g., PROSPERO) and reporting per PRISMA-DTA, with all statistical analyses pre-specified in the protocol |
| Mandate hierarchical models (BRMA/HSROC) | Separate pooling ignores correlation and underestimates uncertainty | Always apply hierarchical models (BRMA/HSROC) as the gold standard; avoid separate pooling approaches that ignore threshold effects and heterogeneity |
| Select appropriate software | Not all platforms support hierarchical modelling or advanced diagnostics | Use Stata (metandi/metadta/midas) or R (mada/diagmeta). MetaDisc 2.0, although more limited, allows hierarchical modelling |
| Report paired Se and Sp, LR+/– and CI/prediction regions | Se or Sp alone mask trade-offs and should not be analyzed separately | Always show sens/spec with 95% CI, plus LR+ and LR–; add joint CI/prediction regions |
| Use AUC and DOR cautiously | AUC extrapolates beyond observed data; DOR has limited interpretability | If reported, include CIs and note limitations; prioritize primary metrics (Se, Sp, LR+, LR-) |
| Quantify heterogeneity with τ² and bivariate I² | Univariate I² exaggerates heterogeneity in DTA | Report τ² for sens/spec and Zhou’s bivariate I²; interpret in context |
| Assess threshold effects | Negative correlation (ρₐᵦ) indicates threshold-driven heterogeneity | Report and interpret ρₐᵦ |
| Perform influence and sensitivity analyses | Outliers distort pooled estimates and heterogeneity | Use Cook’s distance, residuals, and bivariate boxplots to identify influential studies. Report the primary analysis with all studies, and then provide a transparent sensitivity analysis excluding them. The purpose is to test robustness, not to replace the primary result |
| Assess publication bias | Egger/Begg are invalid for DTA; Deeks is recommended only if ≥10 studies | Apply Deeks’ with caution; discuss small-study effects and low power |
| Consider performing meta-regression | Pre-specify covariates and modeling strategy; avoid post-hoc DTA dredging. Data-driven dichotomization inflates Type I error and biases effect estimates. | Model continuous covariates on their natural scale; treat findings as exploratory unless pre-planned and adequately powered (≈≥10 studies per covariate). |
| Use complementary graphical tools | Graphical outputs enhance clinical applicability | Include HSROC curves, Fagan nomograms, and LR scatterplots |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).