Preprint
Article

This version is not peer-reviewed.

Finite-Sample Precision Limits for Expected Shortfall Forecast Comparisons

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract
Expected Shortfall (ES) is a tail functional whose estimation precision is governed by the effective tail sample size nα rather than by the nominal calibration size n. The resulting (nα)−1/2 information limit is well established, yet no practical framework exists for deciding whether two ES forecasts can be meaningfully distinguished over a finite calibration window. This paper converts the asymptotic rate into four operational diagnostics: a plug-in precision benchmark, a sample-size rule, a precision-fragile pairwise comparison screen, and a VaR-first diagnostic linking excess ES dispersion to first-stage quantile miscalibration. An empirical application to global financial assets and heterogeneous forecasters under standard regulatory tail parameters shows that roughly one in five pairwise ES comparisons is precision-fragile, with excess dispersion concentrated in cells with poor VaR calibration. The results suggest that ES forecast rankings at typical tail levels can be constrained by effective tail information rather than by model sophistication.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

Expected Shortfall (ES) at tail level α is a coherent risk measure widely used in financial regulation, portfolio management, and quantitative risk modelling. Because ES averages losses beyond the α -quantile, its estimation draws on at most n α effective tail observations from a calibration sample of size n. At tail levels common in practice— α { 1 % , 2.5 % , 5 % } —this effective count is small even when n is moderately large. The statistical difficulty of ES estimation is therefore governed by n α , not by n.
The ( n α ) 1 / 2 scaling of ES estimation error is well established. Zwingmann and Holzmann [1] derive this rate through a central limit theorem for tail averages, Bartl and Eckstein [2] obtain a concentration lower bound for nonparametric ES estimation, and related kernel estimators [3] obey the same effective-tail-count rate. These results characterise the information limit, but they do not translate it into a practical diagnostic for deciding whether two ES forecasts can be meaningfully distinguished over a finite calibration window.
ES forecast evaluation relies on the joint elicitability of VaR and ES [4]. Two-stage regression frameworks [5] and dynamic semiparametric VaR–ES models [6] provide tools for estimation and evaluation under dependence. Pele et al. [7] show that Fissler–Ziegel-based recalibration can attain the ( n α ) 1 / 2 rate under geometric mixing. That paper is concerned with achievability of the rate under temporal dependence; the present paper takes the rate as given and converts it into an operational comparison audit comprising the four diagnostics developed in Section 3 and Section 4. ES backtesting from a regulatory perspective is studied by Acerbi and Székely [8] and Nolde and Ziegel [9]. The question addressed here is how much pairwise ES model comparison is statistically supportable once the effective tail count is fixed by the calibration window.
The contribution is operational rather than rate-theoretic. The paper converts the known ES information limit into a precision audit for ES model comparison. The audit has four components:
1.
a plug-in precision benchmark based on the effective tail count, calibrated from tail residuals;
2.
a finite-sample sample-size rule that determines the calibration window needed for a target ES precision tolerance;
3.
a precision-fragile pairwise comparison screen that classifies a forecast pair as precision-fragile when the observed difference in ES recalibration corrections falls below the precision floor implied by the available effective tail count;
4.
a VaR-first diagnostic for attributing excess ES recalibration dispersion to first-stage quantile miscalibration.
The novelty is not the ( n α ) 1 / 2 rate itself. The novelty is the conversion of this rate into a practical precision audit for ES model comparison. The Le Cam two-point construction in Appendix A (Theorem A1) certifies that no estimator can beat the ( n α ) 1 / 2 rate over the distribution class, establishing that the plug-in benchmark C ^ / n α has the correct functional form for a precision floor. The operational constant C ^ is a CLT-motivated plug-in rather than the exact minimax constant c L , but the rate it instantiates is sharp.
The natural application domain for this audit is the Fundamental Review of the Trading Book ([10] FRTB;), which sets ES at α = 2.5 % as the capital-determining market-risk measure evaluated over a 250-day calibration window. These parameters imply n α = 6.25 , placing the problem firmly in the finite-sample tail-scarcity regime. FRTB serves as the motivating application throughout this paper, but the audit framework applies to any setting in which ES forecasts are compared over a finite window at a fixed tail level.
The empirical analysis uses a parametric GJR-GARCH-t baseline [11] and three zero-shot time-series foundation models: TimesFM 2.5 [12], Chronos-Small [13], and Moirai 2.0 [14,15]. The forecasters are not ranked as a model-selection exercise. They provide heterogeneous first-stage forecast sequences against which the precision audit can be evaluated. Chronos-Small is retained as a stress case because it exhibits severe first-stage VaR miscalibration, allowing the VaR-first diagnostic to be evaluated under a clear failure mode.
Across 24 global assets and four forecasters evaluated at α = 2.5 % , between 16% and 29% of pairwise ES rankings fall below the plug-in precision tolerance implied by the effective tail count; the paired block-bootstrap estimate is 22.9%. Excess ES recalibration dispersion is concentrated in cells with poor first-stage VaR calibration. For the median asset, the 250-day calibration window delivers approximately 37 bp of precision, but seven of 24 high-volatility assets require more than 250 days to reach a 50 bp tolerance.
From a mathematical-statistical perspective, the paper studies how an asymptotic information bound for a tail functional can be converted into a finite-sample diagnostic for forecast comparison. The empirical application to market-risk forecasts serves to illustrate the consequences of this effective-sample-size constraint in a realistic dependent time-series setting.
The remainder of the paper is organised as follows. Section 2 develops the theoretical framework, linking the effective tail sample size to ES recalibration through an oracle-equivalence argument. Section 3 defines the plug-in precision benchmark and the operational sample-size rule. Section 4 introduces the precision-fragile comparison screen and the VaR-first diagnostic. Section 5 reports a controlled VaR-miscalibration simulation. Section 6 presents the financial-risk application. Section 7 collects robustness and sensitivity results. Section 8 states the main limitations, and Section 9 concludes.

2. Expected Shortfall as a Tail Functional Under Effective Sample-Size Scarcity

This section establishes the theoretical framework. It defines the effective tail sample size, derives the CLT-based precision scaling, and states the oracle-equivalence argument that transfers the rate to additive ES recalibration.

2.1. Definitions and Effective Tail Sample Size

Let X 1 , , X n be observations from a distribution P on R with Lebesgue density f. Fix α ( 0 , 1 / 2 ) . The Value-at-Risk at level α is the α -quantile VaR α ( P ) : = inf { x : F ( x ) α } , and the Expected Shortfall is ES α ( P ) : = α 1 0 α VaR u ( P ) d u = E [ X X VaR α ( P ) ] . Both VaR α and ES α are negative for losses throughout; the FZ 0 loss in Section 6.1 assumes e < 0 accordingly.
The ( n α ) 1 / 2 rate for ES estimation is established by Zwingmann and Holzmann [1] via a CLT for the tail average and by Bartl and Eckstein [2] via a concentration lower bound. The mechanism is the tail identification residual
ϕ ( X ) : = ( VaR α X ) 1 { X VaR α } / α ( ES α VaR α ) ,
which satisfies E [ ϕ ] = 0 and Var ( ϕ ) = σ tail 2 / α , where σ tail 2 : = E [ ( VaR α X ) 2 X VaR α ] . Only an α -fraction of observations falls in the tail, and each carries O ( 1 / α ) weight in the ES average; the CLT for the sample average of ϕ ( X 1 ) , , ϕ ( X n ) gives standard error σ tail / n α . The effective sample size for ES estimation is therefore n α , not n.

2.2. Oracle Equivalence for Additive Recalibration

In the two-stage recalibration framework of Dimitriadis and Bayer [5], a first-stage model produces base VaR and ES forecasts, and the second stage estimates an additive ES correction r * : = ES α ( P ) E ¯ by minimising a Fissler–Ziegel loss [4].
Remark 1 
(Oracle equivalence). Fix a first-stage ES forecast E ¯ that does not depend on the second-stage calibration sample X 1 , , X n . Estimating r * ( P ) : = ES α ( P ) E ¯ is equivalent in L 1 risk to estimating ES α ( P ) : since E ¯ is a fixed constant, E P | r ^ n r * ( P ) | = E P | T ^ n ES α ( P ) | for T ^ n : = r ^ n + E ¯ . Any minimax lower bound for ES estimation therefore transfers directly to r * ( P ) .
The conceptual chain of the paper has four links: (i) the known ( n α ) 1 / 2 rate for ES estimation; (ii) oracle equivalence (Remark 1), by which additive ES recalibration inherits this rate under a fixed first-stage forecast; (iii) the plug-in precision benchmark C ^ / n α , which instantiates the rate empirically; and (iv) the operational diagnostics built on this benchmark (Section 6 and Section 7). The novelty lies in links (iii)–(iv), not in the rate itself.

2.3. Distinction Between c L and C ^

An alternative Le Cam two-point construction yielding a closed-form constant c L is provided in Appendix A for completeness. The plug-in precision benchmark C ^ / n α used in the empirics is calibrated from the CLT, not from c L . The distinction matters: c L is the exact minimax constant from a worst-case two-point argument, whereas C ^ is a CLT-motivated plug-in scale. Throughout, “information limit” and “precision floor” refer to the ( n α ) 1 / 2 rate; “plug-in benchmark” refers to C ^ / n α .
The fixed-forecast condition in Remark 1 is approximated in the empirical design: the 1000-day GJR-GARCH training window ends before the 250-day calibration window, and the foundation-model weights are not updated on the calibration sample (zero-shot inference). The benchmark is therefore an oracle-style precision diagnostic, not a formal finite-sample guarantee for the full rolling procedure. The operational benchmark C ^ / n α inherits the rate from oracle equivalence; it does not inherit the exact Le Cam constant c L . When the first stage is mis-specified, the first-stage bias η t does not vanish and the oracle equivalence breaks down—precisely the VaR-miscalibration channel that the VaR-first diagnostic detects (Section 4.2).

3. Precision Benchmark and Sample-Size Rule

3.1. Plug-In Precision Benchmark

For each asset, forecaster, and tail level, the plug-in constant is defined as
C ^ = σ ^ tail ,
the sample standard deviation of the tail residuals
{ V ^ t X t : X t V ^ t } .
The corresponding precision benchmark is
B i , m , α = C ^ i , m , α n α .
This quantity instantiates the effective-tail-count rate in an empirical cell and provides the precision floor against which ES recalibration dispersion is evaluated. The benchmark is an operational diagnostic, not a formal finite-sample confidence interval.
In the main analysis, C ^ is computed once over the full evaluation sample for each cell. This yields an ex-post diagnostic benchmark, not a real-time estimate. In a live implementation, C ^ would instead be estimated on each rolling calibration window. Appendix C reports this rolling-window variant and shows that the qualitative precision-fragile comparison results are preserved, although the absolute benchmark ratios change.
Because C ^ is a sample standard deviation of tail residuals, it carries its own sampling variance, and the precision floor B = C ^ / n α inherits this uncertainty. The concern is most acute in the rolling-window variant, where C ^ t is re-estimated on each 250-day window from roughly n α 6.25 tail points. The main analysis mitigates the problem by computing C ^ once over the full evaluation sample, which aggregates far more than 6.25 tail observations per cell. The rolling-window results in Appendix C (Table A4) confirm that the qualitative conclusions survive when C ^ is instead estimated window-by-window, even though individual benchmark ratios shift—consistent with the additional noise in the floor estimate. A Monte Carlo study in Appendix B (Table A1) confirms that the coefficient of variation of C ^ is 0.76 under Student- t 5 at FRTB parameters, declining to 0.43 at n = 1000 . Where the floor itself is uncertain, marginal precision-fragile classifications should be treated as tentative rather than decisive, reinforcing the conservative reading of the screen.

3.2. Empirical Dispersion Measure

The raw rolling-window ES corrections r ^ n contain both local estimation noise and slow-moving volatility-regime shifts. The precision benchmark concerns the former. To isolate within-regime estimation variability, we subtract a 252-day moving average from the correction path before computing its standard deviation. Robustness checks with 126- and 504-day filters in Appendix C show that the forecaster ranking is stable, although absolute ratios vary with the detrending horizon.
For each asset i, forecaster m, and tail level α , the empirical benchmark ratio is
R i , m , α = SD ^ det r ^ i , m , α B ^ i , m , α ,
where B ^ i , m , α is the plug-in precision benchmark from Equation (4), and SD ^ det denotes the standard deviation of the detrended correction path. Values R < 1 indicate that the plug-in scale is conservative for that cell; this does not contradict the Le Cam lower bound, which uses a different constant c L . Values R > 1 indicate excess dispersion, potentially due to VaR misalignment, non-stationarity, or forecast-extraction noise.

3.3. Finite-Sample Correction

At FRTB parameters ( α = 2.5 % , n = 250 ), the effective tail count is k : = n α = 6.25 . Because the number of tail observations is random—binomial ( n , α ) under correct VaR calibration—the CLT scale requires an inflation factor:
f ( n , α ) = 1 + 1 α n α .
At FRTB parameters, f = 1.075 , raising the required window by approximately 15%. Monte Carlo calibration (Table 1, 50,000 replications under Student- t 5 and GARCH(1,1)- t 5 DGPs) shows that the CLT plug-in is accurate to roughly 10–15% at k = 6.25 under i.i.d. sampling. Under GARCH dynamics, the unconditional σ tail can exceed the within-regime value at short windows, producing f ^ < 1 ; this motivates the detrending step in Section 3.2. Full details are in Appendix B.

3.4. Operational Sample-Size Rule

The Le Cam constant c L in Appendix A depends on local density geometry and is not directly estimable. For operational sample-size calculations, we therefore instantiate the ( n α ) 1 / 2 rate with the CLT-motivated plug-in scale C = σ ^ tail , estimated from tail residuals [see [16], Ch. 6]. Incorporating the random-count correction from Section 3.3, the required window length for tolerance ε solves
f ( n , α ) C n α ε n f ( n , α ) 2 α C ε 2 ,
where
f ( n , α ) = 1 + 1 α n α .
Because f ( n , α ) depends on n, the rule is implicit; in practice, fixed-point iteration converges in a few steps. At α = 2.5 % and n = 250 , the correction factor is f ( 250 , 0.025 ) = 1.075 , which raises the required window by approximately 15%. For k = n α 25 , the correction is below 2% and can usually be ignored.
Table 2 reports the implied window lengths for four tail levels and four tolerances, calibrated from a Student- t 5 reference distribution. The table has two immediate implications. First, sub-FRTB tail levels are effectively infeasible on realistic calibration windows: at α = 0.5 % and ε = 10 % , the required window is about 44,700 trading days. Second, at the FRTB level α = 2.5 % , a 250-day window implies a corrected tolerance of approximately 0.43 C , which is of the same order as many inter-model ES differences.
Asset-specific calibrations are reported in Table A11 in Appendix I and summarised in Section 6. Empirical ratios should be interpreted as distances to the CLT-motivated plug-in benchmark, not as tests of the exact Le Cam constant.

4. Precision-Fragile Pairwise Comparison Screen

4.1. Screen Definition

We call a comparison precision-fragile if its absolute ES-recalibration difference is smaller than the plug-in tolerance. The screen is not a hypothesis test and does not attach a nominal size. It is a conservative diagnostic: it asks whether the observed difference is larger than the precision floor implied by the available effective tail count.
The threshold is forecaster-agnostic, conditional on approximate first-stage calibration: it depends on the per-cell tail dispersion C ^ and the effective tail count n α , not on the model class itself. Adding or substituting forecasters changes the set of pairwise comparisons, but not the precision floor used to evaluate each pair. Severely miscalibrated forecasters can still inflate the precision-fragile share through the VaR-miscalibration channel, which the VaR-first diagnostic examines in Section 6.3.
Formally, for a given asset and tail level α , the comparison between forecasters 1 and 2 is classified as precision-fragile when
r ¯ n ( 1 ) r ¯ n ( 2 ) < C ^ 1 2 + C ^ 2 2 n α ,
where r ¯ n ( j ) denotes the mean ES recalibration correction for forecaster j over the relevant evaluation windows. The threshold treats the two correction estimates as independently noisy. Because the same return realisation enters both forecasters’ recalibration losses, pairwise correction estimates are likely positively correlated. Ignoring this covariance inflates the tolerance and therefore classifies marginal cases as precision-fragile; the screen is conservative by design.
Although the algebraic form of the threshold resembles a pooled standard error, its substantive content differs in three respects. First, the scale is anchored to the effective tail count n α —6.25 observations at FRTB parameters, not the 250 days that constitute the nominal window—so the tolerance reflects the information actually available in the tail rather than the total calibration length. Second, the precision floor is a fixed comparison standard: it is invariant to the forecaster set, whereas the realised precision-fragile share depends on which models are compared. Third, the screen is offered as a conservative diagnostic, not a hypothesis test with nominal size; it flags comparisons whose observed difference is smaller than the precision floor, without attaching a rejection probability.

4.2. Rate Tests and VaR-First Diagnostic

Three diagnostics assess whether the ( n α ) 1 / 2 rate governs empirical recalibration dispersion and whether deviations from the benchmark are linked to first-stage quantile failure.
First, a cross-sectional scaling regression across all (asset, forecaster, α ) cells:
log SD ^ det ( r ^ n ) = a + b log ( n α ) + γ log ( σ ^ tail ) + u .
The rate prediction is b = 1 / 2 , while the CLT plug-in scale implies γ = 1 . Standard errors are clustered by asset.
Second, a non-overlapping window-length scaling test at α = 1 % , where variation in the effective tail count is most visible. The calibration length is varied over
n { 250 , 500 , 750 , 1000 } ,
with step size equal to n, so adjacent windows do not overlap. For each (asset, forecaster) pair, we estimate
log SD ^ ( r ^ n ) = a + b log ( n ) + u ,
using HC1 robust standard errors. The pooled fixed-effects specification is
log SD ^ ( r ^ i , j , n ) = α i + β j + b log ( n ) + u i , j , n ,
where α i and β j denote asset and forecaster fixed effects.
Third, the VaR-first diagnostic distinguishes intrinsic ES estimation noise from first-stage quantile failure. For each (asset, forecaster) pair, VaR backtesting statistics are computed at α = 1 % . The main diagnostic uses the Christoffersen conditional-coverage statistic [17]; the Kupiec unconditional-coverage statistic [18] is reported as a robustness check. Excess ES recalibration dispersion should concentrate in cells with poor first-stage VaR calibration.

5. Simulation Evidence

This section isolates the mechanism behind the VaR-first diagnostic. The empirical results in Section 6 show that excess ES recalibration dispersion is concentrated in poorly VaR-calibrated cells. The simulation below asks whether this pattern can arise when the first-stage VaR model fails to track time-varying volatility.
The simulation generates 30 independent paths of length 10,000 from a GARCH(1,1)- t 5 process with ω = 10 6 , α g = 0.10 , and β g = 0.85 . For each path, the first-stage VaR/ES model uses a convex blend of the true conditional volatility σ t and the unconditional volatility σ ¯ :
σ model , t = ( 1 δ ) σ t + δ σ ¯ , δ { 0 , 0.70 , 0.85 , 1 } .
At δ = 0 , the model uses the true conditional volatility. At δ = 1 , it ignores conditional volatility dynamics and uses the unconditional volatility. Additive ES corrections are estimated on rolling 250-day windows, advanced in 21-day steps, at α = 2.5 % . The ES dispersion ratio R is computed as the raw standard deviation of the ES correction divided by the oracle benchmark
σ tail n α .
Table 3 shows that R increases from 0.95 under the oracle volatility specification to 1.76 under the unconditional-volatility specification. The hit rate rises at the same time, indicating worsening VaR calibration. The transition through R = 1 occurs between the δ = 0.70 and δ = 0.85 designs, where the hit rate is already far above the nominal 2.5%. The simulation therefore supports the interpretation that regime-dependent VaR error can inflate ES recalibration dispersion, consistent with the VaR-first diagnostic.

6. Financial-Risk Application

This section illustrates the precision-audit framework on 24 global financial assets and four forecasters under FRTB calibration parameters. The forecasters are not ranked as a model-selection exercise; they provide heterogeneous first-stage forecast sequences against which the precision audit can be evaluated.

6.1. Data and Forecasting Setup

The empirical analysis uses daily log returns for 24 global assets: equity indices, bonds, commodities, cryptocurrencies, and currencies. The sample period is 2002–2026, with exact start and end dates varying by asset; details are reported in Appendix C. Forecasts are computed at α { 1 % , 2.5 % , 5 % } , with α = 2.5 % as the primary ES analysis level. Additive Fissler–Ziegel recalibration is performed on rolling n = 250 -day calibration windows, advanced in monthly steps of 21 trading days.
The parametric benchmark is a GJR-GARCH(1,1) model with Student-t innovations [11], estimated by maximum likelihood on a rolling 1000-day training window, with VaR and ES computed analytically from the fitted conditional distribution. The remaining forecasters are time-series foundation models run zero-shot. TimesFM 2.5 [12] uses nine quantile heads with a Student-t fit by quantile matching. Chronos-Small [13] generates 1000 Monte Carlo forecast samples, from which empirical quantiles and conditional tail means yield VaR and ES directly. Moirai 2.0 [14,15] produces 1000 forecast samples with a Student-t fit to extract VaR and ES. Implementation details of the extraction procedures are reported in Appendix C.
Given base forecasts ( V ^ t , E ^ t ) , the second stage estimates an additive correction pair ( q ^ n , r ^ n ) by minimising the Fissler–Ziegel FZ 0 loss [4,5]. The corrected forecasts are
V ˜ t = V ^ t + q ^ n , E ˜ t = E ^ t + r ^ n ,
where q ^ n and r ^ n are held fixed within each calibration window. The loss function is
S ( v , e , x ) = 1 { x v } ( x v ) α e + v e + log ( e ) 1 , e < 0 .
The optimisation uses Nelder–Mead with three random restarts per window. Convergence failure is below 0.5% across all cells.

6.2. Precision-Fragile ES Comparisons

At the FRTB ES level α = 2.5 % , 29 of 144 pairwise comparisons are precision-fragile under the plug-in screen in Equation (9), corresponding to a share of 20.1% (Table 4). The share increases as the tail level becomes more extreme, consistent with the ( n α ) 1 / 2 precision limit.
The 20.1% headline should be interpreted as a baseline diagnostic, not as a universal constant. Table 5 reports the precision-fragile share under alternative noise scales and detrending choices. At α = 2.5 % , the share ranges from 16.0% to 29.2%, with the paired block bootstrap at 22.9%.
Table 6 summarises the three main choices for estimating the precision scale at α = 2.5 % . The qualitative conclusion is unchanged: a material fraction of ES comparisons is smaller than the precision floor supported by the available tail data.
The sample-size rule reveals substantial cross-asset heterogeneity. At α = 2.5 % , the estimated tail-dispersion scale spans more than an order of magnitude: USD/JPY requires roughly 84 trading days for 50 bp precision, whereas BTC requires roughly 4,900 days. For the median asset, the 250-day calibration window delivers approximately 37 bp precision. Seven of 24 assets require more than 250 days to reach a 50 bp tolerance, so the sample-size rule is as much an asset-selection diagnostic as a window-length diagnostic.

6.3. Empirical Diagnostics

Figure 1 compares the detrended standard deviation of the ES correction r ^ n with the plug-in precision benchmark C ^ / n α across all (asset, forecaster, α ) cells. The 45-degree line corresponds to R = 1 . Points below the line indicate that the plug-in benchmark is conservative for that cell; points above the line indicate excess dispersion, which is examined through the VaR-first diagnostic below.
Figure A3 in Appendix I disaggregates the same comparison by forecaster. GJR-GARCH-t and Moirai 2.0 mostly lie below the 45-degree line, while Chronos-Small straddles it.
The cross-sectional scaling regression in Equation (10) tests whether empirical ES recalibration dispersion follows the predicted effective-tail-count rate. Table 7 reports the results across 288 cells. The estimated slope is b ^ = 0.436 with clustered standard error 0.039 , close to the theoretical value 1 / 2 . The tail-dispersion coefficient is γ ^ = 0.870 with standard error 0.036 , below the plug-in prediction γ = 1 .
The estimate γ ^ < 1 does not contradict the ( n α ) 1 / 2 rate, which concerns scaling in the effective tail count. Two mechanisms can reduce the estimated tail-dispersion coefficient: full-sample estimation of σ ^ tail may average across volatility regimes, and rolling-window overlap may reduce the measured dispersion of r ^ n .
Table 8 reports the benchmark ratio R at α = 2.5 % . GJR-GARCH-t, TimesFM 2.5, and Moirai 2.0 have median ratios below one, whereas Chronos-Small has median R = 1.13 and reaches R = 2.69 for Natural Gas. This pattern is consistent with excess ES recalibration dispersion being linked to first-stage VaR failure rather than to the precision benchmark itself.
The controlled simulation in Section 5 establishes the mechanism: as the first-stage volatility model moves from the oracle specification ( δ = 0 ) toward the unconditional-volatility specification ( δ = 1 ), VaR calibration degrades and the ES dispersion ratio rises from R = 0.95 to R = 1.76 (Table 3). This regime-dependent VaR error channel is independent of any single forecaster and provides the causal logic for the VaR-first diagnostic.
The empirical data corroborate this mechanism. Figure 2 plots the Christoffersen conditional-coverage statistic against R at α = 1 % . The pooled Spearman correlation is ρ = 0.776 ( p < 0.001 ). Excluding Chronos-Small, the correlation remains positive and significant ( ρ = 0.513 , p < 0.001 ), whereas within Chronos-Small alone it is small and insignificant. All 12 cells with R > 1 belong to Chronos-Small and exhibit severe VaR miscalibration (Table A16 in Appendix I). The pooled association is strengthened by the separation of Chronos-Small from the remaining forecasters, and the ex-Chronos correlation becomes insignificant at α = 2.5 % and 5 % (Table 11), consistent with the reduced power of the diagnostic at less extreme tail levels.
Chronos-Small is therefore best interpreted as a stress case for the diagnostic, not as a competitive benchmark. Its excess dispersion is consistent with time-varying VaR misalignment; after separating these severely miscalibrated cells, ES recalibration dispersion is broadly comparable across the remaining forecasters.
Table 9 reports the headline results under forecaster subsets. Excluding Chronos-Small increases the precision-fragile share because the remaining forecasters produce ES corrections that are harder to distinguish. The precision-fragile share is not a model-free empirical constant; it depends on the dispersion of the forecasters included in the comparison set. The forecaster-agnostic object is the precision threshold, not the realised share.

6.4. Implications for Tail-Risk Practice

At α = 2.5 % with n = 250 and a typical C ^ 1.3 % , the implied recalibration tolerance is
ε n = C ^ n α 52 bp .
This tolerance is of the same order as many inter-model ES differences, so it provides a simple way to report the sampling uncertainty attached to ES recalibration and model comparison.
The precision audit can be implemented as a four-step workflow:
1.
report the effective tail count n α , not only the window length n;
2.
estimate C ^ = σ ^ tail from tail residuals;
3.
compute the plug-in precision floor C ^ / n α ;
4.
flag ES comparisons whose absolute recalibration difference falls below the corresponding pairwise tolerance as precision-fragile.
A precision-fragile comparison does not imply that the two models are equivalent. It means that the observed ES-recalibration difference is smaller than the precision budget supported by the available tail data. Such comparisons should not be used as decisive evidence for model replacement without additional data, stronger structural assumptions, or complementary diagnostics. Appendix J summarises the benchmark, sample-size rule, and diagnostic screen as a workflow.

7. Robustness and Sensitivity

This section collects robustness checks on the main empirical findings.
Cutoff multiplier sensitivity. Replacing the baseline tolerance in Equation (9) by κ C ^ 1 2 + C ^ 2 2 / n α gives the shares in Table 10. Even at κ = 0.5 , 16.0% of comparisons remain precision-fragile.
VaR-first diagnostic across tail levels.Table 11 reports the diagnostic at all three tail levels. The association between the conditional-coverage statistic and R is strongest at α = 1 % , supporting the choice of this level as the diagnostic anchor. At α = 2.5 % and 5 % , the full-sample correlation remains significant, but the correlation excluding Chronos-Small becomes insignificant.
Table 11. VaR-first diagnostic robustness across tail levels. Entries report Spearman rank correlations between the Christoffersen conditional-coverage statistic and benchmark ratio R at the diagnostic α indicated. *** denotes p < 0.001 .
Table 11. VaR-first diagnostic robustness across tail levels. Entries report Spearman rank correlations between the Christoffersen conditional-coverage statistic and benchmark ratio R at the diagnostic α indicated. *** denotes p < 0.001 .
Diagnostic α n pairs Spearman ρ (all) Spearman ρ (excl. Chronos) Comment
1% 76 0.776 * * * 0.513 * * * FRTB VaR gatekeeper
2.5% 87 0.590 * * * 0.192 Matched to FRTB ES level
5.0% 95 0.559 * * * 0.146 Scaling validation
Window-length scaling. A non-overlapping window-length scaling test at α = 1 % provides a complementary check; details are reported in Appendix D. Three of four forecasters are consistent with the ( n α ) 1 / 2 rate. The pooled test rejects H 0 : b = 0.50 at p = 0.033 , driven by TimesFM 2.5; excluding TimesFM 2.5, all subsamples are consistent with the theoretical slope within their 95% confidence intervals.
HS-250 naive benchmark. As a robustness check, Appendix H reports results for a naive Historical Simulation benchmark (HS-250) that uses the empirical quantile and tail mean from the same 250-day window. HS-250 lies close to the plug-in precision floor ( R = 0.96 at α = 2.5 % ), and all of its comparisons with TimesFM-2.5 and Moirai-2.0 are precision-fragile, reinforcing the interpretation that the binding constraint is often the effective tail count rather than model complexity.

8. Limitations

Six limitations delimit the interpretation.
First, the empirical benchmark uses the CLT-motivated plug-in scale C ^ , not the Le Cam constant c L . The latter depends on local density geometry and is not estimated from data. Hence values R < 1 indicate that the plug-in benchmark is conservative for that cell; they do not contradict the lower-bound argument.
Second, the non-overlapping window-scaling test is data-intensive. At n = 1000 , it requires roughly 5,000 trading days, leaving too few usable assets for Chronos-Small to estimate a meaningful model-specific slope.
Third, the theoretical benchmark is oracle-style: it is conditional on a fixed first-stage forecast sequence. The empirical rolling design approximates this condition because forecasts are predictable from past information, but they are not independent of the full calibration path. A strictly held-out design would isolate the oracle assumption more cleanly.
Fourth, absolute benchmark ratios depend on how low-frequency volatility-regime shifts are removed. The forecaster ranking is stable across the 126-, 252-, and 504-day filters reported in Appendix C, but the levels of R and the precision-fragile share vary with the detrending choice.
Fifth, the pooled window-scaling test rejects H 0 : b = 0.50 at p = 0.033 , driven by TimesFM 2.5 (Appendix D). The rejection disappears in restricted samples.
Sixth, the VaR-first diagnostic is partly a between-group result. The pooled association between the Christoffersen conditional-coverage statistic and R is strengthened by the separation of Chronos-Small from the remaining forecasters.
These limitations define the scope of the audit. The precision floor is a necessary condition for meaningful ES comparison, not a sufficient condition for model superiority. The proposed quantities should be read as precision diagnostics for ES model comparison under tail-data scarcity.

9. Conclusions

This paper studies how the effective tail sample size constrains pairwise comparison of Expected Shortfall forecasts. The relevant precision budget is n α , not the nominal calibration window length n. At the FRTB tail level α = 2.5 % and n = 250 , the effective tail count is only n α = 6.25 .
The paper converts the known ( n α ) 1 / 2 information limit into a precision-audit framework with four components: a plug-in precision benchmark, a finite-sample sample-size rule, a precision-fragile pairwise comparison screen, and a VaR-first diagnostic. The audit is diagnostic, not a replacement for formal model validation.
Across 24 global assets and four forecasters, roughly one in five pairwise ES comparisons is precision-fragile: the observed difference in recalibrated ES corrections is smaller than the precision floor supported by the available tail data. This share ranges from 16% to 29% across estimation variants and remains economically material across alternative noise scales, detrending methods, and bootstrap specifications.
The VaR-first diagnostic shows that excess ES recalibration dispersion is concentrated in cells with poor first-stage VaR calibration. VaR miscalibration is therefore a major channel for inflated ES dispersion, consistent with the simulation evidence in Section 5. A naive Historical Simulation benchmark lies close to the plug-in precision floor, reinforcing the interpretation that at standard tail levels the effective number of tail observations can be more binding than model complexity.
The precision-fragile share is not a universal constant. It depends on the asset universe, forecaster set, tail-dispersion estimate, and detrending convention. What is invariant is the information constraint: ES recalibration precision scales with n α , not with the nominal window length alone.
The analysis is an ex-post oracle-style precision audit, not a deployable real-time rule. Future work should develop fully held-out, real-time, dependence-aware versions of the audit that relax the fixed-first-stage assumption. Extensions to multivariate tail-risk settings and to formal decision-theoretic frameworks for the comparison screen are also left for future investigation.

Author Contributions

Conceptualization, D.T.P.; methodology, D.T.P.; software, D.T.P. and M.M.-M.-P.; validation, D.T.P. and M.M.-M.-P.; formal analysis, D.T.P.; investigation, D.T.P.; data curation, D.T.P. and M.M.-M.-P.; writing—original draft preparation, D.T.P.; writing—review and editing, D.T.P. and M.M.-M.-P.; visualization, D.T.P.; supervision, D.T.P.; project administration, D.T.P.; funding acquisition, D.T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the Marie Skłodowska-Curie Actions under the European Union’s Horizon Europe research and innovation program for the Industrial Doctoral Network on Digital Finance, acronym DIGITAL, Project No. 101119635; the project “IDA Institute of Digital Assets”, CF166/15.11.2022, contract number CN760046/23.05.2023; the project “AI for Energy Finance (AI4EFin)”, CF162/15.11.2022, contract number CN760048/23.05.2023; the project “Accountable Governance and Responsible Innovation in Artificial Intelligence”, CF158/15.11.2022, contract number CN760047/23.05.2023, financed under Romania’s National Recovery and Resilience Plan, Apel nr. PNRR-III-C9-2022-I8. We acknowledge the support of the project “MA’AT — Autonomous Model for Textual Assistance”, SMIS Code 2021+: 330941, funding contract no. 390090/11.11.2025, project co-financed by the European Regional Development Fund through the Smart Growth, Digitalisation and Financial Instruments Programme 2021–2027 (POCIDIF).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Daily return data are sourced from Yahoo Finance for the assets listed in Appendix C. Replication code and intermediate data files are available on the QuantLet platform. Slides are available on the Quantinar platform.

Acknowledgments

During the preparation of this manuscript, the authors used generative AI tools for language editing, LATEX assistance, and code-checking support. The authors reviewed and edited all generated content and take full responsibility for the final manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ES Expected Shortfall
VaR Value-at-Risk
FRTB Fundamental Review of the Trading Book
FZ Fissler–Ziegel
CLT Central limit theorem
CC Christoffersen conditional coverage
HS Historical Simulation

Appendix A. Minimax Lower Bound and Proof

The distribution class used in the Le Cam construction is
P ( α , w , m 0 ) : = { P : P λ , f P = d P d λ , E P [ X ] < , f P ( x ) m 0 ,
x [ VaR α ( P ) w , VaR α ( P ) ] , where w > 0 is the tail window width and m 0 > 0 is a density lower bound on the tail window.
Theorem A1 
(Minimax lower bound). There exists a constant c L > 0 , depending only on the tail-window width w and density lower bound m 0 , such that for any estimator T n of ES α ( P ) based on n i.i.d. observations from P P ( α , w , m 0 ) ,
inf T n sup P P ( α , w , m 0 ) E P | T n ES α ( P ) | c L n α .
By Remark 1, the same bound applies to any additive recalibration estimator r ^ n under a fixed first-stage forecast.
To illustrate the magnitude of c L , consider a Student- t 5 return distribution with w = σ tail and m 0 = f t 5 ( VaR α w ) . Then c L evaluates to 0.0017 at α = 1 % , 0.0024 at α = 2.5 % , and 0.0031 at α = 5 % . The plug-in constant C = σ tail at the same calibration is 1.39 , 1.16 , and 1.03 respectively, giving c L / C 0.1 0.3 % . The Le Cam constant is conservative by construction: the two-point perturbation uses the smallest density on the tail window, while the CLT variance integrates the full tail distribution. The plug-in benchmark C ^ / n α is therefore the operational object.
Lemma A1 
(Antisymmetric two-point perturbation). Fix α ( 0 , 1 / 2 ) and let P 0 have density g 0 satisfying g 0 ( x ) 2 m 0 on [ v 0 w , v 0 ] , where v 0 : = VaR α ( P 0 ) . For δ > 0 , define
f ± ( x ) : = g 0 ( x ) ± δ w h x v 0 + w w ,
where h ( u ) = sin ( 2 π u ) on [ 0 , 1 ] and h = 0 elsewhere. Let κ : = inf u [ 0 , 1 ) 4 π ( 1 u ) / [ 1 cos ( 2 π u ) ] 2.76 (attained near u 0.63 ). Then, provided δ min ( 1 , κ ) m 0 w :
(i)
f ± are valid densities with f ± ( x ) m 0 on [ v 0 w , v 0 ] , so P ± P ( α , w , m 0 ) ;
(ii)
VaR α ( P + ) = VaR α ( P ) = v 0 ;
(iii)
| ES α ( P + ) ES α ( P ) | = w δ / ( π α ) .
Proof. (i) Since 0 1 h ( u ) d u = 0 and the perturbation is supported on [ v 0 w , v 0 ] , both f ± integrate to 1. On the support, f ± ( x ) 2 m 0 δ / w m 0 when δ m 0 w , so P ± P ( α , w , m 0 ) .
(ii) The antisymmetry of sin ( 2 π u ) about u = 1 / 2 ensures that the perturbation adds zero net mass to ( , v 0 ] : v 0 w v 0 ( δ / w ) h ( ( x v 0 + w ) / w ) d x = δ 0 1 sin ( 2 π u ) d u = 0 . Hence F ± ( v 0 ) = α . It remains to show F ± ( x ) < α for all x < v 0 . For x v 0 w the perturbation vanishes, so F ± ( x ) = F 0 ( x ) < α . For x ( v 0 w , v 0 ) , set u : = ( x v 0 + w ) / w [ 0 , 1 ) and compute the cumulative perturbation exactly:
v 0 w x δ w sin 2 π ( t v 0 + w ) w d t = δ 2 π 1 cos ( 2 π u ) .
The baseline CDF margin satisfies F 0 ( v 0 ) F 0 ( x ) = x v 0 g 0 ( t ) d t 2 m 0 ( v 0 x ) = 2 m 0 w ( 1 u ) . Hence F + ( x ) < α requires δ [ 1 cos ( 2 π u ) ] / ( 2 π ) < 2 m 0 w ( 1 u ) , i.e.
δ < 4 π m 0 w ( 1 u ) 1 cos ( 2 π u ) .
The ratio ( 1 u ) / [ 1 cos ( 2 π u ) ] is bounded below on [ 0 , 1 ) : it equals 1 / ( 2 π 2 u ) near u = 0 (bounded) and tends to + as u 1 (the cosine perturbation vanishes faster than the linear baseline margin). Let κ : = inf u [ 0 , 1 ) 4 π ( 1 u ) / [ 1 cos ( 2 π u ) ] > 0 . Then δ κ m 0 w suffices. Since κ is a universal positive constant, the constraint δ m 0 w (from part (i)) can be tightened to δ min ( 1 , κ ) m 0 w if needed. For F , the cumulative perturbation has the opposite sign, so F ( x ) = F 0 ( x ) δ [ 1 cos ( 2 π u ) ] / ( 2 π ) < F 0 ( x ) < α on ( v 0 w , v 0 ) . Therefore F ± ( x ) < α for all x < v 0 , and VaR α ( P ± ) = v 0 .
(iii) The ES difference is
ES α ( P + ) ES α ( P ) = 1 α v 0 w v 0 x f + ( x ) f ( x ) d x = 2 δ α w v 0 w v 0 x sin 2 π ( x v 0 + w ) / w d x .
Substituting u = ( x v 0 + w ) / w , the integral becomes
w 0 1 ( v 0 w + w u ) sin ( 2 π u ) d u .
The constant term vanishes because 0 1 sin ( 2 π u ) d u = 0 . Integration by parts gives 0 1 u sin ( 2 π u ) d u = 1 / ( 2 π ) , so | ES α ( P + ) ES α ( P ) | = w δ / ( π α ) . □
Proof of Theorem A1. 
Apply Lemma A1 with perturbation amplitude
δ = c α / n
for a constant c > 0 to be chosen (the constraint δ m 0 w is satisfied for n large enough).
Step 1: ES separation. By Lemma A1(iii),
Δ : = | ES α ( P + ) ES α ( P ) | = w δ π α = w c α / n π α = w c π n α .
Step 2: χ 2 divergence. Since f + and f differ only on [ v 0 w , v 0 ] ,
χ 2 ( P + P ) = v 0 w v 0 ( f + f ) 2 f d x = v 0 w v 0 4 δ 2 sin 2 ( 2 π u ) w 2 f ( x ) d x .
Using f m 0 on the support and 0 1 sin 2 ( 2 π u ) d u = 1 / 2 , this gives
χ 2 ( P + P ) 4 δ 2 w 2 · 1 m 0 · w · 1 2 = 2 δ 2 m 0 w = 2 c 2 α m 0 w n .
Step 3: TV control. For product measures,
1 + χ 2 ( P + n P n ) = 1 + χ 2 ( P + P ) n ,
so χ 2 ( P + n P n ) exp n χ 2 ( P + P ) 1 , using 1 + x e x . By the total-variation/ χ 2 inequality,
TV ( P + n , P n ) 2 χ 2 ( P + n P n ) exp 2 c 2 α / ( m 0 w ) 1 .
For small c, this is approximately 2 c 2 α / ( m 0 w ) . To ensure TV ( P + n , P n ) 1 1 / 2 , it suffices to take c small enough that exp 2 c 2 α / ( m 0 w ) 1 ( 1 1 / 2 ) 2 ; since α < 1 / 2 , this is satisfied when c 2 ( m 0 w / 2 ) log 1 + ( 1 1 / 2 ) 2 .
Step 4: Le Cam bound. By Le Cam’s two-point lemma,
inf T n max j { + , } E P j | T n ES α ( P j ) | 1 2 Δ 1 TV ( P + n , P n ) Δ 2 2 .
Substituting the expression for Δ from Step 1 and taking c at its maximum admissible value from Step 3,
inf T n sup P P E P | T n ES α ( P ) | w 2 2 π · ( m 0 w / 2 ) log 1 + ( 1 1 / 2 ) 2 n α = : c L n α ,
where c L = w ( m 0 w / 2 ) log ( 1 + ( 1 1 / 2 ) 2 ) / ( 2 2 π ) > 0 depends only on w and m 0 . □

Appendix B. Finite-Sample Calibration Details

The random-count correction in Section 3.3 arises because the tail count K n : = i = 1 n 1 { X i VaR α } Bin ( n , α ) is random. Conditional on K n = k , the tail-average variance is σ tail 2 / k ; a delta-method expansion of E [ 1 / K n ] using Var ( K n ) = n α ( 1 α ) gives the unconditional variance σ tail 2 f ( n , α ) 2 / ( n α ) , where f ( n , α ) = 1 + ( 1 α ) / ( n α ) .
An Edgeworth correction is not used. At tail levels relevant in practice, the conditional tail distribution of heavy-tailed returns is highly skewed (conditional γ 1 4 for Student- t 5 at α 5 % ). At k 10 , the one-term Edgeworth approximation shifts the quantile inward and produces intervals narrower than the Gaussian approximation, contrary to Monte Carlo evidence. The random-count factor and direct Monte Carlo calibration are therefore preferred.
Under GARCH dynamics, the pattern in Table 1 separates finite-sample tail-count effects from volatility-regime effects. At short windows ( n = 250 ), the unconditional σ tail can exceed the within-regime tail dispersion, producing f ^ < 1 . At longer windows, paths traverse multiple volatility regimes and the empirical SD exceeds the unconditional asymptotic prediction ( f ^ > 1.2 , flagged with †). This is the serial-dependence inflation that the detrending analysis in Section 3.2 is designed to remove.
An important consequence is that at short windows under realistic GARCH dynamics, the plug-in benchmark overstates the precision floor ( f ^ < 1 ), so the precision-fragile screen uses a wider tolerance than the within-regime noise alone would justify. Its conclusions remain valid as an upper bound on the fraction of unreliable model rankings.
Table A1 quantifies the sampling variability of the precision floor itself. Because B = C ^ / n α and C ^ is a sample standard deviation of a small tail subsample, B inherits non-negligible sampling variance. Under Student- t 5 at FRTB parameters ( n = 250 , n α = 6.25 ), the coefficient of variation of C ^ across 50,000 replications is 0.76, and the 90% simulation range of B / B ¯ spans [ 0.25 , 2.32 ] . Doubling the window to n = 500 ( n α = 12.5 ) reduces the CV to 0.56, and at n = 1000 it falls to 0.43. GARCH(1,1)- t 5 dynamics produce similar or slightly larger CVs at each window length, consistent with serial-dependence effects. These results reinforce the conservative reading: at standard FRTB windows, the precision floor is itself a noisy estimate, and marginal precision-fragile classifications should be treated as tentative.
Table A1. Sampling variability of the precision floor B = C ^ / n α at α = 2.5 % . CV ( C ^ ) is the coefficient of variation across 50,000 Monte Carlo replications. B / B ¯ : 90% range gives the 5th–95th percentile ratio of the realised floor to its mean.
Table A1. Sampling variability of the precision floor B = C ^ / n α at α = 2.5 % . CV ( C ^ ) is the coefficient of variation across 50,000 Monte Carlo replications. B / B ¯ : 90% range gives the 5th–95th percentile ratio of the realised floor to its mean.
Student- t 5 GARCH(1,1)- t 5
n n α CV ( C ^ ) B / B ¯ : 90% range CV ( C ^ ) B / B ¯ : 90% range
250 6.25 0.76 [ 0.25 , 2.32 ] 0.78 [ 0.17 , 2.43 ]
500 12.50 0.56 [ 0.40 , 1.99 ] 0.64 [ 0.32 , 2.13 ]
750 18.75 0.49 [ 0.48 , 1.84 ] 0.57 [ 0.39 , 2.00 ]
1000 25.00 0.43 [ 0.52 , 1.76 ] 0.52 [ 0.44 , 1.94 ]

Appendix C. Data and Computational Details

Table A2 lists the 24 assets with tickers, asset class, sample period, and number of observations. All returns are daily log returns. Data source: Yahoo Finance.
Table A2. Asset universe.
Table A2. Asset universe.
Ticker Name Class Sample
SP500 S&P 500 Equity 2000-01 – 2026-03
STOXX Euro Stoxx 50 Equity 2004-04 – 2026-03
GDAXI DAX Equity 2000-01 – 2026-03
FCHI CAC 40 Equity 2000-01 – 2026-03
FTSE100 FTSE 100 Equity 2000-01 – 2026-03
NIKKEI Nikkei 225 Equity 2000-01 – 2026-03
HSI Hang Seng Equity 2000-01 – 2026-03
BOVESPA Bovespa Equity 2000-01 – 2026-03
NIFTY Nifty 50 Equity 2007-09 – 2026-03
ASX200 ASX 200 Equity 2000-01 – 2026-03
ICLN iShares Clean Energy Equity 2008-06 – 2026-03
TLT US 20Y+ Treasury Bond 2002-07 – 2026-03
IBGL Euro Gov Bond Bond 2008-01 – 2026-03
DJCI DJ Commodity Commodity 2009-10 – 2021-01
GOLD Gold Commodity 2000-08 – 2026-03
WTI WTI Crude Commodity 2000-08 – 2026-03
NATGAS Natural Gas Commodity 2000-08 – 2026-03
CBU0 Copper Commodity 2011-03 – 2026-03
BTC Bitcoin Crypto 2014-09 – 2026-03
ETH Ethereum Crypto 2017-11 – 2026-03
EURUSD EUR/USD FX 2003-12 – 2026-03
GBPUSD GBP/USD FX 2003-12 – 2026-03
USDJPY USD/JPY FX 2000-01 – 2026-03
AUDUSD AUD/USD FX 2006-05 – 2026-03
All returns are daily log returns r t = log ( P t / P t 1 ) , expressed in decimal form (not percentages). Each asset uses its own trading calendar; no cross-asset calendar synchronisation is imposed. Missing observations (holidays, halts) are dropped, and the rolling-window count n reflects actual trading days.
The forecasting setup, base forecasters, VaR/ES extraction, FZ recalibration specification, detrended SD measure, and plug-in benchmark are described in Section 2, Section 3, Section 4, Section 5 and Section 6. Below are additional implementation details.
TimesFM 2.5 quantile levels are τ { 0.01 , 0.025 , 0.05 , 0.1 , 0.25 , 0.5 , 0.75 , 0.9 , 0.95 } ; the Student-t fit minimises the sum of squared quantile deviations with unconstrained degrees of freedom. For the non-overlapping window-length scaling test (Appendix D), each block of size n is treated as an independent calibration sample; all four forecasters are included.
For each asset, forecaster, and tail level, base forecasts ( V ^ t , E ^ t ) are extracted from the corresponding predictive distribution. GJR-GARCH-t yields VaR and ES analytically from the fitted conditional Student-t distribution. Chronos-Small computes both quantities nonparametrically from 1000 simulated forecast paths. TimesFM 2.5 and Moirai 2.0 require a tail-completion step: a Student-t distribution is fitted to the model output, by quantile matching for TimesFM 2.5 and by sample fitting for Moirai 2.0, and VaR and ES are then extracted from the fitted distribution. The Student-t tail-completion step is a forecaster-specific extraction device, not part of the proposed audit.
As a robustness check, the headline ratios are recomputed using moving-average windows of 126 and 504 days (Table A3). The qualitative ranking is preserved in every specification.
Table A3. Sensitivity of median benchmark ratio R at α = 2.5 % to the detrending moving-average window. The forecaster ranking is invariant across specifications; the absolute level shifts materially.
Table A3. Sensitivity of median benchmark ratio R at α = 2.5 % to the detrending moving-average window. The forecaster ranking is invariant across specifications; the absolute level shifts materially.
Forecaster MA-126 MA-252 MA-504
GJR-GARCH-t 0.40 0.53 0.76
TimesFM-2.5 0.42 0.63 1.07
Chronos-Small 0.81 1.13 1.70
Moirai-2.0 0.41 0.59 0.98
Table A4 reports the effect of estimating C ^ on each rolling calibration window rather than once over the full evaluation sample.
Table A4. Median benchmark ratio R under full-sample vs. rolling-window estimation of C ^ = σ ^ tail .
Table A4. Median benchmark ratio R under full-sample vs. rolling-window estimation of C ^ = σ ^ tail .
α = 1 % α = 2.5 %
Forecaster R full R roll R full R roll
GJR-GARCH-t 0.63 0.83 0.57 1.06
TimesFM 2.5 0.57 1.34 0.62 1.35
Chronos-Small 1.00 1.41 1.13 1.62
Moirai 2.0 0.57 1.49 0.58 1.26
Table A5. Sensitivity of the tail-dispersion coefficient γ ^ to the σ tail estimation method. Cross-sectional regression (10) with SEs clustered by asset (24 clusters).
Table A5. Sensitivity of the tail-dispersion coefficient γ ^ to the σ tail estimation method. Cross-sectional regression (10) with SEs clustered by asset (24 clusters).
σ tail measure γ ^ SE t ( γ = 1 ) b ^
Full-sample 0.866 0.055 2.43 0.440
Rolling (250-day) 0.758 0.055 4.40 0.520
Rolling + forecaster FE 0.727 0.064 4.29 0.516
Table A6. TimesFM first-stage extraction noise diagnostic. Dependent variable: log SD ( r ^ n ) across 24 assets at α = 1 % . Column (A) regresses on log σ tail only; column (B) adds log ν 1 ¯ (mean inverse fitted Student-t degrees of freedom) as a proxy for quantile-fit noise. HC1 standard errors in parentheses.
Table A6. TimesFM first-stage extraction noise diagnostic. Dependent variable: log SD ( r ^ n ) across 24 assets at α = 1 % . Column (A) regresses on log σ tail only; column (B) adds log ν 1 ¯ (mean inverse fitted Student-t degrees of freedom) as a proxy for quantile-fit noise. HC1 standard errors in parentheses.
(A) (B)
log σ tail 0.915 0.854
(0.058) (0.094)
log ν 1 ¯ 0.162
(0.145)
R 2 0.820 0.828
N 24 24

Appendix D. Window-Length Scaling Test

A non-overlapping window-length scaling test at α = 1 % varies the calibration-window length across n { 250 , 500 , 750 , 1000 } for all 24 assets and four forecasters, with step size equal to n so that adjacent windows share zero observations. Per-forecaster median slopes are 0.55 (GJR-GARCH-t, 21 assets), 0.72 (TimesFM 2.5, 20 assets), and 0.56 (Moirai 2.0, 20 assets); Chronos-Small retains only 2 assets and is excluded from per-model inference. Across 63 retained (asset, forecaster) pairs, 63% of 95% CIs contain 0.50 (Table A9).
The pooled fixed-effects estimate is b ^ = 0.585 (SE = 0.040 ), with 95% CI [ 0.66 , 0.51 ] (Table A7). Restricting the panel to GJR-GARCH-t and Moirai 2.0 gives b ^ = 0.542 ( p = 0.298 ), which does not reject 0.50 . Excluding TimesFM alone yields b ^ = 0.532 ( p = 0.530 ). Table A8 reports all four specifications.
Table A7. Pooled fixed-effects rate test. log ( SD ( r ^ n ) ) = α i + β j + b log ( n ) across 4 forecasters, 24 assets and window lengths n { 250 , 500 , 750 , 1000 } , non-overlapping windows only.
Table A7. Pooled fixed-effects rate test. log ( SD ( r ^ n ) ) = α i + β j + b log ( n ) across 4 forecasters, 24 assets and window lengths n { 250 , 500 , 750 , 1000 } , non-overlapping windows only.
Statistic Value
Pooled slope b ^ 0.585
Standard error 0.040
95% CI [ 0.663 , 0.507 ]
t-stat ( H 0 : b = 0.50 ) 2.13
p-value (two-sided) 0.033
R 2 0.811
Forecasters 4
Assets 24
Observations 280
Table A8. Restricted pooled FE rate tests. Same specification as Table A7, restricted to subsamples that isolate different sources of contamination.
Table A8. Restricted pooled FE rate tests. Same specification as Table A7, restricted to subsamples that isolate different sources of contamination.
Sample Cells b ^ SE 95% CI p ( b = 0.5 )
All forecasters 280 0.585 0.040 [ 0.66 , 0.51 ] 0.033
Excl. TimesFM 198 0.532 0.052 [ 0.63 , 0.43 ] 0.530
VaR-pass only 35 0.780 0.131 [ 1.04 , 0.52 ] 0.032
GJR + Moirai only 167 0.542 0.041 [ 0.62 , 0.46 ] 0.298
Table A9. Non-overlapping window-length scaling: per-forecaster summary. Median OLS slope from log ( SD ) = a + b log ( n ) with HC1 robust standard errors. Chronos-Small retains only 2 assets due to short forecast histories.
Table A9. Non-overlapping window-length scaling: per-forecaster summary. Median OLS slope from log ( SD ) = a + b log ( n ) with HC1 robust standard errors. Chronos-Small retains only 2 assets due to short forecast histories.
Forecaster Assets Median b ^ IQR 0.5 CI R 2 (med)
GJR-GARCH-t 21 0.55 [ 0.64 , 0.45 ] 67% 0.911
TimesFM-2.5 20 0.72 [ 0.85 , 0.56 ] 50% 0.919
Chronos-Small 2 0.38 [ 0.39 , 0.37 ] 50% 0.610
Moirai-2.0 20 0.56 [ 0.68 , 0.44 ] 75% 0.913
Pooled 63 0.57 [ 0.73 , 0.45 ] 63% 0.918

Appendix E. CC Statistic Details and Fisher Exact Test

The Christoffersen (1998) conditional-coverage statistic decomposes as CC = UC + IND , where UC = 2 log α n 1 ( 1 α ) n 0 / π ^ n 1 ( 1 π ^ ) n 0 is the Kupiec unconditional-coverage likelihood ratio and IND is a first-order Markov independence test; under the null, CC χ 2 2 . For Chronos-Small, hit rates of 37 % (vs. the nominal 1 % ) produce UC > 7 , 000 , so the large CC values in Figure 2 are arithmetically expected, not anomalous.
The Fisher exact 2 × 2 test on the binary Kupiec-pass vs. R > 1 contingency table gives p = 0.35 , underpowered because only 12 of 96 (asset, forecaster) pairs pass Kupiec at α = 1 % . The Christoffersen CC statistic, being continuous, provides a finer-grained diagnostic than the binary pass/reject classification (Table A15).

Appendix F. Conceptual Overview

Figure A1 summarises the paper’s logical chain from the known ES rate to the operational diagnostics.
Figure A1. Logical chain of the paper: from known ES rate to operational diagnostics.
Figure A1. Logical chain of the paper: from known ES rate to operational diagnostics.
Preprints 216618 g0a1

Appendix G. Detrending Illustration

Figure A2 shows the raw correction path r ^ n , its 252-day moving average, and the detrended residual for one representative asset (S&P 500, GJR-GARCH-t, α = 2.5 % ).
Figure A2. Raw ES correction path, 252-day moving average, and detrended correction for S&P 500 under GJR-GARCH-t at α = 2.5 % .
Figure A2. Raw ES correction path, 252-day moving average, and detrended correction for S&P 500 under GJR-GARCH-t at α = 2.5 % .
Preprints 216618 g0a2

Appendix H. HS-250 Naive Benchmark

Table A10 adds a Historical Simulation benchmark (HS-250) as a diagnostic reference for the effective tail-count constraint, not as a production recommendation. For each forecast date, the HS-250 base VaR and ES are the empirical quantile and conditional tail mean from the trailing 250-day window of observed returns. The FZ recalibration correction is then estimated on the same rolling audit window as for the other forecasters, so the benchmark ratio R is computed on an identical footing.
At α = 2.5 % , HS-250 achieves a median benchmark ratio R = 0.96 , close to the plug-in precision floor. Pairwise comparisons between HS-250 and the four model-based forecasters at α = 2.5 % yield 50 of 96 additional precision-fragile comparisons (52.1%).
Table A10. Forecaster comparison at α = 2.5 % including the HS-250 naive benchmark. R is the median benchmark ratio; r ¯ n is the mean recalibration shift; σ ^ tail is the median empirical tail dispersion. HS-250 is included as a diagnostic reference, not as a model-selection recommendation.
Table A10. Forecaster comparison at α = 2.5 % including the HS-250 naive benchmark. R is the median benchmark ratio; r ¯ n is the mean recalibration shift; σ ^ tail is the median empirical tail dispersion. HS-250 is included as a diagnostic reference, not as a model-selection recommendation.
Forecaster Median R Mean r ¯ n Median σ ^ tail
GJR-GARCH-t 0.53 0.0081 0.85%
TimesFM-2.5 0.63 −0.0015 1.09%
Moirai-2.0 0.59 −0.0045 1.10%
Chronos-Small 1.13 −0.0319 0.91%
HS-250 (naive) 0.96 −0.0040 1.25%

Appendix I. Additional Tables and Figures

Figure A3. Detrended SD vs. plug-in benchmark by forecaster ( 2 × 2 panels). Each panel shows one forecaster; the inset reports the median ratio R across all assets and all three tail levels α { 1 % , 2.5 % , 5 % } .
Figure A3. Detrended SD vs. plug-in benchmark by forecaster ( 2 × 2 panels). Each panel shows one forecaster; the inset reports the median ratio R across all assets and all three tail levels α { 1 % , 2.5 % , 5 % } .
Preprints 216618 g0a3
Figure A4. Rolling-window ES correction r ^ n across asset classes at α = 1 % under GJR-GARCH-t ( n = 250 , step = 21 days), shown as a scaling illustration. Shaded bands are plug-in 95% reference intervals r ¯ ± 1.96 C ^ n α .
Figure A4. Rolling-window ES correction r ^ n across asset classes at α = 1 % under GJR-GARCH-t ( n = 250 , step = 21 days), shown as a scaling illustration. Shaded bands are plug-in 95% reference intervals r ¯ ± 1.96 C ^ n α .
Preprints 216618 g0a4
Figure A5. Required calibration window for 50 bp ES precision by asset at α = 2.5 % . The vertical line marks the 250-day default; assets to the right require longer windows.
Figure A5. Required calibration window for 50 bp ES precision by asset at α = 2.5 % . The vertical line marks the 250-day default; assets to the right require longer windows.
Preprints 216618 g0a5
Table A11. Asset-specific minimum calibration window n at α = 2.5 % . σ ^ tail is the empirical tail dispersion from GJR-GARCH-t residuals; ε 250 is the plug-in benchmark at the 250-day window (with finite-sample correction); right-hand columns report the required n for the stated tolerance ε in basis points of return. Assets sorted by σ ^ tail .
Table A11. Asset-specific minimum calibration window n at α = 2.5 % . σ ^ tail is the empirical tail dispersion from GJR-GARCH-t residuals; ε 250 is the plug-in benchmark at the 250-day window (with finite-sample correction); right-hand columns report the required n for the stated tolerance ε in basis points of return. Assets sorted by σ ^ tail .
Asset σ ^ tail (%) ε 250 (bp) ε = 25 bp ε = 50 bp ε = 100 bp ε = 200 bp
TLT 0.45 19 161 56 23 9
CBU0 0.55 24 228 74 29 12
USDJPY 0.60 26 262 84 32 13
ASX200 0.60 26 266 85 33 13
DJCI 0.63 27 287 91 34 14
FTSE100 0.65 28 306 96 36 15
SP500 0.70 30 348 107 39 16
AUDUSD 0.72 31 367 112 41 17
STOXX 0.75 32 392 119 43 18
GDAXI 0.81 35 453 135 48 20
EURUSD 0.81 35 455 135 48 20
GBPUSD 0.82 35 465 138 49 20
NIFTY 0.88 38 527 154 54 22
FCHI 0.91 39 561 163 56 23
GOLD 1.03 44 717 203 68 27
NIKKEI 1.16 50 893 248 80 31
BOVESPA 1.16 50 898 249 80 31
IBGL 1.28 55 1,091 298 94 35
ICLN 1.33 57 1,175 319 100 37
HSI 1.59 68 1,657 441 132 47
NATGAS 2.31 99 3,463 894 248 80
WTI 2.60 112 4,371 1,121 306 96
ETH 3.45 148 7,665 1,945 513 151
BTC 5.54 238 19,656 4,943 1,264 342
Median 0.85 36 496 146 51 21
Table A12. Asset-specific minimum calibration window n at α = 1 % . σ ^ tail is the empirical tail dispersion from GJR-GARCH-t residuals; ε 250 is the plug-in benchmark at the 250-day window; right-hand columns report the required n for the stated tolerance ε in basis points of return. Assets sorted by σ ^ tail . For standardised-unit requirements see Table 2.
Table A12. Asset-specific minimum calibration window n at α = 1 % . σ ^ tail is the empirical tail dispersion from GJR-GARCH-t residuals; ε 250 is the plug-in benchmark at the 250-day window; right-hand columns report the required n for the stated tolerance ε in basis points of return. Assets sorted by σ ^ tail . For standardised-unit requirements see Table 2.
Asset σ ^ tail (%) ε 250 (bp) ε = 25 b p ε = 50 b p ε = 100 b p ε = 200 b p
USDJPY 0.54 34 473 119 30 8
ASX200 0.56 35 495 124 31 8
FTSE100 0.58 36 530 133 34 9
TLT 0.59 37 561 141 36 9
DJCI 0.68 43 739 185 47 12
CBU0 0.72 46 840 210 53 14
SP500 0.74 47 882 221 56 14
STOXX 0.79 50 998 250 63 16
GDAXI 0.83 52 1,094 274 69 18
AUDUSD 0.94 60 1,423 356 89 23
EURUSD 0.98 62 1,544 386 97 25
FCHI 0.98 62 1,549 388 97 25
GOLD 1.04 66 1,726 432 108 27
NIKKEI 1.06 67 1,802 451 113 29
NIFTY 1.12 71 1,992 498 125 32
BOVESPA 1.22 77 2,366 592 148 37
GBPUSD 1.22 77 2,379 595 149 38
ICLN 1.37 87 3,014 754 189 48
IBGL 1.76 112 4,979 1,245 312 78
NATGAS 1.78 112 5,061 1,266 317 80
HSI 2.29 145 8,414 2,104 526 132
ETH 3.00 190 14,404 3,601 901 226
WTI 3.04 192 14,746 3,687 922 231
BTC 7.20 456 83,038 20,760 5,190 1,298
Median 1.01 64 1,636 409 103 26
Table A13. Asset-level benchmark ratio R at α = 1 % under GJR-GARCH-t. Assets sorted by R.
Table A13. Asset-level benchmark ratio R at α = 1 % under GJR-GARCH-t. Assets sorted by R.
Asset Detr. SD Bound Ratio Kupiec p
IBGL 0.0029 0.0112 0.26 0.000
GBPUSD 0.0023 0.0077 0.29 0.000
AUDUSD 0.0020 0.0060 0.33 0.000
NIFTY 0.0028 0.0071 0.39 0.000
WTI 0.0075 0.0192 0.39 0.000
HSI 0.0057 0.0145 0.39 0.000
BTC 0.0192 0.0456 0.42 0.003
GOLD 0.0029 0.0066 0.44 0.000
EURUSD 0.0028 0.0062 0.46 0.000
BOVESPA 0.0037 0.0077 0.48 0.000
ASX200 0.0017 0.0035 0.48 0.000
NIKKEI 0.0034 0.0067 0.50 0.000
STOXX 0.0025 0.0050 0.51 0.000
TLT 0.0020 0.0037 0.53 0.000
FCHI 0.0033 0.0062 0.54 0.000
GDAXI 0.0028 0.0052 0.54 0.000
SP500 0.0028 0.0047 0.59 0.000
FTSE100 0.0024 0.0036 0.67 0.000
USDJPY 0.0023 0.0034 0.67 0.000
ICLN 0.0059 0.0087 0.68 0.000
DJCI 0.0030 0.0043 0.71 0.000
CBU0 0.0034 0.0046 0.74 0.143
ETH 0.0158 0.0190 0.83 0.006
NATGAS 0.0101 0.0112 0.90 0.000
Table A14. Non-overlapping window-length scaling for 21 assets (GJR-GARCH-t, α = 1 % , step = n ). OLS slope b ^ from log ( SD ) = a + b log ( n ) with HC1 robust standard errors. Assets sorted by slope. †: dropped for < 3 valid window lengths.
Table A14. Non-overlapping window-length scaling for 21 assets (GJR-GARCH-t, α = 1 % , step = n ). OLS slope b ^ from log ( SD ) = a + b log ( n ) with HC1 robust standard errors. Assets sorted by slope. †: dropped for < 3 valid window lengths.
Asset Windows n = 250 n = 500 n = 750 n = 1000 b ^ 95% CI R 2 0.5 CI
NIFTY 30 0.0067 0.0043 0.0028 0.78 [ 0.92 , 0.63 ] 0.982 N
AUDUSD 34 0.0049 0.0032 0.0021 0.77 [ 0.92 , 0.61 ] 0.978 N
HSI 50 0.0144 0.0092 0.0068 0.0051 0.73 [ 0.81 , 0.65 ] 0.993 N
FCHI 51 0.0089 0.0073 0.0043 0.0037 0.66 [ 0.84 , 0.49 ] 0.911 Y
GOLD 50 0.0061 0.0045 0.0022 0.0030 0.64 [ 1.02 , 0.27 ] 0.741 Y
EURUSD 45 0.0074 0.0045 0.0048 0.0026 0.64 [ 0.95 , 0.32 ] 0.824 Y
GBPUSD 45 0.0068 0.0044 0.0029 0.0032 0.62 [ 0.84 , 0.39 ] 0.906 Y
STOXX 43 0.0050 0.0039 0.0039 0.0019 0.58 [ 1.08 , 0.08 ] 0.679 Y
USDJPY 53 0.0040 0.0034 0.0025 0.0017 0.56 [ 0.83 , 0.29 ] 0.875 Y
IBGL 30 0.0054 0.0042 0.0029 0.56 [ 0.78 , 0.34 ] 0.926 Y
FTSE100 51 0.0058 0.0044 0.0030 0.0028 0.55 [ 0.65 , 0.46 ] 0.952 Y
NATGAS 50 0.0226 0.0135 0.0112 0.0110 0.54 [ 0.72 , 0.37 ] 0.940 Y
TLT 45 0.0046 0.0023 0.0029 0.0020 0.53 [ 0.78 , 0.27 ] 0.712 Y
WTI 50 0.0166 0.0104 0.0098 0.0079 0.51 [ 0.59 , 0.43 ] 0.960 Y
ASX200 51 0.0051 0.0042 0.0030 0.0027 0.46 [ 0.55 , 0.38 ] 0.954 Y
ICLN 29 0.0122 0.0105 0.0072 0.45 [ 0.70 , 0.20 ] 0.861 Y
BOVESPA 50 0.0088 0.0079 0.0069 0.0050 0.36 [ 0.60 , 0.13 ] 0.804 Y
BTC 27 0.0429 0.0353 0.0287 0.36 [ 0.44 , 0.27 ] 0.973 N
NIKKEI 50 0.0089 0.0077 0.0054 0.0063 0.32 [ 0.50 , 0.14 ] 0.753 N
SP500 51 0.0073 0.0064 0.0060 0.0048 0.27 [ 0.41 , 0.14 ] 0.879 N
GDAXI 51 0.0077 0.0065 0.0064 0.0056 0.21 [ 0.26 , 0.16 ] 0.943 N
CBU0
DJCI
ETH
Median 0.55 IQR [ 0.64 , 0.45 ] 67%
Figure A6. Distribution of OLS slopes for overlapping (grey, step = 21 ) and non-overlapping (red, step = n ) windows. Eliminating overlap shifts the median from 0.73 to 0.55 , moving it toward the theoretical 1 / 2 (solid vertical line).
Figure A6. Distribution of OLS slopes for overlapping (grey, step = 21 ) and non-overlapping (red, step = n ) windows. Eliminating overlap shifts the median from 0.73 to 0.55 , moving it toward the theoretical 1 / 2 (solid vertical line).
Preprints 216618 g0a6
Figure A7. Log–log scaling of recalibration SD versus window length using non-overlapping windows. Dashed grey: theoretical b = 0.50 ; solid colour: OLS fit.
Figure A7. Log–log scaling of recalibration SD versus window length using non-overlapping windows. Dashed grey: theoretical b = 0.50 ; solid colour: OLS fit.
Preprints 216618 g0a7
Figure A8. Distribution of non-overlapping OLS slopes across all four forecasters. Pooled median = 0.57 (63 asset-forecaster pairs); vertical line: theoretical 1 / 2 .
Figure A8. Distribution of non-overlapping OLS slopes across all four forecasters. Pooled median = 0.57 (63 asset-forecaster pairs); vertical line: theoretical 1 / 2 .
Preprints 216618 g0a8
Figure A9. Non-overlapping window scaling by forecaster. Each grey line is one asset; dashed: theoretical b = 0.50 .
Figure A9. Non-overlapping window scaling by forecaster. Each grey line is one asset; dashed: theoretical b = 0.50 .
Preprints 216618 g0a9
Table A15. Directional VaR-first contingency check at α = 1 % . A cell is classified as “excess” if R > 1 . The Spearman rank correlation between CC statistic and R supersedes the binary Fisher test; see Figure 2.
Table A15. Directional VaR-first contingency check at α = 1 % . A cell is classified as “excess” if R > 1 . The Spearman rank correlation between CC statistic and R supersedes the binary Fisher test; see Figure 2.
Ratio 1 Ratio > 1
VaR pass (Kupiec p > 0.05 ) 12 0
VaR reject (Kupiec p 0.05 ) 72 12
Spearman rank correlation ρ = 0.776 , p = 0.000
Table A16. Excess-dispersion case audit: all (asset, forecaster) cells with R > 1 at α = 1 % . CC stat is the Christoffersen conditional-coverage statistic; σ ^ tail is the tail dispersion scale. All 12 excess cells belong to Chronos-Small.
Table A16. Excess-dispersion case audit: all (asset, forecaster) cells with R > 1 at α = 1 % . CC stat is the Christoffersen conditional-coverage statistic; σ ^ tail is the tail dispersion scale. All 12 excess cells belong to Chronos-Small.
Asset Forecaster R Kupiec p CC stat σ ^ tail Cause
EURUSD Chronos-Small 3.02 <0.001 11636.6 0.0057 Severe VaR miscalibration
USDJPY Chronos-Small 2.86 <0.001 14189.2 0.0059 Severe VaR miscalibration
FTSE100 Chronos-Small 2.24 <0.001 13251.3 0.0087 Severe VaR miscalibration
ICLN Chronos-Small 1.68 <0.001 8581.9 0.0123 Severe VaR miscalibration
ASX200 Chronos-Small 1.61 <0.001 14112.9 0.0078 Severe VaR miscalibration
NIKKEI Chronos-Small 1.48 <0.001 12923.6 0.0111 Severe VaR miscalibration
SP500 Chronos-Small 1.32 <0.001 12789.4 0.0099 Severe VaR miscalibration
BTC Chronos-Small 1.30 <0.001 7186.8 0.0297 Severe VaR miscalibration
GBPUSD Chronos-Small 1.30 <0.001 11699.6 0.0044 Severe VaR miscalibration
HSI Chronos-Small 1.08 <0.001 12690.9 0.0106 Severe VaR miscalibration
NIFTY Chronos-Small 1.08 <0.001 9146.5 0.0080 Severe VaR miscalibration
IBGL Chronos-Small 1.02 <0.001 8993.9 0.0061 Severe VaR miscalibration
Figure A10. Kupiec p-value vs. benchmark ratio R at α = 1 % (binary variant of Figure 2). Each point is one (asset, forecaster) pair. The Fisher exact test gives p = 0.35 due to the small number of VaR-pass cells.
Figure A10. Kupiec p-value vs. benchmark ratio R at α = 1 % (binary variant of Figure 2). Each point is one (asset, forecaster) pair. The Fisher exact test gives p = 0.35 due to the small number of VaR-pass cells.
Preprints 216618 g0a10
Table A17. Overlapping window-length scaling (step = 21 ). Same design as Table A14 but with overlapping windows. The steeper median slope ( 0.73 vs. 0.55 ) is attributable to rolling-window overlap.
Table A17. Overlapping window-length scaling (step = 21 ). Same design as Table A14 but with overlapping windows. The steeper median slope ( 0.73 vs. 0.55 ) is attributable to rolling-window overlap.
Asset n = 250 n = 500 n = 1000 n = 2000 b ^ 95% CI R 2 0.5 CI
DJCI 0.0065 0.0054 0.0041 0.0002 1.31 [ 2.38 , 0.23 ] 0.586 Y
NIFTY 0.0055 0.0032 0.0016 0.0005 1.18 [ 1.40 , 0.96 ] 0.964 N
AUDUSD 0.0045 0.0028 0.0017 0.0005 1.02 [ 1.25 , 0.78 ] 0.946 N
ETH 0.0266 0.0215 0.0200 0.0016 1.01 [ 1.95 , 0.06 ] 0.523 Y
HSI 0.0133 0.0076 0.0039 0.0019 0.97 [ 1.07 , 0.87 ] 0.989 N
STOXX 0.0057 0.0039 0.0024 0.0008 0.93 [ 1.19 , 0.66 ] 0.921 N
GDAXI 0.0075 0.0054 0.0032 0.0012 0.89 [ 1.11 , 0.67 ] 0.941 N
ICLN 0.0127 0.0081 0.0041 0.0021 0.86 [ 0.96 , 0.76 ] 0.986 N
NIKKEI 0.0079 0.0058 0.0038 0.0011 0.85 [ 1.20 , 0.51 ] 0.853 N
BOVESPA 0.0091 0.0063 0.0037 0.0019 0.77 [ 0.87 , 0.66 ] 0.981 N
USDJPY 0.0046 0.0031 0.0021 0.0009 0.74 [ 0.91 , 0.57 ] 0.948 N
GOLD 0.0058 0.0029 0.0018 0.0012 0.73 [ 0.84 , 0.61 ] 0.975 N
NATGAS 0.0201 0.0128 0.0087 0.0046 0.71 [ 0.79 , 0.63 ] 0.987 N
FCHI 0.0083 0.0065 0.0040 0.0024 0.64 [ 0.76 , 0.52 ] 0.966 N
EURUSD 0.0060 0.0044 0.0027 0.0017 0.63 [ 0.70 , 0.57 ] 0.990 N
GBPUSD 0.0057 0.0040 0.0025 0.0017 0.61 [ 0.64 , 0.57 ] 0.997 N
TLT 0.0046 0.0033 0.0023 0.0013 0.60 [ 0.70 , 0.50 ] 0.971 Y
IBGL 0.0056 0.0036 0.0023 0.0017 0.57 [ 0.62 , 0.52 ] 0.991 N
ASX200 0.0051 0.0043 0.0030 0.0016 0.55 [ 0.69 , 0.41 ] 0.938 Y
SP500 0.0072 0.0056 0.0042 0.0023 0.54 [ 0.67 , 0.42 ] 0.947 Y
FTSE100 0.0057 0.0038 0.0027 0.0020 0.50 [ 0.54 , 0.46 ] 0.994 Y
BTC 0.0428 0.0325 0.0277 0.0186 0.37 [ 0.44 , 0.29 ] 0.959 N
WTI 0.0143 0.0101 0.0086 0.0066 0.34 [ 0.40 , 0.28 ] 0.969 N
Median 0.73 IQR [ 0.91 , 0.58 ] 26%
Figure A11. Overlapping-window slope distribution (step = 21 ). Median 0.73 ; only 26% of CIs contain 0.50 . Compare Figure A6.
Figure A11. Overlapping-window slope distribution (step = 21 ). Median 0.73 ; only 26% of CIs contain 0.50 . Compare Figure A6.
Preprints 216618 g0a11
Figure A12. Overlapping-window log–log scaling for S&P 500, BTC, and Natural Gas (step = 21 ). All three slopes are steeper than in the non-overlapping design (Figure A7).
Figure A12. Overlapping-window log–log scaling for S&P 500, BTC, and Natural Gas (step = 21 ). All three slopes are steeper than in the non-overlapping design (Figure A7).
Preprints 216618 g0a12

Appendix J. ES Recalibration Precision Audit

Preprints 216618 i001

References

  1. Zwingmann, T.; Holzmann, H. Asymptotics for the expected shortfall. arXiv 2016, arXiv:1611.07222. [Google Scholar] [CrossRef]
  2. Bartl, D.; Eckstein, S. Optimal nonparametric estimation of the expected shortfall risk. arXiv 2024, arXiv:2405.00357. [Google Scholar]
  3. Chen, S.X. Nonparametric estimation of expected shortfall. J. Financ. Econom. 2008, 6, 87–107. [Google Scholar] [CrossRef]
  4. Fissler, T.; Ziegel, J.F. Higher order elicitability and Osband’s principle. Ann. Stat. 2016, 44, 1680–1707. [Google Scholar] [CrossRef]
  5. Dimitriadis, T.; Bayer, S. A joint quantile and expected shortfall regression framework. Electron. J. Stat. 2019, 13, 1823–1871. [Google Scholar] [CrossRef]
  6. Patton, A.J.; Ziegel, J.F.; Chen, R. Dynamic semiparametric models for expected shortfall (and Value-at-Risk). J. Econom. 2019, 211, 388–413. [Google Scholar] [CrossRef]
  7. Pele, D.T.; Bolovaneanu, V.; Ginavar, A.T.; Lessmann, S.; Härdle, W.K. Recalibrating tail risk forecasts under temporal dependence, 2026. Available online: https://ssrn.com/abstract=6757685.
  8. Acerbi, C.; Székely, B. Back-testing expected shortfall. Risk 2014, 27, 76–81. [Google Scholar]
  9. Nolde, N.; Ziegel, J.F. Elicitability and backtesting: Perspectives for banking regulation. Ann. Appl. Stat. 2017, 11, 1833–1874. [Google Scholar] [CrossRef]
  10. Basel Committee on Banking Supervision. Minimum capital requirements for market risk. Technical Report d457, January 2019; Bank for International Settlements; BCBS d457, 2019. [Google Scholar]
  11. Glosten, L.R.; Jagannathan, R.; Runkle, D.E. On the relation between the expected value and the volatility of the nominal excess return on stocks. J. Financ. 1993, 48, 1779–1801. [Google Scholar] [CrossRef]
  12. Das, A.; Kong, W.; Sen, R.; Zhou, Y. A decoder-only foundation model for time-series forecasting. Proc. Proc. 41st Int. Conf. Mach. Learn. (ICML) 2024, Vol. 235, PMLR, 10148–10167. [Google Scholar]
  13. Ansari, A.F.; Stella, L.; Turkmen, C.; Zhang, X.; Mercado, P.; Shen, H.; Shchur, O.; Rangapuram, S.S.; Pineda Arango, S.; Kapoor, S.; et al. Chronos: Learning the language of time series. Trans. Mach. Learn. Res. 2024, arXiv:2403.07815. [Google Scholar]
  14. Woo, G.; Liu, C.; Kumar, A.; Xiong, C.; Savarese, S.; Sahoo, D. Unified training of universal time series forecasting transformers. Proc. Proc. 41st Int. Conf. Mach. Learn. (ICML) Oral presentation. 2024, arXiv:2402.02592Vol. 235, PMLR, 53140–53164. [Google Scholar]
  15. Liu, C.; Aksu, T.; Liu, J.; Liu, X.; Yan, H.; Pham, Q.; Savarese, S.; Sahoo, D.; Xiong, C.; Li, J. Moirai 2.0: When less is more for time series forecasting. arXiv 2025, arXiv:2511.11698, 2511.11698. [Google Scholar]
  16. McNeil, A.J.; Frey, R.; Embrechts, P. Quantitative Risk Management: Concepts, Techniques and Tools, revised ed.; Princeton Series in Finance; Princeton University Press, 2015. [Google Scholar]
  17. Christoffersen, P.F. Evaluating interval forecasts. Int. Econ. Rev. 1998, 39, 841–862. [Google Scholar] [CrossRef]
  18. Kupiec, P.H. Techniques for verifying the accuracy of risk measurement models. J. Deriv. 1995, 3, 73–84. [Google Scholar] [CrossRef]
Figure 1. Detrended standard deviation of r ^ n versus the plug-in precision benchmark C ^ / n α . Each point is one (asset, forecaster, α ) cell. The dashed 45-degree line marks R = 1 .
Figure 1. Detrended standard deviation of r ^ n versus the plug-in precision benchmark C ^ / n α . Each point is one (asset, forecaster, α ) cell. The dashed 45-degree line marks R = 1 .
Preprints 216618 g001
Figure 2. Christoffersen conditional-coverage statistic versus benchmark ratio R at α = 1 % . Each point is one (asset, forecaster) pair. The pooled Spearman correlation is ρ = 0.776 ; excluding Chronos-Small, ρ = 0.513 .
Figure 2. Christoffersen conditional-coverage statistic versus benchmark ratio R at α = 1 % . Each point is one (asset, forecaster) pair. The pooled Spearman correlation is ρ = 0.776 ; excluding Chronos-Small, ρ = 0.513 .
Preprints 216618 g002
Table 1. Finite-sample inflation factor f ^ = SD ^ ( X ¯ tail ) / ( σ tail / n α ) from 50,000 Monte Carlo replications. †: f ^ > 1.20 , indicating serial-dependence inflation beyond the i.i.d. finite-sample effect.
Table 1. Finite-sample inflation factor f ^ = SD ^ ( X ¯ tail ) / ( σ tail / n α ) from 50,000 Monte Carlo replications. †: f ^ > 1.20 , indicating serial-dependence inflation beyond the i.i.d. finite-sample effect.
α n k = n α Student- t 5 Skewed- t 5 GARCH- t 5
1.0 % 250 2.50 1.129 1.134 0.653
500 5.00 1.124 1.119 0.918
1000 10.00 1.063 1.044 1.080
2000 20.00 1.032 1.022 1 . 217
2.5 % 250 6.25 1.101 1.117 1.006
500 12.50 1.050 1.043 1.150
1000 25.00 1.028 1.011 1 . 284
2000 50.00 1.012 1.008 1 . 441
5.0 % 250 12.50 1.039 1.042 1.145
500 25.00 1.022 1.016 1 . 339
1000 50.00 1.003 1.005 1 . 494
2000 100.00 1.006 1.008 1 . 587
Table 2. Minimum calibration window n (trading days) for tolerance ε at tail level α . Cells are computed from Equation (7), with C calibrated from a Student- t 5 reference distribution and f ( n , α ) from Equation (8). The implicit equation is solved by fixed-point iteration. ε is expressed in standardised units.
Table 2. Minimum calibration window n (trading days) for tolerance ε at tail level α . Cells are computed from Equation (7), with C calibrated from a Student- t 5 reference distribution and f ( n , α ) from Equation (8). The implicit equation is solved by fixed-point iteration. ε is expressed in standardised units.
α C ε = 10 % 20 % 30 % 50 %
0.5% 1.491 44,660 11,311 5,132 1,958
1.0% 1.335 17,921 4,553 2,075 800
2.5% 1.152 5,348 1,365 627 246
5.0% 1.038 2,174 558 257 102
Table 3. VaR-miscalibration simulation. The data-generating process is GARCH(1,1)- t 5 with α = 2.5 % and n = 250 , using 30 paths of 10,000 days each. R is the ratio of raw ES-correction SD to the oracle benchmark σ tail / n α .
Table 3. VaR-miscalibration simulation. The data-generating process is GARCH(1,1)- t 5 with α = 2.5 % and n = 250 , using 30 paths of 10,000 days each. R is the ratio of raw ES-correction SD to the oracle benchmark σ tail / n α .
VaR model Hit rate Median UC stat ES dispersion ratio (R)
Correct (true σ t ) 0.0250 0.5 0.95
Mild (30% cond. + 70% uncond.) 0.0961 23.9 0.97
Moderate (15% cond. + 85% uncond.) 0.1314 45.9 1.33
Severe (unconditional σ ¯ ) 0.1770 80.0 1.76
Table 4. Precision-fragile ES model comparisons. Each tail level has 4 2 × 24 = 144 pairwise comparisons.
Table 4. Precision-fragile ES model comparisons. Each tail level has 4 2 × 24 = 144 pairwise comparisons.
α Comparisons Fragile % Fragile
1.0% 144 39 27.1%
2.5% 144 29 20.1%
5.0% 144 26 18.1%
Table 5. Precision-fragile share under alternative noise scales and detrending methods. The plug-in screen uses Equation (9). The paired bootstrap uses block length 21 days and 1000 replications.
Table 5. Precision-fragile share under alternative noise scales and detrending methods. The plug-in screen uses Equation (9). The paired bootstrap uses block length 21 days and 1000 replications.
Noise scale Fragile @ α = 1 % Fragile @ α = 2.5 % Fragile @ α = 5 %
Plug-in C ^ / n α 27.1% 20.1% 18.1%
Detrended SD: MA-126 16.0% 16.0% 15.3%
Detrended SD: MA-252 17.4% 17.4% 15.3%
Detrended SD: MA-504 20.8% 17.4% 17.4%
Detrended SD: HP ( λ = 1600 ) 17.4% 17.4% 16.0%
Detrended SD: Rolling median 252 18.1% 17.4% 15.3%
Raw SD (no detrending) 36.8% 29.2% 25.0%
Paired block bootstrap (block=21) 22.9%
Table 6. Precision-fragile share under alternative C ^ estimation methods at α = 2.5 % .
Table 6. Precision-fragile share under alternative C ^ estimation methods at α = 2.5 % .
C ^ estimation Precision-fragile share
Full-sample C ^ 20.1%
Rolling-window C ^ t 16.0%
Paired block bootstrap 22.9%
Table 7. Cross-sectional scaling regression. The sample contains 288 (asset, forecaster, α ) cells. Standard errors are clustered by asset. Theoretical predictions are b = 0.50 and γ = 1.00 .
Table 7. Cross-sectional scaling regression. The sample contains 288 (asset, forecaster, α ) cells. Standard errors are clustered by asset. Theoretical predictions are b = 0.50 and γ = 1.00 .
Model b ^ (se) γ ^ (se) R 2
log ( n α ) + log ( σ ^ tail ) 0.436 ( 0.039 ) 0.870 ( 0.036 ) 0.676
   + forecaster FE 0.429 ( 0.028 ) 0.943 ( 0.027 ) 0.842
log ( n ) + log ( σ ^ tail ) 0.926 ( 0.047 ) 0.529
log ( σ ^ tail ) only 0.926 ( 0.047 ) 0.529
Table 8. Benchmark ratio R i , m , α at α = 2.5 % , across 24 assets. The numerator is the detrended standard deviation of r ^ n using a 252-day moving-average filter; the denominator is the plug-in precision benchmark.
Table 8. Benchmark ratio R i , m , α at α = 2.5 % , across 24 assets. The numerator is the detrended standard deviation of r ^ n using a 252-day moving-average filter; the denominator is the plug-in precision benchmark.
Forecaster Median Q1 Q3 Min Max
Chronos-Small 1.13 0.83 1.53 0.60 2.69
GJR-GARCH-t 0.53 0.48 0.63 0.34 0.94
Moirai-2.0 0.59 0.52 0.67 0.35 0.75
TimesFM-2.5 0.63 0.55 0.77 0.35 0.90
Table 9. Headline results under forecaster subsets. Precision-fragile share is computed at α = 2.5 % . Spearman ρ is the correlation between the Christoffersen conditional-coverage statistic and benchmark ratio R at α = 1 % .
Table 9. Headline results under forecaster subsets. Precision-fragile share is computed at α = 2.5 % . Spearman ρ is the correlation between the Christoffersen conditional-coverage statistic and benchmark ratio R at α = 1 % .
Sample Precision-fragile share Median R VaR-diagnostic ρ
All forecasters 20.1% 0.64 0 . 776 * * *
Excl. Chronos-Small 40.3% 0.58 0 . 513 * * *
Only calibrated (GJR, Moirai) 8.3% 0.56 0 . 509 * *
Table 10. Precision-fragile share at α = 2.5 % under alternative cutoff multipliers κ . Each specification has 4 2 × 24 = 144 pairwise comparisons.
Table 10. Precision-fragile share at α = 2.5 % under alternative cutoff multipliers κ . Each specification has 4 2 × 24 = 144 pairwise comparisons.
κ Precision-fragile Share (%) n comparisons
0.5 23 16.0% 144
1.0 29 20.1% 144
1.5 45 31.2% 144
2.0 65 45.1% 144
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated