Finite-Sample Precision Limits for Expected Shortfall Forecast Comparisons

Daniel Traian Pele; Miruna Mazurencu-Marinescu-Pele

doi:10.20944/preprints202606.0259.v1

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract

Expected Shortfall (ES) is a tail functional whose estimation precision is governed by the effective tail sample size nα rather than by the nominal calibration size n. The resulting (nα)−1/2 information limit is well established, yet no practical framework exists for deciding whether two ES forecasts can be meaningfully distinguished over a finite calibration window. This paper converts the asymptotic rate into four operational diagnostics: a plug-in precision benchmark, a sample-size rule, a precision-fragile pairwise comparison screen, and a VaR-first diagnostic linking excess ES dispersion to first-stage quantile miscalibration. An empirical application to global financial assets and heterogeneous forecasters under standard regulatory tail parameters shows that roughly one in five pairwise ES comparisons is precision-fragile, with excess dispersion concentrated in cells with poor VaR calibration. The results suggest that ES forecast rankings at typical tail levels can be constrained by effective tail information rather than by model sophistication.

Keywords:

effective tail sample size

;

Expected Shortfall

;

precision-fragile comparison

;

finite-sample precision

;

sample-size rule

;

Fissler–Ziegel score

;

VaR miscalibration diagnostic

Subject:

Business, Economics and Management - Econometrics and Statistics

MSC: 62G32; 62P05; 91G70

1. Introduction

Expected Shortfall (ES) at tail level

α

is a coherent risk measure widely used in financial regulation, portfolio management, and quantitative risk modelling. Because ES averages losses beyond the

α

-quantile, its estimation draws on at most

n α

effective tail observations from a calibration sample of size n. At tail levels common in practice—

α \in {1 %, 2.5 %, 5 %}

—this effective count is small even when n is moderately large. The statistical difficulty of ES estimation is therefore governed by

n α

, not by n.

The

{(n α)}^{- 1 / 2}

scaling of ES estimation error is well established. Zwingmann and Holzmann [1] derive this rate through a central limit theorem for tail averages, Bartl and Eckstein [2] obtain a concentration lower bound for nonparametric ES estimation, and related kernel estimators [3] obey the same effective-tail-count rate. These results characterise the information limit, but they do not translate it into a practical diagnostic for deciding whether two ES forecasts can be meaningfully distinguished over a finite calibration window.

ES forecast evaluation relies on the joint elicitability of VaR and ES [4]. Two-stage regression frameworks [5] and dynamic semiparametric VaR–ES models [6] provide tools for estimation and evaluation under dependence. Pele et al. [7] show that Fissler–Ziegel-based recalibration can attain the

{(n α)}^{- 1 / 2}

rate under geometric mixing. That paper is concerned with achievability of the rate under temporal dependence; the present paper takes the rate as given and converts it into an operational comparison audit comprising the four diagnostics developed in Section 3 and Section 4. ES backtesting from a regulatory perspective is studied by Acerbi and Székely [8] and Nolde and Ziegel [9]. The question addressed here is how much pairwise ES model comparison is statistically supportable once the effective tail count is fixed by the calibration window.

The contribution is operational rather than rate-theoretic. The paper converts the known ES information limit into a precision audit for ES model comparison. The audit has four components:

1.: a plug-in precision benchmark based on the effective tail count, calibrated from tail residuals;
2.: a finite-sample sample-size rule that determines the calibration window needed for a target ES precision tolerance;
3.: a precision-fragile pairwise comparison screen that classifies a forecast pair as precision-fragile when the observed difference in ES recalibration corrections falls below the precision floor implied by the available effective tail count;
4.: a VaR-first diagnostic for attributing excess ES recalibration dispersion to first-stage quantile miscalibration.

The novelty is not the

{(n α)}^{- 1 / 2}

rate itself. The novelty is the conversion of this rate into a practical precision audit for ES model comparison. The Le Cam two-point construction in Appendix A (Theorem A1) certifies that no estimator can beat the

{(n α)}^{- 1 / 2}

rate over the distribution class, establishing that the plug-in benchmark

\hat{C} / \sqrt{n α}

has the correct functional form for a precision floor. The operational constant

\hat{C}

is a CLT-motivated plug-in rather than the exact minimax constant

c_{L}

, but the rate it instantiates is sharp.

The natural application domain for this audit is the Fundamental Review of the Trading Book ([10] FRTB;), which sets ES at

α = 2.5 %

as the capital-determining market-risk measure evaluated over a 250-day calibration window. These parameters imply

n α = 6.25

, placing the problem firmly in the finite-sample tail-scarcity regime. FRTB serves as the motivating application throughout this paper, but the audit framework applies to any setting in which ES forecasts are compared over a finite window at a fixed tail level.

The empirical analysis uses a parametric GJR-GARCH-t baseline [11] and three zero-shot time-series foundation models: TimesFM 2.5 [12], Chronos-Small [13], and Moirai 2.0 [14,15]. The forecasters are not ranked as a model-selection exercise. They provide heterogeneous first-stage forecast sequences against which the precision audit can be evaluated. Chronos-Small is retained as a stress case because it exhibits severe first-stage VaR miscalibration, allowing the VaR-first diagnostic to be evaluated under a clear failure mode.

Across 24 global assets and four forecasters evaluated at

α = 2.5 %

, between 16% and 29% of pairwise ES rankings fall below the plug-in precision tolerance implied by the effective tail count; the paired block-bootstrap estimate is 22.9%. Excess ES recalibration dispersion is concentrated in cells with poor first-stage VaR calibration. For the median asset, the 250-day calibration window delivers approximately 37 bp of precision, but seven of 24 high-volatility assets require more than 250 days to reach a 50 bp tolerance.

From a mathematical-statistical perspective, the paper studies how an asymptotic information bound for a tail functional can be converted into a finite-sample diagnostic for forecast comparison. The empirical application to market-risk forecasts serves to illustrate the consequences of this effective-sample-size constraint in a realistic dependent time-series setting.

The remainder of the paper is organised as follows. Section 2 develops the theoretical framework, linking the effective tail sample size to ES recalibration through an oracle-equivalence argument. Section 3 defines the plug-in precision benchmark and the operational sample-size rule. Section 4 introduces the precision-fragile comparison screen and the VaR-first diagnostic. Section 5 reports a controlled VaR-miscalibration simulation. Section 6 presents the financial-risk application. Section 7 collects robustness and sensitivity results. Section 8 states the main limitations, and Section 9 concludes.

2. Expected Shortfall as a Tail Functional Under Effective Sample-Size Scarcity

This section establishes the theoretical framework. It defines the effective tail sample size, derives the CLT-based precision scaling, and states the oracle-equivalence argument that transfers the rate to additive ES recalibration.

2.1. Definitions and Effective Tail Sample Size

Let

X_{1}, \dots, X_{n}

be observations from a distribution P on

R

with Lebesgue density f. Fix

α \in (0, 1 / 2)

. The Value-at-Risk at level

α

is the

α

-quantile

{VaR}_{α} (P) : = inf {x : F (x) \geq α}

, and the Expected Shortfall is

{ES}_{α} (P) : = α^{- 1} \int_{0}^{α} {VaR}_{u} (P) d u = E [X ∣ X \leq {VaR}_{α} (P)]

. Both

{VaR}_{α}

and

{ES}_{α}

are negative for losses throughout; the

{FZ}_{0}

loss in Section 6.1 assumes

e < 0

accordingly.

The

{(n α)}^{- 1 / 2}

rate for ES estimation is established by Zwingmann and Holzmann [1] via a CLT for the tail average and by Bartl and Eckstein [2] via a concentration lower bound. The mechanism is the tail identification residual

ϕ (X) : = ({VaR}_{α} - X) 1 {X \leq {VaR}_{α}} / α - ({ES}_{α} - {VaR}_{α}),

(1)

which satisfies

E [ϕ] = 0

and

Var (ϕ) = σ_{tail}^{2} / α

, where

σ_{tail}^{2} : = E [{({VaR}_{α} - X)}^{2} ∣ X \leq {VaR}_{α}]

. Only an

α

-fraction of observations falls in the tail, and each carries

O (1 / α)

weight in the ES average; the CLT for the sample average of

ϕ (X_{1}), \dots, ϕ (X_{n})

gives standard error

σ_{tail} / \sqrt{n α}

. The effective sample size for ES estimation is therefore

n α

, not n.

2.2. Oracle Equivalence for Additive Recalibration

In the two-stage recalibration framework of Dimitriadis and Bayer [5], a first-stage model produces base VaR and ES forecasts, and the second stage estimates an additive ES correction

r^{*} : = {ES}_{α} (P) - \bar{E}

by minimising a Fissler–Ziegel loss [4].

Remark 1

(Oracle equivalence). Fix a first-stage ES forecast

\bar{E}

that does not depend on the second-stage calibration sample

X_{1}, \dots, X_{n}

. Estimating

r^{*} (P) : = {ES}_{α} (P) - \bar{E}

is equivalent in

L^{1}

risk to estimating

{ES}_{α} (P)

: since

\bar{E}

is a fixed constant,

E_{P} | {\hat{r}}_{n} - r^{*} (P) | = E_{P} | {\hat{T}}_{n} - {ES}_{α} (P) |

for

{\hat{T}}_{n} : = {\hat{r}}_{n} + \bar{E}

. Any minimax lower bound for ES estimation therefore transfers directly to

r^{*} (P)

.

The conceptual chain of the paper has four links: (i) the known

{(n α)}^{- 1 / 2}

rate for ES estimation; (ii) oracle equivalence (Remark 1), by which additive ES recalibration inherits this rate under a fixed first-stage forecast; (iii) the plug-in precision benchmark

\hat{C} / \sqrt{n α}

, which instantiates the rate empirically; and (iv) the operational diagnostics built on this benchmark (Section 6 and Section 7). The novelty lies in links (iii)–(iv), not in the rate itself.

2.3. Distinction Between $c_{L}$ and $\hat{C}$

An alternative Le Cam two-point construction yielding a closed-form constant

c_{L}

is provided in Appendix A for completeness. The plug-in precision benchmark

\hat{C} / \sqrt{n α}

used in the empirics is calibrated from the CLT, not from

c_{L}

. The distinction matters:

c_{L}

is the exact minimax constant from a worst-case two-point argument, whereas

\hat{C}

is a CLT-motivated plug-in scale. Throughout, “information limit” and “precision floor” refer to the

{(n α)}^{- 1 / 2}

rate; “plug-in benchmark” refers to

\hat{C} / \sqrt{n α}

.

The fixed-forecast condition in Remark 1 is approximated in the empirical design: the 1000-day GJR-GARCH training window ends before the 250-day calibration window, and the foundation-model weights are not updated on the calibration sample (zero-shot inference). The benchmark is therefore an oracle-style precision diagnostic, not a formal finite-sample guarantee for the full rolling procedure. The operational benchmark

\hat{C} / \sqrt{n α}

inherits the rate from oracle equivalence; it does not inherit the exact Le Cam constant

c_{L}

. When the first stage is mis-specified, the first-stage bias

η_{t}

does not vanish and the oracle equivalence breaks down—precisely the VaR-miscalibration channel that the VaR-first diagnostic detects (Section 4.2).

3. Precision Benchmark and Sample-Size Rule

3.1. Plug-In Precision Benchmark

For each asset, forecaster, and tail level, the plug-in constant is defined as

\hat{C} = {\hat{σ}}_{tail},

(2)

the sample standard deviation of the tail residuals

{{\hat{V}}_{t} - X_{t} : X_{t} \leq {\hat{V}}_{t}} .

(3)

The corresponding precision benchmark is

B_{i, m, α} = \frac{{\hat{C}}_{i, m, α}}{\sqrt{n α}} .

(4)

This quantity instantiates the effective-tail-count rate in an empirical cell and provides the precision floor against which ES recalibration dispersion is evaluated. The benchmark is an operational diagnostic, not a formal finite-sample confidence interval.

In the main analysis,

\hat{C}

is computed once over the full evaluation sample for each cell. This yields an ex-post diagnostic benchmark, not a real-time estimate. In a live implementation,

\hat{C}

would instead be estimated on each rolling calibration window. Appendix C reports this rolling-window variant and shows that the qualitative precision-fragile comparison results are preserved, although the absolute benchmark ratios change.

Because

\hat{C}

is a sample standard deviation of tail residuals, it carries its own sampling variance, and the precision floor

B = \hat{C} / \sqrt{n α}

inherits this uncertainty. The concern is most acute in the rolling-window variant, where

{\hat{C}}_{t}

is re-estimated on each 250-day window from roughly

n α \approx 6.25

tail points. The main analysis mitigates the problem by computing

\hat{C}

once over the full evaluation sample, which aggregates far more than 6.25 tail observations per cell. The rolling-window results in Appendix C (Table A4) confirm that the qualitative conclusions survive when

\hat{C}

is instead estimated window-by-window, even though individual benchmark ratios shift—consistent with the additional noise in the floor estimate. A Monte Carlo study in Appendix B (Table A1) confirms that the coefficient of variation of

\hat{C}

is 0.76 under Student-

t_{5}

at FRTB parameters, declining to 0.43 at

n = 1000

. Where the floor itself is uncertain, marginal precision-fragile classifications should be treated as tentative rather than decisive, reinforcing the conservative reading of the screen.

3.2. Empirical Dispersion Measure

The raw rolling-window ES corrections

{\hat{r}}_{n}

contain both local estimation noise and slow-moving volatility-regime shifts. The precision benchmark concerns the former. To isolate within-regime estimation variability, we subtract a 252-day moving average from the correction path before computing its standard deviation. Robustness checks with 126- and 504-day filters in Appendix C show that the forecaster ranking is stable, although absolute ratios vary with the detrending horizon.

For each asset i, forecaster m, and tail level

α

, the empirical benchmark ratio is

R_{i, m, α} = \frac{{\hat{SD}}^{\det} ({\hat{r}}_{i, m, α})}{{\hat{B}}_{i, m, α}},

(5)

where

{\hat{B}}_{i, m, α}

is the plug-in precision benchmark from Equation (4), and

{\hat{SD}}^{\det}

denotes the standard deviation of the detrended correction path. Values

R < 1

indicate that the plug-in scale is conservative for that cell; this does not contradict the Le Cam lower bound, which uses a different constant

c_{L}

. Values

R > 1

indicate excess dispersion, potentially due to VaR misalignment, non-stationarity, or forecast-extraction noise.

3.3. Finite-Sample Correction

At FRTB parameters (

α = 2.5 %

,

n = 250

), the effective tail count is

k : = n α = 6.25

. Because the number of tail observations is random—binomial

(n, α)

under correct VaR calibration—the CLT scale requires an inflation factor:

f (n, α) = \sqrt{1 + \frac{1 - α}{n α}} .

(6)

At FRTB parameters,

f = 1.075

, raising the required window by approximately 15%. Monte Carlo calibration (Table 1, 50,000 replications under Student-

t_{5}

and GARCH(1,1)-

t_{5}

DGPs) shows that the CLT plug-in is accurate to roughly 10–15% at

k = 6.25

under i.i.d. sampling. Under GARCH dynamics, the unconditional

σ_{tail}

can exceed the within-regime value at short windows, producing

\hat{f} < 1

; this motivates the detrending step in Section 3.2. Full details are in Appendix B.

3.4. Operational Sample-Size Rule

The Le Cam constant

c_{L}

in Appendix A depends on local density geometry and is not directly estimable. For operational sample-size calculations, we therefore instantiate the

{(n α)}^{- 1 / 2}

rate with the CLT-motivated plug-in scale

C = {\hat{σ}}_{tail}

, estimated from tail residuals [see [16], Ch. 6]. Incorporating the random-count correction from Section 3.3, the required window length for tolerance

ε

solves

\frac{f (n, α) C}{\sqrt{n α}} \leq ε ⟺ n \geq \frac{f {(n, α)}^{2}}{α} {(\frac{C}{ε})}^{2},

(7)

where

f (n, α) = \sqrt{1 + \frac{1 - α}{n α}} .

(8)

Because

f (n, α)

depends on n, the rule is implicit; in practice, fixed-point iteration converges in a few steps. At

α = 2.5 %

and

n = 250

, the correction factor is

f (250, 0.025) = 1.075

, which raises the required window by approximately 15%. For

k = n α \geq 25

, the correction is below 2% and can usually be ignored.

Table 2 reports the implied window lengths for four tail levels and four tolerances, calibrated from a Student-

t_{5}

reference distribution. The table has two immediate implications. First, sub-FRTB tail levels are effectively infeasible on realistic calibration windows: at

α = 0.5 %

and

ε = 10 %

, the required window is about 44,700 trading days. Second, at the FRTB level

α = 2.5 %

, a 250-day window implies a corrected tolerance of approximately

0.43 C

, which is of the same order as many inter-model ES differences.

Asset-specific calibrations are reported in Table A11 in Appendix I and summarised in Section 6. Empirical ratios should be interpreted as distances to the CLT-motivated plug-in benchmark, not as tests of the exact Le Cam constant.

4. Precision-Fragile Pairwise Comparison Screen

4.1. Screen Definition

We call a comparison precision-fragile if its absolute ES-recalibration difference is smaller than the plug-in tolerance. The screen is not a hypothesis test and does not attach a nominal size. It is a conservative diagnostic: it asks whether the observed difference is larger than the precision floor implied by the available effective tail count.

The threshold is forecaster-agnostic, conditional on approximate first-stage calibration: it depends on the per-cell tail dispersion

\hat{C}

and the effective tail count

n α

, not on the model class itself. Adding or substituting forecasters changes the set of pairwise comparisons, but not the precision floor used to evaluate each pair. Severely miscalibrated forecasters can still inflate the precision-fragile share through the VaR-miscalibration channel, which the VaR-first diagnostic examines in Section 6.3.

Formally, for a given asset and tail level

α

, the comparison between forecasters 1 and 2 is classified as precision-fragile when

|{\bar{r}}_{n}^{(1)} - {\bar{r}}_{n}^{(2)}| < \frac{\sqrt{{\hat{C}}_{1}^{2} + {\hat{C}}_{2}^{2}}}{\sqrt{n α}},

(9)

where

{\bar{r}}_{n}^{(j)}

denotes the mean ES recalibration correction for forecaster j over the relevant evaluation windows. The threshold treats the two correction estimates as independently noisy. Because the same return realisation enters both forecasters’ recalibration losses, pairwise correction estimates are likely positively correlated. Ignoring this covariance inflates the tolerance and therefore classifies marginal cases as precision-fragile; the screen is conservative by design.

Although the algebraic form of the threshold resembles a pooled standard error, its substantive content differs in three respects. First, the scale is anchored to the effective tail count

n α

—6.25 observations at FRTB parameters, not the 250 days that constitute the nominal window—so the tolerance reflects the information actually available in the tail rather than the total calibration length. Second, the precision floor is a fixed comparison standard: it is invariant to the forecaster set, whereas the realised precision-fragile share depends on which models are compared. Third, the screen is offered as a conservative diagnostic, not a hypothesis test with nominal size; it flags comparisons whose observed difference is smaller than the precision floor, without attaching a rejection probability.

4.2. Rate Tests and VaR-First Diagnostic

Three diagnostics assess whether the

{(n α)}^{- 1 / 2}

rate governs empirical recalibration dispersion and whether deviations from the benchmark are linked to first-stage quantile failure.

First, a cross-sectional scaling regression across all (asset, forecaster,

α

) cells:

log ({\hat{SD}}^{\det} ({\hat{r}}_{n})) = a + b log (n α) + γ log ({\hat{σ}}_{tail}) + u .

(10)

The rate prediction is

b = - 1 / 2

, while the CLT plug-in scale implies

γ = 1

. Standard errors are clustered by asset.

Second, a non-overlapping window-length scaling test at

α = 1 %

, where variation in the effective tail count is most visible. The calibration length is varied over

n \in {250, 500, 750, 1000},

(11)

with step size equal to n, so adjacent windows do not overlap. For each (asset, forecaster) pair, we estimate

log (\hat{SD} ({\hat{r}}_{n})) = a + b log (n) + u,

(12)

using HC1 robust standard errors. The pooled fixed-effects specification is

log (\hat{SD} ({\hat{r}}_{i, j, n})) = α_{i} + β_{j} + b log (n) + u_{i, j, n},

(13)

where

α_{i}

and

β_{j}

denote asset and forecaster fixed effects.

Third, the VaR-first diagnostic distinguishes intrinsic ES estimation noise from first-stage quantile failure. For each (asset, forecaster) pair, VaR backtesting statistics are computed at

α = 1 %

. The main diagnostic uses the Christoffersen conditional-coverage statistic [17]; the Kupiec unconditional-coverage statistic [18] is reported as a robustness check. Excess ES recalibration dispersion should concentrate in cells with poor first-stage VaR calibration.

5. Simulation Evidence

This section isolates the mechanism behind the VaR-first diagnostic. The empirical results in Section 6 show that excess ES recalibration dispersion is concentrated in poorly VaR-calibrated cells. The simulation below asks whether this pattern can arise when the first-stage VaR model fails to track time-varying volatility.

The simulation generates 30 independent paths of length 10,000 from a GARCH(1,1)-

t_{5}

process with

ω = 10^{- 6}

,

α_{g} = 0.10

, and

β_{g} = 0.85

. For each path, the first-stage VaR/ES model uses a convex blend of the true conditional volatility

σ_{t}

and the unconditional volatility

\bar{σ}

:

σ_{model, t} = (1 - δ) σ_{t} + δ \bar{σ}, δ \in {0, 0.70, 0.85, 1} .

(14)

At

δ = 0

, the model uses the true conditional volatility. At

δ = 1

, it ignores conditional volatility dynamics and uses the unconditional volatility. Additive ES corrections are estimated on rolling 250-day windows, advanced in 21-day steps, at

α = 2.5 %

. The ES dispersion ratio R is computed as the raw standard deviation of the ES correction divided by the oracle benchmark

\frac{σ_{tail}}{\sqrt{n α}} .

(15)

Table 3 shows that R increases from 0.95 under the oracle volatility specification to 1.76 under the unconditional-volatility specification. The hit rate rises at the same time, indicating worsening VaR calibration. The transition through

R = 1

occurs between the

δ = 0.70

and

δ = 0.85

designs, where the hit rate is already far above the nominal 2.5%. The simulation therefore supports the interpretation that regime-dependent VaR error can inflate ES recalibration dispersion, consistent with the VaR-first diagnostic.

6. Financial-Risk Application

This section illustrates the precision-audit framework on 24 global financial assets and four forecasters under FRTB calibration parameters. The forecasters are not ranked as a model-selection exercise; they provide heterogeneous first-stage forecast sequences against which the precision audit can be evaluated.

6.1. Data and Forecasting Setup

The empirical analysis uses daily log returns for 24 global assets: equity indices, bonds, commodities, cryptocurrencies, and currencies. The sample period is 2002–2026, with exact start and end dates varying by asset; details are reported in Appendix C. Forecasts are computed at

α \in {1 %, 2.5 %, 5 %}

, with

α = 2.5 %

as the primary ES analysis level. Additive Fissler–Ziegel recalibration is performed on rolling

n = 250

-day calibration windows, advanced in monthly steps of 21 trading days.

The parametric benchmark is a GJR-GARCH(1,1) model with Student-t innovations [11], estimated by maximum likelihood on a rolling 1000-day training window, with VaR and ES computed analytically from the fitted conditional distribution. The remaining forecasters are time-series foundation models run zero-shot. TimesFM 2.5 [12] uses nine quantile heads with a Student-t fit by quantile matching. Chronos-Small [13] generates 1000 Monte Carlo forecast samples, from which empirical quantiles and conditional tail means yield VaR and ES directly. Moirai 2.0 [14,15] produces 1000 forecast samples with a Student-t fit to extract VaR and ES. Implementation details of the extraction procedures are reported in Appendix C.

Given base forecasts

({\hat{V}}_{t}, {\hat{E}}_{t})

, the second stage estimates an additive correction pair

({\hat{q}}_{n}, {\hat{r}}_{n})

by minimising the Fissler–Ziegel

{FZ}_{0}

loss [4,5]. The corrected forecasts are

{\tilde{V}}_{t} = {\hat{V}}_{t} + {\hat{q}}_{n}, {\tilde{E}}_{t} = {\hat{E}}_{t} + {\hat{r}}_{n},

(16)

where

{\hat{q}}_{n}

and

{\hat{r}}_{n}

are held fixed within each calibration window. The loss function is

S (v, e, x) = \frac{1 {x \leq v} (x - v)}{α e} + \frac{v}{e} + log (- e) - 1, e < 0 .

(17)

The optimisation uses Nelder–Mead with three random restarts per window. Convergence failure is below 0.5% across all cells.

6.2. Precision-Fragile ES Comparisons

At the FRTB ES level

α = 2.5 %

, 29 of 144 pairwise comparisons are precision-fragile under the plug-in screen in Equation (9), corresponding to a share of 20.1% (Table 4). The share increases as the tail level becomes more extreme, consistent with the

{(n α)}^{- 1 / 2}

precision limit.

The 20.1% headline should be interpreted as a baseline diagnostic, not as a universal constant. Table 5 reports the precision-fragile share under alternative noise scales and detrending choices. At

α = 2.5 %

, the share ranges from 16.0% to 29.2%, with the paired block bootstrap at 22.9%.

Table 6 summarises the three main choices for estimating the precision scale at

α = 2.5 %

. The qualitative conclusion is unchanged: a material fraction of ES comparisons is smaller than the precision floor supported by the available tail data.

The sample-size rule reveals substantial cross-asset heterogeneity. At

α = 2.5 %

, the estimated tail-dispersion scale spans more than an order of magnitude: USD/JPY requires roughly 84 trading days for 50 bp precision, whereas BTC requires roughly 4,900 days. For the median asset, the 250-day calibration window delivers approximately 37 bp precision. Seven of 24 assets require more than 250 days to reach a 50 bp tolerance, so the sample-size rule is as much an asset-selection diagnostic as a window-length diagnostic.

6.3. Empirical Diagnostics

Figure 1 compares the detrended standard deviation of the ES correction

{\hat{r}}_{n}

with the plug-in precision benchmark

\hat{C} / \sqrt{n α}

across all (asset, forecaster,

α

) cells. The 45-degree line corresponds to

R = 1

. Points below the line indicate that the plug-in benchmark is conservative for that cell; points above the line indicate excess dispersion, which is examined through the VaR-first diagnostic below.

Figure A3 in Appendix I disaggregates the same comparison by forecaster. GJR-GARCH-t and Moirai 2.0 mostly lie below the 45-degree line, while Chronos-Small straddles it.

The cross-sectional scaling regression in Equation (10) tests whether empirical ES recalibration dispersion follows the predicted effective-tail-count rate. Table 7 reports the results across 288 cells. The estimated slope is

\hat{b} = - 0.436

with clustered standard error

0.039

, close to the theoretical value

- 1 / 2

. The tail-dispersion coefficient is

\hat{γ} = 0.870

with standard error

0.036

, below the plug-in prediction

γ = 1

.

The estimate

\hat{γ} < 1

does not contradict the

{(n α)}^{- 1 / 2}

rate, which concerns scaling in the effective tail count. Two mechanisms can reduce the estimated tail-dispersion coefficient: full-sample estimation of

{\hat{σ}}_{tail}

may average across volatility regimes, and rolling-window overlap may reduce the measured dispersion of

{\hat{r}}_{n}

.

Table 8 reports the benchmark ratio R at

α = 2.5 %

. GJR-GARCH-t, TimesFM 2.5, and Moirai 2.0 have median ratios below one, whereas Chronos-Small has median

R = 1.13

and reaches

R = 2.69

for Natural Gas. This pattern is consistent with excess ES recalibration dispersion being linked to first-stage VaR failure rather than to the precision benchmark itself.

The controlled simulation in Section 5 establishes the mechanism: as the first-stage volatility model moves from the oracle specification (

δ = 0

) toward the unconditional-volatility specification (

δ = 1

), VaR calibration degrades and the ES dispersion ratio rises from

R = 0.95

to

R = 1.76

(Table 3). This regime-dependent VaR error channel is independent of any single forecaster and provides the causal logic for the VaR-first diagnostic.

The empirical data corroborate this mechanism. Figure 2 plots the Christoffersen conditional-coverage statistic against R at

α = 1 %

. The pooled Spearman correlation is

ρ = 0.776

(

p < 0.001

). Excluding Chronos-Small, the correlation remains positive and significant (

ρ = 0.513

,

p < 0.001

), whereas within Chronos-Small alone it is small and insignificant. All 12 cells with

R > 1

belong to Chronos-Small and exhibit severe VaR miscalibration (Table A16 in Appendix I). The pooled association is strengthened by the separation of Chronos-Small from the remaining forecasters, and the ex-Chronos correlation becomes insignificant at

α = 2.5 %

and

5 %

(Table 11), consistent with the reduced power of the diagnostic at less extreme tail levels.

Chronos-Small is therefore best interpreted as a stress case for the diagnostic, not as a competitive benchmark. Its excess dispersion is consistent with time-varying VaR misalignment; after separating these severely miscalibrated cells, ES recalibration dispersion is broadly comparable across the remaining forecasters.

Table 9 reports the headline results under forecaster subsets. Excluding Chronos-Small increases the precision-fragile share because the remaining forecasters produce ES corrections that are harder to distinguish. The precision-fragile share is not a model-free empirical constant; it depends on the dispersion of the forecasters included in the comparison set. The forecaster-agnostic object is the precision threshold, not the realised share.

6.4. Implications for Tail-Risk Practice

At

α = 2.5 %

with

n = 250

and a typical

\hat{C} \approx 1.3 %

, the implied recalibration tolerance is

ε_{n} = \frac{\hat{C}}{\sqrt{n α}} \approx 52 bp .

(18)

This tolerance is of the same order as many inter-model ES differences, so it provides a simple way to report the sampling uncertainty attached to ES recalibration and model comparison.

The precision audit can be implemented as a four-step workflow:

1.: report the effective tail count $n α$ , not only the window length n;
2.: estimate $\hat{C} = {\hat{σ}}_{tail}$ from tail residuals;
3.: compute the plug-in precision floor $\hat{C} / \sqrt{n α}$ ;
4.: flag ES comparisons whose absolute recalibration difference falls below the corresponding pairwise tolerance as precision-fragile.

A precision-fragile comparison does not imply that the two models are equivalent. It means that the observed ES-recalibration difference is smaller than the precision budget supported by the available tail data. Such comparisons should not be used as decisive evidence for model replacement without additional data, stronger structural assumptions, or complementary diagnostics. Appendix J summarises the benchmark, sample-size rule, and diagnostic screen as a workflow.

7. Robustness and Sensitivity

This section collects robustness checks on the main empirical findings.

Cutoff multiplier sensitivity. Replacing the baseline tolerance in Equation (9) by

κ \sqrt{{\hat{C}}_{1}^{2} + {\hat{C}}_{2}^{2}} / \sqrt{n α}

gives the shares in Table 10. Even at

κ = 0.5

, 16.0% of comparisons remain precision-fragile.

VaR-first diagnostic across tail levels.Table 11 reports the diagnostic at all three tail levels. The association between the conditional-coverage statistic and R is strongest at

α = 1 %

, supporting the choice of this level as the diagnostic anchor. At

α = 2.5 %

and

5 %

, the full-sample correlation remains significant, but the correlation excluding Chronos-Small becomes insignificant.

Table 11. VaR-first diagnostic robustness across tail levels. Entries report Spearman rank correlations between the Christoffersen conditional-coverage statistic and benchmark ratio R at the diagnostic

α

indicated. *** denotes

p < 0.001

.

Table 11. VaR-first diagnostic robustness across tail levels. Entries report Spearman rank correlations between the Christoffersen conditional-coverage statistic and benchmark ratio R at the diagnostic

α

indicated. *** denotes

p < 0.001

.

Diagnostic $α$	n pairs	Spearman $ρ$ (all)	Spearman $ρ$ (excl. Chronos)	Comment
1%	76	$0.776 * * *$	$0.513 * * *$	FRTB VaR gatekeeper
2.5%	87	$0.590 * * *$	$0.192$	Matched to FRTB ES level
5.0%	95	$0.559 * * *$	$0.146$	Scaling validation

Window-length scaling. A non-overlapping window-length scaling test at

α = 1 %

provides a complementary check; details are reported in Appendix D. Three of four forecasters are consistent with the

{(n α)}^{- 1 / 2}

rate. The pooled test rejects

H_{0} : b = - 0.50

at

p = 0.033

, driven by TimesFM 2.5; excluding TimesFM 2.5, all subsamples are consistent with the theoretical slope within their 95% confidence intervals.

HS-250 naive benchmark. As a robustness check, Appendix H reports results for a naive Historical Simulation benchmark (HS-250) that uses the empirical quantile and tail mean from the same 250-day window. HS-250 lies close to the plug-in precision floor (

R = 0.96

at

α = 2.5 %

), and all of its comparisons with TimesFM-2.5 and Moirai-2.0 are precision-fragile, reinforcing the interpretation that the binding constraint is often the effective tail count rather than model complexity.

8. Limitations

Six limitations delimit the interpretation.

First, the empirical benchmark uses the CLT-motivated plug-in scale

\hat{C}

, not the Le Cam constant

c_{L}

. The latter depends on local density geometry and is not estimated from data. Hence values

R < 1

indicate that the plug-in benchmark is conservative for that cell; they do not contradict the lower-bound argument.

Second, the non-overlapping window-scaling test is data-intensive. At

n = 1000

, it requires roughly 5,000 trading days, leaving too few usable assets for Chronos-Small to estimate a meaningful model-specific slope.

Third, the theoretical benchmark is oracle-style: it is conditional on a fixed first-stage forecast sequence. The empirical rolling design approximates this condition because forecasts are predictable from past information, but they are not independent of the full calibration path. A strictly held-out design would isolate the oracle assumption more cleanly.

Fourth, absolute benchmark ratios depend on how low-frequency volatility-regime shifts are removed. The forecaster ranking is stable across the 126-, 252-, and 504-day filters reported in Appendix C, but the levels of R and the precision-fragile share vary with the detrending choice.

Fifth, the pooled window-scaling test rejects

H_{0} : b = - 0.50

at

p = 0.033

, driven by TimesFM 2.5 (Appendix D). The rejection disappears in restricted samples.

Sixth, the VaR-first diagnostic is partly a between-group result. The pooled association between the Christoffersen conditional-coverage statistic and R is strengthened by the separation of Chronos-Small from the remaining forecasters.

These limitations define the scope of the audit. The precision floor is a necessary condition for meaningful ES comparison, not a sufficient condition for model superiority. The proposed quantities should be read as precision diagnostics for ES model comparison under tail-data scarcity.

9. Conclusions

This paper studies how the effective tail sample size constrains pairwise comparison of Expected Shortfall forecasts. The relevant precision budget is

n α

, not the nominal calibration window length n. At the FRTB tail level

α = 2.5 %

and

n = 250

, the effective tail count is only

n α = 6.25

.

The paper converts the known

{(n α)}^{- 1 / 2}

information limit into a precision-audit framework with four components: a plug-in precision benchmark, a finite-sample sample-size rule, a precision-fragile pairwise comparison screen, and a VaR-first diagnostic. The audit is diagnostic, not a replacement for formal model validation.

Across 24 global assets and four forecasters, roughly one in five pairwise ES comparisons is precision-fragile: the observed difference in recalibrated ES corrections is smaller than the precision floor supported by the available tail data. This share ranges from 16% to 29% across estimation variants and remains economically material across alternative noise scales, detrending methods, and bootstrap specifications.

The VaR-first diagnostic shows that excess ES recalibration dispersion is concentrated in cells with poor first-stage VaR calibration. VaR miscalibration is therefore a major channel for inflated ES dispersion, consistent with the simulation evidence in Section 5. A naive Historical Simulation benchmark lies close to the plug-in precision floor, reinforcing the interpretation that at standard tail levels the effective number of tail observations can be more binding than model complexity.

The precision-fragile share is not a universal constant. It depends on the asset universe, forecaster set, tail-dispersion estimate, and detrending convention. What is invariant is the information constraint: ES recalibration precision scales with

n α

, not with the nominal window length alone.

The analysis is an ex-post oracle-style precision audit, not a deployable real-time rule. Future work should develop fully held-out, real-time, dependence-aware versions of the audit that relax the fixed-first-stage assumption. Extensions to multivariate tail-risk settings and to formal decision-theoretic frameworks for the comparison screen are also left for future investigation.

Author Contributions

Conceptualization, D.T.P.; methodology, D.T.P.; software, D.T.P. and M.M.-M.-P.; validation, D.T.P. and M.M.-M.-P.; formal analysis, D.T.P.; investigation, D.T.P.; data curation, D.T.P. and M.M.-M.-P.; writing—original draft preparation, D.T.P.; writing—review and editing, D.T.P. and M.M.-M.-P.; visualization, D.T.P.; supervision, D.T.P.; project administration, D.T.P.; funding acquisition, D.T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the Marie Skłodowska-Curie Actions under the European Union’s Horizon Europe research and innovation program for the Industrial Doctoral Network on Digital Finance, acronym DIGITAL, Project No. 101119635; the project “IDA Institute of Digital Assets”, CF166/15.11.2022, contract number CN760046/23.05.2023; the project “AI for Energy Finance (AI4EFin)”, CF162/15.11.2022, contract number CN760048/23.05.2023; the project “Accountable Governance and Responsible Innovation in Artificial Intelligence”, CF158/15.11.2022, contract number CN760047/23.05.2023, financed under Romania’s National Recovery and Resilience Plan, Apel nr. PNRR-III-C9-2022-I8. We acknowledge the support of the project “MA’AT — Autonomous Model for Textual Assistance”, SMIS Code 2021+: 330941, funding contract no. 390090/11.11.2025, project co-financed by the European Regional Development Fund through the Smart Growth, Digitalisation and Financial Instruments Programme 2021–2027 (POCIDIF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Daily return data are sourced from Yahoo Finance for the assets listed in Appendix C. Replication code and intermediate data files are available on the QuantLet platform. Slides are available on the Quantinar platform.

Acknowledgments

During the preparation of this manuscript, the authors used generative AI tools for language editing, LATEX assistance, and code-checking support. The authors reviewed and edited all generated content and take full responsibility for the final manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ES	Expected Shortfall
VaR	Value-at-Risk
FRTB	Fundamental Review of the Trading Book
FZ	Fissler–Ziegel
CLT	Central limit theorem
CC	Christoffersen conditional coverage
HS	Historical Simulation

Appendix A. Minimax Lower Bound and Proof

The distribution class used in the Le Cam construction is

P (α, w, m_{0}) : = {P : P ≪ λ, f_{P} = \frac{d P}{d λ}, E_{P} [X^{-}] < \infty, f_{P} (x) \geq m_{0},

(A1)

\forall x \in [{VaR}_{α} (P) - w, {VaR}_{α} (P)]

, where

w > 0

is the tail window width and

m_{0} > 0

is a density lower bound on the tail window.

Theorem A1

(Minimax lower bound). There exists a constant

c_{L} > 0

, depending only on the tail-window width w and density lower bound

m_{0}

, such that for any estimator

T_{n}

of

{ES}_{α} (P)

based on n i.i.d. observations from

P \in P (α, w, m_{0})

,

inf_{T_{n}} sup_{P \in P (α, w, m_{0})} E_{P} | T_{n} - {ES}_{α} (P) | \geq \frac{c_{L}}{\sqrt{n α}} .

(A2)

By Remark 1, the same bound applies to any additive recalibration estimator

{\hat{r}}_{n}

under a fixed first-stage forecast.

To illustrate the magnitude of

c_{L}

, consider a Student-

t_{5}

return distribution with

w = σ_{tail}

and

m_{0} = f_{t_{5}} ({VaR}_{α} - w)

. Then

c_{L}

evaluates to

0.0017

at

α = 1 %

,

0.0024

at

α = 2.5 %

, and

0.0031

at

α = 5 %

. The plug-in constant

C = σ_{tail}

at the same calibration is

1.39

,

1.16

, and

1.03

respectively, giving

c_{L} / C \approx 0.1

–

0.3 %

. The Le Cam constant is conservative by construction: the two-point perturbation uses the smallest density on the tail window, while the CLT variance integrates the full tail distribution. The plug-in benchmark

\hat{C} / \sqrt{n α}

is therefore the operational object.

Lemma A1

(Antisymmetric two-point perturbation). Fix

α \in (0, 1 / 2)

and let

P_{0}

have density

g_{0}

satisfying

g_{0} (x) \geq 2 m_{0}

on

[v_{0} - w, v_{0}]

, where

v_{0} : = {VaR}_{α} (P_{0})

. For

δ > 0

, define

f_{\pm} (x) : = g_{0} (x) \pm \frac{δ}{w} h (\frac{x - v_{0} + w}{w}),

(A3)

where

h (u) = sin (2 π u)

on

[0, 1]

and

h = 0

elsewhere. Let

κ : = {inf}_{u \in [0, 1)} 4 π (1 - u) / [1 - cos (2 π u)] \approx 2.76

(attained near

u \approx 0.63

). Then, provided

δ \leq min (1, κ) m_{0} w

:

(i): $f_{\pm}$ are valid densities with $f_{\pm} (x) \geq m_{0}$ on $[v_{0} - w, v_{0}]$ , so $P_{\pm} \in P (α, w, m_{0})$ ;
(ii): ${VaR}_{α} (P_{+}) = {VaR}_{α} (P_{-}) = v_{0}$ ;
(iii): $| {ES}_{α} (P_{+}) - {ES}_{α} (P_{-}) | = w δ / (π α)$ .

Proof. (i) Since

\int_{0}^{1} h (u) d u = 0

and the perturbation is supported on

[v_{0} - w, v_{0}]

, both

f_{\pm}

integrate to 1. On the support,

f_{\pm} (x) \geq 2 m_{0} - δ / w \geq m_{0}

when

δ \leq m_{0} w

, so

P_{\pm} \in P (α, w, m_{0})

.

(ii) The antisymmetry of

sin (2 π u)

about

u = 1 / 2

ensures that the perturbation adds zero net mass to

(- \infty, v_{0}]

:

\int_{v_{0} - w}^{v_{0}} (δ / w) h ((x - v_{0} + w) / w) d x = δ \int_{0}^{1} sin (2 π u) d u = 0

. Hence

F_{\pm} (v_{0}) = α

. It remains to show

F_{\pm} (x) < α

for all

x < v_{0}

. For

x \leq v_{0} - w

the perturbation vanishes, so

F_{\pm} (x) = F_{0} (x) < α

. For

x \in (v_{0} - w, v_{0})

, set

u : = (x - v_{0} + w) / w \in [0, 1)

and compute the cumulative perturbation exactly:

\int_{v_{0} - w}^{x} \frac{δ}{w} sin (\frac{2 π (t - v_{0} + w)}{w}) d t = \frac{δ}{2 π} [1 - cos (2 π u)] .

(A4)

The baseline CDF margin satisfies

F_{0} (v_{0}) - F_{0} (x) = \int_{x}^{v_{0}} g_{0} (t) d t \geq 2 m_{0} (v_{0} - x) = 2 m_{0} w (1 - u)

. Hence

F_{+} (x) < α

requires

δ [1 - cos (2 π u)] / (2 π) < 2 m_{0} w (1 - u)

, i.e.

δ < \frac{4 π m_{0} w (1 - u)}{1 - cos (2 π u)} .

(A5)

The ratio

(1 - u) / [1 - cos (2 π u)]

is bounded below on

[0, 1)

: it equals

1 / (2 π^{2} u)

near

u = 0

(bounded) and tends to

+ \infty

as

u ↑ 1

(the cosine perturbation vanishes faster than the linear baseline margin). Let

κ : = {inf}_{u \in [0, 1)} 4 π (1 - u) / [1 - cos (2 π u)] > 0

. Then

δ \leq κ m_{0} w

suffices. Since

κ

is a universal positive constant, the constraint

δ \leq m_{0} w

(from part (i)) can be tightened to

δ \leq min (1, κ) m_{0} w

if needed. For

F_{-}

, the cumulative perturbation has the opposite sign, so

F_{-} (x) = F_{0} (x) - δ [1 - cos (2 π u)] / (2 π) < F_{0} (x) < α

on

(v_{0} - w, v_{0})

. Therefore

F_{\pm} (x) < α

for all

x < v_{0}

, and

{VaR}_{α} (P_{\pm}) = v_{0}

.

(iii) The ES difference is

\begin{matrix} {ES}_{α} (P_{+}) - {ES}_{α} (P_{-}) & = \frac{1}{α} \int_{v_{0} - w}^{v_{0}} x [f_{+} (x) - f_{-} (x)] d x \\ = \frac{2 δ}{α w} \int_{v_{0} - w}^{v_{0}} x sin (2 π (x - v_{0} + w) / w) d x . \end{matrix}

(A6)

Substituting

u = (x - v_{0} + w) / w

, the integral becomes

w \int_{0}^{1} (v_{0} - w + w u) sin (2 π u) d u .

The constant term vanishes because

\int_{0}^{1} sin (2 π u) d u = 0

. Integration by parts gives

\int_{0}^{1} u sin (2 π u) d u

= - 1 / (2 π)

, so

| {ES}_{α} (P_{+}) - {ES}_{α} (P_{-}) | = w δ / (π α)

. □

Proof of Theorem A1.

Apply Lemma A1 with perturbation amplitude

δ = c \sqrt{α / n}

(A7)

for a constant

c > 0

to be chosen (the constraint

δ \leq m_{0} w

is satisfied for n large enough).

Step 1: ES separation. By Lemma A1(iii),

Δ : = | {ES}_{α} (P_{+}) - {ES}_{α} (P_{-}) | = \frac{w δ}{π α} = \frac{w c \sqrt{α / n}}{π α} = \frac{w c}{π \sqrt{n α}} .

(A8)

Step 2:

χ^{2}

divergence. Since

f_{+}

and

f_{-}

differ only on

[v_{0} - w, v_{0}]

,

χ^{2} (P_{+} ∥ P_{-}) = \int_{v_{0} - w}^{v_{0}} \frac{{(f_{+} - f_{-})}^{2}}{f_{-}} d x = \int_{v_{0} - w}^{v_{0}} \frac{4 δ^{2} {sin}^{2} (2 π u)}{w^{2} f_{-} (x)} d x .

(A9)

Using

f_{-} \geq m_{0}

on the support and

\int_{0}^{1} {sin}^{2} (2 π u) d u = 1 / 2

, this gives

χ^{2} (P_{+} ∥ P_{-}) \leq \frac{4 δ^{2}}{w^{2}} \cdot \frac{1}{m_{0}} \cdot w \cdot \frac{1}{2} = \frac{2 δ^{2}}{m_{0} w} = \frac{2 c^{2} α}{m_{0} w n} .

(A10)

Step 3: TV control. For product measures,

1 + χ^{2} (P_{+}^{n} ∥ P_{-}^{n}) = {(1 + χ^{2} (P_{+} ∥ P_{-}))}^{n},

(A11)

so

χ^{2} (P_{+}^{n} ∥ P_{-}^{n}) \leq exp (n χ^{2} (P_{+} ∥ P_{-})) - 1

, using

1 + x \leq e^{x}

. By the total-variation/

χ^{2}

inequality,

TV {(P_{+}^{n}, P_{-}^{n})}^{2} \leq χ^{2} (P_{+}^{n} ∥ P_{-}^{n}) \leq exp (2 c^{2} α / (m_{0} w)) - 1 .

(A12)

For small c, this is approximately

2 c^{2} α / (m_{0} w)

. To ensure

TV (P_{+}^{n}, P_{-}^{n}) \leq 1 - 1 / \sqrt{2}

, it suffices to take c small enough that

exp (2 c^{2} α / (m_{0} w)) - 1 \leq {(1 - 1 / \sqrt{2})}^{2}

; since

α < 1 / 2

, this is satisfied when

c^{2} \leq (m_{0} w / 2) log (1 + {(1 - 1 / \sqrt{2})}^{2})

.

Step 4: Le Cam bound. By Le Cam’s two-point lemma,

inf_{T_{n}} max_{j \in {+, -}} E_{P_{j}} | T_{n} - {ES}_{α} (P_{j}) | \geq \frac{1}{2} Δ (1 - TV (P_{+}^{n}, P_{-}^{n})) \geq \frac{Δ}{2 \sqrt{2}} .

(A13)

Substituting the expression for

Δ

from Step 1 and taking c at its maximum admissible value from Step 3,

\begin{matrix} inf_{T_{n}} sup_{P \in P} E_{P} | T_{n} - {ES}_{α} (P) | \geq \frac{w}{2 \sqrt{2} π} \cdot \frac{\sqrt{(m_{0} w / 2) log (1 + {(1 - 1 / \sqrt{2})}^{2})}}{\sqrt{n α}} \\ = : \frac{c_{L}}{\sqrt{n α}}, \end{matrix}

(A14)

where

c_{L} = w \sqrt{(m_{0} w / 2) log (1 + {(1 - 1 / \sqrt{2})}^{2})} / (2 \sqrt{2} π) > 0

depends only on w and

m_{0}

. □

Appendix B. Finite-Sample Calibration Details

The random-count correction in Section 3.3 arises because the tail count

K_{n} : = \sum_{i = 1}^{n} 1 {X_{i} \leq {VaR}_{α}} \sim Bin (n, α)

is random. Conditional on

K_{n} = k

, the tail-average variance is

σ_{tail}^{2} / k

; a delta-method expansion of

E [1 / K_{n}]

using

Var (K_{n}) = n α (1 - α)

gives the unconditional variance

σ_{tail}^{2} f {(n, α)}^{2} / (n α)

, where

f (n, α) = \sqrt{1 + (1 - α) / (n α)}

.

An Edgeworth correction is not used. At tail levels relevant in practice, the conditional tail distribution of heavy-tailed returns is highly skewed (conditional

γ_{1} \approx - 4

for Student-

t_{5}

at

α \leq 5 %

). At

k \leq 10

, the one-term Edgeworth approximation shifts the quantile inward and produces intervals narrower than the Gaussian approximation, contrary to Monte Carlo evidence. The random-count factor and direct Monte Carlo calibration are therefore preferred.

Under GARCH dynamics, the pattern in Table 1 separates finite-sample tail-count effects from volatility-regime effects. At short windows (

n = 250

), the unconditional

σ_{tail}

can exceed the within-regime tail dispersion, producing

\hat{f} < 1

. At longer windows, paths traverse multiple volatility regimes and the empirical SD exceeds the unconditional asymptotic prediction (

\hat{f} > 1.2

, flagged with †). This is the serial-dependence inflation that the detrending analysis in Section 3.2 is designed to remove.

An important consequence is that at short windows under realistic GARCH dynamics, the plug-in benchmark overstates the precision floor (

\hat{f} < 1

), so the precision-fragile screen uses a wider tolerance than the within-regime noise alone would justify. Its conclusions remain valid as an upper bound on the fraction of unreliable model rankings.

Table A1 quantifies the sampling variability of the precision floor itself. Because

B = \hat{C} / \sqrt{n α}

and

\hat{C}

is a sample standard deviation of a small tail subsample, B inherits non-negligible sampling variance. Under Student-

t_{5}

at FRTB parameters (

n = 250

,

n α = 6.25

), the coefficient of variation of

\hat{C}

across 50,000 replications is 0.76, and the 90% simulation range of

B / \bar{B}

spans

[0.25, 2.32]

. Doubling the window to

n = 500

(

n α = 12.5

) reduces the CV to 0.56, and at

n = 1000

it falls to 0.43. GARCH(1,1)-

t_{5}

dynamics produce similar or slightly larger CVs at each window length, consistent with serial-dependence effects. These results reinforce the conservative reading: at standard FRTB windows, the precision floor is itself a noisy estimate, and marginal precision-fragile classifications should be treated as tentative.

Table A1. Sampling variability of the precision floor

B = \hat{C} / \sqrt{n α}

at

α = 2.5 %

. CV

(\hat{C})

is the coefficient of variation across 50,000 Monte Carlo replications.

B / \bar{B}

: 90% range gives the 5th–95th percentile ratio of the realised floor to its mean.

Table A1. Sampling variability of the precision floor

B = \hat{C} / \sqrt{n α}

at

α = 2.5 %

. CV

(\hat{C})

is the coefficient of variation across 50,000 Monte Carlo replications.

B / \bar{B}

: 90% range gives the 5th–95th percentile ratio of the realised floor to its mean.

		Student- $t_{5}$		GARCH(1,1)- $t_{5}$
n	$n α$	CV $(\hat{C})$	$B / \bar{B}$ : 90% range	CV $(\hat{C})$	$B / \bar{B}$ : 90% range
250	6.25	0.76	$[0.25, 2.32]$	0.78	$[0.17, 2.43]$
500	12.50	0.56	$[0.40, 1.99]$	0.64	$[0.32, 2.13]$
750	18.75	0.49	$[0.48, 1.84]$	0.57	$[0.39, 2.00]$
1000	25.00	0.43	$[0.52, 1.76]$	0.52	$[0.44, 1.94]$

Appendix C. Data and Computational Details

Table A2 lists the 24 assets with tickers, asset class, sample period, and number of observations. All returns are daily log returns. Data source: Yahoo Finance.

Table A2. Asset universe.

Ticker	Name	Class	Sample
SP500	S&P 500	Equity	2000-01 – 2026-03
STOXX	Euro Stoxx 50	Equity	2004-04 – 2026-03
GDAXI	DAX	Equity	2000-01 – 2026-03
FCHI	CAC 40	Equity	2000-01 – 2026-03
FTSE100	FTSE 100	Equity	2000-01 – 2026-03
NIKKEI	Nikkei 225	Equity	2000-01 – 2026-03
HSI	Hang Seng	Equity	2000-01 – 2026-03
BOVESPA	Bovespa	Equity	2000-01 – 2026-03
NIFTY	Nifty 50	Equity	2007-09 – 2026-03
ASX200	ASX 200	Equity	2000-01 – 2026-03
ICLN	iShares Clean Energy	Equity	2008-06 – 2026-03
TLT	US 20Y+ Treasury	Bond	2002-07 – 2026-03
IBGL	Euro Gov Bond	Bond	2008-01 – 2026-03
DJCI	DJ Commodity	Commodity	2009-10 – 2021-01
GOLD	Gold	Commodity	2000-08 – 2026-03
WTI	WTI Crude	Commodity	2000-08 – 2026-03
NATGAS	Natural Gas	Commodity	2000-08 – 2026-03
CBU0	Copper	Commodity	2011-03 – 2026-03
BTC	Bitcoin	Crypto	2014-09 – 2026-03
ETH	Ethereum	Crypto	2017-11 – 2026-03
EURUSD	EUR/USD	FX	2003-12 – 2026-03
GBPUSD	GBP/USD	FX	2003-12 – 2026-03
USDJPY	USD/JPY	FX	2000-01 – 2026-03
AUDUSD	AUD/USD	FX	2006-05 – 2026-03

All returns are daily log returns

r_{t} = log (P_{t} / P_{t - 1})

, expressed in decimal form (not percentages). Each asset uses its own trading calendar; no cross-asset calendar synchronisation is imposed. Missing observations (holidays, halts) are dropped, and the rolling-window count n reflects actual trading days.

The forecasting setup, base forecasters, VaR/ES extraction, FZ recalibration specification, detrended SD measure, and plug-in benchmark are described in Section 2, Section 3, Section 4, Section 5 and Section 6. Below are additional implementation details.

TimesFM 2.5 quantile levels are

τ \in {0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95}

; the Student-t fit minimises the sum of squared quantile deviations with unconstrained degrees of freedom. For the non-overlapping window-length scaling test (Appendix D), each block of size n is treated as an independent calibration sample; all four forecasters are included.

For each asset, forecaster, and tail level, base forecasts

({\hat{V}}_{t}, {\hat{E}}_{t})

are extracted from the corresponding predictive distribution. GJR-GARCH-t yields VaR and ES analytically from the fitted conditional Student-t distribution. Chronos-Small computes both quantities nonparametrically from 1000 simulated forecast paths. TimesFM 2.5 and Moirai 2.0 require a tail-completion step: a Student-t distribution is fitted to the model output, by quantile matching for TimesFM 2.5 and by sample fitting for Moirai 2.0, and VaR and ES are then extracted from the fitted distribution. The Student-t tail-completion step is a forecaster-specific extraction device, not part of the proposed audit.

As a robustness check, the headline ratios are recomputed using moving-average windows of 126 and 504 days (Table A3). The qualitative ranking is preserved in every specification.

Table A3. Sensitivity of median benchmark ratio R at

α = 2.5 %

to the detrending moving-average window. The forecaster ranking is invariant across specifications; the absolute level shifts materially.

Table A3. Sensitivity of median benchmark ratio R at

α = 2.5 %

to the detrending moving-average window. The forecaster ranking is invariant across specifications; the absolute level shifts materially.

Forecaster	MA-126	MA-252	MA-504
GJR-GARCH-t	0.40	0.53	0.76
TimesFM-2.5	0.42	0.63	1.07
Chronos-Small	0.81	1.13	1.70
Moirai-2.0	0.41	0.59	0.98

Table A4 reports the effect of estimating

\hat{C}

on each rolling calibration window rather than once over the full evaluation sample.

Table A4. Median benchmark ratio R under full-sample vs. rolling-window estimation of

\hat{C} = {\hat{σ}}_{tail}

.

Table A4. Median benchmark ratio R under full-sample vs. rolling-window estimation of

\hat{C} = {\hat{σ}}_{tail}

.

	$α = 1 %$		$α = 2.5 %$
Forecaster	$R_{full}$	$R_{roll}$	$R_{full}$	$R_{roll}$
GJR-GARCH-t	0.63	0.83	0.57	1.06
TimesFM 2.5	0.57	1.34	0.62	1.35
Chronos-Small	1.00	1.41	1.13	1.62
Moirai 2.0	0.57	1.49	0.58	1.26

Table A5. Sensitivity of the tail-dispersion coefficient

\hat{γ}

to the

σ_{tail}

estimation method. Cross-sectional regression (10) with SEs clustered by asset (24 clusters).

Table A5. Sensitivity of the tail-dispersion coefficient

\hat{γ}

to the

σ_{tail}

estimation method. Cross-sectional regression (10) with SEs clustered by asset (24 clusters).

$σ_{tail}$ measure	$\hat{γ}$	SE	$t (γ = 1)$	$\hat{b}$
Full-sample	0.866	0.055	$- 2.43$	$- 0.440$
Rolling (250-day)	0.758	0.055	$- 4.40$	$- 0.520$
Rolling + forecaster FE	0.727	0.064	$- 4.29$	$- 0.516$

Table A6. TimesFM first-stage extraction noise diagnostic. Dependent variable:

log SD ({\hat{r}}_{n})

across 24 assets at

α = 1 %

. Column (A) regresses on

log σ_{tail}

only; column (B) adds

log \bar{ν^{- 1}}

(mean inverse fitted Student-t degrees of freedom) as a proxy for quantile-fit noise. HC1 standard errors in parentheses.

Table A6. TimesFM first-stage extraction noise diagnostic. Dependent variable:

log SD ({\hat{r}}_{n})

across 24 assets at

α = 1 %

. Column (A) regresses on

log σ_{tail}

only; column (B) adds

log \bar{ν^{- 1}}

(mean inverse fitted Student-t degrees of freedom) as a proxy for quantile-fit noise. HC1 standard errors in parentheses.

	(A)	(B)
$log σ_{tail}$	0.915	0.854
	(0.058)	(0.094)
$log \bar{ν^{- 1}}$	—	0.162
		(0.145)
$R^{2}$	0.820	0.828
N	24	24

Appendix D. Window-Length Scaling Test

A non-overlapping window-length scaling test at

α = 1 %

varies the calibration-window length across

n \in {250, 500, 750, 1000}

for all 24 assets and four forecasters, with step size equal to n so that adjacent windows share zero observations. Per-forecaster median slopes are

- 0.55

(GJR-GARCH-t, 21 assets),

- 0.72

(TimesFM 2.5, 20 assets), and

- 0.56

(Moirai 2.0, 20 assets); Chronos-Small retains only 2 assets and is excluded from per-model inference. Across 63 retained (asset, forecaster) pairs, 63% of 95% CIs contain

- 0.50

(Table A9).

The pooled fixed-effects estimate is

\hat{b} = - 0.585

(SE

= 0.040

), with 95% CI

[- 0.66, - 0.51]

(Table A7). Restricting the panel to GJR-GARCH-t and Moirai 2.0 gives

\hat{b} = - 0.542

(

p = 0.298

), which does not reject

- 0.50

. Excluding TimesFM alone yields

\hat{b} = - 0.532

(

p = 0.530

). Table A8 reports all four specifications.

Table A7. Pooled fixed-effects rate test.

log (SD ({\hat{r}}_{n})) = α_{i} + β_{j} + b log (n)

across 4 forecasters, 24 assets and window lengths

n \in {250, 500, 750, 1000}

, non-overlapping windows only.

Table A7. Pooled fixed-effects rate test.

log (SD ({\hat{r}}_{n})) = α_{i} + β_{j} + b log (n)

across 4 forecasters, 24 assets and window lengths

n \in {250, 500, 750, 1000}

, non-overlapping windows only.

Statistic	Value
Pooled slope $\hat{b}$	$- 0.585$
Standard error	$0.040$
95% CI	$[- 0.663, - 0.507]$
t-stat ( $H_{0} : b = - 0.50$ )	$- 2.13$
p-value (two-sided)	$0.033$
$R^{2}$	$0.811$
Forecasters	4
Assets	24
Observations	280

Table A8. Restricted pooled FE rate tests. Same specification as Table A7, restricted to subsamples that isolate different sources of contamination.

Sample	Cells	$\hat{b}$	SE	95% CI	$p (b = - 0.5)$
All forecasters	280	$- 0.585$	$0.040$	$[- 0.66, - 0.51]$	$0.033$
Excl. TimesFM	198	$- 0.532$	$0.052$	$[- 0.63, - 0.43]$	$0.530$
VaR-pass only	35	$- 0.780$	$0.131$	$[- 1.04, - 0.52]$	$0.032$
GJR + Moirai only	167	$- 0.542$	$0.041$	$[- 0.62, - 0.46]$	$0.298$

Table A9. Non-overlapping window-length scaling: per-forecaster summary. Median OLS slope from

log (SD) = a + b log (n)

with HC1 robust standard errors. Chronos-Small retains only 2 assets due to short forecast histories.

Table A9. Non-overlapping window-length scaling: per-forecaster summary. Median OLS slope from

log (SD) = a + b log (n)

with HC1 robust standard errors. Chronos-Small retains only 2 assets due to short forecast histories.

Forecaster	Assets	Median $\hat{b}$	IQR	$- 0.5 \in$ CI	$R^{2}$ (med)
GJR-GARCH-t	21	$- 0.55$	$[- 0.64, - 0.45]$	67%	$0.911$
TimesFM-2.5	20	$- 0.72$	$[- 0.85, - 0.56]$	50%	$0.919$
Chronos-Small	2	$- 0.38$	$[- 0.39, - 0.37]$	50%	$0.610$
Moirai-2.0	20	$- 0.56$	$[- 0.68, - 0.44]$	75%	$0.913$
Pooled	63	$- 0.57$	$[- 0.73, - 0.45]$	63%	$0.918$

Appendix E. CC Statistic Details and Fisher Exact Test

The Christoffersen (1998) conditional-coverage statistic decomposes as

CC = UC + IND

, where

UC = - 2 log (α^{n_{1}} {(1 - α)}^{n_{0}} / {\hat{π}}^{n_{1}} {(1 - \hat{π})}^{n_{0}})

is the Kupiec unconditional-coverage likelihood ratio and

IND

is a first-order Markov independence test; under the null,

CC \sim χ_{2}^{2}

. For Chronos-Small, hit rates of

\approx 37 %

(vs. the nominal

1 %

) produce

UC > 7, 000

, so the large CC values in Figure 2 are arithmetically expected, not anomalous.

The Fisher exact

2 \times 2

test on the binary Kupiec-pass vs.

R > 1

contingency table gives

p = 0.35

, underpowered because only 12 of 96 (asset, forecaster) pairs pass Kupiec at

α = 1 %

. The Christoffersen CC statistic, being continuous, provides a finer-grained diagnostic than the binary pass/reject classification (Table A15).

Appendix F. Conceptual Overview

Figure A1 summarises the paper’s logical chain from the known ES rate to the operational diagnostics.

Figure A1. Logical chain of the paper: from known ES rate to operational diagnostics.

Appendix G. Detrending Illustration

Figure A2 shows the raw correction path

{\hat{r}}_{n}

, its 252-day moving average, and the detrended residual for one representative asset (S&P 500, GJR-GARCH-t,

α = 2.5 %

).

Figure A2. Raw ES correction path, 252-day moving average, and detrended correction for S&P 500 under GJR-GARCH-t at

α = 2.5 %

.

Figure A2. Raw ES correction path, 252-day moving average, and detrended correction for S&P 500 under GJR-GARCH-t at

α = 2.5 %

.

Appendix H. HS-250 Naive Benchmark

Table A10 adds a Historical Simulation benchmark (HS-250) as a diagnostic reference for the effective tail-count constraint, not as a production recommendation. For each forecast date, the HS-250 base VaR and ES are the empirical quantile and conditional tail mean from the trailing 250-day window of observed returns. The FZ recalibration correction is then estimated on the same rolling audit window as for the other forecasters, so the benchmark ratio R is computed on an identical footing.

At

α = 2.5 %

, HS-250 achieves a median benchmark ratio

R = 0.96

, close to the plug-in precision floor. Pairwise comparisons between HS-250 and the four model-based forecasters at

α = 2.5 %

yield 50 of 96 additional precision-fragile comparisons (52.1%).

Table A10. Forecaster comparison at

α = 2.5 %

including the HS-250 naive benchmark. R is the median benchmark ratio;

{\bar{r}}_{n}

is the mean recalibration shift;

{\hat{σ}}_{tail}

is the median empirical tail dispersion. HS-250 is included as a diagnostic reference, not as a model-selection recommendation.

Table A10. Forecaster comparison at

α = 2.5 %

including the HS-250 naive benchmark. R is the median benchmark ratio;

{\bar{r}}_{n}

is the mean recalibration shift;

{\hat{σ}}_{tail}

is the median empirical tail dispersion. HS-250 is included as a diagnostic reference, not as a model-selection recommendation.

Forecaster	Median R	Mean ${\bar{r}}_{n}$	Median ${\hat{σ}}_{tail}$
GJR-GARCH-t	0.53	0.0081	0.85%
TimesFM-2.5	0.63	−0.0015	1.09%
Moirai-2.0	0.59	−0.0045	1.10%
Chronos-Small	1.13	−0.0319	0.91%
HS-250 (naive)	0.96	−0.0040	1.25%

Appendix I. Additional Tables and Figures

Figure A3. Detrended SD vs. plug-in benchmark by forecaster (

2 \times 2

panels). Each panel shows one forecaster; the inset reports the median ratio R across all assets and all three tail levels

α \in {1 %, 2.5 %, 5 %}

.

Figure A3. Detrended SD vs. plug-in benchmark by forecaster (

2 \times 2

panels). Each panel shows one forecaster; the inset reports the median ratio R across all assets and all three tail levels

α \in {1 %, 2.5 %, 5 %}

.

Figure A4. Rolling-window ES correction

{\hat{r}}_{n}

across asset classes at

α = 1 %

under GJR-GARCH-t (

n = 250

, step

= 21

days), shown as a scaling illustration. Shaded bands are plug-in 95% reference intervals

\bar{r} \pm 1.96 \frac{\hat{C}}{\sqrt{n α}}

.

Figure A4. Rolling-window ES correction

{\hat{r}}_{n}

across asset classes at

α = 1 %

under GJR-GARCH-t (

n = 250

, step

= 21

days), shown as a scaling illustration. Shaded bands are plug-in 95% reference intervals

\bar{r} \pm 1.96 \frac{\hat{C}}{\sqrt{n α}}

.

Figure A5. Required calibration window for 50 bp ES precision by asset at

α = 2.5 %

. The vertical line marks the 250-day default; assets to the right require longer windows.

Figure A5. Required calibration window for 50 bp ES precision by asset at

α = 2.5 %

. The vertical line marks the 250-day default; assets to the right require longer windows.

Table A11. Asset-specific minimum calibration window n at

α = 2.5 %

.

{\hat{σ}}_{tail}

is the empirical tail dispersion from GJR-GARCH-t residuals;

ε_{250}

is the plug-in benchmark at the 250-day window (with finite-sample correction); right-hand columns report the required n for the stated tolerance

ε

in basis points of return. Assets sorted by

{\hat{σ}}_{tail}

.

Table A11. Asset-specific minimum calibration window n at

α = 2.5 %

.

{\hat{σ}}_{tail}

is the empirical tail dispersion from GJR-GARCH-t residuals;

ε_{250}

is the plug-in benchmark at the 250-day window (with finite-sample correction); right-hand columns report the required n for the stated tolerance

ε

in basis points of return. Assets sorted by

{\hat{σ}}_{tail}

.

Asset	${\hat{σ}}_{tail}$ (%)	$ε_{250}$ (bp)	$ε = 25$ bp	$ε = 50$ bp	$ε = 100$ bp	$ε = 200$ bp
TLT	0.45	19	161	56	23	9
CBU0	0.55	24	228	74	29	12
USDJPY	0.60	26	262	84	32	13
ASX200	0.60	26	266	85	33	13
DJCI	0.63	27	287	91	34	14
FTSE100	0.65	28	306	96	36	15
SP500	0.70	30	348	107	39	16
AUDUSD	0.72	31	367	112	41	17
STOXX	0.75	32	392	119	43	18
GDAXI	0.81	35	453	135	48	20
EURUSD	0.81	35	455	135	48	20
GBPUSD	0.82	35	465	138	49	20
NIFTY	0.88	38	527	154	54	22
FCHI	0.91	39	561	163	56	23
GOLD	1.03	44	717	203	68	27
NIKKEI	1.16	50	893	248	80	31
BOVESPA	1.16	50	898	249	80	31
IBGL	1.28	55	1,091	298	94	35
ICLN	1.33	57	1,175	319	100	37
HSI	1.59	68	1,657	441	132	47
NATGAS	2.31	99	3,463	894	248	80
WTI	2.60	112	4,371	1,121	306	96
ETH	3.45	148	7,665	1,945	513	151
BTC	5.54	238	19,656	4,943	1,264	342
Median	0.85	36	496	146	51	21

Table A12. Asset-specific minimum calibration window n at

α = 1 %

.

{\hat{σ}}_{tail}

is the empirical tail dispersion from GJR-GARCH-t residuals;

ε_{250}

is the plug-in benchmark at the 250-day window; right-hand columns report the required n for the stated tolerance

ε

in basis points of return. Assets sorted by

{\hat{σ}}_{tail}

. For standardised-unit requirements see Table 2.

Table A12. Asset-specific minimum calibration window n at

α = 1 %

.

{\hat{σ}}_{tail}

is the empirical tail dispersion from GJR-GARCH-t residuals;

ε_{250}

is the plug-in benchmark at the 250-day window; right-hand columns report the required n for the stated tolerance

ε

in basis points of return. Assets sorted by

{\hat{σ}}_{tail}

. For standardised-unit requirements see Table 2.

Asset	${\hat{σ}}_{tail}$ (%)	$ε_{250}$ (bp)	$ε = 25 b p$	$ε = 50 b p$	$ε = 100 b p$	$ε = 200 b p$
USDJPY	0.54	34	473	119	30	8
ASX200	0.56	35	495	124	31	8
FTSE100	0.58	36	530	133	34	9
TLT	0.59	37	561	141	36	9
DJCI	0.68	43	739	185	47	12
CBU0	0.72	46	840	210	53	14
SP500	0.74	47	882	221	56	14
STOXX	0.79	50	998	250	63	16
GDAXI	0.83	52	1,094	274	69	18
AUDUSD	0.94	60	1,423	356	89	23
EURUSD	0.98	62	1,544	386	97	25
FCHI	0.98	62	1,549	388	97	25
GOLD	1.04	66	1,726	432	108	27
NIKKEI	1.06	67	1,802	451	113	29
NIFTY	1.12	71	1,992	498	125	32
BOVESPA	1.22	77	2,366	592	148	37
GBPUSD	1.22	77	2,379	595	149	38
ICLN	1.37	87	3,014	754	189	48
IBGL	1.76	112	4,979	1,245	312	78
NATGAS	1.78	112	5,061	1,266	317	80
HSI	2.29	145	8,414	2,104	526	132
ETH	3.00	190	14,404	3,601	901	226
WTI	3.04	192	14,746	3,687	922	231
BTC	7.20	456	83,038	20,760	5,190	1,298
Median	1.01	64	1,636	409	103	26

Table A13. Asset-level benchmark ratio R at

α = 1 %

under GJR-GARCH-t. Assets sorted by R.

Table A13. Asset-level benchmark ratio R at

α = 1 %

under GJR-GARCH-t. Assets sorted by R.

Asset	Detr. SD	Bound	Ratio	Kupiec p
IBGL	0.0029	0.0112	0.26	0.000
GBPUSD	0.0023	0.0077	0.29	0.000
AUDUSD	0.0020	0.0060	0.33	0.000
NIFTY	0.0028	0.0071	0.39	0.000
WTI	0.0075	0.0192	0.39	0.000
HSI	0.0057	0.0145	0.39	0.000
BTC	0.0192	0.0456	0.42	0.003
GOLD	0.0029	0.0066	0.44	0.000
EURUSD	0.0028	0.0062	0.46	0.000
BOVESPA	0.0037	0.0077	0.48	0.000
ASX200	0.0017	0.0035	0.48	0.000
NIKKEI	0.0034	0.0067	0.50	0.000
STOXX	0.0025	0.0050	0.51	0.000
TLT	0.0020	0.0037	0.53	0.000
FCHI	0.0033	0.0062	0.54	0.000
GDAXI	0.0028	0.0052	0.54	0.000
SP500	0.0028	0.0047	0.59	0.000
FTSE100	0.0024	0.0036	0.67	0.000
USDJPY	0.0023	0.0034	0.67	0.000
ICLN	0.0059	0.0087	0.68	0.000
DJCI	0.0030	0.0043	0.71	0.000
CBU0	0.0034	0.0046	0.74	0.143
ETH	0.0158	0.0190	0.83	0.006
NATGAS	0.0101	0.0112	0.90	0.000

Table A14. Non-overlapping window-length scaling for 21 assets (GJR-GARCH-t,

α = 1 %

, step

= n

). OLS slope

\hat{b}

from

log (SD) = a + b log (n)

with HC1 robust standard errors. Assets sorted by slope. †: dropped for

< 3

valid window lengths.

Table A14. Non-overlapping window-length scaling for 21 assets (GJR-GARCH-t,

α = 1 %

, step

= n

). OLS slope

\hat{b}

from

log (SD) = a + b log (n)

with HC1 robust standard errors. Assets sorted by slope. †: dropped for

< 3

valid window lengths.

Asset	Windows	$n = 250$	$n = 500$	$n = 750$	$n = 1000$	$\hat{b}$	95% CI	$R^{2}$	$- 0.5 \in$ CI
NIFTY	30	0.0067	0.0043	0.0028	–	$- 0.78$	$[- 0.92, - 0.63]$	$0.982$	N
AUDUSD	34	0.0049	0.0032	0.0021	–	$- 0.77$	$[- 0.92, - 0.61]$	$0.978$	N
HSI	50	0.0144	0.0092	0.0068	0.0051	$- 0.73$	$[- 0.81, - 0.65]$	$0.993$	N
FCHI	51	0.0089	0.0073	0.0043	0.0037	$- 0.66$	$[- 0.84, - 0.49]$	$0.911$	Y
GOLD	50	0.0061	0.0045	0.0022	0.0030	$- 0.64$	$[- 1.02, - 0.27]$	$0.741$	Y
EURUSD	45	0.0074	0.0045	0.0048	0.0026	$- 0.64$	$[- 0.95, - 0.32]$	$0.824$	Y
GBPUSD	45	0.0068	0.0044	0.0029	0.0032	$- 0.62$	$[- 0.84, - 0.39]$	$0.906$	Y
STOXX	43	0.0050	0.0039	0.0039	0.0019	$- 0.58$	$[- 1.08, - 0.08]$	$0.679$	Y
USDJPY	53	0.0040	0.0034	0.0025	0.0017	$- 0.56$	$[- 0.83, - 0.29]$	$0.875$	Y
IBGL	30	0.0054	0.0042	0.0029	–	$- 0.56$	$[- 0.78, - 0.34]$	$0.926$	Y
FTSE100	51	0.0058	0.0044	0.0030	0.0028	$- 0.55$	$[- 0.65, - 0.46]$	$0.952$	Y
NATGAS	50	0.0226	0.0135	0.0112	0.0110	$- 0.54$	$[- 0.72, - 0.37]$	$0.940$	Y
TLT	45	0.0046	0.0023	0.0029	0.0020	$- 0.53$	$[- 0.78, - 0.27]$	$0.712$	Y
WTI	50	0.0166	0.0104	0.0098	0.0079	$- 0.51$	$[- 0.59, - 0.43]$	$0.960$	Y
ASX200	51	0.0051	0.0042	0.0030	0.0027	$- 0.46$	$[- 0.55, - 0.38]$	$0.954$	Y
ICLN	29	0.0122	0.0105	0.0072	–	$- 0.45$	$[- 0.70, - 0.20]$	$0.861$	Y
BOVESPA	50	0.0088	0.0079	0.0069	0.0050	$- 0.36$	$[- 0.60, - 0.13]$	$0.804$	Y
BTC	27	0.0429	0.0353	0.0287	–	$- 0.36$	$[- 0.44, - 0.27]$	$0.973$	N
NIKKEI	50	0.0089	0.0077	0.0054	0.0063	$- 0.32$	$[- 0.50, - 0.14]$	$0.753$	N
SP500	51	0.0073	0.0064	0.0060	0.0048	$- 0.27$	$[- 0.41, - 0.14]$	$0.879$	N
GDAXI	51	0.0077	0.0065	0.0064	0.0056	$- 0.21$	$[- 0.26, - 0.16]$	$0.943$	N
CBU0	–	–	–	–	–	–	–	–	–^†
DJCI	–	–	–	–	–	–	–	–	–^†
ETH	–	–	–	–	–	–	–	–	–^†
Median						$- 0.55$	IQR $[- 0.64, - 0.45]$		67%

Figure A6. Distribution of OLS slopes for overlapping (grey, step

= 21

) and non-overlapping (red, step

= n

) windows. Eliminating overlap shifts the median from

- 0.73

to

- 0.55

, moving it toward the theoretical

- 1 / 2

(solid vertical line).

Figure A6. Distribution of OLS slopes for overlapping (grey, step

= 21

) and non-overlapping (red, step

= n

) windows. Eliminating overlap shifts the median from

- 0.73

to

- 0.55

, moving it toward the theoretical

- 1 / 2

(solid vertical line).

Figure A7. Log–log scaling of recalibration SD versus window length using non-overlapping windows. Dashed grey: theoretical

b = - 0.50

; solid colour: OLS fit.

Figure A7. Log–log scaling of recalibration SD versus window length using non-overlapping windows. Dashed grey: theoretical

b = - 0.50

; solid colour: OLS fit.

Figure A8. Distribution of non-overlapping OLS slopes across all four forecasters. Pooled median

= - 0.57

(63 asset-forecaster pairs); vertical line: theoretical

- 1 / 2

.

Figure A8. Distribution of non-overlapping OLS slopes across all four forecasters. Pooled median

= - 0.57

(63 asset-forecaster pairs); vertical line: theoretical

- 1 / 2

.

Figure A9. Non-overlapping window scaling by forecaster. Each grey line is one asset; dashed: theoretical

b = - 0.50

.

Figure A9. Non-overlapping window scaling by forecaster. Each grey line is one asset; dashed: theoretical

b = - 0.50

.

Table A15. Directional VaR-first contingency check at

α = 1 %

. A cell is classified as “excess” if

R > 1

. The Spearman rank correlation between CC statistic and R supersedes the binary Fisher test; see Figure 2.

Table A15. Directional VaR-first contingency check at

α = 1 %

. A cell is classified as “excess” if

R > 1

. The Spearman rank correlation between CC statistic and R supersedes the binary Fisher test; see Figure 2.

	Ratio $\leq 1$	Ratio $> 1$
VaR pass (Kupiec $p > 0.05$ )	12	0
VaR reject (Kupiec $p \leq 0.05$ )	72	12
Spearman rank correlation	$ρ = 0.776, p = 0.000$

Table A16. Excess-dispersion case audit: all (asset, forecaster) cells with

R > 1

at

α = 1 %

. CC stat is the Christoffersen conditional-coverage statistic;

{\hat{σ}}_{tail}

is the tail dispersion scale. All 12 excess cells belong to Chronos-Small.

Table A16. Excess-dispersion case audit: all (asset, forecaster) cells with

R > 1

at

α = 1 %

. CC stat is the Christoffersen conditional-coverage statistic;

{\hat{σ}}_{tail}

is the tail dispersion scale. All 12 excess cells belong to Chronos-Small.

Asset	Forecaster	R	Kupiec p	CC stat	${\hat{σ}}_{tail}$	Cause
EURUSD	Chronos-Small	3.02	<0.001	11636.6	0.0057	Severe VaR miscalibration
USDJPY	Chronos-Small	2.86	<0.001	14189.2	0.0059	Severe VaR miscalibration
FTSE100	Chronos-Small	2.24	<0.001	13251.3	0.0087	Severe VaR miscalibration
ICLN	Chronos-Small	1.68	<0.001	8581.9	0.0123	Severe VaR miscalibration
ASX200	Chronos-Small	1.61	<0.001	14112.9	0.0078	Severe VaR miscalibration
NIKKEI	Chronos-Small	1.48	<0.001	12923.6	0.0111	Severe VaR miscalibration
SP500	Chronos-Small	1.32	<0.001	12789.4	0.0099	Severe VaR miscalibration
BTC	Chronos-Small	1.30	<0.001	7186.8	0.0297	Severe VaR miscalibration
GBPUSD	Chronos-Small	1.30	<0.001	11699.6	0.0044	Severe VaR miscalibration
HSI	Chronos-Small	1.08	<0.001	12690.9	0.0106	Severe VaR miscalibration
NIFTY	Chronos-Small	1.08	<0.001	9146.5	0.0080	Severe VaR miscalibration
IBGL	Chronos-Small	1.02	<0.001	8993.9	0.0061	Severe VaR miscalibration

Figure A10. Kupiec p-value vs. benchmark ratio R at

α = 1 %

(binary variant of Figure 2). Each point is one (asset, forecaster) pair. The Fisher exact test gives

p = 0.35

due to the small number of VaR-pass cells.

Figure A10. Kupiec p-value vs. benchmark ratio R at

α = 1 %

(binary variant of Figure 2). Each point is one (asset, forecaster) pair. The Fisher exact test gives

p = 0.35

due to the small number of VaR-pass cells.

Table A17. Overlapping window-length scaling (step

= 21

). Same design as Table A14 but with overlapping windows. The steeper median slope (

- 0.73

vs.

- 0.55

) is attributable to rolling-window overlap.

Table A17. Overlapping window-length scaling (step

= 21

). Same design as Table A14 but with overlapping windows. The steeper median slope (

- 0.73

vs.

- 0.55

) is attributable to rolling-window overlap.

Asset	$n = 250$	$n = 500$	$n = 1000$	$n = 2000$	$\hat{b}$	95% CI	$R^{2}$	$- 0.5 \in$ CI
DJCI	0.0065	0.0054	0.0041	0.0002	$- 1.31$	$[- 2.38, - 0.23]$	$0.586$	Y
NIFTY	0.0055	0.0032	0.0016	0.0005	$- 1.18$	$[- 1.40, - 0.96]$	$0.964$	N
AUDUSD	0.0045	0.0028	0.0017	0.0005	$- 1.02$	$[- 1.25, - 0.78]$	$0.946$	N
ETH	0.0266	0.0215	0.0200	0.0016	$- 1.01$	$[- 1.95, - 0.06]$	$0.523$	Y
HSI	0.0133	0.0076	0.0039	0.0019	$- 0.97$	$[- 1.07, - 0.87]$	$0.989$	N
STOXX	0.0057	0.0039	0.0024	0.0008	$- 0.93$	$[- 1.19, - 0.66]$	$0.921$	N
GDAXI	0.0075	0.0054	0.0032	0.0012	$- 0.89$	$[- 1.11, - 0.67]$	$0.941$	N
ICLN	0.0127	0.0081	0.0041	0.0021	$- 0.86$	$[- 0.96, - 0.76]$	$0.986$	N
NIKKEI	0.0079	0.0058	0.0038	0.0011	$- 0.85$	$[- 1.20, - 0.51]$	$0.853$	N
BOVESPA	0.0091	0.0063	0.0037	0.0019	$- 0.77$	$[- 0.87, - 0.66]$	$0.981$	N
USDJPY	0.0046	0.0031	0.0021	0.0009	$- 0.74$	$[- 0.91, - 0.57]$	$0.948$	N
GOLD	0.0058	0.0029	0.0018	0.0012	$- 0.73$	$[- 0.84, - 0.61]$	$0.975$	N
NATGAS	0.0201	0.0128	0.0087	0.0046	$- 0.71$	$[- 0.79, - 0.63]$	$0.987$	N
FCHI	0.0083	0.0065	0.0040	0.0024	$- 0.64$	$[- 0.76, - 0.52]$	$0.966$	N
EURUSD	0.0060	0.0044	0.0027	0.0017	$- 0.63$	$[- 0.70, - 0.57]$	$0.990$	N
GBPUSD	0.0057	0.0040	0.0025	0.0017	$- 0.61$	$[- 0.64, - 0.57]$	$0.997$	N
TLT	0.0046	0.0033	0.0023	0.0013	$- 0.60$	$[- 0.70, - 0.50]$	$0.971$	Y
IBGL	0.0056	0.0036	0.0023	0.0017	$- 0.57$	$[- 0.62, - 0.52]$	$0.991$	N
ASX200	0.0051	0.0043	0.0030	0.0016	$- 0.55$	$[- 0.69, - 0.41]$	$0.938$	Y
SP500	0.0072	0.0056	0.0042	0.0023	$- 0.54$	$[- 0.67, - 0.42]$	$0.947$	Y
FTSE100	0.0057	0.0038	0.0027	0.0020	$- 0.50$	$[- 0.54, - 0.46]$	$0.994$	Y
BTC	0.0428	0.0325	0.0277	0.0186	$- 0.37$	$[- 0.44, - 0.29]$	$0.959$	N
WTI	0.0143	0.0101	0.0086	0.0066	$- 0.34$	$[- 0.40, - 0.28]$	$0.969$	N
Median					$- 0.73$	IQR $[- 0.91, - 0.58]$		26%

Figure A11. Overlapping-window slope distribution (step

= 21

). Median

- 0.73

; only 26% of CIs contain

- 0.50

. Compare Figure A6.

Figure A11. Overlapping-window slope distribution (step

= 21

). Median

- 0.73

; only 26% of CIs contain

- 0.50

. Compare Figure A6.

Figure A12. Overlapping-window log–log scaling for S&P 500, BTC, and Natural Gas (step

= 21

). All three slopes are steeper than in the non-overlapping design (Figure A7).

Figure A12. Overlapping-window log–log scaling for S&P 500, BTC, and Natural Gas (step

= 21

). All three slopes are steeper than in the non-overlapping design (Figure A7).

Appendix J. ES Recalibration Precision Audit

References

Zwingmann, T.; Holzmann, H. Asymptotics for the expected shortfall. arXiv 2016, arXiv:1611.07222. [Google Scholar] [CrossRef]
Bartl, D.; Eckstein, S. Optimal nonparametric estimation of the expected shortfall risk. arXiv 2024, arXiv:2405.00357. [Google Scholar]
Chen, S.X. Nonparametric estimation of expected shortfall. J. Financ. Econom. 2008, 6, 87–107. [Google Scholar] [CrossRef]
Fissler, T.; Ziegel, J.F. Higher order elicitability and Osband’s principle. Ann. Stat. 2016, 44, 1680–1707. [Google Scholar] [CrossRef]
Dimitriadis, T.; Bayer, S. A joint quantile and expected shortfall regression framework. Electron. J. Stat. 2019, 13, 1823–1871. [Google Scholar] [CrossRef]
Patton, A.J.; Ziegel, J.F.; Chen, R. Dynamic semiparametric models for expected shortfall (and Value-at-Risk). J. Econom. 2019, 211, 388–413. [Google Scholar] [CrossRef]
Pele, D.T.; Bolovaneanu, V.; Ginavar, A.T.; Lessmann, S.; Härdle, W.K. Recalibrating tail risk forecasts under temporal dependence, 2026. Available online: https://ssrn.com/abstract=6757685.
Acerbi, C.; Székely, B. Back-testing expected shortfall. Risk 2014, 27, 76–81. [Google Scholar]
Nolde, N.; Ziegel, J.F. Elicitability and backtesting: Perspectives for banking regulation. Ann. Appl. Stat. 2017, 11, 1833–1874. [Google Scholar] [CrossRef]
Basel Committee on Banking Supervision. Minimum capital requirements for market risk. Technical Report d457, January 2019; Bank for International Settlements; BCBS d457, 2019. [Google Scholar]
Glosten, L.R.; Jagannathan, R.; Runkle, D.E. On the relation between the expected value and the volatility of the nominal excess return on stocks. J. Financ. 1993, 48, 1779–1801. [Google Scholar] [CrossRef]
Das, A.; Kong, W.; Sen, R.; Zhou, Y. A decoder-only foundation model for time-series forecasting. Proc. Proc. 41st Int. Conf. Mach. Learn. (ICML) 2024, Vol. 235, PMLR, 10148–10167. [Google Scholar]
Ansari, A.F.; Stella, L.; Turkmen, C.; Zhang, X.; Mercado, P.; Shen, H.; Shchur, O.; Rangapuram, S.S.; Pineda Arango, S.; Kapoor, S.; et al. Chronos: Learning the language of time series. Trans. Mach. Learn. Res. 2024, arXiv:2403.07815. [Google Scholar]
Woo, G.; Liu, C.; Kumar, A.; Xiong, C.; Savarese, S.; Sahoo, D. Unified training of universal time series forecasting transformers. Proc. Proc. 41st Int. Conf. Mach. Learn. (ICML) Oral presentation. 2024, arXiv:2402.02592Vol. 235, PMLR, 53140–53164. [Google Scholar]
Liu, C.; Aksu, T.; Liu, J.; Liu, X.; Yan, H.; Pham, Q.; Savarese, S.; Sahoo, D.; Xiong, C.; Li, J. Moirai 2.0: When less is more for time series forecasting. arXiv 2025, arXiv:2511.11698, 2511.11698. [Google Scholar]
McNeil, A.J.; Frey, R.; Embrechts, P. Quantitative Risk Management: Concepts, Techniques and Tools, revised ed.; Princeton Series in Finance; Princeton University Press, 2015. [Google Scholar]
Christoffersen, P.F. Evaluating interval forecasts. Int. Econ. Rev. 1998, 39, 841–862. [Google Scholar] [CrossRef]
Kupiec, P.H. Techniques for verifying the accuracy of risk measurement models. J. Deriv. 1995, 3, 73–84. [Google Scholar] [CrossRef]

Figure 1. Detrended standard deviation of

{\hat{r}}_{n}

versus the plug-in precision benchmark

\hat{C} / \sqrt{n α}

. Each point is one (asset, forecaster,

α

) cell. The dashed 45-degree line marks

R = 1

.

Figure 1. Detrended standard deviation of

{\hat{r}}_{n}

versus the plug-in precision benchmark

\hat{C} / \sqrt{n α}

. Each point is one (asset, forecaster,

α

) cell. The dashed 45-degree line marks

R = 1

.

Figure 2. Christoffersen conditional-coverage statistic versus benchmark ratio R at

α = 1 %

. Each point is one (asset, forecaster) pair. The pooled Spearman correlation is

ρ = 0.776

; excluding Chronos-Small,

ρ = 0.513

.

Figure 2. Christoffersen conditional-coverage statistic versus benchmark ratio R at

α = 1 %

. Each point is one (asset, forecaster) pair. The pooled Spearman correlation is

ρ = 0.776

; excluding Chronos-Small,

ρ = 0.513

.

Table 1. Finite-sample inflation factor

\hat{f} = \hat{SD} ({\bar{X}}_{tail}) / (σ_{tail} / \sqrt{n α})

from 50,000 Monte Carlo replications. †:

\hat{f} > 1.20

, indicating serial-dependence inflation beyond the i.i.d. finite-sample effect.

Table 1. Finite-sample inflation factor

\hat{f} = \hat{SD} ({\bar{X}}_{tail}) / (σ_{tail} / \sqrt{n α})

from 50,000 Monte Carlo replications. †:

\hat{f} > 1.20

, indicating serial-dependence inflation beyond the i.i.d. finite-sample effect.

$α$	n	$k = n α$	Student- $t_{5}$	Skewed- $t_{5}$	GARCH- $t_{5}$
$1.0 %$	250	2.50	1.129	1.134	0.653
	500	5.00	1.124	1.119	0.918
	1000	10.00	1.063	1.044	1.080
	2000	20.00	1.032	1.022	$1 . 217^{†}$
$2.5 %$	250	6.25	1.101	1.117	1.006
	500	12.50	1.050	1.043	1.150
	1000	25.00	1.028	1.011	$1 . 284^{†}$
	2000	50.00	1.012	1.008	$1 . 441^{†}$
$5.0 %$	250	12.50	1.039	1.042	1.145
	500	25.00	1.022	1.016	$1 . 339^{†}$
	1000	50.00	1.003	1.005	$1 . 494^{†}$
	2000	100.00	1.006	1.008	$1 . 587^{†}$

Table 2. Minimum calibration window n (trading days) for tolerance

ε

at tail level

α

. Cells are computed from Equation (7), with C calibrated from a Student-

t_{5}

reference distribution and

f (n, α)

from Equation (8). The implicit equation is solved by fixed-point iteration.

ε

is expressed in standardised units.

Table 2. Minimum calibration window n (trading days) for tolerance

ε

at tail level

α

. Cells are computed from Equation (7), with C calibrated from a Student-

t_{5}

reference distribution and

f (n, α)

from Equation (8). The implicit equation is solved by fixed-point iteration.

ε

is expressed in standardised units.

$α$	C	$ε = 10 %$	$20 %$	$30 %$	$50 %$
0.5%	1.491	44,660	11,311	5,132	1,958
1.0%	1.335	17,921	4,553	2,075	800
2.5%	1.152	5,348	1,365	627	246
5.0%	1.038	2,174	558	257	102

Table 3. VaR-miscalibration simulation. The data-generating process is GARCH(1,1)-

t_{5}

with

α = 2.5 %

and

n = 250

, using 30 paths of 10,000 days each. R is the ratio of raw ES-correction SD to the oracle benchmark

σ_{tail} / \sqrt{n α}

.

Table 3. VaR-miscalibration simulation. The data-generating process is GARCH(1,1)-

t_{5}

with

α = 2.5 %

and

n = 250

, using 30 paths of 10,000 days each. R is the ratio of raw ES-correction SD to the oracle benchmark

σ_{tail} / \sqrt{n α}

.

VaR model	Hit rate	Median UC stat	ES dispersion ratio (R)
Correct (true $σ_{t}$ )	0.0250	0.5	0.95
Mild (30% cond. + 70% uncond.)	0.0961	23.9	0.97
Moderate (15% cond. + 85% uncond.)	0.1314	45.9	1.33
Severe (unconditional $\bar{σ}$ )	0.1770	80.0	1.76

Table 4. Precision-fragile ES model comparisons. Each tail level has

(\binom{4}{2}) \times 24 = 144

pairwise comparisons.

Table 4. Precision-fragile ES model comparisons. Each tail level has

(\binom{4}{2}) \times 24 = 144

pairwise comparisons.

$α$	Comparisons	Fragile	% Fragile
1.0%	144	39	27.1%
2.5%	144	29	20.1%
5.0%	144	26	18.1%

Table 5. Precision-fragile share under alternative noise scales and detrending methods. The plug-in screen uses Equation (9). The paired bootstrap uses block length 21 days and 1000 replications.

Noise scale	Fragile @ $α = 1 %$	Fragile @ $α = 2.5 %$	Fragile @ $α = 5 %$
Plug-in $\hat{C} / \sqrt{n α}$	27.1%	20.1%	18.1%
Detrended SD: MA-126	16.0%	16.0%	15.3%
Detrended SD: MA-252	17.4%	17.4%	15.3%
Detrended SD: MA-504	20.8%	17.4%	17.4%
Detrended SD: HP ( $λ = 1600$ )	17.4%	17.4%	16.0%
Detrended SD: Rolling median 252	18.1%	17.4%	15.3%
Raw SD (no detrending)	36.8%	29.2%	25.0%
Paired block bootstrap (block=21)	—	22.9%	—

Table 6. Precision-fragile share under alternative

\hat{C}

estimation methods at

α = 2.5 %

.

Table 6. Precision-fragile share under alternative

\hat{C}

estimation methods at

α = 2.5 %

.

$\hat{C}$ estimation	Precision-fragile share
Full-sample $\hat{C}$	20.1%
Rolling-window ${\hat{C}}_{t}$	16.0%
Paired block bootstrap	22.9%

Table 7. Cross-sectional scaling regression. The sample contains 288 (asset, forecaster,

α

) cells. Standard errors are clustered by asset. Theoretical predictions are

b = - 0.50

and

γ = 1.00

.

Table 7. Cross-sectional scaling regression. The sample contains 288 (asset, forecaster,

α

) cells. Standard errors are clustered by asset. Theoretical predictions are

b = - 0.50

and

γ = 1.00

.

Model	$\hat{b}$ (se)	$\hat{γ}$ (se)	$R^{2}$
$log (n α) + log ({\hat{σ}}_{tail})$	$- 0.436$ $(0.039)$	$0.870$ $(0.036)$	$0.676$
+ forecaster FE	$- 0.429$ $(0.028)$	$0.943$ $(0.027)$	$0.842$
$log (n) + log ({\hat{σ}}_{tail})$	—	$0.926$ $(0.047)$	$0.529$
$log ({\hat{σ}}_{tail})$ only	—	$0.926$ $(0.047)$	$0.529$

Table 8. Benchmark ratio

R_{i, m, α}

at

α = 2.5 %

, across 24 assets. The numerator is the detrended standard deviation of

{\hat{r}}_{n}

using a 252-day moving-average filter; the denominator is the plug-in precision benchmark.

Table 8. Benchmark ratio

R_{i, m, α}

at

α = 2.5 %

, across 24 assets. The numerator is the detrended standard deviation of

{\hat{r}}_{n}

using a 252-day moving-average filter; the denominator is the plug-in precision benchmark.

Forecaster	Median	Q1	Q3	Min	Max
Chronos-Small	1.13	0.83	1.53	0.60	2.69
GJR-GARCH-t	0.53	0.48	0.63	0.34	0.94
Moirai-2.0	0.59	0.52	0.67	0.35	0.75
TimesFM-2.5	0.63	0.55	0.77	0.35	0.90

Table 9. Headline results under forecaster subsets. Precision-fragile share is computed at

α = 2.5 %

. Spearman

ρ

is the correlation between the Christoffersen conditional-coverage statistic and benchmark ratio R at

α = 1 %

.

Table 9. Headline results under forecaster subsets. Precision-fragile share is computed at

α = 2.5 %

. Spearman

ρ

is the correlation between the Christoffersen conditional-coverage statistic and benchmark ratio R at

α = 1 %

.

Sample	Precision-fragile share	Median R	VaR-diagnostic $ρ$
All forecasters	20.1%	0.64	$0 . 776^{* * *}$
Excl. Chronos-Small	40.3%	0.58	$0 . 513^{* * *}$
Only calibrated (GJR, Moirai)	8.3%	0.56	$0 . 509^{* *}$

Table 10. Precision-fragile share at

α = 2.5 %

under alternative cutoff multipliers

κ

. Each specification has

(\binom{4}{2}) \times 24 = 144

pairwise comparisons.

Table 10. Precision-fragile share at

α = 2.5 %

under alternative cutoff multipliers

κ

. Each specification has

(\binom{4}{2}) \times 24 = 144

pairwise comparisons.

$κ$	Precision-fragile	Share (%)	n comparisons
0.5	23	16.0%	144
1.0	29	20.1%	144
1.5	45	31.2%	144
2.0	65	45.1%	144

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Finite-Sample Precision Limits for Expected Shortfall Forecast Comparisons

Abstract

Keywords:

Subject:

1. Introduction

2. Expected Shortfall as a Tail Functional Under Effective Sample-Size Scarcity

2.1. Definitions and Effective Tail Sample Size

2.2. Oracle Equivalence for Additive Recalibration

2.3. Distinction Between c L and C ^

3. Precision Benchmark and Sample-Size Rule

3.1. Plug-In Precision Benchmark

3.2. Empirical Dispersion Measure

3.3. Finite-Sample Correction

3.4. Operational Sample-Size Rule

4. Precision-Fragile Pairwise Comparison Screen

4.1. Screen Definition

4.2. Rate Tests and VaR-First Diagnostic

5. Simulation Evidence

6. Financial-Risk Application

6.1. Data and Forecasting Setup

6.2. Precision-Fragile ES Comparisons

6.3. Empirical Diagnostics

6.4. Implications for Tail-Risk Practice

7. Robustness and Sensitivity

8. Limitations

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Minimax Lower Bound and Proof

Appendix B. Finite-Sample Calibration Details

Appendix C. Data and Computational Details

Appendix D. Window-Length Scaling Test

Appendix E. CC Statistic Details and Fisher Exact Test

Appendix F. Conceptual Overview

Appendix G. Detrending Illustration

Appendix H. HS-250 Naive Benchmark

Appendix I. Additional Tables and Figures

Appendix J. ES Recalibration Precision Audit

References

MDPI Initiatives

Important Links

Subscribe

2.3. Distinction Between $c_{L}$ and $\hat{C}$