Stratified Fréchet Distance: A Stratified Evaluation Framework for Conditional Time Series Generation Models

Tsuyoshi Okita

doi:10.20944/preprints202604.1977.v1

Submitted:

28 April 2026

Posted:

28 April 2026

You are already at the latest version

Abstract

The Fréchet Inception Distance (FID), the standard metric for evaluating deep generative models, aggregates all data into a single score and thereby masks quality degradation in safety-critical minority conditions and in specific temporal regions of generated time series. We trace this dilution problem to a single cause—the absence of stratification—and propose Stratified Fréchet Distance (SFD), which partitions evaluation data into strata along a chosen axis and computes the Fréchet distance within each stratum. The choice of axis determines the diagnosis: stratifying by operating condition detects minority-condition failures (generalizing the existing Conditional FID), by temporal segment localizes late-cycle quality breakdown, and by their cross-product yields a two-dimensional condition×time quality map. Comparing SFD at different granularities further enables quantitative detection of inter-condition confounding. Experiments on four battery datasets (161 cells) with CVAE models show that SFD detects condition-dependent quality gaps of 1.97× where FID registers only 1.01×, with up to 79× higher sensitivity for minority conditions. Condition×time stratification reveals that the largest gap (8.69×) occurs in the latter half of 35∘C degradation curves—a physically interpretable failure to reproduce accelerated high-temperature degradation. Granularity comparison further detects temperature–C-rate (charge/discharge rate) confounding (T/J = 1.72×), providing actionable guidance on which conditioning variables a generative model should include. These findings are robust across three feature extractors and four datasets.

Keywords:

generative model evaluation

;

Stratified Fréchet Distance

;

conditional generation

;

time series generation

;

battery degradation prediction

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The safety of lithium-ion batteries depends critically on operating temperature. High temperatures accelerate SEI (solid electrolyte interphase, a passivation layer that forms on the anode surface) growth and cathode decomposition, increasing the risk of thermal runaway (uncontrolled exothermic reaction)[1]; low temperatures promote lithium plating (metallic lithium deposition on the anode) and dendrite-driven internal short circuits[2]. A recent review[3] has made clear that quantitative characterization of degradation is a prerequisite for safety assessment.

Yet experimental data under the extreme conditions most critical for safety—high or low temperature, high C-rate (the ratio of charge/discharge current to battery capacity)—are costly to acquire and inevitably underrepresented in existing datasets[4,5]. Deep generative models (GANs[6], VAEs[7], diffusion models[8], flow matching[9]) have advanced to the point where synthesizing data for such underrepresented conditions is becoming feasible[10,11,12,13,14], as illustrated by DiffBatt[15] for battery degradation. This capability is not merely academic: in the emerging paradigm of battery digital twins, synthetic degradation data are increasingly used to augment limited experimental datasets and to simulate untested operating scenarios[16,17]. As safety and economic decisions increasingly rely on such digital representations, the trustworthiness of synthetic data becomes a practical engineering concern, not just a research question.

There is, however, a problem that is easy to overlook. Mayer et al.[18] showed that generative models trained by maximum likelihood carry an intrinsic bias toward majority conditions, under-representing minorities. The data generated at high temperature for safety evaluation may, in reality, be an inaccurate echo of the dominant 25

°

C pattern. How can we detect this?

The standard evaluation metric, FID[19], cannot answer this question: it aggregates all data into a single scalar irrespective of condition. Stein et al.[20] have documented further limitations of FID’s feature space. CFID[21] and FJD[22] address condition-level evaluation for discrete image classes, but their extension to continuous physical parameters, temporal diagnosis, and confounding detection remains unexplored.

Consider a model that generates good 25

°

C curves but poor 43

°

C curves (12% of the dataset; Figure 1). FID dilutes the 43

°

C failure into the 88% of correct samples and remains virtually unchanged. Dilution also occurs along the temporal axis: battery degradation proceeds through linear fade followed by nonlinear acceleration beyond the knee point (the inflection where capacity fade transitions from gradual to rapid)[23], and FID lets early-phase success mask late-phase failure. Furthermore, when temperature and C-rate are confounded in the data[5], single-condition evaluation cannot isolate their effects.

These three forms of dilution share one root cause: pooling heterogeneous subsets without distinction. This motivates Stratified Fréchet Distance (SFD), which brings stratification into the Fréchet distance: data are partitioned into strata along a chosen axis, the Fréchet distance is computed per stratum, and the results are averaged. Choosing different axes yields different diagnostics—condition failures, temporal-segment failures, condition×time failures, or inter-condition confounding.

We validate SFD on four battery datasets (161 cells) with CVAE models. Key findings: (i) SFD detects condition-dependent degradation at 1.97× where FID shows 1.01×, with up to 79× higher sensitivity for 6%-minority conditions; (ii) condition×time stratification localizes the worst gap (8.69×) to the latter half of 35

°

C curves, revealing failure to reproduce high-temperature acceleration; (iii) granularity comparison detects temperature–C-rate confounding (T/J = 1.72×), guiding model design.

The contributions are:

1.: Unified framework. SFD traces FID’s dilution to the absence of stratification and resolves it via per-stratum evaluation. It subsumes CFID ( $λ = 0$ ) and generalizes it with a Between-SFD term ( $λ > 0$ ) for inter-condition consistency.
2.: Novel stratification axes. Temporal ( ${SFD}_{t}$ ), condition×time ( ${SFD}_{c \times t}$ ), and joint multi-condition ( ${SFD}_{joint}$ ) stratification, plus the Confounding Index (CI), are introduced—none available in prior metrics.
3.: Mathematical interpretation. We connect SFD to the within/between covariance decomposition and mutual information $I (X; C)$ , clarifying when SFD is most beneficial.
4.: Empirical validation. Eight experiments across four datasets and three feature extractors confirm robustness.

The paper is organized as follows: Section II reviews background and formalizes the dilution problem; Section III presents SFD; Sections IV–V describe experiments and results; Section VI discusses implications and limitations; Section VII concludes.

2. Background: Conditional Generation and Its Evaluation Challenges

2.1. Conditional Time Series Generation Models

A conditional time series generation model takes a condition parameter c (such as temperature or C-rate) as input and learns the conditional distribution

p (x | c)

of the corresponding time series x. This framework was established in image generation through Conditional GAN (CGAN)[24] and Conditional VAE (CVAE)[25], and has since evolved into conditional diffusion models with Classifier-free guidance[26] and Conditional Flow Matching (CFM)[9].

In the adaptation to time series, TimeGAN[11] introduced an embedding network for capturing temporal dynamics, and TimeVAE[12] proposed a CVAE architecture tailored to time series structure. CSDI[13] applied conditional score-based diffusion models to time series imputation. In the battery domain, DiffBatt[15] is a recent effort that integrates conditional generation of degradation curves with lifetime prediction, demonstrating the practical effectiveness of synthetic data augmentation. These technical advances are making it increasingly feasible to synthesize degradation curves under conditions for which no experimental data exist.

However, if the output of a generative model is to be used for safety evaluation, a means of verifying generation quality on a per-condition basis is indispensable. In what follows, we trace the design principles of the current standard evaluation metric to analyze why it falls short of meeting this requirement.

2.2. Design and Limitations of the Fréchet Inception Distance (FID)

FID[19] is the most widely used metric for evaluating generative models in image synthesis. It approximates the distributions of real and generated data as Gaussian distributions in a feature space and computes their Fréchet distance (equivalently, the Wasserstein-2 distance):

FID = ∥ μ_{r} - μ_{g} ∥^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(1)

where

μ_{r}, Σ_{r}

are the mean and covariance of features extracted from real data, and

μ_{g}, Σ_{g}

are their counterparts from generated data.

This design carries an implicit assumption that is consequential in the context of conditional generation evaluation. Because FID approximates all data by a single Gaussian distribution, it does not distinguish the internal structure of the data—neither which operating condition a sample belongs to, nor which temporal segment of a time series it represents. When the dataset is homogeneous, this approximation is reasonable. For data such as battery degradation curves, where distributions differ markedly across conditions, fitting a single Gaussian to the mixture distribution is itself an inappropriate modeling choice.

When FID is applied to time series, time-series-specific feature extractors such as InceptionTime[27] and TS2Vec[28] replace the Inception network used in image generation[29]. However, changing the feature extractor does not alter the structural limitation of FID—the design of aggregating a mixture distribution into a single scalar. Stein et al.[20] empirically analyzed FID’s limitations and reported that the Inception-V3 feature space can fail to reflect the perceptual quality of generated samples. The issue we address in this work is even more fundamental: it originates not in the choice of feature extractor, but in the manner of aggregation itself.

Several metrics have been proposed for evaluating conditional generation. CFID[21] computes FID separately for each class label and averages the results, thereby addressing dilution across conditions. FJD[22] evaluates the joint distribution of images and conditions, and FPD[30] is designed for physical simulation data. However, all of these assume discrete class conditions (e.g., object category labels such as “dog” or “cat”). Their extension to continuous physical parameters (temperature, C-rate), temporal quality diagnosis, and detection of inter-condition confounding has not been investigated.

2.3. Formalization of the Dilution Problem

To gain a more precise understanding of FID’s limitation, we formalize the dilution problem.

Suppose there are K conditions

{c_{1}, c_{2}, \dots, c_{K}}

, each with a data proportion

p (c_{k})

. Consider a generative model that fails only under condition

c^{*}

while producing perfect output under all other conditions. Because FID evaluates the mixture distribution over all conditions, we have:

FID \approx p (c^{*}) \cdot d_{F} (p (x | c^{*}), q (x | c^{*})) + \sum_{k \neq *} p (c_{k}) \cdot 0

(2)

That is, the contribution of the failed condition

c^{*}

to FID is weighted by its proportion

p (c^{*})

.

Two important consequences follow from this formulation. First, when

p (c^{*}) = 0.12

(i.e., 43

°

C constitutes 12% of the dataset), a complete generation failure at 43

°

C contributes only 12% to the overall FID. This is condition-axis dilution. A paradoxical situation arises: the rarer a condition is, the more severely its failures are diluted, yet it is precisely these rare conditions—the extreme temperatures—that matter most for safety evaluation.

Second, this dilution structure is not confined to the condition axis. When a time series is divided into an early half and a late half, a failure in the late segment (the period of accelerated degradation) can be diluted by success in the early segment (the stable period). Furthermore, when multiple condition parameters such as temperature and C-rate are confounded, evaluating by temperature alone allows the effect of C-rate to leak into the temperature-stratified evaluation, preventing accurate diagnosis.

At their core, these problems all stem from pooling subsets with fundamentally different characteristics and evaluating them without distinction—a situation analogous to what stratified sampling in classical statistics is designed to address by respecting the internal structure of the population. In the following section, we formalize SFD based on this insight.

3. Proposed Method: Stratified Fréchet Distance

3.1. Core Idea

The root cause of the dilution problem analyzed in the previous section is that FID treats all data as a single distribution. In statistics, when a population is composed of multiple heterogeneous subsets (strata), it has long been known that stratified sampling can improve estimation accuracy by evaluating each stratum separately. SFD introduces this principle of stratification into the computation of the Fréchet distance.

Concretely, the data are partitioned into subsets according to a stratification variable s, the Fréchet distance is computed independently within each stratum, and the results are averaged:

SFD (s) = \underset{Within - SFD}{\underset{︸}{\frac{1}{| S |} \sum_{s \in S} d_{F} (p (x | s), q (x | s))}} + λ \cdot \underset{Between - SFD}{\underset{︸}{d_{F} (p (x), q (x))}}

(3)

where

S

is the set of strata,

d_{F}

denotes the Fréchet distance, and

λ \geq 0

controls the weight assigned to the overall distributional consistency term.

A crucial feature of Within-SFD is that it assigns equal weight to every stratum. Under FID, the contribution of 43

°

C data (12% of the dataset) is roughly one-third that of 4

°

C data (39%), but under Within-SFD both receive the same weight. As a result, quality degradation in minority conditions is detected without being diluted by the majority.

An alternative design would weight each stratum by its sample proportion

p (c_{k})

, but this would reproduce precisely the dilution that SFD is designed to avoid. The equal-weight design reflects a deliberate choice to treat all conditions as equally important for quality assurance, which aligns with the safety-critical setting motivating this work: a failure at 43

°

C is no less consequential than a failure at 4

°

C simply because fewer batteries were tested at 43

°

C. When domain knowledge suggests that certain conditions are more important than others, a weighted average with user-specified weights

w_{k}

can be substituted without altering the framework.

A practical concern with equal weighting is that strata with very few samples may yield unreliable Fréchet distance estimates, which then receive the same weight as well-estimated strata. Since the Fréchet distance requires fitting a Gaussian distribution (mean and covariance), strata containing fewer than approximately

d + 1

samples (where d is the feature dimension) cannot produce a full-rank covariance estimate. In such cases, regularization (e.g., adding a small diagonal term

ϵ I

to the covariance matrix) is required. In our experiments, the smallest stratum (NASA 22

°

C) contains only 2 batteries; for this stratum, we apply covariance regularization and note that the resulting FD estimates should be interpreted with caution. As a practical guideline, we recommend that each stratum contain at least 5 samples for 17-dimensional features.

Between-SFD is the Fréchet distance over the entire mixture distribution and is identical to FID. By setting

λ > 0

, the evaluation incorporates not only per-condition quality but also the consistency of inter-condition distributional relationships—for example, the physically expected monotonicity that degradation proceeds faster at higher temperatures. When

λ = 0

, SFD reduces to Within-SFD alone, which is mathematically equivalent to CFID[21]. In other words, SFD can be viewed as a generalization of CFID through the introduction of the

λ

term.

3.2. Choice of Stratification Variable: What Question Does It Answer?

The diagnostic capability of SFD is determined by the choice of stratification variable s. Table 1 lists the stratification variables and their corresponding diagnostic roles. The key point is that every variant is a special case of Eq. (3): changing s alone yields a different diagnosis within the same unified framework.

Stratification by condition parameter ( ${SFD}_{c}$ ) answers the question: “Under which operating conditions does the generative model fail?” In the battery degradation context, setting s to be the temperature condition allows generation quality to be assessed separately at 15

°

C, 25

°

C, and 35

°

C. The information that generation is adequate at 25

°

C but inaccurate at 15

°

C cannot be obtained from FID, but it can be read directly from the per-stratum Fréchet distances of

{SFD}_{c}

.

Stratification by temporal segment ( ${SFD}_{t}$ ) answers the question: “In which portion of the generated degradation curve does quality break down?” The time series is divided into K segments, each of which becomes a stratum. In battery degradation, the early-life phase of approximately linear capacity fade and the late-life phase of nonlinear acceleration (beyond the knee point) involve qualitatively different degradation mechanisms[23]. A failure pattern in which the generative model reproduces the early phase but fails to capture the abrupt acceleration in the late phase is difficult to detect with FID or

{SFD}_{c}

, both of which compute features over the entire curve.

{SFD}_{t}

detects this failure directly as an elevated Fréchet distance in the late-segment stratum.

Stratification by condition × time ( ${SFD}_{c \times t}$ ) provides the finest granularity of diagnosis. The information that quality is poor “only in the latter half of the 35

°

C curves” is diluted under

{SFD}_{c}

(which treats all of 35

°

C as one stratum) and under

{SFD}_{t}

(which pools all conditions within each time segment). Only by stratifying along both the condition and temporal axes simultaneously does this cross-sectional quality gap become apparent.

Stratification by multiple condition parameters ( ${SFD}_{joint}$ ) serves as a tool for detecting inter-condition confounding. When stratification is performed by temperature alone, batteries tested at C-rate = 0.5C and those at C-rate = 3C coexist within the same 25

°

C stratum, allowing the effect of C-rate to infiltrate the temperature-based evaluation. Stratifying by temperature×C-rate yields strata in which both variables are held constant, enabling evaluation that is free from the confounding influence.

3.3. Confounding Detection: Confounding Index

{SFD}_{joint}

can be further leveraged to quantify inter-condition confounding. Specifically, we define the Confounding Index (CI) as the ratio of SFD values computed at two different stratification granularities:

CI = \frac{{SFD}_{joint} (c_{1}, c_{2})}{{SFD}_{\sin gle} (c_{1})}

(4)

The interpretation of this index is straightforward. If CI is close to 1, adding

c_{2}

as an additional stratification variable does not materially change the result, suggesting that little confounding exists between

c_{1}

and

c_{2}

. If CI

> 1

, problems that were invisible under the coarser stratification by

c_{1}

alone have surfaced upon the finer stratification by

c_{1} \times c_{2}

, indicating that confounding between

c_{1}

and

c_{2}

is affecting generation quality. This provides a direct and actionable guide for the model design question: “Which conditioning variables should be included?”

3.4. Feature Extraction

Computing SFD requires the Fréchet distance, which in turn requires mapping time series data to feature vectors. Because the choice of feature extractor substantially affects the values of both SFD and FID, we compare three types of feature extractors in this study to verify that the advantage of SFD over FID is not an artifact of a particular feature space.

The first is a set of hand-crafted features (17 dimensions). It consists of segment-wise degradation rates obtained by dividing the degradation curve into 10 equal intervals (10 dimensions), summary statistics of the curve (mean, standard deviation, terminal value, and total degradation; 4 dimensions), and first-difference statistics (mean, standard deviation, and minimum of the first-order differences; 3 dimensions). These features are designed on the basis of domain knowledge to capture the rate and shape of degradation.

Algorithm 1 Stratified Fréchet Distance

Require:: Real data ${(x_{i}, s_{i})}$ , generated data ${({\hat{x}}_{j}, s_{j})}$ , stratification variable s, $λ$
1:: Stratum set $S \leftarrow$ unique( ${s_{i}}$ )
2:: $within \leftarrow 0$
3:: for $s \in S$ do
4:: $X_{s}, {\hat{X}}_{s} \leftarrow$ features of real and generated data in stratum s
5:: $within \leftarrow within + d_{F} (X_{s}, {\hat{X}}_{s})$
6:: end for
7:: $within \leftarrow within / | S |$
8:: $between \leftarrow d_{F} (all real data, all generated data)$
9:: return $within + λ \cdot between$

The second is an InceptionTime-style 1D-CNN (32 dimensions). A network employing multi-scale convolutions (kernel sizes 1, 3, 5, 7) with residual connections is trained on a proxy task of temperature condition classification; the output of the fully connected layer following global average pooling serves as the feature representation. Training on a classification task is expected to yield a feature space that emphasizes inter-condition differences.

The third is a Temporal Autoencoder (16 dimensions). A 1D convolutional autoencoder is trained in an unsupervised manner on a reconstruction task, and the bottleneck-layer output is used as the feature representation. This approach does not require classification labels and captures the intrinsic structure of the data.

3.5. Algorithm and Computational Cost

The computation of SFD is summarized in Algorithm 1.

Computing the Fréchet distance requires the mean vector (

O (n d)

), the covariance matrix (

O (n d^{2})

), and the matrix square root (

O (d^{3})

). Because SFD performs

| S |

Fréchet distance computations, the total cost is

O (n d^{2} + | S | \cdot d^{3})

. In settings such as battery degradation data, where the number of strata

| S |

is typically 3–6 and the feature dimension d is around 17, the additional cost relative to FID is negligible in practice.

4. Experimental Setup

We validate SFD through eight experiments. The experiments are organized to correspond to the stratification variables introduced in Table 1, progressing from condition-axis verification (Experiments 1–6) through temporal-axis verification (Experiment 7) to confounding detection (Experiment 8), thereby increasing the granularity of stratification in a step-by-step manner.

4.1. Datasets

The NASA Battery Dataset[31] consists of capacity degradation curves from 33 batteries acquired at NASA Ames Research Center (Table 3). It spans five temperature conditions (4

°

C, 22

°

C, 24

°

C, 43

°

C, 44

°

C) with a pronounced imbalance in the number of batteries per condition. This imbalanced composition makes it well suited for testing SFD’s ability to detect dilution along the condition axis.

The BatteryLife Dataset[32] integrates voltage profile data from 128 batteries collected at three institutions: SNL (Sandia National Laboratories), MICH (University of Michigan), and CALB (China Aviation Lithium Battery), as shown in Table 2. The mean normalized voltage per cycle was used as a proxy for capacity degradation. Of particular interest for this study is the SNL subset, which possesses a composite structure of temperature (15/25/35

°

C) and C-rate (0.5/1/2/3C). As Table 4 reveals, however, the 0.5C and 3C discharge rates were tested only at 25

°

C. This constraint of the experimental design introduces a confound between temperature and C-rate that we exploit in Experiment 8 to test SFD’s confounding detection capability.

Table 2. Composition of the BatteryLife dataset (number of batteries by temperature condition).

	0 $°$ C	15 $°$ C	25 $°$ C	35 $°$ C	45 $°$ C	Total
SNL	–	9	34	18	–	61
MICH	–	–	21	–	19	40
CALB	8	–	2	14	3	27

Table 3. Composition of the NASA battery dataset.

Temperature	No. batteries	Proportion
4 $°$ C	13	39.4%
22 $°$ C	2	6.1%
24 $°$ C	11	33.3%
43 $°$ C	4	12.1%
44 $°$ C	3	9.1%
Total	33	100%

Table 4. Temperature×C-rate structure of the SNL dataset. Empty cells indicate that no measurements were conducted under that combination, constituting a confound between temperature and C-rate.

	0.5C	1C	2C	3C
15 $°$ C	–	4	5	–
25 $°$ C	8	12	6	8
35 $°$ C	–	12	6	–

4.2. Data Granularity

In this study, the entire degradation curve of a single battery is treated as one sample, and degradation curves from multiple batteries are collected to train a conditional generative model. This granularity differs from that of most battery ML studies—such as state-of-health (SOH) prediction and remaining useful life (RUL) estimation—which treat individual cycles within a single battery as separate samples (Table 5). Our setting corresponds to the application scenario of synthesizing degradation curves under unobserved conditions from a small number of experimentally tested batteries. In this scenario, the imbalance in the number of samples per condition (e.g., 9 batteries at 15

°

C versus 34 at 25

°

C in the SNL subset) can directly affect generation quality, making per-condition evaluation particularly important. All degradation curves were interpolated to 50 equally spaced points and normalized by their initial values. A 17-dimensional hand-crafted feature vector was then extracted from each curve and standardized using StandardScaler.

4.3. CVAE Experiment Setup

Starting from Experiment 2, we validate SFD using an actual deep generative model rather than simulated generators. A Conditional VAE (CVAE) is trained on the SNL subset of BatteryLife (61 batteries, 3 temperature conditions). The CVAE consists of a 2-layer encoder and decoder (hidden dimension 64, latent dimension 8) and is trained for 1500 epochs using Adam optimization (

l r = 10^{- 3}

). To create controlled degradations in generation quality, we exclude specific temperature conditions from the training data and compare the following four models (results are reported as the mean ± standard deviation over 3 random seeds):

Model A (baseline): trained on all three temperature conditions (15 $°$ C + 25 $°$ C + 35 $°$ C)
Model B (single-condition exclusion): the minority condition 15 $°$ C (14.8%) is excluded from training
Model C (label swap): Model A is used, but at generation time 15 $°$ C is supplied as 25 $°$ C
Model D (multi-condition exclusion): both 15 $°$ C and 35 $°$ C are excluded; training uses 25 $°$ C data only

Model D represents the most extreme case of condition omission, deliberately engineered to test whether FID is capable of detecting the resulting quality gap. The central question is: when a model has been trained on a single temperature but is asked to generate curves for all temperatures, does FID flag the problem—or does it remain silent?

5. Results

5.1. Can Condition-Axis ${SFD}_{c}$ Detect Failures in Minority Conditions? (Experiments 1–4)

The most fundamental question about condition-axis

{SFD}_{c}

is whether it can reveal condition-dependent quality degradation that FID fails to detect. We address this question first through a simulation (Experiment 1) that establishes the principle, and then through a CVAE-based experiment (Experiment 2) that demonstrates practical effectiveness with a real generative model.

5.1.1. Experiment 1: Condition Confusion Simulation

We conducted a condition confusion simulation using the NASA dataset. A normal generator adds small noise (

σ = 0.02

) to the real data, while a confused generator replaces data from a target temperature condition

T_{fail}

with patterns from a different temperature

T_{wrong}

. This setup mimics the scenario in which a generative model has failed to learn the degradation pattern at

T_{fail}

and instead outputs data resembling

T_{wrong}

.

The results are presented in Table 6. To quantify SFD’s advantage over FID, we introduce the detection advantage (DA), defined as the ratio of the failed condition’s per-stratum FD ratio to the overall FID ratio. DA measures how many times more sensitively

{SFD}_{c}

detects the confusion compared to FID. DA exhibits a clear inverse correlation with the proportion of the minority condition. For confusions involving 22

°

C (6.1% of the dataset), DA reaches 64–79, meaning that

{SFD}_{c}

’s per-condition Fréchet distance detected the confusion with 64–79 times the sensitivity of FID. For 43

°

C (12.1%), DA remains high at 25–30. In contrast, for 4

°

C (39.4%, nearly a majority condition), DA is only 2.46, and SFD’s advantage over FID is modest.

These results are consistent with the predictions of the dilution analysis in Section II. Failures under minority conditions contribute little to FID and are therefore difficult to detect, whereas

{SFD}_{c}

evaluates each condition with equal weight and thereby avoids this dilution. From a practical standpoint, the alignment is noteworthy: extreme conditions—those that matter most for safety evaluation—are typically the minority, which is precisely where

{SFD}_{c}

’s advantage over FID is greatest.

5.1.2. Experiment 2: CVAE-Based Generative Model Experiment

Experiment 1 established the principle through simulation, but real generative models produce subtler quality differences. Table 7 presents the results of the CVAE experiment with four models.

The result that commands the most attention is that of Model D. Trained on 25

°

C data alone, Model D exhibits an overall FID of 1.01× relative to the baseline Model A—virtually indistinguishable. Judged by FID alone, this model would appear to generate data of comparable quality to Model A, which was trained on all three temperature conditions. The picture painted by

{SFD}_{c}

’s per-condition Fréchet distances, however, is starkly different. The FD for 15

°

C, a condition excluded from training, has deteriorated to 1.97×. At 35

°

C, also excluded, the FD is 1.84×. Meanwhile, at 25

°

C—the sole training condition—the FD has actually improved to 0.82× relative to the baseline, revealing that Model D has overfit to 25

°

C at the expense of the other conditions.

Model B (15

°

C excluded) shows FID = 1.02× against FD(15

°

C) = 1.25×. The gap is less dramatic than for Model D, but this scenario is arguably more realistic: a practitioner who has collected data predominantly at 25

°

C and 35

°

C but has sparse coverage at 15

°

C might reasonably train a model on the available data and rely on FID to assess the result. FID’s 1.02× would provide false reassurance, whereas

{SFD}_{c}

’s 1.25× at the excluded condition would signal that 15

°

C generation quality requires scrutiny. The ability to detect this kind of subtle, graded degradation—not just catastrophic failure—is important for practical deployment.

This result provides empirical confirmation of the concern raised in the Introduction. A model that FID declares “no problem” is in fact generating inaccurate data under the minority conditions (15

°

C, 35

°

C) that are indispensable for safety evaluation.

{SFD}_{c}

resolves this blind spot by identifying which conditions are problematic.

Table 7. CVAE experiment results (SNL, 3-seed average). The top row shows absolute values for Model A; subsequent rows show ratios relative to A. Even when FID shows no change,

{SFD}_{c}

’s per-condition FD detects condition-dependent quality degradation.

Table 7. CVAE experiment results (SNL, 3-seed average). The top row shows absolute values for Model A; subsequent rows show ratios relative to A. Even when FID shows no change,

{SFD}_{c}

’s per-condition FD detects condition-dependent quality degradation.

Model	FID	${SFD}_{c}$	FD(15 $°$ C)	FD(25 $°$ C)	FD(35 $°$ C)
A (all conditions)	12.52	22.47	23.52	15.06	10.06
Ratio relative to A
B (15 $°$ C excl.)	1.02×	1.09×	1.25×	0.97×	0.98×
C (label swap)	1.04×	1.03×	1.07×	1.00×	1.00×
D (25 $°$ C only)	1.01×	1.43×	1.97×	0.82×	1.84×

Figure 2. Comparison of detection sensitivity between FID and

{SFD}_{c}

. FID fails to detect Model D’s condition-dependent failure, whereas

{SFD}_{c}

’s per-condition FD clearly reveals quality degradation at 15

°

C and 35

°

C.

Figure 2. Comparison of detection sensitivity between FID and

{SFD}_{c}

. FID fails to detect Model D’s condition-dependent failure, whereas

{SFD}_{c}

’s per-condition FD clearly reveals quality degradation at 15

°

C and 35

°

C.

5.1.3. Experiment 3: The Role of $λ$

The parameter

λ

in the SFD formulation (Eq. 3) controls the balance between Within-SFD (per-stratum quality) and Between-SFD (overall distributional consistency). We analyzed the effect of

λ

on detection performance using the 43

°

C→4

°

C confusion in the NASA dataset (Table 8).

At

λ = 0

(Within-SFD only, equivalent to CFID), the confused-to-normal ratio reaches 507×, the highest detection sensitivity observed. As

λ

increases, the contribution of Between-SFD—which, being identical to FID, is subject to dilution—causes the detection ratio to decrease gradually (327× at

λ = 2.0

). Nevertheless, the detection ratio exceeds FID by a wide margin at every value of

λ

.

Setting

λ = 0

maximizes per-condition detection sensitivity but foregoes the evaluation of inter-condition distributional consistency. For example, the physically expected monotonicity—that degradation proceeds faster at higher temperatures—cannot be assessed when

λ = 0

. A value of

λ \approx 0.5

offers a practical compromise between condition-level sensitivity and physical consistency, and is adopted as the default in this study.

5.1.4. Experiment 4: Cross-Dataset Validation

To confirm that the advantage of

{SFD}_{c}

is not specific to a particular dataset, we performed analogous confusion simulations on the three sub-datasets of BatteryLife (Table 9). In each case, the rarest condition was selected as the confusion target, and DA was computed.

The results are consistent with Experiment 1: the smaller the minority proportion, the larger the DA. At CALB (minority proportion 7.4%), DA = 17.4; at SNL (14.8%), DA = 11.4—both exceeding a tenfold advantage of

{SFD}_{c}

over FID. At MICH (47.5%, nearly balanced), DA = 1.76, reflecting the fact that when minority and majority are roughly equal in size, dilution is inherently mild and FID itself retains some detection capability. This pattern confirms that the effectiveness of

{SFD}_{c}

is not an artifact of a specific dataset but depends on a structural property of the data: the degree of condition imbalance.

5.2. Relationship Between SFD and Existing Metrics (Experiments 5–6)

Experiments 1–4 established the effectiveness of

{SFD}_{c}

, but two natural questions arise. First, what is the precise relationship between

{SFD}_{c}

and the existing CFID[21]? Second, is the advantage of

{SFD}_{c}

an artifact of the particular feature extractor used? Experiments 5 and 6 address these questions.

5.2.1. Experiment 5: Relationship with CFID

As noted in Section III,

{SFD}_{c}

with

λ = 0

is mathematically equivalent to CFID, and we confirmed numerically that the difference is exactly zero. Table 10 shows the D/A ratio of

{SFD}_{c}

as

λ

is varied.

Comparing Model D (trained on 25

°

C only) to Model A (trained on all conditions), the detection ratio at

λ = 0

(CFID-equivalent) is 1.99×, substantially exceeding FID’s 1.14×. As

λ

increases and the Between-SFD term—which is identical to FID and therefore subject to dilution—gains influence, the detection ratio decreases gradually to 1.62× at

λ = 1.0

. Importantly, the detection ratio remains above FID at every value of

λ

.

The distinctive value that SFD adds beyond CFID lies in the ability, when

λ > 0

, to simultaneously assess the consistency of inter-condition distributional relationships. In battery degradation, it is physically established that degradation proceeds faster at higher temperatures. If a generative model produces reasonable data within each temperature condition but violates this monotonic relationship between conditions—for example, generating 35

°

C curves that degrade more slowly than 15

°

C curves—CFID would fail to detect this inconsistency because it evaluates conditions independently. The Between-SFD term in

{SFD}_{c}

with

λ > 0

captures such cross-condition distributional misalignment. Moreover, the generalization from CFID to SFD is not limited to the addition of the

λ

term: the framework naturally extends to temporal stratification (

{SFD}_{t}

), cross-product stratification (

{SFD}_{c \times t}

), and confounding detection (CI)—capabilities that are entirely outside CFID’s scope.

The per-condition FD breakdown also provides informative detail. The excluded conditions 15

°

C and 35

°

C show deterioration of 2.72× and 1.89× respectively, while 25

°

C—the sole training condition—shows an improvement to 0.90×. This per-condition profile makes it immediately apparent that Model D’s quality varies drastically across conditions, information that is entirely lost in the single FID score.

5.2.2. Experiment 6: Feature Extractor Comparison

To verify that the advantage of

{SFD}_{c}

is not specific to the hand-crafted features used in the preceding experiments, we repeated the CVAE experiment with three different feature extractors (Table 11). The outcome is unambiguous: across all three feature extractors,

{SFD}_{c}

’s per-condition FD ratio exceeds the corresponding FID ratio.

An observation worth highlighting is that the choice of feature extractor dramatically affects FID’s own detection capability. With hand-crafted features (17 dimensions), FID ratio is 1.01×—completely unable to detect Model D’s failure. With the InceptionTime-style 1D-CNN (32 dimensions), trained on a temperature classification proxy task that emphasizes inter-condition differences, the FID ratio rises to 12.18×. This shows that improving the feature extractor can substantially boost FID’s sensitivity.

However, a crucial point emerges. Even with the InceptionTime-style features,

{SFD}_{c}

’s per-condition FD surpasses FID further still, reaching 78.82× at 35

°

C. Improving the feature extractor raises the floor for FID, but the diagnostic information that

{SFD}_{c}

provides—which conditions are problematic—remains inaccessible from FID regardless of how sophisticated the feature extractor becomes. This diagnostic capability stems from SFD’s stratification structure and holds independently of the feature space.

5.3. Is the Condition Axis Sufficient?—The Need for Temporal-Axis SFD (Experiment 7)

Experiments 1–6 demonstrated that condition-axis stratification (

{SFD}_{c}

) resolves FID’s blind spot for minority-condition failures. However, as discussed in Section III, dilution also occurs along the temporal axis. If a generative model accurately reproduces the early portion of a degradation curve (the linear fade phase) but fails in the late portion (the nonlinear acceleration phase),

{SFD}_{c}

—which computes features over the entire curve—may miss this failure. Experiment 7 tests whether temporal stratification (

{SFD}_{t}

) and its combination with condition stratification (

{SFD}_{c \times t}

) can address this limitation.

5.3.1. (a) Partial Confusion Simulation

We first verify that

{SFD}_{t}

can correctly localize a failure confined to a specific temporal segment. Using the NASA dataset, we replaced only the first half or only the second half of 43

°

C degradation curves with 4

°

C patterns, leaving the other half intact (Table 12).

The results confirm the intended behavior. Under

{SFD}_{t}

with

K = 2

, only the corrupted segment shows a large increase in FD (173.7× for first-half corruption, 70.7× for second-half corruption), while the intact segment remains at exactly 1.00×. FID, which evaluates the curve as a whole, cannot distinguish which segment is problematic;

{SFD}_{t}

pinpoints the corrupted segment precisely.

When the number of segments is increased to

K = 5

, the highest detection sensitivity—FD = 181.5×—is observed for corruption in seg3 (corresponding to cycles 60–80%, i.e., the vicinity of the knee point). This correspondence between

{SFD}_{t}

’s peak sensitivity and the physical transition from linear to nonlinear degradation suggests that

{SFD}_{t}

is capable of reflecting changes in the underlying degradation mechanism within its evaluation.

5.3.2. (b) CVAE Condition×Time Diagnosis

We now apply the principle verified in simulation to a real CVAE model. Using the same Model A versus Model D setup as in Experiment 2, we compute

{SFD}_{c \times t}

as a two-dimensional table of condition×time (

K = 2

) (Table 13).

This experiment yields what we consider the most physically meaningful result of the study. The largest value in Table 13 is 8.69×, appearing in the latter half of the 35

°

C curves. To appreciate the significance of this number, consider the physics involved. At 35

°

C, battery degradation accelerates, and this acceleration becomes especially pronounced in the late cycles. Model D, having been trained exclusively on 25

°

C data, has learned only the moderate degradation pattern of 25

°

C and lacks the capacity to reproduce the accelerated degradation characteristic of 35

°

C. Because this failure is most severe in the late cycles, the emergence of 8.69× at the 35

°

C×second-half cell is physically natural.

A comparison across metrics makes the relationship between stratification granularity and detection power strikingly clear. FID (no stratification) stands at 1.01×—effectively blind to the problem.

{SFD}_{t}

(temporal stratification only) reaches 1.11×, a faint signal that is weak because all temperature conditions are pooled within each time segment, allowing the adequate quality at 25

°

C to dilute the poor quality at 15

°

C and 35

°

C.

{SFD}_{c \times t}

(condition×time stratification) achieves 3.18×, a clear detection enabled by avoiding dilution along both axes simultaneously.

This stepwise increase in detection power provides experimental confirmation of SFD’s design intent: the finer the stratification granularity, the more localized quality problems are brought to light.

(c) K sensitivity. The number of temporal segments K controls a trade-off between detection sensitivity and statistical stability. At

K = 3

, the FD of the corrupted segment reaches 548×—the highest sensitivity observed—but at

K \geq 5

, each segment contains too few data points for reliable feature estimation, and sensitivity decreases. For battery degradation curves interpolated to 50 points,

K = 2

(25 points per half) offers a practical balance between stability and interpretability.

5.4. Can SFD Detect Inter-Condition Confounding? (Experiment 8)

In all experiments so far, the stratification variables have been temperature and temporal segment. In real-world battery data, however, temperature and C-rate are often not independently controlled (Table 4). Even if a model conditioned on temperature alone yields a favorable

{SFD}_{c}

, this may mask the fact that C-rate effects are entangled within each temperature stratum. Experiment 8 tests whether varying the stratification granularity of SFD can detect this confounding.

5.4.1. (a) CVAE: Temperature-Only vs. Temperature+C-Rate Conditioning

Using the confound-free subset of the SNL data (1C and 2C only; 3 temperatures × 2 C-rates; 45 batteries), we trained two CVAE variants. Model T is conditioned on temperature alone (cdim = 1), while Model J is conditioned on both temperature and C-rate (cdim = 2). Both models’ outputs were evaluated using

{SFD}_{joint}

(temperature×C-rate, 6 strata) (Table 14).

The result is unambiguous. In every temperature×C-rate cell, Model T’s FD exceeds that of Model J, with an overall average ratio of T/J = 1.72×. This means that conditioning on temperature alone leaves C-rate effects as residual confounds that degrade generation quality.

Examining the per-cell ratios reveals a telling pattern. The largest T/J ratio, 1.80×, occurs at 15

°

C/1C, while the smallest, 1.15×, occurs at 35

°

C/1C. The 15

°

C/1C combination represents a distinctive degradation pattern—low temperature and low C-rate—that temperature information alone cannot distinguish from the 25

°

C average. By contrast, 35

°

C/1C is relatively close to 25

°

C/1C in degradation behavior, so temperature-only conditioning incurs less penalty.

This experiment demonstrates that simply comparing SFD values at different stratification granularities can yield a concrete model design recommendation: “temperature-only conditioning is insufficient; C-rate should be included.” SFD thus functions not only as a post-hoc quality assessment tool but also as an upstream design guide for choosing which conditioning variables to incorporate.

5.4.2. (b) Cycle-Count Confounding

A second form of inter-condition confounding arises from differences in cycle lifetime. When comparing degradation curves across batteries with different lifetimes, some form of temporal normalization is required. The most common approach is fractional normalization, which expresses each point as a percentage of total lifetime. However, this normalization can conceal physically meaningful differences in time scale.

In the NASA dataset, the mean cycle lifetime at 24

°

C is 122 cycles, compared to only 40 cycles at 43

°

C. Under fractional normalization, the “50% point” corresponds to cycle 61 for 24

°

C but cycle 20 for 43

°

C. Data from physically distinct degradation stages are mapped onto the same normalized time coordinate, obscuring the underlying difference.

To quantify this effect, we computed the FD ratio for a 43

°

C→24

°

C confusion under both fractional and physical (absolute cycle count) normalization. The FD ratio under fractional normalization is 3.66×, but under physical normalization it explodes to 463,920× (TCS = 126,724). This orders-of-magnitude discrepancy reveals that fractional normalization was suppressing the time-scale confound arising from the 3.0-fold difference in cycle lifetime. In physical time, the degradation patterns at 43

°

C and 24

°

C are vastly different, but fractional normalization compresses this difference into a deceptively small signal.

By contrast, in the SNL dataset, where cycle counts are approximately uniform across batteries (around 300 cycles), TCS ≈ 1.0 and normalization strategy makes no difference. The juxtaposition of NASA and SNL yields two lessons. First, when cycle lifetimes vary substantially across conditions, researchers must be aware that the choice of normalization strategy can critically affect evaluation outcomes. Second, the SFD framework itself can serve as a tool for validating normalization choices post hoc, by computing SFD under different normalizations and examining whether the results diverge.

6. Discussion

6.1. Stratification Granularity and Detection Power

The most important insight to emerge from the eight experiments is the consistent pattern that FID’s dilution problem traces back to the absence of stratification, and that detection power improves as stratification granularity increases. Table 15 summarizes the detection performance of each SFD variant in the CVAE Model D experiment.

Starting from a situation where FID judges the model as “no problem” at 1.01×, stratification by condition reveals a 1.97× quality gap, and adding the temporal axis raises it further to 3.18×. This progressive increase demonstrates that SFD’s framework can deliver diverse diagnostics through a single control parameter: the choice of stratification granularity.

There is, however, a counterbalancing consideration. Finer stratification reduces the number of samples per stratum, which in turn degrades statistical stability. As the K-sensitivity analysis showed, feature estimation became unreliable at

K \geq 5

. In practice, stratification granularity should be chosen to align with physically meaningful divisions (e.g., condition parameters, early-phase versus late-phase degradation) while ensuring that each stratum contains a sufficient number of samples (a rough guideline is at least five).

6.2. Positioning SFD Within the Landscape of Existing Metrics

The mathematical equivalence between

{SFD}_{c}

(

λ = 0

) and CFID means that SFD does not supersede prior work but rather extends it. CFID provides detection power through condition-axis stratification; SFD adds to this the Between-SFD term (

λ > 0

) for assessing inter-condition consistency, temporal stratification (

{SFD}_{t}

), and confounding detection (CI). All of these capabilities are realized by varying the stratification variable s in Eq. (3), requiring no new mathematical apparatus and positioning SFD as a natural generalization of CFID.

The feature extractor comparison (Experiment 6) showed that SFD’s advantage is independent of the feature space. With InceptionTime-style features, FID’s own detection power reaches 12.18×, yet

{SFD}_{c}

’s per-condition FD surpasses it further still (FD(35

°

C) = 78.82×). This result provides experimental support for the analysis in Section II: FID’s limitation originates not in the inadequacy of the feature extractor but in the manner of aggregation—collapsing a mixture distribution into a single scalar.

6.3. Mathematical Structure of Dilution: Variance Decomposition and Entropy

Why does dilution occur in FID? A deeper understanding can be obtained by connecting SFD to well-established results in multivariate statistics and information theory. The analysis below does not introduce new mathematics; rather, it draws on the classical covariance decomposition and the entropy of Gaussian mixtures to provide an interpretive framework that clarifies why SFD avoids dilution and when SFD is most beneficial.

Let the conditional distribution under condition

c \in C

be

p (x | c) = N (μ_{c}, Σ_{c})

. The covariance matrix of the mixture distribution

p (x) = \sum_{c} p (c) p (x | c)

decomposes into within-group and between-group components:

Σ_{mix} = \underset{Σ_{W} (within)}{\underset{︸}{\sum_{c} p (c) Σ_{c}}} + \underset{Σ_{B} (between)}{\underset{︸}{\sum_{c} p (c) (μ_{c} - \bar{μ}) {(μ_{c} - \bar{μ})}^{T}}}

(5)

where

\bar{μ} = \sum_{c} p (c) μ_{c}

is the overall mean.

This decomposition has the same structure as the total-variance decomposition in analysis of variance (ANOVA): total variance = between-group variance + within-group variance. Because FID is computed using the mixture covariance

Σ_{mix}

, it incorporates the between-group component

Σ_{B}

. When the conditional means

μ_{c}

differ substantially across conditions—as they do for degradation patterns at 4

°

C versus 43

°

C—

Σ_{B}

becomes large and

Σ_{mix}

inflates well beyond any individual

Σ_{c}

.

This inflation is the mathematical substance of dilution. The Fréchet distance in FID is computed on the basis of

Σ_{mix}

, so changes in a particular condition’s covariance

Σ_{c}

can be dwarfed by the sheer magnitude of

Σ_{B}

. Within-SFD, by contrast, computes the Fréchet distance directly from each condition’s

Σ_{c}

, entirely bypassing

Σ_{B}

and thereby detecting per-condition quality changes without dilution.

This variance decomposition also connects to entropy. The entropy of a multivariate Gaussian is determined by the determinant of its covariance matrix:

H = \frac{1}{2} ln ({(2 π e)}^{d} | Σ |)

. Consequently, the entropy of the mixture distribution is always at least as large as the conditional entropy:

H (X) \geq H (X | C) = \sum_{c} p (c) H (X | C = c)

(6)

The gap

I (X; C) = H (X) - H (X | C)

is the mutual information, which quantifies how strongly the condition parameter C influences the distribution of the time series X. The larger

I (X; C)

is—that is, the more the distribution varies across conditions—the more severe the dilution in FID becomes.

From this analysis, the structural meaning of SFD comes into sharp focus. FID measures a “whole-distribution distance” corresponding to

H (X)

; the more the conditions contribute (large

I (X; C)

), the coarser this evaluation becomes. Within-SFD measures a “per-condition distribution distance” corresponding to

H (X | C)

and is unaffected by the magnitude of

I (X; C)

. Between-SFD evaluates the consistency of inter-condition distributional relationships and provides complementary information related to

I (X; C)

.

We note that this correspondence is a structural analogy rather than a strict equality. The Fréchet distance is a Wasserstein-2 distance and possesses different geometric properties from the KL divergence, which contains entropy differences directly. Nevertheless, the analogy yields an important practical implication: the larger $I (X; C)$ is for a given dataset, the more severely FID is affected by dilution and the greater the benefit of introducing SFD. Battery degradation data, where degradation patterns differ qualitatively across temperature conditions, represent a domain with high

I (X; C)

and thus one where SFD offers the greatest advantage.

6.4. Practical Guidelines

We summarize practical guidelines for incorporating SFD into generative model development.

The need for condition-specific quality assurance is not hypothetical. In the battery digital twin paradigm, synthetic degradation data are increasingly used to augment training datasets, to simulate untested operating scenarios, and to support lifecycle management decisions[16,17]. When a digital twin generates synthetic 43

°

C degradation curves to fill a gap in the experimental database, the downstream safety assessment is only as reliable as the quality of those curves. Conventional evaluation via FID would not flag a quality failure localized to 43

°

C; SFD provides the means to do so.

When the goal is to identify which operating conditions exhibit poor generation quality,

{SFD}_{c}

is the appropriate choice. When the quality of a specific temporal region—such as the late-life acceleration phase—is of particular concern, as in safety evaluation,

{SFD}_{t}

or

{SFD}_{c \times t}

should be employed. When the question is which conditioning variables to include in the model, the Confounding Index (CI) provides a direct answer.

Regarding

λ

: to maximize per-condition detection sensitivity,

λ = 0

(CFID-equivalent) is optimal. To simultaneously verify physical consistency across conditions,

λ \approx 0.5

offers a practical compromise. Regarding K: the number of temporal segments should be chosen in light of the time series length and physically meaningful boundaries (e.g., the transition from linear fade to accelerated degradation), rather than purely on the basis of detection sensitivity.

6.5. Limitations and Future Directions

The current formulation of SFD has several limitations.

The most significant is the assumption of discrete strata. In this study, the temperature conditions of the battery data were naturally discrete (15

°

C, 25

°

C, 35

°

C), so discretization posed no issue. For data in which temperature varies continuously in, say, 0.1

°

C increments, appropriate binning is required, and the choice of bin width can influence evaluation outcomes. Extending SFD to continuous stratification via kernel density estimation is an important direction for broadening its applicability.

Our experimental validation relies exclusively on CVAE as the generative model. While CVAE was chosen for its controllability—the ability to systematically exclude conditions from training—the question of whether SFD’s advantages hold for other generative architectures (GANs, diffusion models, flow matching) remains empirically open. We emphasize, however, that SFD is a property of the evaluation metric, not of the generative model. The dilution problem analyzed in Section II arises from FID’s aggregation structure and is independent of how the data were generated. Thus, we expect SFD’s advantage to persist across generative architectures, though experimental confirmation with other model families is a natural direction for future work.

A related concern is statistical rigor. Results are reported as 3-seed averages, but confidence intervals and significance tests are not provided. For ratios such as FID = 1.01× and

{SFD}_{c}

= 1.97×, the practical significance is visually apparent from the magnitude of the gap, but formal statistical testing—for example, bootstrap confidence intervals on the FD ratio—would strengthen the claims and is planned for an extended version of this work.

Strata with very few samples pose a further challenge. The NASA dataset contains only 2 batteries at 22

°

C, making Gaussian covariance estimation unreliable in that stratum. While covariance regularization mitigates numerical instability, the resulting FD values carry inherent uncertainty that is not reflected in the reported numbers. Future work should incorporate uncertainty quantification, for instance through bootstrap resampling of the per-stratum FD.

The choice of feature extractor also substantially affects SFD values. As Experiment 6 demonstrated, the FID ratio increased from 1.01× with hand-crafted features (17-dim.) to 12.18× with InceptionTime-style features (32-dim.)—a change of over an order of magnitude depending on the feature space. While the structural advantage of SFD (per-condition FD > FID) holds regardless of the feature extractor, the absolute detection sensitivity is strongly feature-dependent. Leveraging large-scale pre-trained time series encoders such as TS2Vec[28], or developing domain-agnostic feature extractors, are important avenues for improving SFD’s generality.

Although our validation is limited to battery degradation data, the problem that SFD addresses—condition-dependent quality degradation being diluted by FID in conditional time series generation—is not specific to batteries. SFD is applicable wherever condition parameters and distributional imbalance coexist. We outline several concrete scenarios below.

In industrial fault diagnosis, normal operating data are abundant, but sensor time series for specific fault modes (bearing damage, shaft misalignment, etc.) are scarce[33]. Generative models are increasingly used to synthesize fault data and rebalance training sets, but verifying whether the synthetic fault data faithfully reproduce actual fault patterns requires per-mode evaluation. Computing

{SFD}_{c}

by fault mode would enable diagnoses such as “bearing damage generation is adequate, but shaft misalignment generation is insufficient.” Moreover, because vibration patterns change qualitatively between the early and late stages of fault progression,

{SFD}_{t}

can provide temporally localized quality assessment.

In the synthesis of medical time series (ECG, EEG, etc.), data for rare diseases inevitably form the minority class[10,34]. If synthetic data quality is inadequate for a specific disease subtype, a classifier trained on such data risks failing to detect that disease. Stratifying by disease type (

{SFD}_{c}

) and by temporal phase (e.g., pre-seizure versus post-seizure for

{SFD}_{t}

) provides a means of verifying generation quality at a clinically meaningful granularity.

In materials science, time series data such as stress–strain curves and thermal analysis curves vary with chemical composition and processing conditions (sintering temperature, pressure, etc.)[35]. Data for extreme compositions or high-temperature processes are costly to acquire and tend to be underrepresented, giving rise to the same dilution structure observed in battery degradation.

Empirical validation in these domains is left for future work. However, the formulation of SFD is data-agnostic: given an appropriate feature extractor, it can be applied directly.

7. Conclusions

In this work, we proposed Stratified Fréchet Distance (SFD) as a unified framework for resolving the dilution problem of FID in the evaluation of conditional time series generation models. FID aggregates all data into a single score, causing it to overlook quality degradation in safety-critical minority conditions and in late-cycle regions where degradation accelerates. SFD introduces the classical statistical concept of stratification into the Fréchet distance, providing a framework in which data are partitioned into strata along a chosen axis and evaluated within each stratum.

Through eight experiments using four battery datasets (NASA, SNL, MICH, CALB; 161 cells in total) and CVAE models, we obtained the following findings (summarized in Table 15).

First, even in a setting where FID judges the model as “no problem” at 1.01×,

{SFD}_{c}

stratified by condition detects quality degradation in excluded temperature conditions at 1.97×. This result calls for a reconsideration of the current practice of relying on FID as the sole evaluation metric in generative model development, particularly from the standpoint of safety.

Second,

{SFD}_{c \times t}

stratified jointly by condition and time detects the largest quality gap—8.69×—in the latter half of the 35

°

C degradation curves. This value signifies that the generative model has failed to learn the accelerated degradation pattern at elevated temperatures, and quantitatively demonstrates the risk of using its output for safety evaluation.

Third, comparing SFD at different stratification granularities (Confounding Index) reveals that confounding between temperature and C-rate degrades generation quality by a factor of 1.72×. This demonstrates that SFD can provide actionable guidance for the model design decision of which conditioning variables to include.

These findings hold robustly across three feature extractors and four datasets, confirming that SFD’s advantage does not depend on the choice of feature space or data source.

SFD raises a fundamental question that has been largely overlooked in the evaluation of conditional time series generation models. That overall average quality is satisfactory does not guarantee that quality is adequate under every condition and in every temporal region. In applications where safety is at stake, it is precisely the latter guarantee that is needed—and SFD provides the means to verify it, within the unified framework of choosing a stratification variable.

References

Roman, D.; Saxena, S.; Robu, V.; Pecht, M.; Flynn, D. Machine learning pipeline for battery state-of-health estimation. Nat. Mach. Intell. 2021, 3, 447–456. [Google Scholar] [CrossRef]
Plett, G.L. Battery Management Systems, Volume II: Equivalent-Circuit Methods; Artech House, 2015.
Han, X.; Lu, L.; Zheng, Y.; et al. A review on the key issues of the lithium ion battery degradation among the whole life cycle. eTransportation 2019, 1, 100005. [Google Scholar] [CrossRef]
Severson, K.A.; Attia, P.M.; Jin, N.; Perkins, N.; Jiang, B.; Yang, Z.; Chen, M.H.; Aber, M.; Chueh, W.C.; Ermon, S.; et al. Data-driven prediction of battery cycle life before capacity degradation. Nat. Energy 2019, 4, 383–391. [Google Scholar] [CrossRef]
Dos Reis, G.; Strange, C.; Sheridan, M.; Hynes, V. Lithium-ion battery data and where to find it. Energy AI 2022, 5, 100081. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, 2014, Vol. 27.
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 6840–6851.
Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow matching for generative modeling. In Proceedings of the International Conference on Learning Representations, 2023.
Brophy, E.; Wang, Z.; She, Q.; Ward, T. Generative adversarial networks in time series: A systematic literature review. ACM Comput. Surv. 2023, 55, 1–31. [Google Scholar] [CrossRef]
Yoon, J.; Jarrett, D.; van der Schaar, M. Time-series generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems, 2019, Vol. 32.
Desai, A.; Freeman, C.; Wang, Z.; Beaver, I. TimeVAE: A variational auto-encoder for multivariate time series generation. arXiv 2021, arXiv:2111.08095. [Google Scholar]
Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proceedings of the Advances in Neural Information Processing Systems, 2021, Vol. 34.
Iwana, B.K.; Uchida, S. An empirical survey of data augmentation for time series classification with neural networks. PLoS ONE 2021, 16, e0254841. [Google Scholar] [CrossRef]
Eivazi, H.; Hebenbrock, A.; Wildermuth, R.; et al. DiffBatt: A diffusion model for battery degradation prediction and synthesis. arXiv 2024, arXiv:2410.23893. [Google Scholar] [CrossRef]
Wu, B.; Widanage, W.D.; Yang, S.; Liu, X. Battery digital twins: Perspectives on the fusion of models, data and artificial intelligence for smart battery management systems. Energy AI 2020, 1, 100016. [Google Scholar] [CrossRef]
Howey, D.A.; Roberts, S.A.; Viswanathan, V.; et al. Enabling battery digital twins at the industrial scale. Joule 2023, 7, 928–934. [Google Scholar] [CrossRef]
Mayer, P.; Luzi, L.; Siahkoohi, A.; Johnson, D.H.; Baraniuk, R.G. Improving fairness and mitigating MADness in generative models. arXiv 2024, arXiv:2405.13977. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, 2017, Vol. 30.
Stein, G.; Cresswell, J.C.; Hosseinzadeh, R.; Sui, Y.; Ross, B.L.; Villecroze, V.; Liu, Z.; Caterini, A.L.; Taylor, J.E.T.; Loaiza-Ganem, G. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. Adv. Neural Inf. Process. Syst. 2023, 36. [Google Scholar]
Soloveitchik, M.; Diskin, T.; Morin, E.; Wiesel, A. Conditional Fréchet inception distance. arXiv 2021, arXiv:2103.11521. [Google Scholar]
DeVries, T.; Romero, A.; Pineda, L.; Taylor, G.W.; Drozdzal, M. On the evaluation of conditional image generation. arXiv 2019, arXiv:1907.08175. [Google Scholar]
Attia, P.M.; Grover, A.; Jin, N.; Severson, K.A.; Marber, T.M.; Liao, W.; Huber, M.H.; Ermon, S.; Braatz, R.D.; Chueh, W.C. Closed-loop optimization of fast-charging protocols for batteries with machine learning. Nature 2020, 578, 397–402. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. In Proceedings of the Advances in Neural Information Processing Systems, 2015, Vol. 28.
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
Fawaz, H.I.; Lucas, B.; Forestier, G.; Pelletier, C.; Schmidt, D.F.; Weber, J.; Webb, G.I.; Idoumghar, L.; Muller, P.A.; Petitjean, F. InceptionTime: Finding AlexNet for time series classification. Data Min. Knowl. Discov. 2020, 34, 1936–1962. [Google Scholar] [CrossRef]
Yue, Z.; Wang, Y.; Duan, J.; Yang, T.; Huang, C.; Tong, Y.; Xu, B. TS2Vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022, Vol. 36, pp. 8980–8987.
Paul, D.; et al. Evaluating time series generative models using discriminative metrics. arXiv 2022, arXiv:2210.10227. [Google Scholar]
Kansal, R.; Li, J.; Parise, B.; Duarte, J.; Nachman, B. Evaluating generative models in high energy physics. Phys. Rev. D. 2023, 107, 076017. [Google Scholar] [CrossRef]
Saha, B.; Goebel, K. Battery data set. Technical report, NASA Ames Prognostics Data Repository, 2007.
Tan, R.; Hong, W.; Tang, J.; Lu, X.; Ma, R.; Zheng, X.; Li, J.; Huang, J.; Zhang, T.Y. BatteryLife: A comprehensive dataset and benchmark for battery life prediction. In Proceedings of the Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2025, KDD ’25, pp. 5789–5800. [CrossRef]
Pang, G.; Shen, C.; Cao, L.; van den Hengel, A. Deep learning for anomaly detection: A review. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
Habashi, A.G.; Azab, A.M.; Eldawlatly, S.; Aly, G.M. Generative adversarial networks in EEG analysis: An overview. J. Neuroeng. Rehabil. 2023, 20, 40. [Google Scholar] [CrossRef]
Chen, C.; Zuo, Y.; Ye, W.; Li, X.; Deng, Z.; Ong, S.P. A critical review of machine learning of energy materials. Adv. Energy Mater. 2019, 10, 1903242. [Google Scholar] [CrossRef]

Figure 1. Temperature-dependent capacity degradation curves from the NASA battery dataset. Degradation rate and pattern vary substantially across temperatures. Accurate reproduction of minority-condition behavior (43

°

C, 44

°

C) is a prerequisite for safety evaluation.

Figure 1. Temperature-dependent capacity degradation curves from the NASA battery dataset. Degradation rate and pattern vary substantially across temperatures. Accurate reproduction of minority-condition behavior (43

°

C, 44

°

C) is a prerequisite for safety evaluation.

Table 1. Stratification variables of SFD and their correspondence to existing metrics. Every variant is a special case of the unified SFD formulation (Eq. 3).

Stratification variable s	Name	Diagnostic question	Prior work
None (pooled)	FID	What is the overall quality?	FID[19]
Condition c	${SFD}_{c}$	Which conditions fail?	CFID[21]
Temporal segment k	${SFD}_{t}$	Which time window fails?	(novel)
$(c, k)$	${SFD}_{c \times t}$	Which condition in which window?	(novel)
$(c_{1}, c_{2})$	${SFD}_{joint}$	Is there inter-condition confounding?	(novel)

Table 5. Comparison of data granularity across studies. In this work, each battery’s full degradation curve constitutes a single sample, and conditional generation is learned across multiple batteries.

Study	No. cells	Unit sample	Data points	Learning scope
SOH/RUL prediction	4–8	1 cycle	Hundreds	Cycle sequence within a single battery
Severson[4]	124	1 cycle	∼96,700	Cycle sequences across all batteries
DiffBatt[15]	∼300	Full curve	∼300	Degradation curves across all batteries
This work (NASA)	33	Full curve	1,650	Degradation curves across all batteries
This work (SNL)	61	Full curve	3,050	Degradation curves across all batteries
This work (all)	161	Full curve	8,050	Degradation curves across all batteries

Table 6. Condition confusion simulation results (NASA, representative patterns). DA (detection advantage) is the ratio of the failed condition’s FD ratio to the FID ratio, quantifying how much more sensitively

{SFD}_{c}

detects the confusion compared to FID.

Table 6. Condition confusion simulation results (NASA, representative patterns). DA (detection advantage) is the ratio of the failed condition’s FD ratio to the FID ratio, quantifying how much more sensitively

{SFD}_{c}

detects the confusion compared to FID.

Confusion pattern	Proportion	FID ratio	FD ratio	DA
22 $°$ C→24 $°$ C	6.1%	2.74	215.9	78.8
22 $°$ C→44 $°$ C	6.1%	82.2	5287.6	64.3
43 $°$ C→24 $°$ C	12.1%	6.41	192.4	30.0
43 $°$ C→4 $°$ C	12.1%	95.2	2344.4	24.6
4 $°$ C→24 $°$ C	39.4%	2700.1	6629.9	2.46

Table 8. Sensitivity analysis of

λ

(NASA, 43

°

C→4

°

C confusion).

λ = 0

yields the highest sensitivity, but all values of

λ

substantially outperform FID.

Table 8. Sensitivity analysis of

λ

(NASA, 43

°

C→4

°

C confusion).

λ = 0

yields the highest sensitivity, but all values of

λ

substantially outperform FID.

$λ$	SFD (normal)	SFD (confused)	Ratio
0.0	0.006	3.21	507×
0.5	0.008	3.31	441×
1.0	0.009	3.41	392×
2.0	0.011	3.61	327×

Table 9. Cross-dataset validation (BatteryLife). DA is inversely correlated with minority proportion across all three sub-datasets, confirming that the advantage of

{SFD}_{c}

generalizes across datasets.

Table 9. Cross-dataset validation (BatteryLife). DA is inversely correlated with minority proportion across all three sub-datasets, confirming that the advantage of

{SFD}_{c}

generalizes across datasets.

Dataset	N	Minority	Confusion	Proportion	DA
CALB	27	25 $°$ C	25 $°$ C→35 $°$ C	7.4%	17.4
SNL	61	15 $°$ C	15 $°$ C→25 $°$ C	14.8%	11.4
MICH	40	45 $°$ C	45 $°$ C→25 $°$ C	47.5%	1.76

Table 10. Comparison of CFID and

{SFD}_{c}

(SNL, CVAE Model D / Model A ratio). At

λ = 0

,

{SFD}_{c}

coincides with CFID. For

λ > 0

, Between-SFD additionally evaluates inter-condition consistency. All values of

λ

exceed FID (1.14×).

Table 10. Comparison of CFID and

{SFD}_{c}

(SNL, CVAE Model D / Model A ratio). At

λ = 0

,

{SFD}_{c}

coincides with CFID. For

λ > 0

, Between-SFD additionally evaluates inter-condition consistency. All values of

λ

exceed FID (1.14×).

Metric	D/A ratio
FID	1.14×
CFID (= ${SFD}_{c}$ , $λ = 0$ )	1.99×
${SFD}_{c}$ ( $λ = 0.25$ )	1.85×
${SFD}_{c}$ ( $λ = 0.5$ )	1.75×
${SFD}_{c}$ ( $λ = 0.75$ )	1.68×
${SFD}_{c}$ ( $λ = 1.0$ )	1.62×

Table 11. Feature extractor comparison (SNL, CVAE Model D/A ratio, 3-seed average). The per-condition FD ratio of

{SFD}_{c}

exceeds the FID ratio under all three feature extractors, confirming that SFD’s advantage is independent of feature space design.

Table 11. Feature extractor comparison (SNL, CVAE Model D/A ratio, 3-seed average). The per-condition FD ratio of

{SFD}_{c}

exceeds the FID ratio under all three feature extractors, confirming that SFD’s advantage is independent of feature space design.

Feature extractor	FID ratio	${SFD}_{c}$ ratio	FD(15 $°$ C)	FD(35 $°$ C)
Hand-crafted (17-dim.)	1.01×	1.43×	1.97×	1.84×
InceptionTime-style (32-dim.)	12.18×	18.79×	17.49×	78.82×
Autoencoder (16-dim.)	2.30×	6.38×	12.61×	20.50×

Table 12. Detection of partial confusion (NASA, 43

°

C→4

°

C,

K = 2

).

{SFD}_{t}

elevates the FD only for the corrupted segment; the intact segment remains precisely at 1.00×.

Table 12. Detection of partial confusion (NASA, 43

°

C→4

°

C,

K = 2

).

{SFD}_{t}

elevates the FD only for the corrupted segment; the intact segment remains precisely at 1.00×.

Corrupted segment	FID ratio	FD (corrupted)	FD (intact)
First half only	39.2×	173.7×	1.00×
Second half only	79.3×	70.7×	1.00×

Table 13. Two-dimensional diagnosis by

{SFD}_{c \times t}

(SNL, CVAE Model D/A ratio,

K = 2

, 3-seed average). The maximum of 8.69× at 35

°

C in the latter half indicates that the generative model fails to reproduce the accelerated degradation pattern at elevated temperature.

Table 13. Two-dimensional diagnosis by

{SFD}_{c \times t}

(SNL, CVAE Model D/A ratio,

K = 2

, 3-seed average). The maximum of 8.69× at 35

°

C in the latter half indicates that the generative model fails to reproduce the accelerated degradation pattern at elevated temperature.

Condition	First half (0–50%)	Second half (50–100%)
15 $°$ C (excluded)	4.45×	3.79×
25 $°$ C (trained)	0.75×	0.91×
35 $°$ C (excluded)	5.79×	8.69×

Table 14. Confounding detection experiment (SNL, 1C+2C subset, 45 batteries, 2-seed average). Model T (temperature-only conditioning) and Model J (temperature+C-rate conditioning) are evaluated by

{SFD}_{joint}

. Model T exceeds Model J in every cell, indicating that temperature-only conditioning leaves C-rate confounding that degrades generation quality.

Table 14. Confounding detection experiment (SNL, 1C+2C subset, 45 batteries, 2-seed average). Model T (temperature-only conditioning) and Model J (temperature+C-rate conditioning) are evaluated by

{SFD}_{joint}

. Model T exceeds Model J in every cell, indicating that temperature-only conditioning leaves C-rate confounding that degrades generation quality.

Condition	Model T	Model J	T/J ratio
15 $°$ C/1C	1176.7	653.5	1.80×
15 $°$ C/2C	50.1	34.0	1.47×
25 $°$ C/1C	105.3	77.2	1.36×
25 $°$ C/2C	28.1	19.7	1.42×
35 $°$ C/1C	18.2	15.9	1.15×
35 $°$ C/2C	43.4	28.2	1.54×
Mean	237.0	138.1	1.72×

Table 15. Stratification granularity and detection capability of SFD (CVAE Model D/A ratio). As the granularity of stratification increases, quality problems invisible to FID progressively come to light.

Stratification	Diagnostic question	D/A ratio
None (FID)	What is the overall quality?	1.01×
Condition c ( ${SFD}_{c}$ )	Which conditions fail?	1.97×
Time k ( ${SFD}_{t}$ )	Which time window fails?	1.11×
$c \times k$ ( ${SFD}_{c \times t}$ )	Which condition in which window?	3.18×
$c_{1} \times c_{2}$ (CI)	Is there inter-condition confounding?	1.72×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Stratified Fréchet Distance: A Stratified Evaluation Framework for Conditional Time Series Generation Models

Abstract

Keywords:

Subject:

1. Introduction

2. Background: Conditional Generation and Its Evaluation Challenges

2.1. Conditional Time Series Generation Models

2.2. Design and Limitations of the Fréchet Inception Distance (FID)

2.3. Formalization of the Dilution Problem

3. Proposed Method: Stratified Fréchet Distance

3.1. Core Idea

3.2. Choice of Stratification Variable: What Question Does It Answer?

3.3. Confounding Detection: Confounding Index

3.4. Feature Extraction

3.5. Algorithm and Computational Cost

4. Experimental Setup

4.1. Datasets

4.2. Data Granularity

4.3. CVAE Experiment Setup

5. Results

5.1. Can Condition-Axis SFD c Detect Failures in Minority Conditions? (Experiments 1–4)

5.1.1. Experiment 1: Condition Confusion Simulation

5.1.2. Experiment 2: CVAE-Based Generative Model Experiment

5.1.3. Experiment 3: The Role of λ

5.1.4. Experiment 4: Cross-Dataset Validation

5.2. Relationship Between SFD and Existing Metrics (Experiments 5–6)

5.2.1. Experiment 5: Relationship with CFID

5.2.2. Experiment 6: Feature Extractor Comparison

5.3. Is the Condition Axis Sufficient?—The Need for Temporal-Axis SFD (Experiment 7)

5.3.1. (a) Partial Confusion Simulation

5.3.2. (b) CVAE Condition×Time Diagnosis

5.4. Can SFD Detect Inter-Condition Confounding? (Experiment 8)

5.4.1. (a) CVAE: Temperature-Only vs. Temperature+C-Rate Conditioning

5.4.2. (b) Cycle-Count Confounding

6. Discussion

6.1. Stratification Granularity and Detection Power

6.2. Positioning SFD Within the Landscape of Existing Metrics

6.3. Mathematical Structure of Dilution: Variance Decomposition and Entropy

6.4. Practical Guidelines

6.5. Limitations and Future Directions

7. Conclusions

References

MDPI Initiatives

Important Links

Subscribe

5.1. Can Condition-Axis ${SFD}_{c}$ Detect Failures in Minority Conditions? (Experiments 1–4)

5.1.3. Experiment 3: The Role of $λ$