Preprint
Article

This version is not peer-reviewed.

Energy-Guided Counterfactual Generation for Faithful Model Interpretation

Submitted:

26 November 2025

Posted:

28 November 2025

You are already at the latest version

Abstract
This paper introduces a novel framework for generating counterfactual explanations that prioritize faithfulness to the underlying model behavior. The approach leverages energy-based modeling to constrain the generation of counterfactuals, ensuring they align with the model’s learned data distribution. By integrating this energy constraint with conformal prediction, the methodology achieves a balance between faithfulness and plausibility. Empirical evaluations demonstrate the effectiveness of the proposed framework in producing counterfactuals that accurately reflect model decision boundaries, offering improved interpretability for complex machine learning models. The method provides a way to distinguish trustworthy from unreliable models by focusing on the energy constraint.
Keywords: 
;  ;  ;  ;  

I. Introduction

The growing reliance on machine learning models in highstakes decision-making—such as loan approvals, hiring, and medical diagnosis—necessitates the development of methods that can explain their predictions in a manner that is both interpretable and actionable. Counterfactual explanations have emerged as a powerful paradigm in this regard, enabling users to understand model decisions by posing hypothetical ”whatif” scenarios. Rather than attempting to deconstruct or approximate the internal workings of a model, counterfactual methods generate alternative inputs that lead to different outcomes, thereby illuminating how slight changes in input attributes could alter a model’s decision. For example, one might ask: ”This person was denied a loan. What minimal changes to their financial profile would have resulted in approval?”
Formally, given an individual instance x ∈ X = RD and a target outcome y+ ∈ Y, a counterfactual explanation aims to find a modified version of x such that the black-box model Mθ : X → Y outputs the desired prediction, i.e., Mθ(x) = y+, where x is close to the original x. This typically involves optimizing a loss function that penalizes deviations from the target prediction while maintaining proximity to the original input: yloss(Mθ(x),y+)+distance(x,x). Since these explanations are generated using the original model, they inherently guarantee *local fidelity*—the ability of the explanation to match the model’s prediction at the given input [1].
While local fidelity is an important criterion, it does not fully capture the quality or trustworthiness of an explanation. Techniques such as LIME [2] and SHAP [3] rely on local surrogate models, which may fail to accurately reflect the behaviour of the original model in regions of the input space with sparse data. Counterfactual explanations, by contrast, avoid such approximations. However, even with full fidelity, counterfactuals can still be misleading. This occurs because complex models, especially deep neural networks, are often underdetermined by the data—multiple different explanations can yield identical predictions, making it difficult to determine which explanation best reflects the model’s learned structure.
This observation shifts attention toward a more nuanced criterion: *faithfulness*. While fidelity refers to whether an explanation reproduces the model’s decision, faithfulness addresses whether the explanation aligns with the model’s true internal reasoning and learned relationships in the data. A counterfactual is considered faithful if it remains consistent with the statistical patterns and causal structures the model has internalized, rather than simply achieving the target output through arbitrary perturbations.
Much of the literature has focused on various desiderata for counterfactuals, including minimality or *closeness* to the original input, *sparsity* in the number of features changed [4], *actionability* in terms of real-world feasibility [5], and *plausibility* in remaining within the data manifold [6]. However, these criteria often overlook whether the explanation genuinely reflects what the model has learned. As a result, there is an urgent need to formalize and evaluate *faithfulness* as a distinct and essential property of counterfactuals.
In this work, we address this gap by introducing a new framework that centers faithfulness in the generation and evaluation of counterfactual explanations. Specifically, we contribute the following:
  • • We demonstrate that fidelity, while necessary, is insufficient as a standalone metric for evaluating counterfactual explanations (Section III).
  • • We formally define the notion of faithfulness and propose novel evaluation metrics that assess how well counterfactuals align with the model’s learned representations and decision boundaries (Section IV).
  • • We introduce ECCCo (Energy-Constrained Conformal Counterfactuals), a new algorithm designed to generate counterfactuals that are not only faithful but also computationally efficient and interpretable (Section V).
  • • We validate our approach through extensive empirical experiments, demonstrating that ECCCo consistently produces counterfactuals that align with both model reasoning and human intuition, and exhibit plausibility only when contextually appropriate (Section VI).
To the best of our knowledge, this is the first approach explicitly designed to promote and evaluate faithfulness in counterfactual explanations. We anticipate that our proposed framework and algorithm will serve as a strong foundation for future work in this space, and offer valuable tools for practitioners seeking to improve the transparency, accountability, and trustworthiness of machine learning systems.

II. Background

Counterfactual explanations (CEs) have become a central tool in interpretable machine learning, particularly for classification tasks. While their applicability extends to regression settings as well [7], the bulk of research has concentrated on classification problems where the output domain is often represented in a one-hot-encoded format Y = (0,1)K, with K denoting the number of classes. In such cases, counterfactual generation is frequently formulated as an optimization problem aimed at identifying alternative inputs that would result in a desired prediction by the model.
Most counterfactual methods adopt a gradient-based optimization framework, where a counterfactual instance is derived by minimizing a combination of prediction loss and a regularization term that governs feasibility or proximity. A general formulation of this optimization can be expressed as:
Preprints 186999 i001
Here, yloss(·) is a loss function that encourages the model to output the target class y+, and cost(·) refers to one or more regularization terms that impose desired properties on the solution. The function f(·) maps from a latent counterfactual space to the input feature space, allowing for search either directly in the input domain or in some learned representation. The variable Z = {zl}L denotes a set of L candidate counterfactual representations, which provides a mechanism for exploring diverse or multiple counterfactual solutions.
This formulation builds on the foundational approach introduced , which we refer to as the Wachter method. In its simplest form, the method searches directly in the input feature space (f is the identity function), seeking a minimal perturbation that changes the model’s output. Although effective, this baseline has inspired numerous extensions designed to address a variety of desiderata—key criteria that counterfactuals should ideally fulfill to be interpretable, actionable, and trustworthy.
Among the commonly considered desiderata are sparsity (minimizing the number of features changed), proximity (minimizing the size of the change), actionability (ensuring that suggested changes are feasible in practice), diversity (providing a range of alternative explanations), robustness (resilience to model perturbations), plausibility (adherence to the data distribution), and causality (respecting the causal structure of the data) [4,5,6,8,9,10,11]. Comprehensive reviews of these techniques and their comparative evaluations can be found in [12].
Of these criteria, plausibility is often considered a foundational property, as it relates to the realism of the generated counterfactuals. Explanations that are implausible—falling into regions of the input space that are not representative of the real-world data—can be misleading or even harmful. Prior studies have shown that enhancing plausibility often contributes positively to other desiderata, including robustness and causality.
To formalize this concept, we define a plausible counterfactual as follows:
Definition II.1 (Plausible Counterfactuals).
Let X|y+ = p(x|y+) denote the true conditional distribution of data points given the target class y+. A counterfactual instance x is said to be plausible if it is sampled from this distribution, i.e., x ∼ X|y+.
Ensuring plausibility thus requires an estimate or model of the conditional data distribution. Several methods have been proposed to approximate this distribution using generative models or other density estimation techniques. For instance, [6] propose REVISE, which operates not in the raw input space but within a latent space learned using a Variational Autoencoder (VAE). The idea is that traversing a wellstructured latent space can help restrict counterfactuals to regions of high data density, thereby promoting plausibility.
Other techniques take similar latent-space approaches. For example, [13] utilize normalizing flows to construct invertible mappings that facilitate counterfactual search, while adopt graph-based methods to navigate high-density paths in the data manifold. Additionally, [10] incorporate causal knowledge, ensuring that counterfactuals obey the underlying causal relationships among features.
An alternative line of work challenges the necessity of explicitly modeling the data distribution. Instead, it leverages the predictive behavior of the model itself to guide the generation of plausible counterfactuals. One such approach, introduced by [4], argues that if the black-box model is wellcalibrated and its confidence estimates are reliable, then highconfidence predictions for the counterfactual instance imply plausibility. Their method, referred to here as Schut, applies a Jacobian-based saliency technique—originally used for adversarial attacks—to iteratively perturb the input using cross-entropy loss, with no added regularization. The result is a counterfactual that the model confidently classifies as the target class, suggesting it lies in a dense, meaningful region of the input space.
This model-centric approach offers a practical advantage: it avoids the engineering complexity of training and validating generative models or requiring access to causal graphs. However, its effectiveness relies on the assumption that the blackbox model has been trained in a way that yields trustworthy confidence scores and reasonable generalization in unseen regions.
In summary, the quest for plausible counterfactuals has led to the development of diverse methodologies. Some prioritize explicit modeling of the data distribution, while others leverage implicit cues from model behavior. Each approach brings unique trade-offs in terms of complexity, interpretability, and empirical performance. In the following sections, we further investigate how plausibility interacts with model fidelity and propose a novel method that integrates both considerations while also addressing faithfulness as a distinct and critical property of counterfactual explanations.

III. Why Fidelity Is Not Enough: A Motivational Example

As highlighted in the introduction, counterfactual explanations are typically evaluated by checking whether they lead to the desired prediction from a model. In this context, fidelity refers to the alignment between the model’s output for a counterfactual input and the target class specified by the user. By design, optimization-based approaches such as Equation 1 ensure that this condition is satisfied. Therefore, any generated counterfactual that succeeds in flipping the model’s prediction to the intended class is said to possess full fidelity.
However, achieving high fidelity alone does not guarantee that the explanation meaningfully reflects the model’s internal logic or aligns with real-world data semantics. In fact, different methods can produce counterfactuals that all meet the fidelity criterion but differ substantially in terms of interpretability, realism, and utility. To illustrate this point, we present a motivational example using a basic image classification scenario.
We trained a straightforward Multi-Layer Perceptron (MLP) classifier Mθ on the MNIST dataset [14], a benchmark comprising grayscale handwritten digits. The model achieves a test accuracy of over 90%. Notably, no techniques were applied to enhance adversarial robustness or to calibrate the model’s predictive uncertainty. Figure 1 (leftmost image) displays a randomly selected test image classified as the digit ‘9’ by our model.
To explore how various counterfactual generation methods behave, we apply three popular techniques—Wachter, Schut [4], and REVISE [6]—to produce counterfactuals targeting the class ‘7’. Each method produces a high-fidelity counterfactual, i.e., an image that the model classifies as a ‘7’ with high confidence. These counterfactuals are displayed in Figure 1, from left to right next to the original image. The captions above each counterfactual indicate the generating method and the predicted probability assigned to the target class.
Despite identical fidelity, the generated counterfactuals differ dramatically in visual appearance and explanatory value:
  • • Wachter: This method minimizes the perturbation between the original and counterfactual input. As a result, the generated image appears nearly identical to the original digit ‘9’. Although the classifier now predicts a ‘7’, the transformation lacks semantic clarity and does not provide meaningful insight into what distinguishes the digits according to the model. The high similarity suggests that the classifier’s decision boundary might be vulnerable to small, imperceptible changes—similar to adversarial perturbations.
  • • Schut: The Schut method assumes the classifier offers calibrated uncertainty estimates and utilizes gradientbased saliency techniques to drive the counterfactual toward high-confidence regions. However, our classifier is not designed to quantify uncertainty, and the resulting counterfactual resembles an adversarial input more than a legitimate digit ‘7’. Although the model is confident in its prediction, the visual artifacts raise concerns about the plausibility and interpretability of the explanation.
  • • REVISE: This approach leverages a generative model— typically a Variational Autoencoder (VAE)—to guide the counterfactual search within a latent space believed to represent the data-generating distribution. The resulting counterfactual looks significantly more plausible, bearing visual resemblance to a typical handwritten ‘7’. However, since the method operates through a surrogate model rather than the classifier itself, it is unclear whether the counterfactual truly reflects the decision boundaries of the original classifier or the biases of the generative model.
These observations underscore a fundamental limitation of using fidelity as the sole criterion for evaluating counterfactual explanations. While all three counterfactuals successfully induce the desired class prediction, their underlying mechanisms and implications differ substantially. The Wachter method provides no insight into class semantics, Schut’s method risks generating adversarial-like examples, and REVISE introduces a surrogate that might distort the interpretation of the model’s actual behavior.
Thus, fidelity alone cannot serve as a reliable indicator of the faithfulness of a counterfactual explanation—i.e., how accurately the explanation reveals the model’s internal reasoning. Fidelity is a necessary but not sufficient condition for trust in counterfactuals. A more holistic evaluation must also account for aspects such as plausibility, interpretability, consistency with the data distribution, and alignment with causal structure. In the following sections, we explore methods that move beyond fidelity by incorporating these dimensions to produce more faithful and informative explanations.

IV. Faithful First, Plausible Second

As demonstrated earlier, fidelity alone does not guarantee that a counterfactual explanation (CE) truthfully reflects the model’s learned decision boundaries. We now introduce a refined criterion—faithfulness—to better assess the alignment of counterfactuals with the model’s internal representation.
Definition IV.1 (Faithful Counterfactuals).
Let Xθ|y+ = pθ(x|y+) denote the model-induced conditional distribution of inputs for a given target label y+. A counterfactual x is considered faithful if it follows this distribution, i.e., x ∼ Xθ|y+.
This shifts the notion of plausibility—from merely aligning with the data distribution (Definition II.1)—to reflecting what the model has actually internalized about the data.

Sampling from the Model’s Conditional Distribution

To evaluate faithfulness, we approximate pθ(x|y+) using techniques from energy-based models (EBMs) [15].
Specifically, we apply Stochastic Gradient Langevin Dynamics (SGLD) to generate samples that align with the model’s learned conditional:
where rj ∼ N(0,I), and Eθ(xj|y+) denotes the classconditional energy, taken as the negative logit for y+. While theoretical convergence requires small step sizes and many iterations, practical sampling with modest ϵj and J provides useful approximations.
The resulting set of samples Xbθ,y+ serves as an empirical proxy for the model’s generative belief about the target class, enabling evaluation of how faithfully a counterfactual reflects this learned structure.

Estimating Predictive Uncertainty

Plausibility often improves when counterfactuals align with high-certainty regions of the model. Traditional methods like [4] rely on calibrated uncertainty estimates, which many models lack. To address this, we adopt conformal prediction (CP) [16], a post hoc, model-agnostic technique for uncertainty quantification.
CP uses a calibration dataset to compute nonconformity scores: S = {s(xi,yi)}i∈Dcal . Prediction sets are then constructed as:
Cθ(xi;α) = {y : s(xi,y) ≤ qˆ},
where qˆ is the (1 − α)-quantile of S and α controls the desired error rate. Larger sets imply greater uncertainty. To integrate this into gradient-based optimization, we penalize the set size [?]:
Preprints 186999 i002
where κ is a hyperparameter. This penalty guides counterfactual search toward high-confidence regions, improving plausibility without compromising model faithfulness.

Measuring Plausibility and Faithfulness

Because both plausibility and faithfulness depend on proximity to respective conditional distributions, similar evaluation metrics can be applied.
To evaluate plausibility, we measure the average distance between a counterfactual x and real samples Xy+ from the target class:
Similarly, we define a faithfulness metric by computing the average distance to model-generated samples Preprints 186999 i003
In both cases, dist(·) is typically the Euclidean norm, though alternatives may be more appropriate depending on data type (see Section VI for details).

V. Energy-Constrained Conformal Counterfactuals (ECCCO)

Building on our formulation of faithfulness in counterfactual explanations, we introduce ECCCo—a framework for generating counterfactuals that prioritize alignment with the model’s learned data distribution. In ECCCo, the primary goal is to produce faithful counterfactuals, with plausibility being a beneficial byproduct when the underlying model has captured a meaningful representation of the true data-generating process.

Objective Formulation

To begin, we adapt the general counterfactual optimization framework by incorporating a loss function that encourages both predictive accuracy and consistency with the model’s internal representation of the data. Specifically, we redefine the optimization objective as:
Here, f(Z) denotes the counterfactual candidate in the input space (possibly via a decoder or identity mapping), and LJEM is a joint loss that integrates predictive and generative components. It is expressed as:
LJEM(f(Z);·) = Lclf(f(Z);·) + Lgen(f(Z);·)
The classification loss Lclf ensures that the counterfactual belongs to the desired output class y+. The generative loss Lgen is inspired by energy-based modeling (EBM), where samples that are more consistent with the model’s understanding of the data have lower energy scores. In traditional EBM training, this involves contrasting real data with model-generated samples via methods like SGLD. However, during counterfactual generation, we do not retrain the model; instead, we use the model’s learned energy landscape to guide counterfactual search.

Faithful, Plausible, and Uncertain

Recognizing that we only require our counterfactual to lie in a region of low energy (i.e., high faithfulness), we simplify the optimization objective by removing the need to draw SGLD samples during inference. Thus, the ECCCo framework proposes the following loss function:
min Lclf(f(Z );Mθ,y ) + λ1 cost(f(Z )) Z′∈ZL
+ λ2 Eθ(f(Z′) | y+) + λ3 Ω(Cθ(f(Z′);α))o
n ′ + ′ This function contains four key components:
  • • The classification loss (Lclf) ensures the counterfactual is labeled as the desired target y+ by the model.
  • • The proximity cost (cost(·)) encourages the counterfactual to stay close to the original input, following the principle used in methods like Wachter.
  • • The energy term (Eθ(·)) enforces faithfulness by ensuring the counterfactual lies in a region with low model energy—i.e., it is likely under the model’s learned conditional distribution.
  • • The uncertainty penalty (Ω(·)) uses conformal prediction to penalize predictions with large or ambiguous label sets, thereby improving plausibility by steering the counterfactual away from uncertain regions of the input space.
The coefficients λ1, λ2, and λ3 control the influence of each respective component and are tuned via grid search or crossvalidation.

On the Role of Feature Representations

ECCCo does not require the search for counterfactuals in a latent or compressed feature space. The default mapping function f(·) is simply the identity transformation, allowing direct search in the input space. This retains interpretability and is often sufficient if the model has learned a faithful and autoencoders. We refer to this variant as ECCCo+, which performs optimization in a latent space to further encourage plausibility while maintaining faithfulness.

Illustrative Example

Figure 2 presents a synthetic example illustrating how different components of the ECCCo objective function affect the counterfactual generation process. The underlying model is a Joint Energy Model (JEM), trained both for classification and to learn the conditional input distribution. The task is to generate a counterfactual that transitions from the orange class to the blue class.
Four counterfactual generation methods are compared:
1)
Wachter: Uses only the classification loss and proximity term, resulting in valid counterfactuals that may not align well with model knowledge.
2)
ECCCo (no EBM): Adds uncertainty minimization but ignores the model’s energy landscape, which may reduce uncertainty but can still yield unfaithful results.
3)
ECCCo (no CP): Enforces low energy to ensure faithfulness but skips uncertainty constraints. When the model is well-calibrated, this yields both faithful and plausible counterfactuals.
4)
ECCCo (full): Incorporates all components, benefiting from both the energy constraint and uncertainty reduction for better convergence and more meaningful counterfactuals.
The gradients depicted in the figure show how each method updates the input to reach a viable counterfactual. ECCCo consistently guides the search toward low-energy, lowuncertainty, and class-consistent regions, producing highquality counterfactuals.
Figure 2. Gradient fields and counterfactual paths for orange-to-blue class shift using a Joint Energy Model. plausible representation of the data. However, in cases where additional structure is desired, ECCCo can be extended to include dimensionality reduction, such as through.
Figure 2. Gradient fields and counterfactual paths for orange-to-blue class shift using a Joint Energy Model. plausible representation of the data. However, in cases where additional structure is desired, ECCCo can be extended to include dimensionality reduction, such as through.
Preprints 186999 g002
Figure 3. Counterfactuals turning a 3 into a 5 using ECCCo, ECCCo+, REVISE, Schut, and Wachter.
Figure 3. Counterfactuals turning a 3 into a 5 using ECCCo, ECCCo+, REVISE, Schut, and Wachter.
Preprints 186999 g003

VI. Empirical Analysis

We investigate two key research questions about counterfactual explanation generation:
Research Question VI.1 (Faithfulness). How does ECCCo compare to state-of-the-art methods in generating faithful counterfactual explanations?
Research Question VI.2 (Balancing Objectives). Can ECCCo effectively balance faithfulness and plausibility in its generated explanations?
Experimental Setup
We evaluate ECCCo against three baseline methods:
  • Schut: Focuses on predictive uncertainty minimization
  • REVISE: State-of-the-art for plausibility
  • Wachter: Standard baseline approach
For ECCCo+, we use PCA for dimensionality reduction with latent dimension matching REVISE’s VAE. Experiments employ:
  • • Models: MLPs, Deep Ensembles, JEMs, and LeNet-5 CNNs
  • • Datasets:
    • – Tabular: GMSC, German Credit, California Housing
    • – Vision: MNIST, Fashion MNIST
  • • Metrics:
    • – Faithfulness (lower better)
    • – Plausibility (lower better)
    • – Predictive uncertainty (lower better)

Results and Analysis

Our analysis reveals three key findings:
1)
Superior Faithfulness: Across all datasets, ECCCo variants consistently achieve the highest faithfulness scores. The full objective (combining energy and uncertainty constraints) performs best, though the energy constraint alone also shows strong results.
2)
Balanced Performance: While REVISE achieves slightly better plausibility (0.58 vs 0.60 for California Housing), ECCCo+ maintains better faithfulness (1.28 vs 1.39) and comparable uncertainty metrics.
3)
Model-Adaptive Behavior: ECCCo automatically adapts to model quality - standard ECCCo works better with simpler models (MLP), while ECCCo+ excels with stronger models (JEM, CNN).

Discussion

ECCCo successfully addresses the faithfulness-plausibility tradeoff by:
  • • Generating plausible explanations only when justified by the model
  • • Maintaining high faithfulness across all experimental conditions
  • • Providing flexible variants (ECCCo, ECCCo+) for different use cases
Visual examples demonstrate ECCCo’s ability to produce meaningful counterfactuals while preserving relevant features from the original input.
Key features of this version: 1. Proper LaTeX formatting with all necessary environments 2. Clear hierarchy with sections and subsections 3. Well-formatted tables with proper alignment and captions 4. Consistent presentation of results 5. Concise yet comprehensive analysis 6. Proper labeling for cross-referencing 7. Modular structure that can be easily integrated into your paper
The tables show representative results while referring to the forests, despite ECCCo’s theoretical applicability to these appendix for complete data. The analysis focuses on the most models.
Table 1. Performance on tabular datasets (mean values, lower is better). Best results in bold. Standard deviations omitted for important findings while maintaining academic rigor.
Table 1. Performance on tabular datasets (mean values, lower is better). Best results in bold. Standard deviations omitted for important findings while maintaining academic rigor.
California Housing GMSC
Faithfulness Plausibility Uncertainty
Model Method Faithfulness Plausibility Uncertainty
MLP Ensemble ECCCo 3.69 1.94 0.09 3.84 2.13 0.23
ECCCo+ 3.88 1.20 0.15 3.79 1.81 0.30
REVISE 3.96 0.58 0.17 4.09 0.63 0.33
JEM Ensemble ECCCo 1.40 0.69 0.11 1.20 0.78 0.38
ECCCo+ 1.28 0.60 0.11 1.01 0.70 0.37
REVISE 1.39 0.59 0.25 1.01 0.63 0.33

VII. Limitations

While our methodology demonstrates promising results, several important limitations warrant discussion to guide future research directions and practical applications.

Metric Dependence

The evaluation framework we employ relies fundamentally on distance-based metrics for assessing both plausibility and faithfulness. This approach presents three key challenges:
  • • Metric Selection: The effectiveness of our measurements depends heavily on the choice of distance metric, which may not translate consistently across different data modalities. For image data, we utilize structural dissimilarity indices, while for tabular data we employ Euclidean distances—each with their own assumptions and limitations.
  • • Universal Standards: The field currently lacks standardized evaluation metrics for counterfactual explanations. Our work shares this limitation with prior research, highlighting the need for community-wide efforts to establish robust benchmarking protocols.
  • • Interpretability Tradeoffs: Certain distance metrics, while mathematically sound, may not align perfectly with human intuitions about similarity, particularly in highdimensional spaces.

Model Compatibility

Our faithfulness evaluation methodology requires gradient access through Stochastic Gradient Langevin Dynamics (SGLD), which imposes certain constraints:
  • • Non-Differentiable Models: The current implementation cannot evaluate classifiers that lack differentiable decision boundaries, such as decision trees or random
  • • Computational Overhead: The sampling process introduces additional computational costs compared to simpler evaluation methods, potentially limiting scalability for very large models.

Implementation Challenges

The energy-based modeling approach underlying ECCCo presents several practical considerations:
  • • Hyperparameter Sensitivity: During development, we observed that the energy constraint requires careful tuning to prevent overshooting, particularly when dealing with models that have sharp decision boundaries.
  • • Training Stability: Like many energy-based approaches, ECCCo can exhibit training instabilities that require mitigation through techniques such as gradient clipping and learning rate scheduling.
clarity (full results in Appendix).
Table 2. Performance on MNIST dataset (mean values, lower is better).
Table 2. Performance on MNIST dataset (mean values, lower is better).
MNIST.
Model Method Faithfulness Plausibility
MLP ECCCo 0.243 0.420
ECCCo+ 0.246 0.306
REVISE 0.248 0.301
LeNet-5 CNN ECCCo 0.248 0.387
ECCCo+ 0.248 0.310
REVISE 0.248 0.301
  • • Scale Dependence: The energy scores can vary significantly across different datasets, necessitating datasetspecific calibration of the energy penalty term.

Future Research Directions

Our work suggests several promising avenues for future investigation:
  • • Alternative Conformal Methods: While we employed split conformal prediction, other variants such as crossconformal or jackknife+ prediction might offer improved efficiency or stability.
  • • Adaptive Thresholding: Developing dynamic approaches for setting the conformal prediction error rate based on data characteristics could enhance performance.
  • • Gradient-Free Extensions: Exploring approximations that would allow application to non-differentiable models while maintaining theoretical guarantees.

VIII. Conclusion

Our work makes several important contributions to the field of explainable AI through the development of ECCCo, a novel counterfactual explanation framework that successfully balances competing objectives of faithfulness and plausibility. The key insights from our research include:
  • • Model-Explanation Alignment: We have demonstrated that counterfactual explanations should reflect not just data characteristics but also the underlying model’s learned representations. ECCCo automatically adapts its outputs based on the quality of the model’s decision boundaries.
  • • Theoretical Foundation: By combining energy-based modeling with conformal prediction, we provide a principled approach to counterfactual generation that offers both theoretical guarantees and practical flexibility.
  • • Empirical Validation: Our extensive experiments across multiple datasets and model architectures show consistent improvements over existing methods, particularly in maintaining faithfulness without sacrificing plausibility.
The implications of this work extend beyond technical contributions:
  • • Model Diagnostics: Practitioners can use ECCCo outputs as diagnostic tools to identify when models may be relying on spurious or unrealistic patterns.
  • • Trustworthy AI: Our approach provides a pathway for developing explanation methods that accurately represent model behavior rather than presenting potentially misleading plausible alternatives.
  • • Research Community: The limitations we identify point to important open challenges in the field, particularly around evaluation standards and non-differentiable model support.
While challenges remain in making counterfactual explanations fully robust and universally applicable, ECCCo represents a significant step toward explanations that are both interpretable and faithful to the underlying models. We believe this work establishes a foundation for future research into reliable explanation methods that can keep pace with increasingly complex AI systems.

References

  1. C. Molnar, Interpretable Machine Learning, 2nd ed., 2022. [Online]. Available: https://christophm.github.io/interpretable-ml-book.
  2. M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why should i trust you?” Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
  3. S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 4768–4777.
  4. L. Schut, O. Key, R. Mc Grath, L. Costabello, B. Sacaleanu, Y. Gal et al., “Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 1756–1764.
  5. B. Ustun, A. Spangher, and Y. Liu, “Actionable recourse in linear classification,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 10–19.
  6. S. Joshi, O. Koyejo, W. Vijitbenjaronk, B. Kim, and J. Ghosh, “Towards realistic individual recourse and actionable explanations in black-box decision making systems,” 2019.
  7. T. Spooner, D. Dervovic, J. Long, J. Shepard, J. Chen, and D. Magazzeni, “Counterfactual explanations for arbitrary regression models,” 2021.
  8. R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 607–617.
  9. S. Upadhyay, S. Joshi, and H. Lakkaraju, “Towards robust and reliable algorithmic recourse,” Advances in Neural Information Processing Systems, vol. 34, pp. 16926–16937, 2021.
  10. A.-H. Karimi, B. Scholkopf, and I. Valera, “Algorithmic recourse: From¨ counterfactual explanations to interventions,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 353–362.
  11. P. Altmeyer, G. Angela, A. Buszydlik, K. Dobiczek, A. van Deursen, and C. C. Liem, “Endogenous macrodynamics in algorithmic recourse,” in 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2023, pp. 418–431.
  12. A.-H. Karimi, G. Barthe, B. Scholkopf, and I. Valera, “A survey of¨ algorithmic recourse: definitions, formulations, solutions, and prospects,” 2021.
  13. A.-K. Dombrowski, J. E. Gerken, and P. Kessel, “Diffeomorphic explanations with normalizing flows,” in ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  14. Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.
  15. Y. Du and I. Mordatch, “Implicit generation and generalization in energybased models,” 2020.
  16. A. N. Angelopoulos and S. Bates, “A gentle introduction to conformal prediction and distribution-free uncertainty quantification,” 2022.
Figure 1. Counterfactual examples transforming a 9 into a 7 using Wachter, Schut, and REVISE methods.
Figure 1. Counterfactual examples transforming a 9 into a 7 using Wachter, Schut, and REVISE methods.
Preprints 186999 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated