Submitted:
01 May 2026
Posted:
06 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- (Section 3) We introduce a unified variational formulation of machine learning, in which predictive risk, structural regularization, and additional functional constraints are combined within a single objective.
- (Section 4) We show how several classical paradigms arise as special cases of the proposed framework.
- (Section 5) We formalize robustness and fairness as functionals over hypothesis spaces and analyze their interaction, including a characterization of inherent trade-offs between predictive accuracy and fairness.
- (Section 6) We introduce a multi-criteria interpretability functional combining simplicity, information relevance, and stability of explanations.
- (Section 7) We discuss consequences of the unified formulation, including stability, generalization, and connections between robustness and regularity under suitable assumptions.
2. Related Work
- (i)
- functional integration of multiple structural properties (robustness, fairness, interpretability),
- (ii)
- variational well-posedness guarantees (existence, compactness, semicontinuity),
- (iii)
- explicit connection to Pareto optimality at the level of the learning objective.
- a functional analytic formulation amenable to tools from the calculus of variations,
- existence results for optimal predictors under structural constraints,
- a direct link between scalarized optimization and Pareto optimality, and
- a framework in which trade-offs arise intrinsically from the objective, rather than being imposed externally.
3. A Unified Variational Framework
3.1. Unified Functional Formulation
- is the expected risk,
- is a structural regularizer,
- are functionals encoding robustness, fairness, or other constraints,
- is an interpretability score,
- , , and are trade-off parameters.
3.2. Well-Posedness and Basic Properties
- (H1)
- Ω is coercive on ,
- (H2)
- , Ω, and each are lower semicontinuous,
- (H3)
- the sublevel sets of are relatively compact,
- (H4)
- I is upper semicontinuous.
- (i)
- Existence.There exists such that
- (ii)
-
Trade-off inequality.LetThen
- (iii)
- Pareto optimality.The minimizer is Pareto-optimal for the multi-objective problem
- predictors can be characterized as minimizers of a composite functional,
- structural constraints such as robustness, fairness, and interpretability can be incorporated without compromising well-posedness,
- trade-offs between predictive accuracy and additional constraints arise naturally from the objective,
- the unified formulation induces Pareto-optimal solutions in the corresponding multi-objective space.
3.3. Intuitive Interpretation: The Control Panel View
- measures predictive accuracy: how well the model fits the data.
- controls model complexity: how simple or regular the predictor is.
- quantify violations of structural constraints, such as lack of robustness or fairness.
- measures interpretability: how understandable or stable the model is.

4. Reinterpreting Existing Paradigms
- regularization (ridge regression): ,
- regularization (lasso): ,
- RKHS norms in kernel methods: .
- A Gaussian prior on parameters induces regularization,
- A Laplace prior induces regularization.
- Weight decay: corresponds to ,
- Dropout: can be viewed as a stochastic regularization that approximates an ensemble of subnetworks,
- Batch normalization: implicitly controls the geometry of the optimization landscape,
- Early stopping: acts as an implicit regularizer by restricting effective model complexity.
4.1. Comparison of Paradigms Within the Unified Framework
| Paradigm | Hypothesis space | Loss ℓ | Key functional term(s) |
|---|---|---|---|
| Support Vector Machines (SVM) | RKHS (measurable representatives) | Hinge loss | (Tikhonov/RKHS norm) |
| Physics-Informed Neural Networks (PINNs) | Sobolev-type space (realized by NN parametrizations) | Data MSE / likelihood loss | (PDE/operator residual) |
| Fairness-aware learning (Demographic Parity) | Measurable predictors | Cross-entropy / logistic loss | (or an independence surrogate) |
| Dictionary learning / sparse coding | Pairs with , ; model | Reconstruction MSE | (sparsity of code) |
| Deep learning heuristics | Neural nets , | Task loss (CE/MSE) | Weight decay: ; Dropout: stochastic ; Early stopping: implicit regularization |
4.2. A Fully Rigorous Instance in a Reproducing Kernel Hilbert Space
- (A1)
- The kernel K is bounded, i.e. ,
- (A2)
- The loss is convex and Lipschitz in its first argument,
- (A3)
- The output space is finite.
- The risk:
- The regularizer:
- A representative structural functional (e.g. fairness):where A denotes a protected attribute.
- The interpretability score is defined as in Section 6, under the assumptions ensuring finiteness.
5. Robustness and Fairness as Structural Functionals
5.1. Robustness Functionals
5.2. A Fundamental Trade-Off: Fairness vs Accuracy
- ,
- the hypothesis class is sufficiently rich to approximate the Bayes optimal predictor.
5.3. Discussion
- explicit control of robustness under perturbations and distributional shifts,
- formal guarantees of fairness through statistical dependence constraints,
- intrinsic trade-offs between predictive performance and structural properties.
6. Interpretability as a Variational Functional
6.1. Axiomatic Setup and Notation
- A1 (Simplicity). should be larger for models of lower effective complexity.
- A2 (Relevance). should reward predictors that preserve information relevant to the target variable Y.
- A3 (Stability of explanations). should be larger for models whose explanations are stable under small perturbations of the input.
6.2. Definition of the Interpretability Score
6.3. Well-Posedness Considerations
- (a)
- If is finite, then is finite and bounded by .
- (b)
-
If is measurable andthen is well-defined and satisfies .
- (c)
- If and , then is finite.
6.4. Integration into the Unified Objective
6.5. Multi-Objective Interpretation
7. Refinements and Consequences of the Unified Variational Principle
7.1. Uniform Stability of Empirical Minimizers
- the loss is L-Lipschitz with respect to a parameter norm,
- Ω is μ-strongly convex with ,
- each is convex,
- is convex.
7.2. Implications for Generalization
7.3. Refined Trade-Off Inequality
7.4. Bias–Variance Interpretation
7.5. Robustness and Regularity
7.6. Summary
- recovers classical stability and generalization behavior under convexity assumptions,
- induces explicit and quantifiable trade-offs between predictive accuracy and structural constraints,
- introduces bias through structural penalties while preserving statistical rates in favorable settings,
- connects robustness constraints with regularity properties of predictors.
8. Discussion and Open Problems
8.1. Optimization of Nonconvex and Nonsmooth Objectives
- Algorithmic convergence under composite structure. Establish convergence guarantees for principled algorithms (proximal gradient, alternating minimization, primal–dual schemes, mirror descent) when the objective contains multiple competing functionals, some of which may be only lower semicontinuous or only available through stochastic estimators.
- Provably correct surrogates. Many practically used substitutes (e.g. replacing by , mutual information by neural estimators, Wasserstein balls by tractable relaxations) change the geometry of the problem. A natural question is how closely minimizers (or Pareto frontiers) of surrogate objectives approximate those of the original formulation, and at what rate.
- Stationarity notions and certificates. For nonsmooth/nonconvex formulations, classical first-order optimality conditions are insufficient. Developing appropriate notions (Clarke stationarity, variational inequalities, weak KKT-type conditions under constraints) and computable certificates is essential for both theory and reproducibility.
8.2. Choice of Trade-Off Parameters and Identifiability of the Pareto Frontier
- Principled calibration of weights. Develop approaches that connect weights to interpretable quantities (e.g. a bound on worst-case distribution shift size, a target fairness gap, or an interpretability budget). This suggests studying Lagrange-multiplier interpretations and dual formulations whenever constraints are used.
- Sensitivity and stability of solutions. Analyze how minimizers vary with , including continuity/discontinuity of minimizers, bifurcations in nonconvex regimes, and conditions ensuring a well-behaved Pareto frontier.
- Recovering the Pareto set. Linear scalarization recovers only supported Pareto optima under convexity. For nonconvex objectives, a substantial part of the frontier may be missed. Designing algorithms that explore non-supported Pareto points (e.g. -constraint methods, adaptive scalarizations, or multiobjective proximal methods) remains open.
8.3. Scalability and Computational Complexity
- Efficient estimation of dependence penalties. Fairness functionals based on mutual information or conditional constraints require estimating high-dimensional dependence, often under distribution shift. Establishing sample complexity bounds and scalable estimators compatible with stochastic optimization is an important direction.
- Robustness at scale. Distributional robustness over Wasserstein balls can be costly, and adversarial robustness may require expensive inner maximizations. A key challenge is to identify computationally tractable approximations with explicit error bounds, and to understand when robustness objectives lead to manageable training dynamics.
- Sparse/structured interpretability for large models. Interpretability functionals that promote sparsity, modularity, or explanation stability may be natural for linear or kernel methods but become subtle for deep networks. Determining which structural constraints scale (and which collapse into vacuous penalties) is largely unresolved.
8.4. Alignment Between Mathematical Definitions and Human-Centric Notions
- Fairness: incompatibilities and context dependence. Different fairness definitions (demographic parity, equalized odds, calibration, individual fairness) can be mutually incompatible depending on the data-generating process. A systematic functional analysis of these incompatibilities (including necessary and sufficient conditions for feasibility and quantitative “prices of fairness”) remains an important theoretical agenda.
- Interpretability: what is the object being stabilized? Stability of predictions is not the same as stability of explanations. Formalizing the explanation object (saliency, concept vectors, local surrogates) and validating that its stability corresponds to meaningful human understanding is still open. The functional approach helps articulate the question, but does not resolve the measurement problem.
- Robustness: choosing the right perturbation model. Wasserstein balls, adversarial perturbations, and distribution shift sets are mathematical proxies for deployment uncertainty. Selecting perturbation classes that accurately reflect real-world shifts (while remaining analyzable) is a key bridge between theory and practice.
8.5. Further Theoretical Directions
- Existence and compactness. Provide general conditions (coercivity, lower semicontinuity, tightness) ensuring existence of minimizers for objectives combining , robustness and fairness penalties, and interpretability scores.
- Generalization under structural constraints. Extend stability and complexity-based generalization analyses to objectives with Wasserstein robust risk, dependence-based fairness penalties, and explanation-based interpretability terms, including sharp rates and minimax optimality where possible.
- Duality and certificates. Identify settings where robust and fair objectives admit strong dual representations. Duality can yield both computational algorithms and verifiable certificates (e.g. worst-case shift witnesses, fairness-violation witnesses).
- Axiomatic completeness. Determine whether there exist “complete” axiom systems for interpretability functionals (analogous to characterizations in risk measures), and whether different axiom choices lead to equivalent or genuinely distinct notions of interpretability.
9. Conclusions
- (i)
- begin by specifying desired properties as well-posed functionals;
- (ii)
- analyze feasibility and trade-offs through the induced Pareto structure;
- (iii)
- derive algorithms and guarantees aligned with these specifications.
Funding
Conflicts of Interest
References
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
- Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 2021, 23. [Google Scholar] [CrossRef]
- Liu, X.; Faes, L.; Kale, A.; Wagner, S.; Fu, D.J.; Bruynseels, A.; Mahendiran, T.; Moraes, G.; Shamdas, M.; Kern, C.; et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet. Digit. Health 2019, 1 6, e271–e297. [Google Scholar] [CrossRef]
- Hajj, M.E.; Hammoud, J. Unveiling the Influence of Artificial Intelligence and Machine Learning on Financial Markets: A Comprehensive Analysis of AI Applications in Trading, Risk Management, and Financial Operations. J. Risk Financ. Manag. 2023. [Google Scholar] [CrossRef]
- Mhlanga, D. Financial Inclusion in Emerging Economies: The Application of Machine Learning and Artificial Intelligence in Credit Risk Assessment. Int. J. Financ. Stud. 2021. [Google Scholar] [CrossRef]
- Amarasinghe, K.; Rodolfa, K.T.; Lamba, H.; Ghani, R. Explainable machine learning for public policy: Use cases, gaps, and research directions. Data Policy 2020, 5. [Google Scholar] [CrossRef]
- Canhoto, A. Leveraging machine learning in the global fight against money laundering and terrorism financing: An affordances perspective. J. Bus. Res. 2020, 131, 441–452. [Google Scholar] [CrossRef]
- Berk, R. Machine Learning Risk Assessments in Criminal Justice Settings; 2018; pp. 1–178. [Google Scholar] [CrossRef]
- Vapnik, V. Statistical Learning Theory; Wiley, 1998. [Google Scholar]
- Steinwart, I.; Christmann, A. Support vector machines; Information science and statistics; Springer: New York, NY, 2008. [Google Scholar]
- Schölkopf, B.; Smola, A. Learning with Kernels; MIT Press, 2002. [Google Scholar]
- Evans, L.C. Partial differential equations; American Mathematical Society: Providence, R.I., 2010. [Google Scholar]
- Ekeland, I.; Témam, R. Convex Analysis and Variational Problems; Society for Industrial and Applied Mathematics; 1999; Available online: https://epubs.siam.org/doi/pdf/10.1137/1.9781611971088. [CrossRef]
- Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
- Mei, S.; Montanari, A.; Nguyen, P.M. A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 2018, 115, E7665–E7671. [Google Scholar] [CrossRef]
- Du, K.L.; Zhang, R.; Jiang, B.; Zeng, J.; Lu, J. Understanding Machine Learning Principles: Learning, Inference, Generalization, and Computational Learning Theory. Mathematics 2025, 13. [Google Scholar] [CrossRef]
- Deisenroth, M.P.; Faisal, A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press, 2020. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018 Conference Track Proceedings, Vancouver, BC, Canada, April 30 - May 3, 2018; 2018. Available online: https://OpenReview.net.
- Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. ICLR, 2015. [Google Scholar]
- Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020, ICML’20. [Google Scholar]
- Abadeh, S.S.; Nguyen, V.; Kuhn, D.; Esfahani, P.M. Wasserstein Distributionally Robust Kalman Filtering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2018; pp. 8474–8483. [Google Scholar]
- Blanchet, J.; Kang, Y.; Murthy, K. Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 2019, 56, 830–857. [Google Scholar] [CrossRef]
- Villani, C. Optimal Transport: Old and New; Springer, 2009. [Google Scholar]
- Gao, R.; Kleywegt, A. Distributionally Robust Stochastic Optimization with Wasserstein Distance. Math. Oper. Res. 2023, 48, 603–655. [Google Scholar] [CrossRef]
- Sinha, A.; Namkoong, H.; Duchi, J.C. Certifying Some Distributional Robustness with Principled Adversarial Training. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018 Conference Track Proceedings, Vancouver, BC, Canada, April 30 - May 3, 2018; 2018. Available online: https://OpenReview.net.
- Hardt, M.; Price, E.; Srebro, N. Equality of Opportunity in Supervised Learning. In Proceedings of the NIPS, 2016. [Google Scholar]
- Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, New York, NY, USA, 2012; ITCS ’12, pp. 214–226. [Google Scholar] [CrossRef]
- Kamishima, T.; Akaho, S.; Asoh, H.; Sakuma, J. Fairness-aware classifier with prejudice remover regularizer. In Proceedings of the Proceedings of the 2012th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II, Berlin, Heidelberg, 2012; ECMLPKDD’12, pp. 35–50. [Google Scholar]
- Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys, 2022. [Google Scholar]
- Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Leibniz International Proceedings in Informatics (LIPIcs); 2017; Vol. 67, pp. 43:1–43:23. [Google Scholar] [CrossRef]
- Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 2017, 5, 153–163. [Google Scholar] [CrossRef]
- Liu, S.; Vicente, L.N. Accuracy and fairness trade-offs in machine learning: a stochastic multi-objective approach. Comput. Manag. Sci. 2022, 19, 513–537. [Google Scholar] [CrossRef]
- Wan, M.; Zha, D.; Liu, N.; Zou, N. In-Processing Modeling Techniques for Machine Learning Fairness: A Survey. ACM Trans. Knowl. Discov. Data 2023, 17. [Google Scholar] [CrossRef]
- Liu, H.; Chaudhary, M.; Wang, H. Towards Trustworthy and Aligned Machine Learning: A Data-centric Survey with Causality Perspectives, 2023. arXiv arXiv:cs.
- Tishby, N.; Pereira, F.; Bialek, W. The Information Bottleneck Method. arXiv 2000. [Google Scholar]
- Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. CoRR 2017, abs/1703.00810. [Google Scholar]
- Poole, B.; Ozair, S.; van den Oord, A. On Variational Bounds of Mutual Information. ICML, 2022. [Google Scholar]
- Alvarez-Melis, D.; Jaakkola, T.S. Towards robust interpretability with self-explaining neural networks. In Proceedings of the Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2018; NIPS’18, pp. 7786–7795. [Google Scholar]
- Fridkin, S.; Bendersky, M. Interpretable Machine Learning: A Comprehensive Review of Foundations, Methods, and the Path Forward. WIREs Data Min. Knowl. Discov. 2026, 16, e70075. Available online: https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.70075. [CrossRef]
- Lauer, F. Uniform Risk Bounds for Learning with Dependent Data Sequences. ArXiv 2023, abs/2303.11650. [Google Scholar] [CrossRef]
- Riondato, M.; Upfal, E. VC-Dimension and Rademacher Averages: From Statistical Learning Theory to Sampling Algorithms. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. [Google Scholar] [CrossRef]
- Belkin, M.e.a. Reconciling Modern Machine Learning and the Bias-Variance Trade-Off. PNAS, 2022. [Google Scholar]
- Neyshabur, B. Towards Learning Theory of Deep Learning. Foundations and Trends in Machine Learning, 2022. [Google Scholar]
- Deb, K. Multi-Objective Optimization using Evolutionary Algorithms; John Wiley & Sons, 2001. [Google Scholar]
- Miettinen, K. Nonlinear multiobjective optimization; Kluwer: Boston, USA, 1999. [Google Scholar]
- Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press, 2004. [Google Scholar]
- Lin, X.; et al. Multi-Objective Optimization for Machine Learning Systems. IEEE Transactions, 2023. [Google Scholar]
- Mahapatra, D.; Rajan, V. Pareto Optimization in Machine Learning. Pattern Recognition, 2022. [Google Scholar]
- Varshney, K.R. Trustworthy Machine Learning; Independently Published: Chappaqua, NY, USA, 2022. [Google Scholar]
- Liu, H.; Chaudhary, M.; Wang, H. Towards Trustworthy and Aligned Machine Learning: A Data-centric Survey with Causality Perspectives, 2023. arXiv arXiv:cs.
- Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar] [CrossRef]
| Framework | Unified | Functional | Existence | Pareto | Joint |
|---|---|---|---|---|---|
| Objective | Formulation | Guarantees | Structure | Treatment | |
| (R/F/I) | |||||
| ERM / | |||||
| regularization | Partial | Yes | Yes | No | No |
| DRO / | |||||
| adversarial robustness | Partial | Yes | Yes | No | No |
| Fairness-constrained | |||||
| optimization | Partial | Yes | Limited | No | No |
| Interpretability | |||||
| methods | No | No | No | No | No |
| Multi-objective | |||||
| optimization | Yes | No | No | Yes | Partial |
| This work | Yes | Yes | Yes | Yes | Yes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).