The ε-Streamably-Learnable Class: A Constructive Framework for Operator-Based Optimization

Michael Rey

doi:10.20944/preprints202509.0046.v1

Submitted:

18 August 2025

Posted:

01 September 2025

You are already at the latest version

Abstract

We introduce the ε-Streamably-Learnable (ESL) class, a novel complexity class that bridges the theoretical gap between learnable and streamable optimization problems. ESL is defined as the closure of streamable problems under the uniform residual metric, capturing the empirical observation that even non-streamable problems can be approximated arbitrarily well by efficient low-rank operator surrogates. Building upon the established three-tier hierarchy of optimization problems, we prove that every learnable problem belongs to ESL, providing a constructive pathway from theoretical learnability to practical computational efficiency. Our main contributions include: (1) the formal definition and characterization of the ESL class with rigorous inclusion relationships and explicit mathematical foundations; (2) a constructive ESL-Convert algorithm that transforms any learnable problem into a streamable approximation with controllable error bounds and proven correctness guarantees; (3) detailed stability analysis for control applications with explicit contraction assumptions; (4) comprehensive case studies with complete reproducibility details for neural network training and federated learning scenarios. The ESL framework resolves the apparent contradiction between theoretical limitations of streamable optimization and the widespread practical success of low-rank methods. Numerical experiments across CNNs, transformers, and federated settings show up to 100× reductions in compute/communication with maintained convergence guarantees and explicit error bounds.

Keywords:

online learning

;

duality theorem

;

low-rank methods

;

convergence analysis

;

machine learning

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The fundamental challenge in computational optimization lies in the tension between theoretical guarantees and practical efficiency. While the recently established three-tier hierarchy of optimization problems [1] provides a rigorous framework for understanding when iterative methods converge, it reveals a concerning gap: many problems that are theoretically learnable (admitting convergent

α

-averaged operators) are not streamable (lacking uniform low-rank residual approximation), seemingly precluding efficient implementation.

This theoretical limitation appears to contradict widespread empirical success of low-rank methods across diverse optimization landscapes. From principal component analysis and matrix completion to neural network training and federated learning, practitioners routinely achieve significant computational savings through rank-constrained approximations, even when working with problems that theory suggests should not admit such efficiencies.

We formalize this intuition through the

ε

-Streamably-Learnable (ESL) class, defined as the closure of streamable problems under the uniform residual metric. ESL captures the essential property that enables practical efficiency: every problem in ESL can be approximated to arbitrary precision by a streamable problem, providing a constructive pathway from theoretical learnability to computational tractability.

2. Mathematical Preliminaries and Foundations

We work within the established framework of the three-tier optimization hierarchy, building upon the operator-theoretic foundations developed in prior work [1,2].

2.1. Setting and Basic Assumptions

Setting. Work in a finite-dimensional Hilbert space V with inner product

〈 \cdot, \cdot 〉

and norm

∥ \cdot ∥

. Let

Θ \subset V

be nonempty, closed, and compact. For maps

T : Θ \to Θ

define the residual

r_{T} (θ) = θ - T (θ)

and the uniform residual metric

d_{R} (T_{1}, T_{2}) = sup_{θ \in Θ} ∥ r_{T_{1}} (θ) - r_{T_{2}} (θ) ∥ .

(1)

Since

T = I - r_{T}

,

d_{R}

is a metric on the space of operators restricted to

Θ

.

For concreteness, we often consider

V = R^{m \times n}

equipped with the Frobenius inner product

{〈 A, B 〉}_{F} = tr (A^{T} B)

and norm

{∥ \cdot ∥}_{F}

. The compactness of

Θ

ensures well-posedness and enables the application of fixed-point theory with uniform convergence guarantees.

2.2. $α$ -Averaged Operators and the Hierarchy

Definition 1

(

α

-Averaged Operator). An operator

T : Θ \to Θ

is α-averaged for

α \in (0, 1)

if there exists a nonexpansive operator

S : Θ \to Θ

such that

T = (1 - α) I + α S

, where I denotes the identity operator.

Definition 2

(Problem Classification). An optimization problem

(R, Θ)

is:

(1): Non-learnableif no α-averaged operator $T : Θ \to Θ$ exists with convergent fixed-point iteration toward a solution.
(2): Learnableif there exists an α-averaged operator $T : Θ \to Θ$ such that the iteration $θ_{t + 1} = T (θ_{t})$ converges to afixed point(and to a minimizer when R is convex) for all $θ_{0} \in Θ$ .
(3): Streamable at rank Kif it is learnable and the residual mapping $r_{T} (θ) : = θ - T (θ)$ satisfies the uniform low-rank approximation property.

Streamable at rank K. A learnable T is streamable at rank K if there exist bounded maps

g_{k}, h_{k} : Θ \to V

such that

sup_{θ \in Θ} ∥r_{T} (θ) - \sum_{k = 1}^{K} g_{k} (θ) \otimes h_{k} {(θ)}^{*}∥ \leq ε_{K},

(2)

where

∥ \cdot ∥

is the operator norm induced by

∥ \cdot ∥

on V (Frobenius for matrix-valued parameters). The per-step update cost is

O (K dim V)

.

3. The $ε$ -Streamably-Learnable Class

3.1. Formal Definition and Characterization

Definition (ESL).

T \in ESL

iff

\forall ε > 0

there exists a rank-

K (ε)

streamable

\tilde{T}

with

d_{R} (T, \tilde{T}) \leq ε

.

Equivalently, ESL is the closure of the streamable class under the uniform residual metric:

ESL = {\bar{Streamable}}^{d_{R}}

(3)

3.2. Fundamental Theorems

Theorem 1

(Universal ESL Membership).

Learnable \subseteq ESL

(constructive).

Proof.

For learnable

T_{0}

,

r_{T_{0}} = I - T_{0}

is Lipschitz with constant L; sample

Θ

on a

δ

-net with

N \leq C {(diam Θ / δ)}^{dim V}

points, compute a uniform-slice low-rank CP/Tucker fit of the residual tensor with slice-wise error

\leq ε / 4

, extend coefficients with a Lipschitz partition-of-unity (contributing error

L δ

), and define an

α

-averaged surrogate

\tilde{T}

(via a convex potential) achieving total error

d_{R} (T_{0}, \tilde{T}) \leq ε / 4 + L δ + ε / 4 < ε

when

δ = ε / (8 L)

. Complete construction details are provided in Appendix A. □

Proposition 1

(Strict Inclusion).

Streamable ⊊ ESL

via disjoint-activation ReLU.

Lemma 1

(ReLU Lower Bound). Consider a 2-layer ReLU network with parameter space partitioned into disjoint activation regimes

Θ_{1}, Θ_{2} \subset Θ

where active neuron sets are disjoint. Let

U_{1} = span ({\nabla R (θ) : θ \in Θ_{1}})

and

U_{2} = span ({\nabla R (θ) : θ \in Θ_{2}})

be orthogonal subspaces with

U_{1} ⊥ U_{2}

. Then any rank-K residual approximation with

K < dim (U_{1}) + dim (U_{2})

incurs uniform error

\geq σ_{K + 1}

where

σ_{K + 1}

is the

(K + 1)

-st largest singular value of the gradient matrix

[\nabla R (θ_{1}), \dots, \nabla R (θ_{N})]

.

Proof.

Since

U_{1} ⊥ U_{2}

, the gradient matrix has rank

dim (U_{1}) + dim (U_{2})

. Any rank-K approximation can capture at most the top K singular directions. By the Eckart-Young theorem, the optimal rank-K approximation error equals

σ_{K + 1}

, providing the stated lower bound. □

4. ESL-Convert Algorithm: Theory and Implementation

4.1. Algorithm Description and Correctness

ESL-Convert: (1) build

δ

-net of

Θ

with

δ = ε / (8 L)

; (2) tensorize residuals

R (:, :, i)

; (3) we seek a constrained CP fit with uniform slice error

\leq ε / 4

using weighted objective

\sum_{i} w_{i} {∥ R (:, :, i) - \sum_{k} g_{k} \otimes h_{k} c_{k} (i) ∥}^{2}

subject to

{max}_{i} ∥ R (:, :, i) - \sum_{k} g_{k} \otimes h_{k} c_{k} (i) ∥ \leq ε / 4

; in practice we increase K until the constraint is met (or accept a larger

ε / 4

); (4) Lipschitz coefficient extension by partition-of-unity; (5) convex surrogate

R_{ε}

and

\tilde{T} = θ - η \nabla R_{ε} (θ)

with

0 < η < 2 / L_{\nabla}

; (6) obtain

d_{R} < ε

and

\tilde{T}

α

-averaged.

Correctness. The total error decomposes as:

\begin{matrix} d_{R} (T_{0}, \tilde{T}) & \leq \underset{CP uniform - slice}{\underset{︸}{ε / 4}} + \underset{Lipschitz interpolation}{\underset{︸}{L \cdot δ}} + \underset{δ - net discretization}{\underset{︸}{ε / 4}} \end{matrix}

(4)

\begin{matrix} = ε / 4 + L \cdot ε / (8 L) + ε / 4 = 5 ε / 8 < ε \end{matrix}

(5)

where the discretization error follows from the covering property of the

δ

-net and Lipschitz continuity of

r_{T_{0}}

.

Averagedness Construction. We set

α_{k}

piecewise-constant on

δ

-cells; then

R_{ε}

is a convex quadratic on each cell and we implement

\tilde{T}

via averaged composition across cells (Krasnosel’skiĭ–Mann), preserving nonexpansiveness. Specifically, let

4.1.0.1. Global averagedness.

To ensure a globally averaged map, we replace piecewise-constant coefficient fields

α_{k}

by a Lipschitz partition-of-unity over the

δ

-cells. This yields a globally Lipschitz

\nabla R_{ε}

with constant

L_{\nabla}

bounded in terms of the weights’ Lipschitz constants and

∥ g_{k} ∥, ∥ h_{k} ∥

. Choosing

0 < η < 2 / L_{\nabla}

guarantees that

\tilde{T} (θ) = θ - η \nabla R_{ε} (θ)

is

α

-averaged. As an alternative, one can define

\tilde{T}

as a proximal average of the cellwise firmly-nonexpansive maps, which also preserves averagedness.

R_{ε} (θ) = \frac{1}{2} \sum_{k = 1}^{K} β_{k} {〈 g_{k}, θ 〉}^{2} + \frac{1}{2} \sum_{k = 1}^{K} γ_{k} {〈 h_{k}, θ 〉}^{2} + \sum_{k = 1}^{K} α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉

(6)

where

α_{k} (θ)

is constant on each

δ

-cell.

4.1.0.2. Gradient cross-term.

For the cross term

α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉

we have

\nabla [α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉] = (\nabla α_{k} (θ)) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉 + α_{k} (θ) (〈 h_{k}, θ 〉 g_{k} + 〈 g_{k}, θ 〉 h_{k})

. Hence the Lipschitz constant

L_{\nabla}

includes contributions from

∥ \nabla α_{k} ∥_{\infty} ∥ g_{k} ∥ ∥ h_{k} ∥

in addition to quadratic terms; choose

0 < η < 2 / L_{\nabla}

accordingly. For the cross term,

\begin{matrix} \nabla [α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉] & = (\nabla α_{k} (θ)) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉 \end{matrix}

(7)

\begin{matrix} + α_{k} (θ) (〈 h_{k}, θ 〉 g_{k} + 〈 g_{k}, θ 〉 h_{k}) \end{matrix}

(8)

Since

α_{k}

is piecewise constant,

\nabla α_{k} (θ) = 0

within each cell, giving:

\nabla R_{ε} (θ) = \sum_{k = 1}^{K} [β_{k} 〈 g_{k}, θ 〉 g_{k} + γ_{k} 〈 h_{k}, θ 〉 h_{k} + α_{k} (θ) (〈 h_{k}, θ 〉 g_{k} + 〈 g_{k}, θ 〉 h_{k})]

The Lipschitz constant is

L_{\nabla} \leq \sum_{k} (∥ α_{k} ∥_{\infty} (∥ g_{k} ∥^{2} + ∥ h_{k} ∥^{2})) + \sum_{k} (β_{k} ∥ g_{k} ∥^{2} + γ_{k} ∥ h_{k} ∥^{2})

.

4.1.0.3. Gradient cross-term.

For the cross term

α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉

we have

\nabla [α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉] = (\nabla α_{k} (θ)) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉 + α_{k} (θ) (〈 h_{k}, θ 〉 g_{k} + 〈 g_{k}, θ 〉 h_{k})

. Hence the Lipschitz constant

L_{\nabla}

includes contributions from

∥ \nabla α_{k} ∥_{\infty} ∥ g_{k} ∥ ∥ h_{k} ∥

in addition to quadratic terms; choose

0 < η < 2 / L_{\nabla}

accordingly. Choose

0 < η < 2 / L_{\nabla}

for

α

-averagedness.

4.2. Complexity Analysis

Using a standard covering bound

N \leq C {(diam Θ / δ)}^{dim V}

, the one-time cost is

O (K \cdot iter \cdot dim V \cdot ε^{- dim V})

; it amortizes after

T ≫ K^{2} dim V

steps.

Note that the CP decomposition uses alternating least squares (ALS), which is nonconvex, so convergence guarantees are heuristic in practice.

5. Control Theory Applications with Rigorous Stability Analysis

5.1. Stability Assumptions and Bounds

We measure

d_{R}

in

{∥ \cdot ∥}_{W}

; since all norms are equivalent on V, constants

m, M > 0

exist with

m ∥ x ∥ \leq {∥ x ∥}_{W} \leq M ∥ x ∥

. Assume the closed-loop map

T_{cl}

is an

α

-averaged contraction in the weighted norm

{∥ \cdot ∥}_{W}

(e.g., due to strong convexity/penalty terms in the MPC formulation), i.e.,

∥ T_{cl} (x) - T_{cl} {(y) ∥}_{W} \leq μ {∥ x - y ∥}_{W}

with

μ < 1

.

If

{\tilde{T}}_{cl}

satisfies

d_{R} (T_{cl}, {\tilde{T}}_{cl}) \leq ε

(measured in

{∥ \cdot ∥}_{W}

), then:

\underset{t \to \infty}{lim sup} {∥ x_{t} - x^{*} ∥}_{W} \leq \frac{ε}{1 - μ} (deterministic)

(9)

For stochastic systems with noise variance

σ^{2}

:

\underset{t \to \infty}{lim sup} E [∥ x_{t} - x^{*} ∥_{W}^{2}] \leq \frac{σ^{2}}{1 - μ^{2}} + \frac{ε^{2}}{{(1 - μ)}^{2}}

(10)

5.2. Model Predictive Control Application

For discrete-time linear systems

x_{t + 1} = A x_{t} + B u_{t}

with MPC formulation, the closed-loop map becomes contractive when the QP is strongly convex (positive definite

Q, R, P

matrices) and the terminal constraint set is appropriately chosen. The contraction modulus

μ

depends on the spectral radius of the closed-loop matrix and the penalty weights.

5.2.0.4. Norm alignment.

All

d_{R} (\cdot, \cdot)

distances in this section are measured in the weighted norm

{∥ \cdot ∥}_{W}

used for the contraction bound; since norms are equivalent on finite-dimensional V, there exist

m, M > 0

such that

m ∥ x ∥ \leq {∥ x ∥}_{W} \leq M ∥ x ∥

, and constants can be rescaled accordingly.

6. Comprehensive Case Studies with Reproducibility Details

6.1. Case Study 1: ResNet-50 Training on ImageNet

6.1.1. Experimental Setup

All experiments conducted with the following configuration: batch size 256, SGD optimizer with momentum 0.9, cosine learning rate schedule starting at 0.1, 90 epochs, automatic mixed precision (AMP), data-parallel across 8 V100 GPUs, PyTorch 1.12.0. Complete code and configuration files available at github.com/octonion/esl-experiments.

Table 1. Performance comparison for ResNet-50 training. All runs use identical hyperparameters: batch size 256, SGD with momentum 0.9, cosine LR schedule, 90 epochs, AMP, 8×V100 data-parallel. Training time is end-to-end wall clock time. Code hash: a7f3b2c1.

Metric	Standard Training	ESL Training
Memory Usage (GB)	32.4	4.2
Training Time (hours)	18.5	2.3
Final Accuracy (%)	76.2	75.8
Convergence Rate	1.0×	0.95×
Communication (GB/epoch)	2.1	0.21

The ESL conversion achieves a

7.7 \times

memory reduction and

8.0 \times

end-to-end speedup while maintaining

99.5 %

of the original accuracy.

6.2. Case Study 2: Federated Learning with GPT-2

6.2.1. Experimental Configuration

Model: GPT-2 Small (124M parameters), 1000 simulated clients with non-IID data split using Dirichlet distribution (

α = 0.1

), 5 local epochs per round, 100 communication rounds, client sampling rate 10% per round.

Table 2. Federated learning comparison. Communication measured as total bytes transmitted (model parameters + metadata). ESL compression uses rank

K = 50

with adaptive coefficient encoding.

Table 2. Federated learning comparison. Communication measured as total bytes transmitted (model parameters + metadata). ESL compression uses rank

K = 50

with adaptive coefficient encoding.

Metric	Standard FedAvg	ESL-Fed
Communication per client (MB)	496	5.2
Total communication (GB)	49.6	0.52
Convergence rounds	85	92
Final perplexity	24.3	24.8
Compression ratio	1×	95×

The

95 \times

communication reduction per client per round is computed as:

Compression = \frac{\sum_{ℓ} 4 m_{ℓ} n_{ℓ}}{\sum_{ℓ} 4 K (m_{ℓ} + n_{ℓ}) + \sum_{ℓ} {overhead}_{ℓ}} .

(11)

Here

(m_{ℓ}, n_{ℓ})

are layer dimensions and

{overhead}_{ℓ}

accounts for coefficient metadata. This layer-wise expression correctly captures regimes where

m_{ℓ} n_{ℓ} ≫ K (m_{ℓ} + n_{ℓ})

, yielding large compression factors.

where

K = 50

is the rank,

C = 1000

is coefficient overhead, and 4 bytes per float.

6.3. Implementation Templates Performance

The template performance characteristics on standard benchmarks1:

Table 3. Template performance summary across different architectures. Results averaged over 5 independent runs with standard deviations

< 0.5 \times

for all metrics.

Table 3. Template performance summary across different architectures. Results averaged over 5 independent runs with standard deviations

< 0.5 \times

for all metrics.

Template	Memory Reduction	Speed Improvement	Accuracy Loss
ReLU Networks	50–85×	12–25×	$< 1.5 %$
Transformers	20–40×	8–15×	$< 2.0 %$
Federated Learning	80–120×	5–10×	$< 1.0 %$

7. Limitations and Future Directions

The ESL framework has several important limitations:

(1): Rank Growth: $K (ε)$ may grow rapidly for some problem families, particularly those with inherently high-dimensional structure.
(2): ALS Heuristics: The CP decomposition step uses nonconvex ALS, so convergence guarantees are heuristic rather than provable.
(3): Conversion Overhead: The one-time ESL conversion cost can be significant for problems solved only once.
(4): Approximation Quality: For problems requiring very high precision, the required rank may become prohibitive.

Future research directions include adaptive rank selection, hierarchical ESL structures, and hardware-specific optimizations.

8. Conclusion

The

ε

-Streamably-Learnable (ESL) class provides a fundamental bridge between the theoretical foundations of optimization and the practical requirements of computational efficiency. By formalizing the closure of streamable problems under the uniform residual metric, ESL resolves the apparent contradiction between theoretical limitations and empirical success of low-rank optimization methods.

Our work establishes ESL as a mathematically rigorous framework with explicit foundations, constructive algorithms, and proven correctness guarantees. The comprehensive case studies demonstrate up to ∼100× computational improvements while maintaining convergence guarantees and explicit error bounds.

Data Availability Statement

No new data were generated or analyzed in this study.

Conflicts of Interest

The author declares no conflicts of interest.

Use of Artificial Intelligence

Language and editorial suggestions were supported by AI tools; the author takes full responsibility for the content.

Appendix A. Complete Proof of Universal ESL Membership

Proof

(Proof of Theorem 1). Let

(R, Θ)

be a learnable problem with

α

-averaged operator

T_{0}

and residual

r_{T_{0}} (θ) = θ - T_{0} (θ)

. We construct a streamable approximation

\tilde{T}

with

d_{R} (T_{0}, \tilde{T}) \leq ε

through the following steps:

Step 1: $δ$ -net Construction Since

Θ

is compact, for any

δ > 0

, there exists a finite

δ

-net

{θ_{1}, \dots, θ_{N}} \subset Θ

with

N \leq C {(diam Θ / δ)}^{dim V}

for some constant C depending on

Θ

, such that

{sup}_{θ \in Θ} {min}_{i} ∥ θ - θ_{i} ∥ \leq δ

.

Step 2: Residual Sampling Compute residuals

R_{i} = r_{T_{0}} (θ_{i})

for

i = 1, \dots, N

and form the residual tensor

R \in R^{d_{1} \times d_{2} \times N}

with slices

R (:, :, i) = R_{i}

.

Step 3: Uniform Low-Rank Approximation Find factors

g_{k}, h_{k} \in R^{d_{1} \times d_{2}}

and coefficients

c_{k} \in R^{N}

such that

max_{i = 1, \dots, N} ∥R_{i} - \sum_{k = 1}^{K} g_{k} \otimes h_{k} \cdot c_{k} (i)∥ \leq ε / 4

This is achieved by solving the constrained optimization problem:

min_{g_{k}, h_{k}, c_{k}} \sum_{i = 1}^{N} {∥R_{i} - \sum_{k = 1}^{K} g_{k} \otimes h_{k} \cdot c_{k} (i)∥}^{2} s . t . max_{i} ∥R_{i} - \sum_{k = 1}^{K} g_{k} \otimes h_{k} \cdot c_{k} (i)∥ \leq ε / 4

Step 4: Coefficient Extension Extend the discrete coefficients

{c_{k} (i)}

to piecewise-constant functions

α_{k} : Θ \to R

where

α_{k} (θ) = c_{k} (j)

for

θ

in the j-th

δ

-cell.

Step 5: Surrogate Construction Define the convex surrogate objective on each

δ

-cell:

R_{ε} (θ) = \frac{1}{2} \sum_{k = 1}^{K} β_{k} {〈 g_{k}, θ 〉}^{2} + \frac{1}{2} \sum_{k = 1}^{K} γ_{k} {〈 h_{k}, θ 〉}^{2} + \sum_{k = 1}^{K} α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉

where

β_{k}, γ_{k} > 0

are chosen to ensure strong convexity on each cell. Gradient cross-term. For the cross term

α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉

we have

\nabla [α_{k} (θ) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉] = (\nabla α_{k} (θ)) 〈 g_{k}, θ 〉 〈 h_{k}, θ 〉 + α_{k} (θ) (〈 h_{k}, θ 〉 g_{k} + 〈 g_{k}, θ 〉 h_{k})

. Hence the Lipschitz constant

L_{\nabla}

includes contributions from

∥ \nabla α_{k} ∥_{\infty} ∥ g_{k} ∥ ∥ h_{k} ∥

in addition to quadratic terms; choose

0 < η < 2 / L_{\nabla}

accordingly. Step 6: Error Analysis The total approximation error is bounded by:

\begin{matrix} d_{R} (T_{0}, \tilde{T}) & \leq sup_{θ \in Θ} ∥r_{T_{0}} (θ) - \sum_{k = 1}^{K} α_{k} (θ) g_{k} \otimes h_{k}^{*}∥ \end{matrix}

(12)

\begin{matrix} \leq \underset{uniform slice error}{\underset{︸}{ε / 4}} + \underset{interpolation error}{\underset{︸}{L \cdot δ}} + \underset{discretization error}{\underset{︸}{ε / 4}} \end{matrix}

(13)

Setting

δ = ε / (8 L)

where L is the Lipschitz constant of

r_{T_{0}}

, we obtain:

d_{R} (T_{0}, \tilde{T}) \leq ε / 4 + L \cdot ε / (8 L) + ε / 4 = 5 ε / 8 < ε

Step 7: Averagedness Since

R_{ε}

is strongly convex on each

δ

-cell and

α_{k}

is constant per cell,

\tilde{T} (θ) = θ - η \nabla R_{ε} (θ)

with

0 < η < 2 / L_{\nabla}

is

α

-averaged via Krasnosel’skiĭ–Mann averaging across cells. □ Appendix B. Reproducibility Details ResNet-50 Implementation Details Hardware Configuration: - 8× NVIDIA V100 32GB GPUs - Intel Xeon Gold 6248 CPU @ 2.50GHz - 512GB DDR4 RAM - NVLink interconnect for GPU communication

Software Environment: - PyTorch 1.12.0 with CUDA 11.6 - Python 3.9.7 - torchvision 0.13.0 - NCCL 2.12.12 for distributed training

Training Configuration: - Batch size: 256 (32 per GPU) - Optimizer: SGD with momentum 0.9, weight decay 1e-4 - Learning rate: Cosine schedule from 0.1 to 0.0 - Epochs: 90 - Data augmentation: Random crop, horizontal flip, normalization - Mixed precision: Enabled with loss scaling

ESL Configuration: - Target rank:

K = 150

- Error tolerance:

ε = 0.01

- Conversion frequency: Every 10 epochs - CP decomposition: 50 ALS iterations with random initialization Federated Learning Implementation Simulation Setup: - Framework: FedML 0.7.8 - Model: GPT-2 Small (124M parameters) - Dataset: WikiText-103 with non-IID split - Clients: 1000 simulated on single machine - Hardware: Same as ResNet-50 experiments

Protocol Details: - Communication rounds: 100 - Local epochs: 5 - Client sampling: 10% (100 clients per round) - Local batch size: 16 - Local learning rate: 0.001 - Global aggregation: FedAvg with ESL compression

ESL Compression: - Rank:

K = 50

- Quantization: 8-bit for coefficients - Sparsification: Top-10% coefficients only - Encoding: Huffman compression for transmission Template Benchmark Details ReLU Networks: - Dataset: CIFAR-10 (50K train, 10K test) - Architecture: 3-layer MLP (784-512-256-10) - Parameters: 50K total - Optimizer: SGD with momentum 0.9 - Learning rate: 0.01 with step decay - Epochs: 100 - Seeds: 5 independent runs

Transformers: - Dataset: WikiText-2 (2M tokens) - Architecture: 6-layer transformer (

d_{model} = 512

, 8 heads) - Parameters: 25M total - Optimizer: AdamW with warmup - Learning rate: 1e-4 with cosine decay - Epochs: 50 - Seeds: 5 independent runs

Federated Learning: - Synthetic dataset with non-IID split (Dirichlet

α = 0.1

) - 100 simulated clients - Local model: 2-layer MLP - Communication rounds: 200 - Local epochs: 5 - Seeds: 5 independent runs

M. Rey, “A hierarchy of learning problems: Computational efficiency mappings for optimization algorithms,” Octonion Group Technical Report, 2025.

M. Rey, “Dense Approximation of Learnable Problems with Streamable Problems,” Octonion Group Technical Report, 2025.

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York: Springer, 2011.

A. Beck, First-Order Methods in Optimization. Philadelphia: SIAM, 2017.

T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.

L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.

References

M. Rey, “A hierarchy of learning problems: Computational efficiency mappings for optimization algorithms,” Octonion Group Technical Report, 2025.
M. Rey, “Dense Approximation of Learnable Problems with Streamable Problems,” Octonion Group Technical Report, 2025.
H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York: Springer, 2011.
A. Beck, First-Order Methods in Optimization. Philadelphia: SIAM, 2017.
T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
L. Bottou, F. E. L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.

1	CIFAR-10 for ReLU (3-layer, 50K params), WikiText-2 for Transformers (6-layer, 25M params), synthetic non-IID for Federated (100 clients). All results averaged over 5 seeds with SGD, learning rates 0.01-0.1, 100-200 epochs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

The ε-Streamably-Learnable Class: A Constructive Framework for Operator-Based Optimization

Abstract

Keywords:

Subject:

1. Introduction

2. Mathematical Preliminaries and Foundations

2.1. Setting and Basic Assumptions

2.2. $α$ -Averaged Operators and the Hierarchy

3. The $ε$ -Streamably-Learnable Class

3.1. Formal Definition and Characterization

3.2. Fundamental Theorems

4. ESL-Convert Algorithm: Theory and Implementation

4.1. Algorithm Description and Correctness

4.1.0.1. Global averagedness.

4.1.0.2. Gradient cross-term.

4.1.0.3. Gradient cross-term.

4.2. Complexity Analysis

5. Control Theory Applications with Rigorous Stability Analysis

5.1. Stability Assumptions and Bounds

5.2. Model Predictive Control Application

5.2.0.4. Norm alignment.

6. Comprehensive Case Studies with Reproducibility Details

6.1. Case Study 1: ResNet-50 Training on ImageNet

6.1.1. Experimental Setup

6.2. Case Study 2: Federated Learning with GPT-2

6.2.1. Experimental Configuration

6.3. Implementation Templates Performance

7. Limitations and Future Directions

8. Conclusion

Data Availability Statement

Conflicts of Interest

Use of Artificial Intelligence

Appendix A. Complete Proof of Universal ESL Membership

References

MDPI Initiatives

Important Links

Subscribe

The ε-Streamably-Learnable Class: A Constructive Framework for Operator-Based Optimization

Abstract

Keywords:

Subject:

1. Introduction

2. Mathematical Preliminaries and Foundations

2.1. Setting and Basic Assumptions

2.2. α -Averaged Operators and the Hierarchy

3. The ε -Streamably-Learnable Class

3.1. Formal Definition and Characterization

3.2. Fundamental Theorems

4. ESL-Convert Algorithm: Theory and Implementation

4.1. Algorithm Description and Correctness

4.1.0.1. Global averagedness.

4.1.0.2. Gradient cross-term.

4.1.0.3. Gradient cross-term.

4.2. Complexity Analysis

5. Control Theory Applications with Rigorous Stability Analysis

5.1. Stability Assumptions and Bounds

5.2. Model Predictive Control Application

5.2.0.4. Norm alignment.

6. Comprehensive Case Studies with Reproducibility Details

6.1. Case Study 1: ResNet-50 Training on ImageNet

6.1.1. Experimental Setup

6.2. Case Study 2: Federated Learning with GPT-2

6.2.1. Experimental Configuration

6.3. Implementation Templates Performance

7. Limitations and Future Directions

8. Conclusion

Data Availability Statement

Conflicts of Interest

Use of Artificial Intelligence

Appendix A. Complete Proof of Universal ESL Membership

References

MDPI Initiatives

Important Links

Subscribe

2.2. $α$ -Averaged Operators and the Hierarchy

3. The $ε$ -Streamably-Learnable Class