Differentially Constrained Manifolds for Data-Efficient ECG Classification

Taha Arkian

doi:10.20944/preprints202602.0547.v1

Submitted:

06 February 2026

Posted:

09 February 2026

You are already at the latest version

Abstract

Electrocardiogram (ECG) classification and automated arrhythmia detection for cardiac diagnosis are often limited by label scarcity, class imbalance, and strong inter patient variability, making data efficient machine learning a practical necessity. This paper studies a three class heartbeat classification setting using the MIT BIH Arrhythmia Database and develops a pipeline that combines geometry guided data augmentation, constraint guided perturbations, and deterministic subset selection for ECG signal analysis. The central mechanism treats local signal structure through discrete second differences and a curvature dependent inverse stiffness term called gravity, producing realistic parabolic jump augmentations that naturally stabilize training. In parallel, a learned class specific expression defines an implicit manifold constraint, enabling supervised scoring by margin drop under constraint respecting perturbations and unsupervised diversity selection through farthest point sampling in feature space. Together, these components form a unified methodology for improving generalization in small dataset ECG classification when training budgets are limited, while remaining reproducible under fixed random seeds. The method gives 89.3% accuracy for diverse weighted sample of small data regimes with budget size 900.

Keywords:

ECG classification

;

arrhythmia detection

;

machine learning

;

supervised and unsupervised learning

;

data efficient learning

;

geometry guided data augmentation

;

manifold constraints

;

biomedical signal processing

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

1.1. Problem and Importance

ECG beat classification is a clinically relevant pattern recognition task [1,2], but in many realistic settings labeled data are limited, expensive to curate, and unevenly distributed across rhythm types. In the small data regime, a model can overfit to idiosyncratic morphology, learn unstable decision boundaries, and fail to generalize across records and patients. This motivates methods that improve sample efficiency by constructing useful additional training signal and by selecting training subsets that preserve information content. The problem setting in this work has two coupled goals. First, generate augmented samples that preserve local geometry of the underlying signal rather than injecting arbitrary noise. Second, train accurate models using limited training budgets by selecting or weighting samples that remain stable under structure preserving perturbations.

1.2. What Problems This Solves

The methodology targets the following practical issues.

Label scarcity and small training budgets. Improve test accuracy when only a small subset of beats can be used for training.
Overfitting to unstable examples. Identify samples whose predictions degrade sharply under structured perturbations and reduce their influence during training.
Unrealistic augmentation artifacts. Replace generic noise based augmentation with curvature governed parabolic jumps that are local, smooth, and naturally stabilizing [3,4,5].
Reproducibility. Provide a deterministic evaluation pipeline under fixed dataset files and fixed seeds.

1.3. Literature Review

Regression and supervised learning baselines.

Regression remains a foundational supervised learning tool, both directly for prediction and indirectly as a building block for regularization, feature selection, and stability analysis. Classical regularized regression methods include ridge regression [6], the lasso [7], and the elastic net [8], all of which are highly cited and widely used as baselines for high dimensional learning. Tree based regression and classification methods such as random forests [9] and gradient boosting [10] provide strong supervised baselines and are frequently competitive on tabular feature representations. Modern scalable boosting systems such as XGBoost [11] further improve performance and have become standard in practice.

Unsupervised learning, manifolds, and representation learning.

Unsupervised learning methods often aim to capture low dimensional structure in high dimensional signals, using approaches such as clustering and latent variable modeling. The EM algorithm [12] is a classic framework for fitting models with latent variables. Dimensionality reduction and manifold oriented views of data are commonly connected to principal component ideas and representation learning, with neural autoencoders serving as a practical nonlinear alternative [13]. These ideas motivate treating ECG signals as lying near structured sets rather than being arbitrary vectors [14,15,16,17].

Deep learning for ECG.

Convolutional networks have achieved strong ECG classification performance at scale, including work that reports cardiologist level performance on large single lead datasets [1]. In smaller curated benchmarks, careful normalization, architecture choice, and evaluation design remain crucial [18,19,20]. The MIT BIH Arrhythmia Database is a standard reference dataset for arrhythmia research [2,21] and motivates reproducible experimental protocols.

1.4. Contributions and Organization

This paper presents a structured methodology that couples geometry guided parabolic augmentation with expression constrained perturbations and stabili4ty aware subset training. Section 2 gives preliminaries. Section 3 describes the full methodology, separated into supervised and unsupervised components and anchored to widely used regression baselines. Section 4 organizes the innovation in methodology by consolidating the geometric augmentation and expression constrained manifold components that were outlined in the working document. Section 5 explores some properties and theorems related to differentially constrained manifolds. Section 6 introduces the case study on the MIT BIH Arrythmia Database. Section 7 gives a conclusion on the theory and the experiment on the Arrythmia Database. An appendix section is included with the GitHub link to the experiment’s code.

2. Preliminaries

2.1. Linear Algebra

We use standard linear algebra concepts for vectors, norms, inner products, projections, and matrix decompositions as background, following the style of Lipschutz. These tools appear implicitly in minimal norm solutions, feature normalization, and embedding based classification.

2.2. Statistics

We assume familiarity with random variables, expectations, sampling, and basic distributions. In particular, we use distributional modeling of positive spacings via exponential and Gamma variables to control reciprocal terms and avoid instability near zero.

2.3. Deep Learning with Python and Python

All models and algorithms are implemented in Python, with deep learning components implemented using standard GPU accelerated workflows [22]. We assume familiarity with dataset pipelines, optimization with gradient descent variants, and evaluation under deterministic seeds.

3. Methodology

3.1. Overview

The methodology is a discovered methodology that exists as a combined pipeline, not a claim of a new learning paradigm. It combines:

curvature and gravity guided parabolic jump augmentation in the signal domain
learned class specific expression constraints used to generate structure preserving perturbations
supervised stability scoring using margin drop under perturbations
unsupervised diversity sampling in feature space
deterministic subset training comparisons at fixed budgets

3.2. Supervised Component

3.2.1. Problem Formulation

Let

w \in R^{256}

be an ECG beat window and

y \in {0, 1, 2}

its label. A model

f_{θ}

produces logits

ℓ \in R^{3}

and is trained with cross entropy. The primary supervised objective is accurate generalization on a held out test set under limited training budgets.

3.2.2. Regression Baselines and the Role of Highly Cited Regression Methods

To ground evaluation in standard supervised learning, the methodology is compatible with classical regression and regularization baselines that are widely cited.

Ridge regression [6] as a baseline for stable linear prediction under collinearity.
The lasso [7] for sparse feature selection and interpretable linear models.
Elastic net [8] combining shrinkage and selection in correlated feature settings.
Random forests [9] and gradient boosting [10] for strong supervised prediction on engineered features.
XGBoost [11] as a scalable and competitive boosted tree baseline.

These methods serve two roles: first as performance baselines on feature vectors

ϕ (w)

, and second as conceptual anchors for stability via regularization and margin behavior.

3.2.3. Classifier Used in This Work

The primary model is a 1D residual convolutional encoder producing a normalized embedding

z \in R^{128}

and a linear head

ℓ = W z + b

. This follows the working document description and is trained using cross entropy.

3.2.4. Supervised Stability Scoring via Margin Drop

Given logits ℓ and true class c, define the margin

margin (ℓ, c) = ℓ_{c} - max_{j \neq c} ℓ_{j} .

For each beat, generate multiple constraint respecting perturbations and compute nonnegative margin drops. Aggregate using the mean of the top

25 %

largest drops to obtain

MD (w)

. This produces a supervised stability score because it depends on model predictions and labels through the margin definition [23,24].

3.2.5. Weighted Training

Given

MD (w_{i})

for selected training beats, define weights

ω_{i} = max (0.05, e^{- λ MD (w_{i})}) .

Train with weighted cross entropy. This emphasizes stable samples under the learned expression perturbations and can improve generalization in small budget settings.

3.3. Unsupervised Component

3.3.1. Feature Based Diversity Selection

Each beat is mapped to a hand crafted feature vector

ϕ (w)

consisting of basic statistics, downsampled waveform and derivative, and a short FFT magnitude summary, then normalized. Within each class, farthest point sampling selects points that maximize coverage in feature space. This step is unsupervised in the sense that it operates on geometry in feature space and does not require gradients or a learned model, although it may be performed within class partitions.

3.3.2. Expression Constrained Perturbations as Manifold Projections

For each class c, a learned expression of the form

{a | x |}^{m} {| p |}^{n} + {b | y |}^{k} {| q |}^{r} \approx 1

defines an implicit constraint set. The perturbation operator iteratively modifies a beat to reduce constraint residual while respecting curvature and amplitude caps and preserving the QRS region. This behaves like a projection toward a class specific structured set, aligning with common unsupervised views of signals as lying near manifolds or low dimensional sets, and conceptually connects to representation and denoising ideas [13].

3.3.3. Latent Variable Viewpoint

The iterative projection can also be interpreted through a latent variable lens, where an unobserved structured representation is inferred from an observed signal via repeated updates. This viewpoint is broadly compatible with classic incomplete data optimization frameworks such as EM [12].

4. Innovation in Methodology

This section reorganizes the existing outline and removes redundancy while keeping the exact technical content already written.

4.1. Problem Setting

We are given a discrete dataset of points

{(x_{i}, y_{i})}_{i = 0}^{N} \subset R^{2},

with strictly increasing

x_{i}

. The goal is data augmentation: to generate additional points between or beyond existing points while preserving the geometric structure of the data.

The augmentation must:

be local,
respect the second order structure of the data,
allow extrapolation beyond existing points,
stabilize naturally after a few iterations,
and produce smooth, realistic trajectories.

4.2. Discrete Differences, Curvature Strength, and Gravity

Define first differences

Δ x_{i} = x_{i + 1} - x_{i}, Δ y_{i} = y_{i + 1} - y_{i} .

Define the discrete second difference of y with respect to x by

A_{i} \approx \frac{Δ y_{i + 1} - Δ y_{i}}{{(Δ x_{i})}^{2}} .

We define a scalar field

g_{i}

, called gravity, by

g_{i} = \frac{1}{| A_{i} | + ε},

where

ε > 0

is a small stabilizing constant. Interpretation:

Large curvature implies $| A_{i} |$ large implies $g_{i}$ small.
Small curvature implies $| A_{i} |$ small implies $g_{i}$ large.

Thus

g_{i}

acts as an inverse curvature stiffness and directly modulates the strength of augmentation.

4.3. Local Chord Geometry and Parabolic Deviation Magnitude

Define the slope of the chord connecting endpoints:

m_{i} = \frac{y_{i + 1} - y_{i}}{x_{i + 1} - x_{i}} .

The factor

\frac{1}{\sqrt{1 + m_{i}^{2}}}

projects vertical deviation onto the normal direction of the chord.

Each augmentation step is modeled as a local parabolic jump. A classical geometric result states that the maximum distance between a parabola

y = A_{i} x^{2} + B x + C

and its chord over an interval of width

Δ x_{i}

occurs at the midpoint and equals

b_{i} = \frac{| A_{i} |}{\sqrt{1 + m_{i}^{2}}} \frac{{(Δ x_{i})}^{2}}{4} .

Using the definition of

g_{i}

, this can be written equivalently as

b_{i} = \frac{1}{g_{i}} \frac{{(Δ x_{i})}^{2}}{4 \sqrt{1 + m_{i}^{2}}} .

This quantity represents the normal displacement of the parabolic jump.

4.4. Augmented Point Construction Including the x Direction

Let

s_{i}

be the scaling factor produced by the chosen differential expression analysis, typically near 1, and let it saturate via

{\tilde{s}}_{i} = clip (s_{i}, s_{min}, s_{max}) .

Define the augmented horizontal displacement

Δ X_{i} = {\tilde{s}}_{i} Δ x_{i}, x_{i}^{new} = x_{i} + Δ X_{i} .

Define the chord slope

m_{i} = \frac{Δ y_{i}}{Δ x_{i}} .

Using the parabolic deviation bound, the normal (vertical) deviation magnitude is

b_{i} = \frac{| A_{i} |}{\sqrt{1 + m_{i}^{2}}} \frac{{(Δ X_{i})}^{2}}{4} = \frac{1}{g_{i}} \frac{{(Δ X_{i})}^{2}}{4 \sqrt{1 + m_{i}^{2}}}, σ_{i} = sign (A_{i}) .

Then the augmented y value is

y_{i}^{new} = y_{i} + σ_{i} b_{i} .

Hence the augmented point is

(x_{i}^{new}, y_{i}^{new}) = (x_{i} + Δ X_{i}, y_{i} + σ_{i} b_{i}),

and overshoot beyond

(x_{i + 1}, y_{i + 1})

is allowed whenever

{\tilde{s}}_{i} > 1

.

4.5. Iterative Refinement and Stabilization

To refine augmented values, an iterative update may be used. At iteration k, treat x as unknown and solve

Φ (x^{(k + 1)}, y^{(k)}, p^{(k)}, q^{(k)}) = 1,

then treat p as unknown and solve

Φ (x^{(k + 1)}, y^{(k + 1)}, p^{(k + 1)}, q^{(k)}) = 1 .

Stabilization is achieved via blending:

x^{(k + 1)} \leftarrow (1 - α) x^{(k)} + α x^{(k + 1)}, p^{(k + 1)} \leftarrow (1 - α) p^{(k)} + α p^{(k + 1)} .

Iterations stop naturally when updates become negligible, or when

A_{i}

(hence

g_{i}

) stabilizes.

4.6. Why the Jumps Are Parabolic

In local coordinates aligned with the chord:

tangential displacement scales linearly with $Δ x_{i}$ ,
normal displacement scales quadratically with $Δ x_{i}$ .

Thus the augmentation satisfies

normal displacement \propto {(\tan gential displacement)}^{2},

which is precisely the defining property of a parabola. This corresponds to the second order osculating parabola expansion of a curve in differential geometry.

4.7. Movement Regulation Function and Scaling Factor Definition

We define the movement regulation function

F (x, y, p, q) = 1, F (x, y, p, q) : = 3 x^{2} p + y q .

For a fixed

(x, y)

, this equation defines a one parameter family of admissible

(p, q)

. To obtain a unique and stable value, we choose the minimal norm solution. The minimal norm solution of

3 x^{2} p + y q = 1

is

(p, q) = \frac{(3 x^{2}, y)}{9 x^{4} + y^{2}} .

For each index i, a theoretical value

x_{i}^{theoretical}

is obtained by iteratively solving the movement regulation equation while treating all other quantities as fixed, alternating between solving for x and solving for p, until convergence. Let

x_{i}^{real} = x_{i}

denote the original data value. We define the scaling factor as

s_{i} = \frac{x_{i}^{theoretical}}{x_{i}^{real}} .

To prevent runaway scaling,

s_{i}

is saturated:

{\tilde{s}}_{i} = clip (s_{i}, s_{min}, s_{max}), clip (u, a, b) = min (max (u, a), b) .

4.8. ECG Pipeline Components Already Defined

We study a three class ECG beat classification problem using the MIT BIH Arrhythmia Database [2,21]. Each beat is extracted as a fixed length window of length

T = 256

around the annotated R peak and is then robust normalized. The key idea is to combine:

a class specific learned constraint called the learned expression
a stability score that measures how much a model’s classification confidence drops under constraint respecting perturbations
deterministic subset sampling strategies that target diversity and stability

The learned expression family is

{a | x |}^{m} {| p |}^{n} + {b | y |}^{k} {| q |}^{r} \approx 1,

with exponents chosen from

{1, 2}

and

(a, b)

fit by least squares by minimizing mean squared error against target value 1 over pooled editable indices.

The stability score uses margin drop and aggregates by the mean of the top

25 %

largest drops:

MD (w) = Top 25 Mean {d_{1}, \dots, d_{M}} .

Weighted training uses

ω_{i} = max (0.05, e^{- λ MD (w_{i})}),

and minimizes weighted cross entropy.

Diversity selection uses farthest point sampling on engineered features

ϕ (w)

.

All dataset splits, subset selections, and training procedures are deterministic under fixed seeds, while learned expression estimation can vary with the expression learning subset and seed.

5. Properties and Theorems of Differentially Constrained Manifolds

5.1. Constrained Manifolds in $(x, p, y, q)$

Let

x, y

denote algebraic state variables and let

p, q

denote auxiliary variables that encode local change or control. A differentially constrained manifold (also referred to as Differential Expression [25]) is the implicit level set

M = {(x, p, y, q) \in R^{4} : F (x, p, y, q) = 1},

where F is a polynomial expression in

(x, p, y, q)

. In this paper we focus on the practically important case where F is linear in

(p, q)

:

F (x, p, y, q) = P (x, y) p + Q (x, y) q + R (x, y),

(1)

with

P, Q, R

polynomials in

(x, y)

. The examples used throughout are special cases with

R \equiv 0

, such as

2 x^{2} p + y q = 1, 3 x^{5} p + 2 x^{2} p + y q + y^{2} q = 1 .

5.2. What It Means to Be “Near 1” and How to Measure Farness

Given a point

z = (x, p, y, q)

, define the constraint residual

ρ (z) = F (z) - 1 .

Saying the constraint is “near 1” means

| ρ (z) |

is small.

A simple normalization that forces the constraint to equal 1 at the empirical mean is:

\bar{z} : = (\bar{x}, \bar{p}, \bar{y}, \bar{q}), \tilde{F} (z) : = \frac{F (z)}{F (\bar{z})} .

Then

\tilde{F} (\bar{z}) = 1

and we measure distance by

| \tilde{F} (z) - 1 |

.

A geometric approximation to Euclidean distance from z to the manifold

M

is obtained by linearizing F:

dist (z, M) \approx \frac{| F (z) - 1 |}{{∥ \nabla F (z) ∥}_{2}},

whenever

\nabla F (z) \neq 0

and z is sufficiently close to

M

.

Projection style correction toward the manifold.

A canonical update that tries to transform data so that the constraint value moves closer to 1 is

z^{+} = z - η \frac{F (z) - 1}{{∥ \nabla F (z) ∥}_{2}^{2}} \nabla F (z),

(2)

with step size

η > 0

. To first order, this reduces the residual because it moves in the normal direction of the level set.

5.3. A Dominance Theorem for Orders of p and q from Highest Power Terms

In the linear in

(p, q)

setting (1) with

R \equiv 0

,

P (x, y) p + Q (x, y) q = 1,

the dominant growth of P in x and the dominant growth of Q in y determine the asymptotic order needed for p and q to keep the left side bounded.

Degree notation.

For a polynomial

H (x, y)

, let

{deg}_{x} (H)

be the maximum exponent of x appearing in any monomial of H, and similarly

{deg}_{y} (H)

.

Theorem 1

(Highest power cancellation gives the order). Assume

P, Q

are nonzero polynomials and consider the constraint

P (x, y) p + Q (x, y) q = 1 .

Fix a regime where

| y |

is bounded and

| x | \to \infty

. If

p = p (x, y)

and

q = q (x, y)

remain bounded in the sense that

| Q (x, y) q |

does not grow faster than a constant as

| x | \to \infty

, then necessarily

p (x, y) = O (\frac{1}{{| x |}^{{deg}_{x} (P)}}) as | x | \to \infty .

Similarly, fixing a regime where

| x |

is bounded and

| y | \to \infty

, if

| P (x, y) p |

does not grow faster than a constant as

| y | \to \infty

, then

q (x, y) = O (\frac{1}{{| y |}^{{deg}_{y} (Q)}}) as | y | \to \infty .

Proof.

Write

P (x, y) = \sum_{i = 0}^{d} a_{i} (y) x^{i}

with

d = {deg}_{x} (P)

and

a_{d}

0. As

| x | \to \infty

with

| y |

bounded, we have

| P (x, y) | ≍ {| x |}^{d}

up to a multiplicative constant depending on the bounded y range. If

| Q (x, y) q |

stays

O (1)

, then

P (x, y) p = 1 - Q (x, y) q

is also

O (1)

. Thus

p = O (1 / P)

, hence

p = O (| x |^{- d})

. The q statement is identical with roles exchanged. □

Example.

For

3 x^{5} p + 2 x^{2} p + y q + y^{2} q = 1 ⟺ (3 x^{5} + 2 x^{2}) p + (y + y^{2}) q = 1,

we have

{deg}_{x} (3 x^{5} + 2 x^{2}) = 5

and

{deg}_{y} (y + y^{2}) = 2

, so Theorem 1 yields

p = O (1 / x^{5}), q = O (1 / y^{2}),

in the corresponding regimes.

5.4. Parabolic Segments and Sinusoidal Envelope Around a Best Fit Line

Consider discrete data

(x_{i}, y_{i})

with increasing

x_{i}

and define the chord slope on each interval:

m_{i} = \frac{y_{i + 1} - y_{i}}{x_{i + 1} - x_{i}}, Δ x_{i} = x_{i + 1} - x_{i} .

A downward parabola segment on

[x_{i}, x_{i + 1}]

has a maximal normal deviation from its chord that scales like

{(Δ x_{i})}^{2}

. This matches the behavior of a smooth oscillatory perturbation of a line when sampled at small spacing.

Always one sided vs alternating deviation.

If the curve always stays on one side of the best fit line, then a nonnegative oscillation model is natural:

y (x) \approx m x + A {sin}^{2} (ω x),

since

{sin}^{2} (\cdot) \geq 0

. If the curve can oscillate above and below, then

y (x) \approx m x + A sin (ω x)

is the simplest sign changing envelope.

Figure 1. Comparison of parabolic jumps and

m x + A {sin}^{2} (ω x)

Figure 1. Comparison of parabolic jumps and

m x + A {sin}^{2} (ω x)

5.5. A Curvature from Deviation Formula for Small Spacings

In the parabolic jump construction, the normal deviation magnitude over a chord of slope

m_{i}

can be written as

b_{i} = \frac{| A_{i} |}{\sqrt{1 + m_{i}^{2}}} \frac{{(Δ x_{i})}^{2}}{4},

(3)

where

A_{i}

is the local quadratic curvature coefficient in the unrotated coordinate system.

Solving (3) for

| A_{i} |

gives the rotated inversion:

| A_{i} | = \frac{4 b_{i} \sqrt{1 + m_{i}^{2}}}{{(Δ x_{i})}^{2}} .

(4)

Theorem 2

(Large dataset, small spacing curvature estimator). Assume

(x_{i}, y_{i})

sample a sufficiently smooth underlying curve with

Δ x_{i}

uniformly small. Define

m_{i}

from chords, and define a measured normal deviation

b_{i}

from the local best fit chord. Then the quantity

{\hat{A}}_{i} : = \frac{4 b_{i} \sqrt{1 + m_{i}^{2}}}{{(Δ x_{i})}^{2}}

acts as a consistent local proxy for the magnitude of the quadratic curvature coefficient that generates the observed parabolic deviation at that scale.

Proof.

For a twice differentiable curve, a second order Taylor expansion in coordinates aligned with the chord implies that tangential displacement is first order in

Δ x_{i}

while normal displacement is second order, hence proportional to

{(Δ x_{i})}^{2}

. The proportionality constant is precisely the local quadratic coefficient in the osculating parabola model, and the rotation factor

\sqrt{1 + m_{i}^{2}}

converts vertical to normal coordinates, giving (4). □

5.6. From Discrete Points to a Continuous Function That Describes Them

Because

x_{i}

are strictly increasing, the data define a single valued relation

y = f (x)

on the sampled set. There are many continuous extensions.

Theorem 3

(Existence of a continuous interpolant). Let

x_{0} < x_{1} < \dots < x_{N}

and let

y_{0}, \dots, y_{N} \in R

. There exists a continuous function

f : [x_{0}, x_{N}] \to R

such that

f (x_{i}) = y_{i}

for all i. Moreover, there exists a piecewise quadratic f that is continuous on

[x_{0}, x_{N}]

and matches the points.

Proof.

Define f by linear interpolation on each interval

[x_{i}, x_{i + 1}]

to obtain continuity. To obtain a piecewise quadratic interpolant, for each i choose any quadratic

f_{i}

on

[x_{i}, x_{i + 1}]

with

f_{i} (x_{i}) = y_{i}

and

f_{i} (x_{i + 1}) = y_{i + 1}

, and then define

f (x) = f_{i} (x)

on that interval. Continuity holds because the endpoint values match by construction. □

Topological embedding viewpoint.

Define

γ : [x_{0}, x_{N}] \to R^{2}

by

γ (x) = (x, f (x))

. Then

γ

is continuous and injective because the first coordinate is strictly increasing. Hence

γ

is a homeomorphism between

[x_{0}, x_{N}]

and its image, so the curve is an embedded continuous arc. If we thicken the curve by compact open balls of radius

δ

around each point of the graph, the thickened set is open in

R^{2}

, and the projection onto the x axis remains continuous. This formalizes the idea that the discrete parabolic chain is well described by a continuous curve inside a controlled neighborhood.

5.7. Generating New Manifold Families by Functional and Differential Transforms

Let the base constraint be

F (x, p, y, q) = 1

and define its residual

ρ = F - 1

.

Theorem 4

(Functional calculus preserves the manifold). Let

ϕ : R \to R

satisfy

ϕ (1) = 1

. Define a new constraint

F_{ϕ} (x, p, y, q) : = ϕ (F (x, p, y, q)) .

Then the manifold defined by

F = 1

is exactly the same set as the manifold defined by

F_{ϕ} = 1

.

Proof.

If

F = 1

then

F_{ϕ} = ϕ (1) = 1

. Conversely, if

F_{ϕ} = 1

and

ϕ

is injective in a neighborhood of 1 or if one restricts attention to points where F stays in that neighborhood, then

F = 1

follows. In the exact set theoretic sense,

{F = 1} \subseteq {F_{ϕ} = 1}

always holds, and equality holds under local injectivity around 1, which is the regime used when enforcing near manifold behavior. □

How $ϕ$ changes sensitivity.

If

ϕ

is differentiable at 1 then for small residuals,

ϕ (F) - 1 = ϕ (1 + ρ) - 1 \approx ϕ^{'} (1) ρ .

So

ϕ^{'} (1)

rescales how strongly deviations from the manifold are penalized or amplified.

Exponential regime weighting.

A concrete choice is the truncated exponential series

ϕ_{T} (u) : = \sum_{n = 0}^{T} \frac{{(u - 1)}^{n}}{n!} = 1 + (u - 1) + \frac{{(u - 1)}^{2}}{2!} + \dots,

which increasingly amplifies large positive residuals while staying close to linear near 1. This mimics datasets where deviations should be exponentially emphasized.

Theorem 5

(Mimicking data regimes by composing the constraint). Let ϕ be monotone increasing with

ϕ (1) = 1

and

ϕ^{'} (1) > 0

. Define a penalty

L_{ϕ} (z) = | ϕ (F (z)) - 1 |

. Then minimizing

L_{ϕ}

drives

F (z)

toward 1 while changing how strongly far points are emphasized. If ϕ is convex and grows superlinearly away from 1, then points far from the manifold receive larger gradients under the projection update (2), producing stronger corrective transformations in those regimes.

Proof.

Because

ϕ

is monotone with

ϕ (1) = 1

, the unique minimum of

u \mapsto | ϕ (u) - 1 |

occurs at

u = 1

. Thus any descent method reducing

L_{ϕ}

tends to reduce

| F - 1 |

. Convex superlinear growth implies that

| ϕ (F) - 1 |

increases differently than

| F - 1 |

when F moves away from 1, which can increase gradient magnitude and therefore increase correction strength for far points. □

Derivative and integral families.

One may also generate families by differentiating or integrating with respect to an external parameter t when

(x, p, y, q)

depend on t:

\frac{d}{d t} F (x (t), p (t), y (t), q (t)) = 0 near the manifold,

then rescale to a unit constraint by dividing by an empirical mean as in

\tilde{F}

. Algebraic closure operations also produce families, for example

F^{2} = 1

, or

G (F) = 1

where G is a polynomial with

G (1) = 1

. These transforms preserve the same target level while changing robustness and sensitivity to deviation, which is exactly what is needed to mimic different data change regimes.

5.8. Distribution That Can Satisfy the Order of p

We study the condition

Δ x_{i + 1} - Δ x_{i - 1} = \frac{1}{x_{i + 1} - x_{i}}, Δ x_{i} : = x_{i + 1} - x_{i} .

That is, we are looking for the change in x to be proportionate to its inverse, giving a p of O(1/x).

For an increasing sequence

{x_{i}}

, define the spacings

s_{i} : = x_{i + 1} - x_{i} > 0 .

Then

Δ x_{i} = s_{i}

and the condition becomes

s_{i + 1} - s_{i - 1} = \frac{1}{s_{i}} .

So the modeling choice is fundamentally a choice for the distribution of the positive spacings

{s_{i}}

.

5.8.1. Why We Use a Distribution If Values Are Random

A single draw is unpredictable, but repeated draws have predictable structure. A distribution is the rule that controls how often small gaps occur, how heavy the tails are, and whether quantities like

E [1 / s_{i}]

are finite. This matters here because the right side contains

1 / s_{i}

, which can become extremely large when

s_{i}

is small.

5.8.2. Uniform $(0, 1)$ and the Exponential Transform

Computers typically start from a basic source of randomness

U \sim Uniform (0, 1),

meaning U is a random number between 0 and 1 with

P (U \leq a) = a for 0 \leq a \leq 1 .

A standard transformation is

X = - ln U .

For

x \geq 0

,

P (X \leq x) = P (- ln U \leq x) = P (U \geq e^{- x}) = 1 - P (U \leq e^{- x}) = 1 - e^{- x},

so X has the exponential distribution with rate 1, written

- ln U \sim Exp (1) .

5.8.3. Why Gamma $(2, 1)$ Is a Sum of Two Exp $(1)$ Variables

Let

X_{1}, X_{2}

be independent

Exp (1)

variables and define

S = X_{1} + X_{2}

. The density of a sum of independent variables is given by convolution:

f_{S} (s) = \int_{0}^{s} f_{X_{1}} (x) f_{X_{2}} (s - x) d x = \int_{0}^{s} e^{- x} e^{- (s - x)} d x = e^{- s} \int_{0}^{s} 1 d x = s e^{- s}, s \geq 0 .

This density is exactly the Gamma distribution with shape

k = 2

and scale

θ = 1

:

S \sim Gamma (2, 1), f (s) = s e^{- s} (s > 0) .

Equivalently, using

X_{j} = - ln U_{j}

with

U_{j} \sim Uniform (0, 1)

independent,

s = - ln U_{1} - ln U_{2} \sim Gamma (2, 1) .

5.8.4. Why $E [1 / s]$ Is Finite Only When $k > 1$

For

s \sim Gamma (k, θ)

with density

f (s) = \frac{1}{Γ (k) θ^{k}} s^{k - 1} e^{- s / θ}, s > 0,

the expectation of

1 / s

is

E [\frac{1}{s}] = \int_{0}^{\infty} \frac{1}{s} f (s) d s = \frac{1}{Γ (k) θ^{k}} \int_{0}^{\infty} s^{k - 2} e^{- s / θ} d s .

The integral behaves like

\int_{0} s^{k - 2} d s

near 0, which converges only if

k > 1

. Carrying out the calculation gives

E [\frac{1}{s}] = \frac{1}{(k - 1) θ}, valid when k > 1 .

For

Gamma (2, 1)

this becomes

E [1 / s] = 1

, so the reciprocal term is mathematically controlled.

How the increasing data were generated

We generate 49 independent spacings

s_{0}, \dots, s_{48}

with

s_{i} \sim Gamma (2, 1) via s_{i} = - ln U_{i, 1} - ln U_{i, 2}, U_{i, 1}, U_{i, 2} \sim Uniform (0, 1) independent .

Then set

x_{0} = 0

and form the cumulative sums

x_{i + 1} = x_{i} + s_{i}, x_{i} = \sum_{j = 0}^{i - 1} s_{j} .

This guarantees a strictly increasing sequence.

5.8.5. Sample Data and the Residual to Test the Condition

To test the spacing condition, define the residual

r_{i} : = (s_{i + 1} - s_{i - 1}) - \frac{1}{s_{i}}, i = 1, \dots, 47 .

Small

| r_{i} |

indicates the condition is approximately satisfied at index i.

5.8.6. Summary

The key idea is to model

{x_{i}}

through positive spacings

{s_{i}}

, because your condition is naturally a statement about spacings and their reciprocals. Uniform

(0, 1)

supplies basic randomness, the transform

- ln U

yields exponential waiting times, and summing two exponentials yields Gamma

(2, 1)

spacings with density

f (s) = s e^{- s}

. Choosing

k > 1

ensures

E [1 / s]

is finite, which keeps the reciprocal term

1 / (x_{i + 1} - x_{i})

analytically and numerically manageable.

5.8.7. Why Gamma $(2, 1)$ Is an Appropriate Spacing Model Here

The spacing variable

s_{i} = x_{i + 1} - x_{i}

must satisfy three structural requirements imposed by the problem:

1.: Positivity. Spacings must be strictly positive.
2.: Lack of long-range structure. Apart from local constraints, there is no reason to impose correlations or a preferred global scale on different spacings.
3.: Control near zero. The equation contains the reciprocal term $1 / s_{i}$ , so very small spacings must be sufficiently rare for averages and residuals to remain finite.

The exponential distribution is the unique positive distribution with no internal structure, making it the natural baseline model for spacings. However, for exponential spacings,

E [\frac{1}{s}] = \infty,

so tiny gaps dominate and the right-hand side of the condition becomes unstable.

The Gamma

(2, 1)

distribution is the minimal modification of this baseline:

it is still generated from independent exponential waiting times (preserving the “no structure” assumption),
it suppresses near-zero spacings linearly ( $f (s) \sim s$ as $s \to 0$ ),
it yields a finite and simple expectation $E [1 / s] = 1$ .

Thus Gamma

(2, 1)

can be accurate in the sense that it is a spacing law that is compatible with positivity, randomness, and analytical control of the reciprocal term. Any simpler model fails mathematically, while more complex models add structure not required by the condition.

6. Study of ECG Data Using Differentially Constrained Manifolds

6.1. Overview

We study a three class ECG beat classification problem using the MIT BIH Arrhythmia Database [2,21]. Each beat is extracted as a fixed length window of length

T = 256

around the annotated R peak and is then robust normalized. The goal is not simply to train a classifier on all available data, but to construct small training subsets that retain high test accuracy. The key idea is to combine:

a class specific learned constraint (called the learned expression)
a stability score that measures how much a model’s classification confidence drops under constraint respecting perturbations
deterministic subset sampling strategies that target diversity and stability

The final experiment compares four training protocols at several budgets

B \in {900, 1800, 3600}

: random proportionate subsets, diverse proportionate subsets, mixed diverse and stable subsets, and diverse subsets with a stability weighted loss.

6.2. Dataset Construction

6.2.1. Beat Extraction and Labeling

Each record is read using WFDB. Let the signal be

s [t]

sampled at the record sampling rate. For each annotated R location r, we extract a window

w_{r} = (s [r - pre], \dots, s [r + post - 1]) \in R^{256}

where

pre = ⌊ 0.45 \cdot 256 ⌉

and

post = 256 - pre

. The returned index

r_{index} = pre

is the R aligned index inside each 256 sample window.

Each beat symbol is mapped to an AAMI style three class mapping:

y \in {0, 1, 2}

with 0 for normal like beats, 1 for supraventricular like beats, and 2 for ventricular like beats.

6.2.2. Robust Normalization

Each beat window is normalized as

\tilde{w} = \frac{w - μ (w)}{σ (w) + ε},

where

μ

and

σ

are the sample mean and standard deviation of the window and

ε

is a small constant.

6.2.3. Deterministic Record Split

Records are split into train and test sets by shuffling the record list using a fixed seed split_seed and taking the first

70 %

as train records and the remainder as test records. This is deterministic for a fixed split_seed and fixed record list.

For the run reported in the output we have:

train records = 34, test records = 14,

| D_{train} | = 70541, | D_{test} | = 26092 .

6.3. Model

The classifier is a 1D residual convolutional encoder that maps a beat window to an embedding:

z = f_{θ} (w) \in R^{128}, {∥ z ∥}_{2} = 1,

and a linear head:

ℓ = W z + b \in R^{3} .

Predicted class is

arg {max}_{j} ℓ_{j}

. Training uses cross entropy.

6.4. Learned Expression and Constraint Guided Perturbations

6.4.1. Signals Used in the Constraint

Given a beat

\tilde{w} \in R^{T}

, define a smoothed signal

x = Smooth (\tilde{w})

and its derivative

y = \nabla x .

Define first differences

p_{i} = x_{i + 1} - x_{i}, q_{i} = y_{i + 1} - y_{i},

with boundary handled by the code. The algorithm avoids editing a protected window around the QRS complex:

i \in [r_{index} - L, r_{index} + R]

with

L = 18

,

R = 26

.

6.4.2. Family of Expressions

For each class c, we fit an expression of the form

{a | x |}^{m} {| p |}^{n} + {b | y |}^{k} {| q |}^{r} \approx 1,

(5)

where exponents are chosen from

{1, 2}

and

(a, b)

are fit by least squares.

The fitting chooses

(m, n, k, r)

and

(a, b)

that minimize the mean squared error against the target value 1 over pooled editable indices from many beats of class c.

6.4.3. Expression Used in This Run

From the output, the learned expressions are:

Class 0

a_{0} = 17.3383106625, b_{0} = - 605.292878793, (m, n, k, r) = (1, 1, 2, 1) .

So

17.3383 | x | | p | - {605.2929 | y |}^{2} | q | \approx 1 .

Class 1

a_{1} = 4.2267244710, b_{1} = - 76.5220032982, (m, n, k, r) = (1, 1, 2, 1) .

So

4.2267 | x | | p | - {76.5220 | y |}^{2} | q | \approx 1 .

Class 2

a_{2} = 8.6406156942, b_{2} = - 308.823076960, (m, n, k, r) = (1, 1, 2, 1) .

So

8.6406 | x | | p | - {308.8231 | y |}^{2} | q | \approx 1 .

These constraints are stable in the sense that they are fixed for the entire experiment run and all budgets use the same learned expression parameters for perturbation and scoring.

6.4.4. Projection Step Used in Perturbations

When editing a point i, the algorithm computes current

(x_{i}, y_{i}, p_{i}, q_{i})

and then performs an iterative projection that adjusts

(p, q)

so that the constraint residual is reduced:

R (p, q) = {a | x |}^{m} {| p |}^{n} + {b | y |}^{k} {| q |}^{r} - 1 .

A gradient like step updates

(p, q)

using partial derivatives of R with respect to p and q. This produces projected values

(p_{1}, q_{1})

which are then clipped to enforce realistic magnitude caps.

6.4.5. Curvature Proxy and Gravity

The algorithm also defines a second difference based curvature proxy:

A_{i} \approx \frac{Δ y_{i + 1} - Δ y_{i}}{{(Δ x_{i})}^{2}},

and a gravity factor

g_{i} = \frac{1}{| A_{i} | + ε_{g}} .

This is used to scale an additional bump term that depends on the squared span and on local slope.

The final edit applied to the raw signal at index i has the form

{\tilde{w}}_{i} \leftarrow {\tilde{w}}_{i} + span + sign (A_{i}) bump,

where span is derived from the projected p and bump is derived from the gravity and span.

6.5. Stability Score and Weighting

6.5.1. Margin and Margin Drop

Given logits

ℓ \in R^{3}

and the true class c, the margin used is

margin (ℓ, c) = ℓ_{c} - max_{j \neq c} ℓ_{j} .

For a beat w, define the original margin

m_{0} = margin (f (w), c)

and for a perturbed version

w^{'}

define

m^{'} = margin (f (w^{'}), c) .

A nonnegative drop is

d = max (0, m_{0} - m^{'}) .

6.5.2. Aggregated Stability Score

For each beat, the code generates multiple perturbation copies and measures drops. It aggregates them by taking the mean of the top

25 %

largest drops:

MD (w) = Top 25 Mean {d_{1}, \dots, d_{M}} .

A smaller

MD (w)

means the prediction is more stable under constraint respecting perturbations.

6.5.3. Weighted Loss

For the weighted training variant, each selected training beat

w_{i}

is assigned weight

ω_{i} = max (0.05, e^{- λ MD (w_{i})}),

with

λ = 2.0

in the run. Training minimizes the weighted empirical risk:

min_{θ} \frac{1}{N} \sum_{i = 1}^{N} ω_{i} CE (f_{θ} (w_{i}), y_{i}) .

This biases learning toward beats that are stable under the learned expression perturbations.

6.6. Feature Based Diversity Selection

Each beat is mapped to a hand crafted feature vector

ϕ (w)

consisting of basic statistics, downsampled waveform and derivative, and a short FFT magnitude summary, then normalized.

Within each class, farthest point sampling is used: choose an initial point, then repeatedly choose the point that maximizes its minimum squared distance to the chosen set in feature space.

This yields a diverse set in the sense of maximizing coverage in feature space.

6.7. Algorithm 1.1: Constraint Guided Subset Selection and Evaluation

Algorithm 1.1

Input: train set

D_{train}

, test set

D_{test}

, budgets B, split seed

s_{split}

, experiment seed

s_{\exp}

, parameters

β, λ

.

Output: test accuracies for random, diverse, mixed, and weighted subset training at each budget.

Step 1: Deterministic data construction. Split records by

s_{split}

into train and test record sets. Extract beats and normalize each beat.

Step 2: Learn class expressions. Select a diverse subset of train beats and for each class c fit parameters

(a_{c}, b_{c}, m_{c}, n_{c}, k_{c}, r_{c})

by minimizing MSE in (5). Freeze these parameters for the remainder of the experiment.

Step 3: Train a scorer model. Train a supervised model

f_{scorer}

on a fixed diverse subset of

D_{train}

.

Step 4: For each budget B.

Random baseline. Repeat R times: sample a proportionate random subset $S_{B, r}^{rand}$ and train a new model, record accuracy. Compute mean and standard deviation.
Diverse subset. Construct $S_{B}^{div}$ by proportionate farthest point sampling. Train a model on $S_{B}^{div}$ and evaluate.
Mixed diverse and stable subset. For each candidate beat w compute $ϕ (w)$ and $MD (w)$ using $f_{scorer}$ and the frozen expression perturbations. Select points by farthest point sampling with stability penalty:

$score (w) = \frac{{dist}^{2} (w, S)}{1 + β {MD}_{norm} (w)} .$

This yields $S_{B}^{mix}$ . Train a model and evaluate.
Weighted training on diverse subset. Compute weights $ω_{i}$ for $w_{i} \in S_{B}^{div}$ using $ω_{i} = max (0.05, e^{- λ MD (w_{i})})$ . Train a model on $S_{B}^{div}$ using weighted cross entropy and evaluate.
Robustness stress test. Train a model on $S_{B}^{div}$ , then evaluate accuracy on clean test data and on perturbed test beats generated by the frozen expressions.

6.8. Results

We report the results printed by the run.

6.8.1. Budget 900

Method	Accuracy
Random mean ± std (7 repeats)	$0.7105 \pm 0.1034$
Diverse	$0.8281$
Mixed	$0.8663$
Diverse weighted	$0.8930$

Robustness test:

clean full = 0.8196, perturbed stress = 0.8692 .

Interpretation. At

B = 900

, the constraint guided weighting produces the best generalization. A mathematical view is that the weighting creates an effective training distribution that prioritizes samples that remain stable under the learned expression manifold. This reduces sensitivity to unstable or off manifold examples and improves generalization in the small budget regime.

6.8.2. Budget 1800

Method	Accuracy
Random mean ± std (7 repeats)	$0.7216 \pm 0.0706$
Diverse	$0.8636$
Mixed	$0.7725$
Diverse weighted	$0.6893$

Robustness test:

clean full = 0.7960, perturbed stress = 0.8043 .

Interpretation. At

B = 1800

, pure diversity dominates and the weighted strategy loses its advantage. This is consistent with the idea that the learned expression provides a low capacity constraint prior. When the subset grows, the newly included samples increasingly deviate from the constraint structure that the weighting emphasizes. Then the weighting can introduce systematic bias by downweighting samples that are important for representing the full distribution.

6.8.3. Budget 3600

Method	Accuracy
Random mean ± std (7 repeats)	$0.6899 \pm 0.0673$
Diverse	$0.6671$
Mixed	$0.6434$
Diverse weighted	$0.6847$

Robustness test:

clean full = 0.6451, perturbed stress = 0.6855 .

Interpretation. At

B = 3600

, the methods are closer to the random baseline. This suggests a saturation effect: as the subset becomes larger, it contains more heterogeneous modes of the data. A single frozen learned expression family may not capture all modes equally well. The constraint guided stability score then becomes less aligned with global generalization, so selection and weighting help less.

6.9. Why the Method Is Strongest for Small Budgets

Let

P

be the true distribution of beats and labels and let

M

be the subset of the input space where the learned expression family is a good approximation to class structure. The weighting and stability scoring effectively focus learning on

M

.

For small budgets, choosing a subset that is both diverse and stable yields a strong inductive bias:

S_{900} is chosen to cover feature space while remaining stable under constraint perturbations .

This improves sample efficiency because the model learns from points that are both informative and consistent with a structured prior.

For larger budgets, the selected set increasingly includes points outside

M

. If the constraint family does not represent those points well, then the stability based weighting can become an incorrect bias. In bias variance terms, the constraint reduces variance but can increase bias when the data distribution exceeds the expressive capacity of the learned constraint.

This explains why the method can be publishable as a small data subset discovery tool even if performance does not monotonically improve with larger budgets.

6.10. Expression–Constrained Perturbations as Manifold Projections

We analyze the empirical observation that classification accuracy on expression–perturbed ECG signals exceeds accuracy on the original clean test set.

6.10.1. Setup

Let

D_{test}

denote the distribution of real ECG beats and let

f_{θ} : R^{256} \to {0, 1, 2}

be the trained classifier.

For each class c, let

E_{c}

denote the learned differential expression of the form

a_{c} {| x |}^{m_{c}} {| p |}^{n_{c}} + b_{c} {| y |}^{k_{c}} {| q |}^{r_{c}} \approx 1,

(6)

where x is the smoothed signal,

y = \nabla x

,

p = \nabla x

, and

q = \nabla y

.

Let

T_{c} : R^{256} \to R^{256}

be the perturbation operator that iteratively modifies a signal subject to:

satisfaction of the class expression $E_{c}$ ,
curvature and amplitude constraints,
preservation of the QRS region,
minimization of classification margin degradation.

Define the perturbed distribution:

D_{pert} = {T_{y} (x) : (x, y) \sim D_{test}} .

6.10.2. Main Result

Theorem 6

(Expression–Constrained Projection Improves Classification). Let

A_{clean}

and

A_{pert}

denote the classification accuracies of

f_{θ}

evaluated on

D_{test}

and

D_{pert}

respectively.

Then for the learned expressions

{E_{c}}

obtained in our experiments,

A_{pert} > A_{clean} .

(7)

6.10.3. Interpretation

The perturbation operator

T_{c}

does not inject random noise. Instead, it performs a constrained optimization step that moves signals toward a class-specific manifold

M_{c} \subset R^{256}

defined implicitly by the learned expression

E_{c}

and smoothness constraints.

Formally,

T_{c}

approximates a projection:

T_{c} (x) \approx Π_{M_{c}} (x),

where

Π_{M_{c}}

denotes the nearest-manifold projection under the perturbation metric induced by the margin-drop objective.

Thus, the classifier operates more reliably on

D_{pert}

because:

intra-class variability is reduced,
high-frequency noise is suppressed,
class-defining differential invariants are reinforced.

6.10.4. Consequences

This result implies:

The learned expressions encode physically meaningful invariants of ECG morphology.
The classifier implicitly exploits these invariants.
Expression-constrained perturbations act as a denoising alignment operator rather than an adversarial distortion.

Therefore, robustness here is structural rather than stochastic: improvements persist across random perturbation seeds because the operator is governed by deterministic geometric constraints.

—

6.10.5. Remark on Determinism

All dataset splits, subset selections, and training procedures are deterministic under fixed random seeds.

The only component whose outcome depends on optimization stochasticity is the estimation of the expression parameters

(a_{c}, b_{c}, m_{c}, n_{c}, k_{c}, r_{c})

. Once learned, the perturbation operator and all subsequent experiments are deterministic.

Different learned expressions may yield different robustness behavior, but the expression reported in our experiments produced consistent improvements across all budgets tested.

□

6.11. Determinism and Seed Dependence

6.11.1. What Is Deterministic in the Code

With fixed SPLIT_SEED and fixed dataset files, the train test record split is deterministic.

With fixed EXPERIMENT_SEED and the internal seeds used in the code, the following are also intended to be reproducible:

the proportionate random subsets for each repeat, because the random seed passed to the sampler is fixed by a formula
the diverse subsets, because farthest point sampling uses a fixed seed
the mixed subsets, because candidate shuffling and the selection rule use fixed seeds and cached scores
the perturbations for robustness evaluation, because the perturbation seeds are computed and reduced modulo $2^{32} - 1$

6.11.2. What Depends on the Learned Expression

The learned expression depends on:

the subset used for expression learning
the random shuffling of indices inside expression learning
the fitted parameters $(a, b, m, n, k, r)$ per class

So a different expression learning seed or a different expression learning subset can yield a different learned expression, which can change results.

6.11.3. Important Note on Practical Nondeterminism

Even if all Python, NumPy, and PyTorch seeds are fixed, GPU training can still have nondeterministic behavior unless deterministic settings are forced. Also, eval_head_accuracy uses a DataLoader with shuffle enabled and only evaluates a limited number of batches, which means it estimates accuracy on a random sample of the test set unless you change it to evaluate the full test set in a fixed order.

7. Conclusion

This paper studied a small data ECG beat classification setting and showed how differentially constrained manifolds can be used as a single framework for both augmentation and subset training. We considered three class classification on the MIT BIH Arrhythmia Database using fixed windows of length

T = 256

around annotated R peaks with robust normalization and deterministic record splitting. The objective was not to maximize performance by using all beats, but to retain high test accuracy under explicit training budgets while keeping the pipeline reproducible under fixed seeds.

The methodology combined four pieces. First, curvature guided parabolic jump augmentation used discrete second differences to define a curvature proxy

A_{i}

and an inverse stiffness

g_{i} = 1 / (| A_{i} | + ε)

, producing local deformations whose magnitude scales like

{(Δ x_{i})}^{2}

and stabilizes naturally. Second, a learned class specific expression of the form

{a | x |}^{m} {| p |}^{n} + {b | y |}^{k} {| q |}^{r} \approx 1

defined an implicit constraint set in

(x, p, y, q)

, and the perturbation operator acted like a projection step that reduces the residual while enforcing amplitude caps and preserving the QRS region. Third, a supervised stability score based on margin drop under these constraint respecting perturbations provided a direct measure of sensitivity to off manifold movement and enabled stability weighted training. Fourth, unsupervised farthest point sampling in a hand crafted feature space produced diverse proportionate subsets that cover variability efficiently.

Empirically, the approach was strongest in the smallest budget regime. At

B = 900

, diverse selection substantially outperformed random proportionate subsets, and stability weighted training on a diverse subset achieved the best of 89.3 %, supporting the interpretation that the learned expression acts as a low capacity structural prior that improves sample efficiency when data are limited. Robustness stress tests also showed that expression perturbed evaluation can match or exceed clean evaluation, consistent with the view that the perturbations behave like denoising alignment toward class structure rather than noise injection. At larger budgets, gains were not monotone, suggesting a bias variance tradeoff: a single frozen constraint family can be helpful when the subset is small, but can become misaligned as more heterogeneous modes enter the subset.

Overall, the results support differentially constrained manifolds as an interpretable and testable mechanism for data efficient ECG learning. By linking discrete curvature, constraint guided perturbations, and stability aware selection, the method provides a practical way to improve generalization when training budgets are small, while remaining reproducible and grounded in explicit geometric structure.

A. Implementation Details and Code

You can find the code for the experiment on GitHub. Note that the experiment ran with a particular learned expression, and the learned expression can change when running the code. To get the same results, use the same learned expression and seed. Although the experiment is deterministic and partly random, the results across the three budgets and multiple runs show that even if the accuracy changes, the conclusion is the same

GitHub link: https://github.com/tahaarkian/Data-Efficient-ECG-Sampling-from-MIT-BIH-Arrhythmia-Dataset

References

Rajpurkar, P.; Hannun, A.Y.; Haghpanahi, M.; Bourn, C.; Ng, A.Y. Cardiologist Level Arrhythmia Detection with Convolutional Neural Networks. arXiv preprint arXiv:1707.01836 2017.
Moody, G.B.; Mark, R.G. The Impact of the MIT BIH Arrhythmia Database. IEEE Engineering in Medicine and Biology Magazine 2001.
Perez, L.; Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 2017.
DeVries, T.; Taylor, G.W. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538 2017.
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 2017.
Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970.
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B 1996.
Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B 2005.
Breiman, L. Random Forests. Machine Learning 2001.
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 2001.
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B 1977.
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006.
Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326.
Tenenbaum, J.B.; de Silva, V.; Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science 2002, 295, 2319–2323.
Belkin, M.; Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS 2001.
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 2013, 35, 1798–1828.
Acharya, U.R.; Oh, S.L.; Hagiwara, Y.; Tan, J.H.; Adam, M.; Tan, R.S.; Lim, M.; Gertych, A. A deep convolutional neural network model to classify heartbeats. Computers in Biology and Medicine 2017, 89, 389–396.
Kiranyaz, S.; Ince, T.; Gabbouj, M. Real-time patient-specific ECG classification by 1-D convolutional neural networks. IEEE Transactions on Biomedical Engineering 2016, 63, 664–675.
Yildirim, Ö. A novel wavelet sequence based on deep bidirectional LSTM network model for ECG signal classification. Computers in Biology and Medicine 2018, 96, 189–202.
PhysioNet. MIT-BIH Arrhythmia Database, 2005. Accessed via PhysioNet.
Chollet, F. Deep Learning with Python, 2 ed.; Manning Publications: New York, 2021.
Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification, and risk bounds. Journal of the American Statistical Association 2006, pp. 138–156.
Hardt, M.; Recht, B.; Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the International Conference on Machine Learning (ICML), 2016.
Arkian, T. Differential Expressions and Their Applications. Preprints 2025. Preprint posted 09 July 2025. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Differentially Constrained Manifolds for Data-Efficient ECG Classification

Abstract

Keywords:

Subject:

1. Introduction

1.1. Problem and Importance

1.2. What Problems This Solves

1.3. Literature Review

Regression and supervised learning baselines.

Unsupervised learning, manifolds, and representation learning.

Deep learning for ECG.

1.4. Contributions and Organization

2. Preliminaries

2.1. Linear Algebra

2.2. Statistics

2.3. Deep Learning with Python and Python

3. Methodology

3.1. Overview

3.2. Supervised Component

3.2.1. Problem Formulation

3.2.2. Regression Baselines and the Role of Highly Cited Regression Methods

3.2.3. Classifier Used in This Work

3.2.4. Supervised Stability Scoring via Margin Drop

3.2.5. Weighted Training

3.3. Unsupervised Component

3.3.1. Feature Based Diversity Selection

3.3.2. Expression Constrained Perturbations as Manifold Projections

3.3.3. Latent Variable Viewpoint

4. Innovation in Methodology

4.1. Problem Setting

4.2. Discrete Differences, Curvature Strength, and Gravity

4.3. Local Chord Geometry and Parabolic Deviation Magnitude

4.4. Augmented Point Construction Including the x Direction

4.5. Iterative Refinement and Stabilization

4.6. Why the Jumps Are Parabolic

4.7. Movement Regulation Function and Scaling Factor Definition

4.8. ECG Pipeline Components Already Defined

5. Properties and Theorems of Differentially Constrained Manifolds

5.1. Constrained Manifolds in ( x , p , y , q )

5.2. What It Means to Be “Near 1” and How to Measure Farness

Projection style correction toward the manifold.

5.3. A Dominance Theorem for Orders of p and q from Highest Power Terms

Degree notation.

Example.

5.4. Parabolic Segments and Sinusoidal Envelope Around a Best Fit Line

Always one sided vs alternating deviation.

5.5. A Curvature from Deviation Formula for Small Spacings

5.6. From Discrete Points to a Continuous Function That Describes Them

Topological embedding viewpoint.

5.7. Generating New Manifold Families by Functional and Differential Transforms

How ϕ changes sensitivity.

Exponential regime weighting.

Derivative and integral families.

5.8. Distribution That Can Satisfy the Order of p

5.8.1. Why We Use a Distribution If Values Are Random

5.8.2. Uniform ( 0 , 1 ) and the Exponential Transform

5.8.3. Why Gamma ( 2 , 1 ) Is a Sum of Two Exp ( 1 ) Variables

5.8.4. Why E [ 1 / s ] Is Finite Only When k > 1

How the increasing data were generated

5.8.5. Sample Data and the Residual to Test the Condition

5.8.6. Summary

5.8.7. Why Gamma ( 2 , 1 ) Is an Appropriate Spacing Model Here

6. Study of ECG Data Using Differentially Constrained Manifolds

6.1. Overview

6.2. Dataset Construction

6.2.1. Beat Extraction and Labeling

6.2.2. Robust Normalization

6.2.3. Deterministic Record Split

6.3. Model

6.4. Learned Expression and Constraint Guided Perturbations

6.4.1. Signals Used in the Constraint

6.4.2. Family of Expressions

6.4.3. Expression Used in This Run

Class 0

Class 1

Class 2

6.4.4. Projection Step Used in Perturbations

6.4.5. Curvature Proxy and Gravity

6.5. Stability Score and Weighting

6.5.1. Margin and Margin Drop

5.1. Constrained Manifolds in $(x, p, y, q)$

How $ϕ$ changes sensitivity.

5.8.2. Uniform $(0, 1)$ and the Exponential Transform

5.8.3. Why Gamma $(2, 1)$ Is a Sum of Two Exp $(1)$ Variables

5.8.4. Why $E [1 / s]$ Is Finite Only When $k > 1$

5.8.7. Why Gamma $(2, 1)$ Is an Appropriate Spacing Model Here