A Leak-Safe Within-User Benchmark for Compact Surface-EMG Grasp Decoding, with a Causal Sequence-Reasoning Decoder

Sean Barrett; William Hartley

doi:10.20944/preprints202606.1365.v2

Submitted:

18 June 2026

Posted:

23 June 2026

You are already at the latest version

Abstract

Surface electromyography (sEMG) is the dominant control modality for myoelectric hand prostheses, yet a persistent gap remains between high offline classification accuracy and real-time, on-device control: models are often large, evaluated under leaky window-level splits, and reported with a single inflated metric that hides instantaneous behaviour. Our contribution is primarily methodological: a strictly leak-safe within-user protocol (calibrate on a user's earlier repetitions, test on a held-out later one, split by contiguous recording segment so no overlapping window crosses the split), paired with honest dual reporting of per-window (instantaneous) and per-execution balanced accuracy alongside the rest false-activation rate. We instantiate it with a compact (28k-parameter, 27.5 kB int8) decoder for six hand grips plus rest: a depthwise-separable CNN with a causal, population-learned, zero-parameter transition grammar that reasons over per-window evidence. On NinaPro DB2, five calibration repetitions give 94.0% per-execution balanced accuracy at 2.2% false-activation and 75.2% per-window; a wider 20–450 Hz acquisition band adds 2.6 per-window points (Wilcoxon p=0.005) at the same footprint. On 12 trans-radial amputees (NinaPro DB3, DB7) accuracy is 63.8–74.2% with high between-subject variance and no evidence that the deep decoder's able-bodied advantage over a classical baseline transfers. Most consequentially, on NinaPro DB6 we expose the cross-session gap the within-session protocol cannot see: per-execution accuracy collapses from 93.4% (day 1) to 38.8% on day-5 re-donning, but a brief per-don recalibration fully restores it, statistically level with the within-session ceiling at every separation, turning the collapse into an explicit deployment recipe. We position the work honestly against recent ultra-low-power and foundation-model decoders, reporting compute cost rather than claiming on-device measurements. Code and configurations: https://github.com/seanb9/emg-leaksafe-benchmark.

Keywords:

surface electromyography

;

prosthetic control

;

gesture recognition

;

lightweight neural networks

;

sequence decoding

;

leak-safe evaluation

;

NinaPro

Subject:

Engineering - Electrical and Electronic Engineering

1. Introduction

Myoelectric prostheses decode a user’s movement intent from forearm sEMG to drive a robotic hand. Pattern-recognition control has matured substantially, and deep networks now report high within-subject accuracy on public benchmarks such as NinaPro [1,2,3]. Three issues, however, separate these numbers from a device a person can actually use.

(i) Deployability. A prosthetic hand carries a microcontroller-class processor; many high-accuracy models are too large to run on it, motivating a body of work on lightweight and ultra-low-power EMG decoders [4,5,6].

(ii) Evaluation integrity. Overlapping analysis windows split by window index leak information between train and test, inflating reported accuracy (we measure this inflation directly at

+ 4.1

per-window points on DB2, §4.2). A leak-safe protocol must split by recording segment, and, for the deployment question, must extrapolate from calibration data to later, unseen data rather than interpolating within a session.

(iii) Metric honesty. A decoder emits a label every few tens of milliseconds, but a grasp-and-hold device makes one functional decision per hold. Reporting only the per-execution (voted) accuracy hides the instantaneous behaviour the user feels, which is dominated by the onset transient: the fraction of a second at the start of a movement during which the muscle is still recruiting and the window is genuinely ambiguous with rest.

We address all three with an honest evaluation and a compact system. Our contributions are:

A leak-safe within-user extrapolation protocol and a paired-metric reporting standard. Splitting by contiguous recording segment and testing on later repetitions removes the window-overlap leakage and within-session interpolation that inflate standard splits; reporting per-window and per-execution balanced accuracy with the rest false-activation rate exposes the onset-transient behaviour a single voted number hides.
An architectural choice for edge decoding: rather than scaling a parameter-heavy network to learn temporal structure, we offload it into a zero-parameter, population-learned transition grammar combined online with per-window evidence by a causal forward filter. It recovers boundary windows by inference, requires no extra calibration, and keeps the model at a microcontroller-class footprint (28k params, $27.5$ kB int8).
A controlled, honest benchmark: within-user and cross-subject (LOSO) results on NinaPro DB2 with per-subject variance and significance tests, a classical-baseline and component ablation (which transparently shows masked pretraining contributes little at this data scale and the grammar’s gain over voting is small but significant), and a hardware-agnostic compute-cost breakdown, positioned against recent embedded and foundation-model EMG decoders. Code, configurations, and a synthetic positive-control are released at https://github.com/seanb9/emg-leaksafe-benchmark.
An acquisition and cross-population study. We isolate signal bandwidth as a significant, model-free accuracy lever ( $+ 2.6$ per-window points from a wider analysis band, $p = 0.005$ ); we validate the decoder as a causal real-time stream; and we extend the leak-safe benchmark to 12 amputees, where we report an honest negative result: we find no evidence that the deep model’s able-bodied advantage over a classical baseline transfers.

We are explicit about scope: these are offline, single-session results on able-bodied public data, not a clinical validation. Because NinaPro DB2 presents isolated repetitions separated by rest, the learned grammar is hub-and-spoke (rest ↔ grasp) and does not model direct grasp-to-grasp transitions; we show in synthesis (Section 5) that the mechanism supports them once such data is available, but empirical validation on unconstrained, cross-day functional data remains future work. The results are intended as a rigorous, reproducible benchmark that motivates that next step on target hardware.

2. Related Work

Deep EMG decoding. CNNs and their variants are the standard for sEMG gesture recognition [2,3], with within-subject accuracy on NinaPro DB2 reported in the high-80s to

\sim 90 %

for large models [3]. Such accuracies are typically obtained with hundreds of thousands to millions of parameters and under within-session protocols.

Lightweight and embedded EMG. Efficiency is an active area. Zanghieri et al. [5] deploy a temporal convolutional network on a multicore IoT processor; Bioformers [4] run an ultra-low-power transformer in

94.2

kB on a GAP8 microcontroller at

2.72

ms /

0.14

mJ per inference; TinyMyo [6] is a 3.6M parameter EMG foundation model using self-supervised masked reconstruction. Our model is smaller in parameter count than these learned decoders, but we report compute cost only (Section 4.11) and do not claim measured on-device energy; Bioformers remains the reference point for hardware-measured efficiency.

Cross-subject and adaptation. Cross-subject generalization is hard; EMGBench [7] benchmarks out-of-distribution and few-shot adaptation on DB2. Test-time adaptation [8], test-time training [9], and Riemannian recentering [10] are established transfer tools; a recent lightweight TCN targets inter-session adaptation on DB6 [11], and multimodal grasp benchmarks such as MeganePro [12] extend cross-subject/-task evaluation. We use batch-norm / entropy test-time adaptation and a learned sequence grammar; we do not claim a novel adaptation mechanism. Unlike online-adaptive HMMs that re-estimate transition parameters during use, our grammar is fixed at deployment (a zero-parameter, population-learned prior), trading adaptivity for a constant, microcontroller-friendly inference cost.

Sequence models and the onset transient. Temporal models capture gesture onset and maintenance [13], and the effect of transient misclassifications on real-time control and user perception is well studied [14]. HMM decoding of myoelectric signals is itself long-established [15,16]; we do not claim it as novel. Our position is deliberately architectural: rather than scaling a temporal network to learn these dependencies, we offload them into a zero-parameter learned grammar, trading model capacity for a structural prior. This design strategy is motivated by the microcontroller-class footprint of a hand prosthesis.

Evaluation integrity. The weak correlation between offline accuracy and online control is well documented [17,18]; we therefore report instantaneous behaviour and false-activation explicitly and frame our offline numbers accordingly.

3. Methods

3.1. Data and Preprocessing

We use NinaPro DB2 [1] (40 intact-limb subjects, 12 Delsys electrodes; publicly available at http://ninapro.hevs.ch). Labels are stimulus-cued: the annotated movement onset leads true muscle-recruitment onset, so the first windows of a labelled segment carry little movement signal. This is the mechanical origin of the onset transient that the per-window metric exposes (§5). From Exercise 2 we select a functional set of six grips (medium-wrap (Power), tip pinch (Pinch), tripod, lateral, index-point, and fixed hook) plus rest, giving

K = 7

classes; this functional grip set is used for all within-user (clinical) results. For the cross-subject benchmark (Section 4.5, Table 1) we instead use the standard 10-class movement set adopted by the cross-subject literature (e.g. EMGBench): rest plus nine basic finger and wrist movements (finger abduction, finger adduction, fist, mid-axis supination, mid-axis pronation, wrist flexion, wrist extension, radial deviation, and ulnar deviation). We use this distinct set for LOSO purely for comparability with cross-subject baselines; the two protocols are never mixed. Signals are resampled to

250

Hz, band-pass filtered (20–

120

Hz, with a

50

Hz notch), and segmented into

256

ms windows at

64

ms stride. The

256

ms window with

64

ms decision stride keeps the controller delay within the

< 300

ms budget established for real-time myoelectric control [19]. Each window additionally carries per-channel root-mean-square and waveform-length envelopes as extra channels [20].

Class frequencies and exclusions. Counts are post-preprocessing. The 7-class within-user set comprises

430, 777

windows (Rest

316, 751

; Power

18, 007

; Pinch

21, 772

; Tripod

19, 372

; Lateral

22, 595

; Point

14, 938

; Hook

17, 342

), a rest:grasp window ratio of

2.8 : 1

(Rest

73.5 %

). The 10-class LOSO set comprises

388, 703

windows (Rest

252, 518

; the nine movements

12, 192

–

17, 480

each), a ratio of

1.85 : 1

(Rest

65.0 %

). This imbalance is why we report balanced accuracy (§3.6). No subjects, channels, or repetitions were excluded; an RMS-based artifact guard (threshold

500 μ

V) is applied during preprocessing, and there are no missing windows after preprocessing.

3.2. Leak-Safe Windowing and Splits

Every window is assigned to exactly one contiguous recording segment (a maximal run of constant stimulus and repetition). Splitting by segment guarantees that no two windows sharing raw samples (e.g. 75%-overlap neighbours) fall on opposite sides of any split. For the within-user protocol, calibration windows come from the user’s repetitions 1–r and test windows from a disjoint, later repetition (extrapolation); per-channel z-score statistics are fit on calibration windows only. For cross-subject LOSO, the test subject is held out entirely and few-shot calibration uses a disjoint calibration pool. Crucially, even at the

0 %

-calibration operating point the held-out subject’s per-channel z-score is fit on that subject’s own calibration-pool windows (which are used for normalisation statistics only, never for supervision and never including the evaluation repetitions), rather than on global cross-subject statistics; this avoids injecting cross-subject amplitude variance that would artificially penalise the held-out subject. We therefore use “

0 %

calibration” to mean zero labelled adaptation: no held-out-subject windows enter training, fine-tuning, or the contrastive alignment; only an unsupervised per-channel scale is estimated from a short baseline acquisition. This baseline is exactly what a real device collects once at don time (a few seconds of rest and free movement before use) and is operationally lightweight; it is a per-channel gain normalisation, not a learned classifier. The evaluation repetitions are strictly excluded from this pool, so no test-segment sample ever informs the statistics.

Selection provenance. To keep selection leak-safe, model and training hyper-parameters were chosen on a validation split held out from the pretraining subjects, never on the held-out test users or repetitions. The deployment operating point is a single fixed configuration (Section 3.5) applied identically to every subject and fold; it is not tuned per subject, per fold, or on the evaluation repetitions. We do not claim its constants are optimal; they are set to a pre-specified false-activation target, and the gate contributes only

+ 1.3

per-execution points (Section 4.6), bounding any selection effect.

3.3. Model

The backbone is a depthwise-separable 1-D temporal CNN (widths

32 - 64 - 128

, embedding dimension 128) producing a window embedding, with a linear classifier head. The inference model (backbone + classifier) has

28, 199

parameters; a training-only self-supervised decoder and projection head are discarded at inference.

3.4. Training

Self-supervised masked pretraining. Following masked autoencoding [21], random time-spans of each window are hidden and a lightweight transposed-convolution decoder reconstructs the missing channels, minimising

L_{ssl} = \frac{1}{| M |} \sum_{(t, c) \in M} {({\hat{x}}_{t, c} - x_{t, c})}^{2}

over masked positions

M

. This learns the spatio-temporal structure of the signal before any labels are seen; we note up front that its measured contribution at our 20-subject pretraining scale is negligible (Section 4.6), and retain it only as a harmless component expected to help with larger unlabelled corpora.

Supervised contrastive + cross-entropy. The backbone is then trained with the supervised contrastive loss [22] to align same-grip embeddings across users, followed by class-weighted cross-entropy with label smoothing. For the cross-subject protocol the held-out subject is excluded from this stage entirely: no window from the test subject contributes to the contrastive alignment, the cross-entropy, or the masked pretraining; the population base is trained only on the other subjects, so the reported LOSO numbers contain no test-subject information beyond the unsupervised normalisation scale above.

Per-user calibration. For each test user the population base is fine-tuned on the user’s calibration repetitions with gentle intra-session augmentation. A dedicated rest/active detector and a movement-only grasp classifier are adapted separately, decoupling the false-activation decision from grasp discrimination.

3.5. Causal Sequence-Reasoning Decoder

Let

p_{t} \in Δ^{K}

be the classifier’s per-window posterior

p_{t} (j) = P (j ∣ x_{t})

at time t. A grasp is not an isolated pattern but a temporally-structured event; a person rests, forms a grip, holds it, and relaxes, and does not jump directly between grips. We encode this as a first-order Markov model with a transition matrix A whose off-diagonal structure is learned from the training population’s label sequences (which transitions actually occur) and whose self-transition (“hold”) probability is a single interpretable knob. We call the grammar zero-parameter in the sense that it adds no gradient-trained weights to the network: the

K \times K

transition matrix is estimated by counting population label transitions, not learned by backpropagation.

Emissions: a safety-aware design choice. The classical hybrid NN–HMM formulation converts a softmax posterior to an emission likelihood via the scaled-likelihood trick [15,16,23],

e_{t} (j) \propto P (j ∣ x_{t}) / P (j)

, dividing out the empirical class priors

P (j)

. In a safety-critical prosthetic decoder, however, the dominant prior

P (rest)

is not a nuisance to be removed: it encodes the operationally desirable default that the hand stays still. Dividing it out makes the decoder eager to leave rest. We therefore use the per-frame posterior directly as the emission,

e_{t} (j) = P (j ∣ x_{t})

(equivalent to a uniform-prior, per-frame MAP estimate), and report the scaled-likelihood variant as an ablation: empirically it raises per-window balanced accuracy by

\approx 2

points but inflates rest false-activation roughly ten-fold (from

4.5 %

to

52 %

), which is unacceptable for control (Section 4.6). Retaining the rest prior is thus a deliberate, empirically-justified choice. Decoding is the causal forward recursion (in log-space for stability):

α_{t} (j) = log e_{t} (j) + log \sum_{i = 1}^{K} exp (α_{t - 1} (i) + log A_{i j}),

(1)

with

{\hat{y}}_{t} = arg {max}_{j} α_{t} (j)

. Because

α_{t}

depends only on windows

0 : t

, decoding is causal and real-time, uses no test labels, and adds negligible cost. A grasp window with weak emission (an onset ramp) is held at the current grip because leaving and re-entering is implausible under A; a spurious cross-grip flicker is overruled because direct grasp-to-grasp transitions are near-zero.

Using the posterior as emission does not double-count the prior, because our transition matrix A carries no class marginal: its off-diagonal rows are conditional destination distributions, independent of class frequency (Eq. 2), so

P (j)

enters the filter exactly once. Formally this is a linear-chain structured decoder with fixed potentials (unary

log P (j ∣ x_{t})

, pairwise

log A_{i j}

); we do not claim a jointly-trained, calibrated CRF, as the transition prior is fixed ahead of time and population-learned rather than optimised end-to-end.

Transition-matrix structure. We decouple stickiness from structure. With O the row-normalised off-diagonal grammar (the conditional destination distribution given the state is left, learned from population label sequences) and h a single hold probability,

A_{i i} = h, A_{i j} = (1 - h) O_{i j} (i \neq j),

(2)

so A carries no class-frequency marginal. Because NinaPro DB2 records discrete, isolated repetitions in which the subject returns to rest between movements, the learned O is a hub-and-spoke grammar: probability flows rest ↔ grasp and direct grasp-to-grasp entries are near-zero. We state this explicitly; it is a property of the dataset, not the architecture. The same machinery admits arbitrary grasp-to-grasp transitions once O is estimated on unconstrained functional data, which we verify with a controlled synthetic experiment (Section 5).

Deployment hysteresis. To set the false-activation operating point we add an asymmetric hysteresis state machine downstream of the decoder, with an activation score

s_{t}

(the rest/active detector’s

P (active)

) and state

z_{t} \in {r e s t} \cup G

: entry from rest to grasp g requires

n_{on}

consecutive windows with

s_{t} \geq θ_{on}

and

{\hat{y}}_{t} = g

; release to rest requires

n_{off}

consecutive windows with

s_{t} < θ_{off}

; a grip switch requires

n_{sw}

consecutive windows voting a different grasp. The band

θ_{off} < θ_{on}

gives the hysteresis (hard to start, easy to hold);

n_{on}

is the debounce that suppresses false activations. These five constants are the clinical operating point. They are set by a coarse sweep to a pre-specified rest-false-activation target on a validation split held out from the pretraining subjects (we sweep

θ_{on} \in [0.5, 0.7]

,

θ_{off} \in [0.25, 0.35]

,

n_{on} \in {2, 3}

,

n_{off} \in {4, 6}

; deployed point

θ_{on} = 0.6, θ_{off} = 0.35, n_{on} = 3, n_{off} = 4, n_{sw} = 3

) – they are neither learned by backpropagation nor tuned on the evaluation repetitions, and the single deployed configuration is then applied identically to every subject and fold. The decoder-sweep tables printed in the released logs are diagnostic, showing the chosen point is not a fragile optimum; they are not the selection mechanism. Because the debounce defers commitment until

n_{on}

consecutive windows, it contributes up to

\approx 192

ms at the deployed

n_{on} = 3

to the onset latency; this is the debounce component specifically. The full empirical onset latency, dominated by analysis-window fill and the muscle-recruitment ramp, is larger and is reported in Section 4.4. Increasing

n_{on}

trades onset latency for fewer false activations.

3.6. Evaluation Metrics

We report balanced accuracy: the unweighted mean of per-class recalls, which weights each of the grasp classes and rest equally and is therefore invariant to the rest:grasp window-count ratio (a model that simply predicts the majority rest class scores at chance,

1 / C

, not high). Per-window accuracy grades every

64

ms decision independently (instantaneous behaviour). Per-execution accuracy collapses each contiguous movement to a single majority decision (the per-repetition unit, as in NinaPro/EMGBench reporting). Rest false-activation is the fraction of true-rest windows that emit a movement: the clinically critical safety metric.

4. Experiments and Results

4.1. Setup

The population base is trained on 20 subjects disjoint from the 20 test users. The partition is fixed a priori by subject index (subjects 1–20 are the test users, 21–40 the base), not selected on results or resampled to favour an outcome; it is fixed in the released configuration. For each test user we sweep

r \in {1, \dots, 5}

calibration repetitions and test on the held-out later repetition. Training uses SGD/Adam with early stopping; the deployed configuration uses a single fixed seed (1337). To show the conclusions do not depend on that seed, we repeat the headline within-user pipeline and both small-effect comparisons (the sequence-grammar gain and the acquisition-band gain) across three seeds (1337, 7, 2024; reported in §4.2, §4.6 and §4.8). Full configurations are provided in the released repository.

4.2. Within-User Results

Figure 1 shows the calibration curve. With five calibration repetitions the system reaches

94.0 %

per-execution balanced accuracy (deployed gated decoder;

95 %

CI

[88.6, 99.3]

over 20 subjects),

75.2 %

per-window accuracy (gate-free stream; CI

[71.5, 78.8]

), and

2.2 %

rest false-activation (CI

[1.0, 3.5]

; Figure 1). The result does not depend on the random seed: repeating the full within-user pipeline at two further seeds (7, 2024) gives per-execution

94.0 \pm 0.7 %

and per-window

74.6 \pm 0.2 %

across the three seeds (the per-window value reproduces the §4.8 baseline condition, within the run-to-run variation of the

75.2 %

headline run). These within-user numbers are a within-session bound: calibration and test share one recording, so they do not capture cross-day electrode shift and should be read as optimistic relative to re-donning deployment, which we measure separately on DB6 (§4.10). The headline per-execution figure is reported on a single held-out repetition (rep 6); to check it is not an artefact of that repetition we additionally run leave-one-repetition-out cross-validation, holding out each of the six repetitions in turn (this relaxes the strict earlier-to-later extrapolation to interpolation and is reported only as a robustness check). The six-fold means match the headline:

94.8 \pm 6.6 %

per-execution and

75.5 \pm 6.8 %

per-window (+logic), so the single-repetition number is representative rather than a favourable draw.

Quantifying the leakage. To put a number on the inflation our protocol removes, we re-ran the identical pipeline under a leaky window-level split: windows assigned to calibration and test at random (the common practice that lets

75 %

-overlap neighbours straddle the split), at the same calibration fraction. On the same 20 subjects this inflates per-window balanced accuracy by

+ 4.1 \pm 4.9

points (

71.7 % \to 75.8 %

; Wilcoxon

p = 0.001

, inflated for

16 / 20

subjects). The inflation is larger than the entire sequence-grammar gain (

+ 2.6

points), so a window-shuffled split would both over-report instantaneous accuracy and swamp the architectural effect the paper studies; the leak-safe split is what keeps the headline honest. Per-execution accuracy crosses

90 %

at three calibration repetitions. Point and hook reach

100 %

per-execution recall and power/pinch/lateral 90–

95 %

; tripod is the weakest at

80 %

(Figure 5), consistent with the known difficulty of separating multi-finger grasps in forearm EMG (Figure 4).

Figure 1. Within-user calibration curve (20 users, held-out later repetition; balanced accuracy, mean ± SD, shaded 95% CI). Per-execution accuracy crosses

90 %

at three calibration repetitions; the gap to the per-window curves is the onset transient. At five repetitions the gate-free decoder stream reaches

72.6 \pm 7.8 %

per-window (raw) and

75.2 \pm 7.8 %

(+HMM logic), while the deployed gated decoder reaches

94.0 \pm 11.4 %

per-execution (Section 3.5) at

2.2 %

rest false-activation; per-repetition false-activation stays within the

1.2

–

2.2 %

band from one to five reps. The per-window numbers isolate the classifier and the zero-parameter grammar; the per-execution figure additionally includes the operating-point gate (cf. Table 2).

Figure 1. Within-user calibration curve (20 users, held-out later repetition; balanced accuracy, mean ± SD, shaded 95% CI). Per-execution accuracy crosses

90 %

at three calibration repetitions; the gap to the per-window curves is the onset transient. At five repetitions the gate-free decoder stream reaches

72.6 \pm 7.8 %

per-window (raw) and

75.2 \pm 7.8 %

(+HMM logic), while the deployed gated decoder reaches

94.0 \pm 11.4 %

per-execution (Section 3.5) at

2.2 %

rest false-activation; per-repetition false-activation stays within the

1.2

–

2.2 %

band from one to five reps. The per-window numbers isolate the classifier and the zero-parameter grammar; the per-execution figure additionally includes the operating-point gate (cf. Table 2).

Table 2. Baselines and ablations (within-user, 5 cal. reps; per-window and per-execution balanced accuracy %, mean ± SD over the 20 subjects; rest false-activation %). No row in this table uses the deployment hysteresis gate: every row is the classifier/grammar stream only, so the comparison isolates the zero-parameter sequence grammar from the hand-tuned operating-point logic. The deployed gate (Section 3.5) sets the false-activation operating point and adds only

+ 1.3

per-execution points over the gate-free HMM stream (

92.7 \to 94.0 %

, Figure 1), confirming the sequence accuracy is delivered by the grammar, not by the gate. All rows are mean ± SD over the 20 subjects, including the scaled-likelihood and no-pretraining ablations. Significance (Wilcoxon signed-rank across subjects, with rank-biserial effect size r): ours vs. matched

w = 5

vote, per-window

p = 0.001

,

r = 0.79

; ours vs. classical TD+LDA (Holm-corrected), per-window

p < 0.001

,

r = 1.00

(significant), per-execution

p = 0.78

,

r = 0.08

(n.s.).

Table 2. Baselines and ablations (within-user, 5 cal. reps; per-window and per-execution balanced accuracy %, mean ± SD over the 20 subjects; rest false-activation %). No row in this table uses the deployment hysteresis gate: every row is the classifier/grammar stream only, so the comparison isolates the zero-parameter sequence grammar from the hand-tuned operating-point logic. The deployed gate (Section 3.5) sets the false-activation operating point and adds only

+ 1.3

per-execution points over the gate-free HMM stream (

92.7 \to 94.0 %

, Figure 1), confirming the sequence accuracy is delivered by the grammar, not by the gate. All rows are mean ± SD over the 20 subjects, including the scaled-likelihood and no-pretraining ablations. Significance (Wilcoxon signed-rank across subjects, with rank-biserial effect size r): ours vs. matched

w = 5

vote, per-window

p = 0.001

,

r = 0.79

; ours vs. classical TD+LDA (Holm-corrected), per-window

p < 0.001

,

r = 1.00

(significant), per-execution

p = 0.78

,

r = 0.08

(n.s.).

Configuration	Per-window	Per-exec	False-act
Classical: TD + LDA	66.9 ± 8.1	94.7 ± 7.8	–
Causal TCN, raw (57k params)	76.3 ± 7.8	95.2 ± 2.1	–
CNN, raw evidence (no logic)	72.6 ± 7.8	94.0 ± 11.4	–
CNN + majority vote ( $w = 3$ )	73.1 ± 7.7	94.0 ± 11.4	–
CNN + majority vote ( $w = 5$ )	73.7 ± 7.6	94.7 ± 9.3	–
CNN + HMM, gate-free (ours)	75.2 ± 7.8	92.7 ± 11.5	4.5 ± 2.0
CNN + HMM, scaled-likelihood	77.4 ± 7.1	–	52.3 ± 15.4
ours, no masked pretraining	75.0 ± 7.6	93.4 ± 11.5	4.6 ± 2.1

4.3. Effect of Sequence Reasoning

The learned transition grammar improves per-window balanced accuracy by

+ 2.6

points over raw argmax evidence (

72.6 \to 75.2 %

, Figure 2) and by

+ 1.5

points over the matched-window (

w = 5

) majority vote (

73.7 \to 75.2 %

, Table 2); all figures are per-subject means. The gain over raw argmax is robust to the random seed and consistent at the subject level: pooled over three seeds (1337, 7, 2024) it is

+ 2.8

points (

95 %

CI

[2.3, 3.3]

; Cohen’s

d = 1.43

, a large paired effect; rank-biserial

r = 0.96

), positive in

53 / 60

(subject, seed) cases (17–18 of 20 per seed), with a median per-subject gain of

2.6

points (IQR

[1.6, 3.8]

). It is thus small in magnitude but highly reliable, not a seed- or subject-specific fluke. The margin over voting is small in magnitude but statistically significant and consistent across the 20 subjects (Wilcoxon signed-rank

W = 22

,

p = 0.001

, rank-biserial

r = 0.79

, a large effect: the grammar wins for the large majority of subjects), and comes with no additional calibration and zero added parameters. The accuracy gain over naive smoothing is therefore modest; the decoder’s advantage is principled and causal. A w-window majority vote imposes a symmetric tracking lag of

\approx w / 2

windows (e.g.

\approx 128

ms at

w = 5

), applied equally to onset and offset, whereas the causal forward filter updates at every

64

ms stride from accumulated evidence with no added smoothing lag, a meaningful responsiveness advantage relative to the

< 300

ms control-delay budget [19] (this concerns the added smoothing lag; the end-to-end onset latency is reported and discussed in Section 4.4). On the saturating per-execution metric the grammar offers no advantage over voting, underscoring that the per-window metric is where the distinction lies.

4.4. Operating Point

Figure 3 shows the trade-off between grasp accuracy on acted windows and rest false-activation as the activation gate is swept. We recommend the gate at

0.9

, which meets the

\approx 2 %

clinical false-activation bar (

2.2 %

) while retaining

86.7 %

grasp accuracy on acted windows at

66 %

movement coverage; loosening the gate trades false-activation for coverage monotonically, so the single threshold is the clinician-facing tuning knob.

Empirical onset latency. We measure onset latency directly, as the median number of leading windows the decoder still calls rest from the labelled movement start until it commits to a grasp, across the 20 subjects. The gate-free sequence stream incurs

8.8 \pm 5.4

windows (

\approx 560 \pm 350

ms at the

64

ms stride) and the deployed gated decoder

9.7 \pm 5.1

windows (

\approx 620 \pm 330

ms), with large between-subject variance. We report this plainly: this end-to-end latency exceeds the per-decision

< 300

ms processing budget. These are distinct quantities: the budget bounds the

64

ms decision cadence, whereas onset latency is dominated by analysis-window fill and the muscle-recruitment ramp plus the deliberate

n_{on}

debounce, and is tunable (loosening the gate lowers latency at the cost of false activations, Figure 3). It is a primary limitation for fast closed-loop control; we decompose it into design-controlled and physiological parts and report attempts to reduce it in §4.7, where the residual proves to be an acquisition floor rather than a decoder limit. We note, however, that onset latency trades against stability: debounce and decision-based velocity ramps deliberately accept added delay to suppress the transient misclassifications that most erode myoelectric usability [14,19], so a stable commitment that is slightly slower can be preferable in closed-loop use to a faster but flickering one. Quantifying this latency-versus-stability trade in real-time operation is part of the follow-up.

Figure 3. Operating-point trade-off as the gate threshold (labelled) is swept. Grasp accuracy on acted windows (left axis) rises as the gate tightens, but movement coverage (right axis) falls: the recommended gate

0.9

clears the

2 %

false-activation bar at the cost of coverage, which drops to

66 %

. Both recommended points are ringed; the gain is not free.

Figure 3. Operating-point trade-off as the gate threshold (labelled) is swept. Grasp accuracy on acted windows (left axis) rises as the gate tightens, but movement coverage (right axis) falls: the recommended gate

0.9

clears the

2 %

false-activation bar at the cost of coverage, which drops to

66 %

. Both recommended points are ringed; the gain is not free.

Figure 4. Per-window confusion (row-normalised %) for the raw CNN evidence, before the sequence logic (its mean diagonal is the raw per-window balanced accuracy,

\sim 72 %

; the grammar lifts this to the

75.2 %

headline). Off-diagonal mass concentrates in the rest column: the onset/offset transient, not grip–grip confusion.

Figure 4. Per-window confusion (row-normalised %) for the raw CNN evidence, before the sequence logic (its mean diagonal is the raw per-window balanced accuracy,

\sim 72 %

; the grammar lifts this to the

75.2 %

headline). Off-diagonal mass concentrates in the rest column: the onset/offset transient, not grip–grip confusion.

Figure 5. Per-grip per-execution recall at five calibration repetitions.

4.5. Cross-Subject (LOSO) Results

The cross-subject benchmark uses the standard 10-class movement set (rest plus nine wrist/hand movements) for comparability with the LOSO literature (e.g. EMGBench), rather than the 7-grip within-user functional set (Section 3.1); Table 1 reports it. Under 40-fold leave-one-subject-out evaluation with the same pipeline, balanced accuracy with only unsupervised target-subject normalisation (no labelled calibration) is

38.6 %

per-window, rising to

64.2 %

with

20 %

labelled calibration; the sequence-reasoning layer lifts these to

42.9 %

and

71.1 %

respectively, and per-execution voting reaches

50.6 %

(

95 %

CI

[46.5, 54.7]

over 40 folds) and

89.1 %

(CI

[86.7, 91.6]

), up to

91.3 %

at

30 %

(Table 1). Macro-F1 tracks balanced accuracy closely (per-window

0.37

at

0 %

and

0.62

at

20 %

calibration), confirming the gains are not an artefact of the balanced-averaging choice. The sequence layer contributes

+ 4.2

to

+ 7.2

per-window points with zero additional calibration, a larger margin than the within-user

+ 1.5

, consistent with sequence priors mattering more when the per-window classifier is weaker (the cross-subject regime). This LOSO gain is significant at every calibration level (paired Wilcoxon signed-rank across the 40 folds,

p < 10^{- 10}

, rank-biserial

r \approx 1.0

). All calibration uses leak-safe per-subject statistics; no held-out-subject data informs training, and only an unsupervised per-channel scale is fit on the held-out subject.

Comparison to EMGBench. Because we adopt EMGBench’s protocol, the comparison is matched on dataset, class set (10-class DB2), subject count, and the 40-fold leave-one-subject-out plus few-shot-adaptation structure. Across its evaluated architectures (e.g. ResNet18, EfficientNet), EMGBench reports DB2 cross-subject accuracy of

19.5

–

19.9 %

with no adaptation (its Table 2) and

52.3 %

when fine-tuning on the first

20 %

of the held-out subject’s data (its Table 3, FT-

20 %

) [7]. The controlled, like-for-like comparison is per-window: our per-window balanced accuracy at the same operating points is

38.6 %

and

64.2 %

, at a 28k-parameter footprint. (Per-execution voting, an additional unit EMGBench does not report, reaches

50.6 %

and

89.1 %

.) One nuance keeps this honest: EMGBench reports overall test accuracy whereas we report balanced accuracy (mean per-class recall). On this rest-heavy set balanced accuracy is the stricter number (it cannot be inflated by the dominant rest class), so our balanced figure lower-bounds the overall accuracy we report under EMGBench’s exact metric. For completeness we report both: our overall per-window accuracy (EMGBench’s metric) is

69.1 %

at

0 %

and

77.4 %

at

20 %

calibration, versus EMGBench’s

19.5

–

19.9 %

and

52.3 %

; our stricter balanced accuracy (

38.6 %

,

64.2 %

) still clears their overall figure, so beating them on either metric is in the conservative direction. We read the gap as indicative rather than a fully controlled head-to-head, but on the matched protocol our compact model is at least competitive with, and on these numbers exceeds, the benchmark’s reported results.

4.6. Baselines and Ablations

We compare against two baselines and two ablations (Table 2). To separate the contribution of the principled sequence grammar from that of the hand-tuned deployment logic, every row reports the classifier/grammar stream with the hysteresis state machine removed; the gate enters only at the deployed operating point (Section 3.5), where it raises per-execution accuracy by just

+ 1.3

points (

92.7 \to 94.0 %

) while cutting rest false-activation to

2.2 %

. The accuracy is thus delivered by the zero-parameter grammar; the gate is a safety/operating-point layer, not an accuracy crutch. These two contributions act on distinct axes and should not be conflated: the grammar’s gain is the per-window improvement (

+ 2.6

points over raw argmax evidence, Figure 2), whereas the gate’s

+ 1.3

is a per-execution operating-point adjustment that trades a little accuracy for a much lower false-activation rate. Classical baseline: an LDA on the Hudgins time-domain feature set [20] (MAV, WL, ZC, SSC). We test it against our decoder with a Wilcoxon signed-rank test across the 20 subjects (Holm-corrected over the two metrics). On the saturating per-execution metric we do not detect a difference (

92.7 \pm 11.5

vs.

94.7 \pm 7.8

;

p = 0.78

, rank-biserial

r = 0.08

). We are careful here: with a single held-out execution per class per subject, this per-execution comparison is coarse and underpowered, so we claim only the absence of a detectable difference, not statistical equivalence. The discriminating per-window metric, by contrast, has a large per-subject sample and shows a clear and significant advantage for the learned representation (

75.2 \pm 7.8

vs.

66.9 \pm 8.1

,

+ 8.3

points;

p < 0.001

,

r = 1.00

: our decoder wins for every subject). This is the empirical core of the paper’s metric-honesty thesis: the deep model and the classical baseline are separable only under per-window evaluation, so reporting the per-execution number alone would erase a real and significant difference. We state the corollary plainly: for steady-state grip maintenance, the regime the per-execution metric rewards, classical TD+LDA features remain entirely adequate, and the learned representation earns its footprint specifically in the instantaneous, onset-transient regime that the per-window metric exposes. Learned temporal baseline (TCN): to test whether a stronger temporal backbone is needed, we train a standard dilated causal temporal convolutional network [24] under the identical leak-safe protocol. Its raw per-window balanced accuracy (

76.3 \pm 7.8 %

) is competitive with our compact CNN plus the zero-parameter grammar (

75.2 %

), and per-execution saturates as expected (

95.2 %

). Crucially it reaches this with

\sim 2 \times

the parameters (57k versus 28k): baking temporal context into the network and offloading it to a parameter-free transition grammar land in the same per-window regime, but the latter does so at half the deployed footprint, and the grammar is backbone-agnostic (it composes with the TCN exactly as with the CNN). The same pattern holds under the wider acquisition band (§4.8): the wideband TCN reaches

78.8 %

per-window versus our wideband CNN-plus-grammar at

77.4 %

– again within

\sim 1.4

points at twice the parameter count – so the conclusion is not band-specific. This supports, rather than undercuts, the compactness thesis: the contribution is reaching this accuracy band on a microcontroller-class model, not a claim that the CNN backbone is uniquely optimal. Temporal-smoothing ablation: a sliding majority vote at matched window lengths (

w = 3, 5

, spanning the decoder’s effective memory) isolates the learned grammar from naive smoothing; the grammar’s per-window margin over the matched vote is small but significant (Wilcoxon signed-rank,

p = 0.001

,

r = 0.79

, across the 20 subjects). Emission ablation: the scaled-likelihood prior correction raises per-window balanced accuracy slightly (

75.2 \to 77.4 %

) but inflates rest false-activation from

4.5 \pm 2.0 %

to

52.3 \pm 15.4 %

, empirically justifying retention of the rest prior in the clinical decoder. Self-supervision ablation: removing masked pretraining changes per-window accuracy by

< 0.2

points and per-execution by

< 1

point, a negligible effect at this dataset scale (20 pretraining subjects). This is consistent with self-supervision’s reliance on large unlabelled corpora; we report it as a component that is harmless and expected to help with larger pretraining data, not as a driver of the present results.

4.7. Online Behaviour and the Onset-Latency Floor

Streaming validation. To confirm that the reported accuracy is genuinely causal and real-time and not an artefact of batched offline scoring, we replay a held-out repetition through the deployed pipeline one

256

ms window at a time in temporal order, updating the forward-filter belief and the gate online. Across all 20 test users the streamed decode matches the batched offline decode exactly (

100 %

of windows for every subject), confirming the reported accuracy is genuinely causal: the online forward filter is mathematically equivalent to the batched computation, so this is an implementation-correctness check that the reported accuracy carries no future information, not a separate accuracy result. The pipeline sustains a median per-decision latency of

3.7 \pm 0.1

ms (mean ± SD across users; p95

4.8 \pm 0.3

ms) on a desktop processor (Apple M4 Pro), i.e. about

6 %

of the

64

ms cadence (Figure 6). This is a desktop compute-headroom figure, distinct from the microcontroller workload (the

11.8

MMAC/s of Table 6); we report compute headroom, not measured on-device energy, which remains future work.

Onset latency has both a decoder component and an acquisition floor. The gap between per-window and per-execution accuracy is the onset transient, and the clinically relevant quantity is how quickly the controller commits to a grasp. The measured

\sim 600

–

750

ms onset latency decomposes, approximately, into a deterministic algorithmic part and a residual: the

256

ms window must first fill with movement, and the deployed hysteresis debounce adds

on_frames \times stride \approx 192

ms, so roughly

450

ms is design-controlled latency before any physiology, with the remainder the muscle-recruitment ramp and a cue-based-labelling artefact (NinaPro defines onset relative to a stimulus prompt, so the measured value also conflates stimulus-reaction time). We attacked the design-controlled part: a one-sided CUSUM sequential change-detector on the activation log-likelihood-ratio (minimum-latency for a given false-alarm rate [25]) lowers onset latency by

\sim 60

–

160

ms at a matched false-activation rate, and we also tried a shorter

128

ms window at a

32

ms stride, an activation-velocity feature, and fixed-lag smoothing. None of these brought median latency at a clinically-acceptable false-activation rate (2–

5 %

) below

\sim 600

ms. The remaining floor is not removable by the decoder on this data: shortening the window trades away identity accuracy, and the cue-based-labelling artefact cannot be resolved on cue-prompted offline data at all. Quantifying and reducing onset latency therefore requires acquisition with movement-synchronised ground truth; we report this as a limitation with an identified mechanism rather than as a solved problem.

Pseudo-online control metrics. Beyond classification accuracy, we report the controller-level metrics a rehabilitation reviewer expects, computed per held-out grasp attempt from the same causal decode. An attempt counts as completed only if the decoder reaches and holds the target grip for the majority of the attempt’s second half: a settle-and-hold criterion, distinct from a whole-segment majority vote (per-execution accuracy), in that it requires the controller to converge on the target and stay there rather than merely be correct on average. Across the 20 within-user test subjects the controller completes

90.0 \pm 16.2 %

of grasp attempts under this criterion, with a median selection time (onset to first correct output) of

640 \pm 362

ms and a rest stability of

93.7 \pm 2.8 %

(the complement of the false-activation rate). Completion rate and rest stability are deployment-ready; the selection time reflects the onset-latency floor above and is the main responsiveness limitation, consistent with the per-window/per-execution gap.

4.8. Acquisition Bandwidth

The within-user and ablation results above use the deployment-oriented 20–

120

Hz band at a

250

Hz sampling rate, which captures the dominant sEMG power. Surface EMG, however, carries discriminative fibre-recruitment and conduction-velocity content above

120

Hz. We test whether that content is useful by re-deriving the same pipeline at a

1

kHz sampling rate with a wider 20–

450

Hz band, holding the model, calibration protocol, and decoder fixed; only the acquisition front-end changes, and the deployed footprint is unchanged at

28, 199

parameters.

The wider band is a significant, model-free gain in instantaneous accuracy, and an iso-sampling-rate control shows the gain is the analysis band itself rather than the finer temporal sampling that comes with it (Table 3, Figure 7). End to end, per-window balanced accuracy rises by

2.6

points for the causal decoder (

74.8 % \to 77.4 %

; Wilcoxon

p = 0.005

,

r = 0.70

). To separate band from sampling resolution we add a control arm at

1

kHz with the original 20–

120

Hz band, so that the two

1

kHz arms differ only in the low-pass cutoff at identical sampling rate and window length. The wider band alone then accounts for

1.9

of those points (

75.5 % \to 77.4 %

; Wilcoxon

p = 0.027

,

r = 0.56

, better in

14 / 20

subjects), while raising the sampling rate at fixed band contributes the remaining

0.7

(

74.8 % \to 75.5 %

). The per-execution figure, already near ceiling, does not move significantly. Because the gain appears in the per-window metric at zero parameter cost and survives the iso-rate control, it is attributable to the spectral content above

120

Hz rather than to the model, isolating analysis bandwidth as a concrete, inexpensive sensor-front-end lever. This gain is not a single-seed artifact: re-running the full wideband-versus-narrowband comparison with fresh population bases at two additional seeds reproduces a significant per-window improvement in every case (per-seed gains

+ 2.6

,

+ 2.3

,

+ 3.5

points; paired Wilcoxon

p = 0.005

,

0.014

,

< 0.001

; rank-biserial

r = 0.70, 0.62, 0.92

; Cohen’s

d = 0.78, 0.68, 1.13

;

95 %

CIs on the mean gain all exclude zero, e.g.

[1.0, 4.1]

at the deployed seed; better in

15 / 20

,

14 / 20

,

18 / 20

subjects; mean

+ 2.8

points over the three seeds), so the rest of the paper’s single-seed results are reported against a band effect that is stable to re-seeding.

4.9. Amputee Benchmark and Cross-Population Transfer

Able-bodied benchmarks are an imperfect proxy for prosthetic users. We therefore apply the identical leak-safe within-user protocol to 12 trans-radial amputees, pooled from NinaPro DB3 [1] (10 subjects with the complete grip set; one further subject lacked three grips and was excluded) and DB7 [26] (2 amputees), all recorded with the same 12-electrode montage and global movement labelling. The population base is the able-bodied DB2 model (the realistic deployment case, in which a new amputee user cannot be in the pretraining set); each amputee is then calibrated and tested on their own held-out repetition.

Two findings stand out (Table 4). First, amputee decoding is both substantially harder and far more variable than able-bodied: per-execution balanced accuracy is

63.8

–

74.2 %

(versus

94.0 %

able-bodied) with a large between-subject spread (

\pm 25

points; individual amputees range from

\sim 14 %

to

100 %

per-execution; Figure 8). Disaggregating by source dataset is itself informative and we report it plainly: the ten DB3 amputees decode at

61.7 \pm 25.9 %

per-execution, whereas the two DB7 amputees are much stronger (

92.9 %

; individually

86 %

and

100 %

). The pooled figure is thus lifted by a two-subject cohort, and on the larger DB3 cohort alone the gap to able-bodied is wider than the pooled number implies. Because DB7 contributes only two subjects, no per-dataset statistical comparison is meaningful for it, and the cross-population claim below rests on DB3; we keep the datasets pooled only for the aggregate effect-size estimate, and flag the confound explicitly. Second, and contrary to the able-bodied result where the learned decoder significantly beats the classical Hudgins time-domain baseline (Table 2;

p < 0.0001

), on amputees we find no evidence that this advantage transfers: per-window balanced accuracy is statistically comparable (

54.6 %

classical versus

55.0 %

learned,

p = 0.68

) and per-execution the classical baseline is, if anything, ahead by ten points, though not significantly (

74.2 %

versus

63.8 %

,

p = 0.12

). With

n = 12

and a

\pm 25

-point between-subject spread the comparison is underpowered (detectable only for very large effects); neither metric is significant even before correcting for the two comparisons (Holm leaves both n.s.), so this is absence of evidence of a deep advantage, not evidence of equivalence, with the point estimates weakly favouring the classical baseline. Given the pooling of two distinct datasets and the dominant DB3-versus-DB7 difference above, we read the deep ≈ classical observation as a tentative trend, not an authoritative finding. Pretraining the deep features on able-bodied data does not recover the gap. The interpretation is that amputee sEMG (attempted/phantom movements, altered anatomy, reduced repeatability) shifts the signal distribution enough that features learned on intact limbs do not transfer, while a simple per-user time-domain classifier remains robust. This is a cautionary cross-population result: able-bodied performance over-states amputee performance, and architectural advantages measured on intact-limbed benchmarks do not necessarily carry over.

4.10. Cross-Session Re-Donning Gap

Every within-user number above shares a single recording session and is therefore an optimistic bound: it does not capture the electrode shift that occurs when a device is removed and re-applied. We measure that gap directly on NinaPro DB6 [27], which records each subject over five days with the electrodes re-donned between sessions. DB6 has ten subjects; we run two folds of five, training the population base on one half and testing on the other, then swapping the halves, so every subject is tested under a base that never saw their data (leak-free). For each test subject we calibrate on day 1 and evaluate on (a) held-out day-1 repetitions (within-session) and (b) day 5, four days and several re-donnings later, with no day-5 calibration (the worst case).

The gap is large but recoverable, and we characterise it across the full five-day span on all ten subjects (Table 5, Figure 9); we attach subject-bootstrap

95 %

CIs (

10^{4}

resamples over the ten subjects) to the key quantities. Within-session day-1 per-execution balanced accuracy is

93.4 %

(bootstrap CI

[90.6, 95.8]

), in line with the DB2 headline. Applying that day-1 decoder on a later day with no adaptation collapses it, and the drop appears immediately on the first re-donning (day 2:

47.9 %

) and persists to day 5 (

38.8 %

, CI

[28.3, 50.1]

): a

54.6

-point fall. The collapse is a step change the moment the electrodes are re-applied, not gradual drift, which is the signature of donning-induced shift and puts a number on the paper’s central thesis: within-session accuracy is a weak proxy for deployment.

Crucially it is fully recoverable. A short per-session recalibration, fine-tuning on repetitions from the test day, restores per-execution accuracy to 93–

95 %

at every separation. The recovered accuracy is statistically indistinguishable from the within-session ceiling (recalibrated

94.2 \pm 3.4 %

versus ceiling

93.4 \pm 4.3 %

, pooled over re-donned days; paired Wilcoxon

p = 0.63

,

n = 10

;

p \geq 0.32

at every individual day separation), so re-donning is fully reversible rather than merely mitigated (Figure 9). The recovery is cheap, and we quantify its price as a recalibration-budget curve at the worst (day-5) separation (subject-bootstrap

95 %

CIs in brackets): a single recalibration repetition already lifts accuracy from

37.7 %

to

66.5 %

[60.2, 72.7]

, two reach

79.7 %

[72.5, 86.5]

, four reach

86.9 %

[81.2, 91.9]

, and eight restore the full within-session ceiling (

93.6 %

[91.2, 96.0]

; Figure 9b). In wall-clock terms, at DB6’s

\approx 4

–

6

s held contraction per grasp one repetition of the seven-grasp set is roughly a minute of guided movement, so the four repetitions that recover

\sim 87 %

cost on the order of a few minutes of recalibration per don and the eight that fully restore the ceiling under ten minutes; a clinician can trade that one-time per-don cost against accuracy. Even label-free re-normalisation, re-fitting only the per-channel z-score on the new session with no labels, recovers a partial

+ 10.1

points (

n = 10

; to

47.7 \pm 15.0 %

). The deployment recipe is therefore explicit: a decoder calibrated once does not survive re-donning, but a brief recalibration at each don fully recovers accuracy (four repetitions already get most of the way), with automatic renormalisation as a partial label-free fallback. The day-1 numbers, ours and the field’s, are a ceiling. We state the scope: DB6 uses its own 8-class grasp set and a different electrode montage, so the absolute values are not directly comparable to the DB2 sections.

4.11. Deployability

The inference model (backbone + classifier) has

28, 199

parameters (

27.5

kB at int8;

110

kB fp32) and

0.76

million multiply–accumulates per window. At the

64

ms decision stride this is a sustained

11.8

MMAC/s, well within a Cortex-M-class budget; the sequence-reasoning layer adds only a

K \times K

matrix–vector update per window. We report hardware-agnostic compute metrics (parameters, MACs, quantised size) deliberately, so that the efficiency comparison is not confounded by MCU-specific SDK optimisations, sleep states, or accelerator availability; measured on-device energy on the target prosthesis is future work. Table 6 positions our footprint against recent embedded and foundation-model EMG decoders on the axis that is comparable across papers (model size), where ours is the smallest by a wide margin; we deliberately do not tabulate accuracy across these systems, as they use different datasets, class sets, and evaluation protocols, and a cross-protocol accuracy comparison is precisely the apples-to-oranges the paper argues against. Bioformers [4] and TinyMyo [6] report hardware-measured energy that we do not. We tabulate the per-inference energy for both (the per-decision quantity, in mJ), not the device power envelope (in mW): TinyMyo’s

44.9

mJ is the energy of a single forward pass of its

3.6

M-parameter model on GAP9, three orders of magnitude above Bioformers’

0.14

mJ on the smaller GAP8, which is the comparison that motivates a compact model.

Table 6. Footprint positioning against recent embedded/foundation-model EMG decoders. Sizes are comparable across papers; accuracy is not (different datasets and protocols) and is deliberately omitted. We report hardware-agnostic compute cost; measured on-device energy is future work.

Model	Params	Int8 size	Measured on-device
Ours	28.2 k	27.5 kB	compute only ( $0.76$ MMAC/win)
Bioformers [4]	–	$94.2$ kB	GAP8: $0.14$ mJ, $2.72$ ms
TinyMyo [6]	$3.6$ M	–	GAP9: $44.9$ mJ

5. Discussion

What the numbers mean, and a fail-safe property. The

94.0 %

per-execution figure is the standard voted reporting unit and reflects how a grasp-and-hold device decides; the

75.2 %

per-window figure reflects instantaneous behaviour, and the gap between them is the onset transient. The confusion matrix (Figure 4) shows that this residual error is localised almost entirely in the rest column (e.g. pinch and tripod lose

22 %

and

25 %

of their per-window mass to rest) and not to grip–grip confusion. This is a desirable failure mode: during the brief muscle- recruitment delay the system fails safe, momentarily withholding activation rather than executing the wrong grip. Reporting both metrics is essential, as the per-execution number alone would overstate the moment-to-moment experience.

On the transition grammar. Because NinaPro DB2 contains discrete, isolated repetitions separated by rest, the learned transition matrix is a hub-and-spoke model (rest ↔ grasp); direct grasp-to-grasp transitions do not occur in the data and are therefore near-zero in A. This is a property of the dataset, not a limitation of the method. The mechanism, in which a fixed transition prior over per-window posteriors suppresses instantaneous switching noise, supports arbitrary grasp-to-grasp transitions once A is re-estimated on unconstrained functional data. Concretely, the off-diagonal grammar is the population maximum-likelihood estimate

O_{i j} = n_{i j} / \sum_{k \neq i} n_{i k}

, where

n_{i j}

counts observed state-i-to-state-j label transitions; the hub-and-spoke pattern is precisely the regime in which the grasp-to-grasp counts

n_{g g^{'}}

(

g, g^{'} \in G

) are zero, forcing

O_{g g^{'}} = 0

and routing all probability through rest. On continuous data those counts are nonzero, so O acquires direct grasp-to-grasp mass with no change to Eq. 2 or to the forward-filter recursion, and because the hold h and the destination distribution O are decoupled, the self-transition stickiness is unchanged while only the routing adapts. We state the hub-and-spoke structure explicitly to avoid overstating generality. To verify that this is a data property and not an architectural limit, we ran a controlled synthetic experiment: injecting a direct grip-to-grip path into O and presenting an emission stream that switches grip without an intervening rest, the causal decoder follows the transition directly (rest → grip A → grip B, never falling back to rest between grips). The mechanism thus supports unconstrained grip-to-grip control whenever the training data contains such transitions; no execution code changes. We nonetheless stress that this generality is shown only in synthesis: because no public NinaPro-scale dataset contains naturalistic, rest-free grasp sequencing, empirically validating the grammar on unconstrained functional transitions, and quantifying any accuracy cost of grip-to-grip switching, remains an open problem and a limitation of the present benchmark, which we flag for future study on continuous-task data (e.g. MeganePro [12]).

Limitations. (i) Our evaluation is offline and pseudo-online: as the closest proxy that public, cue-prompted data permits, we report controller-level metrics (completion rate, selection time, rest stability; §4.7) and a window-by-window causal stream, but offline and pseudo-online accuracy correlate only weakly with closed-loop, human-in-the-loop control [17,18], which requires a user reacting to real-time feedback and therefore a live recording setup that no pre-recorded public benchmark contains. A closed-loop validation is consequently out of scope for a public-data study by construction; it is the designated next step (below) rather than a gap closable on these datasets. (ii) The headline within-user protocol is within-session (calibration and test share a recording); we measure the cross-day cost directly on DB6 (§4.10,

n = 10

over five days): a

54.6

-point per-execution drop on re-donning without adaptation, fully recovered by a short per-don recalibration at every day separation. The within-session numbers should thus be read as a ceiling and per-session recalibration treated as a deployment requirement. (iii) We report compute cost, not measured on-device energy. (iv) The individual learning components (masked pretraining, contrastive learning, HMM decoding) are established; our contribution is the architecture-level decision to offload temporal structure from the network into a zero-parameter grammar for edge deployment, together with the leak-safe extrapolation protocol and paired-metric reporting. (v) The wider-band gain (§4.8) is shown within-session on able-bodied data; its interaction with electrode shift is untested, and is not obviously benign: the high-frequency content (120–

450

Hz) that carries the gain reflects fibre-recruitment and conduction-velocity detail that is plausibly more sensitive to electrode displacement than the dominant low-frequency power, so whether the band advantage survives re-donning (§4.10) is an open question that a cross-day wideband acquisition would need to settle. (vi) The amputee benchmark (§4.9) pools 12 subjects across two distinct datasets with large variance; the bulk is DB3 (

n = 10

) and DB7 contributes only 2 subjects that decode markedly better, so the pooled accuracy is DB7-inflated and the cross-population observation (deep ≈ classical on amputees) is a tentative trend resting on the DB3 cohort, not a powered comparison. We frame this as a microcontroller-compatible, honestly-evaluated system and benchmark, not as a clinical validation or a new learning primitive.

Toward deployment. The natural next steps are real-time closed-loop evaluation (concretely, a Fitts-style target-acquisition or grasp-completion task with a small cohort of users reacting to live visual feedback, the one regime our pseudo-online proxy cannot capture), and on-device energy measurement on the target prosthesis: the regimes that determine clinical viability. The cross-day re-donning regime, previously listed here as future work, is now measured directly (§4.10). Two concrete, evidence-backed front-end requirements follow from this study: sample to at least

\sim 450

Hz to recover the per-window accuracy the narrow band discards (§4.8), and acquire movement-synchronised onset ground truth so that onset latency can be measured at all (§4.7); both motivate dedicated acquisition hardware over re-use of cue-prompted public benchmarks.

6. Conclusions

We presented a compact (28k-parameter) within-user sEMG grasp decoder, a depthwise-separable CNN with a causal sequence-reasoning layer (and an optional self-supervised pretraining component that we show contributes little at this scale), evaluated under a strictly leak-safe extrapolation protocol with honest paired metrics. It reaches

94.0 %

per-execution accuracy at

2.2 %

false-activation within-user and

89.1 %

per-execution (

20 %

labelled calibration;

50.6 %

with unsupervised normalisation only) under 40-fold cross-subject LOSO, at a microcontroller-class footprint. Three further studies sharpen the picture: a wider acquisition band (20–

450

Hz) raises per-window accuracy by

2.6

points at no model-size cost (

p = 0.005

); the decoder runs as a causal real-time stream that reproduces the offline decision exactly at a few milliseconds per decision; and on 12 amputees we find no evidence that its able-bodied advantage over a classical baseline transfers (the two perform comparably), while onset latency is bounded partly by the decoder’s own window and debounce and partly by an acquisition floor. Most consequentially, a cross-session study on DB6 measures the re-donning gap our within-session protocol cannot: per-execution accuracy falls from

93.4 %

to

38.8 %

when the electrodes are re-applied four days later without adaptation, but a short per-don recalibration fully restores it (

93.1 %

, at every day separation), making per-session recalibration the explicit deployment recipe. Code and configurations are released at https://github.com/seanb9/emg-leaksafe-benchmark to support reproducible, deployment-oriented EMG research.

Reproducibility

All reported numbers are produced by a single command per experiment under a fixed global seed (1337) with deterministic kernels; the environment is PyTorch 2.12, NumPy 2.4, SciPy 1.17, and scikit-learn 1.9. Runs were executed on an Apple M4 Pro (24 GB unified memory, MPS backend); a within-user run completes in

\sim 4

min and each LOSO worker (ten folds) in

\sim 85

min. We specify the full protocol below; the pipeline is released at https://github.com/seanb9/emg-leaksafe-benchmark: leak-safe windowing/split code, run configurations, per-subject result tables, figure scripts, and the synthetic positive-control generator.

Data and splits. NinaPro DB2 (40 subjects, 12 channels), resampled to 250 Hz, band-pass 20–

120

Hz with a

50

Hz notch, windowed at

256

ms /

64

ms stride; each window additionally carries per-channel RMS and waveform-length envelopes (feature window 16 samples). Windows are assigned to contiguous (stimulus, repetition) segments and split by segment. Within-user: a fixed disjoint 20/20 subject partition (pretraining vs. test, enumerated in the released configs); for each test subject, calibration uses repetitions 1–r (

r \leq 5

) and evaluation the held-out repetition 6. LOSO: 40-fold leave-one-subject-out on the 10-class set; the held-out subject contributes no window to training or contrastive alignment. A held-out subject’s calibration pool is a class-balanced sample of its non-evaluation windows; the

0 %

condition draws from it only an unsupervised per-channel z-score, and the

{5, 10, 20, 30} %

conditions additionally fine-tune on that fraction of labelled calibration windows.

Model and training. Depthwise-separable 1-D CNN (widths 32–64–128, embedding 128); masked self-supervised pretraining (25 epochs,

40 %

span masking); supervised contrastive plus class-weighted cross-entropy; per-user fine-tune at learning rate

3 \times 10^{- 4}

; prototype blend

0.4

; 8-view test-time augmentation. LOSO uses 20 SupCon epochs and up to 60 early-stopped fine-tune epochs. Sequence decoder: causal forward filter with hold

h = 0.97

and a population-learned off-diagonal grammar, posterior-as-emission (rest prior retained). Deployment hysteresis:

θ_{on} = 0.6

,

θ_{off} = 0.35

,

n_{on} = 3

,

n_{off} = 4

,

n_{sw} = 3

.

Hyperparameter selection. Backbone width, embedding size, learning rate (

{1, 3, 10} \times 10^{- 4}

), prototype blend (

{0, 0.2, 0.4}

), and SSL mask fraction (

{0.3, 0.4, 0.5}

) were selected by validation balanced accuracy on a split held out from the pretraining subjects; the HMM hold h was swept on the same pretraining validation (

0.80

–

0.99

,

h = 0.97

chosen). The deployment-hysteresis constants were swept to a pre-specified rest-false-activation target, not optimised on test data. Final values and the search grids are provided in the released configurations.

Institutional Review Board Statement

This work is a secondary analysis of the publicly available NinaPro DB2 dataset [1], collected under the ethics approval reported by its original authors. No new human-subjects data were collected for this study and no identifiable personal information was used.

Data Availability Statement

NinaPro DB2 is publicly available at http://ninapro.hevs.ch. The full pipeline (leak-safe windowing and split code, run configurations, per-subject result tables, figure-generation scripts, and the synthetic positive-control generator) is released under the MIT licence at https://github.com/seanb9/emg-leaksafe-benchmark; the complete protocol is given in the Reproducibility section.

Author Contributions

S.B. conceived the study, designed the evaluation protocol and decoder, implemented the experiments, and wrote the manuscript. W.H. (Chief Medical Officer, ReAble Labs) provided clinical and rehabilitation-domain guidance, including the functionally-motivated grip set and the clinical false-activation and operating-point requirements, and revised the manuscript. Both authors reviewed and approved the final manuscript.

Funding

This work was supported by ReAble Labs. No external grant funding was received.

Conflicts of Interest

The authors are affiliated with ReAble Labs, which is developing assistive hand-prosthesis technology related to this work; this constitutes a competing financial interest. All results reported here use only public data and are presented without commercial claims.

Acknowledgments

The authors thank the NinaPro consortium for making the DB2 dataset publicly available.

References

Atzori, M.; Gijsberts, A.; Castellini, C.; Caputo, B.; Mittaz Hager, A.G.; Elsig, S.; Giatsidis, G.; Bassetto, F.; Müller, H. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Scientific Data 2014, 1, 1–13. [CrossRef]
Atzori, M.; Cognolato, M.; Müller, H. Deep learning with convolutional neural networks applied to electromyography data: A resource for the classification of movements for prosthetic hands. Frontiers in Neurorobotics 2016, 10, 9. [CrossRef]
Jiang, B.; Wu, H.; Xia, Q.; Xiao, H.; Peng, B.; Wang, L.; Zhao, Y. An efficient surface electromyography-based gesture recognition algorithm based on multiscale fusion convolution and channel attention. Scientific Reports 2024, 14, 30867. [CrossRef]
Burrello, A.; Bianco Morghet, F.; Scherer, M.; Benatti, S.; Benini, L.; Macii, E.; Poncino, M.; Jahier Pagliari, D. Bioformers: Embedding Transformers for Ultra-Low Power sEMG-based Gesture Recognition. In Proceedings of the Design, Automation & Test in Europe Conference (DATE), 2022. [CrossRef]
Zanghieri, M.; Benatti, S.; Burrello, A.; Kartsch, V.; Conti, F.; Benini, L. Robust Real-Time Embedded EMG Recognition Framework Using Temporal Convolutional Networks on a Multicore IoT Processor. IEEE Transactions on Biomedical Circuits and Systems 2020, 14, 244–256. [CrossRef]
Fasulo, M.; Spacone, G.; Ingolfsson, T.M.; Li, Y.; Benini, L.; Cossettini, A. TinyMyo: A Tiny Foundation Model for Flexible EMG Signal Processing at the Edge. arXiv preprint arXiv:2512.15729 2025.
Yang, J.; Soh, M.; Lieu, V.; Weber, D.J.; Erickson, Z. EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. [CrossRef]
Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; Darrell, T. Tent: Fully Test-Time Adaptation by Entropy Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.; Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Multiclass Brain–Computer Interface Classification by Riemannian Geometry. IEEE Transactions on Biomedical Engineering 2012, 59, 920–928. [CrossRef]
Touko, N.; Ellis, M.O.A.; Capone, C.; Burrello, A.; Donati, E.; Manneschi, L. Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition. arXiv preprint arXiv:2601.04181 2026.
Cognolato, M.; Gijsberts, A.; Gregori, V.; Saetta, G.; Giacomino, K.; Mittaz Hager, A.G.; Gigli, A.; Faccio, D.; Tiengo, C.; Bassetto, F.; et al. Gaze, visual, myoelectric, and inertial data of grasps for intelligent prosthetics. Scientific Data 2020, 7, 43. [CrossRef]
Betthauser, J.L.; Krall, J.T.; Bannowsky, S.G.; Levay, G.; Kaliki, R.R.; Fifer, M.S.; Thakor, N.V. Stable Responsive EMG Sequence Prediction and Adaptive Reinforcement With Temporal Convolutional Networks. IEEE Transactions on Biomedical Engineering 2020, 67, 1707–1717. [CrossRef]
Simon, A.M.; Hargrove, L.J.; Lock, B.A.; Kuiken, T.A. A Decision-Based Velocity Ramp for Minimizing the Effect of Misclassifications During Real-Time Pattern Recognition Control. IEEE Transactions on Biomedical Engineering 2011, 58, 2360–2368. [CrossRef]
Chan, A.D.C.; Englehart, K.B. Continuous Myoelectric Control for Powered Prostheses Using Hidden Markov Models. IEEE Transactions on Biomedical Engineering 2005, 52, 121–124. [CrossRef]
Rabiner, L.R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 1989, 77, 257–286. [CrossRef]
Ortiz-Catalan, M.; Rouhani, F.; Brånemark, R.; Håkansson, B. Offline accuracy: A potentially misleading metric in myoelectric pattern recognition for prosthetic control. In Proceedings of the Proc. 37th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), 2015. [CrossRef]
Vujaklija, I.; Shalchyan, V.; Kamavuako, E.N.; Jiang, N.; Marateb, H.R.; Farina, D. Online mapping of EMG signals into kinematics by autoencoding. Journal of NeuroEngineering and Rehabilitation 2018, 15, 21. [CrossRef]
Englehart, K.; Hudgins, B. A Robust, Real-Time Control Scheme for Multifunction Myoelectric Control. IEEE Transactions on Biomedical Engineering 2003, 50, 848–854. [CrossRef]
Hudgins, B.; Parker, P.; Scott, R.N. A New Strategy for Multifunction Myoelectric Control. IEEE Transactions on Biomedical Engineering 1993, 40, 82–94. [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020.
Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Kluwer Academic Publishers, 1994.
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv preprint arXiv:1803.01271 2018.
Page, E.S. Continuous Inspection Schemes. Biometrika 1954, 41, 100–115. [CrossRef]
Krasoulis, A.; Kyranou, I.; Erden, M.S.; Nazarpour, K.; Vijayakumar, S. Improved prosthetic hand control with concurrent use of myoelectric and inertial measurements. Journal of NeuroEngineering and Rehabilitation 2017, 14, 71. [CrossRef]
Palermo, F.; Cognolato, M.; Gijsberts, A.; Müller, H.; Caputo, B.; Atzori, M. Repeatability of grasp recognition for robotic hand prosthesis control based on sEMG data. In Proceedings of the 2017 International Conference on Rehabilitation Robotics (ICORR). IEEE, 2017, pp. 1154–1159. [CrossRef]

Figure 2. Per-window gain from the causal sequence-reasoning layer over raw argmax evidence (annotated

+ Δ

; e.g.

+ 2.6

at five reps). This is the gain over no logic, distinct from the smaller

+ 1.5

gain over a matched-window majority vote (Table 2).

Figure 2. Per-window gain from the causal sequence-reasoning layer over raw argmax evidence (annotated

+ Δ

; e.g.

+ 2.6

at five reps). This is the gain over no logic, distinct from the smaller

+ 1.5

gain over a matched-window majority vote (Table 2).

Figure 6. Online streaming validation. Top: temporal-segmentation ribbon for a representative held-out repetition decoded window-by-window as a causal stream; the upper strip is the ground-truth intent and the lower strip the streamed decode, colour-coded by grip (inter-grasp rest periods compressed for legibility). The decode strip matches the intent strip across grips, with brief colour slivers at onset/offset transients. Bottom: per-decision compute latency pooled across all 20 users (

79, 897

windows), shown over its actual range; the median is

3.7

ms (p95

4.8

ms),

17 \times

under the

64

ms decision budget. The streamed decode equals the batched offline decode for

100 %

of windows in every subject.

Figure 6. Online streaming validation. Top: temporal-segmentation ribbon for a representative held-out repetition decoded window-by-window as a causal stream; the upper strip is the ground-truth intent and the lower strip the streamed decode, colour-coded by grip (inter-grasp rest periods compressed for legibility). The decode strip matches the intent strip across grips, with brief colour slivers at onset/offset transients. Bottom: per-decision compute latency pooled across all 20 users (

79, 897

windows), shown over its actual range; the median is

3.7

ms (p95

4.8

ms),

17 \times

under the

64

ms decision budget. The streamed decode equals the batched offline decode for

100 %

of windows in every subject.

Figure 7. Per-subject per-window balanced accuracy for the end-to-end front-end change (

250

Hz / 20–

120

Hz vs

1

kHz / 20–

450

Hz). Each point is one user; points above the diagonal improve with the wider, faster front-end (

15 / 20

users;

+ 2.6

points, Wilcoxon

p = 0.005

) at an unchanged model footprint. Table 3 isolates the analysis band (

+ 1.9

,

p = 0.027

) from the sampling rate (

+ 0.7

).

Figure 7. Per-subject per-window balanced accuracy for the end-to-end front-end change (

250

Hz / 20–

120

Hz vs

1

kHz / 20–

450

Hz). Each point is one user; points above the diagonal improve with the wider, faster front-end (

15 / 20

users;

+ 2.6

points, Wilcoxon

p = 0.005

) at an unchanged model footprint. Table 3 isolates the analysis band (

+ 1.9

,

p = 0.027

) from the sampling rate (

+ 0.7

).

Figure 8. Per-execution balanced accuracy of the CNN evidence decoder (raw, gate-free) for each of the 12 amputees (sorted), against the able-bodied reference (dashed). The shaded band marks the pooled per-execution range across decoders (

63.8

–

74.2 %

, Table 4); this pooled statistic is dominated by DB3 (

n = 10

) and is lifted by the two DB7 amputees (which decode markedly better but cannot be assessed as a separate cohort). Amputee decoding is markedly lower on average and highly variable between subjects, from near-perfect to near-chance, which is the dominant feature of amputee EMG decoding rather than the choice of decoder.

Figure 8. Per-execution balanced accuracy of the CNN evidence decoder (raw, gate-free) for each of the 12 amputees (sorted), against the able-bodied reference (dashed). The shaded band marks the pooled per-execution range across decoders (

63.8

–

74.2 %

, Table 4); this pooled statistic is dominated by DB3 (

n = 10

) and is lifted by the two DB7 amputees (which decode markedly better but cannot be assessed as a separate cohort). Amputee decoding is markedly lower on average and highly variable between subjects, from near-perfect to near-chance, which is the dominant feature of amputee EMG decoding rather than the choice of decoder.

Figure 9. Cross-session re-donning on DB6 (

n = 10

, mean ± SD). (a) A day-1 decoder (orange) collapses the moment the electrodes are re-donned, from

93 %

per-execution to

\sim 40 %

by day 2, and stays there through day 5; a short per-don recalibration (green) holds within-session accuracy (93–

95 %

) at every separation. (b) Recalibration-budget curve at the worst (day-5) separation: accuracy recovered versus the number of recalibration repetitions the user provides at each don. One repetition recovers most of the gap; four reach

\sim 87 %

; eight restore the full within-session ceiling. The gap is real but fully fixable, and cheaply so.

Figure 9. Cross-session re-donning on DB6 (

n = 10

, mean ± SD). (a) A day-1 decoder (orange) collapses the moment the electrodes are re-donned, from

93 %

per-execution to

\sim 40 %

by day 2, and stays there through day 5; a short per-don recalibration (green) holds within-session accuracy (93–

95 %

) at every separation. (b) Recalibration-budget curve at the worst (day-5) separation: accuracy recovered versus the number of recalibration repetitions the user provides at each don. One repetition recovers most of the gap; four reach

\sim 87 %

; eight restore the full within-session ceiling. The gap is real but fully fixable, and cheaply so.

Table 1. Cross-subject 40-fold LOSO balanced accuracy (%) on NinaPro DB2. The cross-subject benchmark is run on the standard 10-class movement set (rest plus nine wrist/hand movements) used by cross-subject baselines such as EMGBench, chosen for comparability with the LOSO literature; this differs deliberately from the 7-grip functional set of the within-user clinical evaluation (Section 3.1). All entries are mean ± SD over the 40 held-out subjects. Per-window is the raw window classifier; +sequence logic is the causal-HMM-decoded per-window accuracy; per-execution is the majority-voted accuracy (the EMGBench protocol). All calibration is leak-safe (per-subject calibration-pool statistics, never global). The

0 %

row means zero labelled calibration: an unsupervised per-channel normalisation is still fit on the held-out subject’s calibration pool (a brief don-time baseline), but no labelled target-subject data is used; it is not a strict zero-information condition. None of the columns use the deployed hysteresis gate (Section 3.5); they isolate the classifier and the zero-parameter grammar.

Table 1. Cross-subject 40-fold LOSO balanced accuracy (%) on NinaPro DB2. The cross-subject benchmark is run on the standard 10-class movement set (rest plus nine wrist/hand movements) used by cross-subject baselines such as EMGBench, chosen for comparability with the LOSO literature; this differs deliberately from the 7-grip functional set of the within-user clinical evaluation (Section 3.1). All entries are mean ± SD over the 40 held-out subjects. Per-window is the raw window classifier; +sequence logic is the causal-HMM-decoded per-window accuracy; per-execution is the majority-voted accuracy (the EMGBench protocol). All calibration is leak-safe (per-subject calibration-pool statistics, never global). The

0 %

row means zero labelled calibration: an unsupervised per-channel normalisation is still fit on the held-out subject’s calibration pool (a brief don-time baseline), but no labelled target-subject data is used; it is not a strict zero-information condition. None of the columns use the deployed hysteresis gate (Section 3.5); they isolate the classifier and the zero-parameter grammar.

Calib. ratio	Per-window	+sequence logic	Per-exec.
0%	38.6 ± 7.7	42.9 ± 8.5	50.6 ± 12.9
5%	53.0 ± 7.7	59.8 ± 7.8	74.9 ± 11.5
10%	59.3 ± 7.8	66.6 ± 6.9	84.3 ± 8.3
20%	64.2 ± 8.2	71.1 ± 7.1	89.1 ± 7.7
30%	66.4 ± 8.2	73.2 ± 7.6	91.3 ± 6.7

Table 3. Acquisition bandwidth, isolated from sampling rate (within-user, 20 subjects, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. All rows use the identical model, calibration, and gate-free decoder; only the front-end differs. The two

1

kHz rows differ only in the low-pass cutoff, isolating the analysis band from sampling resolution: the band adds

+ 1.9

per-window points (+logic; Wilcoxon

p = 0.027

,

r = 0.56

), while the rate alone (rows 1→2) adds

+ 0.7

. Per-execution here is the gate-free voted unit (cf. the gated

94.0 %

of Figure 1). The

250

Hz / 20–

120

Hz row reproduces the headline condition of Figure 1 on the same 20 subjects; the

\sim 0.4

-point offset (

74.8

vs

75.2

) is run-to-run variation from independent training.

Table 3. Acquisition bandwidth, isolated from sampling rate (within-user, 20 subjects, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. All rows use the identical model, calibration, and gate-free decoder; only the front-end differs. The two

1

kHz rows differ only in the low-pass cutoff, isolating the analysis band from sampling resolution: the band adds

+ 1.9

per-window points (+logic; Wilcoxon

p = 0.027

,

r = 0.56

), while the rate alone (rows 1→2) adds

+ 0.7

. Per-execution here is the gate-free voted unit (cf. the gated

94.0 %

of Figure 1). The

250

Hz / 20–

120

Hz row reproduces the headline condition of Figure 1 on the same 20 subjects; the

\sim 0.4

-point offset (

74.8

vs

75.2

) is run-to-run variation from independent training.

Sampling	Band	Per-win. (raw)	Per-win. (+logic)	Per-exec.
$250$ Hz	20–120	71.9 ± 8.1	74.8 ± 7.9	93.3 ± 11.5
$1$ kHz	20–120	72.4 ± 8.0	75.5 ± 7.6	92.6 ± 11.5
$1$ kHz	20–450	74.4 ± 7.5	77.4 ± 7.3	95.3 ± 7.9

Table 4. Within-amputee benchmark (12 trans-radial amputees, NinaPro DB3+DB7, leak-safe, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. Unlike the able-bodied case, the learned decoder shows no significant advantage over the classical TD+LDA baseline (paired Wilcoxon n.s.; underpowered at

n = 12

, point estimates weakly favouring the classical baseline).

Table 4. Within-amputee benchmark (12 trans-radial amputees, NinaPro DB3+DB7, leak-safe, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. Unlike the able-bodied case, the learned decoder shows no significant advantage over the classical TD+LDA baseline (paired Wilcoxon n.s.; underpowered at

n = 12

, point estimates weakly favouring the classical baseline).

Method	Per-window	Per-execution
Classical TD+LDA	54.6 ± 16.5	74.2 ± 24.6
CNN, raw evidence	55.0 ± 16.8	66.9 ± 26.5
CNN + HMM grammar (ours)	55.0 ± 18.4	63.8 ± 26.0
Able-bodied (ours), reference	77.4 ± 7.3	94.0 ± 11.4

Table 5. Cross-session collapse and recovery across five days (NinaPro DB6,

n = 10

, two-fold). Per-execution balanced accuracy (%), mean ± SD over subjects. A day-1 decoder collapses the moment the electrodes are re-donned (day 2 onward); a short per-don recalibration restores within-session accuracy at every separation.

Table 5. Cross-session collapse and recovery across five days (NinaPro DB6,

n = 10

, two-fold). Per-execution balanced accuracy (%), mean ± SD over subjects. A day-1 decoder collapses the moment the electrodes are re-donned (day 2 onward); a short per-don recalibration restores within-session accuracy at every separation.

Test day	Day-1 decoder (no adapt.)	+ per-don recalibration
D1 (within-session)	93.4 ± 4.3	94.0 ± 4.0
D2	47.9 ± 17.8	94.0 ± 3.8
D3	49.4 ± 10.9	94.4 ± 2.6
D4	47.7 ± 15.2	95.3 ± 3.8
D5	38.8 ± 17.8	93.1 ± 5.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Leak-Safe Within-User Benchmark for Compact Surface-EMG Grasp Decoding, with a Causal Sequence-Reasoning Decoder

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Methods

3.1. Data and Preprocessing

3.2. Leak-Safe Windowing and Splits

3.3. Model

3.4. Training

3.5. Causal Sequence-Reasoning Decoder

3.6. Evaluation Metrics

4. Experiments and Results

4.1. Setup

4.2. Within-User Results

4.3. Effect of Sequence Reasoning

4.4. Operating Point

4.5. Cross-Subject (LOSO) Results

4.6. Baselines and Ablations

4.7. Online Behaviour and the Onset-Latency Floor

4.8. Acquisition Bandwidth

4.9. Amputee Benchmark and Cross-Population Transfer

4.10. Cross-Session Re-Donning Gap

4.11. Deployability

5. Discussion

6. Conclusions

Reproducibility

Institutional Review Board Statement

Data Availability Statement

Author Contributions

Funding

Conflicts of Interest

Acknowledgments

References

MDPI Initiatives

Important Links

Subscribe