Preprint
Article

This version is not peer-reviewed.

A Leak-Safe Within-User Benchmark for Compact Surface-EMG Grasp Decoding, with a Causal Sequence-Reasoning Decoder

Submitted:

18 June 2026

Posted:

23 June 2026

You are already at the latest version

Abstract
Surface electromyography (sEMG) is the dominant control modality for myoelectric hand prostheses, yet a persistent gap remains between high offline classification accuracy and real-time, on-device control: models are often large, evaluated under leaky window-level splits, and reported with a single inflated metric that hides instantaneous behaviour. Our contribution is primarily methodological: a strictly leak-safe within-user protocol (calibrate on a user's earlier repetitions, test on a held-out later one, split by contiguous recording segment so no overlapping window crosses the split), paired with honest dual reporting of per-window (instantaneous) and per-execution balanced accuracy alongside the rest false-activation rate. We instantiate it with a compact (28k-parameter, 27.5 kB int8) decoder for six hand grips plus rest: a depthwise-separable CNN with a causal, population-learned, zero-parameter transition grammar that reasons over per-window evidence. On NinaPro DB2, five calibration repetitions give 94.0% per-execution balanced accuracy at 2.2% false-activation and 75.2% per-window; a wider 20–450 Hz acquisition band adds 2.6 per-window points (Wilcoxon p=0.005) at the same footprint. On 12 trans-radial amputees (NinaPro DB3, DB7) accuracy is 63.8–74.2% with high between-subject variance and no evidence that the deep decoder's able-bodied advantage over a classical baseline transfers. Most consequentially, on NinaPro DB6 we expose the cross-session gap the within-session protocol cannot see: per-execution accuracy collapses from 93.4% (day 1) to 38.8% on day-5 re-donning, but a brief per-don recalibration fully restores it, statistically level with the within-session ceiling at every separation, turning the collapse into an explicit deployment recipe. We position the work honestly against recent ultra-low-power and foundation-model decoders, reporting compute cost rather than claiming on-device measurements. Code and configurations: https://github.com/seanb9/emg-leaksafe-benchmark.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

Myoelectric prostheses decode a user’s movement intent from forearm sEMG to drive a robotic hand. Pattern-recognition control has matured substantially, and deep networks now report high within-subject accuracy on public benchmarks such as NinaPro [1,2,3]. Three issues, however, separate these numbers from a device a person can actually use.
(i) Deployability. A prosthetic hand carries a microcontroller-class processor; many high-accuracy models are too large to run on it, motivating a body of work on lightweight and ultra-low-power EMG decoders [4,5,6].
(ii) Evaluation integrity. Overlapping analysis windows split by window index leak information between train and test, inflating reported accuracy (we measure this inflation directly at + 4.1 per-window points on DB2, §4.2). A leak-safe protocol must split by recording segment, and, for the deployment question, must extrapolate from calibration data to later, unseen data rather than interpolating within a session.
(iii) Metric honesty. A decoder emits a label every few tens of milliseconds, but a grasp-and-hold device makes one functional decision per hold. Reporting only the per-execution (voted) accuracy hides the instantaneous behaviour the user feels, which is dominated by the onset transient: the fraction of a second at the start of a movement during which the muscle is still recruiting and the window is genuinely ambiguous with rest.
We address all three with an honest evaluation and a compact system. Our contributions are:
  • A leak-safe within-user extrapolation protocol and a paired-metric reporting standard. Splitting by contiguous recording segment and testing on later repetitions removes the window-overlap leakage and within-session interpolation that inflate standard splits; reporting per-window and per-execution balanced accuracy with the rest false-activation rate exposes the onset-transient behaviour a single voted number hides.
  • An architectural choice for edge decoding: rather than scaling a parameter-heavy network to learn temporal structure, we offload it into a zero-parameter, population-learned transition grammar combined online with per-window evidence by a causal forward filter. It recovers boundary windows by inference, requires no extra calibration, and keeps the model at a microcontroller-class footprint (28k params, 27.5 kB int8).
  • A controlled, honest benchmark: within-user and cross-subject (LOSO) results on NinaPro DB2 with per-subject variance and significance tests, a classical-baseline and component ablation (which transparently shows masked pretraining contributes little at this data scale and the grammar’s gain over voting is small but significant), and a hardware-agnostic compute-cost breakdown, positioned against recent embedded and foundation-model EMG decoders. Code, configurations, and a synthetic positive-control are released at https://github.com/seanb9/emg-leaksafe-benchmark.
  • An acquisition and cross-population study. We isolate signal bandwidth as a significant, model-free accuracy lever ( + 2.6 per-window points from a wider analysis band, p = 0.005 ); we validate the decoder as a causal real-time stream; and we extend the leak-safe benchmark to 12 amputees, where we report an honest negative result: we find no evidence that the deep model’s able-bodied advantage over a classical baseline transfers.
We are explicit about scope: these are offline, single-session results on able-bodied public data, not a clinical validation. Because NinaPro DB2 presents isolated repetitions separated by rest, the learned grammar is hub-and-spoke (rest ↔ grasp) and does not model direct grasp-to-grasp transitions; we show in synthesis (Section 5) that the mechanism supports them once such data is available, but empirical validation on unconstrained, cross-day functional data remains future work. The results are intended as a rigorous, reproducible benchmark that motivates that next step on target hardware.

3. Methods

3.1. Data and Preprocessing

We use NinaPro DB2 [1] (40 intact-limb subjects, 12 Delsys electrodes; publicly available at http://ninapro.hevs.ch). Labels are stimulus-cued: the annotated movement onset leads true muscle-recruitment onset, so the first windows of a labelled segment carry little movement signal. This is the mechanical origin of the onset transient that the per-window metric exposes (§5). From Exercise 2 we select a functional set of six grips (medium-wrap (Power), tip pinch (Pinch), tripod, lateral, index-point, and fixed hook) plus rest, giving K = 7 classes; this functional grip set is used for all within-user (clinical) results. For the cross-subject benchmark (Section 4.5, Table 1) we instead use the standard 10-class movement set adopted by the cross-subject literature (e.g. EMGBench): rest plus nine basic finger and wrist movements (finger abduction, finger adduction, fist, mid-axis supination, mid-axis pronation, wrist flexion, wrist extension, radial deviation, and ulnar deviation). We use this distinct set for LOSO purely for comparability with cross-subject baselines; the two protocols are never mixed. Signals are resampled to 250 Hz, band-pass filtered (20– 120 Hz, with a 50 Hz notch), and segmented into 256 ms windows at 64 ms stride. The 256 ms window with 64 ms decision stride keeps the controller delay within the < 300 ms budget established for real-time myoelectric control [19]. Each window additionally carries per-channel root-mean-square and waveform-length envelopes as extra channels [20].
Class frequencies and exclusions. Counts are post-preprocessing. The 7-class within-user set comprises 430 , 777 windows (Rest 316 , 751 ; Power 18 , 007 ; Pinch 21 , 772 ; Tripod 19 , 372 ; Lateral 22 , 595 ; Point 14 , 938 ; Hook 17 , 342 ), a rest:grasp window ratio of 2.8 : 1 (Rest 73.5 % ). The 10-class LOSO set comprises 388 , 703 windows (Rest 252 , 518 ; the nine movements 12 , 192 17 , 480 each), a ratio of 1.85 : 1 (Rest 65.0 % ). This imbalance is why we report balanced accuracy (§3.6). No subjects, channels, or repetitions were excluded; an RMS-based artifact guard (threshold 500 μ V) is applied during preprocessing, and there are no missing windows after preprocessing.

3.2. Leak-Safe Windowing and Splits

Every window is assigned to exactly one contiguous recording segment (a maximal run of constant stimulus and repetition). Splitting by segment guarantees that no two windows sharing raw samples (e.g. 75%-overlap neighbours) fall on opposite sides of any split. For the within-user protocol, calibration windows come from the user’s repetitions 1–r and test windows from a disjoint, later repetition (extrapolation); per-channel z-score statistics are fit on calibration windows only. For cross-subject LOSO, the test subject is held out entirely and few-shot calibration uses a disjoint calibration pool. Crucially, even at the 0 % -calibration operating point the held-out subject’s per-channel z-score is fit on that subject’s own calibration-pool windows (which are used for normalisation statistics only, never for supervision and never including the evaluation repetitions), rather than on global cross-subject statistics; this avoids injecting cross-subject amplitude variance that would artificially penalise the held-out subject. We therefore use “ 0 % calibration” to mean zero labelled adaptation: no held-out-subject windows enter training, fine-tuning, or the contrastive alignment; only an unsupervised per-channel scale is estimated from a short baseline acquisition. This baseline is exactly what a real device collects once at don time (a few seconds of rest and free movement before use) and is operationally lightweight; it is a per-channel gain normalisation, not a learned classifier. The evaluation repetitions are strictly excluded from this pool, so no test-segment sample ever informs the statistics.
Selection provenance. To keep selection leak-safe, model and training hyper-parameters were chosen on a validation split held out from the pretraining subjects, never on the held-out test users or repetitions. The deployment operating point is a single fixed configuration (Section 3.5) applied identically to every subject and fold; it is not tuned per subject, per fold, or on the evaluation repetitions. We do not claim its constants are optimal; they are set to a pre-specified false-activation target, and the gate contributes only + 1.3 per-execution points (Section 4.6), bounding any selection effect.

3.3. Model

The backbone is a depthwise-separable 1-D temporal CNN (widths 32 64 128 , embedding dimension 128) producing a window embedding, with a linear classifier head. The inference model (backbone + classifier) has 28 , 199 parameters; a training-only self-supervised decoder and projection head are discarded at inference.

3.4. Training

Self-supervised masked pretraining. Following masked autoencoding [21], random time-spans of each window are hidden and a lightweight transposed-convolution decoder reconstructs the missing channels, minimising L ssl = 1 | M | ( t , c ) M ( x ^ t , c x t , c ) 2 over masked positions M . This learns the spatio-temporal structure of the signal before any labels are seen; we note up front that its measured contribution at our 20-subject pretraining scale is negligible (Section 4.6), and retain it only as a harmless component expected to help with larger unlabelled corpora.
Supervised contrastive + cross-entropy. The backbone is then trained with the supervised contrastive loss [22] to align same-grip embeddings across users, followed by class-weighted cross-entropy with label smoothing. For the cross-subject protocol the held-out subject is excluded from this stage entirely: no window from the test subject contributes to the contrastive alignment, the cross-entropy, or the masked pretraining; the population base is trained only on the other subjects, so the reported LOSO numbers contain no test-subject information beyond the unsupervised normalisation scale above.
Per-user calibration. For each test user the population base is fine-tuned on the user’s calibration repetitions with gentle intra-session augmentation. A dedicated rest/active detector and a movement-only grasp classifier are adapted separately, decoupling the false-activation decision from grasp discrimination.

3.5. Causal Sequence-Reasoning Decoder

Let p t Δ K be the classifier’s per-window posterior p t ( j ) = P ( j x t ) at time t. A grasp is not an isolated pattern but a temporally-structured event; a person rests, forms a grip, holds it, and relaxes, and does not jump directly between grips. We encode this as a first-order Markov model with a transition matrix A whose off-diagonal structure is learned from the training population’s label sequences (which transitions actually occur) and whose self-transition (“hold”) probability is a single interpretable knob. We call the grammar zero-parameter in the sense that it adds no gradient-trained weights to the network: the K × K transition matrix is estimated by counting population label transitions, not learned by backpropagation.
Emissions: a safety-aware design choice. The classical hybrid NN–HMM formulation converts a softmax posterior to an emission likelihood via the scaled-likelihood trick [15,16,23], e t ( j ) P ( j x t ) / P ( j ) , dividing out the empirical class priors P ( j ) . In a safety-critical prosthetic decoder, however, the dominant prior P ( rest ) is not a nuisance to be removed: it encodes the operationally desirable default that the hand stays still. Dividing it out makes the decoder eager to leave rest. We therefore use the per-frame posterior directly as the emission, e t ( j ) = P ( j x t ) (equivalent to a uniform-prior, per-frame MAP estimate), and report the scaled-likelihood variant as an ablation: empirically it raises per-window balanced accuracy by 2 points but inflates rest false-activation roughly ten-fold (from 4.5 % to 52 % ), which is unacceptable for control (Section 4.6). Retaining the rest prior is thus a deliberate, empirically-justified choice. Decoding is the causal forward recursion (in log-space for stability):
α t ( j ) = log e t ( j ) + log i = 1 K exp α t 1 ( i ) + log A i j ,
with y ^ t = arg max j α t ( j ) . Because α t depends only on windows 0 : t , decoding is causal and real-time, uses no test labels, and adds negligible cost. A grasp window with weak emission (an onset ramp) is held at the current grip because leaving and re-entering is implausible under A; a spurious cross-grip flicker is overruled because direct grasp-to-grasp transitions are near-zero.
Using the posterior as emission does not double-count the prior, because our transition matrix A carries no class marginal: its off-diagonal rows are conditional destination distributions, independent of class frequency (Eq. 2), so P ( j ) enters the filter exactly once. Formally this is a linear-chain structured decoder with fixed potentials (unary log P ( j x t ) , pairwise log A i j ); we do not claim a jointly-trained, calibrated CRF, as the transition prior is fixed ahead of time and population-learned rather than optimised end-to-end.
Transition-matrix structure. We decouple stickiness from structure. With O the row-normalised off-diagonal grammar (the conditional destination distribution given the state is left, learned from population label sequences) and h a single hold probability,
A i i = h , A i j = ( 1 h ) O i j ( i j ) ,
so A carries no class-frequency marginal. Because NinaPro DB2 records discrete, isolated repetitions in which the subject returns to rest between movements, the learned O is a hub-and-spoke grammar: probability flows rest ↔ grasp and direct grasp-to-grasp entries are near-zero. We state this explicitly; it is a property of the dataset, not the architecture. The same machinery admits arbitrary grasp-to-grasp transitions once O is estimated on unconstrained functional data, which we verify with a controlled synthetic experiment (Section 5).
Deployment hysteresis. To set the false-activation operating point we add an asymmetric hysteresis state machine downstream of the decoder, with an activation score s t (the rest/active detector’s P ( active ) ) and state z t { r e s t } G : entry from rest to grasp g requires n on consecutive windows with s t θ on and y ^ t = g ; release to rest requires n off consecutive windows with s t < θ off ; a grip switch requires n sw consecutive windows voting a different grasp. The band θ off < θ on gives the hysteresis (hard to start, easy to hold); n on is the debounce that suppresses false activations. These five constants are the clinical operating point. They are set by a coarse sweep to a pre-specified rest-false-activation target on a validation split held out from the pretraining subjects (we sweep θ on [ 0.5 , 0.7 ] , θ off [ 0.25 , 0.35 ] , n on { 2 , 3 } , n off { 4 , 6 } ; deployed point θ on = 0.6 , θ off = 0.35 , n on = 3 , n off = 4 , n sw = 3 ) – they are neither learned by backpropagation nor tuned on the evaluation repetitions, and the single deployed configuration is then applied identically to every subject and fold. The decoder-sweep tables printed in the released logs are diagnostic, showing the chosen point is not a fragile optimum; they are not the selection mechanism. Because the debounce defers commitment until n on consecutive windows, it contributes up to 192 ms at the deployed n on = 3 to the onset latency; this is the debounce component specifically. The full empirical onset latency, dominated by analysis-window fill and the muscle-recruitment ramp, is larger and is reported in Section 4.4. Increasing n on trades onset latency for fewer false activations.

3.6. Evaluation Metrics

We report balanced accuracy: the unweighted mean of per-class recalls, which weights each of the grasp classes and rest equally and is therefore invariant to the rest:grasp window-count ratio (a model that simply predicts the majority rest class scores at chance, 1 / C , not high). Per-window accuracy grades every 64 ms decision independently (instantaneous behaviour). Per-execution accuracy collapses each contiguous movement to a single majority decision (the per-repetition unit, as in NinaPro/EMGBench reporting). Rest false-activation is the fraction of true-rest windows that emit a movement: the clinically critical safety metric.

4. Experiments and Results

4.1. Setup

The population base is trained on 20 subjects disjoint from the 20 test users. The partition is fixed a priori by subject index (subjects 1–20 are the test users, 21–40 the base), not selected on results or resampled to favour an outcome; it is fixed in the released configuration. For each test user we sweep r { 1 , , 5 } calibration repetitions and test on the held-out later repetition. Training uses SGD/Adam with early stopping; the deployed configuration uses a single fixed seed (1337). To show the conclusions do not depend on that seed, we repeat the headline within-user pipeline and both small-effect comparisons (the sequence-grammar gain and the acquisition-band gain) across three seeds (1337, 7, 2024; reported in §4.2, §4.6 and §4.8). Full configurations are provided in the released repository.

4.2. Within-User Results

Figure 1 shows the calibration curve. With five calibration repetitions the system reaches 94.0 % per-execution balanced accuracy (deployed gated decoder; 95 % CI [ 88.6 , 99.3 ] over 20 subjects), 75.2 % per-window accuracy (gate-free stream; CI [ 71.5 , 78.8 ] ), and 2.2 % rest false-activation (CI [ 1.0 , 3.5 ] ; Figure 1). The result does not depend on the random seed: repeating the full within-user pipeline at two further seeds (7, 2024) gives per-execution 94.0 ± 0.7 % and per-window 74.6 ± 0.2 % across the three seeds (the per-window value reproduces the §4.8 baseline condition, within the run-to-run variation of the 75.2 % headline run). These within-user numbers are a within-session bound: calibration and test share one recording, so they do not capture cross-day electrode shift and should be read as optimistic relative to re-donning deployment, which we measure separately on DB6 (§4.10). The headline per-execution figure is reported on a single held-out repetition (rep 6); to check it is not an artefact of that repetition we additionally run leave-one-repetition-out cross-validation, holding out each of the six repetitions in turn (this relaxes the strict earlier-to-later extrapolation to interpolation and is reported only as a robustness check). The six-fold means match the headline: 94.8 ± 6.6 % per-execution and 75.5 ± 6.8 % per-window (+logic), so the single-repetition number is representative rather than a favourable draw.
Quantifying the leakage. To put a number on the inflation our protocol removes, we re-ran the identical pipeline under a leaky window-level split: windows assigned to calibration and test at random (the common practice that lets 75 % -overlap neighbours straddle the split), at the same calibration fraction. On the same 20 subjects this inflates per-window balanced accuracy by + 4.1 ± 4.9 points ( 71.7 % 75.8 % ; Wilcoxon p = 0.001 , inflated for 16 / 20 subjects). The inflation is larger than the entire sequence-grammar gain ( + 2.6 points), so a window-shuffled split would both over-report instantaneous accuracy and swamp the architectural effect the paper studies; the leak-safe split is what keeps the headline honest. Per-execution accuracy crosses 90 % at three calibration repetitions. Point and hook reach 100 % per-execution recall and power/pinch/lateral 90– 95 % ; tripod is the weakest at 80 % (Figure 5), consistent with the known difficulty of separating multi-finger grasps in forearm EMG (Figure 4).
Figure 1. Within-user calibration curve (20 users, held-out later repetition; balanced accuracy, mean ± SD, shaded 95% CI). Per-execution accuracy crosses 90 % at three calibration repetitions; the gap to the per-window curves is the onset transient. At five repetitions the gate-free decoder stream reaches 72.6 ± 7.8 % per-window (raw) and 75.2 ± 7.8 % (+HMM logic), while the deployed gated decoder reaches 94.0 ± 11.4 % per-execution (Section 3.5) at 2.2 % rest false-activation; per-repetition false-activation stays within the 1.2 2.2 % band from one to five reps. The per-window numbers isolate the classifier and the zero-parameter grammar; the per-execution figure additionally includes the operating-point gate (cf. Table 2).
Figure 1. Within-user calibration curve (20 users, held-out later repetition; balanced accuracy, mean ± SD, shaded 95% CI). Per-execution accuracy crosses 90 % at three calibration repetitions; the gap to the per-window curves is the onset transient. At five repetitions the gate-free decoder stream reaches 72.6 ± 7.8 % per-window (raw) and 75.2 ± 7.8 % (+HMM logic), while the deployed gated decoder reaches 94.0 ± 11.4 % per-execution (Section 3.5) at 2.2 % rest false-activation; per-repetition false-activation stays within the 1.2 2.2 % band from one to five reps. The per-window numbers isolate the classifier and the zero-parameter grammar; the per-execution figure additionally includes the operating-point gate (cf. Table 2).
Preprints 219227 g001
Table 2. Baselines and ablations (within-user, 5 cal. reps; per-window and per-execution balanced accuracy %, mean ± SD over the 20 subjects; rest false-activation %). No row in this table uses the deployment hysteresis gate: every row is the classifier/grammar stream only, so the comparison isolates the zero-parameter sequence grammar from the hand-tuned operating-point logic. The deployed gate (Section 3.5) sets the false-activation operating point and adds only + 1.3 per-execution points over the gate-free HMM stream ( 92.7 94.0 % , Figure 1), confirming the sequence accuracy is delivered by the grammar, not by the gate. All rows are mean ± SD over the 20 subjects, including the scaled-likelihood and no-pretraining ablations. Significance (Wilcoxon signed-rank across subjects, with rank-biserial effect size r): ours vs. matched w = 5 vote, per-window p = 0.001 , r = 0.79 ; ours vs. classical TD+LDA (Holm-corrected), per-window p < 0.001 , r = 1.00 (significant), per-execution p = 0.78 , r = 0.08 (n.s.).
Table 2. Baselines and ablations (within-user, 5 cal. reps; per-window and per-execution balanced accuracy %, mean ± SD over the 20 subjects; rest false-activation %). No row in this table uses the deployment hysteresis gate: every row is the classifier/grammar stream only, so the comparison isolates the zero-parameter sequence grammar from the hand-tuned operating-point logic. The deployed gate (Section 3.5) sets the false-activation operating point and adds only + 1.3 per-execution points over the gate-free HMM stream ( 92.7 94.0 % , Figure 1), confirming the sequence accuracy is delivered by the grammar, not by the gate. All rows are mean ± SD over the 20 subjects, including the scaled-likelihood and no-pretraining ablations. Significance (Wilcoxon signed-rank across subjects, with rank-biserial effect size r): ours vs. matched w = 5 vote, per-window p = 0.001 , r = 0.79 ; ours vs. classical TD+LDA (Holm-corrected), per-window p < 0.001 , r = 1.00 (significant), per-execution p = 0.78 , r = 0.08 (n.s.).
Configuration Per-window Per-exec False-act
Classical: TD + LDA 66.9 ± 8.1 94.7 ± 7.8
Causal TCN, raw (57k params) 76.3 ± 7.8 95.2 ± 2.1
CNN, raw evidence (no logic) 72.6 ± 7.8 94.0 ± 11.4
CNN + majority vote ( w = 3 ) 73.1 ± 7.7 94.0 ± 11.4
CNN + majority vote ( w = 5 ) 73.7 ± 7.6 94.7 ± 9.3
CNN + HMM, gate-free (ours) 75.2 ± 7.8 92.7 ± 11.5 4.5 ± 2.0
CNN + HMM, scaled-likelihood 77.4 ± 7.1 52.3 ± 15.4
   ours, no masked pretraining 75.0 ± 7.6 93.4 ± 11.5 4.6 ± 2.1

4.3. Effect of Sequence Reasoning

The learned transition grammar improves per-window balanced accuracy by + 2.6 points over raw argmax evidence ( 72.6 75.2 % , Figure 2) and by + 1.5 points over the matched-window ( w = 5 ) majority vote ( 73.7 75.2 % , Table 2); all figures are per-subject means. The gain over raw argmax is robust to the random seed and consistent at the subject level: pooled over three seeds (1337, 7, 2024) it is + 2.8 points ( 95 % CI [ 2.3 , 3.3 ] ; Cohen’s d = 1.43 , a large paired effect; rank-biserial r = 0.96 ), positive in 53 / 60 (subject, seed) cases (17–18 of 20 per seed), with a median per-subject gain of 2.6 points (IQR [ 1.6 , 3.8 ] ). It is thus small in magnitude but highly reliable, not a seed- or subject-specific fluke. The margin over voting is small in magnitude but statistically significant and consistent across the 20 subjects (Wilcoxon signed-rank W = 22 , p = 0.001 , rank-biserial r = 0.79 , a large effect: the grammar wins for the large majority of subjects), and comes with no additional calibration and zero added parameters. The accuracy gain over naive smoothing is therefore modest; the decoder’s advantage is principled and causal. A w-window majority vote imposes a symmetric tracking lag of w / 2 windows (e.g. 128 ms at w = 5 ), applied equally to onset and offset, whereas the causal forward filter updates at every 64 ms stride from accumulated evidence with no added smoothing lag, a meaningful responsiveness advantage relative to the < 300 ms control-delay budget [19] (this concerns the added smoothing lag; the end-to-end onset latency is reported and discussed in Section 4.4). On the saturating per-execution metric the grammar offers no advantage over voting, underscoring that the per-window metric is where the distinction lies.

4.4. Operating Point

Figure 3 shows the trade-off between grasp accuracy on acted windows and rest false-activation as the activation gate is swept. We recommend the gate at 0.9 , which meets the 2 % clinical false-activation bar ( 2.2 % ) while retaining 86.7 % grasp accuracy on acted windows at 66 % movement coverage; loosening the gate trades false-activation for coverage monotonically, so the single threshold is the clinician-facing tuning knob.
Empirical onset latency. We measure onset latency directly, as the median number of leading windows the decoder still calls rest from the labelled movement start until it commits to a grasp, across the 20 subjects. The gate-free sequence stream incurs 8.8 ± 5.4 windows ( 560 ± 350 ms at the 64 ms stride) and the deployed gated decoder 9.7 ± 5.1 windows ( 620 ± 330 ms), with large between-subject variance. We report this plainly: this end-to-end latency exceeds the per-decision < 300 ms processing budget. These are distinct quantities: the budget bounds the 64 ms decision cadence, whereas onset latency is dominated by analysis-window fill and the muscle-recruitment ramp plus the deliberate n on debounce, and is tunable (loosening the gate lowers latency at the cost of false activations, Figure 3). It is a primary limitation for fast closed-loop control; we decompose it into design-controlled and physiological parts and report attempts to reduce it in §4.7, where the residual proves to be an acquisition floor rather than a decoder limit. We note, however, that onset latency trades against stability: debounce and decision-based velocity ramps deliberately accept added delay to suppress the transient misclassifications that most erode myoelectric usability [14,19], so a stable commitment that is slightly slower can be preferable in closed-loop use to a faster but flickering one. Quantifying this latency-versus-stability trade in real-time operation is part of the follow-up.
Figure 3. Operating-point trade-off as the gate threshold (labelled) is swept. Grasp accuracy on acted windows (left axis) rises as the gate tightens, but movement coverage (right axis) falls: the recommended gate 0.9 clears the 2 % false-activation bar at the cost of coverage, which drops to 66 % . Both recommended points are ringed; the gain is not free.
Figure 3. Operating-point trade-off as the gate threshold (labelled) is swept. Grasp accuracy on acted windows (left axis) rises as the gate tightens, but movement coverage (right axis) falls: the recommended gate 0.9 clears the 2 % false-activation bar at the cost of coverage, which drops to 66 % . Both recommended points are ringed; the gain is not free.
Preprints 219227 g003
Figure 4. Per-window confusion (row-normalised %) for the raw CNN evidence, before the sequence logic (its mean diagonal is the raw per-window balanced accuracy, 72 % ; the grammar lifts this to the 75.2 % headline). Off-diagonal mass concentrates in the rest column: the onset/offset transient, not grip–grip confusion.
Figure 4. Per-window confusion (row-normalised %) for the raw CNN evidence, before the sequence logic (its mean diagonal is the raw per-window balanced accuracy, 72 % ; the grammar lifts this to the 75.2 % headline). Off-diagonal mass concentrates in the rest column: the onset/offset transient, not grip–grip confusion.
Preprints 219227 g004
Figure 5. Per-grip per-execution recall at five calibration repetitions.
Figure 5. Per-grip per-execution recall at five calibration repetitions.
Preprints 219227 g005

4.5. Cross-Subject (LOSO) Results

The cross-subject benchmark uses the standard 10-class movement set (rest plus nine wrist/hand movements) for comparability with the LOSO literature (e.g. EMGBench), rather than the 7-grip within-user functional set (Section 3.1); Table 1 reports it. Under 40-fold leave-one-subject-out evaluation with the same pipeline, balanced accuracy with only unsupervised target-subject normalisation (no labelled calibration) is 38.6 % per-window, rising to 64.2 % with 20 % labelled calibration; the sequence-reasoning layer lifts these to 42.9 % and 71.1 % respectively, and per-execution voting reaches 50.6 % ( 95 % CI [ 46.5 , 54.7 ] over 40 folds) and 89.1 % (CI [ 86.7 , 91.6 ] ), up to 91.3 % at 30 % (Table 1). Macro-F1 tracks balanced accuracy closely (per-window 0.37 at 0 % and 0.62 at 20 % calibration), confirming the gains are not an artefact of the balanced-averaging choice. The sequence layer contributes + 4.2 to + 7.2 per-window points with zero additional calibration, a larger margin than the within-user + 1.5 , consistent with sequence priors mattering more when the per-window classifier is weaker (the cross-subject regime). This LOSO gain is significant at every calibration level (paired Wilcoxon signed-rank across the 40 folds, p < 10 10 , rank-biserial r 1.0 ). All calibration uses leak-safe per-subject statistics; no held-out-subject data informs training, and only an unsupervised per-channel scale is fit on the held-out subject.
Comparison to EMGBench. Because we adopt EMGBench’s protocol, the comparison is matched on dataset, class set (10-class DB2), subject count, and the 40-fold leave-one-subject-out plus few-shot-adaptation structure. Across its evaluated architectures (e.g. ResNet18, EfficientNet), EMGBench reports DB2 cross-subject accuracy of 19.5 19.9 % with no adaptation (its Table 2) and 52.3 % when fine-tuning on the first 20 % of the held-out subject’s data (its Table 3, FT- 20 % ) [7]. The controlled, like-for-like comparison is per-window: our per-window balanced accuracy at the same operating points is 38.6 % and 64.2 % , at a 28k-parameter footprint. (Per-execution voting, an additional unit EMGBench does not report, reaches 50.6 % and 89.1 % .) One nuance keeps this honest: EMGBench reports overall test accuracy whereas we report balanced accuracy (mean per-class recall). On this rest-heavy set balanced accuracy is the stricter number (it cannot be inflated by the dominant rest class), so our balanced figure lower-bounds the overall accuracy we report under EMGBench’s exact metric. For completeness we report both: our overall per-window accuracy (EMGBench’s metric) is 69.1 % at 0 % and 77.4 % at 20 % calibration, versus EMGBench’s 19.5 19.9 % and 52.3 % ; our stricter balanced accuracy ( 38.6 % , 64.2 % ) still clears their overall figure, so beating them on either metric is in the conservative direction. We read the gap as indicative rather than a fully controlled head-to-head, but on the matched protocol our compact model is at least competitive with, and on these numbers exceeds, the benchmark’s reported results.

4.6. Baselines and Ablations

We compare against two baselines and two ablations (Table 2). To separate the contribution of the principled sequence grammar from that of the hand-tuned deployment logic, every row reports the classifier/grammar stream with the hysteresis state machine removed; the gate enters only at the deployed operating point (Section 3.5), where it raises per-execution accuracy by just + 1.3 points ( 92.7 94.0 % ) while cutting rest false-activation to 2.2 % . The accuracy is thus delivered by the zero-parameter grammar; the gate is a safety/operating-point layer, not an accuracy crutch. These two contributions act on distinct axes and should not be conflated: the grammar’s gain is the per-window improvement ( + 2.6 points over raw argmax evidence, Figure 2), whereas the gate’s + 1.3 is a per-execution operating-point adjustment that trades a little accuracy for a much lower false-activation rate. Classical baseline: an LDA on the Hudgins time-domain feature set [20] (MAV, WL, ZC, SSC). We test it against our decoder with a Wilcoxon signed-rank test across the 20 subjects (Holm-corrected over the two metrics). On the saturating per-execution metric we do not detect a difference ( 92.7 ± 11.5 vs. 94.7 ± 7.8 ; p = 0.78 , rank-biserial r = 0.08 ). We are careful here: with a single held-out execution per class per subject, this per-execution comparison is coarse and underpowered, so we claim only the absence of a detectable difference, not statistical equivalence. The discriminating per-window metric, by contrast, has a large per-subject sample and shows a clear and significant advantage for the learned representation ( 75.2 ± 7.8 vs. 66.9 ± 8.1 , + 8.3 points; p < 0.001 , r = 1.00 : our decoder wins for every subject). This is the empirical core of the paper’s metric-honesty thesis: the deep model and the classical baseline are separable only under per-window evaluation, so reporting the per-execution number alone would erase a real and significant difference. We state the corollary plainly: for steady-state grip maintenance, the regime the per-execution metric rewards, classical TD+LDA features remain entirely adequate, and the learned representation earns its footprint specifically in the instantaneous, onset-transient regime that the per-window metric exposes. Learned temporal baseline (TCN): to test whether a stronger temporal backbone is needed, we train a standard dilated causal temporal convolutional network [24] under the identical leak-safe protocol. Its raw per-window balanced accuracy ( 76.3 ± 7.8 % ) is competitive with our compact CNN plus the zero-parameter grammar ( 75.2 % ), and per-execution saturates as expected ( 95.2 % ). Crucially it reaches this with 2 × the parameters (57k versus 28k): baking temporal context into the network and offloading it to a parameter-free transition grammar land in the same per-window regime, but the latter does so at half the deployed footprint, and the grammar is backbone-agnostic (it composes with the TCN exactly as with the CNN). The same pattern holds under the wider acquisition band (§4.8): the wideband TCN reaches 78.8 % per-window versus our wideband CNN-plus-grammar at 77.4 % – again within 1.4 points at twice the parameter count – so the conclusion is not band-specific. This supports, rather than undercuts, the compactness thesis: the contribution is reaching this accuracy band on a microcontroller-class model, not a claim that the CNN backbone is uniquely optimal. Temporal-smoothing ablation: a sliding majority vote at matched window lengths ( w = 3 , 5 , spanning the decoder’s effective memory) isolates the learned grammar from naive smoothing; the grammar’s per-window margin over the matched vote is small but significant (Wilcoxon signed-rank, p = 0.001 , r = 0.79 , across the 20 subjects). Emission ablation: the scaled-likelihood prior correction raises per-window balanced accuracy slightly ( 75.2 77.4 % ) but inflates rest false-activation from 4.5 ± 2.0 % to 52.3 ± 15.4 % , empirically justifying retention of the rest prior in the clinical decoder. Self-supervision ablation: removing masked pretraining changes per-window accuracy by < 0.2 points and per-execution by < 1 point, a negligible effect at this dataset scale (20 pretraining subjects). This is consistent with self-supervision’s reliance on large unlabelled corpora; we report it as a component that is harmless and expected to help with larger pretraining data, not as a driver of the present results.

4.7. Online Behaviour and the Onset-Latency Floor

Streaming validation. To confirm that the reported accuracy is genuinely causal and real-time and not an artefact of batched offline scoring, we replay a held-out repetition through the deployed pipeline one 256 ms window at a time in temporal order, updating the forward-filter belief and the gate online. Across all 20 test users the streamed decode matches the batched offline decode exactly ( 100 % of windows for every subject), confirming the reported accuracy is genuinely causal: the online forward filter is mathematically equivalent to the batched computation, so this is an implementation-correctness check that the reported accuracy carries no future information, not a separate accuracy result. The pipeline sustains a median per-decision latency of 3.7 ± 0.1 ms (mean ± SD across users; p95 4.8 ± 0.3 ms) on a desktop processor (Apple M4 Pro), i.e. about 6 % of the 64 ms cadence (Figure 6). This is a desktop compute-headroom figure, distinct from the microcontroller workload (the 11.8 MMAC/s of Table 6); we report compute headroom, not measured on-device energy, which remains future work.
Onset latency has both a decoder component and an acquisition floor. The gap between per-window and per-execution accuracy is the onset transient, and the clinically relevant quantity is how quickly the controller commits to a grasp. The measured 600 750 ms onset latency decomposes, approximately, into a deterministic algorithmic part and a residual: the 256 ms window must first fill with movement, and the deployed hysteresis debounce adds on _ frames × stride 192 ms, so roughly 450 ms is design-controlled latency before any physiology, with the remainder the muscle-recruitment ramp and a cue-based-labelling artefact (NinaPro defines onset relative to a stimulus prompt, so the measured value also conflates stimulus-reaction time). We attacked the design-controlled part: a one-sided CUSUM sequential change-detector on the activation log-likelihood-ratio (minimum-latency for a given false-alarm rate [25]) lowers onset latency by 60 160 ms at a matched false-activation rate, and we also tried a shorter 128 ms window at a 32 ms stride, an activation-velocity feature, and fixed-lag smoothing. None of these brought median latency at a clinically-acceptable false-activation rate (2– 5 % ) below 600 ms. The remaining floor is not removable by the decoder on this data: shortening the window trades away identity accuracy, and the cue-based-labelling artefact cannot be resolved on cue-prompted offline data at all. Quantifying and reducing onset latency therefore requires acquisition with movement-synchronised ground truth; we report this as a limitation with an identified mechanism rather than as a solved problem.
Pseudo-online control metrics. Beyond classification accuracy, we report the controller-level metrics a rehabilitation reviewer expects, computed per held-out grasp attempt from the same causal decode. An attempt counts as completed only if the decoder reaches and holds the target grip for the majority of the attempt’s second half: a settle-and-hold criterion, distinct from a whole-segment majority vote (per-execution accuracy), in that it requires the controller to converge on the target and stay there rather than merely be correct on average. Across the 20 within-user test subjects the controller completes 90.0 ± 16.2 % of grasp attempts under this criterion, with a median selection time (onset to first correct output) of 640 ± 362 ms and a rest stability of 93.7 ± 2.8 % (the complement of the false-activation rate). Completion rate and rest stability are deployment-ready; the selection time reflects the onset-latency floor above and is the main responsiveness limitation, consistent with the per-window/per-execution gap.

4.8. Acquisition Bandwidth

The within-user and ablation results above use the deployment-oriented 20– 120 Hz band at a 250 Hz sampling rate, which captures the dominant sEMG power. Surface EMG, however, carries discriminative fibre-recruitment and conduction-velocity content above 120 Hz. We test whether that content is useful by re-deriving the same pipeline at a 1 kHz sampling rate with a wider 20– 450 Hz band, holding the model, calibration protocol, and decoder fixed; only the acquisition front-end changes, and the deployed footprint is unchanged at 28 , 199 parameters.
The wider band is a significant, model-free gain in instantaneous accuracy, and an iso-sampling-rate control shows the gain is the analysis band itself rather than the finer temporal sampling that comes with it (Table 3, Figure 7). End to end, per-window balanced accuracy rises by 2.6 points for the causal decoder ( 74.8 % 77.4 % ; Wilcoxon p = 0.005 , r = 0.70 ). To separate band from sampling resolution we add a control arm at 1 kHz with the original 20– 120 Hz band, so that the two 1 kHz arms differ only in the low-pass cutoff at identical sampling rate and window length. The wider band alone then accounts for 1.9 of those points ( 75.5 % 77.4 % ; Wilcoxon p = 0.027 , r = 0.56 , better in 14 / 20 subjects), while raising the sampling rate at fixed band contributes the remaining 0.7 ( 74.8 % 75.5 % ). The per-execution figure, already near ceiling, does not move significantly. Because the gain appears in the per-window metric at zero parameter cost and survives the iso-rate control, it is attributable to the spectral content above 120 Hz rather than to the model, isolating analysis bandwidth as a concrete, inexpensive sensor-front-end lever. This gain is not a single-seed artifact: re-running the full wideband-versus-narrowband comparison with fresh population bases at two additional seeds reproduces a significant per-window improvement in every case (per-seed gains + 2.6 , + 2.3 , + 3.5 points; paired Wilcoxon p = 0.005 , 0.014 , < 0.001 ; rank-biserial r = 0.70 , 0.62 , 0.92 ; Cohen’s d = 0.78 , 0.68 , 1.13 ; 95 % CIs on the mean gain all exclude zero, e.g. [ 1.0 , 4.1 ] at the deployed seed; better in 15 / 20 , 14 / 20 , 18 / 20 subjects; mean + 2.8 points over the three seeds), so the rest of the paper’s single-seed results are reported against a band effect that is stable to re-seeding.

4.9. Amputee Benchmark and Cross-Population Transfer

Able-bodied benchmarks are an imperfect proxy for prosthetic users. We therefore apply the identical leak-safe within-user protocol to 12 trans-radial amputees, pooled from NinaPro DB3 [1] (10 subjects with the complete grip set; one further subject lacked three grips and was excluded) and DB7 [26] (2 amputees), all recorded with the same 12-electrode montage and global movement labelling. The population base is the able-bodied DB2 model (the realistic deployment case, in which a new amputee user cannot be in the pretraining set); each amputee is then calibrated and tested on their own held-out repetition.
Two findings stand out (Table 4). First, amputee decoding is both substantially harder and far more variable than able-bodied: per-execution balanced accuracy is 63.8 74.2 % (versus 94.0 % able-bodied) with a large between-subject spread ( ± 25 points; individual amputees range from 14 % to 100 % per-execution; Figure 8). Disaggregating by source dataset is itself informative and we report it plainly: the ten DB3 amputees decode at 61.7 ± 25.9 % per-execution, whereas the two DB7 amputees are much stronger ( 92.9 % ; individually 86 % and 100 % ). The pooled figure is thus lifted by a two-subject cohort, and on the larger DB3 cohort alone the gap to able-bodied is wider than the pooled number implies. Because DB7 contributes only two subjects, no per-dataset statistical comparison is meaningful for it, and the cross-population claim below rests on DB3; we keep the datasets pooled only for the aggregate effect-size estimate, and flag the confound explicitly. Second, and contrary to the able-bodied result where the learned decoder significantly beats the classical Hudgins time-domain baseline (Table 2; p < 0.0001 ), on amputees we find no evidence that this advantage transfers: per-window balanced accuracy is statistically comparable ( 54.6 % classical versus 55.0 % learned, p = 0.68 ) and per-execution the classical baseline is, if anything, ahead by ten points, though not significantly ( 74.2 % versus 63.8 % , p = 0.12 ). With n = 12 and a ± 25 -point between-subject spread the comparison is underpowered (detectable only for very large effects); neither metric is significant even before correcting for the two comparisons (Holm leaves both n.s.), so this is absence of evidence of a deep advantage, not evidence of equivalence, with the point estimates weakly favouring the classical baseline. Given the pooling of two distinct datasets and the dominant DB3-versus-DB7 difference above, we read the deep ≈ classical observation as a tentative trend, not an authoritative finding. Pretraining the deep features on able-bodied data does not recover the gap. The interpretation is that amputee sEMG (attempted/phantom movements, altered anatomy, reduced repeatability) shifts the signal distribution enough that features learned on intact limbs do not transfer, while a simple per-user time-domain classifier remains robust. This is a cautionary cross-population result: able-bodied performance over-states amputee performance, and architectural advantages measured on intact-limbed benchmarks do not necessarily carry over.

4.10. Cross-Session Re-Donning Gap

Every within-user number above shares a single recording session and is therefore an optimistic bound: it does not capture the electrode shift that occurs when a device is removed and re-applied. We measure that gap directly on NinaPro DB6 [27], which records each subject over five days with the electrodes re-donned between sessions. DB6 has ten subjects; we run two folds of five, training the population base on one half and testing on the other, then swapping the halves, so every subject is tested under a base that never saw their data (leak-free). For each test subject we calibrate on day 1 and evaluate on (a) held-out day-1 repetitions (within-session) and (b) day 5, four days and several re-donnings later, with no day-5 calibration (the worst case).
The gap is large but recoverable, and we characterise it across the full five-day span on all ten subjects (Table 5, Figure 9); we attach subject-bootstrap 95 % CIs ( 10 4 resamples over the ten subjects) to the key quantities. Within-session day-1 per-execution balanced accuracy is 93.4 % (bootstrap CI [ 90.6 , 95.8 ] ), in line with the DB2 headline. Applying that day-1 decoder on a later day with no adaptation collapses it, and the drop appears immediately on the first re-donning (day 2: 47.9 % ) and persists to day 5 ( 38.8 % , CI [ 28.3 , 50.1 ] ): a 54.6 -point fall. The collapse is a step change the moment the electrodes are re-applied, not gradual drift, which is the signature of donning-induced shift and puts a number on the paper’s central thesis: within-session accuracy is a weak proxy for deployment.
Crucially it is fully recoverable. A short per-session recalibration, fine-tuning on repetitions from the test day, restores per-execution accuracy to 93– 95 % at every separation. The recovered accuracy is statistically indistinguishable from the within-session ceiling (recalibrated 94.2 ± 3.4 % versus ceiling 93.4 ± 4.3 % , pooled over re-donned days; paired Wilcoxon p = 0.63 , n = 10 ; p 0.32 at every individual day separation), so re-donning is fully reversible rather than merely mitigated (Figure 9). The recovery is cheap, and we quantify its price as a recalibration-budget curve at the worst (day-5) separation (subject-bootstrap 95 % CIs in brackets): a single recalibration repetition already lifts accuracy from 37.7 % to 66.5 % [ 60.2 , 72.7 ] , two reach 79.7 % [ 72.5 , 86.5 ] , four reach 86.9 % [ 81.2 , 91.9 ] , and eight restore the full within-session ceiling ( 93.6 % [ 91.2 , 96.0 ] ; Figure 9b). In wall-clock terms, at DB6’s 4 6 s held contraction per grasp one repetition of the seven-grasp set is roughly a minute of guided movement, so the four repetitions that recover 87 % cost on the order of a few minutes of recalibration per don and the eight that fully restore the ceiling under ten minutes; a clinician can trade that one-time per-don cost against accuracy. Even label-free re-normalisation, re-fitting only the per-channel z-score on the new session with no labels, recovers a partial + 10.1 points ( n = 10 ; to 47.7 ± 15.0 % ). The deployment recipe is therefore explicit: a decoder calibrated once does not survive re-donning, but a brief recalibration at each don fully recovers accuracy (four repetitions already get most of the way), with automatic renormalisation as a partial label-free fallback. The day-1 numbers, ours and the field’s, are a ceiling. We state the scope: DB6 uses its own 8-class grasp set and a different electrode montage, so the absolute values are not directly comparable to the DB2 sections.

4.11. Deployability

The inference model (backbone + classifier) has 28 , 199 parameters ( 27.5 kB at int8; 110 kB fp32) and 0.76 million multiply–accumulates per window. At the 64 ms decision stride this is a sustained 11.8 MMAC/s, well within a Cortex-M-class budget; the sequence-reasoning layer adds only a K × K matrix–vector update per window. We report hardware-agnostic compute metrics (parameters, MACs, quantised size) deliberately, so that the efficiency comparison is not confounded by MCU-specific SDK optimisations, sleep states, or accelerator availability; measured on-device energy on the target prosthesis is future work. Table 6 positions our footprint against recent embedded and foundation-model EMG decoders on the axis that is comparable across papers (model size), where ours is the smallest by a wide margin; we deliberately do not tabulate accuracy across these systems, as they use different datasets, class sets, and evaluation protocols, and a cross-protocol accuracy comparison is precisely the apples-to-oranges the paper argues against. Bioformers [4] and TinyMyo [6] report hardware-measured energy that we do not. We tabulate the per-inference energy for both (the per-decision quantity, in mJ), not the device power envelope (in mW): TinyMyo’s 44.9 mJ is the energy of a single forward pass of its 3.6 M-parameter model on GAP9, three orders of magnitude above Bioformers’ 0.14 mJ on the smaller GAP8, which is the comparison that motivates a compact model.
Table 6. Footprint positioning against recent embedded/foundation-model EMG decoders. Sizes are comparable across papers; accuracy is not (different datasets and protocols) and is deliberately omitted. We report hardware-agnostic compute cost; measured on-device energy is future work.
Table 6. Footprint positioning against recent embedded/foundation-model EMG decoders. Sizes are comparable across papers; accuracy is not (different datasets and protocols) and is deliberately omitted. We report hardware-agnostic compute cost; measured on-device energy is future work.
Model Params Int8 size Measured on-device
Ours 28.2 k 27.5 kB compute only ( 0.76 MMAC/win)
Bioformers [4] 94.2 kB GAP8: 0.14 mJ, 2.72 ms
TinyMyo [6] 3.6 M GAP9: 44.9 mJ

5. Discussion

What the numbers mean, and a fail-safe property. The 94.0 % per-execution figure is the standard voted reporting unit and reflects how a grasp-and-hold device decides; the 75.2 % per-window figure reflects instantaneous behaviour, and the gap between them is the onset transient. The confusion matrix (Figure 4) shows that this residual error is localised almost entirely in the rest column (e.g. pinch and tripod lose 22 % and 25 % of their per-window mass to rest) and not to grip–grip confusion. This is a desirable failure mode: during the brief muscle- recruitment delay the system fails safe, momentarily withholding activation rather than executing the wrong grip. Reporting both metrics is essential, as the per-execution number alone would overstate the moment-to-moment experience.
On the transition grammar. Because NinaPro DB2 contains discrete, isolated repetitions separated by rest, the learned transition matrix is a hub-and-spoke model (rest ↔ grasp); direct grasp-to-grasp transitions do not occur in the data and are therefore near-zero in A. This is a property of the dataset, not a limitation of the method. The mechanism, in which a fixed transition prior over per-window posteriors suppresses instantaneous switching noise, supports arbitrary grasp-to-grasp transitions once A is re-estimated on unconstrained functional data. Concretely, the off-diagonal grammar is the population maximum-likelihood estimate O i j = n i j / k i n i k , where n i j counts observed state-i-to-state-j label transitions; the hub-and-spoke pattern is precisely the regime in which the grasp-to-grasp counts n g g ( g , g G ) are zero, forcing O g g = 0 and routing all probability through rest. On continuous data those counts are nonzero, so O acquires direct grasp-to-grasp mass with no change to Eq. 2 or to the forward-filter recursion, and because the hold h and the destination distribution O are decoupled, the self-transition stickiness is unchanged while only the routing adapts. We state the hub-and-spoke structure explicitly to avoid overstating generality. To verify that this is a data property and not an architectural limit, we ran a controlled synthetic experiment: injecting a direct grip-to-grip path into O and presenting an emission stream that switches grip without an intervening rest, the causal decoder follows the transition directly (rest → grip A → grip B, never falling back to rest between grips). The mechanism thus supports unconstrained grip-to-grip control whenever the training data contains such transitions; no execution code changes. We nonetheless stress that this generality is shown only in synthesis: because no public NinaPro-scale dataset contains naturalistic, rest-free grasp sequencing, empirically validating the grammar on unconstrained functional transitions, and quantifying any accuracy cost of grip-to-grip switching, remains an open problem and a limitation of the present benchmark, which we flag for future study on continuous-task data (e.g. MeganePro [12]).
Limitations. (i) Our evaluation is offline and pseudo-online: as the closest proxy that public, cue-prompted data permits, we report controller-level metrics (completion rate, selection time, rest stability; §4.7) and a window-by-window causal stream, but offline and pseudo-online accuracy correlate only weakly with closed-loop, human-in-the-loop control [17,18], which requires a user reacting to real-time feedback and therefore a live recording setup that no pre-recorded public benchmark contains. A closed-loop validation is consequently out of scope for a public-data study by construction; it is the designated next step (below) rather than a gap closable on these datasets. (ii) The headline within-user protocol is within-session (calibration and test share a recording); we measure the cross-day cost directly on DB6 (§4.10, n = 10 over five days): a 54.6 -point per-execution drop on re-donning without adaptation, fully recovered by a short per-don recalibration at every day separation. The within-session numbers should thus be read as a ceiling and per-session recalibration treated as a deployment requirement. (iii) We report compute cost, not measured on-device energy. (iv) The individual learning components (masked pretraining, contrastive learning, HMM decoding) are established; our contribution is the architecture-level decision to offload temporal structure from the network into a zero-parameter grammar for edge deployment, together with the leak-safe extrapolation protocol and paired-metric reporting. (v) The wider-band gain (§4.8) is shown within-session on able-bodied data; its interaction with electrode shift is untested, and is not obviously benign: the high-frequency content (120– 450 Hz) that carries the gain reflects fibre-recruitment and conduction-velocity detail that is plausibly more sensitive to electrode displacement than the dominant low-frequency power, so whether the band advantage survives re-donning (§4.10) is an open question that a cross-day wideband acquisition would need to settle. (vi) The amputee benchmark (§4.9) pools 12 subjects across two distinct datasets with large variance; the bulk is DB3 ( n = 10 ) and DB7 contributes only 2 subjects that decode markedly better, so the pooled accuracy is DB7-inflated and the cross-population observation (deep ≈ classical on amputees) is a tentative trend resting on the DB3 cohort, not a powered comparison. We frame this as a microcontroller-compatible, honestly-evaluated system and benchmark, not as a clinical validation or a new learning primitive.
Toward deployment. The natural next steps are real-time closed-loop evaluation (concretely, a Fitts-style target-acquisition or grasp-completion task with a small cohort of users reacting to live visual feedback, the one regime our pseudo-online proxy cannot capture), and on-device energy measurement on the target prosthesis: the regimes that determine clinical viability. The cross-day re-donning regime, previously listed here as future work, is now measured directly (§4.10). Two concrete, evidence-backed front-end requirements follow from this study: sample to at least 450 Hz to recover the per-window accuracy the narrow band discards (§4.8), and acquire movement-synchronised onset ground truth so that onset latency can be measured at all (§4.7); both motivate dedicated acquisition hardware over re-use of cue-prompted public benchmarks.

6. Conclusions

We presented a compact (28k-parameter) within-user sEMG grasp decoder, a depthwise-separable CNN with a causal sequence-reasoning layer (and an optional self-supervised pretraining component that we show contributes little at this scale), evaluated under a strictly leak-safe extrapolation protocol with honest paired metrics. It reaches 94.0 % per-execution accuracy at 2.2 % false-activation within-user and 89.1 % per-execution ( 20 % labelled calibration; 50.6 % with unsupervised normalisation only) under 40-fold cross-subject LOSO, at a microcontroller-class footprint. Three further studies sharpen the picture: a wider acquisition band (20– 450 Hz) raises per-window accuracy by 2.6 points at no model-size cost ( p = 0.005 ); the decoder runs as a causal real-time stream that reproduces the offline decision exactly at a few milliseconds per decision; and on 12 amputees we find no evidence that its able-bodied advantage over a classical baseline transfers (the two perform comparably), while onset latency is bounded partly by the decoder’s own window and debounce and partly by an acquisition floor. Most consequentially, a cross-session study on DB6 measures the re-donning gap our within-session protocol cannot: per-execution accuracy falls from 93.4 % to 38.8 % when the electrodes are re-applied four days later without adaptation, but a short per-don recalibration fully restores it ( 93.1 % , at every day separation), making per-session recalibration the explicit deployment recipe. Code and configurations are released at https://github.com/seanb9/emg-leaksafe-benchmark to support reproducible, deployment-oriented EMG research.

Reproducibility

All reported numbers are produced by a single command per experiment under a fixed global seed (1337) with deterministic kernels; the environment is PyTorch 2.12, NumPy 2.4, SciPy 1.17, and scikit-learn 1.9. Runs were executed on an Apple M4 Pro (24 GB unified memory, MPS backend); a within-user run completes in 4 min and each LOSO worker (ten folds) in 85 min. We specify the full protocol below; the pipeline is released at https://github.com/seanb9/emg-leaksafe-benchmark: leak-safe windowing/split code, run configurations, per-subject result tables, figure scripts, and the synthetic positive-control generator.
Data and splits. NinaPro DB2 (40 subjects, 12 channels), resampled to 250 Hz, band-pass 20– 120 Hz with a 50 Hz notch, windowed at 256 ms / 64 ms stride; each window additionally carries per-channel RMS and waveform-length envelopes (feature window 16 samples). Windows are assigned to contiguous (stimulus, repetition) segments and split by segment. Within-user: a fixed disjoint 20/20 subject partition (pretraining vs. test, enumerated in the released configs); for each test subject, calibration uses repetitions 1–r ( r 5 ) and evaluation the held-out repetition 6. LOSO: 40-fold leave-one-subject-out on the 10-class set; the held-out subject contributes no window to training or contrastive alignment. A held-out subject’s calibration pool is a class-balanced sample of its non-evaluation windows; the 0 % condition draws from it only an unsupervised per-channel z-score, and the { 5 , 10 , 20 , 30 } % conditions additionally fine-tune on that fraction of labelled calibration windows.
Model and training. Depthwise-separable 1-D CNN (widths 32–64–128, embedding 128); masked self-supervised pretraining (25 epochs, 40 % span masking); supervised contrastive plus class-weighted cross-entropy; per-user fine-tune at learning rate 3 × 10 4 ; prototype blend 0.4 ; 8-view test-time augmentation. LOSO uses 20 SupCon epochs and up to 60 early-stopped fine-tune epochs. Sequence decoder: causal forward filter with hold h = 0.97 and a population-learned off-diagonal grammar, posterior-as-emission (rest prior retained). Deployment hysteresis: θ on = 0.6 , θ off = 0.35 , n on = 3 , n off = 4 , n sw = 3 .
Hyperparameter selection. Backbone width, embedding size, learning rate ( { 1 , 3 , 10 } × 10 4 ), prototype blend ( { 0 , 0.2 , 0.4 } ), and SSL mask fraction ( { 0.3 , 0.4 , 0.5 } ) were selected by validation balanced accuracy on a split held out from the pretraining subjects; the HMM hold h was swept on the same pretraining validation ( 0.80 0.99 , h = 0.97 chosen). The deployment-hysteresis constants were swept to a pre-specified rest-false-activation target, not optimised on test data. Final values and the search grids are provided in the released configurations.

Institutional Review Board Statement

This work is a secondary analysis of the publicly available NinaPro DB2 dataset [1], collected under the ethics approval reported by its original authors. No new human-subjects data were collected for this study and no identifiable personal information was used.

Data Availability Statement

NinaPro DB2 is publicly available at http://ninapro.hevs.ch. The full pipeline (leak-safe windowing and split code, run configurations, per-subject result tables, figure-generation scripts, and the synthetic positive-control generator) is released under the MIT licence at https://github.com/seanb9/emg-leaksafe-benchmark; the complete protocol is given in the Reproducibility section.

Author Contributions

S.B. conceived the study, designed the evaluation protocol and decoder, implemented the experiments, and wrote the manuscript. W.H. (Chief Medical Officer, ReAble Labs) provided clinical and rehabilitation-domain guidance, including the functionally-motivated grip set and the clinical false-activation and operating-point requirements, and revised the manuscript. Both authors reviewed and approved the final manuscript.

Funding

This work was supported by ReAble Labs. No external grant funding was received.

Conflicts of Interest

The authors are affiliated with ReAble Labs, which is developing assistive hand-prosthesis technology related to this work; this constitutes a competing financial interest. All results reported here use only public data and are presented without commercial claims.

Acknowledgments

The authors thank the NinaPro consortium for making the DB2 dataset publicly available.

References

  1. Atzori, M.; Gijsberts, A.; Castellini, C.; Caputo, B.; Mittaz Hager, A.G.; Elsig, S.; Giatsidis, G.; Bassetto, F.; Müller, H. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Scientific Data 2014, 1, 1–13. [CrossRef]
  2. Atzori, M.; Cognolato, M.; Müller, H. Deep learning with convolutional neural networks applied to electromyography data: A resource for the classification of movements for prosthetic hands. Frontiers in Neurorobotics 2016, 10, 9. [CrossRef]
  3. Jiang, B.; Wu, H.; Xia, Q.; Xiao, H.; Peng, B.; Wang, L.; Zhao, Y. An efficient surface electromyography-based gesture recognition algorithm based on multiscale fusion convolution and channel attention. Scientific Reports 2024, 14, 30867. [CrossRef]
  4. Burrello, A.; Bianco Morghet, F.; Scherer, M.; Benatti, S.; Benini, L.; Macii, E.; Poncino, M.; Jahier Pagliari, D. Bioformers: Embedding Transformers for Ultra-Low Power sEMG-based Gesture Recognition. In Proceedings of the Design, Automation & Test in Europe Conference (DATE), 2022. [CrossRef]
  5. Zanghieri, M.; Benatti, S.; Burrello, A.; Kartsch, V.; Conti, F.; Benini, L. Robust Real-Time Embedded EMG Recognition Framework Using Temporal Convolutional Networks on a Multicore IoT Processor. IEEE Transactions on Biomedical Circuits and Systems 2020, 14, 244–256. [CrossRef]
  6. Fasulo, M.; Spacone, G.; Ingolfsson, T.M.; Li, Y.; Benini, L.; Cossettini, A. TinyMyo: A Tiny Foundation Model for Flexible EMG Signal Processing at the Edge. arXiv preprint arXiv:2512.15729 2025.
  7. Yang, J.; Soh, M.; Lieu, V.; Weber, D.J.; Erickson, Z. EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. [CrossRef]
  8. Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; Darrell, T. Tent: Fully Test-Time Adaptation by Entropy Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  9. Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.; Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  10. Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Multiclass Brain–Computer Interface Classification by Riemannian Geometry. IEEE Transactions on Biomedical Engineering 2012, 59, 920–928. [CrossRef]
  11. Touko, N.; Ellis, M.O.A.; Capone, C.; Burrello, A.; Donati, E.; Manneschi, L. Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition. arXiv preprint arXiv:2601.04181 2026.
  12. Cognolato, M.; Gijsberts, A.; Gregori, V.; Saetta, G.; Giacomino, K.; Mittaz Hager, A.G.; Gigli, A.; Faccio, D.; Tiengo, C.; Bassetto, F.; et al. Gaze, visual, myoelectric, and inertial data of grasps for intelligent prosthetics. Scientific Data 2020, 7, 43. [CrossRef]
  13. Betthauser, J.L.; Krall, J.T.; Bannowsky, S.G.; Levay, G.; Kaliki, R.R.; Fifer, M.S.; Thakor, N.V. Stable Responsive EMG Sequence Prediction and Adaptive Reinforcement With Temporal Convolutional Networks. IEEE Transactions on Biomedical Engineering 2020, 67, 1707–1717. [CrossRef]
  14. Simon, A.M.; Hargrove, L.J.; Lock, B.A.; Kuiken, T.A. A Decision-Based Velocity Ramp for Minimizing the Effect of Misclassifications During Real-Time Pattern Recognition Control. IEEE Transactions on Biomedical Engineering 2011, 58, 2360–2368. [CrossRef]
  15. Chan, A.D.C.; Englehart, K.B. Continuous Myoelectric Control for Powered Prostheses Using Hidden Markov Models. IEEE Transactions on Biomedical Engineering 2005, 52, 121–124. [CrossRef]
  16. Rabiner, L.R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 1989, 77, 257–286. [CrossRef]
  17. Ortiz-Catalan, M.; Rouhani, F.; Brånemark, R.; Håkansson, B. Offline accuracy: A potentially misleading metric in myoelectric pattern recognition for prosthetic control. In Proceedings of the Proc. 37th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), 2015. [CrossRef]
  18. Vujaklija, I.; Shalchyan, V.; Kamavuako, E.N.; Jiang, N.; Marateb, H.R.; Farina, D. Online mapping of EMG signals into kinematics by autoencoding. Journal of NeuroEngineering and Rehabilitation 2018, 15, 21. [CrossRef]
  19. Englehart, K.; Hudgins, B. A Robust, Real-Time Control Scheme for Multifunction Myoelectric Control. IEEE Transactions on Biomedical Engineering 2003, 50, 848–854. [CrossRef]
  20. Hudgins, B.; Parker, P.; Scott, R.N. A New Strategy for Multifunction Myoelectric Control. IEEE Transactions on Biomedical Engineering 1993, 40, 82–94. [CrossRef]
  21. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [CrossRef]
  22. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020.
  23. Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Kluwer Academic Publishers, 1994.
  24. Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv preprint arXiv:1803.01271 2018.
  25. Page, E.S. Continuous Inspection Schemes. Biometrika 1954, 41, 100–115. [CrossRef]
  26. Krasoulis, A.; Kyranou, I.; Erden, M.S.; Nazarpour, K.; Vijayakumar, S. Improved prosthetic hand control with concurrent use of myoelectric and inertial measurements. Journal of NeuroEngineering and Rehabilitation 2017, 14, 71. [CrossRef]
  27. Palermo, F.; Cognolato, M.; Gijsberts, A.; Müller, H.; Caputo, B.; Atzori, M. Repeatability of grasp recognition for robotic hand prosthesis control based on sEMG data. In Proceedings of the 2017 International Conference on Rehabilitation Robotics (ICORR). IEEE, 2017, pp. 1154–1159. [CrossRef]
Figure 2. Per-window gain from the causal sequence-reasoning layer over raw argmax evidence (annotated + Δ ; e.g. + 2.6 at five reps). This is the gain over no logic, distinct from the smaller + 1.5 gain over a matched-window majority vote (Table 2).
Figure 2. Per-window gain from the causal sequence-reasoning layer over raw argmax evidence (annotated + Δ ; e.g. + 2.6 at five reps). This is the gain over no logic, distinct from the smaller + 1.5 gain over a matched-window majority vote (Table 2).
Preprints 219227 g002
Figure 6. Online streaming validation. Top: temporal-segmentation ribbon for a representative held-out repetition decoded window-by-window as a causal stream; the upper strip is the ground-truth intent and the lower strip the streamed decode, colour-coded by grip (inter-grasp rest periods compressed for legibility). The decode strip matches the intent strip across grips, with brief colour slivers at onset/offset transients. Bottom: per-decision compute latency pooled across all 20 users ( 79 , 897 windows), shown over its actual range; the median is 3.7 ms (p95 4.8 ms), 17 × under the 64 ms decision budget. The streamed decode equals the batched offline decode for 100 % of windows in every subject.
Figure 6. Online streaming validation. Top: temporal-segmentation ribbon for a representative held-out repetition decoded window-by-window as a causal stream; the upper strip is the ground-truth intent and the lower strip the streamed decode, colour-coded by grip (inter-grasp rest periods compressed for legibility). The decode strip matches the intent strip across grips, with brief colour slivers at onset/offset transients. Bottom: per-decision compute latency pooled across all 20 users ( 79 , 897 windows), shown over its actual range; the median is 3.7 ms (p95 4.8 ms), 17 × under the 64 ms decision budget. The streamed decode equals the batched offline decode for 100 % of windows in every subject.
Preprints 219227 g006
Figure 7. Per-subject per-window balanced accuracy for the end-to-end front-end change ( 250 Hz / 20– 120 Hz vs 1 kHz / 20– 450 Hz). Each point is one user; points above the diagonal improve with the wider, faster front-end ( 15 / 20 users; + 2.6 points, Wilcoxon p = 0.005 ) at an unchanged model footprint. Table 3 isolates the analysis band ( + 1.9 , p = 0.027 ) from the sampling rate ( + 0.7 ).
Figure 7. Per-subject per-window balanced accuracy for the end-to-end front-end change ( 250 Hz / 20– 120 Hz vs 1 kHz / 20– 450 Hz). Each point is one user; points above the diagonal improve with the wider, faster front-end ( 15 / 20 users; + 2.6 points, Wilcoxon p = 0.005 ) at an unchanged model footprint. Table 3 isolates the analysis band ( + 1.9 , p = 0.027 ) from the sampling rate ( + 0.7 ).
Preprints 219227 g007
Figure 8. Per-execution balanced accuracy of the CNN evidence decoder (raw, gate-free) for each of the 12 amputees (sorted), against the able-bodied reference (dashed). The shaded band marks the pooled per-execution range across decoders ( 63.8 74.2 % , Table 4); this pooled statistic is dominated by DB3 ( n = 10 ) and is lifted by the two DB7 amputees (which decode markedly better but cannot be assessed as a separate cohort). Amputee decoding is markedly lower on average and highly variable between subjects, from near-perfect to near-chance, which is the dominant feature of amputee EMG decoding rather than the choice of decoder.
Figure 8. Per-execution balanced accuracy of the CNN evidence decoder (raw, gate-free) for each of the 12 amputees (sorted), against the able-bodied reference (dashed). The shaded band marks the pooled per-execution range across decoders ( 63.8 74.2 % , Table 4); this pooled statistic is dominated by DB3 ( n = 10 ) and is lifted by the two DB7 amputees (which decode markedly better but cannot be assessed as a separate cohort). Amputee decoding is markedly lower on average and highly variable between subjects, from near-perfect to near-chance, which is the dominant feature of amputee EMG decoding rather than the choice of decoder.
Preprints 219227 g008
Figure 9. Cross-session re-donning on DB6 ( n = 10 , mean ± SD). (a) A day-1 decoder (orange) collapses the moment the electrodes are re-donned, from 93 % per-execution to 40 % by day 2, and stays there through day 5; a short per-don recalibration (green) holds within-session accuracy (93– 95 % ) at every separation. (b) Recalibration-budget curve at the worst (day-5) separation: accuracy recovered versus the number of recalibration repetitions the user provides at each don. One repetition recovers most of the gap; four reach 87 % ; eight restore the full within-session ceiling. The gap is real but fully fixable, and cheaply so.
Figure 9. Cross-session re-donning on DB6 ( n = 10 , mean ± SD). (a) A day-1 decoder (orange) collapses the moment the electrodes are re-donned, from 93 % per-execution to 40 % by day 2, and stays there through day 5; a short per-don recalibration (green) holds within-session accuracy (93– 95 % ) at every separation. (b) Recalibration-budget curve at the worst (day-5) separation: accuracy recovered versus the number of recalibration repetitions the user provides at each don. One repetition recovers most of the gap; four reach 87 % ; eight restore the full within-session ceiling. The gap is real but fully fixable, and cheaply so.
Preprints 219227 g009
Table 1. Cross-subject 40-fold LOSO balanced accuracy (%) on NinaPro DB2. The cross-subject benchmark is run on the standard 10-class movement set (rest plus nine wrist/hand movements) used by cross-subject baselines such as EMGBench, chosen for comparability with the LOSO literature; this differs deliberately from the 7-grip functional set of the within-user clinical evaluation (Section 3.1). All entries are mean ± SD over the 40 held-out subjects. Per-window is the raw window classifier; +sequence logic is the causal-HMM-decoded per-window accuracy; per-execution is the majority-voted accuracy (the EMGBench protocol). All calibration is leak-safe (per-subject calibration-pool statistics, never global). The 0 % row means zero labelled calibration: an unsupervised per-channel normalisation is still fit on the held-out subject’s calibration pool (a brief don-time baseline), but no labelled target-subject data is used; it is not a strict zero-information condition. None of the columns use the deployed hysteresis gate (Section 3.5); they isolate the classifier and the zero-parameter grammar.
Table 1. Cross-subject 40-fold LOSO balanced accuracy (%) on NinaPro DB2. The cross-subject benchmark is run on the standard 10-class movement set (rest plus nine wrist/hand movements) used by cross-subject baselines such as EMGBench, chosen for comparability with the LOSO literature; this differs deliberately from the 7-grip functional set of the within-user clinical evaluation (Section 3.1). All entries are mean ± SD over the 40 held-out subjects. Per-window is the raw window classifier; +sequence logic is the causal-HMM-decoded per-window accuracy; per-execution is the majority-voted accuracy (the EMGBench protocol). All calibration is leak-safe (per-subject calibration-pool statistics, never global). The 0 % row means zero labelled calibration: an unsupervised per-channel normalisation is still fit on the held-out subject’s calibration pool (a brief don-time baseline), but no labelled target-subject data is used; it is not a strict zero-information condition. None of the columns use the deployed hysteresis gate (Section 3.5); they isolate the classifier and the zero-parameter grammar.
Calib. ratio Per-window +sequence logic Per-exec.
0% 38.6 ± 7.7 42.9 ± 8.5 50.6 ± 12.9
5% 53.0 ± 7.7 59.8 ± 7.8 74.9 ± 11.5
10% 59.3 ± 7.8 66.6 ± 6.9 84.3 ± 8.3
20% 64.2 ± 8.2 71.1 ± 7.1 89.1 ± 7.7
30% 66.4 ± 8.2 73.2 ± 7.6 91.3 ± 6.7
Table 3. Acquisition bandwidth, isolated from sampling rate (within-user, 20 subjects, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. All rows use the identical model, calibration, and gate-free decoder; only the front-end differs. The two 1 kHz rows differ only in the low-pass cutoff, isolating the analysis band from sampling resolution: the band adds + 1.9 per-window points (+logic; Wilcoxon p = 0.027 , r = 0.56 ), while the rate alone (rows 1→2) adds + 0.7 . Per-execution here is the gate-free voted unit (cf. the gated 94.0 % of Figure 1). The 250 Hz / 20– 120 Hz row reproduces the headline condition of Figure 1 on the same 20 subjects; the 0.4 -point offset ( 74.8 vs 75.2 ) is run-to-run variation from independent training.
Table 3. Acquisition bandwidth, isolated from sampling rate (within-user, 20 subjects, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. All rows use the identical model, calibration, and gate-free decoder; only the front-end differs. The two 1 kHz rows differ only in the low-pass cutoff, isolating the analysis band from sampling resolution: the band adds + 1.9 per-window points (+logic; Wilcoxon p = 0.027 , r = 0.56 ), while the rate alone (rows 1→2) adds + 0.7 . Per-execution here is the gate-free voted unit (cf. the gated 94.0 % of Figure 1). The 250 Hz / 20– 120 Hz row reproduces the headline condition of Figure 1 on the same 20 subjects; the 0.4 -point offset ( 74.8 vs 75.2 ) is run-to-run variation from independent training.
Sampling Band Per-win. (raw) Per-win. (+logic) Per-exec.
250 Hz 20–120 71.9 ± 8.1 74.8 ± 7.9 93.3 ± 11.5
1 kHz 20–120 72.4 ± 8.0 75.5 ± 7.6 92.6 ± 11.5
1 kHz 20–450 74.4 ± 7.5 77.4 ± 7.3 95.3 ± 7.9
Table 4. Within-amputee benchmark (12 trans-radial amputees, NinaPro DB3+DB7, leak-safe, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. Unlike the able-bodied case, the learned decoder shows no significant advantage over the classical TD+LDA baseline (paired Wilcoxon n.s.; underpowered at n = 12 , point estimates weakly favouring the classical baseline).
Table 4. Within-amputee benchmark (12 trans-radial amputees, NinaPro DB3+DB7, leak-safe, 5 calibration repetitions). Balanced accuracy (%), mean ± SD over subjects. Unlike the able-bodied case, the learned decoder shows no significant advantage over the classical TD+LDA baseline (paired Wilcoxon n.s.; underpowered at n = 12 , point estimates weakly favouring the classical baseline).
Method Per-window Per-execution
Classical TD+LDA 54.6 ± 16.5 74.2 ± 24.6
CNN, raw evidence 55.0 ± 16.8 66.9 ± 26.5
CNN + HMM grammar (ours) 55.0 ± 18.4 63.8 ± 26.0
Able-bodied (ours), reference 77.4 ± 7.3 94.0 ± 11.4
Table 5. Cross-session collapse and recovery across five days (NinaPro DB6, n = 10 , two-fold). Per-execution balanced accuracy (%), mean ± SD over subjects. A day-1 decoder collapses the moment the electrodes are re-donned (day 2 onward); a short per-don recalibration restores within-session accuracy at every separation.
Table 5. Cross-session collapse and recovery across five days (NinaPro DB6, n = 10 , two-fold). Per-execution balanced accuracy (%), mean ± SD over subjects. A day-1 decoder collapses the moment the electrodes are re-donned (day 2 onward); a short per-don recalibration restores within-session accuracy at every separation.
Test day Day-1 decoder (no adapt.) + per-don recalibration
D1 (within-session) 93.4 ± 4.3 94.0 ± 4.0
D2 47.9 ± 17.8 94.0 ± 3.8
D3 49.4 ± 10.9 94.4 ± 2.6
D4 47.7 ± 15.2 95.3 ± 3.8
D5 38.8 ± 17.8 93.1 ± 5.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2026 MDPI (Basel, Switzerland) unless otherwise stated

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings