5. Experimental Setup
To test our algorithm, we used two datasets:
In the following sections, we describe the datasets and the experimental protocols used.
5.1. Dataset 1: MIT-BIH Arrhythmia + Noise Stress Test
5.1.1. MIT-BIH Arrhythmia Database
The MIT-BIH Arrhythmia Database [
14] contains 48 half-hour two-channel ambulatory ECG recordings from 47 subjects (25 men 32-89, 22 women 23-89) with comprehensive beat-by-beat annotations. We focus on five arrhythmia classes aligned with the AAMI standards:
Normal (N): Normal sinus rhythm with regular RR intervals
Atrial Fibrillation (AF): Irregular rhythm with absent P-waves
Ventricular Tachycardia (VT): Rapid regular ventricular rhythm > 100 bpm
Premature Ventricular Contractions (PVC): Ectopic beats with wide QRS
Other: Supraventricular arrhythmias, blocks, and rare rhythms
Records are digitized at 360 Hz with 11-bit resolution over the 10 mV range. We extract 30-second segments centered on annotated arrhythmia episodes, yielding 8,247 segments distributed as Normal (4,123), AF (987), VT (623), PVC (1,758), Other (756).
5.1.2. MIT-BIH Noise Stress Test Database
The MIT-BIH NST Database provides three types of realistic noise recorded from actual ambulatory monitoring:
Electrode motion (em): Artifacts of electrode displacement and skin-electrode interface motion
Baseline wander (bw): Low-frequency drift from respiration and body movement
Muscle artifact (ma): High-frequency interference from skeletal muscle contraction
We use the "em" (electrode motion) noise source as it most closely resembles artifacts encountered during physical activity in wearable monitoring. Noise is added to clean MIT-BIH signals at six signal-to-noise ratios: 24, 18, 12, 6, 0, and -6 dB, creating systematically degraded versions enabling controlled evaluation of motion robustness.
SNR is defined as
where signal power is computed over QRS complexes and noise power over baseline segments. At 24 dB, the artifacts are barely perceptible; at 6 dB, substantial corruption is evident; at -6 dB, noise overwhelms the cardiac signal.
5.1.3. Accelerometer Synthesis
Since MIT-BIH lacks synchronized accelerometer data, we synthesize realistic acceleration patterns correlated with noise artifacts. For each noise segment, we generate accelerometer signals by:
where
matches the power spectrum of typical human motion (dominant frequency 1-3 Hz for walking, 1-2 Hz for running), and random phase
introduces realistic variability. Acceleration magnitude scales with noise intensity:
with
.
5.2. Dataset 2: ScientISST MOVE
The ScientISST MOVE dataset [
21] provides synchronized single-lead ECG and tri-axial accelerometer recordings from 20 healthy subjects (10 male, 10 female, ages 20-35, BMI 23.4 kg/m
2) performing six activities:
Sitting (5 minutes)
Standing (5 minutes)
Walking 3 km/h (5 minutes)
Walking 5 km/h (5 minutes)
Running 8 km/h (5 minutes)
Cycling 60 RPM (5 minutes)
Since subjects exhibit a normal sinus rhythm throughout, this data set quantifies false positive rates—how often the arrhythmia detector spuriously flags healthy rhythms as abnormal during different levels of activity. This is the critical clinical metric that determines whether wearable monitors produce acceptable alarm rates.
5.3. Experimental Protocols
5.3.1. Protocol 1: Clean Baseline Performance
Objective: Establish upper-bound performance on artifact-free signals.
Dataset: MIT-BIH Arrhythmia Database (48 records, 109,912 beats). Binary classification: Normal (N, L, R) vs. Arrhythmia (A, a, J, S, V, E, F, /, f, Q). Class distribution: 67.9% / 32.1%.
Split: 80% training (87,930 beats) / 20% validation (21,982 beats), stratified by class.
Training: 100 epochs, no noise increase. Adam optimizer, weighted cross-entropy (weights: [1.0, 2.11]), batch size 16.
Metrics: Per-class precision, recall, F1-score; overall accuracy; confusion matrix; attention gate statistics.
Goal: Obtain the precision in clean signals, validating the architecture before evaluating the robustness of the noise.
5.3.2. Protocol 2: Motion Artifact Robustness
Objective: Evaluate noise robustness through multi-SNR training and testing.
Dataset: MIT-BIH beats augmented with NST electrode motion artifacts. Training: 64,968 samples (21,656 beats × 3 SNRs: 24, 12, 6 dB). Testing: 6,000 samples (1,000 beats × 6 SNRs: 24, 18, 12, 6, 0, -6 dB).
Multi-SNR training: Each beat was corrupted at three noise levels, forcing noise-invariant feature learning. Accelerometer magnitude inversely proportional to SNR: .
Evaluation: Test at six SNRs including three unseen levels (18, 0, -6 dB) to assess generalization. Metrics: accuracy, per-class precision/recall/F1, confusion matrix, gate statistics.
Analysis: (1) Graceful degradation curve (accuracy vs. SNR), (2) generalization to unseen noise, (3) gate adaptation (, vs. SNR), (4) clinical utility threshold (SNR maintaining> 90% accuracy ).
Goal: Demonstrate precision for each noise levels.
5.3.3. Protocol 3: Real-World False Positive Quantification
Objective: Measure false alarm rates for arrhythmia during actual physical activities in healthy subjects with normal sinus rhythm
Dataset: ScientISST MOVE database containing synchronized real ECG and tri-axial accelerometer recordings from healthy subjects during controlled activities: rest (sitting), light activity (slow walking), moderate activity (brisk walking, stairs) and vigorous activity (jogging, running). All subjects exhibit a confirmed normal sinus rhythm—any arrhythmia detections are false positives.
Evaluation: Evaluate the % false positive using the networks trained in protocol 2
Goal: Demonstrate the % false positive for different activity levels.
5.4. Evaluation Metrics
For each arrhythmia class
c:
Overall metrics: Macro-averaged F1 (equal weight per class), weighted F1 (weight by class frequency), overall precision.
Clinical metrics: False positive rate (FPR) during normal rhythm, false negative rate (FNR) for life-threatening arrhythmias (VT, VF).
6. Experimental Results
6.1. Dataset and Experimental Setup
All experiments were conducted in the MIT-BIH Arrhythmia Database [
14]. We extract beats centered on R-peaks with 360-sample windows (1 second duration: 180 samples before and after the R-peak), following standard Association for the Advancement of Medical Instrumentation (AAMI) beat extraction protocols.
Binary classification scheme: Following clinical relevance for continuous monitoring, we collapse the standard 5-class AAMI taxonomy into a binary task:
Class 0 (Normal): Normal beats (N), left bundle branch block (L), right bundle branch block (R)
Class 1 (Arrhythmia): All abnormal rhythms including premature atrial beats (A, a, J, S), ventricular ectopy (V, E), fusion beats (F, f), paced beats (/) and unclassifiable beats (Q)
This binary formulation reflects the primary clinical decision: distinguishing physiologically normal sinus rhythm from pathological arrhythmic events that require further analysis:
Dataset statistics: From 48 records, we extracted 109,912 valid beats, yielding 64,968 training samples after multi-SNR augmentation (described below). The class distribution is 67.9% Normal (44,100 beats) and 32.1% Arrhythmia (20,868 beats), reflecting the natural imbalance in the ambulatory ECG data.
Train-validation split: We employ an 80%/20% stratified split, ensuring balanced class representation in both subsets. Training set: 51,974 samples; Validation set: 12,994 samples. All evaluation metrics reported in the following are computed in the closed-loop validation set, which the model never observes during training.
6.2. Multi-SNR Training Strategy
To ensure robust performance across varying noise conditions encountered in real-world ambulatory monitoring, we employ multi-SNR data augmentation during training. Each clean ECG beat is corrupted with calibrated electrode motion artifacts at three distinct signal-to-noise ratios:
Noise source: We used the MIT-BIH Noise Stress Test Database electrode motion (EM) artifact, which captures realistic motion-induced baseline wander and high-frequency noise characteristic of ambulatory recordings. The noise is z-score normalized before application.
SNR calibration: For each target SNR level, we compute the required noise scaling factor:
where
and
. The corrupted signal becomes:
-
Correlated accelerometer synthesis: Critically, we synthesize tri-axial accelerometer data with magnitude inversely proportional to SNR:
This ensures that high ECG corruption coincides with high accelerometer readings, mimicking the physical coupling between patient motion and signal artifacts in real wearable devices. The accelerometer channels are generated as bandlimited Gaussian processes with this target magnitude.
Effective augmentation: Each of the 21,656 unique beats appears in the training set at three noise levels, generating 64,968 total training samples—a 3× expansion that forces the network to learn noise-invariant features while preventing overfitting to clean-signal morphology.
6.3. Training Dynamics and Convergence Analysis
Accuracy progression Figure 2 (top-left): The model exhibits rapid initial learning, with training accuracy climbing from 96.6% (epoch 1) to 99.0% by epoch 5. The validation accuracy follows a similar trajectory, reaching 99.2% within the first five epochs. This rapid convergence demonstrates the network’s ability to quickly extract discriminative features from the multi-SNR augmented training data. Beyond epoch 5, the model enters a fine-tuning phase in which both training and validation accuracies gradually improve to 99.6 and 99.5 respectively. The validation curve exhibits characteristic stochastic fluctuations (
) due to mini-batch evaluation on the 12,994-sample validation set, but maintains a stable plateau above 99.3% throughout training. Critically, the negligible gap between training and validation accuracy (
) confirms excellent generalization: the model does not overfit the training set despite the aggressive multi-SNR augmentation strategy that creates multiple noisy versions of each beat. The final results are shown in
Table 4.
Table 5 provides the raw classification counts, revealing the model’s decision boundaries.
The confusion matrix reveals an extremely sparse off-diagonal structure: only 40 total errors out of 12,994 predictions (0.31% error rate). This confirms that the learned feature representations achieve strong class separation in the 640-dimensional fused feature space.
Attention gate means Figure 2 (top-right): The mean gate values between the training batches reveal the learned fusion strategy. The ECG gate (blue) stabilizes at
after an initial brief adjustment period (epochs 0-5), while the accelerometer gate (orange) settles at
. This ECG-dominant configuration reflects the superior discriminative power of ECG spectrograms for the detection of arrhythmias compared to motion features alone. The persistent separation from the balanced fusion baseline (dashed red line at 0.5) confirms that the gates learn task-appropriate weighting rather than defaulting to uniform 50/50 fusion. The approximate 1.7:1 ratio between the ECG and accelerometer gate values represents the learned optimal balance: the network relies primarily on cardiac electrical activity while incorporating motion context as a secondary information source. Importantly, the gate means remain stable throughout the training (standard deviation in epochs <0.02), indicating robust convergence to a consistent fusion strategy rather than oscillatory or unstable behavior.
Gate diversity Figure 2 (bottom-left): The standard deviation of gate values within each batch quantifies how many gates vary between different samples—a critical metric to validate adaptive fusion. Without diversity regularization, gates typically collapse to near-constant values (
), rendering the attention mechanism non-functional. Our loss of diversity in the gate successfully prevents this failure mode: the ECG gate exhibits
and the accelerometer gate achieves
by epoch 30. These high standard deviations—approximately 50% of the corresponding mean gate values—confirm a substantial variation between samples, indicating that the gates adapt to sample-specific characteristics rather than applying fixed weights to all inputs. The progressive increase in diversity during the first 15 epochs demonstrates gradual learning of context-dependent fusion strategies, after which diversity stabilizes. The slightly higher diversity in the accelerometer gate suggests a more variable reliance on motion features depending on signal conditions, while ECG features maintain more consistent importance across samples.
Loss decomposition Figure 2 (bottom-right): Total training loss (blue) decreases from 0.34 to 0.32 over 30 epochs, with the steepest descent occurring in the first 10 epochs coinciding with the rapid accuracy improvement. The relatively small reduction in total loss after epoch 10 (0.32 → 0.32) reflects the fine-tuning nature of late-stage training: the model makes small adjustments to decision boundaries rather than learning fundamentally new features. The gate diversity loss component (orange) drops rapidly from 0.02 to near-zero (
) by epoch 5, indicating that the diversity constraint (
) is satisfied early in training. The diversity loss then imposes a minimal penalty that negligibly contributes to the total loss. This behavior validates the regularization design: the diversity term guides initial learning toward adaptive gates but does not interfere with convergence once sufficient diversity is established. The stable loss plateau after epoch 10 confirms convergence to a local optimum without overfitting or training instability.
Implications: The training dynamics reveal several desirable properties of the proposed architecture. First, rapid convergence (99% accuracy within 5 epochs) suggests efficient learning, making the model practical to train even on moderate computational resources. Second, the stable validation accuracy plateau demonstrates that multi-SNR augmentation does not destabilize training or lead to overfitting, despite tripling the effective dataset size. Third, the high gate diversity () confirms that the attention mechanism functions as intended—adapting fusion weights based on input characteristics rather than learning fixed constant weights. Finally, the ECG-dominant fusion strategy ( vs. ) aligns with clinical intuition: cardiac electrical activity carries primary diagnostic information, while motion context serves an auxiliary role in disambiguation under noisy conditions. These combined observations validate both the architectural design and the training methodology.
6.3.1. Clinical Interpretation
With 99.7% accuracy, the system would generate approximately 3 false alarms per 1,000 beats. For a patient with 100,000 beats per day (typical for 24-hour Holter monitoring), this translates to 300 false alarms daily. Although non-zero, this rate is substantially lower than traditional single-threshold detectors and represents a clinically manageable false alarm burden, especially when combined with alarm aggregation strategies (e.g., requiring sustained arrhythmia over multiple consecutive beats).
6.4. Noise Robustness Evaluation
To assess real-world deployment viability, we evaluated the performance degradation of the trained model under increasing noise corruption. Although the model was trained at three SNR levels (24, 12 and 6,0 dB), it is generalized for (18,0 and -6 dB) successfully.
Test methodology: We construct a held-out test set of 1,000 beats per SNR level by selecting 10 records (disjoint from training data) and extracting 100 beats per record. Each beat is corrupted at the target SNR using the same noise addition procedure as training, then classified by the model. This yields 6,000 total test samples under six noise conditions.
Table 6 quantifies performance across the SNR spectrum.
6.5. Key Observations
Excellent clean-signal performance: At 24 dB (minimal noise), the model achieves 99.5% accuracy, matching the validation set performance and confirming that multi-SNR training does not compromise clean-signal accuracy.
Robust to typical ambulatory noise: At 18 dB and 12 dB SNR (typical noise levels during normal walking or daily activities), precision remains> 99%, demonstrating strong noise immunity for standard ambulatory monitoring scenarios.
Graceful degradation under moderate motion: At 6 dB SNR (corresponding to brisk walking or light exercise), the accuracy drops to 97.8%—still clinically acceptable for continuous monitoring. This 1.7 percentage point drop from clean conditions represents good noise tolerance.
Maintained utility under severe motion: At 0 dB SNR (signal and noise powers equal, typical of jogging or moderate exercise), precision remains at 95.0%. Although degraded under clean conditions, this performance is substantially above random chance (50%) and indicates that the model extracts a useful signal structure even from heavily corrupted data.
Extreme noise resistance: At -6 dB SNR (noise power 4× signal power, equivalent to vigorous running or climbing stairs), the accuracy of 88.2% demonstrates remarkable robustness. At this corruption level, the raw ECG waveform is barely discernible visually, yet the learned spectrogram features and accelerometer context enable well-above-chance classification.
Generalization to unseen noise levels: Performance at 18, 0, and -6 dB—SNRs not present during training—validates that the model learns noise-invariant representations rather than memorizing specific corruption patterns. The smooth degradation curve suggests interpolation between the trained SNR levels.
Comparison to single-SNR training: In preliminary experiments (not shown), training on clean data alone (24 dB only) yielded 99.5% under clean conditions but catastrophically degraded to 65% at 0 dB and 42% at -6 dB. The multi-SNR augmentation strategy provides a 29.5% improvement at 0 dB and 46.2% improvement at -6 dB compared to this naive baseline, confirming the critical importance of noise-aware training.
Figure 3 visualizes the accuracy-SNR relationship, revealing the characteristic sigmoidal degradation profile of robust classification systems.
Practical implications of deployment: The accuracy maintained at 0 dB SNR suggests that the system is viable for continuous ambulatory monitoring during normal daily activities (walking, light exercise). The 88.0% accuracy at -6 dB indicates that the system remains functional even during vigorous exercise, although clinicians may wish to flag such episodes for manual review or filter alerts during detected high-motion periods.
6.6. Attention Gate Analysis
A key contribution of our architecture is the attention-gated fusion mechanism with regularization of gate diversity. To validate that the gates learn adaptive, context-dependent weighting rather than collapsing to constant values, we analyze gate statistics across the validation set. The mean and variance values of are with a range
and is with a range of .
The high standard deviations () confirm a substantial variation between the samples—approximately 50% % of the mean value—indicating that the gates adapt significantly according to the input characteristics. This validates the gate diversity regularization strategy (), which successfully prevents the common failure mode of the gate with constant value.
Qualitative interpretation: Manual inspection of extreme gate values reveals interpretable behavior:
High (0.8-0.9): Samples with clean, well-formed ECG morphology. The network relies primarily on ECG features, down weighting accelerometer input.
Balanced gates (0.4-0.6): Samples with moderate noise or ambiguous morphology. The network integrates both modalities equally, taking advantage of the motion context to aid classification.
High (rare, 0.6-0.8): Samples with severe ECG corruption. The network shifts toward accelerometer features, though purely accelerometer-driven decisions remain uncommon due to the limited discriminative power of motion alone for arrhythmia detection.
Correlation with SNR:Table 7 shows the mean gate values stratified by noise level (computed on the SNR test set).
Although the trend is subtle, decreases slightly and increases slightly as the SNR drops, indicating a weak learned bias toward the use of accelerometers under noisy conditions. However, the effect is modest (14% change in the ratio from clean to -6 dB), suggesting that the gates respond more strongly to sample-specific morphology than to global noise levels. This behavior may reflect the relative simplicity of the binary task: even noisy ECG spectrograms retain sufficient discriminative information, reducing the need for aggressive modality reweighting.
Comparison to ablated model: In an ablation experiment (not shown), training without loss of gate diversity () resulted in collapsed gates (, )—effectively constant values across all samples. The classification accuracy remained high (99.2%) due to concatenated features, but the gates did not provide interpretability or adaptive fusion. This confirms the necessity of the diversity regularization term.
Intended application: Our binary classifier is designed as a front-end screening stage for continuous monitoring systems. High-confidence arrhythmia detections can trigger subsequent fine-grained multi-class analysis or clinician alerts, while high-confidence normal classifications suppress false alarms. This hierarchical approach balances computational efficiency with clinical specificity.
6.7. False Positive Validation on Real-World Activities
We validate false alarm suppression on the ScientISST MOVE database containing ECG and accelerometer recordings from healthy subjects during controlled activities. All subjects exhibit normal sinus rhythm—any arrhythmia detection is false positive. This tests zero-shot transfer from MIT-BIH training to authentic motion artifacts.
Figure 4 shows the ECG degradation and accelerometer patterns at all levels of activity.
Table 8 quantifies false alarm rates in six activities.
6.7.1. Analysis
Substantial reduction in false alarms: Accelerometer fusion achieves 68% average false positive reduction (17.8% → 5.4%), consistent between activities (61-72%).
Activity-dependent scaling: ECG-only rates increase dramatically with motion (2.3% sitting → 36.2% running, 16× increase). ECG+ACC degrades more gracefully (0.9% → 10.4%), demonstrating motion-aware classification.
Clinical viability: ECG+ACC maintains <5% false positives by brisk walking (5 km/h), covering most daily activities. Even running (10.4%) is manageable with activity-aware filtering. ECG-only rates (13.5% walking, 36.2% running) would be clinically unacceptable.
Zero-shot transfer: Model trained on MIT-BIH successfully generalizes to ScientISST MOVE without retraining, validating robust multi-modal learning rather than dataset memorization.
Minimal motion benefit: Even during sitting/standing (minimal motion), 61-68% reduction suggests that the accelerometer helps distinguish cardiac signals from non-motion artifacts (electrode issues, breathing).
6.8. Beyond Just Arrhythmia Detection
The dual-stream arrhythmia detection model can also be trained using the MIT-BIH Arrhythmia Database for five classes: Normal (N, L, R), Atrial Fibrillation (A, a, J, S), Ventricular Tachycardia (F, /, f), Premature Ventricular Contractions (V, E), and others. The class distribution is as follows:
Class 0 (Normal): 36630 samples (56.4%)
Class 1 (AF): 126 samples (0.2%)
Class 2 (VT): 18621 samples (28.7%)
Class 3 (PVC): 2046 samples (3.1%)
Class 4 (Other): 7545 samples (11.6%)
The data set was partitioned into training sets (80%) and validation (20%). Training was conducted for 50 epochs using the Adam optimizer with a learning rate of 0.001 and a batch size of 16. A weighted cross-entropy loss function with class weights of [1.0, 2.5, 3.0, 1.8] was employed to address the class imbalance inherent in the MIT-BIH database.
Figure 5 left shows the evolution of precision as a function of the epochs of the training and validation data, and
Figure 5 shows the evolution of precision, recall and F1 score vs the number of epochs.
The model demonstrated rapid convergence, achieving 51.17% training accuracy in the first epoch and stabilizing at 95% by epoch 30, with training loss plateauing at 1.4 ± 0.004. This performance remained consistent through all 100 epochs, indicating successful optimization without overfitting. The observed convergence plateau suggests that the model had fully learned the discriminative features available in the training data, with stable metrics (accuracy: 95.33% ± 0.02%, loss: 1.107 ± 0.001) demonstrating robust generalization. The training dynamics did not exhibit divergence between the training and validation metrics, confirming effective regularization through dropout layers, batch normalization, and multi-SNR data augmentation. Training required approximately 70 minutes on an NVIDIA GPU, processing approximately 1,520 samples per second.
The confusion matrix
Table 10 demonstrates strong class separation with minimal cross-class errors. VT achieves near-perfect classification (98.7% recall, 88.0% precision). AF detection maintains high sensitivity (78.7% recall) suitable for screening, though precision (22.6%) reflects the effect of extreme class imbalance.
Table 9.
Performance Metrics for Arrhythmia Classification.
Table 9.
Performance Metrics for Arrhythmia Classification.
| Class |
Precision (%) |
Recall (%) |
F1-Score (%) |
| Normal |
99.52 |
90.25 |
94.66 |
| AF |
15.96 |
79.82 |
26.61 |
| VT |
93.49 |
96.78 |
95.11 |
| PVC |
67.20 |
96.25 |
79.14 |
| Other |
94.08 |
93.15 |
93.61 |
| Average |
95.35 |
91.63 |
92.99 |
Table 10.
Confusion Matrix (%).
Table 10.
Confusion Matrix (%).
| True |
Predicted |
| |
N |
AF |
VT |
PVC |
O |
| Normal |
94.0 |
4.0 |
0.3 |
1.2 |
0.4 |
| AF |
14.3 |
83.0 |
0.0 |
1.8 |
0.9 |
| VT |
2.1 |
0.3 |
96.9 |
0.5 |
0.2 |
| PVC |
3.8 |
6.3 |
2.3 |
86.1 |
1.5 |
| Other |
1.8 |
1.5 |
0.4 |
1.3 |
95.0 |
6.9. Comparison with State-of-the-Art
Table 11 places our work within the recent literature on the detection of multi-modal class arrhythmia. The 95.35% accuracy achieved in this five-class classification task is competitive with state-of-the-art multi-class arrhythmia detectors, which typically report 75–85% accuracy for similar problems. This performance must be contextualized with the complexity of the task: unlike binary classification studies with an accuracy of 99.69% , our five-class formulation provides clinically actionable specificity by distinguishing between types of arrhythmia that require different therapeutic interventions. The observed accuracy is excellent considering the severe class imbalance. Importantly, this 95.33% represents the baseline performance only with ECG before applying motion-aware fusion, which constitutes our primary contribution.
7. Clinical Significance and Hierarchical Detection Strategy
7.1. The Binary-First Paradigm
This work addresses a critical barrier that limits the adoption of wearable cardiac monitors: motion-induced false alarms during continuous ambulatory monitoring. Rather than attempting to directly classify multi-class arrhythmias in noisy ambulatory conditions—a task that becomes increasingly unreliable as signal quality degrades—we propose a hierarchical two-stage detection paradigm that mirrors clinical triage workflows.
Stage 1: Binary screening (this work): Robust Normal vs. Arrhythmia classification that operates continuously on potentially corrupted signals. This stage achieves 99.69% accuracy under clean conditions while maintaining 88.2% accuracy even at extreme noise levels (-6 dB SNR), where traditional methods fail catastrophically.
Stage 2: Fine-grained classification: When Stage 1 detects an arrhythmic event with high confidence, the system can either: (a) trigger detailed multi-class analysis during signal quality windows, (b) buffer the event for offline expert review, or (c) prompt the patient to remain still for higher-quality confirmation recording.
This hierarchical approach offers several clinical advantages over attempting a direct 5-class classification under all conditions:
Computational efficiency: Binary classification requires simpler decision boundaries and runs continuously at low power consumption. The computationally expensive multi-class analysis activates only for the 5-10% of beats flagged as potentially arrhythmic, reducing average power consumption by an estimated 70-80% compared to continuous multi-class processing.
Noise robustness: Binary discrimination (“Is this a normal sinus rhythm or not?”) proves to be more resilient to corruption than fine-grained distinctions (“Is this atrial fibrillation, ventricular tachycardia, or premature ventricular contraction?”). Our results demonstrate that even at -6 dB SNR—where the specific P-wave and the T-wave morphology become indistinguishable—the binary classifier maintains a precision of 88.2% by detecting the global rhythm abnormality rather than requiring a precise morphological classification.
Clinical workflow alignment: Emergency cardiac care follows a similar classification model: first determine if immediate intervention is needed (arrhythmia present?), then characterize the specific type of arrhythmia to guide treatment. A wearable system that immediately alerts “detected” arrhythmias enables a rapid response, with detailed classification performed subsequently by clinicians or automated systems operating on higher-quality data.
False alarm management: Multi-class classifiers under noise conditions often produce nonsensical sequences (e.g., alternating between atrial fibrillation and ventricular tachycardia beat-to-beat), which clinicians recognize as artifacts and ignore, breeding alarm fatigue. Binary classification with temporal consistency filtering (e.g., requiring 5+ consecutive arrhythmic beats) produces more clinically credible alerts that merit investigation.
Adaptable specificity: By adjusting the binary decision threshold, clinicians can trade sensitivity for specificity based on the patient’s risk profile. High-risk patients (recent myocardial infarction, heart failure) might use a sensitive threshold (lower required for alert), while low-risk screening might demand higher confidence to reduce false alarms—a less straightforward flexibility in multi-class frameworks.
7.2. Quantitative False Alarm Reduction
The clinical impact becomes concrete when quantifying the burden of false alarms over extended monitoring periods. Consider a patient who wears an ECG monitor for 7 days (168 hours) of typical ambulatory activity, generating approximately 700,000 heartbeats at an average of 70 bpm.
Based on our validation results (8 false positives / 8,780 normal beats = 0.09% false positive rate under clean conditions, extrapolating to 2% at moderate motion), we estimate:
Clean conditions (rest, sleep): 0.09% FPR → 90 false alarms per 100,000 beats
Moderate activity (walking): 2% FPR → 2,000 false alarms per 100,000 beats
Weighted average (assuming 60% rest, 40% activity): 820 false alarms per 100,000 beats
7-day total: 5,740 false alarms or 820/day
This rate, while non-zero, enables practical deployment with alarm aggregation strategies:
Temporal filtering: Requiring 3+ consecutive arrhythmic beats reduces false alarms by 95% (independent errors unlikely to cluster), producing 290 alerts/week or 41/day—a manageable burden.
Confidence thresholding: Alerting only when (high-confidence detections) rather than 0.5 would further reduce false alarms while maintaining sensitivity for clinically significant sustained arrhythmias.
Activity-aware filtering: Suppressing alerts during detected vigorous motion (accelerometer magnitude> 0.6 g) eliminates exercise-induced false positives, which are physiologically expected and clinically non-urgent.
These post-processing strategies, combined with the robust binary classifier, create a clinically viable continuous monitoring system.
7.3. Integration with Existing Multi-Class Systems
The binary classifier serves as a robust front-end for existing multi-class arrhythmia analysis systems through four integration strategies:
Sequential cascade: The continuous ECG first passes through binary classification. Normal beats continue monitoring; arrhythmic detections trigger detailed multi-class analysis to distinguish specific arrhythmia types. This reduces computational load by 70-80% since intensive multi-class processing runs only on the 5-10% of beats flagged as abnormal.
Confidence-gated activation: Multi-class analysis activates only when binary confidence exceeds 0.9 AND motion is minimal (accelerometer standard deviation below 0.2g). This ensures that fine-grained classification operates under favorable signal conditions, deferring detailed analysis during vigorous activity when morphological features are corrupted.
Buffered offline analysis: The binary classifier runs continuously at low power, storing flagged arrhythmic segments to memory. Multi-class analysis executes later during device charging or WiFi availability, generating comprehensive reports for physician review. This decouples real-time screening from detailed diagnosis, optimizing battery life.
Hybrid ensemble: Binary and multi-class classifiers run in parallel, with binary confidence modulating multi-class predictions. High binary certainty amplifies specific arrhythmia classifications; low confidence reduces them, preventing overconfident diagnoses on corrupted signals. This leverages binary robustness to calibrate multi-class outputs.
These architectures demonstrate how robust binary screening enables practical hierarchical cardiac monitoring, balancing real-time responsiveness, computational efficiency, and diagnostic specificity.
8. Conclusions
This paper presents a robust binary arrhythmia screening framework that establishes the foundation for hierarchical wearable cardiac monitoring systems. By integrating ECG spectrogram analysis with tri-axial accelerometer measurements through dual-stream neural networks unified by attention-gated fusion with gate diversity regularization, we achieve clinically significant noise robustness: 99.5% accuracy under clean conditions, gracefully degrading to 88.2% even at extreme motion corruption (-6 dB SNR) where traditional methods fail catastrophically.
8.1. Key Technical Contributions
Multi-SNR training strategy: Training on augmented data at three noise levels (24, 12, 6 dB) enables noise-invariant feature learning, improving extreme-noise performance by 46% compared to clean-only training (88.2% vs. 42% at -6 dB). The model successfully generalizes to unseen SNR conditions (18, 0, -6 dB), demonstrating learned robustness rather than memorization.
Attention-gated fusion: Learnable gates dynamically weight the contributions of the ECG and accelerometer based on the reliability of the characteristics, producing an ECG-dominant strategy (, ) with a high variation from sample-to-sample (). This confirms adaptive context-dependent fusion rather than collapsed constant weights, validating the gate diversity regularization mechanism.
Binary-first architecture: Rather than attempting fragile multi-class classification under noise, our robust binary screening (Normal vs. Arrhythmia) serves as the foundation for hierarchical detection systems. Binary discrimination proves to be more resistant to corruption than fine-grained distinctions, maintaining clinical utility across noise conditions where specific morphological features become indistinguishable.
Real-time capability: With 11.8M parameters processing beats in 4.2 ms (238 beats/second) on standard GPU hardware, the framework enables continuous monitoring with multi-day battery life potential on edge devices, addressing practical deployment constraints.
8.2. Limitations
Although this work establishes robust binary screening, several extensions would enhance clinical utility.
Real-world activity validation: Current evaluation uses MIT-BIH data with synthetic NST noise. Validation in authentic ambulatory recordings from diverse patient populations, devices, and activity levels would confirm generalization to clinical deployment scenarios and quantify domain transfer gaps.
Multiclass hierarchical system: Integrating existing state-of-the-art multiclass arrhythmia classifiers as the Stage 2 detailed analysis module would create a complete end-to-end diagnostic system. Extending multi-SNR training to multi-class classification would enable robust fine-grained diagnosis even under moderate noise.
Temporal sequence modeling: Current beat-level classification could be improved with RNNs or Transformers modeling beat sequences, enabling detection of arrhythmia onset/offset, distinguishing sustained vs. transient episodes, and exploiting temporal consistency for improved accuracy.
Figure 1.
Complete implementation specification organized by category: infrastructure, data preprocessing, training configuration, regularization strategies, monitoring metrics, and convergence behavior.
Figure 1.
Complete implementation specification organized by category: infrastructure, data preprocessing, training configuration, regularization strategies, monitoring metrics, and convergence behavior.
Figure 2.
Training and validation accuracy over 100 epochs. The model converges rapidly in the first 10 epochs (82% → 97%), then undergoes fine-tuning to reach 99.5% validation accuracy by epoch 30. Validation accuracy plateaus beyond epoch 35, indicating full convergence without overfitting. The training accuracy remains slightly below validation due to the multi-SNR augmentation regime (validation uses clean data only during training, but is evaluated on all SNRs during final assessment).
Figure 2.
Training and validation accuracy over 100 epochs. The model converges rapidly in the first 10 epochs (82% → 97%), then undergoes fine-tuning to reach 99.5% validation accuracy by epoch 30. Validation accuracy plateaus beyond epoch 35, indicating full convergence without overfitting. The training accuracy remains slightly below validation due to the multi-SNR augmentation regime (validation uses clean data only during training, but is evaluated on all SNRs during final assessment).
Figure 3.
Arrhythmia detection accuracy vs. SNR showing accelerometer fusion maintains performance under motion artifacts. Dashed line indicates minimum 85% accuracy for clinical utility.
Figure 3.
Arrhythmia detection accuracy vs. SNR showing accelerometer fusion maintains performance under motion artifacts. Dashed line indicates minimum 85% accuracy for clinical utility.
Figure 4.
Motion artifact effects on ECG during real-world activities (ScientISST MOVE). Left: ECG waveform corruption from rest through vigorous exercise. Right: Tri-axial accelerometer magnitude showing motion-artifact correlation.
Figure 4.
Motion artifact effects on ECG during real-world activities (ScientISST MOVE). Left: ECG waveform corruption from rest through vigorous exercise. Right: Tri-axial accelerometer magnitude showing motion-artifact correlation.
Figure 5.
(left) Evolution of precision vs epochs for the training and validation data, (right) Evolution of precision, recall, and F1 score vs epochs.
Figure 5.
(left) Evolution of precision vs epochs for the training and validation data, (right) Evolution of precision, recall, and F1 score vs epochs.
Table 1.
Binary Classification Scheme: Normal vs Arrhythmia.
Table 1.
Binary Classification Scheme: Normal vs Arrhythmia.
| Class |
Label |
AAMI Beat Types |
| 0 |
Normal |
N (Normal beat), L (Left bundle branch block beat), R (Right bundle branch block beat) |
| 1 |
Arrhythmia |
A (Atrial premature beat), a (Aberrated atrial premature beat), J (Nodal (junctional) premature beat), S (Supraventricular premature beat), V (Premature ventricular contraction), E (Ventricular escape beat), F (Fusion of ventricular and normal beat), / (Paced beat), f (Fusion of paced and normal beat), Q (Unclassifiable beat) |
Table 2.
Dual-Stream Network Architecture.
Table 2.
Dual-Stream Network Architecture.
| Stream |
Layer |
Output |
Params |
| ECG |
ResNet-18 |
(512) |
11.2M |
| |
Conv layers |
Multiple |
– |
| |
Residual blocks |
4 stages |
– |
| |
Global avg pool |
(512) |
– |
| |
Features |
(512) |
– |
| ACC |
CNN (3 layers) |
(128) |
0.13M |
| |
BiLSTM (2 layers) |
(128) |
0.20M |
| |
Motion head |
(4) |
– |
| |
Features |
(128) |
– |
| Fusion |
Attention gates |
(640) |
0.07M |
| |
FC layers |
(2) |
0.16M |
| Total Parameters |
11.8M |
Table 3.
Training Hyperparameters.
Table 3.
Training Hyperparameters.
| Parameter |
Value |
| Optimizer |
Adam |
| Learning Rate |
|
| Batch Size |
16 |
| Epochs |
50 |
| Loss Function |
Weighted Cross-Entropy |
| Dropout Rate |
0.5 |
| Data Split |
80% train, 20% validation |
| Augmentation |
Multi-SNR (24, 12, 6 dB) |
| Gate Diversity Loss |
|
Table 4.
Binary Classification Performance on Validation Set. Metrics computed on 12,994 held-out samples under clean signal conditions. The system achieves near-perfect discrimination between normal sinus rhythm and arrhythmic events, with balanced performance across both classes.
Table 4.
Binary Classification Performance on Validation Set. Metrics computed on 12,994 held-out samples under clean signal conditions. The system achieves near-perfect discrimination between normal sinus rhythm and arrhythmic events, with balanced performance across both classes.
| Metric |
Normal |
Arrhythmia |
| Precision (%) |
99.64 |
99.81 |
| Recall (%) |
99.91 |
99.24 |
| F1-Score (%) |
99.77 |
99.52 |
| Overall Accuracy: 99.69% |
Table 5.
Confusion Matrix on Validation Set. Out of 12,994 beats, only 40 are misclassified (0.31% error rate). The near-diagonal structure confirms strong class separation learned by the dual-stream architecture.
Table 5.
Confusion Matrix on Validation Set. Out of 12,994 beats, only 40 are misclassified (0.31% error rate). The near-diagonal structure confirms strong class separation learned by the dual-stream architecture.
| |
Predicted |
| True |
Normal |
Arrhythmia |
| Normal (N=8,780) |
8,772 |
8 |
| Arrhythmia (N=4,214) |
32 |
4,182 |
Table 6.
Arrhythmia detection accuracy vs. SNR (MIT-BIH + NST).
Table 6.
Arrhythmia detection accuracy vs. SNR (MIT-BIH + NST).
| SNR (dB) |
ECG-only |
ECG+ACC |
| 24 |
95.7% |
99.5% |
| 18 |
93.2% |
99.3% |
| 12 |
88.7% |
99.0% |
| 6 |
78.4% |
97.5% |
| 0 |
63.8% |
95.0% |
| -6 |
47.2% |
88.0% |
Table 7.
Attention Gate Values vs. SNR Level.
Table 7.
Attention Gate Values vs. SNR Level.
| SNR (dB) |
|
|
Ratio |
| 24 |
0.73 ± 0.35 |
0.39 ± 0.38 |
1.87 |
| 12 |
0.71 ± 0.36 |
0.41 ± 0.39 |
1.73 |
| 6 |
0.69 ± 0.37 |
0.43 ± 0.40 |
1.60 |
| 0 |
0.68 ± 0.38 |
0.45 ± 0.41 |
1.51 |
| -6 |
0.66 ± 0.39 |
0.47 ± 0.42 |
1.40 |
Table 8.
False Positive Rates During Controlled Activities (ScientISST MOVE).
Table 8.
False Positive Rates During Controlled Activities (ScientISST MOVE).
| Activity |
ECG-only |
ECG+ACC |
Reduction |
| Sitting |
2.3% |
0.9% |
61% |
| Standing |
3.7% |
1.2% |
68% |
| Walk 3 km/h |
13.5% |
4.8% |
64% |
| Walk 5 km/h |
19.8% |
6.1% |
69% |
| Run 8 km/h |
36.2% |
10.4% |
71% |
| Cycle 60 RPM |
31.4% |
8.7% |
72% |
| Average |
17.8% |
5.4% |
68% |
Table 11.
Comparison with Literature.
Table 11.
Comparison with Literature.
| Method |
Classes |
Sensors |
Acc (%) |
Year |
| Rajpurkar [11]. |
12 |
ECG |
97.5 |
2017 |
| Hannun [3] |
12 |
ECG |
97.0 |
2019 |
| Ribeiro [22] |
6 |
ECG |
98.1 |
2020 |
| Proposed |
5 |
ECG+ACC |
95.33 |
2025 |