Submitted:
09 June 2026
Posted:
10 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Related Work
2.1. IoT Systems and Challenges
2.2. Taxonomy of FDD Methods
-
Statistical and rule-based detectors apply fixed thresholds, control charts, or pointwise statistical tests (e.g., z-score, CUSUM, Shewhart charts) to individual signals. These methods are computationally trivial, require no training data, and produce interpretable alerts, which is why they remain the default in many embedded and IoT systems [9,10]. For systems where reliability requirements drive the design from the hardware layer upward, these conventional detectors are necessary but rarely sufficient [11,12]. Their limitations are also well known [1,2], namely:
- Samples are assumed and treated independently;
- Insensitive to gradual drift or temporally structured deviations; and
- Manual threshold tuning that does not generalize across devices or operating conditions is required.
- Supervised classification methods learn a mapping from labeled fault examples to fault categories using algorithms ranging from support vector machines and random forests to deep neural networks. When sufficient labeled data exists, these methods can achieve high accuracy and fine-grained fault categorization. However, supervised classifiers are practically constrained, particularly for IoT systems, by the requirement for labeled data. Deployed IoT systems rarely accumulate the volume and variety of labeled fault events necessary to train competitive supervised models, and the fault surface is often unknown at the time of deployment[3,4]. Industrial fault diagnosis reviews consistently identify label scarcity as the primary obstacle to deploying supervised learning in real-world monitoring contexts [3].
- Unsupervised representation learning methods, such as autoencoders, variational autoencoders, generative adversarial networks, and self-supervised transformers learn a compressed representation of normal behavior and flag inputs that the model reconstructs poorly or assigns low likelihood. These methods are powerful for high-dimensional or visually complex telemetry, but they introduce significant deployment overhead: nontrivial architecture and hyperparameter selection, GPU training requirements, opaque anomaly scores that are difficult to interpret, and sensitivity to distribution shifts within the training window [14]. Recent comparative studies of multivariate time-series classification note that no single representation-learning approach dominates across datasets, and that simpler baselines often remain competitive despite their architectural simplicity [16].
- Similarity-based methods compare new observations to a library of reference examples using a distance or similarity measure, flagging observations that are dissimilar to all references. This family includes nearest-neighbor methods, kernel density estimators, and shape-matching methods such as Dynamic Time Warping. Similarity-based methods are label-free (the reference library requires only assumed-healthy data), computationally efficient, and naturally interpretable; the closest reference example provides a direct explanation for why an observation was flagged. The main limitations are sensitivity to feature scaling, the need to manage the reference library size, and reduced effectiveness when faults preserve the overall shape of the signal.
2.3. Dynamic Time Warping in Modern Time-Series Analytics
3. Motivation and Problem Formulation
4. Method: DTW-Based System-Level Fault Detection for IoT Telemetry
4.1. Dataset and Experimental Design
4.2. Preprocessing and Normalization
4.3. Layered Feature Construction
- Hardware (HW): the four primary sensor channels (temperature, humidity, light, voltage) after robust normalization. These form the direct observables of the physical system.
- Communication (COM): health proxies derived from the telemetry stream. In the implementation, these include per-channel stuck-value rates (rolling detection of repeated identical readings, which may indicate transport-layer freezes) and low-variance indicators (rolling windows where variance drops below the 10th percentile of the channel's variance distribution, suggesting irregular sampling or data staleness). These proxies approximate missingness and gap behavior without requiring packet-level instrumentation.
- Software/Firmware (SW/FW): derived key performance indicators (KPIs) that capture operational dynamics not directly modeled by physics-of-failure. For each sensor channel, five rolling-window features are computed: rolling variance, absolute change rate, duty cycle (fraction of time above median), signal energy (sum of squared values), and zero-crossing rate. These features are sensitive to behavioral regime changes such as firmware duty-cycle shifts, workload transitions, or sensor degradation patterns that alter signal dynamics rather than magnitude.
4.4. Episode Segmentation
4.5. Reference Library of Normal Templates
4.6. Constrained DTW Scoring
- Per-channel best-match distance: For each channel, the DTW distances to all templates are computed and the best-match score is taken as a low percentile (5th percentile) of these distances rather than the strict minimum, improving robustness to individual poorly matched templates.
- Layer aggregation: Channel scores are grouped by layer (HW, COM, SW/FW) and averaged within each layer to produce per-layer scores.
- Standardization: Each channel's best-match DTW distance is standardized against its own training-normal distribution (a robust z-score using the training median and MAD), placing all channels on a comparable scale so the informative shape features (rolling variance, change rate, zero-crossing rate, duty cycle) are not overwhelmed by the high-magnitude signal-energy features. The episode score is the mean of these standardized distances over the 28 non-energy channels, that is, all channels except the four per-sensor signal-energy features (Section 4.3); excluding the energy features, which carry little shape information, gives the strongest separation (Section 5.2)
4.7. Robust Threshold Learning and Temporal Validation
- (i)
- a high-percentile rule;
- (ii)
- an interquartile range (IQR) rule; and
- (iii)
- a standard deviation (sigma) rule.
4.7.1. Layer and Channel Attribution
4.7.2. Parameter Tuning for Deployment
- Episode parameters: episode length L and stride S (controls noise sensitivity and delay).
- Threshold sensitivity: percentile , IQR multiplier , sigma multiplier (controls false positives).
- Temporal validation: minimum consecutive anomalies and score multiplier (filters transient excursions).
- Channel standardization and selection: per-channel robust-z scaling of DTW distances and the set of channels scored (controls feature dominance and noise).
5. Results and Findings
5.1. Configuration and Design Rationale
5.2. Parameter Sensitivity
5.3. Operating Point and Confusion Matrix
5.4. Fault-Type Detection and Detection Delay
5.5. Anomaly-Score Timeline
5.6. Quantitative Performance Summary
5.6.1. Summary of Detection Performance
5.6.2. Layer Attribution Behavior
5.7. Explainability: DTW Alignments and Localized Distance Profiles
6. Discussion: Practical Impact, Limitations, and Deployment Guidance
6.1. Value of DTW in IoT FDD
6.2. DTW Challenges and Potential Solutions
- Computational cost scales with episode length and the number of reference templates. Constrained DTW with a Sakoe–Chiba band of width w has time complexity O(L·w) per template comparison, so the per-episode scoring cost is O(C · N · L · w), where C is the number of channels and N is the per-channel template cap. In the tuned configuration (C=32, N=15, L=60, w=6), this corresponds to roughly 173,000 cell computations per episode, with an empirical scoring time of approximately 11 seconds per episode on a consumer-grade workstation. Doubling the episode length quadruples the per-episode cost when the band width is scaled proportionally, while doubling the template cap or the number of channels scales the cost linearly. High-rate telemetry or vast libraries may require approximate DTW variants (e.g., FastDTW) [32], or scalable exact-search techniques [33], or alternative feature-based detectors. The memory footprint is correspondingly small: the reference library stores C × N × L values (here, 32 × 15 × 60, or about 28,800 samples), on the order of a few hundred kilobytes, and scoring requires only a single L-by-w cost band in working memory per comparison. The method is designed to run on an edge gateway or fleet backend that ingests device telemetry rather than in situ on the sensor node, and it scores complete episodes spanning minutes to hours rather than operating at a sample rate. Under this model, on-node compute and memory limits are not the binding constraints, and the relevant timing question is whether scoring keeps pace with the monitoring cadence: at approximately 11 seconds per episode, against a 20-minute stride and a sub-60-minute detection-delay target, it does so with a substantial margin. The small footprint also makes gateway-class edge deployment feasible where desired; on-device deployment on resource-constrained sensor nodes and benchmarking on specific embedded targets are left to future work.
- DTW is sensitive to feature scaling and feature dominance. In the reported experiments, the derived SW/FW energy features dominated both scoring and attribution when all layers were weighted equally, due to their larger numeric scale. The tuned configuration mitigates this by standardizing each channel's DTW distance to its own training-normal distribution and excluding high-magnitude signal-energy features (Section 4.6); the resulting non-energy channel set yields the strongest separation (Section 5.2). As a complementary or alternative strategy, practitioners should consider normalizing features after computation or applying per-layer weighting, and should validate that attribution aligns with engineering expectations rather than artifactually reflecting feature magnitude.
-
Threshold calibration is nontrivial. Percentile-based thresholds computed on training scores can become unstable when training episodes are highly similar to reference templates (e.g., when identical or near-identical episodes are compared). Practical mitigations include:
- scoring training episodes against a disjoint template subset,
- leave-one-out scoring during threshold estimation, and
- selecting thresholds based on an alert-rate budget rather than a fixed percentile.
- DTW is a shape-matching method and is structurally blind to flat or missing-signal faults: constant-value (stuck-at) and dropout faults score near chance (AUROC about 0.66). This is a scope boundary rather than a tuning deficiency, and such faults are routed to complementary detectors, a variance or range check for stuck-at and a missingness monitor for dropout, which run alongside DTW at negligible cost. DTW is therefore best deployed as the first-line detector for temporally-structured faults within a small ensemble.
6.3. Deployment and Tuning Guidelines
- Start conservatively. Use a high threshold percentile and validate false positives on a known-normal window before lowering sensitivity.
- Tune episode length deliberately. Longer episodes tend to improve precision by emphasizing sustained behavioral change, while shorter episodes reduce detection delay but increase noise sensitivity.
- Use robust thresholding. Combine percentile-based thresholds with IQR- or sigma-based checks and select a median threshold to reduce sensitivity to regime shifts.
- Add temporal validation. Require consecutive anomalous episodes before alerting to suppress isolated spikes and improve operational stability.
- Preserve attribution outputs. Layer and channel rankings should be retained and reviewed as part of the diagnostic workflow; they are essential for root-cause analysis and iterative feature refinement.
7. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Operating-Point Trade-Off and Score Separation
| Alert budget (train-normal pct) |
Threshold | Precision | Recall | F1 | False-alarm rate |
|---|---|---|---|---|---|
| 90th | 9.84 | 0.811 | 0.905 | 0.856 | 0.070 |
| 95th | 12.71 | 0.853 | 0.853 | 0.853 | 0.049 |
| 97.5th | 16.11 | 0.867 | 0.758 | 0.809 | 0.038 |
| 99th | 20.27 | 0.883 | 0.716 | 0.791 | 0.031 |

References
- Cook AA, Mısırlı G, Fan Z. Anomaly detection for IoT time-series data: A survey. IEEE Internet of Things Journal. 2020;7(7):6481-6494. [CrossRef]
- Trapani N, Longo L. Fault detection and diagnosis methods for sensor systems: a scientific literature review. IFAC-PapersOnLine. 2023;56(2):1253-1263.
- Liu C, Cichon A, Królczyk G, Li Z. Technology development and commercial applications of industrial fault diagnosis system: A review. International Journal of Advanced Manufacturing Technology. 2022;118:3497-3529. [CrossRef]
- Mercorelli P. Recent advances in intelligent algorithms for fault detection and diagnosis. Sensors. 2024;24(8):2656. [CrossRef]
- Niggemann O, Zimmering B, Steude H, Augustin JL, Windmann A, Multaheb S. Machine learning for cyber-physical systems. In: Vogel-Heuser B, Wimmer M, eds. Digital Transformation. Springer Vieweg; 2023. [CrossRef]
- Ma P, Hou P, Liu F, Yi X. A review of methods for intermittent fault feature recognition. In: Proceedings of the 6th International Conference on System Reliability and Safety Engineering (SRSE). 2024. [CrossRef]
- Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust, Speech, Signal Process. 1978;26(1):43-49. [CrossRef]
- Berndt DJ, Clifford J. Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. AAAIWS’94. AAAI Press; 1994:359-370.
- Lee EA, Seshia SA. Introduction to Embedded Systems: A Cyber-Physical Systems Approach. 2nd ed. MIT Press; 2016. Accessed April 2, 2026. https://mitpress.mit.edu/9780262533812/introduction-to-embedded-systems/.
- Serpanos D, Wolf M. Internet-of-Things (IoT) Systems: Architectures, Algorithms, Methodologies. Springer International Publishing; 2018. Accessed April 2, 2026. https://link-springer-com.ezproxy2.library.colostate.edu/book/10.1007/978-3-319-69715-4.
- Aalund R, Paglioni VP. Enhancing Reliability in Embedded Systems Hardware: A Literature Survey. IEEE Access. 2025;13:17285-17302. [CrossRef]
- Aalund R, P. Paglioni V. Systems Engineering Approach to DfR. In: 35th European Safety and Reliability Conference (ESREL 2025) and the 33rd Society for Risk Analysis Europe Conference (SRA-E 2025). Research Publishing Services; 2025:619-626. [CrossRef]
- Aalund R, Paglioni VP. A Reliability Ontology for IoT Systems. Social Science Research Network. Preprint posted online March 17, 2026:6429821. [CrossRef]
- Fahim M, Sillitti A. Anomaly Detection, Analysis and Prediction Techniques in IoT Environment: A Systematic Literature Review. IEEE Access. 2019;7:81664-81681. [CrossRef]
- Aalund R, Paglioni VP. Fault Detection and Diagnosis Method Selection for Varying Organizational Needs. IEEE Access. 2026;in review:17285-17302. [CrossRef]
- Pasos Ruiz A, Flynn M, Large J, Middlehurst M, Bagnall A. The great multivariate time series classification bake off: A review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery. 2021;35:401-449. [CrossRef]
- Su S, Sun Y, Gao X, Qiu J, Tian Z. A correlation-change based feature selection method for IoT equipment anomaly detection. Applied Sciences. 2019;9(3):437. [CrossRef]
- He S, Li Z, Wang J, Xiong NN. Intelligent detection for key performance indicators in industrial-based cyber-physical systems. IEEE Transactions on Industrial Informatics. 2021;17(8):5799-5809. [CrossRef]
- Paparrizos J, Li H, Yang F, Wu K, d’Hondt JE, Papapetrou O. A survey on time-series distance measures. arXiv preprint. Published online 2024. [CrossRef]
- Li X, Xu S, Yang Y, Lin T, Mba D, Li C. Spherical-dynamic time warping—A new method for similarity-based remaining useful life prediction. Expert Systems with Applications. 2024;238:121913. [CrossRef]
- Niu G, Dong X, Chen Y. Motor fault diagnostics based on current signatures: A review. IEEE Transactions on Instrumentation and Measurement. 2023;72:3520919. [CrossRef]
- Sowmya S, Saimurugan M, Edinbarough I. Rotational machine fault diagnosis using artificial intelligence (AI) strategies for the operational challenges under variable speed condition: A review. IEEE Access. 2024;12:144870-144889. [CrossRef]
- Si Y, Chen Z, Sun J, Zhang D, Qian P. A data-driven fault detection framework using Mahalanobis distance based dynamic time warping. IEEE Access. 2020;8:108359-108370. [CrossRef]
- Choudhury MD, Hong L, Dhupia JS. A methodology to handle spectral smearing in gearboxes using adaptive mode decomposition and dynamic time warping. IEEE Transactions on Instrumentation and Measurement. 2021;70:3510910. [CrossRef]
- Sun B, Li H, Wang C, Zhang K, Chen S. Current-aided dynamic time warping for planetary gearbox fault detection at time-varying speeds. IEEE Sensors Journal. 2024;24(1):390-402. [CrossRef]
- Kim Y, Choi G. Anomaly detection in predictive maintenance using dynamic time warping. Asia-Pacific Journal of Convergent Research Interchange. 2024;10(1):173-183. [CrossRef]
- Yang NC, Yang JM. Fault classification in distribution systems using deep learning with data preprocessing methods based on fast dynamic time warping and short-time Fourier transform. IEEE Access. 2023;11:63612-63622. [CrossRef]
- Kumar N, Sood SK, Saini M. Internet of Vehicles (IoV) based framework for vehicle degradation using multidimensional dynamic time warping (MDTW). Expert Systems with Applications. 2023;224:120038. [CrossRef]
- Hong W, Xu Y, Huang Z, Fang X, Hong D, Zhang J. DT-FTA-ARM: A collaborative framework for real-time fault diagnosis in subway environmental control systems. In: 2025. [CrossRef]
- Babu BM, Sandhya B. A novel time series classification for multivariate data using improved deep belief–recurrent neural network with optimal dynamic time warping. In: Vol 392. 2024:01161. [CrossRef]
- Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R. Intel Lab Data. Published online 2004. http://db.csail.mit.edu/labdata/labdata.html.
- Salvador S, Chan P. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis. 2007;11(5):561-580. [CrossRef]
- Rakthanmanon T, Campana B, Mueen A, et al. Searching and mining trillions of time series subsequences under dynamic time warping. In: 2012:262-270. [CrossRef]








| Sweep | Setting | CORE AUROC | +spikes AUROC |
|---|---|---|---|
| Episode length L (min) | 30 | 0.912 | 0.899 |
| 45 | 0.944 | 0.928 | |
| 60 | 0.964 | 0.945 | |
| 90 | 0.970 | 0.948 | |
| Stride S (min) | 10 | 0.950 | n/a |
| 15 | 0.939 | n/a | |
| 20 | 0.964 | n/a | |
| 30 | 0.928 | n/a | |
| Channel set | HW (4) | 0.625 | n/a |
| all (32) | 0.945 | n/a | |
| SW/FW (20) | 0.944 | n/a | |
| non-energy (28) | 0.964 | n/a |
| Predicted | |||
| Positive | Negative | ||
| Actual | Positive | 81 | 14 |
| Negative | 14 | 273 | |
| Fault type | AUROC | Fault-level detection | Mean delay (min) | DTW scope |
|---|---|---|---|---|
| bias_shift | 0.988 | 100% | 10 | appropriate |
| interaction | 0.989 | 100% | 10 | appropriate |
| drift | 0.966 | 100% | 35 | appropriate |
| spikes | 0.934 | 100% | 30 | appropriate |
| dropout | 0.667 | 0% | n/a | out of scope (missingness monitor) |
| stuck_at | 0.655 | 0% | n/a | out of scope (variance/range check) |
| Metric | Value | Source |
|---|---|---|
| AUROC (DTW-appropriate, threshold-independent) | 0.970 | healthy-window eval |
| Fault-level detection (DTW-appropriate) | 100% (16/16) | healthy-window eval |
| Mean detection delay | 21.2 min | end-of-episode decision time |
| Episode-level operating point | see Appendix Table A1 | alert-rate budget |
| Stuck-at / dropout AUROC | 0.66 / 0.67 (blind spots) | healthy-window eval |
| Method Family | Labeled Faults | Detects temporal faults | Cross-layer attribution | Interpretable evidence | Low deployment cost |
|---|---|---|---|---|---|
| Rule-based | No | No | No | Yes | Yes |
| Supervised classifiers (SVN, RF, DNN) | Yes | Only with sequence models | No | Varies by model | No |
| Isolation Forest / Once-Class SVM | No | No (ordering discarded) | Feature importance only | Score-level only | Yes |
| Autoencoders | No | Only with temporal architectures | Per-feature error only | No | No |
| This work | No | Yes | Yes | Yes | Yes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).