Wearable Sleep Staging: From Neurophysiology to Clinical Translation—A Comprehensive Review

Zohreh Abbasi; Maryam OstadSharif Memar; Ramtin Hamavar; Nadine Steingraeber; Reza Shahshahani; Joachim Gross; Omid Abbasi

doi:10.20944/preprints202606.0999.v1

Submitted:

11 June 2026

Posted:

12 June 2026

You are already at the latest version

Abstract

Accurate sleep staging is fundamental to the diagnosis of sleep disorders, the evaluation of therapeutic interventions, and the understanding of sleep’s role in health and disease. While polysomnography (PSG) remains the clinical gold standard, its cost, technical complexity, and reliance on controlled laboratory settings have motivated the development of wearable alternatives capable of continuous, home-based monitoring. This comprehensive review spans the full translational pipeline - from the neurophysiological foundations of sleep architecture through signal acquisition, preprocessing, and algorithmic classification to device-level validation and population-specific clinical evidence. We begin by establishing the physiological basis of sleep stages, including macro-architectural organization, microstructural markers (spindles, K-complexes, cyclic alternating pattern), and the autonomic, respiratory, and movement correlates that wearable sensors can capture. We then examine the signal processing chain — artifact removal, filtering, feature extraction, and data augmentation — that underpins reliable staging from noisy, real-world recordings. Building on this foundation, we evaluate the performance of consumer and clinical-grade wearable devices across diverse populations, including healthy adults, children and adolescents, older adults, clinical cohorts (insomnia, obstructive sleep apnea, neurodegenerative disease), pregnant individuals, shift workers, and athletes. Finally, we identify persistent challenges — signal quality gaps between laboratory and consumer sensors, algorithmic opacity, population bias, and the emerging phenomenon of orthosomnia — and outline future directions encompassing multimodal sensing, standardized benchmarking, digital sleep biomarkers, and equitable regulatory frameworks. By synthesizing evidence from review articles published from 2018 onward, this work provides a single, integrated reference for researchers, engineers, and clinicians working to advance robust, generalizable, and clinically meaningful wearable sleep monitoring. A companion paper (OstadSharif Memar et al., 2026) offers a detailed algorithmic analysis of computational models for wearable-based sleep stage detection.

Keywords:

sleep staging

;

polysomnography

;

wearable devices

;

physiological signals

;

sleep sensors

;

clinical translation

Subject:

Biology and Life Sciences - Neuroscience and Neurology

1. Introduction

Sleep is a fundamental biological process essential for maintaining physical health, cognitive performance, and emotional regulation. Disruptions in sleep quality or architecture have been associated with a wide range of adverse health outcomes, including cardiovascular disease, metabolic disorders, and impaired cognitive function (Xu et al., 2022). From a neurophysiological perspective, sleep is broadly categorized into NREM and REM sleep. NREM sleep is further divided into three stages (N1, N2, and N3), representing a progression from light to deep sleep, followed by REM sleep. These stages cycle throughout the night in structured patterns, each characterized by distinct electrophysiological signatures and brain oscillatory activity. The balanced alternation between NREM and REM stages is critical for restorative sleep, whereas disruptions in these cycles are closely linked to sleep disorders and long-term health consequences (Masad et al., 2024; R. Zhang et al., 2024).

Accurate identification of sleep stages is therefore central to both clinical diagnosis and sleep research. PSG remains the gold-standard method for sleep staging, as it simultaneously records multiple physiological signals, including EEG, EOG, EMG, and cardiorespiratory activity. These multimodal recordings enable detailed characterization of sleep architecture and facilitate the diagnosis of disorders such as obstructive sleep apnea, insomnia, and narcolepsy. In particular, EEG-derived frequency bands, including delta, theta, and alpha rhythms, serve as key biomarkers for distinguishing sleep stages and transitions. Despite its high diagnostic accuracy and comprehensive nature, PSG is limited by high cost, technical complexity, and the requirement for overnight laboratory-based assessments, which restrict its scalability and accessibility in large populations (Birrer et al., 2024; Markov et al., 2025).

In response to these limitations, there has been growing interest in home-based and wearable sleep monitoring technologies. Advances in wearable sensing and embedded systems have enabled the development of compact devices such as wristbands, rings, and head-mounted systems capable of continuously recording physiological signals in naturalistic environments. These devices typically capture parameters such as heart rate, heart rate variability , motion through accelerometry, and in some cases, simplified EEG signals. Compared to PSG, wearable devices offer significant advantages, including lower cost, ease of use, and the ability to perform long-term and large-scale sleep monitoring outside clinical settings. However, these advantages come with important limitations. Wearable systems generally rely on fewer and often indirect physiological signals, which can reduce their ability to accurately differentiate between detailed sleep stages. Recent studies indicate that while wearable devices perform reasonably well in detecting sleep–wake transitions, their accuracy in classifying finer sleep stages, particularly N2 and N3, remains variable (Depner et al., 2020). Multimodal approaches that combine signals such as accelerometry and photoplethysmography (PPG) have shown improved performance compared to single-sensor systems, highlighting the importance of integrating complementary physiological data (Depner et al., 2020).

The rapid adoption, ongoing development, and positive reception of wearable health technologies further underscores their potential for large-scale sleep monitoring. With a substantial proportion of adults using consumer-grade devices such as smartwatches and fitness trackers (Dhingra et al., 2023), wearable sleep staging has emerged as a promising alternative to traditional laboratory-based assessments. At the same time, ongoing advances in machine learning and data-driven modeling have significantly improved the capability of these systems to infer sleep stages from limited and noisy physiological signals. Against this background, this review aims to provide a comprehensive overview of home-based sleep staging using wearable technologies. Specifically, this paper (1) summarizes the neurophysiological foundations of sleep and conventional PSG-based staging, (2) reviews alternative sensing modalities used in wearable systems, including EEG, EOG, EMG, ECG, PPG, and accelerometry, (3) discusses key signal processing and preprocessing considerations in a dedicated section, and (4) evaluates the performance, challenges, and future directions of wearable sleep staging across diverse populations and clinical contexts. By synthesizing findings from recent studies published since 2018, this review provides an integrated perspective on current capabilities, limitations, and future directions in wearable sleep monitoring.

Given the breadth of this review, different readers may benefit from different entry points. Readers already familiar with sleep neurophysiology may proceed directly to Section 4 (Preprocessing and Digital Filtering) or Section 5 (Performance Evaluation) without loss of continuity. Engineers and algorithm developers may find Section 4 and Section 5 most immediately relevant, while clinicians and translational researchers may wish to focus on Section 6 (Device Types) and 7 (Population-Specific Evidence) before consulting Section 8 (Challenges) and 9 (Future Directions). Section 3 (Sleep Physiology and Stage Architecture) provides a self-contained primer for readers entering the field from a signal processing or engineering background who seek to understand why certain physiological signals carry sleep-stage information and where the fundamental limits of peripheral sensing lie. For readers interested in the computational modeling details of wearable sleep staging algorithms, we refer to our companion paper (OstadSharif Memar et al., 2026), which provides a systematic analysis of traditional machine learning, deep learning, and transfer learning approaches evaluated across 33 datasets.

2. Methods

2.1. Search Strategy

A comprehensive literature search was conducted to identify peer-reviewed studies, systematic reviews, and publicly available datasets relevant to sleep staging, with a particular focus on wearable-based and home-based sleep monitoring systems.

The following electronic databases were searched: PubMed/Medline, Scopus, Web of Science, and Google Scholar. The search included articles published between 2018 and 2026 to capture recent advances in wearable sleep monitoring, physiological signal analysis, and machine-learning-based sleep staging methods. Only articles written in English were included.

Search terms combined sleep-related keywords (e.g., “sleep staging,” “sleep classification,” “polysomnography”) with technology-oriented terms (e.g., “wearable,” “EEG,” “photoplethysmography,” “accelerometry”) using Boolean operators.

Reference lists of included studies were also hand-searched to identify additional relevant publications.

2.2. Inclusion and Exclusion Criteria

Studies were selected according to the following predefined criteria.

Inclusion criteria: Peer-reviewed original research, systematic reviews, and meta-analyses published between 2018–2026. Studies addressing sleep staging using PSG or wearable devices, validated against PSG. Studies conducted on human subjects with clear reporting of participant numbers or dataset size. Studies describing signal-processing or machine-learning approaches for sleep stage classification. Studies using publicly available datasets or code (preferred for reproducibility).

Exclusion criteria: Studies were excluded based on several predefined criteria. First, non-English publications and conference abstracts without an available full-text manuscript were omitted. Furthermore, conceptual or opinion-based papers lacking quantitative analysis were excluded from the review. Finally, research was removed from consideration if it lacked validation against the gold-standard PSG or provided insufficient methodological detail to allow for a robust evaluation.

2.3. Screening and Selection Process

All records identified through the search strategy were imported into the reference management software Sciwheel, where duplicate entries were systematically removed. The remaining records underwent an initial screening of titles and abstracts to identify potentially relevant studies. Subsequently, full-text articles were retrieved for all records deemed eligible and were rigorously assessed against the predefined inclusion and exclusion criteria. Any disagreements during screening were resolved through discussion among the review team until consensus was reached.

For each study meeting the inclusion criteria, data were systematically extracted using a standardized template. The extracted information included:

Study Characteristics: Citation, year of publication, and study type (e.g., clinical trial, algorithm development, validation, or review).
Dataset Details: Number of subjects, demographics and dataset size.
Technical Parameters: Signals recorded (e.g., EEG, EOG, EMG, ECG, PPG, accelerometry, and respiration) and the algorithm or model architecture employed.
Validation and Performance: Gold-standard reference (manual PSG scoring) and quantitative performance metrics, including accuracy, Cohen’s kappa, F1-score, sensitivity, and specificity.
Transparency: Availability of open-source code or public datasets.

This structured extraction framework ensured consistency across reviewers and facilitated systematic comparison of studies spanning different sensing modalities and algorithmic approaches.

2.4. Data Synthesis

Due to the heterogeneity in study designs and technological approaches, a narrative synthesis approach was employed rather than a formal meta-analysis. Studies were grouped and analyzed according to the following thematic frameworks:

Data Acquisition Context: Comparing clinical PSG, home-based PSG, and various wearable device categories.
Methodological Pipeline: Evaluating signal preprocessing, feature extraction techniques, and the implementation of machine-learning or deep-learning models.
Comparative Performance: Analyzing validation strategies and the consistency of performance metrics across different hardware and software configurations.

Where quantitative metrics were reported consistently across studies, ranges and trends are presented to facilitate cross-study comparison, although formal statistical pooling was not performed owing to methodological heterogeneity.

3. Sleep Physiology and Stage Architecture

3.1. Sleep Architecture and Physiological Foundations

Human sleep is a complex and dynamic biological process characterized by recurrent cycles between NREM and REM sleep. A typical night consists of approximately 4 to 6 sleep cycles, each lasting around 90–110 minutes (Wang et al., 2024). According to the AASM guidelines, NREM sleep is divided into three stages (N1, N2, and N3), while REM sleep represents a distinct physiological state characterized by low-amplitude, mixed-frequency EEG activity, rapid eye movements, and muscle atonia (Shi et al., 2023).

Across the night, sleep architecture follows a predictable pattern, with deep NREM (N3) dominating early cycles and REM periods progressively lengthening toward the morning. This dynamic organization reflects the underlying neurophysiological regulation of sleep and is essential for cognitive and physiological restoration. From a physiological perspective, sleep stages are defined by distinct neural and autonomic signatures.

Wakefulness and NREM/REM sleep can be differentiated using multimodal biosignals, including EEG, EOG, EMG, and autonomic measures. These signals provide complementary information for characterizing brain activity, eye movements, muscle tone, and systemic physiological changes across sleep stages (Lambert and Peter-Derex, 2023; Stephansen et al., 2018).

The structure and quality of sleep architecture are clinically significant and have been associated with various neurological and psychiatric conditions, including depression and neurodegenerative disorders (Xiao et al., 2025). Therefore, accurate characterization of sleep stages requires both an understanding of sleep architecture and the physiological signals that define it, which form the basis for automated sleep staging systems discussed in later sections.

3.2. Sleep Stages and Macro-Architecture

Wakefulness is characterized by low-amplitude, mixed-frequency EEG with prominent posterior alpha rhythm (8–13 Hz) during relaxed, eyes-closed states and increased beta activity during active cognition or with eyes open. Muscle tone, heart rate, and responsiveness to external stimuli are relatively high in wakefulness, and eye movements are frequent (Lambert and Peter-Derex, 2023; Sun et al., 2023).

Stage N1 (light sleep) marks the transition from wake to sleep, defined by the replacement of more than half of the alpha rhythm with low-amplitude mixed-frequency activity dominated by theta (4–7 Hz). Muscle tone is slightly reduced but preserved, breathing remains regular, and this stage is brief (around 1–5 minutes), comprising roughly 5% of total sleep time in healthy adults (Lambert and Peter-Derex, 2023; Patel et al., 2026; Stephansen et al., 2018).

As sleep deepens, Stage N2 becomes the predominant stage, accounting for approximately 40–55% of total sleep time. It is defined by the presence of sleep spindles (11–16 Hz, sigma band) and K-complexes on a background of theta activity, accompanied by reduced heart rate and body temperature, absence of eye movements, and further relaxation of skeletal muscle tone (Lambert and Peter-Derex, 2023; Stephansen et al., 2018).

Stage N3 (slow-wave or deep sleep) is dominated by high-amplitude, low-frequency delta activity (0.5–2 Hz) occupying at least 20% of the epoch. During N3, arousal thresholds are highest, muscle tone, pulse, and respiratory rate reach their lowest levels, and this stage is strongly associated with physical restoration, immune function, and homeostatic sleep pressure dissipation (Lambert and Peter-Derex, 2023; Stephansen et al., 2018).

Following the deepest NREM stages, REM sleep is defined by a low-amplitude, mixed-frequency EEG resembling wakefulness, often with characteristic sawtooth waves, co-occurring with rapid, conjugate eye movements and profound muscle atonia. Autonomic activity becomes more variable, with irregular breathing and heart rate, and REM episodes lengthen over successive cycles, typically constituting around 20–25% of total sleep time in adults (Lambert and Peter-Derex, 2023; Stephansen et al., 2018).

Table 1. Sleep EEG patterns.

Sleep Stage	EEG Characteristics	Dominant Frequency Bands
Wakefulness	Low-amplitude, mixed-frequency activity with alpha rhythm (8–13 Hz) during relaxed wake	Alpha, Beta
N1	Transition from alpha to low-voltage mixed-frequency theta (4–7 Hz)	Theta
N2	Presence of sleep spindles (11–16 Hz) and K-complexes	Sigma, Theta
N3	High-amplitude slow waves (>75 μV, 0.5–2 Hz)	Delta
REM	Low-amplitude, mixed-frequency EEG similar to wake; sawtooth waves may appear	Theta, Beta

3.3. Sleep Microstructures: Spindles, K-Complexes, and CAP

While the previous section described the overall organization of sleep stages across the night, this section focuses on transient microstructural events occurring within these stages. Beyond macro-stage scoring, sleep contains characteristic microstructures, short-lived, stereotyped EEG patterns, that offer additional insight into neural processing, sleep stability, and pathology. Among these, sleep spindles, K-complexes, micro-arousals, and the cyclic alternating pattern (CAP) are particularly well studied and increasingly targeted by automated detection methods (Ghermezian et al., 2023).

3.3.1. Sleep Spindles

Sleep spindles are transient bursts of 11–16 Hz (sigma-band) oscillations, lasting roughly 0.5–2 seconds, and are generated through thalamocortical interactions during N2 sleep. These oscillations are widely regarded as a neural substrate for memory consolidation, synaptic plasticity, and cortical network reorganization. Quantitative features of spindles (density, duration, amplitude/frequency, and topographic distribution) are used not only in basic research but also in automated sleep-staging algorithms and clinical studies of sleep quality and cognitive function (Bernhard et al., 2022; Deibel et al., 2020).

3.3.2. K-Complexes

K-complexes are high-amplitude, biphasic waveforms consisting of a sharp negative deflection followed by a positive component, typically occurring in N2 and often preceding or co-occurring with spindles. They can occur spontaneously or be evoked by sensory stimuli, and they are regarded as a defining graphoelement of stage N2 (Deibel et al., 2020).

Physiologically, K-complexes are thought to reflect brief cortical down-states that transiently suppress widespread neural activity, thereby protecting sleep in the face of environmental perturbations while still allowing selective information processing. Their morphology, amplitude, timing, and distribution are routinely assessed in clinical sleep staging and in research on sleep stability, sensory processing during sleep, and memory consolidation (Dumitrescu et al., 2021; Leach et al., 2024). While spindles and K-complexes are discrete events, sleep stability can also be assessed at a broader temporal scale through patterns of periodic EEG fluctuation.

3.3.3. Micro-Arousals and Cyclic Alternating Pattern (CAP)

Micro-arousals are brief intrusions of wake-like EEG activity during NREM sleep, usually accompanied by changes in autonomic and muscle activity and often triggered by respiratory events, movements, or external stimuli. Although individually short, frequent micro-arousals fragment sleep, reduce its restorative quality, and are a hallmark of many sleep disorders (Zitting et al., 2023).

CAP describes a periodic NREM microstructure characterized by sequences of EEG activation (phase A) followed by intervals of relative quiescence (phase B), typically lasting 20–40 seconds per cycle. CAP rate and phase-specific metrics provide a sensitive index of sleep instability, with elevated CAP observed in conditions such as obstructive sleep apnea, periodic limb movement disorder, epilepsy, and psychophysiological insomnia (Mendonça et al., 2023).

By integrating macro-architecture (stages and cycles) with these microstructural elements, PSG and EEG analyses yield a richer, multi-scale characterization of sleep physiology and its alterations in health and disease (Ghermezian et al., 2023). In addition to these central nervous system markers, sleep stages are also accompanied by systematic changes in autonomic and respiratory function, which form the basis of several wearable sensing approaches.

Beyond these neurophysiological features, accurate characterization of sleep requires dedicated measurement modalities.

3.4. Sleep Measurement and Monitoring Modalities

3.4.1. Polysomnography (PSG)

PSG is the gold-standard technique for assessing sleep architecture and diagnosing sleep disorders, combining multiple synchronized physiological signals. Standard scoring rules from the AASM specify how these signals are used to classify each 30-second epoch as wake, N1, N2, N3, or REM sleep (Ellis et al., 2021; Imtiaz, 2021; Qian et al., 2021; Satapathy et al., 2024; Yubo et al., 2022).

A typical adult PSG montage includes multiple EEG derivations (e.g., F3-M2, F4-M1, C3-M2, C4-M1, O1-M2, O2-M1), bilateral EOG channels to capture horizontal and vertical eye movements, and submental EMG electrodes to quantify muscle tone. Additional channels usually comprise a single-lead ECG, thoracic and abdominal respiratory effort belts, airflow sensors (nasal pressure and/or thermistor), pulse oximetry for oxygen saturation, and accelerometer or body-position sensors for movement detection (Baumgartner et al., 2021; Lambert and Peter-Derex, 2023; Stephansen et al., 2018) (Yildirim et al., 2019). Although EEG alone provides a powerful basis for sleep staging, combining it with EOG and EMG in a multimodal PSG framework has been shown to significantly improve automated sleep-stage classification, especially when using explainable deep-learning or machine-learning models (Ellis et al., 2021; Satapathy et al., 2024; Yubo et al., 2022).

All PSG channels are sampled at relatively high frequencies—commonly 200–500 Hz for EEG, EOG, EMG, and ECG—and are time-synchronized using dedicated acquisition hardware. This multimodal configuration enables the simultaneous assessment of sleep stages, respiratory events, cardiac dynamics, and motor phenomena, but also increases complexity, cost, and potential discomfort for the sleeper (de Gans et al., 2024; Penzel, 2024).

These practical constraints have motivated the development of simpler recording approaches. Because full-night, in-lab PSG is resource-intensive and may alter natural sleep, alternative sensing modalities have been explored, including single-channel EEG, wrist actigraphy, and photoplethysmography-based wearables. These approaches can approximate sleep staging and enable large-scale or home-based monitoring, albeit typically with reduced granularity compared to full PSG (Fu et al., 2021; Imtiaz, 2021; Melo et al., 2024). The following subsections examine each of the core signal modalities, beginning with EEG, and their respective roles in sleep stage classification.

3.4.2. Core Physiological Signals in Sleep Recording

3.4.2.1. Electroencephalography (EEG)

EEG records scalp voltage fluctuations generated by synchronized postsynaptic activity in cortical pyramidal neurons and remains the primary modality for sleep staging due to its high temporal resolution and sensitivity to oscillatory brain dynamics. Its ability to capture rapid neural fluctuations makes it essential for identifying sleep stages, arousals, and characteristic graphoelements such as sleep spindles and K-complexes (Qian et al., 2021; Yubo et al., 2022).

In clinical sleep analysis, EEG signals are typically segmented into fixed 30-second epochs and classified based on spectral content, waveform morphology, and the presence of specific sleep-related features. Distinct frequency bands are associated with different vigilance states, including alpha activity (8–13 Hz) during relaxed wakefulness, theta activity (4–7 Hz) in N1, delta activity (<4 Hz) dominating N3, and sigma-band oscillations (11–16 Hz) corresponding to sleep spindles (Qian et al., 2021; X. Zhang et al., 2024). These oscillatory signatures form the basis of both manual scoring guidelines and automated sleep staging algorithms.

Although standard polysomnography employs multi-channel EEG montages to capture spatially distributed cortical activity, there is increasing interest in reduced-channel and single-channel EEG configurations for ambulatory and wearable sleep monitoring. These simplified setups can still support reliable sleep staging and microstructure detection when combined with advanced signal processing and machine-learning approaches; however, they may reduce sensitivity to spatially localized abnormalities and subtle neural patterns. This highlights a fundamental trade-off between diagnostic accuracy and practical usability in real-world sleep monitoring systems (Deibel et al., 2020; Lucey et al., 2016; Tapia-Rivas et al., 2024).

3.4.2.2. Electrooculography (EOG) and Electromyography (EMG)

EOG captures the corneo-retinal potential difference that changes with eye movements, providing a sensitive marker of state-dependent oculomotor behavior. In N1 sleep, slow rolling eye movements are typical, whereas REM sleep is distinguished by bursts of rapid, conjugate eye movements that help differentiate it from both wake and NREM when EEG features alone may be ambiguous (Fan et al., 2021; Imtiaz, 2021).

EMG, usually recorded from submental (chin) muscles and sometimes from limbs, measures the electrical activity associated with muscle tone and phasic contractions. EMG shows relatively high, variable activity in wake, intermediate tonic levels with phasic events in NREM, and near-complete atonia in REM sleep, making it crucial for distinguishing REM from other stages and for identifying motor abnormalities (Fan et al., 2021; Imtiaz, 2021; Modarres et al., 2021).

Together with EEG, EOG and EMG form the core PSG triad that underpins manual scoring and automated classification of sleep stages. Moreover, EMG is indispensable for diagnosing conditions such as REM sleep behavior disorder and periodic limb movement disorder, where the expected pattern of muscle atonia or movement suppression is disrupted (Lambert and Peter-Derex, 2023). Beyond these macro-level stage markers, sleep EEG also contains transient microstructural elements that encode additional information about neural processing and sleep stability.

3.5. Autonomic and Respiratory Correlates

3.5.1. Heart Rate, Heart Rate Variability (HRV), and Respiratory Correlates of Sleep Stages

Sleep is accompanied by systematic, stage-dependent modulations in autonomic cardiovascular and respiratory regulation, which provide valuable physiological markers for sleep depth and transitions between stages. During NREM sleep, particularly in deeper stages, heart rate progressively declines under predominant parasympathetic control, reflecting increased cardiovascular stability. In contrast, REM sleep is characterized by pronounced autonomic instability, manifested as elevated HRV and greater fluctuations in cardiac dynamics. Consequently, HRV metrics, especially high-frequency (HF) and low-frequency (LF) components, have been widely adopted as indirect indicators of sleep depth and stage transitions (Kerkering et al., 2022; Yan et al., 2022).

Respiratory activity exhibits parallel stage-dependent changes that closely interact with autonomic cardiovascular regulation. During deep NREM sleep (e.g., N3), breathing patterns are typically slow, regular, and stable, whereas REM sleep is associated with more irregular respiration, including transient pauses and bursts. These variations reflect alterations in central respiratory control and autonomic balance across sleep stages. Accordingly, monitoring respiratory effort using thoracic and abdominal belts remains a core component of polysomnography for characterizing sleep architecture and detecting sleep-disordered breathing (Kazemi et al., 2024; Krauss et al., 2025).

Recent advances in wearable and unobtrusive sensing technologies have enabled continuous, ambulatory monitoring of both cardiovascular and respiratory signals outside the sleep laboratory. In particular, photoplethysmography based devices allow estimation of heart rate and HRV, while newer wearable systems facilitate indirect assessment of respiratory patterns. Accumulating evidence indicates increasing agreement between PPG-derived sleep staging and full polysomnography, supporting the integration of HRV-based autonomic features and respiratory dynamics as a promising, less intrusive approach for long-term, real-world sleep monitoring and arousal detection (Kazemi et al., 2024; Kerkering et al., 2022; Krauss et al., 2025; Yan et al., 2022). Among these wearable-friendly modalities, PPG has received particular attention owing to its low cost and ease of integration into consumer devices.

3.5.2. Photoplethysmogram (PPG)

PPG is a non-invasive optical technique that measures blood volume changes in the skin using infrared light, reflecting cardiovascular and respiratory activity. PPG has been increasingly used for sleep assessment, with automated multi-stage sleep classification showing high accuracy across two-, three-, and four-stage models. Its widespread integration in wearable oximeters facilitates accessible, long-term home sleep monitoring and sleep apnea diagnosis. Derived metrics, such as Pulse Transit Time (PTT) and Pulse Wave Amplitude (PWA), serve as proxies for autonomic activity and arousal during sleep (Liu et al., 2023). Beyond cardiovascular signals, thermoregulatory and electrodermal measures offer additional physiological windows into sleep-stage dynamics.

3.5.3. Skin Temperature and Thermal Biomarkers

Skin temperature, particularly distal skin temperature measured at the wrist or finger, provides a physiologically informative marker of circadian phase and sleep–wake transitions. During sleep onset, peripheral vasodilation produces a characteristic rise in distal skin temperature of approximately 0.3–0.5 °C, reflecting core-to-periphery heat redistribution. Continuous temperature monitoring is now available in several consumer platforms. Recent large-scale studies demonstrate that longitudinal temperature trends capture circadian and hormonal variations relevant to sleep architecture and metabolic health (Adaimi et al., 2025). When integrated with PPG and accelerometry, temperature features improve multimodal sleep–wake classification accuracy, particularly for REM sleep detection.

3.5.4. Electrodermal Activity (EDA)

Electrodermal activity (EDA), reflecting sympathetic cholinergic innervation of eccrine sweat glands, has emerged as a complementary modality for wearable sleep monitoring. Recent studies have demonstrated that EDA activity exhibits clear stage-dependent variations across the sleep cycle, with increased electrodermal responses commonly observed during NREM sleep, particularly slow-wave sleep, whereas REM sleep is generally characterized by reduced sympathetic electrodermal activity (Romine et al., 2019) . In addition, wrist-based EDA recordings have shown strong potential for unobtrusive and long-term sleep assessment, supporting the feasibility of wearable wrist-worn devices for sleep-related autonomic monitoring. Subsequent work confirmed that these stage-dependent EDA patterns are sufficiently robust to support automated sleep-wake classification, with an optimized algorithm achieving 97% sensitivity and 86% overall accuracy (Herlan et al., 2019). More recently, machine learning approaches applied to wrist EDA have demonstrated promising multi-stage sleep classification performance. (Anusha et al., 2022) reported approximately 88% accuracy for autonomic sleep staging using EDA alone, and when skin temperature was incorporated as a complementary thermoregulatory feature, classification accuracy improved to approximately 96%. EDA-derived features have also shown utility beyond sleep staging, including obstructive sleep apnea screening using gradient-boosted decision models (Piccini et al., 2023). These findings establish EDA as a physiologically distinct autonomic marker of sleep depth, whose contribution to multimodal wearable systems is discussed further in Section 3.6.3.

3.6. Movement and Actigraphy

3.6.1. Accelerometer (Actigraphy)

Accelerometers are wearable sensors that detect movement and are widely used for objective sleep monitoring. By measuring periods of activity and inactivity, they can estimate sleep and wake times, providing a practical proxy for sleep–wake patterns. Wrist-worn accelerometers are commonly used to estimate sleep timing, total sleep duration, sleep regularity, and circadian rest–activity rhythms, offering high user acceptability and enabling longitudinal assessment in real-world environments. When processed with validated algorithms, accelerometer data provide reliable estimates of sleep parameters and support large-scale studies of sleep health (Liu and Benjamin-Neelon, 2023; Patterson et al., 2023; Yin et al., 2023).

Despite its utility, actigraphy exhibits inherent limitations. In particular, periods of quiet wakefulness, such as motionless resting in bed, are frequently misclassified as sleep, leading to systematic overestimation of sleep duration and reduced sensitivity to sleep fragmentation, especially in light sleep stages. Consequently, actigraphy demonstrates lower specificity for wake detection and reduced staging precision compared with polysomnography (Li et al., 2025). These constraints on movement-based sensing alone have been a central motivation for integrating accelerometry with physiological signals in modern wearable sleep devices, as described in the following subsections.

3.6.2. Emergence of Wearable and Home-Based Sleep Technologies

Over the past decade, rapid advances in wearable and home-based sensing technologies have substantially transformed sleep monitoring paradigms. Devices such as headbands, wristbands, smartwatches, rings, earbuds, and mattress-embedded sensors enable longitudinal data collection in naturalistic settings. Many of these systems integrate subsets of standard PSG channels, often combining PPG, accelerometry, and limited EEG or ECG leads to infer sleep stages indirectly. More recent platforms, including research-grade wristbands and consumer smart rings, have expanded their sensor arrays to incorporate EDA and skin temperature alongside PPG and accelerometry, broadening the range of physiological information available for sleep assessment.

These technologies increase accessibility for both consumers and researchers, facilitating large-scale and home-based sleep assessment. Nevertheless, validation studies show variable accuracy compared with PSG, particularly for light sleep and REM sleep (Birrer et al., 2024; de Zambotti et al., 2024). The extent to which accuracy improves when multiple sensor modalities are combined, rather than used in isolation, has become a central question in the field, addressed by the growing body of multimodal fusion research.

3.6.3. Multimodal Fusion Approaches

Building on the complementary strengths of individual sensing modalities, multimodal fusion approaches that integrate movement signals with autonomic features (e.g., PPG-derived heart rate and interbeat intervals), and, where feasible, additional cardiorespiratory measures, offer a pragmatic trade-off between user comfort and sleep staging accuracy. Recent studies indicate that PPG + accelerometer–based systems achieve moderate agreement with standard PSG for simplified 2- or 3-stage classification in healthy adults. In this dual-sensor approach, accelerometry improves detection of sleep–wake transitions, while PPG captures autonomic changes related to sleep stages. Such fusion strategies represent a promising avenue for accurate, home-based sleep monitoring and broader applications in population-level sleep research (Liu et al., 2023; Li et al., 2021).

Incrementally expanding the sensor array further improves classification performance. As discussed in Section 3.5.3 and Section 3.5.4, skin temperature and EDA each carry physiologically distinct information about sleep depth, thermoregulatory heat redistribution and sympathetic sudomotor activity, respectively, that is not fully captured by cardiovascular or movement signals. When these modalities are combined with PPG and accelerometry, studies report meaningful gains in staging accuracy: (Anusha et al., 2022) demonstrated that adding skin temperature to EDA-based models increased classification accuracy from approximately 88% to 96%, while broader evaluations of four-sensor configurations (PPG + accelerometry + EDA + skin temperature) suggest that each additional modality contributes non-redundant discriminative information, particularly for REM and slow-wave sleep detection where dual-sensor systems tend to underperform (Liu et al., 2023; Li et al., 2021).

Taken together, these findings underscore a clear trend: as wearable platforms incorporate a wider array of physiological sensors, the gap between home-based and laboratory-grade sleep staging continues to narrow, though full parity with PSG has not yet been achieved for fine-grained multi-stage classification.

Beyond conventional EEG-derived sleep signatures, modern wearable sleep staging systems increasingly integrate multimodal physiological sensing, including cardiovascular, autonomic, respiratory, thermoregulatory, and movement-related signals. Table 2 summarizes the principal wearable-compatible biosignals used in contemporary sleep staging systems.

4. Preprocessing and Digital Filtering

The physiological signals described in Section 3, whether recorded by laboratory PSG or consumer wearables, invariably contain noise and artifacts that must be addressed before any meaningful analysis can be performed. This section outlines the preprocessing, filtering, and feature extraction techniques that form the methodological bridge between raw signal acquisition and automated sleep-stage classification.

4.1. Signal Preprocessing

4.1.1. Artifact Detection and Removal

Signal preprocessing constitutes a foundational step in sleep research, as the validity of downstream analyses, such as automated sleep-stage classification and detection of physiological events, depends critically on signal quality. The primary objective is to obtain clean, temporally synchronized multimodal biosignals (EEG, EOG, EMG, ECG/PPG, and movement) while preserving physiologically meaningful information and inter-channel relationships. To facilitate synchronized epoching and multimodal feature extraction, signals are commonly resampled to a unified sampling frequency (typically 100–256 Hz), a practice that is particularly important in long-duration recordings and wearable-based sleep monitoring (Cox et al., 2024; Jiang et al., 2019). Artifact detection and correction procedures are then applied to mitigate noise introduced by motion, electrode displacement, poor sensor–skin contact, and environmental interference.

Figure 1. Signal processing pipeline for wearable-based sleep staging. Multimodal physiological signals — EEG/Ear-EEG (neural oscillations), PPG/HRV (cardiovascular), and accelerometry, EDA, and skin temperature (motion and physiological) — are passed through a sequential preprocessing pipeline consisting of artifact removal (ICA, ASR, adaptive noise cancellation), digital filtering (bandpass, notch, and wavelet denoising), and segmentation into 30-second epochs per AASM standards. The conditioned signals are then fed into feature extraction and machine learning or deep learning classification models to produce automated sleep stage estimates.

Contemporary preprocessing pipelines increasingly rely on automated and semi-automated quality-control procedures, such as amplitude thresholding, statistical deviation metrics (e.g., variance, kurtosis), and signal quality indices, to enable scalable, reproducible processing of large sleep datasets and reduce dependence on subjective manual inspection (Cox et al., 2024; Jiang et al., 2019). For EEG, blind source separation techniques, most notably Independent Component Analysis (ICA), are widely employed to disentangle neural activity from ocular, muscular, and cardiac contaminants. More recent sleep-oriented methods, including Artifact Subspace Reconstruction (ASR), have demonstrated effectiveness in cleaning whole-night EEG recordings while preserving sleep-relevant neural dynamics and characteristic oscillatory patterns, and are increasingly integrated into open-source preprocessing toolkits (Cox et al., 2024; Somervail et al., 2023). Moving beyond traditional decompositions, the state-of-the-art has rapidly shifted toward deep learning frameworks. Architectures such as multi-stage autoencoders and detection-guided U-Nets are now heavily utilized to perform automated artifact removal, successfully isolating ocular and muscular noise without compromising underlying sleep-related neural signatures (Farabbi et al., 2025; Nyanney et al., 2026).

4.1.2. Segmentation and Epoching

Once artifacts have been mitigated and signals conditioned, continuous physiological recordings are segmented into discrete epochs, conventionally 30 s windows in accordance with AASM scoring guidelines. Each epoch is labeled with a vigilance state (Wake, N1, N2, N3, or REM) to support supervised learning and algorithmic sleep staging. In some analytical pipelines, overlapping windows (e.g., 50% overlap) are employed to improve temporal resolution and boundary sensitivity, whereas shorter epochs (5–10 s) are increasingly adopted in real-time and wearable-based sleep-monitoring applications to enable near-continuous state estimation (A. H. Zhang et al., 2024). The choice of epoch length and overlap strategy directly affects the granularity of subsequent feature extraction, which in turn determines the discriminative power available to classification algorithms.

4.2. Digital Filtering Techniques for Sleep Signal Preprocessing

Frequency-domain filtering represents a cornerstone of sleep-signal preprocessing, aimed at preserving physiologically informative frequency content while attenuating noise and artifacts originating from instrumentation, environmental sources, and subject motion. Careful filter design is essential to minimize phase distortion and amplitude attenuation of sleep-specific oscillatory phenomena, such as slow waves and sleep spindles, which constitute key biomarkers for sleep staging and microstructural analysis (Jiang et al., 2019).

In both PSG-based and wearable sleep studies, filtering strategies are typically applied sequentially, combining linear filters for baseline stabilization with nonlinear or adaptive techniques to accommodate the inherently non-stationary nature of sleep-related biosignals (Chaddad et al., 2023; Faust et al., 2018). Linear filtering remains the most widely adopted approach for isolating frequency bands of interest. Band-pass filters are routinely applied to extract physiologically relevant frequency ranges (e.g., 0.5–35 Hz for EEG/EOG and 10–100 Hz for EMG). Specifically, high-pass filters, often around 0.3 Hz in EEG preprocessing, are applied to reduce baseline drift and slow direct-current offsets arising from electrode polarization, skin potentials, or amplifier instabilities (Chaddad et al., 2023; Phan et al., 2019a). At the opposite end of the spectrum, low-pass filters are commonly used to attenuate high-frequency noise and myogenic contamination; in EEG and EOG preprocessing for sleep staging, cutoff frequencies in the range of 30–45 Hz are frequently selected to balance retention of physiologically relevant neural activity against effective noise suppression (Faust et al., 2018). Additionally, notch filters centered at the local mains frequency (50 or 60 Hz) are routinely applied to remove power-line interference, with narrow-band or adaptive designs preferred to minimize spectral distortion of adjacent frequency bins.

Beyond filter topology, both FIR and IIR filters are widely used in sleep-signal preprocessing. FIR filters are often favored for their linear-phase characteristics, which preserve the temporal morphology of sleep-related waveforms, whereas IIR filters offer greater computational efficiency and steeper roll-off profiles. Transparent reporting of filter specifications, including filter type, order, cutoff frequencies, and the use of zero-phase (forward–backward) filtering, is essential for reproducibility and accurate interpretation of downstream analyses (Karpiel et al., 2021; Kessler et al., 2025). However, given the non-stationary nature of sleep biosignals, linear filtering is increasingly complemented by nonlinear and adaptive techniques to further enhance signal quality (Chaddad et al., 2023; Jain et al., 2024).

Unlike laboratory EEG, which benefits from controlled electrode placement and shielded environments, wearable-derived signals, including PPG and inertial measurements, require additional preprocessing due to heightened susceptibility to motion artifacts and unstable sensor–skin coupling. Therefore, linear filtering is increasingly complemented by adaptive filtering and nonlinear motion-artifact suppression techniques to enhance signal fidelity under real-world recording conditions (de Zambotti et al., 2019; Jain et al., 2024; Seok et al., 2021). Nonlinear methods, such as empirical mode decomposition (EMD), wavelet-based denoising, and adaptive noise cancellation, have demonstrated utility in isolating stage-specific oscillatory activity and suppressing transient artifacts in both EEG and wearable-derived signals. These approaches adapt to time-varying signal characteristics, thereby facilitating improved preservation of physiologically meaningful components (Elshekhidris et al., 2023). Among these, wavelet-based denoising is particularly well suited for transient sleep phenomena, including sleep spindles and K-complexes, as it enables joint time–frequency representations that selectively attenuate noise while preserving brief, stage-defining oscillatory events critical for accurate sleep staging (Abdolahniya et al., 2025; Hu et al., 2022).

4.3. Feature Extraction, Transformation, and Signal Quality Assessment

With clean, filtered, and segmented signals in hand, the next stage of the pipeline transforms raw or conditioned recordings into compact representations suitable for classification. Following filtering and signal segmentation, informative representations are derived to support automated sleep-stage classification. Feature extraction aims to summarize salient temporal, spectral, and nonlinear properties of physiological signals, whereas transformation techniques enhance class separability across vigilance states (Boostani et al., 2017; Phan et al., 2019a).

Commonly used spectral features include power spectral density (PSD), estimated using the Fast Fourier Transform (FFT) or Welch’s method, as well as band-specific power ratios such as delta/theta and sigma/beta, which reflect stage-dependent oscillatory dynamics. Time–frequency features derived using the Short-Time Fourier Transform (STFT) or wavelet transform are widely employed to capture transient sleep phenomena, including sleep spindles and K-complexes. In addition, nonlinear metrics such as sample entropy, permutation entropy, and fractal dimension are used to quantify signal complexity and irregularity across sleep stages (El Hadiri et al., 2024).

4.4. Data Augmentation for Limited Wearable Datasets

A recurring challenge in wearable sleep research is the scarcity of large, expert-annotated datasets, a constraint that is far less acute in laboratory PSG studies. Data augmentation techniques are increasingly employed to address this limited labeled sleep data availability, particularly for wearable studies. Common approaches include overlapping window strategies (50–75% overlap), signal-level augmentations such as random amplitude scaling and time shifting, and advanced techniques including Mixup (interpolation between signal pairs), Random Cutout (zeroing segments), and synthetic time-series pretraining using frequency-domain generation. These strategies improve model robustness and generalization when training data is limited to fewer than 100 recordings, which is common in wearable sleep-staging research.

In parallel with neurophysiological features, autonomic and movement-related indices are increasingly incorporated, particularly in wearable-based sleep monitoring frameworks. HRV metrics (e.g., LF/HF ratio, RMSSD), respiration rate, and motion amplitude provide complementary information regarding sleep–wake dynamics and autonomic regulation, enriching multimodal representations of sleep physiology (de Zambotti et al., 2019; El Hadiri et al., 2024).

Regardless of feature type, however, the reliability of extracted features is only as good as the underlying signal quality, a concern that is amplified in ambulatory and home-based settings. Systematic signal quality assessment is critical to ensure reliable downstream modeling, especially in long-term and home-based recordings. Automated algorithms commonly evaluate metrics such as signal-to-noise ratio (SNR), artifact burden per epoch, channel dropout rate, and physiological plausibility constraints (e.g., heart rate 40–120 bpm), enabling exclusion of corrupted epochs or channels before feature computation (Birrer et al., 2024; Robbins et al., 2024). For wearable PPG signals, signal quality indices (SQIs), such as template-matching correlation, perfusion indices, and signal-to-noise metrics, have been proposed to flag unreliable epochs before feature extraction. Quality-aware adaptive preprocessing pipelines that dynamically adjust filtering based on real-time SQI assessment represent an emerging approach for improving wearable robustness. Real-time or offline feedback supports adaptive preprocessing strategies, including channel reweighting, selective epoch rejection, and dynamic re-recording, improving robustness and classification performance in unconstrained conditions (Birrer et al., 2024; Robbins et al., 2024).

It is worth noting that the feature-engineering paradigm described above is not the only path to classification. In deep-learning pipelines, handcrafted feature engineering is often minimized or omitted. Instead, normalized raw or minimally processed time-series data are provided directly to architectures such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, or Transformer-based models, which learn hierarchical representations directly from the data (Liu et al., 2024; Roy et al., 2019). Even in these end-to-end frameworks, however, signal quality assessment remains essential to prevent model degradation due to noise-dominated inputs (Martins et al., 2025).

4.5. Challenges and Future Directions

Despite substantial methodological progress, several challenges persist in preprocessing pipelines for automated sleep staging. Wearable sensors remain vulnerable to motion artifacts and ambient light interference, while the absence of standardized preprocessing workflows across datasets limits reproducibility and cross-study comparability. Edge-computing constraints necessitate computationally efficient algorithms for real-time sleep analysis on low-power devices. Furthermore, cross-modality harmonization, particularly for integrating heterogeneous signals such as EEG, PPG, and accelerometry, requires robust normalization, synchronization, and domain-adaptation strategies.

Emerging directions include self-supervised preprocessing, neural filtering architectures, and on-device adaptive signal enhancement, collectively offering promising avenues for advancing scalable, robust sleep analytics on the software infrastructure side, the development of integrated open-source preprocessing frameworks, such as YASA, SleepEEGpy, and MNE-Python-based pipelines, represents an important step toward standardization, enabling reproducible and modular preprocessing workflows across research groups (Imtiaz, 2021; Phan and Mikkelsen, 2022; A. H. Zhang et al., 2024).

5. Performance Evaluation and Validation of Wearable Sleep Staging

Algorithmic advances in wearable sleep staging are only as clinically meaningful as the validation frameworks used to assess them. This section examines how wearable sleep-staging systems are evaluated, the performance currently achievable across different sensor modalities, and the sources of variability that complicate interpretation of reported results.

5.1. Performance Metrics and Validation Frameworks

Evaluation of wearable-based sleep-staging algorithms is most commonly conducted via epoch-by-epoch (EBE) comparisons against manually scored PSG, which serves as the clinical reference standard. Widely reported performance metrics include accuracy, sensitivity, specificity, precision, and F1-score for discriminating wake, NREM, and REM sleep. Given the inherent class imbalance in sleep datasets, particularly the predominance of N2 sleep, Cohen’s kappa (κ) is frequently reported to quantify agreement beyond chance. For whole-night sleep architecture indices, including total sleep time, sleep efficiency, and stage-specific durations, Bland–Altman analysis is commonly employed to assess systematic bias and limits of agreement between wearable-derived estimates and PSG references (Lee et al., 2025; Walch et al., 2019). It is worth noting that even expert human scorers achieve only approximately 82–85% inter-rater agreement on PSG, establishing a practical ceiling against which automated systems should be benchmarked rather than expecting perfect concordance.

Beyond the choice of metrics, however, the design of validation studies plays a decisive role in determining the reliability and interpretability of reported performance metrics. The most rigorous validation framework involves simultaneous laboratory-based PSG recordings, enabling precise temporal synchronization and controlled acquisition conditions. Such protocols are frequently applied in clinical cohorts to benchmark algorithmic performance and signal fidelity. Yet laboratory conditions may not reflect typical use. Complementary home-based and real-world validation studies are increasingly conducted to evaluate algorithm robustness under naturalistic conditions, where motion artifacts, environmental noise, and night-to-night variability substantially influence signal quality. Device-to-device comparisons against established consumer trackers are also used to contextualize performance; however, their interpretability is limited by proprietary algorithms and potential deviations from PSG-based references (Chen et al., 2023; Chinoy et al., 2021; de Zambotti et al., 2019; Wulterkens et al., 2021). With these frameworks in mind, the following subsection summarizes the performance levels currently achieved across different sensing approaches.

5.2. Performance Characteristics Across Sensor Modalities

Reported sleep-staging performance varies considerably across sensor modalities, target sleep stages, and study populations. Wrist-worn devices integrating accelerometry and PPG typically achieve overall staging accuracies in the range of approximately 69–77%, with Cohen’s κ values between 0.42 and 0.65 across healthy adult cohorts, depending on staging granularity and algorithmic approach (Beattie et al., 2017; Robbins et al., 2024; Wulterkens et al., 2021). These systems generally demonstrate robust performance for wake and light NREM detection but show reduced discriminative ability for REM sleep and deep NREM sleep, particularly in the absence of direct neurophysiological measurements. This limitation reflects the constrained capacity of peripheral autonomic and movement signals to fully capture cortical state transitions underlying sleep architecture.

Recent multi-device validation studies offer more granular, device-specific performance data. In a head-to-head comparison of three consumer devices against PSG in 35 healthy adults, (Robbins et al., 2024) reported four-stage Cohen’s κ values of 0.65 for the Oura Ring Gen 3 (with sleep-stage sensitivity ranging from 76.0–79.5%), 0.60 for the Apple Watch Series 8, and 0.55 for the Fitbit Sense 2. In a larger six-device study involving 62 adults at a tertiary sleep center, (Schyvens et al., 2025) found lower agreement overall: Apple Watch Series 8 (κ = 0.53), Fitbit Sense (κ = 0.42), Fitbit Charge 5 (κ = 0.41), WHOOP 4.0 (κ = 0.37), Withings Scanwatch (κ = 0.22), and Garmin Vivosmart 4 (κ = 0.21). The discrepancy between studies likely reflects differences in cohort composition, firmware versions, and scoring protocols, underscoring that no single kappa value can be taken as definitive for a given device. Importantly, these device-level values should be interpreted against the practical ceiling of human expert scoring: a meta-analysis of 11 studies found that inter-rater κ for manual PSG sleep staging is 0.76 (95% CI: 0.71–0.81), with stage-specific agreement as low as κ = 0.24 for N1 (Lee et al., 2022). By this benchmark, the best-performing consumer wearables now approach, though do not yet match, expert-level agreement for overall staging, while substantial room for improvement remains in differentiating individual NREM substages.

In contrast, wearable platforms incorporating EEG or ear-EEG channels exhibit higher concordance with gold-standard PSG, particularly for wake and deep sleep classification, owing to their direct measurement of cortical dynamics. These modalities consistently outperform purely peripheral-signal-based devices in clinical validation studies and demonstrate improved generalizability in populations with altered sleep physiology, such as individuals with insomnia, sleep-disordered breathing, or neurodegenerative conditions (Chen et al., 2023; de Gans et al., 2024; Lee et al., 2023).

Figure 2. Cohen’s kappa (κ) agreement for sleep-stage classification across consumer wearable devices compared to the human inter-rater benchmark. Cohen’s kappa values are shown for six consumer wearable devices — Oura Ring Gen 3 (κ = 0.65), Apple Watch Series 8 (κ = 0.60), Fitbit Sense (κ = 0.42), WHOOP 4.0 (κ = 0.37), Withings ScanWatch (κ = 0.22), and Garmin Vivosmart (κ = 0.21) — benchmarked against the human inter-rater ceiling of κ = 0.76. Data are derived from multi-device validation studies against polysomnography in adult cohorts. The substantial spread in kappa values across devices underscores the heterogeneity in sleep-staging accuracy among commercially available wearables, with the best-performing devices approaching but not yet matching expert human agreement.

5.3. Sources of Variability and Clinical Interpretability

Substantial inter-study heterogeneity remains a defining characteristic of the wearable sleep-staging literature. Substantial inter-study heterogeneity remains a defining characteristic of the wearable sleep-staging literature. This variability arises from both biological and methodological sources, each of which can substantially influence reported performance.

On the biological side, variability in participant characteristics, including age, cardiometabolic status, and the presence of sleep disorders, significantly modulates physiological signal properties and, consequently, algorithmic performance. Moreover, pronounced night-to-night variability has been reported for key sleep parameters, including sleep onset latency, wake after sleep onset (WASO), and stage-specific durations, indicating that single-night recordings may inadequately represent habitual sleep patterns (Arnal et al., 2020; Chen et al., 2023; Chouraki et al., 2023; Lee et al., 2023). On the methodological side, methodological heterogeneity further contributes to inconsistencies across studies. Differences in epoch length (e.g., 20-, 30-, or 60-s epochs), manual versus automated PSG scoring protocols, and variability in preprocessing pipelines, artifact-rejection strategies, feature extraction procedures, and machine-learning architectures directly influence class boundaries, confusion patterns, and agreement statistics (Arnal et al., 2020; Chen et al., 2023; Imtiaz, 2021; Lee et al., 2023). An additional source of variability is proprietary algorithm and firmware updates. Together, these biological and methodological factors make direct comparison of reported accuracies across studies difficult, underscoring the need for standardized benchmarking protocols and publicly shared evaluation datasets. The broader implications of population bias and generalizability for wearable sleep staging are discussed in Section 8.3.

From a translational and clinical perspective, important distinctions must be drawn between medical-grade and consumer-oriented sleep-monitoring devices. Clinical validation studies typically employ research-grade or medically certified wearables, include participants with diagnosed sleep disorders, report detailed epoch-level performance metrics, and provide transparent descriptions of preprocessing and algorithmic pipelines. In contrast, consumer devices, although increasingly subjected to validation efforts, often report only summary sleep metrics and rarely disclose algorithmic methodologies. Consequently, sleep-staging outputs from consumer devices should be interpreted with caution, particularly in clinical populations in which physiological abnormalities, comorbidities, and signal artifacts may substantially degrade performance (Chen et al., 2023; Chinoy et al., 2021; de Zambotti et al., 2019; Imtiaz, 2021; Lee et al., 2023). This distinction between research-grade transparency and consumer-grade opacity becomes especially relevant in the next section, which examines specific device platforms and their validation evidence in greater detail.

6. Device Types: Consumer Trackers and Clinical Wearables

The performance characteristics reviewed in section 5 are inseparable from the hardware that generates the underlying signals. This section surveys the two broad categories of wearable sleep-monitoring devices,consumer-grade trackers and clinical-grade wearables, and synthesizes the head-to-head evidence comparing them.

6.1. Consumer Wrist-Worn Devices (Fitbit, Apple Watch, Garmin, Whoop, etc.)—Strengths & Limitations

Consumer wrist-worn devices are typically multi-sensor platforms that combine accelerometry and PPG to algorithmically infer sleep–wake states and, in some cases, sleep stages (Haghayegh et al., 2019). Their main strengths include high wearability, extended battery life, relatively low cost, and suitability for long-term, large-scale monitoring in naturalistic, non-laboratory environments (de Zambotti et al., 2019).

Validation studies and systematic reviews indicate that many modern consumer devices reliably detect sleep versus wake and provide reasonably accurate estimates of total sleep time and sleep efficiency. As detailed in Section 5.2, the best-performing consumer devices now achieve four-stage Cohen’s κ values of 0.53–0.65 in healthy adults, though performance varies substantially across brands: the Oura Ring Gen 3 leads at κ = 0.65, while some wrist-worn devices fall as low as κ = 0.21 (Schyvens et al., 2025). However, stage-level classification performance, particularly for REM sleep and deep NREM sleep, remains inconsistent across devices, firmware versions, and study populations (Chinoy et al., 2021). A frequently overlooked source of this inconsistency is the role of firmware and algorithm updates: the same physical device can produce meaningfully different staging results after a software update, yet most published validations report results for a single firmware version without longitudinal follow-up. Additional limitations include the use of proprietary algorithms, limited transparency in signal preprocessing pipelines, and heterogeneity in validation cohorts, all of which constrain clinical interpretability and limit cross-study comparability (Chinoy et al., 2021; de Zambotti et al., 2019). Despite these limitations, consumer devices offer an unmatched advantage in longitudinal data collection, capturing weeks to months of sleep data that would be impractical to obtain with PSG, making them valuable for tracking trends even when individual-night staging accuracy is imperfect.

6.2. Clinical-Grade Wearables and Wearable EEG Systems

Where consumer devices trade precision for convenience, clinical-grade wearables prioritize signal fidelity and diagnostic accuracy. Clinical-grade wearables and wearable EEG systems are designed for high-fidelity sleep monitoring in clinical and research contexts. These devices often include single-channel EEG headbands (such as the Dreem headband and Sleep Profiler), ear-EEG sensors, or regulatory-cleared medical devices that directly measure electrophysiological signals. Direct acquisition of neural activity enables substantially higher accuracy in sleep-stage classification compared with accelerometer- and PPG-based consumer devices (Fonseca et al., 2023).

Unlike most consumer trackers, clinical-grade devices typically provide greater transparency in preprocessing pipelines and algorithmic methodologies and are validated in both healthy individuals and patients with clinically diagnosed sleep disorders (de Gans et al., 2024). These systems can reliably discriminate between wakefulness, N1–N3, and REM sleep, making them suitable for diagnostic assessment and controlled experimental studies. Nevertheless, challenges related to user comfort, long-term adherence, and the reliability of electrode placement remain important considerations for real-world and longitudinal deployment (Ross et al., 2023).

A particularly promising development in this space is in-ear EEG (ear-EEG) technology. Recent studies in older adults have demonstrated that single-channel ear-EEG sleep staging can achieve κ = 0.64 using transfer learning models adapted from scalp-EEG data (Hammour et al., 2024), approaching the lower bound of substantial agreement. These devices leverage ear canal proximity to temporal lobe cortical generators for sleep-relevant EEG acquisition, and early validation of commercial platforms has confirmed strong signal concordance with standard PSG derivations (Palo et al., 2024). Because ear-EEG sensors are less visible and less obtrusive than forehead-mounted headbands, they may improve long-term adherence, addressing one of the key barriers to clinical-grade wearable adoption. Additionally, contactless and under-mattress systems, such as the Withings Sleep Analyzer and Emfit QS, offer an alternative approach by capturing ballistocardiographic and respiratory signals without any body-worn sensor, though their staging granularity is generally limited to two- or three-class classification.

6.3. Head-to-Head Validation Evidence Synthesis

The preceding subsections suggest a fundamental trade-off between wearability and staging precision. Head-to-head comparisons confirm this pattern while revealing important nuances.

Direct comparisons highlight systematic trade-offs between consumer and clinical-grade device categories. Consumer wrist-worn devices and nearables perform well for estimating aggregate sleep duration and long-term sleep–wake patterns, whereas EEG-based wearables provide superior stage-level classification accuracy and greater clinical utility (Wiley et al., 2024). Multicenter validation studies further reveal substantial performance variability across device models and firmware versions, even within the same manufacturer. For instance, the Apple Watch Series 8 achieved κ = 0.60 in one study (Chen et al., 2023; Chinoy et al., 2021; de Zambotti et al., 2019; Imtiaz, 2021; Lee et al., 2023; Schyvens et al., 2025) but only κ = 0.53 in another (Schyvens et al., 2025), a discrepancy that likely reflects differences in cohort composition, concurrent sleep pathology, and scoring protocols rather than device hardware alone.

These findings underscore the need for standardized benchmarking datasets, harmonized evaluation protocols, and transparent reporting practices to enable robust cross-device comparisons (de Zambotti et al., 2019). Ideally, future validation studies would adopt multi-night protocols capturing night-to-night variability, report firmware and algorithm version numbers, include both healthy and clinical cohorts, and make epoch-level data publicly available. For clinical applications such as diagnosis, treatment monitoring, and outcome assessment, devices that incorporate direct electrophysiological measurements or rigorously validated cardiorespiratory sensing algorithms currently offer the highest reliability (Chinoy et al., 2021; Ross et al., 2023). However, for population-level screening, longitudinal trend monitoring, and research contexts where scalability outweighs single-night precision, consumer devices remain an indispensable tool, particularly as their algorithms continue to improve with each firmware generation.

7. Population- and Use-Case–Specific Evidence for Wearable Sleep Staging

The preceding sections established the validation frameworks (Section 5) and device landscape (Section 6) for wearable sleep staging. However, aggregate performance metrics can obscure critical variation across the populations and clinical contexts in which these devices are actually deployed. Validation performance and clinical utility of wearable sleep-staging systems vary substantially across populations, age groups, and clinical contexts. Accordingly, population- and use-case–specific evidence is essential for accurate interpretation of wearable-derived sleep metrics and for determining their suitability for research applications, clinical screening, and longitudinal monitoring. This section reviews the evidence base across eight distinct population groups and use cases, progressing from well-studied cohorts to emerging application domains.

7.1. Healthy Adults and General Population Studies

Most wearable-based sleep-staging validation studies have been conducted in healthy adult cohorts. Recent comparative evaluations of multiple consumer wearables (e.g., ring-, smartwatch-, and watch-style devices) against PSG in healthy adults demonstrate overall moderate-to-good agreement, with high sensitivity (≥95%) for sleep–wake detection but only moderate concordance for stage-level classification across sleep stages (Robbins et al., 2024).

Large-scale validation studies further indicate that consumer devices provide robust estimates of total sleep time and sleep efficiency at the population level, despite substantial variability in stage-specific classification accuracy across device models and firmware versions (Chinoy et al., 2021). Overall, these findings support the use of wearables for large-scale epidemiological investigations and longitudinal monitoring of sleep–wake patterns in the general population, while highlighting their current limitations for fine-grained sleep-stage discrimination and clinical diagnostic applications. Importantly, the healthy adult cohort serves as the reference baseline against which performance in all other populations is compared, and as the following subsections demonstrate, accuracy tends to decline as physiological complexity increases.

7.2. Pediatric and Adolescent Populations

Among the first departures from this adult baseline are developmental populations, where sleep architecture differs fundamentally from that of adults. Wearable sleep tracking in children and adolescents remains comparatively underexplored, and available validation evidence is limited in both scale and methodological consistency. Existing studies suggest that consumer wearables can achieve moderate agreement with PSG in adolescent populations; however, sleep-stage classification accuracy is typically lower than that observed in adults (Lee et al., 2019). Developmental differences in sleep architecture, greater nocturnal motor activity, and age-dependent changes in circadian regulation introduce additional sources of variability that complicate wearable-based sleep staging in pediatric cohorts. Standardized validation frameworks, including epoch-by-epoch comparisons and age-stratified performance reporting, have been proposed to improve methodological consistency and interpretability in this domain (Nguyen et al., 2021).

Further large-scale, age-stratified validation studies are needed before wearable sleep staging can be reliably deployed in pediatric clinical settings or school-based screening programs.At the other end of the age spectrum, older adults present a distinct but equally challenging set of physiological changes.

7.3. Older Adults and Movement-Related Neurological Disorders

In older adults, age-related changes in sleep architecture, such as reduced sleep efficiency, increased WASO, and greater nocturnal sleep fragmentation, systematically degrade the accuracy of wearable-based sleep estimates. Consumer wrist-worn devices tend to overestimate total sleep time and underestimate wakefulness in geriatric cohorts, largely due to altered movement patterns and prolonged periods of quiet wakefulness (Wei and Boger, 2021).

Accuracy declines further in individuals with movement-related sleep disorders and neurodegenerative diseases, including Parkinson’s disease, REM sleep behavior disorder, and restless leg syndrome. In these populations, abnormal nocturnal movements, frequent state transitions, and pronounced sleep fragmentation lead to systematic misclassification of sleep and wake by actigraphy-based systems (de Zambotti et al., 2019; Högl et al., 2018). These findings underscore the need for disorder-aware algorithms, multimodal sensing strategies, and population-specific calibration when deploying wearable sleep staging in aging and neurologically vulnerable populations. Beyond neurological conditions, a broader range of clinical sleep disorders further challenges wearable performance in distinct ways.

7.4. Clinical Populations and Disorder-Specific Performance (Insomnia, OSA, Narcolepsy, Psychiatric Comorbidity)

Wearable sleep-staging performance is further reduced in clinical populations, with distinct disorder-specific limitations. In insomnia, wrist-worn devices frequently overestimate total sleep time and sleep efficiency because periods of quiet wakefulness are misclassified as sleep in the absence of EEG-based arousal detection (de Zambotti et al., 2019). This particular failure mode, confusing motionless wakefulness with sleep, represents a fundamental limitation of accelerometry-based approaches that no amount of algorithmic refinement can fully overcome without additional signal modalities.

In obstructive sleep apnea (OSA), cardiopulmonary signals provide informative physiological features for sleep–wake and coarse sleep-stage classification. Studies leveraging PPG, heart rate variability, and respiratory effort demonstrate moderate discrimination between REM and NREM sleep when validated against PSG, although performance remains inferior to EEG-based systems, particularly for fine-grained stage transitions (Fonseca et al., 2020). Notably, the autonomic disruptions caused by recurrent apneic events can paradoxically provide additional discriminative features for OSA screening, even as they degrade conventional sleep-staging accuracy.

Sleep staging remains particularly challenging in narcolepsy and psychiatric disorders, where markedly fragmented sleep architecture, frequent state transitions, and dysregulated REM sleep complicate both PSG scoring and wearable-based inference. Current evidence supporting accurate and clinically reliable wearable sleep staging in these populations remains limited, highlighting an important gap for future algorithmic development and disorder-specific clinical validation (Bassetti et al., 2019; de Zambotti et al., 2019). Nevertheless, recent machine learning studies have demonstrated that wearable-derived digital biomarkers, including sleep patterns, heart rate, and activity, can predict mood episodes in bipolar disorder with 83% accuracy for depressive symptoms (area under the receiver operating characteristic curve [AUROC] = 0.89) and 91% for manic symptoms, though manic episode prediction showed a low F1-score (0.25) reflecting class imbalance in the evaluation cohort (Wu et al., 2025). This finding suggests that even when fine-grained staging remains challenging, wearable-derived features retain clinical utility for mood episode prediction. More broadly, it illustrates a principle that recurs across the emerging populations discussed below: wearable sleep data may prove most clinically valuable not as a replacement for PSG staging but as a longitudinal biomarker of disease trajectory.

7.5. Pregnancy and Perinatal Populations

Pregnancy represents a physiological context in which laboratory PSG is particularly impractical, making continuous wearable monitoring especially valuable. Wearable sleep monitoring during pregnancy has gained increasing research attention. A large-scale retrospective analysis of 10,318 Oura Ring users revealed that sleep duration peaked at approximately 9 weeks gestation, with an increase of roughly 15 minutes of sleep time relative to prepregnancy, before declining through the second and third trimesters, while time awake during the night progressively increased (Wu et al., 2025). These findings highlight the value of continuous monitoring for characterizing gestational sleep changes at a scale that would be infeasible with traditional laboratory-based assessment. However, dedicated validation of wearable staging accuracy against PSG during pregnancy, when hormonal and respiratory changes may alter signal characteristics, remains limited and warrants further investigation.

7.6. Shift Workers and Circadian Disruption

Shift workers represent a priority population due to chronic circadian misalignment, which disrupts the typical relationship between physiological signals and sleep stages that wearable algorithms are trained to recognize. Affecting an estimated 15–20% of the workforce in industrialized nations (Boivin et al., 2022), shift work disorder is associated with fragmented sleep architecture and attenuated circadian amplitude, both of which degrade the performance of standard sleep-staging algorithms calibrated on conventional sleepers. Recent validation work has demonstrated that wrist-worn actigraphy combined with ambient light photometry can predict dim-light melatonin onset (DLMO) — the gold-standard circadian phase marker — with a Lin’s concordance coefficient of 0.70 in night shift workers (Cheng et al., 2021). This finding suggests that wearable-derived circadian phase estimates could serve as an auxiliary input to adaptive sleep-staging models, enabling algorithms to account for the phase-shifted physiology characteristic of this population. Nevertheless, dedicated validation studies comparing wearable sleep-staging accuracy between day-shift and night-shift conditions remain scarce, representing a critical gap given the well-documented attenuation of sleep spindles and slow-wave activity under circadian misalignment.

7.7. Athletes and Sports Science

Sleep optimization is increasingly recognized as a performance variable in competitive sports, driving demand for field-deployable sleep monitoring in athletic populations. Validation of EEG-based wearables (Somfit) in athletes demonstrated 79% accuracy with optimal signal quality (Roach et al., 2025). Consumer device comparisons in endurance athletes confirm adequate sleep detection but variable stage accuracy, consistent with the general population. A unique challenge in this population is the confounding effect of intense physical activity on autonomic signals: elevated resting heart rate and altered HRV following heavy training sessions may reduce the discriminative value of PPG-based features for sleep-stage classification, though this hypothesis has not yet been rigorously tested.

7.8. Longitudinal Monitoring, Therapy Follow-Up, and Population Surveillance

While Section 7.1, Section 7.2, Section 7.3, Section 7.4, Section 7.5, Section 7.6 and Section 7.7 focused on cross-sectional performance in specific populations, a distinct and arguably more impactful application of wearables lies in their capacity for sustained, longitudinal data collection. A key advantage of wearable technologies lies in their capacity for continuous, long-term sleep monitoring in naturalistic home environments, enabling characterization of night-to-night variability and intra-individual fluctuations beyond what is feasible with single-night PSG (Chinoy et al., 2021).

Large-scale population studies further demonstrate that wearable-derived sleep regularity and variability metrics are independently associated with cardiovascular and metabolic risk, beyond total sleep duration alone, supporting the utility of longitudinal digital sleep biomarkers for population-level health surveillance (Chen et al., 2022). In this context, the absolute epoch-level staging accuracy that dominates single-night validation studies becomes less critical than the device’s ability to reliably track relative changes within individuals over time — a measurement property that has received comparatively little formal evaluation.

Wearable sleep measures are increasingly applied to therapy monitoring and outcome assessment, including Continuous Positive Airway Pressure (CPAP) adherence in sleep apnea, longitudinal cognitive aging trajectories, and the evaluation of behavioral and lifestyle interventions (de Zambotti et al., 2019). Overall, these applications support emerging paradigms of personalized and precision sleep medicine, in which individual baseline sleep profiles and longitudinal trajectories are leveraged to tailor interventions and monitor treatment response over time. The challenges and opportunities identified across these diverse populations and use cases converge on a common set of open problems, ethical, technical, and translational, that are addressed in the following section.

8. Challenges and Limitations of Wearable Sleep Staging

8.1. Signal and Sensor Constraints

A central limitation of current wearable sleep technologies is the absence of direct neurophysiological measurements, particularly EEG, EOG, and EMG, which remain the cornerstone of conventional sleep stage scoring. Most consumer wearable platforms instead infer sleep stages from peripheral biosignals such as accelerometry, heart rate, and photoplethysmography. While these modalities support coarse estimation of sleep–wake patterns and global sleep timing, they provide limited insight into cortical dynamics, microarousals, and REM-specific neurophysiological features. Consequently, wearable-based sleep staging demonstrates reduced fidelity in resolving fine-grained sleep architecture. A multi-device validation study comparing six commercial wrist-worn devices against polysomnography reported epoch-by-epoch sleep stage agreement ranging from 38% to 56%, with the poorest performance observed for REM and light sleep classification (Chinoy et al., 2021). These limitations are largely attributable to the use of indirect physiological signals, such as accelerometry and photoplethysmography, rather than direct neurophysiological recordings. Accordingly, accelerometry-based approaches are generally sufficient for sleep–wake discrimination but remain limited in their ability to reliably differentiate NREM sub-stages and detect REM sleep (Chinoy et al., 2021; Depner et al., 2020); (de Zambotti et al., 2019).

8.2. Real-World Environmental Confounds

In naturalistic home environments, wearable performance is further constrained by motion artifacts, variability in sensor placement, and fluctuations in skin–sensor contact quality, all of which compromise signal fidelity. Additional uncontrolled contextual factors, including bed-sharing, irregular sleep schedules, ambient light and noise, nocturnal device displacement, and even smartphone use in bed, introduce variance that is largely absent in controlled laboratory PSG recordings (Chinoy et al., 2021). These challenges disproportionately affect individuals with fragmented sleep, parasomnias, or elevated nocturnal movement, leading to systematic overestimation of total sleep time and underestimation of wake after sleep onset (WASO). Indeed, (de Zambotti et al., 2024) noted that misclassification of wakefulness during the sleep period and poor tracking outside the main sleep bout remain among the most persistent failure modes of consumer devices. Skin pigmentation, tattoo density, and body composition can also degrade optical sensor performance, raising equity concerns when devices are deployed across diverse populations (Chinoy et al., 2021; de Zambotti et al., 2019).

8.3. Population Bias and Generalizability

Inter-individual and population-specific heterogeneity presents a further obstacle to the generalizability of wearable sleep-staging algorithms. Models trained predominantly on data from healthy young adults in laboratory settings often fail to generalize reliably to children, older adults, and clinical populations with altered sleep architecture and autonomic regulation. Age-related changes in sleep depth, REM density, and movement patterns, together with disorder-specific physiological alterations observed in insomnia, obstructive sleep apnea, and neurodegenerative conditions, introduce systematic biases that degrade classification accuracy and limit clinical interpretability (Chinoy et al., 2021; Depner et al., 2020; de Zambotti et al., 2019). The available landscape of validation studies remains disproportionately composed of young, healthy, predominantly White cohorts under controlled conditions, with minimal representation of racial and ethnic minorities or individuals with chronic illness (de Zambotti et al., 2024). This demographic skew not only limits confidence in cross-population performance claims but also risks augmenting existing healthcare disparities if wearable-derived metrics inform clinical decision-making.

8.4. Algorithmic Opacity and Validation Gaps

A less visible but equally consequential challenge is the proprietary nature of most consumer sleep-staging algorithms. Manufacturers typically treat their classification models as trade secrets, precluding independent audit of training data composition, feature engineering choices, and model architecture. This algorithmic opacity complicates device-to-device comparisons and hinders reproducibility in research settings. (Menghini et al., 2021)proposed a standardized analytical framework for evaluating sleep-tracker performance, yet adoption remains inconsistent across the field. Furthermore, regulatory pathways for consumer sleep trackers remain fragmented: the U.S. Food and Drug Administration (FDA) classifies most devices as general wellness products exempt from premarket review, while simultaneously clearing AI-based sleep staging software (e.g., Beacon Biosignals’ SleepStageML in 2024) under the 510(k) pathway for clinical-grade EEG devices. This regulatory asymmetry means that consumer devices making implicit clinical claims operate under substantially less scrutiny than their clinical counterparts.

8.5. Orthosomnia and Behavioral Iatrogenesis

Finally, the growing ubiquity of consumer sleep trackers has given rise to an emergent clinical concern: orthosomnia, a preoccupation with achieving perfect sleep data that paradoxically disrupts sleep quality. Recent evidence suggests that orthosomnia-related behaviors may occur among a subset of wearable sleep-tracker users, particularly individuals with perfectionistic traits, heightened sleep-related anxiety, or pre-existing psychological vulnerability (Jahrami et al., 2024). Clinicians have reported cases in which patients fixate on nightly sleep scores, interpret normal night-to-night variability as pathological, or resist evidence-based treatments that conflict with tracker-derived metrics. This behavioral iatrogenesis underscores that wearable sleep data, however well-intentioned, require careful contextual interpretation and should complement, not replace, clinical assessment.

9. Future Directions and Clinical Translation

9.1. Multimodal Sensing Architectures

Future advances in wearable sleep staging are expected to converge around multimodal sensing frameworks that fuse peripheral biosignals with emerging wearable neurophysiological recordings. Dry-electrode and ear-EEG systems, such as the IDUN Guardian and the Muse S headband, have demonstrated moderate-to-substantial agreement with PSG for sleep staging while maintaining sufficient comfort for longitudinal home use (Hammour et al., 2024; Mikkelsen et al., 2019, 2017). In parallel, integrated wearable polysomnography prototypes, exemplified by the SOMNIIA sleep mask, which combines dry-electrode EEG, ECG, and PPG in a single form factor, point toward a future in which multi-channel neurophysiological recordings are captured as unobtrusively as a sleep mask (Markov et al., 2025). Combining these neurophysiological channels with contextual signals from accelerometry, skin temperature, and electrodermal activity enables more robust characterization of sleep microstructure and enhances discrimination between REM and NREM sub-stages (Chambon et al., 2018; Radha et al., 2019).

9.2. Adaptive and Generalizable Machine Learning

In parallel with hardware advances, recent developments in machine learning offer promising avenues to address the generalization gap that plagues current wearable algorithms. Transfer learning and domain adaptation techniques enable models pre-trained on large PSG datasets to be fine-tuned with small amounts of device-specific or population-specific data, substantially improving cross-device and cross-demographic performance (Phan et al., 2019a, 2019b). Self-supervised and contrastive learning approaches further reduce reliance on costly expert-annotated labels by extracting robust feature representations from unlabeled wearable data. Looking further ahead, federated learning architectures enable collaborative model training across distributed populations of wearable users without centralizing sensitive physiological data, addressing the twin challenges of data scarcity and privacy regulation (Moon et al., 2023). Edge-deployable architectures, including knowledge-distilled and quantized transformer variants, will be critical to enabling real-time, on-device sleep staging without reliance on cloud connectivity, an important requirement for usability in remote or resource-limited settings.

9.3. Standardization and Transparent Benchmarking

For wearable sleep staging to achieve widespread clinical adoption, the field must coalesce around standardized reporting practices and rigorous validation frameworks. The analytical framework proposed by (Menghini et al., 2021), which decomposes measurement error into systematic bias and limits of agreement, provides a foundation, but broader uptake and extension to multi-class sleep staging metrics (e.g., per-stage F1 scores, Cohen’s κ, and confusion matrices) are needed. The World Sleep Society’s 2025 recommendations represent a landmark step in this direction, defining seven Fundamental Sleep Measures that every consumer device should report in a standardized format while distinguishing these from proprietary exploratory metrics (World Sleep Society Task Force, 2025). Transparent benchmarking across diverse demographic and clinical cohorts, combined with open evaluation protocols and publicly available reference datasets, will be critical to establishing clinical reliability and reproducibility (de Zambotti et al., 2024).

9.4. Digital Sleep Biomarkers and Clinical Integration

As methodological robustness improves, the clinical value proposition of wearable sleep staging extends well beyond single-night classification accuracy. Longitudinal wearable data enable the derivation of digital sleep biomarkers, composite metrics capturing night-to-night variability, sleep regularity, circadian phase stability, and temporal trends in sleep architecture, that are inaccessible through traditional single-night PSG. Large-scale initiatives such as the National Institutes of Health (NIH) All of Us Research Program, which links longitudinal Fitbit data to electronic health records, exemplify the epidemiological potential of wearable-derived sleep phenotyping. In clinical practice, the World Sleep Society Task Force (2025) recommends that clinicians emphasize behavioral trends and multi-day averages rather than nightly readings, integrating wearable data as a complement to, not a replacement for, validated clinical assessments. As these recommendations gain traction and device performance matures, wearable systems are poised to support treatment response monitoring, early detection of sleep disorder exacerbations, and scalable population-level sleep health surveillance, enabling ecologically valid sleep phenotyping beyond the constraints of laboratory-based polysomnography (Babrak et al., 2019; Chinoy et al., 2021; Depner et al., 2020; de Zambotti et al., 2019; Khosla et al., 2018).

9.5. Regulatory Evolution and Equitable Access

The regulatory landscape for wearable sleep technology is evolving rapidly. The FDA’s 2024 clearance of AI-based sleep staging algorithms under predetermined change control plans signals a pathway for continuous algorithm improvement without repeated premarket submissions, potentially accelerating the clinical maturation of wearable sleep platforms. However, equitable access remains a concern: high-performing devices with clinical-grade validation are often priced beyond the reach of underserved populations who stand to benefit most from scalable sleep assessment. Addressing this inequity will require concerted efforts in device cost reduction, inclusive validation study design that reflects the full spectrum of age, ethnicity, body habitus, and sleep pathology, and culturally informed deployment strategies that account for diverse sleep practices and environments.

10. Discussion

Recent advances in wearable sleep staging reflect a clear transition from single-sensor, movement-based estimation toward multimodal physiological integration, combining accelerometry, PPG, HRV, and emerging wearable EEG platforms (Chinoy et al., 2021; de Zambotti et al., 2019; Liang and Chapa-Martell, 2019). This evolution represents a broader effort to reduce the performance gap between ambulatory monitoring technologies and laboratory-based PSG, while preserving ecological validity and scalability for home-based use. The synthesis of evidence reviewed in Section 3 through 9 of this work reveals a field at a pivotal juncture: wearable technologies have matured sufficiently to yield clinically meaningful sleep–wake data, yet fundamental constraints in signal physiology, algorithmic design, and population generalizability continue to limit their diagnostic applicability.

Despite these technological advances, validation studies conducted after 2018 consistently demonstrate that both consumer- and clinical-grade wearable devices achieve acceptable agreement with PSG in estimating total sleep time and sleep–wake classification. However, important limitations persist, particularly in reduced specificity for wake detection and limited accuracy in distinguishing N1 from N2 sleep (Chinoy et al., 2021; de Zambotti et al., 2019; Liang and Chapa-Martell, 2019). A recent scoping review of 62 wearable setups confirmed that devices relying solely on accelerometry remain effective for binary sleep–wake detection but fall short of multi-stage classification, whereas PPG-augmented platforms improve REM and deep sleep discrimination yet still exhibit stage-level accuracy ranging from only 38% to 56% (Birrer et al., 2024; Schyvens et al., 2025). Quiet wakefulness remains systematically misclassified as sleep, largely due to the reliance on motor inactivity and autonomic stability as indirect proxies of cortical state. This limitation is especially pronounced in populations with insomnia or fragmented sleep, where motionless wake periods are common , leading to systematic overestimation of total sleep time and underestimation of wake after sleep onset (Willoughby et al., 2024). Willoughby and colleagues demonstrated that detection accuracy deteriorated significantly on nights with lower sleep efficiency, reinforcing concerns that the very populations most in need of accurate sleep monitoring, those with disturbed or fragmented sleep, are least well served by current wearable algorithms.

These performance limitations can be partly explained by underlying physiological constraints. The integration of autonomic biomarkers, particularly PPG-derived HRV, has improved the classification of REM versus NREM sleep, as REM sleep is characterized by increased autonomic variability and sympathetic modulation (Radha et al., 2019).However, autonomic signals are physiologically downstream from thalamocortical oscillatory activity and therefore lack direct sensitivity to microstructural events such as sleep spindles, K-complexes, and CAP, which require EEG-based measurement (Fiorillo et al., 2019). Consequently, while estimation of sleep macro-architecture has improved, accurate detection of sleep microstructure remains largely beyond the capabilities of non-EEG consumer wearables. This distinction carries clinical significance: sleep spindle density and slow-wave activity are established biomarkers for memory consolidation, neurodegenerative risk, and treatment response in insomnia (Fernandez and Lüthi, 2020; Lucey et al., 2021) , yet these parameters remain invisible to the peripheral biosignal modalities that dominate the current consumer wearable landscape.

To address these limitations, emerging wearable EEG systems, including dry-electrode headbands and ear-EEG platforms, represent a significant step toward physiologically richer ambulatory monitoring (Markov et al., 2024). A meta-analysis of 43 validation studies evaluating wearable EEG devices against PSG reported moderate-to-substantial inter-method agreement, with performance varying across sleep stages but consistently exceeding that of PPG-only devices(Markov et al., 2025). These systems demonstrate improved agreement with PSG compared with purely peripheral devices and show promise in capturing sleep oscillatory signatures outside laboratory settings. Nevertheless, their performance remains constrained by practical challenges, including signal quality variability, electrode displacement, and increased susceptibility to motion and environmental artifacts in unsupervised home environments.

Given the signal-level, population-specific, and algorithmic constraints, the clinical and translational utility of wearable sleep staging is best understood within specific use cases. Current evidence supports their application in longitudinal monitoring, behavioral intervention tracking, circadian rhythm assessment, and large-scale epidemiological studies (de Zambotti et al., 2019; Liang and Chapa-Martell, 2019). The World Sleep Society’s 2025 landmark recommendations formalize this positioning, advocating that clinicians emphasize behavioral trends and multi-day averages rather than nightly readings, and distinguishing seven Fundamental Sleep Measures that consumer devices should report from proprietary exploratory metrics (World Sleep Society Task Force, 2025). However, these systems are not yet suitable replacements for PSG in the diagnostic evaluation of parasomnias, narcolepsy, REM sleep behavior disorder, or detailed respiratory event characterization, where high-resolution neurophysiological data remain essential.

In addition to these technological and physiological constraints, this review is subject to several methodological limitations. The restriction to English-language, peer-reviewed full-text articles may introduce language and publication bias, potentially excluding relevant findings reported in other languages or preliminary conference proceedings. The requirement for PSG-based validation, while ensuring methodological rigor, may bias the analysis toward more established technologies and underrepresent emerging approaches that have not yet undergone gold-standard validation. Furthermore, the focus on studies published between 2018 and 2026 reflects the rapidly evolving nature of the field but may limit the inclusion of earlier foundational work as well as very recent contributions appearing after the search window. Finally, the preference for studies with publicly available datasets or detailed methodological reporting enhances reproducibility but may lead to underrepresentation of proprietary, industry-driven developments, whose algorithms, as discussed above, remain largely opaque to independent scrutiny. Despite these constraints, the breadth of the included literature (spanning consumer wearables, clinical-grade devices, and emerging EEG platforms across diverse populations and use cases) provides a comprehensive foundation for evaluating the current state and trajectory of home-based wearable sleep staging.

11. Conclusion

Wearable sleep staging has evolved rapidly from simple motion-based sleep–wake discrimination to multimodal systems integrating accelerometry, PPG, HRV, and wearable EEG, supported by increasingly sophisticated machine learning and signal processing frameworks. These technologies enable scalable and ecologically valid sleep monitoring in real-world environments and serve as a complementary approach to PSG, the clinical gold standard. By design, this review spans the full translational pipeline — from sleep neurophysiology and signal processing through device validation and population-specific evidence — to serve as an integrated reference for researchers, engineers, and clinicians entering or working across the field.

Current evidence shows that wearable systems can reliably estimate sleep–wake states and capture macro-architectural sleep features, particularly total sleep time, sleep onset latency, and the gross distribution of REM and deep sleep. However, their reliance on peripheral physiological signals limits direct access to neural activity, resulting in persistent challenges including misclassification of quiet wakefulness, reduced accuracy in distinguishing light sleep stages (particularly N1 from N2), and limited sensitivity to sleep microstructure, such as spindles, K-complexes, and cyclic alternating pattern. Although advanced preprocessing and artifact reduction techniques improve signal quality, variability in proprietary algorithmic implementations continues to affect reproducibility and cross-study comparability. Performance is further influenced by population- and context-specific factors, with reduced accuracy observed in older adults, individuals with darker skin pigmentation or tattoos. Given these constraints, wearable sleep technologies are currently best suited for longitudinal monitoring, behavioral tracking, circadian rhythm assessment, and large-scale epidemiological studies rather than detailed clinical diagnostics, a positioning now formalized in the World Sleep Society’s 2025 recommendations.

Future progress will depend on several converging developments: multimodal sensor fusion, integration of wearable EEG, federated and privacy-preserving learning paradigms, and the development of physiologically grounded and externally validated machine learning models capable of generalizing across devices, recording environments, and diverse demographic and clinical populations. Equally important is the adoption of standardized validation frameworks, such as the seven Fundamental Sleep Measures proposed by the World Sleep Society Task Force (2025) and the analytical framework of Menghini et al. (2021), to ensure robustness, transparency, and generalizability across devices and populations. Addressing the current demographic skew in validation studies and the algorithmic opacity of proprietary sleep-staging models will be essential to avoid exacerbating existing healthcare disparities. With continued methodological refinement, wearable sleep staging is positioned to become an increasingly reliable component of precision sleep medicine, enabling continuous and personalized assessment of sleep health in real-world settings. For a complementary, algorithm-focused perspective on the computational approaches underlying wearable sleep staging, we refer readers to our companion review (OstadSharif Memar et al., 2026).

Acknowledgments

JG and OA were supported by the consortium grant Trajectories of Affective Disorders from the German Research Foundation (DFG) SFB/TRR 393 (project grant no 521379614).

Abbreviations

AASM American Academy of Sleep Medicine

AUROC Area Under the Receiver Operating Characteristic Curve

ASR Artifact Subspace Reconstruction

CNNs Convolutional Neural Networks

CPAP Continuous Positive Airway Pressure

CAP Cyclic Alternating Pattern

DLMO Dim-Light Melatonin Onset

DL Deep Learning

ECG Electrocardiography

EEG Electroencephalography

EMG Electromyography

EMO Empirical Mode Decomposition

EOG Electrooculography

EMD Empirical Mode Decomposition

FDA Food and Drug Administration

EBE Epoch-by-Epoch

FFT Fast Fourier Transform

FIR Finite Impulse Response

HF High Frequency

HR Heart Rate

HRV Heart Rate Variability

ICA Independent Component Analysis

IIR Infinite Impulse Response

K Cohen’s kappa

LF Low Frequency

LSTM Long Short-Term Memory

NREM Non-Rapid Eye Movement

OSA Obstructive Sleep Apnea

PSD Power Spectral Density

PPG Photoplethysmography

PSG Polysomnography

PTT Pulse Transit Time

PWA Pulse Wave Amplitude

REM Rapid Eye Movement

SNR Signal-to-Noise Ratio

STFT Short-Time Fourier Transform

NIH National Institutes of Health

ML Machine Learning

SQIs Signal quality indices

WASO Wake After Sleep Onset

References

Abdolahniya, H., Khazaei, A.A., Azarnoosh, M., Razavi, S.E., 2025. Electroencephalogram denoising using discrete wavelet transform and adaptive noise cancellation based on information theory. IJ-AI 14, 769. [CrossRef]
Adaimi, R., Thigpen, N., Clausel, A., Gotlieb, N., Patel, K., de Zambotti, M., 2025. Temporal Trajectories in Sleep, Temperature Trends, Cardiorespiratory, and Activity Metrics Measured via Oura Ring During Pregnancy: Large-Scale Observational Analysis. JMIR Mhealth Uhealth 13, e80213. [CrossRef]
Anusha, A.S., Preejith, S.P., Akl, T.J., Sivaprakasam, M., 2022. Electrodermal activity based autonomic sleep staging using wrist wearable. Biomed. Signal Process. Control 75, 103562. [CrossRef]
Arnal, P.J., Thorey, V., Debellemaniere, E., Ballard, M.E., Bou Hernandez, A., Guillot, A., Jourde, H., Harris, M., Guillard, M., Van Beers, P., Chennaoui, M., Sauvet, F., 2020. The Dreem Headband compared to polysomnography for electroencephalographic signal acquisition and sleep staging. Sleep 43. [CrossRef]
Babrak, L.M., Menetski, J., Rebhan, M., Nisato, G., Zinggeler, M., Brasier, N., Baerenfaller, K., Brenzikofer, T., Baltzer, L., Vogler, C., Gschwind, L., Schneider, C., Streiff, F., Groenen, P.M.A., Miho, E., 2019. Traditional and digital biomarkers: two worlds apart? Digit. Biomark. 3, 92–102. [CrossRef]
Bassetti, C.L.A., Adamantidis, A., Burdakov, D., Han, F., Gay, S., Kallweit, U., Khatami, R., Koning, F., Kornum, B.R., Lammers, G.J., Liblau, R.S., Luppi, P.H., Mayer, G., Pollmächer, T., Sakurai, T., Sallusto, F., Scammell, T.E., Tafti, M., Dauvilliers, Y., 2019. Narcolepsy - clinical spectrum, aetiopathophysiology, diagnosis and treatment. Nat. Rev. Neurol. 15, 519–539. [CrossRef]
Baumgartner, A.J., Kushida, C.A., Summers, M.O., Kern, D.S., Abosch, A., Thompson, J.A., 2021. Basal ganglia local field potentials as a potential biomarker for sleep disturbance in parkinson’s disease. Front. Neurol. 12, 765203. [CrossRef]
Beattie, Z., Oyang, Y., Statan, A., Ghoreyshi, A., Pantelopoulos, A., Russell, A., Heneghan, C., 2017. Estimation of sleep stages in a healthy adult population from optical plethysmography and accelerometer signals. Physiol. Meas. 38, 1968–1979. [CrossRef]
Bernhard, H., Schaper, F.L.W.V.J., Janssen, M.L.F., Gommer, E.D., Jansma, B.M., Van Kranen-Mastenbroek, V., Rouhl, R.P.W., de Weerd, P., Reithler, J., Roberts, M.J., DBS study group, 2022. Spatiotemporal patterns of sleep spindle activity in human anterior thalamus and cortex. Neuroimage 263, 119625. [CrossRef]
Birrer, V., Elgendi, M., Lambercy, O., Menon, C., 2024. Evaluating reliability in wearable devices for sleep staging. npj Digital Med. 7, 74. [CrossRef]
Boivin, D.B., Boudreau, P., Kosmadopoulos, A., 2022. Disturbance of the circadian system in shift work and its health impact. J. Biol. Rhythms 37, 3–28. [CrossRef]
Boostani, R., Karimzadeh, F., Nami, M., 2017. A comparative review on sleep stage classification methods in patients and healthy individuals. Comput. Methods Programs Biomed. 140, 77–91. [CrossRef]
Chaddad, A., Wu, Y., Kateb, R., Bouridane, A., 2023. Electroencephalography signal processing: A comprehensive review and analysis of methods and techniques. Sensors 23. [CrossRef]
Chambon, S., Galtier, M.N., Arnal, P.J., Wainrib, G., Gramfort, A., 2018. A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series. IEEE Trans. Neural Syst. Rehabil. Eng. 26, 758–769. [CrossRef]
Cheng, P., Walch, O., Huang, Y., Mayer, C., Sagong, C., Cuamatzi Castelan, A., Burgess, H.J., Roth, T., Forger, D.B., Drake, C.L., 2021. Predicting circadian misalignment with wearable technology: validation of wrist-worn actigraphy and photometry in night shift workers. Sleep 44. [CrossRef]
Chen, J., Ricardo, A.C., Reid, K.J., Lash, J., Chung, J., Patel, S.R., Daviglus, M.L., Huang, T., Liu, L., Hernandez, R., Li, Q., Redline, S., 2022. Sleep, cardiovascular risk factors, and kidney function: The Multi-Ethnic Study of Atherosclerosis (MESA). Sleep Health 8, 648–653. [CrossRef]
Chen, X., Jin, X., Zhang, J., Ho, K.W., Wei, Y., Cheng, H., 2023. Validation of a wearable forehead sleep recorder against polysomnography in sleep staging and desaturation events in a clinical sample. J. Clin. Sleep Med. 19, 711–718. [CrossRef]
Chinoy, E.D., Cuellar, J.A., Huwa, K.E., Jameson, J.T., Watson, C.H., Bessman, S.C., Hirsch, D.A., Cooper, A.D., Drummond, S.P.A., Markwald, R.R., 2021. Performance of seven consumer sleep-tracking devices compared with polysomnography. Sleep 44. [CrossRef]
Chouraki, A., Tournant, J., Arnal, P., Pépin, J.-L., Bailly, S., 2023. Objective multi-night sleep monitoring at home: variability of sleep parameters between nights and implications for the reliability of sleep assessment in clinical trials. Sleep 46. [CrossRef]
Cox, R., Weber, F.D., Van Someren, E.J.W., 2024. Customizable automated cleaning of multichannel sleep EEG in SleepTrip. Front. Neuroinformatics 18. [CrossRef]
Deibel, S.H., Rota, R., Steenland, H.W., Ali, K., McNaughton, B.L., Tatsuno, M., McDonald, R.J., 2020. Assessment of Sleep, K-Complexes, and Sleep Spindles in a T21 Light-Dark Cycle. Front. Neurosci. 14, 551843. [CrossRef]
Depner, C.M., Cheng, P.C., Devine, J.K., Khosla, S., de Zambotti, M., Robillard, R., Vakulin, A., Drummond, S.P.A., 2020. Wearable technologies for developing sleep and circadian biomarkers: a summary of workshop discussions. Sleep 43. [CrossRef]
de Gans, C.J., Burger, P., van den Ende, E.S., Hermanides, J., Nanayakkara, P.W.B., Gemke, R.J.B.J., Rutters, F., Stenvers, D.J., 2024. Sleep assessment using EEG-based wearables - A systematic review. Sleep Med. Rev. 76, 101951. [CrossRef]
de Zambotti, M., Cellini, N., Goldstone, A., Colrain, I.M., Baker, F.C., 2019. Wearable sleep technology in clinical and research settings. Med. Sci. Sports Exerc. 51, 1538–1557. [CrossRef]
de Zambotti, M., Goldstein, C., Cook, J., Menghini, L., Altini, M., Cheng, P., Robillard, R., 2024. State of the science and recommendations for using wearable technology in sleep and circadian research. Sleep 47. [CrossRef]
Dhingra, L.S., Aminorroaya, A., Oikonomou, E.K., Nargesi, A.A., Wilson, F.P., Krumholz, H.M., Khera, R., 2023. Use of wearable devices in individuals with or at risk for cardiovascular disease in the US, 2019 to 2020. JAMA Netw. Open 6, e2316634. [CrossRef]
Dumitrescu, C., Costea, I.-M., Cormos, A.-C., Semenescu, A., 2021. Automatic Detection of K-Complexes Using the Cohen Class Recursiveness and Reallocation Method and Deep Neural Networks with EEG Signals. Sensors 21. [CrossRef]
Ellis, C.A., Zhang, R., Carbajal, D.A., Miller, R.L., Calhoun, V.D., Wang, M.D., 2021. Explainable Sleep Stage Classification with Multimodal Electrophysiology Time-series. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2021, 2363–2366. [CrossRef]
Elshekhidris, I.H., MohamedAmien, M.B., Fragoon, A., 2023. Wavelet transforms for eeg signal denoising and decomposition. IJASIS 9, 11–28. [CrossRef]
El Hadiri, A., Bahatti, L., El Magri, A., Lajouad, R., 2024. Sleep stages detection based on analysis and optimisation of non-linear brain signal parameters. Results in Engineering 23, 102664. [CrossRef]
Fan, J., Sun, C., Long, M., Chen, C., Chen, W., 2021. EOGNET: A Novel Deep Learning Model for Sleep Stage Classification Based on Single-Channel EOG Signal. Front. Neurosci. 15, 573194. [CrossRef]
Faust, O., Hagiwara, Y., Hong, T.J., Lih, O.S., Acharya, U.R., 2018. Deep learning for healthcare applications based on physiological signals: A review. Comput. Methods Programs Biomed. 161, 1–13. [CrossRef]
Fernandez, L.M.J., Lüthi, A., 2020. Sleep spindles: mechanisms and functions. Physiol. Rev. 100, 805–868. [CrossRef]
Fiorillo, L., Puiatti, A., Papandrea, M., Ratti, P.-L., Favaro, P., Roth, C., Bargiotas, P., Bassetti, C.L., Faraci, F.D., 2019. Automated sleep scoring: A review of the latest approaches. Sleep Med. Rev. 48, 101204. [CrossRef]
Fonseca, P., Ross, M., Cerny, A., Anderer, P., van Meulen, F., Janssen, H., Pijpers, A., Dujardin, S., van Hirtum, P., van Gilst, M., Overeem, S., 2023. A computationally efficient algorithm for wearable sleep staging in clinical populations. Sci. Rep. 13, 9182. [CrossRef]
Fonseca, P., van Gilst, M.M., Radha, M., Ross, M., Moreau, A., Cerny, A., Anderer, P., Long, X., van Dijk, J.P., Overeem, S., 2020. Automatic sleep staging using heart rate variability, body movements, and recurrent neural networks in a sleep disordered population. Sleep 43. [CrossRef]
Fu, M., Wang, Y., Chen, Z., Li, J., Xu, F., Liu, X., Hou, F., 2021. Deep learning in automatic sleep staging with a single channel electroencephalography. Front. Physiol. 12, 628502. [CrossRef]
Ghermezian, A., Nami, M., Shalbaf, R., Khosrowabadi, R., Nasehi, M., Kamali, A.-M., 2023. Sleep Micro-Macro-structures in Psychophysiological Insomnia. PSG Study. Sleep Vigilance 1–9. [CrossRef]
Haghayegh, S., Khoshnevis, S., Smolensky, M.H., Diller, K.R., Castriotta, R.J., 2019. Accuracy of Wristband Fitbit Models in Assessing Sleep: Systematic Review and Meta-Analysis. J. Med. Internet Res. 21, e16273. [CrossRef]
Hammour, G., Davies, H., Atzori, G., Della Monica, C., Ravindran, K.K.G., Revell, V., Dijk, D.-J., Mandic, D.P., 2024. From Scalp to Ear-EEG: A Generalizable Transfer Learning Model for Automatic Sleep Scoring in Older People. IEEE J. Transl. Eng. Health Med. 12, 448–456. [CrossRef]
Herlan, A., Ottenbacher, J., Schneider, J., Riemann, D., Feige, B., 2019. Electrodermal activity patterns in sleep stages and their utility for sleep versus wake classification. J. Sleep Res. 28, e12694. [CrossRef]
Högl, B., Stefani, A., Videnovic, A., 2018. Idiopathic REM sleep behaviour disorder and neurodegeneration - an update. Nat. Rev. Neurol. 14, 40–55. [CrossRef]
Hu, Q., Li, M., Li, Y., 2022. Single-channel EEG signal extraction based on DWT, CEEMDAN, and ICA method. Front. Hum. Neurosci. 16, 1010760. [CrossRef]
Imtiaz, S.A., 2021. A systematic review of sensing technologies for wearable sleep staging. Sensors 21. [CrossRef]
Jahrami, H., Trabelsi, K., Husain, W., Ammar, A., BaHammam, A.S., Pandi-Perumal, S.R., Saif, Z., Vitiello, M.V., 2024. Prevalence of Orthosomnia in a General Population Sample: A Cross-Sectional Study. Brain Sci. 14. [CrossRef]
Jain, A., Raja, R., Srivastava, S., Sharma, P.C., Gangrade, J., R, M., 2024. Analysis of EEG signals and data acquisition methods: a review. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 12. [CrossRef]
Jiang, X., Bian, G.-B., Tian, Z., 2019. Removal of Artifacts from EEG Signals: A Review. Sensors 19. [CrossRef]
Karpiel, I., Kurasz, Z., Kurasz, R., Duch, K., 2021. The Influence of Filters on EEG-ERP Testing: Analysis of Motor Cortex in Healthy Subjects. Sensors 21. [CrossRef]
Kazemi, K., Abiri, A., Zhou, Y., Rahmani, A., Khayat, R.N., Liljeberg, P., Khine, M., 2024. Improved sleep stage predictions by deep learning of photoplethysmogram and respiration patterns. Comput. Biol. Med. 179, 108679. [CrossRef]
Kerkering, E.M., Greenlund, I.M., Bigalke, J.A., Migliaccio, G.C.L., Smoot, C.A., Carter, J.R., 2022. Reliability of heart rate variability during stable and disrupted polysomnographic sleep. Am. J. Physiol. Heart Circ. Physiol. 323, H16–H23. [CrossRef]
Kessler, R., Enge, A., Skeide, M.A., 2025. How EEG preprocessing shapes decoding performance. Commun. Biol. 8, 1039. [CrossRef]
Khosla, S., Deak, M.C., Gault, D., Goldstein, C.A., Hwang, D., Kwon, Y., O’Hearn, D., Schutte-Rodin, S., Yurcheshen, M., Rosen, I.M., Kirsch, D.B., Chervin, R.D., Carden, K.A., Ramar, K., Aurora, R.N., Kristo, D.A., Malhotra, R.K., Martin, J.L., Olson, E.J., Rosen, C.L., American Academy of Sleep Medicine Board of Directors, 2018. Consumer sleep technology: an american academy of sleep medicine position statement. J. Clin. Sleep Med. 14, 877–880. [CrossRef]
Krauss, D., Richer, R., Küderle, A., Jukic, J., German, A., Leutheuser, H., Regensburger, M., Winkler, J., Eskofier, B.M., 2025. Incorporating respiratory signals for machine learning-based multimodal sleep stage classification: a large-scale benchmark study with actigraphy and heart rate variability. Sleep 48. [CrossRef]
Lambert, I., Peter-Derex, L., 2023. Spotlight on sleep stage classification based on EEG. Nat. Sci. Sleep 15, 479–490. [CrossRef]
Leach, S., Krugliakova, E., Sousouri, G., Snipes, S., Skorucak, J., Schühle, S., Müller, M., Ferster, M.L., Da Poian, G., Karlen, W., Huber, R., 2024. Acoustically evoked K-complexes together with sleep spindles boost verbal declarative memory consolidation in healthy adults. Sci. Rep. 14, 19184. [CrossRef]
Lee, T., Cho, Y., Cha, K.S., Jung, J., Cho, J., Kim, H., Kim, D., Hong, J., Lee, D., Keum, M., Kushida, C.A., Yoon, I.-Y., Kim, J.-W., 2023. Accuracy of 11 wearable, nearable, and airable consumer sleep trackers: prospective multicenter validation study. JMIR Mhealth Uhealth 11, e50983. [CrossRef]
Lee, X.K., Chee, N.I.Y.N., Ong, J.L., Teo, T.B., van Rijn, E., Lo, J.C., Chee, M.W.L., 2019. Validation of a consumer sleep wearable device with actigraphy and polysomnography in adolescents across sleep opportunity manipulations. J. Clin. Sleep Med.
Lee, Y.J., Lee, J.Y., Cho, J.H., Choi, J.H., 2022. Interrater reliability of sleep stage scoring: a meta-analysis. J. Clin. Sleep Med. 18, 193–202. [CrossRef]
Lee, Y.J., Lee, J.Y., Cho, J.H., Kang, Y.J., Choi, J.H., 2025. Performance of consumer wrist-worn sleep tracking devices compared to polysomnography: a meta-analysis. J. Clin. Sleep Med. 21, 573–582. [CrossRef]
Liang, Z., Chapa-Martell, M.A., 2019. Accuracy of Fitbit Wristbands in Measuring Sleep Stage Transitions and the Effect of User-Specific Factors. JMIR Mhealth Uhealth 7, e13384. [CrossRef]
Liu, P., Qian, W., Zhang, H., Zhu, Y., Hong, Q., Li, Q., Yao, Y., 2024. Automatic sleep stage classification using deep learning: signals, data representation, and neural networks. Artif. Intell. Rev. 57, 301. [CrossRef]
Liu, P.-K., Ting, N., Chiu, H.-C., Lin, Y.-C., Liu, Y.-T., Ku, B.-W., Lee, P.-L., 2023. Validation of photoplethysmography- and acceleration-based sleep staging in a community sample: comparison with polysomnography and Actiwatch. J. Clin. Sleep Med. 19, 1797–1810. [CrossRef]
Liu, T., Benjamin-Neelon, S.E., 2023. A longitudinal study of infant 24-hour sleep: comparisons of sleep diary and accelerometer with different algorithms. Sleep 46. [CrossRef]
Li, Qiao, Li, Qichen, Cakmak, A.S., Da Poian, G., Bliwise, D.L., Vaccarino, V., Shah, A.J., Clifford, G.D., 2021. Transfer learning from ECG to PPG for improved sleep staging from wrist-worn wearables. Physiol. Meas. 42. [CrossRef]
Li, X., Halaki, M., Chow, C.M., 2025. Validation of motionwatch8 actigraphy against polysomnography in menopausal women under warm conditions. Sensors 25. [CrossRef]
Lucey, B.P., Mcleland, J.S., Toedebusch, C.D., Boyd, J., Morris, J.C., Landsness, E.C., Yamada, K., Holtzman, D.M., 2016. Comparison of a single-channel EEG sleep study to polysomnography. J. Sleep Res. 25, 625–635. [CrossRef]
Lucey, B.P., Wisch, J., Boerwinkle, A.H., Landsness, E.C., Toedebusch, C.D., McLeland, J.S., Butt, O.H., Hassenstab, J., Morris, J.C., Ances, B.M., Holtzman, D.M., 2021. Sleep and longitudinal cognitive performance in preclinical and early symptomatic Alzheimer’s disease. Brain 144, 2852–2862. [CrossRef]
Markov, K., Elgendi, M., Menon, C., 2025. Evaluating the performance of wearable EEG sleep monitoring devices: a meta-analysis approach. npj Biomed. Innov. 2, 33. [CrossRef]
Markov, K., Elgendi, M., Menon, C., 2024. EEG-based headset sleep wearable devices. npj Biosensing 1, 12. [CrossRef]
Martins, N.R.A., Bauer, F., Baty, F., Boesch, M., Brutsche, M.H., Rossi, R.M., Annaheim, S., 2025. Introduction to electrocardiogram signal quality assessment and estimated accuracy for textile electrodes. Sci. Rep. 15, 41365. [CrossRef]
Masad, I.S., Alqudah, A., Qazan, S., 2024. Automatic classification of sleep stages using EEG signals and convolutional neural networks. PLoS ONE 19, e0297582. [CrossRef]
Melo, M.C., da Silva Vallim, J.R., Garbuio, S., Soster, L.A., Sousa, K.M.M., Bonaldi, R.R., Pires, G.N., 2024. Validation of a sleep staging classification model for healthy adults based on two combinations of a single-channel EEG headband and wrist actigraphy. J. Clin. Sleep Med. 20, 983–990. [CrossRef]
Mendonça, F., Mostafa, S.S., Morgado-Dias, F., Ravelo-García, A.G., Rosenzweig, I., 2023. Towards automatic EEG cyclic alternating pattern analysis: a systematic review. Biomed. Eng. Lett. 13, 273–291. [CrossRef]
Menghini, L., Cellini, N., Goldstone, A., Baker, F.C., de Zambotti, M., 2021. A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. Sleep 44. [CrossRef]
Mikkelsen, K.B., Tabar, Y.R., Kappel, S.L., Christensen, C.B., Toft, H.O., Hemmsen, M.C., Rank, M.L., Otto, M., Kidmose, P., 2019. Accurate whole-night sleep monitoring with dry-contact ear-EEG. Sci. Rep. 9, 16824. [CrossRef]
Mikkelsen, K.B., Villadsen, D.B., Otto, M., Kidmose, P., 2017. Automatic sleep staging using ear-EEG. Biomed. Eng. Online 16, 111. [CrossRef]
Modarres, M.H., Elliott, J.E., Weymann, K.B., Pleshakov, D., Bliwise, D.L., Lim, M.M., 2021. Validation of Visually Identified Muscle Potentials during Human Sleep Using High Frequency/Low Frequency Spectral Power Ratios. Sensors 22. [CrossRef]
Moon, S., Kim, T.S., Ryu, J., Lee, W.H., 2023. Federated Learning for Sleep Stage Classification on Edge Devices via a Model-Agnostic Meta-Learning-Based Pre-Trained Model., in: 2023 IEEE 13th International Conference on Consumer Electronics - Berlin (ICCE-Berlin). Presented at the 2023 IEEE 13th International Conference on Consumer Electronics - Berlin (ICCE-Berlin), IEEE, pp. 188–192. [CrossRef]
Nguyen, Q.N.T., Le, T., Huynh, Q.B.T., Setty, A., Vo, T.V., Le, T.Q., 2021. Validation Framework for Sleep Stage Scoring in Wearable Sleep Trackers and Monitors with Polysomnography Ground Truth. Clocks & Sleep 3, 274–288. [CrossRef]
Palo, G., Fiorillo, L., Monachino, G., Bechny, M., Wälti, M., Meier, E., Pentimalli Biscaretti di Ruffia, F., Melnykowycz, M., Tzovara, A., Agostini, V., Faraci, F.D., 2024. Comparison analysis between standard polysomnographic data and in-ear-electroencephalography signals: a preliminary study. SLEEP Advances 5. [CrossRef]
Patel, A.K., Reddy, V., Shumway, K.R., Araujo, J.F., 2026. Physiology, Sleep Stages, in: StatPearls. StatPearls Publishing, Treasure Island (FL).
Patterson, M.R., Nunes, A.A.S., Gerstel, D., Pilkar, R., Guthrie, T., Neishabouri, A., Guo, C.C., 2023. 40 years of actigraphy in sleep medicine and current state of the art algorithms. npj Digital Med. 6, 51. [CrossRef]
Penzel, T., 2024. Using the gold mine of sleep data recorded to increase our understanding of sleep. Sleep 47. [CrossRef]
Phan, H., Andreotti, F., Cooray, N., Chen, O.Y., De Vos, M., 2019a. SeqSleepNet: End-to-End Hierarchical Recurrent Neural Network for Sequence-to-Sequence Automatic Sleep Staging. IEEE Trans. Neural Syst. Rehabil. Eng. 27, 400–410. [CrossRef]
Phan, H., Chen, O.Y., Koch, P., Mertins, A., Vos, M.D., 2019b. Deep Transfer Learning for Single-Channel Automatic Sleep Staging with Channel Mismatch., in: 2019 27th European Signal Processing Conference (EUSIPCO). Presented at the 2019 27th European Signal Processing Conference (EUSIPCO), IEEE, pp. 1–5. [CrossRef]
Phan, H., Mikkelsen, K., 2022. Automatic sleep staging of EEG signals: recent development, challenges, and future directions. Physiol. Meas. 43. [CrossRef]
Piccini, J., August, E., Óskarsdóttir, M., Arnardóttir, E.S., 2023. Using the electrodermal activity signal and machine learning for diagnosing sleep. Front. Sleep 2. [CrossRef]
Qian, X., Qiu, Y., He, Q., Lu, Y., Lin, H., Xu, F., Zhu, F., Liu, Z., Li, X., Cao, Y., Shuai, J., 2021. A review of methods for sleep arousal detection using polysomnographic signals. Brain Sci. 11. [CrossRef]
Radha, M., Fonseca, P., Moreau, A., Ross, M., Cerny, A., Anderer, P., Long, X., Aarts, R.M., 2019. Sleep stage classification from heart-rate variability using long short-term memory neural networks. Sci. Rep. 9, 14149. [CrossRef]
Roach, G.D., Miller, D.J., Shell, S.J., Miles, K.H., Sargent, C., 2025. Validation of a Neurophysiological-Based Wearable Device (Somfit) for the Assessment of Sleep in Athletes. Sensors 25. [CrossRef]
Robbins, R., Weaver, M.D., Sullivan, J.P., Quan, S.F., Gilmore, K., Shaw, S., Benz, A., Qadri, S., Barger, L.K., Czeisler, C.A., Duffy, J.F., 2024. Accuracy of three commercial wearable devices for sleep tracking in healthy adults. Sensors 24. [CrossRef]
Romine, W., Banerjee, T., Goodman, G., 2019. Toward Sensor-Based Sleep Monitoring with Electrodermal Activity Measures. Sensors 19. [CrossRef]
Ross, M., Fonseca, P., Overeem, S., Vasko, R., Cerny, A., Shaw, E., Anderer, P., 2023. Autonomic arousal detection and cardio-respiratory sleep staging improve the accuracy of home sleep apnea tests. Front. Physiol. 14. [CrossRef]
Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T.H., Faubert, J., 2019. Deep learning-based electroencephalography analysis: a systematic review. J. Neural Eng. 16, 051001. [CrossRef]
Satapathy, S.K., Brahma, B., Panda, B., Barsocchi, P., Bhoi, A.K., 2024. Machine learning-empowered sleep staging classification using multi-modality signals. BMC Med. Inform. Decis. Mak. 24, 119. [CrossRef]
Schyvens, A.-M., Peters, B., Van Oost, N.C., Aerts, J.-M., Masci, F., Neven, A., Dirix, H., Wets, G., Ross, V., Verbraecken, J., 2025. A performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnography. Sleep Adv. 6, zpaf021. [CrossRef]
Shi, Y., Lou, H., Wang, H., Zhou, Y., Wang, L., Li, Y., Han, D., 2023. Analysis of nasal resistance regulation mechanism during postural changes in patients with obstructive sleep apnea by measuring heart rate variability. Journal of Clinical Sleep Medicine 19, 643–650. [CrossRef]
Somervail, R., Cataldi, J., Stephan, A.M., Siclari, F., Iannetti, G.D., 2023. Dusk2Dawn: an EEGLAB plugin for automatic cleaning of whole-night sleep electroencephalogram using Artifact Subspace Reconstruction. Sleep 46. [CrossRef]
Stephansen, J.B., Olesen, A.N., Olsen, M., Ambati, A., Leary, E.B., Moore, H.E., Carrillo, O., Lin, L., Han, F., Yan, H., Sun, Y.L., Dauvilliers, Y., Scholz, S., Barateau, L., Hogl, B., Stefani, A., Hong, S.C., Kim, T.W., Pizza, F., Plazzi, G., Mignot, E., 2018. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nat. Commun. 9, 5229. [CrossRef]
Sun, H., Ye, E., Paixao, L., Ganglberger, W., Chu, C.J., Zhang, C., Rosand, J., Mignot, E., Cash, S.S., Gozal, D., Thomas, R.J., Westover, M.B., 2023. The sleep and wake electroencephalogram over the lifespan. Neurobiol. Aging 124, 60–70. [CrossRef]
Tapia-Rivas, N.I., Estévez, P.A., Cortes-Briones, J.A., 2024. A robust deep learning detector for sleep spindles and K-complexes: towards population norms. Sci. Rep. 14, 263. [CrossRef]
Walch, O., Huang, Y., Forger, D., Goldstein, C., 2019. Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device. Sleep 42. [CrossRef]
Wang, T., Li, W., Deng, J., Zhang, Q., Liu, Y., Zheng, H., 2024. The impact of the physical activity intervention on sleep in children and adolescents with neurodevelopmental disorders: a systematic review and meta-analysis. Front. Neurol. 15, 1438786. [CrossRef]
Wei, J., Boger, J., 2021. Sleep detection for younger adults, healthy older adults, and older adults living with dementia using wrist temperature and actigraphy: prototype testing and case study analysis. JMIR Mhealth Uhealth 9, e26462. [CrossRef]
Wiley, K., Berger, P., Friehs, M.A., Mandryk, R.L., 2024. Measuring the reliability of a gamified stroop task: quantitative experiment. JMIR Serious Games 12, e50315. [CrossRef]
Willoughby, A.R., Golkashani, H.A., Ghorbani, S., Wong, K.F., Chee, N.I.Y.N., Ong, J.L., Chee, M.W.L., 2024. Performance of wearable sleep trackers during nocturnal sleep and periods of simulated real-world smartphone use. Sleep Health 10, 356–368. [CrossRef]
Wulterkens, B.M., Fonseca, P., Hermans, L.W.A., Ross, M., Cerny, A., Anderer, P., Long, X., van Dijk, J.P., Vandenbussche, N., Pillen, S., van Gilst, M.M., Overeem, S., 2021. It is All in the Wrist: Wearable Sleep Staging in a Clinical Population versus Reference Polysomnography. Nat. Sci. Sleep 13, 885–897. [CrossRef]
Wu, Chia-Tung, Hsieh, M.H., Chen, I.-M., Jhao, L.-Y., Liu, D.-S., Wang, S.-M., Wu, Chia-Ting, Chien, Y.-L., 2025. Using wearable device and machine learning to predict mood symptoms in bipolar disorder: development and usability study. JMIR Med. Inform. 13, e66277. [CrossRef]
Xiao, J., Ming, Y., Li, L., Huang, X., Zhou, Y., Ou, J., Kou, J., Feng, R., Ma, R., Zheng, Q., Shan, X., Meng, Y., Liao, W., Zhang, Y., Wang, T., Kuang, Y., Cao, J., Li, S., Lai, H., Chen, J., Duan, X., 2025. Personalized Theta Burst Stimulation Enhances Social Skills in Young Minimally Verbal Children With Autism: A Double-Blind Randomized Controlled Trial. Biol. Psychiatry 97, 1139–1149. [CrossRef]
Xu, Z., Zhu, Y., Zhao, H., Guo, F., Wang, H., Zheng, M., 2022. Sleep Stage Classification Based on Multi-Centers: Comparison Between Different Ages, Mental Health Conditions and Acquisition Devices. Nat. Sci. Sleep 14, 995–1007. [CrossRef]
Yan, C., Li, P., Yang, M., Li, Y., Li, J., Zhang, H., Liu, C., 2022. Entropy analysis of heart rate variability in different sleep stages. Entropy (Basel) 24. [CrossRef]
Yildirim, O., Baloglu, U.B., Acharya, U.R., 2019. A deep learning model for automated sleep stages classification using PSG signals. Int. J. Environ. Res. Public Health 16. [CrossRef]
Yin, J., Xu, J., Ren, T.-L., 2023. Recent Progress in Long-Term Sleep Monitoring Technology. Biosensors (Basel) 13. [CrossRef]
Yubo, Z., Yingying, L., Bing, Z., Lin, Z., Lei, L., 2022. MMASleepNet: A multimodal attention network based on electrophysiological signals for automatic sleep staging. Front. Neurosci. 16, 973761. [CrossRef]
Zhang, A.H., He-Mo, A., Yin, R.F., Li, C., Tang, Y., 2024. Mamba-based Deep Learning Approaches for Sleep Staging on a Wireless Multimodal Wearable System without Electroencephalography. arXiv preprint arXiv ….
Zhang, R., Dong, X., Zhang, L., Lin, X., Wang, X., Xu, Y., Wu, C., Jiang, F., Wang, J., 2024. Quantitative electroencephalography in term neonates during the early postnatal period across various sleep states. Nat. Sci. Sleep 16, 1011–1025. [CrossRef]
Zhang, X., Zhang, Xizhen, Huang, Q., Lv, Y., Chen, F., 2024. A review of automated sleep stage based on EEG signals. Biocybernetics and Biomedical Engineering 44, 651–673. [CrossRef]
Zitting, K.-M., Lockyer, B.J., Azarbarzin, A., Sands, S.A., Wang, W., Wellman, A., Quan, S.F., 2023. Association of cortical arousals with sleep-disordered breathing events. J. Clin. Sleep Med. 19, 899–912. [CrossRef]

Table 2. Physiological Signals Used in Wearable Sleep Staging and Their Clinical Relevance.

Signal	Measurement Principle	Physiological Information	Relevance to Sleep Stages	Strengths	Limitations
HR (Heart Rate)	Derived from ECG or PPG	Cardiac activity	Decreases in NREM (especially N3); more variable in REM	Easy to measure; widely available	Limited specificity for stage classification
HRV (Heart Rate Variability)	Variation in beat-to-beat intervals	Autonomic nervous system balance	Higher parasympathetic activity in NREM; fluctuates in REM	Useful for distinguishing sleep depth	Sensitive to noise and artifacts
PPG (Photoplethysmography)	Optical measurement of blood volume changes	Blood flow, HR, HRV	Indirectly reflects sleep stages via cardiovascular dynamics	Low-cost; common in wearables	Motion artifacts; indirect measure
Accelerometer	Measures body movement	Physical activity and motion	Differentiates sleep vs. wake; limited stage resolution	Robust; low power consumption	Cannot distinguish NREM stages accurately
Skin Temperature	Peripheral temperature sensing	Thermoregulation	Gradual increase during sleep; varies across stages	Useful for circadian rhythm analysis	Low temporal resolution for staging
EEG (Single-channel)	Electrical brain activity	Neural oscillations (delta, theta, alpha)	Directly distinguishes N1, N2, N3, REM	High physiological relevance	Limited channels reduce accuracy vs PSG
EOG (in some headbands)	Eye movement detection	Ocular activity	Critical for REM detection	Improves REM classification	Less common in consumer devices
EMG (limited wearable use)	Muscle activity	Muscle tone	Reduced in REM; moderate in NREM	Helps identify REM atonia	Rare in wearable systems

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.