1. Introduction
As sleep is an important factor related to health, disease, overall quality of life and peak performance [
2], more and more people monitor it using wearable devices. The pursuit of good sleep can sometimes go too far, creating an obsession with monitoring and quantifying sleep that can result in increased stress and discomfort [
3], a condition termed orthosomnia (i.e., from the Greek “ortho” meaning straight or upright and the Latin “somnia”). Many users of wearables have been erroneously led to believe, however, that commercial wearables can accurately and reliably measure sleep, even though almost all of them lack scientific sound and/or independent validation studies [
4] against the gold standard, Polysomnography (PSG), a combination of electroencephalography (EEG), electrooculography (ECG) and electromyography (EMG).
Erroneous feedback about one’s sleep (e.g., substantial underestimation of deep sleep, or wrong wake classification at sleep onset or during the night) can yet have serious adverse effects enhancing people’s misperception and even having negative daytime consequences [
5]. Such wrong feedback on wearables may also lead to inappropriate suggestions for adjusting sleep habits and work against the aim of promoting better sleep health [
6]. This is particularly worrisome for people with sleep problems and preoccupations who are especially sensitive to feedback on their sleep [
7,
8,
9]. The potential adverse effects of inaccurate feedback in combination with the limited rigour of many validation studies against the PSG gold standard certainly justify the scepticism around the broad use of wearables in the clinical field [
10]. However, the potential benefits of using accurate wearable devices that capture daily sleep changes in natural environments and outside of the laboratory, combined with low costs, are undeniable. Especially in light of recent studies showing that implementing such technologies together with sleep intervention protocols can have a beneficial effect on therapy outcomes [
11,
12,
13,
14]. It is our opinion that soon such technologies if optimized and carefully validated, will play a central role in research and clinical practice as they allow continuous sleep measurements (and feedback) in ecologically valid home environments and at affordable cost.
However, only a few of the wearable technologies that provide multiclass epoch-by-epoch sleep classification have been transparent considering their sensing technology and algorithms used, and are indeed validated against PSG in suitable large (and heterogenous) samples, using appropriate performance metrics (e.g., Cohes’s
, sensitivity and specificity, F1) rather than solely relying on mean values per nights or overall epoch-by-epoch accuracies across sleep stages. Among the few, Chinoy et al. [
15] has recently compared the performance of seven consumer sleep-tracking devices to PSG and reported that the reviewed devices displayed poor detection of sleep stages compared to PSG, with consistent under- or overestimation of the amount of REM or deep sleep. Chee et al. [
16] validated two widely used sensors, the Oura ring (v. 1.36.3) and Actiwatch (AW2 v. 6.0.9), against PSG in a sample of 53 participants with multiple recordings each. Compared to PSG, the Oura ring displayed an average underestimation of about 40 minutes for TST, 16 minutes for REM sleep, and 66 minutes for light sleep, while it overestimated Wake After Sleep Onset (WASO) by about 38 minutes and deep (N3) sleep by 39 minutes. Another study Altini and Kinnunen [
17] examined 440 nights from 106 individuals wearing the Oura ring (v. Gen2M) and found a good 4-class overall accuracy of 79% when including various autonomic nervous system signals, such as heart-beat variability, temperature, acceleration and circadian features.
We have recently developed a 4-class sleep stage classifications model (wake, light, deep, REM) that reaches up to 81% accuracy and a
of 0.69 [
1], when using only data from low-cost Heart Rate Variability (HRV) sensors. Although such classification accuracies are approaching expert inter-rater reliability [
1], there are edge cases where classification is particularly difficult (e.g, longer periods of rest and absent movement while reading as compared to falling asleep) and prone to errors and may result in erroneous feedback to the users. We suggest that to deal with these cases it is first important to implement advanced signal quality controls, that detect artefacts related to sensor detachment, or excessive movements during sleep that lead to insufficient signal quality (i.e. inter-beat interval estimations) for meaningful classification. In fact, some of the previously suggested models developed for one-channel ECG recordings (e.g., Mathunjwa et al. [
18]; Habib et al. [
19]; Sridhar et al. [
20]) have already addressed such issues by implementing individual normalisation and simple outlier corrections (e.g. linearly interpolating too short/long inter-beat intervals). In
Figure 1 we illustrate an example from our data and show how bad IBI signal quality can lead to erroneous sleep stage classification, and suggest incorporating advanced machine learning approaches for bad signal detection and better classification accuracy.
Additionally, to optimize sleep classification of edge cases we need to take into consideration the skewness in the distribution of sleep phases across the night (i.e., a disproportionate amount of sleep stages across the night with light sleep dominating). This becomes evident when one acknowledges that algorithms can reach epoch-by-epoch classification accuracies of up to 65-70% when an algorithm simply classifies “light sleep” (N1 and N2 [
21]) throughout the night. A model that is trained for optimal overall classification accuracy, like the one suggested in [
1], will display a bias towards the majority class (light sleep), resulting in poorer performance on less populated classes, such as wake and deep sleep. It is however crucial for the user that specific periods of waking and deep sleep are not misclassified as this substantially decreases the user’s trust in the sleep analysis and consequently any potential sleep intervention. We suggest a model that opts for the minimum weighted cross-entropy loss function value, that encapsulates the skewness of the sleep stage distribution, thus resulting in unbiased classification.
Figure 2 illustrates the bias of the accuracy model towards the majority class and how opting for a loss function model resolves this issue.
We here explore the benefits of applying an IBI-correction procedure, as well as opting for the loss function in classifying sleep. We tested our updated model on a new sample of 136 subjects, whose sleep was recorded using ambulatory PSG, in their homes, for one or more nights. As previously we are using an i) ECG gold-standard (providing IBIs), a ii) breast-worn belt measuring IBIs (Polar® H10) as well as and iii) a pulse-to-pulse interval device worn on the upper arm (Polar® Verity Sense -VS) as input signals. We put particular emphasis on the upper arm VS as it is a more comfortable sensor to sleep with than the H10 breast belt, without any notable compromise on sleep classification. Finally, we assess whether wearables such as the ones tested here are even accurate in classifying sleep in users with health issues and who, as a consequence, are taking central nervous system medication (psychoactive substances) and/or medication with expected effects on the heart (beta blockers, etc).
2. Methods
2.1. Participant
We recorded ambulatory PSG from 136 participants (female = 40; Mean Age = 45.29 SD = 16.23, range = 20 - 76) who slept in their homes, for one or more nights. In total (265) nights of recordings included the gold standard PSG and ECG. From these participants, 112 wore a Polar® H10 chest sensor (see materials) and 99 participants wore the Polar® Verity Sense, with 178 and 135 Nights respectively. This sample was part of an ongoing study, investigating the effects of online sleep training programs on participants with self-reported sleep complaints, which included participants with no self-reported acute mental or neurological disorders, capable of using smartphones. While we did not set strict exclusion criteria our sample naturally comes mainly with people having sleep difficulties. Specifically, 84.8% of our sample had a Pittsburgh Sleep Quality Index (PSQI) score above 5, with an average score of 9.36 (SD=3.21). For a subset of participants, we took a medical history and grouped them with those who were on psychoactive and/or heart medication (with medication, N= 17, Nnights= 40), and those with no medication (without medication, N= 39, Nnights= 87). The study was conducted according to the ethical guidelines of the Declaration of Helsinki.
2.2. Materials
We recorded PSG using the ambulatory varioport 16-channel EEG system (Becker Meditec®) with gold cup electrodes (Grass Technologies, Astro – Med GmbH®), according to the international 10-20 system, at frontal (F3, Fz, F4), central (C3, Cz, C4), parietal ( P3, P4) and occipital (O1, O2) derivations. We recorded EOG using two electrodes (placed 1cm below the left outer canthus and 1cm above the right outer canthus, respectively), and the chin muscle activity from two EMG electrodes. Importantly, we recorded the ECG signal with two electrodes that we placed below the right clavicle and on the left side below the pectoral muscle at the lower edge of the left rib cage. Before actual sleep staging, we used the BrainVision Analyzer 2.2 (Brain Products GmbH, Gilching, Germany) to re-reference the signal to the average mastoid signal, filter according to the American Academy of Sleep Medicine (AASM) for routine PSG recordings (AASM Standards and Guidelines Committee, 2021) and downsampled to 128 Hz (original sampling rate was at 512 Hz). Sleep was then automatically scored in 30-second epochs using standard AASM scoring criteria, as implemented in the Sleepware G3 software (Sleepware G3, Koniklijke Philips N.V. Eindhoven, The Netherlands). The G3 software is considered to be non-inferior to manual human staging and can be readily used without the need for manual adjustment [
22]. All PSG recordings in the current analysis have been carefully manually inspected for severe non-physiological artefacts in the EEG, EMG as well and EOG as such effects would render our PSG-based classification (serving as the gold standard) less reliable.
Having conducted extensive research on multiple sensors, we decided on two Polar® (Polar® Electro Oy, Kempele, Finland) sensors as they came with the most accurate signal as compared to gold-standard ECG [
23,
24,
25]: the H10 chest strap (Model: 1W) and the Verity Sense (VS, Model: 4J). Next to good signal quality, both sensors have good battery life (about 400 hours for H10, 30 hours for VS Sense), and low overall weight and volume, which makes them comfortable to sleep with.
2.3. Data synchronization and missing data
handling
For each recording, sleep staging was computed using PSG and the G3 software (see materials) that served as the gold standard. As the beginning and end of PSG recordings and wearable recordings were manually set and therefore do not perfectly overlap in time, inter-beat interval time series provided by the wearable sensors were time synchronized to the inter-beat interval estimations from the ECG channels of the PSG recording using a windowed cross-correlation approach before sleep classification. We excluded recordings with the sensors for the following reasons: i) where signal quality was poor (25% criterion), ii) in cases the sensors failed to load the data, and iii) recordings where the synchronization between sensor and PSG was not possible.
2.4. Model optimization
Model (accuracy model) training has been described in Topalidis et al. [
1]. We have further optimized this model by accounting for model biases towards the majority class (light sleep). We aim to oppose such biases even more by selecting the final model based on the loss function value which already incorporates the skewed class distribution (loss function model). In addition, we have implemented an IBI quality control procedure: we trained a random forest model on a subset of IBI windows which were manually labelled in terms of good and bad IBI segments and calculated a set of IBI features on a fixed time window of 10 minutes that were used as input. We started with the feature set used by Radha et al. [
26] but reduced it to a set of 7 features based on permutation feature importance values. Furthermore, we adjusted the threshold of the output value to account for the model’s ability to deal with minor levels of noise and distortion. The IBI quality control was applied after the actual sleep staging, and sleep stage labels of 30-second segments including bad IBIs were then replaced with the surrounding scorings of clean segments (i.e., segments without bad IBIs). In case more than 25% of all 30-second segments of a single night include bad IBIs, then the whole night was characterised as unstageable.
2.5. Sleep Parameters
We chose a few key sleep parameters extracted from PSG and the wearable sensors to explore their relationship and the agreement. We focused primarily on the following objective sleep variables: Sleep Onset Latency (SOL), defined as the time from lights out until the start of the first epoch of any stage of sleep (an epoch of N1, N2, N3, or R), Sleep Efficiency (SE: total sleep time/ time in bed*100), Wake After Sleep Onset (WASO), Total Sleep Time (TST), as well as Deep and REM sleep measured in minutes.
2.6. Model performance & Statistical
Analysis
We used standard measures for estimating model performance, such as overall classification accuracy, Cohen’s
[
27], as well per-class recall and precision values, that are displayed in the confusion matrix (see
Figure 3). Note that, the performance metrics summarized in
Figure 3, are computed by averaging all aggregated epochs. We further explored the performance of each sensor using a Wilcoxon signed-rank test where data for both the gold standard and each sensor existed. In addition, we examined the performance of the two models for each sleep stage separately by computing the F1 score and comparing them with a one-tailed non-parametric Wilcoxon signed-rank test expecting higher F1 scores for the loss function model. A Wilcoxon signed-rank test was also used to compare the classification accuracies between the recordings of participants on psychoactive or heart medication and those with no medication, as well as to see how these two groups differ in age and PSQI scores.
To explore the relationship between the PSG and the two sensors on sleep parameters, we conducted Spearman’s Rank-Order correlation (denoted as
) and visualized the linear trends using scatterplots. We determined the agreement between the PSG and wearables sleep parameters using Bland-Altman plots, reporting thus biases, Standard Measurement Error (SME), Limits of Agreement (LOA), Minimal Detectable Change [
28] (MDC:
, as well as absolute interclass correlation (ICC: two-way model with agreement type [
29]). The MDC, or else smallest detectable change, refers to the smallest change detected by a method that exceeds measurement error [
28]. The intraclass correlation of reliability encapsulates both the degree of correlation and agreement between measurements [
29]. According to Koo and Li [
29], values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.90 are indicative of poor, moderate, good, and excellent reliability, respectively. All data and statistical analysis was performed in R (R Core Team [
30]; version 4).
4. Discussion
In the current study, we validated our previous findings in a new sample and showed that the model suggested in Topalidis et al. [
1] can be optimized further by including advanced IBI signal control and opting for the loss function for choosing the final model. Results reveal statistically higher classification performance for the loss function model, compared to choosing the model with the highest overall accuracy during validation (accuracy model), although numerically the benefit is small. We discuss why even a small but reliable increase is relevant here while highlighting that we optimize a model that already displays very high classification performance. We further show that psychoactive and/or heart-affecting medication does not have a strong impact on sleep stage classification accuracy. Lastly, we evaluated our new optimized model for measuring sleep-related parameters and found that our model shows substantial agreement and reliability with PSG-extracted sleep parameters.
In the following section, we discuss the results, acknowledge the limitations of the current study and provide an outlook on sleep classification using wearables in sleep research and clinical practice for the near future.
When looking at the confusion matrices (in
Figure 3) we observed a small increase in the overall accuracy for the loss function model, although the accuracy model uses the overall accuracy for choosing the best model in the training phase. Descriptively, there is an increase in wake and deep sleep recall in the loss function model (in the magnitude of 1% to 5 %), suggesting that it is indeed beneficial to opt for the loss function model for the model training. This becomes also apparent when comparing the
values from the individual nights
Figure 4, without epoch aggregation. In addition, we observed that the optimized model performs far above a model that would only predict the majority class (light sleep) for the entire recording. When comparing the two models statistically per class using F1 scores extracted per recording, we observed a benefit for the loss function model in all sleep stages when looking at ECG data. In H10 recordings we observed benefits for wake, REM, and deep sleep, while for the VS only classifying wake with the optimised model showed an improvement.
One could discuss whether these small accuracy increases are meaningful but one needs to acknowledge that the new model displays accuracy performance approaching human inter-rater agreement and critically improves in wake classification which is crucial as causing most irritation in the end-user if experienced as wrong. As it has been pointed out in [
1] it is estimated that at four classes experts display an agreement of around 88% as compared to the 84% (
≈ .75 ) on average in our study across the wearable devices which equals to 95.45% agreement of our algorithm with PSG-based human scorings. In the presented classification approach we implement the recommendation from [
31], who systematically reviewed automatic sleep staging using cardiorespiratory signals, and suggests taking into account signal delays and implementing sequence learning models that are capable of incorporating temporal relations.
In the current study, we also explored the effects of psychoactive and heart medication on sleep stage classification performance. We reasoned that such medication could have a direct effect on PSG and ECG [
32,
33] and thereby affect relevant features of the signal, resulting in decreased classification performance. We observed a small but statistically significant decrease in
values in people with and without medications for the ECG data, but no such drops in classification accuracy for the H10 or VS sensor. It is important to note that for the ECG data also age and PSQI differences were observed for the medication vs. non-medication group (on medication being older and having worse sleep quality) and thus the drop in classification performance could be driven by these differences. However, also here the median
for all classified recordings from participants on medication groups was above .07 for all devices, suggesting a substantial classification agreement.
Furthermore, we evaluated the agreement and reliability of our model against PSG on primary sleep parameters. Particular emphasis was put on the VS sensor, as it is the more comfortable sensor to sleep with, while at the same time providing a nice indirect measure of heart-rate variability using the pulse-to-pulse intervals of the PPG. Surprisingly, only for a few sleep variables, the ECG-based H10 was found to be better in accuracy. For all stages but deep sleep (0.5), the key sleep parameters showed high correlations (0.7 to 0.94) between PSG and VS. The highest correlation was found for total sleep time and showed almost perfect agreement (r = .94). These correlational analyses were nicely complemented by intra-class correlations, which likewise indicated moderate (for deep sleep) to excellent reliability. Systematic bias of the VS against and the PSG gold standard was visualised using Bland-Altman plots and was found to be minimal. Keeping in mind that if the end-user has a sleep issue, “deep sleep” seems to be a crucial measure, as it has been related to physical and mental recovery, immune system strengthening or the disposal of brain waste products via the glymphatic system (e.g., Reddy and van der Werf [
34]). Future emphasis shall thus be put on further increasing the classification accuracy for the “deep sleep” class. We also provided the minimal detectable change (MDC) metric, which indicates the smallest change detected by a method that exceeds the measurement error [
28], and found that with both the H10 and VS there is a very good resolution that is accurate up to ±5 minutes across sleep parameters.
In summary, a trend in society and Western health systems goes towards an increasing adoption of digital health interventions including sleep (see Arroyo and Zawadzki [
35] for a systematic review on mHealth interventions in sleep). Objective and reliable tracking of the effects of such interventions thus also becomes more and more relevant and allows ecologically valid and continuous measurements [
10] in natural home settings. Recently, for example, Spina et al. [
12] used sensor-based sleep feedback in a sample of 103 participants suffering from sleep disturbances and found that such sensor-based sleep feedback can already reduce some of the insomnia symptoms. Interestingly, such feedback alone was however not enough to induce changes in sleep–wake misperception which may need additional interventions (see Hinterberger et al. [
36]). Given that people with sleep problems and preoccupations about their sleep are especially sensitive to such feedback there is a high ethical necessity to only provide certain and accurate feedback to patients to prevent negative side effects [
5].
Figure 1.
Bad IBI signal, if not detected, can lead to erroneous sleep stage classification. There are epochs in (a) the raw IBI signal that can be identified using (b) an advanced IBI signal quality control procedure based on a trained random forest model. Due to bad signal quality, sleep staging these epochs results in erroneous classification (c), which misrepresents the ground truth (d).
Figure 1.
Bad IBI signal, if not detected, can lead to erroneous sleep stage classification. There are epochs in (a) the raw IBI signal that can be identified using (b) an advanced IBI signal quality control procedure based on a trained random forest model. Due to bad signal quality, sleep staging these epochs results in erroneous classification (c), which misrepresents the ground truth (d).
Figure 2.
Example of a night where the accuracy model overestimates of majority class (light sleep). Panel (a) displays the actual PSG-based hypnogram as sleep staged automatically using the G3 sleepware gold-standard. Panel (b) displays sleep staging using the “accuracy model”, while panel (c) displays the “loss function model”. Note that the accuracy model displays a bias towards the majority class (e.g., epochs marked in red shading), as it strives to maximize overall classification accuracy, especially in cases where the model is unsure. In contrast, the loss function incorporates the skewed class distribution using categorical cross entropy weighted by class, correcting thus for a bias towards the majority class. In this example, both predicted hypnograms (a;b) use signals derived from the H10 sensor.
Figure 2.
Example of a night where the accuracy model overestimates of majority class (light sleep). Panel (a) displays the actual PSG-based hypnogram as sleep staged automatically using the G3 sleepware gold-standard. Panel (b) displays sleep staging using the “accuracy model”, while panel (c) displays the “loss function model”. Note that the accuracy model displays a bias towards the majority class (e.g., epochs marked in red shading), as it strives to maximize overall classification accuracy, especially in cases where the model is unsure. In contrast, the loss function incorporates the skewed class distribution using categorical cross entropy weighted by class, correcting thus for a bias towards the majority class. In this example, both predicted hypnograms (a;b) use signals derived from the H10 sensor.
Figure 3.
Confusion Matrices of accuracy (upper) and loss function models (lower). The IBIs were extracted from gold-standard ECG (left), chest-belt H10 (middle), and the PPG VS (right), and were classified using the two models. In each confusion matrix, rows represent predicted classes (Output Class) and columns represent true classes (Target Class). Cells on the diagonal indicate correct classifications, while off-diagonal cells represent incorrect classifications. Each cell displays the count and percentage of classifications. Precision () is displayed on the gray squares on the right, while recall () is displayed at the bottom. The number of epochs has been equalized for between the two models for a more fair comparison. Note that next to the small improvement in the overall accuracy compared to the accuracy model, the loss function model displays an increase in the recall of wake and deep sleep stages. This is arguably enough to address few of the nights that are difficult to classify.
Figure 3.
Confusion Matrices of accuracy (upper) and loss function models (lower). The IBIs were extracted from gold-standard ECG (left), chest-belt H10 (middle), and the PPG VS (right), and were classified using the two models. In each confusion matrix, rows represent predicted classes (Output Class) and columns represent true classes (Target Class). Cells on the diagonal indicate correct classifications, while off-diagonal cells represent incorrect classifications. Each cell displays the count and percentage of classifications. Precision () is displayed on the gray squares on the right, while recall () is displayed at the bottom. The number of epochs has been equalized for between the two models for a more fair comparison. Note that next to the small improvement in the overall accuracy compared to the accuracy model, the loss function model displays an increase in the recall of wake and deep sleep stages. This is arguably enough to address few of the nights that are difficult to classify.

Figure 4.
Performance metrics of the accuracy and loss function models as computed for ECG (left), H10 (middle), and VS (right). (a) The loss function model yield a small but significant increase in the values for all sensors. (b) The loss function model displayed significant higher classification accuracies (b) compared to staging the majority class for the whole recording. (c) When considering the performance of the two models separately in each class, as reflected in per class F1 scores, we observed small but significant increase in the performance of the loss function model. ns - not significant, <.1 - +, <.05 - *, <.01 - **, <.001 - ***, <.0001 - ****.
Figure 4.
Performance metrics of the accuracy and loss function models as computed for ECG (left), H10 (middle), and VS (right). (a) The loss function model yield a small but significant increase in the values for all sensors. (b) The loss function model displayed significant higher classification accuracies (b) compared to staging the majority class for the whole recording. (c) When considering the performance of the two models separately in each class, as reflected in per class F1 scores, we observed small but significant increase in the performance of the loss function model. ns - not significant, <.1 - +, <.05 - *, <.01 - **, <.001 - ***, <.0001 - ****.
Figure 5.
The effects of heart or psychoactive medication on sleep stage classification using the optimized model. (a) Comparison of values between recordings obtained from people with and without medication, for each sensors separately. (b) Group descriptives including number of subjects, age and PSQI group averages, as well as statistical effects of Age and PSQI. Note that there significant difference between the two groups in the ECG recordings, but at the same time a trend for age and a significant difference in the PSQI scores. ns - not significant, <.1 - +, <.05 - *, <.01 - **, <.001 - ***, <.0001 - ****.
Figure 5.
The effects of heart or psychoactive medication on sleep stage classification using the optimized model. (a) Comparison of values between recordings obtained from people with and without medication, for each sensors separately. (b) Group descriptives including number of subjects, age and PSQI group averages, as well as statistical effects of Age and PSQI. Note that there significant difference between the two groups in the ECG recordings, but at the same time a trend for age and a significant difference in the PSQI scores. ns - not significant, <.1 - +, <.05 - *, <.01 - **, <.001 - ***, <.0001 - ****.
Figure 6.
Correlations between VS and PSG sleep parameters as computed with Spearman’s rank correlations (). PSG sleep parameters are ploted on x’ axis while VS metrics on y’ axis. Individual points reflect each recording. The solid line reflects the corresponding linear model and the shaded areas the 95% confidence intervals. Note that all sleep metrics correlate highly with PSG, with deep sleep showing the weakest positive correlation.
Figure 6.
Correlations between VS and PSG sleep parameters as computed with Spearman’s rank correlations (). PSG sleep parameters are ploted on x’ axis while VS metrics on y’ axis. Individual points reflect each recording. The solid line reflects the corresponding linear model and the shaded areas the 95% confidence intervals. Note that all sleep metrics correlate highly with PSG, with deep sleep showing the weakest positive correlation.
Figure 7.
Table of Agreement. Agreement between the gold standard PSG and and wearable devices (a) H10 and (b) VS in measuring the sleep parameters of interest. Note that there is good to excellent reliability on all sleep parameters, but deep sleep which shows a moderate reliability. LOA = Limits of Agreement, upper - lower; SME = Standard Measurement Error; MDC = Minimal Detectable Change; ICC = Intra-Class Correlation.
Figure 7.
Table of Agreement. Agreement between the gold standard PSG and and wearable devices (a) H10 and (b) VS in measuring the sleep parameters of interest. Note that there is good to excellent reliability on all sleep parameters, but deep sleep which shows a moderate reliability. LOA = Limits of Agreement, upper - lower; SME = Standard Measurement Error; MDC = Minimal Detectable Change; ICC = Intra-Class Correlation.
Figure 8.
Agreement between the PSG-based sleep metrics and VS sensor, as visualized with Bland-Altman plots. The dashed red line represent the mean difference (i.e., bias) between the two measurements.The black solid line represents the point of equality (where the difference between the two devices is equal to 0), while the dotted lines represent upper and lower limits of agreement (LOA). The shaded areas indicate the 95% confidence interval of bias, lower and upper agreement limits. A positive bias value indicates an VS overestimation, while negative bias reflects an VS underestimation, using the gold standard, PSG, as point of reference (VS-PSG). Note that VS underestimates SOL and WASO, while it overestimates the rest of the sleep parameters. However, the degree of bias is here minimal.
Figure 8.
Agreement between the PSG-based sleep metrics and VS sensor, as visualized with Bland-Altman plots. The dashed red line represent the mean difference (i.e., bias) between the two measurements.The black solid line represents the point of equality (where the difference between the two devices is equal to 0), while the dotted lines represent upper and lower limits of agreement (LOA). The shaded areas indicate the 95% confidence interval of bias, lower and upper agreement limits. A positive bias value indicates an VS overestimation, while negative bias reflects an VS underestimation, using the gold standard, PSG, as point of reference (VS-PSG). Note that VS underestimates SOL and WASO, while it overestimates the rest of the sleep parameters. However, the degree of bias is here minimal.