Pitfalls in sleep and memory research and how to avoid them: A consensus paper

Understanding the complex relationship between sleep and memory is a major challenge in neuroscience. Thousands of studies on memory consolidation in humans suggest that sleep triggers offline memory processes, resulting in less forgetting of declarative memory and performance stabilization in non-declarative memory. However, an increasing number of contradictory findings reveal potential issues with how research is conducted in this field and call into question the reliability and interpretation of the results. In this consensus paper, we describe four sets of prevalent methodological pitfalls in human sleep and memory research: (i) non-optimal experimental designs, (ii) task complexity, (iii) fatigue effects in repetitive tasks, and (iv) inappropriate data analysis practices. We then offer solutions to each of these pitfalls. We believe that implementing these solutions in future research of sleep and memory will lead to more reliable results and significantly advance our understanding in this field. Solutions: modified experimental designs and data analysis approaches can control for fatigue effect in repetitive tasks


Introduction
There is a great interest in sleep in both the general public and the scientific community. The critical influence of sleep on health and some aspects of cognition is well established; moreover, our modern lifestyle and technologies affect our sleep habits and quality in new ways every day, increasing the prevalence of sleep deprivation and bad sleep habits. Memory is also a focus of societal interest, with respect to education and learning on one side of the developmental spectrum, and to aging and age-related memory decline on the other. As such, the effect of sleep on memory has gained much attention in psychology and neuroscience research over the last two decades, with thousands of dedicated publications. Additionally, a number of theories and models explaining the effect of sleep on memory have been developed (e.g., Ackermann & Rasch, 2014;Antony et al., 2019;Boyce et al., 2017;Diekelmann et al., 2009;Feld & Born, 2017;Lewis & Durrant, 2011;Mednick et al., 2011;Saletin & Walker, 2012;Siegel, 2001;Stickgold & Walker, 2005Tononi & Cirelli, 2006;Tononi & Cirelli, 2014;Walker, 2005). The focus of this paper is on the effect of sleep on memory consolidation, that is, on how sleeping after having learned something (e.g., new vocabulary or playing the piano) benefits subsequent memory, compared to an equivalent time spent without sleep.
According to the oft-cited empirical studies (e.g., Gais et al., 2006;Walker et al., 2003) and reviews on this topic (Diekelmann et al., 2009;King et al., 2017;Rasch & Born, 2013), in healthy adults declarative memory appears more resistant to forgetting when encoding is followed by a period of sleep compared to a period of wakefulness, whereas non-declarative memory performance can even be improved when sleep follows training. The evidence for sleep-related memory consolidation appears to be so convincing that it has been claimed that "While memory formation is not the only function of sleep, it seems to be the most important (...)" (Born & Wilhelm, 2012, p. 192) or that "(...) active system consolidation might be an evolutionary conserved function of sleep." (Vorster & Born, 2015, p. 103). While such judgements may be justified if based on a large body of rather heterogenous experimental approaches, we claim here that the support such statements receive from individual experiments is not yet compelling, due to a multitude of experimental and methodological issues that pertain. Indeed, there has been critical discussion as to the actual impact of sleep on memory consolidation (e.g., Mantua, 2018;Pan & Rickard, 2015;Vertes & Siegel, 2005). A non-negligible number of studies have not found sleep-related consolidation effects, especially for non-declarative memory (e.g., Csabi et al., 2014;Nemeth et al., 2010;Robertson et al., 2004;Song et al., 2007;Viczko et al., 2018;Wilson et al., 2012). There is also increasing evidence that the effect of sleep on memory consolidation is perhaps more multifaceted than initially thought (e.g., King et al., 2017). In addition, the correlations observed between sleep physiology and memory consolidation are often not replicated across studies and are sometimes too numerous to reliably interpret the significant ones (Pan & Rickard, 2015). They sometimes even go in the opposite direction than what is expected (Mantua, 2018;Payne et al., 2009). Contradictory findings in the field are not an issue per se, as they can highlight the complexity of the effect of sleep on memory consolidation. Those findings become an issue, however, if they point to systematic problems with the literature. Thus, providing guidelines for future studies is crucial to maintaining progress in the field.
In this consensus paper, we propose a guideline for future research on sleep and memory.
Such a comprehensive guideline is lacking so far (but see King et al., 2017), hindering the progress toward a better understanding of the effect of sleep on memory in fields ranging from psychology to biology and neuroscience. We highlight four sets of critical methodological pitfalls that could be responsible for some of the contradictory findings in the literature, and then propose solutions to prevent them and guide future research.

Pitfall 1: Non-optimal experimental designs
In this section, we identify five areas that could benefit from improvements in the experimental designs and suggest solutions for each of them. Figure 1 illustrates the main study conditions that can be included in the experimental designs testing the effect of sleep on memory consolidation. It is important to note from the outset that it is difficult to address all of these methodological caveats in one parsimonious experimental design. Thus, several types of studies may be necessary to draw strong conclusions about the function of a particular type of sleep for a particular type of memory (Peigneux & Smith, 2010). The issues with non-optimal experimental designs can be organized into three sets. The first one is related to the influence of the time of day when the tasks are performed (see section 1a). The second is about whether the observed benefit of sleep over wake intervals is just due to the fact that, during sleep, interference is much diminished (see sections 1b, 1c, and 1d). This issue is related to the fundamental question whether sleep passively (through reduced interference) or actively (through sleep-specific neural processes) contributes to memory consolidation (for in depth discussion see Ellenbogen et al., 2006). If sleep has only a passive role in this regard, then managing to create wake intervals with reduced external interferences during the post-learning interval may be sufficient to trigger a level of consolidation comparable to that of a sleep interval.
Finally, we also briefly discuss the effect of baseline measurements and feedback on performance changes in the retention interval as differences in these aspects of the experimental design can further confound the observed relationship between sleep and memory (see section 1e).

Pitfall 1a. Time-of-day (circadian) effects
It has been shown that learning and memory performance is affected by the time of day when the task is performed (Schmidt et al., 2007). Time-of-day (circadian) effects can lead to confounds in sleep-related consolidation studies. A typical approach in these studies is to compare performance change in a Sleep condition (i.e., learning in the evening and testing memory the next morning; Figure 1, condition 1) with that in a Wake condition (i.e., learning in the morning and testing memory the next evening; condition 2). Importantly, however, a greater off-line improvement in a Sleep condition compared to a Wake condition in such a design may be, at least partially, explained by two confounds: 1) worse performance in the evening (i.e., when learning takes place in the Sleep condition) due to circadian effects, including day-long buildup of fatigue (Keisler et al., 2007), and/or 2) a better performance in the morning (i.e., when testing takes place in the Sleep condition) when participants are likely well rested. Probing circadian effects, Pan and Rickard (2015) have shown in their meta-analysis of sleep-related motor memory consolidation that performance is best if the test session occurs in the early afternoon. Notably, the issue of circadian effects can be even more pronounced over the course of the human lifespan, especially when comparing young vs. older groups of individuals with differences in chronotype and homeostatic sleep pressure (Taillard et al., 2021). In either case, it is unclear how much sleep per se contributes to improved performance compared to these circadian effects.
Moreover, hormones like cortisol strongly influence memory processes, and their release is subject to distinct circadian rhythms. For instance, growth hormone (GH) has its peak in the first half of the night, whereas GH concentrations in the second half are very low. By contrast, cortisol has its daily nadir in the first half of the night and its peak in the second half (Dresler et al., 2014).
This confounds not only evening-morning vs. morning-evening ( Figure 1, conditions 1 vs. 2, respectively) comparisons but also within-night comparisons that are permitted by the classical split-night paradigm, which compares the SWS-rich first half of the night with the REM sleep-rich second half of the night (Genzel & Robertson, 2015). Thus, the confounds of endocrine fluctuations across the circadian rhythm are difficult to avoid, rendering the split-night paradigm problematic.

Solutions to control for time-of-day (circadian) effects
These potential circadian effects are not easy to control because, in humans, sleep typically occurs during the night (although see Tucker et al., 2017 for an inverted 12-hour schedule with sleep occurring during daytime). Nevertheless, several solutions have been proposed to address this potential caveat. None are ideal nor exhaustive, but a combination of converging results across studies employing these different strategies can accrue confidence in the conclusions.
When focusing on the effects of nighttime sleep on memory consolidation, e.g., in an evening-morning (Sleep) vs. morning-evening (Wake) design ( Figure 1, conditions 1 vs. 2, respectively), additional control groups should be tested to disentangle the circadian effects from the effect of sleep per se. Thus, a full design would include evening-morning (i.e., 12h Sleep; condition 1) and morning-evening (i.e., 12h Wake; condition 2) conditions together with evening-alone (condition 3) and morning-alone (condition 4) conditions, the latter ones with immediate testing. A difference in learning performance (i.e., how long it takes to learn or the overall performance during training) and/or in the immediate retrieval performance when learning/immediate testing takes place during the evening vs. in the morning would indicate potential circadian effects and preclude further interpretation about the effect of sleep on consolidation (e.g., Fenn et al., 2003;Hallgato et al., 2013;Talamini et al., 2008;Tucker et al., 2011). In such a design, only the following pattern of results would demonstrate a beneficial effect of sleep without any time-of-day effect: 1) an absence of difference in learning performance in the four conditions; 2) an absence of difference in testing performance between the morning-alone and the evening-alone conditions; and 3) a better retrieval in the evening-morning over the morning-evening condition.
Another possible solution is to include sleep-deprived wake control groups in eveningmorning (12h Sleep-deprived) or evening-evening (24h Sleep-deprived) conditions (see Figure 1, conditions 5 vs. 6, respectively) and compare their performance with that of an evening-morning (12h Sleep; condition 1) group. In these sleep deprivation controls, learning takes place in the evening and testing takes place either in the morning or in the next evening, with subjects staying awake during the night (in the evening-morning condition) or during both the night and the following day (in the evening-evening condition). In both cases, testing can take place either after sleep deprivation, with subjects being acutely sleep deprived at testing, or testing can be delayed by another 24-48 hours to allow for one or two nights of recovery sleep (condition 7), with the sleep conditions likewise being tested after comparable delays. The advantage of sleep deprivation control designs is that learning and testing take place at the same time of day in the 12h-Sleep and 12h Sleep-deprived groups, thus controlling for potential circadian differences that are an issue with the typical morning-evening (12h Wake; condition 2) controls. The same holds for the comparison of evening-evening groups, where one group could sleep (24h Sleep), while the other stayed awake (24h Sleep-deprived) in the delay period (see Figure 1 conditions 8 vs. 6, respectively). Including recovery sleep (condition 7) ensures that participants are not acutely sleep-deprived at testing, which reduces the negative impact of sleep deprivation on test performance. However, including recovery sleep comes at the price of extending the retention interval, which may likewise affect test performance through processes of decay, forgetting or memory restructuring. Both immediate or delayed testing solutions are limited because acute sleep deprivation introduces confounding influences on test performance, and recovery sleep may exert additional confounds with regard to longer retention intervals and potential compensatory effects of recovery sleep on memory consolidation, or confounds related to sleep rebound effects. That is, sleep during the recovery night(s) may compensate for the missed opportunity for sleep consolidation during the first night of sleep deprivation, thus masking the original effect on memory consolidation. Only few studies applied such sleep deprivation designs. For instance, Gais, Rasch, Dahmen, Sara, and Born (2011) examined the role of noradrenaline for sleep-dependent memory consolidation by pharmacologically blocking noradrenaline via clonidine administration in an evening-evening sleep vs. sleep deprivation design without recovery sleep ( Figure 1 conditions 8 vs. 6, respectively). They observed impaired memory retention after clonidine administration compared to placebo in the sleep condition but no difference between clonidine and placebo in the sleep-deprived wake condition, suggesting that noradrenaline supports memory consolidation specifically during sleep but not during wakefulness.
A third possible solution is to focus on the effect of daytime sleep (i.e., napping) on learning and memory performance, as in this case training and testing occur at the same time in the nap and awake groups (Figure 1 conditions 9 vs. 10, respectively) (Mednick et al., 2003) (see also section 1c). Note, however, that daytime and nighttime sleep can affect memory differently (Payne et al., 2015), and daytime naps might largely vary across participants with regards to the duration, depth and composition of sleep (e.g., appearance of REM stage in some, but not all participants), which should be taken into account during analysis and interpretation of these studies.
Finally, beyond the inclusion of additional groups/conditions to control for circadian effects, further questionnaires/tasks should also be used to assess subjective sleepiness (e.g., Stanford Sleepiness Scale) and objective vigilance (e.g., Psychomotor Vigilance Test) before and after the encoding and testing sessions. These assessments could provide further useful information about the subjective states and vigilance of participants, and could be included in the analyses as covariates, or extreme values could be used for participant exclusion. It is important to note, however, that questionnaires/tasks with good psychometric properties should be selected for this purpose, and a null effect with these assessments alone (i.e., without the inclusion of control groups/conditions discussed above) should not be used to dismiss an alternative explanation entirely since null effects could be due to low statistical power (see also Pitfall 4).

Pitfall 1b. No control conditions in overnight studies
Studies investigating the effect of sleep on memory consolidation sometimes compare experimental conditions across sleep intervals only. For example, studies in clinical populations with patients suffering from sleep disorders (e.g., primary insomnia, obstructive sleep apnea, sleepdisordered breathing) often compare a pathological group with a control healthy group and often use only an evening-morning sleep condition ( Figure 1, condition 1) (Backhaus et al., 2006;Csabi et al., 2014) (for reviews, see Ahuja et al., 2018;Cellini, 2017). Importantly, however, if patients show a smaller overnight performance benefit compared to control participants, one cannot disentangle whether it is caused by the specific effect of that overnight sleep (i.e., state-dependent consolidation) or by a trait-dependent effect of sleep disturbances on memory processes involving not only consolidation, but perhaps also encoding and retrieval or even other cognitive limitations (Ahuja et al., 2018;Csábi et al., 2013;Rosenzweig et al., 2015;Wallace & Bucks, 2013). An additional issue is that pathologies can also influence circadian cycles (Wulff et al., 2010), thus time of peak performance may be shifted in such populations. Many of these issues need to be likewise considered in aging populations, as aging impacts the prevalence of sleep disorders, cognitive performance, daytime functioning, and mnemonic functioning.
Similarly, studies investigating sleep interventions (e.g., using pharmacological agents or electrical stimulation) typically only compare sleep conditions with vs. without intervention in an evening-morning design ( Figure 1, condition 1). However, with this design, it cannot be ascertained that any observed effects are sleep-specific or whether the intervention exerts general effects that are independent of sleep.

Solutions: using appropriate control conditions in overnight studies
To assess the specificity of sleep-related consolidation and sleep interventions, it is essential to include appropriate control groups/conditions in which participants stay awake for a comparable period and, in the case of intervention studies, also receive the same experimental manipulations as in the sleep groups/conditions. There are two main classes of wake controls in overnight studies: morning-evening (wake) controls ( Figure 1, condition 2), and evening-morning (condition 1) or evening-evening (sleep deprived) controls (condition 6) (see section 1a).
For example, to demonstrate that a pathology (or an intervention) specifically affects sleeprelated consolidation, we need to observe not only that test performance in the sleep condition is different in the pathological (or intervention) group than in the control group, but also that test performance or improvement in the wake control condition is similar in both groups. In other words, there should be a group difference in the sleep condition, but no group difference in the wake condition (i.e., a group-by-condition interaction; see Figure 2, panel a). However, if test performance in the sleep condition is similar in the pathological/intervention group and in the control group, then one cannot conclude that the pathology/intervention affects sleep-related consolidation processes (i.e., no interaction; Panel c), unless there is also significant group difference in the wake condition (i.e., an interaction; Panel b). Finally, if test performance in the pathological/intervention group is different from that in the control group both in the Sleep and in the Wake conditions without a group-by-condition interaction (i.e., no interaction; Panel d), then one would conclude that the pathology or the intervention has a more general effect, possibly influencing other processes (such as encoding or retrieval) but not sleep-related consolidation per se.

Figure 2.
Schematic illustration of the different patterns of results and their conclusion in the case of studies comparing a pathological or intervention group vs. a control group in Wake vs. Sleep conditions. Vertical axis represents memory performance. Patterns depicted in Panels a and b display cases where it can be concluded that the pathology or intervention specifically affects sleep-related memory consolidation, compared to a control group (i.e., group-by-condition interaction). Patterns depicted in Panels c and d display cases where it cannot be concluded that the pathology or intervention specifically affects sleep-related memory consolidation, compared to a control group (i.e., no group-by-condition interaction). Note that in panels b and d, pathology and intervention may have opposite effects (i.e., hindering or improving performance, respectively) compared to the control group, without changing the overall logic of the figure.
Such a design has been used in a handful of studies only. For instance, Nissen et al. (2011) compared the effect of evening-morning (sleep) vs. morning-evening (wake) conditions (Figure 1, conditions 1 vs. 2) on consolidation of non-declarative/procedural and declarative memory in insomniac and healthy control participants. For procedural memory, they observed similar retention over the morning-evening (wake) interval in both groups (condition 2). However, the healthy control group showed better retention over the evening-morning (sleep) interval (condition 1) compared to the insomniac group. This corresponds to Figure 2, panel A. The authors concluded that insomnia specifically impaired sleep-related consolidation in procedural memory. Since declarative memory retention did not differ significantly between the two groups either in the wake or in the sleep condition (although it showed the same overall pattern as in Figure 2, panel A), no conclusion about the effect of insomnia on sleep-related consolidation of declarative memory could have been drawn in this case.
Importantly, as described in section 1a, naps differ significantly with respect to the composition of sleep stages and hormonal concentrations depending on the time of day and the duration of the nap. Moreover, including wake controls at the same time of day and of the same duration does not rule out the possibility that other factors than sleep per se affect memory consolidation. For example, the general concern that reducing external interferences during the postlearning interval may be sufficient to aid off-line consolidation also pertains to nap studies (e.g., Mednick et al., 2011;Wamsley, 2019).

Solutions: using appropriate control conditions in napping studies
First, depending on the aim of the study, it should be carefully considered at which time of day the nap is scheduled and for how long participants are allowed to nap. Attending to these variables allows, for example, comparing naps with and without REM sleep (e.g., McDevitt et al., 2015;Mednick et al., 2003).
Second, adding a carefully controlled quiet rest condition (e.g., keeping eyes open to avoid falling asleep; condition 11) helps to distinguish whether sleep is a specific state that actively triggers off-line memory improvement (i.e., Sleep > Quiet Rest) or a non-specific state that only passively protects memories from interference (i.e., Sleep = Quiet Rest) (Ellenbogen et al., 2006).
The benefits of such a design have been just recently recognized. Studies using such a design have led to mixed findings, with some studies showing better consolidation in the nap condition than in the quiet rest condition (Piosczyk et al., 2013;Schichl et al., 2011;Schönauer et al., 2014) and others showing that quiet rest produced effects on memory consolidation similar to those observed in nap conditions (Mednick et al., 2009;Simor, Zavecz, et al., 2019), suggesting that sleep per se may not be necessary for consolidation but rather only provides a favorable environment.
Observations that memory reactivation occurs not only during sleep but also during quiet rest (e.g., Schapiro et al., 2018) further highlight the need for such control conditions. Therefore, monitoring the wake (quiet rest) condition with polysomnography is essential to rule out any sleep-like activity during the quiet rest interval, as well as to examine whether polysomnographic indicators during wake (quiet rest) are specifically associated with memory consolidation (see also section 4b).
Note that a quiet rest condition would also be informative in a classic overnight design ( Figure 1, condition 1) that employs control conditions, such as morning-evening (wake) or evening-morning with sleep deprivation (Figure 1 conditions 2 vs. 5, respectively; see also section 1a). However, this solution is not feasible since it would be extremely difficult to stay in quiet wakefulness for 12 hours in the morning-evening condition and it would be even more difficult (and stressful) to avoid falling asleep in a quiet environment in the evening-morning condition.
Therefore, overnight studies typically use an active wake evening-morning (sleep-deprived; condition 5) control condition if they want to control for circadian effects in their design.
Finally, few studies have included both an overnight sleep (condition 1), a daytime nap (condition 9), and a quiet rest (wake; condition 11) condition in a single experimental design to gain a better understand of the specific effect of sleep on consolidation (den Berg van et al., 2021;Van den Berg et al., 2019). While this approach introduces its own set of challenges (e.g., the architecture of a nap and overnight sleep are not comparable), it can be advantageous as it allows: 1) the direct comparison of the relative benefit of an overnight sleep vs. a nap, 2) a better control for time-of-day effects, and, 3) the examination of specific benefits of napping and the minimum amount of sleep necessary to afford a benefit to memory consolidation.

Pitfall 1d. Time interval between memory encoding and sleep onset
The time interval between the learning task and sleep onset (see Figure 1 condition 1) may vary across experiments, conditions and individuals, potentially hindering the assessment of the true effect of sleep on memory consolidation. Although consolidation of procedural memory appears rather insensitive to such effects (King et al., 2017), it has been shown in declarative memory that the more time elapses between the end of the learning task and sleep onset, the smaller the sleeprelated memory benefit (Gais et al., 2006). A longer wake interval before sleep onset may hinder the manifestation of the beneficial effect of sleep due to the participant's involvement in activities that may interfere with recently learned information by re-engaging the same cognitive processes and/or recruiting the same neural networks.

Solutions: experimental designs/procedures controlling the time interval between memory encoding and sleep onset
Experiments should control the duration of the interval between memory encoding and bedtime/sleep onset, as well as the participants' activities during this interval. To minimize interference during this interval, participants should go to bed as soon as possible after memory encoding. Such designs are more feasible when participants sleep in the lab during the experiment.
If, however, participants sleep at home after the learning session, then mobile actigraphy or, as a less compelling substitute, sleep diaries and post-experiment questionnaires, should be employed to assess the duration of this interval and the activities performed. This information then should be appropriately considered in data analysis.

Pitfall 1e. Baseline measurements and feedback effects in declarative memory paradigms
When designing a declarative memory paradigm, a critical question is what procedure to use to ensure that participants encode a sufficient number of items for a later reliable and valid test of retrieval performance. In most sleep-related declarative memory studies that use cued or free recall, a certain learning criterion is defined, for instance 60% of recall success. (Other methods involve restrictions of study time (e.g. Lahl et al., 2008) or of number of trials during encoding (e.g., Mikutta et al., 2019)). If the learning criterion is not met after the first run of trials, a common strategy is to repeat the whole run, until the learning criterion is met (e.g. Cordi et al., 2014;Prehn-Kristensen et al., 2014). An advantage of this procedure is that all participants encode a sufficient number of items for later retrieval testing. However, the quality of encoding can be significantly different between participants: A participant who met the learning criterion within the very first run studies all items only once and, therefore, encodes them rather weakly. Another participant, who needed several repetitions to meet the criterion, studies all items multiple times. In this latter example, a difference can arise in the strength/quality of encoding of different items: some items may have been successfully encoded already in the first run and further practiced during successive repetitions, while other items may have been encoded only in the final repetition. These differences in repetitions and encoding level could have an enormous impact on later retrieval performance (Ebbinghaus, 2013;Xue et al., 2010;Young & Bellezza, 1982). This effect deserves even more consideration in studies comparing different populations (e.g., healthy participants vs. patients, or children vs. adults) that presumably learn at a different pace.
Another issue that arises in declarative memory paradigms is that the performance level observed during the last run of encoding (i.e., just when the learning criterion is met) is frequently used as a baseline measurement to evaluate recall performance after the retention interval. However, encoding runs are often designed to give direct feedback, often in the form of providing the correct answer after each item, potentially resulting in further, unmeasured encoding. Therefore, the baseline measurement does not reflect the exact memory state at the end of the learning phase, but rather probably underestimates it in such cases (Chan & McDermott, 2007;Karpicke & Roediger, 2008;Soderstrom et al., 2016;Wiklund-Hornqvist et al., 2014).

Solutions: experimental procedures to minimize differences across participants/studies in baseline measurements and feedback effects
Most researchers investigating sleep-related consolidation of declarative memory choose a learning criterion between 40% and 80% (e.g., Fowler et al., 1973;Klinzing et al., 2016). While systematic studies are lacking, an unsystematic Pubmed search yielded a mode of 60%-criterion for word-pair learning or visuo-spatial learning tasks (e.g., Cordi et al., 2014;Fenn & Hambrick, 2012;Marshall et al., 2004;Payne et al., 2012;Prehn-Kristensen et al., 2014;Rasch et al., 2007;Wilhelm et al., 2008). Overall, based on these studies, the 60% learning criterion seems to be a reasonable choice to account for possible floor and ceiling effects.
An option to circumvent the use of a predefined learning criterion is the so-called selective reminding procedure (Buschke, 1973). All items are presented to the participant during a first study run. Subsequently, a first test run is conducted where all items are tested. In a second study run, only those items that were not recalled correctly during the first test run are presented. Then, in a third study run, only those items are presented that were not recalled correctly during the second test run and so on. New study runs proceed until all items are remembered correctly once. This procedure enables all participants to encode the same number of items while no item is 'over-learned' (for examples, see Mazza et al., 2016;Quan et al., 2018;Uguccioni et al., 2013). This procedure could be used to reach 100% for baseline encoding level in all participants. One limitation of such approach (i.e., 100% learning criterion) is, however, that it is only suitable for those memory studies where a loss in declarative memory is expected over the retention interval. In other cases, using the selective reminding procedure with a lower predefined learning criterion (e.g., 60%) could ensure similar encoding strength/quality across participants (no over-learned items), while avoiding potential floor and ceiling effects.
Additionally, a test run of all items could be introduced immediately after the learning criterion is met, without any corrective feedback. This could provide an even more precise measure of the baseline. However, one should remember that test runs can also boost learning and subsequent consolidation even if no feedback is given-a phenomenon called 'test-enhanced learning' (McDermott, 2021;Roediger III & Karpicke, 2006).

Pitfall 2: Task complexity
A significant difficulty in sleep and memory research, and in cognitive neuroscience and psychology in general, is that practically every task involves several cognitive processes (e.g., Jacoby, 1991;Sigman & Dehaene, 2005). The learning/memory scores that are used to assess behavioral performances typically reflect a mixture of these cognitive processes (Cohen et al., 2005). For example, even a simple perceptual-motor learning task requires at least the processing of perceptual stimuli, acquisition of their serial order and/or transitional probabilities, perceptual-motor coordination, and selective attention. As learning progresses (i.e., as the involved neurocognitive system is being fine-tuned to the task), these processes could improve at a different pace and, therefore, could contribute to the behavioral performance assessed at the end of the learning session to a varying degree. Importantly, consolidation can differentially affect these processes involved in the task (e.g., Conte & Ficca, 2013;King et al., 2017;Stickgold, 2013;Stickgold & Walker, 2013), and this could be even further exacerbated by individual differences in their contribution to behavioral performance.
There are not only different types of learning and cognitive processes but also different types of retrieval processes that may determine whether beneficial effects of sleep on memory consolidation are detected or not. Specifically, declarative memory paradigms most frequently probe recall and recognition, with recall referring to the ability to retrieve a stimulus from memory with or without a cue, and recognition referring to the ability to decide whether a given stimulus has been previously encountered. Some evidence suggests that recall is more sensitive to sleep effects than recognition, possibly because sleep facilitates the integration of new memories into preexisting knowledge networks, thereby increasing potential access routes for recall (Diekelmann et al., 2009).
Differences in retrieval may be evident for procedural memory as well. For example, tasks may tap into explicit (i.e., conscious) vs. implicit (i.e., unconscious) aspects of the acquired knowledge to a varying degree (e.g., Fischer et al., 2006;Schendan et al., 2003). This could potentially lead to contradictory findings across studies and mask the differential effect of sleep on the consolidation of different aspects of knowledge.
Another factor to consider is that different memory systems are thought to interact with each other during learning and possibly also during consolidation (Freedberg et al., 2020). However, to optimize use of resources, researchers sometimes include memory tasks tapping into different memory systems (e.g., a non-declarative/procedural and a declarative memory task) in a given experiment. This is problematic because acquiring memories that tap into different memory systems shortly after one another may cause interference in their consolidation. This could potentially alter the observed effects of post-learning sleep and their interpretation. For instance, it has been shown that acquiring procedural memories just after a declarative memory task was affected by participants' memory performance in the latter and led to differences in consolidation over the wake vs. sleep conditions. A complementary pattern has been observed when the declarative memory task was performed after the acquisition of procedural memories (Brown & Robertson, 2007).

Solutions: disentangling and contrasting different cognitive processes and aspects of memory involved in a given task
Since there are no process-pure learning/memory tasks, attention should be paid to the specific cognitive processes that are involved in a particular task. We recommend that future research aiming at understanding the specific effect of sleep on consolidation should use tasks and designs that could help disentangle these different cognitive processes, and examine whether they are differentially affected by sleep. For example, in declarative memory tasks, different aspects of retrieval (free recall, cued recall, recognition) should be systematically compared within the same experimental design to examine how they can reveal (potentially different) sleep effects. These effects may also vary depending on the type of information to be encoded, for example, pairedassociates learning (Feld et al., 2016;Plihal & Born, 1997), word-list learning (Abel & Bauml, 2012;, emotional picture learning (Cairney et al., 2015;Payne et al., 2015) and object-location memory (Rasch et al., 2007;Rudoy et al., 2009), suggesting that they involve at least partially distinct cognitive processes. To better understand the differential effect of sleep on aspects of memory, contrasting the encoding/consolidation of different types of information within the same experimental design is warranted. In procedural learning/memory, for instance, research has disentangled and contrasted allocentric vs. egocentric representations (Viczko et al., 2018), perceptual vs. motor components of learning (Hallgato et al., 2013), transition vs. ordinal representations (Song & Cohen, 2014), and acquisition of statistical vs. sequential regularities (Simor, Zavecz, et al., 2019), and showed differential effects of sleep in some of these aspects (Albouy et al., 2013;Cohen et al., 2005;Song & Cohen, 2014). Furthermore, to minimize the potential interactions between different memory systems, which could confound the identification of sleep effects, it may be beneficial to administer tasks tapping into different memory systems using a between-subject design. If, however, a withinsubject design is chosen, the order of task administration should be counterbalanced across participants and included in data analysis as a separate factor.

Pitfall 3: Fatigue effect in repetitive tasks
Some studies, particularly those investigating non-declarative/procedural learning, use tasks that involve continuous practice with a series of repetitions of the same action, for example, pressing keys (Nissen & Bullemer, 1987). Learning is measured as the improvement in accuracy or in reaction times as the task progresses. Usually, the performance at the end of the training session serves as a baseline to measure improvement at the test session that takes place after an interval involving sleep or wakefulness. Yet, after a certain amount of time spent performing the task, the subject's observed improvement is less marked, which can be interpreted as a reactive inhibition effect that reflects the build-up of fatigue over the trials (e.g., Brawn et al., 2010;Pan & Rickard, 2015). This effect often results in smaller improvement or even a decrease in performance as the task progresses. Thus, the measured performance after longer/extended practice is not representative of the level of expertise gained in the task and, therefore, comparing the performance at the test session with that of the end of the training session may lead to illusory sleep-related improvement and may also bias the quantification of the sleep benefit. Figure 3 upper panel illustrates this issue, which can be even further exacerbated by using performance measures averaged across multiple trials, instead of a trial-by-trial analysis. In several cases, after eliminating the reactive inhibition effect by releasing the presumed fatigue, the sleep-related off-line improvement was no longer observed (e.g., Cai & Rickard, 2009;Rickard et al., 2008). Rather than an actual performance improvement, after elimination of the reactive inhibition effect, the benefit of sleep was expressed as a stabilization of performance (Nettersheim et al., 2015;Rickard et al., 2008). Although this issue is primarily relevant in procedural learning studies, it is possible that reactive inhibition also affects performance in declarative memory studies, particularly if they include repetitive presentations of the same items or a long period of memorization (Abel & Bauml, 2012).

Experimental design solutions
Using post-rest performance at the end of the training session as a baseline. Resting for a few minutes after the training session appears to be sufficient to wash out the effect of reactive inhibition on performance. Measuring performance after a break is therefore a more appropriate baseline to assess subsequent off-line consolidation (e.g., Brawn et al., 2010;Simor, Zavecz, et al., 2019). Figure 3 middle panel illustrates this solution.
Learning through spaced rather than massed practice. Use of short (e.g., 10 s) performance intervals between longer (e.g., 30 s) rest intervals during the training session (often termed spaced practice) impedes the accumulation of reactive inhibition compared to experimental designs that use massed practice in which there are longer task intervals (e.g., Brawn et al., 2010;Rickard et al., 2008;Rieth et al., 2010). Figure 3 lower panel illustrates this solution.

Data analysis solutions
Using curve fitting methods. Here we highlight two such methods. First, a function-based model (e.g., a power function for reaction time improvement) can be fitted to the training session data and used to predict future performance (under the null hypothesis that the delay between training and test sessions has no effect on performance). This method enables a comparison between the predicted (under H0) and the actual outcomes measured during the test session. This way, one avoids averaging over data points to compute a pre-post gain-a procedure which may yield illusory off-line performance gains if performance is improving between the end of the training session and the beginning of the test session, wherein the data averaging is done (Pan & Rickard, 2015). As a second and more formal approach, a function can be fitted to the training and test session data and then a continuity test can be used to infer whether the performance is a simple continuation of that function from the training session to the test session, or whether there is an abrupt change between the sessions (see details on these approaches in Pan & Rickard, 2015).
Performing computational modeling. Using computational models on trial-by-trial data can help overcome the issue of fatigue by directly including reactive inhibition as a separate parameter in the model. For instance, in a probabilistic sequence learning task, Török, Janacsek, Nagy, Orbán, & Nemeth (2017) used such model allowing the estimation of the actual magnitude of learning, independent of the effect of reactive inhibition. Such models can be used in a wide range of learning and memory tasks, including finger tapping and other sequence learning tasks.

Pitfall 4: Inappropriate data analysis practices
The studies of sleep and memory suffer from similar problems as the whole field of psychology and neuroscience discussed in recent years as the 'replication crisis' (Ioannidis, 2005;Maxwell et al., 2015;OpenScienceCollaboration, 2015), and thus could similarly benefit from an update in practices that are currently evolving in the scientific community in general (see e.g., Rickard et al., in press about publication bias in sleep and motor sequence learning literature).

Pitfall 4a. Questionable practices related to sample size and interpretation of non-significant results
In the field of sleep-related consolidation, studies have typically used samples with 12-20 participants per group (e.g., Gais et al., 2006;Rickard et al., in press;Wagner et al., 2006), or in some cases even smaller samples (e.g., Csabi et al., 2015), which may be due to complicated or demanding study designs, difficulties recruiting clinical populations, and/or drop-outs of participants (i.e., experimental attrition). Moreover, sample sizes have usually not been determined by a priori power analysis based on expected effect sizes. Importantly, small sample sizes could result in low statistical power, potentially increasing Type 2 errors (i.e., not detecting an existing effect), as well as could lead to non-replicable, spurious findings. On the other hand, a common but questionable practice of collecting additional data until a significant effect is reached could increase Type 1 errors (i.e., detecting an effect that does not exist), again, leading to non-replicable findings.
Another issue arises from the interpretation of non-significant findings. For example, nonsignificant effects could be observed in pre-sleep vs. post-sleep comparisons when consolidation results in stabilization of the acquired knowledge without forgetting or off-line performance improvement (i.e., no performance change). If one wants to conclude that sleep promotes stabilization of the acquired knowledge or has no effect on some aspects of memory consolidation compared to wakefulness, such conclusion cannot be drawn by showing non-significant results in classical statistical approaches (e.g., frequentist t-test, ANOVA, correlation, etc.).

Solutions: a priori power analysis and Bayesian statistical approaches
Before data collection. It has long been recommended in guidelines (e.g., published by the American Psychological Association) that experimenters should determine the sample size before starting the experiment by computing power analyses based on the expected effect size estimated or found in previous studies that observed similar effects.
During data collection. For particularly costly experimental protocols, Bayesian statistical analyses (Dienes, 2016;Dienes & Mclatchie, 2018;Wagenmakers et al., 2018) computed in the course of data collection can be used to determine whether there is enough evidence in favor of a given a priori defined effect so that one can stop data collection (Rouder, 2014;Schönbrodt et al., 2017).
After data collection. Bayesian analyses, in particular the Bayes Factors, are rarely reported in the field of sleep-related memory consolidation (see Brown & Maylor, 2017 for an exception) whereas they are increasingly reported in other areas of psychology and neuroscience. The Bayes Factor indicates an odds ratio of relative probabilities in favor of the null hypothesis (i.e., the absence of a difference between the conditions or groups) vs. in favor of the alternative hypothesis (i.e., the difference between the conditions or groups) (Jarosz & Wiley, 2014;Rouder et al., 2009).
Bayesian statistics thus allow a more fine-grained quantitative evaluation of the effect of sleep on memory consolidation. Additionally, effect size measures should always be reported to provide an estimate of the relevance of the observed effect. Effect sizes can also be indicative of the true effect in cases of non-significant results with small samples and potential Type 2 errors.

Pitfall 4b. Spurious correlations between sleep parameters and memory consolidation
Beyond the comparison of groups or conditions, conclusions for the effect of sleep are often based on correlations between behavioral performance and sleep polysomnographic parameters (e.g., Scullin, 2013;Simor, Zavecz, et al., 2019) (see also section 4c). However, the concern has been raised that some of these correlation analyses are a consequence of suboptimal statistical practices, leading to spurious correlations (e.g., Mantua, 2018;Pan & Rickard, 2015 Small sample sizes are a further source of spurious correlations. When sufficiently large sample sizes are used, these correlations may be greatly reduced (Ujma, 2021) or even disappear.
For example, between sleep parameters and episodic memory consolidation-an area that has received much attention in the past decades of sleep and memory research-Ackermann et al.
(2015) did not find any significant correlation in a large sample of 929 participants.

Solutions: a priori planning of appropriate statistical analyses and reporting all (significant and non-significant) findings
The correlations to be computed should be planned a priori (see also the section on further recommendations below) and corrected for multiple comparisons in order to avoid increases in Type 1 errors (Abdi, 2007). Non-significant planned correlations should also be systematically reported (Forstmeier et al., 2017). If no relationship is expected between certain sleep parameters and behavioral performance, Bayesian approaches should be used to draw conclusions in favor of the null hypothesis instead of (or in addition to) reporting non-significant p-values (see section 4a).

Pitfall 4c. Individual differences in general cognitive abilities are not controlled for in correlational studies of sleep and memory
Certain features of sleep (e.g., sleep spindles) appear to be highly correlated with trait-like individual differences in cognitive abilities. Particularly strong relationships have been identified for cognitive abilities related to reasoning, problem solving, the ability to identify complex patterns and relationships, and the use of logic (i.e., 'fluid intelligence') (Bódizs et al., 2005;Fang et al., 2019;Fang et al., 2017;Fogel & Smith, 2006;Ujma et al., 2015). Since these cognitive abilities are not only associated with certain features of sleep but they have been shown to support memory functions as well, they may confound the associations revealed between sleep and memory consolidation. Therefore, when the specific effect of sleep on memory consolidation is tested, associations between sleep (e.g., spindles) and these cognitive abilities (e.g., intelligence) should be controlled for.

Solutions: employing appropriate control tasks and conditions
The problem of disentangling individual differences in the associations between sleep and general cognitive abilities from the associations between sleep and memory can be addressed by at least two ways. First, one can employ neurocognitive assessments (e.g., intelligence testing) and include these scores as covariates to statistically control for possible confounding effects when testing the specific associations between sleep and memory consolidation. Second, a comparable baseline night of sleep together with an appropriate control task can be included in the study design.
This control task should be comparable to the experimental task without engaging the targeted specific processes that are the focus of sleep-related memory consolidation. Comparing the two experimental conditions can reveal the specific effect of sleep on the memory process of interest. Research transparency is further hindered by the lack of pre-registration of the studies on sleep and memory.

Depositing research data on open-access platforms
Making research data publicly available enables the re-analysis of old data when new analysis techniques and/or new theories are developed. For example, there are at least two different types of REM microstates (tonic vs. phasic), each with different characteristics (Simor et al., 2017;Simor, van Der Wijk, et al., 2019). Access to previous sleep EEG data would make it possible to specifically test the role of REM microstructure in memory consolidation on previous datasets.
Publicly available sleep EEG and behavioral data could also provide solutions for at least some of the pitfalls discussed above by, for example, enabling data re-analysis to evaluate evidence for potentially non-significant results (see section 4a) and resolve at least some of the previous issues of spurious results (see section 4b). These open databases can also support reliable synthesis of the data through meta-analyses with larger sample sizes.
To maximize the benefits of previous research in the scientific community, we recommend that sleep researchers engage in open science (Nosek et al., 2015) and make data publicly available.

Pre-registration of studies
Another way of increasing transparency of research is to pre-register studies before data collection (Lindsay et al., 2016;Nosek et al., 2018). Pre-registration includes the specification of the research question, experimental design, subject population, sample size as well as planned analysis methods. Pre-registration is already the gold standard in many fields of research, including for clinical trials in medical research. The neurosciences and psychology fields increasingly recognize the importance of pre-registration as well. Yet, this option has been largely neglected in sleep and memory research so far.
Studies can be pre-registered in different ways. One option is pre-registration in independent online registries like the Open Science Framework or ClinicalTrials.gov. In these registries, researchers provide a detailed description of their planned study that can be accessed by other researchers as well as journal editors and reviewers to determine whether the pre-specified plan was followed adequately. Another option is to write a registered report, which is a novel publication type offered by an increasing number of journals (e.g., Plos Biology, eLife, eNeuro, Cortex). A registered report undergoes two stages of peer review, first before data collection to determine the appropriateness of the research plan and methodology, and then after data collection covering the full research report including the results. If the first round of peer review is successful, the authors are typically offered 'in principle acceptance' by the journal, allowing the results to be published irrespective of the actual findings.
Both procedures, pre-registration in online registries and registered reports, increase the quality of research by reducing inappropriate data analysis practices, including p-hacking, HARKing (hypothesizing after the results are known), and the application of unplanned statistical tests. Thus, pre-registration can promote the implementation of solutions to Pitfall 4 discussed above (see e.g., sections 4a and 4b). Additionally, registered reports could also reduce the filedrawer problem because the study, irrespective of finding significant or non-significant results, could be published in the target journal. Assessing sleepiness and vigilance.

b. Overnight studies (pathological population or intervention):
Including a morning-evening condition and/or an evening-morning deprivation condition. Including a control group.
c. Napping studies: Considering time of day and duration of nap.
Including a quiet-wake control condition. Monitoring the nap with polysomnography.

d. Time interval between memory encoding and sleep onset:
Controlling for duration of and subject's activity during the interval between end of task and sleep. Monitoring the activities during the interval with actigraphy and/or questionnaires.

e. Baseline measurements and feedback effects (declarative memory):
Using a selective reminding procedure, possibly combined with a predefined learning criterion.

Task complexity
Every task involves multiple cognitive processes that need to be disentangled to better understand the specific effect of sleep Disentangling and contrasting different cognitive processes and aspects of memory involved in a given task.

Fatigue effect in repetitive tasks (non-declarative memory)
May lead to a spurious beneficial effect of sleep by negatively affecting performance after a longer practice Using appropriate experimental designs, e.g., including post-rest performance at the end of the training session as a baseline, and promoting learning through spaced rather than massed practice.
Using appropriate data analysis methods, such as curve fitting and computational modeling.

Inappropriate data analysis practices
Small sample sizes and inappropriate analyses/reporting may lead to spurious correlations and incorrect conclusions.

a. Small sample size and reporting only significant results:
Determining the required sample size a priori. Using Bayesian analyses to decide when to stop data collection. Reporting Bayes Factors and effect sizes.

b. Spurious correlations between sleep parameters and memory
consolidation: Planning correlation analyses of interest in advance, correcting for multiple comparisons, and reporting non-significant planned comparisons.

c. Not controlling for individual differences in general cognitive abilities in correlational studies of sleep and memory:
Including neurocognitive assessments of general cognitive abilities as covariates. Including a baseline night of sleep with an appropriate control task.

Conclusion
In this consensus paper, we highlighted four sets of critical methodological pitfalls that impede research in the field of sleep and memory, and we offered solutions to avoid them (for a summary, see Table 1). We believe that implementing the solutions presented in this paper will lead to more reliable results and significantly advance our understanding of the complex relationship between sleep and memory. Since some of the pitfalls described in this paper (such as those related to fatigue effects, task complexity, and data analysis practices) are relevant not only in sleep and memory research but also in other fields of psychology and neuroscience, applying these solutions where appropriate could benefit the broader scientific community as well.