Measuring the Impact of AI-Driven Wellbeing Apps—Instrument Development and Pilot Evidence from the Malu Prototype

Sarah Hatfield; Jeanette Tamm

doi:10.20944/preprints202605.2077.v1

Submitted:

29 May 2026

Posted:

29 May 2026

You are already at the latest version

Abstract

The aim of the present study was to develop an instrument that enables evaluation of AI-based mental health apps, which are promising digital interventions for promoting psychological wellbeing. The instrument was used to conduct an initial evaluation of an early pilot stage of the wellbeing app MALU. As part of a non- representative hypothesis-testing longitudinal study, N = 11 participants aged 18 to 34 used the app over a period of two weeks. The participants were surveyed at three points regarding perceived stress (Perceived Stress Scale), sleep problems (short version of the Insomnia Severity Index), and chatbot usability (Chatbot Usability Scale). The results showed a significant decrease in perceived stress between the first and third measurement points (Z = –2.31, p = .01), as well as for perceived sleep problems between the second and third measurement points (Z = –1.86, p = .03). Perceived chatbot usability increased significantly over the course of the study (Z = 2.37, p = .01). The results suggest potential effectiveness of the app in reducing stress and sleep problems as well as an improvement in the user experience regarding the chatbot interaction over time. The evaluation instrument proved suitable for use in early development phases.

Keywords:

wellbeing

;

artificial intelligence

;

mental health

;

perceived stress

;

sleep

;

chatbot interaction

;

usability

;

user experience

;

human-ai

;

app

Subject:

Public Health and Healthcare - Public Health and Health Services

1. Introduction

The psychological well-being of the population is increasingly becoming a central focus of societal, political, and scientific discourse. Against the backdrop of growing pressures caused by digitisation, constant availability, global crises, and the accelerating pace of everyday life, the need for effective strategies to strengthen mental resources is increasing, not only in the context of illness but also as a preventive measure in daily life [1,2]. Numerous studies indicate that even moderate levels of stress, irregular sleep, or emotional imbalance can have long-term negative effects on well-being and physical health [3,4].

At the same time, substantial gaps remain in the provision of adequate psychotherapeutic care. These result, among other factors, from shortages of qualified professionals, long waiting times, and the unequal distribution of available services, particularly in rural or socially disadvantaged regions [2]. Young adults, students, and working professionals are especially affected, as factors such as digital overstimulation, performance pressure, and social comparison may contribute to stress, sleep disturbances, and emotional exhaustion [4,5].

Against this background, AI based digital technologies for supporting mental health are gaining increasing importance. Well-being apps enable low-threshold, location-independent, and time-independent support for psychological balance in everyday life [6]. Development in this field is increasingly moving beyond static applications toward AI-supported programs which include psychological modules based on cognitive behavioural therapy (CBT) and AI-based chatbots for emotional support, thereby allowing flexible integration into everyday routines [2,7].

The present study aims to develop a practice-oriented evaluation instrument to assess the AI-based sleep and stress management app ‘Malu’. Malu is a mobile mental health application that combines ecological momentary assessment (EMA) self-reports with passive data collected from smartphones and wearable devices, including sleep, physical activity, psychophysiological markers, and contextual information. A statistical network model is used to identify person-specific associations between these factors and users’ well-being in order to provide personalized just-in-time interventions. A proactive AI-based chatbot delivers these interventions in daily life, supports user engagement, and in this way facilitates cognitive and behavioral change in real-world contexts. The evaluation instrument is intended to assess the effectiveness and usability of AI-supported well-being apps throughout the development process. Psychological effectiveness and technical usability are core criteria and contribute to quality assurance in a field that has so far been subject to limited methodological control.

2. Theoretical Background

Against the backdrop of increasing psychological strain and a growing demand for support services, digital interventions are receiving increasing attention in both research and practice. In the context of the present study, the terms “wellbeing app” and “mental health (MH) app” are used synonymously to describe digital applications that both aim to promote general psychological wellbeing preventively and can also be used to reduce specific symptom burdens.

To evaluate these digital applications, theoretically grounded psychological constructs and valid measurement instruments are required, as they form the basis for the the standardized evaluation questionnaire.

2.1. Current State of Research

Interventions based on cognitive behavioral therapy (CBT) or mindfulness-based approaches have shown significant effects in systematic reviews and meta-analyses regarding the effectiveness of digital applications [7,8,9]. MH apps can reduce depressive symptoms by specifically addressing cognitive and behavioral patterns [10,11].

AI-supported chatbots play an increasingly important role, that use machine learning methods and natural language processing to enable dialogue-based interactions. They respond to user input in real time, provide targeted guidance for interventions such as breathing techniques or mindfulness exercises, and adapt their recommendations to individual needs [12,13,14]. In addition, they can automatically detect emotional states, thereby enabling individualised user communication [6]. Empirical studies have already demonstrated positive effects of chatbots, including sleep quality and stress levels [7,12,14,15]. Such interventions may help prevent stress-related conditions such as depression in relation to the effects of chronic sleep deprivation and persistent stress exposure [16,17,18].

At the same time, the limitations of such systems have also been emphasized. Khawaja and Bélisle-Pipon (2023), for example, caution against equating chatbots with human psychotherapists. Although these systems may appear authentic, they lack central human competencies such as genuine empathy, in-depth case understanding, and ethical judgment [19].

2.2. Stress as a Psychological Determinant

Stress is one of the central influencing factors for both physical and mental health and is widespread in Germany. Data from Techniker Krankenkasse (TK) health insurance show, for example, that in 2025 around 66% of the German population reported feeling stressed frequently or sometimes [4]. These figures underscore the urgent need to address the phenomenon of stress. To prevent negative health consequences and promote well-being, it is essential to examine stress comprehensively and to develop effective strategies for its prevention and management.

2.2.1. Transactional Model of Stress

A central theoretical model for explaining the experience of stress is the transactional model of stress proposed by Lazarus and Folkman (1984). This model does not conceptualize stress as a direct response to external stimuli but rather as the result of a dynamic interaction between the individual and the environment. At its core is the individual cognitive appraisal of a situation, which occurs in a two-stage process. In the primary appraisal, an event is evaluated as threatening, challenging, harmful, or irrelevant. This is followed by the secondary appraisal, in which the individual assesses whether sufficient resources are available to cope with the demands. These resources may include, for example, personal competencies, social support, or health-related stability. According to this model, distress is experienced when there is an imbalance between the demands of a situation and the resources perceived to be available to manage them [16,20].

For MH apps, this model is highly relevant in several respects, as it provides concrete points of departure for interventions that address both the appraisal of stressors and coping competence. To influence primary appraisal, MH apps offer a range of functions, including guided self-reflections and mood diaries. These features support users in analyzing stressful situations more consciously, identifying dysfunctional interpretations, and developing alternative, less threatening appraisals [21].

To strengthen secondary appraisal, that is, the perception of resources and coping competence, MH apps employ a variety of evidence-based approaches. These include relaxation exercises such as breathing techniques and progressive muscle relaxation, which have been shown to reduce physiological stress responses [22,23,24]. In addition, problem-solving modules are used to strengthen active coping strategies. Many apps also provide psychoeducational content that offers users sound knowledge about stress mechanisms and health-promoting behaviors, thereby enhancing self-efficacy and coping potential [25].

AI-supported chatbots can further support the appraisal process through targeted, dialogue-based questioning, help identify dysfunctional thoughts, and assist users in recognizing existing resources or developing new ones. By specifically addressing cognitive appraisals and strengthening coping strategies, MH apps enable an active modulation of the stress experience [26,27].

2.2.2. Effects of Chronic Stress

Chronic stress represents a major risk factor for both physical and mental health and is a central topic in health research. In contrast to acute stress, which may enhance performance in the short term, chronic stress impairs physical and psychological processes in the long run [28,29].

At the physical level, chronic stress has negative effects on the cardiovascular system. Studies show that it increases blood pressure over time and elevates the risk of cardiovascular diseases such as myocardial infarction and stroke [30]. In addition, chronic stress weakens the immune system, resulting in an increased susceptibility to infections and delayed wound healing [31]. Psychological effects are also substantial. Chronic stress is closely associated with a variety of mental health disorders such as depression, anxiety disorders or insomnia [32,33].

2.3. Sleep as a Psychological Determinant

In addition to the relevance of stress, the importance of sleep for health is also well established. Healthy sleep is a fundamental component of well-being and essential for the body’s regeneration. During this phase, the body recovers physically, cells are repaired, and the immune system is particularly active. Moreover, sleep contributes substantially to psychological recovery by processing emotions and consolidating memories, thereby providing an essential foundation for mental health [34].

2.3.1. Harvey´s Cognitive Model of Insomnia

For the psychological conceptualization of sleep as a determinant for wellbeing, particularly in the context of sleep disturbances, Harvey’s Cognitive Model of Insomnia (2002) is of central importance. This model explains that insomnia is maintained less by primary physical causes than by specific patterns of thinking and behavior. One of the main mechanisms is nighttime worry and rumination, whether about sleep itself or other topics, which leads to increased physiological and mental arousal and impairs both sleep onset and sleep maintenance [35]. In addition, maladaptive sleep habits, such as spending excessive time in bed or maintaining irregular sleep–wake cycles, disrupt the natural sleep rhythm. Dysfunctional beliefs about sleep, for example exaggerated assumptions about the consequences of sleep deprivation, further intensify sleep-related anxiety and reinforce insomnia [18,35].

The Cognitive Model of Insomnia provides a practical foundation for digital sleep interventions, such as modules for cognitive restructuring aimed at addressing worry and dysfunctional beliefs. Mindfulness and breathing exercises can further help reduce physiological and cognitive arousal before bedtime. In addition, apps provide information on sleep hygiene and guide users in behavioral techniques such as stimulus control and sleep restriction.

2.3.2. Effects of Persistent Sleep Disturbances

Persistent sleep disturbances, particularly insomnia, have substantial effects on both mental and physical health. A growing body of scientific research demonstrates that chronic sleep problems not only impair overall well-being but also entail long-term health risks. Mental disorders are among the most frequently reported consequences of sleep disturbances. Studies have shown a significant association between insomnia and an increased risk of developing depression and anxiety disorders [17,18]. In addition, several studies report a strong relationship between poor sleep quality and suicidal ideation as well as suicidal behavior [36,37].

Beyond psychological effects, poor sleep quality also leads to impairments in cognitive functions such as attention, concentration, and memory performance [38]. Furthermore, far-reaching consequences are evident at the physical level. Insomnia is closely associated with the development of cardiovascular diseases such as hypertension and myocardial infarction [39,40,41]. Negative effects on the immune system, metabolic disorders, and neurological diseases have also been demonstrated [42,43].

2.4. Usability as a Success Factor of MH Apps

Usability plays a decisive role in the success of AI-supported mental health apps. Particularly in the context of chatbot integration, usability strongly influences whether an app is accepted by users, used regularly, and perceived as supportive [44,45]. High chatbot usability is reflected in intuitive, understandable, and reliable interactions that are clearly aligned with users’ needs [46]. Essential factors include simple language, an empathetic tone, technical stability, and a clear communication focus [47,48]. It is also crucial that users feel safe when sharing personal or sensitive information, which requires trust in both the chatbot and data protection mechanisms [48].

High usability has been shown to positively influence both usage and effectiveness. It increases acceptance, promotes regular use, and supports long-term adherence [44,49].

A positive and seamless user experience increases the likelihood that provided interventions are implemented, thereby directly enhancing psychological outcomes [7]. Conversely, unclear communication, technical issues, or a lack of relevance often lead to frustration and discontinuation [46].

2.5. Research Questions and Hypotheses

Building on the theoretical foundations of stress, sleep, and the role of usability in MH apps, the preceding sections have examined key psychological determinants in greater detail. Based on this theoretical framework, the following research questions are derived:

1. Is an increased use of Malu, an AI-based application designed to promote psychological well-being, significantly associated with a decrease in perceived stress and sleep problems?

2. Is an increased use of Malu significantly associated with an increase in the perceived usability of Malu?

To systematically address these questions, two central evaluation dimensions are considered:

(1) the psychological effectiveness of the application in terms of reducing stress and sleep problems, measured using the Perceived Stress Scale (PSS) and the Insomnia Severity Index (ISI), and

(2) the perceived usability of the integrated chatbot, assessed using the Chatbot Usability Scale (BUS).

Based on these evaluation dimensions, the following specific hypotheses are formulated:

Evaluation Dimension: Perceived Stress

H0a: With increasing duration of use, no significant reduction in perceived stress is observed.

H1a: With increasing duration of use, a significant reduction in perceived stress is observed.

H0b: With increasing duration of use, no significant reduction in perceived sleep problems is observed.

H1b: With increasing duration of use, a significant reduction in perceived sleep problems is observed.

2.: Evaluation Dimension; Chatbot Usability

H0c: With increasing duration of use, no significant increase in perceived chatbot usability is observed.

H1c: With increasing duration of use, a significant increase in perceived chatbot usability is observed.

3. Methodology

The aim of this research is to examine the usability of an AI-supported application designed to promote psychological well-being, as well as its effectiveness in reducing perceived stress and sleep problems. The following section describes the methodological approach used for the development and initial testing of the questionnaire, which was designed to evaluate the Malu app. The questionnaire was compiled based on established, scientifically validated scales and adapted in content to the specific context of the app.

3.1. Development of the Evaluation Instrument

To address the research question and test the formulated hypotheses, three central target constructs were defined. The effectiveness of the application in reducing stress and sleep problems was operationalized using the constructs of perceived stress and sleep quality. Usability was represented by the distinct construct of perceived chatbot usability and its respective subscales.

3.1.1. Measurement of Perceived Stress

To assess subjectively perceived stress, the Perceived Stress Scale (PSS) was used. The PSS is an established instrument that captures the individual experience of stress over recent weeks. The German version of the PSS-10 consists of 10 items referring to the frequency of stress-related experiences during the past month. Responses are given on a 5-point Likert scale ranging from 0 (never) to 4 (very often) [50]. The total score (PSS-TOTAL) is calculated by reverse-coding the four positively formulated items (Items 4, 5, 7, and 8) and subsequently summing all items [50]. Higher scores indicate a higher level of perceived stress.

For the purpose of this study, the items were adapted to a weekly assessment period. Specifically, the original wording “In the last month, how often have you…” was modified to “In the last week, how often have you…” (see Appendix A: Item overview).

The psychometric properties of the PSS-10 are considered as very good. In the validation study by Klein et al. (2016), the scale demonstrated an internal consistency of α = .84 as well as acceptable to good item-total correlations (r_it = .43–.69) [50].

The use of the PSS-10 was also justified by its suitability for non-clinical populations, as the developed questionnaire was specifically designed for healthy participants. In a comparable non-clinical sample, the scale demonstrated an internal consistency of α = .88, indicating reliable measurement of the construct [51].

3.1.2. Measurement of Perceived Sleep Problems

To evaluate perceived sleep problems, symptoms were assessed using the German version of the Insomnia Severity Index (ISI) [52]. The ISI is a brief, validated questionnaire for the reliable assessment of sleep problems. Its internal consistency is rated as good, with Cronbach’s α of 0.83, and its test–retest reliability (r = 0.78) confirms its stability over time.

In constructing the questionnaire, only the first three items of the ISI were included, as these specifically assess sleep symptoms (problems falling asleep, problems maintaining sleep, early awakening). Items 4 to 7, in contrast, presuppose the presence of sleep problems and refer to their consequences and subjective evaluation The item reliability (α = .80-.81) and item-total correlations (r_it = .60–.67) confirm the quality of the selected items [52].

3.1.3. Measurement of Perceived Chatbot Usability

Perceived chatbot usability describes the subjective evaluation of user-friendliness, comprehensibility, information quality, security, and efficiency in interacting with a chatbot [53]. This evaluation goes beyond classical usability aspects and takes into account specific characteristics of dialogue-based systems [53]. To assess this construct, the German version of the Chatbot Usability Scale, BUS-11, was used (see Appendix A: Item Overview; [53]). The BUS-11 comprises eleven items that capture five central dimensions of the user experience.

The scale was originally derived from a 15-item version and psychometrically optimized. The internal consistency of the total scale is rated as very good, with Cronbach’s α = .89. The subscale values were also all above α = .75 (Accessibility: α = .77, Functionality: α = .78, Conversation: α = .84, Privacy: α = .80, Responsiveness: α = .76). Due to its economical length, conceptual breadth, and empirically confirmed psychometric quality, the BUS-11 is particularly well suited for the evaluation of AI-based dialogue systems in research and practice, for example in the field of psychological health apps or virtual counseling.

3.1.4. Exclusion Criterion

An elevated PHQ-9 score constituted an exclusion criterion. Individuals with moderately severe depression were excluded (PHQ-9 ≥ 15 on the 0–3 scale; cf. [54]). In addition, a score of ≥1 on the suicidality item led to exclusion to avoid putting individuals with suicidal ideation at risk.

3.1.5. Structure of the Questionnaire and Further Items

To systematically assess perceived stress, perceived sleep problems, and perceived chatbot usability over a period of several weeks, three thematically coordinated online questionnaires were constructed. The technical implementation was carried out using SoSci Survey, a well-established software for scientific online surveys in German-speaking countries. The platform enables complex designs, randomization, and precise item control [55].

At measurement time 1 (T1) , before app use, the instrument included questions on demographic characteristics, the screening items of the Patient Health Questionnaire-9 (PHQ-9), the Perceived Stress Scale (PSS), and the Insomnia Severity Index (ISI). The questionnaires administered during (T2) and after app use (T3) included the PSS and ISI as well as the Chatbot Usability Scale (BUS-11) (see Section 3.1.1, Selection of Target Constructs and Scales).

In addition, the third questionnaire assessed frequency of app use as a control variable. For this purpose, a closed-response item on “app use per week” was included, with the categories “1–2 times”, “3–4 times”, “5–8 times”, and “daily or more often”. Optional free-text fields were also provided for individual feedback. These qualitative responses served the practice-oriented further development of the app and were not included in the statistical analyses.

The assessment of demographic characteristics served to describe the sample and to support the interpretation of the results [56]. Age, gender, and current occupation were recorded, as these variables may potentially influence app use and app evaluation and help define the target group of AI-based lifestyle apps (see Section 3.2.2, Sample and Recruitment). The items were implemented using closed-response formats with dropdown menus [55,56].

3.2. Pilot Study for Testing the Instrument

To apply the developed evaluation instrument, a practice-oriented pilot study was conducted to generate initial insights into feasibility and data quality. The object of investigation was the Malu app, an AI-supported application with an integrated chatbot designed to support users in promoting their psychological well-being in everyday life.

3.2.1. Study Design and Procedure

The present study was designed as a hypothesis-testing pilot study using a quantitative, quasi-experimental longitudinal design with repeated measures [56].

After registering for the study (see Section 3.2.2, Sample and Recruitment), participants first received access to the initial online questionnaire. If they met the inclusion criteria (see Section 3.3, Quality Criteria, Ethical and Data Protection Framework), the app developer provided them with access to the application and corresponding instructions for use. The follow-up surveys were sent to participants by e-mail after each additional week of app use.

3.2.2. Sample and Recruitment

Eligible participants were technology-affine individuals aged 18 to 45 years who use AI-based apps [57,58]. A total of 11 participants (N3 = 11) aged 18 to 34 years took part in the pilot study. Of these, 4 were female and 7 were male. Participants’ current occupation was also recorded, with 6 being students and 5 being employed. After 2 weeks of app use, the sample was composed as follows:

Figure 1. Sample N3 at measurement point 3.

A major limitation of the study resulted from the technical availability of the Malu app, which at the time of data collection was only available for the iOS operating system (see Section 5.3, Limitations). This restricted the pool of potential participants led to a substantial reduction in sample size between recruitment (N0) and the first measurement time point (N1). In addition, the PHQ-9 was used as an exclusion criterion to ensure that only psychologically healthy participants took part in the study (see Section 3.3, Quality Criteria, Ethical and Data Protection Framework). An elevated PHQ-9 score constituted an exclusion criterion. Individuals with moderately severe depression (PHQ-9 ≥ 15 on the 0–3 scale; cf. [59]) were excluded. In addition, a score of ≥1 on the suicidality item led to exclusion to avoid putting individuals with suicidal ideation at risk.

One person was excluded from further participation due to an elevated PHQ-9 score. Based on this, the sample developed as follows:

Figure 2. Sampling, taking into account registrations, exclusion criteria and participation levels.

Recruitment was conducted using an online posting designed by the research group, which presented the aim of the study, the benefits of participation, and the study procedure, and invited individuals to take part [56]. To ensure voluntariness, participants were able to register independently via a link. The posting also communicated the participation requirements as well as information on the voluntary nature of participation and the data protection-compliant conduct of the study. The posting was published on LinkedIn and Instagram.

3.3. Quality Criteria, Ethical Considerations and Data Protection Framework

The selection of established and validated scales served to ensure central quality criteria such as objectivity, reliability, and validity (see Section 3.1.1, Selection of Target Constructs and Scales). SoSci Survey complies with data protection requirements in accordance with the General Data Protection Regulation (GDPR) and stores data exclusively on ISO/IEC 27001-certified servers located in Germany [60].

To enable the repeated assignment of individual data points to an anonymized person across multiple survey waves, a pseudonymized identification code was used. This code was based on the principle of the so-called “mother code”, in which several non-directly identifying components are combined. This procedure is methodologically recognized and considered an appropriate means of anonymous assignment in longitudinal studies. Under ideal conditions, the matching accuracy of such procedures exceeds 90% [61,62].

To comply with ethical guidelines, the PHQ-9 (Patient Health Questionnaire-9; [59]) was used as an exclusion criterion to exclude individuals with potentially treatment-relevant mental health conditions from participation. This was necessary because the instrument was designed to evaluate applications in the field of lifestyle and wellness. Automatic scoring and redirection to a separate exit page were implemented using the integrated logic function of SoSci Survey. On this page, excluded individuals received a brief explanation as well as references to support services such as telephone counselling and society for depression help [63,64].

Participants who met the inclusion criteria were asked to provide their e-mail address for app activation. The separation of this information from the actual dataset ensured anonymity and the data protection-compliant conduct of the study [56,65].

3.4. Planned Data Analysis

Statistical analyses were conducted using R software (version 2022.12.0). To examine the internal consistency of the scales used, Cronbach’s alpha was calculated for each measurement and interpreted according to the guidelines proposed by George and Mallery (2003): excellent (α > .9), good (α > .8), acceptable (α > .7), questionable (α > .6), poor (α > .5), and unacceptable (α < .5). In addition, descriptive statistics were calculated for all scales at each measurement time point, including median, interquartile range, minimum, and maximum [66].

Due to the small sample size (N = 11), only non-parametric procedures were used [67]. Friedman tests were conducted to analyze differences across the three measurement times. This was followed by pairwise comparisons between individual time points using one-sided Wilcoxon signed-rank tests. The test direction was based on the hypotheses defined in advance. To explore group-specific differences with regard to sociodemographic characteristics and frequency of app use per week, two-sided Mann–Whitney U tests were applied. For both the Wilcoxon signed-rank tests and the Mann–Whitney U tests, effect size r was calculated and interpreted according to the thresholds proposed by Cohen (1988): small effect (r ≥ .1), medium effect (r ≥ .3), and large effect (r ≥ .5) [68].

To explore statistical associations between variables, Spearman rank correlations were calculated. The correlation coefficient rs was interpreted according to Cohen (1988) as follows: small effect (rs ≥ .1), medium effect (rs ≥ .3), and large effect (rs ≥ .5). The significance level was set at p < .05. Since this was a pilot study with a small sample size, no Bonferroni correction was applied. However, to control for error accumulation in multiple correlations, a false discovery rate correction according to Benjamini and Hochberg (1995) was used. Participants who did not complete all three questionnaires were excluded by listwise deletion to ensure comparability across the measurements [56,69].

4. Results

The following section presents the central results of the statistical analyses. The focus lies on the examination of changes in perceived stress, perceived sleep problems, and perceived chatbot usability across the measurement time points. In addition to temporal developments, the internal consistencies of the scales used are reported. Furthermore, group-specific differences and statistical associations between the assessed variables are analyzed.

4.1. Reliability Analysis

The reliability coefficients of the scales used are presented in Table 1. The Perceived Stress Scale showed good internal consistency across all three measurement times, with values ranging from α = .84 to α = .88. The scale assessing perceived sleep problems demonstrated acceptable to good internal consistency, with values between α = .76 and α = .82. The overall Chatbot Usability Scale showed good internal consistency across the second and third measurement, with values ranging from α = .87 to α = .89. For the Accessibility subscale, values ranged from acceptable to excellent, from α = .74 to α = .91. The Functionality subscale showed acceptable to good internal consistency, with values ranging from α = .75 to α = .82. The Conversation subscale demonstrated acceptable internal consistency at both time points, with α = .75. The Privacy and Responsiveness subscales each consisted of a single item; therefore, no reliability analysis could be conducted for these dimensions.

4.2. Target Dimension Outcomes

4.2.1. Perceived Stress

The descriptive statistics for perceived stress are presented in Table 2. At the beginning of the study (T1), the median perceived stress score was 16 (IQR = 10). This value remained unchanged at T2, with a median of 16 (IQR = 9). At T3, the median was 15 (IQR = 9). A Friedman test (see Table 3) revealed a significant difference in perceived stress across the three measurement times, χ²(2) = 8.23, p = .02, with a medium effect size (Kendall’s W = .37).

Subsequent pairwise comparisons using Wilcoxon signed-rank tests (see Table 3) showed a significant reduction in perceived stress from the first to the second measurement (Z = −1.86, p = .03), accompanied by a large effect size (r = .70). A significant decrease was also found from the first to the third measurement (Z = −2.31, p = .01), again with a large effect size (r = .77). By contrast, the comparison between the second and third measurement did not reveal a statistically significant reduction (Z = −1.60, p = .05), although a large effect size was observed here as well (r = .53). Overall, perceived stress differed significantly across the three measurement times. Significant reductions were found between the first and second measurement as well as between the first and third.

4.2.2. Perceived Sleep Problems

The descriptive statistics for perceived sleep problems are shown in Table 2. The median perceived sleep problems score remained constant at 4 (IQR = 3) across all three measurement time points. A Friedman test (see Table 4) showed no significant difference in perceived sleep problems across the three measurement time points, χ²(2) = 3.13, p = .21, Kendall’s W = .14.

Despite the absence of an overall significant effect, pairwise comparisons between individual measurement time points were subsequently conducted using Wilcoxon signed-rank tests (see Table 3). These analyses revealed a significant reduction in perceived sleep problems from the second to the third measurement (Z = −1.86, p = .03), accompanied by a large effect size (r = .70). In contrast, the comparison between the first and second measurement showed no significant decrease (Z = 0.31, p = .64, r = .09). Likewise, the comparison between the first and third measurement showed no significant reduction (Z = −1.43, p = .07), although a medium effect size was observed (r = .45). In summary, a significant reduction in perceived sleep problems was found between the second and third measurement.

4.3. Usability

4.3.1. Perceived Chatbot Usability

The descriptive statistics for perceived chatbot usability are presented in Table 2. The inferential statistical results are shown in Table 3. The median overall rating of perceived chatbot usability was 45 (IQR = 7) at measurement time 2 and 47 (IQR = 9.5) at measurement time 3. A Wilcoxon signed-rank test revealed a significant increase in the overall ratings from the second to the third measurement (Z = 2.37, p = .01), accompanied by a large effect size (r = .79).

In the Accessibility dimension, a median of 8 (IQR = 1.5) was observed at T2 and a median of 9 (IQR = 2) at T3. This dimension also showed a significant improvement in ratings from the second to the third measurement (Z = 2.03, p = .02), again with a large effect size (r = .72). In the Functionality dimension, the median was 11 (IQR = 2) at T 2 and 12 (IQR = 3) at T 3. This dimension likewise showed a significant increase from the second to the third measurement (Z = 1.78, p = .04), with the calculated effect size also being classified as large (r = .67).

In the Conversation dimension, a median of 16 (IQR = 3) was observed at T2 and a median of 18 (IQR = 2.5) at T3. The comparison between T2 and T3 did not reveal a statistically significant increase (Z = 1.33, p = .10), although a medium effect size was observed (r = .47).

In the Privacy dimension, the median at T2 was 4 (IQR = 0) and at T3 it was 5 (IQR = 0.5). In this dimension, a significant improvement in ratings from T2 to T3 was found (Z = 2.20, p = .01), accompanied by a large effect size (r = .90).

In the Responsiveness dimension, the median was 4 (IQR = 0.5) at T2 and 3 (IQR = 0.5) at T3. The comparison between T2 and T3 showed no significant increase in ratings of chatbot responsiveness (Z = −2.20, p = .99, r = .90).

Overall, a significant increase in the total score of perceived chatbot usability was found. This effect was also evident in the dimensions of Accessibility, Functionality, and Privacy.

4.4. Group Comparisons

Group-specific differences in changed perceived stress between the first and third measurement, calculated as the difference T3 − T1, were found depending on the frequency of app use per week (see Table 5). A significant difference was observed between participants who used the app once or twice per week and those who used it three to four times per week (Z = 2.37, p = .01), with a large effect size (r = .72). In the low frequency usage group, the median change in perceived stress was 0 points (IQR = 1), whereas in the high frequency usage group, a median decrease of 2 points was observed (Mdn = −2, IQR = 0.75).

Group-specific differences depending on the frequency of app use per week were also found regarding the change in perceived sleep problems between the first and third measurement (T3 − T1) (see Table 5). A significant difference was observed between participants using the app once or twice and those using it three to four times weekly (Z = 2.37, p = .01), accompanied by a large effect size (r = .72). In the low frequency usage group, a median increase of 1 point in perceived sleep problems was observed (Mdn = 1, IQR = 1), whereas in the high frequency usage group, a median reduction of 1 point was observed (Mdn = −1, IQR = 0.75).

Regarding the other examined characteristics, namely gender, age, and occupational status, no significant differences were found between the respective groups regarding changes in perceived stress or sleep problems (see Table 5).

With respect to perceived chatbot usability at T3, group-specific differences were found depending on the frequency of app use per week (see Table 6). For the overall rating of perceived chatbot usability, a highly significant difference was observed between participants using the app once or twice per week and those using it three to four times per week (Z = −2.74, p = .005), with a large effect size (r = .83). The median overall rating was 40 (IQR = 0) in the low frequency usage group, whereas a median of 49.5 (IQR = 3.25) was observed in the high frequency usage group.

A highly significant difference between the two groups was also found in the Functionality dimension (Z = −2.74, p = .005, r = .83). Participants using the app once or twice per week rated Functionality with a median of 10 points (IQR = 1), whereas the high frequency usage group reached a median rating of 13 points (IQR = 0.75). Another highly significant difference between the two groups was found in the Conversation dimension (Z = −2.74, p = .005, r = .83). Here, the median was 15 (IQR = 1) in the low frequency usage group and 18 (IQR = 0.75) in the group using it three to four times per week.

For the remaining dimensions of perceived chatbot usability, no significant differences were found between the two groups. Likewise, no significant group-specific differences were found about the sociodemographic characteristics and perceived chatbot usability (see Table 6).

4.5. Correlation Analysis

As part of the correlation analysis, associations between the change in perceived stress and the change in perceived sleep problems between T1 and T3 were calculated as difference. In addition, correlations with perceived chatbot usability at T3 were analyzed. Several statistically significant associations were found for the small test sample (see Table 7).

A strong positive correlation was observed between the change in perceived stress and the change in perceived sleep problems (rs = .69, p = .02). In addition, the change in perceived stress was strongly negatively correlated with the total score of perceived chatbot usability (rs = −.72, p = .01) and with the Functionality dimension (rs = −.79, p = .02). The change in perceived sleep problems was also strongly negatively correlated with the total score of perceived chatbot usability (rs = −.88, p < .001). Furthermore, strong negative correlations were found between the change in perceived sleep problems and the dimensions of Accessibility (rs = −.80, p = .005), Functionality (rs = −.81, p = .005), and Conversation (rs = −.87, p = .005).

The total score of perceived chatbot usability was strongly positively correlated with the dimensions of Accessibility (rs = .81, p = .009), Functionality (rs = .90, p = .002), and Conversation (rs = .96, p < .001). In addition, within the dimensions of perceived chatbot usability, a strong positive correlation was found between Functionality and Conversation (rs = .86, p= .003).

5. Discussion

5.1. Key Findings and Interpretation

The following discussion is done in a way it would unfold, if the sample was not merely a test sample for a prototype but a large user sample for an established wellbeing app, so that the possible conclusions for a more valid evaluation may be anticipated at this state.

5.1.1. Perceived Stress

The analysis of changes in perceived stress over the course of the study shows a clear development. The comparison between T1 and T3 revealed a significant reduction in perceived stress. The effect size was large, indicating a meaningful change beyond mere random fluctuation. Methodologically, this is also confirmed by the Friedman test, which demonstrated a significant change over time. The effect size measure (Kendall’s W = .37) indicates a small to medium change across the overall course, which can be regarded as relevant, given the relatively short duration of the intervention.

In addition, the comparison between the second and third measurement showed no significant difference (Z = −1.60, p = .05), although a medium effect (r = 0.53) was also observed here. These findings suggest that the app’s effect unfolded gradually. It is possible that the app’s stress-reducing functions, such as reflection, reframing, or everyday support, require a certain amount of time to become effective and to intervene in existing behavioral and appraisal patterns [7]. Overall, these findings suggest that sustained use of the app is crucial for effectively changing subjectively experienced stress.

A further key finding emerges from the analysis of group-specific differences. Participants who used the app three to four times per week showed a significantly greater reduction in stress than those with a lower frequency of use. The observed difference, with an effect size of r = .41, lies in the small to medium range and suggests a dose-dependent effect of the app. Whereas the group with regular use showed a larger reduction on the stress scale, the median in the group with once to twice usage per week remained unchanged. This group difference suggests that repeated, active engagement with the app content is a decisive factor in reducing psychological strain. Mere availability therefore does not seem to be sufficient, and only regular use appears to enable positive effects. A higher frequency of use is thus not a rigid prerequisite, but a substantial support for the effectiveness of the intervention in everyday life [70].

The correlation analysis provides additional indications of possible influencing factors for the change in perceived stress over the course of the study. A significant positive association was found with the change in sleep-related complaints, indicating that the reduction in stress tended to coincide with a simultaneous improvement in sleep quality.

In addition, a strong negative association emerged between the change in perceived stress and the rating of perceived chatbot usability at the third measurement. The more positively the app was perceived in terms of usability, the greater the reported reduction in stress on average. This effect was particularly pronounced for the subscales Functionality and Conversation, suggesting that intuitive operability and the quality of the linguistic interaction may represent key influencing factors. A clear structure of the content and an understandable, empathetic conversational design may therefore have contributed substantially to the app’s stress-reducing effect [71].

5.1.2. Perceived Sleep Problems

The analysis of perceived sleep problems initially showed no significant changes across all three measurements in the overall course. The Friedman test yielded only a trend toward significance, indicating a small overall effect size. However, the subsequent pairwise comparisons using Wilcoxon signed-rank tests revealed a significant decline in perceived sleep problems between T2 and T3. This effect thus occurred with some delay and was initially not visible in the overall mean, but over time it indicates a potential effectiveness of the app in reducing sleep problems.

When considering group-specific effects, it became evident that app use frequency had a significant influence on changes in perceived sleep problems. Whereas the high-frequency user group showed a significant median decrease in sleep problems, the low-frequency group even showed a slight increase. This again points to a possible dose–response relationship in which more frequent use of the app was associated with a stronger reduction in sleep problems.

Furthermore, the correlation analysis revealed strong associations between the change in sleep problems and perceived chatbot usability at the third measurement. The more positively the app was rated overall, especially regarding accessibility, functionality, and conversation, the more the reported sleep problems declined. These findings support the assumption that successful chatbot usability may represent a central mechanism underlying the app’s sleep-related effectiveness.

In addition, a significant association was found between the two usability dimensions Functionality and Conversation, as well as between these and perceived sleep problems. This may indicate that certain aspects of the app dialogue, such as structured sleep support, calming conversational offers, or the opportunity for emotional processing, played a key role in the app’s effect on sleep [14]. Although no significant overall reduction in sleep problems was initially found at the total sample level, the detailed analyses suggest that under certain conditions, particularly intensive use and a positive evaluation of the app, effective outcomes did emerge.

5.1.3. Perceived Chatbot Usability

The overall development of perceived chatbot usability showed a small but statistically significant increase between T2 and T3. Despite this small median change, a strong effect (r = 0.79) was observed. This points to a positive development of the user experience over time, which may have been driven by increased trust in and familiarity with the app [72].

Similar patterns also emerged on the level of two individual dimensions. The subscales Accessibility, Functionality, and Conversation showed significant improvements with large effect sizes, indicating a comprehensive improvement in user-friendliness. Particularly noteworthy are the large effects for Accessibility (r = .72) and Privacy (r = .90), even though the median did not increase sharply in every case.

Group comparisons about app use frequency showed clear differences in favor of those participants who used the app three to four times per week compared with those who used it one to two times per week. Heavy users rated the app significantly better regarding overall usability as well as the dimensions Functionality, Conversation, and Responsiveness, each with very large effect sizes. This suggests that more frequent app use is associated with a better subjective usage experience, possibly due to increased routine, trust, and learned operating competence [73].

In addition, the evaluation revealed possible differences in perceived chatbot usability according to sociodemographic characteristics. The Mann–Whitney U tests showed no significant differences for gender, age, or occupational status, neither for the total score nor for the subdimensions, where the median values of the subgroups were closely aligned in each case. This homogeneous evaluation across all groups suggests broad accessibility and intuitive usability of the app, both of which are central prerequisites for successful implementation across different population groups.

5.2. Integration into the Existing Research

The observed significant reduction in perceived stress can be interpreted based on the transactional model of stress already described in the theoretical section [20]. According to this model, stress arises from individual appraisals in which a situation is perceived as threatening and at the same time as difficult to cope with. The app may intervene at this point by contributing to reappraisal through reflective conversational offers, structured questions, and everyday cognitive prompts. Such impulses may help users reinterpret stressful situations and experience them subjectively as less stressful.

Furthermore, the findings can be situated within the context of current studies on digital interventions. Denecke et al. show that digital modules based on cognitive behavioral therapy are particularly effective when they combine cognitive restructuring processes with continuous feedback and regular self-monitoring [7]. Comparable mechanisms also seem to have played a role in the present app. The application offers structured support in dealing with stress-inducing thoughts and promotes the development of constructive behavioral strategies. The particularly strong effects among participants with higher usage frequency further support an intensity-dependent effectiveness, as also described in the study. These parallels underscore that digital applications built on cognitive change processes and reinforced through regular use may represent an effective means of stress management.

A key finding of the study is the significant association between the change in perceived stress and the change in sleep-related complaints. This correlation (r = .69, p < .05) can be explained theoretically by the model of Riemann and Voderholzer [18]. The authors describe stress as a primary cause of difficulties initiating and maintaining sleep and emphasize the central role of cognitive and emotional burdens in the emergence and maintenance of sleep disturbances. In this sense, the stress-reducing effect of the app can also be interpreted as an indirect mechanism for improving sleep. This suggests that interventions targeting stress management may simultaneously represent an effective point of intervention for the treatment of sleep-related complaints.

A central result concerns perceived chatbot usability and its importance for the acceptance of digital interventions. Low usability can severely restrict the use of digital health applications, whereas high usability fosters trust, motivates regular use, and thereby promotes stronger effectiveness [7]. In the present study, a significant improvement in usability scores was found, especially among participants with high usage frequency. This may be interpreted as evidence of a successful, user-friendly chatbot design that not only facilitated use but may also have enhanced the app’s effects on stress and sleep. Additional evidence for the relevance of usability is provided by the association between user evaluation and intervention effect. These findings support the assumption that high perceived usability not only promotes acceptance of the application but also contributes to its effectiveness [74].

A complementary explanatory approach arises from the fact that not only functional usability, but also the nature and quality of interaction with digital systems, may influence the effectiveness of interventions. Studies report that continuous, personalized interaction with a virtual agent can lead to stronger emotional bonding, higher adherence, and better health outcomes [72]. In a controlled setting, participants who communicated with an empathetically designed digital coach over several weeks benefited significantly more from the intervention than those in a minimally interactive control condition. These findings suggest that not only technical usability, but also the subjective experience of social presence and digital relationship building may represent central mechanisms of action in digital health applications.

5.3. Limitations

A central methodological limitation of the present investigation is the small sample size. Only the complete datasets of eleven participants could be included in the statistical analyses. Although clear effects were observed within this group, the interpretability of the findings is limited by the small number of cases. The results should therefore be interpreted with caution and not overestimated. In addition, the composition of the sample is not representative of the general population. Participants were recruited predominantly from the student and working young adult population. This results in a bias toward a homogeneous target group and strongly limits the transferability of the findings to other age groups or population segments. Furthermore, the sample was self-selected. Participation was voluntary and took place in part through social networks such as LinkedIn. It therefore cannot be ruled out that individuals with a particular interest in topics, such as artificial intelligence or mental health, were especially likely to participate. This may also have biased the results and should be considered when interpreting the findings. In addition, the gender distribution of the sample (7 men, 4 women) differs from typical user profiles of digital psychological interventions, in which women are often overrepresented and tend to show greater interest in such offerings [75,76]. This may limit the generalizability of the findings, particularly regarding the question of how strongly the intervention would be accepted and evaluated in a more female-dominated target group.

A central methodological limiting factor of the present study lies in the technical restrictions of the application used. At the time of the investigation, only an iOS version of the app was available, meaning that Android users were excluded from participation. This resulted in a restricted target group and hindered broader recruitment. In addition, only an early, not yet fully mature version of the app was tested. This version contained only a limited range of functions, particularly with respect to interactive modules for promoting psychological well-being, such as a breathing exercise. It can be assumed that both user behavior and the app’s potential effects may change as future versions become more functional and stable. A later implementation of additional modules could individualize and intensify the usage experience, which would likely also result in more differentiated effects on psychological variables. These technical boundary conditions therefore limit the transferability of the findings to later stages of the app’s development.

The instruments used, namely the Perceived Stress Scale [50], the Insomnia Severity Index [52], and the Chatbot Usability Scale BUS- [53], were originally developed for larger and more heterogeneous populations. Their application to a small sample such as the one in the present study accordingly limits the informative value of the results. In addition, supplementary contextual data were lacking, for example regarding usage duration, usage intensity, or chat histories. Due to limited staff resources and data protection requirements, a systematic collection of this information could not be realized within the scope of the study. Moreover, the limited number of comparison studies with AI-supported chatbots in the psychological field complicates the interpretation of the results. Research in this area is still at an early stage, so that only few validated instruments for specific interaction characteristics with AI systems are currently available. Future studies could therefore include additional constructs such as perceived empathy, trust in AI, or experienced emotional support to enable a more comprehensive understanding of the mechanisms of action.

The assessment was based exclusively on participants’ self-reports, meaning that no objective control of actual use was possible. Since the data collection was conducted entirely online and without supervision, socially desirable responding cannot be ruled out either. Overall, the study had to rely on trust in the participants’ responses, which limits the interpretability of the findings about the app’s actual use and impact.

Several limitations are also evident in the study design. First, no control group was included, which means that no causal conclusions can be drawn regarding the app’s effectiveness. Second, the study duration of two weeks was very short. Particularly in the context of psychological change processes, this period is too brief to allow robust conclusions about sustainable effects. This was further complicated by the fact that participation in this prototype test was entirely voluntary and unpaid. App use therefore depended strongly on intrinsic motivation, which - although close to the future use scenario - may result in limited engagement in this case, when personal interest in testing a prototype is lacking. This poses a particular challenge for data completeness, especially in longer study periods. A financial incentive structure might not only have contributed to more reliable and intensive use, but also substantially facilitated recruitment and increased the number of participants considerably.

5.4. Outlook

Despite the small sample size (n = 11), the present investigation identified clear effects that are both statistically and theoretically plausible. Particularly notable were the observed changes regarding the significant reduction in perceived stress and the increase in chatbot usability over the course of the study. These findings support the app’s fundamental effectiveness in the context of digital well-being interventions and suggest its potential scope of application for supporting everyday psychological burdens such as stress and sleep problems.

The validated scales used (PSS, ISI, BUS-11) proved to be suitable evaluation instruments in this pilot study. They provide a robust methodological foundation for future large-scale investigations of the app’s effectiveness. In the long term, the Malu app could, provided that strict clinical and regulatory requirements are fulfilled, also develop the potential to be used as a certified digital health application (DiGA). A prerequisite for this would be that appropriate protective mechanisms ensure that highly burdened or clinically relevant target groups, such as individuals with depressive symptoms, anxiety disorders, or post-traumatic stress, are not put at risk through its use. In such cases, the app could serve in the long term as a supplementary supportive intervention, for example to bridge periods between psychotherapy sessions or during times of limited availability of therapeutic care.

For future studies, compensation and an extension of the observation period beyond the two weeks as here, appear advisable to capture long-term developments and stabilisation processes more comprehensively. In addition, more in-depth analyses of individual usage data, for example through app log files, should be conducted to enable more precise conclusions about usage patterns and their relationship to the app’s effectiveness. In this regard, a more balanced sample should also be considered, both with respect to differences between user groups and individual changes over time. Furthermore, as more validated scales become available in this context, additional dimensions such as empathy could be assessed [19]. A differentiated analysis according to levels of burden, for example low versus high stress or sleep problems, also appears useful to identify potential target groups more precisely and to explore possible limits of effectiveness. Moreover, combining quantitative scales with qualitative feedback may help to better understand the subjective mechanisms underlying app use [77].

The present findings indicate that use of the Malu app may be suitable for initiating positive changes in well-being, such as reductions in subjective stress experience. They also point to the app’s general potential as a component of digital health interventions. However, before any possible final implementation as a fully developed health or well-being app, targeted development work and more extensive studies with larger samples and longer observation periods are still required. Only through these next steps can the app’s effectiveness and everyday applicability be conclusively validated.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org.

Appendix A

Appendix A.1

Table A1. List of items.

Item Name	Items	Scale
Depression (PHQ-9)	Little interest in or enjoyment of your activities.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	Depression, melancholy or hopelessness.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	Difficulty falling asleep or staying asleep, or sleeping more than usual.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	Fatigue or a feeling of having no energy.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	A loss of appetite or an excessive urge to eat.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	A low opinion of oneself; a feeling of being a failure or of having let one’s family down.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	Difficulty concentrating on something, e.g. when reading the newspaper or watching television.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	Were your movements or speech so slowed down that others would have noticed? Or, on the contrary, were you ‘fidgety’ or restless, and did this make you feel a stronger urge to move than usual?	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Depression (PHQ-9)	Thoughts that you would rather be dead or that you want to harm yourself.	0 = Not at all 1 = On some days 2 = On more than half the days 3 = Almost every day
Sleep (ISI-3)	Difficulty falling asleep	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Sleep (ISI-3)	Difficulty staying asleep	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Sleep (ISI-3)	I have trouble waking up early in the morning.	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you been upset because of something that happened unexpectedly?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you felt that you were unable to control important things in your life?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you felt nervous and stressed?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you felt confident about your ability to handle personal problems?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you felt that things were going your way?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you found that you could not cope with all the things you had to do?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you been able to control irritations in your life?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you felt that you were on top of things?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you been angered because of things that were outside of your control?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Stress (PSS-10)	In the last week, how often have you felt difficulties were piling up so high that you could not overcome them?	0 = Never 1 = Hardly ever 2 = Sometimes 3 = Quite often 4 = Very often
Chatbot Usability (BUS-11)	The chatbot function was easy to recognize.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	It was easy to find the chatbot.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The communication with the chatbot was clear.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The chatbot was able to follow the context.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The chatbot's responses were easy to understand.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The chatbot understands what I want and helps me achieve my goal.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The chatbot provides the right amount of information.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The chatbot provides only the information I need.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	I feel that the chatbot's responses were correct.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	I trust that the chatbot informs me about potential data protection issues.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
Chatbot Usability (BUS-11)	The waiting time for a response from the chatbot was short.	1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree
App Use	How often did you use the app in the last week?	1 = Not at all 2 = 1–2 times 3 = 3–4 times 4 = 5–6 times 5 = Every day or more often

References

World Health Organization. Comprehensive Mental Health Action Plan 2013–2030; World Health Organization: Geneva, Switzerland, 2021; Available online: https://www.who.int/publications/i/item/9789240031029 (accessed on 5 May 2026).
Mwogosi, A. Leveraging Digital Technologies in Public Mental Health: A Scoping Review. J. Public Ment. Health 2025, 24, 266–280. [Google Scholar] [CrossRef]
Santomauro, D.F.; Mantilla Herrera, A.M.; Shadid, J.; Zheng, P.; Ashbaugh, C.; Pigott, D.M.; Whiteford, H.A.; et al. Global Prevalence and Burden of Depressive and Anxiety Disorders in 204 Countries and Territories in 2020 Due to the COVID-19 Pandemic. Lancet 2021, 398, 1700–1712. [Google Scholar] [CrossRef] [PubMed]
Krankenkasse, Techniker. TK-Stressreport 2025 . Techniker Krankenkasse: Hamburg, Germany, 2025; Available online: https://www.tk.de/presse/themen/praevention/gesundheitsstudien/stressreport-2025-2206714 (accessed on 28 April 2026).
Keles, B.; McCrae, N.; Grealish, A. A systematic review: The influence of social media on depression, anxiety and psychological distress in adolescents. Int. J. Adolesc. Youth 2020, 25(1), 79–93. [Google Scholar] [CrossRef]
D'Alfonso, S. AI in Mental Health. Curr. Opin. Psychol. 2020, 36, 112–117. [Google Scholar] [CrossRef]
Bakker, D.; Kazantzis, N.; Rickwood, D.; Rickard, N. Mental Health Smartphone Apps: Review and Evidence-Based Recommendations for Future Developments. JMIR Ment. Health 2016, 3, e7. [Google Scholar] [CrossRef]
Haaf, R.; Vock, P.; Wächtershäuser, N.; Correll, C.U.; Köhler, S.; Klein, J.P. Wirksamkeit in Deutschland verfügbarer internetbasierter Interventionen für Depressionen: Ein systematisches Review mit Metaanalyse. Nervenarzt 2024, 95, 206–215. [Google Scholar] [CrossRef]
Chandrashekar, P. Do mental health mobile apps work: Evidence and recommendations for designing high-efficacy mental health mobile apps. mHealth 2018, 4, 6. [Google Scholar] [CrossRef]
Cuijpers, P.; Kleiboer, A.; Karyotaki, E.; Riper, H. Internet and mobile interventions for depression: Opportunities and challenges. Depress. Anxiety 2017, 34, 596–602. [Google Scholar] [CrossRef]
Andersson, G.; Cuijpers, P.; Carlbring, P.; Riper, H.; Hedman, E. Guided Internet-based vs. face-to-face cognitive behavior therapy for psychiatric and somatic disorders: A systematic review and meta-analysis. World Psychiatry 2014, 13, 288–295. [Google Scholar] [CrossRef]
Boucher, E.M.; Harake, N.R.; Ward, H.E.; Stoeckl, S.E.; Vargas, J.; Minkel, J.; Parks, A.C.; Zilca, R. Artificially intelligent chatbots in digital mental health interventions: A review. Expert Rev. Med. Devices 2021, 18, 37–49. [Google Scholar] [CrossRef]
Olawade, D.B.; Wada, O.Z.; Odetayo, A.; David-Olawade, A.C.; Asaolu, F.; Eberhardt, J. Enhancing Mental Health with Artificial Intelligence: Current Trends and Future Prospects. J. Med. Surg. Public Health 2024, 3, 100099. [Google Scholar] [CrossRef]
Abd-Alrazaq, A.A.; Rababeh, A.; Alajlani, M.; Bewick, B.M.; Househ, M. Effectiveness and Safety of Using Chatbots to Improve Mental Health: Systematic Review and Meta-Analysis. J. Med. Internet Res. 2020, 22, e16021. [Google Scholar] [CrossRef]
Feng, X.; Tian, L.; Ho, G.W.K.; Yorke, J.; Hui, V. The Effectiveness of AI Chatbots in Alleviating Mental Distress and Promoting Health Behaviors Among Adolescents and Young Adults: Systematic Review and Meta-Analysis. J. Med. Internet Res. 2025, 27, e79850. [Google Scholar] [CrossRef]
Chen, M.Y.; Wang, E.K.; Jeng, Y.J. Adequate sleep among adolescents is positively associated with health status and health-related behaviors. BMC Public Health 2006, 6, 59. [Google Scholar] [CrossRef]
Neckelmann, D.; Mykletun, A.; Dahl, A.A. Chronic insomnia as a risk factor for developing anxiety and depression. Sleep 2007, 30, 873–880. [Google Scholar] [CrossRef]
Riemann, D.; Voderholzer, U. Primary insomnia: A risk factor to develop depression? J. Affect. Disord. 2002, 76, 255–259. [Google Scholar] [CrossRef] [PubMed]
Khawaja, Z.; Bélisle-Pipon, J.-C. Your Robot Therapist Is Not Your Therapist: Understanding the Role of AI-Powered Mental Health Chatbots. Front. Digit. Health 2023, 5, 1278186. [Google Scholar] [CrossRef]
Folkman, S. Stress: Appraisal and Coping. In Encyclopedia of Behavioral Medicine; Gellman, M.D., Turner, J.R., Eds.; Springer: New York, NY, USA, 2013; pp. 1913–1915. [Google Scholar]
Beck, J.S. Praxis der Kognitiven Verhaltenstherapie, 3., deutsche Erstausgabe; Psychologie Verlagsunion: München, Germany, 2024. [Google Scholar]
Kabat-Zinn, J. Mindfulness-based interventions in context: Past, present, and future. Clin. Psychol. Sci. Pract. 2003, 10(2), 144–156. [Google Scholar] [CrossRef]
Principles and Practice of Stress Management, 3rd ed.; Lehrer, P.M., Woolfolk, R.L., Sime, W.E., Eds.; Guilford Press: New York, NY, USA, 2007. [Google Scholar]
Keng, S.L.; Smoski, M.J.; Robins, C.J. Effects of mindfulness on psychological health: A review of empirical studies. Clin. Psychol. Rev. 2011, 31, 1041–1056. [Google Scholar] [CrossRef]
Khademian, F.; Aslani, A.; Bastani, P. The Effects of Mobile Apps on Stress, Anxiety, and Depression: Overview of Systematic Reviews. Int. J. Technol. Assess. Health Care 2021, 37, e4. [Google Scholar] [CrossRef] [PubMed]
Kaluza, G. Gelassen und sicher im Stress: Das Stresskompetenz-Buch: Stress erkennen, verstehen, bewältigen, 8th ed.; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
Gesundheitswissenschaften, 2nd ed.; Haring, R., Ed.; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
McEwen, B.S. Protection and damage from acute and chronic stress: Allostasis and allostatic overload and relevance to the pathophysiology of psychiatric disorders. Ann. New York Acad. Sci. 2004, 1032, 1–7. [Google Scholar] [CrossRef]
Slavich, G.M. Life Stress and Health: A Review of Conceptual Issues and Recent Findings. Teach. Psychol. 2016, 43, 346–355. [Google Scholar] [CrossRef]
Vaccarino, V.; Bremner, J.D. Stress and Cardiovascular Disease: An Update. Nat. Rev. Cardiol. 2024, 21, 603–616. [Google Scholar] [CrossRef] [PubMed]
Yaribeygi, H.; Panahi, Y.; Sahraei, H.; Johnston, T.P.; Sahebkar, A. The Impact of Stress on Body Function: A Review. EXCLI J. 2017, 16, 1057–1072. [Google Scholar] [PubMed]
Pachi, A.; Sikaras, C.; Melas, D.; Alikanioti, S.; Soultanis, N.; Ivanidou, M.; Ilias, I.; Tselebis, A. Stress, Anxiety and Depressive Symptoms, Burnout and Insomnia Among Greek Nurses One Year After the End of the Pandemic: A Moderated Chain Mediation Model. J. Clin. Med. 2025, 14, 1145. [Google Scholar] [CrossRef]
Palagini, L.; Miniati, M.; Caruso, V.; Alfi, G.; Geoffroy, P.A.; Domschke, K.; Riemann, D.; Gemignani, A.; Pini, S. Insomnia, Anxiety and Related Disorders: A Systematic Review on Clinical and Therapeutic Perspective with Potential Mechanisms Underlying Their Complex Link. Neurosci. Appl. 2024, 3, 103936. [Google Scholar] [CrossRef]
Schuh, A. Gesunder Schlaf und die innere Uhr: Lebensstilbedingte Schlafstörungen und was man dagegen tun kann; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Harvey, A.G. A cognitive model of insomnia. Behav. Res. Ther. 2002, 40, 869–893. [Google Scholar] [CrossRef] [PubMed]
Bernert, R.A.; Turvey, C.L.; Conwell, Y.; Joiner, T.E. Association of poor subjective sleep quality with risk for death by suicide during a 10-year period: A longitudinal, population-based study of late life. JAMA Psychiatry 2014, 71, 1129–1137. [Google Scholar] [CrossRef]
Pigeon, W.R.; Pinquart, M.; Conner, K. Meta-Analysis of Sleep Disturbance and Suicidal Thoughts and Behaviors. J. Clin. Psychiatry 2012, 73, e1160–e1167. [Google Scholar] [CrossRef]
Medic, G.; Wille, M.; Hemels, M.E. Short- and long-term health consequences of sleep disruption. Nat. Sci. Sleep 2017, 9, 151–161. [Google Scholar] [CrossRef]
Zhang, X.; Sun, Y.; Ye, S.; Huang, Q.; Zheng, R.; Li, Z.; Yu, F.; Zhao, C.; Zhang, M.; Zhao, G.; Ai, S. Associations between Insomnia and Cardiovascular Diseases: A Meta-Review and Meta-Analysis of Observational and Mendelian Randomization Studies. J. Clin. Sleep Med. 2024, 20, 1975–1984. [Google Scholar] [CrossRef] [PubMed]
Laugsand, L.E.; Strand, L.B.; Platou, C.; Vatten, L.J.; Janszky, I. Insomnia and the risk of incident heart failure: A population study. Eur. Heart J. 2014, 35, 1382–1393. [Google Scholar] [CrossRef]
Meng, L.; Zheng, Y.; Hui, R. The relationship of sleep duration and insomnia to risk of hypertension incidence: A meta-analysis of prospective cohort studies. Hypertens. Res. 2013, 36, 985–995. [Google Scholar] [CrossRef] [PubMed]
Irwin, M.R. Why sleep is important for health: A psychoneuroimmunology perspective. Annu. Rev. Psychol. 2015, 66, 143–172. [Google Scholar] [CrossRef]
Chaput, J.P.; Dutil, C.; Sampasa-Kanyinga, H. Sleeping hours: What is the ideal number and how does age impact this? Nat. Sci. Sleep 2018, 10, 421–430. [Google Scholar] [CrossRef]
Békés, V.; Aafjes-van Doorn, K. Who Wants to Have an AI Therapist? Acceptance of Using Artificial Intelligence for Mental Health Interventions Among Clinicians, Patients and the General Community. Clin. Psychol. Psychother. 2026, 33, e70220. [Google Scholar] [CrossRef]
Casu, M.; Triscari, S.; Battiato, S.; Guarnera, L.; Caponnetto, P. AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications. Appl. Sci. 2024, 14, 5889. [Google Scholar] [CrossRef]
Boyd, K.; Potts, C.; Bond, R.; Mulvenna, M.; Broderick, T.; Burns, C.; Bickerdike, A.; McTear, M.; Kostenius, C.; Vakaloudis, A.; Dhanapala, I.; Ennis, E.; Booth, F. Usability Testing and Trust Analysis of a Mental Health and Wellbeing Chatbot. In Proceedings of the 33rd European Conference on Cognitive Ergonomics (ECCE 2022), Vienna, Austria, 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]
Lin, S.; Lin, L.; Hou, C.; Chen, B.; Li, J.; Ni, S. Empathy-Based Communication Framework for Chatbots: A Mental Health Chatbot Application and Evaluation. In Proceedings of the 11th International Conference on Human-Agent Interaction (HAI 2023), Gothenburg, Sweden, 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 264–272. [Google Scholar]
Zhang, H.; Mao, Y.; Lin, Y.; Zhang, D. E-Mental Health in the Age of AI: Data Safety, Privacy Regulations and Recommendations. Alpha Psychiatry 2025, 26, 44279. [Google Scholar] [CrossRef]
Opie, J.E.; Vuong, A.; McIntosh, J.; Kuntsche, S. Brief Digital Mental Health Interventions for Adults with Emerging Symptoms: Part II. User Experience Outcomes Based on a Systematic Review. Ment. Health Digit. Technol. 2026, 3, 46–73. [Google Scholar] [CrossRef]
Klein, E.M.; Brähler, E.; Dreier, M.; Reinecke, L.; Müller, K.W.; Schmutzer, G.; Wölfling, K.; Beutel, M.E. The German version of the Perceived Stress Scale - Psychometric characteristics in a representative German community sample. BMC Psychiatry 2016, 16, 159. [Google Scholar] [CrossRef] [PubMed]
Schneider, E.E.; Schönfelder, S.; Domke-Wolf, M.; Wessa, M. Measuring Stress in Clinical and Nonclinical Subjects Using a German Adaptation of the Perceived Stress Scale. Int. J. Clin. Health Psychol. 2020, 20, 173–181. [Google Scholar] [CrossRef]
Dieck, A.; Morin, C.M.; Backhaus, J. A German version of the Insomnia Severity Index: Validation and identification of a cut-off to detect insomnia. Somnologie 2018, 22, 27–35. [Google Scholar] [CrossRef]
Borsci, S.; Schmettow, M. Re-examining the chatBot Usability Scale (BUS-11) to assess user experience with customer relationship management chatbots. Personal. Ubiquitous Comput. 2024, 28, 1033–1044. [Google Scholar] [CrossRef]
Kroenke, K.; Spitzer, R.L.; Williams, J.B.W. The PHQ-9: Validity of a brief depression severity measure. J. General. Intern. Med. 2001, 16, 606–613. [Google Scholar] [CrossRef]
Leiner, D. J. SoSci Survey (Version 3.7.06) [Computer software] . 2025. Available online: https://www.soscisurvey.de.
Kallus, K.W. Erstellung von Fragebogen, 2nd ed.; Facultas: Vienna, Austria, 2016. [Google Scholar]
Borges-Tiago, T.; Tiago, F.; Silva, O.; Guaita Martínez, J.M.; Botella-Carrubi, D. Online Users’ Attitudes toward Fake News: Implications for Brand Management. Psychol. Mark. 2020, 37, 1171–1184. [Google Scholar] [CrossRef]
Shahil-Feroz, A.; Yasmin, H.; Saleem, S.; Bhutta, Z.; Seto, E. Remote Moderated Usability Testing of a Mobile Phone App for Remote Monitoring of Pregnant Women at High Risk of Preeclampsia in Karachi, Pakistan. Informatics 2023, 10, 79. [Google Scholar] [CrossRef]
Kroenke, K.; Spitzer, R.L.; Williams, J.B.W. The PHQ-9: Validity of a brief depression severity measure. J. General. Intern. Med. 2001, 16, 606–613. [Google Scholar] [CrossRef]
60. SoSci Survey. Available online: https://www.soscisurvey.de/ (accessed on 3 May 2026).
Schnell, R.; Bachteler, T.; Reiher, J. Improving the Use of Self-Generated Identification Codes. Eval. Rev. 2010, 34, 391–418. [Google Scholar] [CrossRef]
Kristjansson, A.L.; Sigfusdottir, I.D.; Sigfusson, J.; Allegrante, J.P. Self-Generated Identification Codes in Longitudinal Prevention Research with Adolescents: A Pilot Study of Matched and Unmatched Subjects. Prev. Sci. 2014, 15, 205–212. [Google Scholar] [CrossRef] [PubMed]
63. TelefonSeelsorge Deutschland. Startseite. Available online: https://www.telefonseelsorge.de/ (accessed on 5 May 2026).
64. Stiftung Deutsche Depressionshilfe und Suizidprävention. Wo finde ich Hilfe? Available online: https://www.deutsche-depressionshilfe.de/depression-infos-und-hilfe/wo-finde-ich-hilfe (accessed on 5 May 2026).
65. Datenschutz-Grundverordnung (DSGVO). Available online: https://dsgvo-gesetz.de/ (accessed on 30 April 2026).
George, D.; Mallery, P. SPSS for Windows Step by Step: A Simple Guide and Reference, 11.0 Update, 4th ed.; Allyn and Bacon: Boston, MA, USA, 2003. [Google Scholar]
Field, A. Discovering Statistics Using IBM SPSS Statistics, 6th ed.; SAGE Publications Ltd: London, UK, 2024. [Google Scholar]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological) . 1995, 57, 289–300. [Google Scholar] [CrossRef]
Firth, J.; Torous, J.; Nicholas, J.; Carney, R.; Pratap, A.; Rosenbaum, S.; Sarris, J. The Efficacy of Smartphone-Based Mental Health Interventions for Depressive Symptoms: A Meta-Analysis of Randomized Controlled Trials. World Psychiatry 2017, 16, 287–298. [Google Scholar] [CrossRef]
Zhang, H.; Mao, Y.; Lin, Y.; Zhang, D. E-Mental Health in the Age of AI: Data Safety, Privacy Regulations and Recommendations. Alpha Psychiatry 2025, 26, 44279. [Google Scholar] [CrossRef] [PubMed]
Laranjo, L.; Dunn, A.G.; Tong, H.L.; Kocaballi, A.B.; Chen, J.; Bashir, R.; Surian, D.; Gallego, B.; Magrabi, F.; Lau, A.Y.S.; Coiera, E. Conversational Agents in Healthcare: A Systematic Review. J. Am. Med. Inform. Assoc. 2018, 25, 1248–1258. [Google Scholar] [CrossRef]
Shahil-Feroz, A.; Yasmin, H.; Saleem, S.; Bhutta, Z.; Seto, E. Remote Moderated Usability Testing of a Mobile Phone App for Remote Monitoring of Pregnant Women at High Risk of Preeclampsia in Karachi, Pakistan. Informatics 2023, 10, 79. [Google Scholar] [CrossRef]
Opie, J.E.; Vuong, A.; Welsh, E.; Esler, T.; Raza Khan, U.; Khalil, H. Outcomes of Best-Practice Guided Digital Mental Health Interventions for Youth and Young Adults with Emerging Symptoms: Part II. A Systematic Review of User Experience Outcomes. Clin. Child. Fam. Psychol. Rev. 2024, 27, 1–33. [Google Scholar] [CrossRef] [PubMed]
Andrew, L.; Dare, J.; Robinson, K.; Costello, L. Nursing practicum equity for a changing nurse student demographic: A qualitative study. BMC Nurs. 2022, 21, 37. [Google Scholar] [CrossRef]
Knapstad, M.; Sivertsen, B.; Knudsen, A.K.; Smith, O.R.F.; Aarø, L.E.; Lønning, K.J.; Skogen, J.C. Trends in self-reported psychological distress among college and university students from 2010 to 2018. Psychol. Med. 2021, 51, 470–478. [Google Scholar] [CrossRef]
Creswell, J.W.; Plano Clark, V.L. Designing and Conducting Mixed Methods Research, 3rd ed.; SAGE: Thousand Oaks, CA, USA, 2018. [Google Scholar]

Table 1. Cronbach’s alpha for PSS snd BUS(11) at the measurement points T1 – T3.

Scale	Number of items	T1	T2	T3
PSS	10	.84	.87	.88
Perceived Sleep Problems	3	.80	.76	.82
BUS(11) – Total Scale	11	-	.87	.89
BUS(11) – Accessibility	2	-	.74	.91
BUS(11) – Functionality	3	-	.82	.75
BUS(11) – Conversation	4	-	.75	.75
BUS(11) – Privacy	1	-	-	-
BUS(11) – Responsiveness	1	-	-	-

*Notes. PSS = Perceived Stress Scale. BUS(11) = Chatbot Usability.

Table 2. Descriptive statistics on stress, sleep problems and chatbot usability across the measurement time points.

Variable	Measurement time	Mdn(IQR)	Min.	Max.
Perceived stress	T1 T2 T3	16 (10) 16 (9) 15 (9)	7 6 7	25 24 23
Perceived sleep problems	T1 T2 T3	4 (3) 4 (3) 4 (3)	000	7 7 6
Chatbot usability – Overall score	T2 T3	45 (7) 47 (9.5)	35 38	48 51
Chatbot usability – Accessibility	T2 T3	8 (1.5) 9 (2)	6 8	10 10
Chatbot usability – Functionality	T2 T3	11 (2) 12 (3)	8 9	13 14
Chatbot usability – Conversation	T2 T3	16 (3) 18 (2.5)	12 13	18 20
Chatbot usability – Privacy	T2 T3	4 (0) 5 (0.5)	4 4	5 5
Chatbot usability – Responsiveness	T2 T3	4 (0.5) 3 (0.5)	3 2	4 4

*Notes. T1 = Measurement point 1. T2 = Measurement point 2. T3 = Measurement point 3.

Table 3. Results of the Wilcoxon signed-rank tests on changes in stress, sleep problems and chatbot usability between the individual measurement points.

Variable	Comparison	Z	p	r
Perceived stress	T2 – T1 T3 – T2 T3 – T1	-1.86 -1.60 -2.31	.03* .05 .01*	.70 .53 .77
Perceived sleep problems	T2 – T1 T3 – T2 T3 – T1	0.31 -1.86 -1.43	.64 .03* .07	.09 .70 .45
Chatbot usability – Overall score	T3 – T2	2.37	.01*	.79
Chatbot usability – Accessibility	T3 – T2	2.03	.02*	.72
Chatbot usability – Functionality	T3 – T2	1.78	.04*	.67
Chatbot usability – Conversation	T3 – T2	1.33	.10	.47
Chatbot usability – Privacy	T3 – T2	2.20	.01*	.90
Chatbot usability – Responsiveness	T3 – T2	–2.20	.99	.90

*Notes. T1 = Measurement point 1. T2 = Measurement point 2. T3 = Measurement point 3. * p < .05.

Table 4. Results of the Friedman tests on differences in stress and sleep problems across the three measurement points.

Variable	Comparison	X²(2)	p	Kendall’s W
Perceived stress	T1, T2, T3	8.23	0.2*	.37
Perceived sleep problems	T1, T2, T3	3.13	.21	.14

*Notes. T1 = Measurement point 1. T2 = Measurement point 2. T3 = Measurement point 3. * p < .05.

Table 5. Results of the Mann-Whitney U tests on group differences in changes in stress and sleep problems between measurement points 1 and 3.

Variable	Group characteristic	Groups		Mdn (IQR) from T3 – T1		Z	p		r
		Group 1	Group 2	Group 1	Group 2
Perceived stress	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	-1(1.5) -1(2) -1(1) 0(1)	-1.5(1) -1(0.5) -1(1.5) -2(0.75)	0.95 0.41 -0.28 2.37	.33 .67 .78 .01*	.29 .12 .08 .72
Perceived sleep problems	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	-1(2) -1(1.25) -1(1) 1(1)	-1(0.5) -1(1.5) -1(1.5) -1(0.75)	-0.19 0.31 -0.09 2.37	.84 .75 .92 .01*	.06 .09 .03 .72

*Notes. T1 = Measurement point 1. T3 = Measurement point 3. * p < .05.

Table 6. Results of the Mann-Whitney U tests on group differences in chatbot usability at measurement point 3.

Variable	Group characteristic	Groups		Mdn (IQR) from T3 – T1		Z	p		r
		Group 1	Group 2	Group 1	Group 2
Chatbot usability – Overall score	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	45(9.5) 46(9.25) 47(11) 40(0)	47(2.75) 47(6.5) 46(7.25) 49.5(3.25)	-0.0400.27 -2.74	.63 1 .79 .005**	.140.08 .83
Chatbot usability – Accessibility	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	10(2) 9.5(2) 8(2) 8(0)	8.5(1.25) 8(1) 9.5(1.75) 10(0.75)	0.57 0.61 -0.55 -1.64	.53 .50 .54 .07	.17 .19 .17 .50
Chatbot usability – Functionality	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	11(3) 11.5(2.5) 13(3) 10(1)	12.5(1.75) 13(1.5) 11.5(1.75) 13(0.75)	-1.13 -0.71 0.73 -2.24	.25 .47 .46 .005**	.34 .22 .20 .83
Chatbot usability – Conversation	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	16(2.5) 17(2.25) 18(3) 15(1)	18(1) 18(3.5) 17(2.75) 18(0.75)	-0.66 -0.31 0.73 -2.74	.50 .75 .45 .005**	.20 .09 .20 .83
Chatbot usability – Privacy	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	5(0) 5(0.25) 5(0) 5(1)	4.5(1) 5(0.5) 5(0.75) 5(0)	0.95 0.20 0.37 -0.64	.22 .80 .63 .41	.29 .06 .11 .19
Chatbot usability – Responsiveness	Gender Age Employment status Weekly app usage	Male 18-24 In employment 1-2 times	Female 30-34 Student 3-4 times	3(0.5) 3(1) 3(0) 3(0)	3(0.25) 3(0) 3.5(1) 3.5(1)	-0.19 0.61 -1.64 -1.64	.83 .47 .05 .05	.05 .19 .50 .50

*Notes. T3 = Measurement point 3. ** p < .01.

Table 7. Correlation matrix.

Variable	1	2	3	4	5	6	7
1. Change in perceived stress (T3 – T1)
2. Perceived changes in sleep problems (T3 – T1)	.69*
3. Chatbot usability – Overall score (T3)	-.72*	-.88***
4. Chatbot usability – Accessibility (T3)	-.53	-.80**	.81**
5. Chatbot usability – Functionality (T3)	-.79*	-.81**	90.**	.66
6. Chatbot usability – conversation (T3)	-.68	-.87**	.96***	.70	.86**
7. Chatbot usability – privacy (T3)	.13	-.48	.49	.43	.30	.47
8. Chatbot usability – responsiveness (T3)	-.57	-.46	.43	.50	.41	.31	-.15

* Notes. T1 = Measurement point 1. T3 = Measurement point 3. * p < .05. ** p < .01. *** p < .001.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Measuring the Impact of AI-Driven Wellbeing Apps—Instrument Development and Pilot Evidence from the Malu Prototype

Abstract

Keywords:

Subject:

1. Introduction

2. Theoretical Background

2.1. Current State of Research

2.2. Stress as a Psychological Determinant

2.2.1. Transactional Model of Stress

2.2.2. Effects of Chronic Stress

2.3. Sleep as a Psychological Determinant

2.3.1. Harvey´s Cognitive Model of Insomnia

2.3.2. Effects of Persistent Sleep Disturbances

2.4. Usability as a Success Factor of MH Apps

2.5. Research Questions and Hypotheses

3. Methodology

3.1. Development of the Evaluation Instrument

3.1.1. Measurement of Perceived Stress

3.1.2. Measurement of Perceived Sleep Problems

3.1.3. Measurement of Perceived Chatbot Usability

3.1.4. Exclusion Criterion

3.1.5. Structure of the Questionnaire and Further Items

3.2. Pilot Study for Testing the Instrument

3.2.1. Study Design and Procedure

3.2.2. Sample and Recruitment

3.3. Quality Criteria, Ethical Considerations and Data Protection Framework

3.4. Planned Data Analysis

4. Results

4.1. Reliability Analysis

4.2. Target Dimension Outcomes

4.2.1. Perceived Stress

4.2.2. Perceived Sleep Problems

4.3. Usability

4.3.1. Perceived Chatbot Usability

4.4. Group Comparisons

4.5. Correlation Analysis

5. Discussion

5.1. Key Findings and Interpretation

5.1.1. Perceived Stress

5.1.2. Perceived Sleep Problems

5.1.3. Perceived Chatbot Usability

5.2. Integration into the Existing Research

5.3. Limitations

5.4. Outlook

Supplementary Materials

Appendix A

Appendix A.1

References

MDPI Initiatives

Important Links

Subscribe