A Clinically Guided Rule-Based Synthetic Dataset for Multi-Modal Longitudinal Treatment-Response Monitoring in Major Depressive Disorder

Elsie Kaaya; Jorge Marx Gómez

doi:10.20944/preprints202606.0316.v1

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract

Monitoring treatment response in Major Depressive Disorder (MDD) remains challenging since treatment selection often follows a trial-and-error approach and access to real-world multimodal mental health data is limited by privacy, ethical, and availability constraints. This study presents a methodological approach for designing and generating a clinically guided, rule-based synthetic multimodal dataset to support early-stage experimentation in MDD treatment-response monitoring. Digital biomarkers relevant to depression were identified through literature and expert consultation. Patient Health Questionnaire-9 (PHQ-9) scores were used as the primary clinical anchor, while simulated smartphone and wearable indicators were organized into composite domains, including sleep, activity, mobility, physiology, social interaction, digital behavior, adherence, ecological momentary assessment, and missingness. The synthetic data schema guided the generation of a 12-week acute-phase dataset incorporating baseline characteristics, daily monitoring variables, biweekly PHQ-9 assessments, treatment review points, and derived clinical labels, including response, remission, and trajectory groups. The resulting dataset demonstrated statistical, distributional, temporal, dependency, and trajectory-level plausibility. This work contributes a transparent and reproducible framework for synthetic data generation in privacy-sensitive mental health research and provides a controlled testbed for future machine learning and federated learning experiments.

Keywords:

major depressive disorder

;

synthetic data generation

;

digital phenotyping

;

PHQ-9

;

federated learning

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Major depressive disorder is the leading cause of disability worldwide significantly affecting individuals’ quality of life and contributing to a major public health burden [1,2] .It is characterized by recurrent thoughts of death, persistent depressed mood, loss of interest or pleasure in previously enjoyable activities and physical and cognitive symptoms [3]. Additionally, MDD places individuals at risk of injury, diseases, early mortality including suicide, and negative social, economic, professional and academic outcomes [4].

Despite its serious effects, MDD treatment remains challenging and often relies on a trial-and-error approach. This often contributes to low remission rates; it has been reported that about 60% of patients with MDD do not achieve remission during the acute phase, usually the first 12 weeks of treatment, this has been observed even in antidepressant trials under optimal conditions [5]. Since no treatment is a panacea, several treatment trials are often needed to achieve remission with a treatment that has tolerable side-effects [6]. This underscores the need for monitoring treatment-response in individuals with MDD.

Prior studies have highlighted the importance of longitudinal treatment-response monitoring through Measurement-Based Care (MBC). In psychiatry, MBC is defined as the use of clinical measurement instruments to support objective assessment, treatment and clinical outcomes in patients with psychiatric disorders. It involves two processes: routine symptom assessment and the use of assessment results to guide clinical-decision making [7]. The Texas Medication Algorithm Project (TMAP) for MDD and the STAR*D trial both recommended patient monitoring at specific time points during treatment using clinical symptom-based rating scales such as PHQ-9 which measures depression severity [8,9]. During this period, clinicians were encouraged to assess symptom severity, treatment adherence and tolerability. Despite these efforts, clinical assessments alone may not fully capture the continuous, fluctuating, and context-dependent nature of depressive symptoms in everyday life.

Digital phenotyping has emerged as a promising approach to support clinical assessments by capturing continuous behavioral and physiological indicators. This can be achieved through the use of smartphones and wearable devices, which are capable of passively collecting quantitative physiological and behavioral data [10]. In MDD, digital biomarkers such as mobility, location, sleep, communication patterns, and physiological signals such as heart rate, and Heart Rate Variability (HRV) have shown potential for providing insight into patient status and progression [11,12]. Busshart et al. reported that integrating clinical scores with digital phenotyping may improve predictive value and enable more individualized symptom monitoring over time [13]. Despite its potential, digital phenotyping faces important privacy and access challenges that impede mental health research. These limitations have encouraged the use of synthetic data generation as a privacy-preserving approach to address data scarcity and enable controlled methodological experimentation [14].

Synthetic data generation is increasingly considered a methodological response to data scarcity, privacy constraints, and limited access to representative patient datasets [15]. Synthetic data is increasingly being utilized in healthcare research where access to sensitive patient data is limited. Rather than relying on real-world data, synthetic data is generated to represent the statistical, clinical, and temporal properties of real-world data while preserving patient privacy [16]. In healthcare, synthetic data has been used to support privacy-preserving data sharing, data augmentation, clinical simulation, and predictive analytics. It has also been explored in medical domains, such as modelling complex patient scenarios that may be under-represented in real-world datasets [17]. Therefore, while synthetic data offers a promising avenue for privacy-preserving health research, its methodological value is shaped by careful alignment with clinically meaningful, methodologically transparent, and context-specific design principles.

Although digital phenotyping and synthetic data generation have gained increasing attention, prior studies have not explored the generation of a multimodal longitudinal dataset for monitoring treatment response in individuals with MDD. This study addresses this gap by generating a dataset that integrates PHQ-9 clinical anchoring, expert-informed behavioral and physiological indicators, treatment-response trajectories, and structured monitoring time points within a single rule-based framework. The main contribution of this study is a transparent, reproducible, and clinically grounded rule-based synthetic data generator that can support simulation, experimentation, and future modelling in MDD treatment-response monitoring.

The paper presents a methodological approach used to design and generate a rule-based synthetic dataset. Digital biomarkers relevant to MDD were systematically identified through literature and expert consultation. PHQ-9 scores served as a primary clinical anchor, while expert-informed behavioral and physiological indicators, treatment-response trajectories, and structured monitoring time points were integrated into a single rule-based framework to provide detailed insight into individual treatment-response. Additionally, it investigates the distribution, temporal, dependency and trajector group plausablility of the resulting dataset and analyzes its strengths and limitations.

1.1. Related Works

The above literature shows a clear need for synthetic datasets that are not only privacy-preserving, but also clinically anchored, longitudinal, multimodal, and structured around treatment-response monitoring. This study addresses that gap by developing a rule-based synthetic dataset that integrates PHQ-9 assessment logic, expert-informed behavioral and physiological indicators, treatment-response trajectories, and structured clinical monitoring time points.

Table 1. Summary of the key related studies across treatment-response monitoring, digital phenotyping, privacy and ethical challenges, and synthetic data generation. The table highlights that although each research strand contributes important methodological or clinical insights, limited work has integrated these areas into a clinically guided, multimodal longitudinal synthetic dataset for MDD treatment-response monitoring.

Study	Study Focus	Scope	Relevance	Gap
Rush et al. [6]	MDD treatment outcomes	STAR*D sequential treatment outcomes and remission across treatment steps	Supports the argument that MDD treatment often requires multiple steps and that remission is difficult to achieve	Does not address digital phenotyping or synthetic data generation
Trivedi et al. [9]	Measurement-based care	Clinical symptom monitoring, adherence, and side-effect assessment in STAR*D	Supports structured symptom assessment and treatment monitoring during MDD care	Relies on scheduled clinical assessments without passive smartphone/wearable monitoring
Kroenke et al. [18]	PHQ-9 validation	Validation of PHQ-9 as a depression severity measure	Justifies the use of PHQ-9 as a clinical anchor for symptom severity and treatment-response monitoring	Does not address passive behavioral data or synthetic longitudinal data generation
Zierer et al. [11]	Digital biomarkers	Review of passive digital biomarkers associated with depression	Supports the selection of mobility, activity, sleep, communication, and physiological variables	Reviews digital biomarkers but does not generate a synthetic treatment-response dataset
Leaning et al. [19]	Smartphone phenotyping	Smartphone-derived data for clinically relevant predictions in MDD	Supports the use of smartphone data for depression-related prediction and monitoring	Does not integrate synthetic data generation with PHQ-9-anchored treatment trajectories
Vignapiano et al. [20]	Digital biomarkers in mood disorders	Smartphone and wearable indicators for mood disorder management	Supports the role of digital biomarkers in tracking depression severity and treatment response	Does not provide a rule-based synthetic dataset for MDD treatment monitoring
Jung et al. [21]	Wearable/sensor monitoring	Sensor features for predicting depression and anxiety	Supports the relevance of wearable-derived features for mental health monitoring	Focuses on feature identification, not synthetic multimodal longitudinal data generation
Martinez-Martin et al. [22]	Ethics and privacy	Ethical guidance for mental health digital phenotyping	Supports privacy, consent, data protection, transparency, and accountability concerns	Does not propose a synthetic data framework for MDD treatment-response experimentation
Oudin et al. [23]	Digital psychiatry ethics	Privacy, confidentiality, consent, and data ownership in digital psychiatry	Supports the claim that psychiatric digital phenotyping raises important ethical and governance concerns	Does not develop a privacy-preserving synthetic dataset for MDD monitoring
Giuffrè and Shung [24]	Synthetic data in healthcare	Benefits, applications, and limitations of synthetic healthcare data	Supports synthetic data as a way to improve privacy, data sharing, and predictive analytics	Broad healthcare focus; not specific to MDD, PHQ-9, or digital phenotyping
Pezoulas et al. [25]	Healthcare synthetic data	Review of synthetic data generation methods in healthcare	Supports synthetic data generation as a response to data scarcity and privacy concerns	Does not focus on clinically guided MDD treatment-response monitoring
Mendes et al. [26]	Privacy-preserving synthetic data	Synthetic data for bridging data gaps and enabling simulation	Supports synthetic data as a privacy-preserving method for methodological experimentation	Not specific to MDD, smartphone/wearable data, or treatment-response trajectories
Qian et al. [27]	Privacy-preserving clinical modelling	Synthetic data for clinical risk prediction under privacy constraints	Demonstrates how synthetic data can support clinical modelling pipelines where real data access is limited	Focuses on clinical risk prediction, not MDD digital phenotyping or treatment-response monitoring
Loni et al. [28]	Synthetic health records	Review of synthetic medical text, time-series, and longitudinal health records	Relevant to longitudinal synthetic health data generation	Broad health-record focus; not specific to MDD treatment-response trajectories

2. Materials and Methods

2.1. Methodological Framework

The study employed a step-wise methodology in the design and generation of the synthetic dataset. First, evidence from early research on digital phenotyping in individuals with MDD was used to identify the most common mobile and wearable data used in current research. Second, clinical guidance was sought from three psychologists and three psychiatrists to determine clinically relevant variables and how they can be incorporated into day-to-day practice. Finally following expert consultation, the PHQ-9 tool was used as a clinical anchor to guide treatment-response monitoring. Through the tool, clinical time points for observation and responder categories were identified. These findings resulted in a synthetic dataset schema, which guided the rule-based synthetic data generation.

2.2. Identifying Digital Phenotyping Variables

Literature on digital phenotyping was explored to identify behavioral and physiological biomarkers used in MDD research. Busshart et al. identified five categories of parameters for monitoring, assessing, and potentially predicting depression: (1) physical activity and location, (2) behavioral patterns, (3) physiological signals, (4) sleep indicators, and (5) sociability and self-reported assessments [13]. This is further elaborated in Figure 1. Similarly, Zierer et al. conducted a systematic review to identify primary biomarkers associated with depression, including physical activity, sleep disturbance, speech and language features, mobility and location entropy, HRV, and electrodermal activity [11]. These findings are consistent with Zhan et al., who identified location entropy, social app usage, disturbed sleep, and HRV as variables associated with depression severity [29].

Several studies have highlighted the significance of digital biomarkers in depression research. A study by Aledavood et al. reported that digital biomarkers are indicative of passive behavioral features linked to clinical scores. This was demonstrated by the correlation found between call duration, physical activity, and PHQ-9 scores [12]. Additionally, digital biomarkers can be used to support traditional gold-standard assessments and enable early intervention [30]. Moreover, traditional methods of mental health assessment rely on clinical interviews and self-reports, which are often subjective and dependent on recall, and may not capture dynamic psychological changes in daily life. However, through digital biomarkers, it is possible to capture continuous, real-time dynamic changes remotely [31,32,33]. These findings underscore the importance of digital biomarkers in MDD research.

The findings from previous studies informed the list of variables that were later used in the expert interviews to determine the most clinically relevant variables. A total of 30 variables were organized into six categories, including clinical anchors, physiological features (heart rate, HRV, and electrodermal activity), mobility and location, sleep and circadian rhythm, physical activity, and digital and social behavior, as shown in Figure 2. The inclusion and exclusion of candidate variables were based on feasibility for structured synthetic data generation and interpretability. The final list included variables that could support longitudinal modelling of symptom change, while excluding highly intrusive and technically complex features such as raw speech, facial expressivity, and detailed communication content.

2.3. Development of an Expert Interview Instrument

From the results obtained through literature on digital phenotyping in MDD, an open-ended questionnaire was designed to gain expert insight into the ground truth and how these variables can complement current clinical practices. The main focus of the interview was to understand clinically meaningful behavioral change, temporal patterns and variability, functional recovery versus symptom severity, and early warning signs of relapse. The interviews were conducted individually.

The first step was to understand the most prominent and alarming symptoms of MDD that clinicians observe in patients. This was intended to identify variables that might have been missed in the literature but are important to clinicians. Second, to bridge the knowledge gap between clinical experts and technical researchers, the list of variables obtained from the literature was shown to the experts, as they were not aware of the digital biomarkers that can provide insight into MDD beyond the PHQ-9 questionnaire. This opened room for discussion on how the variables relate to MDD according to reports from previous studies and how the experts envisioned their use in real-life settings. Finally, the expert interview questions were presented to the clinicians, who then provided insights into the topic, bridging the gap between digital phenotyping literature and real-world clinical monitoring.

2.4. Clinical Calibration Through Expert Input

The expert interview provided clinical guidance on the behavioral and physiological patterns that may indicate treatment progression in MDD. The expert emphasized that meaningful improvement should be interpreted through sustained patterns over time rather than isolated behavioral changes. Key indicators included sleep regularity, activity levels, sedentary behavior, mobility, routine stability, screen time, mood, energy, anhedonia, and autonomic signals such as HRV. The expert also highlighted the importance of interpreting digital behavioral data in relation to an individual’s baseline while maintaining PHQ-9 scores as a clinical anchor for symptom severity.

Additionally, one expert highlighted the importance of observing functional recovery and symptom reduction as a unit. This can be explained through instances where increased activity does not represent clinical improvement when it occurs abruptly or in an unstable pattern, emphasizing that improvement in PHQ-9 scores may not always correspond to restored daily functioning. The expert therefore recommended that behavioral signals be interpreted collectively and longitudinally. It was further emphasized that sustained improvement, reduced instability, and restoration of routine were considered more clinically meaningful than short-term fluctuations.

Expert input further informed the timing of data collection. Short-term daily data were considered useful in capturing variability and instability, while weekly and monthly data were considered more important in assessing treatment response. Based on this insight, daily patterns were included to capture the dynamic nature of MDD recovery, while weekly updates were incorporated to reflect broader treatment-response trends over time.

Clinical plausibility was preserved by ensuring that the synthetic dataset reflected realistic symptom-behavior relationships, clinically meaningful time scales, and expected patterns of treatment response. By considering coherence between depressive symptoms, behavioral functioning, and temporal progression, this ensured that synthetic profiles represented plausible clinical trajectories, including improvement, stagnation, unstable response, deterioration, and relapse risk.

2.5. PHQ-9 as the Clinical Anchor for Symptom Severity and Treatment Response

The PHQ-9 depression severity questionnaire provides insight into a person’s emotional state as it relates to depression. The questionnaire consists of nine questions that prompt patients to reflect on how they have been feeling over the past two weeks. Each question is scored from 0 to 3, with 0 representing the absence of the feeling and 3 representing a recurrent feeling experienced nearly every day for the past two weeks. The questionnaire investigates the presence of anhedonia, depressed mood, sleep disturbance, fatigue, appetite changes, worthlessness or excessive guilt, concentration difficulties, psychomotor changes, and suicidal ideation.

Baseline PHQ-9 was used to indicate the depression severity score of the patient. This study focused on individuals with a PHQ-9 score of ≥10, which indicates moderate to severe depression. After every two weeks, PHQ-9 was observed to determine treatment response. Since the PHQ-9 score is commonly assessed every two weeks, the same concept was applied to the dataset. This study focused on the acute phase, which is a 12-week period and is the most significant period in treatment monitoring. This was guided by both literature and expert input.

2.6. Definition of Measurement Frequencies and Assessment Time Points

Each participant was first assigned a baseline PHQ-9 score at Week 0, which determined the participant’s depression severity. Behavioral and physiological data were then generated from Week 0 to Week 12, while weekly summaries were computed at the end of Weeks 1–12. Clinical symptom assessments were represented at two-week intervals, including Weeks 0, 2, 4, 6, 8, 10, and 12. Treatment review windows were defined at Weeks 4, 8, and 12 to reflect clinically meaningful decision points, while the other clinical assessment windows were used only for symptom monitoring.

Treatment trajectories were simulated to represent early responders, delayed responders, partial responders, non-responders, and unstable responders. Early responders demonstrated symptom improvement within the initial treatment period, whereas delayed responders showed comparable gains later in the acute phase. Partial responders achieved only moderate improvement, non-responders showed little to no meaningful change, and unstable responders were characterised by inconsistent, fluctuating trajectories throughout the monitoring period. Daily behavioral and physiological variables were generated to align with these trajectories, including smartphone-derived measures such as screen time, night-time screen use, mobility, location entropy, call activity, and social interaction, as well as wearable-derived measures such as steps, sleep efficiency, WASO, resting heart rate, and heart-rate variability.

Random noise and participant-level variability were introduced to avoid overly deterministic patterns and to reflect heterogeneity in symptom expression and behavioral response. Missingness was also simulated for GPS and heart-rate data to approximate incomplete passive sensing due to device non-wear, sensor failure, or irregular phone use. Final treatment-response labels were assigned at Week 12 and categorized as responder, partial responder, or non-responder.

Table 2. Measurement Frequency and Assessment Time Points Used in the Synthetic Dataset.

Time Scale	Timepoints
Baseline only	Week 0
Daily observations	Everyday from week 0 to week 12
Weekly summary	End of weeks 1-12
Biweekly clinical assessment	Weeks 0, 2, 4, 6, 8, 10, 12
Clinical assessment points	Weeks 0, 2, 4, 6, 8, 10, 12
Treatment review points	Week 4, 8, 12

2.7. Construction of the Synthetic Data Schema

The synthetic data schema served as a blueprint for the rule-based multimodal synthetic dataset. The schema specifies the variable domains, data types, measurement frequencies, expected ranges, clinical or behavioral meanings, and rule-based generation assumptions used to guide the synthetic dataset generation process. To support reproducibility, the full synthetic data schema is provided as supplementary material.

2.8. Rule-Based Synthetic Data Generation

The synthetic dataset was generated by a rule-based generator guided by the synthetic data schema. The schema defined the variables, frequency of measurement, clinical significance, and their association with MDD treatment-response monitoring. The rule-based generator was designed to generate participant characteristics, device information, clinical baseline information, daily behavioral and physiological data, treatment-response trajectories, and final treatment outcomes. The aim was to generate a multimodal longitudinal synthetic dataset that can be used for modelling in MDD research.

2.8.1. Participant-Level Initialization

The generation process began with each synthetic participant being assigned a unique ID, demographic characteristics, device-related characteristics, baseline PHQ-9 score, and initial depression severity group. Demographic and device-related characteristics were included to reflect representation across ethnic groups and device types, respectively, while baseline PHQ-9 scores were used to represent the initial depression severity of each participant. This set the basis for generating subsequent clinical and behavioral trajectories.

2.8.2. Assignment of Treatment-Response Trajectories

After participant initialization, each participant was assigned a treatment-response trajectory based on their initial depression severity group. These trajectories represented how each individual responded to treatment throughout the acute monitoring period. This study classified trajectory groups into five categories: early responders, delayed responders, partial responders, non-responders, and unstable responders. The trajectory group classification was linked to expected PHQ-9 change, behavioral progression, and treatment-response logic. For example, early responders were expected to show early improvement, while delayed responders exhibited improvement later in the acute phase. Partial responders showed moderate improvement, non-responders had minimal or no improvement, and unstable responders had fluctuating improvement patterns.

2.8.3. Generation of Clinical Assessment Scores

Clinical assessment was generated to show participant treatment progress. This was done by referencing the baseline PHQ-9 score. A baseline score was provided in Week 0, and following the PHQ-9 assessment timeline, clinical assessment was conducted biweekly. The assessment scores were determined by metrics such as change from baseline and percentage change from baseline. Additionally, trajectory groups were determined during the assessment period.

Clinical assessment was not based on improvement in PHQ-9 scores alone. Improvement in behavioral and physiological characteristics was also observed. Additionally, life events were included in the assessment. This was based on insights from clinical experts, who highlighted that a patient might initially respond well to treatment, but life-related stress can cause their condition to worsen and disrupt treatment progress.

2.8.4. Generation of Behavioral and Physiological Features

Daily behavioral and physiological variables were generated to reflect data collected through smartphones and wearables. The variables included sleep, physical activity, mobility, phone usage, and social interactions. The generation rules linked these variables to clinical states and treatment-response trajectories. For example, improved sleep quality, physical activity, mobility, and social interactions were linked to reduced PHQ-9 scores and better treatment-response trajectories. In contrast, non-response or worsening was associated with deterioration in sleep quality, increased sedentary patterns, increased night screen time, and irregular phone usage. This ensured that behavioral and physiological features were not generated independently of clinical progression but were aligned with plausible treatment-monitoring patterns.

2.8.5. Introduction of Variability, Noise, and Missingness

Variability, noise, and missingness were intentionally incorporated into the synthetic data generation process to approximate real-world digital health data conditions. Participant-level variability was introduced through demographic, clinical, site-level, device-related, treatment, and adherence characteristics. Random noise was added to PHQ-9 trajectories, daily behavioral features, physiological signals, and adherence patterns to avoid overly deterministic data generation. Missingness was simulated for GPS and heart-rate-related variables through device non-wear, GPS outage, and passive dropout mechanisms, reflecting incomplete data capture commonly observed in smartphone and wearable sensing.

2.9. Dataset Validation and Quality Assessment

Prior to use in downstream ML experiments, the dataset underwent validation to assess whether rule-based interpretability was preserved alongside sufficient statistical and behavioural realism. Descriptive statistics were computed to characterise the range, central tendency, and variability of key variables, while distribution plots were examined to assess whether behavioural, physiological, and clinical variables exhibited plausible patterns. Correlations were inspected to determine whether inter-variable associations were consistent with established clinical and behavioural expectations, and longitudinal trajectory plots were reviewed to evaluate symptom progression and behavioural change across the observation period. Trajectory responder group comparisons was conducted across five subgroups: early responders, late responders, partial responders, non-responders, and unstable responders, to assess whether each exhibited a clinically distinct and internally coherent profile. Missingness patterns were also examined to verify that absent data occurred at realistic rates and reflected the conditional nature of certain variables rather than arbitrary gaps.

2.10. Ethical and Data Governance Considerations

The dataset was synthetically generated; hence, the study did not involve direct access to identifiable patient records. However, the synthetic dataset was designed to reflect clinically plausible patterns without being presented as a substitute for real-world clinical validation.

3. Results

3.1. Plausibility Assessment of the Generated Rule-Based Synthetic Dataset

The generated dataset was assessed to determine whether it was clinically interpretable, statistically plausible, and behaviorally coherent for treatment-response monitoring in MDD. The purpose of the assessment was not to establish the dataset as equivalent to real-world data, but rather to determine whether it is suitable for controlled modelling experiments.

The assessment focused on plausibility in relation to the synthetic dataset’s distribution, temporal patterns, clinical logic, variable correlation, responder group profiles, and missingness realism. This was intended to determine whether the synthetic dataset maintained clinical credibility without exhibiting overly deterministic patterns or discordant clinical relationships.

3.2. Dataset Structure and Composition

The generated dataset consists of five interconnected tables: participant-level attributes, daily behavioral and physiological monitoring, clinical assessments, weekly summary output, and final outcomes. The participant-level attributes table contains 1,000 synthetic participants, the daily monitoring table contains 85,000 entries reflecting repeated monitoring records, and the clinical assessment records contain 7,000 entries corresponding to PHQ-9 assessment schedules for each participant throughout the 12-week period.

The generated participants exhibited a broad adult age distribution ranging from 18 to 65 years, with a mean age of 42.20 and a standard deviation of 14.04. Baseline depression severity was represented by PHQ-9 scores ranging from 8 to 27, simulating participants with mild to severe depression. The observed mean baseline score was 14.85, with a standard deviation of 3.83. By Week 12, the generated participants exhibited improvement in PHQ-9 scores, with a decline to 6.90, a mean final change from baseline of -7.95 points, and a mean percentage change from baseline of 56.53%.

3.3. Distribution Plausibility

Assessment of the distribution of the variables showed that they remained within plausible ranges. The assessed variables included PHQ-9 scores, sleep efficiency, WASO, steps, sedentary minutes, HRV, resting heart rate, time at home, location entropy, screen time, social contact variables, and missingness indicators.

Clinical variables displayed plausible ranges. This was observed in the PHQ-9 scores, which ranged from 0 to 27, while change from baseline ranged from -14.8 to 5.0. This allowed the dataset to represent patients who exhibited improvement, non-improvement, and partial improvement. This was also observed in other variables such as sleep, physiology, daily behavior, HRV, activity, and mobility. For example, daily steps had a mean value of 6,254.70, while sedentary time averaged 639.48 minutes, average HRV was 42.91, and resting heart rate averaged 71.79 bpm. Sleep efficiency had a mean of 0.721, which equates to 72.1% when interpreted as a proportion, while WASO averaged 50.04 minutes.

3.4. Temporal Plausibility

The longitudinal assessment observed changes in behavioral and physiological patterns and their association with PHQ-9 scores over the 12-week period. At Week 12, the mean PHQ-9 score was 6.90, a decline from 14.93 at Week 0. This was accompanied by changes in behavioral and physiological indicators. Sleep efficiency increased from 0.713 to 0.751, steps increased from 5,300.95 to 7,355.92, and sedentary minutes decreased from 688.29 to 558.96. Average HRV increased from 39.34 to 47.71, while resting heart rate decreased from 73.27 bpm to 69.78 bpm.

3.5. Dependency Plausibility

The correlation analysis showed that the relationships between the variables were consistent with clinical, physiological, and behavioral assumptions. For example, PHQ-9 was negatively associated with sleep efficiency and steps, and positively associated with sedentary minutes and WASO. This suggests that the generator successfully represented clinically meaningful relationships between symptom burden and passive behavioral indicators. Similarly, the negative association between PHQ-9 and HRV, together with the positive association between PHQ-9 and resting heart rate, supports the intended link between symptom severity and autonomic regulation.

3.6. Responder-Group Plausibility

Responder groups were observed to behave consistently with clinical, physiological, and behavioral profiles. Early responders exhibited positive profiles, including higher sleep efficiency, lower WASO, higher step counts, lower sedentary time, higher HRV, greater mobility diversity, and lower night screen time. In contrast, non-responders showed less favorable profiles, including poorer sleep, lower activity, higher sedentary time, lower HRV, higher resting heart rate, more time at home, and lower location entropy. Delayed responders, partial responders, and unstable responders displayed mostly intermediate profiles.

3.7. Missingness and Data Quality Patterns

Missingness checks showed that missing data patterns were present but largely structured and interpretable. Most core behavioral and physiological variables had little or no missingness. Treatment review decisions showed missingness because they were only applicable at specific review points. Time-to-response had missingness for participants who did not meet response criteria. Medication and therapy adherence variables also showed missingness, likely reflecting differences in treatment modality or applicability.

4. Discussion

The study presents a clinically guided, rule-based, multimodal longitudinal synthetic dataset for MDD treatment-response monitoring. It integrates clinical, behavioral, and physiological indicators related to MDD. The behavioral and physiological variables observed include sleep efficiency, WASO, steps, sedentary minutes, HRV, resting heart rate, time at home, location entropy, screen time, and social contact. These variables were generated daily for each participant over a 12-week observation period. The PHQ-9 score was used as a clinical anchor to determine how an individual was responding to treatment. Correlations between symptom scores, behavioral indicators, and physiological indicators were examined to assess whether the generated variables followed clinically plausible and internally consistent patterns. The generated dataset was not developed to represent real data, but rather to support early-stage model development, pipeline testing, simulation, and reproducibility.

The generated dataset displayed internal plausibility. The variables stayed within expected ranges, and a decline in PHQ-9 trajectories was observed over time. Additionally, behavioral and physiological indicators changed in clinically interpretable ways. This was observed when a decline in PHQ-9 scores was associated with better sleep efficiency, reduced WASO, increased activity, shorter sedentary time, and increased mobility. Higher symptom scores were associated with poorer sleep, lower activity, higher sedentary time, lower HRV, reduced mobility, and lower social engagement.

These findings suggest that the dataset has potential to be utilised in controlled experimentation, simulation, early-stage modelling, and testing of Machine Learning (ML) and Federated Learning (FL) pipelines where real MDD data are difficult to access. This is supported by the rule-based approach employed in developing the dataset. Because the rule-based approach was informed by domain expertise and subject-specific literature, it allowed for transparency, reproducibility, clinical anchoring, longitudinal structure, and multimodality.

However, the generated dataset does not substitute real patient data. The patterns in the data reflect assumptions that were built into the rule-based generator. Despite these assumptions being grounded in domain expertise and subject-specific literature, real patient data are not as clean and deterministic. Moreover, several factors can influence treatment response, such as comorbidities, medication type and dosage, treatment side effects, adherence patterns, duration of illness, previous treatment history, socioeconomic conditions, social support, stressful life events, substance use, sleep disorders, physical health conditions, and access to care.

Overall, the generated rule-based synthetic dataset demonstrates strong internal plausibility for its intended purpose: providing a controlled, clinically guided, multimodal, longitudinal dataset for MDD treatment-response modelling. The dataset represents expected patterns of symptom change, behavioral activation, sleep disruption and recovery, mobility variation, physiological regulation, digital behavior, social engagement, and treatment-response heterogeneity.

However, the dataset should be presented as a research-enabling synthetic dataset, not as a replacement for real patient data. Its main value lies in supporting early-stage method development, testing modelling pipelines, exploring variable relationships, and preparing for privacy-preserving machine learning experiments in contexts where real mental health data are difficult to access. Future work should refine the generator using clinical expert feedback, published empirical distributions, more realistic missingness mechanisms, and eventual comparison with real-world or clinically validated datasets.

Future work will focus on refining the generator using clinical feedback, empirical distributions, greater heterogeneity, more realistic missingness, and eventual comparison with real-world clinical data. Additionally, the dataset will be employed in a federated learning setting to monitor treatment response using resource-constrained devices.

Table 3. Summary of the Main Strengths and Limitations of the Generated Rule-Based Synthetic Dataset.

Strengths	Limitations
Clinical interpretability: Through the PHQ-9 score, the dataset provides a clear clinical anchor for symptom severity and treatment-response monitoring.	Synthetic and rule-based: The results demonstrate internal plausibility, but they do not validate the dataset against actual real-world data.
Multimodal structure: The dataset goes beyond symptom scores to include behavioral and physiological indicators.	Potentially too coherent: In real-world mental health data, behavioral and physiological variables do not always align neatly with symptom improvement or worsening.
Longitudinal design: Observations are repeated throughout a 12-week treatment period.	Limited representation of real-world heterogeneity: The dataset does not fully represent the heterogeneity inherent in real-world populations.
Internal consistency: Distribution checks, correlation analysis, trajectory plots, and responder-group comparisons indicate that the generated data follow the intended clinical and behavioral logic.	Less complex than real-world digital phenotyping datasets: Smartphone and wearable data often contain irregular gaps, dropout, sensor-specific failure, and participant disengagement.
Reproducibility and controllability: Since the dataset is rule-based, the generation logic is transparent and can be adjusted, audited, or extended.	Limited external validity: Plausible ranges and trajectories support its use for simulation and modelling, but future validation against real-world or clinically reviewed datasets is necessary before drawing conclusions about clinical deployment.

5. Conclusion

This study introduces a clinically guided, multimodal longitudinal synthetic dataset designed to support research on MDD treatment-response monitoring. It combines PHQ-9 assessments with daily behavioural and physiological signals, additionally it offers a transparent and reproducible framework for developing and testing machine learning and federated learning workflows in settings where real-world mental health data is difficult to access. The plausibility assessment confirmed that the generated variables followed reasonable ranges, coherent longitudinal trends, and internally consistent clinical-behavioural patterns. Nevertheless, as the dataset is synthetic and rule-based, its outputs reflect deliberate design assumptions rather than real patient data and should not be interpreted as clinical evidence. Improving realism in future iterations will require broader clinical input, empirical validation against real-world data, more sophisticated missingness mechanisms, and comparison with clinically reviewed datasets to strengthen external validity.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/Ngasomi/MDD_synthetic-data_generator.

Funding

This research is supported by the German Academic Exchange Service (DAAD) through the DAAD Doctoral Programme.

Data Availability Statement

The synthetic dataset, synthetic data schema, and code used for rule-based data generation and plausibility assessment are publicly available in a GitHub repository at: https://github.com/Ngasomi/MDD_synthetic-data_generator. The dataset is fully synthetic and does not contain real patient records or identifiable personal data.

Acknowledgments

During the preparation of this manuscript, the author used ChatGPT for grammar correction, language editing, and sentence clarity. Claude was used to generate a figure summarizing the literature-derived candidate variables for expert review. No generative AI tool was used to generate the study design, data, analysis, or interpretation. The author reviewed and edited all AI-assisted outputs and takes full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDD	Major Depressive Disorder
EMA	Ecological Momentary Assessment
PHQ-9	Patient Health Questionnaire 9
MBC	Measurement-Based Care (MBC).
TMAP	Texas Medication Algorithm Project (TMAP)
HRV	Heart Rate Variability (HRV)

References

Erritzoe, D.; et al. , ‘A short-acting psychedelic intervention for major depressive disorder: a phase IIa randomized placebo-controlled trial’. Nat. Med. 2026, vol. 32(no. 2), 591–598. [Google Scholar] [CrossRef]
Malhi, G. S.; Mann, J. J. ‘Depression’. The Lancet 2018, vol. 392(no. 10161), 2299–2312. [Google Scholar] [CrossRef]
Marx, W.; et al. , ‘Major depressive disorder’. Nat. Rev. Dis. Primer 2023, vol. 9(no. 1), 44. [Google Scholar] [CrossRef]
Santomauro, D. F.; Vos, T.; Whiteford, H. A.; Chisholm, D.; Saxena, S.; Ferrari, A. J. ‘Service coverage for major depressive disorder: estimated rates of minimally adequate treatment for 204 countries and territories in 2021’. Lancet Psychiatry 2024, vol. 11(no. 12), 1012–1021. [Google Scholar] [CrossRef]
Kim, H.-Y.; et al. , ‘Predictors of Remission in Acute and Continuation Treatment of Depressive Disorders’. Clin. Psychopharmacol. Neurosci. 2021, vol. 19(no. 3), 490–497. [Google Scholar] [CrossRef] [PubMed]
Rush, J.; et al. , ‘Acute and Longer-Term Outcomes in Depressed Outpatients Requiring One or Several Treatment Steps: A STAR*D Report’. Am. J. Psychiatry 2006, vol. 163(no. 11), 1905–1917. [Google Scholar] [CrossRef] [PubMed]
Aboraya, et al. , ‘Measurement-based Care in Psychiatry-Past, Present, and Future’. Innov. Clin. Neurosci. 2018, vol. 15(no. 11–12), 13–26. [Google Scholar]
Trivedi, M. H.; Daly, E. J. ‘Treatment strategies to improve and sustain remission in major depressive disorder’. Dialogues Clin. Neurosci. 2008, vol. 10(no. 4), 377–384. [Google Scholar] [CrossRef]
Trivedi, M. H.; et al. , ‘Evaluation of Outcomes With Citalopram for Depression Using Measurement-Based Care in STAR*D: Implications for Clinical Practice’. Am. J. Psychiatry 2006, vol. 163(no. 1), 28–40. [Google Scholar] [CrossRef]
Taliaz, D.; Souery, D. ‘A New Characterization of Mental Health Disorders Using Digital Behavioral Data: Evidence from Major Depressive Disorder’. J. Clin. Med. 2021, vol. 10(no. 14), 3109. [Google Scholar] [CrossRef]
Zierer; Behrendt, C.; Lepach-Engelhardt, A. C. ‘Digital biomarkers in depression: A systematic review and call for standardization and harmonization of feature engineering’. J. Affect. Disord. 2024, vol. 356, 438–449. [Google Scholar] [CrossRef] [PubMed]
Aledavood, T.; et al. , ‘Multimodal Digital Phenotyping Study in Patients With Major Depressive Episodes and Healthy Controls (Mobile Monitoring of Mood): Observational Longitudinal Study’. JMIR Ment. Health 2025, vol. 12, e63622. [Google Scholar] [CrossRef]
Busshart, L.; Petrovic, M.; Amin, R.; Hegerl, U. ‘Distinguishing Common Digital Phenotyping and Self-Report Parameters for Monitoring and Predicting Depression: Scoping Review’. JMIR MHealth UHealth 2026, vol. 14, e70840–e70840. [Google Scholar] [CrossRef] [PubMed]
Pezoulas, V. C.; et al. ‘Synthetic data generation methods in healthcare: A review on open-source tools and methods’. Comput. Struct. Biotechnol. J. 2024, vol. 23, 2892–2910. [Google Scholar] [CrossRef]
Adams, T.; et al. , ‘On the fidelity versus privacy and utility trade-off of synthetic patient data’. iScience 2025, vol. 28(no. 5), 112382. [Google Scholar] [CrossRef]
Pezoulas, V. C.; et al. ‘Synthetic data generation methods in healthcare: A review on open-source tools and methods’. Comput. Struct. Biotechnol. J. 2024, vol. 23, 2892–2910. [Google Scholar] [CrossRef]
Giuffrè, M.; Shung, D. L. ‘Harnessing the power of synthetic data in healthcare: innovation, application, and privacy’. npj Digit. Med. 2023, vol. 6(no. 1), 186. [Google Scholar] [CrossRef]
Kroenke, K.; Spitzer, R. L.; Williams, J. B. ‘The PHQ-9: validity of a brief depression severity measure’. J. Gen. Intern. Med. 2001, vol. 16(no. 9), 606–613. [Google Scholar] [CrossRef] [PubMed]
Leaning, E.; et al. , ‘From smartphone data to clinically relevant predictions: A systematic review of digital phenotyping methods in depression’. Neurosci. Biobehav. Rev. 2024, vol. 158, 105541. [Google Scholar] [CrossRef]
Vignapiano, et al. , ‘A narrative review of digital biomarkers in the management of major depressive disorder and treatment-resistant forms’. Front. Psychiatry 2023, vol. 14, 1321345. [Google Scholar] [CrossRef]
Jung, H. W.; et al. , ‘Key Features of Digital Phenotyping for Monitoring Mental Disorders: Systematic Review’. J. Med. Internet Res. 2025, vol. 27, e77331–e77331. [Google Scholar] [CrossRef]
Martinez-Martin, N.; Greely, H. T.; Cho, M. K. ‘Ethical Development of Digital Phenotyping Tools for Mental Health Applications: Delphi Study’. JMIR MHealth UHealth 2021, vol. 9(no. 7), e27343. [Google Scholar] [CrossRef]
Oudin, et al. , ‘Digital Phenotyping: Data-Driven Psychiatry to Redefine Mental Health’. J. Med. Internet Res. 2023, vol. 25, e44502. [Google Scholar] [CrossRef]
Giuffrè, M.; Shung, D. L. ‘Harnessing the power of synthetic data in healthcare: innovation, application, and privacy’. npj Digit. Med. 2023, vol. 6(no. 1), 186. [Google Scholar] [CrossRef]
Pezoulas, V. C.; et al. ‘Synthetic data generation methods in healthcare: A review on open-source tools and methods’. Comput. Struct. Biotechnol. J. 2024, vol. 23, 2892–2910. [Google Scholar] [CrossRef]
Mendes, M.; Barbar, A.; Refaie, M. ‘Synthetic data generation: a privacy-preserving approach to accelerate rare disease research’. Front. Digit. Health 2025, vol. 7, 1563991. [Google Scholar] [CrossRef] [PubMed]
Qian, Z.; Callender, T.; Cebere, B.; Janes, S. M.; Navani, N.; Van Der Schaar, M. ‘Synthetic data for privacy-preserving clinical risk prediction’. Sci. Rep. 2024, vol. 14(no. 1), 25676. [Google Scholar] [CrossRef]
Loni, M.; Poursalim, F.; Asadi, M.; Gharehbaghi, A. ‘A review on generative AI models for synthetic medical text, time series, and longitudinal data’. npj Digit. Med. 2025, vol. 8(no. 1), 281. [Google Scholar] [CrossRef] [PubMed]
Zhan, Y.; Liu, H.; Wang, Y. ‘Digital phenotyping of depression: A multi-modal passive sensing approach to identifying novel behavioral and physiological markers of treatment response’. J. Psychiatr. Res. 2026, vol. 194, 40–50. [Google Scholar] [CrossRef] [PubMed]
Bufano, P.; Laurino, M.; Said, S.; Tognetti, A.; Menicucci, D. ‘Digital Phenotyping for Monitoring Mental Disorders: Systematic Review’. J. Med. Internet Res. 2023, vol. 25, e46778. [Google Scholar] [CrossRef]
Shen, S.; et al. , ‘Passive Sensing for Mental Health Monitoring Using Machine Learning With Wearables and Smartphones: Scoping Review’. J. Med. Internet Res. 2025, vol. 27, e77066. [Google Scholar] [CrossRef]
Jung, H. W.; et al. , ‘Key Features of Digital Phenotyping for Monitoring Mental Disorders: Systematic Review’. J. Med. Internet Res. 2025, vol. 27, e77331–e77331. [Google Scholar] [CrossRef]
Taliaz; Souery, D. ‘A New Characterization of Mental Health Disorders Using Digital Behavioral Data: Evidence from Major Depressive Disorder’. J. Clin. Med. 2021, vol. 10(no. 14), 3109. [Google Scholar] [CrossRef]

Figure 1. Common parameters for digital phenotyping in depression monitoring based on Busshart et al. [13].

Figure 2. Literature-derived candidate variables used for expert review. The figure was generated using Claude to visually summarize candidate variables identified from the digital phenotyping literature.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.