Preprint
Article

This version is not peer-reviewed.

Technology-Enhanced Training for Prehospital Mass-Casualty Incident Preparedness: A Scoping Review

Submitted:

30 April 2026

Posted:

01 May 2026

You are already at the latest version

Abstract
The use of technology-enhanced training for prehospital mass-casualty incident (MCI) preparedness has grown quickly, but there has been no comprehensive overview of how these technologies operate throughout the training process or how competencies are evaluated. This scoping review, conducted as part of the MCIPHER (Mass-Casualty Incident Prehospital Emergency Response) project, followed the Arksey and O'Malley framework and PRISMA-ScR guidelines. We searched seven databases and additional sources, screened 2,105 records, and included 28 studies published from 2015 to 2025. Virtual reality was the most common method (43%), followed by hybrid approaches (29%) and screen-based simulations (21%). We identified five key analytical constructs. Three were derived from the data: the Technology Function Spectrum revealed that half of the studies used dual-purpose platforms for both training and performance assessment; the Data Capture Architecture linked embedded data collection to advanced learning outcomes (L2+); and the Pedagogical Transparency Gap showed that 75% of studies did not specify a training design framework. Two other constructs — the Immersion-Evaluation Paradox and the Scalability-Rigor Tension — suggest areas for future research. Using a modified Kirkpatrick framework with an L2+ (Applied Learning) sub-level, 56% of completed studies demonstrated applied learning through embedded performance assessments. Overall, these findings suggest that investments in MCI preparedness should focus more on measurement capabilities than immersion, incorporate assessment into training platforms, and work to reduce geographic and resource disparities.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Mass-casualty incidents (MCIs) are among the most complex and mentally demanding situations faced by prehospital care teams. During these events, responders must quickly evaluate and triage many patients, manage limited resources, communicate across various agencies, and make crucial decisions in unstable and risky environments [1]. While MCIs are rare, their impact is significant. Poor preparedness can cause preventable injuries, deaths, and system failures that affect healthcare and communities [2]. Therefore, disaster training for prehospital personnel is essential for strengthening health system resilience and reducing disaster risks worldwide [3].
Although MCI preparedness is crucial, traditional disaster education methods pose notable challenges. Didactic lectures, tabletop drills, and live simulations are still core training tools, but they tend to be resource-heavy, logistically complicated, hard to standardize, and difficult to scale or repeat, especially in remote or resource-limited areas [4]. Additionally, opportunities for repeated realistic MCI practice are scarce because of ethical, safety, and cost issues related to large-scale exercises. This training gap results in many prehospital systems being insufficiently prepared for mass-casualty emergencies.
Recent progress in educational technology has opened up new opportunities to bridge this gap. Tools such as virtual reality (VR), augmented and mixed reality (AR/MR), serious games, screen-based simulations, wearable sensors, mobile apps, and digital communication platforms are increasingly used in emergency and disaster medicine training [5,6]. These technologies offer immersive or interactive experiences of disaster scenarios, enabling learners to practice complex situations safely and providing digital records of their actions for feedback, reflection, and assessment. Technology here goes beyond immersive simulation, also including tools for remote learning, performance tracking, communication, and training assessment [7]. While these approaches have great potential to improve accessibility, standardization, and scalability of MCI training, significant evidence gaps still exist.
A recent systematic review analyzed the use of extended reality technologies, like VR and AR, for training first responders in MCIs, highlighting immersive tools [7]. Although it offered valuable insights, the review did not explore the wider variety of technologies involved in disaster education. It also did not assess how training frameworks, assessment methods, or evaluation models are reported across different studies. Consequently, key questions remain unanswered: How is technology practically applied for training in MCI preparedness? What specific educational functions does it fulfill? How are learning and performance results defined and measured? And what barriers or challenges influence its adoption and scalability?
Improving prehospital training capacity is a key focus in global disaster risk reduction efforts. The Sendai Framework for Disaster Risk Reduction 2015–2030 highlights the importance of enhanced disaster preparedness (Priority 4), advocating for investments in training, exercises, and evidence-based systems [60]. Likewise, the World Health Organization stresses workforce capacity building, simulation exercises, and train-the-trainer programs as essential to national health emergency preparedness [61]. Technology-driven methods could support these aims by improving accessibility, standardization, and scalability of MCI training. However, evidence on how these technologies function throughout the training process and whether they lead to measurable skill improvements remains limited.
Scoping reviews are especially effective for addressing such questions because they systematically chart diverse literature, clarify key concepts, and identify evidence gaps without the need for effectiveness synthesis [8]. This review is conducted within the MCIPHER (Mass-Casualty Incident Prehospital Emergency Response) project, a research initiative aimed at improving disaster medicine education and prehospital preparedness. The related reviews focus on different aspects of MCI preparedness training: curricular design and competency frameworks, as well as tabletop and simulation exercises. Collectively, these reviews offer a comprehensive evidence synthesis to guide research priorities, curriculum development, and policy-making in MCI preparedness training.
This scoping review aims to systematically chart how technology is used for training and evaluating prehospital MCI preparedness. Specifically, we seek to (1) identify the technologies used in MCI training and describe their educational roles, such as training delivery, learner interaction, assessment, feedback, and debriefing methods; (2) map reported learner outcomes and measurement strategies, including alignment with frameworks like Kirkpatrick levels; and (3) summarize reported gaps, barriers, and limitations that affect the design, implementation, and scalability of technology-based MCI training. These findings aim to guide future research, support evidence-informed curriculum development, and assist policymakers in enhancing disaster preparedness within prehospital care systems.

2. Materials and Methods

2.1. Study Design and Methodological Framework

This scoping review was part of the MCIPHER (Mass-Casualty Incident Prehospital Emergency Response) project, which aims to enhance disaster medicine education and prehospital readiness. We adhered to the methodological framework by Arksey and O’Malley [9], further refined by Levac et al. [10], and reported our results following the PRISMA-ScR guidelines for scoping reviews [11].
We prospectively developed and registered the protocol on Protocols.io and published it on preprints.org to ensure transparency and reproducibility. Although recommended in the extended Arksey and O’Malley framework, we did not conduct the optional stakeholder consultation stage, as the review aimed to map existing literature rather than gather experiential perspectives from stakeholders.

2.2. Stage 1: Identifying the Research Questions

The review was guided by one primary and three secondary research questions.

2.2.1. Primary Research Question

What technologies have been used to train and assess healthcare learners and professionals in prehospital mass-casualty incident (MCI) preparedness, and what outcomes have been reported in association with these interventions?

2.2.2. Secondary Research Questions

  • What learner outcomes are reported (such as knowledge acquisition, triage accuracy and performance, clinical decision-making, time-based performance metrics, stress, or cognitive workload), and how do these outcomes correspond with Kirkpatrick’s levels of evaluation?
  • What measurement approaches and instruments are employed (for example, knowledge assessments, performance metrics, structured observational tools, system-generated logs, physiological measures, or usability and acceptability scales)?
  • Which training design frameworks are reported, and how explicitly are they described within the included studies?
We organized these questions based on the Population–Concept–Context (PCC) framework. The population included healthcare professionals and trainees; the concept focused on technology-enhanced training or education; and the context was prehospital mass-casualty incidents.

2.3. Stage 2: Identification of Relevant Studies

We developed the search strategy on June 13, 2025, using combinations of keywords and Medical Subject Headings (MeSH). Preliminary scoping searches and review of the MeSH database in PubMed guided our selection of terms. We structured search terms around three core conceptual components.
  • Population (for example, “first responders”, “medical intern”, “healthcare provider”, “Paramedics”[MeSH], “Health Personnel”[MeSH], “Nurses”[MeSH], “Physicians”[MeSH])
  • Technology/Concept (for example, eye-tracking, virtual reality, augmented reality, mixed reality, biometric sensors, wearable devices, digital monitoring, serious games, mobile applications, artificial intelligence, immersive environments, haptic technology, smart wearables, performance dashboards)
  • Disaster/MCI context (for example “mass casualty incident”, disaster, earthquake, flood, traffic accident, CBRN/CBRNe, terrorist attack, wildfire, multiple trauma, human-made disasters)
We developed tailored search strategies for PubMed, Embase, Scopus, CINAHL, PsycINFO, the Cochrane Library, and ClinicalTrials.gov. These strategies integrated controlled vocabulary (such as MeSH terms) with free-text terms using Boolean operators to maximize both search sensitivity and specificity. The complete search strategies for each database are included in Appendix S1.
The initial search was conducted in June 2025, and the strategy was peer-reviewed following PRESS (Peer Review of Electronic Search Strategies) guidelines [14]. We revised the searches before the final analysis to include newly published studies. We also manually screened Google Scholar and the reference lists of included articles and relevant systematic reviews to find additional eligible records.

2.4. Stage 3: Selection of Studies

We imported all retrieved references into Covidence (Veritas Health Innovation, Melbourne, Australia), an online platform for managing systematic and scoping reviews [15]. Covidence was used to identify and remove duplicates, as well as support title and abstract screening and full-text review.

2.4.1. Eligibility Criteria

We established eligibility criteria based on the PCC framework and applied them consistently during the screening process.
Inclusion criteria
  • Population: Medical first responders, paramedics, emergency medical technicians (EMTs), physicians, nurses, medical interns, residents, and students enrolled in health-related university programs (for example, medicine, nursing, EMS).
  • Concept: Studies describing, implementing, or evaluating educational or training programs that use any form of technology within the context of MCIs. Eligible technologies included virtual, augmented, or mixed reality (VR/AR/MR), serious games, mobile applications, wearable devices, artificial intelligence-based systems, and high-fidelity simulations incorporating digital components.
  • Context: Training or simulation activities that replicate prehospital settings, including field triage, ambulance scenarios, roadside emergencies, or disaster zones.
  • Study design: Original empirical research and study protocols employing quantitative, qualitative, or mixed-methods designs.
  • Other sources: Grey literature was eligible if it contained sufficient methodological detail for appraisal and data extraction.
Exclusion criteria
  • Studies conducted exclusively in in-hospital environments (for example emergency departments, intensive care units, operating rooms) without a prehospital or extramural component.
  • Studies focused solely on non-disaster emergency care, routine trauma management, or non-MCI clinical procedures.
  • Studies involving only non-healthcare populations (for example, firefighters, police, military, laypersons), unless data pertaining to healthcare professionals were reported separately.
  • Studies published before 2015 or not available in English.
  • Secondary research, including systematic reviews, narrative reviews, and scoping reviews, was used for reference list screening.

2.4.2. Study Selection Process

Study selection was conducted in two phases within Covidence. Initially, two reviewers independently evaluated titles and abstracts based on predefined inclusion and exclusion criteria. Records deemed potentially eligible by both reviewers moved on to full-text review.
Secondly, two reviewers independently evaluated full-text articles for eligibility. Disagreements were addressed through discussion until consensus was reached. If they could not agree, a third reviewer made the final decision.
We recorded the reasons for exclusion during the full-text review. The PRISMA flow diagram illustrates the entire study selection process, showing the number of records identified, screened, evaluated for eligibility, and finally included in the synthesis.

2.5. Stage 4: Data Charting

We created a standardized data extraction form and tested it with five studies identified as relevant to technology-enhanced training in healthcare and emergency environments. This pilot phase allowed us to improve the data charting framework to gather important methodological, educational, and technological variables across various study designs and implementation contexts.
The final data extraction form included: study identification, publication year, country, study design, population type, purpose of technology use (training, assessment, or both), technological modality, training design and evaluation frameworks, operational frameworks, data collection instruments, timing of data collection, data type (quantitative, qualitative, or mixed), and reported outcomes mapped to Kirkpatrick levels.

2.5.1. Kirkpatrick Levels of Evaluation

To categorize reported outcomes, we used the four-level Kirkpatrick Evaluation Model [16], a popular framework that organizes training effects from participant satisfaction (Level 1) to knowledge acquisition (Level 2), workplace behavior change (Level 3), and organizational impact (Level 4).
Level 1: Reaction
Outcomes that reflect participants’ immediate reactions to the training, such as satisfaction, perceived engagement, relevance, or confidence reported after the intervention.
Level 2: Learning
Outcomes that assess knowledge acquisition, skill development, or attitudinal change are usually measured through structured methods like quizzes, written tests, performance checklists, or validated questionnaires. Self-perceived learning was classified as Level 2 when it was framed as learning gain rather than satisfaction. Eye-tracking results were categorized as Level 2 when interpreted as related to attention or situational awareness.
Level 2+: Applied Learning
We introduced a sub-level between Level 2 and Level 3 to distinguish simulation-based performance assessment from real-world behavioral transfer. We classified outcomes as Level 2+ when they met all three of the following criteria: (C1) assessment occurred during or was embedded within the exercise, (C2) performance was measured against an external standard or rubric rather than self-report, and (C3) outcomes reflected the integrated application of knowledge to realistic decisions. This distinction recognizes that simulation-based performance assessment, while not equivalent to real-world transfer (Level 3), represents a more methodologically robust form of competency measurement than traditional knowledge tests. In prehospital MCI training, opportunities for Level 3 assessment (evaluating actual performance during real disasters) are limited for ethical and safety reasons. L2+ therefore serves as a practical proxy: it shows that learners not only gained knowledge but also applied it to make realistic decisions under pressure within an authentic scenario. The three criteria ensure that the outcome measured reflects competency demonstration rather than mere knowledge recall. This approach aligns with the companion MCIPHER Tabletop scoping review and allows for detailed interpretation of evidence strength across the learning continuum.
Level 3: Behavior
Outcomes reflect the application of learned skills in real-world practice. Because Level 3 is ideally assessed in actual work environments, we used a modified Kirkpatrick approach: we categorized observable simulation-based behaviors (such as triage accuracy, triage time, protocol adherence, communication behaviors) within technology-enhanced settings as practical proxies when real-world exposure is rare [17,18]. We viewed these outcomes as early signs of potential transfer rather than conclusive evidence of actual workplace behavior change.
Level 4: Results
Outcomes demonstrating system-level or organizational impact, such as better disaster response metrics, fewer errors, or increased operational efficiency. We did not include studies without measurable organizational outcomes at this level.
When an outcome could potentially fit multiple Kirkpatrick levels, we assigned it to the level that most accurately reflected its main purpose and measurement method.

2.5.2. Framework extraction

During data charting, we extracted and categorized all named or explicitly described frameworks, models, and structured approaches that influenced (1) how training was designed, (2) how outcomes were assessed, and (3) which operational disaster-response models were taught or implemented. Included studies reported either broad evaluation models or only specific measurement tools. Therefore, we used an inclusive definition: we captured both high-level assessment frameworks and standardized instruments when they served as the main method of measuring outcomes.
Two reviewers independently extracted data from each included study. We compared the datasets and resolved discrepancies through discussion, consulting a third reviewer when needed. We used Microsoft Excel to manage, organize, and summarize the extracted data.

2.6. Stage 5: Collating, summarizing, and reporting the results

We organized the extracted data into the following categories: general study characteristics; study design and participant groups; purpose and type of technology-enhanced training; outcomes classified using the Kirkpatrick model and L2+ sub-level; training design, assessment, and operational frameworks; and data collection instruments, assessment timing, and data types.

2.6.1. Data synthesis approach

We used descriptive and narrative methods aligned with scoping review standards. We summarized quantitative data (such as study counts by technology type, sample size, participant groups, and outcome areas) using descriptive statistics, mainly frequencies and proportions. We combined qualitative and contextual information (including descriptions of training design, implementation details, and evaluation methods) through narrative thematic synthesis, analyzing findings within and across different domains.
We included both completed studies and protocols or registrations. Record-based frequencies used a single denominator (all included records) for descriptive elements available across both study types, such as technology modality, study design, and reported frameworks or instruments. Outcome mapping, including Kirkpatrick alignment, was performed only for completed studies that reported measurable outcomes; protocols without outcomes were excluded from this component.

2.6.2. Development of analytical constructs

We compared and grouped studies repeatedly, identifying patterns, similarities, and differences. This process uncovered five analytical constructs that define the technology-enhanced training landscape.
  • Technology Function Spectrum (based on corpus data): We categorized technologies along a range from delivery-only (content transmission without assessment) to dual-use (simultaneous training and assessment), and assessment-focused (mainly designed for performance measurement).
  • Data Capture Architecture (corpus-derived): We explained how outcomes were recorded: embedded within the technology (system-generated metrics), outside the technology (observer-rated or manual collection), or multimodal (combining multiple data sources).
  • Pedagogical Transparency Gap (corpus-derived): We assessed how well studies clearly aligned their training design, assessment methods, and operational models taught. Studies with high transparency detailed all three areas; those with low transparency left out or poorly linked one or more.
  • Immersion-Evaluation Paradox (hypothesis-generating): We noticed a conflict between the level of immersion in the training setting and the thoroughness of assessment. High-immersion technologies often lacked strong evaluation methods, while standardized assessment tools sometimes operated in less immersive environments.
  • Scalability-Rigor Tension (hypothesis-generating): We identified recurring trade-offs between technology sophistication and practical implementation capacity. Complex, high-fidelity systems enable rigorous measurement but require significant resources; simpler technologies are more implementable but provide less granular outcome data.
The first three constructs were identified based on their systematic presence or absence within the corpus and represent observable features of reported interventions. The final two constructs emerged as patterns that could not be fully explained by corpus features alone and need to be tested through future implementation studies.
This iterative synthesis allowed for systematic mapping of the range and nature of technology-enhanced training methods for prehospital MCI responses, including the identification of evidence gaps, methodological limitations, and promising areas for future research.

2.7. Ethics

This scoping review gathered data from previously published and publicly available sources without direct contact with human participants. Formal ethical approval was not necessary. We conducted the review following established standards for ethical research, including transparent reporting of methods, accurate citation of original work, and avoiding plagiarism or duplicate publication. We identified and managed potential conflicts of interest within the research team in accordance with institutional policies.

3. Results

3.1. Literature Search and Screening

We identified 2,652 records from database searches (Embase, PubMed, Scopus, CINAHL, Cochrane Library, PsycINFO, and ClinicalTrials.gov) and 9 additional records from other sources. After removing 556 duplicates, we screened 2,105 records by title and abstract. We requested 173 full reports; 25 could not be retrieved. We assessed 148 full-text reports for eligibility and excluded 120 due to incorrect setting, concept, context, population, study design, or non-English language. We included 28 studies in the final synthesis (references 19-46). Figure 1 shows the PRISMA flow diagram.

3.2. Study Characteristics

The 28 included studies were mainly published within the past three years. The United States contributed the most publications (7 studies), followed by Sweden (3 studies), while Canada, China, Germany, and Iran each contributed two studies. The remaining studies originated from Belgium, the Czech Republic, England, Hong Kong, Jordan, Saudi Arabia, South Korea, Taiwan, Thailand, and Turkey. Based on World Bank income classifications, 20 studies (71%) were conducted in high-income countries and 8 (29%) in upper-middle-income countries (China, Iran, Thailand, Turkey, Jordan). No studies were conducted in low- or lower-middle-income countries (Table 1).
Study designs varied considerably. Six studies (21.4%) each employed randomized controlled trials (RCT), mixed-methods, and quasi-experimental designs. Three studies (10.7%) were RCT protocols, and three (10.7%) used pre-post designs. The remaining studies included one cross-sectional study, one intervention study (post-test only), one prospective cohort study, and one usability study, each representing 3.6% (Table 2).
Participants came from various healthcare roles. Nursing students made up the largest group (32.1% of studies), followed by paramedics, physicians, and nurses (25% each). The physician and trainee groups included pediatric emergency medicine specialists, emergency medicine residents, anesthesiologists, and critical care providers. The diverse specialties highlight the wide relevance of MCI training across healthcare systems.

3.3. Technologies and Their Educational Functions

Twelve studies (42.9%) primarily used virtual reality (VR) as the main technology modality. Eight studies (28.6%) combined multiple formats such as video, simulation, and in-person approaches in hybrid methods. Six studies (21.4%) utilized screen-based simulations. Two studies (7.1%) involved hands-on or fully in-person, technology-supported training. Detailed specifications and applications for each study are listed in Table 3, with additional classifications provided in Appendix S2.

3.4. Technology Function Spectrum

Beyond modality, we categorized how technology operated throughout the training cycle. We identified three separate operational models along with a hybrid category.
Delivery-only systems (8 studies: ChiCTR2300072282, Hosseini 2023(2), Kyoung, NCT06253156, NCT06034184, Hermann, Alhawatmeh, Hosseini 2023(1)) served as content delivery channels without integrated performance capture. These systems showed scenarios, provided instructional content, or delivered simulations, but depended on external questionnaires or separate assessments to measure outcomes. Examples include VR glasses used only for scenario presentation and computer-based e-learning platforms.
Dual-use systems (14 studies: Cicero 2017, Chumvanichaya, Way, Heldring 2025, Shujuan, Jain, Cicero 2019, Baetzner, Hu, Bauchwitz, Chang, Foronda, Heldring 2024, Chevalier) operated simultaneously as training delivery and data collection tools. The technology platform itself recorded learner actions, decisions, timing, and performance metrics. Examples include VR systems that logged triage decisions and timing, game platforms that automatically captured learning events, and screen-based simulations that tracked user interactions through system logs. This integration allowed for real-time feedback, performance analytics, and proof of learning within a single platform.
Assessment-dedicated systems (2 studies: Wetherell, Sibley) were used mainly for measuring performance rather than delivering instruction. The wearable technology (smartwatches) tracked physiological stress responses during separate training sessions, and the aerial drone system allowed for scene size-up and hazard identification without providing explicit instructional content.
Hybrid or multi-modal systems (4 studies: Goldberg, McCoy, Lochmannová, Bajow) integrated elements across these categories, combining delivery, data capture, and assessment through multiple coordinated technologies. These studies represent the most complex technology configurations in the corpus.

3.5. Data Capture Architecture: Embedded, External, and Multimodal Approaches

How technology directly impacted the depth of performance evidence and its effect on outcomes.
Embedded capture systems (12 studies) utilized automatic logging within the training platform. VR system logs, game databases, and screen-based simulation platforms recorded learner actions, triage decisions, response times, and error patterns without needing separate assessment tools. This method produced continuous, detailed performance data throughout training.
External capture systems (11 studies) used separate assessment tools administered independently of the training technology. Pre- and post-tests, multiple-choice knowledge questions, questionnaires (such as the State-Trait Anxiety Inventory and NASA Task Load Index), and technology acceptance scales were conducted as separate evaluation events. These tools measured knowledge, confidence, workload, or satisfaction but remained disconnected from the training platform.
Multimodal capture systems (studies by Way, Baetzner, Lochmannová, Chevalier, Shujuan) combined embedded platform data with physiological measures or external assessments. Smartwatch biometrics paired with VR performance logs, eye-tracking combined with triage accuracy metrics, or haptic feedback synchronized with performance data all produced richer datasets linking behavioral, physiological, and cognitive outcomes.
Out of 17 studies using embedded or multimodal capture, 14 achieved L2+ classification (see below). Among 11 studies relying only on external capture, none reached L2+.

3.6. Training Outcomes: L2+ Reclassification and Kirkpatrick Levels

We applied Kirkpatrick’s four-level evaluation framework to all 28 studies. Outcomes focused on lower levels, prompting a new analytic distinction. Figure 2 shows publication trends over time and the evaluation depth achieved across included studies.
Level 1 (Reaction): Seventeen studies (61%) reported participant satisfaction, engagement, or usability outcomes through satisfaction surveys and technology acceptance scales.
Level 2 (Learning): Twenty-four studies (86%) evaluated knowledge gain or procedural skills using pre- and post-tests, multiple-choice questions, or performance measures like triage accuracy.
Level 3 (Behavior): Only 1 study (4%, Goldberg 2021) provided evidence of behavior change in clinical settings.
Level 4 (Results): No study (0%) reported organizational or patient-level outcomes.
Within Level 2, we established a new classification: L2+ (Applied Learning). This subcategory includes studies that demonstrated competency through authentic, performance-based assessments under conditions similar to real-world MCI response. L2+ required integrated data collection within realistic simulation scenarios, providing objective evidence of skill application rather than just knowledge recall.
Among the 25 completed studies, 14 (56%) met all three L2+ criteria: Goldberg 2021, McCoy 2019, Cicero 2017, Chumvanichaya 2025, Heldring 2025, Shujuan 2022, Jain 2016, Cicero 2019, Baetzner 2025, Hosseini 2023(1), Alhawatmeh 2025, Chevalier 2023, Lochmannová 2025, and Sibley 2018. Two protocols (NCT06034184 and NCT06253156) describe methods consistent with L2+ criteria pending outcome reporting (Table 4).
L2+ classification was based on three criteria: (C1) assessment took place during or was integrated into the exercise, (C2) performance was evaluated against an external standard or rubric rather than self-report, and (C3) outcomes demonstrated the combined application of knowledge to realistic decisions. Studies meeting all three criteria were categorized as L2+.

3.7. Pedagogical Transparency Gap

We identified significant gaps in how studies reported their educational frameworks.
Operational frameworks were reported by 22 studies (79%). START triage was reported in 11 studies (39%), either alone or in combination with other systems. JumpSTART appeared in 5 studies (18%). SALT, SIEVE/SORT, and STM each appeared in multiple studies. Broader operational structures, including the Incident Command System, the Sphere Project, and WHO guidance, appeared in 7 studies (25%). This consistent reporting enabled evaluation of whether the training aligned with recognized disaster response protocols.
Assessment frameworks were explicitly reported in only 9 studies (32%). These included named instruments such as NASA Task Load Index, State-Trait Anxiety Inventory, and ARCS Motivation Survey, as well as cognitive task analysis frameworks like Methodology for Annotated Skill Trees and specialized assessment tools like START-Assessment Global Scale and Sense of Presence questionnaire. Most studies described data collection methods but did not reference a specific assessment framework.
Training design frameworks were identified in only 7 studies (25%). Explicit references included Backward Curriculum Design (1 study), Keller’s ARCS Model combined with the System Development Life Cycle (1 study), ADDIE instructional design (1 study), game-based learning principles (1 study), and the Clinical Reasoning Cycle with reflective practice (1 study). However, 21 studies (75%) did not mention any training design model or learning theory, instead focusing on activities or content without explicit design structure.
Full framework integration across all three categories: training design, assessment, and operational, appeared in only 3 studies (11%): Chumvanichaya 2025, Kyoung 2023, and Chang 2022.
This gap hampers replication, implementation fidelity, and comparability across studies. Without specified training design frameworks, it remains unclear how training objectives were chosen, how content was ordered, or how feedback systems were developed, all of which affect training effectiveness.

3.8. Cross-Cutting Patterns: Immersion-Evaluation Paradox and Scalability-Rigor Tension

We identified two hypothesis-generating patterns across technology choices and outcome structures.

3.8.1. The Immersion-Evaluation Paradox

Thirteen studies (46.4%) used immersive VR technology, but immersion level did not predict the depth of evaluation. Studies combining immersive VR with external-only data collection (e.g., Kyoung 2023, Heldring 2024) only achieved Level 1–2 outcomes despite high fidelity. In contrast, non-immersive screen-based systems with integrated data collection (e.g., Cicero 2017, Cicero 2019, Baetzner 2025) reached L2+. Data architecture, rather than immersion level, determined the outcome depth.

3.8.2. Scalability-Rigor Tension

Higher-fidelity multi-modal technologies (VR + eye-tracking + physiological monitoring) produce richer performance data but are mainly used in high-resource settings (United States, Sweden, Germany). Lower-cost, more scalable technologies (screen-based simulations, smartphone apps, game-based learning) appear in upper-middle-income settings but often depend on external-only data collection, resulting in Level 1–2 outcomes. This pattern indicates a tradeoff between technological sophistication and implementation reach (Figure 3).

3.9. Data Collection Methods

We documented a variety of quantitative, qualitative, and mixed-method data collection approaches across the 28 studies.
Quantitative measures were predominant. Pre- and post-test assessments appeared in 11 studies. Multiple-choice knowledge questions were included in 6 studies. Triage accuracy and performance metrics, such as speed, accuracy, and error rate, were documented in 8 studies. Structured questionnaires, including the State-Trait Anxiety Inventory (2 studies), NASA Task Load Index (2 studies), and technology acceptance scales (3 studies), assessed psychological and usability outcomes.
Performance-based data sources appeared in 14 studies. VR system logs recorded learner actions and decisions (5 studies). Game databases captured learning events and performance data (2 studies). Eye-tracking outputs measured visual attention patterns (1 study). Smartwatch biometrics and salivary cortisol sampling captured physiological stress (2 studies). Audio and video recordings enabled later coding and analysis (3 studies).
Qualitative methods were used in 8 studies. Interviews and chart-stimulated recall examined cognitive processes and decision-making (2 studies). Open-ended survey questions measured participant perceptions (3 studies). Debriefing notes recorded facilitator observations (1 study). Observational tools systematically tracked participant behaviors (2 studies).
Mixed-methods designs were used in 6 studies, typically combining surveys with performance metrics or observational data. This combination allowed for triangulation between subjective experiences and objective performance measures.
Comprehensive instrument-by-instrument documentation can be found in Appendix S3, outlining data types, collection timing, and study-specific modifications.

4. Discussion

This scoping review identified 28 studies covering virtual reality, screen-based simulations, wearables, and communication devices, with significant variation in transparency of training design and depth of evaluation. We structure this discussion around the five analytical constructs from the Results and highlight key gaps for future research and practice.

4.1. Technology Function Spectrum: From Delivery Vehicles to Measurement Instruments

Half of the studies included (14 of 28) used technology as a dual-use platform, functioning both as a training delivery system and an assessment tool. These dual-use studies consistently achieved L2+ classification, with objective performance metrics integrated directly into the training environment. In comparison, 8 studies (29%) used technology mainly for content delivery, with assessments conducted separately through post-tests or questionnaires. Two studies (7%) designed technology specifically for assessment, and 4 studies (14%) used hybrid multi-modal systems combining elements from different categories.
This distinction has significant implications. The field often focuses on technology choices based on immersion, visual quality, or vendor features rather than educational purpose. However, our data reveal that the crucial factor is not how technology appears or feels, but what it can measure. Radio communication devices that capture message accuracy [42], computer-based systems that log triage time and accuracy [19,21], and virtual platforms with built-in performance tracking all show deeper competency assessment than systems optimized solely for visual realism.
For disaster risk reduction professionals, this distinction is crucial. When choosing technology for MCI preparedness training, procurement should focus on dual-use platforms that both deliver training and provide evidence of competency. These systems justify the financial investment not only by enhancing training quality but also by delivering measurable outcomes that meet organizational and regulatory standards for training effectiveness. Organizations should request detailed specifications for the data automatically captured by the technology, rather than assuming that more expensive or immersive platforms will produce stronger evidence.
Future work should clearly report the technology’s role across the training spectrum: delivery-only, assessment-only, or integrated dual-use. This will allow systematic comparisons of how the role relates to competency outcomes and implementation feasibility.

4.2. Data Capture Architecture: Embedded Assessment as the Enabling Condition

The connection between data architecture and the depth of outcomes remained consistent throughout the entire collection. Among studies that used embedded or multimodal capture, 14 out of 17 reached L2+; in contrast, none of the studies relying solely on external assessment did. This pattern was clear: studies that utilized the technology’s native ability to generate performance data showed competence at more advanced levels, while external-only assessments stagnated at L1 to L2.
Embedded capture took various forms. Google Glass with EyeSight software enabled real-time recording of procedural steps during telesimulation [34]. VR system logs automatically documented triage decisions, time-to-triage, and error patterns [29,38]. Wearables recorded physiological responses (heart rate, stress indicators) during training [22,33]. Structured observation tools embedded within game or simulation interfaces gathered data on team communication and decision-making [44]. These methods share a key feature: the learner does not complete a separate assessment; instead, assessment data come directly from performance within the training environment.
The measurement capacity of technology, rather than its visual sophistication, is the main factor that determines the depth of evaluation. We observed high-fidelity immersive VR systems (13 studies) whose assessment depended entirely on post-training questionnaires, resulting in only L1-L2 outcomes. Conversely, screen-based or lower-cost systems with strong built-in metrics showed L2+ classification. The field has invested heavily in visual and sensory fidelity but has neglected measurement infrastructure.
This gap highlights a wider misalignment in how technology procurement and development are prioritized. Vendor marketing focuses on immersion and presence. Funders and buyers often judge “advanced training” by headset adoption. However, from a learning science perspective, measuring what learners actually do, such as their decisions, timing, errors, and interactions, is key to enhancing training effectiveness and accountability.
In practice, this means standardizing how embedded metrics are reported: what performance data does the platform automatically collect? How is that data used to give feedback during training, and how is it analyzed after training to assess competency? We advise that future studies specify which metrics they will collect and report, using common terms (triage accuracy, time-to-triage, error patterns, communication protocols) to allow comparison across studies.

4.3. The Pedagogical Transparency Gap: Knowing What to Teach but Not How

Operational frameworks, including the content of MCI training, were consistently and clearly reported across studies. Seventy-nine percent (22 out of 28 studies) specified particular operational frameworks: START triage protocols [38,50], Incident Command System structures, WHO guidance, Sphere Project standards, or disaster-specific curricula [31,40]. Healthcare professionals and disaster managers throughout the review clearly understood which competencies were important: triage decisions under pressure, incident command, resource allocation, and inter-agency coordination.
Training design frameworks, which involve translating competencies into learning activities, were mostly absent. Only 25% of studies (7 out of 28) reported using designated training design frameworks. Just 11% incorporated all three domains: operational content framework, training design framework, and assessment framework. While studies mention training triage or command, they do not explain how they structured training objectives, sequenced content, provided feedback, or incorporated debriefing to enhance retention and transfer.
The few studies that explicitly identify training design frameworks stand out. Bauchwitz used ADDIE and the Methodology for Annotated Skill Trees (MAST) cognitive task analysis framework to decompose triage and incident command into hierarchical sub-skills, enabling scalable, role-specific modules. Kyoung et al. employed SDLC alongside the ARCS motivational design model to guide both training design and assessment targets [28]. Both reported detailed, replicable methods for translating operational knowledge into technology-enhanced learning experiences.
This transparency gap has two effects. First, it reduces comparability across studies. Two studies both training triage with VR may use entirely different teaching methods, making it impossible to determine whether effects are due to the technology or the training design. Second, it creates obstacles to implementation. Facilitators using a published method have access to the what but not the how. They can copy the operational content but not necessarily the training sequence that made the original study successful.
Previous syntheses on simulation-based medical education show that programs created with clear training design frameworks achieve better outcomes than those without structured guidance [51,52,53]. The ADDIE model, which involves defining learning objectives, designing content sequences, developing media, implementing delivery, and evaluating results, has been effectively adapted to MCI training contexts. However, our review found that most studies did not report any details of these processes.
For future research and practice, we suggest: (i) clearly naming and describing the training design framework guiding curriculum development, (ii) transparently connecting training objectives to content delivery strategies and assessment methods, and (iii) publishing design rationales alongside outcome data. This approach would shift the field from reporting isolated training successes to developing cumulative, reproducible, evidence-based methods for MCI preparedness.

4.4. The Immersion-Evaluation Paradox: Investing in Fidelity, Neglecting Measurement

Virtual reality dominated our review of technology: 13 out of 28 studies (46%) employed immersive VR platforms, frequently with high-end headsets, haptic controllers, and photorealistic environments. These studies indicate significant investment in sensory and cognitive immersion: the ability to create presence, situational awareness, and embodied learning in complex scenarios.
However, immersion did not predict the depth of evaluation. We found no significant difference in Kirkpatrick classification between immersive VR studies and non-immersive alternatives such as screen-based systems, tabletop software, and communication simulators. In fact, several non-immersive screen-based systems with embedded performance capture achieved L2+ classification, whereas some immersive VR studies remained at L1-L2 because their assessments relied solely on post-training questionnaires.
This paradox, investing heavily in visual fidelity while neglecting measurement infrastructure, indicates a misalignment between resource allocation and how we assess competency. Immersive VR headsets are costly, with hardware ranging from $300 to $3,000 per unit and software development and customization costing $100,000 to over $1,000,000 per application. These expenses are concentrated in high-resource environments, limit scalability, and necessitate ongoing technical support. Conversely, once the training platform is developed, the cost of integrated metrics is minimal. A well-designed log-capture system adds little computational load and can be deployed across different platforms.
The gap in immersion evaluation is not an attack on VR or immersive methods. Immersion might boost engagement, retention, or transfer, all of which are hypotheses worth exploring. Instead, it shows a sequence of priorities that could be less effective: invest first in immersion, then plan assessment afterward. A better strategy is to prioritize building measurement capacity to gather evidence of learning, and then use remaining resources to improve immersion within those limits.
This finding is hypothesis-generating and requires direct empirical testing. Researchers should prospectively compare equivalent training content delivered through (1) high-fidelity immersive VR with external-only assessment, (2) screen-based systems with embedded assessment, and (3) combinations of both, measuring identical competency outcomes and employing randomized or controlled designs. Such studies are scarce in the existing literature and would help determine whether immersion itself causes competency improvements or if measurement capacity is the more crucial factor.
For procurement decisions, this means a practical reframing: start with measurement requirements. What competencies need to be demonstrated? What performance data would be most useful (accuracy, timing, communication patterns, decision logic)? Choose technology platforms, immersive or not, based on their ability to produce those metrics. Once that foundation is secure, consider how immersion might improve learning within the budget and implementation limits of your environment.

4.5. The Scalability-Rigor Tension: Technology Sophistication Versus Implementation Reach

We identified a clear geographic concentration: high-fidelity technology systems such as immersive VR, advanced wearables, and custom software were mainly found in the United States and other high-income regions. Twenty-one of 28 studies (71%) were carried out in high-income countries; the remaining 8 (29%) took place in upper-middle-income countries including China, Iran, Thailand, Turkey, and Jordan. No research was conducted in low- or lower-middle-income countries. Additionally, no study examined cost-effectiveness, implementation feasibility, or the comparative effectiveness of high-cost versus lower-cost technological approaches.
This gap is vital for disaster risk reduction. MCIs are not evenly spread out; they happen worldwide, with higher rates of death and illness in places with the least preparedness capacity [58]. Training methods that only work in high-resource settings do not fairly serve all areas. However, the current evidence provides no guidance on how to implement technology-enhanced MCI training when hardware procurement is limited, technical support is scarce, digital literacy varies, and electricity or internet access is inconsistent.
A systematic review of educational technology implementation in developing contexts identified cost constraints, digital illiteracy, and cultural factors as primary barriers [58]. Other studies highlight the shortage of personnel trained to operate and maintain advanced systems (e.g., VR platforms), which limits sustained integration into educational programs [59]. In our review, the prevalence of VR systems and smart wearables indicates a significant additional financial burden for lower-resource settings, exacerbating existing disparities in training capacity.
The tension between scalability and rigor highlights a deeper policy issue: technology choices often prioritize pedagogical sophistication (measured by fidelity, immersion, or outcome depth) while overlooking implementation feasibility (considering cost, maintenance, training requirements, and contextual adaptability). A highly rigorous study using a $500,000 custom VR system in a well-funded urban hospital isn’t scalable; replicating it means replicating the entire resource infrastructure. A simpler option—like a low-cost web-based triage simulator with integrated performance logging—might be less elegant pedagogically but more feasible to implement across 100 hospitals with diverse resources.
This distinction is crucial for disaster risk reduction. Disaster preparedness must be fair. Frameworks for choosing technology for MCI training should balance training effectiveness with ease of implementation. This requires evidence on: (1) the cost per learner for each technology, (2) how sustainable different platforms are in resource-limited settings, (3) effectiveness studies comparing similar competency outcomes in high- and low-resource environments, and (4) implementation science methods that identify barriers and enablers to adoption across various organizational settings.
Future research should prioritize equity. Clearly assess lower-cost technological options (e.g., web-based simulations, mobile phone applications, offline-capable platforms) alongside high-fidelity alternatives. Carry out implementation research in developing and resource-limited settings, evaluating not only competency outcomes but also adoption rates, sustainability, and cost-effectiveness. Frame technology choices within the context of health systems strengthening and disaster risk reduction policies, not just within simulation science.

4.6. Kirkpatrick Distribution and the L2+ Classification

While all included studies could be associated with the Kirkpatrick framework, only one explicitly used it for evaluation, which restricts interpretability and cross-study comparison. Out of 28 studies, 17 (61%) assessed Level 1 outcomes (reaction), such as satisfaction, acceptability, usability, and attitudes. Although these metrics are vital for understanding adoption, they do not adequately demonstrate learning.
A meta-analysis revealed that the average correlation between participant satisfaction and immediate learning is very low (r = .08) [55]. High satisfaction ratings do not necessarily reflect effective learning. However, the field’s focus on reaction-level outcomes implies that proving technology’s effectiveness is less important than showing that learners enjoy it.
When we standardized outcome reporting using the L2+ classification, the evidence distribution became clearer. Under traditional Level 3 criteria (on-the-job behavior change), only one study qualified; no study reached Level 4 (organizational impact). With our L2+ framework—classifying simulation-based performance assessments against external standards as applied learning—14 of 25 completed studies (56%) met L2+ criteria. This highlights a methodological limitation: MCIs are inherently difficult to evaluate at Level 3 or 4 in real-world settings. Randomizing hospitals to intervention and control groups for a “live MCI” is neither practical nor ethical. Most studies, therefore, relied on simulated or tabletop MCIs as Level 3 proxies.
However, this limitation does not lessen the value of going beyond reaction-level assessment. Even in simulation settings, we can accurately measure learning: triage accuracy, response time, error patterns, communication protocols, and decision-making under pressure. These built-in metrics form Level 2 (learning) and show significant progress beyond satisfaction surveys.
The L2+ reclassification indicates that when technology allows for objective performance measurement, competency becomes observable. This shifts the focus from “Did participants enjoy the training?” to “Did participants acquire and apply the targeted competencies?” The latter question is crucial for disaster preparedness.
We recommend standardizing outcome reporting using the Kirkpatrick framework from the protocol stage onward. Researchers should specify which Kirkpatrick level each outcome targets, recognize limitations to higher-level evaluation in MCI contexts, and use objective performance metrics and embedded assessment as the most reliable proxies for Level 3 outcomes in simulation-based training.

4.7. Limitations and Strengths

This scoping review has several limitations. First, our search was limited to seven major databases (PubMed, Embase, Scopus, PsycINFO, CINAHL, the Cochrane Library, and ClinicalTrials.gov), which may have overlooked relevant studies in grey literature or sources not indexed. Second, consistent with scoping review methodology, we did not perform a formal quality assessment of the included studies. Third, although study selection, data extraction, and synthesis were independently carried out and verified by multiple researchers, some subjectivity remains in mapping outcomes to Kirkpatrick levels and interpreting study designs. We addressed these issues through transparent documentation, consensus-based resolution of disagreements, and adherence to established scoping review guidelines.
A major strength of this review is that it is the first systematic map of all technological devices used for MCI training, linking each technology to its function across training phases and examining whether suitable training design and operational frameworks were used. The review also lays a foundation for future research by highlighting explicit training design frameworks, validated assessment tools, and the mapping of outcomes to Kirkpatrick levels to enhance comparability and synthesis across studies.

5. Conclusions

This scoping review examined 28 studies that utilized technology-enhanced methods to train healthcare professionals for prehospital mass-casualty incident preparedness. We identified five key constructs that define how technology works, collects data, and interacts with training design and operational frameworks. Three of these constructs were consistently found across the studies: the Technology Function Spectrum, Data Capture Architecture, and the Pedagogical Transparency Gap. The remaining two constructs, the Immersion-Evaluation Paradox and the Scalability-Rigor Tension, were identified as patterns that need further testing through prospective studies.
Our findings show that the key factor influencing evaluation depth is not technology complexity or immersion level, but rather the technology’s ability to incorporate assessment within the training environment. Dual-use platforms that both deliver training and gather performance data consistently produce deeper competency results than systems focused solely on visual fidelity. However, 75% of the included studies did not specify a training design framework, creating a transparency gap that hinders replication, comparison, and evidence synthesis.
These findings are important for disaster risk reduction policy. Technology procurement for MCI preparedness training should focus on measurement capacity as well as delivery capability. Future research should address the uneven geographic and resource distribution of current evidence, explore lower-cost technological options with similar measured outcomes, and prioritize equity in implementation design. Standardizing outcome reporting based on the Kirkpatrick framework with L2+ classification will improve comparability across studies and support evidence-based investment in prehospital disaster preparedness worldwide.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org.

CRediT Authorship Contribution Statement

AS: Data curation, Investigation, Methodology, Writing – original draft, Writing – review & editing. AA: Data curation, Investigation, Methodology, Writing – review & editing. HF: Data curation, Investigation, Methodology, Writing – original draft. AO: Conceptualization, Data curation, Methodology, Formal analysis, Supervision, Writing – original draft. RN: Investigation, Resources, Data curation, Writing – review & editing. AY: Conceptualization, Project administration, Resources, Supervision, Writing – review & editing. IH: Supervision, Methodology, Writing – review & editing. NZ: Conceptualization, Visualization, Supervision, Methodology, Writing – review & editing.

Data Availability

All data supporting the findings of this review are included in the manuscript and its supplementary materials.

Acknowledgments

The authors thank Dr. Sarah Kazim, Chair of Emergency Medicine at Dubai Health, for her support and for fostering the departmental environment that helped make this work possible. They also thank Mohammed Bin Rashid University of Medicine and Health Sciences for its support and collaboration in this research, the College of Medicine for supporting the medical students, Shakeel Tegginmani of the Al Maktoum Medical Library for assistance with the literature search, and the Institute of Learning (IOL) for research support.

References

  1. DeNolf, R.L.; Kahwaji, C.I. EMS Mass Casualty Management, in: StatPearls, StatPearls Publishing, Treasure Island (FL), 2025. Available online: http://www.ncbi.nlm.nih.gov/books/NBK482373/.
  2. Agri, J.; Söderin, L.; Hammarberg, E.; Lennquist-Montán, K.; Montán, C. Prehospital preparedness for major incidents in Sweden: a national survey with focus on mass-casualty incidents. Prehosp. Disaster Med. 2023, 38(1), 65–72. [Google Scholar] [CrossRef]
  3. Hwang, K.I.; Kim, J. The training effects of mass casualty triage in radiological events for 119 emergency medical team. Prehosp. Disaster Med. 2023, 38(S1), s129. [Google Scholar] [CrossRef]
  4. Mills, B.; Dykstra, P.; Hansen, S.; Miles, A.; Rankin, T.; Hopper, L.; et al. Virtual reality triage training can provide comparable simulation efficacy for paramedicine students compared to live simulation-based scenarios. Prehosp. Emerg. Care 2020, 24(4), 525–536. [Google Scholar] [CrossRef] [PubMed]
  5. Grimwood, T.; Snell, L. The use of technology in healthcare education: a literature review. MedEdPublish 2020, 9, 137. [Google Scholar] [CrossRef]
  6. Duan, Y.Y.; Zhang, J.Y.; Xie, M.; Feng, X.B.; Xu, S.; Ye, Z.W. Erratum to: Application of virtual reality technology in disaster medicine. Curr. Med. Sci. 2020, 40(6), 1205. [Google Scholar] [CrossRef]
  7. Del Carmen Cardós-Alonso, M.; Otero-Varela, L.; Redondo, M.; Uzuriaga, M.; González, M.; Vazquez, T.; et al. Extended reality training for mass casualty incidents: a systematic review on effectiveness and experience of medical first responders. Int. J. Emerg. Med. 2024, 17(1), 99. [Google Scholar] [CrossRef]
  8. Khalil, H.; Jia, R.; Moraes, E.B.; Munn, Z.; Alexander, L.; Peters, M.D.J.; et al. Scoping reviews and their role in identifying research priorities. J. Clin. Epidemiol. 2025, 181, 111712. [Google Scholar] [CrossRef]
  9. Arksey, H.; O’Malley, L. Scoping studies: towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8(1), 19–32. [Google Scholar] [CrossRef]
  10. Levac, D.; Colquhoun, H.; O’Brien, K.K. Scoping studies: advancing the methodology. Implement. Sci. 2010, 5(1), 69. [Google Scholar] [CrossRef] [PubMed]
  11. Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): checklist and explanation. Ann. Intern. Med. 2018, 169(7), 467–473. [Google Scholar] [CrossRef] [PubMed]
  12. Saouli, A.; Nour, R.; Farhan, H.; Omayer, A.; Yousif, A.; Hubloue, I.; et al. Tech-Based Training Approaches for Prehospital Mass Casualty Response: A Scoping Review Protocol, protocols.io. 2025. Available online: https://www.protocols.io/view/tech-based-training-approaches-for-prehospital-mas-g4pmbyvk7.
  13. Saouli, A.; AlRahma, A.; Nour, R.; Farhan, H.; Omayer, A.; Yousif, A.; et al. Technology-enhanced training for prehospital mass-casualty response: a scoping review protocol; Preprints, 2025. [Google Scholar] [CrossRef]
  14. McGowan, J.; Sampson, M.; Salzwedel, D.M.; Cogo, E.; Foerster, V.; Lefebvre, C. PRESS Peer Review of Electronic Search Strategies: 2015 guideline statement. J. Clin. Epidemiol. 2016, 75, 40–46. [Google Scholar] [CrossRef]
  15. Covidence Systematic Review Software, Veritas Health Innovation, Melbourne, Australia. Available online: www.covidence.org.
  16. Falletta, S.; Kirkpatrick, Donald L. Evaluating training programs: the four levels. Am. J. Eval. 1998, 19(2), 259–261. [Google Scholar] [CrossRef]
  17. Carnell, S.; Gomes De Siqueira, A.; Miles, A.; Lok, B. Informing and evaluating educational applications with the Kirkpatrick model in virtual environments: using a virtual human scenario to measure communication skills behavior change. Front. Virtual Real. 2022, 3, 810797. [Google Scholar] [CrossRef]
  18. Voicescu, G.T.; Lamine, H.; Loșonți, A.E.; Lupan-Mureșan, E.M.; Luka, S.; Ulerio, J.G.; et al. Monitoring and evaluation in disaster management courses: a scoping review. BMC Med. Educ. 2025, 25(1), 188. [Google Scholar] [CrossRef]
  19. Cicero, M.X.; Whitfill, T.; Munjal, K.; Madhok, M.; Diaz, M.C.G.; Scherzer, D.J.; et al. 60 seconds to survival: a pilot study of a disaster triage video game for prehospital providers. Am. J. Disaster Med. 2017, 12(2), 75–83. [Google Scholar] [CrossRef]
  20. Chumvanichaya, K.; Yuksen, C.; Nuanprom, P.; Aramvanitch, K. A comparison of SIEVE, SORT, and START triage training effectiveness between immersive interactive 3D learning materials using virtual reality (VR-SSST) and traditional methods in mass casualty incidents. Int. J. Emerg. Med. 2025, 18(1), 55. [Google Scholar] [CrossRef]
  21. Heldring, S.; Jirwe, M.; Wihlborg, J.; Lindström, V. Acceptability and applicability of using virtual reality for training mass casualty incidents — a mixed method study. BMC Med. Educ. 2025, 25(1), 728. [Google Scholar] [CrossRef] [PubMed]
  22. Wetherell, M.A.; Williams, G.; Doran, J. Assessing the psychobiological demands of high-fidelity training in pre-hospital emergency medicine. Scand. J. Trauma Resusc. Emerg. Med. 2024, 32(1), 101. [Google Scholar] [CrossRef] [PubMed]
  23. ChiCTR2300072282, Virtual reality-based nursing training for mass casualty incidents, Chinese Clinical Trial Registry, 2023. Available online: https://www.chictr.org.cn/hvshowproject.html?id=228057&v=1.0.
  24. Alhawatmeh, H.N.; Rawashdeh, S.A.; Alwidyan, M.T.; Abuhammad, S. Comparing virtual reality and live standardized patient drill simulation-based triage training methods in terms of triage knowledge and performance. Clin. Simul. Nurs. 2025, 103, 101749. [Google Scholar] [CrossRef]
  25. Jain, T.N.; Ragazzoni, L.; Stryhn, H.; Stratton, S.J.; Della Corte, F. Comparison of the Sacco Triage Method versus START triage using a virtual reality scenario in advance care paramedic students. CJEM 2016, 18(4), 288–292. [Google Scholar] [CrossRef]
  26. Cicero, M.X.; Whitfill, T.; Walsh, B.; Diaz, M.C.G.; Arteaga, G.M.; Scherzer, D.J.; et al. Correlation between paramedic disaster triage accuracy in screen-based simulations and immersive simulations. Prehosp. Emerg. Care 2019, 23(1), 83–89. [Google Scholar] [CrossRef]
  27. Hosseini, M.; Masoumian Hosseini, S.T.; Qayumi, K.; Hosseinzadeh, S.; Ahmady, S. Crossover design in triage education: the effectiveness of simulated interactive vs. routine training on student nurses’ performance in a disaster situation. BMC Res. Notes 2023, 16(1), 313. [Google Scholar] [CrossRef]
  28. Park, S.K. Kyoung; Kim, H.J. Development and evaluation of virtual reality-based simulation content for nursing students regarding emergency triage. J. Korean Acad. Fundam. Nurs. 2023, 30(2), 292–301. [Google Scholar] [CrossRef]
  29. Chang, C.W.; Lin, C.W.; Huang, C.Y.; Hsu, C.W.; Sung, H.Y.; Cheng, S.F. Effectiveness of the virtual reality chemical disaster training program in emergency nurses: a quasi experimental study. Nurse Educ. Today 2022, 119, 105613. [Google Scholar] [CrossRef] [PubMed]
  30. NCT06034184, Enhancing mass casualty triage through virtual reality simulation, ClinicalTrials.gov. 2024. Available online: https://clinicaltrials.gov/study/NCT06034184.
  31. Bajow, N.; Djalali, A.; Ingrassia, P.L.; Ragazzoni, L.; Ageely, H.; Bani, I.; et al. Evaluation of a new community-based curriculum in disaster medicine for undergraduates. BMC Med. Educ. 2016, 16(1), 225. [Google Scholar] [CrossRef]
  32. Heldring, S.; Lindström, V.; Jirwe, M.; Wihlborg, J. Exploring ambulance clinicians’ clinical reasoning when training mass casualty incidents using virtual reality: a qualitative study. Scand. J. Trauma Resusc. Emerg. Med. 2024, 32(1), 90. [Google Scholar] [CrossRef]
  33. Lochmannová, A. Exploring the role of virtual reality in preparing emergency responders for mass casualty incidents. Isr. J. Health Policy Res. 2025, 14((1) 22). [Google Scholar] [CrossRef] [PubMed]
  34. McCoy, E.; Alrabah, R.; Weichmann, W.; Langdorf, M.; Ricks, C.; Chakravarthy, B.; et al. Feasibility of telesimulation and Google Glass for mass casualty triage education and training. West. J. Emerg. Med. 2019, 20(3), 512–519. [Google Scholar] [CrossRef]
  35. Chevalier, S.; Paquay, M.; Goffoy, J.; Servotte, J.C.; Stipulante, S.; Ghuysen, A. Impact of virtual reality on performance among undergraduate healthcare professionals: a cross-sectional study. Int. J. Healthc. Manag. 2025, 18(2), 185–194. [Google Scholar] [CrossRef]
  36. Foronda, C.L.; Shubeck, K.; Swoboda, S.M.; Hudson, K.W.; Budhathoki, C.; Sullivan, N.; et al. Impact of virtual simulation to teach concepts of disaster triage. Clin. Simul. Nurs. 2016, 12(4), 137–144. [Google Scholar] [CrossRef]
  37. Way, D.P.; Panchal, A.R.; Price, A.; Berezina-Blackburn, V.; Patterson, J.; McGrath, J.; et al. Learner evaluation of an immersive virtual reality mass casualty incident simulator for triage training. BMC Digit. Health 2024, 2(1), 56. [Google Scholar] [CrossRef]
  38. Baetzner, A.S.; Hill, Y.; Roszipal, B.; Gerwann, S.; Beutel, M.; Birrenbach, T.; et al. Mass casualty incident training in immersive virtual reality: quasi-experimental evaluation of multimethod performance indicators. J. Med. Internet Res. 2025, 27, e63241. [Google Scholar] [CrossRef] [PubMed]
  39. Hosseini, M.; Masoumian Hosseini, S.T.; Qayumi, K. Nursing student satisfaction with a crisis management game-based training; a quasi-experimental study, Iran. J. Emerg. Med. 2023, 10(1), e22. [Google Scholar] [CrossRef]
  40. Hermann, S.; Gerstner, J.; Weiss, F.; Aichele, S.; Stricker, E.; Gorgati, E.; et al. Presentation and evaluation of a modern course in disaster medicine and humanitarian assistance for medical students. BMC Med. Educ. 2021, 21(1), 610. [Google Scholar] [CrossRef] [PubMed]
  41. Sibley, A.K.; Jain, T.N.; Butler, M.; Nicholson, B.; Sibley, D.; Smith, D.; et al. Remote scene size-up using an unmanned aerial vehicle in a simulated mass casualty incident. Prehosp. Emerg. Care 2019, 23(3), 332–339. [Google Scholar] [CrossRef] [PubMed]
  42. Goldberg, B.S.; Hall, J.E.; Pham, P.K.; Cho, C.S. Text messages by wireless mesh network vs voice by two-way radio in disaster simulations: a crossover randomized-controlled trial. Am. J. Emerg. Med. 2021, 48, 148–155. [Google Scholar] [CrossRef]
  43. NCT06253156, The effect of virtual reality-based disaster education given to nursing students on disaster preparedness: randomized controlled study, ClinicalTrials.gov. 2024. Available online: https://clinicaltrials.gov/study/NCT06253156.
  44. Bauchwitz, B.; Nguyen, J.; Woods, K.; Albagli, K.; Sawitz, M.; Hatch, M.; et al. The use of smartphone-based highly realistic MCI training as an adjunct to traditional training methods. Mil. Med. 2024, 189 (Suppl. 3), 775–783. [Google Scholar] [CrossRef]
  45. Shujuan, L.; Mawpin, T.; Meichan, C.; Weijun, X.; Jing, W.; Biru, L. The use of virtual reality to improve disaster preparedness among nursing students: a randomized study. J. Nurs. Educ. 2022, 61(2), 93–96. [Google Scholar] [CrossRef]
  46. Hu, H.; Lai, X.; Yan, L. Training nurses in an international emergency medical team using a serious role-playing game: a retrospective comparative analysis. BMC Med. Educ. 2024, 24(1), 432. [Google Scholar] [CrossRef]
  47. De Bruin, A.B.H.; Kok, E.M.; Lobbestael, J.; De Grip, A. The impact of an online tool for monitoring and regulating learning at university: overconfidence, learning strategy, and personality. Metacogn. Learn. 2017, 12(1), 21–43. [Google Scholar] [CrossRef]
  48. Pekrun, R. Self-report is indispensable to assess students’ learning. Front. Learn. Res. 2020, 8(3), 185–193. [Google Scholar] [CrossRef]
  49. Gilbert, E. New Technologies and Innovative Methods in Data Collection Scoping Review. 2020. [Google Scholar]
  50. Clarkson, L.; Williams, M. EMS mass casualty triage, in: StatPearls, StatPearls Publishing, Treasure Island (FL), 2025. Available online: http://www.ncbi.nlm.nih.gov/books/NBK459369/.
  51. Ebrahimi, F.; Masoudian, T.; Khiabani, M.M. Integrating ADDIE needs assessment with Kirkpatrick evaluation: a systematic review. Asian J. Educ. Soc. Stud. 2025, 51(3), 350–376. [Google Scholar] [CrossRef]
  52. Cook, D.A.; Hamstra, S.J.; Brydges, R.; Zendejas, B.; Szostek, J.H.; Wang, A.T.; et al. Comparative effectiveness of instructional design features in simulation-based education: systematic review and meta-analysis. Med. Teach. 2013, 35(1), e867–e898. [Google Scholar] [CrossRef]
  53. Costa, J.M.; Miranda, G.L.; Melo, M. Four-component instructional design (4C/ID) model: a meta-analysis on use and effect. Learn. Environ. Res. 2022, 25(2), 445–463. [Google Scholar] [CrossRef]
  54. Bauchwitz, B.; Weyhrauch, P.; Niehaus, J.; Makivic, M.; Manning, W.; Broach, J.; et al. Performance assessment in a virtual simulation for integrated austere medical operations training, in: Proceedings of the Interservice/Industry Training, Simulation and Education Conference (I/ITSEC), 2022.
  55. Alliger, G.M.; Tannenbaum, S.I.; Bennett, W., Jr.; Traver, H.; Shotland, A. A meta-analysis of the relations among training criteria. Pers. Psychol. 1997, 50(2), 341–358. [Google Scholar] [CrossRef]
  56. Rouse, D.N. Employing Kirkpatrick’s evaluation framework to determine the effectiveness of health information management courses and programs. Perspect. Health Inf. Manag. 2011, 8, 1c. [Google Scholar]
  57. Pek, J.H.; Quah, L.J.J.; Valente, M.; Ragazzoni, L.; Della Corte, F. Use of simulation in full-scale exercises for response to disasters and mass-casualty incidents: a scoping review. Prehosp. Disaster Med. 2023, 38(6), 792–806. [Google Scholar] [CrossRef] [PubMed]
  58. Ndibalema, P. Constraints of transition to online distance learning in Higher Education Institutions during COVID-19 in developing countries: a systematic review, E-Learn. Digit. Media 2022, 19(6), 595–618. [Google Scholar] [CrossRef]
  59. Nemani, S. Barriers and enablers to adopting virtual reality in lower secondary STEAM curricula. J. Adv. Res. Educ. 2025, 4(2), 1–14. [Google Scholar]
  60. United Nations Office for Disaster Risk Reduction, Sendai Framework for Disaster Risk Reduction 2015–2030, UNDRR, Geneva, 2015. Available online: https://www.undrr.org/publication/sendai-framework-disaster-risk-reduction-2015-2030.
  61. World Health Organization. Emergency Response Framework, 2nd ed.; WHO: Geneva, 2024; Available online: https://www.who.int/publications/i/item/9789240058064.
Figure 1. PRISMA 2020 flow diagram for the scoping review of technology-enhanced training for prehospital mass-casualty incident preparedness. Records were identified from seven electronic databases (Embase, PubMed, Scopus, CINAHL, Cochrane Library, PsycINFO, and ClinicalTrials.gov) and supplementary sources (Google Scholar and reference list screening). A total of 28 studies met all inclusion criteria and were included in the final synthesis.
Figure 1. PRISMA 2020 flow diagram for the scoping review of technology-enhanced training for prehospital mass-casualty incident preparedness. Records were identified from seven electronic databases (Embase, PubMed, Scopus, CINAHL, Cochrane Library, PsycINFO, and ClinicalTrials.gov) and supplementary sources (Google Scholar and reference list screening). A total of 28 studies met all inclusion criteria and were included in the final synthesis.
Preprints 211208 g001
Figure 2. Publication trends and evaluation depth of technology-enhanced MCI training studies (2016-2025). Panel A: Annual publication counts stratified by the highest Kirkpatrick evaluation level achieved per study. Colors indicate L1 only (gray), L2 (light blue), L2+ Applied Learning (green), and L3 Behavior (amber). Panel B: Cumulative number of included studies (solid line) and cumulative number of studies achieving L2+ classification (dashed line), with the shaded area representing the evaluation gap. Annotations mark the COVID-19 publication gap (2020) and key milestones. Note the sharp increase in publication volume from 2023, with 2025 studies demonstrating exclusively L2+ evaluation depth.
Figure 2. Publication trends and evaluation depth of technology-enhanced MCI training studies (2016-2025). Panel A: Annual publication counts stratified by the highest Kirkpatrick evaluation level achieved per study. Colors indicate L1 only (gray), L2 (light blue), L2+ Applied Learning (green), and L3 Behavior (amber). Panel B: Cumulative number of included studies (solid line) and cumulative number of studies achieving L2+ classification (dashed line), with the shaded area representing the evaluation gap. Annotations mark the COVID-19 publication gap (2020) and key milestones. Note the sharp increase in publication volume from 2023, with 2025 studies demonstrating exclusively L2+ evaluation depth.
Preprints 211208 g002
Figure 3. Alluvial diagram illustrating the flow from participant populations (left) through technology modalities (center) to evaluation depth achieved (right) across the 28 included studies. Bandwidth is proportional to the number of study-population pairs (n = 49; studies with multiple participant types contribute multiple flows). Left column categories reflect the seven major population groups identified across the corpus. The center column shows the four technology modality categories. Right column displays the highest Kirkpatrick level achieved per study: L1 Reaction only (gray), L2 Learning (light blue), L2+ Applied Learning (green), and L3 Behavior (amber; Goldberg 2021 only). The diagram visually demonstrates the Immersion-Evaluation Paradox: VR dominates the technology column but splits between L2+ and lower-level outcomes, while screen-based systems show strong flow to L2+, reflecting the primacy of data capture architecture over immersion level.
Figure 3. Alluvial diagram illustrating the flow from participant populations (left) through technology modalities (center) to evaluation depth achieved (right) across the 28 included studies. Bandwidth is proportional to the number of study-population pairs (n = 49; studies with multiple participant types contribute multiple flows). Left column categories reflect the seven major population groups identified across the corpus. The center column shows the four technology modality categories. Right column displays the highest Kirkpatrick level achieved per study: L1 Reaction only (gray), L2 Learning (light blue), L2+ Applied Learning (green), and L3 Behavior (amber; Goldberg 2021 only). The diagram visually demonstrates the Immersion-Evaluation Paradox: VR dominates the technology column but splits between L2+ and lower-level outcomes, while screen-based systems show strong flow to L2+, reflecting the primacy of data capture architecture over immersion level.
Preprints 211208 g003
Table 1. Study Characteristics.
Table 1. Study Characteristics.
Study ID Study Year Country Study Design Population Sample Size Technology Type
1 Goldberg 2021 2021 USA RCT Pediatric EM physicians, fellows, residents 50 Communication devices (goTenna)
2 McCoy 2019 2019 USA Mixed-methods Physicians, nurses, EMTs, paramedics 32 Smart glasses (Google Glass), telesimulation
3 Cicero 2017 2017 USA RCT Paramedics, paramedic students, EMTs 47 Video game/serious game
4 ChiCTR2300072282 2023 2023 China RCT protocol Nurses 160 Virtual reality
5 Chumvanichaya 2025 2025 Thailand RCT Paramedic students 83 Virtual reality
6 Hosseini 2023 (2) 2023 Iran Quasi-experimental Nursing students 60 Game-based training
7 Way 2024 2024 USA Mixed-methods Paramedics, EMTs, medical students, EM physicians 375 VR (Meta Quest 2)
8 Heldring 2025 2025 Sweden Mixed-methods Ambulance nurses, RNs, nursing students, EMTs 95 VR (HTC VIVE, GoSaveThem)
9 Shujuan 2022 2022 China RCT Nursing students 101 Virtual reality
10 Jain 2016 2016 Canada Prospective cohort Paramedic students 26 VR simulation (XVR)
11 Cicero 2019 2019 USA RCT Paramedics, EMTs 26 Immersive + screen-based simulation
12 Baetzner 2025 2025 Germany Quasi-experimental Paramedics, EM physicians, medical students, nurses 76 VR (XVR + Varjo Aero) + eye-tracking
13 Hermann 2021 2021 Germany Pre-post evaluation Medical students 102 Computer-based simulation
14 Kyoung 2023 2023 South Korea Usability study Nursing students 30 Virtual reality
15 Hosseini 2023 (1) 2023 Iran Quasi-experimental EM students 120 Screen-based simulation (SIG)
16 NCT06253156 2024 2024 Turkey RCT protocol Nursing students 67 Virtual reality
17 NCT06034184 2024 2024 Sweden RCT protocol Nursing students 60 Virtual reality
18 Wetherell 2024 2024 England Mixed-methods EM/ICU physicians, paramedics 15 Smartwatches (Garmin)
19 Hu 2024 2024 Hong Kong Quasi-experimental Nurses 106 Computer game simulation
20 Alhawatmeh 2025 2025 Jordan RCT Paramedic students 102 Immersive VR
21 Bauchwitz 2024 2024 USA Quasi-experimental Medical students, paramedics, nurses, EM residents/attendings 21 Smartphone simulation (EFECTIVE)
22 Chevalier 2023 2023 Belgium Cross-sectional Ambulance attendants, nursing students, medical students 83 Virtual reality
23 Heldring 2024 2024 Sweden Mixed-methods Ambulance clinicians 11 VR (HTC VIVE, GoSaveThem)
24 Lochmannová 2025 2025 Czech Republic Mixed-methods Paramedic students 37 VR + Garmin smartwatches
25 Sibley 2018 2018 Canada Intervention (post-test) EMTs, ED nurses, ED physicians 96 UAV/drone
26 Chang 2022 2022 Taiwan Quasi-experimental ED nurses 67 360° VR (HTC VIRTI)
27 Foronda 2016 2016 USA Pre-post evaluation BSN students 6 Web-based 3D simulation (V-CAEST)
28 Bajow 2016 2016 Saudi Arabia Pre-post evaluation Medical students 29 XVR, ISEE, video lectures/e-learning
Characteristics of the 28 included studies on technology-enhanced training for prehospital mass-casualty incident preparedness. Studies are ordered chronologically by publication year. Sample sizes reflect reported participant numbers; for protocols, planned enrollment is indicated. Country refers to the location where the study was conducted. Technology type describes the primary technology platform used.
Table 2. Study Design Frequencies.
Table 2. Study Design Frequencies.
Study Design Number of Studies Percentage
Mixed-methods 6 21.4%
Quasi-experimental design 6 21.4%
Randomized Controlled Trial (RCT) 6 21.4%
RCT protocol 3 10.7%
Pre-post evaluation 3 10.7%
Cross-sectional study 1 3.6%
Intervention study (post-test only) 1 3.6%
Prospective cohort study 1 3.6%
Usability study 1 3.6%
Total 28 100%
Distribution of study designs across the 28 included studies. Percentages are calculated from the total number of included studies (N = 28). Three studies were registered protocols reporting planned methodology without outcome data.
Table 3. Technology Applications.
Table 3. Technology Applications.
Study ID Study Technology Modality Purpose Function Spectrum
1 Goldberg 2021 In-person (communication devices) Training Hybrid
2 McCoy 2019 Hybrid (smart glasses + telesimulation) Both Hybrid
3 Cicero 2017 Screen-based (serious game) Training Dual-use
4 ChiCTR2300072282 2023 VR Training Delivery-only
5 Chumvanichaya 2025 VR Training Dual-use
6 Hosseini 2023 (2) Screen-based (game-based) Training Delivery-only
7 Way 2024 VR (Meta Quest 2) Both Dual-use
8 Heldring 2025 VR (HTC VIVE) Both Dual-use
9 Shujuan 2022 VR Both Dual-use
10 Jain 2016 VR (XVR) Training Dual-use
11 Cicero 2019 Hybrid (immersive + screen) Training Dual-use
12 Baetzner 2025 VR (XVR + eye-tracking) Both Dual-use
13 Hermann 2021 Screen-based + in-person Both Delivery-only
14 Kyoung 2023 VR Training Delivery-only
15 Hosseini 2023 (1) Screen-based (SIG simulation) Both Dual-use
16 NCT06253156 2024 VR + in-person Training Delivery-only
17 NCT06034184 2024 VR + in-person Training Delivery-only
18 Wetherell 2024 In-person (smartwatches) Assessment Assessment-dedicated
19 Hu 2024 Screen-based (game simulation) Both Dual-use
20 Alhawatmeh 2025 VR + in-person Training Delivery-only
21 Bauchwitz 2024 Screen-based (smartphone) Training Dual-use
22 Chevalier 2023 VR Training Dual-use
23 Heldring 2024 VR (HTC VIVE) Both Dual-use
24 Lochmannová 2025 VR + smartwatches Both Hybrid
25 Sibley 2018 Screen-based (UAV/drone) Assessment Assessment-dedicated
26 Chang 2022 VR (360°) Training Dual-use
27 Foronda 2016 Screen-based (web-based 3D) Training Dual-use
28 Bajow 2016 Hybrid (XVR + ISEE + e-learning) Training Hybrid
Technology applications and functional classification of the 28 included studies. Technology modality describes the primary delivery platform. Purpose of technology indicates whether the system was used for training delivery, assessment, or both. Technology Function Spectrum classifies each study according to the emergent analytical construct: delivery-only (content transmission without embedded assessment), dual-use (simultaneous training and data capture), assessment-dedicated (primarily designed for performance measurement), or hybrid (multi-modal systems combining elements across categories).
Table 4. Kirkpatrick Evaluation Levels with L2+ Reclassification.
Table 4. Kirkpatrick Evaluation Levels with L2+ Reclassification.
Study ID Study L1 (Reaction) L2 (Learning) L2+ (Applied) L3 (Behavior) L4 (Results) Key Outcomes
1 Goldberg 2021 Communication accuracy, triage accuracy, workload
2 McCoy 2019 Triage accuracy, satisfaction, self-reported improvement
3 Cicero 2017 Triage accuracy
4 ChiCTR2300072282 2023 Attitudes, preparedness
5 Chumvanichaya 2025 Knowledge, triage accuracy and time, motivation
6 Hosseini 2023 (2) Satisfaction, attitude
7 Way 2024 Perceived realism, perceived learning
8 Heldring 2025 Attitude change, triage accuracy, triage time
9 Shujuan 2022 Knowledge, attitude, performance skills
10 Jain 2016 Time to triage, triage prioritization accuracy
11 Cicero 2019 Triage accuracy
12 Baetzner 2025 Visual attention, triage accuracy, triage speed
13 Hermann 2021 Knowledge, attitude, satisfaction
14 Kyoung 2023 Usability
15 Hosseini 2023 (1) Knowledge, triage performance
16 NCT06253156 2024 Attitude change, disaster preparedness
17 NCT06034184 2024 Knowledge, triage performance (protocol)
18 Wetherell 2024 Anxiety, workload, stress
19 Hu 2024 Knowledge, usability, motivation
20 Alhawatmeh 2025 Knowledge, triage performance
21 Bauchwitz 2024 Usability, fidelity, time pressure
22 Chevalier 2023 Knowledge, triage accuracy, triage time, stress
23 Heldring 2024 Perceived usefulness, perceived learning
24 Lochmannová 2025 METHANE reporting, triage performance, workload
25 Sibley 2018 Knowledge, triage
26 Chang 2022 Self-assessed disaster preparedness, self-efficacy
27 Foronda 2016 Satisfaction, knowledge
28 Bajow 2016 Satisfaction, knowledge, self-reported behavior
TOTAL 17 (61%) 24 (86%) 14 (56%*) 1 (4%) 0 (0%) *Note: L2+ calculated from 25 completed studies
Kirkpatrick evaluation levels with L2+ (Applied Learning) reclassification for the 28 included studies. Checkmarks (✓) indicate that the study reported outcomes at the corresponding level; dashes (—) indicate no outcomes at that level. L2+ classification required meeting all three criteria: (C1) assessment embedded within the exercise, (C2) performance measured against an external standard, and (C3) integrated application of knowledge to realistic decisions. L2+ percentages are calculated from the 25 completed studies (excluding 3 protocols). Goldberg 2021 is the only study achieving Level 3 (behavior transfer in real-world settings). No study achieved Level 4 (organizational impact).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated