Submitted:
15 June 2025
Posted:
16 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Study Design
- Participants will report significantly higher scores in the measured perception parameters when using an AI chatbot compared to conventional study tools.
- The use of the AI chatbot will result in higher SBA performance scores compared to conventional tools.
- Perception scores from participants using the AI chatbot correlate with performance scores.
- Additionally, we aim to focus on the following research questions from our qualitative analysis:
- How do medical students perceive the usefulness and usability of AI chatbots compared to conventional study tools?
- What are students’ experiences with AI chatbots in supporting their learning, engagement, and information retention?
- How do students perceive the limitations or challenges of using AI chatbots for medical studies?
- What changes, if any, do students report in their attitudes toward AI in medical education after using the chatbot?
- To what extent do students feel the chatbot aligns with their curriculum and supports deeper learning and critical thinking?
2.2. Participants and Setting
2.3. Study Materials
2.3.1. Conventional Study Materials
- 1.
- Ellis, H., & Mahadevan, V. (2019). Clinical Anatomy: Applied Anatomy for Students and Junior Doctors (14th ed.). John Wiley & Sons. Pages 193-200; pages 264-267.
- 2.
- Moore, K. L., Dalley, A. F., & Agur, A. (2017). Clinically Oriented Anatomy (8th ed.). Lippincott Williams and Wilkins. Pages 1597-1599.
2.3.2. AI Chatbot: Lenny AI
2.4. Study Procedures
2.4.1. Task 0: Baseline AI Perception Assessment
2.4.2. Task 1 and Task 2: Randomised Crossover Academic Tasks
2.4.3. Post-Task Questionnaire
2.4.4. Focus Group Discussion
- 3.
- Changes in perceptions of AI
- 4.
- Comparative effectiveness of AI tools
- 5.
- Impact of AI on learning
- 6.
- Usability and engagement with AI
- 7.
- Challenges in using AI
- 8.
- Potential future influences of AI
- 9.
- Perceived role of AI in medical education
- 10.
- Suggestions for improving Lenny AI
2.5. Blinding and Data Anonymisation
2.6. Data Analysis
2.6.1. Quantitative Analysis
- 11.
- Between-arm performance in Task 2 (conventional tools vs. Lenny AI)
- 12.
- Within-arm performance change in Arm 1 (Lenny AI → conventional tools)
- 13.
- Within-arm performance change in Arm 2 (conventional tools → Lenny AI)
2.7. Qualitative Analysis
3. Results
3.1. Baseline Perceptions
3.2. Quantitative Findings
3.2.1. Overview
3.2.2. Dimensions Favouring Chatbot Use Across Both Arms
- Ease of Use: Participants rated the chatbot as significantly easier to use than traditional materials (Arm 1: Mean Difference (MD) = 1.40, p = 0.040; Arm 2: MD = 1.20, p = 0.030). However, this difference did not reach statistical significance when compared with baseline expectations (Arm 1: p = 0.170; Arm 2: p = 0.510).
- Satisfaction: Satisfaction scores were significantly higher in the chatbot condition (Arm 1: MD = 1.40, p = 0.030; Arm 2: MD = 1.10, p = 0.037).
- Quality of Information: Both arms rated the chatbot more highly in terms of information quality (Arm 1: MD = 1.20, p = 0.050; Arm 2: MD = 1.00, p = 0.050). Notably, Arm 1 participants reported a significant improvement in their perception of information quality from baseline (3.40 to 4.30; p = 0.020)
- Ease of Understanding: The chatbot condition was rated more favourably in terms of ease of understanding, where both arms reported a higher score for the question “How easy was it to understand the information provided by your given learning method?” (Arm 1 MD = 1.30; Arm 2 MD = 1.40; both p = 0.010)
- Engagement: Chatbot use was associated with significantly higher engagement scores (Arm 1 MD = 1.60, p = 0.010; Arm 2 MD = 1.50, p = 0.005).
3.2.3. Divergent Perceptions of Efficiency, Confidence, Performance and Future Use
-
Efficiency:
- ○
- Arm 1 (chatbot-first) reported significantly greater perceived efficiency (MD = 1.70, 4.40 vs. 2.70; p = 0.020) whilst completing Task 1.
- ○
- Arm 2 (conventional-first) showed no significant change (MD = 0.60, 3.60 vs. 3.00; p = 0.220).
-
Confidence in Applying Information:
- ○
- Arm 1: Participants felt significantly more confident applying information learned from the chatbot (MD = 0.09, 3.40 vs. 2.5; p = 0.020).
- ○
- Arm 2: The increase was smaller and did not reach statistical significance (MD = 0.80, 3.30 vs 2.50; p = 0.060)
-
Perceived Performance Compared to Usual Methods:
- ○
- Arm 1: The difference was not statistically significant (MD = 0.80; p = 0.110).
- ○
- Arm 2: Participants reported a significant increase in perceived performance using the chatbot (MD = 1.00, 3.50 vs. 2.50; p = 0.040).
-
Likelihood of Future Use:
- ○
- Arm 1: Reported a significantly greater intention to use chatbots in future learning (MD = 1.20; p = 0.020).
- ○
- Arm 2: The increase approached significance (MD = 0.90; p = 0.060).
3.2.4. Inconsistent Impacts on Perceived Accuracy, Depth, and Critical Thinking
3.2.5. Comparative Task Performance and Correlation with Perception
3.3. Thematic analysis
3.3.1. Speed and Efficiency
3.3.2. Depth and Complexity
3.3.3. Functional Use Case and Focused Questions
3.3.4. Accuracy and Credibility
3.3.5. Openness to AI as a Learning Tool
3.3.6. Curriculum Fit
3.3.7. Further Development and Technical Limitations
4. Discussion
4.1. The Efficiency-Depth Paradox: When Speed Compromises Comprehension
4.2. Confidence Versus Competence: The Illusion of Mastery
4.3. Transparency and Traceability: The Foundations of Trust in AI Learning Tools
- Citation Toggles: Allowing users to reveal underlying references where applicable, supporting source traceability.
- Uncertainty Indicators: Signalling lower-confidence outputs to prompt additional verification.
- Expandable Explanations: Offering tiered content depth, enabling students to shift from summary to substantiated detail on demand.
4.4. No Consistent Performance Gains from Chatbot Use
5. Limitations
6. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| Abbrev. | Full form |
| AKT | Applied Knowledge Test |
| AI | Artificial Intelligence |
| CI | Confidence Interval |
| CLT | Cognitive Load Theory |
| DBR | Design-Based Research |
| GDPR | General Data Protection Regulation |
| KCL | King’s College London |
| LLM(s) | Large Language Model(s) |
| LMIC(s) | Low- and Middle-Income Country(ies) |
| M | Mean |
| MD | Mean Difference |
| n | Sample size |
| OSCE(s) | Objective Structured Clinical Examination(s) |
| p | p-value |
| RAG | Retrieval-Augmented Generation |
| r | Correlation coefficient (effect size) |
| REMAS | Research Ethics Management Application System |
| SAQ(s) | Short Answer Question(s) |
| SBA | Single Best Answer |
| SD | Standard Deviation |
| SPSS | Statistical Package for the Social Sciences |
| TAM | Technology Acceptance Model |
| T0, T1, T2 | Task 0 (baseline), Task 1, Task 2 timepoints |
| UI | User Interface |
| Z | Z-statistic |
| κ (kappa) | Cohen’s kappa coefficient |
References
- Amiri, H. et al. (2024) ‘Medical, dental, and nursing students’ attitudes and knowledge towards artificial intelligence: a systematic review and meta-analysis’, BMC Medical Education [Preprint]. [CrossRef]
- Angoff, W.H., 1971. Educational measurement. Washington: American Council on Education.
- Attewell, S. (2024) ‘Student perceptions of generative AI report’, JISC [Preprint].
- Banerjee, M. et al. (2021) ‘The impact of artificial intelligence on clinical education: perceptions of postgraduate trainee doctors in London (UK) and recommendations for trainers’, BMC Medical Education [Preprint]. [CrossRef]
- Benjamini, Y. & Hochberg, Y. (1995) ‘Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing’, Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 1 pp. 289–300. http://www.jstor.org/stable/2346101 (Accessed: 30 April 2025).
- Bisdas, S. et al. (2021) ‘Artificial Intelligence in Medicine: A Multinational Multi-Center Survey on the Medical and Dental Students’ Perception’, Frontiers in Public Health [Preprint]. [CrossRef]
- Bloom, B. S. (1956). Taxonomy of Educational Objectives, Handbook 1: Cognitive Domain. New York: Longman.
- Bonferroni, C. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze.
- Braun, V. and Clarke, V. (2006) ‘Using thematic analysis in psychology’, Qualitative Research in Psychology, 3(2), pp. 77–101. [CrossRef]
- Brown, A. L. (1992) ‘Design Experiments: Theoretical and Methodological Challenges in Creating Complex Interventions in Classroom Settings’, Journal of the Learning Sciences, 2(2), pp. 141–178. [CrossRef]
- Buabbas, A. et al. (2023) ‘Investigating Students’ Perceptions towards Artificial Intelligence in Medical Education’, Healthcare [Preprint]. [CrossRef]
- Civaner, M.M. et al. (2022) ‘Artificial intelligence in medical education: a cross-sectional needs assessment’, BMC Medical Education [Preprint]. [CrossRef]
- Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), pp.37–46. [CrossRef]
- Cook, D.A. and Triola, M.M. (2009). Virtual patients: a critical literature review and proposed next steps. Medical Education, 43(4), pp.303–311. [CrossRef]
- Davis, F. D. (1989). Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Quarterly, 13(3), 319-340.
- European Union (2016). General Data Protection Regulation (GDPR). [online] General Data Protection Regulation (GDPR). https://gdpr-info.eu/.
- Evans JS, Stanovich KE. Dual-Process Theories of Higher Cognition: Advancing the Debate. Perspect Psychol Sci. 2013 May;8(3):223-41. [online]. [CrossRef]
- General Medical Council (2018). Medical Licensing Assessment. [online] Gmc-uk.org. https://www.gmc-uk.org/education/medical-licensing-assessment.
- Gordon, M. et al. (2024) ‘A scoping review of artificial intelligence in medical education: BEME Guide No. 84’, Medical Teacher [Preprint]. [CrossRef]
- Gualda-Gea, J.J. et al. (2025) ‘Perceptions and future perspectives of medical students on the use of artificial intelligence based chatbots: an exploratory analysis’, Frontiers in Medicine [Preprint]. [CrossRef]
- Jackson, P. et al. (2024) ‘Artificial intelligence in medical education - perception among medical students’, BMC Medical Education [Preprint]. [CrossRef]
- IBM (2025). SPSS software. [online] IBM. https://www.ibm.com/spss.
- Jebreen, K. et al. (2024) ‘Perceptions of undergraduate medical students on artificial intelligence in medicine: mixed-methods survey study from Palestine’, BMC Medical Education [Preprint]. [CrossRef]
- Jha, N. et al. (2022) ‘Undergraduate Medical Students’ and Interns’ Knowledge and Perception of Artificial Intelligence in Medicine’, Advances in medical education and practice [Preprint]. [CrossRef]
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S. and Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. [online] arXiv.org. [CrossRef]
- Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), p. 55.
- Luong, J. et al. (2025) ‘Exploring Artificial Intelligence Readiness in Medical Students: Analysis of a Global Survey.’ . [CrossRef]
- Malmström, H., Stöhr, C. and Ou, W. (2023) ‘Chatbots and other AI for learning: A survey of use and views among university students in Sweden’. [CrossRef]
- Mann, H.B. and Whitney, D.R. (1947). On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, [online] 18(1), pp.50–60. [CrossRef]
- Mayer, R.E. (2009) Multimedia learning, 2nd ed. New York, NY, US: Cambridge University Press (Multimedia learning, 2nd ed), pp. xiii, 304. [CrossRef]
- McCraw, B. W. (2015). The Nature of Epistemic Trust. Social Epistemology, 29(4), 413–430. [online] . [CrossRef]
- McHugh, M.L., 2012. Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), pp.276–282.
- Mcmyler, B. (2011). Testimony, trust, and authority. Oxford ; New York: Oxford University Press.
- Messick, S. (1995) ‘Standards of Validity and the Validity of Standards in Performance Asessment’, Educational Measurement: Issues and Practice, 14(4), pp. 5–8. [CrossRef]
- OpenAI (2024). Hello GPT-4o. [online] Openai.com. https://openai.com/index/hello-gpt-4o/.
- Origgi, G. (2004) ‘Is Trust an Epistemological Notion?’, Episteme, 1(1), pp. 61–72. [CrossRef]
- Pucchio, A. et al. (2022) ‘Exploration of exposure to artificial intelligence in undergraduate medical education: a Canadian cross-sectional mixed-methods study’, BMC Medical Education [Preprint]. [CrossRef]
- Qvault.ai. (2025). qVault. [online] https://qvault.ai [Accessed 30 Apr. 2025].
- Salih, S.M. (2024) ‘Perceptions of Faculty and Students About Use of Artificial Intelligence in Medical Education: A Qualitative Study’, Cureus [Preprint]. [CrossRef]
- Shapiro, S.S. and Wilk, M.B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3-4), pp.591–611. [CrossRef]
- Sit, C. et al. (2020) ‘Attitudes and perceptions of UK medical students towards artificial intelligence and radiology: a multicentre survey’, Insights into Imaging [Preprint]. [CrossRef]
- Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), pp.72–101. [CrossRef]
- Sweller, J. (2011) ‘CHAPTER TWO - Cognitive Load Theory’, in J.P. Mestre and B.H. Ross (eds) Psychology of Learning and Motivation. Academic Press, pp. 37–76. [CrossRef]
- Van Der Vleuten, C.P.M. and Schuwirth, L.W.T. (2005) ‘Assessing professional competence: from methods to programmes’, Medical Education, 39(3), pp. 309–317. [CrossRef]
- Wartman, S. and Combs, C. (2017) ‘Medical Education Must Move From the Information Age to the Age of Artificial Intelligence’, Academic medicine : journal of the Association of American Medical Colleges [Preprint]. [CrossRef]
- Whitehorn, A. et al. (2021) ‘Mapping Clinical Barriers and Evidence-Based Implementation Strategies in Low-to-Middle Income Countries (LMICs)’, Worldviews on Evidence-Based Nursing, 18(3), pp. 190–200. [CrossRef]
- Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6), pp.80–83. [CrossRef]



| Outcome Measures | Questions |
|---|---|
| Ease of Use | "How easy was it to use this learning method?" |
| Satisfaction | "Overall, how satisfied are you with this method for studying?" |
| Efficiency | "How efficient was this method in gathering info?" |
| Confidence in Applying Information | "How confident do you feel in applying the information learned?" |
| Quality of Information | "Rate the quality of the information provided." |
| Accuracy of Information | "Was the information provided accurate?" |
| Depth of Content | "Describe the depth of content provided by the learning tool." |
| Ease of Understanding | "Was the information easy to understand?" |
| Engagement | "How engaging was the learning method in maintaining your interest during the task?" |
| Performance Compared to Usual Methods | "Compared to usual study methods, how did this one perform?" |
| Critical Thinking | "How did this learning method affect your critical thinking?" |
| Likelihood of Future Use | "How likely are you to use this learning method again?" |
![]() |
| Comparison | Task 1 Mean Score % (SD) | Task 2 Mean Score % (SD) | Mean Difference (%) | 95% CI | p-value |
|---|---|---|---|---|---|
| Task 1: Arm 1 vs Arm 2 | 71.43 (15.06) | 54.29 (23.13) | 17.14 | -1.20 to 35.48 | 0.065 |
| Task 2: Arm 2 vs Arm 1 | 63.33 (18.92) | 68.33 (26.59) | -5 | -16.68 to 26.68 | 0.634 |
| Within Arm 1: Task 1 vs Task 2 | 71.43 (15.06) | 68.33 (26.59) | -3.1 | -15.41 to 21.60 | 0.7139 |
| Within Arm 2: Task 1 vs Task 2 | 54.29 (23.13) | 63.33 (18.92) | 9.04 | -23.09 to 4.99 | 0.179 |
| Ability | Features and Functions |
|---|---|
| Accuracy | Curriculum fit |
| Complexity | Focused questions |
| Credibility | Further development |
| Depth | Functional use case |
| Efficiency | Openness to AI as a learning tool |
| Speed | Technical limitations |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
