Submitted:
26 May 2025
Posted:
26 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- RQ1: To what extent do an LLM-assisted class diagram match a human-created diagram in terms of completeness and correctness?
- RQ2: To what extent do an LLM-assisted deployment diagram match a human-created diagram in terms of completeness and correctness?
- RQ3: To what extent do an LLM-assisted use case diagram match a human-created diagram in terms of completeness and correctness?
- RQ4: To what extent do an LLM-assisted sequence diagram match a human-created diagram in terms of completeness and correctness?
2. Related Work
3. Materials and Methods
3.1. Generate UML Diagrams Using GPT-4-Turbo and PlantUML
3.2. Prompt Engineering Rules
| Algorithm 1 Generate UML Prompt |
|
| Algorithm 2 Extract Elements |
|
| Algorithm 3 Map Relationships |
|
| Algorithm 4 Define Constraints |
|
3.3. Validation Functions
| Algorithm 5 Validate Completeness |
|
| Algorithm 6 Validate Correctness |
|
3.4. Generating UML-Assisted Diagrams
3.5. Data Collection
4. Results
4.1. Statistical Analysis
4.1.1. Descriptive Statistics
4.1.2. Inferential Statistics
4.2. Comparing LLM-Assisted Diagrams to Human-Created Diagrams
4.2.1. Class Diagram
4.2.2. Deployment Diagram
4.2.3. Use Case Diagram
4.2.4. Sequence Diagram
4.3. Validation of the Proposed Approach
5. Discussion
6. Conclusions
Funding
Informed Consent Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLMs | Large Language Models |
| UML | Unified Modeling Language |
| GPT | Generative Pre-trained Transformer |
| SDLC | Software Development Life Cycle |
| NLP | Natural Language Processing |
| CR | Completeness Ratio |
| CS | Constraint Satisfaction |
| SR | System Requirements |
| SRS | Software Requirements Specification |
Appendix A. Class Diagram



| Rule Step | Applied Output |
|---|---|
| ExtractCriteria | Identified core classes: Customer, Member, NonMember, Car, CarModel, Reservation, Rental, CreditCard, Address, InternetAccount, Category, Vendor, CarModelDetails |
| MapRelationships | Defined generalization, aggregation, and associations based on domain semantics. Mapped actions like reserve/rent/cancel to operations. |
| DefineConstraints | Added multiplicities, role labels, and semantic constraints on class-to-class relationships. |
| ValidatePrompt | Satisfied completeness and correctness criteria based on coverage and UML alignment. |
- Implemented Elements: 19 out of 29
- Observed Coverage: Included major domain classes such as Customer, Member, Car, and Rental
- Missing Elements:Vendor, Category, CarModelDetails, and Make
- Satisfied Constraints: 22 out of 33
- Strengths: Correct usage of inheritance and multiplicity annotations
- Limitations: Lacked precise naming conventions, omitted some aggregations and compositions
- Implemented Elements: 24 out of 29
- Observed Coverage: Included nearly all expected elements, including Vendor, Make, CarModelDetails, and Category
- Satisfied Constraints: 27 out of 33
- Strengths: Correct use of inheritance, composition, aggregation, and multiplicity; improved naming and semantic precision
| Source | Implemented Elements | CR (%) | Satisfied Constraints | CS (%) |
|---|---|---|---|---|
| LLM-Generated | 19 / 29 | 65.52% | 22 / 33 | 66.67% |
| Human-Generated | 24 / 29 | 82.76% | 27 / 33 | 81.82% |
Appendix B. Deployment Diagram


| Rule Step | Applied Output |
|---|---|
| ExtractComponents | Identified nodes and environments: CootHTMLClient, CootServer, DBServer, with execution environments like WebServer, CootBusinessServer, and DBMS. |
| MapRelationships | Mapped inter-node communication (e.g., HTTP, internal links), deployment of artifacts (e.g., icoot.ear, cootschema.ddl), and WebServer → BusinessServer interactions. |
| DefineConstraints | Included deployment semantics, replicated nodes for reliability, layered structure (client → app → data), and use of artifacts per UML standard. |
| ValidatePrompt | Compared against UML 2.5 structure; evaluated for completeness (elements, relationships) and correctness (notation, stereotypes). |
- Implemented Elements: 18 out of 18
- Observed Coverage: All major nodes, execution environments, and artifacts included
- Missing Constraints: Lacked full UML stereotypes (e.g., <<device>>, <<execution environment>>), no <<manifest>> usage
- Satisfied Constraints: 21 out of 26
- Strengths: Clear three-tier layout, full redundancy modeling, artifact deployment clarity
- Limitations: Missing UML annotations (e.g., stereotypes), minor layout inconsistency
- Implemented Elements: 14 out of 18
- Observed Coverage: Core layers present, but no node redundancy, partial client-side representation
- Satisfied Constraints: 24 out of 26
- Strengths: Accurate UML stereotypes, manifest usage, well-formed internal structure
- Limitations: Missing replicated nodes (no DBServer2 or CootServer2), lacks scalability representation
| Source | Implemented Elements | CR (%) | Satisfied Constraints | CS (%) |
|---|---|---|---|---|
| LLM-Generated | 18 / 18 | 100.00% | 21 / 26 | 80.77% |
| Human-Generated | 14 / 18 | 77.78% | 24 / 26 | 92.31% |
Appendix C. Use Case Diagram


| Rule Step | Applied Output |
|---|---|
| ExtractComponents | Identified actors: Customer, Member, NonMember, Assistant. Identified use cases: Log On, Make Reservation, Cancel Reservation, Browse, Search, View Results, View CarModel Details, etc. |
| MapRelationships | Modeled associations between actors and use cases. Applied generalization between Customer, Member, and NonMember. Defined includes (<<include>>) and extends (<<extend>>) relationships. |
| DefineConstraints | Applied system boundary, actor-use case mapping constraints, logical grouping, and interaction coverage per domain. Checked for overlapping actor responsibilities and goal-oriented behavior. |
| ValidatePrompt | Evaluated based on UML completeness (actors, use cases, relationships) and syntactic correctness (notation, use of stereotypes, boundary box). |
- Implemented Elements: 16 out of 18
- Observed Coverage: Covered all key actors and most use cases
- Missing Elements: Omitted abstraction of “Look for Car Models” (U13); fewer <<extend>> relations used
- Satisfied Constraints: 22 out of 28
- Strengths: Accurate actor generalization, proper naming, and most core use cases shown
- Limitations: Lacked a system boundary, missed include/extend richness, and scenario abstraction
- Implemented Elements: 18 out of 18
- Observed Coverage: All actors, use cases, and abstract/generalized goals (e.g., U13: Look for Car Models)
- Satisfied Constraints: 26 out of 28
- Strengths: Rich use of <<include>> and <<extend>>, clear boundary, correct generalization, scenario modularity
- Limitations: Minor clutter in layout (non-impacting correctness)
| Source | Implemented Elements | CR (%) | Satisfied Constraints | CS (%) |
|---|---|---|---|---|
| LLM-Generated | 16 / 18 | 88.89% | 22 / 28 | 78.57% |
| Human-Generated | 18 / 18 | 100.00% | 26 / 28 | 92.86% |
Appendix D. Sequence Diagram



| Rule Step | Applied Output |
|---|---|
| ExtractComponents | Identified lifelines: Member, Browser, AuthenticationServlet, AuthenticationServer, MemberHome, InternetAccount. |
| MapRelationships | Modeled message flows such as logoff(), retrieveMember(), setSessionId(0). Captured synchronous and return messages between components. |
| DefineConstraints | Validated ordering of messages, logical grouping of synchronous vs. asynchronous calls, inclusion of return messages, and correct use of activation bars. |
| ValidatePrompt | Assessed for proper UML 2.x notation, completeness of flow, and coverage of the "logoff" use case scenario based on system description. |
- Implemented Elements: 12 out of 14
- Observed Coverage: Included major objects and messages for the logoff scenario
- Missing Elements: Some unclear labels (e.g., “store session as 0”), absent activation bars
- Satisfied Constraints: 19 out of 24
- Strengths: Good message ordering, full lifeline inclusion, and coverage of all involved actors
- Limitations: Mixed semantic annotations (textual labels vs. methods), lacks activation semantics, missing interaction fragment constructs
- Implemented Elements: 14 out of 14
- Observed Coverage: Full logoff interaction sequence modeled clearly
- Satisfied Constraints: 22 out of 24
- Strengths: Clear lifelines, activation bars, correct synchronous calls, consistent use of UML notation
- Limitations: Slight diagram compactness (layout), missing guard or alt fragments for edge cases (e.g., failed session retrieval)
| Source | Implemented Elements | CR (%) | Satisfied Constraints | CS (%) |
|---|---|---|---|---|
| LLM-Generated | 12 / 14 | 85.71% | 19 / 24 | 79.17% |
| Human-Generated | 14 / 14 | 100.00% | 22 / 24 | 91.67% |
References
- Gao, C.; Hu, X.; Gao, S.; Xia, X.; Jin, Z. The Current Challenges of Software Engineering in the Era of Large Language Models. ACM Transactions on Software Engineering and Methodology 2024. [Google Scholar] [CrossRef]
- Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 2023, 36, 21558–21572. [Google Scholar]
- Vaithilingam, P.; Zhang, T.; Glassman, E.L. Expectation vs. In experience: Evaluating the usability of code generation tools powered by large language models. In Proceedings of the Chi conference on human factors in computing systems extended abstracts; 2022; pp. 1–7. [Google Scholar]
- Ahmad, A.; Waseem, M.; Liang, P.; Fahmideh, M.; Aktar, M.S.; Mikkonen, T. Towards human-bot collaborative software architecting with chatgpt. In Proceedings of the Proceedings of the 27th international conference on evaluation and assessment in software engineering, 2023, pp.
- Zimmermann, D.; Koziolek, A. Automating gui-based software testing with gpt-3. In Proceedings of the 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE; 2023; pp. 62–65. [Google Scholar]
- Ozkaya, M.; Erata, F. A survey on the practical use of UML for different software architecture viewpoints. Information and Software Technology 2020, 121, 106275. [Google Scholar] [CrossRef]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Gamage, M.Y.L. Automated Software Architecture Diagram Generator using Natural Language Processing. A Dissertation by Mr Yasitha Lalanga Gamage Submitted in partial fulfillment of the requirements for the BSc (Hons) in Computer Science degree at the University of Westminster, UK.
- Carvalho, G.; Dihego, J.; Sampaio, A. An integrated framework for analysing, simulating and testing UML models. In Proceedings of the Brazilian Symposium on Formal Methods. Springer; 2024; pp. 86–104. [Google Scholar]
- Ambler, S.W. The elements of UML (TM) 2.0 style; Cambridge University Press, 2005.
- Chen, Z.; Wang, C.; Sun, W.; Yang, G.; Liu, X.; Zhang, J.M.; Liu, Y. Promptware Engineering: Software Engineering for LLM Prompt Development. arXiv preprint arXiv:2503.02400, arXiv:2503.02400 2025.
- Pornprasit, C.; Tantithamthavorn, C. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology 2024, 175, 107523. [Google Scholar] [CrossRef]
- Liu, M.; Wang, J.; Lin, T.; Ma, Q.; Fang, Z.; Wu, Y. An empirical study of the code generation of safety-critical software using llms. Applied Sciences 2024, 14, 1046. [Google Scholar] [CrossRef]
- Boukhlif, M.; Kharmoum, N.; Hanine, M.; Kodad, M.; Lagmiri, S.N. Towards an Intelligent Test Case Generation Framework Using LLMs and Prompt Engineering. In Proceedings of the International Conference on Smart Medical, IoT & Artificial Intelligence. Springer; 2024; pp. 24–31. [Google Scholar]
- Ferrari, A.; Abualhaija, S.; Arora, C. Model Generation from Requirements with LLMs: an Exploratory Study. arXiv preprint arXiv:2404.06371, arXiv:2404.06371 2024.
- Cámara, J.; Troya, J.; Burgueño, L.; Vallecillo, A. On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML. Software and Systems Modeling 2023, 22, 781–793. [Google Scholar] [CrossRef]
- De Vito, G.; Palomba, F.; Gravino, C.; Di Martino, S.; Ferrucci, F. Echo: An approach to enhance use case quality exploiting large language models. In Proceedings of the 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE; 2023; pp. 53–60. [Google Scholar]
- Herwanto, G.B. Automating Data Flow Diagram Generation from User Stories Using Large Language Models. In Proceedings of the 7th Workshop on Natural Language Processing for Requirements Engineering; 2024. [Google Scholar]
- Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 2024, 33, 1–79. [Google Scholar] [CrossRef]
- Hussein, D. Usability of LLMs for Assisting Software Engineering: a Literature Review 2024.
- Lorenzo, C. Integrating large language models for real-world problem modelling: A comparative study. In Proceedings of the INTED2024 Proceedings. IATED; 2024; pp. 3262–3272. [Google Scholar]
- Nifterik, S.v. Exploring the Potential of Large Language Models in Supporting Domain Model Derivation from Requirements Elicitation Conversations. Master’s thesis, 2024.
- Buchmann, R.; Eder, J.; Fill, H.G.; Frank, U.; Karagiannis, D.; Laurenzi, E.; Mylopoulos, J.; Plexousakis, D.; Santos, M.Y. Large language models: Expectations for semantics-driven systems engineering. Data & Knowledge Engineering 2024, 152, 102324. [Google Scholar]
- Hemmat, A.; Sharbaf, M.; Kolahdouz-Rahimi, S.; Lano, K.; Tehrani, S.Y. Research directions for using LLM in software requirement engineering: a systematic review. Frontiers in Computer Science 2025, 7, 1519437. [Google Scholar] [CrossRef]
- Umar, M.A. Automated Requirements Engineering Framework for Model-Driven Development. PhD thesis, King’s College London, 2024.
- Vega Carrazan, P.F. Large Language Models Capabilities for Software Requirements Automation. PhD thesis, Politecnico di Torino, 2024.
- Conrardy, A.; Capozucca, A.; Cabot, J. User Modeling in Model-Driven Engineering: A Systematic Literature Review. arXiv preprint arXiv:2412.15871, arXiv:2412.15871 2024.
- O’docherty, M. Object-oriented analysis & design; John Wiley & Sons, 2005.
- Chen, K.; Yang, Y.; Chen, B.; López, J.A.H.; Mussbacher, G.; Varró, D. Automated Domain Modeling with Large Language Models: A Comparative Study. In Proceedings of the 2023 ACM/IEEE 26th International Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE; 2023; pp. 162–172. [Google Scholar]
- Wang, B.; Wang, C.; Liang, P.; Li, B.; Zeng, C. How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts. arXiv preprint arXiv:2404.17739, arXiv:2404.17739 2024.
- Conrardy, A.; Cabot, J. From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs. arXiv preprint arXiv:2404.11376, arXiv:2404.11376 2024.







| Study | UML Diagram | Purpose |
|---|---|---|
| [31] | Class | Class diagrams generated from images |
| [30] | Class, Use Case, Sequence | Students-generated diagrams from requirements |
| [15] | Sequence | Generate sequence diagram from SRS |
| [29] | Class | Fully automated domain modeling |
| [16] | Class | Understand capabilities of ChatGPT in modeling |
| [17] | Use Case | Co-prompt engineering approach |
| This study | Class, Deployment, Use Case, Sequence | Students-centric evaluation survey |
| UML Diagram | Source | Sample Size (N) | Mean () | Std. Dev. () | Min | Max |
|---|---|---|---|---|---|---|
| Class | Human | 6 | 0.7988 | 0.0322 | 0.7568 | 0.8288 |
| LLM | 6 | 0.6502 | 0.0457 | 0.5721 | 0.6982 | |
| Deployment | Human | 4 | 0.7005 | 0.0372 | 0.6577 | 0.7432 |
| LLM | 4 | 0.6486 | 0.0362 | 0.6171 | 0.6937 | |
| Use Case | Human | 5 | 0.8072 | 0.0087 | 0.7928 | 0.8153 |
| LLM | 5 | 0.6712 | 0.0300 | 0.6351 | 0.7117 | |
| Sequence | Human | 4 | 0.7320 | 0.0358 | 0.6847 | 0.7703 |
| LLM | 4 | 0.6768 | 0.0174 | 0.6532 | 0.6937 |
| UML Diagram | Source | Sample Size (N) | Mean () | Std. Dev. () | Min | Max |
|---|---|---|---|---|---|---|
| Class | Human | 6 | 0.7635 | 0.0220 | 0.7342 | 0.7928 |
| LLM | 6 | 0.6111 | 0.0854 | 0.4865 | 0.6847 | |
| Deployment | Human | 4 | 0.7309 | 0.0174 | 0.7207 | 0.7568 |
| LLM | 4 | 0.6430 | 0.0279 | 0.6126 | 0.6802 | |
| Use case | Human | 4 | 0.8041 | 0.0222 | 0.7793 | 0.8288 |
| LLM | 4 | 0.6419 | 0.0298 | 0.6036 | 0.6757 | |
| Sequence | Human | 4 | 0.7264 | 0.0245 | 0.6937 | 0.7523 |
| LLM | 4 | 0.6622 | 0.0337 | 0.6261 | 0.7072 |
| UML Diagram | Sample Size (N) | Degrees of Freedom () | t-statistic (t) | p-value | Cohen’s d |
|---|---|---|---|---|---|
| Class | 12 | 11 | 10.2576 | 2.9617 | |
| Deployment | 8 | 7 | 8.5979 | 3.0393 | |
| Use Case | 9 | 8 | 14.0198 | 4.6764 | |
| Sequence | 8 | 7 | 6.3026 | 2.1703 |
| UML Diagram | Sample Size (N) | Wilcoxon T-statistic (W) | p-value |
|---|---|---|---|
| Class | 12 | 0.0000 | |
| Deployment | 8 | 0.0000 | 0.0078 |
| Use Case | 9 | 0.0000 | 0.0039 |
| Sequence | 8 | 0.0000 | 0.0078 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).