Submitted:
23 June 2025
Posted:
24 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We designed an experiment that generates randomized datasets in three distinct scenarios (personal stories, medical records, and receipts) to evaluate how effectively each prompt style captures the required structure across diverse and context-specific scenarios. Datasets for personal stories contain attributes representing individual characteristics and paired them with corresponding valid narratives. Datasets for medical records contain medical attributes to produce valid and realistic entries. Datasets for receipts contain attributes to create valid examples reflecting real-world purchases, following the dataset construction methodology from our previous study [8].
- We used these three datasets to compare the outputs generated by leading LLMs (ChatGPT-4o, Claude, and Gemini) with the expected results. Accuracy was measured based on strict adherence to the original data attributes and values, ensuring the generated structured data matched the intended formats, including JSON, YAML, and CSV.
- We developed an automated validation framework to measure output fidelity, token consumption, and generation latency. These metrics are visualized through comparative graphs—Technique vs. Accuracy, Technique vs. Token Cost, and Technique vs. Time—highlighting the trade-offs across prompt styles and LLMs.
2. Summary of Research Questions
3. Experiment Design
- Stage One: Data Generation and Prompt Testing. This first stage created randomized datasets tailored to the three contexts (personal stories, receipts, and medical records) and applied six distinct prompt style (JSON, YAML, CSV, API function calls, Simple Prefixes, and Hybrid CSV/Prefix) to guide the LLMs in generating structured outputs. Three metrics aAccuracy, token usage, and generation time) were recorded for each combination of LLM and prompt style.
- Stage Two: Assessment and Refinement. This second stage validated the outputs generated by each LLM against the original datasets to measure accuracy. The metrics collected during Stage One were assessed to identify the most efficient and effective prompt styles. The results were codified into actionable recommendations that highlight the strengths and trade-offs of each prompt style for different data contexts and LLMs.
3.1. Stage One: Data Generation and Prompt Testing
3.1.1. Personal Stories Dataset and Validation
3.1.2. Medical Record Dataset
3.1.3. Receipt Dataset
3.2. Stage Two: Assessment and Refinement
- Accuracy measures. which calculated the percentage of attributes correctly included in the generated output,
- Token usage measures, which evaluated the number of tokens consumed by each prompt style for each LLM, as token efficiency directly correlates with cost,
- Time efficiency measures, which computed response times for generating outputs to assess the suitability of each LLM for real-time or batch processing tasks.
4. Analysis of Experiment Results
4.1. Analysis of ChatGPT-4o Experiment Results
4.1.1. Accuracy Analysis for ChatGPT-4o
4.1.2. Token Usage Analysis for ChatGPT-4o
4.1.3. Time Analysis for ChatGPT-4o
4.2. Analysis of Experiment Results for Claude
4.2.1. Accuracy Analysis for Claude
4.2.2. Token Usage Analysis for Claude
4.2.3. Time Analysis for Claude
4.3. Gemini Analysis of Experiment Results
4.3.1. Accuracy Analysis for Gemini
4.3.2. Token Usage Analysis for Gemini
4.3.3. Time Analysis for Gemini
5. Comparison of ChatGPT-4o, Claude, and Gemini LLM Models
5.1. Comparing the Accuracy of ChatGPT-4o, Claude, and Gemini
5.2. Comparing the Token Usage of ChatGPT-4o, Claude, and Gemini
5.3. Comparing Time Performance of ChatGPT-4o, Claude, and Gemini
5.4. Comparing ChatGPT-4o, Claude, and Gemini Performance Across All Metrics
5.4.1. Metric-by-Metric Comparison Using Bar Charts
5.4.2. Holistic Trade-Off Analysis Using Radar Charts
5.4.3. Performance Analysis
5.4.4. Trade-Offs and Practical Recommendations
5.4.5. Summarizing Insights in a Comparative Table
6. Related Work
7. Concluding Remarks
- Trade-offs in prompt design are context-dependent. Understanding the trade-offs associated with different prompting techniques helps developers and data scientists make informed choices that balance accuracy, efficiency, and cost-effectiveness. For instance, hierarchical formats like JSON and YAML [9] offer superior accuracy but at a higher token cost, whereas CSV and simple prefixes [5] provide cost-efficient alternatives with reduced flexibility for complex structures.
- Alternative formats deliver unique advantages. Alternative formats, such as simple prefixes and hybrid approaches, can offer high accuracy with reduced token costs in certain use cases. These formats are less verbose and strike a balance between clarity and conciseness, making them valuable for semi-structured data, such as receipts or transactional records.
- Consistent prompts enhances evaluation fairness. Using consistent prompts across different LLMs ensures fairness in evaluation and highlights the inherent capabilities of each LLM. Our findings demonstrate that standardized prompts provide a reliable baseline for performance comparison to avoid biases caused by prompt tailoring for specific LLMs.
- Efficiency gains vary across LLMs. The study revealed significant differences in token usage and processing times among ChatGPT-4o, Claude, and Gemini. While ChatGPT-4o had the highest efficiency in both token consumption and time, Claude excelled in accuracy and Gemini struck a balance between the two. These insights guide LLM selection for specific use cases where speed or cost constraints are critical.
- Applications influence prompt selection. The selection of prompt stlyes can vary significantly based on application domain. For example, our results demonstrate how JSON and YAML formats are better suited for domains like healthcare requiring hierarchical data representation. Conversely, CSV and simple prefixes excel in domains like e-commerce where token efficiency and processing speed are critical. While our datasets simulate healthcare (e.g., medical records) and e-commerce (e.g., receipt records), our results generalize to similar structured data across these domains.
- Prompt design impacts outcomes. While our study does not focus directly on iterative prompt refinement [15], the results indicate that careful initial prompt design plays a key role in determining the accuracy and efficiency of LLM outputs. This finding underscores the importance of selecting prompt styles that align with the dataset’s complexity and intended use case. Our future work will explore the role of iterative refinement in enhancing LLM output quality, which is beyond the scope of this paper.
Acknowledgments
References
- Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. 2023; arXiv:2303.08774, 2023.
- Adebowale Jeremy Adetayo, Mariam Oyinda Aborisade, and Basheer Abiodun Sanni. Microsoft copilot and anthropic claude ai in education and library service. Library Hi Tech News, 2024.
- Ramon Maria Garcia Alarcia and Alessandro Golkar. Optimizing token usage on large language model conversations using the design structure matrix. 2024; arXiv:2410.00749, 2024.
- Jo Inge Arnes and Alexander Horsch. Schema-based priming of large language model for data object validation compliance. Available at SSRN 4453361, 2023.
- Alexander Ball, Lian Ding, and Manjula Patel. Lightweight formats for product model data exchange and preservation. In PV 2007 Conference, pages 9–11, 2007.
- Tom, B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language Models are Few-shot Learners. arXiv 2020, arXiv:2005.14165, 33:1877–1901, 202033, 1877–1901. [Google Scholar]
- John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. Structured Information Extraction from Scientific Text with Large Language Models. Nature Communications 2024, 15, 1418.
- Ashraf Elnashar, Jules White, and Douglas C. Schmidt. Enhancing structured data generation with gpt-4o evaluating prompt efficiency across prompt styles. Frontiers in Artificial Intelligence, 2025; 8.
- Malin Eriksson and Victor Hallberg. Comparison between json and yaml for data serialization. The School of Computer Science and Engineering Royal Institute of Technology, 2011; 1–25.
- Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. arXiv, 2018; arXiv:1805.04833, 2018.
- William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 2022, 23, 1–39.
- Lin Guo. The effects of the format and frequency of prompts on source evaluation and multiple-text comprehension. Reading Psychology 2023, 44, 358–387. [CrossRef]
- Shashank Kedia, Aditya Mantha, Sneha Gupta, Stephen Guo, and Kannan Achan. Generating rich product descriptions for conversational e-commerce systems. In Companion Proceedings of the Web Conference 2021, pages 349–356, 2021.
- Gleb Kumichev, Pavel Blinov, Yulia Kuzkina, Vasily Goncharov, Galina Zubkova, Nikolai Zenovkin, Aleksei Goncharov, and Andrey Savchenko. Medsyn: Llm-based synthetic medical text generation framework. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 215–230. Springer, 2024.
- Zhexin Liang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, and Chen Change Loy. Iterative prompt learning for unsupervised backlit image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8094–8103, 2023.
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 2023, 55, 1–35.
- Max Moundas, Jules White, and Douglas C. Schmidt. Prompt Patterns for Structured Data Extraction from Unstructured Text. In Proceedings of the 31st Pattern Languages of Programming (PLoP) conference, Columbia River Gorge, WA, October 2024.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. Journal of machine learning research 2020, 21, 1–67.
- Sindhu Tipirneni, Ming Zhu, and Chandan K. Reddy. Structcoder: Structure-aware transformer for code generation, 2024.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv, 2023; arXiv:2309.10691, 2023.





























![]() |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
