Submitted:
27 October 2025
Posted:
28 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Prompt Optimization for Chain-of-Thought Reasoning
2.2. Adversarial Training
3. Methods
3.1. Adversarial Learning
3.2. Feedback
3.3. Adversarial Learning Algorithm
| Algorithm 1:Adversarial Chain-of-Thought Optimization |
|
3.4. Verifier
4. Experiment Setup
4.1. Datasets
4.2. Implementation Details
4.3. Baselines
5. Results and Discussion
5.1. Results
5.2. Discussion
5.2.1. Ablation Study
5.2.2. Difference between verification and self-consistency
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| CoT | Chain-of-Thought |
| adv-CoT | Adversarial Chain-of-Thought |
| LLMs | Large Language Models |
| NLP | Natural Language Processing |
| GANs | Generative Adversarial Networks |
| adv-ICL | Adversarial In-Context Learning |
| SC | Self-Consistency |
Appendix A. Statistics of Datasets
| Dataset | Number of samples | Average words | Answer Format | Licence |
|---|---|---|---|---|
| CSQA | 1,221 | 27.8 | Multi-choice | Unspecified |
| StrategyQA | 2,290 | 9.6 | Yes or No | Apache-2.0 |
| OpenBookQA | 500 | 27.6 | Multi-choice | Unspecified |
| ARC-c | 1,172 | 47.5 | Multi-choice | CC BY SA-4.0 |
| Sports | 1,000 | 7.0 | Yes or No | Apache-2.0 |
| BoolQ | 3,270 | 8.7 | Yes or No | CC BY SA-3.0 |
| Last Letters | 500 | 15.0 | String | Unspecified |
| Coin Flip | 500 | 37.0 | Yes or No | Unspecified |
| GSM8K | 1,319 | 46.9 | Number | MIT License |
| SVAMP | 1,000 | 31.8 | Number | MIT License |
| AQuA | 254 | 51.9 | Multi-choice | Apache-2.0 |
| MultiArith | 600 | 31.8 | Number | CC BY SA-4.0 |
- CSQA [46]: it is a multiple-choice question answering dataset that evaluates models’ ability to apply commonsense knowledge in reasoning tasks. It contains diverse questions that require understanding everyday situations beyond factual recall. The homepage is https://github.com/jonathanherzig/commonsenseqa.
- StrategyQA [47]: it is a commonsense QA task with a Yes or No answer format. We use the open-domain setting (question-only set) from [50]: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/strategyqa. The original dataset is from https://github.com/eladsegal/strategyqa, MIT license: https://github.com/eladsegal/strategyqa/blob/main/LICENSE.
- OpenBookQA [48]: it is a multi-choice QA task to evaluate commonsense knowledge. The original dataset is from https://allenai.org/data/open-book-qa.
- ARC-c [49]: it is a multip-choice commonsense QA task. The original dataset is from https://allenai.org/data/arc. CC BY SA-4.0 license: https://creativecommons.org/licenses/by-sa/4.0/.
- Sports understanding from BIG-Bench [50]: the answer format is Yes or No. The homepage is https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/sports_understanding. Apache License v.2: https://github.com/google/BIG-bench/blob/main/LICENSE.
- BoolQ [51]: it is a knowledge-intensive task, and the format is Yes or No. The original dataset is from https://github.com/google-research-datasets/boolean-questions. CC BY SA-3.0 license: https://creativecommons.org/licenses/by-sa/3.0/.
- Last Letters & Coin Flip [9] are novel benchmarks to evaluate whether the LLM can solve a simple symbolic reasoning problem. The last letters dataset is from https://huggingface.co/datasets/ChilleD/LastLetterConcat. The coin flip dataset is from https://huggingface.co/datasets/skrishna/coin_flip.
- GSM8K [52]: it is a dataset of grade school math word problems that evaluates a model’s ability to perform multi-step numerical reasoning. The homepage is https://github.com/openai/grade-school-math. MIT license: https://github.com/openai/grade-school-math/blob/master/LICENSE.
- SVAMP [53]: it is a benchmark of elementary math word problems designed to test a model’s ability to generalize reasoning by altering problem structures and wording. The homepage is https://github.com/arkilpatel/SVAMP, MIT license: https://github.com/arkilpatel/SVAMP/blob/main/LICENSE.
- AQuA [54]: it is a dataset of algebraic word problems with multiple-choice answers, aimed at evaluating the mathematical reasoning and problem-solving skills of models. The homepage is https://github.com/deepmind/AQuA, license: https://github.com/deepmind/AQuA/blob/master/LICENSE.
- MultiArith [55]: it is a dataset of arithmetic word problems that require multi-step reasoning to combine numbers and operations for the correct answer. The homepage is https://huggingface.co/datasets/ChilleD/MultiArith.
Appendix B. Extended Experiments
Appendix B.1. Applicability on Open-Source Models

Appendix B.2. Ablation Studies on Number of Iterations and Data Samples
| MultiArith (%) | ARC-C (%) | Sports (%) | ||
|---|---|---|---|---|
| 1 | 1 | 97.6 | 72.7 | 60.1 |
| 2 | 95.0 | 76.6 | 51.1 | |
| 3 | 98.1 | 80.8 | 50.8 | |
| 4 | 97.8 | 81.9 | 65.6 | |
| 5 | 96.6 | 78.4 | 64.1 | |
| 3 | 1 | 98.3 | 78.7 | 61.9 |
| 2 | 98.3 | 80.5 | 66.1 | |
| 3 | 98.8 | 80.7 | 78.9 | |
| 4 | 98.6 | 79.9 | 71.3 | |
| 5 | 98.0 | 79.7 | 58.0 | |
| 5 | 1 | 98.1 | 71.9 | 82.1 |
| 2 | 97.6 | 75.1 | 76.9 | |
| 3 | 98.3 | 79.9 | 88.7 | |
| 4 | 97.6 | 78.8 | 57.5 | |
| 5 | 97.8 | 77.8 | 76.4 | |
| 7 | 1 | 97.6 | 73.8 | 85.6 |
| 2 | 98.0 | 75.4 | 89.5 | |
| 3 | 98.0 | 82.4 | 80.7 | |
| 4 | 97.5 | 79.9 | 77.8 | |
| 5 | 97.5 | 80.7 | 84.8 |
Appendix C. Prompt
Appendix C.1. Generator Initial Prompt
Appendix C.2. Discriminator Initial Prompt
Appendix C.3. Proposer Prompt for Generator’s Instruction
Appendix C.4. Proposer Prompt for Generator’s Demonstrations
Appendix C.5. Proposer Prompt for Discriminator’s Instruction
Appendix C.6. Proposer Prompt for Discriminator’s Demonstrations
Appendix C.7. Modifier Prompt for Generator’s Instruction
Appendix C.8. Modifier Prompt for Generator’s Demonstrations
Appendix C.9. Modifier Prompt for Discriminator’s Instruction
Appendix C.10. Modifier Prompt for Discriminator’s Demonstrations
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; 2020; 33, pp. 1877–1901. [Google Scholar]
- Rae, J.W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. ArXiv 2021, arXiv:abs/2112.11446. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- OpenAI.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. 2024, arXiv:cs.CL/2303.08774. [Google Scholar]
- Xiao, T.; Zhu, J. Foundations of Large Language Models. 2025, arXiv:cs.CL/2501.09223. [Google Scholar]
- Min, S.; Lewis, M.; Zettlemoyer, L.; Hajishirzi, H. MetaICL: Learning to Learn In Context. 2022, arXiv:cs.CL/2110.15943. [Google Scholar]
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-context Learning. 2024, arXiv:cs.CL/2507.16003. [Google Scholar] [CrossRef]
- Dherin, B.; Munn, M.; Mazzawi, H.; Wunder, M.; Gonzalvo, J. Learning without training: The implicit dynamics of in-context learning. 2025, arXiv:2507.16003. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 2022, 35, 24824–24837. [Google Scholar]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems 2022, 35, 22199–22213. [Google Scholar]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems 2023, 36, 11809–11822. [Google Scholar]
- Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large Language Models are Better Reasoners with Self-Verification. 2023, arXiv:cs.AI/2212.09561. [Google Scholar]
- Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. In Proceedings of the The Eleventh International Conference on Learning Representations; 2023. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM computing surveys 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Li, J.; Chen, J.; Ren, R.; Cheng, X.; Zhao, W.X.; Nie, J.Y.; Wen, J.R. The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. 2024, arXiv:cs.CL/2401.03205. [Google Scholar] [PubMed]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. 2025, arXiv:cs.CL/2401.11817. [Google Scholar]
- Sun, Y.; Yin, Z.; Guo, Q.; Wu, J.; Qiu, X.; Zhao, H. Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem. 2024, arXiv:2403.03558. [Google Scholar] [CrossRef]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
- Zhang, X.; Du, C.; Pang, T.; Liu, Q.; Gao, W.; Lin, M. Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc. Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; 2024; 37, pp. 333–356. [Google Scholar]
- Diao, S.; Wang, P.; Lin, Y.; Pan, R.; Liu, X.; Zhang, T. Active Prompting with Chain-of-Thought for Large Language Models. 2024, arXiv:cs.CL/2302.12246. [Google Scholar]
- Ye, J.; Gong, S.; Chen, L.; Zheng, L.; Gao, J.; Shi, H.; Wu, C.; Jiang, X.; Li, Z.; Bi, W.; et al. Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Curran Associates, Inc. Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; 2024; 37, pp. 105345–105374. [Google Scholar]
- Luo, W.; Wang, W.; Li, X.; Zhou, W.; Jia, P.; Zhao, X. TAPO: Task-Referenced Adaptation for Prompt Optimization. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2025; pp. 1–5. [Google Scholar]
- Larionov, D.; Eger, S. Promptoptme: Error-aware prompt compression for llm-based mt evaluation metrics. arXiv 2024, arXiv:2412.16120. [Google Scholar]
- Guo, P.F.; Tsai, Y.D.; Lin, S.D. Benchmarking large language model uncertainty for prompt optimization. arXiv 2024, arXiv:2409.10044. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Advances in neural information processing systems 2014, 27. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
- Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; Roli, F. Evasion attacks against machine learning at test time. In Proceedings of the Joint European conference on machine learning and knowledge discovery in databases. Springer; 2013; pp. 387–402. [Google Scholar]
- Long, D.; Zhao, Y.; Brown, H.; Xie, Y.; Zhao, J.; Chen, N.; Kawaguchi, K.; Shieh, M.; He, J. Prompt optimization via adversarial in-context learning. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024; pp. 7308–7327.
- Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large language models are human-level prompt engineers. In Proceedings of the The eleventh international conference on learning representations; 2022. [Google Scholar]
- Pryzant, R.; Iter, D.; Li, J.; Lee, Y.T.; Zhu, C.; Zeng, M. Automatic prompt optimization with" gradient descent" and beam search. arXiv 2023, arXiv:2305.03495. [Google Scholar]
- Cheng, J.; Liu, X.; Zheng, K.; Ke, P.; Wang, H.; Dong, Y.; Tang, J.; Huang, M. Black-box prompt optimization: Aligning large language models without model training. arXiv 2023, arXiv:2311.04155. [Google Scholar] [CrossRef]
- Schneider, L.; Wistuba, M.; Klein, A.; Golebiowski, J.; Zappella, G.; Merra, F.A. Hyperband-based Bayesian optimization for black-box prompt selection. arXiv 2024, arXiv:2412.07820. [Google Scholar]
- Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s verify step by step. In Proceedings of the The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
- Li, M.; Wang, W.; Feng, F.; Cao, Y.; Zhang, J.; Chua, T.S. Robust prompt optimization for large language models against distribution shifts. arXiv arXiv:2305.13954. [CrossRef]
- Ashok, D.; May, J. Language models can predict their own behavior. arXiv arXiv:2502.13329. [CrossRef]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017; pp. 4681–4690.
- Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. Journal of machine learning research 2016, 17, 1–35. [Google Scholar]
- Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017; pp. 7167–7176.
- Iglesias, G.; Talavera, E.; Díaz-Álvarez, A. A survey on GANs for computer vision: Recent research, analysis and taxonomy. Computer Science Review 2023, 48, 100553. [Google Scholar] [CrossRef]
- Ren, D.; Cai, Y.; Li, Q. Unlocking the Power of GANs in Non-Autoregressive Text Generation. arXiv 2023, arXiv:2305.03977. [Google Scholar]
- Maus, N.; Chao, P.; Wong, E.; Gardner, J. Black box adversarial prompting for foundation models. arXiv 2023, arXiv:2302.04237. [Google Scholar] [CrossRef]
- Yu, L.; Zhang, W.; Wang, J.; Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence; 2017; 31. [Google Scholar]
- Nie, W.; Narodytska, N.; Patel, A. Relgan: Relational generative adversarial networks for text generation. In Proceedings of the International conference on learning representations; 2018. [Google Scholar]
- Wang, J.; Sun, Q.; Li, X.; Gao, M. Boosting language models reasoning with chain-of-knowledge prompting. arXiv 2023, arXiv:2306.06427. [Google Scholar]
- Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the NAACL-HLT; 2019; pp. 4149–4158. [Google Scholar]
- Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 2021, 9, 346–361. [Google Scholar] [CrossRef]
- Mihaylov, T.; Clark, P.; Khot, T.; Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv 2018, arXiv:1809.02789. [Google Scholar] [CrossRef]
- Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar] [CrossRef]
- Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research 2023. [Google Scholar]
- Clark, C.; Lee, K.; Chang, M.W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv 2019, arXiv:1905.10044. [Google Scholar]
- Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
- Patel, A.; Bhattamishra, S.; Goyal, N. Are NLP models really able to solve simple math word problems? arXiv 2021, arXiv:2103.07191. [Google Scholar] [CrossRef]
- Ling, W.; Yogatama, D.; Dyer, C.; Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv 2017, arXiv:1705.04146. [Google Scholar] [CrossRef]
- Roy, S.; Roth, D. Solving general arithmetic word problems. arXiv 2016, arXiv:1608.01413. [Google Scholar] [CrossRef]
- AI@Meta. Llama 3 Model Card, 2024.





| Methods | Commonsense & Factual | Symbolic | Arithmetic | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CommonSense QA | Strategy QA | OpenBook QA | ARC-c | Sports | BoolQ | Letter | Coin | GSM8K | SVAMP | AQuA | MultiArith | |
| GPT-3.5-turbo | ||||||||||||
| CoT | 78.2 | 73.8 | 85.6 | 88.4 | 87.3 | 69.2 | 65.6 | 77.2 | 82.2 | 83.3 | 62.2 | 97.6 |
| CoT+SC | 79.4 | 74.5 | 86.0 | 88.3 | 89.4 | 67.0 | 66.6 | 98.8 | 87.5 | 86.8 | 65.3 | 98.6 |
| adv-ICL | 77.8 | 66.6 | 86.4 | 87.2 | 86.3 | 73.1 | 74.6 | 96.8 | 80.4 | 84.1 | 60.2 | 98.5 |
| adv-CoT | 78.2 | 73.3 | 86.8 | 89.5 | 91.8 | 73.5 | 76.0 | 98.4 | 85.9 | 85.3 | 66.1 | 99.1 |
| GPT-4o-mini | ||||||||||||
| CoT | 84.9 | 77.4 | 94.1 | 94.3 | 92.3 | 77.1 | 86.8 | 100 | 93.4 | 93.6 | 77.1 | 98.1 |
| CoT+SC | 85.6 | 77.5 | 95.1 | 94.6 | 90.6 | 77.5 | 88.2 | 100 | 94.2 | 94.6 | 83.4 | 98.3 |
| adv-ICL | 83.9 | 76.1 | 94.0 | 94.7 | 89.8 | 77.1 | 87.4 | 100 | 92.7 | 93.4 | 78.7 | 98.6 |
| adv-CoT | 84.8 | 79.3 | 94.8 | 95.3 | 93.1 | 77.5 | 91.0 | 100 | 93.7 | 94.1 | 79.9 | 98.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).