Submitted:
12 November 2025
Posted:
18 November 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We propose CRONUS, a novel, black-box friendly framework that significantly enhances the performance of general-purpose LLMs in complex vertical domain decision support tasks without requiring internal model modifications.
- We introduce the Context-Aware Reasoning Agent (CARA), a lightweight, domain-specific model trained through a multi-stage process (DKLP, CRPIT, DDPO) to generate high-quality contextual reasoning paths that guide black-box LLMs.
- We demonstrate the superior performance of CRONUS in the challenging domain of financial market analysis, achieving state-of-the-art results on our custom FinDecision-QA dataset, particularly in tasks demanding intricate situational reasoning.
2. Related Work
2.1. Enhancing and Orchestrating Black-Box Large Language Models
2.2. Domain Adaptation and Specialized Language Models
3. Method

3.1. CRONUS Framework Overview
3.2. Context-Aware Reasoning Agent (CARA)
3.2.1. Domain Knowledge & Logic Pre-Training (DKLP)
3.2.2. Contextual Reasoning Path Instruction Tuning (CRPIT)
Pseudo-Data Generation and Filtering
Instruction Tuning
3.2.3. Dynamic Decision Prompt Optimization (DDPO)
4. Experiments
4.1. Experimental Setup
- ChatGPT (GPT-3.5/GPT-4 API): A leading commercial general-purpose LLM, accessed via its API.
- Baichuan2-13B-Chat: A powerful open-source LLM, accessed via its API to simulate a black-box scenario.
- Qwen-7B-Chat: Another competitive open-source LLM, also accessed via API.
- Domain Knowledge & Logic Pre-training (DKLP) Corpus: For CARA’s initial pre-training, we collect a massive corpus of public financial news, company annual reports, economic forecasts, and academic papers in finance. This unsupervised data allows CARA to acquire foundational domain knowledge.
-
Instruction Tuning and Evaluation Datasets:
- –
- FinDecision-QA (Self-constructed): This dataset is specifically designed for complex financial market decision support. It includes detailed situational descriptions, multi-step reasoning questions, and multiple potential answers. The dataset comprises two main categories of questions: "Fact Recall Type" questions, which test direct knowledge retrieval, and "Situational Reasoning Type" questions, which demand deeper logical inference and information synthesis.
- –
- FinCausal [13]: An existing dataset focused on financial causal inference, used to further evaluate the models’ ability to understand and reason about causal relationships between financial events.
4.2. Baselines
- Original Black-box LLMs: This baseline involves directly querying the black-box LLMs (ChatGPT, Baichuan2-13B-Chat, Qwen-7B-Chat) with the given context and question, without any external guidance or augmentation.
- Retrieval-Augmented Generation (RAG) LLMs: We integrate a traditional document retrieval system with the black-box LLMs. Relevant documents or passages, retrieved based on the query, are prepended to the prompt as additional context for the LLM [14]. We denote these as RAG-ChatGPT, RAG-Baichuan2-13B, and RAG-Qwen-7B.
- Domain-Specific LLMs: We include a representative domain-specific LLM (e.g., FinLLM-13B), which is pre-trained or fine-tuned extensively on large financial corpora. This baseline serves to demonstrate the performance achievable by models with inherent domain expertise, typically requiring internal modifications.
4.3. Main Results
4.4. Ablation Study
- CRONUS (DKLP only): When CARA is only trained with Domain Knowledge & Logic Pre-training (DKLP) and its output (e.g., domain-relevant facts or summaries) is used with a simple, fixed prompt for the Black-box LLM, the performance is slightly better than RAG. This indicates that even basic domain knowledge within CARA helps, but it lacks structured reasoning guidance.
- CRONUS (DKLP + CRPIT, Fixed Prompt): Adding Contextual Reasoning Path Instruction Tuning (CRPIT) to CARA, but using a fixed, non-optimized prompt to integrate its generated reasoning paths, leads to a significant jump in performance. This highlights the critical role of CARA’s ability to generate multi-step reasoning paths. The model is better at guiding the Black-box LLM through complex inferences, especially in situational reasoning tasks.
- CRONUS (Full: DKLP + CRPIT + DDPO): The full CRONUS framework, incorporating Dynamic Decision Prompt Optimization (DDPO), achieves the best results. This demonstrates that dynamically optimizing how CARA’s reasoning paths are presented to the Black-box LLM is crucial for maximizing performance. DDPO ensures that CARA’s guidance is optimally aligned with the Black-box LLM’s processing, leading to the highest accuracy across both fact recall and situational reasoning tasks.
4.5. Human Evaluation
4.6. Analysis of CARA-Generated Reasoning Paths
4.7. Efficiency and Inference Latency
4.8. Performance on FinCausal Dataset
4.9. Robustness to Context Perturbations
5. Conclusion
References
- Ahuja, K.; Diddee, H.; Hada, R.; Ochieng, M.; Ramesh, K.; Jain, P.; Nambi, A.; Ganu, T.; Segal, S.; Ahmed, M.; et al. MEGA: Multilingual Evaluation of Generative AI. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2023, pp. 4232–4267. [CrossRef]
- Zhou, Y.; Shen, J.; Cheng, Y. Weak to strong generalization for large language models with multi-capabilities. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Zhu, F.; Lei, W.; Huang, Y.; Wang, C.; Zhang, S.; Lv, J.; Feng, F.; Chua, T.S. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021, pp. 3277–3287. [CrossRef]
- Khot, T.; Khashabi, D.; Richardson, K.; Clark, P.; Sabharwal, A. Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021, pp. 1264–1279. [CrossRef]
- Qi, T.; Wu, F.; Wu, C.; Yang, P.; Yu, Y.; Xie, X.; Huang, Y. HieRec: Hierarchical User Interest Modeling for Personalized News Recommendation. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021, pp. 5446–5456. [CrossRef]
- Lin, Z.; Zhang, Q.; Tian, Z.; Yu, P.; Lan, J. DPL-SLAM: enhancing dynamic point-line SLAM through dense semantic methods. IEEE Sensors Journal 2024, 24, 14596–14607. [Google Scholar] [CrossRef]
- Lin, Z.; Tian, Z.; Zhang, Q.; Zhuang, H.; Lan, J. Enhanced visual slam for collision-free driving with lightweight autonomous cars. Sensors 2024, 24, 6258. [Google Scholar] [CrossRef] [PubMed]
- Lin, Z.; Zhang, Q.; Tian, Z.; Yu, P.; Ye, Z.; Zhuang, H.; Lan, J. Slam2: Simultaneous localization and multimode mapping for indoor dynamic environments. Pattern Recognition 2025, 158, 111054. [Google Scholar] [CrossRef]
- Wang, P.; Zhu, Z.; Liang, D. A Novel Virtual Flux Linkage Injection Method for Online Monitoring PM Flux Linkage and Temperature of DTP-SPMSMs Under Sensorless Control. IEEE Transactions on Industrial Electronics 2025. [Google Scholar] [CrossRef]
- Wang, P.; Zhu, Z.Q.; Feng, Z. Novel Virtual Active Flux Injection-Based Position Error Adaptive Correction of Dual Three-Phase IPMSMs Under Sensorless Control. IEEE Transactions on Transportation Electrification 2025. [Google Scholar] [CrossRef]
- Wang, P.; Zhu, Z.; Liang, D. Improved position-offset based online parameter estimation of PMSMs under constant and variable speed operations. IEEE Transactions on Energy Conversion 2024, 39, 1325–1340. [Google Scholar] [CrossRef]
- Huang, J.; Qiu, Y. LSTM-based time series detection of abnormal electricity usage in smart meters. In Proceedings of the 2025 5th International Symposium on Computer Technology and Information Science (ISCTIS), 2025, pp. 272–276. [CrossRef]
- Li, X.; Chan, S.; Zhu, X.; Pei, Y.; Ma, Z.; Liu, X.; Shah, S. Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. Association for Computational Linguistics, 2023, pp. 408–422. [CrossRef]
- Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; Gao, J. KAT: A Knowledge Augmented Transformer for Vision-and-Language. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2022, pp. 956–968. [CrossRef]
- Ding, N.; Chen, Y.; Han, X.; Xu, G.; Wang, X.; Xie, P.; Zheng, H.; Liu, Z.; Li, J.; Kim, H.G. Prompt-learning for Fine-grained Entity Typing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics; 2022; pp. 6888–6901. [Google Scholar] [CrossRef]
- Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-Verification Reduces Hallucination in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics; 2024; pp. 3563–3578. [Google Scholar] [CrossRef]
- Rubin, O.; Herzig, J.; Berant, J. Learning To Retrieve Prompts for In-Context Learning. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2022, pp. 2655–2671. [CrossRef]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
- Onoe, Y.; Boratko, M.; McCallum, A.; Durrett, G. Modeling Fine-Grained Entity Types with Box Embeddings. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021, pp. 2051–2064. [CrossRef]
- Turcan, E.; Muresan, S.; McKeown, K. Emotion-Infused Models for Explainable Psychological Stress Detection. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021, pp. 2895–2909. [CrossRef]
- Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large Language Models are Better Reasoners with Self-Verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics; 2023; pp. 2550–2575. [Google Scholar] [CrossRef]
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734 2023.
- Jiang, J.; Zhou, K.; Dong, Z.; Ye, K.; Zhao, X.; Wen, J.R. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2023, pp. 9237–9251. [CrossRef]
- Zhang, F.; Hua, X.S.; Chen, C.; Luo, X. A Statistical Perspective for Efficient Image-Text Matching. In Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 355–369.
- Zhang, F.; Zhou, H.; Hua, X.S.; Chen, C.; Luo, X. Hope: A hierarchical perspective for semi-supervised 2d-3d cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024, 46, 8976–8993. [Google Scholar] [CrossRef]
- Gururangan, S.; Lewis, M.; Holtzman, A.; Smith, N.A.; Zettlemoyer, L. DEMix Layers: Disentangling Domains for Modular Language Modeling. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2022, pp. 5557–5576. [CrossRef]
- Tedeschi, S.; Maiorca, V.; Campolungo, N.; Cecconi, F.; Navigli, R. WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2021, pp. 2521–2533. [CrossRef]
- Hardalov, M.; Arora, A.; Nakov, P.; Augenstein, I. Cross-Domain Label-Adaptive Stance Detection. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021, pp. 9011–9028. [CrossRef]
- Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.A.; Rouvier, M.; Dufour, R. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024, pp. 5848–5864. [CrossRef]
- Liu, C. Optimization of Adaboost cardiac disease prediction and classification based on long and short term memory network. IET Conference Proceedings 2025, 2025, 196–200, [https://digitallibrary.theiet.org/doi/pdf/10.1049/icp.2025.1034].. [Google Scholar] [CrossRef]
- Tian, Y.; Yang, Z.; Liu, C.; Su, Y.; Hong, Z.; Gong, Z.; Xu, J. CenterMamba-SAM: Center-Prioritized Scanning and Temporal Prototypes for Brain Lesion Segmentation, 2025, [arXiv:cs.CV/2511.01243].
- Zhuang, J.; Miao, S. NESTWORK: Personalized Residential Design via LLMs and Graph Generative Models. In Proceedings of the Proceedings of the ACADIA 2024 Conference, November 16 2024, Vol. 3, pp. 99–100.
- Zhuang, J.; Li, G.; Xu, H.; Xu, J.; Tian, R. TEXT-TO-CITY Controllable 3D Urban Block Generation with Latent Diffusion Model. In Proceedings of the Proceedings of the 29th International Conference of the Association for Computer-Aided Architectural Design Research in Asia (CAADRIA), Singapore, 2024, pp. 20–26.
- Wu, H.; Liu, C.; Zhang, W.; Zhou, L.; Song, Y.; Li, X.; Du, X. TMM-Net: An SAR Ship Detection Method Based on Multiscale Transformer Sampling. IEEE Geoscience and Remote Sensing Letters 2025, 22, 1–5. [Google Scholar] [CrossRef]
- Luo, Z.; Hong, Z.; Ge, X.; Zhuang, J.; Tang, X.; Du, Z.; Tao, Y.; Zhang, Y.; Zhou, C.; Yang, C.; et al. Embroiderer: Do-It-Yourself Embroidery Aided with Digital Tools. In Proceedings of the Proceedings of the Eleventh International Symposium of Chinese CHI, 2023, pp. 614–621.
- Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H.; Wu, W.; Guo, Z.; Liu, Z.; Xu, X. PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 2021, pp. 2513–2525. [CrossRef]
- Zhang, F.; Chen, C.; Hua, X.S.; Luo, X. FATE: Learning Effective Binary Descriptors With Group Fairness. IEEE Transactions on Image Processing 2024, 33, 3648–3661. [Google Scholar] [CrossRef] [PubMed]
- Sciavolino, C.; Zhong, Z.; Lee, J.; Chen, D. Simple Entity-Centric Questions Challenge Dense Retrievers. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021, pp. 6138–6148. [CrossRef]


| Model Type | Model | Fact Recall Type | Situational Reasoning Type | Overall Accuracy |
|---|---|---|---|---|
| General Black-box LLMs | ChatGPT | 68.2 | 52.1 | 58.7 |
| General Black-box LLMs | Baichuan2-13B-Chat | 65.5 | 49.8 | 56.4 |
| General Black-box LLMs | Qwen-7B-Chat | 63.9 | 48.5 | 55.2 |
| Retrieval-Augmented LLMs | RAG-ChatGPT | 72.5 | 58.9 | 64.7 |
| Retrieval-Augmented LLMs | RAG-Baichuan2-13B | 70.1 | 56.3 | 62.0 |
| Retrieval-Augmented LLMs | RAG-Qwen-7B | 68.8 | 55.1 | 60.9 |
| Domain-Specific LLMs (Baseline) | FinLLM-13B | 74.8 | 60.5 | 66.8 |
| Ours (CRONUS) | CRONUS + ChatGPT | 77.1 | 65.2 | 70.5 |
| Ours (CRONUS) | CRONUS + Baichuan2-13B | 76.0 | 63.5 | 69.1 |
| Ours (CRONUS) | CRONUS + Qwen-7B | 75.2 | 62.8 | 68.0 |
| CARA Configuration | Fact Recall Type | Situational Reasoning Type | Overall Accuracy |
|---|---|---|---|
| RAG-ChatGPT (Baseline) | 72.5 | 58.9 | 64.7 |
| CRONUS (DKLP only) | 73.8 | 59.5 | 65.7 |
| CRONUS (DKLP + CRPIT, Fixed Prompt) | 75.1 | 62.3 | 68.0 |
| CRONUS (Full: DKLP + CRPIT + DDPO) | 77.1 | 65.2 | 70.5 |
| Model | Total Latency (s) | CARA Overhead (s) | Black-box LLM Call (s) |
|---|---|---|---|
| ChatGPT (Original) | 3.25 | N/A | 3.25 |
| RAG-ChatGPT | 4.10 | 0.85 (Retrieval) | 3.25 |
| CRONUS + ChatGPT | 3.75 | 0.50 | 3.25 |
| Baichuan2-13B-Chat (Original) | 2.80 | N/A | 2.80 |
| RAG-Baichuan2-13B | 3.60 | 0.80 (Retrieval) | 2.80 |
| CRONUS + Baichuan2-13B | 3.30 | 0.50 | 2.80 |
| Model Type | Model | Accuracy |
|---|---|---|
| General Black-box LLMs | ChatGPT | 62.5 |
| General Black-box LLMs | Baichuan2-13B-Chat | 58.9 |
| General Black-box LLMs | Qwen-7B-Chat | 57.1 |
| Retrieval-Augmented LLMs | RAG-ChatGPT | 66.8 |
| Retrieval-Augmented LLMs | RAG-Baichuan2-13B | 63.5 |
| Retrieval-Augmented LLMs | RAG-Qwen-7B | 61.2 |
| Domain-Specific LLMs (Baseline) | FinLLM-13B | 70.1 |
| Ours (CRONUS) | CRONUS + ChatGPT | 73.5 |
| Ours (CRONUS) | CRONUS + Baichuan2-13B | 72.0 |
| Ours (CRONUS) | CRONUS + Qwen-7B | 71.1 |
| Model | Original Context | + Irrelevant Information | - Missing Key Information |
|---|---|---|---|
| ChatGPT (Original) | 52.1 | 45.8 | 35.2 |
| RAG-ChatGPT | 58.9 | 52.1 | 40.5 |
| FinLLM-13B | 60.5 | 55.3 | 42.8 |
| CRONUS + ChatGPT | 65.2 | 61.9 | 50.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 1996 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).