Submitted:
29 October 2025
Posted:
30 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We propose PEL-SHA, a novel prompt engineering-enhanced LLM framework designed for the automated analysis of scientific hypotheses and their supporting evidence from research abstracts.
- We develop a sophisticated multi-stage prompt engineering pipeline that systematically guides LLMs through complex scientific text understanding, from hypothesis identification to high-level research direction reasoning.
- We introduce SciHypo-500, a new expert-annotated benchmark dataset for scientific hypothesis and evidence analysis, and empirically demonstrate that our PEL-SHA framework achieves state-of-the-art performance across multiple challenging tasks.
2. Related Work
2.1. Large Language Models for Scientific Text Analysis
2.2. Advanced Prompt Engineering Techniques
3. Method

3.1. Overview of the PEL-SHA Framework
3.2. Stage 1: Hypothesis Identification Prompt
3.3. Stage 2: Evidence and Method Classification Prompt
3.4. Stage 3: Potential Research Direction Reasoning Prompt
3.5. Prompt Engineering Principles
4. Experiments
4.1. Experimental Setup
4.1.1. Models Evaluated
- Qwen-7B [5]: A popular open-source large language model, representing a strong general-purpose LLM.
- Claude [6]: A leading commercial large language model known for its advanced conversational and reasoning capabilities.
- Gemini [7]: Google’s latest multi-modal large language model, offering cutting-edge performance.
- LLM-X (Baseline): A generic, un-fine-tuned large language model (e.g., a standard GPT-4 or Llama series model) employed without any specific prompt engineering strategies beyond basic instructions. This serves as a direct baseline to quantify the impact of our prompt engineering.
- LLM-X + PEL-SHA (Our Method): The LLM-X model integrated with our meticulously designed PEL-SHA multi-stage prompt engineering framework, as described in Section 3. This configuration represents our proposed approach.
4.1.2. Dataset
- Composition: SciHypo-500 comprises 500 carefully selected scientific paper abstracts from diverse scientific domains, including biomedicine, materials science, and computer science. This interdisciplinary selection ensures the generalizability of our framework across varied scientific language and structures.
-
Annotation: Each abstract within SciHypo-500 was meticulously annotated by three independent domain experts. The expert annotations include:
- –
- The original abstract text (unstructured description).
- –
- Structured hypothesis statements explicitly identified from the abstract.
- –
- The corresponding evidence types supporting each hypothesis.
- –
- The key research methods employed to investigate the hypotheses.
- –
- Expert commentaries on potential future research directions, open questions, knowledge gaps, or study limitations.
This rich, multi-faceted annotation provides a robust gold standard for evaluating complex scientific text understanding tasks.
4.1.3. Evaluation Tasks and Metrics
-
1. Hypothesis Identification: This task evaluates the models’ ability to accurately identify and extract explicit scientific hypotheses or research questions from the abstracts.
- –
- Metrics: Precision (P), Recall (R), and F1-score (F1) are used to measure the overlap and correctness of extracted hypotheses compared to expert annotations.
-
2. Evidence and Method Classification: This task assesses the models’ capability to correctly associate identified hypotheses with their supporting evidence types and the research methodologies employed.
- –
- Metrics: Accuracy and Macro-F1 are utilized to evaluate the correctness of classifying evidence types and methods across all categories.
-
3. Potential Research Direction Reasoning: This task measures the models’ ability to infer meaningful future research directions, open questions, or study limitations based on the abstract’s content.
- –
- Metrics: ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) is used to quantify the textual similarity between generated research directions and expert annotations. Additionally, a human evaluation score (1-5 points) is employed to assess the quality, relevance, and innovativeness of the generated directions.
4.2. Baseline Methods
- Qwen-7B, Claude, and Gemini represent state-of-the-art general-purpose LLMs. For these models, we used simple, direct prompts for each task (e.g., “Extract hypotheses from the following abstract,” “Classify evidence types,” “Suggest future research directions”), without the multi-stage, detailed prompt engineering inherent in PEL-SHA. This setup allows us to gauge their inherent capabilities on scientific text understanding tasks when used out-of-the-box.
- LLM-X (Baseline) serves as a controlled baseline, using the same underlying LLM as our proposed method but without the PEL-SHA framework’s advanced prompt engineering. This direct comparison is crucial for isolating the performance gains attributed solely to our multi-stage prompting strategy, demonstrating its incremental value over a generic LLM.
4.3. Main Results
4.4. Human Evaluation for Potential Research Direction Reasoning
4.5. Ablation Study: Impact of Prompt Refinement
- PEL-SHA (Full): Utilizes all refined prompts (, , ) and the full sequential processing.
- PEL-SHA w/o Refinement: Stage 1 uses a basic prompt (e.g., “Extract hypotheses from the abstract.”) instead of the detailed . Subsequent stages receive the output from this simplified Stage 1.
- PEL-SHA w/o Refinement: Stage 2 uses a basic prompt (e.g., “Classify evidence and methods for the given hypothesis from the abstract.”) instead of the detailed . Stage 1 uses the refined , and Stage 3 receives output from this simplified Stage 2.
- PEL-SHA w/o Refinement: Stage 3 uses a basic prompt (e.g., “Suggest future research directions based on the abstract and its findings.”) instead of the detailed . Stages 1 and 2 use their respective refined prompts.
4.6. Analysis of Sequential Information Flow
- PEL-SHA (Full Sequential): Our proposed method with distinct stages, each with its refined prompt and leveraging the structured outputs from preceding stages.
- PEL-SHA (Single-Pass Multi-Task Prompt): A single, comprehensive prompt given the raw abstract, requesting all three outputs in one go, without explicit intermediate feedback or structured output feeding. This prompt is more detailed than LLM-X (Baseline) but does not decompose the task into sequential steps.
4.7. Sensitivity to Base LLM
4.8. Error Analysis
4.8.1. Hypothesis Identification Errors
- Implicit Hypotheses Missed: While PEL-SHA performed well on explicitly stated hypotheses, it occasionally struggled to identify hypotheses that were deeply embedded or highly implicit within complex sentences, requiring significant inferential leaps.
- Over-extraction of Background/Results: In some cases, the LLM misidentified strong claims from the introduction or definitive statements from the results section as testable hypotheses, despite ’s directives. This suggests a fine line between a strong finding and a testable claim that can still pose a challenge.
- Granularity Issues: Sometimes, a single complex hypothesis was split into multiple simpler statements by the LLM, or conversely, multiple distinct hypotheses were merged into one, leading to partial credit or mismatches with expert annotations.
4.8.2. Evidence and Method Classification Errors
- Ambiguity in Evidence Type: Abstracts often contain general statements about “data” or “findings” without explicit categorization (e.g., “experimental,” “observational”). The LLM sometimes struggled to infer the precise type of evidence when not directly stated.
- Method Specificity: While general methods (e.g., “statistical analysis”) were often correctly identified, very specific or novel methodological details (e.g., a custom algorithm name) were occasionally missed or misclassified if not widely known or clearly described in the abstract.
- Incorrect Association: Although less frequent due to the sequential feeding of hypotheses, there were instances where an evidence type or method was correctly identified but incorrectly linked to a hypothesis that it did not directly support, particularly in abstracts with multiple interwoven hypotheses.
4.8.3. Potential Research Direction Reasoning Errors
- Generality vs. Specificity: The LLM sometimes generated overly generic future directions (e.g., “more research is needed”) that lacked the specific actionable insights expected by experts.
- Lack of Novelty: While generally relevant, some suggestions lacked true innovativeness, instead reiterating obvious next steps or minor extensions, falling short of identifying deeper knowledge gaps.
- Hallucination of Limitations: In rare instances, the LLM inferred limitations or open questions that were not genuinely supported by the abstract’s content, potentially drawing on its general world knowledge rather than strictly abstract-confined reasoning.
- Redundancy: Multiple generated directions sometimes overlapped in meaning, indicating a need for better synthesis and de-duplication in the reasoning stage.
4.9. Computational Performance
5. Conclusion
References
- Cai, H.; Cai, X.; Yang, S.; Wang, J.; Yao, L.; Gao, Z.; Chang, J.; Li, S.; Xu, M.; Wang, C.; et al. Uni-SMART: Universal Science Multimodal Analysis and Research Transformer. CoRR 2024. [Google Scholar] [CrossRef]
- Nicholas, G.; Bhatia, A. Lost in Translation: Large Language Models in Non-English Content Analysis. CoRR 2023. [Google Scholar] [CrossRef]
- Zhou, Y.; Shen, J.; Cheng, Y. Weak to strong generalization for large language models with multi-capabilities. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Tamber, M.S.; Bao, F.S.; Xu, C.; Luo, G.; Kazi, S.; Bae, M.; Li, M.; Mendelevitch, O.; Qu, R.; Lin, J. Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards. CoRR 2025. [Google Scholar] [CrossRef]
- Yang, A.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Huang, H.; Jiang, J.; Tu, J.; Zhang, J.; Zhou, J.; et al. Qwen2.5-1M Technical Report. CoRR 2025. [Google Scholar] [CrossRef]
- LeBrun, C.; Poon, Y.S. Twistors, Kaehler Manifolds, and Bimeromorphic Geometry II. arXiv preprint arXiv:alg-geom/9202006v1 1992.
- Sivo, G.; Blakeslee, J.; Lotz, J.; Roe, H.; Andersen, M.; Scharwachter, J.; Palmer, D.; Kleinman, S.; Adamson, A.; Hirst, P.; et al. Entering into the Wide Field Adaptive Optics Era on Maunakea. arXiv preprint arXiv:1907.08169v3 2019.
- Peters, U.; Chin-Yee, B. Generalization Bias in Large Language Model Summarization of Scientific Research. CoRR 2025. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Zhang, Y.; Zhao, Z.; Shen, L.; Liu, W.; Mao, W.; Zhang, H. CSL: A Large-scale Chinese Scientific Literature Dataset. In Proceedings of the Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022. International Committee on Computational Linguistics, 2022, pp. 3917–3923.
- Henning, S.; Macher, N.; Grünewald, S.; Friedrich, A. MiST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. Association for Computational Linguistics, 2022, pp. 1305–1324. [CrossRef]
- Dunn, A.; Dagdelen, J.; Walker, N.; Lee, S.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured information extraction from complex scientific text with fine-tuned large language models. CoRR 2022. [Google Scholar] [CrossRef]
- Song, H.; Feng, J.; Li, G.; Province, M.A.; Payne, P.R.O.; Chen, Y.; Li, F. Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning. CoRR 2025. [Google Scholar] [CrossRef]
- Binder, A.; Verma, B.; Hennig, L. Full-Text Argumentation Mining on Scientific Publications. CoRR 2022. [Google Scholar] [CrossRef]
- Zavarella, V.; Gamero-Salinas, J.C.; Consoli, S. A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models. CoRR 2024. [Google Scholar] [CrossRef]
- Wang, C.; Zhou, Y.; Long, G.; Wang, X.; Xu, X. Unsupervised Knowledge Graph Construction and Event-centric Knowledge Infusion for Scientific NLI. CoRR 2022. [Google Scholar] [CrossRef]
- Zhou, Y.; Song, L.; Shen, J. Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback. arXiv preprint arXiv:2501.01377 2025.
- Wang, Q.; Hu, H.; Zhou, Y. Memorymamba: Memory-augmented state space model for defect recognition. arXiv preprint arXiv:2405.03673 2024.
- Wang, G.; Sun, Z.; Gong, Z.; Ye, S.; Chen, Y.; Zhao, Y.; Liang, Q.; Hao, D. Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering? CoRR 2024. [Google Scholar] [CrossRef]
- Shin, J.; Tang, C.; Mohati, T.; Nayebi, M.; Wang, S.; Hemmati, H. Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code. In Proceedings of the 22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025. IEEE, 2025, pp. 490–502. [CrossRef]
- Shi, F.; Qing, P.; Yang, D.; Wang, N.; Lei, Y.; Lu, H.; Lin, X.; Li, D. Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024. Association for Computational Linguistics, 2024, pp. 1836–1862. [CrossRef]
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734 2023.
- Amatriain, X. Prompt Design and Engineering: Introduction and Advanced Methods. CoRR 2024. [Google Scholar] [CrossRef]
- Wang, J.; Hu, Z.; Bing, L. Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective. CoRR 2025. [Google Scholar] [CrossRef]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
- Li, Y. A Practical Survey on Zero-Shot Prompt Design for In-Context Learning. In Proceedings of the Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023, Varna, Bulgaria, 4-6 September 2023. INCOMA Ltd., Shoumen, Bulgaria, 2023, pp. 641–647.
- Austin, D.; Chartock, E. GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering. CoRR 2024. [Google Scholar] [CrossRef]
- Mirzaei, H.; Jafari, M.; Dehbashi, H.R.; Taghavi, Z.S.; Sabokrou, M.; Rohban, M.H. Killing it with Zero-Shot: Adversarially Robust Novelty Detection. CoRR 2025. [Google Scholar] [CrossRef]


| Model | Hypothesis Identification | Evidence Classification | Research Direction Reasoning |
|---|---|---|---|
| Qwen-7B | 69.5 | 64.2 | 25.1 |
| Claude | 73.1 | 68.9 | 28.5 |
| Gemini | 75.8 | 71.3 | 30.2 |
| LLM-X (Baseline) | 78.2 | 75.0 | 34.1 |
| LLM-X + PEL-SHA (Our Method) | 81.5 | 78.9 | 37.8 |
| Model Configuration | Hypothesis ID (F1) | Evidence Class (Macro-F1) | Research Direction (ROUGE-L) |
|---|---|---|---|
| LLM-X (Baseline) | 78.2 | 75.0 | 34.1 |
| PEL-SHA (Full) | 81.5 | 78.9 | 37.8 |
| PEL-SHA w/o Refinement | 79.7 | 76.5 | 35.3 |
| PEL-SHA w/o Refinement | 81.4 | 76.9 | 36.1 |
| PEL-SHA w/o Refinement | 81.5 | 78.9 | 35.9 |
| Model Configuration | Hypothesis ID | Evidence Class | Research Direction |
|---|---|---|---|
| LLM-X (Baseline) | 78.2 | 75.0 | 34.1 |
| PEL-SHA (Single-Pass Multi-Task Prompt) | 79.1 | 75.9 | 34.9 |
| PEL-SHA (Full Sequential) | 81.5 | 78.9 | 37.8 |
| Model | Hypothesis ID (F1) | Evidence Class (Macro-F1) | Research Direction (ROUGE-L) |
|---|---|---|---|
| Qwen-7B (Baseline) | 69.5 | 64.2 | 25.1 |
| Qwen-7B + PEL-SHA | 72.8 | 67.5 | 28.3 |
| Claude (Baseline) | 73.1 | 68.9 | 28.5 |
| Claude + PEL-SHA | 76.5 | 72.1 | 31.6 |
| Gemini (Baseline) | 75.8 | 71.3 | 30.2 |
| Gemini + PEL-SHA | 79.1 | 74.8 | 33.5 |
| LLM-X (Baseline) | 78.2 | 75.0 | 34.1 |
| LLM-X + PEL-SHA (Our Method) | 81.5 | 78.9 | 37.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).