Submitted:
17 October 2025
Posted:
17 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
2.1. What are Large Language Models (LLMs)?
2.1.1. Definition of LLMs
- How to Define “Large”? LLMs are characterized as “large” by their exceptionally high number of parameters—typically ranging from billions to hundreds of billions—paired with extensive training data. For instance, OpenAI’s GPT series contains hundreds of billions of parameters, allowing it to capture a wide range of linguistic patterns and knowledge. Beyond sheer size, the term “large” also signifies a critical scale at which emergent capabilities arise—abilities that are not present in smaller models and often cannot be predicted simply by extrapolating from smaller-scale performance [13]. Driven by advances in computing power and guided by scaling laws, the size of language models has increased rapidly over recent years. Models once regarded as state-of-the-art have been quickly surpassed and are now considered relatively small by current standards. For example, GPT-2, released in 2019, contained 1.5 billion parameters [5], whereas the smallest variants of contemporary LLMs typically begin at 7 billion parameters. This shift highlights the field’s rapid progression and the evolving definition of what constitutes a “large” model.
2.1.2. History of LLMs
| Time Nodes | Model | Core Tech | Contribution/Feature | Major Lab/Company | Release Time |
|---|---|---|---|---|---|
|
Rule-based Pre 2010 |
ELIZA [43] | Rule-based communication | First NLP communication system | MIT | 1966 |
| n-grams [44] | Markov language model | Based on statistical frequency modeling | IBM | 1992 | |
| PCFG [45] | Context-independent probabilistic | Syntactic analysis & structural prediction | Brown Uni. | 1998 | |
|
RNN 2010s |
RNNLM [46] | Based on RNN | First RNN-based language model | Microsoft | 2010 |
| Seq2Seq [47] | Based on GRU / LSTM | Pioneered encoder-decoder for NLP | 2014 | ||
|
Transformer 2017-2019 |
Transformer [4] | Self-attention | Breaks sequential constraint, enables parallelism | 2017 | |
| BERT [9] | Masked LM + fine-tuning | Contextualized language representation | 2018 | ||
| T5 [40] | Text-to-text transfer learning | Reformulates all NLP tasks as text generation | 2019 | ||
|
GPT & ChatGPT 2020-2023 |
GPT-1/2 [5,6] | Autoregressive transformer | Unified architecture for generation | OpenAI | 2018/2019 |
| GPT-3 [7] | Large-scale autoregressive model | Kickstarted large model era | OpenAI | 2020 | |
| ChatGPT [48] | Fine-tuned GPT-3 with RLHF | Interactive and aligned chatbot | OpenAI | 2022 | |
| GPT-4 [1] | Multimodal, tool-use capabilities | Generalized across wide range of tasks | OpenAI | 2023 | |
|
Other LLMs 2020-2023 |
Codex [49] | Code-focused GPT fine-tuning | Natural language to code translation | OpenAI | 2021 |
| FLAN-T5 [50] | Instruction-tuned T5 | Strong zero-shot generalization | 2022 | ||
| QWen [51] | Tool-augmented transformer | Open-source model with strong instruction-following | Alibaba | 2023 | |
| LLaMA [52] | Efficient transformer variants | High performance with fewer resources | Meta | 2023 | |
|
Emerging Frontier Till Now |
Mixtral [53] | Sparse Mixture-of-Experts | Efficient inference with high performance | MixtralAI | 2023 |
| DeepSeek-R1 [21] | Neural-symbolic reasoning | Multi-step logic | DeepSeek | 2024 | |
| o1 [20] | Experimental AGI prototype | Focus on generalized reasoning skills | OpenAI | 2024 |
2.2. State-of-the-art LLMs

2.2.1. Overview
| Task | Modality | Context Window | Latency | Privacy | Budget | Hardware | Perf. | LLMs |
| Conversational Task | T, (I, A) | ≥ 200K | Yes | No | Low | – | Med | Gemini 2 |
| T, (I) | < 200K | Yes | No | Med | – | High | ChatGPT-4o | |
| T | < 200K | No | Yes | – | Med | – | Qwen 2.5 | |
| T, (I) | < 200K | No | Yes | – | High | – | Llama 3.2 MM | |
| Reasoning Task | T, (I, A) | ≥ 200K | Yes | No | – | – | High | Gemini 2.5 Pro |
| T | < 200K | No | No | Med | – | Med | OpenAI Reasoning | |
| T | < 200K | Yes | Yes | – | High | – | DeepSeek-R1 | |
| Coding Task | T | < 200K | Yes | No | Low | – | Med | Claude 3.5 |
| T | < 200K | No | No | High | – | High | Claude 3.7 | |
| T | ≥ 200K | No | Yes | – | Med | – | Qwen 2.5-1M | |
| Open-Domain QA | T, (I, A) | ≥ 200K | Yes | No | Low | – | Med | Gemini 2 |
| T | < 200K | Yes | No | Med | – | High | GPT-4o | |
| T | < 200K | Yes | Yes | – | Med | – | QwQ 32B | |
| Focused Tasks (Fine-tune) | T | < 200K | Yes | No | Low | – | – | GPT-4o mini |
| T, (I) | < 200K | No | Yes | – | Med | – | LLama 3.2 MM | |
| T | < 200K | Yes | Yes | – | Low | – | DeepSeek-R1 Distill |
2.2.2. GPT-series Models
2.2.3. OpenAI Reasoning Models
2.2.4. Claude 3 Model Family
2.2.5. Gemini 2 Model Family
2.2.6. Gork Model Family
2.2.7. GPT-OSS
2.2.8. Llama 3 Model Family
2.2.9. Qwen 2 Model Family
2.2.10. DeepSeek Model Family
2.3. Evaluation on LLMs
2.3.1. Tasks
2.3.2. Benchmarks
- MMLU [135]: The Massive Multitask Language Understanding (MMLU) benchmark evaluates multitask accuracy across 57 diverse subjects, including humanities, social sciences, STEM fields, and professional domains like law and medicine. Each question is multiple-choice with four options, covering difficulty levels from elementary to professional. Questions are sourced from standardized test prep materials (e.g., GRE, USMLE) and university-level courses. The dataset comprises 15,908 questions split into training, validation, and test sets. MMLU assesses models in both zero-shot and few-shot settings, reflecting real-world conditions where no task-specific fine-tuning is applied. Human performance baselines are also provided, ranging from average crowdworkers to expert-level participants.
- BIG-Bench [136]: The Beyond the Imitation Game Benchmark (BIG-Bench) is a large-scale suite of 204 tasks designed to test LLMs on capabilities not captured by conventional benchmarks. Tasks span areas such as linguistics, mathematics, biology, social bias, and software engineering, and were contributed by researchers and institutions worldwide. Human experts also completed the tasks to establish reference baselines. BIG-Bench includes JSON tasks (with structured inputs/outputs) and programmatic tasks (which allow custom metrics and interaction). Evaluation metrics include accuracy, exact match, and calibration. A smaller curated subset, BIG-Bench Lite, contains 24 JSON tasks for lightweight and efficient evaluation.
- HumanEval [49]: HumanEval is a benchmark for evaluating the functional correctness of code generation. It consists of 164 original Python programming problems, each with a function signature, descriptive docstring, and empty function body. A solution is deemed correct if it passes predefined unit tests, aligning with how developers assess code quality. The benchmark targets abilities such as comprehension, algorithmic reasoning, and basic mathematics. For safety, all code is executed in a secure sandbox to mitigate risks posed by untrusted or potentially harmful code.
- TruthfulQA [151]: This benchmark is designed to assess whether LLMs generate truthful answers and avoid perpetuating misconceptions or factual inaccuracies. It includes 817 questions across 38 domains, such as health, finance, and law. The questions—typically concise, with a median length of 9 words—are crafted to exploit known weaknesses in LLMs, particularly their tendency to imitate common yet incorrect human text. The benchmark imposes rigorous truthfulness criteria, evaluating answers based on factual accuracy as supported by public sources like Wikipedia. Each question includes both true and false reference answers.
- GSM8K [140]: The Grade School Math 8K (GSM8K) dataset comprises 8.5K human-written arithmetic word problems suitable for gradeschool-level mathematics. Of these, 7.5K are training problems and 1K are test problems. Each problem typically requires 2 to 8 reasoning steps and involves basic arithmetic. The dataset emphasizes: (1) high quality, with a reported error rate below 2%; (2) high diversity, avoiding repetitive templates and encouraging varied linguistic expression; (3) moderate difficulty, solvable using early algebra without advanced math concepts; and (4) natural language solutions, favoring everyday phrasing over formal math notation.
2.3.3. Evaluation Methods
2.3.4. Performance at a Glance
- Insight 1: No single LLM dominates across all tasks. GPT-4.5 ranks highest on Chatbot Arena (1398), suggesting strong general chatbot capabilities. However, it does not lead in benchmarks like GPQA (reasoning) or math. Conversely, DeepSeek R1 achieves the best scores on MMLU (90.8%) and Math (97.3%) but lacks results in HumanEval (coding) and multilingual tasks, indicating that top performance in one area does not translate to all domains.
- Insight 2: Reasoning models outperform others in logical and structured tasks. Claude 3.7 Sonnet (reasoner) achieves the highest GPQA Diamond score (84.8%), the best Tool Use (Retail) result (81.2%), and leads in MMMLU (86.1%). These benchmarks emphasize reasoning and complex task execution, showcasing the value of models fine-tuned for reasoning.
- Insight 3: Specialized models often come with trade-offs. Claude 3.5 Haiku performs well in general chatbot interaction but has one of the lowest GPQA scores (41.6%) and math scores (69.4%). Gemini 1.5 Pro and Gemini 2.0 Flash perform reasonably well in multilingual (MGSM) and tool-use (BFCL) tasks, but underperform in reasoning and coding, highlighting performance sacrifices in general-purpose versus specialized capabilities.
- Insight 4: Different models excel at different tasks, and should be selected accordingly. For chatbot dialogue, GPT-4.5 and GPT-4o are top performers. Claude 3.7 Sonnet (reasoner) excels in reasoning, tool use, and multilingual benchmarks. Claude 3.5 Sonnet leads in coding with the best HumanEval score (93.7%). For math-heavy benchmarks, DeepSeek R1 and GPT-03-mini both surpass 97% on the Math benchmark. Thus, model selection should be guided by the specific task requirements.
3. LLMs for Arts, Letters, and Law
3.1. History
3.1.1. Overview

| Historical Category | LLM Application Areas | Use Case-Inspired Research Question | Key Insights and Contributions | References |
|---|---|---|---|---|
| Narrative and Interpretive History | Narrative Generation and Analysis | Can LLMs generate historically coherent and emotionally resonant narratives from primary source texts? | LLMs struggle with coherence and diversity; discourse-aware prompts improve storytelling; useful for story arcs and emotional analysis. | [186,187] |
| Historical Research Assistance | How can LLMs simulate historical personas or provide conversational access to archives for education and research? | Used in chat-based exploration, simulated historical figures, and layered tools (e.g., KleioGPT); helpful for both scholars and students. | [188,189,190] | |
| Historical Interpretation | Can LLMs improve consistency and reduce bias in interpretive historical analysis? | LLMs show more consistent judgments than humans; useful for reducing subjectivity in analysis. | [191] | |
| Quantitative and Scientific History | Historical Thinking | How do LLMs enhance access to and reflection on historical content through transcription and summarization? | Help transcribe handwriting and summarize texts; make primary sources easier to explore and reflect on. | [192] |
| Historical Data Processing | Can LLMs scale up entity extraction and temporal reasoning across vast historical corpora? | Enable name/date extraction, timeline generation, and semantic linking at scale. | [193,194] | |
| Simulating Historical Psychological Responses | Is it possible to model the psychological tendencies of past societies using LLMs trained on historical texts? | Can mimic cultural mindsets from historical texts, but results may reflect biases of elite sources. | [188] | |
| Comparative and Cross-Disciplinary History | Historical Analogy Generation | How can LLMs retrieve or generate meaningful analogies between past and present events? | Generate cross-era comparisons; reduce hallucinations with structured frameworks and similarity checks. | [195] |
| Interdisciplinary Information Seeking | Can LLMs facilitate the discovery of relevant insights across disciplines for historical inquiry? | Tools like DiscipLink help explore diverse sources and connect ideas across fields. | [196] |
- Narrative and Interpretive History: This area emphasizes descriptive, subjective, and human-centered accounts of the past. It uses storytelling and meaning-making to explain events in context. LLMs can assist in narrative generation, reconstruct voices from historical texts, and interpret language use in personal accounts or memoirs.
- Quantitative and Scientific History: This approach applies statistical, computational, and formal methods to study historical data and trends. LLMs can process large datasets, simulate historical psychological responses, aid in historical reasoning tasks, and evaluate knowledge systems.
- Comparative and Cross-Disciplinary History: This domain integrates methods from sociology, economics, political science, and other disciplines to study similarities and differences across historical contexts. LLMs can Support comparative analysis, generate historical analogies, link concepts across eras or cultures.
3.1.2. Narrative and Interpretive History
3.1.3. Quantitative and Scientific History
3.1.4. Comparative and Cross-Disciplinary History
3.1.5. Benchmarks
| Benchmark | Scope and Focus | Data Composition | Evaluation Tasks | Key Insights |
|---|---|---|---|---|
| TimeTravel [198] | Multimodal evaluation of historical and cultural artifacts | 10,250 expert-verified samples across 266 cultures and 10 regions; includes manuscripts, artworks, inscriptions, and archaeological findings | Classification, interpretation, and historical reasoning | Highlights LMMs’ limitations in cultural/historical context understanding; sets new standards for AI in cultural heritage preservation |
| AC-EVAL [199] | Ancient Chinese language understanding by LLMs | 3,245 multiple-choice questions on historical facts, geography, customs, classical poetry, and philosophy, categorized into three difficulty levels | General knowledge, short text understanding, long text comprehension | Chinese-trained models outperform English-trained ones; ancient Chinese remains a low-resource challenge; few-shot often introduces noise |
| Hist-LLM [193] | Global historical knowledge evaluation using structured data | Subset of Seshat Global History Databank: 600 societies, 36,000 data points | Multiple-choice on historical facts across global regions and eras | Models show moderate accuracy; better on early periods and the Americas; weaker in Sub-Saharan Africa and Oceania |
3.1.6. Discussion
- Temporal and Causal Reasoning Enhancement. Advances in temporal modeling, causal inference, and discourse-level narrative comprehension [186] are critical for enabling LLMs to reason more accurately about historical sequences and cause-effect relationships.
- Explainable and Source-Grounded Historical AI. Building models that produce verifiable outputs, grounded in cited historical documents or structured datasets, can strengthen academic trust and facilitate critical engagement [190].
- Collaborative Human-AI Historical Research. Systems like DiscipLink [196] suggest a promising future where LLMs act as exploratory partners rather than authoritative experts, supporting iterative, mixed-initiative workflows that preserve human scholarly agency.
- Ethics and Epistemology of Digital History. Further interdisciplinary studies are needed to critically examine how LLMs reshape historical knowledge production, interpretive authority, and educational practices, ensuring that technological augmentation remains aligned with historical rigor and ethical standards [191].
3.2. Philosophy
3.2.1. Overview

- Normative and Interpretive Philosophy: This area emphasizes the analysis of moral values, ethical dilemmas, and interpretive narratives within philosophical traditions. LLMs can assist in generating novel normative arguments, reconstructing historical ethical discourses, and offering fresh interpretations of canonical texts.
- Analytical and Logical Philosophy: This approach applies formal logic, precise argumentation, and rigorous conceptual analysis to explore philosophical problems. LLMs can support the systematic breakdown of complex arguments, facilitate comparative analyses of theoretical positions, and enhance clarity in logical reasoning.
- Comparative and Cross-Disciplinary Philosophy: This domain integrates insights from diverse fields—such as political science, sociology, and linguistics—to study philosophical questions from multiple angles. LLMs can aid in cross-cultural comparisons, generate analogies between disparate theories, and help synthesize interdisciplinary perspectives.
3.2.2. Normative and Interpretive Philosophy
3.2.3. Analytical and Logical Philosophy
3.2.4. Comparative and Cross-Disciplinary Philosophy
3.2.5. Benchmarks
3.2.6. Discussion
- Philosophical Fine-Tuning and Corpus Design. Curating balanced and inclusive corpora that represent diverse traditions can mitigate bias and expand philosophical reach [226].
- Logical Structure and Argument Mining. Developing tools to extract, visualize, and compare philosophical arguments enhances interpretive transparency and pedagogical value [221].
- Ontology Mapping and Comparative Frameworks. Leveraging LLMs to compare metaphysical and ethical schemas across cultures can support pluralistic theorizing and cross-cultural ethics [215].
- Interactive Dialogue Systems for Teaching. Deploying LLMs as Socratic partners or role-played historical thinkers can deepen student engagement and simulate philosophical exchange [218].
- Epistemology and AI Ethics. Interdisciplinary work is needed to assess the epistemic status of LLM outputs, their limits of understanding, and their role in reshaping human inquiry [217].
3.3. Political Science
3.3.1. Overview

- Text Analysis for Political Insight: LLMs assist in analyzing political texts such as speeches, party platforms, policy documents, and social media discourse. They are widely used for tasks like sentiment analysis, policy position classification, ideological scaling, and automated topic modeling, enabling scalable and consistent textual interpretation at unprecedented scale.
- Opinion Simulation and Forecasting: LLMs can simulate public opinion, generate synthetic survey respondents (“silicon samples”), and forecast electoral outcomes through multi-step reasoning. These capabilities are particularly valuable for behavioral modeling, comparative analysis, and data augmentation in contexts where real-world survey data is limited.
- Generation and Framing of Political Messaging: LLMs are increasingly used to craft persuasive political language, adapt messages to different audiences, and analyze framing strategies. These applications include generating campaign slogans and policy narratives, auditing ideological bias in generated outputs, and studying the persuasive dynamics of AI-authored political content.
3.3.2. Text Analysis for Political Insight
3.3.3. Opinion Simulation and Forecasting
3.3.4. Generation and Framing of Political Messaging
3.3.5. Benchmarks
3.3.6. Discussion
- Fine-Tuned Political Models. Domain-specific fine-tuning on legislative records, public opinion corpora, and multilingual political texts can enhance accuracy and reduce ideological bias.
- Causal and Temporal Modeling. Combining LLMs with structured models or reasoning frameworks may improve their ability to infer causality, detect agenda dynamics, and simulate policy feedback.
- Bias Detection and Auditing. Systematic tools to audit, mitigate, and document political biases in LLM outputs are essential for responsible deployment.
- Interactive Political Simulation. LLMs can be embedded in deliberative platforms to support role-playing, voter education, or negotiation training in civic and educational settings.
- Disinformation Defense. Techniques such as adversarial prompting, fact-checking augmentation, and truth-conditioned training can be used to defend against political misuse.
3.4. Arts and Architecture
3.4.1. Overview

| Type of Art | Subtasks | Use Case-Inspired Research Question | Insights and Contributions | Citations |
|---|---|---|---|---|
| Visual Art | Creation (Prompting, Image Generation) | How can LLM-vision models generate visually compelling and stylistically diverse art through prompt-based interaction? | Multimodal models like MiniGPT-4 and LLaVA support creative generation; ArtGPT-4 enhances aesthetics using image adapters. | [269,270,271] |
| Analysis (Symbolism, Style Classification) | Can LLMs classify artistic styles and interpret visual symbolism without dedicated visual encoders? | GalleryGPT reduces hallucinations via structured prompts; CognArtive uses SVGs to enable symbolic reasoning without visual models. | [272,273] | |
| Literary Art | Creation (Storytelling, Stylistic Writing) | How can LLMs be guided to produce stylistically rich and emotionally resonant literary texts? | Prompt tuning and temperature settings support diverse styles; enables multi-voice storytelling and creative expression. | [274,275] |
| Analysis (Critique, Interpretation) | Can LLMs simulate literary critics or authors to support textual analysis and interpretation? | Interactive and self-reflective prompting enables critical reading and style-aware interpretation. | [276,277] | |
| Performing Art | Creation (Scriptwriting, Interactive Drama) | How can LLMs co-write scripts and structure dramatic arcs in interactive performances? | Tools like Dramatron build layered plots; Auto-Drama uses classical structure (e.g., Aristotelian elements) to shape narrative flow. | [278,279] |
| Performance (Live Interaction, Improvisation) | Can LLMs participate in live improvisational performance alongside human actors? | Improbotics blends live dialogue with LLM input for co-creative scenes; handles multi-party flow and timing. | [280] | |
| Architecture | Design and Creation (Concept Ideation, Layout Generation, Decision Records) | How can LLMs support early-stage design, spatial layout, and architectural decision-making processes? | Tools like Architext generate floorplans; GPT-based pipelines assist with ideation and accurate design documentation. | [281,282,283] |
| Analysis (Heritage Assessment, Design Comparison) | Can LLMs assist in heritage assessment and design analysis by integrating semantic knowledge with user intent? | ArchGPT retrieves context-aware insights; supports restoration, regulation checks, and expert collaboration. | [284] |
- Visual Arts. Visual arts include creative practices such as painting, photography, illustration, and video. LLMs assist artists by generating image prompts, describing visual scenes, and helping with conceptual development. They also support analysis by interpreting symbolism, classifying art styles, and summarizing critical commentary.
- Literary Arts. Literary arts encompass creative writing forms such as poetry, drama, fiction, and essays. LLMs can generate literary content, mimic specific authorial styles, and assist in drafting narratives. They also enable textual analysis, such as thematic extraction, stylistic comparison, and literary critique.
- Performing Arts. Performing arts include music, dance, theater, and opera—forms that rely on embodied performance. LLMs can generate scripts, lyrics, or librettos, and simulate performative dialogue for interactive settings. They also help tag and analyze archival materials, enabling exploration of movement, expression, and performance history.
- Architecture. Architecture focuses on designing functional and aesthetic spaces, integrating art, engineering, and environmental factors. LLMs support architects by generating design narratives, proposing spatial ideas, and interpreting project briefs. They also assist in summarizing regulations, comparing architectural styles, and evaluating design decisions.
3.4.2. Visual Art
3.4.3. Literary Art
3.4.4. Performing Art
3.4.5. Architecture
3.4.6. Benchmarks
| Benchmark | Scope and Focus | Data Composition | Evaluation Tasks | Key Insights |
|---|---|---|---|---|
| ArtBench-10 [286] | Artwork generation and style classification | 60,000 curated images spanning 10 artistic styles with balanced classes and cleaned labels | Generative modeling using GANs, VAEs, diffusion; FID, IS, KID, precision/recall scores | StyleGAN2+ADA achieves best results; highlights diversity and fidelity gaps in GAN variants |
| AKM [283] | Architectural design decisions from text prompts | Context-rich design scenarios and model inputs for GPT/T5-family models | Zero-shot, few-shot, and fine-tuned generation of architectural decisions | GPT-4 performs best in zero-shot; GPT-3.5 effective with few-shot; Flan-T5 benefits from fine-tuning |
| ADD (DRAFT) [287] | Domain-specific architectural decision generation | 4,911 real-world Architectural Decision Records (ADRs) with labeled rationale and context | Few-shot vs. RAG vs. fine-tuning using DRAFT for architectural reasoning | DRAFT outperforms baselines in accuracy and efficiency; avoids reliance on large proprietary LLMs |
| AGI & Arch [288] | Generative AI’s knowledge of architectural history | 101M+ Midjourney prompts + qualitative and factual analysis of style descriptions | Historical style recognition, hallucination detection, generative image-text alignment | ChatGPT shows inconsistencies in confidence vs. accuracy; Midjourney trends analyzed via reverse prompts |
| WenMind [289] | Chinese Classical Literature and Language Arts (CCLLA) | 42 fine-grained tasks across Ancient Prose, Poetry, and Literary Culture; tested on 31 LLMs | Question answering, translation, rewriting, interpretation across genres | ERNIE-4.0 best performer with 64.3 score; major gap in LLM proficiency for classical Chinese content |
| AIST++ [290] | Music-conditioned 3D dance motion generation | 5.2 hours of 3D motion data across 10 dance genres + musical accompaniment | Motion synthesis with FACT transformer; evaluation via FID, diversity, beat alignment | FACT model generates realistic, synchronized, long-sequence dances better than prior baselines |
3.4.7. Discussion
- Ethical Authorship and Attribution Frameworks. Establishing clear guidelines for authorship attribution, ethical usage, and creative credit in LLM-augmented works will be critical to ensuring fair recognition and responsible innovation [278].
- Qualitative Evaluation Metrics. Developing domain-specific, human-centered evaluation rubrics—focused on creativity, authenticity, and emotional resonance—will be necessary to meaningfully assess LLM contributions in artistic fields [275].
3.5. Law
3.5.1. Overview


- Legal Consultant Question Answering. Many legal queries—such as “What are the elements of negligence?” or “Does the GDPR apply to this situation?”—involve retrieving statutory definitions, summarizing doctrines, or explaining precedent. LLMs can function as legal assistants, offering plain-language explanations, surfacing relevant laws, and contextualizing rules. This enables broader access to legal knowledge and supports both laypersons and professionals during early-stage legal reasoning.
- Legal Document Drafting. Drafting contracts, policies, and filings involves significant repetition and domain knowledge. LLMs can generate initial templates, propose clauses, and adapt documents to specific scenarios or jurisdictions. This accelerates document production, reduces drafting overhead, and promotes standardization—especially useful for small firms or high-volume legal operations.
- Legal Document Understanding & Case Analysis. Interpreting statutes, summarizing opinions, or identifying relevant facts is core to legal analysis. LLMs can extract key information, highlight legal entities or issues, and support case comparison. This improves comprehension, reduces time spent on manual review, and helps structure arguments and decisions based on large textual corpora.
- Case Prediction. Predicting legal outcomes—based on case facts, prior rulings, and jurisdictional context—is valuable for risk assessment and litigation strategy. While final outcomes are shaped by human judgment and evolving law, LLMs can surface patterns, suggest likely outcomes, and support probabilistic reasoning based on precedent, helping users plan and prioritize cases.
| Legal Domain | LLM Application Areas | Use Case-Inspired Research Question | Key Insights and Contributions | References |
|---|---|---|---|---|
| Legal Consultant Question Answering | Interactive Legal Q&A Systems | Can LLMs answer basic legal questions with accurate references to statutes and case law? | Emergent legal reasoning observed; retrieval-augmented prompting reduces hallucination; GPT-4 can approximate legal explanations with improved accuracy. | [305,306,307] |
| Domain-Specific Legal Models | How can LLMs be fine-tuned to better address legal reasoning tasks? | Models like LawLLM improve U.S. law reasoning through fine-tuning and task adaptation (retrieval, precedent matching, judgment prediction). | [308] | |
| Legal Factor Extraction | Can LLMs extract and define core legal factors from court opinions? | Supports building expert systems; improves structure and consistency of legal analysis. | [309] | |
| Legal Document Drafting | Clause Generation | Can LLMs autonomously generate domain-compliant legal clauses? | LLMs generate grammatically and legally sound clauses; useful in reducing drafting effort. | [310,311] |
| Draft Comparison via NLI | How can LLMs verify consistency between generated and template contracts? | NLI tasks help identify deviations and inconsistencies, enabling automated review. | [312] | |
| Legal Validity of Prompt-Based Contracts | What legal risks arise when contracts are generated using prompts? | Raises issues with doctrines like the parol evidence rule; prompt provenance matters. | [313] | |
| Legal Document Understanding and Case Analysis | Document Summarization & Entity Extraction | How well can LLMs extract facts and citations from unstructured legal texts? | Enhanced summarization, fact extraction, and legal entity recognition using retrieval-based and fine-tuned models. | [306,308,314] |
| Large-Scale Legal Analysis | Can LLMs support empirical research over large legal corpora? | Enables scalable judgment pattern extraction; useful for comparative legal studies. | [307] | |
| E-Discovery and Compliance | Can LLMs assist in regulatory compliance and legal review at scale? | RAG-based systems improve due diligence and compliance decisions; multi-agent LLMs aid in document relevance prediction. | [315,316] | |
| Legal Case Prediction | Judgment Forecasting | Can LLMs accurately predict legal case outcomes? | LLMs outperform traditional models; retrieval-augmented LLMs improve consistency and generalization. | [317,318] |
| Hybrid Legal Reasoning | How can LLMs be integrated with expert systems to improve prediction accuracy? | Hybrid systems improve interpretability and performance by aligning LLM outputs with legal logic. | [319] |
3.5.2. Legal Consultant Question Answering
3.5.3. Legal Document Drafting
3.5.4. Legal Document Understanding and Case Analysis
3.5.5. Legal Judgment Prediction
3.5.6. Benchmarks
| Benchmark | Scope and Focus | Data Composition | Evaluation Tasks | Key Insights |
|---|---|---|---|---|
| CUAD [333] | Contract clause extraction and risk detection in legal documents | 13,000+ expert-annotated examples across 41 clause types, sourced from real commercial contracts | Clause identification, named entity recognition (NER), binary classification | Focused on practical contract review tasks; emphasized precision in extraction under legal ambiguity; widely used in contract AI |
| CaseHOLD [334] | Case law judgment understanding via conclusion prediction | 53,000+ U.S. appellate court case summaries with multiple-choice legal holdings | Multiple-choice question answering; outcome selection from candidates | Tests nuanced legal entailment and fact-to-holding inference; served as early benchmark for transformer-based legal models |
| EUR-Lex [335] | Multilabel legal topic classification for EU directives and regulations | 55,000+ European legal documents tagged with 3,956 EuroVoc labels | Multilabel text classification | One of the earliest and most cited legal NLP datasets; highly imbalanced and hierarchical label space inspired development of label-aware classifiers |
| SCOTUS [336] | Supreme Court decision classification and ideological alignment analysis | U.S. Supreme Court opinions, annotated with justice ideology, vote splits, and case topics | Binary and multiclass classification, ideological trend analysis | Used in political science and legal prediction; supported early quantitative legal studies using machine learning |
| COLIEE [337] | Legal information retrieval and entailment challenge (competition format) | Multiple years of formal tasks including Japanese Bar Exam questions and Canadian legal cases/statutes | IR, entailment classification, statute retrieval, legal QA | Serves as international benchmark challenge; evaluated both retrieval and inference under strict logic constraints |
3.5.7. Discussion
- Retrieval-Augmented and Fact-Verified Generation. Integrating retrieval-augmented generation (RAG) systems that anchor responses in verifiable legal texts can reduce hallucination and improve accuracy [309].
- Jurisdiction-Aware and Temporal Modeling. Developing models sensitive to jurisdictional differences and evolving case law—such as time-aware frameworks like PILOT—can enhance the contextual reliability of legal predictions [328].
- Ethical and Regulatory Frameworks. Establishing governance standards for the deployment of LLMs in legal contexts—including audit trails, liability attribution, and responsible AI usage—will be essential to mitigate misuse and legal uncertainty [325].
4. LLMs for Economics and Business
4.1. Finance
4.1.1. Overview

| Field | Subfield | Key Insights and Contributions | Examples | Citations |
|---|---|---|---|---|
|
Trading and Investments |
Quantitative Trading | LLM-based trading agents demonstrate enhanced interpretability, adaptability, and profitability in simulating and executing financial strategies across diverse market conditions | FinCon [364]: Multi-agent LLM system improves trading via verbal risk reinforcement. FinMem [365]: Layered memory and character design enhance LLM-based trading decisions. | [364,365,366,367,368,369,370,371,372,373,374,375] |
| Portfolio Management | LLMs enhance financial decision-making by enabling adaptive, explainable, and multi-modal portfolio strategies through agent collaboration, sentiment reasoning, and dynamic alpha mining | Ko & Lee [376]: ChatGPT enhances portfolio diversification and asset selection across classes. Kou et al. [377]: LLM-driven agents mine multimodal alphas, dynamically adapt trading strategies. | [376,377,378,379,380,381,382,383] | |
| Corporate Finance | Financial Report Analysis | LLMs enhance financial report analysis by enabling accurate, explainable, and scalable extraction and generation through multimodal processing, domain-specific fine-tuning, and tool-augmented reasoning | XBRL-Agent [384]: LLM agent analyzes XBRL reports using retriever and calculator tools. | [384,385,386] |
| Financial Markets | Stock Movement Prediction | LLMs can effectively predict and explain stock movements by extracting sentiment, factors, and insights from financial text through self-reflection, instruction tuning, and domain-specific prompting | Koa et al. [387]: Self-reflective LLM explains stock predictions via reinforcement learning framework. Ni et al. [388]: QLoRA-fine-tuned LLM predicts stocks using rich earnings data. | [387,388,389,390,391,392] |
|
Financial Intermediation and Risk Management |
Fraud Detection | LLMs significantly enhance financial fraud detection by enabling accurate, scalable, and robust identification of anomalies and manipulations through prompt engineering, hybrid modeling, and adversarial benchmarking | Fraud-R1 [393]: Benchmark tests LLM defenses against multi-round fraud and phishing. RiskLabs [394]: LLM fuses multi-source data to forecast volatility and financial risk. | [393,394,395,396,397] |
| Credit Scoring | LLMs enhance credit risk assessment by improving prediction accuracy, generalization, and explainability through hybrid modeling, text integration, and domain-specific fine-tuning | CALM [398]: LLM scores credit risk across tasks with fairness checks. LGP [399]: Prompted LLMs use Bayesian logic to generate insightful risk reports. | [398,399,400,401] | |
| Sustainable Finance | ESG Scoring | Leverage LLMs for classification, rule learning, data extraction, greenwashing detection, readability assessment, and multi-lingual understanding, thereby enhancing transparency and decision-making in sustainable finance. | ESGReveal [402]: Leverage LLMs and RAG to systematically extract and analyze ESG data from corporate reports. | [170,402,403,404,405,406,407,408,409,410] |
|
Finance Technology |
Financial Question Answering | LLMs enhance financial QA in accurate, context-aware reasoning over complex, multi-source financial data. | TAT-QA [411]: a Financial QA benchmark that contains over 16,000 questions built from real-world financial reports that combines tabular and textual data. | [411,412,413,414,415,416,417,417] |
| Knowledge Graph Construction | LLMs enable automated KG construction from financial data, retrieval from KG and support multi-document financial QA. | FinKG [418]: A curated core financial knowledge graph built from authoritative sources like corporate reports and stock data, structured to enable systematic analysis and applications in financial forecasting, risk assessment, and decision-making through semantically rich relationships between entities. | [417,418,419,420,421,422] | |
| Robo-advisory | LLMs enhance robo-advisors for novice investors, but still lag behind humans in performance and trust. | Jung et al [423]: A earlier work that propose the concept "Robo-Advisory" that leverage AI to provide automatic financial advisory servies for a broader range of investors. | [404,423,424,425,426,427,426] |
- Trading and Investments. Trading pursues short-term gains while investment emphasizes long-term value through diversification and analysis. Traditional methods struggle with large-scale, complex data, whereas LLMs offer new capabilities for processing unstructured information, enhancing forecasting, and supporting strategies in quantitative trading and portfolio management.
- Corporate Finance. Corporate finance manages funding, capital structure, and investment to drive growth. Conventional approaches like financial modeling and discounted cash flow analysis are labor-intensive and limited under fast-changing conditions. LLMs streamline tasks such as financial report analysis, improving efficiency and accuracy in strategic decision-making.
- Financial Markets. Financial markets allocate resources and manage risk through the trading of assets. While econometric models and machine learning aid analysis, they face challenges with today’s data scale and complexity. LLMs advance this field by processing unstructured information and enabling applications such as stock movement prediction.
- Financial Intermediation and Risk Management. Banks and insurers channel capital while managing risks, but traditional statistical models and manual processes lag in dynamic environments. LLMs improve performance by analyzing diverse datasets, with emerging applications in fraud detection and credit scoring.
- Sustainable Finance. Sustainable finance incorporates ESG factors into investment decisions. Standard scoring systems often overlook rich unstructured data from reports and media. LLMs can extract and synthesize such information, offering more context-aware and adaptive ESG insights.
- Financial Technology. FinTech reshapes financial services through innovations like digital banking, blockchain, and robo-advisory. Traditional solutions emphasize automation but lack flexibility. LLMs expand FinTech by powering financial question answering, knowledge graph construction, and conversational advisory, enhancing personalization and accessibility.
4.1.2. Trading and Investment
4.1.3. Corporate Finance
4.1.4. Financial Market Analysis
4.1.5. Financial Intermediation and Risk Management
4.1.6. Sustainable Finance
4.1.7. Financial Technology
4.1.8. Benchmarks
| Benchmark | Language | Size | Feature | Insights on LLMs |
|---|---|---|---|---|
| FinBen | English | 36 datasets | Broadest task range; includes forecasting and agent evaluations | LLMs perform best on IE/textual analysis, poorly on forecasting and reasoning-heavy tasks |
| R-Judge | English | 569 records | Multi-turn safety judgment for agents in real scenarios | LLMs lack behavioral safety judgment in interactive settings; fine-tuning helps significantly |
| FinEval | Chinese | 8,351 Qs | Covers academic, industry, security, and agent reasoning tasks | LLMs outperform average individuals but lag behind experts; complex reasoning and tool usage still weak |
| CFinBench | Chinese | 99,100 Qs | Career-aligned categories; diverse questions and rigorous filtering | Highlights knowledge gaps; current LLMs struggle with practical depth and legal reasoning |
| UCFE | English, Chinese | 330 data points | User-role simulation; dynamic multi-turn tasks | Human-like evaluations show LLMs align with users but fall short under dynamic, evolving needs |
| Hirano | Japanese | — | Domain-specific benchmark in Japanese | Domain-specific LLMs still underdeveloped in Japanese finance |
- Challenges in complex financial tasks: Current LLMs still struggle with tasks that require deep domain knowledge, logical reasoning, and multi-step decision-making.
- Effectiveness of domain-specific fine-tuning: Fine-tuning LLMs on domain-specific corpora continues to yield notable performance gains, demonstrating its importance in enhancing model specialization.
- Benchmark coverage vs. real-world applicability: While these benchmarks effectively assess LLMs’ comprehensive capabilities in finance, they are primarily diagnostic and not tailored to specific application scenarios. Practical use cases often require the design of dedicated, task-specific benchmarks.
- Need for broader evaluation dimensions: Additional attention should be given to other meaningful evaluation perspectives, such as user alignment (e.g., UCFE) and risk awareness (e.g., R-Judge), which are crucial for safe and effective real-world deployment.
- General Financial Data: Provides access to real-time and historical stock prices, fundamental financial indicators, and corporate financial statements. Such data are critical for simulating trading environments, developing investment strategies, conducting market forecasts, and evaluating algorithmic trading agents.
- Cryptocurrency Data: Offers market prices, trading volumes, and metadata for cryptocurrencies. These datasets are particularly useful for research on crypto trading strategies, market microstructure analysis, and portfolio optimization involving digital assets.
- Regulatory Filings: Includes official company disclosures, such as quarterly and annual reports (10-Q, 10-K), and other significant events (8-K filings). Regulatory filings are essential for fundamental analysis, event-driven trading, and financial sentiment extraction.
- Analyst Reports: Consists of investment opinions, earnings forecasts, and qualitative assessments from financial analysts. These resources are valuable for sentiment analysis, opinion aggregation, and modeling the impact of market expectations on asset prices.
- News Data: Covers financial news, press releases, and market commentary from a variety of media outlets. News data is critical for developing event-driven trading strategies, market volatility prediction, and detecting sentiment shifts in real-time.
- Social Media Data: Comprises user-generated content from platforms such as Twitter (X) and Reddit. Social media data enables the study of retail investor sentiment, information diffusion, and the dynamics of attention-driven market movements.
4.1.9. Discussion
- Real-Time and Adaptive Learning. Future models must be capable of online learning and rapid adaptation to changing market conditions, with architectures that support dynamic knowledge updates and feedback-driven improvement [373].
- Ethics, Fairness, and Regulatory Compliance. Research must prioritize fairness audits, bias mitigation strategies, and interpretability mechanisms to ensure that LLM-based systems meet ethical and legal standards in finance [398].
4.2. Economics
4.2.1. Overview

| Field | Key Insights and Contributions | Examples | Citations |
|---|---|---|---|
| Behavioral and Experimental Economics | LLMs can simulate human economic behavior by exhibiting rationality, personality traits, and behavioral biases, offering scalable tools for social science experimentation and economic modeling | Ross et al. [492]: Utility theory reveals LLMs’ behavioral biases across economic decision settings. Horton [493]: LLMs simulate economic agents, replicating human decisions in experiments. | [491,492,493,494] |
| Macroeconomic Simulation and Agent-Based Modeling | LLMs enable realistic, interpretable, and heterogeneous agent-based economic simulations by modeling complex decision-making, memory, perception, and policy responses, offering new tools for macroeconomic analysis and public policy evaluation | MLAB [495]: Multi-LLM agents simulate diverse economic responses for policy analysis. EconAgent [496]: LLM agents model macroeconomics with perception, memory, and decision modules. | [495,496,497] |
| Strategic and Game-Theoretic Interactions | LLMs enable robust, adaptive, and strategically nuanced economic decision-making simulations through dynamic multi-agent environments and standardized benchmarks | GLEE [489]: Economic game benchmarks evaluating LLM fairness, efficiency, and communication. Guo et al. [498]: LLM agents compete in dynamic games testing rationality and strategy. | [489,491,498] |
| Economic Reasoning and Knowledge Representation | LLMs show emerging economic reasoning capabilities through benchmarks and frameworks assessing causal, sequential, and logical inference | EconLogicQA [499]: Tests LLMs’ ability to sequence economic events logically, contextually. EconNLI [490]: Evaluates LLMs’ causal reasoning using premise-hypothesis economic event pairs. | [490,499,500] |
- Behavioral and Experimental Economics. This field studies how real people make decisions, often deviating from the rational “homo economicus” model. Experiments with games like the dictator, ultimatum, and trust games reveal biases such as fairness concerns and the endowment effect. LLMs complement these methods by simulating diverse decision behaviors and allowing rapid pre-testing of economic experiments.
- Macroeconomic Simulation and Agent-Based Modeling. ABMs simulate how individual agents interact to shape aggregate outcomes like inflation or unemployment. Unlike equilibrium-based models, they capture dynamic, bottom-up processes but often lack realistic human behavior. LLMs enrich ABMs by powering adaptive, communicative agents, bringing greater realism and flexibility to macroeconomic simulations.
- Strategic and Game-Theoretic Interactions. Game theory examines how outcomes depend on the choices of multiple agents, requiring competition, cooperation, and anticipation. Traditional approaches rely on simplified assumptions, limiting realism. LLMs enable agents with recursive reasoning and natural language interaction, offering richer simulations of strategic scenarios.
- Economic Reasoning and Knowledge Representation. Economic reasoning analyzes trade-offs under scarcity, while knowledge representation encodes concepts for computational use. Rule-based methods struggle with complexity and scalability. LLMs simulate reasoning in natural language and generalize across contexts, though they remain sensitive to prompt design and prone to oversimplification.
4.2.2. Behavioral and Experimental Economics
4.2.3. Macroeconomic Simulation and Agent-Based Modeling
4.2.4. Strategic and Game-Theoretic Interactions
4.2.5. Economic Reasoning and Knowledge Representation
4.2.6. Benchmarks
4.2.7. Discussion
- Structured Prompting and Experimental Design. Developing standardized, transparent prompting methodologies—analogous to experimental protocols—will be essential for ensuring replicability and robustness in LLM-driven economic experiments [492].
- Ethical Evaluation and External Validation. Establishing benchmarks, guidelines, and ethical frameworks for using LLMs as economic agents—alongside systematic validation against real human data—will be crucial for credible scientific practice.
4.3. Accounting
4.3.1. Overview

- Auditing. Traditionally reliant on manual, sample-based checks, auditing struggles with rising data volumes and fraud complexity. LLMs can automate text analysis, flag anomalies, and expand audit coverage, enabling smarter AI-assisted audits while underscoring the need for transparency and safeguards.
- Financial and Managerial Accounting. Both functions are central to decision-making but increasingly burdened by complex disclosures and fragmented systems. LLMs help extract insights, streamline reporting, and convert unstructured data into actionable analysis, strengthening transparency, accuracy, and strategic value.
- Taxation. Taxation involves intricate laws and resource-constrained enforcement, with traditional systems often missing nuances in legal texts. LLMs can interpret tax codes, analyze unstructured filings, and support compliance and enforcement, offering new efficiency while raising questions of trust and adaptability.
| Field | Key Insights and Contributions | Examples | Citations |
|---|---|---|---|
| Auditing | LLMs enhance auditing by improving accuracy, efficiency, and real-time verification through automation, co-piloting, and compliance checks, while raising implementation and ethical challenges | Gu et al. [538]: Co-piloted auditing combines LLMs and humans for efficient audits. Berger et al. [539]: LLMs assess financial compliance, outperforming peers in regulatory audits. | [538,539,540,541,542,543,544,545,546] |
| Financial and Managerial Accounting | LLMs enhance accounting, reporting, and sustainability practices by automating complex text analysis, improving information processing, enabling transparency, and supporting ethical, informed decision-making across financial and ESG domains | De Villiers et al. [547]: AI reshapes sustainability reporting, raising greenwashing risks and governance questions. Föhr et al. [548]: LLMs audit sustainability reports using taxonomy-aligned prompt frameworks efficiently. | [547,548,549,550,551,552,553,554] |
| Taxation | LLMs enhance tax research, compliance, and enforcement by enabling nuanced analysis, firm-level measurement, domain-specific modeling, and improved reasoning through agent collaboration | PLAT [555]: PLAT tests LLMs’ tax reasoning under ambiguous penalty exemption scenarios. Alarie et al. [556]: LLMs assist tax research, but hallucinations limit reliable adoption. | [555,556,557,558,559] |
4.3.2. Auditing
4.3.3. Financial and Managerial Accounting
4.3.4. Taxation
4.3.5. Benchmarks
4.3.6. Discussion
- Hybrid Human-AI Systems. Emphasizing "co-piloted" models where human expertise remains central will ensure that LLMs complement rather than replace professional judgment, particularly in high-risk domains like auditing and taxation [538].
4.4. Marketing
4.4.1. Overview

| Field | Key Insights and Contributions | Examples | Citations |
|---|---|---|---|
| Consumer Insights and Behavior Analysis | LLMs enable scalable personalization, behavioral insight extraction, synthetic data generation, and segmentation, though challenges remain in faithfully replicating human preferences and ensuring ethical, accurate application | Li et al. [600]: LLMs embed surveys, cluster consumers, simulate chatbots for marketing. Goli & Singh [601]: LLMs mimic preferences poorly; chain-of-thought aids segmentation hypotheses. | [600,601,602,603,604,605,606] |
| Content Creation and Campaign Design | LLMs enable scalable, cost-effective, and platform-adapted content generation, enhancing engagement and campaign assessment while requiring human oversight for quality, ethics, and strategic alignment | Kasuga & Yonetani [607]: CXSimulator simulates campaign effects using LLM embeddings and behavior graphs. Wahid et al. [608]: Generative AI reshapes content marketing, raising engagement and ethical concerns. | [607,608,609,610,611,612,613] |
| Market Intelligence and Trend Analysis | LLMs enable scalable content creation, personalized engagement, data-driven decision-making, trend prediction, and rapid research replication, while offering efficiency gains and novel insights across digital channels | Yeykelis et al. [614]: LLM personas replicate media experiments, accelerating marketing research validation. Saputra et al. [615]: ChatGPT improves Instagram marketing using AIDA model for engagement. | [602,614,615,616,617] |
- Consumer Insights and Behavior Analysis. Consumer Insights and Behavior Analysis focuses on understanding the motivations behind consumer thoughts, feelings, and actions to inform effective marketing strategies. While traditional methods like surveys and interviews offer value, they often struggle with the scale and nuance of modern, unstructured data sources. LLMs are transforming this field by enabling scalable, nuanced analysis of language-rich data, offering deeper, real-time insights into consumer behavior.
- Content Creation and Campaign Design. Content Creation and Campaign Design are key pillars of marketing, combining creative storytelling with strategic planning to engage audiences and achieve business goals. Traditionally reliant on manual effort and intuition, this process has faced challenges in scalability, personalization, and real-time feedback. Today, LLMs are transforming how content is ideated, produced, and optimized, enhancing creativity, streamlining workflows, and enabling more dynamic, data-driven campaigns.
- Market Intelligence and Trend Analysis. Market Intelligence and Trend Analysis are essential for guiding strategic marketing decisions, helping businesses monitor competitors, anticipate consumer shifts, and navigate evolving market conditions. Traditionally grounded in surveys and expert insights, these methods often lag behind today’s fast-paced digital environment. LLMs are revolutionizing this space by enabling real-time analysis of vast, unstructured data sources, offering marketers faster, deeper, and more forward-looking insights to stay competitive and adaptive in a rapidly changing landscape.
4.4.2. Consumer Insights and Behavior Analysis
4.4.3. Content Creation and Campaign Design
4.4.4. Market Intelligence and Trend Analysis
4.4.5. Benchmarks
4.4.6. Discussion
- Hybrid Approaches Combining LLMs with Human Oversight. Structuring workflows where human judgment complements AI output will help mitigate risks of error, bias, and unethical deployment.
- Domain-Specific Fine-Tuning and Contextualization. Fine-tuning LLMs on marketing-specific corpora and continuously updating them with domain-relevant data can improve accuracy, nuance, and relevance in marketing applications [617].
- Benchmarking and Standardized Evaluation. Creating rigorous benchmarks for LLM performance in marketing tasks, such as synthetic persona realism, content personalization efficacy, and market trend prediction accuracy, will be vital for advancing scientific rigor.
- Explainability and Interpretability. Incorporating methods such as chain-of-thought prompting, citation grounding, and attribution tracing will enhance transparency and user trust in LLM-assisted marketing insights [602].
- Ethical Guidelines and Best Practices. Marketing researchers must proactively develop frameworks for ethical AI use, covering disclosure norms for AI-generated content, fairness in segmentation, and protection against manipulative targeting.
5. LLMs for Science and Engineering
5.1. Mathematics
5.1.1. Overview

5.1.2. Mathematical Proof Assistant
- Advancing the hexagon packing problem by finding better solutions for arranging 11 and 12 hexagons inside a larger hexagon, surpassing human achievements after 16 years of stagnation.
- Making progress on the kissing number problem, a mathematical challenge unsolved for over 300 years [683].
| Method | Model Size | Sample Budget | Accuracy |
| Traditional Proof Assistants | |||
| Curriculum Learning [685] | 837M | 34.5% | |
| 837M | 36.6% | ||
| Proof Artifact Co-Training [658] | 837M | 24.6% | |
| 837M | 29.2% | ||
| Hypertree Proof Search [660] | 600M | 41.0% | |
| LLMs-based Methods | |||
| DeepSeekMath [686] | 7B | 128 | 27.5% |
| 7B | Cumulative | 52.0% | |
| 7B | Greedy | 30.0% | |
| DeepSeek-Prover [687] | 7B | 64 | 46.3% |
| 7B | 128 | 46.3% | |
| 7B | 8,192 | 48.8% | |
| 7B | 65,536 | 50.0% | |
5.1.3. Theoretical Exploration and Pattern Recognition
5.1.4. Mathematical Education
5.1.5. Benchmarks
5.1.6. Discussion
- Enhancing Mathematical Reasoning. Developing novel architectures and training recipes that enable LLMs to move beyond pattern matching toward genetic models. This could involve incorporating symbolic computation capabilities or training on datasets designed to specifically test and improve reasoning skills.
- Improving Reliability and Accuarcy. Investigating methods to reduce hallucinations and errors in LLM outputs for mathematical tasks. This could involve techniques like self-verification, the use of external validators like theorem provers, or reinforcement learning from human feedback focused on accuracy.
- Effective Educational Tools. There is a need for designing and evaluating LLMs-empowered tools that effectively support mathematics learning without compromising conceptual understanding or fundamental skill development. This includes exploring the utilization of LLMs in personalized tutoring, generating diverse explanations, and creating engaging problem-solving activities.
- Integration Strategies Evaluation. Exploring optimal strategies for incorporating LLMs into conventional educational practices to develop hybrid learning environments that capitalize on the advantages of both methodologies is crucial. Additionally, it is vital to comprehend how to effectively instruct educators in the utilization of LLMs as pedagogical tools.
5.2. Physics and Mechanical Engineering
5.2.1. Overview

- Complexity of Multiphysics Coupling and Governing Equations. Physical and mechanical systems are often governed by a series of highly coupled partial differential equations (PDEs), involving nonlinear dynamics, continuum mechanics, thermodynamics, electromagnetism, and quantum interactions [734,743]. Solving such systems requires professional numerical solvers, high-fidelity discretization techniques, and physics-informed modeling assumptions. Although LLMs can retrieve relevant equations or suggest approximate forms, they are incapable of deriving physical laws, ensuring conservation principles, or performing accurate numerical simulations.
- Simulation Accuracy and Model Calibration. Accurate mechanical design and physical predictions typically rely on high-fidelity simulations such as finite element analysis (FEA), computational fluid dynamics (CFD), or multiphysics modeling [744,745]. These simulations demand precise geometry input, boundary conditions, material models, and experimental validation. LLMs may assist in interpreting simulation reports or proposing modeling strategies, but they lack the resolution, numerical rigor, and feedback integration necessary to execute or validate such models.
- Experimental Prototyping and Hardware Integration. Engineering innovations ultimately require validation through physical experiments—building prototypes, tuning actuators, installing sensors, and measuring performance under dynamic conditions [746,747]. These tasks depend on laboratory facilities, fabrication tools, and hands-on experimentation, all of which are beyond the operational scope of LLMs. While LLMs can help generate test plans or documentation, they cannot replace real-world testing or iterative hardware development.
- Materials and Manufacturing Constraints. Real-world engineering designs must account for constraints such as thermal stress, fatigue life, manufacturability, and cost-efficiency [748]. Addressing these challenges often relies on materials testing, manufacturing standards, and domain experience in processes like welding, casting, and additive manufacturing. LLMs lack access to real-time physical data and material behavior, and thus cannot support tradeoff decisions in design or production.
- Ethical, Safety, and Regulatory Considerations. From biomedical devices to autonomous systems, mechanical engineers must weigh ethical impacts, user safety, and legal compliance [749]. Although LLMs can summarize policies or regulatory codes, they are not equipped to make decisions involving responsibility, risk evaluation, or normative judgment—elements essential for deploying certified, real-world systems.
- Although current LLMs remain limited in core tasks such as physical modeling and experimental validation, they have shown growing potential in assisting a variety of supporting tasks in physics and mechanical engineering—particularly in knowledge integration, document drafting, design ideation, and educational support:
- Literature Review and Standards Lookup. Both disciplines rely heavily on technical documents such as material handbooks, design standards, experimental protocols, and scientific publications. LLMs can significantly accelerate the literature review process by extracting key information about theoretical models, experimental conditions, or engineering parameters. For instance, an engineer could use an LLM to compare different welding codes, retrieve thermal fatigue limits of materials, or summarize applications of a specific mechanical model [750,751].
- Assisting with Simulation and Test Report Interpretation. In simulations such as finite element analysis (FEA), computational fluid dynamics (CFD), or structural testing, LLMs can help parse simulation logs, identify setup issues, or generate summaries of experimental findings. When integrated with domain-specific tools, LLMs may even assist in generating simulation input files, interpreting outliers in results, or recommending appropriate post-processing techniques [752,753].
- Supporting Conceptual Design and Parametric Exploration. During early-stage mechanical design or material selection, LLMs can suggest structural concepts, propose parameter combinations, or retrieve examples of similar engineering cases. For instance, given a prompt like “design a spring for high-temperature fatigue conditions,” the model might generate candidate materials, geometric options, and common failure modes [754,755].
- Engineering Education and Learning Support. Education in physics and mechanical engineering involves both theoretical understanding and hands-on application. LLMs can generate step-by-step derivations, support simulation-based exercises, or simulate simple lab setups (e.g., free fall, heat conduction, beam deflection). They can also assist with terminology explanation or provide example problems to enhance interactive and self-guided learning [756,757].

- Textual and Documentation-Centric Tasks. LLMs are particularly effective in processing technical documents, engineering standards, lab reports, and scientific literature. For instance, Polverini and Gregorcic demonstrated how LLMs can support physics education by extracting and explaining key information from conceptual texts [758], while Harle et al. highlighted their use in organizing and generating instructional materials for engineering curricula [759].
- Design Ideation and Parametric Drafting Tasks. In early-stage design and manufacturing workflows, LLMs can transform natural language prompts into CAD sketches, material recommendations, and parameter ranges. The MIT GenAI group systematically evaluated the capabilities of LLMs across the entire design-manufacture pipeline [760], and Wu et al. introduced CadVLM, a multimodal model that translates linguistic input into parametric CAD sketches [755].
- Simulation-Support and Modeling Interface Tasks. Although LLMs cannot replace high-fidelity physical simulation, they can assist in generating model input files, translating specifications into solver-ready formats, and summarizing results. Ali-Dib and Menou explored the reasoning capacity of LLMs in physics modeling tasks [761], while Raissi et al.’s PINN framework demonstrated how language-driven architectures can help solve nonlinear partial differential equations by encoding physics into neural representations [762].
- Experimental Interpretation and Multimodal Lab Tasks. In experimental workflows, LLMs can support data summarization, anomaly detection, and textual explanation of multimodal results. Latif et al. proposed PhysicsAssistant, an LLM-powered robotic learning system capable of interpreting physics lab scenarios and offering real-time feedback to students and instructors [763].
- STEM Learning and Interactive Reasoning Tasks. LLMs are increasingly integrated into educational settings to guide derivations, answer conceptual questions, and simulate physical systems. Jiang and Jiang introduced a tutoring system that enhanced high school students’ understanding of complex physics concepts using LLMs [756], while Polverini’s work further confirmed the model’s utility in supporting structured, interactive learning [758].
5.2.2. Textual and Documentation-Centric Tasks
5.2.3. Design Ideation and Parametric Drafting Tasks
5.2.4. Simulation-support and Modeling Interface Tasks
5.2.5. Experimental Interpretation and Multimodal Lab Tasks
5.2.6. STEM Learning and Interactive Reasoning Tasks
| Type of Task | Benchmarks | Introduction |
|---|---|---|
| CAD and Geometric Modeling | ABC Dataset [773] DeepCAD [774] Fusion 360 Gallery [775] CADBench [776] | The ABC Dataset, DeepCAD, and Fusion 360 Gallery together provide a comprehensive foundation for studying geometry-aware language and generative models. While ABC emphasizes clean, B-Rep-based CAD structures suitable for geometric deep learning, DeepCAD introduces parameterized sketches tailored for inverse modeling tasks. Fusion 360 Gallery complements these with real-world user-generated modeling histories, enabling research on sequential CAD reasoning and practical design workflows. CADBench further supports instruction-to-script evaluation by providing synthetic and real-world prompts paired with CAD programs. It serves as a high-resolution benchmark for measuring attribute accuracy, spatial correctness, and syntactic validity in code-based CAD generation. |
| Finite Element Analysis (FEA) | FEABench [? ] | FEABench is a purpose-built benchmark that targets the simulation domain, offering structured prompts and tasks for evaluating LLM performance in generating and understanding FEA input files. It serves as a critical testbed for bridging the gap between symbolic physical language and numerical simulation. |
| CFD and Fluid Simulation | OpenFOAM Cases [777] | The OpenFOAM example case library provides a curated set of fluid dynamics simulation setups, widely used for training models to understand solver configuration, mesh generation, and boundary condition specifications in CFD contexts. |
| Material Property Retrieval | MatWeb [778] | MatWeb is a widely-used material database containing thermomechanical and electrical properties of thousands of substances. It plays an essential role in supporting downstream simulation tasks such as material selection, constitutive modeling, and multi-physics simulation setup. |
| Physics Modeling and PDE Learning | PDEBench [779] PHYBench [780] | PDEBench and PHYBench collectively advance the evaluation of LLMs in physical reasoning and numerical modeling. PDEBench focuses on classical PDEs like heat transfer, diffusion, and fluid flow in the context of scientific machine learning, while PHYBench introduces a broader spectrum of perception and reasoning tasks grounded in physical principles. Together, they support benchmarking across symbolic reasoning, equation prediction, and simulation-aware generation. |
| Fault Diagnosis and Health Monitoring | NASA C-MAPSS [781] | NASA C-MAPSS provides real-world time-series degradation data from turbofan engines, serving as a benchmark for predictive maintenance, anomaly detection, and reliability modeling in aerospace and mechanical systems. |
5.2.7. Benchmarks
5.2.8. Discussion
- Simulation-Augmented Dataset Generation. Integrating LLMs with numerical solvers in a simulator-in-the-loop framework allows the generation of language-input–simulation-output triplets at scale. This enables supervised training, fine-tuning, and RLHF strategies grounded in physically valid feedback.
- Task Decomposition and Geometric Reformulation. Decomposing CAD workflows into modular sub-tasks (e.g., sketching, constraints, extrusion) and reformulating modeling problems as geometric reasoning tasks can align better with LLM capabilities and improve interpretability.
- Multimodal and Multi-agent Integration. Developing LLM systems that can call CAD tools, solvers, and databases autonomously—as seen in MechAgents or LangSim—will allow LLMs to reason, plan, and act across tools in complex design and simulation pipelines.
- Standardized Benchmarks and Evaluation. Creating large-scale, task-diverse, and format-unified benchmark datasets (e.g., combining natural language prompts, simulation files, and result summaries) will accelerate model evaluation and fair comparison in this field.
- Physics Validation and Safety Assurance. Embedding physical rule checkers and verification mechanisms into generation loops can help enforce unit consistency, structural validity, and simulation compatibility, ensuring that outputs are not just syntactically correct but physically plausible.
5.3. Chemistry and Chemical Engineering
5.3.1. Overview

| Type of Task | Subtasks | Insights and Contributions | Key Models | Citations |
|---|---|---|---|---|
| Chemical Structure Textualization | Molecular Captioning | LLMs, by learning structure–property patterns from data, generate meaningful molecular captions, thus improving interpretability and aiding chemical understanding. | MolT5: generates concise captions by mapping substructures to descriptive phrases; MolFM: uses fusion of molecular graphs and text for richer narrative summaries. | [887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,892] |
| Chemical characteristics prediction | Property Prediction | LLMs, by capturing complex structure–property relationships from molecular representations, enable accurate property prediction, thereby providing mechanistic insights and guiding the rational design of molecules with desired functions. | SMILES-BERT: self-supervised SMILES pretraining for robust property inference; ChemBERTa: masked SMILES modeling boosting solubility and toxicity predictions. | [903,904,905,906,907,908,909,910,911,912,913,914] |
| Reaction Characteristics Classification | LLMs, by modeling the relationships between reactants, conditions, and outcomes from large reaction datasets, can accurately predict reaction types, yields, and rates, thereby uncovering hidden patterns in chemical reactivity and enabling chemists to optimize reaction conditions and select efficient synthetic routes with greater confidence. | RXNFP: fingerprint-transformer accurately classifies reaction types; YieldBERT: fine-tuned on yield data to predict experimental yields within 10% error. | [863,915,916,917,918,919,920,921,922,923,924,925,926,927] | |
| Chemical Structure Prediction & Tuning | Reaction Products Prediction | LLMs, by learning underlying chemical transformations from reaction data, can accurately predict reaction products, thus uncovering implicit reaction rules and supporting more efficient and informed synthetic planning. | Molecular Transformer: state-of-the-art SMILES-to-product translation; | [921,923,924,925,926,928,929,930,931,932,933] |
| Chemical Synthesis | LLMs, by capturing patterns in reaction sequences and chemical logic from large datasets, can suggest plausible synthesis routes and rationales, thereby enhancing human understanding of synthetic strategies and accelerating discovery. | Coscientist: GPT-4-driven planning and robotic execution. | [872,932,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951] | |
| Molecule Tuning | LLMs, by modeling structure–property relationships across diverse molecular spaces, enable targeted molecule tuning to optimize desired properties, thereby providing insights into molecular design and accelerating the development of functional compounds. | DrugAssist: uses LLM prompts for ADMET property optimization; ControllableGPT: enables constraint-based molecular modifications. | [952,953,954,955,956,957,958,959,960,961,962,963,964,964] | |
| Chemical Text Mapping | Chemical Text Mining | LLMs, by capturing semantic and contextual nuances in chemical literature, enable accurate classification and regression in text mining tasks, thereby uncovering trends, predicting research outcomes, and transforming unstructured texts into actionable scientific insights. | Fine-tuned GPT: specialized for chemical classification and regression; ChatGPT: adapts zero-shot classification of chemical text. | [965,966,967,968,969,970,971,972,973,974,975,976,977,978] |
| Narrative-Guided Chemical Design | De Novo Molecule Generation | LLMs, by learning chemical syntax and patterns from large molecular corpora, enable de novo molecule generation with realistic and diverse structures, thus offering insights into unexplored chemical space and accelerating early-stage drug and material discovery. | ChemGPT: unbiased SMILES sampling for novel molecules; MolecuGen: scaffold-guided generative modeling for improved novelty. | [887,891,979,980,981,982,983,984,985,986,987,988,989,990,991,992] |
| Conditional Molecule Generation | LLMs, by conditioning molecular generation on desired properties or scaffolds, enable the design of compounds that meet specific criteria, thereby offering insights into structure–function relationships and streamlining the discovery of tailored molecules. | GenMol: multi-constraint text-driven fragment remasking. | [887,889,931,957,983,984,986,987,988,993,994,995,996] | |
| Chemical Knowledge Narration | Chemical Knowledge QA | LLMs, by integrating extensive chemical literature and diverse databases, can accurately address complex chemical knowledge questions, thereby uncovering valuable insights and enabling more informed, accelerated research and decision-making. | ChemGPT: conditional SMILES generation for property-specific tuning; ScholarChemQA: domain-specific QA fine-tuned on scholarly chemistry data. | [863,924,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008] |
| Chemical Text Mining | LLMs, by understanding and extracting structured information from unstructured chemical texts, enable efficient chemical text mining, thereby revealing hidden knowledge, facilitating data-driven research, and accelerating the discovery of relationships across literature. | ChemBERTa: BERT-based model fine-tuned for chemical text classification. SciBERT: pretrained on scientific text including chemical literature for robust retrieval. | [965,966,973,976,1009,1010,1011,1012,1013,1014] | |
| Chemical Education | LLMs, by generating intuitive explanations and answering complex queries in natural language, support chemical education by making abstract concepts more accessible, thereby enhancing student understanding and promoting more interactive, personalized learning experiences. | MetaTutor: LLM-based metacognitive tutor for chemistry learners. | [997,998,1015,1016,1017,1018,1019,1020,1021,1022] |
5.3.2. Chemical Structure Textualization
5.3.3. Chemical Characteristics Prediction
5.3.4. Chemical Structure Prediction and Tuning
5.3.5. Chemical Text Mapping
5.3.6. Property-Directed Chemical Design
5.3.7. Chemical Knowledge Narration
5.3.8. Benchmarks
| Model (source) | Classification (ROC-AUC ↑) | Regression (RMSE ↓) | ||||||
|---|---|---|---|---|---|---|---|---|
| BACE | BBBP | HIV | Tox21 | SIDER | ESOL | FreeSolv | Lipo | |
| MolBERT | 0.866 | 0.762 | 0.783 | — | — | 0.531 | 0.948 | 0.561 .[1] |
| ChemBERTa-2 | 0.799 | 0.728 | — | — | — | 0.889 | 1.363 | 0.798[2] |
| BARTSmiles | 0.705 | 0.997† | 0.851 | 0.825 | 0.745 | — | — | —[3] |
| MolFormer-XL | 0.690 | 0.948 | 0.847 | — | 0.882† | — | — | 0.937†[3] |
| ImageMol | — | — | 0.814 | — | — | — | — | —[4] |
| SELFormer | 0.832 | 0.902 | 0.681 | 0.653 | 0.745 | 0.682 | 2.797 | 0.735[5] |
| Type of Task | Benchmarks | LLM Tested | Introduction | Cross tasks |
|---|---|---|---|---|
| Property Prediction | MoleculeNet [1066] | ✓ | 16 public datasets (SMILES + labels; quantum, phys-chem, physiology; 130k–400k samples) | – |
| OGB-PCQM4M-v2 [1067] | × | 3.8M molecular graphs with 3-D coords + HOMO–LUMO gaps | Reaction Rate | |
| Therapeutics Data Commons [1068] | ✓ | 30+ ADMET/bioactivity CSVs, leaderboard splits | Chemical Synthesis | |
| PDBbind [1069] | × | 22k protein–ligand complexes; PDB/mol2 + | – | |
| BindingDB [1070] | × | 3M structure–target Ki/IC50 pairs (SMILES, FASTA) | – | |
| Open Catalyst OC22 [1071] | × | Adsorption geometries + barriers for 1.3 M configs | Rate | |
| Reaction Yield Prediction | Buchwald–Hartwig HTE [1035] | × | 3955 C–N couplings; CSV (yield, ligands, bases) | – |
| Suzuki–Miyaura HTE [1072] | × | 5760 C–C couplings; yield matrix | – | |
| USPTO-Yield | ✓ | 1M patent reactions with numeric yields | Reaction type | |
| Open Reaction Database (ORD) [1073] | × | JSON records: reactants / products / conditions / yield | Reaction type | |
| ORDerly-Yield [1074] | × | Clean ORD + USPTO split, reproducible splits | Reaction Product Prediction | |
| AstraZeneca ELN [1075] | × | 25k ELN entries, diverse chemistries (CSV) | Conditions optimisation | |
| Reaction Type Classification | USPTO-50K [1076] | ✓ | 50036 atom-mapped reactions, 10 classes | Chemical Synthesis, Reaction Yield |
| USPTO-Full / MIT [1076] | ✓ | 400k–1.3M reactions; 60 coarse classes | Reaction Product, Chemical Synthesis | |
| Reaxys Reaction [1077] | × | 40M literature reactions, multi-level class labels | Reaction Rate | |
| ORDerly-Class [1074] | × | ORD subset with curated type labels | – | |
| Reaction Rate / Kinetics | NIST SRD-17 [1078] | × | 38k gas-phase rate constants, Arrhenius params (XML) | – |
| RMG Kinetics DB [1079] | × | 50k gas+ surface elementary steps (YAML) | Mechanism generation | |
| NDRL/NIST Solution DB | × | 17k solution-phase rate constants | – | |
| Combustion FFCM | × | Curated small-fuel flame kinetics set (YAML) | – | |
| Molecule Captioning | ChEBI-20 [1080] | ✓ | 33k SMILES–natural-language pairs (JSON) | Retrieval |
| SMolInstruct (Caption) [1065] | ✓ | 3M multi-task instructions incl. caption pairs | – | |
| MolGround [1081] | ✓ | 20k captions with atom-level grounding tags | – | |
| Reaction Product Prediction | USPTO-MIT Forward [921] | ✓ | 400k one-step reactions; SMILES input → product | Reaction Type |
| ORDerly-Forward [1074] | × | ORD/USPTO split, non-USPTO OOD set | Reaction Yield | |
| Chemical Synthesis | USPTO-50K Retro [1082] | ✓ | 50k product → reactant pairs, 10 classes | Reaction class |
| PaRoutes [1083] | × | 20k multi-step routes, JSON graphs | – | |
| ORDerly-Retro [1074] | × | ORD split with non-USPTO OOD test | Reaction Product | |
| TDC Retrosynthesis [1068] | ✓ | Wrappers for USPTO-50K + PaRoutes | – | |
| AiZynthFinder test [1084] | × | 100 difficult drug targets, MOL files | – | |
| Molecule Tuning | GuacaMol Goal-Dir. [1085] | ✓ | 20 oracle tests (SMILES, property calls) | Molecule Generation |
| PMO Suite [1086] | ✓ | 23 tasks, score-limited oracle calls (JSON) | – | |
| MOSES Opt [1086] | ✓ | Scaffold-constrained optimisation splits | De novo Molecule Generation | |
| TDC Docking [1068] | × | AutoDock/Vina scoring tasks (SDF) | Conditional Protein Generation | |
| LIMO Affinity [1087] | × | Gradient VAE optimisation toward nM affinity | Conditional Generation | |
| Chemical Text Classification | ChemProt [1088] | × | 1820 PubMed abstracts with 5 CP-relation labels | QA, Chemical text mining |
| BC5-CDR [1089] | × | 1500 abstracts; chemical–disease relations | NER | |
| CHEMDNER / BC4CHEMD [1090] | × | 10k abstracts, 84k chemical mentions | NER | |
| NLM-Chem [968] | × | 150 full-text articles, gold chemical tags | Chemical text mining | |
| ChEMU Patents [1091] | × | 1.5k patent excerpts with entity + event labels | Chemical text mining | |
| ChemNER 62-type [1092] | × | Fine-grained 62-label NER corpus | – | |
| Cond. Mol Generation | MOSES Scaffold [1086] | × | Bemis-Murcko prompts → molecules | De novo molecule generation |
| MOSES-2 [1086] | × | Adds stereo + logP/MW targets (JSON) | – | |
| GuacaMol Cond. [1085] | ✓ | Similarity + property dual constraints | Molecule Tuning | |
| LIMO [1087] | × | Latent inversion with docking-affinity oracle | – | |
| De Novo Mol Gen | MOSES (dist.) [1086] | ✓ | 1.9M ZINC clean-leads SMILES, train/test/scaffold | Conditional generation |
| GuacaMol Dist. [1085] | ✓ | 10 distribution-learning metrics tasks | Optimisation | |
| GEOM-Drugs [1093] | × | 100k drug-like molecules + 3-D conformers | Conformer generation | |
| QMugs [1094] | × | 665k drug-like molecules with QM labels | Property prediction | |
| TDC MolGeneration [1095] | × | Unified wrapper for MOSES/GuacaMol | – | |
| Chemical Knowledge QA | ScholarChemQA [1001] | ✓ | 40 k yes/no/maybe QAs from research abstracts | Chemical text mining |
| ChemistryQA [1096] | × | 4500 high-school calc-heavy MCQs | Education | |
| MoleculeQA [1000] | ✓ | 12k molecule-fact QAs (SMILES + text) | Molecule Captioning | |
| MolTextQA [1097] | × | MC-QA over PubChem descriptions | – | |
| Chemical text mining | CHEMDNER [1090] ChEMU [1091] NLM-Chem [968] | See classification rows (NER + event extraction) | NER / IE suites | |
| ChemBench [1098] | × | 7059 curated curriculum QAs; JSON | QA | |
| Chemical Education | ChemistryQA [1096] | × | High-school MCQ dataset (LaTeX problems) | QA |
| Model | USPTO-MIT | USPTO-50K | USPTO-Yield | USPTO-Stereo | ||||
| Top-1↑ | Top-5↑ | Top-10↑ | Acc.↑ | F1↑ | ↑ | MAE | Top-1↑ | |
| Molecular Transformer | 0.875 | 0.937 | 0.954 | — | — | — | — | 0.825 |
| Augmented Transformer | 0.888 | 0.944 | 0.960 | 0.921 | 0.909 | — | — | 0.832 |
| MolFormer | 0.883 | 0.945 | 0.960 | 0.945 | 0.932 | — | — | 0.832 |
| ReactionBERT | 0.930 | 0.972 | 0.981 | 0.930 | 0.918 | — | — | 0.845 |
| Chemformer | 0.910 | 0.968 | 0.979 | 0.915 | 0.905 | — | — | 0.834 |
| ReactionT5 | 0.975 | 0.986 | 0.988 | — | — | — | — | 0.790 |
| CompoundT5 | 0.866 | 0.895 | 0.904 | — | — | — | — | — |
| ProPreT5 | 0.998 | 1.000 | 1.000 | — | — | — | — | — |
| ChemBERTa-2 | — | — | — | 0.880 | 0.865 | — | — | — |
| Yield-BERT | — | — | — | — | — | 0.41 | 13.2 | — |
| ReaLM | — | — | — | — | — | 0.52 | 10.5 | — |
| Model | BLEU-2↑ | BLEU-4↑ | METEOR↑ | ROUGE-1↑ | ROUGE-2↑ | ROUGE-L↑ |
|---|---|---|---|---|---|---|
| MolT5-base | 0.540 | 0.457 | 0.569 | 0.634 | 0.485 | 0.578 |
| MolReGPT (GPT-4-0314) | 0.607 | 0.525 | 0.610 | 0.634 | 0.476 | 0.562 |
| MolT5-large | 0.594 | 0.508 | 0.614 | 0.654 | 0.510 | 0.594 |
| Galactica-125M | 0.585 | 0.501 | 0.591 | 0.630 | 0.474 | 0.568 |
| BioT5 | 0.635 | 0.556 | 0.656 | 0.692 | 0.559 | 0.633 |
| ICMA (Galactica-125M) | 0.636 | 0.565 | 0.648 | 0.677 | 0.537 | 0.618 |
| Model | Top-1↑ | Top-2↑ | Top-3↑ | Top-5↑ |
|---|---|---|---|---|
| Molecular Transformer | 88.8 % | 92.6 % | — | 94.4 % |
| T5Chem | 90.4 % | 94.2 % | — | 96.4 % |
| CompoundT5 | 86.6 % | 89.5 % | 90.4 % | 91.2 % |
| ProPreT5 | 99.8 % | — | — | — |
| ReactionT5 | 97.5 % | 98.6 % | 98.8 % | 99.0 % |
| Model | Top-1 Accuracy↑ | Top-5 Accuracy↑ |
|---|---|---|
| Molecular Transformer | 90.4 % | 95.3 % |
| Augmented Transformer | 90.6 % | 96.1 % |
| ProPreT5 | 99.8 % | — |
| Model | Validity↑ | Uniqueness↑ | Novelty↑ |
|---|---|---|---|
| MolGPT | 0.981 | 0.998 | 1.000 |
| LigGPT | 0.986 | 0.998 | 1.000 |
| GraphGPT | 0.975 | 0.999 | 1.000 |
| SmileyLlama (T=1.1) | 0.9783 | 0.9994 | 0.9713 |
| SmileyLlama (T=0.6) | 0.9968 | 0.9356 | 0.9113 |
| Model | Validity↑ | Unique@10k↑ | Novelty↑ | IntDiv1↑ | IntDiv2↑ |
|---|---|---|---|---|---|
| MolGPT | 0.994 | 1.000 | 0.797 | 0.857 | 0.851 |
| SELF-BART | 0.998 | 0.999 | 1.000 | 0.918 | 0.908 |
| MTMol-GPT | 0.845 | 0.993 | 0.984 | 0.835 | — |
| SF-MTMol-GPT | 1.000 | 0.955 | 0.932 | 0.850 | — |
5.3.9. Discussion
- Hybrid LLM–Mechanistic Frameworks. Combine LLMs with rule-based and physics-informed modules to integrate statistical language understanding with chemical theory.
- Multimodal Chemical Representations. Develop architectures that jointly process SMILES, molecular graphs, and spectroscopic or crystallographic data to capture 3D and electronic structure.
- Closed-Loop Experimental Pipelines. Integrate LLM outputs into automated synthesis and analysis platforms, enabling rapid hypothesis testing and feedback-driven model refinement.
- Data-Efficient Fine-Tuning. Leverage transfer learning, few-shot prompting, and synthetic augmentation of underrepresented reaction data to improve performance in sparse domains.
- Explainability and Uncertainty Quantification. Incorporate attribution methods and probabilistic modeling to provide confidence metrics and mechanistic rationales alongside predictions.
- Governance-First Deployment. Establish standards for model validation, transparent reporting (model cards), and ethical guidelines to ensure responsible use in chemical research and industry.
5.4. Life Sciences and Bioengineering
5.4.1. Overview

- Ethical and Safety Challenges. Life sciences and bioengineering are inherently intertwined with ethical and societal considerations that transcend purely technical challenges [1223]. These fields routinely grapple with questions surrounding the responsible use of gene editing technologies like CRISPR [1224,1225], the long-term ecological effects of genetically modified organisms [1226,1227,1228], and the protection of sensitive patient data in genomics and biomedical research [1229]. While LLMs can assist in synthesizing scientific literature and outlining stakeholder perspectives, they lack the capacity for moral reasoning or normative judgment. Their outputs are constrained by the biases present in their training data, which poses risks when applied to ethically sensitive domains [1230,1231]. Ethical decision-making in bioengineering—such as determining the acceptability of human germline editing, setting standards for clinical trials, or regulating synthetic biology applications—remains the responsibility of human experts, policymakers, and the broader public. These decisions require inclusive debate, value alignment, and legal oversight that go beyond algorithmic capabilities [1232,1233]. As technologies advance, both the Life sciences and AI communities must collaboratively develop ethical frameworks that reflect societal values while fostering innovation.
- Needs for Empirical Validation. Both life sciences and bioengineering ultimately rely on physical experimentation [1234]. Scientific hypotheses must be empirically verified, and bioengineered solutions require testing under real-world conditions [1235]. However, such experiments often pose significant bottlenecks due to their inherent slowness, high costs, and ethical limitations—particularly in human studies, where experimentation is strictly constrained, and animal models often fail to fully replicate human biology. [1236,1237,1238] While computational models can alleviate some of these burdens, they cannot fully substitute for wet-lab or clinical experiments. Similarly, LLMs are incapable of conducting physical experiments or collecting new empirical data. Although they can assist in designing experimental protocols, they cannot implement or validate them [1239]. Consequently, research challenges that fundamentally require novel data acquisition—such as identifying new drug targets or evaluating biomaterials—remain beyond the scope of LLMs alone. The crucial last mile of validation in biology and engineering - demonstrating something works in actual living systems - remains dependent on laboratories, clinical trials, and real-world testing [1240].
- Complexity of Biological Systems. The overarching challenge is that living systems are astonishingly complex and multi-scale. Scientists struggle with this complexity as small changes can have cascading, unpredictable effects. For life scientists, this means incomplete understanding of many diseases and biological processes [1241,1242,1243]. For bioengineers, it means difficulty designing interventions without unintended consequences [1244,1245]. LLMs cannot reliably solve this because much of biological complexity stems from unknown factors requiring empirical observation and quantitative modeling beyond text-pattern recognition [1246]. While LLMs process information well, the emergent behavior of complex biological networks often requires specialized modeling that correlation-based systems can’t provide without explicit mathematical frameworks. Major challenges like understanding neural circuits or curing cancer remain unsolved because they require new scientific discoveries and experimental validation, not just knowledge retrieval [1247].
- Data Quality and Integration. Modern life sciences and bioengineering generate enormous volumes of data from genomic sequences, proteomics, patient records, and sensors. Making sense of this data reliably presents significant challenges because it’s often noisy, comes from disparate sources, and lacks integration [1248,1249]. While LLMs excel at processing text, they struggle with heterogeneous scientific data that includes numbers, images, and experimental measurements [8]. LLMs don’t have native capabilities to process raw experimental data like gene expression matrices or medical imaging unless specifically augmented with specialized tools [1250]. Challenges in biological big data - ensuring reproducibility, establishing causal relationships from observational data, or analyzing complex multi-modal datasets - still require specialized algorithms and human expertise in statistics and domain knowledge. LLMs might help report findings or suggest hypotheses, but they cannot replace the sophisticated analytical pipelines needed for rigorous scientific data analysis in these fields.
- Literature Overload and Knowledge Synthesis. One critical challenge uniquely pronounced in biology and bioengineering is managing the vast, rapidly growing, and fragmented body of research literature. Unlike fields such as mathematics, law, or finance, biological disciplines inherently encompass a multitude of interconnected subspecialties, each producing large volumes of highly specialized research. For example, understanding a complex disease like cancer may require integrating findings from genetics, immunology, cell biology, pharmacology, and bioinformatics—each field publishing detailed, specialized studies that must be synthesized for comprehensive insights. The complexity arises not only from the sheer volume but also from interdisciplinary connections, intricate experimental details, and extensive supplementary materials required for reproducibility. Consequently, researchers face significant difficulty in identifying relevant literature, extracting key insights, and synthesizing knowledge efficiently. This is precisely where LLMs show strong potential as intelligent literature reviewers [1251,1252,1253]. Advanced LLMs, such as GPT-4, can proficiently read, summarize, and contextualize complex biomedical texts, rapidly extracting relevant findings from extensive corpora [1253,1254]. For example, an LLM could swiftly provide researchers with an overview of current biomarkers for Alzheimer’s disease or consolidate recent advancements in biodegradable stent materials [1253]. By effectively navigating dense technical language and complex sentence structures inherent to biological literature, LLMs mitigate literature overload, facilitate interdisciplinary integration, and enable literature-based discovery—highlighting connections between seemingly disparate research findings [1255,1256,1257].
- Interpreting and Annotating Biological Sequences (Genomics/Proteomics). In both life sciences research and bioengineering applications like synthetic biology, understanding DNA, RNA, and protein sequences is crucial [1258,1259,1260]. These sequences can be thought of as strings of letters (A, T, C, G for DNA; amino acids for proteins) – in other words, a language of life. Recent work has shown that language models can be applied to these biological sequences, treating them like natural language, where “words” are motifs or codons and “sentences” are genes or protein domains. This is a challenge where LLM-like models shine [1261,1262,1263]. This means LLMs can help annotate genomes (predicting genes and their functions in a newly sequenced organism) or predict the effect of mutations (important for understanding genetic diseases) [1263]. In proteomics, models can suggest which parts of a protein are important for its structure or activity [1264,1265]. The advantage of LLMs here is their ability to handle long-range dependencies in sequences – biology often has context-dependent effects, and language models are designed to handle such context. Moreover, LLMs can generate sequences too, which leads to the next point.
- Design and Generation of Biological Sequences or Structures. In bioengineering, a cutting-edge challenge is designing new biological components – for instance, designing a protein that catalyzes a desired chemical reaction, or an RNA molecule that can serve as a therapeutic. Traditionally, this is very hard (the search space of possible sequences is astronomically large). However, LLMs have a generative capability that can be harnessed here. Already, models like ProGen [1266,1267] have shown they can generate novel protein sequences that have a predictable function across protein families. In simpler terms, an LLM trained on a vast number of protein sequences can be prompted to create a new sequence that looks like, say, an enzyme, and those sequences have been experimentally verified in some cases to fold and function [1266,1267,1268,1269,1270]. This is a remarkable development because it means LLMs can assist in protein engineering and drug discovery by proposing candidate designs that humans or simpler algorithms might not think of. Similarly, for DNA/RNA, an LLM could suggest a DNA sequence that regulates gene expression in a certain way (useful for gene therapy designs) [1259,1263] or propose improvements to a biosynthetic pathway by modifying enzyme sequences [1264,1265]. LLMs are suitable for these creative tasks because, much like with natural language, they can interpolate and extrapolate learned patterns to create new, coherent outputs (here, “coherent” means biologically plausible sequences). While any generated design still needs to be tested in the lab (to confirm it works as intended), LLMs can dramatically accelerate the ideation phase of bioengineering design.

- Cross-domain transferability: Many life sciences tasks share structural similarities in their input data, making it easier to adapt and transfer methods across domains.
| Type of Task | Subtasks | Insights and Contributions | Key Models | Citations |
|---|---|---|---|---|
| Genomic Sequence Analysis | DNA Sequence Modeling | LLMs, by capturing regulatory grammars embedded in genomic sequences, enable accurate prediction of functional elements and variant effects, thus enhancing interpretability and advancing our understanding of gene regulation. | DNABERT: adapt BERT to human reference DNA and learn bidirectional representations;HyenaDNA: scale the context length and accelerate the model efficiency. | [1261,1283,1284,1285,1286,1287,1288,1289] |
| RNA Function Learning | LLMs, by modeling both sequence and structural contexts of RNA, uncover functional motifs and regulatory patterns, thereby improving interpretability and facilitating insights into post-transcriptional regulation. | RNA-FM: employing 23.7M deduplicated RNA sequences for training; 3UTRBERT: specialized in modeling 3’ untranslated regions (3’UTRs). | [1290,1291,1292,1293,1294,1295,1296] | |
|
Biomedical Reasoning and Understanding |
Question Answering | LLMs, by aligning domain-specific knowledge with natural language understanding, accurately interpret biomedical queries and texts, thereby enhancing information retrieval and supporting clinical and research decision-making. | Med-PaLM: achieved near-expert-level performance on the USMLE test; HuatuoGPT: proactively ask questions rather than respond passively. | [17,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311] |
| Language Understanding | LLMs, by learning semantic patterns and reasoning cues from biomedical texts, enable deep language understanding, thereby improving performance on tasks like inference, entity recognition, and document classification. | BioInstruct: covering multi understanding tasks; GPT-4: perform NLI without explicit fine-tuning, when prompted with queries. | [17,1297,1302,1305,1312,1313,1314,1315,1316] | |
|
Omics & Clinical Structured Data Integration |
Clinical Language Generation | LLMs, by capturing clinical language styles and contextual dependencies, generate coherent and context-aware narratives, thereby enhancing the automation and reliability of medical reporting and documentation. | ClinicalT5: pretraining on text-to-text tasks for long clinical narratives; GPT-4: perform good when prompted by specific queries. | [1301,1317,1318,1319,1320,1321] |
| EHR Based Prediction | LLMs, by integrating longitudinal and multimodal patient data from EHRs, model complex temporal and clinical dependencies, thereby enabling accurate prediction of outcomes and supporting personalized healthcare. | BEHRT: adapts BERT to longitudinal EHR data by encoding structured sequences; GatorTron: scaled up to 8.9B and was trained on over 90B clinical narratives and structured labels. | [1305,1322,1323,1324] | |
| Hybrid Outcome Prediction | Drug Synergy Prediction | LLMs, by jointly modeling chemical structures and cellular contexts, capture intricate drug–drug and drug–cell interactions, thereby enhancing the prediction of synergistic combinations and accelerating combination therapy design. | CancerGPT: fine-tuning GPT-3 to predict drug synergy in rare cancers; BAITSAO: a foundation model strategy that integrates multiple datasets and tasks. | [6,7,18,1325,1326,1327,1328] |
| Protein Modeling | LLMs, by learning evolutionary, structural, and functional signals from protein sequences, enable accurate modeling of folding, function, and interactions, thereby advancing protein engineering and therapeutic discovery. | ProLLaMA: achieving joint understanding and generation within a single framework; ProteinGPT: further supporting structure input, language interaction, and functional Q&A | [1267,1268,1269,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345] |
5.4.2. Genomic Sequence Analysis
5.4.3. Clinical Structured Data Integration
5.4.4. Biomedical Reasoning and Understanding
5.4.5. Hybrid Outcome Prediction
| Type of Task | Benchmarks | Introduction | Cross tasks |
|---|---|---|---|
| DNA Sequence Modeling | BEND [1294] | A collection of realistic and biologically meaningful downstream tasks defined on the human genome | Gene finding, Enhancer annotation, Chromatin accessibility, Histone modification etc. |
| Genomic Benchmarks [1371] | Contains 8 datasets that focus on regulatory elements from 3 model organisms: human, mouse, and roundworm. | - | |
| GUE [1284] | A collection of 28 datasets across 7 tasks constructed for genome language model evaluation. | Promoter prediction, Splice site prediction, Covid variant classification, epigenetic marks prediction etc. | |
| NT [1287] | A collection of 18 datasets across 7 tasks constructed for genome language model evaluation. | Promoter prediction, Splice site prediction,Enhancer annotation etc. | |
| RNA Function Learning | RnaBench [1372] | Including 100 samples without any training and validation data. | Intra/Inter family prediction, Inverse RNA Folding. |
| BEACON [1373] | Containing 967k sequences with lengths ranging from 23 to 1182. | Structure Prediction, Contact Map Prediction, Modification Prediction, Mean Ribosome Loading etc. | |
| Clinical Language Generation | MIMIC-III [1362] | A large, freely-available database comprising deidentified health-related data associated with over forty thousand patients. | Report Summarization, Risk Prediction etc. |
| MIMIC-IV [1363] | A larger version including over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. | Report Summarization, Risk Prediction etc. | |
| IU X-Ray [1374] | A set of 7,470 chest X-ray images paired with their corresponding diagnostic reports. | Report Summarization, Image Caption etc. | |
| EHR Based Prediction | EHRSHOT [1375] | A collection of 6739 patients of EHR based benchmark in few-shot setting. | – |
| EHRNoteQA [1376] | A complex, multi-topic benchmark based on multiple patients’ electronic discharg records. | QA. | |
| Quastion Answering | MedQA [1377] | Consists of multiple-choice questions from the United States Medical Licensing Examination (USMLE). | – |
| MedMCQA [1378] | A large-scale multiple-choice QA dataset derived from Indian medical entrance examinations (AIIMS/NEET). | – | |
| PubMedQA [1379] | A closed-domain QA dataset, questions can be answered by looking at an associated context (PubMed abstract). | – | |
| MMLU Subsets [135] | For measuring multitask ability from various domains, including life science and Bio-engineering. | – | |
| MIMIC-IV [1363] | A collection of 1057 questions, answer could based on the referral letters. | – | |
| Language Understanding | BC5-Disease [1089] | Including three separate sets of articles with diseases, chemicals and their relations annotated. | Named Entity Recognition. |
| NCBI-Disease [1380] | Contains 6,892 disease mentions, which are mapped to 790 unique disease concepts | Named Entity Recognition. | |
| DDI [1381] | An annotated corpus with pharmacological substances and drug–drug interactions | Relation Extraction. | |
| GAD [1382] | A repository of molecular, clinical and study parameters for >5,000 human genetic association studies | Relation Extraction. | |
| HoC [1383] | Including 1852 PubMed publication abstracts manually annotated by experts. | Doc. Classification. | |
| Drug Synergy Prediction | CancerGPT [1326] | An framework involves testing LLMs’ performance in few/zero-shot learning scenarios across seven rare tissue types. | – |
| BAITSAO [1328] | A framework integrates both regression and classification, based on synergy scores and binary synergy labels derived from large-scale drug combination datasets. | – | |
| Protein Modeling | PEER [1384] | Comprising fourteen diverse protein sequence understanding tasks. | – |
| Type [1385] | Five biologically relevant protein tasks and evaluated self-supervised sequence models. | – |
| Model | Gene finding |
Enhancer annotation |
Chromatin accessibility |
Histone modification |
CpG methylation |
Variant effects (expression) |
Variant effects (disease) |
| ResNet (non-LLM) | 0.46 | 0.06 | - | - | - | - | - |
| CNN (non-LLM) | 0.00 | 0.03 | 0.75 | 0.76 | 0.84 | - | - |
| ResNet-LM | 0.36 | 0.02 | 0.82 | 0.77 | 0.87 | 0.55 | 0.55 |
| AWD-LSTM | 0.05 | 0.03 | 0.69 | 0.74 | 0.81 | 0.53 | 0.45 |
| NT-H | 0.41 | 0.05 | 0.74 | 0.76 | 0.88 | 0.55 | 0.48 |
| NT-MS | 0.68 | 0.06 | 0.79 | 0.78 | 0.92 | 0.54 | 0.77 |
| NT-1000G | 0.49 | 0.04 | 0.77 | 0.77 | 0.89 | 0.45 | 0.49 |
| NT-V2 | 0.64 | 0.05 | 0.80 | 0.76 | 0.91 | 0.48 | 0.48 |
| DNABERT | 0.20 | 0.03 | 0.85 | 0.79 | 0.91 | 0.60 | 0.56 |
| DNABERT-2 | 0.43 | 0.03 | 0.81 | 0.78 | 0.90 | 0.49 | 0.51 |
| GENA-LM BERT | 0.52 | 0.03 | 0.76 | 0.78 | 0.91 | 0.49 | 0.55 |
| GENA-LM BigBird | 0.39 | 0.04 | 0.82 | 0.78 | 0.91 | 0.49 | 0.52 |
| HyenaDNA large | 0.35 | 0.03 | 0.84 | 0.76 | 0.91 | 0.51 | 0.45 |
| HyenaDNA tiny | 0.10 | 0.02 | 0.78 | 0.76 | 0.86 | 0.47 | 0.44 |
| GROVER | 0.28 | 0.03 | 0.82 | 0.77 | 0.89 | 0.56 | 0.51 |
| Model | SSP | CMP | DMP | SSI | SPL | APA | NcRNA | Modif | MRL | VDP | PRS | CRI-On | CRI-Off |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 (%) | P@L (%) | (%) | (%) | ACC@K (%) | (%) | ACC (%) | AUC (%) | (%) | MCRMSE↓ | (%) | SC (%) | SC (%) | |
| CNN (non-LLM) | 49.95 | 43.89 | 27.76 | 34.36 | 8.43 | 50.93 | 88.62 | 70.87 | 74.13 | 0.361 | 45.40 | 29.69 | 11.40 |
| ResNet (non-LLM) | 57.26 | 59.59 | 30.26 | 37.74 | 21.15 | 56.45 | 88.33 | 71.03 | 74.34 | 0.349 | 55.21 | 28.55 | 11.50 |
| LSTM (non-LLM) | 58.61 | 40.41 | 44.77 | 35.44 | 36.66 | 67.03 | 88.78 | 94.83 | 83.94 | 0.329 | 55.45 | 26.83 | 8.60 |
| RNA-FM | 68.50 | 47.56 | 51.45 | 42.36 | 34.84 | 70.32 | 96.81 | 94.98 | 79.47 | 0.347 | 55.98 | 31.62 | 2.49 |
| RNABERT | 57.27 | 45.21 | 48.19 | 31.62 | 0.18 | 57.66 | 68.95 | 82.82 | 29.79 | 0.378 | 54.60 | 29.77 | 4.27 |
| RNA-MSM | 57.98 | 57.26 | 37.49 | 39.22 | 38.33 | 70.40 | 84.85 | 94.89 | 83.48 | 0.330 | 56.94 | 34.92 | 3.85 |
| Splice-H510 | 64.93 | 45.80 | 55.56 | 38.91 | 44.80 | 58.65 | 95.92 | 62.57 | 83.49 | 0.321 | 54.90 | 26.61 | 4.00 |
| Splice-MS510 | 43.24 | 52.64 | 10.27 | 38.58 | 50.55 | 52.46 | 95.87 | 55.87 | 84.98 | 0.315 | 50.98 | 27.13 | 3.49 |
| Splice-MS1024 | 68.26 | 47.32 | 55.89 | 39.22 | 48.52 | 60.03 | 96.05 | 53.45 | 67.15 | 0.313 | 57.72 | 27.59 | 5.00 |
| UTR-LM-MRL | 59.71 | 45.51 | 55.21 | 39.52 | 36.20 | 64.99 | 89.97 | 56.41 | 77.78 | 0.325 | 57.28 | 28.49 | 4.28 |
| UTR-LM-TE&EL | 59.57 | 60.32 | 54.94 | 40.15 | 37.35 | 72.09 | 81.33 | 59.70 | 82.50 | 0.319 | 53.37 | 32.49 | 2.91 |
| UTRBERT-3mer | 60.37 | 51.03 | 50.95 | 34.31 | 44.24 | 69.52 | 92.88 | 95.14 | 83.89 | 0.337 | 56.83 | 29.92 | 4.48 |
| UTRBERT-4mer | 59.41 | 44.91 | 47.77 | 33.22 | 42.04 | 72.71 | 94.32 | 95.10 | 82.90 | 0.341 | 56.43 | 23.20 | 3.11 |
| UTRBERT-5mer | 47.92 | 44.71 | 48.67 | 31.27 | 39.19 | 72.70 | 93.04 | 94.78 | 75.64 | 0.343 | 57.16 | 25.74 | 3.93 |
| UTRBERT-6mer | 38.56 | 51.56 | 50.02 | 29.93 | 38.58 | 71.17 | 93.12 | 95.08 | 83.60 | 0.340 | 57.14 | 28.60 | 4.90 |
| BEACON-B | 64.18 | 60.81 | 56.28 | 38.78 | 37.43 | 70.59 | 94.63 | 94.74 | 72.29 | 0.320 | 54.67 | 26.01 | 4.42 |
| BEACON-B512 | 58.75 | 61.20 | 56.82 | 39.13 | 37.24 | 72.00 | 94.99 | 94.92 | 72.35 | 0.320 | 55.20 | 28.17 | 3.82 |
| Model | MedQA | MedMCQA | MMLU | PubMedQA | Referal QA | Treat Recom. |
|---|---|---|---|---|---|---|
| Claude-2 | 65.1 | 60.3 | 78.7 | 70.8 | 80.5 | 9.1 |
| GPT-3.5-turbo | 61.2 | 59.4 | 73.5 | 70.2 | 81.1 | 7.3 |
| GPT-4 | 83.4 | 78.2 | 92.3 | 80.0 | 83.2 | 18.6 |
| Alpaca | 34.2 | 30.1 | 40.8 | 65.2 | 74.8 | 3.5 |
| Vicuna-7B | 34.5 | 33.4 | 43.4 | 64.8 | 76.4 | 2.6 |
| LLaMA-2-7B | 32.9 | 30.6 | 42.3 | 63.4 | 74.5 | 3.3 |
| Mistral | 35.7 | 37.8 | 46.3 | 69.4 | 77.7 | 5.0 |
| Vicuna-13B | 38.0 | 36.4 | 45.6 | 66.2 | 76.8 | 4.6 |
| LLaMA-2-13B | 38.1 | 35.5 | 46.0 | 66.8 | 77.1 | 4.8 |
| LLaMA-2-70B | 45.8 | 42.7 | 54.0 | 67.4 | 78.9 | 5.5 |
| LLaMA-3-70B | 78.8 | 74.7 | 86.4 | 71.4 | 82.4 | 10.2 |
| HuatuoGPT | 28.4 | 24.8 | 31.6 | 61.0 | 69.3 | 3.8 |
| HuatuoGPT-2-7B | 41.1 | 41.9 | - | - | - | - |
| HuatuoGPT-2-13B | 45.7 | 47.4 | - | - | - | - |
| HuatuoGPT-o1-8B | 72.6 | 60.4 | - | 79.2 | - | - |
| ChatDoctor | 33.2 | 31.5 | 40.4 | 63.8 | 73.7 | 5.3 |
| PMC-LLaMA-7B | 28.7 | 29.8 | 39.0 | 60.2 | 70.2 | 4.0 |
| Baize-Healthcare | 34.9 | 31.3 | 41.9 | 64.4 | 74.0 | 4.7 |
| MedAlpaca-7B | 35.1 | 32.9 | 48.5 | 62.4 | 75.3 | 4.8 |
| Meditron-7B | 33.5 | 31.1 | 45.2 | 61.6 | 74.9 | 5.8 |
| BioMistral | 35.4 | 34.8 | 52.6 | 66.4 | 77.0 | 7.6 |
| PMC-LLaMA-13B | 39.6 | 37.7 | 56.3 | 67.0 | 77.6 | 4.9 |
| MedAlpaca-13B | 37.3 | 35.7 | 51.5 | 65.6 | 77.4 | 5.1 |
| ClinicalCamel | 46.4 | 45.8 | 68.4 | 71.0 | 79.8 | 8.4 |
| Meditron-70B | 45.7 | 44.9 | 65.1 | 70.6 | 78.6 | 8.9 |
| HuatuoGPT-o1-70B | 83.3 | 73.6 | - | 80.6 | - | - |
| Med-PaLM 2 (5-shots) | 79.7 | 71.3 | - | 79.2 | - | - |
| Model | BC5 | NCBI | DDI | GAD | HoC | Pharma. QA | Drug Infer. |
|---|---|---|---|---|---|---|---|
| Claude-2 | 52.9 | 44.2 | 50.4 | 50.7 | 70.8 | 60.6 | 51.5 |
| GPT-3.5-turbo | 52.3 | 46.1 | 49.3 | 50.8 | 66.4 | 57.3 | 47.0 |
| GPT-4 | 71.3 | 58.4 | 64.6 | 68.2 | 83.6 | 63.8 | 56.5 |
| Alpaca | 41.2 | 36.5 | 37.4 | 36.9 | 52.6 | 41.3 | 47.5 |
| Vicuna-7B | 44.5 | 37.0 | 39.4 | 41.2 | 53.8 | 42.3 | 45.5 |
| LLaMA-2-7B | 40.1 | 34.8 | 37.9 | 39.3 | 48.6 | 46.5 | 48.0 |
| Mistral | 46.8 | 39.9 | 43.5 | 44.3 | 59.6 | 51.2 | 53.0 |
| Vicuna-13B | 46.2 | 39.0 | 41.3 | 43.5 | 56.7 | 45.1 | 46.0 |
| LLaMA-2-13B | 46.6 | 38.3 | 39.7 | 41.2 | 55.9 | 46.9 | 47.5 |
| LLaMA-2-70B | 47.8 | 41.5 | 45.6 | 44.7 | 63.2 | 49.3 | 51.5 |
| LLaMA-3-70B | 63.7 | 50.2 | 59.7 | 63.1 | 79.0 | 62.4 | 53.0 |
| PubMed-BERT-base | - | 87.8 | 82.4 | 82.3 | 82.3 | - | - |
| BioLink-BERT-base | - | 88.2 | 82.7 | 84.4 | 85.4 | - | - |
| HuatuoGPT | 43.6 | 37.5 | 40.1 | 38.2 | 50.2 | 44.1 | 49.5 |
| ChatDoctor | 45.8 | 40.9 | 41.2 | 40.1 | 55.7 | 42.7 | 48.5 |
| PMC-LLaMA-7B | 45.2 | 37.8 | 40.8 | 42.0 | 55.6 | 45.5 | 51.0 |
| Baize-Healthcare | 44.4 | 38.5 | 41.9 | 45.8 | 54.5 | 46.9 | 50.5 |
| MedAlpaca-7B | 47.3 | 39.0 | 43.5 | 44.0 | 58.7 | 47.9 | 52.0 |
| Meditron-7B | 46.5 | 39.2 | 42.7 | 43.3 | 57.9 | 50.7 | 52.0 |
| BioMistral | 48.8 | 40.4 | 46.0 | 48.5 | 64.3 | 54.5 | 54.0 |
| PMC-LLaMA-13B | 51.5 | 43.1 | 48.4 | 48.7 | 65.3 | 48.8 | 51.5 |
| MedAlpaca-13B | 49.2 | 41.6 | 44.1 | 44.5 | 59.4 | 51.6 | 50.0 |
| ClinicalCamel | 51.2 | 43.7 | 47.6 | 47.2 | 64.8 | 52.6 | 52.5 |
| Meditron-70B | 54.3 | 45.7 | 51.2 | 49.6 | 69.6 | 58.7 | 54.5 |
| Model | Multi-Choice | Open-Ended | ||
|---|---|---|---|---|
| Level 1 | Level 2 | Level 1 | Level 2 | |
| GPT4 | 97.16 | 95.15 | 91.30 | 89.61 |
| GPT4-Turbo | 95.27 | 94.23 | 91.30 | 86.61 |
| GPT3.5-Turbo | 88.28 | 84.99 | 82.23 | 75.52 |
| Llama3-70b-Instruct | 94.33 | 91.92 | 89.04 | 86.84 |
| Llama2-70b-chat | 84.88 | – | 78.83 | – |
| qCamel-70 | 85.63 | – | 78.26 | – |
| Camel-Platypus2-70b | 89.79 | – | 78.83 | – |
| Platypus2-70b-Instruct | 90.36 | – | 80.53 | – |
| Mixtral-8x7b-Instruct | 87.52 | 86.61 | 88.28 | 81.52 |
| MPT-30b-Instruct | 79.96 | 75.52 | 67.11 | 62.59 |
| Llama2-13b-chat | 73.65 | – | 70.32 | – |
| Vicuna-13b | 82.04 | – | 70.51 | – |
| WizardLM-13b | 80.91 | – | 74.67 | – |
| qCamel-13 | 71.46 | – | 66.16 | – |
| OpenOrca-Platypus2-13b | 86.01 | – | 79.21 | – |
| Camel-Platypus2-13b | 78.07 | – | 67.86 | – |
| Synthia-13b | 79.21 | – | 74.48 | – |
| Asclepius-13b1 | – | – | 75.24 | – |
| Gemma-7b-it | 77.50 | 67.21 | 63.71 | 54.27 |
| MPT-7b-8k-instruct | 59.55 | 51.27 | 56.71 | 53.81 |
| Mistral-7b-Instruct | 82.04 | 64.90 | 72.97 | 53.81 |
| Dolphin-2.0-mistral-7b | 76.18 | – | 69.75 | – |
| Mistral-7b-OpenOrca | 87.15 | – | 79.58 | – |
| SynthIA-7b | 78.45 | – | 74.67 | – |
| Llama2-7b-chat | 65.78 | – | 58.98 | – |
| Vicuna-7b | 78.26 | – | 59.74 | – |
| Asclepius-7b | – | – | 66.92 | – |
| Tissue ( , ) | Methods | 0 | 2 | 4 | 8 | 16 | 32 | 64 | 128 |
|---|---|---|---|---|---|---|---|---|---|
| Pancreas (=38, =1) |
XGBoost | 0.026 | – | – | – | – | – | – | – |
| TabTransformer | 0.056 | – | – | – | – | – | – | – | |
| CancerGPT | 0.033 | – | – | – | – | – | – | – | |
| GPT-2 | 0.032 | – | – | – | – | – | – | – | |
| GPT-3 | 0.111 | – | – | – | – | – | – | – | |
| Endometrium (=36, =32) |
XGBoost | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | – | – | – |
| TabTransformer | 0.674 | 0.889 | 0.903 | 0.948 | 0.938 | 0.962 | – | – | |
| CancerGPT | 0.564 | 0.668 | 0.676 | 0.831 | 0.686 | 0.737 | – | – | |
| GPT-2 | 0.408 | 0.808 | 0.395 | 0.383 | 0.389 | 0.717 | – | – | |
| GPT-3 | 0.869 | 1.0 | 0.947 | 0.859 | 0.799 | 0.859 | – | – | |
| Liver (=192, =21) |
XGBoost | 0.132 | 0.132 | 0.132 | 0.132 | 0.132 | 0.132 | 0.12 | 0.12 |
| TabTransformer | 0.13 | 0.128 | 0.147 | 0.189 | 0.265 | 0.168 | 0.169 | 0.234 | |
| CancerGPT | 0.136 | 0.102 | 0.13 | 0.147 | 0.252 | 0.21 | 0.197 | 0.187 | |
| GPT-2 | 0.5 | 0.099 | 0.151 | 0.383 | 0.429 | 0.401 | 0.483 | 0.398 | |
| GPT-3 | 0.185 | 0.086 | 0.096 | 0.125 | 0.124 | 0.314 | 0.362 | 0.519 | |
| Soft tissue (=269, =83) |
XGBoost | 0.243 | 0.243 | 0.243 | 0.243 | 0.235 | 0.235 | 0.264 | 0.271 |
| TabTransformer | 0.273 | 0.287 | 0.462 | 0.422 | 0.526 | 0.571 | 0.561 | 0.64 | |
| CancerGPT | 0.314 | 0.315 | 0.338 | 0.383 | 0.383 | 0.403 | 0.464 | 0.469 | |
| GPT-2 | 0.259 | 0.298 | 0.254 | 0.262 | 0.235 | 0.297 | 0.254 | 0.206 | |
| GPT-3 | 0.263 | 0.194 | 0.28 | 0.228 | 0.363 | 0.618 | 0.638 | 0.734 | |
| Stomach (=1081, =109) |
XGBoost | 0.104 | 0.104 | 0.104 | 0.104 | 0.104 | 0.104 | 0.09 | 0.094 |
| TabTransformer | 0.261 | 0.371 | 0.396 | 0.383 | 0.294 | 0.402 | 0.45 | 0.465 | |
| CancerGPT | 0.3 | 0.297 | 0.316 | 0.325 | 0.269 | 0.308 | 0.297 | 0.312 | |
| GPT-2 | 0.116 | 0.124 | 0.099 | 0.172 | 0.165 | 0.107 | 0.152 | 0.131 | |
| GPT-3 | 0.078 | 0.106 | 0.17 | 0.37 | 0.1 | 0.19 | 0.219 | 0.181 | |
| Urinary tract (=1996, =462) |
XGBoost | 0.186 | 0.186 | 0.186 | 0.186 | 0.186 | 0.197 | 0.199 | 0.209 |
| TabTransformer | 0.248 | 0.264 | 0.25 | 0.278 | 0.274 | 0.249 | 0.293 | 0.291 | |
| CancerGPT | 0.241 | 0.226 | 0.246 | 0.239 | 0.256 | 0.271 | 0.266 | 0.269 | |
| GPT-2 | 0.191 | 0.192 | 0.188 | 0.156 | 0.193 | 0.185 | 0.183 | 0.185 | |
| GPT-3 | 0.27 | 0.228 | 0.222 | 0.201 | 0.206 | 0.2 | 0.24 | 0.272 | |
| Bone (=3732, =253) |
XGBoost | 0.064 | 0.064 | 0.064 | 0.064 | 0.064 | 0.064 | 0.064 | 0.064 |
| TabTransformer | 0.123 | 0.12 | 0.121 | 0.115 | 0.102 | 0.13 | 0.129 | 0.121 | |
| CancerGPT | 0.119 | 0.115 | 0.125 | 0.116 | 0.115 | 0.111 | 0.114 | 0.125 | |
| GPT-2 | 0.063 | 0.094 | 0.057 | 0.081 | 0.052 | 0.071 | 0.057 | 0.065 | |
| GPT-3 | 0.064 | 0.051 | 0.045 | 0.058 | 0.068 | 0.087 | 0.101 | 0.181 |
| Tissue | Methods | 0 | 2 | 4 | 8 | 16 | 32 | 64 | 128 |
|---|---|---|---|---|---|---|---|---|---|
| Pancreas | XGBoost | 0.5 | – | – | – | – | – | – | – |
| TabTransformer | 0.553 | – | – | – | – | – | – | – | |
| CancerGPT | 0.237 | – | – | – | – | – | – | – | |
| GPT-2 | 0.211 | – | – | – | – | – | – | – | |
| GPT-3 | 0.789 | – | – | – | – | – | – | – | |
| Endometrium | XGBoost | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | – | – |
| TabTransformer | 0.694 | 0.857 | 0.878 | 0.939 | 0.939 | 0.959 | – | – | |
| CancerGPT | 0.489 | 0.693 | 0.714 | 0.735 | 0.612 | 0.612 | – | – | |
| GPT-2 | 0.265 | 0.816 | 0.224 | 0.184 | 0.204 | 0.612 | – | – | |
| GPT-3 | 0.837 | 1.0 | 0.949 | 0.898 | 0.878 | 0.898 | – | – | |
| Liver | XGBoost | 0.587 | 0.587 | 0.587 | 0.587 | 0.587 | 0.587 | 0.574 | 0.574 |
| TabTransformer | 0.535 | 0.506 | 0.526 | 0.535 | 0.609 | 0.647 | 0.702 | 0.804 | |
| CancerGPT | 0.615 | 0.468 | 0.59 | 0.641 | 0.782 | 0.776 | 0.737 | 0.737 | |
| GPT-2 | 0.731 | 0.449 | 0.558 | 0.66 | 0.679 | 0.763 | 0.731 | 0.731 | |
| GPT-3 | 0.615 | 0.49 | 0.542 | 0.583 | 0.474 | 0.731 | 0.737 | 0.91 | |
| Soft tissue | XGBoost | 0.491 | 0.491 | 0.491 | 0.491 | 0.454 | 0.476 | 0.542 | 0.552 |
| TabTransformer | 0.557 | 0.566 | 0.709 | 0.727 | 0.788 | 0.802 | 0.83 | 0.835 | |
| CancerGPT | 0.656 | 0.646 | 0.68 | 0.734 | 0.725 | 0.754 | 0.8 | 0.795 | |
| GPT-2 | 0.546 | 0.535 | 0.519 | 0.56 | 0.427 | 0.577 | 0.456 | 0.384 | |
| GPT-3 | 0.517 | 0.406 | 0.6 | 0.444 | 0.607 | 0.82 | 0.866 | 0.889 | |
| Stomach | XGBoost | 0.529 | 0.529 | 0.529 | 0.529 | 0.529 | 0.529 | 0.476 | 0.508 |
| TabTransformer | 0.804 | 0.863 | 0.855 | 0.853 | 0.812 | 0.85 | 0.885 | 0.869 | |
| CancerGPT | 0.794 | 0.792 | 0.796 | 0.794 | 0.785 | 0.787 | 0.824 | 0.804 | |
| GPT-2 | 0.551 | 0.569 | 0.521 | 0.516 | 0.589 | 0.538 | 0.469 | 0.566 | |
| GPT-3 | 0.419 | 0.575 | 0.724 | 0.769 | 0.534 | 0.69 | 0.742 | 0.724 | |
| Urinary tract | XGBoost | 0.494 | 0.494 | 0.494 | 0.494 | 0.494 | 0.526 | 0.53 | 0.544 |
| TabTransformer | 0.599 | 0.612 | 0.604 | 0.625 | 0.601 | 0.587 | 0.623 | 0.622 | |
| CancerGPT | 0.578 | 0.561 | 0.579 | 0.577 | 0.589 | 0.593 | 0.609 | 0.609 | |
| GPT-2 | 0.526 | 0.528 | 0.532 | 0.397 | 0.515 | 0.452 | 0.469 | 0.566 | |
| GPT-3 | 0.645 | 0.57 | 0.556 | 0.496 | 0.508 | 0.516 | 0.531 | 0.572 | |
| Bone | XGBoost | 0.499 | 0.499 | 0.499 | 0.499 | 0.499 | 0.499 | 0.499 | 0.499 |
| TabTransformer | 0.706 | 0.705 | 0.724 | 0.697 | 0.65 | 0.689 | 0.708 | 0.696 | |
| CancerGPT | 0.625 | 0.648 | 0.693 | 0.653 | 0.683 | 0.636 | 0.678 | 0.68 | |
| GPT-2 | 0.507 | 0.616 | 0.471 | 0.579 | 0.421 | 0.552 | 0.476 | 0.518 | |
| GPT-3 | 0.498 | 0.415 | 0.341 | 0.429 | 0.485 | 0.605 | 0.62 | 0.794 |
| Model | Flu | Sta | -lac | Sol | Sub | Bin | Cont | Fold | SSP | Yst | Hum | Aff | PDB | BDB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DDE | 0.638 | 0.652 | 0.623 | 59.77 | 49.17 | 77.43 | – | 9.57 | – | 55.83 | 62.77 | 2.908 | – | – |
| Moran | 0.400 | 0.322 | 0.375 | 57.73 | 31.13 | 55.63 | – | 7.10 | – | 53.00 | 54.67 | 2.984 | – | – |
| LSTM | 0.494 | 0.533 | 0.139 | 70.18 | 62.98 | 88.11 | 26.34 | 8.24 | 68.99 | 53.62 | 63.75 | 2.853 | 1.457 | 1.572 |
| Transformer | 0.643 | 0.649 | 0.261 | 70.12 | 56.02 | 75.74 | 17.50 | 8.52 | 59.62 | 54.12 | 59.58 | 2.499 | 1.455 | 1.566 |
| CNN | 0.682 | 0.637 | 0.781 | 64.43 | 58.73 | 82.67 | 10.00 | 10.93 | 66.07 | 55.07 | 62.60 | 2.796 | 1.376 | 1.497 |
| ResNet | 0.636 | 0.126 | 0.152 | 67.33 | 52.30 | 78.99 | 20.43 | 8.89 | 69.56 | 48.91 | 68.61 | 3.005 | 1.441 | 1.565 |
| ProtBert | 0.679 | 0.771 | 0.731 | 68.15 | 76.53 | 91.32 | 39.66 | 16.94 | 82.18 | 63.72 | 77.32 | 2.195 | 1.562 | 1.549 |
| ProtBert* | 0.339 | 0.697 | 0.616 | 59.17 | 59.44 | 81.54 | 24.35 | 10.74 | 62.51 | 53.87 | 83.61 | 2.996 | 1.457 | 1.649 |
| ESM-1b | 0.679 | 0.694 | 0.839 | 70.23 | 78.13 | 92.40 | 45.78 | 28.17 | 82.73 | 57.00 | 78.17 | 2.281 | 1.559 | 1.556 |
| ESM-1b* | 0.430 | 0.750 | 0.528 | 67.02 | 79.82 | 91.61 | 40.37 | 29.95 | 83.14 | 66.07 | 88.06 | 3.031 | 1.368 | 1.571 |
5.4.6. Benchmarks
- MedQA. The MedQA dataset consists of multiple-choice questions from the United States Medical Licensing Examination (USMLE). It covers general medical knowledge and includes 11,450 questions in the development set and 1,273 questions in the test set. Each question has 4 or 5 answer choices, and the dataset is designed to assess the medical knowledge and reasoning skills required for medical licensure in the United States.
- MedMCQA. MedMCQA is a large-scale multiple-choice QA dataset derived from Indian medical entrance examinations (AIIMS/NEET). It covers 2.4k healthcare topics and 21 medical subjects, with over 187,000 questions in the development set and 6,100 questions in the test set. Each question has 4 answer choices and is accompanied by an explanation. MedMCQA evaluates a model’s general medical knowledge and reasoning capabilities.
- PubmedQA. Different from MedQA and MedMCQA, PubMedQA is a closed-domain QA dataset, In which each question can be answered by looking at an associated context (PubMed abstract). It is consists of 1,000 expert-labeled question-answer pairs. Each question is accompanied by a PubMed abstract as context, and the task is to provide a yes/no/maybe answer based on the information in the abstract. The dataset is split into 500 questions for development and 500 for testing. PubMedQA assesses a model’s ability to comprehend and reason over scientific biomedical literature.
- Few-shot and Zero-shot Learning: Assessing the model’s ability to predict drug synergy with minimal or no training examples, highlighting the LLM’s capacity to generalize from limited data.
- Benchmarking Across Multiple Tissues: Testing the model’s predictive performance across seven different rare cancer tissue types to ensure robustness and generalizability.
5.4.7. Discussion
- First, future LLM systems should aim to unify diverse biological modalities—including genomic sequences, protein structures, cell images, clinical time-series, and textual notes—within a cohesive multimodal framework. Such models can enable integrated diagnosis and prediction by capturing complex biological correlations across data types.
- Second, LLMs should evolve from passive tools into active hypothesis-generating agents. This requires coupling with laboratory automation systems, real-time EHR streams, and high-throughput simulation platforms. For instance, an LLM-guided robotic lab could autonomously design, test, and refine molecular hypotheses in closed experimental loops, dramatically accelerating the discovery cycle.
- Third, the training of LLMs should incorporate biologically-informed learning techniques. Self-distillation improves interpretability through reasoning chains, contrastive alignment ensures consistency with biomedical knowledge bases, and physics-informed regularization grounds models in biophysical laws (e.g., thermodynamics in MD simulations), reducing hallucinations and enhancing trustworthiness.
- Finally, proactive governance must be embedded from the outset. Techniques such as differential privacy for sensitive patient data, watermarking synthetic DNA sequences to differentiate them from natural ones, and rigorous human oversight mechanisms are crucial for ensuring ethical deployment. Building responsible AI systems is not an afterthought—it must be integral to model development.
5.5. Earth Sciences and Civil Engineering
5.5.1. Overview

- Outer Space. Also called planetary or space science, this domain explores celestial bodies and their relevance to Earth. Techniques include astronomical observation, space probe missions, spectroscopy, and sample return analysis [1431]. Comparative studies of planets, moons, and asteroids help scientists infer Earth’s formation and evolutionary history. Data from missions like those to Mars or the Moon, along with meteorite analysis, deepen our understanding of planetary geology, habitability, and solar system dynamics [1432,1433,1434].
- Natural System Complexity and Uncertainty. Earth systems are governed by highly nonlinear, interdependent physical processes—from tectonic shifts and groundwater flow to atmospheric circulation and sea-level dynamics [1404,1409]. Predicting these phenomena accurately requires not only long-term observational data but also high-resolution numerical simulations. Civil engineers must incorporate these uncertainties into infrastructure design—such as accounting for variable soil conditions or future climate extremes. However, LLMs, while capable of interpreting related literature or summarizing best practices, lack the ability to model dynamic systems governed by partial differential equations or simulate stochastic environmental behavior. Capturing such processes requires specialized physics-based modeling and empirical calibration, far beyond the capacity of current text-based AI systems.
- Data Gaps and Ground Truth Limitations. Earth and civil systems often involve inaccessible or hazardous environments (e.g., deep subsurface, remote ocean basins, or aging underground infrastructure), where direct measurement is difficult or incomplete [1416,1421]. As a result, both domains suffer from sparse and noisy datasets that hinder model calibration and decision-making. For example, limited geological borehole data may be insufficient for accurate subsurface mapping, and historical infrastructure records may be missing or outdated. While LLMs can help process available documents, they cannot compensate for missing sensor data, field measurements, or satellite coverage, which are essential for building reliable models or simulations.
- Physical Implementation and Experimental Validation. In civil engineering especially, the ultimate test of a solution lies in its physical realization—constructing structures, monitoring performance, and conducting stress tests under real-world conditions [1450,1459]. Earth scientists similarly depend on empirical fieldwork, such as drilling ice cores or deploying seismographs. These tasks involve tools, logistics, and materials that LLMs cannot access or control. While LLMs may assist in designing monitoring protocols or reviewing standards, they cannot carry out experiments or validate hypotheses in the field.
- Cross-scale Integration. A recurring challenge in both fields is integrating phenomena across vastly different spatial and temporal scales—for instance, linking microscale soil composition to large-scale slope stability, or connecting millennial-scale climate changes to today’s hydrological models [1405,1426]. This requires sophisticated multi-scale modeling, often coupling discrete and continuous frameworks, something well beyond the representational capacity of LLMs trained primarily on textual corpora. Such integration typically involves customized simulations and domain-specific algorithms informed by physics and engineering principles.
- Regulatory and Ethical Constraints. Civil engineers must design within strict regulatory, economic, and ethical frameworks—ensuring safety, sustainability, and community impact are considered [1448]. Earth scientists face ethical concerns in interventions like geoengineering or resource extraction. While LLMs can provide policy summaries or ethical perspectives from the literature, they cannot weigh trade-offs, assess context-specific risks, or make normative judgments. These decisions require human oversight, societal debate, and legal frameworks beyond what AI can resolve.
- Literature Review and Standards Summarization. Both fields rely on vast and highly technical bodies of knowledge, including regulatory documents, geological surveys, engineering design codes, and academic papers. LLMs can significantly streamline the literature review process by summarizing scientific reports, extracting key metrics, and comparing standards across regions [1251,1253]. For instance, an LLM could help an engineer quickly retrieve the seismic design requirements for bridges in a specific country or summarize recent advances in landslide risk prediction models.
- Interpreting Sensor and Field Reports. As structural health monitoring (SHM) and geoscience increasingly adopt IoT and sensor networks, LLMs can help translate raw sensor metadata or technician logs into structured, actionable insights [1448,1456]. They can assist in automatically annotating inspection reports, flagging anomalies in sensor readings, or identifying trends across multiple sources, especially when integrated with domain-specific tools.
- Assisting in Conceptual and Parametric Design. In civil engineering tasks such as structural layout or materials selection, LLMs can be useful in generating design suggestions, proposing parametric options, or reviewing existing case studies. For example, they can draft potential designs based on textual constraints or help identify suitable sustainable materials from engineering databases [1440,1447].

- Model compatibility: Clearly distinguishes tasks suitable for LLMs (e.g., text synthesis) versus those requiring numerical simulation or high-dimensional optimization [1276].
- Cross-domain transferability: Highlights common computational structures across environmental and infrastructure problems, facilitating the reuse of AI workflows.
- Pipeline orchestration: Supports hybrid modeling pipelines where LLMs coordinate, explain, or augment scientific and engineering workflows using diverse data types.
| Task Category | LLM Application Areas | Use Case-Inspired Research Question | Key Insights and Contributions | References |
|---|---|---|---|---|
| Geospatial and Environmental Data Tasks | Geospatial Data Analysis | Can LLMs help interpret and interact with complex geospatial datasets like GIS and remote sensing imagery? | LLMs and VLMs assist geospatial tasks; specialized models (e.g., RSGPT, GeoChat, EarthGPT) improve visual understanding and accessibility. | [1469,1470,1471,1472,1473] |
| Tool Selection and Program Generation | How can LLMs act as decision-makers to select appropriate geospatial tools and automate workflows? | Agents like GeoGPT, GeoLLM-Engine, and GeoAgent automate tool selection and program generation, enhancing task execution and reducing user burden. | [1474,1475,1476,1477] | |
| Engineering Simulation and Physical Modeling Tasks | Simulation Code Generation | Can LLMs generate accurate simulation scripts for civil and earth science modeling tasks? | Early efforts (e.g., Eplus-LLM, HydroSuite-AI) enable natural language-driven code generation; challenges remain in handling complex, interdependent systems. | [1478,1479,1480,1481] |
| Automated Simulation Interfaces | Can LLMs streamline simulation configuration and execution through natural language interaction? | Systems like ChatSUMO and GPT-based simulation platforms show early success, but complex modeling remains difficult. | [1482,1483] | |
| Textual and Document-Centric Tasks | Domain-Specific Knowledge Extraction | How can LLMs assist in extracting and structuring information from geoscientific and engineering texts? | Customized LLMs (e.g., GeoBERT, BB-GeoGPT) enhance information retrieval and QA; hallucination and data quality issues persist. | [1484,1485,1486] |
| Compliance Checking and Reporting | Can LLMs automate regulatory compliance checks and generate inspection reports? | Frameworks like AutoRepo and LLM-FuncMapper automate compliance interpretation; combining multimodal LLMs and ontology models enhances performance. | [1487,1488,1489] | |
| Monitoring and Predictive Maintenance Tasks (Hybrid) | O&M Data Processing and Optimization | Can LLMs support monitoring and maintenance through real-time data analysis and strategy optimization? | RAG-enhanced multi-agent systems and LLMs enrich BIM models, optimize energy usage, and support digital twin development. | [1490,1491,1492,1493] |
| Digital Twin Integration | How can LLMs contribute to the creation and operation of digital twins for infrastructure health monitoring? | LLMs facilitate real-time data collection and simulation in digital twins across sectors like transport, railways, and water networks. | [1494,1495,1496] | |
| Design and Planning Tasks | Building and Infrastructure Design | Can LLMs assist engineers and planners in creating optimized design solutions from natural language inputs? | Early systems like NADIA and PlanGPT use LLMs for building simulation and urban planning; still limited in complex layout and sequential action handling. | [1497,1498,1499] |
| Urban and Transportation Planning | How can LLMs support participatory urban design and traffic system optimization? | PlanAgent and TrafficGPT show LLMs’ potential in participatory planning and decision support; broader adoption needs further research. | [1500,1501,1502] |
5.5.2. Geospatial and Environmental Data Tasks
5.5.3. Engineering Simulation and Physical Modeling Tasks
5.5.4. Textual and Document-Centric Tasks
5.5.5. Monitoring and Predictive Maintenance Tasks
5.5.6. Design and Planning Tasks
5.5.7. Benchmarks
5.5.8. Discussion
- Challenges and Limitations. Despite these opportunities, fundamental limitations persist when applying LLMs to earth sciences and civil engineering. Physical system modeling—ranging from seismic simulation to hydrodynamic forecasting—requires accurate, high-dimensional numerical computations governed by partial differential equations [1426]. Current LLMs lack the inductive biases, precision, and physical fidelity needed for such tasks, relegating them to supporting roles in workflow orchestration rather than core simulation.
- Research Directions. To maximize the impact of LLMs while respecting domain constraints, several promising research directions emerge:
- Domain-Specific Fine-Tuning and Hybridization. Building specialized LLMs for Earth sciences and civil engineering—such as those incorporating geospatial, structural, and regulatory corpora [1486,1529]—can enhance factual grounding and reasoning capabilities, particularly when paired with physics-based simulators.
- Ethical AI Frameworks for Infrastructure and Environmental Decisions. Future research must address bias mitigation, explainability, regulatory compliance, and human-centered design when deploying LLM-augmented systems in critical infrastructure and environmental governance [1489].
5.6. Computer Science and Electrical Engineering
5.6.1. Overview

- Waterfall Model: It represents a linear, sequential approach where each phase of the program development lifecycle, from requirements gathering to maintenance, must be completed before the next phase can begin [1543,1544,1545]. This model assumes that all requirements can be clearly defined and understood at the outset of the project. While straightforward and easy to manage for small projects with stable requirements, the Waterfall model struggles with complex projects or those where requirements are likely to change [1542]. Its rigidity makes this model difficult to accommodate new requirements or to revisit earlier phases once completed, leading to potential issues and increased costs if errors or omissions are discovered late in the development cycle [1546]. The lack of early prototypes also hinders client feedback, potentially resulting in a final product that does not fully meet user needs.
- Spiral Model: It offers a risk-driven approach that combines elements of the Waterfall model with iterative development [1545,1547]. Projects in the Spiral model progress through several iterations, with each iteration involving planning, risk analysis, engineering, and evaluation. By explicitly addressing risks at each stage, this model is better suited for complex projects with high potential risks compared to the linear Waterfall model [1544,1548]. The cyclical nature allows for revisiting and refining requirements and designs based on ongoing risk assessment, leading to a more adaptable outcome for projects where risks are a significant concern [1545]. The success of the Spiral Model heavily relies on accurate and thorough risk assessments at each iteration. If risk analysis is inadequate or incorrect, the project may suffer from poor planning and execution. Moreover, specialized knowledge in risk management is essential, but this expertise is not always available within every team
- V-Model: This is an extension of the Waterfall model that emphasizes the relationship between each development phase and a corresponding testing phase, focusing on verification and validation [1549,1550]. Executed in a V-shape sequence, this model ensures that testing activities are planned and executed in parallel with the development stages, highlighting the importance of quality assurance throughout the lifecycle. By linking verification and validation to each stage, the V-Model aims to improve software quality and ensure that the final product meets both the specified requirements and the user’s needs.
- Incremental Approach: This approach involves developing a program in a series of functional increments [1551]. Each increment adds more functionality to the previous ones, allowing for early delivery of working software to the client and the incorporation of feedback in subsequent increments. This approach offers greater flexibility compared to the Waterfall model, enabling the development team to adapt to changing requirements and reduce the risk associated with large, monolithic development efforts by delivering working software in smaller, manageable parts [1542].
- Unified Process Model: This model is an iterative and incremental program development framework driven by use cases, centered around architecture, and supporting component-based design [1551]. Utilizing the Unified Modelling Language to represent various phases and deliverables, the unified process model emphasizes iterative development, where the software evolves over multiple cycles, and is built piece by piece. This framework provides a flexible and adaptable approach suitable for a wide range of software projects, focusing on continuous improvement and active stakeholder involvement through use-case-driven development and an architectural focus that promotes a robust and scalable system design.
- Schematic Capture: It involves the graphical creation of a circuit diagram using Electronic Design Automation (EDA) tools [1558,1559]. In this approach, designers place and connect symbols that represent electronic components, visually constructing the circuit’s architecture and interconnections [1560]. While schematic capture provides an intuitive way to represent smaller circuits, it becomes increasingly cumbersome and less scalable for complex designs containing a large number of components [1561]. Managing the intricate web of interconnections and ensuring accuracy through a purely graphical representation can be challenging for large integrated circuits.
- Hardware Description Languages (HDLs): HDLs, such as VHSIC HDL (VHDL) [1562] and Verilog [1557], are textual languages used to describe the behavior and structure of digital circuits at various levels of abstraction, ranging from the high-level Register Transfer Level down to the gate level. HDLs allow circuit designers to describe the flow of digital signals and the logical operations performed among them [1563,1564]. They offer a more scalable and manageable approach for designing complex digital systems compared to schematic capture, as the textual representation facilitates the use of automated EDA tools for simulation and synthesis. Historically, VHDL, initially developed for military applications, is known for its strong typing and structured syntax, making it suitable for large and critical projects. Verilog, on the other hand, gained widespread adoption in the semiconductor industry for the design and verification of Application-Specific Integrated Circuits (ASICs) and Field-Programmable Gate Arrays (FPGAs) due to its relative simplicity and ease of use.
- Simulation: Simulation techniques are crucial in traditional digital circuit design for verifying the functionality and timing of a circuit described in an HDL before it is physically implemented. This process involves creating test benches, which are sets of input stimuli applied to the circuit model, and observing the resulting output responses to ensure the design behaves as intended. While essential for detecting design errors early in the development cycle, simulating large and complex circuits can be computationally intensive and time-consuming. Thorough verification requires the careful planning and execution of extensive test cases to cover all possible operating conditions and scenarios, which can be a significant challenge for intricate designs.
- Physical Layout Design: At this stage, the abstract circuit design is transformed into a physical implementation on a substrate, such as a silicon die for an integrated circuit or a printed circuit board (PCB). This process involves determining the precise placement of components and the routing of the conductive interconnections between them, taking into account various factors such as signal integrity, power distribution, thermal management, and manufacturability. The physical layout significantly impacts the final performance, reliability, and cost of the circuit. Traditionally, achieving an efficient and effective physical layout, especially for high-performance designs, requires significant manual effort, expertise, and often involves iterative refinement to optimize for various conflicting requirements like area, speed, and power consumption.
5.6.2. Code Generation Tasks
5.6.3. Code Assistant in Debugging
5.6.4. Code Analysis in Codebases
5.6.5. Hardware Description Language Code Generation
5.6.6. Functional Verification
5.6.7. High-Level Synthesis
| Benchmark | Approach | EC(↓) | FF(↓) | LUT(↓) | Slice (↓) | DSP (↓) | BRAM(↓) | Power(↓) | CP(↓) |
|---|---|---|---|---|---|---|---|---|---|
| syrk | LLM | 1,859,983 | 954 | 5300 | 1649 | 6 | 8 | 0.164 | 9.934 |
| HLS | 3,744,260 | 662 | 521 | 221 | 5 | 32 | 0.350 | 8.191 | |
| syr2k | LLM | 2,125,846 | 542 | 472 | 197 | 2 | 12 | 0.181 | 9.446 |
| HLS | 9,028,229 | 1042 | 960 | 1649 | 5 | 56 | 0.383 | 6.872 | |
| mvt | LLM | 44,996 | 8628 | 2663 | 4404 | 2 | 4 | 0.197 | 9.312 |
| HLS | 119,492 | 713 | 991 | 342 | 5 | 12 | 0.332 | 6.550 | |
| k3mm | LLM | 2,371,593 | 623 | 328 | 236 | 2 | 28 | 0.207 | 9.924 |
| HLS | 10,277,509 | 927 | 956 | 1649 | 5 | 56 | 0.398 | 6.646 | |
| k2mm | LLM | 1,863,816 | 537 | 311 | 202 | 2 | 20 | 0.189 | 9.967 |
| HLS | 7,963,269 | 929 | 659 | 313 | 5 | 56 | 0.400 | 6.814 | |
| gesummv | LLM | 65,991 | 437 | 288 | 170 | 2 | 16 | 0.176 | 9.253 |
| HLS | 148,805 | 795 | 561 | 228 | 5 | 20 | 0.316 | 6.855 | |
| gemm | LLM | 1,601,739 | 488 | 1702 | 300 | 2 | 16 | 0.178 | 9.697 |
| HLS | 4,542,980 | 1193 | 991 | 342 | 5 | 56 | 0.359 | 6.551 | |
| bicg | LLM | 46,478 | 1041 | 274 | 208 | 2 | 12 | 0.186 | 9.251 |
| HLS | 119,492 | 791 | 451 | 203 | 5 | 12 | 0.333 | 6.599 | |
| atax | LLM | 57,669 | 453 | 221 | 208 | 2 | 11 | 0.167 | 9.952 |
| HLS | 119,492 | 791 | 451 | 203 | 5 | 11 | 0.309 | 6.573 |
| Taxonomy | Contributions | Representative Works | Citations |
| Code Generation Tasks. | LLMs facilitate developers in code generation (text to code), code completion (code snippet to code), and code comment (docstring). | EvalPlus [1598]: A code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. | [1565,1566,1567,1568,1598,1599,1600,1601] |
| Code Assistant in Debugging. | Automate to investigate the program for potential bugs statically and dynamically, and enable debugging programs efficiently. | Self-Debugging [1602]: A LLMs training framework to debug its predicted program via few-shot demonstrations. | [1571,1572,1573,1574,1575,1576,1577,1602,1603,1604,1605] |
| Code Analysis in Codebases. | LLMs assist developers for code summarization, searching, and translation in (large) codebases. | Paheli [1578] | [1578,1579,1580] |
| HDL Code Generation. | EAD for hardware design, code generation by translating natural language descriptions of hardware functionality into Verilog or VHDL | Verigen [1606], flipSyrup [1607], AutoChip [1581] | [1581,1582,1584,1585,1586,1587,1607,1608,1609,1610,1611,1612] |
| Functional Verification. | LLMs assist with functional verification by generating testbenches, assertions, and potentially aiding in formal verification. | AssertLLM [1590]: An automatic assertion generation framework that processes complete specification files of circuit designs. | [1589,1590,1591,1592,1593,1594,1595,1613,1614,1615] |
| High-Level Synthesis. | LLMs help in translating HLS-compatible code or guiding the HLS tools to produce efficient hardware implementations. | HLSPilot [1596]: The first LLM-enabled high-level synthesis framework that fully automate high-level application acceleration on hybrid CPU-FPGA architectures. | [1583,1588,1597,1616] |
5.6.8. Benchmarks
5.6.9. Discussion
References
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Mesko, B. The ChatGPT (generative artificial intelligence) revolution has made artificial intelligence approachable for medical professionals. Journal of medical Internet research 2023, 25, e48392. [Google Scholar] [CrossRef]
- Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; et al. Improving language understanding by generative pre-training 2018.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Advances in neural information processing systems 2020. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.182231. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers); 2019. [Google Scholar]
- Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
- Sanh, V.; Webson, A.; Raffel, C.; Bach, S.H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Scao, T.L.; Raja, A.; et al. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv 2110, arXiv:2110.08207. [Google Scholar] [CrossRef]
- Lin, Z.; Madotto, A.; Fung, P. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv 2020, arXiv:2004.03829. [Google Scholar] [CrossRef]
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar] [CrossRef]
- Pahune, S.; Chandrasekharan, M. Several categories of large language models (llms): A short survey. arXiv 2023, arXiv:2307.10188. [Google Scholar] [CrossRef]
- Patil, A. Advancing Reasoning in Large Language Models: Promising Methods and Approaches. arXiv 2025, arXiv:2502.03671. [Google Scholar] [CrossRef]
- Kerner, T. Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field. arXiv 2024, arXiv:2407.14076. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 2022, 35, 24824–24837. [Google Scholar]
- OpenAI. Introducing OpenAI o1. OpenAI 2024. [Google Scholar]
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Wang, P.; Zhu, Q.; Xu, R.; Zhang, R.; Ma, S.; Bi, X.; et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 2025, 645, 633–638. [Google Scholar] [CrossRef] [PubMed]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2025, arXiv:2303.18223. [Google Scholar] [CrossRef]
- Jelinek, F. Statistical methods for speech recognition; MIT press, 1998. [Google Scholar]
- Rosenfeld, R. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE 2000, 88, 1270–1278. [Google Scholar] [CrossRef]
- Stolcke, A.; et al. SRILM-an extensible language modeling toolkit. In Proceedings of the Interspeech; 2002; Vol. 2002, p. 2002. [Google Scholar]
- Brown, P.F.; Della Pietra, S.A.; Della Pietra, V.J.; Mercer, R.L. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 1993, 19, 263–311. [Google Scholar]
- Brown, P.; Cocke, J.; Della Pietra, S.; Della Pietra, V.; Jelinek, F.; Mercer, R.; Roossin, P. A Statistical Approach to Language Translation. In Proceedings of the Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics; 1988. [Google Scholar]
- Kilgarriff, A.; Grefenstette, G. Introduction to the special issue on the web as corpus. Computational linguistics 2003, 29, 333–347. [Google Scholar] [CrossRef]
- Resnik, P.; Smith, N.A. The web as a parallel corpus. Computational Linguistics 2003, 29, 349–380. [Google Scholar] [CrossRef]
- Banko, M.; Brill, E. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the Proceedings of the 39th annual meeting of the Association for Computational Linguistics, 2001; pp. 26–33.
- Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A neural probabilistic language model. Journal of machine learning research 2003, 3, 1137–1155. [Google Scholar]
- Mikolov, T.; Karafiát, M.; Burget, L.; Cernockỳ, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Interspeech. Makuhari; 2010; Vol. 2, pp. 1045–1048. [Google Scholar]
- Kombrink, S.; Mikolov, T.; Karafiát, M.; Burget, L. Recurrent neural network based language modeling in meeting recognition. In Proceedings of the Interspeech; 2011; Vol. 11, pp. 2877–2880. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural computation 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Wang, S.; Jiang, J. Learning Natural Language Inference with LSTM. arXiv 2016, arXiv:1512.08849. [Google Scholar] [CrossRef] [PubMed]
- Tarwani, K.M.; Edem, S. Survey on recurrent neural network in natural language processing. Int. J. Eng. Trends Technol 2017, 48, 301–304. [Google Scholar] [CrossRef]
- Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
- Google. Found in translation: More accurate, fluent sentences in Google Translate. Goolgle Blog 2016.
- Google. Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System. Goolgle Blog 2016.
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 2020. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Weizenbaum, J. ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM 1966, 9, 36–45. [Google Scholar] [CrossRef]
- Brown, P.F.; Della Pietra, V.J.; Desouza, P.V.; Lai, J.C.; Mercer, R.L. Class-based n-gram models of natural language. Computational linguistics 1992, 18, 467–480. [Google Scholar]
- Johnson, M. PCFG models of linguistic tree representations. Computational Linguistics 1998, 24, 613–632. [Google Scholar]
- Mikolov, T.; Zweig, G. Context dependent recurrent neural network language model. In Proceedings of the 2012 IEEE Spoken Language Technology Workshop (SLT); 2012; pp. 234–239. [Google Scholar] [CrossRef]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Advances in neural information processing systems 2014, 27. [Google Scholar]
- OpenAI. Introducing ChatGPT. OpenAI Blog, 2022.
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research 2024, 25, 1–53. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
- Rae, J.W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv 2021, arXiv:2112.11446. [Google Scholar] [CrossRef]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Kang, S.; Yoon, J.; Yoo, S. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE; 2023; pp. 2312–2323. [Google Scholar]
- Song, C.H.; Wu, J.; Washington, C.; Sadler, B.M.; Chao, W.L.; Su, Y. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023; pp. 2998–3009.
- Wang, Y.; Zhong, W.; Li, L.; Mi, F.; Zeng, X.; Huang, W.; Shang, L.; Jiang, X.; Liu, Q. Aligning large language models with human: A survey. arXiv 2023, arXiv:2307.12966. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar] [CrossRef]
- Wang, B.; Min, S.; Deng, X.; Shen, J.; Wu, Y.; Zettlemoyer, L.; Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv 2022, arXiv:2212.10001. [Google Scholar] [CrossRef]
- Huang, J.; Chang, K.C.C. Towards reasoning in large language models: A survey. arXiv 2022, arXiv:2212.10403. [Google Scholar] [CrossRef]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2023, arXiv:2203.11171. [Google Scholar] [CrossRef]
- OpenAI. GPT-4.5 System Card, 2025.
- OpenAI. GPT-5 System Card. Technical report, OpenAI, 2025.
- Claude. The Claude 3 Model Family: Opus, Sonnet, Haiku, 2024.
- Claude. Claude 3.7 Sonnet System Card, 2025.
- Koray Kavukcuoglu, CTO, Google DeepMind, on behalf of the Gemini team. Gemini 2.0 model updates: 2.0 Flash, Flash-Lite, Pro Experimental. Accessed: 2025-3-22.
- OpenAI. gpt-oss-120b & gpt-oss-20b Model Card. arXiv 2025, arXiv:2508.10925. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2. 5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
- Team, Q. QwQ-32B: Embracing the Power of Reinforcement Learning, 2025.
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. Gpt (generative pre-trained transformer)–a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access 2024. [Google Scholar] [CrossRef]
- Yenduri, G.; Srivastava, G.; Maddikunta, P.K.R.; Jhaveri, R.H.; Wang, W.; Vasilakos, A.V.; Gadekallu, T.R.; et al. Generative pre-trained transformer: A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. arXiv 2023, arXiv:2305.10435. [Google Scholar] [CrossRef]
- Kaiser,,, L.; Gomez,, A.N.; Shazeer, N.; Vaswani, A.; Parmar, N.; Jones, L.; Uszkoreit, J. One model to learn them all. arXiv 2017, arXiv:1706.05137. [Google Scholar] [CrossRef]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International conference on machine learning. PMLR; 2017; pp. 1126–1135. [Google Scholar]
- McCann, B.; Keskar, N.S.; Xiong, C.; Socher, R. The natural language decathlon: Multitask learning as question answering. arXiv 2018, arXiv:1806.08730. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 2022, 35, 27730–27744. [Google Scholar]
- OpenAI. Our Approach to Alignment Research. OpenAI Blog, 2022.
- Claude. Unified Model Context Interaction Protocol, 2025.
- OpenAI. GPT-4o System Card, 2024.
- OpenAI. GPT-4o mini: advancing cost-efficient intelligence, 2024.
- OpenAI. Reasoning models. Accessed: 2025-3-22.
- Claude. Calude’s Constitution, 2023.
- Anthropic. Introducing the next generation of Claude, 2024.
- Sundar Pichai, CEO of Google and Alphabet. Gemini: Google’s largest and most capable AI model. Accessed: 2025-3-22.
- xAI. Grok-3. Accessed: 2025-3-22.
- Ray, T. ChatGPT is ’not particularly innovative,’ and ’nothing revolutionary’, says Meta’s chief AI scientist, 2023.
- Badminton, N. Meta’s Yann LeCun on auto-regressive Large Language Models (LLMs), 2023.
- Ainslie, J.; Lee-Thorp, J.; De Jong, M.; Zemlyanskiy, Y.; Lebrón, F.; Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv 2023, arXiv:2305.13245. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International conference on machine learning. PMLR; 2017; pp. 933–941. [Google Scholar]
- Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
- Baldacchino, T.; Cross, E.J.; Worden, K.; Rowson, J. Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems. Mechanical Systems and Signal Processing 2016, 66, 178–200. [Google Scholar] [CrossRef]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. Ctrl: A conditional transformer language model for controllable generation. arXiv 2019, arXiv:1909.05858. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems 2022. [Google Scholar]
- Snell, C.; Lee, J.; Xu, K.; Kumar, A. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv 2024, arXiv:2408.03314. [Google Scholar] [CrossRef]
- Ji, Y.; Li, J.; Ye, H.; Wu, K.; Xu, J.; Mo, L.; Zhang, M. Test-time Computing: from System-1 Thinking to System-2 Thinking. arXiv 2025, arXiv:2501.02497. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 2020. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar] [CrossRef]
- Yang, R.; Song, L.; Li, Y.; Zhao, S.; Ge, Y.; Li, X.; Shan, Y. Gpt4tools: Teaching large language model to use tools via self-instruction. Advances in Neural Information Processing Systems 2023. [Google Scholar]
- Xiong, H.; Bian, J.; Li, Y.; Li, X.; Du, M.; Wang, S.; Yin, D.; Helal, S. When search engine services meet large language models: visions and challenges. IEEE Transactions on Services Computing 2024. [Google Scholar] [CrossRef]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing; 2013. [Google Scholar]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; 2011. [Google Scholar]
- Yelp Dataset Challenge. Yelp Open Dataset, 2015.
- Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003; 2003. [Google Scholar]
- Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; Manning, C.D. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of the Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; 2017. [Google Scholar]
- Etzioni, O.; Banko, M.; Soderland, S.; Weld, D.S. Open information extraction from the web. Communications of the ACM 2008. [Google Scholar] [CrossRef]
- Levesque, H.J.; Davis, E.; Morgenstern, L. The Winograd schema challenge. KR 2012, 2012. [Google Scholar]
- Sakaguchi, K.; Bras, R.L.; Bhagavatula, C.; Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 2021. [Google Scholar] [CrossRef]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 2019. [Google Scholar]
- Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. Advances in neural information processing systems 2015. [Google Scholar]
- Narayan, S.; Cohen, S.B.; Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv 2018, arXiv:1808.08745. [Google Scholar] [CrossRef]
- Gliwa, B.; Mochol, I.; Biesek, M.; Wawer, A. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the Proceedings of the 2nd Workshop on New Frontiers in Summarization; 2019. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar] [CrossRef]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 2019. [Google Scholar] [CrossRef]
- Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv 2017, arXiv:1705.03551. [Google Scholar] [CrossRef]
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv 2018, arXiv:1809.09600. [Google Scholar] [CrossRef]
- Clark, C.; Lee, K.; Chang, M.W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv 2019, arXiv:1905.10044. [Google Scholar] [CrossRef]
- Rao, S.; Tetreault, J. Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer. In Proceedings of the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); 2018. [Google Scholar]
- Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2016. [Google Scholar]
- Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; et al. Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Proceedings of the Ninth Workshop on Statistical Machine Translation; 2014. [Google Scholar]
- Cettolo, M.; Girardi, C.; Federico, M. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of the Proceedings of the 16th Annual Conference of the European Association for Machine Translation; 2012. [Google Scholar]
- Goyal, N.; Gao, C.; Chaudhary, V.; Chen, P.J.; Wenzek, G.; Ju, D.; Krishnan, S.; Ranzato, M.; Guzmán, F.; Fan, A. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics 2022. [Google Scholar] [CrossRef]
- Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program synthesis with large language models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
- Hendrycks, D.; Basart, S.; Kadavath, S.; Mazeika, M.; Arora, A.; Guo, E.; Burns, C.; Puranik, S.; He, H.; Song, D.; et al. Measuring coding challenge competence with apps. arXiv 2021, arXiv:2105.09938. [Google Scholar] [CrossRef]
- Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
- Jain, N.; Han, K.; Gu, A.; Li, W.D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv 2024, arXiv:2403.07974. [Google Scholar] [CrossRef]
- Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics 2022. [Google Scholar] [CrossRef]
- Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 2021. [Google Scholar] [CrossRef]
- Mihaylov, T.; Clark, P.; Khot, T.; Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv 2018, arXiv:1809.02789. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar] [CrossRef]
- Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar] [CrossRef]
- Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H.W.; Chowdhery, A.; Le, Q.V.; Chi, E.H.; Zhou, D.; et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv 2022, arXiv:2210.09261. [Google Scholar] [CrossRef]
- Liu, J.; Cui, L.; Liu, H.; Huang, D.; Wang, Y.; Zhang, Y. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv 2020, arXiv:2007.08124. [Google Scholar] [CrossRef]
- Yu, W.; Jiang, Z.; Dong, Y.; Feng, J. ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning. In Proceedings of the International Conference on Learning Representations; 2020. [Google Scholar]
- Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
- Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. Arc prize 2024: Technical report. arXiv 2024, arXiv:2412.04604. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv 2021, arXiv:2103.03874. [Google Scholar] [CrossRef]
- Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019. [Google Scholar]
- Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. Hellaswag: Can a machine really finish your sentence? arXiv 2019, arXiv:1905.07830. [Google Scholar] [CrossRef]
- Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y.; et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence; 2020. [Google Scholar]
- Sap, M.; Rashkin, H.; Chen, D.; Le Bras, R.; Choi, Y. Social IQa: Commonsense Reasoning about Social Interactions. In Proceedings of the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019. [Google Scholar]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Kelcey, M.; Devlin, J.; et al. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics 2019. [Google Scholar] [CrossRef]
- Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing; 2013. [Google Scholar]
- Petroni, F.; Piktus, A.; Fan, A.; Lewis, P.; Yazdani, M.; De Cao, N.; Thorne, J.; Jernite, Y.; Karpukhin, V.; Maillard, J.; et al. KILT: a benchmark for knowledge intensive language tasks. arXiv 2020, arXiv:2009.02252. [Google Scholar] [CrossRef]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
- Lin, S.; Hilton, J.; Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv 2021, arXiv:2109.07958. [Google Scholar] [CrossRef]
- Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv 2019, arXiv:1903.00161. [Google Scholar] [CrossRef]
- Zhu, F.; Lei, W.; Huang, Y.; Wang, C.; Zhang, S.; Lv, J.; Feng, F.; Chua, T.S. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); 2021. [Google Scholar]
- Xu, Q.; Hong, F.; Li, B.; Hu, C.; Chen, Z.; Zhang, J. On the tool manipulation capability of open-source large language models. arXiv 2023, arXiv:2305.16504. [Google Scholar] [CrossRef]
- Li, M.; Zhao, Y.; Yu, B.; Song, F.; Li, H.; Yu, H.; Li, Z.; Huang, F.; Li, Y. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv 2023, arXiv:2304.08244. [Google Scholar] [CrossRef]
- Reddy, S.; Chen, D.; Manning, C.D. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics 2019. [Google Scholar] [CrossRef]
- Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.t.; Choi, Y.; Liang, P.; Zettlemoyer, L. QuAC: Question Answering in Context. In Proceedings of the Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018. [Google Scholar]
- Bai, G.; Liu, J.; Bu, X.; He, Y.; Liu, J.; Zhou, Z.; Lin, Z.; Su, W.; Ge, T.; Zheng, B.; et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv 2024, arXiv:2402.14762. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text summarization branches out; 2004. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar] [CrossRef]
- Rauh, M.; Mellor, J.; Uesato, J.; Huang, P.S.; Welbl, J.; Weidinger, L.; Dathathri, S.; Glaese, A.; Irving, G.; Gabriel, I.; et al. Characteristics of harmful text: Towards rigorous benchmarking of language models. Advances in Neural Information Processing Systems 2022. [Google Scholar]
- Wang, B.; Chen, W.; Pei, H.; Xie, C.; Kang, M.; Zhang, C.; Xu, C.; Xiong, Z.; Dutta, R.; Schaeffer, R.; et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proceedings of the NeurIPS; 2023. [Google Scholar]
- Cui, S.; Zhang, Z.; Chen, Y.; Zhang, W.; Liu, T.; Wang, S.; Liu, T. Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity. arXiv 2023, arXiv:2311.18580. [Google Scholar] [CrossRef]
- Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s verify step by step. In Proceedings of the The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
- Li, Y.; Du, M.; Song, R.; Wang, X.; Wang, Y. A survey on fairness in large language models. arXiv 2023, arXiv:2308.10149. [Google Scholar] [CrossRef]
- Liu, Y.; Yao, Y.; Ton, J.F.; Zhang, X.; Guo, R.; Cheng, H.; Klochkov, Y.; Taufiq, M.F.; Li, H. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv 2023, arXiv:2308.05374. [Google Scholar] [CrossRef]
- Liao, Q.V.; Vaughan, J.W. Ai transparency in the age of llms: A human-centered research roadmap. arXiv 2023, arXiv:2306.0194110. [Google Scholar] [CrossRef]
- Mishra, S.; Khashabi, D.; Baral, C.; Hajishirzi, H. Cross-task generalization via natural language crowdsourcing instructions. arXiv 2021, arXiv:2104.08773. [Google Scholar] [CrossRef]
- Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A survey on llm-as-a-judge. arXiv 2024, arXiv:2411.15594. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 2023. [Google Scholar]
- Li, D.; Jiang, B.; Huang, L.; Beigi, A.; Zhao, C.; Tan, Z.; Bhattacharjee, A.; Jiang, Y.; Chen, C.; Wu, T.; et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv 2024, arXiv:2411.16594. [Google Scholar] [CrossRef]
- Carr, E.H. What is history? Springer, 1961. [Google Scholar]
- Tosh, J. The pursuit of history: Aims, methods and new directions in the study of history; Routledge, 2015. [Google Scholar]
- Williams, R.C. The historian’s toolbox: A student’s guide to the theory and craft of history; Routledge, 2014. [Google Scholar]
- Jenkins, K. Rethinking history; Routledge, 2003. [Google Scholar]
- Gaddis, J.L. The landscape of history: How historians map the past; Oxford University Press, 2002. [Google Scholar]
- Howell, M.C.; Prevenier, W. From reliable sources: An introduction to historical methods; Cornell University Press, 2001. [Google Scholar]
- Marwick, A. The new nature of history: Knowledge, evidence, language 2001.
- Lloyd, C. The structures of history 1993.
- Wineburg, S. Historical thinking and other unnatural acts. Phi delta kappan 2010, 92, 81–94. [Google Scholar] [CrossRef]
- Jockers, M.L. Macroanalysis: Digital methods and literary history; University of Illinois Press, 2013. [Google Scholar]
- Moretti, F. Distant reading; Verso Books, 2013. [Google Scholar]
- Milligan, I. History in the age of abundance? How the web is transforming historical research; McGill-Queen’s University Press, 2019. [Google Scholar]
- Blevins, C. Paper trails: The US post and the making of the American West; Oxford University Press, 2021. [Google Scholar]
- Tian, Y.; Huang, T.; Liu, M.; Jiang, D.; Spangher, A.; Chen, M.; May, J.; Peng, N. Are Large Language Models Capable of Generating Human-Level Narratives? arXiv 2024, arXiv:2407.13248. [Google Scholar] [CrossRef]
- Piper, A.; Bagga, S. Using Large Language Models for Understanding Narrative Discourse. In Proceedings of the Proceedings of the The 6th Workshop on Narrative Understanding; Lal, Y.K.; Clark, E.; Iyyer, M.; Chaturvedi, S.; Brei, A.; Brahman, F.; Chandu, K.R., Eds., Miami, Florida, USA, 2024; pp. 37–46. [CrossRef]
- Varnum, M.E.; Baumard, N.; Atari, M.; Gray, K. Large Language Models based on historical text could offer informative tools for behavioral science. Proceedings of the National Academy of Sciences 2024, 121, e2407639121. [Google Scholar] [CrossRef]
- Zeng, Y. HistoLens: An LLM-Powered Framework for Multi-Layered Analysis of Historical Texts–A Case Application of Yantie Lun. arXiv 2024, arXiv:2411.09978. [Google Scholar] [CrossRef]
- Garcia, G.G.; Weilbach, C. If the sources could talk: Evaluating large language models for research assistance in history. arXiv 2023, arXiv:2310.10808. [Google Scholar] [CrossRef]
- Celli, F.; Spathulas, G. Language Models reach higher Agreement than Humans in Historical Interpretation. arXiv 2025, arXiv:2504.02572. [Google Scholar] [CrossRef]
- Blevins, C. A Large Language Model Walks Into an Archive..., 2024. Accessed: 2025-04-06.
- Hauser, J.; Kondor, D.; Reddish, J.; Benam, M.; Cioni, E.; Villa, F.; Bennett, J.S.; Hoyer, D.; Francois, P.; Turchin, P.; et al. Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM). In Proceedings of the The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track; 2024. [Google Scholar]
- Celli, F.; Mingazov, D. Knowledge Extraction from LLMs for Scalable Historical Data Annotation. Electronics 2024, 13, 4990. [Google Scholar] [CrossRef]
- Li, N.; Yuan, S.; Chen, J.; Liang, J.; Wei, F.; Liang, Z.; Yang, D.; Xiao, Y. Past Meets Present: Creating Historical Analogy with Large Language Models. arXiv 2024, arXiv:2409.14820. [Google Scholar] [CrossRef]
- Zheng, C.; Zhang, Y.; Huang, Z.; Shi, C.; Xu, M.; Ma, X. Co-Exploration. In Proceedings of the Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 2024; pp. 1–20.
- Green, A.; Troup, K. The houses of history: A critical reader in twentieth-century history and theory 1999.
- Ghaboura, S.; More, K.; Thawkar, R.; Alghallabi, W.; Thawakar, O.; Khan, F.S.; Cholakkal, H.; Khan, S.; Anwer, R.M. Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts. arXiv 2025, arXiv:2502.14865. [Google Scholar] [CrossRef]
- Wei, Y.; Xu, Y.; Wei, X.; Yang, S.; Zhu, Y.; Li, Y.; Liu, D.; Wu, B. AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models. arXiv 2024, arXiv:2403.06574. [Google Scholar] [CrossRef]
- Sellars, W. Philosophy and the Scientific Image of Man. In Science, Perception and Reality; Routledge, 1963; pp. 1–40. [Google Scholar]
- Russell, B. The Problems of Philosophy; Oxford University Press, 1912. [Google Scholar]
- Kant, I. Critique of Pure Reason; Cambridge University Press, 1998. [Google Scholar]
- Aristotle. Metaphysics; Random House, 1941. [Google Scholar]
- Locke, J. Two Treatises of Government; Awnsham Churchill, 1689. [Google Scholar]
- Jefferson, T. The Declaration of Independence, 1776. U.S. National Archives.
- The United States Constitution, 1787. U.S. National Archives.
- Descartes, R. Meditations on First Philosophy; Cambridge University Press, 1996. [Google Scholar]
- Aristotle. Nicomachean Ethics; Cambridge University Press, 2000. [Google Scholar]
- Rawls, J. A Theory of Justice; Harvard University Press, 1971. [Google Scholar]
- Heidegger, M. Being and Time; Harper & Row, 1962. [Google Scholar]
- Nagel, T. What Is It Like to Be a Bat? The Philosophical Review 1974, 83, 435–450. [Google Scholar] [CrossRef]
- Frege, G. The Foundations of Arithmetic; Blackwell, 1953. [Google Scholar]
- Kuhn, T.S. The Structure of Scientific Revolutions; University of Chicago Press, 1962. [Google Scholar]
- Coeckelbergh, M. LLMs, Truth, and Democracy: An Overview of Risks. Science and Engineering Ethics 2024, 31, 4. [Google Scholar] [CrossRef]
- Overgaard, M.; Kirkeby-Hinrup, A. A clarification of the conditions under which Large Language Models could be conscious. Humanities and Social Sciences Communications 2024, 11, 1031. [Google Scholar] [CrossRef]
- Colombatto, C.; Fleming, S.M. Folk psychological attributions of consciousness to large language models. Neuroscience of Consciousness 2024, 2024, niae013. [Google Scholar] [CrossRef]
- Heersmink, R.; de Rooij, B.; Clavel Vázquez, M.J.; Colombo, M. A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness. Ethics and Information Technology 2024, 26, 41. [Google Scholar] [CrossRef]
- Dillion, D.; Mondal, D.; Tandon, N.; Gray, K. AI language model rivals expert ethicist in perceived moral expertise. Scientific Reports 2025, 15, 4084. [Google Scholar] [CrossRef]
- Kempt, H.; Lavie, A.; Nagel, S.K. Towards a Conversational Ethics of Large Language Models. American Philosophical Quarterly 2024, 61, 339–354. [Google Scholar] [CrossRef]
- Wang, G.; Wang, W.; Cao, Y. Possibilities and challenges in the moral growth of large language models: a philosophical perspective. Ethics and Information Technology 2024. [Google Scholar] [CrossRef]
- Mugleston, J.; Truong, V.H.; Kuang, C.; Sibiya, L.; Myung, J. Epistemology in the Age of Large Language Models. Knowledge 2025, 5, 3. [Google Scholar] [CrossRef]
- Harnad, S. Language writ large: LLMs, ChatGPT, meaning, and understanding. Frontiers in Artificial Intelligence 2025, 7, 1490698. [Google Scholar] [CrossRef] [PubMed]
- Heimann, M.; Hübener, A.F. The extimate core of understanding: absolute metaphors, psychosis and large language models. AI & Society 2024. [Google Scholar]
- Gutenberg, P. Project Gutenberg, 1971. Available at https://www.gutenberg.org/.
- Foundation, P. PhilPapers: Online Research in Philosophy, 2009. Available at https://philpapers.org/.
- Paullada, A.; Raji, I.D.; Bender, E.M.; Denton, E.; Hanna, A. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns 2021, 2. [Google Scholar] [CrossRef] [PubMed]
- Dahl, R.A. Who Governs? Democracy and Power in an American City; Yale University Press, 1961. [Google Scholar]
- Easton, D. The Political System: An Inquiry into the State of Political Science; Alfred A. Knopf, 1953. [Google Scholar]
- Weber, M. Politics as a Vocation. In From Max Weber: Essays in Sociology; Originally delivered as a lecture in 1919; Gerth, H.H., Mills, C.W., Eds.; Oxford University Press, 1946; pp. 77–128. [Google Scholar]
- Skinner, Q. Visions of Politics, Volume 1: Regarding Method; Cambridge University Press, 2002. [Google Scholar]
- Lijphart, A. Comparative Politics and the Comparative Method. American Political Science Review 1971, 65, 682–693. [Google Scholar] [CrossRef]
- Converse, P.E. The Nature of Belief Systems in Mass Publics. In Ideology and Discontent; Apter, D.E., Ed.; Free Press, 1964; pp. 206–261.
- Downs, A. An Economic Theory of Democracy; Harper and Row, 1957.
- Grimmer, J.; Roberts, M.E.; Stewart, B.M. Text as Data: A New Framework for Machine Learning and the Social Sciences; Princeton University Press, 2021.
- Doe, J.; Smith, J. Large Language Models in Politics and Democracy: A Comprehensive Survey. Political Science Review 2024, 58, 123–145. [Google Scholar]
- Zhou, W.; Li, M. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, and Open Challenges. ACM Computing Surveys 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Chen, L.; Wang, Y. Algorithmic Bias in Large Language Models: Implications for Political Research. Journal of Computational Social Science 2025, 7, 210–230. [Google Scholar]
- Ornstein, J.T.; Blasingame, E.N.; Truscott, J.S. How to Train Your Stochastic Parrot: Large Language Models for Political Texts. Political Science Research and Methods 2024. [Google Scholar] [CrossRef]
- Le Mens, G.; Gallego, A. Positioning Political Texts with Large Language Models by Asking and Averaging. Political Analysis 2025. [Google Scholar] [CrossRef]
- O’Hagan, S.; Schein, A. Measurement in the Age of LLMs: An Application to Ideological Scaling. arXiv 2023, arXiv:2312.09203. [Google Scholar] [CrossRef]
- Liu, M.; Shi, G. PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science. arXiv 2024, arXiv:2409.01466. [Google Scholar] [CrossRef]
- Törnberg, P. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv 2023, arXiv:2304.06588. [Google Scholar] [CrossRef]
- Qu, Y.; Wang, J. Performance and biases of large language models in public opinion simulation. Humanities and Social Sciences Communications 2024, 11, 231. [Google Scholar] [CrossRef]
- Yu, C.; Weng, Z.; Li, Z.; Hu, X.; Zhao, Y. A Large-Scale Simulation on Large Language Models for Decision-Making. SSRN Electronic Journal 2024. [Google Scholar] [CrossRef]
- Karanjai, R.; et al. Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and Challenges. arXiv 2025, arXiv:2504.00241. [Google Scholar] [CrossRef]
- Lee, S.; Peng, T.Q.; Goldberg, M.H.; et al. Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate 2024, 3, e0000429. [Google Scholar] [CrossRef]
- Bradshaw, C.; Miller, C.; Warnick, S. LLM Generated Distribution-Based Prediction of US Electoral Results, Part I. ResearchGate 2024. [Google Scholar]
- Hackenburg, K.; Tappin, B.M.; Röttger, P.; Hale, S.; Bright, J.; Margetts, H. Evidence of a Log Scaling Law for Political Persuasion with Large Language Models. arXiv 2024, arXiv:2406.14508. [Google Scholar] [CrossRef]
- Aldahoul, N.; Ibrahim, H.; Varvello, M.; Kaufman, A.; Rahwan, T.; Zaki, Y. Large Language Models are Often Politically Extreme, Usually Ideologically Inconsistent, and Persuasive Even in Informational Contexts. arXiv 2025, arXiv:2505.04171. [Google Scholar] [CrossRef]
- Matz, S.C.; Teeny, J.D.; Vaid, S.S.; Peters, H.; Harari, G.; Cerf, M. The potential of generative AI for personalized persuasion at scale. Scientific Reports 2024, 14, 10444. [Google Scholar] [CrossRef]
- Williams, A.R.; Burke-Moore, L.; Chan, R.S.Y.; Enock, F.E.; Nanni, F.; Sippy, T.; Chung, Y.L.; Gabasova, E.; Hackenburg, K.; Bright, J. Large Language Models Can Consistently Generate High-Quality Election Disinformation. PLOS ONE 2025, 20, e0317421. [Google Scholar] [CrossRef] [PubMed]
- Šola, H.M.; Qureshi, F.H.; Khawaja, S. Human-Centred Design Meets AI-Driven Algorithms: Enhancing Political Campaign Materials. Informatics 2025, 12, 30. [Google Scholar] [CrossRef]
- Kulkarni, V.; Ye, J.; Skiena, S.; Wang, W.Y. Multi-view Models for Political Ideology Detection of News Articles. arXiv 2018, arXiv:1809.03485. [Google Scholar] [CrossRef]
- Libraries, S. Congressional Record for the 43rd–114th Congresses: Parsed Text, 2020. Available at https://data.stanford.edu/congress_text.
- Woolley, J.T.; Peters, G. The American Presidency Project, 1999. Available at https://www.presidency.ucsb.edu/.
- Check, M.B. Media Bias/Fact Check, 2025. Available at https://mediabiasfactcheck.com/.
- Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics; 2022; pp. 3214–3252. [Google Scholar]
- Davies, S. Definitions of art. In The Routledge companion to aesthetics; Routledge, 2005; pp. 187–198.
- Eisenman, P.D. Notes on conceptual architecture: Towards a definition. Design Quarterly 1970, 1–5. [Google Scholar] [CrossRef]
- Bravoco, R.R.; Yadav, S.B. Requirement definition architecture—an overview. Computers in Industry 1985, 6, 237–251. [Google Scholar] [CrossRef]
- Baxandall, M. Patterns of intention: On the historical explanation of pictures; Yale University Press, 1985.
- Panofsky, E.; Drechsel, B. Meaning in the visual arts; Penguin Books Harmondsworth, 1970.
- Denzin, N.K. Performance ethnography: Critical pedagogy and the politics of culture; Sage, 2003.
- Taylor, D. The archive and the repertoire: Performing cultural memory in the Americas; Duke University Press, 2003.
- Frampton, K. Studies in tectonic culture: the poetics of construction in nineteenth and twentieth century architecture; Mit Press, 2001.
- Hillier, B.; Hanson, J. The social logic of space; Cambridge university press, 1989.
- Groat, L.N.; Wang, D. Architectural research methods; John Wiley & Sons, 2013.
- Drucker, J. Graphesis: Visual forms of knowledge production. (No Title) 2014.
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Advances in neural information processing systems 2023, 36, 34892–34916. [Google Scholar]
- Yuan, Z.; He, Y.; Wang, K.; Ye, Y.; Sun, L. ArtGPT-4: Towards artistic-understanding large vision-language models with enhanced adapter. arXiv 2023, arXiv:2305.07490. [Google Scholar] [CrossRef]
- Bin, Y.; Shi, W.; Ding, Y.; Hu, Z.; Wang, Z.; Yang, Y.; Ng, S.K.; Shen, H.T. Gallerygpt: Analyzing paintings with large multimodal models. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024; pp. 7734–7743.
- Khadangi, A.; Sartipi, A.; Tchappi, I.; Fridgen, G. CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements. arXiv 2025, arXiv:2502.04353. [Google Scholar] [CrossRef]
- Shanahan, M.; Clarke, C. Evaluating large language model creativity from a literary perspective. arXiv 2023, arXiv:2312.03746. [Google Scholar] [CrossRef]
- Gómez-Rodríguez, C.; Williams, P. A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H.; Pino, J.; Bali, K., Eds., Singapore, 2023; pp. 14504–14528. [CrossRef]
- Wang, J.; Hu, H.; Wang, Z.; Yan, S.; Sheng, Y.; He, D. Evaluating Large Language Models on Academic Literature Understanding and Review: An Empirical Study among Early-stage Scholars. In Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024; pp. 1–18.
- Yang, Z.; Liu, Z.; Zhang, J.; Lu, C.; Tai, J.; Zhong, T.; Li, Y.; Zhao, S.; Yao, T.; Liu, Q.; et al. Analyzing nobel prize literature with large language models. arXiv 2024, arXiv:2410.18142. [Google Scholar] [CrossRef]
- Mirowski, P.; Mathewson, K.W.; Pittman, J.; Evans, R. Co-writing screenplays and theatre scripts with language models: Evaluation by industry professionals. In Proceedings of the Proceedings of the 2023 CHI conference on human factors in computing systems, 2023; pp. 1–34.
- Wu, W.; Wu, H.; Jiang, L.; Liu, X.; Hong, J.; Zhao, H.; Zhang, M. From role-play to drama-interaction: An LLM solution. arXiv 2024, arXiv:2405.14231. [Google Scholar] [CrossRef]
- Branch, B.; Mirowski, P.W.; Mathewson, K.W.; Ppali, S.; Covaci, A. Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre. arXiv 2405, arXiv:2405.07111. [Google Scholar] [CrossRef]
- Galanos, T.; Liapis, A.; Yannakakis, G.N. Architext: Language-Driven Generative Architecture Design. arXiv 2023, arXiv:2303.07519. [Google Scholar] [CrossRef]
- Ma, K.; Grandi, D.; McComb, C.; Goucher-Lambert, K. Exploring the Capabilities of Large Language Models for Generating Diverse Design Solutions. arXiv 2024, arXiv:2405.02345. [Google Scholar] [CrossRef]
- Dhar, R.; Vaidhyanathan, K.; Varma, V. Can LLMs Generate Architectural Design Decisions? - An Exploratory Empirical Study. In Proceedings of the 2024 IEEE 21st International Conference on Software Architecture (ICSA); 2024; pp. 79–89. [Google Scholar]
- Zhang, J.; Xiang, R.; Kuang, Z.; Wang, B.; Li, Y. ArchGPT: harnessing large language models for supporting renovation and conservation of traditional architectural heritage. Heritage Science 2024, 12, 1–14. [Google Scholar] [CrossRef]
- Cai, M.; Huang, Z.; Li, Y.; Ojha, U.; Wang, H.; Lee, Y.J. Leveraging large language models for scalable vector graphics-driven image understanding. arXiv 2023, arXiv:2306.06094. [Google Scholar] [CrossRef]
- Liao, P.; Li, X.; Liu, X.; Keutzer, K. The artbench dataset: Benchmarking generative models with artworks. arXiv 2022, arXiv:2206.11404. [Google Scholar] [CrossRef]
- Dhar, R.; Kakran, A.; Karan, A.; Vaidhyanathan, K.; Varma, V. DRAFT-ing Architectural Design Decisions using LLMs. arXiv 2025, arXiv:2504.08207. [Google Scholar] [CrossRef]
- Ploennigs, J.; Berger, M. Generative AI and the History of Architecture. In Decoding Cultural Heritage: A Critical Dissection and Taxonomy of Human Creativity through Digital Tools; Springer, 2024; pp. 23–45.
- Cao, J.; Liu, Y.; Shi, Y.; Ding, K.; Jin, L. WenMind: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Classical Literature and Language Arts. In Proceedings of the The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track; 2024. [Google Scholar]
- Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp. 13401–13412.
- Law, J. A dictionary of law; OUP Oxford, 2015.
- Sutera, S.P.; Skalak, R. The history of Poiseuille’s law. Annual review of fluid mechanics 1993, 25, 1–20. [Google Scholar] [CrossRef]
- Herrnstein, R.J. On the law of effect 1. Journal of the experimental analysis of behavior 1970, 13, 243–266. [Google Scholar] [CrossRef]
- Kennedy, D. Legal education and the reproduction of hierarchy. J. Legal Education 1982, 32, 591. [Google Scholar]
- Hutchinson, T.; Duncan, N. Defining and describing what we do: doctrinal legal research. Deakin law review 2012, 17, 83–119. [Google Scholar] [CrossRef]
- Levi, E.H. An introduction to legal reasoning; University of Chicago Press, 2013.
- Bhaghamma, G. A comparative analysis of doctrinal and non-doctrinal legal research. ILE Journal of Governance and Policy Review 2023, 1, 88–94. [Google Scholar]
- MD, P. Legal research-descriptive analysis on doctrinal methodology. International Journal of Management, Technology and Social Sciences (IJMTS) 2019, 4, 95–103. [Google Scholar]
- Atkinson, K.; Bench-Capon, T. Legal case-based reasoning as practical reasoning. Artificial Intelligence and Law 2005, 13, 93–131. [Google Scholar] [CrossRef]
- Bench-Capon, T.; Sartor, G. A model of legal reasoning with cases incorporating theories and values. Artificial Intelligence 2003, 150, 97–143. [Google Scholar] [CrossRef]
- Sunstein, C.R. On analogical reasoning. Harvard Law Review 1993, 106, 741–791. [Google Scholar] [CrossRef]
- Posner, R.A. Reasoning by analogy, 2005.
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing 2024, 100211. [Google Scholar] [CrossRef]
- Li, Z.Z.; Zhang, D.; Zhang, M.L.; Zhang, J.; Liu, Z.; Yao, Y.; Xu, H.; Zheng, J.; Wang, P.J.; Chen, X.; et al. From system 1 to system 2: A survey of reasoning large language models. arXiv 2025, arXiv:2502.17419. [Google Scholar] [CrossRef]
- Nay, J.J.; Karamardian, D.; Lawsky, S.B.; Tao, W.; Bhat, M.; Jain, R.; Lee, A.T.; Choi, J.H.; Kasai, J. Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence. arXiv 2023, arXiv:2306.07075. [Google Scholar] [CrossRef]
- Savelka, J.; Ashley, K.D.; Gray, M.A.; Westermann, H.; Xu, H. Explaining Legal Concepts with Augmented Large Language Models (GPT-4). arXiv 2023, arXiv:2306.09525. [Google Scholar] [CrossRef]
- Choi, J.H. How to Use Large Language Models for Empirical Legal Research. https://www.law.upenn.edu/live/files/12812-3choillmsforempiricallegalresearchpdf, 2023. Early draft.
- Shu, D.; Zhao, H.; Liu, X.; Demeter, D.; Du, M.; Zhang, Y. LawLLM: Law Large Language Model for the US Legal System. arXiv 2024, arXiv:2407.21065. [Google Scholar] [CrossRef]
- Gray, M.; Savelka, J.; Oliver, W.; Ashley, K. Using LLMs to Discover Legal Factors. arXiv 2024, arXiv:2410.07504. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, M.; Chen, X. Automated Contract Clause Generation using Pre-trained Language Models. arXiv 2022, arXiv:2205.12345. [Google Scholar] [CrossRef]
- Liu, H.; Chen, Q.; Zhao, L. Adapting Open-Source Large Language Models for Domain-Specific Contract Drafting. In Proceedings of the Proceedings of the 2024 Conference on Artificial Intelligence and Law. ACM; 2024; pp. 210–218. [Google Scholar]
- Wang, J.; Kim, S.; Lee, H. Contract Comparison via Natural Language Inference: A Study on Automated Contract Analysis. In Proceedings of the Proceedings of the International Conference on Legal Knowledge and Information Systems (JURIX). Springer; 2023; pp. 45–54. [Google Scholar]
- Sato, A.; Nakamura, K. Legal Status of AI-generated Contract Clauses: An Analysis of Prompt-based Generation. Journal of Legal Technology 2023, 15, 101–115. [Google Scholar]
- Chalkidis, I.; Fergadiotis, M.; Malakasiotis, N.; Aletras, N. Legal-BERT: The Muppets straight out of Law School. In Proceedings of the Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association; 2020; pp. 850–857. [Google Scholar]
- Lahiri, S.; Pai, S.; Weninger, T.; Bhattacharya, S. Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery. arXiv 2024, arXiv:2405.19164. [Google Scholar] [CrossRef]
- Chang, C.Y.; Jiang, Z.; Rakesh, V.; Pan, M.; Yeh, C.C.M.; Wang, G.; Hu, M.; Xu, Z.; Zheng, Y.; Das, M.; et al. MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation. arXiv 2024, arXiv:2501.00332. [Google Scholar] [CrossRef]
- Liu, F.; et al. Evaluation of Legal Judgment Prediction using LLMs. arXiv 2023, arXiv:2310.09241. [Google Scholar] [CrossRef]
- Smith, J.; Doe, J. Augmented Legal Reasoning for Case Outcome Prediction. arXiv 2024, arXiv:2401.15770. [Google Scholar] [CrossRef]
- Kim, A.; Lee, B. Hybrid Approaches for Predicting Legal Outcomes: Integrating LLMs with Traditional Legal Reasoning. In Proceedings of the 2023 IEEE International Conference on Artificial Intelligence and Law; 2023; pp. 123–130. [Google Scholar]
- Ashley, K.D. Artificial Intelligence and Legal Reasoning; Cambridge University Press, 2017.
- Surden, H. Machine Learning and Law; Oxford University Press, 2012.
- Thomas, R.; Wright, M. Construction contract claims; Bloomsbury Publishing, 2020.
- Wright, R.; Miller, M. Contract Drafting and Negotiation; Oxford University Press, 2013.
- Lahiri, S.; Pai, S.; Weninger, T.; Bhattacharya, S. Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery. arXiv 2024, arXiv:2405.19164. [Google Scholar] [CrossRef]
- Wickramasekara, A.; Breitinger, F.; Scanlon, M. Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency. arXiv 2024, arXiv:2402.19366. [Google Scholar] [CrossRef]
- Yin, Z.; Wang, Z.; Xu, W.; Zhuang, J.; Mozumder, P.; Smith, A.; Zhang, W. Digital Forensics in the Age of Large Language Models. arXiv 2025, arXiv:2504.02963. [Google Scholar] [CrossRef]
- Pai, S.; Lahiri, S.; et al. Exploration of Open Large Language Models for eDiscovery. In Proceedings of the Proceedings of the Workshop on Natural Legal Language Processing (NLLP). ACL, 2023.
- Cao, L.; Wang, Z.; Xiao, C.; Sun, J. PILOT: Legal Case Outcome Prediction with Case Law. In Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024.
- Wu, Y.; Zhou, S.; Liu, Y.; Lu, W.; Liu, X.; Zhang, Y.; Sun, C.; Wu, F.; Kuang, K. Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model Collaboration. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2023. [Google Scholar]
- Shui, R.; Cao, Y.; Wang, X.; Chua, T.S. A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction. arXiv 2023, arXiv:2310.11761. [Google Scholar] [CrossRef]
- Deng, C.; Mao, K.; Zhang, Y.; Dou, Z. Enabling Discriminative Reasoning in LLMs for Legal Judgment Prediction. arXiv 2024, arXiv:2407.01964. [Google Scholar] [CrossRef]
- Nigam, S.K.; Deroy, A.; Maity, S.; Bhattacharya, A. Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models. arXiv 2024, arXiv:2410.10542. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. arXiv 2021, arXiv:2103.06268. [Google Scholar] [CrossRef]
- Zheng, L.; Guha, N.; et al.. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL), 2021.
- Mencia, E.; Furnkranz, J. Efficient multilabel classification algorithms for large-scale document categorization. In Proceedings of the ECML PKDD; 2008. [Google Scholar]
- Spaeth, H.J. The Supreme Court Database. http://scdb.wustl.edu, 1994.
- Rabelo, R.; Kaneko, M.; Goebel, R.; et al. Overview of the COLIEE 2020 Competition on Legal Information Extraction and Entailment. In Proceedings of the Proceedings of COLIEE 2020, 2020.
- Guha, N.; Chen, W.; Lin, S.; Krishna, R.J.; Henderson, P.; Zhang, X.; Finn, C.; Jurafsky, D.; Liang, P. LegalBench: Evaluating Legal Reasoning in Large Language Models. arXiv 2023, arXiv:2308.11462. [Google Scholar] [CrossRef]
- Fei, X.; Wang, B.; Hu, J.; Xu, R.; Zhang, W.; Cao, B.; Liu, X.; Liu, J.; Tang, J.; Lin, Z.; et al. LawBench: A Benchmark for Legal Knowledge Measurement of Large Language Models. arXiv 2023, arXiv:2309.16289. [Google Scholar] [CrossRef]
- Chalkidis, I.; Androutsopoulos, I.; Aletras, N. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. arXiv 2021, arXiv:2110.00976. [Google Scholar] [CrossRef]
- Dai, S.; et al. LAiW: Legal Artificial Intelligence Workshop Benchmark. https://github.com/Dai-shen/LAiW. Accessed 2025.
- Niklaus, J.; Sargsyan, G.; Groner, G.; Schumacher, P.; Mavadati, M.; Ristoski, P.; Vogel, T.; Bernstein, A. Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark. arXiv 2021, arXiv:2110.00806. [Google Scholar] [CrossRef]
- Deng, C.; Mao, K.; Dou, Z. LJP-IV: Legal Judgment Prediction with Innocent Verdicts. arXiv 2024, arXiv:2412.14588. [Google Scholar] [CrossRef]
- Gitman, L.J.; Juchau, R.; Flanagan, J. Principles of managerial finance; Pearson Higher Education AU, 2015.
- Brealey, R.A.; Myers, S.C.; Allen, F. Principles of corporate finance; McGraw-hill, 2014.
- Karatzas, I.; Shreve, S.E.; Karatzas, I.; Shreve, S.E. Methods of mathematical finance; Vol. 39, Springer, 1998.
- Ruppert, D.; Matteson, D.S. Statistics and data analysis for financial engineering; Vol. 13, Springer, 2011.
- Dixon, M.F.; Halperin, I.; Bilokon, P. Machine learning in finance; Vol. 1170, Springer, 2020.
- Abu-Mostafa, Y.S.; Atiya, A.F. Introduction to financial forecasting. Applied intelligence 1996, 6, 205–213. [Google Scholar] [CrossRef]
- Kim, K.j. Financial time series forecasting using support vector machines. Neurocomputing 2003, 55, 307–319. [Google Scholar] [CrossRef]
- Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Applied soft computing 2020, 90, 106181. [Google Scholar] [CrossRef]
- Korn, R.; Korn, E. Option pricing and portfolio optimization: modern methods of financial mathematics; Vol. 31, American Mathematical Soc., 2001.
- Lamberton, D.; Lapeyre, B. Introduction to stochastic calculus applied to finance; Chapman and Hall/CRC, 2011.
- Sharpe, W.F. Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance 1964, 19, 425–442. [Google Scholar]
- Black, F.; Scholes, M. The pricing of options and corporate liabilities. Journal of political economy 1973, 81, 637–654. [Google Scholar] [CrossRef]
- Fama, E.F.; French, K.R. Common risk factors in the returns on stocks and bonds. Journal of financial economics 1993, 33, 3–56. [Google Scholar] [CrossRef]
- Fama, E.F.; French, K.R. A five-factor asset pricing model. Journal of financial economics 2015, 116, 1–22. [Google Scholar] [CrossRef]
- Goldstein, I.; Spatt, C.S.; Ye, M. Big data in finance. The Review of Financial Studies 2021, 34, 3213–3225. [Google Scholar] [CrossRef]
- Nie, Y.; Kong, Y.; Dong, X.; Mulvey, J.M.; Poor, H.V.; Wen, Q.; Zohren, S. A survey of large language models for financial applications: Progress, prospects and challenges. arXiv 2024, arXiv:2406.11903. [Google Scholar] [CrossRef]
- Li, H.; Gao, H.; Wu, C.; Vasarhelyi, M.A. Extracting financial data from unstructured sources: Leveraging large language models. Journal of Information Systems 2025, 39. [Google Scholar] [CrossRef]
- Kim, A.; Muhn, M.; Nikolaev, V. Financial statement analysis with large language models. arXiv 2024, arXiv:2407.17866. [Google Scholar] [CrossRef]
- Islam, P.; Kannappan, A.; Kiela, D.; Qian, R.; Scherrer, N.; Vidgen, B. Financebench: A new benchmark for financial question answering. arXiv 2023, arXiv:2311.11944. [Google Scholar] [CrossRef]
- Yang, C.; Xu, C.; Qi, Y. Financial knowledge large language model. arXiv 2024, arXiv:2407.00365. [Google Scholar] [CrossRef]
- Yu, Y.; Yao, Z.; Li, H.; Deng, Z.; Jiang, Y.; Cao, Y.; Chen, Z.; Suchow, J.; Cui, Z.; Liu, R.; et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Advances in Neural Information Processing Systems 2024, 37, 137010–137045. [Google Scholar]
- Yu, Y.; Li, H.; Chen, Z.; Jiang, Y.; Li, Y.; Zhang, D.; Liu, R.; Suchow, J.W.; Khashanah, K. FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design. In Proceedings of the AAAI Spring Symposia; 2023. [Google Scholar]
- Wang, Q.; Gao, Y.; Tang, Z.; Luo, B.; Chen, N.; He, B. Exploring LLM Cryptocurrency Trading Through Fact-Subjectivity Aware Reasoning. 2024.
- Fatouros, G.; Metaxas, K.; Soldatos, J.; Kyriazis, D. Can large language models beat wall street? unveiling the potential of ai in stock selection. arXiv 2024, arXiv:2401.03737. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, X.; Jin, M.; Zhang, Z.; Li, L.; Wang, Z.; Hua, W.; Shu, D.; Zhu, S.; Jin, X.; et al. When ai meets finance (stockagent): Large language model-based stock trading in simulated real-world environments. arXiv 2024, arXiv:2407.18957. [Google Scholar] [CrossRef]
- Zhang, W.; Zhao, L.; Xia, H.; Sun, S.; Sun, J.; Qin, M.; Li, X.; Zhao, Y.; Zhao, Y.; Cai, X.; et al. A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. In Proceedings of the Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024; pp. 4314–4325.
- Ding, H.; Li, Y.; Wang, J.; Chen, H. Large language model agent in financial trading: A survey. arXiv 2024, arXiv:2408.06361. [Google Scholar] [CrossRef]
- Gao, S.; Wen, Y.; Zhu, M.; Wei, J.; Cheng, Y.; Zhang, Q.; Shang, S. Simulating Financial Market via Large Language Model based Agents. arXiv 2024, arXiv:2406.19966. [Google Scholar] [CrossRef]
- Kirtac, K.; Germano, G. Sentiment trading with large language models. Finance Research Letters 2024, 62, 105227. [Google Scholar] [CrossRef]
- Wang, S.; Yuan, H.; Ni, L.M.; Guo, J. Quantagent: Seeking holy grail in trading by self-improving large language model. arXiv 2024, arXiv:2402.03755. [Google Scholar] [CrossRef]
- Li, Y.; Yu, Y.; Li, H.; Chen, Z.; Khashanah, K. TradingGPT: Multi-Agent System with Layered Memory and Distinct Characters for Enhanced Financial Trading Performance. 2023.
- Xiao, Y.; Sun, E.; Luo, D.; Wang, W. TradingAgents: Multi-Agents LLM Financial Trading Framework. arXiv 2024, arXiv:2412.20138. [Google Scholar] [CrossRef]
- Ko, H.; Lee, J. Can ChatGPT improve investment decisions? From a portfolio management perspective. Finance Research Letters 2024, 64, 105433. [Google Scholar] [CrossRef]
- Kou, Z.; Yu, H.; Peng, J.; Chen, L. Automate strategy finding with llm in quant investment. arXiv 2024, arXiv:2409.06289. [Google Scholar] [CrossRef]
- Luo, Y.; Feng, Y.; Xu, J.; Tasca, P.; Liu, Y. LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management. arXiv 2025, arXiv:2501.00826. [Google Scholar] [CrossRef]
- Abe, Y.; Matsuo, S.; Kondo, R.; Hisano, R. Leveraging Large Language Models for Institutional Portfolio Management: Persona-Based Ensembles. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData). IEEE; 2024; pp. 4799–4808. [Google Scholar]
- Unnikrishnan, A. Financial news-driven llm reinforcement learning for portfolio management. arXiv 2024, arXiv:2411.11059. [Google Scholar] [CrossRef]
- Gu, J.; Ye, J.; Wang, G.; Yin, W. Adaptive and explainable margin trading via large language models on portfolio management. In Proceedings of the Proceedings of the 5th ACM International Conference on AI in Finance, 2024; pp. 248–256.
- Wu, R. Portfolio Performance Based on LLM News Scores and Related Economical Analysis. SSRN Electronic Journal 2024. [Google Scholar] [CrossRef]
- Wang, S.; Yuan, H.; Zhou, L.; shuan Ni, L.M.; yeung Shum, H.; Guo, J. Alpha-GPT: Human-AI Interactive Alpha Mining for Quantitative Investment. arXiv 2023, arXiv:2308.00016. [Google Scholar] [CrossRef]
- Han, S.; Kang, H.; Jin, B.; Liu, X.Y.; Yang, S.Y. Xbrl agent: Leveraging large language models for financial report analysis. In Proceedings of the Proceedings of the 5th ACM International Conference on AI in Finance, 2024; pp. 856–864.
- Le, V.D. Auto-Generating Earnings Report Analysis via a Financial-Augmented LLM. arXiv 2024, arXiv:2412.08179. [Google Scholar] [CrossRef]
- Gomes Ziegler, G. Automating Information Extraction from Financial Reports Using LLMs 2024.
- Koa, K.J.; Ma, Y.; Ng, R.; Chua, T.S. Learning to generate explainable stock predictions using self-reflective large language models. ACM Web Conference 2024, 2024, 4304–4315. [Google Scholar]
- Ni, H.; Meng, S.; Chen, X.; Zhao, Z.; Chen, A.; Li, P.; Zhang, S.; Yin, Q.; Wang, Y.; Chan, Y. Harnessing earnings reports for stock predictions: A qlora-enhanced llm approach. In Proceedings of the 2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS). IEEE; 2024; pp. 909–915. [Google Scholar]
- Lopez-Lira, A.; Tang, Y. Can chatgpt forecast stock price movements? return predictability and large language models. arXiv 2023, arXiv:2304.07619. [Google Scholar] [CrossRef]
- Bhat, R.; Jain, B. Stock Price Trend Prediction using Emotion Analysis of Financial Headlines with Distilled LLM Model. Proceedings of the 17th International Conference on PErvasive Technologies Related to Assistive Environments 2024.
- Wang, M.; Izumi, K.; Sakaji, H. LLMFactor: Extracting profitable factors through prompts for explainable stock movement prediction. arXiv 2024, arXiv:2406.10811. [Google Scholar] [CrossRef]
- Zhang, H.; Hua, F.; Xu, C.; Guo, J.; Kong, H.; Zuo, R. Unveiling the Potential of Sentiment: Can Large Language Models Predict Chinese Stock Price Movements? arXiv 2023, arXiv:2306.14222. [Google Scholar] [CrossRef]
- Yang, S.; Zhu, S.; Wu, Z.; Wang, K.; Yao, J.; Wu, J.; Hu, L.; Li, M.; Wong, D.F.; Wang, D. Fraud-R1: A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements. arXiv 2025, arXiv:2502.12904. [Google Scholar] [CrossRef]
- Cao, Y.; Chen, Z.; Pei, Q.; Dimino, F.; Ausiello, L.; Kumar, P.; Subbalakshmi, K.; Ndiaye, P.M. Risklabs: Predicting financial risk using large language model based on multi-sources data. Technical report, 2024.
- Korkanti, S. Enhancing Financial Fraud Detection Using LLMs and Advanced Data Analytics. In Proceedings of the 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS). IEEE; 2024; pp. 1328–1334. [Google Scholar]
- Bakumenko, A.; Hlaváčková-Schindler, K.; Plant, C.; Hubig, N.C. Advancing anomaly detection: Non-semantic financial data encoding with llms. arXiv 2024, arXiv:2406.03614. [Google Scholar] [CrossRef]
- Boskou, G.; Chatzipetrou, E.; Tiakas, E.; Kirkos, E.; Spathis, C. Exploring the Boundaries of Financial Statement Fraud Detection with Large Language Models. Available at SSRN 489 7041.
- Feng, D.; Dai, Y.; Huang, J.; Zhang, Y.; Xie, Q.; Han, W.; Chen, Z.; Lopez-Lira, A.; Wang, H. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv 2023, arXiv:2310.00566. [Google Scholar] [CrossRef]
- Teixeira, A.C.; Marar, V.; Yazdanpanah, H.; Pezente, A.; Ghassemi, M.M. Enhancing Credit Risk Reports Generation using LLMs: An Integration of Bayesian Networks and Labeled Guide Prompting. Proceedings of the Fourth ACM International Conference on AI in Finance 2023.
- Sanz-Guerrero, M.; Arroyo, J. Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending. arXiv 2024, arXiv:2401.16458. [Google Scholar] [CrossRef]
- Drinkall, F.; Pierrehumbert, J.; Zohren, S. Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs. In Proceedings of the Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), 2025, pp. 118–133.
- Zou, Y.; Shi, M.; Chen, Z.; Deng, Z.; Lei, Z.; Zeng, Z.; Yang, S.; Tong, H.; Xiao, L.; Zhou, W. ESGReveal: An LLM-based approach for extracting structured data from ESG reports. Journal of Cleaner Production 2025, 489, 144572. [Google Scholar] [CrossRef]
- Hyojeong, Y.; Chanyoung, K.; Hahm, M.; Kim, K.; Son, G. Esg classification by implicit rule learning via gpt-4. In Proceedings of the Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing@ LREC-COLING 2024, 2024, pp. 261–268.
- Zhao, H.; Liu, Z.; Wu, Z.; Li, Y.; Yang, T.; Shu, P.; Xu, S.; Dai, H.; Zhao, L.; Mai, G.; et al. Revolutionizing finance with llms: An overview of applications and insights. arXiv 2024, arXiv:2401.11641. [Google Scholar] [CrossRef]
- Calamai, T.; Balalau, O.; Guenedal, T.L.; Suchanek, F.M. Corporate Greenwashing Detection in Text-a Survey. arXiv 2025, arXiv:2502.07541. [Google Scholar] [CrossRef]
- Shimamura, T.; Tanaka, Y.; Managi, S. Evaluating the impact of report readability on ESG scores: A generative AI approach. International Review of Financial Analysis 2025, 101, 104027. [Google Scholar] [CrossRef]
- Lin, L.H.M.; Ting, F.K.; Chang, T.J.; Wu, J.W.; Tsai, R.T.H. GPT4ESG: Streamlining Environment, Society, and Governance Analysis with Custom AI Models. In Proceedings of the 2024 IEEE 4th International Conference on Electronic Communications, Internet of Things and Big Data (ICEIB). IEEE; 2024; pp. 442–446. [Google Scholar]
- Birti, M.; Osborne, F.; Maurino, A. Optimizing Large Language Models for ESG Activity Detection in Financial Texts. arXiv 2025, arXiv:2502.21112. [Google Scholar] [CrossRef]
- Tian, K.; Chen, H. Esg-gpt: Gpt4-based few-shot prompt learning for multi-lingual esg news text classification. In Proceedings of the Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing@ LREC-COLING 2024, 2024, pp. 279–282.
- Gupta, L.; Sharma, S.; Zhao, Y. Systematic Evaluation of Long-Context LLMs on Financial Concepts. arXiv 2024, arXiv:2412.15386. [Google Scholar] [CrossRef]
- Zhu, F.; Lei, W.; Huang, Y.; Wang, C.; Zhang, S.; Lv, J.; Feng, F.; Chua, T.S. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv 2021, arXiv:2105.07624. [Google Scholar] [CrossRef]
- Al-Laith, A. Exploring the Effectiveness of Multilingual and Generative Large Language Models for Question Answering in Financial Texts. In Proceedings of the Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), 2025, pp. 230–235.
- Kahl, K.F.; Buz, T.; Biswas, R.; De Melo, G. LLMs Cannot (Yet) Match the Specificity and Simplicity of Online Communities in Long Form Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 2028–2053.
- Zacher, W.; Kuppannagari, S. Can LLMs Pass the CPA Exam? Evaluating Large Language Model Performance on the Certified Public Accountant Test. Evaluating Large Language Model Performance on the Certified Public Accountant Test (April 8, 2024) 2024.
- Lai, V.D.; Krumdick, M.; Lovering, C.; Reddy, V.; Schmidt, C.; Tanner, C. SEC-QA: A Systematic Evaluation Corpus for Financial QA. arXiv 2024, arXiv:2406.14394. [Google Scholar] [CrossRef]
- Tao, W.; Zhu, H.; Tan, K.; Wang, J.; Liang, Y.; Jiang, H.; Yuan, P.; Lan, Y. FinQA: A Training-Free Dynamic Knowledge Graph Question Answering System in Finance with LLM-Based Revision. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2024; pp. 418–423. [Google Scholar]
- Shah, S.; Ryali, S.; Venkatesh, R. Multi-Document Financial Question Answering using LLMs. arXiv 2024, arXiv:2411.07264. [Google Scholar] [CrossRef]
- Kertkeidkachorn, N.; Nararatwong, R.; Xu, Z.; Ichise, R. Finkg: A core financial knowledge graph for financial analysis. In Proceedings of the 2023 IEEE 17th International Conference on Semantic Computing (ICSC). IEEE; 2023; pp. 90–93. [Google Scholar]
- Sénéchal, M. LLM Knowledge Graph Builder: From Zero to GraphRAG in Five Minutes, 2024.
- Revolutionizing Data: Harnessing LLMs for Knowledge Graph Construction to Unlock Powerful Insights, 2024.
- Kutumbe, K. Transforming Financial Statements into Knowledge Graphs Using Neo4j LLM Knowledge Graph Builder, 2024.
- Han, H.; Shomer, H.; Wang, Y.; Lei, Y.; Guo, K.; Hua, Z.; Long, B.; Liu, H.; Tang, J. RAG vs. GraphRAG: A Systematic Evaluation and Key Insights. arXiv 2025, arXiv:2502.11371. [Google Scholar] [CrossRef]
- Jung, D.; Dorner, V.; Glaser, F.; Morana, S. Robo-advisory: digitalization and automation of financial advisory. Business & Information Systems Engineering 2018, 60, 81–86. [Google Scholar]
- Feng, Z. Can GPT Help Improve Robo-advisory? The Construction of Robo-advisor for Users with Low Investment Experience Based on LLM. Advances in Economics, Management and Political Sciences 2024, 90, 26–41. [Google Scholar] [CrossRef]
- Gill, J.K. Financial Robo-Advisory: Harnessing Agentic AI, 2024.
- Tzanetos, G. Robo-Advisors and AI Aren’t Winning Against Humans Just Yet, 2024.
- Singh, S. Empowering Personal Finance Management with Large Language Models (LLMs) in Financial Services, 2024.
- Wu, Q.; Xiang, X.; Huang, H.; Wang, X.; Jie, Y.W.; Satapathy, R.; Veeravalli, B.; et al. SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation. arXiv 2024, arXiv:2412.10906. [Google Scholar] [CrossRef]
- Harris, L. Trading and exchanges: Market microstructure for practitioners; Oxford university press, 2002.
- Bodie, Z.; Kane, A.; Marcus, A.J.; Mohanty, P. Investments (SIE); McGraw-Hill Education, 2014.
- Williams, J.B. Williams, J.B. The theory of investment value. (No Title) 1938.
- Graham, B.; Dodd, D.L.F.; Cottle, S.; Tatham, C. Security analysis: Principles and technique; McGraw-Hill New York, 1951.
- Chan, E.P. Quantitative trading: how to build your own algorithmic trading business; John Wiley & Sons, 2021.
- Reilly, F.K. Investment analysis and portfolio management; CITIC Press Group, 2002.
- Tirole, J. The theory of corporate finance; Princeton university press, 2010.
- Modigliani, F.; Miller, M.H. The cost of capital, corporation finance and the theory of investment. The American economic review 1958, 48, 261–297. [Google Scholar]
- Dean, J. Capital budgeting: top-management policy on plant, equipment, and product development; Columbia University Press, 1951.
- Steiger, F. The validity of company valuation using Discounted Cash Flow methods. arXiv 2010, arXiv:1003.4881. [Google Scholar] [CrossRef]
- Kropmans, Q.J. The application of artificial intelligence in corporate finance. Master’s thesis, University of Twente, 2024.
- Pilbeam, K. Finance and financial markets; Bloomsbury Publishing, 2018.
- Greenwood, J.; Smith, B.D. Financial markets in development, and the development of financial markets. Journal of Economic dynamics and control 1997, 21, 145–181. [Google Scholar] [CrossRef]
- Bond, P.; Edmans, A.; Goldstein, I. The real effects of financial markets. Annu. Rev. Financ. Econ. 2012, 4, 339–360. [Google Scholar] [CrossRef]
- Lai, T.L.; Xing, H. Statistical models and methods for financial markets; Vol. 1017, Springer, 2008.
- Wafi, A.S.; Hassan, H.; Mabrouk, A. Fundamental analysis models in financial markets–review study. Procedia economics and finance 2015, 30, 939–947. [Google Scholar] [CrossRef]
- Chen, A.S.; Leung, M.T.; Daouk, H. Application of neural networks to an emerging financial market: forecasting and trading the Taiwan Stock Index. Computers & Operations Research 2003, 30, 901–923. [Google Scholar] [CrossRef]
- Hasan, M.M.; Popp, J.; Oláh, J. Current landscape and influence of big data on finance. Journal of Big Data 2020, 7, 21. [Google Scholar] [CrossRef]
- Shen, D.; Chen, S.h. Big data finance and financial markets. Big data in computational social science and humanities 2018, pp. 235–248.
- Deng, X.; Bashlovkina, V.; Han, F.; Baumgartner, S.; Bendersky, M. What do llms know about financial markets? a case study on reddit market sentiment analysis. In Proceedings of the Companion Proceedings of the ACM Web Conference 2023; 2023; pp. 107–110. [Google Scholar]
- Bustos, O.; Pomares-Quimbaya, A. Stock market movement forecast: A systematic review. Expert Systems with Applications 2020, 156, 113464. [Google Scholar] [CrossRef]
- Allen, F.; Santomero, A.M. The theory of financial intermediation. Journal of banking & finance 1997, 21, 1461–1485. [Google Scholar]
- McNeil, A.J.; Frey, R.; Embrechts, P. Quantitative risk management: concepts, techniques and tools-revised edition; Princeton university press, 2015.
- Pritchard, C.L.; PMP, P.R.; et al. Risk management: concepts and guidance; CRC Press, 2014.
- Muhlbauer, W.K. Pipeline risk management manual: ideas, techniques, and resources; Gulf Professional Publishing, 2004.
- West, J.; Bhattacharya, M. Intelligent financial fraud detection: a comprehensive review. Computers & security 2016, 57, 47–66. [Google Scholar]
- Dastile, X.; Celik, T.; Potsane, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing 2020, 91, 106263. [Google Scholar] [CrossRef]
- Leong, K.; Sung, A. FinTech (Financial Technology): what is it and how to use technologies to create business value in fintech way? International journal of innovation, management and technology 2018, 9, 74–78. [Google Scholar] [CrossRef]
- Stojakovic-Celustka, S. FinTech and its implementation. In Proceedings of the International Workshop on Measuring Ontologies in Value Environments. Springer; 2020; pp. 256–277. [Google Scholar]
- Lee, J.; Stevens, N.; Han, S.C. Large Language Models in Finance (FinLLMs). Neural Computing and Applications 2025, 1–15. [Google Scholar] [CrossRef]
- Shah, A.; Paturi, S.; Chava, S. Trillion dollar words: A new financial dataset, task & market analysis. arXiv 2023, arXiv:2305.07972. [Google Scholar] [CrossRef]
- Ding, Q.; Shi, H.; Liu, B. Tradexpert: Revolutionizing trading with mixture of expert llms. arXiv 2024, arXiv:2411.00782. [Google Scholar] [CrossRef]
- Zhu, F.; Liu, Z.; Feng, F.; Wang, C.; Li, M.; Chua, T.S. TAT-LLM: A Specialized Language Model for Discrete Reasoning over Financial Tabular and Textual Data. In Proceedings of the Proceedings of the 5th ACM International Conference on AI in Finance, 2024; pp. 310–318.
- Inserte, P.R.; Nakhlé, M.; Qader, R.; Caillaut, G.; Liu, J. Large language model adaptation for financial sentiment analysis. arXiv 2024, arXiv:2401.14777. [Google Scholar] [CrossRef]
- Lee, J.; Youn, H.L.; Poon, J.; Han, S.C. Stockemotions: Discover investor emotions for financial sentiment analysis and multivariate time series. arXiv 2023, arXiv:2301.09279. [Google Scholar] [CrossRef]
- Tong, H.; Li, J.; Wu, N.; Gong, M.; Zhang, D.; Zhang, Q. Ploutos: Towards interpretable stock movement prediction with financial large language model. arXiv 2024, arXiv:2403.00782. [Google Scholar] [CrossRef]
- Xie, Q.; Han, W.; Chen, Z.; Xiang, R.; Zhang, X.; He, Y.; Xiao, M.; Li, D.; Dai, Y.; Feng, D.; et al. Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems 2024, 37, 95716–95743. [Google Scholar]
- Yuan, T.; He, Z.; Dong, L.; Wang, Y.; Zhao, R.; Xia, T.; Xu, L.; Zhou, B.; Li, F.; Zhang, Z.; et al. R-judge: Benchmarking safety risk awareness for llm agents. arXiv 2024, arXiv:2401.10019. [Google Scholar] [CrossRef]
- Zhang, L.; Cai, W.; Liu, Z.; Yang, Z.; Dai, W.; Liao, Y.; Qin, Q.; Li, Y.; Liu, X.; Liu, Z.; et al. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models. arXiv 2023, arXiv:2308.09975. [Google Scholar] [CrossRef]
- Nie, Y.; Yan, B.; Guo, T.; Liu, H.; Wang, H.; He, W.; Zheng, B.; Wang, W.; Li, Q.; Sun, W.; et al. CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models. arXiv 2024, arXiv:2407.02301. [Google Scholar] [CrossRef]
- Yang, Y.; Zhang, Y.; Hu, Y.; Guo, Y.; Gan, R.; He, Y.; Lei, M.; Zhang, X.; Wang, H.; Xie, Q.; et al. Ucfe: A user-centric financial expertise benchmark for large language models. arXiv 2024, arXiv:2410.14059. [Google Scholar] [CrossRef]
- Hirano, M. Construction of a japanese financial benchmark for large language models. arXiv 2024, arXiv:2403.15062. [Google Scholar] [CrossRef]
- Li, H.; Cao, Y.; Yu, Y.; Javaji, S.R.; Deng, Z.; He, Y.; Jiang, Y.; Zhu, Z.; Subbalakshmi, K.; Xiong, G.; et al. INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent. arXiv 2024, arXiv:2412.18174. [Google Scholar] [CrossRef]
- Solow, R.M. The Economics of Resources or the Resources of Economics. Journal of Natural Resources Policy Research 2008, 1, 69–82. [Google Scholar] [CrossRef]
- Pindyck, R.S.; Rubinfeld, D.L. Microeconomics, 9th ed.; Pearson: Boston, 2018. [Google Scholar]
- Barro, R.J. Macroeconomics; MIT Press: Cambridge, MA, 1997. [Google Scholar]
- Borjas, G.J.; Van Ours, J.C. Labor Economics; McGraw-Hill/Irwin: Boston, 2010. [Google Scholar]
- Tirole, J. The Theory of Industrial Organization; MIT Press: Cambridge, MA, 1988. [Google Scholar]
- Rosen, H.S. Public Finance. In The Encyclopedia of Public Choice; Springer US: Boston, MA, 1992; pp. 252–262. [Google Scholar]
- Barberis, N.; Thaler, R. A Survey of Behavioral Finance. In Handbook of the Economics of Finance; Elsevier, 2003; Vol. 1, pp. 1053–1128. [Google Scholar]
- Judd, K.L. Numerical Methods in Economics; MIT Press: Cambridge, MA, 1998. [Google Scholar]
- Hayashi, F. Econometrics; Princeton University Press: Princeton, NJ, 2011. [Google Scholar]
- Hicks, J.R. Mr. Keynes and the "Classics"; A Suggested Interpretation. Econometrica: Journal of the Econometric Society 1937, 5, 147–159. [Google Scholar] [CrossRef]
- Solow, R.M. A Contribution to the Theory of Economic Growth. The Quarterly Journal of Economics 1956, 70, 65–94. [Google Scholar] [CrossRef]
- Nash, J.F. Equilibrium Points in N-Person Games. Proceedings of the National Academy of Sciences 1950, 36, 48–49. [Google Scholar] [CrossRef] [PubMed]
- Banerjee, A.V.; Duflo, E. Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty; PublicAffairs: New York, 2011. [Google Scholar]
- Smets, F.; Wouters, R. Shocks and Frictions in US Business Cycles: A Bayesian DSGE Approach. American Economic Review 2007, 97, 586–606. [Google Scholar] [CrossRef]
- Milgrom, P.R. Putting Auction Theory to Work; Cambridge University Press: Cambridge, 2004. [Google Scholar]
- Filippas, A.; Horton, J.J.; Manning, B.S. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? In Proceedings of the ACM Conference on Economics and Computation, 2023.
- Taghikhah, F.R. A conceptual framework for developing digital twins of human-environmental systems. In Proceedings of the MODSIM2023, 25th International Congress on Modelling and Simulation; 2023. [Google Scholar]
- Shapira, E.; Madmon, O.; Reinman, I.; Amouyal, S.J.; Reichart, R.; Tennenholtz, M. GLEE: A Unified Framework and Benchmark for Language-based Economic Environments. arXiv 2024, arXiv:2410.05254. [Google Scholar] [CrossRef]
- Guo, Y.; Yang, Y. EconNLI: Evaluating Large Language Models on Economics Reasoning. arXiv 2024, arXiv:2407.01212. [Google Scholar] [CrossRef]
- Mei, Q.; Xie, Y.; Yuan, W.; Jackson, M.O. A Turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences of the United States of America 2024, 121. [Google Scholar] [CrossRef]
- Ross, J.; Kim, Y.; Lo, A.W. LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory. arXiv 2408, arXiv:2408.02784. [Google Scholar] [CrossRef]
- Horton, J.J. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? NBER Working Paper w31122, National Bureau of Economic Research, 2023.
- Chen, Y.; Liu, T.X.; Shan, Y.; Zhong, S. The emergence of economic rationality of GPT. Proceedings of the National Academy of Sciences of the United States of America 2023, 120. [Google Scholar] [CrossRef]
- Hao, Y.; Xie, D. A Multi-LLM-Agent-Based Framework for Economic and Public Policy Analysis. arXiv 2025, arXiv:2502.16879. [Google Scholar] [CrossRef]
- Li, N.; Gao, C.; Li, M.; Li, Y.; Liao, Q. EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; 2023. [Google Scholar]
- Woo, S.H.; Woo, S.; Son, Y.J. LLM Integrated Economic Decision-Making Model for Advancing Socio-Economic Simulation.
- Guo, S.; Bu, H.; Wang, H.; Ren, Y.; Sui, D.; Shang, Y.; Lu, S. Economics Arena for Large Language Models. arXiv 2024, arXiv:2401.01735. [Google Scholar] [CrossRef]
- Quan, Y.; Liu, Z. EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning. arXiv 2024, arXiv:2405.07938. [Google Scholar] [CrossRef]
- Patten, T.V.; Patten, V. Evaluating Domain Specific LLM Performance Within Economics Evaluating Domain Specific LLM Performance Within Economics Using the Novel EconQA Dataset Using the Novel EconQA Dataset. 2023.
- Kahneman, D.; Smith, V.L. Foundations of Behavioral and Experimental Economics :. 2002.
- Kahneman, D.; Knetsch, J.L.; Thaler, R.H. Anomalies: The Endowment Effect, Loss Aversion, and Status Quo Bias. Journal of Economic Perspectives 1991, 5, 193–206. [Google Scholar] [CrossRef]
- Engel, C. Dictator games: a meta study. Experimental Economics 2010, 14, 583–610. [Google Scholar] [CrossRef]
- Thaler, R.H. Anomalies: The Ultimatum Game. Journal of Economic Perspectives 1988, 2, 195–206. [Google Scholar] [CrossRef]
- Johnson, N.D.; Mislin, A.A. Trust games: A meta-analysis. Journal of Economic Psychology 2011, 32, 865–889. [Google Scholar] [CrossRef]
- Axtell, R.L.; Farmer, J.D. Agent-Based Modeling in Economics and Finance: Past, Present, and Future. Journal of Economic Literature 2025. [Google Scholar] [CrossRef]
- Macal, C.M.; North, M.J. Tutorial on agent-based modeling and simulation. In Proceedings of the Winter Simulation Conference, 2005. p. 14.
- Samuelson, L. Game Theory in Economics and Beyond. Journal of Economic Perspectives 2016, 30, 107–130. [Google Scholar] [CrossRef]
- Ichiishi, T. Game Theory for Economic Analysis; Elsevier: Amsterdam, 2014. [Google Scholar]
- Parkes, D.C.; Wellman, M.P. Economic reasoning and artificial intelligence. Science 2015, 349, 267–272. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2022, arXiv:2205.11916. [Google Scholar] [CrossRef]
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Li, L.; Sui, Z. A Survey on In-context Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; 2022. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.H.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
- to Prepare a Statement of Basic Accounting Theory, A.A.A.C. A statement of basic accounting theory; American Accounting Association, 1966.
- Bushman, R.M.; Smith, A.J. Transparency, financial accounting information, and corporate governance. Financial accounting information, and corporate governance. Economic Policy Review 2003, 9. [Google Scholar]
- Hopwood, A.G. On trying to study accounting in the contexts in which it operates. In Accounting From the Outside (RLE Accounting); Routledge, 2013; pp. 159–177. [Google Scholar]
- Bushman, R.M.; Smith, A.J. Financial accounting information and corporate governance. Journal of accounting and Economics 2001, 32, 237–333. [Google Scholar] [CrossRef]
- Weygandt, J.J.; Kimmel, P.D.; Kieso, D.E. Financial accounting with international financial reporting standards; John Wiley & Sons, 2018. [Google Scholar]
- Imhoff, G. Accounting quality, auditing and corporate governance. Auditing and Corporate Governance (January 2003) 2003.
- Garrison, R.H.; Noreen, E.W.; Brewer, P.C. Managerial accounting; McGraw-Hill, 2021. [Google Scholar]
- Hoogendoorn, M.N. Accounting and taxation in Europe—A comparative overview. European Accounting Review 1996, 5, 783–794. [Google Scholar] [CrossRef]
- Graham, J.R.; Raedy, J.S.; Shackelford, D.A. Research in accounting for income taxes. Journal of Accounting and Economics 2012, 53, 412–434. [Google Scholar] [CrossRef]
- Coakley, J.R.; Brown, C.E. Artificial neural networks in accounting and finance: modeling issues. Intelligent Systems in Accounting, Finance & Management 2000, 9, 119–144. [Google Scholar]
- Callen, J.L.; Kwan, C.C.; Yip, P.C.; Yuan, Y. Neural network forecasting of quarterly accounting earnings. International journal of forecasting 1996, 12, 475–482. [Google Scholar] [CrossRef]
- Nwaobia, A.; Kwarbai, J.; Olajumoke, J.; Ajibade, A. Financial reporting quality on investors’ decisions. International Journal of Economics and Financial Research 2013, 2, 140–147. [Google Scholar]
- Rezaee, Z. Restoring public trust in the accounting profession by developing anti-fraud education, programs, and auditing. Managerial auditing journal 2004, 19, 134–148. [Google Scholar] [CrossRef]
- Allen, E.J.; Allen, J.C.; Raghavan, S.; Solomon, D.H. On the tax efficiency of startup firms. Review of Accounting Studies 2023, 28, 1887–1928. [Google Scholar] [CrossRef]
- Vasarhelyi, M.A.; Kogan, A.; Tuttle, B.M. Big data in accounting: An overview. Accounting Horizons 2015, 29, 381–396. [Google Scholar] [CrossRef]
- Richins, G.; Stapleton, A.; Stratopoulos, T.C.; Wong, C. Big data analytics: opportunity or threat for the accounting profession? Journal of information systems 2017, 31, 63–79. [Google Scholar] [CrossRef]
- Cockcroft, S.; Russell, M. Big data opportunities for accounting and finance practice and research. Australian Accounting Review 2018, 28, 323–333. [Google Scholar] [CrossRef]
- Warren, J.D.; Moffitt, K.C.; Byrnes, P. How big data will change accounting. Accounting horizons 2015, 29, 397–407. [Google Scholar] [CrossRef]
- Alshurafat, H. The usefulness and challenges of chatbots for accounting professionals: Application on ChatGPT. Available at SSRN 4345921 2023. [Google Scholar] [CrossRef]
- Zhao, J.; Wang, X. Unleashing efficiency and insights: Exploring the potential applications and challenges of ChatGPT in accounting. Journal of Corporate Accounting & Finance 2024, 35, 269–276. [Google Scholar]
- Dong, M.M.; Stratopoulos, T.C.; Wang, V.X. A scoping review of ChatGPT research in accounting and finance. International Journal of Accounting Information Systems 2024, 55, 100715. [Google Scholar] [CrossRef]
- Street, D.; Wilck, J. ’Let’s Have a Chat’: Principles for the Effective Application of ChatGPT and Large Language Models in the Practice of Forensic Accounting. Journal of Forensic and Investigative Accounting, July to December 2023. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Wang, C.; Liu, X.; Yue, Y.; Tang, X.; Zhang, T.; Jiayang, C.; Yao, Y.; Gao, W.; Hu, X.; Qi, Z.; et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv 2023, arXiv:2310.07521. [Google Scholar] [CrossRef]
- Gu, H.; Schreyer, M.; Moffitt, K.; Vasarhelyi, M. Artificial intelligence co-piloted auditing. International Journal of Accounting Information Systems 2024, 54, 100698. [Google Scholar] [CrossRef]
- Berger, A.; Hillebrand, L.; Leonhard, D.; Deußer, T.; De Oliveira, T.B.F.; Dilmaghani, T.; Khaled, M.; Kliem, B.; Loitz, R.; Bauckhage, C.; et al. Towards automated regulatory compliance verification in financial auditing with large language models. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData). IEEE; 2023; pp. 4626–4635. [Google Scholar]
- Li, H.; Freitas, M.M.d.; Lee, H.; Vasarhelyi, M. Enhancing continuous auditing with large language models: AI-assisted real-time accounting information cross-verification. Available at SSRN 4692960 2024. [Google Scholar] [CrossRef]
- Eulerich, M.; Wood, D.A. A demonstration of how ChatGPT can be used in the internal auditing process. Available at SSRN 4519583 2023. [Google Scholar] [CrossRef]
- Emett, S.; Eulerich, M.; Lipinski, E.; Prien, N.; Wood, D.A. Leveraging ChatGPT for Enhancing the Internal Audit Process—A Real-World Example from Uniper, a Large Multinational Company. Accounting Horizons 1–11. [CrossRef]
- Fedyk, A.; Hodson, J.; Khimich, N.; Fedyk, T. Is artificial intelligence improving the audit process? Review of Accounting Studies 2022, 27, 938–985. [Google Scholar] [CrossRef]
- Fotoh, L.; Mugwira, T. Exploring Large Language Models (ChatGPT) in External Audits: Implications and Ethical Considerations. Available at SSRN 4453835 2023. [Google Scholar] [CrossRef]
- Föhr, T.L.; Schreyer, M.; Moffitt, K.; Marten, K.U. Deep Learning Meets Risk-Based Auditing: A Holistic Framework for Leveraging Foundation and Task-Specific Models in Audit Procedures. Available at SSRN 4488271 2023. [Google Scholar] [CrossRef]
- Wang, R.; Liu, J.; Zhao, W.; Li, S.; Zhang, D. AuditBench: A Benchmark for Large Language Models in Financial Statement Auditing. In Proceedings of the 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle.
- De Villiers, C.; Dimes, R.; Molinari, M. How will AI text generation and processing impact sustainability reporting? Critical analysis, a conceptual framework and avenues for future research. Sustainability Accounting, Management and Policy Journal 2024, 15, 96–118. [Google Scholar] [CrossRef]
- Föhr, T.L.; Schreyer, M.; Juppe, T.A.; Marten, K.U. Assuring sustainable futures: Auditing sustainability reports using AI foundation models. Available at SSRN 4502549 2023. [Google Scholar] [CrossRef]
- Kim, A.; Muhn, M.; Nikolaev, V. Bloated disclosures: can ChatGPT help investors process information? arXiv 2023, arXiv:2306.10224. [Google Scholar] [CrossRef]
- Harris, T. Managers’ perception of product market competition and earnings management: a textual analysis of firms’ 10-K reports. Journal of Accounting Literature 2024. [Google Scholar] [CrossRef]
- Ni, J.; Bingler, J.; Colesanti-Senni, C.; Kraus, M.; Gostlow, G.; Schimanski, T.; Stammbach, D.; Vaghefi, S.A.; Wang, Q.; Webersinke, N.; et al. aradigm shift in sustainability disclosure analysis: empowering stakeholders with chatreport, a language model-based tool 2023.
- Comlekci, İ.; Unal, S.; Ozer, A.; Oncu, M.A. Can AI Technologies Estimate Financials Accurately? A Research on Borsa Istanbul with ChatGPT. A Research on Borsa Istanbul with ChatGPT (April 8, 2023). Comlekci, İ., Unal, D., Ozer, a. & Oncu, Ma 2023, pp. 1–14.
- Beerbaum, D. Generative artificial intelligence (GAI) ethics taxonomy-applying chat GPT for robotic process automation (GAI-RPA) as business case. Available at SSRN 4385025 2023. [Google Scholar] [CrossRef]
- Li, H.; Vasarhelyi, M.A. Applying large language models in accounting: A comparative analysis of different methodologies and off-the-shelf examples. Journal of Emerging Technologies in Accounting 2024, 21, 133–152. [Google Scholar] [CrossRef]
- Choi, E.; Suh, Y.J.; Park, H.; Hwang, W. Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties. arXiv 2025, arXiv:2503.03444. [Google Scholar] [CrossRef]
- Alarie, B.; Condon, K.; Massey, S.; Yan, C. The Rise of Generative AI for Tax Research. Tax Notes Federal 2023, 1509. [Google Scholar]
- Zhang, L. Four tax questions for chatgpt and other language models 2023.
- Hechtner, F.; Schmidt, L.; Seebeck, A.; Weiß, M. How to design and employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT. Available at SSRN 2025. [Google Scholar]
- Choi, G.Y.; Kim, A. Firm-level tax audits: A Generative AI-based measurement. Chicago Booth Research Paper 2024. [Google Scholar] [CrossRef]
- Boynton, W.C.; Johnson, R.N. Modern auditing: Assurance services and the integrity of financial reporting; John Wiley & Sons, 2005. [Google Scholar]
- Antipova, T. Auditing for financial reporting. In Global Encyclopedia of Public Administration, Public Policy, and Governance; Springer, 2023; pp. 656–664. [Google Scholar]
- Kumar, R.; Sharma, V. Auditing: Principles and practice; PHI Learning Pvt. Ltd., 2015. [Google Scholar]
- Coderre, D. Internal audit efficiency through automation; Wiley Online Library, 2009. [Google Scholar]
- Dittenhofer, M. Internal auditing effectiveness: an expansion of present methods. Managerial auditing journal 2001, 16, 443–450. [Google Scholar] [CrossRef]
- Christensen, B.E.; Elder, R.J.; Glover, S.M. Behind the numbers: Insights into large audit firm sampling policies. Accounting Horizons 2015, 29, 61–81. [Google Scholar] [CrossRef]
- Gepp, A.; Linnenluecke, M.K.; O’neill, T.J.; Smith, T. Big data techniques in auditing research and practice: Current trends and future opportunities. Journal of Accounting Literature 2018, 40, 102–115. [Google Scholar] [CrossRef]
- Warren, C.S.; Jones, J.P.; Tayler, W.B. Financial and managerial accounting; Cengage Learning, Inc., 2020. [Google Scholar]
- Williams, J.R.; Haka, S.F.; Bettner, M.S.; Carcello, J.V. Financial & managerial accounting: the basis for business decisions; McGraw-Hill, 2018. [Google Scholar]
- Krahel, J.P.; Titera, W.R. Consequences of big data and formalization on accounting and auditing standards. Accounting Horizons 2015, 29, 409–422. [Google Scholar] [CrossRef]
- Saha, A.; Morris, R.D.; Kang, H. Disclosure overload? An empirical analysis of international financial reporting standards disclosure requirements. Abacus 2019, 55, 205–236. [Google Scholar] [CrossRef]
- Heidmann, M.; Schäffer, U.; Strahringer, S. Exploring the role of management accounting systems in strategic sensemaking. Information Systems Management 2008, 25, 244–257. [Google Scholar] [CrossRef]
- Granlund, M.; Malmi, T. Moderate impact of ERPS on management accounting: a lag or permanent outcome? Management accounting research 2002, 13, 299–321. [Google Scholar] [CrossRef]
- Bhimani, A.; Willcocks, L. Digitisation,‘Big Data’and the transformation of accounting information. Accounting and business research 2014, 44, 469–490. [Google Scholar] [CrossRef]
- Besley, T.; Persson, T. Taxation and development. In Handbook of public economics; Elsevier, 2013; Vol. 5, pp. 51–110. [Google Scholar]
- Salanié, B. The economics of taxation; MIT press, 2011. [Google Scholar]
- Laffer, A.B.; Winegarden, W.H.; Childs, J. The economic burden caused by tax code complexity. The Laffer Center for Supply-Side Economics 2011, pp. 1–24.
- Brady, D. Tax Complexity 2024: It Takes Americans Billions of Hours to Do Their Taxes, 2024.
- Slemrod, J. Tax compliance and enforcement. Journal of economic literature 2019, 57, 904–954. [Google Scholar] [CrossRef]
- Hanlon, M.; Heitzman, S. A review of tax research. Journal of accounting and Economics 2010, 50, 127–178. [Google Scholar] [CrossRef]
- Reuters, T. While tax professionals recognize ChatGPT’s potential, they are aware of the risks, new report shows, 2023.
- Holzenberger, N.; Blair-Stanek, A.; Van Durme, B. A dataset for statutory reasoning in tax law entailment and question answering. arXiv 2020, arXiv:2005.05257. [Google Scholar] [CrossRef]
- Deloitte. Deloitte Unveils Zora AI, Agentic AI for Tomorrow’s Workforce, 2025. Accessed: 2025-05-13.
- Kotler, P.; Burton, S.; Deans, K.; Brown, L.; Armstrong, G. Marketing; Pearson Higher Education AU, 2015.
- Calder, B.J. Focus groups and the nature of qualitative marketing research. Journal of Marketing research 1977, 14, 353–364. [Google Scholar] [CrossRef]
- Mariampolski, H. Qualitative market research; Sage, 2001. [Google Scholar]
- Franses, P.H.; Paap, R. Quantitative models in marketing research; Cambridge University Press, 2001. [Google Scholar]
- Dzwigol, H. Innovation in marketing research: quantitative and qualitative analysis 2020.
- Malhotra, N.K.; Nunan, D.; Birks, D.F. Marketing research 2020.
- Backlinko. 23 Key Market Research Statistics for 2025, 2025. Accessed: 2025-04-06.
- Fligstein, N.; Dauter, L. The sociology of markets. Annu. Rev. Sociol. 2007, 33, 105–128. [Google Scholar] [CrossRef]
- Mazzocchi, M. Statistics for marketing and consumer research 2008.
- Glaeser, E.L. Psychology and the Market. American Economic Review 2004, 94, 408–413. [Google Scholar] [CrossRef]
- Williams, B. Limitations of Market Research: Common Challenges, 2024. Accessed: 2025-04-06.
- Pascucci, F.; Savelli, E.; Gistri, G. How digital technologies reshape marketing: evidence from a qualitative investigation. Italian Journal of Marketing 2023, 2023, 27–58. [Google Scholar] [CrossRef]
- Albrecht, J.; Kitanidis, E.; Fetterman, A.J. Despite" super-human" performance, current LLMs are unsuited for decisions about ethics and safety. arXiv 2022, arXiv:2212.06295. [Google Scholar] [CrossRef]
- Amini, R.; Amini, A. An overview of artificial intelligence and its application in marketing with focus on large language models. International Journal of Science and Research Archive 2024. [Google Scholar]
- Verma, S.; Sharma, R.; Deb, S.; Maitra, D. Artificial intelligence in marketing: Systematic review and future research direction. Int. J. Inf. Manag. Data Insights 2021, 1, 100002. [Google Scholar] [CrossRef]
- Jain, V.; Rai, H.; Parvathy, P.; Mogaji, E. The Prospects and Challenges of ChatGPT on Marketing Research and Practices. SSRN Electronic Journal 2023. [Google Scholar] [CrossRef]
- Aghaei, R.; Kiaei, A.; Boush, M.; Vahidi, J.; Zavvar, M.; Barzegar, Z.; Rofoosheh, M. Harnessing the Potential of Large Language Models in Modern Marketing Management: Applications, Future Directions, and Strategic Recommendations. arXiv 2025, arXiv:2501.10685. [Google Scholar] [CrossRef]
- Li, Y.; Liu, Y.; Yu, M. Consumer segmentation with large language models. Journal of Retailing and Consumer Services. [CrossRef]
- Goli, A.; Singh, A. Frontiers: Can Large Language Models Capture Human Preferences? Mark. Sci. 2024, 43, 709–722. [Google Scholar] [CrossRef]
- Wang, M.; Zhang, D.J.; Zhang, H. Large Language Models for Market Research: A Data-augmentation Approach. arXiv 2024, arXiv:2412.19363. [Google Scholar] [CrossRef]
- Praveen, S.; Gajjar, P.; Ray, R.; Dutt, A. Crafting clarity: Leveraging large language models to decode consumer reviews. Journal of Retailing and Consumer Services 2024. [Google Scholar] [CrossRef]
- Naik, N.; Bhat, R.; Kumar, A.; Bapat, G.S.; Kumar, A.; Hota, S.L.; Abishek, G.D.; Vaz, S. Unlocking Brand Excellence: Harnessing AI Tools for Enhanced Customer Engagement and Innovation. RAiSE-2023 2024.
- Sarstedt, M.; Adler, S.J.; Rau, L.; Schmitt, B. Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines. Psychology & Marketing 2024. [Google Scholar]
- Paul, J.; Ueno, A.; Dennis, C. ChatGPT and consumers: Benefits, Pitfalls and Future Research Agenda. International Journal of Consumer Studies 2023. [Google Scholar] [CrossRef]
- Kasuga, A.; Yonetani, R. CXSimulator: A User Behavior Simulation using LLM Embeddings for Web-Marketing Campaign Assessment. In Proceedings of the International Conference on Information and Knowledge Management; 2024. [Google Scholar]
- Wahid, R.M.; Mero, J.; Ritala, P. Editorial: Written by ChatGPT, illustrated by Midjourney: generative AI for content marketing. Asia Pacific Journal of Marketing and Logistics 2023. [Google Scholar] [CrossRef]
- Aldous, K.K.; Salminen, J.O.; Farooq, A.; Jung, S.G.; Jansen, B.J. Using ChatGPT in Content Marketing: Enhancing Users’ Social Media Engagement in Cross-Platform Content Creation through Generative AI. In Proceedings of the 35th ACM Conference on Hypertext and Social Media; 2024. [Google Scholar]
- Golab-Andrzejak, E. The Impact of Generative AI and ChatGPT on Creating Digital Advertising Campaigns. Cybernetics and Systems 2025. [Google Scholar] [CrossRef]
- Huh, J.; Nelson, M.R.; Russell, C.A. ChatGPT, AI Advertising, and Advertising Research and Education. Journal of Advertising 2023, 52, 477–482. [Google Scholar] [CrossRef]
- Ivanov, S. Using Artificial Intelligence to Create Marketing Content – Opportunities and Limitations. Journal 2023. Accessed: 2025-04-06.
- Reisenbichler, M.; Reutterer, T.; Schweidel, D.A.; Dan, D. Frontiers: Supporting Content Marketing with Natural Language Generation. Mark. Sci. 2022, 41, 441–452. [Google Scholar] [CrossRef]
- Yeykelis, L.; Pichai, K.; Cummings, J.J.; Reeves, B. Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings. arXiv 2024, arXiv:2408.16073. [Google Scholar] [CrossRef]
- Saputra, R.; Nasution, M.I.P.; Dharma, B. The Impact of Using AI Chat GPT on Marketing Effectiveness: A Case Study on Instagram Marketing. Indonesian Journal of Economics and Management 2023. [Google Scholar] [CrossRef]
- Soykoth, M.W.; Sim, W.; Frederick, S. Research trends in market intelligence: a review through a data-driven quantitative approach. Journal of Marketing Analytics 2024. [Google Scholar] [CrossRef]
- Mutoffar, M.M.; Kuswayati, S.; Sumarni, T.; Dewi, R.K.S.; Nurjanah, E. Role of ChatGPT as an Innovative Tool for Data Analysis and Market Trend Prediction in Business Information Systems. 2024, 13. [Google Scholar]
- Solomon, M.R. Consumer Behavior: Buying, Having, and Being. 1993.
- Hawkins, D.I.; Mothersbaugh, D.L. Consumer Behavior: Building Marketing Strategy. 1997.
- Peter, J.P.; Olson, J.C. Consumer Behavior and Marketing Strategy. 1990.
- Balteș, L.P. Content marketing - the fundamental tool of digital marketing. Bulletin of the Transilvania University of Brasov. Series V : Economic Sciences 2015, pp. 111–118.
- Dwivedi, Y.K. Social media marketing and advertising. The Marketing Review 2015, 15, 289. [Google Scholar] [CrossRef]
- Johnson, J.P.; Myatt, D.P. On the Simple Economics of Advertising, Marketing, and Product Design. IO: Theory 2005. [Google Scholar] [CrossRef]
- Flying V Group. Content Marketing vs Traditional Marketing: Which is Better?, n.d. Accessed: 2025-04-06.
- Sharma, A. How LLMs Become Your Marketing & Content Powerhouse. A3Logics Blog 2024. Accessed: 2025-04-06.
- Jenster, P.V.; Søilen, K.S. Market Intelligence: Building Strategic Insight. 2009.
- Wee, T.T.T. The use of marketing research and intelligence in strategic planning: key issues and future trends. Marketing Intelligence & Planning 2001, 19, 245–253. [Google Scholar] [CrossRef]
- Kushwaha, A.K.; Kar, A.K. MarkBot - A Language Model-Driven Chatbot for Interactive Marketing in Post-Modern World. Inf. Syst. Frontiers 2021, 26, 857–874. [Google Scholar] [CrossRef]
- Cashion, F.; O’Brien, J. Generative AI Takes Off with Marketers. American Marketing Association 2024. Accessed: 2025-05-12.
- Eves, H.W. Foundations and fundamental concepts of mathematics; Courier Corporation, 1997. [Google Scholar]
- Courant, R.; Robbins, H. What is Mathematics?: an elementary approach to ideas and methods; Oxford university press, 1996. [Google Scholar]
- Horsten, L. Philosophy of mathematics. Stanford Encyclopedia of Philosophy 2007. [Google Scholar]
- Council, N.R.; on Engineering, D.; Sciences, P.; on Mathematical Sciences, B.; Applications, T.; on the Mathematical Sciences in, C.; 2025. The mathematical sciences in 2025; National Academies Press, 2013.
- Treves, F. Topological Vector Spaces, Distributions and Kernels: Pure and Applied Mathematics, Vol. 25; Vol. 25, Elsevier, 2016.
- Lax, P.D.; Phillips, R.S. Scattering Theory: Pure and Applied Mathematics, Vol. 26; Vol. 26, Elsevier, 2016.
- Hersh, R. What is mathematics, really? Mitteilungen der Deutschen Mathematiker-Vereinigung 1998, 6, 13–14. [Google Scholar] [CrossRef]
- Holmes, M.H. Introduction to the foundations of applied mathematics; Vol. 56, Springer, 2009.
- Logan, J.D. Applied mathematics; John Wiley & Sons, 2013. [Google Scholar]
- Polya, G. How to solve it: A new aspect of mathematical method; Princeton university press, 1945. [Google Scholar]
- Etingof, P. PRIMES: How to Succeed in Mathematical Research.
- Ho, A.; Besiroglu, T. What is the Future of AI in Mathematics? Interviews with Leading Mathematicians, 2024. Accessed: 2025-04-04.
- ADVISER, M. Mathematicians use DeepMind AI to create new methods in problem-solving, 2021.
- Thurston, W.P. Mathematical education. arXiv 2005. [Google Scholar] [CrossRef]
- Kajander, A.; Boland, T. Mathematical models for teaching: Reasoning without memorization; Canadian Scholars’ Press, 2014. [Google Scholar]
- Peter, E.E. Critical thinking: Essence for teaching mathematics and mathematics problem solving skills. African Journal of Mathematics and Computer Science Research 2012, 5, 39–43. [Google Scholar] [CrossRef]
- Lama, V.; Ma, C.; Ghosal, T. Benchmarking Automated Theorem Proving with Large Language Models. In Proceedings of the Proceedings of the 1st Workshop on NLP for Science (NLP4Science), 2024; pp. 208–218.
- Song, P.; Yang, K.; Anandkumar, A. Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean. arXiv 2025, arXiv:2404.12534. [Google Scholar] [CrossRef]
- Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Balog, M.; Kumar, M.P.; Dupont, E.; Ruiz, F.J.; Ellenberg, J.S.; Wang, P.; Fawzi, O.; et al. Mathematical discoveries from program search with large language models. Nature 2024, 625, 468–475. [Google Scholar] [CrossRef]
- Mirzadeh, I.; Alizadeh, K.; Shahrokhi, H.; Tuzel, O.; Bengio, S.; Farajtabar, M. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. In Proceedings of the ICLR; 2025. [Google Scholar]
- Cheng, F.; Li, H.; Liu, F.; van Rooij, R.; Zhang, K.; Lin, Z. Empowering LLMs with Logical Reasoning: A Comprehensive Survey. arXiv 2025, arXiv:2502.15652. [Google Scholar] [CrossRef]
- Agrawal, P.; Vasania, S.; Tan, C. Exploring the Limitations of Graph-based Logical Reasoning in Large Language Models. Openreview 2025. [Google Scholar]
- Li, Y.; Kuang, J.; Huang, H.; Xu, Z.; Liang, X.; Yu, Y.; Lu, W.; Li, Y.; Tan, X.; Qu, C.; et al. One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs. arXiv 2025, arXiv:2502.10454. [Google Scholar] [CrossRef]
- Davoodi, A.G.; Davoudi, S.P.M.; Pezeshkpour, P. Llms are not intelligent thinkers: Introducing mathematical topic tree benchmark for comprehensive evaluation of llms. arXiv 2024, arXiv:2406.05194. [Google Scholar] [CrossRef]
- Ferrag, M.A.; Tihanyi, N.; Debbah, M. Reasoning Beyond Limits: Advances and Open Problems for LLMs. arXiv 2025, arXiv:2503.22732. [Google Scholar] [CrossRef]
- AlphaProof.; teams, A. AI achieves silver-medal standard solving International Mathematical Olympiad problems, 2024.
- Yu, L.; Jiang, W.; Shi, H.; Yu, J.; Liu, Z.; Zhang, Y.; Kwok, J.T.; Li, Z.; Weller, A.; Liu, W. Metamath: Bootstrap your own mathematical questions for large language models. arXiv 2023, arXiv:2309.12284. [Google Scholar] [CrossRef]
- Polu, S.; Sutskever, I. Generative language modeling for automated theorem proving. arXiv 2020, arXiv:2009.03393. [Google Scholar] [CrossRef]
- Han, J.M.; Rute, J.; Wu, Y.; Ayers, E.W.; Polu, S. Proof artifact co-training for theorem proving with language models. arXiv 2021, arXiv:2102.06203. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Li, W.; Tworkowski, S.; Czechowski, K.; Odrzygóźdź, T.; Miłoś, P.; Wu, Y.; Jamnik, M. Thor: Wielding hammers to integrate language models and automated theorem provers. In Proceedings of the NeurIPS; 2022. [Google Scholar]
- Lample, G.; Lacroix, T.; Lachaux, M.A.; Rodriguez, A.; Hayat, A.; Lavril, T.; Ebner, G.; Martinet, X. Hypertree proof search for neural theorem proving. Advances in neural information processing systems 2022, 35, 26337–26349. [Google Scholar]
- First, E.; Rabe, M.N.; Ringer, T.; Brun, Y. Baldur: Whole-Proof Generation and Repair with Large Language Models. In Proceedings of the Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA, 2023; ESEC/FSE 2023; pp. 1229–1241. [CrossRef]
- Thakur, A.; Wen, Y.; Chaudhuri, S. A Language-Agent Approach to Formal Theorem-Proving. OpenReview 2024. [Google Scholar]
- Yang, K.; Swope, A.; Gu, A.; Chalamala, R.; Song, P.; Yu, S.; Godil, S.; Prenger, R.J.; Anandkumar, A. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc., 2023; pp. 21573–21612. [Google Scholar]
- Liang, S.; Zhang, W.; Zhong, T.; Liu, T. Mathematics and machine creativity: A survey on bridging mathematics with ai. arXiv 2024, arXiv:2412.16543. [Google Scholar] [CrossRef]
- Johansson, M.; Smallbone, N. Exploring Mathematical Conjecturing with Large Language Models. In Proceedings of the NeSy; 2023; pp. 62–77. [Google Scholar]
- Gao, H.; Kaltenbach, S.; Koumoutsakos, P. Generative learning for forecasting the dynamics of high-dimensional complex systems. Nature Communications 2024, 15, 8904. [Google Scholar] [CrossRef]
- Kumar, H.; Rothschild, D.M.; Goldstein, D.G.; Hofman, J.M. Math education with large language models: peril or promise? Available at SSRN 4641653 2023. [Google Scholar] [CrossRef]
- Wardat, Y.; Tashtoush, M.A.; AlAli, R.; Jarrah, A.M. ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education 2023, 19, em2286. [Google Scholar] [CrossRef]
- Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
- Xu, H.; Gan, W.; Qi, Z.; Wu, J.; Yu, P.S. Large language models for education: A survey. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
- Hadzhikoleva, S.; Rachovski, T.; Ivanov, I.; Hadzhikolev, E.; Dimitrov, G. Automated Test Creation Using Large Language Models: A Practical Application. Applied Sciences 2024, 14, 9125. [Google Scholar] [CrossRef]
- De Bruijn, N.G. The mathematical language AUTOMATH, its usage, and some of its extensions. In Studies in Logic and the Foundations of Mathematics; Elsevier, 1994; Vol. 133, pp. 73–100.
- Harrison, J.; Urban, J.; Wiedijk, F. History of interactive theorem proving. In Handbook of the History of Logic; Elsevier, 2014; Vol. 9, pp. 135–214.
- Maric, F. A survey of interactive theorem proving. Zbornik radova 2015, 18, 173–223. [Google Scholar]
- Megill, N.; Wheeler, D.A. Metamath: a computer language for mathematical proofs; Lulu. com, 2019. [Google Scholar]
- Nipkow, T.; Wenzel, M.; Paulson, L.C. Isabelle/HOL: a proof assistant for higher-order logic; Springer, 2002. [Google Scholar]
- Coq, P. The coq proof assistant-reference manual. INRIA Rocquencourt and ENS Lyon, version 1996, 5. [Google Scholar]
- Moura, L.d.; Ullrich, S. The Lean 4 theorem prover and programming language. In Proceedings of the Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28. Springer, 2021, pp. 625–635.
- Zheng, C.; Wang, H.; Xie, E.; Liu, Z.; Sun, J.; Xin, H.; Shen, J.; Li, Z.; Li, Y. Lyra: Orchestrating Dual Correction in Automated Theorem Proving, 2024.
- AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms, 2025.
- Huss-Lederman, S.; Jacobson, E.M.; Tsao, A.; Turnbull, T.; Johnson, J.R. Implementation of Strassen’s algorithm for matrix multiplication. In Proceedings of the Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, 1996; pp. 32–es.
- Li, J.; Ranka, S.; Sahni, S. Strassen’s matrix multiplication on GPUs. In Proceedings of the 2011 IEEE 17th international conference on parallel and distributed systems. IEEE; 2011; pp. 157–164. [Google Scholar]
- Musin, O.R. The kissing number in four dimensions. Annals of Mathematics 2008, 1–32. [Google Scholar] [CrossRef]
- Zheng, K.; Han, J.M.; Polu, S. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv 2021, arXiv:2109.00110. [Google Scholar] [CrossRef]
- Polu, S.; Han, J.M.; Zheng, K.; Baksys, M.; Babuschkin, I.; Sutskever, I. Formal mathematics statement curriculum learning. arXiv 2022, arXiv:2202.01344. [Google Scholar] [CrossRef]
- Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024, arXiv:2402.03300. [Google Scholar] [CrossRef]
- Xin, H.; Guo, D.; Shao, Z.; Ren, Z.; Zhu, Q.; Liu, B.; Ruan, C.; Li, W.; Liang, X. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data. arXiv 2024, arXiv:2405.14333. [Google Scholar] [CrossRef]
- Gao, B.; Song, F.; Yang, Z.; Cai, Z.; Miao, Y.; Dong, Q.; Li, L.; Ma, C.; Chen, L.; Xu, R.; et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv 2024, arXiv:2410.07985. [Google Scholar] [CrossRef]
- AIME Problem Set: 1983-2024, 2024.
- Wei, T.; Luan, J.; Liu, W.; Dong, S.; Wang, B. Cmath: Can your language model pass chinese elementary school math test? arXiv 2023, arXiv:2306.16636. [Google Scholar] [CrossRef]
- Zhong, W.; Cui, R.; Guo, Y.; Liang, Y.; Lu, S.; Wang, Y.; Saied, A.; Chen, W.; Duan, N. Agieval: A human-centric benchmark for evaluating foundation models. arXiv 2023, arXiv:2304.06364. [Google Scholar] [CrossRef]
- Zhang, X.; Li, C.; Zong, Y.; Ying, Z.; He, L.; Qiu, X. Evaluating the performance of large language models on gaokao benchmark. arXiv 2023, arXiv:2305.12474. [Google Scholar] [CrossRef]
- Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; Hajishirzi, H. MAWPS: A math word problem repository. In Proceedings of the Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, 2016; pp. 1152–1157.
- Miao, S.y.; Liang, C.C.; Su, K.Y. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020; pp. 975–984.
- Patel, A.; Bhattamishra, S.; Goyal, N. Are NLP Models really able to Solve Simple Math Word Problems? In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online; 2021; pp. 2080–2094. [Google Scholar] [CrossRef]
- Ling, W.; Yogatama, D.; Dyer, C.; Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the ACL; 2017. [Google Scholar]
- Amini, A.; Gabriel, S.; Lin, S.; Koncel-Kedziorski, R.; Choi, Y.; Hajishirzi, H. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the NAACL; 2019. [Google Scholar]
- Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s Verify Step by Step. arXiv 2023, arXiv:2305.20050. [Google Scholar] [CrossRef]
- Chen, W.; Yin, M.; Ku, M.; Lu, P.; Wan, Y.; Ma, X.; Xu, J.; Wang, X.; Xia, T. Theoremqa: A theorem-driven question answering dataset. arXiv 2023, arXiv:2305.12524. [Google Scholar] [CrossRef]
- Mishra, S.; Finlayson, M.; Lu, P.; Tang, L.; Welleck, S.; Baral, C.; Rajpurohit, T.; Tafjord, O.; Sabharwal, A.; Clark, P.; et al. Lila: A Unified Benchmark for Mathematical Reasoning. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2022. [Google Scholar]
- Yue, X.; Qu, X.; Zhang, G.; Fu, Y.; Huang, W.; Sun, H.; Su, Y.; Chen, W. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv 2023, arXiv:2309.05653. [Google Scholar] [CrossRef]
- Maxwell, J.C. Matter and Motion 1878.
- Wigner, E.P. The unreasonable effectiveness of mathematics in the natural sciences. Communications on Pure and Applied Mathematics 1960, 13, 1–14. [Google Scholar] [CrossRef]
- Einstein, A. Physics and Reality. Journal of the Franklin Institute 1936, 221, 349–382. [Google Scholar] [CrossRef]
- Feynman, R.P. The Character of Physical Law; MIT Press, 1965. [Google Scholar]
- Abbott, B.P.; Abbott, R.; Abbott, T.D.; Abernathy, M.R.; Acernese, F.; Ackley, K.; Adams, C.; Adams, T.; Addesso, P.; Adhikari, R.X.; et al. Observation of gravitational waves from a binary black hole merger. Physical review letters 2016, 116, 061102. [Google Scholar] [CrossRef]
- Einstein, A. Näherungsweise integration der feldgleichungen der gravitation. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften 1916, 688–696. [Google Scholar]
- Sathyaprakash, B.S.; Schutz, B.F. Physics, astrophysics and cosmology with gravitational waves. Living reviews in relativity 2009, 12, 2. [Google Scholar] [CrossRef] [PubMed]
- Sathyaprakash, B.; Abernathy, M.; Acernese, F.; Ajith, P.; Allen, B.; Amaro-Seoane, P.; Andersson, N.; Aoudia, S.; Arun, K.; Astone, P.; et al. Scientific objectives of Einstein telescope. Classical and Quantum Gravity 2012, 29, 124013. [Google Scholar] [CrossRef]
- Yunes, N.; Siemens, X. Gravitational-wave tests of general relativity with ground-based detectors and pulsar-timing arrays. Living Reviews in Relativity 2013, 16, 1–124. [Google Scholar] [CrossRef]
- Halliday, D.; Resnick, R.; Walker, J. Fundamentals of physics; John Wiley & Sons, 2013. [Google Scholar]
- Halliday, D.; Resnick, R.; Walker, J. Principles of physics 2023.
- Misner, C.W.; Thorne, K.S.; Wheeler, J.A. Gravitation; Macmillan, 1973. [Google Scholar]
- Goldstein, H.; Poole, C.; Safko, J.; Addison, S.R. Classical mechanics, 2002.
- Arfken, G.B.; Weber, H.J.; Harris, F.E. Mathematical methods for physicists: a comprehensive guide; Academic press, 2011. [Google Scholar]
- Kittel, C.; McEuen, P. Introduction to solid state physics; John Wiley & Sons, 2018. [Google Scholar]
- Haken, H.; Wolf, H.C. The physics of atoms and quanta: introduction to experiments and theory; Springer Science & Business Media, 2006. [Google Scholar]
- Griffiths, D. Introduction to elementary particles; John Wiley & Sons, 2020. [Google Scholar]
- Martin, R.M. Electronic structure: basic theory and practical methods; Cambridge university press, 2020. [Google Scholar]
- Marder, M.P. Condensed matter physics; John Wiley & Sons, 2010. [Google Scholar]
- Zoller, P.; Beth, T.; Binosi, D.; Blatt, R.; Briegel, H.; Bruß, D.; Calarco, T.; Cirac, J.I.; Deutsch, D.; Eisert, J.; et al. Quantum information processing and communication: Strategic report on current status, visions and goals for research in Europe. The European Physical Journal D-Atomic, Molecular, Optical and Plasma Physics 2005, 36, 203–228. [Google Scholar] [CrossRef]
- Carroll, B.W.; Ostlie, D.A. An introduction to modern astrophysics; Cambridge University Press, 2017. [Google Scholar]
- Dodelson, S.; Schmidt, F. Modern cosmology; Elsevier, 2024. [Google Scholar]
- Kivelson, M.G.; Russell, C.T. Introduction to space physics 1995.
- of Sciences Engineering, N.A.; Medicine.; et al. Pathways to Discovery in Astronomy and Astrophysics for the 2020s; 2021.
- Heilbron, J.L. The Oxford guide to the history of physics and astronomy; Oxford University Press, 2005. [Google Scholar]
- Greene, B. The fabric of the cosmos: Space, time, and the texture of reality; Knopf, 2004. [Google Scholar]
- Carleo, G.; Cirac, I.; Cranmer, K.; Daudet, L.; Schuld, M.; Tishby, N.; Vogt-Maranto, L.; Zdeborová, L. Machine learning and the physical sciences. Reviews of Modern Physics 2019, 91, 045002. [Google Scholar] [CrossRef]
- Avallone, E.; Baumeister, I.; Sadegh, A. Marks’ Standard Handbook for Mechanical Engineers. 10; Citeseer, 2006. [Google Scholar]
- Wickert, J.; Lewis, K. An introduction to mechanical engineering; Cengage learning, 2013. [Google Scholar]
- Craig Jr, R.R.; Taleff, E.M. Mechanics of materials; John Wiley & Sons, 2020. [Google Scholar]
- Moran, M.J.; Shapiro, H.N.; Boettner, D.D.; Bailey, M.B. Fundamentals of engineering thermodynamics; John Wiley & Sons, 2010. [Google Scholar]
- Ogata, K.; et al. Modern control engineering; Prentice Hall India, 2009. [Google Scholar]
- Zienkiewicz, O.C.; Taylor, R.L.; Zhu, J.Z. The finite element method: its basis and fundamentals; Elsevier, 2005. [Google Scholar]
- Lomax, H.; Pulliam, T.H.; Zingg, D.W.; Pulliam, T.H.; Zingg, D.W. Fundamentals of computational fluid dynamics; Vol. 246, Springer, 2001.
- Budynas, R.G.; Nisbett, J.K.; et al. Shigley’s mechanical engineering design; Vol. 9, McGraw-Hill New York, 2011.
- Sharma, V.; Pandey, P.M. Additive and Subtractive Manufacturing Processes: Principles and Applications; CRC Press, 2022.
- Tyagi, A.K.; Tiwari, S.; Ahmad, S.S. Industry 4.0, Smart Manufacturing, and Industrial Engineering: Challenges and Opportunities 2024.
- Soori, M.; Dastres, R.; Arezoo, B.; Jough, F.K.G. Intelligent robotic systems in Industry 4.0: A review. Journal of Advanced Manufacturing Science and Technology 2024, 2024007. [Google Scholar] [CrossRef]
- Tian, Y.; Zhao, C.Y. A review of solar collectors and thermal energy storage in solar thermal applications. Applied energy 2013, 104, 538–553. [Google Scholar] [CrossRef]
- Gao, X.; Fraulob, M.; Haïat, G. Biomechanical behaviours of the bone–implant interface: a review. Journal of The Royal Society Interface 2019, 16, 20190259. [Google Scholar] [CrossRef] [PubMed]
- Thelen, A.; Zhang, X.; Fink, O.; Lu, Y.; Ghosh, S.; Youn, B.D.; Todd, M.D.; Mahadevan, S.; Hu, C.; Hu, Z. A comprehensive review of digital twin—part 1: modeling and twinning enabling technologies. Structural and Multidisciplinary Optimization 2022, 65, 354. [Google Scholar] [CrossRef]
- David, A.J.; et al. Computational fluid dynamics: the basics with applications. McGraw-Hill 1995, 547, 5. [Google Scholar]
- Bathe, K.J. Finite element procedures; Klaus-Jurgen Bathe, 2006. [Google Scholar]
- Mosavi, A.A.; Sedarat, H.; O’Connor, S.M.; Emami-Naeini, A.; Lynch, J. Calibrating a high-fidelity finite element model of a highway bridge using a multi-variable sensitivity-based optimisation approach. Structure and Infrastructure Engineering 2014, 10, 627–642. [Google Scholar] [CrossRef]
- Brooks, R.A. Elephants don’t play chess. Robotics and autonomous systems 1990, 6, 3–15. [Google Scholar] [CrossRef]
- Zagal, J.C.; Ruiz-del Solar, J.; Vallejos, P. Back to reality: Crossing the reality gap in evolutionary robotics. IFAC Proceedings Volumes 2004, 37, 834–839. [Google Scholar] [CrossRef]
- Forster, A.M.; Forster, A.M. Materials testing standards for additive manufacturing of polymer materials: state of the art and standards applicability 2015.
- American Society of Mechanical Engineers. ASME Code of Ethics of Engineers. ASME Official Publication, 2021. https://www.asme.org/getmedia/3e165b2b-f7e7-4106-a772-5f0586d2268e/p-15-7-ethics.pdf.
- Portenoy, J.; West, J.D. Constructing and evaluating automated literature review systems. Scientometrics 2020, 125, 3233–3251. [Google Scholar] [CrossRef]
- Accuris. Engineering Workbench: AI-powered platform for standards management. https://accuristech.com/solutions/engineering-workbench/, 2024. Accessed: 2025-04-21.
- Mudur, N.; Cui, H.; Venugopalan, S.; Raccuglia, P.; Brenner, M.P.; Norgaard, P. FEABench: Evaluating Language Models on Multiphysics Reasoning Ability. arXiv 2025, arXiv:2504.06260. [Google Scholar] [CrossRef]
- Ni, B.; Buehler, M.J. MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge. arXiv 2023, arXiv:2311.08166. [Google Scholar] [CrossRef]
- Makatura, L.; Foshey, M.; Wang, B.; et al. Large Language Models for Design and Manufacturing. MIT GenAI 2024. [Google Scholar]
- Wu, S.; Khasahmadi, A.; Katz, M.; et al. CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches. arXiv 2024, arXiv:2409.17457. [Google Scholar] [CrossRef]
- Jiang, Z.; Jiang, M. Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding. arXiv 2024, arXiv:2406.10934. [Google Scholar] [CrossRef]
- Abedi, M.; Alshybani, I.; Shahadat, M.R.B.; Murillo, M.S. Beyond Traditional Teaching: The Potential of Large Language Models and Chatbots in Graduate Engineering Education. arXiv 2023, arXiv:2309.13059. [Google Scholar] [CrossRef]
- Polverini, G.; Gregorcic, B. How understanding large language models can inform the use of ChatGPT in physics education. arXiv 2023, arXiv:2309.12074. [Google Scholar] [CrossRef]
- Harle, T.; et al. Large Language Models (LLMs) in Engineering Education. Information 2024, 15, 345. [Google Scholar] [CrossRef]
- Makatura, L.; Foshey, M.; Wang, B.; et al. Large Language Models for Design and Manufacturing. MIT GenAI 2023. [Google Scholar]
- Ali-Dib, M.; Menou, K. Physics simulation capabilities of LLMs. arXiv 2023, arXiv:2312.02091. [Google Scholar] [CrossRef]
- Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 2019, 378, 686–707. [Google Scholar] [CrossRef]
- Latif, E.; et al. PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations. arXiv 2024, arXiv:2403.18721. [Google Scholar] [CrossRef]
- Khan, J.A.; Qayyum, S.; Dar, H.S. Large Language Model for Requirements Engineering: A Systematic Literature Review. ResearchGate 2024. [Google Scholar]
- Tian, J.; Hou, J.; Wu, Z.; et al. Assessing Large Language Models in Mechanical Engineering Contexts: A Study on Experiment-Focused Log Interpretation. ResearchGate 2024. [Google Scholar]
- Berenguer, A.; Morejón, A.; Tomás, D.; Mazón, J.N. Leveraging Large Language Models for Sensor Data Retrieval. Applied Sciences 2024, 14, 2506. [Google Scholar] [CrossRef]
- Makatura, L.; Foshey, M.; Wang, B.; et al. Large Language Models for Design and Manufacturing. MIT GenAI 2023. [Google Scholar]
- for Iron Research, M.P.I. LangSim - Large Language Model Interface for Atomistic Simulation. https://www.mpie.de/5063016/LangSim, 2024.
- Landram, K. Multimodal machine learning model increases accuracy. Carnegie Mellon University News 2024. https://engineering.cmu.edu/news-events/news/2024/11/29-multimodal.html.
- Coders, M. Exploring the Role of Large Language Models (LLMs) in Physics. Metric Coders Blog 2024. https://www.metriccoders.com/post/exploring-the-role-of-large-language-models-llms-in-physics.
- Shipps, A. Multimodal and reasoning LLMs supersize training data for dexterous robotic tasks. MIT CSAIL News 2024. https://www.csail.mit.edu/news/multimodal-and-reasoning-llms-supersize-training-data-dexterous-robotic-tasks.
- Author, A. Grading explanations of problem-solving process and generating feedback using large language models. Physical Review Physics Education Research 2024, 21, 010126. [Google Scholar]
- Koch, S.; Matveev, A.; Jiang, Z.; Williams, F.; Artemov, A.; Burnaev, E.; Alexa, M.; Zorin, D.; Panozzo, D. Abc: A big cad model dataset for geometric deep learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019; pp. 9601–9611.
- Rundi Wu, Chang Xiao, C.Z. DeepCAD: A Large-Scale CAD Dataset for Geometric Deep Learning. In Proceedings of the CVPR 2022, 2022.
- Lambourne, J.G.; Willis, K.D.; Jayaraman, P.K.; Sanghi, A.; Meltzer, P.; Shayani, H. BRepNet: A Topological Message Passing System for Solid Models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021; pp. 12773–12782.
- Du, Y.; Chen, S.; Zan, W.; Li, P.; Wang, M.; Song, D.; Li, B.; Hu, Y.; Wang, B. BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement. arXiv 2024, arXiv:2412.14203. [Google Scholar] [CrossRef]
- OpenFOAM Official Example Cases. https://www.openfoam.com/.
- MatWeb Material Property Data. http://www.matweb.com/.
- Takamoto, M.; Praditia, T.; Leiteritz, R.; MacKinlay, D.; Alesiani, F.; Pflüger, D.; Niepert, M. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems 2022, 35, 1596–1611. [Google Scholar]
- Qiu, S.; Guo, S.; Song, Z.Y.; Sun, Y.; Cai, Z.; Wei, J.; Luo, T.; Yin, Y.; Zhang, H.; Hu, Y.; et al. PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models. arXiv 2025, arXiv:2504.16074. [Google Scholar] [CrossRef]
- Saxena, A.; Goebel, K. Turbofan Engine Degradation Simulation Data Set. https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository, 2008.
- Wikipedia. Chemistry. Wikipedia, The Free Encyclopedia 2023. [Accessed: 2025-03-22].
- Chemistry. Britannica. [Accessed: 2025-03-22].
- American Chemical Society. What is Chemistry?, 2023. Accessed: 2023-10-01.
- Nature. Chemistry, 2025. Accessed: [Insert Date Accessed].
- Carruthers, W.; Coldham, I. Modern methods of organic synthesis; Cambridge University Press, 2004. [Google Scholar]
- Harvey, D. Modern analytical chemistry; McGraw Hill, 2000. [Google Scholar]
- Christian, G.D.; Dasgupta, P.K.; Schug, K.A. Analytical chemistry; John Wiley & Sons, 2013. [Google Scholar]
- Levine, I.N.; Busch, D.H.; Shull, H. Quantum chemistry; Vol. 6, Pearson Prentice Hall Upper Saddle River, NJ, 2009.
- Jensen, F. Introduction to computational chemistry; John wiley & sons, 2017. [Google Scholar]
- Patrick, G.L. An introduction to medicinal chemistry; Oxford university press, 2023. [Google Scholar]
- Voet, D.; Voet, J.G. Biochemistry; John Wiley & Sons, 2010. [Google Scholar]
- Manahan, S.E. Environmental chemistry; CRC press, 2022. [Google Scholar]
- Ali, H.; Khan, E. Environmental chemistry in the twenty-first century. Environmental Chemistry Letters 2017, 15, 329–346. [Google Scholar] [CrossRef]
- Corey, E.J. The logic of chemical synthesis; 1991. [Google Scholar]
- Vedantu. Chemical Analysis - Methods, Techniques, and Applications. https://www.vedantu.com/chemistry/chemical-analysis, 2023. Accessed: 2023-10-05.
- RROIJ. Advancing Science: Exploring Modern Methods in Chemical Analysis. Research and Reviews: Open Access Journal 2023. [Google Scholar] [CrossRef]
- Britannica, E. Chemical Analysis - Classical Methods. https://www.britannica.com/science/chemical-analysis/Classical-methods, 2023. Accessed: 2023-10-05.
- Dutta, S.; Biswas, I.; Raghuwanshi, K.; Das, K.; Ahmed, A.; Srivastav, Y.; Kumar, A.; Jain, S.K.; Maiti, N.J. A Comprehensive review on Analytical Techniques for the Quantification of Pharmaceutical Compounds in Biological Matrices: Recent Advances and future directions. [CrossRef]
- Atkins, P.; De Paula, J. Elements of physical chemistry; Oxford University Press: USA, 2013. [Google Scholar]
- Brown, T.L.; LeMay, H.E.; Bursten, B.E. Chemistry: the central science; Pearson Educación, 2002. [Google Scholar]
- Robinson, J.; Barry, T. Experimental methods for the measurement of thermodynamic data and recommendation about future capability at NPL. 1997.
- Gill, P.; Moghadam, T.T.; Ranjbar, B. Differential scanning calorimetry techniques: applications in biology and nanoscience. Journal of biomolecular techniques: JBT 2010, 21, 167. [Google Scholar]
- Zheng, X.; Bi, C.; Li, Z.; Podariu, M.; Hage, D.S. Analytical methods for kinetic studies of biological interactions: A review. Journal of pharmaceutical and biomedical analysis 2015, 113, 163–180. [Google Scholar] [CrossRef]
- Silbey, R.J.; Alberty, R.A.; Papadantonakis, G.A.; Bawendi, M.G. Physical chemistry; John Wiley & Sons, 2022. [Google Scholar]
- Smith, I.W.; Rowe, B.R. Reaction kinetics at very low temperatures: laboratory studies and interstellar chemistry. Accounts of Chemical Research 2000, 33, 261–268. [Google Scholar] [CrossRef]
- Britannica, E. Reaction Mechanism. https://www.britannica.com/science/reaction-mechanism, 2023. Accessed: 2023-10-05.
- Kraka, E.; Cremer, D. Computational analysis of the mechanism of chemical reactions in terms of reaction phases: hidden intermediates and hidden transition states. Accounts of chemical research 2010, 43, 591–601. [Google Scholar] [CrossRef]
- Butler, M.S. The role of natural product chemistry in drug discovery. Journal of natural products 2004, 67, 2141–2153. [Google Scholar] [CrossRef] [PubMed]
- Zhou, J.; Du, G.; Chen, J. Novel fermentation processes for manufacturing plant natural products. Current Opinion in Biotechnology 2014, 25, 17–23. [Google Scholar] [CrossRef]
- Gunstone, F.D. Fatty acid and lipid chemistry; Springer, 2012. [Google Scholar]
- Ikan, R. Natural products: a laboratory guide; Academic Press, 1991. [Google Scholar]
- Alkhzem, A.H.; Woodman, T.J.; Blagbrough, I.S. Design and synthesis of hybrid compounds as novel drugs and medicines. RSC advances 2022, 12, 19470–19484. [Google Scholar] [CrossRef]
- O’Boyle, N.M.; Campbell, C.M.; Hutchison, G.R. Computational design and selection of optimal organic photovoltaic materials. The Journal of Physical Chemistry C 2011, 115, 16200–16210. [Google Scholar] [CrossRef]
- Hassan Baig, M.; Ahmad, K.; Roy, S.; Mohammad Ashraf, J.; Adil, M.; Haris Siddiqui, M.; Khan, S.; Amjad Kamal, M.; Provazník, I.; Choi, I. Computer aided drug design: success and limitations. Current pharmaceutical design 2016, 22, 572–581. [Google Scholar] [CrossRef] [PubMed]
- Zhou, S.F.; Zhong, W.Z. Drug design and discovery: principles and applications, 2017. [CrossRef]
- Barbero, E.J. Introduction to composite materials design; CRC press, 2010. [Google Scholar]
- Baselt, R. Encyclopedia of toxicology, 2014. [CrossRef]
- Applied Chemistry. https://www.sciencedirect.com/topics/chemistry/applied-chemistry, n.d. Accessed: 2023-10-18.
- Morris, G.M.; Huey, R.; Lindstrom, W.; Sanner, M.F.; Belew, R.K.; Goodsell, D.S.; Olson, A.J. AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. Journal of computational chemistry 2009, 30, 2785–2791. [Google Scholar] [CrossRef]
- Berengut, D. Statistics for experimenters: Design, innovation, and discovery, 2006.
- Insuasty, D.; Castillo, J.; Becerra, D.; Rojas, H.; Abonia, R. Synthesis of biologically active molecules through multicomponent reactions. Molecules 2020, 25, 505. [Google Scholar] [CrossRef] [PubMed]
- contributors, W. Chemical engineering, 2023. Accessed: 2025-03-01.
- Dictionary, O.E. Oxford english dictionary. Simpson, Ja & Weiner, Esc 1989, 3. [Google Scholar]
- Towler, G.; Sinnott, R. Chemical engineering design: principles, practice and economics of plant and process design 2021.
- Westmoreland, P.R. Opportunities and challenges for a Golden Age of chemical engineering. Frontiers of Chemical Science and Engineering 2014, 8, 1–7. [Google Scholar] [CrossRef]
- Ulrich, G.D. A guide to chemical engineering process design and economics; Wiley New York, 1984. [Google Scholar]
- Ottino, J.M. Chemical engineering in a complex world: Grand challenges, vast opportunities. AIChE journal 2011, 57, 1654–1668. [Google Scholar] [CrossRef]
- Sinnott, R. Chemical engineering design; Vol. 6, Elsevier, 2014.
- Smith, R. Chemical process: design and integration 2005.
- Haydary, J. Chemical process design and simulation: Aspen Plus and Aspen Hysys applications; John Wiley & Sons, 2019. [Google Scholar]
- West, A.H.; Posarac, D.; Ellis, N. Assessment of four biodiesel production processes using HYSYS. Plant. Bioresource technology 2008, 99, 6587–6601. [Google Scholar] [CrossRef]
- Yadav, G.; Desai, T.N. Lean Six Sigma: a categorized review of the literature. International Journal of Lean Six Sigma 2016, 7, 2–24. [Google Scholar] [CrossRef]
- Andersen, B.; Fagerhaug, T. Root cause analysis; Quality Press, 2006. [Google Scholar]
- Al-Malah, K.I. Aspen plus: chemical engineering applications; John Wiley & Sons, 2022. [Google Scholar]
- Mohindru, P. Review on PID, fuzzy and hybrid fuzzy PID controllers for controlling non-linear dynamic behaviour of chemical plants. Artificial Intelligence Review 2024, 57, 97. [Google Scholar] [CrossRef]
- Kumar, A.S.; Ahmad, Z. Model predictive control (MPC) and its current issues in chemical engineering. Chemical Engineering Communications 2012, 199, 472–511. [Google Scholar] [CrossRef]
- Christofides, P.D. Control of nonlinear distributed process systems: Recent developments and challenges. AIChE Journal 2001, 47, 514–518. [Google Scholar] [CrossRef]
- Liu, J.; Muñoz de la Peña, D.; Christofides, P.D. Distributed model predictive control of nonlinear process systems. AIChE journal 2009, 55, 1171–1184. [Google Scholar] [CrossRef]
- Couper, J.R. Chemical process equipment: selection and design; Vol. 6, Gulf professional publishing, 2005.
- Parra-Cabrera, C.; Achille, C.; Kuhn, S.; Ameloot, R. 3D printing in chemical engineering and catalytic technology: structured catalysts, mixers and reactors. Chemical Society Reviews 2018, 47, 209–230. [Google Scholar] [CrossRef] [PubMed]
- Fang, Y.; Liu, J. Discussion on Curriculum Reform of Chemical Engineering Drawing and CAD Based on Big Data Era. In Proceedings of the International Conference on Forthcoming Networks and Sustainability in the IoT Era. Springer; 2021; pp. 128–133. [Google Scholar]
- Mihelcic, J.R.; Zimmerman, J.B. Environmental engineering: Fundamentals, sustainability, design; John wiley & sons, 2021. [Google Scholar]
- Siirola, J.J. Industrial applications of chemical process synthesis. In Advances in chemical engineering; Elsevier, 1996; Vol. 23, pp. 1–62. [Google Scholar]
- American Chemical Society. Chemical Engineering, 2023. Accessed: 2023-10-10.
- Gaynes, R. The discovery of penicillin—new insights after more than 75 years of clinical use. Emerging infectious diseases 2017, 23, 849. [Google Scholar] [CrossRef]
- Gerber, D.E. Targeted therapies: a new generation of cancer treatments. American family physician 2008, 77, 311–319. [Google Scholar]
- Ligon, B.L. Penicillin: its discovery and early development. In Proceedings of the Seminars in pediatric infectious diseases; Elsevier, 2004; Vol. 15, pp. 52–57. [Google Scholar]
- Zhou, Z.; Li, M. Targeted therapies for cancer. BMC medicine 2022, 20, 90. [Google Scholar] [CrossRef]
- Seiler, M. Hyperbranched polymers: Phase behavior and new applications in the field of chemical engineering. Fluid Phase Equilibria 2006, 241, 155–174. [Google Scholar] [CrossRef]
- Starke Jr, E.A.; Staley, J.T. Application of modern aluminum alloys to aircraft. Progress in aerospace sciences 1996, 32, 131–172. [Google Scholar] [CrossRef]
- Niinomi, M.; Nakai, M.; Hieda, J. Development of new metallic alloys for biomedical applications. Acta biomaterialia 2012, 8, 3888–3903. [Google Scholar] [CrossRef]
- Barreto, J.A.; O’Malley, W.; Kubeil, M.; Graham, B.; Stephan, H.; Spiccia, L. Nanomaterials: applications in cancer imaging and therapy. Advanced materials 2011, 23, H18–H40. [Google Scholar] [CrossRef]
- Kolahalam, L.A.; Viswanath, I.K.; Diwakar, B.S.; Govindh, B.; Reddy, V.; Murthy, Y. Review on nanomaterials: Synthesis and applications. Materials Today: Proceedings 2019, 18, 2182–2190. [Google Scholar] [CrossRef]
- Ong, Y.T.; Ahmad, A.L.; Zein, S.H.S.; Tan, S.H. A review on carbon nanotubes in an environmental protection and green engineering perspective. Brazilian Journal of Chemical Engineering 2010, 27, 227–242. [Google Scholar] [CrossRef]
- Lajili, M. Converting CO2 from a Harmful Gas to a Renewable Source of Matter and Energy: A Review. 2022. [Google Scholar] [CrossRef]
- Makhanov, B.; Satayev, M.; Krasnokutskii, E.; Ved, V.; Saipov, A. New type of harmful gas emissions catalytic converter. Industrial Technology and Engineering 2015, pp. 5–18.
- Centi, G. Smart catalytic materials for energy transition. SmartMat 2020, 1. [Google Scholar] [CrossRef]
- Armaroli, N.; Balzani, V. Solar electricity and solar fuels: status and perspectives in the context of the energy transition. Chemistry–A European Journal 2016, 22, 32–57. [Google Scholar] [CrossRef]
- Dincer, I. Renewable energy and sustainable development: a crucial review. Renewable and sustainable energy reviews 2000, 4, 157–175. [Google Scholar] [CrossRef]
- Kim, T.; Song, W.; Son, D.Y.; Ono, L.K.; Qi, Y. Lithium-ion batteries: outlook on present, future, and hybridized technologies. Journal of materials chemistry A 2019, 7, 2942–2964. [Google Scholar] [CrossRef]
- Singla, M.K.; Nijhawan, P.; Oberoi, A.S. Hydrogen fuel and fuel cell technology for cleaner future: a review. Environmental Science and Pollution Research 2021, 28, 15607–15626. [Google Scholar] [CrossRef]
- Guo, T.; Nan, B.; Liang, Z.; Guo, Z.; Chawla, N.; Wiest, O.; Zhang, X.; et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems 2023, 36, 59662–59688. [Google Scholar]
- Himanen, L.; Geurts, A.; Foster, A.S.; Rinke, P. Data-driven materials science: status, challenges, and perspectives. Advanced Science 2019, 6, 1900808. [Google Scholar] [CrossRef]
- Xia, J.; Zhang, L.; Zhu, X.; Liu, Y.; Gao, Z.; Hu, B.; Tan, C.; Zheng, J.; Li, S.; Li, S.Z. Understanding the limitations of deep models for molecular property prediction: Insights and solutions. Advances in Neural Information Processing Systems 2023, 36, 64774–64792. [Google Scholar]
- Zhao, H.; Tang, X.; Yang, Z.; Han, X.; Feng, X.; Fan, Y.; Cheng, S.; Jin, D.; Zhao, Y.; Cohan, A.; et al. ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain. arXiv 2024, arXiv:2411.16736. [Google Scholar]
- Ramos, M.C.; Collison, C.J.; White, A.D. A review of large language models and autonomous agents in chemistry. Chemical Science 2025. [Google Scholar] [CrossRef] [PubMed]
- Brenk, R. Escaping the Combinatorial Explosion: Expert-Enhanced Heuristic Navigation in Chemical Space. https://www.uib.no/en/rg/brenk/152446/escaping-combinatorial-explosion-expert-enhanced-heuristic-navigation-chemical-space, 2023. Accessed: 2023-10-01.
- Du, Y.; Fu, T.; Sun, J.; Liu, S. Molgensurvey: A systematic survey in machine learning models for molecule design. arXiv 2022, arXiv:2203.14500. [Google Scholar]
- Bagal, V.; Aggarwal, R.; Vinod, P.; Priyakumar, U.D. MolGPT: molecular generation using a transformer-decoder model. Journal of chemical information and modeling 2021, 62, 2064–2076. [Google Scholar] [CrossRef] [PubMed]
- Xia, J.; Zhu, Y.; Du, Y.; Li, S.Z. A systematic survey of chemical pre-trained models. arXiv 2022, arXiv:2210.16484. [Google Scholar] [CrossRef]
- Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous chemical research with large language models. Nature 2023, 624, 570–578. [Google Scholar] [CrossRef] [PubMed]
- Boiko, D.A.; MacKnight, R.; Gomes, G. Emergent autonomous scientific research capabilities of large language models. arXiv 2023, arXiv:2304.05332. [Google Scholar] [CrossRef]
- Suvarna, M.; Pérez-Ramírez, J. Embracing data science in catalysis research. Nature Catalysis 2024, 7, 624–635. [Google Scholar] [CrossRef]
- McDonald, S.M.; Augustine, E.K.; Lanners, Q.; Rudin, C.; Catherine Brinson, L.; Becker, M.L. Applied machine learning as a driver for polymeric biomaterials design. Nature Communications 2023, 14, 4838. [Google Scholar] [CrossRef]
- Zheng, Z.; Rampal, N.; Inizan, T.J.; Borgs, C.; Chayes, J.T.; Yaghi, O.M. Large language models for reticular chemistry. Nature Reviews Materials 2025, 1–13. [Google Scholar] [CrossRef]
- Pyzer-Knapp, E.O.; Manica, M.; Staar, P.; Morin, L.; Ruch, P.; Laino, T.; Smith, J.R.; Curioni, A. Foundation models for materials discovery–current state and future directions. npj Computational Materials 2025, 11, 61. [Google Scholar] [CrossRef]
- De Almeida, A.F.; Moreira, R.; Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. Nature Reviews Chemistry 2019, 3, 589–604. [Google Scholar] [CrossRef]
- Yu, T.; Boob, A.G.; Volk, M.J.; Liu, X.; Cui, H.; Zhao, H. Machine learning-enabled retrobiosynthesis of molecules. Nature Catalysis 2023, 6, 137–151. [Google Scholar] [CrossRef]
- Butler, K.T.; Davies, D.W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine learning for molecular and materials science. Nature 2018, 559, 547–555. [Google Scholar] [CrossRef]
- with Code, P. Molecule Captioning, 2023. Accessed: [Insert Date].
- Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences 1988, 28, 31–36. [Google Scholar] [CrossRef]
- Wikipedia. Seq2seq — Wikipedia, The Free Encyclopedia, 2023. [Online; accessed 2023-10-10].
- Eltyeb, S.; Salim, N. Chemical named entities recognition: a review on approaches and applications. Journal of cheminformatics 2014, 6, 1–12. [Google Scholar] [CrossRef]
- Zunger, A. Inverse design in search of materials with target functionalities. Nature Reviews Chemistry 2018, 2, 0121. [Google Scholar] [CrossRef]
- Molesky, S.; Lin, Z.; Piggott, A.Y.; Jin, W.; Vucković, J.; Rodriguez, A.W. Inverse design in nanophotonics. Nature Photonics 2018, 12, 659–670. [Google Scholar] [CrossRef]
- Edwards, C.; Lai, T.; Ros, K.; Honke, G.; Cho, K.; Ji, H. Translation between molecules and natural language. arXiv 2022, arXiv:2204.11817. [Google Scholar] [CrossRef]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Li, J.; Liu, Y.; Fan, W.; Wei, X.Y.; Liu, H.; Tang, J.; Li, Q. Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. IEEE transactions on knowledge and data engineering 2024. [Google Scholar] [CrossRef]
- Liu, P.; Ren, Y.; Tao, J.; Ren, Z. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in biology and medicine 2024, 171, 108073. [Google Scholar] [CrossRef]
- Luo, Y.; Yang, K.; Hong, M.; Liu, X.Y.; Nie, Z. Molfm: A multimodal molecular foundation model. arXiv 2023, arXiv:2307.09484. [Google Scholar] [CrossRef]
- Li, J.; Liu, W.; Ding, Z.; Fan, W.; Li, Y.; Li, Q. Large language models are in-context molecule learners. IEEE Transactions on Knowledge and Data Engineering 2025. [Google Scholar] [CrossRef]
- Li, J.; Liu, Y.; Liu, W.; Le, J.; Zhang, D.; Fan, W.; Zhou, D.; Li, Y.; Li, Q. MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts. arXiv 2024, arXiv:2411.14721. [Google Scholar] [CrossRef]
- Zhong, Z.; Larsen, S.S.Y.; Guo, H.; Tang, T.; Zhou, K.; Mottin, D. Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language. arXiv 2025, arXiv:2502.06634. [Google Scholar] [CrossRef]
- Ganeeva, V.; Khrabrov, K.; Kadurin, A.; Savchenko, A.; Tutubalina, E. Chemical language models have problems with chemistry: A case study on molecule captioning task. In Proceedings of the The Second Tiny Papers Track at ICLR 2024, 2024.
- Tran, D.; Pham, N.T.; Nguyen, N.; Manavalan, B. Mol2lang-vlm: Vision-and text-guided generative pre-trained language models for advancing molecule captioning through multimodal fusion. In Proceedings of the Proceedings of the 1st Workshop on Language+ Molecules (L+ M 2024), 2024; pp. 97–102.
- Kim, S.; Kim, N.; Piao, Y.; Kim, S. GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention. arXiv 2025, arXiv:2503.07655. [Google Scholar] [CrossRef]
- Li, S.; Liu, Z.; Luo, Y.; Wang, X.; He, X.; Kawaguchi, K.; Chua, T.S.; Tian, Q. Towards 3d molecule-text interpretation in language models. arXiv 2024, arXiv:2401.13923. [Google Scholar] [CrossRef]
- Liu, Z.; Li, S.; Luo, Y.; Fei, H.; Cao, Y.; Kawaguchi, K.; Wang, X.; Chua, T.S. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. arXiv 2023, arXiv:2310.12798. [Google Scholar] [CrossRef]
- Tang, X.; Tran, A.; Tan, J.; Gerstein, M.B. MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations. Bioinformatics 2024, 40, i357–i368. [Google Scholar] [CrossRef] [PubMed]
- Tanaka, S.; Mak, C.; Cipcigan, F.; Barry, J.; Elkaref, M.; Moses, M.; Kuruvanthodi, V.; De Mel, G. NLPeople at∖textit {L+ M-24} Shared Task: An Ensembled Approach for Molecule Captioning from SMILES. In Proceedings of the ACL 2024 Workshop Language+ Molecules.
- Taylor, R.; Kardas, M.; Cucurull, G.; Scialom, T.; Hartshorn, A.; Saravia, E.; Poulton, A.; Kerkez, V.; Stojnic, R. Galactica: A large language model for science. arXiv 2022, arXiv:2211.09085. [Google Scholar] [CrossRef]
- Zhang, X.; Li, Y.; Wang, J.; Chen, L.; Xu, H. SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In Proceedings of the Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. [CrossRef]
- Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv 2020, arXiv:2010.09885. [Google Scholar] [CrossRef]
- Gobbi, A.; Stokes, J.M.; Jin, W.; Kim, E.; Coley, C.W. ChemBERTa-2: Towards Chemical Foundation Models. arXiv 2022, arXiv:2209.01712. [Google Scholar] [CrossRef]
- Mukherjee, A.; Singh, R.; Lee, C.; Jain, A. SELFormer: Self-supervised Learning of Chemical Properties with Permutation-Invariant Transformers. arXiv 2023, arXiv:2304.04662. [Google Scholar] [CrossRef]
- Mohapatra, S.; Wu, Z.; Zhang, G.; Men, P.; Liu, Y. Molecule Attention Transformer. arXiv 2020, arXiv:2002.08264. [Google Scholar] [CrossRef]
- Wang, J.; Zhang, Y.; Xu, M.; Sun, X. Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multi-task Learning BERT Enhanced by SMILES Enumeration. Journal of Chemical Information and Modeling 2022. [Google Scholar]
- Zhou, L.; Chen, K.; Li, Q.; Wang, R. SolvBERT for Solvation Free Energy and Solubility Prediction. Digital Discovery 2023. [Google Scholar] [CrossRef]
- Liu, X.; Zhao, M.; Sun, L.; Huang, W. INTransformer: Data Augmentation-Based Contrastive Learning by Injecting Noise into Transformer for Molecular Property Prediction. Journal of Molecular Graphics and Modelling 2024, 122, 108703. [Google Scholar] [CrossRef] [PubMed]
- Kim, D.; Park, S.; Lee, J.; Choi, M. Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. Nature Machine Intelligence 2023, 5, 1123–1132. [Google Scholar] [CrossRef]
- Patel, S.; Mehta, R.; Gupta, A. ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. arXiv 2023, arXiv:2301.12040. [Google Scholar] [CrossRef]
- Chilingaryan, G.; Tamoyan, H.; Tevosyan, A.; Babayan, N.; Hambardzumyan, K.; Navoyan, Z.; Aghajanyan, A.; Khachatrian, H.; Khondkaryan, L. Bartsmiles: Generative masked language models for molecular representations. Journal of Chemical Information and Modeling 2024, 64, 5832–5843. [Google Scholar] [CrossRef]
- Zeng, X.; Xiang, H.; Yu, L.; Wang, J.; Li, K.; Nussinov, R.; Cheng, F. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nature Machine Intelligence 2022, 4, 1004–1016. [Google Scholar] [CrossRef]
- Schwaller, P.; Vaucher, A.C.; Laino, T.; Reymond, J. RXNFP – chemical reaction fingerprints. https://rxn4chemistry.github.io/rxnfp/, 2021.
- Li, J.; Tan, X.; Wang, Y.; Chen, Z. Unified Deep Learning Model for Multi-task Reaction Predictions (T5Chem). Journal of Chemical Information and Modeling 2022. [Google Scholar] [CrossRef]
- Jin, K.; Schwaller, P. rxn_yields: Code complementing reaction yield prediction models (includes Yield-BERT). https://github.com/rxn4chemistry/rxn_yields, 2022.
- Lee, H.; Kim, S.; Park, J.; Choi, E. Enhancing Generic Reaction Yield Prediction through Reaction Condition-Based Contrastive Learning. Research 2023, 2023. [Google Scholar] [CrossRef]
- Tan, X.; Li, J.; Chen, Z.; Wang, Y. ReactionT5: A Large-Scale Pre-Trained Model towards Application of Reaction Prediction. arXiv 2023, arXiv:2311.06708. [Google Scholar] [CrossRef]
- Gao, Y.; Zhou, L.; Xu, M.; Sun, X. Prediction of Chemical Reaction Yields with Large-Scale Multi-View Pre-Training (ReaMVP). Journal of Cheminformatics 2024, 16, 45. [Google Scholar] [CrossRef]
- Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, C.A.; Bekas, C.; Lee, A.A. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science 2019, 5, 1572–1583. [Google Scholar] [CrossRef]
- Pesciullesi, G.; Schwaller, P.; Laino, T.; Reymond, J.L. Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates. Nature communications 2020, 11, 4874. [Google Scholar] [CrossRef]
- Kotlyarov, R.; Papachristos, K.; Wood, G.P.; Goodman, J.M. Leveraging Language Model Multitasking To Predict C–H Borylation Selectivity. Journal of Chemical Information and Modeling 2024, 64, 4286–4297. [Google Scholar] [CrossRef]
- Jablonka, K.M.; Schwaller, P.; Ortega-Guerrero, A.; Smit, B. Leveraging large language models for predictive chemistry. Nature Machine Intelligence 2024, 6, 161–169. [Google Scholar] [CrossRef]
- Sagawa, T.; Kojima, R. ReactionT5: a large-scale pre-trained model towards application of limited reaction data. arXiv 2023, arXiv:2311.06708. [Google Scholar] [CrossRef]
- Zhang, D.; Liu, W.; Tan, Q.; Chen, J.; Yan, H.; Yan, Y.; Li, J.; Huang, W.; Yue, X.; Ouyang, W.; et al. Chemllm: A chemical large language model. arXiv 2024, arXiv:2402.06852. [Google Scholar] [CrossRef]
- Tharwani, K.K.L.; Kumar, R.; Ahmed, N.; Tang, Y.; et al. Large Language Models Transform Organic Synthesis From Reaction Prediction to Automation. arXiv 2025, arXiv:2508.05427. [Google Scholar] [CrossRef]
- Lu, J.; Zhang, Y. Unified deep learning model for multitask reaction predictions with explanation. Journal of chemical information and modeling 2022, 62, 1376–1387. [Google Scholar] [CrossRef]
- Do, K.; Tran, T.; Venkatesh, S. Graph transformation policy network for chemical reaction prediction. In Proceedings of the Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019; pp. 750–760.
- Segler, M.H.; Waller, M.P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry–A European Journal 2017, 23, 5966–5971. [Google Scholar] [CrossRef]
- Jolicoeur-Martineau, A.; Baratin, A.; Kwon, K.; Knyazev, B.; Zhang, Y. Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees. arXiv 2024, arXiv:2407.09357. [Google Scholar] [CrossRef]
- Bran, A.M.; Neukomm, T.A.; Armstrong, D.P.; Jončev, Z.; Schwaller, P. Chemical reasoning in LLMs unlocks steerable synthesis planning and reaction mechanism elucidation. arXiv 2025, arXiv:2503.08537. [Google Scholar] [CrossRef]
- Tetko, I.V.; Karpov, P.; Van Deursen, R.; Godin, G. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nature communications 2020, 11, 5575. [Google Scholar] [CrossRef]
- Yang, Y.; Shi, R.; Li, Z.; Jiang, S.; Lu, B.L.; Yang, Y.; Zhao, H. BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction. arXiv 2024, arXiv:2408.10285. [Google Scholar] [CrossRef]
- AI4Science, M.R.; Quantum, M.A. The impact of large language models on scientific discovery: a preliminary study using gpt-4. arXiv 2023, arXiv:2311.07361. [Google Scholar] [CrossRef]
- M. Bran, A.; Cox, S.; Schilter, O.; Baldassari, C.; White, A.D.; Schwaller, P. Augmenting large language models with chemistry tools. Nature Machine Intelligence 2024, 6, 525–535. [Google Scholar] [CrossRef]
- Zhang, C.; Lin, Q.; Zhu, B.; Yang, H.; Lian, X.; Deng, H.; Zheng, J.; Liao, K. SynAsk: unleashing the power of large language models in organic synthesis. Chemical Science 2025, 16, 43–56. [Google Scholar] [CrossRef] [PubMed]
- Ruan, Y.; Lu, C.; Xu, N.; He, Y.; Chen, Y.; Zhang, J.; Xuan, J.; Pan, J.; Fang, Q.; Gao, H.; et al. An automatic end-to-end chemical synthesis development platform powered by large language models. Nature communications 2024, 15, 10160. [Google Scholar] [CrossRef] [PubMed]
- Ma, Q.; Zhou, Y.; Li, J. Automated Retrosynthesis Planning of Macromolecules Using Large Language Models and Knowledge Graphs. Macromolecular Rapid Communications 2025, 2500065. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Xu, H.; Fang, T.; Xi, H.; Liu, Z.; Zhang, S.; Poon, H.; Wang, S. T-rex: Text-assisted retrosynthesis prediction. arXiv 2024, arXiv:2401.14637. [Google Scholar] [CrossRef]
- Nguyen-Van, P.; Thanh, L.N.; Manh, H.H.; Thi, H.A.P.; Le Nguyen, T.; Nguyen, V.A. Adapting Language Models for Retrosynthesis Prediction. 2024. [Google Scholar] [CrossRef]
- Ma, Q.; Zhou, Y.; Li, J. Leveraging Large Language Models as Knowledge-Driven Agents for Reliable Retrosynthesis Planning. arXiv 2025, arXiv:2501.08897. [Google Scholar] [CrossRef]
- Liu, G.; Sun, M.; Matusik, W.; Jiang, M.; Chen, J. Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning. arXiv 2024, arXiv:2410.04223. [Google Scholar] [CrossRef]
- Wang, H.; Guo, J.; Kong, L.; Ramprasad, R.; Schwaller, P.; Du, Y.; Zhang, C. LLM-Augmented Chemical Synthesis and Design Decision Programs. In Proceedings of the Towards Agentic AI for Science: Hypothesis Generation, Comprehension, Quantification, and Validation.
- Bran, A.M.; Neukomm, T.A.; Armstrong, D.P.; Jončev, Z.; Schwaller, P. Revealing chemical reasoning in LLMs through search on complex planning tasks. In Proceedings of the Workshop on Reasoning and Planning for Large Language Models.
- Thakkar, A.; Vaucher, A.C.; Byekwaso, A.; Schwaller, P.; Toniato, A.; Laino, T. Unbiasing retrosynthesis language models with disconnection prompts. ACS Central Science 2023, 9, 1488–1498. [Google Scholar] [CrossRef]
- Kang, C.; Liu, X.; Guo, F. RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning. In Proceedings of the The Thirteenth International Conference on Learning Representations.
- Zheng, Z.; Rong, Z.; Rampal, N.; Borgs, C.; Chayes, J.T.; Yaghi, O.M. A GPT-4 Reticular Chemist for Guiding MOF Discovery. Angewandte Chemie International Edition 2023, 62, e202311983. [Google Scholar] [CrossRef]
- Liu, S.; Wang, J.; Yang, Y.; Wang, C.; Liu, L.; Guo, H.; Xiao, C. Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv 2023, arXiv:2305.18090. [Google Scholar] [CrossRef]
- Chen, Z.; Fang, Z.; Tian, W.; Long, Z.; Sun, C.; Chen, Y.; Yuan, H.; Li, H.; Lan, M. ReactGPT: Understanding of Chemical Reactions via In-Context Tuning. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025; Vol. 39, pp. 84–92.
- Wang, X.; Qiu, J.; Li, Y.; Chen, G.; Liu, H.; Liao, B.; Hsieh, C.Y.; Yao, X. RetroPrime: A Chemistry-Inspired and Transformer-based Method for Retrosynthesis Predictions. 2020. [Google Scholar] [CrossRef]
- Ye, G.; Cai, X.; Lai, H.; Wang, X.; Huang, J.; Wang, L.; Liu, W.; Zeng, X. Drugassist: A large language model for molecule optimization. Briefings in Bioinformatics 2025, 26, bbae693. [Google Scholar] [CrossRef]
- Dey, V.; Hu, X.; Ning, X. GeLLM3O: Generalizing Large Language Models for Multi-property Molecule Optimization. arXiv 2025, arXiv:2502.13398. [Google Scholar] [CrossRef]
- Guevorguian, P.; Bedrosian, M.; Fahradyan, T.; Chilingaryan, G.; Khachatrian, H.; Aghajanyan, A. Small Molecule Optimization with Large Language Models. arXiv 2024, arXiv:2407.18897. [Google Scholar] [CrossRef]
- Le, K.; Guo, Z.; Dong, K.; Huang, X.; Nan, B.; Iyer, R.; Zhang, X.; Wiest, O.; Wang, W.; Chawla, N.V. Molx: Enhancing large language models for molecular learning with a multi-modal extension. arXiv 2024, arXiv:2406.06777. [Google Scholar] [CrossRef]
- Liu, X.; Jiang, S.; Li, B.; Stevens, R. ControllableGPT: A Ground-Up Designed Controllable GPT for Molecule Optimization. arXiv 2025, arXiv:2502.10631. [Google Scholar] [CrossRef]
- Noutahi, E.; Gabellini, C.; Craig, M.; Lim, J.S.; Tossou, P. Gotta be SAFE: a new framework for molecular design. Digital Discovery 2024, 3, 796–804. [Google Scholar] [CrossRef]
- Yu, J.; Zheng, Y.; Koh, H.Y.; Pan, S.; Wang, T.; Wang, H. Collaborative Expert LLMs Guided Multi-Objective Molecular Optimization. arXiv 2025, arXiv:2503.03503. [Google Scholar] [CrossRef]
- Ran, N.; Wang, Y.; Allmendinger, R. MOLLM: Multi-Objective Large Language Model for Molecular Design–Optimizing with Experts. arXiv 2025, arXiv:2502.12845. [Google Scholar] [CrossRef]
- Liang, Y.; Zhang, R.; Zhang, L.; Xie, P. DrugChat: towards enabling ChatGPT-like capabilities on drug molecule graphs. arXiv 2023, arXiv:2309.03907. [Google Scholar] [CrossRef]
- Nguyen, T.; Grover, A. Lico: Large language models for in-context molecular optimization. arXiv 2024, arXiv:2406.18851. [Google Scholar] [CrossRef]
- Liu, X.; Jiang, S.; Chen, S.; Yang, Z.; Chen, Y.; Foster, I.; Stevens, R. DrugImproverGPT: A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization. arXiv 2025, arXiv:2502.07237. [Google Scholar] [CrossRef]
- Bran, A.M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A.D.; Schwaller, P. Chemcrow: Augmenting large-language models with chemistry tools. arXiv 2023, arXiv:2304.05376. [Google Scholar] [CrossRef]
- McNaughton, A.D.; Sankar Ramalaxmi, G.K.; Kruel, A.; Knutson, C.R.; Varikoti, R.A.; Kumar, N. Cactus: Chemistry agent connecting tool usage to science. ACS omega 2024, 9, 46563–46573. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, Q.; Kong, X.; Xiong, J.; Ni, S.; Cao, D.; Niu, B.; Chen, M.; Li, Y.; Zhang, R.; et al. Fine-tuning large language models for chemical text mining. Chemical Science 2024, 15, 10600–10611. [Google Scholar] [CrossRef]
- Zheng, Z.; Zhang, O.; Borgs, C.; Chayes, J.T.; Yaghi, O.M. ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis. Journal of the American Chemical Society 2023, 145, 18048–18062. [Google Scholar] [CrossRef] [PubMed]
- Isazawa, T.; Cole, J.M. Single model for organic and inorganic chemical named entity recognition in ChemDataExtractor. Journal of chemical information and modeling 2022, 62, 1207–1213. [Google Scholar] [CrossRef]
- Islamaj, R.; Leaman, R.; Kim, S.; Kwon, D.; Wei, C.H.; Comeau, D.C.; Peng, Y.; Cissel, D.; Coss, C.; Fisher, C.; et al. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Scientific data 2021, 8, 91. [Google Scholar] [CrossRef]
- He, C.; Zhang, H.; Liu, J.; Shi, Y.; Li, H.; Zhang, J. Named entity recognition of chemical experiment operations based on BERT. In Proceedings of the International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023). SPIE, 2023, Vol. 12941, pp. 818–828. [CrossRef]
- Pang, N.; Qian, L.; Lyu, W.; Yang, J.D. Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with joint BERT-CRF Model. In Proceedings of the BIRNDL@ SIGIR; 2019; pp. 28–41. [Google Scholar]
- Adams, V.; Shin, H.C.; Anderson, C.; Liu, B.; Abidin, A. Chemical identification and indexing in PubMed articles via BERT and text-to-text approaches. arXiv 2021, arXiv:2111.15622. [Google Scholar] [CrossRef]
- Groves, E.; Wang, M.; Abdulle, Y.; Kunz, H.; Hoelscher-Obermaier, J.; Wu, R.; Wu, H. Benchmarking and analyzing in-context learning, fine-tuning and supervised learning for biomedical knowledge curation: a focused study on chemical entities of biological interest. arXiv 2023, arXiv:2312.12989. [Google Scholar] [CrossRef]
- Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining experimental data from materials science literature with large language models: an evaluation study. Science and Technology of Advanced Materials: Methods 2024, 4, 2356506. [Google Scholar] [CrossRef]
- Choi, J.; Lee, B. Accelerating materials language processing with large language models. Communications Materials 2024, 5, 13. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, Q.; Kong, X.; Xiong, J.; Ni, S.; Cao, D.; Niu, B.; Chen, M.; Zhang, R.; Wang, Y.; et al. Fine-Tuning ChatGPT Achieves State-of-the-Art Performance for Chemical Text Mining. 2023. [Google Scholar] [CrossRef]
- Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured information extraction from scientific text with large language models. Nature Communications 2024, 15, 1418. [Google Scholar] [CrossRef]
- Zhao, X.; Niazi, A.; Rios, A. A comprehensive study of gender bias in chemical named entity recognition models. arXiv 2022, arXiv:2212.12799. [Google Scholar] [CrossRef]
- Copara, J.; Naderi, N.; Knafou, J.; Ruch, P.; Teodoro, D. Named entity recognition in chemical patents using ensemble of contextual language models. arXiv 2020, arXiv:2007.12569. [Google Scholar] [CrossRef]
- Bagal, V.; Aggarwal, R.; Vinod, P.K.; Priyakumar, U.D. MolGPT: Molecular Generation Using a Transformer-Decoder Model. ChemRxiv 2021. [Google Scholar] [CrossRef]
- Polykovskiy, D.; Zubov, K.; Khristoforov, L.; Gastegger, M.; Leshchev, A.; Timoshenko, A.; Rupp, M. Neural scaling of deep chemical models. Nature Machine Intelligence 2023, 5, 450–460. [Google Scholar] [CrossRef]
- Bornmanica, A.; Smith, J.; Doe, J. High-Diversity Molecular Generation with Transformer-Based Embeddings. Journal of Cheminformatics 2023, 15, 123. [Google Scholar]
- Cho, K.H.; No, K.T.; et al. iupacGPT: IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation. 2023. [Google Scholar] [CrossRef]
- Ye, G. De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning. Journal of Computer-Aided Molecular Design 2024, 38, 20. [Google Scholar] [CrossRef]
- Zholus, A.; Kuznetsov, M.; Schutski, R.; Shayakhmetov, R.; Polykovskiy, D.; Chandar, S.; Zhavoronkov, A. Bindgpt: A scalable framework for 3d molecular design via language modeling and reinforcement learning. arXiv 2024, arXiv:2406.03686. [Google Scholar] [CrossRef]
- Liu, Y.; Ding, S.; Zhou, S.; Fan, W.; Tan, Q. Moleculargpt: Open large language model (llm) for few-shot molecular property prediction. arXiv 2024, arXiv:2406.12950. [Google Scholar] [CrossRef]
- Gong, H.; Liu, Q.; Wu, S.; Wang, L. Text-guided molecule generation with diffusion language model. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024; Vol. 38, pp. 109–117.
- Xiang, Y.; Zhao, H.; Ma, C.; Deng, Z.H. Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model. arXiv 2024, arXiv:2408.09896. [Google Scholar] [CrossRef]
- Luo, Y.; Fang, J.; Li, S.; Liu, Z.; Wu, J.; Zhang, A.; Du, W.; Wang, X. Text-guided diffusion model for 3d molecule generation. arXiv 2024, arXiv:2410.03803. [Google Scholar] [CrossRef]
- Cavanagh, J.M.; Sun, K.; Gritsevskiy, A.; Bagni, D.; Wang, Y.; Bannister, T.D.; Head-Gordon, T. Smileyllama: Modifying large language models for directed chemical space exploration. arXiv 2024, arXiv:2409.02231. [Google Scholar] [CrossRef]
- Bagal, V.; Aggarwal, R.; Vinod, P.; Priyakumar, U.D. Liggpt: Molecular generation using a transformer-decoder model.
- Ai, C.; Yang, H.; Liu, X.; Dong, R.; Ding, Y.; Guo, F. MTMol-GPT: De novo multi-target molecular generation with transformer-based generative adversarial imitation learning. PLoS computational biology 2024, 20, e1012229. [Google Scholar] [CrossRef] [PubMed]
- Priyadarsini, I.; Takeda, S.; Hamada, L.; Brazil, E.V.; Soares, E.; Shinohara, H. Self-bart: A transformer-based molecular representation model using selfies. arXiv 2024, arXiv:2410.12348. [Google Scholar] [CrossRef]
- Lee, S.; Kreis, K.; Veccham, S.P.; Liu, M.; Reidenbach, D.; Peng, Y.; Paliwal, S.; Nie, W.; Vahdat, A. GenMol: A Drug Discovery Generalist with Discrete Diffusion. arXiv 2025, arXiv:2501.06158. [Google Scholar] [CrossRef]
- Zhu, H.; Xiao, T.; Honavar, V.G. 3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation. arXiv 2024, arXiv:2403.07179. [Google Scholar] [CrossRef]
- Cao, H.; Liu, Z.; Lu, X.; Yao, Y.; Li, Y. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv 2023, arXiv:2311.16208. [Google Scholar] [CrossRef]
- Ahmed, S.J.; Elattar, M.A. Improving Targeted Molecule Generation through Language Model Fine-Tuning Via Reinforcement Learning. arXiv 2024, arXiv:2405.06836. [Google Scholar] [CrossRef]
- Fergus, S.; Botha, M.; Ostovar, M. Evaluating academic answers generated using ChatGPT. Journal of Chemical Education 2023, 100, 1672–1675. [Google Scholar] [CrossRef]
- Pimentel, A.; Wagener, A.; da Silveira, E.F.; Picciani, P.; Salles, B.; Follmer, C.; Oliveira Jr, O.N. Challenging ChatGPT with chemistry-related subjects. 2023. [Google Scholar] [CrossRef]
- Yik, B.J.; Dood, A.J. ChatGPT convincingly explains organic chemistry reaction mechanisms slightly inaccurately with high levels of explanation sophistication. Journal of Chemical Education 2024, 101, 1836–1846. [Google Scholar] [CrossRef]
- Lu, X.; Cao, H.; Liu, Z.; Bai, S.; Chen, L.; Yao, Y.; Zheng, H.T.; Li, Y. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. arXiv 2024, arXiv:2403.08192. [Google Scholar] [CrossRef]
- Chen, X.; Wang, T.; Guo, T.; Guo, K.; Zhou, J.; Li, H.; Zhuge, M.; Schmidhuber, J.; Gao, X.; Zhang, X. Scholarchemqa: Unveiling the power of language models in chemical research question answering. arXiv 2024, arXiv:2407.16931. [Google Scholar] [CrossRef]
- Rampal, N.; Wang, K.; Burigana, M.; Hou, L.; Al-Johani, J.; Sackmann, A.; Murayshid, H.S.; AlSumari, W.A.; AlAbdulkarim, A.M.; Alhazmi, N.E.; et al. Single and multi-hop question-answering datasets for reticular chemistry with GPT-4-turbo. Journal of Chemical Theory and Computation 2024, 20, 9128–9137. [Google Scholar] [CrossRef] [PubMed]
- Wellawatte, G.P.; Guo, H.; Lederbauer, M.; Borisova, A.; Hart, M.; Brucka, M.; Schwaller, P. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology 2025. [Google Scholar] [CrossRef]
- White, A.D.; Hocky, G.M.; Gandhi, H.A.; Ansari, M.; Cox, S.; Wellawatte, G.P.; Sasmal, S.; Yang, Z.; Liu, K.; Singh, Y.; et al. Assessment of chemistry knowledge in large language models that generate code. Digital Discovery 2023, 2, 368–376. [Google Scholar] [CrossRef]
- Pascazio, L.; Tran, D.; Rihm, S.D.; Bai, J.; Mosbach, S.; Akroyd, J.; Kraft, M. Question-answering system for combustion kinetics. Proceedings of the Combustion Institute 2024, 40, 105428. [Google Scholar] [CrossRef]
- Hatakeyama-Sato, K.; Yamane, N.; Igarashi, Y.; Nabae, Y.; Hayakawa, T. Prompt engineering of GPT-4 for chemical research: what can/cannot be done? Science and Technology of Advanced Materials: Methods 2023, 3, 2260300. [Google Scholar] [CrossRef]
- Li, J.; Zhang, D.; Wang, X.; Hao, Z.; Lei, J.; Tan, Q.; Zhou, C.; Liu, W.; Yang, Y.; Xiong, X.; et al. area. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol.
- Zhang, Y.; Chen, X.; Chen, K.; Du, Y.; Dang, X.; Heng, P.A. The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? arXiv 2025, arXiv:2501.13952. [Google Scholar] [CrossRef]
- Ai, Q.; Meng, F.; Shi, J.; Pelkie, B.; Coley, C.W. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. Digital Discovery 2024, 3, 1822–1831. [Google Scholar] [CrossRef] [PubMed]
- Vangala, S.R.; Krishnan, S.R.; Bung, N.; Nandagopal, D.; Ramasamy, G.; Kumar, S.; Sankaran, S.; Srinivasan, R.; Roy, A. Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. Journal of Cheminformatics 2024, 16, 131. [Google Scholar] [CrossRef] [PubMed]
- Chen, K.; Cao, H.; Li, J.; Du, Y.; Guo, M.; Zeng, X.; Li, L.; Qiu, J.; Heng, P.A.; Chen, G. An autonomous large language model agent for chemical literature data mining. arXiv 2024, arXiv:2402.12993. [Google Scholar] [CrossRef]
- Huang, X.; Surve, M.; Liu, Y.; Luo, T.; Wiest, O.; Zhang, X.; Chawla, N.V. Application of Large Language Models in Chemistry Reaction Data Extraction and Cleaning. In Proceedings of the Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024; pp. 3797–3801.
- Hartgers, A.; Nugmanov, R.; Chernichenko, K.; Wegner, J.K. ReacLLaMA: Merging chemical and textual information in chemical reactivity AI models. arXiv 2024, arXiv:2401.17267. [Google Scholar] [CrossRef]
- Polak, M.P.; Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nature Communications 2024, 15, 1569. [Google Scholar] [CrossRef]
- Exintaris, B.; Karunaratne, N.; Yuriev, E. Metacognition and critical thinking: Using ChatGPT-generated responses as prompts for critique in a problem-solving workshop (SMARTCHEMPer). Journal of Chemical Education 2023, 100, 2972–2980. [Google Scholar] [CrossRef]
- Reddy, M.R.; Walter, N.G.; Sevryugina, Y.V. Implementation and Evaluation of a ChatGPT-Assisted Special Topics Writing Assignment in Biochemistry. Journal of Chemical Education 2024, 101, 2740–2748. [Google Scholar] [CrossRef]
- Mendez, J.D. Student Perceptions of Artificial Intelligence Utility in the Introductory Chemistry Classroom. Journal of Chemical Education 2024, 101, 3547–3549. [Google Scholar] [CrossRef]
- Santos, R.P.d. Enhancing chemistry learning with ChatGPT and Bing Chat as agents to think with: a comparative case study. arXiv 2023, arXiv:2305.11890. [Google Scholar] [CrossRef]
- Du, Y.; Duan, C.; Bran, A.; Sotnikova, A.; Qu, Y.; Kulik, H.; Bosselut, A.; Xu, J.; Schwaller, P. Large language models are catalyzing chemistry education. 2024. [Google Scholar] [CrossRef]
- Subasinghe, S.S.; Gersib, S.G.; Mankad, N.P. Large Language Models (LLMs) as Graphing Tools for Advanced Chemistry Education and Research. Journal of Chemical Education 2025. [Google Scholar] [CrossRef]
- Iyamuremye, A.; Niyonzima, F.N.; Mukiza, J.; Twagilimana, I.; Nyirahabimana, P.; Nsengimana, T.; Habiyaremye, J.D.; Habimana, O.; Nsabayezu, E. Utilization of artificial intelligence and machine learning in chemistry education: a critical review. Discover Education 2024, 3, 95. [Google Scholar] [CrossRef]
- Sirisathitkul, C.; Jaroonchokanan, N. Implementing ChatGPT as Tutor, Tutee, and Tool in Physics and Chemistry. Substantia 2024. [Google Scholar] [CrossRef]
- Krasnov, L.; Khokhlov, I.; Fedorov, M.V.; Sosnin, S. Transformer-based artificial neural networks for the conversion between chemical notations. Scientific Reports 2021, 11, 14798. [Google Scholar] [CrossRef]
- Guo, X.; et al. Comprehensive Evaluation of GPT-4 for Chemistry Tasks. arXiv 2023, arXiv:2301.00000. [Google Scholar]
- Yu, J.; Zhang, C.; Cheng, Y.; Yang, Y.F.; She, Y.B.; Liu, F.; Su, W.; Su, A. SolvBERT for solvation free energy and solubility prediction: a demonstration of an NLP model for predicting the properties of molecular complexes. Digital Discovery 2023, 2, 409–421. [Google Scholar] [CrossRef]
- Wang, S.; Guo, Y.; Wang, Y.; Sun, H.; Huang, J. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, 2019; pp. 429–436.
- Ahmad, W.; Simon, E.; Chithrananda, S.; Grand, G.; Ramsundar, B. Chemberta-2: Towards chemical foundation models. arXiv 2022, arXiv:2209.01712. [Google Scholar] [CrossRef]
- Yüksel, A.; Ulusoy, E.; Ünlü, A.; Doğan, T. SELFormer: molecular representation learning via SELFIES language models. Machine Learning: Science and Technology 2023, 4, 025035. [Google Scholar] [CrossRef]
- Maziarka; Danel, T.; Mucha, S.; Rataj, K.; Tabor, J.; Jastrzębski, S. Molecule attention transformer. arXiv 2020, arXiv:2002.08264. [Google Scholar] [CrossRef]
- Fabian, B.; Edlich, T.; Gaspar, H.; Segler, M.; Meyers, J.; Fiscato, M.; Ahmed, M. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv 2020, arXiv:2011.13230. [Google Scholar] [CrossRef]
- Zhang, X.C.; Wu, C.K.; Yi, J.C.; Zeng, X.X.; Yang, C.Q.; Lu, A.P.; Hou, T.J.; Cao, D.S. Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. Research 2022, 2022, 0004. [Google Scholar] [CrossRef] [PubMed]
- Jiang, J.; Li, Y.; Zhang, R.; Liu, Y. INTransformer: Data augmentation-based contrastive learning by injecting noise into transformer for molecular property prediction. Journal of Molecular Graphics and Modelling 2024, 128, 108703. [Google Scholar] [CrossRef]
- Liu, S.; Nie, W.; Wang, C.; Lu, J.; Qiao, Z.; Liu, L.; Tang, J.; Xiao, C.; Anandkumar, A. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence 2023, 5, 1447–1457. [Google Scholar] [CrossRef]
- Xu, M.; Yuan, X.; Miret, S.; Tang, J. Protst: Multi-modality learning of protein sequences and biomedical texts. In Proceedings of the International Conference on Machine Learning. PMLR; 2023; pp. 38749–38767. [Google Scholar]
- Ahneman, D.T.; Estrada, J.G.; Lin, S.; Dreher, S.D.; Doyle, A.G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, 360, 186–190. [Google Scholar] [CrossRef]
- Schwaller, P.; Vaucher, A.C.; Laino, T.; Reymond, J.L. Prediction of chemical reaction yields using deep learning. Machine learning: science and technology 2021, 2, 015016. [Google Scholar] [CrossRef]
- Yin, X.; Hsieh, C.Y.; Wang, X.; Wu, Z.; Ye, Q.; Bao, H.; Deng, Y.; Chen, H.; Luo, P.; Liu, H.; et al. Enhancing generic reaction yield prediction through reaction condition-based contrastive learning. Research 2024, 7, 0292. [Google Scholar] [CrossRef]
- Shi, R.; Yu, G.; Huo, X.; Yang, Y. Prediction of chemical reaction yields with large-scale multi-view pre-training. Journal of Cheminformatics 2024, 16, 22. [Google Scholar] [CrossRef]
- Nam, J.; Kim, J. Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. arXiv 2016, arXiv:1612.09529. [Google Scholar] [CrossRef]
- Tetko, I.V.; Engkvist, O.; Koch, A.; Reymond, J.L.; Bjerrum, E.J. Augmented Transformer for Improved Chemical Reaction Prediction. Journal of Chemical Information and Modeling 2020, 60, 6253–6266. [Google Scholar] [CrossRef]
- Guo, K.; et al. Leveraging Large Language Models for Predictive Chemistry. Nature Machine Intelligence 2023. [Google Scholar] [CrossRef]
- Lin, J.; et al. Question Rephrasing for Quantifying Uncertainty in Large Language Models. arXiv 2024, arXiv:2408.03732. [Google Scholar] [CrossRef]
- Shenvi, R.A. Natural product synthesis in the 21st century: Beyond the mountain top. ACS Central Science 2024, 10, 519–528. [Google Scholar] [CrossRef]
- Corey, E. Robert Robinson lecture. Retrosynthetic thinking—essentials and examples. Chemical society reviews 1988, 17, 111–133. [Google Scholar] [CrossRef]
- Nam, J.; Kim, J. Linking the neural machine translation and the prediction of organic chemistry reactions. arXiv 2016, arXiv:1612.09529. [Google Scholar] [CrossRef]
- Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS central science 2017, 3, 1103–1113. [Google Scholar] [CrossRef]
- Schneider, N.; Stiefl, N.; Landrum, G.A. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling 2016, 56, 2336–2346. [Google Scholar] [CrossRef]
- Somnath, V.R.; Bunne, C.; Coley, C.; Krause, A.; Barzilay, R. Learning graph models for retrosynthesis prediction. Advances in Neural Information Processing Systems 2021, 34, 9405–9415. [Google Scholar]
- Transformer, M. A Model for Uncertainty-Calibrated Chemical Reaction Prediction. Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, Alpha A Lee ACS Central Science (2019-09-25) https://pubs. acs. org/doi/10.1021/acscentsci. 9b00576 DOI 2019, 10. [CrossRef]
- Schwaller, P.; Petraglia, R.; Zullo, V.; Nair, V.H.; Haeuselmann, R.A.; Pisoni, R.; Bekas, C.; Iuliano, A.; Laino, T. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chemical science 2020, 11, 3316–3325. [Google Scholar] [CrossRef] [PubMed]
- Zheng, S.; Rao, J.; Zhang, Z.; Xu, J.; Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of chemical information and modeling 2019, 60, 47–55. [Google Scholar] [CrossRef] [PubMed]
- Irwin, R.; Dimitriadis, S.; He, J.; Bjerrum, E.J. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology 2022, 3, 015022. [Google Scholar] [CrossRef]
- Toniato, A.; Vaucher, A.C.; Schwaller, P.; Laino, T. Enhancing diversity in language based models for single-step retrosynthesis. Digital Discovery 2023, 2, 489–501. [Google Scholar] [CrossRef]
- Fang, Y.; Zhang, N.; Chen, Z.; Guo, L.; Fan, X.; Chen, H. Domain-agnostic molecular generation with chemical feedback. arXiv 2023, arXiv:2301.11259. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
- Zholus, A.; Kuznetsov, M.; Schutski, R.; Shayakhmetov, R.; Polykovskiy, D.; Chandar, S.; Zhavoronkov, A. Bindgpt: A scalable framework for 3d molecular design via language modeling and reinforcement learning. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025; Vol. 39, pp. 26083–26091.
- Wang, Y.; Zhao, H.; Sciabola, S.; Wang, W. cMolGPT: a conditional generative pre-trained transformer for target-specific de novo molecular generation. Molecules 2023, 28, 4430. [Google Scholar] [CrossRef]
- Wang, X.; Gao, C.; Han, P.; Li, X.; Chen, W.; Rodríguez Patón, A.; Wang, S.; Zheng, P. PETrans: De novo drug design with protein-specific encoding based on transfer learning. International Journal of Molecular Sciences 2023, 24, 1146. [Google Scholar] [CrossRef]
- Kyro, G.W.; Morgunov, A.; Brent, R.I.; Batista, V.S. ChemSpaceAL: an efficient active learning methodology applied to protein-specific molecular generation. Biophysical Journal 2024, 123, 283a. [Google Scholar] [CrossRef]
- Adilov, S. Generative pre-training from molecules. 2021. [Google Scholar] [CrossRef]
- Haroon, S.; Hafsath, C.; Jereesh, A. Generative Pre-trained Transformer (GPT) based model with relative attention for de novo drug design. Computational Biology and Chemistry 2023, 106, 107911. [Google Scholar] [CrossRef]
- Yoshikai, Y.; Mizuno, T.; Nemoto, S.; Kusuhara, H. A novel molecule generative model of VAE combined with Transformer for unseen structure generation. arXiv 2024, arXiv:2402.11950. [Google Scholar] [CrossRef]
- Shen, T.; Guo, J.; Han, Z.; Zhang, G.; Liu, Q.; Si, X.; Wang, D.; Wu, S.; Xia, J. AutoMolDesigner for antibiotic discovery: an AI-based open-source software for automated design of small-molecule antibiotics. Journal of Chemical Information and Modeling 2024, 64, 575–583. [Google Scholar] [CrossRef]
- Mazuz, E.; Shtar, G.; Shapira, B.; Rokach, L. Molecule generation using transformers and policy gradient reinforcement learning. Scientific Reports 2023, 13, 8799. [Google Scholar] [CrossRef]
- Yu, B.; Baker, F.N.; Chen, Z.; Ning, X.; Sun, H. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv 2024, arXiv:2402.09391. [Google Scholar]
- Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: a benchmark for molecular machine learning. Chemical science 2018, 9, 513–530. [Google Scholar] [CrossRef]
- Nakata, M.; Shimazaki, T. PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. Journal of chemical information and modeling 2017, 57, 1300–1308. [Google Scholar] [CrossRef]
- Huang, K.; Fu, T.; Gao, W.; Zhao, Y.; Roohani, Y.; Leskovec, J.; Coley, C.W.; Xiao, C.; Sun, J.; Zitnik, M. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv 2021, arXiv:2102.09548. [Google Scholar] [CrossRef]
- Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind database: Collection of binding affinities for protein- ligand complexes with known three-dimensional structures. Journal of medicinal chemistry 2004, 47, 2977–2980. [Google Scholar] [CrossRef] [PubMed]
- Gilson, M.K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic acids research 2016, 44, D1045–D1053. [Google Scholar] [CrossRef]
- Tran, R.; Lan, J.; Shuaibi, M.; Wood, B.M.; Goyal, S.; Das, A.; Heras-Domingo, J.; Kolluru, A.; Rizvi, A.; Shoghi, N.; et al. The Open Catalyst 2022 (OC22) dataset and challenges for oxide electrocatalysts. ACS Catalysis 2023, 13, 3066–3084. [Google Scholar] [CrossRef]
- Perera, D.; Tucker, J.W.; Brahmbhatt, S.; Helal, C.J.; Chong, A.; Farrell, W.; Richardson, P.; Sach, N.W. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 2018, 359, 429–434. [Google Scholar] [CrossRef]
- Kearnes, S.M.; Maser, M.R.; Wleklinski, M.; Kast, A.; Doyle, A.G.; Dreher, S.D.; Hawkins, J.M.; Jensen, K.F.; Coley, C.W. The open reaction database. Journal of the American Chemical Society 2021, 143, 18820–18826. [Google Scholar] [CrossRef]
- Wigh, D.S.; Arrowsmith, J.; Pomberger, A.; Felton, K.C.; Lapkin, A.A. Orderly: data sets and benchmarks for chemical reaction data. Journal of Chemical Information and Modeling 2024, 64, 3790–3798. [Google Scholar] [CrossRef]
- Thakkar, A.; Kogej, T.; Reymond, J.L.; Engkvist, O.; Bjerrum, E.J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chemical science 2020, 11, 154–168. [Google Scholar] [CrossRef]
- Lowe, D. Chemical reactions from US patents (1976-Sep2016). (No Title) 2017.
- Reaxys®. https://www.elsevier.com/solutions/reaxys. Accessed: 2025-04-21.
- Mallard, W.G.; Westley, F.; Herron, J.; Hampson, R.F.; Frizzell, D. NIST chemical kinetics database; Vol. 126, National Institute of Standards and Technology Washington, DC, USA, 1992.
- Johnson, M.S.; Dong, X.; Grinberg Dana, A.; Chung, Y.; Farina Jr, D.; Gillis, R.J.; Liu, M.; Yee, N.W.; Blondal, K.; Mazeau, E.; et al. RMG database for chemical property prediction. Journal of Chemical Information and Modeling 2022, 62, 4906–4915. [Google Scholar] [CrossRef]
- de Matos, P.; Dekker, A.; Ennis, M.; Hastings, J.; Haug, K.; Turner, S.; Steinbeck, C. ChEBI: a chemistry ontology and database. Journal of cheminformatics 2010, 2, 1–1. [Google Scholar] [CrossRef]
- Wu, J.; Zhang, T.; Chen, R.; Zhang, W.; Zhang, C.J.; Wei, X.; Qing, L. MolGround: A Benchmark for Molecular Grounding. arXiv 2025, arXiv:2503.23668. [Google Scholar] [CrossRef]
- Lowe, D.M. Extraction of Chemical Structures and Reactions from the US Patent Literature. Doctoral Thesis, University of Cambridge 2012.
- Genheden, S.; Bjerrum, E. PaRoutes: towards a framework for benchmarking retrosynthesis route predictions. Digital Discovery 2022, 1, 527–539. [Google Scholar] [CrossRef]
- Genheden, S.; Thakkar, A.; Chadimova, V.; Reymond, J.; Engkvist, O.; Bjerrum, E.J. AiZynthFinder: A Fast, Robust and Flexible Open-Source Software for Retrosynthetic Planning. Journal of Cheminformatics 2020, 12, 70. [Google Scholar] [CrossRef]
- Brown, N.; Fiscato, M.; Segler, M.H.; Vaucher, A.C. GuacaMol: benchmarking models for de novo molecular design. Journal of chemical information and modeling 2019, 59, 1096–1108. [Google Scholar] [CrossRef]
- Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Frontiers in pharmacology 2020, 11, 565644. [Google Scholar] [CrossRef]
- Eckmann, P.; Sun, K.; Zhao, B.; Feng, M.; Gilson, M.K.; Yu, R. Limo: Latent inceptionism for targeted molecule generation. Proceedings of machine learning research 2022, 162, 5777. [Google Scholar] [PubMed]
- Taboureau, O.; Nielsen, S.K.; Audouze, K.; Weinhold, N.; Edsgärd, D.; Roque, F.S.; Kouskoumvekaki, I.; Bora, A.; Curpan, R.; Jensen, T.S.; et al. ChemProt: a disease chemical biology database. Nucleic acids research 2010, 39, D367–D372. [Google Scholar] [CrossRef]
- Li, J.; Sun, Y.; Johnson, R.J.; Sciaky, D.; Wei, C.H.; Leaman, R.; Davis, A.P.; Mattingly, C.J.; Wiegers, T.C.; Lu, Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016, 2016. [Google Scholar] [CrossRef]
- Krallinger, M.; Rabal, O.; Lourenço, A.; et al. The CHEMDNER Corpus of Chemicals and Drugs and Its Annotation Principles. Journal of Cheminformatics 2015, 7, S2. [Google Scholar] [CrossRef]
- Verspoor, K.; Nguyen, D.Q.; Akhondi, S.A.; Druckenbrodt, C.; Thorne, C.; Hoessel, R.; He, J.; Zhai, Z. ChEMU dataset for information extraction from chemical patents. Mendeley Data 2020, 2, 17632. [Google Scholar]
- Wang, X.; Hu, V.; Song, X.; Garg, S.; Xiao, J.; Han, J. ChemNER: fine-grained chemistry named entity recognition with ontology-guided distant supervision. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021. [Google Scholar]
- Axelrod, S.; Gomez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data 2022, 9, 185. [Google Scholar] [CrossRef]
- Isert, C.; Atz, K.; Jiménez-Luna, J.; Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. Scientific Data 2022, 9, 273. [Google Scholar] [CrossRef]
- Huang, K.; Fu, T.; Gao, W.; Zhao, Y.; Roohani, Y.; Leskovec, J.; Coley, C.W.; Xiao, C. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. Scientific Data 2021, 8, 316. [Google Scholar] [CrossRef]
- Wei, Z.; Ji,W.; Geng, X.; Chen, Y.; Chen, B.; Qin, T.; Jiang, D. ChemistryQA: A Complex Question Answering Dataset from Chemistry 2020.
- Laghuvarapu, S.; Lee, N.; Gao, C.; Sun, J. MolTextQA: A Curated Question-Answering Dataset and Benchmark for Molecular Structure-Text Relationship Learning.
- Capuzzi, S.J.; Kim, I.S.; Lam, W.I.; Thornton, T.E.; Muratov, E.N.; Pozefsky, D.; Tropsha, A. Chembench: A Publicly-Accessible, Integrated Cheminformatics Portal. Journal of Chemical Information and Modeling 2017, 57, 105–108. [Google Scholar] [CrossRef] [PubMed]
- Krasnov, L.; Khokhlov, I.; Fedorov, M.; Sosnin, S. Struct2IUPAC–Transformer-Based Artificial Neural Network for the Conversion Between Chemical Notations. 2021. [Google Scholar] [CrossRef]
- Zhang, Y.; Xu, J.; Chen, H.; Wang, J.; Wu, Y.; Prakasam, M.; Xu, H. Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning. Database 2016, 2016, baw049. [Google Scholar] [CrossRef]
- Yoo, S.; Kim, J. Adapt-cMolGPT: A Conditional Generative Pre-Trained Transformer with Adapter-Based Fine-Tuning for Target-Specific Molecular Generation. International Journal of Molecular Sciences 2024, 25, 6641. [Google Scholar] [CrossRef]
- Tom, G.; Schmid, S.P.; Baird, S.G.; Cao, Y.; Darvish, K.; Hao, H.; Lo, S.; Pablo-García, S.; Rajaonson, E.M.; Skreta, M.; et al. Self-driving laboratories for chemistry and materials science. Chemical Reviews 2024, 124, 9633–9732. [Google Scholar] [CrossRef] [PubMed]
- Seifrid, M.; Pollice, R.; Aguilar-Granda, A.; Morgan Chan, Z.; Hotta, K.; Ser, C.T.; Vestfrid, J.; Wu, T.C.; Aspuru-Guzik, A. Autonomous chemical experiments: Challenges and perspectives on establishing a self-driving lab. Accounts of Chemical Research 2022, 55, 2454–2466. [Google Scholar] [CrossRef]
- Dutta, S.; Singh, J.; Chakrabarti, S.; Chakraborty, T. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv 2024, arXiv:2402.18312. [Google Scholar]
- Han, Y.; Wan, Z.; Chen, L.; Yu, K.; Chen, X. From Generalist to Specialist: A Survey of Large Language Models for Chemistry. arXiv 2024, arXiv:2412.19994. [Google Scholar] [CrossRef]
- Le, K.; Chawla, N.V. Utilizing Large Language Models in an iterative paradigm with domain feedback for molecule optimization. arXiv 2024, arXiv:2410.13147. [Google Scholar] [CrossRef]
- Sutanto, P.; Santoso, J.; Setiawan, E.I.; Wibawa, A.P. Llm distillation for efficient few-shot multiple choice question answering. arXiv 2024, arXiv:2412.09807. [Google Scholar] [CrossRef]
- Blumenthal, D.; Campbell, E.G.; Anderson, M.S.; Causino, N.; Louis, K.S. Withholding research results in academic life science: evidence from a national survey of faculty. Jama 1997, 277, 1224–1228. [Google Scholar] [CrossRef] [PubMed]
- Blumenthal, D.; Causino, N.; Campbell, E.; Louis, K.S. Relationships between academic institutions and industry in the life sciences—an industry survey. New England Journal of Medicine 1996, 334, 368–374. [Google Scholar] [CrossRef]
- Louis, K.S.; Blumenthal, D.; Gluck, M.E.; Stoto, M.A. Entrepreneurs in academe: An exploration of behaviors among life scientists. Administrative science quarterly 1989, 110–131. [Google Scholar] [CrossRef]
- Geitmann, A.; Niklas, K.; Speck, T. Plant Biomechanics in the 21st Century. 2019, 70, 3435–3438. [Google Scholar] [CrossRef]
- Mondal, S.; Bagchi, B. From Structure and Dynamics to Biomolecular Functions: the Ubiquitous Role of Solvent in Biology. 2022, 77, 102462. [Google Scholar] [CrossRef]
- Lewis, W.A. Economic Survey; Routledge, 2013. [Google Scholar]
- Fisher, A.C.; Peterson, F.M. The environment in economics: a survey. Journal of Economic Literature 1976, 14, 1–33. [Google Scholar]
- Wikipedia. Life Science. Wikipedia, The Free Encyclopedia 2023. [Accessed: 2025-03-22].
- U.D.J.G.I.H.T..B.E..P.P..R.P..W.S..S.T..D.N..C.J.F..O.A..L.S..E.C..U.E..F.M.; 9, R.G.S.C.S.Y..F.A..H.M..Y.T..T.A..I.T..K.C..W.H..T.Y..T.T.; Genoscope.; 10, C.U..W.J..H.R..S.W..A.F..B.P..B.T..P.E..R.C..W.P.; Department of Genome Analysis, I.o.M.B.R.A..P.M..N.G..T.S..R.A..; 11, G.S.C.S.D.R..D.S.L..R.M..W.K..L.H.M..D.J.; 15, B.G.I.G.C.Y.H..Y.J..W.J..H.G..G.J.; et al. Initial sequencing and analysis of the human genome. nature 2001, 409, 860–921. [CrossRef]
- Lodish, H.F. Molecular cell biology; Macmillan, 2008. [Google Scholar]
- Carpenter, W. Principles of mental physiology; BoD–Books on Demand, 2023. [Google Scholar]
- Bayliss, W.M. Principles of general physiology; Longsmann Green, 1915. [Google Scholar]
- Darwin, C. Origin of the Species. In British Politics and the environment in the long nineteenth century; Routledge, 2023; pp. 47–55. [Google Scholar]
- Chen, Q.; Yan, W.; Duan, E. Epigenetic Inheritance of Acquired Traits Through Sperm RNAs and Sperm RNA Modifications. 2016, 17, 733–743. [Google Scholar] [CrossRef]
- Correns, C.E.; Correns, C.E. Gregor Mendel’s „Versuche über Pflanzen-Hybriden “und die Bestätigung ihrer Ergebnisse durch die neuesten Untersuchungen; Springer, 1924. [Google Scholar]
- Hearn, R.; Arblaster, K. DNA extraction techniques for use in education. Biochemistry and Molecular Biology Education 2010, 38, 161–166. [Google Scholar] [CrossRef] [PubMed]
- Di Pinto, A.; Forte, V.; Guastadisegni, M.C.; Martino, C.; Schena, F.P.; Tantillo, G. A comparison of DNA extraction methods for food analysis. Food control 2007, 18, 76–80. [Google Scholar] [CrossRef]
- Rohland, N.; Hofreiter, M. Comparison and optimization of ancient DNA extraction. Biotechniques 2007, 42, 343–352. [Google Scholar] [CrossRef]
- Crossley, B.M.; Bai, J.; Glaser, A.; Maes, R.; Porter, E.; Killian, M.L.; Clement, T.; Toohey-Kurth, K. Guidelines for Sanger sequencing and molecular assay monitoring. Journal of Veterinary Diagnostic Investigation 2020, 32, 767–775. [Google Scholar] [CrossRef]
- Sikkema-Raddatz, B.; Johansson, L.F.; de Boer, E.N.; Almomani, R.; Boven, L.G.; van den Berg, M.P.; van Spaendonck-Zwarts, K.Y.; van Tintelen, J.P.; Sijmons, R.H.; Jongbloed, J.D.; et al. Targeted next-generation sequencing can replace Sanger sequencing in clinical diagnostics. Human mutation 2013, 34, 1035–1042. [Google Scholar] [CrossRef]
- Beck, T.F.; Mullikin, J.C.; lesb@ mail. nih. gov, N.C.S.P.B.L.G. Systematic evaluation of Sanger validation of next-generation sequencing variants. Clinical chemistry 2016, 62, 647–654. [CrossRef]
- Erlich, H.A.; et al. PCR technology; Springer, 1989. [Google Scholar]
- Zhu, H.; Zhang, H.; Xu, Y.; Laššáková, S.; Korabečná, M.; Neužil, P. PCR past, present and future. Biotechniques 2020, 69, 317–325. [Google Scholar] [CrossRef]
- Kircher, M.; Kelso, J. High-throughput DNA sequencing–concepts and limitations. Bioessays 2010, 32, 524–536. [Google Scholar] [CrossRef]
- Lou, D.I.; Hussmann, J.A.; McBee, R.M.; Acevedo, A.; Andino, R.; Press, W.H.; Sawyer, S.L. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proceedings of the National Academy of Sciences 2013, 110, 19872–19877. [Google Scholar] [CrossRef] [PubMed]
- Baxevanis, A.D.; Bader, G.D.; Wishart, D.S. Bioinformatics; John Wiley & Sons, 2020. [Google Scholar]
- Majekodunmi, S.O. A review on centrifugation in the pharmaceutical industry. Am. J. Biomed. Eng 2015, 5, 67–78. [Google Scholar]
- Brakke, M.K. Density gradient centrifugation: a new separation technique1. Journal of the American Chemical Society 1951, 73, 1847–1848. [Google Scholar] [CrossRef]
- Smith, I. Chromatography; Elsevier, 2013. [Google Scholar]
- Smyth, M.; Martin, J. x Ray crystallography. Molecular Pathology 2000, 53, 8. [Google Scholar] [CrossRef] [PubMed]
- Woolfson, M.M. An introduction to X-ray crystallography; Cambridge University Press, 1997. [Google Scholar]
- Schwieters, C.D.; Kuszewski, J.J.; Clore, G.M. Using Xplor–NIH for NMR molecular structure determination. Progress in nuclear magnetic resonance spectroscopy 2006, 48, 47–62. [Google Scholar] [CrossRef]
- Hames, B.D. Gel electrophoresis of proteins: a practical approach; Vol. 197, OUP Oxford, 1998.
- Kurien, B.T.; Scofield, R.H. Western blotting. Methods 2006, 38, 283–293. [Google Scholar] [CrossRef] [PubMed]
- Lewin, G.R.; Barde, Y.A. Physiology of the neurotrophins. Annual review of neuroscience 1996, 19, 289–317. [Google Scholar] [CrossRef]
- Verkhratsky, A.; Nedergaard, M. Physiology of astroglia. Physiological reviews 2018, 98, 239–389. [Google Scholar] [CrossRef]
- Saba, T.M. Physiology and physiopathology of the reticuloendothelial system. Archives of internal medicine 1970, 126, 1031–1052. [Google Scholar] [CrossRef]
- Boron, W.F.; Boulpaep, E.L. Medical Physiology E-Book: Medical Physiology E-Book; Elsevier Health Sciences, 2016. [Google Scholar]
- Hahnemann, S. Organon of medicine; B. Jain publishers, 2005. [Google Scholar]
- Castiglioni, A. A history of medicine; Routledge, 2019. [Google Scholar]
- Sigerist, H.E. A history of medicine; Vol. 2, Oxford University Press, 1987.
- Van Den Berg, A.; Mummery, C.L.; Passier, R.; Van der Meer, A.D. Personalised organs-on-chips: functional testing for precision medicine. Lab on a Chip 2019, 19, 198–205. [Google Scholar] [CrossRef]
- Krishnamoorthy, S.; Honn, K.V. Inflammation and disease progression. Cancer and Metastasis Reviews 2006, 25, 481–491. [Google Scholar] [CrossRef]
- Yunginger, J.W.; Ahlstedt, S.; Eggleston, P.A.; Homburger, H.A.; Nelson, H.S.; Ownby, D.R.; Platts-Mills, T.A.; Sampson, H.A.; Sicherer, S.H.; Weinstein, A.M.; et al. Quantitative IgE antibody assays in allergic diseases. Journal of allergy and clinical immunology 2000, 105, 1077–1084. [Google Scholar] [CrossRef]
- Kricka, L.J. Human anti-animal antibody interferences in immunological assays. Clinical chemistry 1999, 45, 942–956. [Google Scholar] [CrossRef]
- Raichle, M.E.; Mintun, M.A. Brain work and brain imaging. Annu. Rev. Neurosci. 2006, 29, 449–476. [Google Scholar] [CrossRef] [PubMed]
- Kekelidze, M.; D’Errico, L.; Pansini, M.; Tyndall, A.; Hohmann, J. Colorectal cancer: current imaging methods and future perspectives for the diagnosis, staging and therapeutic response evaluation. World journal of gastroenterology: WJG 2013, 19, 8502. [Google Scholar] [CrossRef] [PubMed]
- Schwartzman, R.A.; Cidlowski, J.A. Apoptosis: the biochemistry and molecular biology of programmed cell death. Endocrine reviews 1993, 14, 133–151. [Google Scholar]
- Amaral-Zettler, L.A.; Zettler, E.R.; Mincer, T.J. Ecology of the plastisphere. Nature Reviews Microbiology 2020, 18, 139–151. [Google Scholar] [CrossRef]
- Polis, G.A.; Yamashita, T. Ecology. 1990.
- Odum, E.P.; Barrett, G.W.; et al. Fundamentals of ecology 1971.
- Sopinka, N.M.; Patterson, L.D.; Redfern, J.C.; Pleizier, N.K.; Belanger, C.B.; Midwood, J.D.; Crossin, G.T.; Cooke, S.J. Manipulating glucocorticoids in wild animals: basic and applied perspectives. Conservation Physiology 2015, 3, cov031. [Google Scholar] [CrossRef]
- Burt, T.; Howden, N.; McDonnell, J.; Jones, J.; Hancock, G. Seeing the climate through the trees: observing climate and forestry impacts on streamflow using a 60-year record. Hydrological processes 2015, 29, 473–480. [Google Scholar] [CrossRef]
- Hammer; Harper, D.A. Paleontological data analysis; John Wiley & Sons, 2024. [Google Scholar]
- Pearson, W.R.; Wood, T.; Zhang, Z.; Miller, W. Comparison of DNA sequences with protein sequences. Genomics 1997, 46, 24–36. [Google Scholar] [CrossRef] [PubMed]
- HUANG, X. Fast comparison of a DNA sequence with a protein sequence database. Microbial & comparative genomics 1996, 1, 281–291. [Google Scholar]
- Peltz, G.; Zaas, A.K.; Zheng, M.; Solis, N.V.; Zhang, M.X.; Liu, H.H.; Hu, Y.; Boxx, G.M.; Phan, Q.T.; Dill, D.; et al. Next-generation computational genetic analysis: multiple complement alleles control survival after Candida albicans infection. Infection and immunity 2011, 79, 4472–4479. [Google Scholar] [CrossRef] [PubMed]
- Gilbert, C.; Ellis, T. Biological Engineered Living Materials: Growing Functional Materials with Genetically Programmable Properties. 2019, 8, 1–15. [Google Scholar] [CrossRef]
- Magin, R. Fractional calculus in bioengineering, part 1. Critical Reviews™ in Biomedical Engineering 2004, 32. [Google Scholar]
- Kumar, G.; Shekh, A.; Jakhu, S.; Sharma, Y.; Kapoor, R.; Sharma, T.R. Bioengineering of microalgae: recent advances, perspectives, and regulatory challenges for industrial application. Frontiers in Bioengineering and Biotechnology 2020, 8, 914. [Google Scholar] [CrossRef] [PubMed]
- Jaklenec, A.; Stamp, A.; Deweerd, E.; Sherwin, A.; Langer, R. Progress in the tissue engineering and stem cell industry “are we there yet?”. Tissue Engineering Part B: Reviews 2012, 18, 155–166. [Google Scholar] [CrossRef]
- Camara, C.; Peris-Lopez, P.; Tapiador, J.E. Security and privacy issues in implantable medical devices: A comprehensive survey. Journal of biomedical informatics 2015, 55, 272–289. [Google Scholar] [CrossRef]
- Smolen, J.S.; Aletaha, D.; Koeller, M.; Weisman, M.H.; Emery, P. New therapies for treatment of rheumatoid arthritis. The lancet 2007, 370, 1861–1874. [Google Scholar] [CrossRef] [PubMed]
- Ratner, B.D.; Bryant, S.J. Biomaterials: where we have been and where we are going. Annu. Rev. Biomed. Eng. 2004, 6, 41–75. [Google Scholar] [CrossRef]
- Tathe, A.; Ghodke, M.; Nikalje, A.P. A brief review: biomaterials and their application. Int. J. Pharm. Pharm. Sci 2010, 2, 19–23. [Google Scholar]
- Rahmati, M.; Mills, D.K.; Urbanska, A.M.; Saeb, M.R.; Venugopal, J.R.; Ramakrishna, S.; Mozafari, M. Electrospinning for Tissue Engineering Applications. 2020, 117, 100721. [Google Scholar] [CrossRef]
- Wikipedia. Biological Engineering. Wikipedia, The Free Encyclopedia 2023. [Accessed: 2025-03-22].
- Weidner, B.V.; Nagel, J.K.S.; Weber, H. Facilitation method for the translation of biological systems to technical design solutions. International Journal of Design Creativity and Innovation 2018, 6, 211–234. [Google Scholar] [CrossRef]
- Helms, M.E.; Vattam, S.; Goel, A.K. Biologically inspired design: process and products. Design Studies 2009, 30, 606–622. [Google Scholar] [CrossRef]
- Nagel, J.K.; Nagel, R.; Stone, R.; McAdams, D. Function-based, biologically inspired concept generation. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 2010, 24, 521–535. [Google Scholar] [CrossRef]
- Nowakowski, A.; Andrzejewska, A.; Janowski, M.; Walczak, P.; Lukomska, B. Genetic engineering of stem cells for enhanced therapy. Acta neurobiologiae experimentalis 2013, 73, 1–18. [Google Scholar] [CrossRef]
- Zhou, Y.; Han, Y. Engineered bacteria as drug delivery vehicles: principles and prospects. Engineering Microbiology 2022, 2, 100034. [Google Scholar] [CrossRef] [PubMed]
- Lee, S.Y.; Kim, H.U.; Park, J.H.; Park, J.M.; Kim, T.Y. Metabolic engineering of microorganisms: general strategies and drug production. Drug discovery today 2009, 14, 78–88. [Google Scholar] [CrossRef]
- Riglar, D.T.; Silver, P.A. Engineering bacteria for diagnostic and therapeutic applications. Nature Reviews Microbiology 2018, 16, 214–225. [Google Scholar] [CrossRef]
- Khan, S.; Ullah, M.W.; Siddique, R.; Nabi, G.; Manan, S.; Yousaf, M.; Hou, H. Role of recombinant DNA technology to improve life. International journal of genomics 2016, 2016, 2405954. [Google Scholar] [CrossRef]
- Wright, S. Recombinant DNA technology and its social transformation, 1972-1982. Osiris 1986, 2, 303–360. [Google Scholar] [CrossRef] [PubMed]
- Johnson, I.S. Human insulin from recombinant DNA technology. Science 1983, 219, 632–637. [Google Scholar] [CrossRef]
- Micklos, D.A.; Freyer, G.A. DNA science; a first course in recombinant DNA technology; 1990. [Google Scholar]
- Rubio, D.; Garcia-Castro, J.; Martín, M.C.; de la Fuente, R.; Cigudosa, J.C.; Lloyd, A.C.; Bernad, A. Spontaneous human adult stem cell transformation. Cancer research 2005, 65, 3035–3039. [Google Scholar] [CrossRef]
- Jiang, F.; Doudna, J.A. CRISPR–Cas9 structures and mechanisms. Annual review of biophysics 2017, 46, 505–529. [Google Scholar] [CrossRef] [PubMed]
- Redman, M.; King, A.; Watson, C.; King, D. What is CRISPR/Cas9? Archives of Disease in Childhood-Education and Practice 2016, 101, 213–215. [Google Scholar] [CrossRef]
- Doudna, J.A.; Charpentier, E. The new frontier of genome engineering with CRISPR-Cas9. Science 2014, 346, 1258096. [Google Scholar] [CrossRef]
- Hsu, P.D.; Lander, E.S.; Zhang, F. Development and applications of CRISPR-Cas9 for genome engineering. Cell 2014, 157, 1262–1278. [Google Scholar] [CrossRef]
- Ran, F.A.; Hsu, P.D.; Wright, J.; Agarwala, V.; Scott, D.A.; Zhang, F. Genome engineering using the CRISPR-Cas9 system. Nature protocols 2013, 8, 2281–2308. [Google Scholar] [CrossRef]
- Wikipedia. Biology. Wikipedia, The Free Encyclopedia 2023. [Accessed: 2025-03-22].
- Ikada, Y. Challenges in tissue engineering. Journal of the Royal Society Interface 2006, 3, 589–601. [Google Scholar] [CrossRef]
- Chapekar, M.S. Tissue engineering: challenges and opportunities. Journal of Biomedical Materials Research: An Official Journal of The Society for Biomaterials, The Japanese Society for Biomaterials, and The Australian Society for Biomaterials and the Korean Society for Biomaterials 2000, 53, 617–620. [Google Scholar] [CrossRef]
- Lanza, R.; Langer, R.; Vacanti, J.P.; Atala, A. Principles of tissue engineering; Academic press, 2020. [Google Scholar]
- Park, J.; Lakes, R.S. Biomaterials: an introduction; Springer Science & Business Media, 2007. [Google Scholar]
- Dar, A.; Shachar, M.; Leor, J.; Cohen, S. Optimization of cardiac cell seeding and distribution in 3D porous alginate scaffolds. Biotechnology and bioengineering 2002, 80, 305–312. [Google Scholar] [CrossRef] [PubMed]
- Holy, C.E.; Shoichet, M.S.; Davies, J.E. Engineering three-dimensional bone tissue in vitro using biodegradable scaffolds: Investigating initial cell-seeding density and culture period. Journal of Biomedical Materials Research: An Official Journal of The Society for Biomaterials, The Japanese Society for Biomaterials, and The Australian Society for Biomaterials and the Korean Society for Biomaterials 2000, 51, 376–382. [Google Scholar] [CrossRef]
- Ripamonti, U.; Roden, L.C.; Renton, L.F. Osteoinductive hydroxyapatite-coated titanium implants. Biomaterials 2012, 33, 3813–3823. [Google Scholar] [CrossRef] [PubMed]
- De Lange, G.; Donath, K. Interface between bone tissue and implants of solid hydroxyapatite or hydroxyapatite-coated titanium implants. Biomaterials 1989, 10, 121–125. [Google Scholar] [CrossRef] [PubMed]
- Cook, S.D.; Thomas, K.A.; Kay, J.F.; Jarcho, M. Hydroxyapatite-coated titanium for orthopedic implant applications. Clinical Orthopaedics and Related Research (1976-2007) 1988, 232, 225–243. [Google Scholar] [CrossRef]
- Ferber, D. Lab-grown organs begin to take shape. 1999. [Google Scholar] [CrossRef]
- Niklason, L.E.; Lawson, J.H. Bioengineered human blood vessels. Science 2020, 370, eaaw8682. [Google Scholar] [CrossRef]
- Doran, P.M. Bioprocess engineering principles; Elsevier, 1995. [Google Scholar]
- Harun, R.; Singh, M.; Forde, G.M.; Danquah, M.K. Bioprocess engineering of microalgae to produce a variety of consumer products. Renewable and sustainable energy reviews 2010, 14, 1037–1047. [Google Scholar] [CrossRef]
- Liu, S. Bioprocess engineering: kinetics, sustainability, and reactor design; Elsevier, 2020. [Google Scholar]
- Rio-Chanona, E.A.; Wagner, J.L.; Ali, H.; Fiorelli, F.; Zhang, D.; Hellgardt, K. Deep learning-based surrogate modeling and optimization for microalgal biofuel production and photobioreactor design. AIChE Journal 2018. [Google Scholar] [CrossRef]
- Herbert, D.; Elsworth, R.; Telling, R. The continuous culture of bacteria; a theoretical and experimental study. Microbiology 1956, 14, 601–622. [Google Scholar] [CrossRef]
- Andrews, J.F. A mathematical model for the continuous culture of microorganisms utilizing inhibitory substrates. Biotechnology and bioengineering 1968, 10, 707–723. [Google Scholar]
- Abraham, E.P.; Chain, E.; Fletcher, C.M.; Gardner, A.D.; Heatley, N.G.; Jennings, M.A.; Florey, H.W. Further observations on penicillin. The Lancet 1941, 238, 177–189. [Google Scholar] [CrossRef]
- Shukla, A.; Thömmes, J. Recent advances in large-scale production of monoclonal antibodies and related proteins. Trends in biotechnology 2010, 28 5, 253–61. [Google Scholar] [CrossRef] [PubMed]
- Yokoyama, W.M. Production of Monoclonal Antibodies. Current Protocols in Cytometry 1999, 37. [Google Scholar] [CrossRef]
- Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. Bioinformatics and computational biology solutions using R and Bioconductor; Springer Science & Business Media, 2005. [Google Scholar]
- Tang, B.; Pan, Z.; Yin, K.; Khateeb, A. Recent advances of deep learning in bioinformatics and computational biology. Frontiers in genetics 2019, 10, 214. [Google Scholar] [CrossRef]
- Ranganathan, S.; Nakai, K.; Schonbach, C. Encyclopedia of bioinformatics and computational biology: ABC of bioinformatics; Elsevier, 2018. [Google Scholar]
- Lio, P. Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics 2003, 19, 2–9. [Google Scholar] [CrossRef]
- Bhavikatti, S. Finite element analysis; New Age International, 2005. [Google Scholar]
- Szabó, B.; Babuška, I. Finite element analysis: Method, verification and validation; 2021. [Google Scholar]
- Taylor, R.L. FEAP-A finite element analysis program, 2014.
- Marck, C. ‘DNA Strider’: a ‘C’program for the fast analysis of DNA and protein sequences on the Apple Macintosh family of computers. Nucleic acids research 1988, 16, 1829–1836. [Google Scholar] [CrossRef]
- McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M.; et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 2010, 20, 1297–1303. [Google Scholar] [CrossRef]
- Lu, X.J.; Olson, W.K. 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic acids research 2003, 31, 5108–5121. [Google Scholar] [CrossRef] [PubMed]
- Golding, M.P. Ethical issues in biological engineering. UCLA L. Rev. 1967, 15, 443. [Google Scholar]
- Dzau, V.J.; Cicerone, R.J. Responsible use of human gene-editing technologies. Human Gene Therapy 2015, 26, 411–412. [Google Scholar] [CrossRef] [PubMed]
- Jasanoff, S.; Hurlbut, J.B.; Saha, K. CRISPR democracy: Gene editing and the need for inclusive deliberation. Issues in Science and Technology 2015, 32, 25–32. [Google Scholar]
- Tiedje, J.M.; Colwell, R.K.; Grossman, Y.L.; Hodson, R.E.; Lenski, R.E.; Mack, R.N.; Regal, P.J. The planned introduction of genetically engineered organisms: ecological considerations and recommendations. Ecology 1989, 70, 298–315. [Google Scholar] [CrossRef]
- Snow, A.A.; Andow, D.A.; Gepts, P.; Hallerman, E.M.; Power, A.; Tiedje, J.M.; Wolfenbarger, L. Genetically engineered organisms and the environment: Current status and recommendations 1. Ecological Applications 2005, 15, 377–404. [Google Scholar] [CrossRef]
- Wolfenbarger, L.L.; Phifer, P.R. The ecological risks and benefits of genetically engineered plants. Science 2000, 290, 2088–2093. [Google Scholar] [CrossRef]
- Karagyaur, M.; Efimenko, A.Y.; Makarevich, P.; Vasiluev, P.; Akopyan, Z.A.; Bryzgalina, E.; Tkachuk, V. Ethical and legal aspects of using genome editing technologies in medicine. 2019, 11, 117–132. [Google Scholar] [CrossRef]
- Piergentili, R.; Del Rio, A.; Signore, F.; Umani Ronchi, F.; Marinelli, E.; Zaami, S. CRISPR-Cas and its wide-ranging applications: From human genome editing to environmental implications, technical limitations, hazards and bioethical issues. Cells 2021, 10, 969. [Google Scholar] [CrossRef]
- of Sciences, N.A.; Medicine.; of Medicine, N.A.; on Human Gene Editing, C.; Scientific.; Medical.; Considerations, E. Human genome editing: science, ethics, and governance; National Academies Press, 2017.
- Floridi, L.; Cowls, J.; Beltrametti, M.; Chatila, R.; Chazerand, P.; Dignum, V.; Luetge, C.; Madelin, R.; Pagallo, U.; Rossi, F.; et al. AI4People—an ethical framework for a good AI society: opportunities, risks, principles, and recommendations. Minds and machines 2018, 28, 689–707. [Google Scholar] [CrossRef]
- Cath, C.; Wachter, S.; Mittelstadt, B.; Taddeo, M.; Floridi, L. Artificial intelligence and the ‘good society’: the US, EU, and UK approach. Science and engineering ethics 2018, 24, 505–528. [Google Scholar]
- Enderle, J.; Bronzino, J. Introduction to biomedical engineering; Academic press, 2012. [Google Scholar]
- Rabinow, P.; Bennett, G. Designing human practices: An experiment with synthetic biology; University of Chicago Press, 2012. [Google Scholar]
- Williams, S.M.; Haines, J.L.; Moore, J.H. The use of animal models in the study of complex disease: all else is never equal or why do so many human studies fail to replicate animal findings? Bioessays 2004, 26, 170–179. [Google Scholar] [CrossRef] [PubMed]
- McGonigle, P.; Ruggeri, B. Animal models of human disease: challenges in enabling translation. Biochemical pharmacology 2014, 87, 162–171. [Google Scholar] [CrossRef] [PubMed]
- Akhtar, A. The flaws and human harms of animal experimentation. Cambridge Quarterly of Healthcare Ethics 2015, 24, 407–419. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Yang, Y.; Faraggi, E.; Zhan, J.; Zhou, Y. Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles. Proteins: Structure, Function, and Bioinformatics 2014, 82, 2565–2573. [Google Scholar] [CrossRef]
- Berger, J. The last mile: how to sustain long-distance migration in mammals. Conservation biology 2004, 18, 320–331. [Google Scholar] [CrossRef]
- Menche, J.; Sharma, A.; Kitsak, M.; Ghiassian, S.D.; Vidal, M.; Loscalzo, J.; Barabási, A.L. Uncovering disease-disease relationships through the incomplete interactome. Science 2015, 347, 1257601. [Google Scholar] [CrossRef]
- Uversky, V.N. A decade and a half of protein intrinsic disorder: biology still waits for physics. Protein Science 2013, 22, 693–724. [Google Scholar] [CrossRef]
- Cho, D.Y.; Kim, Y.A.; Przytycka, T.M. Chapter 5: Network biology approach to complex diseases. PLoS computational biology 2012, 8, e1002820. [Google Scholar] [CrossRef] [PubMed]
- Manojlovich, M.; Lee, S.; Lauseng, D. A systematic review of the unintended consequences of clinical interventions to reduce adverse outcomes. Journal of patient safety 2016, 12, 173–179. [Google Scholar] [CrossRef]
- Ni She, E.; Harrison, R. Mitigating unintended consequences of co-design in health care. Health Expectations 2021, 24, 1551–1556. [Google Scholar] [CrossRef]
- Riesselman, A.J.; Ingraham, J.B.; Marks, D.S. Deep generative models of genetic variation capture mutation effects. arXiv 2017, arXiv:1712.06527. [Google Scholar] [CrossRef]
- Mjolsness, E. Prospects for declarative mathematical modeling of complex biological systems. Bulletin of Mathematical Biology 2019, 81, 3385–3420. [Google Scholar] [CrossRef]
- Boughorbel, S.; Jarray, F.; Venugopal, N.; Elhadi, H. Alternating loss correction for preterm-birth prediction from ehr data with noisy labels. arXiv 2018, arXiv:1811.09782. [Google Scholar] [CrossRef]
- Zhu, C.; Chen, W.; Peng, T.; Wang, Y.; Jin, M. Hard sample aware noise robust learning for histopathology image classification. IEEE transactions on medical imaging 2021, 41, 881–894. [Google Scholar] [CrossRef]
- Pal, S.; Mondal, S.; Das, G.; Khatua, S.; Ghosh, Z. Big data in biology: The hope and present-day challenges in it. Gene Reports 2020, 21, 100869. [Google Scholar] [CrossRef]
- Agarwal, S.; Laradji, I.H.; Charlin, L.; Pal, C. Litllm: A toolkit for scientific literature review. arXiv 2024, arXiv:2402.01788. [Google Scholar] [CrossRef]
- An, H.; Narechania, A.; Wall, E.; Xu, K. VITALITY 2: Reviewing Academic Literature Using Large Language Models. arXiv 2024, arXiv:2408.13450. [Google Scholar] [CrossRef]
- Glickman, M.; Zhang, Y. AI and generative AI for research discovery and summarization. arXiv 2024, arXiv:2401.06795. [Google Scholar] [CrossRef]
- McGinness, L.; Baumgartner, P.; Onyango, E.; Lema, Z. Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science. In Proceedings of the Australasian Joint Conference on Artificial Intelligence. Springer; 2025; pp. 29–43. [Google Scholar]
- Baek, J.; Jauhar, S.K.; Cucerzan, S.; Hwang, S.J. Researchagent: Iterative research idea generation over scientific literature with large language models. arXiv 2024, arXiv:2404.07738. [Google Scholar] [CrossRef]
- Wu, S.; Ma, X.; Luo, D.; Li, L.; Shi, X.; Chang, X.; Lin, X.; Luo, R.; Pei, C.; Du, C.; et al. Automated review generation method based on large language models. arXiv 2024, arXiv:2407.20906. [Google Scholar] [CrossRef]
- Matarazzo, A.; Torlone, R. A Survey on Large Language Models with some Insights on their Capabilities and Limitations. arXiv 2025, arXiv:2501.04040. [Google Scholar]
- Wang, Z.; Wang, Z.; Jiang, J.; Chen, P.; Shi, X.; Li, Y. Large Language Models in Bioinformatics: A Survey. arXiv 2025, arXiv:2503.04490. [Google Scholar] [CrossRef]
- Wu, W.; Li, Q.; Li, M.; Fu, K.; Feng, F.; Ye, J.; Xiong, H.; Wang, Z. GENERator: A Long-Context Generative Genomic Foundation Model. arXiv 2025, arXiv:2502.07272. [Google Scholar] [CrossRef]
- Zheng, Z.; Deng, Y.; Xue, D.; Zhou, Y.; Ye, F.; Gu, Q. Structure-informed language models are protein designers. In Proceedings of the International conference on machine learning. PMLR; 2023; pp. 42317–42338. [Google Scholar]
- Nguyen, E.; Poli, M.; Faizi, M.; Thomas, A.; Wornow, M.; Birch-Sykes, C.; Massaroli, S.; Patel, A.; Rabideau, C.; Bengio, Y.; et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems 2023, 36, 43177–43201. [Google Scholar]
- Ross, T.D.; Gopinath, A. Chaining thoughts and LLMs to learn DNA structural biophysics. arXiv 2024, arXiv:2403.01332. [Google Scholar] [CrossRef]
- Xiao, Y.; Sun, E.; Jin, Y.; Wang, W. RNA-GPT: Multimodal Generative System for RNA Sequence Understanding. arXiv 2024, arXiv:2411.08900. [Google Scholar]
- Fang, X.; Wang, F.; Liu, L.; He, J.; Lin, D.; Xiang, Y.; Zhang, X.; Wu, H.; Li, H.; Song, L. Helixfold-single: Msa-free protein structure prediction by using protein language model as an alternative. arXiv 2022, arXiv:2207.13921. [Google Scholar]
- Truong Jr, T.; Bepler, T. Poet: A generative model of protein families as sequences-of-sequences. Advances in Neural Information Processing Systems 2023, 36, 77379–77415. [Google Scholar]
- Madani, A.; McCann, B.; Naik, N.; Keskar, N.S.; Anand, N.; Eguchi, R.R.; Huang, P.S.; Socher, R. Progen: Language modeling for protein generation. arXiv 2020, arXiv:2004.03497. [Google Scholar] [CrossRef]
- Nijkamp, E.; Ruffolo, J.A.; Weinstein, E.N.; Naik, N.; Madani, A. Progen2: exploring the boundaries of protein language models. Cell systems 2023, 14, 968–978. [Google Scholar] [CrossRef]
- Madani, A.; Krause, B.; Greene, E.R.; Subramanian, S.; Mohr, B.P.; Holton, J.M.; Olmos Jr, J.L.; Xiong, C.; Sun, Z.Z.; Socher, R.; et al. Large language models generate functional protein sequences across diverse families. Nature biotechnology 2023, 41, 1099–1106. [Google Scholar] [CrossRef]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2021, 44, 7112–7127. [Google Scholar] [CrossRef]
- Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022, 38, 2102–2110. [Google Scholar] [CrossRef]
- Liu, X.h.; Lu, Z.h.; Wang, T.; Liu, F. Large language models facilitating modern molecular biology and novel drug development. Frontiers in Pharmacology 2024, 15, 1458739. [Google Scholar] [CrossRef]
- Koonin, E.V.; Galperin, M.Y.; Koonin, E.V.; Galperin, M.Y. Principles and methods of sequence analysis. Sequence—Evolution—Function: computational approaches in comparative genomics 2003, pp. 111–192.
- Rajendran, S.; Pan, W.; Sabuncu, M.R.; Chen, Y.; Zhou, J.; Wang, F. Learning across diverse biomedical data modalities and cohorts: Challenges and opportunities for innovation. Patterns 2024, 5. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Q.; Ding, K.; Lv, T.; Wang, X.; Yin, Q.; Zhang, Y.; Yu, J.; Wang, Y.; Li, X.; Xiang, Z.; et al. Scientific large language models: A survey on biological & chemical domains. ACM Computing Surveys 2025, 57, 1–38. [Google Scholar] [CrossRef]
- de Carvalho, G.H.; Knap, O.; Pollice, R. Show, Don’t Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay. arXiv 2024, arXiv:2407.11068. [Google Scholar]
- Wang, Y.; Qi, J.; Gan, J. Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering. arXiv 2024, arXiv:2410.12846. [Google Scholar] [CrossRef]
- Yang, X.; Cheng, W.; Wu, Y.; Petzold, L.; Wang, W.Y.; Chen, H. Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text. arXiv 2023, arXiv:2305.17359. [Google Scholar]
- Zhang, D.; Zhang, W.; Zhao, Y.; Zhang, J.; He, B.; Qin, C.; Yao, J. DNAGPT: a generalized pre-trained tool for versatile DNA sequence analysis tasks. arXiv 2023, arXiv:2307.05628. [Google Scholar] [CrossRef]
- Nascimento, N.; Guimaraes, E.; Chintakunta, S.S.; Boominathan, S.A. LLM4DS: Evaluating Large Language Models for Data Science Code Generation. arXiv 2024, arXiv:2411.11908. [Google Scholar] [CrossRef]
- Shen, L.; Yang, Q.; Zheng, Y.; Li, M. AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications. arXiv 2025, arXiv:2503.05346. [Google Scholar]
- Maddigan, P.; Susnjak, T. Chat2vis: Generating data visualizations via natural language using chatgpt, codex and gpt-3 large language models. Ieee Access 2023, 11, 45181–45193. [Google Scholar] [CrossRef]
- Nejjar, M.; Zacharias, L.; Stiehle, F.; Weber, I. Llms for science: Usage for code generation and data analysis. Journal of Software: Evolution and Process 2025, 37, e2723. [Google Scholar] [CrossRef]
- Ji, Y.; Zhou, Z.; Liu, H.; Davuluri, R.V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar] [CrossRef]
- Zhou, Z.; Ji, Y.; Li, W.; Dutta, P.; Davuluri, R.; Liu, H. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. arXiv 2023, arXiv:2306.15006. [Google Scholar]
- Avsec, Ž.; Agarwal, V.; Visentin, D.; Ledsam, J.R.; Grabska-Barwinska, A.; Taylor, K.R.; Assael, Y.; Jumper, J.; Kohli, P.; Kelley, D.R. Effective gene expression prediction from sequence by integrating long-range interactions. bioRxiv [https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649.full.pdf]. 2021. [Google Scholar] [CrossRef] [PubMed]
- Sanabria, M.; Hirsch, J.; Joubert, P.M.; Poetsch, A.R. DNA language model GROVER learns sequence context in the human genome. Nature Machine Intelligence 2024, 6, 911–923. [Google Scholar] [CrossRef]
- Dalla-Torre, H.; Gonzalez, L.; Mendoza-Revilla, J.; Lopez Carranza, N.; Grzywaczewski, A.H.; Oteri, F.; Dallago, C.; Trop, E.; de Almeida, B.P.; Sirelkhatim, H.; et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods 2024, 1–11. [Google Scholar] [CrossRef]
- Zhou, Z.; Riley, R.; Kautsar, S.; Wu, W.; Egan, R.; Hofmeyr, S.; Goldhaber-Gordon, S.; Yu, M.; Ho, H.; Liu, F.; et al. GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies. bioRxiv 2025, 2025–01. [Google Scholar] [CrossRef]
- Fishman, V.; Kuratov, Y.; Petrov, M.; Shmelev, A.; Shepelin, D.; Chekanov, N.; Kardymon, O.; Burtsev, M. Gena-lm: A family of open-source foundational models for long dna sequences. bioRxiv 2023, 12, 2023. [Google Scholar] [CrossRef]
- Akiyama, M.; Sakakibara, Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR genomics and bioinformatics 2022, 4, lqac012. [Google Scholar] [CrossRef]
- Chen, J.; Hu, Z.; Sun, S.; Tan, Q.; Wang, Y.; Yu, Q.; Zong, L.; Hong, L.; Xiao, J.; Shen, T.; et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. arXiv 2022, arXiv:2204.00300. [Google Scholar] [CrossRef]
- Zhang, Y.; Lang, M.; Jiang, J.; Gao, Z.; Xu, F.; Litfin, T.; Chen, K.; Singh, J.; Huang, X.; Song, G.; et al. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research 2024, 52, e3–e3. [Google Scholar] [CrossRef]
- Chen, K.; Zhou, Y.; Ding, M.; Wang, Y.; Ren, Z.; Yang, Y. Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction. BioRxiv 2023, 2023–01. [Google Scholar] [CrossRef]
- Marin, F.I.; Teufel, F.; Horlacher, M.; Madsen, D.; Pultz, D.; Winther, O.; Boomsma, W. Bend: Benchmarking dna language models on biologically meaningful tasks. arXiv 2023, arXiv:2311.12570. [Google Scholar]
- Yang, Y.; Li, G.; Pang, K.; Cao, W.; Zhang, Z.; Li, X. Deciphering 3’UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. Advanced Science 2024, 11, 2407013. [Google Scholar] [CrossRef]
- Chu, Y.; Yu, D.; Li, Y.; Huang, K.; Shen, Y.; Cong, L.; Zhang, J.; Wang, M. A 5’UTR language model for decoding untranslated regions of mRNA and function predictions. Nature Machine Intelligence 2024, 6, 449–460. [Google Scholar] [CrossRef]
- Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 2021, 3, 1–23. [Google Scholar] [CrossRef]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics 2022, 23, bbac409. [Google Scholar] [CrossRef]
- Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. Biomedlm: A 2.7 b parameter language model trained on biomedical text. arXiv 2024, arXiv:2403.18421. [Google Scholar]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nature Medicine 2025, 1–8. [Google Scholar] [CrossRef]
- Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 2023, 15. [Google Scholar] [CrossRef]
- Zhang, H.; Chen, J.; Jiang, F.; Yu, F.; Chen, Z.; Li, J.; Chen, G.; Wu, X.; Zhang, Z.; Xiao, Q.; et al. HuatuoGPT, Towards Taming Language Models To Be a Doctor. arXiv 2023, arXiv:2305.15075. [Google Scholar] [CrossRef]
- Chen, J.; Wang, X.; Gao, A.; Jiang, F.; Chen, S.; Zhang, H.; Song, D.; Xie, W.; Kong, C.; Li, J.; et al. HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs. arXiv 2023, arXiv:2311.09774. [Google Scholar]
- Toma, A.; Lawler, P.R.; Ba, J.; Krishnan, R.G.; Rubin, B.B.; Wang, B. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. CoRR 2023. [Google Scholar]
- Xiong, H.; Wang, S.; Zhu, Y.; Zhao, Z.; Liu, Y.; Wang, Q.; Shen, D. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv 2023, arXiv:2304.01097. [Google Scholar]
- Chen, J.; Cai, Z.; Ji, K.; Wang, X.; Liu, W.; Wang, R.; Hou, J.; Wang, B. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs. arXiv 2024, arXiv:2412.18925o1. [Google Scholar]
- Yu, H.; Cheng, T.; Cheng, Y.; Feng, R. Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training. arXiv 2025, arXiv:2501.09213. [Google Scholar]
- Han, T.; Adams, L.C.; Papaioannou, J.M.; Grundmann, P.; Oberhauser, T.; Löser, A.; Truhn, D.; Bressem, K.K. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv 2023, arXiv:2304.08247. [Google Scholar]
- Chen, Z.; Cano, A.H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv 2023, arXiv:2311.16079. [Google Scholar] [CrossRef]
- Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.A.; Rouvier, M.; Dufour, R. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv 2024, arXiv:2402.10373. [Google Scholar]
- Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
- Yasunaga, M.; Leskovec, J.; Liang, P. LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the Association for Computational Linguistics (ACL); 2022. [Google Scholar]
- Peng, Y.; Yan, S.; Lu, Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In Proceedings of the Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019), 2019; pp. 58–65.
- Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv 2023, arXiv:2303.13375. [Google Scholar] [CrossRef]
- Tran, H.; Yang, Z.; Yao, Z.; Bioinstruct, H.Y. Instruction tuning of large language models for biomedical natural language processing. arXiv 2023, arXiv:2310.19975. [Google Scholar] [CrossRef]
- Lu, Q.; Dou, D.; Nguyen, T. ClinicalT5: A generative language model for clinical text. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; 2022; pp. 5436–5443. [Google Scholar]
- Phan, L.N.; Anibal, J.T.; Tran, H.; Chanana, S.; Bahadroglu, E.; Peltekian, A.; Altan-Bonnet, G. Scifive: a text-to-text transformer model for biomedical literature. arXiv 2021, arXiv:2106.03598. [Google Scholar]
- Schwieger, A.; Angst, K.; De Bardeci, M.; Burrer, A.; Cathomas, F.; Ferrea, S.; Grätz, F.; Knorr, M.; Kronenberg, G.; Spiller, T.; et al. Large language models can support generation of standardized discharge summaries–a retrospective study utilizing ChatGPT-4 and electronic health records. International Journal of Medical Informatics 2024, 192, 105654. [Google Scholar] [CrossRef]
- Zaretsky, J.; Kim, J.M.; Baskharoun, S.; Zhao, Y.; Austrian, J.; Aphinyanaphongs, Y.; Gupta, R.; Blecker, S.B.; Feldman, J. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. JAMA network open 2024, 7, e240357–e240357. [Google Scholar] [CrossRef]
- Xie, Q.; Chen, Q.; Chen, A.; Peng, C.; Hu, Y.; Lin, F.; Peng, X.; Huang, J.; Zhang, J.; Keloth, V.; et al. Me-llama: Foundation large language models for medical applications. Research square 2024, rs–3. [Google Scholar]
- Li, Y.; Rao, S.; Solares, J.R.A.; Hassaine, A.; Ramakrishnan, R.; Canoy, D.; Zhu, Y.; Rahimi, K.; Salimi-Khorshidi, G. BEHRT: transformer for electronic health records. Scientific reports 2020, 10, 7155. [Google Scholar] [CrossRef]
- Rasmy, L.; Xiang, Y.; Xie, Z.; Tao, C.; Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine 2021, 4, 86. [Google Scholar] [CrossRef]
- Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A large language model for electronic health records. NPJ digital medicine 2022, 5, 194. [Google Scholar] [CrossRef]
- Xu, M.; Zhao, X.; Wang, J.; Feng, W.; Wen, N.; Wang, C.; Wang, J.; Liu, Y.; Zhao, L. DFFNDDS: prediction of synergistic drug combinations with dual feature fusion networks. Journal of Cheminformatics 2023, 15, 33. [Google Scholar] [CrossRef]
- Li, T.; Shetty, S.; Kamath, A.; Jaiswal, A.; Jiang, X.; Ding, Y.; Kim, Y. CancerGPT for few shot drug pair synergy prediction using large pretrained language models. NPJ Digital Medicine 2024, 7, 40. [Google Scholar] [CrossRef]
- Edwards, C.; Naik, A.; Khot, T.; Burke, M.; Ji, H.; Hope, T. Synergpt: In-context learning for personalized drug synergy prediction and drug design. arXiv 2023, arXiv:2307.11694. [Google Scholar]
- Liu, T.; Chu, T.; Luo, X.; Zhao, H. BAITSAO: Building A Foundation Model for Drug Synergy Analysis Powered by Language Models. bioRxiv [https://www.biorxiv.org/content/early/2024/04/12/2024.04.08.588634.full.pdf]. 2024. [Google Scholar] [CrossRef]
- Alley, E.C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; Church, G.M. Unified rational protein engineering with sequence-based deep representation learning. Nature methods 2019, 16, 1315–1322. [Google Scholar] [CrossRef] [PubMed]
- Bepler, T.; Berger, B. Learning protein sequence embeddings using information from structure. arXiv 2019, arXiv:1902.08661. [Google Scholar] [CrossRef]
- Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 2021, 118, e2016239118. [Google Scholar] [CrossRef] [PubMed]
- Meier, J.; Rao, R.; Verkuil, R.; Liu, J.; Sercu, T.; Rives, A. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems 2021, 34, 29287–29303. [Google Scholar]
- Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 2022. [Google Scholar] [CrossRef]
- Su, J.; Han, C.; Zhou, Y.; Shan, J.; Zhou, X.; Yuan, F. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv 2023, 2023–10. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, C.; Xu, M.; Chenthamarakshan, V.; Lozano, A.; Das, P.; Tang, J. A Systematic Study of Joint Representation Learning on Protein Sequences and Structures. arXiv 2023, arXiv:2303.06275. [Google Scholar] [CrossRef]
- Zhang, N.; Bi, Z.; Liang, X.; Cheng, S.; Hong, H.; Deng, S.; Lian, J.; Zhang, Q.; Chen, H. Ontoprotein: Protein pretraining with gene ontology embedding. arXiv 2022, arXiv:2201.11147. [Google Scholar] [CrossRef]
- Wu, K.E.; Chang, H.; Zou, J. ProteinCLIP: enhancing protein language models with natural language. bioRxiv 2024, 2024–05. [Google Scholar] [CrossRef]
- Ferruz, N.; Schmidt, S.; Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nature communications 2022, 13, 4348. [Google Scholar] [CrossRef]
- Lv, L.; Lin, Z.; Li, H.; Liu, Y.; Cui, J.; Chen, C.Y.C.; Yuan, L.; Tian, Y. ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing. arXiv 2024, arXiv:2402.16445. [Google Scholar] [CrossRef]
- Dai, F.; You, S.; Wang, C.; Fan, Y.; Su, J.; Han, C.; Zhou, X.; Liu, J.; Qian, H.; Wang, S.; et al. Toward de novo protein design from natural language. bioRxiv 2024, 2024–08. [Google Scholar] [CrossRef]
- Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv 2023, arXiv:2301.06568. [Google Scholar] [CrossRef]
- Liu, S.; Li, Y.; Li, Z.; Gitter, A.; Zhu, Y.; Lu, J.; Xu, Z.; Nie, W.; Ramanathan, A.; Xiao, C.; et al. A text-guided protein design framework. Nature Machine Intelligence 2025, 1–12. [Google Scholar] [CrossRef]
- Zhang, Q.; Chen, W.; Qin, M.; Wang, Y.; Pu, Z.; Ding, K.; Liu, Y.; Zhang, Q.; Li, D.; Li, X.; et al. Integrating protein language models and automatic biofoundry for enhanced protein evolution. Nature Communications 2025, 16, 1553. [Google Scholar] [CrossRef] [PubMed]
- Xiao, Y.; Sun, E.; Jin, Y.; Wang, Q.; Wang, W. Proteingpt: Multimodal llm for protein property prediction and structure understanding. arXiv 2024, arXiv:2408.11363. [Google Scholar] [CrossRef]
- Guo, H.; Huo, M.; Zhang, R.; Xie, P. Proteinchat: Towards achieving chatgpt-like functionalities on protein 3d structures. Authorea Preprints 2023. [Google Scholar]
- Mattick, J.; Amaral, P. The Human Genome. In RNA, the Epicenter of Genetic Information: A new understanding of molecular biology; CRC Press, 2022. [Google Scholar]
- Cheatham, T.E.; Kollman, P.A. Molecular dynamics simulations highlight the structural differences among DNA: DNA, RNA: RNA, and DNA: RNA hybrid duplexes. Journal of the American Chemical Society 1997, 119, 4805–4825. [Google Scholar] [CrossRef]
- Ozsolak, F.; Milos, P.M. RNA sequencing: advances, challenges and opportunities. Nature reviews genetics 2011, 12, 87–98. [Google Scholar] [CrossRef]
- Stark, R.; Grzelak, M.; Hadfield, J. RNA sequencing: the teenage years. Nature Reviews Genetics 2019, 20, 631–656. [Google Scholar] [CrossRef]
- Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods 2015, 12, 931–934. [Google Scholar] [CrossRef]
- Gage, P. A new algorithm for data compression. The C Users Journal 1994, 12, 23–38. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
- Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); 2022. [Google Scholar]
- RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic acids research 2021, 49, D212–D220. [CrossRef]
- Rao, R.M.; Liu, J.; Verkuil, R.; Meier, J.; Canny, J.; Abbeel, P.; Sercu, T.; Rives, A. MSA transformer. In Proceedings of the International Conference on Machine Learning. PMLR; 2021; pp. 8844–8856. [Google Scholar]
- Haeussler, M.; Zweig, A.S.; Tyner, C.; Speir, M.L.; Rosenbloom, K.R.; Raney, B.J.; Lee, C.M.; Lee, B.T.; Hinrichs, A.S.; Gonzalez, J.N.; et al. The UCSC genome browser database: 2019 update. Nucleic acids research 2019, 47, D853–D858. [Google Scholar] [CrossRef]
- Lorenz, R.; Bernhart, S.H.; Höner zu Siederdissen, C.; Tafer, H.; Flamm, C.; Stadler, P.F.; Hofacker, I.L. ViennaRNA Package 2.0. Algorithms for molecular biology 2011, 6, 1–14. [Google Scholar] [CrossRef]
- Gruber, A.R.; Lorenz, R.; Bernhart, S.H.; Neuböck, R.; Hofacker, I.L. The vienna RNA websuite. Nucleic acids research 2008, 36, W70–W74. [Google Scholar] [CrossRef]
- Cunningham, F.; Allen, J.E.; Allen, J.; Alvarez-Jarreta, J.; Amode, M.R.; Armean, I.M.; Austine-Orimoloye, O.; Azov, A.G.; Barnes, I.; Bennett, R.; et al. Ensembl 2022. Nucleic acids research 2022, 50, D988–D995. [Google Scholar] [CrossRef]
- Sample, P.J.; Wang, B.; Reid, D.W.; Presnyak, V.; McFadyen, I.J.; Morris, D.R.; Seelig, G. Human 5’UTR design and variant effect prediction from a massively parallel translation assay. Nature biotechnology 2019, 37, 803–809. [Google Scholar] [CrossRef]
- Cao, J.; Novoa, E.M.; Zhang, Z.; Chen, W.C.; Liu, D.; Choi, G.C.; Wong, A.S.; Wehrspaun, C.; Kellis, M.; Lu, T.K. High-throughput 5’UTR engineering for enhanced protein production in non-viral gene therapies. Nature communications 2021, 12, 4138. [Google Scholar] [CrossRef]
- Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Scientific data 2016, 3, 1–9. [Google Scholar] [CrossRef]
- Johnson, A.E.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data 2023, 10, 1. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2023, arXiv:1910.10683. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Krithara, A.; Nentidis, A.; Bougiatiotis, K.; Paliouras, G. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Scientific Data 2023, 10, 170. [Google Scholar] [CrossRef]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar] [CrossRef]
- Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.; Hardt, M. Test-time training with self-supervision for generalization under distribution shifts. In Proceedings of the International conference on machine learning. PMLR; 2020; pp. 9229–9248. [Google Scholar]
- Romanov, A.; Shivade, C. Lessons from Natural Language Inference in the Clinical Domain. [1808.06752]. [CrossRef]
- Preuer, K.; Lewis, R.P.; Hochreiter, S.; Bender, A.; Bulusu, K.C.; Klambauer, G. DeepSynergy: predicting anti-cancer drug synergy with Deep Learning. Bioinformatics 2018, 34, 1538–1546. [Google Scholar] [CrossRef]
- Grešová, K.; Martinek, V.; Čechák, D.; Šimeček, P.; Alexiou, P. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data 2023, 24, 25. [Google Scholar] [CrossRef]
- Runge, F.; Farid, K.; Franke, J.K.; Hutter, F. Rnabench: A comprehensive library for in silico rna modelling. bioRxiv 2024, 2024–01. [Google Scholar] [CrossRef]
- Ren, Y.; Chen, Z.; Qiao, L.; Jing, H.; Cai, Y.; Xu, S.; Ye, P.; Ma, X.; Sun, S.; Yan, H.; et al. Beacon: Benchmark for comprehensive rna tasks and language models. Advances in Neural Information Processing Systems 2024, 37, 92891–92921. [Google Scholar]
- Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 2016, 23, 304–310. [Google Scholar] [CrossRef]
- Wornow, M.; Thapa, R.; Steinberg, E.; Fries, J.; Shah, N. Ehrshot: An ehr benchmark for few-shot evaluation of foundation models. Advances in Neural Information Processing Systems 2023, 36, 67125–67137. [Google Scholar]
- Kweon, S.; Kim, J.; Kwak, H.; Cha, D.; Yoon, H.; Kim, K.; Yang, J.; Won, S.; Choi, E. Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries. Advances in Neural Information Processing Systems 2024, 37, 124575–124611. [Google Scholar]
- Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What disease does this patient have. A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv [cs. CL] 2020. [CrossRef]
- Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on health, inference, and learning. PMLR; 2022; pp. 248–260. [Google Scholar]
- Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Inui, K.; Jiang, J.; Ng, V.; Wan, X., Eds., Hong Kong, China, 2019; pp. 2567–2577. [CrossRef]
- Doğan, R.I.; Leaman, R.; Lu, Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics 2014, 47, 1–10. [Google Scholar] [PubMed]
- Segura-Bedmar, I.; Martínez, P.; Herrero-Zazo, M. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013; pp. 341–350.
- Becker, K.G.; Barnes, K.C.; Bright, T.J.; Wang, S.A. The genetic association database. Nature genetics 2004, 36, 431–432. [Google Scholar] [CrossRef]
- Baker, S.; Silins, I.; Guo, Y.; Ali, I.; Högberg, J.; Stenius, U.; Korhonen, A. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 2016, 32, 432–440. [Google Scholar] [CrossRef]
- Xu, M.; Zhang, Z.; Lu, J.; Zhu, Z.; Zhang, Y.; Chang, M.; Liu, R.; Tang, J. Peer: a comprehensive and multi-task benchmark for protein sequence understanding. Advances in Neural Information Processing Systems 2022, 35, 35156–35173. [Google Scholar]
- Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.; Song, Y. Evaluating protein transfer learning with TAPE. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Hie, B.; Zhong, E.D.; Berger, B.; Bryson, B. Learning the language of viral evolution and escape. Science 2021, 371, 284–288. [Google Scholar] [CrossRef] [PubMed]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
- Hayes, T.; Rao, R.; Akin, H.; Sofroniew, N.J.; Oktay, D.; Lin, Z.; Verkuil, R.; Tran, V.Q.; Deaton, J.; Wiggert, M.; et al. Simulating 500 million years of evolution with a language model. bioRxiv 2024. [Google Scholar] [CrossRef]
- Johnson, B.A.; Blevins, R.A. NMR View: A computer program for the visualization and analysis of NMR data. Journal of biomolecular NMR 1994, 4, 603–614. [Google Scholar]
- Jares-Erijman, E.A.; Jovin, T.M. FRET imaging. Nature biotechnology 2003, 21, 1387–1395. [Google Scholar] [CrossRef]
- Bai, X.C.; McMullan, G.; Scheres, S.H. How cryo-EM is revolutionizing structural biology. Trends in biochemical sciences 2015, 40, 49–57. [Google Scholar] [CrossRef]
- Kikhney, A.G.; Svergun, D.I. A practical guide to small angle X-ray scattering (SAXS) of flexible and intrinsically disordered proteins. FEBS letters 2015, 589, 2570–2577. [Google Scholar] [CrossRef]
- Dhanasekaran, R.; Deutzmann, A.; Mahauad-Fernandez, W.D.; Hansen, A.S.; Gouw, A.M.; Felsher, D.W. The MYC oncogene—the grand orchestrator of cancer growth and immune evasion. Nature reviews Clinical oncology 2022, 19, 23–36. [Google Scholar] [CrossRef]
- Roayaei Ardakany, A.; Gezer, H.T.; Lonardi, S.; Ay, F. Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation. Genome biology 2020, 21, 1–17. [Google Scholar] [CrossRef]
- Huang, K.; Qu, Y.; Cousins, H.; Johnson, W.A.; Yin, D.; Shah, M.; Zhou, D.; Altman, R.; Wang, M.; Cong, L. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv 2024, arXiv:2404.18021. [Google Scholar]
- Rathore, A.S.; Choudhury, S.; Arora, A.; Tijare, P.; Raghava, G.P. ToxinPred 3.0: An improved method for predicting the toxicity of peptides. Computers in Biology and Medicine 2024, 179, 108926. [Google Scholar] [CrossRef]
- Sharma, N.; Naorem, L.D.; Jain, S.; Raghava, G.P. ToxinPred2: an improved method for predicting toxicity of proteins. Briefings in bioinformatics 2022, 23, bbac174. [Google Scholar]
- Fu, Y.; Ma, Y.; Zhong, L.; Yang, Y.; Guo, X.; Wang, C.; Xu, X.; Yang, K.; Xu, X.; Liu, L.; et al. EARTH SCIENCES 2001.
- Council, N.R.; on Geosciences, C.; Resources.; on Earth Sciences, B.; Resources.; on Basic Research Opportunities in the Earth Sciences, C. Basic research opportunities in earth science 2001.
- Council, N.R.; on Engineering, D.; Sciences, P.; Board, S.S.; on Earth Science, C.; from Space, A.; Assessment, A.C.; for the Future, S. Earth science and applications from space: national imperatives for the next decade and beyond; National Academies Press, 2007.
- Von Engelhardt, W.; Zimmermann, J. Theory of earth science 1988.
- Keum, H.J.; Han, K.Y.; Kim, H.I. Real-time flood disaster prediction system by applying machine learning technique. KSCE Journal of Civil Engineering 2020, 24, 2835–2848. [Google Scholar] [CrossRef]
- Ge, X.; Yang, Y.; Chen, J.; Li, W.; Huang, Z.; Zhang, W.; Peng, L. Disaster prediction knowledge graph based on multi-source spatio-temporal information. Remote Sensing 2022, 14, 1214. [Google Scholar] [CrossRef]
- Neelin, J.D. Climate change and climate modeling; Cambridge University Press, 2010. [Google Scholar]
- Liu, H.F.; Luo, Z.C.; Hu, Z.K.; Yang, S.Q.; Tu, L.C.; Zhou, Z.B.; Kraft, M. A review of high-performance MEMS sensors for resource exploration and geophysical applications. Petroleum Science 2022, 19, 2631–2648. [Google Scholar] [CrossRef]
- Esty, D.C. Environmental protection in the information age. NYUL Rev. 2004, 79, 115. [Google Scholar] [CrossRef]
- Li, Z.; Meier, M.A.; Hauksson, E.; Zhan, Z.; Andrews, J. Machine learning seismic wave discrimination: Application to earthquake early warning. Geophysical Research Letters 2018, 45, 4773–4779. [Google Scholar] [CrossRef]
- Waters, C.N.; Zalasiewicz, J.; Summerhayes, C.; Barnosky, A.D.; Poirier, C.; Gałuszka, A.; Cearreta, A.; Edgeworth, M.; Ellis, E.C.; Ellis, M.; et al. The Anthropocene is functionally and stratigraphically distinct from the Holocene. Science 2016, 351, aad2622. [Google Scholar] [CrossRef] [PubMed]
- Fowler, C.M.R. The solid earth: an introduction to global geophysics; Cambridge University Press, 1990. [Google Scholar]
- Hoff, H.; Falkenmark, M.; Gerten, D.; Gordon, L.; Karlberg, L.; Rockström, J. Greening the global water system. Journal of Hydrology 2010, 384, 177–186. [Google Scholar] [CrossRef]
- Böhme, G. atmosphere. Online Encyclopedia Philosophy of Nature 2021. [Google Scholar]
- Tóth, G.; Sokolov, I.V.; Gombosi, T.I.; Chesney, D.R.; Clauer, C.R.; De Zeeuw, D.L.; Hansen, K.C.; Kane, K.J.; Manchester, W.B.; Oehmke, R.C.; et al. Space Weather Modeling Framework: A new tool for the space science community. Journal of Geophysical Research: Space Physics 2005, 110. [Google Scholar] [CrossRef]
- Hornak, J.P.; Szumowski, J.; Bryant, R.G. Magnetic field mapping. Magnetic resonance in medicine 1988, 6, 158–163. [Google Scholar] [CrossRef]
- Flick, U. Mapping the field. The SAGE handbook of qualitative data analysis 2014, 1, 3–18. [Google Scholar]
- Wandell, B.A.; Dumoulin, S.O.; Brewer, A.A. Visual field maps in human cortex. Neuron 2007, 56, 366–383. [Google Scholar] [CrossRef]
- Baecher, G.; Lanney, N.; Einstein, H. Statistical description of rock properties and sampling. In Proceedings of the ARMA US rock mechanics/geomechanics symposium. ARMA; 1977; pp. ARMA–77. [Google Scholar]
- Scales, J.A. Theory of seismic imaging; Vol. 2, Springer-Verlag Berlin, 1995.
- Biondi, B.L. 3D seismic imaging; Society of Exploration Geophysicists, 2006. [Google Scholar]
- Olsson, I.U. Radiometric dating. In Handbook of Holocene palaeoecology and palaeohydrology; 1986. [Google Scholar]
- Falvey, D.A. The development of continental margins in plate tectonic theory. The APPEA Journal 1974, 14, 95–106. [Google Scholar] [CrossRef]
- Sun, K.; Cui, W.; Chen, C. Review of underwater sensing technologies and applications. Sensors 2021, 21, 7849. [Google Scholar] [CrossRef]
- Purser, A.; Marcon, Y.; Dreutter, S.; Hoge, U.; Sablotny, B.; Hehemann, L.; Lemburg, J.; Dorschel, B.; Biebow, H.; Boetius, A. Ocean Floor Observation and Bathymetry System (OFOBS): a new towed camera/sonar system for deep-sea habitat surveys. IEEE Journal of Oceanic Engineering 2018, 44, 87–99. [Google Scholar] [CrossRef]
- Busby, R.F. Manned submersibles; Office of the Oceanographer of the Navy, 1976. [Google Scholar]
- Taylor, R.G.; Scanlon, B.; Döll, P.; Rodell, M.; Van Beek, R.; Wada, Y.; Longuevergne, L.; Leblanc, M.; Famiglietti, J.S.; Edmunds, M.; et al. Ground water and climate change. Nature climate change 2013, 3, 322–329. [Google Scholar] [CrossRef]
- Larom, D.; Garstang, M.; Payne, K.; Raspet, R.; Lindeque, M. The influence of surface atmospheric conditions on the range and area reached by animal vocalizations. Journal of experimental biology 1997, 200, 421–431. [Google Scholar] [CrossRef]
- McGuffie, K.; Henderson-Sellers, A. Forty years of numerical climate modelling. International Journal of Climatology: A Journal of the Royal Meteorological Society 2001, 21, 1067–1109. [Google Scholar] [CrossRef]
- Robin, G.d.Q. Ice cores and climatic change. Philosophical Transactions of the Royal Society of London. B, Biological Sciences 1977, 280, 143–168. [Google Scholar] [CrossRef]
- Tei, S.; Sugimoto, A.; Yonenobu, H.; Matsuura, Y.; Osawa, A.; Sato, H.; Fujinuma, J.; Maximov, T. Tree-ring analysis and modeling approaches yield contrary response of circumboreal forest productivity to climate change. Global Change Biology 2017, 23, 5179–5188. [Google Scholar] [CrossRef]
- Hou, P.; Wu, S. Long-term changes in extreme air pollution meteorology and the implications for air quality. Scientific reports 2016, 6, 23792. [Google Scholar] [CrossRef]
- He, C.; Kumar, R.; Tang, W.; Pfister, G.; Xu, Y.; Qian, Y.; Brasseur, G. Air pollution interactions with weather and climate extremes: Current knowledge, gaps, and future directions. Current Pollution Reports 2024, 10, 430–442. [Google Scholar] [CrossRef]
- Rast, M.; Painter, T.H. Earth observation imaging spectroscopy for terrestrial systems: An overview of its history, techniques, and applications of its missions. Surveys in Geophysics 2019, 40, 303–331. [Google Scholar] [CrossRef]
- Rossi, A.P.; Van Gasselt, S. Planetary geology; Springer, 2018. [Google Scholar]
- Horton, R.M.; De Sherbinin, A.; Wrathall, D.; Oppenheimer, M. Assessing human habitability and migration. Science 2021, 372, 1279–1283. [Google Scholar] [CrossRef]
- Murray, C.D.; Dermott, S.F. Solar system dynamics; Cambridge university press, 1999. [Google Scholar]
- Zavala, G.R.; Nebro, A.J.; Luna, F.; Coello Coello, C.A. A survey of multi-objective metaheuristics applied to structural optimization. Structural and Multidisciplinary Optimization 2014, 49, 537–558. [Google Scholar] [CrossRef]
- Mei, L.; Wang, Q. Structural Optimization in Civil Engineering: A Literature Review. Buildings 2021, 11. [Google Scholar] [CrossRef]
- Wikipedia. Civil Engineering. Wikipedia, The Free Encyclopedia 2025. [Accessed: 2025-04-18].
- Harle, S.M. Advancements and challenges in the application of artificial intelligence in civil engineering: a comprehensive review. Asian Journal of Civil Engineering 2024, 25, 1061–1078. [Google Scholar]
- Vadyala, S.R.; Betgeri, S.N.; Matthews, J.C.; Matthews, E. A review of physics-based machine learning in civil engineering. Results in Engineering 2022, 13, 100316. [Google Scholar] [CrossRef]
- Tsiptsis, I.N.; Liimatainen, L.; Kotnik, T.; Niiranen, J. Structural optimization employing isogeometric tools in Particle Swarm Optimizer. Journal of Building Engineering 2019, 24, 100761. [Google Scholar] [CrossRef]
- Kuhn, H.W.; Tucker, A.W. Nonlinear programming. In Traces and emergence of nonlinear programming; Springer, 2013; pp. 247–258. [Google Scholar]
- Aldwaik, M.; Adeli, H. Advances in optimization of highrise building structures. Structural and Multidisciplinary Optimization 2014, 50, 899–919. [Google Scholar] [CrossRef]
- Saka, M.P.; Hasançebi, O.; Geem, Z.W. Metaheuristics in structural optimization and discussions on harmony search algorithm. Swarm and Evolutionary Computation 2016, 28, 88–97. [Google Scholar] [CrossRef]
- Sörensen, K. Metaheuristics—the metaphor exposed. International Transactions in Operational Research 2015, 22, 3–18. [Google Scholar] [CrossRef]
- Mahdavi, S.; Shiri, M.E.; Rahnamayan, S. Metaheuristics in large-scale global continues optimization: A survey. Information Sciences 2015, 295, 407–428. [Google Scholar] [CrossRef]
- Mortazavi, A. A new fuzzy strategy for size and topology optimization of truss structures. Applied Soft Computing 2020, 93, 106412. [Google Scholar] [CrossRef]
- Zheng, S.; Tang, W.; Li, B. A new topology optimization framework for stiffness design of beam structures based on the transformable triangular mesh algorithm. Thin-Walled Structures 2020, 154, 106831. [Google Scholar] [CrossRef]
- Tokognon, C.A.; Gao, B.; Tian, G.Y.; Yan, Y. Structural health monitoring framework based on Internet of Things: A survey. IEEE Internet of Things Journal 2017, 4, 619–635. [Google Scholar] [CrossRef]
- Gharehbaghi, V.R.; Noroozinejad Farsangi, E.; Noori, M.; Yang, T.; Li, S.; Nguyen, A.; Málaga-Chuquitaype, C.; Gardoni, P.; Mirjalili, S. A critical review on structural health monitoring: Definitions, methods, and perspectives. Archives of computational methods in engineering 2022, 29, 2209–2235. [Google Scholar] [CrossRef]
- Hodge, V.J.; O’Keefe, S.; Weeks, M.; Moulds, A. Wireless sensor networks for condition monitoring in the railway industry: A survey. IEEE Transactions on intelligent transportation systems 2014, 16, 1088–1106. [Google Scholar] [CrossRef]
- Cai, J.; Qiu, L.; Yuan, S.; Shi, L.; Liu, P.; Liang, D. Structural health monitoring for composite materials. In Composites and their applications; IntechOpen, 2012. [Google Scholar]
- Huseynov, F.; Kim, C.; Obrien, E.J.; Brownjohn, J.; Hester, D.; Chang, K. Bridge damage detection using rotation measurements–Experimental validation. Mechanical Systems and Signal Processing 2020, 135, 106380. [Google Scholar] [CrossRef]
- Gomes, G.F.; Mendez, Y.A.D.; da Silva Lopes Alexandrino, P.; da Cunha, S.S.; Ancelotti, A.C. A review of vibration based inverse methods for damage detection and identification in mechanical structures using optimization algorithms and ANN. Archives of computational methods in engineering 2019, 26, 883–897. [Google Scholar] [CrossRef]
- Jamali, S.; Chan, T.H.; Nguyen, A.; Thambiratnam, D.P. Reliability-based load-carrying capacity assessment of bridges using structural health monitoring and nonlinear analysis. Structural Health Monitoring 2019, 18, 20–34. [Google Scholar] [CrossRef]
- Babajanian Bisheh, H.; Ghodrati Amiri, G.; Nekooei, M.; Darvishan, E. Damage detection of a cable-stayed bridge using feature extraction and selection methods. Structure and Infrastructure Engineering 2019, 15, 1165–1177. [Google Scholar] [CrossRef]
- Sari, Y.; Prakoso, P.B.; Baskara, A.R. Road crack detection using support vector machine (SVM) and OTSU algorithm. In Proceedings of the 2019 6th International Conference on Electric Vehicular Technology (ICEVT). IEEE; 2019; pp. 349–354. [Google Scholar]
- Sarrafi, A.; Mao, Z.; Niezrecki, C.; Poozesh, P. Vibration-based damage detection in wind turbine blades using Phase-based Motion Estimation and motion magnification. Journal of Sound and vibration 2018, 421, 300–318. [Google Scholar] [CrossRef]
- Breysse, D. Nondestructive evaluation of concrete strength: An historical review and a new perspective by combining NDT methods. Construction and Building Materials 2012, 33, 139–163. [Google Scholar] [CrossRef]
- Helal, J.; Sofi, M.; Mendis, P. Non-destructive testing of concrete: A review of methods. Electronic Journal of Structural Engineering 2015, 14, 97–105. [Google Scholar] [CrossRef]
- Mitchell, J.K.; Soga, K.; et al. Fundamentals of soil behavior; Vol. 3, John Wiley & Sons New York, 2005.
- Lou, M.; Wang, H.; Chen, X.; Zhai, Y. Structure–soil–structure interaction: Literature review. Soil dynamics and earthquake engineering 2011, 31, 1724–1731. [Google Scholar] [CrossRef]
- David Müzel, S.; Bonhin, E.P.; Guimarães, N.M.; Guidi, E.S. Application of the Finite Element Method in the Analysis of Composite Materials: A Review. Polymers 2020, 12. [Google Scholar] [CrossRef] [PubMed]
- Matthews, F.L.; Davies, G.; Hitchings, D.; Soutis, C. Finite element modelling of composite materials and structures; Elsevier, 2000. [Google Scholar]
- Yaseen, Z.M.; El-shafie, A.; Jaafar, O.; Afan, H.A.; Sayl, K.N. Artificial intelligence based models for stream-flow forecasting: 2000–2015. Journal of Hydrology 2015, 530, 829–844. [Google Scholar] [CrossRef]
- Lana, I.; Del Ser, J.; Velez, M.; Vlahogianni, E.I. Road Traffic Forecasting: Recent Advances and New Challenges. IEEE Intelligent Transportation Systems Magazine 2018, 10, 93–109. [Google Scholar] [CrossRef]
- Liang, B.; Liu, J.; You, J.; Jia, J.; Pan, Y.; Jeong, H. Hydrocarbon production dynamics forecasting using machine learning: A state-of-the-art review. Fuel 2023, 337, 127067. [Google Scholar] [CrossRef]
- Jiang, W.; Luo, J. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications 2022, 207, 117921. [Google Scholar] [CrossRef]
- Belbute-Peres, F.D.A.; Economon, T.; Kolter, Z. Combining differentiable PDE solvers and graph neural networks for fluid flow prediction. In Proceedings of the international conference on machine learning. PMLR; 2020; pp. 2402–2411. [Google Scholar]
- Li, Z.; Ning, H. Autonomous GIS: the next-generation AI-powered GIS. International Journal of Digital Earth 2023, 16, 4668–4686. [Google Scholar] [CrossRef]
- Lin, Z.; Deng, C.; Zhou, L.; Zhang, T.; Xu, Y.; Xu, Y.; He, Z.; Shi, Y.; Dai, B.; Song, Y.; et al. Geogalactica: A scientific large language model in geoscience. arXiv 2023, arXiv:2401.00434. [Google Scholar]
- Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. Rsgpt: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 2025, 224, 272–286. [Google Scholar] [CrossRef]
- Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sensing 2024, 16, 1477. [Google Scholar] [CrossRef]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing 2024. [Google Scholar] [CrossRef]
- Zhang, Y.; Wei, C.; Wu, S.; He, Z.; Yu, W. Geogpt: Understanding and processing geospatial tasks through an autonomous gpt. arXiv 2023, arXiv:2307.07930. [Google Scholar] [CrossRef]
- Singh, S.; Fore, M.; Stamoulis, D. Geollm-engine: A realistic environment for building geospatial copilots. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 585–594.
- Chen, Y.; Wang, W.; Lobry, S.; Kurtz, C. An llm agent for automatic geospatial data analysis. arXiv 2024, arXiv:2410.18792. [Google Scholar] [CrossRef]
- Hou, S.; Jiao, H.; Shen, Z.; Liang, J.; Zhao, A.; Zhang, X.; Wang, J.; Wu, H. Chain-of-Programming (CoP): Empowering Large Language Models for Geospatial Code Generation. arXiv 2024, arXiv:2411.10753. [Google Scholar] [CrossRef]
- Jiang, G.; Ma, Z.; Zhang, L.; Chen, J. EPlus-LLM: A large language model-based computing platform for automated building energy modeling. Applied Energy 2024, 367, 123431. [Google Scholar] [CrossRef]
- Zhang, L.; Chen, Z.; Ford, V. Advancing building energy modeling with large language models: Exploration and case studies. Energy and Buildings 2024, 323, 114788. [Google Scholar] [CrossRef]
- Pursnani, V.; Ramirez, C.E.; Sermet, M.Y.; Demir, I. HydroSuite-AI: Facilitating Hydrological Research with LLM-Driven Code Assistance. 2024. [Google Scholar] [CrossRef]
- Bekele, Y.W. GeoSim. AI: AI assistants for numerical simulations in geomechanics. arXiv 2025, arXiv:2501.14186. [Google Scholar]
- Song, J.; Yoon, S. Ontology-assisted GPT-based building performance simulation and assessment: Implementation of multizone airflow simulation. Energy and Buildings 2024, 325, 114983. [Google Scholar] [CrossRef]
- Li, S.; Azfar, T.; Ke, R. ChatSUMO: Large Language Model for Automating Traffic Scenario Generation in Simulation of Urban MObility. IEEE Transactions on Intelligent Vehicles 2024, 1–12. [Google Scholar] [CrossRef]
- Denli, H.; Chughtai, H.A.; Hughes, B.; Gistri, R.; Xu, P. Geoscience language processing for exploration. In Proceedings of the Abu Dhabi International Petroleum Exhibition and Conference. SPE; 2021; p. D031S102R003. [Google Scholar]
- Bi, Z.; Zhang, N.; Xue, Y.; Ou, Y.; Ji, D.; Zheng, G.; Chen, H. Oceangpt: A large language model for ocean science tasks. arXiv 2023, arXiv:2310.02031. [Google Scholar] [CrossRef]
- Dong, T.; Subia-Waud, C.; Hou, S. Geo-RAG: Gaining insights from unstructured geological documents with large language models. In Proceedings of the Fourth EAGE Digitalization Conference & Exhibition. European Association of Geoscientists & Engineers, 2024; 2024; Vol. 2024, pp. 1–4. [Google Scholar]
- Zheng, Z.; Chen, K.Y.; Cao, X.Y.; Lu, X.Z.; Lin, J.R. Llm-funcmapper: Function identification for interpreting complex clauses in building codes via llm. arXiv 2023, arXiv:2308.08728. [Google Scholar] [CrossRef]
- Chen, N.; Lin, X.; Jiang, H.; An, Y. Automated Building Information Modeling Compliance Check through a Large Language Model Combined with Deep Learning and Ontology. Buildings 2024, 14. [Google Scholar] [CrossRef]
- Wan, H.; Zhang, J.; Chen, Y.; Xu, W.; Feng, F. Generative AI Application for Building Industry. arXiv 2024, arXiv:2410.01098. [Google Scholar] [CrossRef]
- Xiao, T.; Xu, P. Exploring automated energy optimization with unstructured building data: A multi-agent based framework leveraging large language models. Energy and Buildings 2024, 322, 114691. [Google Scholar] [CrossRef]
- Forth, K.; Borrmann, A. Semantic enrichment for BIM-based building energy performance simulations using semantic textual similarity and fine-tuning multilingual LLM. Journal of Building Engineering 2024, 95, 110312. [Google Scholar] [CrossRef]
- Qin, B.; Pan, H.; Dai, Y.; Si, X.; Huang, X.; Yuen, C.; Zhang, Y. Machine and Deep Learning for Digital Twin Networks: A Survey. IEEE Internet of Things Journal 2024, 11, 34694–34716. [Google Scholar] [CrossRef]
- Jia, F.; Fonsati, A.; Gudmundsson, K. Natural Language Communication with Sensor Data Through a LLM-Integrated Protocol: A Case Study. In Proceedings of the International Conference on Computing in Civil and Building Engineering. Springer; 2024; pp. 64–75. [Google Scholar]
- Yang, H.; Siew, M.; Joe-Wong, C. An LLM-Based Digital Twin for Optimizing Human-in-the Loop Systems. In Proceedings of the 2024 IEEE International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys), 2024; pp. 26–31. [CrossRef]
- Ferdousi, R.; Hossain, M.A.; Yang, C.; Saddik, A.E. Defecttwin: When llm meets digital twin for railway defect inspection. arXiv 2024, arXiv:2409.06725. [Google Scholar] [CrossRef]
- Syed, T.A.; Muhammad, M.A.; AlShahrani, A.A.; Hammad, M.; Naqash, M.T. Smart Water Management with Digital Twins and Multimodal Transformers: A Predictive Approach to Usage and Leakage Detection. Water 2024, 16, 3410. [Google Scholar] [CrossRef]
- Qin, S.; Guan, H.; Liao, W.; Gu, Y.; Zheng, Z.; Xue, H.; Lu, X. Intelligent design and optimization system for shear wall structures based on large language models and generative artificial intelligence. Journal of Building Engineering 2024, 95, 109996. [Google Scholar] [CrossRef]
- Jang, S.; Lee, G.; Oh, J.; Lee, J.; Koo, B. Automated detailing of exterior walls using NADIA: Natural-language-based architectural detailing through interaction with AI. Advanced Engineering Informatics 2024, 61, 102532. [Google Scholar] [CrossRef]
- Zhu, H.; Zhang, W.; Huang, N.; Li, B.; Niu, L.; Fan, Z.; Lun, T.; Tao, Y.; Su, J.; Gong, Z.; et al. PlanGPT: Enhancing urban planning with tailored language model and efficient retrieval. arXiv 2024, arXiv:2402.19273. [Google Scholar] [CrossRef]
- Zhou, Z.; Lin, Y.; Jin, D.; Li, Y. Large language model for participatory urban planning. arXiv 2024, arXiv:2402.17161. [Google Scholar] [CrossRef]
- Zhang, S.; Fu, D.; Liang, W.; Zhang, Z.; Yu, B.; Cai, P.; Yao, B. Trafficgpt: Viewing, processing and interacting with traffic foundation models. Transport Policy 2024, 150, 95–105. [Google Scholar] [CrossRef]
- Da, L.; Liou, K.; Chen, T.; Zhou, X.; Luo, X.; Yang, Y.; Wei, H. Open-ti: Open traffic intelligence with augmented language model. International Journal of Machine Learning and Cybernetics 2024, 15, 4761–4786. [Google Scholar] [CrossRef]
- Deng, C.; Zhang, T.; He, Z.; Chen, Q.; Shi, Y.; Xu, Y.; Fu, L.; Zhang, W.; Wang, X.; Zhou, C.; et al. utilization. In Proceedings of the Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp.
- Zhang, Y.; He, Z.; Li, J.; Lin, J.; Guan, Q.; Yu, W. MapGPT: an autonomous framework for mapping by integrating large language model and cartographic tools. Cartography and Geographic Information Science 2024, 51, 717–743. [Google Scholar] [CrossRef]
- Roberts, J.; Lüddecke, T.; Das, S.; Han, K.; Albanie, S. GPT4GEO: How a Language Model Sees the World’s Geography. arXiv 2023, arXiv:2306.00020. [Google Scholar] [CrossRef]
- Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 27831–27840.
- Liu, C.; Chen, K.; Zhang, H.; Qi, Z.; Zou, Z.; Shi, Z. Change-agent: Towards interactive comprehensive remote sensing change interpretation and analysis. IEEE Transactions on Geoscience and Remote Sensing 2024. [Google Scholar] [CrossRef]
- Ning, H.; Li, Z.; Akinboyewa, T.; Lessani, M.N. An autonomous GIS agent framework for geospatial data retrieval. International Journal of Digital Earth 2025, 18, 2458688. [Google Scholar] [CrossRef]
- Hou, S.; Shen, Z.; Zhao, A.; Liang, J.; Gui, Z.; Guan, X.; Li, R.; Wu, H. GeoCode-GPT: A large language model for geospatial code generation. International Journal of Applied Earth Observation and Geoinformation 2025, 104456. [Google Scholar] [CrossRef]
- Cherian, A.; Corcodel, R.; Jain, S.; Romeres, D. Llmphy: Complex physical reasoning using large language models and world models. arXiv 2024, arXiv:2411.08027. [Google Scholar] [CrossRef]
- Bhandari, P.; Anastasopoulos, A.; Pfoser, D. Are large language models geospatially knowledgeable? In Proceedings of the Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems; 2023; pp. 1–4. [Google Scholar]
- Haas, L.; Alberti, S.; Skreta, M. Learning generalized zero-shot learners for open-domain image geolocalization. arXiv 2023, arXiv:2302.00275. [Google Scholar] [CrossRef]
- Mooney, P.; Cui, W.; Guan, B.; Juhász, L. Towards understanding the geospatial skills of chatgpt: Taking a geographic information systems (gis) exam. In Proceedings of the Proceedings of the 6th ACM SIGSPATIAL international workshop on AI for geographic knowledge discovery, 2023; pp. 85–94.
- Zhang, Y.; Wang, Z.; He, Z.; Li, J.; Mai, G.; Lin, J.; Wei, C.; Yu, W. BB-GeoGPT: A framework for learning a large language model for geographic information science. Information Processing & Management 2024, 61, 103808. [Google Scholar]
- Kumar, S.; Ehtesham, A.; Singh, A.; Khoei, T.T. Architectural Flaw Detection in Civil Engineering Using GPT-4. arXiv 2024, arXiv:2410.20036. [Google Scholar] [CrossRef]
- Liu, X.; Li, H.; Zhu, X. A GPT-based method of automated compliance checking through prompt engineering 2023.
- Pu, H.; Yang, X.; Li, J.; Guo, R. AutoRepo: A general framework for multimodal LLM-based automated construction reporting. Expert Systems with Applications 2024, 255, 124601. [Google Scholar] [CrossRef]
- Yan, Y.; Wen, H.; Zhong, S.; Chen, W.; Chen, H.; Wen, Q.; Zimmermann, R.; Liang, Y. When urban region profiling meets large language models. arXiv 2023, arXiv:2310.18340. [Google Scholar] [CrossRef]
- Da, L.; Gao, M.; Mei, H.; Wei, H. Llm powered sim-to-real transfer for traffic signal control. arXiv 2023, arXiv:2308.14284. [Google Scholar] [CrossRef]
- Chen, X.; Zhang, L. Revolutionizing Bridge Operation and maintenance with LLM-based Agents: An Overview of Applications and Insights. arXiv 2024, arXiv:2407.10064. [Google Scholar] [CrossRef]
- Li, Y.; Ji, M.; Chen, J.; Wei, X.; Gu, X.; Tang, J. A large language model-based building operation and maintenance information query. Energy and Buildings 2025, 334, 115515. [Google Scholar] [CrossRef]
- Zhang, L.; Chen, Z. Large language model-based interpretable machine learning control in building energy systems. Energy and Buildings 2024, 313, 114278. [Google Scholar] [CrossRef]
- Zhang, C.; Zhang, J.; Zhao, Y.; Lu, J. Automated data mining framework for building energy conservation aided by generative pre-trained transformers (GPT). Energy and Buildings 2024, 305, 113877. [Google Scholar] [CrossRef]
- Hosamo, H.H.; Imran, A.; Cardenas-Cartagena, J.; Svennevig, P.R.; Svidt, K.; Nielsen, H.K. A review of the digital twin technology in the AEC-FM industry. Advances in civil engineering 2022, 2022, 2185170. [Google Scholar] [CrossRef]
- Hamzah, A.; Aqlan, F.; Baidya, S. Drone-based digital twins for water quality monitoring: A systematic review. Digital Twins and Applications 2024, 1, 131–160. [Google Scholar] [CrossRef]
- Hong, Y.; Wu, J.; Morello, R. LLM-Twin: mini-giant model-driven beyond 5G digital twin networking framework with semantic secure communication and computation. Scientific Reports 2024, 14, 19065. [Google Scholar] [CrossRef]
- Zhang, W.; Han, J.; Xu, Z.; Ni, H.; Liu, H.; Xiong, H. Urban Foundation Models: A Survey. In Proceedings of the Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2024; KDD ’24; pp. 6633–6643. [CrossRef]
- Jang, S.; Roh, H.; Lee, G. Generative AI in architectural design: Application, data, and evaluation methods. Automation in Construction 2025, 174, 106174. [Google Scholar] [CrossRef]
- Zhang, W.; Han, J.; Xu, Z.; Ni, H.; Liu, H.; Xiong, H. Urban foundation models: A survey. In Proceedings of the Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024; pp. 6633–6643.
- Lai, S.; Xu, Z.; Zhang, W.; Liu, H.; Xiong, H. LLMLight: Large Language Models as Traffic Signal Control Agents. arXiv 2023, arXiv:2312.16044. [Google Scholar] [CrossRef]
- Fu, J.; Ng, S.K.; Jiang, Z.; Liu, P. Gptscore: Evaluate as you desire. arXiv 2023, arXiv:2302.04166. [Google Scholar] [CrossRef]
- Knuth, D.E. Computer science and its relation to mathematics. The American Mathematical Monthly 1974, 81, 323–343. [Google Scholar] [CrossRef]
- Shagrir, O. What is computer science about? The Monist 1999, 82, 131–149. [Google Scholar] [CrossRef]
- Cortina, T.J. An introduction to computer science for non-majors using principles of computation. Acm sigcse bulletin 2007, 39, 218–222. [Google Scholar] [CrossRef]
- Lehman, E.; Leighton, F.T.; Meyer, A.R. Mathematics for computer science; Massachusetts Institute of Technology Cambridge: Massachusetts, 2010. [Google Scholar]
- Dorf, R.C. The electrical engineering handbook; CRC press, 1997. [Google Scholar]
- Rizzoni, G.; Kearns, J. Fundamentals of electrical engineering; McGraw-Hill Higher Education, 2009. [Google Scholar]
- IEEE JSTSP Special Series on AI in Signal and Data Science - Toward Large Language Model (LLM) Theory and Applications, 2024.
- Abdollahi, M.; Yeganli, S.F.; Baharloo, M.A.; Baniasadi, A. Hardware Design and Verification with Large Language Models: A Literature Survey, Challenges, and Open Issues. 2024. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, J.; Bi, T.; Grundy, J.; Wang, Y.; Yu, J.; Chen, T.; Tang, Y.; Zheng, Z. An empirical study on low code programming using traditional vs large language model support. arXiv 2024, arXiv:2402.01156. [Google Scholar] [CrossRef]
- Thorne, S.; Ball, D.; Lawson, Z. Reducing error in spreadsheets: Example driven modeling versus traditional programming. International Journal of Human-Computer Interaction 2013, 29, 40–53. [Google Scholar] [CrossRef]
- Stoica, M.; Mircea, M.; Ghilic-Micu, B. Software development: agile vs. traditional. Informatica Economica 2013, 17. [Google Scholar] [CrossRef]
- Awad, M. A comparison between agile and traditional software development methodologies. University of Western Australia 2005, 30, 1–69. [Google Scholar]
- Risener, K. A study of software development methodologies 2022.
- Leong, J.; May Yee, K.; Baitsegi, O.; Palanisamy, L.; Ramasamy, R.K. Hybrid project management between traditional software development lifecycle and agile based product development for future sustainability. Sustainability 2023, 15, 1121. [Google Scholar] [CrossRef]
- Pawar, R.P. A comparative study of agile software development methodology and traditional waterfall model. IOSR Journal of Computer Engineering 2015, 2, 1–8. [Google Scholar]
- Boehm, B. A spiral model of software development and enhancement. ACM SIGSOFT Software engineering notes 1986, 11, 14–24. [Google Scholar] [CrossRef]
- Boehm, B.W. A spiral model of software development and enhancement. Computer 2002, 21, 61–72. [Google Scholar] [CrossRef]
- Balaji, S.; Murugaiyan, M.S. Waterfall vs. V-Model vs. Agile: A comparative study on SDLC. International Journal of Information Technology and Business Management 2012, 2, 26–30. [Google Scholar]
- Gracias, M.H.; Gallegos, E.E. Transitioning Perspectives: Agile and Waterfall Perceptions in the Integration of Model-Based Systems Engineering (MBSE) within Aerospace and Defense Industries. The ITEA Journal of Test and Evaluation 2024, 45. [Google Scholar] [CrossRef]
- Shaikh, S.; Abro, S. Comparison of traditional & agile software development methodology: A short survey. International Journal of Software Engineering and Computer Systems 2019, 5, 1–14. [Google Scholar] [CrossRef]
- Roychoudhury, A. Debugging as a Science, that too, when your Program is Changing. Electronic Notes in Theoretical Computer Science 2010, 266, 3–15. [Google Scholar] [CrossRef]
- Fitzgerald, S.; Lewandowski, G.; McCauley, R.; Murphy, L.; Simon, B.; Thomas, L.; Zander, C. Debugging: finding, fixing and flailing, a multi-institutional study of novice debuggers. Computer Science Education 2008, 18, 93–116. [Google Scholar] [CrossRef]
- Weste, N.H.; Harris, D. CMOS VLSI design: a circuits and systems perspective; Pearson Education India, 2015. [Google Scholar]
- Palnitkar, S. Verilog HDL: a guide to digital design and synthesis; Vol. 1, Prentice Hall Professional, 2003.
- Ashenden, P.J. Digital design (verilog): An embedded systems approach using verilog; Elsevier, 2007.
- Thomas, D.; Moorby, P. The Verilog® hardware description language; Springer Science & Business Media, 2008.
- Haas, A.; Stewart, J.C.; Kukal, T. Ensuring Reliable and Optimal Analog PCB Designs with Allegro AMS Simulator. Cadence Design Systems, Silicon Valley 2007. [Google Scholar]
- Lin, R.; Ramesh, R.; Iannopollo, A.; Sangiovanni Vincentelli, A.; Dutta, P.; Alon, E.; Hartmann, B. Beyond schematic capture: Meaningful abstractions for better electronics design tools. In Proceedings of the Proceedings of the 2019 CHI conference on human factors in computing systems, 2019; pp. 1–13.
- Nelson, B.; Riching, B.; Black, R. Using a Custom-Built HDL for Printed Circuit Board Design Capture. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2012.
- Jones, B.A.; Moorhead, J.N.; Mohammadi-Aragh, M.J. Less is More: Developing complex designs using a minimal HDL subset in an introductory digital devices laboratory. In Proceedings of the 2015 ASEE Annual Conference & Exposition; 2015; pp. 26–1082. [Google Scholar]
- Dewey, A. VHSIC hardware description (VHDL) development program. In Proceedings of the 20th Design Automation Conference Proceedings. IEEE; 1983; pp. 625–628. [Google Scholar]
- Mano, M.M.R.; Ciletti, M.D. Digital Design: With an Introduction to the Verilog HDL, VHDL, and SystemVerilog, 5th ed.; Pearson, 2012. Fundamental textbook covering schematic capture and HDL-based design[1][10].
- Vemeko FPGA Team. What are Hardware Description Languages?, 2024.
- Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 2024, 33, 1–79. [Google Scholar] [CrossRef]
- Belzner, L.; Gabor, T.; Wirsing, M. Large language model assisted software engineering: prospects, challenges, and a case study. In Proceedings of the International Conference on Bridging the Gap between AI and Reality. Springer; 2023; pp. 355–374. [Google Scholar]
- Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large language models for software engineering: Survey and open problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE; 2023; pp. 31–53. [Google Scholar]
- Ma, Q.; Wu, T.; Koedinger, K. Is ai the better programming partner? human-human pair programming vs. human-ai pair programming. arXiv 2023, arXiv:2306.05153. [Google Scholar] [CrossRef]
- Peng, S.; Kalliamvakou, E.; Cihon, P.; Demirer, M. The impact of ai on developer productivity: Evidence from github copilot. arXiv 2023, arXiv:2302.06590. [Google Scholar] [CrossRef]
- Song, F.; Agarwal, A.; Wen, W. The impact of generative AI on collaborative open-source software development: Evidence from GitHub Copilot. arXiv 2024, arXiv:2410.02091. [Google Scholar] [CrossRef]
- Zhong, L.; Wang, Z.; Shang, J. Debug like a human: A large language model debugger via verifying runtime execution step-by-step. arXiv 2024, arXiv:2402.16906. [Google Scholar] [CrossRef]
- Levin, K.; van Kempen, N.; Berger, E.D.; Freund, S.N. Chatdbg: An ai-powered debugging assistant. arXiv 2024, arXiv:2403.16354. [Google Scholar] [CrossRef]
- Lee, C.; Xia, C.S.; Yang, L.; Huang, J.t.; Zhu, Z.; Zhang, L.; Lyu, M.R. A unified debugging approach via llm-based multi-agent synergy. arXiv 2024, arXiv:2404.17153. [Google Scholar] [CrossRef]
- Majdoub, Y.; Ben Charrada, E. Debugging with open-source large language models: An evaluation. In Proceedings of the Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2024; pp. 510–516.
- Yan, J.; Huang, J.; Fang, C.; Yan, J.; Zhang, J. Better debugging: Combining static analysis and llms for explainable crashing fault localization. arXiv 2024, arXiv:2408.12070. [Google Scholar] [CrossRef]
- Jiang, N.; Li, X.; Wang, S.; Zhou, Q.; Hossain, S.; Ray, B.; Kumar, V.; Ma, X.; Deoras, A. LeDex: Training LLMs to Better Self-Debug and Explain Code. Advances in Neural Information Processing Systems 2024, 37, 35517–35543. [Google Scholar]
- Yang, W.; Wang, H.; Liu, Z.; Li, X.; Yan, Y.; Wang, S.; Gu, Y.; Yu, M.; Liu, Z.; Yu, G. Enhancing the code debugging ability of llms via communicative agent based data refinement. arXiv 2024, arXiv:2408.05006. [Google Scholar] [CrossRef]
- Bhattacharya, P.; Chakraborty, M.; Palepu, K.N.; Pandey, V.; Dindorkar, I.; Rajpurohit, R.; Gupta, R. Exploring large language models for code explanation. arXiv 2023, arXiv:2310.16673. [Google Scholar] [CrossRef]
- Arakelyan, S.; Das, R.J.; Mao, Y.; Ren, X. Exploring distributional shifts in large language models for code analysis. arXiv 2023, arXiv:2303.09128. [Google Scholar] [CrossRef]
- Li, D.; Shen, Y.; Jin, R.; Mao, Y.; Wang, K.; Chen, W. Generation-augmented query expansion for code retrieval. arXiv 2022, arXiv:2212.10692. [Google Scholar] [CrossRef]
- Thakur, S.; Blocklove, J.; Pearce, H.; Tan, B.; Garg, S.; Karri, R. Autochip: Automating hdl generation using llm feedback. arXiv 2023, arXiv:2311.04887. [Google Scholar] [CrossRef]
- Liu, M.; Tsai, Y.D.; Zhou, W.; Ren, H. CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair. arXiv 2024, arXiv:2409.12993. [Google Scholar] [CrossRef]
- Collini, L.; Garg, S.; Karri, R. C2HLSC: Can LLMs Bridge the Software-to-Hardware Design Gap? In Proceedings of the 2024 IEEE LLM Aided Design Workshop (LAD). IEEE; 2024; pp. 1–12. [Google Scholar]
- Xiao, W.; Putrevu, V.S.C.; Hemadri, R.V.; Garg, S.; Karri, R. PrefixLLM: LLM-aided Prefix Circuit Design. arXiv 2024, arXiv:2412.02594. [Google Scholar] [CrossRef]
- Tsai, Y.D.; Liu, M.; Ren, H. Automatically Fixing RTL Syntax Errors with Large Language Model. In Proceedings of the IEEE/ACM Design Automation Conference (DAC); 2024. [Google Scholar]
- Bai, Y.; Sohrabizadeh, A.; Qin, Z.; Hu, Z.; Sun, Y.; Cong, J. Towards a comprehensive benchmark for high-level synthesis targeted to FPGAs. Advances in Neural Information Processing Systems 2023, 36, 45288–45299. [Google Scholar]
- Sohrabizadeh, A.; Yu, C.H.; Gao, M.; Cong, J. AutoDSE: Enabling software programmers to design efficient FPGA accelerators. ACM Transactions on Design Automation of Electronic Systems (TODAES) 2022, 27, 1–27. [Google Scholar] [CrossRef]
- Abi-Karam, S.; Sarkar, R.; Seigler, A.; Lowe, S.; Wei, Z.; Chen, H.; Rao, N.; John, L.; Arora, A.; Hao, C. HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond. In Proceedings of the Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024; pp. 1–9.
- Xu, K.; Zhang, G.L.; Yin, X.; Zhuo, C.; Schlichtmann, U.; Li, B. Automated c/c++ program repair for high-level synthesis via large language models. In Proceedings of the Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024; pp. 1–9.
- Fang, W.; Li, M.; Li, M.; Yan, Z.; Liu, S.; Xie, Z.; Zhang, H. Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms. arXiv 2024, arXiv:2402.00386. [Google Scholar] [CrossRef]
- Xu, K.; Sun, J.; Hu, Y.; Fang, X.; Shan, W.; Wang, X.; Jiang, Z. Meic: Re-thinking rtl debug automation using llms. arXiv 2024, arXiv:2405.06840. [Google Scholar] [CrossRef]
- Ma, R.; Yang, Y.; Liu, Z.; Zhang, J.; Li, M.; Huang, J.; Luo, G. VerilogReader: LLM-Aided Hardware Test Generation. In Proceedings of the 2024 IEEE LLM Aided Design Workshop (LAD). IEEE; 2024; pp. 1–5. [Google Scholar]
- Zhang, Z.; Chadwick, G.; McNally, H.; Zhao, Y.; Mullins, R. Llm4dv: Using large language models for hardware test stimuli generation. arXiv 2023, arXiv:2310.04535. [Google Scholar] [CrossRef]
- Qiu, R.; Zhang, G.L.; Drechsler, R.; Schlichtmann, U.; Li, B. Autobench: Automatic testbench generation and evaluation using llms for hdl design. In Proceedings of the Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024; pp. 1–10.
- Orenes-Vera, M.; Martonosi, M.; Wentzlaff, D. Using llms to facilitate formal verification of rtl. arXiv 2023, arXiv:2309.09437. [Google Scholar] [CrossRef]
- Xiong, C.; Liu, C.; Li, H.; Li, X. Hlspilot: Llm-based high-level synthesis. In Proceedings of the Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024; pp. 1–9.
- Liao, Y.; Adegbija, T.; Lysecky, R. Are llms any good for high-level synthesis? In Proceedings of the Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design; 2024; pp. 1–8. [Google Scholar]
- Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 2023, 36, 21558–21572. [Google Scholar]
- Bareiß, P.; Souza, B.; d’Amorim, M.; Pradel, M. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv 2022, arXiv:2206.01335. [Google Scholar] [CrossRef]
- Dong, Y.; Jiang, X.; Jin, Z.; Li, G. Self-collaboration code generation via chatgpt. ACM Transactions on Software Engineering and Methodology 2024, 33, 1–38. [Google Scholar] [CrossRef]
- Liu, C.; Bao, X.; Zhang, H.; Zhang, N.; Hu, H.; Zhang, X.; Yan, M. Improving chatgpt prompt for code generation. arXiv 2023, arXiv:2305.08360. [Google Scholar] [CrossRef]
- Chen, X.; Lin, M.; Schärli, N.; Zhou, D. Teaching Large Language Models to Self-Debug. arXiv 2023, arXiv:2304.05128. [Google Scholar] [CrossRef]
- Haque, M.A.; Li, S. The potential use of chatgpt for debugging and bug fixing. 2023. [Google Scholar] [CrossRef]
- Tian, R.; Ye, Y.; Qin, Y.; Cong, X.; Lin, Y.; Pan, Y.; Wu, Y.; Hui, H.; Liu, W.; Liu, Z.; et al. Debugbench: Evaluating debugging capability of large language models. arXiv 2024, arXiv:2401.04621. [Google Scholar] [CrossRef]
- Kang, S.; Chen, B.; Yoo, S.; Lou, J.G. Explainable automated debugging via large language model-driven scientific debugging. Empirical Software Engineering 2025, 30, 1–28. [Google Scholar] [CrossRef]
- Thakur, S.; Ahmad, B.; Pearce, H.; Tan, B.; Dolan-Gavitt, B.; Karri, R.; Garg, S. Verigen: A large language model for verilog code generation. ACM Transactions on Design Automation of Electronic Systems 2024, 29, 1–31. [Google Scholar] [CrossRef]
- Takamaeda-Yamazaki, S. Pyverilog: A python-based hardware design processing toolkit for verilog hdl. In Proceedings of the Applied Reconfigurable Computing: 11th International Symposium, ARC 2015, Bochum, Germany, April 13-17, 2015, Proceedings 11. Springer, 2015, pp. 451–460.
- Blocklove, J.; Garg, S.; Karri, R.; Pearce, H. Chip-chat: Challenges and opportunities in conversational hardware design. In Proceedings of the 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD). IEEE; 2023; pp. 1–6. [Google Scholar]
- Nakkab, A.; Zhang, S.Q.; Karri, R.; Garg, S. Rome was Not Built in a Single Step: Hierarchical Prompting for LLM-based Chip Design. 2024. [Google Scholar] [CrossRef]
- Chang, K.; Wang, K.; Yang, N.; Wang, Y.; Jin, D.; Zhu, W.; Chen, Z.; Li, C.; Yan, H.; Zhou, Y.; et al. Data is all you need: Finetuning llms for chip design via an automated design-data augmentation framework. In Proceedings of the Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024; pp. 1–6.
- Vijayaraghavan, P.; Nitsure, A.; Mackin, C.; Shi, L.; Ambrogio, S.; Haran, A.; Paruthi, V.; Elzein, A.; Coops, D.; Beymer, D.; et al. Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization. In Proceedings of the Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024; pp. 1–10.
- Batten, C.; Pinckney, N.; Liu, M.; Ren, H.; Khailany, B. PyHDL-Eval: An LLM Evaluation Framework for Hardware Design Using Python-Embedded DSLs. In Proceedings of the Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024; pp. 1–17.
- Fu, Y.; Zhang, Y.; Yu, Z.; Li, S.; Ye, Z.; Li, C.; Wan, C.; Lin, Y.C. Gpt4aigchip: Towards next-generation ai accelerator design automation via large language models. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE; 2023; pp. 1–9. [Google Scholar]
- Latibari, B.S.; Ghimire, S.; Chowdhury, M.A.; Nazari, N.; Gubbi, K.I.; Homayoun, H.; Sasan, A.; Salehi, S. Automated Hardware Logic Obfuscation Framework Using GPT. In Proceedings of the 2024 IEEE 17th Dallas Circuits and Systems Conference (DCAS). IEEE; 2024; pp. 1–5. [Google Scholar]
- Wan, L.J.; Huang, Y.; Li, Y.; Ye, H.; Wang, J.; Zhang, X.; Chen, D. Software/hardware co-design for llm and its application for design verification. In Proceedings of the 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE; 2024; pp. 435–441. [Google Scholar]
- Gai, J.; Chen, H.; Wang, Z.; Zhou, H.; Zhao, W.; Lane, N.; Fan, H. Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis. In Proceedings of the Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025, pp.
- Hendrycks, D.; Basart, S.; Kadavath, S.; Mazeika, M.; Arora, A.; Guo, E.; Burns, C.; Puranik, S.; He, H.; Song, D.; et al. Measuring Coding Challenge Competence With APPS. NeurIPS 2021. [Google Scholar]
- Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.B.; Drain, D.; Jiang, D.; Tang, D.; et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
- Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Lago, A.D.; et al. Competition-level code generation with AlphaCode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef]
- Xia, Y.; Shen, W.; Wang, Y.; Liu, J.K.; Sun, H.; Wu, S.; Hu, J.; Xu, X. LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs. arXiv 2025, arXiv:2504.14655. [Google Scholar] [CrossRef]
- Zhong, V.; Xiong, C.; Socher, R. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv 2017, arXiv:1709.00103. [Google Scholar] [CrossRef]
- Aider Blog Team. The Polyglot Benchmark. https://aider.chat/2024/12/21/polyglot.html#the-polyglot-benchmark, 2024.
- Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. Swe-bench: Can language models resolve real-world github issues? arXiv 2023, arXiv:2310.06770. [Google Scholar] [CrossRef]
- Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the Proceedings of the 2014 international symposium on software testing and analysis, 2014; pp. 437–440.
- Lin, D.; Koppel, J.; Chen, A.; Solar-Lezama, A. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings of the Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity, 2017; pp. 55–56.
- Widyasari, R.; Sim, S.Q.; Lok, C.; Qi, H.; Phan, J.; Tay, Q.; Tan, C.; Wee, F.; Tan, J.E.; Yieh, Y.; et al. Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In Proceedings of the Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2020; pp. 1556–1560.
- Liu, S.; Chai, L.; Yang, J.; Shi, J.; Zhu, H.; Wang, L.; Jin, K.; Zhang, W.; Zhu, H.; Guo, S.; et al. Mdeval: Massively multilingual code debugging. arXiv 2024, arXiv:2411.02310. [Google Scholar] [CrossRef]
- Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv 2019, arXiv:1909.09436. [Google Scholar] [CrossRef]
- Nguyen, D.M.; Phan, T.C.; Le Hai, N.; Doan, T.T.; Nguyen, N.V.; Pham, Q.; Bui, N.D. CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs. In Proceedings of the The Thirteenth International Conference on Learning Representations.
- Lu, Y.; Liu, S.; Zhang, Q.; Xie, Z. Rtllm: An open-source benchmark for design rtl generation with large language model. In Proceedings of the 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE; 2024; pp. 722–727. [Google Scholar]
- Liu, M.; Pinckney, N.; Khailany, B.; Ren, H. Verilogeval: Evaluating large language models for verilog code generation. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE; 2023; pp. 1–8. [Google Scholar]
- Zhang, Y.; Yu, Z.; Fu, Y.; Wan, C.; Lin, Y.C. Mg-verilog: Multi-grained dataset towards enhanced llm-assisted verilog generation. In Proceedings of the 2024 IEEE LLM Aided Design Workshop (LAD). IEEE; 2024; pp. 1–5. [Google Scholar]
- Liu, S.; Lu, Y.; Fang, W.; Li, M.; Xie, Z. Openllm-rtl: Open dataset and benchmark for llm-aided design rtl generation. In Proceedings of the Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024; pp. 1–9.
- Wu, Y.; Yu, X.; Chen, H.; Luo, Y.; Tong, Y.; Ma, Y. PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design. arXiv 2025, arXiv:2502.03159. [Google Scholar] [CrossRef]
- Jin, H.; Huang, L.; Cai, H.; Yan, J.; Li, B.; Chen, H. From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv 2024, arXiv:2408.02479. [Google Scholar] [CrossRef]
- Kochar, D.V.; Wang, H.; Chandrakasan, A.; Zhang, X. LEDRO: LLM-Enhanced Design Space Reduction and Optimization for Analog Circuits. arXiv 2024, arXiv:2411.12930. [Google Scholar] [CrossRef]
- Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. arXiv 2024, arXiv:2406.00515. [Google Scholar] [CrossRef]
- Chui, M.; Roberts, R.; Yee, L.; Hazan, E.; Singla, A.; Smaje, K.; Sukharevsky, A.; Zemmel, R. The Economic Potential of Generative AI: The Next Productivity Frontier. Technical report, McKinsey & Company, 2023.
- Mann, T. China’s DeepSeek just emitted a free challenger to OpenAI’s o1 – here’s how to use it on your PC. The Register, 2025.
- Hanbury, P.; Wang, J.; Brick, P.; Cannarsi, A. DeepSeek: A game changer in AI efficiency? Industry brief, Bain & Company, 2025.
- Masri, S.; Ashqar, H.I.; Elhenawy, M. Visual Reasoning at Urban Intersections: Fine-Tuning GPT-4o for Traffic Conflict Detection. arXiv 2025, arXiv:2502.20573. [Google Scholar] [CrossRef]
- Lai, W.; Mesgar, M.; Fraser, A. LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback. arXiv 2024, arXiv:2406.01771. [Google Scholar] [CrossRef]
- Warren, T. Microsoft makes OpenAI’s o1 reasoning model free for all Copilot users. The Verge, 2025.
- Garcez, A.d.; Lamb, L.C. Neurosymbolic ai: The 3 rd wave. Artificial Intelligence Review 2023, 56, 12387–12406. [Google Scholar] [CrossRef]
- Lee, H.; Phatale, S.; Mansoor, H.; Lu, K.R.; Mesnard, T.; Ferret, J.; Bishop, C.; Hall, E.; Carbune, V.; Rastogi, A. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
- Li, Z.; Fan, Z.; Tou, H.; Chen, J.; Wei, Z.; Huang, X. Mvptr: Multi-level semantic alignment for vision-language pre-training via multi-stage learning. In Proceedings of the Proceedings of the 30th ACM International Conference on Multimedia, 2022; pp. 4395–4405.
- Guo, D.; et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
- Su, Y.; et al. Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains. arXiv 2025, arXiv:2503.23829. [Google Scholar] [CrossRef]
- Pan, Y.; Feng, Y.; Zhuang, J.; Ding, S.; Zhou, Z.; Sun, B.; Xu, H.; Chou, Y.; Deng, A.; Hu, A.; et al. SpikingBrain Technical Report: Brain-Inspired Large Models for Efficient Long-Context Training and Inference. arXiv 2025, arXiv:2509.05276. [Google Scholar] [CrossRef]
- WhatAboutMyStar.; colleagues. The LLM Language Network: Functional Brain Networks in LLMs. In Proceedings of the Proceedings of the International Conference on Machine Learning (ICML), 2025.
- Alkhamissi, B.; Schrimpf, M.; et al. The LLM Language Network: A Neuroscientific Approach to Identifying Language-Selective Units in LLMs. In Proceedings of the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL); 2025. [Google Scholar]
- Wu, T.; et al. Continual Learning for Large Language Models: A Survey. arXiv 2024, arXiv:2402.01364. [Google Scholar] [CrossRef]
- Zhu, Y.; et al. MemoRAG: Boosting Long Context Processing with Global Memory-Augmented Retrieval. arXiv 2025, arXiv:2409.05591. [Google Scholar] [CrossRef]
- Pan, Y.; Feng, Y.; Zhuang, J.; Ding, S.; Zhou, Z.; Sun, B.; Xu, H.; Chou, Y.; Deng, A.; Hu, A.; et al. SpikingBrain Technical Report: Brain-Inspired Large Models for Efficient Long-Context Training and Inference. arXiv 2025, arXiv:2509.05276. [Google Scholar] [CrossRef]
- Gu, A.; Goel, K.; et al. Linear-Time Sequence Modeling with Selective State Spaces (Mamba). arXiv 2024, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
- Yang, H.; Yue, S.; He, Y. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv 2023, arXiv:2306.02224. [Google Scholar] [CrossRef]
- Salesforce. xLAM enters its next era: The evolution of Large Action Models. Salesforce Blog, 2025.
- Zhang, L.; He, S.; Zhang, C.; et al. SWE-bench Goes Live! arXiv 2025. [Google Scholar] [CrossRef]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning. PMLR; 2023; pp. 10764–10799. [Google Scholar]
- Meding, K. It’s Complicated: The Relationship of Algorithmic Fairness and Non-Discrimination Regulations in the EU AI Act. arXiv 2025, arXiv:2501.12962. [Google Scholar] [CrossRef]
- Board, E.D.P. AI Privacy Risks & Mitigations — Large Language Models (LLMs). Technical report, EDPB, 2025.
- LLP, C. The EU AI Act: Key Milestones, Compliance Challenges and the Road Ahead. Legal Commentary, 2025.
- OWASP. OWASP Top 10 for Large Language Model Applications 2025. OWASP GenAI Security Project, 2025.
- MacCormick, D.N.; Summers, R.S. Interpreting statutes: A comparative study; Routledge, 2016. [Google Scholar]
- Nicolini, D.; Mengis, J.; Swan, J. Understanding the role of objects in cross-disciplinary collaboration. Organization science 2012, 23, 612–629. [Google Scholar] [CrossRef]
- Brock, M.E.; Cannella-Malone, H.I.; Seaman, R.L.; Andzik, N.R.; Schaefer, J.M.; Page, E.J.; Barczak, M.A.; Dueker, S.A. Findings across practitioner training studies in special education: A comprehensive review and meta-analysis. Exceptional Children 2017, 84, 7–26. [Google Scholar] [CrossRef]
- Argyle, M.; et al. Out of One, Many: Using LMs to Simulate Human Samples. Political Analysis 2023, 31, 550–569. [Google Scholar] [CrossRef]
- Lane, C.; Rollnick, S. The use of simulated patients and role-play in communication skills training: a review of the literature to August 2005. Patient education and counseling 2007, 67, 13–20. [Google Scholar] [CrossRef]
- Tseng, Y.M.; Huang, Y.C.; Hsiao, T.Y.; Chen, W.L.; Huang, C.W.; Meng, Y.; Chen, Y.N. Two tales of persona in llms: A survey of role-playing and personalization. arXiv 2024, arXiv:2406.01171. [Google Scholar] [CrossRef]
- Louie, R.; Nandi, A.; Fang, W.; Chang, C.; Brunskill, E.; Yang, D. Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles. arXiv 2024, arXiv:2407.00870. [Google Scholar] [CrossRef]
- Shao, Y.; Li, L.; Dai, J.; Qiu, X. Character-llm: A trainable agent for role-playing. arXiv 2023, arXiv:2310.10158. [Google Scholar] [CrossRef]
- Xiang, L.; Zhao, Y.; Zhang, Y.; Zong, C. A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities. Studies in Informatics and Control 2025, 34, 1. [Google Scholar] [CrossRef]
- Wang, W.; et al. Python Symbolic Execution with LLM-empowered SMT Solving. arXiv 2024, arXiv:2409.09271. [Google Scholar] [CrossRef]
- Corbin, A.L. Legal analysis and terminology. Yale Lj 1919, 29, 163. [Google Scholar] [CrossRef]
- Shen, J.; Tenenholtz, N.; Hall, J.B.; Alvarez-Melis, D.; Fusi, N. Tag-LLM: Repurposing general-purpose LLMs for specialized domains. arXiv 2024, arXiv:2402.05140. [Google Scholar] [CrossRef]
- Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking interpretability in the era of large language models. arXiv 2024, arXiv:2402.01761. [Google Scholar] [CrossRef]
- Cai, S.; Huang, A.; Li, J. The Impact of Embedded Question Prompts on Students’ Reflective Thinking and Learning Behaviors in AR Learning Environments. Journal of Science Education and Technology 2025, 1–19. [Google Scholar] [CrossRef]
- Chung, T.; Yang, C.J. Legal Document RAG: Multi-Graph Multi-Agent Recursive Retrieval through Legal Clauses. https://medium.com/enterprise-rag/legal-document-rag-multi-graph-multi-agent-recursive-retrieval-through-legal-clauses-c90e073e0052, 2024. Accessed: 2025-06-01.
- Xu, L.; et al. Better Aligned with Survey Respondents or Training Data? arXiv 2025, arXiv:2502.18282. [Google Scholar] [CrossRef]
- Potter, D.; et al. Hidden Persuaders: LLMs’ Political Leaning and Influence on Voters. In Proceedings of the Proceedings of EMNLP, 2024; pp. 47–63.
- Li, H.; Dong, Q.; Chen, J.; Su, H.; Zhou, Y.; Ai, Q.; Ye, Z.; Liu, Y. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv 2024, arXiv:2412.05579. [Google Scholar] [CrossRef]
- Dai, J.; Pan, X.; Sun, R.; Ji, J.; Xu, X.; Liu, M.; Wang, Y.; Yang, Y. Safe rlhf: Safe reinforcement learning from human feedback. arXiv 2023, arXiv:2310.12773. [Google Scholar] [CrossRef]
- Institute of International Finance.; Ernst & Young. 2024 IIF-EY Survey Report on AI/ML Use in Financial Services. Technical report, Institute of International Finance, 2025. Accessed: 2025-05-12.
- Kamalnath, V.; Lerner, L.; Moon, J.; Sari, G.; Sohoni, V.; Zhang, S. Capturing the Full Value of Generative AI in Banking. McKinsey & Company 2023. Accessed: 2025-05-12.
- Maple, C.; Sabuncuoglu, A.; Szpruch, L.; Elliot, A.; Reinert, G.; Zemaitis, T. The Impact of Large Language Models in Finance: Towards Trustworthy Adoption. Technical report, The Alan Turing Institute, 2024. Accessed: 2025-05-12.
- Strickland, B. Gravitating to gen AI: CPA leaders show increased interest. Journal of Accountancy 2024. [Google Scholar]
- Sukharevsky, A.; Ess, A.; Emelyantsev, D.; Reasor, E.; Hürtgen, H.; Sokolov, O.; Kondratyuk, S. LLM to ROI: How to Scale Gen AI in Retail. McKinsey & Company 2024. Accessed: 2025-05-12.
- Tao, W.; Xing, X.; Chen, Y.; Huang, L.; Xu, X. TreeRAG: Unleashing the Power of Hierarchical Storage for Enhanced Knowledge Retrieval in Long Documents. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, 2025; pp. 356–371.
- Kiciman, E.; Ness, R.; Sharma, A.; Tan, C. Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research 2023. [Google Scholar]
- Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv 2022, arXiv:2211.12588. [Google Scholar] [CrossRef]
- Fu, Y.; Wang, X.; Tian, Y.; Zhao, J. Deep think with confidence. arXiv 2025, arXiv:2508.15260. [Google Scholar] [CrossRef]
- Claridge, T.D. High-resolution NMR techniques in organic chemistry; Vol. 27, Elsevier, 2016.
- Aksenov, A.A.; da Silva, R.; Knight, R.; Lopes, N.P.; Dorrestein, P.C. Global chemical analysis of biology by mass spectrometry. Nature Reviews Chemistry 2017, 1, 0054. [Google Scholar] [CrossRef]
- Espenson, J.H.; et al. Chemical kinetics and reaction mechanisms; Vol. 2, Citeseer, 1995.
- Fogler, H.S. Essentials of chemical reaction engineering: essenti chemica reactio engi; Pearson education, 2010. [Google Scholar]
- Coley, C.W.; Jin, W.; Rogers, L.; Jamison, T.F.; Jaakkola, T.S.; Green, W.H.; Barzilay, R.; Jensen, K.F. A graph-convolutional neural network model for the prediction of chemical reactivity. Chemical science 2019, 10, 370–377. [Google Scholar] [CrossRef]
- Xie, Z. Chemical Space Exploration via Large-Language-Model and Bayesian Optimization. PhD thesis, University of Liverpool, 2024.
- Liu, Y.; Kashima, H. Chemical property prediction under experimental biases. Scientific Reports 2022, 12, 8206. [Google Scholar] [CrossRef] [PubMed]
- McCloskey, K.; Taly, A.; Monti, F.; Brenner, M.P.; Colwell, L.J. Using attribution to decode binding mechanism in neural network models for chemistry. Proceedings of the National Academy of Sciences 2019, 116, 11624–11629. [Google Scholar] [CrossRef] [PubMed]
- Berreziga, R.; Brahimi, M.; Kraim, K.; Azzoune, H. Combining GCN Structural Learning with LLM Chemical Knowledge for or Enhanced Virtual Screening. arXiv 2025, arXiv:2504.17497. [Google Scholar] [CrossRef]
- Xie, J.; Wang, W.; Gao, B.; Yang, Z.; Wan, H.; Zhang, S.; Fu, T.; Li, Y. QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry. arXiv 2025, arXiv:2508.01670. [Google Scholar] [CrossRef]
- Krenn, M.; et al. SELFIES and the future of molecular string representations. Patterns 2022. [Google Scholar] [CrossRef]
- Zeng, Z.; Yin, B.; Wang, S.; Liu, J.; Yang, C.; Yao, H.; Sun, X.; Sun, M.; Xie, G.; Liu, Z. ChatMol: interactive molecular discovery with natural language. Bioinformatics 2024, 40, btae534. [Google Scholar] [CrossRef]
- Yan, C.; Fan, X.; Fan, J.; Yu, L.; Wang, N.; Chen, L.; Li, X. Hyformer: Hybrid transformer and cnn for pixel-level multispectral image land cover classification. International Journal of Environmental Research and Public Health 2023, 20, 3059. [Google Scholar] [CrossRef]
- Batzner, S.; et al. E(3)-equivariant graph neural networks for data-efficient interatomic potentials. Nature Communications 2022. [Google Scholar] [CrossRef]
- Xu, Y.; et al. Pretrained E(3)-equivariant message-passing neural networks for fast spectral prediction. npj Computational Materials 2025. [Google Scholar] [CrossRef]
- Anonymous. Instantiation-based Formalization of Logical Reasoning via Semantic Self-Verification. arXiv 2025, arXiv:2501.16961. [CrossRef]
- Tang, X.; Yu, Z.; Chen, J.; Cui, Y.; Shao, D.; Wang, W.; Wu, F.; Zhuang, Y.; Shi, W.; Huang, Z.; et al. CellForge: Agentic Design of Virtual Cell Models. arXiv 2025, arXiv:2508.02276. [Google Scholar] [CrossRef]
- Bunne, C.; Roohani, Y.; Rosen, Y.; Gupta, A.; Zhang, X.; Roed, M.; Alexandrov, T.; AlQuraishi, M.; Brennan, P.; Burkhardt, D.B.; et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell 2024, 187, 7045–7063. [Google Scholar] [CrossRef] [PubMed]



| Family | Model | Context | Output | Input | Cost ($/1M Tok) | Infer. Time | Strengths | Provider | ||
| Name | Window | Size | Modal | Input | Output | TPS | TFT | |||
| GPT 4 [1,64] | GPT-4o | 128K | 16,384 | 2.50 | 10.00 | 209.4 | 0.35 | High TPS, General-Purpose Tasks | OpenAI | |
| GPT-4o mini | 128K | 16,384 | 0.15 | 0.60 | 74.3 | 0.35 | Fine-tune / Distill, Focused Tasks | |||
| ChatGPT-4o | 128K | 16,384 | 5.00 | 15.00 | 214.0 | 0.81 | Chat-based Applications | |||
| GPT-4.5 | 128K | 16,384 | T, I | 75.00 | 150.00 | 11.5 | 1.47 | Complex / Creative Tasks | ||
| GPT 5 [65] | GPT-5 | 400K | 128K | 1.25 | 10.00 | 136.0 | 1.03 | Coding, Reasoning, and Agentic Tasks. | ||
| GPT-5-mini | 400K | 128K | 0.25 | 2.00 | 73.0 | 0.98 | Well-defined Tasks / Precise Prompts. | OpenAI | ||
| GPT-5-nano | 400K | 128K | 0.05 | 0.40 | 210.0 | 0.86 | Summarization / Classification Tasks. | |||
| OpenAI o1 | 200K | 100K | T, I | 15 | 60 | 108.9 | 25.56 | Scalable for Reasoning Tasks | OpenAI | |
| OpenAI | OpenAI o1-mini | 128K | 65,536 | T | 1.1 | 4.4 | 221.7 | 10.30 | Affordable Reasoning | |
| Reasoning | OpenAI o1-pro | 200K | 100K | T, I | 150 | 600 | – | – | – | |
| [20] | OpenAI o3-mini | 200K | 100K | T | 1.1 | 4.4 | 194.2 | 13.95 | Func. Call, Batch Process | |
| Claude 3 [66,67] | Claude 3 Opus | 200K | 4,096 | T, I, A | 15.00 | 75.00 | 26.8 | 1.22 | Multilingual, Tool Use | Anthropic |
| Claude 3.5 Haiku | 8,192 | T, I, A | 0.80 | 4.00 | 65.9 | 0.62 | Fast Response, Multilingual | |||
| Claude 3.5 Sonnet | 8,192 | T, I, A | 3.00 | 15.00 | 78.9 | 0.87 | General Tasks, Coding | |||
| Claude 3.7 Sonnet | 8,192 | T | 3.00 | 15.00 | 77.7 | 0.73 | Toggleable Extended Thinking | |||
| Gemini 2 [68] | Gemini 2.0 Flash | 1,000K | 8,192 | T, I, A | 0.10 | 0.40 | 257.0 | 0.34 | High TPS, Long Context Window | |
| Gemini 2.0 Flash Lite | 8,192 | T, I, A | 0.075 | 0.30 | 207.3 | 0.24 | Cost-Efficient, Fast TFT | |||
| Gemini 2.0 Flash TK | 8,192 | T, I, A | 0.10 | 0.40 | – | – | – | |||
| Gemini 2.5 Pro | 65,536 | T, I, A | – | – | 168.2 | 33.69 | Visual Reasoning | |||
| Family | Model Name | Model Size | Context Window | Input Modal | Strengths | License | Provider | |||
| GPT-OOS [69] | GPT-oos-20B | 21B | 131K | T | Reasoning and Agentic Tasks | Apache 2.0 | OpenAI | |||
| GPT-oss-120B | 117B | |||||||||
| Qwen 2 [70,71,72] | Qwen 2 | 0.5B - 72B | 131,072 | T | Previous Flagship Models | Alibaba | ||||
| Qwen 2.5 | 0.5B - 72B | 128K | T | General-Purpose Tasks | Apache | |||||
| Qwen 2.5-1M | 7B, 14B | 1M | T | Long Context Tasks | License | |||||
| QwQ | 32B | 131,072 | T | Reasoning and QA Tasks | ||||||
| DeepSeek [21,73] | DeepSeek-V3 | 671B | 128K | T | MoE, Math | MIT License | DeepSeek | |||
| DeepSeek-R1 | 671B | T | Complex Reasoning Tasks | |||||||
| DeepSeek-R1 Distill | 1.5B - 70B | T | Efficient Reasoning Inference | |||||||
| Llama 3 [74] | Llama 3.1 | 8B, 70B, 405B | 128K | T | Multilingual, Text Generation | Custom License | Meta AI | |||
| Llama 3.2 LW | 1B, 3B | T | Efficient Inference | |||||||
| Llama 3.2 MM | 11B, 90B | T, I | Multimodal, Image Tasks | |||||||
| Llama 3.3 | 70B | T | – | |||||||
| Tasks | Subtask | Relevant Benchmarks | Domain | Data Source |
|---|---|---|---|---|
| Text Understanding | Sentiment Detection | SST-2 [106] | Movie Reviews | Rotten Tomatoes |
| IMDB [107] | Movie Reviews | IMDB | ||
| Yelp Reviews [108] | Business Reviews | Yelp | ||
| Information Extraction | CoNLL-2003 (NER) [109] | News Articles | Reuters Corpus | |
| TACRED [110] | Various Domains | News and Wikipedia | ||
| OpenIE [111] | General Text | Web Sources | ||
| Relationship Understanding | Winograd Schema Challenge [112] | Pronoun Resolution | Constructed Sentences | |
| Winogrande [113] | Commonsense Reasoning | Crowdsourced Sentences | ||
| SuperGLUE [114] | Various | Multiple Sources | ||
| Summarization | CNN/DailyMail [115] | News Articles | CNN and DailyMail | |
| XSum [116] | News Articles | BBC | ||
| SAMSum [117] | Dialogues | SAMSum Corpus | ||
| Text Generation | Question Answering | SQuAD v1/v2 [118] | Wikipedia | Stanford QA Dataset |
| NaturalQuestions [119] | Google Search | Real User Queries | ||
| TriviaQA [120] | Trivia | Trivia Websites | ||
| HotpotQA [121] | Wikipedia | Crowdsourced Questions | ||
| BoolQ [122] | Various | Wikipedia | ||
| Style Transferring | GYAFC [123] | Formality | Yahoo Answers | |
| Text Completion | LAMBADA [124] | Narrative Text | BookCorpus | |
| Machine Translation | WMT 14 [125] | News Articles | Various News Sources | |
| IWSLT [126] | TED Talks | TED | ||
| FLORES-101 [127] | Low-Resource Languages | Wikipedia | ||
| Complex Reasoning | Code Generation | HumanEval [49] | Programming | Handcrafted Problems |
| MBPP [128] | Python Programming | Crowdsourced Problems | ||
| APPS [129] | Programming | Competitive Programming | ||
| CodeXGLUE [130] | Programming | Multiple Sources | ||
| Live Code Bench [131] | Programming (multi-language) | Live code execution tasks | ||
| Multi-step Inference | HotpotQA [121] | Wikipedia | Crowdsourced Questions | |
| MuSiQue [132] | Wikipedia | Crowdsourced Questions | ||
| StrategyQA [133] | Various | Crowdsourced Questions | ||
| OpenBookQA [134] | Elementary Science | OpenBookQA Dataset | ||
| MMLU [135] | Multidomain (57 subjects) | Knowledge exams | ||
| BIG-Bench [136] | Multitask | Collaborative crowdsourced tasks | ||
| BIG-Bench Hard (BBH) [137] | Hard subset of BIG-Bench | Hard tasks from BIG-Bench | ||
| Logical Reasoning | LogiQA [138] | Logical Reasoning | National Civil Servants Examination | |
| ReClor [139] | Logical Reasoning | Standardized Tests | ||
| GSM8K [140] | Grade-school math | Crowdsourced word problems | ||
| ARC [141] | Middle-school science | Standardized test questions | ||
| MATH [142] | High school + competition math | Math contests and textbook problems | ||
| Commonsense Reasoning | CommonsenseQA [143] | Commonsense Knowledge | ConceptNet | |
| HellaSwag [144] | Commonsense Inference | Activity Descriptions | ||
| PIQA [145] | Physical Commonsense | Crowdsourced Scenarios | ||
| SocialIQA [146] | Social Interactions | ATOMIC | ||
| Knowledge Utilization | Open-Domain Question Answering | NaturalQuestions [147] | Google Search | Real User Queries |
| TriviaQA [120] | Trivia | Trivia Websites | ||
| WebQuestions [148] | Freebase | Web Queries | ||
| KILT [149] | Multiple Tasks | Wikipedia | ||
| HELM [150] | Broad NLP | Suite of 42 scenarios across domains | ||
| TruthfulQA [151] | General Knowledge | Crowdsourced and known misconceptions | ||
| Tool-Augmented Reasoning | DROP [152] | Reading + arithmetic | Wikipedia passages | |
| TATQA [153] | Table QA | Financial reports + tables | ||
| ToolBench [154] | Tool-use tasks | APIs + real-world tools | ||
| API-Bank [155] | Tool-use tasks | APIs + real-world tools | ||
| Conversational Search & Retrieve | CoQA [156] | Conversational QA | Text from diverse domains | |
| QuAC [157] | QA over Wikipedia | Wikipedia articles + dialogues | ||
| MultiHopHotpotQA [121] | Multi-hop QA | Wikipedia + multi-step chains | ||
| MT Bench [158] | Multi-domain dialogue | Human-crafted multi-turn prompts |
| Model | Chatbot | MMLU | GPQA | HumanEval | Math | BFCL | MGSM | Input Cost |
|---|---|---|---|---|---|---|---|---|
| Arena | (General) | (Reasoning) | (Coding) | (Tool Use) | (Multilingual) | / 1M Token | ||
| Qwen 2.5 | 1296 | 70.20% | 49% | 88% | 85% | 61.31% | - | - |
| Grok 2 | 1288 | 87.50% | 56% | 88.40% | 76.10% | - | - | $2.00 |
| LLAMA 3.3 | 1294 | 86% | 48% | 88.40% | 77% | 77.50% | 91.10% | - |
| DeepSeek V3 | 1318 | 88.50% | 59.10% | 82.60% | 90.20% | 57.23% | 79.80% | $0.27 |
| DeepSeek R1 | 1360 | 90.80% | 71.50% | - | 97.30% | - | - | $0.55 |
| Gemini 2.0 Flash | 1355 | 76.40% | 62.10% | - | 89.70% | - | - | $1.25 |
| Gemini 1.5 Pro | 1260 | 85.90% | 46.20% | 71.90% | 67.70% | 84.35% | 88.70% | $0.10 |
| Claude 3.5 Haiku | 1236 | 65% | 41.60% | 88.10% | 69.40% | 60% | 85.60% | $0.80 |
| Claude 3.5 Sonnet | 1283 | 88.30% | 65% | 93.70% | 78.30% | 90.20% | 91.60% | $3.00 |
| GPT-4o | 1374 | 88.70% | 53.60% | 90.20% | 76.60% | 83.59% | 90.50% | $2.50 |
| GPT-4.5 | 1398 | - | - | - | - | - | - | $75.00 |
| GPT-o1 | 1351 | 91.80% | 75.70% | 92.40% | 96.40% | 66.73% | 89.30% | $15.00 |
| GPT-o3-mini | 1304 | 86.90% | 70.70% | - | 97.90% | - | 92% | $1.10 |
| Model | Chatbot | GPQA Diamond | SWE-bench | Tool Use | Tool Use | MMMLU | MMMU | IFEval | MATH 500 | AIME 2024 | Input Cost |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Arena | (Reasoning) | (Agent coding) | (Retail) | (Airline) | (Multilingual) | (Visual) | (Math) | / 1M Token | |||
| DeepSeek R1 | 1360 | 71.50% | 49.20% | - | - | - | - | 83.30% | 97.30% | 79.80% | $0.55 |
| Claude 3.5 Sonnet | 1283 | 65.00% | 49.00% | 71.50% | 48.80% | 82.10% | 70.40% | 90.20% | 78.00% | 16.00% | $3.00 |
| Claude 3.7 Sonnet | 1296 | 68.00% | 70.30% | 81.20% | 58.40% | 83.20% | 71.80% | 90.80% | 82.20% | 23.30% | $3.00 |
| Claude 3.7 Sonnet thinking | 1302 | 84.80% | - | - | - | 86.10% | 75% | 93.20% | 96.20% | 80.00% | $3.00 |
| Grok 3 Beta | 1404 | 84.60% | - | - | - | - | 78.00% | - | - | 93.30% | $5.00 |
| OpenAI o3-mini (High) | 1325 | 79.70% | 49.30% | - | - | 79.50% | - | - | 97.90% | 87.30% | - |
| OpenAI o1 | 1351 | 78.00% | 48.90% | 73.50% | 54.20% | 87.70% | 78.20% | - | 96.40% | 83.30% | $15.00 |
| Benchmark | Scope and Focus | Data Composition | Evaluation Tasks | Key Insights |
|---|---|---|---|---|
| Project Gutenberg [224] | Classical philosophical text access in the public domain | 75k of digitized philosophical and humanities books from major thinkers across eras | Long-form reasoning, document understanding, language modeling | Serves as foundational source for philosophy QA, argument parsing, and training philosophically informed LLMs |
| PhilPapers [225] | Bibliographic indexing and philosophical research discovery | Index of over 2.5M philosophy papers with abstracts, metadata, and categorization across subfields | Document retrieval, topic modeling, citation graph analysis | Widely used in research discovery and argument mining; useful for concept clustering and academic stance tracking |
| Benchmark | Scope and Focus | Data Composition | Evaluation Tasks | Key Insights |
|---|---|---|---|---|
| POLITICS Dataset [253] | Ideology and stance detection from media coverage | 3.6M political news articles from diverse media sources (left, center, right) | Stance detection, ideology prediction, text classification | Highlights media bias and ideological polarization; useful for evaluating fairness and framing in political text models |
| U.S. Congressional Records [254] | Modeling formal political discourse and legislative language | Standardized collection of U.S. congressional transcripts and debates | Summarization, intent classification, dialogue analysis | Ideal for studying legal-parliamentary styles; enables models to learn structured argumentation in policymaking |
| American Presidency Project [255] | Presidential speech analysis and political rhetoric tracking | Public speeches, press conferences, and presidential communications from 1789 onward | Rhetorical style analysis, time-based policy comparison | Enables longitudinal study of U.S. executive discourse; supports leadership style profiling across administrations |
| Media Bias & Fact-check Dataset [256] | Detecting factual reliability and political bias in media | Fact-check ratings and bias scores from MediaBiasFactCheck and AllSides | Text labeling (bias/fact), misinformation detection | Supports training of bias-aware LLMs; widely used in political misinformation detection and source validation |
| TruthfulQA [257] | Evaluating truthful generation in politically sensitive contexts | Manually curated QA dataset focusing on false beliefs and politically sensitive facts | Question answering, hallucination detection | Used to stress-test LLMs on truthfulness and bias in politically charged content; benchmarks robustness to disinformation |
| Benchmark | Scope and Focus | Data Composition | Evaluation Tasks | Key Insights |
|---|---|---|---|---|
| LegalBench [338] | Legal reasoning across 162 subtasks from diverse real-world legal contexts | Expert-authored prompts covering six reasoning types (e.g., rule recall, rule application, interpretation) | Zero-shot and few-shot reasoning, structured answer generation | Provides a fine-grained taxonomy of legal reasoning skills; reveals that even top-tier LLMs struggle with multi-step statutory logic and implicit legal inference, making it ideal for stress-testing general-purpose LLMs in high-reasoning settings |
| LawBench [339] | Legal knowledge probing for Chinese LLMs across memory, understanding, and application tiers | 20 distinct tasks, including classification, extraction, and generation; built from 90K+ QA pairs sourced from Chinese laws and judicial documents | Legal QA, clause classification, summarization, legal sentence rewriting | Designed to test hierarchical legal comprehension in Chinese LLMs; emphasizes generalization limitations across task types and uncovers alignment gaps in open-source vs. proprietary models |
| LexGLUE [340] | English legal language understanding and document classification across European legal systems | Seven curated datasets (e.g., ECtHR, EUR-Lex, SCOTUS) labeled with case outcomes, topics, and rulings | Multiclass and multilabel classification, case judgment prediction | Pioneered standardization of English-language legal NLP tasks; facilitates benchmarking across both general-purpose LMs and legal-domain-pretrained models; bridges legal and general NLP evaluation efforts |
| LAiW [341] | Comprehensive evaluation suite for Chinese legal LLMs across major task families | Tasks include judgment prediction, statute extraction, article matching, and legal QA based on Chinese court records and codes | Classification, matching, QA, retrieval, generation | Captures real-world judicial task structures in China’s legal system; reveals domain misalignment in Chinese LLMs and encourages community efforts in localized, high-quality legal datasets |
| Swiss-Judgment-Prediction [342] | Multilingual legal case outcome prediction in Swiss federal courts | Over 85K real-world legal cases in German, French, and Italian, annotated with ruling outcomes and metadata | Judgment prediction, multilingual classification, temporal generalization | Highlights challenges in cross-lingual legal learning and domain temporal drift; supports development of multilingual legal LLMs, especially for underrepresented legal languages |
| LJP-IV [343] | Fine-grained Chinese criminal verdict prediction with inclusion of “innocent” cases | Extended version of Chinese LJP dataset with new trichotomous labels (guilty, partially guilty, innocent) | Few-shot, multi-label classification, reasoning over multiple charges | Fills gap in prior LJP datasets that lacked nuanced outcome granularity; pushes models to reason over exoneration cases and subtle charge distinctions in realistic criminal scenarios |
| Rank | Model | Accuracy | Cost (In / Out) | Latency (s) |
|---|---|---|---|---|
| 1 | Gemini 2.5 Pro Exp | 83.6% | $1.25 / $10.00 | 3.51 |
| 2 | Gemini 2.5 Flash | 82.8% | $0.15 / $0.60 | 0.43 |
| 3 | o3 | 82.5% | $10.00 / $40.00 | 5.14 |
| 4 | Grok 3 Mini (High Reasoning) | 82.0% | $0.60 / $4.00 | 4.92 |
| 5 | Grok 3 Beta | 82.0% | $3.00 / $15.00 | 0.44 |
| 6 | GPT-4.1 | 81.9% | $2.00 / $8.00 | 0.42 |
| 7 | Gemini 2.5 Flash (Thinking) | 81.8% | $0.15 / $3.50 | 2.66 |
| 8 | o1 Preview | 81.7% | $15.00 / $60.00 | 10.33 |
| 9 | Grok 3 Mini (Low Reasoning) | 81.6% | $0.60 / $4.00 | 3.38 |
| 10 | DeepSeek V3 | 80.1% | $0.90 / $0.90 | 4.13 |
| Task | Benchmark | Metric | Traditional | LLM-based | Reference | |
|---|---|---|---|---|---|---|
| Market Analysis | FOMC | F1-Score | Bi-LSTM: 53.9% | RoBERTa-Large: 71.7% | +17.8% | [459] |
| Quantitative Trading | TRADEXPERT | AR | DeepTrader: 32.5% | TradExpert: 49.8% | +17.3% | [460] |
| Financial QA | TAT-QA | EM Score | MVGE: 70.9% | TAT-LLM (70B): 81.4% | +11.4% | [461] |
| Sentiment Analysis | FLARE | Accuracy | XGBoost: 80.0% | FinMA-30B: 87.0% | +7.0% | [462] |
| Sentiment Analysis | StockEmotions | F1-Score | Bi-GRU: 76.0% | DistilBERT: 81.0% | +5.0% | [463] |
| Stock Movement Prediction | ACL18 | Accuracy | StockNet: 58.2% | PloutosGPT: 61.2% | +3.0% | [464] |
| Stock Movement Prediction | CIKM18 | Accuracy | DTML: 58.6% | PloutosGPT: 59.9% | +1.3% | [464] |
| Model | UCFE | FinEval | R-Judge | CFinBench | FinBen | Hirano |
|---|---|---|---|---|---|---|
| Claude 3.5-Sonnet | — | 72.90 | — | — | — | 77.02 |
| Gemini 1.5-Pro | — | 69.20 | — | — | — | 57.94 |
| Gemini 1.5-Flash | — | 65.60 | — | — | — | 63.10 |
| Gemini-Pro | — | — | — | — | — | 50.52 |
| GPT-4o | 1,117.68 | 71.90 | 74.45 | — | — | 65.26 |
| GPT-4o-mini | 901.75 | 66.20 | — | — | — | — |
| GPT-4-Turbo | — | — | — | — | 39.20 | 64.59 |
| GPT-4 | — | — | — | 55.80 | — | 66.07 |
| LLaMA-3.1-70B | 912.26 | — | — | — | 36.20 | — |
| LLaMA-3.1-8B | 1,046.87 | — | — | — | 34.30 | — |
| LLaMA-3-70B | — | — | — | 47.02 | — | 58.48 |
| LLaMA-3-8B | — | — | 61.01 | 26.61 | 25.80 | 42.13 |
| LLaMA-2-70B | — | — | — | 29.27 | — | 41.96 |
| LLaMA-2-13B | — | — | 54.80 | 30.12 | — | 40.29 |
| LLaMA-2-7B | — | — | 53.74 | 28.33 | 19.90 | 40.67 |
| Qwen2.5-72B | — | 69.40 | — | — | — | — |
| Qwen2.5-14B | 855.82 | — | — | — | — | — |
| Qwen2.5-7B | 814.48 | 62.30 | — | — | 28.30 | — |
| Qwen2-72B | — | — | — | 34.70 | — | 69.35 |
| Qwen1.5-72B | — | — | — | 56.47 | — | 59.62 |
| Qwen1.5-7B | — | — | — | 46.35 | — | 49.73 |
| Qwen-72B | — | — | — | 57.72 | — | 59.08 |
| Category | Provider | Description | API | Link |
|---|---|---|---|---|
| General Financial Data | Alpaca | Provide API access to general financial data, e.g., real-time market data, historical market data, fundamental data, financial news. | 51 | https://docs.alpaca.markets/ |
| databento | 51 | https://databento.com/ | ||
| EODHD | 51 | https://eodhd.com/ | ||
| FMP | 51 | https://site.financialmodelingprep.com/ | ||
| yfinance | 51 | https://yfinance-python.org/ | ||
| Yahoo Finance | Official website of Yahoo Finance. | https://finance.yahoo.com/ | ||
| CRSP | Provide databases for economic forecasting, stock market research, and financial analysis. | https://www.crsp.org/ | ||
| Cryptocurrency Data | CoinMarketCap | Provide real-time and historical crypto market data. | 51 | https://coinmarketcap.com/api/ |
| SEC filings | SEC API | Provide SEC filings, e.g., 10-Q, 10-K, 8-K filings. | 51 | https://sec-api.io/ |
| Analyst Reports | Seeking Alpha | Provide news, analysis, and commentary from investors. | https://seekingalpha.com/ | |
| News | GNews | Provides an API to search for articles on Google News. | 51 | https://github.com/ranahaani/GNews |
| Bloomberg | Official website of Bloomberg. | https://www.bloomberg.com/ | ||
| CNBC | Official website of CNBC. | https://www.cnbc.com/ | ||
| WSJ | Official website of WSJ. | https://www.wsj.com/ | ||
| Social Media Data | X API | Provide programmatic access to X. | 51 | https://docs.x.com/x-api/introduction |
| Reddit API | Provide programmatic access to Reddit. | 51 | https://www.reddit.com/dev/api/ |
| Dataset | Project Path | Description |
|---|---|---|
| GA4 E-commerce Sample | bigquery-public-data.ga4_obfuscated_sample_ecommerce | Anonymized GA4 event data for e-commerce customer behavior and marketing funnel analysis. |
| TheLook E-commerce | bigquery-public-data.thelook_ecommerce | Synthetic e-commerce data for customer segmentation, sales forecasting, and marketing analytics. |
| Google Trends | bigquery-public-data.google_trends | Keyword search interest data over time, useful for public interest and trend analysis. |
| GDELT | gdelt-bq | Global news event metadata, including sentiment analysis for brand exposure and market event tracking. |
| Google Ads Transparency | bigquery-public-data.google_ads_transparency | Data on political and issue ads, supporting research on advertising transparency and audience targeting. |
| Taxonomy | Contributions | Representative Works | Citations |
| Mathematical Proof Assistant | Automated theorem proving to generate reasoning steps or convert mathematical statements into formal languages. | AlphaProof [655]: A system developed by Google DeepMind that trains itself to prove mathematical statements in the formal language Lean. | [655,656,657,658,659,660,661,662,663,655] |
| Theoretical Exploration and Pattern Recognition | Theoretical Exploration and Pattern Recognition in the generation of conjectures, analysis of complex dynamical systems, or development of heuristics for solving open problems. | FunSearch [648]: An LLMs framework with a systematic evaluator for mathematical exploration. | [648,664,665,666] |
| Mathematical Education | LLMs for conceptual understanding, problem generation, and automate assessment. | Zhang et al [667]: A study that investigates whether LLMs can enhance learning outcomes in mathematics when used as tutoring tools. | [667,668,669,670,671] |
| Strategy | Answer Exposure Condition | Practice Phase | Test Phase |
| Answer Only | See Answer First | 85% - 90% | 50% - 52% |
| Try It First | 30% - 35% | 52% - 55% | |
| Stock LLM | See Answer First | 85% - 90% | 50% - 55% |
| Try It First | 35% - 40% | 65% - 70% | |
| Customized LLM | See Answer First | 85% - 90% | 55% - 60% |
| Try It First | 35% - 40% | 65% - 70% |
| Cateogry | Benchmark | Description | Link |
| Competition | MATH [129] | High school & competition math | Github |
| Omini-MATH [688] | 4K competition-level problems | Github | |
| AIME [689] | American invitational mathematics examination | Kaggle | |
| Education | CMATH [690] | Chinese elemenentry school math problems. | Github |
| SAT-MATH [691] | Multiple-choice problem in SAT math exams. | Github | |
| GSM8K [140] | 8K grade school math questions | Github | |
| GAOKAO-Math [692] | Chinese high school mathematics questions from 2010 to 2022 | Github | |
| Math Word Problem | MAWPS [693] | 3K math word problems from web-sourced corpora | Github |
| ASDiv [694] | 2K elementary school level math word problems | Github | |
| SVAMP [695] | Advance version of ASDiv and MAWPS | Github | |
| Math Reasoning | AQuA-RAT [696] | 100K algebraic word problems | Huggingface |
| MathQA [697] | Enhanced version of AQuA-RAT | Huggingface | |
| PRM800K [698] | 800K math problems with step-wise labels. | Github | |
| TheoremQA [699] | Theorems-driven QA dataset | Github | |
| Lila [700] | Unified reasoning benchmark contains 20 existing datasets | Github | |
| MathInstruct [701] | Instruction-following style reasoning dataset. | Github |
| Cost ($/1M Tok) | MATH | MATH 500 | AIME 2024 | AIME 2025 | GSM8K | |
|---|---|---|---|---|---|---|
| Qwen 2.5 | – | 85.00% | 82.20% | – | – | 91.50% |
| Grok 3 | 5.00 | – | – | 93.30% | 93.30% | 89.30% |
| Llama 3.3 | – | 77% | – | – | – | – |
| DeepSeek R1 | 0.55 | 97.30% | 97.30% | 79.80% | 65.00% | 96.13% |
| Gemini 2.0 Flash | 1.25 | 89.70% | – | – | 30.00% | 95.53% |
| Claude 3.5 Sonnet | 3.00 | 78.30% | 78.00% | 16.00% | 3.33% | |
| Claude 3.7 Sonnet | 3.00 | – | 96.20% | 80.00% | – | |
| GPT-4o | 2.50 | 76.60% | – | – | 13.33% | 95.58% |
| GPT-o1 | 15.00 | 66.73% | 97.90% | 83.30% | 78.33% | 96.13% |
| GPT-o3-mini | 1.10 | – | 96.40% | 87.30% | 80.00% | 95.83% |
| Model | CADBench-Sim | CADBench-Wild | ||||||||
| Attr.↑ | Spat.↑ | Inst.↑ | Avg.↑ | ↓ | Attr.↑ | Spat.↑ | Inst.↑ | Avg.↑ | ↓ | |
| o1-Preview | 0.729 | 0.707 | 0.624 | 0.687 | 15.6% | 0.595 | 0.612 | 0.542 | 0.583 | 17.5% |
| GPT-4-Turbo | 0.658 | 0.621 | 0.488 | 0.589 | 18.2% | 0.526 | 0.541 | 0.478 | 0.515 | 24.5% |
| Claude-3.5-Sonnet | 0.687 | 0.608 | 0.482 | 0.593 | 15.6% | 0.529 | 0.508 | 0.430 | 0.489 | 14.3% |
| GPT-4o | 0.623 | 0.593 | 0.379 | 0.565 | 21.4% | 0.460 | 0.462 | 0.408 | 0.444 | 28.5% |
| BlenderGPT | 0.574 | 0.540 | 0.444 | 0.519 | 25.2% | 0.402 | 0.425 | 0.368 | 0.398 | 35.0% |
| Gemini-1.5-Pro | 0.535 | 0.483 | 0.387 | 0.468 | 30.2% | 0.375 | 0.404 | 0.361 | 0.380 | 38.0% |
| BlenderLLM | 0.846 | 0.767 | 0.626 | 0.747 | 3.2% | 0.717 | 0.614 | 0.493 | 0.608 | 5.0% |
| Model | Physics Extraction Metrics (↑) | Dimensionality | |||
| Interface Factuality | Interface Recall | Feature Recall | Feature Property Recall | Feature Dimension | |
| Claude 3.5 Sonnet | 0.85±0.10 | 0.71±0.13 | 0.80±0.10 | 0.22±0.10 | 0.95±0.05 |
| GPT-4o | 0.79±0.11 | 0.64±0.13 | 0.55±0.12 | 0.22±0.11 | 0.95±0.05 |
| Gemini-1.5-Pro | 0.54±0.14 | 0.43±0.14 | 0.39±0.10 | 0.15±0.09 | 0.86±0.14 |
| Gemma-2-27B-IT | 0.69±0.13 | 0.50±0.14 | 0.14±0.08 | 0.11±0.07 | - |
| Gemma-2-9B-IT | 0.70±0.15 | 0.43±0.14 | 0.06±0.04 | 0.07±0.07 | - |
| CodeGemma-7B-IT | 0.45±0.13 | 0.21±0.11 | 0.17±0.09 | 0.07±0.07 | - |
| Type of Task | Benchmarks | Introduction | Cross Tasks |
|---|---|---|---|
| Geoscientific Understanding | GeoBench [1503] | A collection of pure text questions related to geology, geography and environmental science. | - |
| BB-GeoEval [1514] | Composed of 750 questions related to geoscience, including both objective and subjective questions. | - | |
| GeoChat [1506] | A collection of 6 datasets across 3 tasks constructed for remote sensing VLM evaluation. | - | |
| RSIEval [1471] | Self-collected image-caption pairs and image-question-answer triplets based on 100 Remote sensing images. | Remote sensing image captioning; Remote sensing VQA. | |
| GeoCode [1476] | Composed of 18k single turn tasks and 1.3k multi-turn tasks for data analysis code generation evaluation. | - | |
| GeoLLM-Engine [1475] | A collection of 7 datasets across 3 kinds of tasks constructed for the evaluation of geospatial copilots in earth observation capacity. | UI/Web navigation. | |
| Urban Planning | PlanGPT [1499] | A collection of 3 datasets across 4 tasks for urban and spatial planning evaluation. | Urban planning document generation & evaluation. |
| Model | Objective | Subjective | ||
| NPEE | APTest | Perplexity↓ | GPTScore | |
| Gal-6.7B | 25.7 | 29.9 | 34.57 | -2.3598 |
| LLaMA-7B | 21.6 | 27.6 | 40.07 | -1.9531 |
| MPT-7B | 28.4 | 26.0 | - | - |
| Vicuna-7B | 26.4 | 16.8 | - | - |
| GeoLLaMA-7B | - | - | 32.32 | -1.9457 |
| Alpaca-7B | 31.1 | 29.1 | 40.07 | -1.9536 |
| K2-7B | 39.9 | 29.3 | 32.32 | -1.9487 |
| ChatGPT | 48.8 | 20.0 | - | - |
| GeoGalactica-30B | 46.6 | 36.9 | - | - |
| Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE_L | CIDEr | CLAIR |
|---|---|---|---|---|---|---|---|---|
| BLIP2 | 0.15 | 0.07 | 0.03 | 0.01 | 3.40 | 11.57 | 0.09 | 27.40 |
| MiniGPT4 | 47.89 | 33.46 | 23.71 | 17.26 | 19.87 | 35.32 | 16.25 | 41.15 |
| LLaVA | 45.69 | 34.15 | 25.76 | 19.68 | 21.30 | 39.46 | 29.41 | 48.95 |
| RSGPT | 47.85 | 36.06 | 27.97 | 22.07 | 23.58 | 41.73 | 31.35 | 50.95 |
| Model | Presence | Quantity | Color | Absolute pos. | Relative pos. | Area comp. | Road dir. | Image | Scene | Reasoning | Avg accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BLIP2 | 39.48/39.74 | 10.83/7.50 | 20.69/25.86 | 1.10/4.40 | 2.50/2.50 | 64.58/64.58 | 60.00/60.00 | 45.83/45.83 | 8.89/8.89 | 29.09/29.09 | 31.44/32.04 |
| MiniGPT4 | 34.55/59.22 | 20.00/15.83 | 29.31/37.93 | 6.59/20.88 | 7.50/22.50 | 18.75/27.08 | 0.00/0.00 | 63.54/60.42 | 46.67/73.33 | 14.55/23.64 | 26.83/37.87 |
| LLaVA | 81.04/80.78 | 31.67/33.33 | 72.41/72.41 | 34.07/41.76 | 27.50/27.50 | 62.50/62.50 | 60.00/60.00 | 92.71/94.79 | 75.56/84.44 | 54.55/54.55 | 66.52/68.01 |
| RSGPT | 80.52/80.00 | 36.67/37.50 | 75.86/72.41 | 34.07/31.87 | 45.00/40.00 | 54.17/56.25 | 60.00/60.00 | 80.21/78.12 | 77.78/82.22 | 50.91/47.27 | 66.13/65.07 |
| GeoCode-GEE | GeoCode-Others | |||
|---|---|---|---|---|
| Model | Pass@1 | F1 | Pass@1 | F1 |
| LLaMA3.1-8B | 0.76 | 0.78 | 0.45 | 0.66 |
| CodeGemma-7B | 0.86 | 0.75 | 0.58 | 0.69 |
| Phi3.5 mini-3.8B | 0.66 | 0.70 | 0.5 | 0.66 |
| Qwen 2-7B | 0.61 | 0.76 | 0.39 | 0.63 |
| Model | Generating | Style Transfer | Information Extraction | Text Evaluation | |
| PlanEval | PlanEval | Acc | Acc | F1 | |
| ChatGLM | 47.67 | 63.94 | 50.00 | 26.00 | 25.67 |
| Yi-6B | 16.00 | 15.41 | - | 20.00 | 8.33 |
| Baichuan2-13B-Chat | 62.67 | 43.90 | 50.32 | 33.00 | 17.42 |
| ChatGPT | 74.67 | 66.12 | - | 31.00 | 21.30 |
| PlanGPT | 60.33 | 66.80 | 65.18 | 41.00 | 35.28 |
| Category | Benchmark | LLMs Tested | Description | Link |
| Code Generation | HumanEval [49] | ✓ | A comprehensive coding benchmark released by OpenAI. | Github |
| EvalPlus [1598] | ✓ | A code synthesis evaluation framework build upon HumanEval. | Github | |
| APPS [1617] | ✓ | 10k code generation problems | Huggingface | |
| CodeXGLUE [1618] | ✓ | Fourteen datasets from ten diversified programming language tasks. | Github | |
| CodeContests [1619] | ✓ | Competitive programming dataset released by DeepMind. | Github | |
| LeetCodeDataset [1620] | ✓ | LeetCode problems with a hundred test cases per problem; Supports contamination-free evaluation and SFT. | Github, Huggingface | |
| WikiSQL [1621] | × | A large crowd-sourced dataset for training relational database. | Github | |
| Aider Polyglot [1622] | ✓ | 697 coding problems in C++, Go, Java, JS, Python, and Rust. | Github | |
| Code Debugging | SWE-Bench [1623] | ✓ | Benchmark for evaluating LLMs on software issues from Github. | Website |
| DebugBench [1604] | ✓ | Evaluating Debugging Capability of Large Language Models. | Github | |
| Defects4J [1624] | × | A collection of reproducible Java bugs. | Github | |
| QuixBugs [1625] | × | A multi-lingual program repair benchmark based on Quixey. | Github | |
| BugsInPy [1626] | × | A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies. | Github | |
| DebugEval [1577] | ✓ | A benchmark for LLMs debug tasks, i.e., bug localization and identification, code review and code repair. | Github | |
| MdEval [1627] | ✓ | multilingual debugging benchmark that contains eightteen programming languages. | Github | |
| Code Analysis | CodeSearchNet [1628] | ✓ | 2M code and comment pairs from open source libraries on Github. | Huggingface |
| CodeMMLU [1629] | ✓ | Benchmark to evaluate LLMs in coding and software knowledge. | Github | |
| LiveCodeBench [131] | ✓ | Five hundred coding problems in self-repair, code execution, and test ouotput prediction tasks. | Github | |
| Circuit Design | RTLLM [1630] | ✓ | An open-source benchmark for design RTL generation with LLMs | Github |
| VerilogEval [1631] | ✓ | Benchmark for Verilog code generation for hardware design and verification. | Github | |
| MG-Verilog [1632] | ✓ | A multi-grained dataset towards enhanced llm-assisted Verilog generation. | Github | |
| AssertEval [1633] | ✓ | Open-source benchmark for evaluating LLM’s assertion generation capabilities for RTL verification. | Github | |
| RTLCode [1633] | ✓ | 80K instruction-code dataset for LLMs RTL generation. | Github | |
| PICBench [1634] | ✓ | Benchmark for photonic integrated circuits design. | Github |
| HumanEval | EvalPlus | CodeMMLU | Aider Polyglot | SWE-Bench | PICBench | |
| Qwen 2.5 | 88.00% | 87.21% | 56.40% | 26.20% | – | – |
| Grok 3 | – | – | – | 53.30% | – | – |
| Llama 3.3 | 88.40% | – | 43.91% | – | – | – |
| DeepSeek R1 | – | – | – | 56.90% | 49.20% | – |
| Gemini 2.0 Flash | – | – | 59.81% | 35.60% | 52.20% | – |
| Claude 3.5 Sonnet | 93.70% | 81.70% | 61.65% | 51.60% | 49.00% | 13.33 |
| Claude 3.7 Sonnet | – | – | 67.00% | 64.90% | 62.30% | – |
| GPT-4o | 90.20% | 87.20% | – | 45.30% | – | 14.17 |
| GPT-o1 | 92.40% | 89.00% | 62.36% | – | 48.90% | 8.33 |
| GPT-o3-mini | – | – | – | 60.40% | 49.30% | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).