2. Evolution of AI: From Pre-Trained Models to Agentic Systems
This section provides a chronological overview of how AI evolved from static language models to agentic systems capable of autonomous reasoning and interaction.
2.1. Stage 1: Transformer-Based Pre-Trained Models (2018–2020)
The advent of transformer-based pre-trained models from 2018 to 2020 marked a transformative milestone in artificial intelligence (AI), enabling large-scale language processing and laying the groundwork for autonomous agentic systems. We argue that this stage introduced scalable architectures but exposed critical limitations in reasoning and intent alignment, driving subsequent advancements. This subsection examines the transformer architecture, Mixture of Experts (MoE), and pre-trained models like GPT-2 [
3]and GPT-3 [
4], detailing their technical components, performance, and constraints, with implications for applications in resource-constrained environments.
2.1.1. Transformer Architecture Fundamentals
The transformer architecture, introduced by Vaswani et al. [
1], marked a turning point in natural language processing by abandoning recurrence in favor of a purely attention-based mechanism. Its design follows a modular pipeline, where each stage plays a distinct role in transforming raw text into contextualized semantic representations.
The pipeline begins with
tokenization, a crucial preprocessing step that converts unstructured text into discrete sub-word units using algorithms such as Byte-Pair Encoding (BPE) [
12]. This step ensures that rare and compound words are broken down into smaller, reusable segments, enabling the model to generalize across a large vocabulary while maintaining computational tractability.
Once tokens are produced, they are transformed into contextual embeddings. This involves two components: (i) token embeddings encode the semantic identity of each sub-word, and (ii) positional embeddings encode the relative position of each token in the sequence. The combination of these two embeddings allows the model to capture both the meaning and the order of the input text, overcoming the permutation invariance of attention mechanisms.
The hallmark of the transformer lies in its
multi-head self-attention mechanism. At each layer, every token computes weighted interactions with all other tokens in the sequence using a triplet of projections: query, key, and value. This allows the model to attend to relevant context at any distance, effectively capturing long-range dependencies, syntactic structure, and semantic relationships across the entire input sequence. Unlike recurrent neural networks (RNNs), which suffer from sequential bottlenecks and gradient vanishing, transformers allow for full parallelization, dramatically improving training efficiency and scalability [
13].
Following self-attention, representations are passed through a feed-forward network (FFN). This component acts as a form of feature digestion, projecting vectors into a higher-dimensional space, applying non-linear transformations (typically ReLU or GELU), and compressing them back. This refines and filters the features extracted by attention, ensuring that only salient information is propagated to the next layer.
An overview of the transformer block architecture, highlighting the flow from token embeddings through multi-head self-attention and feed-forward layers, is illustrated in
Figure 1.
2.1.2. Mixture of Experts (MoE)
As transformer models grew deeper and wider, their computational demands became a limiting factor—especially during inference. To overcome this, the
Mixture of Experts (MoE) architecture was introduced as a scalable alternative to dense feed-forward networks [
2]. The key idea behind MoE is to divide the standard feed-forward layer into a collection of specialized sub-networks, known as
experts, where each expert learns distinct patterns from the data.
Figure 2 provides a schematic representation of the MoE architecture, emphasizing the sparse expert activation controlled by the gating mechanism.
Instead of activating all experts for every token, MoE employs a learnable gating mechanism that dynamically selects only a small subset of experts—typically the top-1 or top-2—for each input token. This mechanism ensures that only the most relevant experts are activated, allowing the model to preserve high representational capacity while significantly reducing computation. For instance, an MoE layer with 8 experts and a top-2 gating strategy activates only 2 experts per token, resulting in a compute cost of just 25% compared to activating all experts, yet maintaining a comparable level of performance.
This sparse activation strategy introduces both efficiency and specialization. Efficiency comes from not having to evaluate every expert for every token, making it feasible to train and deploy much larger models. Specialization emerges as different experts focus on different linguistic, semantic, or structural aspects of the data, allowing the model to better adapt to diverse inputs.
The introduction of MoE layers proved to be a turning point in the architecture of large language models [
14]. It enabled the training of models with billions of parameters while keeping inference tractable—an essential advancement for pushing the limits of generative AI in real-world, resource-constrained scenarios.
2.1.3. Pre-Trained Models: GPT-2 and GPT-3
The advent of GPT-2 and GPT-3 marked a major leap in generative language modeling. These models, built upon the transformer architecture and scaled with billions of parameters—1.5B for GPT-2 and 175B for GPT-3—demonstrated impressive capabilities in generating fluent and contextually consistent text [
3,
4]. Trained on massive corpora like WebText and Common Crawl, they employed a simple yet powerful objective: next-token prediction.
GPT-2 achieved strong performance in zero-shot settings, with remarkable results on benchmarks like LAMBADA. GPT-3 extended these capabilities further, showcasing few-shot and even one-shot learning by conditioning on examples directly in the prompt. For instance, it scored 86.4% on TriviaQA without fine-tuning [
4].
Despite these advances, both models remained fundamentally limited. They operated on surface-level statistical patterns without understanding or reasoning. Their outputs, while coherent, were not aligned with user goals, lacked memory, and could not adapt dynamically to changing tasks or environments. These constraints became critical in real-world domains such as robotics or healthcare, where intent-awareness and external knowledge access are essential.
Nevertheless, GPT-2 and GPT-3 laid the groundwork for scalable generative AI. They proved that large-scale pretraining could generalize across tasks, inspiring the need for better alignment mechanisms—such as instruction fine-tuning—which we explore in the next section.
2.2. Stage 2: Instruction Fine-Tuning (2021)
The second stage in the evolution of large language models introduced a paradigm shift by moving from generic pre-trained text generation to models that follow explicit human instructions. This process, called instruction fine-tuning, aligns model behavior with task-specific goals and user expectations [
5].
At the core, instruction fine-tuning consists of training a base generative model using supervised examples—pairs of prompts and corresponding high-quality responses. These examples span a wide range of tasks (e.g., classification, summarization, translation), enabling the model to learn not just what to predict, but how to structure its outputs based on explicit instructions.
Why it matters: Instruction fine-tuning enhances:
Controllability: The model responds more accurately to varied user intents.
Safety and factuality: Trained models are less likely to hallucinate or produce harmful content.
Task-specific performance: It enables models to generalize better across downstream tasks without retraining.
Prompt Engineering: A Complementary Technique
In parallel to instruction fine-tuning,
prompt engineering emerged as a practical method to guide and shape model behavior without modifying weights. It plays a critical role in extracting reliable, interpretable, and domain-aligned outputs from a single model [
15].
Several advanced prompting strategies have been introduced to improve reasoning, reduce hallucinations, and increase alignment:
-
Chain-of-Thought (CoT): Introduces intermediate reasoning steps to help the model think through problems (e.g., math or logic). This makes its outputs more traceable and reduces logical errors [
16].
Example: To solve “What is 15% of 80?”, the model first converts 15% to decimal (0.15), then multiplies: .
-
Tree-of-Thought (ToT): Explores multiple reasoning paths like a decision tree, which prevents premature conclusions and encourages diverse solution paths [
17].
Example: For the question “How can a company increase profits?”, the model evaluates different branches:
- −
Option 1: Cut operational costs.
- −
Option 2: Raise product prices.
- −
Option 3: Expand to new markets.
-
ReAct (Reasoning + Acting): Combines reasoning with real-time action (e.g., tool use or fact-checking), improving factual grounding [
18].
Example: Q: “What is the capital of France?”
Thought: I need to verify.
Action: Search “capital of France”.
Observation: Paris.
Final Answer: Paris.
Structuring Outputs: Techniques such as Few-Shot Prompting and Role-Based Prompting provide contextual anchors that help the model produce outputs in a well-defined format and tone. This is especially important in high-stakes domains such as legal drafting, scientific summarization, or instructional design, where consistency and precision are critical.
Iterative Techniques: Methods like
Self-Consistency [
19] and
Prompt Chaining reduce generation variability by enforcing redundancy or decomposing tasks into sequential steps. By validating outputs across multiple runs or maintaining logical continuity through chained prompts, these techniques significantly mitigate hallucination risks.
Key Benefit: Prompt engineering serves not only as a performance enhancer but also as a control mechanism against hallucinations. By explicitly guiding the model’s reasoning process, reinforcing structural expectations, and encouraging factual alignment with reference data or tools, prompt engineering transforms LLMs from stochastic generators into more reliable and trustworthy assistants.
In summary, instruction fine-tuning and prompt engineering together mark the transition from passive language generators to reliable, domain-aware assistants. While instruction tuning aligns the internal representations with human intent, prompt engineering provides an external layer of guidance to extract the best reasoning and factual quality from the model at runtime.
2.3. Stage 3: Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) represents a pivotal advancement in aligning large language models (LLMs) with human values, beyond static instruction tuning. Traditional supervised fine-tuning treats target outputs as ground truth, regardless of subjective quality. In contrast, RLHF introduces an adaptive loop that incorporates human evaluations into the optimization process, allowing models to be trained not only to follow instructions but also to exhibit preferred behavioral traits such as helpfulness, harmlessness, and honesty [
5,
20].
The process begins by generating multiple candidate completions for a given prompt using the pre-trained model. Human annotators then rank these outputs according to preference. These ranked examples are transformed into pairwise comparisons and used to train a reward model
— typically a neural network that estimates the alignment quality of a given output
y conditioned on input
x. The reward model learns to assign higher scores to preferred outputs and lower scores to undesirable ones, effectively approximating human judgment [
6].
Formally, given a pair of responses
for the same prompt
x, where
is preferred over
, the reward model is optimized to satisfy:
using a loss function such as the pairwise logistic loss or Bradley-Terry loss [
21]:
Once the reward model is trained, it serves as the objective function to guide the language model’s behavior. A policy
(i.e., the language model’s distribution over outputs) is fine-tuned to maximize the expected reward:
To perform this optimization safely and efficiently, RLHF commonly uses the Proximal Policy Optimization (PPO) algorithm [
22], which ensures controlled updates to the policy and mitigates the risk of reward hacking [
22,
23]. In practice, PPO is often regularized using a KL penalty to constrain the policy from deviating too far from the reference model, ensuring safe optimization [
5].
The PPO objective balances reward maximization with policy stability:
where
is the probability ratio between new and old policies and
is the advantage estimate based on the reward model.
To illustrate, consider a prompt such as "How can I make money quickly?". The model might generate:
Human annotators may rank (2) and (3) higher due to safety and practicality, while downranking (1). Through training, the model internalizes such preferences and learns to prioritize ethical and safe responses in future generations — thus aligning with human intent beyond syntactic accuracy.
RLHF has proven instrumental in shaping models like ChatGPT, which exhibit reduced hallucination, enhanced factual grounding, and user-centric behavior. However, limitations persist: RLHF models are still reactive to prompts, lack autonomous planning capabilities, and remain constrained by the biases in human feedback [
24]. These challenges open new directions for agentic AI systems, which seek to go beyond response generation and incorporate self-driven reasoning and goal execution.
2.4. Stage 4: Retrieval-Augmented Generation (RAG)
As large language models (LLMs) continued to scale, a key limitation persisted: their inability to access external knowledge beyond their training data. Despite the benefits of fine-tuning, LLMs remained static in nature—frozen after training on a fixed corpus. This led to several critical issues, including hallucinations (i.e., confident but incorrect outputs), outdated knowledge, and the inability to dynamically update facts. Fine-tuning alone proved insufficient to resolve these shortcomings, particularly in tasks requiring factual precision or domain-specific updates.
This process is illustrated in
Figure 3, where the retriever fetches relevant context from an external knowledge base and the generator produces grounded responses conditioned on this information.
Retrieval-Augmented Generation (RAG), introduced by Lewis et al. [
7], addressed this gap by integrating a retrieval mechanism into the generative pipeline. Rather than relying solely on memorized knowledge, RAG systems connect LLMs to external, often up-to-date, information sources (e.g., databases, document stores, or search engines). This hybrid architecture introduces two main components: a retriever that identifies relevant documents based on a user query, and a generator that conditions its response on the retrieved content.
The retrieval step typically employs dense or sparse vector search (e.g., FAISS [
25] or BM25[
26]) to identify semantically relevant passages. These retrieved contexts are then appended to the prompt or embedded directly into the model’s attention space before generating a response. This process significantly reduces hallucination by grounding the generation in factual evidence and enables the model to access knowledge beyond its training cutoff.
From an engineering perspective, RAG offers several advantages over traditional fine-tuning:
Modularity: Knowledge updates can be made by modifying the retrieval corpus without retraining the model.
Efficiency: Retrieval systems are more computationally efficient than repeated fine-tuning for each domain or update.
Transparency: The retrieved documents provide interpretability and support for the generated response.
RAG has been widely adopted in applications such as document question answering, PDF chat systems, semantic search, and chatbot assistants in enterprise settings. For instance, in a legal assistant scenario, RAG enables the model to retrieve case law or policy clauses from a legal database, thereby avoiding hallucinated legal claims.
Overall, RAG represents a major evolution in LLM architectures. By blending search and generation, it bridges the gap between static models and dynamic knowledge environments, providing a more robust and factually grounded AI assistant.
Towards Tool Use and System Integration. The emergence of Retrieval-Augmented Generation (RAG) laid the foundational architecture for functional modularity in LLMs. By separating static model parameters from dynamic external knowledge, RAG introduced the idea that models could consult tools—in this case, search engines or vector databases—at inference time. This was a first step toward broader tool integration, where LLMs not only retrieve documents but also invoke APIs, perform calculations, read files, and interact with structured systems [
27]. The conceptual shift from “memorize everything” to “consult when needed” opened the door for more general frameworks involving external function calls, which we will explore in the next stage.
2.5. Stage 5: Tools and APIs Integration
As we move beyond pure text generation, the need arises for language models to interact with external systems, perform actions, and access up-to-date information. This capability builds on earlier ideas of reasoning-augmented tool use such as in ReAct [
18]. This is where
Tool and API integration becomes essential.
While Retrieval-Augmented Generation (RAG) allows LLMs to fetch documents, many user tasks require executing real-time functions, such as calculations, scheduling, or searching a live database. To address this need, recent LLMs have been enhanced with the ability to call external tools through well-defined APIs.
An example of this tool integration workflow is shown in
Figure 4, demonstrating how LLMs bridge text understanding with real-world actions via APIs.
From Text to Action.
In this stage, the LLM transitions from being just a text generator to becoming a reasoning engine that can plan, choose tools, call them, and process results. For example, when asked, “What’s the current exchange rate between USD and EUR?”, the model needs to recognize this task, invoke the appropriate tool (like getExchangeRate), and return a factually correct response.
Training for Intent Understanding.
A key requirement for enabling tool use is that the LLM must be trained to
understand user intent and to
select the right tool based on the request [
27]. This is typically done using supervised fine-tuning datasets that teach the model to generate structured function calls instead of plain answers.
Not all language models support this capability out of the box. For example, OpenAI’s GPT-4 supports native tool calling through structured JSON-based function calls [
28], while many other models do not yet support this feature natively.
How It Works.
Tool integration typically involves the following steps:
Tool Registration: Each tool is described with a name, description, and input/output schema.
Function Call Prediction: The LLM is prompted (or fine-tuned) to detect when a tool is needed and how to call it.
Tool Execution: A backend system receives the tool call, executes it, and returns the result.
Result Injection: The result is returned to the LLM, which then generates a complete response.
Example.
User query:
"Check the weather in Paris tomorrow and summarize the forecast."
The model may internally generate:
getWeather(city="Paris", date="tomorrow") → "Sunny, 24°C"
Final output:
"Tomorrow in Paris, the weather will be sunny with a high of 24°C."
The Role of MCP.
Before the emergence of standardized protocols, the integration of external tools with language models was fragmented and ad hoc. Each system relied on custom interfaces, undocumented behaviors, and brittle assumptions about how models should call tools—making such integrations difficult to scale, reproduce, or secure.
To address this gap, Anthropic introduced the
Model Context Protocol (MCP) [
29], the first practical attempt to standardize how external tools, APIs, and data sources are exposed to language models during inference. MCP defines a machine-readable schema that describes each tool’s name, purpose, input/output formats, and authentication requirements. It acts as a declarative interface layer through which models can dynamically discover and invoke available tools within a given context.
I often refer to MCP as a kind of REST API for resources in AI—a conceptual framework that enables AI systems to reason over structured, discoverable, and securely defined resources, much like how web applications consume RESTful services.
MCP brings three key advantages:
Unification: A common protocol replaces hand-crafted tool-call mechanisms with a structured, extensible specification.
Context-awareness: Tools are only made available to the model within a relevant context, improving safety and reasoning efficiency.
Automation-readiness: Because tools are self-describing, models can auto-generate calls, parse responses, and chain actions in autonomous workflows.
By acting as a middleware layer between the model’s reasoning engine and the external environment, MCP is essential to enabling agentic behaviors. It allows LLMs not only to interpret natural language but also to interact with external systems in a secure, modular, and verifiable manner.
Insights
Stage 5 marks a pivotal shift in the evolution of language models—from passive generators of text to active participants in real-world tasks. By integrating tools through standardized interfaces like MCP, LLMs gain the capacity to retrieve information, perform computations, and interact with dynamic environments. This grounding significantly mitigates hallucinations by anchoring model outputs in factual and verifiable sources. Moreover, tool use lays the foundation for more advanced capabilities—setting the stage for models that not only access external tools but also coordinate reasoning, memory, and planning. This transition leads directly to Stage 6, where LLMs begin to operate as autonomous agents capable of multi-step task execution, decision-making, and self-driven prompt generation [
30].
2.6. Stage 6: Single LLM Agents
Stage 6 represents a major paradigm shift in the evolution of large language models—from passive tool users to autonomous agents capable of reasoning, planning, acting, and learning within a single architectural framework. Unlike prior stages where models merely responded to prompts or executed tools when explicitly guided, single-agent systems embody a more integrated intelligence loop, where the LLM itself orchestrates all steps of the cognitive process.
Figure 5 illustrates this autonomous agentic loop, where a single LLM performs reasoning, planning, tool selection, action execution, and self-reflection in an iterative workflow.
At the core of these agents is the ability to reason through problems using mechanisms like Chain-of-Thought (CoT) prompting [
16], which enables step-by-step logical decomposition of tasks. This reasoning is not static: the model can observe the outcomes of its previous actions, reflect on errors or progress, and update its internal state accordingly—a concept sometimes referred to as
self-reflection or
inner monologue [
31].
Planning emerges as a crucial capability, where the agent decomposes complex goals into subgoals and dynamically selects tools or actions to achieve them. Acting, in this context, involves invoking APIs, retrieving data, or querying databases—all without explicit human instruction at each step. These capabilities are tightly interwoven in a loop [
18]:
Reasoning: The agent breaks down the task into logical steps using CoT.
Acting: It takes concrete actions (e.g., API calls, tool invocation) based on its reasoning.
Observing: It assesses the outcomes of its actions and updates its knowledge or next steps.
Planning: Based on observations, it revises its strategy to continue toward the goal.
Systems such as LangChain’s
AgentExecutor [
9] and Microsoft’s AutoGen [
8] demonstrate practical implementations of this loop, enabling LLMs to autonomously solve multi-step tasks [
32] with minimal human input. While still constrained by the capabilities of single LLMs, this stage establishes the foundational behavior of autonomous agents—setting the stage for multi-agent collaboration in Stage 7.
2.7. Stage 7: Agentic AI Workflows
As the capabilities of large language models (LLMs) evolved, a critical limitation emerged: even the most advanced single-agent systems could not sustain long, complex, or multitask workflows. Stage 7 addresses this limitation by introducing agentic AI workflows, where multiple specialized LLM agents collaborate to perform structured, multi-step tasks. This transition parallels the evolution from monolithic computing to distributed systems—enhancing flexibility, robustness, and scalability.
Figure 6.
Multi-Agent Systems.
Figure 6.
Multi-Agent Systems.
Unlike Stage 6, which focuses on self-contained LLMs that observe, reason, and act, Stage 7 introduces coordination and modularity. A task is no longer solved by one model alone, but instead distributed across a set of interacting agents. These agents operate independently, yet communicate through well-defined protocols, enabling task decomposition, specialization, and feedback across the system.
Core Capabilities.
Agentic AI systems bring several novel capabilities that overcome the constraints of monolithic LLM reasoning:
Task Decomposition: A central planner agent breaks down complex problems into subtasks.
Specialized Roles: Different agents are assigned domain-specific functions (e.g., coding, data analysis, summarization).
Memory Sharing: Shared context or memory buffers enable continuity across agents.
Collaboration Protocols: Orchestrators like
LangGraph [
9],
AutoGen [
8], or
CrewAI [
10] coordinate agent execution, feedback, and control flow.
Illustrative Example.
Consider a research assistant workflow:
The Planner Agent parses a user request: “Find recent AI papers on agentic systems and summarize their contributions.”
It dispatches subtasks to a Search Agent, a Summarizer Agent, and a Citation Verifier Agent.
Each agent performs its function independently—querying arXiv, writing summaries, or validating references.
The Supervisor Agent integrates results and decides whether to loop, revise, or finalize the response.
This architecture mimics real-world team dynamics, where division of labor and iterative refinement lead to higher-quality outcomes.
Advantages Over Stage 6.
Agentic AI workflows offer several distinct advantages over single-agent systems introduced in Stage 6. First, they enable
scalability by distributing subtasks across multiple agents, allowing for parallel execution and reducing the memory constraints typically faced by a single LLM. Second, they promote
modularity, as each agent is designed with a specific role and can be independently updated, replaced, or reused in different workflows. This modularity enhances maintainability and flexibility across applications. Third, these workflows support seamless
tool integration, where agents interact with external resources through standardized interfaces such as the Model Context Protocol (MCP) [
29], enabling dynamic access to APIs and databases. Finally, agentic systems help in
reducing hallucinations by incorporating dedicated agents for fact-checking and external verification, ensuring that responses are grounded in real-time data and evidence-based reasoning.
Strategic Impact.
Stage 7 marks a fundamental shift from isolated LLM capabilities to the orchestration of
autonomous, multi-agent systems. Rather than relying on a single model with linear reasoning capabilities, agentic workflows distribute cognitive load across a collaborative network of agents—each specializing in a defined role. This architecture enables structured planning, dynamic task execution, self-correction through feedback loops, and informed decision-making through tool integration. As a result, we witness the emergence of
LLM ecosystems, where coordination and role-based specialization [
33] unlock a new generation of AI systems capable of handling end-to-end processes with minimal human oversight. These developments are already redefining applications in scientific discovery, automated software development, legal analysis, and personalized education.
Outlook.
Despite their transformative potential, current agentic workflows remain limited by
static roles, ephemeral memory, and rigid context windows. Agents today operate in session-bound environments, lacking long-term identity [
34], adaptability, or personal learning trajectories. This limitation opens the path toward Stage 8, which envisions
persistent, memory-augmented agents capable of evolving over time. Such agents will not only collaborate, but also learn from experience, build domain expertise, retain user-specific preferences, and update their internal state across sessions. This shift represents the transition from reactive orchestration to proactive cognition—laying the groundwork for AI agents that operate as long-term collaborators in dynamic, real-world environments.