Enterprise Agentic AI Architecture Design Guidance (Part 1)

Courtesy of Shelly Palmer

... Agent Goals, RAG, Agent Architecture & Models    

By Lavanya Subbarayalu, Principal Solution Architect, PwC

Designing Enterprise Agentic AI Systems

Enterprise adoption of Agentic AI requires more than connecting large language models to tools. Production systems should coordinate reasoning, retrieval, multi-agent collaboration, tool execution, and runtime orchestration while maintaining reliability, security, and operational visibility.

Many early implementations fail because core architectural decisions are implicit rather than designed deliberately. Without clear boundaries between reasoning, coordination, execution, and governance layers, multi-agent systems quickly become fragile, difficult to evaluate, and expensive to operate.

This document provides practical guidance for architects and AI engineers designing enterprise-grade Agentic AI systems.

The framework organizes Agentic AI design into eight architecture areas.

Part 1 focuses on the foundational design areas, which define how agents reason, retrieve information, and interact with enterprise systems:

  • Agent goals, skills, and context management
  • Retrieval Augmented Generation (RAG) design
  • Agent architecture patterns and frameworks
  • Models, tools, and integration strategies

Part 2 covers the operational layers required to run agent systems in production, including communication protocols, evaluation, infrastructure, reliability, and security.

Enterprise Agentic AI Reference Architecture

The following diagram illustrates the Agentic AI reference architecture used throughout this design guidance.

pwc

Figure 1: Enterprise Agentic AI reference architecture

The diagram illustrates the runtime execution flow and supporting operational components used in Enterprise Agentic AI systems.

  • Numbered boxes represent the primary runtime execution path.
  • Solid arrows represent runtime execution.
  • Dashed arrows represent asynchronous data flows and feedback loops.
  • Left panels represent data pipelines.
  • Right panels represent evaluation and governance capabilities.

0. Should You Build an Agentic System?

Before designing an agent architecture, the suitability of Agentic AI for the problem should be validated. The use case can be evaluated across the following dimensions.

Dimension Traditional AI Signal Agentic AI Signal
Task Complexity Single step: classify, predict, extract Multi-step: plan → execute → verify → adapt
Decision Autonomy Human directs every action System makes intermediate decisions within guardrails
Tool Usage 0–2 fixed integrations, hard-coded 3+ tools, dynamically selected based on context
Workflow Shape Linear, deterministic pipeline Branching, loops, context-dependent paths
Error Handling Fail fast, human retries Self-corrects, tries alternatives, escalates intelligently

Decision Guidance (Illustrative):

Score each dimension 1 (traditional) to 5 (agentic). Total >20 → agentic justified. Total 12–20 → hybrid (agentic for specific sub-flows only). Total <12 → traditional AI.

⚠ Anti-pattern

Building agentic systems for simple classification or single-API tasks. Linear and deterministic workflows are better served by traditional pipelines.

0.1 Design Principles for Enterprise Agentic AI

Enterprise agent systems introduce new architectural complexity. The following principles help keep systems reliable, scalable, and maintainable.

Separate reasoning from execution
Planning, coordination, and execution should be implemented as distinct layers.

Separate orchestration from agent tasks
Orchestration manages workflow state while agents focus on reasoning and task execution.

Treat skills as contracts
Agent skills should expose structured inputs, outputs, and error handling.

Constrain agent autonomy
Agents should operate within defined tool permissions, task scope, and safety policies.

Design feedback loops early
Evaluation, human review, and telemetry should continuously improve the system.

Optimize model usage
Model routing or cascading can be used to balance capability, latency, and cost.

1. Agent Goals, Skills & Context

This area defines what the agent is trying to accomplish (goals), what discrete capabilities it has (skills), and how it manages information needed to act (context).

1.1 Goal Design

Every agent should have exactly one measurable primary goal. Multiple competing objectives lead to unpredictable prioritization. The goal should be decomposable into a sequence of skills that can be individually validated.

Goal Property Requirement Example
Measurability Quantifiable success metric Resolution rate >95% within 2 turns
Decomposability Expressible as ordered skill sequence Intake → Verify → Route → Resolve
Scope Bounded — agent can fail gracefully Single customer issue per session
Observability Every step is loggable and traceable Each skill invocation has correlation ID

1.2 Skill Design & Contracts

Skills are atomic capabilities. Each skill has a defined contract: what it accepts, what it returns, and how it fails. This contract enables independent testing and composition.

The Skill Contract Schema:

Contract Element Description Example
Name Unique identifier, verb-noun format verify_identity
Input Schema Typed input specification (JSON Schema) {“customer_id”: “string”, “method”: “enum[email,sms,push]”}
Output Schema Typed success response {“verified”: “bool”, “confidence”: “float”, “evidence”: “string”}
Error Codes Enumerated failure modes NOT_FOUND, TIMEOUT, LOW_CONFIDENCE, RATE_LIMITED
SLA Timeout and reliability targets Timeout: 5s, Success rate: 99.5%
Fallback What happens on failure Try secondary provider → escalate if both fail

How Many Skills Per Agent?

Skill Count Action Rationale
1–3 Single agent, flat structure Low complexity, minimal orchestration
3–5 Single agent, optimal range Sufficient capability without coordination overhead
6–8 Consider splitting into sub-agents Orchestration complexity increases significantly
>8 Decompose — hierarchical routing required Single agent cannot reliably manage >8 skills

⚠ Anti-pattern

Assigning 10+ skills to one agent. Orchestration complexity grows with skill count. At >8, route to specialized sub-agents instead.

1.3 Skill Routing & Invocation

When an agent receives a user intent, it should decide which skill to invoke. This routing decision is critical — incorrect routing wastes a full skill execution and may produce incorrect results.

Routing Method How It Works When to Use Accuracy
Semantic Similarity Embed intent, cosine-match to skill descriptions General-purpose, <20 skills 85–92%
Intent Classifier Dedicated fine-tuned classifier model >20 skills, high-volume 92–97%
Keyword Rules Regex/keyword matching to skill triggers Simple, well-defined domains 95%+ if domain is narrow
LLM Router Frontier model selects skill from description set Complex ambiguous intents 88–94%, higher latency

Fallback Trigger Logic:

Route fails → retry with next-best match (if confidence gap <0.1) → if still failing, escalate to human. Circuit breaker: 3 consecutive failures on same skill → mark as degraded, bypass in routing.

Routing Accuracy Measurement

Routing accuracy = (correct skill selections / total routing decisions) × 100

This metric should be evaluated against a labeled intent dataset and reviewed when skills or routing logic change.

⚠ Anti-pattern

Using an expensive frontier model for every routing decision. For <20 skills with clear descriptions, semantic similarity or a small, fine-tuned classifier is 5–10x cheaper and often more accurate.

Note: Frontier models: the most advanced and capable LLMs (e.g., GPT-class, Claude- class or flagship models). These offer the best reasoning but are the most expensive.

1.4 Context & Prompt Management

Context is the information available to an agent when making decisions or executing skills. Managing context is a resource allocation problem: insufficient context prevents action; excessive context degrades quality, increases cost, and risks hitting token limits.

Context Budget Calculation

Step 1 — Determine Model Window: The model’s maximum context window should be obtained from provider documentation (e.g., 32K, 128K, 200K tokens).

Step 2 — Calculate Token Budget: Typically, maximum 20% of the window to context can be allocated. Example: 128K window → 25.6K token budget for context.

Step 3 — Allocate Context Budget: The context budget should be divided among conversation turns, retrieved chunks, and skill instructions. Typically, the remaining 80% is reserved for the model’s reasoning and output generation.

The allocations given below are illustrative starting points and should be adjusted based on use case, retrieval needs, and model behavior.

Model Window Illustrative Context Budget Example Allocation (Indicative)
32K tokens 6.4K tokens ~3 conversation turns + 1 retrieved chunk
128K tokens 25.6K tokens ~5 turns + 3–4 retrieved chunks + skill instructions
200K tokens 40K tokens ~8 turns + 5–6 chunks + full skill suite + examples

 

Note: The 20% guideline reserves 80% for the model’s reasoning, skill instructions, and output generation. Exceeding this threshold consistently can lead to quality degradation — models may truncate or ignore later context.

Hot vs Cold Context

Context Type Where Context is Stored What It Contains Retrieval
System Prompt In-context (always) Agent identity, skill list, behavioral rules Not retrieved (pre-injected)
Hot Context In-context (dynamic) Last 3–5 turns, current task state Maintained per session
Cold Context Redis (TTL 24–72hr) Previous task outcomes, user preferences Retrieved on demand via similarity search
Knowledge Base Vector DB Domain documentation, policies Retrieved per skill invocation

Prompt Engineering Techniques

While many prompt techniques exist, the following represent commonly used approaches in enterprise agent systems.

Technique When to Use Trade-off
Chain-of-Thought Multi-step reasoning, math, logic +25% accuracy vs costs 2× tokens
Few-shot (3–5 examples) Consistent output formatting +20% format compliance vs adds ~2K tokens cost
Role Prompting

 

Define agent scope, authority, and boundaries

 

Reduces drift vs minimal overhead
Goal-Oriented Prompting

 

Clear objectives with measurable outcomes

 

Improves consistency vs may reduce flexibility
Task Planning Prompts

 

Multi-step workflows requiring structured execution

 

Better control & structure vs adaptability (may be rigid)
Self-Correction High-stakes decisions (financial, medical) +10% reliability vs adds 50% latency per step

Decision Guidance:

Choose techniques based on the task’s accuracy requirements and acceptable cost increase. Chain-of-Thought is valuable for complex reasoning; few-shot ensures format consistency; self-correction adds reliability for critical decisions.

Structured Output Enforcement: Skill outputs should be validated against the defined JSON Schema. Use the model’s built-in structured output mode (e.g., response_format in OpenAI, or tool_use forcing in Anthropic) rather than post-hoc regex parsing. Post-hoc parsing fails on edge cases.

⚠ Anti-pattern

Free-form text outputs from skills that downstream skills should parse. Every skill output should be schema-validated. Unstructured outputs are a common cause of cascading failures in agent pipelines.

2. RAG, Memory & Vector Search

Retrieval-Augmented Generation (RAG) gives agents access to knowledge beyond their training data. Memory provides continuity across interactions. Together, these form the knowledge layer. This area covers pattern selection, component configuration, and production considerations.

2.1 RAG Design Decision Framework

The need for RAG should be evaluated based on the following criteria.

The thresholds and ranges presented below are example starting points and should be calibrated based on domain requirements, system constraints, and operational data.

Evaluation Criteria No RAG Needed RAG Required
Use Case Complexity Single-turn Q&A, knowledge fits in prompt (<5K tokens) Multi-document synthesis, knowledge exceeds prompt capacity
Data Freshness Static knowledge, updates <monthly Frequently updated content (daily/weekly)
Data Volume Small corpus (<50 documents) Large corpus (>100 documents)
Query Type Broad conceptual questions Specific fact retrieval, citation needed
Precision Requirements Best-effort answers acceptable Should be grounded in specific sources

Decision Guidance (Illustrative):

When multiple criteria indicate that retrieval is required, RAG-based approaches are typically considered. When criteria consistently indicate that retrieval is not required, alternatives such as few-shot prompting or fine-tuning may be sufficient.

These conditions should be validated against domain requirements and evaluation results rather than treated as fixed rules. The thresholds given in above table are indicative starting points and should be calibrated based on domain complexity and risk tolerance.

Note: When the domain is narrow and well-defined with stable structured data, consider direct database queries or API calls instead of RAG. RAG is optimized for unstructured text retrieval, not structured data lookups.

2.2 RAG Pattern Selection

There are 25+ RAG patterns documented in research. These include patterns such as Standard RAG, Hybrid RAG, Self-RAG, Corrective RAG, Multi-hop RAG, Fusion RAG, and Agentic RAG. The commonly used production patterns fall into these categories based on complexity and quality requirements:

Pattern Category Representative Patterns When to Use
Basic Retrieval Standard RAG, Sparse RAG, Constrained RAG Fast fact lookup, protocol retrieval, <500ms latency
Advanced / Hybrid Retrieval / Query Optimization Hybrid RAG (semantic + keyword), Fusion RAG, Adaptive RAG Mixed technical and conceptual queries, variable query complexity, common in production
Context & Conversation Conversation RAG, Memory-Augmented RAG, Contextual RAG Multi-turn interactions, follow-up questions, session-aware retrieval
Quality & Reliability Corrective RAG, Self-RAG, Citation-Aware RAG High-stakes domains, auditability, verification of retrieved evidence is required
Multi-Document Reasoning Iterative RAG, Multi-hop RAG, Hierarchical RAG, Chain-of-Retrieval RAG Synthesis across multiple sources, comparative analysis, evidence chaining
Scale & Performance Federated RAG, Long-Context RAG Multi-system retrieval, large corpora, distributed knowledge environments
Agentic Retrieval Agentic RAG, Speculative RAG, RL-RAG, ReFeed RAG Complex workflows where retrieval should be planned, refined, or optimized over time
Specialized Domain Retrieval Multimodal RAG, Reasoning RAG, Few-shot RAG, Prompt-Augmented RAG Mixed-modality inputs, structured output generation, domain-specific reasoning

The patterns listed above represent the most used production patterns. More advanced patterns should be adopted when complexity and quality requirements justify them.

RAG Pattern Selection Criteria

Factor Consideration
Query complexity Single-step lookup vs. multi-step evidence synthesis
Source requirements Single trusted source vs. multiple sources requiring comparison or validation
Freshness requirements Static content vs. frequently updated content
Latency tolerance Real-time retrieval vs. multi-stage retrieval and reranking
Reliability requirements Best-effort responses vs. grounded, auditable, citation-backed outputs
Modality Text-only vs. multimodal inputs such as images, forms, or scanned documents

Production Guidance (Illustrative):

Hybrid RAG (semantic + keyword retrieval) is commonly used as a production baseline. A typical starting point is a weighted combination (for example, 0.7 semantic / 0.3 keyword), which should be calibrated based on query distribution and domain characteristics.

Precision and Recall in Retrieval

Precision: Of the documents retrieved, how many are relevant? High precision minimizes noise.

Recall: Of all relevant documents in the corpus, how many were retrieved? High recall ensures completeness.

Trade-off: Retrieval systems are typically optimized for recall to ensure relevant candidates are surfaced, while reranking improves precision by ordering results based on true relevance.

Production systems should explicitly balance recall and precision based on use case requirements. High-recall systems require reranking to maintain answer quality.

Retrieval Quality Measurement

Precision@k = relevant retrieved documents / k

Recall@k = relevant retrieved documents / total relevant documents

These metrics should be benchmarked on domain-specific queries prior to production deployment.

2.3 Embedding Model Selection

The embedding model directly affects retrieval quality. This decision is frequently underestimated — using an inappropriate embedding model is a common cause of poor RAG performance.

The guidance below should be treated as indicative starting points rather than fixed rules. Embedding model selection should be benchmarked against domain queries, retrieval latency targets, storage constraints, and update frequency.

Selection Criteria What to Evaluate Guidance
Dimension Size Vector dimensionality (384–1536+) Higher dimensions = better quality, more storage & slower search. 768–1024 is a common practical balance.
Domain Fit Performance on domain-specific queries Embedding models should be benchmarked on at least 100 domain queries before selection. General models degrade into specialized domains.
Speed vs Quality Encoding latency per batch Example: For <100ms retrieval SLO, use lighter models (384-dim). For quality-critical paths, accept 200–400ms with 1024-dim.
Update Cost Re-embedding cost when docs change If data updates frequently, choose models with incremental update support.

2.4 Chunking Strategy

Chunking strategy directly affects retrieval accuracy. When the size is too small, crucial information might be overlooked; conversely, if it’s too large, excessive noise can interfere with retrieval.

Additional chunking mechanisms may also be used depending on document structure and retrieval objectives:

  • Semantic chunking: split on meaning or topic boundaries rather than fixed size
  • Sliding-window chunking: overlapping windows for continuity across adjacent spans
  • Hierarchical chunking: summary-level parent chunks linked to detailed child chunks
  • Structure-aware chunking: split on headings, tables, sections, or form boundaries

The chunk sizes in the table are illustrative starting points and should be calibrated using retrieval quality metrics such as precision@k, recall@k, and answer relevance and based on document structure, and model context limits.

.Content Type Chunk Size (tokens) Overlap Strategy
Structured data (logs, records) 100–200 0% Each record is atomic and should not be split across chunks
Technical documentation 300–500 15–20% Split at section boundaries; overlap preserves cross-section references
Narrative or conversational content 500–800 20% Preserve topic continuity across paragraphs
Legal / policy documents 400–600 25% Higher overlap because clauses reference each other

Note: Overlap is not free — it increases index size and can cause duplicate retrievals. Retrieved chunks should be deduplicate by source ID before passing to the model.

2.5 Retrieval & Reranking

RAG pattern determines retrieval method. For instance, Hybrid RAG requires both semantic (vector) and keyword (sparse) retrieval working together. The pattern dictates the retrieval architecture.

 

Initial retrieval (via vector similarity) is a recall operation — it surfaces candidates. Reranking is a precision operation — it reorders candidates by relevance. The two-stage approach (retrieve-then-rerank) is the production standard.

Stage Method Speed Precision When to Use
Retrieval Vector similarity Fast (<50ms) Moderate Used as stage 1
Reranking Cross-encoder (pairwise scoring) Slower (+100–300ms) High Top-K ≥ 5 candidates, quality-critical paths
Skip Reranking Use retrieval output directly Fastest Moderate Latency-critical paths (<100ms SLO), low-stakes

Retrieval Quality Measurement

Precision@k = relevant retrieved documents / k

Recall@k = relevant retrieved documents / total relevant documents

These metrics should be benchmarked on domain-specific queries prior to production deployment.

2.6 Index Maintenance & Drift

A vector index is not a static artifact — it degrades as source documents change. Index drift (retrieving stale or outdated content) degrades performance gradually and can be difficult to detect.

Source Update Frequency Recommended Strategy Implementation
Daily or less Batch re-index nightly Full re-embed during low-traffic window; swap atomically
Hourly Incremental upsert Track document versions; re-embed only changed docs; append to index
Real-time (<1 min) Streaming pipeline Kafka → embed worker → upsert to vector DB. Accept eventual consistency.
Rare (monthly+) Manual trigger Re-index on content release; validate with benchmark queries

2.7 Memory Architecture

Memory gives agents continuity. Without it, every interaction starts from scratch. Memory should be architected as a system — the wrong memory architecture leads to stale context, privacy violations, or unbounded storage growth.

 

Memory Type Storage Scope TTL / Lifecycle Use Case
Session (Short-term) In-context window Current conversation Session end Conversation flow, turn tracking
Working (Workspace) Redis Active task 24–72 hours Multi-step task state, intermediate results
Episodic (Long-term) Postgres (encrypted) Per-user persistent Regulatory policy (e.g., 7 years) User history, past decisions, preferences
Semantic Vector DB Cross-user knowledge 2 years, periodic cleanup Pattern recognition, experience sharing

Decision Guidance:

  • Working memory and episodic memory should be stored separately to prevent contamination of task state and historical records.
  • Episodic memory should be written only after validation to avoid propagating incorrect information.
  • Retrieved memory should be constrained to a small portion of the context window (for example <5%) to prevent context overflow.

⚠ Anti-pattern

Storing raw conversation transcripts as long-term memory. Transcripts grow unboundedly and contain noise. Extract and store structured summaries instead.

3. Agent Architecture, Patterns & Frameworks

This area covers structural decisions: how the agent system is organized as software, what reasoning patterns it uses, and whether to use an agent framework or build directly against model APIs. These decisions determine operational complexity, failure behavior, and long-term maintainability.

3.1 Event-Driven Microservices Architecture

Event-driven microservices are the de facto standard for production multi-agent systems at scale. This architecture provides fault isolation, independent scaling, and asynchronous workflows. The conceptual reference architecture consists of these layers:

Layer 1 — API Gateway: Authentication, rate limiting, request routing.

Layer 2 — Agent Services: Each agent type runs as an independent service (stateless or stateful with external state store)

Layer 3 — Event Bus (Kafka/SQS): Asynchronous communication between agents and services

Layer 4 — LLM Gateway: Model routing, prompt caching, token metering

Layer 5 — Data Layer: Vector DBs, Redis (working memory), Postgres (episodic memory), tool registries.

These layers represent a deployment view of the architecture. At runtime, the system follows the flow illustrated in the reference architecture diagram (planner → orchestration → agents → coordination → execution).

Key Design Principle: Agents publish events when tasks are complete; other agents subscribe and react. This decouples services and enables independent deployment.

3.2 Agents State Management

Stateless agents are simpler to operate (horizontal scaling, no sticky sessions, simple health checks). Use stateful approaches only when the use case requires conversation continuity or multi-step task tracking.

Approach When to Use State Store Complexity
Stateless Single-turn Q&A, classification, extraction None Low — fully horizontally scalable
Stateful (external) Conversational agents, multi-step workflows Redis (session) + Postgres (persistent) Medium — state store becomes a dependency
Stateful (in-context) Short sessions (<10 turns), low volume Context window only Low, but limited by context window size

⚠ Anti-pattern

Storing state in-process (agent instance memory). When the instance restarts or is load-balanced to a different pod, state is lost. Agent’s state should be externalized for production systems.

3.3 Handling Partial Failures

In multi-skill or multi-service agent workflows, partial failures are inevitable. If Skill A succeeds (example writes to a database) and Skill B fails (example: payment times out), the system is in an inconsistent state.

Saga patterns are commonly used in distributed agent workflows to manage long-running tasks and compensation logic when partial failures occur. Other recovery patterns, such as checkpoint-and-resume and idempotent retry, may be more appropriate depending on workflow duration, side effects, and rollback requirements.

State recovery mechanisms (such as checkpointing, retries, and compensation logic) should be selected based on workflow duration and failure impact.

Pattern How It Works When to Use
Saga (Choreography) Each skill publishes success/failure events; other skills listen and compensate Loosely coupled skills, async workflows
Saga (Orchestration) A central orchestrator coordinates and triggers compensation on failure Tightly coupled workflows, need clear ownership
Retry with Idempotency Retry failed operations; each operation is idempotent (safe to repeat) Simple failures (timeouts, transient errors)
Checkpoint & Resume Save workflow state at each step; resume from last checkpoint on failure Long-running workflows (minutes+)

Compensation Design Guidance: Every skill that performs a side effect (write, send, charge) should have a corresponding compensation action (rollback, cancel, refund). Define these at contract time — not at incident time.

3.4 Planning & Reasoning Patterns

Agents Pattern Overview

Multiple Agent patterns can be used across multiple layers of system behavior:

  • Interaction patterns define how agents communicate with users, tools, and other agents
  • Planning patterns define how tasks are decomposed before execution
  • Reasoning patterns define how decisions are made
  • Execution and coordination patterns define how tasks are carried out across workflows

Part-1 focuses on planning, reasoning, and interaction patterns. Execution and coordination patterns are covered in Part-2.

Planning Patterns

Pattern Best For Tradeoff
Plan-and-Execute Predictable workflows with defined steps Fast and structured, but less adaptive
Least-to-Most / Task Planning Problems that can be decomposed into ordered sub-problems Handles structured complexity, but adds planning overhead
Hierarchical Planning Multi-level task decomposition and delegated sub-goals Scales to complex workflows, but increases coordination overhead
Dynamic Replanning Volatile workflows where conditions change during execution Adaptive, but can increase compute cost and plan instability

Reasoning Patterns

The reasoning pattern determines how an agent breaks down and solves a problem. The choice affects accuracy, latency, and cost.

Selection Criteria:

Task Predictability:

Predictable or known workflows are typically suited to Plan-and-Execute patterns, while uncertain or adaptive workflows align with ReAct-style reasoning.

Solution Space:

Tasks with a single expected outcome are generally suited to Chain-of-Thought reasoning, whereas problems with multiple valid paths may benefit from Tree-of-Thoughts.

Quality Requirements:

Standard workflows may use single-pass reasoning, while high-stakes scenarios often use Reflection or multi-pass approaches.

Cost Tolerance:

Each reasoning iteration incurs an additional model call. Iteration limits are typically defined based on acceptable cost and latency constraints.

Additional considerations include latency tolerance, tool interaction complexity, and whether the workflow involves single-agent or multi-agent coordination.

Pattern Best For Tradeoff
ReAct Adaptive, uncertain environments Flexible, but can loop without iteration bounds
Chain-of-Thought Step-by-step reasoning for logic or calculation Improves reasoning quality, but increases token usage
Tree-of-Thoughts Problems with multiple viable solution paths Stronger exploration, but much higher compute cost
Reflection/Self-Critique Quality refinement and validation Improves output quality, but adds latency and cost
CodeAct Tasks requiring code generation and execution Enables precise computation and automation, but requires sandboxing and execution safeguards

Additional reasoning patterns exist, but the patterns above represent the most commonly used approaches in enterprise agent systems. Selection should be guided by task complexity, quality requirements, latency tolerance, and tool-use needs.

Note: ReAct iteration limits: Set max iterations based on cost tolerance. Each iteration costs one full LLM call. For example: for 10 iterations with a frontier model, cost can exceed $0.10 per query. Iteration counts should be monitored in production – consistently high counts indicate the agent is not making progress.

Interaction & Tool Use Patterns

Tool interaction patterns define how agents invoke external capabilities, including function calling, API orchestration, and tool chaining.

3.5 Agent Framework Selection

Agent frameworks (LangChain, LlamaIndex, AutoGen, CrewAI) abstract common patterns: tool calling, memory management, multi-agent coordination.

Direct implementation means custom code calling LLM APIs (OpenAI, Anthropic Claude, etc.) without framework abstraction.

Framework adoption is an important architectural decision. Frameworks simplify development by abstracting common agent patterns and accelerating implementation, but they also introduce an additional layer that can reduce transparency and fine-grained control in complex workflows.

Framework Selection Criteria

Factor Framework Direct Implementation
Use Case Coverage >80% of patterns are standard (RAG, tool-use, chat) Custom patterns the framework does not support
Development Speed Team is new to Agentic AI, and needs rapid prototyping Team has LLM API experience, optimizing for performance
Latency Requirements Can tolerate <20% overhead from abstraction layer Need <100ms end-to-end, every millisecond matters
Debugging Complexity Can accept framework abstraction in stack traces Need full visibility into every LLM call and token
Ecosystem Needs Need 100+ pre-built integrations (databases, APIs, tools) Custom integrations only, no ecosystem dependency

Decision Guidance (Illustrative):

Framework-based approaches are generally effective when most required capabilities are supported and the operational overhead remains within acceptable limits.

Direct implementation may be more suitable when workflows require highly specialized execution patterns, strict latency constraints, or full control over model interactions.

These thresholds should be validated against real workloads rather than treated as fixed rules.

When Frameworks vs Direct Implementation

Use Frameworks When: Building standard RAG pipelines, multi-agent coordination, or tool-heavy workflows where pre-built integrations save significant development time. Frameworks are the de facto choice for most production agent development.

Use Direct Implementation When: Latency is critical (<100ms SLO), custom reasoning patterns are required, or the team needs full control over every LLM call for optimization and debugging. Examples: high-frequency trading agents, real-time decision systems.

Framework Comparison

Framework Primary Strength Best For
LangChain Broadest ecosystem, most integrations General-purpose agents, tool-heavy workflows
LlamaIndex RAG-optimized data pipeline Knowledge-intensive agents, document Q&A
AutoGen Multi-agent conversation patterns Agent-to-agent workflows, debate/consensus
CrewAI Role-based multi-agent teams Structured team workflows, task delegation

 

The frameworks listed above represent commonly used options and are not exhaustive.

Multi-Frameworks Strategy:

It is valid to combine frameworks for different layers. Example: LlamaIndex for RAG pipeline + LangChain for agent orchestration + direct API calls for latency-critical skills. The key is clear boundary ownership — each layer has one framework responsible.

⚠ Anti-pattern

Choosing a framework based on popularity without benchmarking against actual use cases. Every framework has domains where it excels and domains where it adds friction. Benchmark with real use cases.

4. Models, Tools & Integration

This area covers the components an agent calls: language models (reasoning engine), tools (action layer), and the infrastructure connecting them (LLM Gateway). The central challenge is cost optimization without sacrificing quality — model costs are typically the largest operating expense.

4.1 Model Selection & Cascading

No single model is optimal for all tasks. Small models are fast and economical but limited in complex reasoning. Frontier models are powerful but expensive. Model cascading routes each query to the cheapest model that can handle it reliably — this is one of the highest ROI cost optimizations for agent systems, alongside prompt caching and tool parallelization.

Tier Characteristics Relative Cost Best For
Small LM (7B–13B) Fast inference (<200ms), limited reasoning ~100× cheaper than frontier Classification, extraction, simple formatting
Mid-tier LM (70B or equivalent) Good reasoning, moderate speed ~10× cheaper than frontier Standard Q&A, tool selection, summarization
Frontier LM (GPT-4 class) Best reasoning, slower Baseline cost Complex multi-step reasoning, ambiguous queries
Specialist / Fine-tuned Domain-specific accuracy Training cost amortized High-volume, narrow-domain tasks

 Model Routing vs Cascading

Model routing selects the most appropriate model based on task characteristics. Cascading is a specific routing strategy that escalates queries across model tiers when confidence is insufficient.

Mixture-of-Experts (MoE) Routing

MoE-style routing selects different models based on task type, improving both cost efficiency and performance.

Selection Criteria

Factor Consideration
Task diversity Different query types require different capabilities
Cost sensitivity Whether routing reduces use of expensive models
Latency tolerance Whether routing overhead is acceptable
Volume Whether scale justifies multiple model tiers
Quality variance Whether specialized models outperform general models for specific tasks

MoE routing is most effective when workloads are heterogeneous, and routing decisions can be measured and optimized.

Confidence Estimation for Cascading

Cascading relies on identifying when a lower-tier model is insufficient and escalation is required. Confidence signals should be derived from measurable indicators rather than assumed model certainty.

The methods below represent practical approaches to estimating confidence. Each method provides a signal that can be combined or calibrated based on evaluation data.

Method How It Works Accuracy Latency Cost Recommendation
Token Probabilities Use model’s logprob output; low max-prob = uncertain Moderate Zero extra latency Good first signal; combine with other methods
Consistency Check Run same query 2× with temperature>0; disagreement = uncertain High +1 full LLM call Use for mid→frontier escalation decisions
Separate Classifier Fine-tuned binary model: ‘can small model handle this?’ High (tunable) Small LM latency only Best for high-volume, well-defined domains

Confidence Calibration

Confidence signals should be calibrated against evaluation data rather than treated as intrinsic truth.

A practical approach is to measure:

Calibration accuracy = correctly handled queries within a confidence band / total queries in that band.

This allows routing thresholds to be tuned based on acceptable error rates, escalation volume, latency, and cost.

Illustrative Routing Flow

Query → Small model tier

If confidence is above the upper threshold, return result

If confidence falls within a mid-range, escalate to a stronger model

If confidence is below a lower threshold, escalate to the most capable model or a human review path

Threshold values should be calibrated using evaluation data rather than fixed globally.

 Implementation: Cascading can be implemented in the LLM Gateway layer or using frameworks like LiteLLM (supports model fallback routing) or custom logic in the agent orchestration layer.

Note: The 80/15/5 split (80% small, 15% mid, 5% frontier) is a starting target. Measure actual distribution after initial few weeks and tune thresholds. Some domains will be 95/4/1; others 50/30/20.

4.2 Tool Registry & Lifecycle

Tools are the agent’s interface to external systems. A tool registry is the source of truth for what tools exist, how to call them, and how they behave when they fail. Without a registry, tools quickly become undocumented and difficult to monitor.

Tool Discovery & Registry Patterns

Tools may be discovered and managed using different approaches depending on system scale.

Discovery Pattern Best for Tradeoff
Static registry Small, stable tool sets Simple, manual to maintain
Dynamic registry Frequently changing tool sets Flexible, adds runtime dependency
MCP-based discovery Multi-agent and interoperable systems Standardized, requires protocol support

Decision Guidance (Illustrative)

Static registries are suitable for controlled environments. Dynamic or MCP-based discovery becomes more appropriate as tool count, update frequency, or interoperability requirements increase.

Registry Element What It Contains Why It Matters
Identity Name, version, owner, deprecation date Prevents calling stale or unmaintained tools
Contract OpenAPI spec, input/output schema Enables the model to call tools correctly without examples
SLA Timeout (typically 30s), success rate target (99.5%) Allows the agent to make informed retry/fallback decisions
Fallback Chain Primary → Secondary → Escalate Defines behavior before failure occurs, not during incident response
Cost Per-call cost estimate Enables cost-aware routing and budget enforcement

Tool Discovery by Scale

Tool Count Discovery Method Implementation
<20 tools Static JSON manifest All tools listed in system prompt; model selects from full set
20–100 tools Dynamic retrieval Query embedding → retrieve top-5 relevant tools → model selects
>100 tools Capability-based routing + MCP Intent → tool category → specific tool. MCP (Model Context Protocol) enables dynamic capability advertisement.

 Tool Selection Considerations

Tool selection should consider factors such as the number of tools in the system, update frequency, interoperability requirements, latency constraints, and governance needs such as access control and auditability. These considerations influence whether static registries, dynamic discovery, or MCP-based approaches are appropriate.

Tool Versioning & Breaking Changes

Tool APIs change. When an external API changes its contract, the agent should remain backward compatible and resilient to changes.

Tools Version management strategy:

  • Tools are typically versioned explicitly (for example v1, v2) rather than modified in place to ensure backward compatibility and safe upgrades.
  • Run shadow traffic: new version receives all requests in parallel (no user impact) for 48–72 hours before cutover. Compare outputs.
  • Deprecation path: old version remains callable for 30 days post-cutover. Log all calls to deprecated versions as warnings.
  • Fallback to previous version: if new version error rate exceeds 5%, automatically fall back to the previous version.

4.3 Retry, Timeout & Circuit Breaker

Every external tool call can fail. The retry strategy determines how the agent recovers — or knows when to stop trying.

Parameter Typical configuration Rationale
Max Retries 3 Beyond 3, the tool is likely experiencing an outage, not a transient error
Base Backoff 1 second Exponential: 1s → 2s → 4s. Prevents load spikes on shared services.
Jitter ±25% of backoff Prevents retry storms when multiple agents hit the same failing tool simultaneously
Timeout (per call) 30 seconds Long enough for most APIs; short enough to not block workflow indefinitely
Circuit Breaker Open After 3 failures in 60s Immediately fail fast once a tool is known-broken; prevents wasted retries
Circuit Breaker Reset After 30s half-open probe Allow one test call to check if the tool has recovered

⚠ Anti-pattern

Retrying indefinitely or with fixed intervals. Infinite retries can block workflows indefinitely under persistent failures. The use of fixed intervals may lead to excessive retry activity in widely utilized tools.

4.4 The LLM Gateway

The LLM Gateway is the single point through which all model calls flow. It provides model routing (cascading), rate limiting, prompt caching, cost metering, and observability. Without it, there is no visibility into model usage and no ability to optimize costs.

Request Flow: User Request → API Gateway (auth, rate limiting) → LLM Gateway (model routing, caching, metering) → LLM Model Provider APIs(example: GPT-class or Claude-class models)

Gateway Option Best For Key Features Cost
LiteLLM Teams starting out, open-source preference 100+ provider support, unified API, model routing Free (open source)
Portkey Growing teams (10K–100K req/day) Advanced routing, prompt caching, A/B testing, observability dashboards Per-request pricing
Custom-built High-volume (>100K req/day) with unique routing logic Full control, custom algorithms, internal tooling integration Engineering investment

Cost Attribution: The LLM Gateway should tag every request with originating agent, skill invoked, workflow ID, and model tier used. This is the foundation for cost-per-workflow analysis — without it, expensive workflows cannot be identified and optimized.

Note: Calling model APIs directly from agent code without a gateway is acceptable for prototyping. For production systems, an LLM Gateway is essential for visibility, cost control, and optimization.

Key Takeaways

  • Use case complexity is a key factor in determining whether agentic AI and multi-step reasoning are appropriate.
  • Skill contracts benefit from structured inputs/outputs, error handling, and fallback paths to support testing and composition.
  • RAG pattern selection is typically driven by query complexity, quality requirements, and whether retrieval is necessary.
  • Agent frameworks are evaluated based on use case fit (e.g., capability coverage) and operational overhead (e.g., latency), using real workload benchmarks.
  • Model cascading is commonly used to improve cost efficiency by routing queries to the most cost-effective capable model.

Lavanya Subbarayalu is an enterprise AI architect specializing in large-scale AI platforms, agentic systems, and responsible AI design. She focuses on building practical, production-ready architectures that enable scalable, reliable, and governed deployment of AI systems across enterprise environments.