... Agent Goals, RAG, Agent Architecture & Models
By Lavanya Subbarayalu, Principal Solution Architect, PwC
Designing Enterprise Agentic AI Systems
Enterprise adoption of Agentic AI requires more than connecting large language models to tools. Production systems should coordinate reasoning, retrieval, multi-agent collaboration, tool execution, and runtime orchestration while maintaining reliability, security, and operational visibility.
Many early implementations fail because core architectural decisions are implicit rather than designed deliberately. Without clear boundaries between reasoning, coordination, execution, and governance layers, multi-agent systems quickly become fragile, difficult to evaluate, and expensive to operate.
This document provides practical guidance for architects and AI engineers designing enterprise-grade Agentic AI systems.
The framework organizes Agentic AI design into eight architecture areas.
Part 1 focuses on the foundational design areas, which define how agents reason, retrieve information, and interact with enterprise systems:
- Agent goals, skills, and context management
- Retrieval Augmented Generation (RAG) design
- Agent architecture patterns and frameworks
- Models, tools, and integration strategies
Part 2 covers the operational layers required to run agent systems in production, including communication protocols, evaluation, infrastructure, reliability, and security.
Enterprise Agentic AI Reference Architecture
The following diagram illustrates the Agentic AI reference architecture used throughout this design guidance.

Figure 1: Enterprise Agentic AI reference architecture
The diagram illustrates the runtime execution flow and supporting operational components used in Enterprise Agentic AI systems.
- Numbered boxes represent the primary runtime execution path.
- Solid arrows represent runtime execution.
- Dashed arrows represent asynchronous data flows and feedback loops.
- Left panels represent data pipelines.
- Right panels represent evaluation and governance capabilities.
0. Should You Build an Agentic System?
Before designing an agent architecture, the suitability of Agentic AI for the problem should be validated. The use case can be evaluated across the following dimensions.
| Dimension | Traditional AI Signal | Agentic AI Signal |
| Task Complexity | Single step: classify, predict, extract | Multi-step: plan → execute → verify → adapt |
| Decision Autonomy | Human directs every action | System makes intermediate decisions within guardrails |
| Tool Usage | 0–2 fixed integrations, hard-coded | 3+ tools, dynamically selected based on context |
| Workflow Shape | Linear, deterministic pipeline | Branching, loops, context-dependent paths |
| Error Handling | Fail fast, human retries | Self-corrects, tries alternatives, escalates intelligently |
Decision Guidance (Illustrative):
Score each dimension 1 (traditional) to 5 (agentic). Total >20 → agentic justified. Total 12–20 → hybrid (agentic for specific sub-flows only). Total <12 → traditional AI.
⚠ Anti-pattern
Building agentic systems for simple classification or single-API tasks. Linear and deterministic workflows are better served by traditional pipelines.
0.1 Design Principles for Enterprise Agentic AI
Enterprise agent systems introduce new architectural complexity. The following principles help keep systems reliable, scalable, and maintainable.
Separate reasoning from execution
Planning, coordination, and execution should be implemented as distinct layers.
Separate orchestration from agent tasks
Orchestration manages workflow state while agents focus on reasoning and task execution.
Treat skills as contracts
Agent skills should expose structured inputs, outputs, and error handling.
Constrain agent autonomy
Agents should operate within defined tool permissions, task scope, and safety policies.
Design feedback loops early
Evaluation, human review, and telemetry should continuously improve the system.
Optimize model usage
Model routing or cascading can be used to balance capability, latency, and cost.
1. Agent Goals, Skills & Context
This area defines what the agent is trying to accomplish (goals), what discrete capabilities it has (skills), and how it manages information needed to act (context).
1.1 Goal Design
Every agent should have exactly one measurable primary goal. Multiple competing objectives lead to unpredictable prioritization. The goal should be decomposable into a sequence of skills that can be individually validated.
| Goal Property | Requirement | Example |
| Measurability | Quantifiable success metric | Resolution rate >95% within 2 turns |
| Decomposability | Expressible as ordered skill sequence | Intake → Verify → Route → Resolve |
| Scope | Bounded — agent can fail gracefully | Single customer issue per session |
| Observability | Every step is loggable and traceable | Each skill invocation has correlation ID |
1.2 Skill Design & Contracts
Skills are atomic capabilities. Each skill has a defined contract: what it accepts, what it returns, and how it fails. This contract enables independent testing and composition.
The Skill Contract Schema:
| Contract Element | Description | Example |
| Name | Unique identifier, verb-noun format | verify_identity |
| Input Schema | Typed input specification (JSON Schema) | {“customer_id”: “string”, “method”: “enum[email,sms,push]”} |
| Output Schema | Typed success response | {“verified”: “bool”, “confidence”: “float”, “evidence”: “string”} |
| Error Codes | Enumerated failure modes | NOT_FOUND, TIMEOUT, LOW_CONFIDENCE, RATE_LIMITED |
| SLA | Timeout and reliability targets | Timeout: 5s, Success rate: 99.5% |
| Fallback | What happens on failure | Try secondary provider → escalate if both fail |
How Many Skills Per Agent?
| Skill Count | Action | Rationale |
| 1–3 | Single agent, flat structure | Low complexity, minimal orchestration |
| 3–5 | Single agent, optimal range | Sufficient capability without coordination overhead |
| 6–8 | Consider splitting into sub-agents | Orchestration complexity increases significantly |
| >8 | Decompose — hierarchical routing required | Single agent cannot reliably manage >8 skills |
⚠ Anti-pattern
Assigning 10+ skills to one agent. Orchestration complexity grows with skill count. At >8, route to specialized sub-agents instead.
1.3 Skill Routing & Invocation
When an agent receives a user intent, it should decide which skill to invoke. This routing decision is critical — incorrect routing wastes a full skill execution and may produce incorrect results.
| Routing Method | How It Works | When to Use | Accuracy |
| Semantic Similarity | Embed intent, cosine-match to skill descriptions | General-purpose, <20 skills | 85–92% |
| Intent Classifier | Dedicated fine-tuned classifier model | >20 skills, high-volume | 92–97% |
| Keyword Rules | Regex/keyword matching to skill triggers | Simple, well-defined domains | 95%+ if domain is narrow |
| LLM Router | Frontier model selects skill from description set | Complex ambiguous intents | 88–94%, higher latency |
Fallback Trigger Logic:
Route fails → retry with next-best match (if confidence gap <0.1) → if still failing, escalate to human. Circuit breaker: 3 consecutive failures on same skill → mark as degraded, bypass in routing.
Routing Accuracy Measurement
Routing accuracy = (correct skill selections / total routing decisions) × 100
This metric should be evaluated against a labeled intent dataset and reviewed when skills or routing logic change.
⚠ Anti-pattern
Using an expensive frontier model for every routing decision. For <20 skills with clear descriptions, semantic similarity or a small, fine-tuned classifier is 5–10x cheaper and often more accurate.
Note: Frontier models: the most advanced and capable LLMs (e.g., GPT-class, Claude- class or flagship models). These offer the best reasoning but are the most expensive.
1.4 Context & Prompt Management
Context is the information available to an agent when making decisions or executing skills. Managing context is a resource allocation problem: insufficient context prevents action; excessive context degrades quality, increases cost, and risks hitting token limits.
Context Budget Calculation
Step 1 — Determine Model Window: The model’s maximum context window should be obtained from provider documentation (e.g., 32K, 128K, 200K tokens).
Step 2 — Calculate Token Budget: Typically, maximum 20% of the window to context can be allocated. Example: 128K window → 25.6K token budget for context.
Step 3 — Allocate Context Budget: The context budget should be divided among conversation turns, retrieved chunks, and skill instructions. Typically, the remaining 80% is reserved for the model’s reasoning and output generation.
The allocations given below are illustrative starting points and should be adjusted based on use case, retrieval needs, and model behavior.
| Model Window | Illustrative Context Budget | Example Allocation (Indicative) |
| 32K tokens | 6.4K tokens | ~3 conversation turns + 1 retrieved chunk |
| 128K tokens | 25.6K tokens | ~5 turns + 3–4 retrieved chunks + skill instructions |
| 200K tokens | 40K tokens | ~8 turns + 5–6 chunks + full skill suite + examples |
Note: The 20% guideline reserves 80% for the model’s reasoning, skill instructions, and output generation. Exceeding this threshold consistently can lead to quality degradation — models may truncate or ignore later context.
Hot vs Cold Context
| Context Type | Where Context is Stored | What It Contains | Retrieval |
| System Prompt | In-context (always) | Agent identity, skill list, behavioral rules | Not retrieved (pre-injected) |
| Hot Context | In-context (dynamic) | Last 3–5 turns, current task state | Maintained per session |
| Cold Context | Redis (TTL 24–72hr) | Previous task outcomes, user preferences | Retrieved on demand via similarity search |
| Knowledge Base | Vector DB | Domain documentation, policies | Retrieved per skill invocation |
Prompt Engineering Techniques
While many prompt techniques exist, the following represent commonly used approaches in enterprise agent systems.
| Technique | When to Use | Trade-off |
| Chain-of-Thought | Multi-step reasoning, math, logic | +25% accuracy vs costs 2× tokens |
| Few-shot (3–5 examples) | Consistent output formatting | +20% format compliance vs adds ~2K tokens cost |
| Role Prompting
|
Define agent scope, authority, and boundaries
|
Reduces drift vs minimal overhead |
| Goal-Oriented Prompting
|
Clear objectives with measurable outcomes
|
Improves consistency vs may reduce flexibility |
| Task Planning Prompts
|
Multi-step workflows requiring structured execution
|
Better control & structure vs adaptability (may be rigid) |
| Self-Correction | High-stakes decisions (financial, medical) | +10% reliability vs adds 50% latency per step |
Decision Guidance:
Choose techniques based on the task’s accuracy requirements and acceptable cost increase. Chain-of-Thought is valuable for complex reasoning; few-shot ensures format consistency; self-correction adds reliability for critical decisions.
Structured Output Enforcement: Skill outputs should be validated against the defined JSON Schema. Use the model’s built-in structured output mode (e.g., response_format in OpenAI, or tool_use forcing in Anthropic) rather than post-hoc regex parsing. Post-hoc parsing fails on edge cases.
⚠ Anti-pattern
Free-form text outputs from skills that downstream skills should parse. Every skill output should be schema-validated. Unstructured outputs are a common cause of cascading failures in agent pipelines.
2. RAG, Memory & Vector Search
Retrieval-Augmented Generation (RAG) gives agents access to knowledge beyond their training data. Memory provides continuity across interactions. Together, these form the knowledge layer. This area covers pattern selection, component configuration, and production considerations.
2.1 RAG Design Decision Framework
The need for RAG should be evaluated based on the following criteria.
The thresholds and ranges presented below are example starting points and should be calibrated based on domain requirements, system constraints, and operational data.
| Evaluation Criteria | No RAG Needed | RAG Required |
| Use Case Complexity | Single-turn Q&A, knowledge fits in prompt (<5K tokens) | Multi-document synthesis, knowledge exceeds prompt capacity |
| Data Freshness | Static knowledge, updates <monthly | Frequently updated content (daily/weekly) |
| Data Volume | Small corpus (<50 documents) | Large corpus (>100 documents) |
| Query Type | Broad conceptual questions | Specific fact retrieval, citation needed |
| Precision Requirements | Best-effort answers acceptable | Should be grounded in specific sources |
Decision Guidance (Illustrative):
When multiple criteria indicate that retrieval is required, RAG-based approaches are typically considered. When criteria consistently indicate that retrieval is not required, alternatives such as few-shot prompting or fine-tuning may be sufficient.
These conditions should be validated against domain requirements and evaluation results rather than treated as fixed rules. The thresholds given in above table are indicative starting points and should be calibrated based on domain complexity and risk tolerance.
Note: When the domain is narrow and well-defined with stable structured data, consider direct database queries or API calls instead of RAG. RAG is optimized for unstructured text retrieval, not structured data lookups.
2.2 RAG Pattern Selection
There are 25+ RAG patterns documented in research. These include patterns such as Standard RAG, Hybrid RAG, Self-RAG, Corrective RAG, Multi-hop RAG, Fusion RAG, and Agentic RAG. The commonly used production patterns fall into these categories based on complexity and quality requirements:
| Pattern Category | Representative Patterns | When to Use |
| Basic Retrieval | Standard RAG, Sparse RAG, Constrained RAG | Fast fact lookup, protocol retrieval, <500ms latency |
| Advanced / Hybrid Retrieval / Query Optimization | Hybrid RAG (semantic + keyword), Fusion RAG, Adaptive RAG | Mixed technical and conceptual queries, variable query complexity, common in production |
| Context & Conversation | Conversation RAG, Memory-Augmented RAG, Contextual RAG | Multi-turn interactions, follow-up questions, session-aware retrieval |
| Quality & Reliability | Corrective RAG, Self-RAG, Citation-Aware RAG | High-stakes domains, auditability, verification of retrieved evidence is required |
| Multi-Document Reasoning | Iterative RAG, Multi-hop RAG, Hierarchical RAG, Chain-of-Retrieval RAG | Synthesis across multiple sources, comparative analysis, evidence chaining |
| Scale & Performance | Federated RAG, Long-Context RAG | Multi-system retrieval, large corpora, distributed knowledge environments |
| Agentic Retrieval | Agentic RAG, Speculative RAG, RL-RAG, ReFeed RAG | Complex workflows where retrieval should be planned, refined, or optimized over time |
| Specialized Domain Retrieval | Multimodal RAG, Reasoning RAG, Few-shot RAG, Prompt-Augmented RAG | Mixed-modality inputs, structured output generation, domain-specific reasoning |
The patterns listed above represent the most used production patterns. More advanced patterns should be adopted when complexity and quality requirements justify them.
RAG Pattern Selection Criteria
| Factor | Consideration |
| Query complexity | Single-step lookup vs. multi-step evidence synthesis |
| Source requirements | Single trusted source vs. multiple sources requiring comparison or validation |
| Freshness requirements | Static content vs. frequently updated content |
| Latency tolerance | Real-time retrieval vs. multi-stage retrieval and reranking |
| Reliability requirements | Best-effort responses vs. grounded, auditable, citation-backed outputs |
| Modality | Text-only vs. multimodal inputs such as images, forms, or scanned documents |
Production Guidance (Illustrative):
Hybrid RAG (semantic + keyword retrieval) is commonly used as a production baseline. A typical starting point is a weighted combination (for example, 0.7 semantic / 0.3 keyword), which should be calibrated based on query distribution and domain characteristics.
Precision and Recall in Retrieval
Precision: Of the documents retrieved, how many are relevant? High precision minimizes noise.
Recall: Of all relevant documents in the corpus, how many were retrieved? High recall ensures completeness.
Trade-off: Retrieval systems are typically optimized for recall to ensure relevant candidates are surfaced, while reranking improves precision by ordering results based on true relevance.
Production systems should explicitly balance recall and precision based on use case requirements. High-recall systems require reranking to maintain answer quality.
Retrieval Quality Measurement
Precision@k = relevant retrieved documents / k
Recall@k = relevant retrieved documents / total relevant documents
These metrics should be benchmarked on domain-specific queries prior to production deployment.
2.3 Embedding Model Selection
The embedding model directly affects retrieval quality. This decision is frequently underestimated — using an inappropriate embedding model is a common cause of poor RAG performance.
The guidance below should be treated as indicative starting points rather than fixed rules. Embedding model selection should be benchmarked against domain queries, retrieval latency targets, storage constraints, and update frequency.
| Selection Criteria | What to Evaluate | Guidance |
| Dimension Size | Vector dimensionality (384–1536+) | Higher dimensions = better quality, more storage & slower search. 768–1024 is a common practical balance. |
| Domain Fit | Performance on domain-specific queries | Embedding models should be benchmarked on at least 100 domain queries before selection. General models degrade into specialized domains. |
| Speed vs Quality | Encoding latency per batch | Example: For <100ms retrieval SLO, use lighter models (384-dim). For quality-critical paths, accept 200–400ms with 1024-dim. |
| Update Cost | Re-embedding cost when docs change | If data updates frequently, choose models with incremental update support. |
2.4 Chunking Strategy
Chunking strategy directly affects retrieval accuracy. When the size is too small, crucial information might be overlooked; conversely, if it’s too large, excessive noise can interfere with retrieval.
Additional chunking mechanisms may also be used depending on document structure and retrieval objectives:
- Semantic chunking: split on meaning or topic boundaries rather than fixed size
- Sliding-window chunking: overlapping windows for continuity across adjacent spans
- Hierarchical chunking: summary-level parent chunks linked to detailed child chunks
- Structure-aware chunking: split on headings, tables, sections, or form boundaries
The chunk sizes in the table are illustrative starting points and should be calibrated using retrieval quality metrics such as precision@k, recall@k, and answer relevance and based on document structure, and model context limits.
| .Content Type | Chunk Size (tokens) | Overlap | Strategy |
| Structured data (logs, records) | 100–200 | 0% | Each record is atomic and should not be split across chunks |
| Technical documentation | 300–500 | 15–20% | Split at section boundaries; overlap preserves cross-section references |
| Narrative or conversational content | 500–800 | 20% | Preserve topic continuity across paragraphs |
| Legal / policy documents | 400–600 | 25% | Higher overlap because clauses reference each other |
Note: Overlap is not free — it increases index size and can cause duplicate retrievals. Retrieved chunks should be deduplicate by source ID before passing to the model.
2.5 Retrieval & Reranking
RAG pattern determines retrieval method. For instance, Hybrid RAG requires both semantic (vector) and keyword (sparse) retrieval working together. The pattern dictates the retrieval architecture.
Initial retrieval (via vector similarity) is a recall operation — it surfaces candidates. Reranking is a precision operation — it reorders candidates by relevance. The two-stage approach (retrieve-then-rerank) is the production standard.
| Stage | Method | Speed | Precision | When to Use |
| Retrieval | Vector similarity | Fast (<50ms) | Moderate | Used as stage 1 |
| Reranking | Cross-encoder (pairwise scoring) | Slower (+100–300ms) | High | Top-K ≥ 5 candidates, quality-critical paths |
| Skip Reranking | Use retrieval output directly | Fastest | Moderate | Latency-critical paths (<100ms SLO), low-stakes |
Retrieval Quality Measurement
Precision@k = relevant retrieved documents / k
Recall@k = relevant retrieved documents / total relevant documents
These metrics should be benchmarked on domain-specific queries prior to production deployment.
2.6 Index Maintenance & Drift
A vector index is not a static artifact — it degrades as source documents change. Index drift (retrieving stale or outdated content) degrades performance gradually and can be difficult to detect.
| Source Update Frequency | Recommended Strategy | Implementation |
| Daily or less | Batch re-index nightly | Full re-embed during low-traffic window; swap atomically |
| Hourly | Incremental upsert | Track document versions; re-embed only changed docs; append to index |
| Real-time (<1 min) | Streaming pipeline | Kafka → embed worker → upsert to vector DB. Accept eventual consistency. |
| Rare (monthly+) | Manual trigger | Re-index on content release; validate with benchmark queries |
2.7 Memory Architecture
Memory gives agents continuity. Without it, every interaction starts from scratch. Memory should be architected as a system — the wrong memory architecture leads to stale context, privacy violations, or unbounded storage growth.
| Memory Type | Storage | Scope | TTL / Lifecycle | Use Case |
| Session (Short-term) | In-context window | Current conversation | Session end | Conversation flow, turn tracking |
| Working (Workspace) | Redis | Active task | 24–72 hours | Multi-step task state, intermediate results |
| Episodic (Long-term) | Postgres (encrypted) | Per-user persistent | Regulatory policy (e.g., 7 years) | User history, past decisions, preferences |
| Semantic | Vector DB | Cross-user knowledge | 2 years, periodic cleanup | Pattern recognition, experience sharing |
Decision Guidance:
- Working memory and episodic memory should be stored separately to prevent contamination of task state and historical records.
- Episodic memory should be written only after validation to avoid propagating incorrect information.
- Retrieved memory should be constrained to a small portion of the context window (for example <5%) to prevent context overflow.
⚠ Anti-pattern
Storing raw conversation transcripts as long-term memory. Transcripts grow unboundedly and contain noise. Extract and store structured summaries instead.
3. Agent Architecture, Patterns & Frameworks
This area covers structural decisions: how the agent system is organized as software, what reasoning patterns it uses, and whether to use an agent framework or build directly against model APIs. These decisions determine operational complexity, failure behavior, and long-term maintainability.
3.1 Event-Driven Microservices Architecture
Event-driven microservices are the de facto standard for production multi-agent systems at scale. This architecture provides fault isolation, independent scaling, and asynchronous workflows. The conceptual reference architecture consists of these layers:
Layer 1 — API Gateway: Authentication, rate limiting, request routing.
Layer 2 — Agent Services: Each agent type runs as an independent service (stateless or stateful with external state store)
Layer 3 — Event Bus (Kafka/SQS): Asynchronous communication between agents and services
Layer 4 — LLM Gateway: Model routing, prompt caching, token metering
Layer 5 — Data Layer: Vector DBs, Redis (working memory), Postgres (episodic memory), tool registries.
These layers represent a deployment view of the architecture. At runtime, the system follows the flow illustrated in the reference architecture diagram (planner → orchestration → agents → coordination → execution).
Key Design Principle: Agents publish events when tasks are complete; other agents subscribe and react. This decouples services and enables independent deployment.
3.2 Agents State Management
Stateless agents are simpler to operate (horizontal scaling, no sticky sessions, simple health checks). Use stateful approaches only when the use case requires conversation continuity or multi-step task tracking.
| Approach | When to Use | State Store | Complexity |
| Stateless | Single-turn Q&A, classification, extraction | None | Low — fully horizontally scalable |
| Stateful (external) | Conversational agents, multi-step workflows | Redis (session) + Postgres (persistent) | Medium — state store becomes a dependency |
| Stateful (in-context) | Short sessions (<10 turns), low volume | Context window only | Low, but limited by context window size |
⚠ Anti-pattern
Storing state in-process (agent instance memory). When the instance restarts or is load-balanced to a different pod, state is lost. Agent’s state should be externalized for production systems.
3.3 Handling Partial Failures
In multi-skill or multi-service agent workflows, partial failures are inevitable. If Skill A succeeds (example writes to a database) and Skill B fails (example: payment times out), the system is in an inconsistent state.
Saga patterns are commonly used in distributed agent workflows to manage long-running tasks and compensation logic when partial failures occur. Other recovery patterns, such as checkpoint-and-resume and idempotent retry, may be more appropriate depending on workflow duration, side effects, and rollback requirements.
State recovery mechanisms (such as checkpointing, retries, and compensation logic) should be selected based on workflow duration and failure impact.
| Pattern | How It Works | When to Use |
| Saga (Choreography) | Each skill publishes success/failure events; other skills listen and compensate | Loosely coupled skills, async workflows |
| Saga (Orchestration) | A central orchestrator coordinates and triggers compensation on failure | Tightly coupled workflows, need clear ownership |
| Retry with Idempotency | Retry failed operations; each operation is idempotent (safe to repeat) | Simple failures (timeouts, transient errors) |
| Checkpoint & Resume | Save workflow state at each step; resume from last checkpoint on failure | Long-running workflows (minutes+) |
Compensation Design Guidance: Every skill that performs a side effect (write, send, charge) should have a corresponding compensation action (rollback, cancel, refund). Define these at contract time — not at incident time.
3.4 Planning & Reasoning Patterns
Agents Pattern Overview
Multiple Agent patterns can be used across multiple layers of system behavior:
- Interaction patterns define how agents communicate with users, tools, and other agents
- Planning patterns define how tasks are decomposed before execution
- Reasoning patterns define how decisions are made
- Execution and coordination patterns define how tasks are carried out across workflows
Part-1 focuses on planning, reasoning, and interaction patterns. Execution and coordination patterns are covered in Part-2.
Planning Patterns
| Pattern | Best For | Tradeoff |
| Plan-and-Execute | Predictable workflows with defined steps | Fast and structured, but less adaptive |
| Least-to-Most / Task Planning | Problems that can be decomposed into ordered sub-problems | Handles structured complexity, but adds planning overhead |
| Hierarchical Planning | Multi-level task decomposition and delegated sub-goals | Scales to complex workflows, but increases coordination overhead |
| Dynamic Replanning | Volatile workflows where conditions change during execution | Adaptive, but can increase compute cost and plan instability |
Reasoning Patterns
The reasoning pattern determines how an agent breaks down and solves a problem. The choice affects accuracy, latency, and cost.
Selection Criteria:
Task Predictability:
Predictable or known workflows are typically suited to Plan-and-Execute patterns, while uncertain or adaptive workflows align with ReAct-style reasoning.
Solution Space:
Tasks with a single expected outcome are generally suited to Chain-of-Thought reasoning, whereas problems with multiple valid paths may benefit from Tree-of-Thoughts.
Quality Requirements:
Standard workflows may use single-pass reasoning, while high-stakes scenarios often use Reflection or multi-pass approaches.
Cost Tolerance:
Each reasoning iteration incurs an additional model call. Iteration limits are typically defined based on acceptable cost and latency constraints.
Additional considerations include latency tolerance, tool interaction complexity, and whether the workflow involves single-agent or multi-agent coordination.
| Pattern | Best For | Tradeoff |
| ReAct | Adaptive, uncertain environments | Flexible, but can loop without iteration bounds |
| Chain-of-Thought | Step-by-step reasoning for logic or calculation | Improves reasoning quality, but increases token usage |
| Tree-of-Thoughts | Problems with multiple viable solution paths | Stronger exploration, but much higher compute cost |
| Reflection/Self-Critique | Quality refinement and validation | Improves output quality, but adds latency and cost |
| CodeAct | Tasks requiring code generation and execution | Enables precise computation and automation, but requires sandboxing and execution safeguards |
Additional reasoning patterns exist, but the patterns above represent the most commonly used approaches in enterprise agent systems. Selection should be guided by task complexity, quality requirements, latency tolerance, and tool-use needs.
Note: ReAct iteration limits: Set max iterations based on cost tolerance. Each iteration costs one full LLM call. For example: for 10 iterations with a frontier model, cost can exceed $0.10 per query. Iteration counts should be monitored in production – consistently high counts indicate the agent is not making progress.
Interaction & Tool Use Patterns
Tool interaction patterns define how agents invoke external capabilities, including function calling, API orchestration, and tool chaining.
3.5 Agent Framework Selection
Agent frameworks (LangChain, LlamaIndex, AutoGen, CrewAI) abstract common patterns: tool calling, memory management, multi-agent coordination.
Direct implementation means custom code calling LLM APIs (OpenAI, Anthropic Claude, etc.) without framework abstraction.
Framework adoption is an important architectural decision. Frameworks simplify development by abstracting common agent patterns and accelerating implementation, but they also introduce an additional layer that can reduce transparency and fine-grained control in complex workflows.
Framework Selection Criteria
| Factor | Framework | Direct Implementation |
| Use Case Coverage | >80% of patterns are standard (RAG, tool-use, chat) | Custom patterns the framework does not support |
| Development Speed | Team is new to Agentic AI, and needs rapid prototyping | Team has LLM API experience, optimizing for performance |
| Latency Requirements | Can tolerate <20% overhead from abstraction layer | Need <100ms end-to-end, every millisecond matters |
| Debugging Complexity | Can accept framework abstraction in stack traces | Need full visibility into every LLM call and token |
| Ecosystem Needs | Need 100+ pre-built integrations (databases, APIs, tools) | Custom integrations only, no ecosystem dependency |
Decision Guidance (Illustrative):
Framework-based approaches are generally effective when most required capabilities are supported and the operational overhead remains within acceptable limits.
Direct implementation may be more suitable when workflows require highly specialized execution patterns, strict latency constraints, or full control over model interactions.
These thresholds should be validated against real workloads rather than treated as fixed rules.
When Frameworks vs Direct Implementation
Use Frameworks When: Building standard RAG pipelines, multi-agent coordination, or tool-heavy workflows where pre-built integrations save significant development time. Frameworks are the de facto choice for most production agent development.
Use Direct Implementation When: Latency is critical (<100ms SLO), custom reasoning patterns are required, or the team needs full control over every LLM call for optimization and debugging. Examples: high-frequency trading agents, real-time decision systems.
Framework Comparison
| Framework | Primary Strength | Best For |
| LangChain | Broadest ecosystem, most integrations | General-purpose agents, tool-heavy workflows |
| LlamaIndex | RAG-optimized data pipeline | Knowledge-intensive agents, document Q&A |
| AutoGen | Multi-agent conversation patterns | Agent-to-agent workflows, debate/consensus |
| CrewAI | Role-based multi-agent teams | Structured team workflows, task delegation |
The frameworks listed above represent commonly used options and are not exhaustive.
Multi-Frameworks Strategy:
It is valid to combine frameworks for different layers. Example: LlamaIndex for RAG pipeline + LangChain for agent orchestration + direct API calls for latency-critical skills. The key is clear boundary ownership — each layer has one framework responsible.
⚠ Anti-pattern
Choosing a framework based on popularity without benchmarking against actual use cases. Every framework has domains where it excels and domains where it adds friction. Benchmark with real use cases.
4. Models, Tools & Integration
This area covers the components an agent calls: language models (reasoning engine), tools (action layer), and the infrastructure connecting them (LLM Gateway). The central challenge is cost optimization without sacrificing quality — model costs are typically the largest operating expense.
4.1 Model Selection & Cascading
No single model is optimal for all tasks. Small models are fast and economical but limited in complex reasoning. Frontier models are powerful but expensive. Model cascading routes each query to the cheapest model that can handle it reliably — this is one of the highest ROI cost optimizations for agent systems, alongside prompt caching and tool parallelization.
| Tier | Characteristics | Relative Cost | Best For |
| Small LM (7B–13B) | Fast inference (<200ms), limited reasoning | ~100× cheaper than frontier | Classification, extraction, simple formatting |
| Mid-tier LM (70B or equivalent) | Good reasoning, moderate speed | ~10× cheaper than frontier | Standard Q&A, tool selection, summarization |
| Frontier LM (GPT-4 class) | Best reasoning, slower | Baseline cost | Complex multi-step reasoning, ambiguous queries |
| Specialist / Fine-tuned | Domain-specific accuracy | Training cost amortized | High-volume, narrow-domain tasks |
Model routing selects the most appropriate model based on task characteristics. Cascading is a specific routing strategy that escalates queries across model tiers when confidence is insufficient.
Mixture-of-Experts (MoE) Routing
MoE-style routing selects different models based on task type, improving both cost efficiency and performance.
Selection Criteria
| Factor | Consideration |
| Task diversity | Different query types require different capabilities |
| Cost sensitivity | Whether routing reduces use of expensive models |
| Latency tolerance | Whether routing overhead is acceptable |
| Volume | Whether scale justifies multiple model tiers |
| Quality variance | Whether specialized models outperform general models for specific tasks |
MoE routing is most effective when workloads are heterogeneous, and routing decisions can be measured and optimized.
Confidence Estimation for Cascading
Cascading relies on identifying when a lower-tier model is insufficient and escalation is required. Confidence signals should be derived from measurable indicators rather than assumed model certainty.
The methods below represent practical approaches to estimating confidence. Each method provides a signal that can be combined or calibrated based on evaluation data.
| Method | How It Works | Accuracy | Latency Cost | Recommendation |
| Token Probabilities | Use model’s logprob output; low max-prob = uncertain | Moderate | Zero extra latency | Good first signal; combine with other methods |
| Consistency Check | Run same query 2× with temperature>0; disagreement = uncertain | High | +1 full LLM call | Use for mid→frontier escalation decisions |
| Separate Classifier | Fine-tuned binary model: ‘can small model handle this?’ | High (tunable) | Small LM latency only | Best for high-volume, well-defined domains |
Confidence Calibration
Confidence signals should be calibrated against evaluation data rather than treated as intrinsic truth.
A practical approach is to measure:
Calibration accuracy = correctly handled queries within a confidence band / total queries in that band.
This allows routing thresholds to be tuned based on acceptable error rates, escalation volume, latency, and cost.
Illustrative Routing Flow
Query → Small model tier
If confidence is above the upper threshold, return result
If confidence falls within a mid-range, escalate to a stronger model
If confidence is below a lower threshold, escalate to the most capable model or a human review path
Threshold values should be calibrated using evaluation data rather than fixed globally.
Implementation: Cascading can be implemented in the LLM Gateway layer or using frameworks like LiteLLM (supports model fallback routing) or custom logic in the agent orchestration layer.
Note: The 80/15/5 split (80% small, 15% mid, 5% frontier) is a starting target. Measure actual distribution after initial few weeks and tune thresholds. Some domains will be 95/4/1; others 50/30/20.
4.2 Tool Registry & Lifecycle
Tools are the agent’s interface to external systems. A tool registry is the source of truth for what tools exist, how to call them, and how they behave when they fail. Without a registry, tools quickly become undocumented and difficult to monitor.
Tool Discovery & Registry Patterns
Tools may be discovered and managed using different approaches depending on system scale.
| Discovery Pattern | Best for | Tradeoff |
| Static registry | Small, stable tool sets | Simple, manual to maintain |
| Dynamic registry | Frequently changing tool sets | Flexible, adds runtime dependency |
| MCP-based discovery | Multi-agent and interoperable systems | Standardized, requires protocol support |
Decision Guidance (Illustrative)
Static registries are suitable for controlled environments. Dynamic or MCP-based discovery becomes more appropriate as tool count, update frequency, or interoperability requirements increase.
| Registry Element | What It Contains | Why It Matters |
| Identity | Name, version, owner, deprecation date | Prevents calling stale or unmaintained tools |
| Contract | OpenAPI spec, input/output schema | Enables the model to call tools correctly without examples |
| SLA | Timeout (typically 30s), success rate target (99.5%) | Allows the agent to make informed retry/fallback decisions |
| Fallback Chain | Primary → Secondary → Escalate | Defines behavior before failure occurs, not during incident response |
| Cost | Per-call cost estimate | Enables cost-aware routing and budget enforcement |
Tool Discovery by Scale
| Tool Count | Discovery Method | Implementation |
| <20 tools | Static JSON manifest | All tools listed in system prompt; model selects from full set |
| 20–100 tools | Dynamic retrieval | Query embedding → retrieve top-5 relevant tools → model selects |
| >100 tools | Capability-based routing + MCP | Intent → tool category → specific tool. MCP (Model Context Protocol) enables dynamic capability advertisement. |
Tool Selection Considerations
Tool selection should consider factors such as the number of tools in the system, update frequency, interoperability requirements, latency constraints, and governance needs such as access control and auditability. These considerations influence whether static registries, dynamic discovery, or MCP-based approaches are appropriate.
Tool Versioning & Breaking Changes
Tool APIs change. When an external API changes its contract, the agent should remain backward compatible and resilient to changes.
Tools Version management strategy:
- Tools are typically versioned explicitly (for example v1, v2) rather than modified in place to ensure backward compatibility and safe upgrades.
- Run shadow traffic: new version receives all requests in parallel (no user impact) for 48–72 hours before cutover. Compare outputs.
- Deprecation path: old version remains callable for 30 days post-cutover. Log all calls to deprecated versions as warnings.
- Fallback to previous version: if new version error rate exceeds 5%, automatically fall back to the previous version.
4.3 Retry, Timeout & Circuit Breaker
Every external tool call can fail. The retry strategy determines how the agent recovers — or knows when to stop trying.
| Parameter | Typical configuration | Rationale |
| Max Retries | 3 | Beyond 3, the tool is likely experiencing an outage, not a transient error |
| Base Backoff | 1 second | Exponential: 1s → 2s → 4s. Prevents load spikes on shared services. |
| Jitter | ±25% of backoff | Prevents retry storms when multiple agents hit the same failing tool simultaneously |
| Timeout (per call) | 30 seconds | Long enough for most APIs; short enough to not block workflow indefinitely |
| Circuit Breaker Open | After 3 failures in 60s | Immediately fail fast once a tool is known-broken; prevents wasted retries |
| Circuit Breaker Reset | After 30s half-open probe | Allow one test call to check if the tool has recovered |
⚠ Anti-pattern
Retrying indefinitely or with fixed intervals. Infinite retries can block workflows indefinitely under persistent failures. The use of fixed intervals may lead to excessive retry activity in widely utilized tools.
4.4 The LLM Gateway
The LLM Gateway is the single point through which all model calls flow. It provides model routing (cascading), rate limiting, prompt caching, cost metering, and observability. Without it, there is no visibility into model usage and no ability to optimize costs.
Request Flow: User Request → API Gateway (auth, rate limiting) → LLM Gateway (model routing, caching, metering) → LLM Model Provider APIs(example: GPT-class or Claude-class models)
| Gateway Option | Best For | Key Features | Cost |
| LiteLLM | Teams starting out, open-source preference | 100+ provider support, unified API, model routing | Free (open source) |
| Portkey | Growing teams (10K–100K req/day) | Advanced routing, prompt caching, A/B testing, observability dashboards | Per-request pricing |
| Custom-built | High-volume (>100K req/day) with unique routing logic | Full control, custom algorithms, internal tooling integration | Engineering investment |
Cost Attribution: The LLM Gateway should tag every request with originating agent, skill invoked, workflow ID, and model tier used. This is the foundation for cost-per-workflow analysis — without it, expensive workflows cannot be identified and optimized.
Note: Calling model APIs directly from agent code without a gateway is acceptable for prototyping. For production systems, an LLM Gateway is essential for visibility, cost control, and optimization.
Key Takeaways
- Use case complexity is a key factor in determining whether agentic AI and multi-step reasoning are appropriate.
- Skill contracts benefit from structured inputs/outputs, error handling, and fallback paths to support testing and composition.
- RAG pattern selection is typically driven by query complexity, quality requirements, and whether retrieval is necessary.
- Agent frameworks are evaluated based on use case fit (e.g., capability coverage) and operational overhead (e.g., latency), using real workload benchmarks.
- Model cascading is commonly used to improve cost efficiency by routing queries to the most cost-effective capable model.
Lavanya Subbarayalu is an enterprise AI architect specializing in large-scale AI platforms, agentic systems, and responsible AI design. She focuses on building practical, production-ready architectures that enable scalable, reliable, and governed deployment of AI systems across enterprise environments.
